Page MenuHomePhabricator

Zuul: Gerrit's ssh event stream unavailable
Closed, ResolvedPublic

Description

Since at 2013-04-05 04:29:46 UTC, Zuul no more receives events from Gerrit. It uses a CLI command such as:

ssh -p 29418  jenkins-bot@manganese.wikimedia.org gerrit stream-events

I did stream-events using both the jenkins-bot user and my own account 'hashar'. While submitting a new patchset or posting a comment: nothing is reported.

The end effect is that no jobs are triggered in Jenkins.


Summary (13/12/2017)

The sshd daemon (Mina) would lost connection on one way connections and hitting some kind of timeout. That affected Zuul which idles with a gerrit stream-events.

Gerrit sshd.idleTimeout got set to 10 days as a workaround https://gerrit.wikimedia.org/r/#/c/58126/

Gerrit patch:

https://gerrit-review.googlesource.com/c/gerrit/+/44472 (79b91bfe) is the Gerrit patch.

Details

Reference
bz46917

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:15 AM
bzimport added a project: Gerrit.
bzimport set Reference to bz46917.
bzimport added a subscriber: Unknown Object (MLST).

This is a major regression in our development workflow.

Gerrit has been upgraded with https://gerrit.wikimedia.org/r/#/c/57508/
Which got merged on Apr 4, 2013 20:29 (UTC).

Apparently Gerrit got restarted at 20:44 (UTC), the Zuul log shows errors 500 at that time.

The last event received by Zuul was on 2013-04-05 04:26:56 (UTC) then the JSON events cant be decoded / are not received:

2013-04-05 04:29:46,092 ERROR gerrit.GerritWatcher: Exception on ssh event stream:
Traceback (most recent call last):

File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 68, in _run
  self._listen(stdout, stderr)
File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 52, in _listen
  self._read(stdout)
File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 39, in _read
  data = json.loads(l)
File "/usr/lib/python2.7/json/__init__.py", line 326, in loads
  return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
  obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
  raise ValueError("No JSON object could be decoded")

ValueError: No JSON object could be decoded

Restarted Zuul just in case: nothing. Definitely on Gerrit side.

Ariel restarted Gerrit on manganese. That unbroken the stream-events thus cleaning the issue.

I am leaving this bug open for monitoring / investigation. There must be some regression in Gerrit since we upgraded it yesterday.

(In reply to comment #4)

Ariel restarted Gerrit on manganese. That unbroken the stream-events thus
cleaning the issue.

Thanks. Lowering priority.

Restarted again, since stram-events hung. The output from the queue is kind of weird & concerning:

$ ssh gerrit.wikimedia.org gerrit show-queue -w

Task State Command

9c0dc448 com.google.gerrit.sshd.commands.StreamEvents$3@6435207f
85d6cd95 com.google.gerrit.sshd.commands.StreamEvents$3@5be8a53e
c54125b8 com.google.gerrit.sshd.commands.StreamEvents$3@6b8b896
a0d428db com.google.gerrit.sshd.commands.StreamEvents$3@495e834c
0323c3c0 com.google.gerrit.sshd.commands.StreamEvents$3@5bfe1835
49cc90ac com.google.gerrit.sshd.commands.StreamEvents$3@6e76ec2
e9bf84a4 com.google.gerrit.sshd.commands.StreamEvents$3@44188618
4617a0ff com.google.gerrit.sshd.commands.StreamEvents$3@260bb495
5e8f7686 com.google.gerrit.sshd.commands.StreamEvents$3@4a7cb5b6
fe9f4ab9 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@40811f5d
3e416208 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@7194d12c
1e449e17 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@5d3233c0
de9a86cc waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@7efe402f
9ea40e12 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@ae3b20d
bea9d2e8 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@6abe7910
be3472ba waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@6130d9f5
5e4a1628 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@4e3595ac
1915a152 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@417de5ee
3943c55a waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@1ff57751
9938b1f3 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@5a31e6f7
f911ad68 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@2329064
593eb9dd waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@f04a851
9982f1fc waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@30828dd9
665afa9e waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@2c6c773b
066e9e41 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@507d946
23f714d9 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@4b2eba70
0ff474d2 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@68d2359f
6fc0906b waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@40be8f7c
835cc5a8 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@1ebe3719
63619163 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@5a135a13
c3563d8b waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@2c7ec398
a35b09b3 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@365b46fa
0370d533 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@2290e2f7
e36ca160 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@5624d7db
436a4d46 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@cd7970c
23671980 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@1dfd9b6a
63fd11f8 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@74e65b87
83fda551 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@3f833806
5d722e74 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@3fd3327d
bd77aa62 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@4297cd47
9d7c2648 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@7c261278
bdab2a74 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@2786c3f5
baa810b1 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@ab1e19
7f44cb17 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@6244fba6
bf4ec337 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@194923d6
df493731 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@10cfc392
ff2f5b61 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@4649853e
1f334fd3 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@4cacd7f
7f160b2c waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@1ee581b1
046d445a waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@6da59b4f
64b8c066 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@137797c6
848db42c waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@108d5226
4926f573 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@2e0c4547
8bfdc551 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@1e794f4e
2b8db900 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@2e459983
6b02d146 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@27984fef
cb073d54 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@5b617f30
ab0c4979 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@5de5572e
eb0d6177 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@45a5965c
0b11d594 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@7a3295bc
783069a2 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@a7e838
46810952 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@f7eb4de
267e5542 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@7bbebd2a
868b8174 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@27754fde
c694199b waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@6ce8c0f
6688cd79 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@624d3046
c6e3b962 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@479462c4
06f1d196 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@5ee05692
554535c0 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@3830aedd
b235a762 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@4284bfd
f23f9f40 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@399c7615
923ae350 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@4f0c7a3f
d244dbcb waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@7ac6658a
3249b7e2 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@773485fe
7253af98 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@3c2de6a4
1246f3d4 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@698137d9
5250eb8b waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@5a53bbe7
5288eb78 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@2f556905
cc3d20e4 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@1061df8a
2c32bc16 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@2279bdac
0c2fb8ac waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@669378e2
0c5a59a0 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@66280ecb
6c4e75e4 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@4fe365d2
4c4b71d4 waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@99ae56d
f650f82e waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@184c5c08
e71cc87a waiting .... com.google.gerrit.sshd.commands.StreamEvents$3@3d2e1dae
d36dcb8e 12:18:22.186 push gerritslave@formey.wikimedia.org:/var/lib/gerrit2/review_site/git/mediawiki/extensions/FundraisingEmailUnsubscribe.git

da7faee1 19:42:45.214 Log File Compressor

88 tasks

The Zuul log is full of the following:

2013-04-06 17:02:55,441 ERROR gerrit.GerritWatcher: Exception on ssh event stream:
Traceback (most recent call last):

File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 68, in _run
  self._listen(stdout, stderr)
File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 52, in _listen
  self._read(stdout)
File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 39, in _read
  data = json.loads(l)
File "/usr/lib/python2.7/json/__init__.py", line 326, in loads
  return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
  obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
  raise ValueError("No JSON object could be decoded")

ValueError: No JSON object could be decoded

Changing back to git/gerrit. Zuul has no issue, it is Gerrit holding the events and not emitting them.

Stuck again but this time show-queue gives me:

Task State Command

5229a073 git-upload-pack '/mediawiki/extensions/VisualEditor' (shirayuki)

aad42f27 07:18:57.201 inside-wmf/mediawiki/extensions/FundraisingEmailUnsubscribe

Logged a message on IRC about it and pinged ops list.

Although I cannot reliably reproduce the problem, the culprit seems to
be gerrit's version bump of Apache SSHD to 0.6.0, which comes with
closing of idle ssh connections.

Upon 10 minutes of inactivity, gerrit disconnects Zuul, and Zuul has
to reconnect. Gerrit accepts the connection, but sometimes refuses to
stream events over the new connection. There seems to be some race
condition in gerrit when Zuul reconnects.

We can mitigate the problem by having at least 1 commit every 10
minutes... Go, programmers go :-))

To also have a solution until we get that commit density 24/7,
https://gerrit-review.googlesource.com/#/c/44351/
got submitted to upstream gerrit.

(We cannot simply downgrade gerrit to the last known good version, as
that would require database schema downgrading, which currently is not
supported by gerrit)

Raising importance, this needs fixing ASAP. We need testing. Not just to save work, but because we (and by design and intention) rely on it automatically happen and performing the merge action on approval.

Also the lack of postmerge events is blocking (auto-deploy for certain projects).

(In reply to comment #10)

https://gerrit-review.googlesource.com/#/c/44351/
got submitted to upstream gerrit.

It hasn't been reviewed / merged yet however. How long are we willing to wait? Can we apply it to our fork for now?

Jenkins being down is causing lots of workflow problems throughout the different projects.

For the context:

Zuul is a python daemon running on gallium. It establish a ssh connection as jenkins-bot on manganese.wikimedia.org and runs the 'gerrit stream-events' command. It expect JSON object to be returned.

Looking at the Zuul log (fgrep GerritWatcher /var/log/zuul/zuul.log), it does emit a stacktrace every 10 minutes:

2013-04-08 05:50:48,028 ERROR gerrit.GerritWatcher: Exception on ssh event stream:
2013-04-08 06:00:53,176 ERROR gerrit.GerritWatcher: Exception on ssh event stream:
2013-04-08 06:10:58,266 ERROR gerrit.GerritWatcher: Exception on ssh event stream:
2013-04-08 06:21:03,383 ERROR gerrit.GerritWatcher: Exception on ssh event stream:
2013-04-08 06:31:08,472 ERROR gerrit.GerritWatcher: Exception on ssh event stream:
2013-04-08 06:41:13,594 ERROR gerrit.GerritWatcher: Exception on ssh event stream:
2013-04-08 06:51:18,687 ERROR gerrit.GerritWatcher: Exception on ssh event stream:
2013-04-08 07:01:23,782 ERROR gerrit.GerritWatcher: Exception on ssh event stream:

A full trace is:

2013-04-08 00:12:10,016 ERROR gerrit.GerritWatcher: Exception on ssh event stream:
Traceback (most recent call last):

File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 68, in _run
  self._listen(stdout, stderr)
File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 52, in _listen
  self._read(stdout)
File "/usr/local/lib/python2.7/dist-packages/zuul-0.0.151.e9d929a-py2.7.egg/zuul/lib/gerrit.py", line 39, in _read
  data = json.loads(l)
File "/usr/lib/python2.7/json/__init__.py", line 326, in loads
  return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
  obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
  raise ValueError("No JSON object could be decoded")

ValueError: No JSON object could be decoded

Which is stream-events sending some invalid JSON, most probably something like "Connection timed out after 10 minutes". It seems Zuul properly reconnect the ssh connection.

Whenever the stream-events for Zuul, it is also blocked for any other user even on a fresh new connection. For example, I could run the stream-events command using my own account, fill a comment in Gerrit and I get no event on the channel.

Related URL: https://gerrit.wikimedia.org/r/58059 (Gerrit Change Ie0b3b0a5d77a2192abd09a934e780b8e622a1edd)

Related URL: https://gerrit.wikimedia.org/r/58060 (Gerrit Change Ie0b3b0a5d77a2192abd09a934e780b8e622a1edd)

Related URL: https://gerrit.wikimedia.org/r/58063 (Gerrit Change I431ff970596aae03b3ab5276303ed18574892b13)

Until we can deploy a new gerrit build, we prevent timeouts by forcing a
recheck on
https://gerrit.wikimedia.org/r/#/c/58060/
every 4 minutes.

(In reply to comment #12)

(In reply to comment #10)

https://gerrit-review.googlesource.com/#/c/44351/
got submitted to upstream gerrit.

It hasn't been reviewed / merged yet however. How long are we willing to
wait?
Can we apply it to our fork for now?

Jenkins being down is causing lots of workflow problems throughout the
different projects.

We don't have a fork, we use vanilla Gerrit :)
But yes, I'm looking into deploying this now rather than waiting on upstream to merge.

Deployed the upstream patch just now.

  • Bug 47004 has been marked as a duplicate of this bug. ***

Lowering priority and severity since we have a workaround now.

(In reply to comment #21)

Lowering priority and severity since we have a workaround now.

Now actually doing so for priority too.

  • Bug 49330 has been marked as a duplicate of this bug. ***

We previously rolled back to a gerrit that exposed the problem again :-(

In case we ever do this again: To keep the zuul<->gerrit connection
alive, it suffices to have at least one event fed to zuul each 10
minutes.

So for example clone our gerrit-ping repo.
Restore

https://gerrit.wikimedia.org/r/#/c/58060/

Fetch patch set 1.
And run

while true ; do date ; ssh gerrit.wikimedia.org gerrit review -m "'recheck'" --verified 0 75be73703978e75a25d5e3bb5685c51580af3060 ; sleep 240 ; done

in a shell (I had ssh configured to automatically supply username and gerrit port).

Gerrit 2.6-rc0-154-gfcdb34b is the first version that we deployed and
contained the fix.

(In reply to comment #24)

We previously rolled back to a gerrit that exposed the problem again :-(

To expand on this, last week when we tried upgrading to 2.7-rc1-424-gef469ac we hit a bunch of problems so we rolled back. However, I rolled too far back to 2.6-rc0-144-gb1dadd2.war instead of 2.6-rc0-154-gfcdb34b which lacked the SSH timeout fix mentioned above.