Page MenuHomePhabricator

Zuul: Gerrit's ssh event stream unavailable
Closed, DuplicatePublic

Description

After 30 min and still zuul shows "Queue lengths: 0 events, 0 results" where I hoped for [1], [2] to show some life signs I'll probably can assume that Jenkins died again (after yesterday's incident bug 49294).

I'll leave the severity to be decided by someone else but this is getting a bit awkward now having two days in a row issues with Jenkins.

[1] https://gerrit.wikimedia.org/r/#/c/61171/

[2] https://gerrit.wikimedia.org/r/#/c/60092/


Version: wmf-deployment
Severity: critical

Details

Reference
bz49330

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 2:01 AM
bzimport set Reference to bz49330.
bzimport added a subscriber: Unknown Object (MLST).

Yesterday issue was related to Gerrit having a full queue. Most probably the same today.

Seems we just need to restart Gerrit. I have mailed the operations team about it.

Until it is restarted no tests are going to be triggered. That is surely annoying to people submitting patches meanwhile, but I do not think it is worth paging the whole ops team overnight. After all, sites are still up :)

Will follow up with Chad next weak to start having a proper monitoring for the Gerrit/Zuul/Jenkins processing chain. We will also want to fix the root cause in Gerrit.

Alexndros restarted Gerrit a few minutes after I sent the email to ops and confirmed the service is back up.

Unknown Object (User) added a comment.Jun 9 2013, 1:41 AM

Well sorry, but tests again don't run and this time I can't verify and merge because of the missing state/gate process.

For the record (noticed it wasn't recorded in Bugzilla yet) the following is what Zuul reprots in the log:

2013-06-09 04:49:18,781 ERROR gerrit.GerritWatcher: Exception on ssh event stream:

Traceback (most recent call last):

File "./zuul/lib/gerrit.py", line 68, in _run
  self._listen(stdout, stderr)
File "./zuul/lib/gerrit.py", line 52, in _listen
  self._read(stdout)
File "./zuul/lib/gerrit.py", line 39, in _read
  data = json.loads(l)
File "/usr/lib/python2.7/json/__init__.py", line 326, in loads
  return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
  obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
  raise ValueError("No JSON object could be decoded")

ValueError: No JSON object could be decoded

(In reply to comment #6)

And seems broken again...

Restarted again.

(In reply to comment #2)

Seems we just need to restart Gerrit.

That's only a temporary solution (about 12 hours at most). Chad has said on the engineering list he'll work on this tomorrow.

  • Bug 49294 has been marked as a duplicate of this bug. ***

I assume that no Bugzilla notifications of new changesets are published is related to this as well?

(In reply to comment #10)

I assume that no Bugzilla notifications of new changesets are published is
related to this as well?

That seems unrelated. I filed a separate bug for it: bug 49388

this ends up being a dupe of bug 46917 "Gerrit no more emit events when using stream-events" where the ssh connection between Zuul and Gerrit goes down because of a timeout and no events are ever send again for new connections.

  • This bug has been marked as a duplicate of bug 46917 ***