Page MenuHomePhabricator

[Worked around] Zuul slow to report back to Gerrit
Closed, ResolvedPublic

Description

For a few days now, Zuul has been lagging out to report completed builds back in Gerrit. There are most probably different root causes:

  • When submitting a change, Zuul is locked, if Gerrit is slow to merge the whole process is locked down until the change is merged
  • Zuul does not seem to recognize the LOST builds properly, specially if it is the last of a set of jobs. It seems to consider the change to be still around but does not bother reporting it since it is not FAIL nor SUCCESS
  • Zuul did a ton of git remote update, I have reverted that patch an hour ago.

Usually Zuul become stuck between 8pm and 11pm GMT, which is the busy hours. European volunteers are very active, i18n bot is sending lot of patches and San Francisco is having a productive morning.

The signes of slowness are:

  • https://integration.wikimedia.org/zuul/status has lot of changes with all build completed
  • jenkins takes a long time to report back to gerrit even for very simple checks (such as the one on operations/puppet.git or translatewiki.net).

I have no idea what the fix is but upgrading Zuul is probably going to help. The new version of Zuul depends on a python module which is not available in Ubuntu Precise, I have packaged it and its pending review/merge/deploying (see bug 44061).


Version: unspecified
Severity: major

Details

Reference
bz46176

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 22 2014, 1:20 AM
bzimport set Reference to bz46176.

On March 18th, it took roughly 1 hour and 10 minutes to have a build report for https://gerrit.wikimedia.org/r/#/c/54513/ . The jobs have been completed successfully a few minutes after patch submission but then they stayed in the status queue until being reported.

I will cherry pick that patch from upstream and get it deployed when Zuul/Gerrit is quiet (aka during European morning).

git cherry-pick 263fba9
git push wikimedia HEAD:master

I have deployed the new Zuul version which is upstream ff79197 + the patch "Give the result event queue priority.". The current sha1 is e9d929a.

That will most probably fix the issue, I will monitor that during the next rush hours.

The issue has been worked around with upstream cherry pick. I am thus lowering priority.

This bug will be closed whenever Zuul is upgraded (bug 46354)

Lowest priority since we have a workaround.

The issue has been solved by the workardound (cherry picked an upstream change).

I have upgraded Zuul (bug 46354) a few minutes ago and it includes the change. Nothing left to do so :)

This happened again on Thursday April 25th. Example change https://gerrit.wikimedia.org/r/#/c/60765/

Took a good 20 minutes for gate-and-submit to report back.

one possibility is that there were over jobs running at that time such as the parser tests which takes a good 5 minutes to run. So if you have lot of changes submitted (like X), the most recent change would be run (X+1) * 5 minutes after submission.

Zuul has been upgraded which got some performances improvements. We had some issues such as over querying Jenkins from time to time.