Page MenuHomePhabricator

mw/core gating jobs are delayed during peak hours
Closed, ResolvedPublic

Description

On Thursday April 25th, the mw/core.git Jenkins jobs have been slow to report back in Gerrit. Example change: https://gerrit.wikimedia.org/r/#/c/60765/ which took a good 20 minutes for gate-and-submit to report back.

I suspect there were over jobs running at that time such as the parser tests which takes a good 5 minutes to run. So if you have lot of changes submitted (like X of them ), the most recent change would be run (X+1) * 5 minutes after submission.

This is due to the mediawiki-core-phpunit-parser job being shared among all pipelines. The patchsets sent (test pipeline) and the one asked to be merged (gate-and-submit pipeline) ends up racing for an execution slot in Jenkins.


Version: wmf-deployment
Severity: normal

Details

Reference
bz47724

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:42 AM
bzimport set Reference to bz47724.
bzimport added a subscriber: Unknown Object (MLST).

The change 60765 got reported at 21:49 UTC.

The build start time around that time:

$ cd /var/lib/jenkins/jobs/mediawiki-core-phpunit-parser/builds
$ grep -o -P 'ZUUL_PIPELINE=[\w-]+' 2013-04-25_2*/log
2013-04-25_20-01-46/log:ZUUL_PIPELINE=gate-and-submit
2013-04-25_20-05-11/log:ZUUL_PIPELINE=gate-and-submit
2013-04-25_20-16-14/log:ZUUL_PIPELINE=gate-and-submit
2013-04-25_20-30-57/log:ZUUL_PIPELINE=test
2013-04-25_20-35-53/log:ZUUL_PIPELINE=gate-and-submit
2013-04-25_20-39-39/log:ZUUL_PIPELINE=test
2013-04-25_20-43-45/log:ZUUL_PIPELINE=test
2013-04-25_20-53-27/log:ZUUL_PIPELINE=gate-and-submit
2013-04-25_21-06-31/log:ZUUL_PIPELINE=gate-and-submit
2013-04-25_21-12-16/log:ZUUL_PIPELINE=gate-and-submit
2013-04-25_21-27-54/log:ZUUL_PIPELINE=test
2013-04-25_21-34-54/log:ZUUL_PIPELINE=gate-and-submit
2013-04-25_21-41-44/log:ZUUL_PIPELINE=gate-and-submit
2013-04-25_21-46-01/log:ZUUL_PIPELINE=gate-and-submit
2013-04-25_21-50-01/log:ZUUL_PIPELINE=gate-and-submit
2013-04-25_21-54-25/log:ZUUL_PIPELINE=gate-and-submit
2013-04-25_21-59-02/log:ZUUL_PIPELINE=test
2013-04-25_22-05-25/log:ZUUL_PIPELINE=gate-and-submit
2013-04-25_22-09-11/log:ZUUL_PIPELINE=test
2013-04-25_22-25-30/log:ZUUL_PIPELINE=test
2013-04-25_22-42-05/log:ZUUL_PIPELINE=test
2013-04-25_23-26-14/log:ZUUL_PIPELINE=test
2013-04-25_23-48-50/log:ZUUL_PIPELINE=gate-and-submit
$

As we can see, a lot of builds have been done in a short amount of time. The gate-and-submit have been done in both master and wmf branches.

Maybe I should switch Zuul to uses a DependentPipelineManager for gate-and-submit. That will only run tests for the most recent gated changed and merge them all if the test succeed.

The gate-and-submit pipeline should be made a DependentPipeline which is bug 48419.

Also Zuul is spamming Gerrit with changes request. https://review.openstack.org/#/c/27411/

This is still happening, specially around 10pm (CET) when the l10nbot sends half a thousand of changes.

This is less of an issue right now, the merge jobs are no more triggered which helped with the spike load.

Got fixed by several changes, namely:

  • l10nbot is no more triggering jobs
  • jobs now run concurrently
  • we have more slaves in Jenkins
  • Zuul got upgraded and reports faster