Page MenuHomePhabricator

Zuul is ultra slow post Jenkins upgrade
Closed, ResolvedPublic

Assigned To
Authored By
hashar
May 2 2013, 10:19 PM
Referenced Files
F10772: threadDump2.txt
Nov 22 2014, 1:34 AM
F10771: threadDump1.txt
Nov 22 2014, 1:34 AM
F10769: jstack-f-m-l.txt
Nov 22 2014, 1:34 AM
F10770: file_48025.txt
Nov 22 2014, 1:34 AM
F10768: zuul-submit-description.txt
Nov 22 2014, 1:34 AM

Description

Jenkins was slow to restart (bug 47120) and got upgraded. Since them Zuul has been very slow to trigger the build.

After investigating with upstream (James E. Blair from OpenStack), it turns out Zuul wait for the results of JSON API calls which are meant to verify whether a job exist before triggering it.

We have found a few API calls that took 2 to 4 minutes which would lock Zuul internal scheduler. In the end, it would be spending most of its time waiting.

This is a major issue.

A workaround is to skip the API call. James advised to fake the ExtendedJenkins class in Zuul to pretends jobs always exist.

https://gerrit.wikimedia.org/r/#/c/62095/


Version: wmf-deployment
Severity: major

Details

Reference
bz48025

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:34 AM
bzimport set Reference to bz48025.

yeah for working over night :-]

The workaround did get rid of the calls to the slow API.

Looking at the Apache logs in /var/log/apache2/integration_access.log , I have noticed it took up to 8 seconds to update a job description. That is something being done by Zuul very often and prevents it from doing anything else.

Jenkins had at least a thread at 100% CPU and the main process (java) was reporting 100% CPU too. I have restarted Jenkins.

Tips to investigate:

tail -f /var/log/apache2/integration_access.log | grep Python

That will show the Zuul requests to Jenkins. The time is when the query started, so any backward time will indicated a query that took a long time. On completion, query send a 302 which in turns triggers a GET from Jenkins.

Exemple:

02/May/2013:23:17:53
"POST /ci/job/parsoid-parsertests-run-harder/65/submitDescription 302"
02/May/2013:23:17:53
"GET /ci/job/parsoid-parsertests-run-harder/65/ 200"

That one has been fast.

Before triggering a job, Zuul attempts to lookup whether the job exist. It does that based on the python-jenkins module which hit an API call which is very slow on our setup. I have overridden the existence check with Gerrit Change #62095.

For now it seems Jenkins is more or less working after the last restart.

This got fixed for now after a looong night of tweaking.

The API calls were hitting the full console log which is definitely slow. That one got fixed by Ori and merged upstream. https://review.openstack.org/#/c/28128/

The Jenkins API call was hitting an URL that triggered a full parse of the build history for that jobs. The workaround as a live hack was to let Zuul pretend the job always exist. https://gerrit.wikimedia.org/r/#/c/62095/

Two threads in Jenkins were waiting for each other (as I understand it) causing overall slowness. That got resolved after yet another restart.

For now it seems to be stable. We will see how bad it is after a restart and a slow start, maybe something to try out in the morning.

Lowering priority for now.

There is another issue that have triggered a few time this week which is that updating the job description is suddenly slow.

Ganglia shows a CPU at 100%, top -i then H shows a jenkins thread at 100%

Looking at the apache proxy that handles Jenkins access, one can monitor what Zuul is doing with Jenkins:

tail -f /var/log/apache2/integration_access.log|grep Python-urllib

It shows Zuul update the build descriptions with requests such as:

POST /ci/job/mediawiki-core-qunit/3359/submitDescription  302
GET /ci/job/mediawiki-core-qunit/3359/  200

They take from 3 to 12 seconds and even more sometime. That is the root cause of slowness.

Whenever a build is started or emit a result, Zuul will update the description of all the builds that have been triggered by that change. So given a pipeline of 10 jobs, that is potentially a lot of description update. Zuul is also locked waiting for Jenkins to answer which in turns make it very slow to process events.

Created attachment 12281
Apache log showing Zuul updating build descriptions

The attached log file shows Zuul updating the job description in jenkins. That is quite a lot of http requests. In this case they are fast though.

Attached:

Related URL: https://gerrit.wikimedia.org/r/63064 (Gerrit Change Ie33e1adb8d9facab8c57376cdd418baf0d8a208a)

https://gerrit.wikimedia.org/r/63068 enable Jenkins access log to track the requests made by Zuul

https://gerrit.wikimedia.org/r/63069 makes Zuul access Jenkins via http://127.0.0.1 instead of the public interface. That saves ssl negotiation and the apache overhead.

We now have Jenkins access.log and Zuul querying on localhost directly.

I have pinged wikitech-l about the two crashes that happened today related to that issue : http://lists.wikimedia.org/pipermail/wikitech-l/2013-May/069261.html

This happened again overnight. Will attach some debugging output.

Created attachment 12286
jstack -F -m -l using the PID of the stalled thread

Attached:

Created attachment 12287
jmap using the PID of the stalled thread

Attached:

Created attachment 12288
Jenkins threaddump 1

Attached:

Created attachment 12289
Jenkins threaddump 2

Jenkins has an internal thread dumper which is available via the web interface https://integration.wikimedia.org/ci/threadDump and dumped in the /var/log/jenkins/jenkins.log file. The stalled thread is still 24524

The two thread dumps attached have been taken at 10 minutes interval.

Attached:

The stalled thread as the id 24524 (or 0x5fcc ). That shows in both thread dumps as:

"VM Thread" prio=10 tid=0x00007ff398079000 nid=0x5fcc runnable

Last week Timo had the Java Melody plugin installed on Jenkins which let us monitor the java process. The page entry is https://integration.wikimedia.org/ci/monitoring

We can get heap history via https://integration.wikimedia.org/ci/monitoring?part=heaphisto

After a bit more investigation, our build.xml files contains references to org.jvnet.hudson.plugins.DownstreamBuildViewAction which was a plugin to display the dependencies between builds. When upgrading Jenkins I have removed that plugin but the build files still contains references to it such as:

<org.jvnet.hudson.plugins.DownstreamBuildViewAction>
  <build class="build" reference="../../.."/>
</org.jvnet.hudson.plugins.DownstreamBuildViewAction>

When Jenkins parse the build history, it will detect this is no more used and will log a message which is stored in memory. The data can be discarded manually via the 'Manage old data' https://integration.wikimedia.org/ci/administrativeMonitor/OldData/manage

With the thousands of build history, the old data store turns out to have too many item for Java heap size and that most probably cause the memory exhaustion.

The slow way to solve that is to ask Jenkins to parse the history files for a job then manually discard all data. The fastest / hard way would be to parse all the build.xml and remove the XML snippet.

And the super long filtering command:

find . -name build.xml \
-exec fgrep -q 'DownstreamBuild' {} \; \
-exec sed -i-back \
'/ <org.jvnet.hudson.plugins/,/\/org.jvnet.hudson.plugins.DownstreamBuildViewAction>/d' {} \;

running above tool on mediawiki-core-lint

Launched for all jobs using find . -wholename '*/builds/*/builds.xml'

That is running in a tmux window on gallium.

All build.xml now have the org.jvnet.hudson.plugins.DownstreamBuildViewAction snippet removed.

Sent a summary on wikitech-l http://lists.wikimedia.org/pipermail/wikitech-l/2013-May/069303.html

I will monitor this bug till monday included then probably close this.

That issue is solved for now. The build.xml rewrite fixed it.

Change 86812 had a related patch set uploaded by Matanya:
Removed pstack package since bug 48025 is resolved.

https://gerrit.wikimedia.org/r/86812

Change 86812 merged by Dzahn:
Removed pstack package since bug 48025 is resolved.

https://gerrit.wikimedia.org/r/86812

Change 102465 had a related patch set uploaded by Hashar:
WMF: less build description updates

https://gerrit.wikimedia.org/r/102465

Change 102465 merged by Hashar:
WMF: less build description updates

https://gerrit.wikimedia.org/r/102465

Gerrit change 102465 deployed in production and apparently definitely solved the issue.

Summary sent on wikitech-l http://lists.wikimedia.org/pipermail/wikitech-l/2013-December/073703.html