Page MenuHomePhabricator

[upstream] Jobs are sometime no more being triggered by Zuul / Jenkins
Closed, ResolvedPublic

Description

From time to time, some subsets of jobs are no more being executed. Zuul does enqueue them properly as can be seen on https://integration.wikimedia.org/zuul/ when the issue occurs.

The Jenkins queue is idling with target hosts not running any tests.

An example of a stuck job is:

$ echo status|nc -q 2 localhost 4730|grep integration-jjb-config-test
build:integration-jjb-config-test	2	0	14
build:integration-jjb-config-test:contintLabsSlave	0	0	14
$

Where the numbers are Total, Running, Workers. The status page shows two jobs being stuck.

Another occurrence:

$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8
build:apps-android-wikipedia-tox-flake8	17	0	14
build:apps-android-wikipedia-tox-flake8:contintLabsSlave	0	0	14
$

And there is indeed 17 such jobs being stuck.

Suspicion: both jobs are tied to the node label contintLabsSlave. Either Zuul apparently asked to run the labelless function which got properly enqueued by the Gearman server. Since the job has a label, the labelless function is never being processed by the Jenkins Gearman plugin.


Version: wmf-deployment
Severity: normal
See Also:
https://launchpad.net/bugs/1381565

Details

Reference
bz63760

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:12 AM
bzimport set Reference to bz63760.

Once slaves are disconnected I get:

$ echo status|nc -q 2 localhost 4730|grep integration-jjb-config-test
build:integration-jjb-config-test:contintLabsSlave 0 0 0
build:integration-jjb-config-test 2 0 0

$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8
build:apps-android-wikipedia-tox-flake8 22 0 0
build:apps-android-wikipedia-tox-flake8:contintLabsSlave 0 0 0

It did process a few jobs but got stuck again:

$ echo status|nc -q 2 localhost 4730|grep integration-jjb-config-test
build:integration-jjb-config-test:contintLabsSlave 0 0 14
build:integration-jjb-config-test 2 0 14

$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8
build:apps-android-wikipedia-tox-flake8 16 0 14
build:apps-android-wikipedia-tox-flake8:contintLabsSlave 0 0 14

Disconnecting and reconnecting the gearman client does unleash a few jobs.

Disconnecting and reconnecting a slave does unleash them as well.

Here the debug output whenever I disconnected and reconnected integration-slave1002.eqiad.wmflabs

hashar@gallium:~$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8
build:apps-android-wikipedia-tox-flake8 12 2 14
build:apps-android-wikipedia-tox-flake8:contintLabsSlave 0 0 14
hashar@gallium:~$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8
build:apps-android-wikipedia-tox-flake8 11 1 14
build:apps-android-wikipedia-tox-flake8:contintLabsSlave 0 0 14
hashar@gallium:~$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8
build:apps-android-wikipedia-tox-flake8 10 0 14
build:apps-android-wikipedia-tox-flake8:contintLabsSlave 0 0 14
hashar@gallium:~$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8
build:apps-android-wikipedia-tox-flake8 10 0 14
build:apps-android-wikipedia-tox-flake8:contintLabsSlave 0 0 14
hashar@gallium:~$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8
build:apps-android-wikipedia-tox-flake8 9 2 14
build:apps-android-wikipedia-tox-flake8:contintLabsSlave 0 0 14
hashar@gallium:~$

It eventually managed to run them all.

I have upgraded Zuul wmf-deploy-20140122..wmf-deploy-20140416-3 . That might fix it.

We got python-gear upgraded from 0.4.0 to 0.5.4 which fix a bunch of function registrations errors in Gearman. That might solve the issue.

Seems it is no more occurring now.

That occurred again today around noon UTC. Jenkins/Zuul restarted at around 14:17 UTC :-(

Crashed again on May 28th during european afternoon.

Jobs meant to be run on labs instances ended up not being registered anymore with the Zuul Gearman server. That must be a bug in the Jenkins Gearman plugin :-( {{bug|63760}}

Another occurrence:

hashar@gallium:~$ echo status|nc -q 2 localhost 4730|fgrep apps-android-wikipedia-maven-checkstyle
build:apps-android-wikipedia-maven-checkstyle:contintLabsSlave 0 0 10
build:apps-android-wikipedia-maven-checkstyle 10 0 10

numbers are Total, Running, Workers.

And there are working function indeed:

hashar@gallium:~$ echo workers|nc -q 2 localhost 4730|fgrep apps-android-wikipedia-maven-checkstyle|cut -b1-50
54 127.0.0.1 integration-slave1002_exec-3 : build:
53 127.0.0.1 integration-slave1002_exec-1 : build:
55 127.0.0.1 integration-slave1002_exec-4 : build:
56 127.0.0.1 integration-slave1002_exec-0 : build:
57 127.0.0.1 integration-slave1002_exec-2 : build:
14 127.0.0.1 integration-slave1001_exec-0 : build:
19 127.0.0.1 integration-slave1001_exec-3 : build:
21 127.0.0.1 integration-slave1001_exec-4 : build:
22 127.0.0.1 integration-slave1001_exec-2 : build:
28 127.0.0.1 integration-slave1001_exec-1 : build:

The functions registered:

build:apps-android-wikipedia-maven-checkstyle
build:apps-android-wikipedia-maven-checkstyle:contintLabsSlave

WORKAROUND: disconnect and reconnect the labs slaves.

Created attachment 15589
Zuul events spike

I noticed earlier this week Zuul being trapped in some loop. Upstream has noticed it as well from time to time but never managed to track it down. Attached is a graph showing the spike of events on June 6th which is caused by the death loop.

Attached:

render-3.png (180×400 px, 7 KB)

  • Bug 69045 has been marked as a duplicate of this bug. ***
  • Bug 70256 has been marked as a duplicate of this bug. ***

Documented a workaround on https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Known_issues

The gearman server sometime deadlock when a job is created in Jenkins. The Gearman process is still around but TCP connections time out completely and it does not process anything. The workaround is to disconnect Jenkins from the Gearman server:

head to https://integration.wikimedia.org/ci/configure logged in with a WMF ldap account
search for "Gearman"
uncheck "Enable Gearman"
Save at the bottom
search for "Gearman"
check "Enable Gearman"
Save at the bottom

That is related to bug 63758 (JJB created jobs not registering).

I have upgraded Jenkins Gearman plugin to fix jobs registrations:

That bumps gearman plugin with support for the Jenkins LTS version we are using which is probably going to help.

I found out another issue that causes Gearman server to lock completely waiting for data to be received on a socket. Filled upstream as https://bugs.launchpad.net/gear/+bug/1381565

The root cause is that the Gearman server no more response for an unknown reason.

When reconnecting it (see comment #12) the jobs were still stuck in the queue due to a bug in Zuul. That is bug 72113 and the patch I wrote is applied on our Zuul and confirmed to work (merge functions are now properly retriggered when Gearman comes back).

hashar lowered the priority of this task from High to Low.Nov 24 2014, 3:44 PM

Switching to low priority since the patch is applied on our production and we are just waiting for upstream to merge the change.

hashar renamed this task from Jobs are sometime no more being triggered by Zuul / Jenkins to [upstream] Jobs are sometime no more being triggered by Zuul / Jenkins.Nov 24 2014, 3:44 PM
hashar added a project: Upstream.
hashar set Security to None.

I think it is closely related to T74113: [upstream] Zuul prepareRef does not handle failure to connect to Gearman which is the jobs being considered as merged although the merge process never triggered/completed. The end result is that Zuul scheduler wait for the merge to happen and keep the jobs around.

I don't think I have seen the issue occurring since I have deployed the patch back in October 2014. Our task is still open pending merge by upstream.

I havent seen this one in ages. Since I got python-gear and the Gearman Jenkins plugin upgraded, that probably solved it