Page MenuHomePhabricator

JobQueue not working: no jobs run except for high-priority ones like enotif
Closed, ResolvedPublic

Description

On Translatewiki.net we are using the ReplaceText extension for mass changes of message content or page moves of MediaWiki messages when their name/key changed.

Currently at least page moves are not added to the JobQueue anymore and therefor not executed.


Version: 1.21.x
Severity: blocker
URL: https://commons.wikimedia.org/wiki/File:Job_queue_breakage_October_2012.png

Details

Reference
bz41656

Event Timeline

bzimport raised the priority of this task from to Unbreak Now!.Nov 22 2014, 1:04 AM
bzimport set Reference to bz41656.

It works now after the code was updated, so I think it was the same as the complete breakage of the job queue on 1.21wmf2/1.21wmf3, fixed by Tim and Aaron with some investigations by Reedy, Ariel and others.
Cf. commits I92f538f4, I257d6809, I98533ed5, I326b767c, Iaea96ff8, If612f8e2, I09b3faa7, Ie2b4abab.

Yesterday it worked on translatewiki.net, today it does not work :-(

Raimond, do you all have a stack trace or any debugging hints? We did coincidentally have a problem with our job queue today, which turned out to be a problem with our Puppet scripts and not with the MediaWiki core version of the job queue code. So, this seems to be operating fine as of 1.21wmf3.

There were no notices or backtraces or anything.

You may need to manually run the job runner to get debugging information. The WMF cluster almost certainly has different parameters and wrapper scripts than what you're running, but you may want to look at http://bots.wmflabs.org/~petrb/logs/%23wikimedia-tech/20121031.txt starting at [21:34:28] for an example of us debugging this.

I tried a mass message move again a few minutes ago. It seems that the replaces are not added to the job table.

I take a peek in the table bw_job on translatewiki and in the logs of error_php and job.

Manually running with "b php runJobs.php" has no effect too.

Assigning to Aaron and cc'ing Reedy. We may just have to pay special attention to the test2/mediawiki.org job queues on Monday (Tuesday?...possible delay due to Veterans Day here in the US).

(In reply to comment #7)

Assigning to Aaron and cc'ing Reedy. We may just have to pay special attention
to the test2/mediawiki.org job queues on Monday

1.21wmf4 was deployed yesterday on test2/mw.org - any updates?

I may have some insight into this: yesterday I was trying out my Data Transfer extension, which uses jobs, on my wiki with MW 1.21alpha, and discovered (like Raimond) that it wasn't adding anything to the "job" table. I looked into it, and found that the issue was the call to $dbw->onTransactionIdle() in the method JobQueueDB::doBatchPush(). Everything inside that set of code was never called. When I commented out the onTransactionIdle() line (and its closing tag about 30 lines down), everything worked perfectly again.

By the way, I have PHP 5.3.10 and MySQL 5.5.22 on that server.

Raising priority because the buggish code has been deployed to most projects yesterday.
The number of queue jobs and job runners activity has halved starting a few minutes after the deployment.
https://gdash.wikimedia.org/dashboards/jobq/deploys
http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Jobrunners+pmtpa&m=cpu_report&s=by+name&mc=2&g=network_report

on a chosen random jobrunner, running

root@mw16:/usr/local/apache/common-local/multiversion# php MWScript.php nextJobDB.php --wiki=aawiki

gives the empty string, although eg enwiki has 18313 refreshLinks2 in the table, some from Nov 5.

running

echo 'print_r ( $wgMemc->get( "jobqueue:dbs:v3" ) );' | php MWScript.php eval.php --wiki=aawiki

gives output

Array
(

[refreshLinks2] => Array
    (
        [0] => afwiki
        [1] => alswiki
        [2] => anwiki
        [3] => arwiki

...

[createPdfThumbnailsJob] => Array
    (
        [0] => sqwiki
    )

)

There's no 'pendingDBs' key in there anywhere.

Here's the relevant code in nextJobDB.php:

		$pendingDbInfo = $wgMemc->get( $memcKey );
		if ( !$pendingDbInfo || mt_rand( 0, 100 ) == 0 ) {
                  ... (regenerate 1/100 of the time)
                }
		if ( !$pendingDbInfo || !$pendingDbInfo['pendingDBs'] ) {
			return; // no DBs with jobs or cache is both empty and locked
		}
		$pendingDBs = $pendingDbInfo['pendingDBs'];

So that's going to return empty-handed every time.

Those refs to 'pendingDBs' are the Nov 2 change

I guess that we get some jobs run 1/100 of the time when we regenerate the memcache entry.

I just tried out Aaron's patch (https://gerrit.wikimedia.org/r/#/c/33411/) and it fixed this problem on my wiki.

(In reply to comment #12)

on a chosen random jobrunner, running

root@mw16:/usr/local/apache/common-local/multiversion# php MWScript.php
nextJobDB.php --wiki=aawiki

gives the empty string, although eg enwiki has 18313 refreshLinks2 in the
table, some from Nov 5.

https://gerrit.wikimedia.org/r/#/c/33537/

(In reply to comment #14)

https://gerrit.wikimedia.org/r/#/c/33537/

(That's been merged a while ago, and now also I90911083 .)