Page MenuHomePhabricator

Job queue items from 1.20 get lost – need exit strategy for upgrade
Closed, DeclinedPublic

Description

See http://lists.wikimedia.org/pipermail/mediawiki-l/2013-April/040970.html

I remember that WMF had a similar problem, is there a solution apart from dropping the old jobs?


Version: 1.21.x
Severity: critical
URL: http://lists.wikimedia.org/pipermail/mediawiki-l/2013-April/040970.html
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=60719

Details

Reference
bz46934

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:16 AM
bzimport set Reference to bz46934.
bzimport added a subscriber: Unknown Object (MLST).
  • Bug 46971 has been marked as a duplicate of this bug. ***

Quoting from the duped bug:

(Quoting bug 46971 comment #0)

After I upgraded from 1.20.3, in my database (MySQL) some
pre-upgrade jobs have job_random set to 0 and do not seem to be
picked up -not even when I try to run them by providing their type
as an option: php runJobs.php --type=replaceText.

Do we seriously have no way to fix this? Should we just tell people not to upgrade if they have something in the job queue?

Aaron: Could you comment on this, please?

Aaron: Could you comment on this please, as the 1.21 tarball release is imminent?

(In reply to comment #3)

Do we seriously have no way to fix this? Should we just tell people not to
upgrade if they have something in the job queue?

I'd like to include something in the installation information telling users to clear their job queue before upgrading, but I don't know a lot about this. What would be appropriate?

I would like to get this fixed ASAP for a point release.

(In reply to comment #6)

(In reply to comment #3)

Do we seriously have no way to fix this? Should we just tell people not to
upgrade if they have something in the job queue?

I'd like to include something in the installation information telling users
to
clear their job queue before upgrading, but I don't know a lot about this.
What would be appropriate?

I would like to get this fixed ASAP for a point release.

Running maintenance/runJobs.php should clear the job queue. But depending on how long upgrading takes, one or two jobs might still be lost. Maybe put the wiki into read-only mode once the job queue is cleared?

Could maintenance/runJobs.php be run with the wiki in read-only mode?

(In reply to comment #8)

Could maintenance/runJobs.php be run with the wiki in read-only mode?

Nope, I meant making it read-only after clearing the job queue. Still not a complete solution, but I can't think of anything else.

(In reply to comment #0)

See http://lists.wikimedia.org/pipermail/mediawiki-l/2013-April/040970.html

I remember that WMF had a similar problem, is there a solution apart from
dropping the old jobs?

Did we? The only problem is that they have a harder time getting picked if there always a bunch of other new jobs. In master and REL1_21 I've tried adding a bunch of jobs and setting the token to 0 for all of them, and runJobs.php works just fine. They weren't lost for me.

(In reply to comment #10)

In master and REL1_21 I've tried
adding
a bunch of jobs and setting the token to 0 for all of them, and runJobs.php
works just fine. They weren't lost for me.

I mean job_random of course, not the token.

(In reply to comment #11)

I mean job_random of course, not the token.

Ah. :) Well, isn't setting job_random exactly what the user in comment 0 didn't do and should have done?

(In reply to comment #12)

(In reply to comment #11)

I mean job_random of course, not the token.

Ah. :) Well, isn't setting job_random exactly what the user in comment 0
didn't
do and should have done?

The complaint was that 0 valued ones didn't work, so I set all mine to that value and they still worked.

hexmode said this wasn't deemed worth fixing for 1.21 release. I don't know the reasons, but better a partial update than nothing.

(In reply to comment #14)

hexmode said this wasn't deemed worth fixing for 1.21 release. I don't know
the reasons, but better a partial update than nothing.

What I meant is that this isn't going to stop 1.21.0 from being released. It is still a valid bug that should be fixed at some point. If we can get it fixed in a 1.21 point release that would be great.

I don't know enough about the problem to fix it, though.

WikiApiary still had hundreds of lingering jobs since October, till they dropped them from the DB yesterday because it was impossible to run them. We're still waiting for a general solution to the migration problems.
http://lists.thingelstad.com/pipermail/wikiapiary-l/2014-February/000104.html
http://lists.thingelstad.com/pipermail/wikiapiary-l/attachments/20140202/fd838fb7/attachment-0002.png
http://lists.thingelstad.com/pipermail/wikiapiary-l/attachments/20140202/fd838fb7/attachment-0003.png

I had a similar thing happen. We were running 1.16.3 (yeah!) and upgraded to 1.23. Made a copy of the 1.16.3 database, pointed new 1.23wmf11 installation to new database. Ran update.php, everything is hunky-dory. Notice the next day that job queue is backed up. A few old jobs existed from the day of the upgrade (in 1.16.3). Tried clearing job_token and a few would run. Tried clearing out old pre 1.23 jobs, still not running. showJobs.php spits out 0, while api query and db shows jobs in the queue.

(In reply to comment #17)

I had a similar thing happen. We were running 1.16.3 (yeah!) and upgraded to
1.23. Made a copy of the 1.16.3 database, pointed new 1.23wmf11 installation
to
new database. Ran update.php, everything is hunky-dory. Notice the next day
that job queue is backed up. A few old jobs existed from the day of the
upgrade
(in 1.16.3). Tried clearing job_token and a few would run. Tried clearing out
old pre 1.23 jobs, still not running. showJobs.php spits out 0, while api
query
and db shows jobs in the queue.

What types of jobs? What to some of the rows look like?

Here's the job table from one of our wikis (total separate installations, but configured identically)

http://pastebin.com/i4Qatgpa

Note, this is after I removed some of the older jobs in an attempt to 'kick start' the queue.

What does <<php showJobs.php --group>> show? What about <<php showJobs.php --list>>? Does <<php runJobs.php --type refreshLinks>> actually run anything?

I should note that certain actions on the site, such as modifying a template or running refreshLinks.php appears to not only add new jobs to the queue (as it should) but I can also run "runJobs.php" and a number of jobs will process. I don't see a pattern or commonality in what is run however.

Another note (and someone tell me to shut up if this isn't proper etiquette) it appears that running php refreshLinks.php will queue up a number of jobs. Running runJobs.php afterward will kick off many jobs (in a queue of 2000, 1500 or so) but does not complete all jobs as a result of refreshLinks.php or any of the other jobs queued.

What is $wgJobTypeConf set to? All the jobs you posted had at least 1 attempt. They won't run again until the claim TTL is reached. I don't know what you set that too. By default, jobs the fail are never retried and get deleted after a week.

You can try using:
$wgJobTypeConf['default']['claimTTL'] = 3600; // 1 hour

...this will let the jobs be retried (after 1 hour of failure).

You can also set:
$wgDebugLogGroups['runJobs'] = "<some path>"

...this will log all jobs run, and may show some failures (fatal errors will not show here though).

Aaron, can you explain a little more about the claim TTL? I don't recall setting that anywhere.

I'm trying to pinpoint the cause as much as I can without touching my prod environment. We didn't have this issue in test (berating me for not having more sophisticated QA is justified!) and in order to make changes to prod I have many hoops I must jump through now.

One interesting note is that trying to specify the --memory-limit when running runJobs.php throws an error. I'm beginning to think this might be related to available ram. (Server has 2gb, php.ini has 128mb)

php runJobs.php --memory-limit 1024
PHP Fatal error: Allowed memory size of 262144 bytes exhausted (tried to allocate 122880 bytes) in /var/www/html/w/includes/AutoLoader.php on line 290

If the only jobs that linger have job_attempts set to something other than zero, then this is just a problem of failed jobs. You'd want to set wgDebugLogGroups as above to possibly get more insights on why jobs are failing sometimes.

Jobs can fail for any number of reasons, mostly specific to the code the job classes run. That in itself doesn't indicate any problem with the job queue itself.

Removing target milestone that was in the past.

If you want this in a specific release, have a good reason AND you are willing to find resources to fix this bug, feel free to change it to something appropriate.

Does this affect 1.26?

How to test?

Krinkle added a subscriber: Krinkle.

Closing for now. MediaWiki 1.20 is quite far in the past, and there aren't enough details to debug this problem without actually going back to 1.20 and trying it afresh.

If someone is upgrading from 1.20 to a current version today and finds some jobs are lost in the process, or that pre-1.20 jobs are stuck and not run, please file a new task and we can figure out the details and find a solution.