Page MenuHomePhabricator

Update special pages more frequently to account for bad runs
Closed, ResolvedPublic

Description

Recently, the special pages update jobs have been having some trouble in actually finishing their work.
About 50% of the times, the jobs are terminated by some fatal error (no pattern, from the reasons I've been told), either because there's a stubborn wiki whose database tables grew too big or a bad update has been put live just when the jobs were still running.
To account for these bad runs, I would like to suggest running the jobs every 1, 1.5 or 2 days, instead of the current 3. If a wiki is intended, my specific case applies to pt.wiktionary.
Thank you.


Version: wmf-deployment
Severity: enhancement
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=53227

Details

Reference
bz45007

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:28 AM
bzimport set Reference to bz45007.
bzimport added a subscriber: Unknown Object (MLST).

MaxSem said the job run yesterday (13th) when according to the 3 days schedule it should only have run today (14th; last run was 11th at 00:00UTC).
It finished with an error:

/home/wikipedia/logs/norotate/updateSpecialPages.log:

Fatal error: Call to a member function getText() on a non-object in /usr/local/apache/common-local/php-1.21wmf9/extensions/MobileFrontend/includes/MobileContext.php on line 273

Log file modified: 2013-02-13 05:17:24.374378000 +0000

Job should automatic report this in the server admin log, than some people can see the errors and maybe fix it. LocalisationUpdate is reporting success, maybe this job can do that also.

Fixing the problem usually happens shortly after the error is thrown. But that won't fix the special pages update, which will have to wait at least until the next run (3 more days, if the next run happens to be successful).

< Danny_B> update of special pages is off now?
< Danny_B> or the periods have been prolonged?
..
< mutante> monthday => "*/3"
< mutante> hour => 5
..
< mutante> ./manifests/misc/maintenance.pp
< mutante> class misc::maintenance::update_special_pages

< Danny_B> so it doesn't run obviously
< Danny_B> last update: 13. 8. 2013, 14:15

< mutante> command => "flock -n /var/lock/update-special-pages /usr/local/bin/update-special-pages > /home/wikipedia/logs/norotate/updateSpecialPages.log 2>&1",
< mutante> uhm, yeah, i don't know about the commandline

< Reedy> Never happy
< Danny_B> anyway, in case it would be helpful to track down the issue - cs wikis lack the update

site.pp

< mutante> 1178 # Wrong log file location
< mutante> 1179 class { misc::maintenance::update_special_pages: enabled => true }
< mutante> 2762 # Broken cron jobs moved back to hume:
< mutante> 2765 class { misc::maintenance::update_special_pages: enabled => false }

< mutante> so, the enabled one is on hume in site.pp
< mutante> not on the new host terbium

< mutante> !createbug

< mutante> cat: /home/wikipedia/logs/norotate/updateSpecialPages.log: No such file or directory

reedy@hume:/home/wikipedia/log/norotate$ flock -n /var/lock/update-special-pages /usr/local/bin/update-special-pages > /home/wikipedia/logs/norotate/updateSpecialPages.log 2>&1
reedy@hume:/home/wikipedia/log/norotate$

Change 80560 had a related patch set uploaded by Reedy:
Maintenance scripts should be run as Apache

https://gerrit.wikimedia.org/r/80560

Change 80560 abandoned by Reedy:
Maintenance scripts should be run as Apache

Reason:
user => "apache",

https://gerrit.wikimedia.org/r/80560

I'm not sure how just running it more frequently would make it any more likely to complete successfully. You're just going to make more fails more frequently.

Ideally, if it dies doing one wiki, this shouldn't stop execution on every other subsequent wiki (which has been an issue in the past)

(In reply to comment #8)

I'm not sure how just running it more frequently would make it any more
likely
to complete successfully. You're just going to make more fails more
frequently.

More fails are likely. But running every 3 days makes it quite frequent to have 6 or 9 days without a special page update. Right now it's been 6 days and the Wanted Categories page hasn't been updated at pt.wiktionary (I think the last update was your manual run). And I'm betting it won't be today either. So, another 3 days will have to pass for another go.

(In reply to comment #8)

I'm not sure how just running it more frequently would make it any more
likely
to complete successfully. You're just going to make more fails more
frequently.

So, closing this as a duplicate of bug 53227: let's keep one bug per issue, not one per proposed way to address it. Bug 53227 also shows that the diagnosis behind this proposal is probably wrong, because failures seem consistent rather than occasional, when there are failures.

Ideally, if it dies doing one wiki, this shouldn't stop execution on every
other subsequent wiki (which has been an issue in the past)

This is maybe worth a separate bug? If the scripts can't be improved easily, it should be rather easy to make the cronjobs more atomic.

  • This bug has been marked as a duplicate of bug 53227 ***

If the runs become more reliable than in the past then surely this doesn't make much sense anymore. Let's go with bug 53227 for now.

P.S.:

Bug 53227 also shows that the diagnosis
behind this proposal is probably wrong...

When I submitted this bug in February, bug 53227 was still not an issue at that time. The constant (bad) live updates were the problem then (see Comment #1).