Page MenuHomePhabricator

Wikimedia wikis' job queues need better monitoring
Closed, ResolvedPublic

Description

Currently the job queues for Wikimedia wikis can become heavily backlogged without anyone noticing. This is bad. Sometimes it's due to not enough job runners being assigned, other times it's due to software problems, etc. The job queue is quite important to MediaWiki, so having it run is important, as is being notified when the job queue has gotten too backlogged or is broken.

A better monitoring and notification system (using mailing lists, IRC, nagios, whatever) needs to be implemented for the job queue. This may relate to bug 27724, though adding a timestamp column is only one way you might implement better monitoring.


Version: unspecified
Severity: normal

Details

Reference
bz27851

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:33 PM
bzimport set Reference to bz27851.
bzimport added a subscriber: Unknown Object (MLST).

Raising this bug priority. This is a real issue.

This is fixed now. There is a Nagios check which checks job queue length on all wikis (and starting today, this check actually works), see http://nagios.wikimedia.org/nagios/cgi-bin/extinfo.cgi?type=2&host=spence&service=check_job_queue . Ganglia also measures the enwiki job queue length: http://ganglia.wikimedia.org/?m=cpu_report&r=hour&s=descending&c=Miscellaneous+pmtpa&h=spence.wikimedia.org&sh=1