Page MenuHomePhabricator

Indicate current JobQueue delay by exposing oldest job_timestamp through API
Closed, DuplicatePublic

Description

Author: richardg_uk

Description:
At present, the number of jobs in the job queue is exposed through the API but the delay caused by queuing is not. But the age of the oldest job would often be more helpful to know (at least for editors, who can be concerned or confused if they see categories unchanged for some time after pages are edited).

The API, through ApiQuerySiteinfo::appendStatistics() and SiteStats::jobs(), already exposes the estimated number of queued jobs:

http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=statistics
-> <api><query><statistics ... jobs="918518" /></query></api>

Since MediaWiki version 1.19, the job table has included a job_timestamp field. The field is already indexed. Therefore exposing MIN(job.job_timestamp) as an additional API output should be easy and efficient.

An alternative or additional new statistic would be the queue duration, i.e.:
time() - MIN(job.job_timestamp)
This relative measure would be more suitable for graphing, especially if the site statistics are aggressively cached (since it would typically be more stable than the absolute timestamp during a caching interval).

The API could then return something like:

http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=statistics
-> <api><query><statistics ... jobs="918518" joboldesttime="2012-12-19T10:59:59Z" joboldestseconds="86412" /></query></api>

(Incidentally, the queue duration might be a useful or at least interesting additional metric for Ganglia, since it would help to distinguish a pathological backlog from high throughput.)


Version: 1.21.x
Severity: enhancement

Details

Reference
bz43287

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 12:58 AM
bzimport set Reference to bz43287.
bzimport added a subscriber: Unknown Object (MLST).

Just exposing the oldest job_timestamp isn't much use. You could have the oldest jobs that haven't been run/picked for some reason, but then most of the queue is of much newer time...

richardg_uk wrote:

(In reply to comment #1)

Just exposing the oldest job_timestamp isn't much use. You could have the
oldest jobs that haven't been run/picked for some reason, but then most of
the queue is of much newer time...

I have assumed that jobs are run either FIFO or (in effect) randomly; so that, either way, the timestamp of the oldest job would be a meaningful indicator of the maximum expected delay before the link table is updated after a recent edit.

Is there a non-pathological reason why some jobs would be left in the queue unrun/unpicked for an exceptional length of time?

Jobs are pulled from queue randomly these days, so this metric may be less meaningful.

richardg_uk wrote:

(In reply to comment #3)

Jobs are pulled from queue randomly these days, so this metric may be less
meaningful.

Even in the worst such case, a median time could be reported to at least indicate the "typical" delay.

But frankly, if old jobs remain stuck in the queue for so long that MIN(job_timestamp) becomes effectively meaningless, then the picking method itself is dubious - which is a bad reason for not exposing the metric!

Though the queue age might seem unrelated to operational concerns (since the JobQueue is a documented and controlled breach of database consistency), the duration is relevant to the general issue of database integrity (because, like cached pages, links should not be out of date indefinitely).

More practically, a disclosed statistic would reassure editors, who frequently and understandably ask: "Will my template edit or category change propagate, even though I saved the edit ages ago and nothing has happened?"

If editors knew that pages would be updated in a reasonable and foreseeable length of time, they would have less resort to the current practice of purging and saving null edits to bypass the inscrutable job queue at the expense of greater load on the servers as well as on the editors concerned!

Hopefully, all the work Aaron has put into overhauling the job queue code should mean it's much better from now on.

But it is somewhat arbitrary. The job queue count was removed from user view (Special:Statistics? I think) because it was misleading. It was left in the API for developers etc.

richardg_uk wrote:

(In reply to comment #5)

Hopefully, all the work Aaron has put into overhauling the job queue code
should mean it's much better from now on.

Even more reason to have a way to measure the result of Aaron's work! Not knowing the detail, I can't think why FIFO would not be the preferred picking algorithm, given that the timestamp is indexed; but whatever the method, you'd surely not want any jobs to remain unprocessed for days, weeks or months at a time?

But it is somewhat arbitrary. The job queue count was removed from user view
(Special:Statistics? I think) because it was misleading. It was left in the
API for developers etc.

I agree that the size of the queue was not directly relevant to editors. But turning your comment on its head, I would say that editors deserve to know about likely propagation delays (per comment #4), and that the obscurity of the API (from most editors' perspective) means that it would have been far better if [[Special:Statistics]] had added this additional information instead of removing the little information that it used to expose about the queue.

Readers expect modern websites to be more-or-less up-to-date. Logged-in editors are used to seeing wiki pages that are current in all other respects. So editors, admins, tech ops and servers all benefit from avoiding enquiries and manual purging by making durations easily identifiable, distinguishing routine high I/O from exceptional delays.

(In reply to comment #6)

I agree that the size of the queue was not directly relevant to editors. But
turning your comment on its head, I would say that editors deserve to know
about likely propagation delays (per comment #4), and that the obscurity of
the
API (from most editors' perspective) means that it would have been far better
if [[Special:Statistics]] had added this additional information instead of
removing the little information that it used to expose about the queue.

One of reasons this number was considered misleading was that for performance reasons queue size is estimated, not counted. And these estimations can be wildly inaccurate. This will remain an issue at least as long as the queue is kept in MySQL.

richardg_uk wrote:

(In reply to comment #7)

One of reasons this number was considered misleading was that for performance
reasons queue size is estimated, not counted. And these estimations can be
wildly inaccurate. This will remain an issue at least as long as the queue is
kept in MySQL.

Understood (per bug 27584). I only mentioned the Special:Statistics removal in response to reedy's aside. The API jobs value seems to oscillate disconcertingly between 3 disparate values at any one time (presumably because of aggressive caching), but that oddity is outside the scope of this request.

Focusing again on reporting the queue duration:

The proposed information would be easy to calculate accurately, as well as being more useful to know than the queue size. Even if value caching were needed (doubtful), the value would only need refreshing each time an oldest job were run. But the minimum value of an indexed field is already perfectly optimised for a database lookup.

When this enhancement was previously requested (bug 13786, which I have only just found), it was rejected because the job_timestamp column did not exist at the time. So the main hurdle has already been overcome.

This bug seems like a duplicate of bug 9518 or bug 13786.