Page MenuHomePhabricator

Job queue estimate often woefully inaccurate; need a better strategy
Closed, ResolvedPublic

Description

There seems to be something filling the job queue in a loop,
we see in pretty quick succession:

5723
5726

3  (sometimes)
2

5723
... etc.

If this is caused by possibly recursive behaviour in template
calls, it was likely introduced - or stirred into the job
queue - with changes made during the recent days, i.e. most
likely on or after April, 4th., when I altered inside a system
of interdependent templates which I mostly did not write or
design, and at best partially understand. I tried to simplify
them using parser functions now instead of highly complicated
other contition-evaluating templates of the pre-parser-
function era.
I remember, I had inspected the job queue after such a change
and saw it go down to zero, but frankly cannot recall when.


Version: unspecified
Severity: enhancement
Platform: Other

Details

Reference
bz9518

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 9:40 PM
bzimport set Reference to bz9518.
bzimport added a subscriber: Unknown Object (MLST).

robchur wrote:

The job queue is a queue of link updates waiting to be processed. We have an
external script running on a continuous cycle to perform these updates. Don't
worry if the queue fills up; it will be processed in turn.

A high job queue count after editing a heavily-used template is normal. That the
job queue might not fall to zero on well-established projects with frequent
edits is also normal.

robchur wrote:

*** Bug 9520 has been marked as a duplicate of this bug. ***

robchur wrote:

*** Bug 9521 has been marked as a duplicate of this bug. ***

There have been zero edits since 2 hours+ according to
both [[Special:Recentchanges]] and the rc IRC cannel.

The job queue count appears to be cyclic with a
repitition period of considerably less than a minute.

I have never ever seen a nonzero job queue count for
extended periods of time on this project, which I
monitor closely, i.e. more frequent than daily by
average.

I suggest that someone with server access should have a
look at the job queue and report which pages are
inserted, and/or always kept. I am pretty certain, there
is some loop there. Knowing the pages involved, I might
likely come to some guesses as to its origin.

If there is an endless loop feeding the job queue
indeed, precautions taken in the wiki software so as not
to waste server ressources.

robchur wrote:

The job runner works on a cycle across all wikis. It may take some time to reach
yours.

I know that it usually takes much more time until a job
queue length change is shown in [[Special:Statistics]]
for this wiki.

Please point a browser to the http://ksh.wikipedia.org/
wiki/Spezial:Statistik and make it reload the page some
60 times in succession over ~2 minutes.

If you then believe what you see is healthy, I shall
give up reopening this bug.

Keep in mind, this is going on since 2 hours and a half
at least, and there have been no edits at all, including
no bot edits at all, but a quickly & repititively
changing job queue size, all the time.

is this long term issue?
since this morning the job queue is approximate, and different servers may return different approximations.

if a server has long open transaction, purged rows will appear included into estimate, hence showing big number, whereas there'll be less actual job entries.

This is a good hint at least explaining the grossly
altered "general behaviour" since this morning.

We see rapid changes in the job queue length displayed,
about a dozen or more cycles per minute - while
experience teaches that there sould be no job queue.

If the approximation may lead to a nonzero figure when
the queue is/was empty for a long time, then, yes, that
would possibly explain what one sees. From a users
standpoint of view, it may be bewildering, of course.

If a constant zero queue length (at least after some
time less than 2 hours) should also be reflected on
[[Special:Statistics]] as zero, then we still have no
obvious explanation for the figures shown.

ayg wrote:

After the one-sentence explanation of how it works that _syphilis_ gave me on IRC, I would guess
that it should show 0 if the length is 0. If it were more than 0, though, the number would probably
fluctuate randomly every time you refresh the page, which may be what you're seeing.

Remember, the job queue is shared, and the statistic given on ksh-wiki's page reflects job queue
items created by many other wikis, so it's not surprising that it should be nonzero all the time.

Sounds normal to me, but I'll let a server admin decide whether to close.

Ok, fluctuation explained.

Nonzero for hours of no edit activity at is at least
extremely unlikely, and there should be no cause for it,
if everything was as it should be.

Observerd quick fluctuation following a predictable
scheme for more than 26 hours now (assuming it did not
change back and forth while noone was watching) is
either an error in the wiki - e.g. circular definitions
in templates (which imho would not explain the speed of
change) - or probably something pretty strange in the
display of the firgures themselves.

I would like the actual job queue entries on the server
inspected so as to get better clues.

I did inspect and follow the job queue figures on some
other Wikipedia wikis, observing similar patterns at:

be: 146 - 257 - 366 - 967
qu: 5 - 5 - 5 - 1
es: 286 - 24 - 24 - 286

but not everywhere. Of course, I only followed them for
few minutes. Yet:

ksh: 5723 - 5726 - 3 - 2

seems to be happening since 26 hours pretty unaltered.

daniel wrote:

The Job Queue on ma testwiki is never empty as well. After new installation of mw the jobqueue shows "1". After editing a template the queue was at 2500. With runjobs.php i can reduce this to 109 but nit more. Now there are always at least 109 jobs in the queue even after running runjobs.php.

This error only appears on mw 1.10. on 1.9 the job couont works.

in bug #10417 comment #4 , brion vibber writes something that seems contradicitve to several explanatory statements given above. maybe this only reflects that i have not understood enough.

at least the job queue on kshwiki is showing seemingly random figures since months which do not say anything. there is no way any more to tell when ones template changes have been propagated, even if there is only one editor + one edit for extended periods of time, as evident from recent changes.

robchur wrote:

The job queue is a processing queue for operations which can be deferred. This queue is stored in the database and can be processed via a couple of methods.

The default method is to execute one job per page view provided that a job can be obtained without causing deadlocks on the job table. This execution rate can be tweaked in the site configuration. We do not use this method on Wikimedia wikis.

The second method is to execute a maintenance script which processes all jobs in the queue at once. On Wikimedia wikis, we use this method in a specific manner, that is, we have multiple application servers which are allocated a pool of wikis, and these run a script which loops through each wiki in turn, completes the job queue for that wiki, and moves on. I suspect a previous incomplete explanation of this has caused the confusion.

The job queue count shown on Special:Statistics is generated using a clever trick which avoids an expensive COUNT(*) on the job table. This trick means that the value will fluctuate, often to the point where it's downright incorrect. Our replicated environment also doesn't help much in this regard.

There have been previous discussions regarding this count. There are those in favour of removing it, since it provides a means for some clueless users to spread FUD. There are suggestions that we could generate an accurate count and cache this for a period of time, although this has drawbacks too, since the cached data may be wholly inaccurate. There are further suggestions to graph the data in some manner, providing a more useful visual overview of what's going on.

Due to the nature of a large number of Wikimedia wikis (that is, that they are often prone to heavy editing), and due to the uses we make of the job queue, any given wiki may, for a short (for some definition of "short") time, have a large job queue. This is not a problem, it will be dealt with when the cycle returns to process that particular wiki.

Third parties who are still worried about their job queue sizes are encouraged to:

  • run maintenance/showJobs.php or execute a SELECT COUNT(*) FROM job against their database, to determine what's left on the queue
  • run maintenance/runJobs.php on a periodic basis (or set up a cron job) to ensure that larger queues are processed

There is no actual bug here, just a common misperception about what is being shown and what is happening behind the scenes.

Reopening. The clever index trick shows wholly useless incorrect numbers in many cases, so I would indeed recommend improvements.

robchur wrote:

Repurposing to focus on this.

Just to ensure the suggestion is documented; we could run the estimate, and then run the actual COUNT(*) if the estimate returns a number of rows below some threshold.

lars wrote:

Currently, [[Special:Statistics]] only shows the length of the job queue. I don't mind 1000 jobs in queue if they are executed in a millisecond each, but if they take two seconds each I might as well take a lunch break.

I suggest the age of the job at the head of the queue should also be reported. If the job to be executed next is one hour old, I can expect my newly added job to be executed within roughly one hour. This is not an expensive computation, just the retrieval of a single attribute from a single object.

One possible option: Split out the line from Special:Statistics and making a Special:JobQueue. The page would have total number of jobs, the bottom ten jobs in a table, and have a mechanism to run the jobs for wikis that do it manually (to allow siteadmins not to have to do it from the command line).

Why is this number shown at all? At what circumstances it would be useful to know how many jobs there are. Would simple "there are jobs" "there are no jobs" be sufficient?

matthew.britton wrote:

(In reply to comment #18)

Would simple "there are jobs" "there are no jobs"
be sufficient?

On a large wiki there are always jobs anyway. But yes, the number is meaningless to users and there's no real reason to show it.

mike.lifeguard+bugs wrote:

Domas had some idea about this the other day.

I removed them from the user interface in r65059, since it was just confusing users. The number is still accessible, so the issue is not solved. It's just smaller priority now.

(In reply to comment #21)

I removed them from the user interface in r65059, since it was just confusing
users. The number is still accessible, so the issue is not solved. It's just
smaller priority now.

That number will be useless for WMF sites, since the queue is not in that table.

Change 71966 had a related patch set uploaded by Aaron Schulz:
jobqueue: improved performance of JobQueueGroup::getQueuesWithJobs()

https://gerrit.wikimedia.org/r/71966

Change 71966 merged by jenkins-bot:
jobqueue: improved performance of JobQueueGroup::getQueuesWithJobs()

https://gerrit.wikimedia.org/r/71966