Page MenuHomePhabricator

not all jobs are processed (webVideoTranscode)
Closed, InvalidPublic

Description

Author: jgerber

Description:
videoscalers run "jobs-loop.sh ... webVideoTranscode" to process video transcoding jobs. there are unprocessed jobs in the queue that never run.


Version: unspecified
Severity: major

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:43 AM
bzimport set Reference to bz47312.
bzimport added a subscriber: Unknown Object (MLST).

Seems to be jobs failing rather that not ever getting run.

Jan, can you clarify what you think the next steps on this one should be?

Jan, can you clarify what you think the next steps on this one should be?

jgerber wrote:

http://commons.wikimedia.org/wiki/File:Turtle_at_Mississippi_River_Park_and_Museum_Tunica_Resorts_MS.oggtheora.ogv is another case where its happening, trying to get more date on it now to see what might cause it.

(In reply to comment #0 by jgerber)

videoscalers run "jobs-loop.sh ... webVideoTranscode" to process video
transcoding jobs. there are unprocessed jobs in the queue that never run.

(In reply to comment #1 by aschulz4587)

Seems to be jobs failing rather that not ever getting run.

Is this still a problem, and how to find out?

(In reply to comment #4 by jgerber)

trying to get more date on it now to see what might cause it.

jgerber: Did this ever happen?

(In reply to comment #0 by jgerber)

videoscalers run "jobs-loop.sh ... webVideoTranscode" to process video
transcoding jobs. there are unprocessed jobs in the queue that never run.

(In reply to comment #1 by aschulz4587)

Seems to be jobs failing rather that not ever getting run.

Is this still a problem, and how to find out?

(In reply to comment #4 by jgerber)

trying to get more date on it now to see what might cause it.

jgerber: Did this ever happen?

(In reply to Andre Klapper from comment #6)
Currently we have 6723 queued transcodes as per https://commons.wikimedia.org/wiki/Special:TimedMediaHandler
It seems most (/all) of them were "Added to Job queue 37 days, .. hours, .. minutes, .. seconds ago" as part of the 160p-ogg-transcode batch.

Though the jobs seem not to fail when resubmitted. Thus I don't know if it is related to this bug report specifically. Do you think I should open another bug report requesting to re-queue all queued transcodes?

(In reply to Marco from comment #7)

(In reply to Andre Klapper from comment #6)
Currently we have 6723 queued transcodes as per
https://commons.wikimedia.org/wiki/Special:TimedMediaHandler
It seems most (/all) of them were "Added to Job queue 37 days, .. hours, ..
minutes, .. seconds ago" as part of the 160p-ogg-transcode batch.

Though the jobs seem not to fail when resubmitted. Thus I don't know if it
is related to this bug report specifically. Do you think I should open
another bug report requesting to re-queue all queued transcodes?

Its separate (I think its related to how we temporarily stopped making 160p files. When re-renabled it I don't think jobs got re-added despite what that page says). Anyhow, see bug 61690

(I think its related to how we temporarily stopped making 160p
files. When re-renabled it I don't think jobs got re-added despite what that
page says).

Err, actually we didn't do that. So never mind....

Anyways, it looks like a bunch of jobs disappeared, and then there's an inconsistent state with TimedMediaHandler thinking they are just pending, not gone. Which may or may not be the same bug as this one. I'm not sure.

(In reply to Bawolff (Brian Wolff) from comment #9)

(I think its related to how we temporarily stopped making 160p
files. When re-renabled it I don't think jobs got re-added despite what that
page says).

Err, actually we didn't do that. So never mind....

Anyways, it looks like a bunch of jobs disappeared, and then there's an
inconsistent state with TimedMediaHandler thinking they are just pending,
not gone. Which may or may not be the same bug as this one. I'm not sure.

Ugh, sorry. Cannot read. Thought this was bug 61401. Ignore everything I said. This is quite likely the right bug.

(In reply to Andre Klapper from comment #6)

(In reply to comment #0 by jgerber)

videoscalers run "jobs-loop.sh ... webVideoTranscode" to process video
transcoding jobs. there are unprocessed jobs in the queue that never run.

(In reply to comment #1 by aschulz4587)

Seems to be jobs failing rather that not ever getting run.

Is this still a problem, and how to find out?

Someone with access to job queue log (I believe that's basically the folks with shell access) can find out if there are failing jobs that do not have their "failed" status reflected in the transcode table.

It would be interesting to see if those 6723 queued transcodes ever ran and failed, or if they just never ran.


It might be good to do something like - If there are transcodes that have been pending for more than 10 days, and if the job queue for webVideoTranscode is empty, then automatically re-add the jobs for those videos

(In reply to comment #10)
Someone with access to job queue log (I believe that's basically the folks
with shell access) can find out if there are failing jobs that do not have
their "failed" status reflected in the transcode table.

It would be interesting to see if those 6723 queued transcodes ever ran and
failed, or if they just never ran.

Is there any progress in making the analysis?

It might be good to do something like - If there are transcodes that have
been pending for more than 10 days, and if the job queue for
webVideoTranscode is empty, then automatically re-add the jobs for those
videos

Is there any progress in implementing this? If not, I may code a bot to fulfill the request to reset those transcodes made @ https://commons.wikimedia.org/w/index.php?title=Commons%3ABots%2FWork_requests&diff=122148976&oldid=122026535 which

Change 133994 had a related patch set uploaded by Brian Wolff:
Automatically re-add transcode jobs if transcode pending for 72h

https://gerrit.wikimedia.org/r/133994

(In reply to Gerrit Notification Bot from comment #12)

Change 133994 had a related patch set uploaded by Brian Wolff:
Automatically re-add transcode jobs if transcode pending for 72h

https://gerrit.wikimedia.org/r/133994

This would obviously be a band aid solution. We should figure out why they arent working in the first place.

(In reply to Bawolff (Brian Wolff) from comment #13)

https://gerrit.wikimedia.org/r/133994

This would obviously be a band aid solution. We should figure out why they
arent working in the first place.

Workaround patch has two -1s. What's the plan here?

(In reply to Andre Klapper from comment #14)

(In reply to Bawolff (Brian Wolff) from comment #13)

https://gerrit.wikimedia.org/r/133994

This would obviously be a band aid solution. We should figure out why they
arent working in the first place.

Workaround patch has two -1s. What's the plan here?

Good question. I need to figure out a new plan/investigate further or talk it over with Arron. Gilles' -1 is just cosmetic so that -1 is trivial.

tomasz set Security to None.
brion claimed this task.
brion subscribed.

Don't seem to see this exact behavior anymore. Things fail in expected ways now, like crashing or hanging ffmpeg processes or other weird timeouts :)

Change 133994 had a related patch set uploaded (by Paladox):
Automatically re-add transcode jobs if transcode pending for 72h

https://gerrit.wikimedia.org/r/133994

Change 133994 abandoned by TheDJ:
Automatically re-add transcode jobs if transcode pending for 72h

Reason:
Because brion in T49312 states that he is no longer seeing these types of failures.

https://gerrit.wikimedia.org/r/133994