Page MenuHomePhabricator

OOM on getting metadata for some OGG files (metadata reading hits memory_limit)
Closed, ResolvedPublic

Description

Test case attached.

Steps to reproduce:

  1. Open eval.php, and create an OggHandler object.
  2. Set your memory limit below 50M
  3. Call $OggHandler->getMetadata( null, '/path/to/test/case' );

Result:
PHP dies with OOM.

This is occurring on Wikimedia sites for *some* files with uncached metadata.

I did some research and debugging, and it always seems to die in _decodePageHeader, in File/Ogg.php. It seems to try and list the streams (which, in theory there should only be 5 or 6 of), storing the data as it goes. It then runs through the streams to generate aggregate data.

Using COUNT_RECURSIVE and no memory_limit, I counted the number of pieces of stream information stored in _streamList for the test case, and for the featured media of the day, which happened to be [[File:Eichmann_trial_news_story.ogg]]

$h = new OggHandler; $m = $h->getMetadata( null, '/Users/andrew/En-The_Raven-wikisource.ogg' )

Class File_Ogg not found; skipped loading
Memory used: 50356180
Size of _streamList is 398175

$h = new OggHandler; $m = $h->getMetadata( null, '/Users/andrew/Eichmann_trial_news_story.ogg' );

Class File_Ogg not found; skipped loading
Memory used: 7901476
Size of _streamList is 10662

RECOMMENDED RESOLUTION:

It makes the most sense to resolve this by aggregating whatever data is needed to be aggregated as the stream list is generated, rather than at the end.


Version: unspecified
Severity: major

Details

Reference
bz19476

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:41 PM
bzimport set Reference to bz19476.

OOM can also happen within exif_read_data for jpegs with lengthy exif data.

  • Bug 19870 has been marked as a duplicate of this bug. ***
  • Bug 20801 has been marked as a duplicate of this bug. ***

Bumping this up from an enhancement...

  • Bug 20811 has been marked as a duplicate of this bug. ***

I'm still experiencing the same problem described in bug 20811, also with a DjVu file (it's 40 MB, this one: http://www.archive.org/details/VocabolarioAccademiciCruscaEdi3Vol3).

mike.lifeguard+bugs wrote:

(In reply to comment #4)

> *** Bug 20801 has been marked as a duplicate of this bug. ***

On this bug, note that even Special:WhatLinksHere/File:... fails:

http://meta.wikimedia.org/wiki/Special:WhatLinksHere/Image:Screencast_-_Spam_blacklist_introduction_and_COIBot_reports_-_small.ogg

No metadata should need to be loaded here at all, not even duration, which is apparently "needed" for the image description page. Same for pages where large files are linked from - they don't need file metadata, so shouldn't try to get it.

As well, if this metadata is so expensive to get we run out of memory, then it should be stored so it only needs to be done once on upload.

(In reply to comment #8)

As well, if this metadata is so expensive to get we run out of memory, then it
should be stored so it only needs to be done once on upload.

It is stored, but it obviously can't be if the processing failed.

mike.lifeguard+bugs wrote:

+mdale in case he can help :)

What about using -if available- an external program for that?
That would provide a more grained memory method. And wouldn't kill the whole page.

mdale wrote:

I recommend we use ffmpeg2theora --info command. It outputs the data in JSON and does seeking to the end of the file to get the duration (so is much faster) than oggz info type command that does a linear scan of the file and outputs non-structured data that would have to be parsed. Also ffmpeg2theora is a static binary so should be easier to deploy. I will create a patch.

mdale wrote:

I created the patch to call out to ffmpeg2theora in r57933. But ffmpeg2theora does not list the offset time. So we have to "fix" the ffmpeg to ogg demuxer to know about stream offsets or use a different tool.

Irregardless we should fix the php fallback solution to be less memory heavy.

mdale wrote:

jan has patched ffmpeg2tehora, freed has deployed it, I will shortly push the updated ffmpeg2theora time grabbing code to deployment.

mdale wrote:

patch to use ffmpeg2theora for metadata

here is a patch for the wmf-deployment branch. I never got clarity from anyone if we can push this out or not?

Attached:

Ehm, can you apply the patch? I haven't been able to upload a file on Commons for two months, now...

mdale wrote:

yea it would be good to get this applied and or review it and let me know what has to be changed.

We are at r61846 (https://wikitech.wikimedia.org/?diff=24985 ) but I still have the same problem described in bug 20811#c0 .

(In reply to comment #19)

We are at r61846

This is the version of /branches/wmf-deployment, not /trunk/phase3; this doesn't mean that r60492 has been deployed yet.

(In reply to comment #20)

(In reply to comment #19)

We are at r61846

This is the version of /branches/wmf-deployment, not /trunk/phase3; this
doesn't mean that r60492 has been deployed yet.

Thank you. Sorry.
Anyway, the bug for djvu seems resolved at least for some files, see 20811#c6 .