Page MenuHomePhabricator

Memory limit hit while uploading DjVu file with embedded text
Closed, ResolvedPublic

Description

Hello,

I got a crash twice while uploading a file to Commons:
PHP fatal error in /usr/local/apache/common-local/php-1.17/includes/normal/UtfNormal.php line 285:
Allowed memory size of 125829120 bytes exhausted (tried to allocate 56 bytes)

The file is http://ia600301.us.archive.org/11/items/MN40239ucmf_2/MN40239ucmf_2.djvu
(DJVU from Internet Archive).

Thanks, Yann


Version: unspecified
Severity: normal

Details

Reference
bz28146

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:26 PM
bzimport set Reference to bz28146.

Zyephyrus tried 3 times: same error.
File is available from http://www.archive.org/details/MN40239ucmf_2

Here is the message that I got when trying:

PHP fatal error in /usr/local/apache/common-local/php-1.17/includes/normal/UtfNormal.php line 285:
Allowed memory size of 125829120 bytes exhausted (tried to allocate 24 bytes)

Zeph (Zyephyrus)

That would be the preg_match_all() in quickIsNFCVerify(). It could easily be rewritten to use preg_match() with an offset.

It might also be wise to divide up the giant DjVu data set better. It looks like the *entire* page text metadata for all pages in the file gets read in in a batch in DjVuImage::retrieveMetadata.

This entire set of output is run through UtfNormal::cleanUp() in one piece -- where the above error is occurring -- then divided up into pages, and then put back into a giant XML string which gets saved as the file's metadata. That giant string later gets read in and parsed into an XML DOM for access later, but the string is going to sit around bloating up the image table record, memcached, and anybody fetching the document info via InstantCommons.

Sure, but that doesn't conflict with the simple improvement I'm proposing for quickIsNFCVerify(). There's no reason that function can't work with large strings.

Agreed, putting it on my fun queue.

Created attachment 8362
Work in progress test patch (requires PHP 5.3)

I did a quick try serially running preg_match, bumping the offset, and found it to be too slow to the point of running for at least several minutes on a large German test set (never reached completion).

Redoing it to use preg_replace_callback() and bumping the loop into an anonymous function for convenience works, but still with a major performance regression for the German test set. (from 14 MB/sec to 0.5 MB/sec)

Russian, Japanese, and Korean are slowed down much less, from about 2.2 MB/sec to about 1.9 MB/sec.

This is likely due to splitting up the ASCII and non-ASCII sections being much more expensive for German, which like most European languages mixes ASCII and non-ASCII Latin characters together. The other scripts are mostly large non-ASCII blocks, so there are fewer pieces to split apart.

Per-loop overhead seems to be a lot higher with preg_replace (and much more so with serial preg_match()) than the preg_match_all() + foreach... but the giant array will also be super inefficient for European languages because many of the chunks will be very very short strings, which probably contributes to running out of memory.

Attached:

UtfNormalMemStress.php test script added in r85155 so the tests can be reproduced. Times above done with the existing UtfNormalBench.php.

As a workaround, in r85377 I've changed DjVuImage::retrieveMetaData() so it runs individual page texts through UtfNormal::cleanUp() rather than the entire dumped document.

Verified that without the fix, I run out of memory uploading the sample file at 128M memory_limit, and with the fix I can upload it just fine.

Still should be fixed in UtfNormal; languages with heavy mixes of ASCII and non-ASCII use a LOT of memory due to being split into so many short strings, which makes the preg_match_all() much worse in terms of memory usage than just a copy of the string.

Very long page texts may also hit limits in these situations (the dump data for the DjVu file is about 3 megabytes of French text, not inconceivable for a realllllly long wiki page), and it'd be nice to fix.

Could you update the bug summary to reflect the new (non-preg?) target? Lowering priority since it sounds like a big part of the problem has been fixed.

The current summary reflects the as-yet unsolved problem (which is why I've left it open).

I've broken out the UtfNormal general issue (really big string of mixed Latin text -> fails) to bug 28427, and updated the summary here to be specific to the original issue with DjVu files, as that's now worked around.

I suppose that this error is related to this bug?

PHP fatal error in /usr/local/apache/common-local/php-1.17/includes/normal/UtfNormal.php line 285:
Allowed memory size of 125829120 bytes exhausted (tried to allocate 71 bytes)

http://fr.wikisource.org/w/index.php?title=Fichier:Port_-_Dictionnaire_historique,_g%C3%A9ographique_et_biographique_du_Maine-et-Loire,_tome_1.djvu&action=purge

This is a big file: 882 pages (85,71 Mo)

(In reply to comment #11)

The current summary reflects the as-yet unsolved problem (which is why I've
left it open).

Looks like you closed it, though. Reopening since you apparently didn't intend to do that.

that's the same bug. fix should be merged to 1.17.

(In reply to comment #14)

(In reply to comment #11)

The current summary reflects the as-yet unsolved problem (which is why I've
left it open).

Looks like you closed it, though. Reopening since you apparently didn't intend
to do that.

No, I did intend that -- that's why I broke out the unresolved parts to a separate bug and changed the summary on this bug to the specific issue that was reported. Please leave closed unless there's a regression in the specific issue. :)