Page MenuHomePhabricator

rewrite quickIsNFCVerify() to use preg_match() with an offset to accommodate larger files
Open, LowestPublicFeature

Description

Broken out from T30146, which started with a narrower focus which was solved by a narrower fix.

Per notes & patches on that bug, the preg_match_all() in UtfNormal::quickIsNFCVerify uses a lot of memory for mixed ASCII/non-ASCII strings such as one finds in languages using Latin scripts with accented or other non-ASCII letters.

This results in hitting memory limits on largeish input strings, much sooner than we really ought to.

Rewriting the function so that it works through the string in chunks as it's splitting should avoid that huge memory bump, but my initial tests were too slow using preg_match and an offset, and still slowish using preg_replace_callback.

includes/normal/UtfNormalMemStress.php can be used to stress-test this.


Version: 1.18.x
Severity: enhancement

Details

Reference
bz28427

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 11:33 PM
bzimport set Reference to bz28427.
bzimport added a subscriber: Unknown Object (MLST).

I suppose that this error is related to this bug?

PHP fatal error in
/usr/local/apache/common-local/php-1.17/includes/normal/UtfNormal.php line 285:
Allowed memory size of 125829120 bytes exhausted (tried to allocate 71 bytes)

http://fr.wikisource.org/w/index.php?title=Fichier:Port_-_Dictionnaire_historique,_g%C3%A9ographique_et_biographique_du_Maine-et-Loire,_tome_1.djvu&action=purge

This is a big file: 882 pages (85,71 Mo)

That'll be another instance of bug 28146 with the djvu text extraction; merging the fix for that to 1.17 and deployment should resolve it.

Hi veteran contributors. Is this problem still valid? Is General/Unknow its best location?

Marking as Lowest, since nobody seems to be working or planning to work on this currently.

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:02 AM

The code still exists with the same issue, but it's extremely unlikely to cause errors in production. I don't see any such errors in Logstash.