Author: simon.lipp
Description:
Bug has been encountered on fr.wikisource :
MediaWiki 1.16alpha-wmf (r58524)
PHP 5.2.4-2ubuntu5.7wm1 (apache2handler)
MySQL 4.0.40-wikimedia-log
When the text layer of the Djvu file contains « ") », the MediaWiki parser produces an empty page and then the text layer is shifted by one page from the image. An example of problematic Djvu file can be found here :
In particular, we can find, in page 80, the following text (bad quality of scan) : « La quatrième année (.\"),*)()) ». The problem can be seen in the proofread version of this scan :
http://fr.wikisource.org/w/index.php?title=Page:Sima_qian_chavannes_memoires_historiques_v4.djvu/80&action=edit : the end of the text is missing
http://fr.wikisource.org/w/index.php?title=Page:Sima_qian_chavannes_memoires_historiques_v4.djvu/81&action=edit : no text layer
http://fr.wikisource.org/w/index.php?title=Page:Sima_qian_chavannes_memoires_historiques_v4.djvu/82&action=edit : text layer and image does not longer match
I have been able to track and fix the bug in my local mediawiki installation (same branch, same revision as fr.wikisource). The problem is located in DjvuImage::retrieveMetadata (includes/DjvuImage.php:257) : the regular expression considers any ") as the end of page marker, but a \ before the double quote should prevent this interpretation.
I replaced the current regular expression by this one, and now the problem is fixed :
$reg = "/\(page\s[\d-]*\s[\d-]*\s[\d-]*\s[\d-]*\s*\"((?>\\\\.|(?:(?!\\\\|\").)++)*?)\"\s*\)/s";
$txt = preg_replace( $reg, "<PAGE value=\"$1\" />", $txt );
Note for the regular expression : this is the adaptation of the regular expression used to match a text between double quotes with backslash as escape character, which in perl would be :
"((?>\\.|[^"\\]++)*?)". The rather ugly (but working) (?:(?!\\\\|\").) corresponds to the trivial [^"\\], but the problem is that [^\"] and [^"] are not really the same thing…
Version: 1.16.x
Severity: normal