Page MenuHomePhabricator

SMW mb unsafe regexps can produce garbled UTF-8 strings
Closed, ResolvedPublic

Description

Author: info

Description:
In my install (MediaWiki: 1.11.0 (r26292), PHP: 5.2.1 (apache2handler) on Windows XAMPPlite) , just the annotation

[[Test à stuff::Main Page]]

(note the à -- a accent grave) on a page leads to:

a) No wiki text appearing in page display and preview (only the surrounding skin appears).

b) In Special:Browse it displays as a garbled string

Test �  stuff  Main Page

I spent 6 hours debugging this 8-).

SMW_Factbox.php and SMW_SpecialBrowse.php both do

preg_replace('/[\s]/', ...

This is not multibyte safe, so the string can get garbled.
In this annotation, I think the agrave accented character is C3A0 in UTF-8 , and the code point A0 in ISO8859 is NBSP.

The result *on certain PHP configurations* (?) is a garbled string where the agrave is partly replaced.

a) Special:Browse displays the garbled string.

b) Many of MediaWiki's own preg_* functions use the /u PCRE modifier, so that "Pattern strings are treated as UTF-8." But as someone commented in http://php.net/manual/en/reference.pcre.pattern.modifiers.php#54805 ,
Regarding the validity of a UTF-8 string when using the /u pattern modifier,
...

  1. When the subject string contains invalid UTF-8 sequences / codepoints, it basically result in a "quiet death" for the preg_* functions, where nothing is matched but without indication that the string is invalid UTF-8

In this example the Factbox code garbles this string in its generated HTML, and when MediaWiki's parser calls MagicWords which does some replacements with /u, BOOM, the first one (stripToc) returns nothing.

In these two cases I found a fix is for SMW to use the /u PCRE modifier in its own preg* functions.
a) SMW_SpecialBrowse.php around line 178 change

$html .=  '<strong>' . $skin->makeKnownLinkObj($att, preg_replace('/[\s]/', '&nbsp;', smwfT($att), 2)) . "</strong>&nbsp; \n";

to

$html .=  '<strong>' . $skin->makeKnownLinkObj($att, preg_replace('/[\s]/u', '&nbsp;', smwfT($att), 2)) . "</strong>&nbsp; \n";

b) SMW_Factbox.php around line 273 change
$text .= '<tr><td class="smwpropname">[[' . $property->getPrefixedText() . '|' . preg_replace('/[\s]/','&nbsp;',$property->getText(),2) . ']] </td><td class="smwprops">';

to

$text .= '<tr><td class="smwpropname">[[' . $property->getPrefixedText() . '|' . preg_replace('/[\s]/u','&nbsp;',$property->getText(),2) . ']] </td><td class="smwprops">';

What's confounding is this doesn't happen on ontoworld.org, or even on my ISP (see http://www.skierpage.com/tmp/mb_bug.php , works OK). Maybe it's only an issue on Windows or with XAMPPLITE.


Version: unspecified
Severity: major
URL: http://www.skierpage.com/tmp/mb_bug.txt

Details

Reference
bz13321

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:05 PM
bzimport set Reference to bz13321.

I did implement the suggested changes in SVN, but I did not find an explicit documentation that this should really fix the problem. The PHP-docu states that the u-modifier only affects UTF-8 compatibility of the pattern, not the matched/replaced string. So it might still be an insecure fix.

Also, I see a large number of preg_* functions in MediaWiki code. Are they all operating on ASCII strings only (or in a UTF-aware way?). Some comment in the PHP online docu suggests to use multi-byte extensions for ereg such as mb_ereg(), mb_eregi(), pb_ereg_replace(). Do we need that to really ensure UTF compatibility? Is this extension present in standard PHP distributions?

The /u option on PCRE patterns tells it to match based on UTF-8 characters rather than raw bytes. Most of the time this only makes a difference if your pattern uses specific lengths (that might split characters) or character classes (where byte-based matches won't match correctly).

The above example uses the whitespace character class (\s); you thus need to add the /u modifier there or you're going to get bad matches, as some UTF-8 component bytes might match an 8-bit whitespace (such as the Latin-1 'non-breaking space').

If your pattern very explicitly doesn't use anything that might trip problems, then you don't really need to add the /u. But if in doubt, add it in.

(The above comments do note that in some versions of PHP / PCRE, the /u mode will break if the input string is not valid UTF-8. Make sure your input is not invalid. :)

Thanks, I have now updated all our preg_s accordingly (there were quite some more associated to particular functions, especially input value parsing). This should then close that bug.