Page MenuHomePhabricator

Invalid UTF-8 in percent-encoded links cause page rendering error
Closed, ResolvedPublic

Description

Author: EN.WP.ST47

Description:
Where one links to a typical page such as:
Special:Allpages/%CE%8Α
Where the last character is not a proper A, but instead from the 'greek encoding block':

od -c
Α
0000000 316 221 \n

The page on which the link is displayed will, intermittently, appear blank. See http://en.wikipedia.org/wiki/User:ST47/UniLink and try purging the cache or making a null edit, if you can see the page correctly. It appears correctly in previews and diffs, and seems to happen more often on longer pages.

This allows anyone to insert a special link and cause the page to appear blank. It seems to me that the best solution would be to sanitize such links, but I can't tell where this problem is occuring, or if it is a symptom of a bigger problem.

We had an issue with this on enwiki, at [[RMS Titanic]], and there was a discussion at [[VP/T]].


Version: 1.11.x
Severity: minor
URL: http://en.wikipedia.org/wiki/User:ST47/UniLink

Details

Reference
bz11143

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 9:54 PM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz11143.

The basic problem is that the PCRE library in PHP 5.2.x is a lot more strict about input in UTF-8 mode. It now rejects an input string which isn't 100% valid, which has a nasty habit of breaking everything.

Further complicating things is that we need in certain circumstances to allow (urlencoded) non-valid-UTF-8 titles for legacy interwiki links, so blithely validating all urldecoded titles could break those. I'm not 100% sure what's the best way to handle the combination.

mrry.dmlo wrote:

So sugar coat it with wonder why and not who he makes microsoft alot of $ i was told to buy another ph. i have went thru 5 guess nobody but me can speak it outloud no downloads with bugs right thanx anyway it was cool to see what i.ve been going thru in the proper language when entering my credit card info to purchase online even on a pc it erases or no i enter so yeah this goes on everyway all the time when on pc internet explorer goes nuts i am tired and there.s so much more that this tomcat sicko does i wish he wud get over it i knew in upoc told him off nice guy thou only bad thing he said was go play in traffic 200 items i didnt write all down to interestd in how intelegent u all are and want to read it again it makes me feel normal its a hard and difficult situation i shud be in a mental institution but nope i know everything

(In reply to comment #2)

Wouldn't "Α" result in http://en.wikipedia.org/wiki/Special:Allpages/%CE%91 ?
i.e. [[Special:Allpages/Α]]

Yes it would. However this is about this case:

Special:Allpages/%CE%8Α

which may as well be:

Special:Allpages/%CE%8xxx

where "xxx" is anything that's not /[a-f]/i.

Our link normalization sees the "%"s in the link and does a transformation of /%[0-9a-f]/i sequences, to make something like this:

Special:Allpages/y%8xxx

where "y" is a byte which, by itself, does *not* make up a valid UTF-8 sequence.

The result is we have invalid UTF-8 in our internal parser strings, and eventually it goes through the newer, stricter PCRE which barfs and silently destroys the entire string instead of processing it the way we'd expect.

The proper way to deal with this is probably for the title normalization to detect the bad UTF-8 and reject it, so we don't create a bogus link in the first place.

Downgrading severity/priority: no data loss, no features broken, not relevant to security.

  • Bug 20346 has been marked as a duplicate of this bug. ***

r55382 adds a Unicode-enabled regex check for whitespace (for bug 15248) which has the happy side effect of eliminating this bug.

I've added a comment to this effect in r55514.