Page MenuHomePhabricator

Illegal Unicode characters are allowed in pages
Open, LowPublic

Description

Take a good look at this diff from the Russian Wikipedia: http://ru.wikipedia.org/w/index.php?title=%D0%91%D0%B8%D1%82%D0%B2%D0%B0_%D0%B7%D0%B0_%D0%9A%D0%B0%D0%B2%D0%BA%D0%B0%D0%B7_(1942%E2%80%941943)&diff=9212389&oldid=9112075

What was fixed is four instances of the Unicode character FDD3.

I stumbled upon it when i ran a Perl script that analyzed a dump of the Russian Wikipedia. I ran several pattern matches on every page and on this page the Perl regular expression engine issued this warning: "Unicode character is illegal" (see http://perldoc.perl.org/perldiag.html ). The code chart in which this character appears indeed says this: "These codes are intended for process-internal uses, but are not permitted for interchange." (Search for FDD3 here: http://www.unicode.org/charts/About.html )

My Unicode expertise ends here. I don't know what exactly are those illegal characters. I can guess that characters that have the Noncharacter_Code_Point property are illegal, and maybe there are more. I also don't know what is the exact damage that these characters cause if saved in the MediaWiki database, but i can guess that it may cause interoperability troubles with external tools - browsers, bots, search engines, future versions of the database engine etc. It may also cause security breaches. So i suppose that there is a warning sign here and most probably it shouldn't be possible to save pages that include such characters.


Version: unspecified
Severity: normal
URL: http://ru.wikipedia.org/w/index.php?title=%D0%91%D0%B8%D1%82%D0%B2%D0%B0_%D0%B7%D0%B0_%D0%9A%D0%B0%D0%B2%D0%BA%D0%B0%D0%B7_(1942%E2%80%941943)&diff=9212389&oldid=9112075
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=42807

Details

Reference
bz14600