Page MenuHomePhabricator

Character reference link can generate unreachable non-NFC title
Closed, ResolvedPublic

Description

MediaWiki converts to Unicode "normal form C" (NFC) on input, but Sanitizer::decodeCharReferences() does not necessarily return NFC. A link like "[[Ω]]" generates a Title object which points to the non-NFC character in question (U+2126), and will be a red link, but due to comprehensive NFC conversion on input, clicking the red link will take you to the edit page of U+03C9.

I suggest normalising the output of Sanitizer::decodeCharReferences(), assuming that can be done efficiently. Note that Title::newFromText() is quite hot, performance-wise, for some callers.

This was reported on the English Wikipedia's village pump by [[User:Caerwine]], who does not wish to create a bugzilla account.


Version: 1.14.x
Severity: minor

Details

Reference
bz14952

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:10 PM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz14952.
bzimport added a subscriber: Unknown Object (MLST).

My impression is that sticking normalization on all decodes could be pretty slow, however if we only need to normalize *when something gets expanded* it could be made relatively efficient...

In theory we could optimize by only applying normalization on the individual bits that are expanded -- but we also need the preceding char(s) to deal with combining characters, which doesn't play nicely with the way it's currently implemented (preg_replace callbacks on individual char reference sequences).

The ASCII breakdowns in the normalizer mean that an unoptimized call would still look relatively efficient for English, but could be *enormously* slow for non-Latin languages, especially Korean. (Korean is extra pain because every hangul character has to be unpacked into jamo and repacked.)

Adding Unicode tracking bug 3969.

ayg wrote:

Doesn't the normalizer fall back to php_normal.so if available? If that's acceptably fast, then as a quick fix, decodeCharReferences() could do normalization if that's available and not otherwise. (Does Wikimedia use that? It's mentioned in includes/normal/README.)

conrad.irwin wrote:

*** Bug 19451 has been marked as a duplicate of this bug. ***

conrad.irwin wrote:

&'s in links are incredibly rare (838/11370705 on enwiktionary, 248/7144150 on kowiktionary - approximately 0.005%) - a naive count which includes anything in a [[ ]], i.e. categories, images and interwikis, while excluding anything that a template might add.

I have thus implemented (r64283) the only normalize *when something gets expanded* option from brion above, it is possible that additional checks could be added, but it seems likely they would slow down the 99.99% of cases where no expansion is needed.