Character reference link can generate unreachable non-NFC title
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	tstarling
	Jul 28 2008, 5:26 AM

Description

MediaWiki converts to Unicode "normal form C" (NFC) on input, but Sanitizer::decodeCharReferences() does not necessarily return NFC. A link like "[[Ω]]" generates a Title object which points to the non-NFC character in question (U+2126), and will be a red link, but due to comprehensive NFC conversion on input, clicking the red link will take you to the edit page of U+03C9.

I suggest normalising the output of Sanitizer::decodeCharReferences(), assuming that can be done efficiently. Note that Title::newFromText() is quite hot, performance-wise, for some callers.

This was reported on the English Wikipedia's village pump by [[User:Caerwine]], who does not wish to create a bugzilla account.

Version: 1.14.x
Severity: minor

Details

Reference: bz14952

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Declined		None	T5969 Unicode (UTF-8, utf8) compatibility (tracking)
		Resolved		None	T16952 Character reference link can generate unreachable non-NFC title

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:10 PM

• bzimport added a project: MediaWiki-Parser.

• bzimport set Reference to bz14952.

• bzimport added a subscriber: Unknown Object (MLST).

tstarling created this task.Jul 28 2008, 5:26 AM

My impression is that sticking normalization on all decodes could be pretty slow, however if we only need to normalize *when something gets expanded* it could be made relatively efficient...

In theory we could optimize by only applying normalization on the individual bits that are expanded -- but we also need the preceding char(s) to deal with combining characters, which doesn't play nicely with the way it's currently implemented (preg_replace callbacks on individual char reference sequences).

The ASCII breakdowns in the normalizer mean that an unoptimized call would still look relatively efficient for English, but could be *enormously* slow for non-Latin languages, especially Korean. (Korean is extra pain because every hangul character has to be unpacked into jamo and repacked.)

Adding Unicode tracking bug 3969.

ayg wrote:

Doesn't the normalizer fall back to php_normal.so if available? If that's acceptably fast, then as a quick fix, decodeCharReferences() could do normalization if that's available and not otherwise. (Does Wikimedia use that? It's mentioned in includes/normal/README.)

conrad.irwin wrote:

*** Bug 19451 has been marked as a duplicate of this bug. ***

conrad.irwin wrote:

&'s in links are incredibly rare (838/11370705 on enwiktionary, 248/7144150 on kowiktionary - approximately 0.005%) - a naive count which includes anything in a [[ ]], i.e. categories, images and interwikis, while excluding anything that a template might add.

I have thus implemented (r64283) the only normalize *when something gets expanded* option from brion above, it is possible that additional checks could be added, but it seems likely they would slow down the 99.99% of cases where no expansion is needed.

Character reference link can generate unreachable non-NFC titleClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Character reference link can generate unreachable non-NFC title
Closed, ResolvedPublic
Actions

Related Objects
Search...