Page MenuHomePhabricator

Unicode Byte Order Mark appearing in wikitext causes page to be cutoff
Closed, ResolvedPublic

Description

Author: alphasigmax

Description:
As noted on
(http://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29),
occcurences of the Unicode Byte Order Mark (0xFEFF) cause pages to be cutoff.
This has occurred on [[User_talk:Alphax]]
(http://en.wikipedia.org/w/index.php?title=User_talk%3AAlphax&diff=0&oldid=12566960),
[[User_talk:Larry_Sanger]]
(http://en.wikipedia.org/w/index.php?title=User_talk%3ALarry_Sanger&diff=12607963&oldid=12594206)
and [[User_talk:Duncharris]
(http://en.wikipedia.org/w/index.php?title=User_talk%3ADuncharris&diff=12588770&oldid=12567290).

What is the cause of these, how can they be found, and when will it be fixed?


Version: unspecified
Severity: critical
OS: Windows 2000
Platform: PC

Details

Reference
bz1938

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 8:21 PM
bzimport set Reference to bz1938.
bzimport added a subscriber: Unknown Object (MLST).

Seems to be a problem with tidy. Investigating...

The problem seems to triggered by illegal entity references such as &#0xfeff;

This is not valid HTML/XML; the allowed numeric character references are decimal &#[0-9]+; and hexademical
&#[Xx][0-9A-Fa-f]+;. Putting a 0 _before_ the x is nicely invalid. Tidy looks at this and assumes you had meant
to type �xfeef;... and turns the � reference into a _literal_ null character in output.

A null character is actually ok in a PHP string, but the internal library interface to tidy seems to be treating tidy's
output as a null-terminated string when copying it back to PHP and output ends at that point.

Sigh... Ideally, I can tell tidy not to do this kind of 'correction', which is one that makes more trouble than help.

Our preexisting escaping would have fixed this in body text but was applied too early, so was not correcting the
link text. I've moved the escaping down to after link replacement and it's working now.

Fixed in CVS HEAD and REL1_4 and live on site. Will be included in 1.4.3 release.

Use action=purge if necessary on affected pages with cached broken rendering.