Page MenuHomePhabricator

invisible character in url leads to "no article" page
Closed, DeclinedPublic

Description

Author: fx12345

Description:
example of invisible character in wikipedia url

An example:
http://en.wikipedia.org/wiki/Horsesho%E2%80%8Be_orbit

If this link is pasted into Wikipedia, it fails. The character which is decoding here as "E2 80 8B" is normally invisible. It appears to be the UTF-8 character "zero width space". Since it has no syntactical value, shouldn't such characters simply be removed from pasted urls? I don't know how the spurious character got into the URL in the first place, but surely any invisible characters ought to be removed by the parser, right?


Version: unspecified
Severity: normal

Attached:

invisible-character-in-wikipedia-url.jpg (536×757 px, 44 KB)

Details

Reference
bz28460

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 11:35 PM
bzimport set Reference to bz28460.
bzimport added a subscriber: Unknown Object (MLST).

We already make it impossible to create pages with invisible spaces: http://en.wikipedia.org/w/index.php?title=Horsesho%E2%80%8Be_orbit&action=edit

Is there any reason to do more than this?

(In reply to comment #1)

We already make it impossible to create pages with invisible spaces:
http://en.wikipedia.org/w/index.php?title=Horsesho%E2%80%8Be_orbit&action=edit

Is there any reason to do more than this?

That is some customization on the Wikipedia side (abuse filter presumably). We allow creating such pages in general. For example: http://test.wikipedia.org/wiki/Horsesho%E2%80%8Be_orbit

"Invisible" characters are sometimes needed for typographical reasons. I'm not sure if that is the case for Zero width space, but certainly is for other zero width characters (ZWNJ, etc)- so we should be careful before excluding any such characters.

From the spec - "Zero width space indicates a word break or line break opportunity, except that it has no width. Zero-width space characters are intended to be used in languages that have no visible word spacing to represent word break or line break opportunities, such as Thai, Myanmar, Khmer, and Japanese."

Thus, such characters might potentially be useful in a very very long title, in a language that doesn't use spaces, to indicate where to put a line break in the title. (Of course I don't speak any such languages, so could be wrong on that).

Anyways I don't think we should disallow it without being very sure its un-needed. (OTOH, we do disallow left-to-right mark characters, which are similar, as well as normalizing things like narrow-non-break space to normal spaces, so we do similar things to what is proposed here)

This seems to have been intentionally allowed back in titles in r56918. The commit summary on that revision seems to be a pretty good reason to not strip such characters, so I'm going to go ahead and mark this wontfix.