Page MenuHomePhabricator

Unicode Control Characters should be restricted in title text (RLM, LRM, RLO, LRO, . . .)
Closed, ResolvedPublic

Description

Author: tietew-mediazilla

Description:
Unicode Control Characters such as "RIGHT-TO-LEFT OVERRIDE" (U+202E)
or *all unprintable characters* should be restricted in title text.

A title including control characters breaks RC, history, contributions, etc.
A username including control characters breaks page text after the signature!
And they are hard to be linked.

... in Japanese Wikipedia, a username with U+202E is used as vandalism.


Version: unspecified
Severity: normal

Details

Reference
bz3696

Event Timeline

bzimport raised the priority of this task from to High.Nov 21 2014, 8:52 PM
bzimport set Reference to bz3696.
bzimport added a subscriber: Unknown Object (MLST).

gangleri wrote:

compare with

bug 1524: usernames should use unicode whitelist

tietew-mediazilla wrote:

*** Bug 3888 has been marked as a duplicate of this bug. ***

gangleri wrote:

screen dump

added new screen dump:

This gets a real mess. This happened because
http://yi.wikipedia.org/wiki/%E2%80%AEtest was edited:
http://yi.wikipedia.org/w/index.php?title=%E2%80%AEtest&action=history

Attached:

bugzilla_03696_-_03888_mirrord_latin_characters_01.jpg (648×673 px, 137 KB)

gangleri wrote:

Hallo!

Last week I found
http://uncyclopedia.org/wiki/User:%C2%AD%C2%AD%C2%AD%C2%AD
containing more "Unicode Character SOFT HYPHEN - U+00AD"'s.

http://yi.wiktionary.org/wiki/category:bugzilla/02042 contains more tests. Some
of them relating to Unicode whitespace characters (see bug 02042).

regards reinhardt [[user:gangleri]]

  • Bug 5736 has been marked as a duplicate of this bug. ***
  • Bug 5735 has been marked as a duplicate of this bug. ***

pablo wrote:

final patch on bug #6100 has fixes for that problem in recentchanges

  • Bug 7939 has been marked as a duplicate of this bug. ***
  • Bug 7414 has been marked as a duplicate of this bug. ***
  • Bug 8312 has been marked as a duplicate of this bug. ***

As of r18513, the LRM and RLM marks are stripped from titles on normalization.
This will avoid creation of broken links and broken titles from cut-n-paste
from the list pages where sometimes those marks creep in.

Running cleanup on live wikis for titles where this has crept in.
Such pages can be found with prefix search on 'Broken/'.

gangleri wrote:

https://bugzilla.wikimedia.org/show_bug.cgi?id=7414#c6

should have been posted here


the actual report is part of a more general one:
bug 4185 feature request: provide a notification for irregular Unicode characters

gangleri wrote:

*** Bug 3819 has been marked as a duplicate of this bug. ***

I think r44000 fixed this back in 2008. Closing.