Page MenuHomePhabricator

Illegal/unusual UTF-8 characters can make 2 pages appear to have same name
Closed, DeclinedPublic

Description

Author: sebastien.thebault2

Description:
There is a fearly bug : two pages can have the same name, one with some text, the other whithout text ; see this link on french
Wikipedia to see : http://fr.wikipedia.org/wiki/Wikipédia:Le_Bistro#Disparition sans trace enregistrée à nouveau (semble-t-il).


Version: unspecified
Severity: normal
OS: Mac OS X 10.3
Platform: Macintosh
URL: http://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Le_Bistro#Disparition_sans_trace_enregistr.C3.A9e_.C3.A0_nouveau_.28semble-t-il.29

Details

Reference
bz821

Revisions and Commits

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 7:00 PM
bzimport set Reference to bz821.
bzimport added a subscriber: Unknown Object (MLST).

rowan.collins wrote:

My French is not brilliant, so I'm not quite sure what the behaviour is that is
being observed. But as far as I can make out the "two articles with the same
name" are "Trois Royaumes de Cor�ée" and "Trois Royaumes de Corée", where the
first contains an extra character between the r and the e-acute. In
URL-encoding, it is %EF%BF%BD [note that at some point it has become corrupted
in the example into a "?", I dug the original out of the page history].

So obviously these *aren't* pages with identical names, but before we can
consider the case closed, somebody needs to work out:

  • what is the character %EF%BF%BD supposed to be?
  • how did it get there? (what did the user *think* they were inputting?)
  • why did the user affected think the two links were identical? (they seem to be

saying the extra character was invisible in their browser - why should that be?
perhaps it's some kind of combinational, zero-width, character?)

The UTF-8 sequence EF BF BD represents character FFFD REPLACEMENT CHARACTER. This is used by various software as a placeholder/
replacement for illegal/corrupt characters; typically it's displayed as a question mark, or a black diamond with a white question mark in it, but
sometimes is blank. (Depending on the font, the software, etc.)

One way it might show up in a wiki page is by editing with a browser that doesn't do UTF-8 correctly.

However the link I see listed presently has a _literal_ question mark (3F). It's possible it's been replaced by some browser during subsequent
editing of the page.

rowan.collins wrote:

[The following is additional info from the reporter, received by e-mail; thanks
for responding - for future reference, you need to add additional comments using
the web interface rather than replying to e-mails]

hello : you see right : it is a bug with Safari and Opera : with them,
i can't see the extra character, in the title of the page, in the URL
and in the text (in page text).
*I don't know what is this character.

  • I think I put text in URL bar (Trois Royaumes de Corée), or it is a

bug from Safari (sorry for french : URL en cache mal mémorisé)

two main browser (Opera and Safari, on Mac OS X) display the same two
title to me. But Explorer is good. It was the same thing with the two
URL, and the two ''articles'' : one empty, one not.

And sorry, i didn't understand the last question

Archeos

rowan.collins wrote:

[updating summary: the 2 pages *don't* have the same name, and we now know why
they seemed to]

Meanwhile, can anyone think of a way of verifying that this bug is still
present, and/or know of any any code changes that should have fixed it?

sebastien.thebault2 wrote:

I don't see this bug since november.

sebastien.thebault2 wrote:

i don't see ever more this bug from november

Seems to be fine since Nov 2004... if it shows up again, this can be reopened.

gangleri wrote:

links with
Unicode Character REPLACEMENT CHARACTER - U FFFD
http://www.fileformat.info/info/unicode/char/fffd/index.htm

including "%EF%BF%BD" are generating "Bad titles"
http://yi.wiktionary.org/w/index.php?title=project:bugzilla/00821/%EF%BF%BD&action=edit

This should be OK for all.

regards reinhardt [[user:gangleri]]

P.S. Is this a solved issue for
bug 3985: character conversion (tracking) ?

epriestley changed the task status from Declined to Resolved by committing Unknown Object (Diffusion Commit).Mar 4 2015, 8:20 AM
epriestley added a commit: Unknown Object (Diffusion Commit).
Aklapper changed the task status from Resolved to Declined.Mar 4 2015, 11:38 AM
Aklapper claimed this task.