Page MenuHomePhabricator

strip phantom general punctuation characters from page titles
Closed, ResolvedPublic

Description

Author: gangleri

Description:
Sorry for this!

Hallo!

a) I tested character normalisation which seams part of title normalisation.
Regarding precombined characters - NON-precombined characters this workes fine:
[[User:Gangleri/tests/אָ]]
http://yi.wikipedia.org/wiki/User:Gangleri/tests/%EF%AC%AF
http://yi.wikipedia.org/wiki/User:Gangleri/tests/%D7%90%D6%B8
point to the same page despite different coding.

b) The bug's URL will list four different pages with "identical optical title".
There are "phantom" trailing general punctation characters generating different
URL's. Compare:
http://www.fileformat.info/info/unicode/char/202b/index.htm
Unicode Character 'RIGHT-TO-LEFT EMBEDDING' (U+202B)

UTF-8 (hex) 0xE2 0x80 0xAB (e280ab)

http://homepage1.nifty.com/nomenclator/unicode/data/punct.htm

The generated URL's are:
http://yi.wikipedia.org/wiki/User:Gangleri/tests/%E2%80%AB%D7%B0%D7%99%D7%A5
http://yi.wikipedia.org/wiki/User:Gangleri/tests/%E2%80%AB%D7%B0%D7%99%D7%A5%E2%80%AB
http://yi.wikipedia.org/wiki/User:Gangleri/tests/%E2%80%AB%D7%B0%D7%99%D7%A5%E2%80%AB%E2%80%AB
http://yi.wikipedia.org/wiki/User:Gangleri/tests/%E2%80%AB%D7%B0%D7%99%D7%A5%E2%80%AB%E2%80%AB%E2%80%AB

There are many aspects to this:
a) possible vandalism - suggestion: Please evaluate if "phantom" = unnecessary
heading or trailing punctuation should be stripped from database titles
++ this looks like a normalisation
b) garbage in - garbage out

Regards Reinhardt [[user:gangleri]]

P.S. I run into this because of textual ambiguosities at Wikipedia in Yiddish
relating to the usage of "tsvey vovn" versus "vov + vov", "tsvey-yudn": versus
"yud + yud" etc.

example 1: There is an article [[yi:וויץ]] but not [[yi:װיץ]] .

example 2: http://www.yiddishdictionaryonline.com/ contains "vey iz (tsu) mir"
which is written *there* both with "vov + vov" and "yud + yud". Nevertheless
http://www.cs.engr.uky.edu/~raphael/yiddish/makeyiddish.html translates with
"tsvey vovn" and "tsvey-yudn": ‫װײ איז (צו) מיר!

It seems that automatical character substitution is not possible because of
ambiguasities when three characters meet together as in
http://www.yiddishdictionaryonline.com/ at
farvunderung - פֿאַרווונדערונג , "farvundert" - פֿאַרווונדערט
and the other way around at
oyspruvn - אויספּרווון


Version: unspecified
Severity: trivial
URL: http://yi.wikipedia.org/w/index.php?title=Special%3APrefixindex&from=Gangleri%2Ftests%2F%E2%80%AB%D7%B0%D7%99&namespace=2

Details

Reference
bz3819

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 8:54 PM
bzimport set Reference to bz3819.
bzimport added a subscriber: Unknown Object (MLST).

gangleri wrote:

You will find typical examples at the end of
http://yi.wiktionary.org/wiki/Special:Allpages and at
http://yi.wiktionary.org/w/index.php?title=Category:Bugzilla .

Summary is available at http://yi.wiktionary.org/wiki/%E2%80%AB .

These pages where created because I have "compiled" the titles with "copy and
paste" (of hebrew characters) between different Firefox browsers on Windows.

A workaround is to use an usefull keyboard as described at http://www.uyip.org/
and avoid this silly "copy and pasts".
See http://www.geocities.com/fontboard/yiddish.html : Yiddish Pasekh and Keyman
keyboard for Windows

Regards Reinhardt [[user:gangleri]]

gangleri wrote:

Note:

This bug can cause some confusion in a wiki. I assume that many contributors are
using "copy and paste" to insert a few hebrew characters.

As you can see from
http://yi.wikipedia.org/wiki/User:Gangleri/tests/%E2%80%AB%D7%B0%D7%99%D7%A5%E2%80%AB
%E2%80%AB can be

  • at the begining of a title
  • at the end of a title
  • (I assume also inside the title)

There would be different things to do:

  • avoid generation of such titles during editing, linking etc.
  • clear the database - this is a maintenance issue

Regards Reinhardt [[user:gangleri]]

gangleri wrote:

additions:

I found more incorect titles (only with heading RIGHT-TO-LEFT_EMBEDDING) in
other projects with
http://yi.wikipedia.org/wiki/Special:Prefixindex/%E2%80%AB
beside
http://yi.wiktionary.org/wiki/Special:Prefixindex/%E2%80%AB

Beside RTL wiki's [[ar:]] [[fa:]] [[he:]] [[ur:]] [[yi:]] their wiktioaries etc.
all other projects can be affected.

These wrong titles at [[yi:]] have been created by 5 contributors. This shows
that it is a general problem. If contributors use "copy" from a web page and
copy it (as hebrew characters) into the URL from the browser (I use mainly
Firefox myself) they might copy / paste leading trailing punctuation characters
and the browser will *generate* these URL's.

Of course this is not the proper way to generate titles (one should use a
keyboard) and might be a Firefox issue (I do not know if it is reported at
bugzilla.org if not please do so) or not but is common praxis of a signifficant
amount of contributors to RTL projects.

You will find the affected titles at:
[[yi:Category:Bugzilla/Unicode_character_RIGHT-TO-LEFT_EMBEDDING_-_U_202B]]
http://yi.wiktionary.org/wiki/Category:Bugzilla/Unicode_character_RIGHT-TO-LEFT_EMBEDDING_-_U_202B

Best regards Reinhardt [[user:gangleri]]

gangleri wrote:

more characters:

I found
http://yi.wikipedia.org/w/index.php?title=%E2%80%AB%D7%A7%D7%94%D7%9C_%D7%A4%D6%BF%D7%95%D7%9F_%E2%80%AB%D7%96%D7%A2%D7%9C%D7%91%D7%A9%D7%98%D7%A2%D7%A0%D7%93%D7%99%D7%A7%D7%A2%D7%A8_%D7%A9%D7%98%D7%90%D6%B7%D7%98%D7%9F%E2%80%AC&redirect=no
which contained originaty a trailing %E2%80%AC

Beside
http://www.fileformat.info/info/unicode/char/202b/index.htm
Unicode Character 'RIGHT-TO-LEFT EMBEDDING' (U+202B)
UTF-8 (hex) 0xE2 0x80 0xAB (e280ab)

Compare also:
http://www.fileformat.info/info/unicode/char/202a/index.htm
Unicode Character 'LEFT-TO-RIGHT EMBEDDING' (U+202A)
UTF-8 (hex) 0xE2 0x80 0xAA (e280aa)

http://www.fileformat.info/info/unicode/char/202c/index.htm
Unicode Character 'POP DIRECTIONAL FORMATTING' (U+202C)
UTF-8 (hex) 0xE2 0x80 0xAC (e280ac)

http://www.fileformat.info/info/unicode/char/202d/index.htm
Unicode Character 'LEFT-TO-RIGHT OVERRIDE' (U+202D)
UTF-8 (hex) 0xE2 0x80 0xAD (e280ad)

http://www.fileformat.info/info/unicode/char/202e/index.htm
Unicode Character 'RIGHT-TO-LEFT OVERRIDE' (U+202E)
UTF-8 (hex) 0xE2 0x80 0xAD (e280ae)

Variations / modifications of
http://yi.wikipedia.org/wiki/Special:Prefixindex/%E2%80%AB
as
http://yi.wikipedia.org/wiki/Special:Prefixindex/%E2%80%AA
http://yi.wikipedia.org/wiki/Special:Prefixindex/%E2%80%AC
http://yi.wikipedia.org/wiki/Special:Prefixindex/%E2%80%AD
http://yi.wikipedia.org/wiki/Special:Prefixindex/%E2%80%AE
are of limited use only because (theoreticaly) these characters can be included
anywhere in a title.

I will open another enhancement request about a special page alowing to instring
search of titles specifying %nn values.

gangleri wrote:

(In reply to comment #4)

I will open another enhancement request about a special page alowing to instring
search of titles specifying %nn values.

bug 3887: create a special page for instring search of titles specifying %nn values

gangleri wrote:

sorry for this

see
http://yi.wikipedia.org/wiki/%E2%80%AEtest
http://yi.wiktionary.org/wiki/%E2%80%AEtest

You may say: "garbague in garbague out"

But this seams to be a subsequent error. It "seams" to interfear with setup
about case sensitive / non case sensitive titles. The earlier this bug gets
fixed the less subsequent errors we get.

gangleri wrote:

sorry for this

http://yi.wiktionary.org/wiki/Special:Whatlinkshere/%E2%80%AB%D7%B0%D7%90%D6%B8%D7%9B%D7%A0%D7%98%D7%90%D6%B8%D7%92
this title is invalid because it starts with %E2%80%AB = Unicode Character
'RIGHT-TO-LEFT EMBEDDING' (U+202B)

However it is a mess editing BiDi and generate pages like
http://yi.wiktionary.org/wiki/%D7%B0%D7%90%D6%B8%D7%9A
http://yi.wiktionary.org/wiki/%D7%98%D7%90%D6%B8%D7%92
and also taking care of all these !*%$$€@*# bugs.

These pages look fine but the titles they link to should be invalid and the
links should not show red. Best would be to let them with [[ and ]] brackets
same as invalid links.

Best regards Reinhardt [[user:gangleri]]

gangleri wrote:

(In reply to comment #7)

sorry for this
and also taking care of all these !*%$$€@*# bugs.

I fixed the involved links so the Whatlinkshere is no longer valid . Compare:
http://yi.wiktionary.org/w/index.php?title=%D7%B0%D7%90%D6%B8%D7%9A&diff=4483&oldid=4477
http://yi.wiktionary.org/w/index.php?title=%D7%95%D7%95%D7%90%D6%B8%D7%9B%D7%A0%D7%98%D7%90%D6%B8%D7%92&diff=4482&oldid=4463
and
bug 3894 white space characters, BiDi control characters should show up in diff

gangleri wrote:

fixing this would require later a validation according to
bug 3904 disallow user pages and user_talk pages starting with lower case on
case sensitive wikis

adding blocks bug 3904

gangleri wrote:

Hi! The code on FiverAlpha is changing.
See http://test.leuksman.com/view/Category:Mimic
and bug 3888 comment 3

The category http://test.leuksman.com/view/Category:Mimic ilustrates that the
punctuation characters can be used for fraud and vandalim.

If you are not used to the punctuation topics you may *not* notice that
http://test.leuksman.com/edit/User:Brion%E2%80%AD%E2%80%AC?oldid=9812
the edit of this *false account* contains punctuation characters in
[[User:Brion‭‬|Brion]].

  • one way to see these characters are verifying the URL; this is simple if most

of the contained characters are 7-bit ASCII;

  • onother way to see these characters is inserting the cursor in the text and

moving the cursor with the mouse trough the text area

  • another way to see these characters is to mark the text with the mouse

Because these characters make more trouble then providing benefit I suggest to
suppress the punctuation characters in titles until a solution could be provided
which could be generaly accepted. As it is now mimic accounts can be created.
This opens doors for fraud and vandalism.

regard reinhardt [[user:gangleri]]

gangleri wrote:

(In reply to comment #4)

more characters:

I found also

http://www.fileformat.info/info/unicode/char/200e/index.htm
Unicode Character 'LEFT-TO-RIGHT MARK' (U+200E)
UTF-8 (hex) 0xE2 0x80 0x8E (e2808e)

http://www.fileformat.info/info/unicode/char/200f/index.htm
Unicode Character 'RIGHT-TO-LEFT MARK' (U+200F)
UTF-8 (hex) 0xE2 0x80 0x8F (e2808f)

source:
http://www.fileformat.info/info/unicode/block/general_punctuation/list.htm

gangleri wrote:

Hallo!

I would like to CANCEL this request / draw it back. (There is no such MediaZilla
resolution).

The request is to restrictive to me and other methods to avoid the problem / to
fix affected pages should be found.

Such tools are requested at

  • Bug 4012: feature request: add a felexible magic character conversion to the

build in editor
which would allow to identify these characters in the editor

  • Bug 4185: feature request: provide a notification for irregular links

which would avert users before submitting such links / such pages (either new or
changed).

gangleri wrote:

as status is now this is more a DUPLICATE of

bug 3696 Unicode Control Characters should be restricted in title text (RLM, LRM, RLO, LRO, . . .)

*** This bug has been marked as a duplicate of bug 3696 ***