Page MenuHomePhabricator

Unicode whitespaces allowed in article title
Closed, ResolvedPublic

Description

Author: foenyx

Description:
<FoeNyx> the « » (U+2003 em space) should be an unvalid article name, no ?
<zwitter> all whitespace other than U+0020 should be


Version: unspecified
Severity: normal

Details

Reference
bz1414
TitleReferenceAuthorSource BranchDest Branch
Configure GitLab CIrepos/ci-tools/libup!1taavitaavi/cimaster
Enable trusted runners for LibUprepos/releng/gitlab-trusted-runner!62taavilibupmain
jobs-api: fix cluster domainrepos/cloud/toolforge/toolforge-deploy!79dcarojobs_api_fix_cluster_domainmain
api-gateway: bump to 0.0.17repos/cloud/toolforge/toolforge-deploy!78dcaroapi_gateway_bumpmain
jobs-api: bump to 0.0.220repos/cloud/toolforge/toolforge-deploy!76dcarojobs_api_bumpmain
builds-api: bump to 0.0.85-20230817105952-25c2b55frepos/cloud/toolforge/toolforge-deploy!73dcarobump_builds_apimain
builds api allow cluster domainrepos/cloud/toolforge/toolforge-deploy!71dcarobuilds_api_allow_cluster_domainmain
certificate: use the internal domain for all certsrepos/cloud/toolforge/builds-api!35dcaroconfigure_cluster_domain_namemain
envvars-api: declare the internal cluster domainsrepos/cloud/toolforge/toolforge-deploy!70dcaroenvvars_api_allow_cluster_domainmain
certificate: use internal cluster domain for both certsrepos/cloud/toolforge/envvars-api!10dcarofix_certificatesmain
cert: use the project name for the local cluster namerepos/cloud/toolforge/envvars-api!7dcaroadd_cluster_local_altnamemain
add cert dnsnamesrepos/cloud/toolforge/jobs-api!15dcaroadd_cert_dnsnamesmain
jobs-api: bump to 0.0.216repos/cloud/toolforge/toolforge-deploy!56dcarojobs-api_bump_0.0.216main
envvars-api: bump to 0.0.22-20230710124735-c3a7ee79repos/cloud/toolforge/toolforge-deploy!55dcaroenvvar-api_bump_0.0.22main
cert: add cluster.local dns alt namerepos/cloud/toolforge/builds-api!29dcaroadd_cluster_local_altnamemain
cert: add cluster.local alt dns namerepos/cloud/toolforge/envvars-api!6dcaroadd_cluster_local_altnamemain
cert: use dnsName to support api gateway per-backend checksrepos/cloud/toolforge/jobs-api!14dcaroadd_cert_dnsnamesmain
Show related patches Customize query in GitLab

Related Objects

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 8:10 PM
bzimport set Reference to bz1414.
bzimport added a subscriber: Unknown Object (MLST).

comment from bug 1971:

Moving a page to a title like [[« Pour l'Ukraine unie ! »]] create page with non
breakable space in the title, page move has been done here :
http://fr.wikipedia.org/w/index.php?title=Pour_une_Ukraine_unie_%21&action=history,
resulting page is
http://fr.wikipedia.org/wiki/%C2%AB%C2%A0Pour_l%27Ukraine_unie%C2%A0%21%C2%A0%C2%BB

Regarding the non-breaking space (U+00A0) specifically, it's generally transformed silently into U+0020 spaces when it goes
through the <textarea>->submit edit cycle and is not preserved, making it extra annoying.

  • Bug 1971 has been marked as a duplicate of this bug. ***

I've just done a little research on Unicode whitespace handling; the Zs, Zl, and Zp character classes seem to be relevant, and the
set of them or some variant is what's counted by eg Java's Character.isSpace() and .NET's Char.isSpaceChar().

It might make sense to explicitly disallow the Zl and Zp chars (line separator and paragraph separator), and normalize all the Zs
chars to spaces (well, underscores) in title processing.

A quick grep of the current UnicodeData.txt database lists:

0020;SPACE;Zs;0;WS;;;;;N;;;;;
00A0;NO-BREAK SPACE;Zs;0;CS;<noBreak> 0020;;;;N;NON-BREAKING SPACE;;;;
1680;OGHAM SPACE MARK;Zs;0;WS;;;;;N;;;;;
180E;MONGOLIAN VOWEL SEPARATOR;Zs;0;WS;;;;;N;;;;;
2000;EN QUAD;Zs;0;WS;2002;;;;N;;;;;
2001;EM QUAD;Zs;0;WS;2003;;;;N;;;;;
2002;EN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2003;EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2004;THREE-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2005;FOUR-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2006;SIX-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2007;FIGURE SPACE;Zs;0;WS;<noBreak> 0020;;;;N;;;;;
2008;PUNCTUATION SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2009;THIN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
200A;HAIR SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
202F;NARROW NO-BREAK SPACE;Zs;0;CS;<noBreak> 0020;;;;N;;;;;
205F;MEDIUM MATHEMATICAL SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
3000;IDEOGRAPHIC SPACE;Zs;0;WS;<wide> 0020;;;;N;;;;;

2028;LINE SEPARATOR;Zl;0;WS;;;;;N;;;;;

2029;PARAGRAPH SEPARATOR;Zp;0;B;;;;;N;;;;;

foenyx wrote:

*** Bug 2042 has been marked as a duplicate of this bug. ***

There is another problem with UTF8 titles. The representation of a character in
a foreign codepage looks like a normal character in out codepage.

You may find examples in
http://de.wikipedia.org/w/index.php?title=Spezial:Log&type=delete&user=&page=&limit=500&offset=50
Look for entries in 1-may-2005 3:45 - 3:55 h ("K.D.St.V. CarοIus Маgnus").
Please view this text in html code. Examples:

<a
href="/w/index.php?title=%CE%9A.D.%D0%85t.V._%D0%A1arolu%D1%95_%D0%9Ca%C9%A1nu%D1%95&amp;action=edit"

<a
href="/w/index.php?title=%CE%9A.D.%D0%85t.V._%D0%A1arolu%D1%95_Magnus&amp;action=edit"
<a
href="/w/index.php?title=%CE%9A.D.%D0%85t.V._%D0%A1arolus_%CE%9Cagnu%D1%95&amp;action=edit"

tsor (administrator of german WP)

foenyx wrote:

(In reply to comment #5)

There is another problem with UTF8 titles. The representation of a character in
a foreign codepage looks like a normal character in out codepage.

I reopened the bug 2042 as it's not exactly the same.
(this bug is a subset of bug 2042 only about homograph pair of whitespaces)

rickblock wrote:

Curly vs. straight quotes have been causing confusion at en lately as well.

foenyx wrote:

(In reply to comment #7)

Curly vs. straight quotes have been causing confusion at en lately as well.

this bug is for whitespace characters, the quotes confusion is probably more
suited for the bug 2042

ayg wrote:

Before anything is done on this, obviously a check needs to be run on the various wikis to see if they use these. It seems probable that IDEOGRAPHIC SPACE, for instance, should not be blacklisted. In general, there are various reasons to use various types of spaces, and I think it would be best if these were normalized for storage but not blacklisted, so you can't have two article names that differ only in the type or number of spaces used but you can still have unusual spaces in character titles. This should be part of the eventual move to case-insensitivity for titles (bug 453).

  • Bug 12080 has been marked as a duplicate of this bug. ***

I think this was fixed by r55382 (and the follow-ups to it) back in 2009. Closing.