Page MenuHomePhabricator

Cannot create a username containing a Zero width joiner on languages where a ZWJ makes a visible difference and is required
Open, LowPublicFeature

Description

Author: rathne

Description:
Hi,

We are having some issues with creating users in Sinhala Wikipedia. We are not allowed to create user names like "සසීන්ද්‍ර" and "නන්දිමිත්‍ර".

This looks like something to do with modifiers on Sinhala letters. May be zero width joiner (ZWJ) need to be allowed?

rakaranshaya (්‍ර) is written as: hal kereema + zero width joiner(ZWJ) + ra

Thanks in advance,
/Lee


Version: unspecified
Severity: enhancement

Details

Reference
bz24999

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 11:10 PM
bzimport set Reference to bz24999.

Zero width joiner is forbidden from appearing in a username character since r13007.

Perhaps we could allow it if surrounded by Sinhala characters? :s

rathne wrote:

Is there any reason why these characters are black listed?

If you have a user called "Some admin", having another account called "Some admin" but using a non-default space is confusing. Moreover, trying to block the vandal you are likely to block the right user (or be unable to, if the account with normal space didn't exists).

I think that's what Brion referred as 'troublemaker characters'.

On the other hand, the request to use නන්දිමිත්‍ර is perfectly reasonable.

(In reply to comment #3)

If you have a user called "Some admin", having another account called "Some
admin" but using a non-default space is confusing. Moreover, trying to block
the vandal you are likely to block the right user (or be unable to, if the
account with normal space didn't exists).

Don't we have Extension:AntiSpoof for this?

I think that's what Brion referred as 'troublemaker characters'.

On the other hand, the request to use නන්දිමිත්‍ර is perfectly reasonable.

Most of the banned characters (whitespace, nbsp, control chars) do look like troublemakers, but ZWJ seems perfectly reasonable to me.

Don't we have Extension:AntiSpoof for this?

Antispoof is more powerful: checks similar characters, blocks mixed scripts...

Most of the banned characters (whitespace, nbsp, control chars) do look like
troublemakers, but ZWJ seems perfectly reasonable to me.

Are you sure? Please compare in your browser [[User:Catrope]] vs [[User:Cat‍rope]]. There's no visual difference in mine.

(In reply to comment #5)

Are you sure? Please compare in your browser [[User:Catrope]] vs
[[User:Cat‍rope]]. There's no visual difference in mine.

That's what we have AntiSpoof for, right? I'm sure there's plenty of characters that look very much like an ASCII 'C'.

You failed. That C is the normal one.
What I did was inserting a ZWJ between Cat and rope.

(In reply to comment #7)

You failed. That C is the normal one.
What I did was inserting a ZWJ between Cat and rope.

I knew that, I was just pointing out there's other ways to construct a username looking just like 'Catrope' without using ZWJs or other characters currently forbidden in usernames.

Sure you could use [[С]] for writing [[User:Сatrope]], and that would be blocked by AntiSpoof.
The point is, ZWJ should not be allowed in usernames unless the bad usage keeps blocked.

rathne wrote:

Do we have any update on this?

Lee, can you figure out in which cases a ZWJ makes a visual difference?
I think that's the blocker here. If we can isolate some unambiguous instances of ZWJ, we could try whitelisting them.

According to wikipedia, that'd be arabic and most indic scripts have at least some characters where it makes a visual difference.

Googling, http://www.unicode.org/reports/tr31/ (section 2.3) seems to have some advice on when and when not to ban ZWJ. (it even gives perl regexes, but using the fancy stuff that I don't think is supported by pcre)

http://unicode.org/review/pr-96.html also seems to have some advice (and seems more down to the point), but its unclear what the status of that document is.

rathne wrote:

It looks like I'm going to need some help to answer that question. I'm not that expert in the language. I'll ask around so someone with the proper knowledge can help here.

According to Unicode Annex 31(http://www.unicode.org/reports/tr31/), Identifier patterns, as an exception to the usual exclusion of ZWJ is not allowed for certain scripts. That includes Sinhala. But the policy is strict about where and how one can use ZWJ.
Sinhala , many Indian languages and Arabix require zwj, which make visual difference.
We need to implement UAX31 on top of r13007

(In reply to comment #14)

According to Unicode Annex 31(http://www.unicode.org/reports/t0/), Identifier
patterns, as an exception to the usual exclusion of ZWJ is not allowed for
certain scripts.

Sorry. Read it as :

According to Unicode Annex 31(http://www.unicode.org/reports/t0/), Identifier patterns, as an exception to the usual exclusio, ZWJ *is allowed* for certain scripts,

The right url seems to be http://unicode.org/reports/tr31/
There are some regular expressions reported, I think they are based on \L{} (Unicode properties). Luckily, we can do some slow things on this path.

(In reply to comment #16)

The right url seems to be http://unicode.org/reports/t��0��/
There are some regular expressions reported, I think they are based on \L{}
(Unicode properties). Luckily, we can do some slow things on this path.

I think the rXXX in the url is screwing it up with magic revision auto-linking. Lets try http://www.unicode.org/reports/t%7231/

Last time I looked at that page, the regexs used things based on the more complex unicode properties supported by perl but not pcre. However it was still very do-able, one just needed to create a fairly large (not huge though) character class by hand.

santhosh set Security to None.
Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:00 AM