Page MenuHomePhabricator

Usernames should use unicode whitelist
Closed, ResolvedPublic

Description

Author: river

Description:
usernames should be restricted to a whitelist of characters which includes only
valid alphanumeric characters in each language, and punctuation. otherwise,
creating usernames (and page titles) with invalid characters will make it hard
to block vandals.


Version: 1.5.x
Severity: major

Details

Reference
bz1524
TitleReferenceAuthorSource BranchDest Branch
Naive reverts based on revertrisk scorejsn/automoderator!1jsnJsn.sherman/T352439main
jobs: add job for managing harbor quotasrepos/cloud/toolforge/maintain-harbor!22sstefanovaslavina/manage-project-quotasmain
gitlab_runner: upgrade to v16.4.2repos/releng/gitlab-cloud-runner!304jeltoupgrade-gitlab-runner-16.4main
Fix: the kinit apt package has to do with KDErepos/data-engineering/kerberos-kinit!3brouberolkinit-binarymain
Fix gitlab yml file until the pipeline passesrepos/data-engineering/kerberos-kinit!2brouberolfix-gitlab-ci-ymlmain
Request access to trusted runners for repos/data-engineering/kerberos-kinitrepos/releng/gitlab-trusted-runner!53brouberolbrouberol-main-patch-47896main
Customize query in GitLab

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 8:13 PM
bzimport set Reference to bz1524.
bzimport added a subscriber: Unknown Object (MLST).

*Invalid* characters (those that are illegal in XML or don't reliably cut and paste) need to be outright
blocked in titles.

Characters that simply some people are unable to type should not be a real problem as either there
should be a direct 'block' link, or cut-and-paste will always be available.

I'm not really inclined to proclaim what characters are appropriate for each language, as this will make
interoperability, writing on foreign topics, shared data, shared user accounts, global user accounts etc
very hard and will require a lot of manual mucking about as people whine for whitelists to be updated.

p_simoons wrote:

Agreed. There are to my knowledge no legit users on the English wiki that use
non-ASCII characters in their name, but it's a favorite trick of vandals and
impersonators.

avarab wrote:

*** Bug 2290 has been marked as a duplicate of this bug. ***

gangleri wrote:

(In reply to comment #0)

usernames should be restricted to a whitelist of characters which includes only
valid alphanumeric characters in each language, and punctuation.

This requirement and single user login will conflict with the wish to use
*natives* (non latin) alphabets in user names.

river wrote:

(In reply to comment #4)

(In reply to comment #0)

usernames should be restricted to a whitelist of characters which includes only
valid alphanumeric characters in each language, and punctuation.

This requirement and single user login will conflict with the wish to use
*natives* (non latin) alphabets in user names.

why?

lowzl wrote:

Usernames shouldn't be stored in a normalised form, however, users should not be
permitted to register names which would conflict with existing usernames, when
normalised.

Perhaps this could be achieved by adding a new field to the user table -
'username_normal' - and storing the normalised username there. Add a unique
constraint to the field, and then attempts to register a username which will
result in a collision when normalised will... well, result in a database error.

Now the question is, where do we get a reasonable map of confusable characters.
http://www.unicode.org/draft/reports/tr36/Attic/confusables.txt isn't
particularly extensive, but should work for most malicious cases. Perhaps we
should try to get a copy of the IDN normalisation map. The Unicode Consortium
has a long document about visual spoofing:
http://www.unicode.org/draft/reports/tr36/tr36.html

gangleri wrote:

(In reply to comment #5)

why?

There are many opinions about the restriction of usernames:
"Since this is the English Wikipedia, usernames ought to be constructed using
English characters, with allowances for scripts from other languages ..." from
[[en:Wikipedia_talk:Username#On_Unicode_and_other_odd_characters_in_usernames]]

Nevertheless the communitys decision about this should be more tolerant. With
regard to single user login it should be allowed to use Arabic, Cyrilic, Hebrew,
Hindu, Georgian whatsoever alphabets.

I would not object to usernames as [[user:۞]], [[user:░]], [[User:–]] etc. The
usernames are part of personality and creativity. Whatever opinion we have on
this / how we deal with this it is *reality* that there are also usernames like
[[en:user:god]] - see [[en:user talk:god]], [[en:user:satan]],
[[en:user:antichrist]] etc.

gangleri wrote:

http://en.wikipedia.org/wiki/User:%C2%A0

is a "construct" based on
bug 2173: Fatal error when removing an article with an whitespace title from the
watchlist

gangleri wrote:

compare with

bug 3696: Unicode Control Characters should be restricted in title text

gangleri wrote:

see also

bug 2593: Non-printing characters allowed in registration

gangleri wrote:

(In reply to comment #6)

Usernames shouldn't be stored in a normalised form, however, users should not be
permitted to register names which would conflict with existing usernames, when
normalised.

Depending on the used font two "ו" characters can look like one "װ" character:
[[yi:user:גאַװיאַל]] and [[yi:user:גאַוויאַל]]

lowzl wrote:

Hmm, you could say similar things about vv and w (though generally w is
narrower)...

gangleri wrote:

compare with
bug 3982: Maybe...

gangleri wrote:

*** Bug 4312 has been marked as a duplicate of this bug. ***

gangleri wrote:

Is this FIXED already?

I could create a user page
http://test.wikipedia.org/wiki/User:%E2%80%AEresu_ladnav_%E2%80%AD%E2%80%AC
but I could not create such an *account*.

Please see
http://mail.wikipedia.org/pipermail/mediawiki-cvs/2006-February/013973.html
User.php,1.212,1.213 by Brion
"Blocking some Unicode whitespace characters in usernames. Should check if some
or all should be blocked from all page titles."

A block list is equivalent to a whitelist.

It might a good idea to give a feedback why the user name used during create new
user is invalid / show what Unicode character is used.

For "transparency" of wiki configuration the list of blocked characters should
be displayed.

best regards reinhardt [[user:gangleri]]

robchur wrote:

(sigh)

Blocking != Whitelisting

The list of blocked characters is available if you look at the code and also the
relevant commit message in the mediawiki-cvs archives.

neil wrote:

Here's a good way of filtering names:

  1. first, do Nameprep
  2. only allow the use of characters specific to one particular writing system in

the resulting string, and a few carefully selected non-alphabetic characters
(such as space, apostrophe, and any others you want to add to the whitelist).

This is being used in IDN at the moment, and it's very successful at preventing
a very wide variety of potential abuses, such as mixed-script spoofing and the
use of exotic Unicode characters to break rendering engines.

I happen to have some nice compact table-driven C code for doing this: mail me
if you want it.

We should file the within-script character spoofing problem as a separate bug:
as stated above, this is easily dealt with by storing a normalized form of each
name alongside the real name, and checking that no normalized form is ever
duplicated: given this, the only problem is working out the ruleset for
normalizing these strings.

  • Bug 7463 has been marked as a duplicate of this bug. ***

dodgy wrote:

I emailed Neil and he told me that there is a MediaWiki extention out to block unicode in usernames.
Can anyone confirm this or deny it?

We will never "block out Unicode" as that doesn't make sense.
*Every* username is Unicode, with *no exceptions*.

What we will do is enforce restrictions on some characters
and mixed-script names. Please see the code in AntiSpoof extension.

dodgy wrote:

I download the files and AntiSpoof has no docs or explanations not findable on mediawiki, google, or
in the code. I had to read through the code of the six files to determine which one to include.

First, is AntiSpoof still in testing and not working correctly yet?

Also, is patch-antispoof.sql.txt needed or is some SQL work needed to be done first before using
AntiSpoof?

And for its log file is that something saved like debug.log, something only in the MySQL, or
something viewable in mediawiki itself?

This bug entry is not a discussion forum. If you want to ask general
questions about how to operate software, please do it separately.

Done reasonably with AntiSpoof