Page MenuHomePhabricator

Disallow usernames that are too similar to existing names (confusables, impersonation)
Closed, ResolvedPublic

Assigned To
None
Authored By
bzimport
Jun 2 2005, 1:24 PM
Referenced Files
F2094: secondpass.py
Nov 21 2014, 8:33 PM
F2092: extra_confusables.txt
Nov 21 2014, 8:33 PM
F2093: confpairs.txt
Nov 21 2014, 8:33 PM
F2091: antispoof.py
Nov 21 2014, 8:33 PM
F2090: script_data.txt
Nov 21 2014, 8:33 PM

Description

Author: usenet

Description:
A more and more common form of abuse consists of vandals and trolls registering
new accounts that "look like" other users' accounts, by using characters that
look like other characters. For example, "l" may be used instead of "I", or an
acute-accented 'i' used instead of an ordinary one. These accounts can cause no
end of trouble by being used to conceal other kinds of mischief, or to get the
impersonated user into trouble. It is very difficult to tell these apart without
detailed inspection, and the software at present has no idea of visual
similarity between usernames.

Proposed solution:

Keep a homograph character table, and for each new username, canonicalize it by
applying the homograph table to it. Then compare this canonicalized version of
the name with a pre-existing list of canonicalized usernames, and block it if it
occurs in that list. In this way, registering a username will block the
registration of other "confusingly similar" usernames.

The good news is that that the heavy lifting for this work has already been
performed as part of trying to close the same spoofing hole for
internationalized domain names, and homograph lists have already been compiled
as part of this work. E-mail me if you want me to dig out the lists; I don't
have links to them to hand on this machine.


Version: unspecified
Severity: enhancement

Details

Reference
bz2290

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 8:33 PM
bzimport set Reference to bz2290.

usenet wrote:

See the references towards the end of http://unicode.org/reports/tr36/ for a
very simple example of confusables data file; but I know that much more complete
ones have been compiled elsewhere...

usenet wrote:

Here is the URL for the very nicely compiled multilingual confusables file, in
what I hope is a sufficiently self-documenting format:

http://unicode.org/reports/tr36/draft/confusables.txt

Persumably the "official" TR36 file, and any updates, will also be in a similar
format.

lowzl wrote:

During a vandal attack on a MediaWiki installation I run, the vandal used
Cyrillic lookalikes to impersonate an administrator. No amount of visual
scrutiny would have revealed anything, since typically Cyrillic glyphs are
copied from the Latin lookalikes. Fortunately this is also covered in the
confusables table.

zigger wrote:

*** Bug 3313 has been marked as a duplicate of this bug. ***

I just want to add that many cyrillic letters look the same as letters in latin
script, so confusion is possible. The letters are "A B C E H J K M O P T X a c
e j o p x" as opposed to "%D0%90 %D0%92 %D0%A1 %D0%95 %D0%9D %D0%88 %D0%9A
%D0%9C %D0%9E %D0%A0 %D0%A2 %D0%A5 %D0%90 %D1%81 %D0%B5 %D1%98 %D0%BE %D1%80
%D1%85" (as shown in the nav-bar). They are all the same, except for one pair,
which is extremely similar.

avarab wrote:

*** This bug has been marked as a duplicate of 1524 ***

ayg wrote:

People are/were discussing this at bug 1524, but this remains a separate issue.
It took me forever to find this by searching, since it was closed.

ayg wrote:

*** Bug 3982 has been marked as a duplicate of this bug. ***

neil wrote:

Python code for filtering usernames

Here's some Python code to canonicalize user names to reject most spoofing
attacks. The program also returns an error status if the username is malformed,
for example by containing non-script characters, or mixing two incompatible
scripts.

The general idea is to keep a canonicalized version of each username in another
table, and, when registering a new username, look up the canonicalized username
to see if it is already registered. If it is, the user should be told that
their username is too similar to an existing username, and prompted to try
again.

For example:

"SOME USERNAME" canonicalizes to v1:50MEU5EMAME (the v1: is a version tag, in
case the canonicalization code ever changes). The same canonical string will be
generated for "some username", "SOME USERNAME!!!!!", "S0ME U5ERNAME", and so
on...

I can easily add other filters, so that, for example, "Some Username5"
canonicalizes to the same string as "Some Username 4", and "Bad, bad user"
would canonicalize to the same string as "Bad, bad, bad user".

This version of the code is a bit aggressive, as it assumes that labels can be
in any one script, so E, H, and N are currently considered equivalent because
of the need for transitivity between different cases of different scripts: if
usernames can be restricted to a small subset of possible scripts, some of the
more aggressive canonicalization can be relaxed, and E, H, and N can again be
distinguished.

Preliminary testing shows that this code appears to have a false-positive rate
of under 1% on random plausible names, which is probably acceptable.

attachment antispoof.py ignored as obsolete

neil wrote:

Oh, and I should mention, just in case you're not reading the code, that it
works on a vast number of scripts.

neil wrote:

Python code for filtering usernames

Murphy's law in action: the example I gave the attachment comment is an edge
case that didn't get tested properly: now fixed.

attachment antispoof.py ignored as obsolete

neil wrote:

Experimental language-code-to-script-code mapping

This file attempts to map languages to sets of possible scripts. Where a
language can be written in multiple scripts, both script codes are added. Where
multiple scripts can be used for a language, all scripts known are included.

Where an example character does not have a script code, it is output as U+XXXX.

attachment lang_to_script.txt ignored as obsolete

neil wrote:

Experimental language-code-to-script-code mapping;

Now with 79 more script repertoires, based on analyzing the wikipedia.org front
page

Attached:

neil wrote:

Python code for filtering usernames, v0.3

Now uses stdin/stdout for input and output, thus allowing for batch conversion
and freeing the command line up for later addition of option flags.

attachment antispoof.py ignored as obsolete

neil wrote:

Python code for filtering usernames, v0.4

Now with exception handling, just in case of nasty attacks (eg. BiDi
violations) intended to blow up the low-level Unicode-processing code.

Attached:

I've translated Neil's code to PHP, committed in r16555.

Can build an extension around that to check on account creation.

Currently there are some lazy and inefficient bits; it runs about 30% slower than the Python
version on the set of usernames from meta.wikimedia.org, but that's plenty fast for the individual
checking, a smidge under a millisecond per name on a 2 GHz G5. (Live check will just be a single
name munging and a DB lookup.)

neil wrote:

There are false positive problems with the existing code which need a more
careful second pass to check strings which match the initial checks. Code to
follow...

neil wrote:

New confusables equivalence sets file, generated from UTR#39 confusables.txt

Note: this file is encoded in UTF-8, and contains exotic characters, many of
which may display as spaces or not at all: beware!

This is a transitive closure of the single-character to single-character
mappings within UTR #39s confusables.txt file. Remember to normalize strings
before applying these mappings...

attachment confpairs.txt ignored as obsolete

neil wrote:

Some extra confusables (UTF-8 format text file)

Some extra confusables that are not in UTR#39, spotted by eye.

attachment extra_confusables.txt ignored as obsolete

neil wrote:

Some extra confusables, v2 (UTF-8 format text file)

A second version of the above...

Attached:

neil wrote:

New confusables equivalence sets file v2, generated from UTR#39 confusables.txt + extras

Note: this file is encoded in UTF-8, and contains exotic characters, many of
which may display as spaces or not at all: beware!

This is a transitive closure of the single-character to single-character
mappings within UTR #39s confusables.txt file, combined with my
extra_confusables.txt file. Remember to normalize strings
before applying these mappings...

Attached:

neil wrote:

Note: some letterforms are confusable with more than one other letterform, but
these other letterforms are not confusable with each other. This should be taken
into account in later, more sophisticated, versions of this code.

neil wrote:

Python code for creating equivalence sets of characters

Attached:

robchur wrote:

This was poked, prodded, converted and ported into the AntiSpoof extension,
available in Subversion.

lar wrote:

Not sure if this comment belongs against this bog but I have userid "Lar" on
many WMF wikis. I recently started having trouble registering this userid on new
wikis as a conflict with user "Iar"... based on discussion on MediaWiki-General it was
suggested that this is because the software sees uppercase I and lowercase L as
similar, and that's tripping me up. I'm not sure how to get around that best,
but it's a nuisnace to have to contact each wiki admin separately. See Neil
Harris's comment of 11-14 01:29 which perhaps alludes to this... presumably
once WMF wikis have SUL this goes away?

ayg wrote:

It would no longer be a problem for existing users, but it would still be a
problem for people signing up for a WMF account for the first time, so it's
still undesirable.

See bug 8257.

a.koppad wrote:

I have a question to ask here, In different languages, the same characters can be identified as different names? Does Python code take of this?

Can this thread be closed?

Anu: This report/"thread" has been closed as RESOLVED FIXED six years ago already, and MediaWiki does not use Python code here.
Please refrain from commenting on this ticket - thanks. :)