Page MenuHomePhabricator

Augment our AntiSpoof normalization data with Unicode/CLDR data
Open, MediumPublic

Description

Background

Our list is "based on one by Neil Harris, which was derived by unknown methods". The Unicode Consortium has their own list, which is much less fuzzy than ours, but includes a larger set of characters. We may want to augment our list with theirs or use their list as another layer of normalization.

Documents:

Data:

Toy:

Data:


Requested deliverables

  • Decide on & implement a system that augments our AntiSpoof normalization data with Unicode Consortium data
  • The solution should fail gracefully if unicode.org changes their API or contents.

See Also: T65216: Accept CAPTCHA responses with diacritics removed

Details

Reference
bz63217

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:53 AM
bzimport added projects: AntiSpoof, I18n.
bzimport set Reference to bz63217.
bzimport added a subscriber: Unknown Object (MLST).

I've added bug 63242 as dependency: it seems that the standard ICU API can easily solve a concrete problem (in AbuseFilter) that has been intractable for years.

Perhaps the old and new data sources can co-exist for a while, with the new ones being used first for user-invisible parts like the username creation and for new functions/interfaces like what proposed in bug 63242. When we're confident enough about the data quality (possibly after feeding CLDR with some of our own), and/or old interfaces are used less, we'll consider dropping the custom data sources.

(In reply to Nemo from comment #0)

Our "list is based on one by Neil Harris, which was derived by unknown
methods".
At some point it will get easier to rely on CLDR, probably via the cldr
MediaWiki extension.

Documents:

Toy:

Data:

It doesn't contains zh-hans / zh-hant pairs which are contained in current AntiSpoof equivsets.

(In reply to Liangent from comment #4)

It doesn't contains zh-hans / zh-hant pairs which are contained in current
AntiSpoof equivsets.

Can you file a CLDR bug then please?

(In reply to Nemo from comment #5)

Can you file a CLDR bug then please?

I can't find the CLDR bug tracker for confusable data...?

Forgot to say -- there're also [[Variant Chinese character]]s, which create more confusion than simple traditional / simplified Chinese differences.

I haven't looked at this in depth, but it seems the best solution here would be to surface some of the functions from the PHP Spoofchecker class within AntiSpoof, perhaps with some overrides for edge cases like zh-hans / zh-hant pairs. In other words, provide methods like: AntiSpoof::areConfusable(), AntiSpoof::isSuspicious(), AntiSpoof::setChecks(), etc. (and eventually surface similar functions within AbuseFilter as well).

It looks like the Spoofchecker / Unicode confusables data only handles very close similarities. For example, it doesn't consider any of the following combinations confusable:

  • 5 -> S
  • £ -> L
  • ß -> B
  • $ -> S
  • ¢ -> c

So this is definitely not an adequate replacement for our existing tables. It could however be used to augment our tables or act as an additional layer of normalizing. Interestingly, the format of the Unicode data is very similar to the format for our normalization data (equivset.in).

kaldari renamed this task from Consider using Unicode/CLDR data instead of custom tables to Augmenting our normalization data with Unicode/CLDR data.Jul 3 2017, 10:35 PM
kaldari updated the task description. (Show Details)

One option: implement our own wrapper function around Spoofchecker::areConfusable(), maybe called areSimilar() or something, but augment it with our own (new) list of confusable characters that aren't in the CLDR list (See T65217#3402681 for example). Also, make sure we are setting Spoofchecker::setChecks( Spoofchecker::ANY_CASE ), so that it is case insensitive. This would be the simplest thing to implement, but I'm not sure how useful it would actually be since most of the AntiSpoof comparisons are done with normalizeString(), which doesn't have an equivalent in Spoofchecker.

Another option: Start mass importing useful mappings from http://www.unicode.org/Public/security/latest/confusables.txt to equivset.in. By "useful" I mean stuff like:

1D47A ;	0053 ;	MA	# ( 𝑺 → S ) MATHEMATICAL BOLD ITALIC CAPITAL S → LATIN CAPITAL LETTER S	#

but not:

1F319 ;	263D ;	MA	#* ( 🌙 → ☽ ) CRESCENT MOON → FIRST QUARTER MOON	#
TBolliger renamed this task from Augmenting our normalization data with Unicode/CLDR data to Augment our AntiSpoof normalization data with Unicode/CLDR data.Sep 6 2017, 10:39 PM
TBolliger updated the task description. (Show Details)

@Huji, @MusikAnimal: Do you have any opinions on which of the options mentioned above would be most useful to Abusefilter maintainers?

I think mass importing Unicode's confusables is the best option @kaldari

Reedy added a subscriber: Gaon12.

I think mass importing Unicode's confusables is the best option @kaldari

That was declined, see T246353#6533626