Background
Our list is "based on one by Neil Harris, which was derived by unknown methods". The Unicode Consortium has their own list, which is much less fuzzy than ours, but includes a larger set of characters. We may want to augment our list with theirs or use their list as another layer of normalization.
Documents:
- http://www.unicode.org/reports/tr36/#visual_spoofing
- http://www.unicode.org/reports/tr39/#Confusable_Detection
Data:
Toy:
Data:
- http://www.unicode.org/Public/security/latest/confusablesSummary.txt
- http://www.unicode.org/Public/security/latest/confusables.txt
Requested deliverables
- Decide on & implement a system that augments our AntiSpoof normalization data with Unicode Consortium data
- The solution should fail gracefully if unicode.org changes their API or contents.
See Also: T65216: Accept CAPTCHA responses with diacritics removed