Page MenuHomePhabricator

Add more characters to ccnorm
Closed, ResolvedPublic

Description

Currently only some characters are normalized to a "canonical" form. For example, although ccnorm("α") results in "A", ccnorm("ά") doesn't change anything.

The function should support the conversion of more characters.

The following list is based on what was available at en:MediaWiki:Titleblacklist, but maybe it is better to have different sets of characters depending on the case of the letter. For example, for D, but for d.

a: aαąăãàāάạậảấầẩắằẵẳẫặḁǟǡȁᾳὰᾀἁᾁἄᾄἂᾂἆᾆἅᾅἃᾃἇᾇáâäæåǻ٩4
b: bßβбв฿
c: cċĉ¢сćĉçč
d: dďḍðⅆ
e: éèëeęěĕėẻẹếềễểȨȩḝēḗȅȇệḙḛ3عڠẽə
f: fғ₣
g: gĝģğġɠǥǧǵḡԌ
h: hήĥħȞʰʱḣḥḧḩḫнңӈӉηἠἡἢἣἤἥἦἧὴᾐћⱧԋњһ
i: iìíîïĩļǐīĭḷŀιїɨ!łľį
k: kķкќқҝҡҟӄ
l: l₤ĺľḷłŀλлљ
m: mɯḿṁṃмӍμ₥
n: n₦ńñņňṇν
o: oóòôöõǒōŏǫőœøəόοωὸὀὁὄὂὅὃоөӧӫδσʘǿọ
p: pƥṕṗǷ₧þρр
q: qɊʠ
r: rŕŗřȑȓƦʳʴʵʶṙṛṝṟя®
s: s$śŝşšṣσѕ
t: tţťṭτтŧ
u: uúùûüũůǔūǖǘǚǜŭųű
w: wŵẁẃẅẇẉ₩
x: xҳχ
y: yýÿŷƴȲʸẏỳỵỷỹʊύυϋὑὓὕὗὺῠῡуϓ
z: zźžż

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Both patches are still open. The first one got some reviews and now it looks like is waiting for a new upload from Kaldari. The second one with the simpler version got no reviews at all.

(In reply to Ryan Kaldari from comment #15)

Since I haven't had any luck getting code review on
https://gerrit.wikimedia.org/r/92154 I submitted
https://gerrit.wikimedia.org/r/97304 as a simpler version. It only adds !
and $ and nothing else.

I'm not sure whether sending a request to wikitech-l could help getting any reviews to these two patches, but pinging at the patches and here doesn't seem to be enough... Any ideas?

Change 184850 had a related patch set uploaded (by Whym):
Add Japanese normalization pairs

https://gerrit.wikimedia.org/r/184850

Patch-For-Review

I also see that the "editable" sets on mediawiki.org have some more changes that are included in none of the gerrit changes mentioned here.

EDIT: added link

Change 184850 merged by jenkins-bot:
Add Japanese normalization pairs

https://gerrit.wikimedia.org/r/184850

Change 311310 had a related patch set uploaded (by MusikAnimal):
Adding missing equivalents for I, L, O, and S.

https://gerrit.wikimedia.org/r/311310

Change 311310 merged by jenkins-bot:
Adding missing equivalents for I, L, O, and S.

https://gerrit.wikimedia.org/r/311310

Looks like this ticket can be closed, correct? Is this released to production?

Saw two more O's from ntsamr spambot this does not handle: & ߋ (the latter is hard to copy, but it's unicode 0x07cb)

Also 1 and I are in the same equivalence set but l is not. Those two sets should probably be merged.

If you take a look at the code I maintain on this edit filter on en-wiki, you'll find many more variances of each letter of the alphabet that should also be added to the ccnorm function that aren't listed here. We definitely need to expand this function and get the additional variances added. I have LTA users who are using different font text in their usernames and abusive edits and in order to bypass edit filters. It's becoming a daunting and long task to update the filter I created so that it's caught up, and it's a cat and mouse game that we'll be one step behind on if we don't do this...

This webpage is what the LTA user is using in order to quickly generate text in different fonts and use them to create account, get around edit filters, and make abusive edits to pages...

In T27619#3970962, @Tgr wrote:

Also 1 and I are in the same equivalence set but l is not. Those two sets should probably be merged.

This is a result of the 2 competing use cases of Equivset:

  • To prevent spoofing of usernames (AntiSpoof)
  • To create "bad word" filters in AbuseFilter

I and L being in the same equivalence set makes sense for AntiSpoof, but not for AbuseFilter (as it would make construction of the filters unintuitive). The ultimate solution to this problem is probably to adopt confusables.txt (or a derivative) for AntiSpoof, and tailor Equivset for AbuseFilter.

The characters "ª" (A), "º" (O) and "°" (O) should also be included.

Billinghurst added a subscriber: Billinghurst.

@Proc AbuseFilter leverages the character set from the AntiSpoof extension, so it belongs against that project, not against AbuseFilter

Due to recent abuse that can be seen for example at https://simple.wikipedia.org/wiki/Special:Contributions/46.134.191.203, please also add the mathematical alphanumeric characters (see https://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symbols for character tables).

You all should check out edit filter 51 and 53 on the English Wikipedia... I've been working to stay on top of abusive usernames and edits for years now, and I've concocted a large list of letters that ccnorm doesn't catch. I have amended these letters to the table of letters in the edit task summary. See below:

a: aαąăãàāάạậảấầẩắằẵẳẫặḁǟǡȁᾳὰᾀἁᾁἄᾄἂᾂἆᾆἅᾅἃᾃἇᾇáâäæåǻ٩4AÅaà🇦Ꭿ4ㅂ月A͜͡𝓐𝓪𝒜𝒶𝔸𝕒Aa𝘈𝘢Ꭺαᴀ∀ɐᴬᵃₐ
b: bßβбв฿Bbß𝓑𝓫𝐵𝒷𝔹𝕓Bb𝘉𝘣ʙᙠqᴮᵇ
c: cċĉ¢сćĉçčCcㄷс𝓒𝓬𝒞𝒸ℂ𝕔Cc𝘊𝘤ᴄɔᶜ
d: dďḍðⅆDd🇩𝓓𝓭𝒟𝒹𝔻𝕕Dd𝘋𝘥შᗡᴅpᴰᵈ
e: éèëeęěĕėẻẹếềễểȨȩḝēḗȅȇệḙḛ3عڠẽəEe3🇪Ꮛㅌ€三𝓔𝓮𝐸𝑒𝔼𝕖Ee𝘌𝘦ƎǝᴇєӛᎬᴱᵉₑ
f: fғ₣Ff🇫ㅋ𝓕𝓯𝐹𝒻𝔽𝕗Ff𝘍𝘧ƒҒℲꜰɟᶠ
g: gĝģğġɠǥǧǵḡԌGg巨ġ𝓖𝓰𝒢𝑔𝔾𝕘Gg𝘎𝘨𝔤⅁ɢɓᴳᵍ
h: hήĥħȞʰʱḣḥḧḩḫнңӈӉηἠἡἢἣἤἥἦἧὴᾐћⱧԋњһHh𝓗𝓱𝐻𝒽ℍ𝕙Hh𝘏𝘩ʜɥᴴʰₕ🇭н
i: iìíîïĩļǐīĭḷŀιїɨ!łľįIi工1í🇮!𝘐𝘪ㅣ|𝕝l𝒾𝓁𝓘𝓲𝐼𝕀𝕚IiᏆɪıЇ𝘭ᴵⁱᵢ
j:Jj𝓙𝓳𝒥𝒿𝕁𝕛Jj𝘑𝘫ᴊɾᴶʲⱼ
k: kķкќқҝҡҟӄKkкㅈᏦ𝓚𝓴𝒦𝓀𝕂𝕜KkϏ𝘒𝘬⋊ᴋκʞᴷᵏₖ
l: l₤ĺľḷłŀλлљLlㄴㅣ|1𝕀𝕚𝒾𝓛𝓵𝐿𝓁𝕃𝕝Ll𝘓𝘭Ꮮ˥ʟí🇮!IiЇ𝘐𝘪ᴸˡₗ
m: mɯḿṁṃмӍμ₥Mmм𝓜𝓶𝑀𝓂𝕄𝕞Mm𝘔𝘮რლWᴍɯᴹᵐₘ
n: n₦ńñņňṇνNnпŊ冂Ꮑ𝓝𝓷𝒩𝓃ℕ𝕟Nn𝘕𝘯ɴῃИuᴺⁿₙ
o: oóòôöõǒōŏǫőœøəόοωὸὀὁὄὂὅὃоөӧӫδσʘǿọOo0ÒÔÓóQㅇøᎾ𝓞𝓸𝒪𝑜𝕆𝕠Oo𝘖οი𝘰ᴏσᴼᵒₒ
p: pƥṕṗǷ₧þρрPp尸𝓟𝓹𝒫𝓅ℙ𝕡Pp𝘗𝘱Ԁᴘρᴾᵖₚ
q: qɊʠQq𝓠𝓺𝒬𝓆ℚ𝕢Qqϙ𝐐𝐪𝘘𝘲
r: rŕŗřȑȓƦʳʴʵʶṙṛṝṟя®Rrг🇷ㄱᏒ民®ʁ𝓡𝓻𝑅𝓇ℝ𝕣Rr𝘙𝘳ᴚяʀɹᎡᴿʳᵣ
s: s$śŝşšṣσѕSsşŠš$Ꭶ𝓢𝓼𝒮𝓈𝕊𝕤Ss𝘚𝘴ꜱ🇸Տˢₛ
t: tţťṭτтŧTt七ㅜ𝓣𝓽𝒯𝓉𝕋𝕥ㄒTt𝘛𝘵⊥ᴛτʇᵀᵗₜ
u: uúùûüũůǔūǖǘǚǜŭųű[Uuü∪🇺니心𝓤𝓾𝒰𝓊𝕌𝕦Uu𝘜𝘶∩ᴜnʉɄᵁᵘᵤ
v:Vv𝓥𝓿𝒱𝓋𝕍𝕧Vv𝘝𝘷ᴠΛʌνⱽᵛᵥ
w: wŵẁẃẅẇẉ₩WwᏔ𝓦𝔀𝒲𝓌𝕎𝕨Wwᴡ𝘞𝘸ᴡMʍᵂʷᵚ
x: xҳχXx𝓧𝔁𝒳𝓍𝕏𝕩Xx𝘟𝘹ˣₓ
y: yýÿŷƴȲʸẏỳỵỷỹʊύυϋὑὓὕὗὺῠῡуϓYy𝓨𝔂𝒴𝓎𝕐𝕪Yy𝘠𝘺უყʏ⅄ʎʸ
z: zźžżZzㄹ리𝓩𝔃𝒵𝓏ℤ𝕫Zz𝘡𝘻ᴢ弓ᶻ

Some of the characters I added might be repeated ones that are already in the table - I simply amended the letters that I've been keeping track of from the edit filters I maintain (51 and 53). This should prove helpful, and I really hope that when ccnorm is updated, that the table I supply here is used.

I find it quite weird that you often only list parts of quite specific specific character sets. For example, you have the regional indicator symbols for A D E F H I L R S U (🇦 🇩 🇪 🇫 🇭 🇮 🇮 🇷 🇸 🇺) but not for the rest of the alphabet (🇧 🇨 🇬 🇯 🇰 🇲 🇳 🇴 🇵 🇶 🇹 🇻 🇼 🇽 🇾 🇿). I also suggest adding other (circled/parenthesised) symbols from the Enclosed Alphanumerics and Enclosed Alphanumerics Supplement Unicode blocks, namely, each row resembling an alphabet,

ⓐⓑⓒⓓⓔⓕⓖⓗⓘⓙⓚⓛⓜⓝⓞⓟⓠⓡⓢⓣⓤⓥⓦⓧⓨⓩ
ⒶⒷⒸⒹⒺⒻⒼⒽⒾⒿⓀⓁⓂⓃⓄⓅⓆⓇⓈⓉⓊⓋⓌⓍⓎⓏ //first one is in fact already covered for some reason
⒜⒝⒞⒟⒠⒡⒢⒣⒤⒥⒦⒧⒨⒩⒪⒫⒬⒭⒮⒯⒰⒱⒲⒳⒴⒵
🄐🄑🄒🄓🄔🄕🄖🄗🄘🄙🄚🄛🄜🄝🄞🄟🄠🄡🄢🄣🄤🄥🄦🄧🄨🄩
🄰🄱🄲🄳🄴🄵🄶🄷🄸🄹🄺🄻🄼🄽🄾🄿🅀🅁🅂🅃🅄🅅🅆🅇🅈🅉
🅐🅑🅒🅓🅔🅕🅖🅗🅘🅙🅚🅛🅜🅝🅞🅟🅠🅡🅢🅣🅤🅥🅦🅧🅨🅩
🅰🅱🅲🅳🅴🅵🅶🅷🅸🅹🅺🅻🅼🅽🅾🅿🆀🆁🆂🆃🆄🆅🆆🆇🆈🆉

From the same blocks, we should also add 🄪 for S, 🄫 for C, 🄬 for R, 🆥 for D, and 🆭 for M. Adding the enclosed digits,

⓪①②③④⑤⑥⑦⑧⑨
⑴⑵⑶⑷⑸⑹⑺⑻⑼ //no 0
⓵⓶⓷⓸⓹⓺⓻⓼⓽ //no 0
⓿❶❷❸❹❺❻❼❽❾
🄋➀➁➂➃➄➅➆➇➈
🄌➊➋➌➍➎➏➐➑➒

might also be helpful.

Inserting the lines from your list at Special:AbuseFilter/tools in ccnorm() shows that quite the majority of the symbols are already implemented, though a couple ones as well as my findings are not.

But even from the already covered ones, you only seem to have the bolded mathematical characters for one letter (𝐐𝐪) but not the rest (𝐀𝐚𝐁𝐛𝐂𝐜𝐃𝐝𝐄𝐞𝐅𝐟𝐆𝐠𝐇𝐡𝐈𝐢𝐉𝐣𝐊𝐤𝐋𝐥𝐌𝐦𝐍𝐧𝐎𝐨𝐏𝐩𝐑𝐫𝐒𝐬𝐓𝐭𝐔𝐮𝐕𝐯𝐖𝐰𝐗𝐱𝐘𝐲𝐙𝐳), the italic ones only for a couple letters (𝐵𝐶𝐸𝑒𝐹𝑔𝐻𝐼𝐿𝑀𝑜𝑅) but not the rest (𝐴𝑎𝑏𝑐𝐷𝑑𝑓𝐺ℎ𝑖𝐽𝑗𝐾𝑘𝑙𝑚𝑁𝑛𝑂𝑃𝑝𝑄𝑞𝑟𝑆𝑠𝑇𝑡𝑈𝑢𝑉𝑣𝑊𝑤𝑋𝑥𝑌𝑦𝑍𝑧 — by the way, ℎ is still to be implemented); lots of other mathematical letter variants are missing as well.

UPD: The following mathematical alphabets remain to be implemented:

𝔄𝔅ℭ𝔇𝔈𝔉𝔊ℌℑ𝔍𝔎𝔏𝔐𝔑𝔒𝔓𝔔ℜ𝔖𝔗𝔘𝔙𝔚𝔛𝔜ℨ
𝕬𝕭𝕮𝕯𝕰𝕱𝕲𝕳𝕴𝕵𝕶𝕷𝕸𝕹𝕺𝕻𝕼𝕽𝕾𝕿𝖀𝖁𝖂𝖃𝖄𝖅
𝔞𝔟𝔠𝔡𝔢𝔣𝔤𝔥𝔦𝔧𝔨𝔩𝔪𝔫𝔬𝔭𝔮𝔯𝔰𝔱𝔲𝔳𝔴𝔵𝔶𝔷

Hi @1234qwer1234qwer4! I added the characters that I personally found that LTA accounts were trying to abuse in order to get around the edit filters. I acknowledge that they're not the complete alphabet. I appreciate you for taking the time to locate them and add them to the list. :-)

Please extend this ASAP. ü is normalised to U but Ü isn't. This was recently abused.

Made some extension to T27619#7220525 above; would appreciate if somebody could take a look at this.

Change 818287 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/libs/Equivset@master] Expand set for lower/upper case characters which are alone in the set

https://gerrit.wikimedia.org/r/818287

Change 818287 merged by jenkins-bot:

[mediawiki/libs/Equivset@master] Expand set for lower/upper case characters which are alone in the set

https://gerrit.wikimedia.org/r/818287

Change 906658 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/libs/Equivset@master] Add more number signs

https://gerrit.wikimedia.org/r/906658

Change 906678 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/libs/Equivset@master] Add more letterlike symbols

https://gerrit.wikimedia.org/r/906678

Change 906658 merged by jenkins-bot:

[mediawiki/libs/Equivset@master] Add more number signs

https://gerrit.wikimedia.org/r/906658

Change 906709 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/libs/Equivset@master] Add °¹²³º

https://gerrit.wikimedia.org/r/906709

Change 906709 merged by jenkins-bot:

[mediawiki/libs/Equivset@master] Add °¹²³º

https://gerrit.wikimedia.org/r/906709

Change 906756 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/libs/Equivset@master] Add more letterlike symbols

https://gerrit.wikimedia.org/r/906756

Change 906757 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/libs/Equivset@master] Add more letterlike symbols

https://gerrit.wikimedia.org/r/906757

Change 906678 merged by jenkins-bot:

[mediawiki/libs/Equivset@master] Add more letterlike symbols (Mathematical Alpha Symbols 1D400-1D7FF)

https://gerrit.wikimedia.org/r/906678

Change 906756 merged by jenkins-bot:

[mediawiki/libs/Equivset@master] Add more letterlike symbols (Enclosed Alpha Supplement 1F100–1F1FF)

https://gerrit.wikimedia.org/r/906756

Change 906757 merged by jenkins-bot:

[mediawiki/libs/Equivset@master] Add more letterlike symbols

https://gerrit.wikimedia.org/r/906757

Many letters mention in this task are now part of Equivset, it needs a new release to get them working in AbuseFilter on wmf wikis

From the characters written in this task are not in Equivset (or have it own group):

A: ÆᎯㅂ月͜͡∀
B: ᙠ
C: ㄷ
D: შᗡ
E: Ꮛㅌ三
F: ㅋ
G: 巨
I: !工ㅣᏆ
K: ㅈ⋊
L: Λљㄴ˥
M: რლ
N: Π冂ᏁИ
O: ŒΔㅇიⲟߋ
P: 尸
Q: ϙ
R: ㄱ民
S: Ꭶտ
T: 七ㅜㄒ⊥
U: [∪니心∩
Y: უ
Z: ㄹ리弓

Linking private abuse filter is not helpful.
Some of the suggested characters are in two groups
Some of the suggested characters are now in a different group

The following characters from https://en.wikipedia.org/w/index.php?title=MediaWiki:Titleblacklist&oldid=392108862 seems not part of Equivset

℃℄ɕʥ℈‼℺⅂⅃ʖ∟∳⊂⋕⋲∞Ҥѓҥҩӷᓂᑫᓈ٨٣ץױוזשלטּפב١¦§№™☀`ٯٯҐمٲٱٲٱم’–₯ѤӬဣϵ

Not sure if all characters could or should be added and what mapping is the best.

Maybe it is better to treat the task as fixed and use new task for discussion of some characters.

thanks a lot for your help @Umherirrender! The ɕ character should be in the C equivset for sure since it is literally the letter c with a diacritic, and ѓӷ (plus their capital forms) should be in the R equivset considering that the letter г is. The rest probably needs some more investigation (ligatures appear to be sort of a difficult case in general).

Change 907958 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/libs/Equivset@master] Add ɕЀЁЂЃѐёђѓӐӑӒӓӖӗӶӷḸḹ₡₫₭₮₲₳₵₽₿℮∀∁∆∑∘

https://gerrit.wikimedia.org/r/907958

thanks a lot for your help @Umherirrender! The ɕ character should be in the C equivset for sure since it is literally the letter c with a diacritic, and ѓӷ (plus their capital forms) should be in the R equivset considering that the letter г is. The rest probably needs some more investigation (ligatures appear to be sort of a difficult case in general).

Added, some mapping looking a bit centric for the latin script.
Not sure if all the mapping between scripts are so helpful for wikis in other scripts, but I have no knowledge about other scripts.

Cyrillic and Greek seem reasonable since they are quite closely related (though indeed it may be confusing for letters like Г or Ђ which have a completely different etymology -- for Г I've at least seen it being abused though). In fact, Cyrillic has quite a few more additions worth looking into, such as Ѻѻ for O or Ҏҏ for P. Also, it looks like ґ is normalised to Ґ; I wonder why this isn't leading to the R set the letter Г is pointing at.

Cyrillic and Greek seem reasonable since they are quite closely related (though indeed it may be confusing for letters like Г or Ђ which have a completely different etymology -- for Г I've at least seen it being abused though). In fact, Cyrillic has quite a few more additions worth looking into, such as Ѻѻ for O or Ҏҏ for P. Also, it looks like ґ is normalised to Ґ; I wonder why this isn't leading to the R set the letter Г is pointing at.

Added ѺѻѾѿҎҏԚԛԜԝԞԟԨԩ and Ґ to the patch set.

Thanks! I think Ӽӽ is still missing too. Also since you added ѿ you might as well add ѡ, and Ѽ should at least be mapped to Ѡ if that should not be mapped to W as well.

Thanks! I think Ӽӽ is still missing too. Also since you added ѿ you might as well add ѡ, and Ѽ should at least be mapped to Ѡ if that should not be mapped to W as well.

Added ӺӻӼӽӾӿѠѡѼ

Change 907958 merged by jenkins-bot:

[mediawiki/libs/Equivset@master] Add various letterlike symbols

https://gerrit.wikimedia.org/r/907958

Umherirrender removed a project: Patch-For-Review.
Umherirrender edited subscribers, added: Umherirrender; removed: adnanck919.

Any idea what's happening with the latest patch, why is it not in the wikis? ccnorm("ČĆŽŠčćžš") is semiworking.

Change 934549 had a related patch set uploaded (by Reedy; author: Reedy):

[mediawiki/vendor@master] Upgrading wikimedia/equivset (1.4.3 => 1.5.0)

https://gerrit.wikimedia.org/r/934549

Change 934549 merged by jenkins-bot:

[mediawiki/vendor@master] Upgrading wikimedia/equivset (1.4.3 => 1.5.1)

https://gerrit.wikimedia.org/r/934549

matej_suchanek added a subscriber: matej_suchanek.

This is what ccnorm currently returns in production for the character sets in the description:

A: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAÆAA9A
B: BBBBBB
C: CCCCCCCCC
D: DDDDD
E: EEEEEEEEEEEEEEEEEEEEEEEEEEEEE
F: FFF
G: GGGGGGGGGGG
H: HNHHHHHHHHHHHHHHNNNNNNNNNNNHHHHH
I: IIIIIILIIILLIII!LLI
K: KKKKKKKKK
L: LLLLLLLΛΠљ
M: MWMMMMMMM
N: NNNNNNNN
O: OOOOOOOOOQOŒOEOOWOOOOOOOOOOOΔOOOO
P: PPPPPPPPP
Q: QQQ
R: RRRRRRRRRRRRRRRRR
S: SSSSSSSOS
T: TTTTTTT
U: UUUUUUUUUUUUUUUU
W: WWWWWWWW
X: XXX
Y: YYYYYYYYYYYYUYYYYYYYYYYYY
Z: ZZZZ

ccnorm("ČĆŽŠčćžš") -> CCZSCCZS

Shall we close this task or is it okay to hijack it occasionally for new requests?

Shall we close this task or is it okay to hijack it occasionally for new requests?

I'd prefer to close, otherwise this becomes a neverending ticket without any clear scope.

matej_suchanek assigned this task to Umherirrender.
matej_suchanek removed a project: TestMe.
matej_suchanek moved this task from Backlog to Done on the Equivset board.