Add more characters to ccnorm
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	He7d3r
	Oct 22 2010, 8:48 PM

Description

Currently only some characters are normalized to a "canonical" form. For example, although ccnorm("α") results in "A", ccnorm("ά") doesn't change anything.

The function should support the conversion of more characters.

The following list is based on what was available at en:MediaWiki:Titleblacklist, but maybe it is better to have different sets of characters depending on the case of the letter. For example, ⅅ for D, but ⅆ for d.

a: aαąăãàāάạậảấầẩắằẵẳẫặḁǟǡȁᾳὰᾀἁᾁἄᾄἂᾂἆᾆἅᾅἃᾃἇᾇáâäæåǻ٩4
b: bßβбв฿
c: cċĉ¢сćĉçč
d: dďḍðⅆ
e: éèëeęěĕėẻẹếềễểȨȩḝēḗȅȇệḙḛ3عڠẽə
f: fғ₣
g: gĝģğġɠǥǧǵḡԌ
h: hήĥħȞʰʱḣḥḧḩḫнңӈӉηἠἡἢἣἤἥἦἧὴᾐћⱧԋњһ
i: iìíîïĩļǐīĭḷŀιїɨ!łľį
k: kķкќқҝҡҟӄ
l: l₤ĺľḷłŀλлљ
m: mɯḿṁṃмӍμ₥
n: n₦ńñņňṇν
o: oóòôöõǒōŏǫőœøəόοωὸὀὁὄὂὅὃоөӧӫδσʘǿọ
p: pƥṕṗǷ₧þρр
q: qɊʠ
r: rŕŗřȑȓƦʳʴʵʶṙṛṝṟя®
s: s$śŝşšṣσѕ
t: tţťṭτтŧ
u: uúùûüũůǔūǖǘǚǜŭųű
w: wŵẁẃẅẇẉ₩
x: xҳχ
y: yýÿŷƴȲʸẏỳỵỷỹʊύυϋὑὓὕὗὺῠῡуϓ
z: zźžż

Details

Reference: bz25619

Subject	Repo	Branch	Lines +/-
Upgrading wikimedia/equivset (1.4.3 => 1.5.1)	mediawiki/vendor	master	+22 K -10 K
Add various letterlike symbols	mediawiki/libs/Equivset	master	+138 -24
Add more letterlike symbols	mediawiki/libs/Equivset	master	+397 -70
Add °¹²³º	mediawiki/libs/Equivset	master	+15 -5
Add more letterlike symbols (Enclosed Alpha Supplement 1F100–1F1FF)	mediawiki/libs/Equivset	master	+82 -28
Add more letterlike symbols (Mathematical Alpha Symbols 1D400-1D7FF)	mediawiki/libs/Equivset	master	+835 -62
Add more number signs	mediawiki/libs/Equivset	master	+288 -11
Expand set for lower/upper case characters which are alone in the set	mediawiki/libs/Equivset	master	+549 -27
Adding missing equivalents for I, L, O, and S.	mediawiki/extensions/AntiSpoof	master	+68 -33
Add Japanese normalization pairs	mediawiki/extensions/AntiSpoof	master	+106 -29

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Umherirrender	T27619 Add more characters to ccnorm
Resolved	Umherirrender	T164180 Equivset should normalize some diacriticals
Resolved	Umherirrender	T178010 missing character equivalencies: ÈÉÊẼÌÍÏÓÒÔÕ∅Q̃ÚŰÜŨ
Resolved	None	T212061 Enhance Equivset with regard to Persian/Arabic characters
Resolved	Umherirrender	T305781 ĆČŽ missing, ćčž present

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Both patches are still open. The first one got some reviews and now it looks like is waiting for a new upload from Kaldari. The second one with the simpler version got no reviews at all.

(In reply to Ryan Kaldari from comment #15)

Since I haven't had any luck getting code review on
https://gerrit.wikimedia.org/r/92154 I submitted
https://gerrit.wikimedia.org/r/97304 as a simpler version. It only adds !
and $ and nothing else.

I'm not sure whether sending a request to wikitech-l could help getting any reviews to these two patches, but pinging at the patches and here doesn't seem to be enough... Any ideas?

• demon unsubscribed.Dec 16 2014, 7:57 PM

Change 184850 had a related patch set uploaded (by Whym):
Add Japanese normalization pairs

https://gerrit.wikimedia.org/r/184850

Patch-For-Review

I also see that the "editable" sets on mediawiki.org have some more changes that are included in none of the gerrit changes mentioned here.

EDIT: added link

Change 184850 merged by jenkins-bot:
Add Japanese normalization pairs

https://gerrit.wikimedia.org/r/184850

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 16 2016, 4:49 AM

ReleaseTaggerBot added a project: MW-1.27-release (WMF-deploy-2016-01-19_(1.27.0-wmf.11)).Jan 16 2016, 5:00 AM

MusikAnimal subscribed.Sep 15 2016, 4:54 PM

Change 311310 had a related patch set uploaded (by MusikAnimal):
Adding missing equivalents for I, L, O, and S.

https://gerrit.wikimedia.org/r/311310

Qgil unsubscribed.Sep 19 2016, 8:58 AM

Change 311310 merged by jenkins-bot:
Adding missing equivalents for I, L, O, and S.

https://gerrit.wikimedia.org/r/311310

ReleaseTaggerBot added a project: MW-1.28-release (WMF-deploy-2016-10-04_(1.28.0-wmf.21)).Sep 21 2016, 10:00 PM

He7d3r awarded a token.Nov 20 2016, 3:19 PM

He7d3r updated the task description. (Show Details)Nov 30 2016, 11:32 AM

Xqt added a subtask: T164180: Equivset should normalize some diacriticals.May 8 2017, 4:44 PM

zhuyifei1999 subscribed.Aug 8 2017, 10:52 PM

Looks like this ticket can be closed, correct? Is this released to production?

matej_suchanek removed a project: Patch-For-Review.Oct 21 2017, 10:11 AM

matej_suchanek added a project: Equivset.Nov 24 2017, 5:57 PM

Saw two more O's from ntsamr spambot this does not handle: ⲟ & ߋ (the latter is hard to copy, but it's unicode 0x07cb)

@zhuyifei1999: Thanks for the report!

Also 1 and I are in the same equivalence set but l is not. Those two sets should probably be merged.

If you take a look at the code I maintain on this edit filter on en-wiki, you'll find many more variances of each letter of the alphabet that should also be added to the ccnorm function that aren't listed here. We definitely need to expand this function and get the additional variances added. I have LTA users who are using different font text in their usernames and abusive edits and in order to bypass edit filters. It's becoming a daunting and long task to update the filter I created so that it's caught up, and it's a cat and mouse game that we'll be one step behind on if we don't do this...

This webpage is what the LTA user is using in order to quickly generate text in different fonts and use them to create account, get around edit filters, and make abusive edits to pages...

GeneralNotability subscribed.Sep 23 2020, 6:04 PM

kaldari removed kaldari as the assignee of this task.May 26 2021, 1:56 PM

In T27619#3970962, @Tgr wrote:

Also 1 and I are in the same equivalence set but l is not. Those two sets should probably be merged.

This is a result of the 2 competing use cases of Equivset:

To prevent spoofing of usernames (AntiSpoof)
To create "bad word" filters in AbuseFilter

I and L being in the same equivalence set makes sense for AntiSpoof, but not for AbuseFilter (as it would make construction of the filters unintuitive). The ultimate solution to this problem is probably to adopt confusables.txt (or a derivative) for AntiSpoof, and tailor Equivset for AbuseFilter.

The characters "ª" (A), "º" (O) and "°" (O) should also be included.

MusikAnimal unsubscribed.Jun 10 2021, 10:30 PM

Proc subscribed.Jun 24 2021, 11:55 AM

Proc added a project: AbuseFilter.Jun 24 2021, 11:57 AM

@Proc AbuseFilter leverages the character set from the AntiSpoof extension, so it belongs against that project, not against AbuseFilter

Proc mentioned this in T285468: Add a regex version of str_replace.Jun 24 2021, 12:04 PM

Due to recent abuse that can be seen for example at https://simple.wikipedia.org/wiki/Special:Contributions/46.134.191.203, please also add the mathematical alphanumeric characters (see https://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symbols for character tables).

You all should check out edit filter 51 and 53 on the English Wikipedia... I've been working to stay on top of abusive usernames and edits for years now, and I've concocted a large list of letters that ccnorm doesn't catch. I have amended these letters to the table of letters in the edit task summary. See below:

a: aαąăãàāάạậảấầẩắằẵẳẫặḁǟǡȁᾳὰᾀἁᾁἄᾄἂᾂἆᾆἅᾅἃᾃἇᾇáâäæåǻ٩4AÅaà🇦Ꭿ4ㅂ月A͜͡𝓐𝓪𝒜𝒶𝔸𝕒Ａａ𝘈𝘢Ꭺαᴀ∀ɐᴬᵃₐ
b: bßβбв฿Bbß𝓑𝓫𝐵𝒷𝔹𝕓Ｂｂ𝘉𝘣ʙᙠqᴮᵇ
c: cċĉ¢сćĉçčCcㄷс𝓒𝓬𝒞𝒸ℂ𝕔Ｃｃ𝘊𝘤ᴄɔᶜ
d: dďḍðⅆDd🇩𝓓𝓭𝒟𝒹𝔻𝕕Ｄｄ𝘋𝘥შᗡᴅpᴰᵈ
e: éèëeęěĕėẻẹếềễểȨȩḝēḗȅȇệḙḛ3عڠẽəEe3🇪Ꮛㅌ€三𝓔𝓮𝐸𝑒𝔼𝕖Ｅｅ𝘌𝘦ƎǝᴇєӛᎬᴱᵉₑ
f: fғ₣Ff🇫ㅋ𝓕𝓯𝐹𝒻𝔽𝕗Ｆｆ𝘍𝘧ƒҒℲꜰɟᶠ
g: gĝģğġɠǥǧǵḡԌGg巨ġ𝓖𝓰𝒢𝑔𝔾𝕘Ｇｇ𝘎𝘨𝔤⅁ɢɓᴳᵍ
h: hήĥħȞʰʱḣḥḧḩḫнңӈӉηἠἡἢἣἤἥἦἧὴᾐћⱧԋњһHh𝓗𝓱𝐻𝒽ℍ𝕙Ｈｈ𝘏𝘩ʜɥᴴʰₕ🇭н
i: iìíîïĩļǐīĭḷŀιїɨ!łľįIi工1í🇮!𝘐𝘪ㅣ|𝕝l𝒾𝓁𝓘𝓲𝐼𝕀𝕚ＩｉᏆɪıЇ𝘭ᴵⁱᵢ
j:Jj𝓙𝓳𝒥𝒿𝕁𝕛Ｊｊ𝘑𝘫ᴊɾᴶʲⱼ
k: kķкќқҝҡҟӄKkкㅈᏦ𝓚𝓴𝒦𝓀𝕂𝕜ＫｋϏ𝘒𝘬⋊ᴋκʞᴷᵏₖ
l: l₤ĺľḷłŀλлљLlㄴㅣ|1𝕀𝕚𝒾𝓛𝓵𝐿𝓁𝕃𝕝Ｌｌ𝘓𝘭Ꮮ˥ʟí🇮!IiЇ𝘐𝘪ᴸˡₗ
m: mɯḿṁṃмӍμ₥Mmм𝓜𝓶𝑀𝓂𝕄𝕞Ｍｍ𝘔𝘮რლWᴍɯᴹᵐₘ
n: n₦ńñņňṇνNnпŊ冂Ꮑ𝓝𝓷𝒩𝓃ℕ𝕟Ｎｎ𝘕𝘯ɴῃИuᴺⁿₙ
o: oóòôöõǒōŏǫőœøəόοωὸὀὁὄὂὅὃоөӧӫδσʘǿọOo0ÒÔÓóQㅇøᎾ𝓞𝓸𝒪𝑜𝕆𝕠Ｏｏ𝘖οი𝘰ᴏσᴼᵒₒ
p: pƥṕṗǷ₧þρрPp尸𝓟𝓹𝒫𝓅ℙ𝕡Ｐｐ𝘗𝘱Ԁᴘρᴾᵖₚ
q: qɊʠQq𝓠𝓺𝒬𝓆ℚ𝕢Ｑｑϙ𝐐𝐪𝘘𝘲
r: rŕŗřȑȓƦʳʴʵʶṙṛṝṟя®Rrг🇷ㄱᏒ民®ʁ𝓡𝓻𝑅𝓇ℝ𝕣Ｒｒ𝘙𝘳ᴚяʀɹᎡᴿʳᵣ
s: s$śŝşšṣσѕSsşŠš$Ꭶ𝓢𝓼𝒮𝓈𝕊𝕤Ｓｓ𝘚𝘴ꜱ🇸Տˢₛ
t: tţťṭτтŧTt七ㅜ𝓣𝓽𝒯𝓉𝕋𝕥ㄒＴｔ𝘛𝘵⊥ᴛτʇᵀᵗₜ
u: uúùûüũůǔūǖǘǚǜŭųű[Uuü∪🇺니心𝓤𝓾𝒰𝓊𝕌𝕦Ｕｕ𝘜𝘶∩ᴜnʉɄᵁᵘᵤ
v:Vv𝓥𝓿𝒱𝓋𝕍𝕧Ｖｖ𝘝𝘷ᴠΛʌνⱽᵛᵥ
w: wŵẁẃẅẇẉ₩WwᏔ𝓦𝔀𝒲𝓌𝕎𝕨Ｗｗᴡ𝘞𝘸ᴡMʍᵂʷᵚ
x: xҳχXx𝓧𝔁𝒳𝓍𝕏𝕩Ｘｘ𝘟𝘹ˣₓ
y: yýÿŷƴȲʸẏỳỵỷỹʊύυϋὑὓὕὗὺῠῡуϓYy𝓨𝔂𝒴𝓎𝕐𝕪Ｙｙ𝘠𝘺უყʏ⅄ʎʸ
z: zźžżZzㄹ리𝓩𝔃𝒵𝓏ℤ𝕫Ｚｚ𝘡𝘻ᴢ弓ᶻ

Some of the characters I added might be repeated ones that are already in the table - I simply amended the letters that I've been keeping track of from the edit filters I maintain (51 and 53). This should prove helpful, and I really hope that when ccnorm is updated, that the table I supply here is used.

I find it quite weird that you often only list parts of quite specific specific character sets. For example, you have the regional indicator symbols for A D E F H I L R S U (🇦 🇩 🇪 🇫 🇭 🇮 🇮 🇷 🇸 🇺) but not for the rest of the alphabet (🇧 🇨 🇬 🇯 🇰 🇲 🇳 🇴 🇵 🇶 🇹 🇻 🇼 🇽 🇾 🇿). I also suggest adding other (circled/parenthesised) symbols from the Enclosed Alphanumerics and Enclosed Alphanumerics Supplement Unicode blocks, namely, each row resembling an alphabet,

ⓐⓑⓒⓓⓔⓕⓖⓗⓘⓙⓚⓛⓜⓝⓞⓟⓠⓡⓢⓣⓤⓥⓦⓧⓨⓩ
ⒶⒷⒸⒹⒺⒻⒼⒽⒾⒿⓀⓁⓂⓃⓄⓅⓆⓇⓈⓉⓊⓋⓌⓍⓎⓏ //first one is in fact already covered for some reason
⒜⒝⒞⒟⒠⒡⒢⒣⒤⒥⒦⒧⒨⒩⒪⒫⒬⒭⒮⒯⒰⒱⒲⒳⒴⒵
🄐🄑🄒🄓🄔🄕🄖🄗🄘🄙🄚🄛🄜🄝🄞🄟🄠🄡🄢🄣🄤🄥🄦🄧🄨🄩
🄰🄱🄲🄳🄴🄵🄶🄷🄸🄹🄺🄻🄼🄽🄾🄿🅀🅁🅂🅃🅄🅅🅆🅇🅈🅉
🅐🅑🅒🅓🅔🅕🅖🅗🅘🅙🅚🅛🅜🅝🅞🅟🅠🅡🅢🅣🅤🅥🅦🅧🅨🅩
🅰🅱🅲🅳🅴🅵🅶🅷🅸🅹🅺🅻🅼🅽🅾🅿🆀🆁🆂🆃🆄🆅🆆🆇🆈🆉

From the same blocks, we should also add 🄪 for S, 🄫 for C, 🄬 for R, 🆥 for D, and 🆭 for M. Adding the enclosed digits,

⓪①②③④⑤⑥⑦⑧⑨
⑴⑵⑶⑷⑸⑹⑺⑻⑼ //no 0
⓵⓶⓷⓸⓹⓺⓻⓼⓽ //no 0
⓿❶❷❸❹❺❻❼❽❾
🄋➀➁➂➃➄➅➆➇➈
🄌➊➋➌➍➎➏➐➑➒

might also be helpful.

Inserting the lines from your list at Special:AbuseFilter/tools in ccnorm() shows that quite the majority of the symbols are already implemented, though a couple ones as well as my findings are not.

But even from the already covered ones, you only seem to have the bolded mathematical characters for one letter (𝐐𝐪) but not the rest (𝐀𝐚𝐁𝐛𝐂𝐜𝐃𝐝𝐄𝐞𝐅𝐟𝐆𝐠𝐇𝐡𝐈𝐢𝐉𝐣𝐊𝐤𝐋𝐥𝐌𝐦𝐍𝐧𝐎𝐨𝐏𝐩𝐑𝐫𝐒𝐬𝐓𝐭𝐔𝐮𝐕𝐯𝐖𝐰𝐗𝐱𝐘𝐲𝐙𝐳), the italic ones only for a couple letters (𝐵𝐶𝐸𝑒𝐹𝑔𝐻𝐼𝐿𝑀𝑜𝑅) but not the rest (𝐴𝑎𝑏𝑐𝐷𝑑𝑓𝐺ℎ𝑖𝐽𝑗𝐾𝑘𝑙𝑚𝑁𝑛𝑂𝑃𝑝𝑄𝑞𝑟𝑆𝑠𝑇𝑡𝑈𝑢𝑉𝑣𝑊𝑤𝑋𝑥𝑌𝑦𝑍𝑧 — by the way, ℎ is still to be implemented); lots of other mathematical letter variants are missing as well.

UPD: The following mathematical alphabets remain to be implemented:

𝔄𝔅ℭ𝔇𝔈𝔉𝔊ℌℑ𝔍𝔎𝔏𝔐𝔑𝔒𝔓𝔔ℜ𝔖𝔗𝔘𝔙𝔚𝔛𝔜ℨ
𝕬𝕭𝕮𝕯𝕰𝕱𝕲𝕳𝕴𝕵𝕶𝕷𝕸𝕹𝕺𝕻𝕼𝕽𝕾𝕿𝖀𝖁𝖂𝖃𝖄𝖅
𝔞𝔟𝔠𝔡𝔢𝔣𝔤𝔥𝔦𝔧𝔨𝔩𝔪𝔫𝔬𝔭𝔮𝔯𝔰𝔱𝔲𝔳𝔴𝔵𝔶𝔷

Hi @1234qwer1234qwer4! I added the characters that I personally found that LTA accounts were trying to abuse in order to get around the edit filters. I acknowledge that they're not the complete alphabet. I appreciate you for taking the time to locate them and add them to the list. :-)

Please extend this ASAP. ü is normalised to U but Ü isn't. This was recently abused.

• adnanck919 subscribed.Jan 24 2022, 12:49 PM

This comment was removed by Aklapper.

Made some extension to T27619#7220525 above; would appreciate if somebody could take a look at this.

matej_suchanek added subtasks: T178010: missing character equivalencies: ÈÉÊẼÌÍÏÓÒÔÕ∅Q̃ÚŰÜŨ, T194310: Add "I" as equivalent of "l", T212061: Enhance Equivset with regard to Persian/Arabic characters, T305781: ĆČŽ missing, ćčž present.Apr 10 2022, 12:24 PM

Change 818287 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/libs/Equivset@master] Expand set for lower/upper case characters which are alone in the set

https://gerrit.wikimedia.org/r/818287

gerritbot added a project: Patch-For-Review.Jul 29 2022, 3:06 AM

Change 818287 merged by jenkins-bot:

[mediawiki/libs/Equivset@master] Expand set for lower/upper case characters which are alone in the set

https://gerrit.wikimedia.org/r/818287

Umherirrender closed subtask T305781: ĆČŽ missing, ćčž present as Resolved.Mar 31 2023, 2:11 PM

Umherirrender closed subtask T178010: missing character equivalencies: ÈÉÊẼÌÍÏÓÒÔÕ∅Q̃ÚŰÜŨ as Resolved.Apr 6 2023, 6:44 PM

Change 906658 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/libs/Equivset@master] Add more number signs

https://gerrit.wikimedia.org/r/906658

Change 906678 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/libs/Equivset@master] Add more letterlike symbols

https://gerrit.wikimedia.org/r/906678

Change 906658 merged by jenkins-bot:

[mediawiki/libs/Equivset@master] Add more number signs

https://gerrit.wikimedia.org/r/906658

Change 906709 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/libs/Equivset@master] Add °¹²³º

https://gerrit.wikimedia.org/r/906709

Change 906709 merged by jenkins-bot:

[mediawiki/libs/Equivset@master] Add °¹²³º

https://gerrit.wikimedia.org/r/906709

Change 906756 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/libs/Equivset@master] Add more letterlike symbols

https://gerrit.wikimedia.org/r/906756

Change 906757 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/libs/Equivset@master] Add more letterlike symbols

https://gerrit.wikimedia.org/r/906757

Umherirrender claimed this task.Apr 7 2023, 10:33 PM

Umherirrender closed subtask T212061: Enhance Equivset with regard to Persian/Arabic characters as Resolved.

Umherirrender removed a subtask: T194310: Add "I" as equivalent of "l".

Change 906678 merged by jenkins-bot:

[mediawiki/libs/Equivset@master] Add more letterlike symbols (Mathematical Alpha Symbols 1D400-1D7FF)

https://gerrit.wikimedia.org/r/906678

Change 906756 merged by jenkins-bot:

[mediawiki/libs/Equivset@master] Add more letterlike symbols (Enclosed Alpha Supplement 1F100–1F1FF)

https://gerrit.wikimedia.org/r/906756

Umherirrender closed subtask T164180: Equivset should normalize some diacriticals as Resolved.Apr 11 2023, 3:48 PM

Change 906757 merged by jenkins-bot:

[mediawiki/libs/Equivset@master] Add more letterlike symbols

https://gerrit.wikimedia.org/r/906757

Many letters mention in this task are now part of Equivset, it needs a new release to get them working in AbuseFilter on wmf wikis

From the characters written in this task are not in Equivset (or have it own group):

A: ÆᎯㅂ月͜͡∀
B: ᙠ
C: ㄷ
D: შᗡ
E: Ꮛㅌ三
F: ㅋ
G: 巨
I: !工ㅣᏆ
K: ㅈ⋊
L: Λљㄴ˥
M: რლ
N: Π冂ᏁИ
O: ŒΔㅇიⲟߋ
P: 尸
Q: ϙ
R: ㄱ民
S: Ꭶտ
T: 七ㅜㄒ⊥
U: [∪니心∩
Y: უ
Z: ㄹ리弓

Linking private abuse filter is not helpful.
Some of the suggested characters are in two groups
Some of the suggested characters are now in a different group

The following characters from https://en.wikipedia.org/w/index.php?title=MediaWiki:Titleblacklist&oldid=392108862 seems not part of Equivset

℃℄ɕʥ℈‼℺⅂⅃ʖ∟∳⊂⋕⋲∞Ҥѓҥҩӷᓂᑫᓈ٨٣ץױוזשלטּפב١¦§№™☀`ٯٯҐمٲٱٲٱم’–₯ѤӬဣϵ

Not sure if all characters could or should be added and what mapping is the best.

Maybe it is better to treat the task as fixed and use new task for discussion of some characters.

thanks a lot for your help @Umherirrender! The ɕ character should be in the C equivset for sure since it is literally the letter c with a diacritic, and ѓӷ (plus their capital forms) should be in the R equivset considering that the letter г is. The rest probably needs some more investigation (ligatures appear to be sort of a difficult case in general).

Change 907958 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/libs/Equivset@master] Add ɕЀЁЂЃѐёђѓӐӑӒӓӖӗӶӷḸḹ₡₫₭₮₲₳₵₽₿℮∀∁∆∑∘

https://gerrit.wikimedia.org/r/907958

In T27619#8773151, @1234qwer1234qwer4 wrote:

thanks a lot for your help @Umherirrender! The ɕ character should be in the C equivset for sure since it is literally the letter c with a diacritic, and ѓӷ (plus their capital forms) should be in the R equivset considering that the letter г is. The rest probably needs some more investigation (ligatures appear to be sort of a difficult case in general).

Added, some mapping looking a bit centric for the latin script.
Not sure if all the mapping between scripts are so helpful for wikis in other scripts, but I have no knowledge about other scripts.

Cyrillic and Greek seem reasonable since they are quite closely related (though indeed it may be confusing for letters like Г or Ђ which have a completely different etymology -- for Г I've at least seen it being abused though). In fact, Cyrillic has quite a few more additions worth looking into, such as Ѻѻ for O or Ҏҏ for P. Also, it looks like ґ is normalised to Ґ; I wonder why this isn't leading to the R set the letter Г is pointing at.

In T27619#8773282, @1234qwer1234qwer4 wrote:

Cyrillic and Greek seem reasonable since they are quite closely related (though indeed it may be confusing for letters like Г or Ђ which have a completely different etymology -- for Г I've at least seen it being abused though). In fact, Cyrillic has quite a few more additions worth looking into, such as Ѻѻ for O or Ҏҏ for P. Also, it looks like ґ is normalised to Ґ; I wonder why this isn't leading to the R set the letter Г is pointing at.

Added ѺѻѾѿҎҏԚԛԜԝԞԟԨԩ and Ґ to the patch set.

Thanks! I think Ӽӽ is still missing too. Also since you added ѿ you might as well add ѡ, and Ѽ should at least be mapped to Ѡ if that should not be mapped to W as well.

In T27619#8773483, @1234qwer1234qwer4 wrote:

Thanks! I think Ӽӽ is still missing too. Also since you added ѿ you might as well add ѡ, and Ѽ should at least be mapped to Ѡ if that should not be mapped to W as well.

Added ӺӻӼӽӾӿѠѡѼ

Change 907958 merged by jenkins-bot:

[mediawiki/libs/Equivset@master] Add various letterlike symbols

https://gerrit.wikimedia.org/r/907958

Umherirrender removed Umherirrender as the assignee of this task.Apr 12 2023, 4:44 PM

Umherirrender removed a project: Patch-For-Review.

Umherirrender edited subscribers, added: Umherirrender; removed: • adnanck919.

Any idea what's happening with the latest patch, why is it not in the wikis? ccnorm("ČĆŽŠčćžš") is semiworking.

Equivset hasn't been brought into MediaWiki-Vendor

Change 934549 had a related patch set uploaded (by Reedy; author: Reedy):

[mediawiki/vendor@master] Upgrading wikimedia/equivset (1.4.3 => 1.5.0)

https://gerrit.wikimedia.org/r/934549

gerritbot added a project: Patch-For-Review.Jun 30 2023, 2:17 PM

Change 934549 merged by jenkins-bot:

[mediawiki/vendor@master] Upgrading wikimedia/equivset (1.4.3 => 1.5.1)

https://gerrit.wikimedia.org/r/934549

This is what ccnorm currently returns in production for the character sets in the description:

A: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAÆAA9A
B: BBBBBB
C: CCCCCCCCC
D: DDDDD
E: EEEEEEEEEEEEEEEEEEEEEEEEEEEEE
F: FFF
G: GGGGGGGGGGG
H: HNHHHHHHHHHHHHHHNNNNNNNNNNNHHHHH
I: IIIIIILIIILLIII!LLI
K: KKKKKKKKK
L: LLLLLLLΛΠљ
M: MWMMMMMMM
N: NNNNNNNN
O: OOOOOOOOOQOŒOEOOWOOOOOOOOOOOΔOOOO
P: PPPPPPPPP
Q: QQQ
R: RRRRRRRRRRRRRRRRR
S: SSSSSSSOS
T: TTTTTTT
U: UUUUUUUUUUUUUUUU
W: WWWWWWWW
X: XXX
Y: YYYYYYYYYYYYUYYYYYYYYYYYY
Z: ZZZZ

ccnorm("ČĆŽŠčćžš") -> CCZSCCZS

Shall we close this task or is it okay to hijack it occasionally for new requests?

In T27619#9149563, @matej_suchanek wrote:

Shall we close this task or is it okay to hijack it occasionally for new requests?

I'd prefer to close, otherwise this becomes a neverending ticket without any clear scope.

matej_suchanek closed this task as Resolved.Sep 24 2023, 9:54 AM

matej_suchanek assigned this task to Umherirrender.

matej_suchanek removed a project: TestMe.

matej_suchanek moved this task from Backlog to Done on the Equivset board.

1234qwer1234qwer4 mentioned this in T357855: Further extensions to ccnorm.Feb 18 2024, 1:51 AM

Add more characters to ccnormClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Add more characters to ccnorm
Closed, ResolvedPublic
Actions

Related Objects
Search...