Page MenuHomePhabricator

Set $wgCategoryCollation to 'uca-fi' on Finnish wikis and rebuild category sort keys
Closed, ResolvedPublic

Description

Please set $wgCategoryCollation to 'uca-fi' and rebuild category sort keys on all Finnish wikis except Wiktionary, i.e. fi.wikipedia, fi.wikisource, fi.wikibooks, fi.wikiversity, fi.wikiquote and fi.wikinews.

Is this feature already mature enough to be just deployed, or should it be tested in advance?

Community discussions/notifications:

http://fi.wikipedia.org/wiki/Wikipedia:Kahvihuone_(tekniikka)#.C3.84.C3.A4kk.C3.B6set_vihdoin_oikeaan_j.C3.A4rjestykseen
http://fi.wikisource.org/wiki/Wikiaineisto:Kahvihuone#Suomalainen_aakkosj.C3.A4rjestys_k.C3.A4ytt.C3.B6.C3.B6n_.28.C3.84.C3.85.C3.96_po._.C3.85.C3.84.C3.96.29
http://fi.wikibooks.org/wiki/Wikikirjasto:Kahvihuone#Suomalainen_aakkosj.C3.A4rjestys_k.C3.A4ytt.C3.B6.C3.B6n_.28.C3.84.C3.85.C3.96_po._.C3.85.C3.84.C3.96.29
http://fi.wikiversity.org/wiki/Wikiopisto:Kahvihuone#Suomalainen_aakkosj.C3.A4rjestys_k.C3.A4ytt.C3.B6.C3.B6n_.28.C3.84.C3.85.C3.96_po._.C3.85.C3.84.C3.96.29
http://fi.wikiquote.org/wiki/Wikisitaatit:Kahvihuone#Suomalainen_aakkosj.C3.A4rjestys_k.C3.A4ytt.C3.B6.C3.B6n_.28.C3.84.C3.85.C3.96_po._.C3.85.C3.84.C3.96.29
http://fi.wikinews.org/wiki/Wikiuutiset:Kahvihuone#Suomalainen_aakkosj.C3.A4rjestys_k.C3.A4ytt.C3.B6.C3.B6n_.28.C3.84.C3.85.C3.96_po._.C3.85.C3.84.C3.96.29

The smaller projects are pretty quiet at the moment, so I may not receive any responses to such a no-brainer bug fix proposal, but the Wikipedia community is already becoming impatient and asking why this wasn't fixed years ago. :) Thank you in advance!


Version: unspecified
Severity: enhancement

Details

Reference
bz46330

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:30 AM
bzimport set Reference to bz46330.

This will probably have to wait a few days, since there is a couple of such configuration changes in progress or queued right now, and processing the pages for a semi-large wiki like fi.wikipedia can take multiple hours.

(In reply to comment #0)

Is this feature already mature enough to be just deployed, or should it be
tested in advance?

The ICU library used for the actual sorting here is mature and stable. However, strange interactions with the code in MediaWiki are not impossible, as seen in bug 45446 comment 6 (although they are unlikely). At any rate, I created a testwiki with these settings for you at http://users.v-lo.krakow.pl/~matmarex/testwiki-fi/ , feel free to link it on the wikis and edit there to see how it behaves (but be aware that the wiki won't stay up forever after this bug is closed).

Thanks! There is a grouping problem in Finnish, too: Words starting with T are shown under the Northern Sami letter "Ŧ" instead of "T". This must be fixed before the deployment.

http://users.v-lo.krakow.pl/~matmarex/testwiki-fi/index.php?title=Luokka:Aakkosj%C3%A4rjestys

I'll create a more comprehensive test suite to check if there are any other problems. It would also be nice to know which standard the ICU implementation is supposed to comply with (my guess: SFS-EN 13710). There are a couple of slightly different standards.

Two more problems: The test word "Žukov" is incorrectly shown under "Ʒ" and the word "Nguyen" under "Ŋ". Ž should be equivalent with Z, and Ng should of course be sorted under N.

I wonder if there is some fundamental flaw with the grouping of letters under these one-letter headers?

(In reply to comment #2)

It would also be nice to know which standard the ICU implementation
is supposed to comply with (my guess: SFS-EN 13710). There are a couple of
slightly different standards.

I have no idea, to be honest. Wikimedia wikis are currently running ICU 4.8 (per bug 46036); that's all the information I can give you :)

The data used to "partition" the sorted list into headers is probably not standardised at all and somehow based on the information about primary-level collation data. For details you should probably look at the code that generates it, maintenance/language/generateCollationData.php.

(In reply to comment #3)

I wonder if there is some fundamental flaw with the grouping of letters under
these one-letter headers?

I don't think there's such a "fundamental flaw" in it; the list is generated using generalised data that's reasonably correct for most languages, and thus needs such modifications for some specific ones. For example, no modifications were needed for Portuguese, and Polish only required adding the appropriate letters with diacritics.

You and Swedes are just unlucky, I suppose :) It's interesting how those characters are sorted among Latin letters in Finnish, and at the end of the Latin alphabet in Polish or Portuguese.

I automatically created a category with all two-letter combinations of ASCII letters + Å, Ä, Ö: http://users.v-lo.krakow.pl/~matmarex/testwiki-fi/index.php?title=Luokka:Autotest . It seems like we need to exclude those four characters: Ǥ, Ŋ, Ŧ, Ʒ. I'll submit a patch to do this later today.

Submitted the patch: I976dedfd and deployed it on my test wiki (you might need to action=purge the category pages to see it).

That's kind of weird. "Ŧ" should be primary different from T (according to a chart for icu 4.2 [1], maybe it changed in later versions) which means that they should each have there own section with things starting with T being labelled under T.

In comparison, in swedish the issue was with expansions - note the dark grey background of thorn in [2]

[1] http://collation-charts.org/icu442/icu442-fi.html

[2] http://collation-charts.org/icu442/icu442-fi.html

(Note that Wikimedia wikis are currently running ICU 4.8 per bug 46036.)

Thank you! The grouping looks good in the test categories, and I haven't seen any problems with the underlying sort order.

According to SFS-EN 13710 (derived from EN 13710:2011), the first-level Latin letters are A...ZÞÅÄÖ in Finnish. Ŧ is defined as a second-level letter equivalent to T.

Some "exotic" characters (e.g. Ƕ, Ə and Ƭ) are still treated as first-level letters, but this could be a feature of the ICU library. EN 13710:2011 defines these three characters as second-level letters equivalent to HV, E and T.

I don't see this as a release blocker.

(In reply to comment #6)

That's kind of weird. "Ŧ" should be primary different from T (according to a
chart for icu 4.2 [1], maybe it changed in later versions) which means that
they should each have there own section with things starting with T being
labelled under T.

In comparison, in swedish the issue was with expansions - note the dark grey
background of thorn in [2]

[1] http://collation-charts.org/icu442/icu442-fi.html

[2] http://collation-charts.org/icu442/icu442-fi.html

I think I figured out what was happening.

Ŧ is tailored to be secondary different from T̵ (aka T plus a U+335 COMBINING SHORT STROKE OVERLAY . The U+335 should be primary ignorable. So in essence this is secondary different from plain T). Since that is 2 letters its like an expansion, which our primary collision code doesn't handle properly.

(In reply to comment #5)

Submitted the patch: I976dedfd and deployed it on my test wiki (you might
need
to action=purge the category pages to see it).

btw, now merged.

When can we deploy this? I'd like to notify the Finnish community about the schedule.

I just noticed there is also https://fi.wikimedia.org/wiki/Etusivu - I assume it should be covered by the change as well?

Submitted a patch including fiwikimedia as Ia40f5b89. I'm a volunteer myself, so I can't tell you when it will be deployed - likely within a week or so, probably quicker.

Thank you! Yes, Wikimedia Finland should be included, although this particular site might never have content affected by this bug. (Swedish names starting with Å have been the biggest problem with the old sort order.)

Was the patch mentioned in comment 5 included?

When I view the page http://fi.wikipedia.org/wiki/Luokka:Ruotsin_kaupungit , the letters Å, Ä and Ö are now in the correct order, but the G, N and T sections are incorrectly labelled as Ǥ, Ŋ and Ŧ.

Reopening until the single-letter headings are displayed correctly.

(In reply to comment #17)

Was the patch mentioned in comment 5 included?

... it wasn't. Sorry, that was a stupid oversight :) The backport to 1.21wmf12 is I976dedfd, Reedy is working to get it deployed.

Looks like this is done now, marking as resolved fixed.

I thank you, good people. The categories look good, and I haven't seen any complaints from any project (just checked the discussion threads). Marking as verified.

The Finnish Wiktionary community is still discussing their sorting needs and might submit a new request later:
http://fi.wiktionary.org/wiki/Wikisanakirja:Kahvihuone#Wikisanakirjan_aakkosj.C3.A4rjestys