Page MenuHomePhabricator

Set $wgCategoryCollation to 'uca-uk' on Ukrainian Wikipedia and rebuild category sort keys
Closed, ResolvedPublic

Description

Set $wgCategoryCollation to 'uca-uk' on Ukrainian Wikipedia and rebuild category sort keys.

Needs community notification and discussion.


Version: unspecified
Severity: enhancement
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=45776

Details

Reference
bz45444

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:27 AM
bzimport set Reference to bz45444.

(In reply to comment #0)

Needs community notification and discussion.

Who to start that?

I though Dmytro Dziuma would be interested in this? (He cc'd himself on this bug, and he was the one who filed bug 41040.)

Anybody from uk.wiki, actually.

What kind of discussion do you expect from uk.wiki? I will notify the community but I don't see any reason why anybody could be against this fix.

Dmytro: pretty much a little discussion/voting in your wiki's equivalent of a village pump.

Here's what I did for the very same issue you're having here on pl.wiki: https://pl.wikipedia.org/wiki/Wikipedia:Kawiarenka/Propozycje#Zmiana_konfiguracji_.E2.80.93_w.C5.82.C4.85czenie_poprawnego_sortowania_artyku.C5.82.C3.B3w_na_stronach_kategorii

Is it possible to test the 'uca-uk' collation somewhere? It would be great if it could be easily to set up publicly accessible test wiki like you did for pl.wiki at http://users.v-lo.krakow.pl/~matmarex/testwiki

I set up an open wiki for you at http://users.v-lo.krakow.pl/~matmarex/testwiki-uk/ . Feel free to use it however you wish, but be aware that it won't stay up forever, and that the server I'm running it on might have occasional hiccups.

Since I anticipate that I'm going to be setting up a lot of such testwikis :), I attached the script I used to do this at parent bug 45443.

Also, please note that this also changes how non-Ukrainian characters will be sorted – accented letters such as Ä will be considered the same as their non-accented counterparts for the purposes of sorting (including being shown under one heading) – you can see this on the Polish testwiki. This is probably minor, but worth noting in the discussion. (I'd post there about this myself, but sadly I do not speak Ukrainian.)

Just as a side note. Is this collation used only for sorting in categories? I'm asking because as far I can see from http://users.v-lo.krakow.pl/~matmarex/testwiki-uk/index.php?title=Спеціальна:Усі_сторінки, in other places some other collation is used for sorting.

I doubt that this is important, but it could be nice to have consistent sorting across the whole wiki including API.

As far as I can see from the test, this sorting takes into account an apostrophe (') as a regular letter. However, in Ukrainian apostrophe is not a letter and it should have no impact on the sorting order (for example, the words в'яз and вяз should have the same key). Is it possible to take this into account as well?

(In reply to comment #9)

Just as a side note. Is this collation used only for sorting in categories?
I'm
asking because as far I can see from
http://users.v-lo.krakow.pl/~matmarex/testwiki-uk/index.php?title=Спеціальна:
Усі_сторінки,
in other places some other collation is used for sorting.

I doubt that this is important, but it could be nice to have consistent
sorting
across the whole wiki including API.

Collation only affects categories. There's other bugs about sorting in other places. Most come to the conclusion that while nice its mostly not worth the effort.

(In reply to comment #10)

As far as I can see from the test, this sorting takes into account an
apostrophe (') as a regular letter. However, in Ukrainian apostrophe is not a
letter and it should have no impact on the sorting order (for example, the
words в'яз and вяз should have the same key). Is it possible to take this
into
account as well?

Hmm. Sounds like it should be primary ignorable (ie only used as a tie breaker). This may be an upstream bug, but theres also some options related to such characters (variable characters), so it may just be a configuration issue on our end

Now we have in category such order:

  • В'язь
  • В’язь
  • В`язь
  • Воліючиневолію
  • Вязь

so apostrophe count as separate letter.

Otherwise (if do not count apostrophe) we'll have

  • Воліючиневолію
  • В'язь
  • В’язь
  • В`язь
  • Вязь

or at least

  • В’язь
  • В`язь
  • Воліючиневолію
  • Вязь
  • В'язь

I think it's enough of the current support of the local community of ukwiki. I guess, you can proceed with the deployment of this fix.

Thanks. This will have to wait for the deployment of 1.21wmf11 for the Wikipedias, due on March 13 [https://www.mediawiki.org/wiki/MediaWiki_1.21/Roadmap]. I'll propose a configuration change afterwards.

I'll try to look into the behavior of the apostrophes.


I scanned through the uk.wiki discussion with the help of Google Translate:

  • If I got it right, someone mentioned that other Ukrainian-language projects should have their category pages sorted in the same way. Please feel free to open similar "mini-votes" on them, and link those discussions here once we're sure there is consensus.
  • If I got it right, someone said that Ё and Ў should be sorted separately from Е and У. Not sure if this comment has any merit (I don't speak the language, but neither of these are even mentioned on https://en.wikipedia.org/wiki/Ukrainian_alphabet); however, if it does, it's certainly an upstream issue in the ICU library.

How long should the voting last to be acceptable?

On the apostraphe question, see http://www.unicode.org/reports/tr10/#Variable_Weighting for some background. Try using a locale identifier of uk-u-ka-shifted (have not tested. In theory there should be per locale defaults that are most correct so may be an upstream bug).

(In reply to comment #15)

How long should the voting last to be acceptable?

A week or so I suppose. There is no hard and fast rule as long as your average interested party would have a chance to object if they so desired. The main reason for such votes is to make sure such a change is wanted. In this case it seems fairly obvious it would be wanted but sometimes people request things that the relavent communities don't want which causes drama. Vote type things (or really any demonstration of community consensuss) is good just to make sure everyone is on the same page and the change is actually wanted.

(In reply to comment #16)

"Mini votes" are here:
<snip>

Actually, I split those to bug 45776, for clarity. :) Let's keep this one only about the Wikipedia.

Will it also change sorting in sortable tables, AllPages, API view of
Categories and in other lists avalible via special pages and API?

(In reply to comment #19)

Will it also change sorting in sortable tables, AllPages, API view of
Categories and in other lists avalible via special pages and API?

Just for the record, this has been replied to in bug 45776 comment 2. The answer is no, except for the API view of the categories (which is the same as "user view"), but there are suggestions (and maybe even bugs, I'd have to look) to implement the same for them.

(In reply to comment #20)

(In reply to comment #19)

Will it also change sorting in sortable tables, AllPages, API view of
Categories and in other lists avalible via special pages and API?

Just for the record, this has been replied to in bug 45776 comment 2. The
answer is no, except for the API view of the categories (which is the same as
"user view"), but there are suggestions (and maybe even bugs, I'd have to
look)
to implement the same for them.

Well, in CategoryViewer and ApiQueryCategoryMembers classes we use collation for 'cl_sortkey' field in the table 'categorylinks'. What problem to use collation for 'page_title' field in the table 'page' for other purposes (i.e. ApiQueryAllPages)?

(In reply to comment #21)

(In reply to comment #20)

(In reply to comment #19)

Will it also change sorting in sortable tables, AllPages, API view of
Categories and in other lists avalible via special pages and API?

Just for the record, this has been replied to in bug 45776 comment 2. The
answer is no, except for the API view of the categories (which is the same as
"user view"), but there are suggestions (and maybe even bugs, I'd have to
look)
to implement the same for them.

Well, in CategoryViewer and ApiQueryCategoryMembers classes we use collation
for 'cl_sortkey' field in the table 'categorylinks'. What problem to use
collation for 'page_title' field in the table 'page' for other purposes (i.e.
ApiQueryAllPages)?

That's a bit of a simplification. There's a bit more overhead than that.

Theres concern that the overhead is not worth it given how few places people get a list of all articles. (See related comments like bug 24574 comment 3 about the user list) I also imagine we would want to see how well this entire system works out for categories first before moving to other lists.

For the time being, let's get the category collations deployed, and once this works, we'll wonder how to go further. (I submitted a configuration change proposal as Ifd9b1dfe.

Done

mysql:wikiadmin@db1041 [ukwiki]> select count(cl_collation), cl_collation from categorylinks group by cl_collation ;
+---------------------+--------------+

count(cl_collation)cl_collation

+---------------------+--------------+

2313095uca-uk

+---------------------+--------------+
1 row in set (1.23 sec)

2312046 rows processed

real 1130m5.081s
user 9m5.834s
sys 0m50.911s

Sorting looks good, but category navigation is broken.

For example here
http://uk.wikipedia.org/wiki/Категорія:Футбольні_клуби_України
when I click 'Next 200' I go to 2 items forward instead of 200, and by next clicking 'Next 200' I go to the same page.

(In reply to comment #25)

Sorting looks good, but category navigation is broken.

For example here
http://uk.wikipedia.org/wiki/Категорія:Футбольні_клуби_України
when I click 'Next 200' I go to 2 items forward instead of 200, and by next
clicking 'Next 200' I go to the same page.

Is this still the case? It looks fine now to me.

Sorting may have been a little screwed up during the process of switching sorting orders

It was broken for me 30 minutes ago and I even started digging in the code, but seems okay for me as well right now.

May have something to do with updateCollation.php's re-run per bug 46036. No idea if that's the case.

If this persists for more than 24 hours, please reopen :)

(In reply to comment #27)

May have something to do with updateCollation.php's re-run per bug 46036. No
idea if that's the case.

Actually that would make sense. The paging code assumes that cl_sortkey is encoded with the same version of icu as is currently on the server.if that's not the case, the next 200 link could generate an sql query where the paging part doesnt correspond to the last element of the previous query (since the next 200 link has the last page name in the url, not its cl_sortkey which would be full of binary data and possibly quite long. Also using cl_sortkey in the url would break people making those skip to letter foo templates.)