Page MenuHomePhabricator

Change $wgCategoryCollation values to appropriate one for each Wikimedia wiki
Open, MediumPublic

Description

According to Roan in T32287: Implement uca-fa collation comment 9, actually enabling the uca-default collation stuff that was "fixed" for T2164: Support collation by a certain locale (sorting order of characters) is waiting on an Ubuntu upgrade on the apache cluster (T31915: Upgrade the WMF-cluster >= Ubuntu 10.04?).

There are a few bugs which it looks like should be resolved (for Categories at least) by enabling this -- eg T32287: Implement uca-fa collation (Farsi sorting problems); others require further work (T31788: Swedish-language wikis should use Swedish-locale sorting (ie. ÅÄÖ should sort correctly) needs a Swedish-specific collation setting).

See Also T32673: Implement central locale-specific, or tailored, sorting framework (tracking)

Details

Reference
bz30996
TitleReferenceAuthorSource BranchDest Branch
Autosuggestsitelink.js: Check if wikidata item already has other sitelinksrepos/commtech/autosuggest-sitelink!35s-mukutiT329968main
Customize query in GitLab

Related Objects

StatusSubtypeAssignedTask
Resolvedhashar
Resolvedhashar
ResolvedJoe
Resolvedkaldari
Resolvedkaldari
ResolvedLegoktm
Declined demon
ResolvedJoe
ResolvedReedy
ResolvedNone
ResolvedNone
ResolvedJoe
ResolvedNone
ResolvedJoe
ResolvedNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenFeatureNone
OpenNone
OpenFeatureNone
OpenFeatureNone
Resolved tstarling
Resolvedmatmarex
ResolvedNone
ResolvedNone
OpenFeatureNone
ResolvedNone
ResolvedNone
OpenFeatureNone
Resolvedmatmarex
ResolvedNone
Resolvedmatmarex
Resolvedmatmarex
ResolvedNone
ResolvedNone
ResolvedReedy
Resolvedmatmarex
Resolvedmatmarex
Resolvedmatmarex
ResolvedNone
Resolvedmatmarex
Resolvedmatmarex
Resolvedmatmarex
ResolvedNone
Resolvedmatmarex
Resolvedmatmarex
Resolvedmatmarex
Resolved tstarling
ResolvedNone
Resolvedtomasz
ResolvedReedy
ResolvedNone
ResolvedReedy
ResolvedReedy
ResolvedReedy
ResolvedReedy
ResolvedReedy
ResolvedReedy
Resolvedkaldari
ResolvedGlaisher
Resolvedtomasz
ResolvedJoe
ResolvedJoe
ResolvedJoe
Resolved tstarling
ResolvedJoe
DeclinedNone
Resolvedkaldari
Resolvedjcrespo
ResolvedVolans
ResolvedPRODUCTION ERRORaaron
InvalidNone
DeclinedNone
Resolvedkaldari
Resolved Niharika
ResolvedJohan
DuplicateNone
OpenNone
ResolvedDereckson
Resolved DannyH
Resolvedkaldari
Resolvedkaldari
Resolvedkaldari
ResolvedJohan
Resolved Niharika
Resolved Niharika
ResolvedAmire80
ResolvedStrainu
OpenNone
Resolvedjhsoby-WMNO
ResolvedNone
ResolvedQuiddity
ResolvedLadsgroup
Resolvedmatmarex
ResolvedMarcoAurelio
Resolvedjhsoby-WMNO

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:51 PM
bzimport set Reference to bz30996.
bzimport added a subscriber: Unknown Object (MLST).

Closing LATER until apaches are all upgraded

Relevant dependencies as RT tickets:

http://rt.wikimedia.org/Ticket/Display.html?id=22 full update to Lucid (bug 29915)

http://rt.wikimedia.org/Ticket/Display.html?id=652 install icu & php5-intl (depends on the above)

py wrote:

rt 22 and 652 are done. this can probably be closed.

(In reply to comment #3)

rt 22 and 652 are done. this can probably be closed.

Well this still needs someone to make the changes to MediaWiki's config file and run the maintenance script.

The first letter identification code (maintenance/language/generateCollationData.php) won't work for all languages, so some wikis will have their category pages broken terribly by this change. Also, the default collation tables sort a lot of languages incorrectly, and the amount of breakage that causes will depend on the language in question. So I recommend doing this change on a language-by-language basis, after checking each language for correct collation and first-letter behaviour on a test wiki.

Also, it would be nice to know in advance what percentage of sort keys will be larger than the 230 bytes allowed by the database field, and if that percentage is significant, whether there are categories on the target wikis where the order will be changed by truncation after 230 bytes.

Any progress on this?

On Portuguese Wikipedia we still need to use
{{DEFAULTSORT: Page Name without accents }}
on any article whose title has an accent if we want it to be sorted appropriately in the categories. E.g.:
https://pt.wikipedia.org/w/index.php?title=%C3%81gua_Boa&oldid=28441112&action=edit

Maybe adding a note to [[mw:Roadmap]] would be appropriated?

Some related info:

I created some collations for Chinese and is expected to be used on zhwiki. This code requires ICU 4.8+ to run. Current php5-intl in WMF's APT repo uses libicu42 and existing wikis with uca-default (ptwiki) have sort keys generated with libicu42. Once libicu is updated all existing uca-default sort keys need to be rebuilt.

Btw meta, and especially commons may be good next targets for deploying uca-default to. Both are multilingual so using the root coallation seems ideal

'wgCategoryCollation' => array(
'default' => 'uppercase',
'ptwiki' => 'uca-default', # bug 35632
'iswiktionary' => 'identity', # bug 30722
),

I'm presuming this is fixed now...

(In reply to comment #9)

'wgCategoryCollation' => array(

'default' => 'uppercase',
'ptwiki' => 'uca-default', # bug 35632
'iswiktionary' => 'identity', # bug 30722

),

I'm presuming this is fixed now...

Umm only for ptwiki.

Just to clarify this bug-we probably should *not* do this for all wikis. As tim said above, more mw code is needed to make it work properly.

However this can (and should imo) be done on all english, portugese, and multilingual (meta and commons) wikis

I guess, a rough list for this would be:

reedy@fenari:/home/wikipedia/common$ grep enw all.dblist
arbcom_enwiki
enwiki
enwikibooks
enwikinews
enwikiquote
enwikisource
enwikiversity
enwikivoyage
enwiktionary
tenwiki
wg_enwiki
reedy@fenari:/home/wikipedia/common$ grep ptw all.dblist
ptwiki
ptwikibooks
ptwikinews
ptwikiquote
ptwikisource
ptwikiversity
ptwikivoyage
ptwiktionary

+brwikimedia

reedy@fenari:/home/wikipedia/common$ cat special.dblist
advisorywiki
arbcom_dewiki
arbcom_enwiki
arbcom_fiwiki
arbcom_nlwiki
auditcomwiki
boardgovcomwiki
boardwiki
chairwiki
chapcomwiki
checkuserwiki
collabwiki
commonswiki
donatewiki
execwiki
fdcwiki
foundationwiki
grantswiki
incubatorwiki
internalwiki
mediawikiwiki
metawiki
movementroleswiki
nostalgiawiki
officewiki
otrs_wikiwiki
outreachwiki
qualitywiki
searchcomwiki
sourceswiki
spcomwiki
specieswiki
stewardwiki
strategywiki
tenwiki
test2wiki
testwiki
usabilitywiki
wg_enwiki
wikimania2005wiki
wikimania2006wiki
wikimania2007wiki
wikimania2008wiki
wikimania2009wiki
wikimania2010wiki
wikimania2011wiki
wikimania2012wiki
wikimania2013wiki
wikimaniateamwiki
wikidatawiki

Do the rest of the is projects want to become identity too?

reedy@fenari:/home/wikipedia/common$ grep isw all.dblist
iswiki
iswikibooks
iswikiquote
iswikisource
iswiktionary

(In reply to comment #13)

Do the rest of the is projects want to become identity too?

reedy@fenari:/home/wikipedia/common$ grep isw all.dblist
iswiki
iswikibooks
iswikiquote
iswikisource
iswiktionary

I would imagine so. The language is case sensitive from what I understand. I guess we should ask.


Realistically it doesnt matter that much for a wiki like wikimania2006 since nobody is using them. Although it certainly wouldn't hurt anything.

For larger wikis (where it would take more than a couple hours to run the script) we would probably want to talk to the local community as categories will behave somewhat weirdly when the script is running. ( pages will be out of order while the script is running) its too bad the script doesnt go in order of cl_to instead of cl_from as that would minimize disruption somewhat.

Adjusting the summary: "Set $wgCategoryCollation to 'uca-default' and rebuild category sort keys on Wikimedia wikis deployment" -> "Change $wgCategoryCollation values to appropriate one for each Wikimedia wiki".

Per bug 45443, we don't really want uca-default anywhere anymore (apart from multi-language projects like Commons or Meta), but language-specific collations.

Adjusting the summary: "Set $wgCategoryCollation to 'uca-default' and rebuild category sort keys on Wikimedia wikis deployment" -> "Change $wgCategoryCollation values to appropriate one for each Wikimedia wiki".

Per bug 45443, we don't really want uca-default anywhere anymore (apart from multi-language projects like Commons or Meta), but language-specific collations.

At https://en.wikipedia.org/wiki/Wikipedia_talk:Categorization#OK_to_switch_English_Wikipedia.27s_category_collation_to_uca-default.3F @kaldari has proposed using "uca-default"; are you saying we should be using "uca-en" or something?

Uca-en and uca-default are the same

I think mediawiki has some code where it tries to force you to use uca-default over uca-en

Hmm, "uca-en" might be a bit neater, but it is indeed probably the exact same thing as "uca-default" (I haven't tried to check this).

...we don't really want uca-default anywhere anymore (apart from multi-language projects like Commons or Meta)

@matmarex: Why is that? It seems that most languages do not have language-specific uca-collations yet. Wouldn't it be better to switch them to uca-default rather than uppercase collation?

Maybe? Probably not? You can't tell without researching each language a little bit (or asking a native speaker). At least for languages using the Latin or Cyrillic scripts with additional letters with diacritics it is not always a good idea, since letters with diacritics might need to be ordered differently than the basic versions.

For example in Polish, ordering "L" and "Ł" as if they were the same letter is just as wrong as ordering "Ł" at the end of the alphabet, and in my opinion more confusing (as people are already familiar with the usual broken ordering). [All Polish-language wikis already have the correct uca-pl ordering deployed, this is just an example.]

(To provide another entertaining example with no diacritics involved: "CH" in the Czech alphabet is sorted between "H" and "I".)

What's actually "entertaining" on that?

w:Ch (digraph) states that it is treated as a letter of its own but not anymore commonly used for collation purposes.

Well, the proper adjective would be perhaps "interesting" or "important to bear in mind" then...
I wouldn't dare to say, that german alphabet is "entertaining" because of having ß or whichever other alphabet because of whatever reason. Alphabets are long existing parts of national cultures and have reasons why they have developped to the forms they are in nowadays. (Cf. Czech alphabet having "ú" & "ů" both for marking IPA [u:], and it has its reasons.)
Please weigh your words in such cases next time, thank you.

Indeed, random other alphabets also have entertaining aspects. (I consider human languages highly entertaining in general). My point was merely that it's not only about the sort order of single letters, but also digraphs making things more complex.

@matmarex: I'm sure there are lots of cases where uca-default isn't an accurate collation for the language, but are there any cases where it's actually worse than uppercase collation? At least with uca-default you can have numeric sorting (T8948), which is a highly requested feature from the community. Regarding Cyrillic, it looks like a lot of the Cyrillic languages have already switched over: Belarusian, Serbian, Russian, Ukrainian. Bulgarian isn't switched over, but there is a uca-bg collation available.

I honestly don't know. I'm not a linguist. My personal opinion about Polish is that 'uca-default' is worse than the simple 'uppercase' collation (and objectively, it is definitely very wrong, but which kind of wrongness one prefers might differ). I would be wary of just switching everyone to uca-default, since all the accents/diacritics it ignores are sometimes part of cultural identity and this could rub people the wrong way (I have no specific examples at the moment, sorry).

If it is just about numeric sorting, that would be easy enough to implement on top of 'uppercase'. I've been thinking about this on-and-off for weeks already, so right now I could write a proof-of-concept in a couple hours ;) But I agree that it would be much better to couple this with switching to an appropriate UCA collation.

Phabricator_maintenance renamed this task from Change $wgCategoryCollation values to appropriate one for each Wikimedia wiki (tracking) to Change $wgCategoryCollation values to appropriate one for each Wikimedia wiki.Aug 13 2016, 10:11 PM