Page MenuHomePhabricator

Use allkeys_CLDR.txt - the CLDR tailored DUCET instead of allkeys.txt
Open, MediumPublic

Description

Split off from bug 164. We should use allkeys_CLDR.txt instead of allkeys.txt in (I think) maintenance/language/generateCollationData.php


Version: unspecified
Severity: normal

Details

Reference
bz30675

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:51 PM
bzimport set Reference to bz30675.
bzimport added a subscriber: Unknown Object (MLST).

More info: Unicode defined Default Unicode Collation Element Table and asks vendors/application developers to "tailor" it to meet the exact requirements. See http://www.unicode.org/reports/tr10/tr10-23.html#Tailoring

One such tailoring is CLDR collation data and apparently it is more accurate for many languages than DUCET.
In mediawiki, recently User:Simetrical added DUCET based collation data generation. see
http://www.mediawiki.org/wiki/User:Simetrical/Collation But that code uses DUCET and does not use CLDR tailored DUCET. see maintenance/language/generateCollationData.php It uses http://www.unicode.org/Public/UCA/latest/allkeys.txt

Unicode provides an alternate version of allkeys.txt named allkeys_CLDR.txt. see
http://unicode.org/Public/UCA/6.0.0/CollationAuxiliary.html The variations depend on the language. We will require a close look into this data set to see the differences and whether it can make collation more accurate.

I also agree, simply because the most recent definition of CLDR no longer uses the DUCET in its root locale, but defines now an alternate collation (initially derived from the DUCET, but with many arrangements, notably in variable collation elements that are now subgrouped more logically while preserving their relative order within each subgroup, compared to the relative order they had in the DUCET).

The CLDR still allows using the default DUCET, but the standard DUCET is now considered a tailoring of the new CLDR root version.

The differences basically concerns non-letters, but there are a few other arrangements (notably within letter-like symbols, currency symbols, and with some format controls), that also facilitate the definition of language-specific tailoring, including definitions to facilitate the relative reordering of distinct scripts within a language that is written using multiple scripts.

The CLDR version of the DUCET is then much better, as it requires much less maintenance work for each language-specific tailoring.

To make it work, the CLDR version of the DUCET used in the root locale, adds pseudo collation elements, that are not defined based on standard characters, but only as markers separating subgroups of collation elements, and for which it also defines specific values for primary collation weights.

The CLDR version also defines new pseudo collation elements usable as separators for sorting rows of data structured in separate fields, so that all fields will first sort in parallel at primary level, before comparing all fields to the next level (that's something you can't do simply by using a stable sort starting by fields of lower importance up to the field with first importance).

For MediaWiki itself, there's nothing to change if it uses ICU, on the server side, except just upgrading it.

But if MediaWiki uses its own code, it may not be able to process the pseudo-collation elements defined as markers between subgroups of collation elements (notably between whitespaces, symbols, punctuations, in the variable elements, and then starting the group of numbers, then the group of letters split now by script with their own marker. As these markers are only needed to define tailorings, as long as this specific code will not be able to instantiate thes language-specific tailorings, these pseudo-markers may be simply skipped (ignored). You can easily detect them because they are defined using a specific syntax between [square brackets with a marker type followed by a value], such as "[script Arab]", or remapped using code points mapped to non-characters (so they are NOT encoded with Unicode, but displayed using an escape syntax such as \uFFFF, in the parsable text formats used by the CLDR data (this syntax is not visible in the new binary format now documented in the UCA specification and more precisely in LDML specifications used by the CLDR)

This wouldn't affect the sorting of stuff (because we use intl for that, not our own code as Verdy said). However, we do use our own code to determine which "first letter" to use for all the first letter headers on categories. (that's what generateCollationData.php does)

Changing allkeys.txt to allkeys_CLDR.txt results in very little changes. The only change is two currency symbols: ₨ (U+20A8 'RUPEE SIGN') and ﷼ (U+FDFC 'RIAL SIGN') would now be used in the first letter headers on category pages.

The change is probably because allkeys.txt treats ₨ (Rupee sign) similar to plain ole' Rs (latin R followed by latin s), where allkeys_CLDR.txt seems to treat it as a currency symbol and sorts it far away from latin R followed by latin s. (I think anyways, I'm new to some of this fancy unicode sorting stuff). I assume te arabic currency symbol is similar but didn't check.

But again, we'd probably want to use which ever one the intl extension is using. On my local test server that seems to be allkeys.txt. If you mix and match things, you'd end up with the article "Rss feed" being sorted under the Rupee sign, which is probably not good.

The the CLDR-modified DUCET basically changes only the relative order of primary weights. But yes it includes some notable differences for things like currency symbols.
In the CLDR version, the Rupee sign will no longer sort with Latin letters, meaning that it will no longer be decomposed and that its first primary weight will now be distinct from the primary weight given to Latin letter R. This also means that the "first letter" will need to be made different.
To implement the "first letter", what you need is to do it consistantly with the collation order, so the Rupee sign will need to be changed to use the Rupee sign itself as the "first letter", instead of latin small letter r.
You can infer the "first letter" from the DUCET, by looking at the first collation element that has the same primary weight and the smallest weights for the next levels. But to get a fully ordered list, necessary to make such determination, you first need to decide what to do with variable elements: should they all sort with primary weights, or as ignorables. Because this changes radically the ordered sequence of collation elements and which "first letter" you'll get (note that variable elements to not interleave in the DUCET, at least for the first primary weight when they are expansions, but this is not necessarily the case with locale-specific tailorings).

One example: U+0060 (the ASCII "GRACE ACCENT") has a possible tailored decomposition as SPACE+COMBINING GRAVE ACCENT, in which case it would sort with SPACE, with only a secondary difference of accent (then, using an expansion). In that case, its "first letter" would become the SPACE, and not itself. There are more complex cases of "variable collation elements" that need special handling in tailorings, for "Modifier Letters", or for Hebrew and Tibetan "cantillation marks", or for Braille patterns. For these cases, you must be extremely careful about how you compute the "first letter", or it will be completely out of sync of the collation order.

moving component from i18n -> category. While it is true this is an i18n issue, its primarily used with categories, and that's where most of the other collation bugs are located.