Page MenuHomePhabricator

IcuCollation doesn't prune first letter elements that duplicate a prefix of another first letter's sortkey
Closed, ResolvedPublic

Description

Issue: The collation algorithm changes slightly between different versions of php's intl library. The first letters' algorithm we use to display the first letters on category pages is sensitive to these changes. Specifically the first-letters-root.ser file goes with a specific version of the icu library used with php intl extension

Affect: Some of the category headers will be incorrect. Things may switch back and forth between different headers. For example "Aa" was sorted under a U+214D instead of the letter A in MatmaRex's test wiki.


What needs to be done:

*We need to get the version of php intl's extension in use. This will probably have to be done by grepping for ICU version in phpinfo()'s output

*We need to have versions of first-letters-root.ser that correspond to different versions of icu library. These are fairly easy to make (I believe). Simply need to feed maintinance/languages/generateCollationData.php different data files corresponding to different versions of unicode. (Need to double check that icu library is using the unicode files we think it is. I have a vague memory of it changing to some file from the CLDR stuff a couple versions ago)

*Some of the docs about which data files to download for generateCollationData.php need to be updated. Right now its telling people to download the latest version of one file, and the version corresponding to unicode 6.0 for another.


Version: 1.21.x
Severity: major

Details

Reference
bz43740

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:28 AM
bzimport set Reference to bz43740.
bzimport added a subscriber: Unknown Object (MLST).

I've split three sub-bugs based on your three points as piecemeal tasks: bug 43801, bug 43802 and bug 43804.

I'm hoping to work on them, starting with the easiest – doco one ;)

It looks kind of like the code in Collation.php

// Primary collision
// Keep whichever one sorts first in the main collator
if ( $this->mainCollator->compare( $letter, $letterMap[$key] ) < 0 ) {
         $letterMap[$key] = $letter;
}

Was meant to handle this situation (or I misunderstand the intention of the code)

(In reply to comment #2)

It looks kind of like the code in Collation.php

// Primary collision
// Keep whichever one sorts first in the main collator
if ( $this->mainCollator->compare( $letter, $letterMap[$key] ) < 0 ) {
         $letterMap[$key] = $letter;
}

Was meant to handle this situation (or I misunderstand the intention of the
code)

So reason that block of code didnt prevent the appearant primary key collisions caused by the icu version mismatch was because they weren't collisions. The two cases I saw were a rupee header in the middle of the r section and an A/S symbol in the middle of the a's. Both cases the symbols would be expanded to have more than one primary weight-so they weren't exact collisions, only a prefix was equal

(In reply to comment #3)

(In reply to comment #2)

It looks kind of like the code in Collation.php

// Primary collision
// Keep whichever one sorts first in the main collator
if ( $this->mainCollator->compare( $letter, $letterMap[$key] ) < 0 ) {
         $letterMap[$key] = $letter;
}

Was meant to handle this situation (or I misunderstand the intention of the
code)

So reason that block of code didnt prevent the appearant primary key
collisions
caused by the icu version mismatch was because they weren't collisions. The
two cases I saw were a rupee header in the middle of the r section and an A/S
symbol in the middle of the a's. Both cases the symbols would be expanded to
have more than one primary weight-so they weren't exact collisions, only a
prefix was equal

Yes, the preprocessing done on the first letter array in getFirstLetterData() was meant to defend against changes in ICU library version. I think the best solution to this bug would be to fix this particular case of symbols expanded to multiple primary weights, rather than to just abandon the whole design idea per the request on bug 43802.

For the record, Matma Rex was using Gentoo Hardened 4.5.4, ICU 49.1.1 and intl 1.1.0.

(This can now be seen live at http://users.v-lo.krakow.pl/~matmarex/testwiki/index.php?title=Kategoria:Test , but won't stay there forever. Feel free to edit that wiki, and please disregard the spambots.)

(The same also happens on ICU 50.1, which is using Unicode 6.2. 49.1.1 is using Unicode 6.1.)

My I5d2a4e7e that simply removed the file was -2'd by Tim: "Unless someone can tell me why the bug Bawolff describes in comment 3 really is impossible to fix, this is a -2 for me.".

The summary of this should probably be reworded, but I'll leave that to someone who knows what they are doing.

(Raising severity to major, as this kind of breaks the feature of using ICU to collate category elements.)

Ok, I read up on icu, after quite a bit of googling, this actually looks not that complicated. From what I gather (if I read the docs right, which is a very big if), there should be no two primary collation elements where one collation element in its entirety is a prefix of some other collation. See https://ssl.icu-project.org/repos/icu/icuhtml/trunk/design/collation/ICU_collation_design.htm specificly:
R2. A fractional weight cannot exactly match the initial bytes of another fractional weight at the same level.

So assuming nothing funky is done to compress the sort keys (which I don't happens on the primary level, at least not currently), just looking for matching prefixes should work.

Anyhow gerrit change 55503. I still need to double check that unsetting an element of an array doesn't modify its sorted order (It doesn't seem to, but should double check).

It also might be prettier if the check for duplicate prefix was merged into the general duplicate check, but I didn't see an easy way of doing.

Anyhow, feedback appreciated.

Yay, I4bd3d39e merged. Needs backporting to 1.21 now - CC-ing Mark.