Page MenuHomePhabricator

Kurdish Wikipedia: Alphabetical order in the categories (collation)
Open, MediumPublicFeature

Description

Hello. Is it possible to change alphabetical order in the categories for all Kurdish projects (ku.wikipedia, ku.wiktionary, ku.wikiquote and ku.wikibooks)?

Kurdish alphabet (Kurmanci) uses Latin letters (31 letters).Alphabetical order:

Aa · Bb · Cc · Çç · Dd · Ee · Êê · Ff · Gg · Hh · Ii · Îî · Jj · Kk · Ll · Mm · Nn · Oo · Pp · Qq · Rr · Ss · Şş · Tt · Uu · Ûû · Vv · Ww · Xx · Yy · Zz

26 letters like the English language + 5 diacritical letters (Çç, Êê, Îî, Şş, Ûû): http://en.wikipedia.org/wiki/Kurdish_alphabets#Hawar_alphabet

The problem is that the diacritic letters (Çç, Êê, Îî, Şş, Ûû) do not follow the Kurdish alphabetical order (see above) in the categories: they are placed at the end.
For example here (Ç, Î, Û placed at the end): http://ku.wikipedia.org/wiki/Kategor%C3%AE:Dewlet%C3%AAn_Asyay%C3%AA

More generally, all the classifications can it be done by Kurdish alphabetical order? Can we do something?

Sorry if I'm not at the right place but I do not know who else to ask.
Thank you in advance.


Version: unspecified
Severity: enhancement
URL: http://ku.wikipedia.org/wiki/Destp%C3%AAk

Details

Reference
bz46235

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:24 AM
bzimport set Reference to bz46235.
bzimport added a subscriber: Unknown Object (MLST).

This is the right place.


At first glance I didnt see kurdish on the list of supported collations for icu that I found on google, which would be a problem. But maybe i just missed it. Will have to investigate further.

Yup, it seems to be unsupported, trying to use uca-ku collation gives the same results as if it was uca-default: http://users.v-lo.krakow.pl/~matmarex/testwiki-ku/index.php?title=Kategor%C3%AE:Test

(So no separete headings for the letters with diacritic marks, and e.g. C and Ç are considered the same letter when sorting - this isn't visible on that page right now, since if there's a conflict, the diacritical version is placed after the default one.)

ghybu, does that seems like an improvement over the current state?

No, I don't see any improvement.

I mean, is the category sorting on the testwiki I linked better than the one currently visible on ku.wikipedia?

(Also, feel free to create or modify pages there to test the behavior.)

Yes, it is much better, this is what I wanted. But, it's necessary to also have sections (Ç, Ê, Î, Ş, Û); for example to use this type of template: http://ku.wikipedia.org/wiki/%C5%9Eablon:TOC_Kategor%C3%AE

(In reply to comment #5)

Yes, it is much better, this is what I wanted. But, it's necessary to also
have
sections (Ç, Ê, Î, Ş, Û); for example to use this type of template:
http://ku.wikipedia.org/wiki/%C5%9Eablon:TOC_Kategor%C3%AE

In the test wiki linked above, the accents are secondary differences (used as tie breakers) where in kurdish they should be considered different letters. (Look at the section for c in that link where I just added more examples). Thus it can't have separate section headers as they arent sorted separately

Comment 3 is asking if even though the behaviour is wrong, is it more or less wrong than the current behaviour on ku wikis.

I have also done test, the behavior is not good. The current version of ku.wiki is better. I also think separate sections are required. Thank you for trying!

A solution was found here: bug 30287 an bug 50311

Can we do the same?

(In reply to comment #8)

A solution was found here: bug 30287 an bug 50311

Can we do the same?

Not really. fa is in the list of supported collations by cldr at http://www.unicode.org/repos/cldr/trunk/common/collation/ . ku is not. I think the next step in this bug would be get cldr to add ku as a collation ( http://cldr.unicode.org/index/cldr-spec/collation-guidelines )

[Or i suppose making php's intl bindings to the icu library suck a little less so we could make our own collation]

(In reply to comment #0)
Although little used and not included in the alphabet, the letters "Ḧḧ" and "Ẍẍ" must be integrated in sorting, now we have (26 letters like the English language + 7 diacritical letters):

Aa · Bb · Cc · Çç · Dd · Ee · Êê · Ff · Gg · Hh · Ḧḧ · Ii · Îî · Jj · Kk · Ll · Mm · Nn · Oo · Pp · Qq · Rr · Ss · Şş · Tt · Uu · Ûû · Vv · Ww · Xx · Ẍẍ · Yy · Zz

(In reply to comment #9)
I made a request for Kurdish alphabet.

I made ​​a request for the CLDR, it seems that the problem is fixed: http://unicode.org/cldr/trac/ticket/6527

(In reply to ghybu.w from comment #11)

I made ​​a request for the CLDR, it seems that the problem is fixed:
http://unicode.org/cldr/trac/ticket/6527

Cool, thanks. (For reference, upstream revisions are http://unicode.org/cldr/trac/changeset/9765 http://unicode.org/cldr/trac/changeset/9761).

So open question: Does php intl's extension use ICU data in the "seed" directory

If so, this is now pending us upgrading to CLDR 25, which will happen at ?? (Probably not for a little while, not sure when)

So open question: Does php intl's extension use ICU data in the "seed" directory

If so, this is now pending us upgrading to CLDR 25, which will happen at ?? (Probably not for a little while, not sure when)

Any answers to these questions? How can we know which version of CLDR is currently in use on the cluster?

In T48235#1534218, @TTO wrote:

So open question: Does php intl's extension use ICU data in the "seed" directory

If so, this is now pending us upgrading to CLDR 25, which will happen at ?? (Probably not for a little while, not sure when)

Any answers to these questions? How can we know which version of CLDR is currently in use on the cluster?

1.26wmf17 and 1.26wmf18 are the current MW branches on the cluster:
https://git.wikimedia.org/blob/mediawiki%2Fextensions%2Fcldr.git/refs%2Fheads%2Fwmf%2F1.26wmf17/cldr.php
https://git.wikimedia.org/blob/mediawiki%2Fextensions%2Fcldr.git/refs%2Fheads%2Fwmf%2F1.26wmf18/cldr.php

Both claim CLDR 27?

CLDR extension != libicu, unfortunately

reedy@mw1101:~$ dpkg -l | grep icu
ii  libicu48:amd64                       4.8.1.1-14+trusty2                   amd64        International Components for Unicode
ii  libicu52:amd64                       52.1-3ubuntu0.3                      amd64        International Components for Unicode
ii  ploticus                             2.42-1                               amd64        script driven business graphics package
reedy@mw1101:~$ apt-cache rdepends libicu48
libicu48
Reverse Depends:
  hhvm
  utfnormal
  libicu48-dbg
  libicu-dev
  icu-devtools
  hhvm
  libqtcore4
  libqtcore4
reedy@mw1101:~$

Our current hhvm is dependent on libicu48. Ubuntu 15.10 has libicu55 http://packages.ubuntu.com/wily/libicu55

php5-intl (5.5.9) is built against libicu52

As the mw appservers aren't going to be upgraded to 16.04 (or else a debian version depending on how ops decide to do it) till probably a year from now...

We should look at asking ops to probably backport libicu55, and rebuild hhvm (and php?) against it so that we can use the features

IIRC, from previous things like this, we will need to rebuild category collation everywhere afterwards

The CLDR ticket was closed as fixed 2 years ago, but there's still no ku.xml file in CLDR. I'm not sure what happened, but I've opened a new ticket: http://unicode.org/cldr/trac/ticket/9460.

The CLDR ticket was closed as fixed 2 years ago, but there's still no ku.xml file in CLDR. I'm not sure what happened, but I've opened a new ticket: http://unicode.org/cldr/trac/ticket/9460.

Its in the seed directory ( http://www.unicode.org/repos/cldr/trunk/seed/collation/ ), which I think is where they put their beta collations or something. I have no idea how useful that is to us.

Hmm, it looks like they haven't finalized the collation due to some uncertainties caused by the various competing Kurdish alphabets and variations.

@Ghybu: One outstanding question is how to handle the ⟨'⟩ character (apparently used as a non-standard letter for ع sounds). What glyph is properly used for this character? U+02BB? U+02BC? U+0027? Where should it sort in the alphabet? Is there a reference for this?

Also, does ⟨Hʿ⟩ sort with ⟨H⟩? Does ⟨Gh⟩ sort with ⟨G⟩? If not, where should those characters be sorted?

Hmm, it looks like they haven't finalized the collation due to some uncertainties caused by the various competing Kurdish alphabets and variations.

@Ghybu: One outstanding question is how to handle the ⟨'⟩ character (apparently used as a non-standard letter for ع sounds). What glyph is properly used for this character? U+02BB? U+02BC? U+0027? Where should it sort in the alphabet? Is there a reference for this?

Also, does ⟨Hʿ⟩ sort with ⟨H⟩? Does ⟨Gh⟩ sort with ⟨G⟩? If not, where should those characters be sorted?

I do not understand, there is no uncertainty about the alphabet used. This alphabet is also used for Google Translation (look in the Kurdish keyboard) :

https://translate.google.com/?hl=ku

Hmm, it looks like they haven't finalized the collation due to some uncertainties caused by the various competing Kurdish alphabets and variations.

Are you basing that on comment 1 of http://unicode.org/cldr/trac/ticket/9460 ? Because my reading of that comment is not so much about the uncertainties mentioned in http://unicode.org/cldr/trac/ticket/6527#comment:2 but in uncertainties of the general locale details (i.e. Not the sorting stuff, but things like what the names of the months are. I guess the stuff in http://www.unicode.org/repos/cldr/trunk/seed/main/ku.xml that's marked unconfirmed? Or maybe more stuff that should go in that file but isn't there yet?)

@Ghybu: Can you provide answers to any of the questions in T48235#2325460? I would assume that ⟨Hʿ⟩ sorts with ⟨H⟩, ⟨Gh⟩ sorts with ⟨G⟩, and ⟨'⟩ sorts as whatever character it is rendered with (as they are not official letters according to the Wikipedia article), but it would be good to have confirmation.

Also, when you say "this alphabet", which alphabet are you referring to? The Universal Kurdish Alphabet or Kurdish Unified Alphabet? The current seed collation in cldr seems to be an awkward mixture of the two (that isn't complete for either alphabet).

@Bawolff: I'm basing that statement on my own analysis of the seed collation (which seems to be a confused mess). For example, if the collation was based on the Kurdish Unified Alphabet, it should have 3 separate U letters, not 2. If it was based on the Universal Kurdish Alphabet (which only has 2 separate U letters) it should have 2 separate N letters (not 1).

@Ghybu: Can you provide answers to any of the questions in T48235#2325460? I would assume that ⟨Hʿ⟩ sorts with ⟨H⟩, ⟨Gh⟩ sorts with ⟨G⟩, and ⟨'⟩ sorts as whatever character it is rendered with (as they are not official letters according to the Wikipedia article), but it would be good to have confirmation.

Also, when you say "this alphabet", which alphabet are you referring to? The Universal Kurdish Alphabet or Kurdish Unified Alphabet? The current seed collation in cldr seems to be an awkward mixture of the two (that isn't complete for either alphabet).

@kaldari :
I refer to "Universal Kurdish Alphabet" (Bedirxan alphabet or Hawar alphabet). As I said, this is the alphabet used for Google Translate, Wikipedia, by virtually almost all authors and Kurdish websites.

"Kurdish Unified Alphabet" (Yekgirtú) is very marginal, it is in the field of academic research....

For the sorting and for this alphabet (Universal Kurdish Alphabet), see reference (in English):

Michael L. Chyet: https://en.wikipedia.org/wiki/Michael_L._Chyet

Michael L. Chyet (2003). Kurdish-English Dictionary. Ferhenga Kurmancî-Inglizî. New Haven & London: Yale University Press. Standard Kurdish Orthography Table: A, B, C, Ç, Ç', D, E, 'E (E'), Ê, F, G, H, Ḧ (H'), I, Î, J, K, K', L, Ł, M, N, O, P, P', Q, R, Ř, S, Ş, T, T', U, Û, V, W, X, Ẍ, Y, Z, '.

The character used is the U+0027.

But these characters: Ḧ (H'), Ẍ, Ł, Ř, 'E (E'), Ç', P', T', K',' are used as special letters in "Universal Kurdish Alphabet" and they are not part of this alphabet. And they are very rarely used, but must be included in the sorting.

I remember that there are only 31 letters in the Kurdish alphabet (Universal Kurdish Alphabet) :
Aa · Bb · Cc · Çç · Dd · Ee · Êê · Ff · Gg · Hh · Ii · Îî · Jj · Kk · Ll · Mm · Nn · Oo · Pp · Qq · Rr · Ss · Şş · Tt · Uu · Ûû · Vv · Ww · Xx · Yy · Zz

@Ghybu: According to the English Wikipedia article, the Universal Kurdish Alphabet also includes the letters Ň and Ü, which are not in the list you provide above. The article, however, does not provide any reference. Are the letters Ň and Ü used in the Kurdish Wikipedia, and if so, should they be sorted with N and U respectively, or should they be considered separate letters? Or maybe Ü is an alternate form of Û?

@kaldari

Indeed, these two letters (Ň and Ü) are used for Southern Kurdish (see https://en.wikipedia.org/wiki/Southern_Kurdish and see also Southern Kurdish alphabet: https://en.wikipedia.org/wiki/Southern_Kurdish_alphabet#Kirma.C5.9Fan.C3.AE_lat.C3.AEn_alphabet) by some authors who write in Latin letters:

Muhamadreza Bahadur : https://ku.wikipedia.org/wiki/Muhamadreza_Bahadur

For example, "Kuře Pełeň" by Muhamadreza Bahadur :
https://books.google.fr/books?id=hG7mCQAAQBAJ&printsec=frontcover&source=gbs_ge_summary_r&cad=0#v=onepage&q&f=false

See also Kirmaşanî Alphabet and Pronunciation Guide (Muhamadreza Bahadur) : https://www.academia.edu/12911202/Kirma%C5%9Fan%C3%AE_Alphabet_and_Pronunciation_Guide?auto=download

We can include them in the sorting.

kaldari renamed this task from ku.wikipedia: Alphabetical order in the categories to Kurdish Wikipedia: Alphabetical order in the categories.Aug 29 2016, 11:51 PM

I proposed a new set of Kurdish collation data (based on @Ghybu's comments and the relevant Wikipedia articles) at http://unicode.org/cldr/trac/ticket/9748. Let me know if anything needs to be changed.

I proposed a new set of Kurdish collation data (based on @Ghybu's comments and the relevant Wikipedia articles) at http://unicode.org/cldr/trac/ticket/9748. Let me know if anything needs to be changed.

@kaldari: Thank you ! Nothing to add for now...

kaldari renamed this task from Kurdish Wikipedia: Alphabetical order in the categories to Kurdish Wikipedia: Alphabetical order in the categories (collation).Oct 22 2016, 1:40 AM
kaldari set Security to None.

Looking at the CLDR ticket, this doesn't seem to be proceeding very quickly…

By the way, we now have precedent (and a framework) for implementing collations locally in MediaWiki – see T162823 which added an entirely custom collation for Bashkir. Note that the custom collation code is very simple – it doesn't support any more interesting things like collation of digraphs or accents. If Kurdish doesn't require that for reasonably correct ordering, we could probably use that approach for it too.

Looking at the CLDR ticket, this doesn't seem to be proceeding very quickly…

@matmarex: Indeed, more than 4 years :)) But I am not in a hurry, I just hope to live long enough to see it :)

Unfortunately, it appears to have many other uses. For example, to have interwiki in Kurdish and not in English.

Can someone here tell me why it's so slow? Is this the usual procedure? I notice that they do not even want to discuss and say to me what's wrong...
As far as I am concerned I stop this request on CLDR and I expect nothing more.

Unfortunately yes, sometimes CLDR is very slow :(

However, not all is lost. It recently became possible to create custom collation inside MediaWiki. It should be removed once it's implemented upstream in CLDR, but there's no reason to be blocked on that any longer.

Here's an example: T162823: Changing the alphabetical sorting (collation) @ ba.wikipedia.org. (Also a blog post.)

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 12:24 PM
Aklapper removed a subscriber: wikibugs-l-list.

Hello! Now "ku" is in the list of supported collations by cldr: http://www.unicode.org/repos/cldr/trunk/common/collation/

It was first included in CLDR 34: https://cldr.unicode.org/index/downloads/cldr-34

…with collation data: https://www.unicode.org/cldr/cldr-aux/charts/34/delta/ku.html

…CLDR 34 was included in ICU 63: https://icu.unicode.org/download

…and we upgraded to ICU 63 in late 2020: T264991: Upgrade the MediaWiki servers to ICU 63

So it should be possible to implement this now.