Page MenuHomePhabricator

Allow collation to be specified per category
Open, MediumPublic

Description

It was suggested on irc by Skalman12 (as well as in some other places. Presumably T2164, but I don't feel like reading the 5 billion comments on it) that it should be possible to override the collation. Specifically, the Wiktionaries would find it useful.

Its not entirely clear if this is a wontfix, but i think its possible.

As a further aside, This also won't be very useful until some future time when we actually have multiple useful collations.

Suggested ways of doing this that I've heard so far:
*Global config variable that provides specific categories that use different collation.

Downside: Thats something very specific to put in LocalSettings.php - It'd probably be a long list, it would also probably change on a regular enough interval that the number of shell requests would be annoying. A maintenance script would also have to be run each time this is done

Downside: Well the categories are being re-sorted, the category page becomes kind of borked. Would be a good vandalism target to do this to [[category:Living people]]. Thus we probably want to limit collation changes to admins, so people don't abuse it.

*put config info in system message (mediawiki namespace page).
Upside: Wiki users can configure it them self. Limited to admins so people don't be stupid.

Downside: Not as easy to trigger the re-sort category jobs. Not as clear to the end user why category x is sorted differently from category y. And configuration in system messages is kind of evil.

*Use a special page. (Similar to how we do page protection). Could have a link in the toolbox for sufficiently privileged users to "Change category sorting".

Upside: Kind of a nicer UI. Could present a list of valid choices to user, instead of expecting them to know, along with help info about the various choices. Can have a separate right for changing collations.

Downside: Slightly more complex to implement, would require a new db table to manage the info. Also, to the average user wondering why this category sorts differently then others, its not as obvious as the parser func method, since nothing different in page source (although could have a notice similar to that of page protection perhaps, not sure if that'd entirely make sense).


I personally think the special page approach is the best way to do this (assuming that we do do that).

From a backend prespective, what would need to be done (I think anyways):

*Collation::singleton would have to be changed to accept an argument, for what category it is. Probably would need a change in name to to something more appropriate if its no longer a singleton.
*Collation would probably need a static method to map category names to collation name, so we can full out cl_collation field of the categorylinks table properly.
*Would need to implement support in the job queue to fix cl_sortkey field when we change it for a category. Probably not that hard since we have a maintenance script that does something close to that already. The relevant maintenace script expects everything to use the same collation name if i recall, so that'd also have to be changed.

Thoughts?

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:30 PM
bzimport set Reference to bz28397.

ayg wrote:

The system was designed so that this would be possible without schema changes, by the request of Wiktionary users, although the feature itself was left out of the initial version. I'd say you want to have a magic word that only works on category pages and sets a page_props row. To prevent DoS, very large categories can be protected or FlaggedRevs-ed, like very large templates are. The steps you describe otherwise seem basically right.

I'm going to do it, after my multiple collation support is done. I plan to add a {{DEFAULTCOLLATION:}} magic word for category pages.

Change-Id: I2836aa4a63c146c2d40a0495a1fd58f0575196ff

codecat42 wrote:

I'm reopening this because the above patch hasn't actually been implemented.

This is absolutely a must-have feature for multilingual projects like Wiktionary. Currently, we have to use rather hackish methods like sort keys to handle collation, but this method is flawed. It doesn't account for the different orders of letters in different languages (like ö in Swedish versus Turkish), nor does it handle languages where sequences of several characters are treated as distinct collation headings (like the digraphs of Hungarian).

Change 132437 had a related patch set uploaded by Reedy:
(bug 28397) New magic word "{{DEFAULTCOLLATION:}}" to specify the default collation to use for a category

https://gerrit.wikimedia.org/r/132437

Change 27526 restored by Reedy:
(bug 28397) New magic word "{{DEFAULTCOLLATION:}}" to specify the default collation to use for a category

https://gerrit.wikimedia.org/r/27526

Change 132437 abandoned by Reedy:
(bug 28397) New magic word "{{DEFAULTCOLLATION:}}" to specify the default collation to use for a category

https://gerrit.wikimedia.org/r/132437

Change 27526 had a related patch set uploaded by Bartosz Dziewoński:
New magic word "{{DEFAULTCOLLATION:}}" to specify the default collation to use for a category

https://gerrit.wikimedia.org/r/27526

@liangent, this is one of the oldest tasks assigned to someone. Are you planning to work on it, and is its current priority correct?

@Qgil he did complete work on it. Nobody reviewed his code :(

According to https://gerrit.wikimedia.org/r/#/c/27526/ @Reedy and @matmarex brought @liangent's initial patch further last year. I wonder what is missing in the latest version from Matmarex, apart from a rebase.

Qgil set Security to None.

Same as a year ago :) Sorry :(

@liangent: Hi, I'm resetting the task assignee due to inactivity. Please feel free to reclaim this task if you plan to work on this - it would be welcome! Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for more information - thanks!