Page MenuHomePhabricator

Sort German umlaut characters correctly in jquery.tablesorter
Closed, ResolvedPublic

Description

Author: icvav

Description:
I'am a user of spanish wikipedia and the sorting script doesn't sort properly
strings with spanish characters: á, é, í, ó, ú, ü, ñ, Á, É, Í, Ó, Ú, Ü, Ñ.

Example: ñu, águila, barco, nada, obra
Expected Results:
1.- águila
2.- barco
3.- nada
4.- ñu
5.- obra

Actual Results:
1.- barco
2.- nada
3.- obra
4.- águila
5.- ñu

Solution could be:
function ts_sort_caseinsensitive(a,b) {
aa = ts_getInnerText(a.cells[SORT_COLUMN_INDEX]).toLowerCase();
bb = ts_getInnerText(b.cells[SORT_COLUMN_INDEX]).toLowerCase();
return(aa.localeCompare(bb));
}

Now this function is:
function ts_sort_caseinsensitive(a,b) {
aa = ts_getInnerText(a.cells[SORT_COLUMN_INDEX]).toLowerCase();
bb = ts_getInnerText(b.cells[SORT_COLUMN_INDEX]).toLowerCase();
if (aa==bb) {

		return 0;

}
if (aa<bb) {

		return -1;

}
return 1;
}

I've tried it on my PC and it seems work.

Another possible solution can be use the replace function to replace: á ->a,
é->e, í->i, ó->o, ú->u, ü->u, ñ->nz, ... And then sort them.
Other languages could need similar changes.


Version: unspecified
Severity: normal

Details

Reference
bz8732

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 9:29 PM
bzimport set Reference to bz8732.
bzimport added a subscriber: Unknown Object (MLST).

Created attachment 3121
The patch for comment #1

Attached:

p.org wrote:

The same problem occurs in tables using German umlauts.

Example:

  1. Go to: http://de.wikipedia.org/wiki/Liste_der_größten_Städte_der_EU
  2. Sort the table by "Staat" (German for "State/Country").
  3. "Österreich" (German for "Austria") will be on the bottom, instead of being sorted as an "O".

Patch does not apply. Code moved into wikibits.js and contains:
function ts_sort_generic(a, b) {
return a[1] < b[1] ? -1 : a[1] > b[1] ? 1 : a[2] - b[2];
}

Can someone make a new patch?

All this looks like part of the more general bug about collation (currently in bug #64, except that it just covers the server-side aspects of collation for sorting and navigating the content of categories).

Here, the dynamic sort of tables is a client-side issue, that has to be solved in the Javascript plugins, almost completely out of MediaWiki itself: these customized scripts and standard gadgets are to be localized in Javascript on each wiki according to the languages they support (but good UCA implementations in Javascript, that also support language-dependant tailorings, including on multilingual tables, are still hard if not impossible to find, except on some browsers that have added methods to their core packages for strings and text).

Is there an implementation of ICU algorithms ported to Javascript and that allow easy customization with tailored tables (possibly through AJAX to retrieve language-dependant scripts and not all the many collation tables), and that are not too slow for many browsers?

p.org wrote:

(In reply to comment #2)

  1. Go to: http://de.wikipedia.org/wiki/Liste_der_größten_Städte_der_EU
  2. Sort the table by "Staat" (German for "State/Country").
  3. "Österreich" (German for "Austria") will be on the bottom, instead of being

sorted as an "O".

Issue still unsolved.

This works on the listed German page, provided that you feed the sort keys (see the example filled in that page)

p.org wrote:

You did not mention, that you yourself fixed the page I stated as an example:
http://de.wikipedia.org/w/index.php?oldid=73005824
Edited by Verdy P! Thanks!!!

The status-quo of sortable-tables has clearly improvement to before!
Nevertheless, still too much human effort necessary!

I'd prefer a less labor-intensive solution:

  1. The script is intelligent enough to sort ANY table according to a CENTRALLY defined ruleset, regardless of whether it uses basic Latin only, extended Latin, or even Non-Latin characters if sorting-rules can be defined.
  2. Human intervention is only required, if you want an EXCEPTION for a particular table-cell to the general ruleset, where your individually generated SortKey makes more sense.

You're true, I still wait for a builtin parser function that can convert any string (purged from the HTML/XML formatting) into a binary sortable collation key (even if it's not readable, but still prses OK as a valid Unicode plain-text string).

Performing true collation on the client side would require much more extensive work and would imply much more compatibility and performance problems when implementing it in a Javascript function (this is possible but would be horribly slow for the site navigation, when such collation keys can be informed and cached on the server side).

This would also help fixing the collation order of entries added in categories. One problem is to have a builtin parser function that can create collection keys according to specific languages (not necessarily the main language of the wiki): the collation order in categories should, by default, reflect the collation order expected for categories specific to the language they index (notably in Wiktionary: yuo won't sort the English/French categories like German, Swedish, Hindi, Chinese, Korean categories). Finally, there shoud also exist a way to:

  • either index a category by multiple collations (for Chinese notably: by radical/stroke, or by pinyin : two separate categories ?)
  • or offer to users a way to change the presentation order of the same categories (for languages that need multiple ones like Chinese), by setting up the category page with a list of additional collation orders that can be queried directly from the server, using the suer's preference or a user selection via a link containing an addtional HTTP query parameter, such as:
    • "http://.../wiki/Category:...?collate=zh-Latn" for Pinyin,
    • "http://.../wiki/Category:...?collate=zh-Hani" for traditional radical/stroke order

(this should however be cachable, and should be honored by the server, for performance reasons, only if the category page was prepared with a list of possible collation orders, the default collation being based on the Unicode's DUCET, or possibly on the collation rules for the main language of the wiki if it's a localized wiki and not an international multilingual wiki like Commons, which should still use the DUCET by default, i.e. the collation order for the "root" locale in Unicode's CLDR database)

Anyway, the effort is not so dramatic (at least for Latin languages, and even
for Cyrillic or Greek).

It is however dramatic for Hebrew, Arabic, Chinese or Korean, where sort keys
are extremely difficult to create or infer correctly and because they need to
be specified absolutely everywhere: the simple binary order of Unicode code
point values means absolutely nothing for these scripts. This should be
automated as much as possible.

A builtin parserfunction for computing collation keys is just the start of the
surface but this effort will pay. In the MediaWiki software, this means
integrating the open-sourced ICU library and its now standardized interface
layer to PHP.

Other solutions, based on the underlying SQL engine will not work as
universally (in addition, this causes severe mainernance problems for the SQL
engine : let's keep the binary sort order in SQL, and allow instad the
preparation of separate indexes for the same categories : it will be
SQL-agnostic, and will work with various SQL engines that can already be used
with MediaWiki, not just MySQL whose Unicode support is minimalist and really
not portable).

This situation will persist as long as there's no international and
vendor-neutral ISO standard for SQL engines, and support for this standard in
all major SQL engines, and with compatible collation data across SQL engines,
which can also create colaltion keys and order consistant with PHP server-side
or may be client-side implementations (much later, if a similar standard is
adopted in ECMAScript).

p.org wrote:

Honestly, your writing, Philippe Verdy, is beyond the scope of my knowledge, as I know little about database, collation, etc.

From the little I understood, sorting seems to be really a complicated issue, especially for some Alphabet systems, and even more, if you mix them.

I therefore suggest to start simple with a short term solution and then progress to the more sustainable solution.

SHORT TERM SOLUTION:

I guess the extended latin collation rules could really be solved client-side, without slowing the script down too much. Right? I really know little, maybe the Slavic or Scandinavian special characters are already hard, but at least French accents and German Umlauts could be easily fixed in a first patch.

LONG TERM SOLUTION:

  1. The sorting-code-base (mix of client/server side scripts/database-tables/etc) is written CENTRALLY for all MediaWikis. Simply for reasons of code sharing, as many functions/objects are likely to be universally used.

2)a) The collation rulesets shall be SEPERATEDLY defined, CENTRALLY PER EACH LANGUAGE Wiki (de,fr,en,he,ru,...), as languages have there different sorting rules for their native words, and for foreign words.

b) It is designed in an intelligent plug-in approach. The ruleset may only define a limited amount of Unicode characters (its own languages, plus maybe the characters of its historical related cultures (pre globalisation) for which it has developed sorting rules, i.e. Austrian lexicographical order aware of French accents and Czechoslowakian Háčeks), and handing over responsibility/trust of the Unicode ranges of languages, which it doesn't know how to handle (i.e. Hebrew) by running their ruleset-plug-in.

I guess 2a)b) is already pretty developed in Database applications, its rather just the question of how to integrate it properly, to satisfy the concept described above.

CONCERNING PERFORMANCE:

I advocate that the SortKeys are already calculated server-side, and that the client side script then only needs to sort numerically.

(Offtopic remark: By this we could also offer to sort tables by multiple keys, with very little client processing power. My search for "multiple, many, search, keys" in the BugTracker did not show any results, but it's possible that people would like it.)

As agreed: At best automatically without the need for human effort, only where necessary human added exceptions.

I imagine it as shown in this ASCII diagram table:

Name |SK| Einwohner |SK| Staat | SK
London | 1| 7.554.236 | 1| Vereinigtes Königreich | 3
München| 2| 1.365.052 | 3| Deutschland | 1
Wien | 3| 1.697.982 | 2| Österreich {Oesterreich}| 2

In the Wiki markup we got the 3 columns "Name, Einwohner, Staat". The users only write the Unicode words as they are used too such as "München, Österreich", knowingly that MediaWiki cares about the SortKeys.

And only if they know that the default sort-algorithm will conflict or make no sense, they can add the SortKey attribute. In my example shown as {Oesterreich} (approach Ö gets Oe) instead of the expected ruleset Ö gets O, as defined in the German collation ruleset.

In the HTML/Javascript served to the browser, those additional SK value columns are supplied. Invisible to the user, but used by the client side script.

In Bug 8028, I have suggested to replace our sorting code with a jquery plugin http://tablesorter.com/

This plugin also allows to define custom sorting filters, so specific collations can probably be easily added to that plugin (though they might be computational intensive, esp. on IE, i'm guessing).

OK jquery makes this cleaner to manage, but this does not solve the general problem of correct locale-specific collation orders, because jQuery still does not handle any tailored collation comparator.

The main problem is still there: if it's computationally expensive to create such collator in IE, the alternative will be to have the collation keys computed on the server side, using AJAX, even if the set of strings (from which the collation keys will be computed) is retrieved from the DOM using AJAX.

The server-side script could either return the collation keys as an array from an array of source strings (that could then be stored in an hidden data field of sortable cells, so that these keys will be AJAX-requested only once), or a compact index of integers saying how to reposition the elements in the AJAX-submitted string.

Anyway, I still think that AJAX would just make the page much slower to respond to sort requests. On the opposite, the tables could be generated with hidden data already containing all the collation keys where appropriate. But it could still be a good fallback option, if support on the client-side is not possible in some browsers (jQuery could help detecting when Ajax will be required, and when the collator can be fully written in Javascript to run on the client side).

For now, the current jQuery solution assumes that the sort key gets computed from the visible strings in each cell, but it should allow taking the sort keys from an hidden data field in the same cell (using the "custom sort" solution described in the article, this is not really possible as the source is only the cell's inner text content).

I think that the jQuery team should start a project to create a portable client-side collator (with such fallback to a server side Ajax+JSON script, the server-side collator being able to use any kind of server language: in PHP, or Perl, or C with ICU, or Ruby... and not necessarily the same server as the one serving the web pages using jQuery : Google for example has developped a reusable Google API to do that, just like it has developed an open webfont server API usable in Google Documents or in any other site).

To be allowed to generate the collation sort keys directly within the HTML of the table, we should still have a MediaWiki builtin function that can be called when appropriate, to generate a binary-sortable string that can be used directly from the standard "<=" operator of Javascript. So jQuery would just have to use this comparator (and this would not be computationally intensive in any browser, and could work on smartphones).

The builtin MediaWiki parser function to compute the collation keys would just need two parameters: the source text, and a locale id. There already exists the code needed in ICU4C, already integrable in PHP (look for ICU4PHP), and that can be used in a MediaWiki extension to create the necessary parser function with minimal efforts.

I even bet that native integration of ICU will become standard in the PHP core distribution, because it will allow better interoperability between heterogeneous database engines for sorting data, and will remove the interoperability problems caused by the implicit need of support within the native C libraries used on each system for which PHP was compiled.

Note that HTML5 has standardized the "dataset" feature, i.e. a normative naming scheme that allows hiding one or several data elements as additional attributes of any kind of elements.

The "dataset" works for classic HTML, as long as you don't need a strict DTD where all attribute names must be declared; but given that the DTD declaration is completely deprecated in HTML5 (and the support of the dataset feature is mandatory for HTML5 compliance), this is not an issue.

For XML/XHTML, you may use an external DTD or XML Schema to declare these "data-*" attribute names, but this would not work with MediaWiki that won't allow you to add such declaration ; but MediaWiki does not use the HTML Strict model, and we could still be compatible with XHTML by adding a standard default "data-collation-key" attribute name declaration in the generated document, if pages are not served in HTML5 ; if pages are served in HTML5, nothing is required, we don't need any declaration.

(I don't know if MediaWiki can selectively generate HTML5, or XHTML with XML Schema, or XHTML with DTD, or HTML4, depending on the query capabilities of the source browser ; my opinion is that it should be able to do that, as easily as it can create HTML specially tuned for limited smartphones, or pages rendered in WAP or iMode format for older phones, or pages for accessibility devices like Aural/Braille renderers... or pages in other future standard formats, or less open formats with other user-interactivity types such as PDF, Flash, SilverLight... which may come up and would require specific development for each of them, even if this is not the current work target for MediaWiki whose primary focus is still HTML4 and its optional newer extensions).

data-* is already allowed and the sorttable code reads data-sort-value.

Fixed in r86088
E.g. tableSorterCollation = {'ä':'ae', 'ß':'ss'};

p.org wrote:

(In reply to comment #14)
Great!!! Thanks @DieBuche!

When does this bugfix get live?
I made a test table, and as of now, it did not yet work:
http://de.wikipedia.org/wiki/Benutzer:PutzfetzenORG/Sortierbare_Tabelle

It's will probably be in the 1.19 release -> so it will take at least ~4-5 months.

p.org wrote:

Oh dear! That's a long time to wait for such a good feature improvement! :-(
Any chances for an earlier activation?

(In reply to comment #16)

It's will probably be in the 1.19 release -> so it will take at least ~4-5
months.

I personally think we should be deploying more often than we release, but that's a discussion we're still having. So I guess 4 months is worst-case.

r86088 is seriously broken and may get reverted; reopening. Needs unit tests (see initial basic unit tests in r90595 which show up errors; will also need tests for the particular case this bug is about.)

Marking as duplicate of bug 30674.

There is no point in having lots of separate bugs for adding support for language X in our javascript tablesorter. When we do it, we'll keep this sorting collation central (ie. not hacked up in javascript separately), that way we can use the sorting for other applications as well.

The bug to expose the collation of the current wiki's language or config to javascript (and therewith to jquery.tablesorter) is bug 30674.

The sorting collations themselfs is bug 30673.

  • This bug has been marked as a duplicate of bug 30674 ***