Page MenuHomePhabricator

Tablesorter sorts all numbers as dates in Czech
Closed, ResolvedPublic

Description

The table sorting feature is currently quite broken in Czech, apparently because the tablesorter considers (almost) any number to be a calendar date. (See the linked URL and click e.g. the “Sčítání 2001” column.)

The sorter constructs the date-matching regex as

^\\s*\\d{1,2}[\\,\\.\\-\\/\'\\s]*(' + regex_for_months + ')' + '[\\,\\.\\-\\/\'\\s]*\\d{2,4}\\s*$

The problem in Czech is triggered by the fact that we do not use alphabetic abbreviations for short month names, but just month numbers, i.e. our final regex looks like

^\s*(\d{1,2})[\,\.\-\/'\s]*(leden|1|únor|2|březen|3|duben|4|květen|5|červen|6|červenec|7|srpen|8|září|9|říjen|10|listopad|11|prosinec|12)[\,\.\-\/'\s]*(\d{2,4})\s*$

And because the separators are qualified with *, i.e. optional, the regex considers e.g. “123456” to be, in fact, something like “12/3/456”, i.e. “March 12, 456”. Which is obviously silly.

Do we really want to write dates as e.g. “1Dec2012”, so that we would need the * there? Wouldn’t + be better? (Or, at the very least, check whether wgMonthNamesShort is not numeric, and do not add it into the regex if it is.)

On the other hand, the date recognition does not really work for Czech, anyway. In dates, we use the genitive month names (e.g. “1. prosince 2012”, not “1. prosinec 2012”), see Language::mMonthGenMsgs.


Version: 1.20.x
Severity: normal
URL: http://cs.wikipedia.org/wiki/?oldid=9352755

Details

Reference
bz42607

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:10 AM
bzimport set Reference to bz42607.

Table sorting is really seriously broken. Dates are not sorted in the Czech language at all.
Following code should produce a sortable table:

{| class="wikitable sortable"
! "1. 1. 2000" "01.01.2000" "1.1.2000" "1. ledna 2000" "1. leden 2000"

-
1. 1. 200001.01.20001.1.20001. ledna 20001. leden 2000
-
1. 10. 199901.10.19991.10.19991. října 19991. říjen 2000
-
1. 10. 201001.10.20101.10.20101. října 20101. říjen 2010
}

The first two columns are short formats according to norms, the third one is a common mistake, the fourth one is a correct long format and the last one is with a nominative month name.

Sorting does not work at all for short formats. The only working column in my example is the last one - but as Mormegil has explained, this is not a way how dates are written. The long format in the fourth column is not sorted correctly (reason explained by Mormegil too).

I propose to set the needed regexp in Mediawiki configuration and not to complete it on a client side - it would be a faster solution too.

Just create your custom sorting parser in Common.js.

Danny B: How is that a solution if the default does not work?

Reopening; the code is clearly broken, not handling languages other than English correctly; worse, not just not detecting the dates, but detecting them wrong. There were similar (although even worse) issues with date handling in tablesorter before (see bug 42097).

If you ask me, everything in that code that attempts to detect and sort dates in formats other than YYYY-MM-DD (which should sort just fine anyway) should be taken behind the barn and shot.

No, the code is not broken. It sorts properly in default. The solution is that any site, which wants to sort different way than predefined sorts for predefined types and forms of data, is supposed to have its own custom sorting parser. And properly marked column headers to sort via such parser. That is the correct way how to handle it.

Eh? Don’t be silly. Create a new vanilla MediaWiki installation with $wgLanguageCode = "en". Create a new page

{| class="wikitable sortable"
! Data

-
123456
-
98765
-
333555
-
2468
}

Click on the header, it sorts correctly. Change $wgLanguageCode to "cs", try the same, it sorts incorrectly. MediaWiki does not “sort properly in default”. There are no custom data formats, no custom sorting, nothing nonstandard, just plain old integers here. The tablesorter is just plain broken.

I'm sure that the problem is that in Czech Wikipedia "mw.config" variable 'wgMonthNamesShort' contain digits. This variable should contain only names!

(In reply to comment #7)

I'm sure that the problem is that in Czech Wikipedia "mw.config" variable
'wgMonthNamesShort' contain digits. This variable should contain only names!

Says who and why? As I said in the original report, “we do not use alphabetic
abbreviations for short month names, but just month numbers”. As we say in Czech, “se stim smiř” [learn to live with it]. (You might also want to check Korean (ko) and also possibly bxr, Chinese, Japanese, and other languages.)

You cannot change the world to fit a broken regex. The variable did never have such documented restrictions (and such restriction would be silly, anyway), and the numeric values have been there since r4534 (!).

This will fix itself if bug 45161 is solved.

I submitted a patch to hopefully fix the issue as I3a37acf1. As suggested in comment 0, it makes the separators required.

This won't fix the date parsing properly for Czech (no genitive month names support), but it should unbreak number parsing at least.

(In reply to comment #11)

This won't fix the date parsing properly for Czech (no genitive month names
support)

I opened bug 46496 for that.

(Disclaimer: I am a typical American chauvinist and don't know nearly as much as I should about date conventions in other languages. I'm sorry if I'm totally wrong.)

It looks to me like the Czech messages may be incorrect. The message descriptions specify that the short month name messages are abbreviations of the full month names. So why are the Czech messages numeric? Shouldn't they be led., ún., břez., dub., and so on?

(In reply to comment #13)

It looks to me like the Czech messages may be incorrect. The message
descriptions specify that the short month name messages are abbreviations of
the full month names. So why are the Czech messages numeric? Shouldn't they
be led., ún., břez., dub., and so on?

See comment #0 and comment #8

(In reply to comment #14)

(In reply to comment #13)

It looks to me like the Czech messages may be incorrect. The message
descriptions specify that the short month name messages are abbreviations of
the full month names. So why are the Czech messages numeric? Shouldn't they
be led., ún., břez., dub., and so on?

See comment #0 and comment #8

Right. This seems like a delicate mess. Bartosz's change I3a37acf1 seems like a simple and straightforward improvement over the status quo, though, so I merged it.

https://gerrit.wikimedia.org/r/55494 (Gerrit Change I3a37acf1985eddf922e69e2c2a1cf541fc00e97e) | change APPROVED and MERGED [by jenkins-bot]

Marking as RESOLVED, then. Needs backporting, I guess?

I've read all the comments here and this doesn't seem a regression, nor a catastrophic bug, so no – I wouldn't think it needs to be backported.

(In reply to comment #18)

I've read all the comments here and this doesn't seem a regression, nor a
catastrophic bug, so no – I wouldn't think it needs to be backported.

Well, this is definitely a regression, even though not in 1.21: I just tested the example from comment #6 above in MediaWiki 1.19.5 (previous/LTS), and it works fine, while in the current MediaWiki 1.20.4, it is broken. Even though this is probably not a “catastrophic bug”, it means “sortable” tables just do not sort numbers in Czech at all, which they used to do correctly in MW 1.19. Just saying.