Page MenuHomePhabricator

#time parser function can't read local language month names
Open, LowPublic

Description

Author: folengo

Description:
On the English Wikipedia, one may use {{#time: Y-F-d | {{{date}}} }}, with {{{date}}} being a string such as "15 may 1998".

The same programming cannot be used on the French or any other language Wikipedia, because a French date like "15 mai 1998" can't be read by the #time parser function and results in "Error: invalid time".

According to the help page on Mediawiki.org (1),

« The date/time object can be in any format accepted by PHP's strtotime() function. ».

So I guess that this "strtotime function", should be internationalized into strtotime/fr, strtotime/de, strtotime/es etc...

(1) http://www.mediawiki.org/wiki/Help:Extension:ParserFunctions#.23time:

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:43 PM
bzimport set Reference to bz19412.
bzimport added a subscriber: Unknown Object (MLST).

we are facing the same issue #time parser function in Bengali wikipedia(2) when we use {{#time: Y-F-d | {{{date}}} }}. The out put "Error: invalid time". Because of Bengali date like "15 মার্চ 2010" can't be read by the #time
parser function.

(1)http://bn.wikipedia.org/

  • Bug 24674 has been marked as a duplicate of this bug. ***

strtotime() is a PHP function: http://www.php.net/strtotime
There's no hook inside php to change timelib_lookup_month.
ext/date/lib/parse_date.c

We would need to roll our own in order to support this.

p.selitskas wrote:

Yes, as long as #time is processed by internal date-time PHP functions, the only way to implement this is to carry out backward convertion. I'm not actually sure it's MediaWiki which should do this. PHP modude would be great. It's quicker as well.

Perhaps a PHP bug should be filed as well then. http://bugs.php.net/

Nah, PHP probably just got it from somewhere else. They actually do have http://fi.php.net/manual/en/function.strptime.php but that looks inadequate. I've been playing with the idea a bit: http://translatewiki.net/wiki/LocalTime

  • Bug 28203 has been marked as a duplicate of this bug. ***

Just a note, maybe someone could conjure up a wrapper for the current #time in mediawiki itself that would replace default month names by the localized month names. Such a wrapper (if possible) may not be an ideal solution, but if there's no ideal-ish solution in sight; I think even an untidy/inefficient mediawiki implementation would be much more efficient than template structures that might need to be created just because of this bug. For example: http://hi.wikipedia.org/wiki/%E0%A4%B8%E0%A4%BE%E0%A4%81%E0%A4%9A%E0%A4%BE:%E0%A4%A4%E0%A4%BF%E0%A4%A5%E0%A4%BF_%E0%A4%9C%E0%A4%BE%E0%A4%81%E0%A4%9A is a template I had to create just for verifying the monthname yyyy date format. And every other format will probably require more templates/subtemplates. So maybe this should be resolved before template behemoths to fix it spring up on wikis instead of the other way round (which is what usually happens AFAIK). :)

Why don't you pass {{CURRENTMONTHNAME}} and {{CURRENTYEAR}} as different parameters?

(In reply to comment #10)

Why don't you pass {{CURRENTMONTHNAME}} and {{CURRENTYEAR}} as different
parameters?

How do we calculate
{{#time:Y F j|{{{1|{{CURRENTYEAR}}}}}-{{{2|{{CURRENTMONTH}}}}}-{{{3|{{CURRENTDAY}}}}}}}

and

{{#time:Y F j|{{{1|{{CURRENTYEAR}}}}}-{{{2|{{CURRENTMONTH}}}}}-{{{3|{{CURRENTDAY}}}}} -1 days}}

this type of calculation in all wiki's other than English wiki.

Template use in http://en.wikipedia.org/w/index.php?title=Portal:Current_events/Inclusion

(In reply to comment #10)

Why don't you pass {{CURRENTMONTHNAME}} and {{CURRENTYEAR}} as different
parameters?

Well, firstly, I'd have to fix up http://hi.wikipedia.org/wiki/Template:Multiple_issues to accept two parameters (this isn't a big deal), then I'd have to fix up Twinkle to pass two parameters to the template,(big deal for me) and then I'd have to get users who input the template manually to use two parameters instead of one (really really big deal). Even after that, I'd still need a date check mechanism for the current transclusions, (or I'll have to manually fix them). And I find doing all that stuff much more difficult than creating a template like this.

And this particular format aside, using different parameters doesn't really solve the problem (as evident by Jayant's example above). I really think a solution for this should be found before more templates start springing up.

(In reply to comment #11)

(In reply to comment #10)

Why don't you pass {{CURRENTMONTHNAME}} and {{CURRENTYEAR}} as different
parameters?

How do we calculate this type of calculation in all wiki's other than English wiki.

I don't remember a wiki where different they do calculations with them, yet they don't use different parameters
http://commons.wikimedia.org/wiki/Template:Nsd
http://fr.wikipedia.org/wiki/Mod%C3%A8le:Source_inconnue_dat%C3%A9e
http://es.wikipedia.org/wiki/Plantilla:Sin_relevancia

If {{CURRENTMONTHNAME}} is on its own parameter a MONTH2NUMBER template is trivial and future proof.

Bisrin wrote:

Please solve this issue as soon as possible, and make all numbers and months in Assamese language. I have tried some parser functions in the field of Info boxes in As wiki. Please have a look. Assamese wiki in a begging state. So in this time it will be better to solve it. Otherwise it will cause a lots of problem later on.

(Bishnu Saikia)

This doesn't seem like a tracking bug, so removing bug 2007 as depending on this one (and updating title and removing "tracking" keyword).

dr.trigon wrote:

In my oppinion (vote +1 ;) this bug should really (!) be solved - best would be fast since it is open for quite a while now...

First it is VERY inconsisten since 5 tildes (~~~~~) returns e.g. on dewiki a date string that is not compatible with #timel! So either this bug is solved or 5 tildes should at least return something compatible (e.g. en locale, ISO format or something else...).

Second this is inefficient since because of the 5 tilde issue we would heave to use someting like {{subst:#time:Y-m-d}} which lanches parsing functions without any need for it.

So I think it would really be worth considering to solve this bug in near future.

Thanks a lot and Greetings
(DrTrigon)

p.selitskas wrote:

By the way, can CLDR data be used to implement this?

p.selitskas wrote:

(In reply to comment #18)

Any progress for this bug??

I wish there was any. Anyway, now with Lua enabled ([[mw:Scribunto]]) you can parse strings and match their pieces against month names and convert the original timestamp to a normalized (ISO format, for instance).

p.selitskas wrote:

*** Bug 43714 has been marked as a duplicate of this bug. ***

I beg to disagree with marking Bug 43714 as a duplicate of this one. There are two issues at hand:

  1. When the month name is in a non-English language (such as "mai" in French, which is the same as "may" in English), the bug can be solved by finding a way that would replace month names with their English counterpart, then uses php's strtotime. This is what current bug is all about.
  1. When there is a calendar involved (as in Bug 43714, which talks about the Jalali calendar), then a simple conversion of month names to English is not enough. As explained on PHP's documentation, strtotime only accepts certain date formats -- which are described on http://www.php.net/manual/en/datetime.formats.date.php -- and it ONLY works with the Gregorian calendar. So converting "آذر" to "Azar" (a Jalali month name) won't help anything in this case.

The possible solution with Bug 43714 is to take the date string (which is normally in Persian or English), figure out the Jalali date, use existing MediaWiki code to convert it to Gregorian date, then pass that to the wrapper function (such as the function that handles #time), and take the output of that wrapper function, use existing MediaWiki code to convert it back to Jalali, then localized to the user's language of choice and return the localized string.

p.selitskas wrote:

(In reply to comment #21)

I beg to disagree with marking Bug 43714 as a duplicate of this one. There
are
two issues at hand:

  1. When the month name is in a non-English language (such as "mai" in French,

which is the same as "may" in English), the bug can be solved by finding a
way
that would replace month names with their English counterpart, then uses
php's
strtotime. This is what current bug is all about.

  1. When there is a calendar involved (as in Bug 43714, which talks about the

Jalali calendar), then a simple conversion of month names to English is not
enough. As explained on PHP's documentation, strtotime only accepts certain
date formats -- which are described on
http://www.php.net/manual/en/datetime.formats.date.php -- and it ONLY works
with the Gregorian calendar. So converting "آذر" to "Azar" (a Jalali month
name) won't help anything in this case.

The possible solution with Bug 43714 is to take the date string (which is
normally in Persian or English), figure out the Jalali date, use existing
MediaWiki code to convert it to Gregorian date, then pass that to the wrapper
function (such as the function that handles #time), and take the output of
that
wrapper function, use existing MediaWiki code to convert it back to Jalali,
then localized to the user's language of choice and return the localized
string.

I was watching bug 43714 from a quite more distant point. Currently, all datetime-related stuff is processed by PHP internals. On the other hand, we have CLDR (it stores all those intl'ed "+1 week") and MediaWiki calendar conversion functions, which have been used here... never, I guess.

Adding work-arounds into ParserFunctions would be dirty, and this code would demand not just refactoring, but rethinking and redesign. Moreover, ParserFunctions is not the right place for "low-level" language manipulations.

We can forward prerequisite datetime conversion into Language, as well as calendar detection, although if made straightforward, such piece of code would not be ... beautiful, assuming that, given a certain locale (wgLanguage), {{#time}} must accept every possible date format (in all terms: different month names, different calendars, time spans, etc.).

Otherwise, you can remove the duplicate sign and make the bug dependent on this one if it helps the community track the status of _their_ issue, of course!

This isn't a trivial problem to fix, especially since, as Huji mentions, some month names are associated with different calendars (or even multiple calendars). Take "Nisan" for example. In the Hebrew calendar, Nisan is sometimes month 7 and sometimes month 8. However, Nisan is month 1 in the Assyrian calendar (and the Hebrew ecclesiastic calendar). We would need to add explicit flags for not only language, but also calendar, which would then necessitate considering all the different possible ways of writing dates and times in each language and calendar, as well as how to deal with the almost infinite number of ambiguous cases. Even the relatively "simple" case that we currently support, English-only Gregorian calendar, is fraught with bugs and inconsistencies. Just imagine multiplying that by 300 languages and then multiplying again by the 50 or so calendars currently in use around the world (some of which are quite difficult to convert to the Gregorian calendar). Not to mention that some languages support multiple number systems.

In the history of computer programming, no one has ever solved this problem, and honestly I doubt anyone ever will. Even a modest attempt would probably need its own open source project separate from MediaWiki.

My suggestion for a practical solution would be to distribute the problem and have each wiki write their own Lua module for translating their particular language and calendars into ISO 8601 compatible date-time formats. Each module could then be used to provide the input for #time.

Recommend WONTFIX.

mgharish wrote:

We have the same problem in KN:WP.

In many of the calendar templates which we imported from EN:WP, there are expressions like this:

{{#ifexpr:{{#time:t|{{{year|2000}}}-{{{month|jan}}}}}>28|29}}

which promptly gets expanded to

{{#ifexpr:31>28|29}}

which promptly returns 29.

However, the same expression in KN:WP will result in this:

{{#ifexpr:{{#time:t|{{{year|2000}}}-{{{month|jan}}}}}>28|29}}

{{#ifexpr:೩೧>28|29}}

which will throw an error:
Expression error: Unrecognized punctuation character "�"

Is there any workaround/solution for this?

p.selitskas wrote:

(In reply to comment #24)

We have the same problem in KN:WP.

In many of the calendar templates which we imported from EN:WP, there are
expressions like this:

{{#ifexpr:{{#time:t|{{{year|2000}}}-{{{month|jan}}}}}>28|29}}

which promptly gets expanded to

{{#ifexpr:31>28|29}}

which promptly returns 29.

However, the same expression in KN:WP will result in this:

{{#ifexpr:{{#time:t|{{{year|2000}}}-{{{month|jan}}}}}>28|29}}

{{#ifexpr:೩೧>28|29}}

which will throw an error:
Expression error: Unrecognized punctuation character "�"

Is there any workaround/solution for this?

Your issue is not directly related to this one, but I can see it. You need to convert the digits to arabic glyphs.

As long as {{#time}} supplies you with Kannada digits, you need to use {{format:KANNADA_INTEGER|R}} (note "R" argument). For example, you can use this to fix the issue:

{{#ifexpr:{{formatnum:{{#time:t|{{{year|2000}}}-{{{month|jan}}}}}|R}}>28|29}}

Try it and report back if it works.

@Opraco created a hack on Portuguese Wikipedia to workaround this bug:
https://pt.wikipedia.org/wiki/Module:Datas

p.selitskas wrote:

(In reply to comment #24)

We have the same problem in KN:WP.

In many of the calendar templates which we imported from EN:WP, there are
expressions like this:

{{#ifexpr:{{#time:t|{{{year|2000}}}-{{{month|jan}}}}}>28|29}}

which promptly gets expanded to

{{#ifexpr:31>28|29}}

which promptly returns 29.

However, the same expression in KN:WP will result in this:

{{#ifexpr:{{#time:t|{{{year|2000}}}-{{{month|jan}}}}}>28|29}}

{{#ifexpr:೩೧>28|29}}

which will throw an error:
Expression error: Unrecognized punctuation character "�"

Is there any workaround/solution for this?

Your issue is not directly related to this one, but I can see it. You need to convert the digits to arabic glyphs.

As long as {{#time}} supplies you with Kannada digits, you need to use {{format:KANNADA_INTEGER|R}} (note "R" argument). For example, you can use this to fix the issue:

{{#ifexpr:{{formatnum:{{#time:t|{{{year|2000}}}-{{{month|jan}}}}}|R}}>28|29}}

Try it and report back if it works.

Yes, this worked for us on Kn:Wiki. Thank you! Wish Kannada numerals are identified by default. Finding the templates and variables etc had been a derculean task.

@Jony Why are you remove this from Bengali Sites??, And you are removing so many tasks from Bengali Sites. HAve discussed anywhere?

My suggestion for a practical solution would be to distribute the problem and have each wiki write their own Lua module for translating their particular language and calendars into ISO 8601 compatible date-time formats. Each module could then be used to provide the input for #time.

Well, it would be good to have an interface to feed such conversions for particular language/calendar/digit set and then have concrete implementations built on request. There is no need to perfectionalistically support all the possible combinations of languages, calendars, scripts and digit systems from the start.

Similarly to how we have LanguageXX files we could have TimeXXYYZZ files or something along those lines.

Not saying it is simple to design a good interface for this though.

Another option is to add hooks in the {{#time|...}} code to allow an extension to preprocess the input before passing it back to time. Then those projects that use non-ISO month names (e.g. fawiki) can have the extension enabled and the so called TimeXXYYZZ files can be part of the extension.

I am happy to work on making such an extension. I think we can call it Extension:TimeParserHelper or something like that. If you have a more creative name in mind, please let me know.

The hook I mentioned in the last comment should be define here.

Change 614582 had a related patch set uploaded (by Huji; owner: Huji):
[mediawiki/extensions/ParserFunctions@master] Let #time parser function accept absolute dates in localized form

https://gerrit.wikimedia.org/r/614582

I made a patch for this that gets the job done using MediaWiki's own Language class: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ParserFunctions/+/614582/

How to test it

To test it, create a page on your wiki with the following content:

Fr: {{#time: Y|19 juillet 2020}}

Fa-gen: {{#time: Y|۱۹ ژوئیهٔ ۲۰۲۰}}

Fa-non-gen: {{#time: Y|۱۹ ژوئیه ۲۰۲۰}}

En: {{#time: Y|19 July 2020}}

TS: {{#time: Y|{{CURRENTTIMESTAMP}}}}

If your wiki's content language is English, only the last two rows should produce a valid output ("2020") and the rest should show a warning.

Now change your wiki's content language to French ('fr') and purge the page (or use "Preview" to force it to be parsed again); this time, the first row should also produce "2020".

Lastly, change your wiki's content language to Persian ('fa') and purge/preview again; this time, the first row should cause a warning again, but the other four lines should produce "۲۰۲۰" (which is 2020 in Persian).

Unit test

It would be nice to add a unit test for this function. However, I have no idea how to set up unit tests in which content language is modified. If anyone can give me advice on that, I would truly appreciate it.

So I've set up the following page:

{| class="wikitable"
|+ Table to test [[Wikipedia:phab:T21412]]
|-
! Code !! Output
|-
| <code><nowiki>{{#time: Y|19 July 2020}}</nowiki></code> || {{#time: Y|19 July 2020}}
|-
| <code><nowiki>{{#time: Y|19 липня 2020}}</nowiki></code> || {{#time: Y|19 липня 2020}}
|-
| <code><nowiki>{{#time: Y|19 июля 2020}}</nowiki></code> || {{#time: Y|19 июля 2020}}
|-
| <code><nowiki>{{#time: Y|2020年7月20日}}</nowiki></code> || {{#time: Y|2020年7月20日}}
|}

and flipped my $wgLanguageCode between en, uk, ru, and ja. The former 3 work, the latter doesn't, so @Ladsgroup 's comment on Gerrit currently stands, it only assumes English like date format. Also for me, a MW-ignorant person, it sounds that it would be nice to check it with per page content language change, rather than wiki wide, though hopefully it would work just as well.

For me personally it still looks like something that is better than nothing, though what would be nice is a definite way to know which languages are supported and then a way to provide a support to other languages some other way.

Per-page content language does not work :(

I've set $wgPageLanguageUseDB = true; while keeping $wgLanguageCode = 'uk'; and flipped the page content language around. While {{#time: Y xg d H:i:s}} I've put as an indicator did change its value depending on the page content language, the reverse parsing still assumed uk or en input.

I think the per-page content language functionality should be differed for a later patch. To my knowledge, this feature is not widely used so is a lower priority.

As for the "which languages are supported" question, I am completely in agreement that we should clarify this, and that we should strive to expand support to as many languages as possible. The current patch should cover many languages. The Japanese example is not working because (and correct me if I am wrong) in Japanese, no spaces are placed before or after the month name. We can use a special case for 'ja' in the code, but before doing so, I would rather know if this is only Japanese, or a group of languages (maybe Chinese and Korean too?) that use this format. @Base can you help identify the answer to this?

Ultimately, we should update mw:Help:Extension:ParserFunctions to explain how ParserFunctions now have support beyond php's strtotime() and also explain what new formats are supported.

I think the per-page content language functionality should be differed for a later patch.

Well, if the iterative approach makes sense, then it is fine, but basically if you implement it to use page's content language rather than the global wiki content language, then I think the latter will not be needed (I do not think we have pages with no content language so that you will have to inherit it. I might be wrong.

To my knowledge, this feature is not widely used so is a lower priority.

Well, it depends. It is at least used on wikis where Translate is used. Sometimes it is used on non-Translate related pages too, such as localised VPs on WD (I know since I have set content language for them myself :P) , so all the multilingual wikis, some chapter wikis, and so forth (for example there is Translate on wmua: and wmru:). But when it comes to non-multilingual content wikis, I do not think it is used (and I think it is not even enabled). Unfortunately I have no idea how much it is possibly used outside of Wikimedia world.

(maybe Chinese and Korean too?)

Yeah, and not only those, here is an updated example page with vi, zh, ko, and lzh added:

{{#time: Y xg d H:i:s}} 
{| class="wikitable"
|+ Table to test [[Wikipedia:phab:T21412]]
|-
! Code !! Output
|-
| <code><nowiki>{{#time: Y|19 July 2020}}</nowiki></code> || {{#time: Y|19 July 2020}}
|-
| <code><nowiki>{{#time: Y|19 липня 2020}}</nowiki></code> || {{#time: Y|19 липня 2020}}
|-
| <code><nowiki>{{#time: Y|19 июля 2020}}</nowiki></code> || {{#time: Y|19 июля 2020}}
|-
| <code><nowiki>{{#time: Y|2020年7月20日}}</nowiki></code> || {{#time: Y|2020年7月20日}}
|-
| <code><nowiki>{{#time: Y|ngày 20 tháng 7 năm 2020}}</nowiki></code> || {{#time: Y|ngày 20 tháng 7 năm 2020}}
|-
| <code><nowiki>{{#time: Y|2017年4月9日}}</nowiki></code> || {{#time: Y|2017年4月9日}}
|-
| <code><nowiki>{{#time: Y|2020년 7월 20일}}</nowiki></code> || {{#time: Y|2020년 7월 20일}}
|-
| <code><nowiki>{{#time: Y|二〇二〇年七月二〇日}}</nowiki></code> || {{#time: Y|二〇二〇年七月二〇日}}
|}

All of them new ones fail, except that vi and ko have spaces. Except those it is definitely at least all of the Chinese ones (lzh also uses Chinese characters instead of arabic digits for more func) and some other Sino-influenced (I am not sure if you are familiar 年 is basically year, 月 is month and 日 is day. So 7月 (or 七月), shichigatsu, "7th month", is the actual name of July in Japanese, and despite vi and ko ditching Chinese characters a while back they basically just have an alphabet instead)., but I am pretty sure that it can be more. I am using ~~~~~ to get the date in given language, perhaps we have some configuration for that (we must have since it works) and we can use that to draw at least one of the common date formats for the languages from it at least for testing purposes.

What also needs to be tested is how those act when there is also a time specified. And here good old Sino-influenced guys also make it fun with also adding a day of the week in the full date.

Well, CJK is always fun, it can be dealt with separately, but we should locate other cases where it does not work. I am only familiar with this bit. IANAL, L for linguist in this case ;)

It seems Language->getDateFormatString( or something in signature

Aha, so there are actually some descriptions in $dateFormats in MessagesXX files

You are providing me with a beautiful set of unit tests, so thanks for that! (Now, if only I knew how to properly create unit tests ...)

I will consult the MessagesXX files too. This suddenly became a much larger effort than I hoped, but it is for a good cause, so I'll try to spend more time on it during the week.

I updated the patch to include a unit test, but I need help with it. When I run phpunit I get this error: Error: Class 'MediaWiki\Extensions\ParserFunctions' not found even though the unit test file contains use MediaWiki\Extensions\ParserFunctions;


UPDATE:

I realized it is MediaWiki\Extensions\ParserFunctions\ParserFunctions that needs to be referenced. The duplicative ParserFunctions was confusing but necessary.

I got the unit tests to work too.

mediawiki/core has only a DateFormatter or Language::sprintfDate to format a given date

extensions/Wikibase has a DateFormatParser, but that does more than only providing english names. It also not working for relative timestamps like "2 weeks", which is supported by #time

mediawiki/core has only a DateFormatter or Language::sprintfDate to format a given date

extensions/Wikibase has a DateFormatParser, but that does more than only providing english names. It also not working for relative timestamps like "2 weeks", which is supported by #time

The first is not what we want. The second is; I will take a look at it to see if I can borrow from it.

I think it is okay to start with just absolute dates, before expanding it to relative dates.

I would argue that if our code evers get to the level of maturity to handle relative dates gracefully for all/most languages, then it should be moved to core and be used in other places too (e.g. the free text box for block duration). I would also argue that that would be an ambitious goal, and out of scope of the current task.

I reviewed the Wikibase::DateFormatParser class. It is impressive work. The question it raises is, how do we want to make it reusable to ParserFunctions?

On one hand, we could rely on the fact that WMF wikis have both extensions installed, and just instantiate a Wikibase::DateFormatParser in ParserFunctions if the class is available, and use its methods to parse dates. This kind of co-dependency between extensions is technically not difficult to pull of. On the other hand, the better solution is to take all of this code out of Wikibase and put it in core, and use it for core features too (like free text entries for block expiry). I think I can pull that off with some hand-holding, but I don't think it would be easy to get it past code review unless a WMF affiliated person would tag-team with me on this (or take the lead). I'm saying it based on my experience with some other drastic changes with core/extensions which are stuck in code review for months or years for the same reason.

If anyone else is willing to partner with me on this, now would be the time to say so :)

matej_suchanek subscribed.

how do we want to make it reusable to ParserFunctions?

Probably a new library both could make use of / depend on.

@matej_suchanek in response to T258879#6335738 I agree that a better approach is to tokenize the input, identify all digit characters (which vary by language) and only pass those through Language::parseFormattedNumber(). I think we can use Language class to fetch a list of the digit characters for the content language, and then make up a regex from it (or just loop through them and do a str_replace for each digit). Do you have any thoughts on this approach?

Do you have any thoughts on this approach?

Another way could be doing just the first part of parseFormattedNumber (i.e. skipping separator replacements). It may be tempting to add an optional argument to prevent that but I am not sure if it's appropriate (given the scope of the function as discussed in the other task).