Page MenuHomePhabricator

Use existing $dateFormats to format dates on Wikidata
Open, MediumPublic

Assigned To
None
Authored By
He7d3r
Feb 26 2014, 8:36 PM
Referenced Files
None
Tokens
"Stroopwafel" token, awarded by mxn."Like" token, awarded by Shizhao."Like" token, awarded by Capankajsmilyo."Like" token, awarded by Liuxinyu970226."The World Burns" token, awarded by revi."Like" token, awarded by deryckchan.

Description

Since Wikidata's early days it had been possible to use Wikidata's interface in languages other than English. However, dates have so far been half-localized by substituting the month name with the month name in the target language without localizing the date format string.

This results in major inconvenience to users of languages where the date format string is not "d Mmm yyyy" or "Mmm d yyyy". In many cases the partially localized dates make no sense to a native reader.

This task requests that language-specific format strings to be applied when Wikidata displays any date. Until that is implemented, incomplete localizations should be reverted to an international date format (e.g. 2012-10-29) for languages that do not use a "d Mmm yyyy" format string,


For example, per this thread, the Portuguese interface displays "junho 12 1990" for this item which is wrong Portuguese; the correct date format should be "12 de junho de 1990".

In other languages, such as Chinese (all variants), Japanese, and Korean where month names are simply numbers, such a localisation results in a mangled string of numbers which make little sense to the reader, for example "22 五月 2017" which does not make sense to a Chinese reader; the correct format string should be "2017年5月22日".

Also many more date formats per languages should be changed.
See https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2015/01#Date_format


Whiteboard: papercut u=dev c=backend p=3

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
OpenFeatureNone
OpenNone
OpenNone
DeclinedNone
OpenNone
OpenNone
OpenNone
OpenNone
ResolvedLydia_Pintscher
ResolvedNone
Resolvedadrianheine
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
Resolvedadrianheine
DeclinedNone
DeclinedNone
OpenNone
ResolvedNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).Dec 1 2014, 2:53 PM
He7d3r added a project: I18n.
He7d3r set Security to None.
Stryn renamed this task from Change formatting of dates in Portuguese on Wikidata to Change formatting of dates on Wikidata.Jan 28 2015, 7:02 PM
Stryn updated the task description. (Show Details)
He7d3r renamed this task from Change formatting of dates on Wikidata to Use existing $dateFormats to format dates on Wikidata.Jan 29 2015, 5:47 PM

As with many of these time related tickets, the MwDateFormatParser will solve a lot of these cases, see https://gerrit.wikimedia.org/r/153211 and https://github.com/DataValues/Time/pull/83, both still work in progress.

Change 153211 had a related patch set uploaded (by Thiemo Mättig (WMDE)):
[WIP] Add MwDateFormatParser

https://gerrit.wikimedia.org/r/153211

Also the Hungarian date format (and name of the months in Hungarian) should be implemented. I think, the proposed patch covers this case, too. Is it correct?

@Samat, I'm sorry, can you please describe in more detail what you mean? What is the current situation, what is not correct, and how should it be instead?

When I go to a page like https://www.wikidata.org/wiki/Q159?uselang=hu I see "12 junho 1990". This is, as far as I can tell, the "name of the month in Hungarian".

When I look at https://phabricator.wikimedia.org/source/mediawiki/browse/master/languages/messages/MessagesHu.php;9e8355c87d35345bab5de10cab6c42832f33917d$145 I see that the Hungarian language is set to use a YMD-ordered date format by default. Wikibase currently does not use this, but the "dmy" format. Unfortunately the Hungarian language definitions do not specify a dmy format. That's why the dot is missing in "12 junho 1990".

The patch I mentioned above is not about formatting but about parsing. Being able to parse all date formats is a prerequisite to change the formatting.

@thiemowmde, thank you for your answer. I thought that Wikibase use the month names in English right now, but I was not correct: the names itself are good (for example "12 június 1990").

But the date formatting is incorrect. As you mentioned, the order should be YMD, and after the year and after the day numbers there should be a point. (For the example above "1990. június 12.")

If you say, this ticket won't solve the formatting issue, I open a separate ticket for that. (Or if you know a ticket already open to handle the same problem, please point at it.)

Oh, please do not create more tickets. This one here is about the exact formatting issue you asked for. It's just that the patch I linked above does not fully solve the issue.

To clarify, this ticket is about displaying and outputting dates in the language-appropriate format.

At the moment Wikidata's web interface and {{#Property:}} returns "dd mmm yyyy" or "mmm dd yyyy" with mmm substituted for the name of the month in the desired language, which as earlier discussion has shown is not useful at all to languages whose date formatting isn't a direct application of one of these two formats.

@thiemowmde - Would you please explain how a patch about parsing dates is a prerequisite of a solution about displaying dates?

Central modules will need some date formats inside one module:
for content, page and user languages, see T135845
and to display categories in user language and link them in wiki language, see T68051.

Here, to easy permit these needs, I suggest to structure the change code with that in mind.

@deryckchan, simply because the software must understand itself. The formatted date is what appears in the edit field. We do not want to show the unformated YYYY-MM-DD there as this would be even more confusing, so we show it formatted. You want to edit this, and expect the software to accept the format it was outputting before.

With no parser that is able to understand all formats (and not confuse them!) we can't output all formats.

@Rical, you are right, this is closely connected. But for now this ticket is about PHP backend rendering only, not about possible future Lua modules.

@deryckchan, The formatted date is what appears in the edit field. We do not want to show the unformated YYYY-MM-DD there as this would be even more confusing, so we show it formatted.

I must argue that it is actually more confusing to show partially localised "dd Mmm yyyy". Most users of the internet understand yyyy-mm-dd regardless of mother language and Wikidata users in particular are used to language fallback chains.

@thiemowmde : Imagine your software displays "2017Jahr5Monat22Tag" (which is the Chinese format string with German words substituted in). This is how users of non-"dd Mmm yyyy" languages currently feel when we use Wikidata. It's worse than defaulting to "2017-05-22" or even "May 22 2017".

May I suggest that we actually display "yyyy-mm-dd" until language-specific date formats are implemented?

I must argue that it is actually more confusing to show partially localised "dd Mmm yyyy". Most users of the internet understand yyyy-mm-dd regardless of mother language and Wikidata users in particular are used to language fallback chains.

@thiemowmde : Imagine your software displays "2017Jahr5Monat22Tag" (which is the Chinese format string with German words substituted in). This is how users of non-"dd Mmm yyyy" languages currently feel when we use Wikidata. It's worse than defaulting to "2017-05-22" or even "May 22 2017".

May I suggest that we actually display "yyyy-mm-dd" until language-specific date formats are implemented?

I agree and I would suggest the same if the implementation needs longer time.

There is no reason to do the actual opposite of what this ticket asks for. No matter what the users language is, everybody can distinguish day, month and year in "10 November 2017". But we can not assume everybody understands what the month in "2017-11-10" is. This is actually the 11th of October in certain regions of the world.

There is no reason to do the actual opposite of what this ticket asks for. No matter what the users language is, everybody can distinguish day, month and year in "10 November 2017". But we can not assume everybody understands what the month in "2017-11-10" is. This is actually the 11th of October in certain regions of the world.

The reason is that, I'm afraid, it is not correct to assume that "everybody can distinguish day, month and year in [dd Mmm yyyy]". As Samat and I have strongly argued in this thread, translated month names + wrong date formatting string is not comprehensible in many languages. It is better to default to a correct foreign language than to use an incomprehensibly wrong attempt to localise.

@thiemowmde You can accept the idea that "yyyy-mm-dd" may not make sense for some people in the world, but not when a native language reader telling you that "dd Mmm yyyy" makes even less sense? When doing localisation, if native readers are telling you that what you have make no sense in that language, stop and listen.

Either do full localisation of a string, or don't do it at all.

ISO 8601 format is a well understood international standard designed "to provide an unambiguous and well-defined method of representing dates and times, so as to avoid misinterpretation of numeric representations of dates and times, particularly when data are transferred between countries with different conventions for writing numeric dates and times" (from English Wikipedia). Why on earth would you invent a partially localised system that make no sense at all in many languages?

@thiemowmde : Imagine your software displays "2017Jahr5Monat22Tag" (which is the Chinese format string with German words substituted in). This is how users of non-"dd Mmm yyyy" languages currently feel when we use Wikidata. It's worse than defaulting to "2017-05-22" or even "May 22 2017".

I'm not a native or even fluent speaker of Chinese (or Japanese or Korean), so maybe you would disagree, but I think a better analogy is: Imagine being presented with "30 10 minutes 3" as a length of time in English.

English speakers might eventually figure out that it's supposed to mean "3 hours 10 minutes 30 seconds" but the parts are in the wrong order and two of the expected words are missing, which results in something that looks like complete nonsense. Writing "2017Jahr5Monat22Tag" in German is definitely weird, but I don't think it has the same effect on the comprehensibility.

Nikki's analogy is spot on!

In general, it is a bad idea to assume that users will be able to understand something non-obvious, and this is an even worse idea when it comes to multiple languages. As a native English speaker, with a moderate level of Mandarin, and as a software developer and computational linguist, when I first looked at "22 五月 2017", my first thought was that there was some mistake, because it just looks garbled. For instance, was "22五" supposed to be one number? Of course it's *possible* to work it out, but that's beside the point. This is a failed attempt at localisation, needs to be fixed, and cannot be written off as something that's incorrect but understandable.

This ticket asks for full localization as supported by MediaWiki core. I, personally, love to work on date parsing and formatting and already spend weeks (!) working on code required to fully solve this ticket some day. I will not throw everything away we did in the past four (!) years just because some people start yelling at me with no scientific arguments given.

I see the possibility for a few smaller improvements we could make:

  • 12 of the 423 languages and language-variants MediaWiki currently supports name their default date format "ymd". These languages are namely crh (including variants), hu, kaa, and kk (including variants). We could disable the localization for these languages and display raw ISO dates instead. Users will not get better localization by doing so. But the order will be the same as the users expect. We might assume users being used to any kind of "ymd" ordering are less confused by "2017-11-10", even if it will be entirely unlocalized until we support full localization.
  • About 20 more languages specify default date formats that start with the year, but are not named "ymd". Most notably gan, ko, and zh, including all their variants. We might add these to a blacklist and display raw ISO dates as well.
  • We might add other languages to the same blacklist if requested and ISO is proven to be less confusing for native speakers.

I will not discuss globally disabling the mostly working localizations for the 368 languages (87%) that name their default date format "dmy".

  • 12 of the 423 languages and language-variants MediaWiki currently supports name their default date format "ymd". These languages are namely crh (including variants), hu, kaa, and kk (including variants). We could disable the localization for these languages and display raw ISO dates instead. Users will not get better localization by doing so. But the order will be the same as the users expect. We might assume users being used to any kind of "ymd" ordering are less confused by "2017-11-10", even if it will be entirely unlocalized until we support full localization.
  • About 20 more languages specify default date formats that start with the year, but are not named "ymd". Most notably gan, ko, and zh, including all their variants. We might add these to a blacklist and display raw ISO dates as well.
  • We might add other languages to the same blacklist if requested and ISO is proven to be less confusing for native speakers.

This is a good plan. Thank you for your hard work on date formatting for Wikidata!

This sounds reasonable. Using the raw ISO format "yyyy-mm-dd" would be a better localisation for Mandarin than using "dd mm月 yyyy". I cannot be as certain for the other ymd languages, without a closer look at the linguistic data. I'm not sure what kind of scientific argument is being asked for, but on the subject of not assuming that users can work things out, I would recommend the following paper:

http://www.oecd-ilibrary.org/education/skills-matter_9789264258051-en

It's long but eye-opening. A summary is also available here:

https://www.nngroup.com/articles/computer-skill-levels/

Relevant Patch-For-Review that adds a simple TimeFormatter that can output ISO-like YMD-ordered dates in all relevant precisions: https://github.com/DataValues/Time/pull/49. We might use this basic YMD-formatter instead of the current (DMY-) MwTimeIsoFormatter for the non-DMY languages listed above.

Restricted Application added a subscriber: revi. · View Herald TranscriptJun 8 2017, 5:12 PM

Scibunto modules need also to extract any part of the date (and/or time).
This is very difficult if the only available date is language-formated.
Then I suggest to give also to modules the ISO 8601 format.

I noticed that the date format has recently changed from "25 九月 1997" to "25 9 1997". Is work being done on the date formatting? And are we close to getting ISO dates or language-formatted dates?

I am wondering what is the state of this task. Is there any progress?
I checked the date format in case of Hungarian language, and I saw, that there is a small change since May: There is a dot after the day, for example "12. november 1918"

This is really quite close to the correct format:

  • we would need one more dot after the year number,
  • should be ordered as YYYY.MM.DD.

I am not a programmer but I don't see why would this change be so complicated.
Can we expect that this change will happen (soon)? :)

Is this task complete? Can someone please update this 3 year old task's status?

Is this task complete? Can someone please update this 3 year old task's status?

No change since August 2017. Date format displays in some languages (e.g. Hungarian and Cantonese above) have changed but are still wrong. It appears that the underlying software remains unable to handle date formats that don't follow d-m-y word order.

Relevant Patch-For-Review that adds a simple TimeFormatter that can output ISO-like YMD-ordered dates in all relevant precisions: https://github.com/DataValues/Time/pull/49. We might use this basic YMD-formatter instead of the current (DMY-) MwTimeIsoFormatter for the non-DMY languages listed above.

Why has this patch for review still not accepted? Where is the pending discussion? How can we move forward on this one?

If there's no way to fix the internationalized format now then please change the format into ISO date format as a temporary fix. There's currently no way for me to tell which day a date value actually represent without trying to edit it and see the calendar pop up.

If there's no way to fix the internationalized format now then please change the format into ISO date format as a temporary fix. There's currently no way for me to tell which day a date value actually represent without trying to edit it and see the calendar pop up.

Agreed - we've been sitting here for a year and date fields remain unusable in non-dmy languages. If we switch back to ISO dates until language-specific date formatting strings can be rolled out, at least people can use it without confusion.

We have solved that issue on Commons a decade ago by writing templates which are now in form of Module:Date and Module:ISOdate. Both modules are both on Commons and Wikidata. Maybe we can just pipe the date through that module. Or capture the logic of the module in Mediawiki code.

Change 153211 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add DateFormatParser and MwDateFormatParserFactory

https://gerrit.wikimedia.org/r/153211

This issue still affects any language that numbers its months instead of naming them. Chinese, Japanese, Korean, and Portuguese have been mentioned above, to which I’d add Vietnamese. These languages have narrow month “names” that are just bare numbers and instead rely on date formats to append a prefix or suffix to the month name.

No matter what the users language is, everybody can distinguish day, month and year in "10 November 2017". But we can not assume everybody understands what the month in "2017-11-10" is. This is actually the 11th of October in certain regions of the world.

As things stand, dates formatted by Wikibase aren’t necessarily recognizable as dates, let alone the correct dates. For example, Wikibase formats January 25, 2002, as “25 1 2002”. By contrast, ever since T8910, the rest of MediaWiki has correctly formatted the same date as “ngày 25 tháng 1 năm 2002” based on $dateFormats['vi normal date'] (source). One might deduce that “25 1 2002” is a date in day-month-year format, but that’s hardly assured for every day of the year. It’s even worse with dates outside this millennium, like “22 11 874”. I wholeheartedly agree with commenters above that ISO 8601 format would’ve been preferable, even at the expense of temporarily regressing localized date formats in some other languages.

The root cause seems to be this method, which makes some language-centric assumptions. It correctly calls getDateFormatString() to get the localized date format, but then it scans the format string for a number followed by a period or comma for the day component and a word followed by a period or comma for the month component. It inserts these “formats” into a hard-coded string format %s %s Y then passes it into sprintf(). Effectively, it extracts a choice of the “Month” and “Day of the month” formatting codes in this table but discards any other information from the date format.

@deryckchan, simply because the software must understand itself. The formatted date is what appears in the edit field. We do not want to show the unformated YYYY-MM-DD there as this would be even more confusing, so we show it formatted. You want to edit this, and expect the software to accept the format it was outputting before.

With no parser that is able to understand all formats (and not confuse them!) we can't output all formats.

It is possible to solve this problem without mangling date formats. For example, the Comments in local time gadget accepts a list of date format strings to parse out of talk pages (parseFormat). On a wiki whose system date format has changed over the years, it’s no big deal to accept multiple formats. As it happens, MediaWiki’s $dateFormats['vi normal date'] setting provides these formats in a sprintf()-compatible syntax.

If the goal is to accept a date in any format, that’s laudable, but it shouldn’t preclude outputting a date in the correct format. parseDate() doesn’t have to call the same function as getLocalizedDate(). getLocalizedDate() can consult $dateFormats while parseDate() continues to sniff formatting codes out of a format string.

DiscussionTools can parse (correctly) formatted dates in signatures. The Talk pages project started 3.5 years ago. Wikibase can’t parse correctly formatted dates. Wikidata started over a decade ago. What’s the difference? The signatures were already there, so the DiscussionTools developers could not tell users “you can understand these screwed-up dates, even if you loudly and unambiguously say you can’t”. Using Hungarian interface, I’ve probably never entered a single “localized” date, only ISO 8601-formatted ones, because what you call localized is not localized and unnatural (I not only write dates YMD, I also think about them YMD, writing down the day before the year takes extra effort). On display, while for most dates, one can finally figure out that they’re screwed up, 1st century AD dates are even worse: when I see 12. január 25, I immediately know it means the 25th of January of the year 12 AD – except that it doesn’t, because it means the 12th of January of the year 25 AD.

There will be a Wikidata bug triage hour on the 13th of March relating to dates which may be of interest to people subscribed to this ticket: https://www.wikidata.org/wiki/Wikidata:Events#Wikidata_bug_triage_hour

In T63958#8141945, @mxn wrote:

As things stand, dates formatted by Wikibase aren’t necessarily recognizable as dates, let alone the correct dates. For example, Wikibase formats January 25, 2002, as “25 1 2002”. [...] One might deduce that “25 1 2002” is a date in day-month-year format, but that’s hardly assured for every day of the year.

Yeah, things like 2 7 1908 are clearly ambiguous.

But we can not assume everybody understands what the month in "2017-11-10" is. This is actually the 11th of October in certain regions of the world.

According to https://en.wikipedia.org/wiki/Date_format_by_country, the only people who use yyyy-dd-mm are Uyghur speakers in China, who, it says, also use yyyy-mm-dd. CLDR does not list any yyyy-dd-mm date formats.

CLDR also defaults to yyyy-mm-dd for any language where the date format isn't defined (see the entry with ·all·others·) and yyyy-mm-dd is an international standard, so many people are likely to have come across it at some point, even if it's not the format they normally use.

I think it should either use yyyy-mm-dd as the default if it doesn't know how to format the date, or it should display the date in the next fallback language that it does know how to format... but it seems it doesn't actually know how to format dates in any language, it just uses "dd month yyyy" for everything and hopes for the best?