Page MenuHomePhabricator

PLURAL broken: always returns singular in some languages
Closed, ResolvedPublic

Description

The PLURAL magic word is not working on huwiki and in Hungarian messages on translatewiki since 12 September 2012. You can see at the page size in the history of a [[:hu:Special:Random random page]].


Version: 1.20.x
Severity: major
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=40250

Details

Reference
bz40251

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 12:53 AM
bzimport set Reference to bz40251.
bzimport added a subscriber: Unknown Object (MLST).

Adding bug 38781 as tracking bug.

So, this is about the history always showing "egy bájt" (a byte) for [[MediaWiki:Rc-change-size-new]]. We had similar reports about categories' summaries (number of pages) in ja and ko, so I'm changing the summary to see if it's actually related.

Bug 40250 addressed this issue for Vietnamese, but Niklas thinks CLDR should be fixed. I disagree: MediaWiki’s use of the plural: magic word is incompatible with CLDR. According to the CLDR spec [1]:

“Note that these categories may be different from the forms used for pronouns or other parts of speech. In particular, they are solely concerned with changes that would need to be made if different numbers, expressed with decimal digits, are used with a sentence. If there is a dual form in the language, but it isn't used with decimal numbers, it should not be reflected in the categories.”

In the case of Vietnamese (and most likely other languages), there are plural forms, but not in the example they give (“Duration: 1 hour” → “Duration: 3.2 hours”). Instead, these plural forms occur where there is no decimal number (“the following user” → “the following 5 users”; “this page” → “these pages”). MediaWiki’s English localization has long used the plural: magic word for avoiding legalese like “page(s)”, and plenty of localizations at TranslateWiki have followed suit. In cases where a decimal number is displayed unconditionally, the Vietnamese localization simply omits the plural: magic word.

In short, I think overriding the plural rules is the right approach, and we should do the same for the other languages in the same boat. Just take a look at the translations of [[MediaWiki:Category-subcat-count]] for supposedly plural-less languages like Indonesian or Chinese.

Currently all of these languages’ localizations are severely broken: numbers are not showing up in many of the places they should, such as category counts and list totals in special pages.

[1] http://cldr.unicode.org/index/cldr-spec/plural-rules

I'm not sold to the idea that adding local plural rule overrides is the solution.

At translatewiki.net thread I proposed alternative solution which allows inline override in affected messages.

http://translatewiki.net/wiki/Thread:Support/PLURAL_keyword_for_languages_without_grammatical_plural_forms

  • Bug 40252 has been marked as a duplicate of this bug. ***
  • Bug 40250 has been marked as a duplicate of this bug. ***

We might as well expand the override to include all of CLDR’s plural-less languages. [[id:Kategori:Wikipedia]] will still lack subcategory and page counts, after all.

Supporting explicit numeric arguments to plural: sounds like a good idea. That would add enough flexibility so that, for instance, the English localization could have a message with “{{PLURAL:$1|1=the user|2=both users|all $1 users}}”. A bot at Translatewiki could automatically add 1= to any invocations of plural: with more than one argument and factor out plural: where all the arguments are identical (due to inexperienced translators or translation memory).

There are caveats:

Adding a |n=... syntax will likely break all existing messages having literal ='s inside {{PLURAL: ...}}. Such cases are likely rare, though. Using <nowiki>=</nowiki> should solve the issue but is a performance eater.

PLURAL rules are generally not binding to simple figures, but rather to expressions for sets of numbers, such as (n mod 10 == 1) and the like.
If we cannot make sure that we never will need them, we should generally provide a way to use expressions as well. While this is not hard programmatically, the need to have "="s inside those expressions increases general PLURAL syntax complexity. Having to surround expressions with brackets to make a distinction seems fair.

I was redirected here from bug 40252 and after reading the above comments quickly, my understanding is that this bug arises only for languages which are "plural-less".

I don't completely agree with this concept. As an example, Persian is plural-less in the sense that noun's are not pluralized if preceded by numbers (e.g. "1 book", "2 book") but nouns are pluralized if not preceded by numbers (e.g. "the book is there", "the bookS ARE there"), and also the verb is pluralized all the time (last example).

Up until now, we have been able to use PLURAL magic word to take care of the pluralization of the verbs, etc. Now, this functionality is completely gone.

Respectfully, I suggest the change to the functionality of PURAL magic word to be reverted IMMEDIATELY (as it has affected many projects). Only THEN, we can discuss what is the correct way to change the code again, and make sense of it.

(In reply to comment #11)

Adding a |n=... syntax will likely break all existing messages having literal

's inside {{PLURAL: ...}}. Such cases are likely rare, though. Using

<nowiki>=</nowiki> should solve the issue but is a performance eater.

This a drawback but unavoidable. The CLDR expression syntax doesn't use any = signs so we don't have problems with ambiguity.

(In reply to comment #12)

I don't completely agree with this concept. As an example, Persian is
plural-less in the sense that noun's are not pluralized if preceded by numbers
(e.g. "1 book", "2 book") but nouns are pluralized if not preceded by numbers
(e.g. "the book is there", "the bookS ARE there"), and also the verb is
pluralized all the time (last example).

It's arguable whether these languages should by default have two plural forms or not.

Respectfully, I suggest the change to the functionality of PURAL magic word to
be reverted IMMEDIATELY (as it has affected many projects). Only THEN, we can
discuss what is the correct way to change the code again, and make sense of it.

Let's not throw the baby with the bathwater. We can (and did already for some languages) apply effective workaround while we sort out this problem.

(In reply to comment #13)

It's arguable whether these languages should by default have two plural forms
or not.

It is arguable whether the CLDR representation of various modes of handling plurals is a fair and comprehensive or not. Based on [1], CLDR assumes there are only these modes: (a) to have two forms, like English; (b) to have one form only; (c) to have more than two forms.

The problem is, to have two forms, it doesn't have to be exactly like English (when determinants like "the" and counters like "two" both cause the subsequent noun to be pluralized). In other words, the (a) category above is not comprehensive enough to support languages like Persian or Mazani (while it supports English, Spanish or Turkish).

You might argue this is a limitation of CLDR and should be reported there, not in this bug. I will counter-argue that the MediaWiki implications of it is that we can't adopt a standard which is not comprehensive, hence the point about reverting the change.

[1] http://cldr.unicode.org/index/cldr-spec/plural-rules

I am adding Roozbeh Pournader to this discussion; he is a native Persian speaker, and works for Unicode.org and may be able to shed some light here.

Alternate fix that tries to restore old MW behavior for languages without
defined plural rules - gerrit I345c3051

It boggles the mind that someone just made a breaking change in one of the most-used magic functions without bothering to ask for the opinion for translators or even notifying them afterwards. Are there no communication protocols in place at WMF at all? This is not the first time that users have to find out from obscure bug reports or commit summaries that they are supposed to use something differently.

Even languages which do not use plural with numbers might want use PLURAL for aesthetic reasons (such as writing "one" or "a" instead of 1), or might want to phrase the message differently when it is about multiple things. (For example, "you have [a message/X messages]; you can read [it/them] by..." - the starred part might be different even in languages which do not use plural with numbers; such is the case with Hungarian.)

(In reply to comment #17)

It boggles the mind that someone just made a breaking change in one of the
most-used magic functions without bothering to ask for the opinion for
translators or even notifying them afterwards. Are there no communication
protocols in place at WMF at all? This is not the first time that users have to
find out from obscure bug reports or commit summaries that they are supposed to
use something differently.

Hold your horses[1], Tisza. We are in the business of improving software, and in that process, an intended improvement (in this case using upstream standardised plural definitions over manually maintained ones for both PHP and JavaScript in MediaWiki only), had an unintended side effect. Tests were added[2] on the initial change, but as no test was present for this particular case, this particular breakage went unnoticed.

Luckily we have very well educated users who will report and issue -- this bug report is proof of that -- and a very responsive internationalisation development team, that has created a fix within 20 hours of the report being made. Gerrit 23900 has now been merged, including a test, so that future breakage will be prevented.

A final note: Our software will keep breaking as we continue to improve it. It's not done on purpose, but it's part of the process.