Page MenuHomePhabricator

Preferences and lang codes should distinguish "English" from "American English"/"U.S. English"
Open, LowPublic

Description

Right now our preferences list lists "en - English", "en-CA - Canadian English", and "en-GB - British English". However in reality the "en - English" is en-US ("American English" or "U.S. English").

We should update the preferences system and lang output to accurately reflect state:

  • Special:Preferences should list en-US instead of 'en' and call it by a proper name.
  • When en is used in user language lang="" should output en-US as oourut 'en' i18n is en-US.
  • When the content lang is 'en' we should respect this as we don't know what locale the wiki's content actually uses, and lang="" for content should output 'en'.
  • When a users' preference is set to flat 'en' the preferences list should have the 'en-US' entry as the selected entry.

See also:
T154589: evaluate creation of en-us for Wikidata monolingual strings

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 11:49 PM
bzimport set Reference to bz31874.
bzimport added a subscriber: Unknown Object (MLST).
  • Bug 32889 has been marked as a duplicate of this bug. ***

We shouldn't be so quick to throw away "en". There is such a thing as International English, after all, so "en" doesn't necessarily have to refer to American English. Also, if we only have en-US, en-GB and en-CA, it doesn't leave any other category for other Englishes, of which there are quite a few. I imagine quite a few Australians may prefer "en" over "en-GB", for example, even though the spelling may be closer in the latter. Also, we shouldn't forget dialects like Indian English and Singlish. Perhaps English speakers of those dialects could get by with en-GB, but perhaps not; more investigation is needed, I think.

We shouldn't be so quick to throw away "en". There is such a thing as International English, after all, so "en" doesn't necessarily have to refer to American English. Also, if we only have en-US, en-GB and en-CA, it doesn't leave any other category for other Englishes, of which there are quite a few. I imagine quite a few Australians may prefer "en" over "en-GB", for example, even though the spelling may be closer in the latter. Also, we shouldn't forget dialects like Indian English and Singlish. Perhaps English speakers of those dialects could get by with en-GB, but perhaps not; more investigation is needed, I think.

Our i18n files' en is not international English, it is written specifically in American English.

en - English isn't really being thrown out at all. We'd probably have a quiet alias so $wgLanguageCode = 'en'; will still work.

And for other English variations, no-one said we had to have "only" en-US, en-GB, and en-CA. In fact, originally we didn't even have en-CA, I had it created.

If anyone wants Australian English, Indian English, Singlish, or any other English dialect all they need to do is find someone willing to write the message changes and have the new dialect created on TWN.

How about creating "en-US" in addition to "en", instead of replacing it?

en is already en-US, there's no point confusing people by having them both in preferences.

FYI: there seems to be an analogous situation for Portuguese, as discussed on
https://pt.wikipedia.org/wiki/Project:Esplanada/propostas/Uso_do_portugu%C3%AAs_de_Portugal,_pt-PT_%284mar2012%29
In that context, my understanding is that we have:

  • pt-BR for Portuguese from Brazil
  • pt (in theory) for Portuguese from Portugal

However, the content language of Portuguese Wikipedia is set to pt and it seems to be common to have Brazilian expressions in the local pt translations in that wiki (i.e., replacing the ones from Translatewiki). So, for ptwiki readers, while pt-BR contains only Brazilian Portuguese translations, pt is a mix of pt-PT and pt-BR translations (which is probably unwanted by readers from Portugal).

Our i18n files' en is not international English, it is written specifically in American English.

en is already en-US, there's no point confusing people by having them both in preferences.

That's not true. en i18n should be written in international English and avoid locale-specific variation as much as possible.

On https://en.wikipedia.org/w/index.php?title=Metre&oldid=817655372#cite_note-3 stands:

Thus, the spelling metre is referred to as the "international spelling"; the spelling meter, as the "American spelling".

Currently the system message exif-subjectdistance-value uses:

en-ca.json:	"exif-subjectdistance-value": "$1 metres",
en-gb.json:	"exif-subjectdistance-value": "$1 metres",
en.json:	"exif-subjectdistance-value": "$1 meters",

An international spelling would be:

en-ca.json:	"exif-subjectdistance-value": "$1 metres",
en-gb.json:	"exif-subjectdistance-value": "$1 metres",
en-us.json:	"exif-subjectdistance-value": "$1 meters",
en.json:	"exif-subjectdistance-value": "$1 metres",

Change 412337 had a related patch set uploaded (by Fomafix; owner: Fomafix):
[mediawiki/core@master] Distinguish between International English (en) and American English (en-us)

https://gerrit.wikimedia.org/r/412337

Do the language codes in MediaWiki's list match ICU locale codes? They certainly appear to, but then we are overriding things like date formatting, so perhaps they shouldn't thought of as strictly the same thing.

If these are actual locale names, then en without any country or variant code looks very much the same as en_US (e.g. short dates are M/d/yy), and I'm not sure but is en_001 "English (World)" the same as International English?

Change 412337 abandoned by Fomafix:
Distinguish between International English (en) and American English (en-us)

Reason:
The messages exif-* does not exist anymore. The words meter/metre are currently not anywhere else.

https://gerrit.wikimedia.org/r/412337

Do the language codes in MediaWiki's list match ICU locale codes? They certainly appear to, but then we are overriding things like date formatting, so perhaps they shouldn't thought of as strictly the same thing.

The language codes match, but their content doesn't. en refers to International English (also known as Oxford English), which is the variety used by international organizations such as the UN, ISO, IEC, BIPM, NATO, etc. and millions of people around the world. Appropriating that to mean US English is misleading and incorrect, especially when all of the other locales are distinguished with a modifier (e.g., en-GB: "British English", en-CA: "Canadian English", etc.).

I'd be happy to take this task up if someone can point me in a general direction of what the next steps are, especially given that https://gerrit.wikimedia.org/r/412337 was abandoned.

Change 698599 had a related patch set uploaded (by Jforrester; author: Jforrester):

[mediawiki/core@master] Add language support for American English (en-US)

https://gerrit.wikimedia.org/r/698599

@Jdforrester-WMF Re: the actual code changes: Having en as separate from both en-GB and en-US means that en is international (Oxford English), which means it uses the -ize suffixes, but essentially all other British spellings. Instead of changing those, we'd want to change places where it would say color, center, etc. to colour, centre, etc. Also, we'd want to change all date formats to ISO 8601 for numeric ones and RFC 2822 ("DD Month YYYY") for spelled-out ones.

@Jdforrester-WMF Re: the actual code changes: Having en as separate from both en-GB and en-US means that en is international (Oxford English), which means it uses the -ize suffixes, but essentially all other British spellings. Instead of changing those, we'd want to change places where it would say color, center, etc. to colour, centre, etc.

[Cross-posted comments from gerrit; let's keep the conversation here.]

I don't think going with Oxford spelling serves users as well as the normal spellings.

Having Oxford spelling serves most everyone in the world, as this is what the UN, ISO, IEC, WTO, BIPM, and myriad other international organizations, along with millions of people use and learn as "the standard".

That's an interesting position to take; I appreciate that many international organisations have taken it as a sop to US support.

However, does it actually serve our users well to give them 'en-GB-oxendict' when we say it's 'en'? Is Oxford Press's attempt to fuse British English with some (but not all) American English spellings actually closer to what normal people use in India, or Australia, or Nigeria? The purpose of this is to greet our readers in an interface that is most familiar to them, and especially with a top-level generic language code to pick something that is as neutral as possible. (Note that we don't historically do this very well; 'fr' is almost exclusively French French, and so on; we should probably fix those, too.)

Also, in a tactical sense, not using Oxford Spelling for the 'en' locale wouldn't make much sense since otherwise, 'en' and 'en-GB' would be identical locales.

en and en-GB would primarily differ on idiomatic expression and not spelling if we choose to go with actual international English rather than Oxford's approach, yes, but they would not be identical.

Also, we'd want to change all date formats to ISO 8601 for numeric ones and RFC 2822 ("DD Month YYYY") for spelled-out ones.

I believe those already are the standards for 'en', FWIW.

That's an interesting position to take; I appreciate that many international organisations have taken it as a sop to US support.
However, does it actually serve our users well to give them 'en-GB-oxendict' when we say it's 'en'? Is Oxford Press's attempt to fuse British English with some (but not all) American English spellings actually closer to what normal people use in India, or Australia, or Nigeria? The purpose of this is to greet our readers in an interface that is most familiar to them, and especially with a top-level generic language code to pick something that is as neutral as possible. (Note that we don't historically do this very well; 'fr' is almost exclusively French French, and so on; we should probably fix those, too.)

It seems like you're misinformed about Oxford spelling and its raison d'être. It was not created as a "sop" for US support, nor does it fuse US spelling with British spelling; it precedes US spelling by many years, in fact. It's a common misconception that US spelling exclusively uses -ize endings; British spellings allow -ise or -ize endings. I should also mention that the difference between -ize and -ise endings is not the only different between Oxford and British spelling; there's among vs. amongst, inquiry vs. enquiry, fetus vs. foetus, etc. Oxford has always maintained the -ize endings and its other spellings because of their etymological basis, which is the reasoning that Oxford uses in every case. Insofar as that is the case, that is the reason most (if not all) countries where English is learned as a second language teach Oxford English as simply "English"; i.e., it is the standard worldwide. The Oxford English Dictionary is the desk dictionary of myriad publications around the world as well.

This is not to mention that Oxford spelling is indeed the most neutral and international version of English (which would correspond to the en locale); this is the reason entities like the ones I mentioned and even forgot to mention (e.g., NATO, FIFA, Olympic committee, Red Cross, WWF, Nature, etc.) all use it. It doesn't get any more neutral than the UN, a body composed of practically all the national governments of the world, using it for all communications.

I believe those already are the standards for 'en', FWIW.

Since it seems like the software has been using en to be synonymous with en-US, I think it's been using MDY dates (Month DD, YYYY) in many places IIRC.