Page MenuHomePhabricator

Update non-standard language codes in the projects
Closed, ResolvedPublicFeature

Description

Author: Gerard.meijssen

Description:
Hoi, Aryeh Gregor asked me to make these changes ... http://meta.wikimedia.org/w/index.php?title=Www.wikipedia.org_template&diff=1632626&oldid=1630733 these changes fix errors indicated by a validator. http://validator.nu/?doc=http%3A%2F%2Fwww.wikipedia.org&profile=permissive

We can make our content comply with the standards when the language code is changed on the projects as well. I know these changes to be correct.

Please make these changes.. they demonstrate that we are good Internet citizen.. :)
Thanks,

GerardM

Version: unspecified
Severity: enhancement
See Also:
T10217

Details

Reference
bz20547

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:56 PM
bzimport set Reference to bz20547.
bzimport added a subscriber: Unknown Object (MLST).

What exactly is meant by this bug? What code should be changed?

The request is to change "zh-hak" to "hak" and others, but the list of "some language codes" is missing here so it's very unclear when this report would be "fixed". Gerard, could you clarify which language codes are affected?

http://www-01.sil.org/iso639-3/codes.asp

The request is to change "zh-hak" to "hak" and others

Assuming this refers to the www.wikipedia.org portal (which is my best guess), I should point out that this page is editable by Meta admins - so this is not a matter for Bugzilla.

Some of your "corrections" are not real corrections.

  • The validator just complains about new HTML5 attributes (like srcset on images) or elements (like bdi) which do not cause any problem. They are not really errors
  • You corrected codes that are perfectly valid (note that this is NOT ISO 639-3 which is used in HTML, but BCP 47; many valid BCP 47 codes do not exist in ISO 639-3, and many codes valid in ISO 639-3 are invalid in BCP 47 !!!)

Do not mix the (unstable) ISO 639 language codes with the standard BCP 47 language tags which have always been normative in HTML (including HTML4), and stable since decennials !

Note that BCP47 uses *some* codes from ISO 639-1 (not all), *some* codes from ISO 639-2 (not all), and only then *some* codes from IS 639-3. It also appends *some* codes from ISO 3166-1, *some* codes from UN M.49, *some* codes from ISO 15924, and *some* codes whose origin is the BCP 47 standard track itself.

The reference database for BCP 47 is *not* on on any ISO MA, but the IANA database for language subtags, BCP 47 documents which ISO codes may be imported in the IANA database as subtags and how supplementary extension subtags may be registered (for language variants, or for locales, such as the Unicode locale extension subtags)

As I said in comment 2, if there are problems with the www.wikipedia.org portal page, please take the matter to [[m:Talk:www.wikipedia.org template]]. If you are concerned about the language codes somewhere else, please tell us exactly what you are referring to!

Yes but your coment 2 only restricts to ISO 639-3, which is plain wrong !

And no, your comment 2 (or any other one) did NOT point to the talk page you suggest now.

For example the change from "zh-hak" to "hak" only is NOT required for conformance to HTML standard; "zh-hak" remains fully conforming to BCP 47, even if it has now a "preferred" value, and is now in deprecation (but not obsolete).

The real language tags that are violating BCP 47 are for example:

  • "nrm" (it also violates ISO 639-3)
  • "roa-tara" (it also violates ISO 15924)
  • "simple"

The language tag "pa-Guru" you "corrected" by replacing it by "pa" was perfectly correct; now it is more ambiguous (and breaks some renderers unable to choose the appropriate font to use for this language written in multiple scripts).

(In reply to Philippe Verdy from comment #7)

And no, your comment 2 (or any other one) did NOT poin t to the tal page you
suggest now.

My apologies, I meant to point to comment 3. Sorry for the incorrect reference.

So I think I now understand the scope of this bug: you are stating that incorrect HTML lang attributes are being generated on the projects with language codes "nrm", "roa-tara", and "simple".

The real language tags that are violating BCP 47 are for example:

  • "nrm" (it also violates ISO 639-3)

"nrm" refers to Narom language. However, IANA have not provided a language code for Norman, so I don't know what we're meant to do here. I notice that www.wikipedia.org uses the made-up code "roa-x-nrm" for this language.

  • "roa-tara" (it also violates ISO 15924)

[[roa-tara:]] has the nonsensical lang attribute value "roa-Tara", as if Tara is a script. Again, the Tarantino dialect lacks a unique code and will probably never get one. The www.wikipedia.org portal just uses "roa" for this language.

  • "simple"

[[simple:]] has the correct lang attribute value "en".

The language tag "pa-Guru" you "corrected" by replacing it by "pa"

Not sure who you're talking to here, but it certainly wasn't me who did this. I doubt it was Gerard either.

The real language tags that are violating BCP 47 are for example:

  • "nrm" (it also violates ISO 639-3)

"nrm" refers to Narom language. However, IANA have not provided a language code for Norman, so I don't know what we're meant to do here. I notice that www.wikipedia.org uses the made-up code "roa-x-nrm" for this language.

The IANA database cannot reference this language if it's not even encoded in ISO 639 (so that one of the ISO 639 codes can be imported to the IANA database), and as long as there's not been any specific registration for the language in the IAN database.

"roa-x-nrm" would be conforming, but linguists still consider Norman to be a regional variant of French. "fr-x-norman" or just "fr-x-nrm" would be conforming and would make more sense than using the "roa" language family code

(in BCP 47, the use of language family codes is not invalid but it is highly discouraged, as opposed to codes of macrolanguages like zh/Chinese or sh/Serbocroatian grouping several isolated languages that have a large common base for mutual understanding, even if they are written with distinct scripts because translitterators work quite well within the same isolated language)

Other examples:

"be-x-old" is perfectly conforming to BCP 47 (and so is also conforming to HTML or XML), even if this orthography has now a preferred language tag (but the association between "be-x-old" and "be-tarask" is private to Wikiemdia projects, and not found in the IANA database), so for most softwares "be-x-old" and "be" alone cannot be distinguished.

On the opposite, "zh-gan", "zh-hak" or "zh-yue" are also conforming but they have now a documented preferred value in the IANA database without the "zh-" prefix of the macrolanguage.

"zh-cmn" is also conforming, just like "cmn" alone, but both have a preferred value which is "zh" (the code "zh" of the macro language, because Mandarin if the default language assumed in many applications for the Chinese macrolanguage)

One of the purposes of BCP 47 tags is also to allow easy mapping of language/locale fallbacks (fallbacks are definitely not a goal in ISO 639); but also to preserve backward compatibility of tagged contents (not warrantied by ISO 639 codes).

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:01 AM
Aklapper removed a subscriber: Purodha.

We added a lot of new language codes to DNS today from old tickets, but they were all specific about one language code. This ticket is to open-ended in this form.

Can you add a specific list of names that are still missing at this point?

Pppery subscribed.

14 years after this was filed, is this still an issue?

There has been recent movement on T172035, but I don’t think this ticket is relevant anymore. The portals long ago migrated from me manually copy-pasting HTML around to Meta sysops generating the HTML using a Lua module to finally automating the whole process without Meta’s involvement (T128546).

Pppery changed the task status from Invalid to Resolved.