Page MenuHomePhabricator

duplicate/invalid language codes
Open, MediumPublic

Details

Reference
bz42396

Related Objects

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:55 AM
bzimport added projects: Wikidata.org, I18n.
bzimport set Reference to bz42396.
bzimport added a subscriber: Unknown Object (MLST).

This bug is probably too general to be useful (perhaps transform into a tracking bug?), but as we have another equally general report let me copy it here:


Small update: I went through the language list at

https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py#L472

and added a number of TODOs to the most obvious problematic cases. Typical problems are:

  • Malformed language codes ('tokipona')
  • Correctly formed language codes without any official meaning (e.g., 'cbk-zam')
  • Correctly formed codes with the wrong meaning (e.g., 'sr-ec': Serbian from Ecuador?!)
  • Language codes with redundant information (e.g., 'kk-cyrl' should be the same as 'kk' according to IANA, but we have both)
  • Use of macrolanguages instead of languages (e.g., "zh" is not "Mandarin" but just "Chinese"; I guess we mean Mandarin; less sure about Kurdish ...)
  • Language codes with incomplete information (e.g., "sr" should be "sr-Cyrl" or "sr-Latn", both of which already exist; same for "zh" and "zh-Hans"/"zh-Hant", but also for "zh-HK" [is this simplified or traditional?]).

Small update: I went through the language list at

https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py#L472

and added a number of TODOs to the most obvious problematic cases. Typical problems are:

  • Malformed language codes ('tokipona')
  • Correctly formed language codes without any official meaning (e.g., 'cbk-zam')
  • Correctly formed codes with the wrong meaning (e.g., 'sr-ec': Serbian from Ecuador?!)
  • Language codes with redundant information (e.g., 'kk-cyrl' should be the same as 'kk' according to IANA, but we have both)
  • Use of macrolanguages instead of languages (e.g., "zh" is not "Mandarin" but just "Chinese"; I guess we mean Mandarin; less sure about Kurdish ...)
  • Language codes with incomplete information (e.g., "sr" should be "sr-Cyrl" or "sr-Latn", both of which already exist; same for "zh" and "zh-Hans"/"zh-Hant", but also for "zh-HK" [is this simplified or traditional?]).
Fomafix subscribed.

Reopened. It is not fixed. It is still possible to add unwanted values via API:

It seems fixed to me. I just made this edit with uselang=be-x-old: https://www.wikidata.org/w/index.php?title=Q1&diff=112330190&oldid=112313552

Here you got the correct language code be-tarask because MediaWiki core converts the URL parameter uselang=be-x-old to the user interface language be-tarask.

No. T39459 request to restrict the user interface to known languages. The database is already restricted to known languages. Except for duplicate language codes like als/gsw, be-x-old/be-tarask, ...

This task requests to restrict the database even with API requests to disallow unwanted language codes that are defined in wgDummyLanguageCodes.

T39459 request to restrict the user interface to known languages

No, it doesn't.

It is not possible to add a label/description/alias with language code xyzzy. Neither via GUI nor via API. When you try to do it, you get an error message.

When you open the GUI with uselang=xyzzy you get a UI which gives you input elements for label/description/alias in language xyzzy. You showed me exactly this in T39459#1468881.

Can we close this and just make new tasks for anything that is still outstanding?

Can we close this and just make new tasks for anything that is still outstanding?

You can not close this task as resolved, because it is not solved. But you can merge this task with a similar task for example with T102533: [Bug] Disallow (or resolve) dummy language codes..

For solving this task several subtasks are necessary. The first task should be disallowing adding new entries with deprecated language codes.

! In T44396#2787104, @Fomafix wrote:

You can not close this task as resolved, because it is not solved. But you can merge this task with a similar task for example with T102533: [Bug] Disallow (or resolve) dummy language codes..

For solving this task several subtasks are necessary. The first task should be disallowing adding new entries with deprecated language codes.

Good point! But

You can not create a relationship to object "PHID-TASK-hubexo6f7fq5spgvmjqd" because objects can not be related to themselves.

...

<del>Then what's MLST here?</del>

NOTE: This is tracked at: T122677

From bzimport added a subscriber: Unknown Object (MLST).? Probably wikibugs-l, IIRC they removed this functionality from Phabricator

So there are currently more than 30,000 invalid terms in Wikidata, mostly in als, es-formal, no and simple. Doing cleanup again and again is pointless.

Language codes in question:

als
bat-smg
bh
de-formal
es-formal
fiu-vro
hu-formal
nl-informal
no
roa-rup
simple
zh-classical
zh-min-nan
zh-yue

They are all supported by MediaWiki but should be blacklisted in Wikibase.

I have recently migrated all uses of "bat-smg", "bh", "fiu-vro", "roa-rup", "zh-classical, "zh-min-nan", and "zh-yue" on labels, descriptions, and aliases to "sgs", "bho", "vro", "rup", "lzh", "nan", and "yue" respectively, so now would be a great time to at least disallow those language codes.

I have been semi-regularly migrating occasional additions of labels/descriptions/aliases in the languages codes noted in my original comment since that comment, in addition to an "no" to "nb" migration with @jhsoby's approval. The sooner these codes can be disallowed, the less work this will be for everyone.

Is there a blacklist feature or would that need to be created?

Is there a blacklist feature or would that need to be created?

! In T44396#7150849, @Esc3300 wrote:

@Mbch331 what do you think?

That's more for WMDE. This goes deeper into the codebase.

The request is from 2012, but maybe it was designed since. If you haven't come across it, I suppose it doesn't exists and needs to be requested: see T284808 for the termbox.

Is there a blacklist feature or would that need to be created?

The DifferenceContentLanguages thing. It is already in place for DefaultMonolingualTextLanguages, but not for DefaultTermsLanguages.
I think this all has been waiting for T66649 fixed.

So there are currently more than 30,000 invalid terms in Wikidata.

Over 500,000 right now.