Page MenuHomePhabricator

[Task] Don't try to add labels in non-existing languages: restrict to Language::isKnownLanguageTag
Open, LowPublic

Description

If you go to a page and specify an non-existing content language https://test.wikidata.org/w/index.php?title=Q12345&uselang=xyzzy, the interface will offer you to enter the label in the language. This will fail with "Unrecognized value for parameter 'language': xyzzy" error, but I believe the interface should not even offer the possibility to enter the label, perhaps displaying an error right away.


See Also:
T41623: Invalid language codes via uselang are used for lang HTML tag et al.
T44396: duplicate/invalid language codes
T51024: [Task] Removal of de-formal from allowed language for labels
T38430: Specify language fallback
T66649: 'Add link' function in Wikipedia creates items with wrong language code

Details

Reference
bz37459

Related Objects

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:25 AM
bzimport set Reference to bz37459.
bzimport added a subscriber: Unknown Object (MLST).

It should not be possible to set an illegal code through the user interface, but it can be done by directly manipulating the url.

Normal "page" behavior is to fall back and en up with English messages if the language is unknown. This makes it seem like the site uses English user language when this happens, but the user language code is wrong for the whole page.

It seems likely that either the language code for the user language should do a similar fall back, if the code itself is illegal (ie. the bug scenario), or this should be solved through the normal language fall back when that is implemented, and then only if the code itself is legal.

Note that the language fall back mechanism in Wikidata might only be available if the user is logged in.

The official language switcher will not be through the URL query parameter, but through a widget to change the language. Tinkering with the uselang query parameter is something that should be captured at the core, not in the extension, if at all, I think. Not sure.

How will the language switcher change the language if not through the URL? Through a cookie? I don't think that's a good idea, for multiple reasons.

Anyway, the extension doesn't need to tinker with uselang, it could simply write an error message, or fallback to English like the core.

OK, this sounds like a good solution.

This should be fixed in the ULS. We could possibly add (or ask if it is added) a test for languages that either is lacking message files (Language::isValidBuiltInCode), or a more general test of defined codes (Language::isValidAndDefined). Now the ULS seems to use RequestContext::sanitizeLangCode which calls Language::isValidCode, which only checks if the code is well-formed. We probably need something that checks if the language in fact exist. This is at line 93 in UniversalLanguageSelector.php and at line 217 in RequestContext.php.

The easiest seems to me to add a config to make the ULS use Language::isValidBuiltInCode instead of Language::isValidCode by adding additional sinitizing in UniversalLanguageSelector::getLanguage and skipping the call to RequestContext::sanitizeLangCode.

(In reply to comment #5)

This should be fixed in the ULS.

ULS doesn't use uselang.

(In reply to comment #3)

Anyway, the extension doesn't need to tinker with uselang, it could simply
write an error message, or fallback to English like the core.

See bug 39623 comment 9 for validation, but this can likely end up being a duplicate of bug 46455.

I'd rather say that bug 46455 is a duplicate of this bug ;) But yes, they are similar: 46455 is about recognized but unwanted codes, this one is about completely unrecognized codes; otherwise to me they seem the same.

I think that allowing any valid BCP 47 might suffice. Adding Niklas to CC for an opinion.

I'd restrict data input to Language::isKnownLanguageTag so that we can at least display the name... otherwise we can get all kinds of weird stuff. We can extend our known language tag coverage by resurrecting https://gerrit.wikimedia.org/r/#/c/11829/ - perhaps as an extension.

With regards to ULS, interface language selection should be limited to supported languages (Language::isSupportedLanguage) for setlang. uselang should still be able to use pretty much anything (Language::isValidCode).

  • Bug 46455 has been marked as a duplicate of this bug. ***

So is this something that needs to be done in Wikibase actually? Or Core?

(In reply to comment #11)

So is this something that needs to be done in Wikibase actually? Or Core?

Wikibase. Core already offers the language code validation function you need, which is mentioned by Niklas above and in the summary.

Akkakk produced helpful stats and lists of labels, descriptions and aliases in invalid language codes (+de-formal which is another story, bug 49024). Over 400 thousands... They'll be fixed by bots, but this warrants a "major".

https://www.wikidata.org/w/index.php?title=Wikidata:Project_chat&oldid=124218030#.27als.27.2F.27gsw.27.3F_.27de-formal.27.3F
https://www.wikidata.org/wiki/User:Akkakk/issues/deprecated-languages

(In reply to Nemo from comment #14)

Related: https://gerrit.wikimedia.org/r/#/c/164725/3
I asked there to explain better here.

This bug still exists: The patch you linked only affects languages referenced via the Babel parserfunction on user pages.

(In reply to Marius Hoch from comment #15)

This bug still exists: The patch you linked only affects languages
referenced via the Babel parserfunction on user pages.

Ah. So that was a consequence of (ab)using internal Babel functions? Can you please also file a bug against Babel requesting an API to provide the information that (at least) Wikidata needs? Thanks.

(In reply to Nemo from comment #16)

(In reply to Marius Hoch from comment #15)

This bug still exists: The patch you linked only affects languages
referenced via the Babel parserfunction on user pages.

Ah. So that was a consequence of (ab)using internal Babel functions? Can you
please also file a bug against Babel requesting an API to provide the
information that (at least) Wikidata needs? Thanks.

No, I don't think Babel should have such an API: The API it has is fine and anything we do beyond that doesn't belong into Babel.

(In reply to Marius Hoch from comment #17)

No, I don't think Babel should have such an API: The API it has is fine and
anything we do beyond that doesn't belong into Babel.

Oh. 3 out of 3 persons I asked so far thought the opposite. You may to want to join https://www.mediawiki.org/wiki/Thread:Extension_talk:Babel/Babel_API as we're off topic here.

Lydia_Pintscher removed a subscriber: Unknown Object (MLST).

Due to language fallbacks, this looks pretty good now (https://www.wikidata.org/wiki/Q2?uselang=xyzzy), but it still asks you to enter label, description and aliases in xyzzy. EntityViewFactory::newEntityTermsView could be replaced with an EntityTermsViewFactory that got the ContentLanguages for terms passed in and gets a LanguageFallbackChain, yielding an EntityTermsView in the first actually supported language in the fallback chain.

Or we could sanitize the language as early as possible, for example in EntityParserOutputGeneratorFactory. That would not fix this issue for languages which are allowed UI languages, but forbidden as content languages (de-formal for example).

Maybe we should do both.

Fomafix claimed this task.
Fomafix subscribed.

The API and the special pages like https://www.wikidata.org/wiki/Special:SetLabelDescriptionAliases check for existence of the language code. A label, a description or an alias with the language code xyzzy is not possible anymore. This is not done by Language::isKnownLanguageTag but by special function hasLanguage in WikibaseContentLanguages.

Unwanted duplicate language codes like als/gsw or be-x-old/be-tarask are still possible. This is tracked in T44396. I close this task.

but it still asks you to enter label, description and aliases in xyzzy.

Still so:

wikibase-fake-language.png (273×538 px, 23 KB)

Unwanted duplicate language codes like als/gsw or be-x-old/be-tarask are still possible.

Language::isKnownLanguageTag returns true for them because they are listed in Names.php, but they should probably be removed (once T99059 is fixed).

Reopen. The task description is still not fixed.

Addshore renamed this task from Don't try to add labels in non-existing languages: restrict to Language::isKnownLanguageTag to [Task] Don't try to add labels in non-existing languages: restrict to Language::isKnownLanguageTag.Dec 4 2015, 1:00 PM

Change 282905 had a related patch set uploaded (by Adrian Heine):
[WIP] Don't use UI language for terms if it's not a term language

https://gerrit.wikimedia.org/r/282905

Addshore raised the priority of this task from Medium to High.EditedJun 26 2016, 3:18 PM
Addshore subscribed.

People mentioned on the "contact the dev team" about this task

https://test.wikidata.org/w/api.php?action=wbgetentities&ids=Q2528

Looks like bad things are happening through Special:new* now, any string can be entered as a language code.
This also results in T138724 in the case of bad strings.

As the tickets don't quite line up I have created a new ticket T138725.

Addshore lowered the priority of this task from Medium to Low.

It looks like this is still an issue:

image.png (326×1 px, 47 KB)