Page MenuHomePhabricator

Babel language codes should be normalised to lower case when used in categories
Closed, ResolvedPublic

Description

On enwiki the Babel categories are in the format of "Category:User xx", where xx is always in lower case. However, if a user enters codes with a different capitalisation in their #babel invocation, the page is categorised with that different capitalisation.

For example, on enwiki, the code {{#babel: En}} will add the page to the category "Category:User En", when it should be "Category:User en".

This also has the consequence that [[User:Babel AutoCreate]] creates duplicate categories for each different code capitalisation that someone has used. I see it has created both [[:Category:User Zh]] and [[:Category:User zh]], for instance. (I have blocked the Babel AutoCreate account on enwiki until we can find a way to fix this.)

I assume that lower case is the preferred code format for other wikis, and that seemed to be the case when I spot-checked a few of them. However, if any wikis use a different system, code capitalisation might need to be made configurable, rather than lower case being hard-coded.

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:00 AM
bzimport set Reference to bz61993.
bzimport added a subscriber: Unknown Object (MLST).
MarcoAurelio raised the priority of this task from Medium to Needs Triage.Aug 13 2015, 6:30 PM
MarcoAurelio subscribed.

I've just blocked the extension at Meta for doing this too. The "bot" is creating a lot of empty categories, categories for languages that don't exist, that no user added to their page, or duplicate categories (e. g.: Category:User es [correct], Category: User Es [bad], Category: User ES [bad]).

(Retriaging, old bug imported from BZ that needs proper assesment of severity)

@MarcoAurelio @MrStradivarius Would it be reasonable to implement a fix where the extension always use lowercase language code?

Priority set to high, per T112868.

Change 289604 had a related patch set uploaded (by Ricordisamoa):
Normalise language codes to lower case when used in categories

https://gerrit.wikimedia.org/r/289604

Change 289604 merged by jenkins-bot:
Normalise language codes to lower case when used in categories

https://gerrit.wikimedia.org/r/289604

Nikerabbit removed a project: Patch-For-Review.
Nikerabbit updated the task description. (Show Details)

I'm reopening this because the "fix" appears to have broken it even more. Now parts of the code are being always capitalised when they shouldn't be, e.g. most wikis I've seen use lowercase, but now it's force uppercasing countries and scripts, so the existing categories are empty and people's user pages now point to non-existing categories (e.g. https://www.wikidata.org/wiki/User:Addshore). See the most recent commeont on T112868 too.

As far as I can tell, the preferred format is all lowercase, including countries and scripts, with the exception for the letter "N" to indicate native speaker level (or alternatives in other languages like "M" in German):

This query attempts to list codes with capitalised countries or scripts
This one attempts to list codes using lowercased ones.

While there are probably a few missing from those queries because they use a different syntax or aren't linked to Wikidata or the Wikidata item isn't marked as a user language category, it's quite clear that lowercase is predominant. I don't know how many of the capitalised ones are the preferred style for that wiki but force lowercasing the country and script would cause much less disruption than force capitalising them.

Please consider disabling this "bot" until it can be fixed. At en.wikiquote it is not only creating these spurious categories (miscapitalized and redundant to correctly capitalized ones), but it is REcreating them when they have been deleted.

Please consider disabling this "bot" until it can be fixed.

I see that this was already requested at T132296, but it was closed as "Declined" for some reason. Does this mean that we must continue blocking the pseudo-account at individual wikis one by one to limit the damage?

@Ningauble: You could perhaps request a temporary global lock at https://meta.wikimedia.org/wiki/Steward_requests/Global - that should have the same effect as stopping the bot by preventing the bot from editing.

Another problem with the current behaviour: It is turning "roa-tara" (a special non-standard code - https://meta.wikimedia.org/wiki/Special_language_codes) into "roa-Tara" as if "tara" were a script (e.g. https://it.wikinews.org/wiki/Categoria:Utenti_roa-Tara)

I don't think global locks do affect system accounts. Maybe it does in this
case.

I have blocked the bot account at en.wikiquote because this is still happening. Other wikis will have to fend for themselves.

Please notify wikis where it is blocked (https://meta.wikimedia.org/wiki/Special:CentralAuth/Babel_AutoCreate) at their Administrators' Noticeboards (or equivalent) when (if) this is fixed (really fixed).

The Babel extension normalises language codes according to the internet standard BCP-47 (https://en.wikipedia.org/wiki/IETF_language_tag). Languages like pt-BR and zh-Hans are capitalised as such. Obviously roa-tara falls through the cracks, but it's a bit of a naughty, norm-defying language code, so it might need to be special-cased in the Babel code.

I think the expectation that Babel should output lowercase category names is misguided. The fact that MediaWiki internally uses that style doesn't make it right. Language tags on HTML pages, the list in your MediaWiki preferences, and just about anywhere else you care to look, all use the BCP-47 style.

There's no doubt that Babel AutoCreate has made a complete mess of the babel category system on various wikis over the years... It would be useful to have a script that goes through and moves categories like "User pt-br" to the proper name, along with T62162 and any other tasks desired by the wiki community.

Those language tags are not case sensitive, so "pt-br" is still perfectly valid and all lowercase is what Wikimedia projects have used for years (including in pre-Babel templates, which are still widely used). Trying to force a new style on everyone is really disruptive, wikis already have a well-established system of categories which has now been completely messed up by Babel suddenly switching to different capitalisation. It's no wonder Babel AutoCreate keeps being indefinitely blocked and there are even people who want it turned off or globally blocked because it's so disruptive.

I have now blocked the bot account on svwiki. Any progress?

In https://gerrit.wikimedia.org/r/446766 I introduced BabelLanguageCodes::getCategoryCode() which maps mediawiki-internal language codes to appropriate category names. The current algorithm is to use the (lowercased) mediawiki internal code if it doesn't contain a hyphen (eg en, simple, de), otherwise use the properly-capitalized BCP 47 code (zh-Hans, etc). This matched previous expectations as canonized in the extensions phpunit tests. If we wanted some other behavior for category codes it ought to be straightforward to patch getCategoryCode() for whatever is desired.

Aklapper lowered the priority of this task from High to Medium.May 23 2019, 6:06 PM

Babel AutoCreate still blocked on many wikis. Task likely in need of movement-wide clarification/reevaluation

The current version of the bot appears seem be working properly. What are the exceptions that cause the bot to be banned?

IN changed the task status from Open to Stalled.Jun 18 2022, 3:59 AM

At least I think the problem has been fixed, but I don't know if anyone else can offer a rebate. So I'm going to label this one as a stall.

Change 881944 had a related patch set uploaded (by Pppery; author: Pppery):

[mediawiki/extensions/Babel@master] Add capability to override babel categories on wiki

https://gerrit.wikimedia.org/r/881944

(The above patch won't fully resolve this, I just linked it here for visibility)

Pppery changed the task status from Stalled to Open.Jan 21 2023, 2:25 AM

Also, this is not stalled - there's nothing stopping someone from working on it. But I'm not going to be that someone any more than the above patch.

It might be interesting to add the basic category ($wgBabelMainCategory and $wgBabelCategoryNames) settings that are currently in the config file to the community config: T323811.

Agreed. But that's pie-in-the-sky right now and not very related to this task.

Change 881944 merged by jenkins-bot:

[mediawiki/extensions/Babel@master] Add capability to override babel categories on wiki

https://gerrit.wikimedia.org/r/881944

Iniquity claimed this task.
Iniquity added a subscriber: Pppery.

It seems to me that this task can be closed as solved, the main problem is solved. Regarding the problem with capital letters after hyphens, this is solved either by a @Pppery patch T33074: Use on-wiki messages for category configuration.