Page MenuHomePhabricator

{{#language:code1|code2}} should fail gracefully when Language::isValidCode or stricter is false for code2
Closed, ResolvedPublic

Description

Summary: we get a Server HTTP error 500 instantly with

{{#language:code1|code2}}

if code2 contains a single or double quote, or an ampersand.

So,

{{#language:en|'}} or
{{#language:en|"}} or
{{#language:en|&}}

DO crash. As these three character are not valid in BCP47 language/locale codes (or the few legacy non-standard codes used in Wikimedia sites and remaining in various historic pages), the "codes" in parameter are returned verbatim without mapping them to a native language name.

But,

{{#language:'}} or {{#language:'|en}} or
{{#language:"}} or {{#language:"|en}} or
{{#language:&}} or {{#language:&|en}}

DO NOT crash: the y are returned verbatim (in fact only as decimal numeric character entities.

Details follow.


No language codes shoud ever contain these three characters (but some local extensions may want to use other characters such as spaces/underscores, colons, slashes, arrobaces, dots... but these don't crash the #language function, not even if we attempt to feed non-ASCII characters), so any occurence of these characters in parameter 1 will make #language return the input string verbatim without translating it, so:

"{{#language:français}}" returns "français"
"{{#language:Slovopedia}}" returns "Slovopedia"

Now let's use a valid language code in parameter but feed the second parameter (to indicate that we want the language name translated in another target language, if possible:

"{{#language:fr|en}}" returns "French"
"{{#language:fr|fr}}" returns "français"
"{{#language:fr|de}}" returns "Französisch"

OK now with missing translations (and no fallback):

"{{#language:pdc|ckb}}" returns "Pennsylvany German":

both codes are valid, there's no other fallback than English

"{{#language:pdc|ckb-brai}}" returns "Pennsylvany German":

both codes are valid BCP47 codes, but the Braille script variant of language code "ckb" is still undefined (this would require implementing the transliteration scheme to Braille for this language); the server may retry using BCP47 rules looking for a translation in "diq" only, it does not find it, and after looking for defined fallbacks of "ckb", will finally select the default to give a name of "pdc" in English.

Now with invalid codes:
"{{#language:pdc|ckb+brai(1)}}" returns "Plattdütsch":

the second code is invalid under all rules, so it is ignored. No fallback chain can be determined, so the server will try to find the native name (all supported languages in MEdiaWiki have a native name or "autonym".

Now with invalid codes including the apostrophe-quote:
"{{{#language:pdc|ckb it's failing}}" the server crashes with HTTP 500.

This is a serious issue which, could cause a DoS attack on the server, if the following very simple code:
"{{#language:en|'}}"
is inserted in a widely used template, so that it will block the navigation over lots of page (and many server error 500 may drain a lot of resources, if thie eror 500 comes from a PHP instance crash that must be restarted).

This code could be generated by feeding the second parameter with a subpagename (coming from {{SUBPAGENAME}} where it is HTML-encoded, or from {{SUBPAGENAMEE}} where it is URL-encoded with the legacy "WIKI" style).

To correct this:

The 2nd parameter of #language must be checked like the 1st one; if the string is longer than allowed language codes (you could accept up to the max length of a page name), or if it contains characters in ['"&], treat this parameter as an invalid language code, and ignore it (but you can still use the 1st code to return the autonym mapped to it)

For now, on Mediawiki-Wiki I completed the following article about the issues and tricky details (and other related bugs/inconsistencies I discovered)

[[mw:Manual:PAGENAMEE encoding]]

Look at the table in this page showing the effects of the various encodings used in pagenames or for the three styles of urlencodings and anchorencode.

But the real issue in this bug report is in #language.

To avoid this bug, in pages that attempt to detect if a page is a translation or the source page of trnaslations by checking the content of their last subpagename, I also performed many tests to make sure that

[[m:Template:Pagelang]] on Meta-Wiki and on MediaWiki-Wiki will now NEVER return any subpage name that:

  • matches the full page (this is not a subpage of another base page, so it is not a translation produced by the Translate extension).
  • is idempotent through {{lc:{{PAGENAME|...}}}} (this excludes subpagenames containing capital letters and any characters forbidden or transformed in pagenames)
  • contains any character that remains HTML-encoded after calling {{titleparts}} (these are the three characters ['"&])
  • contains any other characters than [a-z0-9-.], i.e the only characters that are idempotent in all encodings, including URL-encoding in its most restrictive style ("QUERY" style since MediaWiki 1.17).
  • does not start by a letter (this can be tested by comparing "lc:" to "ucfirst:lc:" as they MUST be different (given that only ASCII letters are allowed)

We could add other filters against some subpagenames codes passing this test, such as "doc" or "layout", "testcases", "sandbox", used in templates (they are not valid BCP47 language codes, except "doc"; unfortunately documentation subpages of templates on English or Multilingual wikis use "/doc"; but for now we have never encountered the need to translate to this encoded language)

We could also apply stricter rules (to make sure that they are also valid domain name labels, i.e. at most 64 ASCII characters, no double hyphens, no trailing hyphens, if we exclude IDNA labels interlanguage prefixes).

This means that all codes will be lowercase only (even if BCP47 codes are case insensitive, this gives less false positives with accidental subpages that could be created starting by a capital ASCII letter, such as:

"User:Kennedy/Bob"

But the following page name will accidentally match Indonesian when "id" is a subtemplate returnnin a numeric id, but is not a translation of "Template/Page":

"Template:Page/id"

We can hope that users trying to use common templates on their user subpages will avoid naming them using sequences that could match valid language codes. These few pages could be moved/redirected if needed: here it could be renamed:

"Template:Page/Id"

so that it will no longer match a language code detected by the rules above.

Also, independantly of the language codes supported in MEdiaWiki and in the new Translate extension, there are still lots of legacy codes used in subpages that mean specific variants of languages (they don't always match the BCP47 rules, but at least they should only use ASCII lowercase letters, hyphens, and digits, and no spaces/undescores or quotation marks; the few existing pages depending on these code could be reworked to change their codes to private codes conforming to BCP47 rules)


Version: 1.23.0
Severity: normal

Details

Reference
bz60629

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:57 AM
bzimport set Reference to bz60629.
bzimport added a subscriber: Unknown Object (MLST).

#language parser function is part of core -> moving

"Normal priority" ?

Aa the Translate extnesion starts being used more widely, along with TNT and similar templates based on detetion of subpages trying to see if it is the source (untranslated) page or the translated subpage, we get cases were the source page to translate contains apostrophe-quotes.

And then we get the HTTP 500 server error. Apostrophe-quotes or double quotes or ampersands are not uncommon in titles of pages, and these pages, on multilingual sites like MetaWiki or Commons, will cause such failure.

It is also very easy to reproduce it, and within wikis that use a lot of utility templates to display various notice banners (which may be translated in the page's content language detected, #language will crash in those utility templates.

IF these utility templates are widely used, a user may insert the malicious code with #template, and could cause LOTS of pages to generate HTTP 500.

These templates will not be easily editable by users that have a preference to disply immediately a preview on the first edit. Most users visit the template page to view it alog with its "noinclude" documentation containing some examples of rendering of the template. Such visits will crash before the user can click on the "Edit" tab.

Avanced users could also try eduting it by inserting the URL with the "?action=edit" parameter, but they will also fail if their user preferences include the preview on the first load of the editor.

The crash may not be easy to detect where it occurs, because it will occur before hte page is fully expanded and the dependencies are computed and saved, the page will never be rendered or could only contain the outdated references to the previous state before the change in some deeply hidden sub-sub-template.

May be the server could implement a crash handler for pages so that they are at least autocategorized: we could explore the list to determine which translcluded template or subpage containing #language with bad parameters causes the crash.

The crash is so easy to reproduce that I fear that now it will be exploited to generate DoS attacks against servers constantly trying to recover from HTTP 500 crashes by relaunching new instances (with empty caches in memory, each crashes generates lots of IO on disk and on the database).

Needs CLDR extension to reproduce, was added with bug 16699

It is deployed in all Wikimedia wikis, and notably multilingual wikis like commons, Meta, and MediaWiki.wiki itself that have lots of use of #language to create language navigation bars

The Language bars also performs detection of translations in various ways to pages, that are either translated with Translate extension or manuall created with expected differences between languages (e.g. for pages containing user groups per language, or different contact addresses, or specific items or issues in specific languages). Or because they need to support some extra languages not supported by the Translate extension, or because they want to disable some existing translations that have been blocked from edits and are no longer maintained (but not deleted and kept as historic).

Detecting existing translations will frequently test the presence of subpages, and if one is found, it will attempt to reference it in the navigation bar using a language name returned by #language, but displayed either in the user's own language, or in the page-content language (both could be different from the language autonym returned by #language with only 1 parameter.

But the bug is critical when trnalsations are derived from a base page, because the base page must know that it's the original and not a translation in a subpage.

As the base page does not have its last subpagename segment matching any language code, vyr could contain any character authorized in pagenames (including quotes and ampersands) it is difficult to avoid the case where #language will be called with the second parameter matching the original article title.

However it is expected that if code2 does not match any valid language code, it will be considered as not being a translation but the source language (usually English on multilingual wikis of Wikimedia, but not necessarily).

So #language should ignore code2 in this case and use only code1, i.e. return the autonym, or it could use another fallback such as the user's language, or the page content language if #language can detect it from another source (which one if this cannot be deduced from the current page name alone, if there's no language metadata stored for the current page itself?), or the wiki default language (not always English).

I think that the simplest fix is to discard code2 in this case and return the autonym (it's not the job of #language to determine another fallback chain, unless #language accepts more than 2 codes, in a list of parameters target language codes to look for (if scanning this list is terminated without finding a valid language, and then none of them provide applicable fallbacks, use code1 as the final target, so return an autonym only).

Verdy, it's not necessary to repeat everything you've already said.

  • Bug 67241 has been marked as a duplicate of this bug. ***

From bug 67241:

It throws a Fatal exception of type MWException when for the second parameter you put some special characters.

Those are what I've tested so far and produce the exception:

< > ' " : [ ] ( ) / &

Example input that produces the error:

{{#language:es|<}}

Crashes just doing a preview on WMF wikis.

I've installed CLDR extension on current master, and it doesn't throw that error for me. I've tried without caching, with memcache and with database cache, but I was unable to reproduce the problem.

This seems to be happening on WMF because of some combination of other extensions or configuration options

#
2014-06-29 21:18:24 mw1036 mediawikiwiki: [5f513816] /wiki/Special:ExpandTemplates Exception from line 182 of /usr/local/apache/common-local/php-1.24wmf11/languages/Language.php: Invalid language code ">"
#
#0 /usr/local/apache/common-local/php-1.24wmf11/languages/Language.php(161): Language::newFromCode('>')
#
#1 /usr/local/apache/common-local/php-1.24wmf11/includes/Message.php(540): Language::factory('>')
#
#2 /usr/local/apache/common-local/php-1.24wmf11/extensions/Translate/TranslateHooks.php(349): Message->inLanguage('>')
#
#3 [internal function]: TranslateHooks::translateMessageDocumentationLanguage(Array, '>')
#
#4 /usr/local/apache/common-local/php-1.24wmf11/includes/Hooks.php(206): call_user_func_array('TranslateHooks:...', Array)
#
#5 /usr/local/apache/common-local/php-1.24wmf11/includes/GlobalFunctions.php(4038): Hooks::run('LanguageGetTran...', Array, NULL)
#
#6 /usr/local/apache/common-local/php-1.24wmf11/languages/Language.php(864): wfRunHooks('LanguageGetTran...', Array)
#
#7 /usr/local/apache/common-local/php-1.24wmf11/languages/Language.php(914): Language::fetchLanguageNames('>', 'all')
#
#8 /usr/local/apache/common-local/php-1.24wmf11/includes/parser/CoreParserFunctions.php(770): Language::fetchLanguageName('es', '>')
#
#9 [internal function]: CoreParserFunctions::language(Object(Parser), 'es', '>')
#
#10 /usr/local/apache/common-local/php-1.24wmf11/includes/parser/Parser.php(3713): call_user_func_array(Array, Array)
#
#11 /usr/local/apache/common-local/php-1.24wmf11/includes/parser/Parser.php(3431): Parser->callParserFunction(Object(PPFrame_DOM), '#language', Array)
#
#12 /usr/local/apache/common-local/php-1.24wmf11/includes/parser/Preprocessor_DOM.php(1175): Parser->braceSubstitution(Array, Object(PPFrame_DOM))
#
#13 /usr/local/apache/common-local/php-1.24wmf11/includes/parser/Parser.php(3241): PPFrame_DOM->expand(Object(PPNode_DOM), 0)
#
#14 /usr/local/apache/common-local/php-1.24wmf11/includes/parser/Parser.php(637): Parser->replaceVariables('{{#language:es|...', false)
#
#15 /usr/local/apache/common-local/php-1.24wmf11/includes/specials/SpecialExpandTemplates.php(89): Parser->preprocess('{{#language:es|...', Object(Title), Object(ParserOptions))
#
#16 /usr/local/apache/common-local/php-1.24wmf11/includes/specialpage/SpecialPage.php(382): SpecialExpandTemplates->execute(NULL)
#
#17 /usr/local/apache/common-local/php-1.24wmf11/includes/specialpage/SpecialPageFactory.php(510): SpecialPage->run(NULL)
#
#18 /usr/local/apache/common-local/php-1.24wmf11/includes/Wiki.php(288): SpecialPageFactory::executePath(Object(Title), Object(RequestContext))
#
#19 /usr/local/apache/common-local/php-1.24wmf11/includes/Wiki.php(603): MediaWiki->performRequest()
#
#20 /usr/local/apache/common-local/php-1.24wmf11/includes/Wiki.php(452): MediaWiki->main()
#
#21 /usr/local/apache/common-local/php-1.24wmf11/index.php(46): MediaWiki->run()
#
#22 /usr/local/apache/common-local/w/index.php(3): require('/usr/local/apac...')
#
#23 {main}

Change 142974 had a related patch set uploaded by Brian Wolff:
Handle invalid language code gracefully in Language::fetchLanguageNames

https://gerrit.wikimedia.org/r/142974

Change 142974 merged by jenkins-bot:
Handle invalid language code gracefully in Language::fetchLanguageNames

https://gerrit.wikimedia.org/r/142974

Change 145409 had a related patch set uploaded by Martineznovo:
Handle invalid language code gracefully in Language::fetchLanguageNames

https://gerrit.wikimedia.org/r/145409

Backport to 1.23 in https://gerrit.wikimedia.org/r/#/c/145409/ still awaiting review, and Backport_to_stable flag set 5 weeks ago - someone please decide.

Change 145409 merged by jenkins-bot:
Handle invalid language code gracefully in Language::fetchLanguageNames

https://gerrit.wikimedia.org/r/145409

Backported to the REL1_23 branch.