Page MenuHomePhabricator

PAGESINCATEGORY should decode HTML entities of input - if {{PAGENAME}} contains ' or " it will display 0
Open, MediumPublic

Description

From the linked URL:


{{PAGESINCATEGORY:{{PAGENAME}}}} doesn't work if {{PAGENAME}} contains ' or " or other characters accepted in Mediawiki page names but that are unexpectedly returned as HTML-encoded (in tht case it will display 0).

It is {{PAGENAME}} which is the cause of the problem, because it works when you change it with the title in clear text.


This can be tested for https://www.mediawiki.org/wiki/Category:Chris_G%27s_botclasses

Indeed, if I use {{subst:PAGENAME}} and hit "show changes", I see it's being substituted as "Chris G's botclasses".

I don't know why {{subst:PAGENAME}} is giving HTML encoded entities as output, but that's odd. Since fixing this may break things, PAGESINCATEGORY should check for HTML entities and decode them to check for pagename, just as it was done in bug 35628.


This bug was reported already several years ago (long before Phabricator emerged) and never corrected. It caused various bugs notably with templates that are used to detect redirected categories, or other tracking categories : PAGESINCATEGORY is used to count the number of pages remaining in those tracked categories nad recategorize them appropriately depending on their state (this is used for maintenance).

For now the work around is to pass the result of {{PAGESNAME}} (or {{SUBPAGENAME}} and similar) as the only parameter of {{titlepage: }} in order to HTML-decode those entities before passing the result to {{PAGESINCATEGORY: }}, but there are also specific issues because sometimes a category name (or subpage name) also starts by an existing namespace and {{titlepage: }} will transform those titles by "canonicalizing" them, such as replacing namespace names, changing the capitalization or substituting some initial sequences with . and .. (there's currently no other builtin MediaWiki function to HTML-decode some strings without performing other transforms: prior to use PAGEINCATEGORY, we must first make sure that we are effectively in a Category namespace, then make sure we use the {{FULLPAGENAME}} prior to HTML-decoding it and finally dropping the initial namespace prefix that {{PAGESINCATEGORY:}} rejects in its parameter, even if it is "Category:"; however there's no simple way in MediaWiki to drop the namespace without using once again the PAGENAME parser function which will reconvert the HTML-decoded characters to the HTML-encoded form)

The simplest solution is then to change the PAGESINCATEGORY parserfunction to HTML-decode its parameter (another solution would be to fix the PAGENAME parser function so that it will never HTML-encode its returned value (something that it should have never done, but this is like this since so many years in lot of MediaWiki versions that it will be difficult to reverse it : fixing PAGESINCATEGORY will be much simpler, simply because there's no valid MediaWiki category name that can contain litteral ampersands in their name).

In summary, fix {{PAGESINCATEGORY:}} to process its parameter exactly like what is done for the {{#ifeq:}} or {{#switch:}} parser functions.

  • If after tests this solution causes compatibility problems, then provide a new parser function that will HTML-decode an input parameter, so that we can safely fix the templates that need to need {{PAGESINCATEGORY:{{PAGENAME}}}} or {{PAGESINCATEGORY:{{PAGENAME: Some category name}}}}.

For the list of characters to decode, look at the documentation of the {{PAGENAMEE}} encoding in MediaWiki-wiki that compare the various encodings used (and documents this issue since several years):

https://www.mediawiki.org/wiki/Manual:PAGENAMEE_encoding


Version: 1.24rc
Severity: normal
URL: https://www.mediawiki.org/wiki/Thread:Project:Support_desk/0_doesnt_work_%28allways%29
See Also: T37628: #switch or #ifeq: checks should first HTML-unescape the strings they compare

Details

Reference
bz67196

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:29 AM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz67196.
bzimport added a subscriber: Unknown Object (MLST).

The problem is not alone with pagesincategory, all other parser functions which takes a title have this problem, see also bug 16474.

There's a workaround which is to redecode the parameter of PAGESINCATEGORY with #titleparts.

But I'm still convinced that we should not have to use this trick in wikicode, given that there should not exist any valid category name containing verbatic character entities (it is still possible that they exist, because we have allowed litteral ampersands in pagenames without requring them to be HTML-encoded with named entities, so this causees an ambiguity (but I'm not convinced that we have any valid page name containing verbatim named entities; and not that it's impossible to include verbatic sharp signs "#" so you cannot include verbatim numeric entities).

So what we could do is to HTML-encode quotes, ampersands, and lower-than/greater-than signs, by using numeric entities, instead of named entities (" ' < >), so that they can safely be URL-decoded by PAGESINCATEGORIES (which would continue to treat named entities as verbatim without decoding them automatically like numeric entities.)

Change 145724 had a related patch set uploaded by Brian Wolff:
Have Title::makeTitleSafe decode html entities.

https://gerrit.wikimedia.org/r/145724

Aklapper renamed this task from PAGESINCATEGORY should decode HTML entities of input - if {{PAGENAME}} contains ' it will display 0 to PAGESINCATEGORY should decode HTML entities of input - if {{PAGENAME}} contains ' or " it will display 0.Jun 12 2015, 3:42 PM
Aklapper set Security to None.
Verdy_p updated the task description. (Show Details)

As a workaround, you can use a Lua module to decode the HTML.
https://en.wikipedia.org/wiki/Module:HTMLDecode

And then use {{PAGESINCATEGORY:{{#invoke:HTMLDecode | HTMLDecode | text={{PAGENAME}} }}}}

Using a wiki-dependant Lua module is not a good workaround (also Lua modules are not deployed in all wikis) and slower and more costly than using the builtin parser function {{#titlepart:}} to decode these named or numbered character entities (&, <, >, ",  ,  ,  , and so on) when they are valid (so a & alone or without the correct syntax and the required trailing semicolon must remain intact).

For a character entity to be valid and replaceable during the decoding of the input, there should be nothing else between & and ; than ASCII letters, hyphens, digits for named character entities, or a leading # followed by ASCII decimal digits or a leading # followed by an x and hexadecimal digits for numbered character entities. The named character entities *must* belong to a restricted set as mandated by the HTML5 standard (not every name defined in various non-standard HTML or SGML dialects), and at least the entities generated on ouput of {{PAGENAME}}, plus any valid numeric character entity (for every character which is valid in HTML5, excluding most C0 and C1 controls).

And after this replacement of valid character entities, the standard HTML5 compression and stripping of whitespaces may be applied (as defined in CSS and used by default for rendering of most HTML5 elements, unless these elements are styled explicitly); but note that such compression and stripping is not part of the HTML5 parsing, but only a question of style in the renderer: these whitespaces are actually not compressed/stripped in the HTML5 DOM created by the HTML5 parser, they are left intact in every text element and every attribute value (whitespaces and HTML comments between <and > and surrounding the element names or attributes names or the = sign between an attribute name and attribute value are discarded from the HTML5 DOM, so stripping/compression is not relevant for them, they are only visible in some non-HTML parsers, like XML and SGML parsers used by some debugging tools or editors or data processors, working in advanced mode, which can detect them as unfiltered lexical items to be treated specially in an early parsing step, like also the document declaration and embedded DTD; such special parsing exists in HTML5 but it is an internal step of preparation/adaptation which occurs just after decoding of some transport syntax like Base64/Quoted Pintable encapsulation or MIME and HTTP's multipart attachments, and after decoding the binary character encodings which may be sniffed from the first bytes if it's not provided by some meta information given by the transport layer or when instanciating the parser itself).

So the best workaround is still:

{{PAGESINCATEGORY:{{#titlepart: {{PAGENAME<!--: Optional page name here from some parameter or template-generated-->}} }}}}

because {{#titlepart: ...}} already parses and decode the character entities in its own input parameter (it also strips the leading extra colons and whitespaces before a full pagename, which can also break PAGESINCATEGORY, by normalizing its own parameter to a canonical full page name, it also normalizes the whitespaces/underscores in pagenames, including category page names). The Lua module you suggest still does not perform that.

Ideally, {{PAGESINCATEGORY: ...}} should really parse and normalize its parameter just like what {{#titlepart: ...}} or {{#ifexist: ...}} does, before using it. If this is done, there's no longer need of any workarounds above, and there's no need to change the behavior of {{...PAGENAME[:...]}} functions which can keep the HTML-encoding of their output.

It could also potentially accept a parameter containing the full page name of the category, including one of its known namespace prefixes, to check if the namespace is a category and then drop it along with its next whitespaces, but otherwise it should preserve that namespace, for example when we want to check the number of pages in "Category: User: Someone" ; however, this assumes that there should not exist any category like Category: Category: Some name.

This preparation of the input parameter (parsing and normalization) is also done by {{PAGENAME: <!--Some value here...-->}} and related parser functions {{BASEPAGENAME: ...}} or {{FULLPAGENAME: ...}}, except that these {{...PAGENAME: ...}} functions are unexpectedly reencoding their return value using HTML entities (and this is the origin of this bug), while {{#titlepart: ...}} correctly keeps the return value unencoded.

Only the {{PAGESINCATEGORY: ...}} function unexpectedly doesn't prepare correctly its input parameter (and this is the subject of this bug) before using it.

This comment was removed by Verdy_p.