Page MenuHomePhabricator

Scribunto/Lua should have a built-in method for retrieving categories used on a page
Open, LowPublicFeature

Description

It would be helpful to have a built-in method in Scribunto/Lua that allows retrieving a list of the categories used on a page.

For example, https://en.wikipedia.org/wiki/Einar_Schleef would return a list that includes:

  • German dramatists and playwrights
  • 1944 births
  • 2001 deaths

As these categories are included on this page.


Version: unspecified
Severity: enhancement
See Also: T20596: parser function to detect if the current page is in a given category

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 1:44 AM
bzimport added a project: Scribunto.
bzimport set Reference to bz48175.
bzimport added a subscriber: Unknown Object (MLST).

I'd just like to make the explicit suggestion that the categories be exposed be on title objects (see http://www.mediawiki.org/wiki/Extension:Scribunto/Lua_reference_manual#Title_objects ) as they contain other information about a given page.

*** Bug 59111 has been marked as a duplicate of this bug. ***

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:13 AM

Change 919459 had a related patch set uploaded (by SD0001; author: SD0001):

[mediawiki/extensions/Scribunto@master] Expose page categories to Lua

https://gerrit.wikimedia.org/r/919459

This change strikes me as enabling communication between separate templates, something which has been a no-no in this extension.

It doesn't enable communication between templates any more than getContent() already allows. Besides, it's often useful - such as for citation modules to decide if dates should be formatted as dmy or mdy.

It doesn't enable communication between templates any more than getContent() already allows.

getContent does not expand templates. I'm pretty sure the PHP implementation of getCategories

$page = MediaWikiServices::getInstance()->getWikiPageFactory()->newFromTitle( $title );
...
$categoryTitles = $page->getCategories();

gets all categories, which may be in templates. So I am pretty sure your statement is not a true statement.

Besides, it's often useful - such as for citation modules to decide if dates should be formatted as dmy or mdy.

I am not concerned about its utility though -- I think the utility is pretty clear, enabling one path of cross-communication between templates, exactly as you say. I am instead raising a point that previous maintainers of Scribunto have drawn a line in the sand about, and which may have particular implications regarding this extension's support of Parsoid. The maintainers (whoever that may be since Anomie stopped participating in MediaWiki development) should decide whether it's a legitimate objection or not and whether I've identified the issue correctly or not, though if you don't think that's what those two lines are doing, I'm happy to be informed otherwise.

I dont think it falls afoul of that as long as it is reflecting DB state (current contents of categorylinks) and not currently parsed page.

Although maybe shouldnt work for current page as that would be confusing

I see a slightly different problem – this feature may produce "unstable" parses. Consider the scenario:

  • Page A has a template that adds category X if another page B has a category Y
  • Page B has a template that adds category Y if page A does not have the category X

Every time one of these pages are parsed, the state of its category will flip. This seems bad. As one side effect of this, I think everyone watching the pages will see a message about category changes on their watchlist. There may be other bad side effects.

Even when it doesn't produce an infinite loop like this, you may end up with pages that may need to be parsed multiple times until they reach a stable state. This actually seems even worse. Imagine a vandalism that requires multiple pages to be purged in exactly correct order to be reverted.

I don't know if that's necessarily a deal breaker, but I know that I wouldn't want to be responsible for this :)

You can definitely already do this - e.g. make the category used depend on the current time or something. It does seem like this might make the problem worse though.

Every time one of these pages are parsed, the state of its category will flip. This seems bad. As one side effect of this, I think everyone watching the pages will see a message about category changes on their watchlist.

... which should motivate one of those users to fix up the ungodly mess! This should only happen when the parse is a result of an edit though, as LinksUpdate isn't run otherwise.

There are also opportunistic limk updates of the page is "dynamic", but yes it is limited.

getContent does not expand templates. I'm pretty sure the PHP implementation of getCategories

frame:expandTemplate allows this though. In fact, just using e.g. {{#invoke:string|find|{{:{{FULLPAGENAME}}}}|Category:X1}}, you can check if the categories of the current (or any) page include X1 without even using Lua to get the page content. So this actually adds zero additional functionality (it just makes things faster, since transcluding the page is very slow to parse) and I don't think it actually needs any restrictions. The infinite loops mentioned should already be possible through this method.

Did you test these? Unless you have countermeasures (is it possible at all?), or unless {{FULLPAGENAME}} isn’t actually the title of the current page (which is the case e.g. in edit notices), both frame:expandTemplate and {{:{{FULLPAGENAME}}}} would definitely cause infinite template recursion, not only potentially cause unstable parses.

Yes, I tested it. It works in editnotices; it did cause a template loop on the page itself yeah. But the main complaint from Izno seems to be about cross-template communication, not about accessing info about the current page; and this definitely works if the page given is not the current page.

Although @Izno only mentions “cross-communication between templates”, I’m pretty sure he meant specifically templates transcluded on a page accessing output of templates transcluded at other parts of the same page – otherwise, it’s just a simple template transclusion, which was, is, and going to be, supported, of course. Templates transcluded on a page accessing output of templates transcluded at other parts of the same page is where issues come up: previews of parts of the page (e.g. sections or – in case of VisualEditor and other Parsoid-based things – individual templates) may be different from what a full-page parse produces, and parallel parsing of different parts of the page may cause race conditions.

I kind of see the issue as well a little that Matmarex has put forth. It is the classical paradox of self reference. If the page populates Category:Foo when it is not in Category:Bar, and the page populates Category:Bar when it is in Category:Foo, then which category would be populated? It would require us to know the final execution state in order to further populate. Here is a solution: only fill getCategories() for the page all the way up to the point of execution.

But then for two pages, we have the problem of timing. The order in which categories are populated would then depend on when the page was parsed. I can see how it can lead to "unstable parses" regarding categories, but then we can also get that in the following scenario. Suppose we have the page "Foo" check to see if "Foobar" is in the page "Bar" and if so does nothing, and if not writes "Foobar". Now say "Bar" does the same for "Foo". Now the order of execution determines which page has "Foobar" and which one does not.

There might actually be a better solution than this. If a module accesses category information, just prohibit population of categories by that module period. That is, if the module is like the first two scenarios, it would just do nothing and maybe add the module to a special page. This is something that can be checked after the module is done executing. For now, this might be a good solution until some way to detect these two scenarios is reliably determined. It would allow for category editnotices to work while preventing the two scenarios above from arising.

There might actually be a better solution than this. If a module accesses category information, just prohibit population of categories by that module period. That is, if the module is like the first two scenarios, it would just do nothing and maybe add the module to a special page. This is something that can be checked after the module is done executing. For now, this might be a good solution until some way to detect these two scenarios is reliably determined. It would allow for category editnotices to work while preventing the two scenarios above from arising.

Categories are added by literal category links in the wikitext emitted by a module, unfortunately. Disabling categories might be fairly easy to do for any invoked module function that generates a complete category link that goes straight into the wikitext of the article, because it could be as easy as a search and replace in the parser code that receives the output of the module function, but a module can be used to generate only part of a category link, like [[Category:{{#invoke:example|emit_category_that_page_is_not_in}}]]. But there are often be multiple layers of expansion past the module invocation, for instance when a template contains {{#invoke:module name|function name|...}}. Text generated by a module invocation could also be put into a template that then puts the text in a category link.

I'm not familiar with the parsing infrastructure, but I imagine to solve some of these problems, the parser would have to track which 'contaminated" text ultimately had something to do with invocations of modules that looked up category information for the current page, and in the final expanded text disable any category links that contain any of this contaminated text. That could be quite an invasive change to the parser because of all the levels of expansion that wikitext goes through. There would also be false positives, because a template or module can take in contaminated text but totally ignore it and just generate the same thing whether the contaminated text is there or not. This change could lead to annoying results: suppose a module that is widely used suddenly starts looking up categories when it doesn't need to; then any templates or modules that depend on the module would stop generating category links even though they didn't do anything wrong with the category information. Then editors would have to look through the dependency tree and figure out which module was the culprit for the suddenly disappearing category links.

Here is a solution: only fill getCategories() for the page all the way up to the point of execution.

The problem is that the page may not be parsed in the same order each time, so this “all the way up to the point of execution” may change from parse to parse, without any edits.

then we can also get that in the following scenario. Suppose we have the page "Foo" check to see if "Foobar" is in the page "Bar" and if so does nothing, and if not writes "Foobar". Now say "Bar" does the same for "Foo". Now the order of execution determines which page has "Foobar" and which one does not.

Actually, not:

  • mw.title:getContent() returns the raw wikitext, without any parsing. This either contains or doesn’t contain Foobar; this fact can be changed only through an edit.
  • frame:expandTemplate() uses the parser to get the preprocessed content, and the parser errors out, adding the page to Category:Pages with template loops, when it notices this kind of recursion.

I like the idea of forbidding Scribunto output from adding new categories once categories have been read, but the implementation of that could be fraught. Parsoid will parse the scribunto code as a separate "top level document" so in theory we could zero-out ParserOutput::$mCategories() after the Scribunto code has returned. This still works even if the categories were placed in ParserOutput by parsing [[Category:Foo]] as wikitext. Seems like a pretty big hammer, though, for a problem we'd hope wouldn't arise in practice.

The proposed patch reads the categories from the db, sidesteping any "cross-communication" concerns. The tradeoff is that the categories reflect the ones produced by the last parsed version of the page, so if someone adds some categories and invokes this function on the current page in the same edit, the page is temporarily saved with lua returning outdated result, which gets fixed on the next purge or edit[0].

However, per @aaron on gerrit:

It's strange for this to use the categories of the previous parsed revision. It's also not well defined to use the currently-being-parsed revision either.

It's one thing to depend on page existence, current revision text, or current revision metadata, but depending on the parsed current revision of another page (let alone the same page) to generate the parsed output feels overly "higher-order".

This is something which does seem unprecedented in MediaWiki, although I'm not certain if it's a bad thing. As it stands, there are a lot of Scribunto feature requests where the only straightforward implementation is to use db data generated by the last parsed revision.

FeatureTicketWorkaround
Reading categories (this ticket)T50175Module:Mainspace_editnotice, function blp_notice
Reading local short descriptionT216356Module:GetShortDescription
Checking if page is a disambiguation pageT71441Module:Disambiguation

Current workarounds generally try processing the raw getContent(), which is worse.


[0]: Even this minor annoyance is not unfixable. I earlier tried fixing it by using VARY_REVISION output flag. This doesn't quite work as it causes a re-purge before the link tables are updated. Instead, we could use a custom parser output flag which triggers the LinksUpdateComplete hook to re-purge once the link table is up to date.

@SD0001 IMHO I think any effort to expose categories to Lua is better than nothing at all.

A more robust editnotice load module I worked on testwiki https://test.wikipedia.org/wiki/Module:Editnotice_load uses some hackery to get the categories from the (preprocessed) wikitext, it takes forever to load on large pages, like on this fork of Taylor Swift from enwiki: https://test.wikipedia.org/w/index.php?title=Taylor_Swift&action=edit

I don't think the result is even cached when used in interface messages, is it...

I am also going to backtrack from what I was saying about the template loop issue. Instead of forbidding categorization... why not just add the page to the Category:Pages with template loops category?

The proposed patch reads the categories from the db, sidesteping any "cross-communication" concerns. The tradeoff is that the categories reflect the ones produced by the last parsed version of the page, so if someone adds some categories and invokes this function on the current page in the same edit, the page is temporarily saved with lua returning outdated result, which gets fixed on the next purge or edit[0].

I view this as a feature, not a bug -- it breaks a possible cycle and avoids doing a full parse of the page just to extract categories.

Might we be able to get this on a test wiki so we can ensure that there aren't any problems with the patch?

Test wiki created on Patch demo by Novem Linguae using patch(es) linked to this task:
https://patchdemo.wmflabs.org/wikis/878c4bb105/w

Thank you. I just tested out the categories and specifically the undefined behavior which would result in a template loop for any other parser function.

https://patchdemo.wmflabs.org/wikis/878c4bb105/wiki/A says that the page is in A, even though it clearly is not from the category bar. This might be the "unstable parses" that matmarex was alluding to.

Can we maybe find a way to detect these template loops and throw an error in the Lua code if one occurs?

https://patchdemo.wmflabs.org/wikis/878c4bb105/wiki/A says that the page is in A, even though it clearly is not from the category bar. This might be the "unstable parses" that matmarex was alluding to.

I don't see an issue. By the logic you have, page should be categorised to Category:A if it wasn't already in it. Once this is done, the page is now in Category:A and is indicated as such.

I think we have established already that there are no "unstable parses". Purging the page any number of times doesn't change its categorisation.

There is very much an error with this. Page is not in A so gets put in A, but now that the page is in A the parser function says that we are now in A and does not category link. This is a less obvious self reference loop. The page should detect that it is getting categorization information about itself, do a parse, and after recursing one down say "template loop detected: A" and throw the page into "pages with template loops". Something similar to what happens when you construct an actual template loop like this:

image.png (716×1 px, 85 KB)

@Novem_Linguae I created a test account called "Test" on the test wiki you created on patch demo. I want to know if you can do a couple of things: One can you enable proxying articles from en.wiki? And second, can you give the test account admin and interface-admin so I can test a specific use case (category based editnotices) with the Lua categories?