Page MenuHomePhabricator

Sections generated by tag extension do not show up in TOC
Open, MediumPublicFeature

Description

Author: lord_farin

Description:
I have created an extension that uses a tag to generate new sections. This tag basically takes some parameters and displays as a section heading of the desired depth. The section content is listed as the inner HTML of the tag.

The problem that I am encountering is that the sections, so generated, do not show up in the table of contents (hereafter TOC).

Some relevant information is at http://www.mediawiki.org/wiki/Project:Support_desk#Sections_added_with_tag_extension_do_not_show_up_in_TOC_24342. A transcript:

"I have written a tag extension that allows for parameters to be given with a new section, which is processed by a section custom tag. I have dealt with the nesting issue bugzilla:1310 by implementing the suggested amendments to Preprocessor_DOM.php.

Everything works just fine, the sections are displayed with their correct header levels etc.etc.

There is just one problem. Namely, that the sections written using this section tag do not show up in the table of contents. Some var_dumps revealed to me that these headers do get their UNIQ modifiers. But apparently the TOC is incapable of taking these extra sections into account.

Please note that I am not looking for a solution that involves replacing the tag with a parser function; due to extensive presence of $\LaTeX$ the inability to use double closing braces anywhere is too crippling. Furthermore, entering whole sections as an argument to a parser function is obviously not a good and readable coding style.

Thank you for looking into this."


Version: 1.22.0

Details

Reference
bz45317

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:19 AM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz45317.
bzimport added a subscriber: Unknown Object (MLST).

Hi, without a minimal testcase it's hard to debug or to find out if there is a bug in the MediaWiki codebase, or in your extension. Also, against which MediaWiki version do you try this?

lord_farin wrote:

Code used for the tag extension

The literal PHP code used, for debugging purposes

Attached:

lord_farin wrote:

Minimal example exhibiting the issue

Attached:

lord_farin wrote:

The MediaWiki version used is the latest, 1.20.2.

In addition to the minimal example I added as an attachment, a live example is at http://www.proofwiki.org/wiki/User:Lord_Farin/Sandbox/Help:Wiki_Editing.

lord_farin wrote:

Is there anything else that I can do to speed up the debugging process? The issue is reasonably urgent to ProofWiki because it limits the deployment of the extension, which is critical for proper execution of some long-overdue maintenance projects.

(This may be considered a bump.)

I am not sure. Given your pretty deep analysis I think it requires attention from somebody much better versed in parser code. I've seen similar issues with counting sections for a longer time, so I am not sure if this is something easy to fix without breaking some other ways to handle wikitext; I guess it is also not a recent breakage (feel free to try if it ever worked properly :/).

lord_farin wrote:

I've performed an exhaustive search once again, throwing var_dumps wherever it seemed that something useful was going on.

I am led to conclude that the Parser->formatHeadings function is messing things up when fed with the third parameter $isMain set to false (at the very least, it's performing a lot of redundant operations every time a call to recursiveTagParse comes in).

I would like to ask someone who knows their way around this function to go through it and ensure that only the part of formatHeadings that is essential for recursiveTagParse (or any other things calling formatHeadings with $isMain set to false) is executed when said parameter is not set to true.

lord_farin wrote:

I've done further testing. There are two relevant conclusions:

1: Running formatHeadings with $isMain set to false is completely superfluous as far as I've been able to determine (using the shipped PHPUnit tests). If this is tested and investigated further, it can improve page load times (expensive but useless preg_match() calls are applied on the entire wiki page).

2: The sections that do not show up in TOC are hidden by a UNIQ identifier for the extension tag, which is not resolved until $this->mStripState->unstripGeneral() is called from the main parse() -- which is way after the TOC is determined using the main call to formatHeadings() in internalParse().

This state of affairs (which I suspect to tie in with the problems concerning section editing and templates) naturally means that some unstripping should be done before TOC is generated.

Therefore, my proposal is as follows:

To create a designated third unstrip type e.g. entitled "sections" (the first two being "nowiki" and "general", defined and used in StripState.php), which is called before determining TOC, expanding those UNIQ markers whose content may hide sections. Upon creation of a new extension, one can indicate whether or not it may contain new sections using a flag.

This course of action occurs to me as the least intrusive way to resolve this bug (and hopefully, related ones). Thoughts are welcomed.

Created attachment 13111
Modified version of minimal example that works around the issue.

Logically speaking, I would expect that $parser>recursiveTagParse to create headings like normal. On the other hand, that would probably screw things up with edit section links, if treated precisely as normal.

You can kind of work around this issue by making the headers you create be the actual html tags, and everything else you do be manually made strip items. (The section edit links seem to get borked with nested uses of the tag, but nested tag uses don't work anyways since the first closing tag closes the outer nesting, and not the inner nesting, so that's probably not that big a deal (So if what you want to do, is suround the section and all its contents in an extension tag, you may be out of luck for an entirely other reason). You could maybe partially work around the nestedness issue by regexing it out, and not passing all of it to recursiveTagParse. Anyhow, see attachment for crappy work around (To be clear, I still think this is definitely a bug, just showing one way to work around it).


To create a designated third unstrip type e.g. entitled "sections" (the first
two being "nowiki" and "general", defined and used in StripState.php), which is
called before determining TOC, expanding those UNIQ markers whose content may
hide sections. Upon creation of a new extension, one can indicate whether or
not it may contain new sections using a flag

I wonder if it would just make sense to call formatHeadings some point after unstripGeneral. I don't think anyone is really using general strip markers to hide headings from the TOC. (This thought is just an initial reaction. Would need to do more research to see if that's remotely sane).

Attached:

lord_farin wrote:

Having a need for nested tag use, I've already implemented an according patch in the DOM preprocessor. Accordingly, I've decided to try and follow up on your suggestion:

I wonder if it would just make sense to call formatHeadings some point after
unstripGeneral. I don't think anyone is really using general strip markers to
hide headings from the TOC. (This thought is just an initial reaction. Would
need to do more research to see if that's remotely sane).

I've just tried it, and it works -- the sections now appear in the ToC. This ties in with my observation that there is no actual use for the $isMain parameter to formatHeadings (the function just shouldn't be called if this parameter is false).

It also makes sense conceptually to create a ToC only after the content (or at least, the structure) of the entire page has been determined (but of course, before nowiki strip markers are replaced), and as such, it needn't be processed by recursiveTagParse/internalParse; a call in the main parse routine suffices. My attempt added the formatHeadings call directly after unstripGeneral.

Some testing does reveal, however, that moving the formatHeadings call spoils the edit section links (PHPUnit tests show that they do not get rendered any more).

This needs to be dealt with, but for our wiki it is not a problem since section-wise editing is discouraged anyway.

In the nearly three years since the last activity in this thread, does anyone know what the status of this is?

I have some time to spare so I would like to look into it again. The original problem with the ToC persists in our current wiki which uses MW version 1.25.3.

In the nearly three years since the last activity in this thread, does anyone know what the status of this is?

Thanks for your interest in this task! I'm afraid that nothing has happened in the last three years.
The next step might be to discuss T47317#492636 and trying to get more expertise involved (maybe via an email to wikitech-l?) to make a decision on acceptable approaches

I observed it, too, with older and a slightly newer version.

Although unaware of most details, I believe, toc creation,, if not disabled, ought to be moved closer to the end of the parse, to a place in the code where we know that there is nothing left that could be expaned to a toc entry.

@Lord_Farin, thanks for the ping on wikitech-l

I don't have as much time as I'd like to delve into this right now, but I have some thoughts that may or may not relate to the problem you're hoping to solve. Ping me later if this task seems to be stalled (again)

@RobLa-WMF Unfortunately, nothing happened in the past three weeks.

I found some more time to look into things, and it appears to me that the main issue here is the many-faceted monolith that is formatHeadings. While this name is innocuous enough, the PHPdoc already says:

This function accomplishes several tasks:

  1. Auto-number headings if that option is enabled
  2. Add an [edit] link to sections for users who have enabled the option and can edit the page
  3. Add a Table of contents on the top for users who have enabled the option
  4. Auto-anchor headings

However, in the code, there are also sections providing the Parser with the section structure of the document (complete with hooks upon section creation).

I didn't manage to get a good feeling for the entire ~400 lines of code that make up formatHeadings, but it seems to me that it is trying to be too many things at the same time, and this results in efficiency loss and unpredictable behaviour. Especially the line:

$root = $this->preprocessToDom( $origText );

indicates to me that something is wrong here. Reprocessing the entire DOM tree of the raw wikitext indicates that things are happening at the wrong level or wrong place in the sequence altogether.

In my opinion we should reconsider the handling of sections by the Parser in a thorough way, postponing the building of the ToC to the last possible moment. I'd be happy to help, but sadly I'm seriously out of my depth regarding the Parser handling and expected behaviour which need to be considered as part of a viable solution.

Does anyone else have any thoughts?

$root = $this->preprocessToDom( $origText );

indicates to me that something is wrong here. Reprocessing the entire DOM tree of the raw wikitext indicates that things are happening at the wrong level or wrong place in the sequence altogether.

This is part of the bigger problem that we're trying to tease apart in our discussion in T142803. A big difference between MediaWiki-PHP Wikitext (which I'll refer to as "Traditional-Wikitext" in this comment) and Parsoid-Wikitext is that Traditional-Wikitext relies on string manipulation and does string transclusion, whereas Parsoid-Wikitext parsing is done by building up a DOM.

Disclaimer: most of my knowledge of our parsing is secondhand. It may be worth reviving the conversation you started on wikitech-l.

$root = $this->preprocessToDom( $origText );

indicates to me that something is wrong here. Reprocessing the entire DOM tree of the raw wikitext indicates that things are happening at the wrong level or wrong place in the sequence altogether.

What's going on with that specifically is that it's taking the header markers extracted from $text and finding the corresponding node in the dom structure from which it can determine the "byte" (really codepoint) offset of each section within the original wikitext. This was added in rMW5097164dce83: Add 'index', 'fromtitle', and 'byteoffset' fields to ParserOutput::getSections… with the comment that it would be needed for a "NavigableTOC" extension, and seems to have been in use by at least a few bots or user scripts considering bugs have been reported for it over the years.

It might be possible to pass in an existing dom structure instead of reprocessing it, although the logistics of getting the dom from where it's generated into formatHeadings in all code paths might be complex.

I see you already found (in T47317#492649) that moving the formatHeadings call to after unstripGeneral wouldn't work out because unstripGeneral also removes the strip markers that formatHeadings uses to determine the section indexes. Adding an unstripSections might have to have the result after unstripping be the same as what @Bawolff did in T47317#492636 to fulfill whatever need is making you hide that content behind strip markers in the first place, making the actual unstripSections rather pointless.

As for the assertion that calling formatHeadings with $isMain = false is pointless, I do note it has some effects:

  • It transforms the raw <h#>...</h#> into the HTML with the anchors, the <span class="mw-headline">, the edit link, number (if auto-numbering is on), and so on. Although the edit link itself will have a wrong section number, and the auto-numbering and anchor deduplication will be local to the fragment.
  • The 'ParserSectionCreate' hook is called for every section, which can affect the return value of the method.
  • OutputPage::setTOCHTML() is called with whatever partial TOC would be generated for the non-main fragment. I don't know if any extension might be using that for some purpose.
  • A __TOC__ magic word inside the fragment forces that partial TOC to be embedded in the non-main fragment. That seems likely to be considered a bug, although the easy alternative would result in __TOC__ inside a fragment being effectively equivalent to __NOTOC__ which would probably also be considered a bug.

Parsoid's take on TOC creation is to do this as a final DOM processing step, so it can consistently pick up all headings it should. I added the Parsing-Team--ARCHIVED to this task, so that they can weigh in on what their view on addressing this in the PHP parser is.

@Anomie Thanks for your comment.

Adding an unstripSections might have to have the result after unstripping be the same as what @Bawolff did in T47317#492636 to fulfill whatever need is making you hide that content behind strip markers in the first place, making the actual unstripSections rather pointless.

I'm sorry for failing to understand this sentence. Are you implying that there is some way to prevent the result from the tag from being stripped, while still retaining fully-parsed wikitext output?

I have tried approaches similar to those described by Bawolff, but I ended up in one of these two situations:

  • not removing the strip tag (which is seemingly added automatically) and thus not updating TOC;
  • losing processing of whatever was inside $input to begin with, like internal links (if using 'markerType' => 'none').

See attached image for an example of the latter (one of the titles was also an internal link).

As for the assertion that calling formatHeadings with $isMain = false is pointless, I do note it has some effects:

  • It transforms the raw <h#>...</h#> into the HTML with the anchors, the <span class="mw-headline">, the edit link, number (if auto-numbering is on), and so on. Although the edit link itself will have a wrong section number, and the auto-numbering and anchor deduplication will be local to the fragment.
  • The 'ParserSectionCreate' hook is called for every section, which can affect the return value of the method.
  • OutputPage::setTOCHTML() is called with whatever partial TOC would be generated for the non-main fragment. I don't know if any extension might be using that for some purpose.
  • A __TOC__ magic word inside the fragment forces that partial TOC to be embedded in the non-main fragment. That seems likely to be considered a bug, although the easy alternative would result in __TOC__ inside a fragment being effectively equivalent to __NOTOC__ which would probably also be considered a bug.

This is an interesting list, showing that there are some nontrivial effects. However, each of the bullets mentioned doesn't strike me as intentional behaviour invented at design-time.

sections.PNG (629×831 px, 49 KB)

It has been two weeks since the last activity, and I would be delighted if someone could update me on the latest status. Also if this status is that nobody besides me cares for this behaviour of the TOC and the only way forward is to make a crappy localised workaround that I will need to maintain with every MediaWiki upgrade. (Although obviously the software developer/architect in me would not like that.)

Thanks in advance.

I'm afraid that this indeed is not a priority for anyone at the moment. :( But, if you were to work on a proper solution, I'm sure it would be appreciated by many. See also T114057 for a big list of other table-of-contents woes.

I think this is best tackled in Parsoid since one sane way to do it is as a post-processing pass once all content has been generated. It is a straightforward process there. Since we are anyway working to make Parsoid the default parser, this is best done after that point.

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 12:24 PM
Aklapper removed a subscriber: GWicke.