Page MenuHomePhabricator

Recursive tags in extensions.
Open, LowestPublicFeature

Description

Author: andy753421

Description:
I made an extension to allow for easier discussions (an example is here http://moacad.com/wiki/index.php?title=Talk:Changelog) but in doing so I noticed that the current extractTags function in Parser.php would extract the tag starting at the beginning of the tag (<tag>) and would stop at the 'first' end tag (</tag>). For example if someone edited a page and included the text <tag><tag>foo</tag>bar</tag> it would think the tag was only <tag><tag>foo</tag> and would leave off the extra </tag> on the end.


Version: 1.4.x

Details

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 8:11 PM
bzimport set Reference to bz1310.
bzimport added a subscriber: Unknown Object (MLST).

andy753421 wrote:

patch for the Parser.php file, with line numbers

I'm not sure how to include the code for html comments with it so I just put
that separate with an if tag. If anyone would like to change it go ahead. It
may also be sloppy formating or code, this is my first patch so I don't know
much about the style it should be in.

attachment parser_patch.php ignored as obsolete

andy753421 wrote:

more recent patch

I think this should be more usefull. It replaces lines 234-285 in the 1.4beta4
version of Parser.php

attachment parser_patch.php ignored as obsolete

Not sure this would be desireable; may have side effects.

Anyway the patch is very out of date...

avarab wrote:

Not a patch, removing patch keyword.

shardsofmetal wrote:

Isn't this bug fixed in 1.9 and maybe earlier with Parser::recursiveTagParse()?

aki.ikgw wrote:

diff file from mediawiki 1.11.0

attachment Parser.php.diff ignored as obsolete

ssanbeg wrote:

*** Bug 11528 has been marked as a duplicate of this bug. ***

  • Bug 20350 has been marked as a duplicate of this bug. ***
  • Bug 21426 has been marked as a duplicate of this bug. ***

Removing need-review keyword. Patches are ancient and not very useful, so I've marked them obsolete.

sharon.dagan wrote:

This bug still exist in 1.16.2 - any plans to fix it?

Probably exists in 1.17 (about to be released), too. Can you check out a copy of HEAD from subversion to check?

svn checkout http://svn.wikimedia.org/svnroot/mediawiki/trunk/phase3 wiki

Also, if you're running into this, let us know the particulars of your case.

/me wonders if the proposed change in this bug is really the desired behaviour. Disallowing nested start <tag>'s seems sane to me.

(In reply to comment #13)

/me wonders if the proposed change in this bug is really the desired behaviour.
Disallowing nested start <tag>'s seems sane to me.

That depends on semantics and the code handling them.

sharon.dagan wrote:

Working with the latest code from trunk/phase3, as suggested.
My test case extension is a very basic tag hook:

File: Bug1310_TestCase.php

<?php

$wgHooks['ParserFirstCallInit'][] = 'onParserFirstCallInit';

function onParserFirstCallInit( &$parser ) {

$parser->setHook( 'foo', 'onTag' );
return true;

}

function onTag( $input, $args, $parser, $frame ) {

wfDebug( $input );

return 'xxx';
}

?>

in LocalSettings.php the extension is loaded the normal way.

The input for the test case is:
'<foo>Begin1... <foo>Begin2... ...End2</foo> ...End2</foo>'

The $input that gets into onTag() should be:
'Begin1... <foo>Begin2... ...End2</foo> ...End2'

However,
In wfDebug I get: 'Begin1... <foo>Begin2... ...End2'
And in the browser I get: 'xxx...End1</foo>'

sharon.dagan wrote:

OOPS! (why can't I edit my comment?)

The input for the test case is:
'<foo>Begin1... <foo>Begin2... ...End2</foo> ...End1</foo>'

The $input that gets into onTag() should be:
'Begin1... <foo>Begin2... ...End2</foo> ...End1'

However,
In wfDebug I get: 'Begin1... <foo>Begin2... ...End2'
And in the browser I get: 'xxx...End1</foo>'

*Bulk BZ Change: +Patch to open bugs with patches attached that are missing the keyword*

john wrote:

-patch, no non-obselete patches.

theom3ga wrote:

Same bug here. I'm developing an extension to semantically tag the document, so there are going to be recursive tags, and I'm getting this bug too, test case is the same as in Comment 15.

Any alternative to tag extensions for this?

Any alternative to tag extensions for this?

Well parser function style things perhaps ({{#foo:...}})

sharon.dagan wrote:

Allows extension tags to be nested

This patch allows tag extensions to be nested. Only the most outer tags are parsed, everything in between is passed to the callback.

Given the wiki text "<foo>123<foo>456</foo>789</foo>", foo's callback will be called with the text "123<foo>456</foo>789".

Attached:

Thanks for this patch! We've been in a code slush (not quite a freeze) for a few weeks so we're just getting around to looking at these.

We're also doing a lot of parser work, so I'm not sure how relevant this is, but I'll ask them to take a look.

wicke wrote:

From an implementation standpoint, simply matching up the closest start/end tag is definitely easier than building a stack to enable nested tag pairs. I am also not convinced that nested tag pairs would be a good UI design, as it seems to make the distinction of regular wiki content and input to an extension harder than necessary.

Could you present a compelling use case that demonstrates the need to use the same tags both to delimit the extension inputs and the input itself?

wicke wrote:

The last sentence should naturally end with *and in the input itself*. An edit button would be handy sometimes.

(In reply to comment #23)

From an implementation standpoint, simply matching up the closest start/end tag
is definitely easier than building a stack to enable nested tag pairs. I am
also not convinced that nested tag pairs would be a good UI design, as it seems
to make the distinction of regular wiki content and input to an extension
harder than necessary.

Could you present a compelling use case that demonstrates the need to use the
same tags both to delimit the extension inputs and the input itself?

There are two examples I could see where this may be wanted

  • <ref> tags so people could do nested ref stuff without {{#tag:ref hackery.
  • <source> tag's for when highlighting xml-ish things that have a <source> inside them (since they would usually have a closing source tag as well, but that's more like accidentally fixing an issue then actually fixing an issue).

But I also tend to agree that it may not be worth the effort.

That as a <source> fix sounds to me more like a hack to fix a non-issue to me. The <source> isn't written to explicitly do anything special with any <source> tags inside of it so that does not sound like the thing we should be aiming for. (And sounds like it would break if someone used <source> to document example arguments to the opening <source> tag)

Switching over to a complete even tag matching could change the behaviour of existing content -- ie: <foo><foo></foo> suddenly having different behaviour -- so I'd reject the patch we have on those grounds alone.

We probably also want to write a test to make sure that <nowiki><nowiki></nowiki> doesn't suddenly start turning everything after it into nowiki content when it was written expecting it to display a "<nowiki>" tag verbatim in the page for documentation.

I think that if we do implement recursive tags, it's going to have to be an explicit op-in by extensions feature. ie: Only tags with a specific option will be parsed recursively. We can enable it on <ref> but we may not want to enable it for <source>, and definitely don't want it on <nowiki>.

wicke wrote:

I also share Daniel's concerns about changing the behavior of existing content.

In the longer term, I think that it would be desirable to add a fully parsed input mode for extension tag contents. This could take the form of a token stream or a DOM fragment built from those. Extensions could choose between plain-text input and tokens, so this would be opt-in. This is also very close to how this is currently handled in the Parsoid parser (http://mediawiki.org/wiki/Parsoid), although there still are some issues to solve (e.g. an unclosed html comment in the extension content). Nested nowiki is covered by parser tests already, and works as expected.

Are there extensions that need nested extension tags, but otherwise unparsed input?

  • Bug 35173 has been marked as a duplicate of this bug. ***

This discussion has been stalled during more than two years. Has there been any change helping to its resolution in a way or another?

(In reply to Quim Gil from comment #29)

This discussion has been stalled during more than two years. Has there been
any change helping to its resolution in a way or another?

Had there been, this bug would likely already reflect such a change. I'm not sure what you're asking.

This was a deliberate design choice due to the fact that many extensions (math, timeline, score, etc.) do not use XML-style markup and have no concept of nesting.

This was a deliberate design choice due to the fact that many extensions (math, timeline, score, etc.) do not use XML-style markup and have no concept of nesting.

I don't understand this comment. The Math extension uses <math>, the EasyTimeline extension uses <timeline>, and the Score extension uses <score>. These three examples all seem to use XML-style markup instead of {{#parser functions}}. Could you please clarify?

From what I understand, the current behaviour is not to find the actual closing tag but the first closing tag encountered irrespective of whether another tag has been opened and/or closed since the original opening tag. I'd consider that to be incorrect behaviour which if fixed might allow extensions to handle nested tags if they so choose, for example if something like TemplateData elected to use XML markup instead of JSON.

XML markup is actually possible in the current system (or has been in the past) as Wikia's PortableInfobox demonstrates: https://github.com/Wikia/app/tree/dev/extensions/wikia/PortableInfobox Obviously it's not a help to the original issue, but it does show XML type extensions do exist and do use the extension tag system.

In T3310#1424173, @Qgil wrote:

Declined, then?

No, I don't think so. I don't think my reasoning was especially solid or universal. Nesting requires a more complicated parsing algorithm, and I was worried that it would change from O(N) to O(N^2), but it's already O(N^2) for an extension like Cite that uses recursiveTagParse() with #tag. You could have an O(N) algorithm that scans forward for the matching closing tag, accounting for nesting, and then the extension could do another O(N) pass to convert the string to a nested data structure, which would actually be a performance improvement for an extension like Cite. But the ideal arrangement for Cite from a performance perspective would be to allow the preprocessor to treat the inside of the <ref> tags as wikitext and to pass the preprocessor node object to the extension, like we do with parser functions.

The theory may have been that extension tag contents are not XML, but we still don't allow things like quoted </math> end tags embedded in LaTeX. We only allow quoted <math> start tags. So it doesn't seem like we're taking the theoretical high ground.

As for backwards compatibility, the Parsoid folks have tools for scanning wiki dumps for a regex, and we could do that to check for compatibility issues in practice. The b/c concern would be the presence of quoted XML start tags that are not meant to create nesting.

I don't understand this comment. The Math extension uses <math>, the EasyTimeline extension uses <timeline>, and the Score extension uses <score>. These three examples all seem to use XML-style markup instead of {{#parser functions}}. Could you please clarify?

They don't use XML-style markup internally, they use Latex, Timeline's DSL and Lilypond. XML-style markup is only used to delimit them.

I guess, with everyone seemingly indifferent and/or having moved on, it is safe to assume that this enhancement of the parser will not be implemented?

That is to say, if I want functionality like this without modifying MW core, I need to register just a collection of tags with the same callback function? E.g. ref, reff, refff, etc., and use the one after the other? This would work, but it seems grotesque from a DOM perspective.

If people are still willing to consider it, it appears to me that an intelligent nested lookahead algorithm would only add the overhead of extra preg_match calls (but only matching every character once) to the existing performance.

Should I assume it's not going to be possible or are there still people willing to make it happen?

Hi! I need this functionality for an extension I'm developing for Wikiversity. So I went ahead and submitted the patch created by sharon.dagan (see https://gerrit.wikimedia.org/r/#/c/349016/)

I updated the parser test for it and also made extensive local testing. Seems to work fine!

Change 349016 abandoned by Sophivorus:
Properly handle nested extension tags

Reason:
No more use case

https://gerrit.wikimedia.org/r/349016

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:02 AM
Pppery closed this task as a duplicate of T120477: Nested tag support.
Pppery added subscribers: DannyH, bzimport, Izno and 2 others.

Change 981193 had a related patch set uploaded (by Thomas-topway-it; author: Thomas-topway-it):

[mediawiki/core@REL1_39] parsing of nested tabs with 'parse-wikitext' attribute, see phabricator T4700 and T3310

https://gerrit.wikimedia.org/r/981193