Page MenuHomePhabricator

When nesting refs using cite.php extension, the ordering is wrong
Closed, DeclinedPublic

Description

Author: ncw33

Description:
There are some problems with how to do this correctly, so the extension disallows 'native' nesting at the moment. Trying to do this the #tag magic word does not really word either (see the document linked). In any case, working out the correct numbering is non-trivial. If anyone comments favourably, or acknowledges this bug at all, then I will fling together the work I have done on this to get it into the form of a patch.


Version: unspecified
Severity: minor
URL: http://en.wikipedia.org/wiki/User:Kan8eDie/nestedreforderbug

Details

Reference
bz16330

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:27 PM
bzimport added a project: Cite.
bzimport set Reference to bz16330.
bzimport added a subscriber: Unknown Object (MLST).

happy_melon wrote:

I can't see any advantage to allowing nested refs at all. Why would you want to cite a citation? :D

(In reply to comment #1)

I can't see any advantage to allowing nested refs at all. Why would you want
to cite a citation? :D

You might want to cite a footnote.

It seems part of the problem with allowing <ref>foo<ref>bar</ref></ref> is that the parser interprets it as one node "<ref>foo<ref>bar</ref>" instead of as two nested <ref>s; see http://en.wikipedia.org/w/api.php?action=query&format=yamlfm&action=expandtemplates&text=%3Cref%3Efoo%3Cref%3Ebar%3C/ref%3E%3C/ref%3E&generatexml=1

rupert wrote:

Nicholas,

I agree that not being able to reference footnotes is a real deficiency of Cite.php, and was about to fix this until I came across this bug. How close are you to a fix? (I don't want to reinvent the wheel.)

The reason you had to use the #tag magic word is to circumvent the regexp on Cite_body.php:161 which was added to fix bug 6199 - that unbalanced refs do not generate an error. The "fix" was not very good solution for two reasons:

  1. You were able to side-step it with the #tag magic word
  2. It doesn't work if the last <ref> isn't closed (bug 15712, which I've marked as a dependency of this bug)

Clearly if the Cite extension is not meant to support nesting <refs>, the regexp on Cite_body.php needs to be improved to resist the #tag magic word, and bug 15712 needs to be fixed.

However, I think it ought to support nesting refs in cases where this would make sense - ie. <ref group="g0"><ref group="g1">...<ref group="gn">...</ref>...</ref></ref> providing g0!=g1!=gn and the <references /> tags are used in the correct order <references group="g0" />...<references group="gn" />

In either case, if a </ref> tag is missing, output shouldn't be chomped as it is currently - instead the text between the unclosed ref tag and the next opening one should go to the appropriate reference/footnote. In the latter case, the reference numbing needs to be correct. I think the way to do this would be to defer parsing of the contents of <ref> tags until the <references /> tag that displays them.

This just leaves the problem that the parser passes the text between a <ref> and the first following </ref>, without regard for nesting, etc. (I'm sure you know this already.) I think improving extractTagsAndParams in Parser.php is a necessary prerequisite for enabling nested <ref>s, but I'd be interested to hear your opinion. One would have to be very careful about breaking current behaviour - I think it would be necessary in fact to differentiate between tags you can nest in, and tags that you can't - tags are opened all the time within <nowiki> elements, and the parser should not ignore the closing tag because of them. Of course what MW really needs is a proper parser, but that's too hard!

Best wishes,

Rupert

ayg wrote:

My understanding is that *no* XML-style parser functions can be nested. <poem>Foo<poem>bar</poem>baz</poem> doesn't work either. Nor <source>, etc. Tim would know more about this, and I'd advise anyone interested in fixing this (or related bugs) to discuss their ideas with him.

(In reply to comment #3)

The reason you had to use the #tag magic word is to circumvent the regexp on
Cite_body.php:161 which was added to fix bug 6199 - that unbalanced refs do not
generate an error. The "fix" was not very good solution for two reasons:

  1. You were able to side-step it with the #tag magic word

Which was added a year and a half after the patch was written.

It doesn't work if the last <ref> isn't closed (bug 15712, which I've marked

as a dependency of this bug)

Because Cite.php isn't given access to the text containing "</ref>" in that case. It would have to have an extra parser hook somehow to check the trailing text, and I don't know enough about the parser to get that to work without considerable research for a fairly minor thing.

Clearly if the Cite extension is not meant to support nesting <refs>, the
regexp on Cite_body.php needs to be improved to resist the #tag magic word, and
bug 15712 needs to be fixed.

Patches are welcome.

(In reply to comment #3)

Clearly if the Cite extension is not meant to support nesting <refs>, the
regexp on Cite_body.php needs to be improved to resist the #tag magic word,

IMO, this would be very bad. When an article has explanatory footnotes, it often is the case that the explanatory footnote will need a citation. If you disable #tag, we're back to the crap we had do to work around the lack of functionality before #tag and the group parameter were created.

(In reply to comment #4)

My understanding is that *no* XML-style parser functions can be nested.
<poem>Foo<poem>bar</poem>baz</poem> doesn't work either. Nor <source>, etc.

As far as I can tell, the parser doesn't even try. When it sees an extension tag, it just grabs text up to the next matching closing tag and passes it off to the hook function.

rupert wrote:

(In reply to comment #4)

Clearly if the Cite extension is not meant to support nesting <refs>, the
regexp on Cite_body.php needs to be improved to resist the #tag magic word, and
bug 15712 needs to be fixed.

Patches are welcome.

Agreed. Apologies if I came across unhelpful or hypercritical.

(In reply to comment #5)

(In reply to comment #3)

Clearly if the Cite extension is not meant to support nesting <refs>, the
regexp on Cite_body.php needs to be improved to resist the #tag magic word,

IMO, this would be very bad. When an article has explanatory footnotes, it
often is the case that the explanatory footnote will need a citation. If you
disable #tag, we're back to the crap we had do to work around the lack of
functionality before #tag and the group parameter were created.

Agreed - I'm not going to go down this line. My point was that at the moment it's not designed to allow nested refs

(In reply to comment #4)

My understanding is that *no* XML-style parser functions can be nested.
<poem>Foo<poem>bar</poem>baz</poem> doesn't work either. Nor <source>, etc.

As far as I can tell, the parser doesn't even try. When it sees an extension
tag, it just grabs text up to the next matching closing tag and passes it off
to the hook function.

That is also my understanding.

ncw33 wrote:

Right, I should clarify what I was on about. The real underlying problem, as has been noted, is the inconsistent use of proper parsers (a parser, in theory, is a recursive, not regex-based, function). The current first 'parser' stage is not a full-on parser, but certain aspects of the processing do use proper parsing; templates can be nested correctly, and the magic words work fine. Either we do a huge upgrade and put in some sort of tree-based parsing everywhere, but that is too idealist and not pragmatic enough to be feasible, at least I suspect without a major backward-compatibility-breaking release.

The #tag magic word 'circumventing' the anti-nesting code is not a bug per se; it actually shows the way the fix the problem. What we are tying to do is fix a parser with regexs, which does not work, when we already have a recursive parser stage (templates and magic words). Exploit the #tag method, don't do more regexp kludging to avoid it. As we can see by the use of reflist in preference to references in so many articles, there is nothing against wrapping naked (badly parsed) XML tags in (properly parsed) templates, so that is what we should do: say as policy to users "If you don't want to nest, the old XML syntax works and is good, but if you want to nest, use this template instead of putting low-level magic words in your wikitext". The template then wraps the XML tag. That is a 'better' solution, without re-inventing the wheel and changing the parser. That still leaves the problem of numbering though.

The way we currently number up is to assign a number straight away, sequentially. This does not work with recursion, because

(1) the inner elements get assigned their numbering first (part of the problem in the first example in the URL I originally gave above); and
(2) the inner element may be in a footnote, so should in fact be numbered way later (better shown by my second example).

The only way we can fix this is to delay the numbering to a later stage of processing (it is logically impossible to assign numbers immediately and not get these problems). The question is at what stage to do the numbering.

Now, I have made a patch with a bit of a rework of the processing of refs at bug 16294. I had not had this particular problem in mind at the time, but I think the approach is relevant here. The idea in that fix is that, instead of the full link (with a sup and [] etc.), only a dummy tag with an internal id is written out initially; we then run through the page with a regexp stage later and put in the correct text. This is relevant to the current bug because relatively small changes to my bug 16294 patch are needed to fix this numbering issues here.

What it would involve is writing out a 'fixed' number in the second pass instead of the initially allocated one. We correct the numbering in the references tags by writing out the refs 'out of order', checking for each ref whether its id tag has already been written out (which will not be the case for ones in footnotes that have not been written out); these we place to the end of the list. The regexp code copies this behaviour, placing the same fixed numbers when it updates the dummy code written out by the parser stage. (Note: again for want of a full parser, and since some pages put ols and uls in refs, we must write out the references tag contents at the first pass.)

I think this approach could work, and tinkered with it for about an hour, but did not get it done. I now don't have enough time to look into it fully, and I see Rupert has assigned the bug to himself, so good luck. If these thoughts are not helpful, ignore them. However, I do think that kludging a way to disable #tag is a bad idea (even the nasty regexp detecting unclosed refs is just to help usability and avoid bad wikitext, not to actually process good wikitext).

My ideas in summary: exploit the recursive nature of the proper parser handling magic words and templates, not kludge to eliminate it; delay writing out final numbering to page until have made a pass at the whole thing and know what the correct order is; possibly do this using a regexp stage in the later parsing, as at bug 16924.

rupert wrote:

Simply too hard for me.

I can't rewrite the parser to correctly nest refs.
And using #tag causes the inner refs to be parsed before the outer ref, incrementing the counters prematurely. I can't change this behaviour either.

Just a note, the best solution for bug 20707 would likely be to solve the general problem of nested refs.

matmarex subscribed.

This has been fixed by 944b24542827f827cdfb8646f1c2e87aeb929b65 / https://gerrit.wikimedia.org/r/#/c/181112/. The change will be deployed to all wikis on May 19/20, with MW 1.26wmf6, according to https://www.mediawiki.org/wiki/MediaWiki_1.26/Roadmap. You can test the new behavior on http://en.wikipedia.beta.wmflabs.org/ already.

Reopening because the patch was reverted.

Aklapper added a subscriber: Jackmcbarn.
thiemowmde subscribed.

As far as I can tell not even the (reverted) patch https://gerrit.wikimedia.org/r/181112 from 2015 made nesting possible. The ParserAfterUnstrip hook is gone since 1.35. The idea seems to be to rewrite a larger part of the parser – which is now covered and/or obsolete via the Parsoid project. The other issue I can see in the discussion are miscalculations in the way references are counted/numbered. Such issues should be and already are described in separate tickets.

In any case, this ticket here doesn't seem to be helpful.