Page MenuHomePhabricator

Update to cite.php refs: ids fixed, footnotes, globbed links
Closed, InvalidPublicFeature

Description

Author: ncw33

Description:
The diff

  • There were a couple of spots in the code with slight problems on the id-choosing scheme. There were indeed tricksy things you could do to break the old code by choosing group names with strategically placed hyphens. I have altered this, so that all ids are valid XHTML, and map entered names properly injectively to generated ids to ensure uniqueness.
  • Previously, refs used once were treated separately and the default group is treated slightly oddly. I think some of this was a holdover from an earlier coding of the groups functionality. I have homogenised treatment for these cases a little.

Now, the above make it a bit harder to separate out the changes in my local version to individual patches, so this patch encompasses everthing I did on Friday evening, including the following new exciting features:

  • Support for nice footnote symbols. If you create a group called "footnote", you get a nice progression of *†‡ symbols etc. instead of numbers. This is neat and works well, and should be pretty un-controversial.
  • Secondly, and more majorly, I have built a parser stage that globs links to notes together. It maps [1][2][nb 3][5][nb 5][6][nb 4][7] ↦ [1,2,5-7,nb 3-nb 4] (an obviously extreme example to show the scheme). In general, the most common case is [13][14][15][16] ↦ [13-16], an improvement that has been requested many times. All links and things work nicely.

Version: unspecified
Severity: enhancement

attachment patch.diff ignored as obsolete

Details

Reference
bz16294

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:24 PM
bzimport added a project: Cite.
bzimport set Reference to bz16294.
bzimport added a subscriber: Unknown Object (MLST).

ncw33 wrote:

the diff

I have update the patch to fix another annoying bug present in the upstream version: a ref like [Note 1] (as used copiously in some WP articles) should in fact be [Note 1] for obvious reasons.

attachment patch ignored as obsolete

This'll need some parser test cases that cover:

a) the previously broken cases, to confirm they don't regress in future

b) any new functionality, to confirm it works as expected and doesn't regress in future

A couple quick notes...

+ preg_match("#{$this->mPre}<span id=\"(cite_ref-(\w*)::(\d*)::.*)\"><a href=\"(.*)\".*>(.*)</a>$#U", $strings[$i], $match);
^ I have to admit I'm pretty leery about this kind of regex on HTML code. These can be fragile, and a minor change in markup could result in breakage, even assuming the regexes have no flaws with the original markup.

Adding regression tests helps. ;)

+ 'cite_references_footnote_labels' => '* † ⁑ ‡ § ‖ ¶',
^ Is this really an appropriate progression for footnote labels? § and ¶ refer to section and paragraph numbers, normally; do they get reused for footnotes in this fashion?

+ 'cite_reference_bra' => '[',
+ 'cite_reference_ket' => ']',
^ This naming scheme is cute, but a bit unclear. :) I'd recommend using clear, legible names such as "open_bracket" and "close_bracket", or more simply a single message with the value '[$1]'.

ncw33 wrote:

Thanks for the feedback. I will add the parser tests soon. The change in naming scheme is causing the current parser tests to all fail, as expected — the correct answers need to be updated for the existing tests (the naming scheme had to be changed because there were a couple of characters not correctly escaped by the old scheme, while the new one is robust).

So, for the time being, I have checked and updated the old parser tests, which now work, and I will update the new set of tests when I have added suitably tricksy ones to thoroughly test the new functionality. In fact, a minor syntax change to one of regexps has been needed to properly pass the tests, so it was worth showing them to me.

Notes:

  1. I agree that regexps are nasty (and slow), but this one is as simple as I think it can get. I am making nearly all my regexps non-greedy, which helps, and the expression is literally just grabbing the tags out of the wikitext which is set above. I can see problems with maintainability, but it should be able to take any wikitext correctly.

Ultimately, I am a bit unhappy about the need for this stage, but there seems no way around it, since tag hooks only work on one tag at a time. Implementing ref merging therefore does need a parser stage, and, short of parsing the whole text into a tree, regexps seem the only semi-efficient way of picking out the tags.

If the grouping of refs is not optional, then there is a slightly more efficient way of managing this, including simpler regexps, but the great advantage of the way I have done it is that simply commenting out the line where the parser hook is registered (l. 641) removes the whole processing stage, and links are output as normal. (With this line commented out, the patch still works and becomes rather less controversial, with referencesMerge and callbackMerge never used.)

  1. This is good. I have some books using this style, and [[Footnote#cite_note-0]] agrees with this use of § and ¶.
  1. Sorry; I have just essentially scratch-written the [[probability amplitude]] article, so I have things like [[bra-ket notation]] on the brain. It may look 'cute', but this terminology is actually used by serious physicists!

Ah, quantum mechanics... For the sanity of translators, I would recommend using non-physics jargon here. ;)

ncw33 wrote:

above, but with parser tests updated and expanded (plus other small fixes)

Right. The latest patch now has updated parser tests (which are valid, display correctly, etc.), including some new ones covering the new functionality. I have also (while tweaking) removed a bizarre legacy restriction on using numbers as names for refs. Most importantly, I have seriously cut down the size of the regexps by adding another article-wide array storing tag data, instead of writing it to the page and then separating it back out later. The expressions are now much simpler and quicker.

Overall, the patch now comprises:

  1. A change in the name -> HTML id scheme to tighten up some corner cases, simultaneously cutting down on redundant legacy cases in the code, and clearing up a few other 'fixme' comments.
  1. A special case added for the group name "footnote", which now uses symbols like * † ‡ §.
  1. A fix to the old error when numeric ids are entered, removing it, since this case is now dealt with correctly.
  1. Most majorly, a new stage has been added which globs togther adjacent <ref> tags into one bracketed superscript.
  1. Corresponding updates and expansions to the parser tests to ensure everything works.

attachment patch ignored as obsolete

How odd. Although the headers in the patch claim it to be against r43263, I had to go back to r38094 to get it to apply cleanly. Anyway, I have a few criticisms (all testing done on a local MW installation at r44443, except the Cite extension at r38094 + patch):

Since the patch is against r38094 any changes since then are lost, until someone goes to the trouble of merging by hand. In particular, I notice:

  • A fix so {{#tag:ref||name=foo}} won't blow up.
  • Code to detect unclosed refs.
  • $wgAllowCiteGroups.
  • Use of Sanitizer::escapeId instead of validateName.

Something like "$wgMergeCitations" would probably be preferable to "Comment out this line to disable merging".

If I enter something like '<span class="ref_DB">1</span>' into the wikitext, it gets magically changed into a reference (or "[]"!). And if I enter '<span class="ref_DB"></span>', I get an internal error. You should probably (somehow) pick an identifier that cannot occur in wikitext.

Customized MediaWiki:Cite_references_link_one and friends need to be updated if the "footnote" group is used, as they are unlikely to contain the needed $4 otherwise. This should be noted somewhere prominent.

Output like "[†–‡]" makes little sense to me, as there is not really a natural ordering to the symbols as there is with numbers. OTOH, if someone customized MediaWiki:cite_references_footnote_labels to use "a b c" instead of "* † ⁑" then it would make sense to merge there.

ncw33 wrote:

patch against r44396

Thanks for the useful feedback. It seems I was indeed working from a version of cite quite a bit older than my svn MW. I have manually merged the recent improvements into my code. In particular, great minds must think alike, as I was surprised to find that after switching to Sanitizer::escapeId from my validateName, I still passed the parser tests, because my carefully chose regexps were identical to the ones in the Sanitizer class! (Also, I had removed $wgAllowCiteGroups per the comment to 'remove when groups are fully tested', and given that they are now on by default and used on the main WP.)

The way the new tags are written however makes it a little problematic to just turn off merging. If a $wgMergeCitations feature is wanted, then it can be done easily, but it would be only for aesthetic reasons, not reliability, as nearly the same amount of new code would still be run in each case.

I have cunningly fixed the problem of entering identifiers which can occur in wikitext by re-using the <ref> tag itself, which is neat, because it is the only one we can count on being parsed the way the want. It also happens to work well with the 'unclosed ref' code to make sure that any dangerous <ref> or </ref> still in the wikitext is not a full tag which would mess things up. It looks a spot fragile, but theoretically should work, and I have tested both the cases of too many <ref>s and too many </ref>s, neither of which cause problems.

Thanks for the reminder to update all the documentation, which I can definitely remember to do if/when this goes through.

I have shortened and improved the dashing function now, and it adds footnotes as a special case. Thanks for pointing this out.

attachment patch.diff ignored as obsolete

(Line numbers are after applying the patch)

I see you added a 'cite_error_references_nested' error! Shouldn't the "ref was" part be localizable, though? Also, can the old check (lines 158-161) be removed, or is it still useful?

{{#tag:ref||name=0}} fails. At line 143, it should be is_null($key) instead of $key==false.

When the ref tag has unknown parameters (e.g. <ref name="foo" value="bar">), false is returned from refArg into $key. Instead of generating an error, these are now all being treated as <ref name="" group="0"> would be were name="" allowed. Either restore the check for $key===false in guardedRef, or change refArg to silently ignore unknown parameters.

You're not escaping the group when outputting the inline ref tags: try <ref group="&lt;script&gt;alert('Pwned!')&lt;/script&gt;">Uh-oh</ref>. Passing $text through htmlspecialchars in referencePlaceText seems to take care of it.

For the 'cite_error_references_no_text' error, you are not properly escaping the key. But here it seems you can only inject wikitext, so it's just aesthetic rather than a security hole.

ncw33 wrote:

more parsertests, cleared up return values from refArg

(In reply to comment #8)

(Line numbers are after applying the patch)

I see you added a 'cite_error_references_nested' error! Shouldn't the "ref was"
part be localizable, though? Also, can the old check (lines 158-161) be
removed, or is it still useful?

The old check was there to warn the user against the annoying problem of unbalanced tags, which the parser just chews up silently. There is another check to prevent the XML-valid nesting of tags (for various reasons). Before, this failed fairly silently, so the new error is there to alert the user that something is wrong. Both pieces of code are meant to be in use.

{{#tag:ref||name=0}} fails. At line 143, it should be is_null($key) instead of
$key==false.

Thanks. I have thoroughly rewritten the parser tests now to test nearly exhaustively all possible ways of producing a single error, and some ways of producing two. This should catch more of this sort of thing (but not so many false negatives) in future. Perhaps another set of tests for 'pathological positives' would be helpful sometime.

When the ref tag has unknown parameters (e.g. <ref name="foo" value="bar">),
false is returned from refArg into $key. Instead of generating an error, these
are now all being treated as <ref name="" group="0"> would be were name=""
allowed. Either restore the check for $key===false in guardedRef, or change
refArg to silently ignore unknown parameters.

The problem was just a slightly sloppy merge of a line from the stuff that had changed since the version I started working from. I have removed a few cases from refArg to neaten up the possible return values and fixed up guardedRef.

You're not escaping the group when outputting the inline ref tags: try <ref
group="&lt;script&gt;alert('Pwned!')&lt;/script&gt;">Uh-oh</ref>. Passing $text
through htmlspecialchars in referencePlaceText seems to take care of it.

Yikes! This one really is my bad. mRefs now holds dangerous text, is documented as such, and everything is escaped when written out to the page (I hope!). I think everthing is safe now, but I should not have slipped up there. Thanks for double-checking my 'late night' code!

For the 'cite_error_references_no_text' error, you are not properly escaping
the key. But here it seems you can only inject wikitext, so it's just aesthetic
rather than a security hole.

This is fine. I had thought that anything the user types, they are happy with having echoed back at them as wikitext.

Attached:

wicke wrote:

Nicholas, I'd like to help with getting this patch finally reviewed and checked in. Cite has changed in the meantime, so the patch will need to be updated to latest trunk. Do you have a more up-to-date version? If you do, then this would save the duplicate effort of forward-porting it.

sumanah wrote:

Also, Nicholas, now that we have moved to Git as a source control system, you're welcome to get developer access https://www.mediawiki.org/wiki/Developer_access to put this & future patches directly into Git. That will get them reviewed faster, too!

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:01 AM
thiemowmde subscribed.

Looks like this was never really a ticket but more an extended commit message of a very large, ambivalent patch. The discussion appears to be about several bugs as well as multiple new features that might or might not have been discussed in other tickets before. The last diff we have archived here is 15 years old and 120 kb big. This would need to be redone anyway, as well as discussed in individual tickets before.