Page MenuHomePhabricator

Generate error for unbalanced <ref></ref>s
Closed, ResolvedPublic

Description

Author: ayg

Description:
If a page has more <ref>s than </ref>s or vice versa, that's a clue that
something's wrong. If something *is* wrong, you're probably going to get either
a huge chunk of text stuck in a reference or a reference stuck in the middle of
a page, which are both bad (see the URL for a typical case of the former).
Furthermore, people who aren't technically-minded will probably have not the
slightest clue how to fix the problem--I've seen this happen twice. Since
there's no legitimate reason anyone should ever want to save unbalanced <ref>s,
I suggest an error message be presented when they try rather than allowing them
to screw up the page.


Version: unspecified
Severity: enhancement
URL: http://en.wikipedia.org/w/index.php?title=Alcohol_and_cancer&oldid=56552675

Details

Reference
bz6199

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 9:21 PM
bzimport added a project: Cite.
bzimport set Reference to bz6199.
bzimport added a subscriber: Unknown Object (MLST).

ayg wrote:

Let me revise slightly. Not saving is maybe a bad idea, so how about this: if I
enter foo<ref>bar baz<ref>whatever comes next</ref>, it outputs HTML like

foo&lt;ref&gt;<span style="font-size: large; font-weight: bold; color:
red;">WARNING: No closing &lt;/ref&gt; detected for this tag!</span>bar
baz["whatever comes next" refs correctly]

So in other words, instead of eating the page, ignore the unclosed tag and make
it clear up-front that there's an error of some kind.

(Title: "Prevent page save with unbalanced <ref></ref>s" -> "Fix behavior for
unbalanced <ref></ref>s")

ayg wrote:

Note that this second suggestion appears to be how <ref> handles certain errors
already, see for instance
http://en.wikipedia.org/wiki/User:Simetrical/Let%27s_break_refs%21.

ayg wrote:

I've been informed that with the current parser this is impossible to implement
fully. It could, however, be implemented partially by checking whether the
reference contains the string "<ref>", which would indicate an error. For
"text1<ref>ref1 text2<ref>ref2</ref>", desired output would probably be
"text1[ERROR MESSAGE]ref1 text2[1]...1. ref2". I.e., text at the beginning of
the ref until the *last* occurrence of "<ref>" is printed literally preceded by
an error message, text following the <ref> is printed as a reference to minimize
page-screwed-up-ness. (It should indeed be last "<ref>", if you think about it:
<ref>s can't be legally nested, so any <ref>s previous to the last must also be
unclosed.)

Since the original reason for wontfixing (undesirability of proposed
consequence) has been resolved, and the proposal as stands may be infeasible but
is still ultimately possible and presumably desired, I'm reopening this (neither
brion nor Rob cared enough to reopen it themselves, but they don't mind if I
reopen it either).

ayg wrote:

Modified Cite.php

I have not the foggiest clue what format exactly the patch is supposed to be
in, and I don't see anything that tells me that fact, so I'm just putting up
the whole PHP file.

Attached patch will generate an error if $str contains the string '<ref>', but
will work correctly otherwise. Ideally, this should print the contents of the
bad <ref> rather than just sending an error, but this should be good enough for
now (since it's unlikely it will go undetected for long, I hope, or be hard to
fix).

attachment Extensions/Cite/Cite.php ignored as obsolete

ayg wrote:

An attempt at diff format, otherwise same as above.

attachment Cite.php patch diff.txt ignored as obsolete

ayg wrote:

Proposed patch #3

Is there a page somewhere on how to format diffs?

attachment Cite.php patch diff.txt ignored as obsolete

ayg wrote:

Attempt #4

attachment Cite.php patch diff.txt ignored as obsolete

ayg wrote:

Final attempt before waiting for help

attachment Cite.php patch diff.txt ignored as obsolete

robchur wrote:

Use the diff feature from your Subversion client.

ayg wrote:

Same patch as before

I was trying to use the svn diff, but it previously didn't work because my
command-line window was too small. After fidgeting with settings (these things
are more obvious at 7 PM than 1 AM), I could copy the whole thing. It was too
long before due to addition and removal of identical lines due to reformatting
issues; I've trimmed these out.

attachment Cite.php patch diff.txt ignored as obsolete

ayg wrote:

Same patch as before

Remove extraneous linebreaks that were causing parsing errors due to incidental
hyphen at the beginning of a line.

attachment Cite.php patch diff.txt ignored as obsolete

ayg wrote:

Note: this patch will likely handle <nowiki> incorrectly, interpreting
<nowiki><ref></nowiki> as an error if contained within a ref. A simple
workaround would be to use &lt;ref&gt; instead, so the proposed patch is still
probably better than nothing, but could be improved.

That would be clearly unacceptable, breaking thousands of pages.

ayg wrote:

Really? It would be that major? I can't see why people would want to put the
string "<ref>" inside a ref except, possibly, at [[m:Cite.php]]. I'll fix it, then.

<nowiki><ref></nowiki> needs to work.

Uhhh, but if it's inside a <ref>, who knows. Never mind. :D

ayg wrote:

New patch

New patch fixes the problem with nowiki.

attachment 6199b.txt ignored as obsolete

That regex looks pretty fragile. What about <!-- <ref> --> etc?

ayg wrote:

True. Actually, I can't have tested that; '/<nowiki>.*?</' is obviously going
to break. I'll test this and submit a revised version. Probably
'%<nowiki>.*?</nowiki>|<!--.*?-->|<pre>.*?</pre>%i' or something is what I want.
Or maybe I could exclude a <ref> that's included between *any* tags: that won't
muck with parser functions, for instance, that for God knows what unholy reason
might want a literal "<ref>", but it will still catch common errors.

Plus, the new version can benefit from brilliant programming techniques I've
picked up in the last two months, such as the existence of preg_match. :D

ayg wrote:

Cautious proposed patch

This patch ignores any <ref> inside either a comment or any SGML-like tag.
This is perhaps overcautious, but it will still catch a large majority of
errors like this without any chance of false positives that I can think of.

Attached:

paul wrote:

See also bug 8693 for a related issue.

webboy wrote:

*** Bug 12757 has been marked as a duplicate of this bug. ***

webboy wrote:

Duplicate bug 12757 also contains some comments and a patch.

ayg wrote:

Committed essentially the patch from comment 19 in r40998. Ugly, but better than leaving this around for another two years . . .

I think I found a small bug: If the missing </ref>-tag is the last ref tag, no error message is displayed.
Works correct: <ref>broken <ref>correct</ref> <references />
Doesn't work : <ref>broken <references />
See http://en.wikipedia.org/w/index.php?oldid=239749551 http://en.wikipedia.org/w/index.php?oldid=239749592 for examples.

(In reply to comment #25)

I think I found a small bug: If the missing </ref>-tag is the last ref tag, no
error message is displayed.
Works correct: <ref>broken <ref>correct</ref> <references />
Doesn't work : <ref>broken <references />

That's because parser doesn't send unclosed tags to extensions.

ayg wrote:

(In reply to comment #25)

I think I found a small bug: If the missing </ref>-tag is the last ref tag, no
error message is displayed.
Works correct: <ref>broken <ref>correct</ref> <references />
Doesn't work : <ref>broken <references />
See http://en.wikipedia.org/w/index.php?oldid=239749551
http://en.wikipedia.org/w/index.php?oldid=239749592 for examples.

Please file a new bug, and link to it from here.

(In reply to comment #26)

That's because parser doesn't send unclosed tags to extensions.

It looks like it's sending the tag, just implicitly closing it at the end of the article. Either way, I don't see an obvious workaround for this.