Page MenuHomePhabricator

Fix fragment identifiers in links to sections
Open, LowPublic

Description

Author: gangleri

Description:
Hallo!

http://test.wikipedia.org/wiki/User:Gangleri/tests/bugzilla/section_links/wiki_syntax provides a wide range of various sections generated using wiki syntax.

The scope of this report is to identify which cases where "return to section" does not work properly *and* wrong autocomment links are generated could be improved.

The set of characters which are not allowed in page titles *differs* from the set of characters which don't make sense in sections (as "#"). Beside this there is a range of characters which require *escaping* when used in anchors (as "&").

However the page provides many examples where the the "text" rendered / displayed in the TOC contains only characters which can be used in "return" anchors and in "autocomment" links. This is mainly T4831: Links in autogenerated summary in page histories may point to wrong section or to nowhere but there are many examples using wiki syntax, TeX etc. where the TOC-text is more suitable as return-to anchor and autocomment as what is generated now. T6987 is only one of them.

Another "discrepancy" I want to address is the multiple identical TOC-texts. A sollution which would make sense is to render the TOC-text *same* as the anchors *automatically*. This will make the problem more visible while editing and they users could easy fix such sections immediately. This is T2111: Return to the correct section after section edit for a heading that exists multiple times. The algorithm should work also for insertion of new sections inserted "out of sequence" by "first in" -> "first number"; "next in" -> "last number++" (because of posible gaps). This will not generate duplicate TOC-texts any more. Existing pages should render according to this algorithm at the next change or while purged. The algorithm would require (sorry Brion) identical handling of spaces and underscores as addressed in T4339.

The examples from the url shows many complications or section headers about known bugs or implementation dependent issues:
[[foo|]] does not render to [[foo]] today - T3034
[[foo bar#foobar|]] does not generate a valid link today but it might generate [[foo bar#foobar|foobar]] in the future - T2845

The main / primary focus should not be on mentioned or documented complications or other curiosities but on *efficiency* and unifying user interface functionality.

There will still be situations (conflicts) where characters used inside the TOC are not valid in anchors. These characters need to be escaped anyway.

Wikies live. After "changing the rules" it will be possible to "simplify" the stored autocomments in existing summary fields. This will not be a manipulation but a improuvement / usefull fix for users.

best regards reinhardt [[user:gangleri]]


Version: unspecified
Severity: minor
URL: http://test.wikipedia.org/w/index.php?title=User:Gangleri/tests/bugzilla/section_links/wiki_syntax&action=history

Details

Reference
bz5019

Revisions and Commits

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 9:08 PM
bzimport set Reference to bz5019.
bzimport added a subscriber: Unknown Object (MLST).

gangleri wrote:

(In reply to comment #0)

... *differs* from the
set of characters which don't make sense in sections (as "#").

A "#"-section generetes
http://test.wikipedia.org/wiki/User:Gangleri/tests/bugzilla/section_links/wiki_syntax#.23

" " generates the return link
http://test.wikipedia.org/wiki/User:Gangleri/tests/bugzilla/section_links/wiki_syntax#
This is equivalent to an anchor the start / top of the page

ayg wrote:

Why is this request nontrivial to implement? The TOC works perfectly for every
case I've seen, so why can't that code be plugged in to the post-edit section
link-generator and the page history link-generator in an hour or whatever?

It's nontrivial because the multipass parser is nontrivial.
The set of text that is present at some stage partway through
parsing isn't necessarily the same as you have at some other
time.

ayg wrote:

Hmm. Generating the section-edit link could be rather simpler than it is, since
it doesn't actually have to accord *that* closely with the markup used. It
could potentially be arbitrary, in fact, although we don't really want that. An
algorithm like:

  • Identify internal links and replace with their display text.
  • Identify external links and replace with their display text.
  • Strip the contents of <ref>s, if Cite.php is installed (ideally; this may not

be reasonable for an extension, but I did once see someone included a very
lengthy ref in a header).

  • Strip all SGML-style tags (but not their contents).
  • Replace all template parameters with their defaults, if they have any.
  • Replace all accented Latin characters with their unaccented versions.
  • Strip all apostrophes.
  • For all sequences of nonalphabetic symbols that aren't printable in URLs:
    • If the sequence has a printable punctuation mark (e.g., _ or -) to one side,

delete the sequence.

    • Otherwise, replace the sequence with an underscore.
  • See if the array passed to the function contains a string that's identical to

the current string. If it does, append an appropriate number.

  • Return the string.

could be implemented as an Article method. Anything that wanted to know what
the title should be for a given heading would request the section heading from
the Article object. This general algorithm would probably be simpler and
cheaper to implement than what I believe is being generated now, because it
doesn't transclude templates or generally rely on anything recursive or
otherwise tiresome. Furthermore, the id wouldn't change if an included template
changed, or in fact if anything whatsoever changed other than the actual text of
the header. And finally, on Latin-alphabet wikis it will provide entirely
readable section ids, because all reserved characters are just dropped.

ayg wrote:

Proposed patch

  • Creates basically consistent anchors no matter where you come from (TOC,

Linker.php, EditPage.php).

  • Certain things in headers, such as variables and seemingly extension hooks

(at least <ref>), will remain inconsistent. No regression, they'll still work
from TOC, but they will not work from history or return on edit.

  • Duplicate header names will still work from TOC, but can't conceivably work

from elsewhere with the current setup. Still no regression.

  • Legacy anchors are preserved, so old links don't break.
  • Anchors are more readable and more easily typed where possible(e.g., "One

(Two), Three" -> "One_Two_Three" not "One_.28Two.29.2C_Three", "spéçïål
çhärâçtêrs" -> "special_characters" not
"sp.C3.A9.C3.A7.C3.AF.C3.A5l_.C3.A7h.C3.A4r.C3.A2.C3.A7t.C3.AArs").

  • I noticed no performance hit in some quick benchmarking.

attachment 5019c.patch ignored as obsolete

ayg wrote:

Test cases and output for proposed patch

Thanks mainly to Gangleri. Note: Cite.php is enabled, but Inputbox and math
(among other things) are not.

Attached:

ayg wrote:

Patch with no debugging code

The patch isn't great, so I hope to improve upon it, but in case someone *does*
want to use this one, I'd better remove the debugging code. :D

attachment 5019d.patch ignored as obsolete

ayg wrote:

Patch against r15884

It finally occurred to me that templates don't work with this either; I didn't
think to add that as a test case, stupidly enough. This isn't a great
solution, then, although still no regression, and it should be
forward-compatible, so this is really more "incomplete" than "bad". Might
still be useful to commit the patch in its current form; to handle templates
and variables correctly would require fairly intricate changes to
Parser::replaceVariables or Parser::replace_callback, as far as I can see,
albeit probably a lot simpler if you're familiar with such things. I can't
figure out how to do it, sadly.

Attached:

ayg wrote:

*** Bug 4987 has been marked as a duplicate of this bug. ***

wmahan_04 wrote:

Thanks for making the patch. Here are some comments:

  • As described, the patch introduces a new mapping

of section names to anchors. The old name is kept as
a "legacy" name, but why make this change at all? To
me it seems orthogonal to the bug. The anchor names
are not arbitrary; they are supposed to be ASCII-only,
human-readable when possible, and stable enough for
use in URLs. Both the existing method or the new
one will be unreadable at times, but at least the
old one was established and unambiguous.

  • I noticed that you add an 'x' before anchor names that

don't begin with a letter. Technically you are correct
in that standards require this, but is there any
practical reason for it? Section links like
[[Example#1990s]] already work fine, and changing the
anchor to "x1990s" seems ugly and arbitrary.

  • The base64_encode() / base64_decode() process seems

unnecessary; why not just save the section titles in
an array and store an index into the array in the
placeholder instead of the encoded string? To be fair,
this isn't specific to your patch.

  • Why make wfCreateAnchor a global function?

ayg wrote:

(In reply to comment #10)

  • As described, the patch introduces a new mapping

of section names to anchors. The old name is kept as
a "legacy" name, but why make this change at all?

I couldn't use the mapping currently used by the parser, because that's based
off rendered HTML, not wikitext, and parsing each fragment on a history page is
unacceptable. I could have used the one used by EditPage and Linker (which I
suspect are slightly different), but that would break past links anyway, because
everyone makes links that point to the parser output, not the EditPage/Linker
output. So I figured I might as well make the output more readable while I was
at it. Note that the system I used accords with the "legacy" system much more
often than the EditPage/Linker system does, too.

  • I noticed that you add an 'x' before anchor names that

don't begin with a letter. Technically you are correct
in that standards require this, but is there any
practical reason for it?

No, but MediaWiki likes to maintain XHTML compliance, all things being equal.

  • The base64_encode() / base64_decode() process seems

unnecessary; why not just save the section titles in
an array and store an index into the array in the
placeholder instead of the encoded string?

Good question. I'm kind of new at programming, and the Base64 idea seemed more
obvious at the time to me. Of course, everything else that I saw in the parser
is done the way you say, and in retrospect I should have done it that way.

  • Why make wfCreateAnchor a global function?

Because it needs to be accessible to Parser, EditPage, and Linker, which don't
all talk to each other, and I didn't know about static methods at the time. :P

(Tim pointed out to me on IRC, by the way, that EditPage would be very simple to
fix: just have Parser pass on the correctly-formatted anchor to EditPage. This
tactic wouldn't solve the history-page problem, but it's better than nothing.)

wmahan_04 wrote:

(In reply to comment #11)

I couldn't use the mapping currently used by the parser, because that's based
off rendered HTML, not wikitext, and parsing each fragment on a history page is
unacceptable.

But would it really be that expensive, considering that we
already parse wikilinks on history pages? The rendered HTML is
stripped out when making anchors, so you wouldn't need to
generate it on history pages or for section edits. I wonder how
much it would slow things down to replace wikilinks by their text,
remove all single-quote (bold/italic) formatting, and maybe
replace bracketed external links with their descriptions. Is
here something I missed?

I see now why you chose to create a new format, and maybe that
is indeed the way to go. I like your idea of unifying the three
different places that generate anchor/fragment names.

(Tim pointed out to me on IRC, by the way, that EditPage would be very simple to
fix: just have Parser pass on the correctly-formatted anchor to EditPage. This
tactic wouldn't solve the history-page problem, but it's better than nothing.)

I believe we have to do that anyway if we want to fix bug 111.

ayg wrote:

(In reply to comment #12)

But would it really be that expensive, considering that we
already parse wikilinks on history pages?

Well, not if you wrote a mini-parser, but that would be quite complicated (the
parser is rather intricate) and almost certainly not worth the effort. Even if
it's somewhat faster, there can be a maximum of 5000 items on a history page.
The current parser takes about 800 ms to render a typical page, which obviously
is totally unacceptable when multiplied by even ten, let alone a thousand.

Is here something I missed?

Templates. Which require database queries, not infrequently multiple database
queries. Parser extensions can also potentially be expensive, because they can
do absolutely anything. Tim confirmed what I suspected when I chatted with him
about this on IRC: running the parser for every section edit link would cause
the WMF servers to die. That's why I tried to make a wikitext-based format in
preference to a rendered text-based format in the first place.

I believe we have to do that anyway if we want to fix bug 111.

Correct.

. . . hmm. I just had an idea. What if we took the finished anchor from the
parser, stuck it onto something, used it for the return value of EditPage, *and
then put it into the history entry*? Nobody said the history entry's link
target *has* to be determined on loading the history page, right? It can be
determined when adding it to the edit history in the first place.

Of course, this would likely require database schema changes (store the actual
fragment separately from the text displayed to the right of the arrow), but it
seems like a flawless solution. Everything is inherently centralized, and we
don't have to deal with the headache of adding anchor stuff into the template pass.

wmahan_04 wrote:

(In reply to comment #13)

Well, not if you wrote a mini-parser, but that would be quite complicated (the
parser is rather intricate) and almost certainly not worth the effort. Even if

My point was that I don't see anything intricate or complicated about
converting wikitext into plain text for the important cases: one regular
expression search-and-replace to remove wikilinks, one to remove quotes,
and one to remove external links. This doesn't have to happen for every
history entry, only those corresponding to section edits, and even then
the replacement only takes place on section headings, not the whole edit
summary.

The problem with templates is that they are expanded in the parser before
headings are parsed, and as you say expanding templates on history pages
is not feasible. Your patch doesn't fix templates-within-headings either,
and I'm not convinced it's worth supporting anyway.

. . . hmm. I just had an idea. What if we took the finished anchor from the
parser, stuck it onto something, used it for the return value of EditPage, *and
then put it into the history entry*?

Not a bad idea; the only objection I can think of is that it seems
like overkill to change the schema for what's ultimately a minor
annoyance.

nightstallion wrote:

Does this bug cover the problems with linking to sections which include
apostrophes for bolding and italicising? See Talk:€2 commemorative coins at
en.wikipedia.org.

Many of these issues have been fixed by TimLaqua in r25445, r25446, r25450 and r25573 (see also bug 2831 and bug 10836). This should reduce the list of problematic section headers significantly.

conrad.irwin wrote:

*** Bug 17786 has been marked as a duplicate of this bug. ***

conrad.irwin wrote:

*** Bug 12414 has been marked as a duplicate of this bug. ***

conrad.irwin wrote:

*** Bug 24262 has been marked as a duplicate of this bug. ***

*Bulk BZ Change: +Patch to open bugs with patches attached that are missing the keyword*

sumanah wrote:

+reviewed since comment 16 implies some obsoletion of Aryeh's patches

epriestley added a commit: Unknown Object (Diffusion Commit).Mar 4 2015, 8:16 AM

Accidental clash. Known issue. Sorry for the noise.