Page MenuHomePhabricator

Anchor links are created based on different methods causing broken links
Open, LowPublic

Description

Author: Amalthea.wikimedia

Description:
Currently, anchor ids are created four different ways at the five different places they are used. As a test case, try "_ +.3A%3A]]"

TOC (Parser.php): "__.2B.3A.253A.5D.5D"
Link (Title.php): "_.3A:.5D.5D"
redirectToFragment (Article.php): "_.3A:.5D.5D"
History (Linker.php): "_.2B.3A.253A"
Anchorencode (CoreParserFunctions.php): "__.2B:.253A.5D.5D"

See [[User:Amalthea/test10]] for a demonstration

This regularly breaks the link from history/contributions/RC to the section, makes it hard or impossible to duplicate the functionality in tools (NAVPOP just now), and can break normal section links.

I presume this could easily be fixed by all using the same static function from Title::escapeFragmentForURL, without any additional superfluous logic (in particular stripping "[[", "[[:", "]]" in Linker.php).

The only thing that will still be necessary is ensuring unique ids in the TOC of course. This can still make those links point to unintended sections, but in a controlled way.

See also Bug 17857 and Bug 2831.


Version: unspecified
Severity: normal

Details

Reference
bz18431

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:34 PM
bzimport set Reference to bz18431.
bzimport added a subscriber: Unknown Object (MLST).

Amalthea.wikimedia wrote:

To clarify, the test case above is contrived, but this is a very common problem. The section links in the history of [[:en:Wikipedia:Requests for page protection]] for example *never* work.

Amalthea.wikimedia wrote:

Similarly, on Commons, where many section headers are generated by templates or messages, for localization, the automatic section links in the page histories also do not work, e.g.
http://commons.wikimedia.org/wiki/User_talk:Gerald_Troy?action=history

There are some other inconsistent with anchors:

  1. The anchor added in http://de.wikipedia.org/w/index.php?diff=prev&oldid=73068456 has a invisible character. The autocomment skips this. The TOC and the anchor of that section has it. So the autocomment link does not link to the section
  1. When editing a section which headline has double spaces between two words, after save you are not get back to the section, because the anchor added to the url has two underscores, which is not like the anchor of the section/TOC.
  1. bug 22784

Thanks.

At least partially solved in r68272, by correcting the autosummaries.

This should fix at least 1 and 2 of Umherirrender and i think it also deals with the most important problems of Bug 2831.

conrad.irwin wrote:

As of r68343:

Umherirrender's problems are fixed.

{{anchorencode}} works the same as the redirect from edit-section to article. Both using Parser::guessSectionNameFromWikiText.

/* autocomments */ that we auto-generate are also the same, except that non-linking [[ and ]] are removed. /* comments */ that are provided by the user do not have the HTML stripped from them, but all [[ and ]] are removed. I think the best way forward here is to try and pass user-generated /* comments */ through Parser::stripSectionName on save, and remove the [[ and ]] removal from Linker::formatAutocomments

#links are still just wrong, they are urldecoded() before handling, and do not have whitespace normalised (nor do they have HTML or links stripped, but they can't contain those anyway).

I think the way forward is to add the whitespace normalization to Title::escapeFragmentForURL, not so sure about stopping the urldecode() of the anchor - that could also be done. It may cause problems for interwiki's that aren't wikis.

ayg wrote:

There's no way for anything outside the parser to match Parser.php's behavior here -- it generates the link after parsing, so you'd need to parse, and you can't. It's not obvious how to avoid this -- if the parser generates the id's before preprocessing, it will miss any that come from templates, but if it generates them after, you can have stupid stuff like {{CURRENTTIME}}. You could also have *really* stupid stuff like

"""

{{foo}}

"""

where {{foo}} expands to

"""
text ==

more text

"""

Hard to say what to do in these cases. Ideally we should have the parser and non-parser code agreeing on section id's at least if they don't have any curly braces/parser functions in them, which is the common case.

conrad.irwin wrote:

Re comment 6, This bug isn't about the template expansion, that's bug 5019, just problems with the various encoding functions.

As far as I'm aware the only change that now needs to be made is to stop the urldecoding() of the #-fragment by the parser. Whether that change actually wants to be made, I'm less confident, but I think so. (As %-encoding is not used there, it seems wierd to un-%-encode it).

ayg wrote:

I don't see how "encoding" is logically separate from "stripping wikitext" when all these functions fundamentally operate on wikitext input. But anyway, improvements are good, even if this is really a sub-bug of bug 5019.

Dunno why percent-decoding is done. Probably best to dig around in the history to see if it was added due to an actual bug or just some random decision. Either percent-decode in all these cases or none. I'd think none is better than all here, but who knows what might have come up to cause someone to do the decoding there.

  • Bug 24412 has been marked as a duplicate of this bug. ***
  • Bug 36333 has been marked as a duplicate of this bug. ***

Change 113943 had a related patch set uploaded by Burthsceh:
fix escaping fragment of Title

https://gerrit.wikimedia.org/r/113943

Ricordisamoa changed the task status from Open to Stalled.May 14 2015, 12:02 AM
Ricordisamoa subscribed.

https://gerrit.wikimedia.org/r/113943 has not been updated since Feb 18, 2014.
Changing status accordingly.

Extensions also may or may not be using any of these, or implementing their own. Echo for example doesn't seem to be winding up with the same ids for its email notification links as are on the page (T138384), and it's hardly the only extension which would have need use anchor links.

This needs to be consolidated somewhere reusable.

This keeps catching me when I'm editing sections of pages such as [[:en:WP:Help desk]] and [[:en:WP:Teahouse]]. If the section title has no special characters it is fine, and the GET URL generated at the end of the edit has a valid anchor, and returns to the section. But if there are any non-alphanumeric characters in the title, the URL does not match a section anchor, and it doesn't return to the right section.

Example: I just edited the Teahouse https://en.wikipedia.org/w/index.php?title=Wikipedia:Teahouse&diff=851820375&oldid=851819214, and the following URL was

 https://en.wikipedia.org/wiki/Wikipedia:Teahouse#Biography%2C_familiy_lines

But the anchor (I believe) is 

<span id="Biography.2C_familiy_lines">

@ColinFine The anchor is Biography,_familiy_lines (See snippet below) of which the URL encoded version is #Biography%2C_familiy_lines. This should be supported by all modern browsers. The other id (with the . escaping) is our old style id (pre-HTML5), which are in there to preserve the functionality of the old links present in old revisions of the wikicode.

This ticket is about something different (namely that we had 4 generators for these old anchor ids). Not sure if this ticket is still valid actually.

<h2>
  <span id="Biography.2C_familiy_lines"></span>
  <span class="mw-headline" id="Biography,_familiy_lines">Biography, familiy lines</span>
  <span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Wikipedia:Teahouse&amp;action=edit&amp;section=72" title="Edit section: Biography, familiy lines">edit source</a><span class="mw-editsection-bracket">]</span></span>
</h2>

You should probably file a separate ticket with further details on which browser you use etc.

@TheDJ This ticket is definitely still valid because various parts of MediaWiki code still use invalid anchor ids. The introduction of additional unescaped anchors (like Biography,_familiy_lines) into the HTML did not change anything.

Apparently since MediaWiki 1.30.0 there is a $wgFragmentMode setting, but I still don't think that all issues raised in the OP have been updated accordingly (at least for the legacy mode). Could somebody review it again?

Aklapper changed the task status from Stalled to Open.May 19 2020, 12:45 PM
Aklapper removed a project: Patch-For-Review.
Aklapper subscribed.

Patch abandoned, plus tasks should not be "stalled" on a patch awaiting review. Hence reopening.