Page MenuHomePhabricator

User-specified HTML IDs can be the same as interface IDs
Open, LowPublic

Description

If any of the header/subheader is given as == content ==, firefox 1.5.0.7 draws an semi-complete dashed box next to it.

Repo:
create a page with the following text:

==content==

preview or save, and observer the result.

http://en.wikipedia.org/wiki/User:Simetrical/7356

See also

Details

Reference
bz7356

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 9:24 PM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz7356.
bzimport added a subscriber: Unknown Object (MLST).

ayg wrote:

I don't see anything. Does it happen if you log out? Does it happen at the URL
I just added to this bug?

That's because you capitalized the word "Content". It must be all lower case.

dto wrote:

The heading generates an anchor with name=id=content, which collides with the
id=content div. :(

ayg wrote:

Ouch. That's nasty. The only solution I can see would be to move all header
id's to stuff like #h-content instead of #content. (You could also special-case
the few bad id's, but that will a) lead to confusion and b) be hard to maintain.)

dto wrote:

*** Bug 7662 has been marked as a duplicate of this bug. ***

ayg wrote:

(In reply to comment #4)

Ouch. That's nasty. The only solution I can see would be to move all header
id's to stuff like #h-content instead of #content. (You could also special-case
the few bad id's, but that will a) lead to confusion and b) be hard to maintain.)

Better solution: prefix all interface id's with "mw-" and then ban that from
non-interface id's. Should be pretty simple to fix, although it will
unfortunately be slightly disruptive.

david.sledge wrote:

Even if the aforementioned solutions are applied, someone could just as easily edit/create a page with the following:

==content==

<span id="content">text</span>

and the same problem would exist. Also, if you don't allow user-supplied ids/anchor names (or derived ids/anchor names from user-supplied content) to have the prefix "mw-", how would you deal with the following:

==mw-content==

Let's not forget templates. If a page includes a template, it's possible that both pages use the same id/anchor name, even though within each page individually, the ids/anchor names are unique. And I've found a similar problem with extensions that generate their own ids/anchor names like Cite. (see bug #11625)

One thing I've noticed is that if a tag is created with an ID that has characters not allowed, the parser is smart enough to single out the id and swap out the invalid characters with valid ones.

What if the parser kept a running list of all the ids and anchor names already in use? When it replaces the invalid id/anchor name characters, it can check against the list to make sure the id/anchor name in question is not already in use. Duplicates would be resolved the same way headers with the same text are resolved.

The only issue I can see at the moment are when extensions create links to destination anchors yet to be rendered. Let's take Cite for example. Given the following:

I like cheese<ref>It's true!</ref>.

...

<references/>

when the "ref" tag gets rendered, a link must be created to a destination anchor that doesn't yet exist, so two things have to happen: (a) an id/anchor name must be created on the spot, so it can be linked to the footnote (even the footnote itself has not been created yet), and (b) all other destination anchors must be prevented from using the generated id/anchor name, without preventing the "references" tag from using it, too.

  • Bug 11625 has been marked as a duplicate of this bug. ***

ayg wrote:

(In reply to comment #7)

What if the parser kept a running list of all the ids and anchor names already
in use? When it replaces the invalid id/anchor name characters, it can check
against the list to make sure the id/anchor name in question is not already in
use. Duplicates would be resolved the same way headers with the same text are
resolved.

Something broadly like that is, of course, the only way to fix this bug. To begin with, though, much of the interface isn't run through the Sanitizer, so we'd have to manually (!) keep track of every single one of the hundreds of id's used in the software, which tend not to follow any rhyme or reason. It's still doable, certainly.

david.sledge wrote:

Sounds like it might be tedious task, but not necessarily a difficult one. Worst case scenario is that all the IDs and anchor names outside the actual article body are hard-coded into the list. A better option is to have the surrounding HTML completely assembled before the article body is, and pass it into a method that extracts every id and anchor name and adds it to the list.

ayg wrote:

Patches are appreciated.

david.sledge wrote:

*** Bug 13926 has been marked as a duplicate of this bug. ***

  • Bug 17650 has been marked as a duplicate of this bug. ***

*** Bug 21440 has been marked as a duplicate of this bug. ***

*** Bug 21856 has been marked as a duplicate of this bug. ***

Because the heading can start with a non ascii letter a invalid id is created which starts with a point.
According to specification of xhtml 1.0 an id has to start with [A-Za-z]. Numbers and some other characters (e.g. point) are only allow at the following character.

Überschrift

creates
<span class="mw-headline" id=".C3.9Cberschrift">Überschrift</span>

So a prefix to the id should solve this problem because mw-.C3.9Cberschrift would be a valid id.

ayg wrote:

MediaWiki no longer outputs XHTML1 by default, but HTML5. id's in HTML5 can be any nonempty string that doesn't contain whitespace:

http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#the-id-attribute

(In reply to comment #17)

MediaWiki no longer outputs XHTML1 by default, but HTML5. id's in HTML5 can be
any nonempty string that doesn't contain whitespace:

http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#the-id-attribute

But still can (and on WMF wikis does) output XHTML1, so the solution must count with that DTD.

  • Bug 22587 has been marked as a duplicate of this bug. ***
  • Bug 24285 has been marked as a duplicate of this bug. ***

theevilipaddress wrote:

Can't we do it here the way we do it with duplicate sections. For example,

Heading

bla bla...

Heading

bla bla...

becomes

id="Heading"
bla bla bla...
id="Heading_2"
bla bla bla...

In this case,

content

should simply become id="content_2".

ayg wrote:

Basically, yes. What we have to do is make a list of all the id's used by the software and blacklist them for section titles and other user-provided id's. This is feasible to maintain if we adopt a strict policy of prefixing all software-generated id's with "mw-", which we often do already, although we're not very strict about it. Then we can just blacklist the "mw-" prefix, in addition to a hopefully-not-expanding list of legacy unprefixed id's.

We can't feasibly check the list of interface id's used on the current page on the fly, while parsing. This works for things the parser generates, but parser output can't depend on UI output. The same cached parser output is stuck into a variety of skins, plus no skin at all (action=raw, API output, etc.). So we need to get a list of all id's used anywhere in the software and ban them in all pages.

Both sound needed (interface prefix "mw-", and, upcounting them in the headings).

With upcounting I mean what The Evil IP address mentioned above. That "mw-content" would be treated like a duplicate heading.

So that the following

something

something

content

mw-content

would become

id="something"
id="something_2"
id="content_2"
id="mw-content_2"

  • Bug 29049 has been marked as a duplicate of this bug. ***

We also have the problem that with section editing, we get ids in previews which differ from the ids in the full page. That is at least bewildering, and worst may lead to bogus wrong ids being copied and used elsewhere.

Editing a page closer to the beginning may lead to ids further down being renumbered. References to ids from elsewhere, e.g. via links having a fragment identfier, should ideally not break in such cases.

In bug 29049, it has been suggested that editors be warned when a page is saved with duplicate id values, also to just accept duplicates
during a 2nd save, such like empty "Summary" fields. Maybe even
a toggle in Special:Preferences similar to the one for the
handling of empty "Summary" fields might be considered for the
id= value checking.

A warning on Save does not seem like the right approach. The ID problem is an internal, technical shortcoming of MediaWiki. Exposing this to non-technical editors would just be confusing to them.

  • Bug 29480 has been marked as a duplicate of this bug. ***

(In reply to comment #18)

(In reply to comment #17)

MediaWiki no longer outputs XHTML1 by default, but HTML5. id's in HTML5 can be
any nonempty string that doesn't contain whitespace:

http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#the-id-attribute

But still can (and on WMF wikis does) output XHTML1, so the solution must count
with that DTD.

WMS only uses XHTML because of some bots and scripts that haven't updated yet. Eventually WMF WILL be using html5. And as this is a pure validation thing (browsers are not going to care if you use an XHTML doctype but actually follow html5's rules) we don't care about XHTML rules.

This is coming up again in the context of T252467, which recommends editors to make the following change:

MediaWiki Gadget .js
- $('div#p-personal').hide();
+ $('#p-personal').hide();

- $('div#p-navigation').on('click', … );
+ $('#p-navigation').on('click', … );
MediaWiki Gadget-minimal.css
- div#footer,
+ #footer {
  display: none;
 }
User vector.css
- div#p-personal,
+ #p-personal {
  outline: 1px solid purple;
 }

This of course causes random discussions to become inaccessible or act in surprising ways, due to matching the section heading (or TOC anchor) for == footer == and such.

I think it's high time that the Parser/Linker maintain a list of interface-reserved prefixes (like n-, p-, and mw-), as well as a (short) list of legacy IDs (such as footer), that are automatically mapped to a different name to avoid clashes with interace styles.

For example, by prepending it with h- for heading, or something like that. For compatibility this would of course be limited only to where it is causing potential conflicts. Doing this for the other 99.9% of headings is out of scope for this task.

We do need user defined so-called anchors, and they are spread in millions of occurences over the wiki projects.

Some use cases:

  • A long headline.
    • A section headline with words and interpunction and abbreviations and typography etc.
    • As soon as someone is changing the wording or interpunction or introducing or expanding abbreviations all external links heading to this section break.
    • Article authors introduce mnemonic section identifiers, id="LongHeadlineAbout"
    • That one is robust against reformulation of the visible headline. And it is short and readable and will be used within the same page as well as from outside.
    • When an important headline changes, smart authors keep the old headline text by id= and maintain inner and outer links.
  • Positions in page without headline.
    • Many, many cases; a particular table, a quotation, a kind of explaining footnote other than <ref>, whatever.
    • To be linked within the page and from other pages.

Therefore, free ID specification needs to be maintained.

  • External users need to link to page positions by self-explaining constant fragments.

The major problem for decades now is that wiki software

  • is not using consequently a prefix like mw- which could be respected easily by page authors
    • class="error"
    • id="toc"
    • id="top"
    • etc.
  • there is no documentation of all non-prefixed MediaWiki selectors to be respected.
  • It is the interface (the developers) which has to keep a naming scheme, like prefixing, to ensure that no clashes with text authors occur.
  • Everything without a few prefixes is under responsibility of text authors to ensure unique definition within a page.
  • It is hard to understand how extension developers are not conflicting with core IDs or any other extension ID without a global registry of MediaWiki reserved selectors.
    • Even when all are prefixed by mw- the following string might be used in two extensions without reciprocal knowledge.
  • Actually all things belonging to MediaWiki might be preceded by mw- and no conflict with page content or site configuration could occur.

Gadgets (see T117540), local TemplateStyles, “global modules” do need collision-free class names.

  • See #registry of German Wikipedia for TemplateStyles. We encourage to use German words as identifier since we do not expect German in global selectors.

I think it's high time that the Parser/Linker maintain a list of interface-reserved prefixes (like n-, p-, and mw-), as well as a (short) list of legacy IDs (such as footer), that are automatically mapped to a different name to avoid clashes with interace styles.

For example, by prepending it with h- for heading, or something like that. For compatibility this would of course be limited only to where it is causing potential conflicts. Doing this for the other 99.9% of headings is out of scope for this task.

Both Parsoid and the Parser/Sanitizer already do this. I think the issue is that it's not complete enough (yet), and we have legacy IDs that don't conform to the modern guidance. Historically I added some of this code to support OOUI infusion, specifically to avoid the case where the javascript for OOUI widgets could get confused by user IDs and classes (which could be a security issue in the limit case).

The sanitizer reserves all attributes whose names start with data-(ooui|mw|parsoid). Parsoid actually does this 'reversibly', in that it renames any attributes from user content with conflicting names to avoid the conflict, and undoes that translation on html2wt. In addition, id attributes are uniquified, so that ids in user content which conflict -- usually with other ids defined in user content, but this could be generalized -- are renamed. Finally, all Parsoid-generated class and typeof attributes are prefixed by mw- or mw: respectively. I believe Parsoid also renames values from user-generated content to avoid conflicts, in a similar way to how data-ooui is handled.

I'd like to deprecate and eventually remove the legacy IDs and class names that do not have an appropriate prefix, like content and footer. Then Parsoid's uniform system of renaming attributes and uniquifying IDs will suffice. Users should be free to generate any IDs and class values in their code that they please, so long as they don't start with the reserved mw prefix.

For example, by prepending it with h- for heading,

That is no help at all.

Imagine an article with two sections:

  • == This ==
  • == That ==

Now both sections need to be merged:

  • == This and that ==

Good authors do maintain links within the same page, from other wikipages and the outer wwworld by:

  • == {{ Anchor | This | That }} This and that ==

What does it help now to prefix those fragment IDs? And how to fix billions of URL from within wikis and outside?

Same goes for changing text:

  • == {{Anchor|This}} These ==
  • == {{Anchor|That}} Those ==

Sanitizer might swallow any id=mw- in user text, but there are use cases where it is legal to utilize such fragments. At least it is legal to link to them, but may be forbidden to define mw-.

  • I might want to offer a jump to top of page: [[#top|Go Up!]]
  • See that tool in [[#p-tb|toolbar]] at left hand side of your desktop; or mobile or timeless.
  • Go to TOC.

The one and only task is to avoid clashes between selectors (both class or id) which are used by MediaWiki interface (not content) and those defined by text authors or TemplateStyles or site configuration or extension developers or gadget programming.

The rule which IDs are forbidden to be defined needs to be as easy as possible. For just mw- I can teach authors, but no more prefixes.

What about those:

This task is about creation of HTML IDs. It is not about CSS classes, and not about anchor links. You can link to anything just fine, same as before. It my in fact be rather intentional to link to something in the interface, for example.

From what I can tell, you have not shown any fundamental problem that would happen if we were to prefix heading IDs or even all IDs inside content.

There is indeed a need to support and migrate existing content, for which many options can be considered. That will be thought about once planning work starts on the migration.

A change in strategy is needed. Apparently that has been started some years ago already.

  • id= should not be used for anything else than to enable the reader to jump to a certain point in document.
  • Elements to be decorated or manipulated shall be accessed by mw- classes only.
  • If necessary, the mw- class might have one member only. That is the replacement for elements formerly addressed by id=.
  • Implementing this consequently no conflict between headlines and MW system activities could occur any longer.

Migratíon to this paradigm is on the way, as see.

<div class="mw-body" id="content">
<div class="mw-body-content" id="bodyContent">
<div id="mw-content-text">
<footer class="mw-footer" id="footer">

The #content element which initiated this task in 2006 is no longer confused with any headline when CSS and other usage is selecting .mw-body now.

  • However, not all relevant elements were equipped with a class yet.

Conclusions

  • id= must not be used for anything else than to enable the reader to jump to a certain point in document.
  • Elements to be decorated or manipulated shall be accessed by mw- classes only; to be removed or modified or as position to insert any other element. A class with one member only is the replacement of former id=mw-.
  • No TemplateStyles nor gadgets nor 3rd party PHP extensions are permitted to introduce a class name beginning with mw-.
  • User defined elements may utilize mw- classes in the same way as they are supposed to be used, e.g. for inserted elements.
  • All MW core or consolidated extensions shall migrate to access MW elements via class selectors only. This goes for content modification as well as CSS decoration.
  • Within content and user written wikitext clashes by multiple defined elements with the same id= might occur, inside content specification itself or with declarations provided by MediaWiki. This will make the document invalid in sense of HTML, but it does no harm. Jumping to a fragment will lead to the first occurrence. Gadgets and CSS decoration and MediaWiki software are not affected since none is using any id= for their purposes.
  • A central registry for all mw- classes needs to be built, telling at least which class name is born in which programming unit.
  • TemplateStyles, gadgets and any other 3rd party software may establish any class they like, with exception of mw- prefix. They are responsible to find idiosyncratic names probably never colliding.
  • A dictionary of traditional class names and id= which shall be supported in future will be published by MediaWiki. This will provide widely used legacy selectors, like #top #toc .error. See e.g. deWP for some frequently used selectors mostly not beginning with mw-.
  • A migration guide for developers of templates and gadgets etc. as well as core developers needs to be published.
  • Class names within MW core or maintained extensions not yet preceded by mw- will be equipped with an additional mw- class. Later class names which are not utilized in the field may be dropped, after first round of migration.
  • TOC algorithm might learn a dozen of popular identifiers, like top, toc, etc. These shall be regarded as “existing headlines”. When that algorithm encounters an already known “headline”, the generated id will be increased e.g. as top_2 to generate indivual IDs for repeated headline text. Just try ==top== at end of a page and see how MW TOC will work nowadays.

This task is about creation of HTML IDs. It is not about CSS classes, and not about anchor links. You can link to anything just fine, same as before. It my in fact be rather intentional to link to something in the interface, for example.

From what I can tell, you have not shown any fundamental problem that would happen if we were to prefix heading IDs or even all IDs inside content.

As I demonstrated above, classes are the crucial key to solve the problem.
It does not help to break billions of links to sections by introducing new generated fragment identifiers. It requires a few modifications at MediaWiki to stop using any id= by MW software. Then the task scope has been resolved: User-specified HTML IDs can be the same as interface IDs – in the end there is no interface ID any longer, or if still provided then never used to identify any element for decoration nor modification.

Addendum

BTW, I have created two applications dealing with conflicting id= specifications:

  1. #L-17 in Template:Anchor support which refuses to create MW fragments without explicit permission.
  2. fragmentAnchors@PerfektesChaos

Both are addressing name conflicts with identifiers.

Preceding all headline identifiers with h- won’t help at all.

  • They break billions of URL inside and outside wiki pages linking to a particular section.

Even after introduced, it becomes necessary to provide editor specified anchors to maintain links to sections after the headline text has been changed:
Previously: == This ==

  • Renamed to == {{anchor|h-This}} That ==
  • It is not possible to sanitize ID specification by page editors.
  • h-integral
  • mw-plusminus @ ResourceLoader/Migration guide (users).
    • Imagine headline text has been changed to Classes for numerical prefix signs.
    • Then it is necessary to introduce a id="mw-..." to support existing links to this section: == {{anchor|mw-plusminus}} Classes for numerical prefix signs ==

@Krinkle I'm not sure why you think the solution is to prefix IDs in *user* content, rather than to reserve a prefix for *system* content. After all, (a) links to IDs in user content are already all over the web and in archives, it's somewhat rude to change them even if we provide compatibility anchors for a short time (and if we keep them forever, we haven't actually solved the problem), and (b) URLs to wiki content are intended to be "human readable", which suggests that "machine" content like "mw-" should be omitted if possible -- adding a prefix to user IDs also doesn't play nicely with our i18n (consider RTL anchors).

I agree this task should be focused on IDs. I think the solution is just to migrate our *skins* and other gadgets to the uniform use of a prefix (mw- seems fine), and then in the existing code to make user-generated IDs unique we would also adjust any user-generated prefixes which happened to start with the same prefix. (For example, we will reserve mw-user- for that purpose). As you suggest, we can add compatibility anchors in our skins during a transition, but these can be eventually removed using the usual deprecation process; for example, by the next LTS.