Page MenuHomePhabricator

Stop adding xml:lang attributes to HTML5 pages
Closed, ResolvedPublic

Description

Author: michael

Description:
Wikipedia and Wiktionary pages now have the HTML5 doctype <!doctype html>, and a root <html> element with only a lang tag. HTML5 doesn’t require the xml:lang attribute. According to the spec,

“The attribute in no namespace with no prefix and with the literal local name "<code>xml:lang</code>" has no effect on language processing.”

But if you enter, e.g., <span lang="fr">fou<span> into a wiki page, the wikitext parser will add a redundant and vestigial xml:lang attribute.

The parser should stop adding the xml:lang attribute in pages that are HTML5 and not XML.


Version: 1.21.x
Severity: trivial

Details

Reference
bz44609

Related Objects

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 1:30 AM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz44609.
bzimport added a subscriber: Unknown Object (MLST).

michael wrote:

[I wish I could edit my bug, or at least preview. Have you guys heard of this “wiki” thing. Here’s a better-formatted version of my bug report.]

Wikipedia and Wiktionary pages now have the HTML5 doctype <!doctype html>, and a root <html> tag with only a lang attribute. HTML5 doesn’t require the xml:lang attribute. According to the spec, “The attribute in no namespace with no prefix and with the literal local name "xml:lang" has no effect on language processing.”

Source: http://www.w3.org/TR/2011/WD-html5-20110525/elements.html#the-lang-and-xml:lang-attributes

But if you enter, e.g., <span lang="fr">fou<span> into a wiki page, the Wikitext parser will add a redundant and vestigial xml:lang attribute.

The parser should stop adding the xml:lang attribute in pages that are HTML5 and not XML.

I agree. Let's keep things nice a tidy.

This is caused by the "output-xhtml" option in includes/tidy.conf. Unfortunately, disabling it seems to break things such as the conversion from <hr> to <hr />, so many pages would no longer be well-formed XML as configured by $wgWellFormedXml.

Note that adding the extra attribute is legal according to HTML5 section 3.2.3.3:

"Authors must not use the lang attribute in the XML namespace on HTML elements in HTML documents. To ease migration to and from XHTML, authors may specify an attribute in no namespace with no prefix and with the literal localname "xml:lang" on HTML elements in HTML documents, but such attributes must only be specified if a lang attribute in no namespace is also specified, and both attributes must have the same value when compared in an ASCII case-insensitive manner."

ran.arigur wrote:

(In reply to comment #3)

Unfortunately, [...] many pages would no longer be well-formed XML [...]

Why is that unfortunate? The pages are HTML, not XHTML (we're serving them as text/html, not as e.g. application/xhtml+xml), so there's no reason they *should* be well-formed XML. (See HTML5 section 1.6, or section 8.) The spec says that in the HTML syntax, the use of '/' on void elements (br, hr, img, etc.) is optional and has no effect. (See HTML5 section 8.1.2.1, clause 6.)

(That's as far as the standard is concerned. Obviously we also care about browser support, but personally I find it impossible to believe that any real-world browser would stumble over '<hr>' in an HTML document.)

(In reply to comment #1)

[I wish I could edit my bug, or at least preview. Have you guys heard of this
“wiki” thing. Here’s a better-formatted version of my bug report.]

Offtopic: The "you guys" that you want to talk with can be reached here:
https://bugzilla.mozilla.org/show_bug.cgi?id=40896

TheDJ claimed this task.
TheDJ subscribed.

Not sure when this got fixed, but our pages no longer emit xml:lang in html5 mode anymore.

He7d3r set Security to None.

Not sure when this got fixed, but our pages no longer emit xml:lang in html5 mode anymore.

@TheDJ, there's at least one report at User talk:Redrose64 that this functionality is still enabled, for some reason.

Redrose64 subscribed.

Yes, I tested it recently, and it's happening when the lang= attribute is used on both span and div elements, so I assume that all other elements are also affected.

The relevant section of the HTML5 spec is [[https://www.w3.org/TR/html5/dom.html#the-lang-and-xml:lang-attributes|3.2.5.3 The lang and xml:lang attributes]], in particular the paragraph beginning "Authors must not use the lang attribute in the XML namespace on HTML elements in HTML documents." In this context, "the lang attribute in the XML namespace" means xml:lang=something. Therefore, as MediaWiki serves HTML, we must not add the xml:lang= attribute to any element that already bears the lang= attribute.

@Redrose64 where are you still seeing this though ? Maybe it's a template generating it, instead of core ?

It's not a template. Have a look at https://en.wikipedia.org/w/index.php?title=Wikipedia:Sandbox&oldid=801426457 where the wikisource contains the line

*<span lang=cy>Wicipedia Cymraeg</span>

whereas the page as served contains the HTML

<ul>
<li><span lang="cy" xml:lang="cy">Wicipedia Cymraeg</span></li>
</ul>

Notice the presence of the attribute xml:lang="cy" which I did not add.

TheDJ removed TheDJ as the assignee of this task.Nov 8 2017, 1:06 PM

Unassigning, as I do not have plans to work on this right now.

Izno claimed this task.

I'm going to reclose this resolved because of the switch to Remex. Tested on my user page. Others are welcome to confirm for themselves.