Page MenuHomePhabricator

Parser inserts invalid   in the middle of style attribute (French spaces)
Closed, ResolvedPublic

Description

Author: Marc Meurrens

Description:
The bug can be easely reproduced.

Just type in an article something like :
one: un - two : deux ; three : trois ! (1)
(mind the spaces!!!)

Look at the HTML source code generated : you'll see something like :
one: un - two : deux ; three  : trois ! (2)
which, in general, is fine.
Observe that spaces before punctuation marks ( : ; ! )
have been replaced by the htmlEntity.

What's the problem with that?

If you want to make sure an image will not overlap from one section to another,
you'll probably use the syntax :
<br style="clear:both;" /> (3)
Everything works fine...

But if you write it as :
<br style=" clear : both ; " /> (4)
which is perfectly legal (and much readable, specially if you have a large style
statement)
you'll unfortunately generate :
<br style=" clear &nbsp;: both &nbsp;; " /> (5)
and, of course, the style specification is invalid
and will be ignored.

The workaround is evident : use format (3)
...but it remains an annoying problem
for those who ignore the bug.


Version: unspecified
Severity: minor
URL: http://test.wikipedia.org/wiki/Nbsp_in_style
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=11874

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 8:46 PM
bzimport set Reference to bz3158.
bzimport added a subscriber: Unknown Object (MLST).

Rephrasing the summary (and updating version and severity) – this is a general problem that non-breaking spaces inserted because of French typographic rules are inserted even into XHTML attributes, including style. There have been a similar bug report before (see bug #11874), but it has been “solved” by hardcoding that specific case of “!important”, not taking anything else into account.

Note that the same thing happens if the semicolon is inserted using a parser function, see http://test.wikipedia.org/wiki/Nbsp_in_style#Broken_because_of_parser_function (and also bug #12974).

See also bug #12752 for a more general objection to this feature.

rockmfr wrote:

See http://en.wikipedia.org/wiki/User:RockMFR/style-nbsp-bug for a variant of this that occurs when using spaces before template parameters.

  • Bug 19290 has been marked as a duplicate of this bug. ***

Worked like this for a while and the new parser is in the wings.

  • Bug 67092 has been marked as a duplicate of this bug. ***

(Cite Krinkle from 67092 comment #0)

When the parser strips a /* comment */ from a style attribute it inserts a
&nbsp; in its place. This causes the stylesheet to be invalidated by the
browser and the relevant styles are not applied when the page is renders.

Wikitext input:

<blockquote style="border: 1px solid #aaa /* foo */;"></blockquote>

Expected output:

<blockquote style="border: 1px solid #aaa ;"></blockquote>
or
<blockquote style="border: 1px solid #aaa;"></blockquote>

Actual output:

<blockquote style="border: 1px solid #aaa &#160;;"></blockquote>

A nbsp; is illegal in css in that position and results in a parse error by
the browser, causing the 'border' rule in this case to not be applied.

Change 142042 had a related patch set uploaded by Krinkle:
[WIP] Parser: Don't insert &nbsp; inside style attributes

https://gerrit.wikimedia.org/r/142042

It looks like CSS is just what makes this bug visible, rather than anything to do with its cause. At includes/parser/Parser.php:410-419:

  1. Clean up special characters, only run once, next-to-last before doBlockLevels
		$fixtags = array(
			# french spaces, last one Guillemet-left
			# only if there is something before the space
			'/(.) (?=\\?|:|;|!|%|\\302\\273)/' => '\\1&#160;',
			# french spaces, Guillemet-right
			'/(\\302\\253) /' => '\\1&#160;',
			'/&#160;(!\s*important)/' => ' \\1', # Beware of CSS magic word !important, bug #11874.
		);
		$text = preg_replace( array_keys( $fixtags ), array_values( $fixtags ), $text );

This is doing a very aggressive replacement of spaces with &#160; throughout the entire content of the page, and it's apparently already caused at least one other bug. I'm not sure why this is doing this, but our CSS handling is totally innocent and is just a red herring.

Krinkle removed a subscriber: Unknown Object (MLST).
Krinkle removed a project: Patch-For-Review.
Krinkle unsubscribed.
matmarex renamed this task from Parser inserts invalid &nbsp; in the middle of style attribute to Parser inserts invalid &nbsp; in the middle of style attribute (French spaces).May 14 2015, 5:30 PM

Change 393256 had a related patch set uploaded (by Fomafix; owner: Fomafix):
[mediawiki/core@master] Armor against French spaces detection in HTML attributes

https://gerrit.wikimedia.org/r/393256

Change 441410 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@master] WIP: Don't armor french spaces before punctuation followed by word characters

https://gerrit.wikimedia.org/r/441410

Change 393256 merged by jenkins-bot:
[mediawiki/core@master] Armor against French spaces detection in HTML attributes

https://gerrit.wikimedia.org/r/393256

I think this is now completely resolved with https://gerrit.wikimedia.org/r/393256. With this change, "problematic" spaces in HTML attributes are now replaced with &#32; (the entity for a regular space character), so that the code which would incorrectly replace them with &#160; (the entity for non-breaking space) will not process them.

The other patch (https://gerrit.wikimedia.org/r/441410) changes the logic for this replacement slightly to avoid it in some cases where it is unnecessary, but the underlying issue is already fixed in the general way.