Page MenuHomePhabricator

quoted and unquoted attributes are not handled the same in Sanitizer.php
Closed, ResolvedPublic

Description

change the attribs regex to handle the quoted and unquoted attributes same

The regex MW_ATTRIBS_REGEX in Sanitizer.php should change for some cases, because the quoted and unquoted attributes are not handled the same way:

  • attributes with empty content
  • attributes with < as content

I have create a wikipage linked as url to make that visible.

Thanks.


Version: 1.18.x
Severity: enhancement
URL: http://test.wikipedia.org/wiki/MW_ATTRIBS_REGEX

attachment attribs_regex.patch ignored as obsolete

Details

Reference
bz26441

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:16 PM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz26441.
bzimport added a subscriber: Unknown Object (MLST).

presumably the not reading of < in attributes of tag extensions is some sort of paranoia against XSS. It would perhaps make sense to make it not recognize <tag att=foo<bar > for consistency's sake.

not recognizing <tag someAttribute= > Seems sane to me. I expect to be required to do <tag someAttribute=""> if i want to pass it the empty string.

Not reading < in attributes is b/c that is how the spec is written, IIRC. I don't think <> are allowed in attributes.

<elem attr=> is a bit weird, not sure if that should be supported.

However <elem attr> should render as <elem attr=""> imho

I thought in html, <elem attr> was equivelent to <elem attr="attr">. It would be weird to do the opposite of html imho.

(In reply to comment #4)

I thought in html, <elem attr> was equivelent to <elem attr="attr">.

I don't think so. But see http://www.w3.org/TR/html-markup/syntax.html#syntax-attributes for more info.

Hmm, maybe its an xhtml thing. I was reading http://www.w3.org/TR/xhtml1/#h-4.5 html is confusing.

Need to add test cases for the behavior that it's supposed to be changing, and clarify what is supposed to change and why.

Patch seems to be forbidding quoted empty elements, which is definitely wrong.....?

Appears to also remove '<' and '>' from the list of accepted chars for unquoted attribs. Not sure how those chars actually interact with the rest of the sanitizer stuff, but note the HTML 5 parser rules explicitly specify that '>' should close out the tag, while '<' is technically bogus but should be treated as part of the attribute value for consistent fallback behavior:

http://dev.w3.org/html5/spec/Overview.html#attribute-value-unquoted-state

sumanah wrote:

Umherirrender, I am adding the "reviewed" keyword to this bug since you received a review from Brion in comment 7 in February. Do you have time and interest in revising the patch in accordance with those suggestions? Thanks for the patch!

Comment on attachment 7932
change the attribs regex to handle the quoted and unquoted attributes same

I am not able to provide a new patch with parser tests or which have the right regex for the specifition.

Marking patch as obsolete.

Echoing comment 7: parser tests need to be added, so that we can ensure that the PHP parser and Parsoid have the same behavior.

matmarex subscribed.

In the given example page (https://test.wikipedia.org/wiki/MW_ATTRIBS_REGEX), quoted and unquoted examples are now parsed identically.

I think this has been fixed as part of T108134: Parse bare attributes according to the html5 spec.