Page MenuHomePhabricator

Sanitizer rejects valid xhtml attributes
Closed, ResolvedPublic

Assigned To
None
Authored By
bzimport
Jan 14 2009, 11:57 PM
Referenced Files
F5309: parser_test_results.txt
Nov 21 2014, 10:25 PM
F5308: parserBugTester.php
Nov 21 2014, 10:25 PM
F5306: Sanitizer.patch
Nov 21 2014, 10:25 PM

Description

Author: gbruin

Description:
Sanitizer will reject attributes found in tags as invalid even though they are technically valid.

Example:
<tag hyphen-name="value"></tag>

When parserHook() is called for <tag>, the $args parameter will be an empty array because 'hyphen-name' is rejected.

To my knowledge, attribute names must start with [a-zA-Z] followed by zero or more from the set [a-zA-Z0-9._-] (is this accurate?). Currently, only attributes of 1 or more [a-zA-Z0-9] are accepted. This is also a minor problem, because attribute names cannot start with a number.


Version: 1.15.x
Severity: enhancement

Details

Reference
bz17031

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:25 PM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz17031.

gbruin wrote:

The following patch to Sanitizer.php might resolve this issue.

Index: Sanitizer.php

  • Sanitizer.php (revision 45716)

+++ Sanitizer.php (working copy)
@@ -40,10 +40,11 @@

  • Allows some... latitude.
  • Used in Sanitizer::fixTagAttributes and Sanitizer::decodeTagAttributes */

-$attrib = '[A-Za-z0-9]';
+$attrib_first = '[A-Za-z]';
+$attrib = '[A-Za-z0-9._-]';
$space = '[\x09\x0a\x0d\x20]';
define( 'MW_ATTRIBS_REGEX',

  • "/(?:^|$space)($attrib+)

+ "/(?:^|$space)({$attrib_first}{$attrib}*)

	  ($space*=$space*
		(?:
		 # The attribute value: quoted or alone

On a quick peek at the XML 1.0 spec, I believe attributes are limited to the 'Name' type:

http://www.w3.org/TR/REC-xml/#NT-Name

NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]

NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]

Name ::= NameStartChar (NameChar)*

gbruin wrote:

Thanks for the quick reply, it looks like indeed more attribute names are valid than are accepted by sanitizer. Should this be rectified?

Probably yes. :)

While the original approximation is sufficient for *HTML* sanitization, going ahead and accepting a larger set could be nice for extensions which might want to be more flexible.

gbruin wrote:

Here is an example of how an extension could benefit from accepting a larger set of attributes:

The extension FBConnect (http://www.mediawiki.org/wiki/Extension:FBConnect) enables wiki users to insert XFBML, an extension to XHTML designed by Facebook, by reconstructing the tags and attributes as they are received from the Sanitizer. With the current narrower set [A-Za-z0-9], attributes like facebook-logo (http://wiki.developers.facebook.com/index.php/Fb:profile-pic#Attributes) are dropped from the attributes array in the parser hook.

To fix this issue, the following code

$attrib = '[A-Za-z0-9]';
$space = '[\x09\x0a\x0d\x20]';
define( 'MW_ATTRIBS_REGEX',

"/(?:^|$space)($attrib+)

...

should be modified to be

$attrib_first = '[:A-Z_a-z]';
$attrib = '[:A-Z_a-z-.0-9]';
$space = '[\x09\x0a\x0d\x20]';
define( 'MW_ATTRIBS_REGEX',
"/(?:^|$space)({$attrib_first}{$attrib}*)
...

I've attached this difference in patch form.

gbruin wrote:

Sanitizer.php patch for bug 17031

Attached:

Looks like it'll work; be good to have some parser test cases for it though. (Would need a test extension I guess.)

gbruin wrote:

Test Extension (stest)

Test extension for bug 17031. Adds a test parser hook that simply rebuilds the original tag with the attribute pairs and inner text and passes it through htmlspecialchars.

attachment stest.php ignored as obsolete

gbruin wrote:

Some test cases:

<stest facebook-logo="true" a:b="c" z_-.Z="val" A:B-c.1_2:3="val" _1="2">These attribs should be passed through</stest>

<stest -a="no" .b="no" 0c="no" 9d="no" don't="no" a=b="c" a%="no" hi"="no" a$="no" a@="no" ^a="no" a*="no" a(b)="no">Denied</stest>

Output:

<stest facebook-logo="true" a:b="c" z_-.z="val" a:b-c.1_2:3="val" _1="2">These attribs should be passed through</stest>

<stest>Denied</stest>

gbruin wrote:

Any word on this? I noticed today that Sanitizer still contains this bug.

gbruin wrote:

*** Bug 24749 has been marked as a duplicate of this bug. ***

sean wrote:

Maintenance-script test-driver for parser test.

Script based on Garrett's test-extension which runs several test-cases.

This was run both before and after the fix was applied.

Attached:

sean wrote:

Output of the parser testing.

This output shows exactly what we would expect (assuming that we expected the fix to work ;) ).

Attached:

gbruin wrote:

Finally lol. you guys rock!