Page MenuHomePhabricator

SVG JavaScript detection bypass
Closed, ResolvedPublic

Description

Author: jan-bugreport

Description:
When an SVG file is uploaded, MediaWiki checks it to ensure that it does not contain JavaScript. Using UTF-7 (which is interpreted by the parser MediaWiki uses for validation, but not by browsers), it is possible to create a document where the script part is considered harmless text by the MediaWiki parser, but interpreted as a script tag by regular browsers (tested with Chrome and Firefox).

The proof-of-concept at http://upload.wikimedia.org/wikipedia/test/archive/4/4f/20130415163012!Test-active.svg does this by including a CDATA section with an UTF-7 encoded start marker (treated as text and thus ignored by non-UTF-7 parsers). As can be seen from the URL, I was able to upload this on the test wiki. At least Chrome and Firefox will happily execute the JavaScript when the button is clicked (giving feedback on the JavaScript console), since they don't see the CDATA start marker and thus do render the button with the onclick function.

UTF-7 will certainly not be the only way to cause different interpretation by MediaWiki's parser and browsers. http://www.whatwg.org/specs/web-apps/current-work/multipage/semantics.html#charset has a nice overview of some other problematic charsets. Blacklisting them would certainly be a good start, however given the complexity of XML, I am not sure if it will reliably solve the issue.

Since Wikipedia serves its images from a separate origin, it is probably not seriously affected. For smaller wikis that do not have this separation, this results in an XSS attack.

Please attribute this to "Jan Schejbal / Hatforce.com" in the release announcement.

Kind regards,
Jan Schejbal


Version: unspecified
Severity: normal

Details

Reference
bz47304

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 1:42 AM
bzimport set Reference to bz47304.
bzimport added a subscriber: Unknown Object (MLST).

Nice... as I recall the browsers took out UTF-7 support for security reasons, now it's biting us that libxml2 still understands it. :)

Good catch.

Thanks for the report Jan, I've verified the issue.

You are right, the root seems to be that xml_parser_create_ns doesn't allow us to force an encoding any more ("Starting from PHP 5, the input encoding is automatically detected, so that the encoding parameter specifies only the output encoding."). I'm working on a couple ways to get around this.

Created attachment 12132
Workaround patch

This is a hacky patch to just update the encoding to utf8, if something else is given.

It doesn't seem like php has a way to check the encoding used when using expat, but the header should follow pretty strict formatting, so it may not be too crazy to use a regex like this for pulling out the specified encoding.

XMLReader seems to have the same issue.

attachment workaround.patch ignored as obsolete

jan-bugreport wrote:

(In reply to comment #3)

This is a hacky patch to just update the encoding to utf8, if something else
is given.

This will not solve the problem. The current problem is "file is treated as UTF-8 by browser and UTF-7 by filter". This patch would just turn the attack into "file is treated as $SpecifiedWeirdEncoding by browser, but as UTF-8 by filter" - the attack will certainly look different, but I would assume at least some of the encodings supported by current browsers will allow to make some kind of creative mess.

It took me a while to find an example, but the "hz" encoding (supported at least by chrome) skips the sequence "~\n" (tilde followed by newline). Thus,
"<![CDATA[ ]~\n]>" is treated as "CDATA start, random boring text" by a parser using UTF-8, while a parser using HZ (supported e.g. by Chrome 25) will treat it as "CDATA start, CDATA end". This results in basically the same attack, just that instead of hiding the CDATA start from the browser, the CDATA end is hidden from MediaWiki.

Additionally, I'm not sure if this can be reliably hacked in via RegExp. For the suggested patch, I strongly suspect that if the word "encoding" is split across chunk boundaries, the regexp will not find it and the attack will still work. In general, getting a regexp right so it cannot be bypassed is hard. For example, this string should bypass the regexp:

<?xml version="1.0" encoding="UTF-7" standalone="no"?><!-- encoding="UTF-8" ?> -->

The greedy regexp will happily grab "UTF-8" as the second group, even though it just sits in a comment.

Could someone with access to the file repository maybe evaluate what charsets are actually used? I suspect that a whitelist approach may be the way to go here...

I have about 10k random svgs from commons, and only one had an encoding defined that wasn't utf-8 or iso-8859-1, and that was iso-8859-2. So yes, a whitelist of acceptable encodings seems like it would be fine.

jan-bugreport wrote:

Since this is actually browsers misinterpreting the file (the file clearly states that it is UTF-7, yet they treat it as UTF-8), I have filed security bugs with Chromium and Mozilla.

jan-bugreport wrote:

Chromium decided to WONTFIX - and set the Issue to public.

https://code.google.com/p/chromium/issues/detail?id=233355

Sorry. I have requested that they hide it again, but so far, they didn't do it. MediaWiki wasn't mentioned explicitly.

(In reply to comment #5)

I have about 10k random svgs from commons, and only one had an encoding
defined that wasn't utf-8 or iso-8859-1, and that was iso-8859-2.

Would it be possible to run such a check over *all* 670k SVGs in the repo? It may provide valuable data not only for us, but for others (e.g. it could help Mozilla decide if breaking certain exotic encodings is OK) and be interesting in general.

I'm working on getting a way to access our full dataset to check. In the meantime, I may whitelist utf-8, iso-8859-1, iso-8859-2 with a patch on the cluster, just to make sure no one tries to run the poc you gave them.

Created attachment 12148
Whitelist known good xml encodings in svgs

This checks to see if the xml encoding is on of 'UTF-8', 'ISO-8859-1', or 'ISO-8859-2', if an encoding is specified in the option xml declaration.

attachment whitelistEncodings.patch ignored as obsolete

I've been watching the incoming svgs since last week, and there have been no attempts to upload any svgs with any exotic formats, so I think we're safe to whitelist the encodings.

A couple files came in with encoding="utf-16", and were themselves encoded in a utf-16 encoding, so the encoding regex did not match. It looks like expat detects and converts files from ISO-8859-1, US_ASCII, UTF-8, UTF-16 (BE/LE), so we should probably detect if those are coming in at the PHP layer. New patch coming soon.

Created attachment 12163
Whitelist and detect utf16/32 encodings

This patch also attempts to convert the input form utf16/32 before looking for an encoding directive. This check may not be necessary, as it looks like if expat detects utf16/32 encoding, that will override the encoding in the xml directive. However, it seems prudent to check in case other versions use a different approach.

attachment whitelistEncodings2.patch ignored as obsolete

(In reply to comment #11)

Created attachment 12163 [details]
Whitelist and detect utf16/32 encodings

If an XML declaration can be constructed which doesn't match your regex but does satisfy libxml's idea of a valid XML declaration, then a UTF-7 encoding could sneak through. I notice that libxml allows a space between "encoding" and "=".

php > print simplexml_load_string( '<?xml version="1.0" encoding="utf-7" ?><root>+-</root>' ) . "\n";
+
php > print simplexml_load_string( '<?xml version="1.0" encoding ="utf-7" ?><root>+-</root>' ) . "\n";
+

Other XML whitespace is also allowed. It says so in the XML 1.0 spec also:

EncodingDecl ::= S 'encoding' Eq ('"' EncName '"' | "'" EncName "'" )
Eq ::= S? '=' S?

I think adding [ \t\n\r]* on either side of the equals sign will fix that.

The "m" modifier in your regexes is not needed, since you have no start or end anchors.

You missed a wfRestoreWarnings() in the case where the function returns early. I suggest wrapping it around the iconv() only.

$wgSVGMetadataCutoff may be a vulnerability. The upload should be rejected if no end to the XML declaration is found before truncation by $wgSVGMetadataCutoff.

I don't think it is possible to make a file which is valid XML simultaneously in both UTF-7 and UTF-16 or UTF-32. The document has to start with <?xml, and that string is different in those three encodings. But I suppose the code you wrote is harmless.

attachment whitelistEncodings2.patch ignored as obsolete

(In reply to comment #12)

I don't think it is possible to make a file which is valid XML simultaneously
in both UTF-7 and UTF-16 or UTF-32. The document has to start with <?xml, and
that string is different in those three encodings. But I suppose the code you
wrote is harmless.

I don't think it's possible, but I wanted to make sure we could detect (and limit via whitelist if needed) multi-byte encodings.

Thanks for the input! New patch to address the other items to follow.

Created attachment 12170
Whitelist and detect utf16/32 encodings

Fixed the issues that Tim pointed out. Also, on reading the spec more, realized that the xml could be EBCDIC encoded, so added an explicit test for that, which I'm currently not adding to the whitelist.

Attached:

Related URL: https://gerrit.wikimedia.org/r/61632 (Gerrit Change I3b311a7078d977ae89c51e95e625d79fba183cfc)

Related URL: https://gerrit.wikimedia.org/r/61640 (Gerrit Change I3b311a7078d977ae89c51e95e625d79fba183cfc)

Related URL: https://gerrit.wikimedia.org/r/61643 (Gerrit Change I3b311a7078d977ae89c51e95e625d79fba183cfc)

Related URL: https://gerrit.wikimedia.org/r/62215 (Gerrit Change I3b311a7078d977ae89c51e95e625d79fba183cfc)