Page MenuHomePhabricator

SVG handler uses an "uber-crappy hack" instead of a real XML parser
Closed, ResolvedPublic

Description

wfGetSVGsize (in ImageFunctions.php) tries to detect the dimensions of a SVG file by looking for /<svg\s*([^>]*)\s*>/ in the first 4096 bytes of the file, acknowledging “Uber-crappy hack! Run through a real XML parser.” If it does not find the header by this method, it assumes the file is not a valid SVG file, and therefore, refuses to render its thumb on the image description page, replacing it with an icon.

This can break in many ways, especially with regards to XML comments: if the file starts with a big comment, the <svg> header might not fit into the first 4096 bytes of the file, or, if the comment contains <svg> it is mistaken for the real header.

Specifically, this happenned on http://commons.wikimedia.org/wiki/Image:Stub-icon_linguistics.svg (see version 16:57, 23. 5. 2008, marked 0×0) – the file begins with the full text of the GNU Free Documentation License, moving the SVG header outside the 4096 B limit.


Version: unspecified
Severity: minor

Details

Reference
bz14268

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:13 PM
bzimport set Reference to bz14268.

Created attachment 4925
Proposed patch

attachment SvgInfo.patch ignored as obsolete

Looks good!

A couple quick tweaks -- the current patch version will give false positives to invalid files using upper- or mixed-case elements and attribute names. Use xml_parser_set_option() to disable case-folding; you can then check against the correct, lowercase element & attribute names.

It may also get confused by embedded <svg> elements (I'm not 100% sure that's legal or not but have the vague impression it may be); to be safe, disable the element handler once it's run for the first element.

See XmlTypeCheck.php for a somewhat similar bit of code... It may actually make some sense to merge them in some tricky way, say by having the type check object retain the attributes of the root element.

Created attachment 4935
New patch

(In reply to comment #2)

Looks good!

A couple quick tweaks -- the current patch version will give false positives to
invalid files using upper- or mixed-case elements and attribute names. Use
xml_parser_set_option() to disable case-folding; you can then check against the
correct, lowercase element & attribute names.

Yup, Opera, FF and Inkscape don't accept such deviations, my bad.

It may also get confused by embedded <svg> elements (I'm not 100% sure that's
legal or not but have the vague impression it may be); to be safe, disable the
element handler once it's run for the first element.

Fixed in new patch.

See XmlTypeCheck.php for a somewhat similar bit of code... It may actually make
some sense to merge them in some tricky way, say by having the type check
object retain the attributes of the root element.

Later, we may need to add more options to SvgInfo, so it shouldn't be forced by design to parse only root element.

attachment SvgInfo.patch ignored as obsolete

Couple remaining problems...

First, PHP spews notices due to use of undefined variables in wfGetSVGsize():

<b>Notice</b>: Undefined variable: width in <b>/Library/WebServer/Documents/trunk/includes/ImageFunctions.php</b> on line <b>93</b><br />

<b>Notice</b>: Undefined variable: height in <b>/Library/WebServer/Documents/trunk/includes/ImageFunctions.php</b> on line <b>93</b><br />

Be sure to test your code with error_reporting pegged up to at least E_ALL!

Second, it looks like the current code will incorrectly pass through a document whose root element isn't a <svg> at all, continuing on until it finds one later in the document. In most cases such a document won't reach this point (it should have gotten rejected on upload), but it would be better to work correctly.

Third, it doesn't appear to be doing any namespace checking, so will fail entirely on SVG files which use an XML namespace prefix on the root element.

Since XmlTypeCheck already handles those two cases, and there currently exists no other requirement which the SvgInfo class actually handles, my recommendation still stands to add a root attribute accessor to XmlTypeCheck and use that in place of the SvgInfo class. This would accomplish all required goals with less code and less code duplication, increasing reliability and maintainability of the code.

And one last issue; it's not actually required for an SVG to specify width and height attributes, in which case "100%" is assumed for each when not provided.

The original code would default to a 256x256 pixel square; the patch returns 0x0, which won't provide sensible rendering.

Created attachment 4990
Updated patch

Updated the old patch against the current trunk. Also added accessor methods getHeight() and getWidth() that return 100% if no $svg->height or $svg->width exist.

Attached:

(In reply to comment #4)

Since XmlTypeCheck already handles those two cases, and there currently exists
no other requirement which the SvgInfo class actually handles, my
recommendation still stands to add a root attribute accessor to XmlTypeCheck
and use that in place of the SvgInfo class. This would accomplish all required
goals with less code and less code duplication, increasing reliability and
maintainability of the code.

An accessor was added in r41024, so we're no longer accessing $rootElement directly (needed updates to MimeMagic too). One more step in the right direction.

claus.colloseus wrote:

I'll add another case for you to consider. I can only follow part of your discussion, so maybe it's simply that I don't understand which code version you are running. Is it still the 'crappy hack'?
http://commons.wikimedia.org/w/index.php?title=Image:Arbeitslosenentwicklung_1933.svg
is encoded as UTF-16, and this currently leads to no PNGs being rendered.

(In reply to comment #9)

I'll add another case for you to consider. I can only follow part of your
discussion, so maybe it's simply that I don't understand which code version you
are running. Is it still the 'crappy hack'?
http://commons.wikimedia.org/w/index.php?title=Image:Arbeitslosenentwicklung_1933.svg
is encoded as UTF-16, and this currently leads to no PNGs being rendered.

Yeah, I think so -- the original checks would get confused by UTF-16 since it's not ASCII-compatible. A proper XML parse should work fine as long as the file's legit. (Unfortunately the referenced file has been deleted so I can't test on it.)

(In reply to comment #8)

(In reply to comment #4)

Since XmlTypeCheck already handles those two cases, and there currently exists
no other requirement which the SvgInfo class actually handles, my
recommendation still stands to add a root attribute accessor to XmlTypeCheck
and use that in place of the SvgInfo class. This would accomplish all required
goals with less code and less code duplication, increasing reliability and
maintainability of the code.

An accessor was added in r41024, so we're no longer accessing $rootElement
directly (needed updates to MimeMagic too). One more step in the right
direction.

My recent changes to XmlTypeCheck to accept an element filter (used for more thorough scripting security checks on upload) should make it real easy to plug this into the existing class now.

claus.colloseus wrote:

(In reply to comment #10)

(In reply to comment #9)

I'll add another case for you to consider. I can only follow part of your
discussion, so maybe it's simply that I don't understand which code version you
are running. Is it still the 'crappy hack'?
http://commons.wikimedia.org/w/index.php?title=Image:Arbeitslosenentwicklung_1933.svg
is encoded as UTF-16, and this currently leads to no PNGs being rendered.

Yeah, I think so -- the original checks would get confused by UTF-16 since it's
not ASCII-compatible. A proper XML parse should work fine as long as the file's
legit. (Unfortunately the referenced file has been deleted so I can't test on
it.)

I re-uploaded the file. Hope it stays there now for some days longer...

Created attachment 5565
Sample UTF-16 SVG

Attached a copy of the sample UTF-16 file noted above. (Note that this file contains only an embedded PNG image, base-64 encoded into the XML source, making it about the least efficient possible way to encode this image. :)

Attached:

Fixed in r44322, using a filter callback on XmlTypeCheck.