Page MenuHomePhabricator

SVG images that are invalid XML no longer rendered
Closed, DeclinedPublic

Description

Author: Bryan.TongMinh

Description:
As of r75968, SVG images that are invalid XML no longer render. Previously those images where allowed by MediaWiki and it appears that rsvg happily accepts them.

I agree that we should not accept SVG images which are not valid XML, however I am not convinced that breaking images which used to render fine is a good approach.

People in Commons suggest that a few hundreds of SVGs on Commons are broken. My analysis on a random set suggests that less than 0.1% of the SVG images in use on Wikimedia is affected. (Checked a random set of 1000 images, found none)

The question is whether we should fall back to the old behavior if the file does not pass the new file checks, or whether we should just refuse rendering them.


Version: unspecified
Severity: major

Details

Reference
bz27544

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 11:23 PM
bzimport set Reference to bz27544.
bzimport added a subscriber: Unknown Object (MLST).

I would recommend falling back, but it would be good to have an example of those that are a problem before making a final decision.

This should be fixed. I understand why that one doesn't display, but errors like that (undeclared, but well-known namespaces), shouldn't cause a problem with display.

Would it be possible to let users know at upload time that they have provided invalid XML and that it should be fixed? Could we do that *and* display already-uploaded images like that one?

This one may be a little beyond our control, unless we want to maintain a custom, non-compliant XML parser. The XML specification forbids compliant implementations from doing anything but failing spectacularly, and most decent XML parsers follow the rules.

I don't know exactly what change triggered the new behavior (did we switch implementations?), but I think we're better off encouraging a cleanup of the existing images rather than implementing a fallback strategy.

we didn't use the xmlreader pull parser before. I swithed the whole thing from XmlTypeCheck to SVGReader in r75968

i don't have the problem at home btw, so I guess the XMLReader version installed on wikimedia servers behaves different from the one I have installed.

Seems to me that the WMF version of XMLReader is validating the file, while as far as I know, that is not the default mode of XMLReader.

Created attachment 8170
example file

Attached:

Bryan.TongMinh wrote:

(In reply to comment #7)

Seems to me that the WMF version of XMLReader is validating the file, while as
far as I know, that is not the default mode of XMLReader.

I'm not too familiar with XML, but does that mean that the DTD is fetched from w3.org every time an SVG file is viewed?

Bryan.TongMinh wrote:

$this->reader->setParserProperty( XMLReader::VALIDATE, false ); would presumably fix it.

like i said, false should be the default value anyways. At least from what I understand. but we could run a test on wmf servers i guess to make sure.

/me has no access.

summary, on wmf servers, guessMimeType returns 'application/xml' instead of 'image/svg+xml'. guess mimetype calls doGuessMimeType(), which will call XmlTypeCheck()

$xml = new XmlTypeCheck( 'http://bug-attachment.wikimedia.org/attachment.cgi?id=8170' );
var_dump( $xml->wellFormed );

shows the wmf servers as evaluating wellFormed to false. This is not correct btw, for most of these files, since most are only invalid but are wellformed, this is a bug in XmlTypeCheck().

Actually, on my own server it even always evaluates to false, but later parts apparently pick up the slack.

MimeMagic::doGuessMimeType: analyzing head and tail of /private/var/tmp/phpwMVYQ7 for magic numbers.
DjVuImage::getInfo: not a DjVu file
MimeMagic::guessMimeType: internal type detection failed for /private/var/tmp/phpwMVYQ7 (.)...
MimeMagic::detectMimeType: magic mime type of /private/var/tmp/phpwMVYQ7: image/svg+xml
MimeMagic::guessMimeType: guessed mime type of /private/var/tmp/phpwMVYQ7: image/svg+xml
MimeMagic::improveTypeFromExtension: improved mime type for .svg: image/svg+xml
File::getPropsFromPath: /private/var/tmp/phpwMVYQ7 loaded, 5266 bytes, image/svg+xml.

Internal type detection failed, indicates that doGuessMimeType() has returned false and detectScript() takes over and selects image/svg+xml for me, but apparently not on the WMF installation.

I guess in 1.16 something was correcting for this issue for wmf as well, but now no more...

Bryan.TongMinh wrote:

detectMimeType uses one of the mime detection PHP extensions. I don't see any changes between 1.16 and 1.17 that would explain the difference.

The bug is with XmlTypeCheck though. xml_parse() returns an error for both an invalid as an unwellformed document, whereas we only want an error on an unwellformed document.

two options, rewrite to XMLReader, or set XmlTypeCheck to ignore namespaces all together. Implications of the latter not fully understood yet. Best consult Brion or Tim on this, esp. since brion originally added

http://svn.wikimedia.org/viewvc/mediawiki?view=revision&revision=30603

and

http://www.mediawiki.org/wiki/Special:Code/MediaWiki/43627

(In reply to comment #9)

I'm not too familiar with XML, but does that mean that the DTD is fetched from
w3.org every time an SVG file is viewed?

No. The URIs in xmlns attributes are not meant to be resolved and fetched. Often, they can be, since the "http" schema is used and people DTD at that address. But it is only meant to serve as a unique identifier, not a resource to be retrieved.

In practice, though, some (poorly written) validators do attempt to fetch the resource every time.

Bryan.TongMinh wrote:

To clarify, for people who have not followed the IRC discussion:

  • The original assumption that the non-rendering of certain SVGs was caused by the new SvgMetadataExtrator was false.
  • MimeMagic::doGuessMimeType uses XmlTypeCheck to detect the mime type; this fails because XmlTypeCheck::wellFormed is set to false when the SVG is non-validating instead of non-wellformed
  • For me and Derk-Jan, the fileinfo pecl extension then detects the file as image/svg+xml anyway, but at Wikimedia it is detected as application/xml

XmlTypeCheck must be fixed to only check wellformedness.
Additionally, it would be interesting why at Wikimedia there is a difference between 1.16 and 1.17.

A suggested approach for fixing XmlTypeCheck would be to change xml_parse_create_ns() into xml_parse_create(), but somebody who understands Xml properly needs to look at that.

See r43627 for where Brion forces the use of xml_parser_create_ns() instead of allowing xml_parser_create(). Thinking this is too picky per the bug.

I got a little lost reading the backscroll here, but let me just summarize the expected behavior:

  • invalid[1] SVG file MUST be rejected on upload
  • invalid[1] SVG file that's already existing SHOULD be rejected from rendering, but if it does render it's not the end of the world.

[1] by invalid I mean one or more of the following is true:

  • not well-formed XML (ie, any error is thrown by XML parser)
  • not well-formed namespaces (ie, an error is thrown by namespace-aware XML parser about inconsistent or undeclared namespaces)
  • root element isn't <svg> with expected namespace
  • anything in the file violates our generic or SVG-specific safety checks

Non-rendering of broken files is not a bug; it's expected behavior. Affected files should be fixed.

New files being uploaded that do not pass the validation checks *is* a bug, if it's happening. Failure to live through our XML/SVG-specific checks MUST NOT be overridden by some other generic check.

Bryan.TongMinh wrote:

(In reply to comment #18)

Failure to live through our XML/SVG-specific checks MUST NOT be
overridden by some other generic check.

The best way to fix this is to split the valid and wellformedness check. Wellformed but invalid XML should be marked application/xml.

Unwellformed and invalid XML should never be marked application/xml or any XML subtype, even of if the PHP file info extension recognizes it as such.

Since the only identified files are missing namespace declarations, and taking what seems to be the consensus of developers here into consideration, I'm closing this as WONTFIX -- the identified files should be fixed and re-uploaded.

Just a terminology clarification: what we're checking for well-formedness here including the namespaces is called "namespace-well-formed"ness, and is again not related to what XML terms "validity", which we never check for.

http://www.w3.org/TR/REC-xml-names/#Conformance

ayg wrote:

Although maybe we should define a subset of SVG using our own DTD, excluding script elements and attributes, and test for validity using that DTD. That would be a very easy way to do whitelist-based security, which makes me feel happier than the current blacklist-based approach. Are we really sure that all JS-activating attributes start with "on", and that no element will allow script if it's not named "script"? (But that's a separate issue.)

rainald.koch wrote:

I had problems uploading an svg which passed the W3C validator without any comment and also the validator on the toolserver parsed it correctly and gave no hints. The error message on upload was "wrong MIME type" (in german).

It took me two evenings to learn that the problem had to do with namespaces (i had given svn but failed to add xlink).

The validator on the toolserver should be in line with the checks done on upload and the error message should be more informative.

File:FMCW Doppler radar.svg

We already have bug #27537 for the issue with the error message. I don't see it happening that we are gonna support files anytime soon that are not namespace-well-formed, so reclosing.

rainald.koch wrote:

Reclosing is ok, I reopened it only to be heard and appreciating the decision of Brian to exclude files which are not namespace-well-formed.

Thank you for the hint at bug #27537.

Do you also know, whether the two bugs of the png renderer which show up with my svg are already registered? As svg, it is displayed as expected, but all the pngs of different size ...

  • do not show the indices (<tspan style="baseline-shift:sub;font-size:0.85em">D</tspan>)
  • ignore the positioning baseline-shift="-30%"

Yes, I believe both of those are known issues, and filed somewhere under the "depends" of ticket #8901