Page MenuHomePhabricator

SVG upload get blocked on correct encoding (windows-1252, with wrong/unspecific warning)
Closed, ResolvedPublic

Description

I got an ERROR unspecific warning: "This file contains HTML or script code that may be erroneously interpreted by a web browser."

For example this file is normally in encoding="ISO-8859-1" (or standard encoding="UTF-8") but the W3C says it should use "windows-1252" instead: [[File:Milch.svg]].


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=67044

Details

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 3:49 AM
bzimport set Reference to bz70937.
bzimport added a subscriber: Unknown Object (MLST).

I am not sure I understand the role of encoding in this bug. Do you get the error with some encoding but not with another one?

Currently, we only allow encodings:

$safeXmlEncodings = array(
'UTF-8',
'ISO-8859-1',
'ISO-8859-2',
'UTF-16',
'UTF-32'
);

We had specific issues with UTF-7 (bug 47304), so we whitelisted encodings that were well supported. We would need to verify that xml parsing of windows-1252 on the server and clients is done the same way before we can open that up.

(In reply to PRO from comment #3)

(In reply to Tisza Gergő from comment #1)
Yes, the [[Windows-1252]] encoding is preferred by the W3C (validator) to
encoding="ISO-8859-1", also in SVG, see at a test-file, the second warning:
http://validator.w3.org/check?uri=https%3A%2F%2Fupload.wikimedia.
org%2Fwikipedia%2Fcommons%2Farchive%2Fb%2Fbd%2F20140917155924%21Test.
svg&charset=%28detect+automatically%29&doctype=Inline&ss=1&group=0&user-
agent=W3C_Validator%2F1.3+http%3A%2F%2Fvalidator.w3.org%2Fservices#line-1

See also http://lists.w3.org/Archives/Public/www-validator/2013Mar/0054.html

So MediaWiki accepts ISO-8859-1 but not windows-1252.

Its not like one is better than the other really. If the file is windows-1252, you should mark it as such....

If anything is preferred, realistically it would be to have documents be in utf-8, the most sane of all the encodings.


windows-1252 is almost the same as ISO-8859-1 (just c0 and c1 control characters are different), and both are ascii compatible. Anything ascii compatible should not have bug 47304 type issues, so it should be safe to whitelist windows-1252

(In reply to Bawolff (Brian Wolff) from comment #4)

Its not like one is better than the other really. If the file is
windows-1252, you should mark it as such....

If anything is preferred, realistically it would be to have documents be in
utf-8, the most sane of all the encodings.


windows-1252 is almost the same as ISO-8859-1 (just c0 and c1 control
characters are different), and both are ascii compatible. Anything ascii
compatible should not have bug 47304 type issues, so it should be safe to
whitelist windows-1252

Here is a temporary example why the using of windows-1252 is preferred (partially because the using is not much suggestive):
https://upload.wikimedia.org/wikipedia/commons/thumb/archive/b/bd/20140917155924%21Test.svg/800px-Test.svg.png

(In reply to Bawolff (Brian Wolff) from comment #4)

windows-1252 is almost the same as ISO-8859-1 (just c0 and c1 control
characters are different), and both are ascii compatible. Anything ascii
compatible should not have bug 47304 type issues, so it should be safe to
whitelist windows-1252

I agree. If someone wants to submit a patch to add that to the whitelist, I will +1.

Change 302220 had a related patch set uploaded (by Brian Wolff):
Allow SVGs encoded as WINDOWS-125[0-8].

https://gerrit.wikimedia.org/r/302220

Change 302220 merged by jenkins-bot:
Allow SVGs encoded as WINDOWS-125[0-8].

https://gerrit.wikimedia.org/r/302220

matmarex assigned this task to Bawolff.