Page MenuHomePhabricator

Spurious XML encoding declaration
Closed, ResolvedPublic

Description

Author: giecrilj

Description:
Steps to reproduce:
Load the result into a HTML SCRIPT element, as follows:

<!DOCTYPE HTML PUBLIC "-W3CDTD HTML 4.01//EN"

<HTML
<HEAD
<TITLE >MediaWiki XML encoding switching problem</TITLE
<STYLE TYPE="TEXT/CSS"
<!-- .ERROR { COLOR: RED } --></STYLE
<SCRIPT ID="MWX"

TYPE="text/xml"
SRC="http://en.wikipedia.org/w/api.php?action=query&amp;titles=Albert%20Einstein&amp;prop=info&amp;format=xml"

</SCRIPT ><SCRIPT TYPE="text/vbscript" ><!--

OPTION EXPLICIT

SUB WINDOW_ONLOAD
DIM A3DOC, A1X3DOC, L4ELTS, A4PARS3ERR
SET A3DOC = WINDOW. DOCUMENT
SET A1X3DOC = A3DOC. GETELEMENTBYID("MWX")
SET L4ELTS = A3DOC. FORMS. NAMEDITEM("MAIN"). ELEMENTS
L4ELTS. NAMEDITEM("FURL"). SETATTRIBUTE "value", A1X3DOC. SRC
SET A1X3DOC = A1X3DOC. XMLDOCUMENT
L4ELTS. NAMEDITEM("FXML"). SETATTRIBUTE "value", A1X3DOC. XML
SET A4PARS3ERR = A1X3DOC. PARSEERROR
L4ELTS. NAMEDITEM("FWHY"). SETATTRIBUTE "value", A4PARS3ERR. REASON
L4ELTS. NAMEDITEM("FWHERE"). SETATTRIBUTE "value", A4PARS3ERR. SRCTEXT
IF A4PARS3ERR THEN WINDOW. LOCATION. HREF = "#FWHY"
END SUB

REM --></SCRIPT ></HEAD

<BODY
<FORM ID="MAIN" ACTION="#MAIN"
<FIELDSET CLASS="RESULT"
<LEGEND >XML loaded</LEGEND
<P
The document

loaded from the <LABEL >URL <INPUT TYPE=TEXT ID=FURL READONLY >
contains the following code:
<TEXTAREA ID=FXML COLS=80 ROWS=25 READONLY ></TEXTAREA ></FIELDSET

<FIELDSET CLASS="ERROR" ><LEGEND >XML not loaded</LEGEND
<P >REASON: <BR ><TEXTAREA ID=FWHY COLS=80 READONLY ></TEXTAREA
<P >SOURCE: <BR ><TEXTAREA ID=FWHERE COLS=80 REAONLY ></TEXTAREA
</FORM ></BODY
</HTML >

Expected results:
XML returned should display in a TEXTAREA.

Actual results:
Error:
"Switch from current encoding to specified encoding not supported."
at the XML declaration "<?xml version="1.0" encoding="utf-8"?>".

Affected systems:
Microsoft HTML engine.

Diagnosis:
The error is explained at http://msdn.microsoft.com/en-us/library/aa468560.aspx#xmlencod_topic3). When the XML processor does not load the XML text itself but it relies on an external mechanism to get it (MSHTML in this case), the downloading agent is allowed to recode the text but it is not obliged to convert or strip the encoding declaration. As a result, the text presented to the XML engine has a different encoding than declared, causing the parser to fail.

Backround:
The encoding declaration is necessary only for documents that cannot be described otherwise. Documents transported via HTTP have an encoding declaration in the HTTP headers.
Since the default encoding of XML is UTF-8, declaring this encoding has no effect or causes parsing errors. There is no advantage whatsoever.

Recommendation:
Remove the encoding declaration.

Workarounds:

  1. Use the XML extension element instead.
  2. Use MSXML.DOMDocument directly from script.

Version: unspecified
Severity: normal
OS: Windows XP
Platform: PC
URL: http://en.wikipedia.org/w/api.php?action=query&titles=Albert%20Einstein&prop=info&format=xml

Details

Reference
bz15497

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:22 PM
bzimport set Reference to bz15497.

giecrilj wrote:

Oops, I got the LABEL wrong. Here is a correction:

<BODY
<FORM ID="MAIN" ACTION="#MAIN"
<FIELDSET CLASS="RESULT"
<LEGEND >XML loaded</LEGEND
<P
The document

loaded from the <LABEL >URL <INPUT TYPE=TEXT ID=FURL READONLY ></LABEL >
contains <LABEL >the following code:
<TEXTAREA ID=FXML COLS=80 ROWS=25 READONLY ></TEXTAREA ></LABEL ></FIELDSET

<FIELDSET CLASS="ERROR" ><LEGEND >XML not loaded</LEGEND
<P
<LABEL >REASON: <BR ><TEXTAREA ID=FWHY COLS=80 READONLY ></TEXTAREA ></LABEL
<P
<LABEL >SOURCE: <BR ><TEXTAREA ID=FWHERE COLS=80 REAONLY ></TEXTAREA ></LABEL
</FORM ></BODY

Can someone who actually knows this stuff confirm that changing

<?xml version="1.0" encoding="utf-8"?>

to

<?xml version="1.0" ?>

is OK?

(In reply to comment #2)

Can someone who actually knows this stuff confirm that changing

<?xml version="1.0" encoding="utf-8"?>

to

<?xml version="1.0" ?>

is OK?

Seems to be the case. According to http://www.w3.org/TR/REC-xml/#charencoding (last couple of paragraphs of that section, really), the encodingdeclaration is _only_ required if you're not presenting utf-8, as utf-8 is the fallback.

jpatokal wrote:

As per http://www.w3.org/TR/REC-xml/#sec-TextDecl 4.3.1, "External parsed entities SHOULD each begin with a text declaration." MediaWiki should follow W3C recommendations instead of bending over for Microsoft bugs.