Page MenuHomePhabricator

Special:Export output missing XML encoding
Open, LowPublic

Description

Author: jpatokal

Description:
XML files produced by Special:Export are UTF-8, but this encoding is not indicated in the file. This violates the XML spec, and causes many XML manipulation programs like xmlstarlet to trash any special characters inside.

Trivially fixed by prepending this to all exported files:

<?xml version="1.0" encoding="UTF-8" ?>


Version: 1.13.x
Severity: minor

Details

Reference
bz15914

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:20 PM
bzimport set Reference to bz15914.
bzimport added a subscriber: Unknown Object (MLST).

cf bug 15497

We removed it from the API's output because the W3C (seems) to say that the encoding type doesn't need to be declared on utf-8 output (as it's the default). Rather, it only needs to be added if outputting non-utf-8 data.

The link in question: http://www.w3.org/TR/REC-xml/#charencoding

jpatokal wrote:

It may not be mandatory, but according to 4.3.1, "External parsed entities SHOULD each begin with a text declaration." Why would you want to remove it?

(In reply to comment #2)

It may not be mandatory, but according to 4.3.1, "External parsed entities
SHOULD each begin with a text declaration." Why would you want to remove it?

Someone complained about issues caused by the encoding declaration and pointed out that it could be removed safely, quoting the w3c link in comment #1. I didn't see any harm in it, so I removed it.

What are the issues that warranted removal?

(In reply to comment #4)

What are the issues that warranted removal?

I have no idea, this was ages ago; I searched through the mailing list archives, but found nothing (in topic titles that is; can't search the message contents themselves in a convenient way).

(In reply to comment #0)

causes many XML
manipulation programs like xmlstarlet to trash any special characters inside.

To me it looks that many XML manipulation programs are broken and should be fixed, regardless of whether we output something that merely states the obvious (UTF-8 is the default encoding if nothing else is specified).

(In reply to comment #6)

(In reply to comment #0)

causes many XML
manipulation programs like xmlstarlet to trash any special characters inside.

To me it looks that many XML manipulation programs are broken and should be
fixed, regardless of whether we output something that merely states the obvious
(UTF-8 is the default encoding if nothing else is specified).

It seems that there are both XML parsers that mess up stuff when utf-8 is *not* specified, and those that mess up stuff when it *is* specified (which was probably why I was asked to remove it in the first place). If that is the case, there's not really anything we can do to appease both compliant and non-compliant parsers, like we could with the xml:space="preserve" issue.

jpatokal wrote:

IMHO, a parser that incorrectly handles an explicitly declared encoding is more broken than one that uses an incorrect default for a file with no encoding. As quoted above, the XML spec says "External parsed entities SHOULD each begin with a text declaration", so declaring the encoding is the correct thing to do.

(In reply to comment #8)

IMHO, a parser that incorrectly handles an explicitly declared encoding is more
broken than one that uses an incorrect default for a file with no encoding. As
quoted above, the XML spec says "External parsed entities SHOULD each begin
with a text declaration", so declaring the encoding is the correct thing to do.

I fail to see why files produced by Special:Export should be considered as external parsed entities.

See also Bug 22881 - Greatly improved Export and Import for 1.14.1 (with support for advanced page selection, exporting and importing file uploads, and detection of "conflicts" during import). There's a patch written by me which is related to or fixes your issue.

TTO set Security to None.
TTO removed a subscriber: wikibugs-l-list.