Page MenuHomePhabricator

Special:Import rejects valid XML and gives incorrect error
Closed, ResolvedPublic

Description

Author: jpatokal

Description:
Sample export showcasing the problem

Exports from Wikitravel (1.11.2, export-0.3) contain a "realname" tag as follows:

<mediawiki>

<page>
  <revision>
    <contributor>
      <realname>David</realname>
    </contributor>
    <comment> ... </comment>
    ...

(Sample export attached.) This makes Special:Import go batshit insane:

IMPORT: FAILURE: Invalid tag <realname> in <contributor> WikiImporter XML error: Invalid tag <realname> in <contributor>
IMPORT: out_contributor realname
IMPORT: POP contributor
IMPORT: FAILURE: Expected </contributor>, got </realname> WikiImporter XML error: Expected </contributor>, got </realname>
IMPORT: out_contributor contributor
IMPORT: POP revision
IMPORT: PARENT page
IMPORT: in_page comment
IMPORT: FAILURE: Element <comment> not allowed in a <page>. WikiImporter XML error: Element <comment> not allowed in a <page>.

And it bails out with the (totally incorrect) error "All revisions were previously imported". Removing the offending line fixes the problem, but there are still quite a few WTFs here:

  1. XML specs mandate that the parser should ignore unknown tags, not "FAIL" on them.
  2. Having that unknown tag should not cause it to incorrectly pop out of <revision> and then fail to read the rest of the file.
  3. At the very least, it should abort on and properly display the error (fail-fast), instead of blindly proceeding and then giving the user the wrong error. (In includes/specials/SpecialImport.php, any successCount = 0, that is, any failure at all, is logged as 'import-nonewrevisions'!)

The same problem also affects importDump.php, which is even worse, as it just cheerily reports "Done!" even though the import failed.


Version: 1.13.x
Severity: major

Attached:

Details

Reference
bz15913

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:20 PM
bzimport set Reference to bz15913.
  1. XML specs mandate that the parser should ignore unknown tags, not "FAIL" on them.

Totally untrue. :) XML doesn't specify any such thing; that's up to the domain-specific markup language. (XML is a meta-language, not a language per se.)

However... it really ought to just ignore unknown tags, since that's the nice thing to do.

The failure mode sounds wrong too, and should get fixed if it's still doing that.

I was able to import the attachment successfully using the latest code from git. Perhaps this has been fixed sometime in the last 5 years?

TTO set Security to None.
TTO subscribed.

Per Ejegg, import was revamped some time after October 2008.