Page MenuHomePhabricator

ERROR 1064: SQL syntax error near ''{{Infobox military person\n|name=Alexander Holle\n|birth_date=27 February 1898\' at line 1 (mwdumper fails to import English WP dump)
Closed, ResolvedPublic

Description

Author: piotr.jagielski

Description:
Hello

I'm trying to use mwdumper to import the latest English Wikipedia dump (enwiki-20131104-pages-articles.xml). It fails with the following error:

10á045á000 pages (1á658,325/sec), 10á045á000 revs (1á658,325/sec)
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048

at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unk

nown Source)

at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContent

Dispatcher.dispatch(Unknown Source)

at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Un

known Source)

at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Sour

ce)

at javax.xml.parsers.SAXParser.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(Unknown Source)
at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
at org.mediawiki.dumper.Dumper.main(Dumper.java:142)

ERROR 1064 (42000) at line 79047: You have an error in your SQL syntax; check th
e manual that corresponds to your MySQL server version for the right syntax to u
se near ''{{Infobox military person\n|name=Alexander Holle\n|birth_date=27 Febru
ary 1898\' at line 1


Version: unspecified
Severity: critical

Details

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:22 AM
bzimport set Reference to bz57236.

piotr.jagielski wrote:

Why is it unconfirmed? I run into it into again with the latest dump. Do you need additional information to reproduce it?

It'll be confirmed when a second person has reproduced it.

piotr.jagielski wrote:

Was anyone here able to import the latest dump (20140402) with mwdumper? If there is a chance that it's an issue with my local environment I'd be glad to know.

piotr.jagielski wrote:

Is there anyone here that uses mwdumper to import English Wikipedia XML dump? I tried several ones from past few months and I'm always running into some blocking issue.

mad2one48 wrote:

I have the same problem with the file enwiki-20140502-pages-articles.xml

mad2one48 wrote:

13,200,000 pages (5,538.948/sec), 13,200,000 revs (5,538.948/sec)
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 8192
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:546)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1753)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.arrangeCapacity(XMLEntityScanner.java:1629)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipString(XMLEntityScanner.java:1667)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1747)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2957)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:333)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:96)
at org.mediawiki.dumper.Dumper.main(Dumper.java:142)

mad2one48 wrote:

piotr did you find out the problem ?

chris.padfield wrote:

This is a Xerces bug, documented at https://issues.apache.org/jira/browse/XERCESJ-1257

The workaround suggested is to use the JVM's UTF-8 reader instead of the Xerces UTF8Reader.

chris.padfield wrote:

And definitely confirmed:

649,000 pages (1,281.975/sec), 649,000 revs (1,281.975/sec)
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048

at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:392)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
at org.mediawiki.dumper.Dumper.main(Dumper.java:142)

piotr.jagielski wrote:

The only workaround I came up with is trying a different dump. I was able to import enwiki-20140707-pages-articles.xml.

brion set Security to None.
Aklapper renamed this task from mwdumper fails to import English wikipedia dump: ArrayIndexOutOfBoundsException; error in SQL syntax to ERROR 1064: SQL syntax error near ''{{Infobox military person\n|name=Alexander Holle\n|birth_date=27 February 1898\' at line 1 (mwdumper fails to import English WP dump).Apr 23 2016, 9:06 AM
Aklapper removed a project: Upstream.

Change 285004 had a related patch set uploaded (by Brion VIBBER):
Update Xerxes to 2.11.0

https://gerrit.wikimedia.org/r/285004

Ok, if you build from source things should no longer encounter this error. I'll file a new task about fixing up the download links, as it seems to have fallen off jenkins ci.

Ok, if you build from source things should no longer encounter this error.

Thanks Brion, great to see progress in this area. :)