Page MenuHomePhabricator

GCJ library bug: mwdumper dies with "not a name start character: "U+26"" error
Closed, DeclinedPublic

Assigned To
Authored By
Kelson
Jan 18 2010, 8:36 AM
Referenced Files
F5955: Test.java
Nov 21 2014, 10:47 PM
F5953: sample.xml
Nov 21 2014, 10:47 PM
F5954: sample-d.xml
Nov 21 2014, 10:47 PM

Description

$mwdumper --format=sql:1.5 itwiki-20100108-pages-articles.xml.bz2 | lzma -c > itwiki-20100108-pages-articles.sql.lzma
1000 pages (88,755/sec), 1000 revs (88,755/sec)
2000 pages (65,935/sec), 2000 revs (65,935/sec)
3000 pages (67,621/sec), 3000 revs (67,621/sec)
4000 pages (80,336/sec), 4000 revs (80,336/sec)
5000 pages (80,457/sec), 5000 revs (80,457/sec)
Exception in thread "main" java.io.IOException: not a name start character: "U+26"

at org.mediawiki.importer.XmlDumpReader.readDump(mwdumper)
at org.mediawiki.dumper.Dumper.main(mwdumper)

Caused by: org.xml.sax.SAXParseException: not a name start character: "U+26"

at gnu.xml.stream.SAXParser.parse(libgcj.so.81)
at javax.xml.parsers.SAXParser.parse(libgcj.so.81)
at javax.xml.parsers.SAXParser.parse(libgcj.so.81)
at org.mediawiki.importer.XmlDumpReader.readDump(mwdumper)
...1 more

Caused by: javax.xml.stream.XMLStreamException: not a name start character: "U+26"

at gnu.xml.stream.XMLParser.error(libgcj.so.81)
at gnu.xml.stream.XMLParser.readNmtoken(libgcj.so.81)
at gnu.xml.stream.XMLParser.readNmtoken(libgcj.so.81)
at gnu.xml.stream.XMLParser.readCharData(libgcj.so.81)
at gnu.xml.stream.XMLParser.next(libgcj.so.81)
at gnu.xml.stream.XMLParser.hasNext(libgcj.so.81)
at gnu.xml.stream.SAXParser.parse(libgcj.so.81)
...4 more

Version: unspecified
Severity: critical
OS: Linux
Platform: PC
URL: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43138
See Also:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43138

Details

Reference
bz22137

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:47 PM
bzimport set Reference to bz22137.

Hier is a diff adding column and line information to the exception informations:

  • src/org/mediawiki/importer/XmlDumpReader.java (révision 61197)

+++ src/org/mediawiki/importer/XmlDumpReader.java (copie de travail)
@@ -36,6 +36,7 @@
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
+import org.xml.sax.SAXParseException;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
@@ -82,15 +83,17 @@

 */
public void readDump() throws IOException {
        try {
  • SAXParserFactory factory = SAXParserFactory.newInstance();
  • SAXParser parser = factory.newSAXParser();

+ SAXParserFactory factory = SAXParserFactory.newInstance();
+ SAXParser parser = factory.newSAXParser();

        parser.parse(input, this);
} catch (ParserConfigurationException e) {
        throw (IOException)new IOException(e.getMessage()).initCause(e);

+ } catch (SAXParseException e) {
+ throw (IOException)new IOException(e.getMessage() + " (line: " + e.getLineNumber() + " column: " + e.getColumnNumber() + ")").initCause(e);

} catch (SAXException e) {
  • throw (IOException)new IOException(e.getMessage()).initCause(e);
  • }

+ throw (IOException)new IOException(e.getMessage()).initCause(e);
+ }

        writer.close();
}

Created attachment 6965
Problematic part of the XML dump

I have extract the problematic part of the dump, see attachment.

$ mwdumper --format=sql:1.5 sample.xml.bz2 | lzma -c -d > sample.sql.lzma
Exception in thread "main" java.io.IOException: not a name start character: "U+26" (line: 82 column: 1)

at org.mediawiki.importer.XmlDumpReader.readDump(mwdumper)
at org.mediawiki.dumper.Dumper.main(mwdumper)

Caused by: org.xml.sax.SAXParseException: not a name start character: "U+26"

at gnu.xml.stream.SAXParser.parse(libgcj.so.81)
at javax.xml.parsers.SAXParser.parse(libgcj.so.81)
at javax.xml.parsers.SAXParser.parse(libgcj.so.81)
at org.mediawiki.importer.XmlDumpReader.readDump(mwdumper)
...1 more

Caused by: javax.xml.stream.XMLStreamException: not a name start character: "U+26"

at gnu.xml.stream.XMLParser.error(libgcj.so.81)
at gnu.xml.stream.XMLParser.readNmtoken(libgcj.so.81)
at gnu.xml.stream.XMLParser.readNmtoken(libgcj.so.81)
at gnu.xml.stream.XMLParser.readCharData(libgcj.so.81)
at gnu.xml.stream.XMLParser.next(libgcj.so.81)
at gnu.xml.stream.XMLParser.hasNext(libgcj.so.81)
at gnu.xml.stream.SAXParser.parse(libgcj.so.81)
...4 more

Attached:

Created attachment 7114
Much more simpler case that demonstrates error

This is a unicode issue. If you remove the

Attached:

Bugzilla screwed up my comment:

This is a unicode issue. If you remove the

Ok, apparently bugzilla suffers from the same issue as mwdumper ;)

This is a unicode issue. If you remove the <Unicode character removed from comment, lest bugzilla hate me> ( U+1D59F - MATHEMATICAL BOLD FRAKTUR SMALL Z - however the article claims it to be U+1D537 which is MATHEMATICAL FRAKTUR SMALL Z but thats not what character is in the text. ) everything works fine. Since its not chocking on more ordinary unicode characters, i imagine its something to do with that character being a 4-byte character.

It also appears that this interacts with other stuff in the file, as it doesn't cause the error by itself.

Specifically entity references, seem to be what causes it to die after encountering the unicode character. I think It interpert that & character as starting as outside the tag name (hence starting a new tag, but & (aka U+0026) cannot start a new tag). Newline characters may also have something to do with it, as removing the newline between the unicode character and the & changes the error message.

Changing summary to more adequately reflect what i think the problem is.

Attaching simpler test case.

Note also, that if you replace the unicode character with its entity reference (&#x1D59F;), everything works fine.

Java internally uses UTF-16

"The native coded character set of the Java programming language is that of the first seventeen planes of the Unicode version 3.0 character set; that is, it consists in the basic multilingual plane (BMP) of Unicode version 1 plus the next sixteen planes of Unicode version 3. This is because the language's internal representation of characters uses the UTF-16 encoding, which encodes the BMP directly and uses surrogate pairs, a simple escape mechanism, to encode the other planes. Hence a charset in the Java platform defines a mapping between sequences of sixteen-bit values in UTF-16 and sequences of bytes."
http://java.sun.com/j2se/1.4.2/docs/api/java/nio/charset/Charset.html
http://java.sun.com/javase/6/docs/api/java/nio/charset/Charset.html

The file contains U+01D59F in UTF-8, thus F0 9D 96 9F. In binary 11110000 10011101 10010110 10011111
I don't see why it is reading a U+26 (100110).

PS: Maybe bugzilla is using mysql as utf-8 instead of binary? mysql unicode currently only supports the BMP.

Java internally uses UTF-16

yes it does, but i think the file is interperted as utf-8, otherwise it wouldn't be able to make sense of it at all, as utf-8 and utf-16 look fairly different for your average english text (I'm under the impression that utf-16 is not compatible with ASCII thus nothing would work at all if it was using utf-16).

I don't see why it is reading a U+26 (100110).

The entity references that come after the problematic unicode character is where the U+26 (&) comes from. Its not considered a valid (tag) start character in XML. The question is why java would after failing to interpert the fancy unicode character, it would think that the document was starting a new tag. If you interpret F0 9D 96 9F in utf-16, you get:

   U+F09D:   No name (Private Use Area)
隟   U+969F:   Han ideograph   (CJK Unified Ideographs)

Which theoretically shouldn't cause any problems. (of course the rest of the file wouldn't make sense, and no guarantees that that is where the word boundaries would fall).

I'm thinking this is a bug with the underlying java libraries, as opposed to mwdumper

(In reply to comment #7)

Java internally uses UTF-16

yes it does, but i think the file is interperted as utf-8, otherwise it
wouldn't be able to make sense of it at all, as utf-8 and utf-16 look fairly
different for your average english text (I'm under the impression that utf-16
is not compatible with ASCII thus nothing would work at all if it was using
utf-16).

Right. But it could be overflowing the 16-bit or some other failure.

I don't see why it is reading a U+26 (100110).

The entity references that come after the problematic unicode character is
where the U+26 (&) comes from.

Interesting. Saving from firefox produced a literal " in the output.

I'm thinking this is a bug with the underlying java libraries, as opposed to
mwdumper

I also think so.

Created attachment 7115
Much more simpler java code that demonstrates error

Compile with:
gcj -o test --main=Test Test.java

run with the demo XML code as test.xml

Attached:

Sun jdk / OpenJdk is not affected.

Seems to be a bug in gcj or libgcj. See my email to the java gcc ML:
http://gcc.gnu.org/ml/java/2010-02/msg00000.html

In the meantime, Platonides (or anyone having SVN write access), may you please apply the path from my comment #1 https://bugzilla.wikimedia.org /show_bug.cgi?id=22137#c1 ?

Without it, this is impossible to know at which line a SAX parsing error occurs.

No problem with test case and sample code w/ Apple Java 1.6 on Mac OS X 10.6.2:

java version "1.6.0_17"
Java(TM) SE Runtime Environment (build 1.6.0_17-b04-248-10M3025)
Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01-101, mixed mode)

As it's mentioned above as working on OpenJDK and it being a GCJ-specific problem, have marked this as upstream and noted the GCJ relation in the summary. Add upstream bug reference once it gets handled a litte more upstream.

I'm just going to close this one out, since OpenJDK nobody's too worried about GCJ. :)