Page MenuHomePhabricator

mwdumper java.lang.IllegalArgumentException: Invalid contributor
Closed, DeclinedPublic

Description

Author: robertb

Description:
Trying to convert the simple English wikipedia xml dump to an sql file (i.e. without on-the-fly insert into database), I get a Java exception after partial successful conversion. Here's what is displayed:

...
8,740 pages (41.603/sec), 376,000 revs (1,789.777/sec)
8,778 pages (41.713/sec), 377,000 revs (1,791.493/sec)
8,801 pages (41.713/sec), 378,000 revs (1,791.554/sec)
Exception in thread "main" java.lang.IllegalArgumentException: Invalid contributor
at org.mediawiki.importer.XmlDumpReader.closeContributor(Unknown Source)
at org.mediawiki.importer.XmlDumpReader.endElement(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
at org.apache.xerces.parsers.AbstractXMLDocumentParser.emptyElement(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(Unknown Source)
at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source)
at org.mediawiki.dumper.Dumper.main(Unknown Source)

Versions:

OS: Linux 2.6.17-1.2142 (Fedora Core 4)
Java: 1.6.0_13-b03
mwdumper: 2008-04-13
Data: Simple English Wikipedia dump of 2009-03-30

Invocation:

java -Xmx512m -Xms128m -XX:NewSize=32m -XX:MaxNewSize=64m -XX:SurvivorRatio=6 -XX:+UseParallelGC -XX:GCTimeRatio=9 -XX:AdaptiveSizeDecrementScaleFactor=1 -server -jar mwdumper.jar --format=sql:1.5 simplewiki-20090330-pages-meta-history.xml > simplewiki-20090330-pages-meta-history.sql &

What's up and how to fix this problem?


Version: unspecified
Severity: normal

Details

Reference
bz18328

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:34 PM
bzimport set Reference to bz18328.

robertb wrote:

Having looked further into this issue I think I have isolated the problem. As of recent versions of MediaWiki (from around end of 2008) it is possible for the text, comment and/or contributor of a revision to be completely deleted for legal reasons (copyright infringement, libel, etc). This is mentioned in more detail here: http://www.mediawiki.org/wiki/Bitfields_for_rev_deleted

An example of a revision with a deleted contributor in the 2009-03-30 dump of Simple English Wikipedia looks like this:

<revision>
  <id>1460119</id>
  <timestamp>2009-03-30T11:34:51Z</timestamp>
  <contributor deleted="deleted" />
  <comment>Replaced content with 'Majorly is heaps shit'</comment>
  <text xml:space="preserve">Majorly is heaps shit</text>
</revision>

When mwdumper encounters the contributor element <contributor deleted="deleted" /> it chokes on it.

So it looks like the code needs to be fixed to be able to handle deleted contributors. Question is, what should be put in place of the contributor name?

kurzum wrote:

test file to reproduce

Attached:

kurzum wrote:

I fixed the bug for myself.
its probably not the nicest code, but it'll work.
UserIP is set to 127.0.0.1
Hope it will help.

in XMLDumpReader.java
(I attached my version)

line 152: else if (qName == "contributor") openContributor(attributes);
and about line 333

void openContributor(Attributes attribs) {

		String deleted = attribs.getValue("deleted");
		if(	deleted !=null && deleted.equals("deleted")){
			contrib = new Contributor("127.0.0.1");
		}else{
			contrib = null;
		}

}

kurzum wrote:

proposed fix

Attached:

Please attach all patches as a unified diff against trunk, rather than the complete file.

srini wrote:

can you post the jar somewhere so that people who want to use a working version of mwdumper can use it ? i made the code changes but i dont have the reqd libs to compile the package.

(In reply to comment #3)

I fixed the bug for myself.
its probably not the nicest code, but it'll work.
UserIP is set to 127.0.0.1
Hope it will help.

in XMLDumpReader.java
(I attached my version)

line 152: else if (qName == "contributor") openContributor(attributes);
and about line 333

void openContributor(Attributes attribs) {

        String deleted = attribs.getValue("deleted");
        if(     deleted !=null && deleted.equals("deleted")){
                contrib = new Contributor("127.0.0.1");
        }else{
                contrib = null;
        }
}

ts77 wrote:

I'd appreciate that too as I couldn't find a link to even download the source to apply this change.

kurzum wrote:

diff for patching

I used:
svn diff > invalid.contibutor.patch

hope this is correct.

@Chad: if not, please tell me how to create a unified diff, as this is the first time I tried to create one.

Attached:

The patch is correct yes, but I'm not sure I like your proposed fix. Correct me if I'm wrong, but basically you're saying if the rev has been deleted to set the contributor to 127.0.0.1?

kurzum wrote:

yes. I was not sure what to put there.
127.0.0.1 was a reasonable choice, because I was sure MediaWiki can handle it.
Other options I considered was user 'deleted' or maybe leave it blank (where again I wasn't sure if mediawiki or the mysql-db would choke on it).

I can change it again, but I'm not sure what would be the best.

if (contrib == null){
throw new IllegalArgumentException("Invalid contributor");
}

This code says that contributor should not be null.

So if it is set to null, I'm quite sure the program will break at another point, throwing a NullPointer Exception.

Basically, it is a hack, but still better than not being able to import WikipediaDumps at all.
(And sorry for not answering for such a long time, I was on holiday for a month.)

martin wrote:

(In reply to comment #8)

here is a link for the jar file that fixes the bug

http://downloads.dbpedia.org/mwdumper_invalid_contributor.zip

I downloaded the jar file specified above and there seems to be additionally changes to the file. I cannot get it to output SQL for schema 1.4 or 1.5. I downloaded the file twice and ran it against the enwikipedia dump and it completed successfully. However when I looked at the file it was XML, not SQL. I then downloaded the production version of the mwdumper.jar, http://download.wikimedia.org/tools/mwdumper.jar, with the same command line and it died in due to the bug, but it also put out SQL as requested. For clarity the command line was java -jar mwdumper.jar --format=sql1.4 --output=file:test.sql enwikipedia-20090708.xml. Am I missing something or is there an issue in the program with processing flags for SQL output?

agrozny wrote:

I've just tested this version of mwdumper and it correctly produced sql file (I tried both 1.4 and 1.5 formats).
Perhaps there's a syntax error in your invokation code.
It should be --format=sql:1.5, not --format=sql1.5

But I got another problem.
After generating sql file from xml one this way
java -jar mwdumper.jar --format=sql:1.5 enwiki-20090713-pages-articles.xml > import20090713.sql
I import import20090713.sql into mysql database, but I only get 2,700,000 rows in page and revision tables and 2,700,937 rows in text table.
While it should be 8801763 pages according to http://download.wikimedia.org/enwiki/20090713/

(In reply to comment #12)

I downloaded the jar file specified above and there seems to be additionally
changes to the file. I cannot get it to output SQL for schema 1.4 or 1.5. I
downloaded the file twice and ran it against the enwikipedia dump and it
completed successfully. However when I looked at the file it was XML, not SQL.
I then downloaded the production version of the mwdumper.jar,
http://download.wikimedia.org/tools/mwdumper.jar, with the same command line
and it died in due to the bug, but it also put out SQL as requested. For
clarity the command line was java -jar mwdumper.jar --format=sql1.4
--output=file:test.sql enwikipedia-20090708.xml. Am I missing something or is
there an issue in the program with processing flags for SQL output?

rainman wrote:

Cannot reproduce using the test file with latest mwdumper from SVN. Also ran the conversion on latest simplewiki history snapshot (20090817) - went clean. So, did someone fix this or what?

Closing worksforme.

arjunmeht wrote:

Is there any way someone could provide a mirror to http://downloads.dbpedia.org/mwdumper_invalid_contributor.zip.
The link seems to point to a server that is down or unavailable.

kurzum wrote:

It will be up on Tuesday again, we are doing server maintenance...
BTW: I think it should be fixed in the original code by now .
Did you try?

arjunmeht wrote:

Thanks Sebastian,

I've tried to compile it but I don't have the gcj compiler (OS X), so I came to a roadblock there. AND I'm not that savvy with working with source packages. :)

I'm sure there are many others like me... it would be great if the latest compiled JAR file was made available to the general public at all times! The latest one that's linked from the MediaWiki site is from 2007 and as we know, can't really handle the more recent xml dumps.

kurzum wrote:

ok, I deleted my version. But I think Daniel Kinzler fixed it in the code back then, so I just compiled again and uploaded it on the newly made server.

Here you go:
http://downloads.dbpedia.org/

http://downloads.dbpedia.org/mwdumpedr.jar

to compile:

svn co http://svn.wikimedia.org/svnroot/mediawiki/trunk/mwdumper mwdumper
cd mwdumper

ant jar

btw. The goal of the DBpedia project is to provide structured data extracted from Wikipedia in machine readable format (see http://dbpedia.org). I think for the most common use cases (like getting a list of all title of articles in Wikipedia or all geocordinates) data in DBpedia should be well sufficient. We plan to inlcude provenance data also. Just to mention an alternative to getting the mwdumper and extracting information yourself...

Hope I could help,
Sebastian

arjunmeht wrote:

Sebastian, thank you so much for this!
Hopefully this will be useful for others in the same position as me down the line.

I will try compiling it using your instructions, but this should help in the interim.

I've certainly looked into DBpedia, and you provide a really great alternative to data provision. Amazing work!

The wikipedia databases can be a bit unwieldy, and I feel like mediawiki needs a bit more capability with it's Special Export. (eg. Access to detailed Category info through GET).

Anyway, thank you so much! Huge help.
Arjun

arjunmeht wrote:

The compile worked and the latest source seems to have this issue resolved. Media wiki should have these instructions on the mwdumper.jar page. I'll try to add it now. :)

Thanks again
Arjun

I've had the same problem. A fix should be merged, to be merged with other fixes.

Bean49: Could you please elaborate exactly what "the same problem" means, by providing exact steps and URLs in order to reproduce? Thanks!

(In reply to comment #23)
Sorry! I used the jar from http://download.wikimedia.org/tools/mwdumper.jar and I should have not. Thanks for your attention.

harry1357931 wrote:

Hey guys,

I am getting the same problem: IllegalArgument...Invalid Contributor.

Can some suggest to me....which mwdumper file to use...and where it is available...?
Also, can someone please post link to Text, Revision and Page Tables required to transfer data to MySQL. I don't know the format of these tables....

Singh: The first result in an internet search engine was http://www.mediawiki.org/wiki/Manual:MWDumper for me... Please ask followup questions on https://www.mediawiki.org/wiki/Project:Support_desk . Thanks!