Page MenuHomePhabricator

mwdumper uses too much memory
Closed, DeclinedPublicFeature

Description

I tried to run the GUI version of the newest revision (r60229) of mwdumper under Java 6 update 17 on an Intel Core i7 with 3,25G RAM and WinXP SP3, and it gave this error:

Exception in thread "Thread-8" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Unknown Source)
at java.lang.StringCoding.safeTrim(Unknown Source)
at java.lang.StringCoding.access$300(Unknown Source)
at java.lang.StringCoding$StringEncoder.encode(Unknown Source)
at java.lang.StringCoding.encode(Unknown Source)
at java.lang.String.getBytes(Unknown Source)
at com.mysql.jdbc.StringUtils.getBytes(StringUtils.java:493)
at com.mysql.jdbc.StringUtils.getBytes(StringUtils.java:603)
at com.mysql.jdbc.ByteArrayBuffer.writeStringNoNull(ByteArrayBuffer.java:544)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:1638)
at com.mysql.jdbc.Connection.execSQL(Connection.java:2972)
at com.mysql.jdbc.Connection.execSQL(Connection.java:2902)
at com.mysql.jdbc.Statement.execute(Statement.java:529)
at org.mediawiki.importer.SqlServerStream.writeStatement(SqlServerStream.java:25)
at org.mediawiki.importer.SqlWriter.flushInsertBuffer(SqlWriter.java:195)
at org.mediawiki.importer.SqlWriter.bufferInsertRow(SqlWriter.java:184)
at org.mediawiki.importer.SqlWriter15.writeRevision(SqlWriter15.java:68)
at org.mediawiki.importer.PageFilter.writeRevision(PageFilter.java:67)
at org.mediawiki.dumper.ProgressFilter.writeRevision(ProgressFilter.java:56)
at org.mediawiki.importer.XmlDumpReader.closeRevision(XmlDumpReader.java:346)
at org.mediawiki.importer.XmlDumpReader.endElement(XmlDumpReader.java:204)
at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(Unknown Source)

According to the Java docs, default max heap size is 3/4 of the physical memory, that is, around 800M. Since a single revision is at most 2M, there is no reason for mwdumper to require that much space. (It ran on the huwiki full history dump, directly writing to the database.)


Version: unspecified
Severity: enhancement
OS: Windows XP
Platform: PC

Details

Reference
bz21937

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:46 PM
bzimport set Reference to bz21937.

After manually raising the max heap size, it ran smoothly, unlike the older versions available from download.wikimedia.org which didn't even start. Is there any reason to recommend the broken old versions instead of a current one? ([[mw:MWDumper]] points to a third version attached in a bug report, which also didn't seem to work.)

The solution seems to be to increase the size of the heap as explained on http://www.mediawiki.org/wiki/Manual:MWDumper#Troubleshooting

I'll mark this bugs as Resolved and Worksforme, if the bugreporter feels that this is still an issue then please reopen the bug.

As a bigger question though - why does it need so much memory? Doesn't it interpert the dumps a little at a time, and thus shouldn't need all that much memory?

(In reply to comment #2)

The solution seems to be to increase the size of the heap as explained on
http://www.mediawiki.org/wiki/Manual:MWDumper#Troubleshooting

Yeah, I'm probably aware of that, since I was the one who added it there :)

The point, as Bawolff said, is that MWDumper should not need a default heap size of ~1GB when the largest revision is below 2MB. Either there is a memory leak, or something is done really inefficiently.

brion set Security to None.
Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 12:24 PM
hashar subscribed.

mwdumper is no more able to process dump generated since MediaWiki 1.31 (released in June 2018). The tool started in 2005 and is no more maintained, it is thus being archived, see T351228 for reference.