Author: andras.fabian
Description:
I was experimenting with importing the latest en.wikipedia XML dump to my MySQL
database, but allways failed (at around 800.000 pages - which is only about 1/3
of the current db). The reason was very simple: the importer allways ate up my
complete memory (2 GB RAM + 1 GB swap = 3GB) withhin hours, and the php bailed out.
I was looking for the reason for days, and finally I found it (should have
thought about it earlier). It is the CacheManager. It puts every Title in the
cache, and never frees it up. But somehow I don't see the point, why the
importer needs the cache at all, whe one only looks for an ArticleID? But I
found a workaround:
In the SpecialImport.php there are the following lines:
$article = new Article( $this->title ); $pageId = $article->getId(); if( $pageId == 0 ) { # must create the page... $pageId = $article->insertOn( $dbw ); }
Now the $article->getId() call is the culprit, because at this point the Title
it is put to the cache (article->getId is calling $this->mTitle->getArticleID()
and getArticleID() puts the new Title object to the cache). But this checking
for existing articleId is not necessary at all, if one only imports the current
Pages (and not all!), because there the PageID's are distinct and the DB should
be empty (and no existing Articles).
The solution: comment out $pageId = $article->getId() and replace it by $pageId = 0
Now this is a hack, and should be made configurable (or turn on/off from command
line) becaues people who import the ALL XML will need this feature (nevertheless
I can't imagine, how one could succeed with it, as this will need many-many GB
of RAM).
Version: 1.5.x
Severity: normal