Page MenuHomePhabricator

Not enough memory for ImportDump.php
Closed, ResolvedPublic

Description

Author: andras.fabian

Description:
I was experimenting with importing the latest en.wikipedia XML dump to my MySQL
database, but allways failed (at around 800.000 pages - which is only about 1/3
of the current db). The reason was very simple: the importer allways ate up my
complete memory (2 GB RAM + 1 GB swap = 3GB) withhin hours, and the php bailed out.
I was looking for the reason for days, and finally I found it (should have
thought about it earlier). It is the CacheManager. It puts every Title in the
cache, and never frees it up. But somehow I don't see the point, why the
importer needs the cache at all, whe one only looks for an ArticleID? But I
found a workaround:
In the SpecialImport.php there are the following lines:

		$article = new Article( $this->title );
		$pageId = $article->getId();
		if( $pageId == 0 ) {
			# must create the page...
			$pageId = $article->insertOn( $dbw );
		}

Now the $article->getId() call is the culprit, because at this point the Title
it is put to the cache (article->getId is calling $this->mTitle->getArticleID()
and getArticleID() puts the new Title object to the cache). But this checking
for existing articleId is not necessary at all, if one only imports the current
Pages (and not all!), because there the PageID's are distinct and the DB should
be empty (and no existing Articles).

The solution: comment out $pageId = $article->getId() and replace it by $pageId = 0

Now this is a hack, and should be made configurable (or turn on/off from command
line) becaues people who import the ALL XML will need this feature (nevertheless
I can't imagine, how one could succeed with it, as this will need many-many GB
of RAM).


Version: 1.5.x
Severity: normal

Details

Reference
bz3182

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 8:45 PM
bzimport set Reference to bz3182.
bzimport added a subscriber: Unknown Object (MLST).

I would not expect CacheManager to be invoked during this process at all; it's used
when viewing pages if the file cache is on. Further, a CacheManager shouldn't
actually keep any data around or grow it as I understand.

Can you clarify what you are referring to?

andras.fabian wrote:

You can reproduce this behaviour very easily. Take for example the big (1 GB)
http://download.wikipedia.org/wikipedia/en/pages_current.xml.gz and feed it
maintenance/importDump.php. The just looking at "top" you will see, how fast the
memory consumption of the PHP process is growing (and consequently free memory
is going away rapidly). Now, if you do comment out in SpecialImport.php the line
$article->getId() (which should be obsolete, if you are importing a "current"
dump, because if I understand it correctly it has every page only once - the
current/latest version) then you will see the big differenc. The PHP process
will not grow in memory needs (nly 27-30 Mbyte on my computer) and the import is
running well until the end.
The reason for the big memory consumption is (as far as I understand from
reading the code):
article->getId() is calling $this->mTitle->getArticleID(). Then if I look into
Title.php at "funcion getArticleID" then I see in both branches of the "if" clause:
$this->mArticleID = $wgLinkCache->addLinkObj( $this )
And I suspect, that addLinkObj is the one, who is consuming memory. Because, if
it happens for every Title object during the import process without making any
memory cleaning, at some point you run out of memory.

So when you claimed that CacheManager was at fault, you meant LinkCache?
That at least makes sense. :)

Checking for existing pages is absolutely required, since not all imports will be conflict-free.
However disabling the link cache for these checks is probably in order.

Toss this into WikiRevision::importOldRevision() in SpecialImport.php:

+ // avoid memory leak...?
+ global $wgLinkCache;
+ $wgLinkCache->clear();

Done in CVS HEAD and REL1_5; will be in next 1.5 release.