Page MenuHomePhabricator

l10n cache doesn't actually cache (due to specific extension localization pair not in hash, marking the cache as invalid)
Closed, DeclinedPublic

Description

From http://permalink.gmane.org/gmane.org.wikimedia.mediawiki/41444

I have a strange problem on my (clustered and load-balanced) Mediawiki instance: the 'l10n_cache' table is getting written to *very* excessively (and constantly.) Mediawiki seems to write the same localization information to the database every 3-5 seconds. This is causing MySQL to constantly write binlog files (for replication to the slaves) which fills the partition after a number of hours. Removing all my extensions does not stop the problem. I have APC installed for opt-code caching, and Memcached installed for caching other things (sessions, parser-stuff, etc.) I assume that this is all due to the constant requests made by the load-balancer to the wiki which in turn cause MW to recalculate the cache.

Why is the l10n_cache table getting completely rewritten every few seconds? The cache is not functioning as a cache.

I found one thing that will stop this behavior: $wgLocalisationCacheConf['manualRecache'] = true;
But this feels wrong because it makes me manually recomputed the l10n cache whenever I update/release. (Yet another thing to tack onto the release process.) There are also some configuration variables which will slow the process ($wgUpdateRowsPerQuery) or cache it to disk ($wgLocalisationCacheConf), but this merely masks the problem. Does anyone have an insight into why the l10n_cache table is getting written to so much? Or why Mediawiki isn't actually *caching* the localization messages in the database?


Version: 1.20.x
Severity: normal
Whiteboard: aklapper-moreinfo

Details

Reference
bz47289

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 1:41 AM
bzimport set Reference to bz47289.
bzimport added a subscriber: Unknown Object (MLST).

This doesn't happen when using $wgCacheDirectory (and the path has the appropriate ownership/permissions.) It seems to happen only when the localization messages are "cached" in the database (in the l10n_cache table.)

Can you enable the debug log for an individual request that writes to the l10n_cache table, say with a config hack like this:

if ( isset( $_GET['forcedebuglog'] ) ) {

$wgDebugLogFile = '/tmp/mw.log';

}

Then attach the log to this bug, with session cookies redacted. The log should have LocalisationCache::recache() in it somewhere -- if not, the request probably didn't trigger any cache writes.

Also, if possible, can you get on IRC, on MediaWiki-General on irc.freenode.net?

vdanilchenko wrote:

I have just done that. There's no 'recache' entry when caching to file, but there is when caching to DB:

LocalisationCache::isExpired(en): cache for en expired due to GlobalDependency
LocalisationCache::recache: got localisation for en from source
DatabaseBase::query: Writes done: DELETE FROM vpw_l10n_cache WHERE lc_lang = 'en'
Fully initialised
...
LocalisationCache::recache 1 914.617 914.617 37.637% 3030028 ( 914.617 - 914.617) [114]

Full request log is attached.

vdanilchenko wrote:

The log of a sample pathological request doing full localisation recache

The recache to DB happens on every request; but when caching to file, it doesn't.

Attached:

"expired due to GlobalDependency" implies that the set of extensions in use has changed, or more specifically, $wgExtensionMessagesFiles has changed. If this problem doesn't happen when you use local file caching, then maybe the set of enabled extensions depends on what server MediaWiki runs on, say due to LocalSettings.php not being properly synchronised.

In this server setup, the MediaWiki file tree (including LocalSettings.php) is on a single disk partition, which is mounted via NFS by multiple webservers which are load-balanced.

Is there anything about this setup that might contribute to "LocalSettings.php not being properly synchronised"?

If LocalSettings.php is on NFS then it probably isn't the problem. But there are other ways $wgExtensionMessagesFiles can change. You should patch GlobalDependency::isExpired() in includes/cache/CacheDependency.php so that it starts with:

wfDebug( METHOD . ": {$this->name} old value: " . var_export( $this->value, true ) . "\n" );
wfDebug( METHOD . ": {$this->name} new value: " . var_export( $GLOBALS[$this->name], true ) . "\n" );

Then extract the relevant log lines from the debug log, work out what has changed between the two arrays, and examine the relevant inclusion in LocalSettings.php and the relevant extension setup file.

Thanks for the help, Tim. I added some wfDebugLog() calls in GlobalDependency::isExpired(), and came up with this:

GlobalDependency::isExpired: array (
  'count($GLOBALS[$this->name])' => 120,
  'count($this->value)' => 120,
  'count(array_intersect($GLOBALS[$this->name], $this->value))' => 120,
  'count(array_diff($GLOBALS[$this->name], $this->value))' => 0,
  '$GLOBALS[$this->name] == $this->value' => true,
  '$GLOBALS[$this->name] === $this->value' => true,
)

Obviously $GLOBALS[$this->name] is set and the arrays are the same. GlobalDependency::isExpired() then should return false, but sometimes it doesn't and recache() is called. I am kinda stuck and handed it off to my colleague Victor to look at.

vdanilchenko wrote:

We found it.

The problem turned out to be an extension that was loading another extension, thus affecting the extension execution scope. When this happened, the localization key/value pair for this extension ended up being present in $GLOBALS[$this->name] hash (in GlobalDependency::isExpired method in includes/cache/CacheDependency.php, line 410 or thereabouts), but absent in $this->value hash in the same code. Therefore, the cache would ALWAYS be flagged as invalid in such a case, because the two hashes would ALWAYS mismatch.

This seems to be clearly incorrect behavior. While including extensions from other extensions may or may not be a good idea, they should not be treated differently WRT caching; i.e. you shouldn't get the extension localization key included in the GLOBALS table but not the $this->value one.

I am not sure how DB cache store plays into it though.

Thanks for all your help guys, this was invaluable!

Victor, if you still think this is a MediaWiki bug, can you give links to the source code of the relevant two extensions?

Victor: Could you answer comment 10, please?

Victor Danilchenko: Could you answer comment 10, please?

vdanilchenko wrote:

I will need to get authorization to do so, after my manager returns from vacation. Give me a little time.

Victor: Did you find time to ask?

Hi Andre,

Sorry for the seemingly abandoned ticket -- Victor himself went on vacation! I came across this ticket in my backlog of issues and am going to have another look at it.

To review:

  • Our Mediawiki installation is quite complex, which does not ease the debugging. We have three load-balanced web nodes which share a codebase mounted on an NFS partition. We also have three database (MySQL) nodes: one master, two slaves.
  • We have a plethora of extensions, both third-party and ones developed in-house for various enterprise functionality.
  • The issue here is that the localization cache constantly updates. When requests come in MediaWiki writes a bunch of l10n stuff (I'm assuming these are l10n-messages) to the database. Aside from unnecessarily using resources, in a replicated environment the DB_MASTER writes these INSERTS to the binlog file[1]. These files get executed on the SLAVES, but in the meantime they slowly grow to fill the disk. This is bad.
  • From my cursory debugging, it seems that this behavior only occurs on the DB_MASTER (which makes sense.) When $wgCacheDirectory is set this behavior stops (I'm assuming because MediaWiki writes the cache a single time.) This is the current workaround we have in place.

I'll use the stuff Tim posted above to have another look. Thanks for keeping up with this issue -- I'll post back here as soon as I find something!

--Daniel (User:AlephNull)

Refernces:
[1] https://dev.mysql.com/doc/refman/5.0/en/binary-log.html

(In reply to comment #9)

The problem turned out to be an extension that was loading another extension,

(In reply to comment #10)

can you give links to the source code of the relevant two extensions?

(In reply to comment #15)

I'll post back here as soon as I find something!

Hi Daniel, any news here?

For the life of me I cannot reproduce this. It seems to be caused by sunspots or maybe the phase of the moon.

For a while there the localization cache was convinced it was out of date and re-cached itself (constantly.) I dumped some log files and compared them to the one Victor uploaded a while back (in comment 4.) I couldn't find any substantive differences.

I have a vague feeling this was due to our MySQL replication and/or lag. I know the LoadBalancer/ChronologyProtector is supposed to guard against things like this, but I can't help but feel that this was the problem. (No real reason -- I definitely could be wrong.)

Feel free to mark this bug as "Not fixed"/"Not reproducible".

Alright then. Thanks for investigating and the update here!