Page MenuHomePhabricator

avoid fetching SiteList object from memcached
Closed, ResolvedPublic

Description

Since I4e71671c8, WikibaseClient's OutputPageParserOutput & ParserAfterParse hook handlers call WikibaseClient::getDefaultInstance()->getLangLinkSiteGroup(). Since
$wgWBClientSettings['languageLinkSiteGroup'] is unset, it defaults toWikibaseClient::getSite, which calls SiteSQLStore::getSites, which requires retrieving a huge object from memcached, with a predictable impact on the cluster: see http://noc.wikimedia.org/~ori/SiteSQLStore.html -- you can guess when I4e71671c8 was deployed.

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:39 AM
bzimport set Reference to bz56602.
bzimport added a subscriber: Unknown Object (MLST).

Change 93648 had a related patch set uploaded by Ori.livneh:
Set enwiki's languageLinkSiteGroup to 'wikipedia'

https://gerrit.wikimedia.org/r/93648

Change 93648 merged by Ori.livneh:
Set enwiki's languageLinkSiteGroup to 'wikipedia'

https://gerrit.wikimedia.org/r/93648

Didn't improve things a whole lot, since the call to getSite in WikibaseClient.hooks.php's onSkinTemplateOutputPageBeforeExec hook handler is executed much more frequently.

(In reply to comment #3)

Didn't improve things a whole lot, since the call to getSite in
WikibaseClient.hooks.php's onSkinTemplateOutputPageBeforeExec hook handler is
executed much more frequently.

That's Ie17f2af09, to be specific.

Change 93661 merged by jenkins-bot:
Re-introduce siteGroup setting for performance reasons

https://gerrit.wikimedia.org/r/93661

Change 93767 had a related patch set uploaded by Aude:
Re-introduce siteGroup setting for performance reasons

https://gerrit.wikimedia.org/r/93767

Change 93767 merged by jenkins-bot:
Re-introduce siteGroup setting for performance reasons

https://gerrit.wikimedia.org/r/93767

Change 93769 had a related patch set uploaded by Aude:
Update Wikibase, use siteGroup setting instead of doing lookup

https://gerrit.wikimedia.org/r/93769

Change 93772 had a related patch set uploaded by Aude:
Update Wikibase, use siteGroup setting instead of doing lookup

https://gerrit.wikimedia.org/r/93772

Change 93769 merged by jenkins-bot:
Update Wikibase, use siteGroup setting instead of doing lookup

https://gerrit.wikimedia.org/r/93769

Change 93772 merged by jenkins-bot:
Update Wikibase, use siteGroup setting instead of doing lookup

https://gerrit.wikimedia.org/r/93772

Change 93773 had a related patch set uploaded by Aude:
Add siteGroup setting for Wikibase

https://gerrit.wikimedia.org/r/93773

Change 93773 merged by Ori.livneh:
Add siteGroup setting for Wikibase

https://gerrit.wikimedia.org/r/93773

This is once again an issue; it is loading on every request.

Impact: http://i.imgur.com/v9ebld6.png

This wasn't fixed, so I'm not sure why the bug was closed.

This causes:
https://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&h=mc1005.eqiad.wmnet&m=network_report&s=by+name&mc=2&g=network_report&c=Memcached+eqiad
https://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&h=mc1011.eqiad.wmnet&m=network_report&s=by+name&mc=2&g=network_report&c=Memcached+eqiad
https://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&h=mc1014.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=network_report&c=Memcached+eqiad

Top keys are:
enwiki:sites/SiteList#2014-03-17+Site:2013-01-23 (54MB/s)
wikidatawiki:sites/SiteList#2014-03-17+Site:2013-01-23 (20MB/s)
commonswiki:sites/SiteList#2014-03-17+Site:2013-01-23 (18MB/s)

This has been pointed out as early as September 2013, and again March 2014 and again September 2014 and is still happening. Having e.g. 80% of mc1005's total network bandwidth being a single wikidata key, or SiteList keys being consistently on the top of memcached bandwidth output by multiple factors compared to the rest, is frankly indicative of a serious design failure and unacceptable. I don't understand why this bug was closed either.

Can we just have a current version hash stored in memcached and used to validate server-local CDB caches (made on the fly, with a special key holding the hash of the other key/values). This would reduce the memcached I/O to a minuscule amount.

Change 174113 had a related patch set uploaded by Aude:
Lazy initialize OtherProjectsSidebarGenerator in hook handlers

https://gerrit.wikimedia.org/r/174113

my patch (https://gerrit.wikimedia.org/r/#/c/174113/) ensures the memcached lookup of SiteList is confined to users with the other projects beta feature enabled. This should help quite a lot to reduce memcached access for the SiteList.

the SiteList is used in similar functionality as the interwiki data, used to add links to related sister projects in the sidebar.

to roll out the feature more widely, we should have local caching (json, like i18n?) of the site list data and may want to have memcached store the hash (like done for i18n), per Aaron's suggestion.

faidon triaged this task as High priority.Nov 24 2014, 2:16 PM
faidon updated the task description. (Show Details)
faidon set Security to None.

Change 174113 merged by jenkins-bot:
Lazy initialize OtherProjectsSidebarGenerator in hook handlers

https://gerrit.wikimedia.org/r/174113

hoo claimed this task.

see T47532 which addresses this issue more generally, to avoid memcached entirely for the SiteList and have a file-based cache for it.

the specific issue of languageLinkSiteGroup is resolved (some time ago) and https://gerrit.wikimedia.org/r/174113 (merged now) addressed a related but different issue.

Change 174874 merged by jenkins-bot:
Implement SiteListFileCache and rebuild script

https://gerrit.wikimedia.org/r/174874

r174113, part of wmf10, was deployed across all Wikipedias today and had no effect whatsoever.

JanZerebecki renamed this task from Set languageLinkSiteGroup in $wgWBClientSettings to avoid fetching SiteList object from memcached to avoid fetching SiteList object from memcached.Dec 3 2014, 9:39 PM
JanZerebecki reopened this task as Open.
JanZerebecki reassigned this task from hoo to Wikidata-bugs.
JanZerebecki removed a project: Patch-For-Review.

r174113, part of wmf10, was deployed across all Wikipedias today and had no effect whatsoever.

I poked at this some more and have been able to actually reduce the traffic. See https://lists.wikimedia.org/pipermail/wikidata-tech/2014-December/000682.html

{$wgDBname}:SiteList:sites/SiteList#2014-03-17+Site:2013-01-23 items are at or near the top of memcached keys sorted by bandwidth utilization on the production memcached cluster. This really needs to be fixed and stay fixed.

Fetching SiteList from memcached does not seem to happen on the page view code path. It does happen once on the edit code path. So this is not a regression to the old behavior, it's just that Wikidata is now used so much, even doing this on edit is an issue.

So it seems that we have to tackle T76706: Design caching infrastructure for SiteStore, probably going for T47532: Add file-based cached implementation of SiteStore.

This is going to take a while. I can't think of a good quick fix for this.

We could cache a separate SiteList for each wiki, with the members of the wiki's family plus sister wikis (that is, all the Site entries relevant for sidebars on that wiki). That would give us one cache entry per wiki, and just as many requests to memcached, but the SiteLists returned would be smaller (say, 300 entries instead of 800). Not sure if that's worth the trouble.

Yes, the graph "Memcached eqiad aggregated bytes_out" didn't visibly increase over what it was after T58602#809009. If we go for T47532 which would remove the use of memcache for this then splitting the memcache use by wiki as an in between step is not necessary.

ori raised the priority of this task from High to Unbreak Now!.Jul 19 2015, 4:25 PM

Every so often I crunch some statistics about memcached usage. I can't tell you how demoralizing it is to find sites/SiteList#2014-03-17+Site:2013-01-23 near the top again and again. If it's still there the next time I check I'm going to start disabling extensions. Changing priority to UBN! for that reason.

Please fix this by making the list a static array in a PHP file in operations/mediawiki-config and then include it in CommonSettings.php. This way HHVM will compile it to byte-code and the OS will keep it in memory.

@aude is going to poke at this when she's back from Wikimania.

In T58602#1463722, @hoo wrote:

@aude is going to poke at this when she's back from Wikimania.

Cool, thanks.

Daniel suggests that CACHE_ACCEL (APC) could be used here. I don't know how big the blob is, but probably not to big for that.

Change 225719 had a related patch set uploaded (by Hoo man):
Use CACHE_ACCEL for SiteLists if on HHVM

https://gerrit.wikimedia.org/r/225719

Change 225726 had a related patch set uploaded (by Ori.livneh):
Use CACHE_ACCEL for SiteLists if on HHVM

https://gerrit.wikimedia.org/r/225726

Change 225726 merged by Ori.livneh:
Use CACHE_ACCEL for SiteLists if on HHVM

https://gerrit.wikimedia.org/r/225726

Change 225719 merged by jenkins-bot:
Use CACHE_ACCEL for SiteLists if on HHVM

https://gerrit.wikimedia.org/r/225719

Solved by moving the sites cache into CACHE_ACCEL. As you can see, the traffic of some memcached servers dropped considerably at around 18:18 (UTC).

Memcached sites.png (768×1 px, 106 KB)

Thanks for the quick response.