Page MenuHomePhabricator

"Pool queue is full" and related errors on search api
Closed, DeclinedPublic

Description

Author: simon

Description:
For Wikipedia Text (the SMS & USSD part of Wikipedia Zero) in Kenya we're occasionally seeing the following errors:

{u'servedby': u'mw1205', u'error': {u'info': u'Pool queue is full', u'code': u'srsearch-error'}}
{u'servedby': u'mw1200', u'error': {u'info': u'The search backend returned an error: ', u'code': u'srsearch-error'}}
{u'servedby': u'mw1123', u'error': {u'info': u'HTTP request timed out.', u'code': u'srsearch-error'}}

Please advise if anything can be done about these (other than trying to handle them as gracefully as possible on our application side of things).


Version: unspecified
Severity: normal
OS: other
Platform: Other

Details

Reference
bz60032

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 2:52 AM
bzimport set Reference to bz60032.
bzimport added a subscriber: Unknown Object (MLST).

simon wrote:

Graphite screenshot showing timeouts

Screenhot of the Wikipedia Text service traffic. The yellow line is the response time for the search API. Times shown is in UTC morning of tuesday 14th of January.

Attached:

Screenshot_2014_01_14_11_13_AM.png (881×1 px, 197 KB)

Thanks for taking the time to report this!

Is this a recent problem, or has this been ongoing for a while?

Bug 59993 might be related (which got fixed yesterday).

simon wrote:

Bug 59993 looks related but these errors were of this morning, after it had been resolved.

We see it happening fairly regularly when the Kenyan mobile network operator does a big SMS based announcement of Wikipedia Text resulting in increased traffic volumes.

Which Wikis are generating these messages?

simon wrote:

This is from our API calls to the search API on the English Wikipedia. The API calls are being generated from the Vumi Wikipedia app which handles the SMS & USSD component of Wikipedia Zero.

I don't see any load spikes on the search backends but I do see latency spikes on the pool counters:
https://gdash.wikimedia.org/dashboards/poolcounter/
not 60 seconds though. My 99th percentile times are about 6 seconds. I suppose if you are searching for 10 items and trigger the pool counter for each one _and_ all of them hit the 99th percentile case then you could see this.

The backends have a log of all the searches they've performed and how long they take. Could you send me some examples of slow searches and I can rule out that the backends are taking forever one them?

Simon, would it make sense to test the other search backend? In http://en.wikipedia.org/w/api.php?action=help&modules=query+search i see srbackend parameter, which currently defaults to "LuceneSearch", but it also supports "CirrusSearch". I presume that CirrusSearch is our new backend that will become the default shortly.

Simon: Could you answer comment 7, please?

"Shortly" is relative. You can try the new search backend if you like. Let me know if you want to set it as the default so I can see how we handle the extra traffic.

simon wrote:

Created https://github.com/praekelt/vumi-wikipedia/issues/40 so we'll switch over to the new search backend. Will re-open this is the issue re-occurs.