Page MenuHomePhabricator

Translation memory (solr) is failing
Closed, ResolvedPublic

Description

I noticed that I do no longer get suggestions for many messages while translating TechNews. Some suggestions, especially for short messages still work.

It is unknown to me whether there has been gradual slowdown and we passed the 10s timeout limit now, or whether there is a sudden slowdown making everything slower.

ElasticSearch backend for translation memory is not yet ready. Since I do not know the clause I do not know if there is anything we can do before migrating to ElasticSearch.


Version: wmf-deployment
Severity: normal

Details

Reference
bz69613

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:34 AM
bzimport set Reference to bz69613.

I kept some logs of whether a simple one-word query was failing on Meta, but nothing was logged after 2013-05-08; too bad I didn't keep any more details.

No access to icinga for me and nothing remarkable on https://ganglia.wikimedia.org/latest/?r=year&cs=&ce=&c=Miscellaneous+eqiad&h=zinc.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=ALLGROUPS

Preliminarily assigning to Nik; feel free to reassign (also wondering if this might somehow touch deployment systems territory. Probably not).

Created rt ticket to get access to the machine running the translation memory: zinc.

Icinga reports all is well with zinc.

Seeing a whole lot of this in the exception log:
2014-08-18 10:10:33 mw1016 mediawikiwiki: [64d4ab30] [no req] Exception from line 46 of /usr/local/apache/common-local/php-1.24wmf17/extensions/Translate/tag/TranslateRenderJob.php: Oops, this should not happen!

Probably not good.

Saw this:
2014-08-18 10:37:52 mw1053 commonswiki: [e97ae14a] /rpc/RunJobs.php?wiki=commonswiki&type=MessageUpdateJob&maxtime=30&maxmem=300M Exception from line 1192 of /usr/local/apache/common-local/php-1.24wmf16/includes/db/Database.php: A database error has occurred. Did you forget to run maintenance/update.php after upgrading? See: https://www.mediawiki.org/wiki/Manual:Upgrading#Run_the_update_script
Query: INSERT IGNORE INTO translate_groupstats (tgs_group,tgs_lang,tgs_total,tgs_translated,tgs_fuzzy,tgs_proofread) VALUES ('page-File:The GLAM-Wiki Revolution.webm/srt','ab','188','0','0','0')
Function: MessageGroupStats::forItemInternal
Error: 1213 Deadlock found when trying to get lock; try restarting transaction (10.64.16.29)

#0 /usr/local/apache/common-local/php-1.24wmf16/includes/db/Database.php(1158): DatabaseBase->reportQueryError()
#1 /usr/local/apache/common-local/php-1.24wmf16/includes/db/Database.php(1956): DatabaseBase->query()
#2 /usr/local/apache/common-local/php-1.24wmf16/extensions/Translate/utils/MessageGroupStats.php(384): DatabaseBase->insert()
#3 /usr/local/apache/common-local/php-1.24wmf16/extensions/Translate/utils/MessageGroupStats.php(295): MessageGroupStats::forItemInternal()
#4 /usr/local/apache/common-local/php-1.24wmf16/extensions/Translate/utils/MessageGroupStats.php(114): MessageGroupStats::forGroupInternal()
#5 /usr/local/apache/common-local/php-1.24wmf16/extensions/Translate/tag/TranslatablePage.php(751): MessageGroupStats::forGroup()
#6 /usr/local/apache/common-local/php-1.24wmf16/extensions/Translate/tag/PageTranslationHooks.php(159): TranslatablePage->getTranslationPercentages()
#7 /usr/local/apache/common-local/php-1.24wmf16/extensions/Translate/tag/PageTranslationHooks.php(136): PageTranslationHooks::updateTranslationPage()
#8 /usr/local/apache/common-local/php-1.24wmf16/includes/Hooks.php(206): PageTranslationHooks::onSectionSave()
#9 /usr/local/apache/common-local/php-1.24wmf16/includes/GlobalFunctions.php(4034): Hooks::run()
#10 /usr/local/apache/common-local/php-1.24wmf16/includes/page/WikiPage.php(2047): wfRunHooks()
#11 /usr/local/apache/common-local/php-1.24wmf16/extensions/Translate/utils/MessageUpdateJob.php(44): WikiPage->doEditContent()
#12 /usr/local/apache/common-local/php-1.24wmf16/includes/jobqueue/JobRunner.php(109): MessageUpdateJob->run()
#13 /usr/local/apache/common-local/rpc/RunJobs.php(42): JobRunner->run()
#14 {main}

Just once so far.

More frequent now - some deadlock. Great.

And some of this in fatal:
[2014-08-18 11:45:30] Catchable fatal error: Argument 1 passed to MessageGroups::getSharedGroups() must implement interface MessageGroup, null given, called in /usr/local/apache/common-local/php-1.24wmf16/extensions/Translate/tag/SpecialPageTranslation.php on line 93 and defined at /usr/local/apache/common-local/php-1.24wmf16/extensions/Translate/MessageGroups.php on line 305
Server: mw1091
Method: GET
URL: http://meta.wikimedia.org/w/index.php?title=Special:PageTranslation&do=discourage&target=-redacted-&revision=-1&token=ac62e5547a4bae78516b834da1a7779d%2B%5C
Cookie: -redacted-
Backtrace:
#0 /usr/local/apache/common-local/php-1.24wmf16/extensions/Translate/MessageGroups.php(305): MessageGroups::getSharedGroups()
#1 /usr/local/apache/common-local/php-1.24wmf16/extensions/Translate/tag/SpecialPageTranslation.php(93): MessageGroups::getSharedGroups(NULL)
#2 /usr/local/apache/common-local/php-1.24wmf16/includes/specialpage/SpecialPage.php(382): SpecialPageTranslation->execute(NULL)
#3 /usr/local/apache/common-local/php-1.24wmf16/includes/specialpage/SpecialPageFactory.php(516): SpecialPage->run(NULL)
#4 /usr/local/apache/common-local/php-1.24wmf16/includes/MediaWiki.php(294): SpecialPageFactory::executePath(Object(Title), Object(RequestContext))
#5 /usr/local/apache/common-local/php-1.24wmf16/includes/MediaWiki.php(609): MediaWiki->performRequest()
#6 /usr/local/apache/common-local/php-1.24wmf16/includes/MediaWiki.php(458): MediaWiki->main()
#7 /usr/local/apache/common-local/php-1.24wmf16/index.php(46): MediaWiki->run()
#8 /usr/local/apache/common-local/w/index.php(3): require('/usr/local/apac...')
#9 {main}

Was able to reproduce the problem. It didn't log anything incriminating. I'll need access to zinc before I can proceed.

Update now that I have access. Some requests are taking a long, long time. For example:
manybubbles@zinc:~$ time curl -s 'localhost:8983/solr/select?wt=json&q=_val_%3A%22strdist%28%5C%22%3A%2A+%E2%80%9C%24upload%E2%80%9D+displays+a+special+page+that+allows+logged-in+users+to+upload+images+and+other+files+to+the+wiki.+Uploaded+files+can+be+linked+from+or+embedded+in+wiki+pages.+This+link+is+not+displayed+if+file+uploading+is+not+enabled.%3Cbr+%2F%3E%27%27Uploading+files%2C+viewing+files+on+the+server%2C+including+them+in+wiki+pages+and+managing+the+uploaded+files+is+discussed+in+the+%5B%5BHelp%3AManaging+files%7Cmanaging+files%5D%5D+section+of+this+manual.%27%27%5C%22%2Ccontent%2Cedit%29%22&start=0&rows=25&fl=globalid%2Ccontent%2Cscore&fq=language%3A%22en%22'
...
real 0m22.406s

Rerunning the query doesn't speed it up.

My mad linux skillz (top) tells me that this query pegs one cpu for about half the time, then shifts to another cpu and pegs it for the other half the time. Then returns.

I'm believe the slowness to be caused by getting the string distance of lots and lots and lots of matches but I'm not sure though. I'm also not sure what changed.

So I'm not sure where to go from here - this strdist is exactly what the Elasticsearch translation memory doesn't support. And I'm pretty sure its super duper slow. Anyway - here is an example of the number of matches (not returned results) that can cause the result to be slow:

"numFound": 81067,

22 seconds is too high for the ~10 seconds max semi-originally envisioned, but is not unrealistical.

(In reply to Nik Everett from comment #11)

So I'm not sure where to go from here - this strdist is exactly what the
Elasticsearch translation memory doesn't support.

Any idea if this was "just" gradual slowdown that we're now in this situation?
If not, wondering who else could continue investigating (and how)... :-/

I don't think we have logs that would show of is this is gradual or what. My recommendation for a quick fix is to raise the timeout. For a permanent fix is to rethink using string distance as a measure of relevance. Its a very slow measure. Beyond that, I'm not sure.

Given the info provided by Nik above, it looks like gradual slowdown. What worries me is that the slow queries seem to block the MessageUpdateJobs.

Increasing the timeout even more might not be a good idea in that light, or it might not make a difference. We could also try to limit the set of potential matches even further in the code.

We should be very cautious about deploying the same algorithm to ElasticSearch (as per existing plans to stop using Solr). Nik knows better whether it might cause the same problem there, or whether that setup can make the problem smaller.

I do have initial plans on working on translation memory (on my research time), but the ETA for that is more like end of this year. I would like to avoid is to not be able to run TM service at all until that.

I'm pretty sure the timeouts we have now are client side only - so the server will keep working any way. Give that I think we should raise them for now and think more about replacing string distance later. Its important that we think about it soon, though. Is the feature useful at all if it takes 40 seconds to return?

We have preloading that is able to hide lots of delay in the common case of translating messages sequentially.

It would still be interesting if someone with access could give a look to icinga (which also has history and acknowledgements, right?): the check_solr / check_replicated_solr check is still enabled in role::solr.

Or if you already did, sorry and ignore me:

2014-08-20 22.41 < Nemo_bis> manybubbles: for zinc/ttm, did you manage to look into the icinga alert and see if any meaningful history can be extracted from it?
2014-08-21 1.56 < manybubbles> Nemo_bis: regarding zinc/ttm - I'll check icinga now. I didn't see anything before but I wasn't being super thorough because I wanted to look at logs on the box

I only found:

$ grep "Solr on zinc" irclogs/FreeNode/#wikimedia-operations.log
2014-01-12 18.15 <+icinga-wm> PROBLEM - Solr on zinc is CRITICAL: Average request time is 1000.0105 (gt 1000)
2014-02-21 20.04 <+icinga-wm> RECOVERY - Solr on zinc is OK: All OK

Just occurred to me that perhaps the fix for bug 67921 is related to this, as it causes more candidates to exist and thus slowing down the queries.

We have switched to ES now. ES cluster is so much faster that with same logic a query which was previously timing out now takes less than 4 seconds.