Page MenuHomePhabricator

JobQueue does not scale for TTMServer
Closed, ResolvedPublic

Description

So yesterday we got a report that Special:Translations is not working on meta. I looked few ganglia stats but could not figure out anything specific.

Today with help of Max we have a theory:
Many concurrent updates slow solr down to crawling (1 every 2 seconds taking average of almost 20 s to finish). Also search requests take longer than 10 seconds and time out.

The JobQueue seems to actually work against us here. The lack of accessible statistics to examine JobQueue behavior is problematic.

Max proposed that we store the list of changes and use cronjob to push them in batches with a single thread.

No other solutions have been proposed.

RT-5085 should be related to this.


Version: master
Severity: critical
URL: https://meta.wikimedia.org/wiki/Special:SearchTranslations?query=amen&uselang=en
See Also:
https://rt.wikimedia.org/Ticket/Display.html?id=5085
https://bugzilla.wikimedia.org/show_bug.cgi?id=48266

Details

Reference
bz48164

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:44 AM
bzimport set Reference to bz48164.

(In reply to comment #0)

RT-5085 should be related to this.

It's the ops ticket about this bug.

The JobQueue seems to actually work against us here. The lack of accessible
statistics to examine JobQueue behavior is problematic.

Statistics exist now: https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=hume.wikimedia.org&v=823574&m=Global_JobQueue_length&r=hour&z=default&jr=&js=&st=1365625056&z=large

However the good old https://gdash.wikimedia.org/dashboards/jobq/ gives 503 now, and I assume *executed* job items are what matters?

After updates have been disabled per bug 48266 comment 0 vanadium is back at normal load and since 2013-05-08 17.20 Special:SearchTranslations "works".

*** Bug 48266 has been marked as a duplicate of this bug. ***

Is the problem *frequent* updates and/or *concurrent* updates?

Anyway, one can simulate cron jobs with de-duplication + delayed jobs in the queue (Wikidata will do something similar soon). So if there was a job inserted for each event requires a TTM update, always posted to metawiki, to check the lists of changes for each wiki and do the updates, and that job had ignoreDuplicates() return true and had a delay set, it would basically act like a cron where any of the job runner servers can service it.

Here is some profiling info from trying to use a single runJobs.php process.

  • 99% of time is in SolrTTMServer::update (no surprise)
  • It can be quite fast for some (hundreds) of jobs but very slow for others
  • The slow ones *seem* to involve empty translations

Output dump below:

aaron@mw1001:~$ sudo -u apache php /usr/local/apache/common/multiversion/MWScript.php runJobs.php metawiki --type TTMServerMessageUpdateJob --maxjobs 100 --profiler text
2013-05-29 23:02:14 TTMServerMessageUpdateJob Translations:Mobile_projects/Mobile_Gateway/Mobile_homepage_formatting/27 STARTING
2013-05-29 23:03:11 TTMServerMessageUpdateJob Translations:Mobile_projects/Mobile_Gateway/Mobile_homepage_formatting/27 t=56467 good
2013-05-29 23:03:11 TTMServerMessageUpdateJob Translations:Fundraising_2012/Translation/FAQ_short/1 STARTING
2013-05-29 23:03:24 TTMServerMessageUpdateJob Translations:Fundraising_2012/Translation/FAQ_short/1 t=13436 good
2013-05-29 23:03:24 TTMServerMessageUpdateJob Translations:Fundraising_2012/Translation/FAQ_short/2 STARTING
2013-05-29 23:03:38 TTMServerMessageUpdateJob Translations:Fundraising_2012/Translation/FAQ_short/2 t=13499 good
2013-05-29 23:03:38 TTMServerMessageUpdateJob CNBanner:B13_051120_plaintext_staycurious-lp-appeal STARTING
2013-05-29 23:03:39 TTMServerMessageUpdateJob CNBanner:B13_051120_plaintext_staycurious-lp-appeal t=799 good
2013-05-29 23:03:39 TTMServerMessageUpdateJob CNBanner:B13_051120_plaintext_staycurious-lp-appeal-template STARTING
2013-05-29 23:03:40 TTMServerMessageUpdateJob CNBanner:B13_051120_plaintext_staycurious-lp-appeal-template t=996 good
2013-05-29 23:03:40 TTMServerMessageUpdateJob CNBanner:B13_051120_plaintext_staycurious-lp-form-template STARTING
2013-05-29 23:03:40 TTMServerMessageUpdateJob CNBanner:B13_051120_plaintext_staycurious-lp-form-template t=556 good
2013-05-29 23:03:40 TTMServerMessageUpdateJob CNBanner:B13_051120_plaintext_staycurious-lp-form-countryspecific STARTING
2013-05-29 23:03:41 TTMServerMessageUpdateJob CNBanner:B13_051120_plaintext_staycurious-lp-form-countryspecific t=649 good
2013-05-29 23:03:41 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_frYY_autohide-hide-cookie-name STARTING
2013-05-29 23:03:42 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_frYY_autohide-hide-cookie-name t=1070 good
2013-05-29 23:03:42 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_frYY_autohide-hide-cookie-max-count STARTING
2013-05-29 23:03:43 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_frYY_autohide-hide-cookie-max-count t=898 good
2013-05-29 23:03:43 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_frYY_autohide-hide-die-size STARTING
2013-05-29 23:03:44 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_frYY_autohide-hide-die-size t=1237 good
2013-05-29 23:03:44 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_itYY_autohide-hide-cookie-name STARTING
2013-05-29 23:03:44 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_itYY_autohide-hide-cookie-name t=414 good
2013-05-29 23:03:44 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_itYY_autohide-hide-cookie-max-count STARTING
2013-05-29 23:03:45 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_itYY_autohide-hide-cookie-max-count t=720 good
2013-05-29 23:03:45 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_itYY_autohide-hide-die-size STARTING
2013-05-29 23:03:47 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_itYY_autohide-hide-die-size t=1497 good
2013-05-29 23:03:47 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_jaYY_autohide-hide-cookie-name STARTING
2013-05-29 23:03:47 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_jaYY_autohide-hide-cookie-name t=603 good
2013-05-29 23:03:47 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_jaYY_autohide-hide-cookie-max-count STARTING
2013-05-29 23:03:48 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_jaYY_autohide-hide-cookie-max-count t=556 good
2013-05-29 23:03:48 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_jaYY_autohide-hide-die-size STARTING
2013-05-29 23:03:49 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_jaYY_autohide-hide-die-size t=831 good
2013-05-29 23:03:49 TTMServerMessageUpdateJob CNBanner:B13_text_2013_dr_en-hide-cookie-name STARTING
2013-05-29 23:03:49 TTMServerMessageUpdateJob CNBanner:B13_text_2013_dr_en-hide-cookie-name t=609 good
2013-05-29 23:03:49 TTMServerMessageUpdateJob CNBanner:B13_text_2013_dr_en-hide-cookie-max-count STARTING
2013-05-29 23:03:50 TTMServerMessageUpdateJob CNBanner:B13_text_2013_dr_en-hide-cookie-max-count t=587 good
2013-05-29 23:03:50 TTMServerMessageUpdateJob CNBanner:B13_text_2009_dr_en-hide-cookie-name STARTING
2013-05-29 23:03:51 TTMServerMessageUpdateJob CNBanner:B13_text_2009_dr_en-hide-cookie-name t=741 good
2013-05-29 23:03:51 TTMServerMessageUpdateJob CNBanner:B13_text_2009_dr_en-hide-cookie-max-count STARTING
2013-05-29 23:03:51 TTMServerMessageUpdateJob CNBanner:B13_text_2009_dr_en-hide-cookie-max-count t=632 good
2013-05-29 23:03:51 TTMServerMessageUpdateJob Translations:Wikimedia_Foundation_elections/Board_elections/2013/Candidates/51 STARTING
2013-05-29 23:08:18 TTMServerMessageUpdateJob Translations:Wikimedia_Foundation_elections/Board_elections/2013/Candidates/51 t=267215 good

Further runs had several slow similar Translations:Wikimedia_Foundation_elections/Board_elections/2013/Candidates instances.

Is there a way to find the slow functions via profiling?

(In reply to comment #7)

Is there a way to find the slow functions via profiling?

You mean methods inside of SolrTTMServer::update? It would help if there profiling calls in there I suppose (for all i/o at least).

Maybe strace might help here too...

Related URL: https://gerrit.wikimedia.org/r/67249 (Gerrit Change Ie230f2e40e97fa1fb00f056d4631e64cbcbf9eae)

Related URL: https://gerrit.wikimedia.org/r/67250 (Gerrit Change I8a54e5dd00d6e6298e59000b8bdeaf4135d3158a)

Related URL: https://gerrit.wikimedia.org/r/67252 (Gerrit Change I61d03705a80f67762091ea3bde8fc2bbcc6ce520)

I submitted 3 patch sets that I expect will take away all issues. The main problem was a Solr that was not optimised.

Gerrit 67249 enables autoCommit in Solr, which makes it possible for Solr to batch a bunch of index updates and commit them in one go. All open updates will be committed after at most 5 seconds.

Gerrit 67250 ensures that Translate no longer requires immediate commits, but demands that Solr commits within 5 seconds using the commitWithin feature.

Gerrit 67252 increases ramBufferSizeMB from 32 (MB) to 100 (MB). This is the new default since Solr 4.1[1]. This reduces the number of shard merges at the cost of additional memory usage.

One possible easy optimisation remains, which is the mergeFactor. As I wasn't really able to choose between two evils at this point in time, and without access to the Solr admin to check what's actually happen, I think these changes should go live and we should see if any additional optimisation is needed before doing another round of updates.

[1] https://issues.apache.org/jira/browse/SOLR-4074

Aaron says that TTMServerMessageUpdateJob jobs are running since yesterday and there doesn't seem to be a big backlog; there have never been many, so he also hadn't dropped any such job from the queue. Vanadium seems happy and Special:SearchTranslations works.
Can this bug be closed?

I believe this has been fixed now.