Page MenuHomePhabricator

metrics.wikimedia.org (Wikimetrics) unresponsive
Closed, ResolvedPublic

Description

https://metrics.wmflabs.org/

in currently (2014-07-28 15:29) very unresponsive (and may appear down).
Some pages (like uploading a new cohort) temporary gave me

Wikimetrics is experiencing problems

errors in the browser.

Load is somewhere 25-35-ish.

Of the processes, it stands out that there are ~100 queue processes and
~130 mysqld processes.


Version: unspecified
Severity: normal
Whiteboard: u=Community c=Wikimetrics p=0 s=2014-07-24

Details

Reference
bz68743

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:28 AM
bzimport set Reference to bz68743.

Assigning to milimetric, as he ist about to kill the relevant jobs.

This is due to recurring reports I ran to test wikimetrics and see if it could handle back-filling lots of data. It back-filled 2 large wikis at a time all the way to 2007. However, when running 5 wikis at a time, the system became unstable and basically everything that could have possibly gone wrong went wrong. Further optimization work is clearly needed. For now, cleaning up after the mess:

  • killed queue and scheduler
  • delete from report where user_id = 461; -- this is the WikimetricsBot user
  • copy relevant queue logs to: /data/project/wikimetrics/backup/bug-68743-logs/
  • restart whole system
  • purge any messages from celery that needed to be purged

also, I deleted the symlinks from the /var/lib/wikimetrics/public/datafiles folder. This leaves the system in a fairly clean state. I left the old report results there as they may be interesting to compare to the manually generated data, or to be used for troubleshooting.