Page MenuHomePhabricator

Replication checks disabled in Icinga for most analytics slaves
Closed, ResolvedPublic

Description

Icinga's replication checks are disabled for 6/7 analytics slaves.

Let's get them turned on again, so Icinga alerts our team about lags again.

Icinga shows the following relevant services disabled:

  • s1-analytics-slave.eqiad.wmnet (db1047.eqiad.wmnet)
    • MySQL Replication Heartbeat
    • MySQL Slave Delay
  • s2-analytics-slave.eqiad.wmnet (db69.pmtpa.wmnet)
    • MySQL Replication Heartbeat
    • MySQL Slave Delay
    • MySQL Slave Running
  • s3-analytics-slave.eqiad.wmnet (db71.pmtpa.wmnet)
    • MySQL Replication Heartbeat
    • MySQL Slave Delay
  • s4-analytics-slave.eqiad.wmnet (db72.pmtpa.wmnet)

<none>

  • s4-analytics-slave.eqiad.wmnet (db1017.eqiad.wmnet)
    • MySQL Replication Heartbeat
    • MySQL Slave Delay
  • s6-analytics-slave.eqiad.wmnet (db74.pmtpa.wmnet)
    • MySQL Replication Heartbeat
    • MySQL Slave Delay
  • s7-analytics-slave.eqiad.wmnet (db68.pmtpa.wmnet)
    • MySQL Replication Heartbeat
    • MySQL Slave Delay

Version: unspecified
Severity: normal

Details

Reference
bz64088

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:16 AM
bzimport set Reference to bz64088.
bzimport added a subscriber: Unknown Object (MLST).

It seems no one in our team knows why the alerts are disabled, so I
pinged springle about it.

bingle-admin wrote:

Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1555

Discussion with springle showed that the Icinga alerts are turned off
on purpose as the go off too often (due to slow queries run by
analytics :-) ).

Since a separate machine for slow queries is already on the way,
springe suggested to wait for this machine, and once slow queries have
been migrated over, we turn on Icinga alerts for the other machines
again.

Until then I'll have an eye on the lag and send out alerts if it gets
too high.

Thanks for catching this, Christian!

-Toby

jcrespo claimed this task.
jcrespo subscribed.

This does no longer apply, as this hosts have been decommissioned, and suggested fix applied a long time ago, and analytics databases do indeed report lag.