Page MenuHomePhabricator

[toolforge.infra] Set up alerts for mail queue
Open, LowPublicFeature

Description

We should add some monitoring for a stuck mail queue. Googling revealed various solutions for Nagios/Icinga. We should set the thresholds (how many mails of what age may be in the queue) fairly low. While I believe on Toolserver ACC sent mail to addresses specified by users and was thus prone to typos ("hotmaiil.com", etc.), most mail on Tools should probably be successfully delivered within minutes.


Version: unspecified
Severity: enhancement

Details

Reference
bz58871

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:20 AM
bzimport added a project: Toolforge.
bzimport set Reference to bz58871.

Some lessons learned from cleaning the stuck queues:

  • We need to check the queues on *every* host, not just tools-mail. We could use different thresholds for tools-mail and the rest as the latter only needs to talk to tools-mail, but I don't think that's necessary.
  • Sometimes there are leftovers in /var/spool/exim4/{input,msglog} that are results of exim hiccups (OOM?) where only -D or only -H files exist. Correlating them with the queue is hard; easier: Check for any files there that are older than the Icinga threshold + x days. This will not detect hiccups instantly, but not *so* late.

Also, /var/log/exim4/panic should either be empty or not exist.

Change 143111 had a related patch set uploaded by Yuvipanda:
toollabs: Send exim queue length to graphite

https://gerrit.wikimedia.org/r/143111

Change 143111 merged by coren:
toollabs: Send exim queue length to graphite

https://gerrit.wikimedia.org/r/143111

coren renamed this task from Set up Icinga monitoring for mail queue to Set up alerts for mail queue.Mar 25 2015, 6:44 PM
coren removed coren as the assignee of this task.
coren triaged this task as Medium priority.
coren set Security to None.
coren subscribed.

Queue length is now monitored; but (afaict) there is no alerting. Changed topic accordingly.

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 12:23 PM
dcaro lowered the priority of this task from Medium to Low.Feb 21 2024, 1:29 PM
dcaro subscribed.

This would be now on prometheus + alertmanager/metricsinfra

dcaro renamed this task from Set up alerts for mail queue to [toolforge.infra] Set up alerts for mail queue.Feb 21 2024, 1:31 PM