We collect a lot of useful information in ganglia, but currently can't set up alerts based on this information. This means that we often discover issues only after user reports as manual checking does not scale.
Version: wmf-deployment
Severity: normal
See Also:
https://rt.wikimedia.org/Ticket/Display.html?id=6955