icinga warning:
icinga-wm: PROBLEM - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
The port is responsive interactively, it looks like the timeout is just a bit too short for what the health check is trying to do.
In particular, the health check does a du -s of several cache directories, one of which is >6G now. (The icinga limit for that directory is 40G.) The latest deploy included a change which partially serialized that du (Ie359701c6972cd49786ffde1e8be1cb64d356fa2), which might be the cause of our recently starting to toe the timeout line.
We should improve the speed of the health check. Probably the best way to do this is to cache the sizes of the directories or do the du step less frequently. Alternatively we could add a quick check which didn't include the cache size step. Re-adding some of the parallelism to the du might help some, but probably not enough to cover us when the cache directory climbs nearer its 40G limit.
Version: unspecified
Severity: normal