Speed up health check.
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	cscott
	Sep 24 2014, 11:07 PM

Description

icinga warning:
icinga-wm: PROBLEM - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds

The port is responsive interactively, it looks like the timeout is just a bit too short for what the health check is trying to do.

In particular, the health check does a du -s of several cache directories, one of which is >6G now. (The icinga limit for that directory is 40G.) The latest deploy included a change which partially serialized that du (Ie359701c6972cd49786ffde1e8be1cb64d356fa2), which might be the cause of our recently starting to toe the timeout line.

We should improve the speed of the health check. Probably the best way to do this is to cache the sizes of the directories or do the du step less frequently. Alternatively we could add a quick check which didn't include the cache size step. Re-adding some of the parallelism to the du might help some, but probably not enough to cover us when the cache directory climbs nearer its 40G limit.

Version: unspecified
Severity: normal

Details

Reference: bz71260

Event Timeline

• bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:54 AM

• bzimport added a project: OfflineContentGenerator.

• bzimport set Reference to bz71260.

cscott created this task.Sep 24 2014, 11:07 PM

Change 162933 had a related patch set uploaded by Cscott:
Speed up/cache directory size computation in health check.

https://gerrit.wikimedia.org/r/162933

Change 162933 merged by jenkins-bot:
Speed up/cache directory size computation in health check.

https://gerrit.wikimedia.org/r/162933

Landed the above patch to cache stuff, but from local testing I expected that would decrease the amount of time taken for a (cached) health check to ~30ms. Instead I'm still seeing ~7s request times. So I think there's still an issue here.

Change 163186 had a related patch set uploaded by Cscott:
Increase OCG warning/critical space thresholds.

https://gerrit.wikimedia.org/r/163186

Change 163186 merged by BBlack:
Increase OCG warning/critical space thresholds.

https://gerrit.wikimedia.org/r/163186

Fixed with https://gerrit.wikimedia.org/r/163997

Speed up health check.Closed, ResolvedPublicActions

Description

Details

Event Timeline

Speed up health check.
Closed, ResolvedPublic
Actions