Page MenuHomePhabricator

monitor that application servers are responding
Closed, ResolvedPublic

Description

We had one of the application server that was not responding anymore despite the Apache process being up (bug 52776). We would need to monitor that the application server are actually serving something.


Version: unspecified
Severity: enhancement

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:47 AM
bzimport set Reference to bz52867.
bzimport added a subscriber: Unknown Object (MLST).

That needs monitoring Apache daemon is running AND that it is serving content.

incinga has "Apache HTTP" monitoring in WMF production, but only apparently for hosts in PMTPA, not in EQIAD (another issue)

(In reply to comment #2)

incinga has "Apache HTTP" monitoring in WMF production, but only apparently
for
hosts in PMTPA, not in EQIAD (another issue)

Ignore the EQIAD part - Apache isn't monitored on job runners

And this bug is about the beta cluster :)

(In reply to comment #4)

And this bug is about the beta cluster :)

I was more meaning there should be incinga config you can steal/hack/copy and paste or whatever ;)

Resetting severity. If it was really critical it would have been fixed long ago.

Yuvi Panda is working on integrating Shinken for labs, a drop in replacement for Nagios/Icinga.

Hmm, so the 'ideal' way is for shinken to hit port 80 on those instances and check if they are serving content properly. This is complicated by firewall rules. We could theoretically just open up the web security group's port 80 to the shinken hosts, and that is probably the right thing to do here.

I'll work on this.

Change 181775 had a related patch set uploaded (by Yuvipanda):
beta: Add monitoring for mediawiki app servers

https://gerrit.wikimedia.org/r/181775

Patch-For-Review

Change 181775 merged by Yuvipanda:
beta: Add monitoring for mediawiki app servers

https://gerrit.wikimedia.org/r/181775

yuvipanda mentioned this in Unknown Object (Diffusion Commit).Dec 25 2014, 1:55 AM

http://shinken.wmflabs.org/host/deployment-mediawiki03 :D

So this adds monitoring for bits (requests a static image), and the enwiki main page (checks if the string 'Wikipedia' exists). This hits the individual mediawiki machines - specifically machines with the role role::beta::appserver applied, and reports errors if any.

As soon as it got merged, it told me that mediawiki03 was failing regular mainpage (but not bits!). Restarting hhvm seems to have fixed that.

Resolving since the app servers have monitoring now.

Change 181787 had a related patch set uploaded (by Yuvipanda):
beta: Add HHVM queue size monitoring

https://gerrit.wikimedia.org/r/181787

Patch-For-Review

Change 181787 merged by Yuvipanda:
beta: Add HHVM queue size monitoring

https://gerrit.wikimedia.org/r/181787

yuvipanda mentioned this in Unknown Object (Diffusion Commit).Dec 25 2014, 7:10 PM

That is excellent @yuvipanda . Thank you very much !

Change 183454 had a related patch set uploaded (by Hashar):
beta: monitor mobile main page

https://gerrit.wikimedia.org/r/183454

Patch-For-Review

Change 183454 merged by Yuvipanda:
beta: monitor mobile main page

https://gerrit.wikimedia.org/r/183454