Page MenuHomePhabricator

stat1001's apache not running (stats.wikimedia.org, datasets.wikimedia.org not available) on 2014-10-05
Closed, ResolvedPublic

Description

From 2014-10-05's SAL [1]

20:08 Nemo_bis: 22.03 < Ainali> It was just noticed on svwp village pump that http://stats.wikimedia.org is down

I checked, and apache is currently not running on stat1001 (although it should).
Hence, all it's configured sites are not available.
This includes

stats.wikimedia.org
datasets.wikimedia.org

stat1001's dmesg showed 6 messages about limn-reportcard respawning too fast
every 20 minutes (puppet run?) until 2014-10-04 17:45.

Might be that things broke around that time.

Icinga shows CRITICAL for the "puppet last run” service.
(But the service is currently muted. Anyone know why?)

[1] https://wikitech.wikimedia.org/wiki/Server_Admin_Log


Version: unspecified
Severity: normal

Details

Reference
bz71686

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:47 AM
bzimport set Reference to bz71686.
bzimport added a subscriber: Unknown Object (MLST).

This ticket needs Ops power. I filed RT 8554 for it.

https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&c=Miscellaneous+eqiad&h=stat1001.wikimedia.org&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=ALLGROUPS showa that at some point 1 GB memory was freed and then traffic dropped.

Hoo concluded that apache2 died and puppet doesn't configure the machine to restart it.

Change 164914 had a related patch set uploaded by QChris:
End stats.wikimedia.org certificate in newline

https://gerrit.wikimedia.org/r/164914

Change 164914 merged by Filippo Giunchedi:
End stats.wikimedia.org certificate in newline

https://gerrit.wikimedia.org/r/164914

godog restarted apache on stat1001.

https://stats.wikimedia.org/
https://datasets.wikimedia.org/

are working again.

It seems certificate chaining choked on stats.wikimedia.org's
certificate not ending in a newline. Stop-gap fix is in commet #3.
But godog and _joe_ said this setting should be caught by the
certificate chaining itself, which makes sense.
The RT ticket has been updated accordingly.

Thanks godog and _joe_!

Page request that match "undefined" from the october sample logs so far.

Page request that match "undefined" from the october sample logs so far.

attachment undefined-october.txt.gz ignored as private

Please ignore prior assignment.