Page MenuHomePhabricator

Packet loss alarm on oxygen on 2014-08-16
Closed, ResolvedPublic

Description

We had an packetloss alert on oxygen:

[23:03:09] <icinga-wm> PROBLEM - Packetloss_Average on oxygen is CRITICAL: packet_loss_average CRITICAL: 54.8310445455

(see http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20140816.txt )

It seems ottomata's restarting of udp2log [1] made the problem go away:

[04:34:54] <ottomata> !log restarted udp2log on oxygen
[...]
[04:51:08] <icinga-wm> RECOVERY - Packetloss_Average on oxygen is OK: packet_loss_average OKAY: 0.0

(see http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20140817.txt )

What happened?
How hard does it affect us?

(Was it related to bug 69661 ?)


Version: unspecified
Severity: normal
Whiteboard: u=Community c=General/Unknown p=0 s=2014-08-07
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=69661

Details

Reference
bz69663

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:37 AM
bzimport set Reference to bz69663.

Root cause of traffic drop is that we had to retstart udp2log to make effective our removal of one of the filters. Alarms were raised for what seemed a longer event but the traffic drop just lasted a few minutes.

See request drop around 22:45 hours on the 16th for mobile traffic:

2014-08-16T22:39 4482
2014-08-16T22:40 4391
2014-08-16T22:41 4408
2014-08-16T22:42 4354
2014-08-16T22:43 1628
2014-08-16T22:44 4
2014-08-16T22:46 1
2014-08-16T22:47 3
2014-08-16T22:48 631
2014-08-16T22:49 4419
2014-08-16T22:50 4460
2014-08-16T22:51 4312

A graph that ilustrates the same drop: http://i.imgur.com/7p96Wmh.png

Resolving bug and updating log of events that affect feeds on wikitech.

Reopening as it looks there were other intervals affected.

We had alarms about traffic drop on Aug 16th 23:03 (packetloss), Aug 17th 00:11, Aug 17th 05:51, Aug 17th 07:51 (oxygen).

Recovery was sent on Aug 17th 04:51 (packetloss) and Aug 17th 13:17 (oxygen).

Looked at the files for 17th and 18th again and the only event I can find with significant loss is printed below.

2014-08-17T06:24 3474
2014-08-17T06:25 3516
2014-08-17T06:26 1385 (*)
2014-08-17T06:29 3059
2014-08-17T06:30 3605

Assigning to Christian per his request to take all prod issues in the upcoming weeks.

Oxygen's alarms (see comment #3) around packet loss and udp2log from
2014-08-16 23:03:09 until 2014-08-17 07:51:08 were just artifacts of
bug 69661.

Loss in TSVs (from comment #1 and comment #3) is real though.

The loss on 2014-08-16 ~22:46 was due to the root mount effectively
getting full, hence services panicing, CPU usage jumping up.

The loss starting on 2014-08-16 ~06:25 was due to logrotation kicking
in, and reshuffling some files on the root mount a bit. Thereby, a bit
of disk space was freed up for <20mins, and services recovered a bit
until the root mount got full again. CPU usage going up further.

The losses affected all the multicast udp2log filters on oxygen:

zero tsvs
edits tsvs
mobile-sampled-100 tsvs
5xx tsvs
webstatscollector