Page MenuHomePhabricator

"ulsfo <-> eqiad" network issue on 2014-10-21 affecting udp2log streams
Closed, DeclinedPublic

Description

Ops reported [1] a network issue between ulsfo and eqiad (According to
IRC logs [2], alerts started around 2014-10-21 ~10:30).

We did not see alerts on the udp2log pipeline.
However, we saw alerts on the tighter monitoring the kafka pipeline.

Did the issue affect the udp2log pipeline too?

[1] https://lists.wikimedia.org/mailman/private/ops/2014-October/042427.html
[2] http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20141021.txt


Version: unspecified
Severity: normal

Details

Reference
bz72355

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:56 AM
bzimport set Reference to bz72355.
bzimport added a subscriber: Unknown Object (MLST).

The upd2log pipeline shows the first sporadic ulsfo drop-outs on
2014-10-21T10:58 and continued to show ulsfo drop-outs until ulsfo got
depooled on 2014-10-21T11:43
(Ifc2a1f1abb7d532e01782b05df764bf4cd072014).

Per host packet loss computation for the affected hour does not give a
meaningful result due to the ulsfo depooling bringing down message
volume from ulsfo too much.

(In reply to christian from comment #0)

We did not see alerts on the udp2log pipeline.

That's wrong.
There have been alerts [1]:

[11:54:29] <icinga-wm>         PROBLEM - Packetloss_Average on erbium is CRITICAL: packet_loss_average CRITICAL: 9.11388505882
[12:02:12] <icinga-wm>         PROBLEM - Packetloss_Average on analytics1026 is CRITICAL: packet_loss_average CRITICAL: 23.0722363964
[12:06:06] <icinga-wm>         RECOVERY - Packetloss_Average on erbium is OK: packet_loss_average OKAY: 0.0
[12:21:25] <icinga-wm>         RECOVERY - Packetloss_Average on analytics1026 is OK: packet_loss_average OKAY: 2.49366398305
[12:27:01] <icinga-wm>         RECOVERY - Packetloss_Average on oxygen is OK: packet_loss_average OKAY: 1.85878847458

[1] http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20141021.txt