Page MenuHomePhabricator

"ulsfo <-> eqiad" network issue on 2014-10-20 affecting udp2log streams
Closed, DeclinedPublic

Description

Ops reported [1] a network issue between ulsfo and eqiad (2014-10-20 ~13:07).

We did not see alerts on the udp2log pipeline.
However, we saw alerts on the tighter monitoring the kafka pipeline.

Did the issue affect the udp2log pipeline too?

[1] https://lists.wikimedia.org/mailman/private/ops/2014-October/042274.html


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=72296

Details

Reference
bz72306

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:53 AM
bzimport set Reference to bz72306.
bzimport added a subscriber: Unknown Object (MLST).

(In reply to christian from comment #0)

However, we saw alerts on the tighter monitoring the kafka pipeline.

For the kafka pipeline, the bug is 72296.

The upd2log pipeline seems affected between
2014-10-20T13:06--2014-10-20T13:27.

Per hour per host packetloss ranges between 6-47% for ulsfo caches for
the hour that covers the affected period.

+--------------------+--------------+
|                    |     Per hour |
|                    |   packetloss |
| Host               | (in percent) |
+--------------------+--------------+
| cp4005.ulsfo.wmnet |           46 |
| cp4006.ulsfo.wmnet |           12 |
| cp4007.ulsfo.wmnet |           47 |
| cp4008.ulsfo.wmnet |           42 |
| cp4009.ulsfo.wmnet |           38 |
| cp4010.ulsfo.wmnet |            8 |
| cp4011.ulsfo.wmnet |           36 |
| cp4012.ulsfo.wmnet |           37 |
| cp4013.ulsfo.wmnet |            6 |
| cp4014.ulsfo.wmnet |           44 |
| cp4015.ulsfo.wmnet |            7 |
| cp4016.ulsfo.wmnet |           22 |
| cp4017.ulsfo.wmnet |           40 |
| cp4018.ulsfo.wmnet |           12 |
| cp4019.ulsfo.wmnet |           45 |
| cp4020.ulsfo.wmnet |            9 |
+--------------------+--------------+

Non-ulsfo don't show a drop/rise.

(In reply to christian from comment #0)

We did not see alerts on the udp2log pipeline.

That's wrong.
There have been alerts [1]:

[13:19:04] <icinga-wm>         PROBLEM - Packetloss_Average on erbium is CRITICAL: packet_loss_average CRITICAL: 13.2572885542
[13:27:37] <icinga-wm>         PROBLEM - Packetloss_Average on oxygen is CRITICAL: packet_loss_average CRITICAL: 25.0862913793
[13:29:40] <icinga-wm>         PROBLEM - Packetloss_Average on analytics1026 is CRITICAL: packet_loss_average CRITICAL: 14.6411538136
[13:32:00] <icinga-wm>         RECOVERY - Packetloss_Average on erbium is OK: packet_loss_average OKAY: 2.36820388235
[13:42:20] <icinga-wm>         RECOVERY - Packetloss_Average on analytics1026 is OK: packet_loss_average OKAY: 2.73679050847
[13:46:30] <icinga-wm>         RECOVERY - Packetloss_Average on oxygen is OK: packet_loss_average OKAY: 1.89986423729

[1] http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20141020.txt