Page MenuHomePhabricator

Raw webrequest partitions for 2014-10-08T23:xx:xx not marked successful
Closed, DeclinedPublic

Description

For the hour 2014-10-08T23:xx:xx, bits, text, and upload [1] were not
marked successful.

What happened?

[1]


qchris@stat1002 jobs: 0 time: 10:55:54 // exit code: 0
cwd: ~/cluster-scripts
./dump_webrequest_status.sh

+---------------------+--------+--------+--------+--------+
| Date                |  bits  |  text  | mobile | upload |
+---------------------+--------+--------+--------+--------+

[...]

| 2014-10-08T21:xx:xx |    .   |    .   |    .   |    .   |    
| 2014-10-08T22:xx:xx |    .   |    .   |    .   |    .   |    
| 2014-10-08T23:xx:xx |    X   |    X   |    .   |    X   |    
| 2014-10-09T00:xx:xx |    .   |    .   |    .   |    .   |    
| 2014-10-09T01:xx:xx |    .   |    .   |    .   |    .   |

[...]

+---------------------+--------+--------+--------+--------+

Statuses:

. --> Partition is ok
X --> Partition is not ok (duplicates, missing, or nulls)

Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=71879

Details

Reference
bz71876

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:59 AM
bzimport set Reference to bz71876.
bzimport added a subscriber: Unknown Object (MLST).

For bits and upload we saw both duplicates and missing.
For mobile we only saw duplicates.
The affected period was 23:02:00 -- 23:11:00

It seems to have been a ulsfo glitch, as only ulsfo hosts were
affected.

In total ~5M duplicates and ~2M missing:

+---------+--------+-------------+-----------+

clusterhost# duplicate# missing

+---------+--------+-------------+-----------+

bitscp40011835370
bitscp4002517220218080
bitscp4003381150143408
bitscp40042662750
textcp4008261160
textcp40092152910
textcp40101266670
textcp401619040
textcp40181675770
uploadcp4005592259352364
uploadcp4006581932340563
uploadcp4007507600299497
uploadcp4013592460389971
uploadcp401440841461688
uploadcp4015560605291017

+---------+--------+-------------+-----------+

Since being 7M off is within reach for our streams, I marked the
streams “ok“ by hand.

On the kafka brokers, I the only thing that looked related were
exceptions like

[2014-10-08 23:46:04,585] 1658280892 [kafka-request-handler-9] ERROR kafka.server.KafkaApis  - [KafkaApi-21] Error when processing fetch request for partition 
[webrequest_upload,5] offset 34174421023 from consumer with correlation id 2
kafka.common.OffsetOutOfRangeException: Request for offset 34174421023 but we only have log segments in the range 37690961900 to 40340044788.
        at kafka.log.Log.read(Log.scala:380)
        [...]

6 times on analytics1012 affecting each of

   webrequest_upload,3
   webrequest_upload,7
   webrequest_upload,11
twice.

6 times on analytics1018 affecting each of

   webrequest_upload,0
   webrequest_upload,4
   webrequest_upload,8
twice.

6 times on analytics1021 affecting each of

   webrequest_upload,1
   webrequest_upload,5
   webrequest_upload,9
twice.

6 times on analytics1022 affecting each of

   webrequest_upload,2
   webrequest_upload,6
   webrequest_upload,10
twice.

All those 24 exceptions were around 23:46.

Checking in the affected caches in ganglia, I noticed that some
readings are missing are missing around that time.

SAL did not show anything relevant, but the #wikimedia-operations
channel had

[22:59:55] <icinga-wm> PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Transit: ! Telia [10Gbps DF]BR
[...]
[23:00:55] <icinga-wm> PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192
[...]
[23:03:04] <icinga-wm> PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0]
[23:03:55] <icinga-wm> PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2
[23:05:16] <icinga-wm> RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 9, down: 0, shutdown: 0
[...]
[23:07:04] <icinga-wm> PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Transit: ! Telia [10Gbps DF]BR
[...]
[23:10:14] <icinga-wm> PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192
[23:12:05] <icinga-wm> PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2
[...]
[23:15:05] <icinga-wm> PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Transit: ! Telia [10Gbps DF]BR
[23:15:05] <icinga-wm> RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 9, down: 0, shutdown: 0
[23:16:14] <icinga-wm> PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: puppet fail
[...]
[23:16:45] <icinga-wm> PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: puppet fail
[23:17:45] <icinga-wm> PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: puppet fail
[...]
[23:18:54] <icinga-wm> PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: puppet fail
[...]
[23:19:05] <icinga-wm> PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Puppet has 1 failures
[23:19:26] <icinga-wm> PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 1 failures
[...]
[23:23:56] <icinga-wm> RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures

Afterwards services recovered.
So it looks like a general ULSFO issue.