Raw webrequest partitions for 2014-10-08T23:xx:xx not marked successful
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	QChris
	Oct 9 2014, 11:42 AM

Description

For the hour 2014-10-08T23:xx:xx, bits, text, and upload [1] were not
marked successful.

What happened?

[1]

qchris@stat1002 jobs: 0 time: 10:55:54 // exit code: 0
cwd: ~/cluster-scripts
./dump_webrequest_status.sh

+---------------------+--------+--------+--------+--------+
| Date                |  bits  |  text  | mobile | upload |
+---------------------+--------+--------+--------+--------+

[...]

| 2014-10-08T21:xx:xx |    .   |    .   |    .   |    .   |    
| 2014-10-08T22:xx:xx |    .   |    .   |    .   |    .   |    
| 2014-10-08T23:xx:xx |    X   |    X   |    .   |    X   |    
| 2014-10-09T00:xx:xx |    .   |    .   |    .   |    .   |    
| 2014-10-09T01:xx:xx |    .   |    .   |    .   |    .   |

[...]

+---------------------+--------+--------+--------+--------+

Statuses:

. --> Partition is ok
X --> Partition is not ok (duplicates, missing, or nulls)

Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=71879

Details

Reference: bz71876

Related Objects
Search...

Status	Assigned	Task
Resolved	Ottomata	T72085 Raw webrequest partitions that were not marked successful
Resolved	Ottomata	T74298 Raw webrequest partitions that were not marked successful due to network issues
Declined	None	T73876 Raw webrequest partitions for 2014-10-08T23:xx:xx not marked successful

Event Timeline

• bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:59 AM

• bzimport added a project: Analytics-Clusters.

• bzimport set Reference to bz71876.

• bzimport added a subscriber: Unknown Object (MLST).

QChris created this task.Oct 9 2014, 11:42 AM

For bits and upload we saw both duplicates and missing.
For mobile we only saw duplicates.
The affected period was 23:02:00 -- 23:11:00

It seems to have been a ulsfo glitch, as only ulsfo hosts were
affected.

In total ~5M duplicates and ~2M missing:

+---------+--------+-------------+-----------+

cluster

host

# duplicate

# missing

+---------+--------+-------------+-----------+

bits	cp4001	183537	0
bits	cp4002	517220	218080
bits	cp4003	381150	143408
bits	cp4004	266275	0
text	cp4008	26116	0
text	cp4009	215291	0
text	cp4010	126667	0
text	cp4016	1904	0
text	cp4018	167577	0
upload	cp4005	592259	352364
upload	cp4006	581932	340563
upload	cp4007	507600	299497
upload	cp4013	592460	389971
upload	cp4014	408414	61688
upload	cp4015	560605	291017

+---------+--------+-------------+-----------+

Since being 7M off is within reach for our streams, I marked the
streams “ok“ by hand.

On the kafka brokers, I the only thing that looked related were
exceptions like

[2014-10-08 23:46:04,585] 1658280892 [kafka-request-handler-9] ERROR kafka.server.KafkaApis  - [KafkaApi-21] Error when processing fetch request for partition 
[webrequest_upload,5] offset 34174421023 from consumer with correlation id 2
kafka.common.OffsetOutOfRangeException: Request for offset 34174421023 but we only have log segments in the range 37690961900 to 40340044788.
        at kafka.log.Log.read(Log.scala:380)
        [...]

6 times on analytics1012 affecting each of

   webrequest_upload,3
   webrequest_upload,7
   webrequest_upload,11
twice.

6 times on analytics1018 affecting each of

   webrequest_upload,0
   webrequest_upload,4
   webrequest_upload,8
twice.

6 times on analytics1021 affecting each of

   webrequest_upload,1
   webrequest_upload,5
   webrequest_upload,9
twice.

6 times on analytics1022 affecting each of

   webrequest_upload,2
   webrequest_upload,6
   webrequest_upload,10
twice.

All those 24 exceptions were around 23:46.

Checking in the affected caches in ganglia, I noticed that some
readings are missing are missing around that time.

SAL did not show anything relevant, but the #wikimedia-operations
channel had

[22:59:55] <icinga-wm> PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Transit: ! Telia [10Gbps DF]BR
[...]
[23:00:55] <icinga-wm> PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192
[...]
[23:03:04] <icinga-wm> PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0]
[23:03:55] <icinga-wm> PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2
[23:05:16] <icinga-wm> RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 9, down: 0, shutdown: 0
[...]
[23:07:04] <icinga-wm> PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Transit: ! Telia [10Gbps DF]BR
[...]
[23:10:14] <icinga-wm> PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192
[23:12:05] <icinga-wm> PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2
[...]
[23:15:05] <icinga-wm> PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Transit: ! Telia [10Gbps DF]BR
[23:15:05] <icinga-wm> RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 9, down: 0, shutdown: 0
[23:16:14] <icinga-wm> PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: puppet fail
[...]
[23:16:45] <icinga-wm> PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: puppet fail
[23:17:45] <icinga-wm> PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: puppet fail
[...]
[23:18:54] <icinga-wm> PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: puppet fail
[...]
[23:19:05] <icinga-wm> PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Puppet has 1 failures
[23:19:26] <icinga-wm> PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 1 failures
[...]
[23:23:56] <icinga-wm> RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures

Afterwards services recovered.
So it looks like a general ULSFO issue.

Raw webrequest partitions for 2014-10-08T23:xx:xx not marked successfulClosed, DeclinedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Raw webrequest partitions for 2014-10-08T23:xx:xx not marked successful
Closed, DeclinedPublic
Actions

Related Objects
Search...