Page MenuHomePhabricator

Several raw webrequest partitions now marked successful between 2014-10-13T13:xx:xx and 2014-10-13T22:xx:xx
Closed, DeclinedPublic

Description

Between 2014-10-13T13:xx:xx and 2014-10-13T22:xx:xx several
partitions, were not marked successful [1]. It seems bits was most
affected, followed by upload and to a lesser extent text and mobile.

What happened?

[1]


qchris@stat1002 jobs: 0 time: 11:07:47 // exit code: 0
cwd: ~
cluster-scripts/dump_webrequest_status.sh

+---------------------+--------+--------+--------+--------+
| Date                |  bits  |  text  | mobile | upload |
+---------------------+--------+--------+--------+--------+

[...]

| 2014-10-13T11:xx:xx |    .   |    .   |    .   |    .   |    
| 2014-10-13T12:xx:xx |    .   |    .   |    .   |    .   |    
| 2014-10-13T13:xx:xx |    X   |    X   |    X   |    X   |    
| 2014-10-13T14:xx:xx |    .   |    .   |    .   |    .   |    
| 2014-10-13T15:xx:xx |    X   |    .   |    .   |    .   |    
| 2014-10-13T16:xx:xx |    X   |    .   |    .   |    .   |    
| 2014-10-13T17:xx:xx |    X   |    .   |    .   |    .   |    
| 2014-10-13T18:xx:xx |    X   |    .   |    .   |    .   |    
| 2014-10-13T19:xx:xx |    X   |    .   |    .   |    X   |    
| 2014-10-13T20:xx:xx |    X   |    .   |    .   |    X   |    
| 2014-10-13T21:xx:xx |    X   |    .   |    .   |    X   |    
| 2014-10-13T22:xx:xx |    .   |    .   |    .   |    .   |    
| 2014-10-13T23:xx:xx |    .   |    .   |    .   |    .   |

[...]

+---------------------+--------+--------+--------+--------+

Statuses:

. --> Partition is ok
X --> Partition is not ok (duplicates, missing, or nulls)

pass cluster-scripts/dump_webrequest_status.sh


Version: unspecified
Severity: normal

Details

Reference
bz72028

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:52 AM
bzimport set Reference to bz72028.
bzimport added a subscriber: Unknown Object (MLST).

For 2014-10-13T13:xx:xx it affected all caches with the only exception
of

cp1056.eqiad.wmnet (bits)
cp1057.eqiad.wmnet (bits)
cp3019.esams.wikimedia.org (bits)
cp3020.esams.wikimedia.org (bits)

(which are exactly the machines that saw the ACK experiments [1],
and we did not see missing log lines for any of them.)

For that hour, we saw no duplicates, but intermittent loss between
2014-10-13T13:37:15 and 2014-10-13T13:38:16 which is worth

bits <1 second
text <2 seconds
mobile <2 seconds
upload <1 second

.

This nicely matches the dropout of analytics1021 from its partition leader role [2].

I marked the 2014-10-13T13:xx:xx partitions as ok.

[1] https://git.wikimedia.org/blob/operations%2Fpuppet.git/ccc17ce0780f6c56ddcac4f4dcd9f90b2dc0d346/manifests%2Frole%2Fcache.pp#L510
[2] https://bugzilla.wikimedia.org/show_bug.cgi?id=69667#c14

The failed partitions between 2014-10-13T15:xx:xx--2014-10-13T21:xx:xx
have all exclusively been esams caches.
Hence, filing under the esams bug.

(Since it also is about analytics1021 dropping out of it's leader role,
also blocking on bug 69667)

kevinator set Security to None.