Page MenuHomePhabricator

Raw webrequest partitions for 2014-10-20T02:xx:xx not marked successful
Closed, DeclinedPublic

Description

For the hour 2014-10-20T02:xx:xx, none [1] of the the four sources'
bucket was marked successful.

What happened?

[1]


qchris@stat1002 jobs: 0 time: 10:29:22 // exit code: 0
cwd: ~
~/cluster-scripts/dump_webrequest_status.sh

+---------------------+--------+--------+--------+--------+
| Date                |  bits  | mobile |  text  | upload |
+---------------------+--------+--------+--------+--------+

[...]

| 2014-10-20T00:xx:xx |    .   |    .   |    .   |    .   |    
| 2014-10-20T01:xx:xx |    .   |    .   |    .   |    .   |    
| 2014-10-20T02:xx:xx |    X   |    X   |    X   |    X   |    
| 2014-10-20T03:xx:xx |    .   |    .   |    .   |    .   |    
| 2014-10-20T04:xx:xx |    .   |    .   |    .   |    .   |

[...]

+---------------------+--------+--------+--------+--------+

Statuses:

. --> Partition is ok
M --> Partition manually marked ok
X --> Partition is not ok (duplicates, missing, or nulls)

pass /home/qchris/cluster-scripts/dump_webrequest_status.sh


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=72295

Details

Reference
bz72252

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:50 AM
bzimport set Reference to bz72252.
bzimport added a subscriber: Unknown Object (MLST).

It seems that somewhere between 2014-10-20T02:05:00 and
2014-10-20T02:12:00 analytics1021 again got kicked out of its
partition leader role.

I now ran leader elections, so analytics1021 is ready to help
with esams bits today in the evening.

From the logs between 2014-10-20T02:05:08 2014-10-20T02:05:16, data
worth <2 seconds got lost.

It's noteworthy that we again did not see loss for the hosts that we
tuned the ACKs for. So I think we should move forward to roll out the
ACK experiment to more hosts, so we can get rid of issues when
analytics1021 drops out of its leader role again.

(In reply to christian from comment #2)

So I think we should move forward to roll out the
ACK experiment to more hosts, so we can get rid of issues when
analytics1021 drops out of its leader role again.

Patches to roll out the ACK experiment got uploaded to gerrit

https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:kafka-acks,n,z

(for not yet merged parts) and have been linked to big 69667.