Page MenuHomePhabricator

Duplicates/missing logs from esams bits for 2014-09-28T{18,19,20}:xx:xx
Closed, DeclinedPublic

Description

Between, 2014-09-28T18:31:10 and 2014-09-28T20:06:34 all esams bits
caches saw both duplicate and missing lines.

Looking at the Ganglia graphs, it seems we'll see the same issue also
for today (2014-09-29).

While the issue was going on today, there was a discussion
about it in IRC [1].

It is not clear what happened.

The theory up to now is that due to recent config changes around
varnishkafka, esams bits traffic can no longer be handled with 3
brokers (we're currently using only 3 out of 4 brokers).

[1] Starting at 19:04:03 at
http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-analytics/20140929.txt


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=71882
https://bugzilla.wikimedia.org/show_bug.cgi?id=71881

Details

Reference
bz71435

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:48 AM
bzimport set Reference to bz71435.
bzimport added a subscriber: Unknown Object (MLST).

(In reply to christian from comment #0)

Looking at the Ganglia graphs, it seems we'll see the same issue also
for today (2014-09-29).

Yes, we did.
The affected period is 2014-09-29T18:41:48--2014-09-29T19:55:21.
Again, only all esams bits caches.
Again, both duplicate and missing lines.

Ottomata restarted varnishkafka on cp3019 on 19:41, and cp3019
immediately recovered. Its queues being back to normal, and no longer
getting critical again. No more losses on cp3019.

This nicely matches yesterday's theory of esams bits traffic spikes are
above what 3 brokers can take.

It happened again during for the 5 bits partitions from
2014-10-14T16:xx:xx up to and including 2014-10-14T20:xx:xx.
Again only esams bits.

Since I've been around when it happened, and historic ganglia graphs
don't expose this: The kafka drerr's were not constant, but grew,
died off again, and stayed off for the rest of the interval in
intervals that were ~25-minutes long.
(See attachment kafka.varnishkafka.kafka_drerr.per_second-2014-10-15.png)

All affected caches showed this ~25-minutes long pattern.
But the pattern was not synchronous across machines.

While the drerrs showed this pattern, the outbuf_cnt did not show such
a pattern. It was high the whole time.

Created attachment 16773
kafka.varnishkafka.kafka_drerr.per_second-2014-10-15.png

Attached:

kafka.varnishkafka.kafka_drerr.per_second-2014-10-15.png (374×747 px, 41 KB)

It happened again for 2014-10-16T17:xx:xx up to and including 2014-10-16T19:xx:xx

kevinator set Security to None.