Page MenuHomePhabricator

Kafka broker analytics1021 not receiving messages every now and then
Closed, ResolvedPublic

Description

Every few days, analytics1021 gets kicked out of its partition leader
role.

The last few occurrences are:
2014-08-06: bug 69244
2014-08-13: bug 69665
2014-08-16: bug 69666


Version: unspecified
Severity: normal

Details

Reference
bz69667

Related Objects

StatusSubtypeAssignedTask
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
DeclinedNone
DeclinedNone
ResolvedNone
ResolvedNone
ResolvedNone
DeclinedNone
DeclinedNone
ResolvedNone
ResolvedNone
DeclinedNone
ResolvedNone
ResolvedQChris
ResolvedQChris
ResolvedQChris
ResolvedQChris
ResolvedQChris

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:38 AM
bzimport set Reference to bz69667.
bzimport added a subscriber: Unknown Object (MLST).

Thanks for bugging this Christian.

IMHO, this has been a fairly regular problem with this host and attempts to reimage/fix have not been successful. Andrew/Gage -- can you update?

otto wrote:

The core of this issue is a timeout of the Zookeeper connection, which neither Gage nor I have been able to solve.

Quick summary: Kafka brokers need to maintain a live connection with Zookeeper in order to remain in the ISR. Brokers set a timeout. If a broker can't talk to Zookeeper within this timeout, it will close the connection it has and attempt to open a new one, most likely to a different Zookeeper host than it had before. Zookeeper notices when it closes this connection, and then tells all of the other brokers that this broker has left the ISR. Its leadership for any partitions is then demoted. It takes this broker usually about less than a second to reconnect to another Zookeeper host and rejoin the cluster. Thus far, when this happens, Gage or I have logged in and manually started a preferred replica election. This bring's the offending broker's leadership status back to normal.

This would only be an annoyance, if it weren't for small varnishkafka hiccups this causes. Ideally, when a broker loses partition leadership, producers would be notified of the metadata change quickly enough that their buffer's don't fill up. We have noticed that for some higher volume partitions (upload and/or bits), some varnishkafkas drop messages during the short time that leadership metadata is being propagated.

Action items:

  • Solve zookeeper timeout. Not sure how to replicate or do this right now.
  • Keep varnishkafka from dropping messages on metadata change. There is likely some tuning we can do to make sure we don't drop messages when partition leadership changes.
  • Investigate auto partition balancing. This is a new feature in Kafka 0.8.1. This would eliminate the manual step of starting a preferred replica election. This won't solve either of the above problems, but would allow the cluster to rebalance itself without manual intervention when this happens.

See also:
https://rt.wikimedia.org/Ticket/Display.html?id=6877

otto wrote:

Today I increased kafka.queue.buffering.max.ms from 1 second to 5 seconds. Leader election can take up to 3 seconds. Hopefully this will solve the second bullet point above. We'll have to wait until the next time a leader election happens and see if we miss any data from it.

It happened again on:

2014-08-24 ~16:00 (with recovery for the next reading in
    ganglia. Since ganglia shows a decrease of volume for that time,
    it might align with the kafkatee changes on analytics1003. So it
    might actually be an artifact.)
2014-08-30 ~03:30 (with recovery ~06:00)
2014-09-02 ~10:50 (not yet recovered)

(I haven't yet checked if those instances also caused packetloss)

(In reply to christian from comment #4)

2014-08-24 ~16:00 (with recovery for the next reading in

Wrong day. It should read

2014-08-25 ~16:00 (with recovery for the next reading in

. Sorry.

(In reply to Andrew Otto from comment #3)

Today I increased kafka.queue.buffering.max.ms from 1 second to 5 seconds.
Leader election can take up to 3 seconds. Hopefully this will solve the
second bullet point above. We'll have to wait until the next time a leader
election happens and see if we miss any data from it.

The re-elections from

2014-08-30 ~03:30
2014-09-02 ~10:50

(see comment #5) happened past the buffer increase, but both still exhibit
missing sequence numbers. So bumping the buffer size did not do the trick :-(

otto wrote:

I have done some more sleuthing on the zookeeper timeouts today, and I want to record my thoughts.

  • The timeout can happen when the broker is connected to any zookeeper.
  • Timeouts happen more often than timeout induced leader changes. That is, a Zookeeper session timeout does necessarily mean produce errors.
  • When there are timeout induced leader changes, they seem to be caused by the zookeeper leader expiring a zookeeper session, rather than a broker reconnecting to zookeeper because of a timeout. That is, both zookeeper server and kafka broker (zookeeper client) set a timeout of 16 seconds. (I believe they negotiate to the lower timeout if the setting don't match). If kafka notices the timeout first, it will just close the connection and reconnect with the same session id. If the timeout happens to be noticed by zookeeper before kafka, then zookeeper expires the session. Kafka's own timeout will trigger after the session has been expired by zookeeper, and when it attempts to reconnect, it will be told it has an expired session, which causes it to have to reconnect a second time to ask for a new session.
  • This also seems related to timeouts occurring when the broker (usually analytics1021) is also the controller for the kafka cluster. (The Controller is the one in charge of intra-cluster things like ISRs and leaders...I think). As far as I can tell, all of the timeout induced leader changes that we see are also accompanied by (on broker)

    kafka.controller.KafkaController$SessionExpirationListener - [SessionExpirationListener on 21], ZK expired; shut down all controller components and try to re-elect

and (on zookeeper leader)

[ProcessThread:-1:PrepRequestProcessor@419] - Got user-level KeeperException when processing sessionid:0x46fd72a6d4243f type:create cxid:0x1 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error Path:/kafka/eqiad/controller Error:KeeperErrorCode = NodeExists for /kafka/eqiad/controller

error messages. That is, when the timed-out broker finally is able to reconnect (with a new session id), it attempts to re-register its previous controller status with zookeeper, only to find that another broker has taken over as controller.

I just went back to my email thread to the Kafka Users group and noticed that someone had replied[1] to my sleuthing about broker GCs back in July! I never saw this email response! Gah! I will look into this suggestion tomorrow.

[1] http://mail-archives.apache.org/mod_mbox/kafka-users/201407.mbox/%3CCAFbh0Q2f71qgs5JDNFxkm7SSdZyYMH=ZpEOxotuEQfKqeXQHfw@mail.gmail.com%3E

Adding some more records:

To rule out temporary basic network problems, I started a script to
ping all zookeepers from analytics1021 in an endless loop.
That produced results already:

There was no output for ~10 seconds between 15:26:04 and 15:26:14

But RTT times were low before and afterwards. Since we're hardly
having data to compare against, I collected the analytics1021 ganglia
graphs that showed a exceptional in-/decrease during that period, and
added the relevant parts of the kafka log and kafka gc log.
(See attachment analytics1021-2014-09-04T15-26-03.tar.gz)

It seems most disks spiked.
kafka log looks harmless.
kafka gc log looks harmless too.


Since both the pings and kafka gc stopped during that time, it might
be it's not network related, but the machine itself preempting
processes and blocking them for some time.

So I started another monitor that (in an endless loop) logs the
timestamp, and then waits a second.
That already produced relevant results too:

There was no output for ~13 seconds between 20:47:18 and 20:47:31

.

Again most disks spiked.
kafka log looks harmless.
kafka gc log looks harmless too.


Since I lack access to the system's logs, do they show anything
interesting around

15:26:04 -- 15:26:14, or
20:47:18 -- 20:47:31

?

Created attachment 16376
analytics1021-2014-09-04T15-26-03.tar.gz

Attached:

Trying to understand why it only happens an analytics1021 and not the
other brokers, the only difference that I could find is that
analytic1021 is the only broker that does not have $ganglia_aggregator
set to true. I am not sure if that is relevant, but seeing the quoted
email from comment #7 ... could the $ganglia_aggregator setting change
the machine's characteristics enough to warrant analytics1021
exhibiting this strange behaviour?

otto wrote:

Yeah, strange indeed that this only happens on analytics1021. I *think* we have seen this elsewhere before, but not often. And, I think not since the cluster reinstall in July.

root@palladium:~# salt -E 'analytics10(12|18|21|22).eqiad.wmnet' cmd.run 'grep -i "timed out" /var/log/kafka/kafka.log | wc -l'
analytics1012.eqiad.wmnet:

0

analytics1018.eqiad.wmnet:

0

analytics1021.eqiad.wmnet:

80

analytics1022.eqiad.wmnet:

4

otto wrote:

This happened again on 2014-09-06 around 19:43:22. Here are some relevant sar captures:

https://gist.github.com/ottomata/ae372649afc914a6f606

Something is happening with paging stats, but it is unclear if it is a syptom or a cause.

Starting around 19:43:11, pages are being freed up really fast. Then, at 19:43:22, activity drops out, as if activity is blocked by something. At 19:43:33, a ton more pages are freed all at once Then it seems to go back to normal.

Seeing as the actual block starts at around :22, and there is weird behavior before that, maybe paging is part of the cause.

Change 163744 had a related patch set uploaded by QChris:
Force ACKs from all in-sync kafka replicas

https://gerrit.wikimedia.org/r/163744

Happened again on 2014-10-13 around 13:37:31.

Hi Christian - would you say that the fix did not solve this problem?

thanks,

-Toby

(In reply to Toby Negrin from comment #15)

would you say that the fix did not solve this problem?

If you are refering to the change from comment 13, it totally did the trick.
For the machines it addresses, the issue is gone.
Those machines no longer loose data.

While the fix helps with the issue of losing data, it does not yet prevent
analytics1021 from dropping out.

Change 167550 had a related patch set uploaded by QChris:
Require 2 ACKs from kafka brokers for mobile caches

https://gerrit.wikimedia.org/r/167550

Change 167551 had a related patch set uploaded by QChris:
Require 2 ACKs from kafka brokers for text caches

https://gerrit.wikimedia.org/r/167551

Change 167552 had a related patch set uploaded by QChris:
Require 2 ACKs from kafka brokers for bits caches

https://gerrit.wikimedia.org/r/167552

Change 167553 had a related patch set uploaded by QChris:
Require 2 ACKs from kafka brokers per default

https://gerrit.wikimedia.org/r/167553

Change 167550 merged by Ottomata:
Require 2 ACKs from kafka brokers for mobile caches

https://gerrit.wikimedia.org/r/167550

otto wrote:

Set vm.dirty_writeback_centisecs = 200 (was 500)

Change 167551 merged by Ottomata:
Require 2 ACKs from kafka brokers for text caches

https://gerrit.wikimedia.org/r/167551

Change 167552 merged by Ottomata:
Require 2 ACKs from kafka brokers for bits caches

https://gerrit.wikimedia.org/r/167552

Change 167553 merged by Ottomata:
Require 2 ACKs from kafka brokers per default

https://gerrit.wikimedia.org/r/167553

This happened again. I just ran an election to give an21 some leadership back.

Happened again around 2015-02-07T01:06 . Ran leader-election and analytics1021 is back in the game.

Ottomata set Security to None.
Ottomata claimed this task.

analytics1021 is gone, and we haven't had problems with this since! Closing.