-
Notifications
You must be signed in to change notification settings - Fork 3.1k
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What to do / not do when connections to brokers are down ? #785
Comments
I have a segmentation fault in Kafka, I"ve make it happened by blocking then enabling the connection to brokers. Would it have anything to do by a call I shouldn't have made ? sefault is in rdkafka_partition.c at line 2483 while gdb tells :
here's a stack any idea of what happens in such a context ? |
Do you still have that core file?
Thanks |
No, I don't.
and I've noticed that when playing with iptables for messing with the connection, sometimes the assignment() call returns me an invalid TopicPartition (with empty topic name an invalid partition number (big negative number). So now I'm filtering vTopPart to remove any TopicPartition with empty topic name and haven't experienced this anymore. |
You need to check the return value of both assignment() and committed(), it is possible that they will return an error when the client is not fully joined. |
Forget my previous comment I don't think it's related since the problem finally just happened again now. from function rd_kafka_handle_OffsetFetch
from function rd_kafka_topic_partition_list_set_offsets
Here is the gdb core dump |
as for the return ErrorCode from assignment() and committed() i do check them all now. |
Thanks, I can't use the core file without the corresponding binary though. |
for the result of bt full I can also give you the binary if you need to browse the dump. |
I don't know if the context can help you to understand what's the issue here but in my strength test I start 3 instances of my kafka client, then I block/unblock the connection with this shell script When the seg fault happens, it happens on all three separate instances at the same time. |
I've finally decided to jump a bit in the libRdKafka code : then in rd_kafka_OffsetFetchRequest (from https://github.com/edenhill/librdkafka/blob/master/src/rdkafka_request.c) I see (I read it somewhere else in the code also) that the list of topic partitions must be sorted. In that context isn't there a problem that variable last_topic is NULL and const ? |
Thanks for your effort in finding the root cause of the issue, I havent had time to check closer yet but will do so next week. |
This issue is now fixed on master, please try to verify the fix in your environment. |
cool ! On Tue, Oct 11, 2016 at 5:09 PM, Magnus Edenhill notifications@github.com
The information transmitted is intended only for the person or entity to |
So I just did a long stress test and haven't seen any SIGSEGV anymore, so I'm comfortable assuming this is fixed. One note though : during my test I have experienced another issue (maybe related to what was fixed, or maybe that was hidden before as the seg fault would crash the program before). after 20min of random blocking/ unblocking my Kafka server with iptables i got the following events when Kafka was unavailable :
As I have registered the eventCb, here are the log of the events I receives around the time the revoke was triggered: |
The error/event log wont show when it reconnects to the brokers so it is unfortuantely not enough to troubleshoot this issue. Thanks |
Sorry for the delay, here is my test script :
here is the test scenario with results explained:
(at 10:29:25Z) Note that we haven't seen this output while doing the test the first time.
(at 10:31:30Z) %3|1476873091.062|ERROR|rdkafka#producer-2| 127.0.0.1:9092/bootstrap: Connect to ipv4#127.0.0.1:9092 failed: Connection timed out (at 10:33:39Z) %3|1476873219.318|ERROR|rdkafka#producer-2| 127.0.0.1:9092/bootstrap: Connect to ipv4#127.0.0.1:9092 failed: Connection timed out
Do a search for the word "triggered" if you want to browse where the 4 rebalance callbacks are called in the log. |
I'm not sure why nothing seems to happen after it reconnects to broker 0 at 10:33:39. 6: |
Here is my program output (no reassign bug2.txt.tar.gz) using debug=cgrp,topic,broker,protocol,metadata at 06:51:16Z I started my test script (this is not in the txt file which begins at 06:52:28Z ) at 06:55:50Z I started the script again |
in the following logs, isn't there a reset missing on the state machine when following the events
perhaps the state should change from being state wait-coord and go back an earlier state like query-coord ?
|
Description
Hi I have a question on the usage of librdkafka (c++ wrapper) when brokers go down.
(I don't know how to set up the question flag here).
I trying to find out what to do when the connections to my kafka broker is down.
I simulate that through :
sudo iptables -A INPUT -s 127.0.0.1 -p tcp --destination-port 9092 -j DROP and
sudo iptables -D INPUT -s 127.0.0.1 -p tcp --destination-port 9092 -j DROP
should my program be aware that the brokers are down so I don't request librdkafka to try commit offset (or read the topic partition offset of my consumer group ) while brokers are down ?
second question, how can I know that the connection to the brokers is up or down. If I know that information I would just stop calling consume(), commitSync() or committed() from my KafkaConsumer until I know one broker is up again.
Thank you.
Checklist
Please provide the following information:
::edit:: also in :
debug=..
as necessary) from librdkafkaThe text was updated successfully, but these errors were encountered: