Kafka monitoring #4819

ilyam8 · 2018-12-01T18:42:01Z

We are willing to support monitoring Kafka with netdata.

To get started, we need a user that will act as a sponsor during the implementation.
He/she will assist us with the following:

Specify the metrics required and how they should be presented in charts.
Specify the alarms that would make sense for each metric.
When the implementation passes QA, test the implementation in production.
Use the charts and alarms in their day to day work and provide additional feedback after the collector is delivered.

We would also appreciate getting votes 👍 for this issue from the community, to understand the interest in the particular collector.

PanosJee · 2019-07-12T19:29:52Z

@andmarios could you share some insight on the following?

Specify the metrics required and how they should be presented in charts.
Specify the alarms that would make sense for each metric.

cakrit · 2020-02-22T17:36:48Z

We have a lot of interest here, but no sponsor. Can someone assume the role so we can move forward?

OneCricketeer · 2020-02-24T14:17:21Z

What are the responsibilities of a sponsor?
Kafka is a subset of Java based application monitoring, so is this issue just a matter of determining important metrics to track?

cakrit · 2020-02-24T14:31:36Z

Hi @Cricket007, the responsibilities are mentioned in the OP. Would you be interested to help?

I received the following from @PanosJee, regarding other vendors monitoring Kafka (I think it's enough to get us started at least):

OneCricketeer · 2020-02-24T14:38:48Z

I've limited familiarity with Go, but I currently use Datadog and NewRelic and plan on taking the Confluent admin certification sometime soon, so happy to help where I can

Each of these links are primarily server side monitoring. Is that a good starting point, or should client side metrics be tracked as well?

ilyam8 · 2020-04-01T10:54:26Z

i see all of them are using self written jmx fethcer

newrelis: https://github.com/newrelic/nrjmx
dataog: https://github.com/DataDog/jmxfetch

fzyzcjy · 2020-05-27T00:36:54Z

Interested to see the monitoring here! Though I am not an expert in Kafka so maybe not much insights... To me, maybe monitor the rate of message generation, rate of consumption, not-yet-consumed message count. :)

OneCricketeer · 2020-05-27T00:38:03Z

Those all already exist as JMX metrics

fzyzcjy · 2020-05-27T01:11:28Z

@OneCricketeer Thanks!

rsm3aaq · 2020-06-18T18:02:31Z

Additional metrics of interest would be under replicated partitions, offline partitions, follower time, fetch time, leader election rate, consumer lag, etc

OneCricketeer · 2020-06-18T19:11:34Z

Consumer lag would be measured in the consumer themselves, unless interacting with a tool like Burrow.

The rest are available in JMX

jrevillard · 2021-01-08T14:30:14Z

Hi all,

JMX is one thing but you might know that there are now some monitoring tool which are quite "standard" to use to monitor different part of your kafka cluster. I would mainly speak about the linkedin tools Burrow and Cruise-control.

Both of them exposes very nice APIs which can provide all that we need to properly monitor a Kafka Cluster (https://github.com/linkedin/cruise-control/wiki/REST-APIs#get-requests, https://github.com/linkedin/Burrow/wiki/HTTP-Endpoint)

I would like to see netdata automatically monitor our kafka clusters through these APIs. I don't have time to work on it actually but if somebody is interested I might help and test.

Best,
Jerome

odyslam · 2021-01-08T20:06:21Z

Hey @jrevillard,

Thanks for chiming in! If you could guide us in structuring the available metrics in charts that make sense, that would be a huge help for us. We can do the development alright, although @ilyam8 will have more information in regards to our priorities. The hard part is finding people such as yourself, who are willing to guide us through the available metrics, select the ones that we care and structure them in sensible charts.

What do you think?

jrevillard · 2021-01-11T10:58:36Z

Hi @odyslam,

As I said I'm willing to help if I can... but please apologize in advance if I'm not always very reactive because I really have a lot of stuff at work.

So concerning cruise-control, you can see here the different metrics that they have already:

Anomaly detection, alerting, and self-healing for the Kafka cluster, including:
- Goal violation
- Broker failure detection
- Metric anomaly detection
- Disk failure detection
Resource utilization tracking for brokers, topics, and partitions
Current Kafka cluster state to see the online and offline partitions, in-sync and out-of-sync replicas, replicas under min.insync.replicas, online and offline logDirs, and distribution of replicas in the cluster.

So here are the endpoints that need to be queried:

Cruise-control statut itself (: https://github.com/linkedin/cruise-control/wiki/REST-APIs#query-the-state-of-cruise-control
Partitions and replica states (his will also give you the different brokers, and their state): https://github.com/linkedin/cruise-control/wiki/REST-APIs#query-partition-and-replica-state
Partition resource utilization: https://github.com/linkedin/cruise-control/wiki/REST-APIs#query-partition-resource-utilization

All those endpoints can return JSON data if you specify the json=true parameter.

What would you need then, some sample output of the different endpoints ?

Best,
Jerome

odyslam · 2021-01-14T06:25:37Z

Thanks @jrevillard for this in-detail message. We perfectly understand, please do take your time in our communication.

Ilya is aware of this and should reply shortly in this issue, he is our integration engineer and will have more information regarding prioritisation and bandwidth. In every case, it's super helpful and reassuring that we have a user who is willing to guide us towards the correct data sources and charts. Intimate knowledge of the collector's object has always been the toughest challenge.

I will continue monitoring this issue. See you soon and a happy new year :)

Best,
Odysseas

ilyam8 · 2021-01-18T08:10:25Z

Hi @jrevillard, thanks for your willing to help ❤️

Monitoring Apache Kafka cluster via cruise-control looks very good.

What would you need then, some sample output of the different endpoints ?

Yes, sample output would be very good to have. Ideally we need to setup a local Apache Kafka cluster with cruise-control for developing the collector.
Is it possible to provide a docker-compose config or some guide how to do it? (see for example help with supervisord)

OneCricketeer · 2021-01-18T08:42:36Z

Worth mentioning (again) that much of the data exposed via Burrow/Cruise Control can be gathered via JMX

odyslam · 2021-01-18T10:54:20Z

@OneCricketeer Perhaps @ilyam8 we could approach this via creating a JMX collector/helper function and use that to gather apache kafka, 2 birds with one stone sort of thing. Thougts?

ilyam8 · 2021-01-18T11:09:51Z

I suspect jmx collector should be written in java in which i have 0 experience (i am not 100% sure, because i have no clear understanding what it means - having a JMX collector).

There is netdata-java-plugin (written by a contributer, not really maintained), perhaps it is something related to the issue.

PanosJee · 2021-01-18T11:13:52Z

See Dadadog implementation https://github.com/DataDog/datadog-agent/blob/e070027e0253884d6d2608d5f532e9c8825c599f/pkg/collector/check/jmx.go

…

On Mon, 18 Jan 2021 at 1:10 PM, Ilya Mashchenko ***@***.***> wrote: I suspect jmx collector should be written in java in which i have 0 experience (i am not 100% sure, because i have no clear understanding what it means - *having a JMX collector*). There is netdata-java-plugin <https://github.com/simonnagl/netdata-java-orchestrator> (written by a contributer, not really maintained), perhaps it is something related to the issue. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4819 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAGDRMUPTXXKKV7TZHXYFLS2QJJBANCNFSM4GHUDRYA> .

ilyam8 · 2021-01-18T11:30:48Z

@PanosJee thnks for sharing it.

If i understand it right, both datadog and newrelic have some jmxfetcher agent (a tool for extracting data out of any application exposing a JMX interface) written in java:

nrjmx (newrelic)
jmxfetch (datadog)

That golang collector gathers data from it (perhaps does some additional work, like filtering).

OneCricketeer · 2021-01-18T13:47:00Z

Let me clarify that Cruise control is nice for monitoring external state of Kafka, but it can't give internal state like heap usage and the granularity of data that JMX can provide (depending on the polling interval).

If you're limited to a REST API, then Jolokia is another option, but sounds like you want to steer away from that (#364)

jrevillard · 2021-02-10T11:44:03Z

So, @odyslam, @ilyam8, what do we do, do you go for cruise-control or pur JMX (or both ?)

odyslam · 2021-02-10T15:59:40Z

Hey @jrevillard,

Thanks for coming back to this. We are in the process of defining the Roadmap and new features. When we have new information regarding our bandwidth and priorities, we will come back to this thread.

We believe in developing alongside users, but due to the vastness of this project, we have to be ruthless in our prioritization.

Hopefully, we will have more information soon!

netdata-community-bot · 2021-02-21T13:09:44Z

This issue has been mentioned on the Netdata Community. There might be relevant details there:

https://community.netdata.cloud/t/creating-dynamic-charts-within-plugin-get-data-method/957/4

thiagoftsm · 2022-09-27T00:26:49Z

Hello guys,

Sorry for the delay to bring updates. After to do some initial research, I am glad to inform that we took a look in different possibilities and we starting working again with Kafka.

Software	Configuration	Metrics
Kafka	kafka_cfg.txt	kafka.txt

My current conclusion is that we will work firstly with a collector to monitor Kafka using jmx exporter metrics, so any objection or suggestion, please, let us know.

Best regards!

zyxep · 2023-06-08T10:22:39Z

I currently have metrics fetched to prometheus and a dashboard in grafana to show us information.
it has taken me a bit of time to get that running there are so many examples of config files.

I am happy to share what i can if anyone needs it.

mamutuberalles · 2023-08-18T17:58:03Z

Will we able to monitor topic sizes and disk utilization by topics using this plugin?

ilyam8 self-assigned this Dec 1, 2018

ilyam8 added area/collectors Everything related to data collection help wanted labels Dec 1, 2018

ilyam8 mentioned this issue Dec 1, 2018

Missing plugins/modules (collectors) #4574

Closed

7 tasks

ilyam8 removed their assignment Dec 1, 2018

cakrit added the new collector Issues to create new collector modules/plugins label Dec 2, 2018

cakrit assigned ilyam8 Feb 24, 2020

ilyam8 added the group/data-collection label Jan 27, 2021

ilyam8 removed the data-collection label Apr 14, 2022

cpipilas self-assigned this Jun 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kafka monitoring #4819

Kafka monitoring #4819

ilyam8 commented Dec 1, 2018

PanosJee commented Jul 12, 2019 •

edited

Loading

cakrit commented Feb 22, 2020

OneCricketeer commented Feb 24, 2020

cakrit commented Feb 24, 2020

OneCricketeer commented Feb 24, 2020

ilyam8 commented Apr 1, 2020

fzyzcjy commented May 27, 2020

OneCricketeer commented May 27, 2020

fzyzcjy commented May 27, 2020

rsm3aaq commented Jun 18, 2020

OneCricketeer commented Jun 18, 2020

jrevillard commented Jan 8, 2021

odyslam commented Jan 8, 2021

jrevillard commented Jan 11, 2021 •

edited

Loading

odyslam commented Jan 14, 2021

ilyam8 commented Jan 18, 2021

OneCricketeer commented Jan 18, 2021

odyslam commented Jan 18, 2021

ilyam8 commented Jan 18, 2021

PanosJee commented Jan 18, 2021 via email

ilyam8 commented Jan 18, 2021

OneCricketeer commented Jan 18, 2021

jrevillard commented Feb 10, 2021

odyslam commented Feb 10, 2021

netdata-community-bot commented Feb 21, 2021

thiagoftsm commented Sep 27, 2022

zyxep commented Jun 8, 2023

mamutuberalles commented Aug 18, 2023

Kafka monitoring #4819

Kafka monitoring #4819

Comments

ilyam8 commented Dec 1, 2018

PanosJee commented Jul 12, 2019 • edited Loading

cakrit commented Feb 22, 2020

OneCricketeer commented Feb 24, 2020

cakrit commented Feb 24, 2020

OneCricketeer commented Feb 24, 2020

ilyam8 commented Apr 1, 2020

fzyzcjy commented May 27, 2020

OneCricketeer commented May 27, 2020

fzyzcjy commented May 27, 2020

rsm3aaq commented Jun 18, 2020

OneCricketeer commented Jun 18, 2020

jrevillard commented Jan 8, 2021

odyslam commented Jan 8, 2021

jrevillard commented Jan 11, 2021 • edited Loading

odyslam commented Jan 14, 2021

ilyam8 commented Jan 18, 2021

OneCricketeer commented Jan 18, 2021

odyslam commented Jan 18, 2021

ilyam8 commented Jan 18, 2021

PanosJee commented Jan 18, 2021 via email

ilyam8 commented Jan 18, 2021

OneCricketeer commented Jan 18, 2021

jrevillard commented Feb 10, 2021

odyslam commented Feb 10, 2021

netdata-community-bot commented Feb 21, 2021

thiagoftsm commented Sep 27, 2022

zyxep commented Jun 8, 2023

mamutuberalles commented Aug 18, 2023

PanosJee commented Jul 12, 2019 •

edited

Loading

jrevillard commented Jan 11, 2021 •

edited

Loading