Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka monitoring #4819

Open
ilyam8 opened this issue Dec 1, 2018 · 28 comments
Open

Kafka monitoring #4819

ilyam8 opened this issue Dec 1, 2018 · 28 comments
Assignees
Labels
area/collectors Everything related to data collection help wanted new collector Issues to create new collector modules/plugins

Comments

@ilyam8
Copy link
Member

ilyam8 commented Dec 1, 2018

We are willing to support monitoring Kafka with netdata.

To get started, we need a user that will act as a sponsor during the implementation.
He/she will assist us with the following:

  • Specify the metrics required and how they should be presented in charts.
  • Specify the alarms that would make sense for each metric.
  • When the implementation passes QA, test the implementation in production.
  • Use the charts and alarms in their day to day work and provide additional feedback after the collector is delivered.

We would also appreciate getting votes 👍 for this issue from the community, to understand the interest in the particular collector.

@ilyam8 ilyam8 self-assigned this Dec 1, 2018
@ilyam8 ilyam8 added area/collectors Everything related to data collection help wanted labels Dec 1, 2018
@ilyam8 ilyam8 removed their assignment Dec 1, 2018
@cakrit cakrit added the new collector Issues to create new collector modules/plugins label Dec 2, 2018
@PanosJee
Copy link
Collaborator

PanosJee commented Jul 12, 2019

@andmarios could you share some insight on the following?

Specify the metrics required and how they should be presented in charts.
Specify the alarms that would make sense for each metric.

@cakrit
Copy link
Contributor

cakrit commented Feb 22, 2020

We have a lot of interest here, but no sponsor. Can someone assume the role so we can move forward?

@OneCricketeer
Copy link

  1. What are the responsibilities of a sponsor?

  2. Kafka is a subset of Java based application monitoring, so is this issue just a matter of determining important metrics to track?

@cakrit
Copy link
Contributor

cakrit commented Feb 24, 2020

Hi @Cricket007, the responsibilities are mentioned in the OP. Would you be interested to help?

I received the following from @PanosJee, regarding other vendors monitoring Kafka (I think it's enough to get us started at least):

@OneCricketeer
Copy link

I've limited familiarity with Go, but I currently use Datadog and NewRelic and plan on taking the Confluent admin certification sometime soon, so happy to help where I can

Each of these links are primarily server side monitoring. Is that a good starting point, or should client side metrics be tracked as well?

@ilyam8
Copy link
Member Author

ilyam8 commented Apr 1, 2020

i see all of them are using self written jmx fethcer

newrelis: https://github.com/newrelic/nrjmx
dataog: https://github.com/DataDog/jmxfetch

@fzyzcjy
Copy link

fzyzcjy commented May 27, 2020

Interested to see the monitoring here! Though I am not an expert in Kafka so maybe not much insights... To me, maybe monitor the rate of message generation, rate of consumption, not-yet-consumed message count. :)

@OneCricketeer
Copy link

Those all already exist as JMX metrics

@fzyzcjy
Copy link

fzyzcjy commented May 27, 2020

@OneCricketeer Thanks!

@rsm3aaq
Copy link

rsm3aaq commented Jun 18, 2020

Additional metrics of interest would be under replicated partitions, offline partitions, follower time, fetch time, leader election rate, consumer lag, etc

@OneCricketeer
Copy link

Consumer lag would be measured in the consumer themselves, unless interacting with a tool like Burrow.

The rest are available in JMX

@jrevillard
Copy link

Hi all,

JMX is one thing but you might know that there are now some monitoring tool which are quite "standard" to use to monitor different part of your kafka cluster. I would mainly speak about the linkedin tools Burrow and Cruise-control.

Both of them exposes very nice APIs which can provide all that we need to properly monitor a Kafka Cluster (https://github.com/linkedin/cruise-control/wiki/REST-APIs#get-requests, https://github.com/linkedin/Burrow/wiki/HTTP-Endpoint)

I would like to see netdata automatically monitor our kafka clusters through these APIs. I don't have time to work on it actually but if somebody is interested I might help and test.

Best,
Jerome

@odyslam
Copy link
Contributor

odyslam commented Jan 8, 2021

Hey @jrevillard,

Thanks for chiming in! If you could guide us in structuring the available metrics in charts that make sense, that would be a huge help for us. We can do the development alright, although @ilyam8 will have more information in regards to our priorities. The hard part is finding people such as yourself, who are willing to guide us through the available metrics, select the ones that we care and structure them in sensible charts.

What do you think?

@jrevillard
Copy link

jrevillard commented Jan 11, 2021

Hi @odyslam,

As I said I'm willing to help if I can... but please apologize in advance if I'm not always very reactive because I really have a lot of stuff at work.

So concerning cruise-control, you can see here the different metrics that they have already:

  • Anomaly detection, alerting, and self-healing for the Kafka cluster, including:
    • Goal violation
    • Broker failure detection
    • Metric anomaly detection
    • Disk failure detection
  • Resource utilization tracking for brokers, topics, and partitions
  • Current Kafka cluster state to see the online and offline partitions, in-sync and out-of-sync replicas, replicas under min.insync.replicas, online and offline logDirs, and distribution of replicas in the cluster.

So here are the endpoints that need to be queried:

All those endpoints can return JSON data if you specify the json=true parameter.

What would you need then, some sample output of the different endpoints ?

Best,
Jerome

@odyslam
Copy link
Contributor

odyslam commented Jan 14, 2021

Thanks @jrevillard for this in-detail message. We perfectly understand, please do take your time in our communication.

Ilya is aware of this and should reply shortly in this issue, he is our integration engineer and will have more information regarding prioritisation and bandwidth. In every case, it's super helpful and reassuring that we have a user who is willing to guide us towards the correct data sources and charts. Intimate knowledge of the collector's object has always been the toughest challenge.

I will continue monitoring this issue. See you soon and a happy new year :)

Best,
Odysseas

@ilyam8
Copy link
Member Author

ilyam8 commented Jan 18, 2021

Hi @jrevillard, thanks for your willing to help ❤️

Monitoring Apache Kafka cluster via cruise-control looks very good.

What would you need then, some sample output of the different endpoints ?

Yes, sample output would be very good to have. Ideally we need to setup a local Apache Kafka cluster with cruise-control for developing the collector.
Is it possible to provide a docker-compose config or some guide how to do it? (see for example help with supervisord)

@OneCricketeer
Copy link

Worth mentioning (again) that much of the data exposed via Burrow/Cruise Control can be gathered via JMX

@odyslam
Copy link
Contributor

odyslam commented Jan 18, 2021

@OneCricketeer Perhaps @ilyam8 we could approach this via creating a JMX collector/helper function and use that to gather apache kafka, 2 birds with one stone sort of thing. Thougts?

@ilyam8
Copy link
Member Author

ilyam8 commented Jan 18, 2021

I suspect jmx collector should be written in java in which i have 0 experience (i am not 100% sure, because i have no clear understanding what it means - having a JMX collector).

There is netdata-java-plugin (written by a contributer, not really maintained), perhaps it is something related to the issue.

@PanosJee
Copy link
Collaborator

PanosJee commented Jan 18, 2021 via email

@ilyam8
Copy link
Member Author

ilyam8 commented Jan 18, 2021

@PanosJee thnks for sharing it.

If i understand it right, both datadog and newrelic have some jmxfetcher agent (a tool for extracting data out of any application exposing a JMX interface) written in java:

That golang collector gathers data from it (perhaps does some additional work, like filtering).

@OneCricketeer
Copy link

Let me clarify that Cruise control is nice for monitoring external state of Kafka, but it can't give internal state like heap usage and the granularity of data that JMX can provide (depending on the polling interval).

If you're limited to a REST API, then Jolokia is another option, but sounds like you want to steer away from that (#364)

@jrevillard
Copy link

So, @odyslam, @ilyam8, what do we do, do you go for cruise-control or pur JMX (or both ?)

@odyslam
Copy link
Contributor

odyslam commented Feb 10, 2021

Hey @jrevillard,

Thanks for coming back to this. We are in the process of defining the Roadmap and new features. When we have new information regarding our bandwidth and priorities, we will come back to this thread.

We believe in developing alongside users, but due to the vastness of this project, we have to be ruthless in our prioritization.

Hopefully, we will have more information soon!

@netdata-community-bot
Copy link

This issue has been mentioned on the Netdata Community. There might be relevant details there:

https://community.netdata.cloud/t/creating-dynamic-charts-within-plugin-get-data-method/957/4

@cpipilas cpipilas self-assigned this Jun 15, 2022
@thiagoftsm
Copy link
Contributor

Hello guys,

Sorry for the delay to bring updates. After to do some initial research, I am glad to inform that we took a look in different possibilities and we starting working again with Kafka.

Software Configuration Metrics
Kafka kafka_cfg.txt kafka.txt

My current conclusion is that we will work firstly with a collector to monitor Kafka using jmx exporter metrics, so any objection or suggestion, please, let us know.

Best regards!

@zyxep
Copy link

zyxep commented Jun 8, 2023

I currently have metrics fetched to prometheus and a dashboard in grafana to show us information.
it has taken me a bit of time to get that running there are so many examples of config files.

I am happy to share what i can if anyone needs it.

@mamutuberalles
Copy link

Will we able to monitor topic sizes and disk utilization by topics using this plugin?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/collectors Everything related to data collection help wanted new collector Issues to create new collector modules/plugins
Projects
None yet
Development

No branches or pull requests

14 participants