-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cgroups path being parsed as metric #1724
Comments
I believe that change was made for the purpose of limiting cardinality of the metrics here: #1457 That being said, it appears that we are actually going to be sending many duplicate and overwritten metrics without having the path as a tag. What do you think @sebito91? I understand the need to limit cardinality but it seems that without the path as a tag the metrics are more-or-less useless. |
Hello, I have submitted PR #1796 to revert this change. As spotted by sparrc this is leading to duplicates and metrics loss with the influxdb output plugin. But also it is preventing the use of "GROUP BY" queries which are very convenient to compute percentile of top CPU consumers on a system for instance. The plugin allows to select a restricted set of path to monitor: it is possible to limit the cardinality by choosing only the services in which you are interested in. Cardinality is an influxdb issue (being addressed through influxdata/influxdb#7151), which is one among other output plugin for telegraf. The issue is on influxdb side, not telegraf cgroup plugin. It should be addressed there instead I believe. Thank you very much in advance for your understanding ! |
@ririsoft, @deric the decision to go down the field route and not tag was purely due to cardinality issues for earlier versions of influxdb (v0.11+). Although I share the sentiment that influxdata/influxdb#7151 will fix the world, it simply is not there yet. The issues we saw with high cardinality on this series (plus procstat) literally caused our influxdb to topple over. We had to remove the 'problematic' shards, aka remove data completely, in order to get the database back up. It's entirely possible that our collection was too granular, and effectively pulling the same path for every PID ended up blowing up our indices and if you can avoid that in your configuration it could work. I would STRONGLY encourage you to limit the unique path names you're collecting if that can be avoided at all. |
Out of curiosity what kind of system and setup leads to cardinality issue on your side, @sebito91 ? For a single host how many distinct path do you have ? |
@ririsoft we have systems that put a single job in a cgroup, so we ended up with an extremely large number of these groups. That use of cgroups is not entirely unusual IMHO. |
Thank you @daviesalex, I understand your situation better. However the fact is that moving "path" from a tag to a field breaks the measurements with missing values. It also removes any possibility to use InfluxDB aggregation functions. In my case i had to setup a monitoring for each of the paths in a hierarchy of cgroups (depth ~5) to get back the aggregation metrics and drill down dashboards i had before, but it leads to missing values because at each collection of data by telegraf the metrics have the same unicity: same timestamp, same measurement name and tag set. Only one of my paths is saved on InfluxDB side, as the result of the union of all values for all paths (see duplicate points for more on this). The cgroup plugin is simply broken in its current state unfortunately. My proposal is to get this plugin back to its previous, not broken state, while addressing the InfluxDB cardinality issue on InfluxDB side, since the issue is not limited to cgroup plugin (procstat for instance as already raised above in the discussion). |
I really defer to @sparrc on the timing of when this should be put back to full functionality in telegraf ; we also want to use this data, but the issue with this plugin (and procstat, which was also fixed in a similar way) is that you can very easily inject sufficient data to really screw up your InfluxDB as things stand today. That is why this change was made I believe. From our POV, we have deployed a better restriction of what cgroups we record, so this is not a huge issue for us any more. I would understand if the InfluxData guys didnt want to fix this just yet because it does provide users with a way to really blow up their InfluxDB. |
I'm testing telegraf 1.0.0-rc1 with following configuration:
which generates following:
I would expect the
path="/sys/fs/cgroup/cpu/system.slice/systemd-setup-dgram-qlen.service"
included between tags, not metrics.Also I was getting a type error when parsing memory usage:
The text was updated successfully, but these errors were encountered: