Prometheus metrics: Move zonestats from metric name to label#483
Conversation
I am going to move the zonestats zone name out of the metric name and into a label. As a preparation for that, introduce a helper struct for formatting and printing metrics, to make it a bit easier to handle a variable number of labels from print_stat_block. This commit changes only what is needed for the zonestats, there are still metrics in metrics_print_stats that format the metrics directly. These metrics don't need to be parametrized over the zone, so I am leaving them alone in this commit to keep the changes reviewable and testable individually. I'll change this in a follow-up commit for consistency.
For consistency, this uses the new metrics helper for all remaining metrics. This removes print_longnum in favor of a PRIu64 format string. I don't know why previously long numbers needed to be split, PRIu64 is used in the codebase already (in one place, though PRIu32 and others have more usage). Maybe because that code (at least the original in remote.c) is very old? Tested by querying the metrics from a binary built against master, vs. a binary built from this commit, and then diffing the output. There was no difference except in nsd_time_elapsed_seconds, which is expected. Most counters were 0 though so this test would not catch accidentally wiring up the wrong field.
This makes metrics more useful and more discoverable, for example you can use a regex filter to filter on particular zones, you can use `sum` etc., and graphing multiple zones in a single query becomes much easier. As an additional benefit, dot is now an allowed character in zonestats names. Fixes NLnetLabs#482.
This disallows 'zonestats' values which contains characters that need to be escaped to be a valid Prometheus label name, which are anyway not characters that you'd want to use in zone names. This does not yet validate in nsd-checkconf, which is a shame. Validating properly would be a bit tricky because the zonestat name can be a template, so we'd need to evaluate that first.
|
Thanks for the PR. I'll look at it the during the coming weeks, but cannot promise a timeline due to upcoming events |
mozzieongit
left a comment
There was a problem hiding this comment.
Great work! Just the one note about the disallowed characters.
I don't know why previously long numbers needed to be split, PRIu64 is
used in the codebase already (in one place, though PRIu32 and others
have more usage). Maybe because that code (at least the original in
remote.c) is very old?
Could be. I don't know either, but indeed replicated the behaviour from remote.c
|
|
||
| const char* statname_error = metrics_validate_label_value(statname); | ||
| if (statname_error) { | ||
| log_msg(LOG_ERR, "invalid zonestats name \"%s\": %s", | ||
| statname, statname_error); | ||
| exit(1); | ||
| } |
There was a problem hiding this comment.
I see the appeal to verify on startup, but I don't think that it is a good idea. It will break existing configurations using zonestats when updating NSD even if they don't use Prometheus actively.
I would prefer to have the replacement of disallowed strings over the exit as before. However, you could still log a warning on startup when the zonestats name contains a disallowed string (and that is being replaced in the metrics output).
There was a problem hiding this comment.
It does now log the warning twice, I suppose that happens because that code path is somehow called twice.
Such zonestat names were allowed in the past, and although the disallowed characters are unlikely to be part of zonestat names anyway, people who used would no longer be able to start the new version. So instead of failing loudly, rewrite the name at runtime. To not make this pass silently, do log a warning at startup when this rewrite applies.
This implements the change proposed in #482, moving the zonestats name out of the metric name and into a label.
This pull request is structured in separate commits that can be reviewed one at a time.
print_longnum, I use aPRIu64format string, which is used in the codebase already, but only in one place. Doesprint_longnumexist only for historical reasons, or is it important to only use%luin format strings?/metricsresponse on a binary built frommastervs. this commit.nsd-checkconfcould catch it already, but this is a bit tricky because the zonestats name can be a template like%sand we have to validate after applying the template, which looks like it would be a larger change tonsd-checkconf, so I’m not sure if that’s worth it.I tested locally with Prometheus 3.11.2 that it can still parse the
/metricsresponse, and it can.