Skip to content

Prometheus metrics: Move zonestats from metric name to label#483

Open
ruuda wants to merge 5 commits into
NLnetLabs:masterfrom
ruuda:metrics-label
Open

Prometheus metrics: Move zonestats from metric name to label#483
ruuda wants to merge 5 commits into
NLnetLabs:masterfrom
ruuda:metrics-label

Conversation

@ruuda

@ruuda ruuda commented May 2, 2026

Copy link
Copy Markdown

This implements the change proposed in #482, moving the zonestats name out of the metric name and into a label.

This pull request is structured in separate commits that can be reviewed one at a time.

  • The first commit introduces a helper struct to format metrics with labels. This is only a refactor, no behavioral changes.
  • It’s not really needed to use the helper everywhere because some metrics don’t need labels, but I thought it would be nicer for consistency, so in the second commit I switch all Prometheus metrics to use the helper.
    • Instead of print_longnum, I use a PRIu64 format string, which is used in the codebase already, but only in one place. Does print_longnum exist only for historical reasons, or is it important to only use %lu in format strings?
    • Up to this point the output should not change, and I verified this by diffing the /metrics response on a binary built from master vs. this commit.
  • Then move the zonestats name out of the prefix and into a label, which is now a very small change.
  • I thought that instead of silently replacing disallowed characters in the zonestats name, it would be nicer to fail on it. I implemented that in the final commit, but it’s only validated at startup. It would be nice if nsd-checkconf could catch it already, but this is a bit tricky because the zonestats name can be a template like %s and we have to validate after applying the template, which looks like it would be a larger change to nsd-checkconf, so I’m not sure if that’s worth it.

I tested locally with Prometheus 3.11.2 that it can still parse the /metrics response, and it can.

ruuda added 4 commits May 2, 2026 16:22
I am going to move the zonestats zone name out of the metric name and
into a label. As a preparation for that, introduce a helper struct for
formatting and printing metrics, to make it a bit easier to handle a
variable number of labels from print_stat_block.

This commit changes only what is needed for the zonestats, there are
still metrics in metrics_print_stats that format the metrics directly.
These metrics don't need to be parametrized over the zone, so I am
leaving them alone in this commit to keep the changes reviewable and
testable individually. I'll change this in a follow-up commit for
consistency.
For consistency, this uses the new metrics helper for all remaining
metrics. This removes print_longnum in favor of a PRIu64 format string.
I don't know why previously long numbers needed to be split, PRIu64 is
used in the codebase already (in one place, though PRIu32 and others
have more usage). Maybe because that code (at least the original in
remote.c) is very old?

Tested by querying the metrics from a binary built against master, vs. a
binary built from this commit, and then diffing the output. There was no
difference except in nsd_time_elapsed_seconds, which is expected. Most
counters were 0 though so this test would not catch accidentally wiring
up the wrong field.
This makes metrics more useful and more discoverable, for example you
can use a regex filter to filter on particular zones, you can use `sum`
etc., and graphing multiple zones in a single query becomes much easier.
As an additional benefit, dot is now an allowed character in zonestats
names.

Fixes NLnetLabs#482.
This disallows 'zonestats' values which contains characters that need
to be escaped to be a valid Prometheus label name, which are anyway not
characters that you'd want to use in zone names.

This does not yet validate in nsd-checkconf, which is a shame.
Validating properly would be a bit tricky because the zonestat
name can be a template, so we'd need to evaluate that first.
@mozzieongit mozzieongit self-assigned this May 11, 2026
@mozzieongit

Copy link
Copy Markdown
Member

Thanks for the PR. I'll look at it the during the coming weeks, but cannot promise a timeline due to upcoming events

@mozzieongit mozzieongit left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Just the one note about the disallowed characters.

I don't know why previously long numbers needed to be split, PRIu64 is
used in the codebase already (in one place, though PRIu32 and others
have more usage). Maybe because that code (at least the original in
remote.c) is very old?

Could be. I don't know either, but indeed replicated the behaviour from remote.c

Comment thread options.c
Comment on lines +2951 to +2957

const char* statname_error = metrics_validate_label_value(statname);
if (statname_error) {
log_msg(LOG_ERR, "invalid zonestats name \"%s\": %s",
statname, statname_error);
exit(1);
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the appeal to verify on startup, but I don't think that it is a good idea. It will break existing configurations using zonestats when updating NSD even if they don't use Prometheus actively.

I would prefer to have the replacement of disallowed strings over the exit as before. However, you could still log a warning on startup when the zonestats name contains a disallowed string (and that is being replaced in the metrics output).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I implemented both in 910fa77.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does now log the warning twice, I suppose that happens because that code path is somehow called twice.

Such zonestat names were allowed in the past, and although the
disallowed characters are unlikely to be part of zonestat names anyway,
people who used would no longer be able to start the new version. So
instead of failing loudly, rewrite the name at runtime. To not make this
pass silently, do log a warning at startup when this rewrite applies.
@ruuda ruuda requested a review from mozzieongit May 28, 2026 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants