Skip to content

Fix race in wireguard metrics collection.#12781

Open
fasaxc wants to merge 2 commits into
projectcalico:masterfrom
fasaxc:fix-wg-metrics
Open

Fix race in wireguard metrics collection.#12781
fasaxc wants to merge 2 commits into
projectcalico:masterfrom
fasaxc:fix-wg-metrics

Conversation

@fasaxc
Copy link
Copy Markdown
Member

@fasaxc fasaxc commented May 15, 2026

Prometheus can call Gather() and Collect() concurrently. Fix is different for each:

  • Gather() was incorrect: it returned the current list of metrics, but that was dynamic. For a dynamic collector, the Collector interface specs that we should return no metrics instead.

  • Collect() was concurrently updating the rate limit fields and, when the rate limit fired, it incorrectly returned no metrics at all. Cache the most recent metrics and re-use those if the rate limit fires instead.

Drive-by cleanups: remove unused fields, address missing err check.

Race detector hit:

WARNING: DATA RACE
Write at 0x00c0012b5240 by goroutine 894:
  github.com/projectcalico/calico/felix/wireguard.(*Metrics).refreshStats()
      /go/src/github.com/projectcalico/calico/felix/wireguard/metrics_linux.go:192 +0x664
  github.com/projectcalico/calico/felix/wireguard.(*Metrics).Collect()
      /go/src/github.com/projectcalico/calico/felix/wireguard/metrics_linux.go:111 +0x33
  github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
      /go/pkg/mod/github.com/prometheus/client_golang@v1.23.2/prometheus/registry.go:458 +0x10a

Previous read at 0x00c0012b5240 by goroutine 899:
  github.com/projectcalico/calico/felix/wireguard.(*Metrics).refreshStats()
      /go/src/github.com/projectcalico/calico/felix/wireguard/metrics_linux.go:176 +0x59
  github.com/projectcalico/calico/felix/wireguard.(*Metrics).Collect()
      /go/src/github.com/projectcalico/calico/felix/wireguard/metrics_linux.go:111 +0x33
  github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
      /go/pkg/mod/github.com/prometheus/client_golang@v1.23.2/prometheus/registry.go:458 +0x10a

Release note:

Fix a race in wireguard metrics collection that could result in spuriously missing metrics when metrics were collected faster than the rate limit.

Prometheus can call Gather() and Collect() concurrently. Fix is
different for each:

- Gather() was incorrect: it returned the current list of metrics,
  but that was dynamic.  For a dynamic collector, the Collector
  interface specs that we should return no metrics instead.

- Collect() was concurrently updating the rate limit fields and, when
  the rate limit fired, it incorrectly returned no metrics at all.
  Cache the most recent metrics and re-use those if the rate limit
  fires instead.

Crive-by cleanups: remove unused fields, address missing err check.
Copilot AI review requested due to automatic review settings May 15, 2026 14:13
@fasaxc fasaxc requested a review from a team as a code owner May 15, 2026 14:13
@marvin-tigera marvin-tigera added this to the Calico v3.33.0 milestone May 15, 2026
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels May 15, 2026
@fasaxc fasaxc added docs-not-required Docs not required for this change and removed docs-pr-required Change is not yet documented labels May 15, 2026
@marvin-tigera marvin-tigera added docs-pr-required Change is not yet documented and removed docs-not-required Docs not required for this change labels May 15, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a concurrency bug in the wireguard Prometheus metrics collector where Gather() and Collect() could race, and where rate-limited collections returned no metrics at all.

Changes:

  • Make Describe() send nothing (marking the collector as "unchecked") since the collector is dynamic.
  • Add a mutex around Collect(), cache the most recent metrics, and return the cache when within the rate-limit interval instead of returning nothing.
  • Drive-by: remove unused peerRx/peerTx fields and add a missing error check on wgClient.Close().

- Remove Unregister() test; can't unregister a dynamic collector.
- Fix rate limiter test to expect cached results.
@fasaxc fasaxc added docs-not-required Docs not required for this change and removed docs-pr-required Change is not yet documented labels May 15, 2026
Copy link
Copy Markdown
Member

@electricjesus electricjesus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required Docs not required for this change release-note-required Change has user-facing impact (no matter how small)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants