Skip to content

High-cardinality delta Sum metrics show 30-70% data loss compared to native Datadog Agent #24386

@lightme16

Description

@lightme16

A note for the community

No response

Problem

Summary

When sending high-cardinality counter metrics through Vector's StatsD source to the Datadog metrics sink, we observe consistent 30-70% data loss compared to the same metrics sent through the native Datadog Agent.

What happens

  • Counter metrics with high cardinality tags (~1000 unique values) show only 30-50% of expected values in Datadog
  • The same application code sending to DD Agent on a different UDP port shows correct values
  • No errors in Vector logs, no 429s from Datadog API (always 202)
  • Low-cardinality counters appear accurate
  • Distributions/histograms work correctly

Expected behavior

Counter values should match between Vector pipeline and native DD Agent ingestion paths.

Reproduction steps

  1. Run Vector with minimal StatsD → Datadog config (see Configuration below)
  2. Send high-cardinality counter metrics via StatsD protocol to port 8125
  3. Compare values in Datadog with the same metrics sent to DD Agent

Example comparison

Same Java application, same metric, only difference is UDP port:

  • DD Agent (port 8126): sum:metric.count{*}.as_count() = ~10.5k ✓
  • Vector StatsD (port 8125): sum:metric.count{*}.as_count() = ~5.5k ✗ (~50% loss)

What we've ruled out

  • Datadog API throttling (no 429s, always 202)
  • Batch size issues (tested 2MB-10MB, 500-1000 events)
  • Concurrency (tested with concurrency=1)
  • Network drops (Vector internal metrics show no drops)
  • Upstream issues (also tested with OTel StatsD receiver → Vector OTLP source - same loss)

Configuration

# Minimal Vector config: StatsD → Datadog
# Purpose: Reproduce counter metric drops for bug report
# Usage: vector --config statsd-to-datadog-minimal.yaml

data_dir: "/tmp/vector-data"

sources:
  statsd:
    type: statsd
    address: "0.0.0.0:8125"
    mode: udp

sinks:
  datadog:
    type: datadog_metrics
    inputs: ["statsd"]
    default_api_key: "${DD_API_KEY}"

Version

timberio/vector:0.51.1-debian

Debug Output

No errors in debug output. Vector reports successful sends:
  - `component_sent_events_total` increases normally
  - `component_sent_event_bytes_total` increases normally
  - No `component_errors_total` increments
  - Datadog API returns 202 for all requests

Example Data

StatsD input format

Sending counters like:
agent.retrieval.redis.count:1|c|#config_id:config_001,region:us-east-1,cache_hit:true
agent.retrieval.redis.count:1|c|#config_id:config_002,region:us-east-1,cache_hit:false
... (~1000 unique config_id values)

Observed in Datadog

Query: sum:agent.retrieval.redis.count{*} by {region,cache_hit}.as_count()

Expected (based on DD Agent baseline): ~10,500
Actual (through Vector): ~5,500

Additional Context

Environment

  • Running in AWS ECS as a sidecar container
  • Multiple pods (4-100) sending metrics concurrently
  • Each pod sends to its own Vector sidecar
  • No unusual command line options

Key observations

  1. Low-cardinality counters work fine - only high-cardinality metrics show loss
  2. Distributions/histograms are accurate - only counters affected
  3. Loss is consistent - always 30-70%, not random/intermittent
  4. No visible errors - makes debugging difficult

Possibly related

This may be related to how Vector's StatsD source aggregates metrics before sending to Datadog, particularly around:

  • Flush interval alignment
  • Tag cardinality handling
  • Delta-to-rate conversion for the Datadog API

Workaround attempts (none worked)

  • Adjusting batch sizes
  • Adjusting flush intervals
  • Adding explicit host tags to prevent aggregation
  • Setting sink concurrency to 1

References

I am observing the same behavior with the OpenTelemetry Datadog exporter. open-telemetry/opentelemetry-collector-contrib#44907

Metadata

Metadata

Assignees

No one assigned

    Labels

    type: bugA code related bug.

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions