High-cardinality delta Sum metrics show 30-70% data loss compared to native Datadog Agent

### A note for the community

_No response_

### Problem

## Summary

  When sending high-cardinality counter metrics through Vector's StatsD source to the Datadog metrics sink, we observe consistent 30-70% data loss compared to the same metrics sent through the native Datadog Agent.

  ## What happens

  - Counter metrics with high cardinality tags (~1000 unique values) show only 30-50% of expected values in Datadog
  - The same application code sending to DD Agent on a different UDP port shows correct values
  - No errors in Vector logs, no 429s from Datadog API (always 202)
  - Low-cardinality counters appear accurate
  - Distributions/histograms work correctly

  ## Expected behavior

  Counter values should match between Vector pipeline and native DD Agent ingestion paths.

  ## Reproduction steps

  1. Run Vector with minimal StatsD → Datadog config (see Configuration below)
  2. Send high-cardinality counter metrics via StatsD protocol to port 8125
  3. Compare values in Datadog with the same metrics sent to DD Agent

  ## Example comparison

  Same Java application, same metric, only difference is UDP port:
  - **DD Agent (port 8126):** `sum:metric.count{*}.as_count()` = ~10.5k ✓
  - **Vector StatsD (port 8125):** `sum:metric.count{*}.as_count()` = ~5.5k ✗ (~50% loss)

  ## What we've ruled out

  - Datadog API throttling (no 429s, always 202)
  - Batch size issues (tested 2MB-10MB, 500-1000 events)
  - Concurrency (tested with concurrency=1)
  - Network drops (Vector internal metrics show no drops)
  - Upstream issues (also tested with OTel StatsD receiver → Vector OTLP source - same loss)


### Configuration

```text
# Minimal Vector config: StatsD → Datadog
# Purpose: Reproduce counter metric drops for bug report
# Usage: vector --config statsd-to-datadog-minimal.yaml

data_dir: "/tmp/vector-data"

sources:
  statsd:
    type: statsd
    address: "0.0.0.0:8125"
    mode: udp

sinks:
  datadog:
    type: datadog_metrics
    inputs: ["statsd"]
    default_api_key: "${DD_API_KEY}"
```

### Version

timberio/vector:0.51.1-debian

### Debug Output

```text
No errors in debug output. Vector reports successful sends:
  - `component_sent_events_total` increases normally
  - `component_sent_event_bytes_total` increases normally
  - No `component_errors_total` increments
  - Datadog API returns 202 for all requests
```

### Example Data

## StatsD input format

  Sending counters like:
  agent.retrieval.redis.count:1|c|#config_id:config_001,region:us-east-1,cache_hit:true
  agent.retrieval.redis.count:1|c|#config_id:config_002,region:us-east-1,cache_hit:false
  ... (~1000 unique config_id values)

  ## Observed in Datadog

  Query: `sum:agent.retrieval.redis.count{*} by {region,cache_hit}.as_count()`

  Expected (based on DD Agent baseline): ~10,500
  Actual (through Vector): ~5,500

### Additional Context

 ## Environment

  - Running in AWS ECS as a sidecar container
  - Multiple pods (4-100) sending metrics concurrently
  - Each pod sends to its own Vector sidecar
  - No unusual command line options

  ## Key observations

  1. **Low-cardinality counters work fine** - only high-cardinality metrics show loss
  2. **Distributions/histograms are accurate** - only counters affected
  3. **Loss is consistent** - always 30-70%, not random/intermittent
  4. **No visible errors** - makes debugging difficult

  ## Possibly related

  This may be related to how Vector's StatsD source aggregates metrics before sending to Datadog, particularly around:
  - Flush interval alignment
  - Tag cardinality handling
  - Delta-to-rate conversion for the Datadog API

  ## Workaround attempts (none worked)

  - Adjusting batch sizes
  - Adjusting flush intervals
  - Adding explicit `host` tags to prevent aggregation
  - Setting sink concurrency to 1

### References

I am observing the same behavior with the OpenTelemetry Datadog exporter. https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/44907

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

High-cardinality delta Sum metrics show 30-70% data loss compared to native Datadog Agent #24386

A note for the community

Problem

Summary

What happens

Expected behavior

Reproduction steps

Example comparison

What we've ruled out

Configuration

Version

Debug Output

Example Data

StatsD input format

Observed in Datadog

Additional Context

Environment

Key observations

Possibly related

Workaround attempts (none worked)

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

High-cardinality delta Sum metrics show 30-70% data loss compared to native Datadog Agent #24386

Description

A note for the community

Problem

Summary

What happens

Expected behavior

Reproduction steps

Example comparison

What we've ruled out

Configuration

Version

Debug Output

Example Data

StatsD input format

Observed in Datadog

Additional Context

Environment

Key observations

Possibly related

Workaround attempts (none worked)

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions