-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
A note for the community
No response
Problem
Summary
When sending high-cardinality counter metrics through Vector's StatsD source to the Datadog metrics sink, we observe consistent 30-70% data loss compared to the same metrics sent through the native Datadog Agent.
What happens
- Counter metrics with high cardinality tags (~1000 unique values) show only 30-50% of expected values in Datadog
- The same application code sending to DD Agent on a different UDP port shows correct values
- No errors in Vector logs, no 429s from Datadog API (always 202)
- Low-cardinality counters appear accurate
- Distributions/histograms work correctly
Expected behavior
Counter values should match between Vector pipeline and native DD Agent ingestion paths.
Reproduction steps
- Run Vector with minimal StatsD → Datadog config (see Configuration below)
- Send high-cardinality counter metrics via StatsD protocol to port 8125
- Compare values in Datadog with the same metrics sent to DD Agent
Example comparison
Same Java application, same metric, only difference is UDP port:
- DD Agent (port 8126):
sum:metric.count{*}.as_count()= ~10.5k ✓ - Vector StatsD (port 8125):
sum:metric.count{*}.as_count()= ~5.5k ✗ (~50% loss)
What we've ruled out
- Datadog API throttling (no 429s, always 202)
- Batch size issues (tested 2MB-10MB, 500-1000 events)
- Concurrency (tested with concurrency=1)
- Network drops (Vector internal metrics show no drops)
- Upstream issues (also tested with OTel StatsD receiver → Vector OTLP source - same loss)
Configuration
# Minimal Vector config: StatsD → Datadog
# Purpose: Reproduce counter metric drops for bug report
# Usage: vector --config statsd-to-datadog-minimal.yaml
data_dir: "/tmp/vector-data"
sources:
statsd:
type: statsd
address: "0.0.0.0:8125"
mode: udp
sinks:
datadog:
type: datadog_metrics
inputs: ["statsd"]
default_api_key: "${DD_API_KEY}"
Version
timberio/vector:0.51.1-debian
Debug Output
No errors in debug output. Vector reports successful sends:
- `component_sent_events_total` increases normally
- `component_sent_event_bytes_total` increases normally
- No `component_errors_total` increments
- Datadog API returns 202 for all requests
Example Data
StatsD input format
Sending counters like:
agent.retrieval.redis.count:1|c|#config_id:config_001,region:us-east-1,cache_hit:true
agent.retrieval.redis.count:1|c|#config_id:config_002,region:us-east-1,cache_hit:false
... (~1000 unique config_id values)
Observed in Datadog
Query: sum:agent.retrieval.redis.count{*} by {region,cache_hit}.as_count()
Expected (based on DD Agent baseline): ~10,500
Actual (through Vector): ~5,500
Additional Context
Environment
- Running in AWS ECS as a sidecar container
- Multiple pods (4-100) sending metrics concurrently
- Each pod sends to its own Vector sidecar
- No unusual command line options
Key observations
- Low-cardinality counters work fine - only high-cardinality metrics show loss
- Distributions/histograms are accurate - only counters affected
- Loss is consistent - always 30-70%, not random/intermittent
- No visible errors - makes debugging difficult
Possibly related
This may be related to how Vector's StatsD source aggregates metrics before sending to Datadog, particularly around:
- Flush interval alignment
- Tag cardinality handling
- Delta-to-rate conversion for the Datadog API
Workaround attempts (none worked)
- Adjusting batch sizes
- Adjusting flush intervals
- Adding explicit
hosttags to prevent aggregation - Setting sink concurrency to 1
References
I am observing the same behavior with the OpenTelemetry Datadog exporter. open-telemetry/opentelemetry-collector-contrib#44907