-
Notifications
You must be signed in to change notification settings - Fork 9.8k
Description
What did you do?
We are using remote_write to mimir, currently using RW2.0, baseline performance is around 500k samples/sec, running usually with 4 shards (that's our min_shards setting). Sometimes we get the queue backed up, and even significant upsharding to 50 barely increases throughput, at best we saw 650k samples/sec. For testing the max throughput I restarted the remote_write queue by changing its parameters. It usually picked up 5-10 minutes of backlog while rereading the WAL, so it had a lot to churn through.
What did you expect to see?
I would expect a near linear growth for samples/sec with the number of shards, provided we're not capped with cpu or network bandwidth. Cpu usage grew a little bit, but came nowhere near what's available for the process. Network usage is much higher with RW1.0, so we can be certain it's not capped at RW2.0. To make sure we're using multiple tcp connections, we disabled http 2.0.
What did you see instead? Under which circumstances?
Originally we were running with this config
queue_config:
capacity: 30000
max_shards: 50
min_shards: 4
max_samples_per_send: 10000
batch_send_deadline: 5s
min_backoff: 30ms
max_backoff: 5s
Max throughput peaked at 570k samples/sec.
I started suspecting contention by the shards on the WAL because of this part of the remote_write
documentation:
capacity
Capacity controls how many samples are queued in memory per shard before blocking reading from the WAL. Once the WAL is blocked, samples cannot be appended to any shards and all throughput will cease.
So I increased capacity to 10x max_samples_per_send and that increased our performance:
queue_config:
capacity: 100000
max_shards: 50
min_shards: 4
max_samples_per_send: 10000
batch_send_deadline: 5s
min_backoff: 30ms
max_backoff: 5s
Max throughput with RW2.0 was around 610k samples/sec
To cross check, I tested the same queue config with RW1.0, and it peaked at 880k samples/sec
Tried reducing the batch size while keeping capacity:
queue_config:
capacity: 100000
max_shards: 50
min_shards: 4
max_samples_per_send: 5000
batch_send_deadline: 5s
min_backoff: 30ms
max_backoff: 5s
RW2.0: 520k samples/sec
Increased both batch size and capacity:
queue_config:
capacity: 200000
max_shards: 50
min_shards: 4
max_samples_per_send: 20000
batch_send_deadline: 5s
min_backoff: 30ms
max_backoff: 5s
RW2.0: 580k samples/sec
Keep increased capacity, but go back to the original batch size:
queue_config:
capacity: 200000
max_shards: 50
min_shards: 4
max_samples_per_send: 10000
batch_send_deadline: 5s
min_backoff: 30ms
max_backoff: 5s
RW2.0: 610k samples/sec
It looks like there's some serious contention reading the WAL by the shards, which is present in RW1.0, however due to the increased CPU usage of RW2.0, the effect is bigger there.
When we run multiple shards, do we run them in different threads?
System information
Linux 6.8.0-1019-aws aarch64
Prometheus version
prometheus, version 3.5.0 (branch: HEAD, revision: 8be3a9560fbdd18a94dedec4b747c35178177202)
build user: root@4451b64cb451
build date: 20250714-16:18:17
go version: go1.24.5
platform: linux/arm64
tags: netgo,builtinassets
Prometheus configuration file
remote_write:
- url: https://redacted/api/v1/push
remote_timeout: 30s
write_relabel_configs:
- source_labels: [__name__]
separator: ;
regex: redacted
replacement: $1
action: keep
protobuf_message: io.prometheus.write.v2.Request
follow_redirects: true
enable_http2: false
queue_config:
capacity: 200000
max_shards: 50
min_shards: 4
max_samples_per_send: 10000
batch_send_deadline: 5s
min_backoff: 30ms
max_backoff: 5s
metadata_config:
send: true
send_interval: 1m
max_samples_per_send: 2000
Alertmanager version
Alertmanager configuration file
Logs