Skip to content

Remote write queue contention causing performance bottlenecksΒ #17277

@madaraszg-tulip

Description

@madaraszg-tulip

What did you do?

We are using remote_write to mimir, currently using RW2.0, baseline performance is around 500k samples/sec, running usually with 4 shards (that's our min_shards setting). Sometimes we get the queue backed up, and even significant upsharding to 50 barely increases throughput, at best we saw 650k samples/sec. For testing the max throughput I restarted the remote_write queue by changing its parameters. It usually picked up 5-10 minutes of backlog while rereading the WAL, so it had a lot to churn through.

What did you expect to see?

I would expect a near linear growth for samples/sec with the number of shards, provided we're not capped with cpu or network bandwidth. Cpu usage grew a little bit, but came nowhere near what's available for the process. Network usage is much higher with RW1.0, so we can be certain it's not capped at RW2.0. To make sure we're using multiple tcp connections, we disabled http 2.0.

What did you see instead? Under which circumstances?

Originally we were running with this config

  queue_config:
    capacity: 30000
    max_shards: 50
    min_shards: 4
    max_samples_per_send: 10000
    batch_send_deadline: 5s
    min_backoff: 30ms
    max_backoff: 5s

Max throughput peaked at 570k samples/sec.

I started suspecting contention by the shards on the WAL because of this part of the remote_write documentation:

capacity

Capacity controls how many samples are queued in memory per shard before blocking reading from the WAL. Once the WAL is blocked, samples cannot be appended to any shards and all throughput will cease.

So I increased capacity to 10x max_samples_per_send and that increased our performance:

  queue_config:
    capacity: 100000
    max_shards: 50
    min_shards: 4
    max_samples_per_send: 10000
    batch_send_deadline: 5s
    min_backoff: 30ms
    max_backoff: 5s

Max throughput with RW2.0 was around 610k samples/sec

To cross check, I tested the same queue config with RW1.0, and it peaked at 880k samples/sec

Tried reducing the batch size while keeping capacity:

  queue_config:
    capacity: 100000
    max_shards: 50
    min_shards: 4
    max_samples_per_send: 5000
    batch_send_deadline: 5s
    min_backoff: 30ms
    max_backoff: 5s

RW2.0: 520k samples/sec

Increased both batch size and capacity:

  queue_config:
    capacity: 200000
    max_shards: 50
    min_shards: 4
    max_samples_per_send: 20000
    batch_send_deadline: 5s
    min_backoff: 30ms
    max_backoff: 5s

RW2.0: 580k samples/sec

Keep increased capacity, but go back to the original batch size:

queue_config:
   capacity: 200000
   max_shards: 50
   min_shards: 4
   max_samples_per_send: 10000
   batch_send_deadline: 5s
   min_backoff: 30ms
   max_backoff: 5s

RW2.0: 610k samples/sec

Image Image

It looks like there's some serious contention reading the WAL by the shards, which is present in RW1.0, however due to the increased CPU usage of RW2.0, the effect is bigger there.

When we run multiple shards, do we run them in different threads?

System information

Linux 6.8.0-1019-aws aarch64

Prometheus version

prometheus, version 3.5.0 (branch: HEAD, revision: 8be3a9560fbdd18a94dedec4b747c35178177202)
  build user:       root@4451b64cb451
  build date:       20250714-16:18:17
  go version:       go1.24.5
  platform:         linux/arm64
  tags:             netgo,builtinassets

Prometheus configuration file

remote_write:
- url: https://redacted/api/v1/push
  remote_timeout: 30s
  write_relabel_configs:
  - source_labels: [__name__]
    separator: ;
    regex: redacted
    replacement: $1
    action: keep
  protobuf_message: io.prometheus.write.v2.Request
  follow_redirects: true
  enable_http2: false
  queue_config:
    capacity: 200000
    max_shards: 50
    min_shards: 4
    max_samples_per_send: 10000
    batch_send_deadline: 5s
    min_backoff: 30ms
    max_backoff: 5s
  metadata_config:
    send: true
    send_interval: 1m
    max_samples_per_send: 2000

Alertmanager version


Alertmanager configuration file

Logs


Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions