Remote write queue contention causing performance bottlenecks

### What did you do?

We are using remote_write to mimir, currently using RW2.0, baseline performance is around 500k samples/sec, running usually with 4 shards (that's our min_shards setting). Sometimes we get the queue backed up, and even significant upsharding to 50 barely increases throughput, at best we saw 650k samples/sec. For testing the max throughput I restarted the remote_write queue by changing its parameters. It usually picked up 5-10 minutes of backlog while rereading the WAL, so it had a lot to churn through.

### What did you expect to see?

I would expect a near linear growth for samples/sec with the number of shards, provided we're not capped with cpu or network bandwidth. Cpu usage grew a little bit, but came nowhere near what's available for the process. Network usage is much higher with RW1.0, so we can be certain it's not capped at RW2.0. To make sure we're using multiple tcp connections, we disabled http 2.0. 

### What did you see instead? Under which circumstances?

Originally we were running with this config

```
  queue_config:
    capacity: 30000
    max_shards: 50
    min_shards: 4
    max_samples_per_send: 10000
    batch_send_deadline: 5s
    min_backoff: 30ms
    max_backoff: 5s
```

Max throughput peaked at 570k samples/sec. 

I started suspecting contention by the shards on the WAL because of this part of the `remote_write` documentation:

```
capacity

Capacity controls how many samples are queued in memory per shard before blocking reading from the WAL. Once the WAL is blocked, samples cannot be appended to any shards and all throughput will cease.

```

So I increased capacity to 10x max_samples_per_send and that increased our performance:


```
  queue_config:
    capacity: 100000
    max_shards: 50
    min_shards: 4
    max_samples_per_send: 10000
    batch_send_deadline: 5s
    min_backoff: 30ms
    max_backoff: 5s
```

Max throughput with RW2.0 was around 610k samples/sec

To cross check, I tested the same queue config with RW1.0, and it peaked at 880k samples/sec

Tried reducing the batch size while keeping capacity:

```
  queue_config:
    capacity: 100000
    max_shards: 50
    min_shards: 4
    max_samples_per_send: 5000
    batch_send_deadline: 5s
    min_backoff: 30ms
    max_backoff: 5s
```

RW2.0: 520k samples/sec

Increased both batch size and capacity:

```
  queue_config:
    capacity: 200000
    max_shards: 50
    min_shards: 4
    max_samples_per_send: 20000
    batch_send_deadline: 5s
    min_backoff: 30ms
    max_backoff: 5s
```

RW2.0: 580k samples/sec

Keep increased capacity, but go back to the original batch size:

 ```
 queue_config:
    capacity: 200000
    max_shards: 50
    min_shards: 4
    max_samples_per_send: 10000
    batch_send_deadline: 5s
    min_backoff: 30ms
    max_backoff: 5s
```

RW2.0: 610k samples/sec


<img width="1847" height="678" alt="Image" src="https://github.com/user-attachments/assets/5e01d3b6-0661-49f7-863c-1ec98dda8756" />

<img width="1823" height="672" alt="Image" src="https://github.com/user-attachments/assets/069b1c52-0b37-4fc7-900b-0eb74ae514aa" />

It looks like there's some serious contention reading the WAL by the shards, which is present in RW1.0, however due to the increased CPU usage of RW2.0, the effect is bigger there. 

When we run multiple shards, do we run them in different threads? 

### System information

Linux 6.8.0-1019-aws aarch64

### Prometheus version

```text
prometheus, version 3.5.0 (branch: HEAD, revision: 8be3a9560fbdd18a94dedec4b747c35178177202)
  build user:       root@4451b64cb451
  build date:       20250714-16:18:17
  go version:       go1.24.5
  platform:         linux/arm64
  tags:             netgo,builtinassets
```

### Prometheus configuration file

```yaml
remote_write:
- url: https://redacted/api/v1/push
  remote_timeout: 30s
  write_relabel_configs:
  - source_labels: [__name__]
    separator: ;
    regex: redacted
    replacement: $1
    action: keep
  protobuf_message: io.prometheus.write.v2.Request
  follow_redirects: true
  enable_http2: false
  queue_config:
    capacity: 200000
    max_shards: 50
    min_shards: 4
    max_samples_per_send: 10000
    batch_send_deadline: 5s
    min_backoff: 30ms
    max_backoff: 5s
  metadata_config:
    send: true
    send_interval: 1m
    max_samples_per_send: 2000
```

### Alertmanager version

```text

```

### Alertmanager configuration file

```yaml

```

### Logs

```text

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remote write queue contention causing performance bottlenecks #17277

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

System information

Prometheus version

Prometheus configuration file

Alertmanager version

Alertmanager configuration file

Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Remote write queue contention causing performance bottlenecks #17277

Description

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

System information

Prometheus version

Prometheus configuration file

Alertmanager version

Alertmanager configuration file

Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions