feat: pace-aware throttling to land exactly on reset boundaries

## Problem / motivation

Today the dashboard can estimate weekly pace and show how far usage is above schedule, but codex-lb cannot actively shape traffic so usage lands on reset boundaries instead of burning too fast and idling later.

For operators trying to fully utilize account capacity without crossing the limits early:

- weekly usage can run hot and deplete long before weekly reset
- 5h usage can spike and cause earlier short-window exhaustion even when weekly headroom remains
- sticky-thread mode changes fairness requirements because one hot thread can drain its assigned account while other accounts remain underused

This creates a practical gap between monitoring pace and enforcing pace.

## Proposed change

Add an optional pacing / throttling mode that computes allowable pace from remaining budget and time until reset, and when traffic needs to slow down, have codex-lb return a retryable upstream-style response so the client waits on its side instead of codex-lb holding a proxy-side queue.

Concrete behavior:

- add a configurable throttle mode that computes allowable pace from remaining budget and time until reset
- take both the short window (~5h / primary) and weekly window (~7d / secondary) into account
- apply the stricter of the two limits at any given time
- when sticky threads is **on**, throttle on a **per-account** basis so each sticky-assigned account stays on pace for its own resets
- when sticky threads is **off**, throttle **pool-wide** so aggregate routing pace stays on schedule across the routable pool
- when pacing requires slowdown, prefer returning a retryable response to the client instead of introducing a long-lived queue inside codex-lb
- expose enough dashboard / logs context to show why a request was throttled and which window was binding
- make feature optional and disabled by default

Recommended wire behavior for Codex compatibility:

- return HTTP `503 Service Unavailable`
- return JSON body with `error.code = "slow_down"`
- optionally include standard retry hints such as `Retry-After`, but treat them as best-effort only

Based on current Codex source, this is more compatible than HTTP `429` because Codex retries `5xx` by default but does not retry `429` by default. I did not find evidence that Codex reliably honors HTTP `Retry-After` for normal HTTP retries, so this mechanism should be treated as advisory backoff rather than precise client-side wait control.

Suggested config shape (illustrative, not prescriptive):

- `pacing_enabled: false`
- `pacing_mode: off | advisory_retry`
- `pacing_target: reset_boundary`
- optional guardrails for any minimal in-proxy waiting that still remains

High-level algorithm intent:

1. For each enforcement scope, compute remaining short-window budget and time until short reset.
2. Compute remaining weekly budget and time until weekly reset.
3. Convert both to an allowed credits-per-hour burn rate.
4. Use the tighter rate as the current budget ceiling.
5. If incoming traffic exceeds that ceiling, reject with a retryable pacing response that causes the client to back off and retry later.

Expected outcome:

- usage should approach 100% near reset instead of exhausting far earlier
- short-window and weekly-window headroom should both be respected
- sticky and non-sticky routing should preserve correct fairness boundary
- codex-lb should avoid becoming a hidden request buffer when pacing delays are large

## Concrete defaults / recommended design

These are suggested defaults so implementation can start from a clear behavior model instead of only open questions.

### 1. Pace by credits, not request count

Use credits as the control variable.

Reason:
- upstream limits are credit-based
- request sizes vary too much for request-count pacing to be reliable
- codex-lb already surfaces credit-based usage, remaining budget, and pace math

If exact cost is only known after completion, estimate before dispatch and reconcile afterward using observed cost.

### 2. Prefer client-side waiting over proxy-side queuing

Do not make codex-lb hold a large delayed queue if the client can wait instead. However, current Codex behavior suggests normal HTTP retries use generic exponential backoff rather than a precisely honored server-provided delay.

Recommended behavior:
- when a request arrives before its next admissible dispatch time, return a retryable pacing response immediately
- for Codex clients, prefer HTTP `503` with JSON `error.code = "slow_down"`
- optionally include `Retry-After`, but do not rely on it for exact pacing behavior
- avoid introducing a substantial proxy-side queue as the primary pacing mechanism
- if any small local queue exists for implementation convenience, keep it tightly bounded and secondary to the retryable-response path

Reason:
- current Codex source retries `5xx` by default
- current Codex source does not retry `429` by default
- client-side waiting avoids turning codex-lb into a backlog buffer

### 3. Non-sticky mode should still derive from per-account feasibility

Do not model non-sticky mode as one naive shared bucket.

Recommended behavior:
- compute admissible pace separately for each eligible account
- in non-sticky mode, decide whether the pool can admit the request using combined per-account feasibility
- when choosing an account for an admissible request, prefer the account with the best current headroom / earliest feasible dispatch profile

This avoids overdriving one constrained account just because aggregate pool capacity still looks healthy.

### 4. Health filtering and pacing should be separate responsibilities

Recommended composition:
- health / quota / deactivation rules decide whether an account is eligible
- pacing decides whether an eligible account may receive traffic now or whether the request should get a retryable throttling response
- routing chooses the best eligible account among accounts that are currently admissible

In other words:
- health answers: "may this account be used?"
- pacing answers: "may it be used now?"
- routing answers: "which currently-admissible account should get this request?"

### 5. Two-window pacing model per account

Recommended implementation model:
- maintain short-window pacing constraint
- maintain weekly-window pacing constraint
- effective admissibility for a request is the later / stricter of those two windows

A token-bucket or equivalent time-based scheduler would work well here as long as both windows are enforced and current admissibility is driven by the tighter constraint.

## Alternatives considered

- Dashboard-only advice without enforcement.
  Useful, but still leaves operators to manually pace clients.

- Weekly-only pacing.
  Incomplete because 5h exhaustion still creates user-visible stalls even if weekly pace is fine.

- HTTP `429` for pacing.
  Rejected because current Codex client defaults do not retry `429` automatically.

- Large proxy-side delay queue.
  Less desirable because it turns codex-lb into a hidden waiting room and keeps pacing state server-side when the client can often wait instead.

- Pool-wide pacing even with sticky threads enabled.
  Incorrect fairness model because sticky assignment can concentrate load on one account.

## Area

- Proxy / upstream routing
- Account management & quota tracking
- Dashboard UI

## Additional context

This proposal is concrete enough for an issue, but final implementation details likely need an OpenSpec since it changes observable request latency and routing behavior.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: pace-aware throttling to land exactly on reset boundaries #956

Problem / motivation

Proposed change

Concrete defaults / recommended design

1. Pace by credits, not request count

2. Prefer client-side waiting over proxy-side queuing

3. Non-sticky mode should still derive from per-account feasibility

4. Health filtering and pacing should be separate responsibilities

5. Two-window pacing model per account

Alternatives considered

Area

Additional context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: pace-aware throttling to land exactly on reset boundaries #956

Description

Problem / motivation

Proposed change

Concrete defaults / recommended design

1. Pace by credits, not request count

2. Prefer client-side waiting over proxy-side queuing

3. Non-sticky mode should still derive from per-account feasibility

4. Health filtering and pacing should be separate responsibilities

5. Two-window pacing model per account

Alternatives considered

Area

Additional context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions