Skip to content

feat: pace-aware throttling to land exactly on reset boundaries #956

@cowwoc

Description

@cowwoc

Problem / motivation

Today the dashboard can estimate weekly pace and show how far usage is above schedule, but codex-lb cannot actively shape traffic so usage lands on reset boundaries instead of burning too fast and idling later.

For operators trying to fully utilize account capacity without crossing the limits early:

  • weekly usage can run hot and deplete long before weekly reset
  • 5h usage can spike and cause earlier short-window exhaustion even when weekly headroom remains
  • sticky-thread mode changes fairness requirements because one hot thread can drain its assigned account while other accounts remain underused

This creates a practical gap between monitoring pace and enforcing pace.

Proposed change

Add an optional pacing / throttling mode that computes allowable pace from remaining budget and time until reset, and when traffic needs to slow down, have codex-lb return a retryable upstream-style response so the client waits on its side instead of codex-lb holding a proxy-side queue.

Concrete behavior:

  • add a configurable throttle mode that computes allowable pace from remaining budget and time until reset
  • take both the short window (~5h / primary) and weekly window (~7d / secondary) into account
  • apply the stricter of the two limits at any given time
  • when sticky threads is on, throttle on a per-account basis so each sticky-assigned account stays on pace for its own resets
  • when sticky threads is off, throttle pool-wide so aggregate routing pace stays on schedule across the routable pool
  • when pacing requires slowdown, prefer returning a retryable response to the client instead of introducing a long-lived queue inside codex-lb
  • expose enough dashboard / logs context to show why a request was throttled and which window was binding
  • make feature optional and disabled by default

Recommended wire behavior for Codex compatibility:

  • return HTTP 503 Service Unavailable
  • return JSON body with error.code = "slow_down"
  • optionally include standard retry hints such as Retry-After, but treat them as best-effort only

Based on current Codex source, this is more compatible than HTTP 429 because Codex retries 5xx by default but does not retry 429 by default. I did not find evidence that Codex reliably honors HTTP Retry-After for normal HTTP retries, so this mechanism should be treated as advisory backoff rather than precise client-side wait control.

Suggested config shape (illustrative, not prescriptive):

  • pacing_enabled: false
  • pacing_mode: off | advisory_retry
  • pacing_target: reset_boundary
  • optional guardrails for any minimal in-proxy waiting that still remains

High-level algorithm intent:

  1. For each enforcement scope, compute remaining short-window budget and time until short reset.
  2. Compute remaining weekly budget and time until weekly reset.
  3. Convert both to an allowed credits-per-hour burn rate.
  4. Use the tighter rate as the current budget ceiling.
  5. If incoming traffic exceeds that ceiling, reject with a retryable pacing response that causes the client to back off and retry later.

Expected outcome:

  • usage should approach 100% near reset instead of exhausting far earlier
  • short-window and weekly-window headroom should both be respected
  • sticky and non-sticky routing should preserve correct fairness boundary
  • codex-lb should avoid becoming a hidden request buffer when pacing delays are large

Concrete defaults / recommended design

These are suggested defaults so implementation can start from a clear behavior model instead of only open questions.

1. Pace by credits, not request count

Use credits as the control variable.

Reason:

  • upstream limits are credit-based
  • request sizes vary too much for request-count pacing to be reliable
  • codex-lb already surfaces credit-based usage, remaining budget, and pace math

If exact cost is only known after completion, estimate before dispatch and reconcile afterward using observed cost.

2. Prefer client-side waiting over proxy-side queuing

Do not make codex-lb hold a large delayed queue if the client can wait instead. However, current Codex behavior suggests normal HTTP retries use generic exponential backoff rather than a precisely honored server-provided delay.

Recommended behavior:

  • when a request arrives before its next admissible dispatch time, return a retryable pacing response immediately
  • for Codex clients, prefer HTTP 503 with JSON error.code = "slow_down"
  • optionally include Retry-After, but do not rely on it for exact pacing behavior
  • avoid introducing a substantial proxy-side queue as the primary pacing mechanism
  • if any small local queue exists for implementation convenience, keep it tightly bounded and secondary to the retryable-response path

Reason:

  • current Codex source retries 5xx by default
  • current Codex source does not retry 429 by default
  • client-side waiting avoids turning codex-lb into a backlog buffer

3. Non-sticky mode should still derive from per-account feasibility

Do not model non-sticky mode as one naive shared bucket.

Recommended behavior:

  • compute admissible pace separately for each eligible account
  • in non-sticky mode, decide whether the pool can admit the request using combined per-account feasibility
  • when choosing an account for an admissible request, prefer the account with the best current headroom / earliest feasible dispatch profile

This avoids overdriving one constrained account just because aggregate pool capacity still looks healthy.

4. Health filtering and pacing should be separate responsibilities

Recommended composition:

  • health / quota / deactivation rules decide whether an account is eligible
  • pacing decides whether an eligible account may receive traffic now or whether the request should get a retryable throttling response
  • routing chooses the best eligible account among accounts that are currently admissible

In other words:

  • health answers: "may this account be used?"
  • pacing answers: "may it be used now?"
  • routing answers: "which currently-admissible account should get this request?"

5. Two-window pacing model per account

Recommended implementation model:

  • maintain short-window pacing constraint
  • maintain weekly-window pacing constraint
  • effective admissibility for a request is the later / stricter of those two windows

A token-bucket or equivalent time-based scheduler would work well here as long as both windows are enforced and current admissibility is driven by the tighter constraint.

Alternatives considered

  • Dashboard-only advice without enforcement.
    Useful, but still leaves operators to manually pace clients.

  • Weekly-only pacing.
    Incomplete because 5h exhaustion still creates user-visible stalls even if weekly pace is fine.

  • HTTP 429 for pacing.
    Rejected because current Codex client defaults do not retry 429 automatically.

  • Large proxy-side delay queue.
    Less desirable because it turns codex-lb into a hidden waiting room and keeps pacing state server-side when the client can often wait instead.

  • Pool-wide pacing even with sticky threads enabled.
    Incorrect fairness model because sticky assignment can concentrate load on one account.

Area

  • Proxy / upstream routing
  • Account management & quota tracking
  • Dashboard UI

Additional context

This proposal is concrete enough for an issue, but final implementation details likely need an OpenSpec since it changes observable request latency and routing behavior.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions