Problem / motivation
Today the dashboard can estimate weekly pace and show how far usage is above schedule, but codex-lb cannot actively shape traffic so usage lands on reset boundaries instead of burning too fast and idling later.
For operators trying to fully utilize account capacity without crossing the limits early:
- weekly usage can run hot and deplete long before weekly reset
- 5h usage can spike and cause earlier short-window exhaustion even when weekly headroom remains
- sticky-thread mode changes fairness requirements because one hot thread can drain its assigned account while other accounts remain underused
This creates a practical gap between monitoring pace and enforcing pace.
Proposed change
Add an optional pacing / throttling mode that computes allowable pace from remaining budget and time until reset, and when traffic needs to slow down, have codex-lb return a retryable upstream-style response so the client waits on its side instead of codex-lb holding a proxy-side queue.
Concrete behavior:
- add a configurable throttle mode that computes allowable pace from remaining budget and time until reset
- take both the short window (~5h / primary) and weekly window (~7d / secondary) into account
- apply the stricter of the two limits at any given time
- when sticky threads is on, throttle on a per-account basis so each sticky-assigned account stays on pace for its own resets
- when sticky threads is off, throttle pool-wide so aggregate routing pace stays on schedule across the routable pool
- when pacing requires slowdown, prefer returning a retryable response to the client instead of introducing a long-lived queue inside codex-lb
- expose enough dashboard / logs context to show why a request was throttled and which window was binding
- make feature optional and disabled by default
Recommended wire behavior for Codex compatibility:
- return HTTP
503 Service Unavailable
- return JSON body with
error.code = "slow_down"
- optionally include standard retry hints such as
Retry-After, but treat them as best-effort only
Based on current Codex source, this is more compatible than HTTP 429 because Codex retries 5xx by default but does not retry 429 by default. I did not find evidence that Codex reliably honors HTTP Retry-After for normal HTTP retries, so this mechanism should be treated as advisory backoff rather than precise client-side wait control.
Suggested config shape (illustrative, not prescriptive):
pacing_enabled: false
pacing_mode: off | advisory_retry
pacing_target: reset_boundary
- optional guardrails for any minimal in-proxy waiting that still remains
High-level algorithm intent:
- For each enforcement scope, compute remaining short-window budget and time until short reset.
- Compute remaining weekly budget and time until weekly reset.
- Convert both to an allowed credits-per-hour burn rate.
- Use the tighter rate as the current budget ceiling.
- If incoming traffic exceeds that ceiling, reject with a retryable pacing response that causes the client to back off and retry later.
Expected outcome:
- usage should approach 100% near reset instead of exhausting far earlier
- short-window and weekly-window headroom should both be respected
- sticky and non-sticky routing should preserve correct fairness boundary
- codex-lb should avoid becoming a hidden request buffer when pacing delays are large
Concrete defaults / recommended design
These are suggested defaults so implementation can start from a clear behavior model instead of only open questions.
1. Pace by credits, not request count
Use credits as the control variable.
Reason:
- upstream limits are credit-based
- request sizes vary too much for request-count pacing to be reliable
- codex-lb already surfaces credit-based usage, remaining budget, and pace math
If exact cost is only known after completion, estimate before dispatch and reconcile afterward using observed cost.
2. Prefer client-side waiting over proxy-side queuing
Do not make codex-lb hold a large delayed queue if the client can wait instead. However, current Codex behavior suggests normal HTTP retries use generic exponential backoff rather than a precisely honored server-provided delay.
Recommended behavior:
- when a request arrives before its next admissible dispatch time, return a retryable pacing response immediately
- for Codex clients, prefer HTTP
503 with JSON error.code = "slow_down"
- optionally include
Retry-After, but do not rely on it for exact pacing behavior
- avoid introducing a substantial proxy-side queue as the primary pacing mechanism
- if any small local queue exists for implementation convenience, keep it tightly bounded and secondary to the retryable-response path
Reason:
- current Codex source retries
5xx by default
- current Codex source does not retry
429 by default
- client-side waiting avoids turning codex-lb into a backlog buffer
3. Non-sticky mode should still derive from per-account feasibility
Do not model non-sticky mode as one naive shared bucket.
Recommended behavior:
- compute admissible pace separately for each eligible account
- in non-sticky mode, decide whether the pool can admit the request using combined per-account feasibility
- when choosing an account for an admissible request, prefer the account with the best current headroom / earliest feasible dispatch profile
This avoids overdriving one constrained account just because aggregate pool capacity still looks healthy.
4. Health filtering and pacing should be separate responsibilities
Recommended composition:
- health / quota / deactivation rules decide whether an account is eligible
- pacing decides whether an eligible account may receive traffic now or whether the request should get a retryable throttling response
- routing chooses the best eligible account among accounts that are currently admissible
In other words:
- health answers: "may this account be used?"
- pacing answers: "may it be used now?"
- routing answers: "which currently-admissible account should get this request?"
5. Two-window pacing model per account
Recommended implementation model:
- maintain short-window pacing constraint
- maintain weekly-window pacing constraint
- effective admissibility for a request is the later / stricter of those two windows
A token-bucket or equivalent time-based scheduler would work well here as long as both windows are enforced and current admissibility is driven by the tighter constraint.
Alternatives considered
-
Dashboard-only advice without enforcement.
Useful, but still leaves operators to manually pace clients.
-
Weekly-only pacing.
Incomplete because 5h exhaustion still creates user-visible stalls even if weekly pace is fine.
-
HTTP 429 for pacing.
Rejected because current Codex client defaults do not retry 429 automatically.
-
Large proxy-side delay queue.
Less desirable because it turns codex-lb into a hidden waiting room and keeps pacing state server-side when the client can often wait instead.
-
Pool-wide pacing even with sticky threads enabled.
Incorrect fairness model because sticky assignment can concentrate load on one account.
Area
- Proxy / upstream routing
- Account management & quota tracking
- Dashboard UI
Additional context
This proposal is concrete enough for an issue, but final implementation details likely need an OpenSpec since it changes observable request latency and routing behavior.
Problem / motivation
Today the dashboard can estimate weekly pace and show how far usage is above schedule, but codex-lb cannot actively shape traffic so usage lands on reset boundaries instead of burning too fast and idling later.
For operators trying to fully utilize account capacity without crossing the limits early:
This creates a practical gap between monitoring pace and enforcing pace.
Proposed change
Add an optional pacing / throttling mode that computes allowable pace from remaining budget and time until reset, and when traffic needs to slow down, have codex-lb return a retryable upstream-style response so the client waits on its side instead of codex-lb holding a proxy-side queue.
Concrete behavior:
Recommended wire behavior for Codex compatibility:
503 Service Unavailableerror.code = "slow_down"Retry-After, but treat them as best-effort onlyBased on current Codex source, this is more compatible than HTTP
429because Codex retries5xxby default but does not retry429by default. I did not find evidence that Codex reliably honors HTTPRetry-Afterfor normal HTTP retries, so this mechanism should be treated as advisory backoff rather than precise client-side wait control.Suggested config shape (illustrative, not prescriptive):
pacing_enabled: falsepacing_mode: off | advisory_retrypacing_target: reset_boundaryHigh-level algorithm intent:
Expected outcome:
Concrete defaults / recommended design
These are suggested defaults so implementation can start from a clear behavior model instead of only open questions.
1. Pace by credits, not request count
Use credits as the control variable.
Reason:
If exact cost is only known after completion, estimate before dispatch and reconcile afterward using observed cost.
2. Prefer client-side waiting over proxy-side queuing
Do not make codex-lb hold a large delayed queue if the client can wait instead. However, current Codex behavior suggests normal HTTP retries use generic exponential backoff rather than a precisely honored server-provided delay.
Recommended behavior:
503with JSONerror.code = "slow_down"Retry-After, but do not rely on it for exact pacing behaviorReason:
5xxby default429by default3. Non-sticky mode should still derive from per-account feasibility
Do not model non-sticky mode as one naive shared bucket.
Recommended behavior:
This avoids overdriving one constrained account just because aggregate pool capacity still looks healthy.
4. Health filtering and pacing should be separate responsibilities
Recommended composition:
In other words:
5. Two-window pacing model per account
Recommended implementation model:
A token-bucket or equivalent time-based scheduler would work well here as long as both windows are enforced and current admissibility is driven by the tighter constraint.
Alternatives considered
Dashboard-only advice without enforcement.
Useful, but still leaves operators to manually pace clients.
Weekly-only pacing.
Incomplete because 5h exhaustion still creates user-visible stalls even if weekly pace is fine.
HTTP
429for pacing.Rejected because current Codex client defaults do not retry
429automatically.Large proxy-side delay queue.
Less desirable because it turns codex-lb into a hidden waiting room and keeps pacing state server-side when the client can often wait instead.
Pool-wide pacing even with sticky threads enabled.
Incorrect fairness model because sticky assignment can concentrate load on one account.
Area
Additional context
This proposal is concrete enough for an issue, but final implementation details likely need an OpenSpec since it changes observable request latency and routing behavior.