Skip to content

fit: honest, monotonic parity gate + OOD safety gate#50

Merged
adrida merged 7 commits into
mainfrom
fix/parity-gate-holdout
Jun 14, 2026
Merged

fit: honest, monotonic parity gate + OOD safety gate#50
adrida merged 7 commits into
mainfrom
fix/parity-gate-holdout

Conversation

@adrida

@adrida adrida commented Jun 14, 2026

Copy link
Copy Markdown
Owner

What

Makes the parity gate honest, stable, and safe, and adds a distance-based OOD safety net.

1. Honest held-out gate (was: in-sample)

The gate previously selected the accept threshold and measured its agreement on the same calibration data, so a policy could clear the target by in-sample luck and then break the contract on real traffic. It now certifies on a held-out slice using an exact Clopper-Pearson lower bound, not a point estimate.

2. Monotonic in target

A bug made coverage non-monotonic (e.g. 0% at TA=0.90 but 87% at 0.95): the gate kept only the single highest-coverage threshold and gave up if it failed verification. It now verifies candidates in coverage order and deploys the highest-coverage one that clears on the held-out split.

3. Hybrid select+verify (recovers coverage)

A pure 50/50 held-out split was provably honest but too data-hungry, certifying 0% at strict targets on small calibration sets even when the confidence ranking clearly supported partial coverage. The gate now selects on a 70% slice (CP lower bound) and verifies generalization on a held-out 30% (point estimate). Validated across seeds with zero held-out contract violations; recovers strong coverage at strict targets that the split discarded.

4. Distance-based OOD safety gate

The parity gate only guarantees agreement on traffic resembling the calibration data; off-distribution inputs (gibberish, off-domain, prompt-injection) still got a confident answer. New commodity kNN-distance gate (global + per-predicted-label thresholds, per-label can only loosen) defers inputs far from the training distribution regardless of surrogate confidence.

Plus NaN-robustness in acceptor fitting and an OOD-robustness routing test.

Behavior change

At a given target, coverage now reflects an honest held-out lower bound. A surrogate whose true agreement sits just under the target will correctly defer where the old gate over-deployed. The frontier.json shows the achievable coverage per target.

Tests

tests/test_gate.py, tests/test_ood.py, updated tests/test_fit.py. Full suite green.

adrida and others added 6 commits June 13, 2026 15:47
…imate

The parity gate previously selected the accept threshold and reported its
teacher agreement on the same calibration set, and accepted accept-all
configurations on a raw point estimate. Both let a small or lucky set clear the
target and then break the contract on real traffic.

- build_global now certifies only when the Clopper-Pearson lower bound on cal
  agreement clears the target (was a raw point estimate).
- _calibrate_threshold selects the threshold on a selection split, then verifies
  it on a held-out split, deploying only if the held-out lower bound clears the
  target. Falls back to a single-set lower-bound gate when data is too small to
  split, and refuses rather than guessing.
- Guard NaN/inf surrogate probabilities before the acceptor fit (fixes a crash
  on degenerate datasets).
- Expose `alpha` (confidence level) through the builders and fit_frontier.
- Add regression tests for low-n refusal, the lower-bound certification, and the
  NaN guard.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
With the held-out lower-bound gate, a noisy 300-trace set correctly refuses to
certify a deployment (teacher noise caps agreement below target). Give this test
cleanly separable data so it still exercises the deploy and route path; the
refusal case is covered by test_gate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extreme out-of-distribution embeddings must route without crashing and produce
finite accept scores. Documents that the OSS library defers via the acceptor
threshold and does not ship a separate distance-based OOD gate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The held-out calibration picked the single highest-coverage threshold that
cleared target on the selection split, then returned None if that one threshold
failed held-out verification. Because the highest-coverage threshold is also the
most aggressive (least likely to generalise), coverage came out non-monotonic in
target: banking77 deployed 0% at TA=0.90 but 87% at TA=0.95.

Now verify selection-passing thresholds in coverage order and deploy the
highest-coverage one that also clears target on the held-out split. Same
Clopper-Pearson maths, same held-out split, the gate just no longer quits after
one failed candidate. banking77 frontier is now monotonic (97.5% @0.90 -> 87.1%
@0.95 -> 65.9% @0.97 -> 50.4% @0.98).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 50/50 held-out split was provably honest but too data-hungry: on a few-hundred
calibration rows it certified 0% at strict targets even though the confidence
ranking clearly supported partial coverage (banking77 0% @0.98 despite an oracle
~80%). Replace it with a hybrid: select the highest-coverage threshold whose
Clopper-Pearson LOWER bound clears target on a 70% slice (removes in-sample
optimism at selection), then require it to also clear on a held-out 30% by the
point estimate (catches selection-bias flukes without demanding CP-tightness on
the small verify slice). Falls back to a single-set CP lower-bound gate when
n<40. Validated across seeds on banking77/obside at 0.90-0.98 with zero held-out
contract violations; recovers banking77 to 67% @0.98 (was 0%) and Ridge to 68%
@0.90 (was 0%).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…uts)

The parity gate only guarantees agreement on traffic that resembles calibration
data; off-distribution inputs (gibberish, off-domain, prompt-injection) still got
a confident surrogate answer and slipped through. Add a commodity kNN-distance
OOD gate: at fit, calibrate the 99.5th-percentile mean-10-NN distance globally and
per predicted label (per-label can only loosen, never tighten below global); at
inference, defer any input beyond that, regardless of surrogate confidence.

Keyed on input embeddings + predicted label only, NOT the partition cells, so it
carries none of the cell-construction IP. On the Obside 156-case battery it lifts
OOD-junk deferral from 0% to ~80% while keeping in-distribution coverage (clean
100% / holdout 90% handled). Semantic near-misses (entity-disentanglement) are
deliberately out of scope here, those need the advanced embeddings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@adrida adrida marked this pull request as ready for review June 14, 2026 15:47
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@adrida adrida merged commit 36abcfb into main Jun 14, 2026
5 checks passed
@adrida adrida deleted the fix/parity-gate-holdout branch June 14, 2026 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant