Summary
Add Reasoning with Sampling (Anthropic/Harvard, Oct 2025) as an inference-time option in our patched llama.cpp: MCMC over logits during decoding so the model can backtrack and resample blocks when token-level confidence falls off, instead of committing to a bad trajectory for the full context budget.
Bottleneck it solves
Today every PlanSearch candidate runs to the full 8K-token thinking budget before the Geometric Lens can score and reject it with G(x). A huge fraction of those tokens is wasted on trajectories that were already going off the rails at token 500 — we only learn they were bad after we've paid the full decode cost.
In-generation backtracking kills those candidates early. The saved compute buys more attempts at harder problems (more PlanSearch candidates, more refinement rounds, longer derivation chains), or it just lowers latency. Either way it moves score without changing any weights.
Complementarity with G(x)
Not a replacement — a layer above. Think of it as token-block pruning during generation vs whole-candidate selection after generation:
| Layer |
When |
Granularity |
Decision |
| Reasoning with Sampling (MCMC) |
During decode |
~100-500 token blocks |
Accept/reject block |
| Geometric Lens (C(x), G(x)) |
After decode |
Whole 8K-token candidates |
Rank and pick best |
DeltaNet catch
Qwen3.5-9B is a hybrid architecture (attention + DeltaNet layers). Backtracking is the hard part.
- Attention layers: easy — KV-cache truncate-by-N, already supported in llama.cpp.
- DeltaNet layers: non-trivial — the recurrent state (
S_t = S_{t-1} + k_t ⊗ v_t − k_t ⊗ (S_{t-1} k_t)) isn't naturally reversible. Can't just "subtract" a bad block without accumulating numerical error.
Proposed workaround (inference-only, no architecture change):
- Checkpoint the DeltaNet hidden state at each MCMC decision boundary (not every token).
- On reject, restore from snapshot and resample.
- Memory cost ≈ O(block_count × state_size_per_layer). Manageable at 16K context on 16GB VRAM.
- New llama.cpp patch exposing
save_state/restore_state at the slot level — ATLAS already carries custom llama.cpp patches (draft-model embeddings) so the infrastructure for this exists.
Scope
V3.2 exploratory — not blocking V3.1. Goal is to prototype on a small thinking benchmark (IFBench or GPQA Diamond), measure token savings vs quality delta vs baseline, and decide whether to fold into the mainline pipeline.
Help wanted
- Anyone with experience patching llama.cpp sampler chains and/or hybrid architectures (Mamba, RWKV, DeltaNet).
- Folks who've implemented MCMC-style inference on recurrent models.
- Prototypers interested in measuring the actual compute savings vs quality trade on DeltaNet hardware.
References
- Reasoning with Sampling paper
- ATLAS patched llama.cpp:
inference/Dockerfile.v31 carries a draft-model embeddings patch — reference point for adding more inference-time patches.
- Original suggestion surfaced in community comments on a prior GitHub issue; attribution will go to the submitter when they confirm the handle.
Summary
Add Reasoning with Sampling (Anthropic/Harvard, Oct 2025) as an inference-time option in our patched llama.cpp: MCMC over logits during decoding so the model can backtrack and resample blocks when token-level confidence falls off, instead of committing to a bad trajectory for the full context budget.
Bottleneck it solves
Today every PlanSearch candidate runs to the full 8K-token thinking budget before the Geometric Lens can score and reject it with G(x). A huge fraction of those tokens is wasted on trajectories that were already going off the rails at token 500 — we only learn they were bad after we've paid the full decode cost.
In-generation backtracking kills those candidates early. The saved compute buys more attempts at harder problems (more PlanSearch candidates, more refinement rounds, longer derivation chains), or it just lowers latency. Either way it moves score without changing any weights.
Complementarity with G(x)
Not a replacement — a layer above. Think of it as token-block pruning during generation vs whole-candidate selection after generation:
DeltaNet catch
Qwen3.5-9B is a hybrid architecture (attention + DeltaNet layers). Backtracking is the hard part.
S_t = S_{t-1} + k_t ⊗ v_t − k_t ⊗ (S_{t-1} k_t)) isn't naturally reversible. Can't just "subtract" a bad block without accumulating numerical error.Proposed workaround (inference-only, no architecture change):
save_state/restore_stateat the slot level — ATLAS already carries custom llama.cpp patches (draft-model embeddings) so the infrastructure for this exists.Scope
V3.2 exploratory — not blocking V3.1. Goal is to prototype on a small thinking benchmark (IFBench or GPQA Diamond), measure token savings vs quality delta vs baseline, and decide whether to fold into the mainline pipeline.
Help wanted
References
inference/Dockerfile.v31carries a draft-model embeddings patch — reference point for adding more inference-time patches.