[V3.2 Exploratory] Reasoning with Sampling: in-generation MCMC pruning to complement Geometric Lens

## Summary

Add [Reasoning with Sampling](https://arxiv.org/abs/2510.14901) (Anthropic/Harvard, Oct 2025) as an inference-time option in our patched llama.cpp: MCMC over logits during decoding so the model can backtrack and resample blocks when token-level confidence falls off, instead of committing to a bad trajectory for the full context budget.

## Bottleneck it solves

Today every PlanSearch candidate runs to the full 8K-token thinking budget before the Geometric Lens can score and reject it with G(x). A huge fraction of those tokens is wasted on trajectories that were already going off the rails at token 500 — we only learn they were bad after we've paid the full decode cost.

In-generation backtracking kills those candidates early. The saved compute buys more attempts at harder problems (more PlanSearch candidates, more refinement rounds, longer derivation chains), or it just lowers latency. Either way it moves score without changing any weights.

## Complementarity with G(x)

Not a replacement — a layer above. Think of it as **token-block pruning during generation** vs **whole-candidate selection after generation**:

| Layer                          | When          | Granularity                | Decision           |
|-------------------------------|---------------|---------------------------|--------------------|
| Reasoning with Sampling (MCMC)| During decode | ~100-500 token blocks     | Accept/reject block|
| Geometric Lens (C(x), G(x))   | After decode  | Whole 8K-token candidates | Rank and pick best |

## DeltaNet catch

Qwen3.5-9B is a **hybrid architecture** (attention + DeltaNet layers). Backtracking is the hard part.

- **Attention layers:** easy — KV-cache truncate-by-N, already supported in llama.cpp.
- **DeltaNet layers:** non-trivial — the recurrent state (`S_t = S_{t-1} + k_t ⊗ v_t − k_t ⊗ (S_{t-1} k_t)`) isn't naturally reversible. Can't just "subtract" a bad block without accumulating numerical error.

**Proposed workaround (inference-only, no architecture change):**
- Checkpoint the DeltaNet hidden state at each MCMC decision boundary (not every token).
- On reject, restore from snapshot and resample.
- Memory cost ≈ O(block_count × state_size_per_layer). Manageable at 16K context on 16GB VRAM.
- New llama.cpp patch exposing `save_state`/`restore_state` at the slot level — ATLAS already carries custom llama.cpp patches (draft-model embeddings) so the infrastructure for this exists.

## Scope

**V3.2 exploratory** — not blocking V3.1. Goal is to prototype on a small thinking benchmark (IFBench or GPQA Diamond), measure token savings vs quality delta vs baseline, and decide whether to fold into the mainline pipeline.

## Help wanted

- Anyone with experience patching llama.cpp sampler chains and/or hybrid architectures (Mamba, RWKV, DeltaNet).
- Folks who've implemented MCMC-style inference on recurrent models.
- Prototypers interested in measuring the actual compute savings vs quality trade on DeltaNet hardware.

## References

- [Reasoning with Sampling paper](https://arxiv.org/abs/2510.14901)
- ATLAS patched llama.cpp: `inference/Dockerfile.v31` carries a draft-model embeddings patch — reference point for adding more inference-time patches.
- Original suggestion surfaced in community comments on a prior GitHub issue; attribution will go to the submitter when they confirm the handle.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[V3.2 Exploratory] Reasoning with Sampling: in-generation MCMC pruning to complement Geometric Lens #40

Summary

Bottleneck it solves

Complementarity with G(x)

DeltaNet catch

Scope

Help wanted

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Layer	When	Granularity	Decision
Reasoning with Sampling (MCMC)	During decode	~100-500 token blocks	Accept/reject block
Geometric Lens (C(x), G(x))	After decode	Whole 8K-token candidates	Rank and pick best

Uh oh!

[V3.2 Exploratory] Reasoning with Sampling: in-generation MCMC pruning to complement Geometric Lens #40

Description

Summary

Bottleneck it solves

Complementarity with G(x)

DeltaNet catch

Scope

Help wanted

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions