Skip to content

[V3.2 Exploratory] Reasoning with Sampling: in-generation MCMC pruning to complement Geometric Lens #40

@itigges22

Description

@itigges22

Summary

Add Reasoning with Sampling (Anthropic/Harvard, Oct 2025) as an inference-time option in our patched llama.cpp: MCMC over logits during decoding so the model can backtrack and resample blocks when token-level confidence falls off, instead of committing to a bad trajectory for the full context budget.

Bottleneck it solves

Today every PlanSearch candidate runs to the full 8K-token thinking budget before the Geometric Lens can score and reject it with G(x). A huge fraction of those tokens is wasted on trajectories that were already going off the rails at token 500 — we only learn they were bad after we've paid the full decode cost.

In-generation backtracking kills those candidates early. The saved compute buys more attempts at harder problems (more PlanSearch candidates, more refinement rounds, longer derivation chains), or it just lowers latency. Either way it moves score without changing any weights.

Complementarity with G(x)

Not a replacement — a layer above. Think of it as token-block pruning during generation vs whole-candidate selection after generation:

Layer When Granularity Decision
Reasoning with Sampling (MCMC) During decode ~100-500 token blocks Accept/reject block
Geometric Lens (C(x), G(x)) After decode Whole 8K-token candidates Rank and pick best

DeltaNet catch

Qwen3.5-9B is a hybrid architecture (attention + DeltaNet layers). Backtracking is the hard part.

  • Attention layers: easy — KV-cache truncate-by-N, already supported in llama.cpp.
  • DeltaNet layers: non-trivial — the recurrent state (S_t = S_{t-1} + k_t ⊗ v_t − k_t ⊗ (S_{t-1} k_t)) isn't naturally reversible. Can't just "subtract" a bad block without accumulating numerical error.

Proposed workaround (inference-only, no architecture change):

  • Checkpoint the DeltaNet hidden state at each MCMC decision boundary (not every token).
  • On reject, restore from snapshot and resample.
  • Memory cost ≈ O(block_count × state_size_per_layer). Manageable at 16K context on 16GB VRAM.
  • New llama.cpp patch exposing save_state/restore_state at the slot level — ATLAS already carries custom llama.cpp patches (draft-model embeddings) so the infrastructure for this exists.

Scope

V3.2 exploratory — not blocking V3.1. Goal is to prototype on a small thinking benchmark (IFBench or GPQA Diamond), measure token savings vs quality delta vs baseline, and decide whether to fold into the mainline pipeline.

Help wanted

  • Anyone with experience patching llama.cpp sampler chains and/or hybrid architectures (Mamba, RWKV, DeltaNet).
  • Folks who've implemented MCMC-style inference on recurrent models.
  • Prototypers interested in measuring the actual compute savings vs quality trade on DeltaNet hardware.

References

  • Reasoning with Sampling paper
  • ATLAS patched llama.cpp: inference/Dockerfile.v31 carries a draft-model embeddings patch — reference point for adding more inference-time patches.
  • Original suggestion surfaced in community comments on a prior GitHub issue; attribution will go to the submitter when they confirm the handle.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions