Skip to content

Unify singleton and batch proving#10

Closed
RadNi wants to merge 46 commits into
a16z:mainfrom
LayerZero-Labs:amir/batch-refactor
Closed

Unify singleton and batch proving#10
RadNi wants to merge 46 commits into
a16z:mainfrom
LayerZero-Labs:amir/batch-refactor

Conversation

@RadNi

@RadNi RadNi commented Apr 22, 2026

Copy link
Copy Markdown

Summary

Moves every prover/verifier entry point onto the batched code path and drops the parallel singleton plumbing. Singleton openings are now just WitnessShape::singleton() through the same functions.

Unified prove/verify

So far four root-level variants collapse into one:

Before After
prove_one_level, prove_batched_root_level[_with_points], prove_multipoint_batched_root_level, prove_same_point_batched prove_root_level
verify_batched_root_level, verify_multipoint_batched_root_level, verify_same_point_batched verify_root_level
prove_batched_recursive_suffix prove_recursive_suffix
QuadraticEquation::{new_batched_prover, new_multipoint_batched_prover} are removed — only new_prover remains.

Proof shape

HachiBatchedRootProof and HachiBatchedProofShape become enums with Fold { .. } and Direct { .. } variants (new HachiBatchedFoldRoot carries the fold-rooted payload). This cleans up the case where the root is a direct witness handoff (very small num_vars) vs a fold.

Schedule planner

  • WitnessShape = { num_claims, num_commitment_groups, num_points } is the single batch descriptor; the old BatchConfig alias is gone.
  • The planner's process-global DP cache is removed. find_optimal_schedule now consults the offline schedule tables (Cfg::schedule_plan) first — every (Cfg, num_vars, WitnessShape) case that ships with the crate is a keyed row — and only falls back to the DP for shapes without an entry.

omibo and others added 30 commits February 27, 2026 17:18
* chore: add toolchain and formatting config

Pin Rust 1.88 with minimal profile (cargo, rustc, clippy, rustfmt).

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(ci): switch to actions-rust-lang/setup-rust-toolchain

Respects rust-toolchain.toml automatically. Also normalize clippy
flags to use --all --all-targets consistently.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(primitives): add u128/i128 serialization support

Required by the Fp128 field backend.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(algebra): add prime fields, extensions, and modules

Introduces the algebra module with:
- Fp32/Fp64/Fp128 prime field backends with branchless constant-time
  add/sub/neg and rejection-sampled random
- U256 helper for Fp128 wide multiplication
- Fp2/Fp4 tower extensions with Karatsuba-ready structure
- VectorModule<F, N> fixed-length vector module over any field
- Poly<F, D> fixed-size polynomial container

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(algebra): add NTT small-prime arithmetic and CRT helpers

Adds the ntt submodule with:
- NttPrime: per-prime Montgomery-like fpmul, Barrett-like fpred,
  branchless csubq/caddq/center
- LimbQ/QData: radix-2^14 limb arithmetic for big-q coefficients
- logq=32 parameter preset (six NTT-friendly primes, CRT constants)

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(algebra): add comprehensive algebra test suite

24 tests covering:
- Field arithmetic, identities, and distributivity (Fp32/Fp64/Fp128)
- Zero inversion returns None
- Serialization round-trips (all field types, extensions, VectorModule)
- Fp2 conjugate, norm, and distributivity
- U256 wide multiply and bit access
- LimbQ round-trip and add/sub inverse
- QData consistency with preset constants
- NTT normalize range and fpmul commutativity
- Poly add/sub/neg

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs: add and update progress tracking document

Records Phase 0 status: all field types, extensions, NTT scaffolding,
constant-time arithmetic, and 24-test suite. Reflects the
fields/ntt/module/poly directory layout.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(ntt): Rust-ify NTT/CRT port from C

Overhaul the NTT small-prime arithmetic and CRT modules:

- Add MontCoeff newtype (#[repr(transparent)] i16 wrapper) to enforce
  Montgomery-domain vs canonical-domain separation at the type level
- NttPrime methods now take/return MontCoeff instead of bare i16:
  fpmul→mul, fpred→reduce, csubq→csubp, caddq→caddp
- Add domain conversion: from_canonical (i16→Mont), to_canonical (Mont→i16)
- Delete free functions (pointwise_mul etc), replaced by methods on NttPrime
- LimbQ: replace add_limbs/sub_limbs/less_than with std Add/Sub/Ord impls
- LimbQ: replace from_u128/to_u128 with From<u128>/TryFrom for u128
- LimbQ: add Display impl, branchless csub_mod
- Rename all LABRADOR* constants to project-native Q32_* names
- Add #[cfg(test)] verification that re-derives pinv/v/mont/montsq from p
- Add MontCoeff round-trip and LimbQ ordering tests (28 total)

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore: remove section banners, update progress doc

Remove // ---- Section ---- banner comments from prime.rs and crt.rs.
Add non-negotiable rules to HACHI_PROGRESS.md:
- No section-banner comments
- No commit/push without explicit user approval

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(ring): add CyclotomicRing, CyclotomicNtt, and NTT butterfly

Milestone 1 - CyclotomicRing<F, D> (coefficient form):
- Schoolbook negacyclic convolution Mul (X^D = -1)
- Add/Sub/Neg/AddAssign/SubAssign/MulAssign, scale, zero/one/x
- HachiSerialize/HachiDeserialize

Milestone 2 - NTT butterfly + CyclotomicNtt<K, D>:
- Merged negacyclic Cooley-Tukey forward NTT (twist folded into twiddles)
- Gentleman-Sande inverse NTT with D^{-1} scaling
- Runtime primitive-root finder and twiddle table computation
  (TODO: migrate to compile-time const tables)
- CyclotomicNtt with per-prime pointwise Add/Sub/Neg/Mul
- Ring<->Ntt transforms with CRT reconstruction

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(algebra): add ring and NTT tests, wrap in mod tests

Add 12 new tests:
- CyclotomicRing: negacyclic X^D=-1, mul identity/zero, commutativity,
  distributivity, associativity, additive inverse, serde, degree-64
- NTT: forward/inverse round-trip (single prime + all primes),
  NTT mul matches schoolbook cross-check

Wrap all integration tests in a single mod tests block and remove
section-banner comments.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(algebra): harden ring-NTT conversion and field decoding

Constrain ring/NTT conversions to explicit field backends and replace fragile CRT reconstruction with deterministic modular lifting. Enforce canonical deserialization checks in validated field decoding paths to reject malformed encodings.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(algebra): add CRT round-trip and serialization guard coverage

Add end-to-end ring->NTT->ring CRT round-trip tests plus reduced-ops stability checks. Expand serialization coverage for Fp4/Poly and verify checked deserialization rejects non-canonical field encodings.

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(bench): add ring_ntt benchmark target and CT tracking docs

Add a dedicated ring/NTT benchmark harness and register it in Cargo metadata. Record current constant-time review status and sync the implementation progress board with new milestones and test coverage.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(field): split core, canonical, and sampling capabilities

Break the monolithic Field trait into FieldCore, CanonicalField, and FieldSampling, and update algebra primitives to depend on explicit capabilities for cleaner semantics and future backend integration.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(fields): add pow2-offset pseudo-mersenne registry and checks

Introduce the curated 2^k-offset prime registry and typed field aliases, then add dedicated Miller-Rabin regression tests to enforce probable primality for all enabled profiles.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(ring): introduce crt-ntt backend/domain layering

Rename the ring NTT representation to explicit CRT+NTT semantics and route conversions through backend traits, adding scalar backend and domain aliases for a cleaner representation-vs-execution boundary.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(algebra): cover backend parity and pow2-offset invariants

Expand algebra tests to validate default-vs-backend CRT+NTT equivalence, sampling bounds, and pow2-offset registry consistency under the new field and ring abstractions.

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs(algebra): update progress notes and add prime analysis references

Refresh progress and constant-time notes to match the new CRT+NTT naming and field scope, and add the NTT prime analysis document plus local NIST standards artifacts used for parameter rationale.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(algebra): harden fp128 reduction and CRT reconstruction arithmetic

Make Fp128 reduction and CRT inner accumulation paths more timing-stable with branchless modular operations, and refresh ring/docs/tests status after the hardening cleanup pass.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(protocol): add transcript and commitment scaffold

Introduce Hachi protocol-layer interfaces and placeholder types with Blake2b/Keccak transcript backends plus phase-aligned labels, while making transcript absorption label-directed at call sites.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(protocol): add transcript and commitment contract coverage

Add deterministic transcript schedule checks (including keccak) and protocol commitment contract tests so transcript ordering and challenge derivation behavior are locked down.

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs(protocol): align transcript spec and progress status

Document the protocol scaffold as in-progress, capture the commitment-focused transcript label vocabulary, and clarify deferred Jolt adapter expectations.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(protocol): add ring commitment core and seeded matrix derivation

Implement the ring-native commitment setup/commit core with config validation, utility modules, and seeded domain-separated public matrix derivation, while wiring prover/verifier stub modules for the next open-check phase.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(protocol): consolidate ring commitment and stub contract coverage

Unify ring commitment core and config validation checks in one test file and add explicit prover/verifier stub contract tests to lock current placeholder behavior before open-check implementation.

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs(progress): update phase 2 status after commitment core landing

Record that ring-native §4.1 commitment setup/commit and protocol wiring are in place, and clarify that open-check prove/verify remains the next unfinished protocol milestone.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(algebra): harden CT inversion path and CRT final projection

Add a constant-time inversion helper for prime fields and replace scalar CRT's final `% q` projection with a division-free fixed-iteration reducer, so secret-bearing arithmetic paths avoid variable-latency behavior.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(algebra): rename inversion helper API without ct suffix

Rename the secret-path inversion helper to `Invertible::inv_or_zero` while preserving constant-time semantics via doc contracts, and update CT tracking docs to match the new API names.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(algebra): clean inversion test naming and normalize formatting

Rename the inversion helper test to match the new API naming and keep the ring commitment test formatting consistent after linting.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(protocol): add sumcheck core module and tests

Introduce core sumcheck building blocks (univariate messages, compression, and transcript-driving prover/verifier driver) and add unit/integration tests. Update progress doc to reflect sumcheck core landing.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add reference PDF papers

* Add local agent instruction files

* Add Hachi and SuperNOVA digest docs

* Add general field, ring, and multilinear utilities

* Add sparse Fiat-Shamir challenge sampling

* Implement Polynomail Evaluation as Quadradic Equation

* Rename stub to prover and verifier

* Refactor code organization

* Replace decopose with balanced decompose

* Transform polynomial over Fq to ring

* Refactor function names

* Impl commitment_scheme API

* Add SolinasFp128 backend for sparse 128-bit primes

Introduce `SolinasFp128` with two-fold Solinas reduction for `p = 2^128 - c` (sparse `c`), plus `U256::sqr_u128`. Export descriptive prime aliases, add BigUint-backed correctness tests, and include a Criterion bench for mul/inv.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Tighten docs and minor clippy cleanups

Add missing rustdoc Errors/Panics sections and apply small simplifications suggested by clippy.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add reduction steps to iteration prover

* Optimize Solinas mul/add/sub: fused u64-limb schoolbook + csel canonicalize

Rewrite mul_raw as a fused 2×2 schoolbook multiply with two-fold Solinas
reduction using explicit u64 limbs and mac helper, bypassing U256.
Replace mask-based canonicalize with carry-flag-based pattern that compiles
to adds+adcs+csel+csel (4 insns) instead of 10 on AArch64.
Add pure-mul, sqr, and throughput microbenchmarks.

Made-with: Cursor

* Switch SolinasFp128 repr from u128 to [u64; 2] for 8-byte alignment

Storage is now [u64; 2] (lo, hi) which halves alignment from 16 to 8
bytes, improving struct packing. Arithmetic hot paths convert to u128
for LLVM-optimal codegen (adds/adcs pairs), so no perf regression.

Made-with: Cursor

* Fuse overflow correction with canonicalize in fold2_canonicalize

When fold-2 overflows, the wrapped value s < C², so s + C < C(C+1) < P —
meaning s + C is already canonical. This lets us replace the separate
overflow-correction + canonicalize (3 + 4 insns) with a single fused
`if (overflow | carry) { s + C } else { s }` select, saving 2 instructions
on the critical path. Add compile-time assertion enforcing C(C+1) < P.

Made-with: Cursor

* Unify Fp128 with Solinas-optimized arithmetic, delete SolinasFp128

Replace the generic Fp128<const MODULUS: u128> (binary-long-division via
U256) with the Solinas-optimized implementation. Fp128<const P: u128>
now uses [u64; 2] storage, fused schoolbook 2x2 + two-fold Solinas
reduction (~23 cycles/mul on AArch64/x86-64), and compile-time
validation that P = 2^128 - C with C < 2^64.

Delete SolinasFp128, SolinasParams, solinas128.rs, and u256.rs. All
call sites updated; prime type aliases (Prime128M13M4P0 etc.) are now
simple Fp128<...> aliases in fp128.rs. Blanket PseudoMersenneField impl
for all Fp128<P>.

Made-with: Cursor

* Use git deps for ark-bn254/ark-ff instead of local paths

Switch from local path dependencies to the a16z/arkworks-algebra
git repo (branch dev/twist-shout) so collaborators can compile
without needing a local checkout of arkworks-algebra-jolt.

Made-with: Cursor

* Add template for sumchecks

* Optimize Fp128 mul path and expand Rust field benchmarks.

Refine Fp128 multiply/fold carry handling for better generated code and add isolated, passthrough, independent, and long-chain Rust microbenches to separate latency and throughput effects when comparing against BN254.

Made-with: Cursor

* Add 2^a±1 Fp128 reduction specialization and benches.

Detect C = 2^a ± 1 at compile time and route fold multiplications through a specialized shift-based path with generic fallback, plus add benchmark coverage for sparse 128-bit primes using this shape.

Made-with: Cursor

* Add packed Fp128 field backend scaffolding and focused benchmarks.

This introduces AArch64-first packed field abstractions with a scalar fallback and adds dedicated field-only validation/benchmark coverage before any ring or protocol integration.

Made-with: Cursor

* Refactor packed Fp128 backend to true SoA layout and stabilize benchmarking.

This switches packed lane storage to SoA with NEON add/sub kernels and a SoA mul path, and updates packed-field APIs and benches so scalar-vs-packed latency/throughput comparisons are measured consistently.

Made-with: Cursor

* Optimize packed Fp128 mul throughput with array-backed SoA lanes.

This keeps mul in true SoA form while removing repeated vector transmute overhead and inlining the limb-level Solinas lane kernel, improving packed mul throughput and latency against scalar baselines.

Made-with: Cursor

* Add Fp128 widening multiply API and specialized Solinas reduction

Expose mul_wide_u64, mul_wide, mul_wide_u128, solinas_reduce, and
to_limbs for deferred-reduction patterns needed by jolt-hachi.
Hand-optimized reduce paths for 3/4/5 limbs avoid generic loop
overhead. Refactor mul_raw to reuse mul_wide + reduce_4 (zero
overhead). Add 9 unit tests and widening/accumulator benchmarks.

Made-with: Cursor

* Clean up fp128: remove section banners, hoist std::ops imports, rename mul_wide free fn

Rename free function mul_wide → mul64_wide to avoid shadowing
Fp128::mul_wide. Move reduce_4 next to fold2_canonicalize. Replace
fully qualified std::ops::{Add,Sub,Mul,Neg} with use imports.

Made-with: Cursor

* Constrain Fp32/Fp64 to pseudo-Mersenne primes with Solinas reduction

Rework fp32.rs and fp64.rs to require p = 2^k - c (small c), matching
fp128's design. Compile-time constants BITS/C/MASK derived from P with
static assertions. Replace bit-serial reduction with two-fold Solinas
reduction (reduce_product for hot path, loop-based reduce_u64/u128 for
arbitrary inputs). Add widening ops (mul_wide, square, solinas_reduce).
Fix FieldSampling to use direct modular reduction instead of rejection
sampling. Blanket-impl PseudoMersenneField, remove manual impls. Rename
const generic MODULUS -> P at all call sites. Add latency + throughput
benchmarks. Hoist mid-function imports in tests/algebra.rs.

Made-with: Cursor

* Specialize Fp64 sub-word primes to u64-only arithmetic

For BITS < 64 (e.g. 2^40-195), avoid u128 intermediates in
reduce_product, add_raw, and sub_raw. Use mul_c_narrow which splits
C*high into u32x32->u64 widening multiplies (umaddl on AArch64),
preventing LLVM from promoting to u128. Brings 40-bit mul throughput
within 4% of 64-bit (690 vs 716 Melem/s), up from ~20% gap.

Made-with: Cursor

* Add 2^30 and 2^31 pseudo-Mersenne primes and expand benchmarks

Add Pow2Offset30Field (2^30-35) and Pow2Offset31Field (2^31-19) prime
definitions and type aliases. Refactor fp32/fp64 latency benchmarks with
chain_bench! macro, add throughput benchmarks for all new primes.

Made-with: Cursor

* Add NEON packed backends for Fp32 (4-wide) and Fp64 (2-wide)

PackedFp32Neon: 4 lanes in uint32x4_t with full NEON Solinas reduction
for mul (vmull_u32 + 2-fold reduce), umin trick for add/sub (BITS<=31),
overflow-aware paths for BITS==32. C_SHIFT_KIND optimization for
C=2^a+/-1.

PackedFp64Neon: 2 lanes in uint64x2_t with NEON add/sub (conditional
P for BITS<=62, carry-aware for BITS>=63), scalar-per-lane mul (no
native 64x64->128 on NEON).

Fp32 packed achieves 2.4-3.5x mul throughput and 3.5-5.0x add/sub
throughput over scalar. Includes HasPacking impls, type aliases,
NoPacking fallbacks, 7 correctness tests, and throughput benchmarks.

Made-with: Cursor

* Optimize packed Fp32/Fp64 Solinas multiply hot paths on NEON

For packed Fp32, remove the shift/add C-special-case in the Solinas fold and
always use vmull_u32 with a hoisted C broadcast, which improves stability and
removes the 24-bit mul regression. For packed Fp64, replace per-lane Fp64
wrapper multiplication with packed-local per-lane 64x64->128 products plus
specialized Solinas reduction (including the sub-word u64 fold path), reducing
mul overhead for both 40-bit and 64-bit packed variants.

Made-with: Cursor

* Tune packed Fp64 mul folding and add reducer/codegen probes

Switch packed Fp64 sub-word fold multiplication to direct `C*x`, which improves packed mul throughput in repeated A/B runs. Add dedicated reducer and codegen probe benches so we can compare 40-bit and 64-bit fold paths with instruction-level visibility.

Made-with: Cursor

* Optimize x86 BMI2 multiply paths for fp64/fp128 fields

Use BMI2 widening multiplies in scalar field hot paths and specialize x86 sub-word fold multiplication to a single 64-bit multiply, improving 40-bit fp64 throughput while keeping 64-bit and 128-bit paths stable.

Made-with: Cursor

* Optimize fp128 wide-limb multiply path for Jolt integration

Raise Hachi MSRV to 1.88, add specialized Fp128 mul_wide_limbs kernels for M={3,4} and OUT={4,5,6}, and add field_arith benches that track mul_wide_limbs-only and roundtrip costs to catch regressions.

Made-with: Cursor

* Specialize Fp128 CanonicalField small-int constructors

Make from_u64 use a direct canonical limb construction (no reduction path), fix from_i64 to use unsigned_abs to avoid i64::MIN overflow, and add a regression test for the min-value case.

Made-with: Cursor

* Impl sumchecks for hachi

* Add optimized one-hot commitment path for regular sparse witnesses

Exploits the structure of one-hot vectors (T chunks of K field elements,
each chunk with exactly one 1) to eliminate all inner ring multiplications.
Gadget decomposition of {0,1} coefficients is trivial (only level-0 digit
is nonzero), and the inner Ajtai t = A*s reduces to summing selected
columns of A with O(D) negacyclic rotations instead of O(D^2) ring muls.

Handles both K >= D and D >= K as long as one divides the other:
- K >= D: each nonzero ring element is a monomial X^j (single rotation)
- D >= K: each ring element is a sum of D/K monomials (multiple rotations)

Total inner cost: N_A * T * D coefficient additions (zero multiplications),
vs N_A * 2^M * delta * D^2 coefficient multiplications in the dense path.

Made-with: Cursor

* Apply rustfmt formatting to fp128 and field_arith bench

Made-with: Cursor

* Inject sumchecks to Hachi prover

* Add commitment to w to transcript

* Add AVX2 and AVX-512 packed field backends for Fp32, Fp64, Fp128

Implement vectorized SIMD arithmetic for x86_64:
- AVX2: 8-wide Fp32, 4-wide Fp64, 2-wide Fp128 (scalar delegation)
- AVX-512: 16-wide Fp32, 8-wide Fp64, 4-wide Fp128 (scalar delegation)

Fp32 uses even/odd lane split with 2-fold Solinas reduction.
Fp64 uses vectorized 64×64→128 schoolbook multiply (adapted from
plonky3 Goldilocks) with custom Solinas reduction for pseudo-Mersenne
primes p = 2^k - c.

Also: extract NEON backend into packed_neon.rs, add cfg-gated module
selection (AVX-512 > AVX2 > NEON > NoPacking), enable nightly
stdarch_x86_avx512 feature, add sumcheck-mix benchmark, and fix minor
clippy lints in fp64/fp128.

Made-with: Cursor

* Vectorize Fp128 packed add/sub on AVX-512 (8-wide) and AVX2 (4-wide)

Convert Fp128 packed backends from scalar delegation (AoS) to SoA layout
with vectorized add/sub via __m512i / __m256i. Mul remains scalar per-lane.
Add FIELD_OPS_PERF.md with Zen 5 benchmark results.

Fp128 packed add: +114% (1.08 → 2.31 Gelem/s on Zen 5 AVX-512)
Fp128 packed sub: +137% (1.34 → 3.18 Gelem/s)

Made-with: Cursor

* Add M4 Pro NEON benchmarks, remove mul_add experiment

Populate FIELD_OPS_PERF.md with Apple M4 Pro (NEON) results for all
primes across scalar, packed, and sumcheck MACC workloads. Remove
the experimental mul_add trait method (vectorized add already optimal
after inlining; scalar fused approach was 16% slower).

Made-with: Cursor

* Change sumcheck API

* Separate ring switch logic

* Rename sumchecks to NormSumcheck and RelationSumcheck

* Remove iteration prover

* Eliminate O(D^2) schoolbook ring multiplication from protocol hot paths

At production parameters (D=256/1024), schoolbook CyclotomicRing
multiplication is catastrophically expensive. Every protocol hot path
has exploitable operand structure that avoids the full D^2 cost:

- Add CyclotomicRing::mul_by_sparse for O(omega*D) sparse challenge
  multiplication (90-140x speedup in compute_z_hat)
- Change RingOpeningPoint to store Vec<F> scalars; use scale() instead
  of ring mul in compute_w_hat (256-1024x speedup)
- Add kron_scalars, kron_row_scale, kron_sparse_scale; refactor
  generate_m to use scalar-aware Kronecker products
- Add zero-skip and scalar-detect in compute_r_via_poly_division
- Add sample_sparse_challenges, store Vec<SparseChallenge> in
  QuadraticEquation throughout prover and verifier paths

Made-with: Cursor

* lint: section banner removal, naming hoist, cfg(test) for test-only paths

- Remove section banner comments (----, =====) repo-wide in src, tests, benches
- commitment_scheme: hoist RingCommitment, RingOpeningPoint, transcript labels
  to top-level use; add #[cfg(test)] use for rederive_alpha_and_m_a body
  (Blake2bTranscript, eval_ring_matrix_at, expand_m_a, labels) so that
  function uses short names without polluting lib build
- Leave mod tests imports in place (no hoisting of test-module use blocks)

Made-with: Cursor

* Fix CI issues

---------

Co-authored-by: Quang Dao <quang.dao@layerzerolabs.org>
Co-authored-by: Cursor <cursoragent@cursor.com>
…umcheck (#3)

* Add rayon parallelism behind `parallel` feature flag (enabled by default)

- New src/parallel.rs with cfg_iter!/cfg_into_iter!/cfg_chunks! macros
  that dispatch to rayon parallel iterators when `parallel` is enabled
- Parallelize protocol hot paths: ring polynomial division, w_evals
  construction, M_alpha evaluation, ring vector evaluation, packed ring
  poly evaluation, coefficients-to-ring reduction, quadratic equation
  folding, and sumcheck round polynomial computation
- All 174 tests pass with and without the parallel feature

Made-with: Cursor

* Add e2e benchmark and make HachiCommitmentScheme generic over config

- Make HachiCommitmentScheme generic over <const D, Cfg> so different
  configs (and thus num_vars) can be used without code duplication.
- Remove hardcoded DefaultCommitmentConfig::D from ring_switch.rs;
  WCommitmentConfig and commit_w now flow D generically.
- Add benches/hachi_e2e.rs with configs sweeping nv=10,14,18,20.

Made-with: Cursor

* Refactor CRT-NTT backend: generalize over PrimeWidth, add Q128 support

Make NTT primitives (NttPrime, NttTwiddles, MontCoeff, CyclotomicCrtNtt)
generic over PrimeWidth (i16/i32) instead of hardcoding i16. Replace the
monolithic QData struct with separate GarnerData and per-prime NttPrime
arrays. Add Q128 parameter set (5 × i32 primes, D ≤ 1024) alongside the
existing Q32 set. Simplify ScalarBackend by removing the const-generic
limb count from to_ring_with_backend.

Made-with: Cursor

* Add extension field arithmetic and refactor sumcheck trait bounds

Split CanonicalField into FromSmallInt (from_{u,i}{8,16,32,64} for all
fields) and CanonicalField (u128 repr, base fields only). Implement
FromSmallInt, Eq, Debug for Fp2/Fp4. Add ExtField<F> trait with
EXT_DEGREE and from_base_slice.

Optimize extension field arithmetic: Karatsuba multiplication for Fp2
and Fp4 (3 base muls instead of 4), specialized squaring (2 base muls
for Fp2), non-residue IS_NEG_ONE specialization. Add concrete configs
(TwoNr, NegOneNr, UnitNr) and type aliases Ext2<F>, Ext4<F>.

Add transpose-based packed extension fields (PackedFp2, PackedFp4)
for SIMD acceleration, following Plonky3's approach.

Relax sumcheck bounds from E: CanonicalField to E: FromSmallInt (or
E: FieldCore where spurious). Add sample_ext_challenge transcript
helper. Includes tests for extension field sumcheck execution.

Made-with: Cursor

* Fix CRT+NTT correctness and optimize negacyclic NTT pipeline

Correctness fixes:
- Rewrite negacyclic NTT as twist + cyclic DIF/DIT pair (no bit-reversal
  permutation), correctly diagonalizing X^D+1.
- Center coefficient→CRT mapping and Garner reconstruction to handle
  negacyclic sign wrapping consistently.
- Fix i32 Montgomery csubp/caddp overflow via branchless i64 widening.
- Fix q128 centering overflow in balanced_decompose_pow2 (avoid casting
  q≈2^128 into i128).
- Remove dense-protocol schoolbook fallback; all mat-vec now routes
  through CRT+NTT.

Performance optimizations:
- Precompute per-stage twiddle roots in NttTwiddles (eliminate runtime
  pow_mod per butterfly stage).
- Forward DIF butterfly skips reduce_range before Montgomery mul (safe
  because mul absorbs unreduced input).
- Hoist centered-coefficient computation out of per-prime loop in
  from_ring.
- Add fused pointwise multiply-accumulate for mat-vec inner loop.
- Add batched mat_vec_mul_crt_ntt_many that precomputes matrix NTT once
  and reuses across many input vectors.
- Wire commit_ring_blocks to batched A*s path.

Benchmarks (D=64, Q32/K=6):
- Single-prime forward+inverse NTT: 1.14µs → 0.43µs (2.7x)
- CRT round-trip: 10.7µs → 6.3µs (1.7x)
- Commit nv10: ~70% faster, nv20: ~47% faster

Made-with: Cursor

* Cache CRT+NTT matrix representations in setup to avoid repeated conversion

The dense mat-vec paths (commit_ring_blocks, commit_onehot B-mul, compute_v)
previously converted coefficient-form matrices to CRT+NTT on every call.
Now the setup eagerly converts A, B, D into an NttMatrixCache and all
dense operations use the pre-converted form. Coefficient-form matrices
are retained for the onehot inner-product path and ring-switch/generate_m.

Made-with: Cursor

* Remove dead code (HachiRoutines, domains/, redundant trait methods) and extract shared field utilities

- Delete unused HachiRoutines trait and dead algebra/domains/ module
- Remove redundant FieldCore::add/sub/mul and Module::add/neg (covered by ops traits)
- Extract is_pow2_u64, log2_pow2_u64, mul64_wide into fields/util.rs to deduplicate

Made-with: Cursor

* Unify Blake2b and Keccak transcript backends into generic HashTranscript

Replace separate blake2b.rs and keccak.rs with a single generic
HashTranscript<D: Digest> parameterized by hash function. Blake2bTranscript
and KeccakTranscript are now type aliases.

Made-with: Cursor

* Fix sumcheck degree bug, split types, in-place fold, CommitWitness, rename configs, add soundness test

- Fix CompressedUniPoly::degree() off-by-one that could let malformed proofs pass
- Split sumcheck/mod.rs: extract types into types.rs, relocate multilinear_eval
  and fold_evals to algebra/poly.rs
- Replace allocating fold_evals with in-place fold_evals_in_place
- Add debug_assert guards to multilinear_eval and fold_evals_in_place
- Introduce CommitWitness struct to replace error-prone 3-tuple returns
- Rename DefaultCommitmentConfig to SmallTestCommitmentConfig, add
  ProductionFp128CommitmentConfig
- Add verify_rejects_wrong_opening negative test for verifier soundness

Made-with: Cursor

* fix(test): resolve clippy needless_range_loop in algebra tests

Use iter().enumerate() for schoolbook convolution loops and
array::from_fn for pointwise NTT operations.

Made-with: Cursor

* Refactor commitment setup to runtime layout and staged artifacts.

This removes compile-time commitment shape locks, derives beta from runtime layout, and threads layout-aware setup through commit/prove/verify with setup serialization roundtrip coverage.

Made-with: Cursor

* Soundness hardening: panic-free verifier, Fiat-Shamir binding, NTT overflow fix

- Verifier path never panics; all errors return HachiError
- Bind commitment, opening point, and y_ring in Fiat-Shamir transcript
- Fix i16 csubp/caddp overflow by widening to i32
- multilinear_eval returns Result with dimension checks
- build_w_evals validates w.len() is a multiple of d
- UniPoly::degree uses saturating_sub instead of expect
- Serialize usize as u64 for 32/64-bit portability
- Fix from_i64(i64::MIN) via unsigned_abs
- Remove Transcript::reset from public trait (move to inherent)
- Add batched_sumcheck verifier empty-input guard

Made-with: Cursor

* Hoist fully qualified paths to use statements in touched files

Replace inline crate::protocol::commitment::HachiCommitmentLayout,
hachi_pcs::algebra::backend::{CrtReconstruct, NttPrimeOps}, and
hachi_pcs::algebra::CyclotomicRing with top-level use imports.

Made-with: Cursor

* Dispatch norm sumcheck kernels by range size.

Route small-b rounds through the point-eval interpolation kernel and keep the affine-coefficient kernel for larger b, while adding deterministic baseline-vs-dispatched benchmarks and parity tests to validate correctness across both strategies.

Made-with: Cursor

* Format commitment-related files for readability.

Apply non-functional formatting and import ordering cleanups across commitment, ring-switch, and benchmark/test files to keep the codebase style consistent.

Made-with: Cursor

* Format: cargo fmt pass on commitment-related files

Made-with: Cursor

* feat: sequential coefficient ordering + streaming commitment

Change coefficient-to-ring packing from strided to sequential, enabling
true streaming where each trace chunk maps to exactly one inner Ajtai
block. Implement StreamingCommitmentScheme for HachiCommitmentScheme.

- reduce_coeffs_to_ring_elements: sequential packing (chunks_exact(D))
- prove/verify: opening point split flipped to (inner, outer)
- ring_opening_point_from_field: outer split flipped to (M first, R second)
- commit_coeffs: sequential block distribution
- map_onehot_to_sparse_blocks: sequential block distribution
- HachiChunkState + process_chunk / process_chunk_onehot / aggregate_chunks
- Streaming commit tests (matches non-streaming, prove/verify roundtrip)

Made-with: Cursor

* refactor: decompose verify_batched_sumcheck into composable steps

Split the monolithic verify_batched_sumcheck into three pieces:
- verify_batched_sumcheck_rounds: replay rounds, return intermediate state
- compute_batched_expected_output_claim: query verifier instances
- check_batched_output_claim: enforce equality

This enables callers (e.g. Greyhound) to intercept the intermediate
sumcheck state before the final oracle check. The original function
is preserved as a convenience wrapper.

Made-with: Cursor

* feat: accept Option<usize> in commit_onehot for sparse one-hot support

Allows None entries in one-hot index arrays to represent inactive cycles.
Adds public commit_onehot free function returning both commitment and hint.

Made-with: Cursor

* feat: submatrix commit for polynomials smaller than setup max

commit_coeffs now accepts ring coefficient vectors shorter than the
layout's full size, padding each block internally. prove/verify pad the
opening point with zeros so the transcript stays consistent. This avoids
materializing huge zero-padded field-element arrays.

Made-with: Cursor

* feat: add HachiSerialize impls for proof types

Implement HachiSerialize/HachiDeserialize for HachiProof,
HachiCommitmentHint, and SumcheckAux so they can be serialized
through the ArkBridge adapter in Jolt.

Made-with: Cursor

* fix: relax balanced_decompose_pow2 assertion for 128-bit fields

Allow levels * log_basis up to 128 + log_basis. For Fp128 with
LOG_BASIS=4, the decomposition needs 33 levels (132 bits total) because
32 levels can't represent the full signed range [-q/2, q/2). The extra
level's digit is at most ±1 and the i128 arithmetic remains safe since
the quotient shrinks monotonically.

Made-with: Cursor

* feat: add DynamicSmallTestCommitmentConfig

Same D=16 security parameters as SmallTestCommitmentConfig but derives
layout from max_num_vars instead of using a fixed (4,2) shape.

Made-with: Cursor

* perf: true submatrix in commit_coeffs — skip zero blocks

Short polynomials no longer pad to block_len. commit_coeffs accepts
fewer ring elements than num_blocks * block_len, decomposes only the
non-zero blocks, and fills remaining entries with zero s/t_hat without
allocation or mat-vec multiplication.

Also relax debug_assert in mat_vec_mul_precomputed to >= (zip handles
the shorter vector correctly).

Made-with: Cursor

* fix: use inner_width for zero_s in commit_coeffs/commit_onehot

prove expects s[i] to have inner_width entries. Use the correct
length for zero blocks to match the dense path's decompose_block
output size.

Made-with: Cursor

* fix: configure rayon with 64MB stack for D>=512 ring elements

CRT-NTT conversion puts ~28KB on the stack per ring element
([[MontCoeff; D]; K] + [i128; D]). With D=512 and the commit call
chain depth, rayon's default thread stack overflows.

ensure_large_thread_stack() is called from setup() and is safe to
call multiple times (only the first configures the pool).

Made-with: Cursor

* feat: add commit_mixed for mega-polynomial commitment

Exposes MegaPolyBlock enum (Dense/OneHot/Zero) and commit_mixed()
which processes heterogeneous blocks in a single commitment.
This lets Jolt pack all witness polynomials into one Hachi commitment
(one block per polynomial) instead of N independent commitments.

Also makes SparseBlockEntry and map_onehot_to_sparse_blocks public
so callers can construct one-hot block descriptors.

Made-with: Cursor

* perf: drop s vectors from CommitWitness and HachiCommitmentHint

The basis-decomposed s_i vectors (one per block, each block_len*delta
ring elements) were stored in both CommitWitness and HachiCommitmentHint.
At production parameters (D=512, block_len=2048, delta=32), each s_i is
512 MB — storing all 64 of them consumed ~32 GB.

Instead, recompute s_i on the fly in compute_w_hat and compute_z_hat
from ring_coeffs using decompose_block. Peak memory drops from O(blocks *
block_len * delta) to O(block_len * delta) per thread.

Also adds setup_with_layout for caller-specified HachiCommitmentLayout,
and makes decompose_block, SparseBlockEntry, map_onehot_to_sparse_blocks
public for downstream (Jolt) mega-polynomial integration.

Made-with: Cursor

* chore: untrack docs/ and paper/ from version control

Keep these files locally for reference but remove from the
committed tree. They can be selectively re-added later.

Made-with: Cursor

* perf: fused sumcheck, split-eq streaming, compact w_evals — 8x memory reduction

Refactor the Hachi proving pipeline to eliminate the 13 GB matrix M and
2.6 GB vector z from memory, reducing peak prover allocation from ~30 GB
to ~3.7 GB.

Key changes:

- QuadraticEquation: remove m/z fields; add compute_r_split_eq (split-eq
  factoring replaces full Kronecker materialization) and
  compute_m_a_streaming (row-at-a-time M·α evaluation).

- ring_switch: decompose z_pre on the fly in build_w_coeffs; add
  build_w_evals_compact returning Vec<i8> for round-0 storage (all
  entries fit in [-8, 7] from balanced_decompose_pow2 with LOG_BASIS=4).

- HachiSumcheckProver: fused norm+relation prover sharing a single
  w_table. Round 0 uses WTable::Compact(Vec<i8>), folding to
  WTable::Full(Vec<F>) at half size after the first challenge.

- HachiSumcheckVerifier: fused verifier combining both oracle checks
  with a batching_coeff sampled from the transcript.

- Remove dead batched mat-vec functions from linear.rs.

- Import hygiene: shorten crate::algebra::ring::X to crate::algebra::X;
  hoist mid-function use statements to top-level.

Made-with: Cursor

* revert: remove ensure_large_thread_stack rayon config

Stack sizing for D>=512 ring elements should be handled by the caller,
not baked into the library's setup path.

Made-with: Cursor
…ine, NTT acceleration (#5)

* perf: parallelize commit phase and reduce allocations

- Add block-level parallelism to commit_ring_blocks, commit_coeffs,
  commit_onehot, and commit_mixed via cfg_iter!/cfg_into_iter!
- Parallelize vector-to-NTT conversion in mat_vec_mul_precomputed_with_params
- Cache CRT+NTT params inside NttMatrixCache, eliminating redundant
  select_crt_ntt_params calls on every mat-vec multiply
- Add balanced_decompose_pow2_into for in-place decomposition, removing
  per-element Vec allocations in decompose_block/decompose_rows
- Add inner_ajtai_onehot_t_only that skips the 16MB s-vector allocation
  when the caller discards it (commit_onehot, commit_mixed)
- Add one-hot and mixed commitment benchmarks to hachi_e2e

Made-with: Cursor

* chore: remove stale #[allow(non_snake_case)] from setup structs

HachiSetupSeed, HachiProverSetup, and HachiVerifierSetup have no
uppercase fields — the allows were left over from earlier refactors.

Made-with: Cursor

* perf: hoist decomposition params to runtime, reduce allocations and cloning

Pre-existing change:
- Remove rows/cols from matrix domain separator so A matrix is reusable
  across poly/mega-poly layouts with the same m_vars.

New changes:

Move delta/tau/log_basis from CommitmentConfig associated constants into
HachiCommitmentLayout runtime fields. This decouples decomposition
parameters from the config type, allowing them to vary at runtime
without monomorphization. All ~50 call sites updated.

Eliminate redundant work in the prover hot path:
- Flatten w_hat once and reuse in both compute_v and compute_r_split_eq
  (was flattened separately in each).
- Stream z_hat decomposition directly in build_w_coeffs instead of
  collecting into a temporary Vec.
- Skip the unused w.to_vec() clone in ring_switch_verifier output.
- Take ownership of ring_opening_point and hint in QuadraticEquation
  constructors instead of cloning.

Reduce stack pressure for large ring elements (8KB at D=512, Fp128):
- Add CyclotomicRing::from_slice() to avoid std::array::from_fn
  intermediaries that create 8KB stack temporaries.
- Replace from_fn patterns in process_chunk, reduce_coeffs_to_ring_elements,
  commit_w, and compute_r_split_eq.

Made-with: Cursor

* feat: flexible decomposition depth and dual basis mode

Move DELTA/TAU/LOG_BASIS out of CommitmentConfig into runtime
DecompositionParams (log_basis, log_coeff_bound). Delta and tau are
now auto-derived from the coefficient bound, so small-coefficient
polynomials (0/1, already range-checked) get proportionally cheaper
commitments.

Add BasisMode enum (Lagrange / Monomial) as a prove/verify-time
parameter. Commitment is basis-agnostic; the mode only changes the
tensor-product weights in the opening relation.

Made-with: Cursor

* fix compute m a streaming to not need padding

* refactor: unify polynomial API via HachiPolyOps trait, remove dead code, fix config validation

HachiPolyOps trait and implementations:
- Add HachiPolyOps<F, D> trait with 4 operation methods (evaluate_ring,
  fold_blocks, decompose_fold, commit_inner) replacing raw coefficient access
- Add DensePoly<F, D> for dense ring coefficient vectors
- Add OneHotPoly<F, D> for sparse one-hot polynomials with optimized ops

CommitmentScheme refactor:
- Parameterize CommitmentScheme<F, D> (was CommitmentScheme<F>)
- Generic commit/prove over P: HachiPolyOps<F, D>
- Rename OpeningProofHint to CommitHint, remove Option wrapper from prove
- Remove batch_commit, combine_commitments, combine_hints
- Remove StreamingCommitmentScheme trait, HachiChunkState, process_chunk*

Dead code removal:
- Delete MegaPolyBlock enum and commit_mixed method
- Delete inner_ajtai_onehot (keep _t_only variant)
- Delete Polynomial trait, MultilinearLagrange trait
- Delete DenseMultilinearEvals and multilinear_evals module
- Remove all unnecessary #[allow(...)] attributes

Proof simplification:
- Remove ring_coeffs from HachiCommitmentHint (only t_hat remains)
- Update quadratic_equation to use HachiPolyOps methods

Config fix:
- Remove overly strict delta*log_basis > 128 check in config.rs;
  balanced_decompose_pow2 already enforces the correct bound
  (levels*log_basis <= 128+log_basis)

Documentation:
- Add docs to all public items in test_utils and packed_ext
- Remove #[allow(missing_docs)] from parallel, test_utils, packed_ext modules

Made-with: Cursor

* fix: remove test for deleted delta*log_basis validation

The setup_rejects_invalid_digit_budget test asserted the overly strict
delta*log_basis > 128 check that was intentionally removed in the
previous commit. Delete the test and its BadDigitBudgetConfig.

Made-with: Cursor

* style: fix formatting in ring_commitment_core.rs

Made-with: Cursor

* perf: parallelize proving hot paths, eliminate per-proof w-commitment setup

Parallelize the three proving bottlenecks (quad_eq, ring_switch, sumcheck)
and remove the per-proof matrix generation in commit_w by reusing the main
NTT cache.

Proving hot-path parallelism:
- Parallelize round-0 norm and relation sumcheck via cfg_fold_reduce! macro
- Parallelize DensePoly::decompose_fold with parallel fold-reduce over blocks
- Parallelize fold_evals_in_place and build_w_evals_compact with cfg_into_iter!
- Add cfg_fold_reduce! macro to unify parallel/sequential fold-reduce patterns
- Unify compute_round_{norm,relation}_{compact,full} into single generic fns

Sumcheck micro-optimizations:
- Unroll 3-point relation evaluation to avoid redundant from_u64 conversions
  and multiply-by-zero/one at evaluation points 0 and 1
- Hoist gadget_recompose_pow2 out of per-row loop in compute_r_split_eq

Eliminate per-proof w-commitment setup:
- Add w_ring_element_count() and w_commitment_layout() helpers to compute
  w-commitment matrix dimensions from the main layout
- Widen A/B matrices at setup time to max(main, w) column counts so the
  main NTT cache always covers the w-commitment (required when
  delta_commit=1, e.g. boolean polynomials)
- Rewrite commit_w to take &NttMatrixCache directly, inlining the commit
  logic with flat_map instead of intermediate Vec<Vec<...>>
- Remove w_setup field from HachiProverSetup
- Add ensure_matrix_shape_ge for >= column checks on widened matrices

Naming cleanup:
- Rename delta -> num_digits_commit, tau -> num_digits_fold,
  log_coeff_bound -> log_commit_bound throughout
- Add log_open_bound to DecompositionParams for recursive w commitments
- Hoist fully qualified paths (std::ops, std::mem, std::iter,
  crate::protocol::ring_switch::w_commitment_layout) to use statements

Made-with: Cursor

* perf: profile and accelerate opening proof hot paths

Replace D/B-row schoolbook quotient extraction with an NTT-based unreduced quotient path and add targeted tracing spans/timers plus a Perfetto profile example so prover bottlenecks are visible and cheaper to iterate on. Temporarily force the point-eval norm kernel to isolate fused-sumcheck behavior during profiling.

Made-with: Cursor

* perf: NTT-accelerate A-rows, reduce basis 16→8, fix saturation bug

Three optimizations to the proving pipeline:

1. NTT-accelerate A-rows in compute_r_split_eq: use
   unreduced_quotient_rows_ntt_cached for A*z_pre (O(D log D) instead
   of O(D^2) schoolbook). Also exploit sparse challenge structure in
   add_sparse_ring_product (O(weight*D) instead of O(D^2)).

2. Reduce decomposition basis from 16 to 8 (log_basis 4→3): halves the
   norm sumcheck range-check polynomial degree from 31 to 15, yielding
   ~4x speedup on the dominant prove-time bottleneck. Soundness is
   strictly improved (smaller MSIS norm bound).

3. Fix u128 saturation bug in compute_num_digits and r_decomp_levels
   that caused an incorrect extra decomposition level when b^levels
   overflows u128. Skip the balanced-range check when levels*log_basis
   > log_bound, since the digit range is mathematically guaranteed
   sufficient for b >= 4.

Also: replace hardcoded LOG_BASIS const with log_basis() function
derived from TinyConfig, fuse decompose+sparse-mul in decompose_fold
to i32 arithmetic, and add balanced_decompose_pow2_i8 variant.

Net result: prove time 4.76s → 1.57s (3.0x speedup) at num_vars=19.

Made-with: Cursor

* perf: i8 digit pipeline for w_hat — bypass Fp128 for small decomposed digits

Store w_hat/w_hat_flat as [i8; D] instead of CyclotomicRing<Fp128, D>,
eliminating redundant field arithmetic on values in [-b/2, b/2).

- Add balanced_decompose_pow2_i8 and gadget_recompose_pow2_i8
- Add CyclotomicCrtNtt::from_i8_with_params / from_i8_cyclic for
  direct i8 → CRT+NTT conversion (skips Fp128 centering)
- Add mat_vec_mul_ntt_cached_i8 and unreduced_quotient_rows_ntt_cached_i8
- Change QuadraticEquation w_hat/w_hat_flat types + all consumers
- Simplify build_w_coeffs to write i8 digits directly as field elements

Made-with: Cursor

* perf(poly): optimize range_check_eval and fold_evals_in_place

range_check_eval: precompute w² and use (w²−k²) instead of (w−k)(w+k),
saving one multiply per factor.

fold_evals_in_place: fold in-place with truncate() instead of allocating
a new Vec, removing the rayon dependency from this function.

Made-with: Cursor

* refactor(sumcheck): centralize and optimize norm sumcheck computation

Extract duplicated norm round polynomial logic from NormSumcheckProver
and HachiSumcheckProver into shared compute_norm_round_poly() and
compute_norm_round_poly_compact() functions.

Optimizations:
- Flat contiguous storage for RangeAffinePrecomp (coeff_mix_flat + row_offsets)
- Precomputed small-integer LUT (h_i(w_0)) for round-0 compact accumulation
- Native i128 range-check evaluation path for b <= 10
- Precomputed squared offsets in PointEvalPrecomp
- Make choose_round_kernel public with env var override and b-threshold dispatch

Made-with: Cursor

* feat(protocol): multi-level recursive folding proof

Replace single-shot proof with recursive multi-level folding. Instead of
sending the full w vector after one round of quad_eq → ring_switch →
sumcheck, the prover now recursively commits to w and opens it via the
same protocol until w is small enough to send directly.

Key changes:
- HachiProof now holds Vec<HachiLevelProof> + final_w instead of flat fields
- Remove SumcheckAux; each level carries a w_eval claim instead
- Extract prove_one_level / verify_one_level from monolithic prove/verify
- Folding stops via should_stop_folding heuristic (MIN_W_LEN_FOR_FOLDING,
  MIN_SHRINK_RATIO)
- QuadraticEquation takes explicit layout parameter for per-level configs
- ring_switch exports WCommitmentConfig for recursive w-openings
- D matrix widened to max(layout, w_layout) for shared setup
- HachiSumcheckVerifier gains w_val_override for intermediate levels

Made-with: Cursor

* chore(examples): update profile example for multi-level proofs and A/B kernel testing

- Extract run_prove() helper for reuse across kernel configs
- Add A/B test mode (HACHI_AB_TEST=1) to compare affine_coeff vs point_eval
- Update layout from (6,4) to (8,8)
- Report multi-level proof stats (levels, final_w length, proof size)
- Set 64 MiB rayon stack size

Made-with: Cursor

* style: remove section banners and hoist mid-function use statement

- Remove redundant section banner comments in proof.rs and commitment_scheme.rs
- Move choose_round_kernel import from function body to top-level in hachi_sumcheck.rs

Made-with: Cursor

* perf(algebra): use bitwise ops for balanced digit decomposition

Replace rem_euclid(b) with bitwise AND and division with right shift
in CyclotomicRing digit decomposition (decompose_balanced,
decompose_balanced_digit_planes, decompose_balanced_i8) and
DensePoly commit_with_setup. Valid since b is always a power of two.

Made-with: Cursor

* perf: store t_hat as i8 digit planes, cache w_folded to skip recompose

Switch t_hat storage from Vec<Vec<CyclotomicRing<F,D>>> to Vec<Vec<[i8;D]>>
throughout the commitment and proving pipeline. Decomposed digits are bounded
by log_basis (typically 3), so i8 is sufficient and avoids carrying full
field-element ring elements through commit, ring-switch, and serialization.

Key changes:
- CommitWitness and HachiCommitmentHint now hold [i8; D] digit planes
- New i8 variants: decompose_block_i8, decompose_rows_i8,
  mat_vec_mul_ntt_cached_i8, gadget_recompose_pow2_i8
- HachiPolyOps::commit_blocks returns [i8; D] digit planes
- QuadraticEquation caches w_folded (pre-decomposition folded ring
  elements) so compute_r_split_eq avoids a gadget_recompose roundtrip
- Precomputed idx/sign lookup tables for sparse challenge multiplication
- Custom i8 serialization for HachiCommitmentHint
- Remove bogus debug_assert constraining ring degree D<=128 in
  build_w_evals_compact (was checking log2(D) but message said log_basis)

Made-with: Cursor

* perf: optimize hot paths in commit/prove pipeline

- Hoist NTT conversions out of per-row quotient loops (crt_ntt, linear)
- Precompute c_alpha in compute_m_a_streaming (quadratic_equation)
- Compact alpha/m tables with variable-specific folding (sumcheck)
- Eliminate t_hat_flat rematerialization and zero_t_hat clones (commit, ring_switch, hachi_poly_ops)
- Merge duplicate w-eval passes (ring_switch, commitment_scheme)
- Clean up fully qualified paths (linear, relation_sumcheck, hachi_poly_ops)

Made-with: Cursor

* feat(algebra): add wide unreduced accumulators and fused shift-accumulate

Add Fp32x2i32, Fp64x4i32, Fp128x8i32 types that split field elements
into 16-bit limbs in i32 slots for carry-free SIMD-friendly addition.
Overflow budget ~32k signed adds before reduction.

Add shift_accumulate_into / shift_sub_into / mul_by_monomial_sum_into
on CyclotomicRing for fused negacyclic shift + accumulate without
temporary ring allocations. Make field offset constants C public.

Made-with: Cursor

* refactor(protocol): per-matrix NttSlotCache, fused one-hot commit, bench stack fix

Replace monolithic NttMatrixCache with per-matrix NttSlotCache, removing
HachiPreparedSetup and MatrixSlot enum. HachiProverSetup now holds three
independent NttSlotCache instances (A, B, D). Simplify dispatch macros in
linear.rs to operate on a single slot.

Add CommitCache associated type to HachiPolyOps trait. Wire one-hot
commit path to use fused mul_by_monomial_sum_into, eliminating temporary
allocations.

Fix pre-existing benchmark stack overflow by configuring rayon with a
64MB thread stack (matching examples/profile.rs).

Made-with: Cursor

* feat(commit): column-tiled A matvec for cache-efficient commitment

Add mat_vec_mul_ntt_tiled_i8 and mat_vec_mul_ntt_tiled_single_i8 that
tile the NTT matrix columns into L2-sized chunks (~400 cols). Each
rayon thread owns one tile and iterates over all blocks, so the matrix
is loaded from DRAM exactly once. Ring coefficients are decomposed
on-the-fly per tile to avoid full digit materialization.

All call sites (commit, commit_coeffs, commit_onehot, ring_switch,
quadratic_equation, HachiPolyOps::commit_inner) updated to use the
tiled API. Reduces total DRAM traffic ~25x for large traces.

Made-with: Cursor

* refactor: promote TWO_INV and ZERO to const associated items on FieldCore

Hoists two_inv from a trait method to a compile-time constant, and adds
const ZERO so extension fields (Fp2, Fp4) can build their TWO_INV without
runtime calls. Deduplicates CrtNttParamSet computation across A/B/D caches.

Made-with: Cursor

* refactor: remove two_inv parameters now that TWO_INV is a const

Functions and macros no longer thread two_inv through call chains;
they reference F::TWO_INV directly. Also removes the runtime
computation in batched_sumcheck.

Made-with: Cursor

* feat(commit): stub HachiSerialize for HachiProverSetup

Add Valid + HachiSerialize impls for HachiProverSetup that return
an error on serialize (NTT caches are runtime artifacts). Needed
by downstream wrappers that require the trait bound.

Made-with: Cursor

* perf: fuse hot loops, eliminate allocations, cheaper CRT reduction

- mul_by_sparse: use shift_accumulate_into/shift_sub_into for ±1 coeffs
- inverse NTT: fuse d_inv and psi_inv trailing passes into one loop
- CRT conversion: replace __modti3 (i128 % i128) with split i64 arithmetic
- Fp128 sqr_raw: 3 widening muls instead of 4 via squaring symmetry
- decompose_block_i8: add _into variant, reuse buffer across tiles
- sumcheck: fuse norm+relation into single pass over w_table
- ring_switch: fuse expand_m_a+build_m_evals_x, rayon::join parallel phases
- ring_switch: build_w_evals_dual uses unzip instead of triple allocation
- quadratic_equation: hoist scratch allocations out of row loop

Made-with: Cursor

* feat: wide ring accumulators with NEON SIMD for one-hot commitment

Introduce carry-free wide accumulators (Fp32x2i32, Fp64x4i32,
Fp128x8i32) that defer modular reduction during one-hot commitment,
yielding 69x faster commit for sparse witnesses.

Key changes:
- AdditiveGroup trait decoupling additive ops from full FieldCore
- WideCyclotomicRing<W, D> for carry-free ring accumulation
- HasWide / ReduceTo traits for type-level wide ↔ canonical dispatch
- NEON SIMD backends for Fp64x4i32 and Fp128x8i32 with scalar fallback
- inner_ajtai_onehot_wide replaces inner_ajtai_onehot_t_only
- Profile example now covers both dense and one-hot paths

Made-with: Cursor

* refactor: drop "_tiled" suffix from mat-vec functions

Tiling is an internal optimization detail, not an API distinction.
The tiled versions are the only production path; non-tiled variants
exist only as #[cfg(test)] reference implementations.

Made-with: Cursor

* refactor: rename Fp128CommitmentConfig, hoist inline qualified path

- Drop "Production" prefix from ProductionFp128CommitmentConfig
- Hoist crate::algebra::fields::LiftBase to use statement in
  sparse_challenge.rs

Made-with: Cursor

* feat: pack final_w as balanced digits, use Vec<i8> throughout prover

Represent the prover's witness vector w as Vec<i8> instead of Vec<F>
throughout the folding pipeline. Introduces PackedDigits to bit-pack
the final-level w into log_basis bits per element, reducing proof size
by ~32x. Cleans up import hygiene in profile example and proof module.

Made-with: Cursor

* perf: use const digit lookup table for i8-to-field conversion

Add const fn digit_lut to Fp128 and FromSmallInt trait for
precomputing balanced-digit-to-field-element tables. Replaces
per-element from_i64 calls with indexed loads in the three hot
prover loops (commit_w, build_w_evals_dual, dense_poly_from_w).

Made-with: Cursor

* perf: add DigitMontLut for i8 mat-vec kernels, clean up imports

Add a precomputed Montgomery lookup table (DigitMontLut) for balanced
digit values {-8..7}, replacing per-coefficient from_canonical calls
in the i8→CRT+NTT conversion hot path. Wire it into mat_vec_mul_ntt_i8,
mat_vec_mul_ntt_single_i8, and unreduced_quotient_rows_ntt_cached_i8.

Also: merge duplicate NTT butterfly imports, remove duplicated doc
comment on from_ring_cyclic, export DigitMontLut through ring/algebra
modules, apply cargo fmt.

Made-with: Cursor

* perf: NEON SIMD kernels, decompose_fold optimization, explicit layout API

Add AArch64 NEON SIMD for NTT butterflies, pointwise multiply-accumulate,
and add-reduce (neon.rs). Dispatch from butterfly.rs and linear.rs with
runtime feature check and scalar fallback.

Optimize DensePoly::decompose_fold with two-phase restructure: K=3
interleaved carry chains for ILP on decomposition, then NEON rotate-and-add
scatter (decompose_fold_neon.rs). ~2x speedup on compute_z_pre.

Optimize OneHotPoly::decompose_fold by replacing O(omega*D) mul_by_sparse
with direct sparse scatter O(omega*|nonzero_coeffs|). ~22x speedup.

Thread explicit HachiCommitmentLayout through commit/prove/verify instead
of computing from setup internally. Add OneHotIndex trait for generic
onehot indices. Profile now uses OneHotPoly end-to-end for the onehot path.

Clean up imports: hoist qualified crate::algebra::ntt::neon paths, move
test-function use statements to module scope.

Made-with: Cursor

* perf: unreduced accumulation for sumcheck, fused compact round-0 loop

Introduce HasUnreducedOps trait with MulU64Accum / ProductAccum types
for Fp64, Fp128, and Fp2, enabling widening multiplies that defer
reduction until after accumulation.

Key changes:
- Fuse norm + relation computation into a single pass for compact
  (Round 0) via compute_round_compact_fused, using split pos/neg
  MulU64Accum for the relation and i128/LUT arithmetic for the norm.
- Sparse integer representation for affine-coeff precomputation
  (SparseCoeffEntry) with batched x4 kernel (compute_entry_coeffs_x4).
- Two-level inner/outer ProductAccum accumulation for affine-coeff
  kernel, both compact and full-field paths.
- Optimize fold_compact_to_full to use mul_u64_unreduced for r * delta.
- Parallelize OneHotPoly::evaluate_ring, fold_blocks, decompose_fold.
- Add FromSmallInt::from_i128 default method.

Made-with: Cursor

* perf: two-level ProductAccum for full-field affine-coeff kernel

Upgrade the WTable::Full + AffineCoeffComposition path in
HachiSumcheckProver to use two-level ProductAccum accumulation
(outer loop over e_second, inner mul_to_product_accum, single
reduction per j_high block), matching the standalone norm_sumcheck.rs
implementation.

Also fix multilinear_eval_small missing FromSmallInt bound, switch
commitment_scheme w_eval to use w_evals_field (w_evals is moved),
and add missing doc on ScaleI32 trait method.

Made-with: Cursor

* style: rustfmt formatting for poly.rs and hachi_sumcheck.rs

Made-with: Cursor

* fix(ci): use compound assignment operators to satisfy clippy

Made-with: Cursor

* chore: remove docs/ and paper/ from tracked files

Backed up to quang/temp-docs branch. Files remain on disk.

Made-with: Cursor

* fix(ci): implement assign traits and fix all clippy assign_op_pattern lints

Add MulAssign for Fp128, and AddAssign/SubAssign/MulAssign for all
PackedNeon types. Convert all x = x op y patterns to x op= y across
benches, tests, and lib.

Made-with: Cursor

* fix(ci): add assign traits to NoPacking, AVX2/AVX512 packed types, Fp32, Fp64

NoPacking<T> (x86_64 fallback) was missing AddAssign/SubAssign/MulAssign,
causing CI failures on the GitHub runner. Add assign traits uniformly
across all packed backends and scalar field types. Fix remaining
assign_op_pattern lints in benches and tests.

Made-with: Cursor

* fix(ci): fix no-default-features clippy — unused var, dead code, rayon gate

- Allow unused rel_combine (only used in parallel reduce combiner)
- Allow dead_code on add_ntt_into (only used in parallel + aarch64)
- Gate rayon::ThreadPoolBuilder behind cfg(feature = "parallel")
- Fix remaining assign_op_pattern in norm_sumcheck bench

Made-with: Cursor
… infrastructure (#7)

* fix: separate delta_commit and delta_open for t_hat decomposition

t = A * s produces full-field-size coefficients even when s has small
(delta_commit-digit) entries. The code was decomposing t_hat using
delta_commit instead of delta_open, causing lossy truncation and
breaking verification for onehot/logbasis commitment configs.

Split commit_inner's num_digits parameter into num_digits_commit (for s)
and num_digits_open (for t_hat), and propagate this distinction through
layout, commit, quadratic_equation, and ring_switch.

Also:
- Add Fp128FullCommitmentConfig, Fp128OneHotCommitmentConfig,
  Fp128LogBasisCommitmentConfig bounded commitment configs
- Add optimal_m_r_split for dynamic m/r layout selection
- Refactor profile example to be generic over CommitmentConfig and
  accept HACHI_NUM_VARS / HACHI_MODE env vars

Made-with: Cursor

* refactor(algebra): add repr(transparent) to CyclotomicRing types

Enables safe transmute between `[CyclotomicRing<F, D>]` and `[F]` for
the upcoming FlatMatrix D-agnostic storage layer.

Made-with: Cursor

* refactor(commitment): D-agnostic FlatMatrix storage + halving-D scaffolding

Replace `Vec<Vec<CyclotomicRing<F, D>>>` in HachiExpandedSetup with
`FlatMatrix<F>`, a D-agnostic flat field-element array that can be viewed
at any ring dimension via `.view::<D>()`. This decouples setup storage
from the const-generic D, enabling future varying-D prove loops.

Key changes:
- HachiExpandedSetup<F, D> → HachiExpandedSetup<F> (loses D)
- HachiVerifierSetup<F, D> → HachiVerifierSetup<F>
- NTT/CRT functions take RingMatrixView instead of &[Vec<CyclotomicRing>]
- New FlatMatrix, NttCache, and dispatch_ring_dim! infrastructure
- New CommitmentConfig::d_at_level / n_a_at_level trait methods
- New Fp128HalvingDCommitmentConfig (D=512→256→128→64)
- commit_w made pub for future varying-D usage

Made-with: Cursor

* refactor(bench): rewrite benchmarks with real configs and parameterized D

Replace hand-rolled bench_config! macro with real commitment configs
(Fp128FullCommitmentConfig, Fp128OneHotCommitmentConfig,
Fp128LogBasisCommitmentConfig). Parameterize D as const generic instead
of hardcoding. Use random evaluations, iter_batched for prove bench,
and add HACHI_PARALLEL=0 env var for single-threaded runs.

Made-with: Cursor

* fix: eliminate debug-build stack overflow via dispatch extraction and NTT cache boxing

Extract dispatch_ring_dim!/dispatch_with_ntt! macro expansions into
dedicated #[inline(never)] functions (dispatch_prove_level,
dispatch_verify_level, dispatch_commit) so monomorphized match arms
live in separate stack frames instead of bloating the caller.

Box NttSlotCache<D> fields inside MultiDNttCaches to avoid ~465KB
temporaries on the stack when constructing MultiDNttBundle.

Remove with_large_stack test wrappers and .cargo/config.toml —
all tests now pass with the default 2MB stack in debug builds.

Clean up import hygiene: hoist in-function use statements,
replace inline fully-qualified paths with top-level imports.

Made-with: Cursor

* fix: broken doc links and clippy needless_range_loop

- Use crate-qualified paths for MultiDNttBundle and HachiExpandedSetup
  doc links in dispatch_with_ntt macro
- Replace index loop with iterator in flat_matrix test

Made-with: Cursor
* Add rayon parallelism behind `parallel` feature flag (enabled by default)

- New src/parallel.rs with cfg_iter!/cfg_into_iter!/cfg_chunks! macros
  that dispatch to rayon parallel iterators when `parallel` is enabled
- Parallelize protocol hot paths: ring polynomial division, w_evals
  construction, M_alpha evaluation, ring vector evaluation, packed ring
  poly evaluation, coefficients-to-ring reduction, quadratic equation
  folding, and sumcheck round polynomial computation
- All 174 tests pass with and without the parallel feature

Made-with: Cursor

* Add e2e benchmark and make HachiCommitmentScheme generic over config

- Make HachiCommitmentScheme generic over <const D, Cfg> so different
  configs (and thus num_vars) can be used without code duplication.
- Remove hardcoded DefaultCommitmentConfig::D from ring_switch.rs;
  WCommitmentConfig and commit_w now flow D generically.
- Add benches/hachi_e2e.rs with configs sweeping nv=10,14,18,20.

Made-with: Cursor

* Refactor CRT-NTT backend: generalize over PrimeWidth, add Q128 support

Make NTT primitives (NttPrime, NttTwiddles, MontCoeff, CyclotomicCrtNtt)
generic over PrimeWidth (i16/i32) instead of hardcoding i16. Replace the
monolithic QData struct with separate GarnerData and per-prime NttPrime
arrays. Add Q128 parameter set (5 × i32 primes, D ≤ 1024) alongside the
existing Q32 set. Simplify ScalarBackend by removing the const-generic
limb count from to_ring_with_backend.

Made-with: Cursor

* Add extension field arithmetic and refactor sumcheck trait bounds

Split CanonicalField into FromSmallInt (from_{u,i}{8,16,32,64} for all
fields) and CanonicalField (u128 repr, base fields only). Implement
FromSmallInt, Eq, Debug for Fp2/Fp4. Add ExtField<F> trait with
EXT_DEGREE and from_base_slice.

Optimize extension field arithmetic: Karatsuba multiplication for Fp2
and Fp4 (3 base muls instead of 4), specialized squaring (2 base muls
for Fp2), non-residue IS_NEG_ONE specialization. Add concrete configs
(TwoNr, NegOneNr, UnitNr) and type aliases Ext2<F>, Ext4<F>.

Add transpose-based packed extension fields (PackedFp2, PackedFp4)
for SIMD acceleration, following Plonky3's approach.

Relax sumcheck bounds from E: CanonicalField to E: FromSmallInt (or
E: FieldCore where spurious). Add sample_ext_challenge transcript
helper. Includes tests for extension field sumcheck execution.

Made-with: Cursor

* Fix CRT+NTT correctness and optimize negacyclic NTT pipeline

Correctness fixes:
- Rewrite negacyclic NTT as twist + cyclic DIF/DIT pair (no bit-reversal
  permutation), correctly diagonalizing X^D+1.
- Center coefficient→CRT mapping and Garner reconstruction to handle
  negacyclic sign wrapping consistently.
- Fix i32 Montgomery csubp/caddp overflow via branchless i64 widening.
- Fix q128 centering overflow in balanced_decompose_pow2 (avoid casting
  q≈2^128 into i128).
- Remove dense-protocol schoolbook fallback; all mat-vec now routes
  through CRT+NTT.

Performance optimizations:
- Precompute per-stage twiddle roots in NttTwiddles (eliminate runtime
  pow_mod per butterfly stage).
- Forward DIF butterfly skips reduce_range before Montgomery mul (safe
  because mul absorbs unreduced input).
- Hoist centered-coefficient computation out of per-prime loop in
  from_ring.
- Add fused pointwise multiply-accumulate for mat-vec inner loop.
- Add batched mat_vec_mul_crt_ntt_many that precomputes matrix NTT once
  and reuses across many input vectors.
- Wire commit_ring_blocks to batched A*s path.

Benchmarks (D=64, Q32/K=6):
- Single-prime forward+inverse NTT: 1.14µs → 0.43µs (2.7x)
- CRT round-trip: 10.7µs → 6.3µs (1.7x)
- Commit nv10: ~70% faster, nv20: ~47% faster

Made-with: Cursor

* Cache CRT+NTT matrix representations in setup to avoid repeated conversion

The dense mat-vec paths (commit_ring_blocks, commit_onehot B-mul, compute_v)
previously converted coefficient-form matrices to CRT+NTT on every call.
Now the setup eagerly converts A, B, D into an NttMatrixCache and all
dense operations use the pre-converted form. Coefficient-form matrices
are retained for the onehot inner-product path and ring-switch/generate_m.

Made-with: Cursor

* Remove dead code (HachiRoutines, domains/, redundant trait methods) and extract shared field utilities

- Delete unused HachiRoutines trait and dead algebra/domains/ module
- Remove redundant FieldCore::add/sub/mul and Module::add/neg (covered by ops traits)
- Extract is_pow2_u64, log2_pow2_u64, mul64_wide into fields/util.rs to deduplicate

Made-with: Cursor

* Unify Blake2b and Keccak transcript backends into generic HashTranscript

Replace separate blake2b.rs and keccak.rs with a single generic
HashTranscript<D: Digest> parameterized by hash function. Blake2bTranscript
and KeccakTranscript are now type aliases.

Made-with: Cursor

* Fix sumcheck degree bug, split types, in-place fold, CommitWitness, rename configs, add soundness test

- Fix CompressedUniPoly::degree() off-by-one that could let malformed proofs pass
- Split sumcheck/mod.rs: extract types into types.rs, relocate multilinear_eval
  and fold_evals to algebra/poly.rs
- Replace allocating fold_evals with in-place fold_evals_in_place
- Add debug_assert guards to multilinear_eval and fold_evals_in_place
- Introduce CommitWitness struct to replace error-prone 3-tuple returns
- Rename DefaultCommitmentConfig to SmallTestCommitmentConfig, add
  ProductionFp128CommitmentConfig
- Add verify_rejects_wrong_opening negative test for verifier soundness

Made-with: Cursor

* fix(test): resolve clippy needless_range_loop in algebra tests

Use iter().enumerate() for schoolbook convolution loops and
array::from_fn for pointwise NTT operations.

Made-with: Cursor

* Refactor commitment setup to runtime layout and staged artifacts.

This removes compile-time commitment shape locks, derives beta from runtime layout, and threads layout-aware setup through commit/prove/verify with setup serialization roundtrip coverage.

Made-with: Cursor

* Soundness hardening: panic-free verifier, Fiat-Shamir binding, NTT overflow fix

- Verifier path never panics; all errors return HachiError
- Bind commitment, opening point, and y_ring in Fiat-Shamir transcript
- Fix i16 csubp/caddp overflow by widening to i32
- multilinear_eval returns Result with dimension checks
- build_w_evals validates w.len() is a multiple of d
- UniPoly::degree uses saturating_sub instead of expect
- Serialize usize as u64 for 32/64-bit portability
- Fix from_i64(i64::MIN) via unsigned_abs
- Remove Transcript::reset from public trait (move to inherent)
- Add batched_sumcheck verifier empty-input guard

Made-with: Cursor

* Hoist fully qualified paths to use statements in touched files

Replace inline crate::protocol::commitment::HachiCommitmentLayout,
hachi_pcs::algebra::backend::{CrtReconstruct, NttPrimeOps}, and
hachi_pcs::algebra::CyclotomicRing with top-level use imports.

Made-with: Cursor

* Dispatch norm sumcheck kernels by range size.

Route small-b rounds through the point-eval interpolation kernel and keep the affine-coefficient kernel for larger b, while adding deterministic baseline-vs-dispatched benchmarks and parity tests to validate correctness across both strategies.

Made-with: Cursor

* Format commitment-related files for readability.

Apply non-functional formatting and import ordering cleanups across commitment, ring-switch, and benchmark/test files to keep the codebase style consistent.

Made-with: Cursor

* Format: cargo fmt pass on commitment-related files

Made-with: Cursor

* feat: sequential coefficient ordering + streaming commitment

Change coefficient-to-ring packing from strided to sequential, enabling
true streaming where each trace chunk maps to exactly one inner Ajtai
block. Implement StreamingCommitmentScheme for HachiCommitmentScheme.

- reduce_coeffs_to_ring_elements: sequential packing (chunks_exact(D))
- prove/verify: opening point split flipped to (inner, outer)
- ring_opening_point_from_field: outer split flipped to (M first, R second)
- commit_coeffs: sequential block distribution
- map_onehot_to_sparse_blocks: sequential block distribution
- HachiChunkState + process_chunk / process_chunk_onehot / aggregate_chunks
- Streaming commit tests (matches non-streaming, prove/verify roundtrip)

Made-with: Cursor

* refactor: decompose verify_batched_sumcheck into composable steps

Split the monolithic verify_batched_sumcheck into three pieces:
- verify_batched_sumcheck_rounds: replay rounds, return intermediate state
- compute_batched_expected_output_claim: query verifier instances
- check_batched_output_claim: enforce equality

This enables callers (e.g. Greyhound) to intercept the intermediate
sumcheck state before the final oracle check. The original function
is preserved as a convenience wrapper.

Made-with: Cursor

* feat: Labrador/Greyhound recursive lattice proof protocol

Implements the full Labrador recursive amortization and Greyhound
evaluation reduction, ported from the C reference with Hachi-native
Fiat-Shamir transcript integration.

New modules:
- protocol::labrador — recursive proof (prover, verifier, fold, commit,
  challenge rejection sampler, JL projection, config/guardrails, types)
- protocol::greyhound — evaluation reduction (4-row witness, 5
  constraints, eval prover + verifier-side reduce)
- protocol::prg — pluggable PRG backends (SHAKE256, AES-128-CTR) for
  commitment key and JL matrix derivation

Hachi-core changes:
- algebra::ring — conjugation automorphism, coeff_norm_sq, ternary/
  quaternary samplers for Labrador challenges
- protocol::commitment — pre-derived setup matrices, PRG backend
  abstraction for matrix derivation
- protocol::proof — HachiProof restructured as composite of folds +
  GreyhoundEvalProof + LabradorProof
- protocol::ring_switch — externalized w_tilde(r) check for Greyhound
- protocol::transcript — ring-element challenge functions (dense +
  rejection-sampled), 16 new Fiat-Shamir labels
- protocol::commitment_scheme — integrated Greyhound/Labrador into
  prove/verify pipeline
- sumcheck tests decoupled from old proof structure

Made-with: Cursor

* Impl folded Labrador protocol

* Refactor Labrador Witness

* Refactor Labrador Constraints

* Change grenhound to use Labrador scheme

* Update gitignore

* Fix CI issues

* Use constants instead of hardcoded values

* feat: integrate Greyhound/Labrador lattice proof protocol into main

Port the Greyhound evaluation-reduction and Labrador recursive lattice
proof modules from dev-labrador onto main's optimized proving pipeline.
Greyhound/Labrador is invoked as a final proof step after multi-level
folding when D >= 64, providing post-quantum security for the opening.

New modules: protocol/greyhound, protocol/labrador, protocol/prg.
Algebra extensions: coefficients_mut, coeff_norm_sq,
balanced_decompose_pow2_with_carry, conjugation_automorphism_ntt,
sample_ternary/quaternary.

Made-with: Cursor

* Remove integration to Hachi

* Fix CI issue

---------

Co-authored-by: Quang Dao <quang.dao@layerzerolabs.org>
Save HachiExpandedSetup (seed + matrices A, B, D) to an OS-specific
cache directory on first generation, and transparently load it on
subsequent calls to avoid re-deriving matrices from SHAKE. NTT caches
are rebuilt from the deserialized matrices.

Pattern follows Dory's disk-persistence approach but saves only the
expanded setup (not prover+verifier separately) since NTT caches are
not serializable and must be reconstructed.

Made-with: Cursor
* fix: harden CI workflow to resolve CodeQL security alerts

Pin all GitHub Actions to immutable commit SHAs and add least-privilege
permissions (contents: read) to address 9 medium-severity CodeQL alerts.

Made-with: Cursor

* chore: add .cursor/ to .gitignore

Made-with: Cursor

---------

Co-authored-by: Omid Bodaghi <42227752+omibo@users.noreply.github.com>
…outer_weights (#10)

Three performance fixes for the prove path:

1. Gate prove_level_diagnostic, prove_level_selfcheck, and the w_eval
   consistency check behind #[cfg(debug_assertions)]. These were running
   unconditionally in release builds, causing duplicate compute_m_a_streaming
   calls and full polynomial evaluations purely for debug verification.

2. Factor outer_weights in prove_one_level: instead of materializing the
   full 2^(m_vars + r_vars) basis weight vector (~2.1 GB for large traces),
   pass ring_opening_point.b (size 2^r_vars) and derive the evaluation
   from the fold result: eval = Σ_i b[i] * fold(a)[i].

3. Update HachiPolyOps::evaluate_and_fold signature to accept factored
   per-block outer scalars instead of the full tensor product.

Made-with: Cursor
* perf: gate debug diagnostics behind cfg(debug_assertions) and factor outer_weights

Three performance fixes for the prove path:

1. Gate prove_level_diagnostic, prove_level_selfcheck, and the w_eval
   consistency check behind #[cfg(debug_assertions)]. These were running
   unconditionally in release builds, causing duplicate compute_m_a_streaming
   calls and full polynomial evaluations purely for debug verification.

2. Factor outer_weights in prove_one_level: instead of materializing the
   full 2^(m_vars + r_vars) basis weight vector (~2.1 GB for large traces),
   pass ring_opening_point.b (size 2^r_vars) and derive the evaluation
   from the fold result: eval = Σ_i b[i] * fold(a)[i].

3. Update HachiPolyOps::evaluate_and_fold signature to accept factored
   per-block outer scalars instead of the full tensor product.

Made-with: Cursor

* perf: streamline recursive Hachi proving path

Keep recursive w witnesses in digit form to avoid rebuilding dense polynomials, and size setup and ring-switch work from exact runtime layouts to cut redundant work.

Made-with: Cursor

* fix: satisfy clippy on setup and ring-switch helpers

Address the current CI failures with minimal changes by allowing the internal layout helper's argument count and switching the fused m_evals_x loops to iterator-based indexing.

Made-with: Cursor
* perf: tighten and speed up norm sumcheck

Enforce the balanced digit range produced by decomposition and reduce round-zero norm sumcheck work with compact affine precomputation plus the centered balanced point-eval form.

Made-with: Cursor

* feat: parameterize recursive w basis and expand profile comparisons

Allow recursive w openings to use a different gadget basis from level 0 so we can explore decomposition and sumcheck tradeoffs directly. Add profile modes for comparing basis choices across the main dense and onehot workloads.

Made-with: Cursor

* perf: cache t rows and trim ring-switch witness overhead

Cache inner Ajtai t rows so A_row can reuse them directly and accumulate only the quotient high half instead of recomposing from t_hat on every block. Trim the ring-switch witness path by dropping the unused field w-table, reusing decomposition scratch, and reading the final w evaluation from the folded prover state.

Made-with: Cursor

* perf: skip padded x tails in fused sumcheck

Track the live x prefix from ring switch into the fused prover so x-rounds only accumulate and fold the physical witness region instead of explicit zero padding. Preserve the old semantics with round-by-round equivalence tests against the padded prover.

Made-with: Cursor

* test: bundle sumcheck test helper params for clippy

Collapse the test-only Hachi sumcheck prover helper arguments into a small params struct so clippy no longer rejects the PR on too-many-arguments.

Made-with: Cursor

* fix: allow no-default-features sumcheck lint path

Mark the parallel-only relation combiner as intentionally unused when the parallel feature is disabled so the CI clippy matrix stays green in both feature configurations.

Made-with: Cursor

* perf: specialize single-digit z_pre folds

Cache dense small-digit coefficients and add direct onehot and dense single-digit fold paths so quadratic-equation z_pre construction stops paying generic decomposition costs when the witness is already digit-sized.

Made-with: Cursor
* Add rayon support in Labrador

* Change labrador params and match with reference impl

* Impl Ajtai commitment scheme trait

* Add setup to Labrador prover

* Pass transcript to JL projection

* Fix the issue with JL matrix distribution

* Add benchmark for Labrador single level prover

* Update labrador single-level proof benchmark

* Refactor constraints in Labrador

* Add two level labrador prover benchmark

* Add docs for building next constraints functions

* Make Labrador benchmark more realistic based on Greyhound numbers

* Add NTT backend Ajtai commitment scheme

* Add tests to verify verifier reject malicious proofs

* Add more traing info for level prover

* Use constants in tests/commitment

* Optimizing aggregation phase

* Fix recursive Labrador bug

* Integrate Greyhound and Hachi

* Integrate Labrador directly to Hachi

* Address CI issues

* Remove unused codes

* Fix Labrador handoff binding and tail proof encoding

Bind Labrador tails to the carried Hachi commitment, harden verifier and JL metadata checks, and make Labrador-tail serialization and size accounting honest. Add regression coverage for spliced tails, malformed metadata, variable-D handoff selection, and proof-size accounting.

Made-with: Cursor

* Update hachi e2e test

* Use existing setup matrices

* perf: switch bounded Fp128 configs to D=256

Align the default and halving Fp128 presets around the 256-dimensional Labrador path so the baseline matches the supported challenge machinery. Increase the sparse challenge weight at the lower ring dimension to preserve the intended security margin.

Made-with: Cursor

* perf: speed up Labrador challenge sampling

Add a dedicated single-challenge fast path and reuse precomputed operator-norm tables for sparse challenges. This keeps the Fiat-Shamir distribution unchanged while removing repeated dense trigonometric work from the sampler hot path.

Made-with: Cursor

* perf: pack Labrador JL matrices and replay reduced statements

Store JL signs in a packed ternary layout and aggregate directly into ring-aligned phi blocks to cut the dominant collapse and projection bandwidth. Carry recursive Labrador state as reduced constraint plans so prover and verifier only materialize explicit sparse constraints when they are actually needed.

Made-with: Cursor

* chore: trace Labrador setup and commitment helpers

Label fold planning, setup derivation, and NTT commitment entry points so Perfetto traces attribute the remaining unlabeled setup and commit time to concrete Labrador phases.

Made-with: Cursor

* test: right-size Labrador e2e coverage

Keep the Labrador end-to-end checks aligned with the current D=256 configs while reducing the default test sizes and serializing the heavy cases so nextest stays stable in CI.

Made-with: Cursor

* test: align Labrador coverage with profile path

Use the standard onehot and full configs in the Labrador e2e checks, and benchmark the onehot prove path through OneHotPoly so the test and bench coverage matches the intended profile example behavior.

Made-with: Cursor

* style: format Labrador e2e imports

Apply rustfmt's import grouping for the updated Labrador e2e test so the CI format check matches the checked-in tree.

Made-with: Cursor

* fix: regenerate stale setup caches and clarify Labrador stream IDs

Avoid panicking on invalid cached setup files so local and CI runs can rebuild cleanly, and rename the deterministic challenge stream selector so CodeQL does not treat test vectors as hard-coded nonces.

Made-with: Cursor

---------

Co-authored-by: Quang Dao <quang.dao@layerzerolabs.org>
* Optimize aggregating jl projection functions

* Add lookup to JL aggregation

* Add more parallelism

* perf: cut Labrador verifier recursion rebuild costs

Split Labrador recursion state into prover and verifier setup shapes and compute reduced-plan verifier aggregation directly, so recursive verification stops rebuilding dense intermediate rows and unused NTT caches.

Made-with: Cursor

* perf: stream Labrador JL replay and tail checks

Replay JL rows from the accepted transcript seed and verify the tail round directly on decomposed payloads, so recursive verification avoids rebuilding dense JL matrices and recomposed witness side data.

Made-with: Cursor

* perf: cut Labrador recursion earlier and batch hot kernels

Prefer tail cutover as soon as it beats another standard fold, and batch the hottest aggregation, challenge replay, and linear-garbage kernels so large-nv Labrador stops dwarfing the Hachi path.

Made-with: Cursor

* perf: speed up Labrador aggregation and challenge replay

Exploit sparse Labrador coefficient structure and cheaper challenge bound checks to cut the remaining prover and verifier hotspots without changing transcript behavior.

Made-with: Cursor

* perf: accelerate Labrador JL replay and aggregation kernels

Reuse the in-memory JL collapse path on verifier replay, cut repeated JL scheduling overhead, and tighten dense ring accumulation so the remaining Labrador prover and verifier aggregation paths spend less time in repeated per-element work.

Authored by Cursor assistant (model: GPT-5.4) on behalf of Quang Dao.

Made-with: Cursor

* perf: tighten Labrador handoff accounting and profiling

Make profile runs fail fast outside --release and add the size diagnostics needed to compare direct and Labrador tails from real serialized cost. Reuse the handoff D-matrix NTT cache and compare recursive Labrador transitions against actual carried payload size so tail selection reflects what the proof will actually send.

Made-with: Cursor

* refactor: dedupe Labrador helper paths and quiet prover diagnostics

Share the repeated Labrador utility helpers in one place and move the prover's profiling prints onto structured tracing, so the review feedback is addressed without changing protocol behavior.

Made-with: Cursor

* fix: inline profile format args for clippy

Rewrite the remaining profile example format strings to use inline captures so the CI Clippy job passes again without changing the example's output.

Made-with: Cursor

* perf: cut allocation churn in folding helpers

Reuse flat output buffers in ring-switch and sumcheck prefix folding, and evaluate multilinears recursively over slices. This trims temporary Vec creation on hot prover paths without changing protocol behavior.

Made-with: Cursor

* refactor: hoist opening-point helpers and simplify profile example

Centralize basis and opening-point conversions so the profile example and protocol code reuse the same logic. Drop the setup-only profiling path so the example stays focused on end-to-end proving runs.

Made-with: Cursor

* fix: restore opening-point test helper imports

Keep the commitment-scheme tests compiling after hoisting opening-point helpers into their own module. Include the accompanying rustfmt cleanup in touched Rust call sites.

Made-with: Cursor

* refactor: cut over Labrador naming and wire labels

Replace the terse Labrador config and payload vocabulary with descriptive names across recursion, proofs, and transcript labels so the implementation is easier to follow and the wire format stays internally consistent. Guard the small-digit CRT/NTT fast path so deeper folds fall back safely once coefficients leave the lookup-table range.

Made-with: Cursor

* Refine Labrador handoff selection and tests

---------

Co-authored-by: Quang Dao <quang.dao@layerzerolabs.org>
* perf: add d64 partial-split NTT prototype

Isolate the q=2^128-5823 D=64 partial-split multiplication path, its packed cached-domain kernels, and a focused benchmark/test suite so it can be reviewed independently from the sumcheck work.

Made-with: Cursor

* fix: satisfy clippy in partial split benches

Clean up the benchmark and test scaffolding to avoid indexed iteration warnings and packed-width modulo warnings in CI.

Made-with: Cursor

* perf: tighten partial split NTT hot helpers

Inline the small hot wrappers, collapse duplicated scalar and packed helper kernels, and remove unused prototype-only APIs so the partial-split backend is leaner without changing behavior.

Made-with: Cursor

* perf: re-fuse single-product partial-split kernels

Restore direct-write split multiply kernels so single-product and packed batch workloads do not pay the zero-plus-accumulate cost introduced by the cleanup refactor.

Made-with: Cursor
* perf: split Hachi sumcheck into two stages

Separate the prefix-range pass from the fused relation scan so stage 2 can reuse the shared local w basis and avoid redundant work. This also completes the stage naming cutover and removes the obsolete standalone sumcheck modules.

Made-with: Cursor

* chore: fix doc placement on HachiLevelProof and use Prime128M8M4M1M0 in tests

Move the D-agnostic doc comment to HachiLevelProof where it belongs,
and replace Fp64<4294967197> with the named Prime128M8M4M1M0 alias in
ring_switch, hachi_stage1, and hachi_stage2 tests.

Made-with: Cursor

* fix: absorb s_claim into transcript before batching challenge + dedup trim_trailing_zeros

Absorb the prover-supplied s_claim into the Fiat-Shamir transcript
before sampling CHALLENGE_SUMCHECK_BATCH on both prover and verifier
sides. Without this, an adversary could choose among multiple valid
s_claim values after seeing the batching coefficient.

Also extract the duplicated trim_trailing_zeros helper from
hachi_stage1 and hachi_stage2 into the parent sumcheck module.

Made-with: Cursor

* perf: apply split-eq e_in-inside/e_out-outside optimization to all prefix_x paths

Factor out the e_second multiplication from the inner loop in stage 1
and stage 2 prefix_x compute_round methods. Within each block of
consecutive pairs sharing the same j_high, accumulate contributions
weighted by e_first (e_in) only, then post-multiply the block result
by e_second (e_out) once. This eliminates one full field multiply per
pair per round in all prefix_x code paths.

Made-with: Cursor

* perf: optimize compact Hachi sumcheck folds

Use pair-fold lookup tables for compact stage-1 and stage-2 folds and absorb stage-2 batching into split-eq so the fused kernels do less repeated field work. Clarify the stage-2 relation docs to match the actual prover/verifier identity.

Made-with: Cursor

* perf: skip recoverable norm linear coefficients in Hachi sumcheck

Use split-eq claim recovery to omit norm-round linear q terms during accumulation while still reconstructing the full round polynomial when needed.
Track the prior norm claim in stage 2 and add split-eq recovery tests so the reduced-coefficient path stays equivalent to the full computation.

Made-with: Cursor

* perf: add bivariate-skip proofs for early Hachi sumcheck rounds

Build the first two stage-local bivariate-skip proofs directly, reconstruct the omitted round polynomials from compact algebraic state, and tighten the stage-2 prefix path so the skipped rounds stay cheap while the terminology matches the math.

Made-with: Cursor

* fix: keep full stage2 m table through sparse x rounds

Carry the full stage2 m multilinear table across sparse prefix-x folding so boundary pairs and quads still use the verifier's full relation data, and harden the prefix tests around nonzero tail entries so the compact prover path stays aligned with the padded reference.

Made-with: Cursor

* fix: count prefix fields in profile proof breakdown

Include both prefix option tag bytes and any serialized bivariate-skip payloads in the profile size accounting, and expose size/presence helpers on the staged proof payloads so the example can report those fields without reaching into private internals.

Made-with: Cursor

* test: clean up bivariate-skip reference helpers for CI clippy

Use assign-op and iterator forms in the two-round prefix reference helpers so the strict all-targets Clippy job stays green without changing the helper math.

Made-with: Cursor

* fix: keep sumcheck prefix prover-only

Bind the transcript only to canonical round messages and reject malformed proof shapes explicitly so verifier flow stays implementation-agnostic.

Made-with: Cursor
… estimator (#19)

* ci: add onehot nv32 benchmark reporting

Track onehot nv32 timing and RSS in CI with a sticky PR report so benchmark changes stay visible across commits without heavier profiling artifacts.

Made-with: Cursor

* ci: clarify onehot sparsity labels

Describe the nv32 benchmark and D=64 estimator as 1-of-256 one-hot so reviewers can read the sparsity assumptions directly from the check output and reports.

Made-with: Cursor

* docs/ci: add onehot analysis notes and harden benchmark reporting

Bundle the supporting one-hot and SIS analysis notes with the benchmark branch so the PR carries the rationale for the new parameter choices. Clean up the remaining benchmark-reporting edge cases so traces stay alive for the full run, partial baselines render correctly, and PR comment upserts fail softly instead of surfacing hidden job errors.

Made-with: Cursor

* docs: remove local-only analysis notes from branch

Keep the root analysis markdowns local-only so the benchmark PR only carries code and workflow changes. Preserve the local files via repo-local excludes instead of tracking them in git.

Made-with: Cursor

* ci: fix onehot timing fallback attribution

Attribute missing split timings to Hachi when the benchmark log only exposes total prove or verify time, so the report stays conservative instead of assigning the whole interval to Labrador.

Made-with: Cursor

* ci: compare onehot bench to main and previous run

Render the onehot benchmark report against both the main-branch split point and the previous successful PR update so regressions are visible against the branch base as well as the last iteration.

Made-with: Cursor
…20)

* Use scalar field randomness instead of ring randomness

* Use AggregationRandomness enum for two randomness cases

* Remove b computation from aggregation

* Make JL projection matrix generation thread-friendly

* Speedup computing h

* Fix clippy
…enge families (#21)

* feat: add D64 onehot scheduling infrastructure

* fix: add missing Cfg generic in disk-persistence tests and correct current_w_len on verifier paths

- commit.rs: supply TinyConfig to get_storage_path and load_expanded_setup
  in disk-persistence tests (fixes clippy/test CI)
- labrador_handoff.rs: derive current_w_len from w_layout instead of
  passing 0 in the legacy handoff verifier
- commitment_scheme.rs: derive initial current_w_len from the commitment
  layout (layout.num_blocks * layout.block_len * D) instead of raw
  1 << max_num_vars so prover and verifier always agree

Made-with: Cursor

* fix: prevent usize overflow in current_w_len for large max_num_vars

Use checked_shl or layout-derived values instead of raw 1usize << max_num_vars,
which panics when max_num_vars >= 64 (e.g. disk-persistence tests with TinyConfig).

Made-with: Cursor

* refactor: cut dense commitments over to D=128

Make the runtime scheduling and sparse-challenge redesign land on a single dense Fp128 profile so full and log-basis commitments no longer depend on a legacy D=256 halving path. Keep generic D=256 NTT plumbing available while updating proofs, tests, scripts, examples, and benchmarks to reflect the new D=128 and D=64 defaults.

Made-with: Cursor

* fix: align recursive layouts with sound basis-2 checks

Derive recursive witness layouts from the active level parameters so recursive openings, ring-switching, and Labrador handoff stay aligned after the D=128 cutover. Replace the basis-2 combined path with a direct W-only degree-5 sumcheck, remove the virtual S claim, and add end-to-end tamper coverage.

Made-with: Cursor

* fix: stabilize recursive onehot folding for D=64

Handle +/-2 sparse challenges correctly in the recursive z_pre path, restore the arm64 NEON fast path for small magnitudes, and cover the two-round-prefix edge cases. Clean up the temporary debug instrumentation and align the profiling and estimator tooling with the updated proof-size accounting.

Made-with: Cursor

* fix: apply rustfmt for CI

Normalize the touched Rust files to match the repository formatter so the PR checks run cleanly on GitHub Actions.

Made-with: Cursor

* feat: add adaptive fold-basis scheduling

Use a deterministic public-input schedule so setup, proving, verification, and cache reuse stay aligned across onehot, log, and full configs. Widen the digit LUT path through basis 5 and add mixed-basis regressions so adaptive schedules stay sound.

Made-with: Cursor

* fix: stabilize direct tail packing and drop dead config

Widen direct-tail packing so adaptive schedules do not panic when terminal witness digits exceed the planned basis, and remove the unused rank-2 bounded config to keep the commitment surface minimal.

Made-with: Cursor

* fix: align planner witness sizing with runtime recursion

Use the exact half-field bound so adaptive planning derives the same recursive witness sizes as runtime, and add sparse challenge sampling tracing to make these paths easier to diagnose.

Made-with: Cursor

* fix: reduce D64 recursion overhead

Shrink stage-1 compact tables, avoid redundant recursive hint reconstruction, and realign D-dependent challenge sizing so the lowered ring dimensions actually pay off in memory and prover work.

Made-with: Cursor

* perf: block-parallel mat_vec_mul_ntt_digits_i8 (12x speedup)

When n_a <= 2 and num_blocks >= 16, parallelize over blocks instead
of column tiles. The old tiling created only 5 tiles for Rayon while
the new path gives num_blocks-way parallelism (256 for onehot nv32).

commit_w level 0: 273ms → 23ms on onehot nv32.

Made-with: Cursor

* perf: position-parallel sparse onehot accumulation with precomputed rotation table

Replace per-block fold-reduce with per-thread chunked accumulation and
a dense rotation table (16 KB for D=64, fits in L1). Each entry becomes
a branchless vector addition instead of scatter-based random access.

Made-with: Cursor

* perf: parallelize balanced decomposition in decompose_w_hat

Made-with: Cursor

* fix: guard binomial_u64 against subtraction overflow when n < k

Made-with: Cursor

* perf: optimize high-half quotient with loop trimming and parallel accumulation

Trim add_sparse_ring_product_high_half to skip zero-contribution
iterations (degree < D), parallelize A-row and challenge-fold quotient
accumulation via cfg_fold_reduce, and extract parallel_high_half_accumulate
helper.

Made-with: Cursor

* perf: parallelize z/r balanced decomposition in build_w_coeffs

Made-with: Cursor

* perf: column-sweep Ajtai commit for onehot — 2.2-2.5x at nv32, ~2x at nv36

Replace block-by-block inner_ajtai_onehot_wide (where each block
independently reads and widens A columns from L3) with a two-level
tiled column-sweep that reads each A column exactly once per tile.

Outer level: Rayon threads partition blocks evenly.
Inner level: blocks processed in L2-sized tiles (~1024 blocks, 2MB
accumulators). Entries bucketed by A-column, then swept sequentially
so each column is widened once and scattered into all referencing
block accumulators.

Falls back to the original block-by-block path when blocks_per_thread
is small (≤128), where the bucketing overhead exceeds its benefit.

Also includes position-partitioned BalancedDigitPoly::decompose_fold
and updated OPTIMIZATION_REPORT.md.

Made-with: Cursor

* perf: concurrent NTT rows, parallel quotient fold, batched challenge absorb

- Run D/B/A NTT row computations concurrently via rayon::join in
  compute_r_split_eq, overlapping independent matrix-vector products.
- Replace sequential challenge_fold_row and A_row accumulation with
  parallel cfg_fold_reduce over blocks (8.9x speedup each).
- Batch the four per-challenge append_bytes calls into one in
  sample_one, reducing hash update overhead for sample_sparse_challenges.
- Update OPTIMIZATION_REPORT.md with round 3-6 results.

Made-with: Cursor

* perf: replace per-challenge hash chain with seed-then-SHAKE256-expand

Derive a single 32-byte PRG seed from the transcript and expand all
challenge randomness via SHAKE256 XOF, replacing ~20K sequential
Blake2b512 chain operations with 1 chain + fast XOF squeeze.
2x speedup for sample_sparse_challenges (6.5ms → 3.2ms at 4096 challenges).

Made-with: Cursor

* perf: simplify norm checks and sparse challenge sampling

Always use the two-stage norm-check flow so every proof shares one layout, one verifier path, and a smaller code surface. Stream sparse Fiat-Shamir challenge expansion from SHAKE and tighten i8 decomposition bounds while pruning the obsolete combined-sumcheck code and stale optimization report.

Made-with: Cursor

* fix: align planner proof sizing with two-stage norm checks

Always model Hachi levels as stage1 plus stage2 so proof-size estimates match the serialized proof layout at every basis, including b=4. Add regressions for per-level byte estimation and direct-tail proof sizes to keep the planner in sync with runtime proofs.

Made-with: Cursor

* perf: inline level norm checks and gate compact onehot layout

Flatten stage-1 and stage-2 data directly onto `HachiLevelProof` and only switch onehot witnesses to the compact regular layout once the large-profile cache savings outweigh the nv32 costs. This keeps small witnesses on the legacy sparse path while preserving the nv36 performance win.

Posted by Cursor assistant (model: GPT-5.4) on behalf of the user (Quang Dao) with approval.

Made-with: Cursor

* Reduce D to 64: commitment schedule, ring switch, linear utils, poly ops

- Update commitment/commit, config, schedule, and linear utilities for d=64
- Adjust ring_switch, quadratic_equation, hachi_poly_ops
- Tweak examples/profile harness

Made-with: Cursor
* Remove Labrador implementation

* fix: remove stale profile tail tag accounting

---------

Co-authored-by: Quang Dao <quang.dao@layerzerolabs.org>
…28-bit field (#22)

* perf: use two-round prefix path for b=4 norm sumcheck

Specialize the b=4 skip proof to a smaller quadratic prefix grid so stage 1 can fuse its first two rounds like b=8 without changing the round polynomials. Extend the stage-1 regression coverage to keep the fused path aligned with the padded reference flow.

Made-with: Cursor

* perf: add b=4-specific LUT tables for stage 2 prefix and fused kernels

Stage 2 two-round prefix and fused compact-to-round2 kernels were
reusing b=8 LUT tables (4096 entries) even for b=4.  Add b=4-specific
tables (256 entries, 16x smaller) and dispatch on `b` at runtime.
Same treatment for stage 1's fused kernel (16 vs 256 entries).

Also mark all hot-path digit and lookup-index functions as
`#[inline(always)]` for consistency.

Made-with: Cursor

* fix: consistent polynomial representation between prefix and dense paths

Remove trailing-zero trimming from `finish_gruen_round_poly_from_q_coeffs`
and `coeff_array_to_poly` so both the two-round-prefix path and the dense
sumcheck path produce polynomials with the same number of coefficients
(degree_q + 2). Fixes the b=4 `stage1_round0_matches_dense_reference`
test failure and formatting issues.

Made-with: Cursor

* perf: optimize verifier hot paths (sparse challenges, m_evals_x, multilinear_eval)

- Buffer XOF reads (4 KB buffer), use tiered byte-width rejection
  sampling, and batch sign draws (8 per byte) in sparse challenge
  sampling (~4x speedup)
- Pre-scale alpha_pows by eq_tau1 weights and precompute block scalars
  in compute_m_evals_x to eliminate redundant per-column multiplies
- Add parallel multilinear_eval path (eq-table + par dot-product) for
  large tables (>2^14 entries)
- Move compute_m_evals_x into ring_switch_verifier; remove the separate
  compute_m_eval_at_point and verify_sumcheck_rounds_only functions
- Simplify HachiStage2Verifier: store m_evals_x directly instead of
  Stage2MOracle indirection; unify is_last/non-last verify paths

Made-with: Cursor

* refactor: split hachi_poly_ops/mod.rs into focused submodules

The 2041-line monolith is now:
- mod.rs: trait, shared types, re-exports, tests (~505 lines)
- dense.rs: DensePoly + HachiPolyOps impl (~329 lines)
- onehot.rs: OneHotIndex, OneHotPoly + HachiPolyOps impl (~604 lines)
- balanced_digit.rs: BalancedDigitPoly + HachiPolyOps impl (~238 lines)
- helpers.rs: decomposition, sparse mul-acc, accumulation internals (~440 lines)
- decompose_fold_neon.rs: unchanged NEON kernel (~165 lines)

No behavioral changes. All docstrings updated for the new layout.

Made-with: Cursor

* cleanup: remove dead code, fix unimplemented!(), deduplicate helpers

- Remove unused centered_abs, ring_inf_norm, vec_inf_norm from norm.rs
- Remove redundant #[allow(dead_code)] on add_ntt_into (function is used)
- Replace duplicate flatten_w_hat with existing flatten_i8_blocks
- Implement protocol_name() → b"Hachi" instead of unimplemented!()
- Remove commented-out ring-dimension check in prove path
- Remove duplicate #[allow(clippy::too_many_arguments)] annotation

Made-with: Cursor

* refactor: hoist algebra types out of protocol/sumcheck into algebra/

Move pure algebraic constructs from protocol/sumcheck/ to algebra/:
- EqPolynomial → algebra/eq_poly.rs (fixes backwards dep: algebra → protocol)
- GruenSplitEq → algebra/split_eq.rs
- UniPoly, CompressedUniPoly → algebra/uni_poly.rs
- trim_trailing_zeros → algebra/poly.rs

SumcheckProof stays in protocol/sumcheck/types.rs (uses Transcript).
All re-exports preserved for downstream compatibility.

Made-with: Cursor

* fix: resolve clippy (no-default-features) and rustdoc CI failures

Gate `add_ntt_into` and its neon helpers behind `#[cfg(feature =
"parallel")]` since they are only used in the reduce closure of
`cfg_fold_reduce!`, which is elided without rayon.  Replace intra-doc
links to private items with plain backtick references.

Made-with: Cursor

* chore: sort imports alphabetically and remove stray blank line

Made-with: Cursor

* refactor: split recursive witness runtime from root poly ops

Move recursive folding levels onto a flat digit witness so later rounds stop pretending to be caller-provided polynomials. This keeps `HachiPolyOps` root-only and cuts the recursive prover over to the dedicated witness view.

Made-with: Cursor

* perf: extend stage1 compact coefficient LUTs to b=16

Keeping b=16 on the compact lookup path avoids the dense coefficient fallback in stage-1 norm sumcheck. Add regression coverage so b=32 stays on the existing fallback until we optimize it separately.

Made-with: Cursor

* refactor: cut verifier over to proof-native recursive state

Carry recursive prover and verifier state through proof ring vectors and packed witnesses so level transitions stop rebuilding commitment-specific structures.

Made-with: Cursor

* perf: add b=32 stage1 field coefficient LUT

Precompute stage1 affine coefficients as field elements for b=32 so the compact round kernels can reuse them instead of rebuilding them per pair. This keeps the large-basis optimization isolated to the retained stage1 path.

Made-with: Cursor

* chore: deduplicate helpers, remove dead code, fix doc CI

- Deduplicate: try_centered_i8, absorb_len_prefixed, pow2_field,
  reduce_signed_accum, linear_eq_eval, stage1/stage2 digit helpers
- Remove dead: ring_switch_prover, expand_m_a, verify_single_level,
  build_next_constraints, and 9 unused public functions
- Inline trivial compute_v wrapper, refactor compute_z_pre pair
  into shared validate_decompose_fold
- Fix doc CI: replace intra-doc links to private items with backticks
- Add debug_assert for num_digits==1 in partitioned accumulation

Made-with: Cursor

* perf: unify A/B/D matrices into single shared-prefix matrix and NTT cache

Derive one max-sized public matrix with a shared label instead of three
role-specific matrices. This cuts setup NTT conversion work by ~3x in
production configs and halves memory. Runtime mat-vec performance is
unchanged: column bounds are driven by input vector length, and a
prerequisite inner_width clamp in mat_vec_mul_i8_with_params prevents
empty-tile dispatch for wider caches.

Security justification: SHARED_PREFIX_BINDING.md (every SIS extraction
targets a single role, so the marginal distribution is identical).

Made-with: Cursor

* fix: return error instead of panicking for unsupported ring dimensions in sparse challenge sampling

Stack-buffer sampling functions used debug_assert! guards that were
stripped in release builds. Add a fallible D > 128 check at the public
API boundary so the verifier returns Err rather than panicking on
out-of-bounds access. Also add heap-backed _general variants (unused)
for future large-D support.

Made-with: Cursor

* refactor: consolidate 128-bit primes to Prime128Offset275 and Prime128Offset5823

Delete 7 unused 128-bit prime aliases (Prime128M13M4P0, Prime128M37P3P0,
Prime128M52M3P0, Prime128M54P4P0, Prime128M18M0, Prime128M54P0, P_159).
Rename Prime128M8M4M1M0 to Prime128Offset275. Switch the default 128-bit
field from 2^128-275 to 2^128-5823 (the prime enabling 64-ring / 32-split)
across Q128_MODULUS, POW2_OFFSET_128, HandoffField, all tests, benchmarks,
and examples.

Made-with: Cursor

* perf: pass role-specific row counts to NTT mat-vec, remove dead NttRowView

Replace compute-all-then-truncate pattern with row-bounded dispatch.
Mat-vec functions now accept a num_rows parameter and slice the NTT
cache upfront, so A/B/D roles only compute the rows they need.
Remove unused NttRowView type, neg_rows, and cyc_rows methods.

Made-with: Cursor

---------

Co-authored-by: Omid Bodaghi <42227752+omibo@users.noreply.github.com>
)

Extract the hardcoded `1 << 21` cache budget into a named
`L2_TILE_BUDGET` constant with documentation explaining the 2 MB
choice and a TODO for future arch-specific benchmarking.

Two minor perf improvements in both `onehot_column_sweep_ajtai_regular`
and `onehot_column_sweep_ajtai`:

- Replace wasteful `vec![vec![CyclotomicRing::zero(); n_a]; my_count]`
  pre-allocation with `Vec::new()` per slot, since every entry is
  overwritten by the tile loop.

- Hoist `col_entries` outside the tile loop and `.clear()` between
  tiles so Vec capacities carry over, avoiding repeated heap growth.

Made-with: Cursor
* chore: untrack stale design notes (moving to shared notes folder)

Remove CONSTANT_TIME_NOTES.md, HACHI_PROGRESS.md, and
NTT_PRIME_ANALYSIS.md from version control. These are being
consolidated into the central ~/Documents/Notes/ folder.

Made-with: Cursor

* chore: remove stale CHANGELOG.md placeholder

Made-with: Cursor

* chore: update AGENTS.md crate structure and clean up .gitignore

- Add missing protocol modules (quadratic_equation, recursive_runtime)
  and scripts/ to AGENTS.md crate structure listing
- Fix algebra description (domains → polynomial utilities)
- Remove stale PUBLISH_CHECKLIST.md entry from .gitignore
- Remove empty tests/.gitkeep (test files exist)

Made-with: Cursor
…aster) (#28)

- Add `derive_public_matrix_flat` that generates directly into FlatMatrix
  with entry-level parallelism (rows×cols rayon tasks) and zero-copy
  transmute, replacing the sequential derive + flatten pipeline
- Add `cfg_join!` macro and use it to run negacyclic/cyclic NTT
  conversions concurrently in `build_ntt_slot`
- Add `FlatMatrix::from_flat_data` constructor for pre-flattened storage

Onehot nv=32 setup: 780ms → 291ms (2.7x)
  - Matrix derivation: ~361ms → 71ms (5.1x)
  - NTT cache build: 419ms → 220ms (1.9x)

Made-with: Cursor
Expand the partial-split stage roots into per-position twiddle tables so the butterflies load twiddles directly instead of carrying a serial recurrence. This makes the D64 split path and packed inverse layout more SIMD-friendly and improves leopard x86 benchmarks.

Made-with: Cursor
* Port planner from Python code

* Fix cursor review

* Improve docs for sis_security.rs

* Address AI reviews

* Fix missing B matrix commitment bytes in root level of universal planner

run_universal_planner omitted ring_vec_bytes(root_nb, root_cfg.d) from
both the root level's total cost and its level_bytes field. Every
non-root level in best_from correctly includes this as entry_commit, but
the root level only accounted for the prefix (w_hat + D matrix +
sumcheck + evals), silently under-counting proof size.

The bug affects any root config where nb >= 1 (all of them), with larger
impact for D=32/D=16 roots that can require nb > 1 for SIS security.

Corrected proof sizes (bytes):
  onehot nv=32: 50,418 -> 51,442  (+1,024)
  full   nv=32: 52,866 -> 54,402  (+1,536)
  full   nv=25: 49,842 -> 50,866  (+1,024)
  onehot nv=44: 56,656 -> 58,704  (+2,048)

Made-with: Cursor

* Simplify digit decomposition: remove r_decomp_levels, tighten assertions

- Remove `r_decomp_levels` wrapper; call `compute_num_digits(128, lb)`
  directly everywhere, since the defensive half_field_bound re-check was
  redundant (compute_num_digits already covers 2^(field_bits-1) - 1).
- Drop `half_field_bound` from `PlannerOptions`, `LevelWitnessArgs`, and
  the per-modulus constants (`HALF_FIELD_BOUND_P275`, `HALF_FIELD_BOUND_P5823`).
- Replace unreachable fallback branches in `compute_num_digits` and
  `compute_num_digits_fold` with assertions (`log_bound <= 128`,
  `challenge_l1_mass > 0`, `shift < 127`).
- Correct balanced-digit doc comments to `[-b/2, b/2 - 1]` (asymmetric).

Made-with: Cursor

* Fix baseline validation, remove header wrapper, add (m,r) search

Bug fixes:
- Restore baseline to match Rust codebase by using the existing
  optimal_m_r_split formula (delta_open + n_a*delta_commit). The
  corrected formula (1+n_a)*delta_open is used only in the optimized
  planner since the Rust code hasn't been updated yet.
- Fix baseline tail_bytes to use baseline_packed_digits_bytes (was
  incorrectly using the header-stripped version).
- Remove +4 wrapper from optimized planner total: header stripping
  removes the u32 num_levels prefix.

Enhancements:
- Enumerate (m,r) splits at the root level (+-4 around local optimum)
  to find better global schedules. Recursive levels still use the
  corrected optimal_m_r_split heuristic for speed.
- Restructure compute_level_witness to accept explicit (m,r) via
  WitnessArgs struct instead of calling optimal_m_r_split internally.
- Derive Clone for PlannerOptions, fix repository URL in Cargo.toml.

Baselines: 99,805 / 166,613 / 173,197 (match Rust profiler).
Optimized onehot nv=32: 51,438 B (48.5% reduction).

Made-with: Cursor

* Fix delta_commit bug: pass prev_lb instead of lb for recursive levels

The `best_from` function was passing `lb` as `log_commit_bound` to
`try_level` at recursive levels, causing `delta_commit` to always be 1
regardless of the previous level's base. When prev_lb > lb (e.g.,
lb=6 to lb=3), `compute_num_digits(prev_lb, lb)` should yield 3, not 1.

This fix correctly prices lb-decreasing transitions, causing the planner
to avoid them (they are expensive). Optimal schedules now have
monotonically non-decreasing lb sequences.

Also fixes doc typo: digit range is [-b/2, b/2-1] (asymmetric), not
[-(b/2-1), b/2-1] (symmetric).

Made-with: Cursor

* Fix wrong delta in optimal_m_r_split n_a term

Both config.rs and baseline_optimal_m_r_split used delta_commit for the
n_a multiplier in the per-block opening cost, but the witness construction
(w_hat and t_hat) both use delta_open. This mismatch caused suboptimal
(m, r) splits when delta_open != delta_commit (onehot, and recursive
levels where log_commit_bound < 128).

Also deduplicates baseline_optimal_m_r_split as a thin wrapper around
optimal_m_r_split with num_ring=0.

Made-with: Cursor

---------

Co-authored-by: Quang Dao <quang.dao@layerzerolabs.org>
* Strip serialization headers from proof wire format

Redesign HachiDeserialize with an associated Context type so proof
types can be deserialized without embedded length prefixes. All headers
(u64 Vec counts, u32 num_levels, u8 bits_per_elem, etc.) are removed
from the proof byte stream; the verifier recovers shape information
from the public schedule via HachiProofShape / LevelProofShape.

Key changes:
- HachiDeserialize gains `type Context` — `()` for self-describing
  types, schedule-derived shapes for proof types.
- CompressedUniPoly, SumcheckProof, ProofRingVec, PackedDigits,
  HachiLevelProof, HachiProof all serialize bare (no length prefixes).
- RingSliceSerializer drops u64 count prefix; RingCommitment and
  QuadraticEquation prover paths updated to use RingSliceSerializer
  for transcript consistency.
- Schedule byte accounting updated to match stripped format.
- HachiSchedulePlan::to_proof_shape() produces the context needed
  for proof deserialization.
- FieldCore supertrait tightened to HachiDeserialize<Context = ()>.

This is a protocol-breaking change: Fiat-Shamir transcripts now
absorb headerless data, so proofs from the old format will not verify.

Made-with: Cursor

* Fix missing ctx argument in disk-persistence feature gate

The deserialize_compressed call in load_expanded_setup was missing the
&() context argument, only exposed under --all-features.

Made-with: Cursor

* chore: retrigger CI to pick up CodeQL default setup

Made-with: Cursor
cmd_validate hardcoded stale expected values (from before the
delta_commit → delta_open formula fix) that diverged from the
baseline.rs unit tests, causing `--validate` to always fail.

Extract a single BASELINE_CASES constant and baseline_params_for
helper in baseline.rs, used by both the tests and cmd_validate.
Add a "Planner validation" CI step so mismatches are caught on PRs.

Made-with: Cursor
quangvdao and others added 16 commits March 31, 2026 18:26
* Strip serialization headers from proof wire format

Redesign HachiDeserialize with an associated Context type so proof
types can be deserialized without embedded length prefixes. All headers
(u64 Vec counts, u32 num_levels, u8 bits_per_elem, etc.) are removed
from the proof byte stream; the verifier recovers shape information
from the public schedule via HachiProofShape / LevelProofShape.

Key changes:
- HachiDeserialize gains `type Context` — `()` for self-describing
  types, schedule-derived shapes for proof types.
- CompressedUniPoly, SumcheckProof, ProofRingVec, PackedDigits,
  HachiLevelProof, HachiProof all serialize bare (no length prefixes).
- RingSliceSerializer drops u64 count prefix; RingCommitment and
  QuadraticEquation prover paths updated to use RingSliceSerializer
  for transcript consistency.
- Schedule byte accounting updated to match stripped format.
- HachiSchedulePlan::to_proof_shape() produces the context needed
  for proof deserialization.
- FieldCore supertrait tightened to HachiDeserialize<Context = ()>.

This is a protocol-breaking change: Fiat-Shamir transcripts now
absorb headerless data, so proofs from the old format will not verify.

Made-with: Cursor

* Fix missing ctx argument in disk-persistence feature gate

The deserialize_compressed call in load_expanded_setup was missing the
&() context argument, only exposed under --all-features.

Made-with: Cursor

* feat: finish column-major tight z_pre cutover

* fix: honor active row count in recursive w commits

* fix: restore recursive commitment performance

* refactor: remove dead recursive layout helper

* refactor: make block order explicit

* fix: make recursive split planner 32-bit safe
* Correct planner A-role SIS bounds

* Run rustfmt on planner security changes

* Clarify A-role SIS collision helper
* Add batched commitment to Hachi

* Add batched prove/verification

* Optimize prover/verifier in batched mode

* Fix CI

* Address AI-reviews

* More ci fixes

* Fix issue with early prover stop

* Address cursor review

* Batch polys with detached commitments

* Add e2e tests for commitment scheme

* fix: resolve post-merge issues from PR #31 header-stripping

- Batched prover transcript: use RingSliceSerializer for ABSORB_PROVER_V
  (auto-merge missed this new call site, causing Fiat-Shamir mismatch)
- Add HachiProof::shape() for tests that lack a planner
- Fix single_poly_e2e deserialization to pass shape context
- Update batched_onehot_4x30 threshold for stripped-header byte costs

Made-with: Cursor

* Support multi-point batching in Hachi

* Unified batch commit functions

* Unified batch prove/verify functions

* Fix recursive onehot layout planning

Keep runtime recursive log-basis transitions aligned with the planner and setup sizing so single and batched onehot proofs use the intended layouts. Restore the open-digit witness cost model and clean up the CI clippy regressions from the batching refactor.

Made-with: Cursor

* Fix batched commit benchmark layout mismatch

Use hachi_batched_root_layout for the batch path so the layout's
(m_vars, r_vars) split matches what setup_prover computes internally,
and pass the layout into make_onehot_poly instead of deriving it from
num_vars (matching the pattern in onehot_batched_opening.rs).

Made-with: Cursor

* Remove layout from CommitmentScheme API; derive internally from setup

Remove HachiCommitmentLayout parameter from commit, prove, batched_prove,
verify, and batched_verify. Replace layout field in HachiSetupSeed with
max_inner_width, max_outer_width, max_d_matrix_width. Layout is now
derived internally via hachi_batched_root_layout(num_vars,
max_num_batched_polys), keeping the batch-optimized m_vars/r_vars split
that avoids the 3x regression on batched prove/verify.

Made-with: Cursor

* Remove unused setup_from_existing helper

That path is no longer used after moving layout derivation to runtime inputs, so removing it avoids maintaining dead setup-extension logic.

Made-with: Cursor

* Check openning points having the same length

* Format commit.rs

* FIX CI issue

* Support mixed dense and one-hot multilinear batches

Expose a single public wrapper so batched commitments can combine dense and one-hot polynomials under one shared config without extra call-site branching.

Made-with: Cursor

* Fix scan_layout_chain passing max_num_batched_polys as num_points

The layout optimization in root_batched_layout (via optimal_root_batch_split)
hardcodes num_point_sets=1 for same-point batching, but scan_layout_chain
was passing max_num_batched_polys as both num_claims and num_points to
w_ring_element_count_with_num_claims_and_points. This inflated z_pre by
the batch size, producing an oversized matrix for the first recursive level.

Use w_ring_element_count_with_num_claims (which sets num_point_sets=1)
to match the layout's same-point assumption and the actual prover code.

Made-with: Cursor

* Fix hachi_batched_root_layout returning batched num_digits_fold in per-poly layout

The function claimed to return a per-polynomial layout but leaked the
batch-level num_digits_fold through an unnecessary scale/unscale
roundtrip via scale_batched_root_layout. Replace with direct
construction using compute_num_digits_fold (num_claims=1 equivalent)
and add a regression assertion in the matching unit test.

Made-with: Cursor

* Fix multipoint setup sizing and trim batched root bloat

* Align ring commitment layout with batched setup

* Unify batched root layout and recursion helpers

* Fix clippy warnings: simplify comparison and extract type alias

Made-with: Cursor

* Fix commit_onehot using singleton layout instead of batched layout

commit_onehot called Cfg::commitment_layout() which always returns the
singleton layout, while commit_ring_blocks and commit_coeffs use
Self::layout() which returns the batched root layout. When
max_num_batched_polys > 1, these produce different m_vars/r_vars splits,
causing incompatible block structure.

Made-with: Cursor

---------

Co-authored-by: Quang Dao <quang.dao@layerzerolabs.org>
* feat: add d16 d32 prime275 profile path

* feat: add d16 d32 prime275 profile path

* fix: enforce config field pairing

* fix: enforce config field pairing

* fix: align profile compare configs with prime275

* fix: align profile compare configs with prime275

* refactor: polish field-coupled commitment presets

* Remove D16 commitment plumbing

* Fix CRT NTT dispatch test coverage

* Remove unused setup reuse helper

* Add dynamic root-ring Hachi scheme scaffold

* Add root batch summary schedule scaffolding

* Add canonical root runtime-plan scaffold

* Make fmt clippy and tests green

* Trim aggregated batched test matrix

* Trim grouped batched test matrix

* Canonicalize root runtime schedule plan

* Use generated tables for fp128 schedules

* Select dynamic root D at commit time

* Optimize dynamic singleton root selection

* Fix clippy lint in dynamic batching test

* Restore singleton onehot profile all-mode

* Cut over fp128 defaults to dynamic schemes

* Audit D128 preset security floor

* Fix D128 adaptive audit bounds

* Switch partial split path to q128-159

* Fix static d128 audited root ranks

* Drop q128-5823 and fix CI

* Widen SIS width API to u64

* Lazy dynamic root setup materialization

* Optimize D32 block mat-vec paths

* Make flat digits canonical across Hachi proofs

* Optimize dense D32 commit and fold kernels

* Speed up D32 digit decomposition paths

* Speed up batched onehot column sweep

* Speed up dense D32 decompose-fold

* Speed up dense D32 matvec path

* Reduce dense D32 fold buffer overhead

* Fuse dense D32 full-challenge accumulation

* perf(d32): speed up dense root kernels

Add an x86 SSE4.1 CRT mul-accumulate/add-reduce path for the dense\nD32 root commit hot loop, matching the existing NEON hook pattern\nwithout changing fallback behavior.\n\nAlso precompute and share rotated full-challenge tables in the fused\ndense multi-digit fold path so workers stop rebuilding the same D32\nrotation tables independently.\n\nThis improves the current full nv=25 profile on this host by cutting\nroot commit from about 3.10s to about 2.71s with the SIMD path active,\nand dense_multi_digit_accumulate from about 1.37s to about 1.19s.

* chore(ntt): drop unverified x86 path

Remove the x86 SSE fast path that was added in the previous perf\ncheckpoint. In this environment rustc is targeting aarch64-apple-darwin,\nwhile the x86_64 target is not installed, so that code could not be\ncompiled or validated locally.\n\nKeep the fully verified dense D32 improvement from the shared rotated\nfull-challenge tables, which still improves the current full nv=25\nprofile on this host.

* perf(commitment): speed up dense D32 prover

Keep the root-kernel microbench in-tree so dense D32 commit work can be measured directly.

Reuse CRT+NTT scratch storage on the single-row dense root path and fuse multi-row rotated-challenge accumulation in the dense decompose-fold helper. On HACHI_MODE=full HACHI_NUM_VARS=25 this brings the local dense profile to roughly setup 0.175s, commit 2.624s, prove 0.866s, verify 0.020s with proof size unchanged at 67,936 bytes.

* perf(commitment): hoist i8 decomposition params

* perf(commitment): specialize dense single-row root path

* perf(commitment): fuse scalar crt digit accumulation

* perf(commitment): fuse blockwise digit matvecs

* bench(commitment): compare flat and block digit matvecs

* perf(commitment): restore flat single-i8 hot path

* fix(commitment): address remaining bugbot findings

* refactor(commitment): introduce prime profile layer

* refactor(commitment): unify schedule authority

* refactor(profile): table-drive profile modes

* refactor(schedule): make direct handoff a plan step

* refactor(proof): make direct handoff a proof step

* refactor(proof): generalize direct witness payload

* feat(commitment): support zero-fold direct roots

* feat(commitment): table-drive tiny direct roots

Promote tiny root-direct openings to a first-class generated strategy.
Update the runtime and dynamic onehot path to use field-element direct
witnesses when the chosen typed root layout exceeds the public onehot
arity, and make profile.rs handle tiny-nv dynamic modes cleanly while
skipping impossible fixed onehot layouts.

* refactor(commitment): pin generated schedule params

Attach pinned family-level parameter specs to generated schedule tables and
validate runtime-derived level params against them when materializing exact
plans. Keep profile-backed schedule lookup on the richer table artifact so
planner-backed families fail closed on policy drift instead of silently
inheriting changed ranks or challenge families.

* docs(planner): add codegen cutover plan

* feat(planner): add step-based schedule codegen

* refactor(commitment): use generated schedule modules

* refactor(commitment): honor exact schedule plans

- reuse pinned planned next-level params during exact-size singleton
  prove/verify flows when the current state matches the generated plan
- regenerate schedule artifacts from the broader planner search space
  so shipped D32/D64 families keep the better-performing plans
- keep the exact-plan execution hook opportunistic for now so runtime
  falls back cleanly when a pinned root plan does not match the typed
  root setup path

This moves singleton execution closer to the generated artifact while
avoiding hand-edited table drift.

* refactor(commitment): split generated schedule policies

- make shipped D32 and D64 adaptive preset policies explicitly
  generated-backed instead of sharing one generic adaptive policy
- keep D128 adaptive presets on an explicit live-planned policy path
  so experimental planner use is visible in the type layer
- wire the fp128 profile and public preset surface to the new split so
  generated families fail closed when a pinned table is missing

This shrinks schedule authority for blessed presets and removes a
silent generated-vs-planned ambiguity from the public config layer.

* fix(commitment): enforce exact generated fold roots

- size generated-family setup envelopes from the pinned schedule entry so
  shipped D32 and D64 tables can demand higher ranks without underallocating
- use exact generated level params whenever a singleton state matches a
  pinned fold step instead of regenerating those params from fallback hooks
- make exact singleton fold proofs fail closed if the runtime root plan no
  longer matches the pinned schedule artifact

This removes another silent drift path between generated artifacts and
runtime execution for blessed families.

* docs(planner): drop stale d16 ladder note

* fix(commitment): honor exact generated root layouts

* refactor(commitment): make generated families artifact-driven

* perf(commitment): add fused na3 matvec kernels

* style(commitment): normalize planner codegen

Clean up rustfmt/codegen residue across the planner and commitment modules.

No intended behavior change; this just removes noisy local diffs before the next round of planner/codegen and kernel work.

* fix(planner): compare bit-lengths instead of element counts in fold pruning

The root-level pruning in try_level_mr rejected folds where next_w_len >= w_len,
but at the root the input elements are 128-bit field elements while the output
elements are lb-bit packed digits. This caused small-nv onehot schedules to
skip beneficial folds (e.g. nv=15 sent 524 KB raw instead of 29 KB folded).

Also caps MAX_LB at 6 to match the i8/digit_lut constraint enforced throughout
the codebase. Regenerates all schedule tables.

Made-with: Cursor

* fix(planner): recompute SIS width table with 10^10 search cap

The old table had entries capped at 5M (D=32) and 20M (D=64), which
were binary-search limits rather than true security cutoffs. This
caused the planner to fail finding fold schedules for D=64 onehot at
nv >= 49, falling back to enormous FieldElements direct proofs.

Reran the lattice estimator (BDGL16 + lgsa, q = 2^128 - 275) with a
10^10 search cap. Key changes:

- D=32 rank 3-4: uncapped (e.g. (32,2) rank 3: 5M -> 414M)
- D=64 rank 2-4: uncapped (e.g. (64,7) rank 2: 20M -> 794M)
- D=64: added (64,1023) and (64,2047) collision buckets
- D=128: unchanged (rank 1 already exact, rank 2-4 at 50B cap)

Adds scripts/gen_sis_table.py for reproducible table regeneration.

Made-with: Cursor

* fix(profile): use split-eq evaluation to avoid 64 GB eq-table allocation

The profile binary's `opening_from_public_poly` unconditionally
materialized `EqPolynomial::evals(point)` of size 2^nv, causing OOM
at nv=32 (64 GB).

Add space-efficient `evaluate(point)` methods to the public polynomial
types in `root_poly.rs`:
- DenseMultilinear: split-eq factorization, O(2^{n/2}) space
- OneHotMultilinear: pointwise eq per hot position, O(1) space
- MultilinearPolynomial: dispatches to the above

Replace the ad-hoc helper with `poly.evaluate(&pt)`.

Made-with: Cursor

* fix(ci): specify --bin for planner validation step

The hachi-planner crate now has two binaries (hachi-planner and
gen_schedule_tables), so cargo run needs an explicit --bin flag.

Made-with: Cursor

* fix(commitment): use envelope floor for generated policy fallback params

The GeneratedAdaptivePolicy fallback path (when exact_planned_level_execution
fails to match, e.g. in batched proofs) was using audited_root_outer_rank
directly, which returns 1 for D=32. This silently dropped n_b and n_d below
the planner-determined minimum of 2, producing shorter but insecure proofs.

Use the envelope (which incorporates the generated table maximum) as the
floor for all rank parameters in the fallback path.

Made-with: Cursor

* feat(commitment): derive exact SIS ranks in fallback instead of envelope

When exact_planned_level_execution misses (e.g. batch-divergent
recursive levels), compute the actual matrix widths from the layout
and look up the minimum Module-SIS rank from a generated threshold
table. This replaces the conservative CommitmentEnvelope fallback
with precise per-width security parameters.

- Add `sis_floor.rs` generated module with SIS width thresholds
- Add `max_abs_coeff()` to SparseChallengeConfig
- Add `sis_derived_recursive_params` helper in config.rs
- Update gen_schedule_tables to emit sis_floor.rs

Made-with: Cursor

* refactor(planner): deduplicate ring configs in gen_schedule_tables

Replace the manually-duplicated D128/D64/D32_RING_CONFIGS arrays with
a runtime filter over the single authoritative ALL_RING_CONFIGS from
search.rs. Change PlannerOptions.ring_configs from &'static to Vec
to support non-static slices cleanly.

Made-with: Cursor

* chore: remove stale planner codegen cutover plan doc

* refactor: delete DynamicCommitmentScheme layer and root_poly type erasure

The dynamic layer added ~2400 lines of complexity (traits, lazy init,
type-erased MultilinearPolynomial round-trips) solely to select the root
ring dimension D at runtime. That selection is a simple proof-size
comparison that the profile harness now does with two helper functions
(best_full_d, best_onehot_d) followed by a static match dispatch into
the existing HachiCommitmentScheme<D, Cfg>.

Deleted:
- src/protocol/dynamic_commitment_scheme.rs (1411 lines)
- src/protocol/root_poly.rs (519 lines)
- DynamicCommitmentScheme trait from scheme.rs
- CommitmentFieldProfileDynamic trait and helpers from profile.rs
- Dynamic type aliases from presets.rs
- HachiRootScheduleArtifact from schedule.rs
- All Dynamic* re-exports from mod.rs and lib.rs

Per-level D dispatch (D=64 root folding into D=32 recursive levels) is
unchanged; it was always handled by HachiLevelParams.d and the
dispatch_with_ntt! macro inside the core HachiCommitmentScheme.

Made-with: Cursor

* refactor: delete DynamicSmallTestCommitmentConfig and vestigial profile associated types

DynamicSmallTestCommitmentConfig was defined and re-exported but never
instantiated anywhere. The six FullCfg*/OneHotCfg* associated types on
CommitmentFieldProfile were scaffolding for the now-deleted dynamic
layer and had zero references via the trait.

Made-with: Cursor

* refactor: collapse CommitmentPolicy into CommitmentConfig, extract schedule planner

Three P1 cleanup items:

1. Delete CommitmentPolicy trait and blanket forwarding impl.  Each policy
   (Static, Generated, Planned) now implements CommitmentConfig directly
   on CommitmentPreset<F, Policy>, removing ~200 lines of trait ceremony
   and indirection.

2. Extract DP planner code (best_recursive_suffix, planned_schedule,
   PlannerConfig/State/Suffix) into schedule_planner.rs, reducing
   schedule.rs from 2482 to 2076 lines.

3. Factor the 4 inline debug cross-check blocks into two shared helpers
   (debug_check_dp_basis, debug_check_dp_suffix_bytes) in the new module.

Made-with: Cursor

* refactor: unify FlatRingVec/ProofRingVec, extract test helpers, minor cleanups

P2 + P3 cleanup items:

1. Merge ProofRingVec into FlatRingVec (ring_dim=0 for compact/proof-wire
   mode). Removes ~200 lines of duplicated ring-vector methods and
   serialization code. Serialization now always uses the compact format
   (raw coefficients, no ring_dim prefix) since the self-describing format
   was never used externally.

2. Extract shared Fp128 E2E test helpers (F, stack/rayon init,
   random_point, opening_from_poly, make_*_poly, cfg aliases) into
   tests/common/mod.rs, deduplicating ~350 lines across 5 test files.

3. Merge adjacent if-let-Some(plan) guards in batched verify.

4. Remove redundant FSmall type alias in hachi_e2e.rs.

Cancelled P2 items after investigation:
- GeneratedAdaptivePolicy + PlannedAdaptiveBoundedPolicy merge: split is
  architecturally fundamental (pre-generated table lookup vs live DP
  planner, with different level_params_with_log_basis fallback chains).
- Strip sentinel entries from generated tables: no sentinel entries exist;
  all table rows are real schedule data.

Made-with: Cursor

* refactor: delete PlannedAdaptiveBoundedPolicy, all presets use generated tables

Generate D128 logbasis (LCB=3) and D128 onehot (LCB=1) schedule tables,
switch all D=128 presets from PlannedAdaptiveBoundedPolicy to
GeneratedAdaptivePolicy, and delete the live DP planner entry point
(planned_schedule) along with PlannedAdaptiveBoundedPolicy and
planned_adaptive_bounded_schedule_source.

The runtime DP planner is no longer invoked by any adaptive preset.
dp_suffix_bytes remains for static configs (singleton basis range,
negligible cost) and debug cross-checks.

Made-with: Cursor

* chore: cap generated tables at nv=50, delete fp128_adaptive_bounded_table

- Set max_num_vars=50 for all schedule table families (removes degenerate
  Direct-only entries at nv>50 that produced multi-exabyte "proofs")
- Replace the generic fp128_adaptive_bounded_table<D,LCB,N_A,N_B,N_D>
  with direct fp128_d32_{full,logbasis,onehot}_table() accessors
- Delete obsolete d128_bounded_families_fall_back_to_runtime_planner test
- Update SIS audit test bounds from 63 to 50

Made-with: Cursor

* chore: rename fp128_adaptive_onehot_d64_table to fp128_d64_onehot_table

Consistent naming: all table accessors now follow fp128_d{D}_{family}_table().

Made-with: Cursor

* refactor: delete LogBasis presets, add D64Full table

Remove all LOG_COMMIT_BOUND=3 (logbasis) presets, generated tables,
benchmarks, and profile modes. Only full (LCB=128) and onehot (LCB=1)
remain. Add fp128_d64_full generated table and D64Full preset to
complete the D-by-LCB matrix across D={32,64,128}.

Made-with: Cursor

* refactor: flatten matrix storage from 2D envelope to 1D layout

Eliminates wasted space from the shared 2D max_rows × max_cols envelope
by storing all matrix data in a single flat 1D vector. Each role (A, B, D)
interprets a prefix of the flat buffer via ring_view with role-specific
(num_rows, num_cols) dimensions.

Key changes:
- FlatMatrix: remove 2D metadata (num_rows, cols_ring), add ring_view<D>()
  that provides typed RingMatrixView with zero-copy row access
- NttSlotCache: flatten from Vec<Vec<CyclotomicCrtNtt>> to flat Vec
- derive_public_matrix_flat: switch to 1D domain separation (seed, flat_index)
- HachiSetupSeed: add max_stride() returning the global maximum column width
  across all roles and recursion levels
- HachiPolyOps trait: add matrix_stride parameter to commit_inner/witness
- All mat-vec kernels and ring_view call sites use the uniform max_stride
  to ensure consistent row offsets in the shared NTT cache

Made-with: Cursor

* chore: remove dead code left over from 1D matrix cutover

- FlatMatrix::raw_data(), is_empty() (no callers)
- RingMatrixView::rows(), to_vec_vec() (no callers)
- HachiCommitmentLayout::matrix_stride() (superseded by HachiSetupSeed::max_stride())
- Fix clippy format-string lint in recursive_suffix eprintln

Made-with: Cursor

* fix(proof): stop fixed-point batched folding

Prevent batched recursive proving from looping once the witness stops
shrinking, matching the single-proof recursion stop rule.

Store proof-owned ring vectors in compact proof form so serialized
proofs round-trip without depending on in-memory ring-dimension
metadata.

Made-with: Cursor

* fix(commitment-scheme): keep batched folding byte-driven

Batched recursive suffixes already consult the byte planner, so reusing the
single-proof shrink-ratio guard could stop folding while another recursive
level still reduced proof size. Use a batched-specific stop guard that only
blocks tiny or non-shrinking witnesses, and lock in the nv=32 onehot
regression with a focused test.

Made-with: Cursor

* fix(planner): use actual-state batched suffix DP

Replace singleton table fallbacks with memoized planning from the
actual recursive state so batched suffix estimates stay aligned with
runtime on off-table states.

Add regression and profile coverage for batch-4 onehot cases, and fix
the onehot test lint that was breaking CI.

Posted by Cursor assistant (model: GPT-5.4) on behalf of the user (Quang Dao) with approval.

Made-with: Cursor

* fix: add schedule_plan() to static/test configs for release-mode compatibility

The `planned_next_log_basis_with_current_basis_and_envelope` function
returns a hard error in release builds when `Cfg::schedule_plan()` is
`None`. Three config families hit this: TinyConfig, SmallTestCommitmentConfig,
and StaticBoundedPolicy.

Add a generic `build_schedule_plan_from_config` helper that walks the
level chain for any CommitmentConfig with deterministic basis choices,
then override `schedule_plan()` on each affected config so they return
`Some(plan)` and never reach the release-mode error branch.

Made-with: Cursor

* Assert num_vars equal in batched commit mode

* fix: guard schedule_plan() overflow and restore planner lb freedom

build_schedule_plan_from_config computes 1usize << max_num_vars which
overflows for values >= 64. Return Ok(None) early in TinyConfig,
SmallTestCommitmentConfig, and StaticBoundedPolicy so callers fall back
to runtime computation for absurdly large num_vars (used by
disk-persistence tests with max_num_vars 100+).

Restore independent log-basis iteration in the planner's best_from():
the recursive folding level's lb was locked to current_lb, preventing
re-decomposition at a different basis. Revert to the original design
where each level freely iterates lb in MIN_LB..=MAX_LB while inheriting
the parent's lb as log_cb.

Made-with: Cursor

---------

Co-authored-by: Omid Bodaghi <42227752+omibo@users.noreply.github.com>
* fix(planner): measure recursive suffix costs

Score recursive suffix planning with exact serialized proof bytes instead of
formula-only estimates, and route recursive miss states through the measured
DP path.

Isolate the recursive DP caches by config type so adaptive presets do not
reuse stale suffix or basis choices across families.

Made-with: Cursor

* refactor(planner): inline exact schedule planner

Move the offline planner, generator, and validation CLI into hachi-pcs so
exact schedule lookup, table generation, and runtime planning share one
batch-aware implementation and key space.

Regenerate the shipped tables around exact root schedule keys, fix setup
envelope sizing for generated direct tails, and add coverage for singleton,
blessed-batch, and off-table planner paths.

Made-with: Cursor

* fix(planner): tighten generated schedule miss handling

Keep missing generated schedule tables as hard configuration errors while
still treating per-key schedule misses as soft fallbacks, and remove the
leftover planner wrappers and discarded search-range parameters that no
longer affect runtime behavior.

Made-with: Cursor

* fix(ci): emit rustfmt-clean generated imports

Teach the schedule table generator to emit import blocks in the same shape
that rustfmt expects so regenerated checked-in tables pass the format job
without manual cleanup.

Made-with: Cursor

* fix(bench): publish observed proof metrics

Report proof framing bytes explicitly and derive the published terminal
state from the observed final witness so benchmark comments distinguish
measured proof data from planner-only metadata.

Made-with: Cursor

* fix(planner): derive root ranks from layouts

Derive adaptive level-0 ranks from the actual root layout instead of the
audited envelope fallback so singleton D32 and D64 schedule generation
cannot freeze unsound rank-1 root rows into the checked-in tables.

Keep batched root row counts tied to the per-polynomial root layout,
propagate layout-aware root params through setup and commit helpers, and
regenerate the affected generated schedules with regression coverage for
the D32 onehot root and tiny direct-root path.

Made-with: Cursor

* fix(bench): drop dead terminal summary fields

Remove the unused planned terminal summary keys from the benchmark report
parser so the observed terminal-state reporting remains the single source
of truth and the script no longer carries dead stores.

Made-with: Cursor
* feat: asymmetric centering for power-of-2 digit depths

Use T_k = (b/2-1)(b^k-1)/(b-1) as the centering threshold for
full-field balanced decompositions instead of q/2. This eliminates the
+1 digit correction when lb divides 128, giving power-of-2 depths
(64 instead of 65 for lb=2, 32 instead of 33 for lb=4).

The key insight: k balanced base-b digits biject onto b^k consecutive
integers. For field elements in [0,q), asymmetric centering maps
c <= T_k to itself and c > T_k to c-q, covering all q values with
exactly ceil(128/lb) digits. This only applies to full-field
decompositions (depth_open, depth_commit when log_commit_bound=128,
r_decomp_levels). Fold digits remain symmetrically centered since they
decompose plain integers, not field elements mod q.

Proof size reduction: ~1.1-1.2 KB across all configurations.

Made-with: Cursor

* fix(commitment): restore fold fallback symmetry

Keep the Python dense fold estimator aligned with Rust by using the
symmetric digit-count fallback for folded integers.

Remove the unused asymmetric threshold helper so the decomposition
threshold stays defined in one runtime path.

Made-with: Cursor

* perf(decompose): streamline asymmetric overflow paths

Reduce the overhead of exact-full-field asymmetric decomposition by
reusing the peeled top digit across the field and i8 kernels instead of
staging ring-wide scratch buffers.

Add deterministic fp128 boundary tests so the overflow edge cases stay
covered while we iterate on follow-up benchmarking.

Made-with: Cursor

* perf(commitment): align i8 tiles to digit boundaries

Keep tiled dense i8 mat-vec kernels on full digit groups so adjacent
tiles do not re-decompose the same ring when a boundary lands mid-pack.

Add multi-tile block and strided tests to lock in equivalence with the
direct pre-decomposed digit paths.

Made-with: Cursor
* feat(sumcheck): implement phase-1 y-first cutover

Switch Stage 2 to the y-first witness layout and compute ring-switch
m(x) on demand so the verifier no longer materializes m_evals_x.

Keep Stage 1 x-first behind compatibility shims, including compact
witness transposes, two-round-prefix updates, and wiring-layer
challenge reordering, so recursive proofs continue to chain
correctly during the Phase 1 transition.

Made-with: Cursor

* fix(sumcheck): restore stage2 prefix handoff fast path

Restore the fused stage-2 round-2 handoff so y-first prefix proofs stop
rescanning the compact witness after the two-round-prefix transition, and
clear the prefix state once that handoff completes to avoid stale-path
reentry.

Also narrow the temporary dead-code allowances introduced during the phase-1
y-first cutover by routing the verifier through the shared shifted-eq
dispatcher and dropping now-unused test helpers.

Made-with: Cursor

* feat(sumcheck): finish y-first cutover

Make stage 1 bind y-first and move the only coordinate reorder to
stage-1 input so stage 2 consumes r_stage1 directly.

Preserve the compact two-round prefix path and sparse-x handling
while removing the old stage1-to-stage2 compatibility bridge.

Made-with: Cursor

* refactor(planner): drop unused opt_sumcheck setter

Remove the dead planner option builder so PlannerOptions only exposes
configuration toggles that are still wired into the search flow.

Made-with: Cursor

* fix(ring-switch): drop dead m-eval helpers

Remove the unused shifted-eq evaluation helpers and the stale test that only exercised them so CI can keep treating dead code as an error.

Made-with: Cursor

* refactor(commitment-scheme): drop redundant opening-point reorder

After the y-first cutover, recursive stage transitions can carry the
sumcheck challenges directly as the next opening point. Remove the
identity helper and the dead width bookkeeping it forced the prover and
verifier to carry.

Made-with: Cursor

* perf(sumcheck): fuse sparse y stage1 handoff

Fuse the sparse y-stage full-table fold with next-round polynomial generation so the y-first Stage 1 cutover recovers the onehot regression without reverting semantics.

Made-with: Cursor

* refactor: drop y_first naming and deduplicate test helpers

Now that y-first is the only ordering, remove the _y_first suffix
from reorder_stage1_coords, build_compact_s_table, and all related
variable names. Deduplicate pad_compact_witness, advance_stage1_claim,
and reorder helpers that were copy-pasted across three test modules.
Delete the unused shifted_eq module.

Made-with: Cursor

* style: fix rustfmt on advance_stage1_claim generic bound

Made-with: Cursor
* perf(fp128): hand-written AArch64/x86-64 inline asm for add, sub, mul, sqr

Replace LLVM-generated codegen for Fp128 field arithmetic with
hand-written inline assembly on AArch64 and x86-64, falling back to
portable Rust on other targets.

AArch64 add_raw (8 instructions):
  Uses `ccmp` to fold the overflow predicate (carry from a+b)
  with the ≥p check (carry from s+C) into a single flag state,
  avoiding the GPR round-trip that LLVM's u128 lowering produces.
  Dispatches to immediate or register form based on C.

AArch64 sub_raw (6 instructions):
  Uses `csel` on the borrow flag to pick subtrahend 0 or C,
  then a final `subs`/`sbc` pair.  Avoids materializing the full
  128-bit prime P.

AArch64 mul_raw (35 instructions, was 41):
  Full schoolbook 2×2 → Solinas reduction in one asm block.
  Fold-1 carry chain uses direct adds/adcs/adc (5 insns vs
  LLVM's 8 with cset/cinc shuttling).  Fold-2 + canonicalize
  uses `ccmp` (8 insns vs LLVM's 10).
  Benchmarked at 1.22x throughput improvement on Apple M4.

AArch64 sqr_raw (31 instructions, was 37):
  3 widening multiplies with doubled cross term via shifted-
  register operands.  Same fold-1 + ccmp canonicalize savings.
  Benchmarked at 1.12x throughput improvement on Apple M4.

x86-64 add_raw (10 instructions):
  Uses `sbb reg,reg` to materialize carry as 0/-1 mask, then
  `adc mask,mask` after trial subtraction to encode the
  reduction predicate in ZF for `cmovne`.

x86-64 sub_raw (6 instructions):
  Uses `sbb reg,reg` + `and` to conditionally mask C, then
  a final `sub`/`sbb` pair.

Portable fallback (add_raw_portable):
  Replaces u128 arithmetic with explicit two-limb overflowing_add
  chains, which lowers to better code on targets without native
  128-bit support.

packed_neon Add:
  Rewrites the reduction logic to use `s + C` (add-based) instead
  of `s - P` (subtract-based), eliminating the need to broadcast
  the full 128-bit prime.  Removes unused `veorq_u64` import.

Made-with: Cursor

* fix(fp128): tighten C bound to < 2^32 for asm fold-2 correctness

The AArch64 asm fold-2 step computes C*t2 with a single `mul` (low 64
bits only). Since t2 <= C after fold-1, this requires C^2 < 2^64,
i.e. C < 2^32. Enforce this at compile time instead of the previous
C < 2^64 bound.

Made-with: Cursor

* perf(fp128): add fused mul-add fast path

Add a dedicated Fp128 fused multiply-add primitive that widens the
product, injects the addend before reduction, and finishes with one
Solinas reduction.

On AArch64 this keeps the addend on the carry chain inside the hand-
written multiply path, which improves both standalone mul-add and the
projective binding shape used in our benchmarks.

Made-with: Cursor

* style(fp128): cargo fmt

Made-with: Cursor
… RS encoding (#49)

* feat(algebra): add smooth-FFT prime p=2^128-2355

Add Prime128Offset2355 (p ≡ 5 mod 8) with smooth multiplicative subgroup
of order 14700 = 2² × 3 × 5² × 7², enabling mixed-radix FFT-based RS
encoding up to size 14700 without Bluestein or zero-padding.

- fp128: new type alias, asm dispatch for C=2355 on AArch64/x86-64
- crt_ntt: register new modulus in CRT+NTT param selection

* perf(fft): optimize mixed-radix FFT with radix-7 butterfly and precomputed twiddles

Two optimizations to the smooth-domain mixed-radix FFT that together
yield ~2x throughput improvement across all benchmarked sizes:

- Specialized radix-7 butterfly: unrolled 7-point DFT with explicit
   root-of-unity powers, matching the existing radix-2/3/5 pattern.
   Critical for 14700 = 2²×3×5²×7² where two radix-7 stages previously
   fell through to the generic O(r²) DFT loop. (1.5-1.7x alone)
- Precomputed twiddle tables in SmoothDomain: per-stage omega_r powers
   and twiddle arrays are built once in SmoothDomain::new() for both
   forward and inverse transforms. Replaces the runtime field_pow calls
   and dependent tw_k multiply chain with table lookups and an
   ILP-friendly squaring pattern (tw2=tw*tw, tw4=tw2*tw2, tw6=tw3*tw3).
   (1.2-1.4x on top of optimization 1)
- Criterion benchmark suite (benches/fft_smooth.rs) covering forward,
   , RS-extend, and RS-expand workloads

Made-with: Cursor

* bench(fft): add parallel 32768x RS-expand benchmark

Adds a Rayon-parallel benchmark that runs 32,768 independent
256→1024 RS expansions via the 1470-smooth domain, measuring
aggregate throughput under full core utilization.

Made-with: Cursor

* fix: resolve clippy warnings and update bench field to new prime

- Remove unused `vcgtq_u64` import in packed_neon.rs
- Remove unnecessary `as u64` cast in fp128.rs inline asm
- Use idiomatic iterators and `+=`/`*=` assign ops in fft.rs tests
- Update bench field type alias from p=2^128-275 to p=2^128-2355
  to match the current CommitmentPreset

Made-with: Cursor

* fix: clippy format

* fix: remove extra files

* fix: remove extra files

* fix(fft): update stale docs, add radix guard, clarify comments

- Update fp128.rs module doc to reference Prime128Offset2355
- Add debug_assert for omega_r_pow array bound in precompute_stages
- Fix misleading RS-expand comment in benchmark
- Add explanatory comment for 2's complement reduction in from_scalar_with_params

Made-with: Cursor

---------

Co-authored-by: Quang Dao <quang.dao@layerzerolabs.org>
* feat(protocol): γ-batch evaluation claims per point

Replace L_j per-claim evaluation rows with one γ-weighted row per
opening point, cutting proof size and verifier trace checks.

γ coefficients are Fiat-Shamir challenges sampled after absorbing
commitments and field-element openings. The matrix equation, ring
switch M-evaluation, and w-vector sizing all use num_points instead
of num_claims for the evaluation-row dimension.

Three blessed schedule tests temporarily ignored pending regeneration.

Made-with: Cursor

* fix: unify num_points in w_ring_element_count

Made-with: Cursor

* fix(verify): restore openings.len() == num_claims check in batched verifiers

The batched CWSS refactor replaced the `openings.len() != num_claims`
guard with the weaker `openings.is_empty()` in verify_batched_root_level,
and omitted it entirely in verify_multipoint_batched_root_level. This
allowed malformed proofs with mismatched opening counts to pass early
validation. Restore the exact-length check in both verifiers.
* Add refactored_scheduler which generates smaller proofs

* feat(planner): batched DP schedule planner with pre-computed tables

Add a DP-based schedule planner that optimizes proof size for both
singleton and batched polynomial openings. The planner searches over
(log_basis, m, r) triples at the root and uses exact memoised suffix
costs instead of estimates.

Key changes:
- Unified `compute_num_digits_fold` with `num_claims` parameter
- `find_optimal_batched_schedule` as the single entry point for both
  singleton (num_claims=1) and batched mode
- Replaced `optimal_root_batch_split` in commit.rs to first check
  pre-computed tables, falling back to the DP planner only on miss
- Generated `refactored/` (singleton) and `refactored_batched/`
  (4-poly batched) schedule tables for all 6 families (nv 1..50)
- Wired batched tables into the runtime via `generated_batched_schedule_table`
  fallback in `CommitmentFieldProfileSchedule`
- Added tracing at info/debug/warn levels for table hits, misses,
  and planner invocations

Singleton tables are never worse than existing generated tables
(181KB saved across D32 families). Batched tables eliminate all
runtime recomputation for the 4-poly case.

Made-with: Cursor

* fix(commitment): make batched root ranks width-aware

Align the batched root schedule and setup sizing with the actual
aggregated B and D matrix widths used at runtime. The root planner now
computes batched n_b/n_d from scaled widths, the runtime plan derives
batched root params from the scaled layout, and setup sizing carries the
maximum row counts through matrix allocation.

This also simplifies the batched commit path to use the planner split
directly without rebuilding unnecessary root plan state, keeps the
pre-computed batched tables authoritative when present, and regenerates
those tables so the written schedule data matches the new rank logic.

Additional cleanup:
- remove redundant per-poly fold recomputation at the commit caller
- restore split-based fit checks behind a setup helper
- fix existing clippy blockers in the algebra backends so CI is green

Made-with: Cursor

* refactor(commitment): inline root runtime setup in prove and verify

Derive root layout and params directly at the commitment-scheme entrypoints and remove the extra singleton runtime-plan test wrapper. Keep terminal witness packing anchored to the carried runtime basis, with an explicit panic if the final digits ever exceed it.

Made-with: Cursor

* refactor(planner): consolidate generated schedule tables

Use a single generated schedule source per fp128 family by merging singleton and batched entries into the top-level files, and rename the planner module to match its role in schedule parameter selection.

Made-with: Cursor

* fix(planner): align generated schedule bytes with runtime

Price each planned fold against its actual chosen successor and emit terminal commitment metadata from the direct step's runtime state. This restores the shipped D32 onehot singleton schedule to the correct proof size seen at runtime.

Made-with: Cursor

* refactor: unify HachiLevelParams + HachiCommitmentLayout into LevelParams

Replace the two separate parameter structs (HachiLevelParams for ring
dimension / matrix ranks / challenge config, and HachiCommitmentLayout
for block geometry / digit depths / matrix widths) with a single
LevelParams struct in src/protocol/params.rs.

Key changes:
- New LevelParams struct with AjtaiKeyParams sub-structs for each
  Ajtai matrix (A, B, D), combining row count + column width + basis
- All CommitmentConfig trait methods (level_params_with_log_basis,
  root_level_layout_with_log_basis, root_level_params_for_layout,
  commitment_layout, level_params) now return LevelParams
- HachiPlannedLevel stores a single `lp: LevelParams` field
- ring_switch, quadratic_equation, commitment_scheme, schedule_params,
  and all test/bench/example code use LevelParams exclusively
- Both old structs and their conversion bridges fully deleted
- Net reduction: -845 lines across 26 files

Made-with: Cursor

* fix(generated): use runtime_exact label to match main branch

Reduces diff noise against main by keeping the same fold-step label
that the main branch uses in generated schedule tables.

Made-with: Cursor

* fix(batched): use cached singleton schedule for recursive suffix in batched setup

The scan_layout_chain function previously fell through to an expensive
DP walk for batched mode (max_num_batched_polys > 1) even when a
pre-computed singleton schedule was available. Since recursive levels
are identical for singleton and batched openings, we can reuse the
singleton plan's recursive suffix to skip the DP recomputation.

Made-with: Cursor

* refactor: encapsulate AjtaiKeyParams fields behind constructor and getters

Make AjtaiKeyParams fields private and enforce construction through
AjtaiKeyParams::new(row_len, col_len, log_basis). Add row_len(),
col_len(), and log_basis() getter methods. Update all ~190 field
access sites across 13 files to use the new API.

Made-with: Cursor

* refactor(params): add SIS security check to AjtaiKeyParams

Replace log_basis field on AjtaiKeyParams with collision_inf (worst-case
L∞ collision bound) and add SIS floor validation:

- `new()` panics if row_len is below the 128-bit SIS security floor
- `new_unchecked()` logs a warning instead, for intermediate construction
  steps where ranks haven't converged yet (batched scaling, iterative
  SIS fixed-point loops)
- `Default` derives all-zero (skips SIS check since collision_inf=0)
- SIS-derived params (sis_derived_root_params_for_layout,
  sis_derived_recursive_params) now set collision_inf on each key

All existing call sites use new_unchecked. The checked new() is
available for future code that constructs finalized, security-verified
keys.

Made-with: Cursor

* refactor(ring-switch): remove duplicate helpers and add schedule validation

- Remove identical `w_ring_element_count_with_point_claim_groups`,
  consolidate call sites to `w_ring_element_count_with_claim_groups`
- Remove dead `m_row_count` wrapper (callers use `lp.m_row_count()`)
- Add debug_assert checks in `schedule_plan_from_generated_entry` that
  recomputed digit depths match the table's pinned delta_* values

Made-with: Cursor

* refactor(protocol): cleanup dead code and unused params

- Extend SIS audit test to D32/D64 families (found real rank=1
  violations at high num_vars, capped ranges accordingly)
- Rename `batched_root_level_proof_bytes` to `level_proof_bytes`
- Remove dead `estimated_recursive_suffix_bytes` and both
  `ensure_batched_root_split_fits` methods
- Remove unused `_half_field_bound` param from
  `recursive_r_decomp_levels` and cascade through
  `planned_w_ring_element_count`, `planned_next_w_len`,
  `PlannerConfig`, and all call sites
- Remove now-dead `planner_half_field_bound()` trait method

Made-with: Cursor

* fix(planner): align batched root accounting

Unify planner and runtime batched-root derivation so B/D sizing,
root witness sizing, and root proof bytes all use the same
num_claims versus num_points semantics.

Regenerate the Rust schedule tables and update the end-to-end tests
to match the corrected runtime-exact rows.

Made-with: Cursor

* fix(planner): address batched bugbot issues

Follow the concrete batched runtime suffix when sizing setup matrices,
and keep standalone batched root proof sizing aligned with the runtime
root params and shared fold-digit math.

Made-with: Cursor

* refactor(planner): dedupe fold digit helpers

Route singleton fold-digit sizing through one shared batched helper,
so the public commitment API keeps a single source of truth while the
batched planner paths still pass explicit claim counts.

Made-with: Cursor

* fix(protocol): keep batched root splits per-poly

Return per-polynomial root widths from the DP and direct-only batched-root split paths so setup sizing only scales B/D once. Add regressions for folded and direct-only no-table batch helpers.

Made-with: Cursor

* refactor(commit): remove unnecessary root_lp/batched_lp clones in batched_commit

Access root_plan.root_lp and root_plan.level_lp fields directly
instead of cloning them into local variables.

Made-with: Cursor

---------

Co-authored-by: Quang Dao <quang.dao@layerzerolabs.org>
* refactor(sumcheck): decompose mod.rs into focused submodules

Break the monolithic mod.rs (956 lines) into:
- traits.rs: SumcheckInstance{Prover,Verifier} + EqFactored variants
- drivers.rs: prove/verify driver functions
- compact_fold.rs: CompactPairFoldLut
- accum.rs: reduce_signed_accum (breaks two_round_prefix <-> hachi_stage2 cycle)

Also deduplicates fold_full_prefix_pair (was copy-pasted in stage1 and stage2).

mod.rs is now a thin barrel (~350 lines including tests).

Made-with: Cursor

* refactor: rename num_u/num_l to col_bits/ring_bits

The old names were positional labels from the polynomial w(u, l) that
did not convey what the variables actually index:

  num_u -> col_bits:  number of bits indexing witness columns (ring elements)
  num_l -> ring_bits: number of bits indexing coefficients within a ring element (log2 D)

Made-with: Cursor

* refactor: reorder matrix M rows to consistency, public, D, B, A

Place the consistency row (folded-evaluation) and public row
(evaluation correctness) first in M, followed by D, B, A rows.
This layout groups the shorter-footprint rows at the top, enabling
future verifier optimizations for analytical MLE evaluation.

Updated all dependent row-index arithmetic in ring_switch,
quadratic_equation, hachi_stage2, commitment_scheme debug block,
and schedule doc comments.

Made-with: Cursor

* refactor: digit-major column reindexing for m_evals_x Kronecker structure

Reindex the committed witness polynomial and m_evals_x column vector so
that the power-of-2 block index is the fastest-varying (innermost)
dimension within each segment. This layout enables future O(m) and O(r)
MLE evaluation via Kronecker product factoring of the `a` and `b`
challenge contributions.

Changes:

- `build_w_coeffs`: emit ring elements in digit-major order using new
  helpers `emit_planes_block_inner` (transposes FlatDigitBlocks from
  block-major to digit-major) and `emit_z_pre_block_inner` (decomposes
  and transposes z_pre with multi-point support). Adaptive segment
  ordering places the segment with the larger block dimension first
  (z-hat first when m_vars >= r_vars, else e-hat/t-hat first).

- `compute_m_evals_x_with_claim_groups`: reindex w_segment, t_segment,
  and z_segment to match digit-major layout. D matrix access uses global
  block index (blk * depth_open + dig), B matrix access uses per-claim
  block index. Adaptive segment ordering matches build_w_coeffs.

- `compute_m_evals_x_with_opening_points_and_claim_groups`: same
  digit-major reindexing plus multi-point z_segment with
  z_total_blocks = num_points * block_len.

Docstring on build_w_coeffs notes the alternative of propagating
digit-major throughout FlatDigitBlocks.

Made-with: Cursor

* feat(verifier): direct MLE evaluation for m_evals_x via PreparedMEval

Replace the "materialize full m_evals_x vector then multilinear_eval"
flow with a deferred PreparedMEval struct that pre-computes only
challenge-derived scalars (c_alphas, eq_tau1) and evaluates the MLE
directly at the sumcheck challenge point by streaming through the
setup matrix.

This eliminates the dominant O(total_cols) allocation (~868K field
elements at nv=32) from the verifier path, replacing it with
O(total_blocks + rows) permanent storage (~2K + 16) and O(x_len)
transient eq-table during evaluation.

Key changes:
- PreparedMEval<F> struct with c_alphas, eq_tau1, and layout metadata
- prepare_m_eval() replaces compute_m_evals_x on the verifier path
- PreparedMEval::eval_at_point() streams matrix rows inline
- RingSwitchVerifyOutput now holds PreparedMEval instead of Vec<F>
- Stage2MEvalSource wraps PreparedMEval; HachiStage2Verifier borrows
  setup and opening_points at eval time (Option B)
- setup parameter removed from ring_switch_verifier functions
- Unit test confirms eval_at_point matches materialized multilinear_eval

Made-with: Cursor

* perf(verifier): parallelize PreparedMEval::eval_at_point with cfg_fold_reduce

Use cfg_fold_reduce! for w, t, z, and r_tail segment loops so Rayon
splits the work across threads.  Root-level m_eval drops from ~253 ms
(sequential) to ~47 ms; total verify from ~281 ms to ~60 ms.

Made-with: Cursor

* perf(verifier): use build-segments + multilinear_eval in eval_at_point

Replace fused cfg_fold_reduce loops with the build-then-eval pattern:
parallel-build each segment via cfg_into_iter, concatenate, and call
multilinear_eval.  This matches the old code's parallelism strategy
and achieves zero overhead vs the materialized path (43 ms verify at
nv32, same as the column-reindexing baseline).  Hoist self fields into
locals and mark eval_at_point #[inline] to help the compiler.

Made-with: Cursor

* perf(ring-switch): peel block axis in m-eval

Strip the power-of-two num_blocks axis out of the separable w and t terms
so batched verifier paths can keep using a succinct eq-weighted evaluation
even when the outer claim dimensions are ragged.

Also add the shared offset_eq helper for 2-adic peeled carry summaries and
clean up the remaining clippy issues in the sibling worktree.

Made-with: Cursor

* perf(ring-switch): inline matrix-backed m-eval

Replace the deferred verifier's matrix-backed D and B materialization
with direct offset-eq evaluation, and stop running a zero-padded full
multilinear_eval over the assembled M table.

This keeps the prepared m-eval path test-clean while reducing the
regression from the earlier peeled verifier branch and preserving the
batched D-column layout used by the real proof flow.

Made-with: Cursor

* fix(offset-eq): satisfy clippy assign-op lints

Use `*=` and `+=` in the offset-eq test helper so the
cleanup-sumchecks branch passes the Clippy CI check again.

Made-with: Cursor

* style(offset-eq): format multiline assign expression

Apply rustfmt's expected indentation in the offset-eq test helper
so the cleanup-sumchecks branch passes the format CI check.

Made-with: Cursor

* ci(test): bump RUST_MIN_STACK to 16 MiB for debug test runs

`cargo test --lib` on `layerzero/main` and branches off of it flakes at
20-35 % with sporadic `thread '<unknown>' has overflowed its stack`
aborts originating on rayon worker threads. Repro and investigation:

- Deterministic under `cargo test --all -- --test-threads=1`,
  `RAYON_NUM_THREADS=1`, or `cargo test --all --release`; only debug
  parallel runs flake.
- The aborts fire in rayon workers (not test-runner threads), which
  default to a 2 MiB stack via `std::thread::Builder`. Heavy
  hachi_stage2 tests (`stage2_large_odd_*`) plus several parallel
  commitment/planner tests produce deep rayon-split call chains under
  debug-unoptimized frames and occasionally blow past 2 MiB.
- `commitment_scheme.rs` already acknowledges this for two
  `#[ignore]`-gated debug tests via `init_debug_rayon_pool` /
  `run_debug_on_large_stack` (64 MiB pool stack / 256 MiB test thread).

Cargo's `[env]` section propagates `RUST_MIN_STACK` to all binaries
cargo spawns (including `cargo test`), and `std::thread::Builder`
(which rayon uses internally) honors it for unset stack sizes. Setting
it to 16 MiB is enough headroom for the observed flake and still small
enough to be a drop in the bucket on modern systems.

Verified: 0 / 20 overflows on `cargo test --lib` and 0 / 5 on
`cargo test --all` with this config, versus 4 / 20 previously on the
same branch and 7 / 20 on `layerzero/main` at 7e79bde.

Made-with: Cursor
* refactor(preprocessing): decouple setup sizing from layout derivation

Rework the setup/preprocessing layer so that setup sizing is computed
from conservative upper bounds on config parameters rather than a
layout chain. This fixes a bug where setup(max_num_vars) would fail
at commit time if the actual polynomial num_vars differed from
max_num_vars, and consolidates the setup types into a dedicated
module.

- New `src/protocol/preprocessing.rs` is the canonical home for
  `HachiSetupSeed`, `HachiExpandedSetup`, `HachiProverSetup`, and
  `HachiVerifierSetup` (plus their serialization impls).
- `HachiProverSetup::new()` owns setup expansion end-to-end.
- `HachiSetupSeed` is simplified to carry a single `max_stride`
  (max column width across all levels/roles) instead of separate
  inner/outer/D width fields.
- Add `max_ajtai_rank()` and `max_ajtai_width()` free functions that
  compute conservative row/column bounds from the config's static
  parameters, removing the need for `ensure_layout_fits` /
  `assert_layout_fits` API and their layout-chain probing.
- `src/protocol/commitment/commit.rs` shrinks substantially; setup
  structs and their impls move out.
- Add oversized-setup regression tests (setup with larger
  max_num_vars than the commit's actual num_vars).

Made-with: Cursor

* refactor(commitment): drop HachiCommitmentCore setup helpers and their dead support code

Remove the unused `setup_with_layout`/`setup_with_layouts` entry points and
their private helpers, along with the now-orphaned layout-scanning helpers
(`LayoutChainStats`, `scan_layout_chain`, `root_batched_layout`), the
`num_digits_fold_batched` field, the three tests that exercised them, and
the test-only config/ring-switch functions they relied on.

Made-with: Cursor

* refactor(commitment): hoist setup matrix sizing into CommitmentConfig

Introduce `CommitmentConfig::max_setup_matrix_size(max_num_vars,
max_num_batched_polys)` returning `(max_rows, max_stride)`, with a
default implementation that pins `max_rows` to `sis_security::MAX_RANK`
and derives the worst-case stride from the root `log_basis` search
range. `HachiProverSetup::new` now calls the trait method and just
multiplies to get `max_total`, so setup-sizing policy lives next to the
config abstraction instead of being duplicated in preprocessing.

Also tighten the two small test-double configs (`SmallTestCommitmentConfig`
and `BadDegreeConfig`) to `max_n_a = 4` so they match the new row
ceiling.

Made-with: Cursor

* refactor(commitment): remove SmallTestCommitmentConfig, retarget tests to fp128::D64Full

Drop the public `SmallTestCommitmentConfig` and migrate its dependent
tests to the existing `fp128::D64Full` preset. The end-to-end
prove/verify and batched roundtrip tests in `commitment_scheme.rs` and
the `commit_w_uses_active_level_row_count` regression test in
`ring_switch.rs` now run on the dense fp128 D=64 config, which is
already exercised elsewhere in the suite. The tiny shape-only sanity
test in `tests/ring_commitment_core.rs` is dropped along with the type.

Made-with: Cursor

* test(setup): add preset-capacity E2E tests and harden onehot shape checks

Add `tests/setup.rs` with a 5-scenario E2E suite per fp128 preset
(`D128Full`, `D64Full`, `D64OneHot`, `D32Full`, `D32OneHot`) covering
same-size, undersized, and oversized setup relative to the polynomial
and batch sizes used by commit/prove/verify. The undersized-nv case
pins an explicit `commit received a polynomial with ... variables but
setup supports at most ...` message, and the undersized-batch case
likewise pins the existing polynomial-count guard.

Surface those messages in `HachiCommitmentScheme::{commit, prove,
batched_prove, verify, batched_verify}` by adding `num_vars >
max_num_vars` guards that mirror the existing
`max_num_batched_polys` guard and return `HachiError::InvalidInput`
with a clear, actionable string (callers misusing the API, not
adversarial proofs).

Finally, harden `OneHotPoly`'s `HachiPolyOps` impl against a latent
shape-mismatch foot-gun exposed by the tests: onehot polys bake
their `(r_vars, m_vars)` block split in at construction time,
whereas dense polys reblock on every call. When a user built an
onehot poly with `Cfg::commitment_layout(nv)` and then tried to
commit it under a `max_num_batched_polys > 1` setup (where the
runtime uses `hachi_batched_root_layout(nv, batch)`), the prover
would panic deep in `parallel_high_half_accumulate` with an
`index out of bounds` from the sparse-ring accumulator. Add early
block-size checks (assertions on non-Result entry points,
`InvalidInput` on the Result-returning `commit_inner` /
`commit_inner_witness_batched`) so that misuse now surfaces as a
clear, actionable error pointing users at
`hachi_batched_root_layout`.

Made-with: Cursor

* refactor(commitment): size setup matrix from the planned schedule for adaptive configs

Remove the default body of `CommitmentConfig::max_setup_matrix_size`; each
config now supplies its own. `GeneratedAdaptivePolicy` walks the planned
schedule (cached plan or on-the-fly `find_optimal_batched_schedule`) with
the batch-effective root commitment layout as a seed, yielding tight
`(max_rows, max_stride)` bounds. Static and test configs use an inlined
loose upper bound: `MAX_RANK` rows and
`2^(max_num_vars - log2(D)) * 128 * MAX_RANK * max_num_batched_polys`
stride.

Made-with: Cursor

* refactor(commitment): fold batch scaling into `fallback_batched_root_split`

Teach `fallback_batched_root_split` to take `num_claims` and apply
`scale_batched_root_layout` internally; existing callers in
`optimal_root_batch_split` pass `1` to preserve the per-poly result.

Adaptive `max_setup_matrix_size` now seeds with the scaled fallback
layout in one step (instead of the separate
`hachi_batched_root_layout` + `scale_batched_root_layout` pair) and
uses a `(P, P, 1)` batch for the planner fallback. The seed is applied
unconditionally because commit's runtime `(m, r, log_basis)` may not
match the schedule plan's choice.

Made-with: Cursor

* refactor(commitment): move simple `max_setup_matrix_size` into the trait default

Collapse the near-identical `MAX_RANK`-rows / `2^outer_vars * 128 * MAX_RANK * P`-stride
formula that was inlined in `StaticBoundedPolicy`, `TinyConfig`, `BadDegreeConfig`, and
`WideEnvelopeD64Full` into the trait's default body. `GeneratedAdaptivePolicy` keeps its
schedule-plan-derived override, and `WCommitmentConfig` keeps the explicit delegation to
its inner `Cfg`.

Made-with: Cursor

* refactor(setup): drop always-erroring `HachiSerialize` impl on `HachiProverSetup`

`HachiProverSetup` holds runtime NTT caches and cannot be serialized; the
old impl just returned `SerializationError` at runtime. Nothing requires
`HachiProverSetup: HachiSerialize` (the `CommitmentScheme::ProverSetup`
bound is only `Clone + Send + Sync`), so the impl is safe to remove.
Callers who want to persist setup should serialize the inner
`HachiExpandedSetup` and rebuild caches via `setup_from_expanded`.

Also drop an outdated comment in the adaptive `max_setup_matrix_size`.

Made-with: Cursor

* refactor(setup): extract `HachiProverSetup::from_expanded` to de-duplicate NTT wrapping

The "wrap an expanded setup in Arc + rebuild NTT cache + return a prover
setup" block was inlined inside `HachiProverSetup::new` (disk load hit
path) and again in the free `setup_from_expanded` (disk-persistence
tests). Extract it to a single `HachiProverSetup::from_expanded`
associated function and call it from both sites. Gate the method with
the same `disk-persistence` feature that guards its only callers.

Made-with: Cursor

* refactor(setup): drop free `setup_from_expanded`; call `HachiProverSetup::from_expanded` directly

The free function was only used by one disk-persistence test that
ignored the verifier half of its tuple return. Replace its single call
site with a direct `HachiProverSetup::from_expanded` invocation and
delete the wrapper.

Made-with: Cursor

* refactor(ring-switch): drop unused `WCommitmentConfig::max_setup_matrix_size` override

`max_setup_matrix_size` is only invoked from `HachiProverSetup::new`,
which always uses the outer `Cfg`, never `WCommitmentConfig<_, Cfg>`.
The override was dead code. Drop it and let the trait default apply if
anything ever does call it through the wrapper.

Made-with: Cursor

* fix(setup): unblock `--all-features` CI

Three fixes for the `cargo clippy --all-targets --all-features` and
release-test run on the PR:

- Expose `get_storage_path` and `load_expanded_setup` as `pub(crate)` so
  the disk-persistence test module in `commit.rs` can name them. They
  had been module-private free functions.
- Import the two helpers explicitly at the top of the disk-persistence
  test submodule (`use crate::protocol::setup::{get_storage_path,
  load_expanded_setup}`).
- Give `TinyConfig` an explicit `max_setup_matrix_size` override that
  sizes off its (fixed) `commitment_layout` instead of the trait
  default. The default body raises `2^(max_num_vars - log2(D))` and
  overflows `usize` at the `MAX_VARS = 100..=102` used by the
  disk-persistence tests; the override returns tight widths that match
  the config's actual runtime use.

Made-with: Cursor

* fix(setup): restore non-zero invariants in `HachiSetupSeed::check`

After the refactor that collapsed the per-role width fields into a
single `max_stride`, `HachiSetupSeed::check` was left as an
unconditional `Ok(())`. A corrupt on-disk seed with `max_stride = 0`
would pass validation and then be used as the stride for every matrix
view at runtime. Re-add checks that `max_stride` is non-zero and that
`max_num_batched_polys >= 1`, matching the live construction-time
invariant in `Cfg::max_setup_matrix_size`.

Made-with: Cursor

* feat(setup): thread `max_num_points` through setup sizing

`GeneratedAdaptivePolicy::max_setup_matrix_size` previously hard-coded
`num_points = 1` in its `HachiRootBatchSummary`, while `batched_prove`/
`batched_verify` plan the recursive suffix with
`num_points = opening_points.len()`. Because `num_points` feeds both
`z_pre_count` and the `r_rows` contribution inside
`w_ring_element_count_with_counts`, multi-point batches could widen
recursive-level matrices past the computed `max_stride`; the fallback's
scaled root layout only bounds the root widths, not the recursive
suffix.

Add `max_num_points` as an explicit parameter to
`CommitmentConfig::max_setup_matrix_size`, `HachiProverSetup::new`, and
`CommitmentScheme::setup_prover`, and propagate it into the adaptive
policy's schedule lookup / `find_optimal_batched_schedule` fallback.
Single-point callers pass `1` (the dominant runtime shape); multi-point
batches pass an upper bound on `opening_points.len()`. The adaptive
impl validates `1 <= max_num_points <= max_num_batched_polys`.

Made-with: Cursor

* fix(tests): size blessed batched onehot setup for actual point count

Pass `group_sizes_by_point.len()` as `max_num_points` to `setup_prover`
in `assert_blessed_batched_onehot_exact` so the schedule planner sizes
the D-matrix for the real number of opening points instead of hardcoding
`1`, which was only coincidentally large enough for the current configs.

Made-with: Cursor

* fix(setup): store and enforce `max_num_points` on setup seed

Setup sizes the shared matrix stride using `max_num_points`, but the
value was never stored on `HachiSetupSeed`, leaving no runtime guard
against a batched opening with more distinct points than the setup was
sized for. Persist `max_num_points` on the seed (including through
serialization) and reject `batched_prove`/`batched_verify` calls whose
`opening_points.len()` exceeds it, mirroring the existing
`max_num_batched_polys` check.

Made-with: Cursor

* fix(setup): include `max_num_points` in disk cache key and load check

The disk-persistence cache keyed setup files only by `max_num_vars` and
`max_num_batched_polys`, but `max_num_points` affects `max_stride` via
`Cfg::max_setup_matrix_size`. Two setups with different `max_num_points`
could share a cache file, and the load-side check only verified
`total_ring_elements >= max_total` without comparing `seed.max_stride`.
For configs where `max_rows` varies inversely with `max_stride`, a
cached setup could pass the totals check while carrying the wrong
stride, causing `ring_view` to use an incorrect row layout.

Thread `max_num_points` through `cache_file_name`, `get_storage_path`,
`save_expanded_setup`, and `load_expanded_setup`, and additionally
require the cached `seed.max_stride` and `seed.max_num_points` to meet
the current request before accepting a cache hit.

Made-with: Cursor
* perf(commit): parallelize per-poly inner witness in batched commit

Replace the `commit_inner_witness_batched` dispatch with a direct
`cfg_iter!` parallel map over the input polynomials in the batched
commit path, calling the per-poly `commit_inner_witness` on each.

Benchmarked via `examples/profile` with `HACHI_MODE=onehot_d32
HACHI_NUM_VARS=32 HACHI_NUM_POLYS=4` (20 interleaved and order-reversed
runs). Mean commit time drops from ~707 ms to ~647 ms (~8% faster,
Welch's t = 6.0), with all 325 lib tests and integration tests passing.
The previously fused one-hot helper is no longer on the hot path but is
retained as a trait method for now.

Made-with: Cursor

* refactor: remove unused commit_inner_witness_batched

Now that the batched-commit call site uses a plain `cfg_iter!` parallel
map over `commit_inner_witness`, the fused `commit_inner_witness_batched`
trait method and its impls are dead code. Drop:

- the trait default in `HachiPolyOps`
- the `&P` forwarding impl
- the `MultilinearPolynomial` dispatcher
- the `OneHotPoly` fused implementation

Made-with: Cursor

* fix(clippy): gate parallel import on feature flag

`cargo clippy --no-default-features -- -D warnings` flagged
`use crate::parallel::*` in commitment_scheme.rs as unused when the
`parallel` feature is off (since the module re-exports nothing in that
config). Match the repo-wide pattern by guarding the import with
`#[cfg(feature = "parallel")]`.

Made-with: Cursor
Collapse the singleton prove/verify API onto the batched code path and
drop the schedule planner's process-global DP cache in favor of the
offline schedule tables.

Made-with: Cursor
@socket-security

Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedaes@​0.8.410010093100100
Addedblake2@​0.10.610010093100100
Addedctr@​0.9.210010093100100
Addednum-bigint@​0.4.610010093100100
Addedsha3@​0.10.810010093100100
Addedtracing-chrome@​0.7.210010093100100
Addedtracing-subscriber@​0.3.229910093100100
Updatedtracing@​0.1.41 ⏵ 0.1.4410010093100100

View full report

@RadNi RadNi closed this Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants