Unify singleton and batch proving by RadNi · Pull Request #10 · a16z/hachi

RadNi · 2026-04-22T18:48:17Z

Summary

Moves every prover/verifier entry point onto the batched code path and drops the parallel singleton plumbing. Singleton openings are now just WitnessShape::singleton() through the same functions.

Unified prove/verify

So far four root-level variants collapse into one:

Before	After
`prove_one_level`, `prove_batched_root_level[_with_points]`, `prove_multipoint_batched_root_level`, `prove_same_point_batched`	`prove_root_level`
`verify_batched_root_level`, `verify_multipoint_batched_root_level`, `verify_same_point_batched`	`verify_root_level`
`prove_batched_recursive_suffix`	`prove_recursive_suffix`
`QuadraticEquation::{new_batched_prover, new_multipoint_batched_prover}` are removed — only `new_prover` remains.

Proof shape

HachiBatchedRootProof and HachiBatchedProofShape become enums with Fold { .. } and Direct { .. } variants (new HachiBatchedFoldRoot carries the fold-rooted payload). This cleans up the case where the root is a direct witness handoff (very small num_vars) vs a fold.

Schedule planner

WitnessShape = { num_claims, num_commitment_groups, num_points } is the single batch descriptor; the old BatchConfig alias is gone.
The planner's process-global DP cache is removed. find_optimal_schedule now consults the offline schedule tables (Cfg::schedule_plan) first — every (Cfg, num_vars, WitnessShape) case that ships with the crate is a keyed row — and only falls back to the DP for shapes without an entry.

* chore: add toolchain and formatting config Pin Rust 1.88 with minimal profile (cargo, rustc, clippy, rustfmt). Co-authored-by: Cursor <cursoragent@cursor.com> * chore(ci): switch to actions-rust-lang/setup-rust-toolchain Respects rust-toolchain.toml automatically. Also normalize clippy flags to use --all --all-targets consistently. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(primitives): add u128/i128 serialization support Required by the Fp128 field backend. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(algebra): add prime fields, extensions, and modules Introduces the algebra module with: - Fp32/Fp64/Fp128 prime field backends with branchless constant-time add/sub/neg and rejection-sampled random - U256 helper for Fp128 wide multiplication - Fp2/Fp4 tower extensions with Karatsuba-ready structure - VectorModule<F, N> fixed-length vector module over any field - Poly<F, D> fixed-size polynomial container Co-authored-by: Cursor <cursoragent@cursor.com> * feat(algebra): add NTT small-prime arithmetic and CRT helpers Adds the ntt submodule with: - NttPrime: per-prime Montgomery-like fpmul, Barrett-like fpred, branchless csubq/caddq/center - LimbQ/QData: radix-2^14 limb arithmetic for big-q coefficients - logq=32 parameter preset (six NTT-friendly primes, CRT constants) Co-authored-by: Cursor <cursoragent@cursor.com> * test(algebra): add comprehensive algebra test suite 24 tests covering: - Field arithmetic, identities, and distributivity (Fp32/Fp64/Fp128) - Zero inversion returns None - Serialization round-trips (all field types, extensions, VectorModule) - Fp2 conjugate, norm, and distributivity - U256 wide multiply and bit access - LimbQ round-trip and add/sub inverse - QData consistency with preset constants - NTT normalize range and fpmul commutativity - Poly add/sub/neg Co-authored-by: Cursor <cursoragent@cursor.com> * docs: add and update progress tracking document Records Phase 0 status: all field types, extensions, NTT scaffolding, constant-time arithmetic, and 24-test suite. Reflects the fields/ntt/module/poly directory layout. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(ntt): Rust-ify NTT/CRT port from C Overhaul the NTT small-prime arithmetic and CRT modules: - Add MontCoeff newtype (#[repr(transparent)] i16 wrapper) to enforce Montgomery-domain vs canonical-domain separation at the type level - NttPrime methods now take/return MontCoeff instead of bare i16: fpmul→mul, fpred→reduce, csubq→csubp, caddq→caddp - Add domain conversion: from_canonical (i16→Mont), to_canonical (Mont→i16) - Delete free functions (pointwise_mul etc), replaced by methods on NttPrime - LimbQ: replace add_limbs/sub_limbs/less_than with std Add/Sub/Ord impls - LimbQ: replace from_u128/to_u128 with From<u128>/TryFrom for u128 - LimbQ: add Display impl, branchless csub_mod - Rename all LABRADOR* constants to project-native Q32_* names - Add #[cfg(test)] verification that re-derives pinv/v/mont/montsq from p - Add MontCoeff round-trip and LimbQ ordering tests (28 total) Co-authored-by: Cursor <cursoragent@cursor.com> * chore: remove section banners, update progress doc Remove // ---- Section ---- banner comments from prime.rs and crt.rs. Add non-negotiable rules to HACHI_PROGRESS.md: - No section-banner comments - No commit/push without explicit user approval Co-authored-by: Cursor <cursoragent@cursor.com> * feat(ring): add CyclotomicRing, CyclotomicNtt, and NTT butterfly Milestone 1 - CyclotomicRing<F, D> (coefficient form): - Schoolbook negacyclic convolution Mul (X^D = -1) - Add/Sub/Neg/AddAssign/SubAssign/MulAssign, scale, zero/one/x - HachiSerialize/HachiDeserialize Milestone 2 - NTT butterfly + CyclotomicNtt<K, D>: - Merged negacyclic Cooley-Tukey forward NTT (twist folded into twiddles) - Gentleman-Sande inverse NTT with D^{-1} scaling - Runtime primitive-root finder and twiddle table computation (TODO: migrate to compile-time const tables) - CyclotomicNtt with per-prime pointwise Add/Sub/Neg/Mul - Ring<->Ntt transforms with CRT reconstruction Co-authored-by: Cursor <cursoragent@cursor.com> * test(algebra): add ring and NTT tests, wrap in mod tests Add 12 new tests: - CyclotomicRing: negacyclic X^D=-1, mul identity/zero, commutativity, distributivity, associativity, additive inverse, serde, degree-64 - NTT: forward/inverse round-trip (single prime + all primes), NTT mul matches schoolbook cross-check Wrap all integration tests in a single mod tests block and remove section-banner comments. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(algebra): harden ring-NTT conversion and field decoding Constrain ring/NTT conversions to explicit field backends and replace fragile CRT reconstruction with deterministic modular lifting. Enforce canonical deserialization checks in validated field decoding paths to reject malformed encodings. Co-authored-by: Cursor <cursoragent@cursor.com> * test(algebra): add CRT round-trip and serialization guard coverage Add end-to-end ring->NTT->ring CRT round-trip tests plus reduced-ops stability checks. Expand serialization coverage for Fp4/Poly and verify checked deserialization rejects non-canonical field encodings. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(bench): add ring_ntt benchmark target and CT tracking docs Add a dedicated ring/NTT benchmark harness and register it in Cargo metadata. Record current constant-time review status and sync the implementation progress board with new milestones and test coverage. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(field): split core, canonical, and sampling capabilities Break the monolithic Field trait into FieldCore, CanonicalField, and FieldSampling, and update algebra primitives to depend on explicit capabilities for cleaner semantics and future backend integration. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(fields): add pow2-offset pseudo-mersenne registry and checks Introduce the curated 2^k-offset prime registry and typed field aliases, then add dedicated Miller-Rabin regression tests to enforce probable primality for all enabled profiles. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(ring): introduce crt-ntt backend/domain layering Rename the ring NTT representation to explicit CRT+NTT semantics and route conversions through backend traits, adding scalar backend and domain aliases for a cleaner representation-vs-execution boundary. Co-authored-by: Cursor <cursoragent@cursor.com> * test(algebra): cover backend parity and pow2-offset invariants Expand algebra tests to validate default-vs-backend CRT+NTT equivalence, sampling bounds, and pow2-offset registry consistency under the new field and ring abstractions. Co-authored-by: Cursor <cursoragent@cursor.com> * docs(algebra): update progress notes and add prime analysis references Refresh progress and constant-time notes to match the new CRT+NTT naming and field scope, and add the NTT prime analysis document plus local NIST standards artifacts used for parameter rationale. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(algebra): harden fp128 reduction and CRT reconstruction arithmetic Make Fp128 reduction and CRT inner accumulation paths more timing-stable with branchless modular operations, and refresh ring/docs/tests status after the hardening cleanup pass. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(protocol): add transcript and commitment scaffold Introduce Hachi protocol-layer interfaces and placeholder types with Blake2b/Keccak transcript backends plus phase-aligned labels, while making transcript absorption label-directed at call sites. Co-authored-by: Cursor <cursoragent@cursor.com> * test(protocol): add transcript and commitment contract coverage Add deterministic transcript schedule checks (including keccak) and protocol commitment contract tests so transcript ordering and challenge derivation behavior are locked down. Co-authored-by: Cursor <cursoragent@cursor.com> * docs(protocol): align transcript spec and progress status Document the protocol scaffold as in-progress, capture the commitment-focused transcript label vocabulary, and clarify deferred Jolt adapter expectations. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(protocol): add ring commitment core and seeded matrix derivation Implement the ring-native commitment setup/commit core with config validation, utility modules, and seeded domain-separated public matrix derivation, while wiring prover/verifier stub modules for the next open-check phase. Co-authored-by: Cursor <cursoragent@cursor.com> * test(protocol): consolidate ring commitment and stub contract coverage Unify ring commitment core and config validation checks in one test file and add explicit prover/verifier stub contract tests to lock current placeholder behavior before open-check implementation. Co-authored-by: Cursor <cursoragent@cursor.com> * docs(progress): update phase 2 status after commitment core landing Record that ring-native §4.1 commitment setup/commit and protocol wiring are in place, and clarify that open-check prove/verify remains the next unfinished protocol milestone. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(algebra): harden CT inversion path and CRT final projection Add a constant-time inversion helper for prime fields and replace scalar CRT's final `% q` projection with a division-free fixed-iteration reducer, so secret-bearing arithmetic paths avoid variable-latency behavior. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(algebra): rename inversion helper API without ct suffix Rename the secret-path inversion helper to `Invertible::inv_or_zero` while preserving constant-time semantics via doc contracts, and update CT tracking docs to match the new API names. Co-authored-by: Cursor <cursoragent@cursor.com> * test(algebra): clean inversion test naming and normalize formatting Rename the inversion helper test to match the new API naming and keep the ring commitment test formatting consistent after linting. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(protocol): add sumcheck core module and tests Introduce core sumcheck building blocks (univariate messages, compression, and transcript-driving prover/verifier driver) and add unit/integration tests. Update progress doc to reflect sumcheck core landing. Co-authored-by: Cursor <cursoragent@cursor.com> * Add reference PDF papers * Add local agent instruction files * Add Hachi and SuperNOVA digest docs * Add general field, ring, and multilinear utilities * Add sparse Fiat-Shamir challenge sampling * Implement Polynomail Evaluation as Quadradic Equation * Rename stub to prover and verifier * Refactor code organization * Replace decopose with balanced decompose * Transform polynomial over Fq to ring * Refactor function names * Impl commitment_scheme API * Add SolinasFp128 backend for sparse 128-bit primes Introduce `SolinasFp128` with two-fold Solinas reduction for `p = 2^128 - c` (sparse `c`), plus `U256::sqr_u128`. Export descriptive prime aliases, add BigUint-backed correctness tests, and include a Criterion bench for mul/inv. Co-authored-by: Cursor <cursoragent@cursor.com> * Tighten docs and minor clippy cleanups Add missing rustdoc Errors/Panics sections and apply small simplifications suggested by clippy. Co-authored-by: Cursor <cursoragent@cursor.com> * Add reduction steps to iteration prover * Optimize Solinas mul/add/sub: fused u64-limb schoolbook + csel canonicalize Rewrite mul_raw as a fused 2×2 schoolbook multiply with two-fold Solinas reduction using explicit u64 limbs and mac helper, bypassing U256. Replace mask-based canonicalize with carry-flag-based pattern that compiles to adds+adcs+csel+csel (4 insns) instead of 10 on AArch64. Add pure-mul, sqr, and throughput microbenchmarks. Made-with: Cursor * Switch SolinasFp128 repr from u128 to [u64; 2] for 8-byte alignment Storage is now [u64; 2] (lo, hi) which halves alignment from 16 to 8 bytes, improving struct packing. Arithmetic hot paths convert to u128 for LLVM-optimal codegen (adds/adcs pairs), so no perf regression. Made-with: Cursor * Fuse overflow correction with canonicalize in fold2_canonicalize When fold-2 overflows, the wrapped value s < C², so s + C < C(C+1) < P — meaning s + C is already canonical. This lets us replace the separate overflow-correction + canonicalize (3 + 4 insns) with a single fused `if (overflow | carry) { s + C } else { s }` select, saving 2 instructions on the critical path. Add compile-time assertion enforcing C(C+1) < P. Made-with: Cursor * Unify Fp128 with Solinas-optimized arithmetic, delete SolinasFp128 Replace the generic Fp128<const MODULUS: u128> (binary-long-division via U256) with the Solinas-optimized implementation. Fp128<const P: u128> now uses [u64; 2] storage, fused schoolbook 2x2 + two-fold Solinas reduction (~23 cycles/mul on AArch64/x86-64), and compile-time validation that P = 2^128 - C with C < 2^64. Delete SolinasFp128, SolinasParams, solinas128.rs, and u256.rs. All call sites updated; prime type aliases (Prime128M13M4P0 etc.) are now simple Fp128<...> aliases in fp128.rs. Blanket PseudoMersenneField impl for all Fp128<P>. Made-with: Cursor * Use git deps for ark-bn254/ark-ff instead of local paths Switch from local path dependencies to the a16z/arkworks-algebra git repo (branch dev/twist-shout) so collaborators can compile without needing a local checkout of arkworks-algebra-jolt. Made-with: Cursor * Add template for sumchecks * Optimize Fp128 mul path and expand Rust field benchmarks. Refine Fp128 multiply/fold carry handling for better generated code and add isolated, passthrough, independent, and long-chain Rust microbenches to separate latency and throughput effects when comparing against BN254. Made-with: Cursor * Add 2^a±1 Fp128 reduction specialization and benches. Detect C = 2^a ± 1 at compile time and route fold multiplications through a specialized shift-based path with generic fallback, plus add benchmark coverage for sparse 128-bit primes using this shape. Made-with: Cursor * Add packed Fp128 field backend scaffolding and focused benchmarks. This introduces AArch64-first packed field abstractions with a scalar fallback and adds dedicated field-only validation/benchmark coverage before any ring or protocol integration. Made-with: Cursor * Refactor packed Fp128 backend to true SoA layout and stabilize benchmarking. This switches packed lane storage to SoA with NEON add/sub kernels and a SoA mul path, and updates packed-field APIs and benches so scalar-vs-packed latency/throughput comparisons are measured consistently. Made-with: Cursor * Optimize packed Fp128 mul throughput with array-backed SoA lanes. This keeps mul in true SoA form while removing repeated vector transmute overhead and inlining the limb-level Solinas lane kernel, improving packed mul throughput and latency against scalar baselines. Made-with: Cursor * Add Fp128 widening multiply API and specialized Solinas reduction Expose mul_wide_u64, mul_wide, mul_wide_u128, solinas_reduce, and to_limbs for deferred-reduction patterns needed by jolt-hachi. Hand-optimized reduce paths for 3/4/5 limbs avoid generic loop overhead. Refactor mul_raw to reuse mul_wide + reduce_4 (zero overhead). Add 9 unit tests and widening/accumulator benchmarks. Made-with: Cursor * Clean up fp128: remove section banners, hoist std::ops imports, rename mul_wide free fn Rename free function mul_wide → mul64_wide to avoid shadowing Fp128::mul_wide. Move reduce_4 next to fold2_canonicalize. Replace fully qualified std::ops::{Add,Sub,Mul,Neg} with use imports. Made-with: Cursor * Constrain Fp32/Fp64 to pseudo-Mersenne primes with Solinas reduction Rework fp32.rs and fp64.rs to require p = 2^k - c (small c), matching fp128's design. Compile-time constants BITS/C/MASK derived from P with static assertions. Replace bit-serial reduction with two-fold Solinas reduction (reduce_product for hot path, loop-based reduce_u64/u128 for arbitrary inputs). Add widening ops (mul_wide, square, solinas_reduce). Fix FieldSampling to use direct modular reduction instead of rejection sampling. Blanket-impl PseudoMersenneField, remove manual impls. Rename const generic MODULUS -> P at all call sites. Add latency + throughput benchmarks. Hoist mid-function imports in tests/algebra.rs. Made-with: Cursor * Specialize Fp64 sub-word primes to u64-only arithmetic For BITS < 64 (e.g. 2^40-195), avoid u128 intermediates in reduce_product, add_raw, and sub_raw. Use mul_c_narrow which splits C*high into u32x32->u64 widening multiplies (umaddl on AArch64), preventing LLVM from promoting to u128. Brings 40-bit mul throughput within 4% of 64-bit (690 vs 716 Melem/s), up from ~20% gap. Made-with: Cursor * Add 2^30 and 2^31 pseudo-Mersenne primes and expand benchmarks Add Pow2Offset30Field (2^30-35) and Pow2Offset31Field (2^31-19) prime definitions and type aliases. Refactor fp32/fp64 latency benchmarks with chain_bench! macro, add throughput benchmarks for all new primes. Made-with: Cursor * Add NEON packed backends for Fp32 (4-wide) and Fp64 (2-wide) PackedFp32Neon: 4 lanes in uint32x4_t with full NEON Solinas reduction for mul (vmull_u32 + 2-fold reduce), umin trick for add/sub (BITS<=31), overflow-aware paths for BITS==32. C_SHIFT_KIND optimization for C=2^a+/-1. PackedFp64Neon: 2 lanes in uint64x2_t with NEON add/sub (conditional P for BITS<=62, carry-aware for BITS>=63), scalar-per-lane mul (no native 64x64->128 on NEON). Fp32 packed achieves 2.4-3.5x mul throughput and 3.5-5.0x add/sub throughput over scalar. Includes HasPacking impls, type aliases, NoPacking fallbacks, 7 correctness tests, and throughput benchmarks. Made-with: Cursor * Optimize packed Fp32/Fp64 Solinas multiply hot paths on NEON For packed Fp32, remove the shift/add C-special-case in the Solinas fold and always use vmull_u32 with a hoisted C broadcast, which improves stability and removes the 24-bit mul regression. For packed Fp64, replace per-lane Fp64 wrapper multiplication with packed-local per-lane 64x64->128 products plus specialized Solinas reduction (including the sub-word u64 fold path), reducing mul overhead for both 40-bit and 64-bit packed variants. Made-with: Cursor * Tune packed Fp64 mul folding and add reducer/codegen probes Switch packed Fp64 sub-word fold multiplication to direct `C*x`, which improves packed mul throughput in repeated A/B runs. Add dedicated reducer and codegen probe benches so we can compare 40-bit and 64-bit fold paths with instruction-level visibility. Made-with: Cursor * Optimize x86 BMI2 multiply paths for fp64/fp128 fields Use BMI2 widening multiplies in scalar field hot paths and specialize x86 sub-word fold multiplication to a single 64-bit multiply, improving 40-bit fp64 throughput while keeping 64-bit and 128-bit paths stable. Made-with: Cursor * Optimize fp128 wide-limb multiply path for Jolt integration Raise Hachi MSRV to 1.88, add specialized Fp128 mul_wide_limbs kernels for M={3,4} and OUT={4,5,6}, and add field_arith benches that track mul_wide_limbs-only and roundtrip costs to catch regressions. Made-with: Cursor * Specialize Fp128 CanonicalField small-int constructors Make from_u64 use a direct canonical limb construction (no reduction path), fix from_i64 to use unsigned_abs to avoid i64::MIN overflow, and add a regression test for the min-value case. Made-with: Cursor * Impl sumchecks for hachi * Add optimized one-hot commitment path for regular sparse witnesses Exploits the structure of one-hot vectors (T chunks of K field elements, each chunk with exactly one 1) to eliminate all inner ring multiplications. Gadget decomposition of {0,1} coefficients is trivial (only level-0 digit is nonzero), and the inner Ajtai t = A*s reduces to summing selected columns of A with O(D) negacyclic rotations instead of O(D^2) ring muls. Handles both K >= D and D >= K as long as one divides the other: - K >= D: each nonzero ring element is a monomial X^j (single rotation) - D >= K: each ring element is a sum of D/K monomials (multiple rotations) Total inner cost: N_A * T * D coefficient additions (zero multiplications), vs N_A * 2^M * delta * D^2 coefficient multiplications in the dense path. Made-with: Cursor * Apply rustfmt formatting to fp128 and field_arith bench Made-with: Cursor * Inject sumchecks to Hachi prover * Add commitment to w to transcript * Add AVX2 and AVX-512 packed field backends for Fp32, Fp64, Fp128 Implement vectorized SIMD arithmetic for x86_64: - AVX2: 8-wide Fp32, 4-wide Fp64, 2-wide Fp128 (scalar delegation) - AVX-512: 16-wide Fp32, 8-wide Fp64, 4-wide Fp128 (scalar delegation) Fp32 uses even/odd lane split with 2-fold Solinas reduction. Fp64 uses vectorized 64×64→128 schoolbook multiply (adapted from plonky3 Goldilocks) with custom Solinas reduction for pseudo-Mersenne primes p = 2^k - c. Also: extract NEON backend into packed_neon.rs, add cfg-gated module selection (AVX-512 > AVX2 > NEON > NoPacking), enable nightly stdarch_x86_avx512 feature, add sumcheck-mix benchmark, and fix minor clippy lints in fp64/fp128. Made-with: Cursor * Vectorize Fp128 packed add/sub on AVX-512 (8-wide) and AVX2 (4-wide) Convert Fp128 packed backends from scalar delegation (AoS) to SoA layout with vectorized add/sub via __m512i / __m256i. Mul remains scalar per-lane. Add FIELD_OPS_PERF.md with Zen 5 benchmark results. Fp128 packed add: +114% (1.08 → 2.31 Gelem/s on Zen 5 AVX-512) Fp128 packed sub: +137% (1.34 → 3.18 Gelem/s) Made-with: Cursor * Add M4 Pro NEON benchmarks, remove mul_add experiment Populate FIELD_OPS_PERF.md with Apple M4 Pro (NEON) results for all primes across scalar, packed, and sumcheck MACC workloads. Remove the experimental mul_add trait method (vectorized add already optimal after inlining; scalar fused approach was 16% slower). Made-with: Cursor * Change sumcheck API * Separate ring switch logic * Rename sumchecks to NormSumcheck and RelationSumcheck * Remove iteration prover * Eliminate O(D^2) schoolbook ring multiplication from protocol hot paths At production parameters (D=256/1024), schoolbook CyclotomicRing multiplication is catastrophically expensive. Every protocol hot path has exploitable operand structure that avoids the full D^2 cost: - Add CyclotomicRing::mul_by_sparse for O(omega*D) sparse challenge multiplication (90-140x speedup in compute_z_hat) - Change RingOpeningPoint to store Vec<F> scalars; use scale() instead of ring mul in compute_w_hat (256-1024x speedup) - Add kron_scalars, kron_row_scale, kron_sparse_scale; refactor generate_m to use scalar-aware Kronecker products - Add zero-skip and scalar-detect in compute_r_via_poly_division - Add sample_sparse_challenges, store Vec<SparseChallenge> in QuadraticEquation throughout prover and verifier paths Made-with: Cursor * lint: section banner removal, naming hoist, cfg(test) for test-only paths - Remove section banner comments (----, =====) repo-wide in src, tests, benches - commitment_scheme: hoist RingCommitment, RingOpeningPoint, transcript labels to top-level use; add #[cfg(test)] use for rederive_alpha_and_m_a body (Blake2bTranscript, eval_ring_matrix_at, expand_m_a, labels) so that function uses short names without polluting lib build - Leave mod tests imports in place (no hoisting of test-module use blocks) Made-with: Cursor * Fix CI issues --------- Co-authored-by: Quang Dao <quang.dao@layerzerolabs.org> Co-authored-by: Cursor <cursoragent@cursor.com>

…umcheck (#3) * Add rayon parallelism behind `parallel` feature flag (enabled by default) - New src/parallel.rs with cfg_iter!/cfg_into_iter!/cfg_chunks! macros that dispatch to rayon parallel iterators when `parallel` is enabled - Parallelize protocol hot paths: ring polynomial division, w_evals construction, M_alpha evaluation, ring vector evaluation, packed ring poly evaluation, coefficients-to-ring reduction, quadratic equation folding, and sumcheck round polynomial computation - All 174 tests pass with and without the parallel feature Made-with: Cursor * Add e2e benchmark and make HachiCommitmentScheme generic over config - Make HachiCommitmentScheme generic over <const D, Cfg> so different configs (and thus num_vars) can be used without code duplication. - Remove hardcoded DefaultCommitmentConfig::D from ring_switch.rs; WCommitmentConfig and commit_w now flow D generically. - Add benches/hachi_e2e.rs with configs sweeping nv=10,14,18,20. Made-with: Cursor * Refactor CRT-NTT backend: generalize over PrimeWidth, add Q128 support Make NTT primitives (NttPrime, NttTwiddles, MontCoeff, CyclotomicCrtNtt) generic over PrimeWidth (i16/i32) instead of hardcoding i16. Replace the monolithic QData struct with separate GarnerData and per-prime NttPrime arrays. Add Q128 parameter set (5 × i32 primes, D ≤ 1024) alongside the existing Q32 set. Simplify ScalarBackend by removing the const-generic limb count from to_ring_with_backend. Made-with: Cursor * Add extension field arithmetic and refactor sumcheck trait bounds Split CanonicalField into FromSmallInt (from_{u,i}{8,16,32,64} for all fields) and CanonicalField (u128 repr, base fields only). Implement FromSmallInt, Eq, Debug for Fp2/Fp4. Add ExtField<F> trait with EXT_DEGREE and from_base_slice. Optimize extension field arithmetic: Karatsuba multiplication for Fp2 and Fp4 (3 base muls instead of 4), specialized squaring (2 base muls for Fp2), non-residue IS_NEG_ONE specialization. Add concrete configs (TwoNr, NegOneNr, UnitNr) and type aliases Ext2<F>, Ext4<F>. Add transpose-based packed extension fields (PackedFp2, PackedFp4) for SIMD acceleration, following Plonky3's approach. Relax sumcheck bounds from E: CanonicalField to E: FromSmallInt (or E: FieldCore where spurious). Add sample_ext_challenge transcript helper. Includes tests for extension field sumcheck execution. Made-with: Cursor * Fix CRT+NTT correctness and optimize negacyclic NTT pipeline Correctness fixes: - Rewrite negacyclic NTT as twist + cyclic DIF/DIT pair (no bit-reversal permutation), correctly diagonalizing X^D+1. - Center coefficient→CRT mapping and Garner reconstruction to handle negacyclic sign wrapping consistently. - Fix i32 Montgomery csubp/caddp overflow via branchless i64 widening. - Fix q128 centering overflow in balanced_decompose_pow2 (avoid casting q≈2^128 into i128). - Remove dense-protocol schoolbook fallback; all mat-vec now routes through CRT+NTT. Performance optimizations: - Precompute per-stage twiddle roots in NttTwiddles (eliminate runtime pow_mod per butterfly stage). - Forward DIF butterfly skips reduce_range before Montgomery mul (safe because mul absorbs unreduced input). - Hoist centered-coefficient computation out of per-prime loop in from_ring. - Add fused pointwise multiply-accumulate for mat-vec inner loop. - Add batched mat_vec_mul_crt_ntt_many that precomputes matrix NTT once and reuses across many input vectors. - Wire commit_ring_blocks to batched A*s path. Benchmarks (D=64, Q32/K=6): - Single-prime forward+inverse NTT: 1.14µs → 0.43µs (2.7x) - CRT round-trip: 10.7µs → 6.3µs (1.7x) - Commit nv10: ~70% faster, nv20: ~47% faster Made-with: Cursor * Cache CRT+NTT matrix representations in setup to avoid repeated conversion The dense mat-vec paths (commit_ring_blocks, commit_onehot B-mul, compute_v) previously converted coefficient-form matrices to CRT+NTT on every call. Now the setup eagerly converts A, B, D into an NttMatrixCache and all dense operations use the pre-converted form. Coefficient-form matrices are retained for the onehot inner-product path and ring-switch/generate_m. Made-with: Cursor * Remove dead code (HachiRoutines, domains/, redundant trait methods) and extract shared field utilities - Delete unused HachiRoutines trait and dead algebra/domains/ module - Remove redundant FieldCore::add/sub/mul and Module::add/neg (covered by ops traits) - Extract is_pow2_u64, log2_pow2_u64, mul64_wide into fields/util.rs to deduplicate Made-with: Cursor * Unify Blake2b and Keccak transcript backends into generic HashTranscript Replace separate blake2b.rs and keccak.rs with a single generic HashTranscript<D: Digest> parameterized by hash function. Blake2bTranscript and KeccakTranscript are now type aliases. Made-with: Cursor * Fix sumcheck degree bug, split types, in-place fold, CommitWitness, rename configs, add soundness test - Fix CompressedUniPoly::degree() off-by-one that could let malformed proofs pass - Split sumcheck/mod.rs: extract types into types.rs, relocate multilinear_eval and fold_evals to algebra/poly.rs - Replace allocating fold_evals with in-place fold_evals_in_place - Add debug_assert guards to multilinear_eval and fold_evals_in_place - Introduce CommitWitness struct to replace error-prone 3-tuple returns - Rename DefaultCommitmentConfig to SmallTestCommitmentConfig, add ProductionFp128CommitmentConfig - Add verify_rejects_wrong_opening negative test for verifier soundness Made-with: Cursor * fix(test): resolve clippy needless_range_loop in algebra tests Use iter().enumerate() for schoolbook convolution loops and array::from_fn for pointwise NTT operations. Made-with: Cursor * Refactor commitment setup to runtime layout and staged artifacts. This removes compile-time commitment shape locks, derives beta from runtime layout, and threads layout-aware setup through commit/prove/verify with setup serialization roundtrip coverage. Made-with: Cursor * Soundness hardening: panic-free verifier, Fiat-Shamir binding, NTT overflow fix - Verifier path never panics; all errors return HachiError - Bind commitment, opening point, and y_ring in Fiat-Shamir transcript - Fix i16 csubp/caddp overflow by widening to i32 - multilinear_eval returns Result with dimension checks - build_w_evals validates w.len() is a multiple of d - UniPoly::degree uses saturating_sub instead of expect - Serialize usize as u64 for 32/64-bit portability - Fix from_i64(i64::MIN) via unsigned_abs - Remove Transcript::reset from public trait (move to inherent) - Add batched_sumcheck verifier empty-input guard Made-with: Cursor * Hoist fully qualified paths to use statements in touched files Replace inline crate::protocol::commitment::HachiCommitmentLayout, hachi_pcs::algebra::backend::{CrtReconstruct, NttPrimeOps}, and hachi_pcs::algebra::CyclotomicRing with top-level use imports. Made-with: Cursor * Dispatch norm sumcheck kernels by range size. Route small-b rounds through the point-eval interpolation kernel and keep the affine-coefficient kernel for larger b, while adding deterministic baseline-vs-dispatched benchmarks and parity tests to validate correctness across both strategies. Made-with: Cursor * Format commitment-related files for readability. Apply non-functional formatting and import ordering cleanups across commitment, ring-switch, and benchmark/test files to keep the codebase style consistent. Made-with: Cursor * Format: cargo fmt pass on commitment-related files Made-with: Cursor * feat: sequential coefficient ordering + streaming commitment Change coefficient-to-ring packing from strided to sequential, enabling true streaming where each trace chunk maps to exactly one inner Ajtai block. Implement StreamingCommitmentScheme for HachiCommitmentScheme. - reduce_coeffs_to_ring_elements: sequential packing (chunks_exact(D)) - prove/verify: opening point split flipped to (inner, outer) - ring_opening_point_from_field: outer split flipped to (M first, R second) - commit_coeffs: sequential block distribution - map_onehot_to_sparse_blocks: sequential block distribution - HachiChunkState + process_chunk / process_chunk_onehot / aggregate_chunks - Streaming commit tests (matches non-streaming, prove/verify roundtrip) Made-with: Cursor * refactor: decompose verify_batched_sumcheck into composable steps Split the monolithic verify_batched_sumcheck into three pieces: - verify_batched_sumcheck_rounds: replay rounds, return intermediate state - compute_batched_expected_output_claim: query verifier instances - check_batched_output_claim: enforce equality This enables callers (e.g. Greyhound) to intercept the intermediate sumcheck state before the final oracle check. The original function is preserved as a convenience wrapper. Made-with: Cursor * feat: accept Option<usize> in commit_onehot for sparse one-hot support Allows None entries in one-hot index arrays to represent inactive cycles. Adds public commit_onehot free function returning both commitment and hint. Made-with: Cursor * feat: submatrix commit for polynomials smaller than setup max commit_coeffs now accepts ring coefficient vectors shorter than the layout's full size, padding each block internally. prove/verify pad the opening point with zeros so the transcript stays consistent. This avoids materializing huge zero-padded field-element arrays. Made-with: Cursor * feat: add HachiSerialize impls for proof types Implement HachiSerialize/HachiDeserialize for HachiProof, HachiCommitmentHint, and SumcheckAux so they can be serialized through the ArkBridge adapter in Jolt. Made-with: Cursor * fix: relax balanced_decompose_pow2 assertion for 128-bit fields Allow levels * log_basis up to 128 + log_basis. For Fp128 with LOG_BASIS=4, the decomposition needs 33 levels (132 bits total) because 32 levels can't represent the full signed range [-q/2, q/2). The extra level's digit is at most ±1 and the i128 arithmetic remains safe since the quotient shrinks monotonically. Made-with: Cursor * feat: add DynamicSmallTestCommitmentConfig Same D=16 security parameters as SmallTestCommitmentConfig but derives layout from max_num_vars instead of using a fixed (4,2) shape. Made-with: Cursor * perf: true submatrix in commit_coeffs — skip zero blocks Short polynomials no longer pad to block_len. commit_coeffs accepts fewer ring elements than num_blocks * block_len, decomposes only the non-zero blocks, and fills remaining entries with zero s/t_hat without allocation or mat-vec multiplication. Also relax debug_assert in mat_vec_mul_precomputed to >= (zip handles the shorter vector correctly). Made-with: Cursor * fix: use inner_width for zero_s in commit_coeffs/commit_onehot prove expects s[i] to have inner_width entries. Use the correct length for zero blocks to match the dense path's decompose_block output size. Made-with: Cursor * fix: configure rayon with 64MB stack for D>=512 ring elements CRT-NTT conversion puts ~28KB on the stack per ring element ([[MontCoeff; D]; K] + [i128; D]). With D=512 and the commit call chain depth, rayon's default thread stack overflows. ensure_large_thread_stack() is called from setup() and is safe to call multiple times (only the first configures the pool). Made-with: Cursor * feat: add commit_mixed for mega-polynomial commitment Exposes MegaPolyBlock enum (Dense/OneHot/Zero) and commit_mixed() which processes heterogeneous blocks in a single commitment. This lets Jolt pack all witness polynomials into one Hachi commitment (one block per polynomial) instead of N independent commitments. Also makes SparseBlockEntry and map_onehot_to_sparse_blocks public so callers can construct one-hot block descriptors. Made-with: Cursor * perf: drop s vectors from CommitWitness and HachiCommitmentHint The basis-decomposed s_i vectors (one per block, each block_len*delta ring elements) were stored in both CommitWitness and HachiCommitmentHint. At production parameters (D=512, block_len=2048, delta=32), each s_i is 512 MB — storing all 64 of them consumed ~32 GB. Instead, recompute s_i on the fly in compute_w_hat and compute_z_hat from ring_coeffs using decompose_block. Peak memory drops from O(blocks * block_len * delta) to O(block_len * delta) per thread. Also adds setup_with_layout for caller-specified HachiCommitmentLayout, and makes decompose_block, SparseBlockEntry, map_onehot_to_sparse_blocks public for downstream (Jolt) mega-polynomial integration. Made-with: Cursor * chore: untrack docs/ and paper/ from version control Keep these files locally for reference but remove from the committed tree. They can be selectively re-added later. Made-with: Cursor * perf: fused sumcheck, split-eq streaming, compact w_evals — 8x memory reduction Refactor the Hachi proving pipeline to eliminate the 13 GB matrix M and 2.6 GB vector z from memory, reducing peak prover allocation from ~30 GB to ~3.7 GB. Key changes: - QuadraticEquation: remove m/z fields; add compute_r_split_eq (split-eq factoring replaces full Kronecker materialization) and compute_m_a_streaming (row-at-a-time M·α evaluation). - ring_switch: decompose z_pre on the fly in build_w_coeffs; add build_w_evals_compact returning Vec<i8> for round-0 storage (all entries fit in [-8, 7] from balanced_decompose_pow2 with LOG_BASIS=4). - HachiSumcheckProver: fused norm+relation prover sharing a single w_table. Round 0 uses WTable::Compact(Vec<i8>), folding to WTable::Full(Vec<F>) at half size after the first challenge. - HachiSumcheckVerifier: fused verifier combining both oracle checks with a batching_coeff sampled from the transcript. - Remove dead batched mat-vec functions from linear.rs. - Import hygiene: shorten crate::algebra::ring::X to crate::algebra::X; hoist mid-function use statements to top-level. Made-with: Cursor * revert: remove ensure_large_thread_stack rayon config Stack sizing for D>=512 ring elements should be handled by the caller, not baked into the library's setup path. Made-with: Cursor

…ine, NTT acceleration (#5) * perf: parallelize commit phase and reduce allocations - Add block-level parallelism to commit_ring_blocks, commit_coeffs, commit_onehot, and commit_mixed via cfg_iter!/cfg_into_iter! - Parallelize vector-to-NTT conversion in mat_vec_mul_precomputed_with_params - Cache CRT+NTT params inside NttMatrixCache, eliminating redundant select_crt_ntt_params calls on every mat-vec multiply - Add balanced_decompose_pow2_into for in-place decomposition, removing per-element Vec allocations in decompose_block/decompose_rows - Add inner_ajtai_onehot_t_only that skips the 16MB s-vector allocation when the caller discards it (commit_onehot, commit_mixed) - Add one-hot and mixed commitment benchmarks to hachi_e2e Made-with: Cursor * chore: remove stale #[allow(non_snake_case)] from setup structs HachiSetupSeed, HachiProverSetup, and HachiVerifierSetup have no uppercase fields — the allows were left over from earlier refactors. Made-with: Cursor * perf: hoist decomposition params to runtime, reduce allocations and cloning Pre-existing change: - Remove rows/cols from matrix domain separator so A matrix is reusable across poly/mega-poly layouts with the same m_vars. New changes: Move delta/tau/log_basis from CommitmentConfig associated constants into HachiCommitmentLayout runtime fields. This decouples decomposition parameters from the config type, allowing them to vary at runtime without monomorphization. All ~50 call sites updated. Eliminate redundant work in the prover hot path: - Flatten w_hat once and reuse in both compute_v and compute_r_split_eq (was flattened separately in each). - Stream z_hat decomposition directly in build_w_coeffs instead of collecting into a temporary Vec. - Skip the unused w.to_vec() clone in ring_switch_verifier output. - Take ownership of ring_opening_point and hint in QuadraticEquation constructors instead of cloning. Reduce stack pressure for large ring elements (8KB at D=512, Fp128): - Add CyclotomicRing::from_slice() to avoid std::array::from_fn intermediaries that create 8KB stack temporaries. - Replace from_fn patterns in process_chunk, reduce_coeffs_to_ring_elements, commit_w, and compute_r_split_eq. Made-with: Cursor * feat: flexible decomposition depth and dual basis mode Move DELTA/TAU/LOG_BASIS out of CommitmentConfig into runtime DecompositionParams (log_basis, log_coeff_bound). Delta and tau are now auto-derived from the coefficient bound, so small-coefficient polynomials (0/1, already range-checked) get proportionally cheaper commitments. Add BasisMode enum (Lagrange / Monomial) as a prove/verify-time parameter. Commitment is basis-agnostic; the mode only changes the tensor-product weights in the opening relation. Made-with: Cursor * fix compute m a streaming to not need padding * refactor: unify polynomial API via HachiPolyOps trait, remove dead code, fix config validation HachiPolyOps trait and implementations: - Add HachiPolyOps<F, D> trait with 4 operation methods (evaluate_ring, fold_blocks, decompose_fold, commit_inner) replacing raw coefficient access - Add DensePoly<F, D> for dense ring coefficient vectors - Add OneHotPoly<F, D> for sparse one-hot polynomials with optimized ops CommitmentScheme refactor: - Parameterize CommitmentScheme<F, D> (was CommitmentScheme<F>) - Generic commit/prove over P: HachiPolyOps<F, D> - Rename OpeningProofHint to CommitHint, remove Option wrapper from prove - Remove batch_commit, combine_commitments, combine_hints - Remove StreamingCommitmentScheme trait, HachiChunkState, process_chunk* Dead code removal: - Delete MegaPolyBlock enum and commit_mixed method - Delete inner_ajtai_onehot (keep _t_only variant) - Delete Polynomial trait, MultilinearLagrange trait - Delete DenseMultilinearEvals and multilinear_evals module - Remove all unnecessary #[allow(...)] attributes Proof simplification: - Remove ring_coeffs from HachiCommitmentHint (only t_hat remains) - Update quadratic_equation to use HachiPolyOps methods Config fix: - Remove overly strict delta*log_basis > 128 check in config.rs; balanced_decompose_pow2 already enforces the correct bound (levels*log_basis <= 128+log_basis) Documentation: - Add docs to all public items in test_utils and packed_ext - Remove #[allow(missing_docs)] from parallel, test_utils, packed_ext modules Made-with: Cursor * fix: remove test for deleted delta*log_basis validation The setup_rejects_invalid_digit_budget test asserted the overly strict delta*log_basis > 128 check that was intentionally removed in the previous commit. Delete the test and its BadDigitBudgetConfig. Made-with: Cursor * style: fix formatting in ring_commitment_core.rs Made-with: Cursor * perf: parallelize proving hot paths, eliminate per-proof w-commitment setup Parallelize the three proving bottlenecks (quad_eq, ring_switch, sumcheck) and remove the per-proof matrix generation in commit_w by reusing the main NTT cache. Proving hot-path parallelism: - Parallelize round-0 norm and relation sumcheck via cfg_fold_reduce! macro - Parallelize DensePoly::decompose_fold with parallel fold-reduce over blocks - Parallelize fold_evals_in_place and build_w_evals_compact with cfg_into_iter! - Add cfg_fold_reduce! macro to unify parallel/sequential fold-reduce patterns - Unify compute_round_{norm,relation}_{compact,full} into single generic fns Sumcheck micro-optimizations: - Unroll 3-point relation evaluation to avoid redundant from_u64 conversions and multiply-by-zero/one at evaluation points 0 and 1 - Hoist gadget_recompose_pow2 out of per-row loop in compute_r_split_eq Eliminate per-proof w-commitment setup: - Add w_ring_element_count() and w_commitment_layout() helpers to compute w-commitment matrix dimensions from the main layout - Widen A/B matrices at setup time to max(main, w) column counts so the main NTT cache always covers the w-commitment (required when delta_commit=1, e.g. boolean polynomials) - Rewrite commit_w to take &NttMatrixCache directly, inlining the commit logic with flat_map instead of intermediate Vec<Vec<...>> - Remove w_setup field from HachiProverSetup - Add ensure_matrix_shape_ge for >= column checks on widened matrices Naming cleanup: - Rename delta -> num_digits_commit, tau -> num_digits_fold, log_coeff_bound -> log_commit_bound throughout - Add log_open_bound to DecompositionParams for recursive w commitments - Hoist fully qualified paths (std::ops, std::mem, std::iter, crate::protocol::ring_switch::w_commitment_layout) to use statements Made-with: Cursor * perf: profile and accelerate opening proof hot paths Replace D/B-row schoolbook quotient extraction with an NTT-based unreduced quotient path and add targeted tracing spans/timers plus a Perfetto profile example so prover bottlenecks are visible and cheaper to iterate on. Temporarily force the point-eval norm kernel to isolate fused-sumcheck behavior during profiling. Made-with: Cursor * perf: NTT-accelerate A-rows, reduce basis 16→8, fix saturation bug Three optimizations to the proving pipeline: 1. NTT-accelerate A-rows in compute_r_split_eq: use unreduced_quotient_rows_ntt_cached for A*z_pre (O(D log D) instead of O(D^2) schoolbook). Also exploit sparse challenge structure in add_sparse_ring_product (O(weight*D) instead of O(D^2)). 2. Reduce decomposition basis from 16 to 8 (log_basis 4→3): halves the norm sumcheck range-check polynomial degree from 31 to 15, yielding ~4x speedup on the dominant prove-time bottleneck. Soundness is strictly improved (smaller MSIS norm bound). 3. Fix u128 saturation bug in compute_num_digits and r_decomp_levels that caused an incorrect extra decomposition level when b^levels overflows u128. Skip the balanced-range check when levels*log_basis > log_bound, since the digit range is mathematically guaranteed sufficient for b >= 4. Also: replace hardcoded LOG_BASIS const with log_basis() function derived from TinyConfig, fuse decompose+sparse-mul in decompose_fold to i32 arithmetic, and add balanced_decompose_pow2_i8 variant. Net result: prove time 4.76s → 1.57s (3.0x speedup) at num_vars=19. Made-with: Cursor * perf: i8 digit pipeline for w_hat — bypass Fp128 for small decomposed digits Store w_hat/w_hat_flat as [i8; D] instead of CyclotomicRing<Fp128, D>, eliminating redundant field arithmetic on values in [-b/2, b/2). - Add balanced_decompose_pow2_i8 and gadget_recompose_pow2_i8 - Add CyclotomicCrtNtt::from_i8_with_params / from_i8_cyclic for direct i8 → CRT+NTT conversion (skips Fp128 centering) - Add mat_vec_mul_ntt_cached_i8 and unreduced_quotient_rows_ntt_cached_i8 - Change QuadraticEquation w_hat/w_hat_flat types + all consumers - Simplify build_w_coeffs to write i8 digits directly as field elements Made-with: Cursor * perf(poly): optimize range_check_eval and fold_evals_in_place range_check_eval: precompute w² and use (w²−k²) instead of (w−k)(w+k), saving one multiply per factor. fold_evals_in_place: fold in-place with truncate() instead of allocating a new Vec, removing the rayon dependency from this function. Made-with: Cursor * refactor(sumcheck): centralize and optimize norm sumcheck computation Extract duplicated norm round polynomial logic from NormSumcheckProver and HachiSumcheckProver into shared compute_norm_round_poly() and compute_norm_round_poly_compact() functions. Optimizations: - Flat contiguous storage for RangeAffinePrecomp (coeff_mix_flat + row_offsets) - Precomputed small-integer LUT (h_i(w_0)) for round-0 compact accumulation - Native i128 range-check evaluation path for b <= 10 - Precomputed squared offsets in PointEvalPrecomp - Make choose_round_kernel public with env var override and b-threshold dispatch Made-with: Cursor * feat(protocol): multi-level recursive folding proof Replace single-shot proof with recursive multi-level folding. Instead of sending the full w vector after one round of quad_eq → ring_switch → sumcheck, the prover now recursively commits to w and opens it via the same protocol until w is small enough to send directly. Key changes: - HachiProof now holds Vec<HachiLevelProof> + final_w instead of flat fields - Remove SumcheckAux; each level carries a w_eval claim instead - Extract prove_one_level / verify_one_level from monolithic prove/verify - Folding stops via should_stop_folding heuristic (MIN_W_LEN_FOR_FOLDING, MIN_SHRINK_RATIO) - QuadraticEquation takes explicit layout parameter for per-level configs - ring_switch exports WCommitmentConfig for recursive w-openings - D matrix widened to max(layout, w_layout) for shared setup - HachiSumcheckVerifier gains w_val_override for intermediate levels Made-with: Cursor * chore(examples): update profile example for multi-level proofs and A/B kernel testing - Extract run_prove() helper for reuse across kernel configs - Add A/B test mode (HACHI_AB_TEST=1) to compare affine_coeff vs point_eval - Update layout from (6,4) to (8,8) - Report multi-level proof stats (levels, final_w length, proof size) - Set 64 MiB rayon stack size Made-with: Cursor * style: remove section banners and hoist mid-function use statement - Remove redundant section banner comments in proof.rs and commitment_scheme.rs - Move choose_round_kernel import from function body to top-level in hachi_sumcheck.rs Made-with: Cursor * perf(algebra): use bitwise ops for balanced digit decomposition Replace rem_euclid(b) with bitwise AND and division with right shift in CyclotomicRing digit decomposition (decompose_balanced, decompose_balanced_digit_planes, decompose_balanced_i8) and DensePoly commit_with_setup. Valid since b is always a power of two. Made-with: Cursor * perf: store t_hat as i8 digit planes, cache w_folded to skip recompose Switch t_hat storage from Vec<Vec<CyclotomicRing<F,D>>> to Vec<Vec<[i8;D]>> throughout the commitment and proving pipeline. Decomposed digits are bounded by log_basis (typically 3), so i8 is sufficient and avoids carrying full field-element ring elements through commit, ring-switch, and serialization. Key changes: - CommitWitness and HachiCommitmentHint now hold [i8; D] digit planes - New i8 variants: decompose_block_i8, decompose_rows_i8, mat_vec_mul_ntt_cached_i8, gadget_recompose_pow2_i8 - HachiPolyOps::commit_blocks returns [i8; D] digit planes - QuadraticEquation caches w_folded (pre-decomposition folded ring elements) so compute_r_split_eq avoids a gadget_recompose roundtrip - Precomputed idx/sign lookup tables for sparse challenge multiplication - Custom i8 serialization for HachiCommitmentHint - Remove bogus debug_assert constraining ring degree D<=128 in build_w_evals_compact (was checking log2(D) but message said log_basis) Made-with: Cursor * perf: optimize hot paths in commit/prove pipeline - Hoist NTT conversions out of per-row quotient loops (crt_ntt, linear) - Precompute c_alpha in compute_m_a_streaming (quadratic_equation) - Compact alpha/m tables with variable-specific folding (sumcheck) - Eliminate t_hat_flat rematerialization and zero_t_hat clones (commit, ring_switch, hachi_poly_ops) - Merge duplicate w-eval passes (ring_switch, commitment_scheme) - Clean up fully qualified paths (linear, relation_sumcheck, hachi_poly_ops) Made-with: Cursor * feat(algebra): add wide unreduced accumulators and fused shift-accumulate Add Fp32x2i32, Fp64x4i32, Fp128x8i32 types that split field elements into 16-bit limbs in i32 slots for carry-free SIMD-friendly addition. Overflow budget ~32k signed adds before reduction. Add shift_accumulate_into / shift_sub_into / mul_by_monomial_sum_into on CyclotomicRing for fused negacyclic shift + accumulate without temporary ring allocations. Make field offset constants C public. Made-with: Cursor * refactor(protocol): per-matrix NttSlotCache, fused one-hot commit, bench stack fix Replace monolithic NttMatrixCache with per-matrix NttSlotCache, removing HachiPreparedSetup and MatrixSlot enum. HachiProverSetup now holds three independent NttSlotCache instances (A, B, D). Simplify dispatch macros in linear.rs to operate on a single slot. Add CommitCache associated type to HachiPolyOps trait. Wire one-hot commit path to use fused mul_by_monomial_sum_into, eliminating temporary allocations. Fix pre-existing benchmark stack overflow by configuring rayon with a 64MB thread stack (matching examples/profile.rs). Made-with: Cursor * feat(commit): column-tiled A matvec for cache-efficient commitment Add mat_vec_mul_ntt_tiled_i8 and mat_vec_mul_ntt_tiled_single_i8 that tile the NTT matrix columns into L2-sized chunks (~400 cols). Each rayon thread owns one tile and iterates over all blocks, so the matrix is loaded from DRAM exactly once. Ring coefficients are decomposed on-the-fly per tile to avoid full digit materialization. All call sites (commit, commit_coeffs, commit_onehot, ring_switch, quadratic_equation, HachiPolyOps::commit_inner) updated to use the tiled API. Reduces total DRAM traffic ~25x for large traces. Made-with: Cursor * refactor: promote TWO_INV and ZERO to const associated items on FieldCore Hoists two_inv from a trait method to a compile-time constant, and adds const ZERO so extension fields (Fp2, Fp4) can build their TWO_INV without runtime calls. Deduplicates CrtNttParamSet computation across A/B/D caches. Made-with: Cursor * refactor: remove two_inv parameters now that TWO_INV is a const Functions and macros no longer thread two_inv through call chains; they reference F::TWO_INV directly. Also removes the runtime computation in batched_sumcheck. Made-with: Cursor * feat(commit): stub HachiSerialize for HachiProverSetup Add Valid + HachiSerialize impls for HachiProverSetup that return an error on serialize (NTT caches are runtime artifacts). Needed by downstream wrappers that require the trait bound. Made-with: Cursor * perf: fuse hot loops, eliminate allocations, cheaper CRT reduction - mul_by_sparse: use shift_accumulate_into/shift_sub_into for ±1 coeffs - inverse NTT: fuse d_inv and psi_inv trailing passes into one loop - CRT conversion: replace __modti3 (i128 % i128) with split i64 arithmetic - Fp128 sqr_raw: 3 widening muls instead of 4 via squaring symmetry - decompose_block_i8: add _into variant, reuse buffer across tiles - sumcheck: fuse norm+relation into single pass over w_table - ring_switch: fuse expand_m_a+build_m_evals_x, rayon::join parallel phases - ring_switch: build_w_evals_dual uses unzip instead of triple allocation - quadratic_equation: hoist scratch allocations out of row loop Made-with: Cursor * feat: wide ring accumulators with NEON SIMD for one-hot commitment Introduce carry-free wide accumulators (Fp32x2i32, Fp64x4i32, Fp128x8i32) that defer modular reduction during one-hot commitment, yielding 69x faster commit for sparse witnesses. Key changes: - AdditiveGroup trait decoupling additive ops from full FieldCore - WideCyclotomicRing<W, D> for carry-free ring accumulation - HasWide / ReduceTo traits for type-level wide ↔ canonical dispatch - NEON SIMD backends for Fp64x4i32 and Fp128x8i32 with scalar fallback - inner_ajtai_onehot_wide replaces inner_ajtai_onehot_t_only - Profile example now covers both dense and one-hot paths Made-with: Cursor * refactor: drop "_tiled" suffix from mat-vec functions Tiling is an internal optimization detail, not an API distinction. The tiled versions are the only production path; non-tiled variants exist only as #[cfg(test)] reference implementations. Made-with: Cursor * refactor: rename Fp128CommitmentConfig, hoist inline qualified path - Drop "Production" prefix from ProductionFp128CommitmentConfig - Hoist crate::algebra::fields::LiftBase to use statement in sparse_challenge.rs Made-with: Cursor * feat: pack final_w as balanced digits, use Vec<i8> throughout prover Represent the prover's witness vector w as Vec<i8> instead of Vec<F> throughout the folding pipeline. Introduces PackedDigits to bit-pack the final-level w into log_basis bits per element, reducing proof size by ~32x. Cleans up import hygiene in profile example and proof module. Made-with: Cursor * perf: use const digit lookup table for i8-to-field conversion Add const fn digit_lut to Fp128 and FromSmallInt trait for precomputing balanced-digit-to-field-element tables. Replaces per-element from_i64 calls with indexed loads in the three hot prover loops (commit_w, build_w_evals_dual, dense_poly_from_w). Made-with: Cursor * perf: add DigitMontLut for i8 mat-vec kernels, clean up imports Add a precomputed Montgomery lookup table (DigitMontLut) for balanced digit values {-8..7}, replacing per-coefficient from_canonical calls in the i8→CRT+NTT conversion hot path. Wire it into mat_vec_mul_ntt_i8, mat_vec_mul_ntt_single_i8, and unreduced_quotient_rows_ntt_cached_i8. Also: merge duplicate NTT butterfly imports, remove duplicated doc comment on from_ring_cyclic, export DigitMontLut through ring/algebra modules, apply cargo fmt. Made-with: Cursor * perf: NEON SIMD kernels, decompose_fold optimization, explicit layout API Add AArch64 NEON SIMD for NTT butterflies, pointwise multiply-accumulate, and add-reduce (neon.rs). Dispatch from butterfly.rs and linear.rs with runtime feature check and scalar fallback. Optimize DensePoly::decompose_fold with two-phase restructure: K=3 interleaved carry chains for ILP on decomposition, then NEON rotate-and-add scatter (decompose_fold_neon.rs). ~2x speedup on compute_z_pre. Optimize OneHotPoly::decompose_fold by replacing O(omega*D) mul_by_sparse with direct sparse scatter O(omega*|nonzero_coeffs|). ~22x speedup. Thread explicit HachiCommitmentLayout through commit/prove/verify instead of computing from setup internally. Add OneHotIndex trait for generic onehot indices. Profile now uses OneHotPoly end-to-end for the onehot path. Clean up imports: hoist qualified crate::algebra::ntt::neon paths, move test-function use statements to module scope. Made-with: Cursor * perf: unreduced accumulation for sumcheck, fused compact round-0 loop Introduce HasUnreducedOps trait with MulU64Accum / ProductAccum types for Fp64, Fp128, and Fp2, enabling widening multiplies that defer reduction until after accumulation. Key changes: - Fuse norm + relation computation into a single pass for compact (Round 0) via compute_round_compact_fused, using split pos/neg MulU64Accum for the relation and i128/LUT arithmetic for the norm. - Sparse integer representation for affine-coeff precomputation (SparseCoeffEntry) with batched x4 kernel (compute_entry_coeffs_x4). - Two-level inner/outer ProductAccum accumulation for affine-coeff kernel, both compact and full-field paths. - Optimize fold_compact_to_full to use mul_u64_unreduced for r * delta. - Parallelize OneHotPoly::evaluate_ring, fold_blocks, decompose_fold. - Add FromSmallInt::from_i128 default method. Made-with: Cursor * perf: two-level ProductAccum for full-field affine-coeff kernel Upgrade the WTable::Full + AffineCoeffComposition path in HachiSumcheckProver to use two-level ProductAccum accumulation (outer loop over e_second, inner mul_to_product_accum, single reduction per j_high block), matching the standalone norm_sumcheck.rs implementation. Also fix multilinear_eval_small missing FromSmallInt bound, switch commitment_scheme w_eval to use w_evals_field (w_evals is moved), and add missing doc on ScaleI32 trait method. Made-with: Cursor * style: rustfmt formatting for poly.rs and hachi_sumcheck.rs Made-with: Cursor * fix(ci): use compound assignment operators to satisfy clippy Made-with: Cursor * chore: remove docs/ and paper/ from tracked files Backed up to quang/temp-docs branch. Files remain on disk. Made-with: Cursor * fix(ci): implement assign traits and fix all clippy assign_op_pattern lints Add MulAssign for Fp128, and AddAssign/SubAssign/MulAssign for all PackedNeon types. Convert all x = x op y patterns to x op= y across benches, tests, and lib. Made-with: Cursor * fix(ci): add assign traits to NoPacking, AVX2/AVX512 packed types, Fp32, Fp64 NoPacking<T> (x86_64 fallback) was missing AddAssign/SubAssign/MulAssign, causing CI failures on the GitHub runner. Add assign traits uniformly across all packed backends and scalar field types. Fix remaining assign_op_pattern lints in benches and tests. Made-with: Cursor * fix(ci): fix no-default-features clippy — unused var, dead code, rayon gate - Allow unused rel_combine (only used in parallel reduce combiner) - Allow dead_code on add_ntt_into (only used in parallel + aarch64) - Gate rayon::ThreadPoolBuilder behind cfg(feature = "parallel") - Fix remaining assign_op_pattern in norm_sumcheck bench Made-with: Cursor

… infrastructure (#7) * fix: separate delta_commit and delta_open for t_hat decomposition t = A * s produces full-field-size coefficients even when s has small (delta_commit-digit) entries. The code was decomposing t_hat using delta_commit instead of delta_open, causing lossy truncation and breaking verification for onehot/logbasis commitment configs. Split commit_inner's num_digits parameter into num_digits_commit (for s) and num_digits_open (for t_hat), and propagate this distinction through layout, commit, quadratic_equation, and ring_switch. Also: - Add Fp128FullCommitmentConfig, Fp128OneHotCommitmentConfig, Fp128LogBasisCommitmentConfig bounded commitment configs - Add optimal_m_r_split for dynamic m/r layout selection - Refactor profile example to be generic over CommitmentConfig and accept HACHI_NUM_VARS / HACHI_MODE env vars Made-with: Cursor * refactor(algebra): add repr(transparent) to CyclotomicRing types Enables safe transmute between `[CyclotomicRing<F, D>]` and `[F]` for the upcoming FlatMatrix D-agnostic storage layer. Made-with: Cursor * refactor(commitment): D-agnostic FlatMatrix storage + halving-D scaffolding Replace `Vec<Vec<CyclotomicRing<F, D>>>` in HachiExpandedSetup with `FlatMatrix<F>`, a D-agnostic flat field-element array that can be viewed at any ring dimension via `.view::<D>()`. This decouples setup storage from the const-generic D, enabling future varying-D prove loops. Key changes: - HachiExpandedSetup<F, D> → HachiExpandedSetup<F> (loses D) - HachiVerifierSetup<F, D> → HachiVerifierSetup<F> - NTT/CRT functions take RingMatrixView instead of &[Vec<CyclotomicRing>] - New FlatMatrix, NttCache, and dispatch_ring_dim! infrastructure - New CommitmentConfig::d_at_level / n_a_at_level trait methods - New Fp128HalvingDCommitmentConfig (D=512→256→128→64) - commit_w made pub for future varying-D usage Made-with: Cursor * refactor(bench): rewrite benchmarks with real configs and parameterized D Replace hand-rolled bench_config! macro with real commitment configs (Fp128FullCommitmentConfig, Fp128OneHotCommitmentConfig, Fp128LogBasisCommitmentConfig). Parameterize D as const generic instead of hardcoding. Use random evaluations, iter_batched for prove bench, and add HACHI_PARALLEL=0 env var for single-threaded runs. Made-with: Cursor * fix: eliminate debug-build stack overflow via dispatch extraction and NTT cache boxing Extract dispatch_ring_dim!/dispatch_with_ntt! macro expansions into dedicated #[inline(never)] functions (dispatch_prove_level, dispatch_verify_level, dispatch_commit) so monomorphized match arms live in separate stack frames instead of bloating the caller. Box NttSlotCache<D> fields inside MultiDNttCaches to avoid ~465KB temporaries on the stack when constructing MultiDNttBundle. Remove with_large_stack test wrappers and .cargo/config.toml — all tests now pass with the default 2MB stack in debug builds. Clean up import hygiene: hoist in-function use statements, replace inline fully-qualified paths with top-level imports. Made-with: Cursor * fix: broken doc links and clippy needless_range_loop - Use crate-qualified paths for MultiDNttBundle and HachiExpandedSetup doc links in dispatch_with_ntt macro - Replace index loop with iterator in flat_matrix test Made-with: Cursor

* Add rayon parallelism behind `parallel` feature flag (enabled by default) - New src/parallel.rs with cfg_iter!/cfg_into_iter!/cfg_chunks! macros that dispatch to rayon parallel iterators when `parallel` is enabled - Parallelize protocol hot paths: ring polynomial division, w_evals construction, M_alpha evaluation, ring vector evaluation, packed ring poly evaluation, coefficients-to-ring reduction, quadratic equation folding, and sumcheck round polynomial computation - All 174 tests pass with and without the parallel feature Made-with: Cursor * Add e2e benchmark and make HachiCommitmentScheme generic over config - Make HachiCommitmentScheme generic over <const D, Cfg> so different configs (and thus num_vars) can be used without code duplication. - Remove hardcoded DefaultCommitmentConfig::D from ring_switch.rs; WCommitmentConfig and commit_w now flow D generically. - Add benches/hachi_e2e.rs with configs sweeping nv=10,14,18,20. Made-with: Cursor * Refactor CRT-NTT backend: generalize over PrimeWidth, add Q128 support Make NTT primitives (NttPrime, NttTwiddles, MontCoeff, CyclotomicCrtNtt) generic over PrimeWidth (i16/i32) instead of hardcoding i16. Replace the monolithic QData struct with separate GarnerData and per-prime NttPrime arrays. Add Q128 parameter set (5 × i32 primes, D ≤ 1024) alongside the existing Q32 set. Simplify ScalarBackend by removing the const-generic limb count from to_ring_with_backend. Made-with: Cursor * Add extension field arithmetic and refactor sumcheck trait bounds Split CanonicalField into FromSmallInt (from_{u,i}{8,16,32,64} for all fields) and CanonicalField (u128 repr, base fields only). Implement FromSmallInt, Eq, Debug for Fp2/Fp4. Add ExtField<F> trait with EXT_DEGREE and from_base_slice. Optimize extension field arithmetic: Karatsuba multiplication for Fp2 and Fp4 (3 base muls instead of 4), specialized squaring (2 base muls for Fp2), non-residue IS_NEG_ONE specialization. Add concrete configs (TwoNr, NegOneNr, UnitNr) and type aliases Ext2<F>, Ext4<F>. Add transpose-based packed extension fields (PackedFp2, PackedFp4) for SIMD acceleration, following Plonky3's approach. Relax sumcheck bounds from E: CanonicalField to E: FromSmallInt (or E: FieldCore where spurious). Add sample_ext_challenge transcript helper. Includes tests for extension field sumcheck execution. Made-with: Cursor * Fix CRT+NTT correctness and optimize negacyclic NTT pipeline Correctness fixes: - Rewrite negacyclic NTT as twist + cyclic DIF/DIT pair (no bit-reversal permutation), correctly diagonalizing X^D+1. - Center coefficient→CRT mapping and Garner reconstruction to handle negacyclic sign wrapping consistently. - Fix i32 Montgomery csubp/caddp overflow via branchless i64 widening. - Fix q128 centering overflow in balanced_decompose_pow2 (avoid casting q≈2^128 into i128). - Remove dense-protocol schoolbook fallback; all mat-vec now routes through CRT+NTT. Performance optimizations: - Precompute per-stage twiddle roots in NttTwiddles (eliminate runtime pow_mod per butterfly stage). - Forward DIF butterfly skips reduce_range before Montgomery mul (safe because mul absorbs unreduced input). - Hoist centered-coefficient computation out of per-prime loop in from_ring. - Add fused pointwise multiply-accumulate for mat-vec inner loop. - Add batched mat_vec_mul_crt_ntt_many that precomputes matrix NTT once and reuses across many input vectors. - Wire commit_ring_blocks to batched A*s path. Benchmarks (D=64, Q32/K=6): - Single-prime forward+inverse NTT: 1.14µs → 0.43µs (2.7x) - CRT round-trip: 10.7µs → 6.3µs (1.7x) - Commit nv10: ~70% faster, nv20: ~47% faster Made-with: Cursor * Cache CRT+NTT matrix representations in setup to avoid repeated conversion The dense mat-vec paths (commit_ring_blocks, commit_onehot B-mul, compute_v) previously converted coefficient-form matrices to CRT+NTT on every call. Now the setup eagerly converts A, B, D into an NttMatrixCache and all dense operations use the pre-converted form. Coefficient-form matrices are retained for the onehot inner-product path and ring-switch/generate_m. Made-with: Cursor * Remove dead code (HachiRoutines, domains/, redundant trait methods) and extract shared field utilities - Delete unused HachiRoutines trait and dead algebra/domains/ module - Remove redundant FieldCore::add/sub/mul and Module::add/neg (covered by ops traits) - Extract is_pow2_u64, log2_pow2_u64, mul64_wide into fields/util.rs to deduplicate Made-with: Cursor * Unify Blake2b and Keccak transcript backends into generic HashTranscript Replace separate blake2b.rs and keccak.rs with a single generic HashTranscript<D: Digest> parameterized by hash function. Blake2bTranscript and KeccakTranscript are now type aliases. Made-with: Cursor * Fix sumcheck degree bug, split types, in-place fold, CommitWitness, rename configs, add soundness test - Fix CompressedUniPoly::degree() off-by-one that could let malformed proofs pass - Split sumcheck/mod.rs: extract types into types.rs, relocate multilinear_eval and fold_evals to algebra/poly.rs - Replace allocating fold_evals with in-place fold_evals_in_place - Add debug_assert guards to multilinear_eval and fold_evals_in_place - Introduce CommitWitness struct to replace error-prone 3-tuple returns - Rename DefaultCommitmentConfig to SmallTestCommitmentConfig, add ProductionFp128CommitmentConfig - Add verify_rejects_wrong_opening negative test for verifier soundness Made-with: Cursor * fix(test): resolve clippy needless_range_loop in algebra tests Use iter().enumerate() for schoolbook convolution loops and array::from_fn for pointwise NTT operations. Made-with: Cursor * Refactor commitment setup to runtime layout and staged artifacts. This removes compile-time commitment shape locks, derives beta from runtime layout, and threads layout-aware setup through commit/prove/verify with setup serialization roundtrip coverage. Made-with: Cursor * Soundness hardening: panic-free verifier, Fiat-Shamir binding, NTT overflow fix - Verifier path never panics; all errors return HachiError - Bind commitment, opening point, and y_ring in Fiat-Shamir transcript - Fix i16 csubp/caddp overflow by widening to i32 - multilinear_eval returns Result with dimension checks - build_w_evals validates w.len() is a multiple of d - UniPoly::degree uses saturating_sub instead of expect - Serialize usize as u64 for 32/64-bit portability - Fix from_i64(i64::MIN) via unsigned_abs - Remove Transcript::reset from public trait (move to inherent) - Add batched_sumcheck verifier empty-input guard Made-with: Cursor * Hoist fully qualified paths to use statements in touched files Replace inline crate::protocol::commitment::HachiCommitmentLayout, hachi_pcs::algebra::backend::{CrtReconstruct, NttPrimeOps}, and hachi_pcs::algebra::CyclotomicRing with top-level use imports. Made-with: Cursor * Dispatch norm sumcheck kernels by range size. Route small-b rounds through the point-eval interpolation kernel and keep the affine-coefficient kernel for larger b, while adding deterministic baseline-vs-dispatched benchmarks and parity tests to validate correctness across both strategies. Made-with: Cursor * Format commitment-related files for readability. Apply non-functional formatting and import ordering cleanups across commitment, ring-switch, and benchmark/test files to keep the codebase style consistent. Made-with: Cursor * Format: cargo fmt pass on commitment-related files Made-with: Cursor * feat: sequential coefficient ordering + streaming commitment Change coefficient-to-ring packing from strided to sequential, enabling true streaming where each trace chunk maps to exactly one inner Ajtai block. Implement StreamingCommitmentScheme for HachiCommitmentScheme. - reduce_coeffs_to_ring_elements: sequential packing (chunks_exact(D)) - prove/verify: opening point split flipped to (inner, outer) - ring_opening_point_from_field: outer split flipped to (M first, R second) - commit_coeffs: sequential block distribution - map_onehot_to_sparse_blocks: sequential block distribution - HachiChunkState + process_chunk / process_chunk_onehot / aggregate_chunks - Streaming commit tests (matches non-streaming, prove/verify roundtrip) Made-with: Cursor * refactor: decompose verify_batched_sumcheck into composable steps Split the monolithic verify_batched_sumcheck into three pieces: - verify_batched_sumcheck_rounds: replay rounds, return intermediate state - compute_batched_expected_output_claim: query verifier instances - check_batched_output_claim: enforce equality This enables callers (e.g. Greyhound) to intercept the intermediate sumcheck state before the final oracle check. The original function is preserved as a convenience wrapper. Made-with: Cursor * feat: Labrador/Greyhound recursive lattice proof protocol Implements the full Labrador recursive amortization and Greyhound evaluation reduction, ported from the C reference with Hachi-native Fiat-Shamir transcript integration. New modules: - protocol::labrador — recursive proof (prover, verifier, fold, commit, challenge rejection sampler, JL projection, config/guardrails, types) - protocol::greyhound — evaluation reduction (4-row witness, 5 constraints, eval prover + verifier-side reduce) - protocol::prg — pluggable PRG backends (SHAKE256, AES-128-CTR) for commitment key and JL matrix derivation Hachi-core changes: - algebra::ring — conjugation automorphism, coeff_norm_sq, ternary/ quaternary samplers for Labrador challenges - protocol::commitment — pre-derived setup matrices, PRG backend abstraction for matrix derivation - protocol::proof — HachiProof restructured as composite of folds + GreyhoundEvalProof + LabradorProof - protocol::ring_switch — externalized w_tilde(r) check for Greyhound - protocol::transcript — ring-element challenge functions (dense + rejection-sampled), 16 new Fiat-Shamir labels - protocol::commitment_scheme — integrated Greyhound/Labrador into prove/verify pipeline - sumcheck tests decoupled from old proof structure Made-with: Cursor * Impl folded Labrador protocol * Refactor Labrador Witness * Refactor Labrador Constraints * Change grenhound to use Labrador scheme * Update gitignore * Fix CI issues * Use constants instead of hardcoded values * feat: integrate Greyhound/Labrador lattice proof protocol into main Port the Greyhound evaluation-reduction and Labrador recursive lattice proof modules from dev-labrador onto main's optimized proving pipeline. Greyhound/Labrador is invoked as a final proof step after multi-level folding when D >= 64, providing post-quantum security for the opening. New modules: protocol/greyhound, protocol/labrador, protocol/prg. Algebra extensions: coefficients_mut, coeff_norm_sq, balanced_decompose_pow2_with_carry, conjugation_automorphism_ntt, sample_ternary/quaternary. Made-with: Cursor * Remove integration to Hachi * Fix CI issue --------- Co-authored-by: Quang Dao <quang.dao@layerzerolabs.org>

Save HachiExpandedSetup (seed + matrices A, B, D) to an OS-specific cache directory on first generation, and transparently load it on subsequent calls to avoid re-deriving matrices from SHAKE. NTT caches are rebuilt from the deserialized matrices. Pattern follows Dory's disk-persistence approach but saves only the expanded setup (not prover+verifier separately) since NTT caches are not serializable and must be reconstructed. Made-with: Cursor

* fix: harden CI workflow to resolve CodeQL security alerts Pin all GitHub Actions to immutable commit SHAs and add least-privilege permissions (contents: read) to address 9 medium-severity CodeQL alerts. Made-with: Cursor * chore: add .cursor/ to .gitignore Made-with: Cursor --------- Co-authored-by: Omid Bodaghi <42227752+omibo@users.noreply.github.com>

…outer_weights (#10) Three performance fixes for the prove path: 1. Gate prove_level_diagnostic, prove_level_selfcheck, and the w_eval consistency check behind #[cfg(debug_assertions)]. These were running unconditionally in release builds, causing duplicate compute_m_a_streaming calls and full polynomial evaluations purely for debug verification. 2. Factor outer_weights in prove_one_level: instead of materializing the full 2^(m_vars + r_vars) basis weight vector (~2.1 GB for large traces), pass ring_opening_point.b (size 2^r_vars) and derive the evaluation from the fold result: eval = Σ_i b[i] * fold(a)[i]. 3. Update HachiPolyOps::evaluate_and_fold signature to accept factored per-block outer scalars instead of the full tensor product. Made-with: Cursor

* perf: gate debug diagnostics behind cfg(debug_assertions) and factor outer_weights Three performance fixes for the prove path: 1. Gate prove_level_diagnostic, prove_level_selfcheck, and the w_eval consistency check behind #[cfg(debug_assertions)]. These were running unconditionally in release builds, causing duplicate compute_m_a_streaming calls and full polynomial evaluations purely for debug verification. 2. Factor outer_weights in prove_one_level: instead of materializing the full 2^(m_vars + r_vars) basis weight vector (~2.1 GB for large traces), pass ring_opening_point.b (size 2^r_vars) and derive the evaluation from the fold result: eval = Σ_i b[i] * fold(a)[i]. 3. Update HachiPolyOps::evaluate_and_fold signature to accept factored per-block outer scalars instead of the full tensor product. Made-with: Cursor * perf: streamline recursive Hachi proving path Keep recursive w witnesses in digit form to avoid rebuilding dense polynomials, and size setup and ring-switch work from exact runtime layouts to cut redundant work. Made-with: Cursor * fix: satisfy clippy on setup and ring-switch helpers Address the current CI failures with minimal changes by allowing the internal layout helper's argument count and switching the fused m_evals_x loops to iterator-based indexing. Made-with: Cursor

* perf: tighten and speed up norm sumcheck Enforce the balanced digit range produced by decomposition and reduce round-zero norm sumcheck work with compact affine precomputation plus the centered balanced point-eval form. Made-with: Cursor * feat: parameterize recursive w basis and expand profile comparisons Allow recursive w openings to use a different gadget basis from level 0 so we can explore decomposition and sumcheck tradeoffs directly. Add profile modes for comparing basis choices across the main dense and onehot workloads. Made-with: Cursor * perf: cache t rows and trim ring-switch witness overhead Cache inner Ajtai t rows so A_row can reuse them directly and accumulate only the quotient high half instead of recomposing from t_hat on every block. Trim the ring-switch witness path by dropping the unused field w-table, reusing decomposition scratch, and reading the final w evaluation from the folded prover state. Made-with: Cursor * perf: skip padded x tails in fused sumcheck Track the live x prefix from ring switch into the fused prover so x-rounds only accumulate and fold the physical witness region instead of explicit zero padding. Preserve the old semantics with round-by-round equivalence tests against the padded prover. Made-with: Cursor * test: bundle sumcheck test helper params for clippy Collapse the test-only Hachi sumcheck prover helper arguments into a small params struct so clippy no longer rejects the PR on too-many-arguments. Made-with: Cursor * fix: allow no-default-features sumcheck lint path Mark the parallel-only relation combiner as intentionally unused when the parallel feature is disabled so the CI clippy matrix stays green in both feature configurations. Made-with: Cursor * perf: specialize single-digit z_pre folds Cache dense small-digit coefficients and add direct onehot and dense single-digit fold paths so quadratic-equation z_pre construction stops paying generic decomposition costs when the witness is already digit-sized. Made-with: Cursor

* Add rayon support in Labrador * Change labrador params and match with reference impl * Impl Ajtai commitment scheme trait * Add setup to Labrador prover * Pass transcript to JL projection * Fix the issue with JL matrix distribution * Add benchmark for Labrador single level prover * Update labrador single-level proof benchmark * Refactor constraints in Labrador * Add two level labrador prover benchmark * Add docs for building next constraints functions * Make Labrador benchmark more realistic based on Greyhound numbers * Add NTT backend Ajtai commitment scheme * Add tests to verify verifier reject malicious proofs * Add more traing info for level prover * Use constants in tests/commitment * Optimizing aggregation phase * Fix recursive Labrador bug * Integrate Greyhound and Hachi * Integrate Labrador directly to Hachi * Address CI issues * Remove unused codes * Fix Labrador handoff binding and tail proof encoding Bind Labrador tails to the carried Hachi commitment, harden verifier and JL metadata checks, and make Labrador-tail serialization and size accounting honest. Add regression coverage for spliced tails, malformed metadata, variable-D handoff selection, and proof-size accounting. Made-with: Cursor * Update hachi e2e test * Use existing setup matrices * perf: switch bounded Fp128 configs to D=256 Align the default and halving Fp128 presets around the 256-dimensional Labrador path so the baseline matches the supported challenge machinery. Increase the sparse challenge weight at the lower ring dimension to preserve the intended security margin. Made-with: Cursor * perf: speed up Labrador challenge sampling Add a dedicated single-challenge fast path and reuse precomputed operator-norm tables for sparse challenges. This keeps the Fiat-Shamir distribution unchanged while removing repeated dense trigonometric work from the sampler hot path. Made-with: Cursor * perf: pack Labrador JL matrices and replay reduced statements Store JL signs in a packed ternary layout and aggregate directly into ring-aligned phi blocks to cut the dominant collapse and projection bandwidth. Carry recursive Labrador state as reduced constraint plans so prover and verifier only materialize explicit sparse constraints when they are actually needed. Made-with: Cursor * chore: trace Labrador setup and commitment helpers Label fold planning, setup derivation, and NTT commitment entry points so Perfetto traces attribute the remaining unlabeled setup and commit time to concrete Labrador phases. Made-with: Cursor * test: right-size Labrador e2e coverage Keep the Labrador end-to-end checks aligned with the current D=256 configs while reducing the default test sizes and serializing the heavy cases so nextest stays stable in CI. Made-with: Cursor * test: align Labrador coverage with profile path Use the standard onehot and full configs in the Labrador e2e checks, and benchmark the onehot prove path through OneHotPoly so the test and bench coverage matches the intended profile example behavior. Made-with: Cursor * style: format Labrador e2e imports Apply rustfmt's import grouping for the updated Labrador e2e test so the CI format check matches the checked-in tree. Made-with: Cursor * fix: regenerate stale setup caches and clarify Labrador stream IDs Avoid panicking on invalid cached setup files so local and CI runs can rebuild cleanly, and rename the deterministic challenge stream selector so CodeQL does not treat test vectors as hard-coded nonces. Made-with: Cursor --------- Co-authored-by: Quang Dao <quang.dao@layerzerolabs.org>

* Optimize aggregating jl projection functions * Add lookup to JL aggregation * Add more parallelism * perf: cut Labrador verifier recursion rebuild costs Split Labrador recursion state into prover and verifier setup shapes and compute reduced-plan verifier aggregation directly, so recursive verification stops rebuilding dense intermediate rows and unused NTT caches. Made-with: Cursor * perf: stream Labrador JL replay and tail checks Replay JL rows from the accepted transcript seed and verify the tail round directly on decomposed payloads, so recursive verification avoids rebuilding dense JL matrices and recomposed witness side data. Made-with: Cursor * perf: cut Labrador recursion earlier and batch hot kernels Prefer tail cutover as soon as it beats another standard fold, and batch the hottest aggregation, challenge replay, and linear-garbage kernels so large-nv Labrador stops dwarfing the Hachi path. Made-with: Cursor * perf: speed up Labrador aggregation and challenge replay Exploit sparse Labrador coefficient structure and cheaper challenge bound checks to cut the remaining prover and verifier hotspots without changing transcript behavior. Made-with: Cursor * perf: accelerate Labrador JL replay and aggregation kernels Reuse the in-memory JL collapse path on verifier replay, cut repeated JL scheduling overhead, and tighten dense ring accumulation so the remaining Labrador prover and verifier aggregation paths spend less time in repeated per-element work. Authored by Cursor assistant (model: GPT-5.4) on behalf of Quang Dao. Made-with: Cursor * perf: tighten Labrador handoff accounting and profiling Make profile runs fail fast outside --release and add the size diagnostics needed to compare direct and Labrador tails from real serialized cost. Reuse the handoff D-matrix NTT cache and compare recursive Labrador transitions against actual carried payload size so tail selection reflects what the proof will actually send. Made-with: Cursor * refactor: dedupe Labrador helper paths and quiet prover diagnostics Share the repeated Labrador utility helpers in one place and move the prover's profiling prints onto structured tracing, so the review feedback is addressed without changing protocol behavior. Made-with: Cursor * fix: inline profile format args for clippy Rewrite the remaining profile example format strings to use inline captures so the CI Clippy job passes again without changing the example's output. Made-with: Cursor * perf: cut allocation churn in folding helpers Reuse flat output buffers in ring-switch and sumcheck prefix folding, and evaluate multilinears recursively over slices. This trims temporary Vec creation on hot prover paths without changing protocol behavior. Made-with: Cursor * refactor: hoist opening-point helpers and simplify profile example Centralize basis and opening-point conversions so the profile example and protocol code reuse the same logic. Drop the setup-only profiling path so the example stays focused on end-to-end proving runs. Made-with: Cursor * fix: restore opening-point test helper imports Keep the commitment-scheme tests compiling after hoisting opening-point helpers into their own module. Include the accompanying rustfmt cleanup in touched Rust call sites. Made-with: Cursor * refactor: cut over Labrador naming and wire labels Replace the terse Labrador config and payload vocabulary with descriptive names across recursion, proofs, and transcript labels so the implementation is easier to follow and the wire format stays internally consistent. Guard the small-digit CRT/NTT fast path so deeper folds fall back safely once coefficients leave the lookup-table range. Made-with: Cursor * Refine Labrador handoff selection and tests --------- Co-authored-by: Quang Dao <quang.dao@layerzerolabs.org>

* perf: add d64 partial-split NTT prototype Isolate the q=2^128-5823 D=64 partial-split multiplication path, its packed cached-domain kernels, and a focused benchmark/test suite so it can be reviewed independently from the sumcheck work. Made-with: Cursor * fix: satisfy clippy in partial split benches Clean up the benchmark and test scaffolding to avoid indexed iteration warnings and packed-width modulo warnings in CI. Made-with: Cursor * perf: tighten partial split NTT hot helpers Inline the small hot wrappers, collapse duplicated scalar and packed helper kernels, and remove unused prototype-only APIs so the partial-split backend is leaner without changing behavior. Made-with: Cursor * perf: re-fuse single-product partial-split kernels Restore direct-write split multiply kernels so single-product and packed batch workloads do not pay the zero-plus-accumulate cost introduced by the cleanup refactor. Made-with: Cursor

* perf: split Hachi sumcheck into two stages Separate the prefix-range pass from the fused relation scan so stage 2 can reuse the shared local w basis and avoid redundant work. This also completes the stage naming cutover and removes the obsolete standalone sumcheck modules. Made-with: Cursor * chore: fix doc placement on HachiLevelProof and use Prime128M8M4M1M0 in tests Move the D-agnostic doc comment to HachiLevelProof where it belongs, and replace Fp64<4294967197> with the named Prime128M8M4M1M0 alias in ring_switch, hachi_stage1, and hachi_stage2 tests. Made-with: Cursor * fix: absorb s_claim into transcript before batching challenge + dedup trim_trailing_zeros Absorb the prover-supplied s_claim into the Fiat-Shamir transcript before sampling CHALLENGE_SUMCHECK_BATCH on both prover and verifier sides. Without this, an adversary could choose among multiple valid s_claim values after seeing the batching coefficient. Also extract the duplicated trim_trailing_zeros helper from hachi_stage1 and hachi_stage2 into the parent sumcheck module. Made-with: Cursor * perf: apply split-eq e_in-inside/e_out-outside optimization to all prefix_x paths Factor out the e_second multiplication from the inner loop in stage 1 and stage 2 prefix_x compute_round methods. Within each block of consecutive pairs sharing the same j_high, accumulate contributions weighted by e_first (e_in) only, then post-multiply the block result by e_second (e_out) once. This eliminates one full field multiply per pair per round in all prefix_x code paths. Made-with: Cursor * perf: optimize compact Hachi sumcheck folds Use pair-fold lookup tables for compact stage-1 and stage-2 folds and absorb stage-2 batching into split-eq so the fused kernels do less repeated field work. Clarify the stage-2 relation docs to match the actual prover/verifier identity. Made-with: Cursor * perf: skip recoverable norm linear coefficients in Hachi sumcheck Use split-eq claim recovery to omit norm-round linear q terms during accumulation while still reconstructing the full round polynomial when needed. Track the prior norm claim in stage 2 and add split-eq recovery tests so the reduced-coefficient path stays equivalent to the full computation. Made-with: Cursor * perf: add bivariate-skip proofs for early Hachi sumcheck rounds Build the first two stage-local bivariate-skip proofs directly, reconstruct the omitted round polynomials from compact algebraic state, and tighten the stage-2 prefix path so the skipped rounds stay cheap while the terminology matches the math. Made-with: Cursor * fix: keep full stage2 m table through sparse x rounds Carry the full stage2 m multilinear table across sparse prefix-x folding so boundary pairs and quads still use the verifier's full relation data, and harden the prefix tests around nonzero tail entries so the compact prover path stays aligned with the padded reference. Made-with: Cursor * fix: count prefix fields in profile proof breakdown Include both prefix option tag bytes and any serialized bivariate-skip payloads in the profile size accounting, and expose size/presence helpers on the staged proof payloads so the example can report those fields without reaching into private internals. Made-with: Cursor * test: clean up bivariate-skip reference helpers for CI clippy Use assign-op and iterator forms in the two-round prefix reference helpers so the strict all-targets Clippy job stays green without changing the helper math. Made-with: Cursor * fix: keep sumcheck prefix prover-only Bind the transcript only to canonical round messages and reject malformed proof shapes explicitly so verifier flow stays implementation-agnostic. Made-with: Cursor

… estimator (#19) * ci: add onehot nv32 benchmark reporting Track onehot nv32 timing and RSS in CI with a sticky PR report so benchmark changes stay visible across commits without heavier profiling artifacts. Made-with: Cursor * ci: clarify onehot sparsity labels Describe the nv32 benchmark and D=64 estimator as 1-of-256 one-hot so reviewers can read the sparsity assumptions directly from the check output and reports. Made-with: Cursor * docs/ci: add onehot analysis notes and harden benchmark reporting Bundle the supporting one-hot and SIS analysis notes with the benchmark branch so the PR carries the rationale for the new parameter choices. Clean up the remaining benchmark-reporting edge cases so traces stay alive for the full run, partial baselines render correctly, and PR comment upserts fail softly instead of surfacing hidden job errors. Made-with: Cursor * docs: remove local-only analysis notes from branch Keep the root analysis markdowns local-only so the benchmark PR only carries code and workflow changes. Preserve the local files via repo-local excludes instead of tracking them in git. Made-with: Cursor * ci: fix onehot timing fallback attribution Attribute missing split timings to Hachi when the benchmark log only exposes total prove or verify time, so the report stays conservative instead of assigning the whole interval to Labrador. Made-with: Cursor * ci: compare onehot bench to main and previous run Render the onehot benchmark report against both the main-branch split point and the previous successful PR update so regressions are visible against the branch base as well as the last iteration. Made-with: Cursor

…20) * Use scalar field randomness instead of ring randomness * Use AggregationRandomness enum for two randomness cases * Remove b computation from aggregation * Make JL projection matrix generation thread-friendly * Speedup computing h * Fix clippy

…enge families (#21) * feat: add D64 onehot scheduling infrastructure * fix: add missing Cfg generic in disk-persistence tests and correct current_w_len on verifier paths - commit.rs: supply TinyConfig to get_storage_path and load_expanded_setup in disk-persistence tests (fixes clippy/test CI) - labrador_handoff.rs: derive current_w_len from w_layout instead of passing 0 in the legacy handoff verifier - commitment_scheme.rs: derive initial current_w_len from the commitment layout (layout.num_blocks * layout.block_len * D) instead of raw 1 << max_num_vars so prover and verifier always agree Made-with: Cursor * fix: prevent usize overflow in current_w_len for large max_num_vars Use checked_shl or layout-derived values instead of raw 1usize << max_num_vars, which panics when max_num_vars >= 64 (e.g. disk-persistence tests with TinyConfig). Made-with: Cursor * refactor: cut dense commitments over to D=128 Make the runtime scheduling and sparse-challenge redesign land on a single dense Fp128 profile so full and log-basis commitments no longer depend on a legacy D=256 halving path. Keep generic D=256 NTT plumbing available while updating proofs, tests, scripts, examples, and benchmarks to reflect the new D=128 and D=64 defaults. Made-with: Cursor * fix: align recursive layouts with sound basis-2 checks Derive recursive witness layouts from the active level parameters so recursive openings, ring-switching, and Labrador handoff stay aligned after the D=128 cutover. Replace the basis-2 combined path with a direct W-only degree-5 sumcheck, remove the virtual S claim, and add end-to-end tamper coverage. Made-with: Cursor * fix: stabilize recursive onehot folding for D=64 Handle +/-2 sparse challenges correctly in the recursive z_pre path, restore the arm64 NEON fast path for small magnitudes, and cover the two-round-prefix edge cases. Clean up the temporary debug instrumentation and align the profiling and estimator tooling with the updated proof-size accounting. Made-with: Cursor * fix: apply rustfmt for CI Normalize the touched Rust files to match the repository formatter so the PR checks run cleanly on GitHub Actions. Made-with: Cursor * feat: add adaptive fold-basis scheduling Use a deterministic public-input schedule so setup, proving, verification, and cache reuse stay aligned across onehot, log, and full configs. Widen the digit LUT path through basis 5 and add mixed-basis regressions so adaptive schedules stay sound. Made-with: Cursor * fix: stabilize direct tail packing and drop dead config Widen direct-tail packing so adaptive schedules do not panic when terminal witness digits exceed the planned basis, and remove the unused rank-2 bounded config to keep the commitment surface minimal. Made-with: Cursor * fix: align planner witness sizing with runtime recursion Use the exact half-field bound so adaptive planning derives the same recursive witness sizes as runtime, and add sparse challenge sampling tracing to make these paths easier to diagnose. Made-with: Cursor * fix: reduce D64 recursion overhead Shrink stage-1 compact tables, avoid redundant recursive hint reconstruction, and realign D-dependent challenge sizing so the lowered ring dimensions actually pay off in memory and prover work. Made-with: Cursor * perf: block-parallel mat_vec_mul_ntt_digits_i8 (12x speedup) When n_a <= 2 and num_blocks >= 16, parallelize over blocks instead of column tiles. The old tiling created only 5 tiles for Rayon while the new path gives num_blocks-way parallelism (256 for onehot nv32). commit_w level 0: 273ms → 23ms on onehot nv32. Made-with: Cursor * perf: position-parallel sparse onehot accumulation with precomputed rotation table Replace per-block fold-reduce with per-thread chunked accumulation and a dense rotation table (16 KB for D=64, fits in L1). Each entry becomes a branchless vector addition instead of scatter-based random access. Made-with: Cursor * perf: parallelize balanced decomposition in decompose_w_hat Made-with: Cursor * fix: guard binomial_u64 against subtraction overflow when n < k Made-with: Cursor * perf: optimize high-half quotient with loop trimming and parallel accumulation Trim add_sparse_ring_product_high_half to skip zero-contribution iterations (degree < D), parallelize A-row and challenge-fold quotient accumulation via cfg_fold_reduce, and extract parallel_high_half_accumulate helper. Made-with: Cursor * perf: parallelize z/r balanced decomposition in build_w_coeffs Made-with: Cursor * perf: column-sweep Ajtai commit for onehot — 2.2-2.5x at nv32, ~2x at nv36 Replace block-by-block inner_ajtai_onehot_wide (where each block independently reads and widens A columns from L3) with a two-level tiled column-sweep that reads each A column exactly once per tile. Outer level: Rayon threads partition blocks evenly. Inner level: blocks processed in L2-sized tiles (~1024 blocks, 2MB accumulators). Entries bucketed by A-column, then swept sequentially so each column is widened once and scattered into all referencing block accumulators. Falls back to the original block-by-block path when blocks_per_thread is small (≤128), where the bucketing overhead exceeds its benefit. Also includes position-partitioned BalancedDigitPoly::decompose_fold and updated OPTIMIZATION_REPORT.md. Made-with: Cursor * perf: concurrent NTT rows, parallel quotient fold, batched challenge absorb - Run D/B/A NTT row computations concurrently via rayon::join in compute_r_split_eq, overlapping independent matrix-vector products. - Replace sequential challenge_fold_row and A_row accumulation with parallel cfg_fold_reduce over blocks (8.9x speedup each). - Batch the four per-challenge append_bytes calls into one in sample_one, reducing hash update overhead for sample_sparse_challenges. - Update OPTIMIZATION_REPORT.md with round 3-6 results. Made-with: Cursor * perf: replace per-challenge hash chain with seed-then-SHAKE256-expand Derive a single 32-byte PRG seed from the transcript and expand all challenge randomness via SHAKE256 XOF, replacing ~20K sequential Blake2b512 chain operations with 1 chain + fast XOF squeeze. 2x speedup for sample_sparse_challenges (6.5ms → 3.2ms at 4096 challenges). Made-with: Cursor * perf: simplify norm checks and sparse challenge sampling Always use the two-stage norm-check flow so every proof shares one layout, one verifier path, and a smaller code surface. Stream sparse Fiat-Shamir challenge expansion from SHAKE and tighten i8 decomposition bounds while pruning the obsolete combined-sumcheck code and stale optimization report. Made-with: Cursor * fix: align planner proof sizing with two-stage norm checks Always model Hachi levels as stage1 plus stage2 so proof-size estimates match the serialized proof layout at every basis, including b=4. Add regressions for per-level byte estimation and direct-tail proof sizes to keep the planner in sync with runtime proofs. Made-with: Cursor * perf: inline level norm checks and gate compact onehot layout Flatten stage-1 and stage-2 data directly onto `HachiLevelProof` and only switch onehot witnesses to the compact regular layout once the large-profile cache savings outweigh the nv32 costs. This keeps small witnesses on the legacy sparse path while preserving the nv36 performance win. Posted by Cursor assistant (model: GPT-5.4) on behalf of the user (Quang Dao) with approval. Made-with: Cursor * Reduce D to 64: commitment schedule, ring switch, linear utils, poly ops - Update commitment/commit, config, schedule, and linear utilities for d=64 - Adjust ring_switch, quadratic_equation, hachi_poly_ops - Tweak examples/profile harness Made-with: Cursor

* Remove Labrador implementation * fix: remove stale profile tail tag accounting --------- Co-authored-by: Quang Dao <quang.dao@layerzerolabs.org>

…28-bit field (#22) * perf: use two-round prefix path for b=4 norm sumcheck Specialize the b=4 skip proof to a smaller quadratic prefix grid so stage 1 can fuse its first two rounds like b=8 without changing the round polynomials. Extend the stage-1 regression coverage to keep the fused path aligned with the padded reference flow. Made-with: Cursor * perf: add b=4-specific LUT tables for stage 2 prefix and fused kernels Stage 2 two-round prefix and fused compact-to-round2 kernels were reusing b=8 LUT tables (4096 entries) even for b=4. Add b=4-specific tables (256 entries, 16x smaller) and dispatch on `b` at runtime. Same treatment for stage 1's fused kernel (16 vs 256 entries). Also mark all hot-path digit and lookup-index functions as `#[inline(always)]` for consistency. Made-with: Cursor * fix: consistent polynomial representation between prefix and dense paths Remove trailing-zero trimming from `finish_gruen_round_poly_from_q_coeffs` and `coeff_array_to_poly` so both the two-round-prefix path and the dense sumcheck path produce polynomials with the same number of coefficients (degree_q + 2). Fixes the b=4 `stage1_round0_matches_dense_reference` test failure and formatting issues. Made-with: Cursor * perf: optimize verifier hot paths (sparse challenges, m_evals_x, multilinear_eval) - Buffer XOF reads (4 KB buffer), use tiered byte-width rejection sampling, and batch sign draws (8 per byte) in sparse challenge sampling (~4x speedup) - Pre-scale alpha_pows by eq_tau1 weights and precompute block scalars in compute_m_evals_x to eliminate redundant per-column multiplies - Add parallel multilinear_eval path (eq-table + par dot-product) for large tables (>2^14 entries) - Move compute_m_evals_x into ring_switch_verifier; remove the separate compute_m_eval_at_point and verify_sumcheck_rounds_only functions - Simplify HachiStage2Verifier: store m_evals_x directly instead of Stage2MOracle indirection; unify is_last/non-last verify paths Made-with: Cursor * refactor: split hachi_poly_ops/mod.rs into focused submodules The 2041-line monolith is now: - mod.rs: trait, shared types, re-exports, tests (~505 lines) - dense.rs: DensePoly + HachiPolyOps impl (~329 lines) - onehot.rs: OneHotIndex, OneHotPoly + HachiPolyOps impl (~604 lines) - balanced_digit.rs: BalancedDigitPoly + HachiPolyOps impl (~238 lines) - helpers.rs: decomposition, sparse mul-acc, accumulation internals (~440 lines) - decompose_fold_neon.rs: unchanged NEON kernel (~165 lines) No behavioral changes. All docstrings updated for the new layout. Made-with: Cursor * cleanup: remove dead code, fix unimplemented!(), deduplicate helpers - Remove unused centered_abs, ring_inf_norm, vec_inf_norm from norm.rs - Remove redundant #[allow(dead_code)] on add_ntt_into (function is used) - Replace duplicate flatten_w_hat with existing flatten_i8_blocks - Implement protocol_name() → b"Hachi" instead of unimplemented!() - Remove commented-out ring-dimension check in prove path - Remove duplicate #[allow(clippy::too_many_arguments)] annotation Made-with: Cursor * refactor: hoist algebra types out of protocol/sumcheck into algebra/ Move pure algebraic constructs from protocol/sumcheck/ to algebra/: - EqPolynomial → algebra/eq_poly.rs (fixes backwards dep: algebra → protocol) - GruenSplitEq → algebra/split_eq.rs - UniPoly, CompressedUniPoly → algebra/uni_poly.rs - trim_trailing_zeros → algebra/poly.rs SumcheckProof stays in protocol/sumcheck/types.rs (uses Transcript). All re-exports preserved for downstream compatibility. Made-with: Cursor * fix: resolve clippy (no-default-features) and rustdoc CI failures Gate `add_ntt_into` and its neon helpers behind `#[cfg(feature = "parallel")]` since they are only used in the reduce closure of `cfg_fold_reduce!`, which is elided without rayon. Replace intra-doc links to private items with plain backtick references. Made-with: Cursor * chore: sort imports alphabetically and remove stray blank line Made-with: Cursor * refactor: split recursive witness runtime from root poly ops Move recursive folding levels onto a flat digit witness so later rounds stop pretending to be caller-provided polynomials. This keeps `HachiPolyOps` root-only and cuts the recursive prover over to the dedicated witness view. Made-with: Cursor * perf: extend stage1 compact coefficient LUTs to b=16 Keeping b=16 on the compact lookup path avoids the dense coefficient fallback in stage-1 norm sumcheck. Add regression coverage so b=32 stays on the existing fallback until we optimize it separately. Made-with: Cursor * refactor: cut verifier over to proof-native recursive state Carry recursive prover and verifier state through proof ring vectors and packed witnesses so level transitions stop rebuilding commitment-specific structures. Made-with: Cursor * perf: add b=32 stage1 field coefficient LUT Precompute stage1 affine coefficients as field elements for b=32 so the compact round kernels can reuse them instead of rebuilding them per pair. This keeps the large-basis optimization isolated to the retained stage1 path. Made-with: Cursor * chore: deduplicate helpers, remove dead code, fix doc CI - Deduplicate: try_centered_i8, absorb_len_prefixed, pow2_field, reduce_signed_accum, linear_eq_eval, stage1/stage2 digit helpers - Remove dead: ring_switch_prover, expand_m_a, verify_single_level, build_next_constraints, and 9 unused public functions - Inline trivial compute_v wrapper, refactor compute_z_pre pair into shared validate_decompose_fold - Fix doc CI: replace intra-doc links to private items with backticks - Add debug_assert for num_digits==1 in partitioned accumulation Made-with: Cursor * perf: unify A/B/D matrices into single shared-prefix matrix and NTT cache Derive one max-sized public matrix with a shared label instead of three role-specific matrices. This cuts setup NTT conversion work by ~3x in production configs and halves memory. Runtime mat-vec performance is unchanged: column bounds are driven by input vector length, and a prerequisite inner_width clamp in mat_vec_mul_i8_with_params prevents empty-tile dispatch for wider caches. Security justification: SHARED_PREFIX_BINDING.md (every SIS extraction targets a single role, so the marginal distribution is identical). Made-with: Cursor * fix: return error instead of panicking for unsupported ring dimensions in sparse challenge sampling Stack-buffer sampling functions used debug_assert! guards that were stripped in release builds. Add a fallible D > 128 check at the public API boundary so the verifier returns Err rather than panicking on out-of-bounds access. Also add heap-backed _general variants (unused) for future large-D support. Made-with: Cursor * refactor: consolidate 128-bit primes to Prime128Offset275 and Prime128Offset5823 Delete 7 unused 128-bit prime aliases (Prime128M13M4P0, Prime128M37P3P0, Prime128M52M3P0, Prime128M54P4P0, Prime128M18M0, Prime128M54P0, P_159). Rename Prime128M8M4M1M0 to Prime128Offset275. Switch the default 128-bit field from 2^128-275 to 2^128-5823 (the prime enabling 64-ring / 32-split) across Q128_MODULUS, POW2_OFFSET_128, HandoffField, all tests, benchmarks, and examples. Made-with: Cursor * perf: pass role-specific row counts to NTT mat-vec, remove dead NttRowView Replace compute-all-then-truncate pattern with row-bounded dispatch. Mat-vec functions now accept a num_rows parameter and slice the NTT cache upfront, so A/B/D roles only compute the rows they need. Remove unused NttRowView type, neg_rows, and cyc_rows methods. Made-with: Cursor --------- Co-authored-by: Omid Bodaghi <42227752+omibo@users.noreply.github.com>

) Extract the hardcoded `1 << 21` cache budget into a named `L2_TILE_BUDGET` constant with documentation explaining the 2 MB choice and a TODO for future arch-specific benchmarking. Two minor perf improvements in both `onehot_column_sweep_ajtai_regular` and `onehot_column_sweep_ajtai`: - Replace wasteful `vec![vec![CyclotomicRing::zero(); n_a]; my_count]` pre-allocation with `Vec::new()` per slot, since every entry is overwritten by the tile loop. - Hoist `col_entries` outside the tile loop and `.clear()` between tiles so Vec capacities carry over, avoiding repeated heap growth. Made-with: Cursor

* chore: untrack stale design notes (moving to shared notes folder) Remove CONSTANT_TIME_NOTES.md, HACHI_PROGRESS.md, and NTT_PRIME_ANALYSIS.md from version control. These are being consolidated into the central ~/Documents/Notes/ folder. Made-with: Cursor * chore: remove stale CHANGELOG.md placeholder Made-with: Cursor * chore: update AGENTS.md crate structure and clean up .gitignore - Add missing protocol modules (quadratic_equation, recursive_runtime) and scripts/ to AGENTS.md crate structure listing - Fix algebra description (domains → polynomial utilities) - Remove stale PUBLISH_CHECKLIST.md entry from .gitignore - Remove empty tests/.gitkeep (test files exist) Made-with: Cursor

…aster) (#28) - Add `derive_public_matrix_flat` that generates directly into FlatMatrix with entry-level parallelism (rows×cols rayon tasks) and zero-copy transmute, replacing the sequential derive + flatten pipeline - Add `cfg_join!` macro and use it to run negacyclic/cyclic NTT conversions concurrently in `build_ntt_slot` - Add `FlatMatrix::from_flat_data` constructor for pre-flattened storage Onehot nv=32 setup: 780ms → 291ms (2.7x) - Matrix derivation: ~361ms → 71ms (5.1x) - NTT cache build: 419ms → 220ms (1.9x) Made-with: Cursor

Expand the partial-split stage roots into per-position twiddle tables so the butterflies load twiddles directly instead of carrying a serial recurrence. This makes the D64 split path and packed inverse layout more SIMD-friendly and improves leopard x86 benchmarks. Made-with: Cursor

* Port planner from Python code * Fix cursor review * Improve docs for sis_security.rs * Address AI reviews * Fix missing B matrix commitment bytes in root level of universal planner run_universal_planner omitted ring_vec_bytes(root_nb, root_cfg.d) from both the root level's total cost and its level_bytes field. Every non-root level in best_from correctly includes this as entry_commit, but the root level only accounted for the prefix (w_hat + D matrix + sumcheck + evals), silently under-counting proof size. The bug affects any root config where nb >= 1 (all of them), with larger impact for D=32/D=16 roots that can require nb > 1 for SIS security. Corrected proof sizes (bytes): onehot nv=32: 50,418 -> 51,442 (+1,024) full nv=32: 52,866 -> 54,402 (+1,536) full nv=25: 49,842 -> 50,866 (+1,024) onehot nv=44: 56,656 -> 58,704 (+2,048) Made-with: Cursor * Simplify digit decomposition: remove r_decomp_levels, tighten assertions - Remove `r_decomp_levels` wrapper; call `compute_num_digits(128, lb)` directly everywhere, since the defensive half_field_bound re-check was redundant (compute_num_digits already covers 2^(field_bits-1) - 1). - Drop `half_field_bound` from `PlannerOptions`, `LevelWitnessArgs`, and the per-modulus constants (`HALF_FIELD_BOUND_P275`, `HALF_FIELD_BOUND_P5823`). - Replace unreachable fallback branches in `compute_num_digits` and `compute_num_digits_fold` with assertions (`log_bound <= 128`, `challenge_l1_mass > 0`, `shift < 127`). - Correct balanced-digit doc comments to `[-b/2, b/2 - 1]` (asymmetric). Made-with: Cursor * Fix baseline validation, remove header wrapper, add (m,r) search Bug fixes: - Restore baseline to match Rust codebase by using the existing optimal_m_r_split formula (delta_open + n_a*delta_commit). The corrected formula (1+n_a)*delta_open is used only in the optimized planner since the Rust code hasn't been updated yet. - Fix baseline tail_bytes to use baseline_packed_digits_bytes (was incorrectly using the header-stripped version). - Remove +4 wrapper from optimized planner total: header stripping removes the u32 num_levels prefix. Enhancements: - Enumerate (m,r) splits at the root level (+-4 around local optimum) to find better global schedules. Recursive levels still use the corrected optimal_m_r_split heuristic for speed. - Restructure compute_level_witness to accept explicit (m,r) via WitnessArgs struct instead of calling optimal_m_r_split internally. - Derive Clone for PlannerOptions, fix repository URL in Cargo.toml. Baselines: 99,805 / 166,613 / 173,197 (match Rust profiler). Optimized onehot nv=32: 51,438 B (48.5% reduction). Made-with: Cursor * Fix delta_commit bug: pass prev_lb instead of lb for recursive levels The `best_from` function was passing `lb` as `log_commit_bound` to `try_level` at recursive levels, causing `delta_commit` to always be 1 regardless of the previous level's base. When prev_lb > lb (e.g., lb=6 to lb=3), `compute_num_digits(prev_lb, lb)` should yield 3, not 1. This fix correctly prices lb-decreasing transitions, causing the planner to avoid them (they are expensive). Optimal schedules now have monotonically non-decreasing lb sequences. Also fixes doc typo: digit range is [-b/2, b/2-1] (asymmetric), not [-(b/2-1), b/2-1] (symmetric). Made-with: Cursor * Fix wrong delta in optimal_m_r_split n_a term Both config.rs and baseline_optimal_m_r_split used delta_commit for the n_a multiplier in the per-block opening cost, but the witness construction (w_hat and t_hat) both use delta_open. This mismatch caused suboptimal (m, r) splits when delta_open != delta_commit (onehot, and recursive levels where log_commit_bound < 128). Also deduplicates baseline_optimal_m_r_split as a thin wrapper around optimal_m_r_split with num_ring=0. Made-with: Cursor --------- Co-authored-by: Quang Dao <quang.dao@layerzerolabs.org>

* Strip serialization headers from proof wire format Redesign HachiDeserialize with an associated Context type so proof types can be deserialized without embedded length prefixes. All headers (u64 Vec counts, u32 num_levels, u8 bits_per_elem, etc.) are removed from the proof byte stream; the verifier recovers shape information from the public schedule via HachiProofShape / LevelProofShape. Key changes: - HachiDeserialize gains `type Context` — `()` for self-describing types, schedule-derived shapes for proof types. - CompressedUniPoly, SumcheckProof, ProofRingVec, PackedDigits, HachiLevelProof, HachiProof all serialize bare (no length prefixes). - RingSliceSerializer drops u64 count prefix; RingCommitment and QuadraticEquation prover paths updated to use RingSliceSerializer for transcript consistency. - Schedule byte accounting updated to match stripped format. - HachiSchedulePlan::to_proof_shape() produces the context needed for proof deserialization. - FieldCore supertrait tightened to HachiDeserialize<Context = ()>. This is a protocol-breaking change: Fiat-Shamir transcripts now absorb headerless data, so proofs from the old format will not verify. Made-with: Cursor * Fix missing ctx argument in disk-persistence feature gate The deserialize_compressed call in load_expanded_setup was missing the &() context argument, only exposed under --all-features. Made-with: Cursor * chore: retrigger CI to pick up CodeQL default setup Made-with: Cursor

cmd_validate hardcoded stale expected values (from before the delta_commit → delta_open formula fix) that diverged from the baseline.rs unit tests, causing `--validate` to always fail. Extract a single BASELINE_CASES constant and baseline_params_for helper in baseline.rs, used by both the tests and cmd_validate. Add a "Planner validation" CI step so mismatches are caught on PRs. Made-with: Cursor

* Strip serialization headers from proof wire format Redesign HachiDeserialize with an associated Context type so proof types can be deserialized without embedded length prefixes. All headers (u64 Vec counts, u32 num_levels, u8 bits_per_elem, etc.) are removed from the proof byte stream; the verifier recovers shape information from the public schedule via HachiProofShape / LevelProofShape. Key changes: - HachiDeserialize gains `type Context` — `()` for self-describing types, schedule-derived shapes for proof types. - CompressedUniPoly, SumcheckProof, ProofRingVec, PackedDigits, HachiLevelProof, HachiProof all serialize bare (no length prefixes). - RingSliceSerializer drops u64 count prefix; RingCommitment and QuadraticEquation prover paths updated to use RingSliceSerializer for transcript consistency. - Schedule byte accounting updated to match stripped format. - HachiSchedulePlan::to_proof_shape() produces the context needed for proof deserialization. - FieldCore supertrait tightened to HachiDeserialize<Context = ()>. This is a protocol-breaking change: Fiat-Shamir transcripts now absorb headerless data, so proofs from the old format will not verify. Made-with: Cursor * Fix missing ctx argument in disk-persistence feature gate The deserialize_compressed call in load_expanded_setup was missing the &() context argument, only exposed under --all-features. Made-with: Cursor * feat: finish column-major tight z_pre cutover * fix: honor active row count in recursive w commits * fix: restore recursive commitment performance * refactor: remove dead recursive layout helper * refactor: make block order explicit * fix: make recursive split planner 32-bit safe

* Correct planner A-role SIS bounds * Run rustfmt on planner security changes * Clarify A-role SIS collision helper

* Add batched commitment to Hachi * Add batched prove/verification * Optimize prover/verifier in batched mode * Fix CI * Address AI-reviews * More ci fixes * Fix issue with early prover stop * Address cursor review * Batch polys with detached commitments * Add e2e tests for commitment scheme * fix: resolve post-merge issues from PR #31 header-stripping - Batched prover transcript: use RingSliceSerializer for ABSORB_PROVER_V (auto-merge missed this new call site, causing Fiat-Shamir mismatch) - Add HachiProof::shape() for tests that lack a planner - Fix single_poly_e2e deserialization to pass shape context - Update batched_onehot_4x30 threshold for stripped-header byte costs Made-with: Cursor * Support multi-point batching in Hachi * Unified batch commit functions * Unified batch prove/verify functions * Fix recursive onehot layout planning Keep runtime recursive log-basis transitions aligned with the planner and setup sizing so single and batched onehot proofs use the intended layouts. Restore the open-digit witness cost model and clean up the CI clippy regressions from the batching refactor. Made-with: Cursor * Fix batched commit benchmark layout mismatch Use hachi_batched_root_layout for the batch path so the layout's (m_vars, r_vars) split matches what setup_prover computes internally, and pass the layout into make_onehot_poly instead of deriving it from num_vars (matching the pattern in onehot_batched_opening.rs). Made-with: Cursor * Remove layout from CommitmentScheme API; derive internally from setup Remove HachiCommitmentLayout parameter from commit, prove, batched_prove, verify, and batched_verify. Replace layout field in HachiSetupSeed with max_inner_width, max_outer_width, max_d_matrix_width. Layout is now derived internally via hachi_batched_root_layout(num_vars, max_num_batched_polys), keeping the batch-optimized m_vars/r_vars split that avoids the 3x regression on batched prove/verify. Made-with: Cursor * Remove unused setup_from_existing helper That path is no longer used after moving layout derivation to runtime inputs, so removing it avoids maintaining dead setup-extension logic. Made-with: Cursor * Check openning points having the same length * Format commit.rs * FIX CI issue * Support mixed dense and one-hot multilinear batches Expose a single public wrapper so batched commitments can combine dense and one-hot polynomials under one shared config without extra call-site branching. Made-with: Cursor * Fix scan_layout_chain passing max_num_batched_polys as num_points The layout optimization in root_batched_layout (via optimal_root_batch_split) hardcodes num_point_sets=1 for same-point batching, but scan_layout_chain was passing max_num_batched_polys as both num_claims and num_points to w_ring_element_count_with_num_claims_and_points. This inflated z_pre by the batch size, producing an oversized matrix for the first recursive level. Use w_ring_element_count_with_num_claims (which sets num_point_sets=1) to match the layout's same-point assumption and the actual prover code. Made-with: Cursor * Fix hachi_batched_root_layout returning batched num_digits_fold in per-poly layout The function claimed to return a per-polynomial layout but leaked the batch-level num_digits_fold through an unnecessary scale/unscale roundtrip via scale_batched_root_layout. Replace with direct construction using compute_num_digits_fold (num_claims=1 equivalent) and add a regression assertion in the matching unit test. Made-with: Cursor * Fix multipoint setup sizing and trim batched root bloat * Align ring commitment layout with batched setup * Unify batched root layout and recursion helpers * Fix clippy warnings: simplify comparison and extract type alias Made-with: Cursor * Fix commit_onehot using singleton layout instead of batched layout commit_onehot called Cfg::commitment_layout() which always returns the singleton layout, while commit_ring_blocks and commit_coeffs use Self::layout() which returns the batched root layout. When max_num_batched_polys > 1, these produce different m_vars/r_vars splits, causing incompatible block structure. Made-with: Cursor --------- Co-authored-by: Quang Dao <quang.dao@layerzerolabs.org>

* feat: add d16 d32 prime275 profile path * feat: add d16 d32 prime275 profile path * fix: enforce config field pairing * fix: enforce config field pairing * fix: align profile compare configs with prime275 * fix: align profile compare configs with prime275 * refactor: polish field-coupled commitment presets * Remove D16 commitment plumbing * Fix CRT NTT dispatch test coverage * Remove unused setup reuse helper * Add dynamic root-ring Hachi scheme scaffold * Add root batch summary schedule scaffolding * Add canonical root runtime-plan scaffold * Make fmt clippy and tests green * Trim aggregated batched test matrix * Trim grouped batched test matrix * Canonicalize root runtime schedule plan * Use generated tables for fp128 schedules * Select dynamic root D at commit time * Optimize dynamic singleton root selection * Fix clippy lint in dynamic batching test * Restore singleton onehot profile all-mode * Cut over fp128 defaults to dynamic schemes * Audit D128 preset security floor * Fix D128 adaptive audit bounds * Switch partial split path to q128-159 * Fix static d128 audited root ranks * Drop q128-5823 and fix CI * Widen SIS width API to u64 * Lazy dynamic root setup materialization * Optimize D32 block mat-vec paths * Make flat digits canonical across Hachi proofs * Optimize dense D32 commit and fold kernels * Speed up D32 digit decomposition paths * Speed up batched onehot column sweep * Speed up dense D32 decompose-fold * Speed up dense D32 matvec path * Reduce dense D32 fold buffer overhead * Fuse dense D32 full-challenge accumulation * perf(d32): speed up dense root kernels Add an x86 SSE4.1 CRT mul-accumulate/add-reduce path for the dense\nD32 root commit hot loop, matching the existing NEON hook pattern\nwithout changing fallback behavior.\n\nAlso precompute and share rotated full-challenge tables in the fused\ndense multi-digit fold path so workers stop rebuilding the same D32\nrotation tables independently.\n\nThis improves the current full nv=25 profile on this host by cutting\nroot commit from about 3.10s to about 2.71s with the SIMD path active,\nand dense_multi_digit_accumulate from about 1.37s to about 1.19s. * chore(ntt): drop unverified x86 path Remove the x86 SSE fast path that was added in the previous perf\ncheckpoint. In this environment rustc is targeting aarch64-apple-darwin,\nwhile the x86_64 target is not installed, so that code could not be\ncompiled or validated locally.\n\nKeep the fully verified dense D32 improvement from the shared rotated\nfull-challenge tables, which still improves the current full nv=25\nprofile on this host. * perf(commitment): speed up dense D32 prover Keep the root-kernel microbench in-tree so dense D32 commit work can be measured directly. Reuse CRT+NTT scratch storage on the single-row dense root path and fuse multi-row rotated-challenge accumulation in the dense decompose-fold helper. On HACHI_MODE=full HACHI_NUM_VARS=25 this brings the local dense profile to roughly setup 0.175s, commit 2.624s, prove 0.866s, verify 0.020s with proof size unchanged at 67,936 bytes. * perf(commitment): hoist i8 decomposition params * perf(commitment): specialize dense single-row root path * perf(commitment): fuse scalar crt digit accumulation * perf(commitment): fuse blockwise digit matvecs * bench(commitment): compare flat and block digit matvecs * perf(commitment): restore flat single-i8 hot path * fix(commitment): address remaining bugbot findings * refactor(commitment): introduce prime profile layer * refactor(commitment): unify schedule authority * refactor(profile): table-drive profile modes * refactor(schedule): make direct handoff a plan step * refactor(proof): make direct handoff a proof step * refactor(proof): generalize direct witness payload * feat(commitment): support zero-fold direct roots * feat(commitment): table-drive tiny direct roots Promote tiny root-direct openings to a first-class generated strategy. Update the runtime and dynamic onehot path to use field-element direct witnesses when the chosen typed root layout exceeds the public onehot arity, and make profile.rs handle tiny-nv dynamic modes cleanly while skipping impossible fixed onehot layouts. * refactor(commitment): pin generated schedule params Attach pinned family-level parameter specs to generated schedule tables and validate runtime-derived level params against them when materializing exact plans. Keep profile-backed schedule lookup on the richer table artifact so planner-backed families fail closed on policy drift instead of silently inheriting changed ranks or challenge families. * docs(planner): add codegen cutover plan * feat(planner): add step-based schedule codegen * refactor(commitment): use generated schedule modules * refactor(commitment): honor exact schedule plans - reuse pinned planned next-level params during exact-size singleton prove/verify flows when the current state matches the generated plan - regenerate schedule artifacts from the broader planner search space so shipped D32/D64 families keep the better-performing plans - keep the exact-plan execution hook opportunistic for now so runtime falls back cleanly when a pinned root plan does not match the typed root setup path This moves singleton execution closer to the generated artifact while avoiding hand-edited table drift. * refactor(commitment): split generated schedule policies - make shipped D32 and D64 adaptive preset policies explicitly generated-backed instead of sharing one generic adaptive policy - keep D128 adaptive presets on an explicit live-planned policy path so experimental planner use is visible in the type layer - wire the fp128 profile and public preset surface to the new split so generated families fail closed when a pinned table is missing This shrinks schedule authority for blessed presets and removes a silent generated-vs-planned ambiguity from the public config layer. * fix(commitment): enforce exact generated fold roots - size generated-family setup envelopes from the pinned schedule entry so shipped D32 and D64 tables can demand higher ranks without underallocating - use exact generated level params whenever a singleton state matches a pinned fold step instead of regenerating those params from fallback hooks - make exact singleton fold proofs fail closed if the runtime root plan no longer matches the pinned schedule artifact This removes another silent drift path between generated artifacts and runtime execution for blessed families. * docs(planner): drop stale d16 ladder note * fix(commitment): honor exact generated root layouts * refactor(commitment): make generated families artifact-driven * perf(commitment): add fused na3 matvec kernels * style(commitment): normalize planner codegen Clean up rustfmt/codegen residue across the planner and commitment modules. No intended behavior change; this just removes noisy local diffs before the next round of planner/codegen and kernel work. * fix(planner): compare bit-lengths instead of element counts in fold pruning The root-level pruning in try_level_mr rejected folds where next_w_len >= w_len, but at the root the input elements are 128-bit field elements while the output elements are lb-bit packed digits. This caused small-nv onehot schedules to skip beneficial folds (e.g. nv=15 sent 524 KB raw instead of 29 KB folded). Also caps MAX_LB at 6 to match the i8/digit_lut constraint enforced throughout the codebase. Regenerates all schedule tables. Made-with: Cursor * fix(planner): recompute SIS width table with 10^10 search cap The old table had entries capped at 5M (D=32) and 20M (D=64), which were binary-search limits rather than true security cutoffs. This caused the planner to fail finding fold schedules for D=64 onehot at nv >= 49, falling back to enormous FieldElements direct proofs. Reran the lattice estimator (BDGL16 + lgsa, q = 2^128 - 275) with a 10^10 search cap. Key changes: - D=32 rank 3-4: uncapped (e.g. (32,2) rank 3: 5M -> 414M) - D=64 rank 2-4: uncapped (e.g. (64,7) rank 2: 20M -> 794M) - D=64: added (64,1023) and (64,2047) collision buckets - D=128: unchanged (rank 1 already exact, rank 2-4 at 50B cap) Adds scripts/gen_sis_table.py for reproducible table regeneration. Made-with: Cursor * fix(profile): use split-eq evaluation to avoid 64 GB eq-table allocation The profile binary's `opening_from_public_poly` unconditionally materialized `EqPolynomial::evals(point)` of size 2^nv, causing OOM at nv=32 (64 GB). Add space-efficient `evaluate(point)` methods to the public polynomial types in `root_poly.rs`: - DenseMultilinear: split-eq factorization, O(2^{n/2}) space - OneHotMultilinear: pointwise eq per hot position, O(1) space - MultilinearPolynomial: dispatches to the above Replace the ad-hoc helper with `poly.evaluate(&pt)`. Made-with: Cursor * fix(ci): specify --bin for planner validation step The hachi-planner crate now has two binaries (hachi-planner and gen_schedule_tables), so cargo run needs an explicit --bin flag. Made-with: Cursor * fix(commitment): use envelope floor for generated policy fallback params The GeneratedAdaptivePolicy fallback path (when exact_planned_level_execution fails to match, e.g. in batched proofs) was using audited_root_outer_rank directly, which returns 1 for D=32. This silently dropped n_b and n_d below the planner-determined minimum of 2, producing shorter but insecure proofs. Use the envelope (which incorporates the generated table maximum) as the floor for all rank parameters in the fallback path. Made-with: Cursor * feat(commitment): derive exact SIS ranks in fallback instead of envelope When exact_planned_level_execution misses (e.g. batch-divergent recursive levels), compute the actual matrix widths from the layout and look up the minimum Module-SIS rank from a generated threshold table. This replaces the conservative CommitmentEnvelope fallback with precise per-width security parameters. - Add `sis_floor.rs` generated module with SIS width thresholds - Add `max_abs_coeff()` to SparseChallengeConfig - Add `sis_derived_recursive_params` helper in config.rs - Update gen_schedule_tables to emit sis_floor.rs Made-with: Cursor * refactor(planner): deduplicate ring configs in gen_schedule_tables Replace the manually-duplicated D128/D64/D32_RING_CONFIGS arrays with a runtime filter over the single authoritative ALL_RING_CONFIGS from search.rs. Change PlannerOptions.ring_configs from &'static to Vec to support non-static slices cleanly. Made-with: Cursor * chore: remove stale planner codegen cutover plan doc * refactor: delete DynamicCommitmentScheme layer and root_poly type erasure The dynamic layer added ~2400 lines of complexity (traits, lazy init, type-erased MultilinearPolynomial round-trips) solely to select the root ring dimension D at runtime. That selection is a simple proof-size comparison that the profile harness now does with two helper functions (best_full_d, best_onehot_d) followed by a static match dispatch into the existing HachiCommitmentScheme<D, Cfg>. Deleted: - src/protocol/dynamic_commitment_scheme.rs (1411 lines) - src/protocol/root_poly.rs (519 lines) - DynamicCommitmentScheme trait from scheme.rs - CommitmentFieldProfileDynamic trait and helpers from profile.rs - Dynamic type aliases from presets.rs - HachiRootScheduleArtifact from schedule.rs - All Dynamic* re-exports from mod.rs and lib.rs Per-level D dispatch (D=64 root folding into D=32 recursive levels) is unchanged; it was always handled by HachiLevelParams.d and the dispatch_with_ntt! macro inside the core HachiCommitmentScheme. Made-with: Cursor * refactor: delete DynamicSmallTestCommitmentConfig and vestigial profile associated types DynamicSmallTestCommitmentConfig was defined and re-exported but never instantiated anywhere. The six FullCfg*/OneHotCfg* associated types on CommitmentFieldProfile were scaffolding for the now-deleted dynamic layer and had zero references via the trait. Made-with: Cursor * refactor: collapse CommitmentPolicy into CommitmentConfig, extract schedule planner Three P1 cleanup items: 1. Delete CommitmentPolicy trait and blanket forwarding impl. Each policy (Static, Generated, Planned) now implements CommitmentConfig directly on CommitmentPreset<F, Policy>, removing ~200 lines of trait ceremony and indirection. 2. Extract DP planner code (best_recursive_suffix, planned_schedule, PlannerConfig/State/Suffix) into schedule_planner.rs, reducing schedule.rs from 2482 to 2076 lines. 3. Factor the 4 inline debug cross-check blocks into two shared helpers (debug_check_dp_basis, debug_check_dp_suffix_bytes) in the new module. Made-with: Cursor * refactor: unify FlatRingVec/ProofRingVec, extract test helpers, minor cleanups P2 + P3 cleanup items: 1. Merge ProofRingVec into FlatRingVec (ring_dim=0 for compact/proof-wire mode). Removes ~200 lines of duplicated ring-vector methods and serialization code. Serialization now always uses the compact format (raw coefficients, no ring_dim prefix) since the self-describing format was never used externally. 2. Extract shared Fp128 E2E test helpers (F, stack/rayon init, random_point, opening_from_poly, make_*_poly, cfg aliases) into tests/common/mod.rs, deduplicating ~350 lines across 5 test files. 3. Merge adjacent if-let-Some(plan) guards in batched verify. 4. Remove redundant FSmall type alias in hachi_e2e.rs. Cancelled P2 items after investigation: - GeneratedAdaptivePolicy + PlannedAdaptiveBoundedPolicy merge: split is architecturally fundamental (pre-generated table lookup vs live DP planner, with different level_params_with_log_basis fallback chains). - Strip sentinel entries from generated tables: no sentinel entries exist; all table rows are real schedule data. Made-with: Cursor * refactor: delete PlannedAdaptiveBoundedPolicy, all presets use generated tables Generate D128 logbasis (LCB=3) and D128 onehot (LCB=1) schedule tables, switch all D=128 presets from PlannedAdaptiveBoundedPolicy to GeneratedAdaptivePolicy, and delete the live DP planner entry point (planned_schedule) along with PlannedAdaptiveBoundedPolicy and planned_adaptive_bounded_schedule_source. The runtime DP planner is no longer invoked by any adaptive preset. dp_suffix_bytes remains for static configs (singleton basis range, negligible cost) and debug cross-checks. Made-with: Cursor * chore: cap generated tables at nv=50, delete fp128_adaptive_bounded_table - Set max_num_vars=50 for all schedule table families (removes degenerate Direct-only entries at nv>50 that produced multi-exabyte "proofs") - Replace the generic fp128_adaptive_bounded_table<D,LCB,N_A,N_B,N_D> with direct fp128_d32_{full,logbasis,onehot}_table() accessors - Delete obsolete d128_bounded_families_fall_back_to_runtime_planner test - Update SIS audit test bounds from 63 to 50 Made-with: Cursor * chore: rename fp128_adaptive_onehot_d64_table to fp128_d64_onehot_table Consistent naming: all table accessors now follow fp128_d{D}_{family}_table(). Made-with: Cursor * refactor: delete LogBasis presets, add D64Full table Remove all LOG_COMMIT_BOUND=3 (logbasis) presets, generated tables, benchmarks, and profile modes. Only full (LCB=128) and onehot (LCB=1) remain. Add fp128_d64_full generated table and D64Full preset to complete the D-by-LCB matrix across D={32,64,128}. Made-with: Cursor * refactor: flatten matrix storage from 2D envelope to 1D layout Eliminates wasted space from the shared 2D max_rows × max_cols envelope by storing all matrix data in a single flat 1D vector. Each role (A, B, D) interprets a prefix of the flat buffer via ring_view with role-specific (num_rows, num_cols) dimensions. Key changes: - FlatMatrix: remove 2D metadata (num_rows, cols_ring), add ring_view<D>() that provides typed RingMatrixView with zero-copy row access - NttSlotCache: flatten from Vec<Vec<CyclotomicCrtNtt>> to flat Vec - derive_public_matrix_flat: switch to 1D domain separation (seed, flat_index) - HachiSetupSeed: add max_stride() returning the global maximum column width across all roles and recursion levels - HachiPolyOps trait: add matrix_stride parameter to commit_inner/witness - All mat-vec kernels and ring_view call sites use the uniform max_stride to ensure consistent row offsets in the shared NTT cache Made-with: Cursor * chore: remove dead code left over from 1D matrix cutover - FlatMatrix::raw_data(), is_empty() (no callers) - RingMatrixView::rows(), to_vec_vec() (no callers) - HachiCommitmentLayout::matrix_stride() (superseded by HachiSetupSeed::max_stride()) - Fix clippy format-string lint in recursive_suffix eprintln Made-with: Cursor * fix(proof): stop fixed-point batched folding Prevent batched recursive proving from looping once the witness stops shrinking, matching the single-proof recursion stop rule. Store proof-owned ring vectors in compact proof form so serialized proofs round-trip without depending on in-memory ring-dimension metadata. Made-with: Cursor * fix(commitment-scheme): keep batched folding byte-driven Batched recursive suffixes already consult the byte planner, so reusing the single-proof shrink-ratio guard could stop folding while another recursive level still reduced proof size. Use a batched-specific stop guard that only blocks tiny or non-shrinking witnesses, and lock in the nv=32 onehot regression with a focused test. Made-with: Cursor * fix(planner): use actual-state batched suffix DP Replace singleton table fallbacks with memoized planning from the actual recursive state so batched suffix estimates stay aligned with runtime on off-table states. Add regression and profile coverage for batch-4 onehot cases, and fix the onehot test lint that was breaking CI. Posted by Cursor assistant (model: GPT-5.4) on behalf of the user (Quang Dao) with approval. Made-with: Cursor * fix: add schedule_plan() to static/test configs for release-mode compatibility The `planned_next_log_basis_with_current_basis_and_envelope` function returns a hard error in release builds when `Cfg::schedule_plan()` is `None`. Three config families hit this: TinyConfig, SmallTestCommitmentConfig, and StaticBoundedPolicy. Add a generic `build_schedule_plan_from_config` helper that walks the level chain for any CommitmentConfig with deterministic basis choices, then override `schedule_plan()` on each affected config so they return `Some(plan)` and never reach the release-mode error branch. Made-with: Cursor * Assert num_vars equal in batched commit mode * fix: guard schedule_plan() overflow and restore planner lb freedom build_schedule_plan_from_config computes 1usize << max_num_vars which overflows for values >= 64. Return Ok(None) early in TinyConfig, SmallTestCommitmentConfig, and StaticBoundedPolicy so callers fall back to runtime computation for absurdly large num_vars (used by disk-persistence tests with max_num_vars 100+). Restore independent log-basis iteration in the planner's best_from(): the recursive folding level's lb was locked to current_lb, preventing re-decomposition at a different basis. Revert to the original design where each level freely iterates lb in MIN_LB..=MAX_LB while inheriting the parent's lb as log_cb. Made-with: Cursor --------- Co-authored-by: Omid Bodaghi <42227752+omibo@users.noreply.github.com>

* fix(planner): measure recursive suffix costs Score recursive suffix planning with exact serialized proof bytes instead of formula-only estimates, and route recursive miss states through the measured DP path. Isolate the recursive DP caches by config type so adaptive presets do not reuse stale suffix or basis choices across families. Made-with: Cursor * refactor(planner): inline exact schedule planner Move the offline planner, generator, and validation CLI into hachi-pcs so exact schedule lookup, table generation, and runtime planning share one batch-aware implementation and key space. Regenerate the shipped tables around exact root schedule keys, fix setup envelope sizing for generated direct tails, and add coverage for singleton, blessed-batch, and off-table planner paths. Made-with: Cursor * fix(planner): tighten generated schedule miss handling Keep missing generated schedule tables as hard configuration errors while still treating per-key schedule misses as soft fallbacks, and remove the leftover planner wrappers and discarded search-range parameters that no longer affect runtime behavior. Made-with: Cursor * fix(ci): emit rustfmt-clean generated imports Teach the schedule table generator to emit import blocks in the same shape that rustfmt expects so regenerated checked-in tables pass the format job without manual cleanup. Made-with: Cursor * fix(bench): publish observed proof metrics Report proof framing bytes explicitly and derive the published terminal state from the observed final witness so benchmark comments distinguish measured proof data from planner-only metadata. Made-with: Cursor * fix(planner): derive root ranks from layouts Derive adaptive level-0 ranks from the actual root layout instead of the audited envelope fallback so singleton D32 and D64 schedule generation cannot freeze unsound rank-1 root rows into the checked-in tables. Keep batched root row counts tied to the per-polynomial root layout, propagate layout-aware root params through setup and commit helpers, and regenerate the affected generated schedules with regression coverage for the D32 onehot root and tiny direct-root path. Made-with: Cursor * fix(bench): drop dead terminal summary fields Remove the unused planned terminal summary keys from the benchmark report parser so the observed terminal-state reporting remains the single source of truth and the script no longer carries dead stores. Made-with: Cursor

* feat: asymmetric centering for power-of-2 digit depths Use T_k = (b/2-1)(b^k-1)/(b-1) as the centering threshold for full-field balanced decompositions instead of q/2. This eliminates the +1 digit correction when lb divides 128, giving power-of-2 depths (64 instead of 65 for lb=2, 32 instead of 33 for lb=4). The key insight: k balanced base-b digits biject onto b^k consecutive integers. For field elements in [0,q), asymmetric centering maps c <= T_k to itself and c > T_k to c-q, covering all q values with exactly ceil(128/lb) digits. This only applies to full-field decompositions (depth_open, depth_commit when log_commit_bound=128, r_decomp_levels). Fold digits remain symmetrically centered since they decompose plain integers, not field elements mod q. Proof size reduction: ~1.1-1.2 KB across all configurations. Made-with: Cursor * fix(commitment): restore fold fallback symmetry Keep the Python dense fold estimator aligned with Rust by using the symmetric digit-count fallback for folded integers. Remove the unused asymmetric threshold helper so the decomposition threshold stays defined in one runtime path. Made-with: Cursor * perf(decompose): streamline asymmetric overflow paths Reduce the overhead of exact-full-field asymmetric decomposition by reusing the peeled top digit across the field and i8 kernels instead of staging ring-wide scratch buffers. Add deterministic fp128 boundary tests so the overflow edge cases stay covered while we iterate on follow-up benchmarking. Made-with: Cursor * perf(commitment): align i8 tiles to digit boundaries Keep tiled dense i8 mat-vec kernels on full digit groups so adjacent tiles do not re-decompose the same ring when a boundary lands mid-pack. Add multi-tile block and strided tests to lock in equivalence with the direct pre-decomposed digit paths. Made-with: Cursor

* feat(sumcheck): implement phase-1 y-first cutover Switch Stage 2 to the y-first witness layout and compute ring-switch m(x) on demand so the verifier no longer materializes m_evals_x. Keep Stage 1 x-first behind compatibility shims, including compact witness transposes, two-round-prefix updates, and wiring-layer challenge reordering, so recursive proofs continue to chain correctly during the Phase 1 transition. Made-with: Cursor * fix(sumcheck): restore stage2 prefix handoff fast path Restore the fused stage-2 round-2 handoff so y-first prefix proofs stop rescanning the compact witness after the two-round-prefix transition, and clear the prefix state once that handoff completes to avoid stale-path reentry. Also narrow the temporary dead-code allowances introduced during the phase-1 y-first cutover by routing the verifier through the shared shifted-eq dispatcher and dropping now-unused test helpers. Made-with: Cursor * feat(sumcheck): finish y-first cutover Make stage 1 bind y-first and move the only coordinate reorder to stage-1 input so stage 2 consumes r_stage1 directly. Preserve the compact two-round prefix path and sparse-x handling while removing the old stage1-to-stage2 compatibility bridge. Made-with: Cursor * refactor(planner): drop unused opt_sumcheck setter Remove the dead planner option builder so PlannerOptions only exposes configuration toggles that are still wired into the search flow. Made-with: Cursor * fix(ring-switch): drop dead m-eval helpers Remove the unused shifted-eq evaluation helpers and the stale test that only exercised them so CI can keep treating dead code as an error. Made-with: Cursor * refactor(commitment-scheme): drop redundant opening-point reorder After the y-first cutover, recursive stage transitions can carry the sumcheck challenges directly as the next opening point. Remove the identity helper and the dead width bookkeeping it forced the prover and verifier to carry. Made-with: Cursor * perf(sumcheck): fuse sparse y stage1 handoff Fuse the sparse y-stage full-table fold with next-round polynomial generation so the y-first Stage 1 cutover recovers the onehot regression without reverting semantics. Made-with: Cursor * refactor: drop y_first naming and deduplicate test helpers Now that y-first is the only ordering, remove the _y_first suffix from reorder_stage1_coords, build_compact_s_table, and all related variable names. Deduplicate pad_compact_witness, advance_stage1_claim, and reorder helpers that were copy-pasted across three test modules. Delete the unused shifted_eq module. Made-with: Cursor * style: fix rustfmt on advance_stage1_claim generic bound Made-with: Cursor

* perf(fp128): hand-written AArch64/x86-64 inline asm for add, sub, mul, sqr Replace LLVM-generated codegen for Fp128 field arithmetic with hand-written inline assembly on AArch64 and x86-64, falling back to portable Rust on other targets. AArch64 add_raw (8 instructions): Uses `ccmp` to fold the overflow predicate (carry from a+b) with the ≥p check (carry from s+C) into a single flag state, avoiding the GPR round-trip that LLVM's u128 lowering produces. Dispatches to immediate or register form based on C. AArch64 sub_raw (6 instructions): Uses `csel` on the borrow flag to pick subtrahend 0 or C, then a final `subs`/`sbc` pair. Avoids materializing the full 128-bit prime P. AArch64 mul_raw (35 instructions, was 41): Full schoolbook 2×2 → Solinas reduction in one asm block. Fold-1 carry chain uses direct adds/adcs/adc (5 insns vs LLVM's 8 with cset/cinc shuttling). Fold-2 + canonicalize uses `ccmp` (8 insns vs LLVM's 10). Benchmarked at 1.22x throughput improvement on Apple M4. AArch64 sqr_raw (31 instructions, was 37): 3 widening multiplies with doubled cross term via shifted- register operands. Same fold-1 + ccmp canonicalize savings. Benchmarked at 1.12x throughput improvement on Apple M4. x86-64 add_raw (10 instructions): Uses `sbb reg,reg` to materialize carry as 0/-1 mask, then `adc mask,mask` after trial subtraction to encode the reduction predicate in ZF for `cmovne`. x86-64 sub_raw (6 instructions): Uses `sbb reg,reg` + `and` to conditionally mask C, then a final `sub`/`sbb` pair. Portable fallback (add_raw_portable): Replaces u128 arithmetic with explicit two-limb overflowing_add chains, which lowers to better code on targets without native 128-bit support. packed_neon Add: Rewrites the reduction logic to use `s + C` (add-based) instead of `s - P` (subtract-based), eliminating the need to broadcast the full 128-bit prime. Removes unused `veorq_u64` import. Made-with: Cursor * fix(fp128): tighten C bound to < 2^32 for asm fold-2 correctness The AArch64 asm fold-2 step computes C*t2 with a single `mul` (low 64 bits only). Since t2 <= C after fold-1, this requires C^2 < 2^64, i.e. C < 2^32. Enforce this at compile time instead of the previous C < 2^64 bound. Made-with: Cursor * perf(fp128): add fused mul-add fast path Add a dedicated Fp128 fused multiply-add primitive that widens the product, injects the addend before reduction, and finishes with one Solinas reduction. On AArch64 this keeps the addend on the carry chain inside the hand- written multiply path, which improves both standalone mul-add and the projective binding shape used in our benchmarks. Made-with: Cursor * style(fp128): cargo fmt Made-with: Cursor

… RS encoding (#49) * feat(algebra): add smooth-FFT prime p=2^128-2355 Add Prime128Offset2355 (p ≡ 5 mod 8) with smooth multiplicative subgroup of order 14700 = 2² × 3 × 5² × 7², enabling mixed-radix FFT-based RS encoding up to size 14700 without Bluestein or zero-padding. - fp128: new type alias, asm dispatch for C=2355 on AArch64/x86-64 - crt_ntt: register new modulus in CRT+NTT param selection * perf(fft): optimize mixed-radix FFT with radix-7 butterfly and precomputed twiddles Two optimizations to the smooth-domain mixed-radix FFT that together yield ~2x throughput improvement across all benchmarked sizes: - Specialized radix-7 butterfly: unrolled 7-point DFT with explicit root-of-unity powers, matching the existing radix-2/3/5 pattern. Critical for 14700 = 2²×3×5²×7² where two radix-7 stages previously fell through to the generic O(r²) DFT loop. (1.5-1.7x alone) - Precomputed twiddle tables in SmoothDomain: per-stage omega_r powers and twiddle arrays are built once in SmoothDomain::new() for both forward and inverse transforms. Replaces the runtime field_pow calls and dependent tw_k multiply chain with table lookups and an ILP-friendly squaring pattern (tw2=tw*tw, tw4=tw2*tw2, tw6=tw3*tw3). (1.2-1.4x on top of optimization 1) - Criterion benchmark suite (benches/fft_smooth.rs) covering forward, , RS-extend, and RS-expand workloads Made-with: Cursor * bench(fft): add parallel 32768x RS-expand benchmark Adds a Rayon-parallel benchmark that runs 32,768 independent 256→1024 RS expansions via the 1470-smooth domain, measuring aggregate throughput under full core utilization. Made-with: Cursor * fix: resolve clippy warnings and update bench field to new prime - Remove unused `vcgtq_u64` import in packed_neon.rs - Remove unnecessary `as u64` cast in fp128.rs inline asm - Use idiomatic iterators and `+=`/`*=` assign ops in fft.rs tests - Update bench field type alias from p=2^128-275 to p=2^128-2355 to match the current CommitmentPreset Made-with: Cursor * fix: clippy format * fix: remove extra files * fix: remove extra files * fix(fft): update stale docs, add radix guard, clarify comments - Update fp128.rs module doc to reference Prime128Offset2355 - Add debug_assert for omega_r_pow array bound in precompute_stages - Fix misleading RS-expand comment in benchmark - Add explanatory comment for 2's complement reduction in from_scalar_with_params Made-with: Cursor --------- Co-authored-by: Quang Dao <quang.dao@layerzerolabs.org>

* feat(protocol): γ-batch evaluation claims per point Replace L_j per-claim evaluation rows with one γ-weighted row per opening point, cutting proof size and verifier trace checks. γ coefficients are Fiat-Shamir challenges sampled after absorbing commitments and field-element openings. The matrix equation, ring switch M-evaluation, and w-vector sizing all use num_points instead of num_claims for the evaluation-row dimension. Three blessed schedule tests temporarily ignored pending regeneration. Made-with: Cursor * fix: unify num_points in w_ring_element_count Made-with: Cursor * fix(verify): restore openings.len() == num_claims check in batched verifiers The batched CWSS refactor replaced the `openings.len() != num_claims` guard with the weaker `openings.is_empty()` in verify_batched_root_level, and omitted it entirely in verify_multipoint_batched_root_level. This allowed malformed proofs with mismatched opening counts to pass early validation. Restore the exact-length check in both verifiers.

* Add refactored_scheduler which generates smaller proofs * feat(planner): batched DP schedule planner with pre-computed tables Add a DP-based schedule planner that optimizes proof size for both singleton and batched polynomial openings. The planner searches over (log_basis, m, r) triples at the root and uses exact memoised suffix costs instead of estimates. Key changes: - Unified `compute_num_digits_fold` with `num_claims` parameter - `find_optimal_batched_schedule` as the single entry point for both singleton (num_claims=1) and batched mode - Replaced `optimal_root_batch_split` in commit.rs to first check pre-computed tables, falling back to the DP planner only on miss - Generated `refactored/` (singleton) and `refactored_batched/` (4-poly batched) schedule tables for all 6 families (nv 1..50) - Wired batched tables into the runtime via `generated_batched_schedule_table` fallback in `CommitmentFieldProfileSchedule` - Added tracing at info/debug/warn levels for table hits, misses, and planner invocations Singleton tables are never worse than existing generated tables (181KB saved across D32 families). Batched tables eliminate all runtime recomputation for the 4-poly case. Made-with: Cursor * fix(commitment): make batched root ranks width-aware Align the batched root schedule and setup sizing with the actual aggregated B and D matrix widths used at runtime. The root planner now computes batched n_b/n_d from scaled widths, the runtime plan derives batched root params from the scaled layout, and setup sizing carries the maximum row counts through matrix allocation. This also simplifies the batched commit path to use the planner split directly without rebuilding unnecessary root plan state, keeps the pre-computed batched tables authoritative when present, and regenerates those tables so the written schedule data matches the new rank logic. Additional cleanup: - remove redundant per-poly fold recomputation at the commit caller - restore split-based fit checks behind a setup helper - fix existing clippy blockers in the algebra backends so CI is green Made-with: Cursor * refactor(commitment): inline root runtime setup in prove and verify Derive root layout and params directly at the commitment-scheme entrypoints and remove the extra singleton runtime-plan test wrapper. Keep terminal witness packing anchored to the carried runtime basis, with an explicit panic if the final digits ever exceed it. Made-with: Cursor * refactor(planner): consolidate generated schedule tables Use a single generated schedule source per fp128 family by merging singleton and batched entries into the top-level files, and rename the planner module to match its role in schedule parameter selection. Made-with: Cursor * fix(planner): align generated schedule bytes with runtime Price each planned fold against its actual chosen successor and emit terminal commitment metadata from the direct step's runtime state. This restores the shipped D32 onehot singleton schedule to the correct proof size seen at runtime. Made-with: Cursor * refactor: unify HachiLevelParams + HachiCommitmentLayout into LevelParams Replace the two separate parameter structs (HachiLevelParams for ring dimension / matrix ranks / challenge config, and HachiCommitmentLayout for block geometry / digit depths / matrix widths) with a single LevelParams struct in src/protocol/params.rs. Key changes: - New LevelParams struct with AjtaiKeyParams sub-structs for each Ajtai matrix (A, B, D), combining row count + column width + basis - All CommitmentConfig trait methods (level_params_with_log_basis, root_level_layout_with_log_basis, root_level_params_for_layout, commitment_layout, level_params) now return LevelParams - HachiPlannedLevel stores a single `lp: LevelParams` field - ring_switch, quadratic_equation, commitment_scheme, schedule_params, and all test/bench/example code use LevelParams exclusively - Both old structs and their conversion bridges fully deleted - Net reduction: -845 lines across 26 files Made-with: Cursor * fix(generated): use runtime_exact label to match main branch Reduces diff noise against main by keeping the same fold-step label that the main branch uses in generated schedule tables. Made-with: Cursor * fix(batched): use cached singleton schedule for recursive suffix in batched setup The scan_layout_chain function previously fell through to an expensive DP walk for batched mode (max_num_batched_polys > 1) even when a pre-computed singleton schedule was available. Since recursive levels are identical for singleton and batched openings, we can reuse the singleton plan's recursive suffix to skip the DP recomputation. Made-with: Cursor * refactor: encapsulate AjtaiKeyParams fields behind constructor and getters Make AjtaiKeyParams fields private and enforce construction through AjtaiKeyParams::new(row_len, col_len, log_basis). Add row_len(), col_len(), and log_basis() getter methods. Update all ~190 field access sites across 13 files to use the new API. Made-with: Cursor * refactor(params): add SIS security check to AjtaiKeyParams Replace log_basis field on AjtaiKeyParams with collision_inf (worst-case L∞ collision bound) and add SIS floor validation: - `new()` panics if row_len is below the 128-bit SIS security floor - `new_unchecked()` logs a warning instead, for intermediate construction steps where ranks haven't converged yet (batched scaling, iterative SIS fixed-point loops) - `Default` derives all-zero (skips SIS check since collision_inf=0) - SIS-derived params (sis_derived_root_params_for_layout, sis_derived_recursive_params) now set collision_inf on each key All existing call sites use new_unchecked. The checked new() is available for future code that constructs finalized, security-verified keys. Made-with: Cursor * refactor(ring-switch): remove duplicate helpers and add schedule validation - Remove identical `w_ring_element_count_with_point_claim_groups`, consolidate call sites to `w_ring_element_count_with_claim_groups` - Remove dead `m_row_count` wrapper (callers use `lp.m_row_count()`) - Add debug_assert checks in `schedule_plan_from_generated_entry` that recomputed digit depths match the table's pinned delta_* values Made-with: Cursor * refactor(protocol): cleanup dead code and unused params - Extend SIS audit test to D32/D64 families (found real rank=1 violations at high num_vars, capped ranges accordingly) - Rename `batched_root_level_proof_bytes` to `level_proof_bytes` - Remove dead `estimated_recursive_suffix_bytes` and both `ensure_batched_root_split_fits` methods - Remove unused `_half_field_bound` param from `recursive_r_decomp_levels` and cascade through `planned_w_ring_element_count`, `planned_next_w_len`, `PlannerConfig`, and all call sites - Remove now-dead `planner_half_field_bound()` trait method Made-with: Cursor * fix(planner): align batched root accounting Unify planner and runtime batched-root derivation so B/D sizing, root witness sizing, and root proof bytes all use the same num_claims versus num_points semantics. Regenerate the Rust schedule tables and update the end-to-end tests to match the corrected runtime-exact rows. Made-with: Cursor * fix(planner): address batched bugbot issues Follow the concrete batched runtime suffix when sizing setup matrices, and keep standalone batched root proof sizing aligned with the runtime root params and shared fold-digit math. Made-with: Cursor * refactor(planner): dedupe fold digit helpers Route singleton fold-digit sizing through one shared batched helper, so the public commitment API keeps a single source of truth while the batched planner paths still pass explicit claim counts. Made-with: Cursor * fix(protocol): keep batched root splits per-poly Return per-polynomial root widths from the DP and direct-only batched-root split paths so setup sizing only scales B/D once. Add regressions for folded and direct-only no-table batch helpers. Made-with: Cursor * refactor(commit): remove unnecessary root_lp/batched_lp clones in batched_commit Access root_plan.root_lp and root_plan.level_lp fields directly instead of cloning them into local variables. Made-with: Cursor --------- Co-authored-by: Quang Dao <quang.dao@layerzerolabs.org>

* refactor(sumcheck): decompose mod.rs into focused submodules Break the monolithic mod.rs (956 lines) into: - traits.rs: SumcheckInstance{Prover,Verifier} + EqFactored variants - drivers.rs: prove/verify driver functions - compact_fold.rs: CompactPairFoldLut - accum.rs: reduce_signed_accum (breaks two_round_prefix <-> hachi_stage2 cycle) Also deduplicates fold_full_prefix_pair (was copy-pasted in stage1 and stage2). mod.rs is now a thin barrel (~350 lines including tests). Made-with: Cursor * refactor: rename num_u/num_l to col_bits/ring_bits The old names were positional labels from the polynomial w(u, l) that did not convey what the variables actually index: num_u -> col_bits: number of bits indexing witness columns (ring elements) num_l -> ring_bits: number of bits indexing coefficients within a ring element (log2 D) Made-with: Cursor * refactor: reorder matrix M rows to consistency, public, D, B, A Place the consistency row (folded-evaluation) and public row (evaluation correctness) first in M, followed by D, B, A rows. This layout groups the shorter-footprint rows at the top, enabling future verifier optimizations for analytical MLE evaluation. Updated all dependent row-index arithmetic in ring_switch, quadratic_equation, hachi_stage2, commitment_scheme debug block, and schedule doc comments. Made-with: Cursor * refactor: digit-major column reindexing for m_evals_x Kronecker structure Reindex the committed witness polynomial and m_evals_x column vector so that the power-of-2 block index is the fastest-varying (innermost) dimension within each segment. This layout enables future O(m) and O(r) MLE evaluation via Kronecker product factoring of the `a` and `b` challenge contributions. Changes: - `build_w_coeffs`: emit ring elements in digit-major order using new helpers `emit_planes_block_inner` (transposes FlatDigitBlocks from block-major to digit-major) and `emit_z_pre_block_inner` (decomposes and transposes z_pre with multi-point support). Adaptive segment ordering places the segment with the larger block dimension first (z-hat first when m_vars >= r_vars, else e-hat/t-hat first). - `compute_m_evals_x_with_claim_groups`: reindex w_segment, t_segment, and z_segment to match digit-major layout. D matrix access uses global block index (blk * depth_open + dig), B matrix access uses per-claim block index. Adaptive segment ordering matches build_w_coeffs. - `compute_m_evals_x_with_opening_points_and_claim_groups`: same digit-major reindexing plus multi-point z_segment with z_total_blocks = num_points * block_len. Docstring on build_w_coeffs notes the alternative of propagating digit-major throughout FlatDigitBlocks. Made-with: Cursor * feat(verifier): direct MLE evaluation for m_evals_x via PreparedMEval Replace the "materialize full m_evals_x vector then multilinear_eval" flow with a deferred PreparedMEval struct that pre-computes only challenge-derived scalars (c_alphas, eq_tau1) and evaluates the MLE directly at the sumcheck challenge point by streaming through the setup matrix. This eliminates the dominant O(total_cols) allocation (~868K field elements at nv=32) from the verifier path, replacing it with O(total_blocks + rows) permanent storage (~2K + 16) and O(x_len) transient eq-table during evaluation. Key changes: - PreparedMEval<F> struct with c_alphas, eq_tau1, and layout metadata - prepare_m_eval() replaces compute_m_evals_x on the verifier path - PreparedMEval::eval_at_point() streams matrix rows inline - RingSwitchVerifyOutput now holds PreparedMEval instead of Vec<F> - Stage2MEvalSource wraps PreparedMEval; HachiStage2Verifier borrows setup and opening_points at eval time (Option B) - setup parameter removed from ring_switch_verifier functions - Unit test confirms eval_at_point matches materialized multilinear_eval Made-with: Cursor * perf(verifier): parallelize PreparedMEval::eval_at_point with cfg_fold_reduce Use cfg_fold_reduce! for w, t, z, and r_tail segment loops so Rayon splits the work across threads. Root-level m_eval drops from ~253 ms (sequential) to ~47 ms; total verify from ~281 ms to ~60 ms. Made-with: Cursor * perf(verifier): use build-segments + multilinear_eval in eval_at_point Replace fused cfg_fold_reduce loops with the build-then-eval pattern: parallel-build each segment via cfg_into_iter, concatenate, and call multilinear_eval. This matches the old code's parallelism strategy and achieves zero overhead vs the materialized path (43 ms verify at nv32, same as the column-reindexing baseline). Hoist self fields into locals and mark eval_at_point #[inline] to help the compiler. Made-with: Cursor * perf(ring-switch): peel block axis in m-eval Strip the power-of-two num_blocks axis out of the separable w and t terms so batched verifier paths can keep using a succinct eq-weighted evaluation even when the outer claim dimensions are ragged. Also add the shared offset_eq helper for 2-adic peeled carry summaries and clean up the remaining clippy issues in the sibling worktree. Made-with: Cursor * perf(ring-switch): inline matrix-backed m-eval Replace the deferred verifier's matrix-backed D and B materialization with direct offset-eq evaluation, and stop running a zero-padded full multilinear_eval over the assembled M table. This keeps the prepared m-eval path test-clean while reducing the regression from the earlier peeled verifier branch and preserving the batched D-column layout used by the real proof flow. Made-with: Cursor * fix(offset-eq): satisfy clippy assign-op lints Use `*=` and `+=` in the offset-eq test helper so the cleanup-sumchecks branch passes the Clippy CI check again. Made-with: Cursor * style(offset-eq): format multiline assign expression Apply rustfmt's expected indentation in the offset-eq test helper so the cleanup-sumchecks branch passes the format CI check. Made-with: Cursor * ci(test): bump RUST_MIN_STACK to 16 MiB for debug test runs `cargo test --lib` on `layerzero/main` and branches off of it flakes at 20-35 % with sporadic `thread '<unknown>' has overflowed its stack` aborts originating on rayon worker threads. Repro and investigation: - Deterministic under `cargo test --all -- --test-threads=1`, `RAYON_NUM_THREADS=1`, or `cargo test --all --release`; only debug parallel runs flake. - The aborts fire in rayon workers (not test-runner threads), which default to a 2 MiB stack via `std::thread::Builder`. Heavy hachi_stage2 tests (`stage2_large_odd_*`) plus several parallel commitment/planner tests produce deep rayon-split call chains under debug-unoptimized frames and occasionally blow past 2 MiB. - `commitment_scheme.rs` already acknowledges this for two `#[ignore]`-gated debug tests via `init_debug_rayon_pool` / `run_debug_on_large_stack` (64 MiB pool stack / 256 MiB test thread). Cargo's `[env]` section propagates `RUST_MIN_STACK` to all binaries cargo spawns (including `cargo test`), and `std::thread::Builder` (which rayon uses internally) honors it for unset stack sizes. Setting it to 16 MiB is enough headroom for the observed flake and still small enough to be a drop in the bucket on modern systems. Verified: 0 / 20 overflows on `cargo test --lib` and 0 / 5 on `cargo test --all` with this config, versus 4 / 20 previously on the same branch and 7 / 20 on `layerzero/main` at 7e79bde. Made-with: Cursor

* refactor(preprocessing): decouple setup sizing from layout derivation Rework the setup/preprocessing layer so that setup sizing is computed from conservative upper bounds on config parameters rather than a layout chain. This fixes a bug where setup(max_num_vars) would fail at commit time if the actual polynomial num_vars differed from max_num_vars, and consolidates the setup types into a dedicated module. - New `src/protocol/preprocessing.rs` is the canonical home for `HachiSetupSeed`, `HachiExpandedSetup`, `HachiProverSetup`, and `HachiVerifierSetup` (plus their serialization impls). - `HachiProverSetup::new()` owns setup expansion end-to-end. - `HachiSetupSeed` is simplified to carry a single `max_stride` (max column width across all levels/roles) instead of separate inner/outer/D width fields. - Add `max_ajtai_rank()` and `max_ajtai_width()` free functions that compute conservative row/column bounds from the config's static parameters, removing the need for `ensure_layout_fits` / `assert_layout_fits` API and their layout-chain probing. - `src/protocol/commitment/commit.rs` shrinks substantially; setup structs and their impls move out. - Add oversized-setup regression tests (setup with larger max_num_vars than the commit's actual num_vars). Made-with: Cursor * refactor(commitment): drop HachiCommitmentCore setup helpers and their dead support code Remove the unused `setup_with_layout`/`setup_with_layouts` entry points and their private helpers, along with the now-orphaned layout-scanning helpers (`LayoutChainStats`, `scan_layout_chain`, `root_batched_layout`), the `num_digits_fold_batched` field, the three tests that exercised them, and the test-only config/ring-switch functions they relied on. Made-with: Cursor * refactor(commitment): hoist setup matrix sizing into CommitmentConfig Introduce `CommitmentConfig::max_setup_matrix_size(max_num_vars, max_num_batched_polys)` returning `(max_rows, max_stride)`, with a default implementation that pins `max_rows` to `sis_security::MAX_RANK` and derives the worst-case stride from the root `log_basis` search range. `HachiProverSetup::new` now calls the trait method and just multiplies to get `max_total`, so setup-sizing policy lives next to the config abstraction instead of being duplicated in preprocessing. Also tighten the two small test-double configs (`SmallTestCommitmentConfig` and `BadDegreeConfig`) to `max_n_a = 4` so they match the new row ceiling. Made-with: Cursor * refactor(commitment): remove SmallTestCommitmentConfig, retarget tests to fp128::D64Full Drop the public `SmallTestCommitmentConfig` and migrate its dependent tests to the existing `fp128::D64Full` preset. The end-to-end prove/verify and batched roundtrip tests in `commitment_scheme.rs` and the `commit_w_uses_active_level_row_count` regression test in `ring_switch.rs` now run on the dense fp128 D=64 config, which is already exercised elsewhere in the suite. The tiny shape-only sanity test in `tests/ring_commitment_core.rs` is dropped along with the type. Made-with: Cursor * test(setup): add preset-capacity E2E tests and harden onehot shape checks Add `tests/setup.rs` with a 5-scenario E2E suite per fp128 preset (`D128Full`, `D64Full`, `D64OneHot`, `D32Full`, `D32OneHot`) covering same-size, undersized, and oversized setup relative to the polynomial and batch sizes used by commit/prove/verify. The undersized-nv case pins an explicit `commit received a polynomial with ... variables but setup supports at most ...` message, and the undersized-batch case likewise pins the existing polynomial-count guard. Surface those messages in `HachiCommitmentScheme::{commit, prove, batched_prove, verify, batched_verify}` by adding `num_vars > max_num_vars` guards that mirror the existing `max_num_batched_polys` guard and return `HachiError::InvalidInput` with a clear, actionable string (callers misusing the API, not adversarial proofs). Finally, harden `OneHotPoly`'s `HachiPolyOps` impl against a latent shape-mismatch foot-gun exposed by the tests: onehot polys bake their `(r_vars, m_vars)` block split in at construction time, whereas dense polys reblock on every call. When a user built an onehot poly with `Cfg::commitment_layout(nv)` and then tried to commit it under a `max_num_batched_polys > 1` setup (where the runtime uses `hachi_batched_root_layout(nv, batch)`), the prover would panic deep in `parallel_high_half_accumulate` with an `index out of bounds` from the sparse-ring accumulator. Add early block-size checks (assertions on non-Result entry points, `InvalidInput` on the Result-returning `commit_inner` / `commit_inner_witness_batched`) so that misuse now surfaces as a clear, actionable error pointing users at `hachi_batched_root_layout`. Made-with: Cursor * refactor(commitment): size setup matrix from the planned schedule for adaptive configs Remove the default body of `CommitmentConfig::max_setup_matrix_size`; each config now supplies its own. `GeneratedAdaptivePolicy` walks the planned schedule (cached plan or on-the-fly `find_optimal_batched_schedule`) with the batch-effective root commitment layout as a seed, yielding tight `(max_rows, max_stride)` bounds. Static and test configs use an inlined loose upper bound: `MAX_RANK` rows and `2^(max_num_vars - log2(D)) * 128 * MAX_RANK * max_num_batched_polys` stride. Made-with: Cursor * refactor(commitment): fold batch scaling into `fallback_batched_root_split` Teach `fallback_batched_root_split` to take `num_claims` and apply `scale_batched_root_layout` internally; existing callers in `optimal_root_batch_split` pass `1` to preserve the per-poly result. Adaptive `max_setup_matrix_size` now seeds with the scaled fallback layout in one step (instead of the separate `hachi_batched_root_layout` + `scale_batched_root_layout` pair) and uses a `(P, P, 1)` batch for the planner fallback. The seed is applied unconditionally because commit's runtime `(m, r, log_basis)` may not match the schedule plan's choice. Made-with: Cursor * refactor(commitment): move simple `max_setup_matrix_size` into the trait default Collapse the near-identical `MAX_RANK`-rows / `2^outer_vars * 128 * MAX_RANK * P`-stride formula that was inlined in `StaticBoundedPolicy`, `TinyConfig`, `BadDegreeConfig`, and `WideEnvelopeD64Full` into the trait's default body. `GeneratedAdaptivePolicy` keeps its schedule-plan-derived override, and `WCommitmentConfig` keeps the explicit delegation to its inner `Cfg`. Made-with: Cursor * refactor(setup): drop always-erroring `HachiSerialize` impl on `HachiProverSetup` `HachiProverSetup` holds runtime NTT caches and cannot be serialized; the old impl just returned `SerializationError` at runtime. Nothing requires `HachiProverSetup: HachiSerialize` (the `CommitmentScheme::ProverSetup` bound is only `Clone + Send + Sync`), so the impl is safe to remove. Callers who want to persist setup should serialize the inner `HachiExpandedSetup` and rebuild caches via `setup_from_expanded`. Also drop an outdated comment in the adaptive `max_setup_matrix_size`. Made-with: Cursor * refactor(setup): extract `HachiProverSetup::from_expanded` to de-duplicate NTT wrapping The "wrap an expanded setup in Arc + rebuild NTT cache + return a prover setup" block was inlined inside `HachiProverSetup::new` (disk load hit path) and again in the free `setup_from_expanded` (disk-persistence tests). Extract it to a single `HachiProverSetup::from_expanded` associated function and call it from both sites. Gate the method with the same `disk-persistence` feature that guards its only callers. Made-with: Cursor * refactor(setup): drop free `setup_from_expanded`; call `HachiProverSetup::from_expanded` directly The free function was only used by one disk-persistence test that ignored the verifier half of its tuple return. Replace its single call site with a direct `HachiProverSetup::from_expanded` invocation and delete the wrapper. Made-with: Cursor * refactor(ring-switch): drop unused `WCommitmentConfig::max_setup_matrix_size` override `max_setup_matrix_size` is only invoked from `HachiProverSetup::new`, which always uses the outer `Cfg`, never `WCommitmentConfig<_, Cfg>`. The override was dead code. Drop it and let the trait default apply if anything ever does call it through the wrapper. Made-with: Cursor * fix(setup): unblock `--all-features` CI Three fixes for the `cargo clippy --all-targets --all-features` and release-test run on the PR: - Expose `get_storage_path` and `load_expanded_setup` as `pub(crate)` so the disk-persistence test module in `commit.rs` can name them. They had been module-private free functions. - Import the two helpers explicitly at the top of the disk-persistence test submodule (`use crate::protocol::setup::{get_storage_path, load_expanded_setup}`). - Give `TinyConfig` an explicit `max_setup_matrix_size` override that sizes off its (fixed) `commitment_layout` instead of the trait default. The default body raises `2^(max_num_vars - log2(D))` and overflows `usize` at the `MAX_VARS = 100..=102` used by the disk-persistence tests; the override returns tight widths that match the config's actual runtime use. Made-with: Cursor * fix(setup): restore non-zero invariants in `HachiSetupSeed::check` After the refactor that collapsed the per-role width fields into a single `max_stride`, `HachiSetupSeed::check` was left as an unconditional `Ok(())`. A corrupt on-disk seed with `max_stride = 0` would pass validation and then be used as the stride for every matrix view at runtime. Re-add checks that `max_stride` is non-zero and that `max_num_batched_polys >= 1`, matching the live construction-time invariant in `Cfg::max_setup_matrix_size`. Made-with: Cursor * feat(setup): thread `max_num_points` through setup sizing `GeneratedAdaptivePolicy::max_setup_matrix_size` previously hard-coded `num_points = 1` in its `HachiRootBatchSummary`, while `batched_prove`/ `batched_verify` plan the recursive suffix with `num_points = opening_points.len()`. Because `num_points` feeds both `z_pre_count` and the `r_rows` contribution inside `w_ring_element_count_with_counts`, multi-point batches could widen recursive-level matrices past the computed `max_stride`; the fallback's scaled root layout only bounds the root widths, not the recursive suffix. Add `max_num_points` as an explicit parameter to `CommitmentConfig::max_setup_matrix_size`, `HachiProverSetup::new`, and `CommitmentScheme::setup_prover`, and propagate it into the adaptive policy's schedule lookup / `find_optimal_batched_schedule` fallback. Single-point callers pass `1` (the dominant runtime shape); multi-point batches pass an upper bound on `opening_points.len()`. The adaptive impl validates `1 <= max_num_points <= max_num_batched_polys`. Made-with: Cursor * fix(tests): size blessed batched onehot setup for actual point count Pass `group_sizes_by_point.len()` as `max_num_points` to `setup_prover` in `assert_blessed_batched_onehot_exact` so the schedule planner sizes the D-matrix for the real number of opening points instead of hardcoding `1`, which was only coincidentally large enough for the current configs. Made-with: Cursor * fix(setup): store and enforce `max_num_points` on setup seed Setup sizes the shared matrix stride using `max_num_points`, but the value was never stored on `HachiSetupSeed`, leaving no runtime guard against a batched opening with more distinct points than the setup was sized for. Persist `max_num_points` on the seed (including through serialization) and reject `batched_prove`/`batched_verify` calls whose `opening_points.len()` exceeds it, mirroring the existing `max_num_batched_polys` check. Made-with: Cursor * fix(setup): include `max_num_points` in disk cache key and load check The disk-persistence cache keyed setup files only by `max_num_vars` and `max_num_batched_polys`, but `max_num_points` affects `max_stride` via `Cfg::max_setup_matrix_size`. Two setups with different `max_num_points` could share a cache file, and the load-side check only verified `total_ring_elements >= max_total` without comparing `seed.max_stride`. For configs where `max_rows` varies inversely with `max_stride`, a cached setup could pass the totals check while carrying the wrong stride, causing `ring_view` to use an incorrect row layout. Thread `max_num_points` through `cache_file_name`, `get_storage_path`, `save_expanded_setup`, and `load_expanded_setup`, and additionally require the cached `seed.max_stride` and `seed.max_num_points` to meet the current request before accepting a cache hit. Made-with: Cursor

* perf(commit): parallelize per-poly inner witness in batched commit Replace the `commit_inner_witness_batched` dispatch with a direct `cfg_iter!` parallel map over the input polynomials in the batched commit path, calling the per-poly `commit_inner_witness` on each. Benchmarked via `examples/profile` with `HACHI_MODE=onehot_d32 HACHI_NUM_VARS=32 HACHI_NUM_POLYS=4` (20 interleaved and order-reversed runs). Mean commit time drops from ~707 ms to ~647 ms (~8% faster, Welch's t = 6.0), with all 325 lib tests and integration tests passing. The previously fused one-hot helper is no longer on the hot path but is retained as a trait method for now. Made-with: Cursor * refactor: remove unused commit_inner_witness_batched Now that the batched-commit call site uses a plain `cfg_iter!` parallel map over `commit_inner_witness`, the fused `commit_inner_witness_batched` trait method and its impls are dead code. Drop: - the trait default in `HachiPolyOps` - the `&P` forwarding impl - the `MultilinearPolynomial` dispatcher - the `OneHotPoly` fused implementation Made-with: Cursor * fix(clippy): gate parallel import on feature flag `cargo clippy --no-default-features -- -D warnings` flagged `use crate::parallel::*` in commitment_scheme.rs as unused when the `parallel` feature is off (since the module re-exports nothing in that config). Match the repo-wide pattern by guarding the import with `#[cfg(feature = "parallel")]`. Made-with: Cursor

Collapse the singleton prove/verify API onto the batched code path and drop the schedule planner's process-global DP cache in favor of the offline schedule tables. Made-with: Cursor

socket-security · 2026-04-22T18:48:45Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	aes@0.8.4
	blake2@0.10.6
	ctr@0.9.2
	num-bigint@0.4.6
	sha3@0.10.8
	tracing-chrome@0.7.2
	tracing-subscriber@0.3.22
	tracing@0.1.41 ⏵ 0.1.44

View full report

omibo and others added 30 commits February 27, 2026 17:18

Implement Batched Sumcheck and Gruen EQ (#2)

4980e1a

perf: reuse z_pre witness data across ring switch (#14)

eb8bea7

Replace eprintln! with structured logs (#15)

e62e434

Remove Labrador Implementation (#23)

dfe33a5

* Remove Labrador implementation * fix: remove stale profile tail tag accounting --------- Co-authored-by: Quang Dao <quang.dao@layerzerolabs.org>

fix: harden disk-persistence cache reuse (#33)

d8a20e3

quangvdao and others added 16 commits March 31, 2026 18:26

Account for challenge-aware A-role SIS bounds in planner (#41)

1e8f781

* Correct planner A-role SIS bounds * Run rustfmt on planner security changes * Clarify A-role SIS collision helper

Add eq-factored stage1 sumcheck proof and benchmark comments (#37)

9712c9e

refactor: unify singleton and batched proving paths

cc9cb03

Collapse the singleton prove/verify API onto the batched code path and drop the schedule planner's process-global DP cache in favor of the offline schedule tables. Made-with: Cursor

RadNi closed this Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify singleton and batch proving#10

Unify singleton and batch proving#10
RadNi wants to merge 46 commits into
a16z:mainfrom
LayerZero-Labs:amir/batch-refactor

RadNi commented Apr 22, 2026

Uh oh!

socket-security Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

RadNi commented Apr 22, 2026

Summary

Unified prove/verify

Proof shape

Schedule planner

Uh oh!

socket-security Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants