Implement Hachi PCS protocol with required primitives by omibo · Pull Request #1 · a16z/hachi

omibo · 2026-02-28T00:54:37Z

Summary

Implements the core Hachi lattice-based polynomial commitment scheme, covering the algebra foundations and protocol logic from Sections 4.1-4.3 of the paper.

Key Changes

Algebra layer (`src/algebra/`)

Prime fields (Fp32, Fp64, Fp128) with quadratic/quartic extension towers (Fp2, Fp4)
Pseudo-Mersenne field utilities and SIMD-packed backends (AVX2, AVX512, NEON)
Vector modules, sparse challenge sampling, and field lifting
Shared field utility functions deduplicated into fields/util.rs

Protocol layer (`src/protocol/`)

Ring-native Ajtai commitment (section 4.1) with configurable inner/outer matrices
CommitWitness struct for type-safe commitment outputs
SmallTestCommitmentConfig and ProductionFp128CommitmentConfig for test/production parameterization
Quadratic equation builder (section 4.2)
Ring switching (section 4.3): coefficient expansion into field-based sumcheck instances
Sumcheck with compressed univariate polynomials, split into types.rs and core logic
Norm sumcheck (F_0 range-check) and relation sumcheck (F_alpha) prover/verifier
Generic HashTranscript<D: Digest> unifying Blake2b and Keccak backends
Fiat-Shamir transcript with labeled protocol framing
End-to-end HachiCommitmentScheme commit/prove/verify wrapper

HachiPolyOps trait (`src/protocol/hachi_poly_ops.rs`)

New HachiPolyOps<F, D> trait: operation-centric polynomial API with 4 methods (evaluate_ring, fold_blocks, decompose_fold, commit_inner)
DensePoly<F, D>: dense ring coefficient implementation
OneHotPoly<F, D>: sparse one-hot implementation with optimized commit/fold paths
CommitmentScheme now generic over P: HachiPolyOps<F, D>, replacing the old Polynomial<F> trait
Removed StreamingCommitmentScheme, HachiChunkState, MegaPolyBlock, commit_mixed, commit_onehot (standalone fn), batch_commit, combine_*

Flexible decomposition and dual basis mode

DecompositionParams struct with log_basis and log_coeff_bound for runtime decomposition depth
compute_delta/compute_tau derive optimal decomposition from assumed coefficient bounds
BasisMode enum (Lagrange vs Monomial) as a prove-time parameter for the evaluation relation
Moved LOG_BASIS, DELTA, TAU from compile-time constants to runtime parameters in HachiCommitmentLayout

Dead code removal and cleanup

Deleted Polynomial, MultilinearLagrange traits and DenseMultilinearEvals
Deleted inner_ajtai_onehot (kept _t_only), MegaPolyBlock, commit_mixed
Removed ring_coeffs from HachiCommitmentHint (only t_hat remains)
Removed all unnecessary #[allow(...)] attributes
Added docs to all public items in test_utils, packed_ext

Bug fixes

Fix CompressedUniPoly::degree() off-by-one that could let malformed sumcheck proofs pass
Remove overly strict delta * log_basis > 128 config validation that rejected valid ProductionFp128CommitmentConfig

Testing

cargo test -- all tests pass
cargo clippy --all --tests --benches -- -D warnings -- zero warnings
cargo fmt -- formatted
verify_passes_for_consistent_opening confirms ProductionFp128 config works end-to-end

Pin Rust 1.88 with minimal profile (cargo, rustc, clippy, rustfmt). Co-authored-by: Cursor <cursoragent@cursor.com>

Respects rust-toolchain.toml automatically. Also normalize clippy flags to use --all --all-targets consistently. Co-authored-by: Cursor <cursoragent@cursor.com>

Required by the Fp128 field backend. Co-authored-by: Cursor <cursoragent@cursor.com>

Introduces the algebra module with: - Fp32/Fp64/Fp128 prime field backends with branchless constant-time add/sub/neg and rejection-sampled random - U256 helper for Fp128 wide multiplication - Fp2/Fp4 tower extensions with Karatsuba-ready structure - VectorModule<F, N> fixed-length vector module over any field - Poly<F, D> fixed-size polynomial container Co-authored-by: Cursor <cursoragent@cursor.com>

Adds the ntt submodule with: - NttPrime: per-prime Montgomery-like fpmul, Barrett-like fpred, branchless csubq/caddq/center - LimbQ/QData: radix-2^14 limb arithmetic for big-q coefficients - logq=32 parameter preset (six NTT-friendly primes, CRT constants) Co-authored-by: Cursor <cursoragent@cursor.com>

24 tests covering: - Field arithmetic, identities, and distributivity (Fp32/Fp64/Fp128) - Zero inversion returns None - Serialization round-trips (all field types, extensions, VectorModule) - Fp2 conjugate, norm, and distributivity - U256 wide multiply and bit access - LimbQ round-trip and add/sub inverse - QData consistency with preset constants - NTT normalize range and fpmul commutativity - Poly add/sub/neg Co-authored-by: Cursor <cursoragent@cursor.com>

Records Phase 0 status: all field types, extensions, NTT scaffolding, constant-time arithmetic, and 24-test suite. Reflects the fields/ntt/module/poly directory layout. Co-authored-by: Cursor <cursoragent@cursor.com>

Overhaul the NTT small-prime arithmetic and CRT modules: - Add MontCoeff newtype (#[repr(transparent)] i16 wrapper) to enforce Montgomery-domain vs canonical-domain separation at the type level - NttPrime methods now take/return MontCoeff instead of bare i16: fpmul→mul, fpred→reduce, csubq→csubp, caddq→caddp - Add domain conversion: from_canonical (i16→Mont), to_canonical (Mont→i16) - Delete free functions (pointwise_mul etc), replaced by methods on NttPrime - LimbQ: replace add_limbs/sub_limbs/less_than with std Add/Sub/Ord impls - LimbQ: replace from_u128/to_u128 with From<u128>/TryFrom for u128 - LimbQ: add Display impl, branchless csub_mod - Rename all LABRADOR* constants to project-native Q32_* names - Add #[cfg(test)] verification that re-derives pinv/v/mont/montsq from p - Add MontCoeff round-trip and LimbQ ordering tests (28 total) Co-authored-by: Cursor <cursoragent@cursor.com>

Remove // ---- Section ---- banner comments from prime.rs and crt.rs. Add non-negotiable rules to HACHI_PROGRESS.md: - No section-banner comments - No commit/push without explicit user approval Co-authored-by: Cursor <cursoragent@cursor.com>

Milestone 1 - CyclotomicRing<F, D> (coefficient form): - Schoolbook negacyclic convolution Mul (X^D = -1) - Add/Sub/Neg/AddAssign/SubAssign/MulAssign, scale, zero/one/x - HachiSerialize/HachiDeserialize Milestone 2 - NTT butterfly + CyclotomicNtt<K, D>: - Merged negacyclic Cooley-Tukey forward NTT (twist folded into twiddles) - Gentleman-Sande inverse NTT with D^{-1} scaling - Runtime primitive-root finder and twiddle table computation (TODO: migrate to compile-time const tables) - CyclotomicNtt with per-prime pointwise Add/Sub/Neg/Mul - Ring<->Ntt transforms with CRT reconstruction Co-authored-by: Cursor <cursoragent@cursor.com>

Add 12 new tests: - CyclotomicRing: negacyclic X^D=-1, mul identity/zero, commutativity, distributivity, associativity, additive inverse, serde, degree-64 - NTT: forward/inverse round-trip (single prime + all primes), NTT mul matches schoolbook cross-check Wrap all integration tests in a single mod tests block and remove section-banner comments. Co-authored-by: Cursor <cursoragent@cursor.com>

Constrain ring/NTT conversions to explicit field backends and replace fragile CRT reconstruction with deterministic modular lifting. Enforce canonical deserialization checks in validated field decoding paths to reject malformed encodings. Co-authored-by: Cursor <cursoragent@cursor.com>

Add end-to-end ring->NTT->ring CRT round-trip tests plus reduced-ops stability checks. Expand serialization coverage for Fp4/Poly and verify checked deserialization rejects non-canonical field encodings. Co-authored-by: Cursor <cursoragent@cursor.com>

Add a dedicated ring/NTT benchmark harness and register it in Cargo metadata. Record current constant-time review status and sync the implementation progress board with new milestones and test coverage. Co-authored-by: Cursor <cursoragent@cursor.com>

Break the monolithic Field trait into FieldCore, CanonicalField, and FieldSampling, and update algebra primitives to depend on explicit capabilities for cleaner semantics and future backend integration. Co-authored-by: Cursor <cursoragent@cursor.com>

Introduce the curated 2^k-offset prime registry and typed field aliases, then add dedicated Miller-Rabin regression tests to enforce probable primality for all enabled profiles. Co-authored-by: Cursor <cursoragent@cursor.com>

Rename the ring NTT representation to explicit CRT+NTT semantics and route conversions through backend traits, adding scalar backend and domain aliases for a cleaner representation-vs-execution boundary. Co-authored-by: Cursor <cursoragent@cursor.com>

Expand algebra tests to validate default-vs-backend CRT+NTT equivalence, sampling bounds, and pow2-offset registry consistency under the new field and ring abstractions. Co-authored-by: Cursor <cursoragent@cursor.com>

Refresh progress and constant-time notes to match the new CRT+NTT naming and field scope, and add the NTT prime analysis document plus local NIST standards artifacts used for parameter rationale. Co-authored-by: Cursor <cursoragent@cursor.com>

Make Fp128 reduction and CRT inner accumulation paths more timing-stable with branchless modular operations, and refresh ring/docs/tests status after the hardening cleanup pass. Co-authored-by: Cursor <cursoragent@cursor.com>

Introduce Hachi protocol-layer interfaces and placeholder types with Blake2b/Keccak transcript backends plus phase-aligned labels, while making transcript absorption label-directed at call sites. Co-authored-by: Cursor <cursoragent@cursor.com>

Add deterministic transcript schedule checks (including keccak) and protocol commitment contract tests so transcript ordering and challenge derivation behavior are locked down. Co-authored-by: Cursor <cursoragent@cursor.com>

Document the protocol scaffold as in-progress, capture the commitment-focused transcript label vocabulary, and clarify deferred Jolt adapter expectations. Co-authored-by: Cursor <cursoragent@cursor.com>

Implement the ring-native commitment setup/commit core with config validation, utility modules, and seeded domain-separated public matrix derivation, while wiring prover/verifier stub modules for the next open-check phase. Co-authored-by: Cursor <cursoragent@cursor.com>

Unify ring commitment core and config validation checks in one test file and add explicit prover/verifier stub contract tests to lock current placeholder behavior before open-check implementation. Co-authored-by: Cursor <cursoragent@cursor.com>

Record that ring-native §4.1 commitment setup/commit and protocol wiring are in place, and clarify that open-check prove/verify remains the next unfinished protocol milestone. Co-authored-by: Cursor <cursoragent@cursor.com>

Add a constant-time inversion helper for prime fields and replace scalar CRT's final `% q` projection with a division-free fixed-iteration reducer, so secret-bearing arithmetic paths avoid variable-latency behavior. Co-authored-by: Cursor <cursoragent@cursor.com>

Rename the secret-path inversion helper to `Invertible::inv_or_zero` while preserving constant-time semantics via doc contracts, and update CT tracking docs to match the new API names. Co-authored-by: Cursor <cursoragent@cursor.com>

Rename the inversion helper test to match the new API naming and keep the ring commitment test formatting consistent after linting. Co-authored-by: Cursor <cursoragent@cursor.com>

Introduce core sumcheck building blocks (univariate messages, compression, and transcript-driving prover/verifier driver) and add unit/integration tests. Update progress doc to reflect sumcheck core landing. Co-authored-by: Cursor <cursoragent@cursor.com>

…arking. This switches packed lane storage to SoA with NEON add/sub kernels and a SoA mul path, and updates packed-field APIs and benches so scalar-vs-packed latency/throughput comparisons are measured consistently. Made-with: Cursor

This keeps mul in true SoA form while removing repeated vector transmute overhead and inlining the limb-level Solinas lane kernel, improving packed mul throughput and latency against scalar baselines. Made-with: Cursor

Expose mul_wide_u64, mul_wide, mul_wide_u128, solinas_reduce, and to_limbs for deferred-reduction patterns needed by jolt-hachi. Hand-optimized reduce paths for 3/4/5 limbs avoid generic loop overhead. Refactor mul_raw to reuse mul_wide + reduce_4 (zero overhead). Add 9 unit tests and widening/accumulator benchmarks. Made-with: Cursor

…e mul_wide free fn Rename free function mul_wide → mul64_wide to avoid shadowing Fp128::mul_wide. Move reduce_4 next to fold2_canonicalize. Replace fully qualified std::ops::{Add,Sub,Mul,Neg} with use imports. Made-with: Cursor

Rework fp32.rs and fp64.rs to require p = 2^k - c (small c), matching fp128's design. Compile-time constants BITS/C/MASK derived from P with static assertions. Replace bit-serial reduction with two-fold Solinas reduction (reduce_product for hot path, loop-based reduce_u64/u128 for arbitrary inputs). Add widening ops (mul_wide, square, solinas_reduce). Fix FieldSampling to use direct modular reduction instead of rejection sampling. Blanket-impl PseudoMersenneField, remove manual impls. Rename const generic MODULUS -> P at all call sites. Add latency + throughput benchmarks. Hoist mid-function imports in tests/algebra.rs. Made-with: Cursor

For BITS < 64 (e.g. 2^40-195), avoid u128 intermediates in reduce_product, add_raw, and sub_raw. Use mul_c_narrow which splits C*high into u32x32->u64 widening multiplies (umaddl on AArch64), preventing LLVM from promoting to u128. Brings 40-bit mul throughput within 4% of 64-bit (690 vs 716 Melem/s), up from ~20% gap. Made-with: Cursor

Add Pow2Offset30Field (2^30-35) and Pow2Offset31Field (2^31-19) prime definitions and type aliases. Refactor fp32/fp64 latency benchmarks with chain_bench! macro, add throughput benchmarks for all new primes. Made-with: Cursor

PackedFp32Neon: 4 lanes in uint32x4_t with full NEON Solinas reduction for mul (vmull_u32 + 2-fold reduce), umin trick for add/sub (BITS<=31), overflow-aware paths for BITS==32. C_SHIFT_KIND optimization for C=2^a+/-1. PackedFp64Neon: 2 lanes in uint64x2_t with NEON add/sub (conditional P for BITS<=62, carry-aware for BITS>=63), scalar-per-lane mul (no native 64x64->128 on NEON). Fp32 packed achieves 2.4-3.5x mul throughput and 3.5-5.0x add/sub throughput over scalar. Includes HasPacking impls, type aliases, NoPacking fallbacks, 7 correctness tests, and throughput benchmarks. Made-with: Cursor

For packed Fp32, remove the shift/add C-special-case in the Solinas fold and always use vmull_u32 with a hoisted C broadcast, which improves stability and removes the 24-bit mul regression. For packed Fp64, replace per-lane Fp64 wrapper multiplication with packed-local per-lane 64x64->128 products plus specialized Solinas reduction (including the sub-word u64 fold path), reducing mul overhead for both 40-bit and 64-bit packed variants. Made-with: Cursor

Switch packed Fp64 sub-word fold multiplication to direct `C*x`, which improves packed mul throughput in repeated A/B runs. Add dedicated reducer and codegen probe benches so we can compare 40-bit and 64-bit fold paths with instruction-level visibility. Made-with: Cursor

Use BMI2 widening multiplies in scalar field hot paths and specialize x86 sub-word fold multiplication to a single 64-bit multiply, improving 40-bit fp64 throughput while keeping 64-bit and 128-bit paths stable. Made-with: Cursor

Raise Hachi MSRV to 1.88, add specialized Fp128 mul_wide_limbs kernels for M={3,4} and OUT={4,5,6}, and add field_arith benches that track mul_wide_limbs-only and roundtrip costs to catch regressions. Made-with: Cursor

Make from_u64 use a direct canonical limb construction (no reduction path), fix from_i64 to use unsigned_abs to avoid i64::MIN overflow, and add a regression test for the min-value case. Made-with: Cursor

Exploits the structure of one-hot vectors (T chunks of K field elements, each chunk with exactly one 1) to eliminate all inner ring multiplications. Gadget decomposition of {0,1} coefficients is trivial (only level-0 digit is nonzero), and the inner Ajtai t = A*s reduces to summing selected columns of A with O(D) negacyclic rotations instead of O(D^2) ring muls. Handles both K >= D and D >= K as long as one divides the other: - K >= D: each nonzero ring element is a monomial X^j (single rotation) - D >= K: each ring element is a sum of D/K monomials (multiple rotations) Total inner cost: N_A * T * D coefficient additions (zero multiplications), vs N_A * 2^M * delta * D^2 coefficient multiplications in the dense path. Made-with: Cursor

Made-with: Cursor

Implement vectorized SIMD arithmetic for x86_64: - AVX2: 8-wide Fp32, 4-wide Fp64, 2-wide Fp128 (scalar delegation) - AVX-512: 16-wide Fp32, 8-wide Fp64, 4-wide Fp128 (scalar delegation) Fp32 uses even/odd lane split with 2-fold Solinas reduction. Fp64 uses vectorized 64×64→128 schoolbook multiply (adapted from plonky3 Goldilocks) with custom Solinas reduction for pseudo-Mersenne primes p = 2^k - c. Also: extract NEON backend into packed_neon.rs, add cfg-gated module selection (AVX-512 > AVX2 > NEON > NoPacking), enable nightly stdarch_x86_avx512 feature, add sumcheck-mix benchmark, and fix minor clippy lints in fp64/fp128. Made-with: Cursor

Convert Fp128 packed backends from scalar delegation (AoS) to SoA layout with vectorized add/sub via __m512i / __m256i. Mul remains scalar per-lane. Add FIELD_OPS_PERF.md with Zen 5 benchmark results. Fp128 packed add: +114% (1.08 → 2.31 Gelem/s on Zen 5 AVX-512) Fp128 packed sub: +137% (1.34 → 3.18 Gelem/s) Made-with: Cursor

Populate FIELD_OPS_PERF.md with Apple M4 Pro (NEON) results for all primes across scalar, packed, and sumcheck MACC workloads. Remove the experimental mul_add trait method (vectorized add already optimal after inlining; scalar fused approach was 16% slower). Made-with: Cursor

At production parameters (D=256/1024), schoolbook CyclotomicRing multiplication is catastrophically expensive. Every protocol hot path has exploitable operand structure that avoids the full D^2 cost: - Add CyclotomicRing::mul_by_sparse for O(omega*D) sparse challenge multiplication (90-140x speedup in compute_z_hat) - Change RingOpeningPoint to store Vec<F> scalars; use scale() instead of ring mul in compute_w_hat (256-1024x speedup) - Add kron_scalars, kron_row_scale, kron_sparse_scale; refactor generate_m to use scalar-aware Kronecker products - Add zero-skip and scalar-detect in compute_r_via_poly_division - Add sample_sparse_challenges, store Vec<SparseChallenge> in QuadraticEquation throughout prover and verifier paths Made-with: Cursor

* chore: add toolchain and formatting config Pin Rust 1.88 with minimal profile (cargo, rustc, clippy, rustfmt). Co-authored-by: Cursor <cursoragent@cursor.com> * chore(ci): switch to actions-rust-lang/setup-rust-toolchain Respects rust-toolchain.toml automatically. Also normalize clippy flags to use --all --all-targets consistently. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(primitives): add u128/i128 serialization support Required by the Fp128 field backend. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(algebra): add prime fields, extensions, and modules Introduces the algebra module with: - Fp32/Fp64/Fp128 prime field backends with branchless constant-time add/sub/neg and rejection-sampled random - U256 helper for Fp128 wide multiplication - Fp2/Fp4 tower extensions with Karatsuba-ready structure - VectorModule<F, N> fixed-length vector module over any field - Poly<F, D> fixed-size polynomial container Co-authored-by: Cursor <cursoragent@cursor.com> * feat(algebra): add NTT small-prime arithmetic and CRT helpers Adds the ntt submodule with: - NttPrime: per-prime Montgomery-like fpmul, Barrett-like fpred, branchless csubq/caddq/center - LimbQ/QData: radix-2^14 limb arithmetic for big-q coefficients - logq=32 parameter preset (six NTT-friendly primes, CRT constants) Co-authored-by: Cursor <cursoragent@cursor.com> * test(algebra): add comprehensive algebra test suite 24 tests covering: - Field arithmetic, identities, and distributivity (Fp32/Fp64/Fp128) - Zero inversion returns None - Serialization round-trips (all field types, extensions, VectorModule) - Fp2 conjugate, norm, and distributivity - U256 wide multiply and bit access - LimbQ round-trip and add/sub inverse - QData consistency with preset constants - NTT normalize range and fpmul commutativity - Poly add/sub/neg Co-authored-by: Cursor <cursoragent@cursor.com> * docs: add and update progress tracking document Records Phase 0 status: all field types, extensions, NTT scaffolding, constant-time arithmetic, and 24-test suite. Reflects the fields/ntt/module/poly directory layout. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(ntt): Rust-ify NTT/CRT port from C Overhaul the NTT small-prime arithmetic and CRT modules: - Add MontCoeff newtype (#[repr(transparent)] i16 wrapper) to enforce Montgomery-domain vs canonical-domain separation at the type level - NttPrime methods now take/return MontCoeff instead of bare i16: fpmul→mul, fpred→reduce, csubq→csubp, caddq→caddp - Add domain conversion: from_canonical (i16→Mont), to_canonical (Mont→i16) - Delete free functions (pointwise_mul etc), replaced by methods on NttPrime - LimbQ: replace add_limbs/sub_limbs/less_than with std Add/Sub/Ord impls - LimbQ: replace from_u128/to_u128 with From<u128>/TryFrom for u128 - LimbQ: add Display impl, branchless csub_mod - Rename all LABRADOR* constants to project-native Q32_* names - Add #[cfg(test)] verification that re-derives pinv/v/mont/montsq from p - Add MontCoeff round-trip and LimbQ ordering tests (28 total) Co-authored-by: Cursor <cursoragent@cursor.com> * chore: remove section banners, update progress doc Remove // ---- Section ---- banner comments from prime.rs and crt.rs. Add non-negotiable rules to HACHI_PROGRESS.md: - No section-banner comments - No commit/push without explicit user approval Co-authored-by: Cursor <cursoragent@cursor.com> * feat(ring): add CyclotomicRing, CyclotomicNtt, and NTT butterfly Milestone 1 - CyclotomicRing<F, D> (coefficient form): - Schoolbook negacyclic convolution Mul (X^D = -1) - Add/Sub/Neg/AddAssign/SubAssign/MulAssign, scale, zero/one/x - HachiSerialize/HachiDeserialize Milestone 2 - NTT butterfly + CyclotomicNtt<K, D>: - Merged negacyclic Cooley-Tukey forward NTT (twist folded into twiddles) - Gentleman-Sande inverse NTT with D^{-1} scaling - Runtime primitive-root finder and twiddle table computation (TODO: migrate to compile-time const tables) - CyclotomicNtt with per-prime pointwise Add/Sub/Neg/Mul - Ring<->Ntt transforms with CRT reconstruction Co-authored-by: Cursor <cursoragent@cursor.com> * test(algebra): add ring and NTT tests, wrap in mod tests Add 12 new tests: - CyclotomicRing: negacyclic X^D=-1, mul identity/zero, commutativity, distributivity, associativity, additive inverse, serde, degree-64 - NTT: forward/inverse round-trip (single prime + all primes), NTT mul matches schoolbook cross-check Wrap all integration tests in a single mod tests block and remove section-banner comments. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(algebra): harden ring-NTT conversion and field decoding Constrain ring/NTT conversions to explicit field backends and replace fragile CRT reconstruction with deterministic modular lifting. Enforce canonical deserialization checks in validated field decoding paths to reject malformed encodings. Co-authored-by: Cursor <cursoragent@cursor.com> * test(algebra): add CRT round-trip and serialization guard coverage Add end-to-end ring->NTT->ring CRT round-trip tests plus reduced-ops stability checks. Expand serialization coverage for Fp4/Poly and verify checked deserialization rejects non-canonical field encodings. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(bench): add ring_ntt benchmark target and CT tracking docs Add a dedicated ring/NTT benchmark harness and register it in Cargo metadata. Record current constant-time review status and sync the implementation progress board with new milestones and test coverage. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(field): split core, canonical, and sampling capabilities Break the monolithic Field trait into FieldCore, CanonicalField, and FieldSampling, and update algebra primitives to depend on explicit capabilities for cleaner semantics and future backend integration. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(fields): add pow2-offset pseudo-mersenne registry and checks Introduce the curated 2^k-offset prime registry and typed field aliases, then add dedicated Miller-Rabin regression tests to enforce probable primality for all enabled profiles. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(ring): introduce crt-ntt backend/domain layering Rename the ring NTT representation to explicit CRT+NTT semantics and route conversions through backend traits, adding scalar backend and domain aliases for a cleaner representation-vs-execution boundary. Co-authored-by: Cursor <cursoragent@cursor.com> * test(algebra): cover backend parity and pow2-offset invariants Expand algebra tests to validate default-vs-backend CRT+NTT equivalence, sampling bounds, and pow2-offset registry consistency under the new field and ring abstractions. Co-authored-by: Cursor <cursoragent@cursor.com> * docs(algebra): update progress notes and add prime analysis references Refresh progress and constant-time notes to match the new CRT+NTT naming and field scope, and add the NTT prime analysis document plus local NIST standards artifacts used for parameter rationale. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(algebra): harden fp128 reduction and CRT reconstruction arithmetic Make Fp128 reduction and CRT inner accumulation paths more timing-stable with branchless modular operations, and refresh ring/docs/tests status after the hardening cleanup pass. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(protocol): add transcript and commitment scaffold Introduce Hachi protocol-layer interfaces and placeholder types with Blake2b/Keccak transcript backends plus phase-aligned labels, while making transcript absorption label-directed at call sites. Co-authored-by: Cursor <cursoragent@cursor.com> * test(protocol): add transcript and commitment contract coverage Add deterministic transcript schedule checks (including keccak) and protocol commitment contract tests so transcript ordering and challenge derivation behavior are locked down. Co-authored-by: Cursor <cursoragent@cursor.com> * docs(protocol): align transcript spec and progress status Document the protocol scaffold as in-progress, capture the commitment-focused transcript label vocabulary, and clarify deferred Jolt adapter expectations. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(protocol): add ring commitment core and seeded matrix derivation Implement the ring-native commitment setup/commit core with config validation, utility modules, and seeded domain-separated public matrix derivation, while wiring prover/verifier stub modules for the next open-check phase. Co-authored-by: Cursor <cursoragent@cursor.com> * test(protocol): consolidate ring commitment and stub contract coverage Unify ring commitment core and config validation checks in one test file and add explicit prover/verifier stub contract tests to lock current placeholder behavior before open-check implementation. Co-authored-by: Cursor <cursoragent@cursor.com> * docs(progress): update phase 2 status after commitment core landing Record that ring-native §4.1 commitment setup/commit and protocol wiring are in place, and clarify that open-check prove/verify remains the next unfinished protocol milestone. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(algebra): harden CT inversion path and CRT final projection Add a constant-time inversion helper for prime fields and replace scalar CRT's final `% q` projection with a division-free fixed-iteration reducer, so secret-bearing arithmetic paths avoid variable-latency behavior. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(algebra): rename inversion helper API without ct suffix Rename the secret-path inversion helper to `Invertible::inv_or_zero` while preserving constant-time semantics via doc contracts, and update CT tracking docs to match the new API names. Co-authored-by: Cursor <cursoragent@cursor.com> * test(algebra): clean inversion test naming and normalize formatting Rename the inversion helper test to match the new API naming and keep the ring commitment test formatting consistent after linting. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(protocol): add sumcheck core module and tests Introduce core sumcheck building blocks (univariate messages, compression, and transcript-driving prover/verifier driver) and add unit/integration tests. Update progress doc to reflect sumcheck core landing. Co-authored-by: Cursor <cursoragent@cursor.com> * Add reference PDF papers * Add local agent instruction files * Add Hachi and SuperNOVA digest docs * Add general field, ring, and multilinear utilities * Add sparse Fiat-Shamir challenge sampling * Implement Polynomail Evaluation as Quadradic Equation * Rename stub to prover and verifier * Refactor code organization * Replace decopose with balanced decompose * Transform polynomial over Fq to ring * Refactor function names * Impl commitment_scheme API * Add SolinasFp128 backend for sparse 128-bit primes Introduce `SolinasFp128` with two-fold Solinas reduction for `p = 2^128 - c` (sparse `c`), plus `U256::sqr_u128`. Export descriptive prime aliases, add BigUint-backed correctness tests, and include a Criterion bench for mul/inv. Co-authored-by: Cursor <cursoragent@cursor.com> * Tighten docs and minor clippy cleanups Add missing rustdoc Errors/Panics sections and apply small simplifications suggested by clippy. Co-authored-by: Cursor <cursoragent@cursor.com> * Add reduction steps to iteration prover * Optimize Solinas mul/add/sub: fused u64-limb schoolbook + csel canonicalize Rewrite mul_raw as a fused 2×2 schoolbook multiply with two-fold Solinas reduction using explicit u64 limbs and mac helper, bypassing U256. Replace mask-based canonicalize with carry-flag-based pattern that compiles to adds+adcs+csel+csel (4 insns) instead of 10 on AArch64. Add pure-mul, sqr, and throughput microbenchmarks. Made-with: Cursor * Switch SolinasFp128 repr from u128 to [u64; 2] for 8-byte alignment Storage is now [u64; 2] (lo, hi) which halves alignment from 16 to 8 bytes, improving struct packing. Arithmetic hot paths convert to u128 for LLVM-optimal codegen (adds/adcs pairs), so no perf regression. Made-with: Cursor * Fuse overflow correction with canonicalize in fold2_canonicalize When fold-2 overflows, the wrapped value s < C², so s + C < C(C+1) < P — meaning s + C is already canonical. This lets us replace the separate overflow-correction + canonicalize (3 + 4 insns) with a single fused `if (overflow | carry) { s + C } else { s }` select, saving 2 instructions on the critical path. Add compile-time assertion enforcing C(C+1) < P. Made-with: Cursor * Unify Fp128 with Solinas-optimized arithmetic, delete SolinasFp128 Replace the generic Fp128<const MODULUS: u128> (binary-long-division via U256) with the Solinas-optimized implementation. Fp128<const P: u128> now uses [u64; 2] storage, fused schoolbook 2x2 + two-fold Solinas reduction (~23 cycles/mul on AArch64/x86-64), and compile-time validation that P = 2^128 - C with C < 2^64. Delete SolinasFp128, SolinasParams, solinas128.rs, and u256.rs. All call sites updated; prime type aliases (Prime128M13M4P0 etc.) are now simple Fp128<...> aliases in fp128.rs. Blanket PseudoMersenneField impl for all Fp128<P>. Made-with: Cursor * Use git deps for ark-bn254/ark-ff instead of local paths Switch from local path dependencies to the a16z/arkworks-algebra git repo (branch dev/twist-shout) so collaborators can compile without needing a local checkout of arkworks-algebra-jolt. Made-with: Cursor * Add template for sumchecks * Optimize Fp128 mul path and expand Rust field benchmarks. Refine Fp128 multiply/fold carry handling for better generated code and add isolated, passthrough, independent, and long-chain Rust microbenches to separate latency and throughput effects when comparing against BN254. Made-with: Cursor * Add 2^a±1 Fp128 reduction specialization and benches. Detect C = 2^a ± 1 at compile time and route fold multiplications through a specialized shift-based path with generic fallback, plus add benchmark coverage for sparse 128-bit primes using this shape. Made-with: Cursor * Add packed Fp128 field backend scaffolding and focused benchmarks. This introduces AArch64-first packed field abstractions with a scalar fallback and adds dedicated field-only validation/benchmark coverage before any ring or protocol integration. Made-with: Cursor * Refactor packed Fp128 backend to true SoA layout and stabilize benchmarking. This switches packed lane storage to SoA with NEON add/sub kernels and a SoA mul path, and updates packed-field APIs and benches so scalar-vs-packed latency/throughput comparisons are measured consistently. Made-with: Cursor * Optimize packed Fp128 mul throughput with array-backed SoA lanes. This keeps mul in true SoA form while removing repeated vector transmute overhead and inlining the limb-level Solinas lane kernel, improving packed mul throughput and latency against scalar baselines. Made-with: Cursor * Add Fp128 widening multiply API and specialized Solinas reduction Expose mul_wide_u64, mul_wide, mul_wide_u128, solinas_reduce, and to_limbs for deferred-reduction patterns needed by jolt-hachi. Hand-optimized reduce paths for 3/4/5 limbs avoid generic loop overhead. Refactor mul_raw to reuse mul_wide + reduce_4 (zero overhead). Add 9 unit tests and widening/accumulator benchmarks. Made-with: Cursor * Clean up fp128: remove section banners, hoist std::ops imports, rename mul_wide free fn Rename free function mul_wide → mul64_wide to avoid shadowing Fp128::mul_wide. Move reduce_4 next to fold2_canonicalize. Replace fully qualified std::ops::{Add,Sub,Mul,Neg} with use imports. Made-with: Cursor * Constrain Fp32/Fp64 to pseudo-Mersenne primes with Solinas reduction Rework fp32.rs and fp64.rs to require p = 2^k - c (small c), matching fp128's design. Compile-time constants BITS/C/MASK derived from P with static assertions. Replace bit-serial reduction with two-fold Solinas reduction (reduce_product for hot path, loop-based reduce_u64/u128 for arbitrary inputs). Add widening ops (mul_wide, square, solinas_reduce). Fix FieldSampling to use direct modular reduction instead of rejection sampling. Blanket-impl PseudoMersenneField, remove manual impls. Rename const generic MODULUS -> P at all call sites. Add latency + throughput benchmarks. Hoist mid-function imports in tests/algebra.rs. Made-with: Cursor * Specialize Fp64 sub-word primes to u64-only arithmetic For BITS < 64 (e.g. 2^40-195), avoid u128 intermediates in reduce_product, add_raw, and sub_raw. Use mul_c_narrow which splits C*high into u32x32->u64 widening multiplies (umaddl on AArch64), preventing LLVM from promoting to u128. Brings 40-bit mul throughput within 4% of 64-bit (690 vs 716 Melem/s), up from ~20% gap. Made-with: Cursor * Add 2^30 and 2^31 pseudo-Mersenne primes and expand benchmarks Add Pow2Offset30Field (2^30-35) and Pow2Offset31Field (2^31-19) prime definitions and type aliases. Refactor fp32/fp64 latency benchmarks with chain_bench! macro, add throughput benchmarks for all new primes. Made-with: Cursor * Add NEON packed backends for Fp32 (4-wide) and Fp64 (2-wide) PackedFp32Neon: 4 lanes in uint32x4_t with full NEON Solinas reduction for mul (vmull_u32 + 2-fold reduce), umin trick for add/sub (BITS<=31), overflow-aware paths for BITS==32. C_SHIFT_KIND optimization for C=2^a+/-1. PackedFp64Neon: 2 lanes in uint64x2_t with NEON add/sub (conditional P for BITS<=62, carry-aware for BITS>=63), scalar-per-lane mul (no native 64x64->128 on NEON). Fp32 packed achieves 2.4-3.5x mul throughput and 3.5-5.0x add/sub throughput over scalar. Includes HasPacking impls, type aliases, NoPacking fallbacks, 7 correctness tests, and throughput benchmarks. Made-with: Cursor * Optimize packed Fp32/Fp64 Solinas multiply hot paths on NEON For packed Fp32, remove the shift/add C-special-case in the Solinas fold and always use vmull_u32 with a hoisted C broadcast, which improves stability and removes the 24-bit mul regression. For packed Fp64, replace per-lane Fp64 wrapper multiplication with packed-local per-lane 64x64->128 products plus specialized Solinas reduction (including the sub-word u64 fold path), reducing mul overhead for both 40-bit and 64-bit packed variants. Made-with: Cursor * Tune packed Fp64 mul folding and add reducer/codegen probes Switch packed Fp64 sub-word fold multiplication to direct `C*x`, which improves packed mul throughput in repeated A/B runs. Add dedicated reducer and codegen probe benches so we can compare 40-bit and 64-bit fold paths with instruction-level visibility. Made-with: Cursor * Optimize x86 BMI2 multiply paths for fp64/fp128 fields Use BMI2 widening multiplies in scalar field hot paths and specialize x86 sub-word fold multiplication to a single 64-bit multiply, improving 40-bit fp64 throughput while keeping 64-bit and 128-bit paths stable. Made-with: Cursor * Optimize fp128 wide-limb multiply path for Jolt integration Raise Hachi MSRV to 1.88, add specialized Fp128 mul_wide_limbs kernels for M={3,4} and OUT={4,5,6}, and add field_arith benches that track mul_wide_limbs-only and roundtrip costs to catch regressions. Made-with: Cursor * Specialize Fp128 CanonicalField small-int constructors Make from_u64 use a direct canonical limb construction (no reduction path), fix from_i64 to use unsigned_abs to avoid i64::MIN overflow, and add a regression test for the min-value case. Made-with: Cursor * Impl sumchecks for hachi * Add optimized one-hot commitment path for regular sparse witnesses Exploits the structure of one-hot vectors (T chunks of K field elements, each chunk with exactly one 1) to eliminate all inner ring multiplications. Gadget decomposition of {0,1} coefficients is trivial (only level-0 digit is nonzero), and the inner Ajtai t = A*s reduces to summing selected columns of A with O(D) negacyclic rotations instead of O(D^2) ring muls. Handles both K >= D and D >= K as long as one divides the other: - K >= D: each nonzero ring element is a monomial X^j (single rotation) - D >= K: each ring element is a sum of D/K monomials (multiple rotations) Total inner cost: N_A * T * D coefficient additions (zero multiplications), vs N_A * 2^M * delta * D^2 coefficient multiplications in the dense path. Made-with: Cursor * Apply rustfmt formatting to fp128 and field_arith bench Made-with: Cursor * Inject sumchecks to Hachi prover * Add commitment to w to transcript * Add AVX2 and AVX-512 packed field backends for Fp32, Fp64, Fp128 Implement vectorized SIMD arithmetic for x86_64: - AVX2: 8-wide Fp32, 4-wide Fp64, 2-wide Fp128 (scalar delegation) - AVX-512: 16-wide Fp32, 8-wide Fp64, 4-wide Fp128 (scalar delegation) Fp32 uses even/odd lane split with 2-fold Solinas reduction. Fp64 uses vectorized 64×64→128 schoolbook multiply (adapted from plonky3 Goldilocks) with custom Solinas reduction for pseudo-Mersenne primes p = 2^k - c. Also: extract NEON backend into packed_neon.rs, add cfg-gated module selection (AVX-512 > AVX2 > NEON > NoPacking), enable nightly stdarch_x86_avx512 feature, add sumcheck-mix benchmark, and fix minor clippy lints in fp64/fp128. Made-with: Cursor * Vectorize Fp128 packed add/sub on AVX-512 (8-wide) and AVX2 (4-wide) Convert Fp128 packed backends from scalar delegation (AoS) to SoA layout with vectorized add/sub via __m512i / __m256i. Mul remains scalar per-lane. Add FIELD_OPS_PERF.md with Zen 5 benchmark results. Fp128 packed add: +114% (1.08 → 2.31 Gelem/s on Zen 5 AVX-512) Fp128 packed sub: +137% (1.34 → 3.18 Gelem/s) Made-with: Cursor * Add M4 Pro NEON benchmarks, remove mul_add experiment Populate FIELD_OPS_PERF.md with Apple M4 Pro (NEON) results for all primes across scalar, packed, and sumcheck MACC workloads. Remove the experimental mul_add trait method (vectorized add already optimal after inlining; scalar fused approach was 16% slower). Made-with: Cursor * Change sumcheck API * Separate ring switch logic * Rename sumchecks to NormSumcheck and RelationSumcheck * Remove iteration prover * Eliminate O(D^2) schoolbook ring multiplication from protocol hot paths At production parameters (D=256/1024), schoolbook CyclotomicRing multiplication is catastrophically expensive. Every protocol hot path has exploitable operand structure that avoids the full D^2 cost: - Add CyclotomicRing::mul_by_sparse for O(omega*D) sparse challenge multiplication (90-140x speedup in compute_z_hat) - Change RingOpeningPoint to store Vec<F> scalars; use scale() instead of ring mul in compute_w_hat (256-1024x speedup) - Add kron_scalars, kron_row_scale, kron_sparse_scale; refactor generate_m to use scalar-aware Kronecker products - Add zero-skip and scalar-detect in compute_r_via_poly_division - Add sample_sparse_challenges, store Vec<SparseChallenge> in QuadraticEquation throughout prover and verifier paths Made-with: Cursor * lint: section banner removal, naming hoist, cfg(test) for test-only paths - Remove section banner comments (----, =====) repo-wide in src, tests, benches - commitment_scheme: hoist RingCommitment, RingOpeningPoint, transcript labels to top-level use; add #[cfg(test)] use for rederive_alpha_and_m_a body (Blake2bTranscript, eval_ring_matrix_at, expand_m_a, labels) so that function uses short names without polluting lib build - Leave mod tests imports in place (no hoisting of test-module use blocks) Made-with: Cursor * Fix CI issues --------- Co-authored-by: Quang Dao <quang.dao@layerzerolabs.org> Co-authored-by: Cursor <cursoragent@cursor.com>

quangvdao and others added 30 commits February 12, 2026 23:03

chore: add toolchain and formatting config

4463dc6

Pin Rust 1.88 with minimal profile (cargo, rustc, clippy, rustfmt). Co-authored-by: Cursor <cursoragent@cursor.com>

chore(ci): switch to actions-rust-lang/setup-rust-toolchain

f5cae90

Respects rust-toolchain.toml automatically. Also normalize clippy flags to use --all --all-targets consistently. Co-authored-by: Cursor <cursoragent@cursor.com>

feat(primitives): add u128/i128 serialization support

0ba0422

Required by the Fp128 field backend. Co-authored-by: Cursor <cursoragent@cursor.com>

docs: add and update progress tracking document

c02125a

Records Phase 0 status: all field types, extensions, NTT scaffolding, constant-time arithmetic, and 24-test suite. Reflects the fields/ntt/module/poly directory layout. Co-authored-by: Cursor <cursoragent@cursor.com>

test(algebra): cover backend parity and pow2-offset invariants

4edcc32

Expand algebra tests to validate default-vs-backend CRT+NTT equivalence, sampling bounds, and pow2-offset registry consistency under the new field and ring abstractions. Co-authored-by: Cursor <cursoragent@cursor.com>

docs(protocol): align transcript spec and progress status

0459e21

Document the protocol scaffold as in-progress, capture the commitment-focused transcript label vocabulary, and clarify deferred Jolt adapter expectations. Co-authored-by: Cursor <cursoragent@cursor.com>

test(algebra): clean inversion test naming and normalize formatting

7c90343

Rename the inversion helper test to match the new API naming and keep the ring commitment test formatting consistent after linting. Co-authored-by: Cursor <cursoragent@cursor.com>

quangvdao and others added 27 commits February 25, 2026 20:57

Optimize packed Fp128 mul throughput with array-backed SoA lanes.

39e4716

This keeps mul in true SoA form while removing repeated vector transmute overhead and inlining the limb-level Solinas lane kernel, improving packed mul throughput and latency against scalar baselines. Made-with: Cursor

Optimize x86 BMI2 multiply paths for fp64/fp128 fields

f2c3d3a

Use BMI2 widening multiplies in scalar field hot paths and specialize x86 sub-word fold multiplication to a single 64-bit multiply, improving 40-bit fp64 throughput while keeping 64-bit and 128-bit paths stable. Made-with: Cursor

Optimize fp128 wide-limb multiply path for Jolt integration

1533229

Raise Hachi MSRV to 1.88, add specialized Fp128 mul_wide_limbs kernels for M={3,4} and OUT={4,5,6}, and add field_arith benches that track mul_wide_limbs-only and roundtrip costs to catch regressions. Made-with: Cursor

Specialize Fp128 CanonicalField small-int constructors

80b5e52

Make from_u64 use a direct canonical limb construction (no reduction path), fix from_i64 to use unsigned_abs to avoid i64::MIN overflow, and add a regression test for the min-value case. Made-with: Cursor

Impl sumchecks for hachi

2f461f0

Apply rustfmt formatting to fp128 and field_arith bench

c6f984a

Made-with: Cursor

Inject sumchecks to Hachi prover

751ffc7

Add commitment to w to transcript

68da330

Change sumcheck API

3210b2b

Merge remote-tracking branch 'origin/dev' into feat/sumchecks

cfb72a6

Separate ring switch logic

0b372b1

Rename sumchecks to NormSumcheck and RelationSumcheck

94da83d

Remove iteration prover

4abb647

omibo changed the base branch from dev to main February 28, 2026 00:55

omibo closed this Feb 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Hachi PCS protocol with required primitives#1

Implement Hachi PCS protocol with required primitives#1
omibo wants to merge 83 commits into
a16z:mainfrom
LayerZero-Labs:dev

omibo commented Feb 28, 2026 •

edited by quangvdao

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

omibo commented Feb 28, 2026 • edited by quangvdao Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Algebra layer (src/algebra/)

Protocol layer (src/protocol/)

HachiPolyOps trait (src/protocol/hachi_poly_ops.rs)

Flexible decomposition and dual basis mode

Dead code removal and cleanup

Bug fixes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

omibo commented Feb 28, 2026 •

edited by quangvdao

Loading

Algebra layer (`src/algebra/`)

Protocol layer (`src/protocol/`)

HachiPolyOps trait (`src/protocol/hachi_poly_ops.rs`)