Tags: star-ga/mind
Tags
fix(mind-blas): explicit alignment on all vector llvm.loads — hardeni…
…ng (v0.6.7)
Every llvm.load -> vector<...> the std-surface lowering emits now
carries {alignment = 4 : i64}: emit_vec_dot_f32, emit_vec_dot_q16,
emit_vec_dot_l1_q16, emit_vec_dot_metric_f32 (f32 L1/L-inf), and the
Instr::VecLoad / Instr::VecLoadI32 primitive lowerings — matching the
inc-3b matmul fix. Without it MLIR defaults to the type's natural 32B
alignment -> x86 vmovaps -> GP-fault on non-32B-aligned (interior)
pointers. These kernels are only ever called on allocation-base
over-aligned pointers today so they did not fault, but the #230
mind-nerve encode rewire will pass interior catalog-row pointers to
dot_q16_v and would hit the identical fault. Pre-emptive hardening.
Contract-neutral: alignment changes the emitted move (vmovaps->vmovups)
never the loaded bytes. Verified (verify-don't-trust): cross-arch
bit-identity gate #57 still EXACT for dot_q16_v + dot_l1_q16_v
(blas_vec_q16_smoke 6/6 incl byte-identity tests), dot_f32_v
below-one-lane byte-identity + 1e-4 intact (blas_vec_smoke 3/3);
default release mindc byte-identical base vs change (bench-gate 0.0%);
bootstrap fixed-point IR byte-identical (next_id 206); cargo fmt clean;
only pre-existing project/mod.rs:303 clippy advisory. All std-surface
gated. RFC 0006 section 9.3b-follow-up.
feat(mind-blas): Track B increment 3b — native vectorised matmul_rmaj…
…or_f32_v (v0.6.6)
Outer scf.for over rows (no iter_args, stores y[r] directly), each row
inlining the proven increment-1 dot_f32 8-lane vector.fma reduction +
scalar tail; returns 0 like the Track A C oracle. Each row equals
dot_f32(W + r*cols, x, cols); same 1e-4 f64-oracle contract as
dot_f32_v. Verified within 1e-4 at (1,1)(1,8)(2,8)(3,8)(1,9)(1,17)
(2,17)(5,17)(33,1025)(128,384).
Root cause of an earlier SIGSEGV (rows>=2 + non-empty tail, e.g. (2,17)
crashed, (1,17) passed): vector<8xf32> llvm.load with no alignment
attribute defaults to natural 32-byte alignment -> x86 vmovaps
(alignment-required); row-base pointers W + r*cols*4 are only f32-aligned
so row>=1 with cols not a multiple of 8 mis-aligns -> GP fault. NOT a
nested-scf.for lowering defect (that pattern is valid MLIR, independently
confirmed). Fix: emit {alignment = 4 : i64} on the vector loads ->
vmovups (unaligned), correct for every row.
Verified (verify-don't-trust, re-run on main): blas_vec_q16_smoke 6/6
incl vec_matmul_rmajor_f32_within_1e4_rel_of_f64_oracle; default release
mindc byte-identical base vs change (bench-gate 0.0%); bootstrap
fixed-point IR byte-identical (next_id 206); criterion compiler bench
unchanged (same binary); cargo fmt clean; only the pre-existing
project/mod.rs:303 clippy advisory. All inc3b code is cfg std-surface
gated. This is the direct latency lever for the mind-nerve native-encode
GEMMs (A1.5 residual / task #230).
Deferred to a later increment (RFC 0006 section 9.3c): @target per-call
substrate annotation; cross-module use std.blas inlining; defensive
{alignment=4} on the other dot_*_v kernels (task #232, not a regression
in current use).
feat(mind-blas): Track B increment 3a — native Q16.16 L1 vector path …
…(v0.6.5)
Adds dot_l1_q16_v: the compiler emits a native MLIR vector-dialect
Q16.16 L1 (Manhattan, sum of abs-diff) reduction — vector<8xi64> widen,
signed-subtract, arith-only absolute value via maxsi(d, 0 - d) mirroring
the Track A C oracle if (d<0) d=-d, i64-lane accumulate, associative
vector.reduction add, scalar tail, trunci/extsi pack. No new IR variants
(emits MLIR text directly, like dot_q16_v), additive-only envelope holds;
no C shim, no clang, no -fPIC.
Byte-identical to the Track A scalar oracle __mind_blas_dot_l1_q16 at
every RFC length {0,1,2,7,8,9,15,16,17,31,32,33,1024,4096,65537} — exact,
not a tolerance. Closes the Q16.16 vector-path metric parity deferred in
increment 2: cross-arch bit-identity gate (task #57) now holds for both
the vector dot (inc 2) and vector L1 (inc 3a).
Verified (verify-don't-trust): blas_vec_q16_smoke 5/5 including the new
vec_dot_l1_q16_byte_identical_to_scalar_oracle_all_lengths; default
release mindc byte-identical base vs inc3 (reproducible release build);
bootstrap fixed-point IR byte-identical base vs inc3 (next_id 206); cargo
fmt clean. All inc3a code is cfg std-surface gated so the default binary
and bootstrap are unchanged: bench-gate 0.0%.
Increment 3b honestly deferred in RFC 0006 section 9.3b: @target per-call
substrate annotation, vectorised matmul_rmajor_f32, cross-module use
std.blas inlining.
mind-blas Track B increment 2: native MLIR vector-dialect Q16.16 dot … …(cross-arch bit-identity gate #57 closed for the vector path), VecStore, f32 L1/Linf vector reductions. Bench-gate +7pct cap held (byte-identical default binary); v0.6.1 bootstrap fixed-point unchanged (10889/206).
PreviousNext