Skip to content

Speed up initSecret on the seeded long-input path#28

Open
ajvengo wants to merge 1 commit into
zeebo:masterfrom
ajvengo:initsecret-speedup
Open

Speed up initSecret on the seeded long-input path#28
ajvengo wants to merge 1 commit into
zeebo:masterfrom
ajvengo:initsecret-speedup

Conversation

@ajvengo

@ajvengo ajvengo commented Jun 1, 2026

Copy link
Copy Markdown

What

initSecret derives the 192-byte custom secret by striding the offset
directly (i += 16) instead of indexing 16-byte groups (16*i), which
drops the per-iteration multiply. The output is byte-identical.

Why

This derivation runs on every HashSeed/HashStringSeed call for
inputs longer than 240 bytes (the >240 branch in hashAnySeed /
hashAny128Seed), and once per New/NewSeed/ResetSeed. For seeded
inputs in the ~241–1024 B range it's a noticeable fixed cost on top of
the hash itself, so trimming it helps that band directly.

Benchmarks

Apple M4 Max (arm64/NEON), old vs new compiled as separate binaries and
run interleaved (old, new, old, new… ×12) so run-to-run drift cancels;
compared with benchstat:

benchmark master this PR Δ p
Fixed64/241/seed 16.13 ns 15.68 ns -2.82% 0.000
Fixed64/512/seed 20.91 ns 19.90 ns -4.85% 0.000
Fixed128/241/seed 17.70 ns 17.49 ns -1.19% 0.000
Fixed128/512/seed 22.75 ns 21.95 ns -3.52% 0.000

The unseeded paths don't call initSecret; their control benchmarks
(.../default) show no change (all p > 0.05), as expected. The gain
shrinks for very large inputs as the fixed derivation cost amortizes away.

Correctness

go test ./... passes, including TestVectorCompat and the seeded golden
vectors. go vet (amd64/arm64/386-softfloat) and gofmt are clean.

Stride the secret offset directly (i += 16) instead of indexing
groups (16*i), dropping the per-iteration multiply in the 192-byte
secret derivation.

This derivation runs on every HashSeed/HashStringSeed call for inputs
longer than 240 bytes, and once per New/NewSeed/ResetSeed. Output is
unchanged; TestVectorCompat and the seed vectors still pass.

Measured on an Apple M4 Max (NEON), interleaved old/new runs, n=12:

  Fixed64/241/seed    16.13ns -> 15.68ns  -2.82%
  Fixed64/512/seed    20.91ns -> 19.90ns  -4.85%
  Fixed128/241/seed   17.70ns -> 17.49ns  -1.19%
  Fixed128/512/seed   22.75ns -> 21.95ns  -3.52%

Unseeded paths are untouched and show no change (p > 0.05).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant