Skip to content

Metal: correctness-gate M5 Max 4096 prefill (+5%)#149

Open
fitchmultz wants to merge 2 commits into
antirez:mainfrom
fitchmultz:m5-responses
Open

Metal: correctness-gate M5 Max 4096 prefill (+5%)#149
fitchmultz wants to merge 2 commits into
antirez:mainfrom
fitchmultz:m5-responses

Conversation

@fitchmultz
Copy link
Copy Markdown

@fitchmultz fitchmultz commented May 14, 2026

Result

This PR rebases the Apple M5 Max 4096-prefill correctness/scheduling work onto current upstream main (613e9b2 at benchmark/rebase time). Non-M5 devices keep the existing 2048-token default.

Fresh paired comparison against current antirez/main on an Apple M5 Max 128GB machine, Metal backend, ds4flash.gguf:

benchmark current main this PR result
4096-step sweep, avg prefill 284.42 t/s 291.56 t/s +2.51%
4096-step sweep, avg generation 28.72 t/s 28.48 t/s neutral / -0.86%
README 65k sweep, avg prefill 256.99 t/s 250.38 t/s neutral / -2.57%
README 65k sweep, avg generation 27.73 t/s 27.56 t/s neutral / -0.61%

The safe claim for this PR is: it enables and correctness-gates the M5 Max 4096-token prefill path, with a small 4096-sweep prefill win in this fresh run and otherwise neutral throughput. The larger decode win is kept separate in follow-up #169.

What changed

  • Adds Apple M5-gated Metal runtime fast paths:
    • simdgroup matrix matmul specialization
    • private Metal scratch buffers for GPU-only hot intermediates, keeping hazard tracking enabled
  • Makes 4096-token prefill chunks the default only on Apple M5 Max Metal.
  • Keeps other devices/backends on the existing 2048-token default.
  • Makes the 4096-token path correctness-safe by splitting the zero-prefix first chunk at the existing 2048-token correctness boundary. This avoids selecting compressed top-k rows from future causal positions.
  • Aligns server KV disk-cache boundaries to the backend prefill chunk:
    • M5 Max Metal: 4096
    • other devices/backends: 2048

DS4_METAL_PREFILL_CHUNK=2048 forces the previous M5 Max chunk size. Values above 4096 still require DS4_METAL_ALLOW_UNSAFE_PREFILL_CHUNK=1 on the M5 Max default path.

Correctness

Passed locally on Apple M5 Max, 128GB RAM, Metal backend, ds4flash.gguf, after rebasing to current upstream main:

make clean && make
make test
DS4_METAL_PREFILL_CHUNK=4096 ./ds4_test --logprob-vectors

make test covers:

  • --long-context
  • --tool-call-quality
  • --logprob-vectors
  • --metal-kernels
  • --server

Eval parity check

Deterministic 12-question ds4-eval slice against current upstream main produced identical grading decisions and token counts on both branches:

./ds4-eval \
  -m ds4flash.gguf \
  --plain \
  --questions 12 \
  --tokens 2048 \
  --temp 0 \
  --seed 1
Branch Result Total generated tokens Total tokens
antirez/main 613e9b2 10/12 15,884 18,446
this PR b3d2665 10/12 15,884 18,446

The same two cases failed on both branches with the same extracted answers, so this eval slice shows no quality regression.

Benchmark commands

4096-step sweep:

./ds4-bench \
  -m ds4flash.gguf \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 4096 --ctx-max 32768 --step-incr 4096 \
  --gen-tokens 64

README-shaped 65k sweep:

./ds4-bench \
  -m ds4flash.gguf \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 --ctx-max 65536 --step-incr 2048 \
  --gen-tokens 128

Memory

The 4096 default increases Metal context-buffer allocation but keeps it modest for the tested M5 Max class machine.

From ds4-bench context buffer reporting at the README 65k allocation:

chunk context buffers
2048 1311.89 MiB
4096 1740.42 MiB

That is about +0.4 GiB. Other devices keep the old 2048 default.

Scope notes

This PR is intentionally limited to runtime Metal changes and the M5 Max prefill default. The decode-indexer speedup is kept in #169.

@fitchmultz fitchmultz changed the title Metal: add M5 Max fast paths and 4096 prefill default Metal: speed up M5 Max prefill with correctness-gated 4096 chunks May 14, 2026
@fitchmultz fitchmultz changed the title Metal: speed up M5 Max prefill with correctness-gated 4096 chunks Metal: correctness-gate M5 Max 4096 prefill (+5%) May 14, 2026
@fitchmultz
Copy link
Copy Markdown
Author

Opened a stacked follow-up for the M5 Max decode-indexer tuning: #169. It depends on this PR and includes a clean stacked diff plus local correctness/benchmark evidence (+15–18% generation t/s, prefill neutral). I kept this PR unchanged.

@fitchmultz
Copy link
Copy Markdown
Author

Rebased this PR onto current upstream main (613e9b2 at benchmark/rebase time), force-pushed m5-responses, and updated the body with fresh paired benchmarks plus deterministic ds4-eval parity. Fresh result vs current main: 4096 prefill 284.42 → 291.56 t/s (+2.51%), 65k prefill effectively neutral/noisy, generation neutral. Eval slice is unchanged: 10/12 on both branches, same token counts and same two failures. The larger decode throughput win remains isolated in #169.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant