Metal: correctness-gate M5 Max 4096 prefill (+5%) by fitchmultz · Pull Request #149 · antirez/ds4

fitchmultz · 2026-05-14T21:54:35Z

Result

This PR rebases the Apple M5 Max 4096-prefill correctness/scheduling work onto current upstream main (613e9b2 at benchmark/rebase time). Non-M5 devices keep the existing 2048-token default.

Fresh paired comparison against current antirez/main on an Apple M5 Max 128GB machine, Metal backend, ds4flash.gguf:

benchmark	current main	this PR	result
4096-step sweep, avg prefill	284.42 t/s	291.56 t/s	+2.51%
4096-step sweep, avg generation	28.72 t/s	28.48 t/s	neutral / -0.86%
README 65k sweep, avg prefill	256.99 t/s	250.38 t/s	neutral / -2.57%
README 65k sweep, avg generation	27.73 t/s	27.56 t/s	neutral / -0.61%

The safe claim for this PR is: it enables and correctness-gates the M5 Max 4096-token prefill path, with a small 4096-sweep prefill win in this fresh run and otherwise neutral throughput. The larger decode win is kept separate in follow-up #169.

What changed

Adds Apple M5-gated Metal runtime fast paths:
- simdgroup matrix matmul specialization
- private Metal scratch buffers for GPU-only hot intermediates, keeping hazard tracking enabled
Makes 4096-token prefill chunks the default only on Apple M5 Max Metal.
Keeps other devices/backends on the existing 2048-token default.
Makes the 4096-token path correctness-safe by splitting the zero-prefix first chunk at the existing 2048-token correctness boundary. This avoids selecting compressed top-k rows from future causal positions.
Aligns server KV disk-cache boundaries to the backend prefill chunk:
- M5 Max Metal: 4096
- other devices/backends: 2048

DS4_METAL_PREFILL_CHUNK=2048 forces the previous M5 Max chunk size. Values above 4096 still require DS4_METAL_ALLOW_UNSAFE_PREFILL_CHUNK=1 on the M5 Max default path.

Correctness

Passed locally on Apple M5 Max, 128GB RAM, Metal backend, ds4flash.gguf, after rebasing to current upstream main:

make clean && make
make test
DS4_METAL_PREFILL_CHUNK=4096 ./ds4_test --logprob-vectors

make test covers:

--long-context
--tool-call-quality
--logprob-vectors
--metal-kernels
--server

Eval parity check

Deterministic 12-question ds4-eval slice against current upstream main produced identical grading decisions and token counts on both branches:

./ds4-eval \
  -m ds4flash.gguf \
  --plain \
  --questions 12 \
  --tokens 2048 \
  --temp 0 \
  --seed 1

Branch	Result	Total generated tokens	Total tokens
`antirez/main` `613e9b2`	10/12	15,884	18,446
this PR `b3d2665`	10/12	15,884	18,446

The same two cases failed on both branches with the same extracted answers, so this eval slice shows no quality regression.

Benchmark commands

4096-step sweep:

./ds4-bench \
  -m ds4flash.gguf \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 4096 --ctx-max 32768 --step-incr 4096 \
  --gen-tokens 64

README-shaped 65k sweep:

./ds4-bench \
  -m ds4flash.gguf \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 --ctx-max 65536 --step-incr 2048 \
  --gen-tokens 128

Memory

The 4096 default increases Metal context-buffer allocation but keeps it modest for the tested M5 Max class machine.

From ds4-bench context buffer reporting at the README 65k allocation:

chunk	context buffers
2048	1311.89 MiB
4096	1740.42 MiB

That is about +0.4 GiB. Other devices keep the old 2048 default.

Scope notes

This PR is intentionally limited to runtime Metal changes and the M5 Max prefill default. The decode-indexer speedup is kept in #169.

fitchmultz · 2026-05-16T15:36:09Z

Opened a stacked follow-up for the M5 Max decode-indexer tuning: #169. It depends on this PR and includes a clean stacked diff plus local correctness/benchmark evidence (+15–18% generation t/s, prefill neutral). I kept this PR unchanged.

fitchmultz · 2026-05-16T17:14:18Z

Rebased this PR onto current upstream main (613e9b2 at benchmark/rebase time), force-pushed m5-responses, and updated the body with fresh paired benchmarks plus deterministic ds4-eval parity. Fresh result vs current main: 4096 prefill 284.42 → 291.56 t/s (+2.51%), 65k prefill effectively neutral/noisy, generation neutral. Eval slice is unchanged: 10/12 on both branches, same token counts and same two failures. The larger decode throughput win remains isolated in #169.

fitchmultz changed the title ~~Metal: add M5 Max fast paths and 4096 prefill default~~ Metal: speed up M5 Max prefill with correctness-gated 4096 chunks May 14, 2026

fitchmultz changed the title ~~Metal: speed up M5 Max prefill with correctness-gated 4096 chunks~~ Metal: correctness-gate M5 Max 4096 prefill (+5%) May 14, 2026

fitchmultz mentioned this pull request May 16, 2026

Metal: speed up M5 Max decode indexer #169

Open

fitchmultz added 2 commits May 16, 2026 10:00

metal: add M5 Max runtime fast paths

d3e8581

metal: default M5 Max to safe 4096 prefill

b3d2665

fitchmultz force-pushed the m5-responses branch from e54b952 to b3d2665 Compare May 16, 2026 17:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal: correctness-gate M5 Max 4096 prefill (+5%)#149

Metal: correctness-gate M5 Max 4096 prefill (+5%)#149
fitchmultz wants to merge 2 commits into
antirez:mainfrom
fitchmultz:m5-responses

fitchmultz commented May 14, 2026 •

edited

Loading

Uh oh!

fitchmultz commented May 16, 2026

Uh oh!

fitchmultz commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fitchmultz commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Result

What changed

Correctness

Eval parity check

Benchmark commands

Memory

Scope notes

Uh oh!

fitchmultz commented May 16, 2026

Uh oh!

fitchmultz commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fitchmultz commented May 14, 2026 •

edited

Loading