Metal: speed up M5 Max decode indexer by fitchmultz · Pull Request #169 · antirez/ds4

fitchmultz · 2026-05-16T15:36:04Z

Based on current antirez/main (613e9b2 at benchmark/rebase time) and includes the prerequisite M5 Max prefill work from #149 because the decode path depends on it.

Review helpers:

Full PR diff is against current upstream main.
Decode-only comparison base: fitchmultz/ds4@m5-responses-current-main...m5-decode-indexer-topk
Context/prerequisite: Metal: correctness-gate M5 Max 4096 prefill (+5%) #149

What changes

On Apple M5 Max:

selects the M5 Max runtime fast path from Metal: correctness-gate M5 Max 4096 prefill (+5%) #149;
defaults M5 Max prefill to the safe 4096-token chunk path from Metal: correctness-gate M5 Max 4096 prefill (+5%) #149;
uses a smaller decode-only sparse indexer top-k.

Prefill keeps the model-configured indexer top-k. The decode override remains bounded and can be controlled with DS4_METAL_DECODE_INDEXER_TOP_K.

The decode default is 8: local checks passed down to 4, but 8 keeps more sparse rows while matching/slightly beating 4 in the confirmatory 4096 sweep.

Correctness

Passed locally on Apple M5 Max after rebasing to current upstream main:

make clean && make
make test
DS4_METAL_PREFILL_CHUNK=4096 ./ds4_test --logprob-vectors

Eval parity check

Deterministic 12-question ds4-eval slice against current upstream main produced identical grading decisions and token counts on both branches:

./ds4-eval \
  -m ds4flash.gguf \
  --plain \
  --questions 12 \
  --tokens 2048 \
  --temp 0 \
  --seed 1

Branch	Result	Total generated tokens	Total tokens
`antirez/main` `613e9b2`	10/12	15,884	18,446
`m5-decode-indexer-topk` `6444d30`	10/12	15,884	18,446

The same two cases failed on both branches with the same extracted answers, so this eval slice shows no quality regression.

Fresh benchmarks vs current upstream main

Model: ds4flash.gguf
Prompt: speed-bench/promessi_sposi.txt
Backend: Metal, Apple M5 Max 128 GB
Baseline: antirez/main at 613e9b2
Candidate: m5-decode-indexer-topk at 6444d30

Sweep	Avg prefill t/s	Prefill delta	Avg generation t/s	Generation delta
4096-step, 64 gen tokens	284.42 → 306.27	+7.68%	28.72 → 33.96	+18.23%
65k, 128 gen tokens	256.99 → 258.15	+0.45%	27.73 → 32.40	+16.84%

Commands:

./ds4-bench \
  -m ds4flash.gguf \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 4096 --ctx-max 32768 --step-incr 4096 \
  --gen-tokens 64

./ds4-bench \
  -m ds4flash.gguf \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 --ctx-max 65536 --step-incr 2048 \
  --gen-tokens 128

The strongest intended claim is decode/generation throughput. The 4096-step prefill gain comes from the included #149 prerequisite; long-context prefill is effectively neutral.

fitchmultz · 2026-05-16T15:39:58Z

Additional local confirmation run / machine details:

Machine: MacBook Pro Mac17,6, Apple M5 Max, 18 CPU cores, 128 GB RAM
OS: macOS 26.5 (25F71)
Build command: make clean && make
Benchmark command:

./ds4-bench \
  -m ds4flash.gguf \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 4096 --ctx-max 32768 --step-incr 4096 \
  --gen-tokens 64

Fresh candidate run average over the 8 context points:

prefill: 309.48 t/s
generation: 34.02 t/s

Earlier paired baseline/candidate runs for the same sweep averaged:

baseline (m5-responses): 28.84 generation t/s
candidate: 33.29 generation t/s
delta: +15.41% generation t/s

Happy to rebase/squash after #149 lands, or fold this into #149 if that is easier to review.

fitchmultz · 2026-05-16T15:46:39Z

Small follow-up update: I changed the M5 Max default from top-k 4 to top-k 8. Rationale: top-k 4 passed correctness, but top-k 8 keeps more sparse rows and the fresh 4096 sweep was effectively the same/slightly faster (34.24 gen t/s vs 34.02). Re-ran make clean && make, make test, and DS4_METAL_PREFILL_CHUNK=4096 ./ds4_test --logprob-vectors; all passed.

fitchmultz · 2026-05-16T16:03:42Z

Force-rebased this branch onto current upstream antirez/main (613e9b2 at push time), removed stale branch artifacts from the diff, and re-ran the local correctness gates: make clean && make, make test, and DS4_METAL_PREFILL_CHUNK=4096 ./ds4_test --logprob-vectors all passed. I also pushed m5-responses-current-main only as a compare base for the decode-only stacked diff.

fitchmultz · 2026-05-16T16:23:37Z

Updated the PR body with fresh paired benchmarks after rebasing to current upstream main (613e9b2). Current main → this branch: 4096 sweep gen 28.72 → 33.96 t/s (+18.23%), 65k sweep gen 27.73 → 32.40 t/s (+16.84%). These are now post-rebase numbers, not carried over from the earlier stacked baseline.

fitchmultz · 2026-05-16T16:49:04Z

Added a deterministic eval parity check against current upstream main.

Command on both branches (antirez/main 613e9b2 and this branch 6444d30):

./ds4-eval \
  -m ds4flash.gguf \
  --plain \
  --questions 12 \
  --tokens 2048 \
  --temp 0 \
  --seed 1 \
  --trace /tmp/ds4-m5-current-main-eval/<branch>-q12.trace

Result:

Branch	Passed	Failed	Total generated tokens	Total tokens	Per-case decisions
`antirez/main` `613e9b2`	10/12	2	15,884	18,446	same
`m5-decode-indexer-topk` `6444d30`	10/12	2	15,884	18,446	same

The same two cases failed on both branches, with the same extracted answers:

GPQA Diamond recoiTJPGUmzAkief: got A, expected C
AIME2025 aime2025-02: got 4, expected 588

So this M5 Max scheduling/indexer change preserved this deterministic 12-question eval slice while improving generation throughput in the paired bench runs above.

antirez · 2026-05-16T18:46:55Z

@ivanfioravanti ping

ivanfioravanti · 2026-05-16T20:55:13Z

Let me try to run same tests on my branch. Deterministic evaluation parity is a great idea @fitchmultz

ivanfioravanti · 2026-05-16T22:27:58Z

I used same test on #15 same Machine, same OS 26.5. In this PR, focus was on prefill only to leverage Neural Accelerators.

Deterministic evaluation helped to disable a path in prefill that was leading to different results in my PR.

Fresh candidate run average over the 8 context points:

prefill average: 340.77 t/s
generation average: 32.67 t/s

Here the chart of Quality vs Standard Metal vs Tensor Metal

20260517-000340_gen128_ds4_bench_standard_quality_tensor

I will run same on this PR tomorrow to compare.

ivanfioravanti · 2026-05-17T06:59:31Z

@fitchmultz try again the speed test on your branch but using High Power in the battery settings to check prefill speed.

fitchmultz · 2026-05-17T13:50:30Z

Re-ran the 4096-step speed test on this branch with the machine plugged in and High Power mode active.

Local state:

branch: m5-decode-indexer-topk
commit: 6444d30
machine: MacBook Pro Mac17,6, Apple M5 Max, 18 CPU cores, 128 GB RAM
OS: macOS 26.5 (25F71)
power: AC power, High Power mode
build: make clean && make

Command:

./ds4-bench \
  -m ds4flash.gguf \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 4096 --ctx-max 32768 --step-incr 4096 \
  --gen-tokens 64

Initial fresh 8-point average:

prefill: 290.71 t/s
generation: 32.68 t/s

Rows from that initial run:

ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes
4096,4096,330.02,64,34.31,80373132
8192,4096,328.06,64,33.57,136750476
12288,4096,294.94,64,32.80,193127820
16384,4096,282.58,64,32.21,249505164
20480,4096,280.56,64,32.57,305882508
24576,4096,276.98,64,32.31,362259852
28672,4096,268.58,64,31.94,418637196
32768,4096,263.96,64,31.70,475014540

I investigated the disparity versus my earlier same-branch/same-command runs (306.27 prefill / 33.96 generation in the paired post-rebase run, and an earlier candidate-only 309.48 / 34.02 run). The local repo state was not the cause: branch/head still matched 6444d30. The difference was environmental: I had a large pile of stale agent-browser headless Chrome sessions left running (~243 agent-browser-chrome processes), with many renderer/GPU helper processes actively consuming CPU/GPU. The slow run also had an unusually slow Metal residency request (~11.4s), consistent with a noisy/memory-pressured local environment.

After cleaning those stale browser processes and rerunning the exact same command on the same commit, the result was back in the earlier band:

prefill: 312.26 t/s
generation: 34.20 t/s

Rows after cleanup:

ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes
4096,4096,350.08,64,35.86,80373132
8192,4096,339.34,64,35.28,136750476
12288,4096,314.17,64,34.22,193127820
16384,4096,307.87,64,34.28,249505164
20480,4096,306.40,64,33.89,305882508
24576,4096,301.14,64,33.86,362259852
28672,4096,294.65,64,33.31,418637196
32768,4096,284.46,64,32.87,475014540

So the lower 290.71 prefill run should be treated as a contaminated/noisy local run, not a code regression. This PR is still intentionally focused on the conservative decode/indexer change; #15 is the more aggressive Tensor prefill path to compare for prefill throughput.

ivanfioravanti · 2026-05-17T15:02:54Z

I see that both llamacpp and mlx-lm projects have different logits at temp=0 on M3 Ultra vs M5 Max. I updated my PR to have same like in your one but performance dropped quite a lot. I bet a tradeoff must be made between equality and speed. I'll try to run a full evaluation (92 cases) on my branch "pre-equality" and "post-equality" to see if there is a difference.

At the end I think @antirez will have to decide how to proceed here. Initially I was towards 1-1 results, but now I'm not sure is the right way to go, if we want to squeeze more juice from our M5 Max chipset.

ivanfioravanti · 2026-05-17T17:09:38Z

ggerganov confirmed that it can be industry wide accepted to have different logits with different hardware.
I'm now running full eval on my older version of the PR (Faster) to see if they are good enough and in case I will revert PR to that state.

ggml-org/llama.cpp#23212 (comment)

fitchmultz · 2026-05-17T20:26:16Z

I ran an independent full 92-case check on my M5 Max since the discussion is now really about the equality-vs-speed tradeoff, not just the 12-case drift gate.

Machine/settings: M5 Max 128 GB, macOS 26.5, AC + High Power.

Eval command:

./ds4-eval -m ds4flash.gguf --plain --questions 92 --tokens 2048 --temp 0 --seed 1 --trace TRACE

Summary:

Variant	Commit/config	Bench prefill	Bench gen	Full eval	q11
`antirez/main`	`ef0a490`	`303.47`	`29.47`	`50/92`	`PASS gen=851 total=988 A/A`
#169	`6444d30`	`322.71`	`34.95`	`50/92`	`PASS gen=851 total=988 A/A`
#15 pre-equality proxy	`5e989a0` / MoE up from layer 36	`354.20`	`32.16`	`50/92`	`FAIL gen=1725 total=1862 I/A`
#15 current/post-equality	`2008613` / MoE up from layer 37	`350.35`	`31.94`	`50/92`	`PASS gen=851 total=988 A/A`

A few observations:

Metal: speed up M5 Max decode indexer #169 exactly matched current antirez/main across all 92 cases in my run: same pass/fail set, same extracted answers, same token counts. So this still looks like a decode-speed-only change in this eval gate.
On Add Metal 4 M5 prefill optimizations #15, the layer-37 equality change fixes q11 exactly, but the full-92 score was unchanged in my run (50/92 before and after). It swapped cases: q11 and q58 improved, while q53 and q80 regressed.
That supports Ivan's point that the decision may be policy/tradeoff-driven rather than “1:1 logits at all costs”. The full eval did not show a net score gain from the stricter q11-equivalent setting in this sample, although it did fix the known q11 drift.

Caveat: my “pre-equality” point is the immediate parent 5e989a0 before the layer-37 default change. If the faster point you are evaluating is an older SHA, send me the exact commit/env and I can rerun that same full-92 comparison on this machine.

ivanfioravanti · 2026-05-17T21:22:54Z

I’m running same evals with another model on llamacpp and mlx-lm on M3 Ultra vs M5 Max. It really drives me a little mad the fact that results change based on hardware at temp 0. I’m now running AIME 25 at 0 and 0.6 on all platforms to see what happens. My brain is really uncomfortable with this topic.

fitchmultz mentioned this pull request May 16, 2026

Metal: correctness-gate M5 Max 4096 prefill (+5%) #149

Open

fitchmultz added 6 commits May 16, 2026 10:00

metal: add M5 Max runtime fast paths

d3e8581

metal: default M5 Max to safe 4096 prefill

b3d2665

metal: use tiny M5 Max decode indexer set

ef29fef

metal: lower M5 Max decode indexer top-k

9766d03

metal: keep M5 Max decode top-k override safe

24acad0

metal: prefer conservative M5 decode top-k

6444d30

fitchmultz force-pushed the m5-decode-indexer-topk branch from 951b932 to 6444d30 Compare May 16, 2026 16:03

ivanfioravanti mentioned this pull request May 17, 2026

Add Metal 4 M5 prefill optimizations #15

Open

Conversation

fitchmultz commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes

Correctness

Eval parity check

Fresh benchmarks vs current upstream main

Uh oh!

fitchmultz commented May 16, 2026

Uh oh!

fitchmultz commented May 16, 2026

Uh oh!

fitchmultz commented May 16, 2026

Uh oh!

fitchmultz commented May 16, 2026

Uh oh!

fitchmultz commented May 16, 2026

Uh oh!

antirez commented May 16, 2026

Uh oh!

ivanfioravanti commented May 16, 2026

Uh oh!

ivanfioravanti commented May 16, 2026

Uh oh!

ivanfioravanti commented May 17, 2026

Uh oh!

fitchmultz commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivanfioravanti commented May 17, 2026

Uh oh!

ivanfioravanti commented May 17, 2026

Uh oh!

fitchmultz commented May 17, 2026

Uh oh!

ivanfioravanti commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fitchmultz commented May 16, 2026 •

edited

Loading

fitchmultz commented May 17, 2026 •

edited

Loading