Skip to content

Metal: speed up M5 Max decode indexer#169

Open
fitchmultz wants to merge 6 commits into
antirez:mainfrom
fitchmultz:m5-decode-indexer-topk
Open

Metal: speed up M5 Max decode indexer#169
fitchmultz wants to merge 6 commits into
antirez:mainfrom
fitchmultz:m5-decode-indexer-topk

Conversation

@fitchmultz
Copy link
Copy Markdown

@fitchmultz fitchmultz commented May 16, 2026

Based on current antirez/main (613e9b2 at benchmark/rebase time) and includes the prerequisite M5 Max prefill work from #149 because the decode path depends on it.

Review helpers:

What changes

On Apple M5 Max:

Prefill keeps the model-configured indexer top-k. The decode override remains bounded and can be controlled with DS4_METAL_DECODE_INDEXER_TOP_K.

The decode default is 8: local checks passed down to 4, but 8 keeps more sparse rows while matching/slightly beating 4 in the confirmatory 4096 sweep.

Correctness

Passed locally on Apple M5 Max after rebasing to current upstream main:

make clean && make
make test
DS4_METAL_PREFILL_CHUNK=4096 ./ds4_test --logprob-vectors

Eval parity check

Deterministic 12-question ds4-eval slice against current upstream main produced identical grading decisions and token counts on both branches:

./ds4-eval \
  -m ds4flash.gguf \
  --plain \
  --questions 12 \
  --tokens 2048 \
  --temp 0 \
  --seed 1
Branch Result Total generated tokens Total tokens
antirez/main 613e9b2 10/12 15,884 18,446
m5-decode-indexer-topk 6444d30 10/12 15,884 18,446

The same two cases failed on both branches with the same extracted answers, so this eval slice shows no quality regression.

Fresh benchmarks vs current upstream main

Model: ds4flash.gguf
Prompt: speed-bench/promessi_sposi.txt
Backend: Metal, Apple M5 Max 128 GB
Baseline: antirez/main at 613e9b2
Candidate: m5-decode-indexer-topk at 6444d30

Sweep Avg prefill t/s Prefill delta Avg generation t/s Generation delta
4096-step, 64 gen tokens 284.42 → 306.27 +7.68% 28.72 → 33.96 +18.23%
65k, 128 gen tokens 256.99 → 258.15 +0.45% 27.73 → 32.40 +16.84%

Commands:

./ds4-bench \
  -m ds4flash.gguf \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 4096 --ctx-max 32768 --step-incr 4096 \
  --gen-tokens 64

./ds4-bench \
  -m ds4flash.gguf \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 --ctx-max 65536 --step-incr 2048 \
  --gen-tokens 128

The strongest intended claim is decode/generation throughput. The 4096-step prefill gain comes from the included #149 prerequisite; long-context prefill is effectively neutral.

@fitchmultz
Copy link
Copy Markdown
Author

Additional local confirmation run / machine details:

  • Machine: MacBook Pro Mac17,6, Apple M5 Max, 18 CPU cores, 128 GB RAM
  • OS: macOS 26.5 (25F71)
  • Build command: make clean && make
  • Benchmark command:
./ds4-bench \
  -m ds4flash.gguf \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 4096 --ctx-max 32768 --step-incr 4096 \
  --gen-tokens 64

Fresh candidate run average over the 8 context points:

  • prefill: 309.48 t/s
  • generation: 34.02 t/s

Earlier paired baseline/candidate runs for the same sweep averaged:

  • baseline (m5-responses): 28.84 generation t/s
  • candidate: 33.29 generation t/s
  • delta: +15.41% generation t/s

Happy to rebase/squash after #149 lands, or fold this into #149 if that is easier to review.

@fitchmultz
Copy link
Copy Markdown
Author

Small follow-up update: I changed the M5 Max default from top-k 4 to top-k 8. Rationale: top-k 4 passed correctness, but top-k 8 keeps more sparse rows and the fresh 4096 sweep was effectively the same/slightly faster (34.24 gen t/s vs 34.02). Re-ran make clean && make, make test, and DS4_METAL_PREFILL_CHUNK=4096 ./ds4_test --logprob-vectors; all passed.

@fitchmultz fitchmultz force-pushed the m5-decode-indexer-topk branch from 951b932 to 6444d30 Compare May 16, 2026 16:03
@fitchmultz
Copy link
Copy Markdown
Author

Force-rebased this branch onto current upstream antirez/main (613e9b2 at push time), removed stale branch artifacts from the diff, and re-ran the local correctness gates: make clean && make, make test, and DS4_METAL_PREFILL_CHUNK=4096 ./ds4_test --logprob-vectors all passed. I also pushed m5-responses-current-main only as a compare base for the decode-only stacked diff.

@fitchmultz
Copy link
Copy Markdown
Author

Updated the PR body with fresh paired benchmarks after rebasing to current upstream main (613e9b2). Current main → this branch: 4096 sweep gen 28.72 → 33.96 t/s (+18.23%), 65k sweep gen 27.73 → 32.40 t/s (+16.84%). These are now post-rebase numbers, not carried over from the earlier stacked baseline.

@fitchmultz
Copy link
Copy Markdown
Author

Added a deterministic eval parity check against current upstream main.

Command on both branches (antirez/main 613e9b2 and this branch 6444d30):

./ds4-eval \
  -m ds4flash.gguf \
  --plain \
  --questions 12 \
  --tokens 2048 \
  --temp 0 \
  --seed 1 \
  --trace /tmp/ds4-m5-current-main-eval/<branch>-q12.trace

Result:

Branch Passed Failed Total generated tokens Total tokens Per-case decisions
antirez/main 613e9b2 10/12 2 15,884 18,446 same
m5-decode-indexer-topk 6444d30 10/12 2 15,884 18,446 same

The same two cases failed on both branches, with the same extracted answers:

  • GPQA Diamond recoiTJPGUmzAkief: got A, expected C
  • AIME2025 aime2025-02: got 4, expected 588

So this M5 Max scheduling/indexer change preserved this deterministic 12-question eval slice while improving generation throughput in the paired bench runs above.

@antirez
Copy link
Copy Markdown
Owner

antirez commented May 16, 2026

@ivanfioravanti ping

@ivanfioravanti
Copy link
Copy Markdown
Contributor

Let me try to run same tests on my branch. Deterministic evaluation parity is a great idea @fitchmultz

@ivanfioravanti
Copy link
Copy Markdown
Contributor

I used same test on #15 same Machine, same OS 26.5. In this PR, focus was on prefill only to leverage Neural Accelerators.

Deterministic evaluation helped to disable a path in prefill that was leading to different results in my PR.

Fresh candidate run average over the 8 context points:

  • prefill average: 340.77 t/s
  • generation average: 32.67 t/s

Here the chart of Quality vs Standard Metal vs Tensor Metal

20260517-000340_gen128_ds4_bench_standard_quality_tensor

I will run same on this PR tomorrow to compare.

@ivanfioravanti
Copy link
Copy Markdown
Contributor

@fitchmultz try again the speed test on your branch but using High Power in the battery settings to check prefill speed.

@fitchmultz
Copy link
Copy Markdown
Author

fitchmultz commented May 17, 2026

Re-ran the 4096-step speed test on this branch with the machine plugged in and High Power mode active.

Local state:

  • branch: m5-decode-indexer-topk
  • commit: 6444d30
  • machine: MacBook Pro Mac17,6, Apple M5 Max, 18 CPU cores, 128 GB RAM
  • OS: macOS 26.5 (25F71)
  • power: AC power, High Power mode
  • build: make clean && make

Command:

./ds4-bench \
  -m ds4flash.gguf \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 4096 --ctx-max 32768 --step-incr 4096 \
  --gen-tokens 64

Initial fresh 8-point average:

  • prefill: 290.71 t/s
  • generation: 32.68 t/s

Rows from that initial run:

ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes
4096,4096,330.02,64,34.31,80373132
8192,4096,328.06,64,33.57,136750476
12288,4096,294.94,64,32.80,193127820
16384,4096,282.58,64,32.21,249505164
20480,4096,280.56,64,32.57,305882508
24576,4096,276.98,64,32.31,362259852
28672,4096,268.58,64,31.94,418637196
32768,4096,263.96,64,31.70,475014540

I investigated the disparity versus my earlier same-branch/same-command runs (306.27 prefill / 33.96 generation in the paired post-rebase run, and an earlier candidate-only 309.48 / 34.02 run). The local repo state was not the cause: branch/head still matched 6444d30. The difference was environmental: I had a large pile of stale agent-browser headless Chrome sessions left running (~243 agent-browser-chrome processes), with many renderer/GPU helper processes actively consuming CPU/GPU. The slow run also had an unusually slow Metal residency request (~11.4s), consistent with a noisy/memory-pressured local environment.

After cleaning those stale browser processes and rerunning the exact same command on the same commit, the result was back in the earlier band:

  • prefill: 312.26 t/s
  • generation: 34.20 t/s

Rows after cleanup:

ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes
4096,4096,350.08,64,35.86,80373132
8192,4096,339.34,64,35.28,136750476
12288,4096,314.17,64,34.22,193127820
16384,4096,307.87,64,34.28,249505164
20480,4096,306.40,64,33.89,305882508
24576,4096,301.14,64,33.86,362259852
28672,4096,294.65,64,33.31,418637196
32768,4096,284.46,64,32.87,475014540

So the lower 290.71 prefill run should be treated as a contaminated/noisy local run, not a code regression. This PR is still intentionally focused on the conservative decode/indexer change; #15 is the more aggressive Tensor prefill path to compare for prefill throughput.

@ivanfioravanti
Copy link
Copy Markdown
Contributor

I see that both llamacpp and mlx-lm projects have different logits at temp=0 on M3 Ultra vs M5 Max. I updated my PR to have same like in your one but performance dropped quite a lot. I bet a tradeoff must be made between equality and speed. I'll try to run a full evaluation (92 cases) on my branch "pre-equality" and "post-equality" to see if there is a difference.

At the end I think @antirez will have to decide how to proceed here. Initially I was towards 1-1 results, but now I'm not sure is the right way to go, if we want to squeeze more juice from our M5 Max chipset.

@ivanfioravanti
Copy link
Copy Markdown
Contributor

ggerganov confirmed that it can be industry wide accepted to have different logits with different hardware.
I'm now running full eval on my older version of the PR (Faster) to see if they are good enough and in case I will revert PR to that state.

ggml-org/llama.cpp#23212 (comment)

@fitchmultz
Copy link
Copy Markdown
Author

I ran an independent full 92-case check on my M5 Max since the discussion is now really about the equality-vs-speed tradeoff, not just the 12-case drift gate.

Machine/settings: M5 Max 128 GB, macOS 26.5, AC + High Power.

Eval command:

./ds4-eval -m ds4flash.gguf --plain --questions 92 --tokens 2048 --temp 0 --seed 1 --trace TRACE

Summary:

Variant Commit/config Bench prefill Bench gen Full eval q11
antirez/main ef0a490 303.47 29.47 50/92 PASS gen=851 total=988 A/A
#169 6444d30 322.71 34.95 50/92 PASS gen=851 total=988 A/A
#15 pre-equality proxy 5e989a0 / MoE up from layer 36 354.20 32.16 50/92 FAIL gen=1725 total=1862 I/A
#15 current/post-equality 2008613 / MoE up from layer 37 350.35 31.94 50/92 PASS gen=851 total=988 A/A

A few observations:

  • Metal: speed up M5 Max decode indexer #169 exactly matched current antirez/main across all 92 cases in my run: same pass/fail set, same extracted answers, same token counts. So this still looks like a decode-speed-only change in this eval gate.
  • On Add Metal 4 M5 prefill optimizations #15, the layer-37 equality change fixes q11 exactly, but the full-92 score was unchanged in my run (50/92 before and after). It swapped cases: q11 and q58 improved, while q53 and q80 regressed.
  • That supports Ivan's point that the decision may be policy/tradeoff-driven rather than “1:1 logits at all costs”. The full eval did not show a net score gain from the stricter q11-equivalent setting in this sample, although it did fix the known q11 drift.

Caveat: my “pre-equality” point is the immediate parent 5e989a0 before the layer-37 default change. If the faster point you are evaluating is an older SHA, send me the exact commit/env and I can rerun that same full-92 comparison on this machine.

@ivanfioravanti
Copy link
Copy Markdown
Contributor

I’m running same evals with another model on llamacpp and mlx-lm on M3 Ultra vs M5 Max. It really drives me a little mad the fact that results change based on hardware at temp 0. I’m now running AIME 25 at 0 and 0.6 on all platforms to see what happens. My brain is really uncomfortable with this topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants