Metal: speed up M5 Max decode indexer#169
Conversation
|
Additional local confirmation run / machine details:
./ds4-bench \
-m ds4flash.gguf \
--prompt-file speed-bench/promessi_sposi.txt \
--ctx-start 4096 --ctx-max 32768 --step-incr 4096 \
--gen-tokens 64Fresh candidate run average over the 8 context points:
Earlier paired baseline/candidate runs for the same sweep averaged:
Happy to rebase/squash after #149 lands, or fold this into #149 if that is easier to review. |
|
Small follow-up update: I changed the M5 Max default from top-k 4 to top-k 8. Rationale: top-k 4 passed correctness, but top-k 8 keeps more sparse rows and the fresh 4096 sweep was effectively the same/slightly faster (34.24 gen t/s vs 34.02). Re-ran |
951b932 to
6444d30
Compare
|
Force-rebased this branch onto current upstream |
|
Updated the PR body with fresh paired benchmarks after rebasing to current upstream |
|
Added a deterministic eval parity check against current upstream Command on both branches ( ./ds4-eval \
-m ds4flash.gguf \
--plain \
--questions 12 \
--tokens 2048 \
--temp 0 \
--seed 1 \
--trace /tmp/ds4-m5-current-main-eval/<branch>-q12.traceResult:
The same two cases failed on both branches, with the same extracted answers:
So this M5 Max scheduling/indexer change preserved this deterministic 12-question eval slice while improving generation throughput in the paired bench runs above. |
|
@ivanfioravanti ping |
|
Let me try to run same tests on my branch. Deterministic evaluation parity is a great idea @fitchmultz |
|
I used same test on #15 same Machine, same OS 26.5. In this PR, focus was on prefill only to leverage Neural Accelerators. Deterministic evaluation helped to disable a path in prefill that was leading to different results in my PR. Fresh candidate run average over the 8 context points:
Here the chart of Quality vs Standard Metal vs Tensor Metal I will run same on this PR tomorrow to compare. |
|
@fitchmultz try again the speed test on your branch but using High Power in the battery settings to check prefill speed. |
|
Re-ran the 4096-step speed test on this branch with the machine plugged in and High Power mode active. Local state:
Command: ./ds4-bench \
-m ds4flash.gguf \
--prompt-file speed-bench/promessi_sposi.txt \
--ctx-start 4096 --ctx-max 32768 --step-incr 4096 \
--gen-tokens 64Initial fresh 8-point average:
Rows from that initial run: I investigated the disparity versus my earlier same-branch/same-command runs ( After cleaning those stale browser processes and rerunning the exact same command on the same commit, the result was back in the earlier band:
Rows after cleanup: So the lower |
|
I see that both llamacpp and mlx-lm projects have different logits at temp=0 on M3 Ultra vs M5 Max. I updated my PR to have same like in your one but performance dropped quite a lot. I bet a tradeoff must be made between equality and speed. I'll try to run a full evaluation (92 cases) on my branch "pre-equality" and "post-equality" to see if there is a difference. At the end I think @antirez will have to decide how to proceed here. Initially I was towards 1-1 results, but now I'm not sure is the right way to go, if we want to squeeze more juice from our M5 Max chipset. |
|
ggerganov confirmed that it can be industry wide accepted to have different logits with different hardware. |
|
I ran an independent full 92-case check on my M5 Max since the discussion is now really about the equality-vs-speed tradeoff, not just the 12-case drift gate. Machine/settings: M5 Max 128 GB, macOS 26.5, AC + High Power. Eval command: ./ds4-eval -m ds4flash.gguf --plain --questions 92 --tokens 2048 --temp 0 --seed 1 --trace TRACESummary:
A few observations:
Caveat: my “pre-equality” point is the immediate parent |
|
I’m running same evals with another model on llamacpp and mlx-lm on M3 Ultra vs M5 Max. It really drives me a little mad the fact that results change based on hardware at temp 0. I’m now running AIME 25 at 0 and 0.6 on all platforms to see what happens. My brain is really uncomfortable with this topic. |
Based on current
antirez/main(613e9b2at benchmark/rebase time) and includes the prerequisite M5 Max prefill work from #149 because the decode path depends on it.Review helpers:
main.What changes
On Apple M5 Max:
Prefill keeps the model-configured indexer top-k. The decode override remains bounded and can be controlled with
DS4_METAL_DECODE_INDEXER_TOP_K.The decode default is
8: local checks passed down to4, but8keeps more sparse rows while matching/slightly beating4in the confirmatory 4096 sweep.Correctness
Passed locally on Apple M5 Max after rebasing to current upstream
main:Eval parity check
Deterministic 12-question
ds4-evalslice against current upstreammainproduced identical grading decisions and token counts on both branches:antirez/main613e9b2m5-decode-indexer-topk6444d30The same two cases failed on both branches with the same extracted answers, so this eval slice shows no quality regression.
Fresh benchmarks vs current upstream main
Model:
ds4flash.ggufPrompt:
speed-bench/promessi_sposi.txtBackend: Metal, Apple M5 Max 128 GB
Baseline:
antirez/mainat613e9b2Candidate:
m5-decode-indexer-topkat6444d30Commands:
The strongest intended claim is decode/generation throughput. The 4096-step prefill gain comes from the included #149 prerequisite; long-context prefill is effectively neutral.