Observation from production use on a DGX Spark (GB10), vLLM, Qwen3.6-35B-A3B int4 with the DFlash draft at k=3.
Measuring decode speed from vLLM's TPOT metric right after launch, over the first handful of requests, gives ~43 tok/s. After a steady stream of requests the same metric settles at ~69 tok/s, which matches an independent prefill-separated measurement of the same setup. The cold reading is not noise; it reproduces, and it is exactly the kind of number people quote in "X is slow" reports.
Might be worth one line in the README: benchmark after warmup over a real request window, not on the first requests after launch.
Longer writeup of how we caught it: https://sovgrid.org/blog/catching-your-benchmark-lying-three-measurement-traps/
Observation from production use on a DGX Spark (GB10), vLLM, Qwen3.6-35B-A3B int4 with the DFlash draft at k=3.
Measuring decode speed from vLLM's TPOT metric right after launch, over the first handful of requests, gives ~43 tok/s. After a steady stream of requests the same metric settles at ~69 tok/s, which matches an independent prefill-separated measurement of the same setup. The cold reading is not noise; it reproduces, and it is exactly the kind of number people quote in "X is slow" reports.
Might be worth one line in the README: benchmark after warmup over a real request window, not on the first requests after launch.
Longer writeup of how we caught it: https://sovgrid.org/blog/catching-your-benchmark-lying-three-measurement-traps/