Docs suggestion: cold-start TPOT undersells steady-state by ~35% (43 vs 69 tok/s) until the draft path warms

Observation from production use on a DGX Spark (GB10), vLLM, Qwen3.6-35B-A3B int4 with the DFlash draft at k=3.

Measuring decode speed from vLLM's TPOT metric right after launch, over the first handful of requests, gives ~43 tok/s. After a steady stream of requests the same metric settles at ~69 tok/s, which matches an independent prefill-separated measurement of the same setup. The cold reading is not noise; it reproduces, and it is exactly the kind of number people quote in "X is slow" reports.

Might be worth one line in the README: benchmark after warmup over a real request window, not on the first requests after launch.

Longer writeup of how we caught it: https://sovgrid.org/blog/catching-your-benchmark-lying-three-measurement-traps/


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs suggestion: cold-start TPOT undersells steady-state by ~35% (43 vs 69 tok/s) until the draft path warms #135

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Docs suggestion: cold-start TPOT undersells steady-state by ~35% (43 vs 69 tok/s) until the draft path warms #135

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions