Skip to content

Docs suggestion: cold-start TPOT undersells steady-state by ~35% (43 vs 69 tok/s) until the draft path warms #135

@cipherfoxie

Description

@cipherfoxie

Observation from production use on a DGX Spark (GB10), vLLM, Qwen3.6-35B-A3B int4 with the DFlash draft at k=3.

Measuring decode speed from vLLM's TPOT metric right after launch, over the first handful of requests, gives ~43 tok/s. After a steady stream of requests the same metric settles at ~69 tok/s, which matches an independent prefill-separated measurement of the same setup. The cold reading is not noise; it reproduces, and it is exactly the kind of number people quote in "X is slow" reports.

Might be worth one line in the README: benchmark after warmup over a real request window, not on the first requests after launch.

Longer writeup of how we caught it: https://sovgrid.org/blog/catching-your-benchmark-lying-three-measurement-traps/

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions