Help me understand these inconsistent performance metric with oQ #371

eokic · 2026-03-24T13:26:18Z

eokic
Mar 24, 2026

In the new "Intelligence Benchmark" it would appear that the standard Qwen3.5-0.8B is noticeably faster than the quantized models. But the "Performance Benchmark" shows a different story, or at least a more nuanced one, if "PP" is the main factor. How do I interpret this discrepancy?

Model: Qwen3.5-0.8B-oQ4+
Benchmark         Accuracy   Correct   Total   Time(s)
------------------------------------------------------
MMLU                 32.0%        32     100      57.3
HELLASWAG            38.0%        38     100      36.5
TRUTHFULQA           25.0%        25     100      33.7

Model: Qwen3.5-0.8B
Benchmark         Accuracy   Correct   Total   Time(s)
------------------------------------------------------
MMLU                 31.0%        31     100      36.3
HELLASWAG            39.0%        39     100      19.8
TRUTHFULQA           25.0%        25     100      17.3

Model: Qwen3.5-0.8B-OptiQ-4bit
Benchmark         Accuracy   Correct   Total   Time(s)
------------------------------------------------------
MMLU                 30.0%        30     100      57.5
HELLASWAG            39.0%        39     100      40.1
TRUTHFULQA           24.0%        24     100      36.3

JuiceB0xC0de · 2026-06-07T08:39:30Z

JuiceB0xC0de
Jun 7, 2026

PP and TG are measuring in opposite directions. For prompt processing you're measuring how fast the model is ingesting your token input. This requires compute the more you have the faster you are. Quantized models lose out here because of dequantization overhead.

4bit weights > f16 conversion > matrix math >results

Token generation measures how fast token out put is generated one by one which is all model dependant with memory bandwidth. Little quantized models fit better in cache and are quicker in this benchmark. The standard model wins on intelligence accuracy due to the quantized models losing precision in their weights from the reduction in number length. Quantized neural networks are go end up going from something like this 0.170087 to this 0.17 so there is less granular processing power.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help me understand these inconsistent performance metric with oQ #371

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Help me understand these inconsistent performance metric with oQ #371

Uh oh!

eokic Mar 24, 2026

Replies: 1 comment

Uh oh!

JuiceB0xC0de Jun 7, 2026

eokic
Mar 24, 2026

JuiceB0xC0de
Jun 7, 2026