Replies: 1 comment
-
|
PP and TG are measuring in opposite directions. For prompt processing you're measuring how fast the model is ingesting your token input. This requires compute the more you have the faster you are. Quantized models lose out here because of dequantization overhead. 4bit weights > f16 conversion > matrix math >results Token generation measures how fast token out put is generated one by one which is all model dependant with memory bandwidth. Little quantized models fit better in cache and are quicker in this benchmark. The standard model wins on intelligence accuracy due to the quantized models losing precision in their weights from the reduction in number length. Quantized neural networks are go end up going from something like this 0.170087 to this 0.17 so there is less granular processing power. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
In the new "Intelligence Benchmark" it would appear that the standard
Qwen3.5-0.8Bis noticeably faster than the quantized models. But the "Performance Benchmark" shows a different story, or at least a more nuanced one, if "PP" is the main factor. How do I interpret this discrepancy?Beta Was this translation helpful? Give feedback.
All reactions