@
iango no no no, 强烈推荐 TurboQuant+,8K 上下文 context 占用仅 152 MB
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - MTL0 (Apple M1 Pro) | 13000 = 2666 + (10332 = 9075 + 152 + 1104) + 0 |
llama_memory_breakdown_print: | - Host | 1062 = 1030 + 0 + 32 |
ggml_metal_free: deallocating
链接:
https://github.com/TheTom/turboquant_plus/blob/main/README.mdQwen3.5-9B-Q8_0.GGUF, 8K context RAM 还有剩!
现在当 headless server ,用 SSH 连进去用,GUI cost 降低了,Context Window 还能再调高一点