Tags: cchuter/ds4
Tags
mgpu v0.1.0 Multi-GPU performance branch: prefill tensorcore, decode split-KV/flash-decode attention, routed-MoE gate/up decode launch geometry, q8->f16 cache reserve.
mgpu v0.1.0 Multi-GPU performance branch: prefill tensorcore, decode split-KV/flash-decode attention, routed-MoE gate/up decode launch geometry, q8->f16 cache reserve.