You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From-scratch C++/CUDA LLM inference engine for the NVIDIA RTX 5090 (sm_120a). The fastest single-user inference on the 5090: faster decode than llama.cpp, at-or-ahead of vLLM on NVFP4, and the only engine running native NVFP4 on consumer Blackwell. 100% written by Claude Code.
Honest precision-spectrum GEMM roofline on NVIDIA Blackwell (sm_120) — FP32 → FP4, measured. FP4 (nvfp4) at 20× FP32 via cuBLASLt, anchored to the computed peak. No inflated numbers.