fp4

Here are 2 public repositories matching this topic...

kekzl / imp

From-scratch C++/CUDA LLM inference engine for the NVIDIA RTX 5090 (sm_120a). The fastest single-user inference on the 5090: faster decode than llama.cpp, at-or-ahead of vLLM on NVFP4, and the only engine running native NVFP4 on consumer Blackwell. 100% written by Claude Code.

Updated Jun 13, 2026
Cuda

QuantumDrizzy / BLACKWALL

Star

Honest precision-spectrum GEMM roofline on NVIDIA Blackwell (sm_120) — FP32 → FP4, measured. FP4 (nvfp4) at 20× FP32 via cuBLASLt, anchored to the computed peak. No inflated numbers.

hpc gpu cuda gemm roofline low-precision cublaslt tensor-cores blackwell fp8 fp4

Updated Jun 9, 2026
Cuda

Improve this page

Add a description, image, and links to the fp4 topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the fp4 topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fp4

Here are 2 public repositories matching this topic...

kekzl / imp

QuantumDrizzy / BLACKWALL

Improve this page

Add this topic to your repo