NVFP4 GEMM benchmarks on Modal

cuBLAS Lt vs torch._scaled_mm for NVFP4 on B200.

Folders

b200_nvfp4_initial/ is the naive bench, default inductor backends.

b200_nvfp4_autotune/ sets _inductor.config.max_autotune_gemm_backends = "ATEN,CUTLASS" so the autotuner dispatches _scaled_mm through cuBLAS Lt. Triton is skipped.

Setup

pip install modal
modal setup

Run

cd b200_nvfp4_initial      # or b200_nvfp4_autotune
modal run bench.py::run                              # default: --compile-mode default
modal run bench.py::run --compile-mode max-autotune
modal run bench.py::run --compile-mode reduce-overhead  # capture via cuda graph

Timing

First run: 10-15 min (image build + cold B200 container + 7-shape sweep). Subsequent runs reuse the cached image and finish in 4-6 min.

--compile-mode max-autotune dominates wall time (autotune sweep per shape). The default --compile-mode default runs much faster.

Output

Per shape: cuBLAS default us/TFLOPS, cuBLAS tuned us/TFLOPS, torch eager us/TFLOPS, torch compile us/TFLOPS, and a side-by-side of the cuBLAS Lt algorithm IDs both paths picked.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
b200_nvfp4_autotune		b200_nvfp4_autotune
b200_nvfp4_initial		b200_nvfp4_initial
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NVFP4 GEMM benchmarks on Modal

Folders

Setup

Run

Timing

Output

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NVFP4 GEMM benchmarks on Modal

Folders

Setup

Run

Timing

Output

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages