Reproducible benchmark suite and tuned Triton fused-MoE configs for NVIDIA H20 LLM inference. 24 configs, 36 perf data points, geomean 1.09× / peak 1.74× speedup.
benchmark triton moe hopper h20 kernel-tuning mixture-of-experts bf16 llm fp8 vllm llm-inference qwen mixtral deepseek sglang triton-kernels nvidia-h20 fused-moe
-
Updated
Jun 2, 2026 - Python