Here are
15 public repositories
matching this topic...
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory utilization in both training and inference.
Updated
Apr 17, 2026
Python
Microsoft Automatic Mixed Precision Library
Updated
Dec 1, 2025
Python
Flux diffusion model implementation using quantized fp8 matmul & remaining layers use faster half precision accumulate, which is ~2x faster on consumer devices.
Updated
Oct 12, 2024
Python
JAX Scalify: end-to-end scaled arithmetics
Updated
Oct 30, 2024
Python
Automated Triton w8a8 block FP8 kernel tuning tool for vLLM. Auto-detects model architecture, supports Qwen3-Coder-30B-A3B-Instruct-FP8/DeepSeek-V3/custom models, multi-GPU parallel tuning, and generates optimized kernel configs for quantization.
Updated
Oct 31, 2025
Python
FP8 Metal compute kernels for Apple Silicon MPS — fixing what PyTorch doesn't support yet. FLUX/SD3.5/ComfyUI on Mac.
Updated
Feb 8, 2026
Python
Hybrid Mamba-2 + Transformer 2.94B LLM (Nemotron-H style) — Korean 3B model pretrained from scratch on 7× NVIDIA B200 GPUs with SFT + DPO alignment
Updated
Mar 26, 2026
Python
High-Performance Triton Ops: RMSNorm+RoPE Fusion, Gated MLP Fusion & FP8 Quantized GEMM for Transformers | 高性能 Triton 算子库:RMSNorm+RoPE 融合、Gated MLP 融合、FP8 量化 GEMM,专为 Transformer 优化
Updated
Mar 22, 2026
Python
Cog Single GPU Quantized Implementation of Step-Video-T2V
Updated
Feb 25, 2025
Python
Generate Spike extensions, assembly tests, SVA assertions & docs for custom RISC-V AI vector instructions from YAML specs. Bit-accurate FP8/BF16/INT4 numerics.
Updated
Mar 31, 2026
Python
LLaMA-Factory FP8 training environment for NVIDIA Hopper GPUs. Fixes common configuration issues causing 2x slowdown with FP8 mixed precision.
Updated
Dec 31, 2025
Python
First open-source FP8 linear solver for consumer NVIDIA GPUs — 2-3x faster than cuBLAS FP64. pip install ssblast
Updated
Mar 15, 2026
Python
High-performance Triton kernels for NVIDIA H100. Implements fused FP8 LayerNorm, tiled FlashAttention, and SRAM-optimized memory primitives for Hopper architecture.
Updated
Apr 3, 2026
Python
Runtime monkey-patches for vLLM 0.19.x — enabling Qwen3.6-35B-A3B-FP8 (MoE) on NVIDIA Ampere GPUs (SM 80-86) with TurboQuant KV cache and 160k context
Updated
Apr 19, 2026
Python
Korean 3B LLM (pure Transformer) pretrained from scratch on 8× NVIDIA B200 GPUs with SFT + ORPO alignment
Updated
Mar 26, 2026
Python
Improve this page
Add a description, image, and links to the
fp8
topic page so that developers can more easily learn about it.
Curate this topic
Add this topic to your repo
To associate your repository with the
fp8
topic, visit your repo's landing page and select "manage topics."
Learn more
You can’t perform that action at this time.