fp8

Here are 15 public repositories matching this topic...

NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory utilization in both training and inference.

python machine-learning deep-learning gpu cuda pytorch jax fp8 fp4

Updated Apr 17, 2026
Python

Azure / MS-AMP

Star

Microsoft Automatic Mixed Precision Library

deep-learning gpu amp pytorch transformer mixed-precision fp8

Updated Dec 1, 2025
Python

aredden / flux-fp8-api

Star

Flux diffusion model implementation using quantized fp8 matmul & remaining layers use faster half precision accumulate, which is ~2x faster on consumer devices.

flux pytorch quantization diffusion fast-inference fp8

Updated Oct 12, 2024
Python

graphcore-research / jax-scalify

Star

JAX Scalify: end-to-end scaled arithmetics

jax low-precision llm fp8

Updated Oct 30, 2024
Python

massif-01 / vllm_benchmark_block_fp8

Star

Automated Triton w8a8 block FP8 kernel tuning tool for vLLM. Auto-detects model architecture, supports Qwen3-Coder-30B-A3B-Instruct-FP8/DeepSeek-V3/custom models, multi-GPU parallel tuning, and generates optimized kernel configs for quantization.

triton performance-tuning kernel-tuning fp8 vllm

Updated Oct 31, 2025
Python

tashiscool / fp8-mps-metal

Star

FP8 Metal compute kernels for Apple Silicon MPS — fixing what PyTorch doesn't support yet. FLUX/SD3.5/ComfyUI on Mac.

flux metal pytorch mps quantization apple-silicon fp8 stable-diffusion comfyui m4-pro

Updated Feb 8, 2026
Python

pathcosmos / EVAFRILL-Mo

Star

Hybrid Mamba-2 + Transformer 2.94B LLM (Nemotron-H style) — Korean 3B model pretrained from scratch on 7× NVIDIA B200 GPUs with SFT + DPO alignment

transformer sft dpo pretraining fp8 korean-llm nemotron hybrid-architecture mamba2 nvidia-b200

Updated Mar 26, 2026
Python

LessUp / triton-fused-ops

Star

High-Performance Triton Ops: RMSNorm+RoPE Fusion, Gated MLP Fusion & FP8 Quantized GEMM for Transformers | 高性能 Triton 算子库：RMSNorm+RoPE 融合、Gated MLP 融合、FP8 量化 GEMM，专为 Transformer 优化

python deep-learning cuda pytorch triton gpu-computing fp8 operator-fusion

Updated Mar 22, 2026
Python

zsxkib / cog-step-video-t2v

Star

Cog Single GPU Quantized Implementation of Step-Video-T2V

replicate single-gpu fp8 h100 step-video-t2v diffsynth

Updated Feb 25, 2025
Python

jyrj / rvxv

Star

Generate Spike extensions, assembly tests, SVA assertions & docs for custom RISC-V AI vector instructions from YAML specs. Bit-accurate FP8/BF16/INT4 numerics.

verification code-generation quantization risc-v hardware-verification bfloat16 fp8 spike-simulator ai-accelerator vector-instructions

Updated Mar 31, 2026
Python

sbhavani / llamafactory-fp8-hopper

Star

LLaMA-Factory FP8 training environment for NVIDIA Hopper GPUs. Fixes common configuration issues causing 2x slowdown with FP8 mixed precision.

deep-learning pytorch nvidia hopper performance-optimization fp8 h100 llama-factory gh200 transformer-engine

Updated Dec 31, 2025
Python

Sharveswar007 / SSBLAST

Star

First open-source FP8 linear solver for consumer NVIDIA GPUs — 2-3x faster than cuBLAS FP64. pip install ssblast

python machine-learning hpc gpu cuda nvidia triton cupy numerical-computing fp8 linear-solver rtx-4050

Updated Mar 15, 2026
Python

pauliano22 / triton-gpu-kernels

Star

High-performance Triton kernels for NVIDIA H100. Implements fused FP8 LayerNorm, tiled FlashAttention, and SRAM-optimized memory primitives for Hopper architecture.

parallel-computing cuda triton gpu-kernels fp8 h100 deep-learning-optimization llm-infrastructure

Updated Apr 3, 2026
Python

Sandermage / genesis-vllm-patches

Star

Runtime monkey-patches for vLLM 0.19.x — enabling Qwen3.6-35B-A3B-FP8 (MoE) on NVIDIA Ampere GPUs (SM 80-86) with TurboQuant KV cache and 160k context

cuda nvidia moe ampere fp8 vllm llm-inference qwen turboquant

Updated Apr 19, 2026
Python

pathcosmos / FRANKENSTALLM

Star

Korean 3B LLM (pure Transformer) pretrained from scratch on 8× NVIDIA B200 GPUs with SFT + ORPO alignment

transformer sft gqa pretraining fp8 korean-llm flash-attention gguf orpo nvidia-b200

Updated Mar 26, 2026
Python

Improve this page

Add a description, image, and links to the fp8 topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the fp8 topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fp8

Here are 15 public repositories matching this topic...

NVIDIA / TransformerEngine

Azure / MS-AMP

aredden / flux-fp8-api

graphcore-research / jax-scalify

massif-01 / vllm_benchmark_block_fp8

tashiscool / fp8-mps-metal

pathcosmos / EVAFRILL-Mo

LessUp / triton-fused-ops

zsxkib / cog-step-video-t2v

jyrj / rvxv

sbhavani / llamafactory-fp8-hopper

Sharveswar007 / SSBLAST

pauliano22 / triton-gpu-kernels

Sandermage / genesis-vllm-patches

pathcosmos / FRANKENSTALLM

Improve this page

Add this topic to your repo