Narrow Precision Training
- Quantized Training in FP4(8): Concepts and Reference Pytorch Implementation using cuBLASLt and Microxcaling.
- Unofficial and Early Benchmark of Nvidia's NVFP4 Training on Blackwell 8xB200.
- PoC nvfp4 forward + mxfp8 backward recipe in Transformer Engine, faster than nvfp4-QAT.
Distributed & Parallel
- Megatron, Transformed! A Hands-on Megatron-LM Tutorial on Replicating Empirical Trends in Distributed Training and Model Parallelism.
- Quick Visual Rundown on MLPerf Training v5.1, new Llama3.1-8B, Flux.1 only.
Model Optimization for Efficient Inference
-
Post-Training Statistical Calibration for Higher Activation Sparsity, [ENLSP 2024 Spotlight 7, Paper, Oral, Code, Integrated]
-
Pre-LLM explosion — Unified HuggingFace Trainer for Joint Pruning, Quantization, and Distillation (JPQD), integrating OpenVINO NNCF and runtime. 16× more BERT serving throughput on Xeon Sapphire Rapids. See MLPerf Inference 3.0 submission. Applicable vision, audio models.
Perhaps useful: dlbp, dockerhub, HuggingFace