Stars
Accelerating MoE with IO and Tile-aware Optimizations
LM engine is a library for pretraining/finetuning LLMs
Zero Bubble Pipeline Parallelism
🎯 告别信息过载,AI 助你看懂新闻资讯热点,简单的舆情监控分析 - 多平台热点聚合+基于 MCP 的AI分析工具。监控35个平台(抖音、知乎、B站、华尔街见闻、财联社等),智能筛选+自动推送+AI对话分析(用自然语言深度挖掘新闻:趋势追踪、情感分析、相似检索等13种工具)。支持企业微信/个人微信/飞书/钉钉/Telegram/邮件/ntfy/bark/slack 推送,1分钟手机通知,无需…
Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention
🔥 A minimal training framework for scaling FLA models
A high-performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch.
CUDA Python: Performance meets Productivity
Official PyTorch implementation for "Large Language Diffusion Models"
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…
Complete solutions to the Programming Massively Parallel Processors Edition 4
Puzzles for learning Triton, play it with minimal environment configuration!
A 120-day CUDA learning plan covering daily concepts, exercises, pitfalls, and references (including “Programming Massively Parallel Processors”). Features six capstone projects to solidify GPU par…
This repository is a curated collection of resources, tutorials, and practical examples designed to guide you through the journey of mastering CUDA programming. Whether you're just starting or look…
A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.
A unified inference and post-training framework for accelerated video generation.
A simple, unified multimodal models training engine. Lean, flexible, and built for hacking at scale.
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
⚡️SwanLab - an open-source, modern-design AI training tracking and visualization tool. Supports Cloud / Self-hosted use. Integrated with PyTorch / Transformers / verl / LLaMA Factory / ms-swift / U…
how to optimize some algorithm in cuda.
flash attention tutorial written in python, triton, cuda, cutlass
Flash Attention in ~100 lines of CUDA (forward pass only)