Stars
Flash Attention from Scratch on CUDA Ampere
This is an implementation of flash attention from scratch, without importing any external libraries.
Perplexity open source garden for inference technology
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
《Build a Large Language Model (From Scratch)》是一本深入探讨大语言模型原理与实现的电子书,适合希望深入了解 GPT 等大模型架构、训练过程及应用开发的学习者。为了让更多中文读者能够接触到这本极具价值的教材,我决定将其翻译成中文,并通过 GitHub 进行开源共享。
Ongoing research training transformer models at scale
Implement a Pytorch-like DL library in C++ from scratch, step by step
FlashMLA: Efficient Multi-head Latent Attention Kernels
😱 从源码层面,剖析挖掘互联网行业主流技术的底层实现原理,为广大开发者 “提升技术深度” 提供便利。目前开放 Spring 全家桶,Mybatis、Netty、Dubbo 框架,及 Redis、Tomcat 中间件等
Fast and memory-efficient exact attention
🚀🚀 「大模型」2小时完全从0训练64M的小参数GPT!🌏 Train a 64M-parameter GPT from scratch in just 2h!
CV-CUDA™ is an open-source, GPU accelerated library for cloud-scale image processing and computer vision.
Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞
My learning notes for ML SYS.
AIInfra(AI 基础设施)指AI系统从底层芯片等硬件,到上层软件栈支持AI大模型训练和推理。
分享AI Infra知识&代码练习:PyTorch/vLLM/SGLang框架入门⚡️、性能加速🚀、大模型基础🧠、AI软硬件🔧等
Machine Learning Engineering Open Book
Source code for the book Real-Time C++, by Christopher Kormanyos
FlashInfer: Kernel Library for LLM Serving
《Template Metaprogramming with C++ 》的非专业个人翻译
《Designing Data-Intensive Application》DDIA 第一版 / 第二版 中文翻译
A high-throughput and memory-efficient inference and serving engine for LLMs
Tile primitives for speedy kernels
A self-learning tutorail for CUDA High Performance Programing.