Starred repositories
📖 作为对《C++ Concurrency in Action - SECOND EDITION》的中文翻译。
A fast reverse proxy to help you expose a local server behind a NAT or firewall to the internet.
CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning
仅需Python基础,从0构建大语言模型;从0逐步构建GLM4\Llama3\RWKV6, 深入理解大模型原理
Implement a ChatGPT-like LLM in PyTorch from scratch, step by step
Official implementation of "Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding"
Official PyTorch implementation for "Large Language Diffusion Models"
paper list, tutorial, and nano code snippet for Diffusion Large Language Models.
[TMLR 2025] Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
A curated list for Efficient Large Language Models
From Chain-of-Thought prompting to OpenAI o1 and DeepSeek-R1 🍓
📚A curated list of Awesome Diffusion Inference Papers with Codes: Sampling, Cache, Quantization, Parallelism, etc.🎉
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
CUDA Templates and Python DSLs for High-Performance Linear Algebra
Mirror of the Xen Repository (PRs not accepted see: http://wiki.xenproject.org/wiki/Submitting_Xen_Project_Patches)
A high-throughput and memory-efficient inference and serving engine for LLMs
Let's write an OS which can run on RISC-V in Rust from scratch!
2023秋冬季开源操作系统训练营
Virtual whiteboard for sketching hand-drawn like diagrams
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
Several simple examples for popular neural network toolkits calling custom CUDA operators.
Machine Learning Engineering Open Book