Stars
hpc 教程,包含集合通信(mpi、nccl)、cuda 编程、向量化 SIMD、RDMA 通信等
分享AI Infra知识&代码练习:PyTorch/vLLM/SGLang框架入门⚡️、性能加速🚀、大模型基础🧠、AI软硬件🔧等
slime is an LLM post-training framework for RL Scaling.
ArcticInference: vLLM plugin for high-throughput, low-latency inference
CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge techniques in sparse architecture, speculative sampling and qua…
An Efficient and User-Friendly Scaling Library for Reinforcement Learning with Large Language Models
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
Distributed Compiler based on Triton for Parallel Systems
GSemSplat: Generalizable Semantic 3D Gaussian Splatting from Uncalibrated Image Pairs
Solve Visual Understanding with Reinforced VLMs
A very simple GRPO implement for reproducing r1-like LLM thinking.
[BMVC 2025] Occam’s LGS: An Efficient Approach for Language Gaussian Splatting
Curated list of papers and resources focused on 3D Gaussian Splatting, intended to keep pace with the anticipated surge of research in the coming months.
Official implementation of the paper "LangSplat: 3D Language Gaussian Splatting" [CVPR2024 Highlight]
A curated list for Efficient Large Language Models
🎓Automatically Update LLM inference systems Papers Daily using Github Actions (Update Every 12th hours)
Puzzles for learning Triton, play it with minimal environment configuration!
Efficient Triton Kernels for LLM Training
text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
how to optimize some algorithm in cuda.
JackonYang / hands-on-tvm
Forked from mlc-ai/notebookshands on model tuning with TVM and profile it on a Mac M1, x86 CPU, and GTX-1080 GPU.