Highlights
- Pro
Lists (15)
Sort Name ascending (A-Z)
Starred repositories
Source files to replicate experiments in my RLC 2025 paper.
CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning
Sionna Research Kit: A GPU-Accelerated Research Platform for AI-RAN
Allo Accelerator Design and Programming Framework (PLDI'24)
[ArXiv 2025] A curated list of papers on on-device large language models, focusing on model compression and system optimization techniques from the survey "On-Device Large Language Models: A Survey…
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
cuTile is a programming model for writing parallel kernels for NVIDIA GPUs
Tile primitives for speedy kernels
DFlash: Block Diffusion for Flash Speculative Decoding
Miles is an enterprise-facing reinforcement learning framework for LLM and VLM post-training, forked from and co-evolving with slime.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
DeepEP: an efficient expert-parallel communication library
The simplest, fastest repository for training/finetuning medium-sized GPTs.
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
verl: Volcano Engine Reinforcement Learning for LLMs
Official inference framework for 1-bit LLMs
This repo release the detailed benchmark code and results of Sea Labs AI.
ArcticInference: vLLM plugin for high-throughput, low-latency inference
Papers from the computer science community to read and discuss.
AIInfra(AI 基础设施)指AI系统从底层芯片等硬件,到上层软件栈支持AI大模型训练和推理。
CrowdPose: Efficient Crowded Scenes Pose Estimation and A New Benchmark, CVPR 2019, Oral