-
Sun Yat-sen University
- Guangzhou
-
22:45
(UTC +08:00) - https://gty111.github.io/info/
- https://orcid.org/0009-0005-2979-4486
Highlights
- Pro
Lists (19)
Sort Name ascending (A-Z)
AI
Benchmark
Compiler & DSL
CV & CG
Diffusion
Framework
Hardware
HPC
Instrumention&Reverse&Assemble
LAB
Math
NLP
Operating Systems
Recommendation
ROCM
Simulators
Template & Theme
Tools
Tutorial & Examples
Stars
gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
FlashMLA: Efficient Multi-head Latent Attention Kernels
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Official PyTorch implementation for "Large Language Diffusion Models"
Standardized Distributed Generative and Predictive AI Inference Platform for Scalable, Multi-Framework Deployment on Kubernetes
llm-d enables high-performance distributed LLM inference on Kubernetes
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
Kimi K2 is the large language model series developed by Moonshot AI team
verl: Volcano Engine Reinforcement Learning for LLMs
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Supercharge Your LLM with the Fastest KV Cache Layer
Distributed Compiler based on Triton for Parallel Systems
The official code for the paper: LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without lossing end-to-end metrics across language, image, and video models.
FlexFlow Serve: Low-Latency, High-Performance LLM Serving
[NeurIPS 2025] MMaDA - Open-Sourced Multimodal Large Diffusion Language Models
A bidirectional pipeline parallelism algorithm for computation-communication overlap in DeepSeek V3/R1 training.
Analyze computation-communication overlap in V3/R1.
DeepEP: an efficient expert-parallel communication library
A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
My learning notes/codes for ML SYS.
A markdown version emoji cheat sheet
Get up and running with OpenAI gpt-oss, DeepSeek-R1, Gemma 3 and other models.
High performance Transformer implementation in C++.