- Beijing
-
04:11
(UTC -12:00)
Lists (3)
Sort Name ascending (A-Z)
Stars
A compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems.
Boosting RAG on model and system performance with context reuse
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Research and development (R&D) is crucial for the enhancement of industrial productivity, especially in the AI era, where the core aspects of R&D are mainly focused on data and models. We are commi…
🚀 Awesome System for Machine Learning ⚡️ AI System Papers and Industry Practice. ⚡️ System for Machine Learning, LLM (Large Language Model), GenAI (Generative AI). 🍻 OSDI, NSDI, SIGCOMM, SoCC, MLSy…
chat log tool, easily use your own chat data. 聊天记录工具,轻松使用自己的聊天数据
Zotero plugin to automatically move attachments and link them
⛷ Lightweight Markdown app to help you write great sentences.
Curated collection of papers in MoE model inference
Machine Learning Engineering Open Book
gpt-oss-120b and gpt-oss-20b are two open-weight language models by OpenAI
My learning notes for ML SYS.
Qwen3-Coder is the code version of Qwen3, the large language model series developed by Qwen team, Alibaba Cloud.
[EMNLP 2025] LightThinker: Thinking Step-by-Step Compression
The official implementation of "ML-Master: Towards AI-for-AI via Integration of Exploration and Reasoning"
AG-UI: the Agent-User Interaction Protocol. Bring Agents into Frontend Applications.
A sparse attention kernel supporting mix sparse patterns
A Flexible Framework for Experiencing Heterogeneous LLM Inference/Fine-tune Optimizations
A simple and trans-platform agent framework and tutorial
[ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoring
[ICML'25] Our study systematically investigates massive values in LLMs' attention mechanisms. First, we observe massive values are concentrated in low-frequency dimensions across different attentio…
Unified KV Cache Compression Methods for Auto-Regressive Models