-
Nanyang Technological University
- Singapore
Highlights
- Pro
Stars
Train speculative decoding models effortlessly and port them smoothly to SGLang serving.
A compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems.
AIInfra(AI 基础设施)指AI系统从底层芯片等硬件,到上层软件栈支持AI大模型训练和推理。
dParallel: Learnable Parallel Decoding for dLLMs
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
Distributed MoE in a Single Kernel [NeurIPS '25]
[NeurIPS 2025 Oral]Infinity⭐️: Unified Spacetime AutoRegressive Modeling for Visual Generation
A Collection of Papers on Diffusion Large Language Models
[NeurIPS'25] dKV-Cache: The Cache for Diffusion Language Models
Official PyTorch implementation for "Large Language Diffusion Models"
FlashInfer: Kernel Library for LLM Serving
REAP: Router-weighted Expert Activation Pruning for SMoE compression
A sparse attention kernel supporting mix sparse patterns
Github Pages template based upon HTML and Markdown for personal, portfolio-based websites.
[NeurIPS 2025] Accelerating Parallel Diffusion Model Serving with Residual Compression
Ring attention implementation with flash attention
🚀 Efficient implementations of state-of-the-art linear attention models
[CVPR 2025 Oral]Infinity ∞ : Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
[ECCV 2024 Oral] Code for paper: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
This repository contains the official implementation of "FastVLM: Efficient Vision Encoding for Vision Language Models" - CVPR 2025
Official repository for VisionZip (CVPR 2025)
[ICML'25] Official implementation of paper "SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference" and "SparseVLM+: Visual Token Sparsification with Improved Text-Vis…