Skip to content
View Tom-CaoZH's full-sized avatar
👋
Focusing
👋
Focusing

Highlights

  • Pro

Block or report Tom-CaoZH

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

Vortex: Programmable Sparse Attention for Agents as Algorithm Designers

Python 62 8 Updated Jun 8, 2026

UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)

C++ 1,421 158 Updated Jun 22, 2026

mKernel: fast multi-node, multi-GPU fused kernels

Cuda 239 22 Updated Jun 21, 2026
Python 8 1 Updated May 12, 2026

TokenSpeed is a speed-of-light LLM inference engine.

Python 1,477 164 Updated Jun 22, 2026

Fast Polar Decomposition for Muon

Python 157 13 Updated May 2, 2026

A lightweight inference engine supporting speculative speculative decoding (SSD).

Python 956 72 Updated May 10, 2026

MiroThinker is a deep research agent optimized for complex research and prediction tasks. Our latest models, MiroThinker-1.7, achieves 74.0 and 75.3 on the BrowseComp and BrowseComp Zh, respectively.

Python 8,308 641 Updated Apr 25, 2026

Distributed MoE in a Single Kernel [NeurIPS '25]

Cuda 268 38 Updated May 5, 2026

OpenTinker is an RL-as-a-Service infrastructure for foundation models

Python 675 63 Updated Mar 21, 2026

Efficient Long-context Language Model Training by Core Attention Disaggregation

Python 105 7 Updated Apr 7, 2026

Accelerating MoE with IO and Tile-aware Optimizations

Python 719 90 Updated Jun 15, 2026

A compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems.

Python 4,441 707 Updated May 17, 2026

一个基于nano banana pro🍌的原生AI PPT生成应用,迈向"Vibe PPT"; 支持上传任意模板图片,上传任意素材&智能解析,一句话/大纲/页面描述自动生成PPT,口头修改指定区域、一键导出可编辑ppt - An AI-native slides generator based on nano banana pro🍌

Python 15,009 1,749 Updated Jun 22, 2026

SC'25 UltraAttn: Efficiently Parallelizing Attention through Hierarchical Context-Tiling

Python 16 3 Updated Aug 14, 2025

Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serving systems.

Python 302 51 Updated May 14, 2026

Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding

Python 111 8 Updated Dec 2, 2025

Real-Time VLAs via Future-state-aware Asynchronous Inference.

Python 418 33 Updated Apr 22, 2026

A framework for efficient model inference with omni-modality models

Python 5,236 1,155 Updated Jun 22, 2026

[ICLR 2026]QeRL enables RL for 32B LLMs on a single H100 GPU.

Python 507 52 Updated Mar 30, 2026

[ASPLOS'26] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter

Python 173 18 Updated Feb 27, 2026

Miles is an enterprise-facing reinforcement learning framework for LLM and VLM post-training, forked from and co-evolving with slime.

Python 1,603 273 Updated Jun 22, 2026

An early research stage expert-parallel load balancer for MoE models based on linear programming.

Python 504 37 Updated Nov 19, 2025

A framework for few-shot evaluation of language models.

Python 13,024 3,353 Updated Jun 2, 2026

[NeurIPS 2025] ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive

Cuda 73 6 Updated Dec 11, 2025

Nano vLLM

Python 14,136 2,241 Updated Apr 26, 2026
Python 447 36 Updated Feb 23, 2026
Next