-
Systems Group @ ETH Zurich
- Zürich
- https://www.linkedin.com/in/yong-jun-he-762485154/
Stars
Modern RL Post-training Infrastructure: Optimized for NVIDIA/AMD GPUs with a focus on vLLM and DeepSpeed integration, CUDA/ROCm/Triton kernels, and transparent hardware-aware scaling.
UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)
InfiniCCL is a unified, cross-platform collective communication library designed for heterogeneous accelerator environments.
SGLang is a high-performance serving framework for large language models and multimodal models.
DeepEP: an efficient expert-parallel communication library
MLSys competition for the best MOE NKI kernels
Training and inference on AWS Trainium and Inferentia chips.
Google Research
LLM Inference analyzer for different hardware platforms
OpenTela is a decentralized compute fabric for running machine learning applications.
Tile-Based Runtime for Ultra-Low-Latency LLM Inference
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
Extremely fast Query Engine for DataFrames, written in Rust
Development repository for the Triton language and compiler
NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmer…
An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & VLM & TIS & vLLM & Ray & Async RL)
Low-Latency Transaction Scheduling via Userspace Interrupts: Why Wait or Yield When You Can Preempt? (SIGMOD 2025 Best Paper Award)
Pytorch domain library for recommendation systems
Graph Neural Network Library for PyTorch
verl/HybridFlow: A Flexible and Efficient RL Post-Training Framework
Fast and memory-efficient exact attention
A high-throughput and memory-efficient inference and serving engine for LLMs