YongjunHe

Yongjun He YongjunHe

60 followers · 65 following

Systems Group @ ETH Zurich
Zürich
https://www.linkedin.com/in/yong-jun-he-762485154/

Organizations

Stars

RL-Align / RL-Kernel

Modern RL Post-training Infrastructure: Optimized for NVIDIA/AMD GPUs with a focus on vLLM and DeepSpeed integration, CUDA/ROCm/Triton kernels, and transparent hardware-aware scaling.

Python 122 20 Updated Jun 13, 2026

uccl-project / uccl

UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)

C++ 1,415 157 Updated Jun 12, 2026

InfiniTensor / InfiniCCL

InfiniCCL is a unified, cross-platform collective communication library designed for heterogeneous accelerator environments.

C++ 13 3 Updated Jun 9, 2026

sgl-project / sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

Python 28,960 6,511 Updated Jun 13, 2026

deepseek-ai / DeepEP

DeepEP: an efficient expert-parallel communication library

Cuda 9,727 1,284 Updated Jun 11, 2026

fpgasystems / hacc

ETHZ Heterogeneous Accelerated Compute Cluster.

41 4 Updated Jun 12, 2026

aws-neuron / nki-moe

MLSys competition for the best MOE NKI kernels

Python 44 14 Updated May 29, 2026

huggingface / optimum-neuron

Training and inference on AWS Trainium and Inferentia chips.

Jupyter Notebook 269 94 Updated May 26, 2026

google-research / google-research

Google Research

Jupyter Notebook 38,125 8,430 Updated Jun 12, 2026

abhibambhaniya / GenZ-LLM-Analyzer

LLM Inference analyzer for different hardware platforms

Jupyter Notebook 114 24 Updated Jun 12, 2026

DEAP / deap

Distributed Evolutionary Algorithms in Python

Python 6,404 1,164 Updated Apr 17, 2026

eth-easl / OpenTela

OpenTela is a decentralized compute fabric for running machine learning applications.

Jupyter Notebook 34 11 Updated Jun 6, 2026

dmlc / dlpack

common in-memory tensor structure

C++ 1,222 163 Updated Jun 4, 2026

apache / tvm-ffi

Open ABI and FFI for Machine Learning Systems

C++ 410 80 Updated Jun 12, 2026

google / or-tools

Google's Operations Research tools:

C++ 13,617 2,410 Updated Jun 12, 2026

tile-ai / TileRT

Tile-Based Runtime for Ultra-Low-Latency LLM Inference

Python 1,357 82 Updated Jun 8, 2026

tile-ai / tilelang

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

Python 6,490 601 Updated Jun 12, 2026

pola-rs / polars

Extremely fast Query Engine for DataFrames, written in Rust

Rust 38,750 2,881 Updated Jun 13, 2026

triton-lang / triton

Development repository for the Triton language and compiler

MLIR 19,433 2,936 Updated Jun 13, 2026

rapidsai / cudf

cuDF - GPU DataFrame Library

C++ 9,662 1,068 Updated Jun 13, 2026

sqlalchemy / sqlalchemy

The Database Toolkit for Python

Python 11,909 1,697 Updated Jun 12, 2026

NVIDIA / nvshmem

NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmer…

C++ 548 87 Updated Jun 11, 2026

OpenRLHF / OpenRLHF

An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & VLM & TIS & vLLM & Ray & Async RL)

Python 9,630 968 Updated Jun 9, 2026

sfu-dis / preemptdb

Low-Latency Transaction Scheduling via Userspace Interrupts: Why Wait or Yield When You Can Preempt? (SIGMOD 2025 Best Paper Award)

C++ 78 11 Updated Dec 30, 2025

meta-pytorch / torchrec

Pytorch domain library for recommendation systems

Python 2,563 653 Updated Jun 13, 2026

pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch

Python 23,817 3,996 Updated Jun 13, 2026

verl-project / verl

verl/HybridFlow: A Flexible and Efficient RL Post-Training Framework

Python 21,947 4,073 Updated Jun 13, 2026

meta-pytorch / torchtune

PyTorch native post-training library

Python 5,771 729 Updated Jun 13, 2026

Dao-AILab / flash-attention

Fast and memory-efficient exact attention

Python 24,133 2,827 Updated Jun 10, 2026

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 82,763 18,018 Updated Jun 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly