Skip to content
View YongjunHe's full-sized avatar

Organizations

@sfu-db @DS3Lab @sfu-dis @llm-db

Block or report YongjunHe

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Modern RL Post-training Infrastructure: Optimized for NVIDIA/AMD GPUs with a focus on vLLM and DeepSpeed integration, CUDA/ROCm/Triton kernels, and transparent hardware-aware scaling.

Python 122 20 Updated Jun 13, 2026

UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)

C++ 1,415 157 Updated Jun 12, 2026

InfiniCCL is a unified, cross-platform collective communication library designed for heterogeneous accelerator environments.

C++ 13 3 Updated Jun 9, 2026

SGLang is a high-performance serving framework for large language models and multimodal models.

Python 28,960 6,511 Updated Jun 13, 2026

DeepEP: an efficient expert-parallel communication library

Cuda 9,727 1,284 Updated Jun 11, 2026

ETHZ Heterogeneous Accelerated Compute Cluster.

41 4 Updated Jun 12, 2026

MLSys competition for the best MOE NKI kernels

Python 44 14 Updated May 29, 2026

Training and inference on AWS Trainium and Inferentia chips.

Jupyter Notebook 269 94 Updated May 26, 2026

Google Research

Jupyter Notebook 38,125 8,430 Updated Jun 12, 2026

LLM Inference analyzer for different hardware platforms

Jupyter Notebook 114 24 Updated Jun 12, 2026

Distributed Evolutionary Algorithms in Python

Python 6,404 1,164 Updated Apr 17, 2026

OpenTela is a decentralized compute fabric for running machine learning applications.

Jupyter Notebook 34 11 Updated Jun 6, 2026

common in-memory tensor structure

C++ 1,222 163 Updated Jun 4, 2026

Open ABI and FFI for Machine Learning Systems

C++ 410 80 Updated Jun 12, 2026

Google's Operations Research tools:

C++ 13,617 2,410 Updated Jun 12, 2026

Tile-Based Runtime for Ultra-Low-Latency LLM Inference

Python 1,357 82 Updated Jun 8, 2026

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

Python 6,490 601 Updated Jun 12, 2026

Extremely fast Query Engine for DataFrames, written in Rust

Rust 38,750 2,881 Updated Jun 13, 2026

Development repository for the Triton language and compiler

MLIR 19,433 2,936 Updated Jun 13, 2026

cuDF - GPU DataFrame Library

C++ 9,662 1,068 Updated Jun 13, 2026

The Database Toolkit for Python

Python 11,909 1,697 Updated Jun 12, 2026

NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmer…

C++ 548 87 Updated Jun 11, 2026

An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & VLM & TIS & vLLM & Ray & Async RL)

Python 9,630 968 Updated Jun 9, 2026

Low-Latency Transaction Scheduling via Userspace Interrupts: Why Wait or Yield When You Can Preempt? (SIGMOD 2025 Best Paper Award)

C++ 78 11 Updated Dec 30, 2025

Pytorch domain library for recommendation systems

Python 2,563 653 Updated Jun 13, 2026

Graph Neural Network Library for PyTorch

Python 23,817 3,996 Updated Jun 13, 2026

verl/HybridFlow: A Flexible and Efficient RL Post-Training Framework

Python 21,947 4,073 Updated Jun 13, 2026

PyTorch native post-training library

Python 5,771 729 Updated Jun 13, 2026

Fast and memory-efficient exact attention

Python 24,133 2,827 Updated Jun 10, 2026

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 82,763 18,018 Updated Jun 13, 2026
Next