-
Zhejiang University
Highlights
- Pro
Lists (1)
Sort Name ascending (A-Z)
Stars
Userspace eBPF runtime for Observability, Network, GPU & General Extensions Framework
Perplexity open source garden for inference technology
DELTA-pytorch:DELTA: Dynamically Optimizing GPU Memory beyond Tensor Recomputation
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstra…
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
Tile-Based Runtime for Ultra-Low-Latency LLM Inference
gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
Curated collection of papers in machine learning systems
Rust version of THU uCore OS. Linux compatible.
[NeurIPS 2025] ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive
CUDA Templates and Python DSLs for High-Performance Linear Algebra
Training neural networks in TensorFlow 2.0 with 5x less memory
A lightweight design for computation-communication overlap.
A high performance and generic framework for distributed DNN training
A tool of adding a gitmoji for your commit message automatically.
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
Distributed MoE in a Single Kernel [NeurIPS '25]
Shared library for intercepting CUDA Runtime API calls. This was part of my Bachelor thesis: A Study on the Computational Exploitation of Remote Virtualized Graphics Cards (https://bit.ly/37tIG0D)
[EuroSys'25] Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization
Lightning-Fast RL for LLM Reasoning and Agents. Made Simple & Flexible.