We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstra…

C++ 191 11 Updated Jan 28, 2025

microsoft / TileFusion

TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.

Cuda 104 6 Updated Jun 28, 2025

tile-ai / TileRT

Tile-Based Runtime for Ultra-Low-Latency LLM Inference

Python 462 20 Updated Dec 23, 2025

gty111 / gLLM

gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling

Python 51 4 Updated Dec 18, 2025

byungsoo-oh / ml-systems-papers

Curated collection of papers in machine learning systems

483 34 Updated Dec 13, 2025

rcore-os / rCore

Rust version of THU uCore OS. Linux compatible.

Rust 3,645 378 Updated Aug 24, 2023

xinhao-luo / ClusterFusion

[NeurIPS 2025] ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive

Cuda 55 2 Updated Dec 11, 2025

tqchen / ffi-navigator

Python 248 24 Updated Jul 27, 2025

NVIDIA / cutlass

CUDA Templates and Python DSLs for High-Performance Linear Algebra

C++ 9,006 1,593 Updated Dec 23, 2025

AnyDSL / thorin

The Higher-Order Intermediate Representation

C++ 160 18 Updated Dec 22, 2025

seanzw / random

Random stuff.

Cuda 5 Updated Dec 2, 2025

parasj / checkmate

Training neural networks in TensorFlow 2.0 with 5x less memory

Python 137 17 Updated Feb 21, 2022

infinigence / FlashOverlap

A lightweight design for computation-communication overlap.

Cuda 200 9 Updated Oct 10, 2025

bytedance / byteps

A high performance and generic framework for distributed DNN training

Python 3,715 494 Updated Oct 3, 2023

stepfun-ai / StepMesh

C++ 335 33 Updated Dec 20, 2025

wty92911 / auto-gitmoji

A tool of adding a gitmoji for your commit message automatically.

Rust 2 Updated Jul 10, 2025

HazyResearch / Megakernels

kernels, of the mega variety

Python 633 34 Updated Sep 28, 2025

open-neutrino / neutrino

C 217 20 Updated Aug 4, 2025

mirage-project / mirage

Mirage Persistent Kernel: Compiling LLMs into a MegaKernel

C++ 2,004 161 Updated Dec 20, 2025

osayamenja / FlashMoE

Distributed MoE in a Single Kernel [NeurIPS '25]

Cuda 157 18 Updated Dec 23, 2025

fatlipp / CUinterwarp

CUDA Driver API Calls Interception

C++ 9 Updated Apr 15, 2024

ChristosMatzoros / CUDA-Runtime-API-calls-interception

Shared library for intercepting CUDA Runtime API calls. This was part of my Bachelor thesis: A Study on the Computational Exploitation of Remote Virtualized Graphics Cards (https://bit.ly/37tIG0D)

C++ 14 Updated Jun 6, 2024

CentML / Mist

[EuroSys'25] Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization

Python 21 5 Updated Aug 6, 2025

inclusionAI / AReaL

Lightning-Fast RL for LLM Reasoning and Agents. Made Simple & Flexible.

Python 3,261 256 Updated Dec 23, 2025

yelan187 / tanpopo

Python 7 1 Updated Mar 27, 2025