A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.

C++ 5,617 658 Updated Feb 4, 2026

Nukem9 / dlssg-to-fsr3

Adds AMD FSR 3 Frame Generation to games by replacing Nvidia DLSS Frame Generation (nvngx_dlssg).

C++ 4,921 191 Updated Mar 16, 2025

kvcache-ai / Mooncake

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

C++ 4,691 548 Updated Feb 6, 2026

cameron314 / readerwriterqueue

A fast single-producer, single-consumer lock-free queue for C++

C++ 4,477 728 Updated Jun 25, 2025

NVlabs / tiny-cuda-nn

Lightning fast C++/CUDA neural network framework

C++ 4,408 541 Updated Dec 14, 2025

ztxz16 / fastllm

fastllm是后端无依赖的高性能大模型推理库。同时支持张量并行推理稠密模型和混合模式推理MOE模型，任意10G以上显卡即可推理满血DeepSeek。双路9004/9005服务器+单显卡部署DeepSeek满血满精度原版模型，单并发20tps；INT4量化模型单并发30tps，多并发可达60+。

C++ 4,144 418 Updated Jan 29, 2026

MaskRay / ccls

C/C++/ObjC language server supporting cross references, hierarchies, completion and semantic highlighting

C++ 4,029 274 Updated Nov 30, 2025

iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.

C++ 3,590 833 Updated Feb 6, 2026

apache / singa

a distributed deep learning platform

C++ 3,585 1,268 Updated Jan 14, 2026

bytedance / lightseq

LightSeq: A High Performance Library for Sequence Processing and Generation

C++ 3,304 335 Updated May 16, 2023

arximboldi / immer

Postmodern immutable and persistent data structures for C++ — value semantics at scale

C++ 2,801 199 Updated Jan 29, 2026

mirage-project / mirage

Mirage Persistent Kernel: Compiling LLMs into a MegaKernel

C++ 2,117 172 Updated Jan 29, 2026

flexflow / flexflow-train

Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training

C++ 1,859 248 Updated Feb 6, 2026

max0x7ba / atomic_queue

C++14 lock-free queue.

C++ 1,810 207 Updated Jan 31, 2026

gpgpu-sim / gpgpu-sim_distribution

GPGPU-Sim provides a detailed simulation model of contemporary NVIDIA GPUs running CUDA and/or OpenCL workloads. It includes support for features such as TensorCores and CUDA Dynamic Parallelism as…

C++ 1,571 618 Updated Feb 15, 2025