NVSentinel is a cross-platform fault remediation service designed to rapidly remediate runtime node-level issues in GPU-accelerated computing environments

Go 138 32 Updated Dec 25, 2025

eunomia-bpf / llvmbpf

Userspace/GPU eBPF VM with llvm JIT/AOT compiler

C++ 122 14 Updated Nov 23, 2025

polarsignals / gpu-metrics-agent

A high-performance agent for collecting NVIDIA GPU metrics and exporting them via OpenTelemetry Arrow protocol.

Go 19 2 Updated Dec 5, 2025

pytorch / ao

PyTorch native quantization and sparsity for training and inference

Python 2,593 389 Updated Dec 26, 2025

NVIDIA / Fabric-Manager-Client

This is a tool for managing GPU partitions for NVIDIA Fabric Manager’s Shared NVSwitch.

C++ 12 4 Updated Apr 29, 2025

checkpoint-restore / criu

Checkpoint/Restore tool

C 3,558 687 Updated Dec 17, 2025

inclusionAI / AReaL

Lightning-Fast RL for LLM Reasoning and Agents. Made Simple & Flexible.

Python 3,285 261 Updated Dec 26, 2025

yuuki / rpingmesh

A service-aware RoCE network monitoring system based on end- to-end probing.

Go 20 2 Updated Aug 7, 2025

kserve / kserve

Standardized Distributed Generative and Predictive AI Inference Platform for Scalable, Multi-Framework Deployment on Kubernetes

Go 4,943 1,325 Updated Dec 22, 2025

kvcache-ai / Mooncake

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

C++ 4,484 482 Updated Dec 26, 2025

LMCache / LMCache

Supercharge Your LLM with the Fastest KV Cache Layer

Python 6,451 817 Updated Dec 26, 2025

moeru-ai / airi

💖🧸 Self hosted, you owned Grok Companion, a container of souls of waifu, cyber livings to bring them into our worlds, wishing to achieve Neuro-sama's altitude. Capable of realtime voice chat, Minec…

TypeScript 16,259 1,505 Updated Dec 26, 2025

kubeai-project / kubeai

AI Inference Operator for Kubernetes. The easiest way to serve ML models in production. Supports VLMs, LLMs, embeddings, and speech-to-text.

Go 1,118 120 Updated Dec 15, 2025

llm-d / llm-d

Achieve state of the art inference performance with modern accelerators on Kubernetes

Shell 2,223 271 Updated Dec 19, 2025

Le-yi Ind1x1

Lists (17)

AI Agent

Alibaba

AMD YES！

Codeing

eBPF

Fault tolerance

FPGA

Inference LLM

Kernel SHM

LLM Inference

LLM Training System

Magic book

NetWork Tech

Operation and maintenance

RLHF

Robot

Tutorial

Stars