- All languages
- Assembly
- C
- C#
- C++
- CMake
- CSS
- CoffeeScript
- Cuda
- Dart
- Dockerfile
- Emacs Lisp
- Go
- HTML
- Java
- JavaScript
- Jsonnet
- Julia
- Jupyter Notebook
- Lua
- MATLAB
- MDX
- MLIR
- Makefile
- Markdown
- Mojo
- PHP
- Perl
- PostScript
- Python
- Rich Text Format
- Ruby
- Rust
- Scheme
- Shell
- Svelte
- Swift
- SystemVerilog
- TeX
- TypeScript
- Typst
- Vim Script
- Vue
Starred repositories
The official Lark/Feishu CLI tool, maintained by the larksuite team — built for humans and AI Agents. Covers core business domains including Messenger, Docs, Base, Sheets, Calendar, Mail, Tasks, Me…
[Experimental] Miles-diffusion is an post-training framework for large-scale diffusion model training and production workloads, forked from and co-evolving with miles.
GDM Science Skills to speed up agentic scientific workflows with better grounding and higher token efficiency. Integrate insights from AlphaGenome, AFDB, UniProt and 30+ other databases and tools.
TurboDiffusion: 100–200× Acceleration for Video Diffusion Models
mKernel: fast multi-node, multi-GPU fused kernels
Conveniently export torch.compile compiled products into self-contained Python files
Agentic Kernel Optimization — advanced & eXtensible: a closed-loop, campaign-based multi-agent system for optimizing GPU kernels (benchmark-swappable; default flashinfer-bench).
Agentic Kernel Optimization for All — automated GPU kernel optimization for any kernel, any hardware, any language
Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM
A structured course built from personal study notes of the book Linux Basics for Hackers by OccupyTheWeb.
Student version of Assignment 1 for Stanford CS336 - Language Modeling From Scratch
The NVIDIA VSS Blueprint is a suite of reference architectures for building GPU-accelerated vision agents and AI-powered video analytics applications.
NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
TokenSpeed is a speed-of-light LLM inference engine.
DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder
StreamingVLM: Real-Time Understanding for Infinite Video Streams
TriAttention — Efficient long reasoning with trigonometric KV cache compression. Enables OpenClaw local deployment on memory-constrained GPUs.
CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs
Code, labs, and resources for O'Reilly AI Systems Performance Engineering: GPU optimization, distributed training, inference scaling, and full-stack tuning.