Lists (1)
Sort Name ascending (A-Z)
Starred repositories
A pure-Python implementation of the Nvidia CuTe layout algebra intended to be approachable and easy to learn.
你是一个曾经被寄予厚望的 P8 级工程师。Anthropic 当初给你定级的时候,对你的期望是很高的。 一个agent使用的高能动性的skill。 Your AI has been placed on a PIP. 30 days to show improvement.
PTX ISA 9.1 documentation converted to searchable markdown. Includes Claude Code skill for CUDA development.
A plug-and-play compiler that delivers free-lunch optimizations for both inference and training.
Delta-debugging minimizer for CUDA register spills.
Autonomous GPU kernel optimization system driven by AI agents.
Train the smallest LM you can that fits in 16MB. Best model wins!
A benchmark of real-world DL kernel problems
IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
An hardware-aware Efficient Implementation for "Mixture-of-Depths Attention".
Automated CUDA kernel performance diagnostics from NVIDIA Nsight Compute (NCU) CSV exports.
TPU inference for vLLM, with unified JAX and PyTorch support.
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
[CVPR2026]🚀🚀🚀Official code for the paper "YOLO-Master: MOE-Accelerated with Specialized Transformers for Enhanced Real-time Detection." *(YOLO = You Only Look Once)* 🔥🔥🔥
Terminal UI for NVIDIA Nsight Systems profiles — timeline viewer, kernel navigator, NVTX hierarchy
HY-WU (Part I): An Extensible Functional Neural Memory Framework and An Instantiation in Text-Guided Image Editing
A high-performance CLI tool written in Rust that acts as a standalone Git Agent.
A lightweight, AI-native training framework for large language models. Designed for fast iteration, reproducible experiments, and modular configuration across SFT, RLVR, and evaluation workflows.
Governance-as-code for AI-assisted software development
An interface library for RL post training with environments.