Stars
A Survey of Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention
NVIDIA curated collection of educational resources related to general purpose GPU programming.
CUDA Templates and Python DSLs for High-Performance Linear Algebra
LLM Inference analyzer for different hardware platforms
Unofficial description of the CUDA assembly (SASS) instruction sets.
Simple python library for generating your own perfetto traces for your application. Can be used for both app instrumentation and custom trace generation (for your own purposes)
sazczmh / DeepGEMM
Forked from deepseek-ai/DeepGEMMDeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
FlashMLA: Efficient Multi-head Latent Attention Kernels
Run compilers interactively from your web browser and interact with the assembly
A Easy-to-understand TensorOp Matmul Tutorial
An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
A tool for converting text log files to the VCD format.
Konata is an instruction pipeline visualizer for Onikiri2-Kanata/Gem5-O3PipeView formats. You can download the pre-built binaries from https://github.com/shioyadan/Konata/releases
This is a Chinese translation of the CUDA programming guide
🌊 Digital timing diagram rendering engine
Verilog/SystemVerilog Syntax and Omni-completion
Spectacle allows you to organize your windows without using a mouse.
Report historical and statistical real time of the system, keeping it between restarts. Like uptime command but with more interesting output.
🖥️ macOS status monitoring app written in SwiftUI.
Raspberry Pi PCI Express device compatibility database