Stars
Fast inference engine for Transformer models
Fast and memory-efficient exact attention
Original reference implementation of "3D Gaussian Splatting for Real-Time Radiance Field Rendering"
HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of HierarchicalKV is to store key-value feature-embeddings on h…
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
DeepEP: an efficient expert-parallel communication library
Some C++ codes for computing a 1D and 2D convolution product using the FFT implemented with the GSL or FFTW
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores
Implementation of 1D, 2D, and 3D FFT convolutions in PyTorch. Much faster than direct convolutions for large kernel sizes.
Cross-platform text editor, written in Free Pascal
Lightning fast C++/CUDA neural network framework
Instant neural graphics primitives: lightning fast NeRF and more
CUDA accelerated rasterization of gaussian splatting
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
Code from the "CUDA Crash Course" YouTube series by CoffeeBeforeArch
Sample codes for my CUDA programming book
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
Learn CUDA Programming, published by Packt
RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing …
Automatically exported from code.google.com/p/cuda-convnet2
CUDA-accelerated GIS and spatiotemporal algorithms