- All languages
- ANTLR
- Assembly
- Batchfile
- Bicep
- Bikeshed
- BitBake
- C
- C#
- C++
- CMake
- CSS
- Clojure
- CodeQL
- CoffeeScript
- Common Lisp
- Cuda
- Cython
- D
- Dart
- Dockerfile
- EJS
- Eagle
- Elm
- Emacs Lisp
- F#
- Faust
- Fortran
- G-code
- GLSL
- Go
- Groovy
- HTML
- Hack
- Handlebars
- Haskell
- Java
- JavaScript
- Jinja
- Julia
- Jupyter Notebook
- Kotlin
- LLVM
- Lean
- Less
- Lex
- Lua
- MATLAB
- MDX
- MLIR
- Makefile
- Markdown
- Mathematica
- Mermaid
- Modelica
- NASL
- Nim
- Nix
- Nunjucks
- OCaml
- Objective-C
- Objective-C++
- OpenEdge ABL
- OpenSCAD
- PHP
- Pascal
- Perl
- PostScript
- PowerShell
- Propeller Spin
- Pug
- Python
- R
- ReScript
- Rich Text Format
- RobotFramework
- Roff
- Ruby
- Rust
- SCSS
- SWIG
- Scala
- Scheme
- Shell
- Starlark
- Svelte
- Swift
- SystemVerilog
- TSQL
- Tcl
- TeX
- Thrift
- TypeScript
- VHDL
- Vala
- Verilog
- Vim Script
- Visual Basic .NET
- Vue
- WebAssembly
- Zig
- nesC
Starred repositories
Instant neural graphics primitives: lightning fast NeRF and more
The project is an official implementation of our CVPR2019 paper "Deep High-Resolution Representation Learning for Human Pose Estimation"
Tile primitives for speedy kernels
how to optimize some algorithm in cuda.
Sample codes for my CUDA programming book
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
Fully Convolutional Instance-aware Semantic Segmentation
[MICRO'23, MLSys'22] TorchSparse: Efficient Training and Inference Framework for Sparse Convolution on GPUs.
Examples demonstrating available options to program multiple GPUs in a single node or a cluster
Reference implementation of real-time autoregressive wavenet inference
Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches
Source code that accompanies The CUDA Handbook.
A UNIVERSAL MUSIC TRANSLATION NETWORK - a method for translating music across musical instruments and styles.
NVIDIA-accelerated zero latency video compression library for interactive remoting applications
GPU implementation of a fast generalized ANS (asymmetric numeral system) entropy encoder and decoder, with extensions for lossless compression of numerical and other data types in HPC/ML applications.
GPU-accelerated Levenberg-Marquardt curve fitting in CUDA
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
CGBN: CUDA Accelerated Multiple Precision Arithmetic (Big Num) using Cooperative Groups