Lists (30)
Sort Name ascending (A-Z)
aerospace
art
biotechnology
cli-tools-and-libs
cloud-infra-devops
concurrency
data-engineering
embedded-and-robotics
finance
frontend-and-ui
functional-programming
games-and-gamedev
graphics
Rendering, visualization, video editing and arthpc-and-ai
learning-resources
libs
medicine-and-diagnosis
networking-and-distributed-sys
os-and-systems-dev
parsers-compilers-tracers
quantum-computing
runtimes
scientific
security-and-cryptography
silicon
simulations
theoretical-computer-science
utilities
web
web3
- All languages
- Agda
- Assembly
- Astro
- AutoIt
- Awk
- Ballerina
- Batchfile
- C
- C#
- C++
- CMake
- CSS
- Clojure
- CoffeeScript
- Common Lisp
- Coq
- Crystal
- Cuda
- Cython
- Dart
- Dockerfile
- Elixir
- Erlang
- F*
- Fortran
- Frege
- Futhark
- GAP
- GDScript
- GDShader
- Gnuplot
- Go
- Go Template
- HCL
- HTML
- Handlebars
- Haskell
- HolyC
- Idris
- Isabelle
- Java
- JavaScript
- Julia
- Jupyter Notebook
- KiCad Layout
- Koka
- Kotlin
- LLVM
- Lean
- Lua
- MATLAB
- MDX
- MLIR
- Makefile
- Markdown
- Mathematica
- Mermaid
- Modelica
- Motoko
- Mustache
- Nim
- Nix
- Nunjucks
- OCaml
- Objective-C
- Odin
- PHP
- PLpgSQL
- Pascal
- Pony
- PostScript
- PowerShell
- Processing
- Prolog
- Pug
- Python
- QML
- R
- Ragel
- Reason
- RobotFramework
- Rocq Prover
- Roff
- Ruby
- Rust
- SCSS
- SMT
- SWIG
- Sail
- Scala
- Scheme
- ShaderLab
- Shell
- Smarty
- Solidity
- Svelte
- Swift
- SystemVerilog
- TLA
- TSQL
- TeX
- TypeScript
- V
- Vala
- Verilog
- Vim Script
- Vue
- WebAssembly
- XSLT
- Zig
- hoon
Starred repositories
Instant neural graphics primitives: lightning fast NeRF and more
A massively parallel, optimal functional runtime in Rust
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
DeepEP: an efficient expert-parallel communication library
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
Code and data for paper "Deep Painterly Harmonization": https://arxiv.org/abs/1804.03189
Tile primitives for speedy kernels
[MICRO'23, MLSys'22] TorchSparse: Efficient Training and Inference Framework for Sparse Convolution on GPUs.
Source code that accompanies The CUDA Handbook.
A UNIVERSAL MUSIC TRANSLATION NETWORK - a method for translating music across musical instruments and styles.
GPU implementation of a fast generalized ANS (asymmetric numeral system) entropy encoder and decoder, with extensions for lossless compression of numerical and other data types in HPC/ML applications.
CUDA Matrix Multiplication Optimization
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
A curated set of C++ examples for optimization-based elastodynamic contact simulation using CUDA, emphasizing algorithmic convergence, penetration-free, and inversion-free conditions. Designed for …
Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport
Algorithms implemented in CUDA + resources about GPGPU
A GPU-accelerated general-purpose metaheuristic framework for combinatorial optimization
Writing a CUDA software ray tracing renderer with Analysis-Driven Optimization from scratch: a python-importable, distributed parallel renderer.