Stars
- All languages
- ANTLR
- Arduino
- AsciiDoc
- Assembly
- BASIC
- Batchfile
- C
- C#
- C++
- CMake
- CSS
- CUE
- Chapel
- Clojure
- CodeQL
- CoffeeScript
- Common Workflow Language
- Component Pascal
- Coq
- Crystal
- Cuda
- Cython
- D
- Dart
- Dockerfile
- Elixir
- Elm
- Emacs Lisp
- Erlang
- Fennel
- G-code
- GAP
- GDScript
- GLSL
- Gherkin
- Gnuplot
- Go
- HTML
- Haskell
- Haxe
- Java
- JavaScript
- Julia
- Jupyter Notebook
- KiCad Layout
- Koka
- Kotlin
- LLVM
- Lua
- MATLAB
- MDX
- MLIR
- Makefile
- Markdown
- Mojo
- Nim
- Nix
- OCaml
- Objective-C
- OpenEdge ABL
- OpenSCAD
- PHP
- PLpgSQL
- Perl
- PostScript
- Processing
- PureBasic
- Python
- QML
- R
- Racket
- Rich Text Format
- Rocq Prover
- Roff
- Ruby
- Rust
- SWIG
- Scala
- Scheme
- Shell
- Stata
- Svelte
- Swift
- SystemVerilog
- Tcl
- TeX
- TypeScript
- V
- Vala
- Verilog
- Vue
- WebAssembly
- Wren
- Zig
- nesC
- templ
Instant neural graphics primitives: lightning fast NeRF and more
A massively parallel, optimal functional runtime in Rust
DeepEP: an efficient expert-parallel communication library
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
FlashInfer: Kernel Library for LLM Serving
CUDA accelerated rasterization of gaussian splatting
FSA/FST algorithms, differentiable, with PyTorch compatibility.
Flash Attention in ~100 lines of CUDA (forward pass only)
GPU-accelerated Levenberg-Marquardt curve fitting in CUDA
The CUDA version of the RWKV language model ( https://github.com/BlinkDL/RWKV-LM )
Parallel CUDA implementation of NON maximum Suppression
This repository contains the CUDA implementation of the paper "Work-efficient Parallel Non-Maximum Suppression Kernels".
TenTrans High-Performance Inference Toolkit