Lists (8)
Sort Name ascending (A-Z)
- All languages
- ANTLR
- APL
- ATS
- Ada
- Arc
- Assembly
- Astro
- AutoHotkey
- Batchfile
- Bikeshed
- C
- C#
- C++
- CMake
- COBOL
- CSS
- Clojure
- Common Lisp
- Crystal
- Cuda
- Cython
- D
- Dart
- Dockerfile
- Eiffel
- Elixir
- Elm
- Emacs Lisp
- F#
- Fennel
- Forth
- Fortran
- Futhark
- GDScript
- GLSL
- Gleam
- Go
- HLSL
- HTML
- Haskell
- Haxe
- Java
- JavaScript
- Jinja
- Julia
- Jupyter Notebook
- Koka
- Kotlin
- Lean
- Lua
- M4
- MATLAB
- MDX
- MLIR
- Makefile
- Markdown
- Max
- Mercury
- Mojo
- Nim
- Nix
- OCaml
- Objective-C
- OpenSCAD
- PHP
- PLpgSQL
- Pascal
- Perl
- PlantUML
- PowerShell
- Prolog
- PureScript
- Python
- QML
- QuickBASIC
- R
- Racket
- Raku
- ReScript
- Reason
- Roff
- Ruby
- Rust
- SCSS
- Sass
- Scala
- Scheme
- Shell
- Smalltalk
- Standard ML
- Starlark
- Svelte
- Swift
- SystemVerilog
- Talon
- Tcl
- TeX
- TypeScript
- Typst
- V
- VHDL
- Vala
- Vim Script
- Vue
- WGSL
- XSLT
- YAML
- Yacc
- Zig
- reStructuredText
- wisp
Starred repositories
A massively parallel, optimal functional runtime in Rust
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
Flash Attention in ~100 lines of CUDA (forward pass only)
Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)
Causal depthwise conv1d in CUDA, with a PyTorch interface
Reference implementation of Megalodon 7B model
flash attention tutorial written in python, triton, cuda, cutlass
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
The CUDA version of the RWKV language model ( https://github.com/BlinkDL/RWKV-LM )
Implementation of fused cosine similarity attention in the same style as Flash Attention
A comparison of array languages & libraries: APL, J, BQN, Uiua, Q, Julia, R, NumPy, Nial, Futhark, Dex, Ivy, SaC & ArrayFire.
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
An extension library of WMMA API (Tensor Core API)
Inference Speed Benchmark for Learning to (Learn at Test Time): RNNs with Expressive Hidden States
PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu
A minimalist and extensible PyTorch extension for implementing custom backend operators in PyTorch.