- GuangZhou,China
Lists (17)
Sort Name ascending (A-Z)
- All languages
- ANTLR
- ASL
- ActionScript
- AppleScript
- Assembly
- Awk
- Batchfile
- C
- C#
- C++
- CMake
- CSS
- Clojure
- CoffeeScript
- Crystal
- Cuda
- Dart
- Dockerfile
- EJS
- Elixir
- Erlang
- F#
- Fennel
- Go
- Go Template
- Groovy
- HCL
- HTML
- Handlebars
- Haskell
- Java
- JavaScript
- Jinja
- Jsonnet
- Jupyter Notebook
- Kotlin
- LLVM
- Less
- Logos
- Lua
- MDX
- MLIR
- Makefile
- Markdown
- Mermaid
- MoonScript
- Mustache
- Nu
- Nunjucks
- OCaml
- Objective-C
- Objective-C++
- PHP
- PLpgSQL
- Perl
- PowerShell
- PureBasic
- Python
- Raku
- Rich Text Format
- Roff
- Ruby
- Rust
- SCSS
- Scala
- Shell
- Smarty
- Solidity
- Starlark
- Swift
- TeX
- TypeScript
- Typst
- V
- Vim Script
- Vue
- WebAssembly
- Zig
- templ
Starred repositories
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
DeepEP: an efficient expert-parallel communication library
how to optimize some algorithm in cuda.
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…
Flash Attention in ~100 lines of CUDA (forward pass only)
Efficient GPU kernels for block-sparse matrix multiplication and convolution
Examples demonstrating available options to program multiple GPUs in a single node or a cluster
hpc 教程,包含集合通信(mpi、nccl)、cuda 编程、向量化 SIMD、RDMA 通信等
Distributed MoE in a Single Kernel [NeurIPS '25]
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer