-
MAC Lab, XMU
- Fujian, China
-
23:23
(UTC +08:00)
Highlights
- Pro
Lists (21)
Sort Name ascending (A-Z)
Stars
Instant neural graphics primitives: lightning fast NeRF and more
DeepEP: an efficient expert-parallel communication library
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
The project is an official implementation of our CVPR2019 paper "Deep High-Resolution Representation Learning for Human Pose Estimation"
how to optimize some algorithm in cuda.
Sample codes for my CUDA programming book
Deformable ConvNets V2 (DCNv2) in PyTorch
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…
A simple high performance CUDA GEMM implementation.
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
Tutorials for writing high-performance GPU operators in AI frameworks.
Implement Flash Attention using Cute.