Stars
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
Xiao's CUDA Optimization Guide [NO LONGER ADDING NEW CONTENT]
A 120-day CUDA learning plan covering daily concepts, exercises, pitfalls, and references (including “Programming Massively Parallel Processors”). Features six capstone projects to solidify GPU par…
Algorithms implemented in CUDA + resources about GPGPU
CUDA Templates and Python DSLs for High-Performance Linear Algebra
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…
how to optimize some algorithm in cuda.
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
Step by step implementation of a fast softmax kernel in CUDA
Fast and memory-efficient exact attention
ardacoskunses / WinTools
Forked from 0xeb/WinToolsA collection of free miscellaneous Windows tools
Event Tracing For Windows (ETW) Resources
KrabsETW provides a modern C++ wrapper and a .NET wrapper around the low-level ETW trace consumption functions.
A library containing utilities for mapping higher-level graphics work to D3D12
AMD Research Instruction Based Sampling Toolkit
sibradzic / amdmemorytweak
Forked from Eliovp/amdmemorytweakRead and modify memory timings on the fly