A 120-day CUDA learning plan covering daily concepts, exercises, pitfalls, and references (including “Programming Massively Parallel Processors”). Features six capstone projects to solidify GPU par…

Shell 821 94 Updated Mar 29, 2025

rbaygildin / learn-gpgpu

Algorithms implemented in CUDA + resources about GPGPU

Cuda 62 15 Updated Jan 18, 2022

NVIDIA / cutlass

CUDA Templates and Python DSLs for High-Performance Linear Algebra

C++ 8,985 1,587 Updated Dec 19, 2025

Liu-xiandong / How_to_optimize_in_GPU

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 1,207 177 Updated Jul 29, 2023

BBuf / how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

Cuda 2,696 244 Updated Dec 6, 2025

xlite-dev / LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 8,979 877 Updated Dec 4, 2025

SzymonOzog / FastSoftmax

Step by step implementation of a fast softmax kernel in CUDA

Cuda 59 6 Updated Jan 6, 2025

SzymonOzog / GPU_Programming

Python 86 8 Updated Nov 11, 2025

siboehm / SGEMM_CUDA

Fast CUDA matrix multiplication from scratch

Cuda 980 148 Updated Sep 2, 2025

AlphaGPU / leetgpu-challenges

LeetGPU Challenges

Python 544 42 Updated Dec 11, 2025

Dao-AILab / flash-attention

Fast and memory-efficient exact attention

Python 21,186 2,230 Updated Dec 18, 2025

ardacoskunses / WinTools

Forked from 0xeb/WinTools

A collection of free miscellaneous Windows tools

C# 1 Updated Mar 29, 2023

C++ 4 Updated Mar 25, 2019

Eliovp / amdmemorytweak

Read and modify memory timings on the fly

C++ 310 79 Updated Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arda Coskunses ardacoskunses

Achievements

Achievements

Block or report ardacoskunses

Stars

deepseek-ai / DeepGEMM

modular / modular

XiaoSong9905 / CUDA-Optimization-Guide

prasmussen / gdrive

unconed / fullfrontal

gcoe-dresden / cuda-gpu-tlb

vdesai2014 / inference-optimization-blog-post

pranjalssh / fast.cu

srush / GPU-Puzzles

AdepojuJeremy / CUDA-120-DAYS--CHALLENGE

rbaygildin / learn-gpgpu

NVIDIA / cutlass

Liu-xiandong / How_to_optimize_in_GPU

BBuf / how-to-optim-algorithm-in-cuda

xlite-dev / LeetCUDA

SzymonOzog / FastSoftmax

SzymonOzog / GPU_Programming

siboehm / SGEMM_CUDA

AlphaGPU / leetgpu-challenges

Dao-AILab / flash-attention

ardacoskunses / WinTools

zodiacon / EtwExplorer

nasbench / EVTX-ETW-Resources

wolfpld / tracy

microsoft / krabsetw

microsoft / D3D12TranslationLayer

microsoft / OpenCLOn12

jlgreathouse / AMD_IBS_Toolkit

sibradzic / amdmemorytweak

Eliovp / amdmemorytweak