narain1

Follow

💭

I may be slow to respond.

Narain narain1

💭

I may be slow to respond.

Follow

breakfix in progress

28 followers · 202 following

Achievements

Achievements

Lists (3)

Sort

comeback

cpp

cuda

Starred repositories

26 stars written in Cuda

xlite-dev / LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 9,042 889 Updated Dec 24, 2025

deepseek-ai / DeepEP

DeepEP: an efficient expert-parallel communication library

Cuda 8,829 1,036 Updated Dec 24, 2025

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

Cuda 4,347 614 Updated Dec 24, 2025

HazyResearch / ThunderKittens

Tile primitives for speedy kernels

Cuda 3,013 220 Updated Dec 9, 2025

BBuf / how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

Cuda 2,711 244 Updated Dec 23, 2025

Tony-Tan / CUDA_Freshman

Cuda 2,643 500 Updated Jan 16, 2024

Infatoshi / cuda-course

Cuda 2,429 473 Updated Nov 3, 2025

tspeterkim / flash-attention-minimal

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 1,032 102 Updated Dec 30, 2024

brucefan1983 / GPUMD

Graphics Processing Units Molecular Dynamics

Cuda 701 164 Updated Dec 23, 2025

clu0 / unet.cu

UNet diffusion model in pure CUDA

Cuda 657 31 Updated Jun 28, 2024

a-hamdi / GPU

100 days of building GPU kernels!

Cuda 555 61 Updated Apr 27, 2025

Bruce-Lee-LY / cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 509 87 Updated Sep 8, 2024

CisMine / Parallel-Computing-Cuda-C

CUDA Learning guide

Cuda 501 58 Updated Jun 20, 2024

66RING / tiny-flash-attention

flash attention tutorial written in python, triton, cuda, cutlass

Cuda 468 51 Updated May 14, 2025

Infatoshi / mnist-cuda

Cuda 420 74 Updated Dec 18, 2025

usyd-fsalab / fp6_llm

An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).

Cuda 277 22 Updated Jul 16, 2025

wangsiping97 / FastGEMV

High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.

Cuda 123 7 Updated Jul 13, 2024

gevtushenko / llm.c

Forked from karpathy/llm.c

LLM training in simple, raw C/CUDA

Cuda 108 9 Updated May 1, 2024

microsoft / TileFusion

TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.

Cuda 104 6 Updated Jun 28, 2025

kilianhae / FlashAttention.C

Flash Attention in raw Cuda C beating PyTorch

Cuda 34 3 Updated May 14, 2024

dawn-chu / EECS-368-Programming-Massively-Parallel-Processors-with-CUDA

Cuda 19 8 Updated May 17, 2016

lenLRX / AmpereSparseMatmul

study of Ampere' Sparse Matmul

Cuda 18 5 Updated Jan 10, 2021

gau-nernst / gn-kernels

Cuda 14 Updated Dec 2, 2025

ighoshsubho / Diffusion-Kernels

Kernels for attention and other diffusion specific tasks.

Cuda 9 Updated Apr 19, 2025

yianan261 / Multi_GPU_TRAINING_OPTIMIZATION

This project optimizes multi-GPU parallelism for machine learning training by accelerating multi-GPU using fused gradient buffers, NCCL AllReduce, and CUDA C kernel-level optimizations including me…

Cuda 8 Updated May 13, 2025

terrelln / dietgpu

Forked from facebookresearch/dietgpu

GPU implementation of a fast generalized ANS (asymmetric numeral system) entropy encoder and decoder, with extensions for lossless compression of numerical and other data types in HPC/ML applications.

Cuda 4 Updated Mar 19, 2023

Starred topics

Rust

Python

Go