Skip to content
View narain1's full-sized avatar
💭
I may be slow to respond.
💭
I may be slow to respond.

Block or report narain1

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

26 stars written in Cuda
Clear filter

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 9,042 889 Updated Dec 24, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 8,829 1,036 Updated Dec 24, 2025

FlashInfer: Kernel Library for LLM Serving

Cuda 4,347 614 Updated Dec 24, 2025

Tile primitives for speedy kernels

Cuda 3,013 220 Updated Dec 9, 2025

how to optimize some algorithm in cuda.

Cuda 2,711 244 Updated Dec 23, 2025

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 1,032 102 Updated Dec 30, 2024

Graphics Processing Units Molecular Dynamics

Cuda 701 164 Updated Dec 23, 2025

UNet diffusion model in pure CUDA

Cuda 657 31 Updated Jun 28, 2024

100 days of building GPU kernels!

Cuda 555 61 Updated Apr 27, 2025

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 509 87 Updated Sep 8, 2024

CUDA Learning guide

Cuda 501 58 Updated Jun 20, 2024

flash attention tutorial written in python, triton, cuda, cutlass

Cuda 468 51 Updated May 14, 2025
Cuda 420 74 Updated Dec 18, 2025

An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).

Cuda 277 22 Updated Jul 16, 2025

High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.

Cuda 123 7 Updated Jul 13, 2024

LLM training in simple, raw C/CUDA

Cuda 108 9 Updated May 1, 2024

TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.

Cuda 104 6 Updated Jun 28, 2025

Flash Attention in raw Cuda C beating PyTorch

Cuda 34 3 Updated May 14, 2024

study of Ampere' Sparse Matmul

Cuda 18 5 Updated Jan 10, 2021
Cuda 14 Updated Dec 2, 2025

Kernels for attention and other diffusion specific tasks.

Cuda 9 Updated Apr 19, 2025

This project optimizes multi-GPU parallelism for machine learning training by accelerating multi-GPU using fused gradient buffers, NCCL AllReduce, and CUDA C kernel-level optimizations including me…

Cuda 8 Updated May 13, 2025

GPU implementation of a fast generalized ANS (asymmetric numeral system) entropy encoder and decoder, with extensions for lossless compression of numerical and other data types in HPC/ML applications.

Cuda 4 Updated Mar 19, 2023