Skip to content
View Ryu1845's full-sized avatar
🎯
Focusing
🎯
Focusing

Block or report Ryu1845

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

36 results for source starred repositories written in Cuda
Clear filter

LLM training in simple, raw C/CUDA

Cuda 28,769 3,372 Updated Jun 26, 2025

A massively parallel, optimal functional runtime in Rust

Cuda 11,204 434 Updated Nov 21, 2024

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda 3,132 332 Updated Jan 17, 2026

Tile primitives for speedy kernels

Cuda 3,120 234 Updated Feb 4, 2026

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 1,067 105 Updated Dec 30, 2024

Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)

Cuda 934 346 Updated Aug 19, 2024

Causal depthwise conv1d in CUDA, with a PyTorch interface

Cuda 711 150 Updated Jan 12, 2026

Fastest kernels written from scratch

Cuda 532 64 Updated Sep 18, 2025

Reference implementation of Megalodon 7B model

Cuda 529 53 Updated May 17, 2025

flash attention tutorial written in python, triton, cuda, cutlass

Cuda 485 52 Updated Jan 20, 2026

Learnings and programs related to CUDA

Cuda 432 20 Updated Jun 29, 2025

pytorch knn [cuda version]

Cuda 312 45 Updated Dec 14, 2021

🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.

Cuda 249 13 Updated Jan 20, 2026

The CUDA version of the RWKV language model ( https://github.com/BlinkDL/RWKV-LM )

Cuda 231 35 Updated Dec 10, 2025

Implementation of fused cosine similarity attention in the same style as Flash Attention

Cuda 220 12 Updated Feb 13, 2023

A comparison of array languages & libraries: APL, J, BQN, Uiua, Q, Julia, R, NumPy, Nial, Futhark, Dex, Ivy, SaC & ArrayFire.

Cuda 213 10 Updated Feb 1, 2025

Learn CUDA with PyTorch

Cuda 194 26 Updated Feb 3, 2026

Efficient Top-K implementation on the GPU

Cuda 192 24 Updated Apr 9, 2019

Fast CUDA Kernels for ResNet Inference.

Cuda 182 47 Updated May 26, 2019

High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.

Cuda 127 7 Updated Jul 13, 2024

An extension library of WMMA API (Tensor Core API)

Cuda 109 16 Updated Jul 12, 2024

Inference Speed Benchmark for Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Cuda 83 5 Updated Jul 14, 2024

PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu

Cuda 78 4 Updated Dec 3, 2024
Cuda 43 13 Updated May 21, 2021

A minimalist and extensible PyTorch extension for implementing custom backend operators in PyTorch.

Cuda 39 5 Updated Jan 24, 2026
Cuda 27 6 Updated Jul 28, 2025
Cuda 14 2 Updated Sep 14, 2021
Next