Ryu1845

🎯

Focusing

Sofian Mejjoute Ryu1845

🎯

Focusing

131 followers · 256 following

https://ko-fi.com/ryu1845

Achievements

x3 x2

Achievements

x3 x2

Lists (8)

Sort

Starred repositories

36 results for source starred repositories written in Cuda

Clear filter

karpathy / llm.c

LLM training in simple, raw C/CUDA

Cuda 28,769 3,372 Updated Jun 26, 2025

HigherOrderCO / HVM2

A massively parallel, optimal functional runtime in Rust

Cuda 11,204 434 Updated Nov 21, 2024

thu-ml / SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda 3,132 332 Updated Jan 17, 2026

HazyResearch / ThunderKittens

Tile primitives for speedy kernels

Cuda 3,120 234 Updated Feb 4, 2026

tspeterkim / flash-attention-minimal

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 1,067 105 Updated Dec 30, 2024

olcf / cuda-training-series

Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)

Cuda 934 346 Updated Aug 19, 2024

Dao-AILab / causal-conv1d

Causal depthwise conv1d in CUDA, with a PyTorch interface

Cuda 711 150 Updated Jan 12, 2026

pranjalssh / fast.cu

Fastest kernels written from scratch

Cuda 532 64 Updated Sep 18, 2025

XuezheMax / megalodon

Reference implementation of Megalodon 7B model

Cuda 529 53 Updated May 17, 2025

66RING / tiny-flash-attention

flash attention tutorial written in python, triton, cuda, cutlass

Cuda 485 52 Updated Jan 20, 2026

Maharshi-Pandya / cudacodes

Learnings and programs related to CUDA

Cuda 432 20 Updated Jun 29, 2025

unlimblue / KNN_CUDA

pytorch knn [cuda version]

Cuda 312 45 Updated Dec 14, 2021

xlite-dev / ffpa-attn

🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.

Cuda 249 13 Updated Jan 20, 2026

BlinkDL / RWKV-CUDA

The CUDA version of the RWKV language model ( https://github.com/BlinkDL/RWKV-LM )

Cuda 231 35 Updated Dec 10, 2025

lucidrains / flash-cosine-sim-attention

Implementation of fused cosine similarity attention in the same style as Flash Attention

Cuda 220 12 Updated Feb 13, 2023

codereport / array-language-comparisons

A comparison of array languages & libraries: APL, J, BQN, Uiua, Q, Julia, R, NumPy, Nial, Futhark, Dex, Ivy, SaC & ArrayFire.

Cuda 213 10 Updated Feb 1, 2025

gau-nernst / learn-cuda

Learn CUDA with PyTorch

Cuda 194 26 Updated Feb 3, 2026

anilshanbhag / gpu-topk

Efficient Top-K implementation on the GPU

Cuda 192 24 Updated Apr 9, 2019

xuqiantong / CUDA-Winograd

Fast CUDA Kernels for ResNet Inference.

Cuda 182 47 Updated May 26, 2019

jundaf2 / INT8-Flash-Attention-FMHA-Quantization

Cuda 161 16 Updated Sep 15, 2023

wangsiping97 / FastGEMV

High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.

Cuda 127 7 Updated Jul 13, 2024

wmmae / wmma_extension

An extension library of WMMA API (Tensor Core API)

Cuda 109 16 Updated Jul 12, 2024

test-time-training / ttt-lm-kernels

Inference Speed Benchmark for Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Cuda 83 5 Updated Jul 14, 2024

aredden / torch-cublas-hgemm

PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu

Cuda 78 4 Updated Dec 3, 2024

rox906 / tcFFT

Cuda 43 13 Updated May 21, 2021

sunkx109 / My-Torch-Extension

A minimalist and extensible PyTorch extension for implementing custom backend operators in PyTorch.

Cuda 39 5 Updated Jan 24, 2026

johanwind / wind_rwkv

Cuda 27 6 Updated Jul 28, 2025

apd10 / universal_memory_allocation

Cuda 15 3 Updated Apr 26, 2022

Anil-Gaihre / DrTopKSC

Cuda 14 2 Updated Sep 14, 2021

huggingface / candle-paged-attention

Cuda 12 3 Updated Jan 4, 2024

Sofian Mejjoute Ryu1845

Lists (8)

👀 Looking forward to

Machine Learning

Maybe

My Projects

OSINT

Piracy

Speech

Using

Starred repositories

voice-conversion

Python