michalwols

Mike michalwols

204 followers · 1.9k following

New York
michal.io

Starred repositories

43 stars written in Cuda

Clear filter

karpathy / llm.c

LLM training in simple, raw C/CUDA

Cuda 29,745 3,564 Updated Jun 26, 2025

NVlabs / instant-ngp

Instant neural graphics primitives: lightning fast NeRF and more

Cuda 17,371 2,055 Updated Feb 2, 2026

leoxiaobin / deep-high-resolution-net.pytorch

The project is an official implementation of our CVPR2019 paper "Deep High-Resolution Representation Learning for Human Pose Estimation"

Cuda 4,472 925 Updated Aug 30, 2024

baidu-research / warp-ctc

Fast parallel CTC.

Cuda 4,073 1,033 Updated Mar 4, 2024

hujie-frank / SENet

Squeeze-and-Excitation Networks

Cuda 3,628 850 Updated Feb 25, 2019

thu-ml / SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda 3,331 401 Updated Jan 17, 2026

HazyResearch / ThunderKittens

Tile primitives for speedy kernels

Cuda 3,328 276 Updated Apr 25, 2026

mirage-project / mirage

Mirage Persistent Kernel: Compiling LLMs into a MegaKernel

Cuda 2,231 200 Updated Apr 27, 2026

CannyLab / tsne-cuda

GPU Accelerated t-SNE for CUDA with Python bindings

Cuda 1,925 137 Updated Oct 2, 2024

msracver / FCIS

Fully Convolutional Instance-aware Semantic Segmentation

Cuda 1,564 407 Updated Sep 27, 2021

chengdazhi / Deformable-Convolution-V2-PyTorch

Deformable ConvNets V2 (DCNv2) in PyTorch

Cuda 1,488 231 Updated Nov 18, 2022

openai / blocksparse

Efficient GPU kernels for block-sparse matrix multiplication and convolution

Cuda 1,065 198 Updated Jun 8, 2023

rapidsai / raft

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing …

Cuda 1,002 231 Updated Apr 28, 2026

thu-ml / SpargeAttn

[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.

Cuda 988 91 Updated Feb 25, 2026

CoffeeBeforeArch / cuda_programming

Code from the "CUDA Crash Course" YouTube series by CoffeeBeforeArch

Cuda 947 182 Updated Jul 19, 2023

SeanNaren / warp-ctc

Pytorch Bindings for warp-ctc

Cuda 762 263 Updated Jul 2, 2023

NVIDIA / nv-wavenet

Reference implementation of real-time autoregressive wavenet inference

Cuda 745 125 Updated Jan 19, 2021

ilovepose / DarkPose

Distribution-Aware Coordinate Representation for Human Pose Estimation

Cuda 566 82 Updated May 17, 2024

TorontoDeepLearning / convnet

A GPU implementation of Convolutional Neural Nets in C++

Cuda 505 229 Updated Oct 1, 2020

alicevision / popsift

PopSift is an implementation of the SIFT algorithm in CUDA.

Cuda 492 125 Updated Mar 31, 2026

JimmySuen / integral-human-pose

Integral Human Pose Regression

Cuda 487 76 Updated Apr 4, 2019

facebookresearch / music-translation

A UNIVERSAL MUSIC TRANSLATION NETWORK - a method for translating music across musical instruments and styles.

Cuda 464 71 Updated Aug 15, 2021

1zb / deformable-convolution-pytorch

PyTorch implementation of Deformable Convolution

Cuda 411 53 Updated Feb 17, 2019

ilovepose / fast-human-pose-estimation.pytorch

Official pytorch Code for CVPR2019 paper "Fast Human Pose Estimation" https://arxiv.org/abs/1811.05419

Cuda 398 67 Updated Sep 16, 2022

GPU implementation of a fast generalized ANS (asymmetric numeral system) entropy encoder and decoder, with extensions for lossless compression of numerical and other data types in HPC/ML applications.

Cuda 382 33 Updated Mar 18, 2026