Skip to content
View michalwols's full-sized avatar

Block or report michalwols

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

43 stars written in Cuda
Clear filter

LLM training in simple, raw C/CUDA

Cuda 29,745 3,564 Updated Jun 26, 2025

Instant neural graphics primitives: lightning fast NeRF and more

Cuda 17,371 2,055 Updated Feb 2, 2026

The project is an official implementation of our CVPR2019 paper "Deep High-Resolution Representation Learning for Human Pose Estimation"

Cuda 4,472 925 Updated Aug 30, 2024

Fast parallel CTC.

Cuda 4,073 1,033 Updated Mar 4, 2024

Squeeze-and-Excitation Networks

Cuda 3,628 850 Updated Feb 25, 2019

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda 3,331 401 Updated Jan 17, 2026

Tile primitives for speedy kernels

Cuda 3,328 276 Updated Apr 25, 2026

Mirage Persistent Kernel: Compiling LLMs into a MegaKernel

Cuda 2,231 200 Updated Apr 27, 2026

GPU Accelerated t-SNE for CUDA with Python bindings

Cuda 1,925 137 Updated Oct 2, 2024

Fully Convolutional Instance-aware Semantic Segmentation

Cuda 1,564 407 Updated Sep 27, 2021

Deformable ConvNets V2 (DCNv2) in PyTorch

Cuda 1,488 231 Updated Nov 18, 2022

Efficient GPU kernels for block-sparse matrix multiplication and convolution

Cuda 1,065 198 Updated Jun 8, 2023

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing …

Cuda 1,002 231 Updated Apr 28, 2026

[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.

Cuda 988 91 Updated Feb 25, 2026

Code from the "CUDA Crash Course" YouTube series by CoffeeBeforeArch

Cuda 947 182 Updated Jul 19, 2023

Pytorch Bindings for warp-ctc

Cuda 762 263 Updated Jul 2, 2023

Reference implementation of real-time autoregressive wavenet inference

Cuda 745 125 Updated Jan 19, 2021

Distribution-Aware Coordinate Representation for Human Pose Estimation

Cuda 566 82 Updated May 17, 2024

A GPU implementation of Convolutional Neural Nets in C++

Cuda 505 229 Updated Oct 1, 2020

PopSift is an implementation of the SIFT algorithm in CUDA.

Cuda 492 125 Updated Mar 31, 2026

Integral Human Pose Regression

Cuda 487 76 Updated Apr 4, 2019

A UNIVERSAL MUSIC TRANSLATION NETWORK - a method for translating music across musical instruments and styles.

Cuda 464 71 Updated Aug 15, 2021

PyTorch implementation of Deformable Convolution

Cuda 411 53 Updated Feb 17, 2019

Official pytorch Code for CVPR2019 paper "Fast Human Pose Estimation" https://arxiv.org/abs/1811.05419

Cuda 398 67 Updated Sep 16, 2022

GPU implementation of a fast generalized ANS (asymmetric numeral system) entropy encoder and decoder, with extensions for lossless compression of numerical and other data types in HPC/ML applications.

Cuda 382 33 Updated Mar 18, 2026

Chamfer Distance in Pytorch with f-score

Cuda 374 49 Updated Jan 8, 2021

Approximate nearest neighbor search with product quantization on GPU in pytorch and cuda

Cuda 232 23 Updated Dec 12, 2023

Implementation of fused cosine similarity attention in the same style as Flash Attention

Cuda 220 12 Updated Feb 13, 2023

CUDA Matrix Factorization Library with Alternating Least Square (ALS)

Cuda 182 45 Updated Aug 14, 2018

GGNN: State of the Art Graph-based GPU Nearest Neighbor Search

Cuda 173 28 Updated Feb 11, 2025
Next