Skip to content
View arindas's full-sized avatar
:octocat:
Focusing
:octocat:
Focusing

Organizations

@solidstatedb

Block or report arindas

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

22 stars written in Cuda
Clear filter

LLM training in simple, raw C/CUDA

Cuda 29,722 3,560 Updated Jun 26, 2025

Instant neural graphics primitives: lightning fast NeRF and more

Cuda 17,372 2,055 Updated Feb 2, 2026

A massively parallel, optimal functional runtime in Rust

Cuda 11,235 437 Updated Nov 21, 2024

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 10,787 1,089 Updated Apr 20, 2026

DeepEP: an efficient expert-parallel communication library

Cuda 9,563 1,203 Updated Apr 24, 2026

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 7,104 935 Updated Apr 24, 2026

Code and data for paper "Deep Painterly Harmonization": https://arxiv.org/abs/1804.03189

Cuda 6,051 613 Updated Aug 2, 2021

Tile primitives for speedy kernels

Cuda 3,327 276 Updated Apr 25, 2026

[MICRO'23, MLSys'22] TorchSparse: Efficient Training and Inference Framework for Sparse Convolution on GPUs.

Cuda 1,457 188 Updated Feb 24, 2025

Source code that accompanies The CUDA Handbook.

Cuda 572 197 Updated Mar 10, 2026

A UNIVERSAL MUSIC TRANSLATION NETWORK - a method for translating music across musical instruments and styles.

Cuda 464 71 Updated Aug 15, 2021

GPU implementation of a fast generalized ANS (asymmetric numeral system) entropy encoder and decoder, with extensions for lossless compression of numerical and other data types in HPC/ML applications.

Cuda 382 33 Updated Mar 18, 2026

FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA.

Cuda 275 16 Updated Apr 22, 2026

CUDA Matrix Multiplication Optimization

Cuda 269 25 Updated Jul 19, 2024

Custom PTX Instruction Benchmark

Cuda 139 11 Updated Feb 27, 2025

High-Performance FP32 GEMM on CUDA devices

Cuda 122 9 Updated Jan 21, 2025

A curated set of C++ examples for optimization-based elastodynamic contact simulation using CUDA, emphasizing algorithmic convergence, penetration-free, and inversion-free conditions. Designed for …

Cuda 109 6 Updated Jun 29, 2025

Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport

Cuda 74 3 Updated May 9, 2025

Algorithms implemented in CUDA + resources about GPGPU

Cuda 63 15 Updated Jan 18, 2022

A GPU-accelerated general-purpose metaheuristic framework for combinatorial optimization

Cuda 47 9 Updated Mar 30, 2026

Writing a CUDA software ray tracing renderer with Analysis-Driven Optimization from scratch: a python-importable, distributed parallel renderer.

Cuda 37 2 Updated Apr 12, 2026

Real-time GPU accelerated fluid simulator

Cuda 23 2 Updated Sep 4, 2025