Skip to content
View arindas's full-sized avatar
:octocat:
Focusing
:octocat:
Focusing

Organizations

@solidstatedb

Block or report arindas

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

22 stars written in Cuda
Clear filter

LLM training in simple, raw C/CUDA

Cuda 29,565 3,520 Updated Jun 26, 2025

Instant neural graphics primitives: lightning fast NeRF and more

Cuda 17,362 2,058 Updated Feb 2, 2026

A massively parallel, optimal functional runtime in Rust

Cuda 11,229 436 Updated Nov 21, 2024

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 10,270 1,041 Updated Apr 12, 2026

DeepEP: an efficient expert-parallel communication library

Cuda 9,123 1,148 Updated Apr 14, 2026

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 6,328 857 Updated Mar 22, 2026

Code and data for paper "Deep Painterly Harmonization": https://arxiv.org/abs/1804.03189

Cuda 6,053 613 Updated Aug 2, 2021

Tile primitives for speedy kernels

Cuda 3,312 275 Updated Apr 8, 2026

[MICRO'23, MLSys'22] TorchSparse: Efficient Training and Inference Framework for Sparse Convolution on GPUs.

Cuda 1,457 187 Updated Feb 24, 2025

Source code that accompanies The CUDA Handbook.

Cuda 571 197 Updated Mar 10, 2026

A UNIVERSAL MUSIC TRANSLATION NETWORK - a method for translating music across musical instruments and styles.

Cuda 464 71 Updated Aug 15, 2021

GPU implementation of a fast generalized ANS (asymmetric numeral system) entropy encoder and decoder, with extensions for lossless compression of numerical and other data types in HPC/ML applications.

Cuda 381 33 Updated Mar 18, 2026

CUDA Matrix Multiplication Optimization

Cuda 265 25 Updated Jul 19, 2024

🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.

Cuda 260 14 Updated Feb 13, 2026

Custom PTX Instruction Benchmark

Cuda 139 11 Updated Feb 27, 2025

High-Performance FP32 GEMM on CUDA devices

Cuda 120 9 Updated Jan 21, 2025

A curated set of C++ examples for optimization-based elastodynamic contact simulation using CUDA, emphasizing algorithmic convergence, penetration-free, and inversion-free conditions. Designed for …

Cuda 109 6 Updated Jun 29, 2025

Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport

Cuda 74 3 Updated May 9, 2025

Algorithms implemented in CUDA + resources about GPGPU

Cuda 63 15 Updated Jan 18, 2022

A GPU-accelerated general-purpose metaheuristic framework for combinatorial optimization

Cuda 43 8 Updated Mar 30, 2026

Writing a CUDA software ray tracing renderer with Analysis-Driven Optimization from scratch: a python-importable, distributed parallel renderer.

Cuda 37 2 Updated Apr 12, 2026

Real-time GPU accelerated fluid simulator

Cuda 23 2 Updated Sep 4, 2025