Skip to content
View arindas's full-sized avatar
:octocat:
Focusing
:octocat:
Focusing

Organizations

@solidstatedb

Block or report arindas

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

22 stars written in Cuda
Clear filter

LLM training in simple, raw C/CUDA

Cuda 29,305 3,459 Updated Jun 26, 2025

Instant neural graphics primitives: lightning fast NeRF and more

Cuda 17,344 2,061 Updated Feb 2, 2026

A massively parallel, optimal functional runtime in Rust

Cuda 11,219 434 Updated Nov 21, 2024

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 10,056 1,011 Updated Mar 23, 2026

DeepEP: an efficient expert-parallel communication library

Cuda 9,087 1,133 Updated Feb 9, 2026

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 6,296 847 Updated Mar 22, 2026

Code and data for paper "Deep Painterly Harmonization": https://arxiv.org/abs/1804.03189

Cuda 6,059 613 Updated Aug 2, 2021

Tile primitives for speedy kernels

Cuda 3,285 269 Updated Mar 28, 2026

[MICRO'23, MLSys'22] TorchSparse: Efficient Training and Inference Framework for Sparse Convolution on GPUs.

Cuda 1,451 187 Updated Feb 24, 2025

Source code that accompanies The CUDA Handbook.

Cuda 570 198 Updated Mar 10, 2026

A UNIVERSAL MUSIC TRANSLATION NETWORK - a method for translating music across musical instruments and styles.

Cuda 465 71 Updated Aug 15, 2021

GPU implementation of a fast generalized ANS (asymmetric numeral system) entropy encoder and decoder, with extensions for lossless compression of numerical and other data types in HPC/ML applications.

Cuda 380 33 Updated Mar 18, 2026

CUDA Matrix Multiplication Optimization

Cuda 263 25 Updated Jul 19, 2024

🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.

Cuda 254 14 Updated Feb 13, 2026

Custom PTX Instruction Benchmark

Cuda 138 12 Updated Feb 27, 2025

High-Performance FP32 GEMM on CUDA devices

Cuda 118 8 Updated Jan 21, 2025

A curated set of C++ examples for optimization-based elastodynamic contact simulation using CUDA, emphasizing algorithmic convergence, penetration-free, and inversion-free conditions. Designed for …

Cuda 109 6 Updated Jun 29, 2025

Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport

Cuda 74 3 Updated May 9, 2025

Algorithms implemented in CUDA + resources about GPGPU

Cuda 63 15 Updated Jan 18, 2022

A GPU-accelerated general-purpose metaheuristic framework for combinatorial optimization

Cuda 42 8 Updated Mar 30, 2026

Writing a CUDA software ray tracing renderer with Analysis-Driven Optimization from scratch: a python-importable, distributed parallel renderer.

Cuda 37 2 Updated Oct 5, 2025

Real-time GPU accelerated fluid simulator

Cuda 23 2 Updated Sep 4, 2025