arindas

Focusing

Arindam Das arindas

Focusing

Specializes in distributed systems, deep learning inference and AI SaaS at scale.

161 followers · 1.1k following

Organizations

Lists (30)

Sort

Starred repositories

22 stars written in Cuda

Clear filter

karpathy / llm.c

LLM training in simple, raw C/CUDA

Cuda 29,722 3,560 Updated Jun 26, 2025

NVlabs / instant-ngp

Instant neural graphics primitives: lightning fast NeRF and more

Cuda 17,372 2,055 Updated Feb 2, 2026

HigherOrderCO / HVM2

A massively parallel, optimal functional runtime in Rust

Cuda 11,235 437 Updated Nov 21, 2024

xlite-dev / LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 10,787 1,089 Updated Apr 20, 2026

deepseek-ai / DeepEP

DeepEP: an efficient expert-parallel communication library

Cuda 9,563 1,203 Updated Apr 24, 2026

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 7,104 935 Updated Apr 24, 2026

luanfujun / deep-painterly-harmonization

Code and data for paper "Deep Painterly Harmonization": https://arxiv.org/abs/1804.03189

Cuda 6,051 613 Updated Aug 2, 2021

HazyResearch / ThunderKittens

Tile primitives for speedy kernels

Cuda 3,327 276 Updated Apr 25, 2026

mit-han-lab / torchsparse

[MICRO'23, MLSys'22] TorchSparse: Efficient Training and Inference Framework for Sparse Convolution on GPUs.

Cuda 1,457 188 Updated Feb 24, 2025

ArchaeaSoftware / cudahandbook

Source code that accompanies The CUDA Handbook.

Cuda 572 197 Updated Mar 10, 2026

facebookresearch / music-translation

A UNIVERSAL MUSIC TRANSLATION NETWORK - a method for translating music across musical instruments and styles.

Cuda 464 71 Updated Aug 15, 2021

GPU implementation of a fast generalized ANS (asymmetric numeral system) entropy encoder and decoder, with extensions for lossless compression of numerical and other data types in HPC/ML applications.

Cuda 382 33 Updated Mar 18, 2026

xlite-dev / ffpa-attn

FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA.

Cuda 275 16 Updated Apr 22, 2026

leimao / CUDA-GEMM-Optimization

CUDA Matrix Multiplication Optimization

Cuda 269 25 Updated Jul 19, 2024

LaurieWired / BenchmarkCustomPTX

Custom PTX Instruction Benchmark

Cuda 139 11 Updated Feb 27, 2025

salykova / sgemm.cu

High-Performance FP32 GEMM on CUDA devices

Cuda 122 9 Updated Jan 21, 2025

phys-sim-book / solid-sim-tutorial-gpu

A curated set of C++ examples for optimization-based elastodynamic contact simulation using CUDA, emphasizing algorithmic convergence, penetration-free, and inversion-free conditions. Designed for …

Cuda 109 6 Updated Jun 29, 2025