arindas

Focusing

Arindam Das arindas

Focusing

Specializes in distributed systems, deep learning inference and AI SaaS at scale.

160 followers · 1.1k following

Organizations

Lists (30)

Sort

Starred repositories

22 stars written in Cuda

Clear filter

karpathy / llm.c

LLM training in simple, raw C/CUDA

Cuda 29,305 3,459 Updated Jun 26, 2025

NVlabs / instant-ngp

Instant neural graphics primitives: lightning fast NeRF and more

Cuda 17,344 2,061 Updated Feb 2, 2026

HigherOrderCO / HVM2

A massively parallel, optimal functional runtime in Rust

Cuda 11,219 434 Updated Nov 21, 2024

xlite-dev / LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 10,056 1,011 Updated Mar 23, 2026

deepseek-ai / DeepEP

DeepEP: an efficient expert-parallel communication library

Cuda 9,087 1,133 Updated Feb 9, 2026

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 6,296 847 Updated Mar 22, 2026

luanfujun / deep-painterly-harmonization

Code and data for paper "Deep Painterly Harmonization": https://arxiv.org/abs/1804.03189

Cuda 6,059 613 Updated Aug 2, 2021

HazyResearch / ThunderKittens

Tile primitives for speedy kernels

Cuda 3,285 269 Updated Mar 28, 2026

mit-han-lab / torchsparse

[MICRO'23, MLSys'22] TorchSparse: Efficient Training and Inference Framework for Sparse Convolution on GPUs.

Cuda 1,451 187 Updated Feb 24, 2025

ArchaeaSoftware / cudahandbook

Source code that accompanies The CUDA Handbook.

Cuda 570 198 Updated Mar 10, 2026

facebookresearch / music-translation

A UNIVERSAL MUSIC TRANSLATION NETWORK - a method for translating music across musical instruments and styles.

Cuda 465 71 Updated Aug 15, 2021

GPU implementation of a fast generalized ANS (asymmetric numeral system) entropy encoder and decoder, with extensions for lossless compression of numerical and other data types in HPC/ML applications.

Cuda 380 33 Updated Mar 18, 2026

leimao / CUDA-GEMM-Optimization

CUDA Matrix Multiplication Optimization

Cuda 263 25 Updated Jul 19, 2024

xlite-dev / ffpa-attn

🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.

Cuda 254 14 Updated Feb 13, 2026

LaurieWired / BenchmarkCustomPTX

Custom PTX Instruction Benchmark

Cuda 138 12 Updated Feb 27, 2025

salykova / sgemm.cu

High-Performance FP32 GEMM on CUDA devices

Cuda 118 8 Updated Jan 21, 2025

phys-sim-book / solid-sim-tutorial-gpu

A curated set of C++ examples for optimization-based elastodynamic contact simulation using CUDA, emphasizing algorithmic convergence, penetration-free, and inversion-free conditions. Designed for …

Cuda 109 6 Updated Jun 29, 2025