Stars
cuda-oxide is an experimental Rust-to-CUDA compiler that lets you write (SIMT) GPU kernels in safe(ish), idiomatic Rust. It compiles standard Rust code directly to PTX — no DSLs, no foreign languag…
Train the smallest LM you can that fits in 16MB. Best model wins!
Type annotations and runtime checking for shape and dtype of JAX/NumPy/PyTorch/etc. arrays. https://docs.kidger.site/jaxtyping/
FlashInfer: Kernel Library for LLM Serving
Allow torch tensor memory to be released and resumed later
PatchBatch is an electrophysiology data analysis program designed to facilitate automated processing of raw data into visualization-ready forms.
Solve puzzles. Improve your pytorch.
Ongoing research training transformer models at scale
A PyTorch native platform for training generative AI models
Implementation of the Triangle Multiplicative module, used in Alphafold2 as an efficient way to mix rows or columns of a 2d feature map, as a standalone package for Pytorch
jax-triton contains integrations between JAX and OpenAI Triton
PyTorch native quantization and sparsity for training and inference
RandomX, KawPow, CryptoNight and GhostRider unified CPU/GPU miner and RandomX benchmark
Proof of work algorithm based on random code execution
A high-throughput and memory-efficient inference and serving engine for LLMs
Perforator is a cluster-wide continuous profiling tool designed for large data centers
DeepEP: an efficient expert-parallel communication library
The Book of Statistical Proofs
Performance-portable, length-agnostic SIMD with runtime dispatch