Lists (16)
Sort Name ascending (A-Z)
Stars
Custom memory allocator in C++ built from scratch using mmap. Allocates a 1MB memory pool upfront and carves blocks from it to keep all allocations contiguous. Implements malloc, free, block reuse …
Experimental GPU language with meta-programming
A Python DSL to write Nvidia PTX for Hopper and Blackwell in JAX and PyTorch
Pure C / AVX-512 port of Craftax-Classic. 47.8M SPS on a Ryzen 9 9950X3D -- 3.2x an RTX Pro 6000 Blackwell on the same env.
HTML representation of the Intel x86 instructions documentation.
SIMD-accelerated distances, dot products, matrix ops, geospatial & geometric kernels for 16 numeric types — from 6-bit floats to 64-bit complex — across x86, Arm, RISC-V, and WASM, with bindings fo…
x86-64, ARM, and RVV intrinsics viewer
High-performance LLM inference engine — drop-in replacement for Ollama with faster multi-turn inference, lower TTFT, and higher throughput through prefix caching and continuous batching.
Interactive version of the CuTe layout paper
A pure-Python implementation of the Nvidia CuTe layout algebra intended to be approachable and easy to learn.
A zero-dependency ML framework in C with a modern Python API for full control over execution and memory.
Megapack of LeetCode solutions in many different languages
nCPU: model-native and tensor-optimized CPU research runtimes with organized workloads, tools, and docs
An MLIR-based compiler that takes GPU kernels and compiles them to real hardware instructions. Interactive web visualizer included.
GPU Engineering for AI Systems
Because `model.fit()` isn't an explanation
Exercises for Learning MLIR (Originally written for PPoPP 2026)
A C++ repository for Competitive Programmer's Handbook by Antti Laaksonen
Algorithms from Competitive Programmer's Handbook by Antti Laaksonen
Data Structure Algorithms, (GenAI/ML) System Design, Machine Learning, DevOps coding interview practices
A collection of solutions for all problem statements on the AlgoExpert Coding Interview platform.