Lists (3)
Sort Name ascending (A-Z)
Stars
DeepSeek 4 Flash local inference engine for Metal and CUDA
TokenSpeed is a speed-of-light LLM inference engine.
cudnn_frontend provides a c++ wrapper for the cudnn backend API and samples on how to use it
Get the main content of any page as Markdown.
A kernel library written in tilelang
The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
sitp: run nanochat by building teenygrad from scratch: the bridge from micrograd to tinygrad!
The agent that grows with you
Fast state-of-the-art image and video segmentation in portable C/C++
A nearly complete collection of prefix sum algorithms implemented in CUDA, D3D12, Unity and WGPU. Theoretically portable to all wave/warp/subgroup sizes.
A modern causal profiler built leveraging Linux tracepoints
A simple HTTP server written from scratch as a teaching tool to teach Unix network program architectures
Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM
A small, portable and extensible C++ 3D coding framework
ikawrakow / ik_llamafile
Forked from mozilla-ai/llamafileDistribute and run LLMs with a single file.
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
Fast, accurate & comprehensive text measurement & layout
Building the Virtuous Cycle for AI-driven LLM Systems
GLyphy is an implementation of the Slug algorithm for GPU text rasterization
WebGPU implementation of Eric Lengyel's Slug algorithm for resolution-independent vector text rendering on the GPU