Stars
high-performance linear attention kernel library built on TileLang
A pure-Python implementation of the Nvidia CuTe layout algebra intended to be approachable and easy to learn.
Nsight Python is a Python kernel profiling interface based on NVIDIA Nsight Tools
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
Accelerating MoE with IO and Tile-aware Optimizations
A compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems.
PyTorch building blocks for the OLMo ecosystem
TTS model capable of streaming conversational audio in realtime.
A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
Build compute kernels and load them from the Hub.
gpt-oss-120b and gpt-oss-20b are two open-weight language models by OpenAI
[CVPR 2025] Parallel Sequence Modeling via Generalized Spatial Propagation Network
An extremely fast Python type checker and language server, written in Rust.
Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling
Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.
A PyTorch native platform for training generative AI models
An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs
A Conversational Speech Generation Model
🤖FFPA: Extends FlashAttention-2 via Split-D for large headdims, 1.5x~3×↑🎉 vs SDPA, up to 430T🎉 on H200.
Unofficial implementation of Titans, SOTA memory for transformers, in Pytorch
The official repo of MiniMax-Text-01 and MiniMax-VL-01, large-language-model & vision-language-model based on Linear Attention
Karabiner-Elements is a powerful tool for customizing keyboards on macOS