Stars
This repository contains the code to train and evaluate TRIBE v2, a multimodal model for brain response prediction
100M tokens. Infinite compute. Lowest val loss wins.
MoE training for Me and You and maybe other people
GPU programming related news and material links
Persist and reuse KV Cache to speedup your LLM.
Source code examples from the Parallel Forall Blog
Bridge Megatron-Core to Hugging Face/Reinforcement Learning
An app that brings language models directly to your phone.
gpt-oss-120b and gpt-oss-20b are two open-weight language models by OpenAI
A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
Supercharge Your LLM with the Fastest KV Cache Layer
[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
Open source repo for Locate 3D Model, 3D-JEPA and Locate 3D Dataset
[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection
Unified KV Cache Compression Methods for Auto-Regressive Models
SGLang is a high-performance serving framework for large language models and multimodal models.
A high-throughput and memory-efficient inference and serving engine for LLMs
Dynamic Memory Management for Serving LLMs without PagedAttention
Awesome LLM compression research papers and tools.
[CVPR 2025 Best Paper Award] VGGT: Visual Geometry Grounded Transformer
📰 Must-read papers and blogs on Speculative Decoding ⚡️
A library to analyze PyTorch traces.
A curated list for Efficient Large Language Models
An implementation of a deep learning recommendation model (DLRM)
New repo collection for NVIDIA Cosmos: https://github.com/nvidia-cosmos