Starred repositories
The agent that grows with you
⭐AI-driven public opinion & trend monitor with multi-platform aggregation, RSS, and smart alerts.🎯 告别信息过载,你的 AI 舆情监控助手与热点筛选工具!聚合多平台热点 + RSS 订阅,支持关键词精准筛选。AI 智能筛选新闻 + AI 翻译 + AI 分析简报直推手机,也支持接入 MCP 架构…
AI agents running research on single-GPU nanochat training automatically
TurboQuant KV cache compression plugin for vLLM — asymmetric K/V, 8 models validated, consumer GPUs
IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
From-scratch PyTorch implementation of Google's TurboQuant (ICLR 2026) for LLM KV cache compression. 5x compression at 3-bit with 99.5% attention fidelity.
TurboQuant: Near-optimal KV cache quantization for LLM inference (3-bit keys, 2-bit values) with Triton kernels + vLLM integration
Model compression toolkit engineered for enhanced usability, comprehensiveness, and efficiency.
DFlash: Block Diffusion for Flash Speculative Decoding
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3.6, DeepSeek-R1, GLM-5.1, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Gemma4, Llava, …
Official Implementation of DART (DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference).
This is the official implementation of our paper "SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning"
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
Source code to accompany research paper on training multi token prediction language models using self-distillation.
A curated list of papers, tools, and resources on Multi-Token Prediction (MTP) and related techniques in Large Language Models (LLMs), Speech-Language Models (SLMs), and more.
[NeurIPS 2025] Speculate Deep and Accurate
Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞
A lightweight inference engine supporting speculative speculative decoding (SSD).
SGLang is a high-performance serving framework for large language models and multimodal models.
Draft-Target Disaggregation LLM Serving System via Parallel Speculative Decoding.
Train speculative decoding models effortlessly and port them smoothly to SGLang serving.
A selective knowledge distillation algorithm for efficient speculative decoders
[ICML 2025 Spotlight] RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding
Hierarchical Speculative Decoding is the SOTA verification algorithm for lossless accelerated LLM inference.