Skip to content
View brisker's full-sized avatar

Block or report brisker

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

The agent that grows with you

Python 126,605 18,948 Updated Apr 30, 2026

⭐AI-driven public opinion & trend monitor with multi-platform aggregation, RSS, and smart alerts.🎯 告别信息过载,你的 AI 舆情监控助手与热点筛选工具!聚合多平台热点 + RSS 订阅,支持关键词精准筛选。AI 智能筛选新闻 + AI 翻译 + AI 分析简报直推手机,也支持接入 MCP 架构…

Python 55,907 23,939 Updated Apr 30, 2026

AI agents running research on single-GPU nanochat training automatically

Python 78,114 11,390 Updated Mar 26, 2026

Official Implementation For PolarQuant

Python 37 3 Updated Apr 2, 2026

TurboQuant KV cache compression plugin for vLLM — asymmetric K/V, 8 models validated, consumer GPUs

Python 45 5 Updated Apr 10, 2026

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

94 8 Updated Mar 14, 2026

From-scratch PyTorch implementation of Google's TurboQuant (ICLR 2026) for LLM KV cache compression. 5x compression at 3-bit with 99.5% attention fidelity.

Python 978 131 Updated Apr 23, 2026

TurboQuant: Near-optimal KV cache quantization for LLM inference (3-bit keys, 2-bit values) with Triton kernels + vLLM integration

Python 1,262 155 Updated Mar 27, 2026

Model compression toolkit engineered for enhanced usability, comprehensiveness, and efficiency.

Python 721 88 Updated Apr 29, 2026

DFlash: Block Diffusion for Flash Speculative Decoding

Python 2,435 176 Updated Apr 26, 2026

Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3.6, DeepSeek-R1, GLM-5.1, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Gemma4, Llava, …

Python 13,976 1,390 Updated Apr 30, 2026

Official Implementation of DART (DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference).

Python 55 2 Updated Feb 8, 2026

This is the official implementation of our paper "SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning"

Python 55 5 Updated Apr 2, 2026

[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding

Python 146 12 Updated Dec 4, 2024

Source code to accompany research paper on training multi token prediction language models using self-distillation.

Python 36 7 Updated Feb 21, 2026

Common recipes to run vLLM

JavaScript 770 250 Updated Apr 30, 2026
Python 220 12 Updated Sep 25, 2025

A curated list of papers, tools, and resources on Multi-Token Prediction (MTP) and related techniques in Large Language Models (LLMs), Speech-Language Models (SLMs), and more.

86 4 Updated Apr 28, 2026

[NeurIPS 2025] Speculate Deep and Accurate

Jupyter Notebook 17 9 Updated Jan 16, 2026

Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞

TypeScript 366,769 75,331 Updated Apr 30, 2026
Python 12 3 Updated Nov 11, 2025
Python 32 4 Updated Mar 26, 2026

A lightweight inference engine supporting speculative speculative decoding (SSD).

Python 900 63 Updated Mar 22, 2026

SGLang is a high-performance serving framework for large language models and multimodal models.

Python 26,788 5,641 Updated Apr 30, 2026

Draft-Target Disaggregation LLM Serving System via Parallel Speculative Decoding.

Python 199 31 Updated Mar 18, 2026

Train speculative decoding models effortlessly and port them smoothly to SGLang serving.

Python 814 218 Updated Apr 2, 2026

A selective knowledge distillation algorithm for efficient speculative decoders

39 4 Updated Nov 27, 2025

[ICML 2025 Spotlight] RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding

Python 22 Updated Mar 2, 2025

Hierarchical Speculative Decoding is the SOTA verification algorithm for lossless accelerated LLM inference.

Python 21 3 Updated Apr 14, 2026
Next