GARRYHU

Follow

GARRYHU GARRYHU

Follow

1 follower · 13 following

SJTU
Shanghai, China

Stars

ASISys / SparseServe

SparseServe: Unlocking Parallelism for Dynamic Sparse Attention in Long-Context LLM Serving

7 Updated Sep 30, 2025

RussWong / CUDATutorial

A CUDA tutorial to make people learn CUDA program from 0

Cuda 258 65 Updated Jul 9, 2024

IC3Contributor / DAC25

Leveraging Critical Proof Obligations for Efficient IC3 Verification

C++ 2 Updated Nov 19, 2024

ISCAS-modelchecker / modelchecker

ModelChecker: A bit-level model checking tool

C++ 9 1 Updated Mar 18, 2025

Ascend / triton-ascend

Triton adapter for Ascend. Mirror of https://gitee.com/ascend/triton-ascend

Python 82 6 Updated Nov 4, 2025

microsoft / RetrievalAttention

Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system.

Python 98 15 Updated Sep 17, 2025

TreeAI-Lab / Awesome-KV-Cache-Management

This repository serves as a comprehensive survey of LLM development, featuring numerous research papers along with their corresponding code links.

232 7 Updated Jul 29, 2025

HabanaAI / vllm-fork

Forked from vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 84 134 Updated Nov 5, 2025

DeepAuto-AI / hip-attention

Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.

Python 148 14 Updated Nov 3, 2025

DeepAuto-AI / sglang

Forked from sgl-project/sglang

This is a fork of SGLang for hip-attention integration. Please refer to hip-attention for detail.

Python 18 2 Updated Oct 15, 2025

CherryHQ / cherry-studio

🍒 Cherry Studio is a desktop client that supports for multiple LLM providers.

TypeScript 35,061 3,177 Updated Nov 5, 2025

lobehub / lobe-chat

🤯 LobeHub - an open-source, modern design AI Agent Workspace. Supports multiple AI providers (OpenAI / Claude 4 / Gemini / DeepSeek / Ollama / Qwen), Knowledge Base (file upload / RAG ), one click …

TypeScript 67,455 13,933 Updated Nov 5, 2025

mit-han-lab / Quest

[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Cuda 344 36 Updated Jul 10, 2025

thunlp / InfLLM

The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"

Python 386 36 Updated Apr 20, 2024

mit-han-lab / omniserve

[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

C++ 775 53 Updated Mar 6, 2025

deepseek-ai / FlashMLA

FlashMLA: Efficient Multi-head Latent Attention Kernels

C++ 11,842 896 Updated Sep 30, 2025

Infini-AI-Lab / TriForce

[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

Python 268 17 Updated Aug 31, 2024

smart-lty / ParallelSpeculativeDecoding

[ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length

Python 125 7 Updated Oct 29, 2025

ByteDance-Seed / ShadowKV

[ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

Python 267 18 Updated May 1, 2025

NemoYuan2008 / SJTU-Thesis-Proposal

上海交通大学开题报告/中期报告LaTeX模板（非官方） Shanghai Jiao Tong University LaTeX templates for thesis proposals and annual reports (unofficial)

TeX 126 7 Updated Oct 31, 2025

THUDM / LongBench

LongBench v2 and LongBench (ACL 25'&24')

Python 1,008 107 Updated Jan 15, 2025

AmberLJC / LLMSys-PaperList

Large Language Model (LLM) Systems Paper List

1,580 86 Updated Nov 4, 2025

datawhalechina / thorough-pytorch

PyTorch入门教程，在线阅读地址：https://datawhalechina.github.io/thorough-pytorch/

Jupyter Notebook 3,366 510 Updated Oct 31, 2025

LMCache / LMCache

Supercharge Your LLM with the Fastest KV Cache Layer

Python 5,902 690 Updated Nov 5, 2025

kvcache-ai / Mooncake

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

C++ 4,216 420 Updated Nov 5, 2025

xlite-dev / Awesome-DiT-Inference

📚A curated list of Awesome Diffusion Inference Papers with Codes: Sampling, Cache, Quantization, Parallelism, etc.🎉

Python 435 21 Updated Aug 19, 2025

snu-comparch / InfiniGen

InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI'24)

Python 157 29 Updated Jul 10, 2024

facebookresearch / xformers

Hackable and optimized Transformers building blocks, supporting a composable construction.

Python 10,062 731 Updated Oct 31, 2025

Dao-AILab / flash-attention

Fast and memory-efficient exact attention

Python 20,353 2,114 Updated Nov 5, 2025

DAMO-NLP-SG / CoI-Agent

Official code for paper: Chain of Ideas: Revolutionizing Research via Novel Idea Development with LLM Agents

Python 475 28 Updated Jan 15, 2025