Skip to content
View GARRYHU's full-sized avatar
  • SJTU
  • Shanghai, China

Block or report GARRYHU

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

SparseServe: Unlocking Parallelism for Dynamic Sparse Attention in Long-Context LLM Serving

7 Updated Sep 30, 2025

A CUDA tutorial to make people learn CUDA program from 0

Cuda 258 65 Updated Jul 9, 2024

Leveraging Critical Proof Obligations for Efficient IC3 Verification

C++ 2 Updated Nov 19, 2024

ModelChecker: A bit-level model checking tool

C++ 9 1 Updated Mar 18, 2025

Triton adapter for Ascend. Mirror of https://gitee.com/ascend/triton-ascend

Python 82 6 Updated Nov 4, 2025

Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system.

Python 98 15 Updated Sep 17, 2025

This repository serves as a comprehensive survey of LLM development, featuring numerous research papers along with their corresponding code links.

232 7 Updated Jul 29, 2025

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 84 134 Updated Nov 5, 2025

Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.

Python 148 14 Updated Nov 3, 2025

This is a fork of SGLang for hip-attention integration. Please refer to hip-attention for detail.

Python 18 2 Updated Oct 15, 2025

🍒 Cherry Studio is a desktop client that supports for multiple LLM providers.

TypeScript 35,061 3,177 Updated Nov 5, 2025

🤯 LobeHub - an open-source, modern design AI Agent Workspace. Supports multiple AI providers (OpenAI / Claude 4 / Gemini / DeepSeek / Ollama / Qwen), Knowledge Base (file upload / RAG ), one click …

TypeScript 67,455 13,933 Updated Nov 5, 2025

[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Cuda 344 36 Updated Jul 10, 2025

The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"

Python 386 36 Updated Apr 20, 2024

[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

C++ 775 53 Updated Mar 6, 2025

FlashMLA: Efficient Multi-head Latent Attention Kernels

C++ 11,842 896 Updated Sep 30, 2025

[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

Python 268 17 Updated Aug 31, 2024

[ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length

Python 125 7 Updated Oct 29, 2025

[ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

Python 267 18 Updated May 1, 2025

上海交通大学开题报告/中期报告LaTeX模板(非官方) Shanghai Jiao Tong University LaTeX templates for thesis proposals and annual reports (unofficial)

TeX 126 7 Updated Oct 31, 2025

LongBench v2 and LongBench (ACL 25'&24')

Python 1,008 107 Updated Jan 15, 2025

Large Language Model (LLM) Systems Paper List

1,580 86 Updated Nov 4, 2025

PyTorch入门教程,在线阅读地址:https://datawhalechina.github.io/thorough-pytorch/

Jupyter Notebook 3,366 510 Updated Oct 31, 2025

Supercharge Your LLM with the Fastest KV Cache Layer

Python 5,902 690 Updated Nov 5, 2025

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

C++ 4,216 420 Updated Nov 5, 2025

📚A curated list of Awesome Diffusion Inference Papers with Codes: Sampling, Cache, Quantization, Parallelism, etc.🎉

Python 435 21 Updated Aug 19, 2025

InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI'24)

Python 157 29 Updated Jul 10, 2024

Hackable and optimized Transformers building blocks, supporting a composable construction.

Python 10,062 731 Updated Oct 31, 2025

Fast and memory-efficient exact attention

Python 20,353 2,114 Updated Nov 5, 2025

Official code for paper: Chain of Ideas: Revolutionizing Research via Novel Idea Development with LLM Agents

Python 475 28 Updated Jan 15, 2025
Next