Skip to content
View feifeibear's full-sized avatar

Block or report feifeibear

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Stars

LLM Inference

19 repositories

A highly optimized LLM inference acceleration engine for Llama and its variants.

C++ 907 102 Updated Jul 10, 2025

[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding

Python 135 9 Updated Dec 4, 2024

Quantized Attention on GPU

Python 44 Updated Nov 22, 2024

Materials for learning SGLang

702 51 Updated Dec 15, 2025

A Flexible Framework for Experiencing Heterogeneous LLM Inference/Fine-tune Optimizations

Python 16,260 1,193 Updated Dec 25, 2025

Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)

Python 345 45 Updated Apr 22, 2025

FlashInfer: Kernel Library for LLM Serving

Python 4,358 616 Updated Dec 25, 2025

Fast inference from large lauguage models via speculative decoding

Python 872 93 Updated Aug 22, 2024

Tile primitives for speedy kernels

Cuda 3,017 220 Updated Dec 9, 2025

LLM inference in C/C++

C++ 91,979 14,240 Updated Dec 25, 2025

Compare different hardware platforms via the Roofline Model for LLM inference tasks.

Jupyter Notebook 119 5 Updated Mar 13, 2024

Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).

Python 2,078 233 Updated Dec 18, 2025

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

Python 6,169 568 Updated Aug 22, 2025

PyTorch native quantization and sparsity for training and inference

Python 2,592 389 Updated Dec 25, 2025

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 66,167 12,181 Updated Dec 25, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 8,832 1,038 Updated Dec 24, 2025

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 6,000 780 Updated Dec 23, 2025

Expert Parallelism Load Balancer

Python 1,323 195 Updated Mar 24, 2025

A lightweight data processing framework built on DuckDB and 3FS.

Python 4,876 431 Updated Mar 5, 2025