Skip to content
View ranjiewwen's full-sized avatar
🎯
Focusing
🎯
Focusing
  • algorithmic engineer
  • chengdu

Organizations

@DIP-ML-AI

Block or report ranjiewwen

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Accelerate inference without tears

Python 374 23 Updated Jan 23, 2026

Efficient AI Inference & Serving

Python 481 31 Updated Jan 8, 2024

📰 Must-read papers and blogs on Speculative Decoding ⚡️

1,177 72 Updated Mar 31, 2026

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

Python 6,190 572 Updated Aug 22, 2025

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

Python 24,648 2,758 Updated Aug 12, 2024

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

Python 7,209 397 Updated Jul 11, 2024

Official code implementation of Vary-toy (Small Language Model Meets with Reinforced Vision Vocabulary)

Python 630 43 Updated Dec 30, 2024

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

Python 1,309 79 Updated Feb 5, 2026

llm-export can export llm model to onnx.

Python 347 40 Updated Oct 24, 2025

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

Jupyter Notebook 2,720 197 Updated Jun 25, 2024

Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).

Python 2,253 269 Updated Feb 20, 2026

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

Python 5,124 358 Updated Mar 26, 2026

Official inference library for Mistral models

Jupyter Notebook 10,753 1,033 Updated Feb 26, 2026

Inference Llama 2 in one file of pure C

C 19,350 2,485 Updated Aug 6, 2024

High-speed Large Language Model Serving for Local Deployment

C++ 9,245 546 Updated Jan 24, 2026

optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052

C++ 478 37 Updated Mar 15, 2024

[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Python 1,325 82 Updated Mar 6, 2025

a lightweight LLM model inference framework

C++ 751 95 Updated Apr 7, 2024

A series of large language models developed by Baichuan Intelligent Technology

Python 4,114 293 Updated Nov 8, 2024

中文版 llm-numbers

131 6 Updated Dec 25, 2023

Numbers every LLM developer should know

4,290 140 Updated Jan 16, 2024

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…

Python 13,264 2,252 Updated Apr 4, 2026

Code and documents of LongLoRA and LongAlpaca (ICLR 2024 Oral)

Python 2,694 287 Updated Aug 14, 2024

The official Python library for the OpenAI API

Python 30,368 4,687 Updated Apr 4, 2026

Simple, safe way to store and distribute tensors

Python 3,676 303 Updated Apr 2, 2026

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Python 3,488 307 Updated Jul 17, 2025

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

Python 7,749 679 Updated Apr 4, 2026

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.

Python 3,995 316 Updated Apr 3, 2026
Next