Skip to content
View ranjiewwen's full-sized avatar
🎯
Focusing
🎯
Focusing
  • algorithmic engineer
  • chengdu

Organizations

@DIP-ML-AI

Block or report ranjiewwen

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Accelerate inference without tears

Python 375 23 Updated Jan 23, 2026

Efficient AI Inference & Serving

Python 479 31 Updated Jan 8, 2024

📰 Must-read papers and blogs on Speculative Decoding ⚡️

1,206 75 Updated Apr 18, 2026

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

Python 6,204 572 Updated Aug 22, 2025

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

Python 24,739 2,761 Updated Aug 12, 2024

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

Python 7,225 398 Updated Jul 11, 2024

Official code implementation of Vary-toy (Small Language Model Meets with Reinforced Vision Vocabulary)

Python 630 43 Updated Dec 30, 2024

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

Python 1,315 79 Updated Feb 5, 2026

llm-export can export llm model to onnx.

Python 350 40 Updated Oct 24, 2025

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

Jupyter Notebook 2,729 201 Updated Jun 25, 2024

Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).

Python 2,311 272 Updated Feb 20, 2026

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

Python 5,188 372 Updated Apr 20, 2026

Official inference library for Mistral models

Jupyter Notebook 10,783 1,040 Updated Apr 20, 2026

Inference Llama 2 in one file of pure C

C 19,454 2,525 Updated Aug 6, 2024

High-speed Large Language Model Serving for Local Deployment

C++ 9,400 570 Updated Jan 24, 2026

optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052

C++ 479 37 Updated Mar 15, 2024

[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Python 1,333 83 Updated Mar 6, 2025

a lightweight LLM model inference framework

C++ 752 95 Updated Apr 7, 2024

A series of large language models developed by Baichuan Intelligent Technology

Python 4,109 293 Updated Nov 8, 2024

中文版 llm-numbers

134 6 Updated Dec 25, 2023

Numbers every LLM developer should know

4,300 140 Updated Jan 16, 2024

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…

Python 13,516 2,333 Updated Apr 30, 2026

Code and documents of LongLoRA and LongAlpaca (ICLR 2024 Oral)

Python 2,692 286 Updated Aug 14, 2024

The official Python library for the OpenAI API

Python 30,644 4,758 Updated Apr 29, 2026

Simple, safe way to store and distribute tensors

Python 3,733 312 Updated Apr 28, 2026

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Python 3,520 315 Updated Jul 17, 2025

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

Python 7,832 692 Updated Apr 29, 2026

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.

Python 4,035 322 Updated Apr 30, 2026
Next