Nano-vLLM

A lightweight vLLM implementation built from scratch.

Key Features

🚀 Fast offline inference - Comparable inference speeds to vLLM
📖 Readable codebase - Clean implementation in ~ 1,200 lines of Python code
⚡ Optimization Suite - Prefix caching, Torch compilation, CUDA graph, etc.

Installation

pip install git+https://github.com/GeeeekExplorer/nano-vllm.git

Quick Start

See example.py for usage. The API mirrors vLLM's interface with minor differences in the LLM.generate method.

Benchmark

See bench.py for benchmark.

Test Configuration:

Hardware: RTX 4070
Model: Qwen3-0.6B
Total Requests: 256 sequences
Input Length: Randomly sampled between 100–1024 tokens
Output Length: Randomly sampled between 100–1024 tokens

Performance Results:

Inference Engine	Output Tokens	Time (s)	Throughput (tokens/s)
vLLM	133,966	98.95	1353.86
Nano-vLLM	133,966	101.90	1314.65

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
nanovllm		nanovllm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bench.py		bench.py
example.py		example.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Nano-vLLM

Key Features

Installation

Quick Start

Benchmark

About

Uh oh!

Releases

Packages

Languages

License

amosyou/nano-vllm

Folders and files

Latest commit

History

Repository files navigation

Nano-vLLM

Key Features

Installation

Quick Start

Benchmark

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages