Nano-vLLM-Triton

A lightweight and optimized vLLM implementation built on Nano-vLLM and OpenAI Triton.

🆕 What's New

This project extends the original Nano-vLLM with the following improvements:

Extended Model Support: Added comprehensive support for:
- Qwen2 series models
- Qwen2.5 series models
- Qwen3-MoE series models
- Llama series models
- Original Qwen3 series models
Triton Optimization: Implemented custom Triton operators:
- softmax_online: Replaces torch.softmax
- cat_cos_sin: Replaces torch.cat(cos, sin, dim=-1)
- Performance Gain: Average speed improvement of 101.93 tokens/s

📦 Installation

git clone https://github.com/cosmoliu2002/nano-vllm-triton.git
cd nano-vllm-triton
pip install -e .

🚀 Quick Start

1. Python

example.py

python example.py --model-path /path/to/your/model

Supported Parameters

--model-path
--tensor-parallel-size
--enforce-eager
--temperature
--max-tokens

2. Shell

run_example.sh

bash run_example.sh

Supported Parameters config_example.yaml

model_path: "/path/to/your/model"
tensor_parallel_size: 1
enforce_eager: true
temperature: 0.6
max_tokens: 256

📊 Performance Benchmark

run_bench.sh

bash run_bench.sh

Supported Parameters config_bench.yaml

model_path: "/path/to/your/model"
engine: "nanovllmtriton"             # nanovllmtriton or vllm

Test Configuration:

Hardware: NVIDIA A10 (Powered by ModelScope Notebook)
Models: Various Qwen and Llama models
Total Requests: 256 sequences
Input Length: Randomly sampled between 100–1024 tokens
Output Length: Randomly sampled between 100–1024 tokens
Test Runs: 5 runs per model with average results

Test Results:

Model	vLLM Throughput (tokens/s)	Nano vLLM Triton Throughput (tokens/s)	Nano vLLM Throughput (tokens/s)	Performance Gain (vs Nano vLLM)
Llama3.2-1B	5455.85	5432.24	5394.15	+38.09
Qwen2-0.5B	9501.11	10198.84	10030.08	+168.76
Qwen2.5-0.5B	9441.46	10221.94	10033.20	+188.74
Qwen3-0.6B	3465.40	2970.81	2958.67	+12.14

🛠️ Supported Models

Qwen Series: Qwen2, Qwen2.5, Qwen3, Qwen3-MoE
Llama Series: Llama 2, Llama 3, Llama 3.1, Llama 3.2

🗺️ Future Roadmap

Gradually replace all PyTorch implementations with custom Triton kernels

🙏 Acknowledgments

Nano-vLLM
Qwen2, Qwen2.5, Qwen3-MoE, Llama series model support Nano-vLLM-gogongxt
OpenAI Triton

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
nanovllm		nanovllm
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
bench.py		bench.py
config_bench.yaml		config_bench.yaml
config_example.yaml		config_example.yaml
example.py		example.py
parse_config.py		parse_config.py
pyproject.toml		pyproject.toml
run_bench.sh		run_bench.sh
run_example.sh		run_example.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nano-vLLM-Triton

🆕 What's New

📦 Installation

🚀 Quick Start

1. Python

2. Shell

📊 Performance Benchmark

🛠️ Supported Models

🗺️ Future Roadmap

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Nano-vLLM-Triton

🆕 What's New

📦 Installation

🚀 Quick Start

1. Python

2. Shell

📊 Performance Benchmark

🛠️ Supported Models

🗺️ Future Roadmap

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages