ATOM (AiTer Optimized Model) is a lightweight vLLM-like implementation, focusing on integration and optimization based on AITER.
- ROCm Optimized: Built on AMD's ROCm platform with AITER kernels (ASM, CK, Triton)
- OpenAI-Compatible API: Drop-in server with
/v1/chat/completionsand/v1/completionsendpoints - Piecewise torch.compile: 4 compilation levels with CUDA graph capture for low-latency decode
- Multi-GPU Parallelism: Tensor parallelism (TP), data parallelism (DP), and expert parallelism (EP) with MORI all-to-all
- Quantization: FP8, MXFP4, INT8, INT4 with auto-detection from HuggingFace configs
- Speculative Decoding: Multi-Token Prediction (MTP) with EAGLE proposer
- Prefix Caching: xxhash64-based KV cache block sharing across sequences
| Model Family | HF Architecture | Dense/MoE | Notes |
|---|---|---|---|
| Llama | LlamaForCausalLM |
Dense | Llama 2, Llama 3, Llama 3.1 |
| Qwen3 | Qwen3ForCausalLM |
Dense | |
| Qwen3-MoE | Qwen3MoeForCausalLM |
MoE | 128 experts, top-8 routing |
| DeepSeek V2/V3 | DeepseekV3ForCausalLM |
MoE | MLA attention, MTP speculative decoding |
| Mixtral | MixtralForCausalLM |
MoE | 8 experts, top-2 routing |
| GLM-4-MoE | Glm4MoeForCausalLM |
MoE | |
| GPT-OSS | GptOssForCausalLM |
MoE | Sliding window + attention sinks |
| Kimi-K2 | via --trust-remote-code |
MoE | See recipe |
- AMD GPU with ROCm support
- Docker
docker pull rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.8.0docker run -it --network=host \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v $HOME:/home/$USER \
-v /mnt:/mnt \
-v /data:/data \
--shm-size=16G \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.8.0pip install amd-aiter
git clone https://github.com/ROCm/ATOM.git; pip install ./ATOM| Topic | Description | Guide |
|---|---|---|
| Architecture | System overview, request lifecycle, component design | Architecture Guide |
| Configuration | Config classes, CLI arguments, environment variables | Configuration Guide |
| Model Support | Supported models, weight loading, adding new architectures | Model Support Guide |
| Model Operations | AITER kernel integration, linear/attention/MoE/norm wrappers | Model Ops Guide |
| Scheduling & KV Cache | Batch scheduling, block allocation, prefix caching | Scheduling Guide |
| Compilation | torch.compile levels, CUDA graphs, piecewise compilation | Compilation Guide |
| Distributed | Tensor/data/expert parallelism, multi-GPU deployment | Distributed Guide |
| Serving & Benchmarks | OpenAI API server, benchmarking, profiling, speculative decoding | Serving Guide |
Deployment Recipes:
- Qwen3-235B-A22B -- TP8 + EP with FP8 KV cache
- Kimi-K2-Thinking -- MXFP4 MoE on 4 GPUs
The default optimization level is 3 (piecewise torch.compile with CUDA graphs).
python -m atom.examples.simple_inference --model meta-llama/Meta-Llama-3-8B --kv_cache_dtype fp8Note: First-time execution may take approximately 10 minutes for model compilation.
Start an OpenAI-compatible server:
# Single GPU
python -m atom.entrypoints.openai_server --model Qwen/Qwen3-0.6B --kv_cache_dtype fp8
# Multi-GPU with tensor parallelism
python -m atom.entrypoints.openai_server --model deepseek-ai/DeepSeek-R1 --kv_cache_dtype fp8 -tp 8Profile offline inference:
python -m atom.examples.profile_offline --model Qwen/Qwen3-0.6B --kv_cache_dtype fp8With custom input/output lengths:
python -m atom.examples.profile_offline --model Qwen/Qwen3-0.6B --kv_cache_dtype fp8 \
--random-input --input-length 1024 --output-length 32Profile a running server:
curl -s -S -X POST http://127.0.0.1:8000/start_profile
# ... run your workload ...
curl -s -S -X POST http://127.0.0.1:8000/stop_profileRun an online throughput benchmark against a running server:
MODEL=deepseek-ai/DeepSeek-R1
ISL=1024
OSL=1024
CONC=128
PORT=8000
RESULT_FILENAME=Deepseek-R1-result
python -m atom.benchmarks.benchmark_serving \
--model=$MODEL --backend=vllm --base-url=http://localhost:$PORT \
--dataset-name=random \
--random-input-len=$ISL --random-output-len=$OSL \
--random-range-ratio 0.8 \
--num-prompts=$(( $CONC * 10 )) \
--max-concurrency=$CONC \
--request-rate=inf --ignore-eos \
--save-result --percentile-metrics="ttft,tpot,itl,e2el" \
--result-dir=./ --result-filename=$RESULT_FILENAME.jsonFor more information, visit InferenceMAX.
Install lm-eval to test model accuracy:
pip install lm-eval[api]Start a server, then run the evaluation:
python -m atom.entrypoints.openai_server --model meta-llama/Meta-Llama-3-8B --kv_cache_dtype fp8lm_eval --model local-completions \
--model_args model=meta-llama/Meta-Llama-3-8B,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False \
--tasks gsm8k \
--num_fewshot 5This project was adapted from nano-vllm.
We welcome issues and contributions! Please use the GitHub Issues page to report bugs or request features: https://github.com/ROCm/ATOM/issues