- Start the vLLM server in one terminal or tmux pane:
docker exec -it vllm-container /bin/bash cd /workspace ./1_bench.sh server
- In another terminal or tmux pane, run the performance benchmark:
docker exec -it vllm-container /bin/bash cd /workspace ./1_bench.sh perf
- Results will be saved in
/workspace/results/inside the container.
-
To run a profiling trace (standard):
docker exec -it vllm-container /bin/bash cd /workspace ./1_bench.sh profile
-
Profiling traces are saved in
/workspace/profile/inside the container (e.g.,*.pt.trace.json.gz). -
To profile a specific kernel (e.g.,
fused_moe_kernel) with ROCm Compute Profiler:docker exec -it vllm-container /bin/bash cd /workspace ./1_bench.sh profile_fused_moe
-
Results will be saved in
workloads/fused_moe_profile/MI300/inside the container.
- First, create the destination directory on the host if it doesn't exist:
mkdir -p ~/ai_sprint_paris/scripts/profile/ mkdir -p ~/ai_sprint_paris/scripts/results/
- Copy a profiling trace from the container to the host:
docker cp vllm-container:/workspace/profile/your_trace_file.pt.trace.json.gz ~/ai_sprint_paris/scripts/profile/ - Copy a benchmark result from the container to the host:
docker cp vllm-container:/workspace/results/your_result_file.json ~/ai_sprint_paris/scripts/results/
- From your local machine, use
scpto download the file:Replacescp root@YOUR_IP:/root/ai_sprint_paris/scripts/profile/your_trace_file.pt.trace.json.gz . scp root@YOUR_IP:/root/ai_sprint_paris/scripts/results/your_result_file.json .
YOUR_IPand filenames as appropriate.
- Go to https://ui.perfetto.dev/ in your browser.
- Upload the
.pt.trace.json.gzfile to visualize the trace.
- If you get a permissions error with your SSH key, restrict permissions using:
icacls "C:\Users\MeMyself\.ssh\id_rsa" /inheritance:r icacls "C:\Users\MeMyself\.ssh\id_rsa" /grant:r "$($env:USERNAME):(R)" icacls "C:\Users\MeMyself\.ssh\id_rsa" /remove "Users" "Authenticated Users" "Everyone" # Repeat for id_ed25519 if needed
- If a directory does not exist, create it with
mkdir -p ...before copying files.
For more details on profiling with ROCm Compute Profiler, see the official documentation and GitHub repo.
- The benchmark and profiling scripts now use input and output lengths of 2048 tokens each (
INPUT_LENGTH=2048,OUTPUT_LENGTH=2048) for higher quality evaluation. - The benchmark commands include
--seed 92100for reproducibility and--disable-log-requeststo reduce logging noise, following best practices for LLM benchmarking.