Serve torch models as rest-api using Drogon, example included for resnet18 model for Imagenet. Benchmarks show improvement of ~6-10x throughput and latencies for resnet18 at peak load.
# Create Optimized models for your machine.
$ python3 optimize_model_for_inference.py
# Build and Run Server
$ docker compose run --service-ports blaze
- Add Docker to CLion toolchain this will setup all necessary dependencies.
curl "localhost:8088/classify" -F "image=@images/cat.jpg"
# Drogon + libtorch
for i in {0..8}; do curl "localhost:8088/classify" -F "image=@images/cat.jpg"; done # Run once to warmup.
wrk -t8 -c100 -d60 -s benchmark/upload.lua "http://localhost:8088/classify" --latency
# FastAPI + pytorch
cd benchmark/python_fastapi
python3 -m venv env
source env/bin/activate
python3 -m pip install -r requirements.txt # Run just once to isntall dependencies to folder.
gunicorn main:app -w 2 -k uvicorn.workers.UvicornWorker --bind 127.0.0.1: # Best performance on my machine, tried 3/4 also.
deactivate # Use after benchmarking is done and gunicorn is closed
cd ../.. # back to root folder
for i in {0..8}; do curl "localhost:8088/classify" -F "image=@images/cat.jpg"; done
wrk -t8 -c100 -d60 -s benchmark/fastapi_upload.lua "http://localhost:8088/classify" --latency
Drogon + libtorch
# OS: Ubuntu 21.10 x86_64
# Kernel: 5.15.14-xanmod1
# CPU: AMD Ryzen 9 5900X (24) @ 3.700GHz
# GPU: NVIDIA GeForce RTX 3070
Running 1m test @ http://localhost:8088/classify
8 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 39.30ms 10.96ms 95.51ms 70.50%
Req/Sec 306.58 28.78 390.00 70.92%
Latency Distribution
50% 37.40ms
75% 45.69ms
90% 54.57ms
99% 69.34ms
146612 requests in 1.00m, 30.34MB read
Requests/sec: 2441.60
Transfer/sec: 517.41KB
FastAPI + pytorch
# OS: Ubuntu 21.10 x86_64
# Kernel: 5.15.14-xanmod1
# CPU: AMD Ryzen 9 5900X (24) @ 3.700GHz
# GPU: NVIDIA GeForce RTX 3070
Running 1m test @ http://localhost:8088/classify
8 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 449.50ms 239.30ms 1.64s 70.39%
Req/Sec 33.97 26.41 121.00 83.46%
Latency Distribution
50% 454.64ms
75% 570.73ms
90% 743.54ms
99% 1.16s
12981 requests in 1.00m, 2.64MB read
Requests/sec: 216.13
Transfer/sec: 44.96KB
- API request handing and model Pre-processing in the Drogon Controller
controllers/ImageClass.cc
- Batched Model Inference logic & post-processing in
lib/ModelBatchInference.cpp
- Multithreaded batched inference
- FP16 Inference
- Uses c++20 coroutines for wait free event loop tasks
- Add compiler optimizations for cmake.
- Benchmark optimizations like Channel last, ONNX, TensorRT and report what's faster.
-
Pin Batched tensor used for inference to memory and re-use at every inference.No Improvement. - User Torch-TensorRT for inference, fastest on CUDA devices. Cuts down from 5ms to 1-2ms .
- Use Torch Nvjpeg for faster image decoding, currently spends 2ms on this call with libjpeg-turbo.
- Int8 Inference using FXGraph post-training quantization, Resnet Int8 Quantization example1 , example2
- Benchmark framework against mosec
- Use lockfree queues
- Seperate Pre-Process, Infer and post-preprocessing.
- Added address & memory leak sanitizers to CMake.
- Dockerize for easy usage.
- WIP: Just gets the job done for now, not production ready, though tested regularly.