TritonIC is a C++ application that supports two complementary inference modes:
- Triton backend — computer vision tasks (object detection, segmentation, classification, optical flow, pose, depth) via the NVIDIA Triton Inference Server over HTTP or gRPC.
- Chat backend — text and multimodal generation via any OpenAI-compatible
/v1/chat/completionsendpoint (Ollama, llama.cpp, SGLang, vLLM, OpenAI, etc.).
The two backends are complementary: use Triton for real-time computer vision at high throughput; use the Chat backend for VLMs and LLMs without a Triton server.
🚧 Status: Under Development — expect frequent updates.
- Project Structure
- Architecture
- Tested Models
- Build Client Libraries
- Dependencies
- Build and Compile
- Tasks
- Notes
- Deploying Models
- Running Inference
- Docker Support
- Kubernetes Deployment
- Demo
- References
- Feedback
tritonic/
├── src/
│ ├── main/ # Entry point (client.cpp), App, Logger, ConfigManager
│ ├── triton/ # Triton client (Triton.hpp/.cpp, forwarding headers)
│ ├── chat/ # OpenAI-compatible backend (ChatBackend, ChatSession)
│ └── common/ # Shared forwarding headers
├── include/
│ ├── tritonic/ # Canonical namespaced headers
│ │ ├── core/ # types.hpp, interfaces.hpp
│ │ ├── triton/ # model_info.hpp, itriton.hpp, triton_backend.hpp
│ │ ├── chat/ # ichat_backend.hpp
│ │ └── infra/ # logger.hpp, config.hpp, config_manager.hpp
│ └── *.hpp # Backward-compat forwarding headers
├── deploy/ # Model export scripts (per task type)
├── scripts/ # Docker, setup, and utility scripts
├── config/ # Configuration files
├── docs/ # Documentation and guides
├── labels/ # Label files (COCO, ImageNet, etc.)
├── data/ # Data files (images, videos, outputs)
└── tests/ # Unit and integration tests
CMake Fetched Dependencies:
- neuriplo-tasks - Model pre/post processing and task management
TritonIC selects an inference backend at startup via --backend:
--backend |
Requires | Best for |
|---|---|---|
triton (default) |
NVIDIA Triton server | CV tasks — detection, segmentation, classification, optical flow, pose, depth |
chat |
Any OpenAI-compatible server | LLMs, VLMs, multimodal chat |
The two modes are not competing — Triton handles binary tensor workloads at real-time throughput, while the Chat backend handles text/image generation over a REST API. Choose based on your model and server.
Both implement the common tritonic::core::IInferenceBackend interface (Strategy pattern), enabling clean dependency injection and unit testing without live servers.
Note: "backend" here refers to tritonic's server selection (
--backend=tritonvs--backend=chat). This is distinct from the Triton server's own framework backends (TensorRT, ONNX Runtime, etc.), which are configured server-side.
For full code structure and namespace layout see AGENTS.md.
- YOLOv5
- YOLOv6
- YOLOv7
- YOLOv8/YOLO11/YOLO26
- YOLOv9
- YOLOv10
- YOLOv12
- YOLO-NAS
- RT-DETR
- RT-DETRv2
- RT-DETRv4
- D-FINE
- DEIM
- DEIMv2
- RF-DETR
- Gemma 4 and compatible vision-language models via llama.cpp (image captioning, visual Q&A)
- LLaVA, LLaMA3-V, and other multimodal models via OpenAI-compatible endpoints
To build the client libraries, refer to the official Triton Inference Server client libraries.
For convenience, you can extract the pre-built Triton client libraries from the official NVIDIA Triton Server SDK image using Docker:
# Run the extraction script
./docker/scripts/extract_triton_libs.shThis script will:
- Create a temporary Docker container from the
nvcr.io/nvidia/tritonserver:25.06-py3-sdkimage - Extract the Triton client libraries from
/workspace/install - Copy additional Triton server headers and libraries if available
- Save everything to
./triton_client_libs/directory
After extraction, set the environment variable:
export TritonClientBuild_DIR=$(pwd)/triton_client_libs/installThe extracted directory structure will contain:
install/- Triton client build artifactstriton_server_include/- Triton server headerstriton_server_lib/- Triton server librariesworkspace/- Additional workspace files
Ensure the following dependencies are installed:
- Nvidia Triton Inference Server:
docker pull nvcr.io/nvidia/tritonserver:25.06-py3- Triton client libraries: Tested on Release r25.06
- Protobuf and gRPC++: Versions compatible with Triton
- RapidJSON:
apt install rapidjson-dev- libcurl:
apt install libcurl4-openssl-dev- OpenCV 4: Tested version: 4.7.0
apt install libopencv-devTo maintain code quality and consistency, install pre-commit hooks:
# Run the setup script
./scripts/setup/pre_commit_setup.sh
# Or install manually
pip install pre-commit
pre-commit install-
Set the environment variable
TritonClientBuild_DIRor update theCMakeLists.txtwith the path to your installed Triton client libraries. -
Create a build directory:
mkdir build- Navigate to the build directory:
cd build- Run CMake to configure the build:
cmake -DCMAKE_BUILD_TYPE=Release ..Optional flags:
-DSHOW_FRAME: Enable to display processed frames after inference-DWRITE_FRAME: Enable to write processed frames to disk
- Build the application:
cmake --build .- Object Detection
- Classification
- Instance Segmentation
- Optical Flow
- Open Vocabulary Detection
- Pose Estimation
- Video Classification
- Depth Estimation
Other tasks are in TODO list.
Ensure the model export versions match those supported by your Triton release. Check Triton releases here.
To deploy models, set up a model repository following the Triton Model Repository schema. The config.pbtxt file is optional unless you're using the OpenVino backend, implementing an Ensemble pipeline, or passing custom inference parameters.
<model_repository>/
<model_name>/
config.pbtxt
<model_version>/
<model_binary>
Use the provided script for easy setup:
# Start Triton server with GPU support
./docker/scripts/docker_triton_run.sh /path/to/model_repository 25.06 gpu
# Start with CPU only
./docker/scripts/docker_triton_run.sh /path/to/model_repository 25.06 cpuOr manually with Docker:
docker run --gpus=1 --rm \
-p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v /full/path/to/model_repository:/models \
nvcr.io/nvidia/tritonserver:<xx.yy>-py3 tritonserver \
--model-repository=/modelsOmit the --gpus flag if using the CPU version.
./tritonic \
--source=/path/to/source.format \
--model_type=<model_type> \
--model=<model_name_folder_on_triton> \
--labelsFile=/path/to/labels/coco.names \
--protocol=<http or grpc> \
--serverAddress=<triton-ip> \
--port=<8000 for http, 8001 for grpc> \For dynamic input sizes:
--input_sizes="c,h,w"Tritonic supports shared memory to improve inference performance by reducing data copying between the client and Triton server. Two types of shared memory are available:
Uses CPU-based shared memory for efficient data transfer:
./tritonic \
--source=/path/to/source.format \
--model=<model_name> \
--shared_memory_type=system \
...Uses GPU memory directly for zero-copy inference (requires GPU support):
./tritonic \
--source=/path/to/source.format \
--model=<model_name> \
--shared_memory_type=cuda \
--cuda_device_id=0 \
...Configuration Options:
--shared_memory_typeor-smt: Shared memory type (none,system, orcuda). Default:none--cuda_device_idor-cdi: CUDA device ID when using CUDA shared memory. Default:0
Use the provided Docker scripts for quick testing:
# Run object detection
./docker/scripts/run_client.sh
# Run with debug mode
./docker/scripts/run_debug.sh
# Run optical flow
./docker/scripts/run_optical_flow.sh
# Run unit tests
./docker/scripts/run_tests.shCheck .vscode/launch.json for additional configuration examples
/path/to/source.format: Path to the input video or image file, for optical flow you must pass two images as comma separated list<model_type>: Model type (e.g.,yolov5,yolov8,yolo11,yoloseg,torchvision-classifier,tensorflow-classifier,vit-classifier, check below Model Type Parameters)<model_name_folder_on_triton>: Name of the model folder on the Triton server/path/to/labels/coco.names: Path to the label file (e.g., COCO labels)<http or grpc>: Communication protocol (httporgrpc)<triton-ip>: IP address of your Triton server<8000 for http, 8001 for grpc>: Port number<batch or b >: Batch size, currently only 1 is supported<input_sizes or -is>: Input sizes input for dynamic axes. Semi-colon separated list format: CHW;CHW;... (e.g., '3,224,224' for single input or '3,224,224;3,224,224' for two inputs, '3,640,640;2' for rtdetr/dfine models)
To view all available parameters, run:
./tritonic --help| Model | Model Type Parameter | Notes |
|---|---|---|
| YOLOv5 / v6 / v7 / v8 / v9 / v11 / v12 | yolo |
Any yolo* variant works. Standard format |
| YOLOv7 End-to-End | yolov7e2e |
Only for YOLOv7 exported with --grid --end2end flags (requires TensorRT backend) |
| YOLOv10 | yolov10 |
Specific output format |
| YOLO26 | yolo26 |
Specific output format (i.e. is the same of yolov10) |
| YOLO-NAS | yolonas |
Specific output format |
| RT-DETR / RT-DETRv2 / RT-DETRv4 / D-FINE / DEIM / DEIMv2 | rtdetr |
All RT-DETR style models share the same postprocessor |
| RT-DETR Ultralytics | rtdetrul |
|
| RF-DETR Detection | rfdetr |
|
| YOLOv5/v8/v11/v12 Segmentation | yoloseg |
|
| YOLO26 Segmentation | yolo26seg |
|
| YOLOv10 Segmentation | yolov10seg |
|
| RF-DETR Segmentation | rfdetrseg |
|
| Torchvision Classifier | torchvision-classifier |
|
| Tensorflow Classifier | tensorflow-classifier |
|
| ViT Classifier | vit-classifier |
|
| RAFT Optical Flow | raft |
|
| VideoMAE | videomae |
16-frame sliding window video |
| ViViT | vivit |
Video Transformer |
| TimeSformer | timesformer |
Video Transformer |
| ViTPose | vitpose |
Pose estimation (COCO 17 keypoints) |
| Depth Anything V2 | depth_anything_v2 |
Monocular depth estimation |
| OWLv2 | owlv2 |
Open-vocabulary detection |
| OWL-ViT | owlvit |
Open-vocabulary detection |
| Grounding DINO | grounding_dino |
Open-vocabulary detection |
| ViTPose | vitpose |
Pose estimation (COCO 17 keypoints) |
| YOLOv5 Pose | yolov5pose |
Pose estimation |
| YOLOv8 Pose | yolov8pose |
Pose estimation |
| YOLO11 Pose | yolo11pose |
Pose estimation |
| YOLO26 Pose | yolo26pose |
Pose estimation |
| VideoMAE | videomae |
16-frame sliding window video |
| ViViT | vivit |
Video Transformer |
| TimeSformer | timesformer |
Video Transformer |
Skip Triton entirely and query any OpenAI-compatible server. Works with Ollama, llama.cpp, SGLang, vLLM, OpenAI, Together AI, OpenRouter, and Z.AI.
Single-turn with an image:
./tritonic \
--backend=chat \
--api_endpoint=http://localhost:11434/v1/chat/completions \
--model=llava:7b \
--text_prompt="Describe what you see" \
--source=/path/to/image.jpgInteractive multi-turn session:
./tritonic \
--backend=chat \
--api_endpoint=http://localhost:11434/v1/chat/completions \
--model=llava:7b \
--text_prompt="You are a helpful assistant" \
--interactiveOpenRouter multimodal with Kimi K2.6:
export OPENROUTER_API_KEY=...
./tritonic \
--backend=chat \
--api_service=openrouter \
--model=moonshotai/kimi-k2.6 \
--text_prompt="Describe the scene and read any visible text." \
--source=/path/to/image.jpgTogether AI text-only with GLM-5.1:
export TOGETHER_API_KEY=...
./tritonic \
--backend=chat \
--api_service=together \
--model=zai-org/GLM-5.1 \
--text_prompt="Summarize the design tradeoffs in this architecture."Z.AI multimodal with GLM-4.6V:
export ZAI_API_KEY=...
./tritonic \
--backend=chat \
--api_service=zai \
--model=glm-4.6v \
--text_prompt="Describe the image and extract the key objects." \
--source=/path/to/image.jpgGLM-5.1 is available on Together AI and Z.AI, but it is text-only. For GLM-family image input, use GLM-4.6V.
Chat CLI parameters:
| Parameter | Short | Default | Description |
|---|---|---|---|
--backend |
be |
triton |
triton or chat |
--api_endpoint |
ae |
— | Full URL, e.g. http://localhost:11434/v1/chat/completions |
--api_service |
as |
— | Service preset: openai, openrouter, together, zai |
--api_key_env |
ak |
— | Env-var name that holds the API key (e.g. OPENAI_API_KEY) |
--text_prompt |
tp |
— | System prompt (interactive) or user prompt (single-turn) |
--max_tokens |
mxt |
256 |
Max tokens to generate |
--temperature |
temp |
1.0 |
Sampling temperature |
--target_image_size |
tis |
512 |
Longest edge (px) before base64 encoding |
--interactive |
ia |
false |
Enable multi-turn REPL |
For detailed instructions on installing Docker and the NVIDIA Container Toolkit, refer to the Docker Setup Document.
docker build --rm -t tritonic -f docker/Dockerfile .docker run --rm \
-v /path/to/host/data:/app/data \
tritonic \
--network host \
--source=<path_to_source_on_container> \
--model_type=<model_type> \
--model=<model_name_folder_on_triton> \
--labelsFile=<path_to_labels_on_container> \
--protocol=<http or grpc> \
--serverAddress=<triton-ip> \
--port=<8000 for http, 8001 for grpc>For Kubernetes setup and deployment details, see:
Quick start:
./k8s/scripts/check_and_deploy_triton.shThis script performs:
kubectlinstallation check (and install if missing)- Kubernetes cluster liveness check — automatically starts or installs a local cluster (minikube → kind → k3s) if none is reachable; installs
kindvia Docker if no tool is present - NVIDIA GPU availability check inside cluster — installs
nvidia-container-toolkitand the NVIDIA device plugin automatically if the host has a GPU - Triton deployment status check and reconciliation against the current manifests
- Triton deploy or update (GPU or CPU manifest)
- External Triton endpoint summary for the
NodePortservice
Default external access on minikube:
- HTTP:
http://$(minikube ip):30800 - gRPC:
$(minikube ip):30801 - Metrics:
http://$(minikube ip):30802/metrics
Notes:
- The deployment uses the Triton
25.12-py3image by default for Kubernetes. - On minikube, if the Triton image is already present in host Docker, the deploy script loads it into the node before rollout to avoid long registry pulls.
- GPU deployments use the
Recreatestrategy so updates work on single-node, single-GPU clusters.
Real-time inference test (GPU RTX 3060):
- YOLOv7-tiny exported to ONNX: Demo Video
- YOLO11s exported to onnx: Demo Video
- RAFT Optical Flow Large(exported to traced torchscript): Demo Video
- Triton Inference Server Client Example
- Triton User Guide
- Triton Tutorials
- ONNX Models
- Torchvision Models
- Tensorflow Model Garden
Any feedback is greatly appreciated. If you have any suggestions, bug reports, or questions, don't hesitate to open an issue. Contributions, corrections, and suggestions are welcome to keep this repository relevant and useful.