TritonIC - C++ Inference Client

TritonIC is a C++ application that supports two complementary inference modes:

Triton backend — computer vision tasks (object detection, segmentation, classification, optical flow, pose, depth) via the NVIDIA Triton Inference Server over HTTP or gRPC.
Chat backend — text and multimodal generation via any OpenAI-compatible /v1/chat/completions endpoint (Ollama, llama.cpp, SGLang, vLLM, OpenAI, etc.).

The two backends are complementary: use Triton for real-time computer vision at high throughput; use the Chat backend for VLMs and LLMs without a Triton server.

🚧 Status: Under Development — expect frequent updates.

Project Structure

tritonic/
├── src/
│   ├── main/                     # Entry point (client.cpp), App, Logger, ConfigManager
│   ├── triton/                   # Triton client (Triton.hpp/.cpp, forwarding headers)
│   ├── chat/                     # OpenAI-compatible backend (ChatBackend, ChatSession)
│   └── common/                   # Shared forwarding headers
├── include/
│   ├── tritonic/                 # Canonical namespaced headers
│   │   ├── core/                 # types.hpp, interfaces.hpp
│   │   ├── triton/               # model_info.hpp, itriton.hpp, triton_backend.hpp
│   │   ├── chat/                 # ichat_backend.hpp
│   │   └── infra/                # logger.hpp, config.hpp, config_manager.hpp
│   └── *.hpp                     # Backward-compat forwarding headers
├── deploy/                       # Model export scripts (per task type)
├── scripts/                      # Docker, setup, and utility scripts
├── config/                       # Configuration files
├── docs/                         # Documentation and guides
├── labels/                       # Label files (COCO, ImageNet, etc.)
├── data/                         # Data files (images, videos, outputs)
└── tests/                        # Unit and integration tests

CMake Fetched Dependencies:

neuriplo-tasks - Model pre/post processing and task management

Architecture

TritonIC selects an inference backend at startup via --backend:

`--backend`	Requires	Best for
`triton` (default)	NVIDIA Triton server	CV tasks — detection, segmentation, classification, optical flow, pose, depth
`chat`	Any OpenAI-compatible server	LLMs, VLMs, multimodal chat

The two modes are not competing — Triton handles binary tensor workloads at real-time throughput, while the Chat backend handles text/image generation over a REST API. Choose based on your model and server.

Both implement the common tritonic::core::IInferenceBackend interface (Strategy pattern), enabling clean dependency injection and unit testing without live servers.

Note: "backend" here refers to tritonic's server selection (--backend=triton vs --backend=chat). This is distinct from the Triton server's own framework backends (TensorRT, ONNX Runtime, etc.), which are configured server-side.

For full code structure and namespace layout see AGENTS.md.

Tested Models

Object Detection

Instance Segmentation

Classification

Optical Flow

RAFT

Open Vocabulary Detection

Pose Estimation

Video Classification

Depth Estimation

Depth Anything V2

Image Understanding (VLM)

Gemma 4 and compatible vision-language models via llama.cpp (image captioning, visual Q&A)
LLaVA, LLaMA3-V, and other multimodal models via OpenAI-compatible endpoints

Build Client Libraries

To build the client libraries, refer to the official Triton Inference Server client libraries.

Alternative: Extract Client Libraries from Docker

For convenience, you can extract the pre-built Triton client libraries from the official NVIDIA Triton Server SDK image using Docker:

# Run the extraction script
./docker/scripts/extract_triton_libs.sh

This script will:

Create a temporary Docker container from the nvcr.io/nvidia/tritonserver:25.06-py3-sdk image
Extract the Triton client libraries from /workspace/install
Copy additional Triton server headers and libraries if available
Save everything to ./triton_client_libs/ directory

After extraction, set the environment variable:

export TritonClientBuild_DIR=$(pwd)/triton_client_libs/install

The extracted directory structure will contain:

install/ - Triton client build artifacts
triton_server_include/ - Triton server headers
triton_server_lib/ - Triton server libraries
workspace/ - Additional workspace files

Dependencies

Ensure the following dependencies are installed:

Nvidia Triton Inference Server:

docker pull nvcr.io/nvidia/tritonserver:25.06-py3

Triton client libraries: Tested on Release r25.06
Protobuf and gRPC++: Versions compatible with Triton
RapidJSON:

apt install rapidjson-dev

libcurl:

apt install libcurl4-openssl-dev

OpenCV 4: Tested version: 4.7.0

apt install libopencv-dev

Development Setup

Pre-commit Hooks (Recommended)

To maintain code quality and consistency, install pre-commit hooks:

# Run the setup script
./scripts/setup/pre_commit_setup.sh

# Or install manually
pip install pre-commit
pre-commit install

Build and Compile

Set the environment variable TritonClientBuild_DIR or update the CMakeLists.txt with the path to your installed Triton client libraries.
Create a build directory:

mkdir build

Navigate to the build directory:

cd build

Run CMake to configure the build:

cmake -DCMAKE_BUILD_TYPE=Release ..

Optional flags:

-DSHOW_FRAME: Enable to display processed frames after inference
-DWRITE_FRAME: Enable to write processed frames to disk

Build the application:

cmake --build .

Tasks

Export Instructions

Other tasks are in TODO list.

Notes

Ensure the model export versions match those supported by your Triton release. Check Triton releases here.

Deploying Models

To deploy models, set up a model repository following the Triton Model Repository schema. The config.pbtxt file is optional unless you're using the OpenVino backend, implementing an Ensemble pipeline, or passing custom inference parameters.

Model Repository Structure

<model_repository>/
    <model_name>/
        config.pbtxt
        <model_version>/
            <model_binary>

Starting Triton Server

Use the provided script for easy setup:

# Start Triton server with GPU support
./docker/scripts/docker_triton_run.sh /path/to/model_repository 25.06 gpu

# Start with CPU only
./docker/scripts/docker_triton_run.sh /path/to/model_repository 25.06 cpu

Or manually with Docker:

docker run --gpus=1 --rm \
  -p 8000:8000 -p 8001:8001 -p 8002:8002 \
  -v /full/path/to/model_repository:/models \
  nvcr.io/nvidia/tritonserver:<xx.yy>-py3 tritonserver \
  --model-repository=/models

Omit the --gpus flag if using the CPU version.

Running Inference

Command-Line Inference on Video or Image

./tritonic \
    --source=/path/to/source.format \
    --model_type=<model_type> \
    --model=<model_name_folder_on_triton> \
    --labelsFile=/path/to/labels/coco.names \
    --protocol=<http or grpc> \
    --serverAddress=<triton-ip> \
    --port=<8000 for http, 8001 for grpc> \

For dynamic input sizes:

    --input_sizes="c,h,w"

Shared Memory Support

Tritonic supports shared memory to improve inference performance by reducing data copying between the client and Triton server. Two types of shared memory are available:

System (POSIX) Shared Memory

Uses CPU-based shared memory for efficient data transfer:

./tritonic \
    --source=/path/to/source.format \
    --model=<model_name> \
    --shared_memory_type=system \
    ...

CUDA Shared Memory

Uses GPU memory directly for zero-copy inference (requires GPU support):

./tritonic \
    --source=/path/to/source.format \
    --model=<model_name> \
    --shared_memory_type=cuda \
    --cuda_device_id=0 \
    ...

Configuration Options:

--shared_memory_type or -smt: Shared memory type (none, system, or cuda). Default: none
--cuda_device_id or -cdi: CUDA device ID when using CUDA shared memory. Default: 0

Quick Start with Docker Scripts

Use the provided Docker scripts for quick testing:

# Run object detection
./docker/scripts/run_client.sh

# Run with debug mode
./docker/scripts/run_debug.sh

# Run optical flow
./docker/scripts/run_optical_flow.sh

# Run unit tests
./docker/scripts/run_tests.sh

Debugging Tips

Check .vscode/launch.json for additional configuration examples

Placeholder Descriptions

/path/to/source.format: Path to the input video or image file, for optical flow you must pass two images as comma separated list
<model_type>: Model type (e.g., yolov5, yolov8, yolo11, yoloseg, torchvision-classifier, tensorflow-classifier, vit-classifier, check below Model Type Parameters)
<model_name_folder_on_triton>: Name of the model folder on the Triton server
/path/to/labels/coco.names: Path to the label file (e.g., COCO labels)
<http or grpc>: Communication protocol (http or grpc)
<triton-ip>: IP address of your Triton server
<8000 for http, 8001 for grpc>: Port number
<batch or b >: Batch size, currently only 1 is supported
<input_sizes or -is>: Input sizes input for dynamic axes. Semi-colon separated list format: CHW;CHW;... (e.g., '3,224,224' for single input or '3,224,224;3,224,224' for two inputs, '3,640,640;2' for rtdetr/dfine models)

To view all available parameters, run:

./tritonic --help

Model Type Tag Parameters

Model	Model Type Parameter	Notes
YOLOv5 / v6 / v7 / v8 / v9 / v11 / v12	`yolo`	Any `yolo*` variant works. Standard format
YOLOv7 End-to-End	`yolov7e2e`	Only for YOLOv7 exported with `--grid --end2end` flags (requires TensorRT backend)
YOLOv10	`yolov10`	Specific output format
YOLO26	`yolo26`	Specific output format (i.e. is the same of yolov10)
YOLO-NAS	`yolonas`	Specific output format
RT-DETR / RT-DETRv2 / RT-DETRv4 / D-FINE / DEIM / DEIMv2	`rtdetr`	All RT-DETR style models share the same postprocessor
RT-DETR Ultralytics	`rtdetrul`
RF-DETR Detection	`rfdetr`
YOLOv5/v8/v11/v12 Segmentation	`yoloseg`
YOLO26 Segmentation	`yolo26seg`
YOLOv10 Segmentation	`yolov10seg`
RF-DETR Segmentation	`rfdetrseg`
Torchvision Classifier	`torchvision-classifier`
Tensorflow Classifier	`tensorflow-classifier`
ViT Classifier	`vit-classifier`
RAFT Optical Flow	`raft`
VideoMAE	`videomae`	16-frame sliding window video
ViViT	`vivit`	Video Transformer
TimeSformer	`timesformer`	Video Transformer
ViTPose	`vitpose`	Pose estimation (COCO 17 keypoints)
Depth Anything V2	`depth_anything_v2`	Monocular depth estimation
OWLv2	`owlv2`	Open-vocabulary detection
OWL-ViT	`owlvit`	Open-vocabulary detection
Grounding DINO	`grounding_dino`	Open-vocabulary detection
ViTPose	`vitpose`	Pose estimation (COCO 17 keypoints)
YOLOv5 Pose	`yolov5pose`	Pose estimation
YOLOv8 Pose	`yolov8pose`	Pose estimation
YOLO11 Pose	`yolo11pose`	Pose estimation
YOLO26 Pose	`yolo26pose`	Pose estimation
VideoMAE	`videomae`	16-frame sliding window video
ViViT	`vivit`	Video Transformer
TimeSformer	`timesformer`	Video Transformer

Chat Backend (OpenAI-compatible)

Skip Triton entirely and query any OpenAI-compatible server. Works with Ollama, llama.cpp, SGLang, vLLM, OpenAI, Together AI, OpenRouter, and Z.AI.

Single-turn with an image:

./tritonic \
    --backend=chat \
    --api_endpoint=http://localhost:11434/v1/chat/completions \
    --model=llava:7b \
    --text_prompt="Describe what you see" \
    --source=/path/to/image.jpg

Interactive multi-turn session:

./tritonic \
    --backend=chat \
    --api_endpoint=http://localhost:11434/v1/chat/completions \
    --model=llava:7b \
    --text_prompt="You are a helpful assistant" \
    --interactive

OpenRouter multimodal with Kimi K2.6:

export OPENROUTER_API_KEY=...

./tritonic \
    --backend=chat \
    --api_service=openrouter \
    --model=moonshotai/kimi-k2.6 \
    --text_prompt="Describe the scene and read any visible text." \
    --source=/path/to/image.jpg

Together AI text-only with GLM-5.1:

export TOGETHER_API_KEY=...

./tritonic \
    --backend=chat \
    --api_service=together \
    --model=zai-org/GLM-5.1 \
    --text_prompt="Summarize the design tradeoffs in this architecture."

Z.AI multimodal with GLM-4.6V:

export ZAI_API_KEY=...

./tritonic \
    --backend=chat \
    --api_service=zai \
    --model=glm-4.6v \
    --text_prompt="Describe the image and extract the key objects." \
    --source=/path/to/image.jpg

GLM-5.1 is available on Together AI and Z.AI, but it is text-only. For GLM-family image input, use GLM-4.6V.

Chat CLI parameters:

Parameter	Short	Default	Description
`--backend`	`be`	`triton`	`triton` or `chat`
`--api_endpoint`	`ae`	—	Full URL, e.g. `http://localhost:11434/v1/chat/completions`
`--api_service`	`as`	—	Service preset: `openai`, `openrouter`, `together`, `zai`
`--api_key_env`	`ak`	—	Env-var name that holds the API key (e.g. `OPENAI_API_KEY`)
`--text_prompt`	`tp`	—	System prompt (interactive) or user prompt (single-turn)
`--max_tokens`	`mxt`	`256`	Max tokens to generate
`--temperature`	`temp`	`1.0`	Sampling temperature
`--target_image_size`	`tis`	`512`	Longest edge (px) before base64 encoding
`--interactive`	`ia`	`false`	Enable multi-turn REPL

Docker Support

For detailed instructions on installing Docker and the NVIDIA Container Toolkit, refer to the Docker Setup Document.

Build

docker build --rm -t tritonic -f docker/Dockerfile .

Run Container

docker run --rm \
  -v /path/to/host/data:/app/data \
  tritonic \
  --network host \
  --source=<path_to_source_on_container> \
  --model_type=<model_type> \
  --model=<model_name_folder_on_triton> \
  --labelsFile=<path_to_labels_on_container> \
  --protocol=<http or grpc> \
  --serverAddress=<triton-ip> \
  --port=<8000 for http, 8001 for grpc>

Kubernetes Deployment

For Kubernetes setup and deployment details, see:

Quick start:

./k8s/scripts/check_and_deploy_triton.sh

This script performs:

kubectl installation check (and install if missing)
Kubernetes cluster liveness check — automatically starts or installs a local cluster (minikube → kind → k3s) if none is reachable; installs kind via Docker if no tool is present
NVIDIA GPU availability check inside cluster — installs nvidia-container-toolkit and the NVIDIA device plugin automatically if the host has a GPU
Triton deployment status check and reconciliation against the current manifests
Triton deploy or update (GPU or CPU manifest)
External Triton endpoint summary for the NodePort service

Default external access on minikube:

HTTP: http://$(minikube ip):30800
gRPC: $(minikube ip):30801
Metrics: http://$(minikube ip):30802/metrics

Notes:

The deployment uses the Triton 25.12-py3 image by default for Kubernetes.
On minikube, if the Triton image is already present in host Docker, the deploy script loads it into the node before rollout to avoid long registry pulls.
GPU deployments use the Recreate strategy so updates work on single-node, single-GPU clusters.

Demo

Real-time inference test (GPU RTX 3060):

YOLOv7-tiny exported to ONNX: Demo Video
YOLO11s exported to onnx: Demo Video
RAFT Optical Flow Large(exported to traced torchscript): Demo Video

References

Feedback

Any feedback is greatly appreciated. If you have any suggestions, bug reports, or questions, don't hesitate to open an issue. Contributions, corrections, and suggestions are welcome to keep this repository relevant and useful.

Name		Name	Last commit message	Last commit date
Latest commit History 495 Commits
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
config/environments		config/environments
data		data
deploy		deploy
docker		docker
docs		docs
helm/triton-server		helm/triton-server
include		include
k8s		k8s
labels		labels
scripts		scripts
src		src
tests		tests
.clang-format		.clang-format
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
GEMMA4_UPDATE.md		GEMMA4_UPDATE.md
LICENSE		LICENSE
README.md		README.md
TESTING_INSTRUCTIONS.md		TESTING_INSTRUCTIONS.md
TESTING_SUMMARY.md		TESTING_SUMMARY.md

Folders and files

Latest commit

History

Repository files navigation

TritonIC - C++ Inference Client

Table of Contents

Project Structure

Architecture

Tested Models

Object Detection

Instance Segmentation

Classification

Optical Flow

Open Vocabulary Detection

Pose Estimation

Video Classification

Depth Estimation

Image Understanding (VLM)

Build Client Libraries

Alternative: Extract Client Libraries from Docker

Dependencies

Development Setup

Pre-commit Hooks (Recommended)

Build and Compile

Tasks

Export Instructions

Notes

Deploying Models

Model Repository Structure

Starting Triton Server

Running Inference

Command-Line Inference on Video or Image

Shared Memory Support

System (POSIX) Shared Memory

CUDA Shared Memory

Quick Start with Docker Scripts

Debugging Tips

Placeholder Descriptions

Model Type Tag Parameters

Chat Backend (OpenAI-compatible)

Docker Support

Build

Run Container

Kubernetes Deployment

Demo

References

Feedback

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages