VRS — Video Reasoning System

A modern, two-stage CCTV / video-understanding pipeline for a single local GPU.

VRS is inspired by the architecture patterns in NVIDIA Video Search and Summarization (VSS) and the Public Safety Blueprint: perception first, VLM-based alert verification, and optional higher-level incident reasoning. It is not a VSS clone. The goal is a smaller, hackable, local-GPU-oriented Video Reasoning System that can be evaluated, customized, and embedded into CCTV / VMS / edge-appliance environments.

The deployment target is 16 GB cards when the verifier is quantized or otherwise capacity-tested. For BF16 Cosmos-Reason2-2B, validate on the target host first: NVIDIA's 2026 model card lists a 24 GB minimum for the reference inference path.

Fast path: open-vocabulary detection with YOLOE-L (~6 ms / frame on a T4). Add or change event classes by editing one YAML line — no prompt-bank curation, no per-class fine-tuning.
Slow path: a pluggable VLM verifier. The current baseline is NVIDIA Cosmos-Reason2-2B, but 2026 research and internal benchmarks show it should not be treated as the final verifier. Qwen3.5/Qwen3.6-class VLMs are priority candidates for side-by-side evaluation.

Dashboard

The local console reads runs/live/alerts.jsonl and thumbnails through a CPU-light FastAPI backend. It shows detector candidates, VLM verdicts, confidence, rationale, live RTSP state, and the two-stage cascade.

Quick Start

Start the local RTSP/API/UI stack:

docker compose up --build

Open http://127.0.0.1:5173. The Compose stack publishes runs/pr-integration/clips/falldown_test.mp4 as:

rtsp://127.0.0.1:8554/falldown

Run the full inference profile with a Hugging Face token:

cp .env.example .env
# edit .env and set HF_TOKEN / HUGGING_FACE_HUB_TOKEN
docker compose -f docker-compose.yaml -f docker-compose.hf-local.yaml \
  --profile inference up --build

Convenience commands:

just local-up
just local-logs -f inference
just local-down

See docs/local-web-ui.md for the full local workflow, RTSP checks, API checks, token setup, and troubleshooting.

Why this design

Classical CLIP-style classifiers (the "encode each frame and cosine-match a prompt bank" approach) work, but they require continuous prompt maintenance: every new event type, every new camera, every site-specific quirk needs new prompts and re-scoring. They also produce only frame-level scores, not localizations.

Why not embedding-only business logic

VRS also avoids making precomputed embeddings the primary home for business logic. Embedding lookup can be very fast, but operational semantics often end up hidden in manually maintained vectors, thresholds, and category-specific matching rules. As sites, camera angles, lighting conditions, and false-positive patterns change, engineers must continuously update those assets.

VRS keeps business intent explicit and policy-driven. Event definitions live in human-readable watch policies, while models remain execution components: the detector proposes candidates, event-state checks temporal stability, optional gates estimate risk, and the VLM verifier makes the final decision using the policy definition and video context.

Business intent       -> watch policies
Fast perception       -> open-vocabulary detector
Risk / uncertainty    -> event-state and optional gates
Final decision        -> VLM verifier

VRS uses two practical 2026-era baselines, with the verifier intentionally kept swappable:

Stage	Model	Why
Detect	Ultralytics YOLOE-L (`yoloe-11l-seg.pt` by default)	Open-vocabulary text prompts, returns bounding boxes, no per-site retraining loop
Reason	nvidia/Cosmos-Reason2-2B baseline	Physical-reasoning specialization, FPS=4 video path, bbox/point/trajectory-oriented prompting. Treat as a baseline, not the expected winner.

Sources:

NVIDIA VSS — Video Search and Summarization documentation, used as architectural inspiration for perception-first video AI, alert verification, and optional incident reasoning patterns.
YOLOE — Ultralytics docs and CVPR'25 paper. Ultralytics also publishes newer YOLOE-26 models; migrate only after eval confirms a gain for the active policy.
Cosmos-Reason2 — NVIDIA docs and nvidia/Cosmos-Reason2-2B model card.
Qwen — Qwen3.5/Qwen3.6 official releases report stronger multimodal foundation-model capability than earlier Qwen3-VL-era models. Because Cosmos-Reason2-2B is derived from Qwen3-VL-2B, Qwen3.5/Qwen3.6-class models should be evaluated as verifier backends before production model lock-in.

Architecture

The detector runs on every sampled frame, while the verifier only runs after event-state promotes a stable candidate. This keeps the local GPU budget focused on real alert decisions instead of spending VLM time on quiet frames.

Policy flow

Watch policies live under configs/policies. They define the user-editable event registry: detector prompts, verifier definitions, severity, confidence floors, persistence, and optional verifier context windows.

%%{init: {"theme": "base", "themeVariables": {"background": "#0f172a", "mainBkg": "#243044", "secondBkg": "#1f2937", "primaryColor": "#243044", "primaryBorderColor": "#475569", "primaryTextColor": "#f8fafc", "lineColor": "#38bdf8", "textColor": "#f8fafc", "fontFamily": "sans-serif"}}}%%
flowchart LR
    A[Policy YAML] --> B[WatchPolicy]
    B --> C[YOLOE vocabulary]
    B --> D[Event thresholds]
    B --> E[VLM definitions]
    C --> F[Fast-path detections]
    D --> G[CandidateAlert]
    F --> G
    E --> H[VLM verifier]
    G --> H
    H --> I[VerifiedAlert]

    classDef source fill:#111827,stroke:#334155,color:#f8fafc,stroke-width:1px
    classDef policy fill:#243044,stroke:#475569,color:#f8fafc,stroke-width:1.5px
    classDef fast fill:#1f3a36,stroke:#84cc16,color:#f8fafc,stroke-width:1.5px
    classDef gate fill:#3b3420,stroke:#f59e0b,color:#f8fafc,stroke-width:1.5px
    classDef verifier fill:#2f2945,stroke:#a78bfa,color:#f8fafc,stroke-width:1.5px
    classDef output fill:#193244,stroke:#38bdf8,color:#f8fafc,stroke-width:1.5px

    class A source
    class B policy
    class C,F fast
    class D,G gate
    class E,H verifier
    class I output

    linkStyle 0 stroke:#84cc16,stroke-width:3px
    linkStyle 1 stroke:#84cc16,stroke-width:3px
    linkStyle 2 stroke:#f59e0b,stroke-width:3px
    linkStyle 3 stroke:#a78bfa,stroke-width:3px
    linkStyle 4 stroke:#84cc16,stroke-width:3px
    linkStyle 5 stroke:#f59e0b,stroke-width:3px
    linkStyle 6 stroke:#f59e0b,stroke-width:3px
    linkStyle 7 stroke:#a78bfa,stroke-width:3px
    linkStyle 8 stroke:#38bdf8,stroke-width:3px

At runtime, policy entries such as fire, smoke, and falldown are expanded into YOLOE open-vocabulary classes, mapped back to stable event names, promoted only after temporal persistence, and finally verified by the VLM before being written to JSONL, thumbnails, and the local console.

Documentation

System review — current implementation status, known gaps, and engineering review.
Roadmap — near-term prioritized work.
Local web UI workflow — Docker Compose, RTSP, .env, Hugging Face cache, API checks, and UI troubleshooting.
Policy model — watch-policy schema, runtime flow, UI-driven editing, validation, reload strategy, and scenario-policy direction.
Operations notes — audit signing, served verifier, metrics, GPU smoke tests, policies, and evaluation reports.
VSS + SAM3 blueprint — long-term vendor-neutral platform direction, including optional SAM3 workers, DeepStream runtime adapter planning, platform contracts, semantic search, and enterprise storage.
Runtime validation matrix — validated, unvalidated, and planned GPU/runtime profiles.

GPU Setup

uv python install 3.11

# Pick the right torch build for your GPU architecture:
#   * Blackwell (RTX 5080/5090, B100, GB100) → CUDA 12.8+ / torch 2.6+
#   * Hopper   (H100, H200)                 → CUDA 12.4+
#   * Ada      (RTX 4080/4090, L4, L40)     → CUDA 12.1+
uv sync --python 3.11 --extra cu128     # Blackwell
# or:
# uv sync --python 3.11 --extra cu121   # Ada/Ampere

For the W4A16 verifier profile used by configs/tiny.yaml, include the quant extra:

uv sync --python 3.11 --extra cu128 --extra quant

Pipeline Commands

Single MP4:

uv run scripts/run_mp4.py \
  --video /path/to/cctv.mp4 \
  --config configs/default.yaml \
  --policy configs/policies/safety.yaml \
  --out runs/demo

Single RTSP:

uv run scripts/run_rtsp.py \
  --rtsp rtsp://user:pass@cam.local:554/stream1 \
  --config configs/default.yaml \
  --policy configs/policies/safety.yaml \
  --out runs/live

Multi-stream:

uv run scripts/run_multistream.py \
  --config  configs/default.yaml \
  --policy  configs/policies/safety.yaml \
  --streams configs/multistream.yaml \
  --out     runs/live

Outputs:

runs/<name>/alerts.jsonl — one JSON per verified alert (verdict, confidence, bbox, trajectory, rationale, thumbnail path)
runs/<name>/thumbnails/*.jpg — one event image per alert, with detector/verifier overlays
runs/<name>/annotated.mp4 — optional debug/demo overlay video when sink.write_annotated: true

Operations

Watch-policy changes live in configs/policies/safety.yaml; adding a custom event is one detector prompt list plus one verifier sentence.
Metrics, audit signing, served VLM verifier setup, GPU smoke tests, and eval report commands are covered in docs/operations.md.
Runtime and GPU profile status is tracked in docs/runtime-matrix.md.

Profile	Detector	Verifier	Notes
`default.yaml`	YOLOE-L FP16	Cosmos-Reason2-2B BF16	Accuracy-oriented local profile; validate memory on target GPU. NVIDIA's reference model card lists 24 GB minimum.
`tiny.yaml`	YOLOE-S FP16	Cosmos-Reason2-2B W4A16	Intended for 8-16 GB cards / Jetson-class deployments after quantized-runtime validation.

Layout

.
├── web/                    Static VRS Console frontend
├── docker/                 nginx config for frontend/API proxying
├── docs/                   Current docs, operations notes, benchmarks, archive
├── configs/                Runtime configs, stream manifests, watch policies
├── scripts/                CLI runners, fixtures, benchmarks, eval helpers
├── docker-compose.yaml     RTSP, backend, frontend, and inference workflow
├── Dockerfile.*            Backend, frontend, and GPU inference images
└── vrs/
    ├── web/                FastAPI artifact browser for dashboard data
    ├── ingest/             RTSP/mp4 frame iterator
    ├── triage/             YOLOE detector, tracking, event-state queue
    ├── verifier/           VLM prompts and structured-output parsing
    ├── runtime/            transformers, vLLM, OpenAI-compatible backends
    ├── sinks/              JSONL, thumbnails, optional annotated video
    ├── multistream/        N-stream cascade workers and queues
    ├── pipeline.py         Single-stream cascade orchestration
    └── schemas.py          Frame, detection, candidate, verified alert models

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github		.github
baselines/eval		baselines/eval
configs		configs
docker		docker
docs		docs
prompts/templates		prompts/templates
scripts		scripts
tests		tests
vrs		vrs
web		web
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
Dockerfile.backend		Dockerfile.backend
Dockerfile.frontend		Dockerfile.frontend
Dockerfile.inference		Dockerfile.inference
Justfile		Justfile
README.md		README.md
docker-compose.hf-local.yaml		docker-compose.hf-local.yaml
docker-compose.yaml		docker-compose.yaml
pyproject.toml		pyproject.toml
requirements-jetson.txt		requirements-jetson.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VRS — Video Reasoning System

Dashboard

Quick Start

Why this design

Why not embedding-only business logic

Architecture

Policy flow

Documentation

GPU Setup

Pipeline Commands

Operations

Layout

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VRS — Video Reasoning System

Dashboard

Quick Start

Why this design

Why not embedding-only business logic

Architecture

Policy flow

Documentation

GPU Setup

Pipeline Commands

Operations

Layout

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages