A modern, two-stage CCTV / video-understanding pipeline for a single local GPU.
VRS is inspired by the architecture patterns in NVIDIA Video Search and Summarization (VSS) and the Public Safety Blueprint: perception first, VLM-based alert verification, and optional higher-level incident reasoning. It is not a VSS clone. The goal is a smaller, hackable, local-GPU-oriented Video Reasoning System that can be evaluated, customized, and embedded into CCTV / VMS / edge-appliance environments.
The deployment target is 16 GB cards when the verifier is quantized or otherwise capacity-tested. For BF16 Cosmos-Reason2-2B, validate on the target host first: NVIDIA's 2026 model card lists a 24 GB minimum for the reference inference path.
- Fast path: open-vocabulary detection with YOLOE-L (~6 ms / frame on a T4). Add or change event classes by editing one YAML line — no prompt-bank curation, no per-class fine-tuning.
- Slow path: a pluggable VLM verifier. The current baseline is NVIDIA Cosmos-Reason2-2B, but 2026 research and internal benchmarks show it should not be treated as the final verifier. Qwen3.5/Qwen3.6-class VLMs are priority candidates for side-by-side evaluation.
The local console reads runs/live/alerts.jsonl and thumbnails through a
CPU-light FastAPI backend. It shows detector candidates, VLM verdicts,
confidence, rationale, live RTSP state, and the two-stage cascade.
Start the local RTSP/API/UI stack:
docker compose up --buildOpen http://127.0.0.1:5173. The Compose stack publishes
runs/pr-integration/clips/falldown_test.mp4 as:
rtsp://127.0.0.1:8554/falldown
Run the full inference profile with a Hugging Face token:
cp .env.example .env
# edit .env and set HF_TOKEN / HUGGING_FACE_HUB_TOKEN
docker compose -f docker-compose.yaml -f docker-compose.hf-local.yaml \
--profile inference up --buildConvenience commands:
just local-up
just local-logs -f inference
just local-downSee docs/local-web-ui.md for the full local workflow, RTSP checks, API checks, token setup, and troubleshooting.
Classical CLIP-style classifiers (the "encode each frame and cosine-match a prompt bank" approach) work, but they require continuous prompt maintenance: every new event type, every new camera, every site-specific quirk needs new prompts and re-scoring. They also produce only frame-level scores, not localizations.
VRS also avoids making precomputed embeddings the primary home for business logic. Embedding lookup can be very fast, but operational semantics often end up hidden in manually maintained vectors, thresholds, and category-specific matching rules. As sites, camera angles, lighting conditions, and false-positive patterns change, engineers must continuously update those assets.
VRS keeps business intent explicit and policy-driven. Event definitions live in human-readable watch policies, while models remain execution components: the detector proposes candidates, event-state checks temporal stability, optional gates estimate risk, and the VLM verifier makes the final decision using the policy definition and video context.
Business intent -> watch policies
Fast perception -> open-vocabulary detector
Risk / uncertainty -> event-state and optional gates
Final decision -> VLM verifier
VRS uses two practical 2026-era baselines, with the verifier intentionally kept swappable:
| Stage | Model | Why |
|---|---|---|
| Detect | Ultralytics YOLOE-L (yoloe-11l-seg.pt by default) |
Open-vocabulary text prompts, returns bounding boxes, no per-site retraining loop |
| Reason | nvidia/Cosmos-Reason2-2B baseline | Physical-reasoning specialization, FPS=4 video path, bbox/point/trajectory-oriented prompting. Treat as a baseline, not the expected winner. |
Sources:
- NVIDIA VSS — Video Search and Summarization documentation, used as architectural inspiration for perception-first video AI, alert verification, and optional incident reasoning patterns.
- YOLOE — Ultralytics docs and CVPR'25 paper. Ultralytics also publishes newer YOLOE-26 models; migrate only after eval confirms a gain for the active policy.
- Cosmos-Reason2 — NVIDIA docs and
nvidia/Cosmos-Reason2-2Bmodel card. - Qwen — Qwen3.5/Qwen3.6 official releases report stronger multimodal foundation-model capability than earlier Qwen3-VL-era models. Because Cosmos-Reason2-2B is derived from Qwen3-VL-2B, Qwen3.5/Qwen3.6-class models should be evaluated as verifier backends before production model lock-in.
The detector runs on every sampled frame, while the verifier only runs after event-state promotes a stable candidate. This keeps the local GPU budget focused on real alert decisions instead of spending VLM time on quiet frames.
Watch policies live under configs/policies.
They define the user-editable event registry: detector prompts, verifier
definitions, severity, confidence floors, persistence, and optional verifier
context windows.
%%{init: {"theme": "base", "themeVariables": {"background": "#0f172a", "mainBkg": "#243044", "secondBkg": "#1f2937", "primaryColor": "#243044", "primaryBorderColor": "#475569", "primaryTextColor": "#f8fafc", "lineColor": "#38bdf8", "textColor": "#f8fafc", "fontFamily": "sans-serif"}}}%%
flowchart LR
A[Policy YAML] --> B[WatchPolicy]
B --> C[YOLOE vocabulary]
B --> D[Event thresholds]
B --> E[VLM definitions]
C --> F[Fast-path detections]
D --> G[CandidateAlert]
F --> G
E --> H[VLM verifier]
G --> H
H --> I[VerifiedAlert]
classDef source fill:#111827,stroke:#334155,color:#f8fafc,stroke-width:1px
classDef policy fill:#243044,stroke:#475569,color:#f8fafc,stroke-width:1.5px
classDef fast fill:#1f3a36,stroke:#84cc16,color:#f8fafc,stroke-width:1.5px
classDef gate fill:#3b3420,stroke:#f59e0b,color:#f8fafc,stroke-width:1.5px
classDef verifier fill:#2f2945,stroke:#a78bfa,color:#f8fafc,stroke-width:1.5px
classDef output fill:#193244,stroke:#38bdf8,color:#f8fafc,stroke-width:1.5px
class A source
class B policy
class C,F fast
class D,G gate
class E,H verifier
class I output
linkStyle 0 stroke:#84cc16,stroke-width:3px
linkStyle 1 stroke:#84cc16,stroke-width:3px
linkStyle 2 stroke:#f59e0b,stroke-width:3px
linkStyle 3 stroke:#a78bfa,stroke-width:3px
linkStyle 4 stroke:#84cc16,stroke-width:3px
linkStyle 5 stroke:#f59e0b,stroke-width:3px
linkStyle 6 stroke:#f59e0b,stroke-width:3px
linkStyle 7 stroke:#a78bfa,stroke-width:3px
linkStyle 8 stroke:#38bdf8,stroke-width:3px
At runtime, policy entries such as fire, smoke, and falldown are expanded
into YOLOE open-vocabulary classes, mapped back to stable event names, promoted
only after temporal persistence, and finally verified by the VLM before being
written to JSONL, thumbnails, and the local console.
- System review — current implementation status, known gaps, and engineering review.
- Roadmap — near-term prioritized work.
- Local web UI workflow — Docker Compose, RTSP,
.env, Hugging Face cache, API checks, and UI troubleshooting. - Policy model — watch-policy schema, runtime flow, UI-driven editing, validation, reload strategy, and scenario-policy direction.
- Operations notes — audit signing, served verifier, metrics, GPU smoke tests, policies, and evaluation reports.
- VSS + SAM3 blueprint — long-term vendor-neutral platform direction, including optional SAM3 workers, DeepStream runtime adapter planning, platform contracts, semantic search, and enterprise storage.
- Runtime validation matrix — validated, unvalidated, and planned GPU/runtime profiles.
uv python install 3.11
# Pick the right torch build for your GPU architecture:
# * Blackwell (RTX 5080/5090, B100, GB100) → CUDA 12.8+ / torch 2.6+
# * Hopper (H100, H200) → CUDA 12.4+
# * Ada (RTX 4080/4090, L4, L40) → CUDA 12.1+
uv sync --python 3.11 --extra cu128 # Blackwell
# or:
# uv sync --python 3.11 --extra cu121 # Ada/AmpereFor the W4A16 verifier profile used by configs/tiny.yaml, include the quant
extra:
uv sync --python 3.11 --extra cu128 --extra quantSingle MP4:
uv run scripts/run_mp4.py \
--video /path/to/cctv.mp4 \
--config configs/default.yaml \
--policy configs/policies/safety.yaml \
--out runs/demoSingle RTSP:
uv run scripts/run_rtsp.py \
--rtsp rtsp://user:pass@cam.local:554/stream1 \
--config configs/default.yaml \
--policy configs/policies/safety.yaml \
--out runs/liveMulti-stream:
uv run scripts/run_multistream.py \
--config configs/default.yaml \
--policy configs/policies/safety.yaml \
--streams configs/multistream.yaml \
--out runs/liveOutputs:
runs/<name>/alerts.jsonl— one JSON per verified alert (verdict, confidence, bbox, trajectory, rationale, thumbnail path)runs/<name>/thumbnails/*.jpg— one event image per alert, with detector/verifier overlaysruns/<name>/annotated.mp4— optional debug/demo overlay video whensink.write_annotated: true
- Watch-policy changes live in
configs/policies/safety.yaml; adding a custom event is one detector prompt list plus one verifier sentence. - Metrics, audit signing, served VLM verifier setup, GPU smoke tests, and eval report commands are covered in docs/operations.md.
- Runtime and GPU profile status is tracked in docs/runtime-matrix.md.
| Profile | Detector | Verifier | Notes |
|---|---|---|---|
default.yaml |
YOLOE-L FP16 | Cosmos-Reason2-2B BF16 | Accuracy-oriented local profile; validate memory on target GPU. NVIDIA's reference model card lists 24 GB minimum. |
tiny.yaml |
YOLOE-S FP16 | Cosmos-Reason2-2B W4A16 | Intended for 8-16 GB cards / Jetson-class deployments after quantized-runtime validation. |
.
├── web/ Static VRS Console frontend
├── docker/ nginx config for frontend/API proxying
├── docs/ Current docs, operations notes, benchmarks, archive
├── configs/ Runtime configs, stream manifests, watch policies
├── scripts/ CLI runners, fixtures, benchmarks, eval helpers
├── docker-compose.yaml RTSP, backend, frontend, and inference workflow
├── Dockerfile.* Backend, frontend, and GPU inference images
└── vrs/
├── web/ FastAPI artifact browser for dashboard data
├── ingest/ RTSP/mp4 frame iterator
├── triage/ YOLOE detector, tracking, event-state queue
├── verifier/ VLM prompts and structured-output parsing
├── runtime/ transformers, vLLM, OpenAI-compatible backends
├── sinks/ JSONL, thumbnails, optional annotated video
├── multistream/ N-stream cascade workers and queues
├── pipeline.py Single-stream cascade orchestration
└── schemas.py Frame, detection, candidate, verified alert models