This repo is sample code for building voice agents with NVIDIA open source models:
- Nemotron 3.0 ASR and Nemotron 3.5 ASR (streaming speech-to-text)
- Nemotron 3 Nano LLM
For speech, the bot uses on-device Pocket TTS (Kyutai).
Run locally on an NVIDIA DGX Spark or RTX 5090. Or deploy to the cloud.
Accompanying blog posts:
- Nemotron Speech ASR Open Source Model Launch Post
- More About Voice Agent Architectures and This Agent's Design
- The voice agent — Nemotron ASR + Nemotron 3 Nano LLM + on-device Pocket TTS, run locally (DGX Spark / RTX 5090) or in the cloud.
- A production streaming-ASR serving runtime — a from-scratch native C++
ws_serverfor the Nemotron Speech ASR model that you build, run on an RTX 5090, and deploy as an L40S cluster. It is a byte-exact drop-in for the Python ASR server with much higher per-GPU stream density. See Production streaming-ASR serving below.
| path | what |
|---|---|
src/nemotron_speech/ |
Python ASR + TTS servers (server.py) |
pipecat_bots/ |
The voice-agent bot (bot.py) + its STT / TTS / LLM Pipecat services |
scripts/ |
local container management (nemotron.sh) and test clients |
Dockerfile.unified |
the all-in-one local container (ASR + TTS + LLM), built from source for Blackwell |
runtime/ |
the native C++ ws_server ASR serving runtime — build, run, and artifact regeneration |
deploy/ |
L40S cluster deploy: runbook, systemd unit, HAProxy generator, drain/smoke tooling |
ec2-bench/ |
minimal EC2 GPU provisioning helpers used by the deploy runbook |
tests/ |
ASR/TTS tests, the Python↔C++ byte-exact compatibility oracle, and a WebSocket smoke client |
docs/ |
architecture and latency explainers |
docker build -f Dockerfile.unified -t nemotron-unified:cuda13 .Build time: 2-3 hours (builds PyTorch, NeMo, vLLM, llama.cpp from source for CUDA 13.1 / Blackwell).
# Start with default Q8 model (auto-detected from HuggingFace cache)
./scripts/nemotron.sh start
# Or specify a model explicitly
./scripts/nemotron.sh start --model ~/.cache/huggingface/hub/models--unsloth--Nemotron-3-Nano-30B-A3B-GGUF/snapshots/.../Q8_0.gguf
# Start with vLLM instead of llama.cpp (requires ~72GB VRAM)
./scripts/nemotron.sh start --mode vllmThe voice bot speaks with Pocket TTS (Kyutai's on-device TTS, a published package). Start it in its own terminal and leave it running on port 8001 (it downloads the model on first run):
# runs the published package in an ephemeral env (or: pip install pocket-tts && pocket-tts serve ...)
uvx pocket-tts serve --port 8001 --language englishpipecat_bots/bot.py reads its ASR and LLM endpoints from required env vars — point them at the
servers from step 2 (and Pocket TTS from step 3):
NVIDIA_ASR_URL=ws://localhost:8080 \
NVIDIA_LLM_URL=http://localhost:8000/v1 \
POCKET_TTS_URL=http://localhost:8001 \
uv run pipecat_bots/bot.pyOpen the URL it prints (default http://localhost:7860) in your browser.
The bot connects to your ASR, LLM, and TTS endpoints over the network, so it can run anywhere — including Pipecat Cloud. You bring the model endpoints (e.g. the L40S ASR cluster below); this section deploys only the bot.
Note
Sign up for a Pipecat Cloud account here
# Install Pipecat Cloud package
uv sync --group bot
# Login
pipecat cloud auth loginpipecat cloud secrets set gdx-spark-bot-secrets \
NVIDIA_ASR_URL=wss:// \
NVIDIA_LLM_URL=https:// \
POCKET_TTS_URL=https://Alternatively, create your secret set from a .env file:
pipecat cloud secrets set gdx-spark-bot-secrets --file .envImage pull secrets are used to authenticate with private Docker registries when deploying agents. See docs.
pipecat cloud secrets image-pull-secret gdx-spark-bot-pull-secret https://index.docker.io/v1/Optional: Create a PCC deploy toml:
To speed up deployment you can create a pcc-deploy.toml in the project root. This file is read by the Pipecat CLI to pre-fill command arguments:
agent_name = "gdx-spark-bot"
image = "your-docker-repository/gdx-spark-bot:latest"
secret_set = "gdx-spark-bot-secrets"
image_credentials = "gdx-spark-bot-pull-secret"
agent_profile = "agent-1x"
[scaling]
min_agents = 1docker build -f Dockerfile.bot -t gdx-spark-bot:latest .
# Optional: tag image
docker tag gdx-spark-bot:latest your-docker-repository/gdx-spark-bot:latest
# Push to image repository e.g. Docker Hub
docker push your-docker-repository/gdx-spark-bot:latestRun deploy command:
pipecat cloud deploy
# ...or if not using pcc-deploy.toml
pipecat cloud deploy gdx-spark-bot your-docker-repository/gdx-spark-bot:latest \
--credentials gdx-spark-bot-pull-secret \
--secrets gdx-spark-bot-secrets \
--profile agent-1xCreate a public access key for Pipecat Cloud. Set this is a the default key when prompted:
pipecat cloud organizations keys createStart an active session with your deployed bot:
pipecat cloud agent start gdx-spark-bot --use-dailySee docs for REST and Python usage.
pipecat_bots/bot.py is the single voice agent: Nemotron streaming STT → Nemotron LLM
(OpenAI-compatible) → on-device Pocket TTS, with on-device Smart Turn v3 endpointing and a
SmallWebRTC transport. The ASR and LLM endpoints are required env vars:
| Variable | Required | Default | Description |
|---|---|---|---|
NVIDIA_ASR_URL |
✅ | — | Nemotron streaming ASR WebSocket endpoint (e.g. ws://localhost:8080) |
NVIDIA_LLM_URL |
✅ | — | OpenAI-compatible LLM endpoint (e.g. http://localhost:8000/v1) |
NVIDIA_LLM_MODEL |
nvidia/nemotron-3-nano |
Model name your LLM server serves (per its /v1/models) |
|
NVIDIA_LLM_API_KEY |
EMPTY |
API key (local vLLM / llama.cpp ignore it) | |
NEMOTRON_ENABLE_THINKING |
false |
Enable LLM reasoning (keep off for voice unless the server runs a reasoning parser) | |
POCKET_TTS_URL |
http://127.0.0.1:8001 |
On-device Pocket TTS server | |
POCKET_TTS_VOICE |
alba |
Pocket TTS voice |
SmallWebRTC is the default (opens a local browser client): uv run pipecat_bots/bot.py -t webrtc.
Other Pipecat transports (Daily, Twilio) can be added to transport_params in bot.py.
| Service | File | Description |
|---|---|---|
NVidiaWebSocketSTTService |
nvidia_stt.py |
Nemotron streaming ASR over WebSocket; finalizes on the VAD stop for Smart Turn |
VLLMOpenAILLMService |
nemotron_llm.py |
OpenAI-compatible LLM client (TTFB measured to the first spoken token) |
PocketTTSService |
pocket_tts.py |
On-device Pocket TTS (HTTP streaming) |
Use ./scripts/nemotron.sh to manage the container:
# Start the container
./scripts/nemotron.sh start [OPTIONS]
--mode MODE LLM mode: llamacpp-q8 (default), llamacpp-q4, vllm
--model PATH Path to model file
--no-asr Disable ASR service
--no-llm Disable LLM service
-f, --foreground Run in foreground (default: detached)
# Stop the container
./scripts/nemotron.sh stop
# Restart the container
./scripts/nemotron.sh restart [OPTIONS]
# Check status
./scripts/nemotron.sh status
# View logs
./scripts/nemotron.sh logs # All logs interleaved
./scripts/nemotron.sh logs asr # ASR logs only
./scripts/nemotron.sh logs llm # LLM logs only
# Open shell in container
./scripts/nemotron.sh shell
# Show help
./scripts/nemotron.sh help| Service | Port | Protocol | Health Check |
|---|---|---|---|
| ASR | 8080 | WebSocket | http://localhost:8080/health |
| LLM | 8000 | HTTP | http://localhost:8000/health |
TTS is on-device Pocket TTS, run separately (
uvx pocket-tts serve --port 8001), not part of the container. The bot reaches it viaPOCKET_TTS_URL.
# Build the unified container (2-3 hours)
docker build -f Dockerfile.unified -t nemotron-unified:cuda13 .The build compiles from source for CUDA 13.1 / Blackwell (sm_121):
- PyTorch (with NVRTC support)
- torchaudio
- NeMo ASR/TTS
- vLLM
- llama.cpp
| Model | Source | Size | Used With |
|---|---|---|---|
| Nemotron Speech ASR (English) | HuggingFace nvidia/nemotron-speech-streaming-en-0.6b (auto-downloaded) |
~2.4GB | All configurations |
| Nemotron Speech ASR (Multilingual) | HuggingFace nvidia/NVIDIA-Nemotron-3.5-ASR-Streaming-Multilingual-0.6b (auto-downloaded) |
~2.4GB | Optional — dual-model language routing (see below) |
| Nemotron-3-Nano Q8 | HuggingFace unsloth/Nemotron-3-Nano-30B-A3B-GGUF |
~32GB | llama.cpp on DGX Spark |
| Nemotron-3-Nano Q4 | HuggingFace unsloth/Nemotron-3-Nano-30B-A3B-GGUF |
~16GB | llama.cpp on RTX 5090 |
| Nemotron-3-Nano BF16 | HuggingFace nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 |
~72GB | vLLM (cloud/multi-GPU) |
| Pocket TTS | published pocket-tts package (downloads its model on first serve) |
~small | On-device TTS, run separately |
Download LLM models (ASR is auto-downloaded on first run; Pocket TTS downloads its model on first serve):
# GGUF quantized models (Q8 and Q4 variants for llama.cpp)
huggingface-cli download unsloth/Nemotron-3-Nano-30B-A3B-GGUF
# BF16 full precision (for vLLM)
huggingface-cli download nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16To serve both the English specialist and the multilingual checkpoint behind one endpoint —
with each model on the full high-throughput batched path — use the router + 2 backends recipe in
deploy/dual-model-router/:
EN_PY=/path/to/standard-nemo-venv/bin/python \
ML_PY=/path/to/ea-nemo-venv/bin/python \
ML_MODEL=/path/to/nemotron-asr-streaming-multilingual-0.6b.nemo \
deploy/dual-model-router/run_dual_model.shA thin router (ws://host:8080) routes each connection by ?language= to an English backend
(standard NeMo, rc1) or a multilingual backend (the EA NeMo build, rc3, prompted). The two
checkpoints need different, incompatible NeMo runtimes, so each backend runs in its own venv;
the router keeps English on its validated runtime (byte-identical) while both get the batched
scheduler. The Pipecat service (pipecat_bots/nvidia_stt.py) already connects to one URL and sends a
language, so just point it at the router. See the deploy README for the runtime requirements and
measured perf (router tax ~0; multilingual at near-parity with English).
Advanced / not recommended:
server.pyalso has an in-process--multilingual-model(NEMOTRON_MULTILINGUAL_MODEL) flag that loads both checkpoints in one process. It runs on the serial path only and forces English onto the EA runtime (a different model class — no longer the validated English path). Prefer the router above.
The voice-agent ASR service above is the Python server in src/nemotron_speech/. For high-density
production serving there is a from-scratch native C++ runtime in runtime/ — ws_server — that
is a byte-exact drop-in for the Python ASR WebSocket server (the compat oracle in
tests/server_compat/run_compat.py verifies 8/8 parity) and serves many more concurrent streams per
GPU. The path is clone → regenerate artifacts → build → run on a 5090 → deploy an L40S cluster:
-
Step 0 — model & artifacts. The runtime loads compiled AOTI/TorchScript artifacts that are not committed (large + GPU-arch-specific). Regenerate them from the public checkpoint
nvidia/nemotron-speech-streaming-en-0.6bfollowingruntime/ARTIFACTS.md. No private buckets or credentials are involved. -
Build + run on an RTX 5090 (sm_120). Build
ws_serverin the container and run it on the host:runtime/README.md. Verify byte-exact parity with the Python server via the compat oracle. -
Deploy a cluster on L40S (sm_89). Provision g6e/L40S boxes, recompile the AOTI artifacts for sm_89, install the systemd unit, and front the fleet with HAProxy:
deploy/RUNBOOK.md(rationale and sizing indeploy/DEPLOYMENT.md). The sm_89 recompile + a density sweep are encoded inruntime/run_l40s_density.README.md, and the minimal EC2 provisioning helpers are inec2-bench/.
Because the AOTI artifacts are GPU-architecture-specific, the model is exported once (architecture-
agnostic) and then AOTI-compiled per target (sm_120 for the 5090, sm_89 for the L40S) — see
runtime/ARTIFACTS.md.
LLM crashes or stalls:
- The buffered LLM service uses single-slot operation (
--parallel 1) - Ensure adequate VRAM for context size (default 16384 tokens)
- Check for httpx connection issues if generation hangs
vLLM takes 10-15 minutes to start:
- This is normal for first startup (model loading, kernel compilation)
- Set
SERVICE_TIMEOUT=900if needed
vLLM DNS resolution issues:
- The container uses
--network=hostin vLLM mode to avoid DNS issues with HuggingFace
The code in this repository is licensed under the Apache License 2.0 — see LICENSE.
The NVIDIA models this sample uses (Nemotron Speech ASR, Nemotron 3 Nano LLM) are distributed under their own NVIDIA model licenses on HuggingFace; review and accept those terms on each model's page before downloading or deploying. Pocket TTS is a separate Kyutai package under its own license.