Voice Agent With NVIDIA Open Models

This repo is sample code for building voice agents with NVIDIA open source models:

Nemotron 3.0 ASR and Nemotron 3.5 ASR (streaming speech-to-text)
Nemotron 3 Nano LLM

For speech, the bot uses on-device Pocket TTS (Kyutai).

Run locally on an NVIDIA DGX Spark or RTX 5090. Or deploy to the cloud.

Accompanying blog posts:

Two things live here

The voice agent — Nemotron ASR + Nemotron 3 Nano LLM + on-device Pocket TTS, run locally (DGX Spark / RTX 5090) or in the cloud.
A production streaming-ASR serving runtime — a from-scratch native C++ ws_server for the Nemotron Speech ASR model that you build, run on an RTX 5090, and deploy as an L40S cluster. It is a byte-exact drop-in for the Python ASR server with much higher per-GPU stream density. See Production streaming-ASR serving below.

Repository map

path	what
`src/nemotron_speech/`	Python ASR + TTS servers (`server.py`)
`pipecat_bots/`	The voice-agent bot (`bot.py`) + its STT / TTS / LLM Pipecat services
`scripts/`	local container management (`nemotron.sh`) and test clients
`Dockerfile.unified`	the all-in-one local container (ASR + TTS + LLM), built from source for Blackwell
`runtime/`	the native C++ `ws_server` ASR serving runtime — build, run, and artifact regeneration
`deploy/`	L40S cluster deploy: runbook, systemd unit, HAProxy generator, drain/smoke tooling
`ec2-bench/`	minimal EC2 GPU provisioning helpers used by the deploy runbook
`tests/`	ASR/TTS tests, the Python↔C++ byte-exact compatibility oracle, and a WebSocket smoke client
`docs/`	architecture and latency explainers

Quick start - Run everything locally (DGX Spark or RTX 5090)

1. Build the Unified Container

docker build -f Dockerfile.unified -t nemotron-unified:cuda13 .

Build time: 2-3 hours (builds PyTorch, NeMo, vLLM, llama.cpp from source for CUDA 13.1 / Blackwell).

2. Start the Container

# Start with default Q8 model (auto-detected from HuggingFace cache)
./scripts/nemotron.sh start

# Or specify a model explicitly
./scripts/nemotron.sh start --model ~/.cache/huggingface/hub/models--unsloth--Nemotron-3-Nano-30B-A3B-GGUF/snapshots/.../Q8_0.gguf

# Start with vLLM instead of llama.cpp (requires ~72GB VRAM)
./scripts/nemotron.sh start --mode vllm

3. Start the on-device Pocket TTS server

The voice bot speaks with Pocket TTS (Kyutai's on-device TTS, a published package). Start it in its own terminal and leave it running on port 8001 (it downloads the model on first run):

# runs the published package in an ephemeral env (or: pip install pocket-tts && pocket-tts serve ...)
uvx pocket-tts serve --port 8001 --language english

4. Run the Voice Bot

pipecat_bots/bot.py reads its ASR and LLM endpoints from required env vars — point them at the servers from step 2 (and Pocket TTS from step 3):

NVIDIA_ASR_URL=ws://localhost:8080 \
NVIDIA_LLM_URL=http://localhost:8000/v1 \
POCKET_TTS_URL=http://localhost:8001 \
  uv run pipecat_bots/bot.py

Open the URL it prints (default http://localhost:7860) in your browser.

Quick start - Deploy the bot to Pipecat Cloud

The bot connects to your ASR, LLM, and TTS endpoints over the network, so it can run anywhere — including Pipecat Cloud. You bring the model endpoints (e.g. the L40S ASR cluster below); this section deploys only the bot.

Note

Sign up for a Pipecat Cloud account here

1. Login to your Pipecat Cloud account using the CLI

# Install Pipecat Cloud package
uv sync --group bot

# Login
pipecat cloud auth login

2. Create a new secret set with the necessary API keys

pipecat cloud secrets set gdx-spark-bot-secrets \
  NVIDIA_ASR_URL=wss:// \
  NVIDIA_LLM_URL=https:// \
  POCKET_TTS_URL=https://

Alternatively, create your secret set from a .env file:

pipecat cloud secrets set gdx-spark-bot-secrets --file .env

3. Create image pull secret

Image pull secrets are used to authenticate with private Docker registries when deploying agents. See docs.

pipecat cloud secrets image-pull-secret gdx-spark-bot-pull-secret https://index.docker.io/v1/

Optional: Create a PCC deploy toml:

To speed up deployment you can create a pcc-deploy.toml in the project root. This file is read by the Pipecat CLI to pre-fill command arguments:

agent_name = "gdx-spark-bot"
image = "your-docker-repository/gdx-spark-bot:latest"
secret_set = "gdx-spark-bot-secrets"
image_credentials = "gdx-spark-bot-pull-secret"
agent_profile = "agent-1x"

[scaling]
	min_agents = 1

4. Build and push Docker image

docker build -f Dockerfile.bot -t gdx-spark-bot:latest .

# Optional: tag image
docker tag gdx-spark-bot:latest your-docker-repository/gdx-spark-bot:latest

# Push to image repository e.g. Docker Hub
docker push your-docker-repository/gdx-spark-bot:latest

5. Deploy

Run deploy command:

pipecat cloud deploy

# ...or if not using pcc-deploy.toml

pipecat cloud deploy gdx-spark-bot your-docker-repository/gdx-spark-bot:latest \
--credentials gdx-spark-bot-pull-secret \
--secrets gdx-spark-bot-secrets \
--profile agent-1x

6. Start bot using CLI

Create a public access key for Pipecat Cloud. Set this is a the default key when prompted:

pipecat cloud organizations keys create

Start an active session with your deployed bot:

pipecat cloud agent start gdx-spark-bot --use-daily

See docs for REST and Python usage.

The Voice Bot

pipecat_bots/bot.py is the single voice agent: Nemotron streaming STT → Nemotron LLM (OpenAI-compatible) → on-device Pocket TTS, with on-device Smart Turn v3 endpointing and a SmallWebRTC transport. The ASR and LLM endpoints are required env vars:

Variable	Required	Default	Description
`NVIDIA_ASR_URL`	✅	—	Nemotron streaming ASR WebSocket endpoint (e.g. `ws://localhost:8080`)
`NVIDIA_LLM_URL`	✅	—	OpenAI-compatible LLM endpoint (e.g. `http://localhost:8000/v1`)
`NVIDIA_LLM_MODEL`		`nvidia/nemotron-3-nano`	Model name your LLM server serves (per its `/v1/models`)
`NVIDIA_LLM_API_KEY`		`EMPTY`	API key (local vLLM / llama.cpp ignore it)
`NEMOTRON_ENABLE_THINKING`		`false`	Enable LLM reasoning (keep off for voice unless the server runs a reasoning parser)
`POCKET_TTS_URL`		`http://127.0.0.1:8001`	On-device Pocket TTS server
`POCKET_TTS_VOICE`		`alba`	Pocket TTS voice

Transport

SmallWebRTC is the default (opens a local browser client): uv run pipecat_bots/bot.py -t webrtc. Other Pipecat transports (Daily, Twilio) can be added to transport_params in bot.py.

Services in `pipecat_bots/`

Service	File	Description
`NVidiaWebSocketSTTService`	`nvidia_stt.py`	Nemotron streaming ASR over WebSocket; finalizes on the VAD stop for Smart Turn
`VLLMOpenAILLMService`	`nemotron_llm.py`	OpenAI-compatible LLM client (TTFB measured to the first spoken token)
`PocketTTSService`	`pocket_tts.py`	On-device Pocket TTS (HTTP streaming)

Local Container Management

Use ./scripts/nemotron.sh to manage the container:

# Start the container
./scripts/nemotron.sh start [OPTIONS]
  --mode MODE         LLM mode: llamacpp-q8 (default), llamacpp-q4, vllm
  --model PATH        Path to model file
  --no-asr            Disable ASR service
  --no-llm            Disable LLM service
  -f, --foreground    Run in foreground (default: detached)

# Stop the container
./scripts/nemotron.sh stop

# Restart the container
./scripts/nemotron.sh restart [OPTIONS]

# Check status
./scripts/nemotron.sh status

# View logs
./scripts/nemotron.sh logs          # All logs interleaved
./scripts/nemotron.sh logs asr      # ASR logs only
./scripts/nemotron.sh logs llm      # LLM logs only

# Open shell in container
./scripts/nemotron.sh shell

# Show help
./scripts/nemotron.sh help

Service Endpoints

Service	Port	Protocol	Health Check
ASR	8080	WebSocket	`http://localhost:8080/health`
LLM	8000	HTTP	`http://localhost:8000/health`

TTS is on-device Pocket TTS, run separately (uvx pocket-tts serve --port 8001), not part of the container. The bot reaches it via POCKET_TTS_URL.

Building the Container

# Build the unified container (2-3 hours)
docker build -f Dockerfile.unified -t nemotron-unified:cuda13 .

The build compiles from source for CUDA 13.1 / Blackwell (sm_121):

PyTorch (with NVRTC support)
torchaudio
NeMo ASR/TTS
vLLM
llama.cpp

Model Requirements

Model	Source	Size	Used With
Nemotron Speech ASR (English)	HuggingFace `nvidia/nemotron-speech-streaming-en-0.6b` (auto-downloaded)	~2.4GB	All configurations
Nemotron Speech ASR (Multilingual)	HuggingFace `nvidia/NVIDIA-Nemotron-3.5-ASR-Streaming-Multilingual-0.6b` (auto-downloaded)	~2.4GB	Optional — dual-model language routing (see below)
Nemotron-3-Nano Q8	HuggingFace `unsloth/Nemotron-3-Nano-30B-A3B-GGUF`	~32GB	llama.cpp on DGX Spark
Nemotron-3-Nano Q4	HuggingFace `unsloth/Nemotron-3-Nano-30B-A3B-GGUF`	~16GB	llama.cpp on RTX 5090
Nemotron-3-Nano BF16	HuggingFace `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16`	~72GB	vLLM (cloud/multi-GPU)
Pocket TTS	published `pocket-tts` package (downloads its model on first `serve`)	~small	On-device TTS, run separately

Download LLM models (ASR is auto-downloaded on first run; Pocket TTS downloads its model on first serve):

# GGUF quantized models (Q8 and Q4 variants for llama.cpp)
huggingface-cli download unsloth/Nemotron-3-Nano-30B-A3B-GGUF

# BF16 full precision (for vLLM)
huggingface-cli download nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

Multilingual (dual-model serving)

To serve both the English specialist and the multilingual checkpoint behind one endpoint — with each model on the full high-throughput batched path — use the router + 2 backends recipe in deploy/dual-model-router/:

EN_PY=/path/to/standard-nemo-venv/bin/python \
ML_PY=/path/to/ea-nemo-venv/bin/python \
ML_MODEL=/path/to/nemotron-asr-streaming-multilingual-0.6b.nemo \
  deploy/dual-model-router/run_dual_model.sh

A thin router (ws://host:8080) routes each connection by ?language= to an English backend (standard NeMo, rc1) or a multilingual backend (the EA NeMo build, rc3, prompted). The two checkpoints need different, incompatible NeMo runtimes, so each backend runs in its own venv; the router keeps English on its validated runtime (byte-identical) while both get the batched scheduler. The Pipecat service (pipecat_bots/nvidia_stt.py) already connects to one URL and sends a language, so just point it at the router. See the deploy README for the runtime requirements and measured perf (router tax ~0; multilingual at near-parity with English).

Advanced / not recommended: server.py also has an in-process --multilingual-model (NEMOTRON_MULTILINGUAL_MODEL) flag that loads both checkpoints in one process. It runs on the serial path only and forces English onto the EA runtime (a different model class — no longer the validated English path). Prefer the router above.

Production streaming-ASR serving (RTX 5090 → L40S cluster)

The voice-agent ASR service above is the Python server in src/nemotron_speech/. For high-density production serving there is a from-scratch native C++ runtime in runtime/ — ws_server — that is a byte-exact drop-in for the Python ASR WebSocket server (the compat oracle in tests/server_compat/run_compat.py verifies 8/8 parity) and serves many more concurrent streams per GPU. The path is clone → regenerate artifacts → build → run on a 5090 → deploy an L40S cluster:

Step 0 — model & artifacts. The runtime loads compiled AOTI/TorchScript artifacts that are not committed (large + GPU-arch-specific). Regenerate them from the public checkpoint nvidia/nemotron-speech-streaming-en-0.6b following runtime/ARTIFACTS.md. No private buckets or credentials are involved.
Build + run on an RTX 5090 (sm_120). Build ws_server in the container and run it on the host: runtime/README.md. Verify byte-exact parity with the Python server via the compat oracle.
Deploy a cluster on L40S (sm_89). Provision g6e/L40S boxes, recompile the AOTI artifacts for sm_89, install the systemd unit, and front the fleet with HAProxy: deploy/RUNBOOK.md (rationale and sizing in deploy/DEPLOYMENT.md). The sm_89 recompile + a density sweep are encoded in runtime/run_l40s_density.README.md, and the minimal EC2 provisioning helpers are in ec2-bench/.

Because the AOTI artifacts are GPU-architecture-specific, the model is exported once (architecture- agnostic) and then AOTI-compiled per target (sm_120 for the 5090, sm_89 for the L40S) — see runtime/ARTIFACTS.md.

Troubleshooting

LLM crashes or stalls:

The buffered LLM service uses single-slot operation (--parallel 1)
Ensure adequate VRAM for context size (default 16384 tokens)
Check for httpx connection issues if generation hangs

vLLM takes 10-15 minutes to start:

This is normal for first startup (model loading, kernel compilation)
Set SERVICE_TIMEOUT=900 if needed

vLLM DNS resolution issues:

The container uses --network=host in vLLM mode to avoid DNS issues with HuggingFace

License

The code in this repository is licensed under the Apache License 2.0 — see LICENSE.

The NVIDIA models this sample uses (Nemotron Speech ASR, Nemotron 3 Nano LLM) are distributed under their own NVIDIA model licenses on HuggingFace; review and accept those terms on each model's page before downloading or deploying. Pocket TTS is a separate Kyutai package under its own license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Voice Agent With NVIDIA Open Models

Two things live here

Repository map

Quick start - Run everything locally (DGX Spark or RTX 5090)

1. Build the Unified Container

2. Start the Container

3. Start the on-device Pocket TTS server

4. Run the Voice Bot

Quick start - Deploy the bot to Pipecat Cloud

1. Login to your Pipecat Cloud account using the CLI

2. Create a new secret set with the necessary API keys

3. Create image pull secret

4. Build and push Docker image

5. Deploy

6. Start bot using CLI

The Voice Bot

Transport

Services in `pipecat_bots/`

Local Container Management

Service Endpoints

Building the Container

Model Requirements

Multilingual (dual-model serving)

Production streaming-ASR serving (RTX 5090 → L40S cluster)

Troubleshooting

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
deploy		deploy
docs		docs
ec2-bench		ec2-bench
examples		examples
patches		patches
pipecat_bots		pipecat_bots
runtime		runtime
scripts		scripts
src/nemotron_speech		src/nemotron_speech
tests		tests
vllm_plugins		vllm_plugins
.gitignore		.gitignore
Dockerfile.bot		Dockerfile.bot
Dockerfile.unified		Dockerfile.unified
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Voice Agent With NVIDIA Open Models

Two things live here

Repository map

Quick start - Run everything locally (DGX Spark or RTX 5090)

1. Build the Unified Container

2. Start the Container

3. Start the on-device Pocket TTS server

4. Run the Voice Bot

Quick start - Deploy the bot to Pipecat Cloud

1. Login to your Pipecat Cloud account using the CLI

2. Create a new secret set with the necessary API keys

3. Create image pull secret

4. Build and push Docker image

5. Deploy

6. Start bot using CLI

The Voice Bot

Transport

Services in pipecat_bots/

Local Container Management

Service Endpoints

Building the Container

Model Requirements

Multilingual (dual-model serving)

Production streaming-ASR serving (RTX 5090 → L40S cluster)

Troubleshooting

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Services in `pipecat_bots/`

Packages