GitHub - mrtozner/vox: Local voice AI framework for Rust. Whisper + LLM + TTS with no cloud dependencies.

Local-first voice AI framework. Speech-to-text, text-to-speech, and voice chat.

Speech-to-text, text-to-speech, and voice chat running locally. No API keys, no cloud, no data leaving your machine.

Mic --> VAD (Silero) --> STT (Whisper/Sherpa/Streaming) --> Your Code --> TTS (Kokoro/Piper/Chatterbox) --> Speaker

Quick Start

# Install
cargo install --git https://github.com/mrtozner/vox --features cli

# Transcribe speech from your microphone
vox listen

# Text-to-speech (requires kokoro feature)
cargo install --git https://github.com/mrtozner/vox --features cli,kokoro
vox speak "Hello from Vox!"

# Voice chat with Ollama
vox chat --llm llama3.2

Models auto-download on first run. Pass -y to skip prompts.

What It Does

Speech-to-Text — Whisper (tiny to medium), Sherpa-ONNX (SenseVoice, Zipformer, Paraformer), or streaming Sherpa for real-time partial transcription
Text-to-Speech — Natural synthesis with Kokoro (50+ voices), Piper (multilingual), Pocket (pure Rust, edge-ready), or Chatterbox (voice cloning)
Voice Chat — Talk to any Ollama LLM and hear responses
Web Interface — Browser UI for demos and testing (vox serve)
Python Bindings — Same pipeline from Python via PyO3
HTTP/WebSocket Server — Integrate into any stack with REST or streaming WebSocket API
Fully Local — No API keys, no cloud, no data leaves your machine
Pluggable Backends — Swap VAD, STT, or TTS engines via traits

Usage

CLI

vox listen                              # transcribe from microphone (Whisper)
vox listen --model base.en              # use a larger Whisper model
vox listen --stt-backend sherpa         # use Sherpa SenseVoice (multilingual)
vox listen --stt-backend sherpa-streaming  # real-time streaming transcription
vox speak "Hello from Vox!"             # text-to-speech (needs kokoro feature)
vox speak "Hello" --voice am_adam       # pick a voice
vox speak "Hallo" --backend piper --voice de  # multilingual TTS with Piper
vox speak "Hi" --backend chatterbox --voice ref.wav  # voice cloning
vox chat --llm llama3.2                 # voice chat with Ollama
vox models list                         # show downloaded models
vox models download whisper-base.en     # download a specific model

Web UI

cargo install --git https://github.com/mrtozner/vox --features cli,server,kokoro
vox serve --port 3000

Opens a browser interface at http://localhost:3000 with real-time mic transcription, TTS synthesis, voice chat with Ollama, and a status dashboard. No separate frontend build.

HTTP API

Use the same server's REST endpoints directly:

# Transcribe audio
curl -X POST http://localhost:3000/v1/transcribe \
  -H "Content-Type: audio/wav" \
  --data-binary @audio.wav

# Synthesize speech
curl -X POST http://localhost:3000/v1/synthesize \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello from Vox!"}'

WebSocket streaming at ws://localhost:3000/v1/listen — send PCM f32 LE frames at 16kHz mono, receive JSON events in real time:

{"type": "speech_start"}
{"type": "partial", "text": "hello", "is_final": false, "stability": 0.5, "duration_ms": 600, "processing_time_ms": 2}
{"type": "partial", "text": "hello world", "is_final": false, "stability": 0.5, "duration_ms": 1000, "processing_time_ms": 3}
{"type": "transcript", "text": "hello world", "duration_ms": 1200, "processing_time_ms": 180}
{"type": "speech_end"}

When a streaming STT backend is available (sherpa-streaming model downloaded), partial results arrive incrementally as you speak. Without it, partials are omitted and you get the final transcript on speech end.

Rust Library

use vox::{Vox, SileroVad, WhisperBackend};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let vox = Vox::builder()
        .vad(SileroVad::new("silero_vad.onnx")?)
        .stt(WhisperBackend::from_model("ggml-tiny.en.bin")?)
        .on_utterance(|result, _ctx| {
            println!("{}", result.text);
        })
        .build()?;

    vox.listen().await?;
    Ok(())
}

Python Library

cd python
pip install maturin
maturin develop --features whisper,silero,kokoro

from vox_voice import Vox, SileroVad, WhisperStt

vox = Vox(vad=SileroVad(), stt=WhisperStt("tiny.en"))
for result in vox.listen():
    print(result.text)

Built with PyO3. Same pipeline, Pythonic API.

Architecture

+--------+     +-----+     +-----+     +-----------+     +-----+
|  Mic   | --> | VAD | --> | STT | --> | Callback  | --> | TTS |
| (cpal) |     |     |     |     |     | (your fn) |     |     |
+--------+     +-----+     +-----+     +-----------+     +-----+
                  |                          |
            Silero ONNX               VoxContext gives
            v5 model                  access to speak()

Audio captured via cpal, resampled to 16kHz mono, fed frame-by-frame to VAD. On speech end, the utterance goes to STT. Your callback gets the text and a VoxContext for optional TTS reply.

Models

Component	Model	Size	Notes
VAD	Silero VAD v5	2MB	Speech detection
STT	Whisper tiny.en	75MB	Fast, English
	Whisper base.en	142MB	Better accuracy
	Whisper small.en	466MB	High accuracy
	Whisper medium.en	1.5GB	Highest accuracy
	Sherpa SenseVoice	230MB	Multilingual (zh/en/ja/ko/yue)
	Sherpa Streaming Zipformer	27MB	Real-time partial results
TTS	Kokoro	310MB	50+ voices
	Piper	63MB/voice	Multilingual (en/de/es/fr/zh)
	Pocket	82MB	Pure Rust, edge/embedded
	Chatterbox	350MB	Voice cloning

vox models download silero-vad          # 2MB
vox models download whisper-tiny.en     # 75MB
vox models download kokoro              # 310MB
vox models download kokoro-voices       # 27MB
vox models download piper-en-us         # 63MB (+ piper-en-us-config)

Feature Flags

Flag	Default	Description
`cli`	no	CLI binary (`vox listen`, `vox speak`, `vox chat`, `vox serve`)
`server`	no	HTTP/WebSocket API server
`whisper`	yes	Whisper STT via whisper-rs
`silero`	yes	Silero VAD via ONNX Runtime
`sherpa`	no	Sherpa-ONNX STT (SenseVoice, Zipformer, streaming)
`piper`	no	Piper TTS (multilingual)
`kokoro`	no	Kokoro TTS (50+ voices)
`pocket`	no	Pocket TTS (pure Rust)
`pocket-metal`	no	Pocket TTS with Apple Metal GPU
`chatterbox`	no	Chatterbox TTS (voice cloning)
`chatterbox-coreml`	no	Chatterbox with CoreML (macOS)
`tts`	no	Audio playback for TTS output

Platform Support

Platform	Status
macOS (Apple Silicon)	Tested
macOS (Intel)	Tested
Linux (x86_64)	CI tested
Windows (x86_64)	CI tested

Performance

Measured on Apple M1 MacBook Pro:

Metric	Value
VAD frame latency	~1ms per 32ms frame
Whisper STT (3s utterance)	~200ms
Streaming STT (per chunk)	<1ms (0.03x real-time)
End-to-end (speech end to text)	~250ms
Piper TTS ("Hello world")	~200ms
Chatterbox TTS ("Hello world")	~2s
Memory (idle pipeline)	~150MB

Examples

cargo run --example simple_listen --features whisper,silero       # mic to text
cargo run --example vad_only --features silero                    # speech detection only
cargo run --example voice_assistant --features whisper,silero,kokoro  # voice assistant
cargo run --example tts_speak --features kokoro                   # kokoro TTS
cargo run --example piper_speak --features piper                  # piper TTS
cargo run --example chatterbox_speak --features chatterbox        # voice cloning

Contributing

Fork the repository
Create a feature branch (git checkout -b feat/my-feature)
Make your changes and add tests
Run cargo test and cargo clippy
Submit a pull request

For larger features, open an issue first to discuss the approach.

License

MIT OR Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.cargo		.cargo
.github/workflows		.github/workflows
Formula		Formula
assets		assets
benches		benches
examples		examples
python		python
scripts		scripts
src		src
tests		tests
vendor		vendor
.gitignore		.gitignore
Cargo.toml		Cargo.toml
Dockerfile.static		Dockerfile.static
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick Start

What It Does

Usage

CLI

Web UI

HTTP API

Rust Library

Python Library

Architecture

Models

Feature Flags

Platform Support

Performance

Examples

Contributing

License

About

Licenses found

Uh oh!

Releases 3

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Quick Start

What It Does

Usage

CLI

Web UI

HTTP API

Rust Library

Python Library

Architecture

Models

Feature Flags

Platform Support

Performance

Examples

Contributing

License

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors 1

Languages

Packages