Local-first voice AI framework. Speech-to-text, text-to-speech, and voice chat.
Speech-to-text, text-to-speech, and voice chat running locally. No API keys, no cloud, no data leaving your machine.
Mic --> VAD (Silero) --> STT (Whisper/Sherpa/Streaming) --> Your Code --> TTS (Kokoro/Piper/Chatterbox) --> Speaker
# Install
cargo install --git https://github.com/mrtozner/vox --features cli
# Transcribe speech from your microphone
vox listen
# Text-to-speech (requires kokoro feature)
cargo install --git https://github.com/mrtozner/vox --features cli,kokoro
vox speak "Hello from Vox!"
# Voice chat with Ollama
vox chat --llm llama3.2Models auto-download on first run. Pass -y to skip prompts.
- Speech-to-Text — Whisper (tiny to medium), Sherpa-ONNX (SenseVoice, Zipformer, Paraformer), or streaming Sherpa for real-time partial transcription
- Text-to-Speech — Natural synthesis with Kokoro (50+ voices), Piper (multilingual), Pocket (pure Rust, edge-ready), or Chatterbox (voice cloning)
- Voice Chat — Talk to any Ollama LLM and hear responses
- Web Interface — Browser UI for demos and testing (
vox serve) - Python Bindings — Same pipeline from Python via PyO3
- HTTP/WebSocket Server — Integrate into any stack with REST or streaming WebSocket API
- Fully Local — No API keys, no cloud, no data leaves your machine
- Pluggable Backends — Swap VAD, STT, or TTS engines via traits
vox listen # transcribe from microphone (Whisper)
vox listen --model base.en # use a larger Whisper model
vox listen --stt-backend sherpa # use Sherpa SenseVoice (multilingual)
vox listen --stt-backend sherpa-streaming # real-time streaming transcription
vox speak "Hello from Vox!" # text-to-speech (needs kokoro feature)
vox speak "Hello" --voice am_adam # pick a voice
vox speak "Hallo" --backend piper --voice de # multilingual TTS with Piper
vox speak "Hi" --backend chatterbox --voice ref.wav # voice cloning
vox chat --llm llama3.2 # voice chat with Ollama
vox models list # show downloaded models
vox models download whisper-base.en # download a specific modelcargo install --git https://github.com/mrtozner/vox --features cli,server,kokoro
vox serve --port 3000Opens a browser interface at http://localhost:3000 with real-time mic transcription, TTS synthesis, voice chat with Ollama, and a status dashboard. No separate frontend build.
Use the same server's REST endpoints directly:
# Transcribe audio
curl -X POST http://localhost:3000/v1/transcribe \
-H "Content-Type: audio/wav" \
--data-binary @audio.wav
# Synthesize speech
curl -X POST http://localhost:3000/v1/synthesize \
-H "Content-Type: application/json" \
-d '{"text": "Hello from Vox!"}'WebSocket streaming at ws://localhost:3000/v1/listen — send PCM f32 LE frames at 16kHz mono, receive JSON events in real time:
{"type": "speech_start"}
{"type": "partial", "text": "hello", "is_final": false, "stability": 0.5, "duration_ms": 600, "processing_time_ms": 2}
{"type": "partial", "text": "hello world", "is_final": false, "stability": 0.5, "duration_ms": 1000, "processing_time_ms": 3}
{"type": "transcript", "text": "hello world", "duration_ms": 1200, "processing_time_ms": 180}
{"type": "speech_end"}When a streaming STT backend is available (sherpa-streaming model downloaded), partial results arrive incrementally as you speak. Without it, partials are omitted and you get the final transcript on speech end.
use vox::{Vox, SileroVad, WhisperBackend};
#[tokio::main]
async fn main() -> anyhow::Result<()> {
let vox = Vox::builder()
.vad(SileroVad::new("silero_vad.onnx")?)
.stt(WhisperBackend::from_model("ggml-tiny.en.bin")?)
.on_utterance(|result, _ctx| {
println!("{}", result.text);
})
.build()?;
vox.listen().await?;
Ok(())
}cd python
pip install maturin
maturin develop --features whisper,silero,kokorofrom vox_voice import Vox, SileroVad, WhisperStt
vox = Vox(vad=SileroVad(), stt=WhisperStt("tiny.en"))
for result in vox.listen():
print(result.text)Built with PyO3. Same pipeline, Pythonic API.
+--------+ +-----+ +-----+ +-----------+ +-----+
| Mic | --> | VAD | --> | STT | --> | Callback | --> | TTS |
| (cpal) | | | | | | (your fn) | | |
+--------+ +-----+ +-----+ +-----------+ +-----+
| |
Silero ONNX VoxContext gives
v5 model access to speak()
Audio captured via cpal, resampled to 16kHz mono, fed frame-by-frame to VAD. On speech end, the utterance goes to STT. Your callback gets the text and a VoxContext for optional TTS reply.
| Component | Model | Size | Notes |
|---|---|---|---|
| VAD | Silero VAD v5 | 2MB | Speech detection |
| STT | Whisper tiny.en | 75MB | Fast, English |
| Whisper base.en | 142MB | Better accuracy | |
| Whisper small.en | 466MB | High accuracy | |
| Whisper medium.en | 1.5GB | Highest accuracy | |
| Sherpa SenseVoice | 230MB | Multilingual (zh/en/ja/ko/yue) | |
| Sherpa Streaming Zipformer | 27MB | Real-time partial results | |
| TTS | Kokoro | 310MB | 50+ voices |
| Piper | 63MB/voice | Multilingual (en/de/es/fr/zh) | |
| 82MB | Pure Rust, edge/embedded | ||
| Chatterbox | 350MB | Voice cloning |
vox models download silero-vad # 2MB
vox models download whisper-tiny.en # 75MB
vox models download kokoro # 310MB
vox models download kokoro-voices # 27MB
vox models download piper-en-us # 63MB (+ piper-en-us-config)| Flag | Default | Description |
|---|---|---|
cli |
no | CLI binary (vox listen, vox speak, vox chat, vox serve) |
server |
no | HTTP/WebSocket API server |
whisper |
yes | Whisper STT via whisper-rs |
silero |
yes | Silero VAD via ONNX Runtime |
sherpa |
no | Sherpa-ONNX STT (SenseVoice, Zipformer, streaming) |
piper |
no | Piper TTS (multilingual) |
kokoro |
no | Kokoro TTS (50+ voices) |
pocket |
no | Pocket TTS (pure Rust) |
pocket-metal |
no | Pocket TTS with Apple Metal GPU |
chatterbox |
no | Chatterbox TTS (voice cloning) |
chatterbox-coreml |
no | Chatterbox with CoreML (macOS) |
tts |
no | Audio playback for TTS output |
| Platform | Status |
|---|---|
| macOS (Apple Silicon) | Tested |
| macOS (Intel) | Tested |
| Linux (x86_64) | CI tested |
| Windows (x86_64) | CI tested |
Measured on Apple M1 MacBook Pro:
| Metric | Value |
|---|---|
| VAD frame latency | ~1ms per 32ms frame |
| Whisper STT (3s utterance) | ~200ms |
| Streaming STT (per chunk) | <1ms (0.03x real-time) |
| End-to-end (speech end to text) | ~250ms |
| Piper TTS ("Hello world") | ~200ms |
| Chatterbox TTS ("Hello world") | ~2s |
| Memory (idle pipeline) | ~150MB |
cargo run --example simple_listen --features whisper,silero # mic to text
cargo run --example vad_only --features silero # speech detection only
cargo run --example voice_assistant --features whisper,silero,kokoro # voice assistant
cargo run --example tts_speak --features kokoro # kokoro TTS
cargo run --example piper_speak --features piper # piper TTS
cargo run --example chatterbox_speak --features chatterbox # voice cloning- Fork the repository
- Create a feature branch (
git checkout -b feat/my-feature) - Make your changes and add tests
- Run
cargo testandcargo clippy - Submit a pull request
For larger features, open an issue first to discuss the approach.
MIT OR Apache-2.0