#text-to-speech #candle #speech-synthesis

pocket-tts

High-performance CPU-based Text-to-Speech library using Candle

20 releases (5 breaking)

Uses new Rust 2024

0.6.2 Feb 13, 2026
0.6.1 Feb 10, 2026
0.5.0 Feb 7, 2026
0.4.1 Feb 6, 2026
0.1.1 Jan 17, 2026

#174 in Audio

Download history 34/week @ 2026-01-29 110/week @ 2026-02-05 108/week @ 2026-02-12 111/week @ 2026-02-19 361/week @ 2026-02-26 573/week @ 2026-03-05 210/week @ 2026-03-12 107/week @ 2026-03-19 40/week @ 2026-03-26 83/week @ 2026-04-02 333/week @ 2026-04-09 474/week @ 2026-04-16 104/week @ 2026-04-23 43/week @ 2026-04-30 72/week @ 2026-05-07

808 downloads per month
Used in 2 crates

MIT/Apache

450KB
4.5K SLoC

Pocket TTS (Rust/Candle)

A native Rust port of Kyutai's Pocket TTS using Candle for tensor operations.

Text-to-speech that runs entirely on CPU—no Python, no GPU required.

Features

  • Pure Rust - No Python runtime, just a single binary
  • CPU-only - Runs on CPU, no GPU required
  • Metal Acceleration - Build with --features metal for hardware acceleration on macOS
  • int8 Quantization - Significant speedup and smaller memory footprint
  • Streaming - Full-pipeline stateful streaming (FlowLM + Mimi) for zero-latency audio
  • Project Structure - Clean, modular workspace design
  • WebAssembly - Run the full model in any modern web browser
  • Pause Handling - Support for natural pauses and explicit [pause:Xms] syntax
  • HTTP API - REST API server with OpenAI-compatible endpoint
  • Web UI - Built-in web interface (React/Vite) for interactive use
  • Flexible Builds - Use --no-default-features for a "lite" build without web UI assets
  • Python Bindings - Use the Rust implementation from Python for improved performance

Quick Start

# Build Web UI assets (required for default build from source)
cd crates/pocket-tts-cli/web
npm install
npm run build

# Build with default features (includes Web UI assets)
cargo build --release

# Build "lite" version (no Web UI assets, API only)
cargo build --release --no-default-features

# Build with Metal support (macOS only)
cargo build --release --features metal

If you prefer bun, run bun install and bun run build in crates/pocket-tts-cli/web.

Generate audio

# Using default voice
cargo run --release --package pocket-tts-cli -- generate --text "Hello, world!"

# Using Metal acceleration (if enabled)
cargo run --release --features metal --package pocket-tts-cli -- generate --text "Hello, world!" --use-metal

# Using a custom voice (WAV file)
cargo run --release --package pocket-tts-cli -- generate \
    --text "Hello, world!" \
    --voice ./my_voice.wav \
    --output output.wav

# Using a predefined voice
cargo run --release --package pocket-tts-cli -- generate --voice alba

Start the HTTP server

cargo run --release -p pocket-tts-cli -- serve
# Navigate to http://localhost:8000

Experimental WASM UI

The project has one React web app with two serving modes:

  • standard (default): server-backed streaming (/stream)
  • wasm-experimental: browser-side inference using WasmTTSModel

1. Build the WASM package

From the repository root:

# Windows
.\scripts\build-wasm.ps1
# Unix
./scripts/build-wasm.sh

Manual fallback:

cargo build -p pocket-tts --release --target wasm32-unknown-unknown --features wasm
wasm-bindgen --target web --out-dir crates/pocket-tts/pkg target/wasm32-unknown-unknown/release/pocket_tts.wasm

2. Launch experimental UI mode

cargo run --release -p pocket-tts-cli -- serve --ui wasm-experimental --port 8080
  • Navigate to http://localhost:8080
  • wasm-demo still works as a deprecated alias.

Installation

Add to your Cargo.toml:

[dependencies]
pocket-tts = { path = "crates/pocket-tts" }

Library Usage

use pocket_tts::TTSModel;
use anyhow::Result;

fn main() -> Result<()> {
    // Load the model
    let model = TTSModel::load("b6369a24")?;
    
    // Get voice state from audio file
    let voice_state = model.get_voice_state("voice.wav")?;
    
    // Generate audio
    let audio = model.generate("Hello, world!", &voice_state)?;
    
    // Save to file
    pocket_tts::audio::write_wav("output.wav", &audio, model.sample_rate as u32)?;
    
    Ok(())
}

Streaming Generation

use pocket_tts::TTSModel;

let model = TTSModel::load("b6369a24")?;
let voice_state = model.get_voice_state("voice.wav")?;

// Stream audio chunks as they're generated
for chunk in model.generate_stream("Long text here...", &voice_state) {
    let audio_chunk = chunk?;
    // Process or play each chunk
}

Custom Parameters

let model = TTSModel::load_with_params(
    "b6369a24",     // variant
    0.7,            // temperature (higher = more variation)
    1,              // lsd_decode_steps (more = better quality, slower)
    -4.0,           // eos_threshold (more negative = longer audio)
)?;

HuggingFace token

If you're using a model that has to be downloaded from huggingface you will need a token in the HF_TOKEN environment variable

CLI Reference

generate command

Generate audio from text and save to a WAV file.

pocket-tts generate [OPTIONS]

Options:
  -t, --text <TEXT>              Text to synthesize [default: greeting]
  -v, --voice <VOICE>            Voice: predefined name, .wav file, or .safetensors
  -o, --output <PATH>            Output file [default: output.wav]
      --variant <VARIANT>        Model variant [default: b6369a24]
      --temperature <FLOAT>      Sampling temperature [default: 0.7]
      --lsd-decode-steps <INT>   LSD decode steps [default: 1]
      --eos-threshold <FLOAT>    EOS threshold [default: -4.0]
      --stream                   Stream raw PCM to stdout
  -q, --quiet                    Suppress output
      --use-metal                Use Metal acceleration (macOS)

Predefined voices: alba, marius, javert, jean, fantine, cosette, eponine, azelma

serve command

Start an HTTP API server with web interface.

pocket-tts serve [OPTIONS]

Options:
      --host <HOST>              Bind address [default: 127.0.0.1]
  -p, --port <PORT>              Port number [default: 8000]
      --voice <VOICE>            Default voice [default: alba]
      --variant <VARIANT>        Model variant [default: b6369a24]
      --temperature <FLOAT>      Temperature [default: 0.7]
      --lsd-decode-steps <INT>   LSD steps [default: 1]
      --eos-threshold <FLOAT>    EOS threshold [default: -4.0]
      --ui <UI>                  Web UI mode: standard|wasm-experimental [default: standard]

wasm-demo command (deprecated alias)

wasm-demo now forwards to serve --ui wasm-experimental.

pocket-tts wasm-demo [OPTIONS]

Options:
      --host <HOST>              Bind address [default: 127.0.0.1]
  -p, --port <PORT>              Port number [default: 8080]
      --root <ROOT>              Deprecated (ignored)
  -m, --models <MODELS>          Deprecated (ignored)

Python Bindings

The Rust implementation can be used as a Python module for improved performance (~1.34x speedup).

Installation

Requires maturin.

cd crates/pocket-tts-bindings
uvx maturin develop --release

Usage

import pocket_tts_bindings

# Load the model
model = pocket_tts_bindings.PyTTSModel.load("b6369a24")

# Generate audio
audio_samples = model.generate(
    "Hello from Rust!",
    "path/to/voice.wav"
)

API Endpoints

Method Endpoint Description
GET / Web interface
GET /health Health check
POST /generate Generate audio (JSON)
POST /stream Streaming generation
POST /tts Python-compatible (multipart)
POST /v1/audio/speech OpenAI-compatible

Example API call

curl -X POST http://localhost:8000/generate \
  -H 'Content-Type: application/json' \
  -d '{"text": "Hello world", "voice": "alba"}' \
  --output output.wav

Project Structure

candle/
├── Cargo.toml              # Workspace configuration
├── crates/
│   ├── pocket-tts/         # Core library
│   │   ├── src/
│   │   │   ├── lib.rs          # Public API
│   │   │   ├── tts_model.rs    # Main TTSModel
│   │   │   ├── wasm.rs         # WASM entry points
│   │   │   ├── audio.rs        # WAV I/O, resampling
│   │   │   ├── quantize.rs     # int8 quantization
│   │   │   ├── pause.rs        # Pause/silence handling
│   │   │   ├── config.rs       # YAML config types
│   │   │   ├── models/         # Neural network models
│   │   │   │   ├── flow_lm.rs      # Flow language model
│   │   │   │   ├── mimi.rs         # Audio codec
│   │   │   │   ├── seanet.rs       # Encoder/decoder
│   │   │   │   └── transformer.rs  # Transformer blocks
│   │   │   └── modules/        # Reusable components
│   │   │       ├── attention.rs    # Multi-head attention
│   │   │       ├── conv.rs         # Convolution layers
│   │   │       ├── mlp.rs          # MLP with AdaLN
│   │   │       └── rope.rs         # Rotary embeddings
│   │   ├── tests/
│   │   └── benches/
│   └── pocket-tts-cli/     # CLI binary
│       ├── src/
│       │   ├── main.rs         # Entry point
│       │   ├── commands/       # generate, serve
│       │   ├── server/         # Axum HTTP server
│       │   └── voice.rs        # Voice resolution
│       └── web/                # React/Vite Web UI source
└── docs/                   # Documentation

Architecture

The Rust port mirrors the Python implementation:

  1. Text Conditioning: SentencePiece tokenizer → embedding lookup table
  2. FlowLM Transformer: Generates latent representations from text using Lagrangian Self Distillation (LSD)
  3. Mimi Decoder: Converts latents to audio via SEANet decoder

Key differences from Python

  • Uses Candle instead of PyTorch
  • Full-pipeline stateful streaming (KV-caching for Transformer, overlap-add for Mimi)
  • Polyphase resampling via rubato (matches scipy)
  • Compiled to native code—no JIT, no Python overhead

GPU Acceleration

Metal (macOS)

Build with Metal support for hardware acceleration on Apple Silicon:

cargo build --release --features metal

Current Status: Metal support provides ~2x speedup over CPU on Apple Silicon:

Backend RTF Speed Notes
CPU ~0.33 3x real-time Default, cross-platform
Metal ~0.16 6x real-time Requires --features metal

Benchmarks verified on Apple M4 Max

For best Apple Silicon performance (~8x real-time), consider the community MLX implementation.

CUDA (Linux/Windows)

Build with CUDA support:

cargo build --release --features cuda

Note: CUDA support requires a compatible NVIDIA GPU and CUDA toolkit installed.

Benchmarking

Run benchmarks to measure performance on your hardware:

cargo bench -p pocket-tts

Note: Performance may differ from the Python implementation. Candle is optimized for portability rather than raw speed.

Manual Verification and TTFA Gate

Use the templates in manual-verification/ for reproducible QA:

  • manual-verification/wasm-ui-verification-template.md
  • manual-verification/perf-ttfa-template.md

Recommended flow:

  1. Run standard UI:
    • cargo run --release -p pocket-tts-cli -- serve
  2. Run experimental WASM UI:
    • cargo run --release -p pocket-tts-cli -- serve --ui wasm-experimental --port 8080
  3. In the web UI, run the built-in Manual TTFA Verification panel.
  4. Record TTFC, TTFA, and total generation time for at least 30 short-prompt runs per mode.

Maintainer target:

  • TTFA <= 600ms (time from Generate click to first audible sample) for warmed local short-prompt runs.

Performance Results

Benchmarks run on User Hardware (vs Python baseline):

  • Short Text: ~6.20x speedup
  • Medium Text: ~3.47x speedup
  • Long Text: ~3.33x speedup
  • Latency: ~80ms to first audio chunk (optimized)

Rust is consistently >3.1x faster than the optimized Python implementation.

Cross-Implementation Comparison

Implementation RTF Speed vs Real-Time Platform
PyTorch CPU (official) ~0.25 4x faster Cross-platform
Rust/Candle CPU ~0.33 3x faster Cross-platform
Rust/Candle Metal ~0.16 6x faster macOS (Apple Silicon)
MLX (Apple Silicon) ~0.13 8x faster macOS only

RTF = Real-Time Factor (lower is better, <1.0 means faster than real-time) All benchmarks verified on Apple M4 Max

Numerical Parity

The Rust implementation achieves strong numerical parity with Python:

Component Max Difference Status
Input audio 0 ✅ Perfect
SEANet Decoder ~0.000004 ✅ Excellent
Decoder Transformer ~0.002 ✅ Good
Voice Conditioning ~0.004 ✅ Good
Full Pipeline ~0.06 ✅ Acceptable

Run parity tests:

cargo test -p pocket-tts parity --release

Dependencies

Core dependencies (see full list in Cargo.toml):

License

MIT License - see LICENSE

Acknowledgements

Dependencies

~32–54MB
~782K SLoC