voxtral-tts.c

Experimental pure C inference engine for Mistral's Voxtral-4B-TTS text-to-speech model. Zero external dependencies beyond the C standard library and math. Reads weights directly from safetensors via memory-mapped I/O.

Note: This project is an experiment to create a pure C implementation of Voxtral TTS. It is not production-ready and no further optimization work is planned (even if I'm considering to base it on ggml). Contributions are welcome and encouraged!

Sample: Hello world

Features

Single-file model loading from consolidated.safetensors
BF16 weights accessed directly from mmap (no full conversion needed)
20 preset voices across 9 languages
WAV output at 24kHz
Optional BLAS acceleration (OpenBLAS, Apple Accelerate)
NEON-optimized BF16 matvec on ARM

Architecture

Voxtral TTS is a three-stage pipeline:

Text --> LLM Backbone (3.4B, 26-layer Mistral)
     --> Flow-Matching Acoustic Transformer (390M, 3 layers, 8 Euler steps)
     --> Audio Codec Decoder (300M, 4-stage conv + ALiBi transformer)
     --> 24kHz waveform

The LLM autoregressively generates hidden states conditioned on text and a voice prompt. The acoustic transformer converts each hidden state to 37 audio codes (1 semantic + 36 acoustic) via flow matching with classifier-free guidance. The codec decoder converts all collected codes into a raw waveform.

Quick Start

Build

# Linux with OpenBLAS (recommended)
make blas

# CUDA + OpenBLAS (requires NVIDIA GPU + CUDA toolkit)
make cuda

# Specify GPU architecture (default: sm_80 for Ampere)
make cuda CUDA_ARCH=sm_89   # Ada Lovelace (RTX 4090)
make cuda CUDA_ARCH=sm_90   # Hopper (H100)
make cuda CUDA_ARCH=sm_100  # Blackwell (B200)

# macOS with Accelerate
make apple

# Portable (no BLAS, slower)
make noblas

The CUDA build uploads all model weights to GPU VRAM and runs the full 26-layer LLM forward pass on GPU using cuBLAS for GEMM and custom CUDA kernels for RMS norm, RoPE, attention, and SwiGLU activation.

Download Model

# Requires hf CLI (pip install huggingface_hub[cli]) or wget
./download_model.sh voxtral-tts-model

This downloads the model weights (~8GB), tokenizer, and voice embeddings from HuggingFace.

Run

./voxtral_tts -d voxtral-tts-model -v neutral_female -o output.wav "Hello world"

Options

Usage: ./voxtral_tts [options] "text to speak"

  -d <dir>        Model directory (required)
  -v <voice>      Voice name (default: neutral_female)
  -o <file>       Output WAV file (default: output.wav)
  -s <seed>       Random seed for reproducibility
  --verbose       Enable verbose output
  --inspect       Print model tensor info and exit

Available Voices

Language	Voices
English	casual_female, casual_male, cheerful_female, neutral_female, neutral_male
French	fr_female, fr_male
German	de_female, de_male
Spanish	es_female, es_male
Italian	it_female, it_male
Portuguese	pt_female, pt_male
Dutch	nl_female, nl_male
Arabic	ar_male
Hindi	hi_female, hi_male

Benchmarks

CUDA (NVIDIA GB10 — DGX Spark)

DGX Spark (NVIDIA GB10 Blackwell, 128GB unified memory, ARM Grace CPU). LLM decode on GPU via cuBLAS + custom CUDA kernels, prefill and codec on CPU.

Input	Tokens	Frames	Audio	Wall time	RTF
"Hello world" (2 words)	2	21	1.68s	48s	28x
"The quick brown fox..." (9 words)	9	47	3.76s	59s	16x
Two sentences (17 words)	21	97	7.76s	78s	10x
Paragraph (40 words)	33	212	16.96s	124s	7.3x

~0.4s per audio frame for decode (12x faster than CPU)
RTF ~7-10x for longer texts (fixed ~40s overhead for model load + prefill)
Further speedups possible: GPU prefill, GPU codec, CUDA graphs

CPU-only (AMD Ryzen 9 9950X3D)

AMD Ryzen 9 9950X3D (16-core), 84GB RAM, OpenBLAS 0.3.26. Pure CPU inference.

Input	Tokens	Frames	Audio	Wall time	RTF
"Hello world" (2 words)	2	21	1.68s	121s	72x
"The quick brown fox..." (9 words)	9	47	3.76s	215s	57x
Two sentences (17 words)	21	97	7.76s	447s	58x
Paragraph (40 words)	33	215	17.20s	1023s	59x

~4.8s per audio frame (each frame = 80ms of audio at 12.5 Hz)
RTF ~58x for typical inputs

Notes

Peak RSS: ~7.8 GB (8GB model weights mmap'd)
Binary size: 86 KB
Each audio frame requires a full 26-layer LLM forward pass (3.4B parameters) plus 14 acoustic transformer forward passes (7 Euler steps x 2 for classifier-free guidance)

Run ./bench.sh to reproduce these benchmarks on your machine.

Project Structure

voxtral_tts.h                 Main header (constants, structs, API)
voxtral_tts.c                 Model loading and inference orchestrator
voxtral_tts_llm.c             26-layer Mistral decoder with KV cache
voxtral_tts_acoustic.c        Flow-matching acoustic transformer
voxtral_tts_codec.c           Audio codec decoder (ALiBi + weight_norm)
voxtral_tts_kernels.{c,h}     Math kernels (matmul, attention, conv, RoPE, ...)
voxtral_tts_tokenizer.{c,h}   Tekken BPE tokenizer (encode + decode)
voxtral_tts_voice.c           Voice embedding loader (.pt) + audio codebook embeddings
voxtral_tts_wav.c             WAV file writer
voxtral_tts_safetensors.{c,h} Safetensors mmap reader
main.c                        CLI entry point

Utilities

inspect_weights -- dump tensor names/shapes from safetensors (make inspect)
convert_voice.py -- convert .pt voice embeddings to raw binary
download_model.sh -- download model from HuggingFace

How It Works

The prompt format follows mistral_common's encode_speech_request:

[BOS] [BEGIN_AUDIO] [voice_embedding x N] [/INST] text_tokens [INST] [BEGIN_AUDIO]

Voice embeddings are pre-computed BF16 tensors of shape [N, 3072] that replace audio token placeholder positions. After prefill, the model enters an autoregressive loop:

LLM produces a hidden state
Acoustic transformer predicts a semantic code (greedy argmax) and 36 acoustic codes (flow matching with 8 Euler ODE steps and CFG alpha=1.2)
The 37 codes are embedded back into LLM input space via multi-vocabulary embeddings (sum across codebooks)
Repeat until [END_AUDIO] is generated
All collected codes are decoded by the audio codec into a 24kHz waveform

Requirements

C11 compiler (gcc, clang)
~10GB RAM (8GB mmap'd weights + working memory)
Optional: OpenBLAS or Apple Accelerate for faster matrix operations

License

MIT License. See LICENSE.

Note: The Voxtral-4B-TTS model weights are released by Mistral AI under CC BY-NC 4.0. This inference engine is MIT-licensed but the model weights have their own license terms.

Author

Ettore Di Giacinto (@mudler)

Acknowledgements

This project builds on the work of several open-source projects:

voxtral.c by Salvatore Sanfilippo (antirez) -- Pure C inference engine for Voxtral Realtime (ASR). The safetensors reader, math kernels, Mistral decoder implementation, and overall architecture of this project are directly adapted from voxtral.c. The project demonstrated that a full transformer inference engine can be written in clean, dependency-free C.
vLLM and vLLM-Omni -- The reference Python implementation for Voxtral TTS inference. The flow-matching acoustic transformer, audio codec decoder, and the overall TTS pipeline were implemented based on the vLLM-Omni model code. The prompt format, voice embedding handling, and audio code generation logic follow vLLM-Omni's implementation.
Mistral AI -- For developing and open-sourcing the Voxtral TTS model and the mistral_common tokenizer library.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
bench.sh		bench.sh
convert_voice.py		convert_voice.py
download_model.sh		download_model.sh
hello_world.wav		hello_world.wav
inspect_weights.c		inspect_weights.c
main.c		main.c
test_kernels.c		test_kernels.c
voxtral_tts.c		voxtral_tts.c
voxtral_tts.h		voxtral_tts.h
voxtral_tts_acoustic.c		voxtral_tts_acoustic.c
voxtral_tts_codec.c		voxtral_tts_codec.c
voxtral_tts_cuda.cu		voxtral_tts_cuda.cu
voxtral_tts_cuda.h		voxtral_tts_cuda.h
voxtral_tts_kernels.c		voxtral_tts_kernels.c
voxtral_tts_kernels.h		voxtral_tts_kernels.h
voxtral_tts_llm.c		voxtral_tts_llm.c
voxtral_tts_safetensors.c		voxtral_tts_safetensors.c
voxtral_tts_safetensors.h		voxtral_tts_safetensors.h
voxtral_tts_tokenizer.c		voxtral_tts_tokenizer.c
voxtral_tts_tokenizer.h		voxtral_tts_tokenizer.h
voxtral_tts_voice.c		voxtral_tts_voice.c
voxtral_tts_wav.c		voxtral_tts_wav.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

voxtral-tts.c

Features

Architecture

Quick Start

Build

Download Model

Run

Options

Available Voices

Benchmarks

CUDA (NVIDIA GB10 — DGX Spark)

CPU-only (AMD Ryzen 9 9950X3D)

Notes

Project Structure

Utilities

How It Works

Requirements

License

Author

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

voxtral-tts.c

Features

Architecture

Quick Start

Build

Download Model

Run

Options

Available Voices

Benchmarks

CUDA (NVIDIA GB10 — DGX Spark)

CPU-only (AMD Ryzen 9 9950X3D)

Notes

Project Structure

Utilities

How It Works

Requirements

License

Author

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages