Experimental pure C inference engine for Mistral's Voxtral-4B-TTS text-to-speech model. Zero external dependencies beyond the C standard library and math. Reads weights directly from safetensors via memory-mapped I/O.
Note: This project is an experiment to create a pure C implementation of Voxtral TTS. It is not production-ready and no further optimization work is planned (even if I'm considering to base it on ggml). Contributions are welcome and encouraged!
Sample: Hello world
- Single-file model loading from
consolidated.safetensors - BF16 weights accessed directly from mmap (no full conversion needed)
- 20 preset voices across 9 languages
- WAV output at 24kHz
- Optional BLAS acceleration (OpenBLAS, Apple Accelerate)
- NEON-optimized BF16 matvec on ARM
Voxtral TTS is a three-stage pipeline:
Text --> LLM Backbone (3.4B, 26-layer Mistral)
--> Flow-Matching Acoustic Transformer (390M, 3 layers, 8 Euler steps)
--> Audio Codec Decoder (300M, 4-stage conv + ALiBi transformer)
--> 24kHz waveform
The LLM autoregressively generates hidden states conditioned on text and a voice prompt. The acoustic transformer converts each hidden state to 37 audio codes (1 semantic + 36 acoustic) via flow matching with classifier-free guidance. The codec decoder converts all collected codes into a raw waveform.
# Linux with OpenBLAS (recommended)
make blas
# CUDA + OpenBLAS (requires NVIDIA GPU + CUDA toolkit)
make cuda
# Specify GPU architecture (default: sm_80 for Ampere)
make cuda CUDA_ARCH=sm_89 # Ada Lovelace (RTX 4090)
make cuda CUDA_ARCH=sm_90 # Hopper (H100)
make cuda CUDA_ARCH=sm_100 # Blackwell (B200)
# macOS with Accelerate
make apple
# Portable (no BLAS, slower)
make noblasThe CUDA build uploads all model weights to GPU VRAM and runs the full 26-layer LLM forward pass on GPU using cuBLAS for GEMM and custom CUDA kernels for RMS norm, RoPE, attention, and SwiGLU activation.
# Requires hf CLI (pip install huggingface_hub[cli]) or wget
./download_model.sh voxtral-tts-modelThis downloads the model weights (~8GB), tokenizer, and voice embeddings from HuggingFace.
./voxtral_tts -d voxtral-tts-model -v neutral_female -o output.wav "Hello world"Usage: ./voxtral_tts [options] "text to speak"
-d <dir> Model directory (required)
-v <voice> Voice name (default: neutral_female)
-o <file> Output WAV file (default: output.wav)
-s <seed> Random seed for reproducibility
--verbose Enable verbose output
--inspect Print model tensor info and exit
| Language | Voices |
|---|---|
| English | casual_female, casual_male, cheerful_female, neutral_female, neutral_male |
| French | fr_female, fr_male |
| German | de_female, de_male |
| Spanish | es_female, es_male |
| Italian | it_female, it_male |
| Portuguese | pt_female, pt_male |
| Dutch | nl_female, nl_male |
| Arabic | ar_male |
| Hindi | hi_female, hi_male |
DGX Spark (NVIDIA GB10 Blackwell, 128GB unified memory, ARM Grace CPU). LLM decode on GPU via cuBLAS + custom CUDA kernels, prefill and codec on CPU.
| Input | Tokens | Frames | Audio | Wall time | RTF |
|---|---|---|---|---|---|
| "Hello world" (2 words) | 2 | 21 | 1.68s | 48s | 28x |
| "The quick brown fox..." (9 words) | 9 | 47 | 3.76s | 59s | 16x |
| Two sentences (17 words) | 21 | 97 | 7.76s | 78s | 10x |
| Paragraph (40 words) | 33 | 212 | 16.96s | 124s | 7.3x |
- ~0.4s per audio frame for decode (12x faster than CPU)
- RTF ~7-10x for longer texts (fixed ~40s overhead for model load + prefill)
- Further speedups possible: GPU prefill, GPU codec, CUDA graphs
AMD Ryzen 9 9950X3D (16-core), 84GB RAM, OpenBLAS 0.3.26. Pure CPU inference.
| Input | Tokens | Frames | Audio | Wall time | RTF |
|---|---|---|---|---|---|
| "Hello world" (2 words) | 2 | 21 | 1.68s | 121s | 72x |
| "The quick brown fox..." (9 words) | 9 | 47 | 3.76s | 215s | 57x |
| Two sentences (17 words) | 21 | 97 | 7.76s | 447s | 58x |
| Paragraph (40 words) | 33 | 215 | 17.20s | 1023s | 59x |
- ~4.8s per audio frame (each frame = 80ms of audio at 12.5 Hz)
- RTF ~58x for typical inputs
- Peak RSS: ~7.8 GB (8GB model weights mmap'd)
- Binary size: 86 KB
- Each audio frame requires a full 26-layer LLM forward pass (3.4B parameters) plus 14 acoustic transformer forward passes (7 Euler steps x 2 for classifier-free guidance)
Run ./bench.sh to reproduce these benchmarks on your machine.
voxtral_tts.h Main header (constants, structs, API)
voxtral_tts.c Model loading and inference orchestrator
voxtral_tts_llm.c 26-layer Mistral decoder with KV cache
voxtral_tts_acoustic.c Flow-matching acoustic transformer
voxtral_tts_codec.c Audio codec decoder (ALiBi + weight_norm)
voxtral_tts_kernels.{c,h} Math kernels (matmul, attention, conv, RoPE, ...)
voxtral_tts_tokenizer.{c,h} Tekken BPE tokenizer (encode + decode)
voxtral_tts_voice.c Voice embedding loader (.pt) + audio codebook embeddings
voxtral_tts_wav.c WAV file writer
voxtral_tts_safetensors.{c,h} Safetensors mmap reader
main.c CLI entry point
inspect_weights-- dump tensor names/shapes from safetensors (make inspect)convert_voice.py-- convert .pt voice embeddings to raw binarydownload_model.sh-- download model from HuggingFace
The prompt format follows mistral_common's encode_speech_request:
[BOS] [BEGIN_AUDIO] [voice_embedding x N] [/INST] text_tokens [INST] [BEGIN_AUDIO]
Voice embeddings are pre-computed BF16 tensors of shape [N, 3072] that replace audio token placeholder positions. After prefill, the model enters an autoregressive loop:
- LLM produces a hidden state
- Acoustic transformer predicts a semantic code (greedy argmax) and 36 acoustic codes (flow matching with 8 Euler ODE steps and CFG alpha=1.2)
- The 37 codes are embedded back into LLM input space via multi-vocabulary embeddings (sum across codebooks)
- Repeat until
[END_AUDIO]is generated - All collected codes are decoded by the audio codec into a 24kHz waveform
- C11 compiler (gcc, clang)
- ~10GB RAM (8GB mmap'd weights + working memory)
- Optional: OpenBLAS or Apple Accelerate for faster matrix operations
MIT License. See LICENSE.
Note: The Voxtral-4B-TTS model weights are released by Mistral AI under CC BY-NC 4.0. This inference engine is MIT-licensed but the model weights have their own license terms.
Ettore Di Giacinto (@mudler)
This project builds on the work of several open-source projects:
-
voxtral.c by Salvatore Sanfilippo (antirez) -- Pure C inference engine for Voxtral Realtime (ASR). The safetensors reader, math kernels, Mistral decoder implementation, and overall architecture of this project are directly adapted from voxtral.c. The project demonstrated that a full transformer inference engine can be written in clean, dependency-free C.
-
vLLM and vLLM-Omni -- The reference Python implementation for Voxtral TTS inference. The flow-matching acoustic transformer, audio codec decoder, and the overall TTS pipeline were implemented based on the vLLM-Omni model code. The prompt format, voice embedding handling, and audio code generation logic follow vLLM-Omni's implementation.
-
Mistral AI -- For developing and open-sourcing the Voxtral TTS model and the mistral_common tokenizer library.