Supertonic is a lightning-fast, on-device text-to-speech system designed for extreme performance with minimal computational overhead. Powered by ONNX Runtime, it runs entirely on your device—no cloud, no API calls, no privacy concerns.
- 2025.12.10 - Added
supertonicPyPI package! Install viapip install supertonic. For details, visit supertonic-py documentation - 2025.12.10 - Added 6 new voice styles (M3, M4, M5, F3, F4, F5). See Voices for details
- 2025.12.08 - Optimized ONNX models via OnnxSlim now available on Hugging Face Models
- 2025.11.24 - Added Flutter SDK support with macOS compatibility
- Demo
- Why Supertonic?
- Language Support
- Getting Started
- Performance
- Built with Supertonic
- Citation
- License
Watch Supertonic running on a Raspberry Pi, demonstrating on-device, real-time text-to-speech synthesis:
supertonic_raspberry-pi_480.mov
Experience Supertonic on an Onyx Boox Go 6 e-reader in airplane mode, achieving an average RTF of 0.3× with zero network dependency:
supertonic_ebook.mp4
🎧 Try it now: Experience Supertonic in your browser with our Interactive Demo, or get started with pre-trained models from Hugging Face Hub
- ⚡ Blazingly Fast: Generates speech up to 167× faster than real-time on consumer hardware (M4 Pro)—unmatched by any other TTS system
- 🪶 Ultra Lightweight: Only 66M parameters, optimized for efficient on-device performance with minimal footprint
- 📱 On-Device Capable: Complete privacy and zero latency—all processing happens locally on your device
- 🎨 Natural Text Handling: Seamlessly processes numbers, dates, currency, abbreviations, and complex expressions without pre-processing
- ⚙️ Highly Configurable: Adjust inference steps, batch processing, and other parameters to match your specific needs
- 🧩 Flexible Deployment: Deploy seamlessly across servers, browsers, and edge devices with multiple runtime backends.
We provide ready-to-use TTS inference examples across multiple ecosystems:
| Language/Platform | Path | Description |
|---|---|---|
| Python | py/ |
ONNX Runtime inference |
| Node.js | nodejs/ |
Server-side JavaScript |
| Browser | web/ |
WebGPU/WASM inference |
| Java | java/ |
Cross-platform JVM |
| C++ | cpp/ |
High-performance C++ |
| C# | csharp/ |
.NET ecosystem |
| Go | go/ |
Go implementation |
| Swift | swift/ |
macOS applications |
| iOS | ios/ |
Native iOS apps |
| Rust | rust/ |
Memory-safe systems |
| Flutter | flutter/ |
Cross-platform apps |
For detailed usage instructions, please refer to the README.md in each language directory.
First, clone the repository:
git clone https://github.com/supertone-inc/supertonic.git
cd supertonicBefore running the examples, download the ONNX models and preset voices, and place them in the assets directory:
Note: The Hugging Face repository uses Git LFS. Please ensure Git LFS is installed and initialized before cloning or pulling large model files.
- macOS:
brew install git-lfs && git lfs install- Generic: see
https://git-lfs.comfor installers
git clone https://huggingface.co/Supertone/supertonic assetsPython Example (Details)
cd py
uv sync
uv run example_onnx.pyNode.js Example (Details)
cd nodejs
npm install
npm startBrowser Example (Details)
cd web
npm install
npm run devJava Example (Details)
cd java
mvn clean install
mvn exec:javaC++ Example (Details)
cd cpp
mkdir build && cd build
cmake .. && cmake --build . --config Release
./example_onnxC# Example (Details)
cd csharp
dotnet restore
dotnet runGo Example (Details)
cd go
go mod download
go run example_onnx.go helper.goSwift Example (Details)
cd swift
swift build -c release
.build/release/example_onnxRust Example (Details)
cd rust
cargo build --release
./target/release/example_onnxiOS Example (Details)
cd ios/ExampleiOSApp
xcodegen generate
open ExampleiOSApp.xcodeproj- In Xcode: Targets → ExampleiOSApp → Signing: select your Team
- Choose your iPhone as run destination → Build & Run
- Runtime: ONNX Runtime for cross-platform inference (CPU-optimized; GPU mode is not tested)
- Browser Support: onnxruntime-web for client-side inference
- Batch Processing: Supports batch inference for improved throughput
- Audio Output: Outputs 16-bit WAV files
We evaluated Supertonic's performance (with 2 inference steps) using two key metrics across input texts of varying lengths: Short (59 chars), Mid (152 chars), and Long (266 chars).
Metrics:
- Characters per Second: Measures throughput by dividing the number of input characters by the time required to generate audio. Higher is better.
- Real-time Factor (RTF): Measures the time taken to synthesize audio relative to its duration. Lower is better (e.g., RTF of 0.1 means it takes 0.1 seconds to generate one second of audio).
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|---|---|---|---|
| Supertonic (M4 pro - CPU) | 912 | 1048 | 1263 |
| Supertonic (M4 pro - WebGPU) | 996 | 1801 | 2509 |
| Supertonic (RTX4090) | 2615 | 6548 | 12164 |
API ElevenLabs Flash v2.5 |
144 | 209 | 287 |
API OpenAI TTS-1 |
37 | 55 | 82 |
API Gemini 2.5 Flash TTS |
12 | 18 | 24 |
API Supertone Sona speech 1 |
38 | 64 | 92 |
Open Kokoro |
104 | 107 | 117 |
Open NeuTTS Air |
37 | 42 | 47 |
Notes:
API= Cloud-based API services (measured from Seoul)
Open= Open-source models
Supertonic (M4 pro - CPU) and (M4 pro - WebGPU): Tested with ONNX
Supertonic (RTX4090): Tested with PyTorch model
Kokoro: Tested on M4 Pro CPU with ONNX
NeuTTS Air: Tested on M4 Pro CPU with Q8-GGUF
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|---|---|---|---|
| Supertonic (M4 pro - CPU) | 0.015 | 0.013 | 0.012 |
| Supertonic (M4 pro - WebGPU) | 0.014 | 0.007 | 0.006 |
| Supertonic (RTX4090) | 0.005 | 0.002 | 0.001 |
API ElevenLabs Flash v2.5 |
0.133 | 0.077 | 0.057 |
API OpenAI TTS-1 |
0.471 | 0.302 | 0.201 |
API Gemini 2.5 Flash TTS |
1.060 | 0.673 | 0.541 |
API Supertone Sona speech 1 |
0.372 | 0.206 | 0.163 |
Open Kokoro |
0.144 | 0.124 | 0.126 |
Open NeuTTS Air |
0.390 | 0.338 | 0.343 |
Additional Performance Data (5-step inference)
Characters per Second (5-step)
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|---|---|---|---|
| Supertonic (M4 pro - CPU) | 596 | 691 | 850 |
| Supertonic (M4 pro - WebGPU) | 570 | 1118 | 1546 |
| Supertonic (RTX4090) | 1286 | 3757 | 6242 |
Real-time Factor (5-step)
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|---|---|---|---|
| Supertonic (M4 pro - CPU) | 0.023 | 0.019 | 0.018 |
| Supertonic (M4 pro - WebGPU) | 0.024 | 0.012 | 0.010 |
| Supertonic (RTX4090) | 0.011 | 0.004 | 0.002 |
Supertonic is designed to handle complex, real-world text inputs that contain numbers, currency symbols, abbreviations, dates, and proper nouns.
🎧 View audio samples more easily: Check out our Interactive Demo for a better viewing experience of all audio examples
Overview of Test Cases:
| Category | Key Challenges | Supertonic | ElevenLabs | OpenAI | Gemini | Microsoft |
|---|---|---|---|---|---|---|
| Financial Expression | Decimal currency, abbreviated magnitudes (M, K), currency symbols, currency codes | ✅ | ❌ | ❌ | ❌ | ❌ |
| Time and Date | Time notation, abbreviated weekdays/months, date formats | ✅ | ❌ | ❌ | ❌ | ❌ |
| Phone Number | Area codes, hyphens, extensions (ext.) | ✅ | ❌ | ❌ | ❌ | ❌ |
| Technical Unit | Decimal numbers with units, abbreviated technical notations | ✅ | ❌ | ❌ | ❌ | ❌ |
Example 1: Financial Expression
Text:
"The startup secured $5.2M in venture capital, a huge leap from their initial $450K seed round."
Challenges:
- Decimal point in currency ($5.2M should be read as "five point two million")
- Abbreviated magnitude units (M for million, K for thousand)
- Currency symbol ($) that needs to be properly pronounced as "dollars"
Audio Samples:
| System | Result | Audio Sample |
|---|---|---|
| Supertonic | ✅ | 🎧 Play Audio |
| ElevenLabs Flash v2.5 | ❌ | 🎧 Play Audio |
| OpenAI TTS-1 | ❌ | 🎧 Play Audio |
| Gemini 2.5 Flash TTS | ❌ | 🎧 Play Audio |
| VibeVoice Realtime 0.5B | ❌ | 🎧 Play Audio |
Example 2: Time and Date
Text:
"The train delay was announced at 4:45 PM on Wed, Apr 3, 2024 due to track maintenance."
Challenges:
- Time expression with PM notation (4:45 PM)
- Abbreviated weekday (Wed)
- Abbreviated month (Apr)
- Full date format (Apr 3, 2024)
Audio Samples:
| System | Result | Audio Sample |
|---|---|---|
| Supertonic | ✅ | 🎧 Play Audio |
| ElevenLabs Flash v2.5 | ❌ | 🎧 Play Audio |
| OpenAI TTS-1 | ❌ | 🎧 Play Audio |
| Gemini 2.5 Flash TTS | ❌ | 🎧 Play Audio |
| VibeVoice Realtime 0.5B | ❌ | 🎧 Play Audio |
Example 3: Phone Number
Text:
"You can reach the hotel front desk at (212) 555-0142 ext. 402 anytime."
Challenges:
- Area code in parentheses that should be read as separate digits
- Phone number with hyphen separator (555-0142)
- Abbreviated extension notation (ext.)
- Extension number (402)
Audio Samples:
| System | Result | Audio Sample |
|---|---|---|
| Supertonic | ✅ | 🎧 Play Audio |
| ElevenLabs Flash v2.5 | ❌ | 🎧 Play Audio |
| OpenAI TTS-1 | ❌ | 🎧 Play Audio |
| Gemini 2.5 Flash TTS | ❌ | 🎧 Play Audio |
| VibeVoice Realtime 0.5B | ❌ | 🎧 Play Audio |
Example 4: Technical Unit
Text:
"Our drone battery lasts 2.3h when flying at 30kph with full camera payload."
Challenges:
- Decimal time duration with abbreviation (2.3h = two point three hours)
- Speed unit with abbreviation (30kph = thirty kilometers per hour)
- Technical abbreviations (h for hours, kph for kilometers per hour)
- Technical/engineering context requiring proper pronunciation
Audio Samples:
| System | Result | Audio Sample |
|---|---|---|
| Supertonic | ✅ | 🎧 Play Audio |
| ElevenLabs Flash v2.5 | ❌ | 🎧 Play Audio |
| OpenAI TTS-1 | ❌ | 🎧 Play Audio |
| Gemini 2.5 Flash TTS | ❌ | 🎧 Play Audio |
| VibeVoice Realtime 0.5B | ❌ | 🎧 Play Audio |
Note: These samples demonstrate how each system handles text normalization and pronunciation of complex expressions without requiring pre-processing or phonetic annotations.
| Project | Description | Links |
|---|---|---|
| Read Aloud | Open-source TTS browser extension | Chrome · Edge · GitHub |
| PageEcho | E-Book reader app for iOS | App Store |
| VoiceChat | On-device voice-to-voice LLM chatbot in the browser | Demo · GitHub |
| OmniAvatar | Talking avatar video generator from photo + speech | Demo |
| CopiloTTS | Kotlin Multiplatform TTS SDK via ONNX Runtime | GitHub |
| Voice Mixer | PyQt5 tool for mixing and modifying voice styles | GitHub |
| Supertonic MNN | Lightweight library based on MNN (fp32/fp16/int8) | GitHub · PyPI |
| Transformers.js | Hugging Face's JS library with Supertonic support | GitHub PR · Demo |
| Pinokio | 1-click localhost cloud for Mac, Windows, and Linux | Pinokio · GitHub |
The following papers describe the core technologies used in Supertonic. If you use this system in your research or find these techniques useful, please consider citing the relevant papers:
This paper introduces the overall architecture of SupertonicTTS, including the speech autoencoder, flow-matching based text-to-latent module, and efficient design choices.
@article{kim2025supertonic,
title={SupertonicTTS: Towards Highly Efficient and Streamlined Text-to-Speech System},
author={Kim, Hyeongju and Yang, Jinhyeok and Yu, Yechan and Ji, Seunghun and Morton, Jacob and Bous, Frederik and Byun, Joon and Lee, Juheon},
journal={arXiv preprint arXiv:2503.23108},
year={2025},
url={https://arxiv.org/abs/2503.23108}
}This paper presents Length-Aware Rotary Position Embedding (LARoPE), which improves text-speech alignment in cross-attention mechanisms.
@article{kim2025larope,
title={Length-Aware Rotary Position Embedding for Text-Speech Alignment},
author={Kim, Hyeongju and Lee, Juheon and Yang, Jinhyeok and Morton, Jacob},
journal={arXiv preprint arXiv:2509.11084},
year={2025},
url={https://arxiv.org/abs/2509.11084}
}This paper describes the self-purification technique for training flow matching models robustly with noisy or unreliable labels.
@article{kim2025spfm,
title={Training Flow Matching Models with Reliable Labels via Self-Purification},
author={Kim, Hyeongju and Yu, Yechan and Yi, June Young and Lee, Juheon},
journal={arXiv preprint arXiv:2509.19091},
year={2025},
url={https://arxiv.org/abs/2509.19091}
}This project's sample code is released under the MIT License. - see the LICENSE for details.
The accompanying model is released under the OpenRAIL-M License. - see the LICENSE file for details.
This model was trained using PyTorch, which is licensed under the BSD 3-Clause License but is not redistributed with this project. - see the LICENSE for details.
Copyright (c) 2025 Supertone Inc.