Supertonic — Lightning Fast, On-Device TTS

Quick Start

pip install supertonic

CLI

# Note: First run will download the model (~260MB) from HuggingFace
supertonic tts 'Supertonic is a lightning fast, on-device TTS system.' -o output.wav

Python

from supertonic import TTS

# Note: First run downloads model automatically (~260MB)
tts = TTS(auto_download=True)

# Get a voice style
style = tts.get_voice_style(voice_name="M1")

# Generate speech
text = "The train delay was announced at 4:45 PM on Wed, Apr 3, 2024 due to track maintenance."
wav, duration = tts.synthesize(text, voice_style=style)

# Save to file
tts.save_audio(wav, "output.wav")

Requirements

Supertonic has minimal dependencies - just 4 core libraries:

onnxruntime - Fast ONNX model inference
numpy - Numerical operations
soundfile - Audio file I/O
huggingface-hub - Model downloads

Key Features

⚡ Blazingly Fast: Generates speech up to 167× faster than real-time on consumer hardware (M4 Pro)

🪶 Ultra Lightweight: Only 66M parameters, optimized for efficient on-device performance

📱 On-Device Capable: Complete privacy and zero latency

🎨 Natural Text Handling: Seamlessly processes complex expressions without G2P module

⚙️ Highly Configurable: Adjust inference steps, batch processing, and other parameters

🧩 Flexible Deployment: Deploy across servers, browsers, and edge devices

Performance

We evaluated Supertonic's performance (with 2 inference steps) using two key metrics across input texts of varying lengths: Short (59 chars), Mid (152 chars), and Long (266 chars).

Metrics:

Characters per Second: Measures throughput by dividing the number of input characters by the time required to generate audio. Higher is better.
Real-time Factor (RTF): Measures the time taken to synthesize audio relative to its duration. Lower is better (e.g., RTF of 0.1 means it takes 0.1 seconds to generate one second of audio).

Characters per Second

System	Short (59 chars)	Mid (152 chars)	Long (266 chars)
Supertonic (M4 pro - CPU)	912	1048	1263
Supertonic (M4 pro - WebGPU)	996	1801	2509
Supertonic (RTX4090)	2615	6548	12164
`API` ElevenLabs Flash v2.5	144	209	287
`API` OpenAI TTS-1	37	55	82
`API` Gemini 2.5 Flash TTS	12	18	24
`API` Supertone Sona speech 1	38	64	92
`Open` Kokoro	104	107	117
`Open` NeuTTS Air	37	42	47

Notes: API = Cloud-based API services (measured from Seoul) Open = Open-source models Supertonic (M4 pro - CPU) and (M4 pro - WebGPU): Tested with ONNX Supertonic (RTX4090): Tested with PyTorch model Kokoro: Tested on M4 Pro CPU with ONNX NeuTTS Air: Tested on M4 Pro CPU with Q8-GGUF

Real-time Factor

System	Short (59 chars)	Mid (152 chars)	Long (266 chars)
Supertonic (M4 pro - CPU)	0.015	0.013	0.012
Supertonic (M4 pro - WebGPU)	0.014	0.007	0.006
Supertonic (RTX4090)	0.005	0.002	0.001
`API` ElevenLabs Flash v2.5	0.133	0.077	0.057
`API` OpenAI TTS-1	0.471	0.302	0.201
`API` Gemini 2.5 Flash TTS	1.060	0.673	0.541
`API` Supertone Sona speech 1	0.372	0.206	0.163
`Open` Kokoro	0.144	0.124	0.126
`Open` NeuTTS Air	0.390	0.338	0.343

Additional Performance Data (5-step inference)

Characters per Second (5-step)

System	Short (59 chars)	Mid (152 chars)	Long (266 chars)
Supertonic (M4 pro - CPU)	596	691	850
Supertonic (M4 pro - WebGPU)	570	1118	1546
Supertonic (RTX4090)	1286	3757	6242

Real-time Factor (5-step)

System	Short (59 chars)	Mid (152 chars)	Long (266 chars)
Supertonic (M4 pro - CPU)	0.023	0.019	0.018
Supertonic (M4 pro - WebGPU)	0.024	0.012	0.010
Supertonic (RTX4090)	0.011	0.004	0.002

Natural Text Handling

Supertonic is designed to handle complex, real-world text inputs that contain numbers, currency symbols, abbreviations, dates, and proper nouns.

🎧 View audio samples more easily: Check out our Interactive Demo for a better viewing experience of all audio examples

Overview of Test Cases:

Category	Key Challenges	Supertonic	ElevenLabs	OpenAI	Gemini	Microsoft
Financial Expression	Decimal currency, abbreviated magnitudes (M, K), currency symbols, currency codes	✅	❌	❌	❌	❌
Time and Date	Time notation, abbreviated weekdays/months, date formats	✅	❌	❌	❌	❌
Phone Number	Area codes, hyphens, extensions (ext.)	✅	❌	❌	❌	❌
Technical Unit	Decimal numbers with units, abbreviated technical notations	✅	❌	❌	❌	❌

Example 1: Financial Expression

Text:

"The startup secured $5.2M in venture capital, a huge leap from their initial $450K seed round."

Challenges:

Decimal point in currency ($5.2M should be read as "five point two million")
Abbreviated magnitude units (M for million, K for thousand)
Currency symbol ($) that needs to be properly pronounced as "dollars"

Audio Samples:

System	Result	Audio Sample
Supertonic	✅	🎧 Play Audio
ElevenLabs Flash v2.5	❌	🎧 Play Audio
OpenAI TTS-1	❌	🎧 Play Audio
Gemini 2.5 Flash TTS	❌	🎧 Play Audio
VibeVoice Realtime 0.5B	❌	🎧 Play Audio

Example 2: Time and Date

Text:

"The train delay was announced at 4:45 PM on Wed, Apr 3, 2024 due to track maintenance."

Challenges:

Time expression with PM notation (4:45 PM)
Abbreviated weekday (Wed)
Abbreviated month (Apr)
Full date format (Apr 3, 2024)

Audio Samples:

System	Result	Audio Sample
Supertonic	✅	🎧 Play Audio
ElevenLabs Flash v2.5	❌	🎧 Play Audio
OpenAI TTS-1	❌	🎧 Play Audio
Gemini 2.5 Flash TTS	❌	🎧 Play Audio
VibeVoice Realtime 0.5B	❌	🎧 Play Audio

Example 3: Phone Number

Text:

"You can reach the hotel front desk at (212) 555-0142 ext. 402 anytime."

Challenges:

Area code in parentheses that should be read as separate digits
Phone number with hyphen separator (555-0142)
Abbreviated extension notation (ext.)
Extension number (402)

Audio Samples:

System	Result	Audio Sample
Supertonic	✅	🎧 Play Audio
ElevenLabs Flash v2.5	❌	🎧 Play Audio
OpenAI TTS-1	❌	🎧 Play Audio
Gemini 2.5 Flash TTS	❌	🎧 Play Audio
VibeVoice Realtime 0.5B	❌	🎧 Play Audio

Example 4: Technical Unit

Text:

"Our drone battery lasts 2.3h when flying at 30kph with full camera payload."

Challenges:

Decimal time duration with abbreviation (2.3h = two point three hours)
Speed unit with abbreviation (30kph = thirty kilometers per hour)
Technical abbreviations (h for hours, kph for kilometers per hour)
Technical/engineering context requiring proper pronunciation

Audio Samples:

System	Result	Audio Sample
Supertonic	✅	🎧 Play Audio
ElevenLabs Flash v2.5	❌	🎧 Play Audio
OpenAI TTS-1	❌	🎧 Play Audio
Gemini 2.5 Flash TTS	❌	🎧 Play Audio
VibeVoice Realtime 0.5B	❌	🎧 Play Audio

Note: These samples demonstrate how each system handles text normalization and pronunciation of complex expressions without requiring pre-processing or phonetic annotations.

Citation

The following papers describe the core technologies used in Supertonic. If you use this system in your research or find these techniques useful, please consider citing the relevant papers:

SupertonicTTS: Main Architecture

This paper introduces the overall architecture of SupertonicTTS, including the speech autoencoder, flow-matching based text-to-latent module, and efficient design choices.

@article{kim2025supertonic,
  title={SupertonicTTS: Towards Highly Efficient and Streamlined Text-to-Speech System},
  author={Kim, Hyeongju and Yang, Jinhyeok and Yu, Yechan and Ji, Seunghun and Morton, Jacob and Bous, Frederik and Byun, Joon and Lee, Juheon},
  journal={arXiv preprint arXiv:2503.23108},
  year={2025},
  url={https://arxiv.org/abs/2503.23108}
}

Length-Aware RoPE: Text-Speech Alignment

This paper presents Length-Aware Rotary Position Embedding (LARoPE), which improves text-speech alignment in cross-attention mechanisms.

@article{kim2025larope,
  title={Length-Aware Rotary Position Embedding for Text-Speech Alignment},
  author={Kim, Hyeongju and Lee, Juheon and Yang, Jinhyeok and Morton, Jacob},
  journal={arXiv preprint arXiv:2509.11084},
  year={2025},
  url={https://arxiv.org/abs/2509.11084}
}

Self-Purifying Flow Matching: Training with Noisy Labels

This paper describes the self-purification technique for training flow matching models robustly with noisy or unreliable labels.

@article{kim2025spfm,
  title={Training Flow Matching Models with Reliable Labels via Self-Purification},
  author={Kim, Hyeongju and Yu, Yechan and Yi, June Young and Lee, Juheon},
  journal={arXiv preprint arXiv:2509.19091},
  year={2025},
  url={https://arxiv.org/abs/2509.19091}
}

Related Projects

🏠 Main Repository: github.com/supertone-inc/supertonic

🎧 Try it live: Hugging Face Spaces

🤗 Model Repository: Hugging Face Models

License

Code: MIT License

Model: OpenRAIL-M License

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
assets/images		assets/images
docs		docs
examples		examples
notebook		notebook
supertonic		supertonic
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Supertonic — Lightning Fast, On-Device TTS

Quick Start

CLI

Python

Requirements

Key Features

Performance

Characters per Second

Real-time Factor

Natural Text Handling

Citation

SupertonicTTS: Main Architecture

Length-Aware RoPE: Text-Speech Alignment

Self-Purifying Flow Matching: Training with Noisy Labels

Related Projects

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

supertone-inc/supertonic-py

Folders and files

Latest commit

History

Repository files navigation

Supertonic — Lightning Fast, On-Device TTS

Quick Start

CLI

Python

Requirements

Key Features

Performance

Characters per Second

Real-time Factor

Natural Text Handling

Citation

SupertonicTTS: Main Architecture

Length-Aware RoPE: Text-Speech Alignment

Self-Purifying Flow Matching: Training with Noisy Labels

Related Projects

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages