Supertonic 3 · open weights · out now

Supertonic 3: Lightning-Fast, On-Device, Multilingual TTS.

A 99M-parameter open-weight text-to-speech model running locally on CPU via ONNX Runtime. No GPU. No cloud. No API.

31 Languages
99M Params
CPU Only
ONNX Runtime
OpenRAIL-M

Try it on Hugging Face

Open weights. Runs in your browser or fully offline on your machine.

Try voice cloning

Bring your own voice and hear it speak 31 languages.

Try preset voices

Hundreds of preset voices and emotion presets, right in your browser.

31 languages

Speaks 31 languages

One 99M-parameter model. No per-language fine-tuning. No GPU.

Highlighted languages have audio samples below.

Listening samples

Hear it next to the giants.

Same input text, same reference voice prompt, three systems. Supertonic 3 is ours — 99M params on CPU. OmniVoice and Chatterbox Multilingual are 5–8× larger and run on a GPU.

Supertonic 3 (ours, 99M · CPU) Chatterbox Multilingual (500M · GPU) OmniVoice (800M · GPU) Reference prompt voice

Supertonic 3 in production

Use Supertonic 3 — your way.

Pick a surface — bring your own voice, browse preset voices, or build with the API.

Voice Builder

Want to hear your own voice?

Zero-shot voice cloning — record or upload a short reference and synthesize across 31 languages.

Live Now running Supertonic 3 — with full 31-language support.

Open Voice Builder

Supertone Play

263 voices across scenes and emotions.

Pick a voice, pick an emotion, and try Supertonic 3 right in your browser — no install needed.

Pro Play Desktop subscription unlocks commercially-licensed voices, zero-shot cloning, and unlimited usage.

Open Supertone Play

Supertone API

Production voice AI, ready for your app.

Integrate character-driven, expressive voice generation across 31 languages with adjustable speech controls.

Docs Start with the Supertone API documentation for setup, authentication, and voice generation guides.

Open API Docs

Speed benchmark

GPU-class speed without a GPU.

RTF (real-time factor) measures how long synthesis takes per second of audio — lower is faster. ×RT is the inverse. Supertonic 3 reaches parity with an 800M-parameter GPU baseline while running on a 16-thread CPU.

N = 30 · same machine, same text, same reference voices

Model	Hardware	Params	N	Synth	Audio	RTF ↓	×RT ↑
Supertonic 3	CPU (16 threads)	99M	30	57.99 s	289.92 s	0.200	5.00×
OmniVoice	RTX 3090	800M	30	53.90 s	275.17 s	0.196	5.11×
Chatterbox Multilingual	RTX 3090	500M	30	199.70 s	252.68 s	0.790	1.27×

8× smaller than OmniVoice (99M vs 800M params)

5× smaller than Chatterbox Multilingual (99M vs 500M params)

RTF parity with the 800M GPU baseline — but on CPU

Synthesis throughput (×RT, higher is better)

Seconds of speech produced per second of wall-clock time, across the same 30 inputs.

Methodology

N = 30 samples (same set published in ./samples/ on this page).
Mean audio duration ≈ 9.66 s per sample.
Single machine. Identical text, identical reference voice prompts across all three systems.
Supertonic 3 timed on CPU with 16 threads via ONNX Runtime. Baselines timed on a single RTX 3090.
CPU model: (to be filled in).
RTF = synthesis time ÷ audio duration. ×RT = 1 ÷ RTF.

Install / Quickstart

Drop it into your stack.

Officially supported runtimes. Each tab links to working examples in the upstream repo.

# pip install supertonic
from supertonic import TTS

tts = TTS(auto_download=True)

# 1) Default: synthesize English with voice "M1"
style = tts.get_voice_style(voice_name="M1")
wav, duration = tts.synthesize(
    "A gentle breeze moved through the open window.",
    voice_style=style,
    lang="en",
)
tts.save_audio(wav, "output.wav")

# 2) Swap the voice → "M2"
style = tts.get_voice_style(voice_name="M2")

# 3) Swap the language → Japanese
wav, _ = tts.synthesize("こんにちは、世界。", voice_style=style, lang="ja")

Full reference and example scripts: supertonic-py docs.

// npm install @supertone/supertonic
import { TTS } from "@supertone/supertonic";

const tts = await TTS.load({ autoDownload: true });
const style = await tts.getVoiceStyle("M1");
const { wav } = await tts.synthesize("Hello from Node.", { style, lang: "en" });

See the node/ folder in the upstream repo.

// runs in browsers via onnxruntime-web
import { TTS } from "@supertone/supertonic-web";

const tts = await TTS.load();
const { wav } = await tts.synthesize("Hello from the browser.", { lang: "en" });

See the web/ folder in the upstream repo.

// Swift Package Manager: github.com/supertone-inc/supertonic-swift
import Supertonic

let tts = try Supertonic.TTS(autoDownload: true)
let wav = try tts.synthesize("Hello from iOS.", lang: "en")

See the ios/ folder in the upstream repo.

// Gradle: implementation("ai.supertone:supertonic-android:3.+")
val tts = Supertonic.TTS(context, autoDownload = true)
val wav = tts.synthesize("Hello from Android.", lang = "en")

See the android/ folder in the upstream repo.

// CMake: find_package(Supertonic CONFIG REQUIRED)
#include <supertonic/tts.hpp>

auto tts = supertonic::TTS::create({ .auto_download = true });
auto wav = tts->synthesize("Hello from C++.", { .lang = "en" });

See the cpp/ folder in the upstream repo.

License

Open weights. Permissive code. Read the fine print.

Model weights OpenRAIL-M

The trained Supertonic 3 model is released under the OpenRAIL-M license. Weights are open and usable commercially, with use-based restrictions (no harm, no impersonation without consent) and an attribution requirement.

Read the model card →

Note: OpenRAIL-M is not equivalent to MIT — it imposes downstream use restrictions. Read the full license text before deploying.

Sample code MIT

The Python package, runtime bindings, and example code in the upstream repo are MIT-licensed. Use, modify, and redistribute freely with attribution.

Read the LICENSE →

Standard MIT terms: no warranty, attribution required, no restrictions on commercial use.