Skip to content

WhisperSpeech/WhisperSpeech

Repository files navigation

WhisperSpeech

Test it out yourself in Colab
Join us in the #audio-generation channel on the LAION Discord to chat, ask questions, or contribute!

WhisperSpeech is an open-source, text-to-speech (TTS) system created by “inverting” OpenAI Whisper.
Our goal is to be for speech what Stable Diffusion is for images—powerful, hackable, and commercially safe.

  • All code is Apache-2.0 / MIT.
  • Models are trained only on properly licensed data.
  • Current release: English (LibreLight). Multilingual release coming next.

Sample output →

whisperspeech-sample.mp4

🚀 Progress Updates

[2024-01-29] – Tiny S2A multilingual voice-cloning

We trained a tiny S2A model on an en + pl + fr dataset; it successfully clones French voices using semantic tokens frozen on English + Polish—evidence that one tokeniser could cover all languages.

https://github.com/collabora/WhisperSpeech/assets/107984/267f2602-7eec-4646-a43b-059ff91b574e
https://github.com/collabora/WhisperSpeech/assets/107984/fbf08e8e-0f9a-4b0d-ab5e-747ffba2ccb9

[2024-01-18] – 12× real-time on a 4090 + voice-cloning demo
  • Added torch.compile, KV-caching, and layer tweaks → 12Ă— faster-than-real-time on a consumer RTX 4090.
  • Seamlessly code-switch within one sentence:

To jest pierwszy test wielojęzycznego Whisper Speech modelu …

pl-en-mix.mp4
en-cloning.mp4

Test it on Colab (≤ 30 s install). Hugging Face Space coming soon.

[2024-01-10] – Faster SD S2A + first cloning example

A new SD‑size S2A model brings major speed‑ups without sacrificing quality; cloning example added.
Try it on Colab.

[2023-12-10] – Multilingual trio (EN/PL)

Archive of older updates

📊 Community Benchmarks

Unofficial speed & memory‑usage results from the community can be found here.

📦 Downloads

🗺️ Roadmap

⚙️ Architecture

WhisperSpeech follows the two‑stage, token‑based pipeline popularised by
AudioLM, Google’s SPEAR TTS, and Meta’s MusicGen:

Stage Model Purpose
Semantic Whisper Transcription âžś semantic tokens
Acoustic EnCodec Tokenise waveform (1.5 kbps)
Vocoder Vocos High‑fidelity audio
EnCodec architecture diagram

EnCodec block diagram

Conference talks (deep dives)


Tricks Learned from Scaling WhisperSpeech Models to 80k+ Hours of Speech – Jakub Cłapa, Collabora


Open‑Source TTS Projects: WhisperSpeech – In‑Depth Discussion

🙏 Appreciation

Collabora logo LAION logo

Made possible by:

Additional compute funded by the Gauss Centre for Supercomputing via the John von Neumann Institute for Computing (NIC).

Special thanks to individual contributors:

đź’Ľ Consulting

Need help with open‑source or proprietary AI projects?
Contact us via Collabora or DM on Discord:
 


📚 Citations

@article{SpearTTS,
  title       = {Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision},
  url         = {https://arxiv.org/abs/2302.03540},
  author      = {Kharitonov, Eugene and Vincent, Damien and Borsos, Zalán and Marinier, Raphaël and Girgin, Sertan and Pietquin, Olivier and Sharifi, Matt and Tagliasacchi, Marco and Zeghidour, Neil},
  publisher   = {arXiv},
  year        = {2023},
}
@article{MusicGen,
  title     = {Simple and Controllable Music Generation},
  url       = {https://arxiv.org/abs/2306.05284},
  author    = {Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez},
  publisher = {arXiv},
  year      = {2023},
}
@article{Whisper,
  title     = {Robust Speech Recognition via Large-Scale Weak Supervision},
  url       = {https://arxiv.org/abs/2212.04356},
  author    = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  publisher = {arXiv},
  year      = {2022},
}
@article{EnCodec,
  title     = {High Fidelity Neural Audio Compression},
  url       = {https://arxiv.org/abs/2210.13438},
  author    = {Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
  publisher = {arXiv},
  year      = {2022},
}
@article{Vocos,
  title     = {Vocos: Closing the Gap Between Time‑Domain and Fourier‑Based Neural Vocoders for High‑Quality Audio Synthesis},
  url       = {https://arxiv.org/abs/2306.00814},
  author    = {Hubert Siuzdak},
  publisher = {arXiv},
  year      = {2023},
}

About

An Open Source text-to-speech system built by inverting Whisper.

Topics

Resources

License

Stars

Watchers

Forks

Contributors 11