Pandrator: a multilingual GUI audiobook, subtitle and dubbing generator with voice cloning and translation
Tip
TL;DR:
- Pandrator is not an AI model itself, but a GUI framework for Text-to-Speech, subtitle and translation projects. It can generate audiobooks and subtitles/dubbing by leveraging several AI tools, custom workflows and algorithms. It has an installer and works on Windows out of the box. It is not necessary to set up WSL or Docker containers.
- It supports a wide range of TTS models: Kokoro, Fish S2 Pro, Chatterbox, VoxCPM2, Voxtral, XTTSv2, Silero, OpenAI and Gemini, as well as custom TTS API servers.
- When installing: if you don't have a GPU, choose Kokoro or Silero. If you do have one with at least 8GB of VRAM, and it supports your language, use Voxtral. For voice cloning and a wide range of languages, use XTTS v2 (works even with 4GB GPUs and on CPU).
- The easiest way to use it is to download one of the precompiled archives - simply unpack them and use the included launcher. See this table for their contents and sizes.
- You can talk to me or share tips/workflows/ideas on the Discord server.
This video shows the process of launching Pandrator, selecting a source file, starting generation, stopping it and previewing the saved file. It has not been sped up as it's intended to illustrate the real performance (you may skip the first 35s when the XTTS server is launching, and please remember to turn on the sound).
pandrator_showcase.mp4
And here you can see the dubbing workflow - from a YT video, through transcription, translation, speech generation to synchronisation.
pandrator_dubbing_demonstration.mp4
Pandrator aspires to be easy to use and install - it has a one-click installer and a graphical user interface. It is a tool designed to perform two tasks:
- transform text, PDF (including see-through cropping), EPUB and SRT files into spoken audio in multiple languages based chiefly on open source software run locally, including preprocessing to make the generated speech sound as natural as possible by, among other things, splitting the text into paragraphs, sentences and smaller logical text blocks (clauses), which the TTS models can process with minimal artifacts. Each sentence can be regenerated if the first attempt is not satisfactory, including marking for regeneration using mouse or keyboard actions when listening back to the generation. Voice cloning is possible for models that support it, and text can be additionally preprocessed using LLMs (to remove OCR artifacts or spell out things that the TTS models struggle with, like Roman numerals and abbreviations, for example),
- generate dubbing either directly from a video file, including transcription (using WhisperX), or from an .srt file. It includes a complete workflow from a video file to a dubbed video file with subtitles - including translation using a variety of APIs and techniques to improve the quality of translation. Subdub, a companion app developed for this purpose, can also be used on its own. You can also correct or translate subtitles without generating audio.
At the moment, Pandrator supports multiple TTS backends: Kokoro via Kokoro-FastAPI, Fish Audio S2 Pro GGUF via fishs2-cpp-fastapi, Chatterbox via chatterbox-fastapi, VoxCPM2 via voxcpm_fastapi, Voxtral via voxtral-fastapi, XTTS v2 via the OpenAI-compatible XTTS2 API server, and Silero via silero-api-server. It also supports commercial speech APIs and custom TTS endpoints, including OpenAI-compatible and common JSON APIs, plus optional RVC Python (JarodMica fork) post-processing. For local LLM text preprocessing, Pandrator works well with OpenAI-compatible local servers such as LM Studio and Ollama-compatible endpoints.
-
Kokoro supports English (en), British English (en-gb), German (de), Spanish (es), French (fr), Hindi (hi), Italian (it), Japanese (ja), Portuguese (pt), and Chinese Simplified (zh-cn).
-
FishS2 uses multilingual Fish S2 GGUF models and OpenAI-compatible voice upload endpoints via
fishs2-cpp-fastapi. Supports a wide range of languages. -
Chatterbox supports English (en) via
chatterbox-en/chatterbox-turbo, and a range of additional languages via thechatterbox-multilingualmodel. -
VoxCPM2 is a multilingual model supporting a broad range of languages via the
voxcpm_fastapiserver. -
Voxtral supports Arabic (ar), English (en), German (de), Spanish (es), French (fr), Hindi (hi), Italian (it), Dutch (nl), and Portuguese (pt) via preset voices exposed by
voxtral-fastapi. -
XTTSv2 supports English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Japanese (ja), Hungarian (hu), Korean (ko), and Hindi (hi).
-
Silero supports English, German, Russian, Spanish, French, Hindi, Tatar, Ukrainian, Uzbek, and Kalmyk.
| TTS Model | CPU Requirements | GPU Requirements |
|---|---|---|
| Kokoro | Works well on modern CPUs; install includes direct eSpeak setup on Windows | Optional (CPU path is supported) |
| FishS2 | CPU mode exists but is generally too slow for practical long-form usage | NVIDIA GPU strongly recommended (8GB+ VRAM practical target) |
| Chatterbox | Supported via CPU mode, but notably slower than GPU | NVIDIA GPU recommended (4GB+ VRAM); GPU-only for the multilingual model |
| VoxCPM2 | N/A (GPU-only in current wrapper) | NVIDIA GPU required (8GB+ VRAM recommended) |
| Voxtral | N/A (GPU-only backend in current wrapper) | NVIDIA GPU required (4GB+ VRAM practical minimum) |
| XTTSv2 | A reasonably modern CPU with 4+ cores (for CPU-only generation) | NVIDIA GPU with 4GB+ of VRAM for good performance |
| Silero | Performs well on most CPUs regardless of core count | N/A |
This project relies on several APIs and services (running locally) and libraries, notably:
- One or more local/remote TTS endpoints:
- Kokoro-FastAPI (OpenAI-compatible Kokoro server)
- fishs2-cpp-fastapi (OpenAI-compatible Fish S2 server)
- chatterbox-fastapi (OpenAI-compatible Chatterbox server)
- voxcpm_fastapi (OpenAI-compatible VoxCPM2 server)
- voxtral-fastapi (OpenAI-compatible Voxtral server)
- XTTS2 API (OpenAI-compatible XTTS v2 server)
- silero-api-server (Silero backend)
- Commercial speech APIs and custom TTS endpoints
- FFmpeg for audio encoding.
- Sentence Splitter by mediacloud, PyQt6, num2words by savoirfairelinux, and others listed in
requirements.txt.
For local OpenAI-compatible TTS wrappers used by Pandrator, the preferred ecosystem schema is:
POST /v1/audio/speechGET /v1/modelsGET /v1/audio/voices(preferred voice catalog) with legacyGET /v1/voicessupport during migrationPOST /v1/audio/voicesfor cloning-capable backends (XTTS, FishS2), with legacy/v1/filesfallback
- Subdub, a command line app that transcribes video files, translates subtitles and synchronises the generated speech with the video, made specially for Pandrator.
- WhisperX by m-bain, an enhanced implementation of OpenAI's Whisper model with improved alignment, used for dubbing and XTTS training.
- Easy XTTS Trainer, a command line app that enables XTTS fine-tuning using one or more audio files, made specially for Pandrator.
- RVC Python (JarodMica fork) for enhancing voice quality and cloning results with Retrieval Based Voice Conversion.
- A local OpenAI-compatible LLM endpoint (for example LM Studio, Ollama-compatible endpoints, or other compatible providers) for LLM-based text pre-processing.
I've prepared packages (archives) that you can simply unpack - everything is preconfigured locally so you can launch quickly. You can download them from here.
You can use the launcher to start Pandrator, update it and install new features.
| Package | Contents | Unpacked Size |
|---|---|---|
| 1 | Pandrator + Kokoro | Varies |
| 2 | Pandrator + XTTS + WhisperX + XTTS fine-tuning + RVC | Varies |
| 3 | Pandrator + Voxtral | Varies |
| 4 | Pandrator + Voxtral + XTTS + WhisperX + XTTS fine-tuning + RVC | Varies |
scripts/build_release_packages.py automates archive generation and keeps a reusable local block cache so you do not need to re-download/re-bootstrap every stack for each zip.
By default it creates/uses package_release/ and runs all cache/staging/output work from that directory.
The script now supports two workflows.
- Fully automated source preparation (recommended):
python scripts/build_release_packages.py --prepare-sources --sources-root "D:/pandrator-builds/sources" --installer-exe "dist/PandratorInstaller.exe"Kokoro-only build (prepare + package):
python scripts/build_release_packages.py --prepare-sources --only kokoro --installer-exe "dist/PandratorInstaller.exe"This runs pandrator_installer_launcher.py in headless mode to prepare/reuse 4 source installs under --sources-root:
core(base runtime),stack(XTTS + WhisperX + XTTS fine-tuning + RVC),kokoro,voxtral.
- Manual source paths (if you already manage source installs yourself):
python scripts/build_release_packages.py --core-source "D:/pandrator-builds/core/Pandrator" --stack-source "D:/pandrator-builds/xtts-rvc/Pandrator" --kokoro-source "D:/pandrator-builds/kokoro/Pandrator" --voxtral-source "D:/pandrator-builds/voxtral/Pandrator" --installer-exe "dist/PandratorInstaller.exe"What it does:
- reuses cached blocks in
.release_blocks/and only refreshes changed inputs, - assembles each package in
.release_staging/, - writes final archives to
release_packages/, - includes both
PandratorInstaller.exe(or the path passed with--installer-exe) and thePandrator/folder in every zip.
Those paths are inside package_release/ unless --release-root is changed.
Useful flags:
--force-refreshto rebuild all cached blocks,--release-rootto change the working root directory,--output-dir(or-o) to choose where zip archives are written,--onlyto build only selected packages (for example--only kokoro),--skip-voxtral-with-restto skip the combined Voxtral + XTTS/WhisperX/RVC package,--no-hardlinksto force plain copies,--prepare-forceto reinstall auto-prepared source installs,--installer-scriptand--python-exeto control how headless source preparation is executed.
Run pandrator_installer_launcher.exe from Releases. The executable is built from pandrator_installer_launcher.py.
For automation, the launcher also supports headless installation:
python pandrator_installer_launcher.py --headless-install --workspace "D:/pandrator-builds/core" --components "kokoro"
# or CPU-only Kokoro:
python pandrator_installer_launcher.py --headless-install --workspace "D:/pandrator-builds/core" --components "kokoro_cpu"Note
Some antivirus tools may flag standalone executables. If needed, add an exception or run from source.
You can install components incrementally (during first setup or later):
- Pandrator core app
- XTTS2 API (
XTTSGPU orXTTS CPU only) - FishS2 API (
FishS2) - Chatterbox API (
ChatterboxGPU orChatterbox CPU only) - VoxCPM2 API (
VoxCPM) - Voxtral API (
Voxtral, GPU only) - Kokoro API (
KokoroGPU orKokoro CPU only) - Silero API
- Optional tools: RVC Python, WhisperX, Easy XTTS Trainer
Current installer flow:
- Creates
Pandrator/in the selected location. - Installs/checks Calibre.
- Downloads shared Pixi runtime to
Pandrator/bin/pixi.exe. - Clones required repositories (
Pandrator,Subdub) and selected server repos (xtts2_api,fishs2-cpp-fastapi,chatterbox-fastapi,voxcpm_fastapi,voxtral-fastapi,Kokoro-FastAPI). - Sets up Pandrator dependencies and selected optional environments/tools.
- Bootstraps XTTS2, FishS2, Chatterbox, VoxCPM2, Voxtral, and Kokoro via their own launcher scripts.
Launch tab options:
PandratorXTTS(+Use CPU,DeepSpeed)FishS2Chatterbox(+Use CPU)VoxCPMVoxtralKokoro(+Use CPUwhen GPU support is installed)Silero
If a local TTS server is launched from the launcher, Pandrator is auto-started with the matching connect flag (-connect -xtts, -connect -fishs2, -connect -chatterbox, -connect -voxcpm, -connect -voxtral, -connect -kokoro, -connect -silero).
To re-run setup from scratch, remove the generated Pandrator/ folder and start again.
For additional functionality not yet included in the installer:
- Configure a local OpenAI-compatible LLM endpoint (for example LM Studio or an Ollama-compatible endpoint) if you want LLM text preprocessing and local translation.
Please refer to the repositories linked under Dependencies for detailed API-server options. The selected API server must be running for local TTS generation.
- Git
- Python 3.11+
- Calibre
- FFmpeg on PATH (recommended)
-
Install Calibre:
-
Clone the repositories:
mkdir Pandrator cd Pandrator git clone https://github.com/lukaszliniewicz/Pandrator.git git clone https://github.com/lukaszliniewicz/Subdub.git -
Install Pandrator dependencies:
cd Pandrator python -m pip install -r requirements.txt cd .. -
Install Subdub dependencies:
cd Subdub python -m pip install -e . cd .. -
(Optional) Install XTTS2 API:
git clone https://github.com/lukaszliniewicz/xtts2_api.git cd xtts2_api run.bat --cpu # or run.bat --backend cuda # Linux/macOS: # bash run.sh --cpu # bash run.sh --backend cuda cd .. -
(Optional) Install FishS2 API:
git clone https://github.com/lukaszliniewicz/fishs2-cpp-fastapi.git cd fishs2-cpp-fastapi run.bat # Linux/macOS: # bash run.sh cd .. -
(Optional) Install Voxtral API:
git clone https://github.com/lukaszliniewicz/voxtral-fastapi.git cd voxtral-fastapi run.bat # Linux: # bash run.sh cd .. -
(Optional) Install Kokoro API:
git clone https://github.com/remsky/Kokoro-FastAPI.git cd Kokoro-FastAPI python -m pip install -e .[cpu] # or for NVIDIA GPU support, use the upstream GPU extra and CUDA wheel index: # python -m pip install -e .[gpu] --extra-index-url https://download.pytorch.org/whl/cu126 python docker/scripts/download_model.py --output api/src/models/v1_0 cd .. -
(Optional) Install Silero API:
python -m pip install silero-api-server -
(Optional) Install Easy XTTS Trainer:
git clone https://github.com/lukaszliniewicz/easy_xtts_trainer.git
cd easy_xtts_trainer
pip install -r requirements.txt
cd ..
-
Run Pandrator:
cd Pandrator python main.py -
Run Pandrator with auto-connect to a local TTS backend:
cd Pandrator python main.py -connect -xtts # or python main.py -connect -fishs2 # or python main.py -connect -voxtral # or python main.py -connect -kokoro # or python main.py -connect -silero -
Run XTTS2 API (if installed):
cd xtts2_api run.bat --cpu # or run.bat --backend cuda -
Run FishS2 API (if installed):
cd fishs2-cpp-fastapi run.bat -
Run Voxtral API (if installed):
cd voxtral-fastapi run.bat -
Run Kokoro API (if installed):
cd Kokoro-FastAPI set USE_GPU=false # or set USE_GPU=true if installed with GPU support python -m uvicorn api.src.main:app --host 127.0.0.1 --port 8880
You can play back the generated sentences, also as a playlist, edit them (the text that will be used for regeneration), regenerate or remove individual ones. You can also mark them for regeneration. This is useful when you don't want to stop listening but work on all problematic sentences later. You can use the "m" key to mark the sentence that is currently playing or the right mouse button to mark both the current and the previous sentence (this can be useful if you're listening to the output and not looking at the screen). "Save Output" concatenates the sentences generated so far and encodes them as one file.
Pandrator offers a comprehensive workflow for generating dubbed videos from video files or existing subtitles. This includes transcription, translation, speech generation, and synchronization:
- Select a Video or SRT File:
- Video File: Choose a video file. The audio will be extracted automatically, and transcription will be performed using WhisperX.
- SRT File: Select an existing SRT subtitle file. In this case, you also need to specify the corresponding video file (unless you only want to translate the subtitles).
- Transcription (if using a video file):
- Language: Select the language spoken in the original video.
- Model: Choose a WhisperX model for transcription. Smaller models are faster, while larger ones provide higher accuracy. The
large-v3model provides the best results. - Pandrator will automatically run WhisperX to generate an SRT file containing the transcription.
- Translation (optional):
- Enable Translation: Toggle this option to translate the subtitles.
- Original and Target Languages: Select the original language of the subtitles and the language you want to translate them into.
- Translation Provider: Choose an LLM provider from your configured Providers catalog, or choose
DeepL. - Translation Model: Choose a model from that provider's catalog (or type one manually if needed).
- Manage provider API base URLs, keys and model catalogs in the Providers tab.
- Chain-of-thought (optional): Enables additional reasoning effort for LLM-based translation/correction (not used with DeepL).
- In order to generate speech, click on Generate Dubbing Audio. You will be able to edit/regenerate the sentences as in the Audiobook workflow. You can also choose to only transcribe the chosen video file or only translate a subtitle file.
- Synchronization: When you're happy with the generated audio, click on Add Dubbing to Video. The dubbing will be synchronised with the video, producing a dubbed video file with embedded subtitles.
- OpenAI and Google Gemini are first-class TTS services, alongside local integrations such as Kokoro, Voxtral, and Magpie.
Customis reserved for user-created endpoints. Add and manage those endpoints in Providers > TTS.- The Wrapper Profile selector contains curated recipes for popular third-party servers. Applying a profile fills its suggested local URL, route, request mapping, models, voices, and known defaults; all values remain editable before saving.
- For a new custom endpoint, enter its base URL and click Auto-configure. Pandrator safely inspects OpenAPI metadata and likely routes without generating audio, then presents the detected request mapping and confidence evidence for review before saving.
- Auto-configure supports OpenAI-compatible speech APIs and common JSON speech routes such as
POST /generatewith a text field. Models and voices are populated when the server documents or exposes catalogs. - Multipart/form-data, Gradio, gRPC/WebSocket, and query-only wrappers are not offered as one-click profiles yet because they require additional request transports.
- First-class service base URLs are editable in Providers > TTS, including local service ports. These settings are stored in the app settings database, with the JSON settings file retained as a compatibility backup.
- You can change the length of silence appended to the end of sentences and paragraphs.
- You can enable a fade-in and -out effect and set the duration.
- You can enable RVC. For this to work, you have to install RVC_Python. You can do this in the Installer/Launcher at any time. You need to select a model - an RVC model consists of two files. A
.pthand an.indexfile. They need to have the same name (e.g. voicex.pth and voicex.index). For best results, use the same voice for XTTS. You can also fine-tune the RVC options such as pitch.
- You can disable/enable splitting long sentences and set the max length a text fragment sent for TTS generation may have (enabled by default; it tries to split sentences whose length exceeds the max length value; it looks for punctuation marks (, ; : -) and chooses the one closest to the midpoint of the sentence; if there are no punctuation marks, it looks for conjunctions like "and"; it performs this operation twice as some sentence fragments may still be too long after just one split).
- You can disable/enable appending short sentences (to preceding or following sentences; disabled by default, which may improve flow because the length of text fragments sent to the model is more uniform).
- Remove diacritics (useful when generating text that contains many foreign words or transliterations from foreign alphabets, e.g. Japanese). Do not enable this if you generate in a language that needs diacritics, like German or Polish. The pronunciation will be wrong then.
- Remove quotation marks (useful for models that sometimes read quotation marks aloud).
- Enable LLM processing to use language models for preprocessing text before sending it to the TTS API. For example, you may ask the LLM to remove OCR artifacts, spell out abbreviations, and correct punctuation.
- You can define up to three prompts for text optimization. Each prompt is sent to the LLM API separately, and the output of the last prompt is used for TTS generation.
- For each prompt, you can enable/disable it, set the prompt text, choose the LLM model to use, and enable/disable evaluation (if enabled, the LLM API will be called twice for each prompt, and then again for the model to choose the better result).
- Manage providers/models in the Providers tab, then refresh built-in catalogs from the Text Processing tab if needed.
- Enable RVC to enhance the generated audio quality and apply voice cloning.
- Select the RVC model file (.pth) and the corresponding index file using the "Select RVC Model" and "Select RVC Index" buttons in the Audio Processing tab.
- When RVC is enabled, the generated audio will be processed using the selected RVC model and index before being saved.
Contributions, suggestions for improvements, and bug reports are most welcome!
- You can find a collection of voice samples, for example here. They are intended for use with ElevenLabs, so you will need to pick an 8-12s fragment and save it as a 22050 Hz mono
.wavusing Audacity, for instance. - You can find a collection of RVC models, for example here.
- Add importing/exporting settings.
- Add support for proprietary APIs for text pre-processing and TTS generation.
- Add option to record a voice sample and use it for TTS to the GUI.
- Add support for chapter segmentation
- Add all API servers to the setup script.
- Add support for custom XTTS models
- Add workflow to create dubbing from
.srtsubtitle files. - Include support for PDF files.
- Integrate editing capabilities for processed sentences within the UI.
- Add support for a lower quality but faster local TTS model that can easily run on CPU, e.g. Silero or Piper.
- Add support for EPUB.