Training and inference code for Irodori-TTS, a Flow Matching-based Text-to-Speech model. The architecture and training design largely follow Echo-TTS, using DACVAE continuous latents as the generation target.
For an OpenAI-compatible inference API server, see Irodori-TTS-Server.
Important
main tracks the v3 codebase and is intended for use with the Irodori-TTS-500M-v3 base model release.
It also supports the Irodori-TTS-600M-v3-VoiceDesign 3-branch VoiceDesign release.
The current code remains backward-compatible with Irodori-TTS-500M-v2 checkpoints, including Irodori-TTS-500M-v2-VoiceDesign.
If you need the previous v2 codebase state, use the v2 tag. If you need the previous v1 code, use the v1 tag.
v1 checkpoints / preprocessing are not compatible with v2/v3.
The previous public v1 model is available at Aratako/Irodori-TTS-500M.
For model weights and audio samples, please refer to the base model card and the VoiceDesign model card.
- Flow Matching TTS: Rectified Flow Diffusion Transformer (RF-DiT) over continuous DACVAE latents
- Voice Cloning: Zero-shot voice cloning from reference audio
- Multi-modal Voice Design: v3 VoiceDesign can combine text, reference speech, and caption text for voice identity plus style/emotion control
- Emoji-based Style Control: Emoji annotations in input text can influence delivery and non-verbal vocal expressions in supported checkpoints
- Automatic Duration Prediction: v3 base and v3 VoiceDesign checkpoints estimate output length without manual
--seconds - Automatic Watermarking: Generated audio is watermarked with SilentCipher when available
- Multi-GPU Training: Distributed training via
uv run --no-sync torchrunwith gradient accumulation, mixed precision (bf16), and W&B logging - PEFT LoRA Fine-Tuning: Parameter-efficient adaptation with PEFT/LoRA for released checkpoints
- Speaker Inversion: Learn reusable speaker embedding tokens for a target voice while freezing the base model
- Flexible Inference: CLI, Gradio Web UI, and HuggingFace Hub checkpoint support
The current codebase supports two closely related checkpoint families:
- Base model (
Aratako/Irodori-TTS-500M-v3): Text encoder + reference latent encoder + diffusion transformer + duration predictor. The reference latent encoder consumes patched DACVAE latents from reference audio for speaker/style conditioning. v2 base checkpoints remain supported for inference. - VoiceDesign model (
Aratako/Irodori-TTS-600M-v3-VoiceDesign): Text encoder + reference latent encoder + caption encoder + diffusion transformer + duration predictor. The v3 VoiceDesign path supports 3-branch conditioning from text, reference speech, and caption text. v2 VoiceDesign remains supported as a backward-compatible caption-only checkpoint family.
Shared building blocks:
- Text Encoder: Token embeddings initialized from a pretrained LLM, followed by self-attention + SwiGLU transformer layers with RoPE
- Reference Latent Encoder: Encodes patched reference audio latents for speaker identity conditioning
- Caption Encoder: Encodes style-control text for emotion, tone, speaking style, and acoustic context
- Diffusion Transformer: Joint-attention DiT blocks with Low-Rank AdaLN (timestep-conditioned adaptive layer normalization), half-RoPE, and SwiGLU MLPs
- Duration Predictor: v3 checkpoints include an integrated predictor for automatic output length estimation
Audio is represented as continuous latent sequences via the codec configured by the checkpoint. The released v2/v3 checkpoints use the 32-dim Semantic-DACVAE-Japanese-32dim codec for 48kHz waveform reconstruction.
git clone https://github.com/Aratako/Irodori-TTS.git
cd Irodori-TTS
uv sync --extra cu128 # NVIDIA CUDA 12.8 (Linux/Windows)If you want to explicitly select a PyTorch backend, use one of the backend extras below:
# NVIDIA CUDA 12.8 on Linux/Windows
uv sync --extra cu128
# AMD ROCm on Linux/WSL
uv sync --extra rocm
# Intel XPU on Linux/Windows
uv sync --extra xpu
# CPU-only, or macOS CPU/MPS via PyPI
uv sync --extra cpuThe PyTorch backend extras are mutually exclusive. The cu128 extra uses the
PyTorch CUDA 12.8 index, the rocm extra uses the PyTorch ROCm index on
Linux, and the xpu extra uses the PyTorch XPU index on Linux/Windows.
The cpu extra uses the CPU PyTorch index on Linux/Windows and falls
back to the standard PyPI PyTorch wheels on macOS.
After syncing with a backend extra, use uv run --no-sync ... for the commands
below to avoid re-syncing the environment without the selected PyTorch backend
extra.
The rocm extra includes pytorch-triton-rocm because triton-rocm alone does
not provide triton.language for the transformers to torch._dynamo import
path. This was validated with AMD GPU inference.
uv run --no-sync python infer.py \
--hf-checkpoint Aratako/Irodori-TTS-500M-v3 \
--text "こんにちは、私はAIです。これは音声合成のテストです。" \
--ref-wav path/to/reference.wav \
--output-wav outputs/sample.wavuv run --no-sync python infer.py \
--hf-checkpoint Aratako/Irodori-TTS-500M-v3 \
--text "こんにちは、私はAIです。これは音声合成のテストです。" \
--no-ref \
--output-wav outputs/sample.wavPure VoiceDesign from text + caption:
uv run --no-sync python infer.py \
--hf-checkpoint Aratako/Irodori-TTS-600M-v3-VoiceDesign \
--text "こんにちは、私はAIです。これは音声合成のテストです。" \
--caption "落ち着いた女性の声で、近い距離感でやわらかく自然に読み上げてください。" \
--no-ref \
--output-wav outputs/sample_voice_design.wavStyle-controlled voice cloning with text + reference speech + caption:
uv run --no-sync python infer.py \
--hf-checkpoint Aratako/Irodori-TTS-600M-v3-VoiceDesign \
--text "どうしてもっと早く教えてくれなかったの?私、ずっと待ってたのに。" \
--ref-wav path/to/reference.wav \
--caption "深く傷つき、今にも泣き出しそうな様子。声が震えており、悲痛なトーンで弱々しく話す。" \
--output-wav outputs/sample_voice_design_clone.wavUse a learned Speaker Inversion embedding instead of reference audio:
uv run --no-sync python infer.py \
--checkpoint path/to/Irodori-TTS-500M-v3.safetensors \
--ref-embed path/to/my.speaker.safetensors \
--text "こんにちは、私はAIです。これは音声合成のテストです。" \
--output-wav outputs/sample_speaker_inversion.wavuv run --no-sync python gradio_app.py --server-name 0.0.0.0 --server-port 7860Then access the UI at http://localhost:7860.
The hosted v3 demo is available at Aratako/Irodori-TTS-500M-v3-Demo.
The reference input area supports either reference audio/latent input or a Speaker Inversion embedding via tabs.
For VoiceDesign checkpoints, use the dedicated UI:
uv run --no-sync python gradio_app_voicedesign.py --server-name 0.0.0.0 --server-port 7861The hosted VoiceDesign demo is available at Aratako/Irodori-TTS-600M-v3-VoiceDesign-Demo.
gradio_app.py is for Aratako/Irodori-TTS-500M-v3. gradio_app_voicedesign.py is for Aratako/Irodori-TTS-600M-v3-VoiceDesign and remains compatible with v2 VoiceDesign checkpoints.
uv run --no-sync python infer.py \
--hf-checkpoint Aratako/Irodori-TTS-500M-v3 \
--text "こんにちは、私はAIです。これは音声合成のテストです。" \
--ref-wav path/to/reference.wav \
--output-wav outputs/sample.wavLocal checkpoints (.pt or .safetensors) are also supported:
uv run --no-sync python infer.py \
--checkpoint outputs/checkpoint_final.safetensors \
--text "こんにちは、私はAIです。これは音声合成のテストです。" \
--ref-wav path/to/reference.wav \
--output-wav outputs/sample.wavVoiceDesign checkpoints support caption conditioning. The v3 VoiceDesign model can run with
caption only by passing --no-ref, or with both reference speech and caption by passing
--ref-wav, --ref-latent, or --ref-embed.
uv run --no-sync python infer.py \
--hf-checkpoint Aratako/Irodori-TTS-600M-v3-VoiceDesign \
--text "こんにちは、私はAIです。これは音声合成のテストです。" \
--caption "落ち着いた、近い距離感の女性話者" \
--no-ref \
--output-wav outputs/sample_voice_design.wavuv run --no-sync python infer.py \
--hf-checkpoint Aratako/Irodori-TTS-600M-v3-VoiceDesign \
--text "あははっ🤭、それ本当に言ってるの?…😮💨まぁ、君らしいけどね。" \
--caption "余裕のある大人の男性。親しい相手に対して、くだけた雰囲気で呆れながらも楽しそうに話している。" \
--ref-wav path/to/reference.wav \
--output-wav outputs/sample_voice_design_ref_caption.wavThe older Aratako/Irodori-TTS-500M-v2-VoiceDesign checkpoint is still supported, but it is caption-only and intentionally ignores speaker/reference conditioning.
LoRA adapter directories can be loaded dynamically at inference time without exporting a merged checkpoint:
uv run --no-sync python infer.py \
--checkpoint path/to/base_model.safetensors \
--lora-adapter outputs/irodori_tts_lora/checkpoint_final \
--text "こんにちは、私はAIです。これはLoRA推論のテストです。" \
--ref-wav path/to/reference.wav \
--output-wav outputs/sample_lora.wavSpeaker Inversion embedding checkpoints can be used with the same base model that
was used for inversion training. Pass the embedding with --ref-embed;
it is mutually exclusive with --ref-wav, --ref-latent, and --no-ref.
uv run --no-sync python infer.py \
--checkpoint path/to/Irodori-TTS-500M-v3.safetensors \
--ref-embed outputs/speaker_inversion/name/checkpoint_final.speaker.safetensors \
--text "こんにちは、私はAIです。これはSpeaker Inversion推論のテストです。" \
--output-wav outputs/sample_speaker_inversion.wavThe v3 base and v3 VoiceDesign models integrate duration prediction into inference.
When --seconds is omitted, the runtime estimates the output length from the input
text and enabled conditions, then generates audio for that estimated duration. Use
--duration-scale to multiply the predicted length (>1 longer, <1 shorter). For
exact control, pass --seconds manually.
Older v2 checkpoints were trained with fixed-length 30-second targets. They remain
supported by the v3 codebase and still accept manual --seconds, but forcing a
non-default duration can reduce audio quality; prefer the v3 base model for automatic
or scaled duration control.
For faster experimental inference, Sway Sampling can be combined with fewer Euler steps:
uv run --no-sync python infer.py \
--hf-checkpoint Aratako/Irodori-TTS-500M-v3 \
--text "こんにちは、私はAIです。これは音声合成のテストです。" \
--ref-wav path/to/reference.wav \
--num-steps 6 \
--t-schedule-mode sway \
--sway-coeff -1.0 \
--output-wav outputs/sample_sway.wavFor tuning guidance and detailed explanations of inference options, see the Parameter Guide.
Generated audio is passed through SilentCipher watermarking automatically when the dependency and model files are available.
Encodes audio from a Hugging Face dataset into DACVAE latents and produces a JSONL manifest for training.
uv run --no-sync python prepare_manifest.py \
--dataset myorg/my_dataset \
--split train \
--audio-column audio \
--text-column text \
--output-manifest data/train_manifest.jsonl \
--latent-dir data/latents \
--device cudaTo include speaker_id in the manifest (for speaker-conditioned training):
uv run --no-sync python prepare_manifest.py \
--dataset myorg/my_dataset \
--split train \
--audio-column audio \
--text-column text \
--speaker-column speaker \
--output-manifest data/train_manifest.jsonl \
--latent-dir data/latents \
--device cudaTo include caption in the manifest (for caption-conditioned voice design training):
uv run --no-sync python prepare_manifest.py \
--dataset myorg/my_dataset \
--split train \
--audio-column audio \
--text-column text \
--caption-column caption \
--speaker-column speaker \
--output-manifest data/train_manifest.jsonl \
--latent-dir data/latents \
--device cudaSpeaker/reference labels depend on the training mode:
- For v2 VoiceDesign training,
speaker_idis optional because the model learns fromtext + caption. - For v3 VoiceDesign training, keep
speaker_idavailable so the model can learn fromtext + speaker/reference + caption. - For Speaker Inversion training,
speaker_idis not required because the run learns one shared speaker embedding from the target speaker samples.
The manifest caption value may also be a list of strings; training randomly selects one
non-empty caption each time that row is loaded.
This produces a JSONL manifest with entries like:
{"text": "こんにちは", "caption": "落ち着いた、近い距離感の女性話者", "latent_path": "data/latents/00001.pt", "speaker_id": "myorg/my_dataset:speaker_001", "num_frames": 750}Single-GPU training:
uv run --no-sync python train.py \
--config configs/train_500m_v3_phase1_body.yaml \
--manifest data/train_manifest.jsonl \
--output-dir outputs/irodori_ttsv3 release training uses two phases. After training the body, initialize the integrated duration predictor from the phase-1 checkpoint:
uv run --no-sync python train.py \
--config configs/train_500m_v3_phase2_duration.yaml \
--manifest data/train_manifest.jsonl \
--output-dir outputs/irodori_tts_duration \
--init-checkpoint outputs/irodori_tts/checkpoint_final.ptv2 VoiceDesign training uses a dedicated config:
uv run --no-sync python train.py \
--config configs/train_500m_v2_voice_design.yaml \
--manifest data/train_manifest.jsonl \
--output-dir outputs/irodori_tts_voice_designconfigs/train_500m_v2_voice_design.yaml sets use_caption_condition: true and disables the
speaker/reference branch. Caption-free configs continue to use speaker conditioning when
speaker_id / reference inputs are available.
v3 VoiceDesign training uses two phases. Phase 1 initializes the RF/DiT body from the v3 base checkpoint while adding the caption branch and skipping the base duration predictor:
uv run --no-sync python train.py \
--config configs/train_500m_v3_voice_design_phase1_body.yaml \
--manifest data/train_manifest.jsonl \
--output-dir outputs/irodori_tts_voice_design_phase1 \
--init-checkpoint path/to/Irodori-TTS-500M-v3.safetensorsPhase 2 adds and trains a newly initialized duration predictor with text + speaker + caption conditioning:
uv run --no-sync python train.py \
--config configs/train_500m_v3_voice_design_phase2_duration.yaml \
--manifest data/train_manifest.jsonl \
--output-dir outputs/irodori_tts_voice_design_phase2 \
--init-checkpoint outputs/irodori_tts_voice_design_phase1/checkpoint_final.ptThe VoiceDesign config also enables caption_warmup: true for optional caption-branch warmup.
warmup_steps controls the LR scheduler, while caption_warmup_steps controls how long
non-caption gradients are discarded before normal joint training resumes.
v3 training uses two phases: configs/train_500m_v3_phase1_body.yaml trains the
variable-length DiT body, then configs/train_500m_v3_phase2_duration.yaml freezes the
body and trains the duration predictor.
The duration predictor regresses log1p(num_frames) with Huber loss. The current v3 phase2
config uses the token-sum duration predictor selected from ablations; see the parameter
guide for the architecture details.
Multi-GPU DDP training:
uv run --no-sync torchrun --nproc_per_node 4 train.py \
--config configs/train_500m_v3_phase1_body.yaml \
--manifest data/train_manifest.jsonl \
--output-dir outputs/irodori_tts \
--device cudaTraining supports YAML config files with model and train sections. CLI arguments take precedence over YAML values. See uv run --no-sync python train.py --help for all available options.
For a more detailed explanation of model and training config fields, see Parameter Guide.
Start a new training run from released inference weights (.safetensors). This initializes only the model weights; optimizer / scheduler state starts fresh. For the v3 base release, the LoRA config keeps the duration predictor as part of the saved adapter by default.
uv run --no-sync python train.py \
--config configs/train_500m_v3_lora.yaml \
--manifest data/train_manifest.jsonl \
--output-dir outputs/irodori_tts_lora \
--init-checkpoint path/to/Irodori-TTS-500M-v3.safetensorsv3 VoiceDesign LoRA fine-tuning:
uv run --no-sync python train.py \
--config configs/train_500m_v3_voice_design_lora.yaml \
--manifest data/train_manifest.jsonl \
--output-dir outputs/irodori_tts_voice_design_lora \
--init-checkpoint path/to/Irodori-TTS-600M-v3-VoiceDesign.safetensorsFor the older v2 VoiceDesign checkpoint, use configs/train_500m_v2_voice_design_lora.yaml
and initialize from Irodori-TTS-500M-v2-VoiceDesign.safetensors.
LoRA target presets, adapter saving behavior, and resume details are covered in the Parameter Guide.
Speaker Inversion trains only a small set of speaker embedding tokens while keeping the base Irodori-TTS model frozen. It is useful when you want a reusable speaker identity checkpoint instead of providing reference audio at every inference call.
Prepare a manifest from the target speaker's audio, then initialize from the released v3 base checkpoint:
uv run --no-sync python train.py \
--config configs/train_500m_v3_speaker_inversion.yaml \
--manifest data/target_speaker_manifest.jsonl \
--init-checkpoint path/to/Irodori-TTS-500M-v3.safetensors \
--output-dir outputs/speaker_inversion/nameThe saved checkpoints are embedding-only .speaker.safetensors files, for example
outputs/speaker_inversion/name/checkpoint_final.speaker.safetensors. Use that file
with the base model during inference:
uv run --no-sync python infer.py \
--checkpoint path/to/Irodori-TTS-500M-v3.safetensors \
--ref-embed outputs/speaker_inversion/name/checkpoint_final.speaker.safetensors \
--text "こんにちは、これは学習した話者埋め込みを使った推論です。" \
--output-wav outputs/sample_speaker_inversion.wavTo continue from a saved embedding, set speaker_inversion_init_embedding in the
config or pass --speaker-inversion-init-embedding path/to/checkpoint.speaker.safetensors.
Full trainer --resume is intentionally not used for Speaker Inversion checkpoints.
Enable gradient_checkpointing: true or pass --gradient-checkpointing if GPU memory is tight.
Resume an existing training run from a training checkpoint. Full-model runs use .pt; LoRA runs use checkpoint directories. Both restore optimizer, scheduler, and step state.
uv run --no-sync python train.py \
--config configs/train_500m_v3_phase1_body.yaml \
--manifest data/train_manifest.jsonl \
--output-dir outputs/irodori_tts \
--resume outputs/irodori_tts/checkpoint_0010000.ptLoRA resume example:
uv run --no-sync python train.py \
--config configs/train_500m_v3_lora.yaml \
--manifest data/train_manifest.jsonl \
--output-dir outputs/irodori_tts_lora \
--resume outputs/irodori_tts_lora/checkpoint_0010000If you move a LoRA checkpoint to another environment and the original base-checkpoint path is no longer valid, pass --init-checkpoint path/to/base_model.safetensors together with --resume to override the saved base-model path.
Convert a training checkpoint to inference-only safetensors format:
uv run --no-sync python convert_checkpoint_to_safetensors.py outputs/checkpoint_final.ptLoRA adapter checkpoints can also be converted directly:
uv run --no-sync python convert_checkpoint_to_safetensors.py outputs/irodori_tts_lora/checkpoint_finalLoRA adapter checkpoints are merged into the base model automatically during conversion, so the exported .safetensors file is directly usable for inference. If you do not want to merge the adapter, pass the adapter directory directly to infer.py --lora-adapter or the matching Gradio field.
Irodori-TTS/
├── train.py # Training entry point (DDP support)
├── infer.py # CLI inference
├── gradio_app.py # Gradio web UI
├── gradio_app_voicedesign.py # Gradio web UI for VoiceDesign checkpoints
├── prepare_manifest.py # Dataset -> DACVAE latent preprocessing
├── convert_checkpoint_to_safetensors.py # Checkpoint converter
│
├── docs/
│ └── parameters.md # Detailed parameter guide
│
├── irodori_tts/ # Core library
│ ├── model.py # TextToLatentRFDiT architecture
│ ├── rf.py # Rectified Flow utilities & Euler CFG sampling
│ ├── codec.py # DACVAE codec wrapper
│ ├── dataset.py # Dataset and collator
│ ├── tokenizer.py # Pretrained LLM tokenizer wrapper
│ ├── config.py # Model / Train / Sampling config dataclasses
│ ├── inference_runtime.py # Cached, thread-safe inference runtime
│ ├── lora.py # PEFT LoRA integration helpers
│ ├── speaker_inversion.py # Speaker Inversion embedding save/load helpers
│ ├── text_normalization.py # Japanese text normalization
│ ├── optim.py # Muon + AdamW optimizer
│ └── progress.py # Training progress tracker
│
└── configs/
├── train_500m_v3_phase1_body.yaml # 500M v3 body training config
├── train_500m_v3_phase2_duration.yaml # 500M v3 duration-predictor training config
├── train_500m_v3_voice_design_phase1_body.yaml # 600M v3 VoiceDesign body config
├── train_500m_v3_voice_design_phase2_duration.yaml # 600M v3 VoiceDesign duration config
├── train_500m_v3_voice_design_lora.yaml # 600M v3 VoiceDesign RF+duration LoRA config
├── train_500m_v3_lora.yaml # 500M v3 LoRA fine-tuning config
├── train_500m_v3_speaker_inversion.yaml # 500M v3 Speaker Inversion config
├── train_500m_v2.yaml # 500M v2 backward-compatible model config
├── train_500m_v2_lora.yaml # 500M v2 LoRA fine-tuning config
├── train_500m_v2_voice_design.yaml # 500M v2 VoiceDesign full fine-tuning config
├── train_500m_v2_voice_design_lora.yaml # 500M v2 VoiceDesign LoRA fine-tuning config
├── train_500m.yaml # 500M v1 model config
└── train_2.5b.yaml # 2.5B parameter model config
- Code: MIT License
- Model Weights: Please refer to the base model card and the VoiceDesign model card for licensing details
This project builds upon the following works:
- Echo-TTS — Architecture and training design reference
- DACVAE — Audio VAE
- SilentCipher — Audio watermarking
@misc{irodori-tts,
author = {Chihiro Arata},
title = {Irodori-TTS: A Flow Matching-based Text-to-Speech Model with Emoji-driven Style Control},
year = {2026},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/Aratako/Irodori-TTS}}
}