open-bible-models

TTS models trained on Open Bible datasets for low-resource languages.

Getting Started

This repo uses Git submodules (e.g. F5-TTS), which are not fetched by default when cloning. You need to initialize them explicitly

# Clone with submodules
git clone --recurse-submodules https://github.com/davidguzmanr/open-bible-models.git

# Or, if already cloned without submodules
git submodule update --init --recursive

# To pull the latest changes
git pull --recurse-submodules

F5-TTS

We use F5-TTS to train text-to-speech models from scratch on per-language Bible audio data.

Setup

cd F5-TTS
conda env create -f environment.yaml
conda activate F5-TTS
pip install -e .

Data Preparation

The prepare_data.py script automates the full pipeline: metadata creation, audio preprocessing, vocabulary generation, training parameter estimation, and config generation.

cd F5-TTS

# Single language (downloads from Hugging Face)
python prepare_data.py --languages Yoruba

# Multiple languages, each trained separately
python prepare_data.py --languages Yoruba Ewe Hausa

# Multilingual (combine into one dataset and one config)
python prepare_data.py --languages Yoruba Ewe Hausa --multilingual

# Filter to a specific testament for one or more languages
python prepare_data.py --languages Yoruba --filter-testament '{"Yoruba": "New Testament"}'
python prepare_data.py --languages Yoruba Ewe Hausa --multilingual --filter-testament '{"Yoruba": "New Testament"}'

By default, audio data is downloaded from davidguzmanr/open-bible-resources on Hugging Face.

For each language it:

Downloads wav files from Hugging Face.
Runs prepare_csv_wavs.py to produce raw.arrow, duration.json, and vocab.txt
Verifies the generated vocabulary
Estimates training epochs based on dataset size and target updates
Generates a Hydra config at src/f5_tts/configs/F5TTS_v1_Base_Open_Bible_{Lang}.yaml

Key options:

Flag	Default	Description
`--hf-repo`	`davidguzmanr/open-bible-resources`	Hugging Face dataset repo
`--max-duration`	15	Discard audio files longer than this many seconds; 0 to disable
`--target-updates`	500,000	Total training updates to target
`--batch-size-per-gpu`	28,000	Frame budget per GPU batch
`--max-samples`	32	Max sequences per batch
`--num-gpus`	1	Number of GPUs (affects epoch calculation)
`--workers`	4	Threads for audio preprocessing
`--skip-preprocess`	off	Skip `prepare_csv_wavs.py` if already done
`--multilingual`	off	Combine all languages into one dataset and config
`--filter-testament`	`null`	JSON dict mapping language names to testament strings (e.g. `'{"Yoruba": "New Testament"}'`); languages not listed use all data

Training

After data preparation, launch training with the generated config:

cd F5-TTS

# Single GPU
accelerate launch --mixed_precision bf16 \
    src/f5_tts/train/train.py \
    --config-name F5TTS_v1_Base_Open_Bible_Yoruba.yaml

# Multi-GPU (e.g. 2 GPUs)
accelerate launch --num_processes 2 --mixed_precision bf16 \
    src/f5_tts/train/train.py \
    --config-name F5TTS_v1_Base_Open_Bible_Yoruba.yaml

Training steps in single vs multi-GPU

The generated config (e.g. Yoruba) sets optim.epochs=575, which is calibrated so that a single-GPU run reaches 500k optimizer updates. The prepare_data.py script always computes epochs relative to 1-GPU update counts, so the same config works for any number of GPUs with equivalent total training data:

	1 GPU	2 GPUs	4 GPUs
Effective batch size	28,000 frames	56,000 frames	112,000 frames
Updates per epoch	~870	~435	~218
Total updates (575 epochs)	~500k	~250k	~125k
Total data processed	same	same	same

With DDP (Accelerate's default split_batches=False), each GPU processes batch_size_per_gpu frames independently, then gradients are averaged. More GPUs means a larger effective batch size per update, fewer updates per epoch, but the same number of passes over the dataset. Wall-clock time scales roughly linearly with the number of GPUs.

Finetuning from a Multilingual Checkpoint

To finetune an existing multilingual model, pass --finetune with the checkpoint and its vocabulary. The language argument accepts a testament suffix (e.g. Kikuyu:NT) to restrict data to one testament:

cd F5-TTS

python prepare_data.py --languages Maori:NT --finetune \
    --pretrain-ckpt ckpts/F5TTS_v1_Base_vocos_custom_open-bible-hiligaynon-maori-nt-vietnamese/model_last.pt \
    --pretrain-vocab data/open-bible-hiligaynon-maori-nt-vietnamese_custom/vocab.txt \
    --num-gpus 2 --target-updates 200000

Then launch training with the finetune command that prepare_data.py outputs

accelerate launch --num_processes 2 --mixed_precision bf16 \
    src/f5_tts/train/finetune_cli.py \
    --exp_name F5TTS_v1_Base \
    --dataset_name open-bible-maori-nt-finetune \
    --finetune \
    --pretrain ckpts/F5TTS_v1_Base_vocos_custom_open-bible-hiligaynon-maori-nt-vietnamese/model_last.pt \
    --tokenizer custom \
    --tokenizer_path data/open-bible-hiligaynon-maori-nt-vietnamese_custom/vocab.txt \
    --epochs 1243 \
    --learning_rate 1e-5 \
    --batch_size_per_gpu 28000 \
    --max_samples 32 \
    --num_warmup_updates 2000 \
    --save_per_updates 10000 \
    --logger tensorboard

Multilingual Training with Language-ID Tokens

By default, multilingual training combines all languages into a single dataset with no explicit signal about which language is being spoken. Adding the --prepend flag prefixes every text sample with a language-ID token such as <|Yoruba|>, giving the model a dedicated embedding per language.

Data preparation

cd F5-TTS

python prepare_data.py \
    --languages Yoruba Ewe Hausa \
    --multilingual \
    --prepend

Each text in the combined dataset is transformed:

# without --prepend
In the beginning God created the heavens and the earth.

# with --prepend
<|Yoruba|> In the beginning God created the heavens and the earth.

The token <|Yoruba|> is treated as a single indivisible entry in vocab.txt (not split into individual characters), so the model gets one dedicated embedding for each language.

Training

The generated config is identical to the standard multilingual case. Launch training the same way:

# Single GPU
accelerate launch --mixed_precision bf16 \
    src/f5_tts/train/train.py \
    --config-name F5TTS_v1_Base_Open_Bible_Ewe-Hausa-Yoruba.yaml

# Multi-GPU (e.g. 2 GPUs)
accelerate launch --num_processes 2 --mixed_precision bf16 \
    src/f5_tts/train/train.py \
    --config-name F5TTS_v1_Base_Open_Bible_Ewe-Hausa-Yoruba.yaml

Note: Tagged and untagged datasets are not compatible. A model trained with --prepend has language-ID tokens in its vocabulary and expects them at inference time; a model trained without --prepend does not. Do not mix them.

Inference

Use inference-test-split-tagged.py instead of inference-test-split.py. It automatically prepends the correct <|Language|> token to both the reference text and each generated text, and can process multiple languages in one run against the same multilingual checkpoint.

cd F5-TTS

# Single language
python inference-test-split-tagged.py \
    --languages Yoruba \
    --ckpt_path ckpts/F5TTS_v1_Base_vocos_custom_open-bible-ewe-hausa-yoruba/model_last.pt \
    --vocab_file data/open-bible-ewe-hausa-yoruba_custom/vocab.txt \
    --model_cfg src/f5_tts/configs/F5TTS_v1_Base_Open_Bible_Ewe-Hausa-Yoruba.yaml \
    --metadata_path data/open-bible-ewe-hausa-yoruba/metadata.csv

# All languages in one shot (model is loaded only once)
python inference-test-split-tagged.py \
    --languages Yoruba Ewe Hausa \
    --ckpt_path ckpts/F5TTS_v1_Base_vocos_custom_open-bible-ewe-hausa-yoruba/model_last.pt \
    --vocab_file data/open-bible-ewe-hausa-yoruba_custom/vocab.txt \
    --model_cfg src/f5_tts/configs/F5TTS_v1_Base_Open_Bible_Ewe-Hausa-Yoruba.yaml \
    --metadata_path data/open-bible-ewe-hausa-yoruba/metadata.csv

Outputs are written to synthesis_output/<checkpoint-dir>/<language>/generated/ with a matching ground-truth/ directory for evaluation.

EveryVoice (FastSpeech2)

We also train TTS models using EveryVoice, which uses a FastSpeech2-based feature prediction model followed by a HiFi-GAN vocoder.

Setup

Follow the EveryVoice installation guide. Once installed, activate the environment:

conda activate EveryVoice

Data Preparation

The data is the same as what F5-TTS/prepare_data.py produces. EveryVoice expects a filelist (.psv) and audio files; preprocessing is handled by everyvoice preprocess (see below).

Project Setup

Create a new project using the interactive wizard with default parameters:

everyvoice new-project

This generates the config files under config/ (everyvoice-shared-data.yaml, everyvoice-text-to-spec.yaml, everyvoice-spec-to-wav.yaml).

Training

Training involves three stages: preprocessing, feature prediction (text → mel spectrogram), and vocoder fine-tuning (mel → waveform). See jobs/EveryVoice-Open-Bible-Yoruba-NT.sh for the exact parameters used.

# 1. Preprocess
everyvoice preprocess config/everyvoice-text-to-spec.yaml \
    --config-args preprocessing.audio.max_audio_length=15 \
    --config-args preprocessing.train_split=1.0 \
    --overwrite

# 2. Train feature prediction model (2 GPUs, DDP)
everyvoice train text-to-spec config/everyvoice-text-to-spec.yaml \
    --devices 2 --strategy ddp \
    --config-args training.max_steps=250000 \
    --config-args training.batch_size=32

# 3. Fine-tune vocoder
everyvoice train spec-to-wav config/everyvoice-spec-to-wav.yaml \
    --devices 2 --strategy ddp \
    --config-args training.finetune=True \
    --config-args training.max_steps=50000

Evaluation

Once a model is trained, evaluate it on the test split by first synthesizing audio with the trained checkpoint and then running evaluate-tts.py.

python evaluate-tts.py \
    --ground_truth_dir /path/to/ground-truth \
    --synthesized_dir /path/to/generated \
    --metadata_csv /path/to/test.csv \
    --output_csv /path/to/results.csv \
    --metrics utmos wer \
    --asr-lang yor_Latn \
    --system-name everyvoice

evaluate-tts.py supports four metrics:

Metric	Type	Direction
`mcd`	Mel Cepstral Distortion	lower is better
`speechbertscore`	WavLM-based similarity	higher is better
`utmos`	Predicted MOS (UTMOSv2)	higher is better
`wer`	Word Error Rate via ASR	lower is better

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
EveryVoice-TTS		EveryVoice-TTS
F5-TTS		F5-TTS
VITS-TTS		VITS-TTS
jobs		jobs
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

open-bible-models

Getting Started

F5-TTS

Setup

Data Preparation

Training

Training steps in single vs multi-GPU

Finetuning from a Multilingual Checkpoint

Multilingual Training with Language-ID Tokens

Data preparation

Training

Inference

EveryVoice (FastSpeech2)

Setup

Data Preparation

Project Setup

Training

Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

open-bible-models

Getting Started

F5-TTS

Setup

Data Preparation

Training

Training steps in single vs multi-GPU

Finetuning from a Multilingual Checkpoint

Multilingual Training with Language-ID Tokens

Data preparation

Training

Inference

EveryVoice (FastSpeech2)

Setup

Data Preparation

Project Setup

Training

Evaluation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages