Skip to content

aangelakis/STOMA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

STOMA: A Multi-Speaker Greek Speech Corpus

STOMA is a new multi-speaker Greek speech corpus designed to advance research in text-to-speech (TTS) synthesis and related speech technologies for Greek, an under-resourced language. The corpus comprises approximately 23 hours of studio-recorded read speech from six native speakers (three male and three female), captured under controlled studio conditions using a dual-booth setup to ensure acoustic consistency and high signal quality. The spoken material was selected from the Greek Harvard Corpus and the Text Bank of the Center for the Greek Language, specifically from texts corresponding to the B2, C1, and C2 proficiency levels of the Certification of Attainment in Greek, ensuring linguistically rich and pedagogically well-balanced content. All recordings were standardized to 44.1 kHz, 16-bit mono PCM format and processed through a hybrid quality-control pipeline combining automated normalization and manual verification. To assess dataset quality, we trained state-of-the-art neural TTS systems based on the FastSpeech2 acoustic model and the HiFi-GAN vocoder, achieving natural and intelligible synthesized speech. The resulting corpus provides a publicly accessible, high-quality resource that supports both linguistic research and the development of modern speech synthesis systems in Greek.

Dataset

The dataset is available at the following location: STOMA

Table 1: Text statistics for the full text collection (Greek Harvard, B2, C1, and C2), recorded in its entirety by the two primary speakers.

Statistic Gr. Harvard B2 C1 C2 Total
Total Sentences 720 1,087 1,571 1,296 4,674
Total Words 5,539 15,102 25,263 22,722 68,626
Total Characters 32,301 95,119 164,700 152,904 445,024
Distinct Words 3,343 5,606 8,965 8,275 20,253
Mean Words per Sentence 7.69 13.89 16.08 17.53 14.68
Min Words per Sentence 5 1 1 1 1
Max Words per Sentence 9 32 41 39 41

Table 2: Text statistics for the reduced text subset selected from the full collection, used for recordings by the secondary speakers.

Statistic Gr. Harvard B2 C1 C2 Subset Total
Total Sentences 720 190 194 197 1,301
Total Words 5,539 2,913 3,057 3,265 14,774
Total Characters 32,301 18,286 20,190 22,110 92,887
Distinct Words 3,343 1,519 1,604 1,676 6,899
Mean Words per Sentence 7.69 15.33 15.76 16.57 11.36
Min Words per Sentence 5 4 3 4 3
Max Words per Sentence 9 28 28 30 30

Table 3: Speech statistics of the STOMA corpus per speaker and overall.

Speaker ID Clips Total Duration Mean Duration (s) Min–Max (s)
F 4,674 7:46:21 5.99 0.41–16.60
M 4,674 8:53:37 6.85 0.48–22.31
F1 1,301 1:46:58 4.93 1.65–15.08
F2 1,301 1:27:38 4.04 1.46–11.13
M1 1,301 1:47:06 4.94 1.64–14.99
M2 1,301 1:26:43 4.00 1.10–10.39
Total 14,552 23:08:26 5.72 0.41–22.31

Table 4: Speaker-specific acoustic characteristics of the STOMA corpus. SR: average speech rate (syllables/s, including pauses); AR: average articulation rate (syllables/s, excluding pauses); ASD: average syllable duration (s), computed across all utterances per speaker.

Speaker ID Mean F0 (Hz) Std F0 (Hz) SR AR ASD (s)
M 118.73 20.80 4.143 4.286 0.236
F 195.53 40.16 4.462 4.662 0.218
M1 128.30 23.34 4.165 4.403 0.230
F1 200.93 23.57 3.887 3.974 0.257
M2 109.33 15.49 4.407 4.527 0.225
F2 187.29 26.88 4.238 4.409 0.232

Table 5: Demographic information of the STOMA corpus speakers.

Speaker ID Age Sex Region Raised
M 29 Male Athens
F 37 Female Crete
M1 30 Male Alexandroupoli
F1 21 Female Larisa
M2 22 Male Larisa
F2 26 Female Athens

Table 6: Audio format specifications of the STOMA corpus recordings.

Property Value
File format WAV (RIFF)
Encoding Pulse-Code Modulation (PCM)
Compression None (uncompressed)
Channels 1 (mono)
Sampling rate 44.1 kHz
Bit depth 16-bit signed integer

Requirements

The training framework employs the ESPnet toolkit for acoustic modeling (FastSpeech2) and ParallelWaveGAN for the vocoder (HiFi-GAN).

To replicate the training pipeline, the following toolkits must be installed:

  • Acoustic Model: To train the acoustic model, ESPnet is required. Please refer to the official ESPnet installation guide.
  • Vocoder: To train the vocoder, ParallelWaveGAN is required. Installation instructions are available in the ParallelWaveGAN repository.

Training Procedure

Training Procedure

Acoustic Model (FastSpeech2)

The training process involves a two-stage pipeline: first training a Teacher model (Tacotron2) to extract durations, and then training the Student model (FastSpeech2).

Strategy: Training Strategy:

  • Main Speaker (Main_Speaker_M / Main_Speaker_F): FastSpeech2 models are trained from scratch to establish a robust baseline.
  • Secondary Speakers: Models are fine-tuned from the corresponding gender-specific main speaker checkpoints (e.g., male speakers are fine-tuned from Main_Speaker_M).

Step 1: Train Tacotron2 By default, the run script is configured to train the Tacotron2 model.

  1. Navigate to the FastSpeech2 directory located under Speakers.
  2. Crucial Step: Open the db.sh file and define the absolute path to your dataset.
  3. Execute the training script:
./run.sh

Step 2: Generate Alignments Once the Tacotron2 model is trained, you must generate the alignments (durations) and statistics required for FastSpeech2. Execute the following command to run teacher forcing inference:

./run.sh --stage 8 \
    --tts_exp exp/tts_train_raw_phn_tacotron_g2p_en_no_space \
    --inference_args "--use_teacher_forcing true" \
    --test_sets "tr_no_dev_phn dev_phn eval1_phn"

Step 3: Train FastSpeech2 After generating the alignments, proceed to train the FastSpeech2 model using Tacotron's dump directory:

./run.sh --stage 6 \
    --train_config conf/tuning/train_fastspeech2.yaml \
    --teacher_dumpdir exp/tts_train_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave \
    --tts_stats_dir exp/tts_train_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave/stats \
    --write_collected_feats true

Configuration Note: Prior to execution, you must define the absolute path to the dataset within the db.sh file.

Vocoder (HiFI-GAN)

Vocoder training requires the ParallelWaveGAN library. In this implementation, HiFi-GAN models are trained exclusively on the main speakers. These gender-specific vocoders are subsequently utilized for all remaining speakers during synthesis.

To initiate vocoder training, please refer to the standard ESPnet2 recipe instructions provided here: HiFi-GAN training instructions.

Permission Denied Errors: If you encounter a "Permission Denied" error, please grant execution rights to the relevant files using the following command:

chmod u+x <filename>

Inference Procedure

To perform speech synthesis, navigate to the Speakers/inference directory.

Configuration Prerequisites

Before executing the script, ensure the following:

  1. Model Paths: The paths to the model weights must be correctly defined.

  2. Config Updates: In the config.yaml for each model, ensure the stats_file paths under normalize_conf, pitch_normalize_conf, and energy_normalize_conf are valid and accessible.

    Locating the Statistics Files: For locally trained models, the required statistics files can be found in the following directory:

    {Speaker}/FastSpeech2/exp/tts_train_tacotron2_raw_phn_none/inference_use_teacher_forcingtrue_train.loss.ave/stats/train/
    

    You will need to map the configuration entries to these specific files:

    • normalize_conffeats_stats.npz
    • pitch_normalize_confpitch_stats.npz
    • energy_normalize_confenergy_stats

Execution

Run the inference script as follows:

python inference.py --text "Οι έρευνες δείχνουν πως ένας γαλήνιος και ήρεμος γονιός λειτουργεί
για το παιδί ως λιμάνι στις καταιγίδες" --speaker M --ckpt 500000

Pretrained model weights

If you would like access to our pretrained models, please contact us.

References

Citation

TBA

License

Only the speech dataset, is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

About

STOMA: A Multi-Speaker Greek Speech Corpus

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors