STOMA is a new multi-speaker Greek speech corpus designed to advance research in text-to-speech (TTS) synthesis and related speech technologies for Greek, an under-resourced language. The corpus comprises approximately 23 hours of studio-recorded read speech from six native speakers (three male and three female), captured under controlled studio conditions using a dual-booth setup to ensure acoustic consistency and high signal quality. The spoken material was selected from the Greek Harvard Corpus and the Text Bank of the Center for the Greek Language, specifically from texts corresponding to the B2, C1, and C2 proficiency levels of the Certification of Attainment in Greek, ensuring linguistically rich and pedagogically well-balanced content. All recordings were standardized to 44.1 kHz, 16-bit mono PCM format and processed through a hybrid quality-control pipeline combining automated normalization and manual verification. To assess dataset quality, we trained state-of-the-art neural TTS systems based on the FastSpeech2 acoustic model and the HiFi-GAN vocoder, achieving natural and intelligible synthesized speech. The resulting corpus provides a publicly accessible, high-quality resource that supports both linguistic research and the development of modern speech synthesis systems in Greek.
The dataset is available at the following location: STOMA
Table 1: Text statistics for the full text collection (Greek Harvard, B2, C1, and C2), recorded in its entirety by the two primary speakers.
| Statistic | Gr. Harvard | B2 | C1 | C2 | Total |
|---|---|---|---|---|---|
| Total Sentences | 720 | 1,087 | 1,571 | 1,296 | 4,674 |
| Total Words | 5,539 | 15,102 | 25,263 | 22,722 | 68,626 |
| Total Characters | 32,301 | 95,119 | 164,700 | 152,904 | 445,024 |
| Distinct Words | 3,343 | 5,606 | 8,965 | 8,275 | 20,253 |
| Mean Words per Sentence | 7.69 | 13.89 | 16.08 | 17.53 | 14.68 |
| Min Words per Sentence | 5 | 1 | 1 | 1 | 1 |
| Max Words per Sentence | 9 | 32 | 41 | 39 | 41 |
Table 2: Text statistics for the reduced text subset selected from the full collection, used for recordings by the secondary speakers.
| Statistic | Gr. Harvard | B2 | C1 | C2 | Subset Total |
|---|---|---|---|---|---|
| Total Sentences | 720 | 190 | 194 | 197 | 1,301 |
| Total Words | 5,539 | 2,913 | 3,057 | 3,265 | 14,774 |
| Total Characters | 32,301 | 18,286 | 20,190 | 22,110 | 92,887 |
| Distinct Words | 3,343 | 1,519 | 1,604 | 1,676 | 6,899 |
| Mean Words per Sentence | 7.69 | 15.33 | 15.76 | 16.57 | 11.36 |
| Min Words per Sentence | 5 | 4 | 3 | 4 | 3 |
| Max Words per Sentence | 9 | 28 | 28 | 30 | 30 |
Table 3: Speech statistics of the STOMA corpus per speaker and overall.
| Speaker ID | Clips | Total Duration | Mean Duration (s) | Min–Max (s) |
|---|---|---|---|---|
| F | 4,674 | 7:46:21 | 5.99 | 0.41–16.60 |
| M | 4,674 | 8:53:37 | 6.85 | 0.48–22.31 |
| F1 | 1,301 | 1:46:58 | 4.93 | 1.65–15.08 |
| F2 | 1,301 | 1:27:38 | 4.04 | 1.46–11.13 |
| M1 | 1,301 | 1:47:06 | 4.94 | 1.64–14.99 |
| M2 | 1,301 | 1:26:43 | 4.00 | 1.10–10.39 |
| Total | 14,552 | 23:08:26 | 5.72 | 0.41–22.31 |
Table 4: Speaker-specific acoustic characteristics of the STOMA corpus. SR: average speech rate (syllables/s, including pauses); AR: average articulation rate (syllables/s, excluding pauses); ASD: average syllable duration (s), computed across all utterances per speaker.
| Speaker ID | Mean F0 (Hz) | Std F0 (Hz) | SR | AR | ASD (s) |
|---|---|---|---|---|---|
| M | 118.73 | 20.80 | 4.143 | 4.286 | 0.236 |
| F | 195.53 | 40.16 | 4.462 | 4.662 | 0.218 |
| M1 | 128.30 | 23.34 | 4.165 | 4.403 | 0.230 |
| F1 | 200.93 | 23.57 | 3.887 | 3.974 | 0.257 |
| M2 | 109.33 | 15.49 | 4.407 | 4.527 | 0.225 |
| F2 | 187.29 | 26.88 | 4.238 | 4.409 | 0.232 |
Table 5: Demographic information of the STOMA corpus speakers.
| Speaker ID | Age | Sex | Region Raised |
|---|---|---|---|
| M | 29 | Male | Athens |
| F | 37 | Female | Crete |
| M1 | 30 | Male | Alexandroupoli |
| F1 | 21 | Female | Larisa |
| M2 | 22 | Male | Larisa |
| F2 | 26 | Female | Athens |
Table 6: Audio format specifications of the STOMA corpus recordings.
| Property | Value |
|---|---|
| File format | WAV (RIFF) |
| Encoding | Pulse-Code Modulation (PCM) |
| Compression | None (uncompressed) |
| Channels | 1 (mono) |
| Sampling rate | 44.1 kHz |
| Bit depth | 16-bit signed integer |
The training framework employs the ESPnet toolkit for acoustic modeling (FastSpeech2) and ParallelWaveGAN for the vocoder (HiFi-GAN).
To replicate the training pipeline, the following toolkits must be installed:
- Acoustic Model: To train the acoustic model, ESPnet is required. Please refer to the official ESPnet installation guide.
- Vocoder: To train the vocoder, ParallelWaveGAN is required. Installation instructions are available in the ParallelWaveGAN repository.
The training process involves a two-stage pipeline: first training a Teacher model (Tacotron2) to extract durations, and then training the Student model (FastSpeech2).
Strategy: Training Strategy:
- Main Speaker (Main_Speaker_M / Main_Speaker_F): FastSpeech2 models are trained from scratch to establish a robust baseline.
- Secondary Speakers: Models are fine-tuned from the corresponding gender-specific main speaker checkpoints (e.g., male speakers are fine-tuned from Main_Speaker_M).
Step 1: Train Tacotron2 By default, the run script is configured to train the Tacotron2 model.
- Navigate to the
FastSpeech2directory located underSpeakers. - Crucial Step: Open the
db.shfile and define the absolute path to your dataset. - Execute the training script:
./run.shStep 2: Generate Alignments Once the Tacotron2 model is trained, you must generate the alignments (durations) and statistics required for FastSpeech2. Execute the following command to run teacher forcing inference:
./run.sh --stage 8 \
--tts_exp exp/tts_train_raw_phn_tacotron_g2p_en_no_space \
--inference_args "--use_teacher_forcing true" \
--test_sets "tr_no_dev_phn dev_phn eval1_phn"Step 3: Train FastSpeech2 After generating the alignments, proceed to train the FastSpeech2 model using Tacotron's dump directory:
./run.sh --stage 6 \
--train_config conf/tuning/train_fastspeech2.yaml \
--teacher_dumpdir exp/tts_train_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave \
--tts_stats_dir exp/tts_train_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave/stats \
--write_collected_feats trueConfiguration Note: Prior to execution, you must define the absolute path to the dataset within the db.sh file.
Vocoder training requires the ParallelWaveGAN library. In this implementation, HiFi-GAN models are trained exclusively on the main speakers. These gender-specific vocoders are subsequently utilized for all remaining speakers during synthesis.
To initiate vocoder training, please refer to the standard ESPnet2 recipe instructions provided here: HiFi-GAN training instructions.
Permission Denied Errors: If you encounter a "Permission Denied" error, please grant execution rights to the relevant files using the following command:
chmod u+x <filename>To perform speech synthesis, navigate to the Speakers/inference directory.
Before executing the script, ensure the following:
-
Model Paths: The paths to the model weights must be correctly defined.
-
Config Updates: In the
config.yamlfor each model, ensure thestats_filepaths undernormalize_conf,pitch_normalize_conf, andenergy_normalize_confare valid and accessible.Locating the Statistics Files: For locally trained models, the required statistics files can be found in the following directory:
{Speaker}/FastSpeech2/exp/tts_train_tacotron2_raw_phn_none/inference_use_teacher_forcingtrue_train.loss.ave/stats/train/You will need to map the configuration entries to these specific files:
normalize_conf→feats_stats.npzpitch_normalize_conf→pitch_stats.npzenergy_normalize_conf→energy_stats
Run the inference script as follows:
python inference.py --text "Οι έρευνες δείχνουν πως ένας γαλήνιος και ήρεμος γονιός λειτουργεί
για το παιδί ως λιμάνι στις καταιγίδες" --speaker M --ckpt 500000If you would like access to our pretrained models, please contact us.
- Michail Raptakis - mrap@csd.uoc.gr (University of Crete, IACM-FORTH)
- Yannis Pantazis - pantazis@iacm.forth.gr (IACM-FORTH)
- Alexandros Angelakis - angelakis@csd.uoc.gr (University of Crete, IACM-FORTH)
TBA
Only the speech dataset, is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.