STOMA: A Multi-Speaker Greek Speech Corpus

STOMA is a new multi-speaker Greek speech corpus designed to advance research in text-to-speech (TTS) synthesis and related speech technologies for Greek, an under-resourced language. The corpus comprises approximately 23 hours of studio-recorded read speech from six native speakers (three male and three female), captured under controlled studio conditions using a dual-booth setup to ensure acoustic consistency and high signal quality. The spoken material was selected from the Greek Harvard Corpus and the Text Bank of the Center for the Greek Language, specifically from texts corresponding to the B2, C1, and C2 proficiency levels of the Certification of Attainment in Greek, ensuring linguistically rich and pedagogically well-balanced content. All recordings were standardized to 44.1 kHz, 16-bit mono PCM format and processed through a hybrid quality-control pipeline combining automated normalization and manual verification. To assess dataset quality, we trained state-of-the-art neural TTS systems based on the FastSpeech2 acoustic model and the HiFi-GAN vocoder, achieving natural and intelligible synthesized speech. The resulting corpus provides a publicly accessible, high-quality resource that supports both linguistic research and the development of modern speech synthesis systems in Greek.

Dataset

The dataset is available at the following location: STOMA

Table 1: Text statistics for the full text collection (Greek Harvard, B2, C1, and C2), recorded in its entirety by the two primary speakers.

Statistic	Gr. Harvard	B2	C1	C2	Total
Total Sentences	720	1,087	1,571	1,296	4,674
Total Words	5,539	15,102	25,263	22,722	68,626
Total Characters	32,301	95,119	164,700	152,904	445,024
Distinct Words	3,343	5,606	8,965	8,275	20,253
Mean Words per Sentence	7.69	13.89	16.08	17.53	14.68
Min Words per Sentence	5	1	1	1	1
Max Words per Sentence	9	32	41	39	41

Table 2: Text statistics for the reduced text subset selected from the full collection, used for recordings by the secondary speakers.

Statistic	Gr. Harvard	B2	C1	C2	Subset Total
Total Sentences	720	190	194	197	1,301
Total Words	5,539	2,913	3,057	3,265	14,774
Total Characters	32,301	18,286	20,190	22,110	92,887
Distinct Words	3,343	1,519	1,604	1,676	6,899
Mean Words per Sentence	7.69	15.33	15.76	16.57	11.36
Min Words per Sentence	5	4	3	4	3
Max Words per Sentence	9	28	28	30	30

Table 3: Speech statistics of the STOMA corpus per speaker and overall.

Speaker ID	Clips	Total Duration	Mean Duration (s)	Min–Max (s)
F	4,674	7:46:21	5.99	0.41–16.60
M	4,674	8:53:37	6.85	0.48–22.31
F1	1,301	1:46:58	4.93	1.65–15.08
F2	1,301	1:27:38	4.04	1.46–11.13
M1	1,301	1:47:06	4.94	1.64–14.99
M2	1,301	1:26:43	4.00	1.10–10.39
Total	14,552	23:08:26	5.72	0.41–22.31

Table 4: Speaker-specific acoustic characteristics of the STOMA corpus. SR: average speech rate (syllables/s, including pauses); AR: average articulation rate (syllables/s, excluding pauses); ASD: average syllable duration (s), computed across all utterances per speaker.

Speaker ID	Mean F0 (Hz)	Std F0 (Hz)	SR	AR	ASD (s)
M	118.73	20.80	4.143	4.286	0.236
F	195.53	40.16	4.462	4.662	0.218
M1	128.30	23.34	4.165	4.403	0.230
F1	200.93	23.57	3.887	3.974	0.257
M2	109.33	15.49	4.407	4.527	0.225
F2	187.29	26.88	4.238	4.409	0.232

Table 5: Demographic information of the STOMA corpus speakers.

Speaker ID	Age	Sex	Region Raised
M	29	Male	Athens
F	37	Female	Crete
M1	30	Male	Alexandroupoli
F1	21	Female	Larisa
M2	22	Male	Larisa
F2	26	Female	Athens

Table 6: Audio format specifications of the STOMA corpus recordings.

Property	Value
File format	WAV (RIFF)
Encoding	Pulse-Code Modulation (PCM)
Compression	None (uncompressed)
Channels	1 (mono)
Sampling rate	44.1 kHz
Bit depth	16-bit signed integer

Requirements

The training framework employs the ESPnet toolkit for acoustic modeling (FastSpeech2) and ParallelWaveGAN for the vocoder (HiFi-GAN).

To replicate the training pipeline, the following toolkits must be installed:

Acoustic Model: To train the acoustic model, ESPnet is required. Please refer to the official ESPnet installation guide.
Vocoder: To train the vocoder, ParallelWaveGAN is required. Installation instructions are available in the ParallelWaveGAN repository.

Training Procedure

Acoustic Model (FastSpeech2)

The training process involves a two-stage pipeline: first training a Teacher model (Tacotron2) to extract durations, and then training the Student model (FastSpeech2).

Strategy: Training Strategy:

Main Speaker (Main_Speaker_M / Main_Speaker_F): FastSpeech2 models are trained from scratch to establish a robust baseline.
Secondary Speakers: Models are fine-tuned from the corresponding gender-specific main speaker checkpoints (e.g., male speakers are fine-tuned from Main_Speaker_M).

Step 1: Train Tacotron2 By default, the run script is configured to train the Tacotron2 model.

Navigate to the FastSpeech2 directory located under Speakers.
Crucial Step: Open the db.sh file and define the absolute path to your dataset.
Execute the training script:

./run.sh

Step 2: Generate Alignments Once the Tacotron2 model is trained, you must generate the alignments (durations) and statistics required for FastSpeech2. Execute the following command to run teacher forcing inference:

./run.sh --stage 8 \
    --tts_exp exp/tts_train_raw_phn_tacotron_g2p_en_no_space \
    --inference_args "--use_teacher_forcing true" \
    --test_sets "tr_no_dev_phn dev_phn eval1_phn"

Step 3: Train FastSpeech2 After generating the alignments, proceed to train the FastSpeech2 model using Tacotron's dump directory:

./run.sh --stage 6 \
    --train_config conf/tuning/train_fastspeech2.yaml \
    --teacher_dumpdir exp/tts_train_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave \
    --tts_stats_dir exp/tts_train_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_train.loss.ave/stats \
    --write_collected_feats true

Configuration Note: Prior to execution, you must define the absolute path to the dataset within the db.sh file.

Vocoder (HiFI-GAN)

Vocoder training requires the ParallelWaveGAN library. In this implementation, HiFi-GAN models are trained exclusively on the main speakers. These gender-specific vocoders are subsequently utilized for all remaining speakers during synthesis.

To initiate vocoder training, please refer to the standard ESPnet2 recipe instructions provided here: HiFi-GAN training instructions.

Permission Denied Errors: If you encounter a "Permission Denied" error, please grant execution rights to the relevant files using the following command:

chmod u+x <filename>

Inference Procedure

To perform speech synthesis, navigate to the Speakers/inference directory.

Configuration Prerequisites

Before executing the script, ensure the following:

Model Paths: The paths to the model weights must be correctly defined.
Config Updates: In the config.yaml for each model, ensure the stats_file paths under normalize_conf, pitch_normalize_conf, and energy_normalize_conf are valid and accessible.

Locating the Statistics Files: For locally trained models, the required statistics files can be found in the following directory:
```
{Speaker}/FastSpeech2/exp/tts_train_tacotron2_raw_phn_none/inference_use_teacher_forcingtrue_train.loss.ave/stats/train/
```
You will need to map the configuration entries to these specific files:
- normalize_conf → feats_stats.npz
- pitch_normalize_conf → pitch_stats.npz
- energy_normalize_conf → energy_stats

Execution

Run the inference script as follows:

python inference.py --text "Οι έρευνες δείχνουν πως ένας γαλήνιος και ήρεμος γονιός λειτουργεί
για το παιδί ως λιμάνι στις καταιγίδες" --speaker M --ckpt 500000

Pretrained model weights

If you would like access to our pretrained models, please contact us.

References

Michail Raptakis - mrap@csd.uoc.gr (University of Crete, IACM-FORTH)
Yannis Pantazis - pantazis@iacm.forth.gr (IACM-FORTH)
Alexandros Angelakis - angelakis@csd.uoc.gr (University of Crete, IACM-FORTH)

Citation

TBA

License

Only the speech dataset, is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
Speakers		Speakers
images		images
LICENSE		LICENSE
README.md		README.md
preprocess.py		preprocess.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STOMA: A Multi-Speaker Greek Speech Corpus

Dataset

Requirements

Training Procedure

Acoustic Model (FastSpeech2)

Vocoder (HiFI-GAN)

Inference Procedure

Configuration Prerequisites

Execution

Pretrained model weights

References

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

STOMA: A Multi-Speaker Greek Speech Corpus

Dataset

Requirements

Training Procedure

Acoustic Model (FastSpeech2)

Vocoder (HiFI-GAN)

Inference Procedure

Configuration Prerequisites

Execution

Pretrained model weights

References

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages