MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages
| Name | License | Hours | Languages | Label |
|---|---|---|---|---|
| CommonVoice | CC 0 | 6,732 | bg, cs, da, nl, en, et, fi, fr, de, el, hu, ga, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv | ✅ |
| CoVoST2 | CC 0 | 687 | en, fr, it, es, pt, et, nl, sv, lv, sl | ✅ |
| CSS10 | Public Domain | 99 | nl, fi, fr, de, el, hu, es | ✅ |
| EMU | CC BY 3.0 | 56 | pl | ✅ |
| EU Parliament | CC BY 4.0 | 32 | pl | ✅ |
| FLEURS | CC BY 4.0 | 215 | bg, cs, da, nl, en, et, fi, fr, de, el, hu, ga, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv | ✅ |
| Large Corpus of Czech Parliament Plenary Hearings | CC BY 4.0 | 444 | cs | ✅ |
| LibriLight | Public Domain | 57,706 | en | ❌ |
| LibriTTS | CC BY 4.0 | 585 | en | ✅ |
| LibriSpeech | CC BY 4.0 | 360 | en | ✅ |
| LibriVoxDeEn | Public Domain | 547 | de | ✅ |
| MC Speech | CC 0 | 22 | pl | ✅ |
| Multilingual LibriSpeech | CC BY 4.0 | 50,687 | nl, en, fr, de, it, pl, pt, es | ✅ |
| SIWIS | CC BY 4.0 | 11 | fr | ✅ |
| Speech Commands | CC BY 4.0 | 18 | en | ✅ |
| VCTK | CC BY 4.0 | 44 | en | ✅ |
| VoxPopuli | CC 0 | 383,500 | bg, hr, cs, da, nl, en, et, fi, fr, de, el, hu, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv | ❌ |
| 1,791 | hr, cs, nl, en, et, fu, fr, de, hu, it, lt, pl, ro, sk, sl, es | ✅ | ||
| YouTube-Commons | CC BY 4.0 | 3,261 | bg, cs, nl, en, et, fr, de, el, hu, it, pl, pt, ro, es | ❌ |
| 443,396 | bg, cs, nl, en, et, fi, fr, de, el, hu, it, lv, lt, pl, pt, ro, es, sv | ✅ | ||
| MOSEL 🍇 | CC BY 4.0 | 441,206 | bg, hr, cs, da, nl, en, et, fi, fr, de, el, hu, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv | ✅ |
| Datasets added after MOSEL release | ||||
| Yodas | CC BY 3.0 | 369,510 | 149 | ✅ |
| LoquaciousSet | CC BY 3.0/4.0 | 25,000 | en | ✅ |
| Granary 🌽 | CC BY 3.0/4.0 | ~1M | bg, cs, da, de, el, en, es, et, fi, fr, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv, uk, ru | ✅ |
For the languages, two-letter ISO 639 codes are used.
If you want to add an open-source compliant dataset to the list, please fill a Pull Request. If you want to report an issue about existing content, please use the issues section.
If you use MOSEL dataset, please cite:
@inproceedings{mosel,
title = {{MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages}},
author = {Marco Gaido and Sara Papi and Luisa Bentivogli and Alessio Brutti and Mauro Cettolo and Roberto Gretter and Marco Matassoni and Mohamed Nabihand Matteo Negri},
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, United States",
publisher = "Association for Computational Linguistics",
}