MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages

🍇 Open-Source Compliant Speech Dataset List

Name	License	Hours	Languages	Label
CommonVoice	CC 0	6,732	bg, cs, da, nl, en, et, fi, fr, de, el, hu, ga, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv	✅
CoVoST2	CC 0	687	en, fr, it, es, pt, et, nl, sv, lv, sl	✅
CSS10	Public Domain	99	nl, fi, fr, de, el, hu, es	✅
EMU	CC BY 3.0	56	pl	✅
EU Parliament	CC BY 4.0	32	pl	✅
FLEURS	CC BY 4.0	215	bg, cs, da, nl, en, et, fi, fr, de, el, hu, ga, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv	✅
Large Corpus of Czech Parliament Plenary Hearings	CC BY 4.0	444	cs	✅
LibriLight	Public Domain	57,706	en	❌
LibriTTS	CC BY 4.0	585	en	✅
LibriSpeech	CC BY 4.0	360	en	✅
LibriVoxDeEn	Public Domain	547	de	✅
MC Speech	CC 0	22	pl	✅
Multilingual LibriSpeech	CC BY 4.0	50,687	nl, en, fr, de, it, pl, pt, es	✅
SIWIS	CC BY 4.0	11	fr	✅
Speech Commands	CC BY 4.0	18	en	✅
VCTK	CC BY 4.0	44	en	✅
VoxPopuli	CC 0	383,500	bg, hr, cs, da, nl, en, et, fi, fr, de, el, hu, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv	❌
VoxPopuli	CC 0	1,791	hr, cs, nl, en, et, fu, fr, de, hu, it, lt, pl, ro, sk, sl, es	✅
YouTube-Commons	CC BY 4.0	3,261	bg, cs, nl, en, et, fr, de, el, hu, it, pl, pt, ro, es	❌
YouTube-Commons	CC BY 4.0	443,396	bg, cs, nl, en, et, fi, fr, de, el, hu, it, lv, lt, pl, pt, ro, es, sv	✅
MOSEL 🍇	CC BY 4.0	441,206	bg, hr, cs, da, nl, en, et, fi, fr, de, el, hu, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv	✅
Datasets added after MOSEL release
Yodas	CC BY 3.0	369,510	149	✅
LoquaciousSet	CC BY 3.0/4.0	25,000	en	✅
Granary 🌽	CC BY 3.0/4.0	~1M	bg, cs, da, de, el, en, es, et, fi, fr, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv, uk, ru	✅

For the languages, two-letter ISO 639 codes are used.

⚠️ Contribute and Report Issues

If you want to add an open-source compliant dataset to the list, please fill a Pull Request. If you want to report an issue about existing content, please use the issues section.

🏁 Citation

If you use MOSEL dataset, please cite:

@inproceedings{mosel,
  title = {{MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages}},
  author = {Marco Gaido and Sara Papi and Luisa Bentivogli and Alessio Brutti and Mauro Cettolo and Roberto Gretter and Marco Matassoni and Mohamed Nabihand Matteo Negri},
  booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
  month = nov,
  year = "2024",
  address = "Miami, United States",
  publisher = "Association for Computational Linguistics",
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github		.github
LICENSE		LICENSE
README.md		README.md
mosel-logo-white.png		mosel-logo-white.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages

🍇 Open-Source Compliant Speech Dataset List

⚠️ Contribute and Report Issues

🏁 Citation

About

Uh oh!

Releases 1

Packages

License

hlt-mt/mosel

Folders and files

Latest commit

History

Repository files navigation

MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages

🍇 Open-Source Compliant Speech Dataset List

⚠️ Contribute and Report Issues

🏁 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Packages