Resources in Underrepresented Languages: Building a Representative Romanian Corpus

Ludmila Midrigan - Ciochina, Victoria Boyd, Lucila Sanchez-Ortega, Diana Malancea_Malac, Doina Midrigan, David P. Corina


Abstract
The effort in the field of Linguistics to develop theories that aim to explain language-dependent effects on language processing is greatly facilitated by the availability of reliable resources representing different languages. This project presents a detailed description of the process of creating a large and representative corpus in Romanian – a relatively under-resourced language with unique structural and typological characteristics, that can be used as a reliable language resource for linguistic studies. The decisions that have guided the construction of the corpus, including the type of corpus, its size and component resource files are discussed. Issues related to data collection, data organization and storage, as well as characteristics of the data included in the corpus are described. Currently, the corpus has approximately 5,500,000 tokens originating from written text and 100,000 tokens of spoken language. it includes language samples that represent a wide variety of registers (i.e. written language - 16 registers and 5 registers of spoken language), as well as different authors and speakers
Anthology ID:
2020.lrec-1.402
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3291–3296
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.402
DOI:
Bibkey:
Cite (ACL):
Ludmila Midrigan - Ciochina, Victoria Boyd, Lucila Sanchez-Ortega, Diana Malancea_Malac, Doina Midrigan, and David P. Corina. 2020. Resources in Underrepresented Languages: Building a Representative Romanian Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3291–3296, Marseille, France. European Language Resources Association.
Cite (Informal):
Resources in Underrepresented Languages: Building a Representative Romanian Corpus (Midrigan - Ciochina et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.402.pdf