Two LRL & Distractor Corpora from Web Information Retrieval and a Small Case Study in Language Identification without Training Corpora

Armin Hoenen; Cemre Koc; Marc Rahn

Two LRL & Distractor Corpora from Web Information Retrieval and a Small Case Study in Language Identification without Training Corpora

Abstract

In recent years, low resource languages (LRLs) have seen a surge in interest after certain tasks have been solved for larger ones and as they present various challenges (data sparsity, sparsity of experts and expertise, unusual structural properties etc.). For a larger number of them in the wake of this interest resources and technologies have been created. However, there are very small languages for which this has not yet led to a significant change. We focus here one such language (Nogai) and one larger small language (Maori). Since especially smaller languages often face the situation of having very similar siblings or a larger small sister language which is more accessible, the rate of noise in data gathered on them so far is often high. Therefore, we present small corpora for our 2 case study languages which we obtained through web information retrieval and likewise for their noise inducing distractor languages and conduct a small language identification experiment where we identify documents in a boolean way as either belonging or not to the target language. We release our test corpora for two such scenarios in the format of the An Crubadan project (Scannell, 2007) and a tool for unsupervised language identification using alphabet and toponym information.

Anthology ID:: 2020.sltu-1.4
Volume:: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)
Month:: May
Year:: 2020
Address:: Marseille, France
Editors:: Dorothee Beermann, Laurent Besacier, Sakriani Sakti, Claudia Soria
Venue:: SLTU
SIG:
Publisher:: European Language Resources association
Note:
Pages:: 28–35
Language:: English
URL:: https://aclanthology.org/2020.sltu-1.4/
DOI:
Bibkey:
Cite (ACL):: Armin Hoenen, Cemre Koc, and Marc Rahn. 2020. Two LRL & Distractor Corpora from Web Information Retrieval and a Small Case Study in Language Identification without Training Corpora. In Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), pages 28–35, Marseille, France. European Language Resources association.
Cite (Informal):: Two LRL & Distractor Corpora from Web Information Retrieval and a Small Case Study in Language Identification without Training Corpora (Hoenen et al., SLTU 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.sltu-1.4.pdf

PDF Cite Search Fix data