Multi-label Discourse Function Classification of Lexical Bundles in Basque and Spanish via transformer-based models
Resumen
This paper explores the effectiveness of transformer-based models in the discourse function multi-label classification of lexical bundles task in two languages, Basque and Spanish. The study has a dual focus: firstly, to evaluate the impact of manually and automatically annotated datasets in the fine-tuning for this task; secondly, to demonstrate the efficiency of multilingual language models in a cross-lingual transfer learning context for this task. First and foremost, our findings reveal their ability to generalize discourse function classification of lexical bundles beyond specific sequence of words forms in the mentioned task in both monolingual and cross-lingual transfer learning contexts. In the former setting, this research highlights the superiority of manually annotated datasets over the automatically annotated ones as long as dataset size is sufficiently large. In the latter case, despite the transfer learning occurring between two typologically different languages, results also suggest the superiority of manually annotated datasets along with the capability to surpass the monolingual results when ratios of target and source language training and fine-tuning corpora are balanced.