Skip to content

ZurichNLP/ConLoan

Repository files navigation

ConLoan: A Contrastive Multilingual Dataset for Evaluating Loanwords

loanwords


What if English sounded completely different? Consider this familiar opening from the U.S. Constitution:

🏛️ Standard English (with loanwords)

"We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defense, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America."

⚔️ "Pure" English (native words only)

"We the Folk of the Foroned Riches, to make a more flawless oneship, build rightness, bring frith and stillness to our land, shield one another, uphold the overall welfare, and hold fast the Blessings of Freedom to ourselves and our offspring, do foresay and lay down this lawbook for the foroned riches of Americksland."

The second version uses predominantly Germanic roots, avoiding Latin and French borrowings that entered English over centuries. This isn't just a linguistic curiosity; there's even a movement called Anglish dedicated to this approach!

ConLoan

While the "pure English" example might seem amusing, loanword integration is happening right now, in every language, every day. From "WiFi" entering global vocabularies to "kawaii" spreading beyond Japanese, languages constantly borrow and adapt words from each other. But how does modern language technology handle this linguistic reality?

This is exactly what we try to answer with ConLoan. ConLoan is a novel contrastive dataset comprising sentences with and without loanwords across 10 languages, namely Chinese, French, German, Greek, Icelandic, Italian, Northern Kurdish, Portuguese, Russian and Spanish.

Dataset Overview

ConLoan contains sentences having one or more loanwords along with their equivalents where the loanwords are manually replaced by native alternatives. Here is an example:

sentence with loanwords 🔴 equivalent with native replacements 🔵
Cela impliquerait également tous les frais de transaction occasionnés par la vente de vos actions, puis par leur rachat après le crack boursier. Cela impliquerait également tous les frais de transaction occasionnés par la vente de vos actions, puis par leur rachat après la débacle boursière.
Les propositions de la Commission dont nous discutons ici permettent le lifting dont elle a besoin. Les propositions de la Commission dont nous discutons ici permettent le lissage dont elle a besoin.

This table shows the contrast between loanwords (left column) and their native alternatives (right column), demonstrating how ConLoan provides parallel sentences that differ only in their use of loanwords versus native vocabulary. The dataset also provides translations in English with other meta-data.

In this repository, you can find two main data folders:

  • annotations/: Raw annotations from individual annotators (per language)
  • datasets/: Processed contrastive sentence pairs in JSON format and TSV files with replacements. Here are the data files per Language:
    • JSON files: Contain contrastive sentence pairs with loanwords and their native replacements
    • *_all_replacements.tsv: All annotated loanword replacements, including cases where loanwords were not replaced
    • *_replaced_loanwords.tsv: Only cases where loanwords were successfully replaced with native alternatives
Language Annotations (JSON) All Replacements (TSV) Replaced Loanwords (TSV)
Chinese Chinese.json Chinese_all_replacements.tsv Chinese_replaced_loanwords.tsv
French French.json French_all_replacements.tsv French_replaced_loanwords.tsv
German German.json German_all_replacements.tsv German_replaced_loanwords.tsv
Greek Greek.json Greek_all_replacements.tsv Greek_replaced_loanwords.tsv
Icelandic Icelandic.json Icelandic_all_replacements.tsv Icelandic_replaced_loanwords.tsv
Italian Italian.json Italian_all_replacements.tsv Italian_replaced_loanwords.tsv
Northern Kurdish Northern-Kurdish.json Northern-Kurdish_all_replacements.tsv Northern-Kurdish_replaced_loanwords.tsv
Portuguese Portuguese.json Portuguese_all_replacements.tsv Portuguese_replaced_loanwords.tsv
Russian Russian.json Russian_all_replacements.tsv Russian_replaced_loanwords.tsv
Spanish Spanish.json Spanish_all_replacements.tsv datasets/Spanish_replaced_loanwords.tsv

You can also find the loanword lists collected per language from sources reported in the paper (mainly Wiktionary) in the loanwords/ folder.

Annotation Guidelines

For detailed information about the annotation process and guidelines used in creating this dataset, see loanword_annotation_guide.md.

Code Files

File Description
analyze.py General analysis utilities for the ConLoan dataset
create_corpus.py Script for creating the corpus from raw annotations
donor.py Analysis of donor language distributions
data.json Configuration file or metadata
AppScripts/ Google Apps Script code used for annotation spreadsheets
experiments/surprisal.py Language model surprisal analysis
experiments/t-test.py Statistical significance testing for surprisal differences
experiments/ridge_plot.R R script for creating ridge plots (density visualizations)

The experiments/ folder contains scripts and results for the experiments described in the paper.

Citation

If you're using this project, please cite this paper:

  @inproceedings{ahmadi2025conloan,
   title = {ConLoan: A Contrastive Multilingual Dataset for Evaluating Loanwords},
   author = {
    Ahmadi, Sina and
    Hess, Micha David  and
    Álvarez Mellado, Elena  and
    Battisti, Alessia and
    Ding, Cui and
    Göhring, Anne and
    Gao, Yingqiang and
    Jiang, Zifan and
    Michail, Andrianos and
    Morad, Peshmerge and
    Niklaus, Joel and
    Panagiotopoulou, Maria Christina and
    Perrella, Stefano and
    Opitz, Juri and
    Shaitarova, Anastassia and
    Sennrich, Rico
    },
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = aug,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = ""
}

License

This project is fully open-source with the permissive MIT license.