ConLoan: A Contrastive Multilingual Dataset for Evaluating Loanwords

What if English sounded completely different? Consider this familiar opening from the U.S. Constitution:

🏛️ Standard English (with loanwords)

"We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defense, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America."

⚔️ "Pure" English (native words only)

"We the Folk of the Foroned Riches, to make a more flawless oneship, build rightness, bring frith and stillness to our land, shield one another, uphold the overall welfare, and hold fast the Blessings of Freedom to ourselves and our offspring, do foresay and lay down this lawbook for the foroned riches of Americksland."

The second version uses predominantly Germanic roots, avoiding Latin and French borrowings that entered English over centuries. This isn't just a linguistic curiosity; there's even a movement called Anglish dedicated to this approach!

ConLoan

While the "pure English" example might seem amusing, loanword integration is happening right now, in every language, every day. From "WiFi" entering global vocabularies to "kawaii" spreading beyond Japanese, languages constantly borrow and adapt words from each other. But how does modern language technology handle this linguistic reality?

This is exactly what we try to answer with ConLoan. ConLoan is a novel contrastive dataset comprising sentences with and without loanwords across 10 languages, namely Chinese, French, German, Greek, Icelandic, Italian, Northern Kurdish, Portuguese, Russian and Spanish.

Dataset Overview

ConLoan contains sentences having one or more loanwords along with their equivalents where the loanwords are manually replaced by native alternatives. Here is an example:

sentence with loanwords 🔴	equivalent with native replacements 🔵
Cela impliquerait également tous les frais de transaction occasionnés par la vente de vos actions, puis par leur rachat après le crack boursier.	Cela impliquerait également tous les frais de transaction occasionnés par la vente de vos actions, puis par leur rachat après la débacle boursière.
Les propositions de la Commission dont nous discutons ici permettent le lifting dont elle a besoin.	Les propositions de la Commission dont nous discutons ici permettent le lissage dont elle a besoin.

This table shows the contrast between loanwords (left column) and their native alternatives (right column), demonstrating how ConLoan provides parallel sentences that differ only in their use of loanwords versus native vocabulary. The dataset also provides translations in English with other meta-data.

In this repository, you can find two main data folders:

annotations/: Raw annotations from individual annotators (per language)
datasets/: Processed contrastive sentence pairs in JSON format and TSV files with replacements. Here are the data files per Language:
- JSON files: Contain contrastive sentence pairs with loanwords and their native replacements
- *_all_replacements.tsv: All annotated loanword replacements, including cases where loanwords were not replaced
- *_replaced_loanwords.tsv: Only cases where loanwords were successfully replaced with native alternatives

Language	Annotations (JSON)	All Replacements (TSV)	Replaced Loanwords (TSV)
Chinese	`Chinese.json`	`Chinese_all_replacements.tsv`	`Chinese_replaced_loanwords.tsv`
French	`French.json`	`French_all_replacements.tsv`	`French_replaced_loanwords.tsv`
German	`German.json`	`German_all_replacements.tsv`	`German_replaced_loanwords.tsv`
Greek	`Greek.json`	`Greek_all_replacements.tsv`	`Greek_replaced_loanwords.tsv`
Icelandic	`Icelandic.json`	`Icelandic_all_replacements.tsv`	`Icelandic_replaced_loanwords.tsv`
Italian	`Italian.json`	`Italian_all_replacements.tsv`	`Italian_replaced_loanwords.tsv`
Northern Kurdish	`Northern-Kurdish.json`	`Northern-Kurdish_all_replacements.tsv`	`Northern-Kurdish_replaced_loanwords.tsv`
Portuguese	`Portuguese.json`	`Portuguese_all_replacements.tsv`	`Portuguese_replaced_loanwords.tsv`
Russian	`Russian.json`	`Russian_all_replacements.tsv`	`Russian_replaced_loanwords.tsv`
Spanish	`Spanish.json`	`Spanish_all_replacements.tsv`	`datasets/Spanish_replaced_loanwords.tsv`

You can also find the loanword lists collected per language from sources reported in the paper (mainly Wiktionary) in the loanwords/ folder.

Annotation Guidelines

For detailed information about the annotation process and guidelines used in creating this dataset, see loanword_annotation_guide.md.

Code Files

File	Description
`analyze.py`	General analysis utilities for the ConLoan dataset
`create_corpus.py`	Script for creating the corpus from raw annotations
`donor.py`	Analysis of donor language distributions
`data.json`	Configuration file or metadata
`AppScripts/`	Google Apps Script code used for annotation spreadsheets
`experiments/surprisal.py`	Language model surprisal analysis
`experiments/t-test.py`	Statistical significance testing for surprisal differences
`experiments/ridge_plot.R`	R script for creating ridge plots (density visualizations)

The experiments/ folder contains scripts and results for the experiments described in the paper.

Citation

If you're using this project, please cite this paper:

  @inproceedings{ahmadi2025conloan,
   title = {ConLoan: A Contrastive Multilingual Dataset for Evaluating Loanwords},
   author = {
    Ahmadi, Sina and
    Hess, Micha David  and
    Álvarez Mellado, Elena  and
    Battisti, Alessia and
    Ding, Cui and
    Göhring, Anne and
    Gao, Yingqiang and
    Jiang, Zifan and
    Michail, Andrianos and
    Morad, Peshmerge and
    Niklaus, Joel and
    Panagiotopoulou, Maria Christina and
    Perrella, Stefano and
    Opitz, Juri and
    Shaitarova, Anastassia and
    Sennrich, Rico
    },
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = aug,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = ""
}

License

This project is fully open-source with the permissive MIT license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ConLoan: A Contrastive Multilingual Dataset for Evaluating Loanwords

ConLoan

Dataset Overview

Annotation Guidelines

Code Files

Citation

License

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
AppScripts		AppScripts
annotations		annotations
datasets		datasets
experiments		experiments
loanwords		loanwords
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analyze.py		analyze.py
analyze_de.py		analyze_de.py
create_corpus.py		create_corpus.py
data.json		data.json
donor.py		donor.py
donor_languages_plot.pdf		donor_languages_plot.pdf
loanword_annotation_guide.md		loanword_annotation_guide.md
loanwords_meme.png		loanwords_meme.png
requirements.txt		requirements.txt

License

ZurichNLP/ConLoan

Folders and files

Latest commit

History

Repository files navigation

ConLoan: A Contrastive Multilingual Dataset for Evaluating Loanwords

ConLoan

Dataset Overview

Annotation Guidelines

Code Files

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages