What if English sounded completely different? Consider this familiar opening from the U.S. Constitution:
🏛️ Standard English (with loanwords)
|
⚔️ "Pure" English (native words only)
|
The second version uses predominantly Germanic roots, avoiding Latin and French borrowings that entered English over centuries. This isn't just a linguistic curiosity; there's even a movement called Anglish dedicated to this approach!
While the "pure English" example might seem amusing, loanword integration is happening right now, in every language, every day. From "WiFi" entering global vocabularies to "kawaii" spreading beyond Japanese, languages constantly borrow and adapt words from each other. But how does modern language technology handle this linguistic reality?
This is exactly what we try to answer with ConLoan. ConLoan is a novel contrastive dataset comprising sentences with and without loanwords across 10 languages, namely Chinese, French, German, Greek, Icelandic, Italian, Northern Kurdish, Portuguese, Russian and Spanish.
ConLoan contains sentences having one or more loanwords along with their equivalents where the loanwords are manually replaced by native alternatives. Here is an example:
sentence with loanwords 🔴 | equivalent with native replacements 🔵 |
---|---|
Cela impliquerait également tous les frais de transaction occasionnés par la vente de vos actions, puis par leur rachat après le crack boursier. | Cela impliquerait également tous les frais de transaction occasionnés par la vente de vos actions, puis par leur rachat après la débacle boursière. |
Les propositions de la Commission dont nous discutons ici permettent le lifting dont elle a besoin. | Les propositions de la Commission dont nous discutons ici permettent le lissage dont elle a besoin. |
This table shows the contrast between loanwords (left column) and their native alternatives (right column), demonstrating how ConLoan provides parallel sentences that differ only in their use of loanwords versus native vocabulary. The dataset also provides translations in English with other meta-data.
In this repository, you can find two main data folders:
annotations/
: Raw annotations from individual annotators (per language)datasets/
: Processed contrastive sentence pairs in JSON format and TSV files with replacements. Here are the data files per Language:- JSON files: Contain contrastive sentence pairs with loanwords and their native replacements
*_all_replacements.tsv
: All annotated loanword replacements, including cases where loanwords were not replaced*_replaced_loanwords.tsv
: Only cases where loanwords were successfully replaced with native alternatives
Language | Annotations (JSON) | All Replacements (TSV) | Replaced Loanwords (TSV) |
---|---|---|---|
Chinese | Chinese.json |
Chinese_all_replacements.tsv |
Chinese_replaced_loanwords.tsv |
French | French.json |
French_all_replacements.tsv |
French_replaced_loanwords.tsv |
German | German.json |
German_all_replacements.tsv |
German_replaced_loanwords.tsv |
Greek | Greek.json |
Greek_all_replacements.tsv |
Greek_replaced_loanwords.tsv |
Icelandic | Icelandic.json |
Icelandic_all_replacements.tsv |
Icelandic_replaced_loanwords.tsv |
Italian | Italian.json |
Italian_all_replacements.tsv |
Italian_replaced_loanwords.tsv |
Northern Kurdish | Northern-Kurdish.json |
Northern-Kurdish_all_replacements.tsv |
Northern-Kurdish_replaced_loanwords.tsv |
Portuguese | Portuguese.json |
Portuguese_all_replacements.tsv |
Portuguese_replaced_loanwords.tsv |
Russian | Russian.json |
Russian_all_replacements.tsv |
Russian_replaced_loanwords.tsv |
Spanish | Spanish.json |
Spanish_all_replacements.tsv |
datasets/Spanish_replaced_loanwords.tsv |
You can also find the loanword lists collected per language from sources reported in the paper (mainly Wiktionary) in the loanwords/
folder.
For detailed information about the annotation process and guidelines used in creating this dataset, see loanword_annotation_guide.md
.
File | Description |
---|---|
analyze.py |
General analysis utilities for the ConLoan dataset |
create_corpus.py |
Script for creating the corpus from raw annotations |
donor.py |
Analysis of donor language distributions |
data.json |
Configuration file or metadata |
AppScripts/ |
Google Apps Script code used for annotation spreadsheets |
experiments/surprisal.py |
Language model surprisal analysis |
experiments/t-test.py |
Statistical significance testing for surprisal differences |
experiments/ridge_plot.R |
R script for creating ridge plots (density visualizations) |
The experiments/
folder contains scripts and results for the experiments described in the paper.
If you're using this project, please cite this paper:
@inproceedings{ahmadi2025conloan,
title = {ConLoan: A Contrastive Multilingual Dataset for Evaluating Loanwords},
author = {
Ahmadi, Sina and
Hess, Micha David and
Álvarez Mellado, Elena and
Battisti, Alessia and
Ding, Cui and
Göhring, Anne and
Gao, Yingqiang and
Jiang, Zifan and
Michail, Andrianos and
Morad, Peshmerge and
Niklaus, Joel and
Panagiotopoulou, Maria Christina and
Perrella, Stefano and
Opitz, Juri and
Shaitarova, Anastassia and
Sennrich, Rico
},
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = aug,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = ""
}
This project is fully open-source with the permissive MIT license.