Lux-Eval is a local evaluation suite for Luxembourgish machine translation (MT), inspired by MATEO (Vanroy et al., 2023).
It provides an easy-to-use client and a Flask-based API for computing a range of MT evaluation metrics, either locally or via a server.
- Evaluate Luxembourgish (lb) → target-language (tgt) MT output (e.g., fr, en, de, pt)
- Support for multiple complementary evaluation metrics
- System- and segment-level scoring
- Visual plots for performance insights
- Performs paired bootstrap resampling to test statistical significance
- Score interpretation aid
- Modular architecture for adding new metrics
Install the client on your local machine:
# Clone the Lux-Eval repository
git clone https://github.com/greenirvavril/lux-eval.git
# Navigate into the project directory
cd lux-eval
# Make the setup script executable
chmod +x setup_client.sh
# Run the setup script
./setup_client.shInstall the gateway either on your local machine or on your server:
# Make the setup script executable
chmod +x setup_gateway.sh
# Run the setup script
./setup_gateway.shNote: To use xCOMET-XL, you may have to acknowledge its license on Hugging Face Hub and log-in into hugging face hub.
The client requires the gateway's IP address to connect. Follow these steps to get started.
cd gateway
source main_venv/bin/activate
python main_gateway.pyLet it run until the output shows:
Gateway is running!
- Local access (same machine): http://127.0.0.1:5000
- Network access (other machines should use this IP): http://192.168.X.X:5000
- Use the local access URL if the gateway is running on the same machine as the client.
- Use the network access URL if the gateway is running on another machine on the same network.
- Copy the appropriate IP address from the gateway output.
- Navigate to the client folder:
cd client- Open
client.pyand locate:
URL = "" # <-- enter your gateway IP, e.g.: "http://192.168.X.X:5000"- Paste the IP address and configure the metrics you want to use (
TrueorFalse). - Save the file.
In a new terminal:
cd client # adjust path if necessary
source client_venv/bin/activate
python client.pyThe client will connect to the gateway and start evaluating metrics as configured.
- Ensure that port 5000 is open on the gateway machine if running on a network.
- The gateway must be running before starting the client.
- For testing on the same machine, you can always use
http://127.0.0.1:5000.
- Candidate file(s): MT model outputs (one file per system)
- Source file: original sentences (aligned with candidates)
- Reference file: gold-standard translations (used by reference-based metrics)
All files must be plain
.txt, aligned line-by-line (same number of lines, one segment per line).
| Metric | Description | Reference |
|---|---|---|
| BERTScore | Contextualised embeddings for semantic similarity | Zhang et al., 2019 |
| BLEURT20 | Trained on human preference data | Sellam et al., 2020 |
| xCOMET-XL | Trained on human preference data | Guerreiro et al., 2024 |
| BLEU | N-gram overlap | Papineni et al., 2002; Post, 2018 |
| ChrF2 | Character-level overlap | Popović, 2016; Post, 2018 |
| TER | Edit distance to reference | Snover et al., 2006; Post, 2018 |
Note: BERTScore uses xlm-roberta-large, except for English it uses deberta-xlarge-mnli for English.
| Metric | Description | Reference |
|---|---|---|
| LuxEmbedder | Luxembourgish sentence embeddings | Philippy et al., 2024 |
- Results are exported to
.xlsx, including an accuracy matrix with metric scores converted to probability percentages (cf. Kocmi et al., 2024). - Matrix interpretation: similar to a correlation matrix; shows likelihood of one system outperforming another.
- Note: Luxembedder is excluded from the accuracy matrix due to conversion tool limitations.
- For unrelated models: prioritise BERTScore, BLEURT20, and xCOMET-XL.
- Surface-overlap metrics (BLEU, ChrF2, TER) are limited and not recommended for cross-system comparison.
- Percentages in accuracy matrix do not add up due to conversion tool limitations. Prioritise positive scores (likelihood of model A being better than model B) over negative scores (likelihood of model B being worse than model A).
- LuxEmbedder: promising but unverified; scores min-max-normalised (0.8–1.0 to 0-100 range); can also be used for src -> lb.
- Guerreiro, N. M., Rei, R., Stigt, D. V., Coheur, L., Colombo, P., & Martins, A. F. (2024). xcomet: Transparent machine translation evaluation through fine-grained error detection. Transactions of the Association for Computational Linguistics, 12, 979-995.
- Kocmi, T., Zouhar, V., Federmann, C., & Post, M. (2024). Navigating the metrics maze: Reconciling score magnitudes and accuracies. arXiv preprint arXiv:2401.06760.
- Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002, July). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318).
- Philippy, F., Guo, S., Klein, J., & Bissyandé, T. F. (2024). LuxEmbedder: A cross-lingual approach to enhanced Luxembourgish sentence embeddings. arXiv preprint arXiv:2412.03331.
- Popović, M. (2016, August). chrF deconstructed: beta parameters and n-gram weights. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers (pp. 499-504).
- Post, M. (2018). A call for clarity in reporting BLEU scores. arXiv preprint arXiv:1804.08771.
- Sellam, T., Das, D., & Parikh, A. P. (2020). BLEURT: Learning robust metrics for text generation. arXiv preprint arXiv:2004.04696.
- Snover, M., Dorr, B., Schwartz, R., Micciulla, L., & Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers (pp. 223-231).
- Vanroy, B., Tezcan, A., & Macken, L. (2023). MATEO: MAchine Translation Evaluation Online. In M. Nurminen, J. Brenner, M. Koponen, S. Latomaa, M. Mikhailov, F. Schierl, … H. Moniz (Eds.), Proceedings of the 24th Annual Conference of the European Association for Machine Translation (pp. 499–500). Tampere, Finland: European Association for Machine Translation (EAMT).
- Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019). Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.