Lux-Eval

Lux-Eval is a local evaluation suite for Luxembourgish machine translation (MT), inspired by MATEO (Vanroy et al., 2023).
It provides an easy-to-use client and a Flask-based API for computing a range of MT evaluation metrics, either locally or via a server.

Features

Evaluate Luxembourgish (lb) → target-language (tgt) MT output (e.g., fr, en, de, pt)
Support for multiple complementary evaluation metrics
System- and segment-level scoring
Visual plots for performance insights
Performs paired bootstrap resampling to test statistical significance
Score interpretation aid
Modular architecture for adding new metrics

Installation: Client

Install the client on your local machine:

# Clone the Lux-Eval repository
git clone https://github.com/greenirvavril/lux-eval.git

# Navigate into the project directory
cd lux-eval

# Make the setup script executable
chmod +x setup_client.sh

# Run the setup script
./setup_client.sh

Installation: Gateway

Install the gateway either on your local machine or on your server:

# Make the setup script executable
chmod +x setup_gateway.sh

# Run the setup script
./setup_gateway.sh

Note: To use xCOMET-XL, you may have to acknowledge its license on Hugging Face Hub and log-in into hugging face hub.

Getting Started

The client requires the gateway's IP address to connect. Follow these steps to get started.

1. Launch the Gateway

cd gateway
source main_venv/bin/activate
python main_gateway.py

Let it run until the output shows:

Gateway is running!
  - Local access (same machine): http://127.0.0.1:5000
  - Network access (other machines should use this IP): http://192.168.X.X:5000

Use the local access URL if the gateway is running on the same machine as the client.
Use the network access URL if the gateway is running on another machine on the same network.

2. Configure the Client

Copy the appropriate IP address from the gateway output.
Navigate to the client folder:

cd client

Open client.py and locate:

URL = ""  # <-- enter your gateway IP, e.g.: "http://192.168.X.X:5000"

Paste the IP address and configure the metrics you want to use (True or False).
Save the file.

3. Launch the Client

In a new terminal:

cd client  # adjust path if necessary
source client_venv/bin/activate
python client.py

The client will connect to the gateway and start evaluating metrics as configured.

Notes

Ensure that port 5000 is open on the gateway machine if running on a network.
The gateway must be running before starting the client.
For testing on the same machine, you can always use http://127.0.0.1:5000.

Input Format

Candidate file(s): MT model outputs (one file per system)
Source file: original sentences (aligned with candidates)
Reference file: gold-standard translations (used by reference-based metrics)

All files must be plain .txt, aligned line-by-line (same number of lines, one segment per line).

Metrics

Reference-based

Metric	Description	Reference
BERTScore	Contextualised embeddings for semantic similarity	Zhang et al., 2019
BLEURT20	Trained on human preference data	Sellam et al., 2020
xCOMET-XL	Trained on human preference data	Guerreiro et al., 2024
BLEU	N-gram overlap	Papineni et al., 2002; Post, 2018
ChrF2	Character-level overlap	Popović, 2016; Post, 2018
TER	Edit distance to reference	Snover et al., 2006; Post, 2018

Note: BERTScore uses xlm-roberta-large, except for English it uses deberta-xlarge-mnli for English.

Quality Estimation

Metric	Description	Reference
LuxEmbedder	Luxembourgish sentence embeddings	Philippy et al., 2024

Score Interpretation

Results are exported to .xlsx, including an accuracy matrix with metric scores converted to probability percentages (cf. Kocmi et al., 2024).
Matrix interpretation: similar to a correlation matrix; shows likelihood of one system outperforming another.
Note: Luxembedder is excluded from the accuracy matrix due to conversion tool limitations.

Recommendations

For unrelated models: prioritise BERTScore, BLEURT20, and xCOMET-XL.
Surface-overlap metrics (BLEU, ChrF2, TER) are limited and not recommended for cross-system comparison.
Percentages in accuracy matrix do not add up due to conversion tool limitations. Prioritise positive scores (likelihood of model A being better than model B) over negative scores (likelihood of model B being worse than model A).
LuxEmbedder: promising but unverified; scores min-max-normalised (0.8–1.0 to 0-100 range); can also be used for src -> lb.

References

Guerreiro, N. M., Rei, R., Stigt, D. V., Coheur, L., Colombo, P., & Martins, A. F. (2024). xcomet: Transparent machine translation evaluation through fine-grained error detection. Transactions of the Association for Computational Linguistics, 12, 979-995.
Kocmi, T., Zouhar, V., Federmann, C., & Post, M. (2024). Navigating the metrics maze: Reconciling score magnitudes and accuracies. arXiv preprint arXiv:2401.06760.
Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002, July). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318).
Philippy, F., Guo, S., Klein, J., & Bissyandé, T. F. (2024). LuxEmbedder: A cross-lingual approach to enhanced Luxembourgish sentence embeddings. arXiv preprint arXiv:2412.03331.
Popović, M. (2016, August). chrF deconstructed: beta parameters and n-gram weights. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers (pp. 499-504).
Post, M. (2018). A call for clarity in reporting BLEU scores. arXiv preprint arXiv:1804.08771.
Sellam, T., Das, D., & Parikh, A. P. (2020). BLEURT: Learning robust metrics for text generation. arXiv preprint arXiv:2004.04696.
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., & Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers (pp. 223-231).
Vanroy, B., Tezcan, A., & Macken, L. (2023). MATEO: MAchine Translation Evaluation Online. In M. Nurminen, J. Brenner, M. Koponen, S. Latomaa, M. Mikhailov, F. Schierl, … H. Moniz (Eds.), Proceedings of the 24th Annual Conference of the European Association for Machine Translation (pp. 499–500). Tampere, Finland: European Association for Machine Translation (EAMT).
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019). Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
client		client
gateway		gateway
lb-en_test_files		lb-en_test_files
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
setup_client.sh		setup_client.sh
setup_gateway.sh		setup_gateway.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Lux-Eval

Table of Contents

Features

Installation: Client

Installation: Gateway

Getting Started

1. Launch the Gateway

2. Configure the Client

3. Launch the Client

Notes

Input Format

Metrics

Reference-based

Quality Estimation

Score Interpretation

Recommendations

References

About

Uh oh!

Releases

Packages

Languages

License

greenirvavril/lux-eval

Folders and files

Latest commit

History

Repository files navigation

Lux-Eval

Table of Contents

Features

Installation: Client

Installation: Gateway

Getting Started

1. Launch the Gateway

2. Configure the Client

3. Launch the Client

Notes

Input Format

Metrics

Reference-based

Quality Estimation

Score Interpretation

Recommendations

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages