Towards A Combination of Metrics For Machine Translation: Mawloud Mosbah
Towards A Combination of Metrics For Machine Translation: Mawloud Mosbah
Abstract
In this scholar, we compare three metrics for machine translation, from English to
French and vice versa, and we give some combination formulas based on some
schemes, algorithms, and machine learning tools. As an experimental dataset, we
consider 10 English and French theses abstracts published in the web with four free in
charge machine translation systems. Five combinations, with the same implicit weights,
are considered namely: (BLEU+NIST), (BLEU+ (1-WER)), (NIST+(1-WER)),
(BLEU+NIST+(1-WER)), and (FR(BLEU)+FR(NIST)+FR(WER)). These
combinations are also considered differently through generating weights parameters on
the basis of regression. The results of 12 formulas are computed and compared then in
total. According to the obtained results, average regression combinations based on
machine learning step are the best, especially with the three basic metrics, followed by
average WER metric in the case of English to French. For French to English,
(FR(BLEU)+FR(NIST)+FR(WER)) combination is the best followed respectively by
the average regression combination with both first parameters (Reg(α,β)) and average
BLEU basic metric. Another performance criterion is considered here, in the second
position, namely: the number of times, over the 10 abstracts, where the formula is the
best. Based on the obtained results, combination with regression based on the first and
the last parameters (Reg(α,γ)) outperforms the others, in the case of English to French,
with 3 times followed by Reg(β,γ), Reg(α,β,γ), NIST+(1-WER), and the basic metrics
(BLEU, NIST, and WER) with 2 times for each of them. For French to English, the
basic WER metric outperforms the others with three times followed by BLEU, (BLEU+
(1-WER)), (FR(BLEU)+FR(NIST)+FR(WER)), and Reg(α,γ) with 2 times for each of
them. To note that there is a room of improvement for the combinations with1.0914 in
the case of English to French and 1.01 in the case of French to English.
Keywords: Machine Translation, Machine Translation Metrics, Combination of
Machine Translation Metrics
1. Introduction
Evaluation is an important operation for any scientific field. Indeed, assessment
enables us to know in which degree the addressed model is effective, what is its room
of optimization and improvement through analyzing and identifying its various
weaknesses, and to compare the model in question with other ones proposed in the
literature. For machine translation, evaluation is qualified as a difficult task for the
purpose that there are many possible ways to translate a given source sentence. There
are two approaches for machine translation evaluation: manual, subjective, and
qualitative assessment, done by human experts, and automatic and objective and
numerical evaluation implemented by fully automatic metrics [1]. Three aspects are
tied to human evaluation of machine translation: fluency, indicating how natural the
evaluation segment sounds to a native speaker, adequacy, judging how much of the
information from the original translation is expressed in the output, and acceptability,
judging how easy to understand the translation is. Unfortunately, human evaluation is
subjective and time consuming. Automatic evaluation proceeds to compare the output
segment of the translation to the reference one either through associating a score for
each translation [2] or to rank the different translations of the same input [3] to each
other. In [4], authors have categorized automatic metrics into deterministic metrics,
tending to focus on specific aspects of the evaluation, and learned metrics such as
BLANC [5], trying to gather and combine all the aspects into a single metric.
According to [6], human evaluation metrics are classified based on the criterion of
whether human judges which expresses a so-called subjective evaluation judgment,
such as ‘good’ or ‘better than’, or not. The former methods are based on directly
expressed judgment (DEJ) while the latter are called ‘non-DEJ-based evaluation. For
the DEJ-based evaluation, there are tasks such as fluency and adequacy annotation,
ranking and direct assessment (DA) such as Blend [7] whereas for the non-DEJ-based
evaluation, there are tasks like error classification and post-editing.
Unfortunately, as reported in [8], there is no automatic metric that practically
outperforms the other metrics of the literature or to well reflect human judgment.
Combining automatic metrics seems to be then a good idea with two issues to be
tackled, namely: (1) what are the metrics to combine and how many numbers of them
and (2) which weight to attribute for each one. The scheme of combining metrics have
been previously considered in natural language processing applications such as in
information retrieval in the image of f-measure [9] which combines both precision
and recall.
To the best of our knowledge, there is only one work, in literature, that deals with
combination of evaluation metrics for machine translation as we address here. Indeed,
in [10], authors have applied a loss function, as an approach from statistical decision
theory for weighted cost estimates, to combine three basic metrics namely: correct
response, non-response, and incorrect response rates. However, there are few works
that address combination differently such as in [11] where authors have combined
automatic metrics for predicting human assessment using binary classifiers and in [12]
where author has quoted some works that combines evaluation metrics with error
classification and analysis. A regression is also taken into consideration here in
different ways as considered in [4] where authors have proceeded to combine different
criteria and aspects of machine translation.
The methodology adopted in this paper is as follows: as a purpose, we look for
the effective formula that combines three basic evaluation metrics for machine
translation systems namely: BLEU, NIST, and WER. Four machine translation
systems are considered namely: Google Translate, Promt, Babylon and Bing over 10
theses abstracts in both English and French. Human expert evaluation, through manual
assessment and judgment of the various returned translation outputs, is also taken into
consideration to evaluate the performance of BLEU, NIST, and WER as well as the
different considered combinations. Based on each machine translation metric and the
different combination metrics, the outputs of the adopted four machine translation
systems are ranked. These ranks are compared with those given by human experts
using an information retrieval evaluation metric which is NDCG. The performance of
the three primitive machine translation metrics and especially their different
combinations are assessed then using the NDCG metric.
The rest of the paper is organized as follows: section 2 presents the considered
four free in charge machine translation systems. In section 3, we show, in details, the
different three considered evaluation metrics for machine translation. Section 4
depicts the different considered combinations with and without regression. Datasets
and results with their discussions are given in section 5. In Section 6, we establish a
conclusion and we draw some perspectives may be implemented later in our future
works.
In addition to the free in charge aspect, the four considered machine translation
systems adopt different approaches which enables us to obtain theoretically different
results.
3.1. BLEU
BLEU (Bi-Lingual Evaluation Understudy) is an automatic metric baptized by IBM
employing several references [18]. In its simple manner, BLEU measures how many
sequence of words in the block for text under evaluation (the candidate block text)
match the sequence of words of some reference blocks of text. It also contains a
penalty for translations whose length differs significantly from that of the reference
translation. BLEU metric is based firstly on computing n-grams (or chunks; that are
sequence of words) for both the block of text under evaluation and the reference block
of text. Secondly, the clipper chunk Counts for the candidate block text is added and
divided by the number of candidate chunks in the reference block of text to compute
its modified precision score 𝑝𝑝𝑛𝑛 as follows:
1 𝑖𝑖𝑖𝑖 𝑐𝑐 > 𝑟𝑟
𝐵𝐵𝐵𝐵 = { (1−𝑟𝑟/𝑐𝑐) (2)
𝑒𝑒 𝑖𝑖𝑖𝑖 𝑐𝑐 ≤ 𝑟𝑟
Then
𝑁𝑁
𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 𝐵𝐵𝐵𝐵 ∗ 𝑒𝑒 (∑𝑛𝑛=1 𝑤𝑤𝑛𝑛 ∗𝑙𝑙𝑙𝑙𝑙𝑙(𝑃𝑃𝑛𝑛 )) (3)
Where: 𝑤𝑤𝑛𝑛 represents the weights given to the number of words constituting the
chunks or n-grams. According to [18], the ranking behaviour is more immediately
apparent in the logarithm domain as follows:
𝑁𝑁
𝑟𝑟
𝑙𝑙𝑙𝑙𝑙𝑙(𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵) = 𝑚𝑚𝑚𝑚𝑚𝑚 (1 − , 0) + ∑ 𝑤𝑤𝑛𝑛 𝑙𝑙𝑙𝑙𝑙𝑙(𝑃𝑃𝑛𝑛 ) (4)
𝑐𝑐
𝑛𝑛=1
Commonly, the following factors are set as follows: N=4 (ie. we use uni-gram,
bi-grams, three-grams, and four-grams) and 𝑤𝑤𝑛𝑛 = 1⁄𝑁𝑁.
3.2. NIST
NIST (US National Institute of Standards and Technology) score weights more heavily
on those n-grams occurring less frequently according to their information value [19],
[20]. That is to say when a correct n-gram is found, the rarer that n-gram is, the more
weight it will be given. The formula of NIST score is given as follows:
𝑁𝑁
∑𝑎𝑎𝑎𝑎𝑎𝑎 𝑊𝑊1 ..𝑤𝑤𝑁𝑁 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼(𝑊𝑊1 . . 𝑊𝑊𝑛𝑛 )
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 = ∑ { ⁄∑ }
𝑎𝑎𝑎𝑎𝑎𝑎 𝑊𝑊1 ..𝑊𝑊𝑛𝑛 (1)
𝑛𝑛=1
𝐿𝐿𝑠𝑠𝑠𝑠𝑠𝑠
{𝛽𝛽𝑙𝑙𝑙𝑙𝑙𝑙2 [𝑚𝑚𝑚𝑚𝑚𝑚(𝐿𝐿 ,1)]}
∗ 𝑒𝑒 𝑟𝑟𝑟𝑟𝑟𝑟 (5)
Where:
3.3. WER
WER (Word Error Rate) represents the percentage of words, which are to be inserted,
deleted or replaced in the translation for obtaining the sentence of the reference [21].
It can be computed automatically by using the editing distance between the candidate
and the reference sentences.
In the aim to encounter the dependency on the sentences of the reference, several
references may be considered for each sentence. Indeed, mWER (multi reference
WER) is a version of WER metric where for each sentence the editing distance will
be computed with regard to the various references and the smallest one is chosen [22].
Nevertheless, adopting mWER, considering many references, presents the drawback
of requiring a great human effort to generate references although this effort is
worthwhile to be used later for hundreds of evaluations [23]. aWER is another version
of WER which calculates the percentage of words to be inserted, detected or replaced
in order to obtain a correct translation. Involving automatically synonyms seems to be
an essential pre-processing needed in the case of aWER.
It is worthy to note that there are other metrics not considered here such as
METEOR, ROUGE, TER, SER, and OOV. For more information about them, authors
may ask [8] and [12].
To evaluate the performance of the ranks, for the four considered machine
translation systems, given by BLEU, NIST, WER, and their different combinations, we
use the NDCG.
As we consider here four machine translation systems, the best value of NDCG
(which is done manually) is then 10.563.
4.1. BLEU+NIST
As shown in section 3, both BLEU and NIST metrics have a trade on with machine
translation performance. Adding BLEU to the NIST value seems then to be an intuitive
and a logical combination way to think about for introducing a novel metric, that keeps
the trade on relationship, and that we hope to be more effective. In addition, there is
no need to weight the BLEU and NIST values because as shown in experimental
results, given previously in [8], their values are each other closed and belong
commonly to the same scale which is the [0, 1] range. We associate then the same
importance for both considered basic metrics which is implicitly 1. The first
considered combination formula is simply given then as follows:
4.2. BLEU+(1-WER)
In the same optic of the formula1, BLEU is combined now with WER metric. The
difference is that WER has a trade off with machine translation performance that has
a trade on with the BLEU metric. For this purpose, we need then to consider (1-WER)
values that have a trade on with machine translation performance. Since WER values
belong all to the [0, 1] range, as shown previously in [7], (1-WER) values belong too
to the same range that of [0, 1]. As we attribute the same importance for both values:
BLEU and (1-WER), an implicit weight set as 1 is then enough. The second proposed
combination formula is given then as follows:
4.3. NIST+(1-WER)
With the same thinking way, the third formula combining both NIST and (1-WER) is
given as follows:
4.4. BLEU+NIST+(1-WER)
The fourth formula combining BLEU, NIST, and (1-WER) is given as follows:
4.5. FR(BLEU)+FR(NIST)+FR(1-WER)
Another combination way is considering an algebraic expression based on a first rank
function for the three basic metrics: BLEU, NIST, and (1-WER). The first rank function
for a machine translation system (MTS) is given as follows:
with changes in one or more of the explanatory variables. To note that we consider
here a multiple regression model because we have more than independent variable
(BLEU, NIST, and WER) affecting a dependent variable which is the ranking of
machine translation systems measured using NDCG.
Four combinations may be considered here, namely:
𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼 + 𝛽𝛽𝛽𝛽𝛽𝛽𝛽𝛽𝛽𝛽
𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼 + 𝛾𝛾(1 − 𝑊𝑊𝑊𝑊𝑊𝑊)
{ (17)
𝛽𝛽𝛽𝛽𝛽𝛽𝛽𝛽𝛽𝛽 + 𝛾𝛾(1 − 𝑊𝑊𝑊𝑊𝑊𝑊)
𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼 + 𝛽𝛽𝛽𝛽𝛽𝛽𝛽𝛽𝛽𝛽 + 𝛾𝛾(1 − 𝑊𝑊𝑊𝑊𝑊𝑊)
Learning consists then to find the adequate or at least the optimal parameters: α,
β, and γ that may give more performance in the test step. Table1 presents an
illustrative learning example.
According to the example, the different non equalities to find parameters α and β
considering only BLEU and NIST are:
The different non equalities to find parameters α and γ considering only BLEU
and (1-WER) are:
The different non equalities to find parameters β and γ considering only NIST and
(1-WER) are:
As we can see, there are six non-equalities for each combination which gives 60
non-equalities in total (over the 10 considered abstracts). These non-equalities are
solved using an experimental simple automatic algorithm with 0.01 as a walking step
for the different considered parameters (α, β, and γ). The best configuration of the
parameters is which satisfies the great number of non-equalities let alone all the 60
non-equalities.
5. Experimental Results
We consider English and French abstracts for 10 theses published in the web. Four
free in charge web machine translation systems, based on various schemes, are used,
namely: Google Translate, Promt, Babylon, and Bing. Three basic metrics are adopted
namely: BLEU, NIST, and WER. For combination based on regression, we use 10
abstracts for English and French. In total, the machine learning dataset is composed
of 100 texts over both languages (20 references + 80 results). For combination based
on regression, we consider 60 non equalities for English to French (6 non equalities
over 10 learning examples) and the same number for French to English. To solve these
non equalities and find parameter weights α, β, and γ, we consider a walking step of
0.01. We consider then two respective sub-sections for presenting results namely:
from English to French and from French to English.
Characteristics
Theses Abstractions # of Abstract in Words # of Abstract in Sentences
In English In French In English In French
Thesis #1 264 294 14 13
Thesis #2 295 293 10 09
Thesis #3 289 372 07 07
Thesis #4 317 439 15 15
Thesis #5 250 275 10 08
Thesis #6 578 767 25 27
Thesis #7 170 188 09 07
Thesis #8 275 313 12 12
Thesis #9 287 312 11 11
Thesis #10 301 379 11 10
Total 3026 3632 124 119
9,6
9,5
9,4
9,3
9,2
9,1
9
8,9
8,8
Average NDCG
BLEU NIST
WER BLEU+NIST
BLEU+(1-WER) NIST+(1-WER)
BLEU+NIST+(1-WER) FR(BLEU)+FR(NIST)+FR(WER)
Reg(α,β,γ) Reg(α,β)
Reg(α,γ) Reg(β,γ)
Figure 1. The considered primitive machine translation metrics and their various
combinations in the case of ‘From English to French’
3,5
2,5
1,5
0,5
0
Number of times where the metric is the best
BLEU NIST
WER BLEU+NIST
BLEU+(1-WER) NIST+(1-WER)
BLEU+NIST+(1-WER) FR(BLEU)+FR(NIST)+FR(WER)
Reg(α,β,γ) Reg(α,β)
Reg(α,γ) Reg(β,γ)
Figure 2. The number of times where the metric, from those considered and their
combinations, is the best, in the case of ‘From English to French’
Number of verified 30 35 35 35
non equalities
Parameters 𝛼𝛼 = 0.01, 𝛼𝛼 = 0.04, 𝛽𝛽 = 0.01, 𝛼𝛼 = 0.01,
𝛽𝛽 = 0.03 𝛾𝛾 = 0.97 𝛾𝛾 = 0.97 𝛽𝛽 = 0.03, 𝛾𝛾 = 0.89
Table 8. The performance of the different considered combination formulas without
regression.
5.4. Discussions
• Two performance evaluation criteria have been considered, here, in both cases:
from English to French and vice versa. Firstly, the average accuracy computed by
the automatic NDCG metric which measures the closeness of the various ranks,
for the four adopted translators, given by the different formulas regarding the
reference rank generated from the evaluation done by human experts. Secondly,
the number of times where the formula is the best over the ten considered
abstracts.
• Unfortunately, the solutions given, in the case of regression, do not satisfy all the
considered 60 non-equalities. The best solution satisfies only 37 in both senses
‘from English to French’ and vice versa. Considering step value more little than
0.01 may improve the performance.
10
9,5
8,5
7,5
7
Average NDCG
BLEU NIST
WER BLEU+NIST
BLEU+(1-WER) NIST+(1-WER)
BLEU+NIST+(1-WER) FR(BLEU)+FR(NIST)+FR(WER)
Reg(α,β,γ) Reg(α,β)
Reg(α,γ) Reg(β,γ)
Figure 3. The considered primitive machine translation metrics and their various
combinations in the case of ‘From French to English’
3,5
2,5
1,5
0,5
0
Number of times where the metric is the best
BLEU NIST
WER BLEU+NIST
BLEU+(1-WER) NIST+(1-WER)
BLEU+NIST+(1-WER) FR(BLEU)+FR(NIST)+FR(WER)
Reg(α,β,γ) Reg(α,β)
Reg(α,γ) Reg(β,γ)
Figure 4. The number of times where the metric, from those considered and their
combinations, is the best, in the case of ‘From French to English’
6. Conclusion
In this paper, we have tested the effectiveness of the combination for the basic
machine translation metrics. Two kinds of combination have been considered: a
simple combination with the same implicit parameter weight for each primitive
adopted metric, and a combination with regression which designates the parameter
weights for each considered basic metric. According to the results obtained,
combination may improve performance compared with the basic metrics but it is not
always the case. Indeed, some combinations may downgrade basic performance but
there are always some ones which upgrade it. There is no specific combination which
guarantees performance improvement in both senses ‘from English to French’ and
vice versa. In total, combinations based on regression, which relies on a machine
learning collection, are very promising in both senses. Unfortunately, we have
considered here only three basic machine translation metrics with only one evaluation
metric namely NDCG. In the future works, we hope to adopt more primitive machine
translation metrics as well as more evaluation metrics and not only NDCG.
References
[1] Chauhan, S., & Daniel, P. (2022). A Comprehensive Survey on Various
Fully Automatic Machine Translation Evaluation Metrics. Neural
Processing Letters, 1-55.
[2] Przybocki, M., Peterson, K, & Bronsart, S. (2008). Metrics for Machine
Translation Challenge (MetricsMATR08).
http://nist.gov/speech/tests/metricsmatr/2008/results.
[3] Callison-Burch, C., Koehn, P., Monz, C., Peterson, K., Przybocki, M., &
Zaidan, O. (2010). Findings of the 2010 joint workshop on smt and metrics
for machine translation. In Proceedings of the Joint Fifth Workshop on
SMT and MetricsMATR, pages 17-53, Uppsala, Sweden. Association for
Computational Linguistics.
[4] Albrecht, J. S., & Hwa, R. (2008). Regression for machine translation at
the sentence level. Machine Translation, 22(1), 1-27.
[5] Lita, L. V., Rogati, M., & Lavie, A. (2005, October). Blanc: Learning
evaluation metrics for mt. In Proceedings of Human Language Technology
Conference and Conference on Empirical Methods in Natural Language
Processing (pp. 740-747).
[6] Chatzikoumi, E. (2020). How to evaluate machine translation: A review of
automated and human metrics. Natural Language Engineering, 26(2), 137-
161.
[7] Ma, Q., Graham, Y., Wang, S., & Liu, Q. (2017, September). Blend: a
novel combined MT metric based on direct assessment--CASICT-DCU