Widely used learned metrics for machine translation evaluation, such as Comet and Bleurt, estimate the quality of a translation hypothesis by providing a single sentence-level score. As such, they offer little insight into translation errors (e.g., what are the errors and what is their severity). On the other hand, generative large language models (LLMs) are amplifying the adoption of more granular strategies to evaluation, attempting to detail and categorize translation errors. In this work, we introduce xcomet, an open-source learned metric designed to bridge the gap between these approaches. xcomet integrates both sentence-level evaluation and error span detection capabilities, exhibiting state-of-the-art performance across all types of evaluation (sentence-level, system-level, and error span detection). Moreover, it does so while highlighting and categorizing error spans, thus enriching the quality assessment. We also provide a robustness analysis with stress tests, and show that xcomet is largely capable of identifying localized critical errors and hallucinations.

Automatic metrics for machine translation evaluation are widely used by researchers and practitioners to evaluate the quality of translations and the systems generating them. Notably, learned neural metrics, such as Comet (Rei et al., 2020) and Bleurt (Sellam et al., 2020), have demonstrated significant improvements in terms of correlation with human judgments when compared to traditional metrics like bleu (Papineni et al., 2002; Freitag et al., 2021b, 2022).

These metrics are trained to regress on scores obtained through human annotations, by predicting a single sentence-level score representing the quality of the translation hypothesis. However, these single scores do not offer a detailed view into translation errors (e.g., it is not immediate which words or spans of words are wrongly translated). Moreover, as they are obtained by making use of highly complex pre-trained models, they can be difficult to interpret (Rei et al., 2023b; Leiter et al., 2023). One appealing strategy to bring a more detailed view into translation errors is to obtain finer-grained information on error spans through highlighting them and indicating their severity (Fonseca et al., 2019; Perrella et al., 2022; Bao et al., 2023). In fact, this is the strategy adopted in recent work that has employed generative large language models (LLMs) for machine translation evaluation: (i) identify errors within a given translation, subsequently (ii) categorize these errors according to their severity, and finally (iii) infer a sentence-level score from the predicted errors (Fernandes et al., 2023; Xu et al., 2023). However, these methods still lag behind dedicated learned metrics when using open LLMs, such as the LLaMA models (Touvron et al., 2023; Xu et al., 2023). As it stands, competitive performance with generative strategies remains contingent on utilizing large proprietary, closed LLMs such as PaLM-2 and GPT-4 (Fernandes et al., 2023; Kocmi and Federmann, 2023a).

In this work, we bridge the gap between these two approaches to machine translation evaluation by introducing xcomet: a learned metric that simultaneously performs sentence-level evaluation and error span detection. Through extensive experiments, we show that our metrics leverage the strengths of both paradigms: They achieve state-of-the-art performance in all relevant vectors of evaluation (sentence-level, system-level, and error span prediction), while offering, via the predicted error spans, a lens through which we can analyze translation errors and better interpret the sentence-level scores. We achieve this by employing a curriculum during training that is focused on leveraging high-quality publicly available data at both the sentence- and error span level, complemented by synthetic data constructed to enhance the metric’s robustness. Moreover, xcomet is a unified metric (Wan et al., 2022b), supporting all modes of evaluation within a single model. This enables the metric to be used for quality estimation (when no reference is available), or for reference-only evaluation, similarly to Bleurt (when a source is not provided). Crucially, xcomet also provides sentence-level scores that are directly inferred from the predicted error spans, in the style of AutoMQM (Fernandes et al., 2023) and InstructScore (Xu et al., 2023).

Our contributions can be summarized as follows:

  1. We introduce xcomet, a novel evaluation metric that leverages the advantages of regression-based metrics and error span detection to offer a more detailed view of translation errors.

  2. We show that xcomet is a state-of-the-art metric at all relevant vectors of evaluation— sentence-level, system-level, and error span prediction—generally outperforming widely used neural metrics and generative LLM-based machine translation evaluation.

  3. We provide a comprehensive robustness analysis of xcomet, showing that this new suite of metrics identifies the vast majority of localized critical errors and hallucinations.

  4. We release two evaluation models: xcomet- xl, with 3.5B parameters, and xcomet-xxl, featuring 10.7B parameters.1

Methodologies for Human Assessment of Translation Quality.

Human evaluation of machine translation is primarily conducted through three distinct approaches: post-edits (PE), direct assessments (DA), and the Multidimensional Quality Metrics (MQM) framework.

In post-edits (PE), professional translators are tasked with “fixing” a given translation, making minimal edits to improve its quality. Using this edited translation—often termed post-edit—we can evaluate the machine translation output by quantifying the number of edits, thus gauging the initial translation’s quality (Snover et al., 2006).

Direct assessments (DA) (Graham et al., 2013) are a simple and widely-used evaluation method. Annotators—non-expert bilingual speakers or professional translators—are asked to annotate each translation with a score ranging from 0 to 100 to reflect its adequacy and fluency, where a score of 100 corresponds to a perfect translation, and 0 corresponds to a completely inadequate one.

The Multidimensional Quality Metrics (MQM) framework (Lommel et al., 2014), on the other hand, offers a more comprehensive and systematic approach to MT evaluation. Professional translators highlight errors—typically in the form of error spans—within translations, attributing them severity ratings (e.g., minor, major, or critical) and categorical labels (e.g., fluency, accuracy). Figure 1 illustrates one such annotation. MQM annotations have gained prominence in recent years due to their capacity to offer detailed insights into translation errors, facilitating more fine-grained and accurate comparisons between translation systems (Freitag et al., 2021a). As such, the field of Automatic Evaluation of MT has increasingly favored comparisons using MQM annotations over traditional DA and PE methodologies (Freitag et al., 2021b, 2022; Zerva et al., 2022).

Figure 1: 

The xcomet framework illustrated through a real example from the WMT22 News test set: The metric not only provides a sentence-level score, but also predicts translation error spans along with their respective severity. From these spans, we can infer MQM score (following the MQM typology), which informs and highly correlates with the sentence-level score (see Section 6). These spans complement the sentence-level score by providing a detailed view into the translation errors.

Figure 1: 

The xcomet framework illustrated through a real example from the WMT22 News test set: The metric not only provides a sentence-level score, but also predicts translation error spans along with their respective severity. From these spans, we can infer MQM score (following the MQM typology), which informs and highly correlates with the sentence-level score (see Section 6). These spans complement the sentence-level score by providing a detailed view into the translation errors.

Close modal

Automatic Metrics for Translation Evaluation.

Conventional automatic metrics for machine translation (MT) evaluation rely on lexical-based approaches, where the evaluation score is computed through statistics related to lexical overlap between a machine translation and a reference translation. Despite evidence indicating that these lexical metrics (e.g., Bleu [Papineni et al., 2002] and chrF [Popović, 2015]) do not consistently align with human judgments, particularly when these are obtained through the MQM framework (Freitag et al., 2021b, 2022), they remain very popular. In fact, Bleu remains the most widely employed evaluation metric in machine translation to this day (Marie et al., 2021). On the other hand, neural metrics (e.g., Comet [Rei et al., 2020] and Bleurt [Sellam et al., 2020]) that rely on complex neural networks to estimate the quality of MT outputs are consistently among the best metrics for MT evaluation according to correlations with human judgments (Freitag et al., 2021b, 2022).

However, contrary to lexical metrics, which offer a straightforward interpretation, it can often prove challenging to explain the score predicted by a neural metric to a given translation output. As such, there have been a series of efforts to bring interpretability to neural metrics by focusing on understanding the inner workings of neural metrics (Rei et al., 2023b; Leiter et al., 2023), or on constructing inherently interpretable neural metrics (e.g., MaTESe [Perrella et al., 2022] and Fg-ted [Bao et al., 2023]) by assigning a central role to the task of predicting word-level errors in a given translation, instead of just a sentence- level score.

More recently, with the rise of generative LLMs, some studies have tried to frame the MT evaluation problem as a generative problem. This offers great flexibility, as the LLM can be prompted to either score the translation directly (Kocmi and Federmann, 2023b), or to identify errors in the translation (e.g., in line with the MQM framework) (Fernandes et al., 2023; Xu et al., 2023).

An automatic metric for translation evaluation aims at predicting the quality of a translated sentence, t, in light of a reference translation, r, for a given source sentence, s. Here, we focus specifically on neural metrics that make use of a neural model, and typically operate under one of the following evaluation scenarios:

  • reference-only (ref): The model evaluates the translation by processing it alongside a ground-truth reference sentence (Bleurt is an example of such a metric);

  • source-reference combined input (src+ref): The model evaluates the translation by jointly processing it with both the source and the reference (Comet is an example of such a metric);

  • source-only (src): The model evaluates the translation using only its corresponding source sequence (CometKiwi (Rei et al., 2022b) is an example of such a model). This mode is commonly termed as quality estimation (QE) or reference-free evaluation (Specia et al., 2010).

In essence, the model’s input sequence consists of the translation t paired with some additional input—either r, [r, s], or s—derived from the scenarios above. Given this input, the model may predict the quality of the translation at different granularities, e.g., sentence-level or word(span)-level.

Sentence-level Prediction.

The model is tasked to predict a single global score (typically between 0 and 1) for the translation that represents how well it aligns with its context (i.e., source and/or reference sentence). These scores can be used for a broad range of tasks, such as gauging the quality of different translation systems (Freitag et al., 2022), identifying pathological translations (Guerreiro et al., 2023), assisting the generation of translations by MT systems (Fernandes et al., 2022), or even acting as reward models for human alignment of language models (Gulcehre et al., 2023).

Word(span)-level Prediction.

In contrast, word-level (or span-level) predictions are more fine-grained, identifying individual words or phrases in the translation that may have errors or discrepancies—typically identifying them as ok/bad or according to their severity, e.g., minor/major. These granular evaluations are more interpretable and assist in pinpointing specific issues, which can be particularly valuable for feedback and iterative translation improvements.

Our metric, xcomet, emerges in a unique position in the landscape of MT evaluation metrics. It can simultaneously perform evaluation under all three of the scenarios (src, ref, src+ref) presented, and provide sentence-level scores and error span annotations that are in line with the MQM framework, thus bringing further transparency to the evaluation (see Figure 1 for an illustration). In the next section, we detail the design choices and methodology of xcomet.

In this section, we describe the methodology behind xcomet, outlining its model architecture, training settings and corpora, and learning curriculum. We detail how the model is designed to perform both regression and error span detection while adopting a unified input approach for enhanced flexibility and performance.

4.1 Model Architecture

xcomet is built upon insights from contributions to the WMT22 Metrics and QE shared tasks (Rei et al., 2022a, b). It is designed to concurrently handle two tasks: sentence-level regression and error span detection. Figure 2 illustrates its architecture. We follow the same architecture of the scaled-up version of CometKiwi detailed in Rei et al. (2023a), which uses a large pre-trained encoder model as its backbone encoder model. Importantly, following from our multi-task setup, the model has two prediction heads: (i) a sentence-level regression head, which employs a feed-forward network to generate a sentence score, and (ii) a word-level sequence tagger, which applies a linear layer to assign labels to each translation token.

Figure 2: 

Architecture of xcomet. The input to the model starts with a [cls] token followed by a translation and an additional input that will have the source, reference or both. After the pooling layer the [cls] token is passed to a feed-forward network to produce a quality score while all subword pieces corresponding to the translation are passed to a linear layer that will classify them according to their severity levels, Ywl={ok,min,maj,crit}.

Figure 2: 

Architecture of xcomet. The input to the model starts with a [cls] token followed by a translation and an additional input that will have the source, reference or both. After the pooling layer the [cls] token is passed to a feed-forward network to produce a quality score while all subword pieces corresponding to the translation are passed to a linear layer that will classify them according to their severity levels, Ywl={ok,min,maj,crit}.

Close modal

We train two xcomet versions—xcomet-xl and xcomet-xxl—using the XL (3.5B parameters) and XXL (10.7B parameters) versions of XLM-R (Goyal et al., 2021).2

4.2 Fully Unified Evaluation

xcomet adopts a unified input approach (Wan et al., 2022b), allowing for all the evaluation scenarios described in Section 3ref, src+ref, and src evaluation—under a single model. Thus, the input sequence consists of two parts: (i) the translated sentence t = [t1,…, tn] of length n, and (ii) an additional input containing information from the source, reference, or both. To do so, when a reference is available, we run three distinct forward passes (one for each evaluation scenario), each yielding sentence-level and word-level predictions.

4.2.1 Training Time

For each step, we collect the sentence-level predictions and the word-level logits for each input format: {ŷslsrc,ŷslref,ŷslsrc+ref} and {ŷwlsrc,ŷwlref,ŷwlsrc+ref}.3

As we have mentioned before, xcomet models are trained with supervision from both sentence-level quality assessments, ysl, and word-level severity tags, ywl = [y1,…, yn], with yiYwl={ok,min,maj,crit}. In the multi-tasksetting, we use the following loss ℒ for each input type (input ,∈{src, ref, src+ref}):
(1)
(2)
(3)
αRYwl represents the class weights given for each severity label and λ is used to weigh the combination of the sentence and word-level losses.
The final learning objective is the summation of the losses for each input type:
(4)

Furthermore, in line with preceding metrics constructed upon the Comet framework, our models use features such as gradual unfreezing, and discriminative learning rates.

4.2.2 Inference Time

Error Span Prediction.

For each subword in the translation, we average the output distribution of the word-level linear layer obtained for each forward pass. Using this distribution, we predict a set of word-level tags ŷwl=[ŷ1,,ŷn] by taking the most likely class for each token. From these tags, we construct a list of error spans, S, by grouping adjacent subwords identified as errors. The severity of each span in S is defined according to the most severe error tag found within the span.

Sentence-level Prediction.
For each forward pass, we obtain the corresponding sentence-level scores: ŷsrc, ŷref, and ŷsrc+ref.4 Additionally, we leverage the information coming from the predicted list of error spans, S, to infer an automated MQM score. To do so, we follow the MQM framework: we obtain the error counts for each severity level—cmin, cmaj, ccrit—and apply the predetermined severity penalty multipliers to define the error type penalty total, e(S). Formally:
(5)
Finally, we obtain ŷmqm by capping and flipping the sign of e(S):
(6)
Note that the predicted score ŷmqm is bounded between 0 and 1, with a score of 1 corresponding to a perfect translation.
We aggregate the scores to compute the final sentence-level score, ŷsl, through a weighted sum of the different sentence-level scores (see Figure 3). Importantly, we also include the inferred MQM score ŷmqm to directly inform the final sentence-level prediction. Formally, given ŷ=[ŷsrc,ŷref,ŷsrc+ref,ŷmqm]:
(7)
where w is set to [1/9,1/3,1/3,2/9].
Figure 3: 

Histogram of the sentence-level scores from each partition of the MQM data. We aggregate the data from IndicMT, DEMETR under “Other MQM Data”.

Figure 3: 

Histogram of the sentence-level scores from each partition of the MQM data. We aggregate the data from IndicMT, DEMETR under “Other MQM Data”.

Close modal

4.3 Corpora

Our models are exclusively trained on publicly available DA and MQM annotations, most of which have been collected by WMT over the recent years.

DA Data.

We use DA annotations collected by WMT from 2017 to 2020, and the MLQE-PE dataset (Fomicheva et al., 2022). As the MLQE-PE dataset does not contain reference translations, we used the post-edit translations as reference translations. Overall, the corpus consists of around 1 million samples, spanning 36 language pairs.

MQM Data.

We collected the MQM annotations from WMT from 2020 to 2022.5 We also used annotations sourced from other MQM-annotated datasets: (i) IndicMT (Sai B et al., 2023), which contains MQM annotations spanning 5 Indian languages, and (ii) DEMETR (Karpinska et al., 2022), a diagnostic dataset with perturbations spanning semantic, syntactic, and morphological errors.

Corpora with MQM annotations are usually extremely unbalanced with critical errors being underrepresented (see stats for WMT in Table 1). As a result, metrics might struggle to effectively detect translations with critical errors and hallucinations. (Amrhein and Sennrich, 2022; Raunak et al., 2022; Guerreiro et al., 2023). As such, we augment the MQM corpus with hallucinations from the MLQE-PE corpus and synthetic critical errors. We create detached and oscillatory hallucinations (Raunak et al., 2021; Guerreiro et al., 2023): (i) detached hallucinations, replacing the translation with a random sentence or an unrelated one semantically similar to the source sentence;6 and (ii) oscillatory hallucinations, where we randomly sample a n-gram from the translation (with n in {2,3,4}) and repeat it between 1 and 10 times. We set the sentence-level scores of these hallucinations to 0. Overall, our MQM corpus consists of 194K samples across 14 language pairs.

Table 1: 

Number of samples, as well as error statistics (overall percentage of non-correct translations; rates of error type [MIN, MAJ, CRIT]), of each MQM data source used for training xcomet.

DatasetNo. SamplesError Statistics
WMT Data 147K (76%) 63%;[57,42,1] 
IndicMT 7K (4%) 80%;[19,52,29] 
DEMETR 22K (11%) 47%;[38,19,43] 
MLQE-PE Hall. 1.7K (1%) All set to CRIT 
Synthetic Hall. 16K (8%) All set to CRIT 
DatasetNo. SamplesError Statistics
WMT Data 147K (76%) 63%;[57,42,1] 
IndicMT 7K (4%) 80%;[19,52,29] 
DEMETR 22K (11%) 47%;[38,19,43] 
MLQE-PE Hall. 1.7K (1%) All set to CRIT 
Synthetic Hall. 16K (8%) All set to CRIT 

Scaling of Sentence-level Scores.

While the sentence-level scores inferred from MQM annotations (through the procedure in Equation 6) are bounded between 0 and 1, DA annotations usually require z-normalization in order to mitigate variations in scoring strategies by different annotators (Bojar et al., 2017).7 Thus, as z-scores are inherently centered at 0 and unbounded, there is a scaling mismatch between the data samples.

Consequently, to circumvent this limitation, we employ min-max scaling on our DA corpus to set its range of scores to [0,1]. To do so, we set a practical minimum and maximum z-score value. We obtain the minimum score by averaging the z-scores for translations with over 1 annotation, wherein all annotators unanimously scored them with an unnormalized 0 DA score, i.e., they deemed the translation as “random”. For determining a maximum value, we applied the same process for perfect translations, i.e., unnormalized 100 DA score.8

4.4 Training Curriculum

xcomet models undergo a 3-phase curriculum training. Throughout these phases, the training emphasis alternates between sentence-level prediction and error span prediction by tweaking the parameter λ in Equation 3. The curriculum phases can be described as follows:

  • Phase I:

    The model is trained exclusively using the DA data. In this phase, the focus is exclusively set on sentence-level regression.

  • Phase II:

    In this stage, we introduce word-level supervision. To achieve this, the model is fine-tuned on our diverse MQM corpus, with most emphasis placed on the word-level task.

  • Phase III:

    The last training phase is aimed at unifying both tasks. The model is further fine-tuned using high-quality MQM data from (Freitag et al., 2021a), with a bigger emphasis set to sentence-level prediction.9

Interpretation of the Curriculum.

We start by training a sentence-level metric—similar to UniTE (Wan et al., 2022a)—on the vastly available DA annotations. Phase I acts as a warm-up for subsequent stages. In fact, prior research has shown that models trained on DA annotations leverage token-level information that aligns with MQM error annotations (Rei et al., 2023b). Moving to Phase II, we assume we have a metric that can perform sentence-level regression. Thus, the aim here shifts to integrating word-level supervision without compromising the previously acquired sentence-level prediction skills. To do so, we use the highly diverse corpora of MQM annotations and set most emphasis on the word-level task. Finally, we exclusively leverage a small corpus (around 25k samples) of very high-quality MQM annotations from (Freitag et al., 2021a)—each sample has three annotations from separate annotators—with additional synthetic hallucinations. Our focus here is to mitigate any potential decline in sentence-level regression capabilities during Phase II.

5.1 Evaluation

We evaluate the performance of our metrics using two datasets: (i) the MQM annotations from the News domain of the WMT 2022 Metrics shared task, and (ii) the WMT 2023 Metrics shared task evaluation suite. The WMT22 annotations encompass three language pairs: Chinese→English (zh-en), English→German (en-de), and English→Russian (en-ru). On the other hand, the WMT23 annotations cover Chinese→English (zh-en), English→German (en-de), and Hebrew→English (he-en). We evaluate the metrics in terms of sentence-level, system-level, and error span prediction performance.

At the sentence-level, we report Kendall’s Tau (τ) using the Perm-Both hypothesis test (Deutsch et al., 2021). We also evaluate the metrics on System-level Pairwise Accuracy (Kocmi et al., 2021). We base these evaluations on 200 re-sampling runs, with a significance level (p) set to 0.05. For error span prediction, we adopt the WMT23 Quality Estimation shared task evaluation methodology and compute F1 scores calculated at the character level, taking into account partial matches for both minor and major errors.10 For WMT23, we follow the evaluation setup from the shared task (Freitag et al., 2023) and report the aggregated System-level Pairwise Accuracy pooled across all language pairs, and the primary metric Average Correlation, which encompasses ten tasks, spanning system- and sentence-level metrics.11

5.2 Baselines

Sentence and System-level.

We test our metrics against widely used open neural metrics: Comet-22 (Rei et al., 2022a) and Bleurt-20 (Pu et al., 2021). Additionally, we include MetricX, the best performing metric from the WMT22 Metrics shared task (Freitag et al., 2022),12 and Gemba (Kocmi and Federmann, 2023b), which employs GPT4 (OpenAI, 2023) to evaluate translations following DA guidelines. For WMT23, we report the same metrics but update them, when needed (for MetricX-23 [Juraska et al., 2023] and Gemba-mqm [Kocmi and Federmann, 2023a]), with the versions submitted to the official competition.13

Error Span Prediction.

We report results using GPT3.5 and GPT4 models, by prompting it in the style of AutoMQM (Fernandes et al., 2023).14 We carefully select 5 shots that are held constant for all samples. This way, we can directly compare our results with state-of-the-art LLMs, which have been shown to be able to perform the task of error detection (Fernandes et al., 2023; Xu et al., 2023).

In this section, we present a standard performance analysis of our metrics in terms of correlations with human judgments. Overall, we find xcomet to be a state-of-the-art in sentence-level and error span prediction, being competitive with generative LLMs in terms of system-level evaluation.

Sentence-level Evaluation.

Table 2a shows that both xcomet metrics outperform other strong performing neural metrics, including the generative approach leveraging GPT4 of Gemba. In particular, xcomet-xxl sets a new state-of-the-art for en-de and en-ru. Interestingly, we can see that, while scaling up the encoder model of the xcomet metrics (from xl to xxl) holds better results, xcomet-xl is very competitive. In fact, it outperforms MetricX, which runs at even a larger size than xcomet-xxl. Finally, we can also observe that the MQM scores inferred exclusively from the predicted error spans also exhibit strong performance, outperforming widely used metrics Bleurt-20 and Comet-22. This is particularly relevant: the predicted error spans bring not only a more detailed view into translation errors but also provide high-quality sentence-level scores.

Table 2: 

Segment-level Kendall-Tau (↑) in (a), and System-level Pairwise Accuracy (↑) in (b) using the Perm-Both hypothesis test (Deutsch et al., 2021) on the WMT22 Shared Task News domain test set. Numbers in bold belong to the top-performing cluster according to statistical significance (p < 0.05).

Metriczh-enen-deen-ruAvg.Metriczh-enen-deen-ruAvg.
Bleurt-20 0.336 0.380 0.379 0.365 Bleurt-20 0.762 0.771 0.743 0.759 
comet-22 0.335 0.369 0.391 0.361 comet-22 0.705 0.800 0.733 0.746 
MetricX 0.415 0.405 0.444 0.421 MetricX 0.762 0.781 0.724 0.756 
Gemba-gpt4-da 0.292 0.387 0.354 0.354 Gemba-gpt4-da 0.752 0.848 0.876 0.825 
 
xcomet-xl 0.399 0.414 0.448 0.421 xcomet-xl 0.800 0.743 0.790 0.778 
xcomet-xxl 0.390 0.435 0.470 0.432 xcomet-xxl 0.800 0.829 0.829 0.819 
 
MQM scores from the error spans (ŷ=ŷmqm) MQM scores from the error spans (ŷ=ŷmqm) 
 
xcomet-xl (mqm0.374 0.389 0.445 0.402 xcomet-xl (mqm0.781 0.762 0.762 0.768 
xcomet-xxl (mqm0.332 0.415 0.439 0.395 xcomet-xxl (mqm0.781 0.838 0.810 0.810 
(a) Sentence-level evaluation. (b) System-level evaluation. 
Metriczh-enen-deen-ruAvg.Metriczh-enen-deen-ruAvg.
Bleurt-20 0.336 0.380 0.379 0.365 Bleurt-20 0.762 0.771 0.743 0.759 
comet-22 0.335 0.369 0.391 0.361 comet-22 0.705 0.800 0.733 0.746 
MetricX 0.415 0.405 0.444 0.421 MetricX 0.762 0.781 0.724 0.756 
Gemba-gpt4-da 0.292 0.387 0.354 0.354 Gemba-gpt4-da 0.752 0.848 0.876 0.825 
 
xcomet-xl 0.399 0.414 0.448 0.421 xcomet-xl 0.800 0.743 0.790 0.778 
xcomet-xxl 0.390 0.435 0.470 0.432 xcomet-xxl 0.800 0.829 0.829 0.819 
 
MQM scores from the error spans (ŷ=ŷmqm) MQM scores from the error spans (ŷ=ŷmqm) 
 
xcomet-xl (mqm0.374 0.389 0.445 0.402 xcomet-xl (mqm0.781 0.762 0.762 0.768 
xcomet-xxl (mqm0.332 0.415 0.439 0.395 xcomet-xxl (mqm0.781 0.838 0.810 0.810 
(a) Sentence-level evaluation. (b) System-level evaluation. 

System-level Evaluation.

Table 2b and Table 3 show results for system-level for both WMT22 and WMT23 test sets. Similarly to what we observed at the sentence-level, our metrics show consistently superior performance when compared to other dedicated neural metrics. Notably, although generative approaches typically do much better at system-level evaluation when compared to dedicated models (Kocmi and Federmann, 2023b; Fernandes et al., 2023), xcomet-xxl remains competitive in all language pairs with Gemba using GPT4. Finally, building on the findings at the sentence-level, Table 2b reveals that the MQM scores inferred directly and exclusively from the predicted error spans also exhibit very competitive performance in terms of system-level accuracy.

Table 3: 

System-level pairwise accuracy (↑) (Kocmi et al., 2021) computed over data pooled across all three WMT23 language pairs, and primary metric Average Correlation (↑).

Metricsystem-level acc.avg-corr.
Bleurt-20 0.892 0.776 
comet-22 0.900 0.779 
MetricX-23 0.908 0.808 
 
Gemba-mqm 0.944 0.802 
xcomet-xl 0.912 0.813 
xcomet-xxl 0.920 0.812 
Metricsystem-level acc.avg-corr.
Bleurt-20 0.892 0.776 
comet-22 0.900 0.779 
MetricX-23 0.908 0.808 
 
Gemba-mqm 0.944 0.802 
xcomet-xl 0.912 0.813 
xcomet-xxl 0.920 0.812 

Aggregated Evaluation.

Table 3 shows aggregated results for the WMT23 Metrics Shared Task. Our metrics, at both scales, would win the shared task, outperforming both the newest version of MetricX and Gemba-mqm. Following the trend presented at sentence-level evaluation, we note that xcomet-xl is indeed competitive with xcomet-xxl although running at a smaller scale.

Error Span Prediction.

While we have highlighted the utility of the predicted error spans through the inferred sentence-level MQM scores, here we turn to evaluating them directly. Table 4 shows that the error spans predicted via xcomet metrics outperform those obtained with both GPT3.5 and GPT4 despite being smaller in capacity relative to these models. In fact, our metrics achieve close performance to that of GPT4, even when a reference is not provided.

Table 4: 

F1 scores (↑) for error span detection: reference-free (), reference-based () evaluation.

F1 scores (↑) for error span detection: reference-free (), reference-based () evaluation.
F1 scores (↑) for error span detection: reference-free (), reference-based () evaluation.

Interplay of Error Spans and Sentence-level Scores.

Table 5 shows a strong correlation between the different score types predicted by xcomet and the MQM inferred score derived exclusively from error spans. This interplay is highly important: the predicted error spans may be valuable, not just for the sake of accuracy but also for interpretability. Interestingly, these high correlations with the predicted scores from each forward pass (ŷsrc, ŷref, ŷsrc+ref) are obtained despite no explicit alignment mechanism governing the relationship between the predictions of the sentence-level and word-level heads. We hypothesize that it is thus the shared encoder that, during the multi-task training, aligns the representations between the two tasks. As such, xcomet provides, through its predicted error spans, a potential lens through which we can better understand, contextualize, and even debug its own sentence-level predictions.

Table 5: 

Pearson correlations between the regression scores produced by xcomet-xxl (ŷsrc, ŷref, ŷsrc+ref, ŷsl) and the MQM inferred score, ŷmqm, computed from the identified error spans. The computation of ŷsl, contrary to the other scores, makes direct use of ŷmqm (see Eq. 7).

Scorezh-enen-deen-ruAll
ŷsrc 0.73 0.75 0.79 0.78 
ŷref 0.75 0.74 0.75 0.77 
ŷsrc+ref 0.78 0.79 0.82 0.82 
ŷsl 0.90 0.92 0.92 0.92 
Scorezh-enen-deen-ruAll
ŷsrc 0.73 0.75 0.79 0.78 
ŷref 0.75 0.74 0.75 0.77 
ŷsrc+ref 0.78 0.79 0.82 0.82 
ŷsl 0.90 0.92 0.92 0.92 

We have shown that xcomet metrics exhibit state-of-the-art correlations with human judgements when evaluating on high-quality MQM annotations. However, these MQM annotations are often highly unbalanced and contain little to no major or critical errors. As such, they may not offer a full picture of the metrics’ performance. In this section, we shift our focus to studying how xcomet metrics behave when evaluating translations with localized major or critical errors, and highly pathological translations, such as hallucinations.

7.1 Localized Errors

We employ SMAUG (Alves et al., 2022),15 a tool designed to generate synthetic data for stress-testing metrics, to create corrupted translations that contain major or critical errors. We generate translations with the following pathologies: addition of text, negation errors, mask in-filling, named entity errors, and errors in numbers. For this evaluation, we use data from the WMT 2023 Metrics shared task. Specifically, we corrupt the released synthetic references for which the xcomet metrics found no errors.16 Moreover, as the full suite of SMAUG transformations can only be applied to English text, we focus on Chinese→English (zh-en) and Hebrew→English (he-en) translations.

xcomet Predicts most Localized Errors as Major or Critical Errors.

Table 6 shows that xcomet metrics identify errors in the vast majority of the perturbed samples, with trends varying across scale and language pair. We found that the errors predicted by xcomet-xl and xcomet-xxl overlap with the artificially induced perturbations in over 90% of the perturbed samples (98% for xl and 90.9% for xxl). However, upon further analysis of xcomet’s predicted error spans, we observed that the model tends to identify additional spans in the perturbed sentence as erroneous, beyond the induced perturbations. This behavior is more prominent for perturbations involving the addition of text, which are not as localized as perturbations like swapping numbers or named entities. Furthermore, we noticed that the model has a propensity to assign the same error category to all predicted spans within a single sentence. When the metric predicts multiple error spans in a sentence, it assigns different severity levels to those spans only about 35% of the time. Improving the model’s ability to differentiate error categories among multiple errors within a sentence is an interesting avenue for future researchand development. We show several examples of predictions of xcomet in Table 7.

Table 6: 

Percentage (%) of translations, segmented by perturbation type, that are predicted to have no errors (↓). We show results for both zh-en and he-en language pairs across xcomet sizes.

Errorzh-enhe-en
xlxxlxlxxl
Add. of text 3.66 10.7 6.15 7.35 
Negation 0.20 0.20 3.89 4.90 
Mask in-fill 5.01 17.0 4.78 3.92 
Swap NUM 3.19 2.88 0.16 0.00 
Swap NE 3.66 6.94 9.81 7.01 
 
All 2.24 10.7 9.81 7.00 
Errorzh-enhe-en
xlxxlxlxxl
Add. of text 3.66 10.7 6.15 7.35 
Negation 0.20 0.20 3.89 4.90 
Mask in-fill 5.01 17.0 4.78 3.92 
Swap NUM 3.19 2.88 0.16 0.00 
Swap NE 3.66 6.94 9.81 7.01 
 
All 2.24 10.7 9.81 7.00 
Table 7: 

Predictions of xcomet-xl for perturbed translations. We highlight minor (), major (), and critical () error spans.

Predictions of xcomet-xl for perturbed translations. We highlight minor (), major (), and critical () error spans.
Predictions of xcomet-xl for perturbed translations. We highlight minor (), major (), and critical () error spans.

We also found that negation errors and mismatches in numbers are the most easily identified by the metrics. This is interesting: Localized errors, such as mismatches in numbers and named-entity errors, had been pinpointed as weaknesses of previous COMET metrics (Amrhein and Sennrich, 2022; Raunak et al., 2022). This earlier limitation seems to now have been addressed successfully. In fact, the results in Figure 4a show that most of these errors are predicted as critical errors. One plausible hypothesis for these improvements is the incorporation of datasets that contain negative translations and synthetic hallucinations into our training set.

Figure 4: 

Analysis of xcomet-xxl for data with localized critical errors in terms of (a) distribution of error severities for the predicted error spans, and (b) sensitivity of the sentence-level scores.

Figure 4: 

Analysis of xcomet-xxl for data with localized critical errors in terms of (a) distribution of error severities for the predicted error spans, and (b) sensitivity of the sentence-level scores.

Close modal

xcomet Sentence-level Scores Are Sensitive to Localized Perturbations.

Figure 4b shows that localized errors can lead to significant decreases in the predicted sentence-level scores, with perturbation-wise trends mirroring those of the error span predictions: the most pronounced decreases are found for negation errors and mismatches in numbers and named-entities (median decreases of around 20 points). The distribution of the decreases in quality also reveals two relevant trends: (i) localized perturbations can cause xcomet-xxl to shift from a score of a perfect translation to that of an unrelated translation, and (ii) the behavior of xcomet-xxl is not perfect and can be further improved: In rare cases, perturbations may actually lead to an increase in the score. Nevertheless, upon closer inspection, for over 90% of cases, the increase is smaller than 1 point.

7.2 Hallucinations

Hallucinations lie at the extreme-end of machine translation pathologies (Raunak et al., 2021), and can have devastating impact when models are deployed in the wild. Yet, these translations are often overlooked when assessing the performance of translation systems. Their rarity means that performance, usually judged according to an aggregated corpus-level score, may remain largely unperturbed by a very small number of hallucinations. Here, we assess how the xcomet metrics rank hallucinations among other translations. We will use the German→English hallucination benchmark introduced in Guerreiro et al. (2023). This benchmark involves over 3.4k translations—produced by an actual machine translation system—of different error types, including omissions, named-entity errors, and hallucinations (oscillatory, fully, and strongly detached). For a metric that has not been trained explicitly to rank translations, the benchmark is quite challenging: Hallucinations should be ranked below other severe errors and incorrect translations.

xcomet Metrics Can Distinguish Hallucinations from Other Translations.

The results in Table 8 show that both xcomet metrics largely rank hallucinations lower than other errors. This is especially true for the most severe type of hallucination (fully detached), for which the AUROC exceeds 95 for the xxl metric. In fact, Figure 5 reveals that xcomet-xxl assigns over 90% of these fully detached hallucinations a score under 10. We show examples of error spans predicted by xcomet-xxl in Table 9. Relative to previous metrics, xcomet achieves overall improvements. Interestingly, we also find that src-based evaluation (i.e., without the use of a reference) can reap benefits in this scenario. We hypothesize that this is due to the metric over-relying on the reference when it is available (Rei et al., 2023b). While hallucinations contain content that is detached from the source, some of their text may still overlap (even if just lexically) the reference (e.g., in strongly detached or oscillatory hallucinations), leading to higher scores.

Table 8: 

Hallucination detection performance on the de-en hallucination benchmark from Guerreiro et al. (2023) as measured by AUROC (↑) for reference-free () and reference-based () metrics. We report results for all the dataset, for fully detached, and oscillatory hallucinations separately.

Hallucination detection performance on the de-en hallucination benchmark from Guerreiro et al. (2023) as measured by AUROC (↑) for reference-free () and reference-based () metrics. We report results for all the dataset, for fully detached, and oscillatory hallucinations separately.
Hallucination detection performance on the de-en hallucination benchmark from Guerreiro et al. (2023) as measured by AUROC (↑) for reference-free () and reference-based () metrics. We report results for all the dataset, for fully detached, and oscillatory hallucinations separately.
Table 9: 

Examples of predictions of xcomet-xxl for the hallucination data of Guerreiro et al. (2023). The model correctly identifies the anomalous repeated phrase for the oscillatory hallucination, and predicts the whole translation as a single error span in the case of the fully detached hallucination.

Examples of predictions of xcomet-xxl for the hallucination data of Guerreiro et al. (2023). The model correctly identifies the anomalous repeated phrase for the oscillatory hallucination, and predicts the whole translation as a single error span in the case of the fully detached hallucination.
Examples of predictions of xcomet-xxl for the hallucination data of Guerreiro et al. (2023). The model correctly identifies the anomalous repeated phrase for the oscillatory hallucination, and predicts the whole translation as a single error span in the case of the fully detached hallucination.
Figure 5: 

Category-wise distribution of xcomet-xxl scores on the hallucination benchmark.

Figure 5: 

Category-wise distribution of xcomet-xxl scores on the hallucination benchmark.

Close modal

We now address relevant questions about the development of xcomet through ablations on design choices. Ablations are run with xcomet-xl on the MQM annotations from the News domain of the WMT 22 Metrics shared task (see Section 5).

Impact of the Training Curriculum.

We employed a curriculum to train xcomet (see Section 4.4) in order to balance data from different annotation strategies (i.e., DA and MQM annotations), and also to better balance the multi-task objective. Here, we want to assess how performance evolves throughout the different stages. We perform ablations on Phase II and Phase III, which correspond to the introduction of the multi-task objective and MQM training data that contain both sentence-level scores and error spans.

Table 10 shows that while a multi-task model outperforms single-task models for sentence-level evaluation,17 it does not hold true for word-level evaluation. The best word-level model is obtained by doing Phase II with a word-level only objective. Note that we can still extract sentence-level scores from such a model in two ways: (i) by leveraging the still existing regression head trained during Phase I, or (ii) by converting the error spans into a single sentence-level score. However, notably, neither of these approaches is competitive with our final xcomet model. In fact, it turns out, performing sentence-level evaluation via the error spans predicted by the final model leads to better correlations than with the word-level only model. Moreover, aggregating the different scores from the regression heads yields the best overall performance for sentence-level evaluation.

Table 10: 

Segment-level Kendall-Tau (τ) (↑) and F1 scores (↑) for error span detection on different curriculum choices. We represent metrics that can only perform segment-level evaluation, only perform word-level evaluation (λ = 1), and both. When a model has no capabilities to perform error span prediction, we write na under its F1 score.

Stagezh-enen-deen-ruAvg.
τF1τF1τF1τF1
Post Phase I with sentence-level only objective: λ = 0 
PhaseI 0.377 na 0.356 na 0.425 na 0.386 na 
PhaseII (ŷ=ŷreg0.372 na 0.386 na 0.448 na 0.402 na 
PhaseIII (ŷ=ŷreg0.395 na 0.391 na 0.457 na 0.414 na 
 
Post Phase I with word-level only objective: λ = 1 
PhaseII (ŷ=ŷreg0.333 0.293 0.357 0.332 0.415 0.229 0.368 0.285 
PhaseII (ŷ=ŷMQM0.331 0.410 0.395 0.379 
 
Post Phase I with multi-task only objective: λ set as described in Section 4.4 
PhaseII (ŷ=ŷMQM0.330 0.284 0.359 0.328 0.413 0.212 0.367 0.275 
PhaseII (ŷ=ŷSL0.368 0.396 0.420 0.395 
PhaseIII (ŷ=ŷMQM; xcomet0.374 0.237 0.389 0.290 0.445 0.281 0.402 0.269 
PhaseIII (ŷ=ŷSL; xcomet0.399 0.597 0.448 0.421 
Stagezh-enen-deen-ruAvg.
τF1τF1τF1τF1
Post Phase I with sentence-level only objective: λ = 0 
PhaseI 0.377 na 0.356 na 0.425 na 0.386 na 
PhaseII (ŷ=ŷreg0.372 na 0.386 na 0.448 na 0.402 na 
PhaseIII (ŷ=ŷreg0.395 na 0.391 na 0.457 na 0.414 na 
 
Post Phase I with word-level only objective: λ = 1 
PhaseII (ŷ=ŷreg0.333 0.293 0.357 0.332 0.415 0.229 0.368 0.285 
PhaseII (ŷ=ŷMQM0.331 0.410 0.395 0.379 
 
Post Phase I with multi-task only objective: λ set as described in Section 4.4 
PhaseII (ŷ=ŷMQM0.330 0.284 0.359 0.328 0.413 0.212 0.367 0.275 
PhaseII (ŷ=ŷSL0.368 0.396 0.420 0.395 
PhaseIII (ŷ=ŷMQM; xcomet0.374 0.237 0.389 0.290 0.445 0.281 0.402 0.269 
PhaseIII (ŷ=ŷSL; xcomet0.399 0.597 0.448 0.421 

Impact of the Weights on the Sentence-level Scores from Equation (7).

We studied the impact of aggregating different sentence-level scores by varying the weights w in Equation 7. Besides the individual scores for each evaluation mode (src, ref, and src+ref), we present three aggregations: (i) ŷ=ŷsl used in the final model, (ii) ŷ=ŷreg with uniform weights across the three individual scores and not considering the inferred MQM score from the error spans (w = [1/3,1/3,1/3,0]), and (iii) ŷ=ŷunif with uniform weights across all scores (w = [1/4,1/4,1/4,1/4]).

Table 11 reveals two interesting findings: (i) aggregating scores does not always outperform individual scores (e.g., ŷreg performs similarly to ŷsrc+ref), and (ii) including an inferred MQM score obtained through error span prediction boosts sentence-level performance. Notably, the improvement of our final aggregated score over ŷsrc+ref is not substantial. This suggests that, under computational constraints, one could consider computing a single ŷsrc+ref score without the need for three different forward passes and error span prediction.

Table 11: 

Segment-level Kendall-Tau (τ) (↑) for individual and aggregated sentence-level scores.

Scorezh-enen-deen-ruAll
Individual Scores 
ŷsrc 0.368 0.358 0.402 0.376 
ŷref 0.399 0.389 0.427 0.405 
ŷsrc+ref 0.399 0.390 0.438 0.409 
ŷmqm 0.374 0.389 0.445 0.402 
 
Aggregated Regression Scores: see Equation 7 
ŷreg 0.402 0.380 0.4401 0.408 
ŷunif 0.398 0.402 0.448 0.416 
ŷsl 0.399 0.414 0.448 0.421 
Scorezh-enen-deen-ruAll
Individual Scores 
ŷsrc 0.368 0.358 0.402 0.376 
ŷref 0.399 0.389 0.427 0.405 
ŷsrc+ref 0.399 0.390 0.438 0.409 
ŷmqm 0.374 0.389 0.445 0.402 
 
Aggregated Regression Scores: see Equation 7 
ŷreg 0.402 0.380 0.4401 0.408 
ŷunif 0.398 0.402 0.448 0.416 
ŷsl 0.399 0.414 0.448 0.421 

We introduced xcomet, a novel suite of metrics for machine translation evaluation that combines sentence-level prediction with fine-grained error span prediction. Through extensive experiments, we have shown that xcomet is a state-of-the-art metric at all relevant vectors of evaluation: sentence-level, system-level, and error span prediction. Notably, through xcomet’s capabilities to predict error spans, we can not only obtain useful signals for downstream prediction (either directly through error span prediction or by informing sentence-level scores) but also gain access to a lens through which we can better understand and interpret its predictions. We also stress-tested the metrics by assessing how they score localized critical errors and hallucinations: The metrics identify the vast majority of localized errors and can appropriately penalize the severity of hallucinations.

We hope xcomet can serve as a step towards more informed machine translation evaluation.

We are grateful to José Pombal, José G. C. de Souza, and Sweta Agrawal for their valuable feedback and discussions.

This work was supported by the Portuguese Recovery and Resilience Plan (PRR) through project C645008882-00000055, Center for Responsible AI, by the European Research Council (DECOLLAGE, ERC-2022-CoG 101088763), by EU’s Horizon Europe Research and Innovation Actions (UTTER, contract 101070631), and by the Fundação para a Ciência e Tecnologia (contracts UIDB/50021/2020 and UIDB/50008 /2020). We also thank the HPC resources from GENCI-IDRIS (grants 2023–AD011014714, 2023–AD0110146A68R1, and AD011012377R2).

2 

To the best of our knowledge, these represent the two largest open-source encoder-only models.

3 

Here, for each input ,∈{src, ref, src+ref}, we define ŷwlInput=[ŷ1Input,,ŷnInput].

4 

Here, for ease of notation, we use ŷsrc, ŷref, and ŷsrc+ref to represent the sentence-level score for each input type.

5 

Here, we exclude the 2022 News domain annotations, which we reserved for testing.

6 

We measure cross-lingual similarity using sentence embeddings obtained with the LaBSE encoder (Feng et al., 2022).

7 

This is particularly relevant for DA annotations, since these judgments typically come from non-expert annotators.

8 

This was initially introduced in Bleurt-20 (Pu et al., 2021).

9 

The achieved λ weights for Phases II and III were λ = 0.983 and λ = 0.055, respectively.

10 

We convert all critical errors into major errors, in order to match the guidelines described in Freitag et al. (2021a) that were used for annotating the zh-en and de-en test sets.

11 

The ten tasks consist of the System-level Pairwise Accuracy pooled across the language pairs, as well as segment-level pairwise ranking accuracy with tie calibration (Deutsch et al., 2023a), and system- and segment-level Pearson correlation for each of the individual language pairs.

12 

Specifically, we employ the metricx_xxl_MQM_2020 submission scores from the mt-metrics-eval package. Although the metric has not been released publicly, it is public that it is built upon the mT5-XXL (Xue et al., 2021) and has 13B parameters (Deutsch et al., 2023b).

13 

For all baselines, we report the official numbers from the WMT23 Metrics Shared Task (Freitag et al., 2023).

14 

We use the models from the OpenAI API (gpt-3.5-turbo and gpt-4) in October 2023.

16 

This allows us to isolate the effect of the perturbations. In case there are predicted error spans for the transformed translations, these are a result of the perturbation induced.

17 

For sentence-level only models, we present the sentence-level score ŷ=ŷreg correspondent to setting uniform weights across all three individual scores (src, ref, and src+ref).

Duarte
Alves
,
Ricardo
Rei
,
Ana C.
Farinha
,
José G. C.
de Souza
, and
André F. T.
Martins
.
2022
.
Robust MT evaluation with sentence-level multilingual augmentation
. In
Proceedings of the Seventh Conference on Machine Translation (WMT)
, pages
469
478
,
Abu Dhabi, United Arab Emirates (Hybrid)
.
Association for Computational Linguistics
.
Chantal
Amrhein
and
Rico
Sennrich
.
2022
.
Identifying weaknesses in machine translation metrics through minimum Bayes risk decoding: A case study for COMET
. In
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
1125
1141
,
Online only
.
Association for Computational Linguistics
.
Keqin
Bao
,
Yu
Wan
,
Dayiheng
Liu
,
Baosong
Yang
,
Wenqiang
Lei
,
Xiangnan
He
,
Derek F.
Wong
, and
Jun
Xie
.
2023
.
Towards fine-grained information: Identifying the type and location of translation errors
.
Ondřej
Bojar
,
Rajen
Chatterjee
,
Christian
Federmann
,
Yvette
Graham
,
Barry
Haddow
,
Shujian
Huang
,
Matthias
Huck
,
Philipp
Koehn
,
Qun
Liu
,
Varvara
Logacheva
,
Christof
Monz
,
Matteo
Negri
,
Matt
Post
,
Raphael
Rubino
,
Lucia
Specia
, and
Marco
Turchi
.
2017
.
Findings of the 2017 conference on machine translation (WMT17)
. In
Proceedings of the Second Conference on Machine Translation
, pages
169
214
,
Copenhagen, Denmark
.
Association for Computational Linguistics
.
Daniel
Deutsch
,
Rotem
Dror
, and
Dan
Roth
.
2021
.
A statistical analysis of summarization evaluation metrics using resampling methods
.
Transactions of the Association for Computational Linguistics
,
9
:
1132
1146
.
Daniel
Deutsch
,
George
Foster
, and
Markus
Freitag
.
2023a
.
Ties matter: Meta-evaluating modern metrics with pairwise accuracy and tie calibration
. In
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
, pages
12914
12929
,
Singapore
.
Association for Computational Linguistics
.
Daniel
Deutsch
,
Juraj
Juraska
,
Mara
Finkelstein
, and
Markus
Freitag
.
2023b
.
Training and meta-evaluating machine translation evaluation metrics at the paragraph level
. In
Proceedings of the Eighth Conference on Machine Translation
, pages
994
1011
,
Singapore
.
Association for Computational Linguistics
.
Fangxiaoyu
Feng
,
Yinfei
Yang
,
Daniel
Cer
,
Naveen
Arivazhagan
, and
Wei
Wang
.
2022
.
Language-agnostic BERT sentence embedding
. In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
878
891
,
Dublin, Ireland
.
Association for Computational Linguistics
.
Patrick
Fernandes
,
Daniel
Deutsch
,
Mara
Finkelstein
,
Parker
Riley
,
André F. T.
Martins
,
Graham
Neubig
,
Ankush
Garg
,
Jonathan H.
Clark
,
Markus
Freitag
, and
Orhan
Firat
.
2023
.
The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation
. In
Proceedings of the Eighth Conference on Machine Translation
, pages
1064
1081
,
Singapore
.
Association for Computational Linguistics
.
Patrick
Fernandes
,
António
Farinhas
,
Ricardo
Rei
,
José G. C.
de Souza
,
Perez
Ogayo
,
Graham
Neubig
, and
Andre
Martins
.
2022
.
Quality-aware decoding for neural machine translation
. In
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
1396
1412
,
Seattle, United States
.
Association for Computational Linguistics
.
Marina
Fomicheva
,
Shuo
Sun
,
Erick
Fonseca
,
Chrysoula
Zerva
,
Frédéric
Blain
,
Vishrav
Chaudhary
,
Francisco
Guzmán
,
Nina
Lopatina
,
Lucia
Specia
, and
André F. T.
Martins
.
2022
.
MLQE-PE: A multilingual quality estimation and post-editing dataset
. In
Proceedings of the Thirteenth Language Resources and Evaluation Conference
, pages
4963
4974
,
Marseille, France
.
European Language Resources Association
.
Erick
Fonseca
,
Lisa
Yankovskaya
,
André F. T.
Martins
,
Mark
Fishel
, and
Christian
Federmann
.
2019
.
Findings of the WMT 2019 shared tasks on quality estimation
. In
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)
, pages
1
10
,
Florence, Italy
.
Association for Computational Linguistics
.
Markus
Freitag
,
George
Foster
,
David
Grangier
,
Viresh
Ratnakar
,
Qijun
Tan
, and
Wolfgang
Macherey
.
2021a
.
Experts, errors, and context: A large-scale study of human evaluation for machine translation
.
Transactions of the Association for Computational Linguistics
,
9
:
1460
1474
.
Markus
Freitag
,
Nitika
Mathur
,
Chi-kiu
Lo
,
Eleftherios
Avramidis
,
Ricardo
Rei
,
Brian
Thompson
,
Tom
Kocmi
,
Frederic
Blain
,
Daniel
Deutsch
,
Craig
Stewart
,
Chrysoula
Zerva
,
Sheila
Castilho
,
Alon
Lavie
, and
George
Foster
.
2023
.
Results of WMT23 metrics shared task: Metrics might be guilty but references are not innocent
. In
Proceedings of the Eighth Conference on Machine Translation
, pages
578
628
,
Singapore
.
Association for Computational Linguistics
.
Markus
Freitag
,
Ricardo
Rei
,
Nitika
Mathur
,
Chi-kiu
Lo
,
Craig
Stewart
,
Eleftherios
Avramidis
,
Tom
Kocmi
,
George
Foster
,
Alon
Lavie
, and
André F. T.
Martins
.
2022
.
Results of WMT22 metrics shared task: Stop using BLEU – neural metrics are better and more robust
. In
Proceedings of the Seventh Conference on Machine Translation (WMT)
, pages
46
68
,
Abu Dhabi, United Arab Emirates (Hybrid)
.
Association for Computational Linguistics
.
Markus
Freitag
,
Ricardo
Rei
,
Nitika
Mathur
,
Chi-kiu
Lo
,
Craig
Stewart
,
George
Foster
,
Alon
Lavie
, and
Ondřej
Bojar
.
2021b
.
Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain
. In
Proceedings of the Sixth Conference on Machine Translation
, pages
733
774
,
Online
.
Association for Computational Linguistics
.
Naman
Goyal
,
Jingfei
Du
,
Myle
Ott
,
Giri
Anantharaman
, and
Alexis
Conneau
.
2021
.
Larger-scale transformers for multilingual masked language modeling
. In
Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)
, pages
29
33
,
Online
.
Association for Computational Linguistics
.
Yvette
Graham
,
Timothy
Baldwin
,
Alistair
Moffat
, and
Justin
Zobel
.
2013
.
Continuous measurement scales in human evaluation of machine translation
. In
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse
, pages
33
41
,
Sofia, Bulgaria
.
Association for Computational Linguistics
.
Nuno M.
Guerreiro
,
Elena
Voita
, and
André
Martins
.
2023
.
Looking for a needle in a haystack: A comprehensive study of hallucinations in neural machine translation
. In
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
, pages
1059
1075
,
Dubrovnik, Croatia
.
Association for Computational Linguistics
.
Caglar
Gulcehre
,
Tom Le
Paine
,
Srivatsan
Srinivasan
,
Ksenia
Konyushkova
,
Lotte
Weerts
,
Abhishek
Sharma
,
Aditya
Siddhant
,
Alex
Ahern
,
Miaosen
Wang
,
Chenjie
Gu
,
Wolfgang
Macherey
,
Arnaud
Doucet
,
Orhan
Firat
, and
Nando
de Freitas
.
2023
.
Reinforced self- training (rest) for language modeling
.
Juraj
Juraska
,
Mara
Finkelstein
,
Daniel
Deutsch
,
Aditya
Siddhant
,
Mehdi
Mirzazadeh
, and
Markus
Freitag
.
2023
.
MetricX-23: The Google submission to the WMT 2023 metrics shared task
. In
Proceedings of the Eighth Conference on Machine Translation
, pages
756
767
,
Singapore
.
Association for Computational Linguistics
.
Marzena
Karpinska
,
Nishant
Raj
,
Katherine
Thai
,
Yixiao
Song
,
Ankita
Gupta
, and
Mohit
Iyyer
.
2022
.
DEMETR: Diagnosing evaluation metrics for translation
. In
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
, pages
9540
9561
,
Abu Dhabi, United Arab Emirates
.
Association for Computational Linguistics
.
Tom
Kocmi
and
Christian
Federmann
.
2023a
.
GEMBA-MQM: Detecting translation quality error spans with GPT-4
. In
Proceedings of the Eighth Conference on Machine Translation
, pages
768
775
,
Singapore
.
Association for Computational Linguistics
.
Tom
Kocmi
and
Christian
Federmann
.
2023b
.
Large language models are state-of-the-art evaluators of translation quality
. In
Proceedings of the 24th Annual Conference of the European Association for Machine Translation
, pages
193
203
,
Tampere, Finland
.
European Association for Machine Translation
.
Tom
Kocmi
,
Christian
Federmann
,
Roman
Grundkiewicz
,
Marcin
Junczys-Dowmunt
,
Hitokazu
Matsushita
, and
Arul
Menezes
.
2021
.
To ship or not to ship: An extensive evaluation of automatic metrics for machine translation
. In
Proceedings of the Sixth Conference on Machine Translation
, pages
478
494
,
Online
.
Association for Computational Linguistics
.
Christoph
Leiter
,
Piyawat
Lertvittayakumjorn
,
M.
Fomicheva
,
Wei
Zhao
,
Yang
Gao
, and
Steffen
Eger
.
2023
.
Towards explainable evaluation metrics for machine translation
.
ArXiv
,
abs/2306.13041
.
Arle
Lommel
,
Aljoscha
Burchardt
, and
Hans
Uszkoreit
.
2014
.
Multidimensional quality metrics (MQM): A framework for declaring and describing translation quality metrics
.
Tradumàtica: Tecnologies de la traducció
,
0
:
455
463
.
Benjamin
Marie
,
Atsushi
Fujita
, and
Raphael
Rubino
.
2021
.
Scientific credibility of machine translation research: A meta-evaluation of 769 papers
. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
7297
7306
,
Online
.
Association for Computational Linguistics
.
OpenAI
.
2023
.
Gpt-4 technical report
.
Kishore
Papineni
,
Salim
Roukos
,
Todd
Ward
, and
Wei-Jing
Zhu
.
2002
.
Bleu: A method for automatic evaluation of machine translation
. In
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics
, pages
311
318
,
Philadelphia, Pennsylvania, USA
.
Association for Computational Linguistics
.
Stefano
Perrella
,
Lorenzo
Proietti
,
Alessandro
Scirè
,
Niccolò
Campolungo
, and
Roberto
Navigli
.
2022
.
MaTESe: Machine translation evaluation as a sequence tagging problem
. In
Proceedings of the Seventh Conference on Machine Translation (WMT)
, pages
569
577
,
Abu Dhabi, United Arab Emirates (Hybrid)
.
Association for Computational Linguistics
.
Maja
Popović
.
2015
.
chrF: Character n-gram F-score for automatic MT evaluation
. In
Proceedings of the Tenth Workshop on Statistical Machine Translation
, pages
392
395
,
Lisbon, Portugal
.
Association for Computational Linguistics
.
Amy
Pu
,
Hyung Won
Chung
,
Ankur
Parikh
,
Sebastian
Gehrmann
, and
Thibault
Sellam
.
2021
.
Learning compact metrics for MT
. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, pages
751
762
,
Online and Punta Cana, Dominican Republic
.
Association for Computational Linguistics
.
Vikas
Raunak
,
Arul
Menezes
, and
Marcin
Junczys-Dowmunt
.
2021
.
The curious case of hallucinations in neural machine translation
. In
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
1172
1183
,
Online
.
Association for Computational Linguistics
.
Vikas
Raunak
,
Matt
Post
, and
Arul
Menezes
.
2022
.
SALTED: A framework for SAlient long-tail translation error detection
. In
Findings of the Association for Computational Linguistics: EMNLP 2022
, pages
5163
5179
,
Abu Dhabi, United Arab Emirates
.
Association for Computational Linguistics
.
Ricardo
Rei
,
José G. C.
de Souza
,
Duarte
Alves
,
Chrysoula
Zerva
,
Ana C.
Farinha
,
Taisiya
Glushkova
,
Alon
Lavie
,
Luisa
Coheur
, and
André F. T.
Martins
.
2022a
.
COMET-22: Unbabel-IST 2022 submission for the metrics shared task
. In
Proceedings of the Seventh Conference on Machine Translation (WMT)
, pages
578
585
,
Abu Dhabi, United Arab Emirates (Hybrid)
.
Association for Computational Linguistics
.
Ricardo
Rei
,
Nuno M.
Guerreiro
,
José
Pombal
,
Daan
van Stigt
,
Marcos
Treviso
,
Luisa
Coheur
,
José G. C.
de Souza
, and
André F. T.
Martins
.
2023a
.
Scaling up cometkiwi: Unbabel-ist 2023 submission for the quality estimation shared task
. In
Proceedings of the Eighth Conference on Machine Translation
, pages
839
846
,
Singapore
.
Association for Computational Linguistics
.
Ricardo
Rei
,
Nuno M.
Guerreiro
,
Marcos
Treviso
,
Luisa
Coheur
,
Alon
Lavie
, and
André
Martins
.
2023b
.
The inside story: Towards better understanding of machine translation neural evaluation metrics
. In
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
, pages
1089
1105
,
Toronto, Canada
.
Association for Computational Linguistics
.
Ricardo
Rei
,
Craig
Stewart
,
Ana C.
Farinha
, and
Alon
Lavie
.
2020
.
COMET: A neural framework for MT evaluation
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
2685
2702
.
Online
.
Association for Computational Linguistics
.
Ricardo
Rei
,
Marcos
Treviso
,
Nuno M.
Guerreiro
,
Chrysoula
Zerva
,
Ana C.
Farinha
,
Christine
Maroti
,
José G. C.
de Souza
,
Taisiya
Glushkova
,
Duarte
Alves
,
Luisa
Coheur
,
Alon
Lavie
, and
André F. T.
Martins
.
2022b
.
CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task
. In
Proceedings of the Seventh Conference on Machine Translation (WMT)
, pages
634
645
,
Abu Dhabi, United Arab Emirates (Hybrid)
.
Association for Computational Linguistics
.
Ananya Sai
B.
,
Tanay
Dixit
,
Vignesh
Nagarajan
,
Anoop
Kunchukuttan
,
Pratyush
Kumar
,
Mitesh M.
Khapra
, and
Raj
Dabre
.
2023
.
IndicMT eval: A dataset to meta-evaluate machine translation metrics for Indian languages
. In
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
14210
14228
,
Toronto, Canada
.
Association for Computational Linguistics
.
Thibault
Sellam
,
Dipanjan
Das
, and
Ankur
Parikh
.
2020
.
BLEURT: Learning robust metrics for text generation
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
7881
7892
,
Online
.
Association for Computational Linguistics
.
Matthew
Snover
,
Bonnie
Dorr
,
Richard
Schwartz
,
Linnea
Micciulla
, and
John
Makhoul
.
2006
.
A study of translation edit rate with targeted human annotation
. In
Proceedings of Association for Machine Translation in the Americas
, pages
223
231
.
Lucia
Specia
,
Dhwaj
Raj
, and
Marco
Turchi
.
2010
.
Machine translation evaluation versus quality estimation
.
Machine Translation
,
24
:
39
50
.
Hugo
Touvron
,
Thibaut
Lavril
,
Gautier
Izacard
,
Xavier
Martinet
,
Marie-Anne
Lachaux
,
Timothée
Lacroix
,
Baptiste
Rozière
,
Naman
Goyal
,
Eric
Hambro
,
Faisal
Azhar
,
Aurelien
Rodriguez
,
Armand
Joulin
,
Edouard
Grave
, and
Guillaume
Lample
.
2023
.
Llama: Open and efficient foundation language models
.
Dazhen
Wan
,
Zheng
Zhang
,
Qi
Zhu
,
Lizi
Liao
, and
Minlie
Huang
.
2022a
.
A unified dialogue user simulator for few-shot data augmentation
. In
Findings of the Association for Computational Linguistics: EMNLP 2022
, pages
3788
3799
,
Abu Dhabi, United Arab Emirates
.
Association for Computational Linguistics
.
Yu
Wan
,
Dayiheng
Liu
,
Baosong
Yang
,
Haibo
Zhang
,
Boxing
Chen
,
Derek
Wong
, and
Lidia
Chao
.
2022b
.
UniTE: Unified translation evaluation
. In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
8117
8127
,
Dublin, Ireland
.
Association for Computational Linguistics
.
Wenda
Xu
,
Danqing
Wang
,
Liangming
Pan
,
Zhenqiao
Song
,
Markus
Freitag
,
William Yang
Wang
, and
Lei
Li
.
2023
.
Instructscore: Towards explainable text generation evaluation with automatic feedback
.
Linting
Xue
,
Noah
Constant
,
Adam
Roberts
,
Mihir
Kale
,
Rami
Al-Rfou
,
Aditya
Siddhant
,
Aditya
Barua
, and
Colin
Raffel
.
2021
.
mT5: A massively multilingual pre-trained text-to-text transformer
. In
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
483
498
,
Online
.
Association for Computational Linguistics
.
Chrysoula
Zerva
,
Frédéric
Blain
,
Ricardo
Rei
,
Piyawat
Lertvittayakumjorn
,
José G. C.
de Souza
,
Steffen
Eger
,
Diptesh
Kanojia
,
Duarte
Alves
,
Constantin
Orăsan
,
Marina
Fomicheva
,
André F. T.
Martins
, and
Lucia
Specia
.
2022
.
Findings of the WMT 2022 shared task on quality estimation
. In
Proceedings of the Seventh Conference on Machine Translation (WMT)
, pages
69
99
,
Abu Dhabi, United Arab Emirates (Hybrid)
.
Association for Computational Linguistics
.

Author notes

Equal contribution.

Action Editor: Colin Cherry

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.