Abstract
Widely used learned metrics for machine translation evaluation, such as Comet and Bleurt, estimate the quality of a translation hypothesis by providing a single sentence-level score. As such, they offer little insight into translation errors (e.g., what are the errors and what is their severity). On the other hand, generative large language models (LLMs) are amplifying the adoption of more granular strategies to evaluation, attempting to detail and categorize translation errors. In this work, we introduce xcomet, an open-source learned metric designed to bridge the gap between these approaches. xcomet integrates both sentence-level evaluation and error span detection capabilities, exhibiting state-of-the-art performance across all types of evaluation (sentence-level, system-level, and error span detection). Moreover, it does so while highlighting and categorizing error spans, thus enriching the quality assessment. We also provide a robustness analysis with stress tests, and show that xcomet is largely capable of identifying localized critical errors and hallucinations.
1 Introduction
Automatic metrics for machine translation evaluation are widely used by researchers and practitioners to evaluate the quality of translations and the systems generating them. Notably, learned neural metrics, such as Comet (Rei et al., 2020) and Bleurt (Sellam et al., 2020), have demonstrated significant improvements in terms of correlation with human judgments when compared to traditional metrics like bleu (Papineni et al., 2002; Freitag et al., 2021b, 2022).
These metrics are trained to regress on scores obtained through human annotations, by predicting a single sentence-level score representing the quality of the translation hypothesis. However, these single scores do not offer a detailed view into translation errors (e.g., it is not immediate which words or spans of words are wrongly translated). Moreover, as they are obtained by making use of highly complex pre-trained models, they can be difficult to interpret (Rei et al., 2023b; Leiter et al., 2023). One appealing strategy to bring a more detailed view into translation errors is to obtain finer-grained information on error spans through highlighting them and indicating their severity (Fonseca et al., 2019; Perrella et al., 2022; Bao et al., 2023). In fact, this is the strategy adopted in recent work that has employed generative large language models (LLMs) for machine translation evaluation: (i) identify errors within a given translation, subsequently (ii) categorize these errors according to their severity, and finally (iii) infer a sentence-level score from the predicted errors (Fernandes et al., 2023; Xu et al., 2023). However, these methods still lag behind dedicated learned metrics when using open LLMs, such as the LLaMA models (Touvron et al., 2023; Xu et al., 2023). As it stands, competitive performance with generative strategies remains contingent on utilizing large proprietary, closed LLMs such as PaLM-2 and GPT-4 (Fernandes et al., 2023; Kocmi and Federmann, 2023a).
In this work, we bridge the gap between these two approaches to machine translation evaluation by introducing xcomet: a learned metric that simultaneously performs sentence-level evaluation and error span detection. Through extensive experiments, we show that our metrics leverage the strengths of both paradigms: They achieve state-of-the-art performance in all relevant vectors of evaluation (sentence-level, system-level, and error span prediction), while offering, via the predicted error spans, a lens through which we can analyze translation errors and better interpret the sentence-level scores. We achieve this by employing a curriculum during training that is focused on leveraging high-quality publicly available data at both the sentence- and error span level, complemented by synthetic data constructed to enhance the metric’s robustness. Moreover, xcomet is a unified metric (Wan et al., 2022b), supporting all modes of evaluation within a single model. This enables the metric to be used for quality estimation (when no reference is available), or for reference-only evaluation, similarly to Bleurt (when a source is not provided). Crucially, xcomet also provides sentence-level scores that are directly inferred from the predicted error spans, in the style of AutoMQM (Fernandes et al., 2023) and InstructScore (Xu et al., 2023).
Our contributions can be summarized as follows:
We introduce xcomet, a novel evaluation metric that leverages the advantages of regression-based metrics and error span detection to offer a more detailed view of translation errors.
We show that xcomet is a state-of-the-art metric at all relevant vectors of evaluation— sentence-level, system-level, and error span prediction—generally outperforming widely used neural metrics and generative LLM-based machine translation evaluation.
We provide a comprehensive robustness analysis of xcomet, showing that this new suite of metrics identifies the vast majority of localized critical errors and hallucinations.
We release two evaluation models: xcomet- xl, with 3.5B parameters, and xcomet-xxl, featuring 10.7B parameters.1
2 Background
Methodologies for Human Assessment of Translation Quality.
Human evaluation of machine translation is primarily conducted through three distinct approaches: post-edits (PE), direct assessments (DA), and the Multidimensional Quality Metrics (MQM) framework.
In post-edits (PE), professional translators are tasked with “fixing” a given translation, making minimal edits to improve its quality. Using this edited translation—often termed post-edit—we can evaluate the machine translation output by quantifying the number of edits, thus gauging the initial translation’s quality (Snover et al., 2006).
Direct assessments (DA) (Graham et al., 2013) are a simple and widely-used evaluation method. Annotators—non-expert bilingual speakers or professional translators—are asked to annotate each translation with a score ranging from 0 to 100 to reflect its adequacy and fluency, where a score of 100 corresponds to a perfect translation, and 0 corresponds to a completely inadequate one.
The Multidimensional Quality Metrics (MQM) framework (Lommel et al., 2014), on the other hand, offers a more comprehensive and systematic approach to MT evaluation. Professional translators highlight errors—typically in the form of error spans—within translations, attributing them severity ratings (e.g., minor, major, or critical) and categorical labels (e.g., fluency, accuracy). Figure 1 illustrates one such annotation. MQM annotations have gained prominence in recent years due to their capacity to offer detailed insights into translation errors, facilitating more fine-grained and accurate comparisons between translation systems (Freitag et al., 2021a). As such, the field of Automatic Evaluation of MT has increasingly favored comparisons using MQM annotations over traditional DA and PE methodologies (Freitag et al., 2021b, 2022; Zerva et al., 2022).
Automatic Metrics for Translation Evaluation.
Conventional automatic metrics for machine translation (MT) evaluation rely on lexical-based approaches, where the evaluation score is computed through statistics related to lexical overlap between a machine translation and a reference translation. Despite evidence indicating that these lexical metrics (e.g., Bleu [Papineni et al., 2002] and chrF [Popović, 2015]) do not consistently align with human judgments, particularly when these are obtained through the MQM framework (Freitag et al., 2021b, 2022), they remain very popular. In fact, Bleu remains the most widely employed evaluation metric in machine translation to this day (Marie et al., 2021). On the other hand, neural metrics (e.g., Comet [Rei et al., 2020] and Bleurt [Sellam et al., 2020]) that rely on complex neural networks to estimate the quality of MT outputs are consistently among the best metrics for MT evaluation according to correlations with human judgments (Freitag et al., 2021b, 2022).
However, contrary to lexical metrics, which offer a straightforward interpretation, it can often prove challenging to explain the score predicted by a neural metric to a given translation output. As such, there have been a series of efforts to bring interpretability to neural metrics by focusing on understanding the inner workings of neural metrics (Rei et al., 2023b; Leiter et al., 2023), or on constructing inherently interpretable neural metrics (e.g., MaTESe [Perrella et al., 2022] and Fg-ted [Bao et al., 2023]) by assigning a central role to the task of predicting word-level errors in a given translation, instead of just a sentence- level score.
More recently, with the rise of generative LLMs, some studies have tried to frame the MT evaluation problem as a generative problem. This offers great flexibility, as the LLM can be prompted to either score the translation directly (Kocmi and Federmann, 2023b), or to identify errors in the translation (e.g., in line with the MQM framework) (Fernandes et al., 2023; Xu et al., 2023).
3 Problem Statement
An automatic metric for translation evaluation aims at predicting the quality of a translated sentence, t, in light of a reference translation, r, for a given source sentence, s. Here, we focus specifically on neural metrics that make use of a neural model, and typically operate under one of the following evaluation scenarios:
reference-only (ref): The model evaluates the translation by processing it alongside a ground-truth reference sentence (Bleurt is an example of such a metric);
source-reference combined input (src+ref): The model evaluates the translation by jointly processing it with both the source and the reference (Comet is an example of such a metric);
source-only (src): The model evaluates the translation using only its corresponding source sequence (CometKiwi (Rei et al., 2022b) is an example of such a model). This mode is commonly termed as quality estimation (QE) or reference-free evaluation (Specia et al., 2010).
In essence, the model’s input sequence consists of the translation t paired with some additional input—either r, [r, s], or s—derived from the scenarios above. Given this input, the model may predict the quality of the translation at different granularities, e.g., sentence-level or word(span)-level.
Sentence-level Prediction.
The model is tasked to predict a single global score (typically between 0 and 1) for the translation that represents how well it aligns with its context (i.e., source and/or reference sentence). These scores can be used for a broad range of tasks, such as gauging the quality of different translation systems (Freitag et al., 2022), identifying pathological translations (Guerreiro et al., 2023), assisting the generation of translations by MT systems (Fernandes et al., 2022), or even acting as reward models for human alignment of language models (Gulcehre et al., 2023).
Word(span)-level Prediction.
In contrast, word-level (or span-level) predictions are more fine-grained, identifying individual words or phrases in the translation that may have errors or discrepancies—typically identifying them as ok/bad or according to their severity, e.g., minor/major. These granular evaluations are more interpretable and assist in pinpointing specific issues, which can be particularly valuable for feedback and iterative translation improvements.
Our metric, xcomet, emerges in a unique position in the landscape of MT evaluation metrics. It can simultaneously perform evaluation under all three of the scenarios (src, ref, src+ref) presented, and provide sentence-level scores and error span annotations that are in line with the MQM framework, thus bringing further transparency to the evaluation (see Figure 1 for an illustration). In the next section, we detail the design choices and methodology of xcomet.
4 Design and Methodology of xcomet
In this section, we describe the methodology behind xcomet, outlining its model architecture, training settings and corpora, and learning curriculum. We detail how the model is designed to perform both regression and error span detection while adopting a unified input approach for enhanced flexibility and performance.
4.1 Model Architecture
xcomet is built upon insights from contributions to the WMT22 Metrics and QE shared tasks (Rei et al., 2022a, b). It is designed to concurrently handle two tasks: sentence-level regression and error span detection. Figure 2 illustrates its architecture. We follow the same architecture of the scaled-up version of CometKiwi detailed in Rei et al. (2023a), which uses a large pre-trained encoder model as its backbone encoder model. Importantly, following from our multi-task setup, the model has two prediction heads: (i) a sentence-level regression head, which employs a feed-forward network to generate a sentence score, and (ii) a word-level sequence tagger, which applies a linear layer to assign labels to each translation token.
4.2 Fully Unified Evaluation
xcomet adopts a unified input approach (Wan et al., 2022b), allowing for all the evaluation scenarios described in Section 3—ref, src+ref, and src evaluation—under a single model. Thus, the input sequence consists of two parts: (i) the translated sentence t = [t1,…, tn] of length n, and (ii) an additional input containing information from the source, reference, or both. To do so, when a reference is available, we run three distinct forward passes (one for each evaluation scenario), each yielding sentence-level and word-level predictions.
4.2.1 Training Time
For each step, we collect the sentence-level predictions and the word-level logits for each input format: and .3
Furthermore, in line with preceding metrics constructed upon the Comet framework, our models use features such as gradual unfreezing, and discriminative learning rates.
4.2.2 Inference Time
Error Span Prediction.
For each subword in the translation, we average the output distribution of the word-level linear layer obtained for each forward pass. Using this distribution, we predict a set of word-level tags by taking the most likely class for each token. From these tags, we construct a list of error spans, S, by grouping adjacent subwords identified as errors. The severity of each span in S is defined according to the most severe error tag found within the span.
Sentence-level Prediction.
4.3 Corpora
Our models are exclusively trained on publicly available DA and MQM annotations, most of which have been collected by WMT over the recent years.
DA Data.
We use DA annotations collected by WMT from 2017 to 2020, and the MLQE-PE dataset (Fomicheva et al., 2022). As the MLQE-PE dataset does not contain reference translations, we used the post-edit translations as reference translations. Overall, the corpus consists of around 1 million samples, spanning 36 language pairs.
MQM Data.
We collected the MQM annotations from WMT from 2020 to 2022.5 We also used annotations sourced from other MQM-annotated datasets: (i) IndicMT (Sai B et al., 2023), which contains MQM annotations spanning 5 Indian languages, and (ii) DEMETR (Karpinska et al., 2022), a diagnostic dataset with perturbations spanning semantic, syntactic, and morphological errors.
Corpora with MQM annotations are usually extremely unbalanced with critical errors being underrepresented (see stats for WMT in Table 1). As a result, metrics might struggle to effectively detect translations with critical errors and hallucinations. (Amrhein and Sennrich, 2022; Raunak et al., 2022; Guerreiro et al., 2023). As such, we augment the MQM corpus with hallucinations from the MLQE-PE corpus and synthetic critical errors. We create detached and oscillatory hallucinations (Raunak et al., 2021; Guerreiro et al., 2023): (i) detached hallucinations, replacing the translation with a random sentence or an unrelated one semantically similar to the source sentence;6 and (ii) oscillatory hallucinations, where we randomly sample a n-gram from the translation (with n in {2,3,4}) and repeat it between 1 and 10 times. We set the sentence-level scores of these hallucinations to 0. Overall, our MQM corpus consists of 194K samples across 14 language pairs.
Dataset . | No. Samples . | Error Statistics . |
---|---|---|
WMT Data | 147K (76%) | 63%;[57,42,1] |
IndicMT | 7K (4%) | 80%;[19,52,29] |
DEMETR | 22K (11%) | 47%;[38,19,43] |
MLQE-PE Hall. | 1.7K (1%) | All set to CRIT |
Synthetic Hall. | 16K (8%) | All set to CRIT |
Dataset . | No. Samples . | Error Statistics . |
---|---|---|
WMT Data | 147K (76%) | 63%;[57,42,1] |
IndicMT | 7K (4%) | 80%;[19,52,29] |
DEMETR | 22K (11%) | 47%;[38,19,43] |
MLQE-PE Hall. | 1.7K (1%) | All set to CRIT |
Synthetic Hall. | 16K (8%) | All set to CRIT |
Scaling of Sentence-level Scores.
While the sentence-level scores inferred from MQM annotations (through the procedure in Equation 6) are bounded between 0 and 1, DA annotations usually require z-normalization in order to mitigate variations in scoring strategies by different annotators (Bojar et al., 2017).7 Thus, as z-scores are inherently centered at 0 and unbounded, there is a scaling mismatch between the data samples.
Consequently, to circumvent this limitation, we employ min-max scaling on our DA corpus to set its range of scores to [0,1]. To do so, we set a practical minimum and maximum z-score value. We obtain the minimum score by averaging the z-scores for translations with over 1 annotation, wherein all annotators unanimously scored them with an unnormalized 0 DA score, i.e., they deemed the translation as “random”. For determining a maximum value, we applied the same process for perfect translations, i.e., unnormalized 100 DA score.8
4.4 Training Curriculum
xcomet models undergo a 3-phase curriculum training. Throughout these phases, the training emphasis alternates between sentence-level prediction and error span prediction by tweaking the parameter λ in Equation 3. The curriculum phases can be described as follows:
- Phase I:
The model is trained exclusively using the DA data. In this phase, the focus is exclusively set on sentence-level regression.
- Phase II:
In this stage, we introduce word-level supervision. To achieve this, the model is fine-tuned on our diverse MQM corpus, with most emphasis placed on the word-level task.
- Phase III:
The last training phase is aimed at unifying both tasks. The model is further fine-tuned using high-quality MQM data from (Freitag et al., 2021a), with a bigger emphasis set to sentence-level prediction.9
Interpretation of the Curriculum.
We start by training a sentence-level metric—similar to UniTE (Wan et al., 2022a)—on the vastly available DA annotations. Phase I acts as a warm-up for subsequent stages. In fact, prior research has shown that models trained on DA annotations leverage token-level information that aligns with MQM error annotations (Rei et al., 2023b). Moving to Phase II, we assume we have a metric that can perform sentence-level regression. Thus, the aim here shifts to integrating word-level supervision without compromising the previously acquired sentence-level prediction skills. To do so, we use the highly diverse corpora of MQM annotations and set most emphasis on the word-level task. Finally, we exclusively leverage a small corpus (around 25k samples) of very high-quality MQM annotations from (Freitag et al., 2021a)—each sample has three annotations from separate annotators—with additional synthetic hallucinations. Our focus here is to mitigate any potential decline in sentence-level regression capabilities during Phase II.
5 Experimental Setting
5.1 Evaluation
We evaluate the performance of our metrics using two datasets: (i) the MQM annotations from the News domain of the WMT 2022 Metrics shared task, and (ii) the WMT 2023 Metrics shared task evaluation suite. The WMT22 annotations encompass three language pairs: Chinese→English (zh-en), English→German (en-de), and English→Russian (en-ru). On the other hand, the WMT23 annotations cover Chinese→English (zh-en), English→German (en-de), and Hebrew→English (he-en). We evaluate the metrics in terms of sentence-level, system-level, and error span prediction performance.
At the sentence-level, we report Kendall’s Tau (τ) using the Perm-Both hypothesis test (Deutsch et al., 2021). We also evaluate the metrics on System-level Pairwise Accuracy (Kocmi et al., 2021). We base these evaluations on 200 re-sampling runs, with a significance level (p) set to 0.05. For error span prediction, we adopt the WMT23 Quality Estimation shared task evaluation methodology and compute F1 scores calculated at the character level, taking into account partial matches for both minor and major errors.10 For WMT23, we follow the evaluation setup from the shared task (Freitag et al., 2023) and report the aggregated System-level Pairwise Accuracy pooled across all language pairs, and the primary metric Average Correlation, which encompasses ten tasks, spanning system- and sentence-level metrics.11
5.2 Baselines
Sentence and System-level.
We test our metrics against widely used open neural metrics: Comet-22 (Rei et al., 2022a) and Bleurt-20 (Pu et al., 2021). Additionally, we include MetricX, the best performing metric from the WMT22 Metrics shared task (Freitag et al., 2022),12 and Gemba (Kocmi and Federmann, 2023b), which employs GPT4 (OpenAI, 2023) to evaluate translations following DA guidelines. For WMT23, we report the same metrics but update them, when needed (for MetricX-23 [Juraska et al., 2023] and Gemba-mqm [Kocmi and Federmann, 2023a]), with the versions submitted to the official competition.13
Error Span Prediction.
We report results using GPT3.5 and GPT4 models, by prompting it in the style of AutoMQM (Fernandes et al., 2023).14 We carefully select 5 shots that are held constant for all samples. This way, we can directly compare our results with state-of-the-art LLMs, which have been shown to be able to perform the task of error detection (Fernandes et al., 2023; Xu et al., 2023).
6 Correlations with Human Judgments
In this section, we present a standard performance analysis of our metrics in terms of correlations with human judgments. Overall, we find xcomet to be a state-of-the-art in sentence-level and error span prediction, being competitive with generative LLMs in terms of system-level evaluation.
Sentence-level Evaluation.
Table 2a shows that both xcomet metrics outperform other strong performing neural metrics, including the generative approach leveraging GPT4 of Gemba. In particular, xcomet-xxl sets a new state-of-the-art for en-de and en-ru. Interestingly, we can see that, while scaling up the encoder model of the xcomet metrics (from xl to xxl) holds better results, xcomet-xl is very competitive. In fact, it outperforms MetricX, which runs at even a larger size than xcomet-xxl. Finally, we can also observe that the MQM scores inferred exclusively from the predicted error spans also exhibit strong performance, outperforming widely used metrics Bleurt-20 and Comet-22. This is particularly relevant: the predicted error spans bring not only a more detailed view into translation errors but also provide high-quality sentence-level scores.
Metric . | zh-en . | en-de . | en-ru . | Avg. . | Metric . | zh-en . | en-de . | en-ru . | Avg. . |
---|---|---|---|---|---|---|---|---|---|
Bleurt-20 | 0.336 | 0.380 | 0.379 | 0.365 | Bleurt-20 | 0.762 | 0.771 | 0.743 | 0.759 |
comet-22 | 0.335 | 0.369 | 0.391 | 0.361 | comet-22 | 0.705 | 0.800 | 0.733 | 0.746 |
MetricX | 0.415 | 0.405 | 0.444 | 0.421 | MetricX | 0.762 | 0.781 | 0.724 | 0.756 |
Gemba-gpt4-da | 0.292 | 0.387 | 0.354 | 0.354 | Gemba-gpt4-da | 0.752 | 0.848 | 0.876 | 0.825 |
xcomet-xl | 0.399 | 0.414 | 0.448 | 0.421 | xcomet-xl | 0.800 | 0.743 | 0.790 | 0.778 |
xcomet-xxl | 0.390 | 0.435 | 0.470 | 0.432 | xcomet-xxl | 0.800 | 0.829 | 0.829 | 0.819 |
MQM scores from the error spans () | MQM scores from the error spans () | ||||||||
xcomet-xl (mqm) | 0.374 | 0.389 | 0.445 | 0.402 | xcomet-xl (mqm) | 0.781 | 0.762 | 0.762 | 0.768 |
xcomet-xxl (mqm) | 0.332 | 0.415 | 0.439 | 0.395 | xcomet-xxl (mqm) | 0.781 | 0.838 | 0.810 | 0.810 |
(a) Sentence-level evaluation. | (b) System-level evaluation. |
Metric . | zh-en . | en-de . | en-ru . | Avg. . | Metric . | zh-en . | en-de . | en-ru . | Avg. . |
---|---|---|---|---|---|---|---|---|---|
Bleurt-20 | 0.336 | 0.380 | 0.379 | 0.365 | Bleurt-20 | 0.762 | 0.771 | 0.743 | 0.759 |
comet-22 | 0.335 | 0.369 | 0.391 | 0.361 | comet-22 | 0.705 | 0.800 | 0.733 | 0.746 |
MetricX | 0.415 | 0.405 | 0.444 | 0.421 | MetricX | 0.762 | 0.781 | 0.724 | 0.756 |
Gemba-gpt4-da | 0.292 | 0.387 | 0.354 | 0.354 | Gemba-gpt4-da | 0.752 | 0.848 | 0.876 | 0.825 |
xcomet-xl | 0.399 | 0.414 | 0.448 | 0.421 | xcomet-xl | 0.800 | 0.743 | 0.790 | 0.778 |
xcomet-xxl | 0.390 | 0.435 | 0.470 | 0.432 | xcomet-xxl | 0.800 | 0.829 | 0.829 | 0.819 |
MQM scores from the error spans () | MQM scores from the error spans () | ||||||||
xcomet-xl (mqm) | 0.374 | 0.389 | 0.445 | 0.402 | xcomet-xl (mqm) | 0.781 | 0.762 | 0.762 | 0.768 |
xcomet-xxl (mqm) | 0.332 | 0.415 | 0.439 | 0.395 | xcomet-xxl (mqm) | 0.781 | 0.838 | 0.810 | 0.810 |
(a) Sentence-level evaluation. | (b) System-level evaluation. |
System-level Evaluation.
Table 2b and Table 3 show results for system-level for both WMT22 and WMT23 test sets. Similarly to what we observed at the sentence-level, our metrics show consistently superior performance when compared to other dedicated neural metrics. Notably, although generative approaches typically do much better at system-level evaluation when compared to dedicated models (Kocmi and Federmann, 2023b; Fernandes et al., 2023), xcomet-xxl remains competitive in all language pairs with Gemba using GPT4. Finally, building on the findings at the sentence-level, Table 2b reveals that the MQM scores inferred directly and exclusively from the predicted error spans also exhibit very competitive performance in terms of system-level accuracy.
Metric . | system-level acc. . | avg-corr. . |
---|---|---|
Bleurt-20 | 0.892 | 0.776 |
comet-22 | 0.900 | 0.779 |
MetricX-23 | 0.908 | 0.808 |
Gemba-mqm | 0.944 | 0.802 |
xcomet-xl | 0.912 | 0.813 |
xcomet-xxl | 0.920 | 0.812 |
Metric . | system-level acc. . | avg-corr. . |
---|---|---|
Bleurt-20 | 0.892 | 0.776 |
comet-22 | 0.900 | 0.779 |
MetricX-23 | 0.908 | 0.808 |
Gemba-mqm | 0.944 | 0.802 |
xcomet-xl | 0.912 | 0.813 |
xcomet-xxl | 0.920 | 0.812 |
Aggregated Evaluation.
Table 3 shows aggregated results for the WMT23 Metrics Shared Task. Our metrics, at both scales, would win the shared task, outperforming both the newest version of MetricX and Gemba-mqm. Following the trend presented at sentence-level evaluation, we note that xcomet-xl is indeed competitive with xcomet-xxl although running at a smaller scale.
Error Span Prediction.
While we have highlighted the utility of the predicted error spans through the inferred sentence-level MQM scores, here we turn to evaluating them directly. Table 4 shows that the error spans predicted via xcomet metrics outperform those obtained with both GPT3.5 and GPT4 despite being smaller in capacity relative to these models. In fact, our metrics achieve close performance to that of GPT4, even when a reference is not provided.
Interplay of Error Spans and Sentence-level Scores.
Table 5 shows a strong correlation between the different score types predicted by xcomet and the MQM inferred score derived exclusively from error spans. This interplay is highly important: the predicted error spans may be valuable, not just for the sake of accuracy but also for interpretability. Interestingly, these high correlations with the predicted scores from each forward pass (, , ) are obtained despite no explicit alignment mechanism governing the relationship between the predictions of the sentence-level and word-level heads. We hypothesize that it is thus the shared encoder that, during the multi-task training, aligns the representations between the two tasks. As such, xcomet provides, through its predicted error spans, a potential lens through which we can better understand, contextualize, and even debug its own sentence-level predictions.
7 Robustness of xcomet to Pathological Translations
We have shown that xcomet metrics exhibit state-of-the-art correlations with human judgements when evaluating on high-quality MQM annotations. However, these MQM annotations are often highly unbalanced and contain little to no major or critical errors. As such, they may not offer a full picture of the metrics’ performance. In this section, we shift our focus to studying how xcomet metrics behave when evaluating translations with localized major or critical errors, and highly pathological translations, such as hallucinations.
7.1 Localized Errors
We employ SMAUG (Alves et al., 2022),15 a tool designed to generate synthetic data for stress-testing metrics, to create corrupted translations that contain major or critical errors. We generate translations with the following pathologies: addition of text, negation errors, mask in-filling, named entity errors, and errors in numbers. For this evaluation, we use data from the WMT 2023 Metrics shared task. Specifically, we corrupt the released synthetic references for which the xcomet metrics found no errors.16 Moreover, as the full suite of SMAUG transformations can only be applied to English text, we focus on Chinese→English (zh-en) and Hebrew→English (he-en) translations.
xcomet Predicts most Localized Errors as Major or Critical Errors.
Table 6 shows that xcomet metrics identify errors in the vast majority of the perturbed samples, with trends varying across scale and language pair. We found that the errors predicted by xcomet-xl and xcomet-xxl overlap with the artificially induced perturbations in over 90% of the perturbed samples (98% for xl and 90.9% for xxl). However, upon further analysis of xcomet’s predicted error spans, we observed that the model tends to identify additional spans in the perturbed sentence as erroneous, beyond the induced perturbations. This behavior is more prominent for perturbations involving the addition of text, which are not as localized as perturbations like swapping numbers or named entities. Furthermore, we noticed that the model has a propensity to assign the same error category to all predicted spans within a single sentence. When the metric predicts multiple error spans in a sentence, it assigns different severity levels to those spans only about 35% of the time. Improving the model’s ability to differentiate error categories among multiple errors within a sentence is an interesting avenue for future researchand development. We show several examples of predictions of xcomet in Table 7.
Error . | zh-en . | he-en . | ||
---|---|---|---|---|
xl . | xxl . | xl . | xxl . | |
Add. of text | 3.66 | 10.7 | 6.15 | 7.35 |
Negation | 0.20 | 0.20 | 3.89 | 4.90 |
Mask in-fill | 5.01 | 17.0 | 4.78 | 3.92 |
Swap NUM | 3.19 | 2.88 | 0.16 | 0.00 |
Swap NE | 3.66 | 6.94 | 9.81 | 7.01 |
All | 2.24 | 10.7 | 9.81 | 7.00 |
Error . | zh-en . | he-en . | ||
---|---|---|---|---|
xl . | xxl . | xl . | xxl . | |
Add. of text | 3.66 | 10.7 | 6.15 | 7.35 |
Negation | 0.20 | 0.20 | 3.89 | 4.90 |
Mask in-fill | 5.01 | 17.0 | 4.78 | 3.92 |
Swap NUM | 3.19 | 2.88 | 0.16 | 0.00 |
Swap NE | 3.66 | 6.94 | 9.81 | 7.01 |
All | 2.24 | 10.7 | 9.81 | 7.00 |
We also found that negation errors and mismatches in numbers are the most easily identified by the metrics. This is interesting: Localized errors, such as mismatches in numbers and named-entity errors, had been pinpointed as weaknesses of previous COMET metrics (Amrhein and Sennrich, 2022; Raunak et al., 2022). This earlier limitation seems to now have been addressed successfully. In fact, the results in Figure 4a show that most of these errors are predicted as critical errors. One plausible hypothesis for these improvements is the incorporation of datasets that contain negative translations and synthetic hallucinations into our training set.
xcomet Sentence-level Scores Are Sensitive to Localized Perturbations.
Figure 4b shows that localized errors can lead to significant decreases in the predicted sentence-level scores, with perturbation-wise trends mirroring those of the error span predictions: the most pronounced decreases are found for negation errors and mismatches in numbers and named-entities (median decreases of around 20 points). The distribution of the decreases in quality also reveals two relevant trends: (i) localized perturbations can cause xcomet-xxl to shift from a score of a perfect translation to that of an unrelated translation, and (ii) the behavior of xcomet-xxl is not perfect and can be further improved: In rare cases, perturbations may actually lead to an increase in the score. Nevertheless, upon closer inspection, for over 90% of cases, the increase is smaller than 1 point.
7.2 Hallucinations
Hallucinations lie at the extreme-end of machine translation pathologies (Raunak et al., 2021), and can have devastating impact when models are deployed in the wild. Yet, these translations are often overlooked when assessing the performance of translation systems. Their rarity means that performance, usually judged according to an aggregated corpus-level score, may remain largely unperturbed by a very small number of hallucinations. Here, we assess how the xcomet metrics rank hallucinations among other translations. We will use the German→English hallucination benchmark introduced in Guerreiro et al. (2023). This benchmark involves over 3.4k translations—produced by an actual machine translation system—of different error types, including omissions, named-entity errors, and hallucinations (oscillatory, fully, and strongly detached). For a metric that has not been trained explicitly to rank translations, the benchmark is quite challenging: Hallucinations should be ranked below other severe errors and incorrect translations.
xcomet Metrics Can Distinguish Hallucinations from Other Translations.
The results in Table 8 show that both xcomet metrics largely rank hallucinations lower than other errors. This is especially true for the most severe type of hallucination (fully detached), for which the AUROC exceeds 95 for the xxl metric. In fact, Figure 5 reveals that xcomet-xxl assigns over 90% of these fully detached hallucinations a score under 10. We show examples of error spans predicted by xcomet-xxl in Table 9. Relative to previous metrics, xcomet achieves overall improvements. Interestingly, we also find that src-based evaluation (i.e., without the use of a reference) can reap benefits in this scenario. We hypothesize that this is due to the metric over-relying on the reference when it is available (Rei et al., 2023b). While hallucinations contain content that is detached from the source, some of their text may still overlap (even if just lexically) the reference (e.g., in strongly detached or oscillatory hallucinations), leading to higher scores.
8 Ablations on Design Choices
We now address relevant questions about the development of xcomet through ablations on design choices. Ablations are run with xcomet-xl on the MQM annotations from the News domain of the WMT 22 Metrics shared task (see Section 5).
Impact of the Training Curriculum.
We employed a curriculum to train xcomet (see Section 4.4) in order to balance data from different annotation strategies (i.e., DA and MQM annotations), and also to better balance the multi-task objective. Here, we want to assess how performance evolves throughout the different stages. We perform ablations on Phase II and Phase III, which correspond to the introduction of the multi-task objective and MQM training data that contain both sentence-level scores and error spans.
Table 10 shows that while a multi-task model outperforms single-task models for sentence-level evaluation,17 it does not hold true for word-level evaluation. The best word-level model is obtained by doing Phase II with a word-level only objective. Note that we can still extract sentence-level scores from such a model in two ways: (i) by leveraging the still existing regression head trained during Phase I, or (ii) by converting the error spans into a single sentence-level score. However, notably, neither of these approaches is competitive with our final xcomet model. In fact, it turns out, performing sentence-level evaluation via the error spans predicted by the final model leads to better correlations than with the word-level only model. Moreover, aggregating the different scores from the regression heads yields the best overall performance for sentence-level evaluation.
Stage . | zh-en . | en-de . | en-ru . | Avg. . | ||||
---|---|---|---|---|---|---|---|---|
τ . | F1 . | τ . | F1 . | τ . | F1 . | τ . | F1 . | |
Post Phase I with sentence-level only objective: λ = 0 | ||||||||
PhaseI | 0.377 | na | 0.356 | na | 0.425 | na | 0.386 | na |
PhaseII () | 0.372 | na | 0.386 | na | 0.448 | na | 0.402 | na |
PhaseIII () | 0.395 | na | 0.391 | na | 0.457 | na | 0.414 | na |
Post Phase I with word-level only objective: λ = 1 | ||||||||
PhaseII () | 0.333 | 0.293 | 0.357 | 0.332 | 0.415 | 0.229 | 0.368 | 0.285 |
PhaseII () | 0.331 | = | 0.410 | = | 0.395 | = | 0.379 | = |
Post Phase I with multi-task only objective: λ set as described in Section 4.4 | ||||||||
PhaseII () | 0.330 | 0.284 | 0.359 | 0.328 | 0.413 | 0.212 | 0.367 | 0.275 |
PhaseII () | 0.368 | = | 0.396 | = | 0.420 | = | 0.395 | = |
PhaseIII (; xcomet) | 0.374 | 0.237 | 0.389 | 0.290 | 0.445 | 0.281 | 0.402 | 0.269 |
PhaseIII (; xcomet) | 0.399 | = | 0.597 | = | 0.448 | = | 0.421 | = |
Stage . | zh-en . | en-de . | en-ru . | Avg. . | ||||
---|---|---|---|---|---|---|---|---|
τ . | F1 . | τ . | F1 . | τ . | F1 . | τ . | F1 . | |
Post Phase I with sentence-level only objective: λ = 0 | ||||||||
PhaseI | 0.377 | na | 0.356 | na | 0.425 | na | 0.386 | na |
PhaseII () | 0.372 | na | 0.386 | na | 0.448 | na | 0.402 | na |
PhaseIII () | 0.395 | na | 0.391 | na | 0.457 | na | 0.414 | na |
Post Phase I with word-level only objective: λ = 1 | ||||||||
PhaseII () | 0.333 | 0.293 | 0.357 | 0.332 | 0.415 | 0.229 | 0.368 | 0.285 |
PhaseII () | 0.331 | = | 0.410 | = | 0.395 | = | 0.379 | = |
Post Phase I with multi-task only objective: λ set as described in Section 4.4 | ||||||||
PhaseII () | 0.330 | 0.284 | 0.359 | 0.328 | 0.413 | 0.212 | 0.367 | 0.275 |
PhaseII () | 0.368 | = | 0.396 | = | 0.420 | = | 0.395 | = |
PhaseIII (; xcomet) | 0.374 | 0.237 | 0.389 | 0.290 | 0.445 | 0.281 | 0.402 | 0.269 |
PhaseIII (; xcomet) | 0.399 | = | 0.597 | = | 0.448 | = | 0.421 | = |
Impact of the Weights on the Sentence-level Scores from Equation (7).
We studied the impact of aggregating different sentence-level scores by varying the weights w in Equation 7. Besides the individual scores for each evaluation mode (src, ref, and src+ref), we present three aggregations: (i) used in the final model, (ii) with uniform weights across the three individual scores and not considering the inferred MQM score from the error spans (w = [1/3,1/3,1/3,0]), and (iii) with uniform weights across all scores (w = [1/4,1/4,1/4,1/4]).
Table 11 reveals two interesting findings: (i) aggregating scores does not always outperform individual scores (e.g., performs similarly to ), and (ii) including an inferred MQM score obtained through error span prediction boosts sentence-level performance. Notably, the improvement of our final aggregated score over is not substantial. This suggests that, under computational constraints, one could consider computing a single score without the need for three different forward passes and error span prediction.
Score . | zh-en . | en-de . | en-ru . | All . |
---|---|---|---|---|
Individual Scores | ||||
0.368 | 0.358 | 0.402 | 0.376 | |
0.399 | 0.389 | 0.427 | 0.405 | |
0.399 | 0.390 | 0.438 | 0.409 | |
0.374 | 0.389 | 0.445 | 0.402 | |
Aggregated Regression Scores: see Equation 7 | ||||
0.402 | 0.380 | 0.4401 | 0.408 | |
0.398 | 0.402 | 0.448 | 0.416 | |
0.399 | 0.414 | 0.448 | 0.421 |
Score . | zh-en . | en-de . | en-ru . | All . |
---|---|---|---|---|
Individual Scores | ||||
0.368 | 0.358 | 0.402 | 0.376 | |
0.399 | 0.389 | 0.427 | 0.405 | |
0.399 | 0.390 | 0.438 | 0.409 | |
0.374 | 0.389 | 0.445 | 0.402 | |
Aggregated Regression Scores: see Equation 7 | ||||
0.402 | 0.380 | 0.4401 | 0.408 | |
0.398 | 0.402 | 0.448 | 0.416 | |
0.399 | 0.414 | 0.448 | 0.421 |
9 Conclusions
We introduced xcomet, a novel suite of metrics for machine translation evaluation that combines sentence-level prediction with fine-grained error span prediction. Through extensive experiments, we have shown that xcomet is a state-of-the-art metric at all relevant vectors of evaluation: sentence-level, system-level, and error span prediction. Notably, through xcomet’s capabilities to predict error spans, we can not only obtain useful signals for downstream prediction (either directly through error span prediction or by informing sentence-level scores) but also gain access to a lens through which we can better understand and interpret its predictions. We also stress-tested the metrics by assessing how they score localized critical errors and hallucinations: The metrics identify the vast majority of localized errors and can appropriately penalize the severity of hallucinations.
We hope xcomet can serve as a step towards more informed machine translation evaluation.
Acknowledgments
We are grateful to José Pombal, José G. C. de Souza, and Sweta Agrawal for their valuable feedback and discussions.
This work was supported by the Portuguese Recovery and Resilience Plan (PRR) through project C645008882-00000055, Center for Responsible AI, by the European Research Council (DECOLLAGE, ERC-2022-CoG 101088763), by EU’s Horizon Europe Research and Innovation Actions (UTTER, contract 101070631), and by the Fundação para a Ciência e Tecnologia (contracts UIDB/50021/2020 and UIDB/50008 /2020). We also thank the HPC resources from GENCI-IDRIS (grants 2023–AD011014714, 2023–AD0110146A68R1, and AD011012377R2).
Notes
To the best of our knowledge, these represent the two largest open-source encoder-only models.
Here, for each input ,∈{src, ref, src+ref}, we define .
Here, for ease of notation, we use , , and to represent the sentence-level score for each input type.
Here, we exclude the 2022 News domain annotations, which we reserved for testing.
We measure cross-lingual similarity using sentence embeddings obtained with the LaBSE encoder (Feng et al., 2022).
This is particularly relevant for DA annotations, since these judgments typically come from non-expert annotators.
This was initially introduced in Bleurt-20 (Pu et al., 2021).
The achieved λ weights for Phases II and III were λ = 0.983 and λ = 0.055, respectively.
We convert all critical errors into major errors, in order to match the guidelines described in Freitag et al. (2021a) that were used for annotating the zh-en and de-en test sets.
The ten tasks consist of the System-level Pairwise Accuracy pooled across the language pairs, as well as segment-level pairwise ranking accuracy with tie calibration (Deutsch et al., 2023a), and system- and segment-level Pearson correlation for each of the individual language pairs.
For all baselines, we report the official numbers from the WMT23 Metrics Shared Task (Freitag et al., 2023).
We use the models from the OpenAI API (gpt-3.5-turbo and gpt-4) in October 2023.
This allows us to isolate the effect of the perturbations. In case there are predicted error spans for the transformed translations, these are a result of the perturbation induced.
For sentence-level only models, we present the sentence-level score correspondent to setting uniform weights across all three individual scores (src, ref, and src+ref).
References
Author notes
Equal contribution.
Action Editor: Colin Cherry