Abstract
Several uncertainty estimation methods have been recently proposed for machine translation evaluation. While these methods can provide a useful indication of when not to trust model predictions, we show in this paper that the majority of them tend to underestimate model uncertainty, and as a result, they often produce misleading confidence intervals that do not cover the ground truth. We propose as an alternative the use of conformal prediction, a distribution-free method to obtain confidence intervals with a theoretically established guarantee on coverage. First, we demonstrate that split conformal prediction can “correct” the confidence intervals of previous methods to yield a desired coverage level, and we demonstrate these findings across multiple machine translation evaluation metrics and uncertainty quantification methods. Further, we highlight biases in estimated confidence intervals, reflected in imbalanced coverage for different attributes, such as the language and the quality of translations. We address this by applying conditional conformal prediction techniques to obtain calibration subsets for each data subgroup, leading to equalized coverage. Overall, we show that, provided access to a calibration set, conformal prediction can help identify the most suitable uncertainty quantification methods and adapt the predicted confidence intervals to ensure fairness with respect to different attributes.1
1 Introduction
Neural models for natural language processing (NLP) are able to tackle increasingly challenging tasks with impressive performance. However, their deployment in real-world applications does not come without risks. For example, systems that generate fluent text might mislead users with fabricated facts, particularly if they do not expose their confidence. High performance does not guarantee an accurate prediction for every instance—for example, the degradation tends to be more severe when instances are noisy or out of distribution. This makes uncertainty quantification methods more important than ever.
While most work on uncertainty estimation for NLP has focused on classification tasks, uncertainty quantification for text regression has recently gained traction, with applications in machine translation (MT) evaluation, semantic sentence similarity, or sentiment analysis (Wang et al., 2022; Glushkova et al., 2021). This line of work builds upon a wide range of methods proposed for estimating uncertainty (Kendall and Gal, 2017a; Kuleshov et al., 2018a; Amini et al., 2020; Ulmer et al., 2023). However, current uncertainty quantification methods suffer from three important limitations:
Most methods provide confidence intervals without any theoretically established guarantees with respect to coverage. In other words, while a representative confidence interval should include (cover) the ground truth target value for each instance (and ideally the bound of the confidence interval should be close in expectation to the ground truth as shown in Figure 1), the predicted interval is often much narrower and underestimates the model uncertainty. In fact, for the concrete problem of MT evaluation, we show that the majority of uncertainty quantification methods achieve very low coverage even after calibration, as can be observed in Figures 2 and 5.
Most proposed methods involve underlying assumptions on the distribution (e.g., Gaussianity) or the source of uncertainty (e.g., aleatoric or epistemic) which are often unrealistic and may lead to misleading (over- or under-estimated) results (Izmailov et al., 2021; Zerva et al., 2022). Hence, choosing a suitable method for a dataset can be complicated.
While uncertainty quantification can shed light on model weaknesses and biases, the uncertainty prediction methods themselves can suffer from biases and provide unfair and misleading predictions for specific data subgroups or for examples with varying levels of difficulty (Cherian and Candès, 2023; Ding et al., 2020; Boström and Johansson, 2020).
To address the shortcomings above, we propose conformal prediction (§2) as a means to obtain more trustworthy confidence intervals on textual regression tasks, using MT evaluation as the primary paradigm. We rely on the fact that given a scoring or uncertainty estimation function, conformal prediction can provide statistically rigorous uncertainty intervals (Angelopoulos and Bates, 2021; Vovk et al., 2005, 2022). More importantly, the conformal prediction methodology provides theoretical guarantees about coverage over a test set, given a chosen coverage threshold. The predicted uncertainty intervals are thus valid in a distribution-free sense: They possess explicit, non-asymptotic guarantees even without distributional assumptions or model assumptions (Angelopoulos and Bates, 2021; Vovk et al., 2005), and they also allow for an intuitive interpretation of the confidence interval width.
Predicted confidence intervals and coverage for the same ground truth/prediction points. We consider the middle (green) interval to be desired as it covers the ground truth but does not overestimate uncertainty.
Predicted confidence intervals and coverage for the same ground truth/prediction points. We consider the middle (green) interval to be desired as it covers the ground truth but does not overestimate uncertainty.
Coverage obtained by different uncertainty predictors. We compare originally obtained values (red), with values after calibration (light blue), and conformal prediction (green) for the desired coverage (dashed line) set to 0.9 (90%).
Coverage obtained by different uncertainty predictors. We compare originally obtained values (red), with values after calibration (light blue), and conformal prediction (green) for the desired coverage (dashed line) set to 0.9 (90%).
We specifically show (§3) that previously proposed uncertainty quantification methods can be used to design non-conformity scores for split conformal prediction (Papadopoulos, 2008). We confirm that, regardless of the initially obtained coverage, the application of conformal prediction can increase coverage to the desired—user defined—value (see Figure 2). To this end, we compare four parametric uncertainty estimation methods (Monte Carlo dropout, deep ensembles, heteroscedastic regression, and direct uncertainty prediction) and one non-parametric method (quantile regression) with respect to coverage and distribution of uncertainty intervals. Additionally, we introduce a translation-inspired measure for referenceless quality estimation (QE) that uses the distance between quality estimates of translated and back-translated text to estimate non-conformity. We show that the estimated quantiles over each non-conformity score are indicative not only of the coverage but also of the overall suitability of the non-conformity score and the performance of the underpinning uncertainty quantification method (e.g. it aligns well with error correlation computed over the test set). Our experiments highlight the efficacy of quantile regression, a previously overlooked method for the MT evaluation task.
Moreover, we investigate the fairness of the obtained intervals (§4) for a set of different attributes: (1) translation language pair; (2) translation difficulty, as reflected by source sentence length and syntactic complexity, (3) estimated quality level and (4) uncertainty level. We highlight unbalanced coverage for all cases and demonstrate how equalized conformal prediction (Angelopoulos and Bates, 2021; Boström and Johansson, 2020; Boström et al., 2021) can address such imbalances effectively.
2 Conformal Prediction
In this section, we provide background on conformal prediction and introduce the notation used throughout this paper. Later in §3 we show how this framework can be used for uncertainty quantification in MT evaluation.
2.1 Desiderata
Let and be random variables representing inputs and outputs, respectively; in this paper we focus on regression, where . We use upper case to denote random variables and lower case to denote their specific values.
Traditional machine learning systems use training data to learn predictors which, when given a new test input xtest, output a point estimate . However, such point estimates lack uncertainty information. Conformal predictors (Vovk et al., 2005) depart from this framework by considering set functions —given xtest, they return a prediction set with theoretically established guarantees regarding the coverage of the ground truth value. For regression tasks, this prediction set is usually a confidence interval (see Figure 1). Conformal prediction techniques have recently proved useful in many applications: for example, in the U.S. presidential election in 2020, the Washington Post used conformal prediction to estimate the number of outstanding votes (Cherian and Bronner, 2020).
Given a desired confidence level (e.g., 90%), these methods have a formal guarantee that, in expectation, contains the true value Ytest with a probability equal to or higher than (but close to) that confidence level. Importantly, this is done in a distribution-free manner, i.e., without making any assumptions about the data distribution beyond exchangeability, a weaker assumption than independent and identically distributed (i.i.d.) data.2
In this paper, we use a simple inductive method called split conformal prediction (Papadopoulos, 2008), which requires the following ingredients:
A mechanism to obtain non-conformity scoress(x, y) for each instance, i.e., a way to estimate how “unexpected” an instance is with respect to the rest of the data. In this work, we do this by leveraging a pretrained predictor together with some heuristic notion of uncertainty—our method is completely agnostic about which model is used for this. We describe in §2.2 the non-conformity scores we design in our work.
A held-out calibration set containing n examples, . The underlying distribution from which the calibration set is generated is assumed unknown but it must be exchangeable (see footnote 2).
A desired error rateα (e.g., α = 0.1), such that the coverage level will be 1 −α (e.g., 90%).
This result tells us two important things: (i) the expected coverage is at least 1 −α, and (ii) with a large enough calibration set (large n), the procedure outlined above does not overestimate the coverage too much, so we can expect it to be nearly 1 −α.3
2.2 Non-conformity Scores
Naturally, the result stated in Theorem 1 is only practically useful if the prediction sets are small enough to be informative—to ensure this, we need a good heuristic to generate the non-conformity scores s(x, y). In this paper, we are concerned with regression problems (), so we define the prediction sets to be confidence intervals. We assume we have a pretrained regressor , and we consider two scenarios, one where we generate symmetric intervals (i.e., where is the midpoint of the interval) and a more general scenario where intervals can be non-symmetric.
Symmetric Intervals.
Non-symmetric Intervals.
3 Conformal MT Evaluation
We now apply the machinery of conformal prediction to the problem of MT evaluation, which is a regression task, aiming to predict a numeric quality score over an (automatically) translated sentence. The input is a triplet of source segment s, automatic translation t, and (optionally) human reference r, , and the goal is to predict a scalar value that corresponds to the estimated quality of the translation t. We can also consider a reference-less MT evaluation scenario where the input is simply .4 The ground truth is a quality score y manually produced by a human annotator, either in the form of a point on a quality scale called direct assessment (DA; Graham (2013)) or in the form of accumulated penalties called multidimensional quality metrics (MQM; Lommel et al., 2014). We use DA scores that are standardized for each annotator. An example instance is shown in Figure 3.
Example of MT evaluation instance: input triplet (left) evaluated by human and then normalised to obtain the ground truth scores (top right) and model prediction (bottom right).
Example of MT evaluation instance: input triplet (left) evaluated by human and then normalised to obtain the ground truth scores (top right) and model prediction (bottom right).
Subsequently, to apply conformal prediction we need to determine suitable non-conformity metrics, that can capture the divergence of a new test point xtest with respect to the seen data. To that end, we primarily experiment with a range of uncertainty quantification heuristics to generate δ(x) (or δ−(x) and δ +(x) in the non-symmetric case). With the symmetric parametric uncertainty methods, described in §3.1.1, we obtain heuristics to compute δ which we use to obtain non-conformity scores via (2), leading to the confidence intervals in (4), for each x triplet. Alternatively, in §3.1.2 we describe a non-symmetric and non-parametric method which returns , δ−, and δ +, and which we will use to compute the non-conformity scores (5) and confidence intervals (6). Finally, in 3.1.3 we describe a heuristic non-conformity score which is inspired by the MT symmetry between s and t.
3.1 Choice of Non-conformity Scores
For the application of conformal prediction on MT evaluation, we experiment with a diverse set of uncertainty prediction methods to obtain non-conformity scores, accounting both for parametric and non-parametric uncertainty prediction. We extensively compare all the parametric methods previously used in MT evaluation (Zerva et al., 2022), which return symmetric confidence intervals. In addition, we experiment with quantile regression (Koenker and Hallock, 2001), a simple non-parametric approach that has never been used for MT evaluation (to the best of our knowledge), and which can return non-symmetric intervals. Finally, we propose a new MT evaluation-specific non-conformity measure that relies on the symmetry between source and target in MT tasks.
3.1.1 Parametric Uncertainty
We compare a set of different parametric methods which fit the quality scores in the training data to an input-dependent Gaussian distribution . All these methods lead to symmetric confidence intervals (see Eq. 4). We use these methods to obtain estimates . Then we use to extract the corresponding uncertainty estimates as , which correspond to the and quantiles of the Gaussian, for a given confidence threshold 1 −α. For α = 0.1 (i.e., a 90% confidence level) this results in . We describe the concrete methods used to estimate and below.
MC Dropout (MCD).
This is a variational inference technique approximating a Bayesian network with a Bernoulli prior distribution over its weights (Gal and Ghahramani, 2016). By retaining dropout layers during multiple inference runs, we can sample from the posterior distribution over the weights. As such, we can approximate the uncertainty over a test instance x through a Gaussian distribution with the empirical mean and variance of the quality estimates . We use 100 runs, following the analysis of Glushkova et al. (2021).
Deep Ensembles (DE).
This method (Lakshminarayanan et al., 2017) trains an ensemble of neural models with the same architecture but different initializations. During inference, we collect the predictions of each single model and return and as in MC dropout. We use N = 5 checkpoints obtained with different initialization seeds, following Glushkova et al. (2021).
Heteroscedastic Regression (HTS).
Direct Uncertainty Prediction (DUP).
3.1.2 Non-parametric Uncertainty: Quantile Regression (QNT)
The pinball loss objective used for quantile regression. The slope of the lines is determined by the desired quantile level τ.
The pinball loss objective used for quantile regression. The slope of the lines is determined by the desired quantile level τ.
We use τ = α to train our models to predict the and quantiles, as well as the quantile, which corresponds to the median (see below), but there are extensions that either optimize multiple quantiles that cover the full predictive distribution (Tagasovska and Lopez-Paz, 2019) or explore asymmetric loss extensions to account for overestimating or underestimating the confidence intervals (Beck et al., 2016).
Unlike the parametric methods covered in §3.1.1, the quantile regression method can be used to return asymmetric confidence intervals. This is done by fitting 0.5, , and quantile predictors to the data, and setting , , and .
For completeness, we also consider a symmetric variant of quantile regression where we do not estimate the median , but rather set . We report coverage for both the non-symmetric (QNT-NS) and the symmetric case (QNT-S) later in Table 1.
Coverage percentages for α = 0.1 over different uncertainty methods. Values reported correspond to the mean over 10 runs. The second, third, and fourth columns refer respectively to the coverage obtained by original methods without calibration, after the ECE calibration described in §3.2, and with conformal prediction as described in §2.
Method . | Orig. . | Calib. . | Conform. . | . |
---|---|---|---|---|
MCD | 23.82 | 66.60 | 90.01 | 8.08 |
DE | 29.10 | 66.23 | 91.31 | 6.99 |
HTS | 82.02 | 68.29 | 89.89 | 1.28 |
DUP | 86.01 | 66.13 | 89.88 | 1.11 |
QUANT-NS | 77.83 | – | 90.21 | 1.29 |
QUANT-S | 78.66 | 49.03 | 90.54 | 1.28 |
Method . | Orig. . | Calib. . | Conform. . | . |
---|---|---|---|---|
MCD | 23.82 | 66.60 | 90.01 | 8.08 |
DE | 29.10 | 66.23 | 91.31 | 6.99 |
HTS | 82.02 | 68.29 | 89.89 | 1.28 |
DUP | 86.01 | 66.13 | 89.88 | 1.11 |
QUANT-NS | 77.83 | – | 90.21 | 1.29 |
QUANT-S | 78.66 | 49.03 | 90.54 | 1.28 |
3.1.3 Back-translation-inspired Non-conformity
The aforementioned uncertainty quantification metrics are based on well-established methods that could be applied on other regression problems with minimal modifications. However, conformal prediction is quite flexible with respect to the choice of the underlying non-conformity measure, allowing us to tailor the definition of conformity to the task at hand. Thus, we also experiment with a back-translation-inspired setup for the referenceless COMET metric (COMET-QE).
Our intuition for this metric is previous work that exploiting the symmetry between source and target, e.g. via back translation, can be used as an indicator of translation quality (Agrawal et al., 2022; Moon et al., 2020). In other words, a metric that computes the distance (on semantic or surface level) between the original source sentence and the one obtained upon translating the target, correlates well with translation quality. In this work, we hypothesize that we can exploit the symmetry between translation directions to infer a non-conformity measure as follows:
We henceforth refer to this score as the BT non-conformity score.
3.2 Comparison with Calibration
3.3 Experimental Setup
Models.
We experiment with a range of different models for the task of MT quality evaluation. We specifically use two models that employ source, translation, and reference in their input, namely UniTE (Wan et al., 2022) and COMET (Rei et al., 2020). We also experiment with BLEURT (Sellam et al., 2020), a metric that relies only on translation and reference comparisons, and finally, we explore a reference-less setup using COMET-QE, receiving only the source and translation sentences as input (Rei et al., 2021; Zerva et al., 2021). We provide model training hyperparameters in Appendix B.
Data.
For training, we use the direct assessment (DA) data from the WMT17-19 metrics shared tasks (Ma et al., 2018, 2019). We evaluate our models on the WMT20 metrics dataset (Mathur et al., 2020). For the calibration set , we use repeated random sub-sampling for k = 20 runs. The WMT20 test data includes 16 language pairs, of which 9 pairs are into-English and 7 pairs are out-of-English translations. For the calibration set sub-sampling, we sample uniformly from each language pair. For metrics for which we report averaged performance, we use micro-average over all of the language pairs.
3.4 Results
We first compare the uncertainty methods described in §3.1 with respect to coverage percentage as shown in Table 1. We select a desired coverage level of 90%, i.e., we set α = 0.1. We also align the uncertainty estimates with respect to the same α value: for the parametric uncertainty heuristics, we select the δ(x) that corresponds to a 1 −α coverage of the distribution, by using the probit function as described in §3.1.1; and for the non-parametric approach, we train the quantile regressors by setting τ = α/2, as described in §3.1.2.
Table 1 shows that coverage varies significantly across methods for the COMET metric, while Figure 5 shows that the same trend follows the coverage for BLEURT, UniTE, and COMET-QE metrics respectively. We can see that the sampling-based methods such as MC dropout and deep ensembles achieve coverage much below the desired 1 −α level. In contrast, direct uncertainty prediction and heteroscedastic regression achieve comparatively high coverage even before the application of conformal prediction. This could be related to the fact that by definition, they try to model uncertainty in relation to model error (DUP explicitly tries to predict uncertainty modelled as , while based on Eq. 7, the model needs to predict larger variance for larger errors). The quantile regression method also performs competitively to DUP achieving high coverage across metrics, with the exception of BLEURT (the only metric that does not use the source sentence), where coverage is significantly lower to DUP and HTS. Finally, while the back-translation-inspired (BT) score does not achieve high coverage, it outperforms sampling-based methods, providing a low-cost solution even in the absence of trained uncertainty quantifiers.
Coverage obtained by different uncertainty predictors for different MT evaluation metrics. We compare originally obtained values (red), with values after calibration (light blue) and after conformal prediction (green) with the desired coverage threshold (dashed line) set to 0.9 (90%).
Coverage obtained by different uncertainty predictors for different MT evaluation metrics. We compare originally obtained values (red), with values after calibration (light blue) and after conformal prediction (green) with the desired coverage threshold (dashed line) set to 0.9 (90%).
Calibration helps improve coverage in the cases of MC dropout and deep ensembles—albeit still without reaching close to 0.9. Instead, it seems that minimizing the ECE is not well aligned to optimizing coverage as for most cases calibration leads to less than 70% coverage. In contrast, we can see that conformal prediction approximates the desired coverage level best for all methods, regardless of the initial coverage they obtain, in line with the guarantees provided by Theorem 1.
In addition, as shown in Table 2 the value seems to correlate well with the performance of each uncertainty quantification metric, as measured by uncertainty Pearson correlation (UPS) (Glushkova et al., 2021).5 We can specifically see that methods with low correspond to uncertainty quantification methods that yield better performance and correlate better with the residuals of the MT evaluation metric. Hence, conformal prediction can be used to efficiently guide the selection of a suitable uncertainty quantification method, using only a small amount of data (calibration set).
We show the width in Figure 6, where we can see that especially for MCD and DE methods the average width increases significantly to reach the desired 90% coverage. The direct uncertainty prediction method is the one that shows a smaller increase in width, with quantile and heteroscedastic regression following.
Width for each uncertainty quantifier for the COMET metric showing the original intervals (red), the intervals after calibration (light blue) and the intervals after conformal prediction (green).
Width for each uncertainty quantifier for the COMET metric showing the original intervals (red), the intervals after calibration (light blue) and the intervals after conformal prediction (green).
We also plot the average width of the conformalised confidence intervals with respect to coverage for increasing α values (see Figure 7). We can see that as the desideratum on coverage relaxes confidence interval widths reduce accordingly, and that depending on the chosen α value the optimal method can vary. For example, quantile regression performs much better for α ≤ 0.2 but for more “relaxed” values the width-coverage balance deteriorates.
Coverage (x-axis) vs width (y-axis) for increasing α values, for the COMET metric.
Coverage (x-axis) vs width (y-axis) for increasing α values, for the COMET metric.
4 Conditional Coverage
The coverage guarantees stated in Theorem 1 refer to marginal coverage—the probabilities are not conditioned on the input points, they are averaged (marginalized) over the full test set. In several practical situations it is desirable to assess the conditonal coverage where denotes a region of the input space, e.g., inputs containing some specific attributes or pertaining to some group of the population.
In fact, evaluating the conditional coverage with respect to different data attributes may reveal biases of the uncertainty estimation methods towards specific data subgroups which are missed if we only consider marginal coverage. In the next experiments, we follow the feature stratified coverage described in Angelopoulos and Bates (2021); we use conformal prediction with MC dropout as our main paradigm. We demonstrate five examples of imbalanced coverage in Figure 8 and Table 3 with respect to different attributes: language pairs, estimated source difficulty, and predicted quality and uncertainty scores.
Conditional coverage imbalance per (top-to-bottom): sentence length, syntactic complexity, estimated quality score level and uncertainty score for conformal prediction with MCD-based non-conformity scores. To facilitate plotting, the segment frequencies are re-scaled with respect to the maximum bin frequency (so that the bin with the maximum frequency equals 1).
Conditional coverage imbalance per (top-to-bottom): sentence length, syntactic complexity, estimated quality score level and uncertainty score for conformal prediction with MCD-based non-conformity scores. To facilitate plotting, the segment frequencies are re-scaled with respect to the maximum bin frequency (so that the bin with the maximum frequency equals 1).
We can see that, coverage varies significantly across groups, revealing biases towards specific attribute values. For example, the plots show that into-English translations are under-covered for most uncertainty quantifiers (coverage ≤ 0.9), i.e., we consistently underestimate the uncertainty over the predicted quality for these language pairs. More importantly, we can see that examples with low predicted quality are significantly under-covered, as coverage for quality scores where y ≤−1.5 drops below 50%. For MCD-based uncertainty scores on the other hand, the drop in coverage seems to be related to the low uncertainty scores, indicating that due to the skewed distribution of uncertainty scores, the calculation of the quantile is not well tuned to lower uncertainty values (i.e., higher non-conformity scores). Instead, our two proxies for source difficulty reveal better balanced behaviour, with small deviations for very small sentences or high syntactic complexity. Similar patterns for these dimensions are also observed for the other uncertainty quantification methods shown in Appendix C.
Ensuring that we do not overestimate confidence for such examples is crucial for MT evaluation, in particular for applications where MT is used on the fly and one needs to decide if human editing is needed. Hence, in the rest of this section, we elaborate approaches to assess and mitigate coverage imbalance in the aforementioned examples, towards equalized coverage (Romano et al., 2020).
4.1 Conditioning on Categorical Attributes: Language-pairs
To deal with imbalanced coverage for discrete data attributes we use an equalized conformal prediction approach, i.e., we compute the conditional coverage for each attribute value and, upon observing imbalances, we compute conditional quantiles instead of a single one on the calibration set.
Let {1,…, K} index the several attributes (e.g., language pairs). We partition the calibration set according to these attributes, , where denotes the partition corresponding to the kth attribute and for every k ≠ k′. Then, we follow the procedure described in §2 to fit attribute-specific quantiles to each calibration set .
We demonstrate the application of this process on language pairs in Table 3 for all uncertainty quantification methods examined in the previous section. The top part of Table 3 shows the language-based conditional coverage, using a heatmap coloring to highlight the language pairs that fall below the guaranteed marginal coverage of 1 −α = 0.9. We can see that for all language pairs we achieve coverage >75% but some are below the 90% target. For all methods except for DUP, the coverage is high for out-of-English translations and drops for the majority of into-English cases. Applying the equalizing approach described above, we successfully rectify the imbalance for all uncertainty quantification methods, as shown in the bottom heatmap of Table 3.
4.2 Conditioning on Numerical Attributes: Quality, Difficulty and Uncertainty Scores
With some additional constraints on the equalized conformal prediction process described in §4.1 we can generalize this approach to account for attributes with numerical discrete or continuous values, such as the MT quality scores (ground truth quality y) or the uncertainty scores obtained by different uncertainty quantification methods. To that end, we adapt the Mondrian conformal regression methodology (Vovk et al., 2005; Boström et al., 2021). Mondrian conformal predictors have been used initially for classification and later for regression, where they have been used to partition the data with respect to the residuals (Johansson et al., 2014; Boström et al., 2021). Boström and Johansson (2020) proposed a Mondrian conformal predictor that partitions along the expected “difficulty” of the data as estimated by the non-conformity score s(x, y) or the uncertainty score δ(x).
In all the above cases, the calibration instances are sorted according to a continuous variable of interest and then partitioned into calibration bins. While the bins do not need to be of equal size, they need to satisfy a minimum length condition that depends on the chosen α threshold for the error rate (Johansson et al., 2014). Upon obtaining a partition into calibration bins, and similarly to what was described in §4.1 for discrete attributes, we compute bin-specific quantiles , where b ∈{1,…, B} indexes a bin.
We apply the aforementioned approach to the MT evaluation for the estimated translation quality scores, , and uncertainty scores, as well as two different proxies for sentence translation difficulty, namely sentence length and syntactic complexity (computed on the source language) (Mishra et al., 2013). We compute the source sentence length, as the number of tokens in the sentence, while for syntactic complexity we consider the sum of subtrees that constitute grammatical phrases7 and sort the calibration and test samples accordingly.
We then split the ordered calibration set into bins8 and compute the quantiles over the calibration set bins. Subsequently, to apply the conformal prediction on a test instance xtest, we check the attribute value and identify which bin of the Scal set it falls into, to use the corresponding quantile .
The equalized coverage for COMET-MCD is shown in Figure 8, compared with the original coverage. We can see that for estimated quality and uncertainty scores, the previously observed coverage drop for the lower values is successfully rectified by the equalized conformal prediction approach, achieving balanced coverage across bins, as desired. Between the two difficulty approximation methods, we see that for MCD the obtained bins are fairly balanced concerning coverage, with only a small drop for higher difficulty in terms of syntactic complexity. We provide additional results for the remaining uncertainty quantifiers in Appendix C.
5 Related Work
5.1 Conformal Prediction
We build on literature on conformal prediction that has been established by Vovk et al. (2005). Subsequent works focus on improving the predictive efficiency of the conformal sets or relaxing some of the constraints (Angelopoulos and Bates, 2021; Jin and Candès, 2022; Tibshirani et al., 2019). Most relevant to our paper are works that touch conformal prediction for regression tasks, either via the use of quantile regression (Romano et al., 2019) or using other scalar uncertainty estimates (Angelopoulos and Bates, 2021; Johansson et al., 2014; Papadopoulos et al., 2011). Other strands of work focus on conditional conformal prediction and methods to achieve balanced coverage (Angelopoulos and Bates, 2021; Romano et al., 2020; Boström et al., 2021; Lu et al., 2022).
There are few studies that use conformal prediction in NLP, so far focusing only on classification or generation, with applications to sentiment and relation classification and entity detection (Fisch et al., 2021, 2022; Maltoudoglou et al., 2020). Recently, Ravfogel et al. (2023) and Ulmer et al. (2024) considered natural language generation, with the former proposing the use of conformal prediction applied to top-p nucleus sampling, and the latter proposing applying non-exchangeable conformal prediction with k-nearest neighbors to obtain better prediction sets for generation. Some other works apply conformal prediction on the sentence level to rank generated sentences for different tasks (Kumar et al., 2023; Ren et al., 2023; Liang et al., 2024). Concurrent to this work, Giovannotti (2023) proposed the use of conformal prediction to quantify MT quality estimation, using a k-nearest neighbor (kNN) quality estimation model to obtain non-conformity scores, proposing the use of conformal prediction as a new standalone uncertainty quantification method for this task. They empirically demonstrate the impact of violating the i.i.d. assumption on the obtained performance and compare to a fixed-variance baseline regarding ECE, AUROC, and sharpness, but they consider neither the aspect of marginal or conditional coverage for the estimated confidence intervals, nor any other uncertainty quantification methods.
Our work complements the aforementioned efforts, as it focuses on a regression task (MT evaluation) and investigates the impact of conformal prediction on the estimated confidence intervals. Contrary to previous approaches, however, we provide a detailed analysis of conformal prediction for an NLP regression task and investigate a wide range of uncertainty methods that can be used to design non-conformity scores. Additionally, we elaborate different aspects of equalized coverage for MT evaluation, revealing biases for different data attributes, and providing an effective method that corrects these biases.
5.2 Uncertainty Quantification
Several uncertainty methods have been previously proposed for regression tasks in NLP and the task of MT evaluation specifically. Beck et al. (2016) focused on the use of Gaussian processes to obtain uncertainty predictions for the task of quality estimation, with emphasis on cases of asymmetric risk. Wang et al. (2022) also explored Gaussian processes but provided a comparison of multiple NLP regression tasks (semantic sentence similarity, MT evaluation, sentiment quantification) investigating end-to-end and pipeline approaches to apply Bayesian regression to large language models. Focusing on MT evaluation, Glushkova et al. (2021) proposed the use of MC dropout and deep ensembles as efficient approximations of Bayesian regression, inspired by work in computer vision (Kendall and Gal, 2017a). Zerva et al. (2022) proposed additional methods of uncertainty quantification for MT evaluation, focusing on methods that targ et aleatoric or epistemic uncertainties under specific assumptions. They specifically investigated heteroscedastic regression and KL-divergence for aleatoric uncertainty and direct uncertainty prediction for epistemic uncertainty, highlighting the performance benefits of these methods, when compared to MC dropout and deep ensembles, with respect to the correlation of uncertainties to model error. However, none of the previous works in uncertainty for NLP regression considered coverage. We compare several of the aforementioned uncertainty quantification methods with respect to coverage and focus on the impact of applying conformal prediction to each uncertainty method.
6 Conclusions
In this work, we apply conformal prediction to the important problem of MT evaluation. We show that most existing uncertainty quantification methods significantly underestimate uncertainty, achieving low coverage, and that the application of conformal prediction can help rectify this and guarantee coverage tuned to a user-specified threshold. We further show that the estimated quantiles provide a way to choose the most suitable uncertainty quantification methods, aligning well with other metrics such as UPS (Glushkova et al., 2021).
We also use conformal prediction tools to assess the conditional coverage for five different attributes: language pairs, sentence length and syntactic complexity, predicted translation quality, and estimated uncertainty level. We highlight inconsistencies and imbalanced coverage for the different cases, and we show that equalized conformal prediction can correct the initially unfair confidence predictions to obtain more balanced coverage across attributes.
Overall, our work aims to highlight the potential weaknesses of using uncertainty estimation methods without a principled calibration procedure. To this end, we propose a methodology that can guarantee more meaningful confidence intervals. In future work, we aim to further investigate the application of conformal prediction across different data dimensions as well as different regression tasks in NLP.
Acknowledgments
This work was supported by the Portuguese Recovery and Resilience Plan through project C645008882- 00000055 (NextGenAI - Center for Responsible AI), by the EU’s Horizon Europe Research and Innovation Actions (UTTER, contract 101070631), by the project DECOLLAGE (ERC-2022-CoG 101088763), and by Fundação para a Ciência e Tecnologia through contract UIDB/50008/2020.
Notes
Code and data can be accessed on https://github.com/deep-spin/conformalizing_MT_eval.
Namely, the data distribution is said to be exchangeable iff, for any sample and any permutation function π, we have ℙ((Xπ(1), Yπ(1)),…,(Xπ(n), Yπ(n))) = ℙ((X1, Y1),…,(Xn, Yn)). If the data distribution is i.i.d., then it is automatically exchangeable, since and multiplication is commutative. By De Finetti’s theorem (De Finetti, 1929), exchangeable observations are conditionally independent relative to some latent variable.
The reference-less scenario is also frequently referred to as quality estimation for machine translation.
Note that unlike (Glushkova et al., 2021) we compute UPS over the full test set, instead of taking the macro-average over each language-pair. However, looking at the values reported in that work we can see that our findings hold for the macro-averaged UPS values as well.
In related work (Glushkova et al., 2021), sharpness is computed with respect to σ2, but this cannot be applied to non-parametric uncertainty cases, so we use the confidence interval length, henceforth referred to as width to be able to compare conformal prediction for all uncertainty quantification methods.
We employ an nltk-based dependency parser.
We use a threshold of 100 instances per bin.
https://github.com/Unbabel/COMET, version 2.1.0.
References
A Average Width Across Metrics and Uncertainty Quantifiers
In this section we are presenting the average width of the confidence intervals calculated by the original uncertainty quantification methods, as well as the adapted width when using calibration either by minimising the ECE or by applying conformal prediction. Expanding the analysis on COMET as presented in Figure 6, we are presenting results for BLEURT, UniTE, and COMET-QE. As shown in Figure 9, a similar pattern can be observed for all metrics, where, upon conformalizing, the width increases significantly for MCD, DE, and BT, while changes for the other methods are more moderate.
Widths obtained for the BLEURT, UniTE, and COMET-QE metrics showing the original intervals (red), the intervals after calibration (light blue) and the intervals after conformal prediction (green).
Widths obtained for the BLEURT, UniTE, and COMET-QE metrics showing the original intervals (red), the intervals after calibration (light blue) and the intervals after conformal prediction (green).
B Model Implementation and Parameters
Table 4 shows the hyperparameters used to train the following metrics: BLEURT, UniTE, COMET, and COMET-QE. We implemented the models using the COMET codebase9 and implementation from Zerva et al. (2022) for the uncertainty quantification methods. For deep ensembles, we trained 5 models with different seeds. For MCD we used a total of 100 runs following Glushkova et al. (2021) and Zerva et al. (2022). For the DUP method, we used a bottleneck layer with dimensionality 256, and we maintained the same setup across metrics.
Hyperparameters for MT evaluation metrics used.
Hyperparameter . | COMET . | BLEURT . | UniTE . | COMET-QE . |
---|---|---|---|---|
Encoder Model | XLM-R (large) | RemBERT (large) | Info-XLM (large) | XLM-R (large) |
Optimizer | Adam | Adam | Adam | Adam |
No. frozen epochs | 0.3 | 0.3 | 0.3 | 0.3 |
Learning rate | 3e-05 | 3e-05 | 3e-05 | 3e-05 |
Encoder Learning Rate | 1e-05 | 1e-05 | 1e-05 | 1e-05 |
Layerwise Decay | 0.95 | 0.95 | 0.95 | 0.95 |
Batch size | 4 | 4 | 4 | 4 |
Dropout | 0.15 | 0.15 | 0.15 | 0.15 |
Hidden sizes | [3072, 1024] | [2048, 1024] | [3072, 1024] | [2048, 1024] |
Encoder Embedding layer | Frozen | Frozen | Frozen | Frozen |
FP precision | 32 | 32 | 32 | 32 |
No. Epochs (training) | 2 | 2 | 2 | 2 |
Hyperparameter . | COMET . | BLEURT . | UniTE . | COMET-QE . |
---|---|---|---|---|
Encoder Model | XLM-R (large) | RemBERT (large) | Info-XLM (large) | XLM-R (large) |
Optimizer | Adam | Adam | Adam | Adam |
No. frozen epochs | 0.3 | 0.3 | 0.3 | 0.3 |
Learning rate | 3e-05 | 3e-05 | 3e-05 | 3e-05 |
Encoder Learning Rate | 1e-05 | 1e-05 | 1e-05 | 1e-05 |
Layerwise Decay | 0.95 | 0.95 | 0.95 | 0.95 |
Batch size | 4 | 4 | 4 | 4 |
Dropout | 0.15 | 0.15 | 0.15 | 0.15 |
Hidden sizes | [3072, 1024] | [2048, 1024] | [3072, 1024] | [2048, 1024] |
Encoder Embedding layer | Frozen | Frozen | Frozen | Frozen |
FP precision | 32 | 32 | 32 | 32 |
No. Epochs (training) | 2 | 2 | 2 | 2 |
C Equalized Conformal Prediction Across Uncertainty Quantification Methods
In this section, we extend the analysis discussed in Section 4 of the main paper, to the rest of the quantification methods for the COMET metric, shown in Figures 10 to 13. We can see that direct uncertainty prediction (Figure 12) and quantile regression (Figure 13) are the two methods that suffered less from imbalanced coverage, even for extreme values of quality and uncertainty, supporting their suitability for MT evaluation, as also shown for the general results in Section 3.4. We can also observe that when the initial calibration step yields balanced results around the desired α, the recalibration brings no significant benefits and may even result in slightly lower coverage. Hence, it is important to first detect for which, if any, attributes we may need to recalibrate.
Equalized prediction for COMET using direct uncertainty prediction.
Author notes
Action Editor: Joel Tetreault