## Abstract

State-of-the-art classification and regression models are often not well calibrated, and cannot reliably provide uncertainty estimates, limiting their utility in safety-critical applications such as clinical decision-making. While recent work has focused on calibration of classifiers, there is almost no work in NLP on calibration in a regression setting. In this paper, we quantify the calibration of pre- trained language models for text regression, both intrinsically and extrinsically. We further apply uncertainty estimates to augment training data in low-resource domains. Our experiments on three regression tasks in both self-training and active-learning settings show that uncertainty estimation can be used to increase overall performance and enhance model generalization.

## 1 Introduction

Modern neural network models, particularly those based on pre-training and fine-tuning, have achieved impressive results across a broad spectrum of NLP tasks, in terms of evaluation metrics such as classification accuracy or F-score for classification tasks and mean squared error for regression tasks. However, the standard training regime fails to take model uncertainty into account, and tends to result in over-fitting and poor generalization, especially in limited training data situations.

In addition, these models have been empirically demonstrated to have poor calibration—the predictive probability does not reflect the true correctness likelihood, and they are generally over- confident when they make wrong predictions (Guo et al., 2017; Desai and Durrett, 2020; Jiang et al., 2020). Put differently, the models do not know what they don’t know. This is particularly the case in low-resource settings. However, faithfully assessing the uncertainty of model predictions is as important as obtaining high accuracy in many safety-critical applications, such as autonomous driving or clinical decision support (Chen et al., 2020; Kendall and Gal, 2017; Davis et al., 2017). If models were able to more faithfully capture their lack of certainty when they make erroneous predictions, they could be used more reliably in critical decision-making contexts, and avoid catastrophic errors.

In the context of text regression, we aim to alleviate over-fitting and improve generalizability in low-resource settings by taking the uncertainty sourced from both the data and model into account. Specifically, we address: (1) data uncertainty by filtering noisy annotations from (either pseudo or gold) labeled data based on predictive confidence, preventing models from memorizing out-of-distribution examples; and (2) model uncertainty to accurately estimate both the target value and predictive confidence by uncertainty models, providing more reliable and interpretable predictions, meanwhile effectively supporting denoising in (1).

Uncertainty estimation has been extensively explored in the context of classification (Guo et al., 2017; Vaicenavicius et al., 2019; Desai and Durrett, 2020; Jiang et al., 2020), but is relatively unexplored for regression tasks, due to the complexities in dealing with a continuous target space. The output of a classifier passed through a softmax layer naturally provides a discrete probability distribution, while in a regression setting the output is a single numerical value.

We compare four well-studied techniques for uncertainty estimation, as applied to pre-trained language models (LMs): Gaussian processes (Shen et al., 2019; Camporeale and Carè, 2020), Bayesian linear regression (Hernández-Lobato and Adams, 2015), Bayes by backprop, and Monte Carlo (MC) dropout. To comprehensively assess uncertainty quality, we evaluate results intrinsically using various metrics, and extrinsically with several downstream experiments. Our analysis shows that predictions are highly uncertain and inaccurate in low-resource scenarios.

Two major types of uncertainty have been identified: *aleatoric uncertainty* captures noise inherent in the observations; and *epistemic uncertainty* accounts for uncertainty in the model, which can be explained away given enough data, compensating for limited knowledge (Kendall and Gal, 2017). In other words, uncertainty results primarily from noisy human annotations, insufficient labeled data, and out-of-domain text in practice (Glushkova et al., 2021). We therefore propose a simple method to filter noisy labels and select high-quality instances from an unlabeled data pool based on the predictive confidence, which on the one hand alleviates both aleatoric and epistemic uncertainty, and on the other hand, improves accuracy and generalization thanks to increased training data.

In this work, we explore how to estimate uncertainty in a regression setting with pre-trained language models, and evaluate estimation quality both intrinsically and extrinsically. Intrinsic uncertainty estimation provides the basis for our proposed data selection strategy: By filtering noise based on confidence thresholding, and mitigating exposure bias, our approach is shown to be effective at improving both performance and generalization in low-resource settings, in self-training, and active learning settings.

## 2 Background

We first review approaches for estimating the predictive uncertainty of deep neural networks (DNNs) in a regression setting, then methods for reducing uncertainty and improving generalization.

### 2.1 Uncertainty Estimation in DNNs

##### Bayesian Estimation

Bayesian approaches provide a general framework for dealing with uncertainty estimation, for example, in the form of Gaussian processes (GPs: Camporeale and Carè, 2020; Shen et al., 2019) and Bayesian neural networks (Hernández-Lobato and Adams, 2015). However, prior work has either been based on hand-crafted features, or based on small-scale neural networks with only one or two hidden layers, which are far removed from modern pre-trained LMs. How to combine deterministic pre-trained LMs with Bayesian methods to achieve both high accuracy and accurate uncertainty estimation is an open problem, particularly in a regression setting.

*q*(

**w**|

*θ*):

**w**

^{(i)}denotes the

*i*th Monte Carlo sample drawn from the variational posterior

*q*(

**w**

^{(i)}|

*θ*).

##### Ensemble Estimation

Another approach is to estimate uncertainty by ensemble, typically with MC-dropout (Gal and Ghahramani, 2016) and deep ensembles (Lakshminarayanan et al., 2017), which are agnostic to model structure.

**w**) given precision parameter

*τ*> 0 is:

The deep ensemble approach trains multiple copies of the variance networks from different network initializations to estimate predictive distributions. It operates similarly to sub-networks of MC dropout, but is computationally more expensive due to the need to train multiple models. Additionally, the need to split the training data into multiple folds to train different networks exacerbates overfitting in small-data scenarios. Given our specific focus on low-data scenarios, we focus exclusively on MC dropout in this paper.

The only work we are aware of for estimating uncertainty with transformers in a regression setting is Glushkova et al. (2021), who use ensemble estimation of uncertainty for machine translation quality evaluation, comparing the translated sentence with a reference translation. In contrast, we experiment in a cross-lingual setting, comparing a source sentence and its translation directly.

### 2.2 Selecting Clean Instances

To reduce the uncertainty from both data and model, we draw on approaches that can filter noisy labels from labeled data, and select clean instances from unlabeled data, thus eliminating aleatoric uncertainty, and reducing epistemic uncertainty due to the enhanced knowledge learned from the augmented data. In brief, we need a method to distinguish noisy and clean labels.

It has been shown that in data augmentation, self-training, and zero-shot learning, using the right sampling strategy is critical (Thakur et al., 2020; Wang et al., 2020b). However, previous work has mainly focused on label distribution balance, and lexical and semantic similarity, but not uncertainty.

In this work, we propose a simple method leveraging predictive confidence, to select high-quality instances, which is related to uncertainty-based sampling in active learning (Settles, 2009). However, most work in active learning has focused on classification rather than regression, either extracting the least probable or the most informative examples with large entropy (Settles and Craven, 2008; Pinsler et al., 2019; Radmard et al., 2021).

Our approach also has a similar flavor to self-paced curricular learning (Bengio et al., 2009; Kumar et al., 2010; Wan et al., 2020), in which the aim is to choose “hard” examples and gradually increase the difficulty of learning content, differing from the criteria in our setting— “clean” ones.

According to a recent review of uncertainty estimation for DNNs (Abdar et al., 2020), there is little work on using aleatoric uncertainty for denoising and sampling in NLP tasks. The most relevant work is that by Miok et al. (2020), who aims to guide the annotation process for the binary classification task of hate speech detection.

## 3 Tasks and Notation

In this paper, we consider text regression across three separate tasks, and a total of 10 datasets.

##### Tasks

**STS**: Semantic textual similarity assesses the degree of semantic equivalence between two pieces of text (Corley and Mihalcea, 2005). The aim is to predict a similarity score for a sentence pair (*S*1,*S*2), generally in the range [0,5], where 0 indicates complete dissimilarity and 5 indicates equivalence in meaning. As an example:

**S****1**: *Total minutes spent in timed codes: 10 mins.*

**S****2**: *Total minutes spent in timed codes: 33 mins.*

might be labeled 4, as the two texts differ only in very specific content (underlined).

**SA**: Sentiment analysis rating involves predicting a sentiment score for a review *S*, in the range 1 (extremely negative) to 5 (extremely positive).

**DA**: Machine translation quality estimation, based on the direct assessment approach (Graham et al., 2017), aims to predict a normalised quality score for text pair (*S*1,*S*2), where *S*2 is machine translated from *S*1. As such, it is similar to STS, but differs in that it is cross-lingual.

##### Notation and Assumptions

Throughout this paper, raw examples, column vectors, and matrices are denoted in lower-case italics, bold, and upper-case italics, respectively (e.g., *x*, **x**, and *X*). *θ*_{encoder} and *θ*_{reg} represent parameters of the encoder and task-specific regression layers, and *f*(** θ**,⋅) refers to the whole model. Take a dataset $D=(x1,y1),(xi,yi),\cdots ,(xN,yN)$, where (

*x*

_{i},

*y*

_{i}) is the

*i*th instance,

*y*

_{i}∈ℝ, and

**x**

_{i}=

*s*(

*θ*_{encoder},

*x*

_{i}) is the hidden state of

*x*

_{i}. The loss function is the empirical risk of the mean square error (MSE): $L=1N\u2211i=1N(f(\theta ,xi)\u2212yi)2$

##### Datasets

We evaluate on different-sized datasets across various domains for STS and SA, and three same-sized datasets for DA, summarized in Table 1.

Dataset . | Size (train, test, dev) . | Range . | Domain . |
---|---|---|---|

STS-B (2017) | 5749, 1379, 1500 | [0,5] | general |

MedSTS (2018) | 750, 318, — | [0,5] | clinical |

N2C2-STS (2019) | 1642, 412, — | [0,5] | clinical |

BIOSSES (2017) | 100, —, — | [0,4] | biomedical |

EBMSASS (2019) | 700, 300, — | [1,5] | biomedical |

Yelp (2018) | 7000, 1500, 1500, | [1,5] | product |

PeerRead (2018) | 713, 290, — | [1,5] | paper |

WMT en-zh (2020) | 7000, 1000, 1000 | [0,100] | high-resource |

WMT ru-en (2020) | 7000, 1000, 1000 | [0,100] | medium-resource |

WMT si-en (2020) | 7000, 1000, 1000 | [0,100] | low-resource |

Dataset . | Size (train, test, dev) . | Range . | Domain . |
---|---|---|---|

STS-B (2017) | 5749, 1379, 1500 | [0,5] | general |

MedSTS (2018) | 750, 318, — | [0,5] | clinical |

N2C2-STS (2019) | 1642, 412, — | [0,5] | clinical |

BIOSSES (2017) | 100, —, — | [0,4] | biomedical |

EBMSASS (2019) | 700, 300, — | [1,5] | biomedical |

Yelp (2018) | 7000, 1500, 1500, | [1,5] | product |

PeerRead (2018) | 713, 290, — | [1,5] | paper |

WMT en-zh (2020) | 7000, 1000, 1000 | [0,100] | high-resource |

WMT ru-en (2020) | 7000, 1000, 1000 | [0,100] | medium-resource |

WMT si-en (2020) | 7000, 1000, 1000 | [0,100] | low-resource |

For STS, we use: (1) one large-scale general dataset, STS-B (Cer et al., 2017); (2) two small clinical data sets, MedSTS (Wang et al., 2018) and N2C2-STS (Wang et al., 2020a); and (3) two small biomedical data sets, BIOSSES (Soğancıoğlu et al., 2017) and EBMSASS (Hassanzadeh et al., 2019), each of which is 5-way annotated.

For SA, we use: (1) a large-scale product review dataset, Yelp (Sabnis, 2018); and (2) a small paper review rating dataset, PeerRead (Kang et al., 2018), augmented with 399 Spanish paper reviews (Keith et al., 2017) machine-translated into English.

For DA, we use the three language pairs from WMT2020 (Lucia et al., 2020), en-zh, ru-en, and si-en, corresponding to high-, medium-, and low- resource settings in terms of the source language.

## 4 Method

In this section, we first introduce approaches for estimating regression uncertainty based on pre-trained LMs, then propose a simple method to sample “clean” instances from unlabeled data to augment training data based on predictive uncertainty. The proposed methods can be applied equally in semi-supervised and unsupervised settings (including active learning and self-learning).

### 4.1 Bayesian Regression using LMs

We investigate two alternatives to combine pre- trained transformer LMs with Bayesian estimation, either in a pipeline approach, or end-to- end. Figure 1 provides an overview.

##### Pipeline Training

To estimate probability distributions for the regression task of document quality assessment, Shen et al. (2019) used a Gaussian process (GP) with Radial Basis Function (RBF) kernel function over hand-crafted features. We build off this in applying Bayesian linear regression and sparse GP regression to pre-trained sentence encoders, such as Sentence-BERT (SBERT; Reimers and Gurevych, 2019). For text input *x*, we generate **x** = *s*(*θ*_{encoder},*x*) ∈ℝ^{d}. In this way, we leverage contextualized sentence representations, while avoiding the complexity of estimating uncertainty directly from a large- scale Bayesian neural network.

**: The prior distribution of a Bayesian linear layer with parameters**

*Bayesian Linear Regression***w**and

*b*is set to be a Gaussian distribution:

*ε*is the observation noise, which is assumed to be an independent and identically distributed random variable $\epsilon \u223cN(0,\sigma 2)$.

** Gaussian Processes (GPs)** (Rusmassen and Williams, 2005) are a natural way to generalize the concept of a multivariate normal distribution determined by a mean vector

**and covariance matrix**

*μ***, to describe a real-valued function. They provide a mathematically elegant framework for Bayesian inference and offer principled uncertainty estimates for regression problems with a closed-form posterior (Leibfried et al., 2020). Given (**

*Σ***x**

_{i},

*y*

_{i}),

*y*

_{i}=

*f*(

**x**

_{i}) +

*ε*

_{i}, where

*f*(⋅) is a real-valued function with input

**x**

_{i}that is sampled from a GP, and where

*ε*

_{i}are scalar independent and identically distributed random variables corresponding to observation noise.

*f*(⋅). We assume that

*f*(⋅) is distributed according to a GP, that is,

*m*(

**x**) is a mean function, and

*k*(

**x**,

*x′*) is a covariance or kernel function, corresponding to

**and**

*μ***of a multivariate normal distribution. Following common practice, we fix the mean function to zero, and use a RBF as the kernel function (Preoţiuc-Pietro and Cohn, 2013; Beck et al., 2014; Bitvai and Cohn, 2015; Shen et al., 2019).**

*Σ*Computing the exact posterior requires the storage and inversion of an (*N* × *N*) matrix, which is quadratic in the amount of training data *N* and has cubic computational complexity, both of which are infeasible for large datasets. Thus we use sparse GPs, which approximate an exact GP by using a small set of latent inducing points (Titsias, 2009), learned by variational inference.

##### End-to-end Training

Rather than pre-training a LM and task-specific model separately, Xue et al. (2021) jointly trained them by only applying Bayesian estimation to a subset of the model parameters. This requires training entirely from scratch, while we seek to leverage pre-trained LMs. We apply Bayesian inference to task- specific layers, keeping parameters of the LM deterministic and making task-specialised parameters stochastic during fine-tuning. Importantly, being deterministic is not equivalent to being frozen: Parameters are updated as in non-Bayesian optimization, rather than kept fixed during back- propagation.

To increase randomness, we evaluate on two task-specific networks with more stochastic parameters than a single-layer linear regression network used in Pipeline Training, as detailed below.

**: The linear regression layers take the hidden state**

*Bayesian Two-layer MLP***h**∈ℝ

^{d}, through a two-layer MLP with $tanh$ activation function:

**W**∈ℝ

^{d×d},

**b**,

**w**∈ℝ

^{d}and

*b*∈ℝ are trainable parameters.

** Bayesian Hierarchical Convolution**: Drawing on the finding that a hierarchical convolution neural network (HConv) is effective in low-resource settings (Wang et al., 2020b), and that increasing the capacity of task-specific layers can boost performance (Chung et al., 2020), we train a large-capacity network as follows. HConv is structured as a two-layer convolutional network, with kernel size

*k*= 2,3,4 in the first layer and

*k*= 2 in the second (Wang et al., 2020b). The prior distributions of the weights and bias are based on Eq. (2) for Bayesian inference, and the inference method follows Bayes by Backprop (Blundell et al., 2015).

### 4.2 Predictive Uncertainty-based Sampling

*f*(

**,⋅), and a (large-scale) unlabeled data pool $Du=x1,x2,\cdots ,xi,\cdots ,xU$, the distribution of the predicted**

*θ**y*

_{i}for input

*x*

_{i}is:

*μ*

_{i}and

*σ*

_{i}are the mean and standard deviation of the normal distribution of

*y*

_{i}.

Our aim is to sample a subset $Du\u2032$ from $Du$ in which the uncertainty model is expected to be sufficiently confident in predicting $Du\u2032$, that is have a confidence interval as narrow as possible under a given confidence level. For example, under 99% confidence, the confidence interval [*μ*_{i} − 2.58*σ*_{i},*μ*_{i} + 2.58*σ*_{i}] is expected to be narrow. Put differently, the distribution is concentrated around the mean with small standard deviation.

Based on this, we propose a simple instance selection method based on predictive uncertainty. For each instance *x*_{i} in $Du$, if *σ*_{i} < *τ*, select *x*_{i}; $Du\u2032\u2190xi$. The threshold *τ* is a global hyperparameter tuned over the validation set, or in the case of self-training and active learning, using a heuristic strategy.^{1}

The strategy is based on the observation that the model can generally predict precisely for instances of extreme polarity, such as labels in the ranges [0,1] and [4,5] for STS. We posit that cases whose predictive uncertainty is at the same level as these well-predicted examples are also predicted accurately. Formally, after inference, the unlabeled data pool is $Du=(xi,\mu i,\sigma i)$, *i* ∈ [1,*U*], where *U* is the number of unlabeled instances. The standard deviation of all well- predicted examples can be vectorized as ** σ** = [

*σ*

_{i}], where

*σ*

_{i}is the std whose

*μ*

_{i}is at an extremum, such as 0 ≤

*μ*

_{i}≤ 1 or 4 ≤

*μ*

_{i}≤ 5 for STS. We then set

*τ*= mean(

**).**

*σ*## 5 Uncertainty Evaluation Metrics

Evaluating uncertainty estimates of predictions is challenging in a regression setting, as the “ground truth” uncertainty is usually not available (Lakshminarayanan et al., 2017). To evaluate model predictions, we consider four metrics.

**Pearson Correlation:** It is vital to assess the predictive accuracy of the system, regardless of the uncertainty estimate. We use Pearson correlation *r* to evaluate the correlation between the system’s average predictions and ground truth quality scores.

**Calibration Error (Cal):** One way to understand if models can be trusted is by analysing whether they are calibrated. Gneiting et al. (2007) defined calibration in a regression setting as the asymptotic consistency between the probabilistic forecasts *F*_{i} and the true data-generating distributions *G*_{i}, with the index *i* referring to each example.

*F*

_{i}is the cumulative probability distribution

*P*(

*Y*≤

*y*

_{i}),

*G*

_{i}is generally estimated by empirical distribution functions based on the observations only. So calibration measures if the predictive confidence estimates are aligned with the empirical correctness likelihoods. Given a confidence level

*p*

_{j}, the empirical accuracy is calculated:

*F*

_{i}

^{−1}is used to denote the quantile function $Fi\u22121(p)=infy:p\u2264Fi(y)$, that is mapping from [0,1] $\u2192$ Y. The expected calibration error $cal=\u2211j=1mwj\u22c5(pj\u2212pj^)2$, with

*m*confidence levels 0 ≤

*p*

_{1}< ⋯ <

*p*

_{m}≤ 1, is the distance of predictive confidence away from the empirical accuracy.

**Negative Log-Probability Density (NLPD)**complements Cal’s equal treatment to over- and under-confidence. It penalises over-confidence more strongly through logarithmic scaling: $LNLPD=\u22121n\u2211i=1nlogp(yi=ti|xi))$, favouring under-confident ones. In Gaussian predictive distributions with mean

*m*

_{i}and variance

*v*

_{i}, the NLPD loss incurred for predicting at input

*x*

_{i}with true associated target

*t*

_{i}is given by:

**Sharpness (Shp):** The metrics above do not account for the concentration of the predictive distributions, which generally favours predictors that produce wide and uninformative confidence intervals. To guarantee useful uncertainty estimation, confidence intervals should not only be calibrated, but also sharp and “tight” around the predicted value. The numerical width of prediction intervals (Gneiting et al., 2007; Song et al., 2019) and the mean of variance (Kuleshov et al., 2018; Zelikman et al., 2020) are often used to quantify sharpness. We apply the latter in our work, with a lower score implying higher sharpness.

To interpret mixed results, for example when a model attains the best sharpness but with infinitely large NLPD, we suggest that Pearson correlation (*r*) has primacy, followed by Cal and NLPD, then Shp. That is, when models have comparable *r*, the comparison of Cal/NLPD is more meaningful, and if those are also similar, Shp should be considered; otherwise, it’s largely meaningless.

## 6 Evaluation of Uncertainty Estimation

We expect that the incorporation of uncertainty estimation should not harm predictive performance compared to point estimation without uncertainty, in both in- and out-of-domain scenarios. Additionally, uncertainty estimates should reflect “what the model does not know”, making it possible to determine whether a prediction can be trusted based on the output distribution. This is quantified intrinsically with Cal and NLPD (the lower, the better), and extrinsically via instance selection in Section 7.

### 6.1 Experimental Setup

**Pipeline Training:** We use SBERT as an off- the-shelf sentence encoder. We fine-tune SBERT separately over each STS corpus based on the pre- trained bert-base-nli-mean-tokens, using the same configuration as the original paper (4 epochs with training batch size of 16). For the cross-lingual DA task, we use *distiluse-base- multilingual-cased-v1*.

To represent a sentence pair (*S*1,*S*2) using SBERT, we use the concatenation of the embeddings *u* ⊕ *v*, along with their absolute difference |*u* − *v*| and element-wise multiplication *v* × *t*. “SBERT Bayesian LR” and “SBERT Sparse GP Regression” indicates that features are fed into Bayesian LR and sparse GP regression, respectively, implemented in pyro.^{2}

**End-to-End Training:** We apply pre-trained BERT as the LM encoder (Devlin et al., 2019), using bert-base-uncased for monolingual tasks and bert-base-multilingual-cased for cross- lingual tasks. The input format is [CLS] S1 [SEP] S2 [SEP] for text pair (*S*1,*S*2), and [CLS] S [SEP] for a single text *S*. BERT Bayesian LR and BERT Bayesian ConvLR denote task-specific networks based on a two-layer MLP and HConv, respectively, implemented based on the *Hugging- face Transformer* framework and blitz for BBB estimation (Esposito, 2020).

**MC-Dropout:** We apply MC-dropout to base models BERT LR and BERT ConvLR, with dropout rate = 0.1 and 30 iterations of sampling.^{3}

**Point Estimation:** In addition to the uncertainty estimation approaches, we also compare with four non-Bayesian methods: (1) cosine similarity; (2) optimization of deterministic LR with SBERT (SBERT LR); (3) fine-tuned BERT LR; and (4) fine-tuned BERT ConvLR.

**Training Configuration:** The maximum sequence length is set to 128 for STS and DA, and 256 for SA. The learning rate (lr), training batch size, and training epochs are optimized over the validation set. In the situation that a validation set is not available (i.e., EBMSASS and MedSTS), we provisionally split the training data into 80%:20% training:dev data, and tune hyperparameters over the dev data. We then retrain the model over the full training dataset, and evaluate on the test set. Tuned hyperparameter settings of the pipeline are shown in Table 3. End-to-end is based on grid-searching over [8, 16, 32] × [1e-5, 2e-5] × [1, 2, 3, ⋯ 10] for batch size, lr, and epochs, respectively. Generally, the best setting is batch size = 16, lr = 2e-5, and epochs = 3, although BERT ConvLR based on BBB requires more epochs to converge. Further details of the training regimen and hyperparameter settings are provided in our Github repository.^{4}

### 6.2 Sentence-Pair STS

In this section, we compare the various uncertainty estimation approaches from Section 4.1 over STS, in terms of correlation and the metrics for uncertainty estimation, aiming to empirically establish:

Which uncertainty estimation strategy is most accurate, most calibrated, and sharpest?

Which method performs best in out-of- domain settings?

#### 6.2.1 In-Domain Performance

To observe the influence of data size and domain distribution on uncertainty estimation, we experiment over the large-scale general-domain STS-B, in addition to the smaller-scale domain-specific MedSTS (clinical domain) and EBMSASS (biomedical domain) datasets. There are three main findings from the results in Table 2.

. | STS-B test . | EBMSASS test . | MedSTS test . | Yelp test . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | |

SBERT Cosine similarity | 0.842 | n/a | n/a | n/a | 0.773 | n/a | n/a | n/a | 0.784 | n/a | n/a | n/a | — | n/a | n/a | n/a |

SBERT LR | 0.835 | n/a | n/a | n/a | 0.743 | n/a | n/a | n/a | 0.776 | n/a | n/a | n/a | 0.666 | n/a | n/a | n/a |

SBERT Bayesian LR | 0.810 | 0.046 | 0.648 | 1.632 | 0.688 | 0.443 | 1.095 | 2.156 | 0.740 | 0.101 | 0.801 | 2.092 | 0.671 | 0.019 | 0.447 | 0.753 |

SBERT Sparse GP Regression | 0.847 | 0.065 | 0.614 | 1.621 | 0.788 | 0.195 | 0.541 | 1.627 | 0.781 | 0.073 | 0.499 | 1.453 | 0.689 | 0.049 | 0.573 | 1.507 |

BERT LR | 0.868 | n/a | n/a | n/a | 0.914 | n/a | n/a | n/a | 0.858 | n/a | n/a | n/a | 0.826 | n/a | n/a | n/a |

BERT ConvLR | 0.855 | n/a | n/a | n/a | 0.922 | n/a | n/a | n/a | 0.846 | n/a | n/a | n/a | 0.822 | n/a | n/a | n/a |

BERT Bayesian LR (BBB) | 0.848 | 0.521 | $+\u221e$ | 0.005 | 0.914 | 0.669 | 1177.2 | 0.005 | 0.848 | 0.514 | 6594.3 | 0.006 | 0.827 | 0.531 | 3908.6 | 0.083 |

BERT Bayesian ConvLR (BBB) | 0.849 | 0.495 | 2061.0 | 0.015 | 0.898 | 0.618 | 327.3 | 0.010 | 0.835 | 0.506 | 1037.2 | 0.017 | 0.797 | 1.513 | 119.2 | 0.089 |

BERT LR MC dropout | 0.868 | 0.181 | 4.659 | 0.215 | 0.921 | 0.054 | 0.036 | 0.140 | 0.859 | 0.163 | 4.118 | 0.168 | 0.827 | 0.267 | 7.285 | 0.153 |

BERT ConvLR MC dropout | 0.855 | 0.202 | 5.830 | 0.209 | 0.922 | 0.093 | 2.137 | 0.085 | 0.852 | 0.219 | 6.402 | 0.146 | 0.823 | 0.291 | 8.214 | 0.150 |

. | STS-B test . | EBMSASS test . | MedSTS test . | Yelp test . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | |

SBERT Cosine similarity | 0.842 | n/a | n/a | n/a | 0.773 | n/a | n/a | n/a | 0.784 | n/a | n/a | n/a | — | n/a | n/a | n/a |

SBERT LR | 0.835 | n/a | n/a | n/a | 0.743 | n/a | n/a | n/a | 0.776 | n/a | n/a | n/a | 0.666 | n/a | n/a | n/a |

SBERT Bayesian LR | 0.810 | 0.046 | 0.648 | 1.632 | 0.688 | 0.443 | 1.095 | 2.156 | 0.740 | 0.101 | 0.801 | 2.092 | 0.671 | 0.019 | 0.447 | 0.753 |

SBERT Sparse GP Regression | 0.847 | 0.065 | 0.614 | 1.621 | 0.788 | 0.195 | 0.541 | 1.627 | 0.781 | 0.073 | 0.499 | 1.453 | 0.689 | 0.049 | 0.573 | 1.507 |

BERT LR | 0.868 | n/a | n/a | n/a | 0.914 | n/a | n/a | n/a | 0.858 | n/a | n/a | n/a | 0.826 | n/a | n/a | n/a |

BERT ConvLR | 0.855 | n/a | n/a | n/a | 0.922 | n/a | n/a | n/a | 0.846 | n/a | n/a | n/a | 0.822 | n/a | n/a | n/a |

BERT Bayesian LR (BBB) | 0.848 | 0.521 | $+\u221e$ | 0.005 | 0.914 | 0.669 | 1177.2 | 0.005 | 0.848 | 0.514 | 6594.3 | 0.006 | 0.827 | 0.531 | 3908.6 | 0.083 |

BERT Bayesian ConvLR (BBB) | 0.849 | 0.495 | 2061.0 | 0.015 | 0.898 | 0.618 | 327.3 | 0.010 | 0.835 | 0.506 | 1037.2 | 0.017 | 0.797 | 1.513 | 119.2 | 0.089 |

BERT LR MC dropout | 0.868 | 0.181 | 4.659 | 0.215 | 0.921 | 0.054 | 0.036 | 0.140 | 0.859 | 0.163 | 4.118 | 0.168 | 0.827 | 0.267 | 7.285 | 0.153 |

BERT ConvLR MC dropout | 0.855 | 0.202 | 5.830 | 0.209 | 0.922 | 0.093 | 2.137 | 0.085 | 0.852 | 0.219 | 6.402 | 0.146 | 0.823 | 0.291 | 8.214 | 0.150 |

. | LR . | Bayes LR . | GP Reg . | |||
---|---|---|---|---|---|---|

lr . | epoch . | lr . | epoch . | lr . | epoch . | |

STS-B | 0.1 | 100 | 0.01 | 2500 | 0.1 | 25 |

EBMSASS | 0.1 | 15 | 0.01 | 10000 | 0.1 | 25 |

MedSTS | 0.1 | 100 | 0.01 | 8500 | 0.1 | 25 |

Yelp | 0.1 | 600 | 0.01 | 2500 | 0.1 | 25 |

en-zh | 0.1 | 50 | 0.03 | 300 | 0.1 | 200 |

ru-en | 0.1 | 40 | 0.03 | 400 | 0.1 | 200 |

si-en | 0.1 | 199 | 0.03 | 300 | 0.1 | 1000 |

. | LR . | Bayes LR . | GP Reg . | |||
---|---|---|---|---|---|---|

lr . | epoch . | lr . | epoch . | lr . | epoch . | |

STS-B | 0.1 | 100 | 0.01 | 2500 | 0.1 | 25 |

EBMSASS | 0.1 | 15 | 0.01 | 10000 | 0.1 | 25 |

MedSTS | 0.1 | 100 | 0.01 | 8500 | 0.1 | 25 |

Yelp | 0.1 | 600 | 0.01 | 2500 | 0.1 | 25 |

en-zh | 0.1 | 50 | 0.03 | 300 | 0.1 | 200 |

ru-en | 0.1 | 40 | 0.03 | 400 | 0.1 | 200 |

si-en | 0.1 | 199 | 0.03 | 300 | 0.1 | 1000 |

**Uncertainty models do not degrade accuracy.** With SBERT, GP-based models have higher correlation than either cosine similarity or LR. In the case of BERT, estimation by MC-dropout is competitive with corresponding point estimates. Thus, they have comparable raw performance, in addition to providing uncertainty estimates.

**End-to-end training based on BERT results in higher correlation and narrower confidence intervals, but poorer calibration and NLPD**. Results over the three datasets show that end- to-end training based on BERT overall performs much better than pipeline training using SBERT, but BERT-based models are poorly calibrated compared to SBERT-based Bayesian linear regression and sparse GP regression using fixed sentence features (as can be seen in the higher NLPD numbers for BERT-based models). This is consistent with prior work (Guo et al., 2017).

**MC-dropout is superior to BBB inference, and sparse GP regression performs better than SBERT Bayesian LR, regardless of data size and domain.** Under both BERT LR and ConvLR, MC-dropout achieves higher or equal correlation, and much lower Cal and NLPD than BBB in end-to-end training. Among methods based on SBERT, sparse GP regression requires many fewer iterations to converge, and outperforms Bayesian LR in correlation and NLPD, and is comparable for Cal and Shp.

#### 6.2.2 Out-of-Domain Performance

Apart from in-domain evaluation, out-of-domain performance is also an important concern. We expect that a model trained on domain A will generate more uncertain predictions on domain B, with lower correlation, larger Cal and NLPD, and a wider confidence interval (Lakshminarayanan et al., 2017). Given two models trained on domain A with similar point-estimate performance on domain B, that is competitive *r*, the model with the lower NLPD is arguably the better model, as this indicates that the model gives sharper distributions when the prediction is correct, and flatter ones when wrong.

Using models fine-tuned over the general- domain STS-B, we evaluate on the biomedical EBMSASS and clinical MedSTS test sets. In contrast with the results in Table 2, in which models have been fine-tuned with in-domain labeled data, Table 4 shows a steep decline in *r* of more than 10 points on average for EBMSASS, and 7 for MedSTS. Meanwhile, both Cal and NLPD increase by a large margin.

. | EBMSASS test . | MedSTS test . | PeerRead test . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | |

SBERT Cosine similarity | 0.716 | n/a | n/a | n/a | 0.731 | n/a | n/a | n/a | — | n/a | n/a | n/a |

SBERT LR | 0.696 | n/a | n/a | n/a | 0.718 | n/a | n/a | n/a | 0.256 | n/a | n/a | n/a |

SBERT Bayesian LR | 0.684 | 0.091 | 0.400 | 1.325 | 0.672 | 0.038 | 0.568 | 1.506 | 0.241 | 0.116 | 1.018 | 1.245 |

SBERT Sparse GP Regression | 0.726 | 0.211 | 0.586 | 1.609 | 0.723 | 0.129 | 0.634 | 1.604 | 0.427 | 0.021 | 0.771 | 1.339 |

BERT LR | 0.838 | n/a | n/a | n/a | 0.786 | n/a | n/a | n/a | 0.669 | n/a | n/a | n/a |

BERT ConvLR | 0.806 | n/a | n/a | n/a | 0.776 | n/a | n/a | n/a | 0.627 | n/a | n/a | n/a |

BERT Bayesian LR | 0.867 | 0.625 | 5165 | 0.005 | 0.768 | 0.619 | 11081 | 0.005 | 0.694 | 0.522 | 7606. | 0.009 |

BERT Bayesian ConvLR | 0.811 | 0.714 | 1043. | 0.011 | 0.770 | 0.523 | 1527. | 0.017 | 0.608 | 0.990 | 189.0 | 0.086 |

BERT LR MC dropout | 0.838 | 0.280 | 3.517 | 0.137 | 0.795 | 0.199 | 5.060 | 0.188 | 0.676 | 0.400 | 21.75 | 0.160 |

BERT ConvLR MC dropout | 0.814 | 0.194 | 4.649 | 0.153 | 0.788 | 0.240 | 8.447 | 0.158 | 0.635 | 0.456 | 36.37 | 0.138 |

. | EBMSASS test . | MedSTS test . | PeerRead test . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | |

SBERT Cosine similarity | 0.716 | n/a | n/a | n/a | 0.731 | n/a | n/a | n/a | — | n/a | n/a | n/a |

SBERT LR | 0.696 | n/a | n/a | n/a | 0.718 | n/a | n/a | n/a | 0.256 | n/a | n/a | n/a |

SBERT Bayesian LR | 0.684 | 0.091 | 0.400 | 1.325 | 0.672 | 0.038 | 0.568 | 1.506 | 0.241 | 0.116 | 1.018 | 1.245 |

SBERT Sparse GP Regression | 0.726 | 0.211 | 0.586 | 1.609 | 0.723 | 0.129 | 0.634 | 1.604 | 0.427 | 0.021 | 0.771 | 1.339 |

BERT LR | 0.838 | n/a | n/a | n/a | 0.786 | n/a | n/a | n/a | 0.669 | n/a | n/a | n/a |

BERT ConvLR | 0.806 | n/a | n/a | n/a | 0.776 | n/a | n/a | n/a | 0.627 | n/a | n/a | n/a |

BERT Bayesian LR | 0.867 | 0.625 | 5165 | 0.005 | 0.768 | 0.619 | 11081 | 0.005 | 0.694 | 0.522 | 7606. | 0.009 |

BERT Bayesian ConvLR | 0.811 | 0.714 | 1043. | 0.011 | 0.770 | 0.523 | 1527. | 0.017 | 0.608 | 0.990 | 189.0 | 0.086 |

BERT LR MC dropout | 0.838 | 0.280 | 3.517 | 0.137 | 0.795 | 0.199 | 5.060 | 0.188 | 0.676 | 0.400 | 21.75 | 0.160 |

BERT ConvLR MC dropout | 0.814 | 0.194 | 4.649 | 0.153 | 0.788 | 0.240 | 8.447 | 0.158 | 0.635 | 0.456 | 36.37 | 0.138 |

**MC-dropout is not always best.** Interestingly, we find that BERT Bayesian LR performs well in this setting, obtaining the highest correlation and smallest Shp on EBMSASS and PeerRead. This suggests that BERT Bayesian LR has better generalizability over these two domains, but the substantially higher NLPD also reveals that its predictions are over-confident. By and large, MC-dropout stably offers accurate and calibrated predictions in out-of-domain settings. ConvLR in particular outperforms Bayesian inference across all metrics.

**BERT ConvLR tends to be inferior to BERT LR in the out-of-domain setting.** We speculate this is because of its smaller capacity to memorize task-specific knowledge, as eight layers of the BERT encoder are frozen in BERT ConvLR.

### 6.3 Single-sentence Sentiment Rating

We perform in-domain SA evaluation on Yelp, and out-of-domain evaluation by applying the fine-tuned Yelp model to PeerRead test data. We find:

**Fine-tuned sentence embeddings are vital to the performance of pipeline uncertainty estimation.** As shown in Table 2, performance over Yelp, EBMSASS, and MedSTS based on SBERT is substantially worse than with BERT. We speculate this is due to poor feature representations. That is, on the STS task, we continue to fine-tune sentence embeddings over each STS dataset. As a result of being unable to fine-tune SBERT on SA (as there is no paired data), the representations for Yelp are pre-trained using SNLI only, which is neither task- nor domain-specific. Compared with the similarly sized STS-B where embeddings are fine-tuned, the performance gap for Yelp between SBERT and BERT is more than 0.15, but less than 0.02 for STS-B. Equally, though we fine-tune SBERT for EBMSASS and MedSTS, each has fewer than 1k training instances. Poor domain-specific sentence embeddings result in gaps of 0.15 and 0.07.

Meanwhile, for SBERT in the upper half of Table 4, the out-of-domain correlation on PeerRead is extremely poor; the gap of 6 points on EBMSASS and MedSTS relative to in-domain results (0.78 in Table 2) further confirms our hypothesis.

**LR outperforms ConvLR in out-of-domain SA.** In both point and Bayesian estimates, ConvLR performs better than LR (Table 4), similar to STS.

### 6.4 Cross-lingual Sentence-pair DA

We evaluate on machine translation quality estimation (QE) over three language pairs using DA, using 7,000 training instances in each case. The results are shown in Table 5. We first observe that using embeddings directly from pretrained SBERT with cosine similarity underperforms other methods that involve fine-tuning.

. | en-zh test . | ru-en test . | si-en test . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | |

SBERT Cosine similarity | 0.115 | n/a | n/a | n/a | 0.428 | n/a | n/a | n/a | 0.097 | n/a | n/a | n/a |

SBERT LR | 0.270 | n/a | n/a | n/a | 0.616 | n/a | n/a | n/a | 0.397 | n/a | n/a | n/a |

SBERT Bayesian LR | 0.280 | 0.025 | 0.155 | 0.908 | 0.625 | 0.013 | 0.223 | 0.771 | 0.371 | 0.013 | 0.193 | 0.934 |

SBERT Sparse GP Regression | 0.384 | 0.026 | 0.143 | 0.892 | 0.626 | 0.007 | 0.207 | 0.776 | 0.366 | 0.010 | 0.191 | 0.931 |

BERT LR | 0.395 | n/a | n/a | n/a | 0.621 | n/a | n/a | n/a | 0.504 | n/a | n/a | n/a |

BERT ConvLR | 0.436 | n/a | n/a | n/a | 0.641 | n/a | n/a | n/a | 0.524 | n/a | n/a | n/a |

BERT Bayesian LR | 0.385 | 0.726 | 11600 | 0.005 | 0.644 | 0.515 | 11666 | 0.005 | 0.506 | 0.568 | 10971 | 0.005 |

BERT Bayesian ConvLR | 0.378 | 1.780 | 683.7 | 0.066 | 0.609 | 1.775 | 723.4 | 0.069 | 0.503 | 1.758 | 638.5 | 0.059 |

BERT LR MC dropout | 0.407 | 0.250 | 9.216 | 0.190 | 0.637 | 0.315 | 17.00 | 0.126 | 0.527 | 0.178 | 6.578 | 0.200 |

BERT ConvLR MC dropout | 0.441 | 0.268 | 13.33 | 0.127 | 0.649 | 0.333 | 22.28 | 0.106 | 0.530 | 0.275 | 10.19 | 0.133 |

. | en-zh test . | ru-en test . | si-en test . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | |

SBERT Cosine similarity | 0.115 | n/a | n/a | n/a | 0.428 | n/a | n/a | n/a | 0.097 | n/a | n/a | n/a |

SBERT LR | 0.270 | n/a | n/a | n/a | 0.616 | n/a | n/a | n/a | 0.397 | n/a | n/a | n/a |

SBERT Bayesian LR | 0.280 | 0.025 | 0.155 | 0.908 | 0.625 | 0.013 | 0.223 | 0.771 | 0.371 | 0.013 | 0.193 | 0.934 |

SBERT Sparse GP Regression | 0.384 | 0.026 | 0.143 | 0.892 | 0.626 | 0.007 | 0.207 | 0.776 | 0.366 | 0.010 | 0.191 | 0.931 |

BERT LR | 0.395 | n/a | n/a | n/a | 0.621 | n/a | n/a | n/a | 0.504 | n/a | n/a | n/a |

BERT ConvLR | 0.436 | n/a | n/a | n/a | 0.641 | n/a | n/a | n/a | 0.524 | n/a | n/a | n/a |

BERT Bayesian LR | 0.385 | 0.726 | 11600 | 0.005 | 0.644 | 0.515 | 11666 | 0.005 | 0.506 | 0.568 | 10971 | 0.005 |

BERT Bayesian ConvLR | 0.378 | 1.780 | 683.7 | 0.066 | 0.609 | 1.775 | 723.4 | 0.069 | 0.503 | 1.758 | 638.5 | 0.059 |

BERT LR MC dropout | 0.407 | 0.250 | 9.216 | 0.190 | 0.637 | 0.315 | 17.00 | 0.126 | 0.527 | 0.178 | 6.578 | 0.200 |

BERT ConvLR MC dropout | 0.441 | 0.268 | 13.33 | 0.127 | 0.649 | 0.333 | 22.28 | 0.106 | 0.530 | 0.275 | 10.19 | 0.133 |

**Traditional Bayesian LR and GP models achieve results competitive with deep uncertainty models** when the input sentence embedding is expressive enough, and with smaller Cal and NLPD. Related uncertainty prediction work (Glushkova et al., 2021) argued that GPs are not competitive or easy to integrate with current neural architectures. In contrast, our results demonstrate that GPs can achieve comparable results to deep neural networks, while also being better calibrated.

ConvLR consistently outperforms LR for BERT-based models. In the cross-lingual scenario, SBERT models have smaller Cal and NLPD, and larger Shp, analogous to the monolingual setting.

## 7 Instance Selection Through Uncertainty

In self-training, a model is first trained using labeled data, then used to predict labels for unlabeled data instances. Instances with higher-probability predictions are then adopted as pseudo-labels, and used to re-train the model in conjunction with the labeled training data. Active learning is similar, expect that instances are selected for explicit human labelling rather than pseudo-labeled, often based on estimates of model confidence or uncertainty. In both tasks, accurate estimation of labelling (un)certainty is critical.

In this section, we evaluate the uncertainty- based instance selection method from Section 4.2 in the settings of self-training and active learning, over the tasks of STS, SA rating, and cross- lingual DA.

### 7.1 Self-training STS and SA

In self-training, we experiment in both semi- supervised (limited gold-standard training data) and zero-shot scenarios, over three low-resource datasets: MedSTS, N2C2-STS, and PeerRead.

**Experimental Setup:** As we require high correlation to ensure high-quality pseudo-labels, and lower Cal and NLPD to guarantee that predictions are neither over- nor under-confident, we employ MC-dropout over LR and ConvLR. Additionally, to alleviate domain data sparsity, we first fine-tune the regressor on two general datasets— STS-B for STS and Yelp for SA (general-purpose STS/SA)—also providing the proxy for the zero- shot setting. We continue to fine-tune on domain training data in the semi-supervised scenario, and predict (*μ*,*σ*) for $Du$ by applying dropout 30 times. All results in Table 6 are obtained using train batch size = 16, learning rate = 2e-5, and training epochs = 3.

. | N2C2-STS test . | MedSTS test . | PeerRead test . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

r_{1} / r_{2}↑
. | Cal ↓ . | NLPD ↓ . | r_{1} / r_{2}↑
. | Cal ↓ . | NLPD ↓ . | r_{1} / r_{2}↑
. | Cal ↓ . | NLPD ↓ . | ||||

Semi-supervised: | ||||||||||||

BERT LR | 0.853 / 0.857 | 0.384 | 6.571 | 0.858 / 0.859 | 0.158 | 3.903 | 0.686 / 0.686 | 0.370 | 15.95 | |||

+$Du$ | 0.861 / 0.862 | 0.511 | 9.232 | 0.860 / 0.861 | 0.224 | 5.267 | 0.655 / 0.656 | 0.394 | 19.26 | |||

+$Du\u2032$ | 0.860 / 0.864 | 0.493 | 8.476 | 0.863 / 0.866 | 0.181 | 4.758 | 0.720 / 0.720 | 0.340 | 19.89 | |||

BERT ConvLR | 0.874 / 0.875 | 0.509 | 11.51 | 0.846 / 0.853 | 0.201 | 5.968 | 0.691 / 0.692 | 0.346 | 16.98 | |||

+$Du$ | 0.875 / 0.876 | 0.522 | 13.50 | 0.846 / 0.855 | 0.215 | 6.403 | 0.671 / 0.683 | 0.453 | 25.50 | |||

+$Du\u2032$ | 0.875 / 0.879 | 0.535 | 11.44 | 0.857 / 0.864 | 0.222 | 6.129 | 0.699 / 0.697 | 0.374 | 21.78 | |||

Zero-shot: | ||||||||||||

BERT LR | 0.682 / 0.663 | 0.568 | 17.08 | 0.786 / 0.795 | 0.199 | 5.060 | 0.669 / 0.676 | 0.400 | 21.75 | |||

+ $Du$ | 0.687 / 0.673 | 0.624 | 40.10 | 0.796 / 0.797 | 0.266 | 11.94 | 0.023 / 0.006 | 1.728 | 387.3 | |||

+ $Du\u2032$ | 0.743 / 0.729 | 0.630 | 23.67 | 0.793 / 0.792 | 0.296 | 8.907 | 0.678 / 0.675 | 0.495 | 52.72 | |||

BERT ConvLR | 0.728 / 0.722 | 0.612 | 21.06 | 0.776 / 0.788 | 0.240 | 8.447 | 0.627 / 0.635 | 0.456 | 36.37 | |||

+ $Du$ | 0.746 / 0.737 | 0.653 | 47.68 | 0.790 / 0.794 | 0.332 | 16.45 | 0.138 / 0.119 | 1.748 | 546.1 | |||

+ $Du\u2032$ | 0.763 / 0.748 | 0.628 | 40.32 | 0.809 / 0.810 | 0.303 | 15.26 | 0.656 / 0.659 | 0.483 | 57.77 |

. | N2C2-STS test . | MedSTS test . | PeerRead test . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

r_{1} / r_{2}↑
. | Cal ↓ . | NLPD ↓ . | r_{1} / r_{2}↑
. | Cal ↓ . | NLPD ↓ . | r_{1} / r_{2}↑
. | Cal ↓ . | NLPD ↓ . | ||||

Semi-supervised: | ||||||||||||

BERT LR | 0.853 / 0.857 | 0.384 | 6.571 | 0.858 / 0.859 | 0.158 | 3.903 | 0.686 / 0.686 | 0.370 | 15.95 | |||

+$Du$ | 0.861 / 0.862 | 0.511 | 9.232 | 0.860 / 0.861 | 0.224 | 5.267 | 0.655 / 0.656 | 0.394 | 19.26 | |||

+$Du\u2032$ | 0.860 / 0.864 | 0.493 | 8.476 | 0.863 / 0.866 | 0.181 | 4.758 | 0.720 / 0.720 | 0.340 | 19.89 | |||

BERT ConvLR | 0.874 / 0.875 | 0.509 | 11.51 | 0.846 / 0.853 | 0.201 | 5.968 | 0.691 / 0.692 | 0.346 | 16.98 | |||

+$Du$ | 0.875 / 0.876 | 0.522 | 13.50 | 0.846 / 0.855 | 0.215 | 6.403 | 0.671 / 0.683 | 0.453 | 25.50 | |||

+$Du\u2032$ | 0.875 / 0.879 | 0.535 | 11.44 | 0.857 / 0.864 | 0.222 | 6.129 | 0.699 / 0.697 | 0.374 | 21.78 | |||

Zero-shot: | ||||||||||||

BERT LR | 0.682 / 0.663 | 0.568 | 17.08 | 0.786 / 0.795 | 0.199 | 5.060 | 0.669 / 0.676 | 0.400 | 21.75 | |||

+ $Du$ | 0.687 / 0.673 | 0.624 | 40.10 | 0.796 / 0.797 | 0.266 | 11.94 | 0.023 / 0.006 | 1.728 | 387.3 | |||

+ $Du\u2032$ | 0.743 / 0.729 | 0.630 | 23.67 | 0.793 / 0.792 | 0.296 | 8.907 | 0.678 / 0.675 | 0.495 | 52.72 | |||

BERT ConvLR | 0.728 / 0.722 | 0.612 | 21.06 | 0.776 / 0.788 | 0.240 | 8.447 | 0.627 / 0.635 | 0.456 | 36.37 | |||

+ $Du$ | 0.746 / 0.737 | 0.653 | 47.68 | 0.790 / 0.794 | 0.332 | 16.45 | 0.138 / 0.119 | 1.748 | 546.1 | |||

+ $Du\u2032$ | 0.763 / 0.748 | 0.628 | 40.32 | 0.809 / 0.810 | 0.303 | 15.26 | 0.656 / 0.659 | 0.483 | 57.77 |

**Unlabeled Data Pool:** For clinical STS, we extract sentences from MIMIC-III covering topics of medication, diagnosis, follow-up instructions, and test, then synthetically balance across each unit score interval, resulting in 1,534 sentence pairs, which we denote as $Du$. For PeerRead, we use 1,014 reviews from ICLR 2017 without labels as $Du$. To expand $Du$ in the zero-shot setting, we remove the gold-standard labels and integrate the resulting unlabeled data into $Du$.

**Results and Analysis:** As seen in Table 6, semi-supervision improves correlation, at the cost of being more uncertain and miscalibrated, with larger Cal and NLPD. Predictive confidence threshold selection can further improve the accuracy. It also effectively calibrates the model, resulting in much lower Cal and NLPD, compared with directly incorporating unlabeled data (“+$Du$”).

In the zero-shot setting, Cal and NLPD increase for all tasks under both LR and ConvLR with $Du$, making predictions less reliable, especially for PeerRead where the model totally collapses. This matches our intuition that the distribution of the pseudo-labeled data differs from the true distribution, and that learning from this data impedes the model. This problem is alleviated by retaining only the highly confident subset $Du\u2032$, as its distribution is closer to the gold-standard for well-calibrated models. This is also consistent with the observation that Cal and NLPD in the zero-shot setting are much larger than in the semi-supervised setting, as the latter benefits from the guidance of the gold-standard distribution.

Note that if we merely assess the model with Pearson correlation as in most previous work, we can only observe the improvement due to data augmentation, neglecting the risk of the model being more miscalibrated, and producing less reliable predictions. Further, Cal and NLPD are useful metrics to evaluate the effectiveness of the data sampling strategy used in self-training.

### 7.2 Cross-lingual DA

We evaluate self-training and active-learning on DA-based machine translation quality estimation using BERT LR.

**Experimental Setup:** We use three language pairs: WMT 2020 DA en-zh, ru-en, and si-en, in each case splitting the original 7k training instances into a training set $D$ of 3k instances and 4k unlabeled data pool $Du$, keeping the original validation and test sets. The lr is set to 2e-5, and training epochs and batch size are tuned by grid search over the validation set based on the range [1,2,3,4,5] × [16, 32]. Other settings follow STS and SA above, but without a general-purpose base model. As a baseline, we use $D$ fine-tuned on the validation set, and evaluate the best configuration on test.

**Results and Analysis:** As shown in Table 7, directly incorporating pseudo $Du$ substantially outperforms baselines for all three language pairs. This differs from the results for STS and SA in the semi-supervised setting, but is consistent with the results in the zero-shot setting. It indicates that a high-performance model requires high-quality data to further gain improvements; lower-quality models are more tolerant to lower data quality.

. | en-zh (high) . | ru-en (medium) . | si-en (low) . | |||
---|---|---|---|---|---|---|

r_{dev}↑
. | r_{test}↑
. | r_{dev}↑
. | r_{test}↑
. | r_{dev}↑
. | r_{test}↑
. | |

Baseline | 0.407 | 0.374 | 0.592 | 0.599 | 0.427 | 0.478 |

+ pseudo $Du$ | 0.434 | 0.400 | 0.604 | 0.619 | 0.449 | 0.488 |

+ $Du\u2032$ | 0.438 | 0.404 | 0.606 | 0.603 | 0.443 | 0.482 |

+ $Du\u2032\u222aDa\u2032$ | 0.445 | 0.422 | 0.615 | 0.628 | 0.466 | 0.496 |

+ gold $Du$ | 0.453 | 0.395 | 0.600 | 0.621 | 0.466 | 0.504 |

. | en-zh (high) . | ru-en (medium) . | si-en (low) . | |||
---|---|---|---|---|---|---|

r_{dev}↑
. | r_{test}↑
. | r_{dev}↑
. | r_{test}↑
. | r_{dev}↑
. | r_{test}↑
. | |

Baseline | 0.407 | 0.374 | 0.592 | 0.599 | 0.427 | 0.478 |

+ pseudo $Du$ | 0.434 | 0.400 | 0.604 | 0.619 | 0.449 | 0.488 |

+ $Du\u2032$ | 0.438 | 0.404 | 0.606 | 0.603 | 0.443 | 0.482 |

+ $Du\u2032\u222aDa\u2032$ | 0.445 | 0.422 | 0.615 | 0.628 | 0.466 | 0.496 |

+ gold $Du$ | 0.453 | 0.395 | 0.600 | 0.621 | 0.466 | 0.504 |

We select the most confident 1,904, 1,985, and 2,462 instances with *τ* = 0.15, 0.13 and 0.19 for en-zh, ru-en and si-en, respectively. Equal or higher performance is achieved when this subset of instances is added to the training data, as compared to the complete $Du$.

Simulating active learning, we also explore the annotation of $Du\u2212Du\u2032$ with human gold scores, i.e. $Da\u2032$. The results show that with $Du\u2032\u222aDa\u2032$, our model achieves results competitive with using all of $Du$ with gold labels. This reveals that it is not necessary to annotate the entire dataset, but we can focus on the subset where the model is not confident. In this way, data annotation is more efficient, and models generalize better over unseen data.

## 8 Analysis

In this section, we conduct further analysis to better understand the results of the experiments.

**Qualitative Comparison:** In both in-domain and out-of-domain evaluation, end-to-end training based on BERT, particularly BBB estimation, obtains much larger NLPD than pipeline training based on SBERT, especially GP regression. We speculate that end-to-end uncertainty models are confident for both correct and incorrect predictions, i.e. have small variance over all instances, thus resulting in the smaller Shp and larger NLPD. Meanwhile, models with extremely small NLPD are less confident in inaccurate predictions, and might also be under-confident in correct predictions.

We score sentence pairs in the STS-B test set using BERT Bayesian LR (BBB) and SBERT GP.^{5} Overall, the incorrect predictions ( >1 from the true score) by BBB have a much smaller variance compared to those predicted by GP. For correct predictions (≤ 1 of the true score), BBB has a higher variance than for incorrect predictions, which is counter-intuitive. Though the std for SBERT GP regression on correct predictions is much larger than BBB, it’s slightly less than that for incorrect ones. This fits the expectation that when a model is good at uncertainty prediction, the model should be more confident for correct predictions than incorrect ones. Examples where both models are correct and incorrect are presented in Table 9.

The near-zero variance of BBB (0.005 on average) results in infinite NLPD because of the element $(ti\u2212mi)2vi$ in the NLPD formula. Larger Shp of GP tends to produce smaller NLPD in spite of being under-confident on correct cases—the variance of 1.57 is much larger than the true gap of 0.01. So NLPD is not a perfect metric, favouring under-confident models. We therefore suggest a metric priority order of *r*, Cal, NLPD and Shp.

**Impact of Sentence Embedding:** The quality of sentence embeddings is critical for uncertainty training, affecting not only the correlation, but also the uncertainty metrics. Instead of SBERT, we also experimented with simCSE, the current state-of-the-art sentence encoder (Gao et al., 2021). We train three pipeline models with STS-B training data based on *sup-simcse-bert-base- uncased*, using the same settings as the first row of Table 3, and evaluate on the STS-B, ENMSASS, and MedSTS test sets. In Table 8, contrasting with the results in Table 2 for STS-B and Yelp, and results in Table 4 for EBMSASS and MedSTS, the correlation improves for all datasets other than MedSTS, and Cal and NLPD drop. This suggests that better sentence encoders boost pipeline performance.

. | STS-B test . | EBMSASS test . | MedSTS test . | Yelp test . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | |

simCSE Cosine | 0.833 | n/a | n/a | n/a | 0.700 | n/a | n/a | n/a | 0.696 | n/a | n/a | n/a | — | n/a | n/a | n/a |

simCSE LR | 0.849 | n/a | n/a | n/a | 0.703 | n/a | n/a | n/a | 0.675 | n/a | n/a | n/a | 0.688 | n/a | n/a | n/a |

simCSE Bayes LR | 0.850 | 0.051 | 0.381 | 0.891 | 0.738 | 0.048 | 0.102 | 0.900 | 0.693 | 0.002 | 0.295 | 0.885 | 0.668 | 0.005 | 0.377 | 0.846 |

simCSE Sparse GP | 0.853 | 0.002 | 0.368 | 0.960 | 0.757 | 0.210 | 0.218 | 0.962 | 0.694 | 0.034 | 0.346 | 0.960 | 0.681 | 0.004 | 0.360 | 0.880 |

. | STS-B test . | EBMSASS test . | MedSTS test . | Yelp test . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | r ↑
. | Cal ↓ . | NLPD ↓ . | Shp ↓ . | |

simCSE Cosine | 0.833 | n/a | n/a | n/a | 0.700 | n/a | n/a | n/a | 0.696 | n/a | n/a | n/a | — | n/a | n/a | n/a |

simCSE LR | 0.849 | n/a | n/a | n/a | 0.703 | n/a | n/a | n/a | 0.675 | n/a | n/a | n/a | 0.688 | n/a | n/a | n/a |

simCSE Bayes LR | 0.850 | 0.051 | 0.381 | 0.891 | 0.738 | 0.048 | 0.102 | 0.900 | 0.693 | 0.002 | 0.295 | 0.885 | 0.668 | 0.005 | 0.377 | 0.846 |

simCSE Sparse GP | 0.853 | 0.002 | 0.368 | 0.960 | 0.757 | 0.210 | 0.218 | 0.962 | 0.694 | 0.034 | 0.346 | 0.960 | 0.681 | 0.004 | 0.360 | 0.880 |

. | SBERT Sparse GP . | BERT Bayesian LR (BBB) . |
---|---|---|

Incorrect Predictions: | ||

S1:You will want to clean the area first. | ||

S2:You will also want to remove the seeds. | ||

Gold score = 0 | ||

Prediction: | 2.22 ± 1.62 | 1.95 ± 0.0037 |

Correct Predictions: | ||

S1:He was referring to ..., ... last Sunday. | ||

S2:Next week, ... Sunday ..., will take up his position. | ||

Gold score = 4 | ||

Prediction: | 3.89 ± 1.58 | 4.14 ± 0.0056 |

. | SBERT Sparse GP . | BERT Bayesian LR (BBB) . |
---|---|---|

Incorrect Predictions: | ||

S1:You will want to clean the area first. | ||

S2:You will also want to remove the seeds. | ||

Gold score = 0 | ||

Prediction: | 2.22 ± 1.62 | 1.95 ± 0.0037 |

Correct Predictions: | ||

S1:He was referring to ..., ... last Sunday. | ||

S2:Next week, ... Sunday ..., will take up his position. | ||

Gold score = 4 | ||

Prediction: | 3.89 ± 1.58 | 4.14 ± 0.0056 |

**High-disagreement Label Detection:** A natural question to ask in the instance selection is what types of instances are selected and discarded, and how this correlates with the underlying label uncertainty in the data. When models are well-calibrated, the predicted variance will reflect the true label uncertainty, both aleatoric and epistemic. As such, if we select instances with smaller variance, we are effectively filtering out instances with higher inherent label uncertainty, as should be reflected in the labels assigned by independent annotators. We verify this hypothesis below.

We apply the model fine-tuned on STS-B over BIOSSES and EBMSASS (1000 instances each), for which five raw annotations for each instance can be accessed to approximate an empirical label distribution. KL-Divergence (KL) is used to measure the distance between the predicted and empirical probability. In Table 10, the trend in KL values on the two datasets is consistent with Cal/NLPD across all estimation methods, indirectly suggesting that Cal and NLPD are effective metrics in the absence of empirical label distributions.

. | EBMSASS . | BIOSSES . | ||||
---|---|---|---|---|---|---|

r ↑
. | Cal / NLPD ↓ . | KL1 / KL2 ↓ . | r ↑
. | Cal / NLPD ↓ . | KL1 / KL2 ↓ . | |

LR MC | 0.828 | 0.236 / 3.319 | 8.75 / 1.23 | 0.870 | 0.250 / 4.488 | 8.82 / 1.54 |

ConvLR MC | 0.806 | 0.201 / 4.668 | 12.74 / 1.46 | 0.823 | 0.304 / 12.59 | 19.64 / 2.06 |

LR BBB | 0.854 | 0.633 / 5351. | 16297 / 5.00 | 0.836 | 0.530 / 11972 | 16598 / 4.90 |

ConvLR BBB | 0.806 | 0.736 / 1091. | 2373.7 / 4.13 | 0.804 | 0.923 / 2076. | 2631.2 / 5.01 |

. | EBMSASS . | BIOSSES . | ||||
---|---|---|---|---|---|---|

r ↑
. | Cal / NLPD ↓ . | KL1 / KL2 ↓ . | r ↑
. | Cal / NLPD ↓ . | KL1 / KL2 ↓ . | |

LR MC | 0.828 | 0.236 / 3.319 | 8.75 / 1.23 | 0.870 | 0.250 / 4.488 | 8.82 / 1.54 |

ConvLR MC | 0.806 | 0.201 / 4.668 | 12.74 / 1.46 | 0.823 | 0.304 / 12.59 | 19.64 / 2.06 |

LR BBB | 0.854 | 0.633 / 5351. | 16297 / 5.00 | 0.836 | 0.530 / 11972 | 16598 / 4.90 |

ConvLR BBB | 0.806 | 0.736 / 1091. | 2373.7 / 4.13 | 0.804 | 0.923 / 2076. | 2631.2 / 5.01 |

Do large-variance instances selected by strategies in Section 4.2 overlap with high-disagreement instances? Without a ground truth of high- disagreement annotations, they are identified by two steps iteratively: (1) select labels whose std is greater than *α*, beginning from 0.3; and (2) manually check whether for all selected instances, at least two out of the five annotations differ from the others by ≥ 1.0; if not *α* + =0.1, otherwise end. This results in 137 and 31 label disagreements when *α* = 0.5 and 0.4, for EBMSASS and BIOSSES, respectively.

Using BERT LR MC-dropout, a learned threshold of *τ* = 0.162 results in Acc = 0.48, F1 = 0.28 at high-disagreement label detection on EBMSASS. For BIOSSES, *τ* = 0.1 leads to Acc = 0.37, F1 = 0.48. Under ConvLR MC, EBMSASS has Acc = 0.46, F1 = 0.31 as *τ* = 0.124; BIOSSES: *τ* = 0.157 with Acc = 0.45, F1 = 0.48.

As such, high-disagreement labels can be detected by the large-variance criterion, obtaining Acc = 0.44, F1 = 0.39 on average. This is not good as a binary classifier, since regarding all instances as the majority-class “clean” performs better. But in our context, it is effective as a data augmentation strategy—selecting clean examples from an out-of-domain corpus. Detecting noisy labels is not just a binary classification task requiring high accuracy, but critical to recognize and filter noisy instances from a whole training corpus, even at the cost of removing clean labels.

## 9 Conclusion

We comprehensively investigated a range of uncertainty estimation methods over different regression tasks, using pre-trained language models. Bayesian linear regression and sparse Gaussian process regression based on fixed features obtain lower calibration error and NLPD compared with fine-tuning large-capacity deep networks end-to-end, but are inferior in terms of correlation. When embeddings are sufficiently expressive, they are comparable in performance to deep uncertainty models.

To reduce uncertainty resulting from noisy labels and limited labeled data in specific domains, we proposed a simple instance selection method based on uncertainty model predictive confidence. This approach demonstrated consistent performance improvements on three regression tasks in both self-training and active-learning settings, underscoring its effectiveness and generalizability.

## Acknowledgments

We thank the anonymous reviewers and three editors for their helpful comments. Yuxia Wang is supported by scholarships from the University of Melbourne and China Scholarship Council (CSC).

## Notes

^{1}

We also experimented with a strategy for tuning *τ* based on the principle of discarding the majority so that remaining examples are as clean as possible. Specifically, we set *τ* to the marginal value corresponding to the left boundary of the peak of the std probability distribution, but found little difference in results, so omit it from the paper.

^{2}

^{3}

No significant difference was observed when sampling 20, 30, 40, or 50 times, so we report only on 30.

^{5}

These two were chosen because they have similar *r*, but one has the largest NLPD and the other has the smallest.

## References

## Author notes

Action Editor: Dani Yogatama