Unsupervised Quality Estimation for Neural Machine Translation

Quality Estimation (QE) is an important component in making Machine Translation (MT) useful in real-world applications, as it is aimed to inform the user on the quality of the MT output at test time. Existing approaches require large amounts of expert annotated data, computation and time for training. As an alternative, we devise an unsupervised approach to QE where no training or access to additional resources besides the MT system itself is required. Different from most of the current work that treats the MT system as a black box, we explore useful information that can be extracted from the MT system as a by-product of translation. By employing methods for uncertainty quantification, we achieve very good correlation with human judgments of quality, rivalling state-of-the-art supervised QE models. To evaluate our approach we collect the first dataset that enables work on both black-box and glass-box approaches to QE.


Introduction
With the advent of neural models, Machine Translation (MT) systems have made substantial progress, reportedly achieving near-human quality for high-resource language pairs (Hassan et al., 2018;Barrault et al., 2019). However, translation quality is not consistent across language pairs, domains and datasets. This is problematic for low-resource scenarios, where there is not enough training data and translation quality significantly lags behind. Additionally, neural MT (NMT) systems can be deceptive to the end user as they can generate fluent translations that differ in meaning from the original (Bentivogli et al., 2016;Castilho et al., 2017). Thus, it is crucial to have a feedback mechanism to inform users about the trustworthiness of a given MT output.
Quality estimation (QE) aims to predict the quality of the output provided by an MT system at test time when no gold-standard human translation is available. State-of-the-art (SOTA) QE models require large amounts of parallel data for pretraining and in-domain translations annotated with quality labels for training (Kim et al., 2017a;Fonseca et al., 2019). However, such large collections of data are only available for a small set of languages in limited domains.
Current work on QE typically treats the MT system as a black box. In this paper we propose an alternative glass-box approach to QE which allows us to address the task as an unsupervised problem. We posit that encoder-decoder NMT models Bahdanau et al., 2015;Vaswani et al., 2017) offer a rich source of information for directly estimating translation quality: (a) the output probability distribution from the NMT system (i.e. the probabilities obtained by applying the softmax function over the entire vocabulary of the target language); and (b) the attention mechanism used during decoding. Our assumption is that the more confident the decoder is, the higher the quality of the translation.
While sequence-level probabilities of the top MT hypothesis have been used for confidence estimation in statistical MT (Specia et al., 2013;Blatz et al., 2004), the output probabilities from deep Neural Networks (NNs) are generally not well calibrated, i.e. not representative of the true likelihood of the predictions (Nguyen and O'Connor, 2015;Guo et al., 2017;Lakshminarayanan et al., 2017). Moreover, softmax output probabilities tend to be overconfident and can assign a large probability mass to predictions that are far away from the training data (Gal and Ghahramani, 2016). To overcome such deficiencies, we propose ways to exploit output distributions beyond the top-1 prediction by exploring uncertainty quantification methods for better probability estimates (Gal and Ghahramani, 2016;Lakshminarayanan et al., 2017). In our experiments, we account for different factors that can affect the reliability of model probability estimates in NNs, such as model architecture, training and search (Guo et al., 2017).
In addition, we study attention mechanism as another source of information on NMT quality. Attention can be interpreted as a soft alignment, providing an indication of the strength of relationship between source and target words (Bahdanau et al., 2015). While this interpretation is straightforward for NMT based on Recurrent Neural Networks (RNN) (Rikters and Fishel, 2017), its application to current SOTA Transformer models with multi-head attention (Vaswani et al., 2017) is challenging. We analyze to what extent meaningful information on translation quality can be extracted from multi-head attention.
To evaluate our approach in challenging settings, we collect a new dataset for QE with 6 language pairs representing NMT training in high, medium, and low-resource scenarios. To reduce the chance of overfitting to particular domains, our dataset is constructed from Wikipedia documents. We annotate 10K segments per language pair. By contrast to the vast majority of work on QE that uses semi-automatic metrics based on post-editing distance as gold standard, we perform quality labelling based on the Direct Assessment (DA) methodology (Graham et al., 2015b), which has been widely used for popular MT evaluation campaigns in the recent years. At the same time, the collected data differs from the existing datasets annotated with DA judgments for the well known WMT Metrics task 1 in two important ways: we provide enough data to train supervised QE models and access to the NMT systems used to generate the translations, thus allowing for further exploration of the glass-box unsupervised approach to QE for NMT introduced in this paper.
Our main contributions can be summarised as follows: (i) A new, large-scale dataset for sentence-level 2 QE annotated with DA rather than post-editing metrics ( §4); (ii) A set of unsupervised quality indicators that can be produced as a by-product of NMT decoding and a thorough evaluation of how they correlate with human judgments of translation quality ( §3 and §5); (iii) The first attempt at analysing the attention distribution for the purposes of unsupervised QE in Transformer models ( §3 and §5); (iv) The analysis on how model confidence relates to translation quality for different NMT systems ( §6). Our experiments show that unsupervised QE indicators obtained from well-calibrated NMT model probabilities rival strong supervised SOTA models in terms of correlation with human judgments.

Related Work
QE QE is typically addressed as a supervised machine learning task where the goal is to predict MT quality in the absence of reference translation. Traditional feature-based approaches relied on manually designed features, extracted from the MT system (glass-box features) or obtained from the source and translated sentences, as well as external resources, such as monolingual or parallel corpora (black-box features) (Specia et al., 2009).
Currently the best performing approaches to QE employ NNs to learn useful representations for source and target sentences (Kim et al., 2017b;Wang et al., 2018;Kepler et al., 2019a). A notable example is the Predictor-Estimator (PredEst) model (Kim et al., 2017b), which consists of an encoder-decoder RNN (predictor) trained on parallel data for a word prediction task and a unidirectional RNN (estimator) that produces quality estimates leveraging the context representations generated by the predictor. Despite achieving strong performances, neural-based approaches are resource-heavy and require a significant amount of in-domain labelled data for training. They do not use any internal information from the MT system.
Existing work on glass-box QE is limited to features extracted from statistical MT, such as language model (LM) probabilities or number of hypotheses in the n-best list (Blatz et al., 2004;Specia et al., 2013). The few approaches for unsupervised QE are also inspired by the work on statistical MT and perform significantly worse than supervised approaches (Popović, 2012;Moreau and Vogel, 2012;Etchegoyhen et al., 2018). For example, Etchegoyhen et al. (2018) use lexical translation probabilities from word alignment models and LM probabilities. Their unsupervised approach average these features to produce the final score. However, it is largely outperformed by the neural-based supervised QE systems .
The only work that explore internal information from neural models as an indicator of translation quality rely on the entropy of attention weights in RNN-based NMT systems (Rikters and Fishel, 2017;Yankovskaya et al., 2018). However, attention-based indicators perform competitively only when combined with other QE features in a supervised framework. Furthermore, this approach is not directly applicable to the SOTA Transformer model that employs multi-head attention mechanism. Recent work on attention interpretability showed that attention weights in Transformer networks might not be readily interpretable (Vashishth et al., 2019;Vig and Belinkov, 2019). Voita et al. (2019) show that different attention heads of Transformer have different functions and some of them are more important than others. This makes it challenging to extract information from attention weights in Transformer (see §5).
To the best of our knowledge, our work is the first on glass-box unsupervised QE for NMT that performs competitively with respect to the SOTA supervised systems.

QE Datasets
The performance of QE systems has been typically assessed using the semiautomatic HTER (Human-mediated Translation Edit Rate) (Snover et al., 2006) metric as gold standard. However, the reliability of this metric for assessing the performance of QE systems has been shown to be questionable . The current practice in MT evaluation is the so called Direct Assessment (DA) of MT quality (Graham et al., 2015b), where raters evaluate the MT on a continuous 1-100 scale. This method has been shown to improve the reproducibility of manual evaluation and to provide a more reliable gold standard for automatic evaluation metrics (Graham et al., 2015a).
DA methodology is currently used for manual evaluation of MT quality at the WMT translation tasks, as well as for assessing the performance of reference-based automatic MT evaluation metrics at the WMT Metrics Task (Bojar et al., 2016(Bojar et al., , 2017Ma et al., 2018Ma et al., , 2019. Existing datasets with sentence-level DA judgments from the WMT Metrics Task could in principle be used for benchmarking QE systems. However, they contain only a few hundred segments per language pair and thus hardly allow for training supervised systems, as illustrated by the weak correlation results for QE on DA judgments based on the Metrics Task data recently reported by (Fonseca et al., 2019). Furthermore, for each language pair the data contains translations from a number of MT systems often using different architectures, and these MT systems are not readily available, making it impossible for experiments on glass-box QE. Finally, the judgments are either crowd-sourced or collected from task participants and not professional translators, which may hinder the reliability of the labels. We collect a new dataset for QE that addresses these limitations ( §4).
Uncertainty quantification Uncertainty quantification in NNs is typically addressed using a Bayesian framework where the point estimates of their weights are replaced with probability distributions (MacKay, 1992;Graves, 2011;Welling and Teh, 2011;Tran et al., 2019). Various approximations have been developed to avoid high training costs of Bayesian NNs, such as Monte Carlo Dropout (Gal and Ghahramani, 2016) or model ensembling (Lakshminarayanan et al., 2017). The performance of uncertainty quantification methods is commonly evaluated by measuring calibration, i.e. the relation between predictive probabilities and the empirical frequencies of the predicted labels, or by assessing generalization of uncertainty under domain shift (see §6).
Only a few studies analyzed calibration in NMT and they came to contradictory conclusions. Kumar and Sarawagi (2019) measure calibration error by comparing model probabilities and the percentage of times NMT output matches reference translation, and conclude that NMT probabilities are poorly calibrated. However, the calibration error metrics they use are designed for binary classification tasks and cannot be easily transferred to NMT (Kuleshov and Liang, 2015).  analyze uncertainty in NMT by comparing predictive probability distributions with the empirical distribution observed in human translation data. They conclude that NMT models are well calibrated. However, this approach is limited by the fact that there are many possible correct translations for a given sentence and only one human translation is available in practice. Although the goal of this paper is to devise an unsupervised solution for the QE task, the analysis presented here provides new insights into calibration in NMT. Different from existing work, we study the relation between model probabilities and human judgments of translation correctness. Uncertainty quantification methods have been successfully applied to various practical tasks, e.g. neural semantic parsing (Dong et al., 2018), hate speech classification (Miok et al., 2019), or backtranslation for NMT (Wang et al., 2019). Wang et al. (2019), which is the closest to our work, explore a small set of uncertainty-based metrics to minimise the weight of erroneous synthetic sentence pairs for back translation in NMT. However, improved NMT training with weighted synthetic data does not necessarily imply better prediction of MT quality. In fact, metrics that Wang et al. (2019) report to perform the best for backtranslation do not perform well for QE (see §3.2).

Unsupervised QE for NMT
We assume a sequence-to-sequence NMT architecture consisting of encoder-decoder networks using attention (Bahdanau et al., 2015). The encoder maps the input sequence x = x 1 , ..., x I into a sequence of hidden states, which is summarized into a single vector using attention mechanism (Bahdanau et al., 2015;Vaswani et al., 2017). Given this representation the decoder generates an output sequence y = y 1 , ..., y T of length T . The probability of generating y is factorized as: where θ represents model parameters.
The decoder produces the probability distribution p(y t |y <t , x, θ) over the system vocabulary at each time step using the softmax function. The model is trained to minimize cross-entropy loss. We use SOTA Transformers (Vaswani et al., 2017) for the encoder and decoder in our experiments.
In what follows, we propose unsupervised quality indicators based on: (i) output probability distribution obtained either from a standard deterministic NMT ( §3.1) or (ii) using uncertainty quantification ( §3.2), and (iii) attention weights ( §3.3).

Exploiting the Softmax Distribution
We start by defining a simple QE measure based on sequence-level translation probability normalized by length: However, 1-best probability estimates from the softmax output distribution may tend towards overconfidence, which would result in high probability for unreliable MT outputs. We propose two metrics that exploit output probability distribution beyond the average of top-1 predictions. First, we compute the entropy of softmax output distribution over target vocabulary of size V at each decoding step and take an average to obtain a sentence-level measure: where p(y t ) represents the conditional distribution p(y t |x, y <t , θ).
If most of the probability mass is concentrated on a few vocabulary words, the generated target word is likely to be correct. By contrast, if softmax probabilities approach a uniform distribution picking any word from the vocabulary is equally likely and the quality of the resulting translation is expected to be low.
Second, we hypothesize that the dispersion of probabilities of individual words might provide useful information that is inevitably lost when taking an average. Consider, as an illustration, that the sequences of word probabilities [0.1, 0.9] and [0.5, 0.5] have the same mean, but might indicate very different behaviour of the NMT system, and consequently, different output quality. To formalize this intuition we compute the standard deviation of word-level log-probabilities.

Quantifying Uncertainty
It has been argued in recent work that deep neural networks do not properly represent model uncertainty (Gal and Ghahramani, 2016;Lakshminarayanan et al., 2017). Uncertainty quantification in deep learning typically relies on the Bayesian formalism (MacKay, 1992;Graves, 2011;Welling and Teh, 2011;Gal and Ghahramani, 2016;Tran et al., 2019). Bayesian NNs learn a posterior distribution over parameters that quantifies model or epistemic uncertainty, i.e. our lack of knowledge as to which model generated the training data. 3 Bayesian NNs usually come with prohibitive computational costs and various approximations have been developed to alleviate this. In this paper we explore the Monte Carlo (MC) dropout (Gal and Ghahramani, 2016).
Dropout is a method introduced by Srivastava et al. (2014) to reduce overfitting when training neural models. It consists in randomly masking neurons to zero based on a Bernoulli distribution. Gal and Ghahramani (2016) use dropout at test time before every weight layer. They perform several forward passes through the network and collect posterior probabilities generated by the model with parameters perturbed by dropout. Mean and variance of the resulting distribution can then be used to represent model uncertainty.
We propose two flavours of MC dropout-based measures for unsupervised QE. First, we compute the expectation and variance for the set of sentence-level probability estimates obtained by running N stochastic forward passes through the MT model with model parametersθ perturbed by dropout: ) 2 where TP is sentence-level probability as defined in §3.1. We also look at a combination of the two: We note that these metrics have also been used by Wang et al. (2019), but with the purpose of minimising the effect of low quality outputs on NMT training with back translations.
Second, we measure lexical variation between the MT outputs generated for the same source segment when running inference with dropout. We posit that differences between likely MT hypotheses may also capture uncertainty and potential ambiguity and complexity of the original sentence. We compute an average similarity score (sim) between the set H of translation hypotheses: where h i , h j ∈ H, i = j and C = 2 −1 |H|(|H|−1) is the number of pairwise comparisons for |H| hypotheses. We use Meteor (Denkowski and Lavie, 2014) to compute similarity scores.

Attention
Attention weights represent the strength of connection between source and target tokens, which may be indicative of translation quality (Rikters and Fishel, 2017). One way to measure it is to compute the entropy of the attention distribution: where α represents attention weights, I is the number of target tokens and J is the number of source tokens. This mechanism can be applied to any NMT model with encoder-decoder attention. We focus on attention in Transformer models, as it is currently the most widely used NMT architecture. Transformers rely on various types of attention, multiple attention heads and multiple encoder and decoder layers. Encoder-decoder attention weights are computed for each head (H) and for each layer (L) of the decoder, as a result we get [H × L] matrices with attention weights. It is not clear which combination would give the best results for QE. To summarize the information from different heads and layers, we propose to compute the entropy scores for each possible head/layer combination and then choose the minimum value or compute the average: Att-Ent hl

Multilingual Dataset for QE
The quality of NMT translations is strongly affected by the amount of training data. To study our unsupervised QE indicators under different conditions, we collected data for 6 language pairs that includes high-, medium-, and low-resource conditions. To add diversity, we varied the directions into and out-of English, when permitted by the availability of expert annotators into non-English languages. Thus our dataset is composed by the high-resource English-German (En-De) and English-Chinese (En-Zh) pairs; by the medium-resource Romanian-English (Ro-En) and Estonian-English (Et-En) pairs; and by the lowresource Sinhala-English (Si-En) and Nepali-English (Ne-En) pairs. The dataset contains sen-tences extracted from Wikipedia and the MT outputs manually annotated for quality. Document and sentence sampling We follow the sampling process outlined in FLORES (Guzmán et al., 2019). First, we sampled documents from Wikipedia for English, Estonian, Romanian, Sinhala and Nepali. Second, we selected the top 100 documents containing the largest number of sentences that are: (i) in the intended source language according to a language-id classifier 4 and (ii) have the length between 50 and 150 characters. In addition, we filtered out sentences that have been released as part of recent Wikipedia parallel corpora (Schwenk et al., 2019) ensuring that our dataset is not part of parallel data commonly used for NMT training.
For every language, we randomly selected 10K sentences from the sampled documents and then translated them into English using the MT models described below. For German and Chinese we selected 20K sentences from the top 100 documents in English Wikipedia. To ensure sufficient representation of high-and low-quality translations for high-resource language pairs, we selected the sentences with minimal lexical overlap with respect to the NMT training data.
NMT systems For medium-and high-resource language pairs we trained the MT models based on the standard Transformer architecture (Vaswani et al., 2017) and followed the implementation details described in Ott et al. (2018b). We used publicly available MT datasets such as Paracrawl (Esplà et al., 2019) and Europarl (Koehn, 2005). Si-En and Ne-En MT systems were trained based on Big-Transformer architecture as defined in (Vaswani et al., 2017). For the low-resource language pairs, the models were trained following the FLORES semi-supervised setting (Guzmán et al., 2019) 5 which involves two iterations of backtranslation using the source and the target monolingual data. Table 1 specifies the amount of data used for training.
DA judgments We followed the FLORES setup (Guzmán et al., 2019), which presents a form of DA (Graham et al., 2013). The annotators are asked to rate each sentence from 0-100 according to the perceived translation quality. Specifically, the 0-10 range represents an incorrect translation; 4 https://fasttext.cc 5 https://bit.ly/36YaBlU 11-29, a translation with few correct keywords, but the overall meaning is different from the source; 30-50, a translation with major mistakes; 51-69, a translation which is understandable and conveys the overall meaning of the source but contains typos or grammatical errors; 70-90, a translation that closely preserves the semantics of the source sentence; and 90-100, a perfect translation.
Each segment was evaluated independently by three professional translators from a single language service provider. To improve annotation consistency, any evaluation in which the range of scores among the raters was above 30 points was rejected, and an additional rater was requested to replace the most diverging translation rating until convergence was achieved. To further increase the reliability of the test and development partitions of the dataset, we requested an additional set of three annotations from a different group of annotators (i.e. from another language service provider) following the same annotation protocol, thus resulting in a total of six annotations per segment.
Raw human scores were converted into zscores, i.e. standardized according to each individual annotator's overall mean and standard deviation. The scores collected for each segment were averaged to obtain the final score. Such setting allows for the fact that annotators may genuinely disagree on some aspects of quality.
In Table 1 we show a summary of the statistics from human annotations. Besides the NMT training corpus size and the distribution of the DA scores for each language pair, we report mean and standard deviation of the average differences between the scores assigned by different annotators to each segment, as an indicator of annotation consistency. First, we observe that, as expected, the amount of training data per language pair correlates with the average quality of an NMT system. Second, we note that the distribution of human scores changes substantially across language pairs. In particular, we see very little variability in quality for En-De, which makes QE for this language pair especially challenging (see §5). Finally, as shown in the right-most columns, annotation consistency is similar across language pairs and comparable to existing work that follows DA methodology for data collection. For example, Graham et al. (2013) reports an average difference of 25 across annotators' scores.
Data splits To enable comparison between supervised and unsupervised approaches to QE, we split the data into 7K training partition, 1K development set, and two testsets of 1K sentences each. One of these testsets is used for the experiments in this paper, the other is kept blind for future work.
Additional data To support our discussion of the effect of NMT training on the correlation between predictive probabilities and perceived translation quality presented in §6, we trained various alternative NMT system variants, translated and annotated 400 original Estonian sentences from our test set with each system variant.
The data, the NMT models and the DA judgments are available at https://github.com/ facebookresearch/mlqe.

Experiments and Results
Below we analyze how our unsupervised QE indicators correlate with human judgments.

Settings
Benchmark supervised QE systems We compare the performance of the proposed unsupervised QE indicators against the best performing supervised approaches with available open-source implementation, namely the Predictor-Estimator (PredEst) architecture (Kim et al., 2017b) provided by OpenKiwi toolkit (Kepler et al., 2019b), and an improved version of the BiRNN model provided by DeepQuest toolkit (Ive et al., 2018), which we refer to as BERT-BiRNN (Blain et al., 2020).
PredEst. We trained PredEst models (see §2) using the same parameters as in the default configurations provided by Kepler et al. (2019b). Predictor models were trained for 6 epochs on the same training and development data as the NMT systems, while the Estimator models were trained for 10 epochs on the training and development sets of our dataset (see §4). Unlike Kepler et al. (2019b), the Estimator was not trained using multi-task learning, as our dataset currently does not contain any word-level annotation. We use the model corresponding to the best epoch as identified by the metric of reference on the development set: perplexity for the Predictor and Pearson correlation for the Estimator.
BERT-BiRNN. This model, similarly to the recent SOTA QE systems (Kepler et al., 2019a), uses a large scale pre-trained BERT model to obtain token-level representations which are then fed into two independent bidirectional RNNs to encode both the source sentence and its translation independently. The two resulting sentence representations are then concatenated as a weighted sum of their word vectors, using an attention mechanism. The final sentence-level representation is then fed to a sigmoid layer to produce the sentence-level quality estimates. During training, BERT was fine-tuned by unfreezing the weights of the last four layers along with the embedding layer. We used early stopping based on Pearson correlation on the development set, with a patience of 5.
Unsupervised QE For the dropout-based indicators (see §3.2), we use dropout rate of 0.3, the same as for training the NMT models (see §4). We perform N = 30 inference passes to obtain the posterior probability distribution. N was chosen following the experiments in related work (Dong et al., 2018;Wang et al., 2019). However, we note that increasing N beyond 10 results in very small improvements on the development set. The implementation of stochastic decoding with MC dropout is available as part of the fairseq toolkit  at https://github.com/pytorch/ fairseq. Table 2 shows Pearson correlation with DA for our unsupervised QE indicators and for the supervised QE systems. Unsupervised QE indicators are grouped as follows: Group I corresponds to the measurements obtained with standard decoding ( §3.1); Group II contains indicators computed using MC dropout ( §3.2); and Group III contains the results for attention-based indicators ( §3.3). Group IV corresponds to the supervised QE models presented in §5.1. We use the Hotelling-Williams test to compute significance of the difference between dependent correlations (Williams, 1959) with p-value < 0.05. For each language pair, results that are not significantly outperformed by any method are marked in bold; results that are not significantly outperformed by any other method from the same group are underlined.

Correlation with Human Judgments
We observe that the simplest measure that can be extracted from NMT, sequence-level probability (TP), already performs competitively, in particular for the medium-resource language pairs. TP is consistently outperformed by D-TP, indicating that NMT output probabilities are not well   calibrated. This confirms our hypothesis that estimating model uncertainty improves correlation with perceived translation quality. Furthermore, our approach performs competitively with strong supervised QE models. Dropout-based indicators significantly outperform PredEst and rival BERT-BiRNN for four language pairs 6 . These results position the proposed unsupervised QE methods as an attractive alternative to the supervised approach in the scenario where the NMT model used to gen- 6 We note that PredEst models are systematically and significantly outperformed by BERT-BiRNN. This is not surprising, as large-scale pretrained representations have been shown to boost model performance for QE (Kepler et al., 2019a) and other natural language processing tasks (Devlin et al., 2019). erate the translations can be accessed.
For both unsupervised and supervised methods performance varies considerably across language pairs. The highest correlation is achieved for the medium-resource languages, whereas for highresource language pairs it is drastically lower. The main reason for this difference is a lower variability in translation quality for high-resource language pairs. Figure 2 shows scatter plots for Ro-En, which has the best correlation results, and En-De with the lowest correlation for all quality indicators. Ro-En has a substantial number of highquality sentences, but the rest of the translations are uniformly distributed across the quality range. The distribution for En-De is highly skewed, as the

Reference
Nile perch and kapenta are fished from Lake Tanganyika. MT Output There is a silver thread and candle from Tanzeri.

Dropout
There will be a silver thread and a penny from Tanzer. There is an attempt at a silver greed and a carpenter from Tanzeri. There will be a silver bullet and a candle from Tanzer. The puzzle is being caught in the chicken's gavel and the coffin.

Reference
This could however lead to a split between the inner and outer view. MT Output Then there may be a split between internal and external viewpoints.

Dropout
Then, however, there may be a split between internal and external viewpoints. Then, however, there may be a gap between internal and external viewpoints. Then there may be a split between internal and external viewpoints. Then there may be a split between internal and external viewpoints. vast majority of the translations are of high quality. In this case capturing meaningful variation appears to be more challenging, as the differences reflected by the DA may be more subtle than any of the QE methods is able to reveal.
The reason for a lower correlation for Sinhala and Nepalese is different. For unsupervised indicators it can be due to the difference in model capacity 7 and the amount of training data. On the one hand, increasing depth and width of the model may negatively affect calibration (Guo et al., 2017). On the other hand, due to the small amount of training data the model can overfit, resulting in inferior results both in terms of translation quality and correlation. It is noteworthy, however, that supervised QE system suffers a larger drop in performance than unsupervised indicators, as its predictor component requires large amounts of parallel data for training. We suggest, there-7 Models for these languages were trained using Transformer-Big architecture from Vaswani et al. (2017). fore, that unsupervised QE is more stable in lowresource scenarios than supervised approaches.
We now look in more detail at the three groups of unsupervised measurements in Table 2.
Group I Average entropy of the softmax output (Softmax-Ent) and dispersion of the values of token-level probabilities (Sent-Std) achieve a significantly higher correlation than TP metric for four language pairs. Softmax-Ent captures uncertainty of the output probability distribution, which appears to be a more accurate reflection of the overall translation quality. Sent-Std captures a pattern in the sequence of token-level probabilities that helps detect low-quality translation illustrated in Figure 1. Figure 1 shows two Et-En translations which have drastically different absolute DA scores of 62 and 1, but the difference in their sentence-level log-probability is negligible: -0.50 and -0.48 for the first and second translations, respectively. By contrast, the sequences of token-level probabilities are very different, as the second sentence has larger variation in the logprobabilities for adjacent words, with very high probabilities for high-frequency function words and low probabilities for content words.
Group II The best results are achieved by the D-Lex-Sim and D-TP metrics. Interestingly, D-Var has a much lower correlation, since by only capturing variance it ignores the actual probability estimate assigned by the model to the given output. 8 Table 3 provides an illustration of how model uncertainty captured by MC dropout reflects the quality of MT output. The first example contains a low quality translation, with a high variability in MT hypotheses obtained with MC dropout. By contrast, MC dropout hypotheses for the second high-quality example are very similar and, in fact, constitute valid linguistic paraphrases of each other. This fact is directly exploited by the D-Lex-Sim metric that measures the variability between MT hypotheses generated with perturbed model parameters and performs on pair with D-TP. Besides capturing model uncertainty, D-Lex-Sim reflects the potential complexity of the source segments, as the number of different possible translations of the sentences is an indicator of their inherent ambiguity. 9 Group III While our attention-based metrics also achieve a sensible correlation with human judgments, it is considerably lower than the rest of the unsupervised indicators. Attention may not provide enough information to be used as a quality indicator of its own, since there is no direct mapping between words in different languages, and, therefore, high entropy in attention weights does not necessarily indicate low translation quality. We leave experiments with combined attention and probability-based measures to future work.
The use of multi-head attention with multiple layers in Transformer may also negatively affect the results. As shown by Voita et al. (2019), different attention heads are responsible for differ- ent functions. Therefore, combining the information coming from different heads and layers in a simple way may not be an optimal solution. To test whether this is the case, we computed attention entropy and its correlation with DA for all possible combinations of heads and layers. As shown in Table 2, the best head/layer combination (AW:best head/layer) indeed significantly outperforms other attention-based measurements for all language pairs suggesting that this method should be preferred over simple averaging. Using the best head/layer combination for QE is limited by the fact that it requires validation on a dataset annotated with DA and thus is not fully unsupervised. This outcome opens an interesting direction for further experiments to automatically discover the best possible head/layer combination.

Discussion
In the previous Section we studied the performance of our unsupervised quality indicators for different language pairs. In this Section we validate our results by looking at two additional factors: domain shift and underlying NMT system.

Domain Shift
One way to evaluate how well a model represents uncertainty is to measure the difference in model confidence under domain shift (Hendrycks and Gimpel, 2016;Lakshminarayanan et al., 2017;Snoek et al., 2019). A well calibrated model should produce low confidence estimates when tested on data points that are far away from the training data.
Overconfident predictions on out-of-domain sentences would undermine the benefits of unsupervised QE for NMT. This is particularly relevant given the current wide use of NMT for translating mixed domain data online. Therefore, we conduct a small experiment to compare model confidence on in-domain and out-of-domain data. We focus on the Et-En language pair. We use the test partition of the MT training dataset as our in-domain sample. To generate the out-of-domain sample, we sort our Wikipedia data (prior to sentence sampling stage in §4) by distance to the training data and select the top 500 segments with the largest distance score. To compute distance scores we follow the strategy of Niehues and Pham (2019) that measures the test/training data distance based on the hidden states of NMT encoder.
We compute model posterior probabilities for the translations of the in-domain and out-ofdomain sample either obtained through standard decoding, or using MC dropout. TP obtains average values of -0.440 and -0.445 for in-domain and out-of-domain data respectively, whereas for D-TP these values are -0.592 and -0.685. The difference between in-domain and out-of-domain confidence estimates obtained by standard decoding is negligible. The difference between MCdropout average probabilities for in-domain vs. out-of-domain samples was found to be statistically significant under Student's T-test, with pvalue < 0.01. Thus, expectation over predictive probabilities with MC dropout indeed provides a better estimation of model uncertainty for NMT, and therefore can improve the robustness of unsupervised QE on out-of-domain data.

NMT Calibration across NMT Systems
Findings in the previous Section suggest that using model probabilities results in fairly high correlation with human judgments for various language pairs. In this Section we study how well these findings generalize to different NMT systems. The list of model variants that we explore is by no means exhaustive and was motivated by common practices in MT and by the factors that can negatively affect model calibration (number of training epochs) or help represent uncertainty (model ensembling). For this small-scale experiment we focus on Et-En. For each system variant we translated 400 sentences from the test partition of our dataset and collected the DA accordingly. As baseline, we use a standard Transformer model with beam search decoding. All system variants are trained using Fairseq implementation  for 30 epochs, with the best checkpoint chosen according to the validation loss.
First, we consider three system variants with differences in architecture or training: RNN-based NMT (Bahdanau et al., 2015;Luong et al., 2015), Mixture of Experts (MoE, He et al., 2018;Shen et al., 2019;Cho et al., 2019) and model ensemble (Garmash and Monz, 2016). Shen et al. (2019) use the MoE framework to capture the inherent uncertainty of the MT task where the same input sentence can have multiple correct translations. A mixture model introduces a multinomial latent variable to control generation and produce a diverse set of MT hypotheses. In our experiment we use hard mixture model with uniform prior and 5 mixture components. To produce the translations we generate from a randomly chosen component with standard beam search. To obtain the probability estimates we average the probabilities from all mixture components.
Previous work has used model ensembling as a strategy for representing model uncertainty (Lakshminarayanan et al., 2017;Pearce et al., 2018). 10 In NMT, ensembling has been used to improve translation quality. We train four Transformer models initialized with different random seeds. At decoding time predictive distributions from different models are combined by averaging.
Second, we consider two alternatives to beam search: diverse beam search (Vijayakumar et al., 2016) and sampling. For sampling, we generate translations one token at a time by sampling from the model conditional distribution p(y j |y <j , x, θ), until the end of sequence symbol is generated. For comparison, we also compute the D-TP metric for the standard Transformer model on the subset of 400 segments considered for this experiment. Table 4 shows the results. Interestingly, the correlation between output probabilities and DA is not necessarily related to the quality of MT outputs. For example, sampling produces much  Table 4: Pearson correlation (r) between sequencelevel output probabilities (TP) and average DA for translations generated by different NMT systems.
higher correlation although the quality is much lower. This is in line with previous work that indicates that sampling results in better calibrated probability distribution than beam search (Ott et al., 2018a). System variants that promote diversity in NMT outputs (diverse beam search and MoE) do not achieve any improvement in correlation over standard Transformer model. The best results both in quality and QE are achieved by ensembling, which provides additional evidence that better uncertainty quantification in NMT improves correlation with human judgments. MC dropout achieves very similar results. We recommend using either of these two methods for NMT systems with unsupervised QE.

NMT Calibration across Training Epochs
The final question we address is how the correlation between translation probabilities and translation quality is affected by the amount of training. We train our base Et-En Transformer system for 60 epochs. We generate and evaluate translations after each epoch. We use the test partition of the MT training set and assess translation quality with Meteor evaluation metric. Figure 3 shows the average Meteor scores (blue) and Pearson correlation (orange) between segment-level Meteor scores and translation probabilities from the MT system for each epoch.
Interestingly, as the training continues test quality stabilizes whereas the relation between model probabilities and translation quality is deteriorated. During training, after the model is able to correctly classify most of the training examples, the loss can be further minimized by increasing the confidence of predictions (Guo et al., 2017). Thus longer training does not affect output quality but damages calibration.

Conclusions
We have devised an unsupervised approach to QE where no training or access to any additional resources besides the MT system is required. Besides exploiting softmax output probability distribution and the entropy of attention weights from the NMT model, we leverage uncertainty quantification for unsupervised QE. We show that, if carefully designed, the indicators extracted from the NMT system constitute a rich source of information, competitive with supervised QE methods.
We analyzed how different MT architectures and training settings affect the relation between predictive probabilities and translation quality. We showed that improved translation quality does not necessarily imply a stronger correlation between translation quality and predictive probabilities. Model ensemble have been shown to achieve optimal results both in terms of translation quality and when using output probabilities as an unsupervised quality indicator.
Finally, we created a new multilingual dataset for QE covering various scenarios for MT development including low-and high-resource language pairs. Both the dataset and the MT models needed to reproduce the results of our experiments are available at https://github.com/ facebookresearch/mlqe. This work can be extended in many directions. First, our sentence-level unsupervised metrics could be adapted for QE at other levels (word, phrase and document). Second, the proposed metrics can be combined as features in supervised QE approaches. Finally, other methods for uncertainty quantification, as well as other types of uncertainty, can be explored.