Abstract
Quality Estimation (QE) is an important component in making Machine Translation (MT) useful in real-world applications, as it is aimed to inform the user on the quality of the MT output at test time. Existing approaches require large amounts of expert annotated data, computation, and time for training. As an alternative, we devise an unsupervised approach to QE where no training or access to additional resources besides the MT system itself is required. Different from most of the current work that treats the MT system as a black box, we explore useful information that can be extracted from the MT system as a by-product of translation. By utilizing methods for uncertainty quantification, we achieve very good correlation with human judgments of quality, rivaling state-of-the-art supervised QE models. To evaluate our approach we collect the first dataset that enables work on both black-box and glass-box approaches to QE.
1 Introduction
With the advent of neural models, Machine Translation (MT) systems have made substantial progress, reportedly achieving near-human quality for high-resource language pairs (Hassan et al., 2018; Barrault et al., 2019). However, translation quality is not consistent across language pairs, domains, and datasets. This is problematic for low-resource scenarios, where there is not enough training data and translation quality significantly lags behind. Additionally, neural MT (NMT) systems can be deceptive to the end user as they can generate fluent translations that differ in meaning from the original (Bentivogli et al., 2016; Castilho et al., 2017). Thus, it is crucial to have a feedback mechanism to inform users about the trustworthiness of a given MT output.
Quality estimation (QE) aims to predict the quality of the output provided by an MT system at test time when no gold-standard human translation is available. State-of-the-art (SOTA) QE models require large amounts of parallel data for pre-training and in-domain translations annotated with quality labels for training (Kim et al., 2017a; Fonseca et al., 2019). However, such large collections of data are only available for a small set of languages in limited domains.
Current work on QE typically treats the MT system as a black box. In this paper we propose an alternative glass-box approach to QE that allows us to address the task as an unsupervised problem. We posit that encoder-decoder NMT models (Sutskever et al., 2014; Bahdanau et al., 2015; Vaswani et al., 2017) offer a rich source of information for directly estimating translation quality: (a) the output probability distribution from the NMT system (i.e., the probabilities obtained by applying the softmax function over the entire vocabulary of the target language); and (b) the attention mechanism used during decoding. Our assumption is that the more confident the decoder is, the higher the quality of the translation.
While sequence-level probabilities of the top MT hypothesis have been used for confidence estimation in statistical MT (Specia et al., 2013; Blatz et al., 2004), the output probabilities from deep Neural Networks (NNs) are generally not well calibrated, that is, not representative of the true likelihood of the predictions (Nguyen and O’Connor, 2015; Guo et al., 2017; Lakshminarayanan et al., 2017). Moreover, softmax output probabilities tend to be overconfident and can assign a large probability mass to predictions that are far from the training data Gal and Ghahramani (2016). To overcome such deficiencies, we propose ways to exploit output distributions beyond the top-1 prediction by exploring uncertainty quantification methods for better probability estimates (Gal and Ghahramani, 2016; Lakshminarayanan et al., 2017). In our experiments, we account for different factors that can affect the reliability of model probability estimates in NNs, such as model architecture, training, and search (Guo et al., 2017).
In addition, we study attention mechanism as another source of information on NMT quality. Attention can be interpreted as a soft alignment, providing an indication of the strength of relationship between source and target words (Bahdanau et al., 2015). Although this interpretation is straightforward for NMT based on Recurrent Neural Networks (RNN) (Rikters and Fishel, 2017), its application to current SOTA Transformer models with multihead attention (Vaswani et al., 2017) is challenging. We analyze to what extent meaningful information on translation quality can be extracted from multihead attention.
To evaluate our approach in challenging settings, we collect a new dataset for QE with 6 language pairs representing NMT training in high, medium, and low-resource scenarios. To reduce the chance of overfitting to particular domains, our dataset is constructed from Wikipedia documents. We annotate 10K segments per language pair. By contrast to the vast majority of work on QE that uses semi-automatic metrics based on post-editing distance as gold standard, we perform quality labeling based on the Direct Assessment (DA) methodology Graham et al. (2015b), which has been widely used for popular MT evaluation campaigns in the recent years. At the same time, the collected data differs from the existing datasets annotated with DA judgments for the well known WMT Metrics task1 in two important ways: We provide enough data to train supervised QE models and access to the NMT systems used to generate the translations, thus allowing for further exploration of the glass-box unsupervised approach to QE for NMT introduced in this paper.
Our main contributions can be summarized as follows: (i) A new, large-scale dataset for sentence- level2 QE annotated with DA rather than post-edit ing metrics (§4); (ii) A set of unsupervised quality indicators that can be produced as a by-product of NMT decoding and a thorough evaluation of how they correlate with human judgments of translation quality (§3 and §5); (iii) The first attempt at analysing the attention distribution for the purposes of unsupervised QE in Transformer models (§3 and §5); and (iv) The analysis on how model confidence relates to translation quality for different NMT systems (§6). Our experiments show that unsupervised QE indicators obtained from well-calibrated NMT model probabilities rival strong supervised SOTA models in terms of correlation with human judgments.
2 Related Work
QE
QE is typically addressed as a supervised machine learning task where the goal is to predict MT quality in the absence of reference translation. Traditional feature-based approaches relied on manually designed features, extracted from the MT system (glass-box features) or obtained from the source and translated sentences, as well as external resources, such as monolingual or parallel corpora (black-box features) (Specia et al., 2009).
Currently, the best performing approaches to QE use NNs to learn useful representations for source and target sentences (Kim et al., 2017b; Wang et al., 2018; Kepler et al., 2019a). A notable example is the Predictor-Estimator (PredEst) model (Kim et al., 2017b), which consists of an encoder-decoder RNN (predictor) trained on parallel data for a word prediction task and a unidirectional RNN (estimator) that produces quality estimates leveraging the context representations generated by the predictor. Despite achieving strong performances, neural-based approaches are resource-heavy and require a significant amount of in-domain labeled data for training. They do not use any internal information from the MT system.
Existing work on glass-box QE is limited to features extracted from statistical MT, such as language model probabilities or number of hypotheses in the n-best list (Blatz et al., 2004; Specia et al., 2013). The few approaches for unsupervised QE are also inspired by the work on statistical MT and perform significantly worse than supervised approaches (Popović, 2012; Moreau and Vogel, 2012; Etchegoyhen et al., 2018). For example, Etchegoyhen et al. (2018) use lexical translation probabilities from word alignment models and language model probabilities. Their unsupervised approach averages these features to produce the final score. However, it is largely outperformed by the neural-based supervised QE systems (Specia et al., 2018).
The only works that explore internal information from neural models as an indicator of translation quality rely on the entropy of attention weights in RNN-based NMT systems (Rikters and Fishel, 2017; Yankovskaya et al., 2018). However, attention-based indicators perform competitively only when combined with other QE features in a supervised framework. Furthermore, this approach is not directly applicable to the SOTA Transformer model that uses multihead attention mechanism. Recent work on attention interpretability showed that attention weights in Transformer networks might not be readily interpretable (Vashishth et al., 2019; Vig and Belinkov, 2019). Voita et al. (2019) show that different attention heads of Transformer have different functions and some of them are more important than others. This makes it challenging to extract information from attention weights in Transformer (see §5).
To the best of our knowledge, our work is the first on glass-box unsupervised QE for NMT that performs competitively with respect to the SOTA supervised systems.
QE Datasets
The performance of QE systems has been typically assessed using the semi-automatic Human-mediated Translation Edit Rate (Snover et al., 2006) metric as gold standard. However, the reliability of this metric for assessing the performance of QE systems has been shown to be questionable (Graham et al., 2016). The current practice in MT evaluation is the so-called Direct Assessment (DA) of MT quality (Graham et al., 2015b), where raters evaluate the MT on a continuous 1–100 scale. This method has been shown to improve the reproducibility of manual evaluation and to provide a more reliable gold standard for automatic evaluation metrics (Graham et al., 2015a).
DA methodology is currently used for manual evaluation of MT quality at the WMT translation tasks, as well as for assessing the performance of reference-based automatic MT evaluation metrics at the WMT Metrics Task (Bojar et al., 2016, 2017; Ma et al., 2018, 2019). Existing datasets with sentence-level DA judgments from the WMT Metrics Task could in principle be used for benchmarking QE systems. However, they contain only a few hundred segments per language pair and thus hardly allow for training supervised systems, as illustrated by the weak correlation results for QE on DA judgments based on the Metrics Task data recently reported by Fonseca et al. (2019). Furthermore, for each language pair the data contains translations from a number of MT systems often using different architectures, and these MT systems are not readily available, making it impossible for experiments on glass-box QE. Finally, the judgments are either crowd-sourced or collected from task participants and not professional translators, which may hinder the reliability of the labels. We collect a new dataset for QE that addresses these limitations (§4).
Uncertainty Quantification
Uncertainty quantification in NNs is typically addressed using a Bayesian framework where the point estimates of their weights are replaced with probability distributions (MacKay, 1992; Graves, 2011; Welling and Teh, 2011; Tran et al., 2019). Various approximations have been developed to avoid high training costs of Bayesian NNs, such as Monte Carlo Dropout (Gal and Ghahramani, 2016) or model ensembling (Lakshminarayanan et al., 2017). The performance of uncertainty quantification methods is commonly evaluated by measuring calibration, that is, the relation between predictive probabilities and the empirical frequencies of the predicted labels, or by assessing generalization of uncertainty under domain shift (see §6).
Only a few studies have analyzed calibration in NMT and they came to contradictory conclusions. Kumar and Sarawagi (2019) measure calibration error by comparing model probabilities and the percentage of times NMT output matches reference translation, and conclude that NMT probabilities are poorly calibrated. However, the calibration error metrics they use are designed for binary classification tasks and cannot be easily transferred to NMT (Kuleshov and Liang, 2015). Ott et al. (2019) analyze uncertainty in NMT by comparing predictive probability distributions with the empirical distribution observed in human translation data. They conclude that NMT models are well calibrated. However, this approach is limited by the fact that there are many possible correct translations for a given sentence and only one human translation is available in practice. Although the goal of this paper is to devise an unsupervised solution for the QE task, the analysis presented here provides new insights into calibration in NMT. Different from existing work, we study the relation between model probabilities and human judgments of translation correctness.
Uncertainty quantification methods have been successfully applied to various practical tasks, for example, neural semantic parsing (Dong et al., 2018), hate speech classification (Miok et al., 2019), or back-translation for NMT (Wang et al., 2019). Wang et al. (2019), whose work is the closest to our work, explore a small set of uncertainty-based metrics to minimize the weight of erroneous synthetic sentence pairs for back translation in NMT. However, improved NMT training with weighted synthetic data does not necessarily imply better prediction of MT quality. In fact, metrics that Wang et al. (2019) report to perform the best for back-translation do not perform well for QE (see §3.2).
3 Unsupervised QE for NMT
3.1 Exploiting the Softmax Distribution
If most of the probability mass is concentrated on a few vocabulary words, the generated target word is likely to be correct. By contrast, if softmax probabilities approach a uniform distribution picking any word from the vocabulary is equally likely and the quality of the resulting translation is expected to be low.
3.2 Quantifying Uncertainty
It has been argued in recent work that deep neural networks do not properly represent model uncertainty (Gal and Ghahramani, 2016; Lakshminarayanan et al., 2017). Uncertainty quantification in deep learning typically relies on the Bayesian formalism (MacKay, 1992; Graves, 2011; Welling and Teh, 2011; Gal and Ghahramani, 2016; Tran et al., 2019). Bayesian NNs learn a posterior distribution over parameters that quantifies model or epistemic uncertainty, i.e., our lack of knowledge as to which model generated the training data.3 Bayesian NNs usually come with prohibitive computational costs and various approximations have been developed to alleviate this. In this paper we explore the Monte Carlo (MC) dropout (Gal and Ghahramani, 2016).
Dropout is a method introduced by Srivastava et al. (2014) to reduce overfitting when training neural models. It consists in randomly masking neurons to zero based on a Bernoulli distribution. Gal and Ghahramani (2016) use dropout at test time before every weight layer. They perform several forward passes through the network and collect posterior probabilities generated by the model with parameters perturbed by dropout. Mean and variance of the resulting distribution can then be used to represent model uncertainty.
We note that these metrics have also been used by Wang et al. (2019), but with the purpose of minimizing the effect of low-quality outputs on NMT training with back translations.
3.3 Attention
4 Multilingual Dataset for QE
The quality of NMT translations is strongly affected by the amount of training data. To study our unsupervised QE indicators under different conditions, we collected data for 6 language pairs that includes high-, medium-, and low-resource conditions. To add diversity, we varied the directions into and out-of English, when permitted by the availability of expert annotators into non-English languages. Thus our dataset is composed by the high-resource English–German (En-De) and English–Chinese (En-Zh) pairs; by the medium- resource Romanian–English (Ro-En) and Estonian– English (Et-En) pairs; and by the low-resource Sinhala–English (Si-En) and Nepali–English (Ne-En) pairs. The dataset contains sentences extracted from Wikipedia and the MT outputs manually annotated for quality.
Document and Sentence Sampling
We follow the sampling process outlined in FLORES (Guzmán et al., 2019). First, we sampled documents from Wikipedia for English, Estonian, Romanian, Sinhala, and Nepali. Second, we selected the top 100 documents containing the largest number of sentences that are: (i) in the intended source language according to a language-id classifier4 and (ii) have the length between 50 and 150 characters. In addition, we filtered out sentences that have been released as part of recent Wikipedia parallel corpora (Schwenk et al., 2019), ensuring that our dataset is not part of parallel data commonly used for NMT training.
For every language, we randomly selected 10K sentences from the sampled documents and then translated them into English using the MT models described below. For German and Chinese we selected 20K sentences from the top 100 documents in English Wikipedia. To ensure sufficient representation of high- and low-quality translations for high-resource language pairs, we selected the sentences with minimal lexical overlap with respect to the NMT training data.
NMT systems
For medium- and high-resource language pairs we trained the MT models based on the standard Transformer architecture (Vaswani et al., 2017) and followed the implementation details described in Ott et al. (2018b). We used publicly available MT datasets such as Paracrawl Esplà et al. (2019) and Europarl (Koehn, 2005). Si-En and Ne-En MT systems were trained based on Big-Transformer architecture as defined in Vaswani et al. (2017). For the low-resource language pairs, the models were trained following the FLORES semi-supervised setting (Guzmán et al., 2019),5 which involves two iterations of backtranslation using the source and the target monolingual data. Table 1 specifies the amount of data used for training.
. | . | . | scores . | diff . | ||||
---|---|---|---|---|---|---|---|---|
. | Pair . | size . | avg | p25 | median | p75 | avg | std |
High-resource | En-De | 23.7M | 84.8 | 80.7 | 88.7 | 92.7 | 13.7 | 8.2 |
En-Zh | 22.6M | 67.0 | 58.7 | 70.7 | 79.0 | 12.1 | 6.4 | |
Mid-resource | Ro-En | 3.9M | 68.8 | 50.1 | 76.0 | 92.3 | 10.7 | 6.7 |
Et-En | 880K | 64.4 | 40.5 | 72.0 | 89.3 | 13.8 | 9.4 | |
Low-resource | Si-En | 647K | 51.4 | 26.0 | 51.3 | 77.7 | 13.4 | 8.7 |
Ne-En | 564K | 37.7 | 23.3 | 33.7 | 49.0 | 11.5 | 5.9 |
. | . | . | scores . | diff . | ||||
---|---|---|---|---|---|---|---|---|
. | Pair . | size . | avg | p25 | median | p75 | avg | std |
High-resource | En-De | 23.7M | 84.8 | 80.7 | 88.7 | 92.7 | 13.7 | 8.2 |
En-Zh | 22.6M | 67.0 | 58.7 | 70.7 | 79.0 | 12.1 | 6.4 | |
Mid-resource | Ro-En | 3.9M | 68.8 | 50.1 | 76.0 | 92.3 | 10.7 | 6.7 |
Et-En | 880K | 64.4 | 40.5 | 72.0 | 89.3 | 13.8 | 9.4 | |
Low-resource | Si-En | 647K | 51.4 | 26.0 | 51.3 | 77.7 | 13.4 | 8.7 |
Ne-En | 564K | 37.7 | 23.3 | 33.7 | 49.0 | 11.5 | 5.9 |
DA Judgments
We followed the FLORES setup (Guzmán et al., 2019), which presents a form of DA (Graham et al., 2013). The annotators are asked to rate each sentence from 0–100 according to the perceived translation quality. Specifically, the 0–10 range represents an incorrect translation; 11–29, a translation with few correct keywords, but the overall meaning is different from the source; 30–50, a translation with major mistakes; 51–69, a translation which is understandable and conveys the overall meaning of the source but contains typos or grammatical errors; 70–90, a translation that closely preserves the semantics of the source sentence; and 91–100, a perfect translation.
Each segment was evaluated independently by three professional translators from a single language service provider. To improve annotation consistency, any evaluation in which the range of scores among the raters was above 30 points was rejected, and an additional rater was requested to replace the most diverging translation rating until convergence was achieved. To further increase the reliability of the test and development partitions of the dataset, we requested an additional set of three annotations from a different group of annotators (i.e., from another language service provider) following the same annotation protocol, thus resulting in a total of six annotations per segment.
Raw human scores were converted into z-scores, that is, standardized according to each individual annotator’s overall mean and standard deviation. The scores collected for each segment were averaged to obtain the final score. Such setting allows for the fact that annotators may genuinely disagree on some aspects of quality.
In Table 1 we show a summary of the statistics from human annotations. Besides the NMT training corpus size and the distribution of the DA scores for each language pair, we report mean and standard deviation of the average differences between the scores assigned by different annotators to each segment, as an indicator of annotation consistency. First, we observe that, as expected, the amount of training data per language pair correlates with the average quality of an NMT system. Second, we note that the distribution of human scores changes substantially across language pairs. In particular, we see very little variability in quality for En-De, which makes QE for this language pair especially challenging (see §5). Finally, as shown in the right-most columns, annotation consistency is similar across language pairs and comparable to existing work that follows DA methodology for data collection. For example, Graham et al. (2013) report an average difference of 25 across annotators’ scores.
Data Splits
To enable comparison between supervised and unsupervised approaches to QE, we split the data into 7K training partition, 1K development set, and two test sets of 1K sentences each. One of these test sets is used for the experiments in this paper, the other is kept blind for future work.
Additional Data
To support our discussion of the effect of NMT training on the correlation between predictive probabilities and perceived translation quality presented in §6, we trained various alternative NMT system variants, translated and annotated 400 original Estonian sentences from our test set with each system variant.
The data, the NMT models, and the DA judgments are available at https://github.com/facebookresearch/mlqe.
5 Experiments and Results
Below we analyze how our unsupervised QE indicators correlate with human judgments.
5.1 Settings
Benchmark Supervised QE Systems
We compare the performance of the proposed unsupervised QE indicators against the best performing supervised approaches with available open-source implementation, namely, the Predictor-Estimator (PredEst) architecture (Kim et al., 2017b) provided by OpenKiwi toolkit (Kepler et al., 2019b), and an improved version of the BiRNN model provided by DeepQuest toolkit (Ive et al., 2018), which we refer to as BERT-BiRNN (Blain et al., 2020).
PredEst.
We trained PredEst models (see §2) using the same parameters as in the default configurations provided by Kepler et al. (2019b). Predictor models were trained for 6 epochs on the same training and development data as the NMT systems, while the Estimator models were trained for 10 epochs on the training and development sets of our dataset (see §4). Unlike Kepler et al. (2019b), the Estimator was not trained using multitask learning, as our dataset currently does not contain any word-level annotation. We use the model corresponding to the best epoch as identified by the metric of reference on the development set: perplexity for the Predictor and Pearson correlation for the Estimator.
BERT-BiRNN.
This model, similarly to the recent SOTA QE systems (Kepler et al., 2019a), uses a large-scale pre-trained BERT model to obtain token-level representations that are then fed into two independent bidirectional RNNs to encode both the source sentence and its translation independently. The two resulting sentence representations are then concatenated as a weighted sum of their word vectors, using an attention mechanism. The final sentence-level representation is then fed to a sigmoid layer to produce the sentence-level quality estimates. During training, BERT was fine-tuned by unfreezing the weights of the last four layers along with the embedding layer. We used early stopping based on Pearson correlation on the development set, with a patience of 5.
Unsupervised QE
For the dropout-based indicators (see §3.2), we use dropout rate of 0.3, the same as for training the NMT models (see §4). We perform N = 30 inference passes to obtain the posterior probability distribution. N was chosen following the experiments in related work (Dong et al., 2018; Wang et al., 2019). However, we note that increasing N beyond 10 results in very small improvements on the development set. The implementation of stochastic decoding with MC dropout is available as part of the fairseq toolkit (Ott et al., 2019) at https://github.com/pytorch/fairseq.
5.2 Correlation with Human Judgments
Table 2 shows Pearson correlation with DA for our unsupervised QE indicators and for the supervised QE systems. Unsupervised QE indicators are grouped as follows: Group I corresponds to the measurements obtained with standard decoding (§3.1); Group II contains indicators computed using MC dropout (§3.2); and Group III contains the results for attention-based indicators (§3.3). Group IV corresponds to the supervised QE models presented in §5.1. We use the Hotelling-Williams test to compute significance of the difference between dependent correlations (Williams, 1959) with p-value < 0.05. For each language pair, results that are not significantly outperformed by any method are marked in bold; results that are not significantly outperformed by any other method from the same group are underlined.
. | . | Low-resource . | Mid-resource . | High-resource . | |||
---|---|---|---|---|---|---|---|
. | Method . | Si-En | Ne-En | Et-En | Ro-En | En-De | En-Zh |
I | TP | 0.399 | 0.482 | 0.486 | 0.647 | 0.208 | 0.257 |
Softmax-Ent (-) | 0.457 | 0.528 | 0.421 | 0.613 | 0.147 | 0.251 | |
Sent-Std (-) | 0.418 | 0.472 | 0.471 | 0.595 | 0.264 | 0.301 | |
II | D-TP | 0.460 | 0.558 | 0.642 | 0.693 | 0.259 | 0.321 |
D-Var (-) | 0.307 | 0.299 | 0.356 | 0.332 | 0.164 | 0.232 | |
D-Combo (-) | 0.286 | 0.418 | 0.475 | 0.383 | 0.189 | 0.225 | |
D-Lex-Sim | 0.513 | 0.600 | 0.612 | 0.669 | 0.172 | 0.313 | |
III | AW:Ent-Min (-) | 0.097 | 0.265 | 0.329 | 0.524 | 0.000 | 0.067 |
AW:Ent-Avg (-) | 0.10 | 0.205 | 0.377 | 0.382 | 0.090 | 0.112 | |
AW:best head/layer (-) | 0.255 | 0.381 | 0.416 | 0.636 | 0.241 | 0.168 | |
IV | PredEst | 0.374 | 0.386 | 0.477 | 0.685 | 0.145 | 0.190 |
BERT-BiRNN | 0.473 | 0.546 | 0.635 | 0.763 | 0.273 | 0.371 |
. | . | Low-resource . | Mid-resource . | High-resource . | |||
---|---|---|---|---|---|---|---|
. | Method . | Si-En | Ne-En | Et-En | Ro-En | En-De | En-Zh |
I | TP | 0.399 | 0.482 | 0.486 | 0.647 | 0.208 | 0.257 |
Softmax-Ent (-) | 0.457 | 0.528 | 0.421 | 0.613 | 0.147 | 0.251 | |
Sent-Std (-) | 0.418 | 0.472 | 0.471 | 0.595 | 0.264 | 0.301 | |
II | D-TP | 0.460 | 0.558 | 0.642 | 0.693 | 0.259 | 0.321 |
D-Var (-) | 0.307 | 0.299 | 0.356 | 0.332 | 0.164 | 0.232 | |
D-Combo (-) | 0.286 | 0.418 | 0.475 | 0.383 | 0.189 | 0.225 | |
D-Lex-Sim | 0.513 | 0.600 | 0.612 | 0.669 | 0.172 | 0.313 | |
III | AW:Ent-Min (-) | 0.097 | 0.265 | 0.329 | 0.524 | 0.000 | 0.067 |
AW:Ent-Avg (-) | 0.10 | 0.205 | 0.377 | 0.382 | 0.090 | 0.112 | |
AW:best head/layer (-) | 0.255 | 0.381 | 0.416 | 0.636 | 0.241 | 0.168 | |
IV | PredEst | 0.374 | 0.386 | 0.477 | 0.685 | 0.145 | 0.190 |
BERT-BiRNN | 0.473 | 0.546 | 0.635 | 0.763 | 0.273 | 0.371 |
We observe that the simplest measure that can be extracted from NMT, sequence-level probability (TP), already performs competitively, in particular for the medium-resource language pairs. TP is consistently outperformed by D-TP, indicating that NMT output probabilities are not well calibrated. This confirms our hypothesis that estimating model uncertainty improves correlation with perceived translation quality. Furthermore, our approach performs competitively with strong supervised QE models. Dropout-based indicators significantly outperform PredEst and rival BERT-BiRNN for four language pairs.6 These results position the proposed unsupervised QE methods as an attractive alternative to the supervised approach in the scenario where the NMT model used to generate the translations can be accessed.
For both unsupervised and supervised methods performance varies considerably across language pairs. The highest correlation is achieved for the medium-resource languages, whereas for high-resource language pairs it is drastically lower. The main reason for this difference is a lower variability in translation quality for high-resource language pairs. Figure 2 shows scatter plots for Ro-En, which has the best correlation results, and En-De with the lowest correlation for all quality indicators. Ro-En has a substantial number of high-quality sentences, but the rest of the translations are uniformly distributed across the quality range. The distribution for En-De is highly skewed, as the vast majority of the translations are of high quality. In this case capturing meaningful variation appears to be more challenging, as the differences reflected by the DA may be more subtle than any of the QE methods is able to reveal.
The reason for a lower correlation for Sinhala and Nepalese is different. For unsupervised indicators it can be due to the difference in model capacity7 and the amount of training data. On the one hand, increasing depth and width of the model may negatively affect calibration (Guo et al., 2017). On the other hand, due to the small amount of training data the model can overfit, resulting in inferior results both in terms of translation quality and correlation. It is noteworthy, however, that supervised QE system suffers a larger drop in performance than unsupervised indicators, as its predictor component requires large amounts of parallel data for training. We suggest, therefore, that unsupervised QE is more stable in low-resource scenarios than supervised approaches.
We now look in more detail at the three groups of unsupervised measurements in Table 2.
Group I
Average entropy of the softmax output (Softmax-Ent) and dispersion of the values of token-level probabilities (Sent-Std) achieve a significantly higher correlation than TP metric for four language pairs. Softmax-Ent captures uncertainty of the output probability distribution, which appears to be a more accurate reflection of the overall translation quality. Sent-Std captures a pattern in the sequence of token-level probabilities that helps detect low-quality translation illustrated in Figure 1. Figure 1 shows two Et-En translations that have drastically different absolute DA scores of 62 and 1, but the difference in their sentence-level log-probability is negligible: −0.50 and −0.48 for the first and second translations, respectively. By contrast, the sequences of token-level probabilities are very different, as the second sentence has larger variation in the log-probabilities for adjacent words, with very high probabilities for high-frequency function words and low probabilities for content words.
Group II
The best results are achieved by the D-Lex-Sim and D-TP metrics. Interestingly, D-Var has a much lower correlation, because by only capturing variance it ignores the actual probability estimate assigned by the model to the given output.8
Table 3 provides an illustration of how model uncertainty captured by MC dropout reflects the quality of MT output. The first example contains a low quality translation, with a high variability in MT hypotheses obtained with MC dropout. By contrast, MC dropout hypotheses for the second high-quality example are very similar and, in fact, constitute valid linguistic paraphrases of each other. This fact is directly exploited by the D-Lex-Sim metric that measures the variability between MT hypotheses generated with perturbed model parameters and performs on pair with D-TP. Besides capturing model uncertainty, D-Lex-Sim reflects the potential complexity of the source segments, as the number of different possible translations of the sentences is an indicator of their inherent ambiguity.9
Low Quality | Original | Tanganjikast püütakse niiluse ahvenat ja kapentat. |
Reference | Nile perch and kapenta are fished from Lake Tanganyika. | |
MT Output | There is a silver thread and candle from Tanzeri. | |
Dropout | There will be a silver thread and a penny from Tanzer. | |
There is an attempt at a silver greed and a carpenter from Tanzeri. | ||
There will be a silver bullet and a candle from Tanzer. | ||
The puzzle is being caught in the chicken’s gavel and the coffin. | ||
High Quality | Original | Siis aga võib tekkida seesmise ja välise vaate vahele lõhe. |
Reference | This could however lead to a split between the inner and outer view. | |
MT Output | Then there may be a split between internal and external viewpoints. | |
Dropout | Then, however, there may be a split between internal and external viewpoints. | |
Then, however, there may be a gap between internal and external viewpoints. | ||
Then there may be a split between internal and external viewpoints. | ||
Then there may be a split between internal and external viewpoints. |
Low Quality | Original | Tanganjikast püütakse niiluse ahvenat ja kapentat. |
Reference | Nile perch and kapenta are fished from Lake Tanganyika. | |
MT Output | There is a silver thread and candle from Tanzeri. | |
Dropout | There will be a silver thread and a penny from Tanzer. | |
There is an attempt at a silver greed and a carpenter from Tanzeri. | ||
There will be a silver bullet and a candle from Tanzer. | ||
The puzzle is being caught in the chicken’s gavel and the coffin. | ||
High Quality | Original | Siis aga võib tekkida seesmise ja välise vaate vahele lõhe. |
Reference | This could however lead to a split between the inner and outer view. | |
MT Output | Then there may be a split between internal and external viewpoints. | |
Dropout | Then, however, there may be a split between internal and external viewpoints. | |
Then, however, there may be a gap between internal and external viewpoints. | ||
Then there may be a split between internal and external viewpoints. | ||
Then there may be a split between internal and external viewpoints. |
Group III
While our attention-based metrics also achieve a sensible correlation with human judgments, it is considerably lower than the rest of the unsupervised indicators. Attention may not provide enough information to be used as a quality indicator of its own, since there is no direct mapping between words in different languages, and, therefore, high entropy in attention weights does not necessarily indicate low translation quality. We leave experiments with combined attention and probability-based measures to future work.
The use of multihead attention with multiple layers in Transformer may also negatively affect the results. As shown by Voita et al. (2019), different attention heads are responsible for different functions. Therefore, combining the information coming from different heads and layers in a simple way may not be an optimal solution. To test whether this is the case, we computed attention entropy and its correlation with DA for all possible combinations of heads and layers. As shown in Table 2, the best head/layer combination (AW: best head/layer) indeed significantly outperforms other attention-based measurements for all language pairs suggesting that this method should be preferred over simple averaging. Using the best head/layer combination for QE is limited by the fact that it requires validation on a dataset annotated with DA and thus is not fully unsupervised. This outcome opens an interesting direction for further experiments to automatically discover the best possible head/layer combination.
6 Discussion
In the previous section we studied the performance of our unsupervised quality indicators for different language pairs. In this section we validate our results by looking at two additional factors: domain shift and underlying NMT system.
6.1 Domain Shift
One way to evaluate how well a model represents uncertainty is to measure the difference in model confidence under domain shift (Hendrycks and Gimpel, 2016; Lakshminarayanan et al., 2017; Snoek et al., 2019). A well-calibrated model should produce low confidence estimates when tested on data points that are far away from the training data.
Overconfident predictions on out-of-domain sentences would undermine the benefits of unsupervised QE for NMT. This is particularly relevant given the current wide use of NMT for translating mixed domain data online. Therefore, we conduct a small experiment to compare model confidence on in-domain and out-of-domain data. We focus on the Et-En language pair. We use the test partition of the MT training dataset as our in-domain sample. To generate the out-of-domain sample, we sort our Wikipedia data (prior to sentence sampling stage in §4) by distance to the training data and select the top 500 segments with the largest distance score. To compute distance scores we follow the strategy of Niehues and Pham (2019) that measures the test/training data distance based on the hidden states of NMT encoder.
We compute model posterior probabilities for the translations of the in-domain and out-of-domain sample either obtained through standard decoding, or using MC dropout. TP obtains average values of −0.440 and −0.445 for in-domain and out-of-domain data, respectively, whereas for D-TP these values are −0.592 and −0.685. The difference between in-domain and out-of-domain confidence estimates obtained by standard decoding is negligible. The difference between MC-dropout average probabilities for in-domain vs. out-of-domain samples was found to be statistically significant under Student’s t-test, with p-value < 0.01. Thus, expectation over predictive probabilities with MC dropout indeed provides a better estimation of model uncertainty for NMT, and therefore can improve the robustness of unsupervised QE on out-of-domain data.
6.2 NMT Calibration across NMT Systems
Findings in the previous section suggest that using model probabilities results in fairly high correlation with human judgments for various language pairs. In this section we study how well these findings generalize to different NMT systems. The list of model variants that we explore is by no means exhaustive and was motivated by common practices in MT and by the factors that can negatively affect model calibration (number of training epochs) or help represent uncertainty (model ensembling). For this small-scale experiment we focus on Et-En. For each system variant we translated 400 sentences from the test partition of our dataset and collected the DA accordingly. As baseline, we use a standard Transformer model with beam search decoding. All system variants are trained using Fairseq implementation Ott et al. (2019) for 30 epochs, with the best checkpoint chosen according to the validation loss.
First, we consider three system variants with differences in architecture or training: RNN-based NMT (Bahdanau et al., 2015; Luong et al., 2015), Mixture of Experts (MoE,He et al., 2018; Shen et al., 2019; Cho et al., 2019), and model ensemble (Garmash and Monz, 2016).
Shen et al. (2019) use the MoE framework to capture the inherent uncertainty of the MT task where the same input sentence can have multiple correct translations. A mixture model introduces a multinomial latent variable to control generation and produce a diverse set of MT hypotheses. In our experiment we use hard mixture model with uniform prior and 5 mixture components. To produce the translations we generate from a randomly chosen component with standard beam search. To obtain the probability estimates we average the probabilities from all mixture components.
Previous work has used model ensembling as a strategy for representing model uncertainty (Lakshminarayanan et al., 2017; Pearce et al., 2018).10 In NMT, ensembling has been used to improve translation quality. We train four Transformer models initialized with different random seeds. At decoding time predictive distributions from different models are combined by averaging.
Second, we consider two alternatives to beam search: diverse beam search (Vijayakumar et al., 2016) and sampling. For sampling, we generate translations one token at a time by sampling from the model conditional distribution , until the end of sequence symbol is generated. For comparison, we also compute the D-TP metric for the standard Transformer model on the subset of 400 segments considered for this experiment.
Table 4 shows the results. Interestingly, the correlation between output probabilities and DA is not necessarily related to the quality of MT outputs. For example, sampling produces much higher correlation although the quality is much lower. This is in line with previous work that indicates that sampling results in better calibrated probability distribution than beam search (Ott et al., 2018a). System variants that promote diversity in NMT outputs (diverse beam search and MoE) do not achieve any improvement in correlation over standard Transformer model.
Method . | r . | DA . |
---|---|---|
TP-Beam | 0.482 | 58.88 |
TP-Sampling | 0.533 | 42.02 |
TP-Diverse beam | 0.424 | 55.12 |
TP-RNN | 0.502 | 43.63 |
TP-Ensemble | 0.538 | 61.19 |
TP-MoE | 0.449 | 51.20 |
D-TP | 0.526 | 58.88 |
Method . | r . | DA . |
---|---|---|
TP-Beam | 0.482 | 58.88 |
TP-Sampling | 0.533 | 42.02 |
TP-Diverse beam | 0.424 | 55.12 |
TP-RNN | 0.502 | 43.63 |
TP-Ensemble | 0.538 | 61.19 |
TP-MoE | 0.449 | 51.20 |
D-TP | 0.526 | 58.88 |
The best results both in quality and QE are achieved by ensembling, which provides additional evidence that better uncertainty quantification in NMT improves correlation with human judgments. MC dropout achieves very similar results. We recommend using either of these two methods for NMT systems with unsupervised QE.
6.3 NMT Calibration across Training Epochs
The final question we address is how the correlation between translation probabilities and translation quality is affected by the amount of training. We train our base Et-En Transformer system for 60 epochs. We generate and evaluate translations after each epoch. We use the test partition of the MT training set and assess translation quality with Meteor evaluation metric. Figure 3 shows the average Meteor scores (blue) and Pearson correlation (orange) between segment-level Meteor scores and translation probabilities from the MT system for each epoch.
Interestingly, as the training continues test quality stabilizes whereas the relation between model probabilities and translation quality is deteriorated. During training, after the model is able to correctly classify most of the training examples, the loss can be further minimized by increasing the confidence of predictions (Guo et al., 2017). Thus longer training does not affect output quality but damages calibration.
7 Conclusions
We have devised an unsupervised approach to QE where no training or access to any additional resources besides the MT system is required. Besides exploiting softmax output probability distribution and the entropy of attention weights from the NMT model, we leverage uncertainty quantification for unsupervised QE. We show that, if carefully designed, the indicators extracted from the NMT system constitute a rich source of information, competitive with supervised QE methods.
We analyzed how different MT architectures and training settings affect the relation between predictive probabilities and translation quality. We showed that improved translation quality does not necessarily imply a stronger correlation between translation quality and predictive probabilities. Model ensemble have been shown to achieve optimal results both in terms of translation quality and when using output probabilities as an unsupervised quality indicator.
Finally, we created a new multilingual dataset for QE covering various scenarios for MT development including low- and high-resource language pairs. Both the dataset and the MT models needed to reproduce the results of our experiments are available at https://github.com/facebookresearch/mlqe.
This work can be extended in many directions. First, our sentence-level unsupervised metrics could be adapted for QE at other levels (word, phrase, and document). Second, the proposed metrics can be combined as features in supervised QE approaches. Finally, other methods for uncertainty quantification, as well as other types of uncertainty, can be explored.
Acknowledgments
Marina Fomicheva, Lisa Yankovskaya, Frédéric Blain, Mark Fishel, Nikolaos Aletras, and Lucia Specia were supported by funding from the Bergamot project (EU H2020 grant no. 825303).
Notes
While the paper covers QE at sentence level, the extension of our unsupervised metrics to word-level QE would be straightforward and we leave it for future work.
A distinction is typically made between epistemic and aleatoric uncertainty, where the latter captures the noise inherent to the observations (Kendall and Gal, 2017). We leave modeling this distinction in NMT for future work.
We note that PredEst models are systematically and significantly outperformed by BERT-BiRNN. This is not surprising, as large-scale pretrained representations have been shown to boost model performance for QE (Kepler et al., 2019a) and other natural language processing tasks (Devlin et al., 2019).
Models for these languages were trained using Transformer-Big architecture from Vaswani et al. (2017).
This is in contrast with the work by Wang et al. (2019) where D-Var appears to be one of the best performing metric for NMT training with back-translation demonstrating an essential difference between this task and QE.
Note that D-Lex-Sim involves generating N additional translation hypotheses, whereas the D-TP only requires re-scoring an existing translation output and is thus less expensive in terms of time.