Abstract
Neural sequence generation models are known to “hallucinate”, by producing outputs that are unrelated to the source text. These hallucinations are potentially harmful, yet it remains unclear in what conditions they arise and how to mitigate their impact. In this work, we first identify internal model symptoms of hallucinations by analyzing the relative token contributions to the generation in contrastive hallucinated vs. non-hallucinated outputs generated via source perturbations. We then show that these symptoms are reliable indicators of natural hallucinations, by using them to design a lightweight hallucination detector which outperforms both model-free baselines and strong classifiers based on quality estimation or large pre-trained models on manually annotated English-Chinese and German-English translation test beds.
1 Introduction
While neural language generation models can generate high quality text in many settings, they also fail in counter-intuitive ways, for instance by “hallucinating” (Wiseman et al., 2017; Lee et al., 2018; Falke et al., 2019). In the most severe case, known as “detached hallucinations” (Raunak et al., 2021), the output is completely detached from the source, which not only reveals fundamental limitations of current models, but also risks misleading users and undermining trust (Bender et al., 2021; Martindale and Carpuat, 2018). Yet, we lack a systematic understanding of the conditions where hallucinations arise, as hallucinations occur infrequently among translations of naturally occurring text. As a workaround, prior work has largely focused on black-box detection methods which train neural classifiers on synthetic data constructed by heuristics (Falke et al., 2019; Zhou et al., 2021), and on studying hallucinations given artificially perturbed inputs (Lee et al., 2018; Shi et al., 2022).
In this paper, we address the problem by first identifying the internal model symptoms that characterize hallucinations given artificial inputs and then testing the discovered symptoms on translations of natural texts. Specifically, we study hallucinations in Neural Machine Translation (nmt) using two types of interpretability techniques: saliency analysis and perturbations. We use saliency analysis (Bach et al., 2015; Voita et al., 2021) to compare the relative contributions of various tokens to the hallucinated vs. non-hallucinated outputs generated by diverse adversarial perturbations in the inputs (Table 1) inspired by Lee et al. (2018) and Raunak et al. (2021). Results surprisingly show that source contribution patterns are stronger indicators of hallucinations than the relative contributions of the source and target, as had been previously hypothesized (Voita et al., 2021). We discover two distinctive source contribution patterns, including 1) concentrated contribution from a small subset of source tokens, and 2) the staticity of the source contribution distribution along the generation steps (§ 3).
Counterfactual hallucination from perturbation . | |
---|---|
Source | Republicans Abroad are not running a similar election, nor will they have delegates at the convention. Recent elections have emphasized the value of each vote. |
Good nmt | |
Perturbed Source | Repulicans Abroad ar not runing a simila election, nor will they have delegates at the convention. Recent elections have emphasized the value o each vote. |
Hallucination | |
Gloss: The big ear comments that administrators have the right to retain or delete any content in the comments under their jurisdiction. | |
Natural hallucination | |
Source | DAS GRUNDRECHT JEDES EINZELNEN AUF FREIE WAHL DES BERUFS, DER AUSBILDUNGSSTÄTTE SOWIE DES AUSBILDUNGS - UND BESCHÄFTIGUNGSORTS MUSS GEWAHRT BLEIBEN. |
Gloss: The fundamental right of every individual to freely choose their profession, their training institution and their employment place must remain guaranteed. | |
Hallucination | THE PRIVACY OF ANY OTHER CLAIM, EXTRAINING STANDARDS, EXTRAINING OR EMPLOYMENT OR EMPLOYMENT WILL BE LIABLE. |
Counterfactual hallucination from perturbation . | |
---|---|
Source | Republicans Abroad are not running a similar election, nor will they have delegates at the convention. Recent elections have emphasized the value of each vote. |
Good nmt | |
Perturbed Source | Repulicans Abroad ar not runing a simila election, nor will they have delegates at the convention. Recent elections have emphasized the value o each vote. |
Hallucination | |
Gloss: The big ear comments that administrators have the right to retain or delete any content in the comments under their jurisdiction. | |
Natural hallucination | |
Source | DAS GRUNDRECHT JEDES EINZELNEN AUF FREIE WAHL DES BERUFS, DER AUSBILDUNGSSTÄTTE SOWIE DES AUSBILDUNGS - UND BESCHÄFTIGUNGSORTS MUSS GEWAHRT BLEIBEN. |
Gloss: The fundamental right of every individual to freely choose their profession, their training institution and their employment place must remain guaranteed. | |
Hallucination | THE PRIVACY OF ANY OTHER CLAIM, EXTRAINING STANDARDS, EXTRAINING OR EMPLOYMENT OR EMPLOYMENT WILL BE LIABLE. |
We further show that the symptoms identified generalize to hallucinations on natural inputs by using them to design a lightweight hallucination classifier (§ 4) that we evaluate on manually annotated hallucinations from English-Chinese and German-English nmt (Table 1). Our study shows that our introspection-based detection model largely outperforms model-free baselines and the classifier based on quality estimation scores. Furthermore, it is more accurate and robust to domain shift than black-box detectors based on large pre-trained models (§ 5).
Before presenting these two studies, we review current findings about the conditions in which hallucinations arise and formulate three hypotheses capturing potential hallucination symptoms.
2 Hallucinations: Definition and Hypotheses
The term “hallucinations” has varying definitions in mt and natural language generation. We adopt the most widely used one, which refers to output text that is unfaithful to the input (Maynez et al., 2020; Zhou et al., 2021; Xiao and Wang, 2021; Ji et al., 2022), while others include fluency criteria as part of the definition (Wang and Sennrich, 2020; Martindale et al., 2019). Different from previous work that aims to detect partial hallucinations at the token level (Zhou et al., 2021), we focus on detached hallucinations where a major part of the output is unfaithful to the input, as these represent severe errors, as illustrated in Table 1.
Prior work on understanding the conditions that lead to hallucinations has focused on training conditions and data noise (Ji et al., 2022). For mt, Raunak et al. (2021) show that hallucinations under perturbed inputs are caused by training samples in the long tail that tend to be memorized by Transformer models, while natural hallucinations given unperturbed inputs can be linked to corpus-level noise. Briakou and Carpuat (2021) show that models trained on samples where the source and target side diverge semantically output degenerated text more frequently. Wang and Sennrich (2020) establish a link between mt hallucinations under domain shift and exposure bias by showing that Minimum Risk Training, a training objective which addresses exposure bias, can reduce the frequency of hallucinations. However, these insights do not yet provide practical strategies for handling mt hallucinations.
A complementary approach to diagnosing hallucinations is to identify their symptoms via model introspection at inference time. However, there lacks a systematic study of hallucinations from the model’s internal perspective. Previous work is either limited to an interpretation method that is tied to an outdated model architecture (Lee et al., 2018) or to pseudo-hallucinations (Voita et al., 2021). In this paper, we propose to shed light on the decoding behavior of hallucinations on both artificially perturbed and natural inputs through model introspection based on Layerwise Relevance Propagation (lrp) (Bach et al., 2015), which is applicable to a wide range of neural model architectures. We focus on mt tasks with the widely used Transformer model (Vaswani et al., 2017), and examine existing and new hypotheses for how hallucinations are produced. These hypotheses share the intuition that anomalous patterns of contributions from source tokens are indicative of hallucinations, but operationalize it differently.
The Low Source Contribution Hypothesis introduced by Voita et al. (2021) states that hallucinations occur when nmt overly relies on the target context over the source. They test the hypothesis by inspecting the relative source and target contributions to nmt predictions on Transformer models using lrp. However, their study is limited to pseudo-hallucinations produced by force decoding with random target prefixes. This work will test this hypothesis on actual hallucinations generated by nmt models.
The Local Source Contribution Hypothesis introduced by Lee et al. (2018) states that hallucinations occur when nmt model overly relies on a small subset of source tokens across all generation steps. They test it by visualizing the dot-product attention in RNN models, but it is unclear whether these findings generalize to other model architectures. In addition, they only study hallucinations caused by random token insertion. This work will test this hypothesis on hallucinations under various types of source perturbations as well as on natural inputs, and will rely on lrp to quantify token contributions more precisely than with attention.
Inspired by the previous observation on attention matrices that an nmt model attends repeatedly to the same source tokens throughout inference when it hallucinates (Lee et al., 2018; Berard et al., 2019b) or generates a low-quality translation (Rikters and Fishel, 2017), we formalize this observation as the Static Source Contribution Hypothesis—the distribution of source contributions remains static along inference steps when an nmt model hallucinates. While prior work (Lee et al., 2018; Berard et al., 2019b; Rikters and Fishel, 2017) focuses on the static attention to the EOS or full-stop tokens, this hypothesis is agnostic about which source tokens contribute. Unlike the Low Source Contribution Hypothesis, this hypothesis exclusively relies on the source and does not make any assumption about relative source versus target contributions. Unlike the Local Source Contribution Hypothesis, this hypothesis is agnostic to the proportion of source tokens contributing to a translation.
In this work, we evaluate in a controlled fashion how well each hypothesis explains detached hallucinations, first on artificially perturbed samples that let us contrast hallucinated vs. non-hallucinated outputs in controlled settings (§ 3), and second on natural source inputs that let us test the generalizability of these hypotheses when they are used to automatically detect hallucinations in more realistic settings (§ 5).1
3 Study of Hallucinations under Perturbations via Model Introspection
Hallucinations are typically rare and difficult to identify in natural datasets. To test the aforementioned hypotheses at scale, we first exploit the fact that source perturbations exacerbate nmt hallucinations (Lee et al., 2018; Raunak et al., 2021). We construct a perturbation-based counterfactual hallucination dataset on English→Chinese by automatically identifying hallucinated nmt translations given perturbed source inputs and contrast them with the nmt translations of the original source (§ 3.1). This dataset lets us directly test the three hypotheses by computing the relative token contributions to the model’s predictions using lrp (§ 3.2), and conduct a controlled comparison of patterns on the original and hallucinated samples (§ 3.4).
3.1 Perturbation-based Hallucination Data
To construct the dataset, we randomly select 50k seed sentence pairs to perturb from the nmt training corpora, and then we apply the following perturbations on the source sentences:2
We randomly misspell words by deleting characters with a probability of 0.1, as Karpukhin et al. (2019) show that a few misspellings can lead to egregious errors in the output.
We randomly title-case words with a probability of 0.1, as Berard et al. (2019a) find that this often leads to severe output errors.
We insert a random token at the beginning of the source sentence, as Lee et al. (2018) and Raunak et al. (2021) find it a reliable trigger of hallucinations. The inserted token is chosen from 100 most frequent, 100 least frequent, mid-frequency tokens (randomly sampled 100 tokens from the remaining tokens), and punctuations.
Inspired by Lee et al. (2018), we then identify hallucinations using heuristics that compare the translations from the original and perturbed sources. We select samples whose original nmt translations y′ are of reasonable quality compared to the reference y (i.e., bleu(y,y′) > 0.3). The translation of a perturbed source sentence is identified as a hallucination if it is very different from the translation of the original source (i.e., ) and is not a copy of the perturbed source (i.e., ).3 This results in 623, 270, and 1307 contrastive pairs of the original (non-hallucinated) and hallucinated translations under misspelling, title-casing, and insertion perturbations, respectively.
We further divide the contrastive pairs into degenerated and non-degenerated hallucinations. Degenerated hallucinations are “bland, incoherent, or get stuck in repetitive loops” (Holtzman et al., 2020), i.e., hallucinated translations that contain 3 more repetitive n-grams than the source are identified as degenerated hallucinations, while the non-degenerated group contains relatively fluent but hallucinated translations.
3.2 Measuring Relative Token Contributions
We then test the aforementioned hypotheses based on the distribution of relative token contributions and compare it with the attention matrix.
3.3 nmt Setup
We build strong Transformer models on two high-resource language pairs: English→Chinese (En-Zh) and German→English (De-En). They produce acceptable translation outputs overall, thus making hallucinations particularly misleading.
Data
For En-Zh, we use the 18M training samples from WMT18 (Bojar et al., 2018) and newsdev2017 as the validation set. For De-En, we use all training corpora from WMT21 (Akhbardeh et al., 2021) except for ParaCrawl, which yields 5M sentence pairs after cleaning as in Chen et al. (2021).4 We use newstest2019 for validation. We tokenize English and German sentences using the Moses scripts (Koehn et al., 2007) and Chinese sentences using the Jieba segmenter.5 For En-Zh, we train separate BPE models for English and Chinese using 32k merging operations for each language. For De-En, we train a joint BPE model using 32k merging operations.
Models
All models are based on the base Transformer (Vaswani et al., 2017). We apply label smoothing of 0.1. We train all models using the Adam optimizer (Kingma and Ba, 2015) with initial learning rate of 4.0 and batch sizes of 4,000 tokens for maximum 800k steps. We decode with beam search with a beam size of 4. The resulting nmt models achieve close or higher bleu scores than comparable published results.6
3.4 Findings
We test the aforementioned hypotheses on the perturbation-based counterfactual hallucination dataset constructed on English→Chinese.
First, we test the Low Source Contribution Hypothesis by computing the relative source contributions at each generation step t, where n is the length of each source sequence.7 We plot the average contributions over a set of samples in Figure 1. It shows that hallucinations under source perturbations have only slightly higher source contributions (Δ ≈ 0.1) than the original samples. This departs from previous observations on pseudo-hallucinations (Voita et al., 2021), where the relative source contributions were lower on pseudo-hallucinations than on reference translations, perhaps because actual model outputs differ from pseudo-hallucinations created by inserting random target prefixes. We show that the hypothesis does not hold on actual hallucinations generated by the model itself.
To explain this phenomenon, we further examine the source contribution from the end-of-sequence (EOS) token. Previous work hypothesizes that a translation is likely to be a hallucination when the attention distribution is concentrated on the source EOS token, which carries little information about the source (Berard et al., 2019b; Raunak et al., 2021). However, this hypothesis has only been supported by qualitative analysis on individual samples. Our quantitative results on the perturbation-based hallucination dataset do not support it, and align instead with the recent finding that the proportion of attention paid to the EOS token is not indicative of hallucinations (Guerreiro et al., 2022). Specifically, our results show that the proportion of source contribution from the EOS token is slightly higher on the original samples (11.2%) than that on the hallucinated samples (10.8%). We will show in the next part that the source contribution is more concentrated on the beginning than the end of the source sentence when the model hallucinates.
. | Contrib Ratio . | Staticity . | ||
---|---|---|---|---|
. | D . | N . | D . | N . |
Attention | −1.03† | +0.51† | 1.92† | −1.10† |
lrp | −1.05† | −1.13† | 3.16† | 2.16† |
. | Contrib Ratio . | Staticity . | ||
---|---|---|---|---|
. | D . | N . | D . | N . |
Attention | −1.03† | +0.51† | 1.92† | −1.10† |
lrp | −1.05† | −1.13† | 3.16† | 2.16† |
Furthermore, we investigate whether there is any positional bias for the local source contribution. We visualize the normalized source contribution averaged over all samples with a source length of 30 in Figure 2. The source contribution of the hallucinated samples is disproportionately high at the beginning of a source sequence. By contrast, on the original samples, the normalized contribution is higher at the end of the source sequence, which could be a way for the model to decide when to finish generation. The positional bias exists not only on hallucinations under insertions at the beginning of the source, but also on hallucinations under misspelling and title-casing perturbations that are applied at random positions.
Third, we examine the Static Source Contribution Hypothesis hypothesis by first visualizing the source contributions Rt(xi) at varying source and generation positions on individual pairs of original and hallucinated samples. The heatmaps of source contributions for the example from Table 1 are shown in Figure 3. On the original outputs, the source contribution distribution in each column changes dynamically when moving horizontally along target generation steps. By contrast, when the model hallucinates, the source contribution distribution remains roughly static.
To quantify this pattern, we introduce Source Contribution Staticity, which measures how the source contribution distribution shifts over generation steps. Specifically, given a window size k, we first divide the target sequence into several non-overlapping segments, each containing k tokens. Then, we compute the average vector over the contribution vectors Rt = [Rt(x0)…Rt(xn)] at steps t within each segment. Finally, we measure the cosine similarity between the average contribution vectors of adjacent segments and average over the cosine similarity scores at all positions as the final score sk of window size k. Figure 4 illustrates this process for a window size of 2.
Table 2 shows the standardized mean difference in Source Contribution Staticity between the hallucinated and original samples in the degenerated and non-degenerated groups, taking the maximum staticity score among window sizes k ∈ [1,3] for each sample. The positive differences in lrp-based scores supports the Static Source Contribution Hypothesis—the source contribution distribution is more static on the hallucinated samples than that on the original samples. Furthermore, lrp distinguishes hallucinations from non-hallucinations better than attention, especially on non-degenerated samples where the translation outputs contain no repetitive loops.
In summary, we find that, when generating a hallucination under source perturbations, the nmt model tends to rely on a small proportion of the source tokens, especially the tokens at the beginning of the source sentence. In addition, the distribution of the source contributions is more static on hallucinated translations than that on non-hallucinated translations. We turn to applying these insights on natural hallucinations next.
4 A Classifier to Detect Natural Hallucinations
Based on these findings, we design features for a lightweight hallucination detector trained on samples automatically constructed by perturbations.
Classifier
We build a small multi-layer perceptron (mlp) with a single hidden layer and the following input features:
Normalized Source Contribution of the first K1 source tokens and the last K1 source tokens: (where n is the length of the source sequence and K1 is a hyper-parameter), as we showed in the Local Source Contribution Hypothesis that the contributions of the beginning and end tokens distribute differently between hallucinated and non-hallucinated samples.
Source Contribution Staticity sk given the source contributions Rt(xi) and a window size k as defined in § 3.4. We include the similarity scores of window sizes k = {1, 2, …, K2} as input features, where K2 is a hyper-parameter.
This yields small classifiers with input dimension of 9. For each language pair, we train 20 classifiers with different random seeds and select the model with the highest validation F1 score.
Data Generation
We construct the training and validation data using the same approach to constructing the perturbation-based hallucination dataset (§ 3.1), but with longer seed pairs—we randomly select seed sentence pairs with source length between 20 and 60 from the training corpora. We split the synthetic data randomly into the training (around 1k samples) and validation (around 200 samples) sets with roughly equal number of positive and negative samples.
5 Detecting Natural Hallucinations
While the hallucination classifier is trained on hallucinations from perturbations, we collect more realistic data to evaluate it against a wide range of relevant models.
5.1 Natural Hallucination Evaluation Set
We build a test bed for detached hallucination detection for different language pairs and translation directions (En-Zh and De-En), and release the data together with the underlying nmt models (described in § 3.3).
Since hallucinations are rare, we collect samples from large pools of out-of-domain data for our models to obtain enough positive examples of hallucinations for a meaningful test set. We use TED talk transcripts from the IWSLT15 training set (Cettolo et al., 2015) for En-Zh, and the JRC-Acquis corpus (Steinberger et al., 2006) of legislation from the European Union for De-En. To increase the chance of finding hallucinations, we select around 200, 50, and 50 translation outputs with low bleu, low comet (Rei et al., 2020a), or low laser similarity (Artetxe and Schwenk, 2019) scores, respectively. We further combine them with 50 randomly selected samples.
Three bilingual annotators assess the faithfulness of the nmt output given each input. While we ultimately need a binary annotation of outputs as hallucinated or not, annotators were asked to choose one of five labels to improve consistency:
Detached hallucination: a translation with large segments that are unrelated to the source.
Faithful translation: a translation that is faithful to the source.
Incomplete translation: a translation that is partially correct but misses part(s) of the source.
Locally unfaithful: a translation that contains a few unfaithful phrases but is otherwise faithful.
Incomprehensible but aligned: a translation that is incomprehensible even though most phrases can be aligned to the source.
All labels except for the “detached hallucination” are aggregated into the “non-hallucination” category. The inter-annotator agreement on aggregated labels is substantial, with a Fleiss’s Kappa (Fleiss, 1971) score of FK = 0.77 for De-En and FK = 0.64 for En-Zh. Disagreements are resolved by majority voting for De-En, and by adjudication by a bilingual speaker for En-Zh. This yields 27% of detached hallucinations on En-Zh and 45% on De-En. The non-hallucinated nmt outputs span all the fine-grained categories above, as can be seen in Table 3. Hallucinations are over-represented compared to what one might expect in the wild, but this is necessary to provide enough positive examples of hallucinations for evaluation.
. | En-Zh . | De-En . |
---|---|---|
Detached hallucination | 111 | 189 |
Non hallucination, including: | ||
Faithful translation | 154 | 153 |
Incomplete translation | 80 | 17 |
Locally unfaithful | 58 | 31 |
Incomprehensible but aligned | 5 | 33 |
Total | 408 | 423 |
. | En-Zh . | De-En . |
---|---|---|
Detached hallucination | 111 | 189 |
Non hallucination, including: | ||
Faithful translation | 154 | 153 |
Incomplete translation | 80 | 17 |
Locally unfaithful | 58 | 31 |
Incomprehensible but aligned | 5 | 33 |
Total | 408 | 423 |
5.2 Experimental Conditions
5.2.1 Introspection-based Classifiers
We implement the lrp-based classifier described in § 4. To lower the cost of computing source contributions, we clip the source length at 40, and only consider the influence back-propagated through the most recent 10 target tokens—prior work shows that nearby context is more influential than distant context (Khandelwal et al., 2018). We tune the hyper-parameters K1 and K2 within the space K1 ∈{1,3,5,7,9}, K2 ∈{4,8,12,16} based on the average F1 accuracy on the validation set over three runs. We compare it with an attention-based classifier, which uses the same features, but computes token contributions using attention weights averaged over all attention heads.
5.2.2 Model-free Baselines
We use three simple baselines to characterize the task. The random classifier that predicts hallucination with a probability of 0.5. The degeneration detector marks as hallucinations degenerated outputs that contain K more repetitive n-grams than the source, where K is a hyper-parameter tuned on the perturbation-based hallucination data. The nmt probability scores are used as a coarse model signal to detect hallucinations based on the heuristic that the model is less confident when producing a hallucination. The output is classified as a hallucination if the probability score is lower than a threshold tuned on the perturbation-based hallucination data.
5.2.3 Quality Estimation Classifier
We also compare the introspection-based classifiers with a baseline classifier based on the state-of-the-art quality estimation model—comet-qe (Rei et al., 2020b). Given a source sentence and its nmt translation, we compute the comet-qe score and classify the translation as a hallucination if the score is below a threshold tuned on the perturbation-based validation set.
5.2.4 Large Pre-trained Classifiers
We further compare the introspection-based classifiers with classifiers that rely on large pre-trained multilingual models, to compare the discriminative power of the source contribution patterns from the NMT model itself to extrinsic semantically driven discrimination criteria.
We use the cosine distance between the laser representations (Artetxe and Schwenk, 2019; Heffernan et al., 2022) of the source and the nmt translation. It classifies a translation as a hallucination if the distance score is higher than a threshold tuned on the perturbation-based validation set.
Inspired by local hallucination (Zhou et al., 2021) and cross-lingual semantic divergence (Briakou and Carpuat, 2020) detection methods, we build an xlm-r classifier by fine-tuning the xlm-r model (Conneau et al., 2020) on synthetic hallucination samples. We randomly select 50K seed pairs of source and reference sentences with source lengths between 20 and 60 from the parallel corpus and use the following perturbations to construct examples of detached hallucinations:
Map a source sentence to a random target from the parallel corpus to simulate natural, detached hallucinations.
Repeat a random dependency subtree in the reference many times to simulate degenerated hallucinations.
Drop a random clause from the source sentence to simulate natural, detached hallucinations.
We then collect diverse non-hallucinated samples:
Original seed pairs provide faithful translations.
Randomly drop a dependency subtree from a reference to simulate incomplete translations.
Randomly substitute a phrase in the reference keeping the same part-of-speech to simulate translations with locally unfaithful phrases.
The final training and validation sets contain around 300k and 700 samples, respectively. We fine-tune the pre-trained model with a batch size of 32. We use the Adam optimizer (Kingma and Ba, 2015) with decoupled weight decay (Loshchilov and Hutter, 2019) and an initial learning rate of 2 × 10−5. We fine-tune all models for 5 epochs and select the checkpoint with the highest F1 score on the validation set.
5.3 Findings
As shown in Table 4, we compare all classifiers against the baselines by the Precision, Recall, and F1 scores. Since false positives and false negatives might have a different impact in practice (e.g., does the detector flag examples for review by humans, or entirely automatically? what is mt used for?), we also report the Area Under the Receiver Operating Characteristic Curve (AUC), which characterizes the discriminative power of each method at varying threshold settings.
. | Params . | De-En . | En-Zh . | ||||||
---|---|---|---|---|---|---|---|---|---|
. | . | P . | R . | F1 . | AUC . | P . | R . | F1 . | AUC . |
Model-free Baselines | |||||||||
Random | 0 | 44.0 | 49.9 | 46.8 | 50.2 | 27.6 | 49.8 | 35.5 | 48.0 |
Degeneration | 1 | 49.1 | 59.3 | 53.7 | – | 63.2 | 71.2 | 66.9 | – |
nmt Score | 1 | 33.3 | 3.4 | 6.2 | 37.7 | 35.4 | 91.9 | 51.1 | 49.8 |
Quality Estimation Classifier | |||||||||
comet-qe | 363M | 72.2 | 71.4 | 71.8 | 82.4 | 32.4 | 99.1 | 48.9 | 89.4 |
Large Pre-trained Classifiers | |||||||||
laser | 45M | 81.6 | 54.0 | 65.0 | 89.5 | 54.6 | 64.0 | 58.9 | 75.3 |
xlm-r | 125M | 91.3 | 21.0 | 33.8 | 45.6 | 94.9 | 83.2 | 88.6 | 93.3 |
Introspection-based Classifiers | |||||||||
Attention-based | <400 | 54.3 | 89.0 | 67.4 | 70.1 | 36.0 | 71.0 | 47.7 | 68.6 |
lrp-based | <400 | 87.3 | 76.2 | 81.2 | 91.4 | 87.5 | 85.6 | 86.4 | 96.5 |
Ensemble Classifier | |||||||||
lrp + laser | 45M | 100.0 | 45.7 | 62.7 | – | 94.5 | 59.5 | 72.9 | – |
lrp + xlm-r | 125M | 95.3 | 21.5 | 35.1 | – | 97.6 | 72.4 | 83.1 | – |
. | Params . | De-En . | En-Zh . | ||||||
---|---|---|---|---|---|---|---|---|---|
. | . | P . | R . | F1 . | AUC . | P . | R . | F1 . | AUC . |
Model-free Baselines | |||||||||
Random | 0 | 44.0 | 49.9 | 46.8 | 50.2 | 27.6 | 49.8 | 35.5 | 48.0 |
Degeneration | 1 | 49.1 | 59.3 | 53.7 | – | 63.2 | 71.2 | 66.9 | – |
nmt Score | 1 | 33.3 | 3.4 | 6.2 | 37.7 | 35.4 | 91.9 | 51.1 | 49.8 |
Quality Estimation Classifier | |||||||||
comet-qe | 363M | 72.2 | 71.4 | 71.8 | 82.4 | 32.4 | 99.1 | 48.9 | 89.4 |
Large Pre-trained Classifiers | |||||||||
laser | 45M | 81.6 | 54.0 | 65.0 | 89.5 | 54.6 | 64.0 | 58.9 | 75.3 |
xlm-r | 125M | 91.3 | 21.0 | 33.8 | 45.6 | 94.9 | 83.2 | 88.6 | 93.3 |
Introspection-based Classifiers | |||||||||
Attention-based | <400 | 54.3 | 89.0 | 67.4 | 70.1 | 36.0 | 71.0 | 47.7 | 68.6 |
lrp-based | <400 | 87.3 | 76.2 | 81.2 | 91.4 | 87.5 | 85.6 | 86.4 | 96.5 |
Ensemble Classifier | |||||||||
lrp + laser | 45M | 100.0 | 45.7 | 62.7 | – | 94.5 | 59.5 | 72.9 | – |
lrp + xlm-r | 125M | 95.3 | 21.5 | 35.1 | – | 97.6 | 72.4 | 83.1 | – |
Main Results
The lrp-based, xlm-r, and the laser classifiers are the best hallucination detectors, reaching AUC scores around 90 for either or both language pairs, which is considered outstanding discrimination ability (Hosmer Jr et al., 2013).
The lrp-based classifier is the best and most robust hallucination detector overall. It achieves higher F1 and AUC scores than laser on both language pairs. Additionally, it outperforms xlm-r by +47 F1 and +46 AUC on De-En, while achieving competitive performance on En-Zh. This shows that the source contribution patterns identified on hallucinations under perturbations (§ 3) generalize as symptoms of natural hallucinations even under domain shift, as the domain gap between training and evaluation data is bigger on De-En than En-Zh. It also confirms that lrp provides a better signal to characterize token contributions than attention, improving F1 by 14–39 points and AUC by 21–28 points. These high scores represent large improvements of 41–54 points on AUC and 20–75 points on F1 over the model-free baselines.
Model-free Baselines
These baselines shed light on the nature of the hallucinations in the dataset. The degeneration baseline is the best among them, with 53.7 F1 on De-En and 66.9 F1 on En-Zh, indicating that the Chinese hallucinations are more frequently degenerated than the English hallucinations from German. However, ignoring the remaining hallucinations is problematic, since they might be more fluent and thus more likely to mislead readers. The nmt score is a poor predictor, scoring worse than the random baseline on De-En, in line with previous reports that nmt scores do not capture faithfulness well during inference (Wang et al., 2020). Manual inspection shows that the nmt score can be low when the output is faithful but contains rare words, and it can be high for a hallucinated output that contains mostly frequent words.
Quality Estimation Classifier
The comet-qe classifier achieves higher AUC and F1 scores than the model-free classifiers, except for En-Zh, where the degeneration baseline obtains higher F1 than the comet-qe classifier. However, compared with the lrp-based classifier, comet-qe lags behind by 9-38 points on F1 and 7-9 points on AUC. This is consistent with previous findings that quality estimation models trained on data with insufficient negative samples (e.g., comet-qe) are inadequate for detecting critical mt errors such as hallucinations (Takahashi et al., 2021; Sudoh et al., 2021; Guerreiro et al., 2022).
Pre-trained Classifiers
The performance of pre-trained classifiers varies greatly across language pairs. laser achieves a competitive AUC score to the lrp-based classifier on De-En but lags behind on En-Zh, perhaps because the laser model is susceptible to the many rare tokens in the En-Zh evaluation data (from TED subtitles). xlm-r obtains better performance on En-Zh, approaching that of the lrp-based classifier, but lags behind greatly on De-En. This suggests that the xlm-r classifier suffers from domain shift, which is bigger on De-En (News→Law) than En-Zh (News→TED). Fine-tuning the model on the synthetic training data generalizes more poorly across domains. By contrast, the introspection-based classifiers are more robust.
Ensemble Classifiers
The laser and xlm-r classifiers emerge as the top classifiers apart from the lrp-based one, but they make different errors than lrp—the confusion matrix comparing their predictions shows that the laser and lrp classifiers agree on 68–78% of samples, while the xlm-r and lrp classifiers agree on 64–88% of samples. Thus an ensemble of lrp + laser or lrp + xlm-r (which detects hallucinations when the two classifiers both do so) yields a very high precision (at the expense of recall).
lrp Ablations
The lrp-based classifier benefits the most from Source Contribution Staticity features (Table 5). Removing them hurts AUC by 15–17 points and F1 by 28–31 points, confirming that the Static Source Contribution Hypothesis holds on natural hallucinations. Ablating the Normalized Source Contribution features also causes a significant drop in F1 on De-En, while its impact on En-Zh is not significant.
Error Analysis
Incomprehensible but aligned translations suffer from the highest false positive rate for the lrp classifier, followed by incomplete translations. Additionally, the classifier can fail to detect hallucinations caused by the mistranslation of a large span of the source with rare or previously unseen tokens, rather than by pathological behavior at inference time as shown by the example in Table 6.
Source: C) DASS DIE WAREN IN DEM ZUSTAND IN DIE GEMEINSCHAFT VERSANDT WORDEN SIND, IN DEM SIE ZUR AUSSTELLUNG GESANDT WURDEN; |
Correct Translation: C) THAT THE GOODS WERE SHIPPED TO THE COMMUNITY IN THE CONDITION IN WHICH THEY ARE SENT FOR EXHIBITION; |
Output: C) THAT THE WOULD BE CONSIDERED IN THE COMMUNITY, IN which YOU WILL BE EXCLUSIVE; |
Source: C) DASS DIE WAREN IN DEM ZUSTAND IN DIE GEMEINSCHAFT VERSANDT WORDEN SIND, IN DEM SIE ZUR AUSSTELLUNG GESANDT WURDEN; |
Correct Translation: C) THAT THE GOODS WERE SHIPPED TO THE COMMUNITY IN THE CONDITION IN WHICH THEY ARE SENT FOR EXHIBITION; |
Output: C) THAT THE WOULD BE CONSIDERED IN THE COMMUNITY, IN which YOU WILL BE EXCLUSIVE; |
Toward Practical Detectors
Detecting hallucinations in the wild is challenging since they tend to be rare and their frequency may vary greatly across test cases. We provide a first step in this direction by stress testing the top classifiers in an in-domain scenario where hallucinations are expected to be rare. Specifically, we randomly select 10k English sentences from the News Crawl: articles from 2021 from WMT21 (Akhbardeh et al., 2021) and use the En-Zh nmt model to translate them into Chinese. We measure the Precision@20 for hallucination detection by manually examining the top-20 highest scoring hallucination predictions for each method. The laser, xlm-r, and lrp-based classifiers evaluated above (without fine-tuning in this setting) achieve 35%, 45%, and 45% Precision@20, respectively (compared to 0% for the random baseline). More interestingly, after tuning the threshold on the predicted probabilities (which is originally set to 0.5) so that each classifier predicts hallucination 1% of the time, the lrp + laser ensemble detects 9 hallucinations with a much higher precision of 89%, and the lrp + xlm-r ensemble detects 12 hallucinations with a precision of 83%. These ensemble detectors thus have the potential to provide useful signals for detecting hallucinations even when they are needles in a haystack.
5.4 Limitations
Our findings should be interpreted with several limitations in mind. First, we exclusively study detached hallucinations in mt. Thus, we do not elucidate the internal model symptoms that lead to partial hallucinations (Zhou et al., 2021), although the methodology in this work could be used to shed light on this question. Second, we work with nmt models trained using the parallel data from wmt without exploiting monolingual data or comparable corpora retrieved from collections of monolingual texts (e.g., WikiMatrix [Schwenk et al., 2021]). It remains to be seen whether hallucination symptoms generalize to nmt models trained with more heterogeneous supervision. Finally, we primarily test the hallucination classifiers in roughly balanced test sets, while hallucinations are expected to be rare in practice. We conducted a small stress test which shows the promise of our lrp +laser classifier in more realistic conditions. However, further work is needed to systematically evaluate how these classifiers can be used for hallucination detection in the wild.
6 Related Work
Hallucinations occur in all applications of neural models to language generation, including abstractive summarization (Falke et al., 2019; Maynez et al., 2020), dialogue generation (Dušek et al., 2018), data-to-text generation (Wiseman et al., 2017), and machine translation (Lee et al., 2018). Most existing detection approaches view the generation model as a black-box, by 1) training hallucination classifiers on synthetic data constructed by heuristics (Zhou et al., 2021; Santhanam et al., 2021), or 2) using external models to measure the faithfulness of the outputs, such as question answering or natural language inference models (Falke et al., 2019; Durmus et al., 2020). These approaches ignore the signals from the generation model itself and could be highly biased by the heuristics used for synthetic data construction, or the biases in the external semantic models trained for other purposes. Concurrent to this work, Guerreiro et al. (2022) explore glass-box detection methods based on model confidence scores or attention patterns (e.g., the proportion of attention paid to the EOS token and the proportion of source tokens with attention weights higher than a threshold). They evaluate these methods based on hallucination recall, and find that model confidence is a better indicator of hallucinations than attention patterns. In this paper, we investigated varying types of glass-box patterns based on the relative token contributions instead of attention, and find that these patterns yield more accurate hallucination detectors than model confidence.
Detecting hallucinations in mt has not yet been directly addressed by the mt quality estimation literature. Most quality estimation work has focused on predicting a direct assessment of translation quality, which does not distinguish adequacy and fluency errors (Guzmán et al., 2019; Specia et al., 2020). More recent task formulations target critical adequacy errors (Specia et al., 2021), but do not separate hallucinations from other error types, despite arguments that hallucinations should be considered separately from other mt errors (Shi et al., 2022). The critical error detection task at wmt 2022 introduces an Additions error category, which refers to hallucinations where the translation content is only partially supported by the source (Zerva et al., 2022). Additions includes both detached hallucinations (as in this work) and partial hallucinations. Methods for addressing all these tasks fall in two categories: 1) black-box methods based on the source and output alone (Specia et al., 2009; Kim et al., 2017; Ranasinghe et al., 2020), and 2) glass-box methods based on features extracted from the nmt model itself (Rikters and Fishel, 2017; Yankovskaya et al., 2018; Fomicheva et al., 2020). Black-box methods typically use resource-heavy deep neural networks trained on large amounts of annotated data. Our work is inspired by the glass-box methods that rely on model probabilities, uncertainty quantification, and the entropy of the attention distribution, but shows that relative token contributions computed through lrp provide sharper features to characterize hallucinations.
This paper combines interpretability techniques to identify the symptoms of hallucinations. We adopt a saliency method to measure the importance of each input unit through a back-propagation pass (Simonyan et al., 2014; Bach et al., 2015; Li et al., 2016a; Ding et al., 2019). While other saliency-based methods measure an abstract quantity reflecting the importance of each input feature by the partial derivative of the prediction with regard to each input unit (Simonyan et al., 2014), lrp (Bach et al., 2015) measures the proportional contribution of each input unit. This makes it well-suited to compare model behavior across samples. Furthermore, lrp does not require neural activations to be differentiable and smooth, and can be applied to a wide range of architectures, including RNN (Ding et al., 2017) and Transformer (Voita et al., 2021). We apply this technique to analyze counterfactual hallucination samples inspired by perturbation methods (Li et al., 2016b; Feng et al., 2018; Ebrahimi et al., 2018), but crucially show that the insights generalize to natural hallucinations.
7 Conclusion
We contribute a thorough empirical study of the notorious but poorly understood hallucination phenomenon in nmt, which shows that internal model symptoms exhibited during inference are strong indicators of hallucinations. Using counterfactual hallucinations triggered by perturbations, we show that distinctive source contribution patterns alone indicate hallucinations better than the relative contributions of the source and target. We further show that our findings can be used for detecting natural hallucinations much more accurately than model-free baselines and quality estimation models. Our detector also outperforms black-box classifiers based on pre-trained models. We release human-annotated test beds of natural English-Chinese and German-English hallucinations to enable further research. This work opens a path toward detecting hallucinations in the wild and improving models to minimize hallucinations in mt and other generation tasks.
Acknowledgments
We thank our TACL action editor, the anonymous reviewers, and the UMD CLIP lab for their feedback. Thanks also to Yuxin Xiong for helping examine German outputs. This research is supported in part by an Amazon Machine Learning Research Award and by the National Science Foundation under Award No. 1750695. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Notes
Code and data are released at https://github.com/weijia-xu/hallucinations-in-nmt.
For better contrastive analysis, we select samples with source length of n = 30 and clip the output length by T = 15.
The bleu thresholds are selected based on manual inspection of the translation outputs.
Since lrp ensures that the sum of source and target contributions at each generation step is a constant, we only visualize the relative source contributions.
λ0 is set to yield the largest score difference for each measurement type.
References
Author notes
Action Editor: Ivan Titov