Abstract
The goal of information-seeking dialogue is to respond to seeker queries with natural language utterances that are grounded on knowledge sources. However, dialogue systems often produce unsupported utterances, a phenomenon known as hallucination. To mitigate this behavior, we adopt a data-centric solution and create FaithDial, a new benchmark for hallucination-free dialogues, by editing hallucinated responses in the Wizard of Wikipedia (WoW) benchmark. We observe that FaithDial is more faithful than WoW while also maintaining engaging conversations. We show that FaithDial can serve as training signal for: i) a hallucination critic, which discriminates whether an utterance is faithful or not, and boosts the performance by 12.8 F1 score on the BEGIN benchmark compared to existing datasets for dialogue coherence; ii) high-quality dialogue generation. We benchmark a series of state-of-the-art models and propose an auxiliary contrastive objective that achieves the highest level of faithfulness and abstractiveness based on several automated metrics. Further, we find that the benefits of FaithDial generalize to zero-shot transfer on other datasets, such as CMU-Dog and TopicalChat. Finally, human evaluation reveals that responses generated by models trained on FaithDial are perceived as more interpretable, cooperative, and engaging.
1 Introduction
Despite the recent success of knowledge-grounded neural conversational models (Thoppilan et al., 2022; Prabhumoye et al., 2021; Zhao et al., 2020, inter alia) in generating fluent responses, they also generate unverifiable or factually incorrect statements, a phenomenon known as hallucination (Rashkin et al., 2021b; Dziri et al., 2021; Shuster et al., 2021). Ensuring that models are trustworthy is key to deploying them safely in real-world applications, especially in high-stakes domains. In fact, they can unintentionally inflict harm on members of the society with unfounded statements or can be exploited by malicious groups to spread large-scale disinformation.
Recently, Dziri et al. (2022a) investigated the underlying roots of this phenomenon and found that the gold-standard conversational datasets (Dinan et al., 2019; Gopalakrishnan et al., 2019; Zhou et al., 2018)—upon which the models are commonly fine-tuned—are rife with hallucinations, in more than 60% of the turns. An example of hallucination in Wizard of Wikipedia (WoW; Dinan et al., 2019) is shown in the red box of Figure 1. In WoW, an information seeker aims to learn about a topic and a human wizard harnesses knowledge (typically a sentence) from Wikipedia to answer. This behavior, where the human wizard ignores the knowledge snippet and assumes a fictitious persona, can later reverberate in the dialogue system trained on this kind of data. Instead, the ideal wizard response, highlighted in green, should acknowledge the bot’s nature, and whenever the knowledge is not sufficient or relevant, it should acknowledge its ignorance of the topic.
A representative FaithDial annotation: Subjective and hallucinated (red) information present in the wizard’s utterance of WoW data are edited into utterances faithful to the given knowledge (green). In FaithDial, the wizard assumes the persona of a bot.
A representative FaithDial annotation: Subjective and hallucinated (red) information present in the wizard’s utterance of WoW data are edited into utterances faithful to the given knowledge (green). In FaithDial, the wizard assumes the persona of a bot.
Unfortunately, modeling solutions alone cannot remedy the hallucination problem. By mimicking the distributional properties of the data, models are bound to “parrot” the hallucinated signals at test time (Bender et al., 2021). What is more, Dziri et al. (2022a) observe that GPT2 not only replicates, but even amplifies hallucination around 20% when trained on WoW. This finding also extends to models that are designed explicitly to be knowledge-grounded (Prabhumoye et al., 2021; Rashkin et al., 2021b). Filtering noisy or high-error data (Zhang and Hashimoto, 2021) is also prone to failure as it may either break the cohesion of discourse or it may require excluding entire dialogues.
In this work, we adopt instead a data-centric solution to address hallucinations and create FaithDial, a new benchmark for faithful1 knowledge-grounded dialogue. Specifically, we ask annotators to amend hallucinated utterances in WoW by making them faithful to the corresponding knowledge snippets from Wikipedia and acknowledging ignorance when necessary. This approach is vastly more scalable than creating FaithDial from scratch while retaining the cohesiveness of conversations. Moreover, it allows us to shed light on hallucinations by contrasting corresponding wizard’s responses in WoW and FaithDial.
As a result, FaithDial contains around 50K turns across 5.5K conversations. Extensive human validation reveals that 94.4% of the utterances in FaithDial are faithful (i.e., without hallucinations), compared to only 20.9% in WoW. Moreover, we benchmark several state-of-the-art models (Radford et al., 2019; Roller et al., 2021; Raffel et al., 2020; Rashkin et al., 2021b) on dialogue generation. If trained on FaithDial, we find that they are significantly more faithful while also enhancing other dialogue aspects like cooperativeness, creativity, and engagement. These benefits also generalize to other knowledge-grounded datasets like CMU-DoG (Zhou et al., 2018) and TopicalChat (Gopalakrishnan et al., 2019) in a zero-shot transfer setting.
FaithDial also provides supervision for hallucination critics, which discriminate whether an utterance is faithful or not. We source positive examples from FaithDial and negative examples from WoW. Compared to other dialogue inference datasets (Welleck et al., 2019a; Nie et al., 2021), the classifiers trained on this data (which we call FaithCritic) transfer better to general NLU tasks like MNLI (Williams et al., 2018) and achieve state-of-the-art on BEGIN (Dziri et al., 2022b), a dialogue-specific knowledge grounding benchmark in a zero-shot setting.
Thus, FaithDial holds promise to encourage faithfulness in information-seeking dialogue and make virtual assistants both more trustworthy. We release data and code for future research.2
2 FaithDial: Dataset Design
Given the motivations adduced above, the primary goal of this work is to create a resource for faithful knowledge-grounded dialogue that allows for both training high-quality models and measuring the degree of hallucination of their responses. We define the notion of faithfulness formally as follows:
Given an utterance un, a dialogue history ℋ = (u1,…,un−1), and knowledge at turn n, we say that un is faithful with respect to iff the following condition holds:
∃ Γn such that , where denotes semantic consequence and Γn is a non-empty subset of . In other words, there is no interpretation ℐ such that all members of Γn are true and un is false.
Hence, an utterance can optionally be grounded on multiple facts but not none.
In what follows, we present the design of our task as well as our annotation pipeline to curate FaithDial. In our dialogue setting, we simulate interactions between two speakers: an information seeker and a bot wizard.
The information seeker, a human, aims at learning about a specific topic in a conversational manner. They can express subjective information, bring up a new set of facts independent from the source , and even open up new sub-topics.
From the perspective of Definition 2, utterances pronounced by the seeker have a large degree of freedom. For example, the human can chat about personal life and can ask a diverse set of questions. On the other hand, the wizard is more restricted on what they can communicate.
The Wizard, a bot, aims at conversing in a knowledgeable manner about the seeker’s unique interests, resorting exclusively to the available knowledge . They can reply to a direct question or provide information about the general topic of the conversation.3
From Definition 3, it follows that there are three key rules the bot must abide by: First, it should be truthful by providing information that is attributable to the source . Second, it should provide information conversationally, that is, use naturalistic phrasing of , support follow-up discussion with questions, and prompt the user’s opinions. Third, it should acknowledge its ignorance of the answer in those cases where does not include it while still moving the conversation forward using .
2.1 Data Selection
Rather than creating a novel benchmark from scratch, however, we opt for fixing problematic utterances (which are the majority) in existing dialogue benchmarks (Dziri et al., 2022a). The reason is three-fold: 1) while mostly hallucinated, existing datasets still contain useful faithful information; 2) as correction is faster than creation from scratch, this enables us to annotate examples on a larger scale; 3) two versions of the same dialogue turn, either hallucinated or faithful, can provide signal for (contrastive) learning and evidence for a linguistic analysis. In particular, we focus on WoW as our benchmark backbone.
Initial pilot study revealed that WoW dialogues are more suitable for editing compared to other prominent knowledge-grounded dialogue benchmarks: TopicalChat (Gopalakrishnan et al., 2019) and CMU-DoG (Zhou et al., 2018). In fact, according to Dziri et al. (2022a), as shown in Table 1, WoW is relatively less hallucinated compared with CMU-DoG and TopicalChat. Moreover, full hallucinations—responses that contain no faithful content and that therefore need to be entirely thrown out— are highly prevalent in the latter two (61.4% in CMU-DoG and 46.8% in TopicalChat and only 19.7% in WoW). Moreover, knowledge snippets in WoW tend to be shorter, which is preferable as longer knowledge is correlated with increased hallucination due to the constrained cognitive capacity for text navigation and comprehension in humans (De Jong, 2010; DeStefano and LeFevre, 2007).
The breakdown of responses from WoW, CMU-DoG and TopicalChat according to BEGIN taxonomy (Dziri et al., 2022b). “Faith.” refers to faithful responses and “Uncoop.” refers to faithful but uncooperative responses given the conversation history.
Dataset . | Generic . | Hallucination . | Entailment . | ||
---|---|---|---|---|---|
Full . | Partial . | Faith. . | Uncoop. . | ||
WoW | 05.3 | 19.7 | 42.3 | 24.1 | 8.5 |
CMU | 13.2 | 61.4 | 05.1 | 16.2 | 4.1 |
Topical | 12.7 | 46.8 | 17.1 | 22.9 | 0.5 |
Dataset . | Generic . | Hallucination . | Entailment . | ||
---|---|---|---|---|---|
Full . | Partial . | Faith. . | Uncoop. . | ||
WoW | 05.3 | 19.7 | 42.3 | 24.1 | 8.5 |
CMU | 13.2 | 61.4 | 05.1 | 16.2 | 4.1 |
Topical | 12.7 | 46.8 | 17.1 | 22.9 | 0.5 |
Our first step consists in filtering out WoW conversations where ground-truth knowledge was not given, and annotators relied on personal knowledge instead. Then, we focus on seeker-initiated conversations and sample 44% from the train set (4094 conversations), 100% from the validation set (764 conversations), and 100% from the test set (791 conversations).4
2.2 Crowd-sourced Annotations
Following the guidelines for ethical crowdsourcing outlined in Sheehan (2018), we hire Amazon Mechanical Turk (AMT) workers to edit utterances in WoW dialogues that were found to exhibit unfaithful responses.5 First, workers were shown dialogues from WoW and asked to determine whether the wizard utterances are faithful to the source knowledge. To guide them in this decision, they were additionally requested to identify the speech acts (VRM taxonomy; Stiles, 1992) such as disclosure, edification, question, acknowledgment, and so on; and the response attribution classes (BEGIN taxonomy; Dziri et al., 2022b) such as hallucination and entailment for each of the wizard’s utterances according to Dziri et al.’s (2022a) schema.
2.2.1 Editing the Wizard’s Utterances
Workers were instructed to edit the wizard’s utterances in the following cases, depending on their faithfulness.
Hallucination.
They should remove information that is unsupported by the given knowledge snippet , and replace it with information that is supported. To ensure that the responses are creative, we disallowed workers from copying segments from . They were instead instructed to paraphrase the source knowledge as much as possible without changing its meaning (Ladhak et al., 2022; Lux et al., 2020; Goyal and Durrett, 2021). If the inquiry of the seeker cannot be satisfied by the knowledge , the wizard should acknowledge their ignorance and carry on the conversation by presenting the given knowledge in an engaging manner. In the example shown in Table 3, the new wizard confirms that it cannot surf and instead enriches the conversation by talking about surfing as opposed to the original wizard who hallucinates personal information.
Generic
utterances such as “That’s nice” should be avoided solely on their own. Workers are instructed to enrich these responses with content that is grounded on the knowledge.
Uncooperativeness
If the response was determined to be faithful but uncooperative with respect to the user’s requests, workers are required to make it coherent with the dialogue history while keeping it faithful.
2.2.2 Editing the Seeker’s Utterances
Although the seeker has no restrictions on their utterances, it is inevitable that the conversation may drift away—because of the edits on the wizard’s response—making the existing seeker’s next utterance in WoW incoherent with the new context. In these cases, they perform edits on the seeker’s next utterance to make it coherent. Consider Table 3 where workers had to edit the WoWseeker’s utterance as it was not coherent anymore with the freshly edited wizard’s response.
3 Dataset Quality
3.1 Crowdworker Quality Control
To be eligible for the task, workers have to be located in the United States or Canada and have to answer successfully 20 questions as part of a qualification test. Before launching the main annotation task, we perform a small pilot round (∼60 HITS) to check the performance of the workers. If we observe any errors, we email the concerned workers and provide them with examples on how to fix their mistakes in future HITS. Workers are also encouraged to reach out to us in case they find annotating a particular example ambiguous. At the end of the pilot round, we revoke access for workers who provide poor quality annotations. After several staging rounds, we launch the main annotation stage. To ensure the quality does not drop, a linguistics major student evaluates the performance of workers daily (10 HITS on average per worker) and rejects poor quality work. Repeated mistakes result in the worker being blocked from the task entirely. In total, we ended up recruiting 10 well-trained workers. We also perform automatic quality control checks to enforce that workers avoid copying segments from the source knowledge.
3.2 Human validation
To evaluate the quality of FaithDial, we run two final rounds of annotations. Firstly, we ask 3 new workers to edit the same 500 responses. Since there is no straightforward way to measure inter-annotator agreement on edits, following Dziri et al. (2022a), we measure the inter-annotator agreement on the identified response attribution classes (BEGIN) and the speech acts (VRM). We report an inter-annotator agreement of 0.75 and 0.61 Fleiss’ κ, respectively, which shows substantial agreement according to Landis and Koch (1977). This is an indicator of overall annotation quality: If the worker can reliably identify speech acts, they generally also produce reasonable edits. Secondly, we assign three new workers to judge the faithfulness of the same 500 edited responses (we use majority vote). Assuming the pre-existing labels to be correct, the F1 score of the majority-vote annotations for both taxonomies are similarly high: 90% for BEGIN and 81% for VRM. In total, we found that FaithDial contains 94.4% faithful responses and 5.6% hallucinated responses, as shown in Figure 2(a) (inner circle), and this shows the high quality of FaithDial.
Coarse-grained (BEGIN) and fine-grained speech act (VRM) distributions used by wizards in FaithDial and WoW. The inner most circle shows the breakdown of coarse-grained types: Hallucination (red), Entailment (green), Partial Hallucination (yellow), Generic (purple), and Uncooperative (pink). The outer circles show the fine-grained types of each coarse-grained type.
Coarse-grained (BEGIN) and fine-grained speech act (VRM) distributions used by wizards in FaithDial and WoW. The inner most circle shows the breakdown of coarse-grained types: Hallucination (red), Entailment (green), Partial Hallucination (yellow), Generic (purple), and Uncooperative (pink). The outer circles show the fine-grained types of each coarse-grained type.
4 Dataset Analysis
4.1 Dataset Statistics
Overall, FaithDial contains a total of 5,649 dialogues consisting of 50,761 utterances. Table 2 reports statistics for each dataset split. To curate FaithDial, workers edited 84.7% of the wizard responses (21,447 utterances) and 28.1% of the seeker responses (7,172 utterances). In particular, 3.8 wizard turns per conversation were modified on average, as opposed to only 1.2 seeker turns. The low percentage of the seeker edits shows that our method does not disrupt the cohesiveness of the conversations.
Dataset statistics of FaithDial.
Dataset . | Train . | Valid . | Test . |
---|---|---|---|
Turns | 36809 | 6851 | 7101 |
Conversations | 4094 | 764 | 791 |
Avg. Tokens for Wizard | 20.29 | 21.76 | 20.86 |
Avg. Tokens for Seeker | 17.25 | 16.65 | 16.49 |
Avg. Tokens for Knowledge | 27.10 | 27.17 | 27.42 |
Turns per Conversation | 9 | 9 | 9 |
Dataset . | Train . | Valid . | Test . |
---|---|---|---|
Turns | 36809 | 6851 | 7101 |
Conversations | 4094 | 764 | 791 |
Avg. Tokens for Wizard | 20.29 | 21.76 | 20.86 |
Avg. Tokens for Seeker | 17.25 | 16.65 | 16.49 |
Avg. Tokens for Knowledge | 27.10 | 27.17 | 27.42 |
Turns per Conversation | 9 | 9 | 9 |
4.2 Linguistic Phenomena
4.2.1 Faithfulness
Based on our human validation round of 500 examples, FaithDial contains 94.4% faithful responses and 5.6% hallucinated responses. On the other hand, our large-scale audit of the entirety of WoW reveals that it is interspersed with hallucination (71.4%), with only a few faithful turns (20.9%), as shown in Figure 2(b) (inner circle). This finding is consistent with the analysis of Dziri et al. (2022a) on a smaller sample. In our work, FaithDial cleanses dialogues from hallucination almost entirely.
We also report the speech acts used to ensure faithfulness in FaithDial in the outer circle in Figure 2. We observe that wizard resorts to a diverse set of speech acts to convey faithful information in a conversational style (see the Entailment pie): 78.26% of the responses contain objective content (Edification) that is interleaved with dialogue acts such as acknowledging receipt of previous utterance (18.3%), asking follow-up questions (35.5%), and sparking follow-on discussions by expressing opinions still attributable to the knowledge source (36.2%). Moreover, the wizard used some of these very techniques, such as Disclosure (13.04%) and Questions (8.6%), in isolation. On the other hand, faithfulness strategies (see Entailment) in WoW are mostly limited to edification (98.9%), curbing the naturalness of responses.
4.2.2 Abstractiveness
After establishing the faithfulness of FaithDial, we investigate whether it stems from an increased level of extractiveness or abstractiveness with respect to the knowledge source. Extractive responses reuse the same phrases as the knowledge source, while abstractive responses express the same meaning with different means. Although extractive responses are an easy shortcut to achieving more faithfulness, it comes at the cost of creativity. Ideally, we want responses that are faithful as well as creative, meaning responses that are not just a copy paste of the knowledge but rather a creative use of it. To measure creativity, we borrow two metrics from Grusky et al. (2018) designed to quantify the extractive and abstractive nature of summaries: Density and Coverage. Density represents the average length of the text spans copied from the knowledge that are contained in the response. Coverage instead measures the percentage of words existing in a response that are also found in the source knowledge. Figure 3 illustrates the density and coverage distributions in FaithDial (right) vs. WoW (left). We observe that while the coverage (x-axis) is similar in both FaithDial and WoW, the density (y-axis) is always low in FaithDial but often high in WoW. This indicates that responses in FaithDial tend to be abstractive to a large degree.
Density and coverage in WoW (Dinan et al., 2019) (left) vs. FaithDial (right). Responses in FaithDial tend to be abstractive to a large degree compared to WoW.
Density and coverage in WoW (Dinan et al., 2019) (left) vs. FaithDial (right). Responses in FaithDial tend to be abstractive to a large degree compared to WoW.
Based on this, we also study which specific abstractive strategies wizard adopts to present knowledge from without repeating long fragments. The strategies we discovered fall into five broad categories: inference of new knowledge from , rewording, reshaping the syntactic structure, abridging long expressions, and introducing connectives.
4.2.3 Fallback Responses in FaithDial
We further probe the wizard responses with respect to their ability to handle unanswerable questions. We randomly sample 45 dialogues containing 400 responses and ask a linguist to annotate them. Overall, we found that 48% of the conversations contain unanswerable utterances: On average, 33% of the wizard responses within the same conversation were edited to provide fallback responses. Out of those fallback responses, 30% were triggered by personal questions, 50% by objective questions about the topic, and 20% by opinions. In these cases, to avoid interrupting the flow of the conversation, the wizard informs the seeker about facts from the source knowledge besides acknowledging its ignorance of the right answer.
5 Experiments
The purpose of FaithDial is two-fold: first, the collected labels can serve as training data for a critic to determine whether a given response is faithful or hallucinated. The second goal is providing high-quality data to generate faithful responses in information-seeking dialogue. Given knowledge and the conversation history ℋ = (u1,…,un−1), the task is to generate a response un faithful to . We benchmark a series of state-of-the-art dialogue models (Radford et al., 2019; Roller et al., 2021; Raffel et al., 2020; Rashkin et al., 2021b) on FaithDial. We also evaluate them on WoW and in a zero-shot transfer setup on CMU-DoG, and TopicalChat). We implement all the baselines using the Huggingface Transformers library (Wolf et al., 2020).
5.1 Task I: Hallucination Critic
We frame the problem of identifying hallucination as a binary classification task where the goal is to predict whether an utterance is faithful or not, given the source knowledge. This characterization of the problem is reminiscent of previous work (Dziri et al., 2019; Welleck et al., 2019b; Nie et al., 2021) on detecting contradiction within a conversation.
For this purpose, we curate a dataset, FaithCritic, derived from human annotations in FaithDial. Specifically, we take 14k wizard utterances from WoW labeled as hallucination (Section 2) as negative examples. The wizard responses from WoW labeled as entailment along with newly edited wizard utterances (20k in total) count as positive examples. Overall, FaithCritic consists of 34k examples for training. We compare the performance of models trained on FaithCritic against models trained on two dialogue inference datasets—DNLI (Welleck et al., 2019b) and DECODE (Nie et al., 2021)—and on a well-known natural language inference (NLI) dataset, MNLI (Williams et al., 2018). For all datasets, we choose RoBERTaLarge (Liu et al., 2019) as a pre-trained model. We measure the transfer performance of different critics on MNLI, BEGIN, and FaithCritic in zero-shot settings wherever possible.
The results are presented in Table 4. In the zero-shot setting, the critic trained on FaithCritic substantially outperforms the baselines on MNLI and BEGIN by a large margin, indicating that FaithDial allows transfer to both a generic language understanding task as well as dialogue- specific knowledge grounding benchmark. On the other hand, the transfer performance of DECODE and DNLI are poor on both generic and dialogue-specific classification tasks. Surprisingly, MNLI transfers well to FaithCritic.
Transfer results (accuracy) of the hallucination critics trained and tested on different datasets. † indicates zero-shot transfer results and bolded numbers denote best performance.
Trained on . | Tested on . | ||
---|---|---|---|
MNLI . | BEGIN . | FaithCritic . | |
DECODE | 62.5† | 58.8† | 38.5† |
DNLI | 52.4† | 59.8† | 30.9† |
MNLI | 93.1 | 61.1† | 81.6† |
FaithCritic | 74.7† | 71.6† | 86.5 |
Trained on . | Tested on . | ||
---|---|---|---|
MNLI . | BEGIN . | FaithCritic . | |
DECODE | 62.5† | 58.8† | 38.5† |
DNLI | 52.4† | 59.8† | 30.9† |
MNLI | 93.1 | 61.1† | 81.6† |
FaithCritic | 74.7† | 71.6† | 86.5 |
5.2 Task II: Dialogue Generation
5.2.1 Methods
For the task of dialogue generation, we consider a series of state-of-the-art models ranging from general-purpose LMs—such as GPT2 (Radford et al., 2019), DialoGPT (Zhang et al., 2020b), and T5 (Raffel et al., 2020)—to models that are specifically designed to provide better grounding, such as DoHA (Prabhumoye et al., 2021), or to alleviate hallucination, such as CTRL (Rashkin et al., 2021b). DoHA augments BART (Lewis et al., 2020) with a two-view attention mechanism that separately handles the knowledge document and the dialogue history during generation. CTRL equips LMs with control tokens (< objective-voice >, < lexical-overlap >, and < entailment >) whose embeddings are learned at training time. At test time, these steer a model towards generating utterances faithful to a source of knowledge. Finally, we adopt a training strategy, called loss truncation (Kang and Hashimoto, 2020) to cope with the presence of hallucination in WoW, by adaptively eliminating examples with a high training loss.
5.2.2 Automatic Evaluation
We rely on several metrics that provide a multi-faceted measure of performance. A first group measures the degree of hallucination of generated responses. The Critic model trained on FaithCritic (Section 5.1) returns the percentage of utterances identified as unfaithful. Q2 (Honovich et al., 2021) measures faithfulness via question answering. It takes a candidate response as input and then generates corresponding questions. Then, it identifies possible spans in the knowledge source and the candidate response to justify the question–answer pairs (Durmus et al., 2020; Wang et al., 2020). Finally, it compares the candidate answers with the gold answers, in terms of either token-level F1 score or a NLI-inspired similarity score based on a RoBERTa model. BERTScore (Zhang et al., 2020a) rates the semantic similarity between the generated response r and the knowledge based on the cosine of their sentence embeddings. F1 measures instead the token-level lexical overlap between u and . Finally, as a second set of metrics, we report BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004), which reflect instead the n-gram overlap between u and the gold (faithful) response g.
WoW vs FaithDial.
In order to evaluate the ability of FaithDial to reduce hallucination in generated responses, Table 5 illustrates three experimental setups with different training data. WoW corresponds to the first block and FaithDial to the second block. The third block reflects a hybrid setup where a model is fine-tuned sequentially on WoW as an intermediate task and then on FaithDial. We evaluate all on the FaithDial test set.
Model performance on the test split of FaithDial. Bolded results indicate best performance. Metrics measure either the degree of hallucination of generated responses u with respect to knowledge or their overlap with gold faithful responses g. Gray blocks correspond to models that are specifically designed to alleviate hallucinations. Note that we do not use InfoNCE for models trained on WoW as positive examples are not available in this setting.

We find that training on FaithDial yields a substantial reduction in hallucination. For example, T5 trained on FaithDial decreases hallucination by 42.2% according to the Critic and increases the faithfulness score (Q2-NLI) by 4.3% compared to T5 trained on WoW.6 This corroborates the prominence of data quality compared to the data quantity (FaithDial is one third the size of WoW). When initializing the models trained on FaithDial with the noisy checkpoint from WoW (third block), we observe a performance boost in all models across all metrics, except a marginal drop in Critic for GPT2 and DialoGPT. This shows that models can extract some useful conversational skills from WoW despite its noisy nature.
Models.
First, we observe that T5 consistently performs favorably in reducing hallucination in all setups and across all metrics, compared to the rest of the vanilla baselines: GPT2, DialoGPT, and DoHA. Additionally, we compare models that are designed specifically to alleviate hallucination. Results are reported in the gray blocks of Table 5. We choose the best vanilla model T5 as the backbone for CTRL, InfoNCE, and LossTruncation. By virtue of these methods, faithfulness increases even further, which demonstrates their effectiveness. Sample responses from different models are presented in Table 6.
Abstractiveness.
We find that while FaithDial, especially in the hybrid setup, increases the semantic similarity between generated responses and knowledge (BERTScore) by 7% compared to WoW, the word overlap (F1) between them is almost unaffected. This indicates that WoW induces extractiveness over abstractiveness in models, which is not desirable. This is especially true for T5-CTRL variants, as their training objective encourages word overlap. Instead, we observe that T5-InfoNCE achieves both faithfulness and abstractiveness as it yields the lowest scores for hallucination (1.4 Critic) and extractiveness (55.8 F1).
5.2.3 Human Evaluation
In addition to the automated metrics, we conduct human evaluation to assess the presence of hallucination in models trained on FaithDial, as well as other aspects in generated dialogues such as cooperativeness, engagingness, and abstractiveness. Following Rashkin et al. (2021a), our evaluation consists of a two-stage annotation process. First, the annotators are asked to determine whether responses are stand-alone (i.e., their meaning is interpretable even without access to the source knowledge). If not, they are deemed to be too vague or ill-formed to judge their faithfulness. Second, if the response is interpretable, the annotators are requested to evaluate whether the response is grounded on the source knowledge. If the response was deemed not faithful, we further ask the annotators to mark it as hallucination or generic.
On the other hand, if the response was deemed faithful, workers are asked to score three qualities: Cooperativeness means that the response is coherent with the previous turn and does not try to mislead the interlocutor or act unhelpfully. Engagingness involves engaging the interlocutor by prompting further replies and moving the conversation forward.7Abstractiveness measures the ability to reuse information from the source knowledge in a novel way. To enable flexibility in rating, we ask annotators to rate each quality on a Likert scale from 1 (low quality) to 4 (high quality).
Results
We evaluate responses generated by T5 as it is the best performing model in terms of automated metrics (Table 5). We provide human annotators with 200 responses, where each is scored by 3 humans raters. Results are depicted in Table 7. We measure the agreement for each of the 7 qualities separately using Krippendorff’s α and find that the agreement (0.92, 0.91, 0.88, 0.90, 0.89, 0.75, 0.85, respectively) is reliably high.
Human evaluation on 1600 generated FaithDial responses (200 × 8) from different models on the test data. * and ** indicates that the results are significantly different from the best result in that column (bolded) with p-value < 0.05, < 0.01 respectively. ‘Coop.’, ‘Abst.’, and ‘Enga.’ means cooperativeness, abstractiveness, and engagingness, respectively.
. | Models . | Interpretable . | Hallucination . | Faithfulness . | Generic . | ||
---|---|---|---|---|---|---|---|
Coop. . | Abst. . | Enga. . | |||||
WoW | T5 | 93.2% | 055.8%** | 2.97* | 1.95* | 1.72* | 2.2% |
T5-CTRL | 95.2% | 44.2%* | 1.97* | 0.92* | 1.33* | 0.9% | |
T5-LossTruncation | 94.3% | 042.5%** | 2.87* | 1.87* | 1.83* | 1.2% | |
FaithDial | T5 | 94.4% | 23.2%* | 3.63 | 2.43* | 2.33 | 1.4% |
T5-WoW | 95.2% | 20.9%* | 3.59 | 2.44 | 2.37 | 1.0% | |
T5-CTRL | 96.7% | 20.8%* | 2.55* | 1.42* | 2.10* | 1.0% | |
T5-LossTruncation | 94.2% | 24.2%* | 3.59 | 2.42* | 2.03* | 0.9% | |
T5-InfoNCE | 97.2% | 19.9% | 3.79 | 2.92 | 2.60 | 0.9% |
. | Models . | Interpretable . | Hallucination . | Faithfulness . | Generic . | ||
---|---|---|---|---|---|---|---|
Coop. . | Abst. . | Enga. . | |||||
WoW | T5 | 93.2% | 055.8%** | 2.97* | 1.95* | 1.72* | 2.2% |
T5-CTRL | 95.2% | 44.2%* | 1.97* | 0.92* | 1.33* | 0.9% | |
T5-LossTruncation | 94.3% | 042.5%** | 2.87* | 1.87* | 1.83* | 1.2% | |
FaithDial | T5 | 94.4% | 23.2%* | 3.63 | 2.43* | 2.33 | 1.4% |
T5-WoW | 95.2% | 20.9%* | 3.59 | 2.44 | 2.37 | 1.0% | |
T5-CTRL | 96.7% | 20.8%* | 2.55* | 1.42* | 2.10* | 1.0% | |
T5-LossTruncation | 94.2% | 24.2%* | 3.59 | 2.42* | 2.03* | 0.9% | |
T5-InfoNCE | 97.2% | 19.9% | 3.79 | 2.92 | 2.60 | 0.9% |
Contrasting models trained on WoW and FaithDial, we find that FaithDial reduces hallucination by a large margin (32.6%) while increasing interpretability. Also, we observe that training models on FaithDial enhances the cooperativeness, engagingness, and abstractiveness of responses, as they tend to prompt further conversations, acknowledge previous utterances, and abstract information from the source knowledge. We see that CTRL benefits faithfulness but at the expense of cooperativeness and abstractiveness of the responses. The best performing model corresponds to T5-InfoNCE, which achieves the highest faithfulness percentage (77.4%) and the highest dialogue quality scores.
Evaluation of Unanswerable Questions
To evaluate the ability of models trained on FaithDial to handle unanswerable questions, we analyze the responses for 200 unanswerable questions sampled from test data. Each response is manually evaluated by 3 annotators whether the answer is appropriate. Inter-annotator agreement based on Krippendorff’s alpha is 0.9 which is substantially high. Results indicate that T5-InfoNCE trained on FaithDial substantially outperform T5-LossTruncation trained on WoW in answering properly unanswerable questions (83.2% vs. 33.3%).
5.2.4 Transfer from FaithDial to Other Datasets
To further examine the usefulness of FaithDial in out-of-domain setting, we test the performance of T5-FaithDial on TopicalChat (Gopalakrishnan et al., 2019), CMU-DoG (Zhou et al., 2018), and WoW (Dinan et al., 2019). Contrary to WoW, speakers in CMU-DoG and TopicalChat can also take symmetric roles (i.e., both act as the wizard). Knowledge is provided from Wikipedia movie articles in CMU-DoG and from diverse sources—such as Wikipedia, Reddit, and news articles—in TopicalChat. Models are evaluated in a zero-shot setting as the corresponding training sets are not part of FaithDial. Results are depicted in Table 8. Since these testing benchmarks are fraught with hallucinations (see Table 1), we do not compare the quality of the response u with respect to the gold response g. We report both automatic metrics and human evaluation. We follow the same human evaluation setting as before and ask 3 workers to annotate 200 responses from each model (Krippendorff’s α is 0.82, 0.79, 0.85 on TopicalChat, CMU-DoG, and WoW respectively). In this regard, the models trained on FaithDial are far more faithful than the models trained on in-domain data despite the distribution shift. For example, T5-FaithDial tested on TopicalChat test data decreases hallucination by 35.7 points on Critic, by 13.9 points on Q2-NLI, and by 30.4 points on human scores. Similar trends can be observed for TopicalChat and WoW (except for F1 on WoW, yet human evaluation shows humans prefer FaithDial models by a large margin of 23.8). Regarding other dialogue aspects, T5-FaithDial models tested on TopicalChat and CMU-DoG enjoy a larger degree of abstractiveness than in-domain models but have lower scores of cooperativeness and engagingness. However, all of these aspects are enhanced when tested in-domain on WoW.
6 Related Work
Hallucination in Natural Language Generation.
Hallucination in knowledge-grounded neural language generation has recently received increasing attention from the NLP community (Ji et al., 2022). Tasks include data-to-text generation (Wiseman et al., 2017; Parikh et al., 2020), machine translation (Raunak et al., 2021; Wang and Sennrich, 2020), summarization (Durmus et al., 2020; Kang and Hashimoto, 2020), generative question answering (Li et al., 2021), and dialogue generation (Dziri et al., 2021, 2022b; Rashkin et al., 2021b).
These works focus on either devising automatic metrics to identify when hallucination occurs (Wiseman et al., 2017) or finding possible causes for this degenerate behaviour, including out-of-domain generalization and noisy training data points (Kang and Hashimoto, 2020; Raunak et al., 2021) and exposure bias caused by MLE training (Wang and Sennrich, 2020).
Hallucination in Dialogue Systems.
Hallucinations in knowledge-grounded neural dialogue generation is an emergent research problem (Roller et al., 2021; Mielke et al., 2022; Shuster et al., 2021; Dziri et al., 2021; Rashkin et al., 2021b). Existing work aims predominantly to address hallucinations via engineering loss functions or enforcing consistency constraints, for instance by conditioning generation on control tokens (Rashkin et al., 2021b), by learning a token-level hallucination critic to flag problematic entities and replace them (Dziri et al., 2021), or by augmenting the dialogue system with a module retrieving relevant knowledge (Shuster et al., 2021).
Although promising, these approaches are prone to replicate—or even amplify—the noise found in training data. Dziri et al. (2022a) demonstrated that more than 60% of three popular dialogue benchmarks are rife with hallucination, which is picked up even by models designed to increase faithfulness. To the best of our knowledge, FaithDial is the first dataset for information-seeking dialogue that provides highly faithful curated data.
Hallucination Evaluation.
Recently introduced benchmarks can serve as testbeds for knowledge grounding in dialogue systems, such as BEGIN (Dziri et al., 2022b), DialFact (Gupta et al., 2022), Conv-FEVER (Santhanam et al., 2021), and Attributable to Identified Sources (AIS) framework (Rashkin et al., 2021a). Meanwhile, a recent study has reopened the question of the most reliable metric for automatic evaluation of hallucination- free models, with the Q2 metric (Honovich et al., 2021) showing performance comparable to human annotation. In this work, we further contrigute to this problem by proposing a critic model—trained on our collected FaithCritic data—that achieves high performance on the BEGIN benchmark.
7 Conclusions
We release FaithDial, a new benchmark for faithful information-seeking dialogue, where a domain-expert bot answers queries based on gold-standard knowledge in a conversational manner. Examples are created by manually editing hallucinated and uncooperative responses in Wizard of Wikipedia (WoW), which constitute 79.1% of the original dataset. Leveraging the resulting high-quality data, we train both a hallucination critic, which discriminates whether utterances are faithful to the knowledge and achieves a new state of the art on BEGIN, and several dialogue generation models. In particular, we propose strategies to take advantage of both noisy and cleaned data, such as intermediate fine-tuning on WoW and an auxiliary contrastive objective. With both automated metrics and human evaluation, we verify that models trained on FaithDial drastically enhance faithfulness and abstractiveness, both in- domain and during zero-shot transfer to other datasets, such as TopicalChat and CMU-DoG.
Acknowledgments
We are grateful to the anonymous reviewers for helpful comments. We would like to thank MTurk workers for contributing to the creation of FaithDial and for giving feedback on various pilot rounds. SR acknowledges the support of the the IBM-Mila grant, the NSERC Discovery grant, and the Facebook CIFAR AI chair program. OZ acknowledges the Alberta Machine Intelligence Institute Fellow Program and the Canadian Institute for Advanced Research AI Chair Program.
A AMT Instructions
Here, we detail the instructions given to workers in the annotation task. We follow instructions from Dziri et al. (2022a) in determining BEGIN and VRM categories. Additionally, according to the identified categories, we ask workers to perform a particular edit. Below are the questions we ask in every HIT:
Does the wizard’s response contain other information that is NOT supported by ? (e.g., facts, opinions, feelings) (Yes/No)
- (a)
If the response is hallucinated, what is the type of the unsupported information? (options: expressing a personal experience, expressing an opinion, expressing feelings, expressing unsupported facts, giving advice, acknowledging information from the Seeker)
- (b)
If the response is hallucinated, was the unsupported information triggered by a question/opinion from the Seeker? (Yes/No)
- (c)
Besides unsupported information, does the wizard’s response contain thoughts/ opinions/feelings/facts that are supported by ? (Yes/No)
- (d)
Modify the wizard’s sentence such that the response:
- i.
uses only the facts from to make the response informative.
- ii.
is not a copy paste of but a paraphrase of it.
- iii.
is relevant to the previous utterance and cooperative with the Seeker.
- i.
- (e)
If the response is not hallucinated, does the wizard’s response express personal thoughts/opinions/feelings that are supported by ? (Yes/No)
- (f)
If the response is not hallucinated, does the wizard’s response contain factual/objective information that is supported by ? (Yes/No)
- (a)
If the answer is “No” to (e) and (f), the response is flagged as generic. We ask the annotators to modify the wizard’s sentence such that the response is supported by .
If the response is faithful, workers are asked the following question: Is the wizard’s response cooperative with the Seeker’s response? i.e. the wizard does not ignore answering a question, or does not act in any unhelpful way.
- (a)
If yes, no modification is required for the wizard’s response.
- (b)
If no, modify the bot sentence such that:
- i.
The response is relevant to the previous utterance and cooperative with the Seeker.
- ii.
The response is not a copy paste of but a paraphrase of it.
- i.
- (a)
B Pay Structure
We pay crowdworkers a base pay of $1.70/HIT (USD). To retain excellent workers for all rounds, we give a bonus of $35–$40 per 100 HITs that are submitted successfully. The average amount of time spent per HIT is 6 min, that it, in one hour, workers are able to complete 10 HITS. This is equivalent to $17–$18 per hour.
Notes
To encourage naturalness in the response, annotators were also asked to express empathy such as “I’m sorry about ...”. in case the Seeker expresses a very unfortunate event.
We use the original WoW splits. Please note that only the training set in FaithDial is smaller than the WoW training set because of limited budget. The main goal of this paper is to provide a high-quality faithful dialogue benchmark rather than providing a large-scale dataset for training.
To ensure clarity in the task definition, we provided turkers with detailed examples for our terminology. Moreover, we performed several staging rounds over the course of several months. See the full set of instructions in Appendix §H, the pay structure in Appendix §I, and details about our quality control in Sec. 3.1 and Sec. 3.2.
The relatively high score of T5-WoW on Q2-NLI may be due to this metric not being robust to partial hallucinations.
A low score in cooperativeness is correlated with a low score in engagingness, but the opposite is not necessarily true.
References
Author notes
Action Editor: Wenjie Li
Work done while at IBM Research.