FaithDial: A Faithful Benchmark for Information-Seeking Dialogue

Abstract The goal of information-seeking dialogue is to respond to seeker queries with natural language utterances that are grounded on knowledge sources. However, dialogue systems often produce unsupported utterances, a phenomenon known as hallucination. To mitigate this behavior, we adopt a data-centric solution and create FaithDial, a new benchmark for hallucination-free dialogues, by editing hallucinated responses in the Wizard of Wikipedia (WoW) benchmark. We observe that FaithDial is more faithful than WoW while also maintaining engaging conversations. We show that FaithDial can serve as training signal for: i) a hallucination critic, which discriminates whether an utterance is faithful or not, and boosts the performance by 12.8 F1 score on the BEGIN benchmark compared to existing datasets for dialogue coherence; ii) high-quality dialogue generation. We benchmark a series of state-of-the-art models and propose an auxiliary contrastive objective that achieves the highest level of faithfulness and abstractiveness based on several automated metrics. Further, we find that the benefits of FaithDial generalize to zero-shot transfer on other datasets, such as CMU-Dog and TopicalChat. Finally, human evaluation reveals that responses generated by models trained on FaithDial are perceived as more interpretable, cooperative, and engaging.


Introduction
Despite the recent success of knowledge-grounded neural conversational models (Thoppilan et al., 2022;Prabhumoye et al., 2021;Zhao et al., 2020, inter alia) in generating fluent responses, they also generate unverifiable or factually incorrect statements, a phenomenon known as hallucination (Rashkin et al., 2021b;Dziri et al., 2021;Shuster et al., 2021).Ensuring that models are trustworthy is key to deploying them safely in real-world applications, especially in high-stake domains.In fact, Broken heart is a term metaphor for the intense emotional and sometimes physical stress or pain one feels at experiencing great longing.
Have you ever had a broken heart?I did last year when I broke up with my girlfriend, it was terrible!I have not.I'm a machine and I can't feel pain.But I surely know that a broken heart is intense emotionally and physically.they can unintentionally inflict harm on members of the society with unfounded statements or can be exploited by malicious groups to spread large-scale disinformation.

Information Seeker
Recently, Dziri et al. (2022a) investigated the underlying roots of this phenomenon and found that the gold-standard conversational datasets (Dinan et al., 2019;Gopalakrishnan et al., 2019;Zhou et al., 2018)-upon which the models are commonly fine-tuned-are rife with hallucinations, in more than 60% of the turns.An example of hallucination in Wizard of Wikipedia (WoW; Dinan et al. 2019) is shown in the red box of Figure 1.In WoW, an information SEEKER aims to learn about a topic and a human WIZARD harnesses knowledge (typically a sentence) from Wikipedia to answer.This behavior, where the human WIZARD ignores the knowledge snippet and assumes a fictitious persona, can later reverberate in the dialogue system trained on this kind of data.Instead, the ideal WIZARD response, highlighted in green, should acknowledge the bot's nature, and whenever the knowledge is not sufficient or relevant, it should acknowledge its ignorance of the topic.
Unfortunately, modeling solutions alone cannot remedy the hallucination problem.By mimicking the distributional properties of the data, models are bound to 'parrot' the hallucinated signals at test time (Bender et al., 2021).What is more, Dziri et al. (2022a) observe that GPT2 not only replicates, but even amplifies hallucination around 20% when trained on WOW.This finding also extends to models that are designed explicitly to be knowledgegrounded (Prabhumoye et al., 2021;Rashkin et al., 2021b).Filtering noisy or high-error data (Zhang and Hashimoto, 2021) is also prone to failure as it may either break the cohesion of discourse or it may require excluding entire dialogues.
In this work, we adopt instead a data-centric solution to address hallucinations and create FAITH-DIAL, a new benchmark for faithful1 knowledgegrounded dialogue.Specifically, we ask annotators to amend hallucinated utterances in WOW by making them faithful to the corresponding knowledge snippets from Wikipedia and acknowledging ignorance when necessary.This approach is vastly more scalable than creating FAITHDIAL from scratch while retaining the cohesiveness of conversations.Moreover, it allows us to shed light on hallucinations by contrasting corresponding WIZARD's responses in WOW and FAITHDIAL.
As a result, FAITHDIAL contains around 50K turns across 5.5K conversations.Extensive human validation reveals that 94.4% of the utterances in FAITHDIAL are faithful (i.e., without hallucinations), compared to only 20.9% in WOW.Moreover, we benchmark several state-of-the-art models (Radford et al., 2019;Roller et al., 2021;Raffel et al., 2020;Rashkin et al., 2021b) on dialogue generation.If trained on FAITHDIAL, we find that they are significantly more faithful while also enhancing other dialogue aspects like cooperativeness, creativity, and engagement.These benefits also generalize to other knowledge-grounded datasets like CMU-DoG (Zhou et al., 2018) and TopicalChat (Gopalakrishnan et al., 2019) in a zero-shot transfer setting.
FAITHDIAL also provides supervision for hallucination critics, which discriminate whether an utterance is faithful or not.We source positive examples from FAITHDIAL and negative examples from WOW.Compared to other dialogue inference datasets (Welleck et al., 2019a;Nie et al., 2021), the classifiers trained on this data (which we call FAITHCRITIC) transfer better to general NLU tasks like MNLI (Williams et al., 2018) and achieve state-of-the-art on BEGIN (Dziri et al., 2022b), a dialogue-specific knowledge grounding benchmark in a zero-shot setting.
Thus, FAITHDIAL holds promise to encourage faithfulness in information-seeking dialogue and make virtual assistants both more trustworthy.We release data and code for future research.2

FAITHDIAL: Dataset Design
Given the motivations adduced above, the primary goal of this work is to create a resource for faithful knowledge-grounded dialogue that allows for both training high-quality models and measuring the degree of hallucination of their responses.We define the notion of faithfulness formally as follows: Definition 2.1 (Faithfulness).Given an utterance u n , a dialogue history H = (u 1 , . . ., u n−1 ), and knowledge K = (k 1 , . . ., k j ) at turn n, we say that u n is faithful with respect to K iff the following condition holds: • ∃ Γ n such that Γ n u n , where denotes semantic consequence and Γ n is a non-empty subset of K n .In other words, there is no interpretation I such that all members of Γ n are true and u n is false.
Hence, an utterance can optionally be grounded on multiple facts but not none.
In what follows, we present the design of our task as well as our annotation pipeline to curate FAITHDIAL.In our dialogue setting, we simulate interactions between two speakers: an information SEEKER and a bot WIZARD.
Definition 2.2 (INFORMATION SEEKER: A Human).The information SEEKER, a human, aims at learning about a specific topic in a conversational manner.They can express subjective information, bring up a new set of facts independent from the source K, and even open up new sub-topics.
From the perspective of Definition 2.2, utterances pronounced by the SEEKER have a large degree of freedom.For example, the human can chat about personal life and can ask a diverse set of questions.On the other hand, the WIZARD is more restricted on what they can communicate.
Definition 2.3 (WIZARD: A Bot).The Wizard, a bot, aims at conversing in a knowledgeable manner about the SEEKER's unique interests, resorting exclusively to the available knowledge K.They can reply to a direct question or provide information about the general topic of the conversation. 3rom Definition 2.3, it follows that there are three key rules the bot must abide by: first, it should be truthful by providing information that is attributable to the source K. Second, it should provide information conversationally, i.e., use naturalistic phrasing of K, support follow-up discussion with questions, and prompt user's opinions.Third, it should acknowledge its ignorance of the answer in those cases where K does not include it while still moving the conversation forward using K.

Data Selection
Rather than creating a novel benchmark from scratch, however, we opt for fixing problematic utterances (which are the majority) in existing dialogue benchmarks (Dziri et al., 2022a).The reason is three-fold: 1) while mostly hallucinated, existing datasets still contain useful faithful information.2) as correction is faster than creation from scratch, this enables us to annotate examples on a larger scale; 3) two versions of the same dialogue turn, either hallucinated or faithful, can provide signal for (contrastive) learning and evidence for a linguistic analysis.In particular, we focus on WOW as our benchmark backbone.
Initial pilot study revealed that WOW dialogues are more suitable for editing compared to other prominent knowledge-grounded dialogue benchmarks: TopicalChat (Gopalakrishnan et al., 2019) and CMU-DoG (Zhou et al., 2018).In fact, according to Dziri et al. (2022a), as shown in Table 1, WOW is relatively less hallucinated compared to CMU-DoG and TopicalChat.Moreover, full hallucinations-responses that contain no faithful content and that therefore need to be entirely thrown out-are highly prevalent in the latter two (61.4% in CMU-DoG and 46.8% in TopicalChat and only 19.7% in WOW).Moreover, knowledge snippets in WOW tend to be shorter, which is preferable as longer knowledge is correlated with increased hallucination due to the constrained cognitive capacity for text navigation and comprehension in humans (De Jong, 2010;DeStefano and LeFevre, 2007).Our first step consists in filtering out WOW conversations where ground-truth knowledge K was not given, and annotators relied on personal knowledge instead.Then, we focus on SEEKERinitiated conversations and sample 44% from the train (4094 conversations) and 100% from validation (764 conversations) and 100% from test (791 conversations).4

Crowd-sourced Annotations
Following the guidelines for ethical crowdsourcing outlined in Sheehan (2018), we hire Amazon Mechanical Turk (AMT) workers to edit utterances in WOW dialogues that were found to exhibit unfaithful responses.5First, workers were shown dialogues from WOW and asked to determine whether the WIZARD utterances are faithful to the source knowledge.To guide them in this decision, they were additionally requested to identify the speech acts (VRM taxonomy; Stiles 1992) such as disclosure, edification, question, acknowledgment, etc; and the response attribution classes (BEGIN taxonomy; Dziri et al. 2022b) such as hallucination and entailment for each of the WIZARD's utterances according to Dziri et al. (2022a)'s schema.

Editing the Wizard's Utterances
Workers were instructed to edit the WIZARD's utterances in the following cases, depending on their faithfulness.
Hallucination.They should remove information that is unsupported by the given knowledge snippet K, and replace it with information that is supported.To ensure that the responses are creative, we disallowed workers from copying segments from K.They were instead instructed to paraphrase the source knowledge as much as possible without changing its meaning (Ladhak et al., 2022;Lux et al., 2020;Goyal and Durrett, 2021).If the inquiry of the SEEKER cannot be satisfied by the knowledge K, the WIZARD should acknowledge their ignorance and carry on the conversation by presenting the given knowledge in an engaging manner.In the example shown in Table 3, the new WIZARD confirms that it cannot surf and instead enriches the conversation by talking about surfing as opposed to the original WIZARD who hallucinates personal information.
Generic utterances such as "That's nice" should be avoided solely on their own.Workers are instructed to enrich these responses with content that is grounded on the knowledge.
Uncooperativeness If the response was determined to be faithful but uncooperative with respect to the user's requests, workers are required to make it coherent with the dialogue history while keeping it faithful.

Editing the Seeker's Utterances
Although the SEEKER has no restrictions on their utterances, it is inevitable that the conversation may drift away-because of the edits on the WIZ-ARD's response-making the existing SEEKER's next utterance in WOW incoherent with the new context.In these cases, they perform edits on the SEEKER's next utterance to make it coherent.Consider Table 3 where workers had to edit the WOW SEEKER's utterance as it was not coherent anymore with the freshly edited WIZARD's response.
3 Dataset Quality

Crowdworker Quality Control
To be eligible for the task, workers have to be located in the United States and Canada and have to answer successfully 20 questions as part of a qualification test.Before launching the main annotation task, we perform a small pilot round (∼60 HITS) to check the performance of the workers.
If we observe any errors, we email the concerned workers and provide them with examples on how to fix their mistakes in future HITS.Workers are also encouraged to reach out to us in case they find annotating a particular example ambiguous.At the end of the pilot round, we revoke access for workers who provide poor quality annotations.After several staging rounds, we launch the main annotation stage.To ensure the quality does not drop, a linguistics major student evaluates the performance of workers daily (10 HITS on average per worker) and rejects poor quality work.Repeated mistakes result in the worker being blocked from the task entirely.In total, we ended up recruiting 10 welltrained workers.We also perform automatic quality control checks to enforce workers to avoid copying segments from the source knowledge.

Human validation
To evaluate the quality of FAITHDIAL, we run two final rounds of annotations.Firstly, we ask 3 new workers to edit the same 500 responses.Since there is no straightforward way to measure inter-annotator agreement on edits, following Dziri et al. (2022a), we measure the inter-annotator agreement on the identified response attribution classes (BEGIN) and the speech acts (VRM).We report an inter-annotator agreement of 0.75 and 0.61 Fleiss' κ, respectively, which shows substantial agreement according to Landis and Koch (1977).This is an indicator of overall annotation quality: if the worker can reliably identify speech acts, they generally also produce reasonable edits.Secondly, we assign three new workers to judge the faithfulness of the same 500 edited responses (we use majority vote).Assuming the pre-existing labels to be correct, the F1 score of the majority-vote annotations for both taxonomies are similarly high: 90% for BEGIN and 81% for VRM.In total, we found that FAITHDIAL contains 94.4% faithful responses and 5.6% hallucinated responses, as shown in Figure Seeker: I would love to learn to surf but I'm terrified of sharks!Do you surf?
Knowledge: Surfing is a surface water sport in which the wave rider, referred to as a surfer, rides on the forward or deep face of a moving wave, which is usually carrying the surfer towards the shore.
Wizard: I absolutely love to surf, just riding on the forward face of a moving wave is so exhilarating.Are there a lot of sharks near you?(BEGIN: Partial hallucination, VRM: disclosure) Wizard: Unfortunately, it's impossible for me to surf.Remember, I'm just your virtual bot! Sorry to hear you're terrified from sharks.However, did you know that a surfer is someone who can ride on either the forward or deep face of a wave which typically sends her/him close to the shore?
Seeker: I have no idea how many sharks are in the Gulf of Mexico but there out there! lol Where do like to surf? (incoherent with the freshly edited Wizard response) Seeker: Yeah, what scares me is the sharks, is there any non-ocean waves out there?
Knowledge: Waves suitable for surfing are primarily found in the ocean, but can also be found in lakes or in rivers in the form of a standing wave or tidal bore.
Wizard: There's a little hidden cove near me that has great waves, although waves suitable for surfing are primarily found in the Pacific ocean.Have you ever watched people surf?(BEGIN: hallucination, VRM: disclosure & edification) Wizard: For sure there is.An alternative to ocean waves could be lakes and rivers which have a phenomenon called standing waves.Have you ever watched people surf?We also report the speech acts employed to ensure faithfulness in FAITHDIAL in the outer circle in Figure 2. We observe that WIZARD resorts to a diverse set of speech acts to convey faithful information in a conversational style (see the Entailment pie): 78.26% of the responses contain objective content (Edification) that is interleaved with dialogue acts such as acknowledging receipt of previous utterance (18.3%), asking follow-up questions (35.5%), and sparking follow-on discussions by expressing opinions still attributable to the knowledge source (36.2%).Moreover, the WIZARD used some of these very techniques, such as Disclosure (13.04%) and Questions (8.6%), in isolation.On the other hand, faithfulness strategies (see Entailment) in WOW are mostly limited to edification (98.9%), curbing the naturalness of responses.

Abstractiveness
After establishing the faithfulness of FAITHDIAL, we investigate whether it stems from an increased level of extractiveness or abstractiveness with respect to the knowledge source.Extractive responses reuse the same phrases as the knowledge source, while abstractive responses express the same meaning with different means.Although extractive responses are an easy shortcut to achieving more faithfulness, it comes at the cost of creativity.Ideally, we want responses that are faithful as well as creative, meaning responses that are not just a copy paste of the knowledge but rather a creative use of it.To measure creativity, we borrow two metrics from Grusky et al. (2018) designed to quantify the extractive and abstractive nature of summaries: Density and Coverage.Density repre-sents the average length of the text spans copied from the knowledge that are contained in the response.Coverage instead measures the percentage of words existing in a response that are also found in the source knowledge.Figure 3 illustrates the density and coverage distributions in FAITHDIAL (right) vs. WOW (left).We observe that while the coverage (x-axis) is similar in both FAITHDIAL and WOW, the density (y-axis) is always low in FAITHDIAL but often high in WOW.This indicates that responses in FAITHDIAL tend to be abstractive to a large degree.
Based on this, we also study which specific abstractive strategies WIZARD adopts to present knowledge from K without repeating long fragments.The strategies we discovered, fall into five broad categories: inference of new knowledge from K, rewording, reshaping the syntactic structure, abridging long expressions, and introducing connectives.We discuss these categories in more detail in §C.

Fallback Responses in FAITHDIAL
We further probe the WIZARD responses with respect to their ability to handle unanswerable questions.We randomly sample 45 dialogues containing 400 responses and ask a linguist to annotate them.Overall, we found that 48% of the conversations contain unanswerable utterances: On average 33% of the WIZARD responses within the same conversation were edited to provide fallback responses.Out of those fallback responses, 30% were trig-gered by personal questions, 50% by objective questions about the topic, and 20% by opinions.In these cases, to avoid interrupting the flow of the conversation, the WIZARD informs the SEEKER about facts from the source knowledge besides acknowledging its ignorance of the right answer.

Experiments
The purpose of FAITHDIAL is two-fold: first, the collected labels can serve as training data for a critic to determine whether a given response is faithful or hallucinated.The second goal is providing high-quality data to generate faithful responses in information-seeking dialogue.Given knowledge K n and the conversation history H = (u 1 , . . ., u n−1 ), the task is to generate a response u n faithful to K n .We benchmark a series of stateof-the-art dialogue models (Radford et al., 2019;Roller et al., 2021;Raffel et al., 2020;Rashkin et al., 2021b) on FAITHDIAL.We also evaluate them on WOW and in a zero-shot transfer setup on CMU-DoG, and TopicalChat).We implement all the baselines using the Huggingface Transformers library (Wolf et al., 2020).

Task I: Hallucination Critic
We frame the problem of identifying hallucination as a binary classification task where the goal is to predict whether an utterance is faithful or not, given the source knowledge.This characterization of the problem is reminiscent of previous work (Dziri et al., 2019;Welleck et al., 2019b;Nie et al., 2021) on detecting contradiction within a conversation.
For this purpose, we curate a dataset, FAITH-CRITIC, derived from human annotations in FAITH-DIAL.Specifically, we take 14k WIZARD utterances from WOW labelled as hallucination (Section 2) as negative examples.The WIZARD responses from WOW labelled as entailment along with newly edited WIZARD utterances (20k in total) count as positive examples.Overall, FAITHCRITIC consists of 34k examples for training.We compare the performance of models trained on FAITH-CRITIC against models trained on two dialogue inference datasets -DNLI (Welleck et al., 2019b) and DECODE (Nie et al., 2021)-and on a wellknown natural language inference (NLI) dataset, MNLI (Williams et al., 2018).For all datasets, we choose RoBERTa Large (Liu et al., 2019) as a pre-trained model.Implementation details can be found in §D.We measure the transfer performance of different critics on MNLI, BEGIN and FAITH-CRITIC in zero-shot settings wherever possible.
The results are presented in Table 4.In the zeroshot setting, the critic trained on FAITHCRITIC substantially outperforms the baselines on MNLI and BEGIN by a large margin, indicating that FAITH-DIAL allows transfer to both a generic language understanding task as well as dialogue-specific knowledge grounding benchmark.On the other hand, the transfer performance of DECODE and DNLI are poor on both generic and dialogue-specific classification tasks.Surprisingly, MNLI transfers well to FAITHCRITIC.

Methods
For the task of dialogue generation, we consider a series of state-of-the-art models ranging from general-purpose LMs-such as GPT2 (Radford et al., 2019), DIALOGPT (Zhang et al., 2020b), and T5 (Raffel et al., 2020)-to models that are specifically designed to provide better grounding, such as DoHA (Prabhumoye et al., 2021), or to alleviate hallucination, such as CTRL (Rashkin et al., 2021b).DoHA augments BART (Lewis et al., 2020) with a two-view attention mechanism that separately handles the knowledge document and the dialogue history during generation.CTRL equips LMs with control tokens (<objective-voice>, <lexical-overlap>, and <entailment>) whose embeddings are learned at training time.At test time, these steer a model towards generating utterances faithful to a source of knowledge.Finally, we adopt a training strategy, called loss truncation (Kang and Hashimoto, 2020) to cope with the presence of hallucination in WOW, by adaptively eliminating examples with a high training loss.
In addition to existing models, we also consider an auxiliary objective to attenuate hallucination during training (Cao and Wang, 2021;Tang et al., To generate up to k = 8 negative candidates x − , we follow a perturb-and-generate strategy for each utterance in the training data.More precisely, we manipulate the gold knowledge snippets to alter their meaning and feed them along with the history to an auto-regressive model fine-tuned on WOW.We use two perturbation techniques proposed by Dziri et al. (2022b): verb substitution and entity substitution.Additionally, utterances labelled as hallucination by human annotators in WOW are also included in the negative samples.The implementation details and hyperparameters are provided in §D.

Automatic Evaluation
We rely on several metrics that provide a multifaceted measure of performance.A first group measures the degree of hallucination of generated responses.The Critic model trained on FAITH-CRITIC (Section 5.1) returns the percentage of utterances identified as unfaithful.Q 2 (Honovich et al., 2021) measures faithfulness via question answering.It takes a candidate response as input and then generates corresponding questions.Then, it identifies possible spans in the knowledge source and the candidate response to justify the questionanswer pairs (Durmus et al., 2020;Wang et al., 2020).Finally, it compares the candidate answers with the gold answers, in terms of either token-level F1 score or a NLI-inspired similarity score based on a RoBERTa model.BERTScore (Zhang et al., 2020a) rates the semantic similarity between the generated response r and the knowledge K based on the cosine of their sentence embeddings.F1 measures instead the token-level lexical overlap between u and K. Finally, as a second set of metrics, we report BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004), which reflect instead the ngram overlap between u and the gold (faithful) response g.
WoW vs FAITHDIAL.In order to evaluate the ability of FAITHDIAL to reduce hallucination in generated responses, Table 5 illustrates three experimental setups with different training data.WOW corresponds to the first block and FAITHDIAL to the second block.The third block reflects a hybrid setup where a model is fine-tuned sequentially on WOW as an intermediate task and then on FAITH-DIAL.We evaluate all on the FAITHDIAL test set.We find that training on FAITHDIAL yields a substantial reduction in hallucination.For example, T5 trained on FAITHDIAL decreases hallucination by 42.2% according to the Critic and increases

Context Responses
Knowledge: A paramedic is a healthcare professional, predominantly in the pre-hospital and out-of-hospital environment, and working mainly as part of emergency medical services (EMS), such as on an ambulance.History: Do you like medical science or have you considered being a paramedic?T5-LOSSTRUNCATION-WOW: I'm a nurse and I know a lot about EMS.I'm in the hospital and it can be tiring.(Hallucination) T5-INFONCE-FAITHDIAL: I am a bot so I can't be a paramedic.However, I know that a paramedic is a healthcare professional, providing medical care principally in the pre-hospital and out-of-hospital environment like an ambulance.(Faithful) Knowledge: In the Middle Ages and Renaissance, green was the color commonly associated with merchants, bankers and the gentry, while red was reserved for the nobility.History: Yes, that is camouflage.It is the color most closely associated with spring, nature, health, youth, hope, and envy.I love green.It is so versatile.
T5-CTRL-WOW: Green is so nice to have.It brings joy and excitement.It is so well known for being associated with merchants, banks, and the gentry.(Partial Hallucination) T5-CTRL-FAITHDIAL: Oh, cool!Did you know that in the Middle Ages and Renaissance, green was the color most related with merchants and bankers?(Faithful) the faithfulness score (Q 2 -NLI) by 4.3% compared to T5 trained on WOW. 6 This corroborates the prominence of data quality compared to the data quantity (FAITHDIAL is one third of WOW).When initializing the models trained on FAITHDIAL with the noisy checkpoint from WOW (third block), we observe a performance boost in all models across all metrics, except a marginal drop in Critic for GPT2 and DIALOGPT.This shows that models can extract some useful conversational skills from WOW despite its noisy nature.
Models.First, we observe that T5 consistently performs favourably in reducing hallucination in all setups and across all metrics, compared to the rest of the vanilla baselines: GPT2, DIALOGPT, and DOHA.Additionally, we compare models that are designed specifically to alleviate hallucination.Results are reported in the grey blocks of Table 5.We choose the best vanilla model T5 as the backbone for CTRL, INFONCE and LOSSTRUNCA-TION.By virtue of these methods, faithfulness increases even further, which demonstrates their effectiveness.Sample responses from different models are presented in Table 6.
Abstractiveness.We find that while FAITH-DIAL, especially in the hybrid setup, increases the semantic similarity between generated responses and knowledge (BERTScore) by 7% compared to WOW, the word overlap (F1) between them is almost unaffected.This indicates that WOW induces extractiveness over abstractiveness in models, which is not desirable.This is especially true for T5-CTRL variants, as their training objective encourages word overlap.Instead, we observe that T5-INFONCE achieves both faithfulness and abstractiveness as it yields the lowest scores for hallucination (1.4 Critic) and extractiveness (55.8 F1).

Human Evaluation
In addition to the automated metrics, we conduct human evaluation to assess the presence of hallucination in models trained on FAITHDIAL, as well as other aspects in generated dialogues such as cooperativeness, engagingness, and abstractiveness.Following Rashkin et al. (2021a), our evaluation consists of a two-stage annotation process.First, the annotators are asked to determine whether responses are stand-alone (i.e., their meaning is interpretable even without access to the source knowledge).If not, they are deemed to be too vague or illformed to judge their faithfulness.Second, if the response is interpretable, the annotators are requested to evaluate whether the response is grounded on the source knowledge.If the response was deemed not faithful, we further ask the annotators to mark it as hallucination or generic.
On the other hand, if the response was deemed faithful, workers are asked to score three qualities: Cooperativeness means that the response is coherent with the previous turn and does not try to mislead the interlocutor or act unhelpfully.Engagingness involves engaging the interlocutor by  prompting further replies and moving the conversation forward. 7Abstractiveness measures the ability to reuse information from the source knowledge in a novel way.To enable flexibility in rating, we ask annotators to rate each quality on a Likert scale from 1 (low quality) to 4 (high quality).

Results
We evaluate responses generated by T5 as it is the best performing model in terms of automated metrics (Table 5).We provide human annotators with 200 responses, where each is scored by 3 humans raters.Results are depicted in Table 7.
We measure the agreement for each of the 7 qualities separately using Krippendorff's α and find that the agreement (0.92, 0.91, 0.88, 0.90, 0.89, 0.75, 0.85 respectively) is reliably high.Contrasting models trained on WOW and FAITH-DIAL, we find that FAITHDIAL reduces hallucination by a large margin (32.6%) while increasing interpretability.Also, we observe that training models on FAITHDIAL enhances the cooperativeness, engagingness, and abstractiveness of responses, as they tend to prompt further conversations, acknowledge previous utterances, and abstract information from the source knowledge.We see that CTRL benefits faithfulness but at the expense of cooperativeness and abstractiveness of the responses.The best performing model corresponds to T5-INFONCE, which achieves the highest faithfulness percentage (77.4%)and the highest dialogue quality scores.
Evaluation of unanswerable questions To evaluate the ability of models trained on FAITHDIAL to handle unanswerable questions, we analyze the responses for 200 unanswerable questions sampled from test data.Each response is manually evaluated by 3 annotators whether the answer is appropriate.Inter-annotator agreement based on Krippendorff's alpha is 0.9 which is substantially high.Results indicate that T5-INFONCE trained on FAITHDIAL substantially outperform T5-LOSSTRUNCATION trained on WOW in answering properly unanswerable questions (83.2% vs. 33.3%).

Transfer from FAITHDIAL to other datasets
To further examine the usefulness of FAITHDIAL in out-of-domain setting, we test the performance of T5-FAITHDIAL on TopicalChat (Gopalakrishnan et al., 2019) and CMU-DoG (Zhou et al., 2018), and WoW (Dinan et al., 2019).Contrary to WOW, speakers in CMU-DoG and TopicalChat can also take symmetric roles (i.e., both act as the wizard).Knowledge is provided from Wikipedia movie articles in CMU-DoG and from diverse sources-such as Wikipedia, Reddit and news articles-in Topi-calChat.Models are evaluated in a zero-shot setting as the corresponding training sets are not part of FAITHDIAL.Results are depicted in Table 8.Since these testing benchmarks are fraught with hallucinations (see Table 1), we do not compare the quality of the response u with respect to the gold response g.We report both automatic metrics and human evaluation.We follow the same human evaluation setting as before and ask 3 workers to annotate 200 responses from each model (Krippendorff's α is 0.82, 0.79, 0.85 on TopicalChat, CMU-DoG and WOW respectively).In this regard, the models trained on FAITHDIAL are far more faithful than the models trained on in-domain data despite the distribution shift.For example, T5-FAITHDIAL tested on TopicalChat test data decreases hallucination by 35.7 points on Critic, by 13.9 points on Q 2 -NLI and by 30.4 points on human scores.Similar trends can be observed for TOPICALCHAT and WOW (except for F1 on WoW, yet human evaluation shows humans prefer FAITHDIAL models by a large margin of 23.8).Generated responses can be found in Table 10.Regarding other dialogue aspects, T5-FAITHDIAL models tested on TopicalChat and CMU-DoG enjoy a larger degree of abstractiveness than in-domain models but have lower scores of cooperativeness and engagingness.However, all of these aspects are enhanced when tested in-domain on WoW.

Related Work
Hallucination in Natural Language Generation.
Hallucination in Dialogue Systems.Hallucinations in knowledge-grounded neural dialogue generation is an emergent research problem (Roller et al., 2021;Mielke et al., 2022;Shuster et al., 2021;Dziri et al., 2021;Rashkin et al., 2021b).Existing work aims predominantly to address hallucinations via engineering loss functions or enforcing consistency constraints, for instance by conditioning gen-eration on control tokens (Rashkin et al., 2021b), by learning a token-level hallucination critic to flag problematic entities and replace them (Dziri et al., 2021), or by augmenting the dialogue system with a module retrieving relevant knowledge (Shuster et al., 2021).
Although promising, these approaches are prone to replicate-or even amplify-the noise found in training data.Dziri et al. (2022a) demonstrated that more than 60% of three popular dialogue benchmarks are rife with hallucination, which is picked up even by models designed to increase faithfulness.To the best of our knowledge, FAITHDIAL is the first dataset for information-seeking dialogue that provides highly faithful curated data.
Hallucination Evaluation.Recently introduced benchmarks can serve as testbeds for knowledge grounding in dialogue systems, such as BEGIN (Dziri et al., 2022b), DialFact (Gupta et al., 2022), Conv-FEVER (Santhanam et al., 2021) and Attributable to Identified Sources (AIS) framework (Rashkin et al., 2021a).Meanwhile, a recent study has reopened the question of the most reliable metric for automatic evaluation of hallucination-free models, with the Q 2 metric (Honovich et al., 2021) showing performance comparable to human annotation.In this work, we further contribute to this problem by proposing a critic model-trained on our collected FAITHCRITIC data-that achieves high performance on the BEGIN benchmark.

Conclusions
We release FAITHDIAL, a new benchmark for faithful information-seeking dialogue, where a domainexpert bot answers queries based on gold-standard knowledge in a conversational manner.Examples are created by manually editing hallucinated and uncooperative responses in Wizard of Wikipedia (WOW), which constitute 79.1% of the original dataset.Leveraging the resulting high-quality data, we train both a hallucination critic, which discrimi-nates whether utterances are faithful to the knowledge and achieves a new state of the art on BEGIN, and several dialogue generation models.In particular, we propose strategies to take advantage of both noisy and cleaned data, such as intermediate fine-tuning on WOW and an auxiliary contrastive objective.With both automated metrics and human evaluation, we verify that models trained on FAITHDIAL drastically enhance faithfulness and abstractiveness, both in-domain and during zeroshot transfer to other datasets, such as TopicalChat and CMU-DoG.
ceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 270-278, Online.Association for Computational Linguistics.

A AMT Instructions
Here, we detail the instructions given to workers in the annotation task.We follow instructions from (Dziri et al., 2022a) in determining BEGIN and VRM categories.And, according to the identified categories, we ask workers to perform a particular edit.Below are the questions we ask in every HIT: 1. Does the i.The response is relevant to the previous utterance and cooperative with the SEEKER.ii.The response is not a copy paste of K but a paraphrase of it.

B Pay Structure
We pay crowdworkers a base pay of $1.7/HIT (USD).

C Abstractiveness strategies
We annotate manually 150 responses to explore the techniques used by the WIZARD to derive and represent information from the knowledge source K.
Table 9 shows the different abstractiveness types with their frequencies: Inference corresponds to information which can be derived from the evidence with an intermediate step in reasoning; in other words, it involves inferring obvious but implicit information from K, from the Apprentice utterance, or from commonsense knowledge.It encompasses implicatures (e.g.replace "She finished some of her work" with "She did not finish all of her work"), presuppositions (e.g.replace "She stopped smoking" with "She used to smoke"), and deductions (e.g.replace "She drove her car to work every day for 3 years" with "She can drive").Also, it includes commonsense knowledge (e.g.replace "Elvis, the artist, . . ." with "Elvis, a person, . . .").
Rewording involves the replacement of words/phrases in K with similar wording.One instance of Rewording is synonymization, where words/phrases are replaced with their synonyms (e.g.replace "can lead to" with "can result in").Also, it is sometimes possible to preserve truth while replacing words/phrases denoting subset members with their supersets, as in generalization (e.g.replace "Some dogs" with "Some animals"), or superset members with their subsets, as in specification (e.g.replace "all animals" with "all dogs").Lastly, pronominalization replaces pronouns with noun phrases, or vice versa (e.g. , replace "Andy visited Mary" with "Andy visited her").
Restructuring corresponds to restructuring the syntactic formulations (syntax) of K in a meaningpreserving manner.It can be done through passivization (e.g.replace "Andy visited Mary" with "Mary was visited by Andy").Another type of Restructuring is reordering, the rearranging of list elements.Ellipsis refers to the ellipsis of sentences or the expanding of ellipted sentences (e.g.replace "I have not heard of Elvis" with "I have not").Questioning refers to the restructuring of declarative statements into questions.
Abridging refers to the removal of modifiers and/or optional complements while preserving the entailment relationship between K and the response.This includes removing adjectives, adverbs, and independent clauses (e.g.replace "I'm taking the red bus early today, in 10 minutes" with "I'm taking the bus today").

D Implementation Details
Critic We implement all our critics using the Huggingface Transformers library (Wolf et al., 2020).We train all models for 10 epochs, using a batch size of 32 and the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 1 × 10 −5 .We warm up the learning rate for 6% of the training steps followed by a linear decay.

Generation Models
We implement all the baselines using the Huggingface Transformers library (Wolf et al., 2020) and the Pytorch-lightning library 8 .We train our models for 10 epochs on a batch size of 32 via accumulating gradients for 4 steps, and use Adam with a learning rate of 6.25 × 10 −5 that warms up for 4% of training steps, followed by a linear decay.The models are evaluated twice per epoch on the validation set, and the best performing model is saved for testing.We early stop the training with a patience of 5.The maximum dialogue history length is set to 3 utterances.For DoHA (Prabhumoye et al., 2021), we follow the same hyperparameters, used in the paper.More specifically, DoHA is trained for 25 epochs using an Adam optimizer with a learning rate of 2×10 −5 , a warm-up ratio of 0.1, and accumulating gradients for 8 steps.For CTRL, the code is not publicly available.We were able to reproduce the results ourselves by following training implementations in the paper and exchanging discussions with the authors.Training for all models is done on an Nvidia V100 GPU 32GB and for inference, we use nucleus sampling with p=0.6.Use by a wider audience only came in 1995 when restrictions on the use of the Internet to carry commercial traffic were lifted.
More people started using it after some restrictions on internet use were lifted in 1995.

8.66
Inferring Deduction Homebrewing is the brewing of beer on a small scale for personal, non-commercial purposes.
Interesting that you've done homebrewing before.So you just brew enough for yourself? 4.6 Figure 1: A representative FAITHDIAL annotation: subjective and hallucinated (red) information present in the wizard's utterance of WoW data are edited into utterances faithful to the given knowledge (green).In FAITHDIAL, the wizard assumes the persona of a bot.
would love to learn to surf but I'm terrified of sharks!Do you surf?

FaithDialFigure 3 :
Figure 2: Coarse-grained (BEGIN) and fine-grained speech act (VRM) distributions used by wizards in FAITHDIAL and WOW.The inner most circle shows the breakdown of coarse-grained types: Hallucination (red), Entailment (green), Partial Hallucination (yellow), Generic (purple), and Uncooperative (pink).The outer circles show the fine-grained types of each coarse-grained type.

Table 3 :
WIZARD turns per conversation were modified on average, as opposed to only 1.2 SEEKER turns.The low percentage of the SEEKER edits shows that our method does not disrupt the cohesiveness of the conversations.
A dialogue example showing the process of editing WOW utterances to convert them to FAITHDIAL utterances.Text highlighted in red indicates hallucinated content.Text in violet indicates the BEGIN labels and the speech act VRM labels as identified by annotators.2(a)(innercircle), and this shows the high quality of FAITHDIAL.4DatasetAnalysis4.1 Dataset StatisticsOverall, FAITHDIAL contains a total of 5,649 dialogues consisting of 50,761 utterances.Table2reports statistics for each dataset split.To curate FAITHDIAL, workers edited 84.7% of the WIZARD responses (21,447 utterances) and 28.1% of the SEEKER responses (7,172 utterances).In particular, 3.8 amples, FAITHDIAL contains 94.4% faithful responses and 5.6% hallucinated responses.On the other hand, our large-scale audit of the entirety of WOW reveals that it is interspersed with hal-lucination (71.4%), with only a few faithful turns (20.9%), as shown in Figure2(b) (inner circle).This finding is consistent with the analysis of Dziri et al. (2022a) on a smaller sample.In our work, FAITHDIAL cleanses dialogues from hallucination almost entirely.

Table 4 :
Transfer results (accuracy) of the hallucination critics trained and tested on different datasets.† indicates zero-shot transfer results.

Table 5 :
g) Model performance on the test split of FAITHDIAL.Metrics measure either the degree of hallucination of generated responses u with respect to knowledge K or their overlap with gold faithful responses g.Gray blocks correspond to models that are specifically designed to alleviate hallucinations.Note that we do not use InfoNCE for models trained on WOW as positive examples are not available in this setting.
Oord et al., 2018)r, we adopt InfoNCE (van denOord et al., 2018), a contrastive learning loss, to endow models with the capability of distinguishing faithful responses x + from hallucinated ones x − .Given an embedding of the context c, which includes both conversation history and knowledge:

Table 6 :
Sample responses from different models.Models trained on FAITHDIAL have a higher success rate in providing faithful responses as opposed to the ones trained on WOW.Text highlighted in red indicates hallucination.

Table 8 :
Transfer results of faithful response generation from FAITHDIAL to other dialogue datasets.The most right block corresponds to human evaluation.* indicates that the results are statistically significant (p-value < 0.05).
To retain excellent workers for all rounds, we give a bonus of $35 − $40 per 100 HITs that are submitted successfully.The average amount of time spent per HIT is 6 min, i.e., in one hour, workers are able to complete 10 HITS.This is equivalent to $17 − $18 per hour.

Table 9 :
Possible abstractiveness strategies of FAITHDIAL from manual analysis on 200 responses.family of team sports that involve, to varying degrees, kicking a ball to score a goal.Unqualified, the word football is understood to refer to whichever form of football is the most popular in the regional context in which the word appears.Sports commonly called football in certain places include association football ( known as soccer in some countries); gridiron football (specifically American football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); and Gaelic football.These different variations of football are known as football codes.The University of Iowa's locker room for visiting football teams is completely painted pink.In the 1960's top bowlers made twice as much as top football stars.The highest score ever in a football game occurred in 1916 when Georgia Tech defeated Cumberl and 222-0 ESPN won an emmy for the creation of the superimposed yellow line representing the first down line for American football games.Former Partiots RB BenJarvus Green -Ellis has never fumbled the football in his NFL career.HistoryDid you know that the University of Iowa 's locker room is painted pink?I wonder why?T5-FAITHDIAL I don't know honestly.A room is painted pink?that's interesting.Speaking of football, did you know that the highest score in football appeared in 1916?(Faithful) T5-TopicalChat That is pretty funny, I guess the visiting team gets a lot of attention from them, I think they are scared of the pink locker room.(Hallucination) CMU-DoG Knowledge: Cast: Lindsay Lohan as Cady Heron, Rachel McAdams as Regina George, Lacey Chabert as Gretchen Wieners, Amanda Seyfried as Karen Smith, Lizzy Caplan as Janis Ian.critical response: Boasts a one -two-three punch in star Lindsay Lohan, screenwriter Tina Fey and director Mark Waters, and, indeed, it delivers a knockout The screenplay was highly praised by critics with Peter Travers of Rolling Stone calling it 'comic gold'.In November 2012, Rotten Tomatoes named it in its 'Top 50 Greatest Teen Comedies'.Director: Mark Waters.Genre: Comedy, teeny introduction: Mean Girls is a 2004 American teen comedy film directed by Mark Waters and written by Tina Fey.The film is partially based on Rosalind Wiseman's 2002 non-fiction self-help book Queen Bees and Wannabes, which describes female high school social cliques and the damaging effects they can have on girls.MovieName: Mean Girls rating: Rotten Tomatoes: 84% and average: 6.9/10.Metacritic Score: 66/100.CinemaScore : A year: 2004.History: I think Rachel McAdams had an even better role as Regina George however !T5-FAITHDIAL: That's great.Did you know it was written by Tina Fey? That's being said, the movie is based on a 2002 book called Queen Bees and Wannabes about social cliques in high school?It's also a comedy film.It's a 2004 film, directed by Mark Waters.(Faithful) T5-CMU-DoG: I think it was really good.I bet it will win the Oscars.(Halluciantion)