Abstract
Knowledge-grounded dialogue systems powered by large language models often generate responses that, while fluent, are not attributable to a relevant source of information. Progress towards models that do not exhibit this issue requires evaluation metrics that can quantify its prevalence. To this end, we introduce the Benchmark for Evaluation of Grounded INteraction (Begin), comprising 12k dialogue turns generated by neural dialogue systems trained on three knowledge-grounded dialogue corpora. We collect human annotations assessing the extent to which the models’ responses can be attributed to the given background information. We then use Begin to analyze eight evaluation metrics. We find that these metrics rely on spurious correlations, do not reliably distinguish attributable abstractive responses from unattributable ones, and perform substantially worse when the knowledge source is longer. Our findings underscore the need for more sophisticated and robust evaluation metrics for knowledge-grounded dialogue. We make Begin publicly available at https://github.com/google/BEGIN-dataset.
1 Introduction
Neural language models (Bengio et al., 2000; Vaswani et al., 2017; Radford et al., 2019, inter alia) often form the backbone of open-ended dialogue systems (Wolf et al., 2019; Zhang et al., 2020b; Roller et al., 2021; Adiwardana et al., 2020). Utterances sampled from such language models sound natural, as reflected in these systems’ high scores in human evaluations focused on measures such as “engagingness” or “human-likeness” (See et al., 2019). While fluent, however, the responses generated by these systems often contain statements that are not supported by the evidence available to the system; such statements are sometimes referred to informally as “hallucinations” (Tian et al., 2019; Maynez et al., 2020a; Dziri et al., 2021; Shuster et al., 2021; see Figure 1 for an example). This issue is particularly salient for knowledge-grounded dialogue systems, which are expected to interact with a user in an open-ended fashion while conveying information that is attributable to external identifiable sources. In this work, we develop a benchmark that can be used to assess attribution in knowledge-based dialog systems; following Rashkin et al. (2021a), we define an attributable response1 as one connected to textual evidence that supports the entirety of the response.
A number of modeling approaches have recently been proposed to increase attribution in knowledge- grounded dialog systems (Rashkin et al., 2021b; Shuster et al., 2021; Dziri et al., 2021, 2022a). Progress in this area crucially relies on metrics that can measure the attribution of the text generated by the system; and indeed, recent work has developed automated metrics with relatively high correlations with human annotations, potentially paving the way for alternatives to expensive human evaluations (Honovich et al., 2021; Dziri et al., 2021, 2022a). Yet our understanding of these recently proposed metrics, as well as more established ones, remains limited, for two reasons. First, comparisons between automated metrics and human judgments rely on small-scale datasets with a few hundred examples. This results in high variance in our estimate of the correlation coefficient and a limited ability to measure performance on infrequent example types (Gehrmann et al., 2021).
Second, the correlation with human scores does not sufficiently determine the efficacy and robustness of automatic metrics produced by neural networks: such learned metrics—like other properties learned by neural networks—can be susceptible to spurious correlations that fail to generalize to more challenging cases. To address these limitations, we introduce a large-scale resource, the Benchmark for Evaluation of Grounded INteraction (Begin), for meta-evaluation of metrics designed to evaluate grounded dialogue. In other words, the goal of this benchmark is to determine to what extent current evaluation metrics fulfill their purpose.
We define a taxonomy dividing knowledge- grounded dialogue responses into three broad categories—fully attributable, not fully attributable, and generic—and ask humans to classify a large set of utterances produced by dialogue systems with this taxonomy. The motivation for the generic category we introduce—which is assigned to utterances such as “Sorry, I’m not sure about this topic”—is the intuition that evaluation metrics should not treat the basic elements of a natural-sounding conversation, such as backchanneling or acknowledgment (Grice, 1989; Stiles, 1992; Bunt et al., 2020), as equally undesirable as a misleading unattributable statement. In real-world scenarios, it is preferable for a model to acknowledge its ignorance instead of producing hallucinated content which may lead to the spread of disinformation.
Using this taxonomy, we then collect high- quality human annotations for 12k examples generated by four language-model-based dialogue systems, each trained on three different knowledge-grounded dialogue corpora. Examples of machine-generated responses along with labels are presented in Table 1. We use this benchmark to evaluate multiple existing automatic metrics including word-overlap measures, embedding-based measures, metrics based on Question Answering (QA) systems, and ones based on Natural Language Inference (NLI). We also propose a classifier trained on an adversarially generated dataset we create. We find that all metrics inadequately measure attribution and all rely on spurious correlations to a large extent. In particular, the metrics tend to misidentify cases that are attributable but highly abstractive, as well as cases that are not fully attributable but use multiple words from the evidence document (i.e., unattributable but extractive). We also find that the metrics fail to measure attribution under distribution shift, scoring responses that pertain to relatively long knowledge sources the lowest. These results are in line with the robustness issues reported for other natural language generation metrics, despite the high correlation of those metrics with human judgments (Durmus et al., 2022; Gehrmann et al., 2021; Gabriel et al., 2021; Yeh et al., 2021). We hope that Begin will facilitate progress toward more robust metrics for grounded dialogue response generation.
. | Category . | Frequency . | Example . | |
---|---|---|---|---|
WoW | Fully Attributable | 23.5% | Knowledge: | Raised in Compton, California, Lamar embarked on his musical career as a teenager under the stage name k-dot, releasing a mix-tape that garnered local attention and led to his signing with indie record label top dawg entertainment. |
History: | oh, yeah. I believe I’ve heard about that. The album has to be good then for sure. Do you know anything else about him or rap? | |||
T5 Response: | Oh yes, I know that the singer was raised in Compton, California and started on his musical career as a teenager. | |||
CMU-DoG | Generic | 6.4% | Knowledge: | Throughout the film, the narrative is intercut with scenes from depositions taken in the Winklevoss twins’ and Saverin’s respective lawsuits against Zuckerberg and Facebook. …After everyone leaves, Zuckerberg is shown sending a friend request to Albright on Facebook and then refreshing the webpage every few seconds as he waits for her response… |
History: | The ending seems cools. Like after all of that happens he just goes to hang out with his kids. | |||
GPT2 Response: | What else do you know about Zuckerberg? | |||
TopicalChat | Not Attributable | 70.1% | Knowledge: | Television (TV) is a telecommunication medium used for transmitting moving images in monochrome (black and white) .…Ringo Starr, George Carlin, and Alec Baldwin have all narrated Thomas the Tank Engine for at least 52 episodes each. There are over 750 episodes of “Pokémon”: That’s about 50 a year for 16 years straight. |
History: | no, I haven’t watched Pokemon. My sons use to play with when they were little. Now, they use their phones looking for Pokemon in the neighborhood. It is funny. | |||
DoHA Response: | Pokemon appeared in TV in 1952 and it was created by Ringo Starr and directed by George Carlin. In total, there were 100 episodes. |
. | Category . | Frequency . | Example . | |
---|---|---|---|---|
WoW | Fully Attributable | 23.5% | Knowledge: | Raised in Compton, California, Lamar embarked on his musical career as a teenager under the stage name k-dot, releasing a mix-tape that garnered local attention and led to his signing with indie record label top dawg entertainment. |
History: | oh, yeah. I believe I’ve heard about that. The album has to be good then for sure. Do you know anything else about him or rap? | |||
T5 Response: | Oh yes, I know that the singer was raised in Compton, California and started on his musical career as a teenager. | |||
CMU-DoG | Generic | 6.4% | Knowledge: | Throughout the film, the narrative is intercut with scenes from depositions taken in the Winklevoss twins’ and Saverin’s respective lawsuits against Zuckerberg and Facebook. …After everyone leaves, Zuckerberg is shown sending a friend request to Albright on Facebook and then refreshing the webpage every few seconds as he waits for her response… |
History: | The ending seems cools. Like after all of that happens he just goes to hang out with his kids. | |||
GPT2 Response: | What else do you know about Zuckerberg? | |||
TopicalChat | Not Attributable | 70.1% | Knowledge: | Television (TV) is a telecommunication medium used for transmitting moving images in monochrome (black and white) .…Ringo Starr, George Carlin, and Alec Baldwin have all narrated Thomas the Tank Engine for at least 52 episodes each. There are over 750 episodes of “Pokémon”: That’s about 50 a year for 16 years straight. |
History: | no, I haven’t watched Pokemon. My sons use to play with when they were little. Now, they use their phones looking for Pokemon in the neighborhood. It is funny. | |||
DoHA Response: | Pokemon appeared in TV in 1952 and it was created by Ringo Starr and directed by George Carlin. In total, there were 100 episodes. |
2 Task, Datasets, and Models
In knowledge-grounded response generation, the system is given a dialogue history ℋ = (u1,…,un−1), and knowledge at turn n, and is expected to generate a response that is coherent with ℋ and attributable to a non-empty subset . Similar to the conversational QA task (Choi et al., 2018; Reddy et al., 2019), the system is expected to use knowledge to respond to the user query. However, since the previous utterance may be an open-ended statement rather than a direct question (see the second and third examples in Table 1), there is a wider range of possible types of informative replies compared to the conversational QA task.
Begin consists of responses generated by language-model-based systems trained to perform this task. This section describes the models we train on this task and the corpora we use to train them.
2.1 Dialogue Datasets
For all three datasets, we use the training portion to train the model, the development set to tune hyperparameters, and the test set to generate the responses that are then annotated and included in the final Begin benchmark.
Wizard of Wikipedia (WoW)
WoW dialogue (Dinan et al., 2019) takes place between a Wizard and an Apprentice. The Wizard is tasked with providing information about a particular topic and the Apprentice, in turn, is expected to seek more information. At each turn of the conversation, the Wizard is presented with passages from Wikipedia and chooses a span from the document—typically one or two sentences—that serves as evidence supporting their response. We omitted examples where the Wizard did not explicitly select a passage as evidence for the response or where there was no dialogue history. We also use the “unseen” topic portion of the test data. Overall, we used 82722 training examples, 8800 development examples, and 3902 test examples.
CMU-DoG
The CMU-DoG dataset (Zhou et al., 2018) consists of conversations about films. Each response is expected to be grounded in a section from Wikipedia. Workers can have either asymmetric or symmetric roles. In the asymmetric setting, one worker is asked to persuade the interlocutor to watch the movie using arguments from the document where only the persuader has access to the document. In the symmetric role, workers discuss together the content of the document. In total, there are 78136, 13800, and 13796 grounded responses (training/dev/test).
TopicalChat
TopicalChat (Gopalakrishnan et al., 2019) consists of dialogues about a variety of topics. Workers are provided relevant facts from Reddit, Wikipedia, and news articles. Analogous to CMU-DoG, the data collection protocol consists of two scenarios. In the symmetric scenario, workers have access to the same knowledge source; in the asymmetric scenario, they have access to different sources. They are asked to use the information from the documents to chat knowledgeably about the topic. In total, the dataset has 134572, 8790, and 8081 grounded responses (training/dev/test).
2.2 Dialogue Models
We consider the outputs of four different dialogue systems; by selecting a relatively wide range of systems, we hope to encounter a range of attribution errors. Two of the systems are based on plain language models, GPT2-base (Radford et al., 2019) and T5-base (Raffel et al., 2020). The remaining two systems, DoHA (Prabhumoye et al., 2021) and CTRL-dialog (Rashkin et al., 2021b), are specifically designed as knowledge-grounded dialogue systems. DoHA augments a BART-based conversational model (Lewis et al., 2020) with a two-view attention mechanism that handles the encoded document and the dilaogue history separately during generation. CTRL-dialog augments T5-base with control tokens (Keskar et al., 2019) that guide the generation towards less subjective and more grounded content. We trained these models to generate responses based on a concatenation of two inputs: an evidence span (the knowledge snippet) and the dialogue history (we only use the previous turn un−1).
3 Annotations
We next describe the human annotations we collected for the utterances generated by the models described in Section 2.
3.1 Taxonomy of Response Types
We classify responses into three broad categories:
Fully Attributable
These are responses that convey information that can be completely supported by the provided document; this property has been referred in the literature to as faithfulness (Rashkin et al., 2021b; Maynez et al., 2020b; Dziri et al., 2021; Durmus et al., 2020) and attribution (Rashkin et al., 2021a). In our annotation set-up, we use similar definitions to the Attributable to Identifiable Source (AIS) framework of Rashkin et al. (2021a). The full framework in that paper consists of a two-stage annotation process in which annotators first filter out responses that are deemed to be too vague or ill-formed to be evaluated for attribution. Since Rashkin et al. (2021a) found that more than 90% of the conversational responses in their study were interpretable, we have our annotators focus solely on attribution.
Not Attributable
These are responses that contain at least some information that cannot be verified given the evidence, regardless of whether that information is factually true in the real world. This includes statements that are relevant but not fully supported by the background information (hallucinations), statements that explicitly contradict the background information, and off-topic responses about information completely external to the evidence sources. In a pilot study we attempted to separate these three subcategories, but the boundaries between them turned out to be difficult to define and annotate.
Generic
Responses that fall into this category are general enough to fit into a large number of possible contexts (Li et al., 2016). Examples include “I don’t know about that” and “Hello there!”. Even when the responses are ostensibly about the same topic as the document, they are vague and do not provide new information. Nevertheless, such responses may be useful for various conversational purposes: back-channeling, expressing uncertainty, or diverting the conversation from ambiguous or controversial topics.
3.2 Collecting Prompt-Query-Reply Triples
As described in Section 2, we collect data using outputs from four models—T5, GPT2, DoHA, and CTRL-dialog. We train a version of each model on each of the three datasets (WoW, TopicalChat, and CMU-DoG) and generate responses using the test portion of the dataset. For more details on training and hyperparameters, refer to Appendix B. We select at least 1000 examples from each dataset-model pair. We filter and remove toxic responses using the Google Perspective API. This yields 12288 examples in total.
3.3 Annotating Prompt-Query-Reply Triples
We present annotators with a knowledge snippet , the previous turn un−1 and a generated response , and ask them to select which of the three categories fits best. For the exact annotation instructions, see Appendix A. To obtain high quality data, we assign three annotators to each example and report results based on majority vote. We exclude examples where each of the three annotators assigned a different category, making it impossible to compute a majority vote.
Annotation Quality
To ensure that the annotators understood the task, we use the following manual quality control procedure. In the first stage, we train the annotators by running two pilot annotation batches (∼100 examples each). After each batch, we manually grade the answers for compliance with instructions, and provide feedback explaining any misconceptions. After the training stage, we launch the main annotation round for the full set of 12k examples. During this round, we intermittently check responses after every 3k completed annotations to examine the annotation quality. This procedure resulted in high inter-annotator agreement (a Krippendorff’s alpha of 0.7).
3.4 Dataset Analysis
Begin is intended as a test benchmark; as such, it does not have a training portion: We only create development (10%) and test (90%) partitions. We include examples from Begin in Table 1 along with the label breakdown. Overall, the models generated a substantial number of unattributable responses (70%). As Figure 2(a) shows, this proportion was higher for GPT2, DoHA, and T5, whereas CTRL-dialog generated the lowest proportion of unattributable responses (30.8%). This indicates that CTRL-dialog, which is explicitly designed to discourage unattributable responses, is moderately successful at its goal. Figure 2(b), which breaks the results down by training corpus, shows that models trained on TopicalChat produce the highest amount of unattributable responses followed by CMU-DoG and WoW. This is consistent with recent analyses on WoW, CMU-DoG, and TopicalChat which revealed that more than 60% of the ground-truth responses are unattributable to the knowledge (Dziri et al., 2022b; Rashkin et al., 2021a).
3.5 The Need to Measure Attribution
Our analysis of the responses produced by the systems we trained highlights the potential pitfalls of language-model-based dialogue systems, especially when deployed in real-world scenarios across a broad range of domains where hallucinations pertaining to vital information may produce undesirable user experiences—e.g., healthcare (Laranjo et al., 2018; Jovanović et al., 2021) and education (Yang and Evans, 2019; Kochmar et al., 2021)—and underscores the need for progress on both the modeling and the evaluation side. Neural dialogue systems are optimized to mimic the distributional properties of the human-generated dialogue corpus used to train them. Because humans often include unattributable information in their utterances, language models trained on those corpora can replicate and perhaps even amplify the prevalence of unattributable responses at test time (Kang and Hashimoto, 2020; Dziri et al., 2022b). These findings call for robust evaluation metrics to uncover actionable insights about best practices of using such models and benchmarks. We hope that Begin will, as an evaluation benchmark, promote a strict standard for evaluation metrics, laying the ground for trustworthy dialogue systems.
4 Evaluating Evaluation Metrics
We next use Begin to evaluate a range of evaluation metrics. In §4.1 we list the untrained metrics we use as well as metrics trained on existing resources, and in §4.2 we describe a training set that we designed to train a classifier for the three response categories. We then describe the extent to which these metrics align with the Begin categories and analyze the metrics’ robustness.
4.1 Metrics
Lexical Overlap Metrics
Semantic Similarity Metrics
These metrics compare the semantic similarity between and . We consider BERTScore (Zhang et al., 2020a), which computes the similarity between and based on the cosine similarity of the sentence embeddings, as well as BARTScore (Yuan et al., 2021) and BLEURT (Sellam et al., 2020); for implementation details, see Appendix C.
Question-Based Metrics
We use Q2 (Honovich et al., 2021), which computes a factuality score through asking and answering questions. Given a candidate response as input, Q2 generates a corresponding question and identifies potential answer spans in the knowledge source that can justify the question–answer pair (Durmus et al., 2020; Wang et al., 2020). It also computes an NLI-inspired similarity score between a candidate response and a predicted answer span in the knowledge source.
Inference-Based Metrics
Finally, we study the performance of NLI-based models, trained either on gold NLI benchmarks or on adversarially augmented silver data that we generate. We first describe the metrics trained on gold NLI datasets; we discuss our adversarially augmented dataset (BEGIN-Adversarial) in §4.2. We use two transformer-based classifiers: T5-base (Raffel et al., 2020) and RoBERTa-large (Liu et al., 2019). We fine-tune them on MNLI (Williams et al., 2018) and the dialogue inference dataset DNLI (Welleck et al., 2019a). For both datasets, we map the labels (entailment, contradiction, neutral) to the labels (attributable, unattributable, generic) in Begin.
We also train classifiers on AugWow (Gupta et al., 2022), a synthetic dataset designed to evaluate factuality in dialogue systems. This dataset includes three categories: Supported responses that are fully verified by , Refuted responses that explicitly contradict , and responses with Not Enough Information (NEI), which do not contain enough information to be verified or refuted by . We map the labels (supported, refuted, NEI) to the labels (attributable, unattributable, generic) in Begin.
4.2 Adversarially Augmented Training Set
This section describes our curated silver training set (BEGIN-Adversarial) for NLI-based attribution classifiers. This dataset includes 8k (, ℋ, up) triples that fit into the three categories: attributable, generic, and unattributable.
Attributable
Here we use the original human generated responses ug from WoW. To avoid human responses that contain opinions or generic chit-chat, we only use response that do not use first-person pronouns and where at least 25% of the words in the response are contained in the evidence.
Unattributable
To generate examples that are likely to be unattributable, but are sufficiently challenging to distinguish from attributable ones as to be useful in training a classifier, we use multiple perturbation strategies. First, we directly perturb the knowledge spans from the WoW test set and then feed them to GPT2 trained on WoW. We use three perturbation methods, each applied to a different . First, we swap the subject and the object of . Second, we replace up to two verbs with verbs of the same tense. Finally, we extract all mentioned entities from different dialogue examples using the SpaCy NER tagger (Honnibal et al., 2020), and replace up to two randomly chosen entities in the original with entities of the same type. Manual inspection reveals that this usually results in responses that are hallucinations with respect to the original .
We also generate responses designed to specifically contradict , using two techniques. First, we directly negate the human response ug from WoW using the English Resource Grammar parser (ERG; Flickinger et al., 2014). Second, we replace adjectives in ug with their WordNet antonyms (Miller, 1994).
Lastly, we gather responses that are off-topic with respect to the information in the . For a given context, we randomly select a WoW gold response that was based on different . To avoid easy-to-detect off-topic responses, we sample from conversations that were prompted by the same initial topic word as the target conversation.
Generic
Generic responses are generated from the GPT2 model we trained on WoW, using a low softmax temperature of 0.4.
4.3 Results
In this section, we report the performance of automatic metrics on the Begin test set.
Lexical and Semantic Metrics
The distribution of scores is shown in Figure 3. For all metrics, the median score of fully attributable responses is higher than that of generic and unattributable responses, as expected. In many individual cases, however, unattributable responses are scored quite highly, and there is some overlap in the distribution of scores across all three labels, particularly between generic and unattributable responses, indicating that it would be impossible to map these score ranges directly to the Begin label taxonomy. Higher scores do not always translate into more desirable response types: Even though a generic response would typically be preferable to an unattributable one in a knowledge-grounded dialogue system, the median scores are lower for generic responses than unattributable ones.
Q2
Figure 4 shows a box plot for each Begin class using the Q2 metric. As in the case of the lexical and semantic metrics, Q2 scores are typically higher for attributable responses but indistinguishable between generic and unattributable responses.
Inference-Based Classifiers
Table 2 reports the performance of the NLI-based classifiers on Begin. BEGIN-Adversarial substantially outperforms the classifiers trained on the gold datasets MNLI, DNLI, and AugWoW even though it is a significantly smaller resource than those datasets. We also use MNLI as an intermediate fine-tuning dataset before fine-tuning on BEGIN- Adversarial.5 We find that intermediate task fine-tuning can be beneficial when RoBERTa is used as the pretrained model (↑ 4.1 on F1).
Finetuning data . | Test set . | Dev set . | ||||
---|---|---|---|---|---|---|
P . | R . | F1 . | P . | R . | F1 . | |
T5 | ||||||
MNLI | 48.6 | 47.9 | 34.6 | 52.1 | 50.7 | 37.4 |
DNLI | 40.8 | 56.5 | 25.6 | 41.6 | 59.2 | 28.6 |
AugWow | 36.8 | 39.8 | 37.8 | 36.7 | 39.9 | 38.1 |
BEGIN-Adv. | 46.7 | 47.4 | 45.9 | 47.2 | 47.1 | 46.3 |
+MNLI | 46.9 | 49.3 | 45.3 | 47.6 | 49.4 | 46.1 |
RoBERTa | ||||||
MNLI | 50.5 | 51.1 | 36.4 | 52.3 | 53.8 | 38.5 |
DNLI | 40.2 | 46.6 | 27.2 | 34.9 | 46.1 | 29.2 |
AugWow | 41.2 | 39.2 | 29.7 | 29.4 | 41.4 | 29.1 |
BEGIN-Adv. | 42.6 | 46.1 | 41.1 | 49.2 | 45.8 | 41.1 |
+MNLI | 44.8 | 45.9 | 45.2 | 44.9 | 45.6 | 45.1 |
Human | 96.4 | – | – | 97.2 | – | – |
Finetuning data . | Test set . | Dev set . | ||||
---|---|---|---|---|---|---|
P . | R . | F1 . | P . | R . | F1 . | |
T5 | ||||||
MNLI | 48.6 | 47.9 | 34.6 | 52.1 | 50.7 | 37.4 |
DNLI | 40.8 | 56.5 | 25.6 | 41.6 | 59.2 | 28.6 |
AugWow | 36.8 | 39.8 | 37.8 | 36.7 | 39.9 | 38.1 |
BEGIN-Adv. | 46.7 | 47.4 | 45.9 | 47.2 | 47.1 | 46.3 |
+MNLI | 46.9 | 49.3 | 45.3 | 47.6 | 49.4 | 46.1 |
RoBERTa | ||||||
MNLI | 50.5 | 51.1 | 36.4 | 52.3 | 53.8 | 38.5 |
DNLI | 40.2 | 46.6 | 27.2 | 34.9 | 46.1 | 29.2 |
AugWow | 41.2 | 39.2 | 29.7 | 29.4 | 41.4 | 29.1 |
BEGIN-Adv. | 42.6 | 46.1 | 41.1 | 49.2 | 45.8 | 41.1 |
+MNLI | 44.8 | 45.9 | 45.2 | 44.9 | 45.6 | 45.1 |
Human | 96.4 | – | – | 97.2 | – | – |
Overall, our adversarially generated dataset provides better supervision for detecting our taxonomy than NLI-style datasets. This can be attributed to the fact that NLI-style datasets are designed with a focus on detecting direct contradictions. By contrast, identifying unattributable responses requires detecting multiple types of unverifiable information including, but not limited to, contradictions. At the same time, none of the models exceed 46% F1 score, showing that there is still room for improvement compared to human performance (over 95% precision when comparing human annotations to the majority vote). Finally, T5 and RoBERTa have similar F1 scores despite differences in model size and pretraining corpora, suggesting that simply scaling up the pretrained model may not be sufficient to make progress on this problem.
4.4 Are Metrics Measuring Attribution or Extractivity?
Do the metrics perform similarly on both challenging and easier examples? We adopt a density metric from Grusky et al. (2018) to split the data into three groups—low, medium, and high density—based on the extent to which they reuse language from the knowledge sources. Density represents the average length of the text spans in the responses that are copied from the knowledge. Extractive (high density) responses reuse the same phrases as the knowledge source, whereas abstractive (low density) responses may express the same meaning using a paraphrase.
Results
Figures 5 and 6 show the distributions across different levels of extractivity of the lexical and semantic metrics and the Q2 score. We observe a common pattern across all metrics: high density responses for all categories (except generic on BLEURT) score the highest, followed by medium density and low density responses. The differences between the scores of the attributable, generic and unattributable categories are more pronounced in the more extractive responses, and less in the abstractive cases. Only Q2, though generally unable to separate generic examples, maintains a clear separation between attributable and unattributable examples in the abstractive cases. Moreover, extractivity strongly influences the score assigned to attributable examples; an attributable response is likely to be scored much lower by all of these metrics if it is abstractive. Even more strikingly, unattributable extractive responses score higher on average than attributable abstractive responses in all metrics.
We observe similar trends for the classifiers (Figure 7). The performance on classifying attributable responses is much higher in extractive cases than in abstractive ones. In contrast, the performance on unattributable responses is typically worse in the extractive cases. This pattern of results suggests that a response that is unattributable but has a high word overlap with the knowledge is very likely to be misclassified as attributable. In summary, we find that current metrics are relying on the spurious correlation between attribution and word overlap, and do not capture a deep understanding of the notion of attribution (cf. McCoy et al., 2019).
4.5 Robustness to Distribution Shift
We further investigate the robustness of the metrics under distribution shift. Figure 8 shows the distributions of both semantic and Q2 scores across the data broken down by source. All metrics6 rate responses from WoW in all categories significantly higher than responses derived from CMU-DoG and TopicalChat. Concerningly, attributable responses generated based on CMU-DoG and TopicalChat receive nearly identical scores to unattributable responses. Likewise, the F1 scores of all the classifiers (Figure 9) are higher on the responses from WoW compared to the ones from CMU-DoG and TopicalChat. Specifically, classifiers tested on TopicalChat examples yield the worst F1 scores. For example, RoBERTA-MNLI’s F1 score decreases by 10 points when tested on attributable responses from TopicalChat compared to WoW. In general, the metrics appear to perform poorly on datasets that have longer knowledge sources. TopicalChat has on average 271 words in , followed by CMU-DoG and WoW which have 215 words and 27 words, respectively. This shows that shorter knowledge spans correlates with higher metrics performance, pointing to the limited robustness of the metrics.
5 Related Work
Analysis of Evaluation Metrics in Natural Language Generation
There is extensive interest in analyzing and meta-evaluating neural language generation (NLG) evaluation metrics (Gehrmann et al., 2022, 2021), for various tasks including machine translation (Freitag et al., 2021; Mathur et al., 2020), data-to-text generation (Dhingra et al., 2019), summarization (Bhandari et al., 2020; Pagnoni et al., 2021; Durmus et al., 2020; Gabriel et al., 2021; Fabbri et al., 2021; Durmus et al., 2022), and dialogue generation (Yeh et al., 2021; Durmus et al., 2022). Most of these studies have compared reference-free and reference-based evaluation metrics to human evaluation. For example, Gabriel et al. (2021) measured the performance of automated metrics on summaries and compared certain dimensions such as sensitivity and high correlation with human scores. Fabbri et al. (2021) analyzed metrics in summarization and released human-annotated data for faithfulness across 16 summarization models. We perform a similar meta-evaluation of existing automatic metrics in the context attribution in knowledge-grounded responses. Closest to our work is Durmus et al. (2022), who found that reference-free evaluation metrics of summarization and dialogue generation rely heavily on spurious correlations such as perplexity and length.
Metrics in Knowledge-Grounded Response Generation
In contrast to the significant progress achieved in evaluating many NLG tasks, the evaluation of grounded response generation is a nascent research area (Shuster et al., 2021; Rashkin et al., 2021a; Dziri et al., 2021). Yeh et al. (2021) conducted a comprehensive study of existing dialog evaluation metrics. They measured properties such as engagingness and relevance but did not investigate the faithfulness of responses. While hallucination is well studied in the context of summarization (Durmus et al., 2020; Maynez et al., 2020b; Nan et al., 2021; Falke et al., 2019), fewer researchers have looked into the problem of assessing hallucination in dialogue systems. Dziri et al. (2021) introduced a token-level critic that leverages a knowledge graph to identify hallucinated dialogue responses. Rashkin et al. (2021a) proposed a human evaluation framework to assess output of dialogue models that pertains to the external world and utilized their evaluation framework for conversational QA tasks. Dziri et al. (2022a) introduced a faithful benchmark for information-seeking dialogues and demonstrated that it can serve as training signal for a hallucination critic, which discriminates whether an utterance is faithful or not. An alternative approach for assessing faithfulness uses an auxiliary language understanding task, which measures whether a question answering system produces the same responses for the source document (Honovich et al., 2021). Begin as a testing benchmark should be useful in developing similar metrics further.
NLI and Adversarial Data for Grounded Dialogue Evaluation
In this work, we also investigate the performance of classifiers trained on NLI data, extending prior work that has proposed using NLI as a framework for evaluating conversational consistency (Welleck et al., 2019b). Dziri et al. (2019) also used NLI to evaluate dialogue consistency. They generated a large-scale, noisy synthetic dataset of (premise, hypothesis) pairs tailored for dialogue, based on Zhang et al. (2018). We also explore training classifiers on adversarially augmented training data similar to concurrent work from Gupta et al. (2022) and Kryscinski et al. (2020), which proposed a synthetic dataset for determining whether a summary or response is consistent with the source document; this dataset was constructed by applying a number of syntactic transformations to reference documents (for a similar approach applied to NLI, see Min et al., 2020).
6 Conclusion
Contemporary knowledge-based dialogue systems that rely on language models often generate responses that are not attributable to the background knowledge they are expected to convey. We present Begin, a new benchmark to advance research toward robust metrics that can assess this issue. We use Begin to comprehensively evaluate a broad set of existing automatic metrics. We show that these metrics rely substantially on word overlap and fail to properly rank abstractive attributable responses as well as generic responses. They also struggle under distribution shift, assigning low scores to attributable responses grounded on long knowledge sources. We hope that this work will spur future research on building robust evaluation metrics for grounded dialogue systems.
Acknowledgments
We are grateful to the anonymous reviewers for helpful comments. We thank Dipanjan Das, Vitaly Nikolaev, Sebastian Gehrmann, Roee Aharoni, Jennimaria Palomaki, Tom Kwiatkowski, Michael Collins, and Slav Petrov for helpful discussions and feedback. We also thank Ashwin Kakarla and his team for helping with the annotations.
A Begin Annotation Protocol
Each worker was given a document, previous turn in a conversation, and a generated response (either by T5, GPT2, DoHA, or CTRL-dialog). They were asked to evaluate the response as either fully attributable, not attributable, or too generic to be informative. They also were provided with multiple examples with explanations for each category. The exact instructions were as follows:
Which of these best describes the highlighted utterance?
- ◦
Generic: This utterance is uninformative (too bland or not specific enough to be sharing any new information)
- ◦
Contains any unsupported Information: This utterance is sharing information that cannot be fully verified by the document. It may include false information, unverifiable information, and personal stories/opinions.
- ◦
All information is fully supported by the document: This utterance contains only information that is fully supported by the document.
B Implementations
GPT2, T5
We implement these models using the TensorFlow Huggingface Transformers library (Wolf et al., 2020). During training, we use the Adam optimizer (Kingma and Ba, 2015) with Dropout (Srivastava et al., 2014) on a batch size of 32 with a learning rate of 6.25 × 10−5 that is linearly decayed. The maximum dialogue history length is set to 3 utterances. The model early-stops at epoch {6, 10, 10} respectively for WoW, CMU-DoG, and TopicalChat.
CTRL-dialog
We reproduce the results from (Rashkin et al., 2021b), following the training details in that paper.
DoHA
We use the code and the pre-trained model on CMU-DoG that are publicly available by the authors at their Github account.7 For WoW and TopicalChat, we follow closely the authors’ training procedure described in Prabhumoye et al. (2021) and we train two models on both datasets.
For each dataset, we save the best model based on the validation set. We use nucleus sampling with p = 0.9.
C Model-Based Metrics
Semantic Similarity Models
We use BERTScore version 0.3.11. with the DeBERTa-xl-MNLI model (He et al., 2021), which is the recommended model as of the time of investigation. For BLEURT, we use the recommended BLEURT-20 checkpoint (Pu et al., 2021). For BARTScore, we use the latest publicly available checkpoint (accessed March 2022) from https://github.com/neulab/BARTScore.
Notes
Note that we do not compare the generated responses to the gold responses as they may be unattributable (Sec 3.4).
We did not observe a similar improvement when using DNLI as an intermediate task.
We observe similar results for lexical metrics.
References
Author notes
Equal contribution.
Work done while at Google Research.
Action Editor: Hang Li