Abstract
Instruction-following models are attractive alternatives to fine-tuned approaches for question answering (QA). By simply prepending relevant documents and an instruction to their input, these models can be adapted to various information domains and tasks without additional training. However, these models tend to produce verbose responses with supplementary information, which makes traditional QA metrics like exact match (EM) and F1 unreliable for accurately quantifying model performance. In this work, we evaluate instruction-following models along two fronts: 1) how well they satisfy user’s information need (correctness), and 2) whether they disseminate information supported by the provided knowledge (faithfulness). Guided by human evaluation and analysis, we highlight the shortcomings of traditional metrics for both correctness and faithfulness and propose simple token-overlap metrics that correlate highly with human judgments. Our analysis reveals that for correctness, instruction-following models perform comparably to models specifically fine-tuned for that task. However, they struggle to accurately judge the relevance of the provided knowledge and often hallucinate in their responses. We hope our work encourages more holistic evaluation of instruction-following models for QA. Our code and human annotation data is available at https://github.com/McGill-NLP/instruct-qa.
1 Introduction
Instruction-following models, such as ChatGPT, are appealing as they can perform tasks based on natural language instructions. These models are usually trained by exposing large language models (LLMs; Brown et al., 2020; Zhang et al., 2022; Touvron et al., 2023a) to thousands of NLP tasks formulated as instructions (Sanh et al., 2022; Mishra et al., 2022; Wei et al., 2022; Chung et al., 2022; Ouyang et al., 2022; Iyer et al., 2022; Touvron et al., 2023b) or to synthetic examples generated by other LLMs (Wang et al., 2022a; Taori et al., 2023; Peng et al., 2023). In this paper, we evaluate factual question-answering (QA) ability of instruction-following models using a given set of text passages.
Instruction-following models can perform QA when provided with a task description, question, and relevant text passages (Chung et al., 2022). User-centric applications (e.g., Bing Chat) typically pair these models with a retriever or internet search to provide relevant information. These models generate natural, informative, and verbose responses, a useful trait that helps build users’ trust and engagement. However, the verbosity renders traditional evaluation metrics such as exact match (EM) and F1 unreliable, raising new challenges for evaluation (Kamalloo et al., 2023). Moreover, these models also tend to provide supplementary information that may be hallucinated (Chiesurin et al., 2023).
Consider Figure 1, where the user asks Where are One Direction from?. Comparing the reference answer London, England with the first part of model response One Direction are from London, England yields 0 EM and 0.5 F1 score, despite both answers being effectively equivalent (the entire response scores 0.36 F1). Moreover, the model asserts that One Direction is from Mullingar. While correct, this fact is unsupported by the provided knowledge. As EM and F1 only compare with reference answers, they cannot estimate if the model response is supported by the provided knowledge.
We posit that an optimal model should not only correctly respond to user queries but also be faithful, i.e., it should only disseminate information that is inferrable or directly stated by external documents (Rashkin et al., 2021b; Dziri et al., 2022c). The resulting interpretability builds user trust and allows for dynamic knowledge updates (Lewis et al., 2020). In this work, we advocate that QA models should be evaluated along two fronts: 1) correctness w.r.t. information need, which measures model’s efficacy in satisfying a user’s information needs, and 2) faithfulness w.r.t. provided knowledge, which measures a model’s capability to ground factual information in provided knowledge. We evaluate several recent instruction-following models—Flan-T5 (Chung et al., 2022), Alpaca (Taori et al., 2023), GPT-3.5 (sibling model of Ouyang et al., 2022), and Llama-2 (Touvron et al., 2023b)—on three popular factual information-seeking QA datasets: Natural Questions (NQ; Kwiatkowski et al., 2019), HotpotQA (Yang et al., 2018), and TopiOCQA (Adlakha et al., 2022). We conduct a human analysis of 1800 model responses and correlate them with several automatic metrics for correctness and faithfulness.
Our findings suggest that for correctness, recall (proportion of tokens in the reference answer that are also in the model’s response) exhibits a higher correlation than traditional QA metrics like EM or F1. For faithfulness, K-Precision (proportion of tokens in model response that appear in the knowledge snippet) correlates better than any other lexical metric. Although GPT-4 as an evaluator achieves the highest correlation for both correctness and faithfulness, it is expensive and prone to systematic biases (Wang et al., 2023). We demonstrate that our proposed lexical metrics are close to GPT-4-based evaluation, allowing us to evaluate instruction-following models at a large scale.
A faithful model should not only answer when the provided knowledge is relevant, but also abstain from answering when it is irrelevant. Hence, we also consider the model’s ability to abstain from answering as a measure of its faithfulness.
To summarize, our contributions are as follows:
We annotate responses from instruction-following models for QA along the dimensions of correctness and faithfulness, and evaluate several evaluation metrics. We also analyze the nature of responses where current metrics fail.
Guided by our analysis of traditional QA metrics’ shortcomings, we propose simple token-overlap metrics—recall for correctness and K-Precision for faithfulness—and demonstrate their strong correlation with human judgments.
We evaluate four instruction-following models across three diverse QA tasks. Our results indicate that these models, even without any further training, are comparable to task-specific fine-tuned models for correctness. However, they struggle to be faithful to provided knowledge, often failing to accurately identify its relevance.
2 Related Work
Instruction-following Models
These models are often trained on many NLP tasks verbalized in the form of natural language instructions (Wang et al., 2022b; Mishra et al., 2022; Chung et al., 2022; Iyer et al., 2022). The number of tasks varies from few tens (62 in Wei et al., 2022) to several hundreds (1800+ in Iyer et al., 2022). To increase diversity and scope of NLP tasks, InstructGPT (Ouyang et al., 2022) and Llama-2 (Touvron et al., 2023b) incorporate high-quality expert annotations during training. They are further trained using human feedback to align them with human preferences (RLHF; Christiano et al., 2017). Another popular approach, self-instruct (Wang et al., 2022a), reduces dependency on human-authored instructions by bootstrapping an LLM to generate instructions and demonstrations of new tasks. The resultant dataset is used to train instruction-following models (Taori et al., 2023; Peng et al., 2023).
Recent works (Lazaridou et al., 2022; Shi et al., 2023) have paired retrievers with few-shot language models for QA, alleviating the need to learn additional parameters. In contrast to these works, we evaluate retrieval-augmented instruction-following models without any demonstrations. In these settings, models do not follow the distribution of reference answers, raising new challenges for evaluation.
Evaluation in QA
Previous research on information-seeking QA has primarily relied on lexical matching metrics such as exact match (EM) and F1 for model evaluation (Rajpurkar et al., 2016; Reddy et al., 2019). As simple token-overlap metrics cannot capture semantic equivalence (Min et al., 2021), subsequent model-based metrics employ contextualized embeddings (Zhang et al., 2020) or specialized classifier (Bulian et al., 2022) to predict equivalence. More recently, several studies resort to prompting LLMs like GPT-4 (OpenAI, 2023) to act as evaluators (Chiang et al., 2023; Peng et al., 2023; Chiang and Lee, 2023; Kamalloo et al., 2023; Liu et al., 2023b).
Recently, Kamalloo et al. (2023) compared the correctness of InstructGPT (Ouyang et al., 2022) with fine-tuned models for QA. They highlight the shortcomings of traditional QA metrics and propose model-based evaluation as a viable alternative. In contrast, we evaluate both correctness and faithfulness of instruction-following models and propose token-overlap metrics that correlate highly with human judgments.
Evaluating Faithfulness
Conversational models produce factually incorrect or unsupported statements (Rashkin et al., 2021b; Dziri et al., 2022b), known as hallucinations. Several metrics have been proposed to detect hallucination, or conversely, to measure faithfulness. Knowledge-F1 (K-F1; Shuster et al., 2021) computes token-overlap F1 between model response and the provided knowledge. Q2 (Honovich et al., 2021) checks for factual consistency based on automatic question generation and question answering. FaithCritic (Dziri et al., 2022a) uses a trained model to predict hallucinations.
Recently, Chiesurin et al. (2023) demonstrated that retrieval-augmented GPT-3 is likely to produce responses that appear trustworthy but are unfaithful to the retrieved passages. They propose K-F1++, a variant of K-F1 that discounts the tokens in the model response that appear in the question. In our experiments, we observe that this metric doesn’t correlate well with human judgments.
Evaluation of Instruction-following Models
Instruction-following models have challenged previously established evaluation protocols for many NLP tasks. Goyal et al. (2022) demonstrate that humans prefer summaries generated by GPT-3 (Brown et al., 2020) over fine-tuned models, but, existing automatic metrics cannot capture this preference. Xu et al. (2023) advocate multi-faceted evaluation for long-form QA that focuses on fine-grained aspects such as completeness and ease of understanding. In this paper, we propose multi-faceted evaluation for factual information-seeking QA along correctness and faithfulness.
3 Experimental Setup
3.1 Tasks
We evaluate instruction-following models on three diverse information-seeking QA tasks based on Wikipedia. For each task, we use a representative popular dataset. We describe the tasks below.
Open-domain QA
Here we test the model’s ability to answer questions with genuine information-seeking intent and whose answer is present in one of the Wikipedia passages. We use the open version (Lee et al., 2019) of Natural Questions (NQ; Kwiatkowski et al., 2019), which contains queries from Google search engine.
Multi-hop QA
Here we test the model’s ability to answer questions that require at least two Wikipedia passages to reason upon jointly. We use HotpotQA (Yang et al., 2018) for this task.
Conversational QA
Here we test the model’s ability to answer questions in conversational context and whose answer is present in one of the Wikipedia passages. We use TopiOCQA (Adlakha et al., 2022), a dataset for open-domain information-seeking dialogue.
Table 1 lists the total number of questions, average answer length, and the total number of passages in the Wikipedia corpus for each dataset. These datasets contain short-form answers as they are easier and more consistent to annotate by humans. However, as users, humans prefer verbose answers (Chiesurin et al., 2023). This mismatch makes our evaluation setting realistic and important.
3.2 Instruction-following Models
As shown in Figure 2, we use a standardized prompt template that contains an instruction, passage(s) from an information source, and the question to elicit answers from instruction-following language models. We replace the question with conversation history for TopiOCQA. Inspired from Mishra et al. (2022), we formulate the instruction as – “Please answer the following question given the following passages”. We refer to this instruction as Instr. v1. We consider four models that differ primarily based on their training regimes. We use the same generation parameters for all instruction-following models, described in Appendix A.
Flan-T5
(Chung et al., 2022) We use the 11B parameter obtained by training T5 (Raffel et al., 2020) on multiple instruction-following datasets (Sanh et al., 2022; Wang et al., 2022b; Wei et al., 2022). These datasets encompass 1800+ tasks, of which 200+ are QA tasks. The training splits of NQ and HotpotQA are included in these datasets.
Alpaca
GPT-3.5
Llama-2
We use the 7B chat version of Llama-2 (Touvron et al., 2023b). The model is initially bootstrapped on similar datasets as Flan-T5, followed by fine-tuning on dialogue-style instructions.
3.3 Retrieval
To evaluate instruction-following models for correctness, we pair them with a retriever that provides the model with passages relevant to the user query. For each task, we employ a task-specific variant of DPR (Dense Passage Retrieval; Karpukhin et al., 2020). For NQ, we use a pre-trained checkpoint from Karpukhin et al. (2020), which is trained on multiple QA datasets. For HotpotQA, we adopt the iterative multi-hop DPR variant by Xiong et al. (2021) that selects passages based on the query and prior retrievals. For TopiOCQA, we utilize the checkpoint provided by Adlakha et al. (2022) that is trained for conversational QA task.
The number of retrieved passages passed to instruction-following models is constrained by their input context size. For a fair comparison, we provide the same number of retrieved passages to each model within a specific task—8 for NQ and HotpotQA, and 4 for TopiOCQA.
4 Correctness w.r.t. Information Need
In this section, we investigate the correctness of instruction-following models. We consider a model response to be correct if it accurately satisfies the user’s information need. For example, when answering What is the capital of Canada?, the model should convey that Ottawa is the capital of Canada. While the model’s response might include additional information like Ottawa’s population, we limit the evaluation of correctness to the part of the model response that is directly relevant to the user’s information need. We address evaluation of additional information in the next section (Section 5). We describe lexical and semantic similarity metrics for the task in §4.1. Next, we conduct human evaluation and compare several evaluation metrics (§4.2). Finally, using metrics that correlate highly with human judgments, we evaluate instruction-following models for correctness (§4.3).
4.1 Evaluation Metrics
Evaluating correctness in QA involves comparing model responses to human-annotated gold answers. Below, we describe the two broad categories of automatic evaluation metrics:
Lexical Matching
These metrics score a model response based on its token overlap with the gold answer. While some metrics perform bag-of-words matching (e.g., Exact Match (EM), F1), others consider the order of the tokens by n-gram matching such as METEOR (Banerjee and Lavie, 2005), and ROUGE (Lin, 2004).
In this work, we also consider Recall—the proportion of tokens in the reference answer that are present in the model response. Recall does not penalize a verbose response, as long as it contains the reference answer tokens. Recent work (Liu et al., 2023a; Mallen et al., 2023) has used a similar metric whereby a model response is correct if it fully contains the reference answer as a substring. We refer to this metric as Recall (Strict), as it is a stricter version of token-level recall.
Semantic Similarity
Unlike lexical metrics that face the strictness issues (Min et al., 2021), semantic similarity metrics typically leverage a model to predict semantic equivalence. BERTScore (BertS, Zhang et al., 2020) uses contextual BERT embeddings to compute precision, recall, and F1 between the model and reference answers. BEM (BERT matching, Bulian et al., 2022) employs a trained BERT model to predict the semantic equivalence based on the question, the reference answer, and the model response. We extend BEM to conversational QA task by providing the most recent question in the conversation as input.
Based on recent works (Chiang et al., 2023; Peng et al., 2023), we also consider prompting LLMs (referred to here as GPT3.5-Eval and GPT4-Eval) to act as evaluation agents. Given the question, the reference answer, and the model response, the LLM is instructed to predict if the model response is correct or not. The prompt template and the instruction used are described on our project page.2
4.2 Human Evaluation
To establish a basis for comparing evaluation metrics, we conduct human evaluation on a subset of responses from all instruction-following models. Specifically, we focus on cases where the gold passage is included in retrieved passages. Therefore, any inaccuracies in the response can be attributed to model’s failures, rather than inaccurate retrieval. For each of the three tasks, we take 100 samples, resulting in 1200 samples for the four models.
In our evaluation setup, the annotator is presented with the question (or conversation history), the reference answer, and the anonymized model response. The annotator’s task is to assess if the model response is correct, i.e., it is factually accurate and satisfies the information need underlying the user’s query. We hired four NLP graduate students for the task. Each sample is labeled by two annotators, achieving an inter-annotator agreement of 92.42% and a Fleiss’ Kappa score of 76.4%. In cases of disagreement, a third annotation is collected and majority vote is taken.
Out of 1200 model responses, 961 were judged correct by humans, and 239 as incorrect. In Table 2, we report distributional statistics of scores assigned by EM, F1, Recall, and GPT4-Eval for both correct and incorrect responses. While EM assigns a 0.0 score to almost all human-judged incorrect responses, it assigns 1.0 to only 22% of human-judged correct responses. F1 does only slightly better, obtaining 39% accuracy on human-judged correct responses (when we consider responses with ≥ 0.5 score as correct). This highlights the well-known strictness problem (Min et al., 2021; Kamalloo et al., 2023) of traditional QA metrics. In contrast, GPT4-Eval and Recall offer a relatively balanced assessment.
. | EM . | F1 . | Recall . | GPT4 . | |
---|---|---|---|---|---|
Human=1 ↑ | Avg. | 0.22 | 0.45 | 0.85 | 0.89 |
Median | 0.00 | 0.33 | 1.00 | 1.00 | |
SD | 0.42 | 0.36 | 0.30 | 0.31 | |
Human=0 ↓ | Avg. | 0.00 | 0.10 | 0.23 | 0.13 |
Median | 0.00 | 0.00 | 0.00 | 0.00 | |
SD | 0.06 | 0.16 | 0.32 | 0.33 |
. | EM . | F1 . | Recall . | GPT4 . | |
---|---|---|---|---|---|
Human=1 ↑ | Avg. | 0.22 | 0.45 | 0.85 | 0.89 |
Median | 0.00 | 0.33 | 1.00 | 1.00 | |
SD | 0.42 | 0.36 | 0.30 | 0.31 | |
Human=0 ↓ | Avg. | 0.00 | 0.10 | 0.23 | 0.13 |
Median | 0.00 | 0.00 | 0.00 | 0.00 | |
SD | 0.06 | 0.16 | 0.32 | 0.33 |
Qualitative Analysis of Failure Cases
As evident from Table 2, traditional QA metrics like EM and F1 tend to produce higher rates of false negatives than false positives. For instance, 78% of answers deemed correct by humans were falsely marked incorrect (false negative) by EM, whereas only 0.4% (1 out of 239) of answers judged incorrect by humans were marked correct (false positive). Similarly, F1 has 39% false negative rate and 3.8% false positive rate. As the rate of false positives is extremely low and they do not disproportionately impact instruction-following models, our analysis focuses solely on the false negatives.
We analyze the models’ responses that have ≤ 0.3 F1 score, but have been deemed correct by the annotators. This results in 448 samples out of 1200. Our classification of errors is inspired from Kamalloo et al. (2023) and Min et al. (2021), modified to focus on instruction-following models. We list the categories of our classification and their descriptions below.
Semantic Equivalence: Here, the model response is semantically similar to the reference answer. Sub-categories include Multinominal entities, e.g., John Kennedy and John F Kennedy, More Elaborate Answers, e.g., yes and yes, he is member of the band and Synonymous Answers, e.g., from India and Indian nationality.
Symbolic Equivalence: This primarily refers to different possible representations of numeric quantities, e.g., four seasons and 4 seasons.
Intrinsic Ambiguity in Questions: This refers to queries with multiple valid interpretations, leading to a range of correct answers, e.g., Who won NFL football coach of the year? could have different answers dependent on the specific point in time being referenced.
Granularity Discrepancies: The level of specificity in the model’s response may not align with that in the reference answer. This discrepancy in granularity can be Temporal, e.g., August 25, 1939 and 1939, or Spatial, e.g., Vancouver and British Columbia, Canada.
Incomplete Reference Answers: This refers to cases when the reference answers fail to cover the entire spectrum of correct responses. We consider two sub-categories: List of named entities which includes questions like the cast of a movie or members of the band, and Open-ended questions which includes questions that can be answered in multiple different ways, all of which are not captured by reference answers, e.g., What was the Watergate scandal?.
Enumeration of Reference Answers is an error category where the question seeks a list (e.g., All states in north-east USA), but each reference answer contains only one entity (e.g., “Vermont”, “Maine”). Instruction-following models often list multiple entities together in the response (e.g., Vermont and Maine), leading to mismatches. This error category is very frequent in NQ.
Satisfactory Subset Response represents the inverse, where the model’s answer, though shorter than the reference, still addresses the query. An example is when a query asks for songs of an artist, the reference lists 5–6, but the model responds with only 1–2 song names (primarily seen in TopiOCQA).
Figure 3 displays the distribution of error cases based on our classification. A significant portion of the errors (60.81%) fall under the More Elaborate Answers category. This suggests that traditional QA metrics often penalize models unjustly due to the verbose nature of their responses. The next most common sub-category, Open-ended Questions (10.81%), suggests that models are occasionally penalized for providing correct responses that were not included in the reference answers.
In Figure 4, we provide qualitative examples of More Elaborate Answers and Open-ended Questions sub-categories. Recall can act as an effective fix for More Elaborate Answers. However, both lexical and semantic similarity metrics struggle with Open-ended Questions. We also observed that this is also the most common failure sub-category for GPT4-Eval.
Overall, the results of our human evaluation and analysis indicate that traditional metrics such as EM and F1, typically used for QA models, are not well-aligned with the verbose nature of instruction-following models. To determine more suitable metrics for these models, we analyze the correlation of each metric with human assessments.
Correlation Between Automatic Metrics and Human Judgment
With only four models in our setup, correlations computed between model rankings obtained from metrics and human judgments are not statistically significant. To overcome this issue, we directly compare the human judgments with the metric scores across 1200 annotated examples. We utilize Spearman’s ρ and Kendall’s τ, both of which have mechanisms which prevent the ties from artificially inflating or deflating the correlation measurement. For instance, we use τ-b for Kendall (Kendall, 1945), which adjusts the normalizing factor by discounting the number of tied pairs. Similarly, for Spearman’s ρ, the average of the ranks that would have been assigned to all the tied values is assigned to each value. Simply put, a metric achieves high correlation if it assigns a higher score to samples deemed correct by humans than to those deemed incorrect. Table 3 presents the Spearman’s ρ and Kendall’s τ correlation of different metrics with human judgments. Apart from metrics detailed in Section 4.1, we include token-level precision, as well as precision and recall as computed using BERTScore.
Metric . | Spearman ρ . | Kendall τ . |
---|---|---|
EM | 27.326 | 27.326 |
F1 | 47.341 | 40.164 |
Recall | 60.048 | 55.622 |
Recall (S) | 52.535 | 52.535 |
Precision | 43.929 | 37.636 |
METEOR | 48.232 | 39.845 |
Rouge-L | 45.874 | 38.797 |
BertS (F1) | 31.853 | 26.133 |
BertS (Recall) | 38.841 | 31.866 |
BertS (Precision) | 20.876 | 17.129 |
BEM | 53.704 | 43.868 |
GPT3.5-Eval | 61.353 | 61.353 |
GPT4-Eval | 67.474 | 67.474 |
Metric . | Spearman ρ . | Kendall τ . |
---|---|---|
EM | 27.326 | 27.326 |
F1 | 47.341 | 40.164 |
Recall | 60.048 | 55.622 |
Recall (S) | 52.535 | 52.535 |
Precision | 43.929 | 37.636 |
METEOR | 48.232 | 39.845 |
Rouge-L | 45.874 | 38.797 |
BertS (F1) | 31.853 | 26.133 |
BertS (Recall) | 38.841 | 31.866 |
BertS (Precision) | 20.876 | 17.129 |
BEM | 53.704 | 43.868 |
GPT3.5-Eval | 61.353 | 61.353 |
GPT4-Eval | 67.474 | 67.474 |
Notably, GPT4-Eval has the highest agreement with human judgments, with 67.47 for both Spearman and Kendall correlation, closely followed by GPT3.5-Eval. We speculate that the language comprehension capabilities and inherent world knowledge embedded in LLMs help them overcome many of the challenges associated with evaluating responses of instruction-following models that we identified in our human evaluation study.
After GPT4-Eval and GPT3.5-Eval, Recall achieves the highest correlation with human judgment. This simple token-overlap metric correlates better than other lexical metrics or more complex semantic similarity metrics like BERTScore and BEM, likely because it does not penalize the additional verbosity in model responses.
Although LLM-based evaluations such as GPT4-Eval and GPT3.5-Eval exhibit the highest correlation with human judgments, they also have certain limitations. Accessing these proprietary models incurs substantial API costs, which renders them impractical for automatic evaluation on large-scale datasets. Moreover, the reliability of LLMs as evaluators is still unclear, as recent studies have shown that they may exhibit systematic bias (Wang et al., 2023). Given these considerations, we rely on Recall to evaluate model performance.
4.3 Evaluating the Correctness of Instruction-following Models
Equipped with proper evaluation metrics, we evaluate instruction-following models for correctness across three QA tasks. Specifically, we investigate the performance of current instruction-following models in comparison to models that have been specifically fine-tuned for those tasks.
To compare against instruction-following models, we select FiD (Izacard and Grave, 2021) as our task-specific fine-tuned baseline. This T5-based (Raffel et al., 2020) encoder-decoder model separately encodes each retrieved passage with the query, resulting in a set of vectors. The decoder then autoregressively generates the answer by attending to input passages and previously generated tokens. For NQ and TopiOCQA, we use the publicly available FiD checkpoints. For HotpotQA, we train our own variant using the default hyperparameters. All checkpoints are base variants that contain 220 million trainable parameters.
Unlike instruction-following models (Section 3.3), FiD is not restricted by input context size for number of retrieved passages. We use the default settings for each dataset—100 passages for NQ, 50 for TopiOCQA, and up to 18 for HotpotQA.
In Table 4, we report EM, F1, and Recall for assessing correctness. Unsurprisingly, FiD, which is fine-tuned separately on each dataset and thus emulates the distribution of reference answers, scores higher than instruction-following models on traditional QA metrics like EM and F1 (with the exception of Flan-T5 on HotpotQA). However, based on our findings (Section 4.2), we rely on Recall for a more accurate evaluation. Using recall, the performance gap narrows significantly, with some instruction-following models even outperforming FiD. Notably, GPT-3.5 outperforms FiD across all three QA tasks—by 6.48% in NQ, 10.27% in HotpotQA, and 9.33% in TopiOCQA, whereas Llama-2 outperforms FiD on two out of three tasks, matching FiD’s performance in TopiOCQA. It is also worth noting that TopiOCQA serves as a test for generalization as it is not included in any instruction-tuning datasets and was released after the knowledge cutoff of GPT-3.5 (September 2021).
. | Model . | EM ↑ . | F1 ↑ . | Recall ↑ . |
---|---|---|---|---|
NQ | FiD | 46.57 | 53.93 | 54.45 |
Flan-T5 | 41.16 | 50.62 | 54.03 | |
Alpaca | 8.84 | 19.5 | 48.82 | |
GPT-3.5 | 1.41 | 16.22 | 57.98 | |
Llama-2 | 0.64 | 9.78 | 59.28 | |
HotpotQA | FiD | 48.43 | 60.16 | 60.55 |
Flan-T5 | 58.12 | 71.14 | 71.28 | |
Alpaca | 16.27 | 33.45 | 57.53 | |
GPT-3.5 | 5.66 | 22.49 | 66.77 | |
Llama-2 | 1.39 | 15.15 | 69.75 | |
TopiOCQA | FiD | 36.48 | 58.52 | 61.64 |
Flan-T5 | 18.34 | 43.17 | 52.54 | |
Alpaca | 5.85 | 26.72 | 43.37 | |
GPT-3.5 | 2.7 | 34.32 | 67.39 | |
Llama-2 | 0.95 | 22.79 | 61.4 |
. | Model . | EM ↑ . | F1 ↑ . | Recall ↑ . |
---|---|---|---|---|
NQ | FiD | 46.57 | 53.93 | 54.45 |
Flan-T5 | 41.16 | 50.62 | 54.03 | |
Alpaca | 8.84 | 19.5 | 48.82 | |
GPT-3.5 | 1.41 | 16.22 | 57.98 | |
Llama-2 | 0.64 | 9.78 | 59.28 | |
HotpotQA | FiD | 48.43 | 60.16 | 60.55 |
Flan-T5 | 58.12 | 71.14 | 71.28 | |
Alpaca | 16.27 | 33.45 | 57.53 | |
GPT-3.5 | 5.66 | 22.49 | 66.77 | |
Llama-2 | 1.39 | 15.15 | 69.75 | |
TopiOCQA | FiD | 36.48 | 58.52 | 61.64 |
Flan-T5 | 18.34 | 43.17 | 52.54 | |
Alpaca | 5.85 | 26.72 | 43.37 | |
GPT-3.5 | 2.7 | 34.32 | 67.39 | |
Llama-2 | 0.95 | 22.79 | 61.4 |
Overall, these results suggest that in retrieval- augmented settings, instruction-following models are equally, or sometimes even more, capable than task-specific fine-tuned generators for generating correct responses w.r.t. user information needs.
We also investigate the impact of retriever on the final performance of instruction-following models, using HotpotQA as a testbed (Appendix B). Our findings underscore the importance of selecting task-specific retrievers to maximize performance.
5 Faithfulness w.r.t. Provided Knowledge
Instruction-following models often provide verbose responses with additional information apart from user information needs. For example, when asked What is the capital of Canada?, the model might add information about the population – Ottawa is the capital of Canada, with a population of 1,017,449. Evaluating the correctness ofz this supplementary information is challenging without an oracle. Therefore, we focus on a more limited goal of faithfulness (Rashkin et al., 2021a; Dziri et al., 2022b; Chiesurin et al., 2023), which measures if the supplementary information is inferable or directly stated in the knowledge provided as input to these models. A faithful model helps build user trust and enables knowledge configurability.
Following Dziri et al. (2022a), we posit that a faithful model response should be fully grounded in the provided knowledge. Based on this hypothesis, we split our analysis into two parts: 1) faithfulness w.r.t. relevant knowledge, where we provide the model with the relevant gold passage and evaluate the groundedness of its response, and 2) faithfulness w.r.t. irrelevant knowledge, where we provide a related but irrelevant passage and measure how often the model refuses to answer.
In this section, we first describe the automatic evaluation metrics for evaluating faithfulness (§5.1). Next, similar to correctness, we conduct human evaluation to identify optimal metrics for faithfulness w.r.t. relevant knowledge (§5.2). Finally, after outlining our approach to evaluate a model faithfulness w.r.t. irrelevant knowledge (§5.3), we present the results from large-scale evaluation of instruction-following models (§5.4).
5.1 Evaluation Metrics
Given the user question or the conversation history (denoted by ℋ), the gold passage , and the model response u, the objective of the metric is to check if u can be inferred from . We explore several reference-free faithfulness and groundedness metrics in the literature, broadly categorized into two:
Lexical Matching
Knowledge-F1 (denoted K-F1) is a lexical overlap metric widely used for knowledge-grounded dialogue (Shuster et al., 2021; Dziri et al., 2022a) that checks for F1 overlap between the tokens of u and . As K-F1 checks for equivalence between the model response and the knowledge snippet, we argue that it is unsuitable for information-seeking QA tasks. Grounding u in in these tasks is an inherently asymmetric task, i.e., u can be a subset of but cannot be a subset of u. To capture this intuition, we propose K-Precision—the proportion of tokens in the model response u that are present in .
Chiesurin et al. (2023) propose K-F1++, a variant of K-F1 that discounts tokens from user question or the conversation history in the model response. We also consider K-Precision++, which applies similar discounting.
Semantic Similarity
A parallel to K-F1 in semantic space, we explore using BERTScore to measure semantic similarity between and u based on contextual BERT token embeddings (denoted K-BertS). We also consider FaithCritic, a hallucination critic model by Dziri et al. (2023) that evaluates whether a response entails a given passage. Q2 (Honovich et al., 2021) is another evaluation metric used to quantify factual consistency between responses and provided passages using automatic question generation, question answering, and natural language inference (NLI) models.
Similar to correctness, we investigate prompting LLMs to act as evaluators (LLMCritic). More specifically, we prompt GPT-3.5 and GPT-4 to annotate whether a given response uses only the knowledge present in the provided passage. The prompt template and the instruction used are described on our project page.3
5.2 Faithfulness w.r.t. Relevant Knowledge
We use the same template and prompt that we did for correctness (Section 3.2), but replace the retrieved passages with the gold passage(s).
Human Evaluation Setup
We randomly sampled 50 examples for each dataset, resulting in 600 examples across four models. For each sample, we provide annotators with the question (or the conversation history), model response, and the gold passage(s). They are given two tasks: 1) to verify if the given passage is indeed relevant to the user’s query, and 2) to determine if the model response is “fully”, “partially”, or “not at all” supported by the passages. We retain the same annotators from the previous task. Each sample is annotated twice, and in case of disagreement, a third annotation is collected for a majority vote. The annotators achieved an inter-annotator agreement of 86.33% and Fleiss’ Kappa score of 70.57%. For our analysis, we filter out samples for which the passage is marked as not relevant to the query, resulting in 544 samples.
Human Evaluation Results
Overall, 85.3% responses were marked as “fully” supported by the gold passage, 9% as “partially” and 5.7% as “not at all”. We include examples of model responses marked as “partially” and “not at all” in Figure 5. Responses in the “not at all” category generally contain short hallucinated information, while “partially” supported responses tend to be lexically aligned with the passage but with slight modifications in some pieces of information.
Correlation of Automatic Evaluation Metrics
To compare with automatic evaluation metrics, we consider model responses marked as “fully” as faithful, assigning them a score of 1.0. The other two categories are given a score of 0.0. We calculate Spearman’s ρ and Kendall’s τ correlation between assessments of automatic metrics and human judgments and report the results in Table 5.
Metric . | Spearman ρ . | Kendall τ . |
---|---|---|
K-F1 | −10.266 | −8.397 |
K-F1++ | −5.541 | −4.537 |
K-Precision | 49.849 | 43.384 |
K-Precision++ | 43.559 | 39.041 |
K-Recall | −13.931 | −11.397 |
K-BertS (F1) | −0.456 | −0.373 |
K-BertS (Precision) | 23.083 | 18.866 |
K-BertS (Recall) | −16.817 | −13.745 |
FaithCritic | 11.277 | 9.218 |
Q2 (F1) | 28.708 | 24.478 |
Q2 (NLI) | 28.862 | 25.084 |
LLMCritic (GPT-3.5) | 23.851 | 23.851 |
LLMCritic (GPT-4) | 54.995 | 54.995 |
Metric . | Spearman ρ . | Kendall τ . |
---|---|---|
K-F1 | −10.266 | −8.397 |
K-F1++ | −5.541 | −4.537 |
K-Precision | 49.849 | 43.384 |
K-Precision++ | 43.559 | 39.041 |
K-Recall | −13.931 | −11.397 |
K-BertS (F1) | −0.456 | −0.373 |
K-BertS (Precision) | 23.083 | 18.866 |
K-BertS (Recall) | −16.817 | −13.745 |
FaithCritic | 11.277 | 9.218 |
Q2 (F1) | 28.708 | 24.478 |
Q2 (NLI) | 28.862 | 25.084 |
LLMCritic (GPT-3.5) | 23.851 | 23.851 |
LLMCritic (GPT-4) | 54.995 | 54.995 |
We find that GPT-4-based LLMCritic correlates the most with human evaluation. K-Precision, the token-overlap metric that is invariant to the length of the knowledge snippet, is a close second, better than other semantic similarity metrics like K-BertS, FaithCritic, and Q2. This indicates that models trained to detect hallucinations in knowledge-grounded dialogues do not generalize well to information-seeking QA tasks. Surprisingly, K-Precision also outperforms GPT-3.5-based LLMCritic, indicating that verifying faithfulness is still a challenging task for LLMs.
Although GPT-4-based LLMCritic achieves the highest Spearman correlation of 54.99, it is still only moderately correlated with human judgments, indicating that accurately quantifying faithfulness is a challenging task for current evaluation metrics. K-Precision is a simple interpretable token-overlap metric that can serve as a strong baseline for the development of more robust automatic metrics in the future. In this work, we rely on K-Precision to evaluate faithfulness w.r.t. relevant knowledge.
5.3 Faithfulness w.r.t. Irrelevant Knowledge
An ideal model for QA should comprehend the provided passage and avoid answering if the passage lacks relevant information. To test this, we provide the models with a passage that is likely to be irrelevant but related (e.g., a query about Tom Cruise movie will be provided a Korean movie passage that has nothing to do with Tom Cruise). To do so, we treat the first passage after the thousandth ranked passage as irrelevant but related.
Our preliminary experiments demonstrated that without explicit instruction, Flan-T5 and Alpaca did not refrain from answering at all. Hence, we modify the instruction from Section 3.2 (Instr. v1) to direct the model to refrain from answering if the passage is deemed irrelevant: Please answer the following question given the following passages. If the answer is not in the passages or cannot be inferred from the passages, respond as ‘‘I don’t know’’. We refer to this instruction as Instr. v2. We report the proportion of model responses that contain I don’t know and other observed synonymous expressions, referred to as PIR. To test for any bias this instruction modification might introduce, we also check for the model’s answer abstinence when provided with the gold passage, denoted by PG. Ideally, a model should always refrain from answering when given irrelevant information and never refrain when given the correct passage(s).
5.4 Evaluating Faithfulness of Instruction-following Models
We conduct large-scale evaluation of instruction-following models for faithfulness. Table 6 reports the results for faithfulness w.r.t. both relevant and irrelevant knowledge.
. | Model . | K-Precision ↑ . | PIR↑ . | PG↓ . |
---|---|---|---|---|
NQ | Flan-T5 | 94.0 | 92.0 | 24.8 |
Alpaca | 69.4 | 0.0 | 0.0 | |
GPT-3.5 | 65.5 | 98.4 | 47.4 | |
Llama-2 | 69.4 | 97.8 | 57.8 | |
HotpotQA | Flan-T5 | 92.1 | 77.1 | 1.6 |
Alpaca | 87.1 | 0.1 | 0.1 | |
GPT-3.5 | 81.4 | 98.2 | 25.5 | |
Llama-2 | 75.8 | 97.7 | 61.5 | |
TopiOCQA | Flan-T5 | 86.4 | 40.8 | 7.7 |
Alpaca | 66.8 | 1.3 | 0.8 | |
GPT-3.5 | 69.6 | 88.2 | 31.8 | |
Llama-2 | 65.4 | 79.1 | 52.4 |
. | Model . | K-Precision ↑ . | PIR↑ . | PG↓ . |
---|---|---|---|---|
NQ | Flan-T5 | 94.0 | 92.0 | 24.8 |
Alpaca | 69.4 | 0.0 | 0.0 | |
GPT-3.5 | 65.5 | 98.4 | 47.4 | |
Llama-2 | 69.4 | 97.8 | 57.8 | |
HotpotQA | Flan-T5 | 92.1 | 77.1 | 1.6 |
Alpaca | 87.1 | 0.1 | 0.1 | |
GPT-3.5 | 81.4 | 98.2 | 25.5 | |
Llama-2 | 75.8 | 97.7 | 61.5 | |
TopiOCQA | Flan-T5 | 86.4 | 40.8 | 7.7 |
Alpaca | 66.8 | 1.3 | 0.8 | |
GPT-3.5 | 69.6 | 88.2 | 31.8 | |
Llama-2 | 65.4 | 79.1 | 52.4 |
Faithfulness w.r.t. Relevant Knowledge
We report K-Precision, the metric most correlated with human judgments (Section 5.2). Flan-T5 achieves the highest score for all three tasks, outperforming all other models by a significant margin. GPT-3.5 is the least faithful for NQ, while Llama-2 is the least faithful for HotpotQA and TopiOCQA. The stark difference between the scores of Flan-T5 and other models denotes a trade-off between correctness and faithfulness: GPT-3.5 and Llama-2 outperform Flan-T5 in correctness but lag behind significantly in faithfulness.
Answer Refraining and Prompt Sensitivity
When explicitly instructed to output “I don’t know” and given an irrelevant passage, GPT-3.5 most often refrains from answering (98% in NQ and HotpotQA, 88% in TopiOCQA), followed closely by Llama-2. Alpaca almost always answers, indicating that it either fails to detect when the answer is absent or it has difficulty following the instruction. While Flan-T5 successfully abstains from answering on NQ and HotpotQA, it fails on TopiOCQA, indicating that it struggles with out-of-distribution (TopiOCQA is not included in its training).
When provided with gold passage with the same instruction, surprisingly, both GPT-3.5 and Llama-2 still refrain from answering, with Llama-2 refraining to answer more than 50% of the time across all three datasets. This indicates further research is required for models to identify when and when not to refrain from answering.
We extend the evaluation of faithfulness to real-world scenarios, providing models with retrieved passages instead of gold or irrelevant passages. Additionally, we explore the impact of modifying the instruction (Instr. v1 vs Instr. v2), on both correctness and faithfulness (Appendix C).
6 Conclusion
In this paper, we analyze the responses of instruction-following models for QA along the dimensions of correctness w.r.t. information need and faithfulness w.r.t. provided knowledge. Our results show that Recall and K-Precision metrics correlate well with human judgments for correctness and faithfulness respectively.
On evaluating instruction-following models using these proposed metrics, we find that these models demonstrate a tradeoff between correctness and faithfulness. GPT-3.5 and Llama-2 achieve high scores for correctness but have difficulty being faithful to the provided knowledge. Moreover, they struggle to decide when to refrain from answering.
When using instruction-following models for QA, we urge to the community to move away from reporting a single overall score and adopt a more holistic evaluation that reports correctness, faithfulness, and the ability to refrain from answering.
Limitations
Although we evaluate for correctness, faithfulness, and the ability to refrain from answering, it is not an exhaustive list of all the desirable properties of a QA model. Previous works (Xu et al., 2023) have focused on aspects like completeness and ease of understanding for long-form QA. We leave the evaluation of these properties for information-seeking QA to future work.
It is important to note that low faithfulness of a response does not imply that it is incorrect. The model can potentially provide accurate information using its parametric knowledge. However, such knowledge is difficult to interpret and modify.
We propose Recall and K-Precision for correctness and faithfulness respectively. Although these metrics correlate highly with human judgments, they are easy to hack. For instance, Recall might score an affirmative statement and its negated version equally, despite their contrasting meanings. However, QA models tend to answer in affirmation rather than negation. Similarly, K-Precision can be hacked by copying all the knowledge from the prompt. However, such strategy will be penalized heavily when evaluated for faithfulness w.r.t. irrelevant knowledge.
Acknowledgments
We thank the reviewers and action editor for their valuable feedback. Furthermore, we thank the members of SR’s research group for providing feedback throughout the project. We thank Eva Portelance and Ehsan Kamalloo for helpful discussions. PB is supported by the Mila-Intel Grant program. XHL and NM are supported by the NSERC CGSD Fellowship. SR is supported by a Facebook CIFAR AI Chair and NSERC Discovery Grant program.
Notes
References
A Generation Parameters for Instruction-following Models
We use the following generation parameters for all instruction-following models:
Top-p: p = 0.95
Temperature: t = 0.95
Seed: s = 0
Min. new tokens:
For GPT-3.5, we did not specify any limit for maximum new tokens. For other models, we specified as 500 to prevent out-of-bound memory errors on our GPU infrastructure.
B Impact of Retriever on Correctness
We evaluate the impact of different retrievers on the correctness of instruction-following models, using HotpotQA as a testing benchmark. We consider three retrievers: (1) BM25 (Robertson et al., 1995), a sparse lexical-overlap based retriever, (2) DPR (Karpukhin et al., 2020), a dense retriever trained on multiple QA datasets, and (3) Xiong et al. (2021), a multi-hop version of DPR trained on HotpotQA, which we refer to as DPR (multi-hop). We use the same prompt template and evaluation setup as Section 4.
The comparison of different retrievers is presented in Figure 6. For all instruction-following models, DPR (multi-hop) performs the best, highlighting the importance of using task-specific retrievers to maximize performance. Similar to Sidiropoulos et al. (2021), we found that BM25 outperforms DPR across all instruction-following models, potentially due its ability to exploit high overlap between the query and gold passages.
C Evaluation of Instruction-Following Models in Real-world Settings
In this section, we evaluate faithfulness of instruction-following models in real-world settings. We also investigate the impact of changing instructions on both correctness and faithfulness.
Experiment Setup
We follow the prompt template from Figure 2, providing models with an instruction, retrieved passages, and question or conversation history. We consider two instructions, Instr. v1 (Section 3.2) and Instr. v2 (Section 5.3). The latter adds a directive to not answer if passages are irrelevant. For evaluation, we use Recall for correctness and K-precision for faithfulness. The correctness results with Instr. v1 are copied from Table 4 for easy comparison with Instr. v2. With retrieved passages, K-Precision is computed as the proportion of tokens in the model’s response that are present in the retrieved passages. For Instr. v2, we consider “I don’t know.” to be an additional retrieved passage, hence marking the response as faithful if the model refrained from answering.
It is non-trivial to determine when the model should refrain from answering (Section 5.3) with retrieved passages as they may be relevant even if they are not gold passages. However, if the gold passages are retrieved, the model should definitely not abstain from answering. Therefore, we report PG, i.e., the proportion of times model refrained from answering when the gold passage was present, as a proxy of answer abstinence (lower is better). A model that always output “I don’t know.” with Instr. v2 will achieve the best score in K-Precision but the worst score in PG.
Results
We report the results in Table 7. Flan-T5 outperforms all other models by a significant margin for faithfulness under Instr. v1, consistent with previous findings (Table 6). Switching to Instr. v2 increases the faithfulness of Flan-T5 and GPT-3.5, with GPT-3.5 showing the largest gain. Llama-2’s faithfulness drops by 12.33% across all three tasks. Our manual inspection reveals that Llama-2 often explains its reasoning for not answering, which has a low overlap with the retrieved passages.
. | Model . | Correctness w.r.t. . | Faithfulness w.r.t. Retrieved . | Answer Abstinence (PG↓) . | ||||
---|---|---|---|---|---|---|---|---|
Information Need (Recall ↑) . | Knowledge (K-Precision ↑) . | |||||||
. | Instr. v1 . | Instr. v2 . | Instr. v1 . | Instr. v2 . | Instr. v1 . | Instr. v2 . | % Gold . | |
NQ | Flan-T5 | 54.03 | 50.52 | 96.59 | 98.88 | 0.00 | 5.95 | 36.79 |
Alpaca | 48.82 | 48.00 | 80.87 | 79.16 | 0.00 | 0.08 | ||
GPT-3.5 | 57.98 | 34.10 | 81.50 | 96.27 | 2.18 | 39.01 | ||
Llama-2 | 59.28 | 44.03 | 83.26 | 76.46 | 0.08 | 31.55 | ||
HotpotQA | Flan-T5 | 71.28 | 69.68 | 91.18 | 92.81 | 0.00 | 0.87 | 79.28 |
Alpaca | 57.53 | 57.37 | 86.98 | 86.85 | 0.00 | 0.00 | ||
GPT-3.5 | 66.77 | 40.08 | 80.64 | 93.72 | 2.42 | 41.17 | ||
Llama-2 | 69.75 | 47.56 | 81.75 | 67.35 | 0.22 | 61.51 | ||
TopiOCQA | Flan-T5 | 52.54 | 49.16 | 91.77 | 94.41 | 0.38 | 6.52 | 62.25 |
Alpaca | 43.37 | 42.85 | 72.57 | 67.33 | 0.19 | 0.26 | ||
GPT-3.5 | 67.39 | 46.76 | 80.80 | 90.74 | 0.64 | 25.11 | ||
Llama-2 | 61.40 | 42.62 | 79.57 | 70.66 | 0.13 | 33.29 |
. | Model . | Correctness w.r.t. . | Faithfulness w.r.t. Retrieved . | Answer Abstinence (PG↓) . | ||||
---|---|---|---|---|---|---|---|---|
Information Need (Recall ↑) . | Knowledge (K-Precision ↑) . | |||||||
. | Instr. v1 . | Instr. v2 . | Instr. v1 . | Instr. v2 . | Instr. v1 . | Instr. v2 . | % Gold . | |
NQ | Flan-T5 | 54.03 | 50.52 | 96.59 | 98.88 | 0.00 | 5.95 | 36.79 |
Alpaca | 48.82 | 48.00 | 80.87 | 79.16 | 0.00 | 0.08 | ||
GPT-3.5 | 57.98 | 34.10 | 81.50 | 96.27 | 2.18 | 39.01 | ||
Llama-2 | 59.28 | 44.03 | 83.26 | 76.46 | 0.08 | 31.55 | ||
HotpotQA | Flan-T5 | 71.28 | 69.68 | 91.18 | 92.81 | 0.00 | 0.87 | 79.28 |
Alpaca | 57.53 | 57.37 | 86.98 | 86.85 | 0.00 | 0.00 | ||
GPT-3.5 | 66.77 | 40.08 | 80.64 | 93.72 | 2.42 | 41.17 | ||
Llama-2 | 69.75 | 47.56 | 81.75 | 67.35 | 0.22 | 61.51 | ||
TopiOCQA | Flan-T5 | 52.54 | 49.16 | 91.77 | 94.41 | 0.38 | 6.52 | 62.25 |
Alpaca | 43.37 | 42.85 | 72.57 | 67.33 | 0.19 | 0.26 | ||
GPT-3.5 | 67.39 | 46.76 | 80.80 | 90.74 | 0.64 | 25.11 | ||
Llama-2 | 61.40 | 42.62 | 79.57 | 70.66 | 0.13 | 33.29 |
The instruction switch impacts the correctness of all models, except Alpaca. GPT-3.5 experiences the largest drop in performance across all three datasets (37.26%), followed by Llama-2 (29.37%). This decline is likely due to the models’ increased tendency to refrain from answering under Instr. v2. PG scores support this hypothesis—GPT-3.5 refrains from answering 39.01% of the time on NQ with Instr. v2, compared to 2.18% with Instr. v1. Overall, these scores indicate a similar observation as noted in Section 5.4—GPT-3.5 and Llama-2 refrain from answering more often than needed.
Alpaca is relatively unaffected with the change in instruction across correctness, faithfulness and answer abstinence, further confirming the observation that it has difficulty following the instruction to refrain from answering.
Author notes
Core contributor.
Action Editor: Kristina Toutanova