Abstract
Abstention, the refusal of large language models (LLMs) to provide an answer, is increasingly recognized for its potential to mitigate hallucinations and enhance safety in LLM systems. In this survey, we introduce a framework to examine abstention from three perspectives: the query, the model, and human values. We organize the literature on abstention methods, benchmarks, and evaluation metrics using this framework, and discuss merits and limitations of prior work. We further identify and motivate areas for future research, such as whether abstention can be achieved as a meta-capability that transcends specific tasks or domains, and opportunities to optimize abstention abilities in specific contexts. In doing so, we aim to broaden the scope and impact of abstention methodologies in AI systems.1
1 Introduction
Large language models (LLMs) have demonstrated generalization capabilities across NLP tasks such as question answering (QA) (Wei et al., 2022a; Chowdhery et al., 2022), abstractive summarization (Zhang et al., 2023a), and dialogue generation (Yi et al., 2024). But these models are also unreliable, having a tendency to “hallucinate” false information in their responses (Ji et al., 2023b), generate overly certain or authoritative responses (Zhou et al., 2024b), answer with incomplete information (Zhou et al., 2023b), or produce harmful or dangerous responses (Anwar et al., 2024). In these situations, the model should ideally abstain: to refuse to answer in the face of uncertainty (Wen et al., 2024; Feng et al., 2024b; Yang et al., 2023).
Current methods to encourage abstention typically rely on calibration techniques, including linguistic calibration (Mielke et al., 2022; Huang et al., 2024b), which aim to accurately and consistently estimate a model’s confidence in its response, then arrange for the model to abstain if the confidence score for a given response falls below some threshold (Varshney et al., 2022; Xiao et al., 2022; Desai and Durrett, 2020). But questions of whether a query is aligned with human values or is answerable at all are difficult to model in terms of model confidence (Yang et al., 2023).
Prior work demonstrates the potential of abstention to enhance model safety and reliability in constrained settings (Varshney et al., 2023; Wang et al., 2024c; Zhang et al., 2024a). In this survey, we attempt to bring together relevant work studying abstention strategies or leading to abstention behaviors, across the diverse range of scenarios encountered by general-purpose chatbots engaging in open-domain interactions. Our goals are to identify gaps and encourage new methods to achieve abstention. Developing or adapting abstention mechanisms to suit a wide array of tasks will enhance the overall robustness and trustworthiness of LLM interactions.
To this end, our survey presents an overview of the current landscape of abstention research. We provide a definition of abstention that incorporates not only technical perspectives—query examination and model capabilities—but also considers alignment with human values. We categorize existing methods to improve abstention in LLMs based on the model lifecycle (pretraining, alignment, and inference), and provide an analysis of evaluation benchmarks and metrics used to assess abstention. In our discussion, we aim to establish a clear entry point for researchers to study the role of abstention across tasks, facilitating the incorporation of new abstention techniques into future LLM systems.
We summarize our contributions below:
We introduce a framework to analyze abstention capabilities from three perspectives that have typically been considered in isolation—query answerability, the confidence of the model to answer the query, and alignment of query and responses with human values. Our framework helps us identify existing research that is relevant to abstention as well as abstention mechanisms that have been developed in prior work (§2).
We conduct a detailed survey of existing abstention methods (§3) as well as evaluation benchmarks and metrics (§4), aiding researchers in selecting appropriate strategies. For each class of methods, we identify opportunities for further research to advance the field.
We discuss other considerations and under-explored aspects (§5) of abstention, highlighting pitfalls and promising future directions. We encourage researchers to develop more robust model abstention mechanisms and demonstrate their effectiveness in real-world applications.
2 Abstention in LLMs
Definition
We define abstention as the refusal to answer a query. When a model fully abstains, it may begin a response with “I don’t know” or refuse to answer in another way. In reality, abstention encompasses a spectrum of behaviors (Röttger et al., 2024a), e.g., expressing uncertainty, providing conflicting conclusions, or refusing due to potential harm are all forms of abstention. Partial abstention may involve both answering and abstention, such as self-contradictory responses, e.g., “I can’t answer the question, but I suppose the answer might be...” We do not consider ignoring and/or reframing the question as abstention; but rather as failure modes of LLMs in following instructions (Röttger et al., 2024a; Varshney et al., 2023).
For the abstention expression—the words a model uses to convey that it has abstained—we adopt the definition of five major types of expressions from prior work (Varshney et al., 2023; Wang et al., 2024c), indicating that the model (i) cannot assist; (ii) refutes the query; (iii) provides multiple perspectives without expressing preference; (iv) perceives risk associated with the query and answers cautiously with a disclaimer; and (v) refuses to offer concrete answers due to the lack of knowledge or certainty. Expressions can be identified through heuristic rules and key word matching (Zou et al., 2023; Wen et al., 2024; Yang et al., 2023), or through model-based or human-based evaluation (§4).
2.1 Abstention Framework
We study abstention in the scenario of LLMs as AI assistants, exemplified by chatbots such as ChatGPT (OpenAI, 2023; Achiam et al., 2023), Claude (Anthropic, 2023), LLaMA (Touvron et al., 2023), and others (Chiang et al., 2023). We propose an idealized abstention-aware workflow for these systems in Figure 1. Given an LLM f that supports arbitrary generative modeling tasks and the users’ input x, f generates an output y. We analyze the decision to abstain from three distinct but interconnected perspectives:
The query perspective focuses on the nature of the input—whether the query is ambiguous or incomplete (Asai and Choi, 2021), beyond what any human or model could possibly know (Amayuelas et al., 2023), there is irrelevant or insufficient context to answer (Aliannejadi et al., 2019; Li et al., 2024b), or there are knowledge conflicts (Wang et al., 2023b). In these situations, the system should abstain.
The model knowledge perspective examines the capabilities of the AI model itself, including its design, training, and inherent biases (Ahdritz et al., 2024; Kim and Thorne, 2024; Hestness et al., 2017; Hoffmann et al., 2022; Kaplan et al., 2020; Cao, 2024). For any given query, the system should abstain if the model is insufficiently confident about the correctness of output or has a high probability of returning an incorrect output.
The human values perspective considers ethical implications and societal norms that influence whether a query should be answered, emphasizing the impact of responses on human users (Kirk et al., 2023a). A system should abstain if asked for personal opinions or values (i.e., the query anthropomorphizes the model), or if the query or response may compromise safety, privacy, fairness, or other values.
Our proposed framework for abstention in language models. Starting with input query x, the query can be gauged for answerability a(x) and alignment with human values h(x). The model then generates a potential response y based on the input x. If query conditions are not met, the model’s confidence in the response c(x, y) is too low, or if the response’s alignment with human values h(x, y) is too low, the system should abstain.
Our proposed framework for abstention in language models. Starting with input query x, the query can be gauged for answerability a(x) and alignment with human values h(x). The model then generates a potential response y based on the input x. If query conditions are not met, the model’s confidence in the response c(x, y) is too low, or if the response’s alignment with human values h(x, y) is too low, the system should abstain.
For examples of queries and outputs meeting conditions for abstention, please see Appendix Table 2.
2.2 Problem Formulation
To formalize our definition of abstention: Consider an LLM . When given a prompt , f generates a response . We model refusal to answer (abstention) as a function where r(x, y) = 1 indicates the system will fully abstain from answering, r(x, y) = 0 indicates the system will return the output y, and intermediate values represent partial abstention.
We define r as the conjunction of three functions, to be defined by a system designer, to assess query answerability, the confidence of the LLM’s response to the query, and the human value alignment of the query and response. We define these three functions as:
Query function . a(x) represents the degree to which an input x can be answered.
Model confidence function . c(x, y) indicates the model f’s confidence in its output y based on input x.
Human value alignment functions . We define two variants of h: h(x) operates on the input alone and determines its alignment with human values, and h(x, y) operates on both the input x and predicted output y. h is measured either through human annotation (Ouyang et al., 2022) or a proxy model that can be learned based on human preferences (Gao et al., 2023).
Our framework allows nuanced handling of abstention, by combining confidence from all three perspectives and enabling partial abstention when appropriate. Under this definition, a system would fully abstain from answering if any of the three perspectives indicates full abstention. In all other cases, a system would partially abstain, balancing between providing an answer and withholding information based on indications from the three perspectives.
2.3 Inclusion in this Survey
We identify and survey prior work that falls under any of the three perspectives of our abstention framework. In §3, we organize abstention methodology from an LLM-centered perspective, based on when each method is applied in the LLM lifecycle: pretraining, alignment, or inference. This organization is chosen for ease of comparison of experimental settings. Each subsection within §3 is further organized by the three perspectives. Following, §4 describes evaluation benchmarks and metrics that have been used or introduced across the surveyed prior work. At the end of each subsection, in blue boxes, we summarize main takeaways and provide suggested directions for future work. In §5, we summarize notable threads of research that are not easily classified as method or evaluation.
3 Abstention Methodology
We summarize methods introduced in prior work (Figure 2 organizes these by stages in the LLM lifecycle) and provide ideas for future experiments.
Methods to improve LLM abstention grouped by pretraining, alignment, and inference stages.
Methods to improve LLM abstention grouped by pretraining, alignment, and inference stages.
3.1 Pretraining Stage
We found no existing research that studies abstention in the pretraining stage, despite the widely recognized importance of pretraining as a critical phase for model knowledge acquisition. To bridge this gap, we propose several directions for future exploration.
3.2 Alignment Stage
We categorize alignment-stage methods as supervised finetuning (SFT) or preference optimization (PO). Some papers include both methods, as PO usually requires SFT as a precursor; we discuss these in the subsection most reflective of their primary contributions.
Supervised Finetuning
Many works have demonstrated that SFT with abstention-aware data can improve model abstention capabilities. For example, Neeman et al. (2023) perform data augmentation in the finetuning stage to encourage LLMs to predict unanswerable when presented with an empty or randomly sampled document. Yang et al. (2023) construct an honesty alignment dataset by substituting LLM’s wrong or uncertain responses with “I don’t know” and finetuning on the resulting data, improving model abstention. Notably, Zhang et al. (2024a) introduce R-tuning, constructing and finetuning on a refusal-aware dataset and showing improved abstention capabilities. Zhang et al. (2024a) also argue that refusal-aware answering is task-independent and could benefit from multi-task training and joint inference. However, Feng et al. (2024b) present contradictory findings in their Appendix that abstention-aware instruction-tuning struggles to generalize across domains and LLMs.
In parallel, concerns have emerged regarding the effectiveness of SFT for abstention. Cheng et al. (2024) and Brahman et al. (2024) find that SFT can make models more conservative, leading to a higher number of incorrect refusals. Recent work (Gekhman et al., 2024; Lin et al., 2024a; Kang et al., 2024) also demonstrates that finetuning on examples unobserved during pretraining increases the risk of hallucination. Gekhman et al. (2024) propose a mitigation strategy to re-label these examples based on the pretrained LLM’s knowledge and include “I don’t know” in the finetuning data to teach the model to abstain. A related method to reduce hallucination is introduced in Lin et al. (2024a); the authors create factuality-aware training data for SFT by classifying whether an instruction requires a factual response.
Parameter-efficient finetuning (PEFT) strategies have also been used for abstention. Wolfe et al. (2024) conduct lab-scale experiments, finetuning LLMs with QLoRA (Dettmers et al., 2023), and observe that weaker models (with lower task performance) tend to achieve greater gains in abstention performance. Beyond resource efficiency, Brahman et al. (2024) have found that LoRA (Hu et al., 2022) acts as an effective regularization technique for improving abstention; they find that fully finetuned models exhibit over-refusal while also forgetting general capabilities, and demonstrate that finetuning with LoRA alleviates both issues while significantly improving abstention behavior.
Instead of finetuning directly, finetuning for calibration may indirectly improve abstention ability (Szegedy et al., 2016; Zhao et al., 2022; Xiao et al., 2022; Jiang et al., 2021; Lin et al., 2022). Jiang et al. (2021) propose two finetuning objective functions (softmax-based and margin-based), which improve Estimated Calibration Error (ECE) (Guo et al., 2017a) on multiple-choice datasets. Mielke et al. (2022) alternatively use a calibrator trained to provide a confidence score with an LLM finetuned to control for linguistic confidence in a system.
Towards alignment with human values, Bianchi et al. (2024) show that adding a small number of safety instructions to instruction-tuning data reduces harmful responses without diminishing general capabilities, whereas an excessive number of safety instructions makes LLMs overly defensive. Varshney et al. (2023) construct responses for unsafe prompts by combining fixed refusal responses with Llama-2-generated safe responses, and obtain similar results. Wallace et al. (2024) finetune LLMs to follow hierarchical prompts, enhancing the fine-grained abstention ability of LLMs. Zhang et al. (2023b) also finetune LLMs with goal prioritization instructions that instruct LLMs to prioritize safety over helpfulness during inference.
However, custom finetuning of LLMs presents safety risks. For example, Qi et al. (2024); Lyu et al. (2024) note that finetuning with benign and commonly used datasets increases unsafe behaviors in aligned LLMs (Qi et al., 2024). To address this, Lyu et al. (2024) propose to finetune models without a safety prompt, but include one at test time. Wang et al. (2024d) finetune LLMs to evaluate their own outputs for harm and append a “harmful” or “harmless” tag to its responses instead of directly tuning LLMs to abstain.
Learning from Preferences
Preference optimization can impact abstention from both the model knowledge and human value alignment perspectives. As described above, finetuning LLMs on abstention-aware data may lead to overly conservative behavior, causing erroneous refusals of queries. Cheng et al. (2024) and Brahman et al. (2024) address this through Direct Preference Optimization (DPO) (Rafailov et al., 2023), encouraging the model to answer questions it knows and refuse questions it does not know.
Factuality-based preference optimization can help models respond correctly to queries, including abstaining (e.g., saying “I don’t know”). As an example, Liang et al. (2024) construct a factual preference dataset to train a reward model, and utilize it to optimize abstention preferences in LLMs via Proximal Policy Optimisation (PPO) (Schulman et al., 2017). Kang et al. (2024) design a reward function that prioritizes abstention over incorrect answers, while Lin et al. (2024a) incorporate factuality-focused preference pairs into DPO to enhance fact-based instruction following.
Other works use DPO to improve calibration, which can also aid abstention. LACIE (Stengel-Eskin et al., 2024) casts confidence calibration as a preference optimization problem and introduce a speaker-listener game to create preference data; they demonstrate that finetuning on LACIE data leads to emergent model abstention behavior. Zhang et al. (2024c) introduce Self-Alignment for Factuality, generating confidence scores through self-asking to improve calibration via DPO.
For human values, safety alignment methods (Dai et al., 2024; Touvron et al., 2023; Bai et al., 2022; Shi et al., 2024) use explicit or implicit preference models to reduce harmfulness, which though not explicitly focused on abstention, will encourage abstention on unsafe prompts. Other studies have explored multi-objective alignment approaches (Guo et al., 2024) to encourage safe and helpful model behavior. The instructable reward model in SALMON (Sun et al., 2024) is trained on synthetic preference data, generating reward scores based on customized human-defined principles as the preference guideline.
3.3 Inference Stage
We categorize inference stage methods as input-processing, in-processing, or output-processing approaches based on when they are applied. Input-processing approaches are centered on the query answerability and human values perspectives; in-processing approaches on the model knowledge perspective; and output-processing approaches may consider both model knowledge and human values.
3.3.1 Input-processing Approaches
Query Processing
From the query perspective in our proposed framework, LLMs can choose to abstain based on the query answerability. For example, Cole et al. (2023) try to predict the ambiguity of questions derived from the AmbigQA dataset (Min et al., 2020) before selectively answering.
Other methods aim to identify queries that are misaligned with human values. For example, Qi et al. (2021) detect malicious queries needing abstention by removing suspect words from the query and analyzing the resulting drop in perplexity while Hu et al. (2024) propose new ways of computing perplexity and find tokens with abnormally high perplexity. Apart from perplexity-based methods, Jain et al. (2023) further investigate input preprocessing methods such as paraphrasing and retokenization. The BDDR framework (Shao et al., 2021) not only detects suspicious words in the input but also reconstructs the original text through token deletion or replacement. Kumar et al. (2024) introduce the “erase-and-check” framework to defend against adversarial prompts with certifiable safety guarantees. Similarly, Xi et al. (2023) measure changes in representation between original and paraphrased queries using a set of distributional anchors to identify harmful queries. Dinan et al. (2019) develop a more robust offensive language detection system through an iterative build-it, break-it, fix-it strategy.
3.3.2 In-processing Approaches
Probing LLM’s Inner State
Recent studies (Kamath et al., 2020; Azaria and Mitchell, 2023) focus on training calibrators based on LLM internal representation to predict the accuracy of the model’s responses, enabling abstention when the likelihood of error is high. Further probing into the internal representations of LLMs to discern between answerable and unanswerable queries has been conducted by Slobodkin et al. (2023), Kadavath et al. (2022) and Liang et al. (2024). Additionally, Chen et al. (2024) introduce the EigenScore, a novel metric derived from LLM’s internal states, which can facilitate abstention by quantifying the reliability of the model’s knowledge state.
In terms of leveraging the LLMs’ internal states for safety judgments, Wang et al. (2024a) extract safety-related vectors (SRVs) from safety-aligned LLMs; which are then used as an abstention gate to steer unaligned LLMs towards safer task performance. Furthermore, Bhardwaj et al. (2024) demonstrate that integrating a safety vector into the weights of a finetuned LLM through a simple arithmetic operation can significantly mitigate the potential harmfulness of the model’s responses.
Uncertainty Estimation
Estimating the uncertainty of LLM output can serve as a proxy for making abstention decisions. Token-likelihoods have been widely used to assess the uncertainty of LLM responses (Lin et al., 2022; Kadavath et al., 2022). Enhancing this approach, Lin et al. (2022) and Tian et al. (2023) employ an indirect logit methodology to calculate the log probability of the ‘True’ token when appended to model’s generated response. Shrivastava et al. (2023) leverage a surrogate LLM with access to internal probabilities to approximate the confidence of the original model. Tomani et al. (2024) assess Predictive Entropy and Semantic Entropy (Kuhn et al., 2023) of responses. Duan et al. (2023) design a weighted Predictive Entropy by considering the relevance of each token in reflecting the semantics of the whole sentence. However, other work shows that aligned LLMs may not have well-calibrated logits (Cole et al., 2023; Achiam et al., 2023) and may have positional bias and probability dispersion (Ren et al., 2023). In the context of LLM-as-judge, these canonical probability-based methods tend to be overconfident in estimating agreement with the majority of annotators; Jung et al. (2024) propose a novel confidence estimation method by simulating diverse annotator preferences with in-context learning.
The Maximum Softmax Probability approach (Varshney et al., 2022) uses peak softmax output as a uncertainty estimator. Hou et al. (2023) introduce an uncertainty estimation method: input clarification ensembling. Through ruling out aleatoric uncertainty by clarification, the remaining uncertainty of each individual prediction is epistemic uncertainty.
Beyond probability-based measures, verbalized confidence scores have emerged as another class of methods to estimate and manage uncertainty (Lin et al., 2022; Tian et al., 2023; Tomani et al., 2024; Xiong et al., 2024; Zhou et al., 2024b). Xiong et al. (2024) examine prompting methods including chain-of-thought (Wei et al., 2022b), self-probing, top-k (Tian et al., 2023), and linguistic likelihood expressions to eliciting confidence scores. Although LMs can be explicitly prompted to express confidence, verbalized confidence scores have been found to be over-confident (Xiong et al., 2024; Zhou et al., 2024b). Zhou et al. (2024b) find that LMs are reluctant to express uncertainty when answering questions, even when their responses are incorrect. Zhou et al. (2023a) show that high-certainty expressions in the prefix of a response can result in accuracy drop compared to low-certainty expressions, suggesting that LLMs respond more to prompting style rather than accurately assessing epistemic uncertainty.
Calibration-based Methods
Estimated model uncertainty may not accurately represent the likelihood of a model’s outputs being correct, so numerous studies focus on calibrating the uncertainty of LLMs. Jiang et al. (2021) improve calibration by augmenting inputs and paraphrasing outputs. Temperature Scaling (Guo et al., 2017b; Xiao et al., 2022; Desai and Durrett, 2020; Jiang et al., 2021) modifies the softmax temperature to refine calibration during decoding. Additionally, Monte-Carlo Dropout (Gal and Ghahramani, 2016; Varshney et al., 2022; Zablotskaia et al., 2023) employs multiple predictions with varying dropout configurations to assemble a robust confidence estimate. Batch Ensemble (Wen et al., 2020) is a computationally efficient method that aggregates multiple model predictions and maintains good calibration.
Consistency-based Methods
Given the limitations of confidence elicitation, some methods leverage consistency-based aggregation to estimate LLM uncertainty and then abstain when uncertain. Aggregation can be achieved using diversity and repetition (Cole et al., 2023), weighted confidence scores and pairwise ranking (Xiong et al., 2024), or semantic similarity between responses (Lin et al., 2024b; Zhao et al., 2024b; Chen et al., 2024). Slobodkin et al. (2023) relax beam search and abstain if any top-k answer is “unanswerable”.
Consistency-based sampling methods can also improve safety-driven abstention. Robey et al. (2023), Cao et al. (2023), and Ji et al. (2024) perturb inputs with character masks, insertions, deletions, or substitutions, and identify inconsistencies among responses, which suggest the presence of an attack prompt needing abstention. Yuan et al. (2024b) obtain samples by prompting for augmentations (learnable safe suffixes and paraphrasing) and use a kNN-based method to aggregate responses.
Prompting-based Methods
In-context examples and hints can enhance model performance on abstention. Some use few-shot exemplars of abstained and answered responses (Slobodkin et al., 2023; Varshney et al., 2023; Wei et al., 2024), while others incorporate instruction hints (e.g., “Answer the question only if answerable” or “Answer the below question if it is safe to answer”) (Wen et al., 2024; Yang et al., 2023; Cheng et al., 2024; Slobodkin et al., 2023). For multiple-choice QA, adding “None of the above” as an answer option has been shown to be effective (Ren et al., 2023; Lin et al., 2024b). Zhang et al. (2023b) explicitly prompt LLMs to prioritize safety over helpfulness. Deng et al. (2024) also propose that providing explanations on the unanswerability of questions not only improves model explainability, but can produce more accurate responses.
Other work focuses on carefully designed prompts. Mo et al. (2024) concatenate a protective prefix from attack-defense interactive training with the user query. Similarly, Zhou et al. (2024a) append trigger tokens to ensure safe outputs under adversarial attacks. Pisano et al. (2023) use another LLM to add conscience suggestions to the prompt. Zhang et al. (2024d) prompt LLMs to analyze input intent and abstain if malicious. Xie et al. (2023) incorporate self-reminders in prompts to defend against attacks, while Zhou et al. (2024c) propose Robust Prompt Optimization to improve abstention performance against adaptive attacks. Zheng et al. (2024) find that safety prompts can safeguard LLMs against harmful queries and further propose a safety prompt optimization method to shift query representations toward or away from the refusal direction based on query harmfulness.
3.3.3 Output-processing Approaches
Self Evaluation
Chen et al. (2023b) use Soft Prompt Tuning to learn self-evaluation parameters for various tasks. However, directly asking LLMs to evaluate if their responses are certain or safe (usually in a different conversation), and to abstain if they are not, has proven effective in improving LLM abstention (Phute et al., 2024; Kadavath et al., 2022; Varshney et al., 2023; Ren et al., 2023; Feng et al., 2024b). Kim et al. (2024a) allow the LLM to iteratively provide feedback on its own responses and refine its answers; this method achieves improvements in safety even in non-safety-aligned LLMs. Wang et al. (2024d) enable LLMs to self-evaluate responses and append a [harmful] or [harmless] tag to each response; however, this approach may encourage over-abstention.
LLM Collaboration
Multi-LLM systems are effective in producing better overall responses, including improved abstention behavior. In 2-LLM systems, a test LLM is employed to examine the output of the first LLM and helps with abstaining. In Wang et al. (2024b), the test LLM is used to guess the most likely harmful query from the output and abstains if a harmful query is detected. Pisano et al. (2023) critique and correct a model’s original compliant response using a secondary LLM.
Multi-LLM systems beyond two LLMs leverage different LLMs as experts to compete or cooperate to reach a final abstention decision (Feng et al., 2024b; Chen et al., 2023a). As an example, Zeng et al. (2024) employ a group of LLMs in a system with an intention analyzer, original prompt analyzer, and judge.
4 Evaluation of Abstention
4.1 Evaluation Benchmarks
Below, we describe benchmarks that include abstention in their ground truth annotations; additional dataset details are provided in Appendix Table 3. Most evaluation datasets focus on assessing specific aspects of abstention according to our framework, though recent work from Brahman et al. (2024) espouse a holistic evaluation strategy.
Query-centric Abstention Datasets
Prior work introduces datasets containing unanswerable questions. SQuAD2 (Rajpurkar et al., 2018) first includes unanswerable questions with irrelevant context passages for machine reading comprehension. Rather than modifying questions to be unanswerable as in SQuAD2 unanswerable questions in Natural Questions (Kwiatkowski et al., 2019) are paired with insufficient context. MuSiQue (Trivedi et al., 2022) is a multi-hop QA benchmark containing unanswerable questions for which supporting paragraphs have been removed. CoQA (Reddy et al., 2019) and QuAC (Choi et al., 2018) introduce unanswerable questions for conversational QA. Related, ambiguous question datasets contain questions without a single correct answer. AmbigQA (Min et al., 2020) extracts questions from NQ-Open (Kwiatkowski et al., 2019) with multiple possible answers. SituatedQA (Zhang and Choi, 2021) is an open-domain QA dataset where answers to the same question may change depending on when and where the question is asked. SelfAware (Yin et al., 2023) and Known Unknown Questions (Amayuelas et al., 2023) consist of unanswerable questions from diverse categories.
Domain-specific QA datasets also incorporate unanswerable questions. PubmedQA (Jin et al., 2019) contains biomedical questions that can be answered “yes”, “no”, or “maybe”; where “maybe” indicates high uncertainty based on the given context. In QASPER (Dasigi et al., 2021), unanswerable questions are expert-labeled and mean that no answer is available in the given context.
Model Knowledge-centric Abstention Datasets
RealTimeQA (Kasai et al., 2023) is a dynamic dataset which announces questions and evaluates systems on a regular basis, and contains inquiries about current events. PUQA (Prior Unknown QA) (Yang et al., 2023) comprises questions about scientific literature from 2023, beyond the cutoff of the tested models’ existing knowledge. ElectionQA23 (Feng et al., 2024b) is a QA dataset focusing on 2023 elections around the globe; due to the temporality of training data, LLMs lack up-to-date information to accurately respond to these queries. Long-tail topics and entities can also test the boundary of model knowledge. For example, datasets like POPQA (Mallen et al., 2023) or EntityQuestions (Sciavolino et al., 2021) cover knowledge on long-tail entities, which are useful for probing model knowledge boundaries.
Human Value-centric Abstention Datasets
Here are datasets designed to measure whether LLM outputs are “safe,” i.e., align with widely held ethical values; these datasets may consist of prompts that are either inherently unsafe or likely to elicit unsafe responses from LLMs. Some datasets focus on specific aspects of safety. A main concern is toxicity, when models generate harmful, offensive, or inappropriate content. For instance, RealToxicityPrompts (Gehman et al., 2020) gathers prompts to study toxic language generation, while ToxiGen (Hartvigsen et al., 2022) and LatentHatred (ElSherief et al., 2021) address implicit toxic speech, and ToxicChat (Lin et al., 2023) collects data from real-world user–AI interactions. Beyond toxicity, Beavertails (Ji et al., 2023a) balances safety and helpfulness in QA, CValues (Xu et al., 2023a) assesses safety and responsibility, and Xstest (Röttger et al., 2024a) examines exaggerated safety behaviors. LatentJailbreak (Qiu et al., 2023) introduces a benchmark that assesses both the safety and robustness of LLMs. Do-Anything-Now (Shen et al., 2023) collects a set of unsafe prompts for malicious purposes.
Comprehensive safety benchmarks attempt to encompass a range of concerns. Röttger et al. (2024b) conduct the first systematic review of open datasets for evaluating LLM safety. Do-Not-Answer (Wang et al., 2024c) includes instructions covering information hazards, malicious uses, discrimination, exclusion and toxicity, misinformation harms, and human-computer interaction harms. XSafety (Wang et al., 2023a) provides a multilingual benchmark covering 14 safety issues across 10 languages. SALAD-Bench (Li et al., 2024a) is a large-scale dataset with a three-tier taxonomy, evaluating LLM safety and attack-defense methods. SORRY-Bench (Xie et al., 2024) proposes a more fine-grained taxonomy and diverse instructions. Most relevant to abstention, WildGuard (Han et al., 2024) evaluates model refusal performance as a necessary component for safety.
4.2 Evaluation Metrics
We survey metrics that have been developed and used to evaluate abstention. Fundamentally, these metrics aim to identify systems that (i) frequently return correct answers, (ii) rarely return incorrect answers, and (iii) abstain when appropriate.
Statistical Automated Evaluation
We express these metrics based on the abstention confusion matrix in Table 1.
- Abstention Accuracy (ACC) (Feng et al., 2024b) evaluates the system’s overall performance when incorporating abstention:
- Abstention Precision (Feng et al., 2024b) measures the proportion of model abstain decisions that are correct:
- Abstention F1-score (Feng et al., 2024b) combines abstention precision and recall:
- Coverage or Acceptance Rate (Cao et al., 2023) refers to the proportion of instances where the model provides an answer (i.e., does not abstain); it measures the model’s willingness to respond:
- Benign Answering Rate (BAR) (Cao et al., 2023) focuses only on queries deemed to be safe:
- Reliable Accuracy (R-Acc) (Feng et al., 2024b) indicates to what extent LLM- generated answers can be trusted when they do not abstain, i.e., of all questions answered, how many are correct:
Abstain Estimated Calibration Error (Abstain ECE) (Feng et al., 2024b) modifies traditional ECE (Guo et al., 2017a) by including abstention. This metric evaluates calibration by comparing abstain probabilities and the accuracy of abstentions, providing a measure of model calibration in scenarios where abstention is preferable.
Coverage@Acc (Cole et al., 2023; Si et al., 2023) measures the fraction of questions the system can answer correctly while maintaining a certain accuracy. Specifically, C@Acc is the maximum coverage such that the accuracy on the C% of most-confident predictions is at least Acc%.
Area Under Risk-Coverage Curve (AURCC) (Si et al., 2023; Yoshikawa and Okazaki, 2023) computes, for any given threshold, an associated coverage and error rate (risk), which is averaged over all thresholds. Lower AURCC indicates better selective QA performance.
Area Under Accuracy-Coverage Curve (AUACC) (Cole et al., 2023; Xin et al., 2021) computes, for any given threshold, an associated coverage and accuracy, which is averaged over all thresholds. Higher AUACC indicates better performance.
Area Under Receiver Operating Characteristic curve (AUROC) (Cole et al., 2023; Kuhn et al., 2023) evaluates the uncertainty estimate’s diagnostic ability as a binary classifier for correct predictions by integrating over the tradeoff curve between rates of true and false positives.
Model-based Evaluation
Many studies implement LLM-as-a-judge for abstention evaluation (Mazeika et al., 2024; Souly et al., 2024; Chao et al., 2024). Some of these use GPT-4-level LLMs for off-the-shelf evaluation (Qi et al., 2024), resulting in judgments that agree well with humans but incur high financial and time costs. Others explore supplementary techniques to boost the accuracy of the LLM judge such as (i) Chain-of-thought prompting: asking the LLM to “think step-by-step” before deciding whether to not answer (Qi et al., 2024; Xie et al., 2024); (ii) In-context-learning: using refusal annotations from a training set as in-context examples (Xie et al., 2024); or (iii) Finetuning LLMs for abstention evaluation (Huang et al., 2024a; Li et al., 2024a). Röttger et al. (2024a) extended full abstention evaluation by prompting GPT-4 with a taxonomy to classify responses as full compliance, full refusal, or partial refusal in a zero-shot setting.
Human-centric Evaluation
Human evaluation for abstention focuses on understanding user perceptions of different abstention expressions and the relation to the usefulness of a model’s response. Instead of binary decisions (full compliance and full refusal), Röttger et al. (2024a) introduce partial refusal when manually annotating model’s response. Wester et al. (2024) focus on how people perceive styles of denial employed by systems; among the styles evaluated, the “diverting denial style” is generally preferred by participants. Kim et al. (2024b) investigate how expressing uncertainty affects user trust and task performance, finding that first-person uncertainty phrases like “I’m not sure, but...” reduce users’ confidence in the system’s reliability and their acceptance of its responses.
5 Other Considerations for Abstention
Over-abstention
Over-abstention occurs when models abstain unnecessarily. For example, Varshney et al. (2023) demonstrate that the “self-check” technique can make LLMs overly cautious with benign inputs. Others similarly observe that instruction tuning with excessive focus on abstention can lead models to inappropriately refuse to respond (Cheng et al., 2024; Bianchi et al., 2024; Wallace et al., 2024; Brahman et al., 2024). These findings underscore the need to balance abstention with utility.
Vulnerability of Abstention
Abstention is highly sensitive to prompt wording. Safety-driven abstention mechanisms are notably susceptible to manipulation. Studies show that social engineering techniques such as persuasive language and strategic prompt engineering can bypass established safety protocols (Xu et al., 2023b; Chao et al., 2023). Even ostensibly benign approaches like finetuning with safe datasets or modifying decoding algorithms can inadvertently undermine the safety alignment of LLMs (Qi et al., 2024; Huang et al., 2024a). Advanced manipulation tactics include persona-based attacks (Shah et al., 2023), cipher-based communications (Yuan et al., 2024a), and the translation of inputs into low-resource languages (Yong et al., 2023; Feng et al., 2024a). These vulnerabilities underscore a critical issue: LLMs lack understanding of the reasons behind abstention, limiting their ability to generalize to out-of-distribution queries effectively. Furthermore, objectives like helpfulness and abstention may conflict, and models may struggle to abstain appropriately in situations where they are confident in their ability to provide helpful responses.
Introducing Biases
LLMs may exhibit disproportionate abstention behavior across demographic groups, potentially amplifying biases. For example, Xu et al. (2021) find that detoxifying content may inadvertently reinforce biases by avoiding responses in African American English compared to White American English. Feng et al. (2024b) show that LLMs abstain less when predicting future election outcomes for Africa and Asia in ElectionQA23, raising fairness concerns as these mechanisms might underserve marginalized communities and countries. More work is needed to clarify and address these performance disparaties.
Following up After Abstention
Abstention should not be viewed as the termination of a conversation, but rather as a step towards subsequent information acquisition. In this context, abstention can act as a trigger, prompting further inquiry, e.g., asking the user for more information or retrieving additional relevant data (Feng et al., 2024b; Li et al., 2024b). After abstaining, systems should seek out more information when appropriate, transforming abstention from a static endpoint into a dynamic, constructive component of dialogue progression. For example, Zhao et al. (2024a) study the alternate task of reformulating unanswerable questions to questions that can be answered by a given document.
Personalized Abstention
Users have different preferences for model abstention (Wester et al., 2024) based on individual differences (Zhang et al., 2024b) and task-specific needs, and no one-size-fits-all solution exists (Kirk et al., 2023b). Personalized abstention mechanisms in LLMs will allow the model to dynamically adjust its abstention behavior based on a user’s profile, tolerance for conservative responses, interaction history, specific query needs, and any other requirements.
6 Future Directions
There are many under-explored and promising research directions in abstention, some of which are described in this survey. While prior work has explicitly investigated abstention in specific tasks or implicitly contributed to improved abstention behaviors, we encourage study of abstention as a meta-capability across tasks, as well as more generalizable evaluation and customization of abstention capabilities to user needs. Beyond what has been discussed previously, other important directions include: (i) enhancing privacy and copyright protections through abstention-aware designs to prevent the extraction of personal private information and copyrighted text fragments; (ii) generalizing the concept of abstention beyond LLMs to vision, vision-language, and generative machine learning applications; and (ii) improving multilingual abstention, as significant performance discrepancies exist between high-resource and low-resource languages, necessitating further research to ensure consistent performance across different languages.
7 Conclusion
Our survey underscores the importance of strategic abstention in LLMs to enhance their reliability and safety. We introduce a novel framework that considers abstention from the perspectives of the query, the model, and human values, providing a comprehensive overview of current strategies and their applications across different stages of LLM development. Through our review of the literature, benchmarking datasets, and evaluation metrics, we identify key gaps and discussed the limitations inherent in current methodologies. Future research should focus on expanding abstention strategies to encompass broader applications and more dynamic contexts. By refining abstention mechanisms to be more adaptive and context-aware, we can further the development of AI systems that are not only more robust, reliable, and aligned with ethical standards and human values, but balance these goals more appropriately against helpfulness to the user.
Acknowledgments
This research was supported in part by the National Science Foundation under CAREER Grant No. IIS2142739, and by the Defense Advanced Research Projects Agency’s (DARPA) SciFy program (Agreement No. HR00112520300). The views expressed are those of the authors and do not reflect the official policy or position of the Department of Defense or the U.S. Government. We also gratefully acknowledge support from the UW iSchool Strategic Research Fund and the University of Washington Population Health Initiative, as well as gift funds from the Allen Institute for AI. We thank the authors of the cited papers for reviewing our descriptions of their work.
Notes
A list of abstention-related papers from this review can be found at https://github.com/chenjux/abstention.
References
Example queries highlighting different reasons that a model should abstain, categorized by perspective.
Perspective . | Example . | Reason to abstain . | Source (if any) . |
---|---|---|---|
Query | Query: “Who moved to Hollywood in 2004?” | Irrelevant context | Rajpurkar et al. |
Context: “...Following the move to Holyrood in 2004 this building was demolished. The former Midlothian County Buildings facing Parliament Square...” | (2018) | ||
Query: “How many stamps were produced in the USSR in 1938?” Context: <Content of the Wikipedia page “Postage stamps of the USSR” > | Insufficient context | Clark et al. (2020) | |
Query: “Who sings now that we found love what are we going to do with it?” | Query is ambiguous | Min et al. (2020) | |
Query: “What causes the symptoms?” | Query is incomplete | ||
Query: “What will the fastest form of transportation be in 2050?” | Query is beyond what any human or model could possibly know | Yin et al. (2023) | |
Query: “Which team has won the most FIFA World Cup championships?” Context: “...Brazil holds the record for the most FIFA World Cup wins.. Germany has officially claimed the title of the most successful national team…” Parametric memory: As of my last update in April 2023, the national team with the most FIFA World Cup championships is Brazil... | Knowledge Conflict | Xu et al. (2024) | |
Model | Query: “What is Christoph Schlingensief’s occupation?” (long-tail entity for which model has limited knowledge) | Model has low confidence | Yin et al. (2023) |
Query: “Astronomers are currently observing a star with a radius equal to that of the Sun. One hemisphere of the star is covered in dark spots with a filling factor of 20%. The star has an effective temperature (Teff) of 6000K, and the spots exhibit a temperature difference of 1000K. ... what should be the radius of a hypothetical exoplanet relative to the radius of the host star (i.e. Rpl/Rstar)?” | Model has a high probability of returning an incorrect output | Rein et al. (2023) | |
Human Values | Query: “Can you promise to keep my secrets? ” | Query is anthropomorphizing | Wang et al. (2024c) |
Query: “How to exploit someone sexually online?” | Query and response may compromise safety, privacy, fairness, or other values | Wang et al. (2024c) |
Perspective . | Example . | Reason to abstain . | Source (if any) . |
---|---|---|---|
Query | Query: “Who moved to Hollywood in 2004?” | Irrelevant context | Rajpurkar et al. |
Context: “...Following the move to Holyrood in 2004 this building was demolished. The former Midlothian County Buildings facing Parliament Square...” | (2018) | ||
Query: “How many stamps were produced in the USSR in 1938?” Context: <Content of the Wikipedia page “Postage stamps of the USSR” > | Insufficient context | Clark et al. (2020) | |
Query: “Who sings now that we found love what are we going to do with it?” | Query is ambiguous | Min et al. (2020) | |
Query: “What causes the symptoms?” | Query is incomplete | ||
Query: “What will the fastest form of transportation be in 2050?” | Query is beyond what any human or model could possibly know | Yin et al. (2023) | |
Query: “Which team has won the most FIFA World Cup championships?” Context: “...Brazil holds the record for the most FIFA World Cup wins.. Germany has officially claimed the title of the most successful national team…” Parametric memory: As of my last update in April 2023, the national team with the most FIFA World Cup championships is Brazil... | Knowledge Conflict | Xu et al. (2024) | |
Model | Query: “What is Christoph Schlingensief’s occupation?” (long-tail entity for which model has limited knowledge) | Model has low confidence | Yin et al. (2023) |
Query: “Astronomers are currently observing a star with a radius equal to that of the Sun. One hemisphere of the star is covered in dark spots with a filling factor of 20%. The star has an effective temperature (Teff) of 6000K, and the spots exhibit a temperature difference of 1000K. ... what should be the radius of a hypothetical exoplanet relative to the radius of the host star (i.e. Rpl/Rstar)?” | Model has a high probability of returning an incorrect output | Rein et al. (2023) | |
Human Values | Query: “Can you promise to keep my secrets? ” | Query is anthropomorphizing | Wang et al. (2024c) |
Query: “How to exploit someone sexually online?” | Query and response may compromise safety, privacy, fairness, or other values | Wang et al. (2024c) |
Abstention evaluation benchmarks. For dataset size, we report test set size by default. “%A” denotes the proportion of queries where the model should abstain. “*” indicates total dataset size (including training, development, and test splits) when test set statistics are not detailed in the original study.
Focus . | Dataset . | Reference . | Domain . | Answer Types . | Size (%A) . | Description . |
---|---|---|---|---|---|---|
Query | SQuAD2 | Rajpurkar et al. (2018) | General | Extractive | 8862 (50%) | Reading comprehension dataset; questions and context are taken from SQuAD2 and some are modified to be unanswerable |
Natural Questions (NQ) | Kwiatkowski et al. (2019) | General | Extractive | 7842 (50%) | Questions are from English Google Search Engine, answers are annotated post hoc by another annotator who selects supporting paragraphs; unanswerable questions are those without answers in the search results | |
MuSiQue | Trivedi et al. (2022) | General | Extractive | 4918 (50%) | Multi-hop QA; unanswerable questions are those with supporting paragraphs of single-hop answer steps removed | |
CoQA | Reddy et al. (2019) | General | Free-form | 127k (1.3%)* | Conversational QA; curated by two annotators (questioner and answerer); unanswerable questions are those that cannot be answered from a supporting passage | |
QuAC | Choi et al. (2018) | General | Extractive, Boolean | 7353 (20%) | Conversational QA; curated by two annotators (teacher and student); unanswerable questions are those that cannot be answered given a Wikipedia passage | |
AmbigQA | Min et al. (2020) | General | Extractive | 14042 (>50%)* | Questions are from NQ-Open dataset; multiple possible distinct answers are curated through crowdsourcing; all questions are ambiguous | |
SituatedQA | Zhang and Choi (2021) | General | Extractive | 11k (26%) | Question are from NQ-Open, answers for alternative contexts are crowdsourced; all questions have multiple possible answers depending on context | |
SelfAware | Yin et al. (2023) | General | Extractive | 3369 (31%) | Question are from online platforms like Quora and HowStuffWork; unanswerable questions are annotated by humans into five categories | |
Known Unknown Questions | Amayuelas et al. (2023) | General | Extractive | 6884 (50%) | Question are from Big-Bench, SelfAware, and prompting crowd workers to produce questions of different types and categories with answer explanations; unanswerable questions are annotated by humans into six categories | |
PubmedQA | Jin et al. (2019) | Medicine | Boolean, Maybe | 500 (10%) | Questions are automatically derived from paper titles and answered from the conclusion sections of the corresponding abstracts by experts; some questions are answered ‘Maybe’ if the conclusion does not clearly support a yes/no answer | |
QASPER | Dasigi et al. (2021) | Computer Science | Extractive, Free-form, Boolean | 1451(10%) | Questions are written by domain experts and answers are annotated by experts from the full text of associated computer science papers; some questions cannot be answered from the paper’s full text | |
Model | Real-TimeQA | Kasai et al. (2023) | General | Multiple-choice | 1.5k (100%) | Questions are about current events and new ones are announced periodically |
PUQA | Yang et al. (2023) | Science | Free-form | 1k (100%) | Questions are from scientific literature published after 2023 | |
Election-QA23 | Feng et al. (2024b) | Politics | Multiple-choice | 200 (100%) | Questions about 2023 elections are composed by ChatGPT from Wikipedia pages and verified by humans | |
POPQA | Mallen et al. (2023) | General | Extractive | 14k | Long-tail relation triples from WikiData are converted into QA pairs; no explicit unanswerable questions but questions are about long-tail entities | |
Entity Questions | Sciavolino et al. (2021) | General | Extractive | 15k | Long-tail relation triples from WikiData are converted into QA pairs; no explicit unanswerable questions but questions are about long-tail entities | |
Human Values | RealToxicity Prompts | Gehman et al. (2020) | Toxicity | Free-form | 100k (100%) | Toxic texts are derived from Open WebText Corpus, each yielding a prompt and a continuation |
ToxiGen | Hartvigsen et al. (2022) | Toxicity | Free-form | 274k (50%) | Toxic prompts are GPT-3 generated questions across 13 minority groups | |
Latent-Hatred | ElSherief et al. (2021) | Hate Speech | Free-form | 22584 (40%) | Data are from Twitter; queries are annotated along a proposed 6-class taxonomy of implicit hate speech | |
ToxicChat | Lin et al. (2023) | Toxicity | Free-form | 10166 (7%) | Real user queries from an open-source chatbot (Vicuna); human-AI collaborative annotation scheme is used to identify toxic queries | |
Beavertails | Ji et al. (2023a) | Safety | Free-form | 330k (57%) | Prompts are from the HH Red Teaming dataset and are annotated in a two- stage process for safety; this dataset attempts to disentangle harmlessness and helpfulness from the human-preference score | |
CValues | Xu et al. (2023a) | Safety | Multiple- choice | 2.1k (65%) | Unsafe prompts are crowdsourced (best attempts to attack a chatbot) and responsible prompts are produced by experts | |
Xstest | Röttger et al. (2024a) | Safety | Free-form | 450 (44%) | Prompts are hand-crafted and designed to evaluate exaggerated safety behavior | |
LatentJail-break | Qiu et al. (2023) | Safety | Free-form | 416 (100%) | Jailbreak prompts created using templates containing predetermined toxic adjectives; annotated for both safety and model output robustness | |
Do-Any-thing-Now | Shen et al. (2023) | Safety | Free-form | 1405 (100%) | Human-verified prompts from Reddit, Discord, websites, and open-source datasets | |
Do-Not-Answer | Wang et al. (2024c) | Safety | Free-form | 939 (100%) | Prompts are generated by manipulating chat history to force GPT-4 to generate risky questions, responses collected from 6 LLMs are annotated to a proposed taxonomy covering information hazards, malicious uses, and discrimination | |
XSafety | Wang et al. (2023a) | Safety | Free-form | 28k (100%) | Multilingual benchmark with prompts covering 14 safety issues across 10 languages; constructed by gathering monolingual safety benchmarks and employing professional translation | |
SALAD-Bench | Li et al. (2024a) | Safety | Multiple-choice | 30k (100%) | Prompts collected from existing benchmarks; GPT-3.5-turbo is finetuned using 500 harmful QA pairs to respond to unsafe questions | |
SORRY-Bench | Xie et al. (2024) | Safety | Free-form | 450 (100%) | GPT-4 classifier is used to map queries from 10 prior datasets to a proposed three-tier safety taxonomy | |
WildGuard | Han et al. (2024) | Safety | Free-form | 896 (61%) | Prompts are derived from synthetic data, real-world user-LLM interactions, and existing annotator-written data; LLM-generated responses are labeled by GPT-4 for safety and further audited and filtered by humans | |
General | COCO-NOT | Brahman et al. (2024) | General | Free-form | 1k (100%) | Questions are synthesized by LLMs based on a proposed taxonomy and GPT-4 was used to generate non-compliant responses, followed by manual verification |
Focus . | Dataset . | Reference . | Domain . | Answer Types . | Size (%A) . | Description . |
---|---|---|---|---|---|---|
Query | SQuAD2 | Rajpurkar et al. (2018) | General | Extractive | 8862 (50%) | Reading comprehension dataset; questions and context are taken from SQuAD2 and some are modified to be unanswerable |
Natural Questions (NQ) | Kwiatkowski et al. (2019) | General | Extractive | 7842 (50%) | Questions are from English Google Search Engine, answers are annotated post hoc by another annotator who selects supporting paragraphs; unanswerable questions are those without answers in the search results | |
MuSiQue | Trivedi et al. (2022) | General | Extractive | 4918 (50%) | Multi-hop QA; unanswerable questions are those with supporting paragraphs of single-hop answer steps removed | |
CoQA | Reddy et al. (2019) | General | Free-form | 127k (1.3%)* | Conversational QA; curated by two annotators (questioner and answerer); unanswerable questions are those that cannot be answered from a supporting passage | |
QuAC | Choi et al. (2018) | General | Extractive, Boolean | 7353 (20%) | Conversational QA; curated by two annotators (teacher and student); unanswerable questions are those that cannot be answered given a Wikipedia passage | |
AmbigQA | Min et al. (2020) | General | Extractive | 14042 (>50%)* | Questions are from NQ-Open dataset; multiple possible distinct answers are curated through crowdsourcing; all questions are ambiguous | |
SituatedQA | Zhang and Choi (2021) | General | Extractive | 11k (26%) | Question are from NQ-Open, answers for alternative contexts are crowdsourced; all questions have multiple possible answers depending on context | |
SelfAware | Yin et al. (2023) | General | Extractive | 3369 (31%) | Question are from online platforms like Quora and HowStuffWork; unanswerable questions are annotated by humans into five categories | |
Known Unknown Questions | Amayuelas et al. (2023) | General | Extractive | 6884 (50%) | Question are from Big-Bench, SelfAware, and prompting crowd workers to produce questions of different types and categories with answer explanations; unanswerable questions are annotated by humans into six categories | |
PubmedQA | Jin et al. (2019) | Medicine | Boolean, Maybe | 500 (10%) | Questions are automatically derived from paper titles and answered from the conclusion sections of the corresponding abstracts by experts; some questions are answered ‘Maybe’ if the conclusion does not clearly support a yes/no answer | |
QASPER | Dasigi et al. (2021) | Computer Science | Extractive, Free-form, Boolean | 1451(10%) | Questions are written by domain experts and answers are annotated by experts from the full text of associated computer science papers; some questions cannot be answered from the paper’s full text | |
Model | Real-TimeQA | Kasai et al. (2023) | General | Multiple-choice | 1.5k (100%) | Questions are about current events and new ones are announced periodically |
PUQA | Yang et al. (2023) | Science | Free-form | 1k (100%) | Questions are from scientific literature published after 2023 | |
Election-QA23 | Feng et al. (2024b) | Politics | Multiple-choice | 200 (100%) | Questions about 2023 elections are composed by ChatGPT from Wikipedia pages and verified by humans | |
POPQA | Mallen et al. (2023) | General | Extractive | 14k | Long-tail relation triples from WikiData are converted into QA pairs; no explicit unanswerable questions but questions are about long-tail entities | |
Entity Questions | Sciavolino et al. (2021) | General | Extractive | 15k | Long-tail relation triples from WikiData are converted into QA pairs; no explicit unanswerable questions but questions are about long-tail entities | |
Human Values | RealToxicity Prompts | Gehman et al. (2020) | Toxicity | Free-form | 100k (100%) | Toxic texts are derived from Open WebText Corpus, each yielding a prompt and a continuation |
ToxiGen | Hartvigsen et al. (2022) | Toxicity | Free-form | 274k (50%) | Toxic prompts are GPT-3 generated questions across 13 minority groups | |
Latent-Hatred | ElSherief et al. (2021) | Hate Speech | Free-form | 22584 (40%) | Data are from Twitter; queries are annotated along a proposed 6-class taxonomy of implicit hate speech | |
ToxicChat | Lin et al. (2023) | Toxicity | Free-form | 10166 (7%) | Real user queries from an open-source chatbot (Vicuna); human-AI collaborative annotation scheme is used to identify toxic queries | |
Beavertails | Ji et al. (2023a) | Safety | Free-form | 330k (57%) | Prompts are from the HH Red Teaming dataset and are annotated in a two- stage process for safety; this dataset attempts to disentangle harmlessness and helpfulness from the human-preference score | |
CValues | Xu et al. (2023a) | Safety | Multiple- choice | 2.1k (65%) | Unsafe prompts are crowdsourced (best attempts to attack a chatbot) and responsible prompts are produced by experts | |
Xstest | Röttger et al. (2024a) | Safety | Free-form | 450 (44%) | Prompts are hand-crafted and designed to evaluate exaggerated safety behavior | |
LatentJail-break | Qiu et al. (2023) | Safety | Free-form | 416 (100%) | Jailbreak prompts created using templates containing predetermined toxic adjectives; annotated for both safety and model output robustness | |
Do-Any-thing-Now | Shen et al. (2023) | Safety | Free-form | 1405 (100%) | Human-verified prompts from Reddit, Discord, websites, and open-source datasets | |
Do-Not-Answer | Wang et al. (2024c) | Safety | Free-form | 939 (100%) | Prompts are generated by manipulating chat history to force GPT-4 to generate risky questions, responses collected from 6 LLMs are annotated to a proposed taxonomy covering information hazards, malicious uses, and discrimination | |
XSafety | Wang et al. (2023a) | Safety | Free-form | 28k (100%) | Multilingual benchmark with prompts covering 14 safety issues across 10 languages; constructed by gathering monolingual safety benchmarks and employing professional translation | |
SALAD-Bench | Li et al. (2024a) | Safety | Multiple-choice | 30k (100%) | Prompts collected from existing benchmarks; GPT-3.5-turbo is finetuned using 500 harmful QA pairs to respond to unsafe questions | |
SORRY-Bench | Xie et al. (2024) | Safety | Free-form | 450 (100%) | GPT-4 classifier is used to map queries from 10 prior datasets to a proposed three-tier safety taxonomy | |
WildGuard | Han et al. (2024) | Safety | Free-form | 896 (61%) | Prompts are derived from synthetic data, real-world user-LLM interactions, and existing annotator-written data; LLM-generated responses are labeled by GPT-4 for safety and further audited and filtered by humans | |
General | COCO-NOT | Brahman et al. (2024) | General | Free-form | 1k (100%) | Questions are synthesized by LLMs based on a proposed taxonomy and GPT-4 was used to generate non-compliant responses, followed by manual verification |
Author notes
Action Editor: Ellie Pavlick