Abstract
The rapid proliferation of large language models and natural language processing (NLP) applications creates a crucial need for uncertainty quantification to mitigate risks such as Hallucinations and to enhance decision-making reliability in critical applications. Conformal prediction is emerging as a theoretically sound and practically useful framework, combining flexibility with strong statistical guarantees. Its model-agnostic and distribution-free nature makes it particularly promising to address the current shortcomings of NLP systems that stem from the absence of uncertainty quantification. This paper provides a comprehensive survey of conformal prediction techniques, their guarantees, and existing applications in NLP, pointing to directions for future research and open challenges.
1 Introduction
Natural language processing (NLP) is witnessing an explosive growth in applications and public visibility, namely with large language models (LLMs) being deployed in many real-life applications, ranging from general-purpose chatbots to the generation of medical reports (Min et al., 2023). However, the widespread use of these models brings important concerns: Hallucinations are frequent (Ji et al., 2023; Guerreiro et al., 2023), models are poorly calibrated (Vasudevan et al., 2019; Desai and Durrett, 2020; Zhou et al., 2024), evaluation is limited and sometimes affected by data contamination (Sainz et al., 2023; Golchin and Surdeanu, 2024), explanations are often unreliable (Zhao et al., 2024; Wiegreffe and Pinter, 2019), and models often exhibit undesired biases (Gallegos et al., 2024). Reliable uncertainty quantification is key to addressing some of these concerns: NLP systems should not only provide accurate answers but also “know when they do not know”.
Unfortunately, some NLP systems return only single predictions (i.e., point estimates), without reliable confidence information. Systems that quantify uncertainty are less common and typically limited in various ways: They often make incorrect distribution-based assumptions ignoring the complex nature of the underlying data and model (Xiao and Wang, 2019; He et al., 2020; Glushkova et al., 2021; Zerva et al., 2022); they are often poorly calibrated (i.e., they predict a confidence level that does not match its error probability; Kuleshov et al., 2018); and they may be computationally too demanding, thus inapplicable to large-scale models (Hu et al., 2023).
Conformal prediction (CP; Vovk et al., 2005) has recently emerged as a promising candidate to bypass the issues above: Unlike other uncertainty quantification frameworks, it offers statistical guarantees of ground-truth coverage with minimal assumptions. CP methods are model-agnostic and distribution-free, assuming only data exchangeability (as described in §3.1). Moreover, extensions of CP that handle non-exchangeable data have recently been proposed (Gibbs and Candes, 2021; Barber et al., 2023). Popular CP variants are also efficient: They do not require model retraining and can be used online or offline, given an additional relatively small calibration set.1 Finally, equalized variants of CP (Romano et al., 2020) can also reduce biases and unfairness, by distributing coverage evenly across protected attributes.
The flexibility and strong statistical guarantees of CP have attracted considerable interest, with an increasing number of publications in computer science.2 It is therefore timely to present a survey of conformal methods for NLP, revealing the theory and guarantees behind these methods and outlining opportunities and challenges for them to tackle important problems in the field.
Scope.
This survey provides a comprehensive overview of CP techniques for NLP tasks (Figure 1). After briefly explaining CP and some relevant extensions (§2 and §4), we review direct applications thereof in NLP (§5). Finally, we look at possible threads of future investigation and current open issues concerning the use of CP in NLP (§6).
What this survey is not about.
This is not a general survey on uncertainty quantification and does not include techniques not based on CP. Comprehensive reviews of uncertainty quantification in NLP were recently published by Baan et al. (2023) and Hu et al. (2023). Also, our survey is focused on NLP applications; Angelopoulos and Bates (2023) and Shafer and Vovk (2008) have published comprehensive surveys on CP.
2 Conformal Predictors
This section briefly explains CP and presents some definitions and results needed for understanding the applications mentioned below. In what follows, we use upper case letters (X, Y,...) for random variables, lower case letters (x, y,...) for specific values they take, and calligraphic letters () for sets.
2.1 Definitions and Ingredients
Consider a prediction task where and are the input and output sets, respectively. The most common procedure is to learn/train a mapping , which, given an input , unseen during training, returns a point prediction, hopefully close to the “true” target ytest, according to some performance metric. A weakness of point predictions is the absence of information about uncertainty. In contrast, for the same input xtest, a conformal predictor yields a prediction set, ideally small, which includes the target ytest with some high (user-chosen) probability, say 1 −α.
Consider an example involving a pretrained model which classifies a clinical report with a label, e.g., a disease . This is a high-risk scenario requiring strong reliability guarantees. For a random test report (Xtest), a conformal predictor yields a set of possibly multiples labels, with the guarantee guarantee that3. Figure 2 illustrates the CP procedure for the mentioned task, which we describe next in detail.
Split4 CP (Vovk et al., 2005) is built with three ingredients: a trained predictor, ; a calibration set, , independent from the set used to train the predictor; and a non-conformity score, . The non-conformity score measures how unlikely an input-output pair is, compared to the remaining data. Consequently, given a test sample xtest, predictions predictions yielding pairs (xtest, y) deemed likely to data should have a low non-conformity score, and should thus be included in the the prediction set .
The choice of non-conformity score is task-dependent. For example, for a classifier outputting an estimate p(y|x) of the posterior probability for each possible label (e.g., via a softmax output layer), a common and natural choice is s(x, y) = 1 −p(y|x), with lower values of s(x, y) implying that the sample is more conformal with the previously seen data.
2.2 Prediction Procedure
The procedure for generating for new, unseen test instances xtest is as follows:
Calibration Step:
- 1.
Compute (s1,…, sn), the non-conformity scores for , where si = s(xi, yi);
- 2.
Set to be the ⌈(n + 1)(1 −α)⌉/n empirical quantile of the calibration set scores;
Prediction Step:
- 3.
Output the prediction set, using the quantile , as .
The intuition is that the prediction set includes all predictions corresponding to samples that are more conformal than a sufficiently large fraction of the calibration set. Note that the calibration step needs to be computed only once and the obtained quantile can then be used for all new test instances (Figure 2).
2.3 Relation to Hypothesis Testing
compute p-values for all labels ;
generate prediction set as p-value (xtest, y) > α}.
A disadvantage of this approach is that it needs access to the calibration scores at test time. On the other hand, the p-values do not need a preset α and can be used to evaluate predictions, as shown next.
3 Guarantees and Evaluation
Prediction sets produced by conformal predictors have associated theoretical guarantees. This section reviews established coverage guarantees as well as evaluation metrics used to measure the quality of the prediction sets.
3.1 Theoretical Guarantees
A predictor satisfying the coverage inequality given in Theorem 1 is said to be valid.5 Note that as the size of the calibration set increases, the probability of coverage tends to exactly 1 −α. It is worth noting that the CP procedure we described is model-agnostic and distribution-free, i.e., it makes no assumption about the data distribution, requiring only data exchangeability.
3.2 Set Efficiency Metrics
When assessing the quality of a conformal predictor, an important aspect beyond validity is efficiency: The prediction sets should be relatively small and adaptive—easier cases should yield smaller sets than harder observations. The efficiency of a conformal predictor depends on the trained predictor f and the chosen non-conformity score, which is typically based on some heuristic notion of prediction uncertainty, e.g., using the softmax output of a model (§2.1).
Consider a separate test set . Some metrics, called a priori, do not require access to the test set labels. This is the case of the average prediction set size (or interval width, in regression tasks): , computed as a function of α. Using the test set labels, an informative a posteriori metric is the observed fuzziness, computed as the average of p-values for the false labels: p-value (xn +i, y), which should be as small as possible, since correct predictions should have high conformal scores, whereas incorrect labels should have low scores. These metrics can also be useful to evaluate adaptivity and bias, by comparing them over different partitions of the dataset, e.g., split by a particular feature.
3.3 Pointwise Metrics
Conformal predictors provide point-level uncertainty metrics that can be used even in the forced prediction approach, i.e., producing a single prediction , the label with the highest p-value (typically coinciding with the original output of the point predictor), rather than the predicted set. Two common metrics in this case are credibility, p-value , and confidence, p-value (xi, y). Credibility measures the reliability of a prediction based on its p-value: A higher p-value indicates that a larger portion of the calibration points are less conformal than the prediction, indicating a higher credibility in the prediction. Confidence is measured by looking at the second largest p-value, i.e., by evaluating the second prediction candidate as alternative to the actual prediction: the lower this value, the more confidence we have in the actual prediction. These metrics make use of the calibration set to measure uncertainty and can be extremely useful, even if disregarding the full prediction set produced by the conformal predictor.
4 Extending Conformal Prediction
CP has extended beyond classic conformal predictors, with developments that allow handling challenges such as conditional coverage, dispensing with exchangeability, or obtaining guarantees beyond coverage. This section briefly presents the core ideas of some of the extensions that are most relevant for NLP applications.
4.1 Conditional Conformal Predictors
Assuming exchangeability (Eq. 2), the above procedure is guaranteed to satisfy (Eq. 4). This label-conditional example is a particular case of Mondrian conformal predictors, which applies to any mapping of the data into Mondrian taxonomies (Vovk et al., 2005). The same rationale can be used to obtain coverage across different partitions of the data, such as across a particular feature stratification.
4.2 Beyond Exchangeability
All theoretical guarantees presented so far are rooted in the assumption of data exchangeability (Eq. 2). However, this assumption is unrealistic in many NLP applications: For example, it is incompatible with the conditional nature of most language generation methods. Several extensions have been proposed which handle non-exchangeable data, which includes the cases of covariate and label shift (Tibshirani et al., 2019; Podkopaev and Ramdas, 2021), time series (Chernozhukov et al., 2018; Xu and Xie, 2021; Angelopoulos et al., 2023), and other types of shift (Gibbs and Candes, 2021).
4.3 Conformal Risk Control
While coverage guarantees are useful in many tasks, there are cases where the adequate notion of error control is not captured solely by guaranteeing that the prediction set contains the ground truth. Some extensions of CP address these cases.
Angelopoulos et al. (2024) consider multilabel classification, where each is a set of labels. The loss function to be controlled is thus defined on pairs of sets of labels, , and assumed to satisfy monotonicity: A ⊆ B ⇒ ℓ(A, Y ) ≥ ℓ(B, Y ), for any . They define prediction sets , where f(y|x) ∈ [0,1] is the softmax output of class y, given by predictor f for input x, and a parameter λ. Invoking loss monotonicity yields , for any .
4.4 Other CP Variants
Full Conformal Prediction.
Introduced by Vovk et al. (2005), full CP differs from the split version in two aspects: It does not use a separate calibration set, but the entire training set; and it involves model refitting—given a new instance, a model is trained for each possible label6 and used with the full data set to compute the non-conformity scores and obtain the prediction set. A clear disadvantage of full conformal prediction is the high computational cost of retraining. However, it has advantages: full conformal predictors can be used if there is a limited amount of data and model retraining is not too expensive, providing the same validity guarantees (Lei et al., 2018).
Cross-validation and Jackknifing.
The goal of these methods is to achieve a balance between statistical and computational efficiency. Cross-conformal predictors (Vovk, 2015) apply the cross-validation rationale to split conformal predictors. Each cross-validation fold is used as a calibration set once and the p-values are computed using all folds. These predictors, although lacking proven validity guarantees, have shown good empirical results (Vovk et al., 2018). Inspired by this idea, Barber et al. (2021) propose the so-called jackknife+, a leave-one-out scheme, and prove validity for regression under some conditions.
Density-based Conformal Prediction.
Hechtlinger et al. (2019) propose a different approach to the conformal procedure, based on p(x|y) instead of the typical p(y|x) to build more cautious predictors that should output the null set when underconfident. This method can be useful to abstain from answering when given an outlier observation. They show promising results using adversarial attacks on different tasks.
Venn-Abers Predictors.
This class of probabilistic predictors has guarantees proved by Vovk and Petej (2014). They produce one probability distribution per possible label and provide guarantees that one of the predictive distributions is perfectly calibrated, with no assumptions on the model or data distribution. Venn-Abers have been shown to be a good calibration tool with the added benefit that the distance between the different probability distributions provides calibrated uncertainty quantification (Johansson et al., 2023). A more efficient split variant is proposed by Lambrou et al. (2014), and Manokhin (2017) presents a multi-class generalization.
5 Applications in NLP
CP has been used in several NLP tasks, both to get validity/calibration guarantees on predictions; or within a pipeline, e.g., to safely prune intermediate outputs with guaranteed coverage, achieving computational speedups (§5.4). This section reviews several such applications organized by use case.
5.1 Text Classification and Sequence Tagging
For classification and tagging tasks, models are often accurate but lack reliable confidence estimates.
Binary Text Classification.
Maltoudoglou et al. (2020) build a conformal predictor on top of a BERT classifier (Devlin et al., 2019) for binary sentiment classification. They show that the conformal predictor with forced prediction retains the original model’s accuracy while providing useful accompanying measures of credibility and confidence. For the same task, Messoudi et al. (2020) use density-based CP (§4.4). They report good performance and empirical validity, highlighting the usefulness of having such a predictor by considering noisy and outlier observations: The CP set contains both classes for the noisy example and is empty for the outliers, showing the desired discriminatory power. Zhan et al. (2022) automate identification of literature on drug-induced liver injury, using conformal prediction to manage prediction uncertainty and guaranteeing reliability.
Classification with Conditional Coverage.
Mondrian CP (§4.1) has been successfully applied to unbalanced classification tasks, such as sentiment analysis, with good efficiency results (Norinder and Norinder, 2022). Giovannotti and Gammerman (2021) compare split, Mondrian, and cross-conformal (Vovk, 2015) CP on unbalanced paraphrase detection and report that the theoretically expected efficiency drop for Mondrian CP is small, making it useful in practice.
POS Tagging.
Dey et al. (2022) present promising results by showing that CP based on the softmax outputs of a BERT model for POS tagging yields practical prediction sets even at high confidence levels on a large test set: At the 99% confidence level, fewer than 4% of the prediction sets had more than one answer.
Multilabel Tasks.
CP has been used for multilabel text classification, where multiple labels can be assigned to an input, e.g., document categorization. In the label powerset approach (Tsoumakas et al., 2010), which treats each possible combination of labels as a class, there is an added challenge due to the large output space. Paisios et al. (2019) show how CP can be used in this setting, exploring different task-appropriate non-conformity scores for the task of categorization of news articles. The forced prediction method (§3.3) shows negligible performance drops (as a consequence of part of the training data being set aside for calibration) while providing reliable credibility measures; moreover, the prediction sets were tight and well-calibrated at high confidence levels. Maltoudoglou et al. (2022) build on top of the aforementioned work and propose an efficient computational approach that allows a higher number of possible labels to be considered. Fisch et al. (2022) tackle the multilabel case under the need to limit false positive predictions—a type of constraint that arises naturally in many highly sensitive tasks, e.g., identification of drug properties—by using a computationally efficient method that provides the desired coverage and constraint guarantees for an NER task, reporting prediction sets of useful size.
A different approach has been considered in tasks such as document retrieval, where it may be of interest to obtain prediction sets with at least one admissible correct answer. Fisch et al. (2021b) present an efficient conformal procedure to find such sets. They exploit the fact that simpler and lighter models can be used first in the pipeline to reduce the number of output candidates, producing a sequence of conformally valid candidates that are passed on to more complex models, showing that the final output is guaranteed to yield the user desired coverage.
Dealing with Limited Data.
CP has also been found useful in providing guarantees for tasks with limited amounts of data. Fisch et al. (2021a) tackle few-shot relation classification with CP procedures to meta-learn both non-conformity measures and a threshold predictor from auxiliary tasks with larger amounts of available data. Not only do the predicted sets for the final task grant coverage requirements, but they are also small (average set size smaller than 2 for 95% confidence level). A different approach is used by Dutta et al. (2023) for estimating uncertainty in zero-shot biomedical image captioning using CLIP models (Radford et al., 2021): They query the Web to get a calibration set and design a CP protocol that takes into account the plausibility of each calibration point, providing promising results with small predicted test sets with coverage even in the absence of original labeled calibration data. In a setting with limited reliable data, Zhan et al. (2023) use CP to clean possibly mislabeled training data, based on a small curated amount of data as a calibration set. They explore the effects of removing or changing the label of noisy data identified by the conformal procedure and show performance improvements on the text classification downstream task for different levels of induced noise.
5.2 Natural Language Generation
Despite their impressive capabilities, large language models are prone to hallucinations (Huang et al., 2023; Ji et al., 2023). The strong correlation between hallucinations and uncertainty unawareness makes CP a promising approach to tackle this issue. Yadkori et al. (2024) use conformal risk control to obtain an upper bound on the hallucination risk, developing an abstention procedure. In spite of its potential, the application of CP to language generation faces two big challenges: (i) the combinatorially large size of output sets and (ii) the conditional (recursive) nature of language generation, which violates the exchangeability assumption underlying standard CP.
Sentence-level Conformal Prediction.
Most research on CP for NLP tries to circumvent the issues above by operating at the sentence level, e.g., by first sampling multiple options and then reformulating the problem as a multiple choice question (Kumar et al., 2023). For instance, an LLM can be used to generate plans (expressed in natural language) for a robot to follow but a single plan alone may result in unfeasible or risky actions. Ren et al. (2023) build upon the methods presented in §2 to calibrate the confidence of LLM planners, providing formal guarantees for task completion while minimizing human help. Specifically, they look at the next-token probability to assess the uncertainty of different possible actions (i.e., they use it to compute the non-conformity score, as described in §2.2) and generate CP sets. If the prediction set is not a singleton, the robot should ask for help; otherwise, it should continue to execute the plan. Liang et al. (2024) further enhance this framework by incorporating an “introspective reasoning” step (Leake, 2012), which leads to tighter prediction bounds, while Wang et al. (2024) consider teams of robots.
Sentence-level Risk Control.
Quach et al. (2024) show how LTT (§4.3) can be used to calibrate a stopping rule for sampling outputs from a language model that are added to a growing set of candidates until they are confident that the set includes at least one acceptable hypothesis (Fisch et al., 2021b)—this was applied, for example, to document retrieval for fact verification, where the presence of one admissible document is sufficient to solve the task. Simultaneously, they calibrate a rejection rule to remove low-quality and redundant candidates. They use Pareto testing (Laufer-Goldshtein et al., 2023) to efficiently search and test the high-dimensional hyperparameter configuration. The resulting output sets are not only valid but also precise (i.e., small). Angelopoulos et al. (2024) and Farinhas et al. (2024) apply conformal risk control to open-domain question answering, whereas Ernez et al. (2023) do it for speech recognition. While the former calibrate the best token-based F1-score of the prediction set in Eq. 6, the latter control the word error rate to an adjustable level of guarantee. Finally, Zollo et al. (2023) discuss how prompts that perform well on average on a validation set may be prone to produce poor generations with high probability in deployment and propose prompt risk control based on upper bounds on families of informative risk measures.7 Specifically, they bound the worst-case toxicity (Hanu and Unitary team, 2020) in chatbots, the expected loss (pass@K, Kulal et al., 2019) in code generation, and the dispersion of ROUGE scores (Lin, 2004) in medical summarization.
Token-level Approaches.
While the approaches above focus on full sentences, language models generate text by successively producing new tokens autoregressively. Nucleus sampling (Holtzman et al., 2020) samples each token from the smallest set whose cumulative probability exceeds a threshold. However, Ravfogel et al. (2023) observe that LLMs tend to be overconfident—the prediction sets used in nucleus sampling are not calibrated (see their Figure 4)—and this does not improve by scaling up the model size. They propose conformal nucleus sampling, which calibrates prediction sets within bins of similar entropies. As an alternative, Ulmer et al. (2024) take non-exchangeability (§4.2) into account by using a dynamic calibration step. They use the k-nearest neighbors and data-dependent relevance weights based on the squared ℓ2 distance between the embedding representations. This leads to smaller prediction sets compared to previous approaches while maintaining the desired coverage level in machine translation and language modeling.
5.3 Uncertainty-Based Evaluation
CP can also be used to assist in evaluating and benchmarking NLP models. Two main approaches employ CP to that end: (i) using it to assess the confidence of different models and compare them accordingly; (ii) framing evaluation as a regression task (i.e., learning to score the model outputs to predict human perceived quality and using CP to provide reliable confidence intervals).
Focusing on the former approach, Ye et al. (2024) apply CP to benchmark the performance of different LLMs. They use prompt engineering to turn different generation tasks (question answering, summarization, commonsense inference, etc.) into multiple-choice questions such that the models need to predict a letter corresponding to each candidate output. They subsequently attempt to quantify the uncertainty of the language model over the possible labels, conformalizing the softmax outputs for each candidate label. They show that high model accuracy does not necessarily imply high certainty; in some cases, an inverse correlation between accuracy and certainty is observed. Based on their findings, Ye et al. (2024) propose an uncertainty-aware metric accounting for both accuracy and uncertainty (encoded as set size).
Focusing instead on the latter approach, Giovannotti (2023) applies CP to referenceless MT evaluation (quality estimation), specifically using conformal predictive distributions (CPDs) as introduced by Vovk et al. (2017) to estimate the probability distribution around each quality estimate, using a k-NN model to obtain quality scores and subsequently the distances between each point and its neighbors to form non-conformity scores. The non-conformity scores computed in this way provide a proxy to the uncertainty of each quality estimate and they thus use CP as a method to obtain calibrated confidence intervals for MT quality estimation. Zerva and Martins (2023), on the other hand, apply CP on top of non-conformity heuristics coming from other uncertainty quantification methods for reference-based MT evaluation and discuss how such method choice can impact coverage and width. They also highlight biases in estimated confidence intervals, reflected in imbalanced coverage for attributes such as translation language and quality, demonstrating how these can be addressed with equalized CP. While focused on MT, the proposed approaches are applicable to other NLP evaluation or regression tasks.
5.4 Faster Inference
Given the high computational requirements of state-of-the-art NLP models and their widespread use, considerable effort is being put on making these models more time- and memory-efficient (Deng et al., 2020). Several strategies for increasing efficiency at prediction time (e.g., early exiting, Liu et al., 2019; Schwartz et al., 2020) focus on identifying easily classifiable instances and using a lighter version of the original model to predict them. Such instances must be reliably identified and both the original and simplified models should consistently produce the same results for a given input with a high probability.
Early Exiting Transformers Fine-tuning.
Schuster et al. (2021) present an extension of CP to build a method to speed up inference in transformer models, while guaranteeing an adjustable degree of consistency with the original model, with high confidence. The rationale is to skip directly to the final layer from one of the previous layers whenever there is enough confidence. They use a binary meta-classifier to predict whether the lighter model is consistent with the original one and use CP to predict the set of inconsistent models. The final procedure consists of exiting at the first layer that exceeds the threshold found by the conformal procedure. Their method shows reduced inference time in several classification and regression tasks.
Filtering Labels.
Choubey et al. (2022) tackle the computational efficiency problem in zero-shot text classification with pretrained language models, looking at the fact that inference time increases with the number of possible labels. They use CP on top of a base, simple and fast, text classifier to reduce the number of possible labels for the final, more complex, language model. They experiment on different classification tasks, testing different choices of non-conformity scores and different base models, exploring the trade-off between efficiency and accuracy in choosing the complexity of the base model.
Speeding Up Inference.
To obtain the lightest possible model while preserving performance, Laufer-Goldshtein et al. (2023) propose a CP method to find optimal thresholds to guarantee several risk constraints with adjustable high probability, while optimizing another objective function. They report results on several text classification tasks with different objectives, such as minimizing prediction cost (searching thresholds on all pruning directions), while controlling accuracy reduction (drop in performance from the full to a lighter model) to a user-chosen degree. Their method builds upon the LTT procedure (§4.3), with an efficient technique to reduce the number of parameter combinations tested, using Pareto-optimal solutions (Deb and Kalyanmoy, 2001). The results show significant efficiency gains with the proposed risk-controlling guarantees.
Schuster et al. (2022) make text generation more efficient by considering decoder early exiting at the token level, while bounding global efficiency. They leverage the LTT procedure to obtain risk-controlling solutions with dynamic allocation of compute per generated token and test their approach on news summarization, text translation and open question answering, showing efficiency gains with the required quality guarantees.
6 Future Directions
We outline in this section some promising future research directions and open challenges related to the use of CP and its many variants in NLP tasks.
6.1 CP for Human–Computer Interaction
Some tasks in NLP, such as recommendation and predictive writing systems, benefit naturally from prediction sets that can be used to offer suggestions to users. CP is an opportunity for improving the efficiency and quality of such systems and prediction sets can be used to enhance performance in decision-making with humans in the loop (Cresswell et al., 2024). This aspect could be further explored in NLP, as there are numerous scenarios involving human feedback, e.g., interactive MT (Green et al., 2013; Wang et al., 2021) or creation of human preference data for LLM alignment (Stiennon et al., 2020; Fernandes et al., 2023).
6.2 CP for Handling Label Variation
The complexity and ambiguity of natural language, as well as the varied human perspectives, make it hard to disentangle model uncertainty from valid, naturally occurring label variation (Baan et al., 2024; Plank, 2022; Baan et al., 2022). It is often the case that multiple outputs are correct, particularly in tasks involving high variation in human language production (question answering, summarization, and other generation tasks where several output variants are equivalent) or inherent, plausible disagreement (the ChaosNLI data that demonstrates valid disagreements in textual inference annotations [Pavlick and Kwiatkowski, 2019]). While traditional methods focus on the majority class, or see variation as model uncertainty, CP yields a more faithful representation of label variation. Besides representing uncertainty, the sets produced by CP provide multiple “equivalent” labels, allowing for more interpretable and informed predictions. Further research on such scenarios could provide models that behave better in tasks with high label variation. Moreover, in such cases, CP can also be used to achieve diverse prediction sets, avoiding redundancy, as suggested by Quach et al. (2024).
6.3 CP for Fairness
The increased use of NLP systems in global daily life and high-risk tasks raises concerns about the fairness of their outputs. Many of these systems have been shown to be biased (Blodgett et al., 2020). In tasks such as resume filtering, medical diagnosis assistance, and several others, these biases can be extremely harmful, leading to skewed performance and coverage. CP can be used to achieve equalized coverage for different population groups (Romano et al., 2020), thus “correcting” biases in model predictions without the need for expensive retraining. The open research problem of finding conditional guarantees (Gibbs et al., 2023) to obtain pointwise error bounds can also contribute towards fairness in NLP applications.
6.4 CP for Dealing with Data Limitations
Learning and quantifying uncertainty with limited data is challenging, particularly in NLP problems where manual text labeling can be difficult, time-consuming, and expensive. Approaches to leverage limited data, such as active learning, make use of uncertainty quantification in order to reduce the need for manual labelling (Settles, 2009). In these settings, CP could be used for reliable uncertainty quantification, e.g., selecting points with larger prediction sets for manual labelling. The predicted sets can also be useful to reduce the possible labels in tasks with high cardinality output spaces, increasing the performance of subsequent predictions. Another option is to use CP for data filtering and cleaning to increase the performance of LLMs (Marion et al., 2023), using for example a small reliable set for calibration, in order to identify mislabeled or noisy samples.
6.5 CP for Uncertainty-Aware Evaluation
CP is also useful for tackling the current challenge of model evaluation. There are some concerns regarding the current way NLP systems are evaluated: e.g., questioning how confident we can be in evaluations that result from an LLM scoring the output of another one. Evaluating a conformal predictor built on top of a predictor can be a more reliable way to assess model performance. Another useful application of CP is to compare different uncertainty heuristics and transformations of model outputs by designing distinct non-conformity scores and evaluating the efficiency (e.g., set size, conditional coverage, observed fuzziness) of the resulting predictors (§5.3).
6.6 Open Challenges
Despite its numerous applications, using CP in NLP poses challenges, particularly in generation tasks, providing exciting areas for further research.
Sentence Level.
The high cardinality of the output space in generation tasks raises a challenge to typical CP applications. There are open questions on how to sample the possible outputs and on what is the impact of considering a finite set of samples.
Token Level.
The non-exchangeability of the data tackled by Barber et al. (2023), Ulmer et al. (2024), and Farinhas et al. (2024) still presents an obstacle since it is not currently easy to: quantify the coverage gap—the bound in Eq. 5 involves computing a total variation distance between unknown distributions, which is hard to estimate; find good strategies for choosing the weights.
Distribution Shift.
Several extensions to distribution drift and non-exchangeable data have been proposed for time series by Chernozhukov et al. (2018, 2021b), Xu and Xie (2021), Stankeviciute et al. (2021), Zaffran et al. (2022), and Angelopoulos et al. (2023) and for other scenarios by Cauchois et al. (2020), Gibbs and Candes (2021), Chernozhukov et al. (2021a), and Gibbs and Candès (2022). However, it remains unclear how to apply them in the context of NLP.
7 Discussion
Significant efforts have been made to improve the quality and confidence estimation of NLP systems. In fact, recent techniques for generating text with LLMs often involve the generation of multiple hypotheses, followed by a reranking (or voting) stage to increase the likelihood of producing a high-quality prediction. These methods include minimum Bayes risk decoding (Kumar and Byrne, 2004; Eikema and Aziz, 2020), reranking based on quality estimation (Fernandes et al., 2022), and other strategies (Wang et al., 2023; Suzgun et al., 2023). While these approaches have been shown to improve quality and reduce the number of generated hallucinations (Farinhas et al., 2023; Guerreiro et al., 2023), they do not inherently quantify the uncertainty of their predictions or provide the formal guarantees offered by CP. Sensitive tasks, such as those in fields like medicine and education, may require guarantees. In these cases, the system could abstain from answering or defer to a human whenever its confidence is below a certain threshold. CP provides theoretical guarantees in a distribution-free setting, making it highly useful in such scenarios.
Other calibration methods have been used to obtain uncertainty intervals for regression and classification tasks in NLP. Calibration methods that target accuracy-related metrics, e.g., aiming to minimise the estimated calibration error (ECE) have been widely used (Li et al., 2023; Wang et al., 2022). Still, they also suffer from significant drawbacks as they are sensitive to binning choices and small changes in model estimates (Błasiok et al., 2023; Roelofs et al., 2022; Minderer et al., 2021; Chidambaram et al., 2024). Furthermore, they do not provide any formal guarantees to control risk and objectives of interest. CP can thus be used to complement such methods, especially in cases where ensuring coverage is deemed important.
8 Conclusion
This paper provides an overview of applications of the conformal prediction framework in NLP tasks, after a brief introduction to that framework and its main variants. We showed how conformal prediction is a promising tool to address the uncertainty quantification challenge in NLP and hope the existing and possible applications presented in this survey will motivate future research on the topic.
Acknowledgments
This work was supported by the Portuguese Recovery and Resilience Plan through project C645008882- 00000055 (NextGenAI - Center for Responsible AI), by the EU’s Horizon Europe Research and Innovation Actions (UTTER, contract 101070631), by the project DECOLLAGE (ERC-2022-CoG 101088763), and by Fundação para a Ciência e Tecnologia through contract UIDB/50008/2020.
Notes
For most purposes, a reasonable calibration set size is of the order of 1000 samples (Angelopoulos and Bates, 2023).
The number of arXiv papers in the field of computer science containing the expression “conformal prediction” has been steadily rising, from 16 papers in 2018 to 224 in 2023.
Note that the probability is over (Xtest, Ytest), not conditioned on a particular Xtest = xtest. We discuss conditional coverage in §4.1.
Although split (a.k.a. inductive) CP was developed after the full (a.k.a. transductive) variant (described in §4.4), it is more widely used due to its computational efficiency.
Altough there are other definitions of validity in the CP literature (Vovk et al., 2005), this is the most common one, termed conservative coverage validity.
For regression, discretization is typically used.
They use the terms loss and risk in a distinctive way. Loss refers to scoring the quality of a single sample generation (e.g., ROUGE); risk measures some aspect of the distribution of the loss across the population (e.g., mean).
References
Author notes
Action Editor: Roi Reichart