Abstract
Computational linguistics models commonly target the prediction of discrete—categorical—labels. When assessing how well-calibrated these model predictions are, popular evaluation schemes require practitioners to manually determine a binning scheme: grouping labels into bins to approximate true label posterior. The problem is that these metrics are sensitive to binning decisions. We consider two solutions to the binning problem that apply at the stage of data annotation: collecting either distributed (redundant) labels or direct scalar value assignment.
In this paper, we show that although both approaches address the binning problem by evaluating instance-level calibration, direct scalar assignment is significantly more cost-effective. We provide theoretical analysis and empirical evidence to support our proposal for dataset creators to adopt scalar annotation protocols to enable a higher-quality assessment of model calibration.
1 Introduction
With recently released large-scale language models (LLMs) demonstrating impressive few-shot, zero-shot, and task-agnostic performance (Brown et al., 2020; Kojima et al., 2022; Ouyang et al., 2022), there is a boom of interest in deploying NLP-based systems to aid various human decision making (Chen et al., 2021; Nori et al., 2023). However, the black-box nature of LLMs gives little insight into how the predictions are made by these models (Zhao et al., 2021), risking user trust in model prediction reliability.
A common proposal to address this concern is to explore model calibration (Guo et al., 2017; Kull et al., 2019), which requires a model to approximately predict the true label distribution. This evaluation has been adopted by many recent language model benchmarking efforts (Desai and Durrett, 2020; Hendrycks et al., 2020; Jiang et al., 2022; OpenAI, 2023); these works often consider confidence calibration for classification and adopt Expected Calibration Error (ECE) (Guo et al., 2017) as the main empirical evaluation metric. ECE, along with variants like Adaptive Calibration Error (ACE) (Nixon et al., 2019), involve binning in their calculation, which groups hard categorical labels into bins to approximate label distributions. This is mainly because many popular NLP tasks are annotated predominantly with categorical labels. However, these empirical evaluations are sensitive to the choice of binning schemes (Nixon et al., 2019), and can severely underestimate calibration error (Ovadia et al., 2019; Kumar et al., 2019; Baan et al., 2022).
Instance-level calibration (Zhao et al., 2020) avoids the binning issue and matches model confidence with human annotations at an individual level, as uncertainty from human annotations is a good surrogate for true label distribution (Nie et al., 2020b; Baan et al., 2022). Following this intuition, recent work, particularly in Natural Language Inference (NLI), has crowdsourced massive number of redundant labels per instance (Pavlick and Kwiatkowski, 2019; Nie et al., 2020b), which we call distributed labels. These annotations cater well to the evaluation of instance-level calibration and provide valuable insight into model behavior but are often prohibitively expensive to obtain.
In this work, we propose a theoretically sound method for cost-efficient empirical calibration evaluation that can also be measured on an instance level and does not rely on binning schemes. This is done by eliciting scalar labels that score instances along a particular aspect and evaluate whether the predictive distribution is consistent with these scalars. An example that compares categorical, distributed, and scalar annotations can be found in Table 1. We prove that our annotations provide a lower bound for calibration error and better characterize uncertainty along the specific dimension of interests (Zhao et al., 2021). Our contributions are as follows:
We propose widespread use of scalar labeling to capture subjective human uncertainty; it can be reliably collected, adds no overhead compared to categorical labels, yet is comparably informative as distributed labels.
Using scalar labels, the evaluation of model calibration does not depend on the choice of the binning scheme and can be evaluated at an instance level with provable guarantees. In particular, we can use scalar labels to form a lower bound of the calibration error.
We show on multiple NLP tasks that scalar annotations can be collected with high agreement, discriminate better than fine-grained categorical labels, and evaluate classification models consistently.
2 Motivation and Background
However, since p(Y |X = x) is usually unknown, evaluating directly against Ed is generally infeasible. One way to approximate Pr(Y |X = x) is by binning the model’s prediction with a pre-defined partition of the probability simplex ΔK−1 (Guo et al., 2017).
Alternatively, a large number of labels can be collected (Nie et al., 2020b) to approximate each instance’s conditional label distribution, against which model predictive distribution for each instance is evaluated. Instance-level calibration (Zhao et al., 2020) like this does not rely on predefined binning schemes as it requires the model to predict conditional label distribution perfectly. However, the high cost of obtaining these labels makes it very hard to upscale (Clark et al., 2019). Often, many practical decisions have to be made during data collection to limit annotation cost (Collins et al., 2022), leading to suboptimal evaluation.
3 UNLI: A Scoring Function Example
In this section, we use UNLI for a case study on a specific scoring function ψ. For a given NLI instance, ChaosNLI (Nie et al., 2020b) elicits 100 redundant hard labels per instance to approximate true label distribution Pr(Y |x), while UNLI (Chen et al., 2019) elicits 2–3 scalar labels per instance to estimate the probability that a hypothesis is true given a premise.
Despite previously considered mismatched in distribution and unrelated (Meissner et al., 2021), we argue that these two labeling schemes are closely related. Specifically, Figure 1 shows that UNLI labels preserve an implicit ordering of the ChaosNLI label distribution. As the probability mass of label distribution for each instance gradually shifts from CON to ENT through NEU, the UNLI score also increases. Also, it is rare for an instance to have a high contradiction and entailment probability at the same time, supporting the intuition that neutral is the intermediate state between contradiction and entailment. These observations suggest that a scoring function ψ from ChaosNLI to UNLI can be found which preserves most of the information in the label distribution that we care about, such as: what’s the likelihood of a hypothesis, whether some hypotheses are more likely than the others, given their premises, etc.
ChaosNLI-S (Nie et al., 2020b) label distribution visualized with barycentric coordinates concerning the ENT, NEU and CON points on the horizontal surface. The redder color on the heatmap implies a higher probability of label distribution. The height of the bars corresponds to the human uncertainty scalar labels obtained from UNLI (Chen et al., 2019). The correspondence between these two sets of labels suggests the existence of a scoring rule that maps ChaosNLI labels to a scalar with limited information lost.
ChaosNLI-S (Nie et al., 2020b) label distribution visualized with barycentric coordinates concerning the ENT, NEU and CON points on the horizontal surface. The redder color on the heatmap implies a higher probability of label distribution. The height of the bars corresponds to the human uncertainty scalar labels obtained from UNLI (Chen et al., 2019). The correspondence between these two sets of labels suggests the existence of a scoring rule that maps ChaosNLI labels to a scalar with limited information lost.
Suppose we want to evaluate whether our model is well-calibrated on SNLI at the instance level, namely, to test whether it provides human-aligned uncertainty for each instance. Instead of directly evaluating whether the model’s predictive distribution is consistent with the ChaosNLI label distribution, we can transform model distribution with this ψ and compare it with UNLI labels to avoid the massive annotation overhead of distributed labels, as described in the following section.1
4 Scalar Label for Calibration
We give theoretical guarantees for using scalar annotations for calibration evaluation in this section. To begin with, we first define the class-wise calibration error with which we assume calibration error is evaluated:
This calibration error is fully compatible with the expected calibration error definition in Section 2, with a critic function d marked as above.
Definition 2 below gives a simple function family that maps a discrete distribution to a scalar value. We then show that specific probing tasks on these transformed scalar values provide a useful proxy for evaluating model calibration in the form of a lower bound. Here, we only showcase our main theorem that supports direct transfer from ranking to multi-class (K ≥ 3) classification, and leave additional results that guarantee similar lower bounds from regression or ranking to binary or multiclass classifications and their proofs to Appendix A.
Without losing generality, in this paper, we assume and that f(yk) is monotonous in k. Since we do not require a calibration loss to be calculated against this scoring rule, f does not need to be normalized as for calibration lens in Vaicenavicius et al. (2019). It’s worth noticing that we are particularly interested in cases where f has an intuitive interpretation in empirical evaluation. For example, given a label set , is a probabilistic membership probe, estimating whether the model understands the concept related to a label set reminiscent of the annotation scheme proposed by Deng et al. (2012) and Collins et al. (2023).
The following result is an example demonstrating how comparing induced scalars to their corresponding human uncertainty labels can be used to evaluate model calibration when the probing task on the scalar labels is a ranking problem. Previous research has shown when the annotation protocol is properly designed, rankings among instances can be consistent even when individual scores are not (Rankin and Grube, 1980; Russell and Gray, 1994; Burton et al., 2021). We show that empirical loss functions like the pairwise ranking risk defined below can be used to form a lower bound to the original calibration error.
For the original classifier and the corresponding ranking risk regret, we have the following:
where s = ψf(Pr(Y |·)) is the f-based expected label scoring rule applied to ground-truth label posterior while is the f-based expected label scoring rule applied to the model predictive distribution. L(s) is the ranking risk of a ranking rule r induced by s (As we will show in the appedices that under Assumption 1s is an optimal scoring rule).
This is particularly useful for challenging annotation tasks where annotators are hard to calibrate; otherwise, we can directly compare with scalar values. Regression tasks also provide similar lower bound guarantees that can be found in the Appendix A. We also refer to a detailed discussion about how Assumption 1 ensures the existence of an intuitive ranking in the Appendix A.
Per the discussion above, to evaluate the calibration of a classifier , we should compare the induced scalar values with the ground truth labels under a corresponding interpretation of f. For illustration purposes, suppose in emotion classification for GoEmotions (Demszky et al., 2020), a natural scalar interpretation of the labels is along the valence dimension. We would expect instances with a higher probability of being labeled with emotions like “Joy” or “Relief” be ranked higher on valence than instances with a higher probability of being labeled with emotions like “Sadness”, as shown in Figure 2.
A hypothetical example of applying Definition 2 to an emotion classification task (e.g., GoEmotions [Demszky et al., 2020]). By attaching a single scalar valence value to each class label, we specify how the respective conditional label distribution of instances ○, , and △ induces an estimation of its valence, which can then be annotated.
A hypothetical example of applying Definition 2 to an emotion classification task (e.g., GoEmotions [Demszky et al., 2020]). By attaching a single scalar valence value to each class label, we specify how the respective conditional label distribution of instances ○, , and △ induces an estimation of its valence, which can then be annotated.
We would expect the induced valence score by a calibrated classifier to correlate well with human annotations. This is reflected by the conformity of rankings (evaluated with ranking risk) and closeness in scoring.
It’s also worth noticing that our formulation does not require the scalar annotation task to fully recover the classification task, and it’s possible that a classification task can be characterized by multiple valid mapping functions. For example, Figure 3 shows a different ordering of the same set of instances induced by an “arousal” mapping function g.
Applying Definition 2 with a different mapping function could induce a different ordering and spacing of the instances on the Arousal axis.
Applying Definition 2 with a different mapping function could induce a different ordering and spacing of the instances on the Arousal axis.
5 Pseudo Distributed Label from Scalars
This section discusses ways to map scalar labels back to label distributions. This is useful when one wants to augment classification training with scalar annotations, for example when performing distillation (Hinton et al., 2015) or label smoothing (Szegedy et al., 2016). If we have parametrized distribution information of scalar annotations through common aggregation techniques (Hovy et al., 2013; Peterson et al., 2019), back-mapping is equivalent to trying to quantize a continuous distribution p(y) to a discretized distribution q(y). Although there already exist heuristics for allocating probability mass to categories (Pavlick and Kwiatkowski, 2019; Collins et al., 2022; Meissner et al., 2021), these mappings are generally considered suboptimal (Meissner et al., 2021). We propose two more principled ways to do label back-mapping: (1) inference with a neural network; and (2) distribution quantization with fixed support.
Neural Network
We can use neural networks directly to predict the distribution parameter of the resulting categorical distribution. A small validation set of distributed labels is needed to train this back-mapping model.
Closed Form Solution
We can think of label back-mapping as redistributing the probability mass of the continuous distribution to a fixed set of ranges defined by a set of cutting-off points {c1, c2,…, cK}∈ℝK, such that the discretized distribution is as close to the original distribution as possible. The following solution can be given:
where F(·) is the PDF function of p, and we have and .
Compared to the neural network approach, the closed-form solution does not require the ground truth distributed labels in the validation set but runs the risk of distribution mismatch with the target data.
6 Experiments
Our experiments intend to validate two critical arguments in this paper: (1) Scalar labels effectively evaluate models’ uncertainty estimation; (2) It’s possible to collect high-quality scalar annotations.
6.1 Evaluation with Scalar Labels
We evaluate 5 differently fine-tuned LMs against UNLI and ChaosNLI labels. Among them bert-base-debiased-nli (BERTb) (Wu et al., 2022) and roberta-large-anli (RoBERTaa) (Nie et al., 2020a) are from the HuggingFace model hub.2 We intentionally choose one extensively trained, strong NLI model (RoBERTaa), and a debiased model (BERTb) to cover a wider range of model calibration, as per previous discussion it is generally impossible to simultaneously enforce fairness and calibration (Pleiss et al., 2017). We also fine-tune two RoBERTa (Liu et al., 2019) models, roberta-base (RoBERTab) and roberta-large (RoBERTal), on the SNLI dataset and carry out the same set of evaluations. We also evaluate a model with the roberta-base encoder and a randomly initialized classifier on top as a random baseline (random).
Comparing against Distributed Labels
We first evaluate these models on ChaosNLI-S with Classwise-Calibration Error (CE) as shown in Definition 1. Notice that the calibration error evaluated in this fashion is expected to be exact and free of hyperparameters.
We then study the evaluation capability of scalar labels by calculating the Mean Absolute Error (MAE) and Ranking Risk (RR) against UNLI-style labels. Since the original UNLI labels collected by Chen et al. (2019) only cover 614 of the 1,514 ChaosNLI-S instances, we collect UNLI annotation for all remaining ChaosNLI-S instances while ensuring a matched distribution by using the same logistic transformation as described by that prior work, as humans are especially sensitive to values near the ends of the probability spectrum (Tversky and Kahneman, 1981).
We observe that scalar-label-based metrics including RR and MAE give a consistent ranking of models compared to distributed-label-based metrics (Table 2). The model most tuned with high-quality data roberta-large-anli is most calibrated as indicated by RR, MAE, and CE.
Results for evaluating model calibration. Metrics against scalar labels (right side) correlate well with evaluation metrics calculated using distributed labels (left side). This empirically validates our theoretical results on the relation between the ranking risk and calibration.
Models . | CE (↓) . | MAE-b . | RR-b . | MAE(↓) . | RR . |
---|---|---|---|---|---|
random | 28.7 | 17.3 | 2.75 | 47.9 | 35.2 |
BERTb | 23.7 | 13.0 | 1.20 | 27.7 | 30.1 |
RoBERTab | 18.3 | 10.1 | 0.72 | 24.2 | 24.2 |
RoBERTal | 16.1 | 8.46 | 0.60 | 23.6 | 23.0 |
RoBERTaa | 14.4 | 8.40 | 0.62 | 23.1 | 22.9 |
Models . | CE (↓) . | MAE-b . | RR-b . | MAE(↓) . | RR . |
---|---|---|---|---|---|
random | 28.7 | 17.3 | 2.75 | 47.9 | 35.2 |
BERTb | 23.7 | 13.0 | 1.20 | 27.7 | 30.1 |
RoBERTab | 18.3 | 10.1 | 0.72 | 24.2 | 24.2 |
RoBERTal | 16.1 | 8.46 | 0.60 | 23.6 | 23.0 |
RoBERTaa | 14.4 | 8.40 | 0.62 | 23.1 | 22.9 |
Empirical Bound Investigation
Joint Training with Scalars
We further tested whether joint training with UNLI will improve model calibration with the same mapping function f. We run a round-robin sampler over the two datasets. To balance the dataset size, we keep reiterating through UNLI until one epoch of SNLI finishes. For the following three settings, we evaluate with roberta-base and roberta-large: (1) Original (SNLI), we evaluate a model trained on SNLI data, with the cross-entropy loss; (2) Scalar (+reg), we conduct UNLI multitask training with the MAE loss on UNLI labels; and (3) Ranking (+ran), we conduct UNLI multitask training with margin loss as in Li et al. (2019).
To precisely evaluate the calibration error, we also calculate instance-level class-wise calibration error on ChaosNLI data. ChaosNLI only annotates the SNLI dev set, which then requires us to use it as a test; for this experiment, we therefore use the SNLI test for development. We extend the UNLI annotation to all the ChaosNLI instances to calculate scalar-value-based metrics (MAE and RR).
Table 3 shows that both encoders benefit from the joint training with UNLI regarding accuracy as well as model calibration. It should be noted that all UNLI training examples are already presented in the SNLI training set, so the benefit of including UNLI comes solely from scalar labels. This indicates that the jointly trained classifier, while still directly applicable to original classification tasks, can discriminate subtler differences among instances.
The training result with UNLI augmentation. Training with scalar labels (+reg and +ran) improves accuracy as well as calibration. ECE-# indicate the bins used for the Expected Calibration Error (ECE) evaluation.
Models . | roberta-base . | roberta-large . | ||||
---|---|---|---|---|---|---|
SNLI . | +reg . | +ran . | SNLI . | +reg . | +ran . | |
Acc(↑) | 91.6 | 91.8 | 91.8 | 92.7 | 92.9 | 93.2 |
ECE-5(↓) | 4.28 | 3.40 | 4.23 | 2.47 | 1.41 | 1.78 |
ECE-20(↓) | 4.30 | 3.47 | 4.27 | 2.47 | 1.73 | 1.96 |
ECE-100(↓) | 4.57 | 3.74 | 4.54 | 2.93 | 2.08 | 2.26 |
MAE(↓) | 24.2 | 23.6 | 23.8 | 23.6 | 22.7 | 23.0 |
RR(↓) | 24.2 | 23.9 | 23.9 | 23.0 | 22.6 | 22.7 |
CE(↓) | 18.3 | 17.4 | 17.9 | 16.1 | 15.0 | 15.6 |
Models . | roberta-base . | roberta-large . | ||||
---|---|---|---|---|---|---|
SNLI . | +reg . | +ran . | SNLI . | +reg . | +ran . | |
Acc(↑) | 91.6 | 91.8 | 91.8 | 92.7 | 92.9 | 93.2 |
ECE-5(↓) | 4.28 | 3.40 | 4.23 | 2.47 | 1.41 | 1.78 |
ECE-20(↓) | 4.30 | 3.47 | 4.27 | 2.47 | 1.73 | 1.96 |
ECE-100(↓) | 4.57 | 3.74 | 4.54 | 2.93 | 2.08 | 2.26 |
MAE(↓) | 24.2 | 23.6 | 23.8 | 23.6 | 22.7 | 23.0 |
RR(↓) | 24.2 | 23.9 | 23.9 | 23.0 | 22.6 | 22.7 |
CE(↓) | 18.3 | 17.4 | 17.9 | 16.1 | 15.0 | 15.6 |
6.2 Studying Annotation Quality
We investigate whether humans are capable of giving consistent scalar judgments. We conduct an annotation study on the recently released WiCE (Kamoi et al., 2023) dataset. WiCE is a dataset on verifying claims decomposed from Wikipedia passages against their cited source text. A subtask of WiCE provides the annotator with a claim from Wikipedia passages intended to present an “individual fact” and a paired source document cited in the context. The annotator is then asked to give a 3-point scaled feedback whether the claim is either supported, partially-supported, or not-supported by the information provided in the source text. We replace the 3-point scale with our proposed scalar annotation scheme.
For all annotation tasks in this work, we collect scalar judgments from annotators with a slider bar protocol similar to the one employed by Chen et al. (2019). To get a set of good annotators, we design a qualification task with 5 manually selected claims from the abovementioned subset of WiCE, where the claims are relatively unambiguous and have varied levels of uncertainty given their respective supporting source documents. We ask workers from MTurk3 to each do all the questions in a single session and analyze their performance to allow for qualification. To better understand worker behaviors, we log different kinds of on-page worker actions, including dragging the slider handle, checking / unchecking boxes, turning pages, or revising answers.
Figure 4 shows that as workers spend more time on the HIT and reviewing before their final submission, they get better holdout Pearson correlation against the aggregated scalar label of other workers. This supports that a responsible set of annotators can provide consistent annotations with the scalar annotation scheme, even for challenging and time-consuming tasks. We qualify workers whose holdout correlation is greater than .6.
Relationship between an annotator’s holdout Pearson’s correlation coefficient to their total working time (left), and their time reviewing before submission (right). The solid line is the fitted linear regression model, of which the .95 confidence range is shaded.
Relationship between an annotator’s holdout Pearson’s correlation coefficient to their total working time (left), and their time reviewing before submission (right). The solid line is the fitted linear regression model, of which the .95 confidence range is shaded.
We then annotate a subsampling of 200 sub-claims from the WiCE test set with three-way redundant annotation. Figure 5 shows the scalar label distribution of this subset of WiCE, broken down by the original WiCE discrete label. The class-level ordering of the scalar label aligns well with the “likelihood” interpretation of the 3-way categorical labeling scheme. At the same time, scalar annotations capture more nuance of the data that better differentiate instances in the same category, especially those of the partially-supported class. This is expected according to the definition of the categorical label, as partially-supported claims can naturally be supported at any likelihood level from 0 to 1. It is worth noticing that to get good quality scalar uncertainty labels, we only need the same level of annotation redundancy compared to the original categorical labels (Kamoi et al., 2023).
Strength of evidential support label distribution for each of the three discrete supporting levels on a subset sampled from the WiCE test supported, partially-supported, not-supported. Light / dark shade covers 100% / 50% of each category, with outliers out of 1.5 IQR dropped, and the bar in the middle of each stripe denotes the median of that category.
Strength of evidential support label distribution for each of the three discrete supporting levels on a subset sampled from the WiCE test supported, partially-supported, not-supported. Light / dark shade covers 100% / 50% of each category, with outliers out of 1.5 IQR dropped, and the bar in the middle of each stripe denotes the median of that category.
6.3 Versus Fine-grained Categoricals
We also investigate whether the scalar annotation is equally effective when collected for and applied to datasets initially annotated with more fine-grained categorical labels. We first apply the scalar uncertainty annotation scheme to the Circa (Louis et al., 2020) dataset. Circa annotates a pragmatic inference problem in dialog, classifying whether an indirect answer to a question is more a “yes” or a “no” or neither. We filter out those instances with the Other labels, which typically correspond to irrelevant answers, and do a stratified sampling of 300 instances from all 8 remaining label classes. We collect 3-way redundant annotation with the same set of qualified annotators as in Section 6.2. To better calibrate our annotators, we dynamically show them their previous annotations for the closest-lower-scoring and closest-higher-scoring instances in the same batch.
Figure 6 shows the scalar label distribution broken down by original Circa labels. Our scalar annotation is still highly consistent with the intuitively perceived order defined by an answer’s inclination towards Yes. At the same time, scalar uncertainty annotation captures intricate differences within each group, even when the original categorical label is already fine-grained (Table 4).
Equal spaced sampling from the scalar-annotated Circa subset. Notice that the scalar label makes meaningful distinctions between instances within the same class, even when the original categorical label from Circa (Louis et al., 2020) is already fine-grained.
Question . | Answer . | Scalar . | Cat. . |
---|---|---|---|
Do you work full-time? | Full-time, unfortunately. | 1.0 | Y |
Are you in on Monday? | Should be! | 0.9 | PY |
Does the neighborhood have a good reputation? | The crime rate is low. | 0.8 | PY |
Would you have to work weekends? | I might have to. | 0.7 | PY |
Do you like music similar to your parents? | We have some crossover. | 0.6 | PY |
Do you like Rnb? | Hum a little for me, will you? | 0.5 | M |
Anything I should be worried about? | About what? | 0.4 | M |
Can you eat Mexican? | Beans make me fart. | 0.3 | PN |
Do you know Roller balding? | That’s new to me. | 0.2 | N |
Is your favorite food Mexican? | Mexican is my second favorite. | 0.1 | N |
Do you like country and western bands? | Country sucks. | 0 | N |
Question . | Answer . | Scalar . | Cat. . |
---|---|---|---|
Do you work full-time? | Full-time, unfortunately. | 1.0 | Y |
Are you in on Monday? | Should be! | 0.9 | PY |
Does the neighborhood have a good reputation? | The crime rate is low. | 0.8 | PY |
Would you have to work weekends? | I might have to. | 0.7 | PY |
Do you like music similar to your parents? | We have some crossover. | 0.6 | PY |
Do you like Rnb? | Hum a little for me, will you? | 0.5 | M |
Anything I should be worried about? | About what? | 0.4 | M |
Can you eat Mexican? | Beans make me fart. | 0.3 | PN |
Do you know Roller balding? | That’s new to me. | 0.2 | N |
Is your favorite food Mexican? | Mexican is my second favorite. | 0.1 | N |
Do you like country and western bands? | Country sucks. | 0 | N |
Scalar label annotation for Yes / No polarity on the Circa dataset breakdown by the original categorical label. In the label names,“Prob.” means “probably”, while “M” means “in the middle”.
Scalar label annotation for Yes / No polarity on the Circa dataset breakdown by the original categorical label. In the label names,“Prob.” means “probably”, while “M” means “in the middle”.
Evaluating Calibration
To demonstrate the effectiveness of these scalar labels, we further fit a set of models of different sizes to the Circa dataset and evaluate them against the scalar-annotated subset. Previous research (Lewkowycz et al., 2022; Nori et al., 2023) shows that larger pretrained transformer models tend to be more calibrated, and we would like to examine whether the scalar annotation can recover the size ordering of the models in terms of calibration. To do this, we specify a very intuitive mapping function f, that maps N, Prob.N, M, Prob.Y, Y to equality spaced [0, .25, .5, .75, 1.], maps N∖A and Unsure to .5 to indicate indecisiveness, and map Cond to .6 to show a slight tendency towards “yes”.
For the evaluation experiment, we further make an 80/20 split of the not-scalar-annotated Circa subset into training and validation sets. This should be even more challenging, especially for smaller models, as there is a label distribution mismatch between this training set and our sampled test set (Dan and Roth, 2021). Besides the bert-base-uncased and bert-large-uncased model used in Section 6.1, we also tune two larger language models: GPT-Neo (Black et al., 2021) (1.3B and 2.7B). Table 5 shows that ECE-5 and ECE-100 provide different values and inconsistent rankings for the calibration level, again highlighting how the ECE result is highly hyper-parameter dependent. Instead, the scalar-label-based MAE and RR provide consistent rankings in terms of calibration evaluation.
Evaluating model with scalar-label objectives as well as grouping-based calibration. Darker shades correspond to better performance on a particular metric. Metrics names remain the same as spelled out in Section 6.1.

7 Conclusions and Future Work
We show that scalar annotation elicited from individual humans can be a valuable resource for developing calibrated NLP models. Both our theoretical and empirical results suggest that scalar annotation is an effective and scalable way to collect ground truth for human uncertainty, and we encourage future datasets to include scalar annotation if applicable. Our result provides an interesting perspective for researchers to devise new annotation tasks on traditionally categorical tasks. Future research may also look into conditions where scalar label-based uncertainty examination has better guarantees, investigate better ways to robustly collect consistent scalar annotations, and other principled ways to train or evaluate with scalar labels, particularly in areas where direct application of the method isn’t immediately available, such as NLP tasks where structured predictions are involved, probably requiring some decisions specific to the task to be made regarding matters such as Events of Interest (Kuleshov and Liang, 2015).
Acknowledgment
We thank Ha Bui, Shiye Cao, Yunmo Chen, Iliana Maifeld-Carucci, Kate Sanders, Elias Stengel-Eskin, and Nathaniel Weir for their valuable feedback on the writing.
This work has been supported by the U.S. National Science Foundation under grant no. 2204926. Any opinions, findings, conclusions, or recommendations expressed in this article are those of the authors and do not necessarily reflect the views of the National Science Foundation. Anqi Liu is partially supported by the JHU-Amazon AI2AI Award and the JHU Discovery Award.
Notes
By transfering the distribution with the ψ induced by Definition 2 with the same f as demonstrated in Section 6.1, we obtain high correlation as r = 0.703 and ρ = 0.766.
References
A Appendix
A.1 Calibration Evaluation with Regression
When K = 2, it is straightforward that the mean absolute error (MAE) as regression loss w.r.t. the expected label scoring rule is linear to the calibration error.
Similar results can be derived for K > 2, in the form of the following theorem:
A.2 Calibration Evaluation with Ranking
However, the bounding result for bipartite ranking does not directly transfer to K-partite ranking. Although Uematsu and Lee (2014) construct a global optimal scoring the rule for multipartite ranking, they also comment that the optimal ranking induced may be inconsistent with a ranking induced by optimal ordinal classification labels. For example, for some very subjective ratings, like a 1 to 5 scale movie review with labels {1,2,3,4,5}, we may want to rank a pair of movies (x, x′) with distributed labels η(x) = [0.1,0.4,0,0.2,0.3] and η(x′) = [0.3,0.2,0,0.1,0.4]. Notice that these two instances have identical expected scoring but different hard rating classes if labels are aggregated with majority voting. To avoid such inconsistencies Clémençon et al. (2013) relies on the following assumption:
This assumption is equivalent to saying that the expected label scoring rule is the optimal ranker for all binary subproblems. This is a reasonable assumption to make if an obvious ordering can be identified from the label set. For example, in ChaosNLI, if an instance is more likely an ENT than a NEU, it is usually more likely a NEU than a CON as well, as we have demonstrated that probability mass only shifts gradually from CON to ENT through NEU.
For completeness, we include a proof to this lemma, which is essential for the Theorem 1.
This leads to the following corollary:
When Assumption 1 holds, the expected scoring rule by Equation 1 is an optimal scoring rule.
Then we are able to prove Theorem 1:
Theorem 2 is the direct corollary of some well-known property of the Wasserstein distance (e.g., see Kolouri et al., 2019).
Author notes
Action Editor: Kristina Toutanova