Reducing Conversational Agents’ Overconfidence Through Linguistic Calibration

Abstract While improving neural dialogue agents’ factual accuracy is the object of much research, another important aspect of communication, less studied in the setting of neural dialogue, is transparency about ignorance. In this work, we analyze to what extent state-of-the-art chit-chat models are linguistically calibrated in the sense that their verbalized expression of doubt (or confidence) matches the likelihood that the model’s responses are factually incorrect (or correct). We find that these models are poorly calibrated, yet we show that likelihood of correctness can accurately be predicted. By incorporating such metacognitive features into the training of a controllable generation model, we obtain a dialogue agent with greatly improved linguistic calibration.


Introduction
Neural generative open-domain English-language dialogue agents have made progress towards the ability to carry on chit-chat conversations with humans (Adiwardana et al., 2020;Roller et al., 2021).Recent models-trained on large swaths of data from the internet to mimic human-human conversations-can name their favorite sports teams, describe what it's like to be the owner of two dogs, or share their opinions on tacos.However, ask a state-of-the-art chatbot "Which is heavier, 1 kg feathers or 1 kg stone?", and it might confidently answer: "Feathers, because they are heavier than a kilogram of any other material." 1 This amusing overconfidence can become problematic if someone genuinely doesn't know the answer and is misled into believing something false.Generative chit-chat dialogue agents have many issues going much beyond inaccurate answers (Xu et al., 2020;Bender et al., 2021), making them currently generally unsuitable for applications other 1 Answer generated by BST 2.7B (Roller et al., 2021).
"What is the largest US city?" (TriviaQA question) uncalibrated answer: ...so control certainty: "I'm not sure, but my guess is Los Angeles." (linguistically calibrated answer) calibrator predicts p(✓)= uncalibrated chatbot controllable generation with fine-tuned chatbot ⚙⚙ "That would be Los Angeles."0.17 <LO> Figure 1: Proposed method for re-calibrating a generative dialogue agent.This pipeline involves a calibrator which returns the probability that the original dialogue agent's answers are correct, as well as a finetuned model which controls for linguistic confidence; the linguistic confidence is adjusted based on the probability returned by the calibrator, yielding a response for which the linguistic confidence aligns with the likelihood that the dialogue agent's answer is correct.This is our proposed calibrator-controlled chatbot.
than entertainement and research.Nevertheless, better control of the alignment between the confidence of an answer and its likelihood of being correct seems like a promising type of remediation: it makes models more transparent about their limitations directly in the dialogue rather than through extrinsic instructions for adequate use that people might overlook or forget.This goal applies Grice's maxim of quality (Grice, 1975) on a metacognitive level, i.e., being truthful about what one knows.Here, this would mean that if we can train accurate predictors of correctness from information available to the model (input words and internal representations), then model generations should convey that information.The skill of handling uncertainty would be desirable even if accuracy on factual questions ever became perfect: some questions do not have known answers, or have answers which depend on a context that a dialogue agent cannot know, making it perilous to "ignore ignorance" (Smithson, 2012;Ravetz, 1993).
In this work, we seek to understand whether a model's verbalized expression of confidence ("Obviously, ...") or doubt ("I'm not sure, but...") in its answer-which we refer to throughout as linguistic confidence-corresponds to the likelihood that the answer is correct, and if not, whether we can fine-tune the models with controlled generation techniques to achieve better alignment.In other words, do state-of-the-art open domain dialogue agents "know" what they do not know?If yes, can this knowledge inform their responses, to achieve better verbalized metacognition?
We thus make three main contributions.(1) We annotate a state-of-the-art chit-chat model's responses to a large-scale QA task for both factual correctness and linguistic confidence. 2 (2) Using these annotations, we find that the model is poorly calibrated, in that linguistic confidence does not match factual correctness, but we show that we can train a much better correctness predictor directly from the chit-chat model's representations.
(3) We use this trained predictor within a controllable generation model to create a pipeline which greatly improves the calibration of a state-of-theart chit-chat model.

Related Work
Knowledge in Open-Domain Chatbots We focus on neural generative open-domain dialogue agents, rather than general purpose language models or QA models trained to produce a factual answer given a question.Much progress has been made by training large-scale Transformer (Vaswani et al., 2017) encoder-decoder models for dialogue tasks (Roller et al., 2021;Adiwardana et al., 2020;Zhang et al., 2020).These sequenceto-sequence models are typically trained on large amounts of data from the internet to produce a conversational response given a dialogue history as input.Despite impressive performance on chitchat tasks, these models are often prone to hallucinating knowledge (Roller et al., 2021).Dinan et al. (2019) and Gopalakrishnan et al. (2019) have proposed additional conditioning on a knowl-edge base to address this issue, but success is only partial, so we are far from being able to assume that even a knowledge-conditioned model reliably gives correct answers.
This overconfidence effect has been well-established, robustly showing that humans are poorly calibrated when completing general knowledge tasks (Juslin, 1994;Kleitman and Stankov, 2001;Stankov and Crawford, 1996;Stankov, 1998).Kamath et al. (2020) attempt to correct overconfidence in neural models, by training QA models to abstain from answering questions in which they are likely to err, using probabilistic calibration (see next paragraph).We instead focus on getting conversational models to communicate their confidence verbally, i.e., still produce an answer, but one less misleading as to its expected correctness.
Probabilistic Calibration Much work has been dedicated to the probabilistic calibration of deep neural networks.Guo et al. (2017) show that modern neural networks for classification tasks are poorly calibrated: models' confidence estimate that their answer is correct doesn't match the empirical rate of correctness.This contrasts with previous findings that show that (earlier) neural networks are well-calibrated on binary classification tasks (Niculescu-Mizil and Caruana, 2005).We thereafter refer to this notion of calibration as probabilistic calibration to distinguish it from linguistic calibration.More recently, probabilistic calibration has been explored in the space of large-scale language models (LMs).Desai and Durrett (2020) find that the pre-trained Transformers RoBERTa (Liu et al., 2019) and BERT (Devlin et al., 2019) are well-calibrated in-domain on the tasks of Natural Language Inference (NLI), paraphrase detection, and commonsense reasoning.Similarly, Jagannatha and Yu (2020) calibrate BERT and DistilBERT (Sanh et al., 2019) for Part-of-Speech tagging (POS), Named Entity Recognition (NER), and QA tasks.Rather than using LMs as target predictors on classification tasks like NLI and NER, Jiang et al. (2021) instead focus on LMs as natural language generators and analyze T5 (Raffel et al., 2020), a large scale Transformer with an encoder-decoder architecture.The authors find that it is poorly calibrated in its probability estimates on QA tasks.Conversely, Radford et al. (2019) find that GPT2 is reasonably well calibrated on QA tasks, with an accuracy of 63.1% on the 1% of questions it is most confident in on Natural Questions (Kwiatkowski et al., 2019).

Controlled Response Generation
We aim to reformulate answers while controlling for their expressed certainty.This requires style transfer or controlled generation techniques, which encourage certain attributes to fit prescribed values, for example a given length or sentiment.Lample et al. (2019) proposed a method to exert simultaneous control over multiple attributes based on concatenated learned control tokens.We similarly condition on an initial source text and concatenate multiple control tokens when generating responses.Keskar et al. ( 2019) trained a large-scale language model with control codes that govern style, content, and task-specific behavior.In the context of open-domain dialogue, See et al. (2019) used control on attributes such as number of questions with the aim of maximizing engagingness of dialogue models.Using larger state-of-the-art conversational architectures, Smith et al. (2020a) and Madotto et al. (2020) compared several methods to achieve control in conversation; here, we use the simple method of training attribute-specific control tokens that was the most effective in Smith et al. (2020a) for a variety of styles.While our experiments in §5.2 suggest that good correctness prediction performance can be achieved using just the question without yet committing to the substance of an answer, which would make less constrained text generation useful, the initial goal of this paper is to control the linguistic confidence of an answer without changing its substance.For this, techniques that condition on a source response are more relevant to us than less tightly constrained controlled techniques.Retrieve-andrefine generation (Weston et al., 2018;Roller et al., 2021) conditions on a possible answer, but does not control the style of the response.Here, we condition on the initial answer produced by a vanilla conversational model rather than a retrieval model, and then add additional control tokens to control the style.
3 Quantifying Linguistic Confidence Linguistic Confidence We aim to align a model's expressed confidence with its actual cor-  (Raffel et al., 2020), and any factoid-style question answering dataset can be used in this manner.Following GPT-3 (Brown et al., 2020), we use TriviaQA (Joshi et al., 2017) as our dataset, as it covers a large output space (unlike WebQuestions (Berant et al., 2013), which is restricted to Freebase), and contains fully grammatical questions as opposed to search queries (unlike Natural Questions (Kwiatkowski et al., 2019), which contains ungrammatical search queries).
To convert it into a closed-book QA dataset we can use, we merge the dataset's "Web" and "Wikipedia" sections (including shared questions only once), remove all provided evidence documents for the questions, strip the (Wikipediabased) aliases of their " (disambiguation)" suffix, and then use these aliases to create a list of allowable gold answers.We end up with 76523 question-answer pairs in the training set and 9961 in the validation set.Despite the list of aliases of the gold answer ("Steel," given first in the otherwise alphabetically sorted list), evaluating correctness of answers may not always be so straightforward-consider this example answer: 4 "It is called a whetstone.It is a stone that is used for sharpening knives."

Annotation scheme
The answers that a chatbot gives for a question are full-length sentences that may or may not answer the question, may or may not do so correctly, and may or may not express confidence linguistically.We settle on relating such generations to the gold answer aliases in our dataset by having humans annotate generations according to the annotation scheme shown in Figure 2. Unless the question is not even acknowledged as such (OT, short for "off-topic"), the chatbot's response is judged for linguistic confidence and for correctness with respect to the provided gold answers.Figure 3 illustrates all 13 resulting 4 This answer was generated by the vanilla BST 2.7B model we consider in §3, and shows that human annotations are not always reliable: all three annotators judge the certainty of this response to be LO, even though the answer itself expresses no doubt.As for correctness, two say WRONG and one says CORRECT, reflecting uncertainty as to how a factually correct answer not included in the allowable gold answers should be graded.
classes with example answers in the GUI that is presented to human annotators.
The fine-grained 4-way splitting of correctness is designed to provide guidance to human annotators and reduce ambiguity.After the initial annotation, we simplify all correctness annotations to binary correctness that better aligns with the type of linguistic framing we would like the model to be able to express, mapping OTHER and WRONG to incorrect ( ) and EXTRA and RIGHT to correct ( ).
The 3-way splitting of confidence is intuitively richer than simply splitting along confident vs. not confident (HI vs. not), however many responses were of the kind "I don't know, but I know that...," which makes them ambiguous.Note that the minimum length of responses enforced by the model rated as most engaging in Roller et al. (2021) precludes responding with a straight "I don't know," which likely makes the ambiguity more salient (see discussion of minimum length in §3).We nevertheless release the full 3-way annotations in case they are useful for further research.
Automatic annotation Noting predictability in patterns of human annotation, we seek to quantify whether automatic annotation would be an adequate substitute.The left half of Figure 4 indeed confirms that the simplified binary correctness annotations are highly predictable by simply checking whether any of the answer aliases appear in the generation (tokenized).We will refer to this way of scoring correctness as match-based, and use it as an automatic proxy for human annotations, when the latter is cost-prohibitive.
Linguistic confidence is harder to automatically infer using template-and match-based methods, as there are many ways to express doubt or confidence.Still, we find that we obtain usable predictions by training a BERT-based classifier on a set of 2000 annotated question-prediction pairs. 5e will refer to this way of classifying 4-way certainty (DK, LO, HI, and OT) as BERT-based and likewise use it extensively for training.This classifier works well (see the right half of Figure 4) for distinguishing DK/LO from HI, but struggles to discern between DK and LO (likely due to inconsis-
Models Our base model is the state-of-the-art open-domain English-language dialogue system BlenderBot from Roller et al. (2021)."Blender-Bot" refers to a suite of models of varying sizes which employ a Seq2Seq Transformer architecture (Vaswani et al., 2017).These models were pretrained on 1.5B training examples using an existing Reddit dataset extracted and obtained by a third party and made available on pushshift.io(Baumgartner et al., 2020). 6We use the 2.7B parameter version that is finetuned on the Blended Skill Talk tasks (BST; Smith et al., 2020b) and consider the outputs of beam search using the models' recommended standard parameters, which includes a requirement for generated answers to have at least 20 tokens.We choose this model (referred to as "vanilla" from here on) because it is the configuration that is rated as most engaging by humans (Roller et al., 2021) and therefore the most realistic use-case, even though it is not the best-performing QA model. 7This vanilla model attains an accuracy of only 4.8% on the test set,8 yet it answers 29.45% of questions confidently (HI), making only 14% of the model's confident answers actually correct (see Figure 6).We also try to examine what kind of questions are intrinsically "difficult" in a way that can be detected by shallow features.For example, we might hypothesize that questions about locations might be easier than questions about people-this would be reflected by the words "where" and "who" in a question being predictive of correctness.To obtain such predictive surface features we train a single sparse logistic regression model on all 2, 3, . . ., 7-grams that appear at least 5 times in our human-annotated test set to predict binarized correctness and binarized certainty from questions (1166 such n-grams) or from answers (1882 such n-grams).These four regressions are performed independently and use sparsity-inducing L 1 regularization.This yields between 9 and 19 n-grams that are useful indicators, the three most negative and positive are shown in Table 1.Given that BST 2.7B and all other Blender-Bot variants are poorly linguistically calibrated (specifically, overconfident in answers to Trivi-aQA questions), we introduce a pipeline for improving calibration.

Pipeline overview
We propose training a calibrator and using controllable generation techniques to allow generative dialogue agents to better "own their ignorance," i.e., such that the models' linguistic confidence better aligns with the probability that the answers are correct.The overall pipeline is illustrated9 in Figure 1.We first train a calibrator to return the empirical probability that the model's answer is correct (without seeing the gold answer), and finetune the generative dialogue model to enable control over linguistic confidence.Using the calibrator and the controllable generation model, we adjust the dialogue agent's response by choosing linguistic confidence control tokens that align with the probability returned by the calibrator, resulting in a calibrator-controlled chatbot.
Training a calibrator The first step involves training a calibrator that predicts the probability that the model's response is correct, given the question and answer, and the vanilla model's internal representations of both.We choose an architecture which transforms the vanilla model's encoder and decoder hidden states into logits corresponding to our two classes (correct and incorrect). 10The model is trained using 50,000 questions from the full TriviaQA training split with the vanilla model's corresponding responses, automatically annotated for correctness using the match-based annotation scheme (see §3).Ablations in §5.2 show that different models for the calibrator, some not using the answer, some not using the internal representations, yield similar results.
Training a controllable generation model The next step trains a generative model that will adjust the linguistic confidence of a response, provided the original response and a control token representing the desired linguistic confidence: <DK>, <LO>, or <HI>.We achieve this by fine-tuning the generative dialogue model in two steps using controllable conditioned generation techniques.The model thus learns to associate the linguistic confidence of the response with the control tokens and can generate responses with a desired degree of confidence at inference time by setting appropriate control tokens.We refer to this model as the only-certainty-controlled model.
Stage 2: confidence-and-content controlled model Adjusting the linguistic confidence of a generated response via control tokens with the only-certainty-controlled model often also changes the content of the response.Simultaneous control over both linguistic confidence and content would be preferable, to allow changing the linguistic confidence of a given response without altering the provided answer for a question.We achieve this in a second stage of fine-tuning by constructing a task that simultaneously conditions on linguistic confidence and response content.Training prompts for this task are constructed by concatenating the same 25,000 TriviaQA training split questions with the vanilla model's response, a linguistic confidence control token as before, and also an additional control token capturing whether the content of the only-certaintycontrolled model's response when given that question and linguistic confidence control token is the without changing the answer to the question.We refer to this model as our "controlled" model, to be used in the final pipeline.

Results
We describe data collection and annotation results, as well as experimental results and analysis on the vanilla model and each stage of the pipeline for the calibrator-controlled chatbot.

Data collection and annotation
We collect human annotation for both training data and for our final evaluation of the vanilla model and the calibrator-controlled chatbot.Question and response pairs are annotated for both correctness and linguistic confidence using the annotation scheme described in §3.Crowdsource annotators annotate questions in batches of nine questions, after completing an "onboarding" test of three questions.aQA validation split (none of which overlap with the VALID SET) for each the vanilla model and the controlled model under all three linguistic confidence control settings (DK, LO, HI).We refer to this size 3 × 4 × 5000 set as the TEST SET throughout.Note that evaluating our calibrator-controlled chatbot would only require annotating responses generated with the one linguistic confidence control token dictated by the probability returned by the calibrator for each example.However, collecting annotations for all three linguistic confidence control settings allows future work to improve the calibrator in isolation, without having to re-train and re-label the controlled outputs.

Training data
Inter-annotator agreement We analyze agreement between annotators using the question and response pairs from the VALID SET that were annotated three times each.For linguistic confidence, 43.60% of samples have all three annotators agree and 97.60% have at least two agree.For four-way correctness, these ratios are 69.15% and 97.90%; for binary correctness, they are 94.35% and 99.40%.We restrict to samples for which a majority (binary on correctness) exists and take the majority label, reducing the size of the VALID SET from 2000 to 1793 examples and the size of the TEST SET from 5000 to 4793 examples.

Calibrator training results
The calibrator-controlled chatbot can only be as good as the calibrator, requiring the ability to reliably predict how likely an answer is to be correct without access to additional knowledge.Fig-ure 5 plots the observed correctness on the TEST SET against the probability predicted by the calibrator that we selected using the VALID SET, and shows that the calibrator does a good job predicting correctness probability.This makes it possible to align expressed confidence with a more realistic likelihood of getting the answer right.We also evaluate calibration using the metrics from Guo et al. (2017).The first two metrics assume that examples are sorted into equally-spaced bins by their predicted likelihood of correctness (which thus need not contain the same number of samples).We can define the "distance" between the predicted likelihood of correctness of a bin (the midpoint between the start and the end of the bin) and the actual correctness of the bin (the average of all individual examples, counting correct ones as 1, incorrect ones as 0)-lower is better.Using these distances, the Expected Calibration Error (ECE) refers to the weighted average of all bins' distances (weighted by how many samples out of the total were in a bin)-our calibrator achieves an ECE of 0.018.Similarly, the Maximum Calibration Error (MCE) refers to the maximum of all bins' distances-our calibrator reaches an MCE of 0.292.Finally, we can calculate the Average Negative Log-Likelihood (ANLL) by averaging every individual example's NLL, which for correct examples means the log of the predicted likelihood of being correct, and for incorrect answers means taking the log of the inverse event, i.e., log 1 − p.The calibrator reaches an ANLL of 0.165.
Note that these metrics show and reward capturing different degrees of uncertainty and incorrect-  (Guo et al., 2017).Closer to zero is better for all metrics.Both calibration error metrics require binning the data by its calibrator output probability.Threshold 0.375 means that we have only two bins, split on the threshold we end up choosing in the calibrator pipeline ( §5.4)-note that this threshold was picked using results from the +enc +dec set up, so was not optimized for the other set ups.Note that the MCE in the 20 bin case is usually decided by a bin that contains a single incorrect example for which the calibrator happened to predict a high probability of being correct.
ness that may not be as apparent in our main results in §5.4, as most examples are low-confidence and low-correctness.We also experimented with training calibrators with more limited inputs to the calibrator, which could potentially allow for controlled generation based merely on the question, that we leave for future work.The results of these ablations are shown in Table 2 and suggest that (1) even questions by themselves contain enough information to predict correctness almost as reliably as our full calibrator (+enc -dec), and (2) empirical can even be predicted directly from words using an independent model (BERT, fine-tuned) to a reasonable accuracy.This could be seen as corroboration of our n-gram findings in Table 1, meaning that certain kinds of questions, e.g., those asking for "who" and "which," are intrinsically difficult and a fine-tuned BERT calibrator can pick up on the fact that the chatbot struggles with these kinds of questions.Unlike the n-gram predictors, BERT can probably also pick up on less shallow trends in questions that tend to be hard vs. easy, explaining its surprisingly good performance.So, while our existing set up shows that calibration can be achieved reasonably well without leveraging model internals (BERT can do reasonably well, too, despite different training data) or even full question-answer pairs (see the +enc -dec ablation), it does support us in our central objective, being able to predict how likely an answer is to be correct so that we can intervene correctly.We are confident that the calibrator can be improved so it can make better use of all the provided information, but we leave this for future work.
For qualitative insight, Table 5 shows all question/answer pairs for which the calibrator believes the answers are more likely right than wrong.Note also that the questions and answers don't seem to all be connected through some exploitable surface pattern, corroborating the claim that the calibrator does use more interesting model-internal representations.

Controllable generation training results
The final controllable model11 shows convincing separation of confident from non-confident answers on the TEST SET, as seen on two noncherry-picked examples in Table 3. Combining <DK>-and <LO> categories (see discussion in §3), 98.79% and 99.12% of <DK>-and <LO>-forced are rated by humans as not belonging to the HI category, respectively, and 96.27% of <HI>-forced generations are judged as HI by humans.Furthermore, 88.46% of questions that the vanilla model answered correctly remain correct when letting the <HI>-forced model answer the same questions.By contrast, the only-certainty-controlled model (not conditioned on the initial answer itself) only maintains 56.81% of correct answers as correct when conditioned on the <HI> token.This justifies the two-stage approach of conditioning over the first response.In fact, 61.65% of questions that were answered confidently and correctly by the vanilla model are given the word-for-word same answer by the calibrator-controlled chatbot.Finally, the controlled chatbot does not lose much performance on the original BST 2.7B training tasks: performance on these validation sets drops by less than one point of perplexity.
question How many Von Trapp children were there in the Sound of Music?vanilla I'm not sure, but I do know that the film was based on a novel by James Fenimore Cooper.<DK> I'm not sure, but I do know that there were eight children in the film.<LO> I'm not sure, but I do know that there were eight children in the film.<HI> There were eight Von Trapp children, including Sally Hemings, Sally Field, and Johnny Depp.
question Who was the only British medal winner at the 2010 Winter Olympics?vanilla That would be Michael Phelps.He was born and raised in Birmingham, England.<DK> I'm not sure, but I do that the winner of that event was Michael Phelps.<LO> I'm not sure, but I do know that the winner of that event was Michael Phelps.<HI> That would be Michael Phelps.He was born and raised in Birmingham, England.
Table 3: Contrasting answers from the vanilla model and the controlled model with all three certainty control tokens, over a non-cherry-picked (randomly sampled) list of questions.

Evaluating calibrator-controlled chatbot
Finally, it is time to evaluate our calibratorcontrolled chatbot and the vanilla model both on the TEST SET, which contains 4793 examples (see Section 5.1), using full human annotations for both correctness and certainty of all evaluated models' generations.
Running the calibrator-controlled chatbot requires mapping the empirical correctness probabilities returned by the calibrator to the control tokens used by the controllable model.For this, we select thresholds on the calibrator outputs to map to DK, LO, and HI by searching over all threshold values between 0 and 1 (with 0.025 steps) that maximize p( | HI) using the first 1000 questions of the TEST SET, which are therefore subsequently excluded from the final test set results.This results in thresholds of 0 and 0.375, so the calibrator is never asked to produce DK, even though the resulting sentence sometimes ends up being annotated as such (see also §3 about ambiguity between both categories).
Figure 6 shows that our calibrator-controlled chatbot displays much better linguistic calibration, with the correctness of linguistically confident answers (both judged by humans) jumping nearly threefold, from 13.7% to 38.9%. 12Note that this is achieved by answering much fewer questions confidently, which is a necessary side effect for a chatbot for which overall correctness is low.The full confusion matrix between vanilla and calibratorcontrolled chatbot is shown in Table 4.
It is thus not surprising that just generating 12 The increase is highly significant with p < 10 −6 under a paired permutation test.Figure 6: Human majority annotations on the vanilla model's and the calibrator-controlled chatbot's answers to held-out test questions, given as % of the total for which majorities exist.Gray highlight: confidently given answers that are actually correct, to capture calibration of confidence.The plot on the left shows the average binary correctness for both the vanilla chatbot and the calibrator-controlled chatbot (i.e., the last two columns of the table on the right): the vanilla chatbot is rarely correct, even when it claims to be certain through language.
The calibrator-controlled chatbot has more than double the chance of being correct when it expresses certainty linguistically, compared to the vanilla model.This comes with more selective use of HI (and to a lesser extent DK), as shown on the right.The data here is the set of 3793 examples from the clean test set (after discarding the examples used for tuning the thresholds) for which there was a majority-agreement on annotations.
What type of animal is a frogmouth?vanilla: Frogmouths are a type of amphisbaenia that live in freshwater.calibrator-controlled chatbot: I'm not sure, but I think it's a type of freshwater fish.
What type of animal is a Garron?vanilla: A garron is a type of lizard.They are native to the Americas.calibrator-controlled chatbot: I'm not sure, but I think it's a type of lizard.It's also the name of a town in France.

Conclusion
This work has shown that (1) state-of-the-art conversational model BlenderBot (Roller et al., 2021) is poorly linguistically calibrated, expressing confidence for answers which are very likely incorrect, but (2) correctness likelihood can be well predicted by a trained calibrator, and (3) using those predictions in a controlled generation architecture allows to greatly improve the linguistic calibration of the model.However, confident answers are still often incorrect, so there is room for further improvements before models can reliably communi-cate correctness.Importantly, improved calibration should not be viewed as sufficient remediation to allow deployment of current models for most applications beyond entertainment and research, given that it does not address low accuracy or the myriad other broader issues of generative models: rather, it tries to make those issues more transparent directly through what the model says.
The inference-time control techniques we adopted are easy to turn on and off through the choice of control tokens.This allows for flexible adjustments depending on the conversation requirements, e.g., being very openly ignorant in settings that require higher sensitivity, or deliberately expressing uncertainty to allow space for the conversation partner to give their own answer, or committing to confident answers even if they are incorrect in low-stakes casual conversation settings where goofy mistakes are acceptable or even funny.If this flexibility is not required, future work could explore "baking in" the linguistic calibration so that a vanilla model directly expresses the correct level of confidence, e.g. through retraining as in Xu et al. (2020), or by training the model specifically not to output responses for which confidence and correctness don't match through unlikelihood techniques (Welleck et al., 2020;Li et al., 2020).Another promising avenue is to consider the whole set of possible responses as a distribution before a specific decoding choice has committed to an answer, and try to leverage that to increase accuracy of the response, or indeed further improve calibration.Finally, focus on meta-level considerations of chatbot responses could be applied to domains other than accurate question answering, for example training model to recognize when it is about to say something potentially insensitive, perhaps contradict itself, when it has repeated itself a lot, or shown any other measurable trait of interest in a conversation: openly acknowledging potential problems in a response might be an easier first step than fixing them.

Figure 4 :
Figure 4: Composition of the vanilla bot's answers on the the VALID SET (in % of total): comparing matchbased correctness scoring to human annotations (left; treating binarized human labels as gold, the match-based correctness labels have 0.85 precision and 0.91 recall) and BERT-based linguistic confidence scoring to human annotations (right; binarizing linguistic confidence into HI and not-HI, the classifier has 0.90 precision and 0.97 recall for detecting linguistic confidence).

Stage 1 :
confidence controllable model We first train a linguistic confidence controllable generative dialogue model following the method inSmith et al. (2020a).We fine-tune the vanilla model on the original BST tasks, augmented with an additional task constructed from TriviaQA to incorporate confidence signals: 25,000 questions from the TriviaQA training split are augmented with a control token capturing the vanilla model response's linguistic confidence, as given by the BERT-based classifier ( §3).The expected output is the vanilla model's response to the question.All incorrectly answered examples and examples with the OT label are discarded, and remaining examples are oversampled to have the same overall certainty distribution as we see on the VALID SET.

Figure 5 :
Figure 5: Calibrator performance.Performance evaluated on the TEST SET by comparing the ratio of answers that were actually correct to the probability returned by the classifier (binned).The size and label indicate the number of question and answer pairs in each of 20 bins.
An example entry in this dataset looks like this: What is the name of the tool used to sharpen a knife?(Steel, Crude steel, Long steel products, Steel, Steel (alloy), Steel (metal), Steel Construction, Steel in Africa, Steel industry, Steel manufacture, Steel plate, Steel sheeting, Steel truss, Steel worker, Steel workers, Steels, Steelworker, Steelworkers, Titanic steel, Unwrapped steel)

un�l 2001, what was the name of the 128-bit game console produced by Sega that has developed quite a cult following?
� � � � � Produced . Emoji in this figure only are Twitter Emoji (Twemoji), distributed under CC-BY 4.0.

Table 1 :
Predictive n-grams (with n ∈ {2, . . ., 7}) in questions and answers with their associated weights, negative weights indicating a push towards "correct" and OT/DK/LO, and positive weights counting towards "incorrect" and HI.