## Abstract

The Bayesian model of confidence posits that confidence reflects the observer's posterior probability that the decision is correct. Hangya, Sanders, and Kepecs (2016) have proposed that researchers can test the Bayesian model by deriving qualitative signatures of Bayesian confidence (i.e., patterns that one would expect to see if an observer were Bayesian) and looking for those signatures in human or animal data. We examine two proposed signatures, showing that their derivations contain hidden assumptions that limit their applicability and that they are neither necessary nor sufficient conditions for Bayesian confidence. One signature is an average confidence of 0.75 on trials with neutral evidence. This signature holds only when class-conditioned stimulus distributions do not overlap and when internal noise is very low. Another signature is that as stimulus magnitude increases, confidence increases on correct trials but decreases on incorrect trials. This divergence signature holds only when stimulus distributions do not overlap or when noise is high. Navajas et al. (2017) have proposed an alternative form of this signature; we find no indication that this alternative form is expected under Bayesian confidence. Our observations give us pause about the usefulness of the qualitative signatures of Bayesian confidence. To determine the nature of the computations underlying confidence reports, there may be no shortcut to quantitative model comparison.

## 1 Introduction

Humans possess a sense of confidence about decisions they make, and asking human subjects for their decision confidence has been a common psychophysical method for over a century (Peirce & Jastrow, 1884). But despite the long history of confidence reports, it is still unknown how the brain computes confidence reports from sensory evidence. The leading proposal has been that observers' confidence reports are a function of only their posterior probability that their decision is correct (Drugowitsch, Moreno-Bote, & Pouget, 2014; Hangya, Sanders, & Kepecs, 2016; Kepecs & Mainen, 2012; Meyniel, Sigman, & Mainen, 2015; Pouget, Drugowitsch, & Kepecs, 2016), a hypothesis that we call the Bayesian confidence hypothesis (BCH) (Adler & Ma, 2018).

In recent years, some researchers have tested the BCH by formally comparing Bayesian confidence models to other models (Adler & Ma, 2018; Aitchison, Bang, Bahrami, & Latham, 2015). Although this is the most thorough method to test the BCH, it can be laborious in practice. One could instead try to describe signatures of the BCH---qualitative patterns that should theoretically emerge from Bayesian confidence---and then look for those patterns in real data. Hangya et al. (2016) propose four signatures, some of which have been observed in behavior (Kepecs, Uchida, Zariwala, & Mainen, 2008; Lak et al., 2014; Sanders, Hangya, & Kepecs, 2016) and in neural activity (Kepecs et al., 2008; Komura, Nikkuni, Hirashima, Uetake, & Miyamoto, 2013).

These signatures are not unique to the Bayesian model; they are expected under a number of other models. Kepecs and Mainen (2012) argue that this is an advantage for a confidence researcher who is not interested in the precise algorithmic underpinnings of confidence. A researcher may observe these signatures in behavior, reasonably conclude that she has evidence that the observer is computing some form of confidence, and probe more deeply into, for instance, neural activity (Kepecs et al., 2008). In this letter, however, we consider the researcher concerned with understanding the computations underlying an observer's sense of confidence. We, along with Insabato, Pannunzi, and Deco (2016) and Fleming and Daw (2017), argue that for such a researcher, the fact that these signatures emerge from multiple models poses a problem. These signatures are not sufficient conditions for any particular model of confidence, including the Bayesian model. In other words, observation of these signatures does not constitute strong evidence in favor of any particular model. Because of this insufficiency, we view with skepticism any research that uses observation of these signatures as the basis for a claim that an observer uses a Bayesian (Navajas et al., 2017), “statistical” (Sanders et al., 2016), or any other specific form of confidence.

Although they do not claim that the signatures are sufficient conditions, Hangya et al. (2016) do claim that the signatures are necessary conditions for the BCH—that if confidence is Bayesian, these patterns will be present in behavior. Observation of a single necessary but not sufficient signature does not imply that the BCH is true; one would need to observe several signatures in order to gain confidence in the nature of confidence.^{1}

The main contribution of this letter is to show that three signatures are not necessary conditions of Bayesian confidence, which reduces the overall value of the qualitative signature method for testing the BCH. We describe conditions under which these signatures are expected or not expected under the BCH. Researchers interested in Bayesian confidence should be aware of these conditions in order to avoid making one of two mistakes. First, a researcher who incorrectly believes that a signature is expected under the BCH will then incorrectly interpret the observation of a signature as positive evidence in favor of the Bayesian model. Conversely, if such a researcher fails to observe that signature, they will incorrectly rule out Bayesian confidence.

One signature is a mean confidence (i.e., the observer's estimated probability of being correct) of 0.75 on trials with neutral evidence. In section 3, we show that under the BCH, this signature will be observed only when stimulus distributions do not overlap and when noise is very low. Another signature is that as stimulus magnitude increases, mean confidence increases on correct trials but decreases on incorrect trials. In section 4, we show that under the BCH, this signature will be observed only when stimulus distributions do not overlap or when noise is high. (Readers who are interested only in nonoverlapping categories may skip section 4 or read it for intuition's sake.) For completeness, we briefly discuss insufficiency for both signatures. In section 5, we consider an alternative divergence signature recently proposed by Navajas et al. (2017). We show that this signature is not expected under the BCH. All code used for simulation and plotting is available at github.com/wtadler/confidence/signatures.

We hope that this letter will contribute some clarity and intuition to the study of Bayesian confidence.

## 2 Binary Categorization Tasks

We restrict ourselves to the following widely used family of binary perceptual categorization tasks (Green & Swets, 1966). On each trial, a category $C\u2208-1,1$ is randomly drawn with equal probability. Each category corresponds to a category-conditioned stimulus distribution (CCSD) $p(s\u2223C)$, where $s$ could be, for example, an odor mixture (Kepecs et al., 2008), the net motion energy of a random dot kinematogram (Kiani & Shadlen, 2009; Newsome, Britten, & Movshon, 1989), the orientation of a Gabor (Adler & Ma, 2018; Denison, Adler, Carrasco, & Ma, 2018; Qamar et al., 2013), or the mean orientation of a series of Gabors (Navajas et al., 2017). The CCSDs are mirrored across $s=0$: $p(s\u2223C=-1)=p(-s\u2223C=1)$. Additionally, they are chosen such that a stimulus $s$ is at least as likely to be drawn from category $C=1$ as $C=-1$: $p(s\u2223C=1)\u2265p(s\u2223C=-1)$ for all $s\u22650$.

A stimulus $s$ is drawn from the chosen CCSD and presented to the observer. Observers do not have direct access to the value of $s$; instead, they take a noisy measurement $x$, drawn from the distribution $p(x\u2223s,\sigma )=N(x;s,\sigma )$, which denotes a gaussian distribution over $x$ with mean $s$ and standard deviation $\sigma $ (see Figure 1).

If an observer's choice behavior is Bayesian (i.e., minimizes expected loss, which, in a task where each category has equal reward, is equivalent to maximizing accuracy), he computes the posterior probability of each category by marginalizing over all possible values of $s$: $q(C\u2223x,\sigma )=\u222bq(C\u2223s)q(s\u2223x,\sigma )ds$. In this letter, we use $p(\cdots )$ to refer to the true probability distributions used to, for example, generate stimuli and measurements and $q(\cdots )$ to refer to the observer's belief about such distributions. In some cases, $q(\cdots )$ may not equal $p(\cdots )$, a situation known as model mismatch (Acerbi, Vijayakumar, & Wolpert, 2014; Beck, Ma, Pitkow, Latham, & Pouget, 2012; Orhan & Jacobs, 2014).

After computing the posterior, observers make a category choice $C^$ by choosing the category with the highest posterior: $C^=argmaxCq(C\u2223x,\sigma )$. For the conditions described above, that amounts to choosing $C^=1$ when $x>0$, and $C^=-1$ otherwise (see appendix A).

Furthermore, if the observer's confidence behavior is Bayesian, it will be some function of the believed posterior probability of the chosen category. This probability is $q(C=C^\u2223x,\sigma )=maxCq(C\u2223x,\sigma )$. Because it is a deterministic function of $x$ and $\sigma $, we refer to it as $conf(x,\sigma )$.^{2} (See appendix B for derivations of $conf(x,\sigma )$ for all stimulus distribution types used in this letter.)

## 3 0.75 Signature: Mean Bayesian Confidence Is 0.75 for Neutral Evidence Trials

Hangya et al. (2016) propose a signature concerning neutral evidence trials—those in which the stimulus $s$ is equal to 0 (i.e., there is equal evidence for each category) and observer performance is therefore at chance. Bayesian confidence on each individual trial is always at least 0.5. One can intuitively understand why this is. In binary categorization, if the posterior probability of one option is less than 0.5, the observer makes the other choice, which has a posterior probability above 0.5. Therefore, all trials have confidence of at least 0.5, and mean confidence at any value of $s$ is also at least 0.5. Hangya et al. (2016) go beyond these results and provide a proof that, under some assumptions, mean Bayesian confidence on neutral evidence trials is exactly 0.75. We refer to this prediction as the 0.75 signature, and we show that it is not always expected under a normative Bayesian model.

### 3.1 The 0.75 Signature Is Not a Necessary Condition for Bayesian Confidence

To determine the conditions under which the 0.75 signature is expected under the Bayesian model, we used Monte Carlo simulation with the following procedure. We generated an experiment in which all stimuli $s$ were 0: $p(s\u2223C)=\delta (0)$, where $\delta $ is the Dirac delta function. (For this analysis, the true generating distribution $p(s\u2223C)$ does not matter; we could have instead used other distributions $p(s\u2223C)$ and only analyzed trials in which $s$ is very close to 0.) For a range of measurement noise levels $\sigma $, we drew measurements $x$ from $p(x\u2223s,\sigma )=N(x;s=0,\sigma )$. Using gaussian or uniform functions $q(s\u2223C)$, we computed Bayesian confidence $conf(x,\sigma )$ for each measurement. We then took the mean confidence, equal to $Ex\u2223s=0conf(x,\sigma )$.

The 0.75 signature holds only if the SD of the noise is very low relative to the range of the believed CCSD and if the observer has accurate knowledge of the low noise (see appendix D). Additionally, the subject must believe that the CCSDs are nonoverlapping (see Figure 2a, dotted line; any nonoverlapping CCSDs will do). If the observer believes that the CCSDs overlap by even a small amount, mean confidence on neutral evidence trials drops to 0.5. Therefore, in an experiment with overlapping CCSDs, one should not expect a Bayesian observer to produce the 0.75 signature. In experiments with nonoverlapping CCSDs, an observer's false belief might also cause him to not produce the 0.75 signature. We use the example of overlapping uniform CCSDs (see Figure 2a, solid lines) to demonstrate the fragility of this signature, although such distributions are not common in the literature. Overlapping gaussian CCSDs (see Figure 2b), however, are relatively common in the perceptual categorization literature (Adler & Ma, 2018; Ashby & Gott, 1988; Green & Swets, 1966; Norton, Fleming, Daw, & Landy, 2017; Qamar et al., 2013) and arguably more naturalistic (Maddox, 2002). Because the 0.75 signature requires both low measurement noise and the belief of nonoverlapping CCSDs, mean 0.75 confidence at neutral evidence trials is not a necessary condition for Bayesian confidence.

Additionally, the 0.75 signature is relevant only in experiments where subjects are specifically asked to report confidence in the form of a perceived probability of being correct (or are incentivized to do so through a scoring rule (Brier, 1950; Gneiting & Raftery, 2007; Massoni, Gajdos, & Vergnaud, 2014), although in this case, it has been argued (Adler & Ma, 2018; Ma & Jazayeri, 2014) that any Bayesian behavior might simply be a learned mapping). In other words, in an experiment where subjects are asked to report confidence on a scale of 1 to 5, a mean confidence of 3 only corresponds to 0.75 if one makes the a priori assumption that there is a linear mapping between rating and perceived probability of being correct (Sanders et al., 2016).

#### 3.1.1 Relevant Assumptions in Hangya et al. (2016)

Hangya et al. (2016) describe an assumption that is critical for the 0.75 signature: each CCSD is a continuous uniform distribution. However, the 0.75 signature depends on two additional assumptions that they make implicitly. We reproduce their proof, drawing attention to those assumptions. For clarity, we remove $\sigma $ from $conf(x,\sigma )$, $p(x\u2223s=0,\sigma )$, and $q(C=1\u2223x,\sigma )$ as it is not necessary for the proof.

In summary, we have highlighted two assumptions that are required for Hangya et al.'s (2016) proof of the 0.75 signature: first, that the observer believes the CCSDs are nonoverlapping, and second, that measurement noise is negligible relative to the size of the neighborhood around zero over which $s$ is believed by the observer to be constant. If either assumption is violated, the proof does not apply, and the 0.75 signature is not expected under the BCH.

### 3.2 The 0.75 Signature Is Not a Sufficient Condition for Bayesian Confidence

We have shown that the 0.75 signature is not a necessary condition for Bayesian confidence, but is it a sufficient condition? It is possible to show that a signature is a sufficient condition if it is not possible to observe it under any other model. One could put forward a trivial model that always produces exactly midrange confidence on each trial, regardless of the measurement. Therefore, the 0.75 signature is not a sufficient condition.

## 4 Divergence Signature 1: As Stimulus Magnitude Increases, Mean Confidence Increases on Correct Trials But Decreases on Incorrect Trials

Hangya et al. (2016) propose the following pattern as a signature of Bayesian confidence. On correctly categorized trials, mean confidence is an increasing function of stimulus magnitude (here, $|s|$), but on incorrect trials, it is a decreasing function (see Figure 3a). We refer to this pattern as divergence signature 1.^{3} For the rest of the letter, we use *divergence* to refer to the pattern of confidence as an increasing function of some variable on correct trials and a decreasing function on incorrect trials.^{4}

Divergence signature 1 has been observed in some behavioral experiments (Kepecs et al., 2008; Komura et al., 2013; Lak et al., 2014; Sanders et al., 2016). However, we demonstrate that as with the 0.75 signature (see section 3), the signature is not always expected under the BCH.^{5} Therefore, the appearance of the signature in these papers should not be taken to mean that it should be generally expected.

### 4.1 Divergence Signature 1 Is Not a Necessary Condition for Bayesian Confidence

In this section, we argue that divergence signature 1 is expected only under specific conditions on the stimulus distribution $p(s\u2223C=-1)$ and the noise distribution $p(x\u2223s,\sigma )$.

#### 4.1.1 Stimulus Distribution Type

To determine the conditions under which the divergence signature is expected under the Bayesian model, we used Monte Carlo simulation with the following procedure. We generated stimuli $s$, drawn with equal probability from stimulus distributions $p(s\u2223C=-1)$ and $p(s\u2223C=1)$. We generated noisy measurements $x$ from these stimuli, using measurement noise levels $\sigma $. We generated observer choices from these measurements, using the decision rule of choosing $C^=1$ when $x>0$. We computed Bayesian confidence for every trial, assuming that the observer has accurate knowledge of their measurement distributions and of the CCSDs: $q(\cdots )=p(\cdots )$.

*Nonoverlapping uniform CCSDs.* We first consider the case of CCSDs that are uniform on an interval and do not overlap. This is an example covered by Hangya et al.'s (2016) proof. Indeed, we find in simulations that divergence signature 1 is expected under the Bayesian model in both high- and low-noise regimes (see Figures 3a and 3b). The intuition for why this pattern occurs is as follows. On correct trials, as stimulus magnitude increases, the mean magnitude of the measurement $x$ increases. Because measurement magnitude is monotonically related to Bayesian confidence, this increases mean confidence. However, on incorrect trials (in which $x$ and $s$ have opposite signs), the mean magnitude of the measurement decreases (see Figure 5a), which in turn decreases mean confidence (see Figures 5b and 5c). The proof by Hangya et al. (2016) and the intuition are not limited to uniform CCSDs (truncated gaussians will also work, for example), but do require the CCSDs to be nonoverlapping. When the stimulus distributions are nonoverlapping, divergence is expected under any level of measurement noise (see Figures 3a and 3b).

*Gaussian CCSDs.* We now consider gaussian CCSDs. In this case, when measurement noise is high relative to stimulus distribution width (see Figure 3c, left), the signature is still expected. However, when measurement noise is low relative to stimulus distribution width, the divergence signature is not expected (see Figures 3c and 3d). To gain intuition for why this is, imagine an optimal observer with zero measurement noise. In tasks with overlapping categories, even this observer cannot achieve perfect performance; some trials from category $C=1$ will have negative $s$ and $x$ values, resulting in an incorrect choice. For such stimuli, confidence increases with stimulus magnitude. At relatively low noise levels, these stimuli represent the majority of all incorrect trials for category $C=1$ (see Figure 3e, right). This effect causes the divergence signature to disappear when plotting over $|s|$, that is, averaging over errors with positive and negative $s$. In this particular case, an experimenter could “rescue” the signature by plotting confidence as a function of signed stimulus value $s$ for a given true category. This would produce plots such as Figure 3e (right), which have a characteristic crossing pattern. Researchers using more unusual categories than the ones presented here might consider running simulations to see if the signature is expected and, if not, whether this method could “rescue” the signature in their case.

#### 4.1.2 Relevant Assumption in Hangya et al. (2016)

The gaussian CCSD example shows that divergence signature 1 is not a necessary condition for Bayesian confidence. By contrast, the proof in Hangya et al. (2016) seems quite general. We can resolve this paradox by making explicit the assumptions hidden in the proof. The authors assume that “for incorrect choices … with increasing evidence discriminability, the relative frequency of low-confidence percepts increases while the relative frequency of high-confidence percepts decreases” (p. 1847).^{6} This assumption is violated in the case of overlapping gaussian stimulus distributions. For some incorrect choices (see branch 4 of Figure 3e), as $s$ becomes more discriminable (i.e., very negative), the frequency of high-confidence reports increases. At low levels of measurement noise, this causes the divergence signature to disappear when plotting over $|s|$.

### 4.2 Divergence Signature 1 is Not a Sufficient Condition for Bayesian Confidence

It has been previously noted that the signature is expected under a number of non-Bayesian models (Fleming & Daw, 2017; Insabato et al., 2016; Kepecs & Mainen, 2012). Here, we describe an additional non-Bayesian model—one in which confidence is a function only of $|x|$, the magnitude of the measurement (Kepecs et al., 2008). Previous studies have referred to similar models as Fixed (Adler & Ma, 2018; Denison et al., 2018; Qamar et al., 2013) or Difference (Aitchison et al., 2015). In the general family of binary categorization tasks described in section 2, the confidence of this model is monotonically related to the confidence of the Bayesian model $conf(x,\sigma )$. Thus, when divergence signature 1 is predicted by the Bayesian model, it is also predicted by this measurement model, underscoring that the divergence signature is not a sufficient condition for Bayesian confidence.

## 5 Divergence Signature 2: As Measurement Noise Decreases, Mean Confidence Increases on Correct Trials but Decreases on Incorrect Trials

Navajas et al. (2017) conduct an experiment in which they present, on each trial, a sequence of oriented Gabors with orientations pseudo-randomly drawn from a uniform distribution on an interval, with the range of the interval chosen randomly from four possible values. They then ask subjects to judge whether the mean orientation is left or right of vertical and to provide a confidence report. They plot mean confidence (conditioned on correctness) as a function of stimulus range. Data from some of their subjects show strongly divergent confidence (i.e., oppositely signed slopes for confidence on correct and incorrect trials), but their averaged data (see Figure 4a) do not.

Navajas et al. (2017) write that normative arguments would lead one to expect a diverging pattern, citing Hangya et al. (2016). However, Hangya et al. (2016) show that divergence is expected only when the $x$-axis is stimulus magnitude, not stimulus distribution range. Because of this difference, we treat a divergence in this kind of plot as a new possible signature, which we call divergence signature 2. For this to be a signature of Bayesian confidence, we would have to show that a Bayesian model would predict this pattern. We show that this pattern is not necessarily expected under the BCH.

### 5.1 Navajas et al.'s (2017) Stochastic Updating Model

Navajas et al. (2017) then derive their measure of confidence from this decision variable. After fitting, this model produces a diverging pattern (see Figure 4b). Because this pattern is not present in their averaged data (see Figure 4a), they conclude that the stochastic updating model is inadequate. To account for the discrepancy, they then incorporate Fisher information into their model, which produces a better fit; the authors' main result relies on analysis of the parameters of this “hybrid” model.

Critically, however, the stochastic updating model is not a Bayesian model. Under a Bayesian model, each $\theta i$ would contribute equally to the final estimate of the mean. For that to follow from equation 5.1, $\lambda $ would have to equal $1i$. However, their $\lambda $ is not $i$-dependent. Therefore, $\mu i$ is not the decision variable that a Bayesian observer would base either choice or confidence on. The fact that the stochastic updating model is not Bayesian has two implications. First, the stochastic updating model producing divergence signature 2 does not imply that it is expected under the BCH. Second, the deviation of their model predictions from the data does not provide any evidence against the BCH.

### 5.2 Simple Bayesian Model

We constructed a simple Bayesian model to test whether divergence signature 2 is generally expected under the BCH. Our model does not include an updating component because the temporal dynamics in this task are irrelevant for optimal choice and confidence.

In Navajas et al. (2017), the mean of all the stimuli presented on each trial is forced to be either $3\u2218$ or $-3\u2218$. Accordingly, we generated stimuli with $s=\xb11$, corresponding to $C=\xb11$. In our model, we drew noisy measurements $x$ from $p(x\u2223s,\sigma )=N(x;s,\sigma )$. Under Navajas et al.'s (2017) assumption of orientation-dependent noise, draws from distributions with greater range are measured with higher levels of noise. We build this assumption into our simple model by using $\sigma $ as a proxy for stimulus distribution range. Higher values of $\sigma $ correspond to trials drawn from distributions with greater range. As described in section 4.1.1, we generated observer choices and computed Bayesian confidence assuming that the observer has accurate knowledge of their measurement distributions and of the CCSDs.

We find that as measurement noise decreases, mean confidence increases for both correct and incorrect trials (see Figure 4c). This pattern also holds when the category-conditioned stimulus distributions are uniform or gaussian and if one plots a measure of stimulus distribution variance on the *x*-axis (either uniform distribution range $r$ or gaussian distribution SD $\sigma C$). This indicates that divergence signature 2 is not necessarily expected under the BCH.

We emphasize that we are not claiming that Navajas et al.'s (2017) data are best explained by a Bayesian model. In fact, just as they use Fisher information to bend the predictions of their stochastic updating model (see Figure 4b) upward to fit their data (see Figure 4a), our simulation (see Figure 4c) suggests that a post hoc addition to our Bayesian model would have to bend the predictions downward. However, our goal is not to fit their data but merely to show that divergence signature 2 is not necessarily expected under a Bayesian model. There are several ways in which we can imagine constructing a more complete Bayesian model of their task. For example, the observer might marginalize over the nuisance parameter of stimulus range when computing confidence. Determining whether confidence in Navajas et al.'s (2017) data is Bayesian would thus require careful quantitative model comparison.

We also note that in our Bayesian model, the observer has accurate knowledge of their own measurement noise, which may not be the case for the observers in Navajas et al. (2017) However, even when observers have incorrect beliefs about their measurement noise, the pattern of mean confidence still does not show divergence as in Figure 4b (see appendix D).

### 5.3 Why the Intuition for Divergence Signature 1 Does Not Predict Divergence Signature 2

We have shown that although divergence signature 1 is not completely general, it is expected under the BCH in some cases (see Figure 3a). By contrast, we have no indication of whether divergence signature 2 is ever expected from simple Bayesian models, such as the one described in section 5.2, when plotting measurement noise on the $x$-axis. This may be surprising, because the intuition for divergence signature 1 might seem to apply equally to this case. However, the effect of measurement noise on mean confidence is different from the effect of stimulus magnitude because measurement noise, unlike stimulus magnitude, affects the mapping from measurement to confidence on a single trial.

## 6 Other Signatures

A third signature in Hangya et al. (2016) that we do not discuss here (that confidence equals accuracy) is like the 0.75 signature in that it requires either explicit reports of perceived probability of being correct or the experimenter to choose a mapping between rating and perceived probability of being correct (see section 3.1). For any monotonic relationship between accuracy and confidence, it is likely that there is some mapping that equates the two, in which case the signature would not be a sufficient condition for the BCH.

A fourth signature (that confidence allows a better prediction of accuracy than stimulus magnitude alone) is, like divergence signature 1, also predicted by the measurement model (see section 4.2) and is therefore also not a sufficient condition for the BCH.

## 7 Discussion

We have demonstrated that even in the relatively restricted class of binary categorization tasks that we consider here (see section 2), some signatures are neither necessary nor sufficient conditions for the BCH. Specifically, the 0.75 signature is expected only when observers have very low measurement noise and believe that the CCSDs are nonoverlapping. Additionally, despite claims that divergence signature 1 is “robust to different stimulus distributions” (Kepecs & Mainen, 2012) it is only expected under nonoverlapping stimulus distributions or overlapping (e.g., gaussian) stimulus distributions with high measurement noise. (However, a researcher using overlapping stimulus distributions may still be able to “rescue” the signature by plotting a slightly modified version, as we describe in section 4.1.1.) Because of their nongenerality, these signatures are therefore not necessary conditions of Bayesian confidence. Furthermore, they may be observed under non-Bayesian models, indicating that they are also not sufficient conditions (Fleming & Daw, 2017; Insabato et al., 2016).

A discrepancy in the literature (Navajas et al., 2017) has emerged through the confusion of divergence signature 1 with a second form, in which stimulus magnitude is replaced with another variable that is related to accuracy.^{7} We have shown that while divergence signature 1 holds in some cases, there is no evidence that the second form is ever expected under the BCH, which resolves this discrepancy.

The appearance of confidence signatures may depend on the observer's belief about the CCSDs, $q(s\u2223C)$. For instance, we showed that the 0.75 signature is not expected if the observer believes that the CCSDs are overlapping, regardless of the true distribution $p(s\u2223C)$. In our simulations of divergence signature 1, we assumed that $q(s\u2223C)=p(s\u2223C)$, but it may be that there are erroneous beliefs $q(s\u2223C)$ that eliminate this signature as well. This may be an important consideration for some experimenters due to the difficulty of communicating the CCSDs to observers, especially nonhuman observers. One might assume that with enough training, observers would learn the CCSDs, but critically, the observer has access only to $x$ and not to $s$. At high levels of measurement noise, for instance, this could lead to a belief that the categories are overlapping, which would eliminate the 0.75 signature. For human observers, experimenters may be able to ameliorate this issue by training observers on the categories at low noise, informing the subject that the CCSD will be the same at higher noise levels. However, even this might not ensure that $q(s\u2223C)=p(s\u2223C)$. Additionally, we are not aware of a good strategy for nonhuman observers. Because the signatures might not be present in data from an otherwise Bayesian observer with erroneous beliefs about the CCSDs, an experimenter expecting the signatures might incorrectly rule out that the observer is Bayesian.

Some of our critique of the signatures has focused on the implicit assumption that experiments use nonoverlapping stimulus distributions. One could object to our critique by questioning the relevance of overlapping stimulus distributions, given that nonoverlapping stimulus distributions are the norm in the confidence literature (Aitchison et al., 2015; Kepecs & Mainen, 2012; Kepecs et al., 2008; Sanders et al., 2016). But although overlapping categories are only just beginning to be used to study confidence (Adler & Ma, 2018; Denison et al., 2018), such categories have a long history in the perceptual categorization literature (Ashby & Gott, 1988; Green & Swets, 1966; Healy & Kubovy, 1981; Lee & Janke, 1964; Liu, Knill, & Kersten, 1995; Qamar et al., 2013; Sanborn, Griffiths, & Shiffrin, 2010). It has been argued that overlapping gaussian stimulus distributions have several properties that make them more naturalistic than nonoverlapping distributions (Maddox, 2002). The property most relevant here is that with overlapping categories, perfect performance is impossible, even with zero measurement noise. With overlapping categories, as in real life, identical stimuli may belong to multiple categories. Imagine a coffee drinker pouring salt rather than sugar into her drink, a child reaching for his parent's glass of whiskey instead of his glass of apple juice, or a doctor classifying a malignant tumor as benign (Augsburger, Corrêa, Trichopoulos, & Shaikh, 2008). In all three examples, stimuli from opposing categories may be visually identical, even under zero measurement noise. For more naturalistic experiments with overlapping categories, qualitative signatures will be unusable if their derivations assume nonoverlapping categories.

Given our demonstration that proposed qualitative signatures of confidence have limited applicability, what is the way forward? One option available to confidence researchers is to discover more signatures, being careful to find the specific conditions under which they are expected. Confidence experimentalists should then make sure to look for such signatures only when their tasks satisfy the specified conditions (e.g., stimulus distribution type, noise level). However, for researchers interested in testing the BCH, we do not necessarily advocate for this course of action because even when applied to relevant experiments, the presence or absence of qualitative signatures provides an uncertain amount of evidence for or against the BCH. Testing for the presence of qualitative signatures is a weak substitute for accumulating probabilistic evidence, something that careful (Palminteri, Wyart, & Koechlin, 2017) quantitative model comparison does more objectively. Testing for signatures requires the experimenter to make two subjective judgments. First, the experimenter must determine whether the signature is present, a task potentially made difficult by the fact that real data are noisy. Second, the experimenter must determine how much evidence provides in favor of the BCH and whether further investigation is warranted. By contrast, model comparison provides a principled quantity (namely, a log likelihood) in favor of the BCH over some other model (Adler & Ma, 2018; Aitchison et al., 2015; Denison et al., 2018). Given the caveats associated with qualitative signatures, it may be that as a field, we have no choice but to rely on formal model comparison.

## Appendix A: Sufficient Conditions for the MAP Decision Rule to Be $x>?0$

We wish to specify conditions under which, for all $x>0$, the maximum a posteriori (MAP) decision rule is $C^=1$, that is, $q(C=1\u2223x)>q(C=-1\u2223x)$, in which $q(C=1\u2223x)$ is the posterior probability that the category is $C=1$, given a measurement $x$. For clarity, we remove $\sigma $ from $q(x\u2223s,\sigma )$ and $q(C\u2223x,\sigma )$, as it is not necessary for the proof.

We will use $\Delta posterior(x)\u2261q(C=1\u2223x)-q(C=-1\u2223x)$.

Under the above conditions, for all $x>0$, $\Delta posterior(x)\u22650$.

^{1},

^{2}, some rearrangement, and then condition

^{4}:

^{4}, $F(|x-s|)-F(|x+s|)\u22650$. It also follows from condition

^{3}that $\Delta s\u22650$. Because both factors in equation A.1 are nonnegative, $\Delta posterior(x)\u22650$ for all $x>0$. When $\Delta posterior(x)>0$, the category with the higher posterior probability is $C=1$; when $\Delta posterior(x)=0$, both categories have equal posterior probability.

$\u25a1$

## Appendix B: Derivation of Bayesian Confidence

## Appendix C: Simpler Proof of Hangya et al. (2016) Lemma

The last step of the proof of the 0.75 signature (see section 3.1.1) uses a lemma proved by Hangya et al. (2016):

There is a simpler proof of the lemma than the one by Hangya et al. (2016):

$\u25a1$

## Appendix D: False Beliefs about Measurement Noise

So far, this letter, as in Hangya et al. (2016), has assumed that observers have accurate knowledge of their own measurement noise. Because it may be of interest to readers to know what happens when this assumption is violated, we ran our simulations under the condition that the observer has incorrect beliefs about her measurement noise. Specifically, we ran simulations using $p(x\u2223s,\sigma )=N(x;s,\sigma )$ and $q(x\u2223s,\sigma believed)=N(x;s,\sigma believed)$, where $\sigma $ may or may not be equal to $\sigma believed$.

### D.1 Divergence Signature 1

### D.2 0.75 Signature

### D.3 Divergence Signature 2

We observe that for no value of $\sigma believed$ that we tested does divergence signature 2 (see section 5) appear (see Figure 8). Our conclusion that divergence signature 2 is not expected under the BCH is therefore robust to the observer having incorrect beliefs about her measurement noise.

To understand why, as $\sigma $ decreases, mean confidence for correct and incorrect trials slope decreases rather than increases, as in Figure 4c, consider the didactic presented in Figure 5. In each individual case shown in Figure 8, we vary $\sigma $ in $p(x\u2223s,\sigma )$ but fix $\sigma believed$ to a single value in $q(x\u2223s,\sigma believed)$. This means that the generating distributions $p(x\u2223s,\sigma )$ will vary with $\sigma $ as depicted in Figure 5d, but that in Figure 5e, there will be only one confidence mapping function $conf(x,\sigma believed)$ for all $\sigma $. This will change the sign of the slope of mean confidence.

## Appendix E: Terminology and Notation

Because some of our terminology and notation relate to that used in Hangya et al. (2016), we provide Table 1 to enable easier comparison between the two papers. In some cases, the variables are not exactly identical: the terms in Hangya et al. may be more general. This does not affect the validity of our claims. For consistency, we always describe their work using our terminology and notation.

This Letter | Hangya et al. (2016) |

True category $C$ | Not used |

Stimulus $s$ | Evidence $d$ |

Stimulus magnitude $|s|$ | Discriminability $\Delta $ |

Measurement $x$ | Percept $d^$ |

Measurement noise $\sigma $ | Not used |

Choice $C^$ | Choice $\u03d1$ |

Confidence $q(C=C^\u2223x,\sigma )=conf(x,\sigma )$ | Confidence $c=\xi (d^,\u03d1)$ |

This Letter | Hangya et al. (2016) |

True category $C$ | Not used |

Stimulus $s$ | Evidence $d$ |

Stimulus magnitude $|s|$ | Discriminability $\Delta $ |

Measurement $x$ | Percept $d^$ |

Measurement noise $\sigma $ | Not used |

Choice $C^$ | Choice $\u03d1$ |

Confidence $q(C=C^\u2223x,\sigma )=conf(x,\sigma )$ | Confidence $c=\xi (d^,\u03d1)$ |

## Notes

^{1}

Restating this logic in probabilistic terms, a signature being a necessary condition for the BCH implies that $p(signatureobserved\u2223BCHistrue)=1$. A signature being an insufficient condition implies that $p(signatureobserved\u2223BCHisfalse)>0$. By Bayes's rule, for signatures that are both necessary and insufficient, $p(BCHistrue\u2223signature(s)observed)$ will increase with the observation of each signature but will never reach 1.

^{2}

Note that our assumption that confidence and category choice are deterministic functions of $x$ amounts to an assumption that there is no noise at the action (i.e., reporting) stage.

^{4}

The term *divergence* does not normally imply opposite trends. For example, the lower function could be flat or even increasing. However, we could not think of a better one-word alternative.

^{5}

Our finding is distinct from that of Insabato et al. (2016), who show that the signature would not be predicted under a non-Bayesian model in which the observer uses two measurements on each trial. Our analyses concern only Bayesian models in which the observer has a single measurement on each trial.Our finding is also distinct from that of Fleming and Daw (2017), who show that the divergence signature would not be predicted if the experimenter could plot confidence as a function of the internal measurement $x$. Our analyses concern confidence only as a function of the stimulus $s$, which, unlike $x$, is known by the experimenter.

^{6}

Earlier in their paper, Hangya et al. (2016) phrase this assumption as, “For any given confidence $c$, the relative frequency of percepts mapping to $c$ by $\xi $ changes monotonically with evidence discriminability for any fixed choice” (p. 1847). In our terminology, this is equivalent to saying that as $|s|$ increases, the frequency of reporting any particular level of confidence changes monotonically. This is not correct even in the case of nonoverlapping uniform stimulus distributions. For example, at low noise, as discriminability increases, the frequency of medium-confidence reports will increase and then decrease. Therefore, we will use the formulation of the assumption further down on p. 1847, which correctly narrows it down to incorrect choices.

^{7}

Kiani, Corthell, and Shadlen (2014) also note the lack of the divergence signature in their data, but because their stimuli have variable duration, optimality is more complicated to characterize (Drugowitsch, DeAngelis, Klier, Angelaki, & Pouget, 2014), and the explanation we offer here may not apply.

## Acknowledgments

We thank Luigi Acerbi, Rachel N. Denison, Andra Mihali, and Joaquín Navajas for helpful conversations and comments on the manuscript; Bas van Opheusden for the simple proof of the lemma; and Rachel Adler for some clever real-life examples of overlapping categories. This material is based on work supported by the National Science Foundation Graduate Research Fellowship under grant DGE-1342536.

## References

*Proceedings of the National Academy of Sciences*,