## Abstract

Reinforcement learning involves updating estimates of the value of states and actions on the basis of experience. Previous work has shown that in humans, reinforcement learning exhibits a confirmatory bias: when the value of a chosen option is being updated, estimates are revised more radically following positive than negative reward prediction errors, but the converse is observed when updating the unchosen option value estimate. Here, we simulate performance on a multi-arm bandit task to examine the consequences of a confirmatory bias for reward harvesting. We report a paradoxical finding: that confirmatory biases allow the agent to maximize reward relative to an unbiased updating rule. This principle holds over a wide range of experimental settings and is most influential when decisions are corrupted by noise. We show that this occurs because on average, confirmatory biases lead to overestimating the value of more valuable bandits and underestimating the value of less valuable bandits, rendering decisions overall more robust in the face of noise. Our results show how apparently suboptimal learning rules can in fact be reward maximizing if decisions are made with finite computational precision.

## 1 Introduction

*Confirmation bias* refers to seeking or interpreting evidence in ways that are influenced by existing beliefs, and it is a ubiquitous feature of human perceptual, cognitive, and social processes and a longstanding topic of study in psychology (Nickerson, 1998). Confirmatory biases can be pernicious in applied settings, for example, when clinicians overlook the correct diagnosis after forming a strong initial impression of a patient (Groopman, 2007). In laboratory, confirmation bias has been studied with a variety of paradigms (Nickerson, 1998; Talluri, Urai, Tsetsos, Usher, & Donner, 2018). One paradigm in which the confirmation bias can be observed and measured involves reinforcement learning tasks, where participants have to learn from positive or negative feedback which options are worth taking (Chambon et al., 2020; Palminteri, Lefebvre, Kilford, & Blakemore, 2017). This article focuses on confirmation bias during reinforcement learning.

This task and modeling framework have also been used to study the biases that humans exhibit during learning. One line of research has suggested that humans may learn differently from positive and negative outcomes. For example, variants of the model above, which include distinct learning rates for positive and negative updates to $Vti$, have been observed to fit human data from a two-armed bandit task better, even after penalizing for additional complexity (Gershman, 2015; Niv, Edlund, Dayan, & O'Doherty, 2012). Similar differences in learning rates after positive and negative feedback have also been observed in monkeys (Farashahi, Donahue, Hayden, Lee, & Soltani, 2019) and rodents (Cieślak, Ahn, Bogacz, & Parkitna, 2018), suggesting that they reflect an important optimization of a learning process that occurred earlier in evolution and has been preserved across species. When payout is observed only for the option that was chosen, updates seem to be larger when the participant is positively rather than negatively surprised, which might be interpreted as a form of optimistic learning (Lefebvre, Lebreton, Meyniel, Bourgeois-Gironde, & Palminteri, 2017). However, a different pattern of data was observed in follow-up studies in which counterfactual feedback was also offered: the participants were able to view the payout associated with both chosen and unchosen options. Following a feedback on the unchosen option, larger updates were observed for negative prediction errors (Chambon et al., 2020; Palminteri et al., 2017; Schuller et al., 2020). This is consistent with a confirmatory bias rather than a strictly optimistic bias, whereby belief revision helps to strengthen rather than weaken existing preconceptions about which option may be better.

One obvious question is why confirmatory biases persist as a feature of our cognitive landscape. If they promote suboptimal choices, why have they not been selected away by evolution? One variant of the confirmation bias, a tendency to overtly sample information from the environment that is consistent with existing beliefs, has been argued to promote optimal data selection: where the agent chooses its own information acquisition policy, exhaustively ruling out explanations (however obscure) for an observation would be highly inefficient (Oaksford & Chater, 2003). However, this account is unsuited to explaining the differential updates to chosen and unchosen options in a bandit task with counterfactual feedback, because in this case, feedback for both options is freely displayed to the participant, and there is no overt data selection problem.

It has been demonstrated that biased estimates of value can paradoxically be beneficial in two-armed tasks in the sense that under standard assumptions, they maximize the average total reward for the agent (Caze & van der Meer, 2013). This happens because with such biased value estimates, the difference $Vt1-Vt2$ may be magnified, so with a noisy choice rule (typically used in reinforcement learning models), the option with the higher reward probability is more likely to be selected. Caze and van der Meer (2013) considered a standard reinforcement learning task in which feedback is provided only for the chosen option. In that task, the reward probabilities of the two options in the task determine whether, it is beneficial to have a higher learning rate after positive or negative prediction error (Caze & van der Meer, 2013). In other words, when only the outcome of a chosen option is observed, optimistic bias is beneficial for some reward probabilities and pessimistic bias for other.

In this article, we show that if the participants are able to view the payouts associated with both chosen and unchosen options, reward is typically maximized if the learning rates follow the pattern of the confirmation bias, that is, they are higher when the chosen option is rewarded and the unchosen option is unrewarded. We find that this benefit holds over a wide range of settings, including both stationary and nonstationary bandits, with different reward probabilities, across different epoch lengths, and under different levels of choice variability. We also demonstrate that such confirmation bias tends to magnify the difference $Vt1-Vt2$ and hence makes the choice more robust to the decision noise. These findings may explain why humans tend to revise beliefs to a smaller extent when outcomes do not match with their expectations.

We formalize the confirmation bias in a reinforcement learning model, compare its performance in simulations with models without confirmation bias, and formally characterize the biases introduced in value estimates. We also point out that the confirmation bias not only typically increases the average reward, but may shorten reaction times and thus increase the rate of obtaining rewards to even higher extent.

## 2 Reinforcement Learning Models

### 2.1 Confirmation Model

*unbiased*.

### 2.2 Decaying Learning Rate Model

We compared the performance of the *confirmation* model in a stable environment to an optimal value estimator, which for each option computes the average of rewards seen so far. Such values can be learned by a model using the update given in equation 1.1 with the learning rate $\alpha $ decreasing over trials according to $\alpha =1t$, where $t$ is the trial number (note that with the counterfactual feedback, $t$ is also equal to the number of times the reward for this option has been observed).

### 2.3 Decision Policies

*hardmax*,

*softmax*, and $\u025b$-

*greedy*policies. The hardmax is a noiseless policy selecting deterministically the arm associating with the highest value. The softmax is a probabilistic action selection process associated with each arm $a$ the probability $Pta$ of being selected based on their respective values such that

## 3 Effects of Confirmation Bias on Average Reward

### 3.1 Methods of Simulation

We considered four ways in which the reward probabilities $pi$ are set, illustrated schematically in Figure 1c. First, we considered stable environments in which reward probabilities were constant. We also considered conditions of 1 reversal and 3 reversals where the payout probabilities were reversed to $1-pi$ once in the middle of the task (second display in Figure 1c) or three times at equal intervals (third display in Figure 1c). In stable, 1 reversal and 3 reversals conditions, the initial probabilities $pi$ at the start of the task were sampled at intervals of 0.1 in the range [0.05, 0.95] such that $p1\u2260p2$, and we tested all possible combinations of these probabilities (45 probability pairs). Unless otherwise noted, results are averaged across these initial probabilities.

. | Chosen Option $i$ . | Unchosen Option $j\u2260i$ . | ||
---|---|---|---|---|

Model . | $\delta ti>0$ . | $\delta ti<0$ . | $\delta tj>0$ . | $\delta tj<0$ . |

Confirmation model | $\alpha C$ | $\alpha D$ | $\alpha D$ | $\alpha C$ |

Valence model | $\alpha +$ | $\alpha -$ | $\alpha +$ | $\alpha -$ |

Hybrid model | $\alpha +$ | $\alpha -$ | $\alpha =$ | $\alpha =$ |

Partial feedback | $\alpha +$ | $\alpha -$ | — | — |

. | Chosen Option $i$ . | Unchosen Option $j\u2260i$ . | ||
---|---|---|---|---|

Model . | $\delta ti>0$ . | $\delta ti<0$ . | $\delta tj>0$ . | $\delta tj<0$ . |

Confirmation model | $\alpha C$ | $\alpha D$ | $\alpha D$ | $\alpha C$ |

Valence model | $\alpha +$ | $\alpha -$ | $\alpha +$ | $\alpha -$ |

Hybrid model | $\alpha +$ | $\alpha -$ | $\alpha =$ | $\alpha =$ |

Partial feedback | $\alpha +$ | $\alpha -$ | — | — |

Note: To make the table easier to read, $\alpha C$ and $\alpha +$ are highlighted in bold.

We conduct all simulations numerically, sampling the initial payout probabilities and experiment length(s) exhaustively, varying $\alpha C$ and $\alpha D$ exhaustively and noting the average reward obtained by the agent in each setting. The model is simulated with all possible combinations of learning rates $\alpha C$ and $\alpha D$ defined in the range [0.05, 0.95] with increments of 0.05, that is $192$ learning rate combinations. For each combination of parameters, the simulations were performed 1000 times for all but the random walk condition where simulations are performed 100,000 times to account for the increased variability. Results are averaged for plotting and analysis. In all cases, inferential statistics were conducted using nonparametric tests with an alpha of $p$$<$ 0.001 and Bonferroni correction for multiple comparisons. At the start of each simulation, the value estimates were initialized to $V0i=0.5$.

### 3.2 Results of Simulations

We compared the performance of the confirmation model to the decaying learning rate model described above, which maximizes reward under the assumption that payout probabilities are stationary and decisions are noiseless (i.e., under a hardmax choice rule). We confirmed this by plotting the average reward under various temperature values for three models: one in which a single learning rate was set to a fixed low value $\alpha =0.05$ (small learning rate model), one in which it was optimally annealed (decaying learning rate model), and one in which there was a confirmatory bias (confirmation model; see Figure 2c). As can be seen, only under $\beta =0$ does the confirmation bias not increase rewards; as soon as decision noise increases, the relative merit of the confirmation model grows sharply. Importantly, whereas the performance advantage for the decaying learning rate model in the absence of noise (under $\beta =0$) was very small (on the order of 0.2%), the converse advantage for the confirmatory bias given noisy decisions was numerically larger (1.6%, 4.6%, and 5.5% under $\beta =0.1,0.2,0.3$, respectively).

Many decisions that humans and animals face in natural environments involve choices among multiple options; hence, we investigated if the confirmation bias also brings an advantage^{3} in^{4} such situations. The confirmation model can be naturally extended to multiple options by applying the update of equation 2.2 to all unchosen options. Figure S1 in the online supplementary information shows that confirmation bias also increases the average outcome for extended learning environments with more than two options.

Finally, we performed simulations of the experiment by Palminteri et al. (2017) in order to see where human participants' learning rates combinations stand in terms of performance. In this study, participants made choices between two options and received feedback on the outcomes of both options. The task involved choices in multiple conditions in which the participants could receive outcomes $-$1 or 1. In some conditions, the reward probabilities were constant, while in others, 1 reversal occurred. We simulated the confirmation model in the same sets of conditions that participants experienced, with the same number of trials. We used the values of softmax temperature estimated from individual participants by fitting the confirmation model to their behavior (data are available at https://doi.org/10.6084/m9.figshare.4265408.v1). These estimated parameter values of the confirmation model were reported by Palminteri et al. (2017).

## 4 Confirmation Bias Magnifies Difference between Estimated Values

The simulations show that a confirmatory update strategy—one that privileges the chosen over the unchosen option—is reward maximizing across a wide range of experimental conditions, in particular when decisions are noisy. Why would this be the case? It is well known, for example, that adopting a single small value for $\alpha $ will allow value estimates to converge to their ground-truth counterparts. Why would an agent want to learn biased value estimates? To answer this question, we demonstrate that the confirmation bias often magnifies the differences between estimated values and hence makes choices more robust to decision noise. We first show it on an intuitive example and then more formally.

### 4.1 Example of the Effects of Confirmation Bias

For each update rule, we plotted the evolution of the value estimate for the richer bandit $V+$ over trials (see Figure 6b) as well as aggregate choice accuracy (see Figure 6c). Beginning with the choice accuracy data, one can see that intermediate levels of bias are reward maximizing in the sense that they increase the probability that the agent chooses the bandit with the higher payout probability, relative to an unbiased or a severely biased update rule (see Figure 6c). This is of course simply a restatement of the finding that biased policies maximize reward (see the shading in Figure 6a). However, perhaps more informative are the value estimates for $V+$ under each update rule (see Figure 6b). As expected, the unbiased learning rule allows the agent to accurately learn the appropriate value estimate, such that after a few tens of trials, $V+\u2248p+=0.65$ (gray line). By contrast, the confirmatory model overestimates the value of the richer option (converging close to $V+\u223c0.8$ despite $p+=0.65$, and (not shown) the model underestimates the value of the poorer option $p-=0.35$). Thus, the confirmation model outperforms the unbiased model despite misestimating the value of both the better and the worse option. How is this possible?

### 4.2 Analysis of Biases in Estimated Values

We were not able to obtain tractable analytic expressions for stochastic fixed points of values when the softmax choice rule was assumed; hence, we considered a simpler $\u025b$-greedy choice rule. We denote the probability of selecting an option with a lower estimated value by $\u025b$. To find the stochastic fixed points, we will assume that it rarely changes which of $Vt+$ and $Vt-$ is higher. Indeed, in simulation of Figure 7a, right, such change occurred only once in 1000 trials. Therefore, we will analyze the behavior within the intervals on which $Vt+>Vt-$, when the agent's beliefs on superiority of options are true, and within intervals on which $Vt+<Vt-$, when the agent's beliefs are false.

We first consider a case of true beliefs, where a learned value for the richer option $Vt+$ is higher than the value for the poorer option $Vt-$. In this case, the richer option is selected with probability $1-\u025b$ and the poorer option with probability $\u025b$.

Let us now consider the behavior of the model under false beliefs, during the intervals when $Vt+<Vt-$. In this case, the poorer option is chosen on the majority of trials because the agent falsely believes it has higher value. Furthermore, $Vt-$ is updated in the same way $Vt+$ was updated under the correct beliefs. Hence the fixed point under the false beliefs, $Vfalse-$, is given by an expression analogous to that for $Vtrue+$ (see equation 4.4) but with $p+$ replaced by $p-$. Similarly, $Vfalse+$ is given by an expression analogous to that for $Vtrue-$ (see equation 4.5) but with $p-$ replaced by $p+$. Consequently, $Vfalse-$ and $Vfalse+$ inherit from $Vtrue+$ and $Vtrue-$ their dependence on confirmation bias: $Vfalse-$ increases with the confirmation bias, while $Vfalse+$ decreases with the bias. The green and red curves in Figure 7b plot the expressions for the stochastic fixed points for sample parameters. Without the confirmation bias ($b=1$), the expressions for true and false fixed points coincide and then diverge with confirmation bias.

Importantly, the fixed points based on false beliefs exist only when the agent has false beliefs. Thus the agent will tend to stay in these fixed points only if the false belief is satisfied in these fixed points: $Vfalse+<Vfalse-$. In Figure 7b, this false belief is only satisfied to the right from the intersection of the bright curves, so the intersection occurs at a critical value of the confirmation bias in which $Vfalse+=Vfalse-$. The fixed points $Vfalse-$ and $Vfalse+$ only emerge for the confirmation bias above this critical value, and to highlight this, the curves plotting expressions for $Vfalse-$ and $Vfalse+$ are shown in solid in Figure 7b when they become fixed points.

The existence of fixed points $Vfalse-$ and $Vfalse+$ only above critical confirmation bias is confirmed in simulations shown in Figure 7b. Blue and magenta curves show the mean estimated values at the end of simulations. The left display corresponds to simulations in which the values are initialized to a false belief. In this case, the values stay in $Vfalse-$ and $Vfalse+$ for sufficiently high confirmation bias but move to $Vtrue+$ and $Vtrue-$ for lower biases. The right display corresponds to a simulation in which the values are initialized to 0.5. In this case, the values always move toward $Vtrue+$ and $Vtrue-$ for low bias, while for large bias, on some simulations they go to $Vfalse-$ and $Vfalse+$, as indicated by larger error bars.

## 5 Effects of Confirmation Bias on Reward Rate

The analysis shown in Figure 6 illustrates why the benefit of confirmation drops off as the bias tends to the extreme: under extreme bias, the agent falls into a feedback loop whereby it confirms its false belief that the lower-valued bandit is in fact the best. Over multiple simulations, this radically increases the variance in performance and thus dampens overall average reward (see Figure 6c). However, it is noteworthy that this calculation is made under the assumption that all trials are made with equivalent response times. In the wild, incorrect choices may be less pernicious if they are made rapidly, if biological agents ultimately seek to optimize their reward per unit time (or reward rate).

Here, we relaxed this assumption and asked how the confirmatory bias affected overall reward rates, under the assumption that decisions are drawn to a close after a bounded accumulation process that is described by the drift-diffusion model This allows us to model not only the choice probabilities but also reaction times.

### 5.1 Methods of Simulations

### 5.2 Results of Simulations

## 6 Comparison with Alternative Models

In order to clarify the role of the constraint imposed on the learning rates and of the counterfactual feedback, we performed simulations with three additional models, which differ from the confirmation model in the update of the values of the unchosen option. Table 1 compares how the learning rates depend on the choice and the sign of prediction error in the confirmation model and the alternative models.

All of the alternative models update the value estimate $Vti$ of the chosen option similar to the confirmation model—that is, according to a delta rule with two learning rates: $\alpha +$ for positive updates and $\alpha -$ for negative updates. The three additional models differ in their updates of the value estimate of the unchosen option. The first model, referred to as the *valence model*, updates the value estimate of the unchosen option with learning rates depending on the sign of prediction error analogous to that of the chosen option. Thus, in this model, the learning rate depends only on the sign of prediction error, not on whether the option was chosen. The second model, referred to as the *hybrid model*, updates the value of unchosen option using an unbiased learning rate defined as $\alpha ==(\alpha ++\alpha -)/2$. We refer to this model as hybrid because the learning rate for the unchosen option in this model is the average of those in the valence model and the confirmation model (with $\alpha C=\alpha +$ and $\alpha D=\alpha -$). The third model, referred to as *partial feedback*, does not update the value of the unchosen option; hence, it can describe learning in tasks in which feedback is provided only for the chosen option. We define an agent with a positivity bias as one for whom $\alpha +>\alpha -$, whereas an agent with a negativity bias has $\alpha +<\alpha -$, and an agent with no bias (or a neutral setting) has $\alpha +=\alpha -$.

In the case of full feedback, Figures 9a and 9b show that confirmation bias in the confirmation model increases average reward regardless of the range of reward probabilities for the two options. The consistent effect of confirmation bias contrasts with the opposite effects of biases in learning rates in the valence model (see Figures 9c and 9d), where positivity bias is beneficial for low reward probabilities, while negativity bias is beneficial for high reward probabilities. These effects can be understood on the basis of a previous study (Caze & van der Meer, 2013). That study analyzed the reinforcement learning model in which the learning rate depended on the sign of prediction error as in the valence model. The study showed that if reward probabilities for both options $pi<0.5$, then it is beneficial to have a positivity bias. With such bias, both $V1$ and $V2$ will be overestimated and, critically, the difference $V1-V2$ will be magnified, so with a noisy choice rule, the option with the higher reward probability will be more likely to be selected. By contrast, if both $pi>0.5$, then overestimating $V1$ and $V2$ would actually reduce the difference $V1-V2$ due to a ceiling effect, because according to equation 1.1, reward estimates cannot exceed the maximum reward available, $Vi\u22641$. In this case, it is beneficial to have a negativity bias. Therefore, if one assumes that learning rates can differ between rewarded and unrewarded trials, the type of reward-increasing bias depends on the magnitude of reward probabilities in a task (Caze & van der Meer, 2013), a dependence clearly seen in Figures 9c and 9d.

Since the learning rates in the hybrid model lie in between those in the valence and the confirmation models, the optimal bias in the hybrid model is in between that in these two models. The positivity or confirmation bias is optimal for low reward probabilities (see Figure 9e), while for high reward probabilities the optimal bias is close to $b\u22481$ (see Figure 9f), so it is between the optimal biases for the confirmation (see Figure 9b) and valence (see Figure 9d) models.

It is also worth comparing the performance of the models for their optimal learning rates. Different panels in Figures 9a to 9f have different color scales that span the range of obtained rewards. Comparing color scales reveals that the confirmation model can produce the overall highest reward: for low reward probabilities, it achieved higher reward (for its best parameters) than the valence model and similar performance to the hybrid model, while for high reward probabilities, it obtained higher reward than both alternative models.

In summary, in the case of full feedback, the confirmation model is the only one among models compared in Figures 9a to 9f for which the optimal learning rates lie in the same regions of parameter space for both low and high probabilities. A learner often does not know the task parameters, and the confirmation model is most robust to this uncertainty because it is the only model for which it is possible to choose a combination of learning rates that work relatively well for different tasks.

In the case of partial feedback, where only the value of the chosen option is modified, the positivity bias is beneficial for low reward probabilities, while the negativity bias is beneficial for high reward probabilities (see Figures 9g and 9h), as expected from a previous theoretic analysis (Caze & van der Meer, 2013). The optimal learning rates with partial feedback are similar to those in the valence model with full feedback (compare Figures 9c and 9d with Figures 9g and 9h) as in both models the learning rate only depends on the sign of prediction error (see Table 1).

The optimal learning rates slightly differ between the valence model with full feedback and partial feedback: less negativity bias is required to maximize reward with partial feedback for high probabilities (see Figures 9d and 9h). This difference arises because with full feedback both values are updated equally often, while with partial feedback, the poorer option is chosen less frequently. Hence, with partial feedback, the value of the poorer option moves slowly from its initial value of 0.5, so even if $\alpha +>\alpha -$, the value of the poorer option may not be overestimated. The difference between the models disappears if both values are updated with more similar frequencies (we observed it in simulations, not shown, in which temperature of the softmax function was increased).

In summary, in the case of partial feedback, updating values of the chosen option with the larger learning rate after positive prediction error is detrimental for higher reward probabilities (see Figure 9h). Hence, the bias that optimizes the confirmation model (see Figure 9b) may be detrimental with partial feedback in the model analyzed in this section (see Figure 9h). Nevertheless, in the next section, we come back to this issue, and point out that the optimal bias may differ in other reinforcement learning models with partial feedback.

## 7 Discussion

Humans have been observed to exhibit confirmatory biases when choosing between stimuli or actions that pay out with uncertain probability (Chambon et al., 2020; Palminteri et al., 2017; Schuller et al., 2020). These biases drive participants to update positive outcomes (or those that are better than expected) for chosen options more sharply than negative outcomes, but to reverse this update pattern for the unchosen option. Here, we show through simulations that in an extended range of settings traditionally used in human experiments, this asymmetric update is advantageous in the presence of noise in the decision process. Indeed, agents who exhibited a confirmatory bias, rather than a neutral or disconfirmatory bias, were in most circumstances tested the agents that reaped the largest quantities of reward. This counterintuitive result directly stems from the update process itself that biases the value of the chosen and unchosen options (corresponding overall to the best and worst options respectively), increasing mechanistically their relative distance from each other and ultimately the probability of selecting the best option in the upcoming trials.

Exploring the evolution of action values under confirmatory updates offers insight into why this occurs. Confirmatory updating has the effect of rendering subjective action values more extreme than their objective counterparts; in other words, options that are estimated to be good are overvalued, and options estimated to be bad are undervalued (see Figure 6). This can have both positive and negative effects. The negative effect is that a sufficiently strong confirmatory bias can drive a feedback loop whereby poor or mediocre items that are chosen by chance can be falsely updated in a positive direction, leading them to being chosen more often. The positive effect, however, is that where decisions are themselves intrinsically variable (e.g., because they are corrupted by gaussian noise arising during decision making or motor planning, modeled here with the softmax temperature parameter), overestimation of value makes decisions more robust to decision noise because random fluctuations in the value estimated at the time of the decision are less likely to reverse a decision away from the better of the two options. The relative strength of these two effects depends on the level of decision noise: within reasonable noise ranges, the latter effect outweighs the former and performance benefits overall.

### 7.1 Relationship to Other Studies

The results described here thus join a family of recently reported phenomena whereby decisions that distort or discard information lead to reward-maximizing choices under the assumption that decisions are made with finite computational precision—in other words, that decisions are intrinsically noisy (Summerfield & Tsetsos, 2015). For example, when averaging features from a multi-element array to make a category judgment, under the assumption that features are equally diagnostic (and that the decision policy is not itself noisy), then normatively, they should be weighted equally in the choice. However, in the presence of “late” noise, encoding models that overestimate the decision value of elements near the category boundary are reward maximizing, for the same reason as the confirmatory bias here: they inflate the value of ambiguous items away from indifference and render them robust to noise (Li, Herce Castanon, Solomon, Vandormael, & Summerfield, 2017). A similar phenomenon occurs when comparing gambles defined by different monetary values: utility functions that inflate small values away from indifference (rendering the subjective difference between $2 and $4 greater than the subjective difference between $102 and $104) have a protective effect against decision noise, providing a normative justification for convex utility functions (Juechems, Spitzer, Balaguer, & Summerfield, 2020). Related results have been described in problems that involve sequential sampling in time, where they may account for violations of axiomatic rationality, such as systematically intransitive choices (Tsetsos et al., 2016). Moreover, a bias in how evidence is accumulated within a trial has been shown to increase the accuracy of individual decisions, making the decision variable more extreme and thus less likely to be corrupted by noise (Zhang & Bogacz, 2010).

Recent studies also report simulations of the confirmation bias model (Chambon et al., 2020; Tarantola, Folke, Boldt, Perez, & De Martino, 2021). These simulations paralleled experimental paradigms reported in these papers and a confirmation model was simulated for parameters (including softmax temperature) corresponding to those estimated for participants of the studies. The simulated agents employing confirmation bias obtained higher average reward than unbiased learners, as well as learners described by other models. Our article suggests the same conclusion using a complementary approach in which the models have been simulated in a variety of conditions and analyzed mathematically.

Modeling studies have investigated how learning with rates depending on the sign of prediction error could be implemented in the basal ganglia circuits known to underlie reinforcement learning (Collins & Frank, 2014; Dabney et al., 2020). Models have been developed to describe how positive and negative prediction errors preferentially engage learning in different populations of striatal neurons (Mikhael & Bogacz, 2016; Möller & Bogacz, 2019). It would be interesting to investigate the neural mechanisms that lead to learning rates depending not only on the sign of prediction error but also on whether options have been chosen.

### 7.2 Validity of Model's Assumptions

Reinforcement learning models fit to human data often assume that choices are stochastic—that participants fail to choose the most valuable bandit. In standard tasks involving only feedback about the value of the chosen option (factual feedback), some randomness in choices promotes exploration, which in turns allows information to be acquired that may be relevant for future decisions. However, our task involves both factual and counterfactual feedback, and so exploration is not required to learn the value of the two bandits. Nevertheless, in some simulations, we modeled choices with a softmax rule, which assumes that decisions are corrupted by gaussian noise, or an $\u025b$-greedy policy, which introduces lapses to the choice process with a fixed probability. Implicitly, thus, we are committing to the idea that value-guided decisions may be irreducibly noisy even where exploration is not required (Renart & Machens, 2014). Indeed, others have shown that participants continue to make noisy decisions even where counterfactual feedback is available, even if they have attributed that noise to variability in learning rather than choice (Findling, Skvortsova, Dromnelle, Palminteri, & Wyart, 2019).

Due to our assumptions, this study has a number of limitations. First, we explored the properties of a confirmatory model that has been previously shown to provide a good fit to human data performing a bandit task with factual and counterfactual feedback. However, we acknowledge that this is not the only possible model that could increase reward by enhancing the difference between represented values of options. In principle, any other models producing choice hysteresis might be able to explain these results (Katahira, 2018; Miller, Shenhav, & Ludvig, 2019; Worthy, Pang, & Byrne, 2013). An analysis of these different models and their respective resilience to decision noise in different settings is beyond the scope of our study here but would be an interesting target for future research. Second, the results described here hold assuming a fixed and equal level of stochasticity (e.g., softmax temperature) in agents' behaviors, regardless of their bias (i.e., the specific combination of learning rates). Relaxing this assumption, an unbiased agent could perform equally well as a biased agent subject to more decision noise. Thus, the benefit of confirmatory learning is relentlessly linked to the level of noise, and one level of confirmation bias cannot be thought as being beneficial overall. Third, our study does not investigate the impact on the performance of other kinds of internal noise such like an update noise (Findling et al., 2019). The latter, instead of perturbing the policy itself, perturbs at each trial the update process of the option's value (i.e., prediction errors are blurred with a gaussian noise) and cannot presumably produce a similar increase in performance, having overall no effect on the average difference between these option values.

### 7.3 Confirmation Bias with Partial Feedback

In this article, we have focused on studying confirmation bias in tasks where the feedback is provided for both chosen and unchosen options, but in most reinforcement learning tasks studied in the laboratory and possibly in the real world, feedback is provided only for the chosen option. With such partial feedback, it seems not possible to distinguish between the confirmation and valence models because they make the same update of the value of the chosen option. However, a recent ingenious experiment suggested that the confirmation bias was also present with partial feedback, because the learning rate was higher after positive prediction errors only if the choice was made by the participant but not when the choice was made by a computer (Chambon et al., 2020). Analogous effect was also observed outside the laboratory in a study of heart surgeons who learned more from their own successes than their failures but not from observed successes of their colleagues (Kc, Staats, & Gino, 2013). Hence, it is important to understand how results from this article could be generalized to partial feedback.

For partial feedback, previous theoretic work suggests that optimal leaning rates depend on whether the reward probability is high or low (Caze & van der Meer, 2013), and we confirmed it in simulations in Figures 9g and 9h. Surprisingly, it has been shown that human participants did not follow this pattern and had similar learning rates regardless if reward probabilities were low or high (Chambon et al., 2020; Gershman, 2015). This poses a question whether humans do not behave in a way maximizing rewards (which seems unlikely given the evolutionary pressure for reward maximization) or the normative theory of learning with partial feedback needs to be revised. One way to include confirmation bias in models of learning with partial feedback would be to note that humans and animals are aware of confidence of their choices (Kiani & Shadlen, 2009)—that is, whether they are certain the chosen option yields the highest reward or if the choice was a guess. Hence, one could consider models in which learning rate depends not only on the sign of prediction error but also on the confidence such that the negative feedback is taken into consideration less when a participant is confident of his or her choice. Formulating such models would require careful comparison of the models with specially designed experiments; hence, it is beyond the scope of this article but would be an interesting direction for future work.

### 7.4 Limits to the Benefits of Biased Beliefs

It is important to point out that confirmation bias is beneficial in many, but not all, circumstances. Although in almost all presented simulations, there exists a combination of biased learning rates giving performance that is higher than or as good as the best unbiased learner, the optimal learning rates and hence the amount of bias differ depending on task parameters. At the start of the task, a learner usually is unable to know the details of the task, so needs to adopt a certain default combination of learning rates. One could expect that such default learning rates would be determined by experience or even be to a certain extent influenced by evolution. However, such a default set of biased learning rates will lead to detrimental effects on performance in certain tasks. For example, a recent study estimated average learning rates of human participants to be $\alpha C\u22480.15$ and $\alpha D\u22480.05$ (Tarantola et al., 2021) giving a confirmation bias of $b\u22483$. Although such strong confirmation bias increases reward in many simulated scenarios when decisions are noisy (e.g., Figures 3e to 3h), it would have a negative effect on performance when decisions are accurately made on the basis of values and in changing environments (e.g., Figures 3a to 3d). If the default confirmation bias is influenced by evolution, its value is likely to be relatively high because many of the key decisions of our ancestors had to be quick and thus were noisy due to the speed–accuracy trade-off. By contrast, in the modern world, we often can take time to consider important choices, hence the biases that brought evolutionary advantage to our ancestor may not always be beneficial to us.

## Acknowledgments

This work has been supported by MRC grants MC_UU_12024/5, MC_UU_00003/1, BBSRC grant BB/S006338/1, and ERC Consolidator grant 725937.

## References

*Psychol. Rev.*

*Biol. Cybern.*

*Nature Human Behaviour*

*Eneuro*

*Psychological Review*

*Nature*

*Nature*

*Nature Human Behaviour*

*Nat. Neurosci.*

*Psychon. Bull. Rev.*

*Optimal utility and probability functions for agents with finite computational precision.*

*J. Math. Psychol.*

*Management Science*

*Science*

*Nat. Hum. Behav.*

*PLOS Comput. Biol.*

*PLOS Computational Biology*

*Psychol. Rev.*

*PLOS Computational Biology*

*Review of General Psychology*

*J. Neurosci.*

*Psychon. Bull. Rev.*

*PLOS Comput. Biol.*

*Psychon. Bull. Rev.*

*Curr. Opin. Neurobiol.*

*Classical conditioning II: Current research and theory*

*Cortex, 126*

*Trends Cogn. Sci.*

*Current Biology*

*Confirmation bias optimizes reward learning*

*Proc. Natl. Acad. Sci. USA*

*Front Psychol.*

*Journal of Mathematical Psychology*

## Author notes

C.S. and R.B. contributed equally.