Abstract
Reinforcement learning involves updating estimates of the value of states and actions on the basis of experience. Previous work has shown that in humans, reinforcement learning exhibits a confirmatory bias: when the value of a chosen option is being updated, estimates are revised more radically following positive than negative reward prediction errors, but the converse is observed when updating the unchosen option value estimate. Here, we simulate performance on a multi-arm bandit task to examine the consequences of a confirmatory bias for reward harvesting. We report a paradoxical finding: that confirmatory biases allow the agent to maximize reward relative to an unbiased updating rule. This principle holds over a wide range of experimental settings and is most influential when decisions are corrupted by noise. We show that this occurs because on average, confirmatory biases lead to overestimating the value of more valuable bandits and underestimating the value of less valuable bandits, rendering decisions overall more robust in the face of noise. Our results show how apparently suboptimal learning rules can in fact be reward maximizing if decisions are made with finite computational precision.
1 Introduction
Confirmation bias refers to seeking or interpreting evidence in ways that are influenced by existing beliefs, and it is a ubiquitous feature of human perceptual, cognitive, and social processes and a longstanding topic of study in psychology (Nickerson, 1998). Confirmatory biases can be pernicious in applied settings, for example, when clinicians overlook the correct diagnosis after forming a strong initial impression of a patient (Groopman, 2007). In laboratory, confirmation bias has been studied with a variety of paradigms (Nickerson, 1998; Talluri, Urai, Tsetsos, Usher, & Donner, 2018). One paradigm in which the confirmation bias can be observed and measured involves reinforcement learning tasks, where participants have to learn from positive or negative feedback which options are worth taking (Chambon et al., 2020; Palminteri, Lefebvre, Kilford, & Blakemore, 2017). This article focuses on confirmation bias during reinforcement learning.
This task and modeling framework have also been used to study the biases that humans exhibit during learning. One line of research has suggested that humans may learn differently from positive and negative outcomes. For example, variants of the model above, which include distinct learning rates for positive and negative updates to , have been observed to fit human data from a two-armed bandit task better, even after penalizing for additional complexity (Gershman, 2015; Niv, Edlund, Dayan, & O'Doherty, 2012). Similar differences in learning rates after positive and negative feedback have also been observed in monkeys (Farashahi, Donahue, Hayden, Lee, & Soltani, 2019) and rodents (Cieślak, Ahn, Bogacz, & Parkitna, 2018), suggesting that they reflect an important optimization of a learning process that occurred earlier in evolution and has been preserved across species. When payout is observed only for the option that was chosen, updates seem to be larger when the participant is positively rather than negatively surprised, which might be interpreted as a form of optimistic learning (Lefebvre, Lebreton, Meyniel, Bourgeois-Gironde, & Palminteri, 2017). However, a different pattern of data was observed in follow-up studies in which counterfactual feedback was also offered: the participants were able to view the payout associated with both chosen and unchosen options. Following a feedback on the unchosen option, larger updates were observed for negative prediction errors (Chambon et al., 2020; Palminteri et al., 2017; Schuller et al., 2020). This is consistent with a confirmatory bias rather than a strictly optimistic bias, whereby belief revision helps to strengthen rather than weaken existing preconceptions about which option may be better.
One obvious question is why confirmatory biases persist as a feature of our cognitive landscape. If they promote suboptimal choices, why have they not been selected away by evolution? One variant of the confirmation bias, a tendency to overtly sample information from the environment that is consistent with existing beliefs, has been argued to promote optimal data selection: where the agent chooses its own information acquisition policy, exhaustively ruling out explanations (however obscure) for an observation would be highly inefficient (Oaksford & Chater, 2003). However, this account is unsuited to explaining the differential updates to chosen and unchosen options in a bandit task with counterfactual feedback, because in this case, feedback for both options is freely displayed to the participant, and there is no overt data selection problem.
It has been demonstrated that biased estimates of value can paradoxically be beneficial in two-armed tasks in the sense that under standard assumptions, they maximize the average total reward for the agent (Caze & van der Meer, 2013). This happens because with such biased value estimates, the difference may be magnified, so with a noisy choice rule (typically used in reinforcement learning models), the option with the higher reward probability is more likely to be selected. Caze and van der Meer (2013) considered a standard reinforcement learning task in which feedback is provided only for the chosen option. In that task, the reward probabilities of the two options in the task determine whether, it is beneficial to have a higher learning rate after positive or negative prediction error (Caze & van der Meer, 2013). In other words, when only the outcome of a chosen option is observed, optimistic bias is beneficial for some reward probabilities and pessimistic bias for other.
In this article, we show that if the participants are able to view the payouts associated with both chosen and unchosen options, reward is typically maximized if the learning rates follow the pattern of the confirmation bias, that is, they are higher when the chosen option is rewarded and the unchosen option is unrewarded. We find that this benefit holds over a wide range of settings, including both stationary and nonstationary bandits, with different reward probabilities, across different epoch lengths, and under different levels of choice variability. We also demonstrate that such confirmation bias tends to magnify the difference and hence makes the choice more robust to the decision noise. These findings may explain why humans tend to revise beliefs to a smaller extent when outcomes do not match with their expectations.
We formalize the confirmation bias in a reinforcement learning model, compare its performance in simulations with models without confirmation bias, and formally characterize the biases introduced in value estimates. We also point out that the confirmation bias not only typically increases the average reward, but may shorten reaction times and thus increase the rate of obtaining rewards to even higher extent.
2 Reinforcement Learning Models
2.1 Confirmation Model
2.2 Decaying Learning Rate Model
We compared the performance of the confirmation model in a stable environment to an optimal value estimator, which for each option computes the average of rewards seen so far. Such values can be learned by a model using the update given in equation 1.1 with the learning rate decreasing over trials according to , where is the trial number (note that with the counterfactual feedback, is also equal to the number of times the reward for this option has been observed).
2.3 Decision Policies
3 Effects of Confirmation Bias on Average Reward
3.1 Methods of Simulation
Simulation setup. (a) Reward contingencies. The illustration represents the chosen (orange) and unchosen (blue) bandits, each with a feedback signal (central number). Below, we state the range of possible outcomes and probabilities. (b) Learning periods. The illustration represents the different lengths of the learning periods and the different outcome combinations potentially received by the agents. (c) Volatility types. The line plots represent the evolution of the two arms' probability across trials in the different volatility conditions.
Simulation setup. (a) Reward contingencies. The illustration represents the chosen (orange) and unchosen (blue) bandits, each with a feedback signal (central number). Below, we state the range of possible outcomes and probabilities. (b) Learning periods. The illustration represents the different lengths of the learning periods and the different outcome combinations potentially received by the agents. (c) Volatility types. The line plots represent the evolution of the two arms' probability across trials in the different volatility conditions.
We considered four ways in which the reward probabilities are set, illustrated schematically in Figure 1c. First, we considered stable environments in which reward probabilities were constant. We also considered conditions of 1 reversal and 3 reversals where the payout probabilities were reversed to once in the middle of the task (second display in Figure 1c) or three times at equal intervals (third display in Figure 1c). In stable, 1 reversal and 3 reversals conditions, the initial probabilities at the start of the task were sampled at intervals of 0.1 in the range [0.05, 0.95] such that , and we tested all possible combinations of these probabilities (45 probability pairs). Unless otherwise noted, results are averaged across these initial probabilities.
Learning Rates in the Confirmation and Alternative Models.
. | Chosen Option . | Unchosen Option . | ||
---|---|---|---|---|
Model . | . | . | . | . |
Confirmation model | ||||
Valence model | ||||
Hybrid model | ||||
Partial feedback | — | — |
. | Chosen Option . | Unchosen Option . | ||
---|---|---|---|---|
Model . | . | . | . | . |
Confirmation model | ||||
Valence model | ||||
Hybrid model | ||||
Partial feedback | — | — |
Note: To make the table easier to read, and are highlighted in bold.
We conduct all simulations numerically, sampling the initial payout probabilities and experiment length(s) exhaustively, varying and exhaustively and noting the average reward obtained by the agent in each setting. The model is simulated with all possible combinations of learning rates and defined in the range [0.05, 0.95] with increments of 0.05, that is learning rate combinations. For each combination of parameters, the simulations were performed 1000 times for all but the random walk condition where simulations are performed 100,000 times to account for the increased variability. Results are averaged for plotting and analysis. In all cases, inferential statistics were conducted using nonparametric tests with an alpha of 0.001 and Bonferroni correction for multiple comparisons. At the start of each simulation, the value estimates were initialized to .
3.2 Results of Simulations
Dependence of reward on learning rate and decision noise in a stable environment. (a, b) Average reward for all learning rate combinations. The heat maps represent the per trial average reward for combinations of (-axis) and (-axis), averaged across all reward contingencies and agents in the stable condition with 1024 trials. Areas enclosed by black lines represent learning rate combinations for which the reward is significantly higher than the performance of the best equal learning rates combination represented by a black circle, one-tailed independent samples rank-sum tests, 0.001 corrected for multiple comparison. (a) Deterministic decisions. Simulated reward is obtained using a noiseless hardmax policy. (b) Noisy decisions. Simulated reward is obtained using a noisy softmax policy with . (c) Comparison with optimal models. The bar plot represents the per trial average reward of the confirmation model, the small learning rate model and the decaying learning rate model for four different levels of noise in the decision process. In simulations of the confirmation model, the best learning rates combination was used for each noise level ( [0.1, 0.15, 0.3, 0.35] and ). Bars represent the means and error bars the standard deviations across agents; all reward levels are significantly different from each other; two-tailed independent samples rank-sum tests, 0.001.
Dependence of reward on learning rate and decision noise in a stable environment. (a, b) Average reward for all learning rate combinations. The heat maps represent the per trial average reward for combinations of (-axis) and (-axis), averaged across all reward contingencies and agents in the stable condition with 1024 trials. Areas enclosed by black lines represent learning rate combinations for which the reward is significantly higher than the performance of the best equal learning rates combination represented by a black circle, one-tailed independent samples rank-sum tests, 0.001 corrected for multiple comparison. (a) Deterministic decisions. Simulated reward is obtained using a noiseless hardmax policy. (b) Noisy decisions. Simulated reward is obtained using a noisy softmax policy with . (c) Comparison with optimal models. The bar plot represents the per trial average reward of the confirmation model, the small learning rate model and the decaying learning rate model for four different levels of noise in the decision process. In simulations of the confirmation model, the best learning rates combination was used for each noise level ( [0.1, 0.15, 0.3, 0.35] and ). Bars represent the means and error bars the standard deviations across agents; all reward levels are significantly different from each other; two-tailed independent samples rank-sum tests, 0.001.
We compared the performance of the confirmation model to the decaying learning rate model described above, which maximizes reward under the assumption that payout probabilities are stationary and decisions are noiseless (i.e., under a hardmax choice rule). We confirmed this by plotting the average reward under various temperature values for three models: one in which a single learning rate was set to a fixed low value (small learning rate model), one in which it was optimally annealed (decaying learning rate model), and one in which there was a confirmatory bias (confirmation model; see Figure 2c). As can be seen, only under does the confirmation bias not increase rewards; as soon as decision noise increases, the relative merit of the confirmation model grows sharply. Importantly, whereas the performance advantage for the decaying learning rate model in the absence of noise (under ) was very small (on the order of 0.2%), the converse advantage for the confirmatory bias given noisy decisions was numerically larger (1.6%, 4.6%, and 5.5% under , respectively).
Dependence of reward on learning rate and decision noise in different environments. The heat maps represent the per trial average reward for combinations of (-axis) and (-axis) given a hardmax policy (a–d) or a softmax policy () (panels e–h). The performance is averaged across all reward contingencies, period lengths, and 1000 agents in the stable condition (a, e), 1 reversal condition (b, f), 3 reversals condition (c, g), or 100,000 agents in the random walk condition (d, h). Areas enclosed by black lines represent learning rate combinations for which the reward is significantly higher than the reward of the best equal learning rates combination represented by a black circle, one-tailed independent samples rank-sum tests, corrected for multiple comparisons.
Dependence of reward on learning rate and decision noise in different environments. The heat maps represent the per trial average reward for combinations of (-axis) and (-axis) given a hardmax policy (a–d) or a softmax policy () (panels e–h). The performance is averaged across all reward contingencies, period lengths, and 1000 agents in the stable condition (a, e), 1 reversal condition (b, f), 3 reversals condition (c, g), or 100,000 agents in the random walk condition (d, h). Areas enclosed by black lines represent learning rate combinations for which the reward is significantly higher than the reward of the best equal learning rates combination represented by a black circle, one-tailed independent samples rank-sum tests, corrected for multiple comparisons.
Effects of period length and decision noise on the relative performance of the confirmation model. (a) Effect of period length on reward. The line plot represents the difference in average reward between the confirmation model (with the best confirmatory learning rate combination per period) and the unbiased model (with the best per period single learning rate) in function of the log of the period length and for the four different volatility conditions. The logarithmic transformation of the trial number is for illustrative purpose only. , two-tailed independent rank-sum tests. (b) Effect of decision noise on performance. The line plot represents the difference in per trial average performances of the confirmation model (with the best confirmatory learning rates combination) and the unbiased model (with the single best learning rate) as function of the log of softmax temperature, and for the four different volatility conditions. The logarithmic transformation of the softmax temperature is for illustrative purposes only. , two-tailed independent rank-sum tests.
Effects of period length and decision noise on the relative performance of the confirmation model. (a) Effect of period length on reward. The line plot represents the difference in average reward between the confirmation model (with the best confirmatory learning rate combination per period) and the unbiased model (with the best per period single learning rate) in function of the log of the period length and for the four different volatility conditions. The logarithmic transformation of the trial number is for illustrative purpose only. , two-tailed independent rank-sum tests. (b) Effect of decision noise on performance. The line plot represents the difference in per trial average performances of the confirmation model (with the best confirmatory learning rates combination) and the unbiased model (with the single best learning rate) as function of the log of softmax temperature, and for the four different volatility conditions. The logarithmic transformation of the softmax temperature is for illustrative purposes only. , two-tailed independent rank-sum tests.
Many decisions that humans and animals face in natural environments involve choices among multiple options; hence, we investigated if the confirmation bias also brings an advantage3 in4 such situations. The confirmation model can be naturally extended to multiple options by applying the update of equation 2.2 to all unchosen options. Figure S1 in the online supplementary information shows that confirmation bias also increases the average outcome for extended learning environments with more than two options.
Finally, we performed simulations of the experiment by Palminteri et al. (2017) in order to see where human participants' learning rates combinations stand in terms of performance. In this study, participants made choices between two options and received feedback on the outcomes of both options. The task involved choices in multiple conditions in which the participants could receive outcomes 1 or 1. In some conditions, the reward probabilities were constant, while in others, 1 reversal occurred. We simulated the confirmation model in the same sets of conditions that participants experienced, with the same number of trials. We used the values of softmax temperature estimated from individual participants by fitting the confirmation model to their behavior (data are available at https://doi.org/10.6084/m9.figshare.4265408.v1). These estimated parameter values of the confirmation model were reported by Palminteri et al. (2017).
Relation between human and synthetic data. The heat maps represent the per trial average reward for combinations of (-axis) and (-axis) in the experimental environment studied by Palminteri et al. (2017). Simulations have been performed with different softmax temperatures corresponding to the fitted temperature of the participants from that study and are averaged across 1000 agents. The stars represent the combination of fitted learning rates for each participant.
Relation between human and synthetic data. The heat maps represent the per trial average reward for combinations of (-axis) and (-axis) in the experimental environment studied by Palminteri et al. (2017). Simulations have been performed with different softmax temperatures corresponding to the fitted temperature of the participants from that study and are averaged across 1000 agents. The stars represent the combination of fitted learning rates for each participant.
4 Confirmation Bias Magnifies Difference between Estimated Values
The simulations show that a confirmatory update strategy—one that privileges the chosen over the unchosen option—is reward maximizing across a wide range of experimental conditions, in particular when decisions are noisy. Why would this be the case? It is well known, for example, that adopting a single small value for will allow value estimates to converge to their ground-truth counterparts. Why would an agent want to learn biased value estimates? To answer this question, we demonstrate that the confirmation bias often magnifies the differences between estimated values and hence makes choices more robust to decision noise. We first show it on an intuitive example and then more formally.
4.1 Example of the Effects of Confirmation Bias
Mechanism by which confirmation bias tends to increase reward. (a) Average reward and reward distributions for different levels of confirmation bias. The heat map represents the per trial average reward of the confirmation model for all learning rate combinations (confirmatory learning rates are represented on the -axis whereas disconfirmatory learning rates are represented on the -axis) associated with a softmax policy with . The rewards concern the stable condition with 128 trials and asymmetric contingencies ( and ) and are averaged across agents. The three signs inside the heat map (, , and ) represent the three learning rate combinations used in the simulations illustrated in panels b and c. The histograms show the distribution across agents of the average per trial reward for the three different combinations. (b) Estimated values. The line plots represent the evolution of the best option value across trials. The large plot represents the agents-averaged value of the best option across trials for three different learning rate combinations: “unbiased” (), “biased (low)” ( and ), and “biased (high)” ( and ). The lines represent the mean and the shaded areas, the SEM. The small plots represent the value of the best option across trials plotted separately for the three combinations. The thick lines represent the average across agents and the lighter lines the individual values of 5% of the agents. (c) Choice accuracy. The line plots represent the evolution of the probability to select the best option across trials. The large plot represents the agents-averaged probability to select the best option across trials for three different learning rates combinations: “unbiased” (), “biased (low)” ( and ), and “biased (high)” ( and ). The lines represent the mean and the shaded areas, the SEM. The small plots represent the probability of selecting the best option across trials plotted separately for the three combinations. The thick lines represent the average across agents and the lighter lines the individual probability for 5% of the agents.
Mechanism by which confirmation bias tends to increase reward. (a) Average reward and reward distributions for different levels of confirmation bias. The heat map represents the per trial average reward of the confirmation model for all learning rate combinations (confirmatory learning rates are represented on the -axis whereas disconfirmatory learning rates are represented on the -axis) associated with a softmax policy with . The rewards concern the stable condition with 128 trials and asymmetric contingencies ( and ) and are averaged across agents. The three signs inside the heat map (, , and ) represent the three learning rate combinations used in the simulations illustrated in panels b and c. The histograms show the distribution across agents of the average per trial reward for the three different combinations. (b) Estimated values. The line plots represent the evolution of the best option value across trials. The large plot represents the agents-averaged value of the best option across trials for three different learning rate combinations: “unbiased” (), “biased (low)” ( and ), and “biased (high)” ( and ). The lines represent the mean and the shaded areas, the SEM. The small plots represent the value of the best option across trials plotted separately for the three combinations. The thick lines represent the average across agents and the lighter lines the individual values of 5% of the agents. (c) Choice accuracy. The line plots represent the evolution of the probability to select the best option across trials. The large plot represents the agents-averaged probability to select the best option across trials for three different learning rates combinations: “unbiased” (), “biased (low)” ( and ), and “biased (high)” ( and ). The lines represent the mean and the shaded areas, the SEM. The small plots represent the probability of selecting the best option across trials plotted separately for the three combinations. The thick lines represent the average across agents and the lighter lines the individual probability for 5% of the agents.
For each update rule, we plotted the evolution of the value estimate for the richer bandit over trials (see Figure 6b) as well as aggregate choice accuracy (see Figure 6c). Beginning with the choice accuracy data, one can see that intermediate levels of bias are reward maximizing in the sense that they increase the probability that the agent chooses the bandit with the higher payout probability, relative to an unbiased or a severely biased update rule (see Figure 6c). This is of course simply a restatement of the finding that biased policies maximize reward (see the shading in Figure 6a). However, perhaps more informative are the value estimates for under each update rule (see Figure 6b). As expected, the unbiased learning rule allows the agent to accurately learn the appropriate value estimate, such that after a few tens of trials, (gray line). By contrast, the confirmatory model overestimates the value of the richer option (converging close to despite , and (not shown) the model underestimates the value of the poorer option ). Thus, the confirmation model outperforms the unbiased model despite misestimating the value of both the better and the worse option. How is this possible?
4.2 Analysis of Biases in Estimated Values
Stochastic fixed points of value estimates. Behavior of the confirmation model with -greedy choice policy () has been analyzed for a stable environment with reward probabilities of the two options equal to and . (a) Blue and purple lines show the evolution of value estimates over simulated trials. Different displays correspond to different levels of confirmation bias , indicated above the displays. The learning rates were set to and . (b) Asymptotic behavior of the confirmation model for different levels of the confirmation bias. The blue and magenta curves show the average estimated values at the end of simulation with 10,000 trials. This average is taken over 100 simulations, and the error bars indicate the standard deviation. The model was simulated with and , where the confirmation bias is shown on -axes. Red and green curves denote the values of stochastic fixed points. The two displays correspond to different initial estimated values, listed above the displays.
Stochastic fixed points of value estimates. Behavior of the confirmation model with -greedy choice policy () has been analyzed for a stable environment with reward probabilities of the two options equal to and . (a) Blue and purple lines show the evolution of value estimates over simulated trials. Different displays correspond to different levels of confirmation bias , indicated above the displays. The learning rates were set to and . (b) Asymptotic behavior of the confirmation model for different levels of the confirmation bias. The blue and magenta curves show the average estimated values at the end of simulation with 10,000 trials. This average is taken over 100 simulations, and the error bars indicate the standard deviation. The model was simulated with and , where the confirmation bias is shown on -axes. Red and green curves denote the values of stochastic fixed points. The two displays correspond to different initial estimated values, listed above the displays.
We were not able to obtain tractable analytic expressions for stochastic fixed points of values when the softmax choice rule was assumed; hence, we considered a simpler -greedy choice rule. We denote the probability of selecting an option with a lower estimated value by . To find the stochastic fixed points, we will assume that it rarely changes which of and is higher. Indeed, in simulation of Figure 7a, right, such change occurred only once in 1000 trials. Therefore, we will analyze the behavior within the intervals on which , when the agent's beliefs on superiority of options are true, and within intervals on which , when the agent's beliefs are false.
We first consider a case of true beliefs, where a learned value for the richer option is higher than the value for the poorer option . In this case, the richer option is selected with probability and the poorer option with probability .
Let us now consider the behavior of the model under false beliefs, during the intervals when . In this case, the poorer option is chosen on the majority of trials because the agent falsely believes it has higher value. Furthermore, is updated in the same way was updated under the correct beliefs. Hence the fixed point under the false beliefs, , is given by an expression analogous to that for (see equation 4.4) but with replaced by . Similarly, is given by an expression analogous to that for (see equation 4.5) but with replaced by . Consequently, and inherit from and their dependence on confirmation bias: increases with the confirmation bias, while decreases with the bias. The green and red curves in Figure 7b plot the expressions for the stochastic fixed points for sample parameters. Without the confirmation bias (), the expressions for true and false fixed points coincide and then diverge with confirmation bias.
Importantly, the fixed points based on false beliefs exist only when the agent has false beliefs. Thus the agent will tend to stay in these fixed points only if the false belief is satisfied in these fixed points: . In Figure 7b, this false belief is only satisfied to the right from the intersection of the bright curves, so the intersection occurs at a critical value of the confirmation bias in which . The fixed points and only emerge for the confirmation bias above this critical value, and to highlight this, the curves plotting expressions for and are shown in solid in Figure 7b when they become fixed points.
The existence of fixed points and only above critical confirmation bias is confirmed in simulations shown in Figure 7b. Blue and magenta curves show the mean estimated values at the end of simulations. The left display corresponds to simulations in which the values are initialized to a false belief. In this case, the values stay in and for sufficiently high confirmation bias but move to and for lower biases. The right display corresponds to a simulation in which the values are initialized to 0.5. In this case, the values always move toward and for low bias, while for large bias, on some simulations they go to and , as indicated by larger error bars.
5 Effects of Confirmation Bias on Reward Rate
The analysis shown in Figure 6 illustrates why the benefit of confirmation drops off as the bias tends to the extreme: under extreme bias, the agent falls into a feedback loop whereby it confirms its false belief that the lower-valued bandit is in fact the best. Over multiple simulations, this radically increases the variance in performance and thus dampens overall average reward (see Figure 6c). However, it is noteworthy that this calculation is made under the assumption that all trials are made with equivalent response times. In the wild, incorrect choices may be less pernicious if they are made rapidly, if biological agents ultimately seek to optimize their reward per unit time (or reward rate).
Here, we relaxed this assumption and asked how the confirmatory bias affected overall reward rates, under the assumption that decisions are drawn to a close after a bounded accumulation process that is described by the drift-diffusion model This allows us to model not only the choice probabilities but also reaction times.
5.1 Methods of Simulations
5.2 Results of Simulations
Effect of confirmation bias on reward rate. (a) The heat map represents the per trial average reward simulated with the confirmation RLDDM for all learning rates combinations (confirmatory learning rates are represented on the -axis, whereas disconfirmation learning rates are represented on the -axis). The rewards concern the stable condition with 128 trials and asymmetric contingencies ( and ) and are averaged across agents. (b) The heat map represents the per trial average reaction time estimated with the confirmation RLDDM for all learning rates combinations. (c) The heat map represents the per trial average reward rate simulated with the confirmation RLDDM for all learning rates combinations.
Effect of confirmation bias on reward rate. (a) The heat map represents the per trial average reward simulated with the confirmation RLDDM for all learning rates combinations (confirmatory learning rates are represented on the -axis, whereas disconfirmation learning rates are represented on the -axis). The rewards concern the stable condition with 128 trials and asymmetric contingencies ( and ) and are averaged across agents. (b) The heat map represents the per trial average reaction time estimated with the confirmation RLDDM for all learning rates combinations. (c) The heat map represents the per trial average reward rate simulated with the confirmation RLDDM for all learning rates combinations.
6 Comparison with Alternative Models
In order to clarify the role of the constraint imposed on the learning rates and of the counterfactual feedback, we performed simulations with three additional models, which differ from the confirmation model in the update of the values of the unchosen option. Table 1 compares how the learning rates depend on the choice and the sign of prediction error in the confirmation model and the alternative models.
All of the alternative models update the value estimate of the chosen option similar to the confirmation model—that is, according to a delta rule with two learning rates: for positive updates and for negative updates. The three additional models differ in their updates of the value estimate of the unchosen option. The first model, referred to as the valence model, updates the value estimate of the unchosen option with learning rates depending on the sign of prediction error analogous to that of the chosen option. Thus, in this model, the learning rate depends only on the sign of prediction error, not on whether the option was chosen. The second model, referred to as the hybrid model, updates the value of unchosen option using an unbiased learning rate defined as . We refer to this model as hybrid because the learning rate for the unchosen option in this model is the average of those in the valence model and the confirmation model (with and ). The third model, referred to as partial feedback, does not update the value of the unchosen option; hence, it can describe learning in tasks in which feedback is provided only for the chosen option. We define an agent with a positivity bias as one for whom , whereas an agent with a negativity bias has , and an agent with no bias (or a neutral setting) has .
Dependence of reward on contingencies in alternative models. The heat maps represent the per trial average reward for combinations of (-axis) and (-axis) (a, b) or (-axis) and (-axis) (c--h) with a softmax policy () in a stable environment. The performance is averaged across 1000 agents, all period lengths, and low reward contingencies: and (a, c, e, g) or high reward contingencies ( and (b, d, f, h). The four models are the confirmation model (a, b), the valence model (c, d) the hybrid model (e, f), and a model with partial feedback (g, h). Areas enclosed by black lines represent learning rate combinations for which the reward is significantly higher than the reward of the best equal learning rates combination represented by a black circle, one-tailed independent samples rank-sum tests, , corrected for multiple comparisons.
Dependence of reward on contingencies in alternative models. The heat maps represent the per trial average reward for combinations of (-axis) and (-axis) (a, b) or (-axis) and (-axis) (c--h) with a softmax policy () in a stable environment. The performance is averaged across 1000 agents, all period lengths, and low reward contingencies: and (a, c, e, g) or high reward contingencies ( and (b, d, f, h). The four models are the confirmation model (a, b), the valence model (c, d) the hybrid model (e, f), and a model with partial feedback (g, h). Areas enclosed by black lines represent learning rate combinations for which the reward is significantly higher than the reward of the best equal learning rates combination represented by a black circle, one-tailed independent samples rank-sum tests, , corrected for multiple comparisons.
In the case of full feedback, Figures 9a and 9b show that confirmation bias in the confirmation model increases average reward regardless of the range of reward probabilities for the two options. The consistent effect of confirmation bias contrasts with the opposite effects of biases in learning rates in the valence model (see Figures 9c and 9d), where positivity bias is beneficial for low reward probabilities, while negativity bias is beneficial for high reward probabilities. These effects can be understood on the basis of a previous study (Caze & van der Meer, 2013). That study analyzed the reinforcement learning model in which the learning rate depended on the sign of prediction error as in the valence model. The study showed that if reward probabilities for both options , then it is beneficial to have a positivity bias. With such bias, both and will be overestimated and, critically, the difference will be magnified, so with a noisy choice rule, the option with the higher reward probability will be more likely to be selected. By contrast, if both , then overestimating and would actually reduce the difference due to a ceiling effect, because according to equation 1.1, reward estimates cannot exceed the maximum reward available, . In this case, it is beneficial to have a negativity bias. Therefore, if one assumes that learning rates can differ between rewarded and unrewarded trials, the type of reward-increasing bias depends on the magnitude of reward probabilities in a task (Caze & van der Meer, 2013), a dependence clearly seen in Figures 9c and 9d.
Since the learning rates in the hybrid model lie in between those in the valence and the confirmation models, the optimal bias in the hybrid model is in between that in these two models. The positivity or confirmation bias is optimal for low reward probabilities (see Figure 9e), while for high reward probabilities the optimal bias is close to (see Figure 9f), so it is between the optimal biases for the confirmation (see Figure 9b) and valence (see Figure 9d) models.
It is also worth comparing the performance of the models for their optimal learning rates. Different panels in Figures 9a to 9f have different color scales that span the range of obtained rewards. Comparing color scales reveals that the confirmation model can produce the overall highest reward: for low reward probabilities, it achieved higher reward (for its best parameters) than the valence model and similar performance to the hybrid model, while for high reward probabilities, it obtained higher reward than both alternative models.
In summary, in the case of full feedback, the confirmation model is the only one among models compared in Figures 9a to 9f for which the optimal learning rates lie in the same regions of parameter space for both low and high probabilities. A learner often does not know the task parameters, and the confirmation model is most robust to this uncertainty because it is the only model for which it is possible to choose a combination of learning rates that work relatively well for different tasks.
In the case of partial feedback, where only the value of the chosen option is modified, the positivity bias is beneficial for low reward probabilities, while the negativity bias is beneficial for high reward probabilities (see Figures 9g and 9h), as expected from a previous theoretic analysis (Caze & van der Meer, 2013). The optimal learning rates with partial feedback are similar to those in the valence model with full feedback (compare Figures 9c and 9d with Figures 9g and 9h) as in both models the learning rate only depends on the sign of prediction error (see Table 1).
The optimal learning rates slightly differ between the valence model with full feedback and partial feedback: less negativity bias is required to maximize reward with partial feedback for high probabilities (see Figures 9d and 9h). This difference arises because with full feedback both values are updated equally often, while with partial feedback, the poorer option is chosen less frequently. Hence, with partial feedback, the value of the poorer option moves slowly from its initial value of 0.5, so even if , the value of the poorer option may not be overestimated. The difference between the models disappears if both values are updated with more similar frequencies (we observed it in simulations, not shown, in which temperature of the softmax function was increased).
In summary, in the case of partial feedback, updating values of the chosen option with the larger learning rate after positive prediction error is detrimental for higher reward probabilities (see Figure 9h). Hence, the bias that optimizes the confirmation model (see Figure 9b) may be detrimental with partial feedback in the model analyzed in this section (see Figure 9h). Nevertheless, in the next section, we come back to this issue, and point out that the optimal bias may differ in other reinforcement learning models with partial feedback.
7 Discussion
Humans have been observed to exhibit confirmatory biases when choosing between stimuli or actions that pay out with uncertain probability (Chambon et al., 2020; Palminteri et al., 2017; Schuller et al., 2020). These biases drive participants to update positive outcomes (or those that are better than expected) for chosen options more sharply than negative outcomes, but to reverse this update pattern for the unchosen option. Here, we show through simulations that in an extended range of settings traditionally used in human experiments, this asymmetric update is advantageous in the presence of noise in the decision process. Indeed, agents who exhibited a confirmatory bias, rather than a neutral or disconfirmatory bias, were in most circumstances tested the agents that reaped the largest quantities of reward. This counterintuitive result directly stems from the update process itself that biases the value of the chosen and unchosen options (corresponding overall to the best and worst options respectively), increasing mechanistically their relative distance from each other and ultimately the probability of selecting the best option in the upcoming trials.
Exploring the evolution of action values under confirmatory updates offers insight into why this occurs. Confirmatory updating has the effect of rendering subjective action values more extreme than their objective counterparts; in other words, options that are estimated to be good are overvalued, and options estimated to be bad are undervalued (see Figure 6). This can have both positive and negative effects. The negative effect is that a sufficiently strong confirmatory bias can drive a feedback loop whereby poor or mediocre items that are chosen by chance can be falsely updated in a positive direction, leading them to being chosen more often. The positive effect, however, is that where decisions are themselves intrinsically variable (e.g., because they are corrupted by gaussian noise arising during decision making or motor planning, modeled here with the softmax temperature parameter), overestimation of value makes decisions more robust to decision noise because random fluctuations in the value estimated at the time of the decision are less likely to reverse a decision away from the better of the two options. The relative strength of these two effects depends on the level of decision noise: within reasonable noise ranges, the latter effect outweighs the former and performance benefits overall.
7.1 Relationship to Other Studies
The results described here thus join a family of recently reported phenomena whereby decisions that distort or discard information lead to reward-maximizing choices under the assumption that decisions are made with finite computational precision—in other words, that decisions are intrinsically noisy (Summerfield & Tsetsos, 2015). For example, when averaging features from a multi-element array to make a category judgment, under the assumption that features are equally diagnostic (and that the decision policy is not itself noisy), then normatively, they should be weighted equally in the choice. However, in the presence of “late” noise, encoding models that overestimate the decision value of elements near the category boundary are reward maximizing, for the same reason as the confirmatory bias here: they inflate the value of ambiguous items away from indifference and render them robust to noise (Li, Herce Castanon, Solomon, Vandormael, & Summerfield, 2017). A similar phenomenon occurs when comparing gambles defined by different monetary values: utility functions that inflate small values away from indifference (rendering the subjective difference between $2 and $4 greater than the subjective difference between $102 and $104) have a protective effect against decision noise, providing a normative justification for convex utility functions (Juechems, Spitzer, Balaguer, & Summerfield, 2020). Related results have been described in problems that involve sequential sampling in time, where they may account for violations of axiomatic rationality, such as systematically intransitive choices (Tsetsos et al., 2016). Moreover, a bias in how evidence is accumulated within a trial has been shown to increase the accuracy of individual decisions, making the decision variable more extreme and thus less likely to be corrupted by noise (Zhang & Bogacz, 2010).
Recent studies also report simulations of the confirmation bias model (Chambon et al., 2020; Tarantola, Folke, Boldt, Perez, & De Martino, 2021). These simulations paralleled experimental paradigms reported in these papers and a confirmation model was simulated for parameters (including softmax temperature) corresponding to those estimated for participants of the studies. The simulated agents employing confirmation bias obtained higher average reward than unbiased learners, as well as learners described by other models. Our article suggests the same conclusion using a complementary approach in which the models have been simulated in a variety of conditions and analyzed mathematically.
Modeling studies have investigated how learning with rates depending on the sign of prediction error could be implemented in the basal ganglia circuits known to underlie reinforcement learning (Collins & Frank, 2014; Dabney et al., 2020). Models have been developed to describe how positive and negative prediction errors preferentially engage learning in different populations of striatal neurons (Mikhael & Bogacz, 2016; Möller & Bogacz, 2019). It would be interesting to investigate the neural mechanisms that lead to learning rates depending not only on the sign of prediction error but also on whether options have been chosen.
7.2 Validity of Model's Assumptions
Reinforcement learning models fit to human data often assume that choices are stochastic—that participants fail to choose the most valuable bandit. In standard tasks involving only feedback about the value of the chosen option (factual feedback), some randomness in choices promotes exploration, which in turns allows information to be acquired that may be relevant for future decisions. However, our task involves both factual and counterfactual feedback, and so exploration is not required to learn the value of the two bandits. Nevertheless, in some simulations, we modeled choices with a softmax rule, which assumes that decisions are corrupted by gaussian noise, or an -greedy policy, which introduces lapses to the choice process with a fixed probability. Implicitly, thus, we are committing to the idea that value-guided decisions may be irreducibly noisy even where exploration is not required (Renart & Machens, 2014). Indeed, others have shown that participants continue to make noisy decisions even where counterfactual feedback is available, even if they have attributed that noise to variability in learning rather than choice (Findling, Skvortsova, Dromnelle, Palminteri, & Wyart, 2019).
Due to our assumptions, this study has a number of limitations. First, we explored the properties of a confirmatory model that has been previously shown to provide a good fit to human data performing a bandit task with factual and counterfactual feedback. However, we acknowledge that this is not the only possible model that could increase reward by enhancing the difference between represented values of options. In principle, any other models producing choice hysteresis might be able to explain these results (Katahira, 2018; Miller, Shenhav, & Ludvig, 2019; Worthy, Pang, & Byrne, 2013). An analysis of these different models and their respective resilience to decision noise in different settings is beyond the scope of our study here but would be an interesting target for future research. Second, the results described here hold assuming a fixed and equal level of stochasticity (e.g., softmax temperature) in agents' behaviors, regardless of their bias (i.e., the specific combination of learning rates). Relaxing this assumption, an unbiased agent could perform equally well as a biased agent subject to more decision noise. Thus, the benefit of confirmatory learning is relentlessly linked to the level of noise, and one level of confirmation bias cannot be thought as being beneficial overall. Third, our study does not investigate the impact on the performance of other kinds of internal noise such like an update noise (Findling et al., 2019). The latter, instead of perturbing the policy itself, perturbs at each trial the update process of the option's value (i.e., prediction errors are blurred with a gaussian noise) and cannot presumably produce a similar increase in performance, having overall no effect on the average difference between these option values.
7.3 Confirmation Bias with Partial Feedback
In this article, we have focused on studying confirmation bias in tasks where the feedback is provided for both chosen and unchosen options, but in most reinforcement learning tasks studied in the laboratory and possibly in the real world, feedback is provided only for the chosen option. With such partial feedback, it seems not possible to distinguish between the confirmation and valence models because they make the same update of the value of the chosen option. However, a recent ingenious experiment suggested that the confirmation bias was also present with partial feedback, because the learning rate was higher after positive prediction errors only if the choice was made by the participant but not when the choice was made by a computer (Chambon et al., 2020). Analogous effect was also observed outside the laboratory in a study of heart surgeons who learned more from their own successes than their failures but not from observed successes of their colleagues (Kc, Staats, & Gino, 2013). Hence, it is important to understand how results from this article could be generalized to partial feedback.
For partial feedback, previous theoretic work suggests that optimal leaning rates depend on whether the reward probability is high or low (Caze & van der Meer, 2013), and we confirmed it in simulations in Figures 9g and 9h. Surprisingly, it has been shown that human participants did not follow this pattern and had similar learning rates regardless if reward probabilities were low or high (Chambon et al., 2020; Gershman, 2015). This poses a question whether humans do not behave in a way maximizing rewards (which seems unlikely given the evolutionary pressure for reward maximization) or the normative theory of learning with partial feedback needs to be revised. One way to include confirmation bias in models of learning with partial feedback would be to note that humans and animals are aware of confidence of their choices (Kiani & Shadlen, 2009)—that is, whether they are certain the chosen option yields the highest reward or if the choice was a guess. Hence, one could consider models in which learning rate depends not only on the sign of prediction error but also on the confidence such that the negative feedback is taken into consideration less when a participant is confident of his or her choice. Formulating such models would require careful comparison of the models with specially designed experiments; hence, it is beyond the scope of this article but would be an interesting direction for future work.
7.4 Limits to the Benefits of Biased Beliefs
It is important to point out that confirmation bias is beneficial in many, but not all, circumstances. Although in almost all presented simulations, there exists a combination of biased learning rates giving performance that is higher than or as good as the best unbiased learner, the optimal learning rates and hence the amount of bias differ depending on task parameters. At the start of the task, a learner usually is unable to know the details of the task, so needs to adopt a certain default combination of learning rates. One could expect that such default learning rates would be determined by experience or even be to a certain extent influenced by evolution. However, such a default set of biased learning rates will lead to detrimental effects on performance in certain tasks. For example, a recent study estimated average learning rates of human participants to be and (Tarantola et al., 2021) giving a confirmation bias of . Although such strong confirmation bias increases reward in many simulated scenarios when decisions are noisy (e.g., Figures 3e to 3h), it would have a negative effect on performance when decisions are accurately made on the basis of values and in changing environments (e.g., Figures 3a to 3d). If the default confirmation bias is influenced by evolution, its value is likely to be relatively high because many of the key decisions of our ancestors had to be quick and thus were noisy due to the speed–accuracy trade-off. By contrast, in the modern world, we often can take time to consider important choices, hence the biases that brought evolutionary advantage to our ancestor may not always be beneficial to us.
Acknowledgments
This work has been supported by MRC grants MC_UU_12024/5, MC_UU_00003/1, BBSRC grant BB/S006338/1, and ERC Consolidator grant 725937.
References
Author notes
C.S. and R.B. contributed equally.