Much experimental evidence suggests that during decision making, neural circuits accumulate evidence supporting alternative options. A computational model well describing this accumulation for choices between two options assumes that the brain integrates the log ratios of the likelihoods of the sensory inputs given the two options. Several models have been proposed for how neural circuits can learn these log-likelihood ratios from experience, but all of these models introduced novel and specially dedicated synaptic plasticity rules. Here we show that for a certain wide class of tasks, the log-likelihood ratios are approximately linearly proportional to the expected rewards for selecting actions. Therefore, a simple model based on standard reinforcement learning rules is able to estimate the log-likelihood ratios from experience and on each trial accumulate the log-likelihood ratios associated with presented stimuli while selecting an action. The simulations of the model replicate experimental data on both behavior and neural activity in tasks requiring accumulation of probabilistic cues. Our results suggest that there is no need for the brain to support dedicated plasticity rules, as the standard mechanisms proposed to describe reinforcement learning can enable the neural circuits to perform efficient probabilistic inference.
Humans and other animals often have to choose a course of action based on multiple pieces of information. Consider a cat deciding whether to chase a bird in your back garden. Her decision will depend on multiple factors: how far away the bird is, how tasty it looks, and how long it is until humans will provide a bowl of kibble. To make the best decisions in novel or unfamiliar settings, animals have to learn by trial and error how to weigh information appropriately. In this letter, we study this question in the context of laboratory tasks in which multiple cues signal which response is most likely to be rewarded. We focus on how the weight associated with each cue is learned over time, in situations where multiple stimuli are present at each trial, and participants need to learn to appropriately assign credit to each cue for successes or failures. This, in turn, will facilitate subsequent decision making.
Two broad classes of theory have been developed to describe learning and decision making in such situations. The classical theory of reinforcement learning (RL) suggests that animals learn to predict scalar reward outcomes for each cue or combination of cues (Rescorla & Wagner, 1972). When multiple cues are presented, the animal’s total expected reward is a sum of rewards associated with the stimuli presented. Following feedback, the individual reward expectations are updated proportionally to the reward prediction error, defined as the difference between reward obtained and expected. This model naturally generalizes to learning about expected rewards following actions (Sutton & Barto, 1998). The model also captures essential aspects of learning in basal ganglia; much evidence suggests that reward prediction error is encoded in phasic activity of dopaminergic neurons (Schultz, Dayan, & Montague, 1997; Fiorillo, Tobler, & Schultz, 2003), which modulates synaptic plasticity in the striatum (Reynolds, Hyland, & Wickens, 2001; Shen, Flajolet, Greengard, & Surmeier, 2008).
In the transition from the first to the second line, we used Bayes’ theorem, and in the transition from the second to the third line, we assumed conditional independence of stimuli. Thus, the ratio of posterior probabilities can be simply computed by adding the WOEs associated with presented stimuli to a term representing the initial, prior probabilities (this term is equal to zero when the prior probabilities of the two actions are equal). Choosing action or when the sign of the above decision variable is positive or negative, respectively, is equivalent to choosing the action with a higher posterior probability.
These theories (RL and SPRT) have largely been developed in parallel. Several models have attempted to combine the two approaches and describe how animals learn WOEs of stimuli using RL (Soltani & Wang, 2010; Coulthard et al., 2012; Berthet, Hellgren-Kotaleski, & Lansner, 2012; Soltani, Khorsand, Guo, Farashahi, & Liu, 2016). However, these models employed novel synaptic plasticity rules, which for some of the models were relatively complex, and there is no evidence that synapses can implement these rules. Building on the ideas from these earlier models, this letter shows that WOEs can also be learned with a standard and simple plasticity rule. In particular, we show that for a certain class of tasks, the expected reward for selecting action with a stimulus present is approximately linearly proportional to . Therefore, a simple model based on the standard Rescorla-Wagner rule is able to learn WOEs and accumulate the WOEs associated with presented stimuli while selecting an action. The main novel contribution of this letter is showing that the learning of WOEs (previously demonstrated only with unconventional learning rules) can also be achieved with the standard RL plasticity rules, which are thought to be implemented in the basal ganglia circuits.
In the next section, we describe the class of tasks under consideration and present a model of learning in these tasks. In section 3, we show that this model approximates decision making through accumulation of WOEs, analyze how the WOEs estimated by the model depend on task and model parameters, and compare the model with the data. Finally, in section 4, we compare the proposed model with previous models and discuss further experimental predictions.
We consider a class of tasks often used to investigate the neural bases of probabilistic decision making (Knowlton, Mangels, & Squire, 1996; Yang & Shadlen, 2007; Philiastides, Biele, & Heekeren, 2010; de Gardelle & Summerfield, 2011; Coulthard et al., 2012). On each trial, participants choose between two actions on the basis of multiple sensory cues, and on a given trial, only one of the actions is rewarded. At a start of each trial, cues are presented, which are sampled with replacement from a set of . The probability of each cue appearing depends on which action is rewarded on a given trial, so the subject can deduce from the cues which action is more likely to be rewarded. After the decision, a reward of is received if the correct action was selected and no reward if the incorrect action is selected.
2.1 Reinforcement Learning Model
To capture learning in such tasks we employ a very simple model (single-layer perceptron) that learns based on the standard Rescorla-Wagner rule for learning in tasks with multiple stimuli. This model is schematically illustrated in Figure 1. It is composed of an input layer (e.g., putative cortical sensory neurons) selective for the cues and an output layer (e.g., putative striatal neurons) selective for actions . A similar two-layer structure has been used in other models of learning stimulus-response associations (Law & Gold, 2009; Gluck & Bower, 1998).
To implement such learning, at the time of choice, a memory trace, known as the eligibility trace, needs to form in synapses between neurons selective for presented stimuli and the neurons selective for the chosen action (Sutton & Barto, 1998). Subsequently, when the feedback is provided, the eligible synapses should be modified proportionally to the reward prediction error. The reward prediction error is thought to be encoded in the phasic activity dopaminergic neurons (Schultz et al., 1997). They receive inhibitory input from striatal neurons (Watabe-Uchida, Zhu, Ogawa, Vamanrao, & Uchida, 2012), which in our model encode . Assuming that the dopaminergic neurons also receive an input encoding reward, they could subtract these two inputs and compute . The dopaminergic neurons send dense projections to striatum and modulate plasticty of cortico-striatal synapses (Shen et al., 2008).
It has been also proposed how the weights are physically represented in strengths of cortico-striatal connections. The striatal projection neurons can be divided in two groups: those whose activity can facilitate movements (Go neurons, expressing D1 receptors) and those inhibiting movements (NoGo neurons, expressing D2 receptors) (Kravitz et al., 2010). Computational models have been proposed in which the weights of Go neurons increase, while the weights of NoGo neurons decrease when the prediction error is positive, and vice versa when prediction error is negative (Frank, Seeberger, & O’Reilly, 2004; Collins & Frank, 2014; Mikhael & Bogacz, 2016). It has been shown that for a certain class of plasticity rules, the difference between the weights of Go and NoGo neurons encodes , that is, this difference evolves according to equation 2.3 (Mikhael & Bogacz, 2016).
At the start of each simulated experiment, weights are initialized to for , while the weights from the bias node are set to and kept constant in all simulations except for those in section 3.3.
2.2 Generating Stimuli
Before each simulated trial, it was decided randomly which action would be rewarded, according to prior probabilities , which were set to in all simulations except when indicated otherwise. Depending on which action was rewarded, cues were drawn randomly with replacement according to probabilities .
First, we investigate the values to which the weights converge in the model. We first derive a general condition that the weights need to satisfy at the stochastic fixed point and then analyze its implications for different variants of the task.
3.1 Learning with a Single Stimulus
3.2 Learning with Multiple Stimuli
where . When WOEs are chosen such that on a majority of trials, , then the sigmoid function in equation 3.8 can be approximated by the linear function, and weights approximately satisfy that equation, which implies that the weights converge to similar values as for the case of single stimulus (). This is illustrated in Figure 2B, which shows results of a simulation with WOEs relatively close to 0. One can see that the weights for and even are relatively close to those for .
When more extreme WOEs are used and , the linear approximation does not hold. The simulations of this case are shown in Figure 2A, where symbols of different colors indicate how the weights in the model depend on the number of stimuli presented within a trial. The weights converge to less extreme values when more stimuli are presented. In this simulation, when , the value of is more likely to exceed the linear range, and therefore the final weighting by the model for each individual will be damped. Let us, for example, consider the case in which the model is presented with two stimuli in a trial: with and . If the weights were equal to the values as for (i.e., , and ), then the expected reward would be , so even if the reward is received, the prediction error is negative and the weights are decreased. The more stimuli are presented per trial, the more the weights learned by the model will be damped.
Nevertheless, it is remarkable in Figure 2A that the weights learned by the model remain an approximately linear function of , even for when they are damped. This happens because all weights are damped, as even the stimuli with closer to 0 may co-occur on the same trial with stimuli with high , and so may exceed the linear range of the sigmoid, and weights of all stimuli on that trial will be damped. We will see in section 3.5 that for highly extreme weights, this linear relationship breaks, but the relationship between and remains approximately linear for a wide range of weights used in Figure 2A, which is similar to those used typically in experimental studies (Yang & Shadlen, 2007; Philiastides et al., 2010).
3.3 Learning Prior Probabilities
In all simulations so far, we assumed for simplicity that the two actions were correct equally often, and we fixed the weights from the bias node to . Here we analyze to what values these weights converge when the probabilities of two actions are no longer the same.
In order for the model to learn the prior probabilities, the trials with a different number of stimuli per trial had to be intermixed. Figure 3B shows that indeed converged to the vicinity of .
To gain a sense of how learning of the prior probabilities affects subsequent decision behavior, Figures 3I and 3J show the activity of the decision node (selective for the more likely action) when the trained model is presented with a particular set of stimuli. Each dot corresponds to a set of stimuli, and the actual probability of action being correct for a given set is reflected by the position along the horizontal axis. Since the posterior probability of action being correct was equal to the expected reward, and is the reward predicted by the model, a perfectly trained model should produce the activity on the identity line. Although model predictions are generally close to the true expected reward, there are systematic departures that are worth analyzing as similar misestimations of expected reward were observed by Soltani et al. (2016).
When the model was trained with a fixed number of stimuli per trial (as in the Soltani et al., 2016, study), the activity depended on the number of stimuli present on a given testing trial (indicated by color in Figure 3I). In particular, if only stimulus was presented, the model underestimated the reward for choosing action . This happened because the bias weight underestimated the prior probability (see Figure 3A). For a larger number of stimuli , the activity became closer to the expected reward. This happened because the model incorporated information about prior probabilities into learned weights, such that the weights for the more likely action were increased. Note in Figures 3C to 3E that the majority of blue points shift upward as the probability of action increases. Therefore, when more stimuli were presented, these increased weights cumulated, raising the activity . A similar dependence of expected reward on the number of presented stimuli has been observed in an analogous task by Soltani et al. (2016), and we will come back to it in section 4.
Figure 3J shows the activity of node , when the model has been trained with the variable number of stimuli. Here, the activity was closer to the posterior probability of action being correct for trials with and stimuli than in Figure 3I, because the model has experienced such trials during training. To help understand why the prediction is not perfect, we need to analyze under what conditions the model is able to closely approximate the decision variable of equation 1.2.
If the prior probabilities are sufficiently close to 0.5 and is sufficiently close to 0 so that the sigmoid on the right-hand side of equation 3.13 can be approximated as in equation 3.7, then the weight will approximately satisfy equation 3.13. Figures 3F and 3G show that the weights indeed converge in the vicinity of , for priors close to 0.5, but not in the case of the more extreme priors in Figure 3H. Equation 3.13 also implies that the higher the prior probability, the closer needs to be to 0 for the weights to converge to .
3.4 Properties of the Model
This section characterizes different aspects of learning in the model in different variants of the task.
3.4.1 Speed of Learning
Figure 4A compares how the weights changed during learning for the different numbers of stimuli presented in the simulation shown in Figure 2A and reveals that the model converged faster when more stimuli were presented at a time. This effect is illustrated in Figure 4B, which shows the number of trials required for convergence as a function of the number of stimuli presented per trial. The model’s weights are able to converge faster due to the fact that more information was presented at each trial and the final weights were less extreme when more stimuli were present at a time.
3.4.2 Effect of Exploration Parameter
For simplicity, in all simulations so far, we set the parameter controlling how deterministic the choice is to , which corresponds to random action selection. In order to test learning in the model with more deterministic action selection, we performed simulations with different values for . Figure 5A shows results when only stimulus was presented per trial. We found that in cases of high , which made the model choose only reward-promising actions, the neuron selective for action did not properly learn the weights of stimuli that predicted a low reward for this action. This happened because for such stimuli, action was rarely chosen. Nevertheless, we point out that the neuron selective for action did learn the weights of these stimuli (data not shown), so the network as a whole was able to preferentially select actions with higher expected value (note that the softmax equation 2.2 can be rewritten as and , so the model makes a choice on the basis of the difference in activity of the two action-selective units).
The difficulty with learning weights for stimuli supporting the other action vanished as we performed simulations featuring stimuli per trial (see Figure 5B). In this case, after extensive training, the model learned weights of all stimuli as it inevitably had to choose actions on the basis of four stimuli that may have included those predicting the nonchosen action.
3.4.3. Effect of Stimulus Frequency
It has been reported that humans weight stimuli that are unlikely to occur in learning tasks less strongly compared to more frequently appearing ones, as subjects are more uncertain about their influence and can update corresponding weights only infrequently (de Gardelle & Summerfield, 2011). To evaluate whether our model was able to reproduce this behavior, it was confronted with the same type of task as described above, but in this case, it featured pairs of stimuli with the same weights: . For each pair of stimuli of the same weight, one was taken to appear more frequently () than the other (). While computing the likelihoods of these stimuli, we used a formula analogous to equations 2.4 but scaled by the actual rather than .
Figure 6A shows that given enough training, the weights for stimuli with different frequency converge to similar values. In a second simulation, we set the differences in frequencies to be more extreme among stimuli with the same weight— and respectively—so that the infrequent stimuli are presented only rarely. Figure 6B illustrates that in this case, the weights for stimuli with lower frequency were closer to 0. This occurred because the infrequent stimuli were shown so rarely that their weights had not converged in the course of the simulation.
3.5 Simulation of Primate Learning Behavior
Figure 7D shows analogous analysis of difference in activity in the action units in the model after the presentation of consecutive stimuli. The trials were binned, excluding the trials in which stimuli with infinite WOE were present. We observed that the nodes in the model had activity proportional to cumulative WOE, similar to the primate data by Yang and Shadlen (2007).
This letter has analyzed the relationship between computational accounts of learning and decision making based on reinforcement learning (RL) and the sequential probability ratio test (SPRT). We demonstrated that synaptic weights learned by RL rules in a certain class of tasks are proportional to WOEs, and hence allow information to be integrated from multiple cues to form a decision. Simulations of the model in the task of Yang and Shadlen (2007) replicated the key features of animal behavior and neural activity. In this section, we relate the model we have presented to other models and experimental data, and we discuss further experimental predictions.
4.1 Relationship to the Soltani and Wang Model
In a closely related study, Soltani and Wang (2010) proposed a model that can also learn weights of synaptic connections allowing probabilistic inference and can also replicate the observations of Yang and Shadlen (2007). We briefly review their model, discuss in what ways it differs from the model we have proposed in this letter, and suggest experiments that can differentiate between the two models.
Despite the similarities, the models differ in two key aspects: the plasticity rule and the presence of the bias node that is critical for learning prior probabilities. We now review these two differences, compare the models with experimental data, and suggest further experiments that can differentiate between the models.
In the Soltani and Wang (2010) model, the weight modification is modulated by reward, while in the model proposed here, it is modulated by reward prediction error. The model proposed here aims at capturing learning in the basal ganglia and assumes that such modulation of plasticity is mediated by the neuromodulator dopamine, which is known to influence cortico-striatal plasticity (Reynolds et al., 2001; Shen et al., 2008), and encode the reward prediction error during learning tasks (Schultz et al., 1997; Fiorillo et al., 2003). By contrast, the Soltani and Wang (2010) model aimed at capturing learning in the cortex, where the effects of reward on synaptic plasticity are not as well understood. Importantly, the model proposed here uses the same synaptic plasticity rule, which is also known to well capture learning about reward magnitudes in reinforcement learning tasks. Therefore, we suggest that there is no need for the brain to have specialized plasticity rules dedicated to support probabilistic reasoning, as the standard synaptic mechanisms that learn expected rewards can fulfill this function.
The two models make differential predictions on how the learning about WEO should interact with reward. Consider an experiment in which some stimuli are presented on trials on which a reward of is given for correct choices, while other stimuli are presented on trials where is given for correct responses. Subsequently, on critical test trials, participants need to make a choice on the basis of stimuli from both groups presented together. In the model proposed here, the learned weights are proportional to the expected reward; thus, it would predict that the participants would be more influenced by the stimuli from the first group. Such increased influence is not predicted by the Soltani and Wang (2010) model, where the reward magnitude does not affect .
It is also worth comparing the biological plausibility of the Rescorla-Wagner rule and the rule of equation 4.3. A nice property of equation 4.3 is that the change of a particular synaptic weight depends only on the value of this weight, not on other weights in the network. By contrast, in the Rescola-Wagner rule, the change in depends on the reward prediction error, which is a function of , which in turn depends on other weights in the network. Nevertheless, we described in section 2.1 how this problem can be overcome in the basal ganglia circuit. Recall that the model assumes that the reward prediction error is computed by dopaminergic neurons that receive input from striatal neurons computing . Consequently, the Rescorla-Wagner rule requires the eligible synapse to have information on only a single quantity: the reward prediction error that could be brought by a single neuromodulator, dopamine. By contrast, the rule of equation 4.3 requires the synapse to have information on two quantities: the presence of the reward on the current trial and the average reward rate. Thus, to implement such a rule, two separate neuromodulators would need to encode these quantities, and it is unclear which of the known neuromodulators could play this role.
The second difference between the models is that the one proposed here includes bias weights that allow it to learn about prior probabilities of the responses under certain conditions, while the Soltani and Wang (2010) model does not include the bias node and hence is unable to represent prior probabilities separately from likelihoods.
To test whether humans are able to learn prior probabilities separately from WEO, Soltani et al. (2016) trained participants with a fixed number () of stimuli presented per trial. Then they presented the participants with one, two, or four stimuli and asked them to estimate how likely the two responses are to be correct. They found a pattern similar to that in Figure 3I (see Figure 2d in Soltani et al., 2016): participants underestimated the probability of the more likely option for stimuli. Soltani et al. (2016) pointed out that these data can be explained only by the model that did not learn prior probabilities separately from WEO. Simulations in section 3.3 show that our model also produces this pattern of behavior, because our model also did not learn prior probabilities when a fixed number () of stimuli was presented per trial during training.
The model presented here predicts that when the number of stimuli presented during learning is intermixed, the networks in the brain should be able to learn prior probabilities of responses. To test this prediction, one could modify an experiment from Soltani et al. (2016) such that trials with different are intermixed and the model proposed here predicts that the participants then would be able to learn the prior probabilities and no longer underestimate the probability of a more likely option for small (i.e., produce the pattern illustrated in Figure 3J).
It would be interesting to investigate whether a modified version of the Soltani and Wang (2010) model, including the bias node, could also learn the prior probabilities if the trials with different are intermixed during training, but not when is fixed during training.
The model presented in this letter describes learning in the basal ganglia, while the Soltani and Wang (2010) model focuses on learning in the neocortex. It is likely that both structures are involved in learning in tasks requiring evidence accumulation, so it is also possible that the two models describe complementary contributions of basal ganglia and cortex to probabilistic decision making.
4.2 Relationship to Other Models
A handful of other studies have linked the RL to the framework provided by the SPRT, or related sequential sampling models. Law and Gold (2009) have also used a standard RL model with an architecture and learning rule very similar to these considered here to capture learning and decision making in a motion discrimination task. Their model was able to learn weights allowing accumulation of information and reproduced many aspects of neural activity during motion discrimination tasks. Here we show that a similar model can also be used to describe decision tasks with discrete stimuli, and we explicitly demonstrate that in these tasks, the learned synaptic weights are approximately proportional to WOEs.
Two studies described models of the basal ganglia circuit that can learn probabilistic quantities, allowing the circuit to implement Bayesian decision making in a fashion equivalent to that described in equation 1.2 (Berthet et al., 2012; Coulthard et al., 2012). However, both of these models assume complex rules for the plasticity of cortico-striatal synapses, and it is not clear if such rules can be implemented by biological synapses. By contrast, here we show that weights allowing integration of information during decision making can also arise from the very simple plasticity rule of Rescorla and Wagner (1972).
In this letter, we have focused on decision making between two options, but it would be interesting to generalize our approach to choices with multiple alternatives. With more than two options, it is no longer possible to define a simple decision variable as in equation 1.2. Nevertheless, it has been proposed that the basal ganglia can compute posterior probabilities of actions given presented stimuli (Bogacz & Gurney, 2007; Bogacz & Larsen, 2011). In order to perform such computation, the neurons selective for an action in this model need to receive input proportional to the log likelihood of stimuli given the action (Bogacz & Gurney, 2007), or WOE in the case of a choice between two alternatives (Lepora & Gurney, 2012). Thus, for a choice between two alternatives, the cortico-striatal weights proportional to WOEs would allow this Bayesian model of basal ganglia to compute the posterior probabilities of actions. Future research may wish to investigate whether the cortico-striatal weights learned with the Rescorla-Wagner rule can allow the model of the basal ganglia to approximate the posterior probabilities of actions for the choice between multiple alternatives.
4.3 Relationship to Experimental Data
The simulation of the model in the task of Yang and Shadlen (2007) showed that the model learned similar WOEs as the animals for cues with finite WOE. For cues with infinite WOE, the weights learned by the model were more dampened than those learned by the animals (see Figures 7A and 7B). Nevertheless, note that the animals also dampened the weights of these stimuli (i.e., the WOEs estimated by animals are not infinite). The difference in the extent to which these weights were dampened could arise from the fact that our model captured only model-free RL, while animals could have employed both model-free and model-based RL systems during their choices (Daw, Niv, & Dayan, 2005), and the model-based system could have learned simple deterministic rules for these stimuli (e.g., choose whenever stimulus 10 is presented). During decision making with such stimuli, the final choice could have been based on information brought by both the model-based and model-free system, resulting in a high but not fully deterministic influence of these stimuli on choice.
The simulations also showed that the model could replicate key features of neural responses to successive stimuli, the neural activity being proportional to the accumulated WOE of stimuli seen so far. Nevertheless, this relationship was more linear in the last “Epoch 4” in our model than in the experimental data, where it appeared more sigmoid (see Figures 7C and 7D). This difference may arise from the fact that after seeing the last stimulus, the animals knew that they had all available information, and the neural activity could have started to reflect the choice rather than the decision variable.
The neural activity in the experiment of Yang and Shadlen (2007) was recorded from the lateral intraparietal cortex, while our model described the activity in the striatum. Nevertheless, it has been observed that the neural activity in striatum also encodes information accumulated during decision making (Ding & Gold, 2010). This similarity in activity between decision-related cortical regions and striatum may arise from the prominent feedback connections from basal ganglia back to cortex via the thalamus (Alexander, DeLond, & Strick, 1986) and the fact that in highly practiced tasks, the stimulus-response mapping learned in the striatum becomes consolidated in the cortex (Ashby, Ennis, & Spiering, 2007).
The model presented here required fewer trials to converge when multiple stimuli were present per trial (see Figure 4). It seems unlikely that humans and animals would show such behavior, as humans often learn faster in tasks that start with training with a single stimulus per trial (Gould, Nobre, Wyart, & Rushworth, 2012). Our simulations (not shown here) indicate that slower learning with multiple stimuli occurs in a modified version of the model in which on each trial, the weights are updated for only a single stimulus (randomly chosen in the simulations). It is also possible that such slower learning arises because subjects have difficulty focusing on several stimuli presented to them and evolution has optimized perception in a way that only stimuli crucial to decision making are attended (Summerfield & Tsetsos, 2015).
4.4 Other Experimental Predictions
In addition to the predictions described in section 4.1, the model described in this letter makes a few more predictions. In the proposed model, the estimated WOEs often depend on the number of stimuli presented within a trial (see Figure 2A; a similar prediction is also made by the Soltani and Wang, 2010, model). In particular, the model predicts that the estimated WOEs are closer to zero if the participants learn them in a task with multiple stimuli presented in a trial. This prediction could be tested in an experiment in which participants learn WOEs for one set of stimuli with and WOEs for another set with , and then make decisions on the basis of multiple stimuli from both sets. The model predicts that participants would give more weight to stimuli learned with .
Experiments with human subjects by de Gardelle and Summerfield (2011) featuring shapes colored by stochastically drawn values of a two-color gradient showed that humans performed averaging among presented color values when they were asked to decide which of the two colors was predominant. It was also observed that outliers (extreme color values that appeared less frequently) were downweighted by the subjects, even though they should have had a strong influence on decision outcome (de Gardelle & Summerfield, 2011). We performed simulations featuring stimuli with the same weights but different frequencies of occurring. In cases where frequency of commonly and uncommonly occurring stimuli was sufficiently different, we were able to observe a downweighting by a constant factor. It would be interesting to perform experiments with human subjects in a similar scenario in order to see whether downweighting occurs only if the stimuli are sufficiently infrequent.
In summary, in this letter, we have shown that the same learning rule that allows estimating expected rewards associated with stimuli and actions can approximate WOEs in a class of tasks with binary rewards. Such WOEs can be efficiently integrated across different stimuli during decision making.
R. B. was supported by Medical Research Council grant MC UU 12024/5. N. K. was supported by the EU Erasmus higher education programme grant KA103-2015.