Attention causes diverse changes to visual neuron responses, including alterations in receptive field structure, and firing rates. A common theoretical approach to investigate why sensory neurons behave as they do is based on the efficient coding hypothesis: that sensory processing is optimized toward the statistics of the received input. We extend this approach to account for the influence of task demands, hypothesizing that the brain learns a probabilistic model of both the sensory input and reward received for performing different actions. Attention-dependent changes to neural responses reflect optimization of this internal model to deal with changes in the sensory environment (stimulus statistics) and behavioral demands (reward statistics). We use this framework to construct a simple model of visual processing that is able to replicate a number of attention-dependent changes to the responses of neurons in the midlevel visual cortices. The model is consistent with and provides a normative explanation for recent divisive normalization models of attention (Reynolds & Heeger, 2009).
Attention plays an important role in sensory perception, improving one's perceptual performance at detecting attended stimuli, at the expense of a reduction in performance for other stimuli (Pestilli & Carrasco, 2005). A large body of work has been devoted to identifying the neurophysiological changes underlying attention-dependent changes in perception (Reynolds & Chelazzi, 2004). A central finding has been that in the striate and extrastriate visual cortex, the firing rate of neurons tuned toward attended spatial locations or features is increased (Reynolds, Pasternak, & Desimone, 2000). Taken alone, this result appears to paint a simple picture: that attention acts to optimize sensory processing toward attended stimuli by increasing the sensitivity of sensory neurons that are tuned toward these stimuli. However, on closer inspection of the experimental data, it becomes clear that this picture is overly simple. In addition to increasing neural firing rates, visual attention can also suppress responses (Reynolds, Chelazzi, & Desimone, 1999), alter receptive field properties (Womelsdorf, Anton-Erxleben, Pieper, & Treue, 2006), and influence center-surround suppression from a stimulus placed outside the classical receptive field (Sundberg, Mitchell, & Reynolds, 2009). Furthermore, the effects of attention are highly sensitive to the experimental setup, with changes in the sensory stimulus and behavioral task giving rise to qualitatively different attention-depend changes in neural responses.
Recently several divisive normalization models have been proposed that are able to account for many of the experimentally observed effects of attention in the low to midlevel visual cortices (Reynolds & Heeger, 2009; Lee & Maunsell, 2009; Ghose, 2009). While the details of these models vary, the firing rate of a neuron is generally computed by dividing its feedforward excitatory input by the summed activity of a pool of neurons with similar, but differing, stimulus selectivities. These models explain why attention can facilitate or suppress the response of a given neuron, depending on how it alters the neuron's excitatory input, versus suppression from other neurons. They also provide a potential explanation as to why small changes in the behavioral task can produce qualitatively different types of attentional modulation. For example, if the task requires the animal to attend to a small region of visual space, then the principal effect of attention will be to alter the feedforward input to a neuron that is tuned to this location, giving rise to simple multiplicative changes in its firing rate. Alternatively, if the task requires the animal to direct its attention toward a broader region of space, then attention will alter the activity of neurons tuned to nearby spatial locations also, increasing both the inhibitory and excitatory input to a neuron that is tuned to the center of the attended spatial region. As a result, the neuron will undergo a more complex form of attentional modulation that cannot be explained by a simple multiplicative change in its firing rate.
A limitation of divisive normalization models of attention is that the modulatory effect of attention on the feedforward input to each of the neurons in the network has to be specified explicitly by the modeler rather than being predicted directly from the behavioral task and presented visual stimuli. To avoid this limitation, we need a theory that can explain why, rather than just how, attention alters sensory neural responses as it does.
Several researchers have proposed that attention-dependent changes to sensory neural responses can be understood within a normative Bayesian framework as a consequence of performing optimal inferences about the state of the world (Dayan & Zemel, 1999; Rao, 2005; Chikkerur, Serre, Tan, & Poggio, 2010; Yu & Dayan, 2005; Yu, Dayan, & Cohen, 2009; Dayan & Solomon, 2010; Whiteley & Sahani, 2012). These models hypothesize that changes to the attentional state of the animal correspond to changes in their prior beliefs about the world, which in turn alter how incoming sensory signals are used to infer which stimuli are present. Recently Chikkerur et al. (2010) showed that given certain assumptions about how probabilistic inference is performed in the brain, increasing one's prior belief that certain (attended) stimuli will be presented produces qualitatively similar changes to neural firing rates as divisive normalization models of attention. However, Chikerrur et al. specified explicitly the attention-dependent changes to the prior without providing a normative explanation for why these changes come about. Indeed a general problem of Bayesian models of attention is that it is often not clear why the animal should alter its prior beliefs, depending on the behavioral task that it is performing. Specifically, in the case where attention is manipulated by changes to the behavioral task (i.e., by manipulating which stimuli are important in determining the action that the animal should perform; Pestilli & Carrasco, 2005; Luck, Chelazzi, Hillyard, & Desimone, 1997), rather than by the presented stimulus statistics (i.e., by manipulating which stimuli are most likely to be presented (Posner, Snyder, & Davidson, 1980; Downing, 1988) there is no clear normative reason that the animal should alter its prior beliefs about which stimuli are most likely to be presented.
Here, we extend previous Bayesian models of attention to account for task-dependent modulation of sensory neural responses. We hypothesize that the nervous system learns an internal model describing how both the sensory input and the reward received for performing different actions are generated by a common set of explanatory causes (Sahani, 2004). Within this framework, the behavioral task will alter visual neuron responses only when there is some mismatch between the organism's internal model of the sensory input and the external environment. We argue that due to the complexity of real-world environments, this is often the case. Faced with such a model mismatch, we propose that attention modulates visual processing in order to improve the organism's predictions of the received reward, at the possible expense of their learning a worse model of the stimulus statistics.
We implement a simple model of visual processing to illustrate how our framework can be used to predict attention-dependent changes to visual neuron responses. For our simulations, we assume a particular type of model mismatch in which the image features that are relevant to the task are smaller than the image features used by the agent to perform the task. In common with previous Bayesian models of attention, we assume that attention alters the internal model in a computationally simple way: varying the prior probability that image features are present while leaving other aspects of the model unchanged. Given certain assumptions about the form of the internal model and how probability distributions are encoded by the sensory neural population, our model predicts attention-dependent changes to visual neuron responses that are consistent with a number of experimental observations in midlevel regions of the visual cortex, including modulation of contrast response functions, sensory tuning curves, and center-surround interactions. Our model is consistent with and provides a normative explanation for previous divisive normalization models of attention (Reynolds & Heeger, 2009; Lee & Maunsell, 2009; Ghose, 2009).
2. Overview of Modeling Approach
2.1. General Framework.
A large body of research is based on the idea that the visual system learns a probabilistic model of natural image statistics, in which a set of hidden causes is assumed to generate received sensory signals (Hyvärinen, 2010). We extend this framework to consider visual processing within the context of a simple task, where a biological agent has to perform actions (motor commands or perceptual judgments) in order to receive a reward. To perform the task, the agent must be able to predict the reward associated with each possible action. We hypothesize that it does this by learning a probabilistic model that describes how both the sensory input and reward received for performing an action are generated by a common set of hidden causes (Sahani, 2004). This internal model is used to infer the hidden causes that generated its received sensory input and, consequently, to predict the reward associated with each action.
In most statistical models of visual processing, the agent's internal model is learned in an unsupervised manner based on the statistics of received sensory signals (Hyvärinen, 2010). We propose that in addition, the internal model is continuously adapted based on received sensory signals and reward in order to optimize performance for the task at hand. As well as influencing behavioral performance, changes in the internal model will also influence perceptual inference, altering the agent's internal representation of sensory stimuli. As a result, the activity of visual neurons will vary dynamically in response to changes in both reward contingencies and presented stimulus statistics. Here, we propose that this task-dependent optimization of the agent's internal model can account for experimentally observed changes in visual neuron responses normally attributed to selective attention.
2.2. When Do Task Demands Alter Visual Processing?.
The responses of visual neurons to a given stimulus can be manipulated by changes in the stimulus statistics (determining which stimuli are expected; often communicated by visual cues) (Posner et al., 1980) or the reward delivered for performing each action (determining which stimuli are deemed relevant to the task) (Pestilli & Carrasco, 2005). In our theoretical framework, the agent's internal model of the stimulus statistics is coupled to its internal model of reward. Thus, perceptual inference can be altered by changes to both the stimulus and reward statistics. In contrast, previous Bayesian models of visual processing, in which the agent's internal model is learned and adapted based on the stimulus statistics alone, can account only for changes in perception due to changes in the stimulus statistics.
For the agent's internal model of the sensory input statistics to be altered by the reward structure of a task, there must be some mismatch between its internal model and the external environment (i.e., if the internal model is already a perfect description of the world, it cannot be further optimized). We postulate that due to the complexity of real-world environments, this will often be the case. For our simulations, we assume that the image features relevant to the task are more spatially localized than the image features used by the agent to choose which action to perform. Such a model mismatch might occur because the agent tries to learn a simple model of the behavioral task, in which the actions that it should perform depend on a small number of spatially distributed image features. While useful in allowing the agent to quickly learn new tasks, this model structure could result in suboptimal performance in experiments that use very simple or spatially localized stimuli (e.g., orientated gratings or coherent motion).
We constructed a simple model to illustrate how changes in the reward structure of a task alter visual processing. We use this model to show in principle how experimentally observed attention-dependent changes to visual neuron responses can be interpreted functionally, as a consequence of optimal adaptation toward a given task. In the following sections, we describe the presented stimuli and task, the agent's internal model, and the neural code. Supplementary section 1 (available online) describes the model assumptions in detail and how they influence our results.
3.1. Visual Detection Task.
In many experimental investigations of goal-directed visual attention, a monkey is instructed (often using a visual cue) that a particular spatial location is task relevant and thus should be attended. In order to receive a reward in the task, the animal is required to make responses that are contingent on stimuli presented at this location, while ignoring distractor stimuli presented at other locations (Luck et al., 1997; Reynolds et al., 2000; Williford & Maunsell, 2006). To capture the main aspects of these experiments, we simulated a visual detection task, in which an agent is presented with one or more stimuli at various locations and has to report whether a stimulus is present at a single target location (see Figure 1). The agent receives a unitary reward for a correct response in the task and no reward otherwise.
In each attentional condition, stimuli are equally likely to be presented at all locations. The only thing that distinguishes stimuli presented at different locations is whether a reward is delivered for making a detection response. The agent must use this feedback on its performed actions to learn the target location (by adapting the reward model) and to direct attention toward the target (by adapting its sensory model).
3.2. Agent's Internal Model of Sensory Input.
We assume that the agent uses a hierarchical internal model to infer the hidden causes of the received sensory input (see Figure 3a). Thus, in contrast to the simulated experiment, where spatially localized stimulus features are presented independent of each other, the agent assumes a higher level of statistical structure, such that certain image features are more likely to be presented together than others.
3.3. Agent's Internal Model of Reward.
After receiving a sensory input, the expected reward for reporting that the target is present, , is equal to the posterior probability that the detection target is present, . Conversely, the expected reward for reporting that the target is not present, , is equal to the posterior probability that the target is not present, .
We assume that the agent makes the response associated with the highest predicted reward. Thus, if the posterior probability that the target is present is greater than 0.5, the agent should make a detection response (a=1); otherwise, the agent should make a rejection response (a=0).
3.4. Visual Neuron Firing Rates.
Figure 3b illustrates a putative mapping of the probabilistic model used in our simulations onto the neural architecture. The assumed role of the visual system is to infer the posterior probability distribution over the hidden causes. The posterior distribution, encoded in the population activity of visual neurons, is then transmitted to areas of the brain that are responsible for predicting the received reward for performing different actions, allowing the agent to make an appropriate response in the task.
For our simulations, there were sufficiently few latent variables that we were able to perform the summation over the latent states directly. However, if there is a large number of hidden variables, this summation will become intractable, and an approximate algorithm must be used. Shelton, Bornschein, Sheikh, Berkes, and Lücke (2011) describe a biologically plausible algorithm that could be used to perform approximate inference on a binary latent variable model similar to the one used in our simulations (Puertas et al., 2010).
The stimulus selectivity of a given neuron is largely determined by the basis function of the hidden variable it encodes. In other words, if the hidden variable encoded by a given neuron typically generates a specific profile of sensory activity, then receiving this same sensory activation profile will imply that the hidden cause is active and the neuron will respond with a high firing rate. The basis functions used in our simulations were spatially localized, so that model neurons responded most strongly to stimuli presented at a small number of neighboring locations (their receptive field, RF). The basis functions of the low-level y-units were narrower than the basis function of the high-level z-units (compare Figures 2 and 4), so that neurons encoding y-units had smaller RFs than neurons encoding z-units. Note, however, that in general, a neuron's RF is not identical to the basis function of the encoded variable. Although basis functions are an invariant property of the generative model, the measured RF will depend on the types of stimuli presented.
3.5. Task Optimization.
For parameters to converge on stable values, we used a learning rate that decreased as a function of the trial number, according to (where i is the trial number and and n0 are parameters that determine the initial learning rate and how fast it decays, set to 0.05 and 104, respectively). Learning was terminated after 105 trials, when the model parameters were observed to converge on stable values.
We initialized the bias term terms, b0i, to take equal values, such that the prior probability that each y-unit was active was exactly equal to the true probability that a stimulus was presented at each location (). Consequently, before optimization, the only difference between the agent's internal model and the true model describing how the sensory inputs were generated was related to the second-order statistics describing the probability that stimuli were presented at different locations at the same time. For the true model, all y-units were independent, while for the agent's internal model, there was a higher probability that adjacent y-units were simultaneously active.
Note that our aim was to investigate the effects of attentional optimization rather than the temporal dynamics of the optimization process itself. Thus, while we assume that attentional modulation of visual neuron responses is learned online from task feedback, in reality, the attentional state could also be altered more quickly, based on information received from visual cues or previous experience in the task (see section 5).
4.1. Attentional Modulation of Detection Performance.
We first asked how attention altered performance in the detection task. We considered two conditions: a no-attention condition, in which the agent optimized its reward model but not its sensory model, and an attend-target condition, where the agent optimized both its reward model and its sensory model.
On each trial, the agent estimated the probability that a stimulus was present at a target location, p(t=1|x), to decide whether to make a detection response. We used the agent's estimates of p(t=1|x) to plot receiver operating characteristic curves (ROC) for each attentional condition (see Figure 5a). The area under the ROC curve (the AUC) provides a measure of detection performance that is independent of the threshold used for classification: an AUC value of 1 indicates perfect performance, while an AUC value of 0.5 indicates chance performance (Fawcett, 2006). As expected, detection performance was better in the attend-target condition (AUC = 0.85) than in the no-attention condition (AUC = 0.81). The magnitude of this performance increase was observed to be highly dependent on the precise setup of the task (e.g., increasing the sensory noise leads to larger attention-dependent improvements in performance). However, while the magnitude of attention-dependent changes to performance varied depending on the task, the qualitative effect of attention was always the same: to improve performance in the detection task.
To understand how attention alters detection performance, we plotted the estimated probability that a stimulus was present at a target location (p(t=1|x)) versus the true stimulus location (see Figure 5b). In the attend-target condition, the agent's estimates of p(t=1|x) were increased for stimuli close to the target location and reduced for stimuli far from the target location. Thus, in the attend-target condition, the agent was better able to detect stimuli at the target location, while ignoring stimuli at other locations.1
4.2. Attentional Modulation of Neural Population Response.
We next asked how attention alters the internal sensory representation, encoded by the visual neuron responses. The model was set up so that midlevel neurons (encoding y-units in the agent's internal model) were highly sensitive to the presented stimulus location, with each neuron responding only to stimuli presented near the neuron's preferred location (see Figures 6a and 6b, dashed line). In contrast, high-level neurons (encoding z-units in the agent's internal model) were relatively insensitive to the presented stimulus location (see Figures 6c and 6d, dashed line).
In our model, the agent relied on the responses of high-level neurons to choose which action to perform. However, as high-level neurons were insensitive to the stimulus location, the agent was not able to discriminate between stimuli presented at task-relevant and task-irrelevant locations, impairing its performance in the task. Following attentional optimization toward the task (see section 3.5), the agent learned to associate increased prior probability for stimuli at the target location. This learned prior did not reflect the true stimulus statistics (stimuli were equally likely at each location) but instead compensated for the mismatch between the agent's internal model and the true structure of the task.
The attentional prior increased the gain of midlevel neurons whose preferred location was near the target location (see Figure 6a) while decreasing the gain of neurons whose preferred location was far from the target location (see Figure 6b). This change in the gain of midlevel neurons resulted in changes to the stimulus selectivity of high-level neurons, which became differentially more sensitive to stimuli presented at the target location (see Figures 6c and 6d). The net result was that in the attend-target condition, high-level neural responses were a better predictor of whether a stimulus was present at the target location, allowing the agent to improve its task performance.
4.3. Attentional Modulation of the Contrast Response Function.
There have been a number of controversies about how goal-directed attention alters sensory neural responses. A prominent example is attention-dependent changes to the firing rates of V4 neurons with varying stimulus contrast. Previous experiments have reported very different findings. Williford and Maunsell (2006) observed a “response gain” effect, with increases in neural firing rates for all stimulus contrasts, while Reynolds et al. (2000) observed a “contrast gain” effect, consistent with an increase in the effective stimulus contrast. Reynolds and Heeger (2009) proposed a phenomenological model to account for these differences, proposing that they are due to variations in the relative size of the focus of attention and the stimulus between experiments: a narrow focus of attention would give rise to a response gain effect, while a broad focus of attention would give rise to a contrast gain effect. We use our normative model to ask why attention might alter neural responses in this way.
To manipulate the size of the attentional focus, we varied the number of target locations in the detection task. We simulated two experimental conditions: one with a single target location (narrow attentional focus) and another with multiple neighboring target locations (broad attentional focus, with seven neighboring locations chosen as targets). Note that only the reward contingencies changed for the different attentional conditions; the stimulus statistics were always the same (see section 3.1).
In the narrow attentional focus condition, the agent learned to associate an increased prior probability that hidden causes representing stimuli at this location were active (see Figure 7a). In the broad attentional focus condition, there was a broader change in its learned prior, with increases in the prior probability for hidden causes representing all of the target locations (see Figure 7b).
Neurons in visual area V4 were hypothesized to encode information about hidden variables at an intermediate level of the agent's internal model (i.e., components of y). To obtain neural contrast response functions (CRFs), we plotted the mean firing rate of a model neuron while varying the amplitude of a sensory input centered at its preferred location (x=cai, where ai is the ith column of A, and c represents the stimulus contrast). The resulting CRF was qualitatively similar to experiment, increasing monotonically at intermediate sensory input amplitudes, before saturating at high amplitudes. The effect of spatial attention was consistent with Reynolds and Heeger's (2009) divisive normalization model: directing a narrow focus of attention toward the presented stimulus location increased the response of a neuron tuned to this location for all sensory input amplitudes; a broad focus of attention increased the response of this neuron only at intermediate sensory input amplitudes (see Figures 7c and 7d, respectively).
4.4. Comparison with Normalization Model of Attention.
At low contrasts, neural firing rates can be approximated by , so that both a narrow and a broad focus of attention alter neural responses multiplicatively, increasing the firing of neurons that are tuned to attended spatial locations. At high contrasts, neural firing rates can be approximated by . In this case, a broad focus of attention produces similar increases to both the numerator and the denominator, so that the response of a neuron that is tuned to an attended stimulus is unchanged by attention (see Figure 7d). A narrow focus of attention increases the numerator by a larger factor than the denominator, so that the response of a neuron that is tuned to an attended stimulus is increased (as in Figure 7c).
Both the expression for neural firing rates and the modulatory effects of attention in our model are similar to Reynolds and Heeger's (2009) normalization model of attention. However, while divisive normalization was an ad hoc assumption in Reynolds and Heeger's model, in our work it comes about as a direct consequence of performing Bayesian inference on a particular form of internal model. Likewise, while Reynolds and Heeger specified an attention field, which multiplicatively scaled the gain of the feedforward excitatory input to the network, in our work, attentional modulation of neural responses comes about as a result of optimization toward the task and is thus entirely determined by the behavioral task and the agent's internal model.
4.5. Attentional Modulation of Sensory Tuning Curves.
We investigated how goal-directed attention alters neural tuning curves in our model. To do this, we extended our model to include both a featural and a spatial dimension. We altered the basis functions that determined the image features represented by the hidden units, so that each model neuron (corresponding to a component of y) was selective to both a stimulus feature (e.g., orientation, or motion direction) and a spatial location.
We simulated two experimental conditions. In the first condition (spatial attention), one of two spatial locations was selected as a target in the detection task. In the second condition (feature-based attention), only certain features were chosen as targets. Spatial attention caused the agent to associate a high prior probability that hidden variables representing the attended location were active, but a uniform prior probability that hidden variables representing different features were active (see Figure 8a). Conversely, feature-based attention caused the agent to associate a high prior probability that hidden variables representing attended features were active, but a uniform prior probability that hidden variables representing both spatial locations were active (see Figure 8b).
Attending to the presented stimulus location increased the responses of neurons tuned to this location, with no sharpening in the population response (see Figure 8c, dotted line). Similar effects have been observed experimentally in visual area V4 when attention is directed toward a particular spatial location (McAdams & Maunsell, 1999). In contrast, we found that attending to the presented stimulus feature produced a sharpening in the population response; the responses of model neurons that were selective for the attended feature were most strongly increased by attention (see Figure 8c, solid line). Martinez-Trujillo and Treue (2004) reported a similar effect in visual area MT when animals were directed toward a particular motion direction. Our results are also consistent with Reynolds and Heeger's (2009) normalization model of attention.
Also consistent with the experimental findings of Martinez-Trujillo and Treue (2004), our model predicted a small suppression in the responses of model neurons tuned to unattended features. In our model, this suppression came about because the agent accorded greater probability to the possibility that the sensory input was produced by hidden causes representing attended features, at the expense of a reduction in the probability that it was produced by hidden causes representing other, unattended, features.
Experimentally it has been shown that attention-dependent suppression of neural responses is particularly strong when there are multiple stimuli within the cell's RF (Moran & Desimone, 1985; Reynolds et al., 1999). Although we do not explicitly model this effect, it is easy to see how it could come about for our model. When there is one stimulus within a cell's RF, directing attention away from or toward the presented stimulus will induce a multiplicative change to the neuron's response by altering the numerator in equation 4.2. When two stimuli are present within the cell's RF, attending toward one of the stimuli will also alter suppression that comes from the other stimulus via the denominator in equation 4.2, resulting in larger changes in the neuron's response. This effect was demonstrated by Reynolds and Heeger (2009) in their normalization model of attention.
4.6. Attentional Modulation of Center-Surround Interactions.
The responses of neurons in the visual cortex are modulated by stimuli located outside their classical RF that do not evoke a response when presented alone. Typically, presenting a stimulus outside a neuron's RF suppresses its response, compared to when there is only a single stimulus presented within its RF, a phenomenon called surround suppression (Seriès, Lorenceau, & Frégnac, 2003). Sundberg et al. (2009) found that in visual area V4, attending to a stimulus located within the RF reduces the suppressive influence of a stimulus presented at the surround, while attending to the surround increases this suppression.
We used the setup described in the previous section to measure the degree of surround suppression in our model in the absence of attention or with attention directed to either the RF center or the surround (see Figure 9a). By definition, a stimulus in the RF surround should not elicit a response when presented alone, although it may suppress the response of a neuron to a stimulus simultaneously presented in the RF. To reproduce this behavior in our model, we needed to specify the spatial width of the basis functions, determined by in equation 4.3. If they are too broad, surround stimuli elicit a response when presented alone; too small, and there is no surround suppression (we found that , produced the required behavior; see Figure 9a).
Directing attention toward the RF increased the model neuron response toward a single stimulus presented within the RF, while decreasing the suppression from a second stimulus presented at the surround (see Figures 9b and 9c). Directing attention to the surround did not significantly alter the model neuron response when a single stimulus was presented within the RF, but did increase the suppression caused by a second stimulus presented at the surround (see Figure 9b and 9c, left panel). In both conditions, the response of the model neuron to a stimulus presented at the surround alone was negligible. Qualitatively similar results were obtained by Sundberg et al. (2009) (see Figure 9c, right panel).
Note that in all attentional conditions, surround suppression was significantly stronger in our model than in the population-averaged data (by a factor of 2). However, the important qualitative aspect of the data that we sought to capture was the effect of attention on surround suppression rather than the absolute magnitude of surround suppression. Indeed, while the qualitative effects of attention were robust to changes in model parameters, the absolute magnitude of surround suppression depended on our choice of model parameters (and experimentally, Sundberg et al., 2009, observed a large variability in the degree of surround suppression across different neurons).
We extended previous Bayesian models of visual processing (Hyvärinen, 2010) to account for the effects of behavioral demands on visual neuron responses, hypothesizing that the brain learns a probabilistic model that predicts how both the sensory input and reward received for performing different actions are determined by a common set of hidden causes (Sahani, 2004). We developed a simple model of visual processing to show in principle how our proposed framework can be used to make concrete predictions about how task-dependent attention modulates visual neuron responses. Our framework has two main advantages. First, it has predictive power: in theory, changes to neural responses can be predicted as a direct consequence of the presented stimuli and behavioral task. Second, predicted changes to neural responses have a direct functional meaning: they correspond to changes in the believed causes of the sensory input.
In order to make concrete predictions about the effects of attention on visual neuron responses, we needed to make certain assumptions about the agent's internal model of its environment. First, we assumed that the agent learns a model in which binary hidden causes are responsible for generating its received input. This internal model was very similar to a previous work by Puertas et al. (2010). Second, we assumed that the agent's internal model was sparse, meaning that there was a small prior probability for any particular hidden cause to be active (Olshausen & Field, 1996, 1997). This sparsity prior leads to competition between different possible causes of the sensory input, and in our neural model results in surround-suppression of neural responses. Third, we assumed that the agent performs inference on a hierarchy of image features (Karklin & Lewicki, 2003), and that its behavioral responses depended on only high-level hidden variables in their internal model. In our neural model, this corresponds to relying on the responses of high-level neurons with large receptive fields to choose which action to perform. Finally, we assumed that attention alters the bias terms in the agent's internal model but not the basis functions. This corresponds to altering the gain of individual neurons but not the network connectivity (Dayan & Zemel, 1999; Yu, Dayan, & Cohen, 2009). Most of our assumptions are not new but correspond to assumptions made implicitly in many phenomenological and mechanistic models of attention (Reynolds & Heeger, 2009; Ghose, 2009; Lee & Maunsell, 2009). However, in contrast to these models, we justify our assumptions from functional principles to provide insight into why attention alters visual neuron responses as it does.
The predictions and mathematical formulation of our model bear strong similarities to the normalization model of attention, proposed by Reynolds and Heeger (2009). Recently, both Schwartz and Coen-Cagli (2013) and Chikkerur et al. (2010) showed that Reynolds and Heeger's normalization model can be derived using a Bayesian framework. In their models, attention is hypothesized to modulate the agent's perceptual prior (Chikkerur et al., 2010) or the feedforward inputs to neurons at attended locations (Schwartz & Coen-Cagli, 2013). However, in both models, attention-dependent changes are specified explicitly, without stating why these changes might come about. As a result, these models suffer from the same limitation as Reynolds and Heeger's normalization model: they do not explain how attention should be shaped by behavioral demands and sensory experience. In contrast, in our model, task-dependent changes to the agent's internal model are learned automatically by the agent in order to improve their predictions of the received reward.
Several studies have tried to explain visual attention in normative terms, under the hypothesis that it corresponds to changes in the perceptual prior (Dayan & Zemel, 1999; Rao, 2005; Chikkerur et al., 2010; Yu & Dayan, 2005; Yu et al., 2009). However, in the absence of any changes to the presented stimulus statistics, it not clear why the perceptual prior should be altered by task demands. Indeed, in nearly all Bayesian models of attention, changes to the agent's prior are either specified explicitly (Dayan & Zemel, 1999; Rao, 2005; Chikkerur et al., 2010; Whiteley & Sahani, 2012), or learned directly from the stimulus statistics (Yu & Dayan, 2005; Yu et al., 2009). Here, we show that in certain circumstances, it is desirable to alter the perceptual prior, even in the absence of any changes to the stimulus statistics. In our proposed framework, the agent continuously adapts the internal model to improve predictions of the reward associated with each action. When there is a mismatch between the agent's internal model and the true structure of the task, improvements in their predictions of reward may come at the expense of learning a worse model of the sensory input statistics. Consequently, their learned prior will differ from the true stimulus statistics (as in Figures 7a and 7b).
In our simulations, we implemented a specific type of model mismatch in which the stimulus features relevant to the task (the detection targets) are smaller than the features used to decide which action to perform. This is analogous to previous modeling work, in which the agent uses the response of neurons with large RFs to detect stimuli presented in a small task-relevant region of space (Yu et al., 2009; Dayan & Solomon, 2010; Dayan & Daw, 2008; Liu, Yu, & Holmes, 2009). In this work, perceptual performance is limited because neurons integrate sensory signals from both task-relevant and task-irrelevant spatial locations. Attention improves performance by selectively boosting neural inputs that are selective to stimuli at task-relevant locations. Previous authors suggested that perceptual performance is constrained in this way because of the limited number (and thus, necessarily large size) of neural RFs available to cover the visual scene (Dayan & Daw, 2008; Dayan & Zemel, 1999). However, this cannot explain why the agent does not use information encoded by low-level visual neurons with small RFs to perform the task. Here, we propose an alternative explanation: that perceptual performance is constrained by the need to learn a simple behavioral strategy that can be quickly altered in response to changing behavioral demands. One way to achieve this goal could be to learn a simple mapping between the responses of a small number of high-level neurons (with large RFs) and the reward associated with each action.
Recently, Whiteley and Sahani (2012) proposed that in complex environments, the agent simplifies perceptual inference by using an approximate internal model that neglects statistical dependencies between stimuli. This results in a mismatch between the agent's internal model and its external environment, which reduces its perceptual performance. Whiteley et al. hypothesized that attention compensates for this reduction in performance, forming part of an approximate inference algorithm that selectively improves perceptual accuracy for certain attended features or stimuli.
In both our model and the model of Whiteley and Sahani (2012), attention is required because of a mismatch between the agent's internal model and its environment. In Whiteley's model, this mismatch occurs because the agent neglects dependencies between hidden variables; in our model, it occurs because the agent learns a simplified model of the task, in which the received reward is assumed to depend on a limited number of high-level variables. Experimentally one could distinguish between these different scenarios by investigating when attention is most strongly recruited: when there are complex statistical dependencies between stimuli (as predicted by Whiteley and Sahani's model) or when task-relevant stimulus features are localized in a particular spatial or featural dimension (as predicted by our model). However, rather than there only ever being one type of model mismatch, it is more likely that attention is required in a range of different situations to compensate for different mismatches between the agent's internal model and its external environment. Put in this broader context, we believe that Whitely and Sahani's model is not incompatible with our framework. For example, one could imagine a hybrid of both models in which reward feedback is used to determine which stimulus features are task relevant, controlling an approximate inference algorithm that improves perceptual accuracy toward these features.
In this letter, we focused on attentional modulation of midlevel neural responses. However, our model also predicts how attention should modulate the responses of higher-level visual neurons. In our simulations, attention dynamically alters the RF profiles of high-level neurons, shrinking them around attended stimuli (see Figure 6c) or shifting their centers toward attended locations (see Figure 6d). This prediction is supported by experimental recordings in area MT, which observe dynamic reshaping of neural RFs as a result of visual attention (Womelsdorf et al., 2006). Of course, in the brain, there is no clear demarcation between high-level or midlevel neurons. However, in the context of our model, what matters is the ratio between a neuron's RF size and the size of task-relevant stimulus features. A neuron is considered to be “high level” if its RF is significantly larger than the task-relevant stimulus features. (Note that while we discuss only spatial attention here, an analogous argument could be made in the feature domain to describe feature-based attention.)
At the behavioral level, our model predicts that stimuli should be perceived as being more similar to attended stimuli than they actually are. This is because attention-dependent changes to the perceptual prior will induce an estimation bias toward task-relevant stimulus features. While estimation biases have been observed experimentally in response to changes in the presented stimulus statistics (Chalk, Seitz, & Seriès, 2010), we predict that they should also be induced by changes to the behavioral task alone. Experimentally, different behavioral tasks have been found to give rise to qualitatively different types of perceptual bias. For example, Jazayeri (2007) found that after performing a discrimination task with visual motion stimuli, subjects report stimuli as moving farther away from the discrimination boundary than they actually are. Therefore, while we simulated a simple detection task, it would be interesting to use our modeling framework in the future to investigate how different behavioral tasks and stimuli influence perception.
At present, it is unknown how probability distributions are represented in the brain (Fiser, Berkes, Orbán, & Lengyel, 2010; Shelton et al., 2011; Deneve, 2008; Ma, Beck, Latham, & Pouget, 2006). A current area of debate is whether neural firing rates encode samples from a probability distribution (Fiser et al., 2010; Shelton et al., 2011); or parameters, such as the mean and variance of the distribution (Deneve, 2008; Ma et al., 2006). In our simulations, we assumed that mean firing rates are proportional to the probability that individual hidden causes contributed to generating the received sensory input. While this coding scheme was chosen for simplicity, it produces mean firing rates that are qualitatively consistent with a sampling code (Shelton et al., 2011). Meanwhile, certain parametric codes, such as the coding scheme proposed by Deneve (2008), predict mean firing rates that are qualitatively consistent with our model (i.e., they scale monotonically with the posterior probability that encoded latent variables are active).
We investigated short-term effects of behavioral context, focusing specifically on visual attention. We hypothesized that over these timescales, only the sensitivity of individual neurons (the prior) varies, while the network connectivity (the basis functions) remains constant. This restriction could be removed to investigate changes that take place over longer timescales. Currently, the relationship between different types of sensory learning—for example, attentional (Eckstein, Abbey, Pham, & Shimozaki, 2004; Jiang & Chun, 2001) versus perceptual learning (Fahle, 2005; Seitz, Kim, & Watanabe, 2009)—and how they depend on the training paradigm, is an active area of research. Our framework can be used to make explicit predictions about how visual perception is modulated by different stimuli and tasks and thus could help contribute this debate.
In our model, the agent's attentional state is altered slowly, based on feedback on its actions on many trials. However, in reality, attention can be quickly redirected following explicit sensory cues or instructions. Our model could be extended to account for these quick changes in attentional state by including additional variables in the internal model to represent the current behavioral context (e.g., the location of the detection target). Thus, on any given trial, the agent would first have to infer the behavioral context based on all available sources of information (e.g., received rewards, sensory cues, instructions, or prior experience in the task). The inferred context would then determine the attentional state that optimized the internal sensory representation toward the task. Such a model could be used to predict how people's attentional state is altered in real time as a result of newly received information. Our goal, however, was more modest: we sought to explore the effects (rather than the temporal dynamics; Yu et al., 2009) of optimizing the internal sensory representation towards a given task.
In this letter, we put forward a very general framework for predicting how task demands alter visual processing. We then showed that given certain assumptions about the internal model and behavioral task, this framework predicts attention-dependent changes to neural responses that are consistent with existing phenomenological models of attention. However, although the assumptions of our model are based on functional principles, in order to truly derive the effects of attention, it would be desirable to construct a more sophisticated model of natural images in which model parameters are learned directly from natural image statistics (as opposed to artificial data). In the past, this approach has been highly successful in understanding the passive properties of visual neurons. In the future, it could be used to make quantitative and testable predictions about how different behavioral tasks alter visual processing and perception.
We thank Odelia Schwartz, Ruben Coen-Caglie, and Peter Dayan for their helpful comments and feedback on an earlier version of this letter. This research was supported by funding from the Engineering and Physical Sciences Research Council and the Medical Research Council of Great Britain.
Note that variations in the agent's estimates of p(t=1|x) matter more than its baseline value, as changes in baseline can be easily compensated for by varying the detection threshold.
A supplemental appendix is available online at http://www.mitpressjournals.org/doi/suppl/10.1162/NECO_a_00494.