## Abstract

The primate visual system has an exquisite ability to discriminate partially occluded shapes. Recent electrophysiological recordings suggest that response dynamics in intermediate visual cortical area V4, shaped by feedback from prefrontal cortex (PFC), may play a key role. To probe the algorithms that may underlie these findings, we build and test a model of V4 and PFC interactions based on a hierarchical predictive coding framework. We propose that probabilistic inference occurs in two steps. Initially, V4 responses are driven solely by bottom-up sensory input and are thus strongly influenced by the level of occlusion. After a delay, V4 responses combine both feedforward input and feedback signals from the PFC; the latter reflect predictions made by PFC about the visual stimulus underlying V4 activity. We find that this model captures key features of V4 and PFC dynamics observed in experiments. Specifically, PFC responses are strongest for occluded stimuli and delayed responses in V4 are less sensitive to occlusion, supporting our hypothesis that the feedback signals from PFC underlie robust discrimination of occluded shapes. Thus, our study proposes that area V4 and PFC participate in hierarchical inference, with feedback signals encoding top-down predictions about occluded shapes.

## 1  Introduction

In natural scenes, objects rarely appear in isolation; rather, animals often have to discriminate and recognize partially occluded objects. While recognition under occlusion is difficult for even the best computer vision system, animals seldom have trouble. But the neural basis of this capacity is poorly understood. Here, we study the physiological underpinnings of a special case of the general problem, where occluders can be detected as distinct stimulus features.

Feedback projections from higher cortices are hypothesized to be important for successful recognition of occluded objects (Rust & Stocker, 2010; Gregoriou, Rossi, Ungerleider, & Desimone, 2014), and there are abundant feedback connections in the visual stream. Despite this, models of object recognition are typically hierarchical feedforward circuits (Fukushima, 1980; Riesenhuber & Poggio, 1999; Serre, Oliva, & Poggio, 2007; Cadieu et al., 2007; Yamins et al., 2014). This is partly because of the complexity of including feedback signals, but also because little is known about where the relevant feedback signals originate, where they terminate in visual cortex, and how they contribute to recognition. Developing a computational framework explaining how feedback facilitates shape recognition under occlusion therefore is a prominent challenge for visual neuroscience.

Recent experimental results provide key insights into how interactions between area V4, a fundamental stage in the primate shape processing pathway (Roe et al., 2012; Pasupathy & Connor, 1999, 2001), and the prefrontal cortex, important for the control of complex behavior (Miller & Cohen, 2001), may underlie the ability to recognize partially occluded objects (Kosai, El-Shamayleh, Fyall, & Pasupathy, 2014; Pasupathy, Fyall, & Choi, 2015; Fyall, El-Shamayleh, Choi, Shea-Brown, & Pasupathy, 2017). Specifically, in monkeys trained to discriminate pairs of shapes under varying degrees of occlusion, dynamics of V4 and PFC activity suggest that feedback signals from PFC to area V4 may serve to discount the effect of occlusion on the responses of V4 neurons, thereby increasing shape selectivity. This raises the question of how the feedback signals in V4-PFC circuitry perform the computation necessary for shape recognition. In this article, we propose and test the hypothesis that this occurs via hierarchical predictive coding. With the proposed model based on predictive coding, we successfully explain the dynamics of a subpopulation of neurons in V4 that exhibit delayed peak of responses (Pasupathy et al., 2015; Fyall et al., 2017), presumably induced by feedback signals from PFC.

Predictive coding has been proposed as a method to create efficient neural codes and has successfully described neural responses in a variety of different sensory systems (Bogacz, 2017; Bastos et al., 2012; Friston & Kiebel, 2009a, 2009b; Srinivasan, Laughlin, & Dubs, 1982; Rao & Ballard, 1999; Spratling, 2017; Rao, 1997, 1999, 2004, 2005; Lee & Mumford, 2003; Yuille & Kersten, 2006). Notably, the predictive coding framework reproduces center-surround antagonism in the retina (Srinivasan et al., 1982) and end-stopping effects in V1 (Rao & Ballard, 1999). In these studies, feedforward signals from each cortical area represent the residual errors between the feedback predictions and the encoding expectation. This interpretation of feedforward signals, however, has met the criticism (Koch & Poggio, 1999) that it implies reduced firing when familiar sensory inputs are encountered, differing from the common view in which sensory neurons respond strongly to preferred features. Here, we introduce a novel implementation of predictive coding, where the responses in V4 and PFC correspond to their most likely (or optimal) values given the stimulus and a hierarchical representation of its likelihood. Furthermore, the hierarchical inference is implemented in two steps, initially reflecting only the feedforward sensory signals and later integrating the feedback predictions, to explain the dynamic shape-selective responses in V4.

In addition to assigning an algorithmic role to the feedback signals, our model makes further predictions on the structure of the network, representation of the stimuli, and prior expectations encoded in V4 and PFC. Previous studies have shown that shapes can be discriminated based on V4 activity at the population level (Meyers, Freedman, Kreiman, Miller, & Poggio, 2008; Pasupathy & Connor, 2002), and shape identity information is already available at the level of V4. However, in our model, feedback predictions effectively remap the population responses and amplify the shape identity information reduced by partial occlusion. Furthermore, our model predicts that such amplification of the shape identity information following feedback predictions occurs only when the occlusion is salient and distinct from the shape.

In sum, our model suggests that feedback signals to V4 during the representation of occluded shapes can be interpreted in the context of predictive coding. These results shed light on how prior expectations contribute to the recognition of complex images in V4 and higher cortical areas.

## 2  Methods

### 2.1  Experiments

Experimental procedures are described in detail by Kosai et al. (2014) and Fyall et al. (2017) and are only briefly outlined in this section to provide the background.

Animals were trained on a sequential shape discrimination task, where two stimuli were presented in sequence and the animal had to report whether they were the same or different with a rightward or a leftward saccade, respectively. The second stimulus in the sequence was presented in the receptive field of the V4 neuron under study and was partially occluded. During recordings in area V4, all task details were customized to the preferences of the single neuron under study. Specifically, one of the two discriminanda was a preferred shape that elicited strong responses from the neuron while the other was a nonpreferred shape. Both shapes were presented in a preferred color for the cell, and the occluding dots were in a nonpreferred color so they provided only a modulatory influence. For recordings in the PFC, we studied many neurons simultaneously and did not customize stimulus shape or color to individual neuronal preferences, as is customary in the field. Each day the experimental session began as follows. We chose two stimuli to serve as the discriminanda. This was followed by two phases. First, during the training phase, animals performed the sequential discrimination task with the unoccluded versions of the discriminanda. This typically included 20 attempts and was to ensure that the unoccluded versions of the discriminanda were discriminable on the periphery. This was followed by the test phase during which the discriminanda were occluded to different levels with a field of randomly positioned dots. The level of occlusion was titrated by varying dot diameter while the number of dots was held constant, and was quantified as the percentage of the shape area that remained visible (% visible area).

All animal procedures conformed to NIH guidelines and were approved by the Institutional Animal Care and Use Committee at the University of Washington.

### 2.2  Coding Assumptions

We explain the response dynamics of V4 and PFC neurons during the shape discrimination task by building a computational model based on a few coding principles.

First, we assume that average firing rates of the neuronal populations recorded in experiments reflect the most likely representation of the neuronal responses given the input visual stimulus and a specific hierarchical model of the responses that we define below. Thus, assuming the sensory system seeks to infer the most likely representation of neuronal responses ${r1,…,rn}$ of hierarchical areas ranging from the lowest area 1 to the highest area $n$, we simply find the set of responses that maximizes the posterior probability $p(r1,…,rn|κ)$, where $κ$ represents the sensory input. We refer to these as the optimal firing rates.

Second, the model is constructed based on the hierarchical predictive coding principle. In predictive coding (Rao & Ballard, 1999; Friston & Kiebel, 2009a; Bogacz, 2017), feedback from higher cortical areas is interpreted as a prediction about activities in lower cortical areas. In the lower cortical areas, bottom-up sensory signals are combined with these top-down predictions. With the predictions and the sensory inputs thus combined, probability distributions of the neural responses are constructed based on hierarchical Bayesian inference (Rao & Ballard, 1999; Bogacz, 2017; Lee & Mumford, 2003; Yuille & Kersten, 2006). Under this assumption, combined with predictive coding, neuronal activities depend on the activities of the next higher area but are conditionally independent of activities in other cortical areas. In other words, the neurons in area $i+1$, whose activity is denoted as $ri+1$, make the top-down prediction $Pred(ri+1)$ of the neuronal activity $ri$ in area $i$. The noise $ηi$ characterizing the differences between the actual neuronal response $ri$ and the prediction made by the next higher layer $Pred(ri+1)$, is given as
$ηi=ri-Pred(ri+1).$
2.1
We assumed the noises to have a distribution $gi(ηi)$ with zero mean. This leads to $p(ri|ri+1)$, the distribution of the neuronal activity $ri$ in area $i$ given the next-level activity $ri+1$, having its mean at the top-down prediction $Pred(ri+1)$.
The posterior probability of the response representation across all levels given the sensory stimulus $κ$ therefore factors as
$p(r1,…,rn|κ)=ν·p(κ|r1,…,rn)p(r1,…,rn)=ν·p(κ|r1)p(r1|r2)…p(rn-1|rn)p(rn),$
2.2
where $ν$ is a normalization constant.

We have described the general and classical framework for hierarchical representation of a stimulus $κ$ via a sequence of firing rates. In summary, we assume that the brain aims to have neuronal activity in every layer get as close as possible to the prediction made by the responses of the next higher layer, where the discrepancy is given by a noise term $ηi$. Then the neural firing rates adjust to those that are most consistent (i.e., most likely) given the stimulus $κ$. We next describe the specific form of the representation that we use here.

### 2.3  Model Architecture

Our model is composed of two layers, a V4 layer and a PFC layer (see Figure 1A). We designate the higher cortical area as PFC based on the experimental evidence indicating feedback from PFC as a likely precursor of the delayed responses in V4 (Pasupathy et al., 2015; Fyall et al., 2017; see section 3). Furthermore, previous experimental studies have found anatomical and physiological evidence for direct feedforward (Ninomiya, Sawamura, Inoue, & Takeda, 2012) and feedback (Barbas & Mesulam, 1985; Ungerleider, Galkin, Desimone, & Gattass, 2008) connections between V4 and PFC in the primate brain.

Figure 1:

Schematic diagram of network model. (A) Model network of V4 and PFC populations and the schematic of the input shape stimulus. By optimizing the cost function with respect to both V4 and PFC responses, the network implements both feedforward connections from V4 to PFC and feedback connections from PFC to V4. Note that the model is not image computable, and the input stimulus in the figure is given to illustrate the model setup. (B) Top-down predictions made by PFC on each of the three V4 units are represented by gaussian distributions with means at $f(u·rpfc)=u·rpfc$. (C) Bottom-up component, which is represented by the conditional probability distributions of the V4 responses given the shape stimulus. When the input stimulus is unoccluded shape A, the response distribution of the shape A-selective V4 population has a higher mean than those of the shape B- and occluder-selective populations. As the occlusion level increases, the mean of the shape A-selective response distribution decreases and the standard deviation increases. Shape B-selective distribution stays at the constant baseline, and the occluder-selective response distribution moves toward higher rates. The response distribution of each V4 population is shown in the same color as in panel A.

Figure 1:

Schematic diagram of network model. (A) Model network of V4 and PFC populations and the schematic of the input shape stimulus. By optimizing the cost function with respect to both V4 and PFC responses, the network implements both feedforward connections from V4 to PFC and feedback connections from PFC to V4. Note that the model is not image computable, and the input stimulus in the figure is given to illustrate the model setup. (B) Top-down predictions made by PFC on each of the three V4 units are represented by gaussian distributions with means at $f(u·rpfc)=u·rpfc$. (C) Bottom-up component, which is represented by the conditional probability distributions of the V4 responses given the shape stimulus. When the input stimulus is unoccluded shape A, the response distribution of the shape A-selective V4 population has a higher mean than those of the shape B- and occluder-selective populations. As the occlusion level increases, the mean of the shape A-selective response distribution decreases and the standard deviation increases. Shape B-selective distribution stays at the constant baseline, and the occluder-selective response distribution moves toward higher rates. The response distribution of each V4 population is shown in the same color as in panel A.

The V4 layer is composed of three units: two that are selective for each of the two visual shapes that are being discriminated, namely, shape A and shape B (see Figure 1A, V4 unit 1 (green) and V4 unit 2 (blue), respectively), and a third V4 unit that responds selectively to the occluder-specific features, such as color (see Figure 1A, V4 unit 3 (red)). Shape selectivity has been previously demonstrated in area V4 (Pasupathy & Connor, 1999). While the existence of V4 cells that are selective exclusively for occluders has not been confirmed experimentally, a recent experimental study has found such strictly occluder-selective cells in the IT cortex (Namima & Pasupathy, 2016). Furthermore, we do not require that neurons corresponding to V4 unit 3 would be exclusively selective for occluders independent of other stimulus features; rather, they could respond preferentially to any occluder-specific features. The V4 cells that preferentially respond to the color of the occluders are a good candidate, as in the experiment, occluders were presented in a different color from the shape or the background. Supporting this, many V4 neurons are known to have color selectivity (Zeki, 1973; Schein & Desimone, 1990; Bushnell, Harding, Kosai, Bair, & Pasupathy, 2011; Bushnell & Paupathy, 2012), and many are sensitive simply to stimulus area rather than shape (Eghbali, Pasupathy, & Bair, 2016). Indeed, in Figure A.1B, we present example V4 cells that respond strongly to the presence of occluders regardless of whether these occluders are presented with the preferred or the nonpreferred shape. Each V4 unit can be interpreted as a subpopulation of V4 neurons with similar tuning properties.

The two PFC units in the model represent two distinct neuronal populations in PFC. While the roles of PFC neurons are not well understood, PFC is believed to be involved in planning complex behavior and tasks involving short-term memory (Miller & Cohen, 2001). Experimental recordings (Pasupathy et al., 2015; Fyall et al., 2017) from PFC also show that a subset of PFC neurons has mild shape selectivity while also responding strongly to occluders.

The sum of PFC activities weighted by the connection weights between V4 and PFC units (see Figure 1A) is represented as the feedback signal to V4 units. The initial feedback connection weights between V4 units and PFC units are chosen so that the PFC units show appropriate selectivity after training. Namely, one of the PFC units in the model is designated to be weakly shape A selective and the other PFC unit is weakly shape B selective. Both PFC units respond strongly to partially occluded shapes and only weakly to unoccluded shapes.

In this way, PFC neurons of the model respond strongly to both the task-relevant visual features (shape identity) and nuisance variables (occlusion level), while each of V4 populations responds preferentially to a single feature of the input visual stimulus. Thus, although the V4 responses are already modulated by both shape and occlusion level, the signals become even more mixed as they go up the hierarchy. Previous studies have shown that such mixed selectivity in the PFC plays an important computational role in a high-dimensional population encoding of task-relevant information (Rigotti et al., 2013; Fusi, Miller, & Rigotti, 2016).

### 2.4  Probabilistic Network Model

As we detail below, the responses of the neuronal units evolve toward values that maximize the posterior probability of these responses given the input shape stimulus. In other words, the neuronal activities, and synaptic weights at a slower timescale, are found by estimating the most likely values given the shape stimulus.

In our model, visual inputs are simplified and represented by $κ$, which includes the shape identity $s$ (shape A or shape B, $s∈{A,B}$) and the degree of occlusion $c$ ($c∈[0,1]$), so that $κ=(s,c)$. We assume that the V4-PFC circuitry builds a two-level hierarchical description of the input stimulus $κ$ via firing rates of V4 ($rv4$) and PFC neurons ($rpfc$). As it is assumed that each successive random variable is conditionally dependent only on the random variable in the adjacent higher level, the posterior probability of the V4 and PFC responses given $κ$ factors is
$p(rv4,rpfc|κ)=h0·p(κ|rv4,rpfc)p(rv4,rpfc)=h0·p(κ|rv4,rpfc)p(rv4|rpfc)p(rpfc)=h0·p(κ|rv4)p(rv4|rpfc)p(rpfc)=h·p(κ|rv4)p(rv4|rpfc),$
2.3
where $h0$ and $h$ are constants. The first equality comes from Bayes' theorem, with a normalization term $h0$. The second equality is simply a property of joint probability. The third equality is based on the assumption that the probability distribution is set up hierarchically. Based on the assumption of spatially Markovian inference (Lee & Mumford, 2003; Rao & Ballard, 1999; Friston & Kiebel, 2009a; Bogacz, 2017), we made a simplification $p(κ|rv4,rpfc)=p(κ|rv4)$ in equation 2.3. Finally, a flat prior on the PFC firing rates is assumed, which is embedded in the constant $h$ on the last line of equation 2.3; therefore, the posterior probability of the neuronal responses is
$p(rv4,rpfc|κ)=h·p(κ|rv4)p(rv4|rpfc).$
2.4
The firing rates of the V4 and PFC units are given as
$rv4=rv4,1rv4,2rv4,3,rpfc=rpfc,1rpfc,2,$
2.5
where $rv4,1$ and $rv4,2$ represent the average firing rates of the shape-selective V4 neuronal populations (preferring shape A and shape B, respectively) and $rv4,3$ is the average firing rate of the occluder feature-selective V4 population.
We first describe $p(κ|rv4)$ and how V4 firing rates depend on the input stimulus $κ$. We define $μ$ as the bottom-up representation of the stimulus
$μ=μ1μ2μ3.$
2.6
The difference between this bottom-up representation and the V4 responses $rv4$ gives the noise term $η1$,
$η1=μ-rv4,$
2.7
which has a gaussian distribution with zero mean and diagonal covariance matrix
$Σ1=σ12000σ22000σ32.$
2.8
The distribution $p(κ|rv4)$ is the likelihood of the V4 neuronal activities given the sensory input $κ$. Assuming a flat prior on $rv4$, $p(κ|rv4)∝p(rv4|κ)$. Thus,
$p(rv4|κ)=N(rv4;μ,Σ1).$
2.9

The mean $μ$ and the covariance matrix $Σ1$ are determined by the input shape identity $s$ and the occlusion level $c$. Changes in $μ$ and $Σ1$ describe the sensory-input-driven responses of the V4 populations to different shapes under various degrees of occlusion. In other words, for each occlusion level and the shape identity, there is a most likely firing rate of each V4 unit given by $μ$, and that likelihood falls off according to the covariance $Σ1$.

Here we describe how we modulate $μ$ and $Σ1$ based on the sensory input $κ$. Let us assume the animal is presented with shape A as the test shape. With shape A presented, $μ1$, the gaussian mean of the firing rate distribution of V4 unit 1 in Figure 1A (the shape A-selective V4 population), decreases as occlusion $c$ increases (see Figure 1C, green). However, $μ2$ of the V4 population, preferring shape B (V4 unit 2 in Figure 1A), stays constant at a baseline firing rate, independent of the change in occlusion level. That is, the V4 unit 2 does not prefer shape A; it responds with a low firing rate regardless of the occlusion level (see Figure 1C, blue). The standard deviation $σ1$ of the preferred V4 unit increases as occlusion increases in order to capture the increasing uncertainty of the shape identity under higher degrees of occlusion (see Figure 1C, where the green distribution widens as occlusion increases). The standard deviation $σ2$ of the nonpreferred V4 population (V4 unit 2) is assumed to be constant.

A justification for increasing the input variance $σ1$ but not $σ2$ with occlusion is as follows. These terms represent uncertainty in shape identity signals. We hypothesize that occlusion introduces the most uncertainty for neuronal responses to preferred shapes, as randomly placed occluders may either hide critical features of the preferred shapes or fail to hide these features. In the first case, shape signals will be strongly suppressed; in the second, they will be maintained. For nonpreferred shapes, which lack critical features, we hypothesize that shape signals for different occlusion patterns will be less volatile. Supporting this, while random placement of occluders may form accidental contours, a previous experimental study in V4 has shown that responses to preferred contours are suppressed when those contours are accidentally formed at the junction between the occluded and occluding objects (Bushnell, Harding, Kosai, & Pasupathy, 2011). Accordingly, the variance $σ2$ should be roughly constant with added occlusion or, if increasing, only by a small amount. In section 3, we explore which trends in variances are consistent with the data in more detail (see Figure 9).

Finally, for V4 unit 3, the relevant stimuli (occluding dots) are present on every trial but slightly shifted in position. As we assume that this unit responds to the presence of occluders but not their specific configuration, the variance $σ3$ is taken to be constant across occlusion levels (see Figure 1C, red). As the occlusion level increases, $μ3$ of the occluder-selective V4 population (V4 unit 3) also increases.

The dependence of the means and the variances on the occlusion level $c$ was set to be linear: $μ=μ0+α·c$ and $Σ1=Σ0+β·c$ with $μ0=[502020]T$, $α=[-50100]T$, $Σ0=I3$, and $β=[500]T$. The slopes ($α$, $β$) and the values defining the response distributions when the shape is unoccluded ($μ0$, $Σ0$) at $c=0$ were manually chosen to match the peak firing rates observed in experiments. With this choice of $α$, as occlusion level increases, the peak of the response distribution decreases, stays constant at a low baseline firing rate, and increases, for V4 units 1, 2, and 3, respectively. Thus, V4 units 1 and 2 reproduce response patterns of V4 neurons to preferred and nonpreferred shapes under varying degrees of occlusion in experiments (see Figures 2 and A.1A), and unit 3 replicates V4 neurons that respond strongly to occlusion (see Figure A.1B). The values chosen for $β$, on the other hand, indicate that the ambiguity of the stimulus feature increases only for the test shape-preferred V4 unit 1. In this way, input stimuli—shapes A and B with various degrees of occlusion—are represented by the response distributions of three different V4 populations given $κ$ rather than by using actual pixel images.

Figure 2:

Recordings from V4 and PFC show characteristic response dynamics. (A) Example V4 cell responses to a preferred (left) and a nonpreferred shape (right) during the discrimination task. Test stimulus onset was at time 0 ms. Level of occlusion was measured by % unoccluded area (line color). The black line (100% unoccluded) represents the unoccluded stimulus. Two transient peaks are identified by filled and open rectangles. (B) The time-averaged V4 firing rates during the initial and the delayed peaks (identified in panel A), as a function of occlusion level. Solid lines show average firing rates for the preferred shape during the initial peak, and the dotted lines indicate average firing rates during the delayed transients, as marked above response traces in panel A. (C) Response of an example PFC cell to the two shape stimuli (left and right) during the discrimination task. (D) Averaged PFC responses as a function of occlusion level. Responses to each of the two shapes are shown in green and blue, respectively. Population data follow the same trend. Data adapted with permission from Pasupathy et al. (2015) and Fyall et al. (2017).

Figure 2:

Recordings from V4 and PFC show characteristic response dynamics. (A) Example V4 cell responses to a preferred (left) and a nonpreferred shape (right) during the discrimination task. Test stimulus onset was at time 0 ms. Level of occlusion was measured by % unoccluded area (line color). The black line (100% unoccluded) represents the unoccluded stimulus. Two transient peaks are identified by filled and open rectangles. (B) The time-averaged V4 firing rates during the initial and the delayed peaks (identified in panel A), as a function of occlusion level. Solid lines show average firing rates for the preferred shape during the initial peak, and the dotted lines indicate average firing rates during the delayed transients, as marked above response traces in panel A. (C) Response of an example PFC cell to the two shape stimuli (left and right) during the discrimination task. (D) Averaged PFC responses as a function of occlusion level. Responses to each of the two shapes are shown in green and blue, respectively. Population data follow the same trend. Data adapted with permission from Pasupathy et al. (2015) and Fyall et al. (2017).

The second term on the right-side of equation 2.4, $p(rv4|rpfc)$, provides the top-down effects on the posterior distribution, also described as gaussian. Here, the mean is the prediction made by PFC, $u·rpfc$, which is the sum of the two PFC population responses weighted by the connection weight matrix $u$. In more general cases, this weighted sum is filtered by a nonlinearity $f$, thus yielding the top-down prediction $f(u·rpfc)$ (see Figure 1B). For the simulations in this study, however, the nonlinearity on weighted PFC responses was ignored and the predictions were assumed to be linear, that is, $f(u·rpfc)=u·rpfc$, as in Rao and Ballard (1999). The connection weights between the V4 and PFC neuronal units are given as
$u=u1,1u1,2u2,1u2,2u3,1u3,2.$
2.10
The difference between $u·rpfc$, the top-down prediction made by PFC, and the V4 responses $rv4$ is then
$η2=rv4-u·rpfc,$
2.11
where the noise $η2$ has a gaussian distribution with zero mean and covariance matrix $Σ2$:
$Σ2=σ'12000σ'22000σ'32.$
2.12
The distribution of V4 responses given the PFC responses, $p(rv4|rpfc)$, is then
$p(rv4|rpfc)=N(rv4;u·rpfc,Σ2).$
2.13

The standard deviation of the response distribution of each V4 unit given the PFC responses determines the relative significance of the top-down predictive contribution on shaping the V4 responses. Specifically, a smaller standard deviation leads to smaller noise terms, forcing closer matches between PFC and V4 responses. These standard deviations were chosen as $σ'1=10,σ'2=10$, and $σ'3=1$. Thus, the top-down component is more strongly emphasized for V4 unit 3, the V4 neuronal population selective for occluders. We found that such emphasis on the predictive component for the occluder-selective V4 population was necessary to reproduce the experimentally observed PFC response characteristics, an increase in PFC responses with a rise in occlusion level (see section 3).

Given the visual stimulus $κ$, the firing rates $rv4$ and $rpfc$ adjust in order to maximize the posterior distribution, $p(κ|rv4)p(rv4|rpfc)$. Maximizing this is equivalent to minimizing its negative logarithm, which is defined as the cost function $E$:
$E=rv4-μTΣ1-1rv4-μ+rv4-u·rpfcTΣ2-1rv4-u·rpfc.$
2.14

Note that this cost function is the sum of the squared error $η1Tη1$ between the V4 responses and the sensory-input imposed representation, and the squared error $η2Tη2$ between the V4 responses and the top-down prediction made by PFC, weighted by their inverse variances.

The optimal “parameters”—the neuronal responses and the connection weights—are thus found by minimizing this cost function $E$ with respect to the parameters $rv4$, $rpfc$, and $u$. The initial V4 responses in experiments, which presumably depend only on the feedforward sensory input, are found by minimizing only the first term of equation 2.14. The initial responses are therefore equal to the sensory-driven representation $μ$. However, the delayed V4 responses, which we hypothesize to depend on both the feedfoward sensory input and the feedback prediction, are found by minimizing the entire cost function, equation 2.14.

### 2.5  Training Protocol: Weight Adjustment during the Preliminary Phase

We divide the optimization process into two phases based on the experimental setup: the preliminary phase and the test phase. In this section, we discuss how the synaptic weight matrix between PFC and V4 is found during the preliminary phase. To find these weights, we minimized the cost function $E$ with respect to $rv4$ and $rpfc$, as well as with respect to the connection weight matrix $u$, over a series unoccluded trials. Then, during the test phase, the optimal estimates of the neuronal responses to shapes under varying degrees of occlusion are determined by minimizing the cost function with respect to $rv4$ and $rpfc$, with the connection weights fixed at the learned values.

The preliminary phase corresponds to the stage at the beginning of the experiment where the animal is exposed to a pair of unoccluded shapes used for the experimental session for about 20 times. We introduced its equivalent in the simulation, during which the cost function $E$ is minimized by gradient descent with respect to the firing rates of the V4 units $rv4$ and PFC units $rpfc$, as well as the connection weight matrix $u$. During this phase, unoccluded shapes A and B are randomly chosen and used as inputs to the model for up to 30 trials.

The optimal estimates of $rv4$, $rpfc$, and $u$ are obtained by performing gradient descent on $E$ with respect to these parameters at different learning rates:
$drv4dt=-kr∂E∂rv4,drpfcdt=-kr∂E∂rpfc,dudt=-ku∂E∂u.$
2.15

The learning rate of $u$ was a significantly smaller value $ku=0.001$ compared to that of $rv4$ and $rpfc$, which was $kr=0.1$. This models the relatively faster dynamics of firing rates and slower dynamics of synaptic plasticity. For each selected shape, we carried out gradient descent until the firing rates reach steady states after a minimum 20 iterations or until the iteration exceeds the maximum of 500 iterations. While $rv4$ and $rpfc$ rapidly converge to a fixed point for each of the sampled shapes, the connection matrix $u$ gradually converges over the course of multiple samples of shapes A and B. In this way, the weight matrix $u$ is tuned over the course of the preliminary phase, which corresponds to the animal's familiarization with the pair of the shapes at the beginning of the experiment.

We set initial weights for $u1,2$ and $u2,1$ smaller than the initial values of other connection weights, to slightly bias one of the PFC populations (PFC unit 1) to be shape A selective and the other (PFC unit 2) to be shape B selective.

We acknowledge a limitation of the gradient descent method on $E$ in equation 2.15, which is that it requires nonlocal computation. In other words, the activities and the synaptic strengths of all the neuronal units in the system must be known in order to take a gradient descent step, a requirement that is not physiologically realistic. This issue also exists in previous models of predictive coding and sparse coding in the visual system (Rao & Ballard, 1999; Olshausen & Field, 1996, 1997), as pointed by Bogacz (2017) and Zylberberg, Murphy, and DeWeese (2011). While we do not pursue this matter further here, we note that Zylberberg et al. (2011) show that in the limit that the neuronal activity is sparse and uncorrelated, the nonlocal gradient descent rule is approximately equivalent to a synaptically local rule.

### 2.6  Optimal Stimulus Representation during the Test Phase

Once the weight matrix $u$ has converged over the course of the preliminary phase, it is fixed at the learned values during the test phase. The test phase corresponds to the recording session where the animal performs the matching task while test shapes with varying degrees of occlusion are displayed. We hypothesize that the V4 and PFC recordings from the experiment are represented by the average firing rates of the V4 and PFC populations in the model network, $rv4$ and $rpfc$, that minimize the cost function $E$. Either shape A or shape B can be used as the input to the network. In this article, however, without loss of generality, we show only the simulations with shape A as the test shape so that the V4 unit selective for shape A (V4 unit 1) is the preferred population and the shape B-selective V4 unit (V4 unit 2) is the nonpreferred population. The weight matrix $u$ is fixed at the learned values from the preliminary phase.

For each occlusion level, the optimization is carried out in two parts to reflect the dynamics of the V4 responses. The initial responses of V4 neurons observed in experiments are compared to the V4 responses $rv4$ that minimize the first part of the cost function $E$ (see equation 2.14), namely,
$E1=rv4-μTΣ1-1rv4-μ.$
2.16
$E1$ is simply a weighted difference between the V4 neuronal responses and the V4 responses predicted by the bottom-up sensory input. Therefore, $rv4$ that minimizes $E1$ is interpreted as the V4 responses shaped only by the feedforward inputs.
The delayed responses of V4 neurons, as well as the PFC responses, are found by minimizing the entire cost function $E$, equation 2.14, with respect to $rv4$ and $rpfc$. We rewrite the full cost function $E$ as $E2$:
$E2:=E=rv4-μTΣ1-1rv4-μ+rv4-u·rpfcTΣ2-1rv4-u·rpfc.$
2.17

$E2$ includes a term that depends on the difference between $rv4$ and the top-down predictions made by PFC, $u·rpfc$, in addition to the error term between the $rv4$ and the V4 responses predicted by the input visual stimulus. Therefore, $rv4$ that minimizes this cost function $E2$ is interpreted as the V4 responses shaped by both the feedforward and the feedback signals. This $rv4$ is compared to the delayed responses in V4 neurons in experiments that we hypothesize to be induced by feedback from PFC.

$E1$ and $E2$ are minimized using gradient descent and Matlab fminsearch with respect to $rv4$ and $rpfc$, starting from the initial value at 10 (spikes/s) for all neuronal units.

## 3  Results

In section 3.1, we present experimental evidence that supports the hypothesis that feedback signals from PFC modulate shape representations in V4. In section 3.2, we compare the outcomes in our probabilistic network model to physiology and explain in sections 3.3 and 3.4 how robust shape recognition can be achieved in our model. Subsequently, we identify necessary assumptions on the network structure (in section 3.5) and the signal structure (in sections 3.6 and 3.7) of the model to capture the key trends in the experimental results. Finally, using our model, we make predictions in section 3.8 on shape-selective neuronal responses to a new type of reduced stimulus clarity.

### 3.1  Experimental Evidence for Feedback Signals in Area V4

Recent experiments demonstrated that neurons in V4 and PFC show strikingly different response patterns in monkeys performing a sequential shape discrimination task. Specifically, a class of V4 neurons shows evidence of feedback signals from PFC, supported by interesting response patterns in these V4 neurons and PFC neurons. Our goal in this study is to provide a normative model describing these experimental results.

Figure 2A shows the response dynamics of an example V4 cell to a preferred shape (left) and a nonpreferred shape (right). The V4 neuron exhibits two transient peaks when the preferred shape was presented, but only one smaller peak for the nonpreferred shape. In the initial transient at the onset of the preferred shape stimulus, the V4 neuron responded strongly to the unoccluded shape (black), and an increase in occlusion weakened the shape-selective responses (color). While the first peak shows a dramatic dependence on occlusion, the latter peak of responses shows a weaker dependence. Figure 2B shows the averaged responses of the V4 neuron during the initial transient (50–125 ms) and the delayed transient (175–250 ms), illustrating the differential effects of occlusion on V4 responses over time. The reduced effect of occlusion on V4 responses to the preferred shape during the second transient leads to enhanced shape selectivity, as previously observed in Pasupathy et al. (2015), Kosai et al. (2014), and Fyall et al. (2017). Such response patterns were observed in many other V4 neurons in experiments. In Figure A.1A, we show a few more example V4 cells that exhibit shape-selective responses that are less sensitive to occlusion during the delayed response peak.

In contrast to V4 neurons, PFC neurons exhibit one peak and show their strongest responses to occluded stimuli and weakest responses to unoccluded stimuli, as shown for an example PFC neuron in Figure 2C (Pasupathy et al., 2015). Figure 2D shows the time-averaged responses of the PFC neuron as a function of occlusion level for both the preferred and the nonpreferred shapes. As occlusion increases, the PFC responses increase, which is the opposite trend for V4. Moreover, the timing of the peak PFC responses is between the initial and the delayed transients of V4 responses, consistent with the hypothesis that the PFC responses, which arise from feedforward transmission of sensory information, in turn send feedback inputs and drive the second peak of responses in V4. These experimental observations led us to the hypothesis that the feedback inputs from PFC and other higher cortices underlie delayed improvement of shape-selective responses under occlusion in V4. (For more details on the experimental results, see Kosai et al., 2014, Pasupathy et al., 2015, and Fyall et al., 2017.)

### 3.2  Structure and Design of Probabilistic Network Model

We sought to understand the response dynamics of V4 and PFC neurons in the context of predictive coding, a hierarchical encoding of stimuli widely used to probe interactions of lower- and higher-sensory areas. We first pose a probabilistic network model of the V4-PFC circuitry with the presumptive feedback based on predictive coding and introduce an innovation that differentiates our model from previous predictive coding models.

In each layer of our V4-PFC network model, there are distinct units, each representing a neuronal population with similar tuning properties. The V4 layer is composed of three units that respond preferentially to different features of the visual stimulus (see Figure 1A): unit 1 to shape A, unit 2 to shape B, and unit 3 to an occluder-specific feature (e.g., the color of the occluders). In PFC, two units respond strongly to occlusion while also exhibiting some degree of shape selectivity. The representation of a population of neurons as a single unit is a common simplification, but we find that each unit replaced by a population of multiple neurons with mild heterogeneity yields qualitatively the same response trends as with the single-unit model (see appendix B).

In the model, V4 receives feedforward sensory inputs and seeks to match the responses imposed by the sensory inputs. At the same time, feedback predictions from PFC bias the V4 responses. The weighted sums of PFC responses provide top-down predictions conditioned on underlying visual stimulus and are regarded as the feedback from PFC to V4. With hierarchical Bayesian inference assumed, the most likely representation of the responses is obtained by finding a set of responses that maximize the posterior probability given the visual stimulus, which is equivalent to the product of conditional probabilities of the neuronal activities given only the activities of the next higher area (see equation 2.3). Note that we are finding not only the optimal responses of V4 units but also the optimal responses of PFC units to minimize the cost function. Therefore, the V4 neuronal responses drive the PFC responses, while the PFC predictions drive V4 neurons, enacting feedforward and feedback connections between V4 and PFC. Here, the visual input to each V4 unit is represented as a gaussian distribution, whose mean and variance change according to the shape identity and the occlusion level (see Figure 1C). Similarly, the feedback from PFC to each V4 unit is described by a gaussian distribution with the peak at a sum of the PFC responses weighted by the synaptic strengths (see Figure 1B).

In this way, the optimal representation of the neuronal responses integrates both the bottom-up sensory input and the top-down prediction. This is done by minimizing a cost function composed of the difference between the V4 activities and the top-down predictions, as well as the difference between the V4 activities and the V4 responses predicted by the sensory input, with each term inversely weighted by its respective variance (see equation 2.14). We compare this optimal representation directly to the neuronal responses in experiments; this differs from previous studies (Rao & Ballard, 1999; Srinivasan et al., 1982) where the residual error between the prediction and the neuronal activity was associated with physiologically measured responses. With this reformulation, neural activity conveys both the sensory input and the internal prediction, preventing the situation in original implementations of predictive coding in which neurons have depressed activity when familiar stimuli are presented regardless of the sensory tuning properties.

### 3.3  Network Training and Synaptic Weight matrix

First, the network was trained following the experimental procedure where the animal was exposed to the pair of unoccluded shapes. During this preliminary phase, the connection weight matrix $u$ between PFC and V4 is learned by gradient descent on the cost function $E$ with respect to the weights $u$ as well as the neuronal responses $rv4$ and $rpfc$, while unoccluded shape stimuli randomly selected from the set of shape A and shape B are input to the network. The learning rate for neuronal firing rates is significantly larger than that for weights (see equation 2.15). Thus, for each sampled shape, the firing rates of the neuronal units converge rapidly. The weight matrix $u$ converges on a slower timescale over the course of the preliminary phase, with multiple presentations of unoccluded shapes. With initial values of the connection weights set to
$u=1-1-1111,$
the connection weight matrix converges to
$u=2.320.210.262.370.940.94,$
where the asymmetric weights between the PFC units and the shape-selective V4 units indicate shape selectivity in PFC units. The shape selectivity in PFC units and resulting response characteristics are preserved as long as the initial values for $u1,2$ and $u2,1$ are sufficiently smaller than $u1,1$ and $u2,2$ to introduce an initial bias on shape selectivity.

The convergence of the weight matrix depends on the choice of initial conditions, given the nonconvex and underconstrained nature of the cost function $E$, as there are multiple combinations of the connection weights and neuronal responses that minimize $E$. However, this does not limit our main results, as we can regard the biased initial values as the connections between a subset of PFC populations and the V4 population of interest before learning the shapes, which may have either weak negative values or positive values, among a wide range of random initial connection weights between PFC and V4. Depending on the initial connection weights, the connections will become either stronger or weaker over the course of training, and shape selectivity in PFC neurons emerges.

Our simulations with synaptic weights starting from different initial values show that the neuronal responses of the model are robust to precise choices of these initial weights. In Figure 3A, we randomly choose different initial weights, under the constraint that $u1,1$ and $u2,2$ start from stronger values (in the range from 0.5 to 3.5) than $u1,2$ and $u2,1$ (in the range from $-$1 to 1). The initial weights between the occluder-selective V4 units and PFC units, $u3,1$ and $u3,2$, are randomly chosen in the range from 0 to 2. For all of these choices, the connection weights $u1,1$ and $u2,2$ converge to higher values than $u1,2$ and $u2,1$, resulting in mild shape selectivity in PFC units, with PFC unit 1 preferring the test shape (see Figure 3A i and iv). If the initial connection weights of $u1,2$ and $u2,1$ are at larger values than $u1,1$ and $u2,2$, the shape preferences in PFC units switch (see Figure 3B i and iv), but the response characteristics of V4 neurons remain unchanged (see Figure 3B ii and iii). Interestingly, responses of V4 units are highly robust to differences in initial weights, converging to almost indistinguishable identical values in each case, as shown in Figures 3A, ii and iii and 3B ii and iii.

Figure 3:

Convergence of connection weights from different initial values. During the training phase, the gradient descent starts with 10 different randomly chosen sets of initial weights (see text). (A) When the initial connection weights for $u1,1$ and $u2,2$ are higher than the initial connection weights for $u1,1$ and $u2,2$, the connection weights $u1,1$ and $u2,2$ converge to larger values than $u1,2$ and $u2,1$ during the training phase (i). With these connection weights, responses of V4 unit 1 and unit 2 (ii), V4 unit 3 (iii), and PFC unit 1 and 2 (iv) to the test shape under varying degrees of occlusion are generated, and are almost identical regardless of the precise initial condition. (B) Same as in panel A, but with initial connection weights for $u1,2$ and $u2,1$ larger than the initial connection weights for $u1,1$ and $u2,2$.

Figure 3:

Convergence of connection weights from different initial values. During the training phase, the gradient descent starts with 10 different randomly chosen sets of initial weights (see text). (A) When the initial connection weights for $u1,1$ and $u2,2$ are higher than the initial connection weights for $u1,1$ and $u2,2$, the connection weights $u1,1$ and $u2,2$ converge to larger values than $u1,2$ and $u2,1$ during the training phase (i). With these connection weights, responses of V4 unit 1 and unit 2 (ii), V4 unit 3 (iii), and PFC unit 1 and 2 (iv) to the test shape under varying degrees of occlusion are generated, and are almost identical regardless of the precise initial condition. (B) Same as in panel A, but with initial connection weights for $u1,2$ and $u2,1$ larger than the initial connection weights for $u1,1$ and $u2,2$.

The obtained connection weight matrix is interpreted as a stored template or memory of the shape pair and is fixed during the following test phase. The memory of the shapes encoded in the connection weights is similar to the idea proposed in Mumford (1992), suggesting that descending pathways store templates in the weights of their synapses.

### 3.4  Two-Step Inference on Neuronal Dynamics

With the trained connection weights, we find that the model responses of each unit to partially occluded stimuli are comparable to neuronal responses in experimental recordings during the sequential shape discrimination task described above. In particular, we separate the responses inferred strictly by feedforward sensory inputs from those generated by integrated signals of both feedforward inputs and feedback predictions and show that the model responses capture the temporal dynamics in the electrophysiological recordings.

The optimal representations of the neuronal responses $rv4$ and $rpfc$ that minimize either the first term ($E1$ from equation 2.16) or the full representation of the cost function $E$ ($E2$ from equation 2.17) are computed at each occlusion level. As explained in section 2, these are equivalent to the optimal responses in hierarchical Bayesian inference that maximize the posterior probability of the V4 neuronal responses given the shape identity and the occlusion level. Here we assume that the occluders are of a color different from that of the shape or the background, that is, occlusion is salient and distinct (see Figure 4B). The occluders therefore activate V4 unit 3, the occluder-selective neuronal population in the model.

Figure 4:

Model simulations. The optimal representation based on hierarchical Bayesian inference reproduces V4 and PFC responses in the experiments. (A) The network model schematic as in Figure 1A. The solid rectangle shows the initial feedforward-only signal computation. The dotted rectangle encompasses the computations for the delayed response inferences that integrate the bottom-up sensory inputs and the top-down predictions from PFC. The corresponding optimal representations are shown in solid (initial, feedforward only) and dotted (delayed, feedforward + feedback) lines in panels D–E. (B) Illustration of the input stimuli shape A with varying degrees of occlusion. The actual images were not used as the input; the $κ$-dependent population response distributions of V4 neurons were used to represent the shape stimuli. Note that the occluders are of a different color from the shape or the background and activate a group of V4 cells selective for the color. (C) Inferred PFC responses increase as occlusion level increases, in accordance with experiments. A weak shape selectivity is present, as PFC unit 1 responds at higher rates than PFC unit 2 to the presented shape A across the occlusion levels. (D) Inferred responses of the shape-selective V4 units before (solid) and after (dotted) the top-down prediction. The green lines are the optimal responses of the V4 population selective for the test shape (shape A (V4 unit 1)), and the blue lines are those of the nonpreferred V4 population that responds preferentially to shape B (V4 unit 2). (E) Model prediction of average firing rates of the occluder-selective V4 population (V4 unit 3), as a function of occlusion level. The salient occlusion activates this class of V4 neurons. Note that the $x$-axis shows fraction unoccluded.

Figure 4:

Model simulations. The optimal representation based on hierarchical Bayesian inference reproduces V4 and PFC responses in the experiments. (A) The network model schematic as in Figure 1A. The solid rectangle shows the initial feedforward-only signal computation. The dotted rectangle encompasses the computations for the delayed response inferences that integrate the bottom-up sensory inputs and the top-down predictions from PFC. The corresponding optimal representations are shown in solid (initial, feedforward only) and dotted (delayed, feedforward + feedback) lines in panels D–E. (B) Illustration of the input stimuli shape A with varying degrees of occlusion. The actual images were not used as the input; the $κ$-dependent population response distributions of V4 neurons were used to represent the shape stimuli. Note that the occluders are of a different color from the shape or the background and activate a group of V4 cells selective for the color. (C) Inferred PFC responses increase as occlusion level increases, in accordance with experiments. A weak shape selectivity is present, as PFC unit 1 responds at higher rates than PFC unit 2 to the presented shape A across the occlusion levels. (D) Inferred responses of the shape-selective V4 units before (solid) and after (dotted) the top-down prediction. The green lines are the optimal responses of the V4 population selective for the test shape (shape A (V4 unit 1)), and the blue lines are those of the nonpreferred V4 population that responds preferentially to shape B (V4 unit 2). (E) Model prediction of average firing rates of the occluder-selective V4 population (V4 unit 3), as a function of occlusion level. The salient occlusion activates this class of V4 neurons. Note that the $x$-axis shows fraction unoccluded.

We make the inference on the neuronal responses in two steps. First, only the bottom-up sensory input is considered, so that the posterior distribution depends only on the stimulus $κ$ (see Figure 4A, solid box). In other words, the optimal representations of the activities of the V4 units, $rv4$, are found by minimizing only the first term of the cost function $E$ in equation 2.14 or, equivalently, maximizing equation 2.9. We hypothesize that these optimal responses are modulated only by the bottom-up sensory inputs, to correspond to the initial transient in recorded V4 responses. Thus, only feedforward signals are present at this stage.

The delayed transients in V4 responses following the peak of responses in PFC are compared to the optimal responses that integrate both the bottom-up and the top-down inputs. The model representations of the delayed V4 responses and the PFC responses therefore are obtained by finding $rv4$ and $rpfc$ minimizing the full cost function $E$ (see equation 2.14), which is equivalent to maximizing the full posterior distribution in equation 2.4 composed of both the feedforward, $κ$-dependent distribution and the feedback, prediction-driven distribution. In this way, the model draws a connection between the response dynamics of V4 and PFC neurons and different computational stages in the feedforward-feedback loop.

The inferred optimal responses of each neuronal unit in V4 and PFC across a range of occlusion levels, before and after the feedback from PFC, are shown in Figures 4C to 4E. Both PFC unit 1 and unit 2 responses increase with added occlusion (see Figure 4C), in agreement with the experiments where PFC neurons respond strongly to occluded stimuli and weakly to unoccluded stimuli (see Figure 2D). Such increased PFC responses to occlusion result from the PFC connections to the occluder-selective V4 unit 3. Through the synaptic connections, PFC predictions are compelled to match the responses of V4 unit 3, which responds preferentially to occluders. The model PFC units also show shape selectivity, with PFC unit 1 showing higher responses than PFC unit 2 to the test shape A across occlusion levels. This agrees with physiological evidence for shape selectivity in PFC (Pasupathy et al., 2015).

The two-step inference on the V4 responses accurately predicts the response characteristics of the initial and the delayed peaks in experimental recordings of V4 neurons. While the responses of V4 unit 2 (the neuronal unit not preferring the test shape A) stay constant at a low rate across the occlusion levels, V4 unit 1 (the preferred V4 unit) shows a decreasing response pattern as occlusion increases (i.e., as unoccluded area decreases). Compared to the responses inferred only based on the feedforward sensory input (see Figure 4D, solid green), the firing rates are less dependent on occlusion level when the feedback predictions are included (see Figure 4D, dotted green). Thus, with the feedback, an increase in occlusion does not as extensively degrade the preferred V4 responses. The model predictions therefore agree with the experimental observation on the two transients in V4 (see Figure 2B) and are in accordance with our hypothesis that the initial V4 responses reflect the feedforward signals from the afferent areas and the delayed peak responses in V4 are computed based on both the feedforward sensory signals and the feedback predictions from PFC. Because the response of the preferred V4 unit becomes resistant to occlusion when the feedback prediction is included, we say that the feedback enables V4 neurons to have enhanced shape discriminability under partial occlusion.

We note that in contrast to other sensory areas (Rao & Ballard, 1999; Srinivasan et al., 1982) where this comparison has been successfully made, the responses of shape-selective V4 neurons are not accurately described by a direct comparison to the residual errors between the feedback predictions and the neuronal responses underlying the current estimate of the sensory signals. Here, the residual error of V4 units 1 and 2 increases with added occlusion (see Figure 5), unlike the activity of shape-selective V4 neurons in experiments, which were strongest for unoccluded stimuli and weaker with occlusion (see Figure 2B; Kosai et al., 2014). Instead, we identify the optimal estimates shaped by the feedforward input and the feedback predictions as the neuronal responses measured in V4, which do replicate response characteristics in experiments. Unlike the residual errors, which reflect novelty of the sensory inputs, the optimal response representation conveys both sensory stimulus features and stimulus novelty.

Figure 5:

Error signals. (A) Squared difference between top-down prediction $u·rpfc$ and the initial V4 responses $rv4$ obtained by minimizing $E1$. (B) Squared difference between top-down prediction $u·rpfc$ and the delayed V4 responses $rv4$ obtained by minimizing $E2$.

Figure 5:

Error signals. (A) Squared difference between top-down prediction $u·rpfc$ and the initial V4 responses $rv4$ obtained by minimizing $E1$. (B) Squared difference between top-down prediction $u·rpfc$ and the delayed V4 responses $rv4$ obtained by minimizing $E2$.

Finally, the group of neurons that are hypothesized to respond preferentially to occluder saliency exhibit increasing responses as occlusion increases, both with and without the feedback (see Figure 4E). Although this class of neurons has not been systematically recorded in experiments, we identified several neurons with increasing responses to added occlusion regardless of the copresented shapes (see Figure A.1B) similar to the response patterns of V4 unit 3 (see Figure 4E).

In the above, we have compared the steady-state representation of neuronal responses in the model to transient peaks of responses in the experiments. The two-step inference does not have a mechanism for the shape of the transient activities observed in experiments. Specifically, instead of having the brief suppression of responses between the initial peak and the delayed peak (see Figure 2A), the gradient descent on $E1$ (see equation 2.16) and $E2$ (see equation 2.17) with respect to $rv4$ simply predicts the V4 response dynamics $rv4$ to reach and stay at the respective steady-state firing rates, which minimize $E1$ and $E2$.

The gradient descent dynamics are shown in Figure C.1, where the feedback prediction term from PFC is included in the cost function after neuronal responses reach the steady-state firing rates minimizing $E1$. Before optimization of the full cost function $E2$ starts, the responses may be brought down to the baseline firing rate for a brief interval rather than being continued from the values optimizing $E1$. In either case, the optimal responses measured at the end of optimization process of $E2$ do not change (see Figures C.1B and C.2D), indicating that those values are robust within the range of firing rates we consider. Note that unless the responses are deliberately suppressed, the gradient descent dynamics do not exhibit transient peaks as observed in experiments. This implies that there may be additional physiological mechanisms in the cortical circuitry responsible for the transient dynamics. In principle, it is also possible that such temporal effects could be interpreted by extending the predictive coding to the temporal domain (Rao & Ballard, 1999; Friston & Kiebel, 2009a, 2009b).

In summary, in this section, we asked how the responses in a hierarchical predictive coding model compare to physiology. We find that upon training, the model indeed predicts the observed responses in V4 and PFC, when the dynamics unfold over an initial feedforward and a second feedback stage.

### 3.5  Parsimony of the Network Structure

In the simulations above, we have assumed a specific network structure. This poses the question of whether these assumptions were necessary and, in general, what aspects of network structure are required to reproduce the observed physiological responses.

Shape selectivity in V4 and PFC neurons is supported by experiments (Fyall et al., 2017); thus, we included the test shape preferred and nonpreferred V4 and PFC units: V4 units 1 and 2 and PFC units 1 and 2. Our model also includes an additional group of V4 cells that responds strongly to occlusion. We found that such occluder-selective V4 neurons are necessary to capture the response characteristics of PFC neurons observed in the experiments. Since the second term in the cost function, equation 2.14, is the squared difference between the PFC predictions—a linear combination of PFC responses—and the actual V4 responses, the PFC responses minimizing the cost function tend to follow the response trends of the afferent V4 neurons. The shape A (test shape)–preferred V4 unit 1 exhibits monotonically decreasing firing rates as the occlusion level increases, while the activity of the shape B–selective V4 unit 2 stays constant across degrees of occlusion as a consequence of the bottom-up stimulus-dependent inputs. With only these two types of neuronal populations, therefore, the PFC responses cannot capture the firing rate increase induced by occlusion. Given our model architecture without any additional mechanisms, there has to be a class of V4 neurons that responds strongly to occlusion but only weakly to unoccluded stimuli, so that PFC follows the similar response trends. Moreover, we found that the increase in PFC responses with occlusion cannot be obtained by including a simple prior distribution of PFC responses in the cost function instead of the third class of V4 neurons in question (the only way to have a prior implement the observed changes in PFC responses would be to have that prior itself change with occlusion level). There are several candidates for types of V4 neurons represented by unit 3. These include populations of neurons that respond preferentially to the color of the occluding dots.

Another feature of our architecture—the convergence of the signals, with each of the PFC cells connected to multiple afferent V4 neurons from different populations—is also critical to replicate the shape-selective responses that become more robust to occlusion after the PFC feedback. We experimented with different architectures and found that such convergence is crucial for transmitting information between different V4 units. Unless the same PFC unit makes predictions about both the shape-preferred V4 unit (V4 unit 1) and the occluder-selective V4 unit (V4 unit 3), the information about the occlusion level encoded by the occluder-selective V4 unit will not be transmitted to the shape-selective V4 population, which is crucial for maintaining robust shape discrimination and weaker dependence on occlusion. This structure, where the neurons of the lower cortical areas with different tuning properties send convergent signals to neurons in higher cortices, agrees with physiological findings in which signals become more mixed as they travel along the hierarchy (Felleman & Van Essen, 1991; Rigotti et al., 2013; Fusi et al., 2016).

Another feature of our model is that fewer units in PFC (two units) combine to make linear predictions about the responses of a larger number (three units) of V4 units. This is also necessary to capture the experimental data. Without such convergence, the V4 responses imposed by the bottom-up sensory input can be matched perfectly by the top-down predictions made by PFC units, leading the optimal predictive coding solution to make identical copies of the sensory input at each stage along the hierarchy—which clearly does not occur in experiments. Translating this constraint into biology does not mean there must be fewer neurons in higher areas of brain, but rather that there are fewer functional or active populations that can be grouped as single units in the higher area during the task.

In our model, information about the shape identity $s$ and the occlusion level $c$ is input to V4. The system implements a feedforward-feedback loop involving the higher-area PFC to enhance shape discriminability under occlusion, as illustrated in a state-space view in Figure 6. Without the feedback predictions, during the initial responses, high occlusion moves noisy versions of the responses close to or even above the unity line, obscuring the shape identity (see Figure 6A). However, when the feedback from PFC is included, the responses move away from the unity line, thus clarifying the shape identity under partial occlusion (see Figure 6B). The convergent structure of the network is the key for this effect to occur. Although information about occlusion is initially present at the level of V4, it does not affect the shape-selective V4 units without feedback from PFC. In other words, PFC predictions remap the information about the shape identity and the occlusion level onto the shape-selective space in V4, enhancing the shape discriminability there.

Figure 6:

Shape discriminability under occlusion increases with the top-down prediction. The optimal average firing rates across degrees of occlusion as in Figure 4D (yellow), projected onto the state space of V4 unit 1 (preferred) and unit 2 (nonpreferred) responses. For each occlusion level, 200 responses were generated with a white noise with the mean at the optimal average value (yellow) and standard deviation of 2 arbitrarily chosen for illustration purposes (blue: low occlusion; green: high occlusion) when the population responses are under the unity line (dotted black), $rv4,1>rv4,2$, and the animal concludes that the test shape presented is shape A. The opposite is true for $rv4,2>rv4,1$. Before the top-down prediction (A), the noisy responses under high occlusion (green dots) lie close to the unity line, obscuring the shape identity. With the top-down prediction included (B), the average optimal responses to occluded stimuli are moved horizontally to larger $rv4,1$ values (yellow). Thus, the noisy responses are more squeezed and moved away from the unity line, clarifying the shape identity.

Figure 6:

Shape discriminability under occlusion increases with the top-down prediction. The optimal average firing rates across degrees of occlusion as in Figure 4D (yellow), projected onto the state space of V4 unit 1 (preferred) and unit 2 (nonpreferred) responses. For each occlusion level, 200 responses were generated with a white noise with the mean at the optimal average value (yellow) and standard deviation of 2 arbitrarily chosen for illustration purposes (blue: low occlusion; green: high occlusion) when the population responses are under the unity line (dotted black), $rv4,1>rv4,2$, and the animal concludes that the test shape presented is shape A. The opposite is true for $rv4,2>rv4,1$. Before the top-down prediction (A), the noisy responses under high occlusion (green dots) lie close to the unity line, obscuring the shape identity. With the top-down prediction included (B), the average optimal responses to occluded stimuli are moved horizontally to larger $rv4,1$ values (yellow). Thus, the noisy responses are more squeezed and moved away from the unity line, clarifying the shape identity.

We note that recurrent connections among V4 populations—rather than the feedback described—in principle could also transmit information about the occlusion level to the shape-selective neurons. Which mechanism is more effective and efficient is an open question. However, the current experimental evidence showing the delayed peak of responses in V4 arising after PFC responses peak, as well as the strong PFC responses to occlusion, is suggestive of feedback.

In summary, the proposed network, composed of two PFC units and three V4 units, has a parsimonious structure to explain the neuronal responses in the experiments under predictive coding principles.

### 3.6  Structure of Inputs to V4

In the simulations, we have assumed a simple input structure, where the sensory input is determined by probability distributions of V4 responses conditioned on shape identity $s$ and occlusion level $c$. Based on experiments, we model $μ1$ for the test shape–preferred V4 unit 1 to decrease from a high firing rate as occlusion levels grow, $μ2$ for the nonpreferred V4 unit 2 to stay at a low baseline firing rate and $μ3$ for the occlusion-preferring V4 unit 3 to increase. To provide a firmer basis for this, we show several example V4 neurons in Figure A.1. Example cells in Figure A.1A behave like V4 units 1 and 2 with decreasing responses to preferred shapes under added occlusion and overall low responses to nonpreferred shapes. The cells shown in Figure A.1B may correspond to V4 unit 3, which displays relatively low, shape-selective responses to unoccluded shapes and increasing responses to both preferred and nonpreferred shapes as occlusion level increases. The population-averaged initial peak responses of V4 neurons to preferred and nonpreferred shapes further support our implementation of $μ$ (see Figure 7A). Specifically, the averaged responses of V4 neurons with clear two transient peaks and shape selectivity exhibit a decreasing response pattern to preferred stimuli with added occlusion, but responses to nonpreferred shapes stay at a constantly low firing rate across the range of occlusion levels.

Figure 7:

V4 population encoding of shape stimuli. (A) Population-averaged initial peak responses of 39 V4 cells that show clear two transient peaks and shape selectivity. The population-averaged responses (normalized) to preferred shapes (green) decrease as occlusion increases, while those to nonpreferred shapes (blue) remain at a relatively constant low activity level across the range of occlusion levels. (Data adapted with permission from Pasupathy et al., 2015.) (B) Normalized responses of 109 V4 neurons neurons to the shapes displayed in the insets (unoccluded), sorted based on firing rate. The population responses to the shape on the top have a sharp peak indicating a division between the neurons that show a strong preference to the shape and the rest of the neurons. Responses to the shape on the bottom are more distributed across the V4 population. (Data adapted with permission from Pasupathy & Connor, 2002.)

Figure 7:

V4 population encoding of shape stimuli. (A) Population-averaged initial peak responses of 39 V4 cells that show clear two transient peaks and shape selectivity. The population-averaged responses (normalized) to preferred shapes (green) decrease as occlusion increases, while those to nonpreferred shapes (blue) remain at a relatively constant low activity level across the range of occlusion levels. (Data adapted with permission from Pasupathy et al., 2015.) (B) Normalized responses of 109 V4 neurons neurons to the shapes displayed in the insets (unoccluded), sorted based on firing rate. The population responses to the shape on the top have a sharp peak indicating a division between the neurons that show a strong preference to the shape and the rest of the neurons. Responses to the shape on the bottom are more distributed across the V4 population. (Data adapted with permission from Pasupathy & Connor, 2002.)

In addition to the peak firing rate $μ$, another component that forms the input signals is the variance of the V4 response distributions given the sensory input. As discussed earlier, we hypothesize that $σ1$ increases with added occlusion as high degrees of occlusion obscure the shape identity and that $σ2$ is constant across degrees of occlusion, since random placements of occluding dots on a nonpreferred shape will not introduce as much variability in responses as on a preferred shape. We also modeled $σ3$ to stay constant regardless of occlusion level, as this unit responds to the presence of occluders but not their specific configuration. To test this hypothesis, we examined the consequences of other plausible assumptions for how variance depends on occlusion. First, when variances for all three V4 units increase at the same rate (see Figure 8A), the response characteristics of the shape-preferred V4 unit 1 remain unchanged, but the nonpreferred V4 unit 2 shows increasing delayed responses to added occlusion. Such increasing delayed responses of unit 2 are also obtained when the input variances for both V4 units 1 and 2 are increased with occlusion, while the variance for V4 unit 3 is kept constant (see Figures 8B and 8C). When the variances of all V4 units are decreased with added occlusion, we observe very different response patterns (see Figure 8D). Specifically, feedback does not improve the shape discriminability, as the initial and the delayed responses of V4 unit 1 are identical. Based on these simulations, we limit our model to the cases where introducing occlusion increases the input variances for shape-selective V4 units.

Figure 8:

Dependence of input variance on occlusion levels. Model responses across occlusion levels when (A) the variances $σ1$, $σ2$, and $σ3$ increase with added occlusion at the same rate ($σ1=σ2=σ3=1+5·c$) and (B) the variances for the shape-selective V4 units $σ1$ and $σ2$ increase at the same rate as occlusion increases, while $σ3$ remains unchanged ($σ1=σ2=1+5·c$; $σ3=1$), (C) the variances for the shape-selective V4 units $σ1$ and $σ2$ both increase, but $σ2$ at a slower rate; here $σ3$ again remains unchanged ($σ1=1+5·c$; $σ2=1+2·c$; $σ3=1$), and (D) the variances $σ1$, $σ2$, and $σ3$ decrease with added occlusion at the same rate ($σ1=σ2=σ3=1-c$).

Figure 8:

Dependence of input variance on occlusion levels. Model responses across occlusion levels when (A) the variances $σ1$, $σ2$, and $σ3$ increase with added occlusion at the same rate ($σ1=σ2=σ3=1+5·c$) and (B) the variances for the shape-selective V4 units $σ1$ and $σ2$ increase at the same rate as occlusion increases, while $σ3$ remains unchanged ($σ1=σ2=1+5·c$; $σ3=1$), (C) the variances for the shape-selective V4 units $σ1$ and $σ2$ both increase, but $σ2$ at a slower rate; here $σ3$ again remains unchanged ($σ1=1+5·c$; $σ2=1+2·c$; $σ3=1$), and (D) the variances $σ1$, $σ2$, and $σ3$ decrease with added occlusion at the same rate ($σ1=σ2=σ3=1-c$).

Our simulations show that the experimental results (see Figures 2A, 2B, 7A, and A.1A) are best captured by different rates of variance increase for different V4 units (see Figures 4A and 8C), in particular, when the variances for V4 unit 1 increase with added occlusion, the variances for V4 unit 2 stay constant or increase by a smaller amount compared to V4 unit 1, and the variances for V4 unit 3 stay constant. As a result, shape selectivity under occlusion is consistently improved in the delayed signals (see Figure 8C).

The dependence of variance on occlusion may not be uniquely defined and likely varies among V4 neurons. Indeed, neurons in V4 show a spectrum of different response patterns to nonpreferred stimuli, indicating that different V4 neurons encode input variances in more than one way. For example, the top two rows in Figure A.1A show example V4 neurons whose delayed responses to nonpreferred stimuli do not increase with added occlusion, corresponding to our model simulation with a constant variance for V4 unit 2. The last row in Figure A.1A, however, shows an example V4 neuron with increased delayed responses to occluded nonpreferred shapes. The example V4 cells thus suggest that neurons in V4 may respond to partially occluded, nonpreferred shapes with constant variances or slightly increased variances.

Our original model has only two V4 units of shape selectivity, one tuned for the test shape and the other not preferring the test shape. A more biologically realistic model would consist of a population of V4 neurons with diverse response properties. To construct this population, we first examined response profiles of a population of 109 neurons in V4 previously reported in Pasupathy and Connor (2002). For some shapes, disparity between the responses of the neurons preferring the shapes and those of the nonpreferred neurons is noticeable, as illustrated in the sorted population responses to a given shape in Figure 7B (top panel). However, to many other shapes, the population of V4 neurons show more graded responses, as in Figure 7B (bottom panel).

We next expanded the model network to a larger network with 2 PFC units and 30 V4 units (see Figure 9A). Among the V4 units, 10 are occluder-preferred units and the remaining 20 units are shape selective. Instead of dividing the shape-selective V4 units into test shape-preferred and nonpreferred groups, we modeled the V4 units to have a spectrum of peak firing rates and variances for the input-driven responses (see Figure 9B). For the V4 units that are more tuned to the test shape (corresponding to higher values of $μ$), the input-driven variance increases by larger amounts with added occlusion (see Figure 9B). The connection weights from PFC units to the V4 units are also adjusted accordingly (i.e., so that in Figure 9A, the green PFC unit has stronger connections to the green V4 units compared to the blue V4 units and vice versa).

Figure 9:

Model of graded encoding of feedforward sensory inputs across a heterogeneous V4 population. (A) Schematic of an expanded model composed of 2 PFC units, 10 occluder-selective V4 units, and 20 shape-selective V4 units with graded shape preferences. (B) Input-dependent peak firing rates (top) and variances (bottom) as a function of occlusion level, for occluder-preferred V4 units (red) and shape-selective V4 units (blue-green). (C) The model responses of PFC units (i), occluder-selective V4 units (ii), and shape-selective V4 units (iii). A selected number of the shape-selective V4 unit responses are shown in (iv) for a better display. Solid lines indicate initial responses, and dotted lines correspond to delayed responses.

Figure 9:

Model of graded encoding of feedforward sensory inputs across a heterogeneous V4 population. (A) Schematic of an expanded model composed of 2 PFC units, 10 occluder-selective V4 units, and 20 shape-selective V4 units with graded shape preferences. (B) Input-dependent peak firing rates (top) and variances (bottom) as a function of occlusion level, for occluder-preferred V4 units (red) and shape-selective V4 units (blue-green). (C) The model responses of PFC units (i), occluder-selective V4 units (ii), and shape-selective V4 units (iii). A selected number of the shape-selective V4 unit responses are shown in (iv) for a better display. Solid lines indicate initial responses, and dotted lines correspond to delayed responses.

This expanded model yields qualitatively the same results as the simple network with two shape-selective V4 units. The PFC units and the occluder-selective V4 units show increasing responses with added occlusion (see Figure 9C i and ii), and the shape-selective V4 units yield decreasing response patterns with an increase in occlusion (see Figure 9C iii). Moreover, the delayed responses of the shape-selective V4 units obtained by optimizing the full cost function exhibit reduced sensitivity to occlusion, and the effect is stronger for the units with stronger test shape preference. To see this more clearly, Figure 9C iv presents responses of a selected number of the shape-selective V4 units with high, intermediate, and low degrees of preference for the test shape. In sum, the delayed increase in responses to stimuli under occlusion induced by the feedback, in neurons that respond preferentially to the test shape, is maintained in a population of V4 units with graded response properties, validating the predictions made by our simplified model.

### 3.7  Differential Weighting of Feedforward and Feedback Inputs

In our model, the relative strengths of feedback and feedfoward interactions are determined by assumptions about levels of variability in the inference errors (the noise terms in equations 2.7 and 2.11) at each network layer (see equations 2.8 and 2.12). Here we ask how these assumptions affect the ability of the model to reproduce trends in experimental data.

Recall that the cost function $E$ in our model has two terms: one based on bottom-up sensory inputs and the other based on top-down predictions (see equation 2.4). The contribution of each of these components is weighted by the inverse variance of the respective probability distribution. The pattern of the optimal responses to occlusion can therefore be modulated by these variances. Here we examine how this occurs and show that the trade-off between feedforward and feedback components achieved by the variances in Figure 4 is necessary to capture the response characteristics observed in experiments.

We first discuss the effects of the variances for the bottom-up input-driven distributions. In the original model (see Figure 4), for the bottom-up component, variances are set equal to 1 for all three V4 populations when the input shape is unoccluded. We also set the variance for the test shape-preferred V4 population (V4 unit 1) to increase as occlusion level increases, to capture the increase in uncertainty of the shape identity in the presence of occlusion. We found that this increase in variance for the preferred V4 unit is necessary to mimic its weaker sensitivity to occlusion when feedback inputs are included. Without the increase in variance, this V4 unit depends relatively more on the bottom-up inputs under high degrees of occlusion, and as a result, it shows a steep decrease in its responses as occlusion increases (see Figure 10A, left panel, green). By increasing the variance of the sensory input-dependent distribution, therefore, the optimal response of this V4 population becomes more dependent on the top-down predictions made by PFC. As the PFC populations respond strongly to occluded stimuli, weighting the bottom-up component less will result in a more gradual decrease in V4 responses to increasing occlusion, as in the original model in Figure 4D.

Figure 10:

Model simulations with modified top-down and bottom-up variances predict different response patterns in neuronal units. The responses of each neuronal unit when (A) the bottom-up variance of shape A-selective V4 response distribution $σ1$ stays constant with increasing occlusion, (B) the top-down predictive distributions all have unit variances ($σ1'=σ2'=σ3'=1$), and (C) the top-down variances are all larger than the bottom-up variances ($σ1'=σ2'=σ3'=10$).

Figure 10:

Model simulations with modified top-down and bottom-up variances predict different response patterns in neuronal units. The responses of each neuronal unit when (A) the bottom-up variance of shape A-selective V4 response distribution $σ1$ stays constant with increasing occlusion, (B) the top-down predictive distributions all have unit variances ($σ1'=σ2'=σ3'=1$), and (C) the top-down variances are all larger than the bottom-up variances ($σ1'=σ2'=σ3'=10$).

Next, we examine the choice of the top-down variances in the original model that successfully captures experimental data. In the initial model, Figure 4, the variances of the top-down component do not depend on the occlusion level and stay at constant values. However, the top-down effect is differentially weighted for each of the V4 populations; it is weighted more for the occluder-selective V4 population ($σ3'=1$) compared to the shape A- and B-selective neurons ($σ1'=σ2'=10$). This is needed to reproduce the rise in PFC responses at higher levels of occlusion. The smaller variance, or equivalently, more weight, on the top-down predictions of the occluder-selective V4 unit drives the PFC unit to follow the same increasing response pattern as the occluder-selective V4. The smaller variance imposed on the top-down prediction for the occluder-selective V4 unit can be interpreted as the top-down predictions having more significance for occlusion than for identity of the shape.

We investigated effects of changes in the top-down variances on the response patterns. When the feedback prediction-driven distributions for all V4 units are uniformly weighted with unit variance, the top-down effect becomes more pronounced (see Figure 10B) compared to the case with the variances at the original values (see Figure 4D). As a consequence, the delayed responses of the test shape-preferred V4 (V4 unit 1) increase with added occlusion, reflecting strong modulation by PFC (see Figure 10B, left panel, green dotted line). Similarly, when the top-down variances on all three V4 units are set to be larger than the bottom-up variances, relatively more influence is exerted by the bottom-up drive (see Figure 10C). As a result, the feedback no longer increases the robustness of V4 unit 1 responses under partial occlusion (see Figure 10C, left panel, green dotted line).

In sum, we have shown that the ability to reproduce trends in experimental recordings in our predictive coding model requires the balance of top-down and the bottom-up influences that is given by the increase in the input-dependent variance with added occlusion for the test shape-selective neurons and the smaller variance in the top-down prediction on the occluder-selective neurons.

### 3.8  Model Prediction for Responses to Nonsalient Occlusion, Noise, or Reduced Contrast

We have assumed that occlusion is salient and that there is a separate population of cells in V4 that responds preferentially to occlusion. But what happens to predictions of the model when the occlusion is nonsalient—that is, indistinct from the shape? To answer this, we consider the case where the occluder reduces the shape signal but does not activate a dedicated class of V4 neurons. For example, when the occluders are of the same color as the shape or the background, occlusion would increase the ambiguity of the shape identity but would not induce responses in a V4 population separately responsive to a distinct color. Other examples include a decrease in shape clarity by white noise or reduced contrast (illustrated in Figure 11B).

Figure 11:

Model simulation with indiscriminate occlusion or noise does not activate a class of V4 neurons, predicting the top-down signals to have no effect on the V4 responses. (A) Model schematic. Same model as in Figure 4A, but with an input stimulus obscured by nonsalient occlusion, noise, or reduced contrast. (B) Illustration of the input stimuli: shape A with varying degrees of noise, contrast, and nonsalient occlusion with occluders of the same color as the background or the shape. These types of visual ambiguity are not salient while obscuring the shape identity. (C) Inferred PFC responses as a function of fraction of the shape unoccluded (shape clarity). Reduced shape clarity alone does not increase the responses of shape A-selective PFC population. (D) Inferred responses of the shape-selective V4 units before (solid) and after (dotted) the top-down prediction, as a function of occlusion/obscurity level. The responses are depicted by color and line type as in Figure 4D. The responses of the preferred V4 population after the top-down inputs are not distinguishable from those before the top-down inputs. Therefore, the top-down prediction does not improve shape discriminability under occlusion. (E) Model prediction of average firing rates of the occluder-selective V4 population. The nonsalient occlusion does not activate the V4 population selective for some distinct feature (e.g., color) of the occluders. Note that “Fraction Unoccluded” on the $x$-axis means shape clarity in the case of reduced contrast or added noise.

Figure 11:

Model simulation with indiscriminate occlusion or noise does not activate a class of V4 neurons, predicting the top-down signals to have no effect on the V4 responses. (A) Model schematic. Same model as in Figure 4A, but with an input stimulus obscured by nonsalient occlusion, noise, or reduced contrast. (B) Illustration of the input stimuli: shape A with varying degrees of noise, contrast, and nonsalient occlusion with occluders of the same color as the background or the shape. These types of visual ambiguity are not salient while obscuring the shape identity. (C) Inferred PFC responses as a function of fraction of the shape unoccluded (shape clarity). Reduced shape clarity alone does not increase the responses of shape A-selective PFC population. (D) Inferred responses of the shape-selective V4 units before (solid) and after (dotted) the top-down prediction, as a function of occlusion/obscurity level. The responses are depicted by color and line type as in Figure 4D. The responses of the preferred V4 population after the top-down inputs are not distinguishable from those before the top-down inputs. Therefore, the top-down prediction does not improve shape discriminability under occlusion. (E) Model prediction of average firing rates of the occluder-selective V4 population. The nonsalient occlusion does not activate the V4 population selective for some distinct feature (e.g., color) of the occluders. Note that “Fraction Unoccluded” on the $x$-axis means shape clarity in the case of reduced contrast or added noise.

Figure A.1:

Recordings from example V4 cells during the discrimination task. (A) Example V4 cells whose responses to shapes decrease with added occlusion. Their responses to preferred (left panels) and nonpreferred (center panels) shape stimuli under varying degrees of occlusion are shown. The averaged firing rates during the initial peak (solid) and the second peak (dotted) of responses to the preferred (green) and nonpreferred (blue) shape stimuli are plotted in the right panels. (B) Example V4 cells showing strong responses to occlusion. Their preference to occlusion was maintained regardless of whether occluders were presented with preferred (left) or nonpreferred (center) shape stimuli. Shape selectivity decreases with added occlusion in these cells. The averaged firing rates during the initial transient responses to the preferred (dark red) and the nonpreferred (light red) shape stimuli are plotted in the right panels.

Figure A.1:

Recordings from example V4 cells during the discrimination task. (A) Example V4 cells whose responses to shapes decrease with added occlusion. Their responses to preferred (left panels) and nonpreferred (center panels) shape stimuli under varying degrees of occlusion are shown. The averaged firing rates during the initial peak (solid) and the second peak (dotted) of responses to the preferred (green) and nonpreferred (blue) shape stimuli are plotted in the right panels. (B) Example V4 cells showing strong responses to occlusion. Their preference to occlusion was maintained regardless of whether occluders were presented with preferred (left) or nonpreferred (center) shape stimuli. Shape selectivity decreases with added occlusion in these cells. The averaged firing rates during the initial transient responses to the preferred (dark red) and the nonpreferred (light red) shape stimuli are plotted in the right panels.

Figure B.1:

Simulation with slightly heterogeneous neurons within each population. (A) Model schematic. For visualization, fewer neurons per population and a subset of connections are shown. The actual model includes 10 neurons with similar tuning properties per population. Each neuron in PFC is connected to three V4 neurons from each V4 population, and each V4 neuron is connected to two PFC from each of the two PFC populations. The neurons of green shades prefer shape A and correspond to V4 unit 1, those of blue shades prefer shape B (V4 unit 2), and the neurons of red shades are selective for occluder properties (V4 unit 3). Varied shades of colors for neurons within each population represent slight heterogeneity. (B) Inferred responses of the shape-selective V4 neurons before (solid) and after (dotted) the top-down prediction. The green lines represent the optimal responses of the V4 population selective for the test shape A and the blue lines are those of the nonpreferred V4 population that responds preferentially to shape B, as in Figure 4D. The lines and the error bars show the averaged responses and the standard deviations across the population of 10 neurons, respectively. (C) The inferred neuronal responses of the 10 sets of V4 neurons, each predicted by a common PFC neuron across degrees of occlusion, are projected onto the state space of V4 unit 1 (preferred) and unit 2 (nonpreferred) responses. The yellow line represents the averaged inferred responses. Responses to high occlusion are green; responses to low occlusion are blue. The left and the right panels show the responses before and after the top-down prediction, respectively.

Figure B.1:

Simulation with slightly heterogeneous neurons within each population. (A) Model schematic. For visualization, fewer neurons per population and a subset of connections are shown. The actual model includes 10 neurons with similar tuning properties per population. Each neuron in PFC is connected to three V4 neurons from each V4 population, and each V4 neuron is connected to two PFC from each of the two PFC populations. The neurons of green shades prefer shape A and correspond to V4 unit 1, those of blue shades prefer shape B (V4 unit 2), and the neurons of red shades are selective for occluder properties (V4 unit 3). Varied shades of colors for neurons within each population represent slight heterogeneity. (B) Inferred responses of the shape-selective V4 neurons before (solid) and after (dotted) the top-down prediction. The green lines represent the optimal responses of the V4 population selective for the test shape A and the blue lines are those of the nonpreferred V4 population that responds preferentially to shape B, as in Figure 4D. The lines and the error bars show the averaged responses and the standard deviations across the population of 10 neurons, respectively. (C) The inferred neuronal responses of the 10 sets of V4 neurons, each predicted by a common PFC neuron across degrees of occlusion, are projected onto the state space of V4 unit 1 (preferred) and unit 2 (nonpreferred) responses. The yellow line represents the averaged inferred responses. Responses to high occlusion are green; responses to low occlusion are blue. The left and the right panels show the responses before and after the top-down prediction, respectively.

Figure C.1:

Gradient descent dynamics of neuronal firing rates. (A) The V4 firing rates $rv4$ and the PFC firing rates $rpfc$ during the gradient descent. The optimization function switches from $E1$ to $E2$. In other words, the feedback inputs from PFC start being included in the optimization after 400 iterations. The initial peak responses and the delayed peak responses are measured at the end of each optimization process on $E1$ and $E2$, respectively (indicated by arrows). (B) The optimal responses of each neuronal unit found by the gradient descent in panel A. (C, D) Same as in panels A and B except that a brief suppression is included before the gradient descent on $E2$ starts.

Figure C.1:

Gradient descent dynamics of neuronal firing rates. (A) The V4 firing rates $rv4$ and the PFC firing rates $rpfc$ during the gradient descent. The optimization function switches from $E1$ to $E2$. In other words, the feedback inputs from PFC start being included in the optimization after 400 iterations. The initial peak responses and the delayed peak responses are measured at the end of each optimization process on $E1$ and $E2$, respectively (indicated by arrows). (B) The optimal responses of each neuronal unit found by the gradient descent in panel A. (C, D) Same as in panels A and B except that a brief suppression is included before the gradient descent on $E2$ starts.

Figure D.1:

Model simulations when the connection weights are learned from training on partially occluded shapes. The initial (solid) and the delayed (dotted) responses of the test shape-selective V4 unit 1 (left column) and the squared total errors between the top-down predictions and the inferred responses of the V4 units (right column) when the connection weight matrix is trained with repeated presentations of (A) unoccluded, (B) $30%$ occluded, and (C) $50%$ occluded shapes chosen from either shape A or B.

Figure D.1:

Model simulations when the connection weights are learned from training on partially occluded shapes. The initial (solid) and the delayed (dotted) responses of the test shape-selective V4 unit 1 (left column) and the squared total errors between the top-down predictions and the inferred responses of the V4 units (right column) when the connection weight matrix is trained with repeated presentations of (A) unoccluded, (B) $30%$ occluded, and (C) $50%$ occluded shapes chosen from either shape A or B.

We simulated such nonsalient occlusion and ambiguity in our model by setting $μ3$, and therefore the peak of the response distribution for the occluder-selective V4 conditioned on sensory stimulus, to a constant. Therefore, an increase in occlusion or ambiguity in the shape stimulus does not increase the responses of V4 unit 3, as shown in Figure 11E. The peak $μ1$ for the shape A-preferring V4 unit, however, is assumed to decrease with occlusion, as for previous simulations. Note that such neuronal behaviors are assumed because the occluder-selective V4 unit 3 is not modeled to specifically detect occlusion, but rather to respond to some occluder-specific feature such as a distinct color with contrast relative to background. This results in a decrease in the preferred PFC responses with occlusion or ambiguity and only a slight increase in the nonpreferred PFC responses (see Figure 11C). Therefore, the feedback predictions made by PFC do not increase the preferred V4 unit 1 responses when the shape ambiguity (occlusion level) is high. In Figure 11D, the preferred V4 responses after the feedback (dotted green) are therefore indistinguishable from the responses before the feedback (solid green). Our model thus predicts that when the shape signal is reduced in a way that is not salient, the feedback from PFC does not improve shape discriminability.

From the point of view of perception, this prediction seems plausible since we often have more difficulty recognizing an object when the obscurant is not distinct from the object. Moreover, preliminary experimental observations show that PFC neurons do not respond strongly to occluders of the same color as the background. In addition, the responses in the second peak were not observed in V4 neurons when the shapes were obscured by reducing their contrast. While these preliminary observations are in accordance with our model predictions, more data should certainly be collected before conclusions can be drawn.

## 4  Discussion

In this study, we have proposed that robust shape-selective V4 responses under partial occlusion can be explained in the framework of predictive coding and hierarchical Bayesian inference. We have used this framework to construct a model of V4 and PFC in which signals converge as they travel up the hierarchy. In particular, we suggest that top-down predictions made by PFC neurons with mixed selectivity for shape identity and occlusion play a significant role in maintaining robust shape discriminability under salient partial occlusion in V4. In this model, PFC neurons make linear predictions on V4 activities in the form of feedback signals, and the connection weights are interpreted to store the memory of the shape identities. We reformulated the traditional framework of predictive coding, so that the optimal representation of the internal states of the model V4 and PFC units, rather than residual errors, are comparable to the electrophysiological recordings in these areas.

Our model suggests that the initial responses in experimental recordings of a class of V4 neurons are purely feedforward and computed solely based on the bottom-up sensory input, while the delayed responses are modulated by both the bottom-up sensory signals and the top-down predictions. The model further shows that the feedback signals in V4 improve the shape discriminability under occlusion by reducing ambiguity in the population representation of the shape identity and that this is achieved by transmission of the occlusion information via a feedforward-feedback loop. This can be viewed as an extension of the concept proposed in Rao and Ballard (1999) where predictions made by higher visual areas with larger receptive fields enable neurons encoding the surround and the center in V1 to share information. In our model of V4, neurons encoding different features of a shape stimulus such as curvature or color share information by predictions made by the higher areas.

The increase in the shape-selective responses of V4 induced by the feedback depends on asymmetric weighting of the top-down and the bottom-up effects, so that the top-down prediction is weighted more strongly for the occluder-selective neurons and the dependency of the shape-selective neuronal responses on the sensory input decreases with added occlusion. Interesting future work could more directly test this weighting of the top-down and the bottom-up effects. For example, the top-down predictive component of our model would be weakened by training with a larger set of noisy shape stimuli under various degrees of occlusion, which will introduce larger variance terms in the feedback prediction. Model predictions for experiments where partially occluded shapes are used for training are given in appendix D. Our simulations predict that when training is done with partially occluded shapes, the V4 neurons do not exhibit a delayed increase in shape-selective responses, underlining the significance of initial exposure to unoccluded shapes (see appendix D). If the variances ($Σ1$, $Σ2$) are allowed to be learned as well, using a noisy stimuli set under various degrees of occlusion for training will weaken the top-down influence by increasing the top-down variance $Σ2$. Alternatively, cooling PFC is another way that more emphasis would be placed on the bottom-up sensory input. Overall, our model predicts a smaller or no increase in shape selectivity during delayed V4 responses in these cases where the effect of the top-down predictive signals is reduced.

For the input-driven response distributions, we have assumed the mean and the covariance to depend linearly on occlusion level. While this assumption keeps our model simple, detailed neuronal responses may be captured more accurately by implementing nonlinear dependence. For example, the example V4 cell in Figure 2 shows the maximum delayed increase in shape discriminability when the occlusion level is intermediate. However, in our model, the separation between the initial and the delayed V4 responses increases monotonically with added occlusion, more resembling example V4 cell 3 in Figure A.1A. The response pattern in Figure 2 can be reproduced by nonlinear dependence on occlusion of the bottom-up mean and covariance (data not shown). Thus, the variability in detailed response patterns across cells may indicate heterogeneous occlusion-dependence functions of individual V4 neurons.

In this way, our model contributes to a new understanding of both neurophysiological and computational mechanisms underlying discrimination of partially occluded shapes in V4, suggesting a possible functional contribution of feedback signals.

### 4.1  Relationship to Previous Models

Several previous theoretical studies investigated the computational mechanisms for recognition of partially occluded shapes, patterns, and objects (Fukushima, 1987, 2001, 2005; Rao, 1997). However, these are strictly feedforward and often overlook feedback computation, in stark contrast to biological networks that feature abundant feedback and recurrent connections. One approach is based on an extended version of neocognitron—a hierarchical, multilayered, and feedforward neural network model (Fukushima, 1987, 2001, 2005). This extended neocognitron has an additional masker layer that detects occluders by difference in brightness and suppresses them at an early state. A study by Rao (1997, 1999) uses a Kalman filter model and Bayesian optimal estimation theory of maximizing the posterior probability of the internal states. With robust optimization method that clips large residual errors, the model effectively segments the occluders from the image, treating the occluders as the outlier. The physiological mechanisms underlying the robust optimization method, however, are not known.

There have been a number of other modeling studies of V4 tuning to shape contours based on hierarchical feedforward models of object categorization, which have structural similarity to the ventral visual pathway (Fukushima, 1980; Riesenhuber & Poggio, 1999; Serre et al., 2007; Cadieu et al., 2007; Yamins et al., 2014). These models are also purely feedforward, and while they have had successes in reproducing V4 shape selectivity (Cadieu et al., 2007; Yamins et al., 2014), they lack separate mechanisms to account for occlusion. Unlike these previous models, our model bridges hierarchical predictive coding and experimentally recorded response dynamics in area V4 and PFC.

Our model is focused primarily on the encoding of the partially occluded stimulus, while the underlying behavioral task required animals to report whether the two stimuli presented were the same or different. Other literature has proposed how the sensory representation of the test stimulus may be compared to the memory representation of the reference stimulus (Hayden & Gallant, 2013; Murray, Jaramillo, & Wang, 2017; Romo & Salinas, 2003) to derive a behavioral decision, but this is beyond the scope of this article.

### 4.2  Learning the Shape Templates with Connection Weights

Our model modifies the synaptic weights between V4 and PFC neurons during the preliminary phase, which consists of a few presentations of unoccluded shapes. This step corresponds to the initial learning phase in experiments where the animal discriminates an unoccluded pair of shapes used for the session. In this setup, the fast learning of the shape pair after exposure to the shapes for just a few times is achieved by the memory stored in the synaptic weights between V4 and PFC neurons. When partially occluded shapes are used during the preliminary phase, the system learns different values of synaptic weights and the feedback does not improve shape discriminability (see appendix D). Fast learning, as attested by the shape discrimination task here, has been observed widely, where new sensory stimuli are easily learned with just a few presentations (Seitz, 2010; Rubin, Nakayama, & Shapley, 1997).

Physiological recordings in cortical cells in vitro, however, show only small changes in synaptic strength after a pair of pre- and postsynaptic spikes (Markram, Lubke, Frotscher, & Sakmann, 1997; Bi & Poo, 1998; Gerstner, Kempter, van Hemmen, Wagner, & Hemmen, 1996), suggesting that neurons learn a repeated stimulus more gradually, after a large number of presentations. Such seemingly contradictory evidence from physiology and behavioral observations can be reconciled by introducing stronger synaptic changes than usually observed in vitro, possibly aided by neuromodulation (Fusi, Drew, & Abbot, 2005). More recently, it has been proposed that even weak synaptic plasticity can support fast learning in the balanced regime of excitation and inhibition (Yger, Stimberg, & Brette, 2015). Due to the leverage effect from the excitatory and inhibitory balance in this regime, small synaptic modifications applied to many synapses onto a given neuron result in a large effect (Yger, Stimberg, & Brette, 2015).

### 4.3  Mapping Computational Units in Predictive Coding to Cortical Circuitry

Different algorithms implementing hierarchical predictive coding share the general principle of a generative model: the brain has an internal representation of the world that is actively compared to the actual sensory inputs. However, the precise computational procedures employed by these algorithms, as well as their connections to neuronal populations, are controversial and vary widely across different studies (Spratling, 2008, 2017; Bastos et al., 2012; Bogacz, 2017; Rao & Ballard, 1999; Mumford, 1992).

For example, in our model, the variances of the response distributions of different V4 units given the sensory input or the higher cortical activity are predefined to capture the response characteristics in experiments. However, they can also be treated as parameters to be optimized and are assigned to the most likely values, with a slight modification on the network structure as done in a few other models of hierarchical predictive coding. In these studies, the variances are interpreted as synaptic weights and are obtained by minimizing the free energy (Bogacz, 2017; Friston & Kiebel, 2009a, 2009b).

There are varied interpretations on the connections between predictive coding algorithms and computations done by cortical circuitry. Cortical areas have laminar structures, and different layers or populations within the cortical area may correspond to different local computational nodes that arise in predictive coding algorithms. However, there is no unifying description of the intracortical connectivity and the local computations within a cortical area. For example, inhibitory feedback connection implemented in the model proposed by Rao & Ballard (1999), is modified in Spratling (2008, 2017) to reflect excitatory feedback signals observed in physiology. In order to avoid negative responses, Spratling (2008, 2017) also replaced additive excitation and subtractive inhibition in Rao and Ballard (1999) by multiplicative and divisive modulations, respectively. In our model, we follow the approach in Rao and Ballard (1999) and implement additive excitation and subtractive inhibition for simplicity.

Within area V4, there surely are multiple neuronal populations across the laminar structures, and each neuronal node may perform different computations as suggested by earlier studies. In this work, we have shown that V4 neuronal responses to partially occluded shapes are better explained by the optimal representation of responses than the residual errors between the current estimates and the predictions, and thus correspond to the node that encodes the current estimates. However, neurons whose responses are less dependent on their tuning to stimuli features but more sensitive to novelty of the stimulus may correspond to the unit that computes residual errors. Investigations of specific neuronal populations within V4-PFC circuitry in the context of the corresponding computational nodes in the predictive coding algorithm will provide a better understanding and validation of our model.

## Appendix A:  More Example V4 cells

Figure A.1 presents more example V4 cell responses from experimental data.

## Appendix B:  Population Average Responses

In this appendix, the model is extended to include populations of neurons with slight heterogeneity. In our main model, each unit in the model V4 and PFC is considered as a population of neurons with similar tuning properties. For example, V4 unit 1 represents a population of V4 neurons that respond preferentially to shape A, and V4 unit 3 is interpreted as a population of V4 neurons responding strongly to some salient features of the occluders. With each unit representing a neuronal population, the optimal response inferred by minimizing the cost function depicts the average response of each neuronal population. Since the cost function in equation 2.14 increases linearly with added neuronal units that share the same properties with the existing populations, representing a neuronal population as a single unit seems a reasonable simplification.

We now test this simplification explicitly. We performed further numerical simulations with a slightly heterogeneous group of neurons for each neuronal unit in V4 and PFC. The heterogeneity is introduced to the V4 neurons by assigning $μ$, the mean vector of the feedforward sensory input-driven response distribution, from a normal distribution with a unit standard deviation for each neuron within the population. Therefore, the bottom-up sensory input drives neurons within the same group to converge to slightly different optimal responses. In addition, the initial connection weights and the initial firing rates of the neurons are also slightly heterogeneous, chosen from normal distributions with the means at the initial values used in previous simulations and the standard deviations of 0.1 for initial weights and 0.5 for initial firing rates. The PFC neurons in each population therefore also show weakly heterogeneous optimal representations as a result. Each neuronal population is composed of 10 slightly heterogeneous neurons. Moreover, each PFC neuron sends the prediction signals to one neuron from each of the three V4 populations, and each V4 neuron receives the feedback that is a weighted sum of two PFC neurons, one from each PFC population. Therefore, the convergence ratio from V4 to PFC is preserved as in the previous simulations where populations are represented as single units. We have tested a couple of other convergence ratios (e.g., two neurons per each V4 population connected to a single PFC neuron) and found that they produce qualitatively similar results.

The connection weight matrix $u$ is learned during the preliminary phase, and the optimal responses of the V4 and PFC neurons with the learned weights $u$ are obtained by minimizing the cost function $E$ using the same method as in the previous simulations with single unit representation. Figure B.1 shows the averaged inferred responses (dots) and the standard deviation (bars) within the population of the shape A-selective V4 neurons (green) and the shape B-selective V4 neurons (blue) before (solid line) and after (dotted line) the feedback predictions, as a function of unoccluded area. Figure B.1C illustrates the same results in a state-space view for the responses before (left) and after (right) the feedback. The inferred responses of the neurons in the shape A-selective (V4 unit 1) and the shape B-selective (V4 unit 2) populations, predicted by the common PFC neurons, are projected onto the 2D space of the shape A- and shape B-selective population responses. The level of occlusion is indicated by the color bar, and the yellow line represents the population average responses. The population responses shown in Figures B.1B and B.1C match the results from the single-unit representation model in Figures 4 and 6; the shape discriminability increases during the delayed responses when the feedback predictions are included. Although not shown here, the population average responses of the PFC populations and the occluder-selective V4 population also agree with the previous results in Figure 4.

Treating each population as a single unit as done in Figure 1 is therefore a reasonable simplification of the model, which expedites computation while maintaining the core mechanisms of the model. Furthermore, there may be recurrent connections among the neurons within the same group, which reduce the variances among these neurons and further validate representation of these neurons as a single unit.

## Appendix C:  Gradient Descent Dynamics of Neuronal Firing Rates

Figure C.1 displays the gradient descent dynamics of responses of each neuronal unit in the model network.

## Appendix D:  Connection Weights Learned with Partially Occluded Shapes

In this appendix, we make predictions on shape discriminability when the synaptic weights store templates of partially occluded shapes instead of unoccluded shapes. This represents the case where the animal has memory of partially occluded shapes rather than being exposed to unoccluded shapes.

In the simulations of the previous sections, the connection weight matrix $u$ is learned based on presentations of the pair of unoccluded shapes in order to mimic the experimental procedure where the animals discriminated a pair of unoccluded shapes at the beginning of each trial. Here we test our model with the weight matrix learned from partially occluded shapes. We train the weight matrix on the pair of the shapes under $30%$ and $50%$ occlusion (see Figures D.1B and D.1C) and compare the results to the simulation with the weights learned based on unoccluded shapes (see Figure D.1A).

During the preliminary phase, the gradient descent with respect to the connection weights starts from the initial weight matrix,
$u=10.10.1111.$
Using $-$1 instead of 0.1 for $u2,1$ and $u1,2$ produces same the qualitative result. When trained with $30%$ occlusion, the weight matrix converges to
$u=1.390.480.561.472.132.13,$
and with $50%$ occlusion, the weight matrix converges to
$u=1.270.360.291.193.003.00.$
When unoccluded shapes are presented during the training, the weight matrix converges to
$u=1.780.850.891.830.950.95.$
When the weight matrix is trained on partially occluded shapes instead of unoccluded shapes, the connection weights to the occluder-selective V4 population converge to larger values over the course of the preliminary training phase. Having the weights learned, the responses of the test-shape-preferred V4 unit (V4 unit 1) are plotted across degrees of occlusion, before (solid line) and after (dotted line) the feedback (see Figure D, left column). We also plotted the total sum of the squared error signals from all three V4 units, namely, the unweighted second term of the cost function $E$, $rv4-u·rpfcTrv4-u·rpfc$ (see Figure D, right column). When trained on unoccluded shapes, the squared total error is minimum at zero occlusion. When trained on partially occluded shapes with $30%$ and $50%$ occlusion, the squared total error is lowest at approximately the respective occlusion levels (see Figure D, right column).

The stronger connection weights between the PFC units and the occluder-selective V4 unit 3 that emerge from training on partially occluded shapes change the response pattern of the preferred V4. Due to the stronger weights, the PFC responses are relatively lower overall. Then the delayed responses of shape A-preferred V4 unit 1 induced by PFC predictions are moved to lower values when the stimulus has no or a low degree of occlusion. As the occlusion level increases, the standard deviation $σ1$ of the preferred V4 increases, weakening the bottom-up influence, which suppresses the V4 responses under occlusion. As a result, under high occlusion, the optimal representation of the preferred V4 unit 1 responses depends more on the top-down PFC prediction reflecting the occluder-selective V4 response pattern.

When the training is based on a pair of unoccluded shapes, the responses of the preferred V4 unit are never lower with the feedback than without the feedback across all occlusion levels, and thus, the feedback enhances the responses under higher degrees of occlusion. When trained on shapes with $30%$ occlusion, the delayed responses are lower than the initial responses under low degrees of occlusion. For occlusion levels higher than about $25%$ to $30%$ occlusion, the delayed responses are higher than the initial responses. When trained on $50%$ occlusion, the delayed V4 responses are always lower than the initial responses in the occlusion range of $0%$ to $50%$. However, the differences between the initial and the delayed responses of the test shape-preferred V4 unit 1 are very small; the initial and the delayed responses are almost identical to the parameter set used here. In addition, across the range of occlusion levels, the errors between the top-down predictions and the inferred V4 activities are smaller when the weights are trained on partially occluded shapes (see Figure D.1, right column).

The deviation from the initial responses (see the solid line in Figure D.1) is on average smaller for the simulations with weights trained on partially occluded shapes. Since the variance $σ3'$ is smaller than $σ1'$ and $σ2'$, the prediction $u·rpfc$ tends to follow the response patterns of the occluder-selective V4 unit, which increases with added occlusion. When the weights are trained on unoccluded shapes, the connection from a PFC unit to the V4 unit 1 with the same shape preference is the strongest, while its connection to the occluder-selective V4 unit 3 is weaker. Then the increase in the PFC responses with added occlusion is relatively large, compensating for the effects of the small weights $u3,1$ and $u3,2$. The large increase in PFC responses induced by added occlusion can then evoke a larger deviation in the test shape-selective V4 unit 1 from its initial responses. When the weights are trained on partially occluded shapes, compared to the case with training on unoccluded shapes, the weights to the test shape-selective V4 unit 1 are reduced by a little and the weights to the occluder-selective V4 unit 3 increase significantly. Then the PFC responses do not increase as much as the occlusion level increases (the preferred PFC unit responses decrease slightly when trained on $50%$ occlusion or stay constant when trained on $30%$ occlusion, while the other PFC unit exhibits increasing responses as occlusion increases; data not shown), and thus exert milder effects on the shape-selective V4 units.

In brief, when the connection weights of the network are trained on partially occluded shapes, the feedback from PFC does not improve the shape discriminability. Our model predicts that seeing the unoccluded shapes and learning them prior to the occluded shape-discrimination task may be a necessary step to benefit from the delayed enhancement of shape discriminability induced by the feedback predictions. Testing this hypothesis will be an interesting future experimental study.

## Acknowledgments

We thank Rajesh Rao, Wyeth Bair, and Joel Zylberberg for many helpful discussions and comments. This work was supported by the Washington Research Foundation Innovation Postdoctoral Fellowship in Neuroengineering to H.C., NEI grant R01EY018839 to A.P., NSF Career Award DMS-1056125 to E.S-B, Vision Core grant P30EY01730 to the University of Washington, and P51 grant OD010425 to the Washington National Primate Research Center. E.S-B wishes to thank the Allen Institute for Brain Science founders, Paul G. Allen and Jody Allen, for their vision, encouragement, and support.

## References

Barbas
,
H.
, &
Mesulam
,
M. M.
(
1985
).
Cortical afferent input to the principalis region of the rhesus monkey
.
Neuroscience
,
15
,
619
637
.
Bastos
,
A. M.
,
Usrey
,
W. M.
,
,
R. A.
,
Mangun
,
G. R.
,
Fries
,
P.
, &
Friston
,
K.
(
2012
).
Canonical microcircuits for predictive coding
.
Neuron
,
76
,
695
711
.
Bi
,
G. Q.
, &
Poo
,
M. M.
(
1998
).
Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type
.
J. Neuroscience
,
18
,
10464
10472
.
Bogacz
,
R.
(
2017
).
A tutorial on the free-energy framework for modelling perception and learning
.
Journal of Mathematical Psychology
,
76
,
198
211
.
Bushnell
,
B. N.
,
Harding
,
P. J.
,
Kosai
,
Y.
,
Bair
,
W.
, &
Pasupathy
,
A.
(
2011
).
Equiluminance cells in visual cortical area V4
.
J. Neuroscience
,
31
(
35
),
12398
12412
.
Bushnell
,
B. N.
,
Harding
,
P. J.
,
Kosai
,
Y.
, &
Pasupathy
,
A.
(
2011
).
Partial occlusion modulates contour-based shape encoding in primate area V4
.
J. Neuroscience
,
31
(
11
),
4012
4024
.
Bushnell
,
B. N.
, &
Pasupathy
,
A.
(
2012
).
Shape encoding consistency across colors in primate V4
.
J. Neurophysiology
,
108
,
1299
1308
.
,
C.
,
Kouh
,
M.
,
Pasupathy
,
A.
,
Connor
,
C. E.
,
Riesenhuber
,
M.
, &
Poggio
,
T.
(
2007
).
A model of V4 shape selectivity and invariance
.
J. Neurophysiology
,
98
,
1733
1750
.
Eghbali
,
R.
,
Pasupathy
,
A.
, &
Bair
,
W.
(
2016
).
Clustering V4 neurons based on their responses to simple shapes
.
Society for Neuroscience Abstracts
.
Felleman
,
D. J.
, &
Van Essen
,
D. C.
(
1991
).
Distributed hierarchical processing in the primate cerebral cortex
.
Cerebral Cortex
,
1
(
1
),
1
47
.
Friston
,
K.
, &
Kiebel
,
S.
(
2009a
).
Predictive coding under the free-energy principle
.
Phil. Trans. R. Soc. B
,
364
,
1211
1221
.
Friston
,
K.
, &
Kiebel
,
S.
(
2009b
).
Cortical circuits for perceptual inference
.
Neural Networks
,
364
,
1093
1104
.
Fukushima
,
K.
(
1980
).
Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position
.
Biological Cybernetics
,
36
,
193
202
.
Fukushima
,
K.
(
1987
).
Neural network model for selective attention in visual pattern recognition $n$ and associative recall
.
Applied Optics
,
26
,
1985
1992
.
Fukushima
,
K.
(
2001
).
Recognition of partly occluded patterns: A neural network model
.
Biological Cybernetics
,
84
,
251
259
.
Fukushima
,
K.
(
2005
).
Restoring partly occluded patterns: A neural network model
.
Neural Networks
,
18
,
33
43
.
Fusi
,
S.
,
Drew
,
P. J.
, &
Abbot
,
L. F.
(
2005
).
Cascade models of synaptically stored memories
.
Neuron
,
45
,
599
611
.
Fusi
,
S.
,
Miller
,
R. K.
, &
Rigotti
,
M.
(
2016
).
Why neurons mix: High dimensionality for higher cognition
.
Current Opinion in Neurobiology
,
37
,
66
74
.
Fyall
,
A. M.
,
El-Shamayleh
,
Y.
,
Choi
,
H.
,
Shea-Brown
,
E.
, &
Pasupathy
,
A.
(
2017
).
Dynamic representation of partially occluded objects in primate prefrontal and visual cortex
.
eLife
,
6
,
e25784
.
Gerstner
,
W.
,
Kempter
,
R.
,
van Hemmen
,
J. L.
,
Wagner
,
H.
, &
Hemmen
,
J.V.
(
1996
).
A neuronal learning rule for sub-millisecond temporal coding
.
Nature
,
383
,
76
81
.
Gregoriou
,
G. G.
,
Rossi
,
A. F.
,
Ungerleider
,
L. G.
, &
Desimone
,
R.
(
2014
).
Lesions of prefrontal cortex reduce attentional modulation of neuronal responses and synchrony in V4
.
Nature Neuroscience
,
17
,
1003
1011
.
Hayden
,
B. Y.
, &
Gallant
,
J. L.
(
2013
).
Working memory and decision processes in visual area V4
.
Frontiers in Neuroscience
,
7
,
18
.
Koch
,
C.
, &
Poggio
,
T.
(
1999
).
Predicting the visual world: Silence is golden
.
Nature Neuroscience
,
2
,
9
10
.
Kosai
,
Y.
,
El-Shamayleh
,
Y.
,
Fyall
,
A. M.
, &
Pasupathy
,
A.
(
2014
).
The role of visual area V4 in the discrimination of partially occluded shapes
.
J. Neuroscience
,
34
(
25
),
8570
8584
.
Lee
,
T. S.
, &
Mumford
,
D.
(
2003
).
Hierarchical Bayesian inference in the visual cortex
.
J. Opt. Soc. Am. A Opt. Image Sci. Vis.
,
20
,
1434
1448
.
Markram
,
H.
,
Lubke
,
J.
,
Frotscher
,
M.
, &
Sakmann
,
B.
(
1997
).
Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs
.
Science
,
275
,
213
215
.
Meyers
,
E. M.
,
Freedman
,
D. J.
,
Kreiman
,
G.
,
Miller
,
E. K.
, &
Poggio
,
T.
(
2008
).
Dynamic population coding of category information in inferior temporal and prefrontal cortex
.
J. Neurophyiology
,
100
,
1407
1419
.
Miller
,
E.
, &
Cohen
,
J. D.
(
2001
).
An integrative theory of prefrontal cortex function
.
Annu. Rev. Neurosci.
,
24
,
167
202
.
Mumford
,
D.
(
1992
).
On the computational architecture of the neocortex
.
Biological Cybernetics
,
66
,
241
251
.
Murray
,
J. D.
,
Jaramillo
,
J.
, &
Wang
,
X-J.
(
2017
).
Working memory and decision making in a fronto-parietal circuit model
.
bioRxiv
.
Namima
,
T.
, &
Pasupathy
,
A.
(
2016
).
Neural responses in the inferior temporal cortex to partially occluded and occluding stimuli
.
Society for Neuroscience Abstracts
.
Ninomiya
,
T.
,
Sawamura
,
H.
,
Inoue
,
K.
, &
Takeda
,
M.
(
2012
).
Multisynaptic inputs from the medial temporal lobe to V4 in macaques
.
PLoS One
,
7
(
12
),
e52115
.
Olshausen
,
B. A.
, &
Field
,
D. J.
(
1996
).
Emergence of simple-cell receptive field properties by learning a sparse code for natural images
.
Nature
,
381
,
607
609
.
Olshausen
,
B. A.
, &
Field
,
D. J.
(
1997
).
Sparse coding with an overcomplete basis set: A strategy employsed by V1?
Vision Research
,
37
(
23
),
3311
3325
.
Pasupathy
,
A.
, &
Connor
,
C. E.
(
1999
).
Responses to contour features in macaque area V4
.
J Neurophysiology
,
82
,
2490
2502
.
Pasupathy
,
A.
, &
Connor
,
C. E.
(
2001
).
Shape representation in area V4: Position-specific tuning for boundary conformation
.
J. Neurophysiology
,
86
,
2505
2519
.
Pasupathy
,
A.
, &
Connor
,
C. E.
(
2002
).
Population coding of shape in area V4
.
Nature Neuroscience
,
5
,
1332
1338
.
Pasupathy
,
A.
,
Fyall
,
A. M.
, &
Choi
,
H.
(
2015
).
Discriminating partially occluded shapes: Insights from visual and frontal cortex
. In
Cosyne Annual Abstracts 2015
.
Rao
,
R. P. N.
(
1997
). Correlates of attention in a model of dynamic visual recognition. In
M. I.
Jordan
,
M. J.
Kearns
, &
S. A.
Solla
(Eds.),
Advances in neural information processing systems
,
10
(pp.
80
86
).
Cambridge, MA
:
MIT Press
.
Rao
,
R. P. N.
(
1999
).
An optimal estimation approach to visual perception and learning
.
Vision Research
,
39
,
1963
1989
.
Rao
,
R. P. N.
(
2004
).
Bayesian computation in recurrent neural circuits
.
Neural Computation
,
16
,
1
38
.
Rao
,
R. P. N.
(
2005
).
Bayesian inference and attentional modulation in the visual cortex
.
NeuroReport
,
16
,
1843
1848
.
Rao
,
R. P. N.
, &
Ballard
,
D. H.
(
1999
).
Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects
.
Nature Neuroscience
,
2
,
79
87
.
Riesenhuber
,
M.
, &
Poggio
,
T.
(
1999
).
Hierarchical models of object recognition in cortex
.
Nature Neuroscience
,
2
,
1019
1025
.
Rigotti
,
M.
,
Barak
,
O.
,
Warden
,
M.
,
Wang
,
X.-J.
,
Daw
,
N. D.
,
Miller
,
R. K.
, &
Fusi
,
S.
(
2013
).
The importance of mixed selectivity in complex cognitive tasks
.
Nature
,
497
,
585
590
.
Roe
,
A. W.
,
Chelazzi
,
L.
,
Connor
,
C. E.
,
Conway
,
B. R.
,
Fujita
,
I.
,
Gallant
,
J. L.
,
Vanduffel
,
W.
(
2012
).
Toward a unified theory of visual area V4
.
Neuron
,
74
,
12
29
.
Romo
,
R.
, &
Salinas
,
E.
(
2003
).
Flutter discrimination: Neural codes, perception, memory and decision making
.
Nature Reviews Neuroscience
,
4
,
203
218
.
Rubin
,
N.
,
Nakayama
,
K.
, &
Shapley
,
R.
(
1997
).
Abrupt learning and retinal size specificity in illusory-contour perception
.
Curr. Biol.
,
7
,
461
467
.
Rust
,
N. C.
, &
Stocker
,
A. A.
(
2010
).
Ambiguity and invariance: Two fundamental challenges for visual processing
.
Current Opinion in Neurobiology
,
20
,
382
388
.
Schein
,
S. J.
, &
Desimone
,
R.
(
1990
).
Spectral properties of V4 neurons in the macaque
.
J. Neuroscience
,
10
,
3369
3389
.
Seitz
,
A. R.
(
2010
).
Sensory learning: Rapid extraction of meaning from noise
.
Curr. Biol.
,
20
,
R643
R644
.
Serre
,
T.
,
Oliva
,
A.
, &
Poggio
,
T.
(
2007
).
A feedforward architecture accounts for rapid categorization
.
,
104
,
6424
6429
.
Spratling
,
M. W.
(
2008
).
Predictive coding as a model of biased competition in visual attention
.
Vision Research
,
48
(
12
),
1391
1408
.
Spratling
,
M. W.
(
2017
).
A review of predictive coding algorithms
.
Brain and Cognition
,
112
,
92
97
.
Srinivasan
,
M. V.
,
Laughlin
,
S. B.
, &
Dubs
,
A.
(
1982
).
Predictive coding: A fresh view of inhibition in the retina
.
Proc. R. Soc. Lond. B Biol. Sci.
,
216
,
427
459
.
Ungerleider
,
L. G.
,
Galkin
,
T. W.
,
Desimone
,
R.
, &
Gattass
,
R.
(
2008
).
Cortical connections of area V4 in the macaque
.
Cereb. Cortex
,
18
,
477
499
.
Yamins
,
D. L.
K.,
Hong
,
H.
,
,
C. F.
,
Solomon
,
E. A.
,
Seibert
,
D.
, &
DiCarlo
,
J. J.
(
2014
).
Performance-optimized hierarchical models predict neural responses in higher visual cortex
.
,
111
,
8619
8624
.
Yger
,
P.
,
Stimberg
,
M.
, &
Brette
,
R.
(
2015
).
Fast learning with weak synaptic plasticity
.
J. Neuroscience
,
35
(
39
),
13351
13362
.
Yuille
,
A.
, &
Kersten
,
D.
(
2006
).
Vision as Bayesian inference: Analysis by synthesis?
Trends in Cognitive Sciences
,
10
,
301
308
.
Zeki
,
S. M.
(
1973
).
Colour coding in rhesus monkey prestriate cortex
.
Brain Res.
,
53
,
422
427
.
Zylberberg
,
J.
,
Murphy
,
J. T.
, &
DeWeese
,
M. R.
(
2011
).
A sparse coding model with synaptically local plasticity and spiking neurons can account for the diverse shapes of V1 simple cell receptive fields
.
PLoS Comp. Bio.
,
7
(
10
),
e1002250
.