How do humans learn from raw sensory experience? Throughout life, but most obviously in infancy, we learn without explicit instruction. We propose a detailed biological mechanism for the widely embraced idea that learning is driven by the differences between predictions and actual outcomes (i.e., predictive error-driven learning). Specifically, numerous weak projections into the pulvinar nucleus of the thalamus generate top–down predictions, and sparse driver inputs from lower areas supply the actual outcome, originating in Layer 5 intrinsic bursting neurons. Thus, the outcome representation is only briefly activated, roughly every 100 msec (i.e., 10 Hz, alpha), resulting in a temporal difference error signal, which drives local synaptic changes throughout the neocortex. This results in a biologically plausible form of error backpropagation learning. We implemented these mechanisms in a large-scale model of the visual system and found that the simulated inferotemporal pathway learns to systematically categorize 3-D objects according to invariant shape properties, based solely on predictive learning from raw visual inputs. These categories match human judgments on the same stimuli and are consistent with neural representations in inferotemporal cortex in primates.
The fundamental epistemological conundrum of how knowledge emerges from raw experience has challenged philosophers and scientists for centuries. Although there have been significant advances in cognitive and computational models of learning (LeCun, Bengio, & Hinton, 2015; Watanabe & Sasaki, 2015; Ashby & Maddox, 2011) and in our understanding of the detailed biochemical basis of synaptic plasticity (Cooper & Bear, 2012; Lüscher & Malenka, 2012; Urakubo, Honda, Froemke, & Kuroda, 2008; Shouval, Bear, & Cooper, 2002), there is still no widely accepted answer to this puzzle that is clearly supported by known biological mechanisms and also produces effective learning at the computational and cognitive levels. The idea that we learn via an active predictive process was advanced by Helmholtz in his “recognition by synthesis” proposal (von Helmholtz, 1867/2013) and has been widely embraced in a range of different frameworks (de Lange, Heilbron, & Kok, 2018; Summerfield & de Lange, 2014; Clark, 2013; George & Hawkins, 2009; Friston, 2005; Hawkins & Blakeslee, 2004; Rao & Ballard, 1999; Elman et al., 1996; Dayan, Hinton, Neal, & Zemel, 1995; Kawato, Hayakawa, & Inui, 1993; Mumford, 1992; Elman, 1990).
Here, we propose a detailed biological mechanism for a specific form of “predictive error-driven learning” based on distinctive patterns of connectivity between the neocortex and the higher-order nuclei of the thalamus (i.e., the pulvinar; Usrey & Sherman, 2018; Sherman & Guillery, 2006). We hypothesize that learning is driven by the difference between top–down predictions, generated by numerous weak projections into the thalamic relay cells (TRCs) in the pulvinar, and the actual outcomes supplied by sparse, strong driver inputs from lower areas. Because these driver inputs originate in Layer 5 intrinsic bursting (5IB) neurons, the outcome is only briefly activated, roughly every 100 msec (i.e., 10 Hz, alpha). Thus, the prediction error is a temporal difference in activation states over the pulvinar, from an earlier prediction to a subsequent burst of outcome. This temporal difference can drive local synaptic changes throughout the neocortex, supporting a biologically plausible form of error backpropagation (Bp) that improves the predictions over time (Lillicrap, Santoro, Marris, Akerman, & Hinton, 2020; Whittington & Bogacz, 2019; Bengio, Mesnard, Fischer, Zhang, & Wu, 2017; O'Reilly, 1996; Hinton & McClelland, 1988; Ackley, Hinton, & Sejnowski, 1985). The temporal difference form of error-driven learning contrasts with prevalent alternative hypotheses that require a separate population of neurons to compute a prediction error explicitly and transmit it directly through neural firing (Lotter, Kreiman, & Cox, 2016; Ouden, Kok, & Lange, 2012; Friston, 2005, 2010; Rao & Ballard, 1999; Kawato et al., 1993).
In the following, our primary objective is to describe the hypothesized biologically based mechanism for predictive error-driven learning, contrast it with other existing proposals regarding the functions of this thalamocortical circuitry and other ways that the brain might support predictive learning, and evaluate it relative to a wide range of existing anatomical and electrophysiological data. We provide a number of specific empirical predictions that follow from this functional view of the thalamocortical circuit, which could potentially be tested by current neuroscientific methods. Thus, this work proposes a clear functional interpretation of this distinctive thalamocortical circuitry that contrasts with existing ideas in testable ways.
A second major objective is to implement this predictive error-driven learning mechanism in a large-scale computational model that faithfully captures its essential biological features, to test whether the proposed learning mechanism can drive the formation of cognitively useful representations. In particular, we ask a critical question for any predictive learning model: Can it develop high-level, abstract representations while learning from nothing but predicting low-level visual inputs? Most visual object recognition models that provide a reasonable fit to neurophysiological data rely on large human-labeled data sets to explicitly train abstract category information via error B (Rajalingham et al., 2018; Cadieu et al., 2014; Khaligh-Razavi & Kriegeskorte, 2014). Thus, it is perhaps not too surprising that the higher layers of these models, which are closer to these category output labels, exhibited a greater degree of categorical organization.
Through large-scale simulations based on the known structure of the visual system, we found that our biologically based predictive learning mechanism developed high-level, abstract representations that significantly diverge from the similarity structure present in the lower layers of the network and systematically categorize 3-D objects according to invariant shape properties. Furthermore, we found in an experiment using the same stimuli that these categories match human similarity judgments and that they are also qualitatively consistent with neural representations in inferotemporal (IT) cortex in primates (Cadieu et al., 2014). In addition, we show that comparison predictive Bp models lacking these biological features (Lotter et al., 2016) did not learn object categories that go beyond the visual input structure. Thus, there may be some important features of the biologically based model that enable this ability to learn higher-level structure beyond that of the raw inputs.
It is important to emphasize that our objectives for these simulations are not to produce a better machine-learning algorithm per se but rather to test whether our biologically based model can capture some of the known high-level, cognitive phenomena that the mammalian brain learns. Thus, we explicitly dissuade readers from the inevitable desire to evaluate the importance of our model based on differences in narrow, performance-based machine learning metrics. As discussed later, there are various engineering-level issues regarding the biologically based model's computational cost and performance, which currently limit its ability to compete with simpler, much larger-scale Bp models, but we do not think these are relevant to the evaluation of the scientific questions of relevance here. In short, this model is an instantiation of a scientific theory, and it should be evaluated on its ability to explain a wide range of data across multiple levels of analysis, just as every other scientific theory is evaluated.
The remainder of the paper is organized as follows. First, we provide a concise overview of the biologically based predictive error-driven learning framework, including the most relevant neural data. Then, we present a small-scale implementation of the model that learns a probabilistic grammar, to illustrate the basic computational mechanisms of the theory. This is followed by the large-scale model of the visual system, which learns by predicting over brief movies of 3-D objects rotating and translating in space. We evaluate this model and compare it to two other predictive learning models that directly use error Bp, based on current deep convolutional neural network (DCNN) mechanisms. Then, we circle back to discuss the relevant biological data in greater detail, along with testable predictions that can differentiate this account from other existing ideas. Finally, we conclude with a discussion of related models and outstanding issues.
PREDICTIVE ERROR-DRIVEN LEARNING IN THE NEOCORTEX AND PULVINAR
Figure 1 shows the thalamocortical circuits characterized by Sherman and Guillery (2006; see also Usrey & Sherman, 2018; Sherman & Guillery, 2013), which have two distinct projections converging on the principal TRCs of the pulvinar, the primary thalamic nucleus that is interconnected with higher-level posterior cortical visual areas (Halassa & Kastner, 2017; Arcaro, Pinsk, & Kastner, 2015; Shipp, 2003). One projection consists of numerous, weaker connections originating in deep layer VI of the neocortex (the 6CT corticothalamic projecting cells), which we hypothesize generate a top–down prediction on the pulvinar. The other is a sparse (Rockland, 1996, 1998) and strong driver pathway that originates from lower-level layer 5IB cells, which we hypothesize provide the outcome. These 5IB neurons fire discrete bursts with intrinsic dynamics having a period of roughly 100 msec between bursts (Saalmann, Pinsk, Wang, Li, & Kastner, 2012; Larkum, Zhu, & Sakmann, 1999; Franceschetti et al., 1995; Silva, Amitai, & Connors, 1991; Connors, Gutnick, & Prince, 1982), which is thought to drive the widely studied alpha frequency of ∼10 Hz that originates in cortical deep layers and has important effects on a wide range of perceptual and attentional tasks (Clayton, Yeung, & Kadosh, 2018; Jensen, Bonnefond, & VanRullen, 2012; Buffalo, Fries, Landman, Buschman, & Desimone, 2011; Mathewson, Gratton, Fabiani, Beck, & Ro, 2009; VanRullen & Koch, 2003). Critically, unlike many other such bursting phenomena, this 5IB occurs in awake animals (Luczak, Bartho, and Harris, 2009, 2013; Sakata & Harris, 2009, 2012), consistent with the presence of alpha in awake, behaving states.
The existing literature generally characterizes the 6CT projection as modulatory (Usrey & Sherman, 2018; Sherman & Guillery, 2013), but a number of electrophysiological recordings from awake, behaving animals clearly show sustained, continuous patterns of neural firing in pulvinar TRC neurons, which is not consistent with the idea that they are only being driven by their phasic bursting 5IB inputs (Zhou, Schafer, & Desimone, 2016; Komura, Nikkuni, Hirashima, Uetake, & Miyamoto, 2013; Saalmann et al., 2012; Bender & Youakim, 2001; Robinson, 1993; Petersen, Robinson, & Keys, 1985; Bender, 1982). Indeed, these recordings show that pulvinar neural firing generally resembles that of the visual areas with which they interconnect, in terms of neural receptive field properties, tuning curves, and so forth. This is important because our predictive learning framework requires that these 6CT top–down projections be capable of directly driving TRC activity. Specifically, in contrast to the standard view, the core idea behind our theory is that the top–down 6CT projections drive a predicted activity pattern across the extent of the pulvinar, which precedes the subsequent outcome activation state driven by the strong 5IB inputs.
Figure 2 illustrates the temporal evolution of activity states according to our predictive learning theory, which is somewhat challenging to convey because the critical signals driving learning unfold over time (O'Reilly, Wyatte, & Rohrlich, 2014, 2017; Kachergis, Wyatte, O'Reilly, de Kleijn, & Hommel, 2014). We hypothesize that synaptic plasticity throughout the cortex is sensitive to the resulting temporal differences that emerge initially in the pulvinar. Thus, unlike other models (as we discuss in depth later), the prediction error here is not captured directly in the firing of a special population of error-coding neurons but rather remains as a temporal difference error signal.
Figure 2 shows a single 125-msec time window of a 100-msec alpha cycle for the purposes of illustration (the actual timing is likely to be more dynamic as discussed next). The activity state in pulvinar TRC neurons, representing a prediction, as driven by the top–down 6CT projections, should develop during the first ∼75 msec, when the 5IB neurons are paused between bursting. Then, the final ∼25 msec largely reflects the strong 5IB bottom–up ground-truth driver inputs when they burst. Thus, the prediction error signal is reflected in the temporal difference of these activation states as they develop over time. In other words, our hypothesis is that the pulvinar is directly representing either the top–down prediction or the bottom–up outcome at any given time, and the temporal difference between these states implicitly encodes a prediction error. Whereas the deep 6CT layer is involved in generating a top–down prediction over the pulvinar, the superficial layer neurons continuously represent the current state, simultaneously incorporating bottom–up and top–down constraints via their own connections with other areas. To ensure that the prediction is not directly influenced by this current state representation (i.e., “peeking at the right answer”), it is important that the 6CT neurons encode temporally delayed information, consistent with available data (Harris & Shepherd, 2015; Thomson, 2010; Sakata & Harris, 2009).
The actual biological system is likely to be much more dynamic than the simplistic cartoon with rigid 100-msec timing, as shown in Figure 2, based on a set of neural mechanisms that can work together to enable it to more flexibly entrain the predictive learning cycle to the environment. These mechanisms would also tend to increase activity and learning associated with unexpected outcomes relative to expected ones, consistent with the observed expectation suppression phenomena (Bastos et al., 2012; Meyer & Olson, 2011; Todorovic, van Ede, Maris, & de Lange, 2011; Summerfield, Trittschuh, Monti, Mesulam, & Egner, 2008).
Specifically, various underlying mechanisms result in neural adaptation, which is generally thought to increase neural activity and learning associated with novel inputs relative to recently familiar ones (Hennig, 2013; Grill-Spector, Henson, & Martin, 2006; Brette & Gerstner, 2005; Müller, Metha, Krauskopf, & Lennie, 1999; Abbott, Varela, Sen, & Nelson, 1997). In the case where outcomes are consistent with prior predictions (i.e., the predictions are accurate), the same population of neurons across pulvinar and cortex should be active over time, whereas unpredicted outcomes will generally activate new subsets of neurons in superficial cortical layers representing the current state. Thus, because of adaptation, there should be a phasic increase in activity in these superficial neurons at the onset of unpredicted stimuli relative to predicted ones. Furthermore, the 5IB neurons downstream of these superficial neurons may be particularly responsive to these phasic activity increases, causing their bursting to coincide preferentially with unexpected outcomes, thereby driving the phase resetting of the alpha cycle to such events. Thus, during a sequence of predicted states, the pulvinar may experience relatively weaker or even absent 5IB driving inputs, until an unpredicted stimulus arises. At this point, error-driven learning would be more strongly engaged as a function of the phasic release from adaptation and 5IB burst activation. We discuss these dynamics more later in the context of the comparison with explicit error (EE) coding models.
We also hypothesize that 5IB preferentially drives the synaptic plasticity processes to take place at that time, because of the strong driving nature of the outputs from these neurons. In computational terms originating with the Boltzmann machine (Hinton & Salakhutdinov, 2006; Ackley et al., 1985), this anchors the target or plus phase to be at this point of 5IB. Furthermore, this means that the predictive nature of the prior minus phase naturally emerges just by virtue of it being the state before 5IB: The learning rule automatically causes that prior state to better anticipate the subsequent state. Thus, even if no prediction was initially generated, learning over multiple iterations will work to create one, to the extent that a reliable prediction can be generated based on internal states and environmental inputs. Likewise, assuming relevant activity traces naturally persist over timescales longer than the alpha cycle, this predictive learning process can take advantage of any such remaining traces to learn across these longer timescales, although it is operating at the faster alpha scale.
In short, learning always happens whenever something unexpected occurs, at any point, and drives the development of predictions immediately prior, to the extent such predictions are possible to generate. In the typical laboratory experiment where phasic stimuli are presented without any predictable temporal sequence (which is uncharacteristic of the natural world), there may often be no significant prediction before stimulus onset, and we would expect such stimuli to reliably drive 5IB, which is consistent with available electrophysiological data (Zhou et al., 2016; Komura et al., 2013; Luczak et al., 2009, 2013; Bender & Youakim, 2001; Robinson, 1993; Petersen et al., 1985; Bender, 1982). Thus, unlike Figure 2, such situations would start with a 5IB-triggered plus phase, without a significant minus phase before that.
As may be evident by this point, we are mainly focused on prediction in the sense of the humorous quote: “Prediction is very difficult, especially about the future” (attributable to Danish author Robert Storm Petersen), whereas this term is potentially confusingly used in a much broader sense in most Bayesian-inspired predictive coding frameworks (de Lange et al., 2018; Friston, 2005; Rao & Ballard, 1999). These frameworks use “prediction” to encompass everything from genetic biases to the results of learning in the feedforward synaptic pathways to top–down filling-in or biasing of the current stimulus properties and fairly rarely use it in the “about the future” sense. We think these different phenomena are each associated with different neural mechanisms at different timescales (O'Reilly, Hazy, & Herd, 2016; O'Reilly, Wyatte, Herd, Mingus, & Jilk, 2013; O'Reilly, Munakata, Frank, Hazy, & Contributors, 2012) and thus prefer to treat them separately, while also recognizing that they can clearly interact as well.
Thus, our use of the term “prediction” here refers specifically to “anticipatory” neural firing that predicts subsequent stimuli. We use the term “postdiction” to refer to the operation of this predictive mechanism after a stimulus has been initially processed (to consolidate and more deeply encode, as in an autoencoder model) and distinguish both from top–down excitatory biasing, which directly influences the online superficial layer neural representations of the current stimulus (O'Reilly et al., 2013; Miller & Cohen, 2001; Reynolds, Chelazzi, & Desimone, 1999; Desimone & Duncan, 1995). Finally, many discussions of prediction error in the literature include late, frontally associated processes such as those associated with the P300 ERP component (Holroyd & Coles, 2002). We specifically exclude these from the scope of the mechanisms described here, which are anticipatory, fast, and low level, as is appropriate for the posterior cortical sensory processing areas that interconnect with the pulvinar.
Computational Properties of Predictive Learning in the Thalamocortical Circuits
We next elaborate the connections between the computational properties required for predictive learning and the properties of the circuits interconnecting the cortex and the pulvinar, which appear to be notably well suited for their hypothesized role in predictive learning. We begin with a relatively established interpretation of superficial layer processing, to contextualize subsequent points about the special functions required of the deep layers and the thalamus.
The superficial cortical layers continuously represent the current state: The superficial layer pyramidal neurons are densely and bidirectionally interconnected with other cortical areas and update quickly to new stimulus inputs, with continuous, relatively rapid firing (i.e., up to about 100 Hz for preferred stimuli). These neurons integrate higher-level top–down information with bottom–up sensory information to resolve ambiguities, focus attention, fill in missing information, and generally enhance the consistency and quality of the online representations (O'Reilly, Hazy, & Herd, 2016; O'Reilly, Wyatte, Herd, Mingus, & Jilk, 2013; O'Reilly et al., 2012; Miller & Cohen, 2001; Reynolds et al., 1999; Desimone & Duncan, 1995; Hopfield, 1984; Rumelhart & McClelland, 1982). As noted above, we distinguish this form of top–down processing, which is often most evident during the period after stimulus onset (Lee & Mumford, 2003), from the specifically predictive, anticipatory sort.
Predictions must be insulated against receiving current state information (it is not prediction if you already know what happens): Given that the superficial layers are continuously updating and representing the current state, some kind of separate neural system insulated from this current state information must be used to generate predictions; otherwise, the prediction system can just “cheat” and directly report the current state. It may seem counterintuitive, but making the prediction task harder is actually beneficial, because that pushes the learning to capture deeper, more systematic regularities about how the environment evolves over time. In other words, like any kind of cheating, the cheater itself is cheated because of the reduced pressure to learn, and learning is the real goal.
Predictions take time and space to generate: Nontrivial predictions likely require the integration of multiple converging inputs from a range of higher-level cortical areas, each encoding different dimensions of relevance (e.g., location, motion, color, texture, shape). Thus, sufficient time and space (i.e., neural substrates with relevant connectivity) must be available to integrate these signals into a coherent predicted state, and per the above point, these substrates must be separated from the influence of current state information. This fits with the properties of the layer 6CT neurons and their deep layer inputs, which we hypothesize are insulated from superficial-layer firing by virtue of being driven locally by the 5IB within their own cortical microcolumn, such that the interbursting pause period provides a time window when these deep layers can integrate and generate the prediction.
Biologically, this is consistent with the delayed responses of 6CT neurons (Harris & Shepherd, 2015; Thomson, 2010; Sakata & Harris, 2009). Computationally, these neurons function much like the simple recurrent network (SRN) context layer updating (Elman, 1990; Jordan, 1989), which reflects the prior trial's state, as discussed in detail in the Appendix. The overall duration of the alpha cycle may represent a reasonable compromise between the prediction integration time and the need to keep up with predictions tracking changes in the world. Notably, films are typically shown at only over two times the alpha frequency (24 Hz), suggesting a Nyquist sampling relative to the underlying alpha processing.
The predicted state must be directly aligned with the outcome state it predicts: A prediction error is a difference between two states, so these prediction and outcome states must be directly comparable such that their difference meaningfully represents the actual prediction error and not some other kind of irrelevant encoding differences. In other words, the prediction and the outcome must be represented in the same “language,” so that the “words” from the prediction can be directly compared against those of the outcome—if the prediction was in Japanese and the outcome was in English, it would be hard to tell whether the prediction was correct or not. Thus, a common neural substrate with two different input pathways is required, one reflecting the prediction and the other reflecting the outcome, so that both converge onto the same representational system within this common neural substrate. This fits well with the two pathways converging into the pulvinar: the 6CT top–down prediction-generation pathway and the lower-level 5IB driving inputs.
The outcome signal should be as veridical as possible (i.e., directly reflecting the bottom–up outcome) and should arise from lower areas in the hierarchy relative to the corresponding predictive 6CT inputs: Given that the outcome is the driver of learning, if it were to be corrupted or inaccurate, then everything that is learned would then be suspect. To the extent that delusional thinking is present in all people (some more so than others perhaps), this principle must be violated at some level, but for the lowest levels of the perceptual system at least, it is important that strongly grounded, accurate training signals drive learning. The bottom–up, sparse, strongly driving nature of the 5IB projections to the pulvinar can directly convey such veridical outcome signals and ensure that they dominate the activation of their TRC targets. On the basis of indirect available data, it is likely that each pulvinar TRC neuron receives only roughly one to six driver inputs (Sherman & Guillery, 2006, 2011), such that these sparse inputs directly convey the signal from lower layers, without much further mixing or integration (which could distort the nature of the signal). Furthermore, these inputs are likely not plastic (Usrey & Sherman, 2018), again consistent with a need for unaltered, veridical signals. Finally, the TRC neurons are distinctive in having no significant lateral interconnectivity (Sherman & Guillery, 2006), enabling them to faithfully represent their inputs. These properties led Mumford (1991) to characterize the pulvinar as a blackboard, and we further suggest the metaphor of a projection screen upon which the predictions are projected.
The prediction error must drive learning to reduce subsequent prediction errors: Obviously, this is the goal of prediction error learning in the first place, and given that the cortex is what generates predictions, it must be capable of learning based on prediction error signals represented over the pulvinar. Computationally, the critical problem here is “credit assignment”: How do the error signals direct learning in the proper direction for each individual neuron, to reduce the overall prediction error? The error Bp procedure solves this problem (Rumelhart, Hinton, & Williams, 1986) but requires biologically implausible retrograde signaling across the entire network of neural communication (Crick, 1989), to propagate the error proportionally back along the same channels that drive forward activation. Bidirectional connections, which are ubiquitous in the cortex (Markov, Ercsey-Ravasz, et al., 2014; Felleman & Van Essen, 1991) and computationally beneficial for other reasons as noted earlier, can eliminate that problem by “implicitly” propagating error signals via standard neural communication mechanisms along both directions of connectivity (O'Reilly, 1996).
This solution to the credit assignment problem relies on a temporal difference error signal, as originally developed for the Boltzmann machine (Ackley et al., 1985). The bidirectional neural communication at one point in time is encoding and sharing the prediction among the entire network of neurons. Then, this same network of connections is reused at another point in time to encode and communicate the outcome. Mathematically, the difference in activation state across these two points in time, locally at each individual neuron, provides an accurate estimate of the error Bp gradient (O'Reilly, 1996). In effect, this temporal difference tells each neuron which direction it needs to change its activation state to reduce the overall error. The reuse of the very same network of connections across both points in time ensures the overall alignment of the two activation states, as noted above, such that this temporal difference precisely represents the error signal. Although various other schemes for error-driven learning in biologically plausible networks have been proposed (e.g., Lillicrap et al., 2020; Whittington & Bogacz, 2019; Bengio et al., 2017), the temporal difference framework with bidirectional connectivity provides a particularly good fit with the natural temporal ordering of predictive learning (prediction and then outcome) and the extensive bidirectional connectivity of the thalamocortical circuits (Shipp, 2003).
Temporal differences in activation state across the alpha cycle, between prediction and outcome states, must drive synaptic plasticity: The final step needed to connect all of the elements above is that neurons actually modify their synaptic strengths in proportion to the temporal difference error signal. We have recently provided a fully explicit mechanism for this form of learning (O'Reilly et al., 2012), based on a biologically detailed model of spike-timing-dependent plasticity (Urakubo et al., 2008). We showed that, when activated by realistic Poisson spike trains, this spike-timing-dependent plasticity model produces a nonmonotonic learning curve similar to that of the Bienenstock, Cooper, and Munro (BCM) model (Bienenstock, Cooper, & Munro, 1982), which results from competing calcium-driven postsynaptic plasticity pathways (Cooper & Bear, 2012; Shouval et al., 2002). As in the BCM framework, we hypothesized that the threshold crossover point in this nonmonotonic curve moves dynamically—if this happens on the alpha timescale (Lim et al., 2015), then it can reflect the prediction phase of activity, producing a net error-driven learning rule based on a subsequent calcium signal reflecting the outcome state. The resulting learning mechanism naturally supports a combination of both BCM-style Hebbian learning and error-driven learning, where the BCM component acts as a kind of regularizer or bias, similar to weight decay (O'Reilly et al., 2012; O'Reilly & Munakata, 2000).
Thus, remarkably, the pulvinar and associated thalamocortical circuitry appear to provide precisely the necessary ingredients to support predictive error-driven learning, according to the above analysis. Interestingly, although Sherman and Guillery (2006) did not propose a predictive learning mechanism as just described, they did speculate about a potential role for this circuit in motor forward-model learning and the predictive remapping phenomenon (Usrey & Sherman, 2018; Sherman & Guillery, 2011). In addition, Pennartz, Dora, Muckli, and Lorteije (2019) also suggested that the pulvinar may be involved in predictive learning, but within the EE coding framework and not involving the detailed aspects of the above-described circuitry.
It bears emphasizing the synergy between the various considerations above for the benefits of the pause in 5IB firing between bursts. First, this pause is critical for creating the time window when the predictive network is representing and communicating the prediction state, without influence from the outcome state. Furthermore, it creates the temporal difference in activation state in the pulvinar between prediction and outcome, which is needed for driving error-driven learning. Thus, for both the 6CT and pulvinar layers, the periodic pausing of 5IB neurons is essential for creating the predictive learning dynamic. Interestingly, by these principles, the lack of such burst/pause dynamics in the driver inputs to first-order sensory thalamus areas such as the lateral geniculate nucleus and medial geniculate nucleus (Sherman & Guillery, 2006) means that these areas should not be directly capable of error-driven predictive learning. This is consistent with a number of models and theoretical proposals suggesting that primary sensory areas may learn predominantly through Hebbian-style self-organizing mechanisms (Bednar, 2012; Miller, 1994). Nevertheless, primary sensory areas do receive “collateral” error signals from the pulvinar (Shipp, 2003), which could provide some useful indirect error-driven learning signals.
Note that this form of temporal difference learning signal is distinct from the widely used temporal-difference model in reinforcement learning (Sutton & Barto, 1998), which is scalar and applies to reward expectations, not sensory predictions (although see Gardner, Schoenbaum, & Gershman, 2018, and Dayan, 1993, for potential connections between these two forms of prediction error). Finally, as we discuss later, this proposed predictive role for the pulvinar is compatible with the more widely discussed role it may play in attention (Fiebelkorn & Kastner, 2019; Zhou et al., 2016; Saalmann & Kastner, 2011; Snow, Allen, Rafal, & Humphreys, 2009; Bender & Youakim, 2001; LaBerge & Buchsbaum, 1990). Indeed, we think these two functions are synergistic (i.e., you predict what you attend, and vice versa; Richter & de Lange, 2019) and have initial computational results consistent with this idea.
PREDICTIVE LEARNING OF TEMPORAL STRUCTURE IN A PROBABILISTIC GRAMMAR
To illustrate and test the predictive learning abilities of this biologically based model, we first ran a classical test of sequence learning (Cleeremans & McClelland, 1991; Reber, 1967) that has been explored using SRNs (Elman, 1990; Jordan, 1989). The biologically based model was implemented using the Leabra algorithm, which is a comprehensive framework that uses conductance-based point neuron equations, inhibitory competition, bidirectional connectivity, and the biologically plausible temporal difference learning mechanism described above (O'Reilly et al., 2012, 2016; O'Reilly & Munakata, 2000; O'Reilly, 1996, 1998). Leabra serves as a model of the bidirectionally connected processing in the cortical superficial layers and has been used to simulate a large number of different cognitive neuroscience phenomena. It is described in the Appendix, which also provides a detailed mapping between the SRN and our biological model.
As shown in Figure 3, sequences were generated according to a finite state automaton (FSA) grammar, as used in implicit sequence learning experiments by Reber (1967). Each node has a 50% random branching to two different other nodes, and the labels generated by node transitions are locally ambiguous (except for the B = begin and E = end states). Thus, integration over time and across many iterations is required to infer the systematic underlying grammar. It is a reasonably challenging task for SRNs and people to learn and provides an important validation of the power of these predictive learning mechanisms. Given the random branching, accurately predicting the specific path taken is impossible, but we can score the model's output as correct if it activates either or both of the possible branches for each state.
The model (Figure 4) required around 20 epochs of 25 sequences through the grammar to learn it to the point of making no prediction errors for five epochs in a row (which guarantees that it had completely learned the task). This model is available in the standard emergent distribution at github.com/emer/leabra/tree/master/examples/deep_fsa. A few steps through a sequence are shown in Figure 4, illustrating how the corticothalamic (CT) context layer, which drives the P pulvinar layer prediction, represents the information present on the previous alpha cycle time step. Thus, the network is attempting to predict the current input state, which then drives the pulvinar plus phase at the end of each alpha cycle, as shown in the last panel. On each trial, the difference between plus and minus phases locally over each cortical neuron drives its synaptic weight changes, which accumulate over trials to allow accurate prediction of the sequences, to the extent possible given their probabilistic nature.
PREDICTIVE LEARNING OF OBJECT CATEGORIES IN IT CORTEX
Now, we describe a large-scale, systems-neuroscience implementation of the proposed thalamocortical predictive error-driven learning framework, in a model of visual predictive learning (Figure 5). Our second major objective, and a critical question for predictive learning, is determining whether the model can develop high-level, abstract ways of representing the raw sensory inputs, while learning from nothing but predicting these low-level visual inputs. We showed the model brief movies of 156 3-D object exemplars drawn from 20 different basic-level categories (e.g., car, stapler, table lamp, traffic cone) selected for their overall shape diversity from the CU3D-100 data set (O'Reilly et al., 2013). The objects moved and rotated in 3-D space over eight movie frames, where each frame was sampled at the alpha frequency (Figure 5B). Because the motion and rotation parameters were generated at random on each sequence, this data set consists of 512,000 unique images, and there is no low-dimensional object category training signal, so the usual concerns about overfitting and training versus testing sets are not applicable: Our main question is what kind of representations self-organize as a result of this purely visual experience.
There were also saccadic eye movements every other frame, introducing an additional, realistic, predictive learning challenge. An efferent copy signal enabled full prediction of the effects of the eye movement and allows the model to capture the signature predictive remapping phenomenon (Neupane, Guitton, & Pack, 2017; Cavanagh, Hunt, Afraz, & Rolfs, 2010; Duhamel, Colby, & Goldberg, 1992). The only learning signal available to the model was the prediction error generated by the temporal difference between what it predicted to see in the V1 input in the next frame and what was actually seen.
As described in detail in the Appendix, our model was constructed to capture critical features of the visual system, including the major division between a dorsal “where” pathway and a ventral “what” pathway (Ungerleider & Mishkin, 1982), and the overall hierarchical organization of these pathways derived from detailed connectivity analyses (Markov, Ercsey-Ravasz, et al., 2014; Markov, Vezoli, et al., 2014; Felleman & Van Essen, 1991; Rockland & Pandya, 1979). In addition to these biological constraints, we conducted extensive exploration of the connectivity and architecture space and found a remarkable convergence between what worked functionally and the known properties of these pathways (O'Reilly et al., 2017). For example, the feedforward pathway has projections from lower-level superficial layers to superficial layers of higher levels, whereas feedback originated in both the superficial and deep layers and projected back to both (Felleman & Van Essen, 1991; Rockland & Pandya, 1979). In addition, consistent with the core features of the pulvinar pathways discussed above, deep layer predictive (6CT) inputs originated in higher levels, whereas driver (5IB) inputs originated in lower levels. For simplicity, we organized the model layers in terms of these driver inputs, whereas the topographic organization of pulvinar in the brain is organized more according to the 6CT projection loops (Shipp, 2003).
Another important set of parameters are the strength of deep-layer recurrent projections, which influence the timescale of temporal integration, producing a simple biologically based version of slow feature analysis (Wiskott & Sejnowski, 2002; Foldiak, 1991). We followed the biological data suggesting that recurrence increases progressively up the visual hierarchy (Chaudhuri, Knoblauch, Gariel, Kennedy, & Wang, 2015). It was essential that the “where” pathway learn first, consistent with extant data (Kiorpes, Price, Hall-Haro, & Movshon, 2012; Bourne & Rosa, 2006), including early pathways interconnecting lateral inferior parietal (LIP) and pulvinar (Bridge, Leopold, & Bourne, 2016), and a rare asymmetric pathway, from V1 to LIP (Markov, Ercsey-Ravasz, et al., 2014), providing a direct shortcut for high-level spatial representations in LIP. Results from various informative model architecture and parameter manipulations are discussed below after the primary results from the standard intact model.
Learning curves and other model details are shown in the Appendix. We have also implemented a full de-novo replication of the model in a new modeling framework, which also replicated the results shown here. Furthermore, much of the model was originally developed in the context of a set of object-like patterns generated systematically from a set of simple line features (O'Reilly et al., 2017), and the parameters that work best in terms of combinatorial generalization on those patterns also worked well for these 3-D objects. Thus, we are confident that the model's learning behavior is not idiosyncratic to the particular set of objects used here and represents a general capacity of the system to develop abstract representations through predictive learning. Other ongoing work to be reported in an upcoming publication is applying the model to prediction of auditory speech inputs, which has a natural temporal structure, and finding similar results in terms of learning higher-level abstract encoding of these auditory signals.
To directly address the question of whether the hierarchical structure of the network supports the development of abstract, higher-level representations that go beyond the information present in the visual inputs, we applied a second-order similarity measure across the object-level similarity matrices computed at each layer in the network (Figure 6). This shows the extent to which the similarity matrix across objects in one layer is itself similar to the object similarity matrix in another layer, in terms of a correlation measure across these similarity matrices. Critically, this measure does not depend on any kind of subjective interpretation of the learned representations—it only tells us whether whatever similarity structure was learned differs across the layers. Starting from either V1 compared to all higher layers, or the highest TE layer compared to all lower layers, we found a consistent pattern of progressive emergence of the object categorization structure in the upper IT pathway (TEO, TE).
This analysis confirms that indeed the IT category structure is significantly different from that present at the level of the V1 primary visual input. Thus, the model, despite being trained only to generate accurate visual input-level predictions, has learned to represent these objects in an abstract way that goes beyond the raw input-level information. We further verified that, at the highest IT levels in the model, a consistent, spatially invariant representation is present across different views of the same object (e.g., the average correlation across frames within an object was .901).
To better understand the nature of these learned representations, Figure 7 shows a representational similarity analysis (RSA) on the activity patterns at the highest IT layer (TE), which reveals the explicit categorical structure of the learned representations (Cadieu et al., 2014; Kriegeskorte, Mur, & Bandettini, 2008). Specifically, we found that the highest IT layer (TE) produced a systematic organization of the 156 3-D objects into five categories. In our admittedly subjective judgment, these categories seemed to correspond to the overall shape of the objects, as shown by the object exemplars in Figure 7 (pyramid shaped, vertically elongated, round, boxy/square, and horizontally elongated). Furthermore, the basic-level categories were subsumed within these broader shape-level categories, so the model appears to be sensitive to the coherence of these basic-level categories as well, but apparently, their shapes were not sufficiently distinct between categories to drive differentiated TE-level representations for each such basic-level category.
Given that the model only learns from a passive visual experience of the objects, it has no access to any of the richer interactive multimodal information that people and animals would have. Furthermore, as evident in Figure 5B, the relatively low resolution of the V1 layers (required to make the model tractable computationally) means that complex visual details are not reliably encoded (and, even so, are not generally reliable across object exemplars), such that the overall object shape is the most salient and sensible basis for categorization for this model.
Although these object shape categories appeared sensible to us, we ran a simple experiment to test whether a sample of 30 human participants would use the same category structure in evaluating the pairwise similarity of these objects. Figure 7B shows the results, confirming that indeed this same organization of the objects emerged in their similarity judgments. These judgments were based on the V1 reconstruction as shown in Figure 5B to capture the model's coarse-grained perception (see Appendix for methods and further analysis).
The progressive emergence of increasingly abstract category structure across visual areas, evident in Figure 6, has been investigated in recent comparisons between monkey electrophysiological recordings and DCNNs, which provide a reasonably good fit of the overall progressive pattern of increasingly categorical organization (Cadieu et al., 2014). However, these DCNNs were trained on large data sets of human-labeled object categories, and it is perhaps not too surprising that the higher layers closer to these category output labels exhibited a greater degree of categorical organization. In contrast, because the only source of learning in our model comes from prediction errors over the V1 input layers, the graded emergence of an object hierarchy here reflects a truly self-organizing learning process.
Figure 8 compares the similarity structures in Layers V4 and IT in macaque monkeys (Cadieu et al., 2014) with those in corresponding layers in our model. In both the monkeys and our model, the higher IT layer builds upon and clarifies the noisier structure that is emerging in the earlier V4 layer, showing that our model replicates the essential qualitative hierarchical progression in the brain. As noted, we would not expect our model to exactly replicate the detailed object-specific similarity structure found in macaques, because of the impoverished nature of our model's experience, so this comparison remains qualitative in terms of the respective differences between V4 and IT in each model, rather than a direct comparison of the similarity structure between corresponding layers in the model and the macaque. In the future, when we can scale up our model and tune the attentional processing dynamics necessary to deal with cluttered visual scenes, we will be able to train our model on the same images presented to the macaques and can provide this more direct comparison.
Finally, we did not use analyses based on decoding techniques, because with high-dimensional distributed neural representations, it is generally possible to decode many different features that are not otherwise compactly and directly represented (Fusi, Miller, & Rigotti, 2016). In preliminary work using decoding in the context of the simpler feature-based input patterns, we indeed found that decoding was not a very sensitive measure of the differentiation of representations across layers, which is so clearly evident in Figure 6. Thus, as advocates of the RSA approach have argued, measuring similarity structure evident in the activity patterns over a given layer generally provides a clearer picture of what that layer is explicitly encoding (Kriegeskorte et al., 2008).
In summary, the model learned an abstract category organization that reflects the overall visual shapes of the objects as judged by human participants, in a way that is invariant to the differences in motion, rotation, and scaling that are present in the V1 visual inputs. We are not aware of any other model that has accomplished this signature computation of the ventral “what” pathway in a purely self-organizing manner operating on realistic 3-D visual objects, without any explicit supervised category labels. Furthermore, our model does this using a learning algorithm directly based on detailed properties of the underlying biological circuits in this pathway, providing a coherent overall account.
Backpropagation Comparison Models
To help discern some of the factors that contribute to the categorical learning in our model and provide a comparison with more widely used error Bp models, we tested a Bp-based version of the same “what vs. where” architecture as our biologically based predictive error model, and we also tested a standard PredNet model (Lotter et al., 2016) with extensive hyperparameter optimization (see Appendix). Because of the constraints of Bp, we had to eliminate any bidirectional connectivity loops in the Bp version, but we were able to retain a form of predictive learning by configuring the V1p pulvinar layer as the final target output layer, with the target being the next visual input relative to the current V1 inputs.
Figure 9 shows the same second-order similarity analysis as Figure 6, to determine the extent to which these comparison networks also developed more abstract representations in the higher layers that diverge from the similarity structure present in the lowest layers. According to this simple objective analysis, they did not—the higher layers showed no significant, progressive divergence in their similarity structure. The PredNet model did show a larger difference between the first layer and the rest of the layers, because of the subsequent layers encoding errors while the first layer has a positive representation of the image, but there was no progressive difference beyond that up into the higher layers.
Next, we examined the RSA matrices for the highest (TE) layer in the comparison models, also in comparison with the same for the V1 layer (Figure 10). This shows that the TE layer in the Bp model formed a simple binary category structure overall, which is similar to the RSA for the V1 input layer. It is also important to emphasize that the scales on these figures are different (as shown in their headers), such that these comparison models had much less differentiated representations overall. Similar results were found in the PredNet model. Because existing work with these models has typically relied on additional supervised learning and decoder-based analyses (which are essentially equivalent to an additional layer of supervised learning), these RSA-based analyses provide an important, more sensitive way of determining what they learn purely through predictive learning.
These results show that the additional biologically derived properties in our model are playing a critical role in the development of abstract categorical representations that go beyond the raw visual inputs. These properties include excitatory bidirectional connections, inhibitory competition, and an additional Hebbian form of learning that serves as a regularizer (similar to weight decay) on top of predictive error-driven learning (O'Reilly & Munakata, 2000; O'Reilly, 1998). Each of these properties could promote the formation of categorical representations. Bidirectional connections enable top–down signals to consistently shape lower-level representations, creating significant attractor dynamics that cause the entire network to settle into discrete categorical attractor states. Another indication of the importance of bidirectional connections is that a greedy layer-wise pretraining scheme, consistent with a putative developmental cascade of learning from the sensory periphery on up (Valpola, 2015; Bengio, Yao, Alain, & Vincent, 2013; Hinton & Salakhutdinov, 2006; Shrager & Johnson, 1996), did not work in our model. Instead, we found it essential that higher layers, with their ability to form more abstract, invariant representations, interact and shape learning in lower layers right from the beginning.
Furthermore, the recurrent connections within the TEO and TE layers likely play an important role by biasing the temporal dynamics toward longer persistence (Chaudhuri et al., 2015). By contrast, Bp networks typically lack these kinds of attractor dynamics, and this could contribute significantly to their relative lack of categorical learning. Hebbian learning drives the formation of representations that encode the principal components of activity correlations over time, which can help more categorical representations coalesce (and results below already indicate its importance). Inhibition, especially in combination with Hebbian learning, drives representations to specialize on more specific subsets of the space.
Ongoing work is attempting to determine which of these is essential in this case (perhaps all of them) by systematically introducing some of these properties into the Bp model, although this is difficult because full bidirectional recurrent activity propagation, which is essential for conveying error signals top–down in the biological network, is incompatible with the standard efficient form of error Bp, and requires significantly more computationally intensive and unstable forms of fully recurrent Bp (Williams & Zipser, 1992; Pineda, 1987). Furthermore, Hebbian learning requires dynamic inhibitory competition, which is difficult to incorporate within the Bp framework.
Architecture and Parameter Manipulations
Figure 11 shows only a few of the large number of parameter manipulations that have been conducted to develop and test the final architecture. For example, we hypothesized that separating the overall prediction problem between a spatial “where” versus nonspatial “what” pathway (Goodale & Milner, 1992; Ungerleider & Mishkin, 1982) would strongly benefit the formation of more abstract, categorical object representations in the “what” pathway. Specifically, the “where” pathway can learn relatively quickly to predict the overall spatial trajectory of the object (and anticipate the effects of saccades) and thus effectively regress out that component of the overall prediction error, leaving the residual error concentrated in object feature information, which can train the ventral “what” pathway to develop abstract visual categories.
Figure 11A shows that, indeed, when the “where” pathway is lesioned, the formation of abstract categorical representations in the intact “what” pathway is significantly impaired. We also hypothesized that full predictive learning (about the future), as compared to just encoding and decoding the current state (i.e., an autoencoder, which is much easier computationally), is also critical for the formation of abstract categorical representations—prediction is a “desirable difficulty” (Bjork, 1994). Figure 11B shows that this was the case. Finally, consistent with our hypothesis that Hebbian learning provides an important bias on learning, Figure 11C shows the impairment associated with reducing this learning bias. The significant reduction in differentiation across all of these manipulations shows that this differentiation property is not a simple consequence of the neural architecture but rather depends critically on the learning process, unfolding over time with appropriate parameter values and other architectural components. Furthermore, the Bp comparison model shares the same architecture and does not show the differentiation across layers.
A signature example of predictive behavior at the neural level in the brain is the predictive remapping of visual space in anticipation of a saccadic eye movements (Marino & Mazer, 2016; Nakamura & Colby, 2002; Gottlieb, Kusunoki, & Goldberg, 1998; Colby, Duhamel, & Goldberg, 1997; Duhamel et al., 1992; Figure 12A). Here, parietal neurons start to fire at the future receptive field location where a currently visible stimulus will appear after a planned saccade is actually executed. Remapping has also been shown for border ownership neurons in V2 (O'Herron & von der Heydt, 2013) and in Area V4 (Neupane, Guitton, and Pack, 2016, 2020). These are examples, we believe, of a predictive process operating throughout the neocortex to predict what will be experienced next. A major consequence of this predictive process is the perception of a stable, coherent visual world despite constant saccades and other sources of visual change.
Figure 12B shows that our model exhibits this predictive remapping phenomenon. Specifically, LIP, which is most directly interconnected with the saccade efferent copy signals, is the first to predict the new location, and it then drives top–down activation of lower layers. This top–down dynamic is consistent with the account of predictive remapping given by Wurtz (2008) and Cavanagh et al. (2010), who argue that the key remapping takes place at the high levels of the dorsal stream, which then drive top–down activation of the predicted location in lower areas, instead of the alternative where lower levels remap themselves based on saccade-related signals. The lower-level visual layers are simply too large and distributed to be able to remap across the relevant degrees of visual angle—the extensive lateral connectivity needed to communicate across these areas would be prohibitive.
NEURAL DATA AND PREDICTIONS
Having tested the computational and functional learning properties of this biologically based predictive learning mechanism, we now return to consider some of the most important neural data of relevance to our hypotheses, beyond that summarized in the introduction, including contrasts with a widely discussed alternative framework for predictive coding, and some of the extensive data on alpha frequency effects, followed by a discussion of predictions that would clearly test the validity of this framework.
Additional Neuroscience Data
We begin with data relevant to the basic neural-level properties of the framework. First, a central element of the proposed model is the alpha cycle bursting, and subsequent interburst pauses, in the 5IB neurons. Direct electrophysiological recording of deep layer neurons shows periodic alpha-scale bursting for continuous tones in awake animals (Luczak et al., 2009, 2013; Sakata & Harris, 2009, 2012). In vitro, a variety of potential mechanisms behind the generation and synchronization of the 5IB bursts driving this alpha cycle have been identified (Franceschetti et al., 1995; Silva et al., 1991; Connors et al., 1982). Furthermore, the pulvinar has been shown to drive alpha-frequency synchronization of cortical activity across areas in the alpha band in awake, behaving animals (Saalmann et al., 2012). We review the larger alpha frequency literature in more detail below, but it is critical to emphasize that this alpha bursting dynamic is actually found in awake, behaving animals, because so many other bursting and up/down state phenomena have recently been shown to only occur in anesthetized brains, including bursting in the thalamic TRC neurons.
In contrast to the 5IB bursting, the 6CT neurons exhibit regular spiking behavior (Thomson, 2010; Thomson & Lamy, 2007), providing consistent activation to the pulvinar. In addition, they do not have axonal branches that project to other cortical areas—the subpopulation that projects to the pulvinar only project there and not to other cortical areas (Petrof, Viaene, & Sherman, 2012), whereas there are other Layer 6 neurons that do project to other cortical areas. This distinct connectivity is consistent with a specific role of this neuron type in generating predictions in the pulvinar. The 6CT synaptic inputs on pulvinar TRCs have metabotropic glutamate receptors that have longer timescale temporal dynamics consistent with the alpha period (100 msec) and even longer (Sherman, 2014), and the 6CT neurons themselves also have temporally delayed responding (Harris & Shepherd, 2015; Thomson, 2010; Sakata & Harris, 2009). Furthermore, they have significantly more plasticity-inducing N-methyl-D-aspartate receptors compared to the 5IB projections (Usrey & Sherman, 2018). These properties are consistent with the 6CT inputs driving a longer-integrated prediction signal that is subject to learning, whereas the 5IB are likely nonplastic, and their effects are tightly localized in time.
The 5IB inputs often have distinctive glomeruli structures at their synapses onto pulvinar neurons, which contain a complete feedforward inhibition circuit involving a local inhibitory interneuron, in addition to the direct strong excitatory driver input (Wilson, Bose, Sherman, & Guillery, 1984). Computationally, this can provide a balanced level of excitatory and inhibitory drive so as to not overly excite the receiving neuron, while still dominating its firing behavior.
Although there are well-documented and widely discussed burst versus tonic firing modes in pulvinar neurons (Sherman & Guillery, 2006), there is not much evidence of these playing a clear role in the awake, behaving state, and as noted earlier, the growing electrophysiological evidence shows a remarkable correspondence between cortical and pulvinar response properties across multiple different pulvinar areas in this awake state. Nevertheless, there may be important dynamics arising from these firing modes that are more subtle or emerge in particular types of state transitions that may have yet to be identified.
Contrast with Explicit Error Frameworks
To further clarify the nature of the present theory and introduce a body of relevant data, we contrast it with the widely discussed EE framework for predictive coding (Lotter et al., 2016; Bastos et al., 2012; Ouden et al., 2012; Friston, 2005, 2010; Rao & Ballard, 1999; Kawato et al., 1993; Figure 13). The hypothesized locus for computing errors in this framework is in the superficial layers of the neocortex, which are suggested to directly compute the difference between bottom–up inputs from lower layers and top–down inputs from higher areas. Despite many attempts to identify such EE coding neurons in the cortex, no substantial body of unambiguous evidence has been discovered (Walsh, McGovern, Clark, & O'Connell, 2020; Kok & de Lange, 2015; Kok, Jehee, & de Lange, 2012; Summerfield & Egner, 2009; Lee & Mumford, 2003). Furthermore, because of the positive-only firing rate nature of neural coding, two separate populations would be required to convey both signs of prediction error signals, or it would have to be encoded as a variation from tonic firing levels, which are generally low in the neocortex.
By contrast, the use of temporal difference error signals enables all connections between cortical layers to be excitatory, and each layer can represent the positive encoding of either the prediction or outcome state, at different levels of abstraction. These properties are overwhelmingly supported by extensive electrophysiological data about the hierarchical organization of representations, for example, in the visual object recognition pathway (Cadieu et al., 2014; VanRullen & Thorpe, 2002; Kobatake & Tanaka, 1994), and are consistent with the widely supported biased competition model for excitatory top–down attentional effects (O'Reilly et al., 2013; Miller & Cohen, 2001; Reynolds et al., 1999; Desimone & Duncan, 1995).
The EE approach requires net inhibitory top–down predictions, and it sends error signals forward, not positive representations of the actual state at a given level of abstraction. Thus, a literal interpretation (and at least one existing implementation; Lotter et al., 2016) has only error signals represented at all levels above the lowest level, which is inconsistent with the positive encoding of stimuli at various levels of abstraction across the visual hierarchy. For example, although Issa, Cadieu, and DiCarlo (2018) observed an error-signal-like increase in activation for atypical faces in some posterior IT neurons, these neurons overall had a positive stimulus encoding, with only a relatively small, later, error-like modulation.
Furthermore, as discussed below, anticipatory predictions typically closely resemble the subsequent stimulus-driven activity, suggesting a positive, not inhibitory, effect (Walsh et al., 2020; Cavanagh et al., 2010; Lee & Mumford, 2003; Duhamel et al., 1992). However, there are various different ways of reformulating the neural implementation of EE that can avoid some of these issues (Bastos et al., 2012; Spratling, 2008), but perhaps, this flexibility renders the framework difficult to falsify (Kogo & Trengove, 2015). In any case, an extensive treatment of the issues with EE is beyond the scope of this paper and has already been aptly covered by Walsh et al. (2020)—our goal here is to highlight some of the core differences as a way to clarify the framework by way of contrast and in relation to available data.
First, there are many examples of anticipatory predictive neural firing in the brain. Of perhaps greatest relevance, Barczak et al. (2018) recently showed that the auditory pulvinar in monkeys exhibits predictive firing using a carefully controlled auditory sequence that had no first-order acoustic differences from a background noise signal. The pulvinar predictive activation preceded that of A1, suggesting a strong predictive role for pulvinar. Unfortunately, the deep layers of higher auditory areas that should contribute to the formation of the pulvinar prediction were not recorded in this study, so their role in generating the prediction could not be determined.
Nevertheless, there is extensive additional evidence for top–down anticipatory activation of predicted stimuli, with activity patterns closely resembling the subsequent stimulus-driven ones (Walsh et al., 2020). For example, the widely replicated predictive remapping effect, simulated in our model (Figure 12), is of this nature (Cavanagh et al., 2010; Wurtz, 2008; Duhamel et al., 1992). The fact that these anticipatory activations are of a positive nature, consistent with the stimulus-driven activations, is inconsistent with the expected behavior of EE neurons, which should be inhibited by the top–down prediction, while not receiving any bottom–up stimulus.
However, the neural response to the actual predicted stimulus itself is typically suppressed relative to unexpected stimuli, that is, expectation suppression (Bastos et al., 2012; Meyer & Olson, 2011; Todorovic et al., 2011; Summerfield et al., 2008). This phenomenon is widely cited as evidence in favor of the EE predictive coding framework, consistent with an inhibitory effect of the expectation. Nevertheless, despite various conflicting results and many complications of interpretation, multiple comprehensive reviews conclude that it is difficult to distinguish expectation suppression from the neural adaptation effects that underlie the well-documented repetition suppression effect (Walsh et al., 2020; Kok & de Lange, 2015; Vinken & Vogels, 2017; Kok et al., 2012; Summerfield & Egner, 2009; Lee & Mumford, 2003). Furthermore, detailed single-neuron-level recordings are the least likely to show these effects—instead, they are most evident in aggregate signals such as the BOLD response in fMRI, suggesting that they may more strongly reflect population-level differences in activity, rather than individual EE coding neurons.
As noted earlier, accurately predicted outcomes in our framework would result in a continued adaptation of the neural response carrying over from the prediction to the outcome state, whereas unexpected outcomes would be associated with two distinct patterns of activity over a given area: first the prediction and then the outcome. Thus, the unexpected outcome state would not be subject to the prior neural adaptation effects, and furthermore, the time-integrated aggregate activity over these two patterns would be greater compared to the single activity state associated with an accurately predicted outcome. Thus, our model explains expectation suppression without invoking EE neurons, meaning that considerably more detailed and replicable experimental paradigms using single-neuron resolution techniques are needed to distinguish EE from our framework.
Alpha Frequency Effects
The alpha frequency bursting of 5IB neurons acting as drivers into the pulvinar naturally entrains the predictive learning process in our model to this fundamental rhythm, which has long been recognized as an important signature of posterior cortical function (VanRullen & Koch, 2003; Varela, Toro, John, & Schwartz, 1981; Nunn & Osselton, 1974; Walter, 1953; Berger, 1929). A number of different functional associations with alpha have been established, and this literature is large and growing rapidly. Thus, we refer the reader to recent reviews (Foster & Awh, 2019; Clayton et al., 2018; VanRullen, 2016; Jensen, Bonnefond, Marshall, & Tiesinga, 2015) while highlighting the data most relevant to our specific framework here, organized according to a set of key points.
Alpha is specifically associated with deep neocortical layers and the pulvinar as well as with feedback pathways in the cortex. This has been established using direct laminar-specific electrophysiological single-neuron and local field potential recordings (Luczak et al., 2013; Spaak, Bonnefond, Maier, Leopold, & Jensen, 2012; Xing, Yeh, Burns, & Shapley, 2012; Buffalo et al., 2011; Maier, Aura, & Leopold, 2011; Maier, Adams, Aura, & Leopold, 2010) and feedforward versus feedback manipulations (Michalareas et al., 2016; Bastos et al., 2015; Jensen et al., 2015; van Kerkoerle et al., 2014; von Stein, Chiang, & König, 2000). These data are consistent with the 5IB alpha bursting and the major role of cortical deep layers in driving top–down corticocortical projections (in addition to the 6CT pathway that is specific to the pulvinar). By contrast, these same studies show that superficial cortical layers are associated with gamma frequency (40-Hz) dynamics. However, the next point raises some important interpretational difficulties.
Increases in cortical activity levels, for example, because of attention, produce a corresponding decrease in alpha power, whereas decreased activity increases alpha power (Foster & Awh, 2019; Jensen & Mazaheri, 2010; Fries, Womelsdorf, Oostenveld, & Desimone, 2008; Klimesch, Sauseng, & Hanslmayr, 2007; Kelly, Lalor, Reilly, & Foxe, 2006; Worden, Foxe, Wang, & Simpson, 2000). This pattern is not exactly what you might expect if alpha was a signature of predictive learning. Furthermore, given that these same pulvinar and thalamocortical pathways are also widely regarded as important for attention (Fiebelkorn & Kastner, 2019; Zhou et al., 2016; Saalmann & Kastner, 2011; Snow et al., 2009; Bender & Youakim, 2001; LaBerge & Buchsbaum, 1990), this pattern presents a challenge for many theorists. However, it is possible to explain this pattern as arising directly from the desynchronizing effects of cortical activity on alpha power. Specifically, neural spiking is associated with broadband noise, because of the highly random, Poisson nature of spike firing, which can desynchronize the entrainment of lower-frequency oscillations including alpha (Solomon et al., 2017; Privman, Malach, & Yeshurun, 2013; Waldert, Lemon, & Kraskov, 2013; Ray & Maunsell, 2011). In other words, because cortical activity is inherently noisy, it tends to interfere with the coherent activity across populations of neurons needed to produce a strong alpha frequency power signal. This explanation is directly supported by studies manipulating and measuring cortical activity (Zhou et al., 2016; Fries et al., 2008) and is consistent with alpha power changes being a result of attentional modulation, but not their cause (Antonov, Chakravarthi, & Andersen, 2020). Thus, although attention and predictive learning can both affect overall activity levels in the cortex and thus drive changes in alpha power, alpha power itself is not a transparent measure of the underlying mechanisms supporting these functions, which may help to explain some contradictory patterns of results (Gundlach, Moratti, Forschack, & Müller, 2020; Foster & Awh, 2019; Keitel et al., 2019).
Alpha phase effects provide a more direct measure of thalamocortical function than alpha power and have been more consistently related to perception, attention, and prediction (Solís-Vivanco, Jensen, & Bonnefond, 2018; Neupane et al., 2017; Jaegle & Ro, 2013; Palva & Palva, 2011; Mathewson, Fabiani, Gratton, Beck, & Lleras, 2010; Busch, Dubois, & VanRullen, 2009; VanRullen & Koch, 2003; Varela et al., 1981; Nunn & Osselton, 1974). For example, weak, near-threshold stimuli are more reliably detected and processed when presented in the trough of the individual's ongoing alpha cycle. Of greatest relevance to this paper are studies showing effects of prediction on alpha phase (Mayer, Schwiedrzik, Wibral, Singer, & Melloni, 2016; Sherman, Kanai, Seth, & VanRullen, 2016; Samaha, Bauer, Cimaroli, & Postle, 2015). For example, Mayer et al. (2016) showed that prestimulus alpha phase directly correlated with the predictability of the upcoming stimulus, and the pattern of this prestimulus activation was indistinguishable from the subsequent stimulus activation pattern. This is consistent with our model, and less consistent with the EE framework, as discussed previously. Neupane et al. (2017) found strong alpha coherence effects in local field potential recordings distributed across V4, associated with the predictive remapping of receptive fields (Duhamel et al., 1992).
Discrete, salient, or oscillatory stimuli entrain the alpha cycle in the brain (Spaak, de Lange, & Jensen, 2014; Mathewson et al., 2012). Furthermore, the massive literature on ERPs may represent a significant contribution from alpha-level entrainment (Klimesch, 2011; Gruber, Klimesch, Sauseng, & Doppelmayr, 2005; Makeig et al., 2002). These entrainment effects are consistent with the 5IB entrainment mechanisms in our framework, as described earlier, and entrainment is functionally important for aligning predictive learning with relevant salient or unexpected outcomes.
The pulvinar contributes to synchronizing alpha phase relationships across different brain areas (Fiebelkorn, Pinsk, & Kastner, 2018; Saalmann et al., 2012). This is consistent with the broad, convergent pattern of projections into the pulvinar from many different cortical areas, and the corresponding broad projections back out to these same areas (Arcaro et al., 2015; Shipp, 2003). Functionally, this convergence and synchronization are important for integrating the contributions from these different areas at the same time, to generate predictions over the pulvinar.
The theta cycle, composed of a pair of alpha cycles, organizes saccades as well as attentional, motor, and mnemonic processes (Fiebelkorn & Kastner, 2019). The theta rhythm is dominant in the medial temporal lobe and hippocampus and has been extensively studied there (Buzsáki, 2005; Kahana, Seelig, & Madsen, 2001). Furthermore, there is a sharp peak of saccade fixation durations at 200 msec, which suggests that two alpha cycles are typically required for complete processing of a given fixation. On the first cycle, the predictions from before the eye moved may be fairly vague depending on factors such as the size of the saccade and familiarity with the environment. However, after the first alpha cycle of a fixation, a subsequent postdiction phase can provide an important additional learning opportunity, to consolidate and more deeply encode the current fixation (computationally equivalent to an autoencoder). In addition, a mix of smaller saccades (including microsaccades) and larger saccades enables a range of more and less predictable outcomes on the first alpha cycle after the saccade and matches human behavior (Martinez-Conde, Otero-Millan, & Macknik, 2013; Martinez-Conde, Macknik, & Hubel, 2004).
Putting all of these points together, a particularly effective way of testing the predictions of our framework would be measuring alpha phase changes emerging in the prestimulus period as a function of predictive learning in predictable sequential stimulus streams. In addition, it would also be important to examine theta- and alpha-cycle dynamics in relation to predictive learning in the context of attention, motor control, and memory processes, to better understand the larger systems-level temporal organization of learning and processing in the brain (Fiebelkorn & Kastner, 2019).
Predictions for Predictive Learning
In this section, we enumerate a set of direct, testable predictions from our framework. Before doing so, there are several important considerations for any experimental test of the theory. First, the nature of what is to be learned must be matched to the pulvinar area in question. For example, learning a new variation of basic physics in movies at the alpha time scale (e.g., altering properties such as gravity, inertia, or elasticity) would be appropriate for the lower-level visual pathways. At higher visual levels (e.g., IT cortex), it might be possible to use simple sequences of different objects, although it is not clear to what extent the hippocampus or PFC might also contribute in this case (Fiser et al., 2016; Gavornik & Bear, 2014). To distinguish pulvinar learning effects from pervasive motor learning supported by other brain areas, it would be most effective to directly measure activity in the pulvinar and/or associated perceptual neocortical areas, instead of involving overt behavioral performance.
Much of the learning in posterior sensory cortex should take place early in development, requiring very early developmental interventions or genetic knockouts that are expressed from the start (which can also have other interpretational issues if not highly selective). In our models, the bulk of the basic sensory predictive learning happens very quickly, because the basic first-level regularities are quite strong and relatively easily learned. Although there are longer-term changes in the higher-level pathways in our models, more fine-grained measurements would likely be required to see these changes. Once this learning has taken place, the remaining contributions of the thalamocortical circuit are likely more strongly weighted toward its role in attention, as we discuss below. Finally, directly lesioning or inactivating the pulvinar is not likely to be very informative, because existing work has shown dramatic effects on cortical activity (Zhou et al., 2016; Purushothaman, Marion, Li, & Casagrande, 2012), and furthermore, any effects could be attributed to the attentional contributions of the pulvinar.
With these considerations in mind, here are a set of strong predictions from our model that should be testable using existing techniques. Failure to obtain the predicted result, while adhering to all the relevant constraints, would constitute a falsification of our model.
Blocking 5IB bursting mechanisms early in developmental learning should disrupt learning. It should be possible to selectively knock out or modify the channels that cause this specific population of neurons to burst fire, and doing so should have a significant effect on learning in associated neocortical and pulvinar areas, given the critical role that this burst firing plays on the predictive learning process, as elaborated above.
Blocking synaptic plasticity in the pulvinar (specifically the 6CT inputs) very early in developmental learning should impair learning. Although most of the learning overall should occur in the neocortex as a result of the temporal difference error signal broadcast by the pulvinar (which should remain generally intact), learning in the 6CT projections is important, especially right at the start, to map the emerging neocortical representations into the space defined by the 5IB projections.
Temporal differences on an alpha cycle timescale actually drive synaptic plasticity in an error-driven learning manner, in neocortical pyramidal neurons and in 6CT inputs to the pulvinar. That is, if a pre/post pair of neurons across a synapse is more active in the prediction than the subsequent outcome, the synapse should experience long-term depression, and vice versa if the activity pattern is reversed (long-term potentiation, for more activity in outcome than prediction). Furthermore, if activity is essentially stable across both prediction and outcome phases, then weights should not change (modulo, a small level of Hebbian learning; O'Reilly et al., 2012; O'Reilly & Munakata, 2000). This should be directly testable using current experimental methods and is perhaps the single most important empirical test of this entire framework, and it also underlies many other current approaches to error-driven learning in the brain (Lillicrap et al., 2020; Whittington & Bogacz, 2019; Bengio et al., 2017). One general consideration is the extent to which an awake in vivo preparation would be required to capture all the neuromodulatory and other factors present when this learning normally takes place. Some suggestive evidence in such a preparation is generally consistent with a sensitivity to relatively short-term temporal dynamics (Lim et al., 2015), although these results lacked the direct measurement of individual neural activity across a synapse.
We have hypothesized a novel computational function for the distinctive features of thalamocortical circuits (Usrey & Sherman, 2018; Sherman & Guillery, 2006), as supporting a specific form of prediction-error driven learning, where predictions arise from the numerous top–down layer 6CT projections into the pulvinar, and the strong, sparse driving 5IB inputs supply the bottom–up sensory-driven outcome. The phasic bursting nature of the 5IB inputs results in a natural temporal-difference error signal of prediction followed by outcome, consistent with extensive neural recording data. This temporal dynamic is also essential for enabling predictions to be generated without contamination from current sensory inputs and predicts a characteristic alpha-frequency prediction cycle based on the 10-Hz bursting cycle of the 5IB inputs, consistent with the pervasive influence of alpha on perception and neural dynamics (Foster & Awh, 2019; Clayton et al., 2018; VanRullen, 2016; Jensen et al., 2015). In short, the hypothesized predictive learning function fits remarkably well with a number of well-established properties of these thalamocortical circuits, and we also provided a set of additional predictions that could be tested to further evaluate this theory, especially in contrast to the widely discussed alternative of EE coding neurons, which have not been unambiguously supported across a range of empirical studies (Walsh et al., 2020).
Furthermore, we implemented this theory in a large-scale model of the visual system and demonstrated that learning based strictly on predicting what will be seen next is, in conjunction with a number of critical biologically motivated network properties and mechanisms, capable of generating abstract, invariant categorical representations of the overall shapes of objects. The nature of these shape representations closely matches human shape similarity judgments on the same objects. Thus, predictive learning has the potential to go beyond the surface structure of its inputs and develop systematic, abstract encodings of the environment. We found that comparison models based on standard error Bp learning did not learn a categorical structure that went beyond the surface similarity present in the visual input layers, and future work is focused on narrowing down the specific mechanisms required to drive this learning.
In addition to the predictive learning functions of the deep/thalamic layers, these same circuits are also likely critical for supporting powerful top–down attentional mechanisms that have a net multiplicative effect on superficial-layer activations (Bortone, Olsen, & Scanziani, 2014; Olsen, Bortone, Adesnik, & Scanziani, 2012). The importance of the pulvinar for attentional processing has been widely documented (e.g., Saalmann et al., 2012; Bender & Youakim, 2001; LaBerge & Buchsbaum, 1990), and there is likely an additional important role of the thalamic reticular nucleus, which can contribute a surround-inhibition contrast-enhancing effect on top of the incoming attentional signal from the cortex (Jaramillo, Mejias, & Wang, 2019; Wimmer et al., 2015; Pinault, 2004; Crick, 1984). In other work in progress, we have shown that the deep/thalamic circuits in our model produce attentional effects consistent with the abstract Reynolds and Heeger (2009) model, whereas the contributions of the deep layer networks to this function are broadly consistent with the folded-feedback model (Grossberg, 1999). These attentional modulation signals cause the bidirectional constraint satisfaction process in the superficial network to focus on task-relevant information while down-regulating responses to irrelevant information—in the real world, there are typically too many objects to track at any given time, so predictive learning must be directed toward the most important objects (Richter & de Lange, 2019; Cavanagh et al., 2010; Pylyshyn, 1989).
There are also data suggesting that the pulvinar is important for supporting confidence judgments, driven by relative ambiguity in a random dot motion categorization task (Komura et al., 2013). Critically for the present framework, this confidence modulation only emerged in the period after the first 100 msec of processing and manifested as a positive correlation with confidence (i.e., more unambiguous stimuli resulted in higher firing rates). We can interpret this as reflecting an ongoing generative postdiction of the stimulus signal, with stronger firing associated with more unambiguous top–down activation based on the current internal representation. Note that this directionality is the opposite of EE coding neurons, which would presumably increase with increasing error/ambiguity in the prediction. Interestingly, inactivation of these pulvinar neurons resulted in a substantial (200%) increase in opt-out choices on the most ambiguous stimuli, suggesting a level of metacognitive awareness of the pulvinar signal (or at least a direct effect of pulvinar on relevant metacognitive processes). Predictive accuracy would be an ideal source of metacognitive confidence signals across a wide range of domains, suggesting another important contribution of pulvinar even after initial learning. Jaramillo et al. (2019) present a comprehensive model of attentional, decision-making, and working memory contributions of the pulvinar, including these confidence data, which is generally compatible with our framework, although it does not address any learning phenomena.
There are a number of important limitations of the current What–Where Integration (WWI) model, in terms of its scale and ability to process real-world cluttered visual scenes with multiple objects present, such as those used in the widely studied ImageNet data set. The model is much smaller than standard DCNN vision models, because its computational demands are significantly higher, in a way that also does not fit well with current graphics processing unit (GPU)-based parallel computation hardware, because of the relative complexity of the algorithms and the sparseness of the activations. For each image, 100 cycles (of 1 msec each) of activation updating are required to enable the bidirectional activation and inhibition to integrate in a graded manner over the alpha cycle, compared to only one such iteration for most feedforward DCNN models. Furthermore, the bidirectional connectivity, extensive shortcut connections, and use of multiple cortical lamina per cortical area result in significant increases in the number of synaptic connections, which dominate the computational cost, and scale roughly as n2 in the number of neurons n per layer across one projection. Thus, there are 207 million connections for the full WWI model, requiring 10 GB of RAM, and it takes over a day to run using 32 high-performance CPU processors with fast network interconnects, using the fastest combination of threading and parallel batch training. Doubling the network size causes it to no longer fit in available RAM, and yet, its high-resolution V1 layer is only 16 × 16, compared to 55 × 55 for basic DCNN models such as AlexNet and 224 × 224 for VGG16. The result is that the model has a relatively low-resolution view of the world, as reflected in the reconstructed images shown in Figure 5.
In addition to having a higher-resolution input to be able to process more complex real-world cluttered images, the model would require functional attentional dynamics to focus processing on a small number of objects at a time, as is well documented for humans processing complex images. Thus, once the attentional dynamics are well integrated with the predictive learning mechanisms, we can begin to explore performance on more complex images, subject to improved computational hardware supporting larger network sizes.
Considerable further work remains to be done to more precisely characterize the essential properties of our biologically motivated model necessary to produce this abstract form of learning and to further explore the full scope of predictive learning across different domains. We strongly suspect that extensive cross-modal predictive learning in real-world environments, including between sensory and motor systems, is a significant factor in infant development and could greatly multiply the opportunities for the formation of higher-order abstract representations that more compactly and systematically capture the structure of the world (Yu & Smith, 2012). Future versions of these models could thus potentially provide novel insights into the fundamental question of how deep an understanding a preverbal human, or a nonverbal primate, can develop (Elman et al., 1996; Spelke, Breinlinger, Macomber, & Jacobson, 1992), based on predictive learning mechanisms. This would then represent the foundation upon which language and cultural learning builds, to shape the full extent of human intelligence.
All of the materials described here, including the experimental study, the computational models, and the code to perform the representational similarity analysis, are all available on our github account at github.com/ccnlab/deep-obj-cat, and the new version of the emergent simulation environment is at github.com/emer/leabra, which contains extensive documentation and examples that can be run in Python or the Go language. The best place to start in understanding computationally how the predictive learning model works is with the FSA model described in the main text, which is available at github.com/emer/leabra/tree/master/examples/deep_fsa. For the large and complex WWI model, the most complete understanding can only be had by directly examining the code, as there are a number of details that are not efficiently captured in this Appendix text.
REPRESENTATIONAL SIMILARITY ANALYSIS METHODS
The different representations being compared here are the following:
Leabra: The DeepLeabra (biological model) TE layer representations (specifically TEs = superficial—results are very similar for deep as well).
Bp: The TEs layer representations from the Bp version of biological model, including “what,” “where,” and “What × Where” integration layers, trained with the V1p and V1hp (low- and high-resolution pulvinar) layers as the final output layers, using the time t target pattern from the t – 1 input (i.e., as a predictive network).
V1: The Gabor-filtered representation of the visual input to both of the above models, which was identical across them.
PredNet: The highest layer (sixth layer) of the PredNet architecture.
Expt: Similarity matrix constructed from human pairwise similarity judgments (see Behavioral Experiment Methods).
Starting with an initial set of clusters, a permutation-based hill-climbing strategy was used to determine a local minimum in this measure: Each item was tested in each of the other possible categories, and if that configuration reduced the overall average contrast distance (ACD) metric across all items, then it was adopted and the process iterated until no such permutation improved the metric. This algorithm can only decrease the number of clusters (by moving all items out of a given cluster), so different numbers of initial clusters can be used to search the overall space.
Figure 14 shows the resulting categories. The Bp model converged on the same cluster state from all starting configurations tested, varying from five to two initial categories. This is the cluster set shown in Figure 10 of the main paper and has an ACD of 0.0838 (this is relatively low because the patterns were overall quite similar). Likewise, the V1 patterns (which were the same across Leabra and Bp models) reliably converged on the same pattern (shown in Figure 10), with ACD = 0.2448.
For the PredNet Layer 6 representations, starting from the V1 categories gave the best results of any other set (ACD = 0.1967), and a few permutations resulted in a reliable solution that was arrived at from all other three category starting points tested, shown in Figure 14 (ACD = 0.2820). This indicates that PredNet did not go much beyond the structure present in the input, although it did not use the V1 Gabor filtering used in the Leabra and Bp models (i.e., this V1-level encoding well captures the structure of the visual inputs in general). The PredNet pixel and Layer 1 representations both converged on essentially a single monolithic category with very low ACDs (0.0018 and 0.0013, respectively).
For the Leabra TE representations, we found a set of centroid-shape categories that are near-best when considering both the Leabra model and the results from the human behavioral experiment. Starting from these categories, the permutation analysis converged on reducing the size of the vertical and round categories to one item each, over a sequence of five steps. This is consistent with the observation from Figure 7 that there are three broader categories within which the five finer-grained categories are embedded (i.e., vertical and pyramid are overall similar to each other, as are round and box). Nevertheless, our initial visual intuition about the broad shape categories, along with a bias against having single-item categories, reinforced the use of the finer-grained centroid selection. The average contrast difference of our centroid selection is 0.5071, whereas the maximal result from the permutation was 0.5526, which is a relatively small proportional difference.
Furthermore, once we had collected the human experimental data (Expt), it was clear that it strongly coincided with our original shape intuitions and with the finer-grained five-category centroid structure. Starting from the centroid categories, the maximal permutation made only three changes, moving trex (T-rex) and handgun into the horizontal category, and chair into the pyramid, going from a distance score of 0.3083 to 0.3225, which is a relatively small improvement. However, using the maximal Expt clusters directly on the Leabra model gives a lower ACD measure of 0.3745 (compared to 0.5071 for centroid), so the centroid categories represent a good middle ground between Expt and the model, and this strong shared similarity structure with near-optimal cluster structures confirms that the model and people are encoding largely the same information.
In contrast, if we organize the Expt similarity matrix using the Bp categories, it produces a very poor ACD measure of 0.0643 (compared to 0.3083 for the centroid categories), strongly suggesting that people's shape representations are not compatible with that simple structure.
Another approach to determining clusters from similarity matrices, “agglomerative clustering,” starts with all items as singletons and iteratively combines the closest two into a new cluster. The results for the Leabra and Expt similarity matrices are shown in Figure 15, which has also color-coded the items in terms of their category status according to the centroid structure. Because of a strong history dependency in the clustering process and the indeterminacy of reducing a high-dimensional similarity structure down to two dimensions, structure beyond the leaf level is not very reliable (ties are also broken by a random number generator), but nevertheless, you can clearly see that, in both cases, items from the same cluster are almost always together as leaves in the plots. This then provides additional converging support for the idea that the model is learning the same kind of shape categories as people have.
For the network layer RSA computations, activation vectors were accumulated separately for each 3-D object item and, within that, separately for each frame index of the movie. To be able to monitor similarity metrics as the model trained, we used a running-average integration of neural activity across trials to accumulate the patterns. Specifically, the current activation pattern across each layer was recorded and averaged unit-by-unit with a time constant of τ = 10. Critically, by integrating separately for each frame, this running-average computation did not introduce any bias for temporally adjacent frames to be more similar. Nevertheless, when we computed the frame-to-frame similarities for TE, they were quite high (.901 correlation on average across all objects).
BEHAVIORAL EXPERIMENT METHODS
The behavioral experiment was conducted on Amazon.com's MTurk Web platform under University of Colorado institutional review board approval (19-0176), using 30 participants each categorizing up to 800 image pairs as shown in Figure 16, using the standard simple image categorization framework with a lightly customized script. Objects were drawn from the 156 3-D object set, but data were aggregated in terms of the 20 basic-level categories (car, stapler, etc.) because we could not sample all 156 × 156 object pairs. Thus, the resulting data were aggregated for each category pair in terms of the proportion of times when that pair was selected when presented.
The individual images were produced by reconstructing from the V1 transform that the computational model used in its high-resolution V1 input layer, to give human participants as similar of an experience as possible to how the model “saw” the objects, and to reduce the influence of existing semantic knowledge, which was entirely missing in our model (Figure 16).
BIOLOGICAL MODEL METHODS
This section provides more information about the DeepLeabra WWI model. The purpose of this information is to give more detailed insight into the model's function beyond the level provided in the main text, but with a model of this complexity, the only way to really understand it is to explore the model itself. It is available for download at github.com/ccnlab/deep-obj-cat/tree/master/sims/cemer. We now have a full replication of this model in our new, much more transparent simulation framework, available at github.com/ccnlab/deep-obj-cat/tree/master/sims/wwi3d—this is more readable and recommended. Furthermore, the best way to understand this model is to understand the framework in which it is implemented, which is explained in great detail, with many running simulations explaining specific elements of functionality, at CompCogNeuro.org.
Layer Sizes and Structure
|Area .||Name .||Unites .||Pools .||Receiving Projections .|
|x .||y .||x .||y .|
|V1p||4||5||8||8||V1s, V2d, V3d, V4d, TEOd|
|V1hp||4||5||16||16||V1s, V2d, V3d, V4d, TEOd|
|V2||V2s||10||10||8||8||V1s, LIPs, V3s, V4s, TEOd, V1p, V1hp|
|V2d||10||10||8||8||V2s, V1p, V1hp, LIPd, LIPp, V3d, V4d, V3s, TEOs|
|LIPs||4||4||8||8||MtPos, ObjVel, SaccadePlan, EyePos, LIPp|
|LIPd||4||4||8||8||LIPs, LIPp, ObjVel, Saccade, EyePos|
|LIPp||1||1||8||8||MtPos, V1s, LIPd|
|V3||V3s||10||10||4||4||V2s, V4s, TEOs, DPs, LIPs, V1p, V1hp, DPp, TEOd|
|V3d||10||10||4||4||V3s, V1p, V1hp, DPp, LIPd, DPd, V4d, V4s, DPs, TEOs|
|V3p||10||10||4||4||V3s, V2d, DPd, TEOd|
|DP||DPs||10||10||V2s, V3s, TEOs, V1p ,V1hp, V3p, TEOp|
|DPd||10||10||DPs, V1p, V1hp, DPp, TEOd|
|DPp||10||10||DPs, V2d, V3d, DPd, TEOd|
|V4||V4s||10||10||4||4||V2s, TEOs, V1p, V1hp|
|V4d||10||10||4||4||V4s, V1p, V1hp, V4p, TEOd, TEOs|
|V4p||10||10||4||4||V4s, V2d, V3d, V4d, TEOd|
|TEO||TEOs||10||10||4||4||V4s, V1p, V1hp, TEs|
|TEOd||10||10||4||4||TEOs, TEOd, V1p, V1hp, V4p, TEOp, TEp, TEd|
|TEOp||10||10||4||4||TEOs, V3d, V4d, TEOd, TEd|
|TE||TEs||10||10||4||4||TEOs, V1p, V1hp|
|TEd||10||10||4||4||TEs, TEd, V1p, V1hp, V4p, TEOp, TEp, TEOd|
|TEp||10||10||4||4||TEs, V3d, V4d, TEOd|
|Area .||Name .||Unites .||Pools .||Receiving Projections .|
|x .||y .||x .||y .|
|V1p||4||5||8||8||V1s, V2d, V3d, V4d, TEOd|
|V1hp||4||5||16||16||V1s, V2d, V3d, V4d, TEOd|
|V2||V2s||10||10||8||8||V1s, LIPs, V3s, V4s, TEOd, V1p, V1hp|
|V2d||10||10||8||8||V2s, V1p, V1hp, LIPd, LIPp, V3d, V4d, V3s, TEOs|
|LIPs||4||4||8||8||MtPos, ObjVel, SaccadePlan, EyePos, LIPp|
|LIPd||4||4||8||8||LIPs, LIPp, ObjVel, Saccade, EyePos|
|LIPp||1||1||8||8||MtPos, V1s, LIPd|
|V3||V3s||10||10||4||4||V2s, V4s, TEOs, DPs, LIPs, V1p, V1hp, DPp, TEOd|
|V3d||10||10||4||4||V3s, V1p, V1hp, DPp, LIPd, DPd, V4d, V4s, DPs, TEOs|
|V3p||10||10||4||4||V3s, V2d, DPd, TEOd|
|DP||DPs||10||10||V2s, V3s, TEOs, V1p ,V1hp, V3p, TEOp|
|DPd||10||10||DPs, V1p, V1hp, DPp, TEOd|
|DPp||10||10||DPs, V2d, V3d, DPd, TEOd|
|V4||V4s||10||10||4||4||V2s, TEOs, V1p, V1hp|
|V4d||10||10||4||4||V4s, V1p, V1hp, V4p, TEOd, TEOs|
|V4p||10||10||4||4||V4s, V2d, V3d, V4d, TEOd|
|TEO||TEOs||10||10||4||4||V4s, V1p, V1hp, TEs|
|TEOd||10||10||4||4||TEOs, TEOd, V1p, V1hp, V4p, TEOp, TEp, TEd|
|TEOp||10||10||4||4||TEOs, V3d, V4d, TEOd, TEd|
|TE||TEs||10||10||4||4||TEOs, V1p, V1hp|
|TEd||10||10||4||4||TEs, TEd, V1p, V1hp, V4p, TEOp, TEp, TEOd|
|TEp||10||10||4||4||TEs, V3d, V4d, TEOd|
Each area has three associated layers: s = superficial layer; d = deep layer (context updated by 51B neurons in the same area, shown in bold); and p = pulvinar layer (driven by 5IB neurons from the associated area, shown in bold).
All the activation and general learning parameters in the model are at their standard Leabra defaults.
The general principles and patterns of connectivity are shown in Figure 17 (and Figures 1 and 2 in the main text). As noted in the main text, the connectivity and overall structure obeys the established principles identified in neocortical anatomy (Markov, Ercsey-Ravasz, et al., 2014; Markov, Vezoli, et al., 2014; Felleman & Van Essen, 1991; Rockland & Pandya, 1979).
Detailing each of the specific parameters associated with the different projections shown in Table 1 would take too much space—those interested in this level of detail should download the model from the link shown above. There are topographic projections between many of the lower-level retinotopically mapped layers, consistent with our earlier vision models (O'Reilly et al., 2013). For example the 8 × 8 unit groups in V2 are reduced down to the 4 × 4 groups in V3 via a 4 × 4 unit-group topographic projection, where neighboring units have half-overlapping receptive fields (i.e., the field moves over two unit groups in V2 for every one unit group in V3), and the full space is uniformly tiled by using a wraparound effect at the edges. Similar patterns of connectivity are used in standard DCNNs. However, we do not share weights across units as in a true convolutional network.
The projections from ObjVel (object velocity) and SaccadePlan layers to LIPs, LIPd were initialized with a topographic sigmoidal pattern that moved as a function of the position of the unit group, by a factor of .5, whereas the projections from EyePos were initialized with a Gaussian pattern. These patterns multiplied uniformly distributed random weights in the .25–.75 range, with the lowest values in the topographic pattern having a multiplier of .6, whereas the highest had a multiplier of 1 (i.e., a fairly subtle effect). This produced faster convergence of the LIP layer when doing “where” pathway pretraining compared to purely random initial weights, consistent with Pouget and Sejnowski (1997) and related work on parietal gain field basis function representations.
In addition to exploring different patterns of overall connectivity, we also explored differences in the relative strengths of receiving projections, which can be set with a wt_scale.rel parameter in the simulator. All feedforward pathways have a default strength of 1. For the feedback projections, which are typically weaker (consistent with the biology), we explored a discrete range of strengths, typically .5, .2, .1, and .05. The strongest top–down projections were into V2s from LIP and V3, whereas most others were .2 or .1. Likewise, projections from the pulvinar were weaker, typically .1. These differences in strength sometimes had large effects on performance during the initial bootstrapping of the overall model structure, but in the final model, they are typically not very consequential for any individual projection.
Training typically consisted of 512 alpha trials per epoch (51.2 sec of real-time equivalent), for 1000 such epochs. Each trial was generated from a virtual reality environment in the emergent simulator, which rendered first-person views with moving eye position onto the object tumbling through space with fixed motion and rotation parameters over the sequence of eight frames (see Figure 5 in the main text for a representative example). Each frame was rendered at a 256 × 256 resolution and processed through our standard V1 Gabor filters, which are described in detail in O'Reilly et al. (2013).
Because the start of each sequence of eight frames is unpredictable, we turned off learning for that trial, which improves learning overall. We have recently developed an automatic such mechanism based on the running average (and running variance) of the prediction error, where we turn off learning whenever the current prediction error z-normalized by these running average values is below 1.5 SDs, which works well and will be incorporated into future models. Biologically, this could correspond to a connection between pulvinar and neuromodulatory areas that could regulate the effective learning rate in this way.
Figure 18A shows the learning trajectory of the model, indicating that it learns quite rapidly. This rapid initial learning is likely facilitated by the extensive use of shortcut connections converging from all over the simulated visual system onto the V1 pulvinar layers and direct projections back from these pulvinar layers. Thus, error signals are directly communicated and can drive learning quickly and efficiently. However, there are also extensive indirect, bidirectional connections among the superficial layers, which can drive indirect error Bp learning as well.
The biologically based model was implemented using the Leabra framework, which is described in detail in previous publications (O'Reilly et al., 2012, 2016; O'Reilly & Munakata, 2000; O'Reilly, 1996, 1998), and summarized here. The online textbook at CompCogNeuro.org provides the most comprehensive description of the framework, and github.com/emer/leabra has a summary of all the equations (and the code itself). There are two main implementations of Leabra, one in the C++ emergent software and a new one using Go and Python language at the prior link. These same equations and standard parameters have been used to simulate over 40 different models in O'Reilly and Munakata (2000), O'Reilly et al. (2012), and a number of other research models. Thus, the model can be viewed as an instantiation of a systematic modeling framework using standardized mechanisms, instead of constructing new mechanisms for each model (O'Reilly et al., 2016).
This section describes in detail the equations that are specific to the “deep” version of Leabra that implements the specific predictive learning additions to the general algorithm. Like the SRN (Elman, 1990; Jordan, 1989), which the deep predictive learning model functionally resembles, the primary computational specialization required is the maintenance of prior temporal context in the CT layer. In addition, the pulvinar layers have to be driven by the bottom–up inputs in the plus phase, after being driven by the CT inputs in the minus phase.
The relative strength of these context layer inputs was set progressively larger for higher layers in the network, with a maximum of four in V4, TEO, and TE. In addition, TEO and TE received “self” context projections, which provide an extended window of temporal context into the prior 200-msec interval, consistent with multiple sources of neural data (Chaudhuri et al., 2015). These self projections were connected only within the narrower pool level of units, enabling these neurons to develop mutually excitatory loops to sustain activations over the multiple trials when the same object was present. We hypothesize that these modifications correspond to biological adaptations in IT cortex that likewise support greater sustained activation of object-level representations.
Learning of the context weights occurs as normal, but using the sending activation states from the prior time step's activation.
Computational and Biological Details of SRN-like Functionality
Predictive autoencoder learning has been explored in various frameworks, but the most relevant to our model comes from the application of the SRN to a range of predictive learning domains (Elman et al., 1996; Elman, 1990). One of the most powerful features of the SRN is that it enables error-driven learning, instead of arbitrary parameter settings, to determine how prior information is integrated with new information. Thus, SRNs can learn to hold onto some important information for a relatively long interval, while rapidly updating other information that is only relevant for a shorter duration. This same flexibility is present in our DeepLeabra model. Furthermore, because this temporal context information is hypothesized to be present in the deep layers throughout the entire neocortex (in every microcolumn of tissue), the DeepLeabra model provides a more pervasive and interconnected form of temporal integration compared to the SRN, which typically only has a single temporal context layer associated with the internal “hidden” layer of processing units.
An extensive computational analysis of what makes the SRN work as well as it does, and explorations of a range of possible alternative frameworks, has led us to an important general principle: Subsequent outcomes determine what is relevant from the past. At some level, this may seem obvious, but it has significant implications for predictive learning mechanisms based on temporal context. It means that the information encoded in a temporal context representation cannot be learned at the time when that information is presently active. Instead, the relevant contextual information is learned on the basis of what happens next.
This explains the peculiar power of the otherwise strange property of the SRN: The temporal context information is preserved as a “direct copy” of the state of the hidden layer units on the previous time step (Figure 19), and then learned synaptic weights integrate that copied context information into the next hidden state (which is then copied to the context again, and so on). This enables the error-driven learning taking place in the current time step to determine how context information from the previous time step is integrated. Furthermore, the simple direct copy operation eschews any attempt to shape this temporal context itself, instead relying on the learning pressure that shapes the hidden layer representations to also shape the context representations. In other words, this copy operation is essential, because there is no other viable source of learning signals to shape the nature of the context representation itself (because these learning signals require future outcomes, which are by definition only available later).
The direct copy operation of the SRN is however seemingly problematic from a biological perspective: How could neurons copy activations from another set of neurons at some discrete point in time and then hold onto those copied values for a duration of 100 msec, which is a reasonably long period in neural terms (e.g., a rapidly firing cortical neuron fires at around 100 Hz, meaning that it will fire 10 times within that context frame)? However, there is an important transformation of the SRN context computation, which is more biologically plausible and compatible with the structure of the deep network (Figure 19). Specifically, instead of copying an entire set of activation states, the context activations (generated by the phasic 5IB burst) are immediately sent through the adaptive synaptic weights that integrate this information, which we think occurs in the 6CC (corticortical) and other lateral integrative connections from 5IB neurons into the rest of the deep network.
The result is a precomputed net input from the context onto a given hidden unit (in the original SRN terminology), not the raw context information itself. Computationally, and metabolically, this is a much more efficient mechanism, because the context is, by definition, unchanging over the 100-msec alpha cycle, and thus, it makes more sense to precompute the synaptic integration, rather than repeatedly recomputing this same synaptic integration over and over again (in the original feedforward Bp-based SRN model, this issue did not arise because a single step of activation updating took place for each context update—whereas in our bidirectional model, many activation update steps must take place per context update).
There are a couple of remaining challenges for this transformation of the SRN. First, the precomputed net input from the context must somehow persist over the subsequent 100-msec period of the alpha cycle. We hypothesize that this can occur via N-methyl-D-aspartate and metabotropic glutamate receptor channels that can easily produce sustained excitatory currents over this time frame. Furthermore, the reciprocal excitatory connectivity from 6CT to TRC and back to 6CT could help to sustain the initial temporal context signal. Second, these contextual integration synapses require a different form of learning algorithm that uses the sending activation from the prior 100 msec, which is well within the time constants in the relevant calcium and second messenger pathways involved in synaptic plasticity.
BACKPROPAGATION MODEL METHODS
The Bp version of the WWI model has the same layer sizes and feedforward patterns of connectivity as the DeepLeabra version. Topographically, the V1p and V1hp pulvinar layers serve as output layers at the highest level of the network, receiving all the various connections from deep layers as shown in Table 1. Likewise, the LIPp served as a target output layer for the “where” pathway. To achieve predictive learning, the V1 pulvinar targets were from the scene at time t, whereas the V1s inputs were from the scene at time t − 1. We also ran a comparison autoencoder model that had inputs and target outputs from the same time step, and it showed even less systematic organization of its higher-level representations, further supporting the notion that predictive learning is important, across all frameworks. The learning curve for the predictive version is shown in Figure 20, which shows better overall prediction accuracy compared to the DeepLeabra model. However, as the RSA showed, this Bp model failed to learn object categories that go beyond the input similarity structure, indicating that perhaps it was paying too much “attention” in learning to this low-level structure, and lacked the necessary mechanisms to enable it to impose a simplifying higher-level structure on top of these inputs.
PREDNET MODEL METHODS
The PredNet architecture was designed to incorporate principles from predictive coding theory into a neural network model for predicting the next frame in a video sequence. Details of the model can be found in the original paper (Lotter et al., 2016), but here, we provide a brief overview of the architecture.
All analyses in the RSA were conducted using the representations from the Rl layers.
All experiments with the PredNet architecture were performed using PyTorch. An informal hyperparameter search was conducted to find the settings that maximized representational similarity to the human judgments. This was done by conducting RSA on each layer for each hyperparameter setting and computing, according to the centroid categories derived from the human data, the difference between the average within-category similarity and the average between-category similarity. Our final architecture had six layers with 3, 16, 32, 64, 128, and 256 filters in the Al and Rl modules and 3 × 3 kernels throughout the whole network. We also found that using sigmoid and tanh activation functions in fully connected convolutional LSTMs slightly improved performance, so these were used for all experiments.
The weights in the PredNet model are trained using error Bp. Predictions are generated, and errors are computed at all levels of the hierarchy, but the model performs better when only the lowest layer's errors are backpropagated (Lotter et al., 2016). We confirmed these results with experiments that backpropagated the errors in higher layers, in which performance (in terms of mean squared error) was marginally reduced but the RSA results were similar. For this reason, all reported experiments used a PredNet that was trained by only backpropagating the lowest level error.
The model was trained using a batch size of 8 and an Adam optimizer with a learning rate of 0.0001, with no scheduler, for 150,000 batches. A training curve is shown in Figure 21, showing that it achieves the best overall prediction accuracy of any model we tested and yet does not have representations that are as differentiated or categorical as our biologically based model, as shown in the main paper.
As discussed in the main paper, our biologically based model includes a number of important biologically motivated properties that may be contributing to the development of its categorical representations. These properties, including excitatory bidirectional connections, inhibitory competition, and an additional form of Hebbian learning, may be acting as regularizers that encourage categorical learning. We therefore tested whether standard regularization methods used in deep learning would have similar effects on the representations developed in the PredNet architecture. We tested (1) batch normalization, (2) dropout (0.1, 0.3, and 0.5), and (3) weight decay (0.01, 0.001, 0.0001, 0.00001). All experiments with batch normalization and weight decay showed reduced performance (in terms of both prediction error on the test set and within-category correlation). As shown in Figure 22, dropout marginally improved the within-category correlation while also slightly improving prediction accuracy, so a dropout rate of 0.1 was used for the comparison to our biologically based model in the main paper.
We thank Dean Wyatte, Tom Hazy, Seth Herd, Kai Krueger, Tim Curran, David Sheinberg, Lew Harvey, Jessica Mollick, Will Chapman, Helene Devillez, and the rest of the CCN Lab for many helpful comments and suggestions. This work was supported by ONR grants ONR N00014-19-1-2684/N00014-18-1-2116, N00014-14-1-0670/N00014-16-1-2128, N00014-18-C-2067, N00014-13-1-0067, and D00014-12-C-0638.
This work utilized the Janus supercomputer, which is supported by the National Science Foundation (award number CNS-0821794) and the University of Colorado Boulder. The Janus supercomputer is a joint effort of the University of Colorado Boulder, the University of Colorado Denver, and the National Center for Atmospheric Research. All data and materials will be available at github.com/ccnlab/deep-obj-cat upon publication.
Reprint requests should be sent to Randall C. O'Reilly, Department of Psychology, Computer Science, and Center for Neuroscience, University of California Davis, 1544 Newton Ct, Davis, CA 95618, or via e-mail: firstname.lastname@example.org.
Randall C. O'Reilly: Conceptualization; Formal analysis; Funding acquisition; Investigation; Methodology; Project administration; Software; Supervision; Validation; Visualization; Writing – Original draft. Jacob L. Russin: Formal analysis; Investigation; Methodology; Validation; Writing – Review & editing. Maryam Zolfaghar: Investigation; Methodology; Validation; Writing – Review & editing. John Rohrlich: Conceptualization; Data curation; Investigation; Methodology; Software; Validation; Writing – Review & editing.
Randall C. O'Reilly: Office of Naval Research (http://dx.doi.org/10.13039/100000006), grants D00014-12-C-0638, N00014-13-1-0067, N00014-14-1-0670, N00014-18-C-2067, and N00014-19-1-2684.
Diversity in Citation Practices
A retrospective analysis of the citations in every article published in this journal from 2010 to 2020 has revealed a persistent pattern of gender imbalance: Although the proportions of authorship teams (categorized by estimated gender identification of first author/last author) publishing in the Journal of Cognitive Neuroscience (JoCN) during this period were M(an)/M = .408, W(oman)/M = .335, M/W = .108, and W/W = .149, the comparable proportions for the articles that these authorship teams cited were M/M = .579, W/M = .243, M/W = .102, and W/W = .076 (Fulvio et al., JoCN, 33:1, pp. 3–7). Consequently, JoCN encourages all authors to consider gender balance explicitly when selecting which articles to cite and gives them the opportunity to report their article's gender citation balance.