Abstract
We present a review of predictive coding, from theoretical neuroscience, and variational autoencoders, from machine learning, identifying the common origin and mathematical framework underlying both areas. As each area is prominent within its respective field, more firmly connecting these areas could prove useful in the dialogue between neuroscience and machine learning. After reviewing each area, we discuss two possible correspondences implied by this perspective: cortical pyramidal dendrites as analogous to (nonlinear) deep networks and lateral inhibition as analogous to normalizing flows. These connections may provide new directions for further investigations in each field.
1 Introduction
1.1 Cybernetics
Machine learning and theoretical neuroscience once overlapped under the field of cybernetics (Wiener, 1948; Ashby, 1956). Within this field, perception and control, in both biological and nonbiological systems, were formulated in terms of negative feedback and feedforward processes. Negative feedback attempts to minimize error signals by feeding the errors back into the system, whereas feedforward processing attempts to preemptively reduce error through prediction. Cybernetics formalized these techniques using probabilistic models, which estimate the likelihood of random outcomes, and variational calculus, a technique for estimating functions, particularly probability distributions (Wiener, 1948). This resulted in the first computational models of neuron function and learning (McCulloch & Pitts, 1943; Rosenblatt, 1958; Widrow & Hoff, 1960), a formal definition of information (Wiener, 1942; Shannon, 1948) (with connections to neural systems Barlow, 1961b), and algorithms for negative feedback perception and control (MacKay, 1956; Kalman, 1960). Yet with advances in these directions (see Prieto et al., 2016) the cohesion of cybernetics diminished, with the new ideas taking root in, for example, theoretical neuroscience, machine learning, and control theory. The transfer of ideas is shown in Figure 1.
1.2 Neuroscience and Machine Learning: Convergence and Divergence
A renewed dialogue between neuroscience and machine learning formed in the 1980s and 1990s. Neuroscientists, bolstered by new physiological and functional analyses, began making traction in studying neural systems in probabilistic and information-theoretic terms (Laughlin, 1981; Srinivasan, Laughlin, & Dubs, 1982; Barlow, 1989; Bialek, Rieke, Van Steveninck, & Warland, 1991). In machine learning, improvements in probabilistic modeling (Pearl, 1986) and artificial neural networks (Rumelhart, Hinton, & Williams, 1986) combined with ideas from statistical mechanics (Hopfield, 1982; Ackley, Hinton, & Sejnowski, 1985) to yield new classes of models and training techniques. This convergence of ideas, primarily centered around perception, resulted in new theories of neural processing and improvements in their mathematical underpinnings.
In particular, the notion of predictive coding emerged within neuroscience (Srinivasan et al., 1982; Rao & Ballard, 1999). In its most general form, predictive coding postulates that neural circuits are engaged in estimating probabilistic models of other neural activity and sensory inputs, with feedback and feedforward processes playing a central role. These models were initially formulated in early sensory areas, for example, in the retina (Srinivasan et al., 1982) and thalamus (Dong & Atick, 1995), using feedforward processes to predict future neural activity. Similar notions were extended to higher-level sensory processing in neocortex by David Mumford (1991, 1992). Top-down neural projections (from higher-level to lower-level sensory areas) were hypothesized to convey sensory predictions, whereas bottom-up neural projections were hypothesized to convey prediction errors. Through negative feedback, these errors then updated state estimates. These ideas were formalized by Rao and Ballard (1999), formulating a simplified artificial neural network model of images, reminiscent of a Kalman filter (Kalman, 1960).
Feedback and feedforward processes also featured prominently in machine learning. Indeed, the primary training algorithm for artificial neural networks, backpropagation (Rumelhart et al., 1986), literally feeds (propagates) the output prediction errors back through the network—negative feedback. During this period, the technique of variational inference was rediscovered within machine learning (Hinton & Van Camp, 1993; Neal & Hinton, 1998), recasting probabilistic inference using variational calculus. This technique proved essential in formulating the Helmholtz machine (Dayan et al., 1995; Dayan & Hinton, 1996), a hierarchical unsupervised probabilistic model parameterized by artificial neural networks. Similar advances were made in autoregressive probabilistic models (Frey, Hinton, & Dayan, 1996; Bengio & Bengio, 2000), using artificial neural networks to form sequential feedforward predictions, as well as new classes of invertible probabilistic models (Comon, 1994; Parra, Deco, & Miesbach, 1995; Deco & Brauer, 1995; Bell & Sejnowski, 1997).
These new ideas regarding variational inference and probabilistic models, particularly the Helmholtz machine (Dayan, Hinton, Neal, & Zemel, 1995), influenced predictive coding. Specifically, Karl Friston utilized variational inference to formulate hierarchical dynamical models of neocortex (Friston, 2005, 2008a). In line with Mumford (1992), these models contain multiple levels, with each level attempting to predict its future activity (feedforward) as well as lower-level activity, closer to the input data. Prediction errors across levels facilitate updating higher-level estimates (negative feedback). Such models have incorporated many biological aspects, including local learning rules (Friston, 2005) and attention (Spratling, 2008; Feldman & Friston, 2010; Kanai, Komura, Shipp, & Friston, 2015), and have been compared with neural circuits (Bastos et al., 2012; Keller & Mrsic-Flogel, 2018; Walsh, McGovern, Clark, and O'Connell, 2020). While predictive coding and other Bayesian brain theories are increasingly popular (Doya, Ishii, Pouget, & Rao, 2007; Friston, 2009; Clark, 2013), validating these models is hampered by the difficulty of distinguishing between specific design choices and general theoretical claims (Gershman, 2019). Further, a large gap remains between the simplified implementations of these models and the complexity of neural systems.
Progress in machine learning picked up in the early 2010s, with advances in parallel computing as well as standardized data sets (Deng et al., 2009). In this era of deep learning (LeCun, Bengio, & Hinton, 2015; Schmidhuber, 2015), that is, artificial neural networks with multiple layers, a flourishing of ideas emerged around probabilistic modeling. Building off previous work, more expressive classes of deep hierarchical (Gregor, Danihelka, Mnih, Blundell, & Wierstra, 2014; Mnih & Gregor, 2014; Kingma & Welling, 2014; Rezende, Mohamed, & Wierstra, 2014), autoregressive (Uria, Murray, & Larochelle, 2014; van den Oord, Kalchbrenner, & Kavukcuoglu, 2016), and invertible (Dinh, Krueger, & Bengio, 2015; Dinh, Sohl-Dickstein, & Bengio, 2017) probabilistic models were developed. Of particular importance is a model class known as variational autoencoders (VAEs; Kingma & Welling, 2014; Rezende et al., 2014), a relative of the Helmholtz machine, which closely resembles hierarchical predictive coding. Unfortunately, despite this similarity, the machine learning community remains largely oblivious to the progress in predictive coding and vice versa.
1.3 Connecting Predictive Coding and VAEs
This review aims to bridge the divide between predictive coding and VAEs. While this work provides unique contributions, it is inspired by previous work at this intersection. In particular, van den Broeke (2016) outlines hierarchical probabilistic models in predictive coding and machine learning. Likewise, Lotter, Kreiman, and Cox (2017, 2018) implement predictive coding techniques in deep probabilistic models, comparing these models with neural phenomena.
After reviewing background mathematical concepts in section 2, we discuss the basic formulations of predictive coding in section 3 and variational autoencoders in section 4, and we identify commonalities in their model formulations and inference techniques in section 5. Based on these connections, in section 6, we discuss two possible correspondences between machine learning and neuroscience seemingly suggested by this perspective:
Dendrites of pyramidal neurons and deep artificial networks, affirming a more nuanced perspective over the analogy of biological and artificial neurons
Lateral inhibition and normalizing flows, providing a more general framework for normalization.
2 Background
2.1 Maximum Log Likelihood
This is the maximum log-likelihood objective, which is found throughout machine learning and probabilistic modeling (Murphy, 2012). In practice, we do not have access to and instead approximate the objective using data samples, that is, using .
2.2 Probabilistic Models
2.2.1 Dependency Structure
2.2.2 Parameterizing the Model
Conditional dependencies are mediated by the distribution parameters, which are functions of the conditioning variables. For example, we can express an autoregressive gaussian distribution (see equation 2.2) through , where and are functions taking as input. A similar form applies to autoregressive models on sequences of vector inputs (see equation 2.3), with . Likewise, in a latent variable model (see equation 2.4), we can express a gaussian conditional likelihood as . In the above examples, we have used a subscript for all functions; however, these may be separate functions in practice.
Deep learning (Goodfellow, Bengio, & Courville, 2016) provides probabilistic models with expressive nonlinear functions, improving their capacity. In these models, the distribution parameters are parameterized with deep networks, which are then trained by backpropagating (Rumelhart et al., 1986) the gradient of the log-likelihood, , through the network. Deep probabilistic models have enabled recent advances in speech (Graves, 2013; van den Oord et al., 2016), natural language (Sutskever, Vinyals, & Le, 2014; Radford et al., 2019), images (Razavi et al., 2019), video (Kumar et al., 2020), reinforcement learning (Chua, Calandra, McAllister, & Levine, 2018; Ha & Schmidhuber, 2018) and other areas.
Concept overview. Cybernetics influenced the areas that became theoretical neuroscience and machine learning, resulting in shared mathematical concepts. This review explores the connections between predictive coding, from theoretical neuroscience, and variational autoencoders, from machine learning.
Concept overview. Cybernetics influenced the areas that became theoretical neuroscience and machine learning, resulting in shared mathematical concepts. This review explores the connections between predictive coding, from theoretical neuroscience, and variational autoencoders, from machine learning.
Dependency structures. Each diagram shows a directed graphical model. Nodes represent random variables, and arrows represent dependencies. The main forms of dependency structure are autoregressive (see equation 2.2) and latent variable models (see equation 2.4). These structures can be combined in various ways (see equations 2.7 and 2.8).
Dependency structures. Each diagram shows a directed graphical model. Nodes represent random variables, and arrows represent dependencies. The main forms of dependency structure are autoregressive (see equation 2.2) and latent variable models (see equation 2.4). These structures can be combined in various ways (see equations 2.7 and 2.8).
Autoregressive computation graph. The graph contains the (gaussian) conditional likelihoods (green), data (gray), and terms in the objective (red dots). Gradients (red dotted lines) backpropagate through the networks parameterizing the distributions.
Autoregressive computation graph. The graph contains the (gaussian) conditional likelihoods (green), data (gray), and terms in the objective (red dots). Gradients (red dotted lines) backpropagate through the networks parameterizing the distributions.
Autoregressive models have proven useful in many domains. However, there are reasons to prefer latent variable models in some contexts. First, autoregressive sampling is inherently sequential, becoming costly in high-dimensional domains. Second, latent variables provide a representation for downstream tasks, compression, and overall data analysis. Finally, latent variables increase flexibility, which is useful for modeling complex distributions with relatively simple (e.g., gaussian) conditional distributions. While flow-based latent variable models offer one option, their invertibility requirement limits the types of functions that can be used. For these reasons, we require methods for handling the latent marginalization in equation 2.5. Variational inference is one such method.
2.3 Variational Inference
Training latent variable models through maximum likelihood requires evaluating . However, evaluating is generally intractable. Thus, we require some technique for approximating . Variational inference (Hinton & Van Camp, 1993; Jordan, Ghahramani, Jaakkola, & Saul, 1998) approaches this problem by introducing an approximate posterior distribution, , which provides a lower bound, . This lower bound is referred to as the evidence (or variational) lower bound (ELBO), as well as the (negative) free energy. By tightening and maximizing the ELBO with regard to the model parameters, , we can approximate maximum likelihood training.
ELBO computation graphs. (a) Basic computation graph for variational inference. Outlined circles denote distributions, smaller red circles denote terms in the ELBO, and arrows denote conditional dependencies. This notation can be used to express (b) hierarchical and (c) sequential models with various dependencies.
ELBO computation graphs. (a) Basic computation graph for variational inference. Outlined circles denote distributions, smaller red circles denote terms in the ELBO, and arrows denote conditional dependencies. This notation can be used to express (b) hierarchical and (c) sequential models with various dependencies.
3 Predictive Coding
Predictive coding can be divided into two settings, spatiotemporal and hierarchical, roughly corresponding to the two main forms of probabilistic dependencies. In this section, we review these settings, discussing existing hypothesized correspondences with neural anatomy. We then outline the empirical support for predictive coding, highlighting the need for large-scale, testable models.
3.1 Spatiotemporal Predictive Coding
Normalization can also be applied within to remove spatial dependencies. For instance, we can apply another autoregressive transform over spatial dimensions, predicting the th dimension, , as a function of previous dimensions, (see equation 2.2). With linear functions, this corresponds to Cholesky whitening (Pourahmadi, 2011; Kingma et al., 2016). However, this imposes an ordering over dimensions. Zero-phase components analysis (ZCA) whitening instead learns symmetric spatial dependencies (Kessy, Lewin, & Strimmer, 2018). Modeling these dependencies with a constant covariance matrix, , and mean, , the whitening transform is . With natural images, this results in center-surround filters in the rows of , thereby extracting edges (see Figure 5a).
Srinivasan et al. (1982) investigated spatiotemporal predictive coding in the retina, where compression is essential for transmission through the optic nerve. Estimating the (linear) autocorrelation of input sensory signals, they showed that spatiotemporal predictive coding models retinal ganglion cell responses in flies. This scheme allows these neurons to more fully utilize their dynamic range. It is generally accepted that retina, in part, performs stages of spatiotemporal normalization through center-surround receptive fields and on-off responses (Hosoya, Baccus, & Meister, 2005; Graham, Chandler, & Field, 2006; Pitkow & Meister, 2012; Palmer, Marre, Berry, & Bialek, 2015). Dong and Atick (1995) applied similar ideas to the thalamus, proposing an additional stage of temporal normalization. This also relates to the notion of generalized coordinates (Friston, 2008a), that is, modeling temporal derivatives, which can be approximated using finite differences (prediction errors). That is, . Thus, spatiotemporal predictive coding may be utilized at multiple stages of sensory processing to remove redundancy (Huang & Rao, 2011).
In neural circuits, normalization often involves inhibitory interneurons (Carandini & Heeger, 2012), performing operations similar to those in equation 3.1. For instance, inhibition occurs in retina between photoreceptors, via horizontal cells, and between bipolar cells, via amacrine cells. This can extract unpredicted motion, e.g., an object moving relative to the background (Ölveczky, Baccus, & Meister, 2003; Baccus, Ölveczky, Manu, & Meister, 2008). A similar scheme is present in the lateral geniculate nucleus (LGN) in thalamus, with interneurons inhibiting relay cells from retina (Sherman & Guillery, 2002). As mentioned above, this is thought to perform temporal normalization (Dong and Atick, 1995; Dan, Atick, & Reid, 1996). Lateral inhibition is also prominent in neocortex, with distinct classes of interneurons shaping the responses of pyramidal neurons (Isaacson & Scanziani, 2011). Part of their computational role appears to be spatiotemporal normalization (Carandini & Heeger, 2012).
3.2 Hierarchical Predictive Coding
Spatiotemporal predictive coding. (a) Spatial predictive coding removes spatial dependencies, using center-surround filters (left) to extract edges (right). (b) Temporal predictive coding removes temporal dependencies, extracting motion from video. Video frames are from BAIR Robot Pushing (Ebert, Finn, Lee, & Levine, 2017).
Spatiotemporal predictive coding. (a) Spatial predictive coding removes spatial dependencies, using center-surround filters (left) to extract edges (right). (b) Temporal predictive coding removes temporal dependencies, extracting motion from video. Video frames are from BAIR Robot Pushing (Ebert, Finn, Lee, & Levine, 2017).
Brain anatomy and cortical circuitry. Sensory inputs enter the thalamus, forming reciprocal connections with the neocortex, which is composed of six layers, with columns across layers and hierarchies of columns. Black and red circles represent excitatory and inhibitory neurons, respectively, with arrows denoting connections. This circuit is repeated with variations throughout neocortex.
Brain anatomy and cortical circuitry. Sensory inputs enter the thalamus, forming reciprocal connections with the neocortex, which is composed of six layers, with columns across layers and hierarchies of columns. Black and red circles represent excitatory and inhibitory neurons, respectively, with arrows denoting connections. This circuit is repeated with variations throughout neocortex.
Formulating a theory of neocortex, Mumford (1992) described the thalamus as an “active blackboard,” with the neocortex attempting to reconstruct the activity in the thalamus and lower hierarchical areas. Under this theory, backward projections convey predictions, while forward projections use prediction errors to update estimates. Through a dynamic process, the system settles to an activity pattern, minimizing prediction error. Over time, the parameters are also adjusted to improve predictions. In this way, negative feedback is used, both in inference and learning, to construct a generative model of sensory inputs. Generative state estimation dates back (at least) to Helmholtz (Von Helmholtz, 1867), and error-based updating is in line with cybernetics (Wiener, 1948; MacKay, 1956), which influenced Kalman filtering (Kalman, 1960), a ubiquitous Bayesian filtering algorithm.
Hierarchical predictive coding. The diagram shows the basic computation graph for a gaussian latent variable model with MAP inference. The insets show the weighted error calculation for the latent (left) and observed (right) variables.
Hierarchical predictive coding. The diagram shows the basic computation graph for a gaussian latent variable model with MAP inference. The insets show the weighted error calculation for the latent (left) and observed (right) variables.
Predictive coding identifies the conditional likelihood (equation 3.2) with backward (top-down) cortical projections, whereas inference (equation 3.5) is identified with forward (bottom-up) projections (Friston, 2005). Each is thought to be mediated by pyramidal neurons. Under this model, each cortical column predicts and estimates a stochastic continuous latent variable, possibly represented via a (pyramidal) firing rate or membrane potential (Friston, 2005). Interneurons within columns calculate errors ( and ). Although we have only discussed diagonal covariance ( and ), lateral inhibitory interneurons could parameterize full covariance matrices, and , as a form of spatial predictive coding (see section 3.1). These factors weight and , modulating the gain of each error as a form of “attention” (Feldman & Friston, 2010). Neural correspondences are summarized in Table 1.
Neural Correspondences of Hierarchical Predictive Coding.
Neuroscience . | Predictive Coding . |
---|---|
Top-down cortical projections | Generative model conditional mapping |
Bottom-up cortical projections | Inference updating |
Lateral inhibition | Covariance matrices |
(Pyramidal) neuron activity | Latent variable estimates & errors |
Cortical column | Corresponding estimate & error |
Neuroscience . | Predictive Coding . |
---|---|
Top-down cortical projections | Generative model conditional mapping |
Bottom-up cortical projections | Inference updating |
Lateral inhibition | Covariance matrices |
(Pyramidal) neuron activity | Latent variable estimates & errors |
Cortical column | Corresponding estimate & error |
We have presented a simplified model of hierarchical predictive coding, without multiple latent levels and dynamics. A full hierarchical predictive coding model would include these aspects and others. In particular, Friston has explored various design choices (Friston, Mattout, Trujillo-Barreto, Ashburner, & Penny, 2007; Friston, 2008a, 2008b), yet the core aspects of probabilistic generative modeling and variational inference remain the same. Elaborating and comparing these choices will be essential for empirically validating hierarchical predictive coding.
3.3 Empirical Support
While there is considerable evidence in support of predictions and errors in neural systems, disentangling these general aspects of predictive coding from the particular algorithmic choices remains challenging (Gershman, 2019). Here, we outline relevant work, but we refer to Huang and Rao (2011), Bastos et al. (2012), Clark (2013), Keller and Mrsic-Flogel (2018), and Walsh, McGovern, Clark, and O'Connell (2020) for a more in-depth overview.
3.3.1 Spatiotemporal
Various works have investigated predictive coding in early sensory areas such as the retina (Srinivasan et al., 1982; Atick & Redlich, 1992). This involves fitting retinal ganglion cell responses to a spatial whitening (or decorrelation) process (Graham et al., 2006; Pitkow & Meister, 2012), which can be dynamically adjusted (Hosoya et al., 2005). Similar analyses suggest that retina also employs temporal predictive coding (Srinivasan et al., 1982; Palmer et al., 2015). Such models typically contain linear whitening filters (center-surround) followed by nonlinearities. These nonlinearities have been shown to be essential for modeling responses (Pitkow & Meister, 2012), possibly by inducing added sparsity (Graham et al., 2006). Spatiotemporal predictive coding also appears to be found in the thalamus (Dong & Atick, 1995; Dan et al., 1996) and cortex; however, such analyses are complicated by backward, modulatory inputs.
3.3.2 Hierarchical
Early work toward empirically validating hierarchical predictive coding came from explaining extraclassical receptive field effects (Rao & Ballard, 1999; Rao & Sejnowski, 2002), whereby top-down signals in the cortex alter classical visual receptive fields, suggesting that top-down influences play a key role in sensory processing (Gilbert & Sigman, 2007). Note that such effects support a cortical generative model generally (Olshausen & Field, 1997), not predictive coding specifically.
Temporal influences have been demonstrated through repetition suppression (Summerfield et al., 2006), in which activity diminishes in response to repeated (i.e., predictable) stimuli. This may reflect error suppression from improved predictions. Predictive coding has also been used to explain biphasic responses in LGN (Jehee & Ballard, 2009), in which reversing the visual input with an anticorrelated image results in a large response, presumably due to prediction errors. Predictive signals have been documented in auditory (Wacongne et al., 2011) and visual (Meyer & Olson, 2011) processing. Activity seemingly corresponding to prediction errors has also been observed in a variety of areas and contexts, including visual cortex in mice (Keller, Bonhoeffer, & Hübener, 2012; Zmarz & Keller, 2016; Gillon et al., 2021), auditory cortex in monkeys (Eliades & Wang, 2008) and rodents (Parras et al., 2017), and visual cortex in humans (Murray, Kersten, Olshausen, Schrater, & Woods, 2002; Alink, Schwiedrzik, Kohler, Singer, & Muckli, 2010; Egner, Monti, & Summerfield, 2010). Thus, sensory cortex appears to be engaged in hierarchical and temporal prediction, with prediction errors playing a key role.
Empirical evidence for predictive coding aside, given the complexity of neural systems, the theory is undoubtedly incomplete or incorrect. Without the low-level details such as connectivity and potentials, it is difficult to determine the computational form of the circuit. Further, these models are typically oversimplified, with few trained parameters, detached from natural stimuli. While new tools enable us to test predictive coding in neural circuits (Gillon et al., 2021), machine learning, particularly VAEs, can advance from the other direction. Training large-scale models on natural stimuli may improve empirical predictions for biological systems (Rao & Ballard, 1999; Lotter et al., 2018).
4 Variational Autoencoders
Variational autoencoders (VAEs) (Kingma & Welling, 2014; Rezende et al., 2014) are latent variable models parameterized by deep networks. As in hierarchical predictive coding, these models typically contain gaussian latent variables and are trained using variational inference. However, rather than performing inference optimization directly, VAEs amortize inference (Gershman & Goodman, 2014).
4.1 Amortized Variational Inference
To differentiate through , we can use the pathwise derivative estimator, also referred to as the reparameterization estimator (Kingma & Welling, 2014). This is accomplished by expressing in terms of an auxiliary random variable. The most common example expresses as , where and denotes element-wise multiplication. We can then estimate and , allowing us to calculate the inference model gradients, .
Variational autoencoder. VAEs use direct amortization (see equation 4.1) to train deep latent variable models. The inference model (left) is an encoder, and the conditional likelihood (right) is a decoder. Each is parameterized by deep networks.
Variational autoencoder. VAEs use direct amortization (see equation 4.1) to train deep latent variable models. The inference model (left) is an encoder, and the conditional likelihood (right) is a decoder. Each is parameterized by deep networks.
4.1.1 Iterative Amortized Inference
4.2 Extensions of VAEs
4.2.1 Additional Dependencies and Representation Learning
VAEs have been extended to a variety of architectures, incorporating hierarchical and temporal dependencies. Sønderby et al. (2016) proposed a hierarchical VAE, in which the conditional prior at each level, , that is, , is parameterized by a deep network (see equation 2.7). Follow-up works have scaled this approach with impressive results (Kingma et al., 2016; Vahdat & Kautz, 2020; Child, 2020), extracting increasingly abstract features at higher levels (Maaløe, Fraccaro, Lievin, & Winther, 2019). Another line of work has incorporated temporal dependencies within VAEs, parameterizing dynamics in the prior and conditional likelihood with deep networks (Chung et al., 2015; Fraccaro, Sønderby, Paquet, & Winther, 2016). Such models can also provide representations and predictions for reinforcement learning (Ha & Schmidhuber, 2018; Hafner et al., 2019).
Other work has investigated representation learning within VAEs. One approach, the -VAE (Higgins et al., 2017), modifies the ELBO (see equation 2.14) by adjusting a weighting, , on . This tends to yield more disentangled (i.e., independent) latent variables. Indeed, controls the rate-distortion trade-off between latent complexity and reconstruction (Alemi et al., 2018), highlighting VAEs' ability to extract latent structure at multiple resolutions (Rezende & Viola, 2018). A separate line of work has focused on identifiability: the ability to uniquely recover the original latent variables within a model (or their posterior). While this is true in linear ICA (Comon, 1994), it is not generally the case with nonlinear ICA and noninvertible models (VAEs) (Khemakhem, Kingma, Monti, & Hyvarinen, 2020; Gresele, Fissore, Javaloy, Schölkopf, & Hyvarinen, 2020), requiring special considerations.
4.2.2 Normalizing Flows
Another direction within VAEs is the use of normalizing flows (Rezende & Mohamed, 2015). Flow-based distributions use invertible transforms to add and remove dependencies (see section 2.2.1). While such models can operate as generative models (Dinh et al., 2015, 2017; Papamakarios, Pavlakou, & Murray, 2017), they can also define distributions in VAEs. This includes the approximate posterior (Rezende & Mohamed, 2015), prior (Huang et al., 2017), and conditional likelihood (Agrawal & Dukkipati, 2016). In each case, a deep network outputs the parameters (e.g., mean and variance) of a base distribution over a normalized variable. Separate deep networks parameterize the transforms, which map between the normalized and unnormalized variables.
4.2.3 Example
Normalizing flows. Normalizing flows is a framework for adding or removing dependencies. Using affine parameter functions, and , one can model nonlinear dependencies, generalizing constant transforms, e.g., a covariance matrix.
Normalizing flows. Normalizing flows is a framework for adding or removing dependencies. Using affine parameter functions, and , one can model nonlinear dependencies, generalizing constant transforms, e.g., a covariance matrix.
5 Connections and Comparisons
Predictive coding and VAEs (and deep generative models generally), are highly related in both their model formulations and inference approaches (see Figure 10). Specifically,
- Model formulation: Both areas consider hierarchical latent gaussian models with nonlinear dependencies between latent levels, as well as dependencies within levels via covariance matrices (predictive coding) or normalizing flows (VAEs).Figure 10:
Hierarchical predictive coding and VAEs. Computation diagrams for (a) hierarchical predictive coding, (b) VAE with direct amortized inference, and (c) VAE with iterative amortized inference (Marino, Yue, et al., 2018). denotes the transposed Jacobian matrix of the conditional likelihood. Red dotted lines denote gradients, and black dashed lines denote amortized inference.
Figure 10:Close modalHierarchical predictive coding and VAEs. Computation diagrams for (a) hierarchical predictive coding, (b) VAE with direct amortized inference, and (c) VAE with iterative amortized inference (Marino, Yue, et al., 2018). denotes the transposed Jacobian matrix of the conditional likelihood. Red dotted lines denote gradients, and black dashed lines denote amortized inference.
Inference: Both areas use variational inference, often with gaussian approximate posteriors. While predictive coding and VAEs employ differing optimization techniques, these are design choices in solving the same inference problem.
These similarities reflect a common mathematical foundation inherited from cybernetics and descendant fields. We now discuss these two points in more detail.
5.1 Model Formulation
The primary distinction in model formulation is the form of the (non-linear) functions parameterizing dependencies. Rao and Ballard (1999) parameterize the conditional likelihood as a linear function followed by an element-wise nonlinearity. Friston has considered a wider range of functions, such as polynomial (Friston, 2008a); however, such functions are rarely learned. VAEs instead parameterize these functions using deep networks with multiple layers. The deep network weights are trained through backpropagation, enabling the wide application of VAEs to various data domains.
Predictive coding and VAEs also consider dependencies within each level. Friston (2005) uses full-covariance gaussian densities, with the inverse of the covariance matrix (precision) parameterizing linear dependencies within a level. Rao and Ballard (1999) normalize the observations, modeling linear dependencies within the conditional likelihood. These are linear special cases of the more general technique of normalizing flows (Rezende & Mohamed, 2015): a covariance matrix is an affine normalizing flow with linear dependencies (Kingma et al., 2016). Normalizing flows have been applied throughout each of the distributions within VAEs (see section 4.2.2), modeling nonlinear dependencies across both spatial and temporal dimensions. These flows are also parameterized by deep networks, providing a flexible yet general modeling approach.
Related to normalization, there are proposals within predictive coding that the precision of the prior could mediate a form of attention (Feldman & Friston, 2010). Increasing the precision of a variable serves as a form of gain modulation, up-weighting the error in the objective function, thereby enforcing more accurate inference estimates. This concept is absent from VAEs. However, as VAEs become more prevalent in interactive settings (Ha & Schmidhuber, 2018), that is, beyond pure generative modeling, this may become crucial in steering models toward task-relevant perceptual inferences.
Finally, predictive coding and VAEs have both been extended to sequential settings. In predictive coding, sequential dependencies may be parameterized by linear functions (Srinivasan et al., 1982) or so-called generalized coordinates (Friston, 2008a), modeling multiple orders of motion. In extensions of VAEs, sequential dependencies are again parameterized by deep networks, in many cases using recurrent networks (Chung et al., 2015; Fraccaro et al., 2016). Thus, while the specific implementations vary, in either case, sequential dependencies are ultimately functions, which are subject to design choices.
5.2 Inference
Although both predictive coding and VAEs typically use variational inference with gaussian approximate posteriors, sections 3 and 4 illustrate key differences (see Figure 10). Predictive coding generally relies on gradient-based optimization to perform inference, whereas VAEs employ amortized optimization. While these approaches may at first appear radically different, hybrid error-encoding inference approaches (see equation 4.3), such as PredNet (Lotter et al., 2017) and iterative amortization (Marino, Yue, et al., 2018), provide a link. Such approaches receive errors as input, as in predictive coding; however, they have learnable parameters (i.e., amortization). In fact, amortization may provide a crucial element for implementing predictive coding in biological neural networks.
Though rarely discussed, hierarchical predictive coding assumes that the inference gradients, supplied by forward connections, can be readily calculated. But as seen in section 3.2, the weights of these forward connections are the transposed Jacobian matrix of the backward connections (Rao & Ballard, 1999). This is an example of the weight transport problem (Grossberg, 1987), in which the weights of one set of connections (forward) depend on the weights of another set of connections (backward). This is generally regarded as not being biologically plausible.
Amortization provides a solution to this problem: learn to perform inference. Rather than transporting the generative weights to the inference connections, amortization learns a separate set of inference weights, potentially using local learning rules (Bengio, 2014; Lee, Zhang, Fischer, & Bengio, 2015). Thus, despite criticism from Friston (2018), amortization may offer a more biologically plausible inference approach. Further, amortized inference yields accurate estimates with exceedingly few iterations: even a single iteration may yield reasonable estimates (Marino, Yue, et al., 2018). These computational efficiency benefits provide another argument in favor of amortization.
Finally, although predictive coding and VAEs typically assume gaussian approximate posteriors, there is one additional difference in the ways in which these parameters are conventionally calculated. Friston often uses the Laplace approximation3 (Friston et al., 2007), solving directly for the optimal gaussian variance, whereas VAEs treat this as another output of the inference model (Kingma & Welling, 2014; Rezende, Mohamed, & Wierstra, 2014). These approaches can be applied in either setting (Park, Kim, & Kim, 2019).
6 Correspondences
Pyramidal neurons and deep networks. Connecting VAEs with predictive coding places deep networks (bottom) in correspondence with the dendrites of pyramidal neurons (top), for both generation (right) and (amortized) inference (left).
Pyramidal neurons and deep networks. Connecting VAEs with predictive coding places deep networks (bottom) in correspondence with the dendrites of pyramidal neurons (top), for both generation (right) and (amortized) inference (left).
6.1 Pyramidal Neurons and Deep Networks
6.1.1 Nonlinear Dendritic Computation
Placing deep networks in correspondence with pyramidal dendrites suggests that (some) biological neurons may be better computationally described as nonlinear functions. Evidence from neuroscience supports this claim. Early simulations showed that individual pyramidal neurons, through dendritic processing, could operate as multilayer artificial networks (Zador, Claiborne, & Brown, 1992; Mel, 1992). This was later supported by empirical findings that pyramidal dendrites act as computational “subunits,” yielding the equivalent of a two-layer artificial network (Poirazi, Brannon, & Mel, 2003; Polsky, Mel, & Schiller, 2004). More recently, Gidon et al. (2020) demonstrated that individual pyramidal neurons can compute the XOR operation, which requires nonlinear processing. This is supported by further modeling work (Jones & Kording, 2020; Beniaguev, Segev, & London, 2021). Positing a more substantial role for dendritic computation (London & Häusser, 2005) moves beyond the simplistic comparison of biological and artificial neurons that currently dominates. Instead, neural computation depends on morphology and circuits.
6.1.2 Amortization
Pyramidal neurons mediate both top-down and bottom-up cortical projections. Under predictive coding, this suggests that inference relies on learned, nonlinear functions: amortization. One such implementation is through pyramidal neurons with separate apical and basal dendrites, which, respectively, receive top-down and bottom-up inputs (Bekkers, 2011; Guergiuev, Lillicrap, & Richards, 2016). Recent evidence from Gillon et al. (2021) suggests that these are top-down predictions and bottom-up errors. These neurons may implement iterative amortized inference (Marino, Yue, et al., 2018), separately processing top-down and bottom-up error signals to update inference estimates (see Figure 11, left). While some empirical support for amortization exists (Yildirim, Kulkarni, Freiwald, & Tenenbaum, 2015; Dasgupta, Schulz, Goodman, & Gershman, 2018), further investigation is needed. Finally, this perspective implies separate computational processing for prediction and inference, with distinct (but linked) frequencies. While some evidence supports this conjecture (Bastos et al., 2015), it is unclear how this could be implemented in biological neurons.
6.1.3 Backpropagation
The biological plausibility of backpropagation is an open question (Lillicrap, Santoro, Marris, Akerman, & Hinton, 2020). Critics argue that backpropagation requires nonlocal learning signals (Grossberg, 1987; Crick, 1989), whereas the brain relies largely on local learning rules (Hebb, 1949; Markram, Lübke, Frotscher, & Sakmann, 1997; Bi & Poo, 1998). Biologically plausible formulations of backpropagation have been proposed (Stork, 1989; Körding & König, 2001; Xie & Seung, 2003; Hinton, 2007; Lillicrap, Cownden, Tweed, & Akerman, 2016), attempting to reconcile this disparity. Yet consensus is still lacking. From another perspective, the apparent biological implausibility of backpropagation may instead be the result of incorrectly assuming a one-to-one correspondence between biological and artificial neurons.
Backpropagation within neurons. If deep networks are in correspondence with pyramidal neurons, this implies that backpropagation (left) is analogous to learning within neurons, perhaps via backpropagating action potentials (right).
Backpropagation within neurons. If deep networks are in correspondence with pyramidal neurons, this implies that backpropagation (left) is analogous to learning within neurons, perhaps via backpropagating action potentials (right).
6.2 Lateral Inhibition and Normalizing Flows
6.2.1 Sensory Input Normalization
One of the key computational roles of early sensory areas appears to be reducing spatiotemporal redundancies—normalization. In retina, this is performed through lateral inhibition via horizontal and amacrine cells, removing correlations (Graham et al., 2006; Pitkow & Meister, 2012). Normalization and prediction are inseparable, and, accordingly, previous work has framed early sensory processing in terms of spatiotemporal predictive coding (Srinivasan et al., 1982; Hosoya et al., 2005; Palmer et al., 2015). This is often motivated in terms of increased sensitivity or efficiency (Srinivasan et al., 1982; Atick & Redlich, 1990) due to redundancy reduction (Barlow, 1961a; Barlow et al., 1989), that is, compression.
If we consider cortex as a hierarchical latent variable model, then early sensory areas are implicated in parameterizing the conditional likelihood. The ubiquity of normalization in these areas is suggestive of normalization in a flow-based model, implementing the inference direction of a flow-based conditional likelihood (Agrawal & Dukkipati, 2016; Winkler et al., 2019). In addition to the sensitivity and efficiency benefits cited above, this learned, normalized space simplifies downstream generative modeling and improves generalization (Marino et al., 2020).
6.2.2 Normalization in Thalamus
Normalization also appears to occur in first-order thalamic relays, such as the lateral geniculate nucleus (LGN). Dong and Atick (1995) framed LGN in terms of temporal normalization, with supporting evidence provided by Dan et al. (1996). This has the effect of removing predictable temporal structure (e.g., static backgrounds). Under the interpretation above, this is an additional inference stage of a flow-based conditional likelihood (Marino et al., 2020).
6.2.3 Normalization in Cortex
Visual pathway. The retina and LGN are interpreted as implementing normalizing flows, that is, spatiotemporal predictive coding, reducing spatial and temporal redundancy in the visual input (dashed arrows between gray circles). LGN is also the lowest level for hierarchical predictions from cortex. Using prediction errors throughout the hierarchy, forward cortical connections update latent estimates.
Visual pathway. The retina and LGN are interpreted as implementing normalizing flows, that is, spatiotemporal predictive coding, reducing spatial and temporal redundancy in the visual input (dashed arrows between gray circles). LGN is also the lowest level for hierarchical predictions from cortex. Using prediction errors throughout the hierarchy, forward cortical connections update latent estimates.
7 Discussion
We have reviewed predictive coding and VAEs, identifying their shared history and formulations. These connections provide an invaluable link between leading areas of theoretical neuroscience and machine learning, hopefully facilitating the transfer of ideas across fields. We have initiated this process by proposing two novel correspondences suggested by this perspective: (1) dendrites of pyramidal neurons and deep networks and (2) lateral inhibition and normalizing flows. Placing pyramidal neurons in correspondence with deep networks departs from the traditional one-to-one analogy of biological and artificial neurons, raising questions regarding dendritic computation and backpropagation. Normalizing flows offers a more general framework for normalization via lateral inhibition. Connecting these areas may provide new insights for both machine learning and neuroscience, helping us move beyond overly simplistic comparisons.
7.1 Predictive Coding VAEs
Although considerable independent progress has recently occurred in VAEs, such models are often still trained on relatively simple, standardized data sets of static images. Thus, predictive coding and neuroscience may still hold insights for improving these models for real-world settings. For instance, the correspondences outlined above may offer new architectural insights in designing deep networks and normalizing flows, for example, drawing on dendritic morphology, short-term plasticity, and connectivity. Predictive coding has also used prediction precision as a form of attention (Feldman & Friston, 2010). More broadly, neuroscience may provide insights into interfacing VAEs with other computations, as well as within embodied agents.
7.2 VAEs Predictive Coding
Another motivating factor in connecting these areas stems from a desire for large-scale, testable models of predictive coding. While predictive coding offers general considerations for neural activity, e.g., predictions, prediction errors, and extra-classical receptive fields (Rao & Ballard, 1999), it is difficult to align such hypotheses with real data due to the many possible design choices (Gershman, 2019). Current models are often implemented in simplified settings, with few, if any, learned parameters. VAEs, in contrast, offer a large-scale test-bed for implementing models and evaluating them on natural stimuli. This may offer a more nuanced perspective over current efforts to compare biological and artificial neural activity (Yamins et al., 2014).
While we have reviewed many topics across neuroscience and machine learning, for brevity, we have focused exclusively on passive perceptual settings. However, separate, growing bodies of work are incorporating predictive coding (Adams, Shipp, & Friston, 2013) and VAEs (Ha & Schmidhuber, 2018) within active settings such as reinforcement learning. We are hopeful that the connections in this paper will inspire further insight in such areas.
Appendix A: Variational Bound Derivation
Notes
Formally, we refer to normalization as one or more steps of a process transforming the data density into a standard gaussian (i.e., Normal), which is equivalent to ICA (Hyvärinen & Oja, 2000). This is a form of redundancy reduction, removing statistical dependencies between data dimensions.
Note that other forms of probabilistic models will result in other forms of whitening transforms.
This is not to be confused with a Laplace distribution. The approximate posterior is still gaussian.
Acknowledgments
Sam Gershman and Rajesh Rao provided helpful comments on this manuscript, and Karl Friston engaged in useful early discussions related to these ideas. We also thank the anonymous reviewers for their feedback and suggestions.
References
Author notes
*The author is now at DeepMind, London, U.K.