## Abstract

We present a review of predictive coding, from theoretical neuroscience, and variational autoencoders, from machine learning, identifying the common origin and mathematical framework underlying both areas. As each area is prominent within its respective field, more firmly connecting these areas could prove useful in the dialogue between neuroscience and machine learning. After reviewing each area, we discuss two possible correspondences implied by this perspective: cortical pyramidal dendrites as analogous to (nonlinear) deep networks and lateral inhibition as analogous to normalizing flows. These connections may provide new directions for further investigations in each field.

## 1 Introduction

### 1.1 Cybernetics

Machine learning and theoretical neuroscience once overlapped under the field of cybernetics (Wiener, 1948; Ashby, 1956). Within this field, perception and control, in both biological and nonbiological systems, were formulated in terms of negative feedback and feedforward processes. Negative feedback attempts to minimize error signals by feeding the errors back into the system, whereas feedforward processing attempts to preemptively reduce error through prediction. Cybernetics formalized these techniques using probabilistic models, which estimate the likelihood of random outcomes, and variational calculus, a technique for estimating functions, particularly probability distributions (Wiener, 1948). This resulted in the first computational models of neuron function and learning (McCulloch & Pitts, 1943; Rosenblatt, 1958; Widrow & Hoff, 1960), a formal definition of information (Wiener, 1942; Shannon, 1948) (with connections to neural systems Barlow, 1961b), and algorithms for negative feedback perception and control (MacKay, 1956; Kalman, 1960). Yet with advances in these directions (see Prieto et al., 2016) the cohesion of cybernetics diminished, with the new ideas taking root in, for example, theoretical neuroscience, machine learning, and control theory. The transfer of ideas is shown in Figure 1.

### 1.2 Neuroscience and Machine Learning: Convergence and Divergence

A renewed dialogue between neuroscience and machine learning formed in the 1980s and 1990s. Neuroscientists, bolstered by new physiological and functional analyses, began making traction in studying neural systems in probabilistic and information-theoretic terms (Laughlin, 1981; Srinivasan, Laughlin, & Dubs, 1982; Barlow, 1989; Bialek, Rieke, Van Steveninck, & Warland, 1991). In machine learning, improvements in probabilistic modeling (Pearl, 1986) and artificial neural networks (Rumelhart, Hinton, & Williams, 1986) combined with ideas from statistical mechanics (Hopfield, 1982; Ackley, Hinton, & Sejnowski, 1985) to yield new classes of models and training techniques. This convergence of ideas, primarily centered around perception, resulted in new theories of neural processing and improvements in their mathematical underpinnings.

In particular, the notion of predictive coding emerged within neuroscience (Srinivasan et al., 1982; Rao & Ballard, 1999). In its most general form, predictive coding postulates that neural circuits are engaged in estimating probabilistic models of other neural activity and sensory inputs, with feedback and feedforward processes playing a central role. These models were initially formulated in early sensory areas, for example, in the retina (Srinivasan et al., 1982) and thalamus (Dong & Atick, 1995), using feedforward processes to predict future neural activity. Similar notions were extended to higher-level sensory processing in neocortex by David Mumford (1991, 1992). Top-down neural projections (from higher-level to lower-level sensory areas) were hypothesized to convey sensory predictions, whereas bottom-up neural projections were hypothesized to convey prediction errors. Through negative feedback, these errors then updated state estimates. These ideas were formalized by Rao and Ballard (1999), formulating a simplified artificial neural network model of images, reminiscent of a Kalman filter (Kalman, 1960).

Feedback and feedforward processes also featured prominently in machine learning. Indeed, the primary training algorithm for artificial neural networks, backpropagation (Rumelhart et al., 1986), literally feeds (propagates) the output prediction errors back through the network—negative feedback. During this period, the technique of variational inference was rediscovered within machine learning (Hinton & Van Camp, 1993; Neal & Hinton, 1998), recasting probabilistic inference using variational calculus. This technique proved essential in formulating the Helmholtz machine (Dayan et al., 1995; Dayan & Hinton, 1996), a hierarchical unsupervised probabilistic model parameterized by artificial neural networks. Similar advances were made in autoregressive probabilistic models (Frey, Hinton, & Dayan, 1996; Bengio & Bengio, 2000), using artificial neural networks to form sequential feedforward predictions, as well as new classes of invertible probabilistic models (Comon, 1994; Parra, Deco, & Miesbach, 1995; Deco & Brauer, 1995; Bell & Sejnowski, 1997).

These new ideas regarding variational inference and probabilistic models, particularly the Helmholtz machine (Dayan, Hinton, Neal, & Zemel, 1995), influenced predictive coding. Specifically, Karl Friston utilized variational inference to formulate hierarchical dynamical models of neocortex (Friston, 2005, 2008a). In line with Mumford (1992), these models contain multiple levels, with each level attempting to predict its future activity (feedforward) as well as lower-level activity, closer to the input data. Prediction errors across levels facilitate updating higher-level estimates (negative feedback). Such models have incorporated many biological aspects, including local learning rules (Friston, 2005) and attention (Spratling, 2008; Feldman & Friston, 2010; Kanai, Komura, Shipp, & Friston, 2015), and have been compared with neural circuits (Bastos et al., 2012; Keller & Mrsic-Flogel, 2018; Walsh, McGovern, Clark, and O'Connell, 2020). While predictive coding and other Bayesian brain theories are increasingly popular (Doya, Ishii, Pouget, & Rao, 2007; Friston, 2009; Clark, 2013), validating these models is hampered by the difficulty of distinguishing between specific design choices and general theoretical claims (Gershman, 2019). Further, a large gap remains between the simplified implementations of these models and the complexity of neural systems.

Progress in machine learning picked up in the early 2010s, with advances in parallel computing as well as standardized data sets (Deng et al., 2009). In this era of deep learning (LeCun, Bengio, & Hinton, 2015; Schmidhuber, 2015), that is, artificial neural networks with multiple layers, a flourishing of ideas emerged around probabilistic modeling. Building off previous work, more expressive classes of deep hierarchical (Gregor, Danihelka, Mnih, Blundell, & Wierstra, 2014; Mnih & Gregor, 2014; Kingma & Welling, 2014; Rezende, Mohamed, & Wierstra, 2014), autoregressive (Uria, Murray, & Larochelle, 2014; van den Oord, Kalchbrenner, & Kavukcuoglu, 2016), and invertible (Dinh, Krueger, & Bengio, 2015; Dinh, Sohl-Dickstein, & Bengio, 2017) probabilistic models were developed. Of particular importance is a model class known as variational autoencoders (VAEs; Kingma & Welling, 2014; Rezende et al., 2014), a relative of the Helmholtz machine, which closely resembles hierarchical predictive coding. Unfortunately, despite this similarity, the machine learning community remains largely oblivious to the progress in predictive coding and vice versa.

### 1.3 Connecting Predictive Coding and VAEs

This review aims to bridge the divide between predictive coding and VAEs. While this work provides unique contributions, it is inspired by previous work at this intersection. In particular, van den Broeke (2016) outlines hierarchical probabilistic models in predictive coding and machine learning. Likewise, Lotter, Kreiman, and Cox (2017, 2018) implement predictive coding techniques in deep probabilistic models, comparing these models with neural phenomena.

After reviewing background mathematical concepts in section 2, we discuss the basic formulations of predictive coding in section 3 and variational autoencoders in section 4, and we identify commonalities in their model formulations and inference techniques in section 5. Based on these connections, in section 6, we discuss two possible correspondences between machine learning and neuroscience seemingly suggested by this perspective:

**Dendrites of pyramidal neurons and deep artificial networks**, affirming a more nuanced perspective over the analogy of biological and artificial neurons**Lateral inhibition and normalizing flows**, providing a more general framework for normalization.

## 2 Background

### 2.1 Maximum Log Likelihood

This is the maximum log-likelihood objective, which is found throughout machine learning and probabilistic modeling (Murphy, 2012). In practice, we do not have access to $pdata(x)$ and instead approximate the objective using data samples, that is, using $p^data(x)$.

### 2.2 Probabilistic Models

#### 2.2.1 Dependency Structure

*the*primary technique for increasing the flexibility of a model, as evaluating the likelihood now requires marginalizing over the latent variables:

*normalizing flows*(Rezende & Mohamed, 2015), are the basis of independent components analysis (ICA) (Comon, 1994; Bell & Sejnowski, 1997; Hyvärinen & Oja, 2000) and nonlinear generalizations (Chen & Gopinath, 2001; Laparra, Camps-Valls, & Malo, 2011). These models provide a general-purpose mechanism for adding and removing dependencies between variables (i.e., normalization).

^{1}Yet while flow-based models avoid marginalization, their requirement of invertibility can be overly restrictive (Cornish, Caterini, Deligiannidis, & Doucet, 2020).

#### 2.2.2 Parameterizing the Model

Conditional dependencies are mediated by the distribution parameters, which are functions of the conditioning variables. For example, we can express an autoregressive gaussian distribution (see equation 2.2) through $p\theta (xj|x<j)=N(xj;\mu \theta (x<j),\sigma \theta 2(x<j))$, where $\mu \theta $ and $\sigma \theta 2$ are functions taking $x<j$ as input. A similar form applies to autoregressive models on sequences of vector inputs (see equation 2.3), with $p\theta (xt|x<t)=N(xt;\mu \theta (x<t),\Sigma \theta (x<t))$. Likewise, in a latent variable model (see equation 2.4), we can express a gaussian conditional likelihood as $p\theta (x|z)=N(x;\mu \theta (z),\Sigma \theta (z))$. In the above examples, we have used a subscript $\theta $ for all functions; however, these may be separate functions in practice.

Deep learning (Goodfellow, Bengio, & Courville, 2016) provides probabilistic models with expressive nonlinear functions, improving their capacity. In these models, the distribution parameters are parameterized with deep networks, which are then trained by backpropagating (Rumelhart et al., 1986) the gradient of the log-likelihood, $\u2207\theta Ex\u223cp^datalogp\theta (x)$, through the network. Deep probabilistic models have enabled recent advances in speech (Graves, 2013; van den Oord et al., 2016), natural language (Sutskever, Vinyals, & Le, 2014; Radford et al., 2019), images (Razavi et al., 2019), video (Kumar et al., 2020), reinforcement learning (Chua, Calandra, McAllister, & Levine, 2018; Ha & Schmidhuber, 2018) and other areas.

Autoregressive models have proven useful in many domains. However, there are reasons to prefer latent variable models in some contexts. First, autoregressive sampling is inherently sequential, becoming costly in high-dimensional domains. Second, latent variables provide a representation for downstream tasks, compression, and overall data analysis. Finally, latent variables increase flexibility, which is useful for modeling complex distributions with relatively simple (e.g., gaussian) conditional distributions. While flow-based latent variable models offer one option, their invertibility requirement limits the types of functions that can be used. For these reasons, we require methods for handling the latent marginalization in equation 2.5. Variational inference is one such method.

### 2.3 Variational Inference

Training latent variable models through maximum likelihood requires evaluating $logp\theta (x)$. However, evaluating $p\theta (x)=\u222bp\theta (x,z)dz$ is generally intractable. Thus, we require some technique for approximating $logp\theta (x)$. Variational inference (Hinton & Van Camp, 1993; Jordan, Ghahramani, Jaakkola, & Saul, 1998) approaches this problem by introducing an approximate posterior distribution, $q(z|x)$, which provides a lower bound, $L(x;q,\theta )\u2264logp\theta (x)$. This lower bound is referred to as the evidence (or variational) lower bound (ELBO), as well as the (negative) free energy. By tightening and maximizing the ELBO with regard to the model parameters, $\theta $, we can approximate maximum likelihood training.

## 3 Predictive Coding

Predictive coding can be divided into two settings, spatiotemporal and hierarchical, roughly corresponding to the two main forms of probabilistic dependencies. In this section, we review these settings, discussing existing hypothesized correspondences with neural anatomy. We then outline the empirical support for predictive coding, highlighting the need for large-scale, testable models.

### 3.1 Spatiotemporal Predictive Coding

^{2}A video example is shown in Figure 5b. The normalization transform removes temporal redundancy in the input, enabling the resulting sequence, $y1:T$, to be compressed more efficiently (Shannon, 1948; Harrison, 1952; Oliver, 1952). This technique forms the basis of modern video (Wiegand, Sullivan, Bjontegaard, & Luthra, 2003) and audio (Atal & Schroeder, 1979) compression. Note that one special case of this transform is $\mu \theta (x<t)=xt-1$ and $\sigma \theta (x<t)=1$, in which case, $yt=xt-xt-1=\Delta xt$, that is, temporal changes. For slowly changing sequences, this is a reasonable choice.

Normalization can also be applied within $xt$ to remove spatial dependencies. For instance, we can apply another autoregressive transform over spatial dimensions, predicting the $i$th dimension, $xi,t$, as a function of previous dimensions, $x1:i,t$ (see equation 2.2). With linear functions, this corresponds to Cholesky whitening (Pourahmadi, 2011; Kingma et al., 2016). However, this imposes an ordering over dimensions. Zero-phase components analysis (ZCA) whitening instead learns symmetric spatial dependencies (Kessy, Lewin, & Strimmer, 2018). Modeling these dependencies with a constant covariance matrix, $\Sigma \theta $, and mean, $\mu \theta $, the whitening transform is $y=\Sigma \theta -1/2(x-\mu \theta )$. With natural images, this results in center-surround filters in the rows of $\Sigma \theta -1$, thereby extracting edges (see Figure 5a).

Srinivasan et al. (1982) investigated spatiotemporal predictive coding in the retina, where compression is essential for transmission through the optic nerve. Estimating the (linear) autocorrelation of input sensory signals, they showed that spatiotemporal predictive coding models retinal ganglion cell responses in flies. This scheme allows these neurons to more fully utilize their dynamic range. It is generally accepted that retina, in part, performs stages of spatiotemporal normalization through center-surround receptive fields and on-off responses (Hosoya, Baccus, & Meister, 2005; Graham, Chandler, & Field, 2006; Pitkow & Meister, 2012; Palmer, Marre, Berry, & Bialek, 2015). Dong and Atick (1995) applied similar ideas to the thalamus, proposing an additional stage of temporal normalization. This also relates to the notion of generalized coordinates (Friston, 2008a), that is, modeling temporal derivatives, which can be approximated using finite differences (prediction errors). That is, $dxdt\u2248\Delta xt\u2261xt-xt-1$. Thus, spatiotemporal predictive coding may be utilized at multiple stages of sensory processing to remove redundancy (Huang & Rao, 2011).

In neural circuits, normalization often involves inhibitory interneurons (Carandini & Heeger, 2012), performing operations similar to those in equation 3.1. For instance, inhibition occurs in retina between photoreceptors, via horizontal cells, and between bipolar cells, via amacrine cells. This can extract unpredicted motion, e.g., an object moving relative to the background (Ölveczky, Baccus, & Meister, 2003; Baccus, Ölveczky, Manu, & Meister, 2008). A similar scheme is present in the lateral geniculate nucleus (LGN) in thalamus, with interneurons inhibiting relay cells from retina (Sherman & Guillery, 2002). As mentioned above, this is thought to perform temporal normalization (Dong and Atick, 1995; Dan, Atick, & Reid, 1996). Lateral inhibition is also prominent in neocortex, with distinct classes of interneurons shaping the responses of pyramidal neurons (Isaacson & Scanziani, 2011). Part of their computational role appears to be spatiotemporal normalization (Carandini & Heeger, 2012).

### 3.2 Hierarchical Predictive Coding

Formulating a theory of neocortex, Mumford (1992) described the thalamus as an “active blackboard,” with the neocortex attempting to reconstruct the activity in the thalamus and lower hierarchical areas. Under this theory, backward projections convey predictions, while forward projections use prediction errors to update estimates. Through a dynamic process, the system settles to an activity pattern, minimizing prediction error. Over time, the parameters are also adjusted to improve predictions. In this way, negative feedback is used, both in inference and learning, to construct a generative model of sensory inputs. Generative state estimation dates back (at least) to Helmholtz (Von Helmholtz, 1867), and error-based updating is in line with cybernetics (Wiener, 1948; MacKay, 1956), which influenced Kalman filtering (Kalman, 1960), a ubiquitous Bayesian filtering algorithm.

Predictive coding identifies the conditional likelihood (equation 3.2) with backward (top-down) cortical projections, whereas inference (equation 3.5) is identified with forward (bottom-up) projections (Friston, 2005). Each is thought to be mediated by pyramidal neurons. Under this model, each cortical column predicts and estimates a stochastic continuous latent variable, possibly represented via a (pyramidal) firing rate or membrane potential (Friston, 2005). Interneurons within columns calculate errors ($\xi x$ and $\xi z$). Although we have only discussed diagonal covariance ($\sigma x2$ and $\sigma z2$), lateral inhibitory interneurons could parameterize full covariance matrices, $\Sigma x$ and $\Sigma z$, as a form of spatial predictive coding (see section 3.1). These factors weight $\xi x$ and $\xi z$, modulating the gain of each error as a form of “attention” (Feldman & Friston, 2010). Neural correspondences are summarized in Table 1.

Neuroscience . | Predictive Coding . |
---|---|

Top-down cortical projections | Generative model conditional mapping |

Bottom-up cortical projections | Inference updating |

Lateral inhibition | Covariance matrices |

(Pyramidal) neuron activity | Latent variable estimates & errors |

Cortical column | Corresponding estimate & error |

Neuroscience . | Predictive Coding . |
---|---|

Top-down cortical projections | Generative model conditional mapping |

Bottom-up cortical projections | Inference updating |

Lateral inhibition | Covariance matrices |

(Pyramidal) neuron activity | Latent variable estimates & errors |

Cortical column | Corresponding estimate & error |

We have presented a simplified model of hierarchical predictive coding, without multiple latent levels and dynamics. A full hierarchical predictive coding model would include these aspects and others. In particular, Friston has explored various design choices (Friston, Mattout, Trujillo-Barreto, Ashburner, & Penny, 2007; Friston, 2008a, 2008b), yet the core aspects of probabilistic generative modeling and variational inference remain the same. Elaborating and comparing these choices will be essential for empirically validating hierarchical predictive coding.

### 3.3 Empirical Support

While there is considerable evidence in support of predictions and errors in neural systems, disentangling these general aspects of predictive coding from the particular algorithmic choices remains challenging (Gershman, 2019). Here, we outline relevant work, but we refer to Huang and Rao (2011), Bastos et al. (2012), Clark (2013), Keller and Mrsic-Flogel (2018), and Walsh, McGovern, Clark, and O'Connell (2020) for a more in-depth overview.

#### 3.3.1 Spatiotemporal

Various works have investigated predictive coding in early sensory areas such as the retina (Srinivasan et al., 1982; Atick & Redlich, 1992). This involves fitting retinal ganglion cell responses to a spatial whitening (or decorrelation) process (Graham et al., 2006; Pitkow & Meister, 2012), which can be dynamically adjusted (Hosoya et al., 2005). Similar analyses suggest that retina also employs temporal predictive coding (Srinivasan et al., 1982; Palmer et al., 2015). Such models typically contain linear whitening filters (center-surround) followed by nonlinearities. These nonlinearities have been shown to be essential for modeling responses (Pitkow & Meister, 2012), possibly by inducing added sparsity (Graham et al., 2006). Spatiotemporal predictive coding also appears to be found in the thalamus (Dong & Atick, 1995; Dan et al., 1996) and cortex; however, such analyses are complicated by backward, modulatory inputs.

#### 3.3.2 Hierarchical

Early work toward empirically validating hierarchical predictive coding came from explaining extraclassical receptive field effects (Rao & Ballard, 1999; Rao & Sejnowski, 2002), whereby top-down signals in the cortex alter classical visual receptive fields, suggesting that top-down influences play a key role in sensory processing (Gilbert & Sigman, 2007). Note that such effects support a cortical generative model generally (Olshausen & Field, 1997), not predictive coding specifically.

Temporal influences have been demonstrated through repetition suppression (Summerfield et al., 2006), in which activity diminishes in response to repeated (i.e., predictable) stimuli. This may reflect error suppression from improved predictions. Predictive coding has also been used to explain biphasic responses in LGN (Jehee & Ballard, 2009), in which reversing the visual input with an anticorrelated image results in a large response, presumably due to prediction errors. Predictive signals have been documented in auditory (Wacongne et al., 2011) and visual (Meyer & Olson, 2011) processing. Activity seemingly corresponding to prediction errors has also been observed in a variety of areas and contexts, including visual cortex in mice (Keller, Bonhoeffer, & Hübener, 2012; Zmarz & Keller, 2016; Gillon et al., 2021), auditory cortex in monkeys (Eliades & Wang, 2008) and rodents (Parras et al., 2017), and visual cortex in humans (Murray, Kersten, Olshausen, Schrater, & Woods, 2002; Alink, Schwiedrzik, Kohler, Singer, & Muckli, 2010; Egner, Monti, & Summerfield, 2010). Thus, sensory cortex appears to be engaged in hierarchical and temporal prediction, with prediction errors playing a key role.

Empirical evidence for predictive coding aside, given the complexity of neural systems, the theory is undoubtedly incomplete or incorrect. Without the low-level details such as connectivity and potentials, it is difficult to determine the computational form of the circuit. Further, these models are typically oversimplified, with few trained parameters, detached from natural stimuli. While new tools enable us to test predictive coding in neural circuits (Gillon et al., 2021), machine learning, particularly VAEs, can advance from the other direction. Training large-scale models on natural stimuli may improve empirical predictions for biological systems (Rao & Ballard, 1999; Lotter et al., 2018).

## 4 Variational Autoencoders

Variational autoencoders (VAEs) (Kingma & Welling, 2014; Rezende et al., 2014) are latent variable models parameterized by deep networks. As in hierarchical predictive coding, these models typically contain gaussian latent variables and are trained using variational inference. However, rather than performing inference optimization directly, VAEs amortize inference (Gershman & Goodman, 2014).

### 4.1 Amortized Variational Inference

To differentiate through $z\u223cq\varphi (z|x)$, we can use the pathwise derivative estimator, also referred to as the reparameterization estimator (Kingma & Welling, 2014). This is accomplished by expressing $z$ in terms of an auxiliary random variable. The most common example expresses $z\u223cN(z;\mu q,diag(\sigma q2))$ as $z=\mu q+\epsilon \u2299\sigma q$, where $\epsilon \u223cN(\epsilon ;0,I)$ and $\u2299$ denotes element-wise multiplication. We can then estimate $\u2207\mu qL$ and $\u2207\sigma qL$, allowing us to calculate the inference model gradients, $\u2207\varphi L$.

#### 4.1.1 Iterative Amortized Inference

### 4.2 Extensions of VAEs

#### 4.2.1 Additional Dependencies and Representation Learning

VAEs have been extended to a variety of architectures, incorporating hierarchical and temporal dependencies. Sønderby et al. (2016) proposed a hierarchical VAE, in which the conditional prior at each level, $\u2113$, that is, $p\theta (z\u2113|z\u2113+1:L)$, is parameterized by a deep network (see equation 2.7). Follow-up works have scaled this approach with impressive results (Kingma et al., 2016; Vahdat & Kautz, 2020; Child, 2020), extracting increasingly abstract features at higher levels (Maaløe, Fraccaro, Lievin, & Winther, 2019). Another line of work has incorporated temporal dependencies within VAEs, parameterizing dynamics in the prior and conditional likelihood with deep networks (Chung et al., 2015; Fraccaro, Sønderby, Paquet, & Winther, 2016). Such models can also provide representations and predictions for reinforcement learning (Ha & Schmidhuber, 2018; Hafner et al., 2019).

Other work has investigated representation learning within VAEs. One approach, the $\beta $-VAE (Higgins et al., 2017), modifies the ELBO (see equation 2.14) by adjusting a weighting, $\beta $, on $DKL(q(z|x)||p\theta (z))$. This tends to yield more disentangled (i.e., independent) latent variables. Indeed, $\beta $ controls the rate-distortion trade-off between latent complexity and reconstruction (Alemi et al., 2018), highlighting VAEs' ability to extract latent structure at multiple resolutions (Rezende & Viola, 2018). A separate line of work has focused on identifiability: the ability to uniquely recover the original latent variables within a model (or their posterior). While this is true in linear ICA (Comon, 1994), it is not generally the case with nonlinear ICA and noninvertible models (VAEs) (Khemakhem, Kingma, Monti, & Hyvarinen, 2020; Gresele, Fissore, Javaloy, Schölkopf, & Hyvarinen, 2020), requiring special considerations.

#### 4.2.2 Normalizing Flows

Another direction within VAEs is the use of normalizing flows (Rezende & Mohamed, 2015). Flow-based distributions use invertible transforms to add and remove dependencies (see section 2.2.1). While such models can operate as generative models (Dinh et al., 2015, 2017; Papamakarios, Pavlakou, & Murray, 2017), they can also define distributions in VAEs. This includes the approximate posterior (Rezende & Mohamed, 2015), prior (Huang et al., 2017), and conditional likelihood (Agrawal & Dukkipati, 2016). In each case, a deep network outputs the parameters (e.g., mean and variance) of a base distribution over a normalized variable. Separate deep networks parameterize the transforms, which map between the normalized and unnormalized variables.

#### 4.2.3 Example

## 5 Connections and Comparisons

Predictive coding and VAEs (and deep generative models generally), are highly related in both their model formulations and inference approaches (see Figure 10). Specifically,

**Model formulation**: Both areas consider hierarchical latent gaussian models with nonlinear dependencies between latent levels, as well as dependencies within levels via covariance matrices (predictive coding) or normalizing flows (VAEs).Figure 10:**Inference**: Both areas use variational inference, often with gaussian approximate posteriors. While predictive coding and VAEs employ differing optimization techniques, these are design choices in solving the same inference problem.

These similarities reflect a common mathematical foundation inherited from cybernetics and descendant fields. We now discuss these two points in more detail.

### 5.1 Model Formulation

The primary distinction in model formulation is the form of the (non-linear) functions parameterizing dependencies. Rao and Ballard (1999) parameterize the conditional likelihood as a linear function followed by an element-wise nonlinearity. Friston has considered a wider range of functions, such as polynomial (Friston, 2008a); however, such functions are rarely learned. VAEs instead parameterize these functions using deep networks with multiple layers. The deep network weights are trained through backpropagation, enabling the wide application of VAEs to various data domains.

Predictive coding and VAEs also consider dependencies within each level. Friston (2005) uses full-covariance gaussian densities, with the inverse of the covariance matrix (precision) parameterizing linear dependencies within a level. Rao and Ballard (1999) normalize the observations, modeling linear dependencies within the conditional likelihood. These are linear special cases of the more general technique of normalizing flows (Rezende & Mohamed, 2015): a covariance matrix is an affine normalizing flow with linear dependencies (Kingma et al., 2016). Normalizing flows have been applied throughout each of the distributions within VAEs (see section 4.2.2), modeling nonlinear dependencies across both spatial and temporal dimensions. These flows are also parameterized by deep networks, providing a flexible yet general modeling approach.

Related to normalization, there are proposals within predictive coding that the precision of the prior could mediate a form of attention (Feldman & Friston, 2010). Increasing the precision of a variable serves as a form of gain modulation, up-weighting the error in the objective function, thereby enforcing more accurate inference estimates. This concept is absent from VAEs. However, as VAEs become more prevalent in interactive settings (Ha & Schmidhuber, 2018), that is, beyond pure generative modeling, this may become crucial in steering models toward task-relevant perceptual inferences.

Finally, predictive coding and VAEs have both been extended to sequential settings. In predictive coding, sequential dependencies may be parameterized by linear functions (Srinivasan et al., 1982) or so-called generalized coordinates (Friston, 2008a), modeling multiple orders of motion. In extensions of VAEs, sequential dependencies are again parameterized by deep networks, in many cases using recurrent networks (Chung et al., 2015; Fraccaro et al., 2016). Thus, while the specific implementations vary, in either case, sequential dependencies are ultimately functions, which are subject to design choices.

### 5.2 Inference

Although both predictive coding and VAEs typically use variational inference with gaussian approximate posteriors, sections 3 and 4 illustrate key differences (see Figure 10). Predictive coding generally relies on gradient-based optimization to perform inference, whereas VAEs employ amortized optimization. While these approaches may at first appear radically different, hybrid error-encoding inference approaches (see equation 4.3), such as PredNet (Lotter et al., 2017) and iterative amortization (Marino, Yue, et al., 2018), provide a link. Such approaches receive errors as input, as in predictive coding; however, they have learnable parameters (i.e., amortization). In fact, amortization may provide a crucial element for implementing predictive coding in biological neural networks.

Though rarely discussed, hierarchical predictive coding assumes that the inference gradients, supplied by forward connections, can be readily calculated. But as seen in section 3.2, the weights of these forward connections are the transposed Jacobian matrix of the backward connections (Rao & Ballard, 1999). This is an example of the weight transport problem (Grossberg, 1987), in which the weights of one set of connections (forward) depend on the weights of another set of connections (backward). This is generally regarded as not being biologically plausible.

Amortization provides a solution to this problem: *learn* to perform inference. Rather than transporting the generative weights to the inference connections, amortization learns a separate set of inference weights, potentially using local learning rules (Bengio, 2014; Lee, Zhang, Fischer, & Bengio, 2015). Thus, despite criticism from Friston (2018), amortization may offer a more biologically plausible inference approach. Further, amortized inference yields accurate estimates with exceedingly few iterations: even a single iteration may yield reasonable estimates (Marino, Yue, et al., 2018). These computational efficiency benefits provide another argument in favor of amortization.

Finally, although predictive coding and VAEs typically assume gaussian approximate posteriors, there is one additional difference in the ways in which these parameters are conventionally calculated. Friston often uses the Laplace approximation^{3} (Friston et al., 2007), solving directly for the optimal gaussian variance, whereas VAEs treat this as another output of the inference model (Kingma & Welling, 2014; Rezende, Mohamed, & Wierstra, 2014). These approaches can be applied in either setting (Park, Kim, & Kim, 2019).

## 6 Correspondences

### 6.1 Pyramidal Neurons and Deep Networks

#### 6.1.1 Nonlinear Dendritic Computation

Placing deep networks in correspondence with pyramidal dendrites suggests that (some) biological neurons may be better computationally described as nonlinear functions. Evidence from neuroscience supports this claim. Early simulations showed that individual pyramidal neurons, through dendritic processing, could operate as multilayer artificial networks (Zador, Claiborne, & Brown, 1992; Mel, 1992). This was later supported by empirical findings that pyramidal dendrites act as computational “subunits,” yielding the equivalent of a two-layer artificial network (Poirazi, Brannon, & Mel, 2003; Polsky, Mel, & Schiller, 2004). More recently, Gidon et al. (2020) demonstrated that individual pyramidal neurons can compute the XOR operation, which requires nonlinear processing. This is supported by further modeling work (Jones & Kording, 2020; Beniaguev, Segev, & London, 2021). Positing a more substantial role for dendritic computation (London & Häusser, 2005) moves beyond the simplistic comparison of biological and artificial neurons that currently dominates. Instead, neural computation depends on morphology and circuits.

#### 6.1.2 Amortization

Pyramidal neurons mediate both top-down and bottom-up cortical projections. Under predictive coding, this suggests that inference relies on learned, nonlinear functions: amortization. One such implementation is through pyramidal neurons with separate apical and basal dendrites, which, respectively, receive top-down and bottom-up inputs (Bekkers, 2011; Guergiuev, Lillicrap, & Richards, 2016). Recent evidence from Gillon et al. (2021) suggests that these are top-down predictions and bottom-up errors. These neurons may implement iterative amortized inference (Marino, Yue, et al., 2018), separately processing top-down and bottom-up error signals to update inference estimates (see Figure 11, left). While some empirical support for amortization exists (Yildirim, Kulkarni, Freiwald, & Tenenbaum, 2015; Dasgupta, Schulz, Goodman, & Gershman, 2018), further investigation is needed. Finally, this perspective implies separate computational processing for prediction and inference, with distinct (but linked) frequencies. While some evidence supports this conjecture (Bastos et al., 2015), it is unclear how this could be implemented in biological neurons.

#### 6.1.3 Backpropagation

The biological plausibility of backpropagation is an open question (Lillicrap, Santoro, Marris, Akerman, & Hinton, 2020). Critics argue that backpropagation requires nonlocal learning signals (Grossberg, 1987; Crick, 1989), whereas the brain relies largely on local learning rules (Hebb, 1949; Markram, Lübke, Frotscher, & Sakmann, 1997; Bi & Poo, 1998). Biologically plausible formulations of backpropagation have been proposed (Stork, 1989; Körding & König, 2001; Xie & Seung, 2003; Hinton, 2007; Lillicrap, Cownden, Tweed, & Akerman, 2016), attempting to reconcile this disparity. Yet consensus is still lacking. From another perspective, the apparent biological implausibility of backpropagation may instead be the result of incorrectly assuming a one-to-one correspondence between biological and artificial neurons.

### 6.2 Lateral Inhibition and Normalizing Flows

#### 6.2.1 Sensory Input Normalization

One of the key computational roles of early sensory areas appears to be reducing spatiotemporal redundancies—normalization. In retina, this is performed through lateral inhibition via horizontal and amacrine cells, removing correlations (Graham et al., 2006; Pitkow & Meister, 2012). Normalization and prediction are inseparable, and, accordingly, previous work has framed early sensory processing in terms of spatiotemporal predictive coding (Srinivasan et al., 1982; Hosoya et al., 2005; Palmer et al., 2015). This is often motivated in terms of increased sensitivity or efficiency (Srinivasan et al., 1982; Atick & Redlich, 1990) due to redundancy reduction (Barlow, 1961a; Barlow et al., 1989), that is, compression.

If we consider cortex as a hierarchical latent variable model, then early sensory areas are implicated in parameterizing the conditional likelihood. The ubiquity of normalization in these areas is suggestive of normalization in a flow-based model, implementing the inference direction of a flow-based conditional likelihood (Agrawal & Dukkipati, 2016; Winkler et al., 2019). In addition to the sensitivity and efficiency benefits cited above, this learned, normalized space simplifies downstream generative modeling and improves generalization (Marino et al., 2020).

#### 6.2.2 Normalization in Thalamus

Normalization also appears to occur in first-order thalamic relays, such as the lateral geniculate nucleus (LGN). Dong and Atick (1995) framed LGN in terms of temporal normalization, with supporting evidence provided by Dan et al. (1996). This has the effect of removing predictable temporal structure (e.g., static backgrounds). Under the interpretation above, this is an additional inference stage of a flow-based conditional likelihood (Marino et al., 2020).

#### 6.2.3 Normalization in Cortex

^{4}Normalizing flows offers a more general framework for describing these computations. Further, Friston (2005) assumes that these dependencies are modeled using

*symmetric*weights, whereas normalizing flows permits

*non-symmetric*schemes, e.g., using autoregressive models (Kingma et al., 2016) or ensembles (Uria et al., 2014). These weights can also be restricted to local spatial regions (Vahdat & Kautz, 2020). Similar normalization operations can also parameterize temporal dynamics (Marino et al., 2020), akin to Friston's generalized coordinates (Friston, 2008a). The overall computational scheme is shown in Figure 13.

## 7 Discussion

We have reviewed predictive coding and VAEs, identifying their shared history and formulations. These connections provide an invaluable link between leading areas of theoretical neuroscience and machine learning, hopefully facilitating the transfer of ideas across fields. We have initiated this process by proposing two novel correspondences suggested by this perspective: (1) dendrites of pyramidal neurons and deep networks and (2) lateral inhibition and normalizing flows. Placing pyramidal neurons in correspondence with deep networks departs from the traditional one-to-one analogy of biological and artificial neurons, raising questions regarding dendritic computation and backpropagation. Normalizing flows offers a more general framework for normalization via lateral inhibition. Connecting these areas may provide new insights for both machine learning and neuroscience, helping us move beyond overly simplistic comparisons.

### 7.1 Predictive Coding $\u2192$ VAEs

Although considerable independent progress has recently occurred in VAEs, such models are often still trained on relatively simple, standardized data sets of static images. Thus, predictive coding and neuroscience may still hold insights for improving these models for real-world settings. For instance, the correspondences outlined above may offer new architectural insights in designing deep networks and normalizing flows, for example, drawing on dendritic morphology, short-term plasticity, and connectivity. Predictive coding has also used prediction precision as a form of attention (Feldman & Friston, 2010). More broadly, neuroscience may provide insights into interfacing VAEs with other computations, as well as within embodied agents.

### 7.2 VAEs $\u2192$ Predictive Coding

Another motivating factor in connecting these areas stems from a desire for large-scale, testable models of predictive coding. While predictive coding offers general considerations for neural activity, e.g., predictions, prediction errors, and extra-classical receptive fields (Rao & Ballard, 1999), it is difficult to align such hypotheses with real data due to the many possible design choices (Gershman, 2019). Current models are often implemented in simplified settings, with few, if any, learned parameters. VAEs, in contrast, offer a large-scale test-bed for implementing models and evaluating them on natural stimuli. This may offer a more nuanced perspective over current efforts to compare biological and artificial neural activity (Yamins et al., 2014).

While we have reviewed many topics across neuroscience and machine learning, for brevity, we have focused exclusively on passive perceptual settings. However, separate, growing bodies of work are incorporating predictive coding (Adams, Shipp, & Friston, 2013) and VAEs (Ha & Schmidhuber, 2018) within active settings such as reinforcement learning. We are hopeful that the connections in this paper will inspire further insight in such areas.

## Appendix A: Variational Bound Derivation

## Notes

^{1}

Formally, we refer to normalization as one or more steps of a process transforming the data density into a standard gaussian (i.e., Normal), which is equivalent to ICA (Hyvärinen & Oja, 2000). This is a form of redundancy reduction, removing statistical dependencies between data dimensions.

^{2}

Note that other forms of probabilistic models will result in other forms of whitening transforms.

^{3}

This is not to be confused with a Laplace distribution. The approximate posterior is still gaussian.

## Acknowledgments

Sam Gershman and Rajesh Rao provided helpful comments on this manuscript, and Karl Friston engaged in useful early discussions related to these ideas. We also thank the anonymous reviewers for their feedback and suggestions.

## References

*Cognitive Science*

*Brain Structure and Function*

*Deep variational inference without pixel-wise reconstruction.*

*Proceedings of the International Conference on Machine Learning*

*Journal of Neuroscience*

*Advances in neural information processing systems*

*An introduction to cybernetics*

*IEEE Transactions on Acoustics, Speech, and Signal Processing*

*Neural Computation*

*Neural Computation*

*Journal of Neuroscience*

*Proceedings of the AAAI*

*Sensory Communication*

*Current problems in animal behavior*

*Neural Computation*

*Neural Computation*

*Neuron*

*Neuron*

*Vision Research*

*How autoencoders could provide credit assignment in deep networks via target propagation*

*Advances in neural information processing systems*

*Neuron*

*Journal of Neuroscience*

*Science*

*Nature Communications*

*Nature Reviews Neuroscience*

*Advances in neural information processing systems*

*Very deep VAES generalize autoregressive models and can outperform them on images.*

*Advances in neural information processing systems*

*Advances in neural information processing systems*

*Behavioral and Brain Sciences*

*Signal Processing*

*Proceedings of the International Conference on Machine Learning.*

*Cerebral Cortex*

*Proceedings of the International Conference on Machine Learning*

*Journal of Neuroscience*

*Cognition*

*Neural Networks*

*Neural Computation*

*Journal of Neuroscience*

*Advances in neural information processing systems*

*Journal of the Royal Statistical Society. Series B (Methodological)*

*Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition*

*International Conference on Learning Representations.*

*Proceedings of the International Conference on Learning Representations*

*Network: Computation in Neural Systems*

*Neural Computation*

*Bayesian brain: Probabilistic approaches to neural coding*

*Proceedings of the Conference on Robot Learning.*

*Journal of Neuroscience*

*Nature*

*Frontiers in Human Neuroscience*

*Advances in neural information processing systems*

*Advances in neural information processing systems*

*Philosophical Transactions of the Royal Society of London B: Biological Sciences*

*PLOS Computational Biology*

*NeuroImage*

*Trends in Cognitive Sciences*

*Nature Neuroscience*

*NeuroImage*

*What does the free energy principle tell us about the brain?*

*Proceedings of the Cognitive Science Society*

*Science*

*Neuron*

*Learning from unexpected events in the neocortical microcircuit.*

*Journal of Neurophysiology*

*Journal of Neurophysiology*

*Deep learning*

*Vision Research*

*Generating sequences with recurrent neural networks.*

*Proceedings of the International Conference on Machine Learning*

*Advances in neural information processing systems*

*Cognitive Science*

*Biologically feasible deep learning with segregated dendrites.*

*International Conference on Learning Representations.*

*Advances in neural information processing systems*

*International Conference on Machine Learning*

*Bell System Technical Journal*

*On intelligence: How a new understanding of the brain will lead to the creation of truly intelligent machines*

*The organization of behavior: A neuropsychological theory*

*Proceedings of the International Conference on Learning Representations.*

*NeurIPS Deep Learning Workshop*

*Proceedings of the Sixth Annual Conference on Computational Learning Theory*

*Advances in neural information processing systems*

*Proceedings of the National Academy of Sciences*

*Nature*

*Learnable explicit density for continuous latent space and variational inference*

*Wiley Interdisciplinary Reviews: Cognitive Science*

*Neural Networks*

*Neuron*

*PLOS Comput. Biol.*

*PLOS Biology*

*Can single neurons solve MNIST? The computational power of biological dendritic trees.*

*NATO ASI Series D Behavioural and Social Sciences*

*Journal of Basic Engineering*

*Phil. Trans. R. Soc. B*

*Neuron*

*Neuron*

*American Statistician*

*Proceedings of the International Conference on Artificial Intelligence and Statistics*

*Proceedings of the International Conference on Machine Learning.*

*Journal of Neuroscience*

*Advances in neural information processing systems*

*Proceedings of the International Conference on Learning Representations.*

*Journal of Computational Neuroscience*

*Proceedings of the International Conference on Artificial Intelligence and Statistics*

*Proceedings of the International Conference on Learning Representations.*

*IEEE Transactions on Neural Networks*

*Zeitschrift für Naturforschung c*

*Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases*

*Nature Communications*

*Nature Reviews Neuroscience*

*Annu. Rev. Neurosci.*

*Proceedings of the International Conference on Learning Representations.*

*A neural network trained to predict future video frames mimics critical properties of biological neuronal responses and perception.*

*Advances in neural information processing systems*

*Automata Studies*

*Proceedings of the Symposium on Advances in Approximate Bayesian Inference*

*Advances in neural information processing systems*

*International Conference on Machine Learning.*

*Science*

*Bulletin of Mathematical Biophysics*

*Advances in neural information processing systems*

*Proceedings of the National Academy of Sciences*

*Proceedings of the National Academy of Sciences*

*International Conference on Machine Learning.*

*Am. J. Physiol.*

*Biological Cybernetics*

*Biological Cybernetics*

*Machine learning: A probabilistic perspective*

*Proceedings of the National Academy of Sciences*

*Learning in graphical models*

*Bell System Technical Journal*

*Nature*

*Vision Research*

*Nature*

*Proceedings of the National Academy of Sciences*

*Advances in neural information processing systems*

*Proceedings of the International Conference on Machine Learning*

*Network: Computation in Neural Systems*

*Nature Communications*

*Artificial Intelligence*

*Nature Neuroscience*

*Neuron*

*Nature Neuroscience*

*Statistical Science*

*Neurocomputing*

*Advances in neural information processing systems*

*Nature Neuroscience*

*Probabilistic models of the brain*

*Advances in neural information processing systems*

*Proceedings of the International Conference on Machine Learning*

*Proceedings of the International Conference on Machine Learning*

*Taming VAEs.*

*High-dimensional probability estimation with deep density models.*

*Psychological Review*

*Nature*

*PLOS Computational Biology*

*Neural Networks*

*Bell System Technical Journal*

*Nature*

*Philosophical Transactions of the Royal Society of London, Series B: Biological Sciences*

*Advances in neural information processing systems*

*Frontiers in Computational Neuroscience*

*Proceedings of the Royal Society of London. Series B. Biological Sciences*

*Proceedings of the International Joint Conference on Neural Networks*

*Nature*

*Trends in Neurosciences*

*Science*

*Advances in neural information processing systems*

*Communications on Pure and Applied Mathematics*

*Journal of the Royal Statistical Society: Series B (Statistical Methodology)*

*International Conference on Machine Learning*

*Advances in neural information processing systems*

*What auto-encoders could learn from brains*

*Proceedings of the 9th ISCA Speech Synthesis Workshop*

*Proceedings of the International Conference on Machine Learning*

*Proceedings of the International Conference on Machine Learning*

*Trends in Neurosciences*

*Handbuch der physiologischen optik*

*Proceedings of the National Academy of Sciences*

*Annals of the New York Academy of Sciences*

*Neural Computation*

*Adaptive switching circuits*

*IEEE Transactions on Circuits and Systems for Video Technology*

*The interpolation, extrapolation and smoothing of stationary time series. NDRC report*

*Cybernetics or control and communication in the animal and the machine*

*Journal of Neuroscience*

*Learning likelihoods with conditional normalizing flows*

*Neural Computation*

*Proceedings of the National Academy of Sciences*

*Proceedings of the Thirty-Seventh Annual Conference of the Cognitive Science Society*

*Advances in neural information processing systems*

*Neuron*

## Author notes

^{*}The author is now at DeepMind, London, U.K.