Backpropagation has rapidly become the workhorse credit assignment algorithm for modern deep learning methods. Recently, modified forms of predictive coding (PC), an algorithm with origins in computational neuroscience, have been shown to result in approximately or exactly equal parameter updates to those under backpropagation. Due to this connection, it has been suggested that PC can act as an alternative to backpropagation with desirable properties that may facilitate implementation in neuromorphic systems. Here, we explore these claims using the different contemporary PC variants proposed in the literature. We obtain time complexity bounds for these PC variants, which we show are lower bounded by backpropagation. We also present key properties of these variants that have implications for neurobiological plausibility and their interpretations, particularly from the perspective of standard PC as a variational Bayes algorithm for latent probabilistic models. Our findings shed new light on the connection between the two learning frameworks and suggest that in its current forms, PC may have more limited potential as a direct replacement of backpropagation than previously envisioned.
Predictive coding (PC) is a prominent neuroscientific theory that has emerged in the past two decades as a highly compelling computational model of perception and action in the brain (Friston, 2005; Rao & Ballard, 1999). From a theoretical standpoint, PC, in its standard formulation, has benefited from its interpretation as a variational Bayes algorithm for learning and inference in latent hierarchical models (Bogacz, 2017; Friston, 2003, 2005; Friston & Kiebel, 2009), providing it credence as a plausible instantiation of normative theories such as the Bayesian brain hypothesis and the free-energy principle. More empirically, its candidate neurobiological implementation has demonstrated impressively close correspondences to the canonical circuitry of the cortex (Bastos et al., 2012; Shipp, 2016), while also successfully explaining or reproducing a number of neurophysiological and cognitive phenomena, such as end stopping (Rao & Ballard, 1999), binocular rivalry (Hohwy et al., 2008), attention (Feldman & Friston, 2010; Kanai et al., 2015), and biases in the perception of event duration (Fountas et al., 2022).
More recent work has suggested a connection between modified forms of PC, and backpropagation, suggesting the former as an alternative to the latter with the various benefits one may naturally expect to associate for a theory formulated with neurobiological constraints in mind (Millidge et al., 2022, 2020b; Song et al., 2020). These include locality of computation, and parallelizability that may lend it improved performance and greater amenability to implementation on neuromorphic hardware.
The purpose of this work is to investigate these suggestions and clarify more precisely the relationship between backpropagation and PC, particularly with regard to the kinds of assumptions that are necessary to enable such a comparison and, furthermore, define exactly what advantage PC-based formulations may have with respect to parallelizability, locality, and ultimately practical performance. In what follows, we re-review some of the ground covered by Rosenbaum (2022), while extending it to discuss the implications for PC in general with respect to computational and memory complexity, as well as compute and memory locality. We also present proofs for lower bounds on the time complexity of current PC variants relative to backpropagation and validate these proofs empirically. Our work finds that while the examined algorithms present interesting, biologically inspired, alternative implementations of backpropagation, they are provably guaranteed to be slower under identical problem settings and that it is unclear to what degree they improve on locality, particularly in comparison to modern candidate proposals for backpropagation. Finally, we note that when adopting the modifications required to enable such comparisons, PC models diverge from the generative variational Bayes framework, frequently used as the motivational basis of this theory.
Backpropagation, introduced by Linnainmaa (1976) in the context of accumulating rounding errors and later popularized by Rumelhart et al. (1986) in the context of neural networks, has quickly become the workhorse credit assignment algorithm for modern deep learning tasks. Described most simply, backpropagation can be seen as an algorithmically efficient implementation of the chain rule from calculus that allows the computation of high-dimensional gradients for deeply nested functions with respect to scalar or low-dimensional outputs. It accomplishes this by defining gradients of values deep inside a nested function in terms of the gradients of their children (their functional dependents). The resultant recursive relationship is then evaluated for values closest to the output, sequentially backward. This recursive relationship is what enables backpropagation to efficiently compute gradients with respect to all parameters in the function chain with a single forward and backward computational pass.
where we use μi to denote the intermediary outputs of our composite function and x0, xL to denote known values that are provided as inputs and outputs of our function, respectively. x0 and xL may, for example, be inputs and classification labels, respectively; μℓ may be intermediate activation vectors for a multilayer perceptron and fL a classification loss.
For multivariate elementary functions, the terms would correspond to the Jacobian matrices for the elementary functions fi with respect to μℓ. And by virtue of the chain rule, the error terms eℓ would thus equal the (transposed) gradient of E with respect to μℓ: .
Though obvious in this scenario, it will be important for our later comparisons to mention here that all Jacobians mentioned above are evaluated at the values of the feedforward outputs of our function, which, as we will discuss, may not necessarily be the case for Bayesian PC. We also note that in practice, the Jacobians specified in equations 2.3 and 2.2 are almost never instantiated explicitly. Rather, an automatic differentiation program that implements backpropagation would associate each elementary or constituent function with an associated vector-Jacobian product (VJP) function, which is generally far more computationally and memory efficient (Bradbury et al., 2018; Paszke et al., 2017).
2.1 Computational and Time Complexity
The cost of computing the forward pass (see equation 2.1 and subsequent backward recursions (see equations 2.2 and 2.3) can be shown to have computational and time cost bounded by a small constant multiple of the cost for a single forward pass through the corresponding elementary function. This well-known result from the automatic differentiation literature is sometimes called the cheap-gradient principle and is one reason why backpropagation has been so successful as a credit assignment algorithm in modern large data settings. This constant was shown to be 3 for rational functions in the seminal work of Baur and Strassen (1983) and 5 more generally for any function composed of elementary arithmetic and trigonometric functions (Griewank, 1997, 2009; Griewank & Walther, 2008). The precise value of this constant generally also depends on details regarding the relative costs of various basic mathematical operations (such as addition, multiplication, memory access, and nonlinear operations) on the specific hardware being considered. For modern deep neural network operations and associated hardware, this constant is generally taken to be 3 (Hoffmann et al., 2022; Kaplan et al., 2020), and we adopt this for our comparison.
As we will see, the basic unit of computation for both the current implementations of PC and backpropagation algorithms is the same: the VJP. The cost of computing this VJP is bounded in terms of a forward evaluation via the cheap gradient principle. To levy a comparison between algorithms, we may therefore report computational complexity with respect to the cost of computing a single forward pass, which we denote with and , for elementary constituent functions fi, and the overall composite function F respectively.
3 Predictive Coding
3.1 Variational PC
We begin by describing the standard formulation of PC as a variational Bayes algorithm (Bogacz, 2017; Buckley et al., 2017; Friston, 2003, 2005). Within this context, the PC algorithm can be interpreted as a method for performing inference (over latent states) and learning (over parameters) for a latent hierarchical generative model. In keeping with much of recent work that renders a comparison with backpropagation we will restrict our focus to PC for static (i.e., nondynamical/time-series-based) observations and states. We will distinguish this formulation from recent modified formulations such as FPA-PC (Millidge et al., 2020) and Z-IL (Salvatori et al., 2021; Song et al., 2020) by calling it variational PC (VPC) to emphasize that modified formulations may not necessarily correspond to a variational Bayesian inference and learning algorithm.
For the sake of enabling our subsequent comparisons of PC with backpropagation, in this section we will reuse the elementary functions fi, first defined in section 2, for defining the conditional relationships within our hierarchical model.
Given this model, PC answers the question of how one can learn the parameters θ that maximize their model evidence, , where for notational simplicity, we have combined all observations (x0 and xL), latent states (x1, . . . , xL − 1), and parameters (θ0, . . . , θL) into the vectors xOBS, xLAT, , respectively, and superscript indices denote samples from a data-generating (observation) distribution .
PC optimizes this function by first enacting an ascent with respect to the Dirac δ parameters ϕ corresponding to the modes of the approximate posterior over latent states. Once the maxima for the log joint probability with respect to ϕ is obtained, we then update our parameters θ by computing the gradient with respect to the log joint evaluated at these values of ϕ.
Depending on one’s perspective, this initial step can be seen variously as obtaining an MAP (maximum a posteriori) estimate over latent states (due to the maximization of the log-joint) or as an expectation step within an expectation-maximization scheme (Friston, 2005), or as a variational bound tightening step from a variational Bayes perspective. The subsequent maximization step (optimization of θ) can then be implemented via a mini-batch or stochastic gradient descent (SGD) procedure allowing one to tractably optimize against a large data set of observations. The final algorithm can be summarized succinctly as follows:
Define a (possibly hierarchical) graphical model over latent states (xLAT) and observations (xOBS) with parameters (i.e., )
For minibatch , where is the data-generating distribution:
Inference: Obtain MAP estimates (xMAP) for the latent states by enacting a gradient descent on .
Learning: Update the parameters using SGD with respect to the negative log joint evaluated at the MAP estimates found at the end of inference: .
Equations 3.9 and 3.10 describe the dynamics of the standard PC formulation. Under these dynamics, inference over latent states can be considered MAP inference on a corresponding probabilistic model, and learning can be considered as the maximization of an ELBO with respect to its model parameters. In particular, this bound corresponds to a point-mass variational distribution assumed over our latent states.
3.1.1 Inverted Configurations
This particular formulation of PC is inverted relative to the standard PC formulation (Buckley et al., 2017; Friston, 2003, 2008; Rao & Ballard, 1999) present in cognitive science and, indeed, most generative modeling schemes, in the sense that observations parameterize, and are thus hierarchically higher than, latents such as classification labels. Inverting PC in this way, to render a comparison with backpropagation, has nontrivial implications on neurobiological plausibility. One of the strengths of standard PC formulations has historically been in the notion that latents hierarchically higher in the cortex have a nonlinearly mixing and modulatory effect (f(xℓ, θℓ))) that is congruent with what we know about cortical—that top-down backward connections are generally more bifurcating and modulatory, while bottom-up influences, or forward connections, are more driving (−eℓ) (Bastos et al., 2012; Friston, 2005, 2008; Markov et al., 2014; Shipp, 2016).
It is interesting to note that supervised generative (i.e., noninverted) formulations of standard PC have thus far struggled to produce competitive or close to competitive results for complex classification tasks. For example, traditional generative PC models trained to classify and generate MNIST digits are consistently unable to surpass a test set accuracy of 80%, both from our experience and observed by Kinghorn et al. (2022). We are not aware of any example of a traditional (generative) PC succeeding in obtaining competitive classification performance on standard machine learning tasks. Though the possible reasons for this are outside the scope of this article, it is likely a consequence of either the gaussian-based supervision signal generally used in existing attempts or the limitations of the Dirac δ (point-mass) approximate posterior, for training generative models, as highlighted in Zahid et al. (2023).
3.2 Memory Locality, Weight Transport, and Weight Symmetry
Local computation within a neurobiological context generally refers to local plasticity, synaptic weight updates that are determined by the activity of the neurons they connect, that is, presynaptic and postsynaptic neuronal activity and synaptic locality. This notion of locality is often used in the context of assessing learning algorithms for their plausibility of implementation in the brain. Historically, a key criticism in the biological plausibility of backpropagation has been due to the presence of this type of nonlocality, where it has been called the weight transport problem (Crick, 1989; Grossberg, 1987; Zipser & Rumelhart, 1993).
With respect to backpropagation, there have been a number of well-known attempts at resolving the implausibility of the weight transport issue. These include the influential work of Lillicrap et al. (2016), which demonstrated that computing the error via a fixed random matrix B replacing θℓ could nonetheless facilitate learning without the requirement that the forward parameters be reused. However, the use of random feedback weights remained performant only for shallow networks, failing for deeper models. Subsequent and more recent work has demonstrated that training these feedback weights separately (Akrout et al., 2019; Amit, 2019) or using meta-learned plasticity rules for the forward weights (Shervani-Tabar & Rosenbaum, 2023) may improve performance for deeper models.
While these various models resolve the weight transport issue in one way or another, they frequently require or induce additional assumptions about the neurobiological machinery required for their implementation. The work of Zipser and Rumelhart (1993), for example, required initializing feedforward and feedback matrices identically, which was criticized and resolved using weight decay by Kolen and Pollack (1994). Both Kolen and Pollack (1994) and Zipser and Rumelhart (1993), however, required transmitting the synaptic weight changes between separate networks, which has itself also been argued as implausible (Akrout et al., 2019). Similarly, both Lillicrap et al. (2016) and Akrout et al. (2019) require interactions between the activations within forward and backward pathways, usually implemented as three-factor-style learning rules, which induce a complexity beyond simple two-factor Hebbian learning rules. And while there exists a growing body of literature presenting empirical evidence or nascent mechanistic proposals for such learning rules (Pawlak et al., 2010; Sjöström & Häusser, 2006; Urbanczik & Senn, 2014), they nonetheless induce a complexity beyond simple two-factor Hebbian learning, while also frequently requiring that modulators do not affect postsynaptic activity in the feedforward path. Finally, for completeness, we note that Amit (2019) required strict and regimented sequencing of computation, with feedforward neurons subsequently also acting as error neurons, which can be seen diagrammatically in Figure 1.
To better understand the relationship of these various proposals, one can decompose the general issue of neurobiological plausibility and weight transport into two distinct sets of issues: weight nonlocality (synaptic modification based on distinct error and value activity) and that of weight symmetry (parity between reciprocating connections between the same or different sets of neurons) (roughly mapping to constraints 2.2 and 2.3, respectively, from Payeur et al., 2021). In the backpropagation literature, solutions to this first issue generally appeal to either some form of signal multiplexing (i.e., reuse of the same neuron for distinct activities in either time; Amit, 2019), or space (via, for example, apical, dendritic, and somatic compartments (see Sacramento et al., 2018; Whittington & Bogacz, 2019; and the discussion in Shervani-Tabar & Rosenbaum, 2023), or they appeal to third-factor learning rules (Lillicrap et al., 2016), while attempts at resolving the latter issue generally appeal to the use of fixed (Lillicrap et al., 2016) or updating (Amit, 2019) random feedback and/or decay (Kolen & Pollack, 1994).
Thus, the negative of the weights that mediate the influence of the subsequent error units (eℓ + 1) on the previous latent states (xℓ) is also required for the reciprocal connection from latent states to error units. (See Figure 1 for a depiction of this.) This property was identified in some of the earliest narratives on PC (Friston, 2005, 2008), albeit from the perspective of an advantage of THE PC’s neurobiological plausibility due to the reported prevalence of reciprocal connections in the brain (Felleman & Van Essen, 1991). Prevalence of reciprocal connections is not sufficient, however, as we, at least naively, require exact equivalence in strengths also.
However, while the issue of weight symmetry remains, by mediating value neuron activity transmission through error neurons and their dynamics and relying on converged value and error neuron activity to update weights, standard PC abrogates the need to update weights nonlocally using activities between separate populations of value and error neurons. This benefit is not obtained for free, and the converged error values that result from standard PC dynamics are generally distinct from those obtained by backpropagation. Thus, to obtain parity with backpropagation, modified forms of PC adopt changes to modifications that themselves may be neurobiologically implausible. These include rescaling the synaptic learning rate up to infinity (see section 3.3), requiring postsynaptic value neuron activity to remain fixed and stored somewhere throughout multiple steps of inference and error computation (see section 3.4), or requiring strictly regimented activity propagation and weight updates (see section 3.5).
Here we have distinguished two closely related issues of weight nonlocality and symmetry as they appear to have distinct considerations and solutions. In particular, weight symmetry is arguably a more addressable issue from the perspective of neurobiological plausibility over weight nonlocality. The reason is that plasticity is generally considered to be some simple function of presynaptic and postsynaptic activity. Given that both of these activities are accessible for forward and reciprocal synapses, it is not implausible that forward and backward weights could come into parity. In particular, given a simple two-factor Hebbian learning rule, one would expect, at least for simple Hebbian plasticity, that both forward and reciprocal connections could experience the same weight modifications (the product of presynaptic and postsynaptic activity). Further incorporating weight decay would result in weights that eventually synchronize, as demonstrated for the weight transport case by Kolen and Pollack (1994). This approach, without decay, has indeed been demonstrated with minimal performance degradation on standard PC networks (Millidge et al., 2020a).
We next discuss a number of the modified formulations of PC that have recently been presented as neurmorphic alternatives to backpropagation.
3.3 Backpropagation as PC in the Infinite Variance Limit
The link between PC and backpropagation was first presented in Whittington and Bogacz (2017), where it was shown that the resultant weight updates under the PC algorithm outlined above would equal that for the equivalent backpropagation graph given the following specific restrictive set of circumstances:
When the variational modes or latent states (x1, . . . , xL − 1) are initialized to the feedforward values of the corresponding backpropagation-based computational chain (μ1, . . . , μL − 1) and:
Case 1. When the feedforward prediction results in zero output loss (i.e., zero negative log likelihood loss)
Case 2. When the output variance (ΣL) is set sufficiently higher than the remaining variances (Σℓ < L), such that the ratio of goes to infinity and the learning rate is scaled appropriately
Both of these cases correspond to the same requirement, namely, that the values of the latent states, x1, . . . , xL − 1 remain equal or close to their feedforward values at the end of the inference procedure. Said another way, this can be seen as the requirement that the MAP values for the latent states while observing both input observations (e.g., images) and output observations (e.g., classification labels) are equal to the MAP values when conditioning on input observations alone. Another more intuitive framework for understanding this requirement is to note that the inference and learning procedure that results under these sets of circumstances is equivalent to doing inference and learning for a probabilistic graphical model in which the intermediate latent states have zero conditional uncertainty, or variance, and are thus deterministic outputs of the input observations.
Given this framework, learning devolves into enacting maximimum likelihood learning for a probabilistic graphical model in which inputs x0 parameterize an output likelihood via the deep “total” network function F = fL ○ fL − 1 . . . ○ f1 ○ f0, with observations xL. This is equivalent to enacting maximum likelihood training with backpropagation when the output loss function fL (see equation 2.1) corresponds to a valid log likelihood. (See Figure 2 for a depiction of the probabilistic graphical model assumed by this approach versus standard VPC.)
To see this, observe that this is trivially true for case 1 where the weight updates from the deterministic latent maximum likelihood objective and the PC objective with probabilistic latent states are the same, namely zero, due to the zero error.
More interestingly, this can also be seen for case 2 if one considers that scaling up ΣL and rescaling all learning rates by its inverse is equivalent to scaling down the intermediate variances Σℓ, ∀ℓ < L, while keeping the output variances fixed and the learning rate the same. This is arguably a more natural perspective to view this procedure as it allows us to understand what is happening to the underlying probabilistic graphical model that we are training as we do this: that we are removing uncertainty over latent states and thus going from a variational Bayes (free energy minimizing) learning procedure to a maximum likelihood learning procedure. To demonstrate this, we can look at what happens to the inference and learning equations as we conduct the case 2 scaling procedure from Whittington and Bogacz (2017).
Therefore, the modification in case 2 corresponds to downscaling the variances of the intermediate latent states toward zero while keeping the output likelihood variance fixed. Obtaining exact equivalence to the equivalent backpropagation-based learning update for this nonlatent graph would require taking k to infinity, or more practically for getting approximate equivalence, some reasonably large value; Whittington and Bogacz (2017) use k = 100 for a small network. But this scaling clearly runs the risk of being numerically unstable and is also biologically unrealistic.
3.4 The Fixed-Prediction Assumption
An alternative to the scaling procedure mentioned above, dubbed the fixed-prediction assumption (FPA) was presented in Millidge et al. (2020), wherein the top-down predictions f(xℓ) and Jacobians corresponding to each latent were fixed to their feedforward predictions μℓ + 1 and , respectively, throughout the inference procedure.
Millidge et al. (2020) suggested that the fixed-prediction modifications (see equations 3.19 and 3.21) were equivalent to the infinite variance limit of case 2 from Whittington and Bogacz (2017). This is due to the fact that in the limit of infinite output variance, the predictions of the latent states change very little from their feedforward initializations by the end of inference. We argue, however, that this equivalence does not hold due to the fact that in the infinite or high-variance limit of case 2, the latent states xℓ also remain fixed, or arbitrarily close to, their feedforward values, and not just the predictions fℓ(xℓ). This behavior is necessary for the inference to accurately correspond to inference over the latent intermediate states xℓ, which, in the limit of the zero-variance (deterministic) intermediate latent states assumed by the equivalent backpropagation model, become equal to the feedforward values μℓ.
This is not the case for the FPA modification of PC, where the latent states are unconstrained and the prediction error is necessarily not zero except in the trivial case of a zero magnitude weight update.
This suggests that FPA-PC inference corresponds neither to neither the probabilistic graphical model assumed by standard VPC or that of the scaled variance (i.e., deterministic latent states) modification from case 2. This leaves us the question of how one can interpret FPA-PC inference if not in terms of inference over a probabilistic graphical model as with standard VPC. One particularly simple and elegant answer is that one can interpret the FPA-PC modified inference equations as directly implementing the recursive backpropagation equations via the steady state of a set of ordinarily differential equations. FPA-PC inference can then be interpreted as enacting an Euler integration scheme until one reaches this steady state. We derive this perspective in the following section.
3.4.1 Deriving FPA-PC as a Steady-State Implementation of Backpropagation
We can make the relationship between FPA-PC and backpropagation even clearer by showing that the FPA-PC equations can be interpreted as a direct implementation of backpropagation via differential equations. To illustrate this, we work our way backward from backpropagation, showing how a direct implementation of its recursion relationship results in the FPA-PC inference equations.
The practical consequences of this modified version of PC were first questioned in Rosenbaum (2022), who showed that in the case of step size equal to 1, FPA-PC is not just functionally but algorithmically equivalent to backpropagation, suggesting at least in the specific case of inference step size equal to 1, there may be no benefit of FPA-PC over backpropagation. We extend this work and investigate the properties of Millidge et al. (2020), as well as other recent modifications (Salvatori et al., 2021; Song et al., 2020) further, showing that these modifications have provable worse computational and time complexity for any step size.
3.4.2 Computational and Time Complexity
To investigate the computational and time complexity of the FPA-PC and other modified PC algorithms, we begin by proving that under the aforementioned dynamics, the number of inference steps taken for an error to propagate from an output node in a computational chain to an intermediate node is lower-bounded by its distance to the output node. See theorem 1 and corollary 1.
For a particular node (xℓ) in a computational chain that has nonzero gradient with regard to an output loss , initialized to feedforward values of the network, evolving under standard PC or FPA-PC dynamics, the error for that node (eℓ) will first become nonzero at time t = L − ℓ, where L is the length of the chain and ℓ is the index of that node within the chain.
Let the variational modes x be initialized to feedforward values of the network (i.e., requirement 1). That is, let xℓ = μℓ for ℓ ∈ [1, . . . , L − 1].
Because we are ignoring the trivial case of a zero gradient with respect to the output loss, we are guaranteed that the Jacobians are all nonzero as .
At time t = 0, the dynamics u(xℓ, xℓ + 1) of all nodes are 0, except for xL − 1, by the definition of the update equations and requirement 1.
At time t = 1, the nodes xL − 1 therefore update proportionally to u(xL)
At any particular time step t = t + 1, if the dynamics u(xℓ, xℓ + 1) associated with a node xℓ were zero in time step t, then they will become nonzero at t + 1 if and only if a change has occurred in xℓ + 1 from time step t to t + 1. (By the definition of the FPA-PC equations 3.22 and 3.23, or standard PC, equations 3.9 and 3.10, update equations, and statement 1.)
Thus, via induction, the first nonzero change to occur for an arbitrary variational mode xℓ at point ℓ in the computational chain will occur at a time t = L − ℓ, its distance from the output with regard to the number of intermediary nodes.
The convergence time for errors associated with an arbitrary node in a computational chain is lower-bounded by its distance, in nodes, to the output node.
The fixed convergence points for the dynamical equations of FPA-PC are guaranteed to converge to an error equal to the gradient with regard to the output loss.
For a node, with nonzero gradient with regard to an output loss, to converge, the converged error for that node must therefore be nonzero.
Under theorem 1, the time taken for the error to first become nonzero is equal to its distance to the output node xL, which is t = L − ℓ for a computational chain of length L.
This proof describes how, despite the ostensibly parallel computation occurring in FPA-PC, convergence nonetheless requires sufficient time steps for error information to propagate from the output node backward. This is necessary as information is still nonetheless only transmitted via adjacent nodes, as with backpropagation. Thus, inference must proceed for a minimum number of inference steps equal to this distance before convergence can occur.
We can now ask what the time complexity of a single inference step is under FPA-PC dynamics, and thus the time complexity of inference overall for FPA-PC, as well as how this compares to backpropagation. Note that for this comparison, we exclude the final VJP calculation for gradients associated with a particular θ, for both FPA-PC and backpropagation, as this has the same computational cost for both algorithms and only occurs once, at the end of inference and error propagation, respectively.
We have assumed that error computation (eℓ = xℓ − μℓ) contributes negligible latency to each inference step.
Convergence will generally not occur within exactly tc = L steps except in particular circumstances (e.g., inference step size equal to 1), and so in practice, FPA-PC may have a significantly higher time complexity (scaling with inference length) than backpropagation.
Bottlenecks (slow VJP computations) will affect FPA inference significantly more as each inference step will be as slow as the slowest VJP computation (incurring a tc · w · Cmax latency cost), while a single slow VJP computation would only incur a single Cmax latency cost for backpropagation.
Note that this is a separate and distinct proof to the result by Rosenbaum (2022), which showed that for a specific step size of 1, FPA-PC computes gradients in t = L − ℓ steps (i.e., in an equivalent number of steps to backpropagation). It was not clear, however, whether FPA-PC could converge in fewer time steps for arbitrary step sizes and whether FPA-PC inference had worse time complexity (or slower runtime), which was dependent on the time complexity of each inference step. Here we show that the FPA-PC convergence can provably never be faster than standard backpropagation for any step size.
Similar to Rosenbaum (2022), we test these results by enacting FPA-PC inference on multilayer perceptron networks of various sizes and compute the cosine similarity of gradient updates relative to those obtained via backpropagation. We focus on observing what happens as we vary the number of inference steps (proportional to the computational cost) relative to the size of the network. We find, as expected, that cosine similarity drops rapidly as the number of inference steps falls below the number of layers in the network, demonstrating a failure to converge (see Figure 3). Inference step sizes lower than 1 show a failure to converge even for a number of inference steps higher than the number of layers in the network. We also report validation and test set accuracies, which experience significant drops as this occurs also (see Figures 4 and 5).
The results in Figures 3, 4, and 5 correspond to networks consisting of L hidden layers with 64 dimensionality, each followed by a tanh activation function and a 10-dimensional output layer followed by a log softmax. All networks were trained using SGD, with momentum equal to 0.9, initialized with default PyTorch linear initialization ( where ), batch size 64, and standard categorical negative log-likelihood loss.
3.5 Zero Divergence Inference Learning
Subsequent to the FPA-modified formulation of PC, further modified formulations of PC were presented in Salvatori et al. (2021) and Song et al. (2020). The resultant algorithm, termed zero divergence inference learning (Z-IL), achieved a similar effect to FPA-PC without explicitly fixing the feedforward values and Jacobians and did so by requiring a very specific set of changes. Specifically, the following changes to the standard PC algorithm were required:
Requirement 1: (As before): The variational modes or latent states (x1, . . . , xL − 1) are initialized to the feedforward values of the corresponding backpropagation-based computational chain (μ1, . . . , μL − 1).
Requirement 2: The inference step size γ is set to 1.
Requirement 3: A particular layer is updated specifically and exclusively at a particular inference time-step (t = L − ℓ) corresponding to its distance from the output node in the computational chain.
Ablation of any one of these three requirements was shown to result in losing equivalence between Z-IL and backpropagation.1
This algorithm relies on the fact that at a particular time-step, the node corresponding to a particular layer xℓ is experiencing an update that is equivalent to that obtained under FPA-PC dynamics. This occurs because the current value of any particular node (xℓ) does not deviate from the feedforward values they were initialized with until after a particular (t = L − ℓ) inference step (see lemma A.3 from supplementary material in Song et al., 2020). For the node corresponding to the weight being updated, this has the effect of mimicking the fixing of feedforward values seen under the FPA modified form (see equations 3.19 to 3.21), since both its predictions and the predictions of its parents (ℓ − 1) remain fixed to their feed-forward values.
To see this explicitly, we describe the relevant computations occurring at each inference step for the example function F, in Table 1. We restrict ourselves at each inference step to only describing the changes that occur to xℓ, eℓ, and θℓ for every time step ℓ. Therefore, we will not record in Table 1 any computations occurring at nodes > ℓ, as they will have no future influence on any parameter learning, as well as nodes < ℓ, as no change is occurring for that inference step. For notational simplicity, we drop time indices; rather, values on the right-hand side of the equations in Table 1 refer to those from the previous time-step.
|Step (T) .||Z-IL .||Backpropagation .|
|T = 0|
|T = 1|
|eL − 1 = xL − 1 − μL − 1|
|T = L − 1|
|e1 = x1 − μ1|
|Step (T) .||Z-IL .||Backpropagation .|
|T = 0|
|T = 1|
|eL − 1 = xL − 1 − μL − 1|
|T = L − 1|
|e1 = x1 − μ1|
We can see here that in each time step, for the computations relevant to parameter gradient updates, Z-IL engages in precisely the same computations as backpropagation (a VJP evaluated at the feedforward values and the subsequent error). These are then added to a constant (the feedforward activations μℓ), before this constant is then subtracted. In addition to these computations, we note that Z-IL (and its variant, FA-Z-IL, by extension) engages in wasted computation for nodes xi, i > ℓ, which we do not depict in the walk-through of the computations above. These computations occur despite not resulting in, or contributing to, any parameter updates and are thus unnecessary.
3.6 Generalized-IL, Learning Rate Stability, and Online Learning
For completeness, we note that some recent work has suggested that predictive coding (PC) and related energy-based models, which do not yield backpropagation-equivalent updates, may nonetheless exhibit greater robustness to high learning rates and show less degraded performance in the online learning (batch size 1) settings (Alonso et al., 2022; Song et al., 2022). More specifically, Alonso et al. (2022) demonstrate that a variant of PC with additional regularization terms, called generalized-IL (G-IL), may approximate implicit SGD, with the approximation becoming exact under certain limits. The authors of this work further present a modified algorithm (IL-prox) that guarantees equal updates to implicit SGD. The resultant schemes demonstrated greater robustness to learning rate and less degraded performance in online learning regimes. These properties are also demonstrated empirically, for energy-based models (EBMs) in general, as a consequence of a principle dubbed “prospective configuration” in Song et al. (2022).
These findings are intriguing, as robustness to learning rates aligns with expectations for a backward Euler integration scheme—the implicit counterpart to the forward Euler integration scheme from which SGD originates—an optimization approach that is demonstrably less prone to divergence in nonstochastic contexts.
It is unclear, however, whether the differences in learning rate stability would remain when testing against optimizers such as Adam or SGD with momentum, both of which are well understood in theory, and practice, to ameliorate the risk of divergence under high learning rates, by improving the conditioning of the loss landscape or dampening oscillations. Moreover, there is evidence to suggest that small-batch SGD itself has a greater robustness to learning rate (Masters & Luschi, 2018; Shallue et al., 2019), possibly due to the additional stochasticity in the small-batch setting preventing the accumulation of Euler integration errors. As such, the authors posit that it would be a compelling avenue for future research to explore whether the performance gap in online learning settings persists when online SGD is combined with high learning rates.
While these findings do not affect our presented results or the use of PC as a direct substitute for backpropagation, they raise thought-provoking questions on whether modified and unmodified PC forms may nonetheless possess attributes desirable for neuromorphic learning, even if they do not necessarily exhibit parameter updates identical to those under backpropagation.
4 Conclusion and Key Results
We now summarize the key results and points made in this article:
Result 1. The infinite-variance limit modification from Whittington and Bogacz (2017; see case 2) is equivalent to assuming a model with strictly deterministic (nonprobabilistic) latent states.
Result 2. Modified variants of PC—FPA-PC (Millidge et al., 2020), and Z-IL (Salvatori et al., 2021; Song et al., 2020)—do not follow a free energy gradient. Corollary: learning does not correspond to learning a latent probabilistic model via variational Bayes.
Result 3. Modified variants of PC (FPA-PC, and Z-IL) are guaranteed to have worse time complexity than backpropagation, even when accounting for fully parallel computation within each inference step. (See theorem 1, corresponding corollary 1, and equations 3.27 to 3.36.)
Point 1. Equivalence, or approximate equivalence, of PC with backpropagation mandates adopting an inverted scheme, unlike traditional formulations of generative PC, which has nontrivial implications for neurobiological plausibility.
Point 2. Implementations of standard PC do not suffer from the weight nonlocality issue as present in naive neurobiological proposals of backpropagation. They may, however, suffer from a, potentially less implausible, weight symmetry issue.
These results raise doubt with respect to the advantages that recent variants of PC may provide with regard to backpropagation and its neuromorphic implementation, given that PC variants that result in equivalent or close to equivalent gradient updates also engage in precisely the same computations as backpropagation, and do so in much the same, equally local/nonlocal, way. This is despite the ostensibly parallelized computation of PC networks, which from the perspective of backpropagating errors merely results in additional, or unused, computation, and not faster error propagation.
Furthermore, by introducing modifications to obtain, or approximate, equivalence, PC variants lose the generative variational Bayes interpretation of standard formulations of PC and its various strengths, such as maintaining uncertainty over latent hidden states or causes, or parity with our current understanding of the organizational and functional properties of pathways within the cortex.
We thank Christopher L. Buckley and his group for the valuable comments and feedback on this article.
Note that for the purposes of this proof, we will ignore FA-Z-IL (fully autonomous Z-IL), which builds on Z-IL, as its primary purpose was to remove the neurobiologically implausible requirement that the weight update is triggered manually at a particular inference step and thus does not affect the results in this section.