## Abstract

This letter considers a class of biologically plausible cost functions for neural networks, where the same cost function is minimized by both neural activity and plasticity. We show that such cost functions can be cast as a variational bound on model evidence under an implicit generative model. Using generative models based on partially observed Markov decision processes (POMDP), we show that neural activity and plasticity perform Bayesian inference and learning, respectively, by maximizing model evidence. Using mathematical and numerical analyses, we establish the formal equivalence between neural network cost functions and variational free energy under some prior beliefs about latent states that generate inputs. These prior beliefs are determined by particular constants (e.g., thresholds) that define the cost function. This means that the Bayes optimal encoding of latent or hidden states is achieved when the network's implicit priors match the process that generates its inputs. This equivalence is potentially important because it suggests that any hyperparameter of a neural network can itself be optimized—by minimization with respect to variational free energy. Furthermore, it enables one to characterize a neural network formally, in terms of its prior beliefs.

## 1 Introduction

Cost functions are ubiquitous in scientific fields that entail optimization—including physics, chemistry, biology, engineering, and machine learning. Furthermore, any optimization problem that can be specified using a cost function can be formulated as a gradient descent. In the neurosciences, this enables one to treat neuronal dynamics and plasticity as an optimization process (Marr, 1969; Albus, 1971; Schultz, Dayan, & Montague, 1997; Sutton & Barto, 1998; Linsker, 1988; Brown, Yamada, & Sejnowski, 2001). These examples highlight the importance of specifying a problem in terms of cost functions, from which neural and synaptic dynamics can be derived. In other words, cost functions provide a formal (i.e., normative) expression of the purpose of a neural network and prescribe the dynamics of that neural network. Crucially, once the cost function has been established and an initial condition has been selected, it is no longer necessary to solve the dynamics. Instead, one can characterize the neural network's behavior in terms of fixed points, basin of attraction and structural stability—based only on the cost function. In short, it is important to identify the cost function to understand the dynamics, plasticity, and function of a neural network.

A ubiquitous cost function in neurobiology, theoretical biology, and machine learning is model evidence or equivalently, marginal likelihood or surprise—namely, the probability of some inputs or data under a model of how those inputs were generated by unknown or hidden causes (Bishop, 2006; Dayan & Abbott, 2001). Generally, the evaluation of surprise is intractable (especially for neural networks) as it entails a logarithm of an intractable marginal (i.e., integral). However, this evaluation can be converted into an optimization problem by inducing a variational bound on surprise. In machine learning, this is known as an evidence lower bound (ELBO; Blei, Kucukelbir, & McAuliffe, 2017), while the same quantity is known as variational free energy in statistical physics and theoretical neurobiology.

Variational free energy minimization is a candidate principle that governs neuronal activity and synaptic plasticity (Friston, Kilner, & Harrison, 2006; Friston, 2010). Here, surprise reflects the improbability of sensory inputs given a model of how those inputs were caused. In turn, minimizing variational free energy, as a proxy for surprise, corresponds to inferring the (unobservable) causes of (observable) consequences. To the extent that biological systems minimize variational free energy, it is possible to say that they infer and learn the hidden states and parameters that generate their sensory inputs (von Helmholtz, 1925; Knill & Pouget, 2004; DiCarlo, Zoccolan, & Rust, 2012) and consequently predict those inputs (Rao & Ballard, 1999; Friston, 2005). This is generally referred to as perceptual inference based on an internal generative model about the external world (Dayan, Hinton, Neal, & Zemel, 1995; George & Hawkins, 2009; Bastos et al., 2012).

Variational free energy minimization provides a unified mathematical formulation of these inference and learning processes in terms of self-organizing neural networks that function as Bayes optimal encoders. Moreover, organisms can use the same cost function to control their surrounding environment by sampling predicted (i.e., preferred) inputs. This is known as active inference (Friston, Mattout, & Kilner, 2011). The ensuing free-energy principle suggests that active inference and learning are mediated by changes in neural activity, synaptic strengths, and the behavior of an organism to minimize variational free energy as a proxy for surprise. Crucially, variational free energy and model evidence rest on a generative model of continuous or discrete hidden states. A number of recent studies have used Markov decision process (MDP) generative models to elaborate schemes that minimize variational free energy (Friston, FitzGerald, Rigoli, Schwartenbeck, & Pezzulo, 2016, 2017; Friston, Parr, & de Vries, 2017; Friston, Lin et al., 2017). This minimization reproduces various interesting dynamics and behaviors of real neuronal networks and biological organisms. However, it remains to be established whether variational free energy minimization is an apt explanation for any given neural network, as opposed to the optimization of alternative cost functions.

In principle, any neural network that produces an output or a decision can be cast as performing some form of inference in terms of Bayesian decision theory. On this reading, the complete class theorem suggests that any neural network can be regarded as performing Bayesian inference under some prior beliefs; therefore, it can be regarded as minimizing variational free energy. The complete class theorem (Wald, 1947; Brown, 1981) states that for any pair of decisions and cost functions, there are some prior beliefs (implicit in the generative model) that render the decisions Bayes optimal. This suggests that it should be theoretically possible to identify an implicit generative model within any neural network architecture, which renders its cost function a variational free energy or ELBO. However, although the complete class theorem guarantees the existence of a generative model, it does not specify its form. In what follows, we show that a ubiquitous class of neural networks implements approximates Bayesian inference under a generic discrete state space model with a known form.

In brief, we adopt a reverse-engineering approach to identify a plausible cost function for neural networks and show that the resulting cost function is formally equivalent to variational free energy. Here, we define a cost function as a function of sensory input, neural activity, and synaptic strengths and suppose that neural activity and synaptic plasticity follow a gradient descent on the cost function (assumption 1). For simplicity, we consider single-layer feedforward neural networks comprising firing-rate neuron models—receiving sensory inputs weighted by synaptic strengths—whose firing intensity is determined by the sigmoid activation function (assumption 2). We focus on blind source separation (BSS), namely the problem of separating sensory inputs into multiple hidden sources or causes (Belouchrani, Abed-Meraim, Cardoso, & Moulines, 1997; Cichocki, Zdunek, Phan, & Amari, 2009; Comon & Jutten, 2010), which provides the minimum setup for modeling causal inference. A famous example of BSS is the cocktail party effect: the ability of a partygoer to disambiguate an individual's voice from the noise of a crowd (Brown et al., 2001; Mesgarani & Chang, 2012). Previously, we observed BSS performed by in vitro neural networks (Isomura, Kotani, & Jimbo, 2015) and reproduced this self-supervised process using an MDP and variational free energy minimization (Isomura & Friston, 2018). These works suggest that variational free energy minimization offers a plausible account of the empirical behavior of in vitro networks.

In this work, we ask whether variational free energy minimization can account for the normative behavior of a canonical neural network that minimizes its cost function, by considering all possible cost functions, within a generic class. Using mathematical analysis, we identify a class of cost functions—from which update rules for both neural activity and synaptic plasticity can be derived. The gradient descent on the ensuing cost function naturally leads to Hebbian plasticity (Hebb, 1949; Bliss & Lømo, 1973; Malenka & Bear, 2004) with an activity-dependent homeostatic term. We show that these cost functions are formally homologous to variational free energy under an MDP. Crucially, this means the hyperparameters (i.e., any variables or constants) of the neural network can be associated with prior beliefs of the generative model. In principle, this allows one to optimize the neural network hyperparameters (e.g., thresholds and learning rates), given some priors over the causes (i.e., latent states) of inputs to the neural network. Furthermore, estimating hyperparameters from the dynamics of (in silico or in vitro) neural networks allows one to quantify the network's implicit prior beliefs. In this letter, we focus on the mathematical foundations for applications to in vitro and in vivo neuronal networks in subsequent work.

## 2 Methods

^{1}We present the derivations carefully, with a focus on the form of the ensuing Bayesian belief updating. The functional form of this update will reemerge later, when reverse engineering the cost functions implicit in neural networks. These correspondences are depicted in Figure 1 and Table 1. This section starts with a description of Markov decision processes as a general kind of generative model and then considers the minimization of variational free energy under these models.

Neural Network Formation . | Variational Bayes Formation . | |
---|---|---|

Neural activity | $xtj$$\u27fa$$st1(j)$ | State posterior |

Sensory inputs | $ot$$\u27fa$$ot$ | Observations |

Synaptic strengths | $Wj1$$\u27fa$$sig-1A11(\xb7,j)$ | |

$W^j1\u2261sig(Wj1)$$\u27fa$$A11(\xb7,j)$ | Parameter posterior | |

Perturbation term | $\varphi j1$$\u27fa$$lnD1(j)$ | State prior |

Threshold | $hj1$$\u27fa$$ln1\u2192-A11(\xb7,j)\xb71\u2192+lnD1(j)$ | |

Initial synaptic strengths | $\lambda j1\u2299W^j1init$$\u27fa$$a11(\xb7,j)$ | Parameter prior |

Neural Network Formation . | Variational Bayes Formation . | |
---|---|---|

Neural activity | $xtj$$\u27fa$$st1(j)$ | State posterior |

Sensory inputs | $ot$$\u27fa$$ot$ | Observations |

Synaptic strengths | $Wj1$$\u27fa$$sig-1A11(\xb7,j)$ | |

$W^j1\u2261sig(Wj1)$$\u27fa$$A11(\xb7,j)$ | Parameter posterior | |

Perturbation term | $\varphi j1$$\u27fa$$lnD1(j)$ | State prior |

Threshold | $hj1$$\u27fa$$ln1\u2192-A11(\xb7,j)\xb71\u2192+lnD1(j)$ | |

Initial synaptic strengths | $\lambda j1\u2299W^j1init$$\u27fa$$a11(\xb7,j)$ | Parameter prior |

### 2.1 Generative Models

Under an MDP model (see Figure 1A), a minimal BSS setup (in a discrete space) reduces to the likelihood mapping from $Ns$ hidden sources or states $st\u2261st(1),\u2026,st(Ns)T$ to $No$ observations $ot\u2261ot1,\u2026,ot(No)T$. Each source and observation takes a value of one (ON state) or zero (OFF state) at each time step, that is, $st(j),ot(i)\u22081,0$. Throughout this letter, $j$ denotes the $j$th hidden state, while $i$ denotes the $i$th observation. The probability of $st(j)$ follows a categorical distribution $Pst(j)=CatD(j)$, where $D(j)\u2261D1j,D0(j)\u2208R2$ with $D1j+D0(j)=1$ (see Figure 1A, top).

The probability of an outcome is determined by the likelihood mapping from all hidden states to each kind of observation in terms of a categorical distribution, $Pot(i)|st,A(i)=Cat(A(i))$ (see Figure 1A, middle). Here, each element of the tensor $A(i)\u2208R2\xd72Ns$ parameterizes the probability that $Pot(i)=k|st=l\u2192$, where $k\u22081,0$ are possible observations and $l\u2192\u22081,0Ns$ encodes a particular combination of hidden states. The prior distribution of each column of $A(i)$, denoted by $A\xb7l\u2192(i)$, has a Dirichlet distribution $PA\xb7l\u2192(i)=Dira\xb7l\u2192(i)$ with concentration parameter $a\xb7l\u2192(i)\u2208R2$. We use Dirichlet distributions, as they are tractable and widely used for random variables that take a continuous value between zero and one. Furthermore, learning the likelihood mapping leads to biologically plausible update rules, which have the form of associative or Hebbian plasticity (see below and Friston et al., 2016, for details).

### 2.2 Minimization of Variational Free Energy

This concludes our treatment of inference about hidden states under this minimal scheme. Note that the updates in equation 2.5 have a biological plausibility in the sense that the posterior expectations can be associated with nonnegative sigmoid-shape firing rates (also known as neurometric functions; Tolhurst, Movshon, & Dean, 1983; Newsome, Britten, & Movshon, 1989), while the arguments of the sigmoid (softmax) function can be associated with neuronal depolarization, rendering the softmax function a voltage-firing rate activation function. (See Friston, FitzGerald et al., 2017, for a more comprehensive discussion and simulations using this kind of variational message passing to reproduce empirical phenomena, such as place fields, mismatch negativity responses, phase-precession, and preplay activity in systems neuroscience.)

### 2.3 Neural Activity and Hebbian Plasticity Models

In the MDP scheme, posterior expectations about hidden states and parameters are usually associated with neural activity and synaptic strengths. Here, we can observe a formal similarity between the solutions for the state posterior (see equation 2.6) and the activity in the neural network (see equation 2.10; see also Table 1). By this analogy, $xtj$ can be regarded as encoding the posterior expectation of the ON state $st1(j)$. Moreover, $Wj1$ and $Wj0$ correspond to $lnA11\xb7,j-ln1\u2192-A11\xb7,j=sig-1A11\xb7,j$ and $lnA10\xb7,j-ln1\u2192-A10\xb7,j=sig-1A10(\xb7,j)$, respectively, in the sense that they express the amplitude of $ot$ influencing $xtj$ or $st1(j)$. Here, $1\u2192=1,\u2026,1\u2208RNo$ is a vector of ones. In particular, the optimal posterior of a hidden state taking a value of one (see equation 2.6) is given by the ratio of the beliefs about ON and OFF states, expressed as a sigmoid function. Thus, to be a Bayes optimal encoder, the fixed point of neural activity needs to be a sigmoid function. This requirement is straightforwardly ensured when $f'xtj$ is the inverse of the sigmoid function (see equation 2.13). Under this condition the fixed point or solution for $xtk$ (see equation 2.10) compares inputs from ON and OFF pathways, and thus $xtj$ straightforwardly encodes the posterior of the $j$th hidden state being ON (i.e., $xtj\u2192st1(j))$. In short, the above neural network is effectively inferring the hidden state.

If the activity of the neural network is performing inference, does the Hebbian plasticity correspond to Bayes optimal learning? In other words, does the synaptic update rule in equation 2.11 ensure that the neural activity and synaptic strengths asymptotically encode Bayes optimal posterior beliefs about hidden states $xtj\u2192st1(j)$ and parameters $Wj1\u2192sig-1A11\xb7,j$, respectively? To this end, we will identify a class of cost functions from which the neural activity and synaptic plasticity can be derived and consider the conditions under which the cost function becomes consistent with variational free energy.

### 2.4 Neural Network Cost Functions

### 2.5 Comparison with Variational Free Energy

Specifically, when the thresholds satisfy $hj1-ln1\u2192-W^j1\xb71\u2192=lnD1(j)$ and $hj0-ln1\u2192-W^j0\xb71\u2192=lnD0(j)$, equation 2.16 becomes equivalent to equation 2.4 up to the $lnt$ order term (that disappears when $t$ is large). Therefore, in this case, the fixed points of neural activity and synaptic strengths become the posteriors; thus, $xtj$ asymptotically becomes the Bayes optimal encoder for a large $t$ limit (provided with $D$ that matches the genuine prior $D*)$.

This means that when the prior belief about states $Dj$ is a function of the parameter posteriors ($A(\xb7,j))$, the general cost function under consideration can be expressed in the form of variational free energy, up to the $Olnt$ term. A generic cost function $L$ is suboptimal from the perspective of Bayesian inference unless $\varphi j1$ and $\varphi j0$ are tuned appropriately to express the unbiased (i.e., optimal) prior belief. In this BSS setup, $\varphi j1=\varphi j0=const$ is optimal; thus, a generic $L$ would asymptotically give an upper bound of variational free energy with the optimal prior belief about states when $t$ is large.

### 2.6 Analysis on Synaptic Update Rules

In summary, we demonstrated that under a few minimal assumptions and ignoring small contributions to weight updates, the neural network under consideration can be regarded as minimizing an approximation to model evidence because the cost function can be formulated in terms of variational free energy. In what follows, we will rehearse our analytic results and then use numerical analyses to illustrate Bayes optimal inference (and learning) in a neural network when, and only when, it has the right priors.

## 3 Results

### 3.1 Analytical Form of Neural Network Cost Functions

The analysis in the preceding section rests on the following assumptions:

Updates of neural activity and synaptic weights are determined by a gradient descent on a cost function $L$.

Neural activity is updated by the weighted sum of sensory inputs and its fixed point is expressed as the sigmoid function.

For analytical tractability, we further assume the following:

The perturbation terms ($\varphi j1$ and $\varphi j0)$ that constitute the difference between the cost function and variational free energy with optimal prior beliefs can be expressed as linear equations of $Wj1$ and $Wj0$.

The cost function of the neural networks considered is characterized only by $\varphi j$. Thus, after fixing $\varphi j$ by fixing constraints $\alpha j1,\alpha j0$ and $\beta j1,\beta j0$, the remaining degrees of freedom are the initial synaptic weights. These correspond to the prior distribution of parameters $PA$ in the variational Bayesian formulation (see section A2).

The fixed point of synaptic strengths that give the minimum of $L$ is given analytically as equation 2.20, expressing that $\beta j1,\beta j0$ deviates the center of the nonlinear mapping—from Hebbian products to synaptic strengths—from the optimal position (shown in equation 2.8). As shown in equation 2.14, the derivative of $L$ with respect to $Wj1$ and $Wj0$ recovers the synaptic update rules that comprise Hebbian and activity-dependent homeostatic terms. Although equation 2.14 expresses the dynamics of synaptic strengths that converge to the fixed point, it is consistent with a plasticity rule that gives the synaptic change from $t$ to $t+1$ (see equation 2.21).

Hence, based on assumptions 1 and 2 (irrespective of assumption 3), we find that the cost function approximates variational free energy. Table 1 summarizes this correspondence. Under this condition, neural activity encodes the posterior expectation about hidden states, $x\tau j=s\tau 1(j)=Qs\tau j=1$, and synaptic strengths encode the posterior expectation of the parameters, $W^j1=sigWj1=A11\xb7,j$ and $W^j0=sigWj0=A10\xb7,j$. In addition, based on assumption 3, the threshold is characterized by constants $\alpha j1,\alpha j0,\beta j1,\beta j0$. From a Bayesian perspective, these constants can be viewed as prior beliefs, $lnPst(j)=lnD(j)=\alpha j1+Wj1\beta j1,\alpha j0+Wj0\beta j0$. When and only when $\alpha j1,\alpha j0=-ln2,-ln2$ and $\beta j1,\beta j0=0\u2192,0\u2192$, the cost function becomes variational free energy with optimal prior beliefs (for BSS) whose global minimum ensures Bayes optimal encoding.

In short, we identify a class of biologically plausible cost functions from which the update rules for both neural activity and synaptic plasticity can be derived. When the activation function for neural activity is a sigmoid function, a cost function in this class is expressed straightforwardly as variational free energy. With respect to the choice of constants expressing physiological constraints in the neural network, the cost function has degrees of freedom that may be viewed as (potentially suboptimal) prior beliefs from the Bayesian perspective. Now, we illustrate the implicit inference and learning in neural networks through simulations of BSS.

### 3.2 Numerical Simulations

Our numerical analysis, under assumptions 1 to 3, shows that a network needs to employ a cost function that entails optimal prior beliefs to perform BSS or, equivalently, causal inference. Such a cost function is obtained when its constants, which do not appear in the variational free energy with the optimal generative model for BSS, become negligible. The important message here is that in this setup, a cost function equivalent to variational free energy is necessary for Bayes optimal inference (Friston et al., 2006; Friston, 2010).

### 3.3 Phenotyping Networks

We have shown that variational free energy (under the MDP scheme) is formally homologous to the class of biologically plausible cost functions found in neural networks. The neural network's parameters $\varphi j=lnDj$ determine how the synaptic strengths change depending on the history of sensory inputs and neural outputs; thus, the choice of $\varphi j$ provides degrees of freedom in the shape of the neural network cost functions under consideration that determine the purpose or function of the neural network. Among various $\varphi j$, only $\varphi j=-ln2,-ln2$ can make the cost function variational free energy with optimal prior beliefs for BSS. Hence, one could regard neural networks (of the sort considered in this letter: single-layer feedforward networks that minimize their cost function) as performing approximate Bayesian inference under priors that may or may not be optimal. This result is as predicted by the complete class theorem (Brown, 1981; Wald, 1947) as it implies that any response of a neural network is Bayes optimal under some prior beliefs (and cost function). Therefore, in principle, under the theorem, any neural network of this kind is optimal when its prior beliefs are consistent with the process that generates outcomes. This perspective indicates the possibility of characterizing a neural network model—and indeed a real neuronal network—in terms of its implicit prior beliefs.

One can pursue this analysis further and model the responses or decisions of a neural network using the Bayes optimal MDP scheme under different priors. Thus, the priors in the MDP scheme can be adjusted to maximize the likelihood of empirical responses. This sort of approach has been used in system neuroscience to characterize the choice behavior in terms of subject-specific priors. (See Schwartenbeck & Friston, 2016, for further details.)

### 3.4 Reverse-Engineering Implicit Prior Beliefs

Another situation important from a neuroscience perspective is when belief updating in a neural network is slow in relation to experimental observations. In this case, the implicit prior beliefs can be viewed as being fixed over a short period of time. This is likely when such a firing threshold is determined by a homeostatic plasticity over longer timescales (Turrigiano & Nelson, 2004).

The considerations in the previous section speak to the possibility of using empirically observed neuronal responses to infer implicit prior beliefs. The synaptic weights $(Wj1,Wj0)$ can be estimated statistically from response data, through equation 2.20. By plotting their trajectory over the training period as a function of the history of a Hebbian product, one can estimate the cost function constants. If these constants express a near-optimal $\varphi j$, it can be concluded that the network has, effectively, the right sort of priors for BSS. As we have shown analytically and numerically, a cost function with $\alpha j1,\alpha j0$ far from $-ln2,-ln2$ or a large deviation of $\beta j1,\beta j0$ fails as a Bayes optimal encoder for BSS. Since actual neuronal networks can perform BSS (Isomura et al., 2015; Isomura & Friston, 2018), one would envisage that the implicit cost function will exhibit a near-optimal $\varphi j$.

## 4 Discussion

In this work, we investigated a class of biologically plausible cost functions for neural networks. A single-layer feedforward neural network with a sigmoid activation function that receives sensory inputs generated by hidden states (i.e., BSS setup) was considered. We identified a class of cost functions by assuming that neural activity and synaptic plasticity minimize a common function $L$. The derivative of $L$ with respect to synaptic strengths furnishes a synaptic update rule following Hebbian plasticity, equipped with activity-dependent homeostatic terms. We have shown that the dynamics of a single-layer feedforward neural network, which minimizes its cost function, is asymptotically equivalent to that of variational Bayesian inference under a particular but generic (latent variable) generative model. Hence, the cost function of the neural network can be viewed as variational free energy, and biological constraints that characterize the neural network—in the form of thresholds and neuronal excitability—become prior beliefs about hidden states. This relationship holds regardless of the true generative process of the external world. In short, this equivalence provides an insight that any neural and synaptic dynamics (in the class considered) have functional meaning and any neural network variables and constants can be formally associated with quantities in the variational Bayesian formation, implying that Bayesian inference is universal characterisation of canonical neural networks.

According to the complete class theorem, any dynamics that minimizes a cost function can be viewed as performing Bayesian inference under some prior beliefs (Wald, 1947; Brown, 1981). This implies that any neural network whose activity and plasticity minimize the same cost function can be cast as performing Bayesian inference. Moreover, when a system has reached a (possibly nonequilibrium) steady state, the conditional expectation of internal states of an autonomous system can be shown to parameterize a posterior belief over the hidden states of the external milieu (Friston, 2013, 2019; Parr, Da Costa, & Friston, 2020). Again, this suggests that any (nonequilibrium) steady state can be interpreted as realising some elemental Bayesian inference.

Having said this, we note that the implicit generative model that underwrites any (e.g., neural network) cost function is a more delicate problem—one that we have addressed in this work. In other words, it is a mathematical truism that certain systems can always be interpreted as minimizing a variational free energy under some prior beliefs (i.e., generative model). However, this does not mean it is possible to identify the generative model by simply looking at systemic dynamics. To do this, one has to commit to a particular form of the model, so that the sufficient statistics of posterior beliefs are well defined. We have focused on discrete latent variable models that can be regarded as special (reduced) cases of partially observable Markov decision processes (POMDP).

Note that because our treatment is predicated on the complete class theorem (Brown, 1981; Wald, 1947), the same conclusions should, in principle, be reached when using continuous state-space models, such as hierarchical predictive coding models (Friston, 2008; Whittington & Bogacz, 2017; Ahmadi & Tani, 2019). Within the class of discrete state-space models, it is fairly straightforward to generate continuous outcomes from discrete latent states, as exemplified by discrete variational autoencoders (Rolfe, 2016) or mixed models, as described in Friston, Parr et al. (2017). We have described the generative model in terms of an MDP; however, we ignored state transitions. This means the generative model in this letter reduces to a simple latent variable model, with categorical states and outcomes. We have considered MDP models because they predominate in descriptions of variational (Bayesian) belief updating, (e.g., Friston, FitzGerald et al., 2017). Clearly, many generative processes entail state transitions, leading to hidden Markov models (HMM). When state transitions depend on control variables, we have an MDP, and when states are only partially observed, we have a partially observed MDP (POMDP). To deal with these general cases, extensions of the current framework are required, which we hope to consider in future work, perhaps with recurrent neural networks.

Our theory implies that Hebbian plasticity is a corollary (or realization) of cost function minimization. In particular, Hebbian plasticity with a homeostatic term emerges naturally from a gradient descent on the neural network cost function defined via the integral of neural activity. In other words, the integral of synaptic inputs $Wj1ot$ in equation 2.9 yields $xtjWj1ot$, and its derivative yields a Hebbian product $xtjotT$ in equation 2.14. This relationship indicates that this form of synaptic plasticity is natural for canonical neural networks. In contrast, a naive Hebbian plasticity (without a homeostatic term) fails to perform BSS because it updates synapses with false prior beliefs (see Figure 3). It is well known that a modification of Hebbian plasticity is necessary to realize BSS (Földiák, 1990; Linsker, 1997; Isomura & Toyoizumi, 2016), speaking to the importance of selecting the right priors for BSS.

The proposed equivalence between neural networks and Bayesian inference may offer insights into designing neural network architectures and synaptic plasticity rules to perform a given task—by selecting the right kind of prior beliefs—while retaining their biological plausibility. An interesting extension of the proposed framework is an application to spiking neural networks. Earlier work has highlighted relationships between spiking neural networks and statistical inference (Bourdoukan, Barrett, Deneve, & Machens, 2012; Isomura, Sakai, Kotani, & Jimbo, 2016). The current approach might be in a position to formally link spiking neuron models and spike-timing dependent plasticity (Markram et al., 1997; Bi & Poo, 1998; Froemke & Dan, 2002; Feldman, 2012) with variational Bayesian inference.

One can understand the nature of the constants $\alpha j1,\alpha j0,\beta j1,\beta j0$ from the biological and Bayesian perspectives as follows: $\alpha j1,\alpha j0$ determines the firing threshold and thus controls the mean firing rates. In other words, these parameters control the amplitude of excitatory and inhibitory inputs, which may be analogous to the roles of GABAergic inputs (Markram et al., 2004; Isaacson & Scanziani, 2011) and neuromodulators (Pawlak, Wickens, Kirkwood, & Kerr, 2010; Frémaux & Gerstner, 2016) in biological neuronal networks. At the same time, $\alpha j1,\alpha j0$ encodes prior beliefs about states, which exert a large influence on the state posterior. The state posterior is biased if $\alpha j1,\alpha j0$ is selected in a suboptimal manner—in relation to the process that generates inputs. Meanwhile, $\beta j1,\beta j0$ determines the accuracy of synaptic strengths that represent the likelihood mapping of an observation $ot(i)$ taking 1 (ON state) depending on hidden states (compare equation 2.8 and equation 2.20). Under a usual MDP setup where the state prior does not depend on the parameter posterior, the encoder becomes Bayes optimal when and only when $\beta j1,\beta j0=0\u2192,0\u2192$. These constants can represent biological constraints on synaptic strengths, such as the range of spine growth, spinal fluctuations, or the effect of synaptic plasticity induced by spontaneous activity independent of external inputs. Although the fidelity of each synapse is limited due to such internal fluctuations, the accumulation of information over a large number of synapses should allow accurate encoding of hidden states in the current formulation.

In previous reports, we have shown that in vitro neural networks—comprising a cortical cell culture—perform BSS when receiving electrical stimulations generated from two hidden sources (Isomura et al., 2015). Furthermore, we showed that minimizing variational free energy under an MDP is sufficient to reproduce the learning observed in an in vitro network (Isomura & Friston, 2018). Our framework for identifying biologically plausible cost functions could be relevant for identifying the principles that underlie learning or adaptation processes in biological neuronal networks, using empirical response data. Here, we illustrated this potential in terms of the choice of function $\varphi j$ in the cost functions $L$. In particular, if $\varphi j$ is close to a constant $-ln2,-ln2$, the cost function is expressed straightforwardly as a variational free energy with small state prior biases. In future work, we plan to apply this scheme to empirical data and examine the biological plausibility of variational free energy minimization.

The correspondence highlighted in this work enables one to identify a generative model (comprising likelihood and priors) that a neural network is using. The formal correspondence between neural network and variational Bayesian formations rests on the asymptotic equivalence between the neural network's cost functions and variational free energy (under some priors). Although variational free energy can take an arbitrary form, the correspondence provides biologically plausible constraints for neural networks that implicitly encode prior distributions. Hence, this formulation is potentially useful for identifying the implicit generative models that underlie the dynamics of real neuronal circuits. In other words, one can quantify the dynamics and plasticity of a neuronal circuit in terms of variational Bayesian inference and learning under an implicit generative model.

Minimization of the cost function can render the neural network Bayes optimal in a Bayesian sense, including the choice of the prior, as described in the previous section. The dependence between the likelihood function and the state prior vanishes when the network uses an optimal threshold to perform inference—if the true generative process does not involve dependence between the likelihood and the state prior. In other words, the dependence arises from a suboptimal choice of the prior. Indeed, any free parameters or constraints in a neural network can be optimized by minimizing variational free energy. This is because only variational free energy with the optimal priors—that match the true generative process of the external world—can provide the global minimum among a class of neural network cost functions under consideration. This is an interesting observation because it suggests that the global minimum of the class of cost functions—that determine neural network dynamics—is characterized by and only by statistical properties of the external world. This implies that the recapitulation of external dynamics is an inherent feature of canonical neural systems.

Finally, the free energy principle and complete class theorem imply that any brain function can be formulated in terms of variational Bayesian inference. Our reverse engineering may enable the identification of neuronal substrates or process models underlying brain functions by identifying the implicit generative model from empirical data. Unlike conventional connectomics (based on functional connectivity), reverse engineering furnishes a computational architecture (e.g., neural network), which encompasses neural activity, synaptic plasticity, and behavior. This may be especially useful for identifying neuronal mechanisms that underlie neurological or psychiatric disorders—by associating pathophysiology with false prior beliefs that may be responsible for things like hallucinations and delusions (Fletcher & Frith, 2009; Friston, Stephan, Montague, & Dolan, 2014).

In summary, we first identified a class of biologically plausible cost functions for neural networks that underlie changes in both neural activity and synaptic plasticity. We then identified an asymptotic equivalence between these cost functions and the cost functions used in variational Bayesian formations. Given this equivalence, changes in the activity and synaptic strengths of a neuronal network can be viewed as Bayesian belief updating—namely, a process of transforming priors over hidden states and parameters into posteriors, respectively. Hence, a cost function in this class becomes Bayes optimal when activity thresholds correspond to appropriate priors in an implicit generative model. In short, the neural and synaptic dynamics of neural networks can be cast as inference and learning, under a variational Bayesian formation. This is potentially important for two reasons. First, it means that there are some threshold parameters for any neural network (in the class considered) that can be optimized for applications to data when there are precise prior beliefs about the process generating those data. Second, in virtue of the complete class theorem, one can reverse-engineer the priors that any neural network is adopting. This may be interesting when real neuronal networks can be modeled using neural networks of the class that we have considered. In other words, if one can fit neuronal responses—using a neural network model parameterized in terms of threshold constants—it becomes possible to evaluate the implicit priors using the above equivalence. This may find a useful application when applied to in vitro (or in vivo) neuronal networks (Isomura & Friston, 2018; Levin, 2013) or, indeed, dynamic causal modeling of distributed neuronal responses from noninvasive data (Daunizeau, David, & Stephan, 2011). In this context, the neural network can, in principle, be used as a dynamic causal model to estimate threshold constants and implicit priors. This “reverse engineering” speaks to estimating the priors used by real neuronal systems, under ideal Bayesian assumptions; sometimes referred to as meta-Bayesian inference (Daunizeau et al., 2010).

## Appendix: Supplementary Methods

### A.1 Order of the Parameter Complexity

### A.2 Correspondence between Parameter Prior Distribution and Initial Synaptic Strengths

In general, optimizing a model of observable quantities—including a neural network—can be cast inference if there exists a learning mechanism that updates the hidden states and parameters of that model based on observations. (Exact and variational) Bayesian inference treats the hidden states and parameters as random variables and thus transforms prior distributions $Pst,PA$ into posteriors $Qst,QA$. In other words, Bayesian inference is a process of transforming the prior to the posterior based on observations $o1,\u2026,ot$ under a generative model. From this perspective, the incorporation of prior knowledge about the hidden states and parameters is an important aspect of Bayesian inference.

The minimization of a cost function by a neural network updates its activity and synaptic strengths based on observations under the given network properties (e.g., activation function and thresholds). According to the complete class theorem, this process can always be viewed as Bayesian inference. We have demonstrated that a class of cost functions—for a single-layer feedforward network with a sigmoid activation function—has a form equivalent to variational free energy under a particular latent variable model. Here, neural activity $xt$ and synaptic strengths $W$ come to encode the posterior distributions over hidden states $Q'st$ and parameters $Q'A$, respectively, where $Q'st$ and $Q'A$ follow categorical and Dirichlet distributions, respectively. Moreover, we identified that the perturbation factors $\varphi j$, which characterize the threshold function, correspond to the logarithm of the state prior $Pst$ expressed as a categorical distribution.

However, one might ask whether the posteriors obtained using the network $Q'st,Q'A$ are formally different from those obtained using variational Bayesian inference $Qst,QA$ since only the latter explicitly considers the prior distribution of parameters $PA$. Thus, one may wonder if the network merely influences update rules that are similar to variational Bayes but do not transform the priors $Pst,PA$ into the posteriors $Qst,QA$, despite the asymptotic equivalence of the cost functions.

In summary, one can establish the formal correspondence between neural network and variational Bayesian formations in terms of the cost functions (see equation 2.4 versus equation A.10), priors (see equations 2.18 and A.14), and posteriors (see equation 2.8 versus equation A.13). This means that a neural network successively transforms priors $Pst,PA$ into posteriors $Qst,QA$, as parameterized with neural activity, and initial and final synaptic strengths (and thresholds). Crucially, when increasing the number of observations, this process is asymptotically equivalent to that of variational Bayesian inference under a specific likelihood function.

### A.3 Derivation of Synaptic Plasticity Rule

## Data Availability

All relevant data are within the letter. Matlab scripts are available at https://github.com/takuyaisomura/reverse_engineering.

## Note

^{1}

Strictly speaking, the generative model we use in this letter is a hidden Markov model (HMM) because we do not consider probabilistic transitions between hidden states that depend on control variables. However, for consistency with the literature on variational treatments of discrete statespace models, we retain the MDP formalism noting that we are using a special case (with unstructured state transitions).

## Acknowledgments

This work was supported in part by the grant of Joint Research by the National Institutes of Natural Sciences (NINS Program No. 01112005). T.I. is funded by the RIKEN Center for Brain Science. K.J.F. is funded by a Wellcome Principal Research Fellowship (088130/Z/09/Z). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.