This letter considers a class of biologically plausible cost functions for neural networks, where the same cost function is minimized by both neural activity and plasticity. We show that such cost functions can be cast as a variational bound on model evidence under an implicit generative model. Using generative models based on partially observed Markov decision processes (POMDP), we show that neural activity and plasticity perform Bayesian inference and learning, respectively, by maximizing model evidence. Using mathematical and numerical analyses, we establish the formal equivalence between neural network cost functions and variational free energy under some prior beliefs about latent states that generate inputs. These prior beliefs are determined by particular constants (e.g., thresholds) that define the cost function. This means that the Bayes optimal encoding of latent or hidden states is achieved when the network's implicit priors match the process that generates its inputs. This equivalence is potentially important because it suggests that any hyperparameter of a neural network can itself be optimized—by minimization with respect to variational free energy. Furthermore, it enables one to characterize a neural network formally, in terms of its prior beliefs.

Cost functions are ubiquitous in scientific fields that entail optimization—including physics, chemistry, biology, engineering, and machine learning. Furthermore, any optimization problem that can be specified using a cost function can be formulated as a gradient descent. In the neurosciences, this enables one to treat neuronal dynamics and plasticity as an optimization process (Marr, 1969; Albus, 1971; Schultz, Dayan, & Montague, 1997; Sutton & Barto, 1998; Linsker, 1988; Brown, Yamada, & Sejnowski, 2001). These examples highlight the importance of specifying a problem in terms of cost functions, from which neural and synaptic dynamics can be derived. In other words, cost functions provide a formal (i.e., normative) expression of the purpose of a neural network and prescribe the dynamics of that neural network. Crucially, once the cost function has been established and an initial condition has been selected, it is no longer necessary to solve the dynamics. Instead, one can characterize the neural network's behavior in terms of fixed points, basin of attraction and structural stability—based only on the cost function. In short, it is important to identify the cost function to understand the dynamics, plasticity, and function of a neural network.

A ubiquitous cost function in neurobiology, theoretical biology, and machine learning is model evidence or equivalently, marginal likelihood or surprise—namely, the probability of some inputs or data under a model of how those inputs were generated by unknown or hidden causes (Bishop, 2006; Dayan & Abbott, 2001). Generally, the evaluation of surprise is intractable (especially for neural networks) as it entails a logarithm of an intractable marginal (i.e., integral). However, this evaluation can be converted into an optimization problem by inducing a variational bound on surprise. In machine learning, this is known as an evidence lower bound (ELBO; Blei, Kucukelbir, & McAuliffe, 2017), while the same quantity is known as variational free energy in statistical physics and theoretical neurobiology.

Variational free energy minimization is a candidate principle that governs neuronal activity and synaptic plasticity (Friston, Kilner, & Harrison, 2006; Friston, 2010). Here, surprise reflects the improbability of sensory inputs given a model of how those inputs were caused. In turn, minimizing variational free energy, as a proxy for surprise, corresponds to inferring the (unobservable) causes of (observable) consequences. To the extent that biological systems minimize variational free energy, it is possible to say that they infer and learn the hidden states and parameters that generate their sensory inputs (von Helmholtz, 1925; Knill & Pouget, 2004; DiCarlo, Zoccolan, & Rust, 2012) and consequently predict those inputs (Rao & Ballard, 1999; Friston, 2005). This is generally referred to as perceptual inference based on an internal generative model about the external world (Dayan, Hinton, Neal, & Zemel, 1995; George & Hawkins, 2009; Bastos et al., 2012).

Variational free energy minimization provides a unified mathematical formulation of these inference and learning processes in terms of self-organizing neural networks that function as Bayes optimal encoders. Moreover, organisms can use the same cost function to control their surrounding environment by sampling predicted (i.e., preferred) inputs. This is known as active inference (Friston, Mattout, & Kilner, 2011). The ensuing free-energy principle suggests that active inference and learning are mediated by changes in neural activity, synaptic strengths, and the behavior of an organism to minimize variational free energy as a proxy for surprise. Crucially, variational free energy and model evidence rest on a generative model of continuous or discrete hidden states. A number of recent studies have used Markov decision process (MDP) generative models to elaborate schemes that minimize variational free energy (Friston, FitzGerald, Rigoli, Schwartenbeck, & Pezzulo, 2016, 2017; Friston, Parr, & de Vries, 2017; Friston, Lin et al., 2017). This minimization reproduces various interesting dynamics and behaviors of real neuronal networks and biological organisms. However, it remains to be established whether variational free energy minimization is an apt explanation for any given neural network, as opposed to the optimization of alternative cost functions.

In principle, any neural network that produces an output or a decision can be cast as performing some form of inference in terms of Bayesian decision theory. On this reading, the complete class theorem suggests that any neural network can be regarded as performing Bayesian inference under some prior beliefs; therefore, it can be regarded as minimizing variational free energy. The complete class theorem (Wald, 1947; Brown, 1981) states that for any pair of decisions and cost functions, there are some prior beliefs (implicit in the generative model) that render the decisions Bayes optimal. This suggests that it should be theoretically possible to identify an implicit generative model within any neural network architecture, which renders its cost function a variational free energy or ELBO. However, although the complete class theorem guarantees the existence of a generative model, it does not specify its form. In what follows, we show that a ubiquitous class of neural networks implements approximates Bayesian inference under a generic discrete state space model with a known form.

In brief, we adopt a reverse-engineering approach to identify a plausible cost function for neural networks and show that the resulting cost function is formally equivalent to variational free energy. Here, we define a cost function as a function of sensory input, neural activity, and synaptic strengths and suppose that neural activity and synaptic plasticity follow a gradient descent on the cost function (assumption 1). For simplicity, we consider single-layer feedforward neural networks comprising firing-rate neuron models—receiving sensory inputs weighted by synaptic strengths—whose firing intensity is determined by the sigmoid activation function (assumption 2). We focus on blind source separation (BSS), namely the problem of separating sensory inputs into multiple hidden sources or causes (Belouchrani, Abed-Meraim, Cardoso, & Moulines, 1997; Cichocki, Zdunek, Phan, & Amari, 2009; Comon & Jutten, 2010), which provides the minimum setup for modeling causal inference. A famous example of BSS is the cocktail party effect: the ability of a partygoer to disambiguate an individual's voice from the noise of a crowd (Brown et al., 2001; Mesgarani & Chang, 2012). Previously, we observed BSS performed by in vitro neural networks (Isomura, Kotani, & Jimbo, 2015) and reproduced this self-supervised process using an MDP and variational free energy minimization (Isomura & Friston, 2018). These works suggest that variational free energy minimization offers a plausible account of the empirical behavior of in vitro networks.

In this work, we ask whether variational free energy minimization can account for the normative behavior of a canonical neural network that minimizes its cost function, by considering all possible cost functions, within a generic class. Using mathematical analysis, we identify a class of cost functions—from which update rules for both neural activity and synaptic plasticity can be derived. The gradient descent on the ensuing cost function naturally leads to Hebbian plasticity (Hebb, 1949; Bliss & Lømo, 1973; Malenka & Bear, 2004) with an activity-dependent homeostatic term. We show that these cost functions are formally homologous to variational free energy under an MDP. Crucially, this means the hyperparameters (i.e., any variables or constants) of the neural network can be associated with prior beliefs of the generative model. In principle, this allows one to optimize the neural network hyperparameters (e.g., thresholds and learning rates), given some priors over the causes (i.e., latent states) of inputs to the neural network. Furthermore, estimating hyperparameters from the dynamics of (in silico or in vitro) neural networks allows one to quantify the network's implicit prior beliefs. In this letter, we focus on the mathematical foundations for applications to in vitro and in vivo neuronal networks in subsequent work.

In this section, we formulate the same computation in terms of variational Bayesian inference and neural networks to demonstrate their correspondence. We first derive the form of a variational free energy cost function under a specific generative model, a Markov decision process.1 We present the derivations carefully, with a focus on the form of the ensuing Bayesian belief updating. The functional form of this update will reemerge later, when reverse engineering the cost functions implicit in neural networks. These correspondences are depicted in Figure 1 and Table 1. This section starts with a description of Markov decision processes as a general kind of generative model and then considers the minimization of variational free energy under these models.
Figure 1:

Comparison between an MDP scheme and a neural network. (A) MDP scheme expressed as a Forney factor graph (Forney, 2001; Dauwels, 2007) based on the formulation in Friston, Parr et al. (2017). In this BSS setup, the prior $D$ determines hidden states $st$, while $st$ determines observation $ot$ through the likelihood mapping $A$. Inference corresponds to the inversion of this generative process. Here, $D*$ indicates the true prior, while $D$ indicates the prior under which the network operates. If $D=D*$, the inference is optimal; otherwise, it is biased. (B) Neural network comprising a singlelayer feedforward network with a sigmoid activation function. The network receives sensory inputs $ot=(ot(1),…,otNo)T$ that are generated from hidden states $st=(st(1),…,stNs)T$ and outputs neural activities $xt=(xt1,…,xtNx)T$. Here, $xtj$ should encode the posterior expectation about a binary state $st(j)$. In an analogy with the cocktail party effect, $st$ and $ot$ correspond to individual speakers and auditory inputs, respectively.

Figure 1:

Comparison between an MDP scheme and a neural network. (A) MDP scheme expressed as a Forney factor graph (Forney, 2001; Dauwels, 2007) based on the formulation in Friston, Parr et al. (2017). In this BSS setup, the prior $D$ determines hidden states $st$, while $st$ determines observation $ot$ through the likelihood mapping $A$. Inference corresponds to the inversion of this generative process. Here, $D*$ indicates the true prior, while $D$ indicates the prior under which the network operates. If $D=D*$, the inference is optimal; otherwise, it is biased. (B) Neural network comprising a singlelayer feedforward network with a sigmoid activation function. The network receives sensory inputs $ot=(ot(1),…,otNo)T$ that are generated from hidden states $st=(st(1),…,stNs)T$ and outputs neural activities $xt=(xt1,…,xtNx)T$. Here, $xtj$ should encode the posterior expectation about a binary state $st(j)$. In an analogy with the cocktail party effect, $st$ and $ot$ correspond to individual speakers and auditory inputs, respectively.

Close modal
Table 1:
Correspondence of Variables and Functions.
Neural Network FormationVariational Bayes Formation
Neural activity $xtj$$⟺$$st1(j)$ State posterior
Sensory inputs $ot$$⟺$$ot$ Observations
Synaptic strengths $Wj1$$⟺$$sig-1A11(·,j)$
$W^j1≡sig(Wj1)$$⟺$$A11(·,j)$ Parameter posterior
Perturbation term $ϕj1$$⟺$$lnD1(j)$ State prior
Threshold $hj1$$⟺$$ln1→-A11(·,j)·1→+lnD1(j)$
Initial synaptic strengths $λj1⊙W^j1init$$⟺$$a11(·,j)$ Parameter prior
Neural Network FormationVariational Bayes Formation
Neural activity $xtj$$⟺$$st1(j)$ State posterior
Sensory inputs $ot$$⟺$$ot$ Observations
Synaptic strengths $Wj1$$⟺$$sig-1A11(·,j)$
$W^j1≡sig(Wj1)$$⟺$$A11(·,j)$ Parameter posterior
Perturbation term $ϕj1$$⟺$$lnD1(j)$ State prior
Threshold $hj1$$⟺$$ln1→-A11(·,j)·1→+lnD1(j)$
Initial synaptic strengths $λj1⊙W^j1init$$⟺$$a11(·,j)$ Parameter prior

### 2.1  Generative Models

Under an MDP model (see Figure 1A), a minimal BSS setup (in a discrete space) reduces to the likelihood mapping from $Ns$ hidden sources or states $st≡st(1),…,st(Ns)T$ to $No$ observations $ot≡ot1,…,ot(No)T$. Each source and observation takes a value of one (ON state) or zero (OFF state) at each time step, that is, $st(j),ot(i)∈1,0$. Throughout this letter, $j$ denotes the $j$th hidden state, while $i$ denotes the $i$th observation. The probability of $st(j)$ follows a categorical distribution $Pst(j)=CatD(j)$, where $D(j)≡D1j,D0(j)∈R2$ with $D1j+D0(j)=1$ (see Figure 1A, top).

The probability of an outcome is determined by the likelihood mapping from all hidden states to each kind of observation in terms of a categorical distribution, $Pot(i)|st,A(i)=Cat(A(i))$ (see Figure 1A, middle). Here, each element of the tensor $A(i)∈R2×2Ns$ parameterizes the probability that $Pot(i)=k|st=l→$, where $k∈1,0$ are possible observations and $l→∈1,0Ns$ encodes a particular combination of hidden states. The prior distribution of each column of $A(i)$, denoted by $A·l→(i)$, has a Dirichlet distribution $PA·l→(i)=Dira·l→(i)$ with concentration parameter $a·l→(i)∈R2$. We use Dirichlet distributions, as they are tractable and widely used for random variables that take a continuous value between zero and one. Furthermore, learning the likelihood mapping leads to biologically plausible update rules, which have the form of associative or Hebbian plasticity (see below and Friston et al., 2016, for details).

We use $o˜≡o1,…,ot$ and $s˜≡s1,…,st$ to denote sequences of observations and hidden states, respectively. With this notation in place, the generative model (i.e., the joint distribution over outcomes, hidden states and the parameters of their likelihood mapping) can be expressed as
$Po˜,s˜,A=PA∏τ=1tPoτ|sτ,APsτ=∏i=1NoP(A(i))·∏τ=1t∏i=1NoPoτ(i)|sτ,A(i)∏j=1NsPsτ(j).$
(2.1)
Throughout this letter, $t$ denotes the current time, and $τ$ denotes an arbitrary time from the past to the present, $1≤τ≤t$.

### 2.2  Minimization of Variational Free Energy

In this MDP scheme, the aim is to minimize surprise by minimizing variational free energy as a proxy, that is, performing approximate or variational Bayesian inference. From the generative model, we can motivate a mean-field approximation to the posterior (i.e., recognition) density as follows,
$Qs˜,A=QAQs˜=∏i=1NoQ(A(i))·∏τ=1t∏j=1NsQsτ(j),$
(2.2)
where $A(i)$ is the likelihood mapping (i.e., tensor), and the marginal posterior distributions of $sτ(j)$ and $A(i)$ have a categorical $Qsτj=Catsτj$ and Dirichlet distribution $QA(i)=Dira(i)$, respectively. For simplicity, we assume that $A(i)$ factorizes into the product of the likelihood mappings from the $j$th hidden state to the $i$th observation: $Ak·(i)≈Ak·i,1⊗⋯⊗Aki,Ns$ (where $⊗$ denotes the outer product and $Ai,j∈R2×2)$. This (mean-field) approximation simplifies the computation of state posteriors and serves to specify a particular form of Bayesian model which corresponds to a class of canonical neural networks (see below).
In what follows, a case variable in bold indicates the posterior expectation of the corresponding variable in italics. For example, $sτj$ takes the value 0 or 1, while the posterior expectation $sτ(j)∈R2$ is the expected value of $sτ(j)$ that lies between zero and one. Moreover, $a(i,j)∈R2×2$ denotes positive concentration parameters. Below, we use the posterior expectation of $lnA(i,j)$ to encode posterior beliefs about the likelihood, which is given by
$lnA·j(i,j)≡EQA(i,j)lnA·ji,j=ψa·li,j-ψa1li,j+a0l(i,j)=lna·l(i,j)-lna1li,j+a0l(i,j)+Oa·l(i,j)-1,$
(2.3)
where $l∈1,0$. Here, $ψ·≡Γ'·/Γ·$ denotes the digamma function, which arises naturally from the definition of the Dirichlet distribution. (See Friston et al., 2016, for details.) $EQAi,j·$ denotes the expectation over the posterior of $A(i,j)$.
The ensuing variational free energy of this generative model is then given by
$Fo˜,Qs˜,QA≡∑τ=1tEQsτQA-lnPoτ|sτ,A+DKLQsτ||Psτ+DKLQA||PA=∑j=1Ns∑τ=1tsτ(j)·-∑i=1NolnA(i,j)·oτ(i)+lnsτ(j)-lnD(j)︸accuracy+statecomplexity+∑i=1No∑j=1Ns{(a(i,j)-a(i,j))·lnA(i,j)-lnB(a(i,j))}︸parametercomplexity,$
(2.4)
where $lnA(i,j)·oτi$ denotes the inner product of $lnAi,j$ and a one-hot encoded vector of $oτ(i)$, $DKL·∥·$ is the complexity as scored by the Kullback–Leibler divergence (Kullback & Leibler, 1951) and $Ba(i,j)≡Ba·1(i,j)Ba·0(i,j)$ with $Ba·l(i,j)≡Γa1l(i,j)Γa0l(i,j)/Γa1li,j+a0l(i,j)$ is the beta function. The first term in the final equality comprises the accuracy $-sτ(j)·∑i=1NolnA(i,j)·oτ(i)$ and (state) complexity $sτ(j)·lnsτ(j)-lnD(j)$. The accuracy term is simply the expected log likelihood of an observation, while complexity scores the divergence between prior and posterior beliefs. In other words, complexity reflects the degree of belief updating or degrees of freedom required to provide an accurate account of observations. Both belief updates to states and parameters incur a complexity cost: the state complexity increases with time $t$, while parameter complexity increases on the order of $lnt$—and is thus negligible when $t$ is large (see section A.1 for details). This means that we can ignore parameter complexity when the scheme has experienced a sufficient number of outcomes. We drop the parameter complexity in subsequent sections. In the remainder of this section, we show how the minimization of variational free energy transforms (i.e., updates) priors into posteriors when the parameter complexity is evaluated explicitly.
Inference optimizes posterior expectations about the hidden states by minimizing variational free energy. The optimal posterior expectations are obtained by solving the variation of $F$ to give
$st(j)=σ∑i=1NolnA(i,j)·ot(i)+lnD(j)=σlnA(·,j)·ot+lnD(j),$
(2.5)
where $σ·$ is the softmax function (see Figure 1A, bottom). As $st(j)$ is a binary value in this work, the posterior expectation of $st(j)$ taking a value of one (ON state) can be expressed as
$st1(j)=explnA·1(·,j)·ot+lnD1(j)explnA·1(·,j)·ot+lnD1(j)+explnA·0(·,j)·ot+lnD0(j)=siglnA·1·,j·ot-lnA·0·,j·ot+lnD1(j)-lnD0j$
(2.6)
using the sigmoid function $sigz≡1/(1+exp-z)$. Thus, the posterior expectation of $stj$ taking a value zero (OFF state) is $st0j=1-st1(j)$. Here, $D1j$ and $D0(j)$ are constants denoting the prior beliefs about hidden states. Bayes optimal encoding is obtained when and only when the prior beliefs match the genuine prior distribution: $D1j=D0(j)=0.5$ in this BSS setup.

This concludes our treatment of inference about hidden states under this minimal scheme. Note that the updates in equation 2.5 have a biological plausibility in the sense that the posterior expectations can be associated with nonnegative sigmoid-shape firing rates (also known as neurometric functions; Tolhurst, Movshon, & Dean, 1983; Newsome, Britten, & Movshon, 1989), while the arguments of the sigmoid (softmax) function can be associated with neuronal depolarization, rendering the softmax function a voltage-firing rate activation function. (See Friston, FitzGerald et al., 2017, for a more comprehensive discussion and simulations using this kind of variational message passing to reproduce empirical phenomena, such as place fields, mismatch negativity responses, phase-precession, and preplay activity in systems neuroscience.)

In terms of learning, by solving the variation of $F$ with respect to $a(i,j)$, the optimal posterior expectations about the parameters are given by
$a(i,j)=ai,j+∑τ=1toτ(i)⊗sτ(j)=ai,j+tot(i)⊗st(j)¯,$
(2.7)
where $a(i,j)$ is the prior, $oτ(i)⊗sτ(j)$ expresses the outer product of a one-hot encoded vector of $oτ(i)$ and $sτ(j)$, and $ot(i)⊗st(j)¯≡1t∑τ=1toτ(i)⊗sτ(j)$. Thus, the optimal posterior expectation of matrix $A$ is
$A11i,j=a11i,ja11i,j+a01(i,j)=tot(i)st1(j)¯+a11(i,j)tst1j¯+a11(i,j)+a01i,j=ot(i)st1(j)¯st1(j)¯+O1tA10i,j=a10i,ja10i,j+a00(i,j)=tot(i)st0(j)¯+a10(i,j)tst0j¯+a10(i,j)+a00i,j=ot(i)st0(j)¯st0(j)¯+O1t,$
(2.8)
where $ot(i)st1j¯=1t∑τ=1toτ(i)sτ1(j)$, $st1(j)¯=1t∑τ=1tsτ1(j)$, $ot(i)st0(j)¯=1t∑τ=1toτ(i)sτ0(j)$, and $st0(j)¯=1t∑τ=1tsτ0(j)$. Further, $A01i,j=1-A11(i,j)$ and $A00i,j=1-A10(i,j)$. The prior of parameters $a(i,j)$ is on the order of one and is thus negligible when $t$ is large. The matrix $Ai,j$ expresses the optimal posterior expectations of $ot(i)$ taking the ON state when $st(j)$ is ON ($A11(i,j))$ or OFF ($A10(i,j))$, or $ot(i)$ taking the OFF state when $st(j)$ is ON ($A01(i,j))$ or OFF ($A00(i,j))$. Although this expression may seem complicated, it is fairly straightforward. The posterior expectations of the likelihood simply accumulate posterior expectations about the co-occurrence of states and their outcomes. These accumulated (Dirichlet) parameters are then normalized to give a likelihood or probability. Crucially, one can observe the associative or Hebbian aspect of this belief update, expressed here in terms of the outer products between outcomes and posteriors about states in equation 2.7. We now turn to the equivalent update for neural activities and synaptic weights of a neural network.

### 2.3  Neural Activity and Hebbian Plasticity Models

Next, we consider the neural activity and synaptic plasticity in the neural network (see Figure 1B). The generation of observations $ot$ is exactly the same as in the MDP model introduced in section 2.1 (see Figure 1B, top to middle). We assume that the $j$th neuron's activity $xtj$ (see Figure 1B, bottom) is given by
$x˙tj∝-f'(xtj)︸leakage+Wj1ot-Wj0ot︸synapticinput+hj1-hj0︸threshold.$
(2.9)
We suppose that $Wj1∈RNo$ and $Wj0∈RNo$ comprise row vectors of synapses and $hj1∈R$ and $hj0∈R$ are adaptive thresholds that depend on the values of $Wj1$ and $Wj0$, respectively. One may regard $Wj1$ and $Wj0$ as excitatory and inhibitory synapses, respectively. We further assume that the nonlinear leakage $f'·$ (i.e., the leak current) is the inverse of the sigmoid function (i.e., the logit function) so that the fixed point of $xtj$ (i.e., the state of $xtj$ that gives $x˙tj=0)$ is given in the form of the sigmoid function:
$xtj=sigWj1ot-Wj0ot+hj1-hj0=expWj1ot+hj1expWj1ot+hj1+expWj0ot+hj0.$
(2.10)
Equations 2.9 and 2.10 are a mathematical expression of assumption 2. Further, we consider a class of synaptic plasticity rules that comprise Hebbian plasticity with an activity-dependent homeostatic term as follows:
$ΔWj1t≡Wj1t+1-Wj1t∝Hebb1xtj,ot,Wj1+Home1xtj,Wj1ΔWj0t≡Wj0t+1-Wj0t∝Hebb0xtj,ot,Wj0+Home0xtj,Wj0,$
(2.11)
where $Hebb1$ and $Hebb0$ denote Hebbian plasticity as determined by the product of sensory inputs and neural outputs and $Home1$ and $Home0$ denote homeostatic plasticity determined by output neural activity. Equation 2.11 can be read as an ansatz: we will see below that a synaptic update rule with the functional form of equation 2.11 emerges as a natural consequence of assumption 1.

In the MDP scheme, posterior expectations about hidden states and parameters are usually associated with neural activity and synaptic strengths. Here, we can observe a formal similarity between the solutions for the state posterior (see equation 2.6) and the activity in the neural network (see equation 2.10; see also Table 1). By this analogy, $xtj$ can be regarded as encoding the posterior expectation of the ON state $st1(j)$. Moreover, $Wj1$ and $Wj0$ correspond to $lnA11·,j-ln1→-A11·,j=sig-1A11·,j$ and $lnA10·,j-ln1→-A10·,j=sig-1A10(·,j)$, respectively, in the sense that they express the amplitude of $ot$ influencing $xtj$ or $st1(j)$. Here, $1→=1,…,1∈RNo$ is a vector of ones. In particular, the optimal posterior of a hidden state taking a value of one (see equation 2.6) is given by the ratio of the beliefs about ON and OFF states, expressed as a sigmoid function. Thus, to be a Bayes optimal encoder, the fixed point of neural activity needs to be a sigmoid function. This requirement is straightforwardly ensured when $f'xtj$ is the inverse of the sigmoid function (see equation 2.13). Under this condition the fixed point or solution for $xtk$ (see equation 2.10) compares inputs from ON and OFF pathways, and thus $xtj$ straightforwardly encodes the posterior of the $j$th hidden state being ON (i.e., $xtj→st1(j))$. In short, the above neural network is effectively inferring the hidden state.

If the activity of the neural network is performing inference, does the Hebbian plasticity correspond to Bayes optimal learning? In other words, does the synaptic update rule in equation 2.11 ensure that the neural activity and synaptic strengths asymptotically encode Bayes optimal posterior beliefs about hidden states $xtj→st1(j)$ and parameters $Wj1→sig-1A11·,j$, respectively? To this end, we will identify a class of cost functions from which the neural activity and synaptic plasticity can be derived and consider the conditions under which the cost function becomes consistent with variational free energy.

### 2.4  Neural Network Cost Functions

Here, we consider a class of functions that constitute a cost function for both neural activity and synaptic plasticity. We start by assuming that the update of the $j$th neuron's activity (see equation 2.9) is determined by the gradient of cost function $Lj$: $x˙tj∝-∂Lj/∂xtj$. By integrating the right-hand side of equation 2.9, we obtain a class of cost functions as
$Lj=∑τ=1tfxτj-xτjWj1oτ-1-xτjWj0oτ-xτjhj1-1-xτjhj0+O1=∑τ=1tfxτj-xτj1-xτjTWj1Wj0oτ+hj1hj0+O(1),$
(2.12)
where the $O(1)$ term, which depends on $Wj1$ and $Wj0$, is of a lower order than the other terms (as they are $Ot)$ and is thus negligible when $t$ is large (See section A.2 for the case where we explicitly evaluate the $O(1)$ term to demonstrate the formal correspondence between the initial values of synaptic strengths and the parameter prior $pA$. The cost function of the entire network is defined by $L≡∑j=1NxLj$. When $f'xτj$ is the inverse of the sigmoid function, we have
$fxτj=xτjlnxτj+1-xτjln1-xτj$
(2.13)
up to a constant term (ensure $f'xτj=sig-1xτj)$. We further assume that the synaptic weight update rule is given as the gradient descent on the same cost function $Lj$ (see assumption 1). Thus, the synaptic plasticity is derived as follows:
$W˙j1∝-1t∂Lj∂Wj1=xtjotT¯+xtj¯hj1'W˙j0∝-1t∂Lj∂Wj0=1-xtjotT¯+1-xtj¯hj0',$
(2.14)
where $xtjotT¯≡1t∑τ=1txτjoτT,xtj¯≡1t∑τ=1txτj$, $1-xtjotT¯≡1t∑τ=1t(1-xτj)oτT$, $1-xtj¯≡1t∑τ=1t1-xτj$, $hj1'≡∂hj1/∂Wj1$, and $hj0'≡∂hj0/∂Wj0$. Note that the update of $Wj1$ is not directly influenced by $Wj0$, and vice versa because they encode parameters in physically distinct pathways (i.e., the updates are local learning rules; Lee, Girolami, Bell, & Sejnowski, 2000; Kusmierz, Isomura, & Toyoizumi, 2017). The update rule for $Wj1$ can be viewed as Hebbian plasticity mediated by an additional activity-dependent term expressing homeostatic plasticity. Moreover, the update of $Wj0$ can be viewed as anti-Hebbian plasticity with a homeostatic term, in the sense that $Wj0$ is reduced when input ($ot)$ and output ($xtj)$ fire together. The fixed points of $Wj1$ and $Wj0$ are given by
$Wj1=h1'-1-xtjotT¯xtj¯Wj0=h0'-1-1-xtjotT¯1-xtj¯.$
(2.15)
Crucially, these synaptic strength updates are a subclass of the general synaptic plasticity rule in equation 2.11 (see also section A.3 for the mathematical explanation). Therefore, if the synaptic update rule is derived from the cost function underlying neural activity, the synaptic update rule has a biologically plausible form comprising Hebbian plasticity and activity-dependent homeostatic plasticity. The updates of neural activity and synaptic strengths—via gradient descent on the cost function—enable us to associate neural and synaptic dynamics with optimization. Although the steepest descent method gives the simplest implementation, other gradient descent schemes, such as adaptive moment estimation (Adam; Kingma & Ba, 2015), can be considered, while retaining the local learning property.

### 2.5  Comparison with Variational Free Energy

Here, we establish a formal relationship between the cost function $L$ and variational free energy. We define $W^j1≡sig(Wj1)$ and $W^j0≡sigWj0$ as the sigmoid functions of synaptic strengths. We consider the case in which neural activity is expressed as a sigmoid function and thus equation 2.13 holds. As $Wj1=lnW^j1-ln1→-W^j1$, equation 2.12 becomes
$L=∑j=1Nx∑τ=1txτj1-xτjTlnxτjln1-xτj-lnW^j1ln1→-W^j1lnW^j0ln1→-W^j0×oτ1→-oτ-hj1hj0+ln1→-W^j1ln1→-W^j01→+O(1),$
(2.16)
where $1→=1,…,1∈RNo$. One can immediately see a formal correspondence between this cost function and variational free energy (see equation 2.4). That is, when we assume that $xtj,1-xtjT=stj$, $W^j1=A11·,j$, and $W^j0=A10·,j$, equation 2.16 has exactly the same form as the sum of the accuracy and state complexity, which is the leading-order term of variational free energy (see the first term in the last equality of equation 2.4).

Specifically, when the thresholds satisfy $hj1-ln1→-W^j1·1→=lnD1(j)$ and $hj0-ln1→-W^j0·1→=lnD0(j)$, equation 2.16 becomes equivalent to equation 2.4 up to the $lnt$ order term (that disappears when $t$ is large). Therefore, in this case, the fixed points of neural activity and synaptic strengths become the posteriors; thus, $xtj$ asymptotically becomes the Bayes optimal encoder for a large $t$ limit (provided with $D$ that matches the genuine prior $D*)$.

In other words, we can define perturbation terms $ϕj1≡hj1-ln1→-W^j1·1→$ and $ϕj0≡hj0-ln1→-W^j0·1→$ as functions of $Wj1$ and $Wj0$, respectively, and can express the cost function as
$L=∑j=1Nx∑τ=1txτj1-xτjT{lnxτjln1-xτj-lnW^j1ln1→-W^j1lnW^j0ln1→-W^j0×oτ1→-oτ-ϕj1ϕj0}+O(1).$
(2.17)
Here, without loss of generality, we can suppose that the constant terms in $ϕj1$ and $ϕj0$ are selected to ensure that $expϕj1+expϕj0=1$. Under this condition, $expϕj1,expϕj0$ can be viewed as the prior belief about hidden states
$ϕj1=lnD1(j)ϕj0=lnD0(j)$
(2.18)
and thus equation 2.17 is formally equivalent to the accuracy and state complexity terms of variational free energy.

This means that when the prior belief about states $Dj$ is a function of the parameter posteriors ($A(·,j))$, the general cost function under consideration can be expressed in the form of variational free energy, up to the $Olnt$ term. A generic cost function $L$ is suboptimal from the perspective of Bayesian inference unless $ϕj1$ and $ϕj0$ are tuned appropriately to express the unbiased (i.e., optimal) prior belief. In this BSS setup, $ϕj1=ϕj0=const$ is optimal; thus, a generic $L$ would asymptotically give an upper bound of variational free energy with the optimal prior belief about states when $t$ is large.

### 2.6  Analysis on Synaptic Update Rules

To explicitly solve the fixed points of $Wj1$ and $Wj0$ that provide the global minimum of $L$, we suppose $ϕj1$ and $ϕj0$ as linear functions of $Wj1$ and $Wj0$, respectively, given by
$ϕj1=αj1+Wj1βj1ϕj0=αj0+Wj0βj0,$
(2.19)
where $αj1,αj0∈R$, and $βj1,βj0∈RNo$ are constants. By solving the variation of $L$ with respect to $Wj1$ and $Wj0$, we find the fixed point of synaptic strengths as
$Wj1=sig-1xtjotT¯xtj¯+βj1TWj0=sig-11-xtjotT¯1-xtj¯+βj0T.$
(2.20)
Since the update from $t$ to $t+1$ is expressed as $sigWj1+ΔWj1-sigWj1=W^j1⊙1→-W^j1⊙ΔWj1+OΔWj12$ and $sigWj1+ΔWj1-sig(Wj1)≈xt+1jot+1T/xtj¯-xt+1jxtjotT¯/xtj¯2=xt+1jot+1T/xtj¯-W^j1-βj1Txt+1j/xtj¯$, we recover the following synaptic plasticity:
$ΔWj1={W^j1⊙(1→-W^j1)}⊙-1xtj¯︸adaptivelearningrate⊙x(t+1)jot+1T︸Hebbianplasticity-(W^j1-βj1T)x(t+1)j︸homeostaticplasticityΔWj0={W^j0⊙(1→-W^j0)}⊙-11-xtj¯︸adaptivelearningrate⊙(1-x(t+1)j)ot+1T︸anti-Hebbianplasticity-(W^j0-βj0T)(1-x(t+1)j)︸homeostaticplasticity,$
(2.21)
where $⊙$ denotes the elementwise (Hadamard) product and $W^j1⊙1→-W^j1⊙-1$ denotes the element-wise inverse of $W^j1⊙1→-W^j1$. This synaptic plasticity rule is a subclass of the general synaptic plasticity rule in equation 2.11.

In summary, we demonstrated that under a few minimal assumptions and ignoring small contributions to weight updates, the neural network under consideration can be regarded as minimizing an approximation to model evidence because the cost function can be formulated in terms of variational free energy. In what follows, we will rehearse our analytic results and then use numerical analyses to illustrate Bayes optimal inference (and learning) in a neural network when, and only when, it has the right priors.

### 3.1  Analytical Form of Neural Network Cost Functions

The analysis in the preceding section rests on the following assumptions:

1. Updates of neural activity and synaptic weights are determined by a gradient descent on a cost function $L$.

2. Neural activity is updated by the weighted sum of sensory inputs and its fixed point is expressed as the sigmoid function.

Under these assumptions, we can express the cost function for a neural network as follows (see equation 2.17):
$L=∑j=1Nx∑τ=1txτj1-xτjT{lnxτjln1-xτj-lnW^j1ln1→-W^j1lnW^j0ln1→-W^j0×oτ1→-oτ-ϕj1ϕj0}+O(1),$
where $W^j1=sig(Wj1)$ and $W^j0=sigWj0$ hold and $ϕj1$ and $ϕj0$ are functions of $Wj1$ and $Wj0$, respectively. The log-likelihood function (accuracy term) and divergence of hidden states (complexity term) of variational free energy emerge naturally under the assumption of a sigmoid activation function (assumption 2). Additional terms denoted by $ϕj1$ and $ϕj0$ express the state prior, indicating that a generic cost function $L$ is variational free energy under a suboptimal prior belief about hidden states: $lnPst(j)=lnD(j)=ϕj$, where $ϕj≡ϕj1,ϕj0$. This prior alters the landscape of the cost function in a suboptimal manner and thus provides a biased solution for neural activities and synaptic strengths, which differ from the Bayes optimal encoders.

For analytical tractability, we further assume the following:

1. The perturbation terms ($ϕj1$ and $ϕj0)$ that constitute the difference between the cost function and variational free energy with optimal prior beliefs can be expressed as linear equations of $Wj1$ and $Wj0$.

From assumption 3, equation 2.17 becomes
$L=∑j=1Nx∑τ=1txτj1-xτjTlnxτjln1-xτj-lnW^j1ln1→-W^j1lnW^j0ln1→-W^j0×oτ1→-oτ-αj1+Wj1βj1αj0+Wj0βj0+O(1),$
(3.1)
where $αj1,αj0,βj1,βj0$ are constants. The cost function has degrees of freedom with respect to the choice of constants $αj1,αj0,βj1,βj0$, which correspond to the prior belief about states $D(j)$. The neural activity and synaptic strengths that give the minimum of a generic physiological cost function $L$ are biased by these constants, which may be analogous to physiological constraints (see section 4 for details).

The cost function of the neural networks considered is characterized only by $ϕj$. Thus, after fixing $ϕj$ by fixing constraints $αj1,αj0$ and $βj1,βj0$, the remaining degrees of freedom are the initial synaptic weights. These correspond to the prior distribution of parameters $PA$ in the variational Bayesian formulation (see section A2).

The fixed point of synaptic strengths that give the minimum of $L$ is given analytically as equation 2.20, expressing that $βj1,βj0$ deviates the center of the nonlinear mapping—from Hebbian products to synaptic strengths—from the optimal position (shown in equation 2.8). As shown in equation 2.14, the derivative of $L$ with respect to $Wj1$ and $Wj0$ recovers the synaptic update rules that comprise Hebbian and activity-dependent homeostatic terms. Although equation 2.14 expresses the dynamics of synaptic strengths that converge to the fixed point, it is consistent with a plasticity rule that gives the synaptic change from $t$ to $t+1$ (see equation 2.21).

Hence, based on assumptions 1 and 2 (irrespective of assumption 3), we find that the cost function approximates variational free energy. Table 1 summarizes this correspondence. Under this condition, neural activity encodes the posterior expectation about hidden states, $xτj=sτ1(j)=Qsτj=1$, and synaptic strengths encode the posterior expectation of the parameters, $W^j1=sigWj1=A11·,j$ and $W^j0=sigWj0=A10·,j$. In addition, based on assumption 3, the threshold is characterized by constants $αj1,αj0,βj1,βj0$. From a Bayesian perspective, these constants can be viewed as prior beliefs, $lnPst(j)=lnD(j)=αj1+Wj1βj1,αj0+Wj0βj0$. When and only when $αj1,αj0=-ln2,-ln2$ and $βj1,βj0=0→,0→$, the cost function becomes variational free energy with optimal prior beliefs (for BSS) whose global minimum ensures Bayes optimal encoding.

In short, we identify a class of biologically plausible cost functions from which the update rules for both neural activity and synaptic plasticity can be derived. When the activation function for neural activity is a sigmoid function, a cost function in this class is expressed straightforwardly as variational free energy. With respect to the choice of constants expressing physiological constraints in the neural network, the cost function has degrees of freedom that may be viewed as (potentially suboptimal) prior beliefs from the Bayesian perspective. Now, we illustrate the implicit inference and learning in neural networks through simulations of BSS.

### 3.2  Numerical Simulations

Here, we simulated the dynamics of neural activity and synaptic strengths when they followed a gradient descent on the cost function in equation 3.1. We considered a BSS comprising two hidden sources (or states) and 32 observations (or sensory inputs), formulated as an MDP. The two hidden sources show four patterns: $st=st(1)⊗st2=0,01,00,11,1$. An observation $ot(i)$ was generated through the likelihood mapping $A(i)$, defined as
$Pot(i)=1|st,A(i)=A1·(i)=0,34,14,1for1≤i≤16Pot(i)=1|st,A(i)=A1·(i)=0,14,34,1for17≤i≤32.$
(3.2)
Here, for example, $A1·(i)=3/4$ for $1≤i≤16$ is the probability of $ot(i)$ taking one when $st=1,0$. The remaining elements were given by $A0·(i)=1→-A1·(i)$. The implicit state priors employed by a neural network were varied between zero and one in keeping with $D1j+D0(j)=1$; whereas, the true state priors were fixed as $D1*(j),D0*(j)=(0.5,0.5)$. Synaptic strengths were initialized as values close to zero. The simulations preceded over $T=104$ time steps. The simulations and analyses were conducted using Matlab. Notably, this simulation setup is exactly the same experimental setup as that we used for in vitro neural networks (Isomura et al., 2015; Isomura & Friston, 2018). We leverage this setup to clarify the relationship among our empirical work, a feedforward neural network model, and variational Bayesian formulations.
First, as in Isomura and Friston (2018), we demonstrated that a network with a cost function with optimized constants $(αj1,αj0=-ln2,-ln2$ and $βj1,βj0=0→,0→)$ can perform BSS successfully (see Figure 2). The responses of neuron 1 came to recognize source 1 after training, indicating that neuron 1 learned to encode source 1 (see Figure 2A). Meanwhile, neuron 2 learned to infer source 2 (see Figure 2B). During training, synaptic plasticity followed gradient descent on the cost function (see Figures 2C and 2D). This demonstrates that minimization of the cost function, with optimal constants, is equivalent to variational free energy minimization and hence is sufficient to emulate BSS. This process establishes a concise representation of the hidden causes and allows maximizing information retained in the neural network (Linsker, 1988; Isomura, 2018).
Figure 2:

Emergence of response selectivity for a source. (A) Evolution of neuron 1's responses that learn to encode source 1, in the sense that the response is high when source 1 takes a value of one (red dots), and it is low when source 1 takes a value of zero (blue dots). Lines correspond to smoothed trajectories obtained using a discrete cosine transform. (B) Emergence of neuron 2's response that learns to encode source 2. These results indicate that the neural network succeeded in separating two independent sources. (C) Neural network cost function $L$. It is computed based on equation 3.1 and plotted against the averaged synaptic strengths, where $W11_avg1$ ($z$-axis) is the average of 1 to 16 elements of $W11$, while $W11_avg2$ ($x$-axis) is the average of 17 to 32 elements of $W11$. The red line depicts a trajectory of averaged synaptic strengths. (D) Trajectory of synaptic strengths. Black lines show elements of $W11$, and magenta and cyan lines indicate $W11_avg1$ and $W11_avg2$, respectively.

Figure 2:

Emergence of response selectivity for a source. (A) Evolution of neuron 1's responses that learn to encode source 1, in the sense that the response is high when source 1 takes a value of one (red dots), and it is low when source 1 takes a value of zero (blue dots). Lines correspond to smoothed trajectories obtained using a discrete cosine transform. (B) Emergence of neuron 2's response that learns to encode source 2. These results indicate that the neural network succeeded in separating two independent sources. (C) Neural network cost function $L$. It is computed based on equation 3.1 and plotted against the averaged synaptic strengths, where $W11_avg1$ ($z$-axis) is the average of 1 to 16 elements of $W11$, while $W11_avg2$ ($x$-axis) is the average of 17 to 32 elements of $W11$. The red line depicts a trajectory of averaged synaptic strengths. (D) Trajectory of synaptic strengths. Black lines show elements of $W11$, and magenta and cyan lines indicate $W11_avg1$ and $W11_avg2$, respectively.

Close modal
Next, we quantified the dependency of BSS performance on the form of the cost function, by varying the above-mentioned constants (see Figure 3). We varied $αj1,αj0$ in a range of $0.05≤expαj1≤0.95$, while maintaining $expαj1+expαj0=1$ and found that changing $αj1,αj0$ from $-ln2,-ln2$ led to a failure of BSS. Because neuron 1 encodes source 1 with optimal $α$, the correlation between source 1 and the response of neuron 1 is close to one, while the correlation between source 2 and the response of neuron 1 is nearly zero. In the case of suboptimal $α$, these correlations fall to around 0.5, indicating that the response of neuron 1 encodes a mixture of sources 1 and 2 (Figure 3A). Moreover, a failure of BSS can be induced when the elements of $β$ take values far from zero (see Figure 3B). When the elements of $β$ are generated from a zeromean gaussian distribution, the accuracy of BSS—measured using the correlation between sources and responses—decreases as the standard deviation increases.
Figure 3:

Dependence of source encoding accuracy on constants. Left panels show the magnitudes of the correlations between sources and responses of a neuron expected to encode source 1: $|corr(st(1),xt1)|$ and $|corr(st2,xt1)|$. The right panels show the magnitudes of the correlations between sources and responses of a neuron expected to encode source 2: $|corr(st(1),xt2)|$ and $|corr(st2,xt2)|$. (A) Dependence on the constant $α$ that controls the excitability of a neuron when $β$ is fixed to zero. The dashed line (0.5) indicates the optimal value of $expαj1$. (B) Dependence on constant $β$ when $α$ is fixed as $αj1,αj0=-ln2,-ln2$. Elements of $β$ were randomly generated from a gaussian distribution with zero mean. The standard deviation of $β$ was varied (horizontal axis), where zero deviation was optimal. Lines and shaded areas indicate the mean and standard deviation of the source-response correlation, evaluated with 50 different sequences.

Figure 3:

Dependence of source encoding accuracy on constants. Left panels show the magnitudes of the correlations between sources and responses of a neuron expected to encode source 1: $|corr(st(1),xt1)|$ and $|corr(st2,xt1)|$. The right panels show the magnitudes of the correlations between sources and responses of a neuron expected to encode source 2: $|corr(st(1),xt2)|$ and $|corr(st2,xt2)|$. (A) Dependence on the constant $α$ that controls the excitability of a neuron when $β$ is fixed to zero. The dashed line (0.5) indicates the optimal value of $expαj1$. (B) Dependence on constant $β$ when $α$ is fixed as $αj1,αj0=-ln2,-ln2$. Elements of $β$ were randomly generated from a gaussian distribution with zero mean. The standard deviation of $β$ was varied (horizontal axis), where zero deviation was optimal. Lines and shaded areas indicate the mean and standard deviation of the source-response correlation, evaluated with 50 different sequences.

Close modal

Our numerical analysis, under assumptions 1 to 3, shows that a network needs to employ a cost function that entails optimal prior beliefs to perform BSS or, equivalently, causal inference. Such a cost function is obtained when its constants, which do not appear in the variational free energy with the optimal generative model for BSS, become negligible. The important message here is that in this setup, a cost function equivalent to variational free energy is necessary for Bayes optimal inference (Friston et al., 2006; Friston, 2010).

### 3.3  Phenotyping Networks

We have shown that variational free energy (under the MDP scheme) is formally homologous to the class of biologically plausible cost functions found in neural networks. The neural network's parameters $ϕj=lnDj$ determine how the synaptic strengths change depending on the history of sensory inputs and neural outputs; thus, the choice of $ϕj$ provides degrees of freedom in the shape of the neural network cost functions under consideration that determine the purpose or function of the neural network. Among various $ϕj$, only $ϕj=-ln2,-ln2$ can make the cost function variational free energy with optimal prior beliefs for BSS. Hence, one could regard neural networks (of the sort considered in this letter: single-layer feedforward networks that minimize their cost function) as performing approximate Bayesian inference under priors that may or may not be optimal. This result is as predicted by the complete class theorem (Brown, 1981; Wald, 1947) as it implies that any response of a neural network is Bayes optimal under some prior beliefs (and cost function). Therefore, in principle, under the theorem, any neural network of this kind is optimal when its prior beliefs are consistent with the process that generates outcomes. This perspective indicates the possibility of characterizing a neural network model—and indeed a real neuronal network—in terms of its implicit prior beliefs.

One can pursue this analysis further and model the responses or decisions of a neural network using the Bayes optimal MDP scheme under different priors. Thus, the priors in the MDP scheme can be adjusted to maximize the likelihood of empirical responses. This sort of approach has been used in system neuroscience to characterize the choice behavior in terms of subject-specific priors. (See Schwartenbeck & Friston, 2016, for further details.)

From a practical perspective for optimizing neural networks, understanding the formal relationship between cost functions and variational free energy enables us to specify the optimum value of any free parameter to realize some functions. In the present setting, we can effectively optimize the constants by updating the priors themselves such that they minimize the variational free energy for BSS. Under the Dirichlet form for the priors, the implicit threshold constants of the objective function can then be optimized using the following updates:
$ϕj=lnD(j)=ψd(j)-ψd1(j)+d0jd(j)=dj+∑τ=1tsτj.$
(3.3)
(See Schwartenbeck & Friston, 2016, for further details.) In effect, this update will simply add the Dirichlet concentration parameters, $d(j)=d1(j),d0j$, to the priors in proportion to the temporal summation of the posterior expectations about the hidden states. Therefore, by committing to cost functions that underlie variational inference and learning, any free parameter can be updated in a Bayes optimal fashion when a suitable generative model is available.

### 3.4  Reverse-Engineering Implicit Prior Beliefs

Another situation important from a neuroscience perspective is when belief updating in a neural network is slow in relation to experimental observations. In this case, the implicit prior beliefs can be viewed as being fixed over a short period of time. This is likely when such a firing threshold is determined by a homeostatic plasticity over longer timescales (Turrigiano & Nelson, 2004).

The considerations in the previous section speak to the possibility of using empirically observed neuronal responses to infer implicit prior beliefs. The synaptic weights $(Wj1,Wj0)$ can be estimated statistically from response data, through equation 2.20. By plotting their trajectory over the training period as a function of the history of a Hebbian product, one can estimate the cost function constants. If these constants express a near-optimal $ϕj$, it can be concluded that the network has, effectively, the right sort of priors for BSS. As we have shown analytically and numerically, a cost function with $αj1,αj0$ far from $-ln2,-ln2$ or a large deviation of $βj1,βj0$ fails as a Bayes optimal encoder for BSS. Since actual neuronal networks can perform BSS (Isomura et al., 2015; Isomura & Friston, 2018), one would envisage that the implicit cost function will exhibit a near-optimal $ϕj$.

In particular, when $ϕj$ can be viewed as a constant $i.e.,ϕj=αj1,αj0T$ during a period of experimental observation, the characterization of thresholds is fairly straightforward: using empirically observed neuronal responses, through variational free energy minimization, under the constraint of $eϕj1+eϕj0=1$, the estimator of $ϕj$ is obtained as follows:
$ϕϕj=ln1t∑τ=1txtj1-xtj.$
(3.4)
Crucially, equation 3.4 is exactly the same as equation 3.3 up to a negligible term $d(j)$. However, equation 3.3 represents the adaptation of a neural network, while equation 3.4 expresses Bayesian inference about $ϕj$ based on empirical data. We applied equation 3.4 to sequences of neural activity generated from the synthetic neural networks used in the simulations reported in Figures 2 and 3, and confirmed that the estimator was a good approximation to the true $ϕj$ (see Figure 4A). This characterization enables the identification of implicit prior beliefs ($D)$ and consequently, the reconstruction of the neural network cost function, that is, variational free energy (see Figure 4B). Furthermore, because a canonical neural network is supposed to use the same prior beliefs for the same sort of tasks, one can use the reconstructed cost functions to predict subsequent inference and learning without observing neural activity (see Figure 4C). These results highlight the utility of reverse-engineering neural networks to predict their activity, plasticity, and assimilation of input data.
Figure 4:

Estimation of prior beliefs enables the prediction of subsequent learning. (A) Estimation of constants ($α11)$ that characterize thresholds (i.e., prior beliefs) based on sequences of neural activity. Lines and shaded areas indicate the mean and standard deviation. (B) Reconstruction of cost function $L$ using the obtained threshold estimator through equation 3.1. The estimated $L$ well approximates the true $L$ shown in Figure 2C. (C) Prediction of learning process with new sensory input data generated through different likelihood mapping characterized by a random matrix $A$. Supplying the sensory (i.e., input) data to the ensuing cost function provides a synaptic trajectory (red line), which predicts the true trajectory (black line) in the absence of observed neural responses. Inset panel depicts a comparison between elements of the true and predicted $W11$ at $t=104$.

Figure 4:

Estimation of prior beliefs enables the prediction of subsequent learning. (A) Estimation of constants ($α11)$ that characterize thresholds (i.e., prior beliefs) based on sequences of neural activity. Lines and shaded areas indicate the mean and standard deviation. (B) Reconstruction of cost function $L$ using the obtained threshold estimator through equation 3.1. The estimated $L$ well approximates the true $L$ shown in Figure 2C. (C) Prediction of learning process with new sensory input data generated through different likelihood mapping characterized by a random matrix $A$. Supplying the sensory (i.e., input) data to the ensuing cost function provides a synaptic trajectory (red line), which predicts the true trajectory (black line) in the absence of observed neural responses. Inset panel depicts a comparison between elements of the true and predicted $W11$ at $t=104$.

Close modal

In this work, we investigated a class of biologically plausible cost functions for neural networks. A single-layer feedforward neural network with a sigmoid activation function that receives sensory inputs generated by hidden states (i.e., BSS setup) was considered. We identified a class of cost functions by assuming that neural activity and synaptic plasticity minimize a common function $L$. The derivative of $L$ with respect to synaptic strengths furnishes a synaptic update rule following Hebbian plasticity, equipped with activity-dependent homeostatic terms. We have shown that the dynamics of a single-layer feedforward neural network, which minimizes its cost function, is asymptotically equivalent to that of variational Bayesian inference under a particular but generic (latent variable) generative model. Hence, the cost function of the neural network can be viewed as variational free energy, and biological constraints that characterize the neural network—in the form of thresholds and neuronal excitability—become prior beliefs about hidden states. This relationship holds regardless of the true generative process of the external world. In short, this equivalence provides an insight that any neural and synaptic dynamics (in the class considered) have functional meaning and any neural network variables and constants can be formally associated with quantities in the variational Bayesian formation, implying that Bayesian inference is universal characterisation of canonical neural networks.

According to the complete class theorem, any dynamics that minimizes a cost function can be viewed as performing Bayesian inference under some prior beliefs (Wald, 1947; Brown, 1981). This implies that any neural network whose activity and plasticity minimize the same cost function can be cast as performing Bayesian inference. Moreover, when a system has reached a (possibly nonequilibrium) steady state, the conditional expectation of internal states of an autonomous system can be shown to parameterize a posterior belief over the hidden states of the external milieu (Friston, 2013, 2019; Parr, Da Costa, & Friston, 2020). Again, this suggests that any (nonequilibrium) steady state can be interpreted as realising some elemental Bayesian inference.

Having said this, we note that the implicit generative model that underwrites any (e.g., neural network) cost function is a more delicate problem—one that we have addressed in this work. In other words, it is a mathematical truism that certain systems can always be interpreted as minimizing a variational free energy under some prior beliefs (i.e., generative model). However, this does not mean it is possible to identify the generative model by simply looking at systemic dynamics. To do this, one has to commit to a particular form of the model, so that the sufficient statistics of posterior beliefs are well defined. We have focused on discrete latent variable models that can be regarded as special (reduced) cases of partially observable Markov decision processes (POMDP).

Note that because our treatment is predicated on the complete class theorem (Brown, 1981; Wald, 1947), the same conclusions should, in principle, be reached when using continuous state-space models, such as hierarchical predictive coding models (Friston, 2008; Whittington & Bogacz, 2017; Ahmadi & Tani, 2019). Within the class of discrete state-space models, it is fairly straightforward to generate continuous outcomes from discrete latent states, as exemplified by discrete variational autoencoders (Rolfe, 2016) or mixed models, as described in Friston, Parr et al. (2017). We have described the generative model in terms of an MDP; however, we ignored state transitions. This means the generative model in this letter reduces to a simple latent variable model, with categorical states and outcomes. We have considered MDP models because they predominate in descriptions of variational (Bayesian) belief updating, (e.g., Friston, FitzGerald et al., 2017). Clearly, many generative processes entail state transitions, leading to hidden Markov models (HMM). When state transitions depend on control variables, we have an MDP, and when states are only partially observed, we have a partially observed MDP (POMDP). To deal with these general cases, extensions of the current framework are required, which we hope to consider in future work, perhaps with recurrent neural networks.

Our theory implies that Hebbian plasticity is a corollary (or realization) of cost function minimization. In particular, Hebbian plasticity with a homeostatic term emerges naturally from a gradient descent on the neural network cost function defined via the integral of neural activity. In other words, the integral of synaptic inputs $Wj1ot$ in equation 2.9 yields $xtjWj1ot$, and its derivative yields a Hebbian product $xtjotT$ in equation 2.14. This relationship indicates that this form of synaptic plasticity is natural for canonical neural networks. In contrast, a naive Hebbian plasticity (without a homeostatic term) fails to perform BSS because it updates synapses with false prior beliefs (see Figure 3). It is well known that a modification of Hebbian plasticity is necessary to realize BSS (Földiák, 1990; Linsker, 1997; Isomura & Toyoizumi, 2016), speaking to the importance of selecting the right priors for BSS.

The proposed equivalence between neural networks and Bayesian inference may offer insights into designing neural network architectures and synaptic plasticity rules to perform a given task—by selecting the right kind of prior beliefs—while retaining their biological plausibility. An interesting extension of the proposed framework is an application to spiking neural networks. Earlier work has highlighted relationships between spiking neural networks and statistical inference (Bourdoukan, Barrett, Deneve, & Machens, 2012; Isomura, Sakai, Kotani, & Jimbo, 2016). The current approach might be in a position to formally link spiking neuron models and spike-timing dependent plasticity (Markram et al., 1997; Bi & Poo, 1998; Froemke & Dan, 2002; Feldman, 2012) with variational Bayesian inference.

One can understand the nature of the constants $αj1,αj0,βj1,βj0$ from the biological and Bayesian perspectives as follows: $αj1,αj0$ determines the firing threshold and thus controls the mean firing rates. In other words, these parameters control the amplitude of excitatory and inhibitory inputs, which may be analogous to the roles of GABAergic inputs (Markram et al., 2004; Isaacson & Scanziani, 2011) and neuromodulators (Pawlak, Wickens, Kirkwood, & Kerr, 2010; Frémaux & Gerstner, 2016) in biological neuronal networks. At the same time, $αj1,αj0$ encodes prior beliefs about states, which exert a large influence on the state posterior. The state posterior is biased if $αj1,αj0$ is selected in a suboptimal manner—in relation to the process that generates inputs. Meanwhile, $βj1,βj0$ determines the accuracy of synaptic strengths that represent the likelihood mapping of an observation $ot(i)$ taking 1 (ON state) depending on hidden states (compare equation 2.8 and equation 2.20). Under a usual MDP setup where the state prior does not depend on the parameter posterior, the encoder becomes Bayes optimal when and only when $βj1,βj0=0→,0→$. These constants can represent biological constraints on synaptic strengths, such as the range of spine growth, spinal fluctuations, or the effect of synaptic plasticity induced by spontaneous activity independent of external inputs. Although the fidelity of each synapse is limited due to such internal fluctuations, the accumulation of information over a large number of synapses should allow accurate encoding of hidden states in the current formulation.

In previous reports, we have shown that in vitro neural networks—comprising a cortical cell culture—perform BSS when receiving electrical stimulations generated from two hidden sources (Isomura et al., 2015). Furthermore, we showed that minimizing variational free energy under an MDP is sufficient to reproduce the learning observed in an in vitro network (Isomura & Friston, 2018). Our framework for identifying biologically plausible cost functions could be relevant for identifying the principles that underlie learning or adaptation processes in biological neuronal networks, using empirical response data. Here, we illustrated this potential in terms of the choice of function $ϕj$ in the cost functions $L$. In particular, if $ϕj$ is close to a constant $-ln2,-ln2$, the cost function is expressed straightforwardly as a variational free energy with small state prior biases. In future work, we plan to apply this scheme to empirical data and examine the biological plausibility of variational free energy minimization.

The correspondence highlighted in this work enables one to identify a generative model (comprising likelihood and priors) that a neural network is using. The formal correspondence between neural network and variational Bayesian formations rests on the asymptotic equivalence between the neural network's cost functions and variational free energy (under some priors). Although variational free energy can take an arbitrary form, the correspondence provides biologically plausible constraints for neural networks that implicitly encode prior distributions. Hence, this formulation is potentially useful for identifying the implicit generative models that underlie the dynamics of real neuronal circuits. In other words, one can quantify the dynamics and plasticity of a neuronal circuit in terms of variational Bayesian inference and learning under an implicit generative model.

Minimization of the cost function can render the neural network Bayes optimal in a Bayesian sense, including the choice of the prior, as described in the previous section. The dependence between the likelihood function and the state prior vanishes when the network uses an optimal threshold to perform inference—if the true generative process does not involve dependence between the likelihood and the state prior. In other words, the dependence arises from a suboptimal choice of the prior. Indeed, any free parameters or constraints in a neural network can be optimized by minimizing variational free energy. This is because only variational free energy with the optimal priors—that match the true generative process of the external world—can provide the global minimum among a class of neural network cost functions under consideration. This is an interesting observation because it suggests that the global minimum of the class of cost functions—that determine neural network dynamics—is characterized by and only by statistical properties of the external world. This implies that the recapitulation of external dynamics is an inherent feature of canonical neural systems.

Finally, the free energy principle and complete class theorem imply that any brain function can be formulated in terms of variational Bayesian inference. Our reverse engineering may enable the identification of neuronal substrates or process models underlying brain functions by identifying the implicit generative model from empirical data. Unlike conventional connectomics (based on functional connectivity), reverse engineering furnishes a computational architecture (e.g., neural network), which encompasses neural activity, synaptic plasticity, and behavior. This may be especially useful for identifying neuronal mechanisms that underlie neurological or psychiatric disorders—by associating pathophysiology with false prior beliefs that may be responsible for things like hallucinations and delusions (Fletcher & Frith, 2009; Friston, Stephan, Montague, & Dolan, 2014).

In summary, we first identified a class of biologically plausible cost functions for neural networks that underlie changes in both neural activity and synaptic plasticity. We then identified an asymptotic equivalence between these cost functions and the cost functions used in variational Bayesian formations. Given this equivalence, changes in the activity and synaptic strengths of a neuronal network can be viewed as Bayesian belief updating—namely, a process of transforming priors over hidden states and parameters into posteriors, respectively. Hence, a cost function in this class becomes Bayes optimal when activity thresholds correspond to appropriate priors in an implicit generative model. In short, the neural and synaptic dynamics of neural networks can be cast as inference and learning, under a variational Bayesian formation. This is potentially important for two reasons. First, it means that there are some threshold parameters for any neural network (in the class considered) that can be optimized for applications to data when there are precise prior beliefs about the process generating those data. Second, in virtue of the complete class theorem, one can reverse-engineer the priors that any neural network is adopting. This may be interesting when real neuronal networks can be modeled using neural networks of the class that we have considered. In other words, if one can fit neuronal responses—using a neural network model parameterized in terms of threshold constants—it becomes possible to evaluate the implicit priors using the above equivalence. This may find a useful application when applied to in vitro (or in vivo) neuronal networks (Isomura & Friston, 2018; Levin, 2013) or, indeed, dynamic causal modeling of distributed neuronal responses from noninvasive data (Daunizeau, David, & Stephan, 2011). In this context, the neural network can, in principle, be used as a dynamic causal model to estimate threshold constants and implicit priors. This “reverse engineering” speaks to estimating the priors used by real neuronal systems, under ideal Bayesian assumptions; sometimes referred to as meta-Bayesian inference (Daunizeau et al., 2010).

### A.1  Order of the Parameter Complexity

The order of the parameter complexity term
$DA≡∑i=1No∑j=1Ns∑l∈1,0a·l(i,j)-a·li,j·lnA·li,j-lnBa·l(i,j)$
(A.1)
is computed. To avoid the divergence of $lnA·l(i,j)$, all the elements of $A·l(i,j)$ are assumed to be larger than a positive constant $ɛ$. This means that all the elements of $a·l(i,j)$ are in the order of $t$. The first term of equation A.1) becomes $a·l(i,j)-a·l(i,j)·lnA·li,j=a·l(i,j)·lnA·l(i,j)+O(1)$ since $a·l(i,j)·lnA·l(i,j)$ is in the order of 1. Moreover, from equation 2.3, $a·li,j·lnA·li,j=a·l(i,j)·lna·l(i,j)-lna1li,j+a0l(i,j)+Oa·l(i,j)-1=a·l(i,j)·lnA·l(i,j)+O(1)$. Meanwhile, the second term of equation A.1 comprises the logarithms of gamma functions as $lnBa·l(i,j)=lnΓa1l(i,j)+lnΓa0li,j-lnΓa1li,j+a0l(i,j)$. From Stirling's formula,
$Γa1l(i,j)=2πa1l(i,j)-12a1li,jea1l(i,j)1+Oa1l(i,j)-1$
(A.2)
holds. The logarithm of $Γa1l(i,j)$ is evaluated as
$lnΓa1l(i,j)=12ln2π-12lna1li,j+a1l(i,j)lna1l(i,j)-1+ln1+Oa1l(i,j)-1=a1l(i,j)lna1li,j-a1l(i,j)+Olnt.$
(A.3)
Similarly, $lnΓa0li,j=a0l(i,j)lna0li,j-a0l(i,j)+Olnt$ and $lnΓa1li,j+a0l(i,j)=a1li,j+a0l(i,j)lna1li,j+a0l(i,j)-a1li,j+a0l(i,j)+Olnt$ hold. Thus, we obtain
$lnBa·l(i,j)=a1l(i,j)lna1li,j+a0l(i,j)lna0l(i,j)-a1li,j+a0l(i,j)×lna1li,j+a0l(i,j)+Olnt=a·l(i,j)·lnA·l(i,j)+Olnt.$
(A.4)
Hence, equation A.1 becomes
$DA=∑i=1No∑j=1Ns∑l∈1,0{a·l(i,j)·lnA·l(i,j)+O1-(a·l(i,j)·lnA·l(i,j)+Olnt)}=Olnt.$
(A.5)
Therefore, we obtain
$Fo˜,Qs˜,QA=∑j=1Ns∑τ=1tsτ(j)·lnsτ(j)-∑i=1NolnA(i,j)·oτ(i)-lnD(j)+Olnt.$
(A.6)
Under the current generative model comprising binary hidden states and binary observations, the optimal posterior expectation of $A$ can be obtained up to the order of $lnt/t$ even when the $Olnt$ term in equation A.6 is ignored. Solving the variation of $F$ with respect to $A1li,j$ yields the optimal posterior expectation. From $A0li,j=1-A1l(i,j)$, we find
$δF=∑i=1No∑j=1Ns∑τ=1tsτ(j)·-δlnA1·(i,j)oτ(i)-δln1→-A1·i,j1-oτ(i)=t∑i=1No∑j=1Ns-δA1·(i,j)⊙A1·(i,j)⊙-1·ot(i)⊗st(j)¯+δA1·(i,j)⊙1→-A1·(i,j)⊙-1·1-ot(i)st(j)¯=t∑i=1No∑j=1NsδA1·(i,j)⊙A1·(i,j)⊙-1⊙1→-A1·(i,j)⊙-1·A1·(i,j)⊙stj¯-ot(i)stj¯$
(A.7)
up to the order of $lnt$. Here, $A1·(i,j)⊙-1$ denotes the element-wise inverse of $A1·(i,j)$. From $δF=0$, we find
$A1·(i,j)=ot(i)st(j)¯⊙st(j)¯⊙-1+Olntt.$
(A.8)
Therefore, we obtain the same result as equation 2.8 up to the order of $lnt/t$.

### A.2  Correspondence between Parameter Prior Distribution and Initial Synaptic Strengths

In general, optimizing a model of observable quantities—including a neural network—can be cast inference if there exists a learning mechanism that updates the hidden states and parameters of that model based on observations. (Exact and variational) Bayesian inference treats the hidden states and parameters as random variables and thus transforms prior distributions $Pst,PA$ into posteriors $Qst,QA$. In other words, Bayesian inference is a process of transforming the prior to the posterior based on observations $o1,…,ot$ under a generative model. From this perspective, the incorporation of prior knowledge about the hidden states and parameters is an important aspect of Bayesian inference.

The minimization of a cost function by a neural network updates its activity and synaptic strengths based on observations under the given network properties (e.g., activation function and thresholds). According to the complete class theorem, this process can always be viewed as Bayesian inference. We have demonstrated that a class of cost functions—for a single-layer feedforward network with a sigmoid activation function—has a form equivalent to variational free energy under a particular latent variable model. Here, neural activity $xt$ and synaptic strengths $W$ come to encode the posterior distributions over hidden states $Q'st$ and parameters $Q'A$, respectively, where $Q'st$ and $Q'A$ follow categorical and Dirichlet distributions, respectively. Moreover, we identified that the perturbation factors $ϕj$, which characterize the threshold function, correspond to the logarithm of the state prior $Pst$ expressed as a categorical distribution.

However, one might ask whether the posteriors obtained using the network $Q'st,Q'A$ are formally different from those obtained using variational Bayesian inference $Qst,QA$ since only the latter explicitly considers the prior distribution of parameters $PA$. Thus, one may wonder if the network merely influences update rules that are similar to variational Bayes but do not transform the priors $Pst,PA$ into the posteriors $Qst,QA$, despite the asymptotic equivalence of the cost functions.

Below, we show that the initial values of synaptic strengths $Wj1init,Wj0init$ correspond to the parameter prior $PA$ expressed as a Dirichlet distribution, to show that a neural network indeed transforms the priors into the posteriors. For this purpose, we specify the order 1 term in equation 2.12 to make the dependence on the initial synaptic strengths explicit. Specifically, we modify equation 2.12 as
$Lj=∑τ=1tfxτj-xτj1-xτjTWj1Wj0oτ+hj1hj0+Wj1,Wj0λj1⊙W^j1init,λj0⊙W^j0initT+ln1→-W^j1,ln1→-W^j0λj1,λj0T,$
(A.9)
where $W^j1init≡sigWj1init$ and $W^j0init≡sigWj0init$ are the sigmoid functions of the initial synaptic strengths, and $λj1,λj0∈RNo$ are row vectors of the inverse learning rate factors that express the insensitivity of the synaptic strengths to the activity-dependent synaptic plasticity. The third term of equation A.9 expresses the integral of $W^j1$ and $W^j0$ (with respect to $Wj1$ and $Wj0$, respectively). This ensures that when $t=0$ (i.e., when the first term on the right-hand side of equation A.9 is zero), the derivative of $Lj$ is given by $∂Lj/∂Wj1=λj1⊙W^j1init-λj1⊙W^j1$, and thus $Wj1,Wj0=Wj1init,Wj0init$ provides the fixed point of $Lj$.
Similar to the transformation from equation 2.12 to equation 2.17, we compute equation A9) as
$L=∑j=1Nx∑τ=1txτj1-xτjTlnxτjln1-xτj-lnW^j1ln1→-W^j1lnW^j0ln1→-W^j0×oτ1→-oτ-ϕj1ϕj0+∑j=1NxlnW^j1,ln1→-W^j1λj1⊙W^j1init,λj1⊙1→-W^j1initT+lnW^j0,ln1→-W^j0λj0⊙W^j0init,λj0⊙1→-W^j0initT.$
(A.10)
Note that we used $Wj1=lnW^j1-ln1→-W^j1$. Crucially, analogous to the correspondence between $W^j1$ and the Dirichlet parameters of the parameter posterior $a11(·,j)$, $λj1⊙W^j1init$ can be formally associated with the Dirichlet parameters of the parameter prior $a11(·,j)$. Hence, one can see the formal correspondence between the second and third terms on the right-hand side of equation A.10 and the expectation of the log parameter prior in equation 2.4:
$EQAlnPA=∑i=1No∑j=1NslnA(i,j)·a(i,j)=∑i=1No∑j=1NslnA·1(i,j)·a·1(i,j)+lnA·0i,j·a·0(i,j).$
(A.11)
Furthermore, the synaptic update rules are derived from equation A.10 as
$W˙j1∝-1t∂L∂Wj1=xtjotT¯-xtj¯W^j1+xtj¯ϕj1'+1tλj1⊙W^j1init-λj1⊙W^j1W˙j0∝-1t∂L∂Wj0=1-xtjotT¯-1-xtj¯W^j0+1-xtj¯ϕj0'+1tλj0⊙W^j0init-λj0⊙W^j0$
(A.12)
The fixed point of equation A.12 is provided as
$Wj1=sig-1txtj¯1→+λj1⊙-1⊙txtjotT¯+txtj¯ϕj1'+λj1⊙W^j1initWj0=sig-1t1-xtj¯1→+λj0⊙-1⊙t1-xtjotT¯+t1-xtj¯ϕj0'+λj0⊙W^j0init.$
(A.13)
Note that the synaptic strengths at $t=0$ are computed as $Wj1=sig-1λj1⊙-1⊙λj1⊙W^j1init=Wj1init$. Again, one can see the formal correspondence between the final values of the synaptic strengths given by equation A.13 in the neural network formation and the parameter posterior given by equation 2.8 in the variational Bayesian formation. As the Dirichlet parameter of the posterior $a11(·,j)$ is decomposed into the outer product $ot⊗st1(j)¯$ and the prior $a11·,j$, they are associated with $xtjotT¯$ and $λj1⊙W^j1init$, respectively. Thus, equation 2.8 corresponds to equation A.13. Hence, for a given constant set $Wj1init,Wj0init,λj1,λj0$, we identify the corresponding parameter prior $PA·,j=Dira(·,j)$, given by
$a(·,j)≡a11(·,j)a10(·,j)a01(·,j)a00(·,j)=λj1⊙W^j1initλj0⊙W^j0initλj1⊙1→-W^j1initλj0⊙1→-W^j0init.$
(A.14)

In summary, one can establish the formal correspondence between neural network and variational Bayesian formations in terms of the cost functions (see equation 2.4 versus equation A.10), priors (see equations 2.18 and A.14), and posteriors (see equation 2.8 versus equation A.13). This means that a neural network successively transforms priors $Pst,PA$ into posteriors $Qst,QA$, as parameterized with neural activity, and initial and final synaptic strengths (and thresholds). Crucially, when increasing the number of observations, this process is asymptotically equivalent to that of variational Bayesian inference under a specific likelihood function.

### A.3  Derivation of Synaptic Plasticity Rule

We consider synaptic strengths at time $t$, $Wj1=Wj1t$ and define the change as $ΔWj1≡Wj1t+1-Wj1t$. From equation 2.15, $h1'(Wj1)$ satisfies both
$h1'Wj1+ΔWj1-h1'(Wj1)=h1''(Wj1)⊙ΔWj1+OΔWj12$
(A.15)
and
$h1'Wj1+ΔWj1-h1'(Wj1)=-xt+1jot+1T+txtjotT¯xt+1j+txtj¯+xtjotT¯xtj¯≈-xt+1jot+1Ttxtj¯+xtjotT¯txtj¯2xt+1j=-1txtj¯xt+1jot+1T+xt+1jh1'(Wj1).$
(A.16)
Thus, we find
$ΔWj1=-h1''(Wj1)⊙-1txtj¯︸adaptivelearningrate⊙x(t+1)jOt+1T︸Hebbianterm+x(t+1)jh1'(Wj1)︸homeostaticterm.$
(A.17)
Similarly,
$ΔWj0=-h0''(Wj0)⊙-1t1-xtj¯︸adaptivelearningrate⊙(1-x(t+1)j)Ot+1T︸anti-Hebbianterm+(1-x(t+1)j)h0'(Wj0)︸homeostaticterm.$
(A.18)
These plasticity rules express (anti-) Hebbian plasticity with a homeostatic term.

All relevant data are within the letter. Matlab scripts are available at https://github.com/takuyaisomura/reverse_engineering.

1

Strictly speaking, the generative model we use in this letter is a hidden Markov model (HMM) because we do not consider probabilistic transitions between hidden states that depend on control variables. However, for consistency with the literature on variational treatments of discrete statespace models, we retain the MDP formalism noting that we are using a special case (with unstructured state transitions).

This work was supported in part by the grant of Joint Research by the National Institutes of Natural Sciences (NINS Program No. 01112005). T.I. is funded by the RIKEN Center for Brain Science. K.J.F. is funded by a Wellcome Principal Research Fellowship (088130/Z/09/Z). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

,
A.
, &
Tani
,
J.
(
2019
).
A novel predictive-coding-inspired variational RNN model for online prediction and recognition
.
Neural Comput.
,
31
,
2025
2074
.
Albus
,
J. S.
(
1971
).
A theory of cerebellar function
.
Math. Biosci.
,
10
,
25
61
.
Bastos
,
A. M.
,
Usrey
,
W. M.
,
,
R. A.
,
Mangun
,
G. R.
,
Fries
,
P.
, &
Friston
,
K. J.
(
2012
).
Canonical microcircuits for predictive coding
.
Neuron
,
76
,
695
711
.
Belouchrani
,
A.
, Abed-
Meraim
,
K.
,
Cardoso
,
J. F.
, &
Moulines
,
E.
(
1997
).
A blind source separation technique using second-order statistics
.
IEEE Trans. Signal Process.
,
45
,
434
444
.
Bi
,
G. Q.
, &
Poo
,
M. M.
(
1998
).
Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type
.
J. Neurosci.
,
18
,
10464
10472
.
Bishop
,
C. M.
(
2006
).
Pattern recognition and machine learning.
Berlin
:
Springer
.
Blei
,
D. M.
,
Kucukelbir
,
A.
, &
McAuliffe
,
J. D.
(
2017
).
Variational inference: A review for statisticians
.
J. Am. Stat. Assoc.
,
112
,
859
877
.
Bliss
,
T. V.
, &
Lømo
,
T.
(
1973
).
Longasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path
.
J. Physiol.
232
,
331
356
.
Bourdoukan
,
R.
,
Barrett
,
D.
,
Deneve
,
S.
, &
Machens
,
C. K.
(
2012
). Learning optimal spike-based representations. In
F.
Pereira
,
C. J. C.
Burges
,
L.
Bottou
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
25
(pp.
2285
2293
)
Red Hook, NY
:
Curran
.
Brown
,
G. D.
,
,
S.
, &
Sejnowski
,
T. J.
(
2001
).
Independent component analysis at the neural cocktail party
.
Trends Neurosci.
24
,
54
63
.
Brown
,
L. D.
(
1981
).
A complete class theorem for statistical problems with finite-sample spaces
.
Ann Stat.
,
9
,
1289
1300
.
Cichocki
,
A.
,
Zdunek
,
R.
,
Phan
,
A. H.
, &
Amari
,
S. I.
(
2009
).
Nonnegative matrix and tensor factorizations: Applications to exploratory multi-way data analysis and blind source separation
.
Hoboken, NY
:
Wiley
.
Comon
,
P.
, &
Jutten
,
C.
(
2010
).
Handbook of blind source separation: Independent component analysis and applications
.
Orlando, FL
:
.
Daunizeau
,
J.
,
David
,
O.
, &
Stephan
,
K. E.
(
2011
).
Dynamic causal modelling: A critical review of the biophysical and statistical foundations
.
NeuroImage
,
58
,
312
322
.
Daunizeau
,
J.
,
Den Ouden
,
H. E.
,
Pessiglione
,
M.
,
Kiebel
,
S. J.
,
Stephan
,
K. E.
, &
Friston
,
K. J.
(
2010
).
Observing the observer (I): Meta-Bayesian models of learning and decision-making
.
PLOS One
,
5
,
e15554
.
Dauwels
,
J.
(
2007
).
On variational message passing on factor graphs
. In
Proceedings of the International Symposiun on Information Theory
.
Piscataway, NJ
:
IEEE
.
Dayan
,
P.
, &
Abbott
,
L. F.
(
2001
).
Theoretical neuroscience: Computational and mathematical modeling of neural systems.
Cambridge, MA
:
MIT Press
.
Dayan
,
P.
,
Hinton
,
G. E.
,
Neal
,
R. M.
, &
Zemel
,
R. S.
(
1995
).
The Helmholtz machine
.
Neural Comput.
,
7
,
889
904
.
DiCarlo
,
J. J.
,
Zoccolan
,
D.
, &
Rust
,
N. C.
(
2012
).
How does the brain solve visual object recognition?
Neuron
,
73
,
415
434
.
Feldman
,
D. E.
(
2012
).
The spike-timing dependence of plasticity
.
Neuron
,
75
,
556
571
.
Fletcher
,
P. C.
, &
Frith
,
C. D.
(
2009
).
Perceiving is believing: A Bayesian approach to explaining the positive symptoms of schizophrenia
.
Nat. Rev. Neuros.
,
10
,
48
58
.
Földiák
,
P.
(
1990
).
Forming sparse representations by local anti-Hebbian learning
.
Biol. Cybern.
,
64
,
165
170
.
Forney
,
G. D.
(
2001
).
Codes on graphs: Normal realizations
.
IEEE Trans. Info. Theory
,
47
,
520
548
.
Frémaux
,
N.
, &
Gerstner
,
W.
(
2016
)
Neuromodulated spike-timing-dependent plasticity, and theory of three-factor learning rules
.
Front. Neural Circuits
,
9
.
Friston
,
K.
(
2005
).
A theory of cortical responses
.
Philos. Trans. R. Soc. Lond. B Biol. Sci.
,
360
,
815
836
.
Friston
,
K.
(
2008
).
Hierarchical models in the brain
.
PLOS Comput. Biol.
,
4
,
e1000211
.
Friston
,
K.
(
2010
).
The free-energy principle: A unified brain theory?
Nat. Rev. Neurosci.
,
11
,
127
138
.
Friston
,
K.
(
2013
).
Life as we know it
.
J. R. Soc. Interface
,
10
,
20130475
.
Friston
,
K.
(
2019
).
A free energy principle for a particular physics
.
arXiv:1906.10184
.
Friston
,
K.
,
FitzGerald
,
T.
,
Rigoli
,
F.
,
Schwartenbeck
,
P.
, &
Pezzulo
,
G.
(
2016
).
Active inference and learning
.
Neurosci. Biobehav. Rev.
,
68
,
862
879
.
Friston
,
K.
,
FitzGerald
,
T.
,
Rigoli
,
F.
,
Schwartenbeck
,
P.
, &
Pezzulo
,
G.
(
2017
).
Active inference: A process theory
.
Neural Comput.
,
29
,
1
49
.
Friston
,
K.
,
Kilner
,
J.
, &
Harrison
,
L.
(
2006
).
A free energy principle for the brain
.
J. Physiol. Paris
,
100
,
70
87
.
Friston
,
K. J.
,
Lin
,
M.
,
Frith
,
C. D.
,
Pezzulo
,
G.
,
Hobson
,
J. A.
, &
Ondobaka
,
S.
(
2017
).
Active inference, curiosity and insight
.
Neural Comput.
,
29
,
2633
2683
.
Friston
,
K.
,
Mattout
,
J.
, &
Kilner
,
J.
(
2011
).
Action understanding and active inference
.
Biol Cybern.
,
104
,
137
160
.
Friston
,
K. J.
,
Parr
,
T.
, &
de Vries
,
B. D.
(
2017
).
The graphical brain: Belief propagation and active inference
.
Netw. Neurosci.
,
1
,
381
414
.
Friston
,
K. J.
,
Stephan
,
K. E.
,
Montague
,
R.
, &
Dolan
,
R. J.
(
2014
).
Computational psychiatry: The brain as a phantastic organ
.
Lancet Psychiatry
,
1
,
148
158
.
Froemke
,
R. C.
, &
Dan
,
Y.
(
2002
).
Spike-timing-dependent synaptic modification induced by natural spike trains
.
Nature
,
416
,
433
438
.
George
,
D.
, &
Hawkins
,
J.
(
2009
).
Towards a mathematical theory of cortical micro-circuits
.
PLOS Comput. Biol.
,
5
,
e1000532
.
Hebb
,
D. O.
(
1949
).
The organization of behavior: A neuropsychological theory.
New York
:
Wiley
.
Isaacson
,
J. S.
, &
Scanziani
,
M.
(
2011
).
How inhibition shapes cortical activity
.
Neuron
,
72
,
231
243
.
Isomura
,
T.
(
2018
).
A measure of information available for inference
.
Entropy
,
20
,
512
.
Isomura
,
T.
, &
Friston
,
K.
(
2018
).
In vitro neural networks minimize variational free energy
.
Sci. Rep.
,
8
,
16926
.
Isomura
,
T.
,
Kotani
,
K.
, &
Jimbo
,
Y.
(
2015
).
Cultured cortical neurons can perform blind source separation according to the free-energy principle
.
PLOS Comput. Biol.
,
11
,
e1004643
.
Isomura
,
T.
,
Sakai
,
K.
,
Kotani
,
K.
, &
Jimbo
,
Y.
(
2016
).
Linking neuromodulated spike-timing dependent plasticity with the free-energy principle
.
Neural Comput.
,
28
,
1859
1888
.
Isomura
,
T.
, &
Toyoizumi
,
T.
(
2016
).
A local learning rule for independent component analysis
.
Sci. Rep.
,
6
,
28073
.
Kingma
,
D. P.
, &
Ba
,
J.
(
2015
).
Adam: A method for stochastic optimization
. In
Proceedings of the 3rd International Conference for Learning Representations
.
ICLR-15
.
Knill
,
D. C.
, &
Pouget
,
A.
(
2004
).
The Bayesian brain: The role of uncertainty in neural coding and computation
.
Trends Neurosci.
,
27
,
712
719
.
Kullback
,
S.
, &
Leibler
,
R. A.
(
1951
).
On information and sufficiency
.
Ann. Math. Stat.
,
22
,
79
86
.
Kuśmierz
,
Ł.
,
Isomura
,
T.
, &
Toyoizumi
,
T.
(
2017
).
Learning with three factors: Modulating Hebbian plasticity with errors
.
Curr. Opin. Neurobiol.
,
46
,
170
177
.
Lee
,
T. W.
,
Girolami
,
M.
,
Bell
,
A. J.
, &
Sejnowski
,
T. J.
(
2000
).
A unifying information-theoretic framework for independent component analysis
.
Comput. Math. Appl. 39
,
1
21
.
Levin
,
M.
(
2013
).
Reprogramming cells and tissue patterning via bioelectrical pathways: Molecular mechanisms and biomedical opportunities
.
Wiley Interdiscip. Rev. Syst. Biol. Med.
,
5
,
657
676
.
Linsker
,
R.
(
1988
).
Self-organization in a perceptual network
.
Computer
21
,
105
117
.
Linsker
,
R.
(
1997
).
A local learning rule that enables information maximization for arbitrary input distributions
.
Neural Comput.
,
9
,
1661
1665
.
Malenka
,
R. C.
, &
Bear
,
M. F.
(
2004
).
LTP and LTD: An embarrassment of riches
.
Neuron
,
44
,
5
21
.
Markram
,
H.
,
Lübke
,
J.
,
Frotscher
,
M.
, &
Sakmann
,
B.
(
1997
).
Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs
.
Science
,
275
,
213
215
.
Markram
,
H.
,
Toledo-Rodriguez
,
M.
,
Wang
,
Y.
,
Gupta
,
A.
,
Silberberg
,
G.
, &
Wu
,
C.
(
2004
).
Interneurons of the neocortical inhibitory system
.
Nat. Rev. Neurosci.
,
5
,
793
807
.
Marr
,
D.
(
1969
).
A theory of cerebellar cortex
.
J. Physiol.
,
202
,
437
470
.
Mesgarani
,
N.
, &
Chang
,
E. F.
(
2012
).
Selective cortical representation of attended speaker in multi-talker speech perception
.
Nature
,
485
,
233
236
.
Newsome
,
W. T.
,
Britten
,
K. H.
, &
Movshon
,
J. A.
(
1989
).
Neuronal correlates of a perceptual decision
.
Nature
,
341
,
52
54
.
Parr
,
T.
,
Da Costa
,
L.
, &
Friston
,
K.
(
2020
).
Markov blankets, information geometry and stochastic thermodynamics
.
Phil. Trans. R. Soc. A
,
378
,
20190159
.
Pawlak
,
V.
,
Wickens
,
J. R.
,
Kirkwood
,
A.
, &
Kerr
,
J. N.
(
2010
).
Timing is not everything: Neuromodulation opens the STDP gate
.
Front. Syn. Neurosci.
,
2
,
146
.
Rao
,
R. P.
, &
Ballard
,
D. H.
(
1999
).
Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects
.
Nat Neurosci.
,
2
,
79
87
.
Rolfe
,
J. T.
(
2016
).
Discrete variational autoencoders
.
arXiv:1609.02200
.
Schultz
,
W.
,
Dayan
,
P.
, &
Montague
,
P. R.
(
1997
).
A neural substrate of prediction and reward
.
Science
,
275
,
1593
1599
.
Schwartenbeck
,
P.
, &
Friston
,
K.
(
2016
).
Computational phenotyping in psychiatry: A worked example
.
eNeuro
,
3
,
e0049
16.2016
.
Sutton
,
R. S.
, &
Barto
,
A. G.
(
1998
).
Reinforcement learning
.
Cambridge, MA
:
MIT Press
.
Tolhurst
,
D. J.
,
Movshon
,
J. A.
, &
Dean
,
A. F.
(
1983
).
The statistical reliability of signals in single neurons in cat and monkey visual cortex
.
Vision Res.
,
23
,
775
785
.
Turrigiano
,
G. G.
, &
Nelson
,
S. B.
(
2004
).
Homeostatic plasticity in the developing nervous system
.
Nat Rev. Neurosci.
,
5
,
97
107
.
von Helmholtz
,
H.
(
1925
).
Treatise on physiological optics
(Vol. 3)
Washington, DC
:
Optical Society of America
.
Wald
,
A.
(
1947
).
An essentially complete class of admissible decision functions
.
Ann Math Stat.
,
18
,
549
555
.
Whittington
,
J. C.
, &
Bogacz
,
R.
(
2017
).
An approximation of the error backpropagation algorithm in a predictive coding network with local Hebbian synaptic plasticity
.
Neural Comput.
,
29
,
1229
1262
.
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode