Reverse-Engineering Neural Networks to Characterize Their Cost Functions

This letter considers a class of biologically plausible cost functions for neural networks, where the same cost function is minimized by both neural activity and plasticity. We show that such cost functions can be cast as a variational bound on model evidence under an implicit generative model. Using generative models based on partially observed Markov decision processes (POMDP), we show that neural activity and plasticity perform Bayesian inference and learning, respectively, by maximizing model evidence. Using mathematical and numerical analyses, we establish the formal equivalence between neural network cost functions and variational free energy under some prior beliefs about latent states that generate inputs. These prior beliefs are determined by particular constants (e.g., thresholds) that define the cost function. This means that the Bayes optimal encoding of latent or hidden states is achieved when the network's implicit priors match the process that generates its inputs. This equivalence is potentially important because it suggests that any hyperparameter of a neural network can itself be optimized—by minimization with respect to variational free energy. Furthermore, it enables one to characterize a neural network formally, in terms of its prior beliefs.


Introduction
Cost functions are ubiquitous in scientific fields that entail optimizationincluding physics, chemistry, biology, engineering, and machine learning. Furthermore, any optimization problem that can be specified using a cost function can be formulated as a gradient descent. In the neurosciences, this enables one to treat neuronal dynamics and plasticity as an optimization process (Marr, 1969;Albus, 1971;Schultz, Dayan, & Montague, 1997;Sutton & Barto, 1998;Linsker, 1988;Brown, Yamada, & Sejnowski, 2001). These examples highlight the importance of specifying a problem in terms of cost functions, from which neural and synaptic dynamics can be derived. In other words, cost functions provide a formal (i.e., normative) expression of the purpose of a neural network and prescribe the dynamics of that neural network. Crucially, once the cost function has been established and an initial condition has been selected, it is no longer necessary to solve the dynamics. Instead, one can characterize the neural network's behavior in terms of fixed points, basin of attraction and structural stability-based only on the cost function. In short, it is important to identify the cost function to understand the dynamics, plasticity, and function of a neural network.
A ubiquitous cost function in neurobiology, theoretical biology, and machine learning is model evidence or equivalently, marginal likelihood or surprise-namely, the probability of some inputs or data under a model of how those inputs were generated by unknown or hidden causes (Bishop, 2006;Dayan & Abbott, 2001). Generally, the evaluation of surprise is intractable (especially for neural networks) as it entails a logarithm of an intractable marginal (i.e., integral). However, this evaluation can be converted into an optimization problem by inducing a variational bound on surprise. In machine learning, this is known as an evidence lower bound (ELBO; Blei, Kucukelbir, & McAuliffe, 2017), while the same quantity is known as variational free energy in statistical physics and theoretical neurobiology.
Variational free energy minimization is a candidate principle that governs neuronal activity and synaptic plasticity (Friston, Kilner, & Harrison, 2006;Friston, 2010). Here, surprise reflects the improbability of sensory inputs given a model of how those inputs were caused. In turn, minimizing variational free energy, as a proxy for surprise, corresponds to inferring the (unobservable) causes of (observable) consequences. To the extent that biological systems minimize variational free energy, it is possible to say that they infer and learn the hidden states and parameters that generate their sensory inputs (von Helmholtz, 1925;Knill & Pouget, 2004;DiCarlo, Zoccolan, & Rust, 2012) and consequently predict those inputs (Rao & Ballard, 1999;Friston, 2005). This is generally referred to as perceptual inference based on an internal generative model about the external world (Dayan, Hinton, Neal, & Zemel, 1995;George & Hawkins, 2009;Bastos et al., 2012).
Variational free energy minimization provides a unified mathematical formulation of these inference and learning processes in terms of selforganizing neural networks that function as Bayes optimal encoders. Moreover, organisms can use the same cost function to control their surrounding environment by sampling predicted (i.e., preferred) inputs. This is known as active inference (Friston, Mattout, & Kilner, 2011). The ensuing freeenergy principle suggests that active inference and learning are mediated by changes in neural activity, synaptic strengths, and the behavior of an organism to minimize variational free energy as a proxy for surprise. Crucially, variational free energy and model evidence rest on a generative model of continuous or discrete hidden states. A number of recent studies have used Markov decision process (MDP) generative models to elaborate schemes that minimize variational free energy (Friston, FitzGerald, Rigoli, Schwartenbeck, & Pezzulo, 2016Friston, Parr, & de Vries, 2017;Friston, Lin et al., 2017). This minimization reproduces various interesting dynamics and behaviors of real neuronal networks and biological organisms. However, it remains to be established whether variational free energy minimization is an apt explanation for any given neural network, as opposed to the optimization of alternative cost functions.
In principle, any neural network that produces an output or a decision can be cast as performing some form of inference in terms of Bayesian decision theory. On this reading, the complete class theorem suggests that any neural network can be regarded as performing Bayesian inference under some prior beliefs; therefore, it can be regarded as minimizing variational free energy. The complete class theorem (Wald, 1947;Brown, 1981) states that for any pair of decisions and cost functions, there are some prior beliefs (implicit in the generative model) that render the decisions Bayes optimal. This suggests that it should be theoretically possible to identify an implicit generative model within any neural network architecture, which renders its cost function a variational free energy or ELBO. However, although the complete class theorem guarantees the existence of a generative model, it does not specify its form. In what follows, we show that a ubiquitous class of neural networks implements approximates Bayesian inference under a generic discrete state space model with a known form.
In brief, we adopt a reverse-engineering approach to identify a plausible cost function for neural networks and show that the resulting cost function is formally equivalent to variational free energy. Here, we define a cost function as a function of sensory input, neural activity, and synaptic strengths and suppose that neural activity and synaptic plasticity follow a gradient descent on the cost function (assumption 1). For simplicity, we consider single-layer feedforward neural networks comprising firing-rate neuron models-receiving sensory inputs weighted by synaptic strengths-whose firing intensity is determined by the sigmoid activation function (assumption 2). We focus on blind source separation (BSS), namely the problem of separating sensory inputs into multiple hidden sources or causes (Belouchrani, Abed-Meraim, Cardoso, & Moulines, 1997;Cichocki, Zdunek, Phan, & Amari, 2009;Comon & Jutten, 2010), which provides the minimum setup for modeling causal inference. A famous example of BSS is the cocktail party effect: the ability of a partygoer to disambiguate an individual's voice from the noise of a crowd (Brown et al., 2001;Mesgarani & Chang, 2012). Previously, we observed BSS performed by in vitro neural networks (Isomura, Kotani, & Jimbo, 2015) and reproduced this self-supervised process using an MDP and variational free energy minimization (Isomura & Friston, 2018). These works suggest that variational free energy minimization offers a plausible account of the empirical behavior of in vitro networks.
In this work, we ask whether variational free energy minimization can account for the normative behavior of a canonical neural network that minimizes its cost function, by considering all possible cost functions, within a generic class. Using mathematical analysis, we identify a class of cost functions-from which update rules for both neural activity and synaptic plasticity can be derived. The gradient descent on the ensuing cost function naturally leads to Hebbian plasticity (Hebb, 1949;Bliss & Lømo, 1973;Malenka & Bear, 2004) with an activity-dependent homeostatic term. We show that these cost functions are formally homologous to variational free energy under an MDP. Crucially, this means the hyperparameters (i.e., any variables or constants) of the neural network can be associated with prior beliefs of the generative model. In principle, this allows one to optimize the neural network hyperparameters (e.g., thresholds and learning rates), given some priors over the causes (i.e., latent states) of inputs to the neural network. Furthermore, estimating hyperparameters from the dynamics of (in silico or in vitro) neural networks allows one to quantify the network's implicit prior beliefs. In this letter, we focus on the mathematical foundations for applications to in vitro and in vivo neuronal networks in subsequent work.

Methods
In this section, we formulate the same computation in terms of variational Bayesian inference and neural networks to demonstrate their correspondence. We first derive the form of a variational free energy cost function under a specific generative model, a Markov decision process. 1 We present the derivations carefully, with a focus on the form of the ensuing Bayesian belief updating. The functional form of this update will reemerge later, when reverse engineering the cost functions implicit in neural networks. These correspondences are depicted in Figure 1 and Table 1. This section starts with a description of Markov decision processes as a general kind of generative model and then considers the minimization of variational free energy under these models.

Generative Models.
Under an MDP model (see Figure 1A), a minimal BSS setup (in a discrete space) reduces to the likelihood mapping from N s hidden sources or states   (Forney, 2001;Dauwels, 2007) based on the formulation in Friston, Parr et al. (2017). In this BSS setup, the prior D determines hidden states s t , while s t determines observation o t through the likelihood mapping A. Inference corresponds to the inversion of this generative process. Here, D * indicates the true prior, while D indicates the prior under which the network operates. If D = D * , the inference is optimal; otherwise, it is biased. (B) Neural network comprising a singlelayer feedforward network with a sigmoid activation function. The network receives sensory inputs . . . , o (No) t ) T that are generated from hidden states s t = (s (1) t , . . . , s (Ns ) t ) T and outputs neural activities x t = (x t1 , . . . , x tNx ) T . Here, x t j should encode the posterior expectation about a binary state s ( j) t . In an analogy with the cocktail party effect, s t and o t correspond to individual speakers and auditory inputs, respectively. or zero (OFF state) at each time step, that is, s , 0}. Throughout this letter, j denotes the jth hidden state, while i denotes the ith observation. The probability of s ( j) t follows a categorical distribution P s Figure 1A, top). The probability of an outcome is determined by the likelihood mapping from all hidden states to each kind of observation in terms of a categorical distribution, P o (i) t |s t , A (i) = Cat(A (i) ) (see Figure 1A, middle). Here, each element of the tensor A (i) ∈ R 2×2 Ns parameterizes the probability that Parameter posterior ∈ R 2 . We use Dirichlet distributions, as they are tractable and widely used for random variables that take a continuous value between zero and one. Furthermore, learning the likelihood mapping leads to biologically plausible update rules, which have the form of associative or Hebbian plasticity (see below and Friston et al., 2016, for details).
We useõ ≡ (o 1 , . . . , o t ) ands ≡ (s 1 , . . . , s t ) to denote sequences of observations and hidden states, respectively. With this notation in place, the generative model (i.e., the joint distribution over outcomes, hidden states and the parameters of their likelihood mapping) can be expressed as Throughout this letter, t denotes the current time, and τ denotes an arbitrary time from the past to the present, 1 ≤ τ ≤ t.

Minimization of Variational Free Energy.
In this MDP scheme, the aim is to minimize surprise by minimizing variational free energy as a proxy, that is, performing approximate or variational Bayesian inference. From the generative model, we can motivate a mean-field approximation to the posterior (i.e., recognition) density as follows, where A (i) is the likelihood mapping (i.e., tensor), and the marginal posterior distributions of s and Dirichlet distribution Q A (i) = Dir a (i) , respectively. For simplicity, we assume that A (i) factorizes into the product of the likelihood mappings from the jth hidden state to the ith observation: where ⊗ denotes the outer product and A (i, j) ∈ R 2×2 ). This (mean-field) approximation simplifies the computation of state posteriors and serves to specify a particular form of Bayesian model which corresponds to a class of canonical neural networks (see below).
In what follows, a case variable in bold indicates the posterior expectation of the corresponding variable in italics. For example, s ( j) τ takes the value 0 or 1, while the posterior expectation s ( j) τ ∈ R 2 is the expected value of s ( j) τ that lies between zero and one. Moreover, a (i, j) ∈ R 2×2 denotes positive concentration parameters. Below, we use the posterior expectation of ln A (i, j) to encode posterior beliefs about the likelihood, which is given by where l ∈ {1, 0}. Here, ψ (·) ≡ (·) / (·) denotes the digamma function, which arises naturally from the definition of the Dirichlet distribution. (See Friston et al., 2016, for details.) E Q(A (i, j) ) [·] denotes the expectation over the posterior of A (i, j) . The ensuing variational free energy of this generative model is then given by where ln A (i, j) · o (i) τ denotes the inner product of ln A (i, j) and a one- is the complexity as scored by the Kullback-Leibler divergence (Kullback & Leibler, 1951) is the beta function. The first term in the final equality comprises the accuracy − s j) . The accuracy term is simply the expected log likelihood of an observation, while complexity scores the divergence between prior and posterior beliefs. In other words, complexity reflects the degree of belief updating or degrees of freedom required to provide an accurate account of observations. Both belief updates to states and parameters incur a complexity cost: the state complexity increases with time t, while parameter complexity increases on the order of ln t-and is thus negligible when t is large (see section A.1 for details). This means that we can ignore parameter complexity when the scheme has experienced a sufficient number of outcomes. We drop the parameter complexity in subsequent sections. In the remainder of this section, we show how the minimization of variational free energy transforms (i.e., updates) priors into posteriors when the parameter complexity is evaluated explicitly.
Inference optimizes posterior expectations about the hidden states by minimizing variational free energy. The optimal posterior expectations are obtained by solving the variation of F to give where σ (·) is the softmax function (see Figure 1A, bottom). As s ( j) t is a binary value in this work, the posterior expectation of s ( j) t taking a value of one (ON state) can be expressed as using the sigmoid function sig (z) ≡ 1/(1 + exp (−z)). Thus, the posterior expectation of s and D ( j) 0 are constants denoting the prior beliefs about hidden states. Bayes optimal encoding is obtained when and only when the prior beliefs match the genuine prior distribution: D This concludes our treatment of inference about hidden states under this minimal scheme. Note that the updates in equation 2.5 have a biological plausibility in the sense that the posterior expectations can be associated with nonnegative sigmoid-shape firing rates (also known as neurometric functions; Tolhurst, Movshon, & Dean, 1983;Newsome, Britten, & Movshon, 1989), while the arguments of the sigmoid (softmax) function can be associated with neuronal depolarization, rendering the softmax function a voltage-firing rate activation function. (See Friston, FitzGerald et al., 2017, for a more comprehensive discussion and simulations using this kind of variational message passing to reproduce empirical phenomena, such as place fields, mismatch negativity responses, phase-precession, and preplay activity in systems neuroscience.) In terms of learning, by solving the variation of F with respect to a (i, j) , the optimal posterior expectations about the parameters are given by 10 . The prior of parameters a (i, j) is on the order of one and is thus negligible when t is large. The matrix A (i, j) expresses the optimal posterior expectations of o (i) t taking the ON state when s ). Although this expression may seem complicated, it is fairly straightforward. The posterior expectations of the likelihood simply accumulate posterior expectations about the co-occurrence of states and their outcomes. These accumulated (Dirichlet) parameters are then normalized to give a likelihood or probability. Crucially, one can observe the associative or Hebbian aspect of this belief update, expressed here in terms of the outer products between outcomes and posteriors about states in equation 2.7. We now turn to the equivalent update for neural activities and synaptic weights of a neural network.

Neural Activity and Hebbian Plasticity Models.
Next, we consider the neural activity and synaptic plasticity in the neural network (see Figure  1B). The generation of observations o t is exactly the same as in the MDP model introduced in section 2.1 (see Figure 1B, top to middle). We assume that the jth neuron's activity x t j (see Figure 1B, bottom) is given bẏ (2.9) We suppose that W j1 ∈ R No and W j0 ∈ R No comprise row vectors of synapses and h j1 ∈ R and h j0 ∈ R are adaptive thresholds that depend on the values of W j1 and W j0 , respectively. One may regard W j1 and W j0 as excitatory and inhibitory synapses, respectively. We further assume that the nonlinear leakage f (·) (i.e., the leak current) is the inverse of the sigmoid function (i.e., the logit function) so that the fixed point of x t j (i.e., the state of x t j that givesẋ t j = 0) is given in the form of the sigmoid function: (2.10) Equations 2.9 and 2.10 are a mathematical expression of assumption 2. Further, we consider a class of synaptic plasticity rules that comprise Hebbian plasticity with an activity-dependent homeostatic term as follows: where Hebb 1 and Hebb 0 denote Hebbian plasticity as determined by the product of sensory inputs and neural outputs and Home 1 and Home 0 denote homeostatic plasticity determined by output neural activity. Equation 2.11 can be read as an ansatz: we will see below that a synaptic update rule with the functional form of equation 2.11 emerges as a natural consequence of assumption 1.
In the MDP scheme, posterior expectations about hidden states and parameters are usually associated with neural activity and synaptic strengths. Here, we can observe a formal similarity between the solutions for the state posterior (see equation 2.6) and the activity in the neural network (see equation 2.10; see also Table 1). By this analogy, x t j can be regarded as encoding the posterior expectation of the ON state s 10 , respectively, in the sense that they express the amplitude of o t influencing x t j or s ( j) t1 . Here, 1 = (1, . . . , 1) ∈ R No is a vector of ones. In particular, the optimal posterior of a hidden state taking a value of one (see equation 2.6) is given by the ratio of the beliefs about ON and OFF states, expressed as a sigmoid function. Thus, to be a Bayes optimal encoder, the fixed point of neural activity needs to be a sigmoid function. This requirement is straightforwardly ensured when f x t j is the inverse of the sigmoid function (see equation 2.13). Under this condition the fixed point or solution for x tk (see equation 2.10) compares inputs from ON and OFF pathways, and thus x t j straightforwardly encodes the posterior of the jth hidden state being ON (i.e., x t j → s ( j) t1 ). In short, the above neural network is effectively inferring the hidden state.
If the activity of the neural network is performing inference, does the Hebbian plasticity correspond to Bayes optimal learning? In other words, does the synaptic update rule in equation 2.11 ensure that the neural activity and synaptic strengths asymptotically encode Bayes optimal posterior beliefs about hidden states x t j → s ( j) t1 and parameters W j1 → sig −1 A (·, j) 11 , respectively? To this end, we will identify a class of cost functions from which the neural activity and synaptic plasticity can be derived and consider the conditions under which the cost function becomes consistent with variational free energy.

Neural Network Cost Functions.
Here, we consider a class of functions that constitute a cost function for both neural activity and synaptic plasticity. We start by assuming that the update of the jth neuron's activity (see equation 2.9) is determined by the gradient of cost function L j : x t j ∝ −∂L j /∂x t j . By integrating the right-hand side of equation 2.9, we obtain a class of cost functions as where the O(1) term, which depends on W j1 and W j0 , is of a lower order than the other terms (as they are O (t)) and is thus negligible when t is large (See section A.2 for the case where we explicitly evaluate the O(1) term to demonstrate the formal correspondence between the initial values of synaptic strengths and the parameter prior p (A). The cost function of the entire network is defined by L ≡ Nx j=1 L j . When f x τ j is the inverse of the sigmoid function, we have up to a constant term (ensure f x τ j = sig −1 x τ j ). We further assume that the synaptic weight update rule is given as the gradient descent on the same cost function L j (see assumption 1). Thus, the synaptic plasticity is derived as follows: Note that the update of W j1 is not directly influenced by W j0 , and vice versa because they encode parameters in physically distinct pathways (i.e., the updates are local learning rules; Lee, Girolami, Bell, & Sejnowski, 2000;Kusmierz, Isomura, & Toyoizumi, 2017). The update rule for W j1 can be viewed as Hebbian plasticity mediated by an additional activity-dependent term expressing homeostatic plasticity. Moreover, the update of W j0 can be viewed as anti-Hebbian plasticity with a homeostatic term, in the sense that W j0 is reduced when input (o t ) and output (x t j ) fire together. The fixed points of W j1 and W j0 are given by ( 2.15) Crucially, these synaptic strength updates are a subclass of the general synaptic plasticity rule in equation 2.11 (see also section A.3 for the mathematical explanation). Therefore, if the synaptic update rule is derived from the cost function underlying neural activity, the synaptic update rule has a biologically plausible form comprising Hebbian plasticity and activitydependent homeostatic plasticity. The updates of neural activity and synaptic strengths-via gradient descent on the cost function-enable us to associate neural and synaptic dynamics with optimization. Although the steepest descent method gives the simplest implementation, other gradient descent schemes, such as adaptive moment estimation (Adam; Kingma & Ba, 2015), can be considered, while retaining the local learning property.

Comparison with Variational Free Energy.
Here, we establish a formal relationship between the cost function L and variational free energy. We defineŴ j1 ≡ sig(W j1 ) andŴ j0 ≡ sig W j0 as the sigmoid functions of synaptic strengths. We consider the case in which neural activity is expressed as a sigmoid function and thus equation 2.13 holds. As W j1 = lnŴ j1 − ln 1 −Ŵ j1 , equation 2.12 becomes where 1 = (1, . . . , 1) ∈ R No . One can immediately see a formal correspondence between this cost function and variational free energy (see equation 2.4). That is, when we assume that 11 , and W j0 = A (·, j) 10 , equation 2.16 has exactly the same form as the sum of the accuracy and state complexity, which is the leading-order term of variational free energy (see the first term in the last equality of equation 2.4).

Specifically, when the thresholds satisfy
0 , equation 2.16 becomes equivalent to equation 2.4 up to the ln t order term (that disappears when t is large). Therefore, in this case, the fixed points of neural activity and synaptic strengths become the posteriors; thus, x t j asymptotically becomes the Bayes optimal encoder for a large t limit (provided with D that matches the genuine prior D * ).
In other words, we can define perturbation terms φ j1 ≡ h j1 − ln 1 − W j1 · 1 and φ j0 ≡ h j0 − ln 1 −Ŵ j0 · 1 as functions of W j1 and W j0 , respectively, and can express the cost function as ( 2.17) Here, without loss of generality, we can suppose that the constant terms in φ j1 and φ j0 are selected to ensure that exp φ j1 + exp φ j0 = 1. Under this condition, exp φ j1 , exp φ j0 can be viewed as the prior belief about hidden states (2.18) and thus equation 2.17 is formally equivalent to the accuracy and state complexity terms of variational free energy. This means that when the prior belief about states D ( j) is a function of the parameter posteriors (A (·, j) ), the general cost function under consideration can be expressed in the form of variational free energy, up to the O (ln t) term. A generic cost function L is suboptimal from the perspective of Bayesian inference unless φ j1 and φ j0 are tuned appropriately to express the unbiased (i.e., optimal) prior belief. In this BSS setup, φ j1 = φ j0 = const is optimal; thus, a generic L would asymptotically give an upper bound of variational free energy with the optimal prior belief about states when t is large.

Analysis on Synaptic Update Rules.
To explicitly solve the fixed points of W j1 and W j0 that provide the global minimum of L, we suppose φ j1 and φ j0 as linear functions of W j1 and W j0 , respectively, given by where α j1 , α j0 ∈ R, and β j1 , β j0 ∈ R No are constants. By solving the variation of L with respect to W j1 and W j0 , we find the fixed point of synaptic strengths as (2.20) Since the update from t to t + 1 is expressed as sig we recover the following synaptic plasticity: Hebbian plasticity where denotes the elementwise (Hadamard) product and Ŵ j1 1 − W j1 −1 denotes the element-wise inverse ofŴ j1 1 −Ŵ j1 . This synaptic plasticity rule is a subclass of the general synaptic plasticity rule in equation 2.11.
In summary, we demonstrated that under a few minimal assumptions and ignoring small contributions to weight updates, the neural network under consideration can be regarded as minimizing an approximation to model evidence because the cost function can be formulated in terms of variational free energy. In what follows, we will rehearse our analytic results and then use numerical analyses to illustrate Bayes optimal inference (and learning) in a neural network when, and only when, it has the right priors.

Analytical Form of Neural Network Cost Functions.
The analysis in the preceding section rests on the following assumptions: 1. Updates of neural activity and synaptic weights are determined by a gradient descent on a cost function L. 2. Neural activity is updated by the weighted sum of sensory inputs and its fixed point is expressed as the sigmoid function.
Under these assumptions, we can express the cost function for a neural network as follows (see equation 2.17): whereŴ j1 = sig(W j1 ) andŴ j0 = sig W j0 hold and φ j1 and φ j0 are functions of W j1 and W j0 , respectively. The log-likelihood function (accuracy term) and divergence of hidden states (complexity term) of variational free energy emerge naturally under the assumption of a sigmoid activation function (assumption 2). Additional terms denoted by φ j1 and φ j0 express the state prior, indicating that a generic cost function L is variational free energy under a suboptimal prior belief about hidden states: ln P s . This prior alters the landscape of the cost function in a suboptimal manner and thus provides a biased solution for neural activities and synaptic strengths, which differ from the Bayes optimal encoders.
For analytical tractability, we further assume the following: 3. The perturbation terms (φ j1 and φ j0 ) that constitute the difference between the cost function and variational free energy with optimal prior beliefs can be expressed as linear equations of W j1 and W j0 .
From assumption 3, equation 2.17 becomes where α j1 , α j0 , β j1 , β j0 are constants. The cost function has degrees of freedom with respect to the choice of constants α j1 , α j0 , β j1 , β j0 , which correspond to the prior belief about states D ( j) . The neural activity and synaptic strengths that give the minimum of a generic physiological cost function L are biased by these constants, which may be analogous to physiological constraints (see section 4 for details). The cost function of the neural networks considered is characterized only by φ j . Thus, after fixing φ j by fixing constraints α j1 , α j0 and β j1 , β j0 , the remaining degrees of freedom are the initial synaptic weights. These correspond to the prior distribution of parameters P (A) in the variational Bayesian formulation (see section A2).
The fixed point of synaptic strengths that give the minimum of L is given analytically as equation 2.20, expressing that β j1 , β j0 deviates the center of the nonlinear mapping-from Hebbian products to synaptic strengthsfrom the optimal position (shown in equation 2.8). As shown in equation 2.14, the derivative of L with respect to W j1 and W j0 recovers the synaptic update rules that comprise Hebbian and activity-dependent homeostatic terms. Although equation 2.14 expresses the dynamics of synaptic strengths that converge to the fixed point, it is consistent with a plasticity rule that gives the synaptic change from t to t + 1 (see equation 2.21).
In short, we identify a class of biologically plausible cost functions from which the update rules for both neural activity and synaptic plasticity can be derived. When the activation function for neural activity is a sigmoid function, a cost function in this class is expressed straightforwardly as variational free energy. With respect to the choice of constants expressing physiological constraints in the neural network, the cost function has degrees of freedom that may be viewed as (potentially suboptimal) prior beliefs from the Bayesian perspective. Now, we illustrate the implicit inference and learning in neural networks through simulations of BSS.

Numerical Simulations.
Here, we simulated the dynamics of neural activity and synaptic strengths when they followed a gradient descent on the cost function in equation 3.1. We considered a BSS comprising two hidden sources (or states) and 32 observations (or sensory inputs), formulated as an MDP. The two hidden sources show four patterns: s t = s (1) t ⊗ s (2) t = (0, 0) (1, 0) (0, 1) (1, 1). An observation o (i) t was generated through the likelihood mapping A (i) , defined as Here, for example, A (i) 1· = 3/4 for 1 ≤ i ≤ 16 is the probability of o (i) t taking one when s t = (1, 0). The remaining elements were given by A (i) 0· = 1 − A (i) 1· . The implicit state priors employed by a neural network were varied between zero and one in keeping with D = (0.5, 0.5). Synaptic strengths were initialized as values close to zero. The simulations preceded over T = 10 4 time steps. The simulations and analyses were conducted using Matlab. Notably, this simulation setup is exactly the same experimental setup as that we used for in vitro neural networks (Isomura et al., 2015;Isomura & Friston, 2018). We leverage this setup to clarify the relationship among our empirical work, a feedforward neural network model, and variational Bayesian formulations.
First, as in Isomura and Friston (2018), we demonstrated that a network with a cost function with optimized constants α j1 , α j0 = − ln 2, − ln 2 and β j1 , β j0 = 0, 0 can perform BSS successfully (see Figure 2). The responses of neuron 1 came to recognize source 1 after training, indicating that neuron 1 learned to encode source 1 (see Figure 2A). Meanwhile, neuron 2 learned to infer source 2 (see Figure 2B). During training, synaptic plasticity followed gradient descent on the cost function (see Figures 2C and  2D). This demonstrates that minimization of the cost function, with optimal constants, is equivalent to variational free energy minimization and hence is sufficient to emulate BSS. This process establishes a concise representation of the hidden causes and allows maximizing information retained in the neural network (Linsker, 1988;Isomura, 2018).
Next, we quantified the dependency of BSS performance on the form of the cost function, by varying the above-mentioned constants (see Figure 3). We varied α j1 , α j0 in a range of 0.05 ≤ exp α j1 ≤ 0.95, while Figure 2: Emergence of response selectivity for a source. (A) Evolution of neuron 1's responses that learn to encode source 1, in the sense that the response is high when source 1 takes a value of one (red dots), and it is low when source 1 takes a value of zero (blue dots). Lines correspond to smoothed trajectories obtained using a discrete cosine transform. (B) Emergence of neuron 2's response that learns to encode source 2. These results indicate that the neural network succeeded in separating two independent sources. (C) Neural network cost function L. It is computed based on equation 3.1 and plotted against the averaged synaptic strengths, where W 11_avg1 (z-axis) is the average of 1 to 16 elements of W 11 , while W 11_avg2 (x-axis) is the average of 17 to 32 elements of W 11 . The red line depicts a trajectory of averaged synaptic strengths. (D) Trajectory of synaptic strengths. Black lines show elements of W 11 , and magenta and cyan lines indicate W 11_avg1 and W 11_avg2 , respectively. maintaining exp α j1 + exp α j0 = 1 and found that changing α j1 , α j0 from (− ln 2, − ln 2) led to a failure of BSS. Because neuron 1 encodes source 1 with optimal α, the correlation between source 1 and the response of neuron 1 is close to one, while the correlation between source 2 and the response of neuron 1 is nearly zero. In the case of suboptimal α, these correlations fall to around 0.5, indicating that the response of neuron 1 encodes a mixture of sources 1 and 2 ( Figure 3A). Moreover, a failure of BSS can be induced when the elements of β take values far from zero (see Figure 3B). When the elements of β are generated from a zeromean gaussian distribution, t , x t1 )| and |corr(s (2) t , x t1 )|. The right panels show the magnitudes of the correlations between sources and responses of a neuron expected to encode source 2: |corr(s (1) t , x t2 )| and |corr(s (2) t , x t2 )|. (A) Dependence on the constant α that controls the excitability of a neuron when β is fixed to zero. The dashed line (0.5) indicates the optimal value of exp α j1 . (B) Dependence on constant β when α is fixed as α j1 , α j0 = (− ln 2, − ln 2). Elements of β were randomly generated from a gaussian distribution with zero mean. The standard deviation of β was varied (horizontal axis), where zero deviation was optimal. Lines and shaded areas indicate the mean and standard deviation of the source-response correlation, evaluated with 50 different sequences.
the accuracy of BSS-measured using the correlation between sources and responses-decreases as the standard deviation increases.
Our numerical analysis, under assumptions 1 to 3, shows that a network needs to employ a cost function that entails optimal prior beliefs to perform BSS or, equivalently, causal inference. Such a cost function is obtained when its constants, which do not appear in the variational free energy with the optimal generative model for BSS, become negligible. The important message here is that in this setup, a cost function equivalent to variational free energy is necessary for Bayes optimal inference (Friston et al., 2006;Friston, 2010).

Phenotyping Networks.
We have shown that variational free energy (under the MDP scheme) is formally homologous to the class of biologically plausible cost functions found in neural networks. The neural network's parameters φ j = ln D ( j) determine how the synaptic strengths change depending on the history of sensory inputs and neural outputs; thus, the choice of φ j provides degrees of freedom in the shape of the neural network cost functions under consideration that determine the purpose or function of the neural network. Among various φ j , only φ j = (− ln 2, − ln 2) can make the cost function variational free energy with optimal prior beliefs for BSS. Hence, one could regard neural networks (of the sort considered in this letter: single-layer feedforward networks that minimize their cost function) as performing approximate Bayesian inference under priors that may or may not be optimal. This result is as predicted by the complete class theorem (Brown, 1981;Wald, 1947) as it implies that any response of a neural network is Bayes optimal under some prior beliefs (and cost function). Therefore, in principle, under the theorem, any neural network of this kind is optimal when its prior beliefs are consistent with the process that generates outcomes. This perspective indicates the possibility of characterizing a neural network model-and indeed a real neuronal network-in terms of its implicit prior beliefs.
One can pursue this analysis further and model the responses or decisions of a neural network using the Bayes optimal MDP scheme under different priors. Thus, the priors in the MDP scheme can be adjusted to maximize the likelihood of empirical responses. This sort of approach has been used in system neuroscience to characterize the choice behavior in terms of subject-specific priors. (See , for further details.) From a practical perspective for optimizing neural networks, understanding the formal relationship between cost functions and variational free energy enables us to specify the optimum value of any free parameter to realize some functions. In the present setting, we can effectively optimize the constants by updating the priors themselves such that they minimize the variational free energy for BSS. Under the Dirichlet form for the priors, the implicit threshold constants of the objective function can then be optimized using the following updates: (3.3) (See , for further details.) In effect, this update will simply add the Dirichlet concentration parameters, 0 , to the priors in proportion to the temporal summation of the posterior expectations about the hidden states. Therefore, by committing to cost functions that underlie variational inference and learning, any free parameter can be updated in a Bayes optimal fashion when a suitable generative model is available.

Reverse-Engineering Implicit Prior Beliefs.
Another situation important from a neuroscience perspective is when belief updating in a neural network is slow in relation to experimental observations. In this case, the implicit prior beliefs can be viewed as being fixed over a short period of time. This is likely when such a firing threshold is determined by a homeostatic plasticity over longer timescales (Turrigiano & Nelson, 2004).
The considerations in the previous section speak to the possibility of using empirically observed neuronal responses to infer implicit prior beliefs. The synaptic weights (W j1 , W j0 ) can be estimated statistically from response data, through equation 2.20. By plotting their trajectory over the training period as a function of the history of a Hebbian product, one can estimate the cost function constants. If these constants express a near-optimal φ j , it can be concluded that the network has, effectively, the right sort of priors for BSS. As we have shown analytically and numerically, a cost function with α j1 , α j0 far from (− ln 2, − ln 2) or a large deviation of β j1 , β j0 fails as a Bayes optimal encoder for BSS. Since actual neuronal networks can perform BSS (Isomura et al., 2015;Isomura & Friston, 2018), one would envisage that the implicit cost function will exhibit a near-optimal φ j .
In particular, when φ j can be viewed as a constant i.e., φ j = α j1 , α j0 T during a period of experimental observation, the characterization of thresholds is fairly straightforward: using empirically observed neuronal responses, through variational free energy minimization, under the constraint of e φ j1 + e φ j0 = 1, the estimator of φ j is obtained as follows: (3.4) Crucially, equation 3.4 is exactly the same as equation 3.3 up to a negligible term d ( j) . However, equation 3.3 represents the adaptation of a neural network, while equation 3.4 expresses Bayesian inference about φ j based on empirical data. We applied equation 3.4 to sequences of neural activity generated from the synthetic neural networks used in the simulations reported in Figures 2 and 3, and confirmed that the estimator was a good approximation to the true φ j (see Figure 4A). This characterization enables the identification of implicit prior beliefs (D) and consequently, the reconstruction of the neural network cost function, that is, variational free energy (see Figure 4B). Furthermore, because a canonical neural network is supposed to use the same prior beliefs for the same sort of tasks, one can use the reconstructed cost functions to predict subsequent inference and learning without observing neural activity (see Figure 4C). These results highlight the utility of reverse-engineering neural networks to predict their activity, plasticity, and assimilation of input data.

Discussion
In this work, we investigated a class of biologically plausible cost functions for neural networks. A single-layer feedforward neural network with a sigmoid activation function that receives sensory inputs generated by hidden states (i.e., BSS setup) was considered. We identified a class of cost functions by assuming that neural activity and synaptic plasticity minimize a common function L. The derivative of L with respect to synaptic strengths furnishes a synaptic update rule following Hebbian plasticity, equipped with activity-dependent homeostatic terms. We have shown that the dynamics of a single-layer feedforward neural network, which minimizes its cost function, is asymptotically equivalent to that of variational Bayesian inference under a particular but generic (latent variable) generative model. Hence, the cost function of the neural network can be viewed as variational free energy, and biological constraints that characterize the neural networkin the form of thresholds and neuronal excitability-become prior beliefs about hidden states. This relationship holds regardless of the true generative process of the external world. In short, this equivalence provides an insight that any neural and synaptic dynamics (in the class considered) have functional meaning and any neural network variables and constants can be formally associated with quantities in the variational Bayesian formation, implying that Bayesian inference is universal characterisation of canonical neural networks. According to the complete class theorem, any dynamics that minimizes a cost function can be viewed as performing Bayesian inference under some prior beliefs (Wald, 1947;Brown, 1981). This implies that any neural network whose activity and plasticity minimize the same cost function can be cast as performing Bayesian inference. Moreover, when a system has reached a (possibly nonequilibrium) steady state, the conditional expectation of internal states of an autonomous system can be shown to parameterize a posterior belief over the hidden states of the external milieu (Friston, 2013(Friston, , 2019Parr, Da Costa, & Friston, 2020). Again, this suggests that any (nonequilibrium) steady state can be interpreted as realising some elemental Bayesian inference.
Having said this, we note that the implicit generative model that underwrites any (e.g., neural network) cost function is a more delicate problemone that we have addressed in this work. In other words, it is a mathematical truism that certain systems can always be interpreted as minimizing a variational free energy under some prior beliefs (i.e., generative model). However, this does not mean it is possible to identify the generative model by simply looking at systemic dynamics. To do this, one has to commit to a particular form of the model, so that the sufficient statistics of posterior beliefs are well defined. We have focused on discrete latent variable models that can be regarded as special (reduced) cases of partially observable Markov decision processes (POMDP).
Note that because our treatment is predicated on the complete class theorem (Brown, 1981;Wald, 1947), the same conclusions should, in principle, be reached when using continuous state-space models, such as hierarchical predictive coding models (Friston, 2008;Whittington & Bogacz, 2017;Ahmadi & Tani, 2019). Within the class of discrete state-space models, it is fairly straightforward to generate continuous outcomes from discrete latent states, as exemplified by discrete variational autoencoders (Rolfe, 2016) or mixed models, as described in Friston, Parr et al. (2017). We have described the generative model in terms of an MDP; however, we ignored state transitions. This means the generative model in this letter reduces to a simple latent variable model, with categorical states and outcomes. We have considered MDP models because they predominate in descriptions of variational (Bayesian) belief updating, (e.g., Friston, FitzGerald et al., 2017). Clearly, many generative processes entail state transitions, leading to hidden Markov models (HMM). When state transitions depend on control variables, we have an MDP, and when states are only partially observed, we have a partially observed MDP (POMDP). To deal with these general cases, extensions of the current framework are required, which we hope to consider in future work, perhaps with recurrent neural networks.
Our theory implies that Hebbian plasticity is a corollary (or realization) of cost function minimization. In particular, Hebbian plasticity with a homeostatic term emerges naturally from a gradient descent on the neural network cost function defined via the integral of neural activity. In other words, the integral of synaptic inputs W j1 o t in equation 2.9 yields x t j W j1 o t , and its derivative yields a Hebbian product x t j o T t in equation 2.14. This relationship indicates that this form of synaptic plasticity is natural for canonical neural networks. In contrast, a naive Hebbian plasticity (without a homeostatic term) fails to perform BSS because it updates synapses with false prior beliefs (see Figure 3). It is well known that a modification of Hebbian plasticity is necessary to realize BSS (Földiák, 1990;Linsker, 1997;Isomura & Toyoizumi, 2016), speaking to the importance of selecting the right priors for BSS.
The proposed equivalence between neural networks and Bayesian inference may offer insights into designing neural network architectures and synaptic plasticity rules to perform a given task-by selecting the right kind of prior beliefs-while retaining their biological plausibility. An interesting extension of the proposed framework is an application to spiking neural networks. Earlier work has highlighted relationships between spiking neural networks and statistical inference (Bourdoukan, Barrett, Deneve, & Machens, 2012;Isomura, Sakai, Kotani, & Jimbo, 2016). The current approach might be in a position to formally link spiking neuron models and spike-timing dependent plasticity (Markram et al., 1997;Bi & Poo, 1998;Froemke & Dan, 2002;Feldman, 2012) with variational Bayesian inference.
One can understand the nature of the constants α j1 , α j0 , β j1 , β j0 from the biological and Bayesian perspectives as follows: α j1 , α j0 determines the firing threshold and thus controls the mean firing rates. In other words, these parameters control the amplitude of excitatory and inhibitory inputs, which may be analogous to the roles of GABAergic inputs (Markram et al., 2004;Isaacson & Scanziani, 2011) and neuromodulators (Pawlak, Wickens, Kirkwood, & Kerr, 2010;Frémaux & Gerstner, 2016) in biological neuronal networks. At the same time, α j1 , α j0 encodes prior beliefs about states, which exert a large influence on the state posterior. The state posterior is biased if α j1 , α j0 is selected in a suboptimal manner-in relation to the process that generates inputs. Meanwhile, β j1 , β j0 determines the accuracy of synaptic strengths that represent the likelihood mapping of an observation o (i) t taking 1 (ON state) depending on hidden states (compare equation 2.8 and equation 2.20). Under a usual MDP setup where the state prior does not depend on the parameter posterior, the encoder becomes Bayes optimal when and only when β j1 , β j0 = 0, 0 . These constants can represent biological constraints on synaptic strengths, such as the range of spine growth, spinal fluctuations, or the effect of synaptic plasticity induced by spontaneous activity independent of external inputs. Although the fidelity of each synapse is limited due to such internal fluctuations, the accumulation of information over a large number of synapses should allow accurate encoding of hidden states in the current formulation.
In previous reports, we have shown that in vitro neural networkscomprising a cortical cell culture-perform BSS when receiving electrical stimulations generated from two hidden sources (Isomura et al., 2015). Furthermore, we showed that minimizing variational free energy under an MDP is sufficient to reproduce the learning observed in an in vitro network (Isomura & Friston, 2018). Our framework for identifying biologically plausible cost functions could be relevant for identifying the principles that underlie learning or adaptation processes in biological neuronal networks, using empirical response data. Here, we illustrated this potential in terms of the choice of function φ j in the cost functions L. In particular, if φ j is close to a constant (− ln 2, − ln 2), the cost function is expressed straightforwardly as a variational free energy with small state prior biases. In future work, we plan to apply this scheme to empirical data and examine the biological plausibility of variational free energy minimization.
The correspondence highlighted in this work enables one to identify a generative model (comprising likelihood and priors) that a neural network is using. The formal correspondence between neural network and variational Bayesian formations rests on the asymptotic equivalence between the neural network's cost functions and variational free energy (under some priors). Although variational free energy can take an arbitrary form, the correspondence provides biologically plausible constraints for neural networks that implicitly encode prior distributions. Hence, this formulation is potentially useful for identifying the implicit generative models that underlie the dynamics of real neuronal circuits. In other words, one can quantify the dynamics and plasticity of a neuronal circuit in terms of variational Bayesian inference and learning under an implicit generative model.
Minimization of the cost function can render the neural network Bayes optimal in a Bayesian sense, including the choice of the prior, as described in the previous section. The dependence between the likelihood function and the state prior vanishes when the network uses an optimal threshold to perform inference-if the true generative process does not involve dependence between the likelihood and the state prior. In other words, the dependence arises from a suboptimal choice of the prior. Indeed, any free parameters or constraints in a neural network can be optimized by minimizing variational free energy. This is because only variational free energy with the optimal priors-that match the true generative process of the external world-can provide the global minimum among a class of neural network cost functions under consideration. This is an interesting observation because it suggests that the global minimum of the class of cost functionsthat determine neural network dynamics-is characterized by and only by statistical properties of the external world. This implies that the recapitulation of external dynamics is an inherent feature of canonical neural systems.
Finally, the free energy principle and complete class theorem imply that any brain function can be formulated in terms of variational Bayesian inference. Our reverse engineering may enable the identification of neuronal substrates or process models underlying brain functions by identifying the implicit generative model from empirical data. Unlike conventional connectomics (based on functional connectivity), reverse engineering furnishes a computational architecture (e.g., neural network), which encompasses neural activity, synaptic plasticity, and behavior. This may be especially useful for identifying neuronal mechanisms that underlie neurological or psychiatric disorders-by associating pathophysiology with false prior beliefs that may be responsible for things like hallucinations and delusions (Fletcher & Frith, 2009;Friston, Stephan, Montague, & Dolan, 2014).
In summary, we first identified a class of biologically plausible cost functions for neural networks that underlie changes in both neural activity and synaptic plasticity. We then identified an asymptotic equivalence between these cost functions and the cost functions used in variational Bayesian formations. Given this equivalence, changes in the activity and synaptic strengths of a neuronal network can be viewed as Bayesian belief updating-namely, a process of transforming priors over hidden states and parameters into posteriors, respectively. Hence, a cost function in this class becomes Bayes optimal when activity thresholds correspond to appropriate priors in an implicit generative model. In short, the neural and synaptic dynamics of neural networks can be cast as inference and learning, under a variational Bayesian formation. This is potentially important for two reasons. First, it means that there are some threshold parameters for any neural network (in the class considered) that can be optimized for applications to data when there are precise prior beliefs about the process generating those data. Second, in virtue of the complete class theorem, one can reverse-engineer the priors that any neural network is adopting. This may be interesting when real neuronal networks can be modeled using neural networks of the class that we have considered. In other words, if one can fit neuronal responses-using a neural network model parameterized in terms of threshold constants-it becomes possible to evaluate the implicit priors using the above equivalence. This may find a useful application when applied to in vitro (or in vivo) neuronal networks (Isomura & Friston, 2018;Levin, 2013) or, indeed, dynamic causal modeling of distributed neuronal responses from noninvasive data (Daunizeau, David, & Stephan, 2011). In this context, the neural network can, in principle, be used as a dynamic causal model to estimate threshold constants and implicit priors. This "reverse engineering" speaks to estimating the priors used by real neuronal systems, under ideal Bayesian assumptions; sometimes referred to as meta-Bayesian inference (Daunizeau et al., 2010).
. From Stirling's formula, holds. The logarithm of a (A.4) Hence, equation A.1 becomes Therefore, we obtain A.2 Correspondence between Parameter Prior Distribution and Initial Synaptic Strengths. In general, optimizing a model of observable quantities-including a neural network-can be cast inference if there exists a learning mechanism that updates the hidden states and parameters of that model based on observations. (Exact and variational) Bayesian inference treats the hidden states and parameters as random variables and thus transforms prior distributions P (s t ), P (A) into posteriors Q (s t ), Q (A). In other words, Bayesian inference is a process of transforming the prior to the posterior based on observations o 1 , . . . , o t under a generative model. From this perspective, the incorporation of prior knowledge about the hidden states and parameters is an important aspect of Bayesian inference.
The minimization of a cost function by a neural network updates its activity and synaptic strengths based on observations under the given network properties (e.g., activation function and thresholds). According to the complete class theorem, this process can always be viewed as Bayesian inference. We have demonstrated that a class of cost functions-for a singlelayer feedforward network with a sigmoid activation function-has a form equivalent to variational free energy under a particular latent variable model. Here, neural activity x t and synaptic strengths W come to encode the posterior distributions over hidden states Q (s t ) and parameters Q (A), respectively, where Q (s t ) and Q (A) follow categorical and Dirichlet distributions, respectively. Moreover, we identified that the perturbation factors φ j , which characterize the threshold function, correspond to the logarithm of the state prior P (s t ) expressed as a categorical distribution.
However, one might ask whether the posteriors obtained using the network Q (s t ), Q (A) are formally different from those obtained using variational Bayesian inference Q (s t ), Q (A) since only the latter explicitly considers the prior distribution of parameters P (A). Thus, one may wonder if the network merely influences update rules that are similar to variational Bayes but do not transform the priors P (s t ), P (A) into the posteriors Q (s t ), Q (A), despite the asymptotic equivalence of the cost functions.
Below, we show that the initial values of synaptic strengths W init j1 , W init j0 correspond to the parameter prior P (A) expressed as a Dirichlet distribution, to show that a neural network indeed transforms the priors into the posteriors. For this purpose, we specify the order 1 term in equation 2.12 to make the dependence on the initial synaptic strengths explicit. Specifically, we modify equation 2.12 as are the sigmoid functions of the initial synaptic strengths, and λ j1 , λ j0 ∈ R No are row vectors of the inverse learning rate factors that express the insensitivity of the synaptic strengths to the activity-dependent synaptic plasticity. The third term of equation A.9 expresses the integral ofŴ j1 andŴ j0 (with respect to W j1 and W j0 , respectively). This ensures that when t = 0 (i.e., when the first term on the right-hand side of equation A.9 is zero), the derivative of L j is given by ∂L j /∂W j1 = λ j1 Ŵ init j1 − λ j1 Ŵ j1 , and thus W j1 , W j0 = W init j1 , W init j0 provides the fixed point of L j . Similar to the transformation from equation 2.12 to equation 2.17, we compute equation A9) as Note that we used W j1 = lnŴ j1 − ln 1 −Ŵ j1 . Crucially, analogous to the correspondence betweenŴ j1 and the Dirichlet parameters of the parameter posterior a (·, j) 11 , λ j1 Ŵ init j1 can be formally associated with the Dirichlet parameters of the parameter prior a (·, j) 11 . Hence, one can see the formal correspondence between the second and third terms on the right-hand side of equation . (A.11) A.3 Derivation of Synaptic Plasticity Rule. We consider synaptic strengths at time t, W j1 = W j1 (t) and define the change as W j1 ≡ W j1 (t + 1) − W j1 (t). From equation 2.15, h 1 (W j1 ) satisfies both and These plasticity rules express (anti-) Hebbian plasticity with a homeostatic term.

Data Availability
All relevant data are within the letter. Matlab scripts are available at https://github.com/takuyaisomura/reverse_engineering. funded by the RIKEN Center for Brain Science. K.J.F. is funded by a Wellcome Principal Research Fellowship (088130/Z/09/Z). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.