## Abstract

The brain faces the problem of inferring reliable hidden causes from large populations of noisy neurons, for example, the direction of a moving object from spikes in area MT. It is known that a theoretically optimal likelihood decoding could be carried out by simple linear readout neurons if weights of synaptic connections were set to certain values that depend on the tuning functions of sensory neurons. We show here that such theoretically optimal readout weights emerge autonomously through STDP in conjunction with lateral inhibition between readout neurons. In particular, we identify a class of optimal STDP learning rules with homeostatic plasticity, for which the autonomous emergence of optimal readouts can be explained on the basis of a rigorous learning theory. This theory shows that the network motif we consider approximates expectation-maximization for creating internal generative models for hidden causes of high-dimensional spike inputs. Notably, we find that this optimal functionality can be well approximated by a variety of STDP rules beyond those predicted by theory. Furthermore, we show that this learning process is very stable and automatically adjusts weights to changes in the number of readout neurons, the tuning functions of sensory neurons, and the statistics of external stimuli.

## 1. Introduction

Uncertainty accompanies us in almost all situations in life. Whether we try to recognize a distant object, decide which path to take on a mountain hike, or read a person's face in an important negotiation, the environment often provides us with many cues, but each single cue is too unreliable to inform a decision on its own. Thus, we are forced to combine different cues in a meaningful manner in order to gain sufficient certainty. The theoretical framework for solving such tasks in an optimal way is Bayesian inference. Notably, behavioral and psychophysical studies strongly support the picture that the brain implements this strategy: in numerous experiments, human subjects have been shown to take into account uncertainty in a near-optimal way (Griffiths & Tenenbaum, 2006).

At the level of neural coding in early sensory areas, uncertainty is a particularly well-studied phenomenon: individual neurons that encode certain stimulus properties show significant trial-to-trial variability, making it difficult to infer the original stimulus from single-neuron responses. This observation has led to the notion of population coding: the value of a single variable is encoded by a whole population of neurons, each noisy and typically broadly tuned to the external variable (Pouget, Dayan, & Zemel, 2000). Experimental data suggest this coding strategy as a candidate for understanding how important variables are represented in different areas across cortex, for example, sound location in auditory cortex (Miller & Recanzone, 2009) or stimulus location in somatosensory cortex (Petersen, Panzeri, & Diamond, 2002). Just how the brain reliably decodes information from populations of neurons in a near-optimal way remains one of the key open questions in computational neuroscience.

In experimental neuroscience, the computation of robust readouts from population codes has become indispensable in the analysis and interpretation of neural data; population vector analysis and maximum likelihood (ML) estimation (Pouget et al., 2000) are two frequently used methods. More recently, attempts have been made to model neural networks that exhibit near-optimal decoding capabilities (Deneve, Latham, & Pouget, 1999; Jazayeri & Movshon, 2006; Chaisanguanthum & Lisberger, 2011). The hope is that such models will advance our understanding of the neural substrates of perceptual judgments: how the noisy and broadly tuned representations found in sensory areas can be efficiently used and transformed by downstream populations to allow near-optimal performance in a variety of perceptual tasks. The theoretical framework that has been guiding the search for suitable models is Bayesian inference. Here, in contrast to previous, static models, we will use this perspective for the analysis of an adaptive cortical microcircuit, featuring spike-timing-dependent plasticity.

*M*neurons, which we will write here as , is interpreted as the result of an underlying cause that cannot be observed directly. In visual processing, this external cause could correspond to the true stimulus orientation or motion direction at some retinal location. The population code (i.e., the relation between the external cause and the observed population response

**x**) can then be represented by the conditional distribution . This captures deterministic dependencies on , usually specified in terms of tuning functions, and neuronal noise. The optimal way of inferring the external variable from noisy observations

**x**is then given by Bayes’ theorem:

**x**. This results in a quite straightforward decoding strategy: compute the likelihood for each possible cause that could have given rise to the observed population response and choose the one with maximal likelihood. Somewhat surprisingly, the computational requirements for implementing such a readout are minimal: under a few simplifying assumptions, including Poisson firing statistics and zero noise correlations in the population pool, it was shown recently that the likelihood of a stimulus can be written as a weighted sum of sensory responses

**x**(Jazayeri & Movshon, 2006). Based on this observation, the authors argued that a readout neuron that specializes in detecting a particular stimulus value from a set of possible values can compute the corresponding likelihood by integrating its synaptic inputs in a feedforward manner (see the feedforward path in Figure 1). The synaptic weights required for this operation depend on the tuning functions of the sensory neurons

*x*and the preferred orientation of the readout neuron: This establishes an important link between the response properties of the sensory population (the tuning functions) and the optimal weights to decode information from it and gives clear instructions on how to construct an optimal readout network. However, equation 1.2 also highlights that each population of neurons in the brain must be read out differently, depending on the particular tuning functions of the neurons.

_{j}The preceding research leaves the question open of how readout neurons could acquire the theoretically optimal synaptic weights (see equation 1.2). The importance of this question is underlined by recent findings suggesting that the brain constantly retunes its circuits in order to improve probabilistic inference (Bejjanki, Beck, Lu, & Pouget, 2011). How experimentally observed plasticity mechanisms at the synaptic level could account for such an improvement remained an open question.

Here we present a learning theory for spike-timing-dependent plasticity (STDP) rules
in the context of spiking neurons and lateral inhibition that addresses this
question. Building on the analysis of previous work (Nessler, Pfeiffer, & Maass, 2010) we show that optimal likelihood
decoding of noisy population codes emerges automatically through STDP in a
winner-take-all (WTA) circuit, a ubiquitous network motif in cortical microcircuits
(Douglas & Martin, 2004). In particular,
we theoretically analyze the weight dynamics of a particular form of STDP with
homeostatic plasticity and show that it can be described as an attractor dynamics,
where the centers of the attractors are the weight values (see equation 1.2) that are optimal from the
perspective of probabilistic inference and learning. In this way, we create a direct
link between simple local rules for synaptic plasticity and theoretically optimal
inference and learning on the network level. Whereas the analysis of Nessler et al.
(2010) was restricted to the case of
multinomial input variables (each encoded by a population of neurons of which at any
time *t*, exactly one has fired within the time window ) and rectangular excitatory postsynaptic potentials (EPSPs)
(modeled by a step function, rather than by a function with smooth decay), we show
here that the underlying learning theory can be extended to cover the biologically
more realistic case where each input neuron fires according to some Poisson-like
statistics and EPSPs have a smooth decay. This new learning theory makes it possible
to derive learning curves for STDP and dependencies between current weight values
and the amount of weight potentiations or depressions under STDP that are optimal
for a given input statistic and EPSP shape. These analytically derived predictions
for details of STDP match currently available data quite well in a number of
aspects. Furthermore, we show that in practice, variants of these optimal rules, and
in particular plasticity rules based on typical STDP curves, approximate this
optimal behavior well.

We test predictions of the new learning theory in computer simulations and show that previously derived optimal weights for likelihood decoding (see equation 1.2) emerge autonomously through STDP in conjunction with homeostatic plasticity. Based on this learning theory, we also show that an adaptive architecture for reading population codes provides attractive benefits compared to previously considered static models. In particular, we demonstrate stable and predictable behavior under challenging but biologically realistic dynamic scenarios, like changes in sensory tuning functions or neuron growth and loss. We furthermore demonstrate that selective modulation of learning leads to effects reminiscent of perceptual learning (Gilbert, Li, & Piech, 2009), where the accuracy of neural codes is selectively enhanced for behaviorally relevant stimuli.

The article is structured as follows. First, we introduce the canonical microcircuit model we consider. In this circuit model, we demonstrate STDP-based autonomous learning on exposure to population-coded stimuli and show that in such a setup, the synaptic weights converge to a setting that is optimal from the perspective of likelihood decoding. We then introduce the theory that underlies this optimality and derive a rigorous link between local synaptic learning and theoretical optimality in the Bayesian framework. We complement these theoretical results by an extensive performance comparison of different optimal and near-optimal STDP-based learning rules. We present results of simulations that test further predictions of the theory and demonstrate the versatility and robustness of the considered model. We conclude with a simulation in which effects reminiscent of perceptual learning are reproduced.

## 2. Results

In order to make accurate perceptual judgments, the brain must use the information
provided by sensory areas as efficiently as possible. Since sensory neurons tend to
be noisy and broadly tuned, the computation of a sparse representation of the most
likely external stimulus (“hidden cause”) is a nontrivial task for a
network of neurons. However, Jazayeri and Movshon (2006) showed that for the case of a one-dimensional hidden cause , which could, for example, represent the current orientation,
speed, or direction of a visual stimulus, this task could in principle be solved by
an array of linear readout neurons with spiking outputs *z _{k}* that receive synaptic inputs

*x*from sensory neurons (see Figure 1). It was shown that the readout neurons

_{j}*k*can compute in their membrane potential the log likelihood that a particular hidden cause (which was assigned externally to readout neuron

*k*in Jazayeri & Movshon, 2006) had caused the current spike output

**x**of the sensory neurons. This occurs if the weights

*w*from sensory neuron

_{kj}*j*to readout neuron

*k*are set according to equation 1.2. We show here that these weights emerge as fixed points (equilibria) of a class of theoretically optimal STDP rules with homeostatic plasticity, provided that the readout neurons are subject to lateral inhibition. Through the same learning process, each readout neuron

*k*implicitly develops a preferred stimulus , which then allows the reconstruction of an external input variable at any moment in time.

### 2.1. Adaptive, Stochastic WTA Architecture.

*k*, a stochastic neuron model is used similar to the model proposed by Jolivet, Rauch, Lüscher, and Gerstner (2006), which has been shown to explain neural data well. The neuron model is characterized by an exponential dependence of the firing probability on the current membrane potential

*u*, for small . The membrane potential of a readout neuron

_{k}*k*consists of an excitatory and an inhibitory contribution: Excitation comes from the feedforward connections originating in the sensory population . In particular, the term

*w*(

_{kj}*t*)

*x*(

_{j}*t*) represents the contribution of previous spikes from the

*j*th sensory neuron to the membrane potential

*u*of the readout neuron

_{k}*k*at time

*t*. The term

*x*(

_{j}*t*) hence models the unweighted output spike train of the

*j*th sensory neuron after filtering according to the low-pass filtering properties of the postsynaptic membrane.

**x**(

*t*) represents the collection of all input signals available to the readout neurons at time

*t*. We will refer to this as the current (sensory) population response.

*I*(*t*) denotes the contribution to the membrane
potential due to lateral inhibition. *I*(*t*) is
common to all readout neurons in the circuit and can thus be viewed as a global
inhibitory signal that controls the total gain of the circuit. In computer
simulations, we modeled *I*(*t*) such that the
total firing activity of the readout circuit remained approximately constant.
The resulting effect of inhibition is a normalization of circuit responses,
reminiscent of normalization models in cortex (Simoncelli & Heeger, 1998; Zoccolan, Cox, & DiCarlo, 2005; Ohshiro, Angelaki, & DeAngelis, 2011; Louie, Grattan, & Glimcher, 2011). Note that the lateral
inhibition introduces competition among the readout neurons, since a readout
neuron with strong feedforward input will claim a large fraction of the total
firing rate, thereby suppressing other readout neurons. We refer to the
resulting network as a (stochastic) winner-take-all (WTA) circuit.

*z*=1 at spike times of the WTA neuron

_{k}*k*and

*z*=0 otherwise. The learning rate is chosen small throughout this article such that learning takes place on a (much) longer timescale than stimulus and network dynamics.

_{k}The first term of equation 2.3 corresponds to a prototypical spike-timing-dependent long-term potentiation (LTP) window (with the same shape and time constant as PSPs) with weight dependence. The second term depends on postsynaptic spikes only and can be interpreted as a form of homeostatic plasticity providing negative feedback to the postsynaptic rate. Both terms fit well into the relatively broad phenomenological framework of STDP rules (Gerstner & Kistler, 2002; Gilson, Burkitt, & van Hemmen, 2010). But to distinguish it from more common STDP rules (which typically feature spike-timing-dependent depression), we will refer to this rule as theoretically optimal STDP.

The motivation for first considering this form of STDP comes from its notable and provable theoretical properties, which will be developed in this article: equation 2.3 leads to stable equilibrium weight settings that can be analyzed and have a clear interpretation from the perspective of probability theory. In particular, we will show that equation 2.3 is an instance of a family of learning rules that can be directly derived from the principle of adapting and optimizing an implicit generative model of the input statistics.

With regard to biological plausibility, equation 2.3 is in many aspects consistent with
experimental studies on STDP. First, the strength and direction of learning
depend on the timing difference between pre- and postsynaptic spike. For
pre-before-post pairings, *x _{j}* is large at the time of the postsynaptic spike and equation 2.3 will typically lead to
potentiation. For post-before-pre pairings, the negative part dominates and
leads to depression (Bi & Poo, 1998;
Sjöström, Turrigiano, & Nelson, 2001). Also, the strength of potentiation correlates
inversely with the synaptic weight before pairing. This feature is also
consistent with a number of experimental studies (Bi & Poo, 1998; Sjöström et al., 2001; Liao, Jones, & Malinow, 1992; Montgomery, Pavlidis, & Madison, 2001). Furthermore, the amount of
depression is independent of the current weight. This is consistent with the
experimental results of Jacob, Brasier, Erchova, Feldman, and Shulz (2007), which is to the best of our
knowledge the only study of this effect in vivo. Moreover, when measured under a
typical pairing STDP protocol, equation 2.3 will automatically shift toward LTP for higher pairing
frequencies, since for each postsynaptic spike, the

*x*will accumulate more presynaptic spikes in the causal STDP window while the negative term remains constant. This effect is hence reminiscent of the tendency toward LTP (and abolishment of LTD) for higher frequencies found experimentally (Sjöström et al., 2001). Altogether, the agreement with these experimental STDP data is in a sense remarkable, given that equation 2.3 can be derived from purely statistical principles.

_{j}However, there are also a few deviations from STDP data. Most important, the negative contribution of equation 2.3 is activated for every postsynaptic spike regardless of presynaptic input. Hence, long-term depression (LTD) is not spike timing dependent and can be triggered by postsynaptic spikes alone. We address this potential issue later, when we consider a variation of equation 2.3 with timing-dependent depression in Figure 3; it turns out that the results are qualitatively very similar. Another deviation is with regard to the frequency dependence mentioned above: despite the general shift toward LTP, the exact form of frequency dependence in experimental data (see Figure 7 in Sjöström et al., 2001) cannot be reproduced by equation 2.3.

For the described network architecture, a computer simulation was carried out in
which a fluctuating external variable with a periodic boundary condition (like any angle) was
presented (see Figure 2). Concretely one
could interpret this external variable as a motion direction. The changes in were designed to be slightly slower than the network dynamics,
corresponding to the assumption that the sensory stimulus remains stable for the
duration of a few PSPs. A number of broadly tuned sensory neurons with Poisson
firing statistics represented the value of this external variable over time.
Initially synaptic readout weights were randomly initialized, leading to rather
unspecific responses in the readout circuit when stimulated with different . At *t*=100 s of exposure to different
population responses, readout neurons started to specialize on different
stimuli. This was reflected by both selective firing of individual readout
neurons to certain and the gradual specialization of synaptic readout weights to
different patterns. This specialization can be understood as follows. During
exposure to population responses, the readout network continuously emits spikes.
After each readout spike, the current population response leaves a small trace
in the synaptic weights of the readout neuron *k*, which has
fired. In particular, as expected from STDP, synapses that are not active at the
time of the postsynaptic spike are depressed, and those with strong activity are
enhanced. The synaptic changes induced by plasticity increase the probability
that the same neuron fires again when a similar population response is
registered, while making it less likely to fire when significantly different
patterns are active. Due to the inhibitory circuit that promotes competition,
different neurons start to pick up different patterns. As a consequence, a
specific pattern emerges in the synaptic weights of each readout neuron. This
pattern reflects the history of population responses that made the readout
neuron fire. At *t*=2000 s the readout weights reached a
setting that was stable with minor fluctuations around some target values. One
can now compare this converged weight setting with the results by Jazayeri and
Movshon (2006) on optimal population
decoding. As shown in Figure 2C, the
weights, which have autonomously emerged during exposure to unknown population
responses, have recovered the previously identified theoretical optima for
likelihood decoding, in particular those obtained through equation 1.2.

What is the overall effect of this learning process? Most important, individual readout neurons have become specialists for individual hidden causes (or stimuli). Hence, after, but not before, learning, each readout spike can be considered an indication for a particular hidden cause. As a consequence, one can achieve an increasingly faithful reconstruction of the original stimulus from readout spikes, as shown in Figure 2E.

### 2.2. Learning Theory for STDP in a Network with Lateral Inhibition.

*k*during the operation of the circuit (for a more precise definition, see section A.4 in the appendix). Importantly, we show that these weight settings are not only the attractors of the network dynamics but also correspond to locally optimal weights settings from the perspective of the implicit generative model. Hence, we establish a direct link between local synaptic learning rules and theoretically optimal performance of the network.

**x**without supervision, that is, without the help of a teacher signal, which provides the desired outcome

**z**during training. Generative models are arguably the most powerful paradigm of unsupervised learning, with examples including (probabilistic) PCA, ICA, gaussian mixture models, and hidden Markov models (Bishop, 2006). The rationale behind generative models is “analysis by synthesis”: if you aim to understand (decode) something, learn to build (generate) it yourself. The starting point is a model

*p*(

**x**|

**W**) describing the statistics of an input stream

**x**as the result of a generative random process, typically involving hidden causes

**z**, This generative model can be understood as a two-step process. First, a hidden configuration is drawn according to the hidden probabilities

*p*(

**z**|

**W**). Then the hidden states generate the actual data

**x**according to

*p*(

**x**|

**z**,

**W**). The sum in equation 2.5 reflects the fact that the same data

**x**can be generated by different hidden configurations

**z**.

Then the goal of learning is to find parameters **W** that bring the
distribution *p*(**x**|**W**) generated by the
model as close as possible to the actual data distribution, which we write here
as . This is done by adjusting the parameters **W** of
the model, usually in small steps, until the Kullback-Leibler divergence , a quantity that measures the distance between the real
distribution and the model, becomes minimal. During this process, an efficient
representation for the hidden causes **z** emerges, which is optimized
to “explain” the data in . This hidden representation can be retrieved by evaluating the
posterior distribution *p*(**z**|**x**, **W**).

*x*with “rate” and is an arbitrary positive constant (note that our analysis extends beyond Poisson distributions; see below). The model can be understood as follows: the probability of cause

*k*being active is . If the cause

*k*is active, the number of spikes for the sensory neuron

*x*within a rectangular PSP window of length is Poisson distributed with a rate that is encoded in the parameter

_{j}*w*. Hence, each cause

_{kj}*k*is an expert for a particular pattern of Poisson rates in the input, and this preferred pattern is encoded in the parameters . Note that prior parameters instead of a fixed prior could be easily incorporated into the model. These prior parameters could then be mapped onto neural excitabilities in the neural implementation (Nessler et al., 2010).

Several steps are required to establish a direct correspondence between the
network implementation and the generative model: first, the network elements
need to be related to the variables of the generative model. We already
implicitly linked the synaptic weights to the model parameters in a
straightforward fashion by using the weights *w _{kj}* in the definition of the generative model. Similarly, we can link the
readout neurons to the hidden causes of the model: each hidden cause

*k*is represented by one readout neuron,

*k*. Second the network should support Bayesian inference, that is, it should respond to an input

**x**by inferring likely hidden states

**z**according to the posterior probabilities

*p*(

**z**|

**x**,

**W**). Indeed we will show that each output spike from the readout network is generated according to the correct posterior distribution. Third, network plasticity should optimize the parameters of the generative model over time; the synaptic weights should come to reflect the input statistics through learning. This is the main focus of this article. We will prove this by relating STD learning to a standard algorithm for generative model learning, expectation-maximization (EM). One operation of the generative model that will not be required from the network is the actual generation of data

**x**. Since, in contrast to other models (see Dayan, Hinton, Neal, & Zemel, 1995; Rao & Ballard, 1999), such an explicit generation of data is not needed here, we will refer to our model as an implicit generative model.

*w*between input and readout layer. This is because the log likelihood of an input

_{kj}**x**under the cause

*k*is given by, The second term can be ignored in practice if the input representation is homogeneous (Jazayeri & Movshon, 2006). The third term is independent of

*k*and hence drops out later in the normalization of the posterior distribution

*p*(

*k*|

**x**,

**W**). This leaves only the first term, which is simply the feedforward sum of inputs, which a neuron

*k*can compute in its soma. As a consequence, we can show that each spike from a readout neuron in Figure 1 can be interpreted as a sample from the correct posterior distribution

*p*(

*k*|

**x**,

**W**) (see the appendix).

The key component of the learning theory is the connection between STD learning
and the implicit generative model, in particular the optimization of its
parameters **W**. In machine learning, the best-known algorithm for
performing this optimization is EM (Bishop, 2006). What we will show here is that the operation of STD learning
in the cortical microcircuit of Figure 1 can be understood as a stochastic online version of EM. This
allows us to view learning as an attractor dynamics in the weight space, where
the attractors correspond to local optima in the generative model
perspective.

*p*(

**z**|

**x**,

**W**) of hidden configurations are evaluated for the current input

**x**. The subsequent learning step uses both

**x**and the inferred hidden configurations to perform a small update , which increases the model's likelihood for the current input

**x**. As indicated above, each firing of a neuron

*k*in the WTA circuit of the network provides a sample from the current posterior distribution: hence, this is equivalent to a stochastic E-step. The corresponding postsynaptic spike then triggers the plasticity rule, equation 2.3, in the synapses . Through an analysis of this STDP-based update, one can show that this provides a step in the direction of a correct M-step in online EM. In particular, one can prove that the expected application of the rule with a sufficiently small learning rate will always improve the performance of the network until a local optimum is reached, where is the expected update for a randomly chosen input pattern from (see the appendix).

This allows us to view learning in the considered cortical microcircuit as an
attractor dynamics: the trajectories in weight space induced by STD learning are
attracted to specially distinguished weight settings which have the property
that for all synapses , that is, the system is in equilibrium with respect to
equation 2.3. These attractors
in weight space are optimal from the perspective of the implicit generative
model in the sense that they correspond to locally optimal solutions to the
problem of fitting the internal model *p*(**x**|**W**) to the input distribution .

Before moving on to the generalization of this main result to non-Poisson input
statistics, a remark is in order concerning the connection between the
optimization of the implicit generative model *p*(**x**|**W**) and the optimal readout
weights for population coding, as they were derived in Jazayeri and Movshon
(2006). As already suggested by the
simulation results shown in Figure 2, these
two optimality criteria are intimately connected. This is because the
“true” statistics of the population code from which the optimal readout weights in Jazayeri and Movshon
(2006) are derived can be mapped with
arbitrary precision onto a mixture model representation in the form of
equation 2.6. The mapping is
most faithful if the Kullback-Leibler divergence between the real population
code and the mixture model is lowest: this is precisely what the plasticity rule
optimizes. By increasing the number of readout neurons, the theoretically
achievable divergence can be made arbitrarily small, and the two optimality
criteria become effectively equivalent.

The assumption of Poisson variability is a reasonable and popular approximation
to the firing statistics of real cortical neurons. At the same time, there
exists ample experimental evidence for deviations from this rule, such as bursty
or regularly firing neurons (Shinomoto et al., 2009). This raises the question of whether our learning theory can
also be applied to other firing statistics. It turns out that the theory can
indeed be generalized to the exponential family of distributions, which contains
many well-known parametric models like the Poisson, normal, gamma, and negative
binomial distributions. The main result of this generalization is that each
firing statistic is associated with a different, small variation of optimal
STDP. Through the generalization to exponential families, the presented theory
for STDP in the context of lateral inhibition becomes applicable to a wide range
of biologically relevant firing statistics (see the online supplement). The
power of exponential family distributions also allows incorporating more
realistic EPSP shapes: given the firing statistics of sensory neurons, together
with the EPSP shape, one can construct a tailor-made exponential family
distribution that accounts for the variability encountered in the inputs **x** (see the online supplement). This makes the theory applicable
to virtually arbitrary EPSP shapes, in particular, those derived from
electrophysiological experiments.

### 2.3. A Family of STDP Rules Leads to Near-Optimal Readouts.

These theoretical results are quite encouraging, since they establish a link between a particular form of STDP, on the one hand, and the powerful statistical framework of EM, on the other hand. But how relevant are these results for the study of autonomous learning processes in the cortex? After all, the theoretically postulated mechanism for depression in equation 2.3, which leads to LTD even in the absence of presynaptic spikes, is quite speculative. It would thus be highly interesting to know whether this term could also be replaced by timing-dependent depression without loss of functionality. Furthermore, plasticity mechanisms in the brain appear to be heterogeneous and noisy (Sjöström et al., 2001; Caporale & Dan, 2008), in contrast to precise theoretical rules. If one assumes a given PSP shape and Poisson variability, is equation 2.3 really the only rule that can be guaranteed to lead to the emergence of optimal weights?

*f*(

*w*) can be any strictly positive function. For

_{kj}*f*(

*w*)=1, the original learning rule is recovered, which yields LTD proportional to the postsynaptic rate . This form of LTD is thus reminiscent of synaptic scaling (Turrigiano, 2010), except that in synaptic scaling, updates scale with the current weight (Abbott & Nelson, 2000). Indeed, a learning rule with synaptic scaling can be obtained by letting

_{kj}*f*(

*w*)=

_{kj}*w*. We verified the correctness of this (optimal) variant of equation 2.3 in Figure 3B.

_{kj}Furthermore, we tested two variations that are not provably optimal from the perspective of the presented learning theory but approximate the functionality of the optimal rules well. As shown in Figure 3C, in another variation of equation 2.3, the time constant of the causal STDP window was chosen twice as long as the PSP decay constant (while its magnitude was halved). The resulting performance is virtually optimal. Finally, we tested the performance of an STDP rule with common timing-dependent depression (see Figure 3D), with additional superimposed noise. The resulting weights are slightly less pronounced and noisier, but the readout performance is comparable to the other variants (compare RMSE values in the bottom row). Notably, this behavior is also quite robust against variations of parameters of this STDP rule (see Figure S1 in the online supplement).

Although these variants cover only a few cases of special interest, they make clear that the main finding of this article, the emergence of optimal readouts, is not an artifact of a specific definition of the learning rule, but in practice a property of a whole family of STDP learning rules. The most important features that appear to characterize this family of STDP rules are (1) the pronounced weight dependence of STD potentiation (in accordance with experimental data) and (2) some form of homeostatic regulation of weights, implemented either explicitly (e.g. through synaptic scaling) or implicitly through STD depression.

### 2.4. Maintenance of Optimality in Spite of Drastic Changes in Downstream and Upstream Neurons.

To our knowledge, this article provides the first adaptive neural approach to optimal decoding from sensory populations in the literature. As a consequence, the canonical microcircuit considered here is substantially more flexible than previous static models for optimal decoding. In the following computer simulations, we have selected two biologically motivated dynamic scenarios that demonstrate its functional advantages. The first scenario is based on the observation that plasticity in sensory cortex appears to persist throughout adulthood, as suggested by numerous studies in different species and cortical areas (Trachtenberg et al., 2002; Goel & Lee, 2007). Hence, the tuning functions of sensory neurons, especially in superficial layers, are not fixed but subject to constant change. This constitutes a challenge to downstream populations that rely on the representation provided by sensory areas. We will show that the canonical microcircuit considered here gracefully handles a scenario of changing sensory tuning functions: any change in the input distribution is automatically detected by the network, resulting in an appropriate adjustment of the weight settings. As a consequence of this constant readaptation, the readout representation of the external stimulus can remain remarkably invariant to changes in the sensory representation. The second scenario deals with neurogenesis and neuron death, two phenomena that have been reported repeatedly in adult cortex (Morrison & Hof, 2007). We studied the effects of neuron growth and death in the readout population, showing that through learning, the readout network can minimize the detrimental effects of cell death and maximize the gain in representation accuracy brought by neurogenesis.

In the first scenario, we studied the consequences of changing tuning functions in the sensory population on the readout network (see Figure 4). For this, we divided the sensory population into three groups (G1–G3). Initial sensory tuning functions were chosen such that the whole range of the input stimulus was represented uniformly in each group. The simulation was then split into three phases: a static phase (0 s–1000 s), a dynamic phase (1000 s–2000 s), and a consolidation phase (2000 s–3000 s). During the static phase, sensory tuning functions were kept constant, and the readout network learned an optimal readout representation based on the initial sensory code. The resulting weight matrices are qualitatively identical to those obtained in Figure 2C. In the dynamic phase, tuning functions in all three groups of sensory neurons started to change. In the first group, tuning functions broadened, in the second group they narrowed, and in the third group tuning functions narrowed and simultaneously started to develop a second mode (see Figure 4B). This change in the sensory population code was automatically “detected” by the readout network, causing the synaptic readout weights to take up pursuit of a quickly moving target: the attractor centers of the learning dynamics that depend on and thus drift in parallel with sensory tuning functions. A snapshot of the readout weight matrix at time 1600 s in the middle of the dynamic phase is shown in Figure 4C. In the consolidation phase, sensory tuning functions were fixed again. This allowed the readout network to converge to the new optimal setting, thereby reinterpreting all sensory neurons according to their new properties. This is particularly visible in group 3 (G3), where each sensory neuron had developed two modes in the tuning function. The corresponding optimal decoding strategy is to connect each sensory neuron in G3 strongly with the readout neurons that correspond to the two modes. As shown in Figure 4C, this strategy was found autonomously through learning. Figure 4D illustrates the preferred stimuli of five exemplary readout neurons during the dynamic phase. One should note that in spite of drastic changes in the sensory representation (e.g., the development of bimodal tuning functions; see Figure 4B), the preferred stimuli of readout neurons barely change during the adaptation period. Hence, learning facilitates the decoupling between sensory and readout representations with respect to the external stimulus.

In a second scenario (see Figure 5) we
studied the effect of neuron growth and death in the readout population. In a
computer simulation, a microcircuit was set up with five readout neurons, and
the weights were allowed to converge to a stable setting on stimulation with
input stimuli analogous to Figure 2. At *t*=800 s and *t*=1600 s, the
growth of five new readout neurons with randomly initialized weights was
simulated, yielding a jump in the mean squared error (MSE) of the reconstruction
(see Figure 5B). In theory, a larger number
of readout neurons allows a more finely grained representation, and hence a more
accurate readout. However, this requires that the weights of all neurons are
adjusted appropriately. In particular, the newly formed neurons should learn to
respond to the most poorly represented regions of the external stimulus. Figure 5A shows that this optimal strategy
automatically emerges through learning: the preferred stimuli of the newly grown neurons quickly learned to fill in the
“spaces” in between the existing neurons. This is also reflected
by a lower mean squared error of the reconstruction signal (Figure 5B; compare at *t*=800
s and *t*=1600 s). A similar effect is observed
after simulating the new growth of another five neurons with random weight
initialization. At *t*=2400 s, the sudden death of 10
neurons was simulated. Theoretically, this substantially reduces the achievable
readout accuracy. However, the loss in accuracy is smallest if the surviving
neurons rearrange such that their preferred stimuli are uniformly distributed over the input range. As shown in
Figure 5A, this is the strategy
automatically implemented by learning in the network
(2400 s–3200 s). Hence, the detrimental effects of neuron
loss on the readout representation are minimized through learning (see the
corresponding reduction of reconstruction MSE in Figure 5B).

### 2.5. Improved Representation of Behaviorally Relevant Inputs.

The ability to dynamically allocate resources for representing important peripheral inputs in a use-dependent manner is a hallmark of cerebral cortex (Buonomano & Merzenich, 1998). Indeed, cortical representations are highly plastic, and the dynamic changes in cortical representation on manipulation in input or during task learning are thought to underlie the phenomenon of perceptual learning: the improvement of sensory abilities during training (Seitz & Watanabe, 2005). Although the exact neural mechanisms that give rise to task-related improvements are largely unknown, experimental evidence suggests an important implication of synaptic plasticity of intracortical connections (Buonomano & Merzenich, 1998) in conjunction with top-down “relevance” or reward signals (Seitz & Watanabe, 2005). The latter have been hypothesized to be communicated to local cortical circuits via diffuse neuromodulatory signals, such as “acetylcholine, norepinephrine or dopamine, which gate learning and thus restrict sensory plasticity” (Seitz, Kim, & Watanabe, 2009). Here we will show that by incorporating such a modulatory signal into the considered cortical microcircuit model, the internal representation of stimuli becomes automatically focused on behaviorally relevant inputs—those inputs that are consistently paired with high levels of the modulatory signal. The resulting allocation of cortical resources in proportion to relevance is strongly reminiscent of experimentally observed training-dependent cortical map plasticity (Buonomano & Merzenich, 1998).

In a computer simulation, we incorporated a global modulatory signal into the
cortical microcircuit considered above, acting as a soft gate for learning. The
gating was implemented as a multiplicative factor on the learning rate. During
an initial reference phase, modulation remained constant during exposure to
stimuli, leading, as expected, via learning to a uniform internal representation
(see Figures 6B and 6C at *t*=1000 s). Modulation
was activated at *t*=1000 s. The modulation was chosen in
dependence of the current external stimulus, such that certain ranges of were paired with higher levels of the modulatory signal than
others (see Figure 6A, red curve).
As shown, population responses that coincided with higher levels were more
strongly imprinted in the synaptic network weights, and consequently, the
internal representation started to shift toward a denser concentration around
more “relevant” s (see Figures 6B and 6C at *t*=4000 s). In contrast, stimuli paired with lower
levels of modulation became more crudely represented by the network. Hence, the
resulting density of representation in the distribution of preferred stimuli is directly connected to the strength of modulation imposed
for different .

Such use-dependent allocation of cortical resources is a ubiquitous feature of cortical plasticity and has been shown to correlate with significant performance improvements on corresponding tasks (Buonomano & Merzenich, 1998). Our results indicate that the modulation of learning in a local cortical microcircuit model through a diffuse global signal is indeed sufficient to reproduce relevance-dependent allocation of representational resources. This provides a simple mechanism that could underlie the characteristic improvements in sensory abilities associated with perceptual learning.

## 3. Discussion

Ensembles of pyramidal cells with lateral inhibition on layers 2/3 and layer 5/6 (Fino & Yuste, 2011) constitute a universal network motif of cortical microcircuits in many different cortical areas and different species. We propose that these network motifs acquire through STDP a rather universal computational function that appears to be essential for many, if not all, cortical areas: the compression of high-dimensional noisy spike inputs into lower-dimensional sparse representations of the most likely hidden causes of these high-dimensional spike inputs. This yields in particular an emergent optimal decoding of population codes in the sense of Jazayeri and Movshon (2006). This emergent computational function of stochastic WTA circuits is very stable because a rigorous learning theory shows that the theoretically optimal values of synaptic weights are attractors in the dynamics of synaptic weights under the considered class of STDP rules with homeostatic plasticity. A remarkable feature of the underlying learning theory is that it creates a link between local synaptic learning rules such as STDP and the most powerful known abstract method for autonomous (i.e., unsupervised) learning: EM. This link holds provably for the class of optimal STDP rules and, in practice, also approximately for STDP rules that are not directly covered by theory (see Figure 3).

Furthermore the learning theory for STDP that we have introduced provides a new benchmark for analyzing and understanding from a functional perspective the large variety of parameters and learning curves for STDP that have been found at different synapses (Dan & Poo, 2006). This learning theory proposes that the theoretically optimal version of STDP (from the perspective of autonomous generation of implicit generative models for high-dimensional noisy spike inputs through EM) depends on both the firing statistics of pre-synaptic neurons and the average shape of EPSPs at the soma in a systematic and predictable manner.

In particular, the learning theory for STDP that we have presented throws new light on the question of downstream decoding of information conveyed by populations of noisy spiking neurons. Jazayeri and Movshon (2006) had already shown that simple linear neurons could in principle carry out a theoretically optimal maximum likelihood decoding, provided that suitable values are chosen for their synaptic weights. The learning theory that we have presented shows that these optimal weight values are in fact attractors with regard to the dynamics of synaptic weights under STDP in stochastic WTA circuits. This learning theory for STDP therefore also provides a possible explanation for the problem of how downstream neurons can maintain optimal decoding of population codes of sensory neurons in spite of ubiquitous changes in the tuning functions of sensory neurons (see Figure 4) and changes in the number of neurons involved in this decoding (see Figure 5). Furthermore if one assumes that synaptic plasticity is gated by neuromodulators or network activity so that the learning rate of STDP is increased for behaviorally relevant stimuli, the downstream decoding network automatically adapts its resolution in order to achieve a finer representation of behaviorally relevant ranges of external stimuli (see Figure 6).

An interesting aspect, which we did not study in this article, is whether self-organizing maps, similar to the well-documented orientation maps in cat visual cortex (Hubel & Wiesel, 1963), could emerge in the presented model. Indeed, one could think of clever changes in the model to induce such effects, for example, adding lateral excitatory connections between neighboring neurons, which would likely facilitate the emergence of locally smooth maps. But it is not clear at all whether a downstream “user” of the network output actually requires such a smooth map representation; indeed, many rodents, including rats, mice, and squirrels, can live without smooth maps (Van Hooser, Heimel, Chung, Nelson, & Toth, 2005). Furthermore, from a purely functional standpoint, neurons in the presented model with similar tuning (e.g., specialized on similar motion directions) will generally have nonnegligible signal correlation (see Figure 2D). Hence, downstream neurons that receive connections from the WTA neurons could quite easily detect groups of neurons that code for similar input stimuli and establish selective synaptic connections to “functionally neighboring” neurons even if their spatial arrangement is scrambled.

The WTA mechanism in this article relies on an exponential input-output nonlinearity
of the form *e ^{u}* to fit the requirements of Bayesian inference. Indeed, this is precisely
the input-output relation found empirically by Jolivet et al. (2006), who fit a stochastic spike response model to
predict spike timings in pyramidal cells from L5 rat somatosensory cortex. Other
authors have suggested a power relation of the form

*u*between membrane potential and output rate (Priebe, Mechler, Carandini, & Ferster, 2004). Although we expect that power relations would suffice to induce the desired competition, it would be interesting to study the effect of different nonlinearities in future work.

^{p}One of the model assumptions is conditional independence of inputs, and hence learning is guaranteed to be optimal only if there are no noise correlations in the input (although convergence is guaranteed regardless). Many studies have found moderate noise correlations in cortex (Smith & Kohn, 2008; Huang & Lisberger, 2009). On the other hand, a recent study by Ecker et al. (2010) found that in primary visual cortex, even nearby neurons with similar tuning show close to zero correlation (and the authors argue that earlier findings may need to be reevaluated under more precise experimental paradigms). Nonetheless, we tested how nonzero noise correlations affect network performance. Figure S2 in the online supplement shows that for a moderately correlated input population code, the performance remains quite high although not optimal. This has to be expected from a learner that relies on the independence assumption, and is in accordance with the findings of Graf, Kohn, Jazayeri, and Movshon (2011) who showed that linear readouts can be further improved if input correlations are taken into account. Whether input correlations could be efficiently exploited in the context of this article—whether and how the correlation structure in the inputs could be learned autonomously in an unsupervised manner (Graf et al., 2011, used supervised learning) by a spiking network with local plasticity rules—remains an open question for future work.

Another model assumption we made is that the stimulus distribution is uniform. Note that this does not invalidate the presented results for nonuniformly distributed inputs: even in a nonuniform stimulus setting, learning is still guaranteed to converge to an optimal solution with respect to the fixed prior distribution (since the assumption is made on the level of the implicit generative model, not on the input distribution ). The solution produced by the network for a nonuniform stimulus distribution resembles Figure 6: a greater number of neurons specialize on the more likely region of the stimulus (see Figure S3 in the online supplement). This is precisely what is expected from a maximum likelihood learner with a fixed prior model distribution. Another important question is whether the network also adapts its implicit prior distribution over the stimulus in an optimal manner. In other words, does the WTA network automatically become biased toward activating those neurons whose preferred directions occur more often in the input? Indeed, this turns out to be the case. Since high-probability stimuli will automatically attract more WTA neurons during the specialization process, the implicit prior distribution of the network favors high-probability stimuli after learning. Indeed this phenomenon is also quantitatively consistent with the prediction of a Bayesian framework (see Figure S3).

### 3.1. Related Work.

As a model for decoding of high-dimensional population codes in the brain, the key novelty of this work is plasticity. In particular, this article demonstrates to the best of our knowledge for the first time that population codes not only can be read out efficiently (this had already been shown by Deneve et al., 1999; Jazayeri & Movshon, 2006; and Chaisanguanthum & Lisberger, 2011), but that this readout can also be learned optimally by a stereotypical cortical microcircuit motif in a process that is entirely autonomous and self-organizing. An interesting difference from previous models for population decoding lies in the output code of the readout circuit, which can represent a whole distribution (in contrast to Deneve et al., 1999), while being sparse for typical stimuli (as opposed to the predictions of Jazayeri & Movshon, 2006). Hence, the information extracted about the inputs is represented and conveyed by the whole network population rather than by the identity of a single neuron. Strong, unambiguous stimuli will elicit sharp responses, whereas weaker, low-contrast, or ambiguous stimuli will lead to a distributed code. Furthermore, our model predicts that neural activity represents samples from a probability distribution, a coding scheme that has recently attracted considerable attention, as it appears to be particularly suitable for probabilistic representations subject to learning and adaptation (Fiser, Berkes, Orban, & Lengyel, 2010). As a consequence, our model is not only consistent with the experimentally observed trial-to-trial variability of neuronal responses, but in fact requires that neurons respond in a stochastic manner. Hence, one can view this article also as a contribution to the growing literature on computational properties of networks of stochastically firing neurons.

The link to EM extends the model's generality beyond population coding to input representations, which cannot be easily described in terms of a fixed number of external variables. In this more general context, different authors have proposed neural network models that are able to carry out Bayesian inference (Doya, Ishii, Pouget, & Rao, 2007), and some also considered the question how probabilistic representations could emerge autonomously through learning, for example, Dayan et al. (1995), Rao and Ballard (1999), and Keck, Savin, and Lücke (2012), who used a related approach to the one developed here but required more approximations and did not model spiking neurons (and hence also not the relation between EM and STDP). While these models focused on rather artificial neural networks based on abstract (either binary or continuous-valued rate-based) neural units, recent studies have started to address whether more realistic spiking neural networks are also capable of acquiring optimized probabilistic representations through autonomous learning. A model by Deneve (2008) focused on a single spiking neuron and showed that such a neuron can in principle learn an efficient code for its inputs in a basic temporal generative model. Whether parameter learning in the model can be scaled up to multiple neurons, which would be required to cope with complex but biologically realistic input distributions, remains an interesting open issue. Finally, the recent model by Nessler et al. (2010) showed that spiking neurons can learn to represent a mixture of multinomial input variables, assuming that exactly one input neuron in a group is active at a time. The generalization and extension of this result to biologically realistic firing statistics and EPSP shapes, required for the decoding of realistic population codes, was described in this article.

Finally, Law and Gold (2009) showed in a related work that near-optimal population readouts can also be learned autonomously through a simple reinforcement learning (RL) rule. More precisely, they demonstrate that for a given decision problem, for example, whether the motion direction encoded by the input is less than or more than 180 degrees, near-optimal linear readouts from an input population can be found through RL. The learning scheme relies on a feedback signal that conveys the correctness of the decisions computed from the readout. In principle, for any particular decision problem, a different set of readout weights is required. Our work complements these findings insofar as we show that an (external) feedback signal is not needed for the emergence of optimal readouts. Instead, optimal readout weights in a downstream population of neurons can emerge through a purely self-supervising process. Furthermore, in principle, any decision problem can then be reduced to simply counting spikes emitted by the network population, for example, checking whether those output neurons with preferred directions below 180 degrees produced more spikes than those with preferred directions of above 180 degrees. Given the attractive properties of both approaches and the prevalence of both STDP and reward-modulated learning in the brain, it is not unreasonable to assume that the cortex employs a combination of these learning mechanisms.

### 3.2. Experimentally Testable Predictions.

Our results predict that abolishment of STDP during a critical period prevents the emergence of sparse codes for frequently occurring sensory stimuli. They also predict that with intact STDP, the coding properties of pyramidal cells will change in a predictable manner in response to changes in the distribution of external stimuli or their behavioral relevance (see Figure 6), since this change will be tracked by the implicit generative model of ensembles of pyramidal cells (probably on layers 2/3 and layers 5/6). In addition, a lesion of some neurons within this ensemble will cause a redistribution of neural codes among the remaining neurons (see Figure 5). Our theoretical analysis of optimal versions of STDP predicts a specific dependence of features of STDP on the firing statistics of presynaptic neurons that can in principle be tested experimentally.

Furthermore, our model predicts that in WTA neurons, that is, pyramidal cells in layers 2/3 and layers 5/6, the tuning of firing rates should be sharper than the tuning of membrane potentials. In fact, this is the well-known experimentally observed iceberg effect (Carandini & Ferster, 2000). But in our model, the iceberg effect also has a novel functional interpretation from the perspective of Bayesian inference: the instantaneous firing rate must depend on the current membrane potential via a sharpening exponential activation function in order to ensure that neurons encode probabilities. Hence, the iceberg effect is a prerequisite for correct learning of optimal decoding weights. Another direct prediction of our model is that the stimulus selectivity in ensembles of pyramidal cells with lateral inhibition is sharper when familiar stimuli are presented compared to novel stimuli. This effect has been recently reported in monkey inferior temporal cortex (ITC) by Freedman, Riesenhuber, Poggio, and Miller (2006).

## 4. Conclusion

In summary, we have shown that in conjunction with STDP, a common network motif of cortical microcircuits acquires a generic computational function: it creates a sparse representation of complex high-dimensional spike inputs to a local microcircuit. Although we have discussed in this article only the application to decoding of information from populations of noisy sensory neurons, the generation of sparse representations for complex high-dimensional inputs, which converge onto a microcircuit from many different brain areas, is a candidate for a generic computational operation that is meaningful for microcircuits in any cortical area.

## Appendix: Methods

### A.1. Spike-Timing-Dependent Plasticity Rules.

Rules . | . | A_{+}(w)
. | . | . | Noise . |
---|---|---|---|---|---|

A. Optimal rule, equation 2.3 | 20 ms (2) | 0 ms | −1 | No | |

B. Optimal with synaptic scaling | 20 ms (2) | 0 ms | −w | No | |

C. Longer causal window | 40 ms (4) | 0 ms | −1 | No | |

D. Common STDP curve | 20 ms (0) | 60 ms (0) | 0 | Yes |

Rules . | . | A_{+}(w)
. | . | . | Noise . |
---|---|---|---|---|---|

A. Optimal rule, equation 2.3 | 20 ms (2) | 0 ms | −1 | No | |

B. Optimal with synaptic scaling | 20 ms (2) | 0 ms | −w | No | |

C. Longer causal window | 40 ms (4) | 0 ms | −1 | No | |

D. Common STDP curve | 20 ms (0) | 60 ms (0) | 0 | Yes |

Notes: Labels A–D are as in Figure 3. ms in A–C corresponds to the absence of STD
depression. For these rules, weight stabilization is achieved
through the homeostatic plasticity term *a*^{post}_{1}(*w*), which is triggered for every
postsynaptic spike.

We set , , and for the simulations in the main text. Figure S1 shows
the impact on performance of varying and . For the common STDP curve (see Figure 3D), additional zero-mean gaussian noise was
superimposed for each learning update triggered by a postsynaptic spike. The
standard deviation of this noise was dependent on the magnitude of the
deterministic update: for each deterministic update of magnitude *M*, gaussian noise with standard deviation was added. The noise parameters were set to and . The optimal rule, equation 2.3, was used for simulations in Figures 2 to 6. Rules B to D were used in Figure 3.

### A.2. Implicit Generative Model.

*x*: where , and . We associate each hidden cause

_{j}*k*with one readout neuron,

*k*, and the parameters

*w*with the synaptic weights of the network. This allows us to relate inference and learning in the generative model to the operation of the microcircuit.

_{kj}### A.3. Stochastic Winner-Take-All Circuit and Inference.

*k*from input data

**x**in the generative model can be written as In case of a homogeneous input representation, for example, a population code in which the sum of sensory activations is constant , this reduces to (see the online supplement),

Note that the restriction is necessary to make the theory tractable. However, as demonstrated in the simulations through this article (see Figures 2–6) in which sensory input neurons fire randomly and only the total input population rate was kept constant as in Jazayeri and Movshon (2006), this is not required for the functionality of the model in practice.

Now consider a population of stochastically spiking readout neurons *k* that fire at an instantaneous rate , depending on their current membrane potential *u _{k}*=

**w**

^{T}

_{k}

**x**−

*I*. We make two basic assumptions about the inhibitory contribution: that

*I*is common to all readout neurons and that the inhibitory circuit that provides

*I*ensures that an approximately constant total target firing rate is maintained. This form of divisive inhibition introduces competition among the readout neurons, since a strongly activated readout neuron will claim a large fraction of the total target firing rate, thereby suppressing other readout neurons.

*k*will respond to an input

**x**with an instantaneous rate, By comparison with equation A.9, one can verify that this is proportional to the posterior probability of the hidden cause

*k*in the generative model. Note that since the relative spiking probabilities match the posterior probabilities, each spike produced by the readout circuit can be interpreted as a sample from the posterior distribution, independent of the total firing rate of the circuit.

### A.4. Equilibria of Theoretically Optimal STDP Rules and Maximum Likelihood.

Here we discuss two important results. First, we derive conditions for the equilibrium points of STDP: stable weight settings for which the expected STDP update is zero. The significance of these points derives from the fact that they are attractors in the weight dynamics. Hence, STDP will always drive the weights into the neighborhood of such an equilibrium point. Second, we show that these equilibria also have a special interpretation from the generative model perspective: they correspond to locally optimal parameter settings for the generative model in the maximum likelihood sense.

**x**and the activation of a readout neuron, : where is the input distribution and

*p*(

*z*=1|

_{k}**x**,

**W**)=

*p*(

*k*|

**x**,

**W**), using the fact that the WTA circuit implements inference according to the generative model.

*w*is set to the logarithm of the average presynaptic activity

_{kj}*x*, the average taken over those input patterns that make the postsynaptic neuron

_{j}*k*fire (plus some constant).

*k*and

*j*, By evaluating the derivative, one can verify that this is indeed equivalent to equation A.14. Hence, all STDP equilibrium points are automatically local optima with respect to the implicit generative model.

### A.5. Link Between Optimal STDP and Expectation-Maximization.

**x**,

*k*). Theoretically optimal STDP, equation 2.3, or more generally any rule from the family 2.9, then increases the log-likelihood

*p*(

**x**,

*k*|

**W**) of this pair in the model. The main difference from standard EM is the fact that STDP can access only temporally local information (at least the simple model considered here). Hence, just as in online EM (Sato, 1999), STDP does not maximize the likelihood over the whole data set in each step, but rather makes a small update in the right direction after each postsynaptic spike. To show this, consider the expected STDP update for a synapse if the input distribution is : where is some small learning rate. At the same time, the log-likelihood gradient with respect to that synaptic weight is

This means that the average STDP update will never decrease the performance of the generative model. In fact, one can easily verify that it will always increase performance unless the weight setting already constitutes a local optimum of the log likelihood (see also section A.4). Altogether this implies that the average effect of STDP can be described as an attractor dynamics in the weight space in which the attractors are the equilibrium points of STDP and at the same time the local optima of the generative model. Stochastic deviations from these dynamics are zero mean and can thus be suppressed to arbitrary precision via the learning rate (at the cost of convergence speed).

### A.6. Computer Simulations.

*j*in response to stimulus . For experiments E1, E2, E4, and E5, the tuning functions were given by where each sensory neuron was associated with a different preferred stimulus , equally spaced over across the population. The concentration factor

*k*regulates the sharpness of the tuning function and was chosen

*k*=1 for E1, E2, E4, and E5. In all experiments, the scaling constant was set to

*c*=40, yielding a maximal firing rate of 40 Hertz for the preferred stimulus.

*k*=0.5 and

*k*=3, respectively. G3 developed a bimodal tuning function, with

*k*=6. The transition between initial and final tuning functions in E3 was done by linear interpolation of the tuning functions.

*x*(

_{j}*t*). The constant

*D*was chosen such that

*x*(

_{j}*t*) has equal mean and variance for a fixed firing rate of the sensory neuron. The EPSP time constants were ms and ms.

Synaptic weights *w _{kj}* were initialized by drawing independently from a normal distribution
with mean and standard deviation , where we used for E1, E2, E4, and E5, and for E3. The filtered spike trains

*x*(

_{j}*t*) were then used as inputs to the readout neurons. The WTA circuit was implemented according to a discrete time approximation of equation A.11; at each discrete time step, the firing probability for each neuron

*k*is given by . A spike occurs for neuron

*k*if a draw from a Bernoulli distribution with this probability is successful. In all simulations, the total firing rate of the WTA circuit was chosen beforehand to achieve an average firing rate of 3 Hertz per readout neuron (in E4, was computed for

*N*=15 and kept constant throughout the simulation). Whenever a readout neuron spiked, STDP was applied according to equation 2.3. The learning rate was set to in E1, E3, and E4, and to for E2. For E5, the learning rate depended on , as indicated in Figure 6.

At regular intervals, the preferred stimuli of readout neurons were computed. This was done in a separate
offline simulation, in which we swept over all input stimuli and selected for
each readout neuron *k* the stimulus that elicited the greatest average response from that neuron.
As a performance indicator, a reconstruction of the input signal was computed
from the output spikes (shown in Figure 2E). First, we computed a raw reconstruction signal , which jumps to the preferred stimulus of a readout neuron *k* whenever *z _{k}* emits a spike. From this, we obtained the reconstruction as the population vector average of the raw reconstruction
values within a [

*t*−20 ms,

*t*] window. After training in E1, E2, E3, and E5, readout neurons were sorted for visualization according to their preferred stimuli at the end of the simulation. In E4, sorting was done individually for each visualization point, since different neurons were involved at each time.

## Acknowledgments

This article was written under partial support by the European Union: projects FP7-269921 (BrainScaleS), FP7-216593 (SECO), FP7-216886 (PASCAL2), and FP7-243914 (Brain-i-Nets).

## References

## Author notes

Stefan Habenschuss and Helmut Puhr contributed equally to this work. A supplemental appendix is available online at http://www.mitpressjournals.org/doi/suppl/10.1162/NECO_a_00446.