The brain faces the problem of inferring reliable hidden causes from large populations of noisy neurons, for example, the direction of a moving object from spikes in area MT. It is known that a theoretically optimal likelihood decoding could be carried out by simple linear readout neurons if weights of synaptic connections were set to certain values that depend on the tuning functions of sensory neurons. We show here that such theoretically optimal readout weights emerge autonomously through STDP in conjunction with lateral inhibition between readout neurons. In particular, we identify a class of optimal STDP learning rules with homeostatic plasticity, for which the autonomous emergence of optimal readouts can be explained on the basis of a rigorous learning theory. This theory shows that the network motif we consider approximates expectation-maximization for creating internal generative models for hidden causes of high-dimensional spike inputs. Notably, we find that this optimal functionality can be well approximated by a variety of STDP rules beyond those predicted by theory. Furthermore, we show that this learning process is very stable and automatically adjusts weights to changes in the number of readout neurons, the tuning functions of sensory neurons, and the statistics of external stimuli.
Uncertainty accompanies us in almost all situations in life. Whether we try to recognize a distant object, decide which path to take on a mountain hike, or read a person's face in an important negotiation, the environment often provides us with many cues, but each single cue is too unreliable to inform a decision on its own. Thus, we are forced to combine different cues in a meaningful manner in order to gain sufficient certainty. The theoretical framework for solving such tasks in an optimal way is Bayesian inference. Notably, behavioral and psychophysical studies strongly support the picture that the brain implements this strategy: in numerous experiments, human subjects have been shown to take into account uncertainty in a near-optimal way (Griffiths & Tenenbaum, 2006).
At the level of neural coding in early sensory areas, uncertainty is a particularly well-studied phenomenon: individual neurons that encode certain stimulus properties show significant trial-to-trial variability, making it difficult to infer the original stimulus from single-neuron responses. This observation has led to the notion of population coding: the value of a single variable is encoded by a whole population of neurons, each noisy and typically broadly tuned to the external variable (Pouget, Dayan, & Zemel, 2000). Experimental data suggest this coding strategy as a candidate for understanding how important variables are represented in different areas across cortex, for example, sound location in auditory cortex (Miller & Recanzone, 2009) or stimulus location in somatosensory cortex (Petersen, Panzeri, & Diamond, 2002). Just how the brain reliably decodes information from populations of neurons in a near-optimal way remains one of the key open questions in computational neuroscience.
In experimental neuroscience, the computation of robust readouts from population codes has become indispensable in the analysis and interpretation of neural data; population vector analysis and maximum likelihood (ML) estimation (Pouget et al., 2000) are two frequently used methods. More recently, attempts have been made to model neural networks that exhibit near-optimal decoding capabilities (Deneve, Latham, & Pouget, 1999; Jazayeri & Movshon, 2006; Chaisanguanthum & Lisberger, 2011). The hope is that such models will advance our understanding of the neural substrates of perceptual judgments: how the noisy and broadly tuned representations found in sensory areas can be efficiently used and transformed by downstream populations to allow near-optimal performance in a variety of perceptual tasks. The theoretical framework that has been guiding the search for suitable models is Bayesian inference. Here, in contrast to previous, static models, we will use this perspective for the analysis of an adaptive cortical microcircuit, featuring spike-timing-dependent plasticity.
The preceding research leaves the question open of how readout neurons could acquire the theoretically optimal synaptic weights (see equation 1.2). The importance of this question is underlined by recent findings suggesting that the brain constantly retunes its circuits in order to improve probabilistic inference (Bejjanki, Beck, Lu, & Pouget, 2011). How experimentally observed plasticity mechanisms at the synaptic level could account for such an improvement remained an open question.
Here we present a learning theory for spike-timing-dependent plasticity (STDP) rules in the context of spiking neurons and lateral inhibition that addresses this question. Building on the analysis of previous work (Nessler, Pfeiffer, & Maass, 2010) we show that optimal likelihood decoding of noisy population codes emerges automatically through STDP in a winner-take-all (WTA) circuit, a ubiquitous network motif in cortical microcircuits (Douglas & Martin, 2004). In particular, we theoretically analyze the weight dynamics of a particular form of STDP with homeostatic plasticity and show that it can be described as an attractor dynamics, where the centers of the attractors are the weight values (see equation 1.2) that are optimal from the perspective of probabilistic inference and learning. In this way, we create a direct link between simple local rules for synaptic plasticity and theoretically optimal inference and learning on the network level. Whereas the analysis of Nessler et al. (2010) was restricted to the case of multinomial input variables (each encoded by a population of neurons of which at any time t, exactly one has fired within the time window ) and rectangular excitatory postsynaptic potentials (EPSPs) (modeled by a step function, rather than by a function with smooth decay), we show here that the underlying learning theory can be extended to cover the biologically more realistic case where each input neuron fires according to some Poisson-like statistics and EPSPs have a smooth decay. This new learning theory makes it possible to derive learning curves for STDP and dependencies between current weight values and the amount of weight potentiations or depressions under STDP that are optimal for a given input statistic and EPSP shape. These analytically derived predictions for details of STDP match currently available data quite well in a number of aspects. Furthermore, we show that in practice, variants of these optimal rules, and in particular plasticity rules based on typical STDP curves, approximate this optimal behavior well.
We test predictions of the new learning theory in computer simulations and show that previously derived optimal weights for likelihood decoding (see equation 1.2) emerge autonomously through STDP in conjunction with homeostatic plasticity. Based on this learning theory, we also show that an adaptive architecture for reading population codes provides attractive benefits compared to previously considered static models. In particular, we demonstrate stable and predictable behavior under challenging but biologically realistic dynamic scenarios, like changes in sensory tuning functions or neuron growth and loss. We furthermore demonstrate that selective modulation of learning leads to effects reminiscent of perceptual learning (Gilbert, Li, & Piech, 2009), where the accuracy of neural codes is selectively enhanced for behaviorally relevant stimuli.
The article is structured as follows. First, we introduce the canonical microcircuit model we consider. In this circuit model, we demonstrate STDP-based autonomous learning on exposure to population-coded stimuli and show that in such a setup, the synaptic weights converge to a setting that is optimal from the perspective of likelihood decoding. We then introduce the theory that underlies this optimality and derive a rigorous link between local synaptic learning and theoretical optimality in the Bayesian framework. We complement these theoretical results by an extensive performance comparison of different optimal and near-optimal STDP-based learning rules. We present results of simulations that test further predictions of the theory and demonstrate the versatility and robustness of the considered model. We conclude with a simulation in which effects reminiscent of perceptual learning are reproduced.
In order to make accurate perceptual judgments, the brain must use the information provided by sensory areas as efficiently as possible. Since sensory neurons tend to be noisy and broadly tuned, the computation of a sparse representation of the most likely external stimulus (“hidden cause”) is a nontrivial task for a network of neurons. However, Jazayeri and Movshon (2006) showed that for the case of a one-dimensional hidden cause , which could, for example, represent the current orientation, speed, or direction of a visual stimulus, this task could in principle be solved by an array of linear readout neurons with spiking outputs zk that receive synaptic inputs xj from sensory neurons (see Figure 1). It was shown that the readout neurons k can compute in their membrane potential the log likelihood that a particular hidden cause (which was assigned externally to readout neuron k in Jazayeri & Movshon, 2006) had caused the current spike output x of the sensory neurons. This occurs if the weights wkj from sensory neuron j to readout neuron k are set according to equation 1.2. We show here that these weights emerge as fixed points (equilibria) of a class of theoretically optimal STDP rules with homeostatic plasticity, provided that the readout neurons are subject to lateral inhibition. Through the same learning process, each readout neuron k implicitly develops a preferred stimulus , which then allows the reconstruction of an external input variable at any moment in time.
2.1. Adaptive, Stochastic WTA Architecture.
I(t) denotes the contribution to the membrane potential due to lateral inhibition. I(t) is common to all readout neurons in the circuit and can thus be viewed as a global inhibitory signal that controls the total gain of the circuit. In computer simulations, we modeled I(t) such that the total firing activity of the readout circuit remained approximately constant. The resulting effect of inhibition is a normalization of circuit responses, reminiscent of normalization models in cortex (Simoncelli & Heeger, 1998; Zoccolan, Cox, & DiCarlo, 2005; Ohshiro, Angelaki, & DeAngelis, 2011; Louie, Grattan, & Glimcher, 2011). Note that the lateral inhibition introduces competition among the readout neurons, since a readout neuron with strong feedforward input will claim a large fraction of the total firing rate, thereby suppressing other readout neurons. We refer to the resulting network as a (stochastic) winner-take-all (WTA) circuit.
The first term of equation 2.3 corresponds to a prototypical spike-timing-dependent long-term potentiation (LTP) window (with the same shape and time constant as PSPs) with weight dependence. The second term depends on postsynaptic spikes only and can be interpreted as a form of homeostatic plasticity providing negative feedback to the postsynaptic rate. Both terms fit well into the relatively broad phenomenological framework of STDP rules (Gerstner & Kistler, 2002; Gilson, Burkitt, & van Hemmen, 2010). But to distinguish it from more common STDP rules (which typically feature spike-timing-dependent depression), we will refer to this rule as theoretically optimal STDP.
The motivation for first considering this form of STDP comes from its notable and provable theoretical properties, which will be developed in this article: equation 2.3 leads to stable equilibrium weight settings that can be analyzed and have a clear interpretation from the perspective of probability theory. In particular, we will show that equation 2.3 is an instance of a family of learning rules that can be directly derived from the principle of adapting and optimizing an implicit generative model of the input statistics.
With regard to biological plausibility, equation 2.3 is in many aspects consistent with experimental studies on STDP. First, the strength and direction of learning depend on the timing difference between pre- and postsynaptic spike. For pre-before-post pairings, xj is large at the time of the postsynaptic spike and equation 2.3 will typically lead to potentiation. For post-before-pre pairings, the negative part dominates and leads to depression (Bi & Poo, 1998; Sjöström, Turrigiano, & Nelson, 2001). Also, the strength of potentiation correlates inversely with the synaptic weight before pairing. This feature is also consistent with a number of experimental studies (Bi & Poo, 1998; Sjöström et al., 2001; Liao, Jones, & Malinow, 1992; Montgomery, Pavlidis, & Madison, 2001). Furthermore, the amount of depression is independent of the current weight. This is consistent with the experimental results of Jacob, Brasier, Erchova, Feldman, and Shulz (2007), which is to the best of our knowledge the only study of this effect in vivo. Moreover, when measured under a typical pairing STDP protocol, equation 2.3 will automatically shift toward LTP for higher pairing frequencies, since for each postsynaptic spike, the xj will accumulate more presynaptic spikes in the causal STDP window while the negative term remains constant. This effect is hence reminiscent of the tendency toward LTP (and abolishment of LTD) for higher frequencies found experimentally (Sjöström et al., 2001). Altogether, the agreement with these experimental STDP data is in a sense remarkable, given that equation 2.3 can be derived from purely statistical principles.
However, there are also a few deviations from STDP data. Most important, the negative contribution of equation 2.3 is activated for every postsynaptic spike regardless of presynaptic input. Hence, long-term depression (LTD) is not spike timing dependent and can be triggered by postsynaptic spikes alone. We address this potential issue later, when we consider a variation of equation 2.3 with timing-dependent depression in Figure 3; it turns out that the results are qualitatively very similar. Another deviation is with regard to the frequency dependence mentioned above: despite the general shift toward LTP, the exact form of frequency dependence in experimental data (see Figure 7 in Sjöström et al., 2001) cannot be reproduced by equation 2.3.
For the described network architecture, a computer simulation was carried out in which a fluctuating external variable with a periodic boundary condition (like any angle) was presented (see Figure 2). Concretely one could interpret this external variable as a motion direction. The changes in were designed to be slightly slower than the network dynamics, corresponding to the assumption that the sensory stimulus remains stable for the duration of a few PSPs. A number of broadly tuned sensory neurons with Poisson firing statistics represented the value of this external variable over time. Initially synaptic readout weights were randomly initialized, leading to rather unspecific responses in the readout circuit when stimulated with different . At t=100 s of exposure to different population responses, readout neurons started to specialize on different stimuli. This was reflected by both selective firing of individual readout neurons to certain and the gradual specialization of synaptic readout weights to different patterns. This specialization can be understood as follows. During exposure to population responses, the readout network continuously emits spikes. After each readout spike, the current population response leaves a small trace in the synaptic weights of the readout neuron k, which has fired. In particular, as expected from STDP, synapses that are not active at the time of the postsynaptic spike are depressed, and those with strong activity are enhanced. The synaptic changes induced by plasticity increase the probability that the same neuron fires again when a similar population response is registered, while making it less likely to fire when significantly different patterns are active. Due to the inhibitory circuit that promotes competition, different neurons start to pick up different patterns. As a consequence, a specific pattern emerges in the synaptic weights of each readout neuron. This pattern reflects the history of population responses that made the readout neuron fire. At t=2000 s the readout weights reached a setting that was stable with minor fluctuations around some target values. One can now compare this converged weight setting with the results by Jazayeri and Movshon (2006) on optimal population decoding. As shown in Figure 2C, the weights, which have autonomously emerged during exposure to unknown population responses, have recovered the previously identified theoretical optima for likelihood decoding, in particular those obtained through equation 1.2.
What is the overall effect of this learning process? Most important, individual readout neurons have become specialists for individual hidden causes (or stimuli). Hence, after, but not before, learning, each readout spike can be considered an indication for a particular hidden cause. As a consequence, one can achieve an increasingly faithful reconstruction of the original stimulus from readout spikes, as shown in Figure 2E.
2.2. Learning Theory for STDP in a Network with Lateral Inhibition.
Then the goal of learning is to find parameters W that bring the distribution p(x|W) generated by the model as close as possible to the actual data distribution, which we write here as . This is done by adjusting the parameters W of the model, usually in small steps, until the Kullback-Leibler divergence , a quantity that measures the distance between the real distribution and the model, becomes minimal. During this process, an efficient representation for the hidden causes z emerges, which is optimized to “explain” the data in . This hidden representation can be retrieved by evaluating the posterior distribution p(z|x, W).
Several steps are required to establish a direct correspondence between the network implementation and the generative model: first, the network elements need to be related to the variables of the generative model. We already implicitly linked the synaptic weights to the model parameters in a straightforward fashion by using the weights wkj in the definition of the generative model. Similarly, we can link the readout neurons to the hidden causes of the model: each hidden cause k is represented by one readout neuron, k. Second the network should support Bayesian inference, that is, it should respond to an input x by inferring likely hidden states z according to the posterior probabilities p(z|x, W). Indeed we will show that each output spike from the readout network is generated according to the correct posterior distribution. Third, network plasticity should optimize the parameters of the generative model over time; the synaptic weights should come to reflect the input statistics through learning. This is the main focus of this article. We will prove this by relating STD learning to a standard algorithm for generative model learning, expectation-maximization (EM). One operation of the generative model that will not be required from the network is the actual generation of data x. Since, in contrast to other models (see Dayan, Hinton, Neal, & Zemel, 1995; Rao & Ballard, 1999), such an explicit generation of data is not needed here, we will refer to our model as an implicit generative model.
The key component of the learning theory is the connection between STD learning and the implicit generative model, in particular the optimization of its parameters W. In machine learning, the best-known algorithm for performing this optimization is EM (Bishop, 2006). What we will show here is that the operation of STD learning in the cortical microcircuit of Figure 1 can be understood as a stochastic online version of EM. This allows us to view learning as an attractor dynamics in the weight space, where the attractors correspond to local optima in the generative model perspective.
This allows us to view learning in the considered cortical microcircuit as an attractor dynamics: the trajectories in weight space induced by STD learning are attracted to specially distinguished weight settings which have the property that for all synapses , that is, the system is in equilibrium with respect to equation 2.3. These attractors in weight space are optimal from the perspective of the implicit generative model in the sense that they correspond to locally optimal solutions to the problem of fitting the internal model p(x|W) to the input distribution .
Before moving on to the generalization of this main result to non-Poisson input statistics, a remark is in order concerning the connection between the optimization of the implicit generative model p(x|W) and the optimal readout weights for population coding, as they were derived in Jazayeri and Movshon (2006). As already suggested by the simulation results shown in Figure 2, these two optimality criteria are intimately connected. This is because the “true” statistics of the population code from which the optimal readout weights in Jazayeri and Movshon (2006) are derived can be mapped with arbitrary precision onto a mixture model representation in the form of equation 2.6. The mapping is most faithful if the Kullback-Leibler divergence between the real population code and the mixture model is lowest: this is precisely what the plasticity rule optimizes. By increasing the number of readout neurons, the theoretically achievable divergence can be made arbitrarily small, and the two optimality criteria become effectively equivalent.
The assumption of Poisson variability is a reasonable and popular approximation to the firing statistics of real cortical neurons. At the same time, there exists ample experimental evidence for deviations from this rule, such as bursty or regularly firing neurons (Shinomoto et al., 2009). This raises the question of whether our learning theory can also be applied to other firing statistics. It turns out that the theory can indeed be generalized to the exponential family of distributions, which contains many well-known parametric models like the Poisson, normal, gamma, and negative binomial distributions. The main result of this generalization is that each firing statistic is associated with a different, small variation of optimal STDP. Through the generalization to exponential families, the presented theory for STDP in the context of lateral inhibition becomes applicable to a wide range of biologically relevant firing statistics (see the online supplement). The power of exponential family distributions also allows incorporating more realistic EPSP shapes: given the firing statistics of sensory neurons, together with the EPSP shape, one can construct a tailor-made exponential family distribution that accounts for the variability encountered in the inputs x (see the online supplement). This makes the theory applicable to virtually arbitrary EPSP shapes, in particular, those derived from electrophysiological experiments.
2.3. A Family of STDP Rules Leads to Near-Optimal Readouts.
These theoretical results are quite encouraging, since they establish a link between a particular form of STDP, on the one hand, and the powerful statistical framework of EM, on the other hand. But how relevant are these results for the study of autonomous learning processes in the cortex? After all, the theoretically postulated mechanism for depression in equation 2.3, which leads to LTD even in the absence of presynaptic spikes, is quite speculative. It would thus be highly interesting to know whether this term could also be replaced by timing-dependent depression without loss of functionality. Furthermore, plasticity mechanisms in the brain appear to be heterogeneous and noisy (Sjöström et al., 2001; Caporale & Dan, 2008), in contrast to precise theoretical rules. If one assumes a given PSP shape and Poisson variability, is equation 2.3 really the only rule that can be guaranteed to lead to the emergence of optimal weights?
Furthermore, we tested two variations that are not provably optimal from the perspective of the presented learning theory but approximate the functionality of the optimal rules well. As shown in Figure 3C, in another variation of equation 2.3, the time constant of the causal STDP window was chosen twice as long as the PSP decay constant (while its magnitude was halved). The resulting performance is virtually optimal. Finally, we tested the performance of an STDP rule with common timing-dependent depression (see Figure 3D), with additional superimposed noise. The resulting weights are slightly less pronounced and noisier, but the readout performance is comparable to the other variants (compare RMSE values in the bottom row). Notably, this behavior is also quite robust against variations of parameters of this STDP rule (see Figure S1 in the online supplement).
Although these variants cover only a few cases of special interest, they make clear that the main finding of this article, the emergence of optimal readouts, is not an artifact of a specific definition of the learning rule, but in practice a property of a whole family of STDP learning rules. The most important features that appear to characterize this family of STDP rules are (1) the pronounced weight dependence of STD potentiation (in accordance with experimental data) and (2) some form of homeostatic regulation of weights, implemented either explicitly (e.g. through synaptic scaling) or implicitly through STD depression.
2.4. Maintenance of Optimality in Spite of Drastic Changes in Downstream and Upstream Neurons.
To our knowledge, this article provides the first adaptive neural approach to optimal decoding from sensory populations in the literature. As a consequence, the canonical microcircuit considered here is substantially more flexible than previous static models for optimal decoding. In the following computer simulations, we have selected two biologically motivated dynamic scenarios that demonstrate its functional advantages. The first scenario is based on the observation that plasticity in sensory cortex appears to persist throughout adulthood, as suggested by numerous studies in different species and cortical areas (Trachtenberg et al., 2002; Goel & Lee, 2007). Hence, the tuning functions of sensory neurons, especially in superficial layers, are not fixed but subject to constant change. This constitutes a challenge to downstream populations that rely on the representation provided by sensory areas. We will show that the canonical microcircuit considered here gracefully handles a scenario of changing sensory tuning functions: any change in the input distribution is automatically detected by the network, resulting in an appropriate adjustment of the weight settings. As a consequence of this constant readaptation, the readout representation of the external stimulus can remain remarkably invariant to changes in the sensory representation. The second scenario deals with neurogenesis and neuron death, two phenomena that have been reported repeatedly in adult cortex (Morrison & Hof, 2007). We studied the effects of neuron growth and death in the readout population, showing that through learning, the readout network can minimize the detrimental effects of cell death and maximize the gain in representation accuracy brought by neurogenesis.
In the first scenario, we studied the consequences of changing tuning functions in the sensory population on the readout network (see Figure 4). For this, we divided the sensory population into three groups (G1–G3). Initial sensory tuning functions were chosen such that the whole range of the input stimulus was represented uniformly in each group. The simulation was then split into three phases: a static phase (0 s–1000 s), a dynamic phase (1000 s–2000 s), and a consolidation phase (2000 s–3000 s). During the static phase, sensory tuning functions were kept constant, and the readout network learned an optimal readout representation based on the initial sensory code. The resulting weight matrices are qualitatively identical to those obtained in Figure 2C. In the dynamic phase, tuning functions in all three groups of sensory neurons started to change. In the first group, tuning functions broadened, in the second group they narrowed, and in the third group tuning functions narrowed and simultaneously started to develop a second mode (see Figure 4B). This change in the sensory population code was automatically “detected” by the readout network, causing the synaptic readout weights to take up pursuit of a quickly moving target: the attractor centers of the learning dynamics that depend on and thus drift in parallel with sensory tuning functions. A snapshot of the readout weight matrix at time 1600 s in the middle of the dynamic phase is shown in Figure 4C. In the consolidation phase, sensory tuning functions were fixed again. This allowed the readout network to converge to the new optimal setting, thereby reinterpreting all sensory neurons according to their new properties. This is particularly visible in group 3 (G3), where each sensory neuron had developed two modes in the tuning function. The corresponding optimal decoding strategy is to connect each sensory neuron in G3 strongly with the readout neurons that correspond to the two modes. As shown in Figure 4C, this strategy was found autonomously through learning. Figure 4D illustrates the preferred stimuli of five exemplary readout neurons during the dynamic phase. One should note that in spite of drastic changes in the sensory representation (e.g., the development of bimodal tuning functions; see Figure 4B), the preferred stimuli of readout neurons barely change during the adaptation period. Hence, learning facilitates the decoupling between sensory and readout representations with respect to the external stimulus.
In a second scenario (see Figure 5) we studied the effect of neuron growth and death in the readout population. In a computer simulation, a microcircuit was set up with five readout neurons, and the weights were allowed to converge to a stable setting on stimulation with input stimuli analogous to Figure 2. At t=800 s and t=1600 s, the growth of five new readout neurons with randomly initialized weights was simulated, yielding a jump in the mean squared error (MSE) of the reconstruction (see Figure 5B). In theory, a larger number of readout neurons allows a more finely grained representation, and hence a more accurate readout. However, this requires that the weights of all neurons are adjusted appropriately. In particular, the newly formed neurons should learn to respond to the most poorly represented regions of the external stimulus. Figure 5A shows that this optimal strategy automatically emerges through learning: the preferred stimuli of the newly grown neurons quickly learned to fill in the “spaces” in between the existing neurons. This is also reflected by a lower mean squared error of the reconstruction signal (Figure 5B; compare at t=800 s and t=1600 s). A similar effect is observed after simulating the new growth of another five neurons with random weight initialization. At t=2400 s, the sudden death of 10 neurons was simulated. Theoretically, this substantially reduces the achievable readout accuracy. However, the loss in accuracy is smallest if the surviving neurons rearrange such that their preferred stimuli are uniformly distributed over the input range. As shown in Figure 5A, this is the strategy automatically implemented by learning in the network (2400 s–3200 s). Hence, the detrimental effects of neuron loss on the readout representation are minimized through learning (see the corresponding reduction of reconstruction MSE in Figure 5B).
2.5. Improved Representation of Behaviorally Relevant Inputs.
The ability to dynamically allocate resources for representing important peripheral inputs in a use-dependent manner is a hallmark of cerebral cortex (Buonomano & Merzenich, 1998). Indeed, cortical representations are highly plastic, and the dynamic changes in cortical representation on manipulation in input or during task learning are thought to underlie the phenomenon of perceptual learning: the improvement of sensory abilities during training (Seitz & Watanabe, 2005). Although the exact neural mechanisms that give rise to task-related improvements are largely unknown, experimental evidence suggests an important implication of synaptic plasticity of intracortical connections (Buonomano & Merzenich, 1998) in conjunction with top-down “relevance” or reward signals (Seitz & Watanabe, 2005). The latter have been hypothesized to be communicated to local cortical circuits via diffuse neuromodulatory signals, such as “acetylcholine, norepinephrine or dopamine, which gate learning and thus restrict sensory plasticity” (Seitz, Kim, & Watanabe, 2009). Here we will show that by incorporating such a modulatory signal into the considered cortical microcircuit model, the internal representation of stimuli becomes automatically focused on behaviorally relevant inputs—those inputs that are consistently paired with high levels of the modulatory signal. The resulting allocation of cortical resources in proportion to relevance is strongly reminiscent of experimentally observed training-dependent cortical map plasticity (Buonomano & Merzenich, 1998).
In a computer simulation, we incorporated a global modulatory signal into the cortical microcircuit considered above, acting as a soft gate for learning. The gating was implemented as a multiplicative factor on the learning rate. During an initial reference phase, modulation remained constant during exposure to stimuli, leading, as expected, via learning to a uniform internal representation (see Figures 6B and 6C at t=1000 s). Modulation was activated at t=1000 s. The modulation was chosen in dependence of the current external stimulus, such that certain ranges of were paired with higher levels of the modulatory signal than others (see Figure 6A, red curve). As shown, population responses that coincided with higher levels were more strongly imprinted in the synaptic network weights, and consequently, the internal representation started to shift toward a denser concentration around more “relevant” s (see Figures 6B and 6C at t=4000 s). In contrast, stimuli paired with lower levels of modulation became more crudely represented by the network. Hence, the resulting density of representation in the distribution of preferred stimuli is directly connected to the strength of modulation imposed for different .
Such use-dependent allocation of cortical resources is a ubiquitous feature of cortical plasticity and has been shown to correlate with significant performance improvements on corresponding tasks (Buonomano & Merzenich, 1998). Our results indicate that the modulation of learning in a local cortical microcircuit model through a diffuse global signal is indeed sufficient to reproduce relevance-dependent allocation of representational resources. This provides a simple mechanism that could underlie the characteristic improvements in sensory abilities associated with perceptual learning.
Ensembles of pyramidal cells with lateral inhibition on layers 2/3 and layer 5/6 (Fino & Yuste, 2011) constitute a universal network motif of cortical microcircuits in many different cortical areas and different species. We propose that these network motifs acquire through STDP a rather universal computational function that appears to be essential for many, if not all, cortical areas: the compression of high-dimensional noisy spike inputs into lower-dimensional sparse representations of the most likely hidden causes of these high-dimensional spike inputs. This yields in particular an emergent optimal decoding of population codes in the sense of Jazayeri and Movshon (2006). This emergent computational function of stochastic WTA circuits is very stable because a rigorous learning theory shows that the theoretically optimal values of synaptic weights are attractors in the dynamics of synaptic weights under the considered class of STDP rules with homeostatic plasticity. A remarkable feature of the underlying learning theory is that it creates a link between local synaptic learning rules such as STDP and the most powerful known abstract method for autonomous (i.e., unsupervised) learning: EM. This link holds provably for the class of optimal STDP rules and, in practice, also approximately for STDP rules that are not directly covered by theory (see Figure 3).
Furthermore the learning theory for STDP that we have introduced provides a new benchmark for analyzing and understanding from a functional perspective the large variety of parameters and learning curves for STDP that have been found at different synapses (Dan & Poo, 2006). This learning theory proposes that the theoretically optimal version of STDP (from the perspective of autonomous generation of implicit generative models for high-dimensional noisy spike inputs through EM) depends on both the firing statistics of pre-synaptic neurons and the average shape of EPSPs at the soma in a systematic and predictable manner.
In particular, the learning theory for STDP that we have presented throws new light on the question of downstream decoding of information conveyed by populations of noisy spiking neurons. Jazayeri and Movshon (2006) had already shown that simple linear neurons could in principle carry out a theoretically optimal maximum likelihood decoding, provided that suitable values are chosen for their synaptic weights. The learning theory that we have presented shows that these optimal weight values are in fact attractors with regard to the dynamics of synaptic weights under STDP in stochastic WTA circuits. This learning theory for STDP therefore also provides a possible explanation for the problem of how downstream neurons can maintain optimal decoding of population codes of sensory neurons in spite of ubiquitous changes in the tuning functions of sensory neurons (see Figure 4) and changes in the number of neurons involved in this decoding (see Figure 5). Furthermore if one assumes that synaptic plasticity is gated by neuromodulators or network activity so that the learning rate of STDP is increased for behaviorally relevant stimuli, the downstream decoding network automatically adapts its resolution in order to achieve a finer representation of behaviorally relevant ranges of external stimuli (see Figure 6).
An interesting aspect, which we did not study in this article, is whether self-organizing maps, similar to the well-documented orientation maps in cat visual cortex (Hubel & Wiesel, 1963), could emerge in the presented model. Indeed, one could think of clever changes in the model to induce such effects, for example, adding lateral excitatory connections between neighboring neurons, which would likely facilitate the emergence of locally smooth maps. But it is not clear at all whether a downstream “user” of the network output actually requires such a smooth map representation; indeed, many rodents, including rats, mice, and squirrels, can live without smooth maps (Van Hooser, Heimel, Chung, Nelson, & Toth, 2005). Furthermore, from a purely functional standpoint, neurons in the presented model with similar tuning (e.g., specialized on similar motion directions) will generally have nonnegligible signal correlation (see Figure 2D). Hence, downstream neurons that receive connections from the WTA neurons could quite easily detect groups of neurons that code for similar input stimuli and establish selective synaptic connections to “functionally neighboring” neurons even if their spatial arrangement is scrambled.
The WTA mechanism in this article relies on an exponential input-output nonlinearity of the form eu to fit the requirements of Bayesian inference. Indeed, this is precisely the input-output relation found empirically by Jolivet et al. (2006), who fit a stochastic spike response model to predict spike timings in pyramidal cells from L5 rat somatosensory cortex. Other authors have suggested a power relation of the form up between membrane potential and output rate (Priebe, Mechler, Carandini, & Ferster, 2004). Although we expect that power relations would suffice to induce the desired competition, it would be interesting to study the effect of different nonlinearities in future work.
One of the model assumptions is conditional independence of inputs, and hence learning is guaranteed to be optimal only if there are no noise correlations in the input (although convergence is guaranteed regardless). Many studies have found moderate noise correlations in cortex (Smith & Kohn, 2008; Huang & Lisberger, 2009). On the other hand, a recent study by Ecker et al. (2010) found that in primary visual cortex, even nearby neurons with similar tuning show close to zero correlation (and the authors argue that earlier findings may need to be reevaluated under more precise experimental paradigms). Nonetheless, we tested how nonzero noise correlations affect network performance. Figure S2 in the online supplement shows that for a moderately correlated input population code, the performance remains quite high although not optimal. This has to be expected from a learner that relies on the independence assumption, and is in accordance with the findings of Graf, Kohn, Jazayeri, and Movshon (2011) who showed that linear readouts can be further improved if input correlations are taken into account. Whether input correlations could be efficiently exploited in the context of this article—whether and how the correlation structure in the inputs could be learned autonomously in an unsupervised manner (Graf et al., 2011, used supervised learning) by a spiking network with local plasticity rules—remains an open question for future work.
Another model assumption we made is that the stimulus distribution is uniform. Note that this does not invalidate the presented results for nonuniformly distributed inputs: even in a nonuniform stimulus setting, learning is still guaranteed to converge to an optimal solution with respect to the fixed prior distribution (since the assumption is made on the level of the implicit generative model, not on the input distribution ). The solution produced by the network for a nonuniform stimulus distribution resembles Figure 6: a greater number of neurons specialize on the more likely region of the stimulus (see Figure S3 in the online supplement). This is precisely what is expected from a maximum likelihood learner with a fixed prior model distribution. Another important question is whether the network also adapts its implicit prior distribution over the stimulus in an optimal manner. In other words, does the WTA network automatically become biased toward activating those neurons whose preferred directions occur more often in the input? Indeed, this turns out to be the case. Since high-probability stimuli will automatically attract more WTA neurons during the specialization process, the implicit prior distribution of the network favors high-probability stimuli after learning. Indeed this phenomenon is also quantitatively consistent with the prediction of a Bayesian framework (see Figure S3).
3.1. Related Work.
As a model for decoding of high-dimensional population codes in the brain, the key novelty of this work is plasticity. In particular, this article demonstrates to the best of our knowledge for the first time that population codes not only can be read out efficiently (this had already been shown by Deneve et al., 1999; Jazayeri & Movshon, 2006; and Chaisanguanthum & Lisberger, 2011), but that this readout can also be learned optimally by a stereotypical cortical microcircuit motif in a process that is entirely autonomous and self-organizing. An interesting difference from previous models for population decoding lies in the output code of the readout circuit, which can represent a whole distribution (in contrast to Deneve et al., 1999), while being sparse for typical stimuli (as opposed to the predictions of Jazayeri & Movshon, 2006). Hence, the information extracted about the inputs is represented and conveyed by the whole network population rather than by the identity of a single neuron. Strong, unambiguous stimuli will elicit sharp responses, whereas weaker, low-contrast, or ambiguous stimuli will lead to a distributed code. Furthermore, our model predicts that neural activity represents samples from a probability distribution, a coding scheme that has recently attracted considerable attention, as it appears to be particularly suitable for probabilistic representations subject to learning and adaptation (Fiser, Berkes, Orban, & Lengyel, 2010). As a consequence, our model is not only consistent with the experimentally observed trial-to-trial variability of neuronal responses, but in fact requires that neurons respond in a stochastic manner. Hence, one can view this article also as a contribution to the growing literature on computational properties of networks of stochastically firing neurons.
The link to EM extends the model's generality beyond population coding to input representations, which cannot be easily described in terms of a fixed number of external variables. In this more general context, different authors have proposed neural network models that are able to carry out Bayesian inference (Doya, Ishii, Pouget, & Rao, 2007), and some also considered the question how probabilistic representations could emerge autonomously through learning, for example, Dayan et al. (1995), Rao and Ballard (1999), and Keck, Savin, and Lücke (2012), who used a related approach to the one developed here but required more approximations and did not model spiking neurons (and hence also not the relation between EM and STDP). While these models focused on rather artificial neural networks based on abstract (either binary or continuous-valued rate-based) neural units, recent studies have started to address whether more realistic spiking neural networks are also capable of acquiring optimized probabilistic representations through autonomous learning. A model by Deneve (2008) focused on a single spiking neuron and showed that such a neuron can in principle learn an efficient code for its inputs in a basic temporal generative model. Whether parameter learning in the model can be scaled up to multiple neurons, which would be required to cope with complex but biologically realistic input distributions, remains an interesting open issue. Finally, the recent model by Nessler et al. (2010) showed that spiking neurons can learn to represent a mixture of multinomial input variables, assuming that exactly one input neuron in a group is active at a time. The generalization and extension of this result to biologically realistic firing statistics and EPSP shapes, required for the decoding of realistic population codes, was described in this article.
Finally, Law and Gold (2009) showed in a related work that near-optimal population readouts can also be learned autonomously through a simple reinforcement learning (RL) rule. More precisely, they demonstrate that for a given decision problem, for example, whether the motion direction encoded by the input is less than or more than 180 degrees, near-optimal linear readouts from an input population can be found through RL. The learning scheme relies on a feedback signal that conveys the correctness of the decisions computed from the readout. In principle, for any particular decision problem, a different set of readout weights is required. Our work complements these findings insofar as we show that an (external) feedback signal is not needed for the emergence of optimal readouts. Instead, optimal readout weights in a downstream population of neurons can emerge through a purely self-supervising process. Furthermore, in principle, any decision problem can then be reduced to simply counting spikes emitted by the network population, for example, checking whether those output neurons with preferred directions below 180 degrees produced more spikes than those with preferred directions of above 180 degrees. Given the attractive properties of both approaches and the prevalence of both STDP and reward-modulated learning in the brain, it is not unreasonable to assume that the cortex employs a combination of these learning mechanisms.
3.2. Experimentally Testable Predictions.
Our results predict that abolishment of STDP during a critical period prevents the emergence of sparse codes for frequently occurring sensory stimuli. They also predict that with intact STDP, the coding properties of pyramidal cells will change in a predictable manner in response to changes in the distribution of external stimuli or their behavioral relevance (see Figure 6), since this change will be tracked by the implicit generative model of ensembles of pyramidal cells (probably on layers 2/3 and layers 5/6). In addition, a lesion of some neurons within this ensemble will cause a redistribution of neural codes among the remaining neurons (see Figure 5). Our theoretical analysis of optimal versions of STDP predicts a specific dependence of features of STDP on the firing statistics of presynaptic neurons that can in principle be tested experimentally.
Furthermore, our model predicts that in WTA neurons, that is, pyramidal cells in layers 2/3 and layers 5/6, the tuning of firing rates should be sharper than the tuning of membrane potentials. In fact, this is the well-known experimentally observed iceberg effect (Carandini & Ferster, 2000). But in our model, the iceberg effect also has a novel functional interpretation from the perspective of Bayesian inference: the instantaneous firing rate must depend on the current membrane potential via a sharpening exponential activation function in order to ensure that neurons encode probabilities. Hence, the iceberg effect is a prerequisite for correct learning of optimal decoding weights. Another direct prediction of our model is that the stimulus selectivity in ensembles of pyramidal cells with lateral inhibition is sharper when familiar stimuli are presented compared to novel stimuli. This effect has been recently reported in monkey inferior temporal cortex (ITC) by Freedman, Riesenhuber, Poggio, and Miller (2006).
In summary, we have shown that in conjunction with STDP, a common network motif of cortical microcircuits acquires a generic computational function: it creates a sparse representation of complex high-dimensional spike inputs to a local microcircuit. Although we have discussed in this article only the application to decoding of information from populations of noisy sensory neurons, the generation of sparse representations for complex high-dimensional inputs, which converge onto a microcircuit from many different brain areas, is a candidate for a generic computational operation that is meaningful for microcircuits in any cortical area.
A.1. Spike-Timing-Dependent Plasticity Rules.
|Rules .||.||A+(w) .||.||.||Noise .|
|A. Optimal rule, equation 2.3||20 ms (2)||0 ms||−1||No|
|B. Optimal with synaptic scaling||20 ms (2)||0 ms||−w||No|
|C. Longer causal window||40 ms (4)||0 ms||−1||No|
|D. Common STDP curve||20 ms (0)||60 ms (0)||0||Yes|
|Rules .||.||A+(w) .||.||.||Noise .|
|A. Optimal rule, equation 2.3||20 ms (2)||0 ms||−1||No|
|B. Optimal with synaptic scaling||20 ms (2)||0 ms||−w||No|
|C. Longer causal window||40 ms (4)||0 ms||−1||No|
|D. Common STDP curve||20 ms (0)||60 ms (0)||0||Yes|
Notes: Labels A–D are as in Figure 3. ms in A–C corresponds to the absence of STD depression. For these rules, weight stabilization is achieved through the homeostatic plasticity term apost1(w), which is triggered for every postsynaptic spike.
We set , , and for the simulations in the main text. Figure S1 shows the impact on performance of varying and . For the common STDP curve (see Figure 3D), additional zero-mean gaussian noise was superimposed for each learning update triggered by a postsynaptic spike. The standard deviation of this noise was dependent on the magnitude of the deterministic update: for each deterministic update of magnitude M, gaussian noise with standard deviation was added. The noise parameters were set to and . The optimal rule, equation 2.3, was used for simulations in Figures 2 to 6. Rules B to D were used in Figure 3.
A.2. Implicit Generative Model.
A.3. Stochastic Winner-Take-All Circuit and Inference.
Note that the restriction is necessary to make the theory tractable. However, as demonstrated in the simulations through this article (see Figures 2–6) in which sensory input neurons fire randomly and only the total input population rate was kept constant as in Jazayeri and Movshon (2006), this is not required for the functionality of the model in practice.
Now consider a population of stochastically spiking readout neurons k that fire at an instantaneous rate , depending on their current membrane potential uk=wTkx−I. We make two basic assumptions about the inhibitory contribution: that I is common to all readout neurons and that the inhibitory circuit that provides I ensures that an approximately constant total target firing rate is maintained. This form of divisive inhibition introduces competition among the readout neurons, since a strongly activated readout neuron will claim a large fraction of the total target firing rate, thereby suppressing other readout neurons.
A.4. Equilibria of Theoretically Optimal STDP Rules and Maximum Likelihood.
Here we discuss two important results. First, we derive conditions for the equilibrium points of STDP: stable weight settings for which the expected STDP update is zero. The significance of these points derives from the fact that they are attractors in the weight dynamics. Hence, STDP will always drive the weights into the neighborhood of such an equilibrium point. Second, we show that these equilibria also have a special interpretation from the generative model perspective: they correspond to locally optimal parameter settings for the generative model in the maximum likelihood sense.
A.5. Link Between Optimal STDP and Expectation-Maximization.
This means that the average STDP update will never decrease the performance of the generative model. In fact, one can easily verify that it will always increase performance unless the weight setting already constitutes a local optimum of the log likelihood (see also section A.4). Altogether this implies that the average effect of STDP can be described as an attractor dynamics in the weight space in which the attractors are the equilibrium points of STDP and at the same time the local optima of the generative model. Stochastic deviations from these dynamics are zero mean and can thus be suppressed to arbitrary precision via the learning rate (at the cost of convergence speed).
A.6. Computer Simulations.
Synaptic weights wkj were initialized by drawing independently from a normal distribution with mean and standard deviation , where we used for E1, E2, E4, and E5, and for E3. The filtered spike trains xj(t) were then used as inputs to the readout neurons. The WTA circuit was implemented according to a discrete time approximation of equation A.11; at each discrete time step, the firing probability for each neuron k is given by . A spike occurs for neuron k if a draw from a Bernoulli distribution with this probability is successful. In all simulations, the total firing rate of the WTA circuit was chosen beforehand to achieve an average firing rate of 3 Hertz per readout neuron (in E4, was computed for N=15 and kept constant throughout the simulation). Whenever a readout neuron spiked, STDP was applied according to equation 2.3. The learning rate was set to in E1, E3, and E4, and to for E2. For E5, the learning rate depended on , as indicated in Figure 6.
At regular intervals, the preferred stimuli of readout neurons were computed. This was done in a separate offline simulation, in which we swept over all input stimuli and selected for each readout neuron k the stimulus that elicited the greatest average response from that neuron. As a performance indicator, a reconstruction of the input signal was computed from the output spikes (shown in Figure 2E). First, we computed a raw reconstruction signal , which jumps to the preferred stimulus of a readout neuron k whenever zk emits a spike. From this, we obtained the reconstruction as the population vector average of the raw reconstruction values within a [t−20 ms, t] window. After training in E1, E2, E3, and E5, readout neurons were sorted for visualization according to their preferred stimuli at the end of the simulation. In E4, sorting was done individually for each visualization point, since different neurons were involved at each time.
This article was written under partial support by the European Union: projects FP7-269921 (BrainScaleS), FP7-216593 (SECO), FP7-216886 (PASCAL2), and FP7-243914 (Brain-i-Nets).
Stefan Habenschuss and Helmut Puhr contributed equally to this work. A supplemental appendix is available online at http://www.mitpressjournals.org/doi/suppl/10.1162/NECO_a_00446.