Abstract
Normative models of synaptic plasticity use computational rationales to arrive at predictions of behavioral and network-level adaptive phenomena. In recent years, there has been an explosion of theoretical work in this realm, but experimental confirmation remains limited. In this review, we organize work on normative plasticity models in terms of a set of desiderata that, when satisfied, are designed to ensure that a given model demonstrates a clear link between plasticity and adaptive behavior, is consistent with known biological evidence about neural plasticity and yields specific testable predictions. As a prototype, we include a detailed analysis of the REINFORCE algorithm. We also discuss how new models have begun to improve on the identified criteria and suggest avenues for further development. Overall, we provide a conceptual guide to help develop neural learning theories that are precise, powerful, and experimentally testable.
1 Introduction
Our identities change with time, gradually reshaping our experiences. We remember, we associate, we learn. However, we are only beginning to understand how changes in our minds arise from underlying changes in our brains. Of the many features of neural architecture that are altered over time, from the biophysical properties of individual neurons to the creating or pruning of synapses between neurons, changes in the strength of existing synapses remain the most prominent candidate for the neural substrate of longitudinal perceptual and behavioral change (Magee & Grienberger, 2020). Synaptic connections are easily modified, and these modifications can persist for extended periods of time (Bliss & Collingridge, 1993). Further, synaptic modification has been associated with many of the brain’s critical adaptive functions, including memory (Martin et al., 2000), experience-based sensory development (Levelt & Hübener, 2012), operant conditioning (Fritz et al., 2003; Ohl & Scheich, 2005), and compensation for stroke (Murphy & Corbett, 2009) or neurodegeneration (Zigmond et al., 1990). However, beyond these associations, it is often hard to establish a precise link between plasticity and a certain adaptive behavior. In this review, we distinguish “normative” modeling approaches from alternatives, demonstrate why they show promise for establishing precise links between mechanism and behavioral outcomes, and outline a set of desiderata that articulate how recent progress on normative plasticity models strengthens the link between plasticity and system-wide adaptive phenomena.
Plasticity models come in several flavors: phenomenological, mechanistic, and normative models (see Figure 1a) (Levenstein et al., 2020)—with the demarcation lines between them not always completely precise. Broadly, phenomenological models focus on concisely describing what happens in plasticity experiments with mathematical modeling; mechanistic modeling adds to this project by explaining how plasticity dynamics emerge from causal interactions between biophysical quantities. While phenomenological and mechanistic models articulate how synaptic plasticity works, they do not explain why it exists in the brain, that is, what its importance is for neural circuits, behavior, or perception. Answering this question with any precision requires an appeal to normative modeling.
Normative models aim to answer this why question by connecting plasticity to observed network-level or behavioral-level phenomena, including memory formation (Hopfield, 1982; Lengyel et al., 2005; Savin et al., 2014) and consolidation (Fusi et al., 2005; Clopath et al., 2008; Benna & Fusi, 2016), reinforcement learning (Frémaux & Gerstner, 2016), and representation learning (Oja, 1982; Hinton et al., 1995; Rao & Ballard, 1999; Toyoizumi et al., 2005; Savin et al., 2010). Guided by the intuition that plasticity processes have developed on an evolutionary timescale to near-optimally perform adaptive functions, normative plasticity theories are typically top-down in that they begin with a set of prescriptions about how synapses should modify in order to optimally perform a given learning-based function. Subsequently, with varying degrees of success, these theories attempt to show that real biology matches or approximates this optimal solution. Here, we review classical normative plasticity approaches and discuss efforts to improve them.1 To provide concrete examples of these principles in action, in appendix C we describe the REINFORCE algorithm (Williams, 1992), explain how it can function as a normative plasticity model, and note its successes and failures to match our desiderata.
2 Desiderata for Normative Models
One of the biggest challenges for normative models of synaptic plasticity is their connection to biology: their predictions often tie biophysical phenomenology with function in ways that are hard to access experimentally. Therefore, it is a major challenge to identify how to improve normative models with relatively limited access to experimental data confirming or rejecting their predictions. In what follows, we articulate a set of desiderata that can serve as both an organizing tool for understanding the contributions of recent normative plasticity modeling efforts and as a set of intermediate objectives for the development of new models in the absence of explicit experimental rejection or confirmation of older work. Normative plasticity models do not need to satisfy all desiderata to be useful. For example, several seminal normative plasticity models fail to accommodate known facts about biology (e.g., Hopfield networks (Hopfield, 1982) and Boltzmann machines (Ackley et al., 1985)). We argue that any normative plasticity model can be improved by making it conform more closely to our desiderata. Each principle is desirable for some combination of the following reasons: first, it may help ensure that the plasticity model actually qualifies as normative; second, it may require a model to accommodate known facts about biology; and third, it may ensure that models can be compared to existing experimental literature and generate genuinely testable experimental predictions. Most of these desiderata are relatively intuitive and simple. However, it has proven incredibly difficult for existing models of any adaptive cognitive phenomenon—from sensory representation learning, to associative memory formation, to reinforcement learning—to satisfy all of them in tandem.
2.1 Improving Performance on a Specified Objective
Many popular normative frameworks view neural plasticity as an approximate optimization process (Lillicrap et al., 2020; Richards et al., 2019), wherein synaptic modifications progressively reduce a scalar loss function. Within this perspective, the function of synaptic plasticity is to improve performance on this objective.2 Thus, the modeling process can be divided into two steps: articulating an appropriate objective and subsequently demonstrating that a synaptic plasticity mechanism improves performance.
Normative theories of synaptic plasticity developed to date usually involve some combination of supervised, unsupervised, or reinforcement learning objectives (see Figure 1c). The choice of objective function for a neural system influences the resultant form and scope of applicability of the model. For instance, supervised learning implies the existence of either an internal (e.g., motor error signals (Gao et al., 2012; Bouvier et al., 2018) or saccade information indicating that a visual scene has changed (Illing et al., 2021)) or external teacher (e.g., zebra finch song learning (Fiete et al., 2007)). Unsupervised teaching signals can be provided by prediction, as in generative modeling frameworks (Fiser et al., 2010). This account of sensory coding is popular for both its ability to accommodate normative plasticity theories (Rao & Ballard, 1999; Dayan et al., 1995; Kappel et al., 2014; Isomura & Toyoizumi, 2016; Bredenberg et al., 2021) and its philosophical vision of sensory processing as a form of advanced model building, beyond simple sensory transformations. Bayesian inference frameworks can also be useful for systematically quantifying uncertainty about optimal synaptic parameter estimates and for adjusting learning rates accordingly (Kappel et al., 2015; Aitchison et al., 2021; Jegminat et al., 2022). However, alternative perspectives on sensory processing exist, including those based on maximizing the information about a sensory stimulus contained in a neural population (Attneave, 1954; Atick & Redlich, 1990) subject to metabolic efficiency constraints (Tishby et al., 2000; Simoncelli & Olshausen, 2001), and those based on contrastive methods (Oord et al., 2018; Illing et al., 2021), where a self-supervising internal teacher encourages the neural representation of some stimuli to grow closer together, while encouraging others to grow more discriminable.
Evaluating which objective function (or functions) best explains the properties of a neural system is hard: while some forms of objective function may have discriminable effects on plasticity (e.g. supervised versus unsupervised learning; Nayebi et al., 2020), others are even provably impossible to distinguish (see appendix A). This motivates the idea that for a given data set, it is plausible that one objective () can masquerade as another (). In some cases, complex objective functions can masquerade as simple objectives, which may only be epiphenomenal. Take balancing excitatory and inhibitory inputs as an example objective for a neuron: this could be a goal on its own (Vogels et al., 2011) or a consequence of predictive coding (Brendel et al., 2020). In other cases, philosophically distinct frameworks, such as generative modeling, information maximization, or denoising may simply produce similar synaptic plasticity modifications because the frameworks often overlap heavily (Vincent et al., 2010) and may not be distinguishable on simple data sets without targeted experimental attempts to disambiguate between them.
Some studies show empirically that this inner product is negative (Lillicrap et al., 2016; Marschall et al., 2020). A pure empirical demonstration that a learning algorithm aligns with the loss gradient on a particular task, network architecture, and data set does not necessarily generalize to the full range of relevant tasks. Moreover, trained networks are sensitive to hyperparameter choices, where small changes in simulated network parameters can effect large qualitative differences in network behavior (Xiao et al., 2021). Further, a battery of in silico simulations under a variety of different parameter settings and circumstances rapidly begins to suffer the curse of dimensionality, becoming almost as extensive as the collection of in vivo or in vitro experiments that it is attempting to explain.
For this reason, many studies construct mathematical arguments as to why equation 2.2 should hold for a given local synaptic plasticity rule by demonstrating that it either is a stochastic approximation to the true gradient (Williams, 1992; Spall, 1992) or maintains a negative inner product under reasonable assumptions (Bredenberg et al., 2021; Dayan et al., 1995; Ikeda et al., 1998; Meulemans et al., 2020). Mathematical analysis allows one to know quite clearly when a particular plasticity rule will decrease a loss function and identifies how plasticity mechanisms should change with changes in the network architecture or environment. However, analysis is often possible only under restrictive circumstances, and it is often necessary to supplement mathematical results with empirical simulations in order to demonstrate that the results extend to more general, more realistic circumstances.
2.2 Locality
Biological synapses can only change strengths using local chemical and electrical signals. “Locality” refers to the idea that a postulated synaptic plasticity mechanism should only refer to variables that could be conceivably available at a given synapse (see Figure 1b). Though locality may seem like an obvious biological requirement, it presents a great mystery: How does a system whose success or failure is determined by the joint action of many neurons distributed across the entire brain, improve performance through local changes? This is particularly puzzling, given that successful machine learning algorithms—including backpropagation (Werbos, 1974; Rumelhart et al., 1985) (see appendix B), backpropagation through time (Werbos, 1990), and real-time recurrent learning (Williams & Zipser, 1989)—need nonlocal propagation of learning signals.
Despite its importance as a guiding principle for normative theories of synaptic plasticity, locality is a slippery concept, primarily because of our limited understanding of the precise battery of biochemical signals available to a synapse and how those signals could be used to approximate quantities required by theories. As such, locality has resisted mathematical formalization until very recently (Bredenberg et al., 2023). Because of the complexities associated with assessing locality, normative theories typically declare success when some standard of plausibility is reached, where derived plasticity rules roughly match the experimental literature (Payeur et al., 2021) or only require reasonably simple functions of postsynaptic and presynaptic activity that a synapse could hypothetically approximate (Oja, 1982; Gerstner & Kistler, 2002; Scellier & Bengio, 2017; Williams, 1992).
In normative models of synaptic plasticity, the need for locality is in perpetual tension with the general need for some form of credit assignment (Lillicrap et al., 2020; Richards et al., 2019), a mechanism capable of signaling to a neuron that it is “responsible” for a network-wide error and should modify its synapses to reduce errors. Depending on a network’s objective, a system’s credit assignment mechanism could take a wide variety of forms, some small number of which may only require information about the pre- and postsynaptic activity of a cell (Oja, 1982; Pehlevan et al., 2015, 2017; Obeid et al., 2019; Brendel et al., 2020), but many of which appear to require the existence of some form of error (Scellier & Bengio, 2017; Lillicrap et al., 2016; Akrout et al., 2019) or reward-based (Williams, 1992; Fiete et al., 2007; Legenstein et al., 2010) signal.
The extent to which a credit assignment signal postulated by a normative theory meets the standards of locality depends heavily on the nature of the signal. For instance, there is growing support for the idea that neuromodulatory systems, distributing dopamine (Otani et al., 2003; Calabresi et al., 2007; Reynolds & Wickens, 2002), norepinephrine (Martins & Froemke, 2015), oxytocin (Marlin et al., 2015), and acetylcholine (Froemke et al., 2013; Guo et al., 2019; Hangya et al., 2015; Rasmusson, 2000; Shinoe et al., 2005) signals, can propagate information about reward (Guo et al., 2019), expectation of reward (Schultz et al., 1997), and salience (Hangya et al., 2015) diffusely throughout the brain to induce or modify synaptic plasticity in their targeted circuits. Therefore, it may be reasonable for normative theories to postulate that synapses have access to global reward or reward-like signals, without violating the requirement that plasticity be affected only by locally available information (Frémaux & Gerstner, 2016).
Locality as a desideratum serves as a heuristic stand-in for the requirement that a normative model must be eventually held to the standard of experimental evidence. This is not to say that normative models cannot postulate neural mechanisms that have not yet been observed experimentally. However, for such an exercise to be constructive, the theory should clearly articulate how it deviates from the current state of the experimental field and how these deviations can be tested (see section 2.7; see appendix C for a concrete example of this process).
2.3 Architectural Plausibility
The learning algorithm implemented by a plasticity model can require specific architectural motifs to exist in a neural circuit in order to deliver reward, error, or prediction signals. These might include diffuse neuromodulatory projections (see Figure 4b) or neuron-specific top-down synapses onto apical dendrites (Richards & Lillicrap, 2019). Such architectural features are required for the learning algorithm in question and are known to exist in a wide range of cortical areas. However, normative plasticity models should not depend on circuit features that have been demonstrated not to exist in the modeled system, because spurious architectural features can be used to “cheat” at achieving locality by postulating unrealistic credit assignment mechanisms (see appendix B). In what follows, we highlight several particularly important architectural motifs that have been the focus of recent work.
Unlike the deterministic rate-based models typically used in machine learning, neurons communicate through discrete action potentials, with variability due to, for example, synaptic failures or task-irrelevant inputs (see Figure 2a; Faisal et al., 2008). Normative theories that employ rate-based activations (Bredenberg et al., 2020; Scellier & Bengio, 2017) or assume that the input-output function of neurons is approximately linear (Oja, 1982) may not extend to this more realistic discrete, stochastic, and highly nonlinear setting. Further, such theories inherently produce plasticity rules that ignore the precise relationship between pre- and postsynaptic spike times and will consequently be unable to capture spike-timing-dependent plasticity (STDP) phenomenology. Fortunately, learning rules that were originally formulated using rate-based models have subsequently been extended to spiking network models to great effect by leveraging methods that use stochasticity or explicit approximations to enable credit assignment through nondifferentiable spikes (Bohte et al., 2002; Pfister et al., 2006; Huh & Sejnowski, 2018; Shrestha & Orchard, 2018; Bellec et al., 2018; Neftci et al., 2019). Reward-based Hebbian plasticity based on the REINFORCE algorithm (see appendix C) (Williams, 1992) has been generalized to stochastic spiking networks (Pfister et al., 2006), while real-time recurrent learning approximations (Murray, 2019) and predictive coding methods (Rao & Ballard, 1999) have subsequently been extended to deterministic spiking networks (Bellec et al., 2020; Brendel et al., 2020). Therefore, a lack of a generalization to spiking networks is not necessarily a death knell for a normative theory, but many theories lack either an explicit generalization to spiking or a clear relationship to STDP, and the mathematical formalism that defines these methods may require significant modification to accommodate the change.
Real biological networks have a diversity of cell types with different neurotransmitters and connectivity motifs. At the bare minimum, a normative model should be able to accommodate Dale’s law (see Figure 2a), which stipulates that the neurotransmitters released by a neuron are either excitatory or inhibitory but not both (O’Donohue et al., 1985). Though this might seem like a simple principle, enforcing Dale’s principle can seriously damage the performance of artificial neural networks without careful architectural considerations (Cornford et al., 2021). Furthermore, the mathematical results of many canonical models of synaptic modification rely on symmetric connectivity between neurons, including Hopfield networks (Hopfield, 1982), Boltzmann machines (Ackley et al., 1985), contrastive Hebbian learning (Xie & Seung, 2003), and predictive coding (Rao & Ballard, 1999). This symmetry is partially related to the symmetric connectivity required by the backpropagation algorithm (see appendix B). Symmetric connectivity means that the connection from neuron A to neuron B must be the same as the reciprocal connection from neuron B to neuron A. It inherently violates Dale’s law, because it means that entirely excitatory and entirely inhibitory neurons can never be connected to one another: the positive sign for one synapse and the negative sign for the reciprocal connection violates symmetry. Some models, such as Hopfield networks (Sompolinsky & Kanter, 1986) and equilibrium propagation (Ernoult et al., 2020) have been extended to demonstrate that moderate deviations from symmetry can exist and still preserve function. Further, a recent mathematical reformulation of predictive coding has demonstrated that interlayer symmetric connectivity is not necessary (Golkar et al., 2022). Therefore, recent results indicate that many canonical models believed to depend on symmetric connectivity can be improved on.
Many early plasticity models, including Oja’s rule (Oja, 1982) and perceptron learning (Rosenblatt, 1958), as well as more modern model recurrent network models focused on learning temporal tasks (Murray, 2019), are designed to greedily optimize layer-wise objectives, and their mathematical justifications do not generalize to multilayer architectures. Though greedy layer-wise optimization may be sufficient for some forms of unsupervised learning (Illing et al., 2021), it is not clear how such greedy methods would be able to support many complex supervised or reinforcement learning tasks humans are known to learn (Lillicrap et al., 2020) that involve coordinating sensorimotor transformations across cortical areas (but see Zador, 2019). Generalizing layer-local learning to multilayer objective functions has been the focus of much recent work: many multilayer models can be seen as generalizations of perceptron learning (Bengio, 2014; Hinton et al., 1995; Rao & Ballard, 1999), with other models such as those derived from similarity matching (Pehlevan et al., 2017) or Bienenstock-Cooper-Munro theory (Bienenstock et al., 1982; Cooper, 2004; Intrator & Cooper, 1992) receiving similar treatment (Obeid et al., 2019; Halvagal & Zenke, 2023). We will refer to this form of multilayer signal propagation as “spatial” credit assignment, and will refer to relaying information across time as “temporal” credit assignment (see Figure 2b and section 2.4). As we will discuss in the next section, models that do not support temporal credit assignment are not able to account for learning in inherently sequential tasks.
2.4 Temporal Credit Assignment
Because so many learned biologically relevant tasks involving temporal decision making (Gold & Shadlen, 2007) or working memory (Compte et al., 2000; Wong & Wang, 2006; Ganguli et al., 2008) inherently leverage information from the past to inform future behavior and because neural signatures associated with these tasks exhibit rich recurrent dynamics (Brody et al., 2003; Shadlen & Newsome, 2001; Mante et al., 2013; Sohn et al., 2019), many aspects of learning in the brain require a normative theory of synaptic plasticity that works in recurrent neural architectures and provides an account of temporal credit assignment.
As it currently stands, the majority of normative synaptic plasticity models focus on spatial credit assignment, which presents distinct challenges when compared to temporal credit assignment (Marschall et al., 2020). In fact, many theories that provide a potential solution to spatial credit assignment do so by requiring networks to relax to a steady state on a timescale much faster than inputs (Hopfield, 1982; Scellier & Bengio, 2017; Bredenberg et al., 2020; Xie & Seung, 2003; Ackley et al., 1985), which effectively prevents networks from having the rich, slow, internal dynamics required for many temporal motor (Hennequin et al., 2012) and working memory (Wong & Wang, 2006) tasks. Other methods appear to be agnostic to the temporal properties of their inputs but have not yet been combined with existing plasticity rules that perform approximate temporal credit assignment within local microcircuits (Murray, 2019; Bellec et al., 2020).
New algorithms do provide potential solutions to temporal credit assignment through either explicit approximation of real-time recurrent learning (Marschall et al., 2020; Bellec et al., 2020; Murray, 2019), by leveraging principles from control theory (Gilra & Gerstner, 2017; Alemi et al., 2018; Meulemans et al., 2022), or by leveraging principles of stochastic circuits that are fundamentally different from traditional explicit gradient-based calculation methods (Bredenberg et al., 2020; Miconi, 2017). Many use what is called “eligibility traces” (Izhikevich & Desai, 2003; Gerstner et al., 2018; see Figure 2b)—a local synaptic record of coactivity—to identify associations between rewards and neural activity that may have occurred much further in the past. We suggest that these models capture something fundamental about learning across time and that much work remains to combine these with spatial learning rules to construct normative models of full spatiotemporal learning.
2.5 Learning during Behavior
The relationship between learning and behavior can vary widely depending on the experimental context (see Figure 2b): learning-related changes can occur concomitantly with action (Bittner et al., 2015; Sheffield et al., 2017; Grienberger & Magee, 2022) (“online” learning), during brief periods of quiescence between trials (Pavlides & Winson, 1989; Bönstrup et al., 2019; Liu et al., 2021), or over periods of extended sleep (Gulati et al., 2017; Eschenko et al., 2008; Girardeau et al., 2009) (“offline” learning). Therefore, whether a normative plasticity model uses offline or online learning should be determined by the experimental context.
However, many classical algorithms—especially those that support multilayer spatial credit assignment (Ackley et al., 1985; Xie & Seung, 2003; Dayan et al., 1995)—are constrained to modeling only offline learning, because they require distinct training phases, during at least one phase of which activity of neurons is driven for learning, rather than performative purposes; these algorithms have begun to be extended to online learning only recently. For instance, algorithms such as Wake-Sleep (Hinton et al., 1995; Dayan et al., 1995) have been adapted such that the second phase becomes indistinguishable from perception (Bredenberg et al., 2020; Ernoult et al., 2020). Other recent models allow for simultaneous multiplexing of top-down learning signals and bottom-up inputs (Greedy et al., 2022), which enables online learning. These results suggest that future work may fruitfully adapt existing offline algorithms to provide good models of explicitly online learning in the brain.
2.6 Scalability in Dimensionality and Complexity
Models of brain learning need to be able to scale to handle the full complexity of the problems a given model organism has to solve. However, this is a point that can be difficult to verify: How can we guarantee that adding more neurons and more complexity will not make a particular collection of plasticity rules more effective? As a case study, consider REINFORCE (Williams, 1992), an algorithm that for the most part satisfies our other desiderata for normative plasticity for the limited selection of tasks in naturalistic environments that are explicitly rewarded (see appendix C). However, though REINFORCE demonstrably performs better than its precursor weight perturbation (Jabri & Flower, 1992), as the dimensionality of its stimuli, the number of neurons in the network and the delay time between neural activity and reward increase, the performance of the algorithm decays rapidly both analytically and in simulations (Werfel et al., 2003). This is primarily caused by the high variance of gradient estimates provided by the REINFORCE algorithm and is only partially ameliorated by methods that reduce its variance (Bredenberg et al., 2021; Ranganath et al., 2014; Mnih & Gregor, 2014; Miconi, 2017). Thus, adding more complexity to the network architecture actually impairs learning.
We do not mean to imply that all normative plasticity algorithms should be demonstrated to meet human-level performance or even that they should match state-of-the-art machine learning methods. Machine learning methods profit in many ways from their biological implausibility, and the human brain itself has orders of magnitude more neural units and synapses than have ever been simulated on a computer, all of them capable of processing totally in parallel. Therefore, direct comparison to the human—or any other—brain is also not fair. We propose the far softer condition that as the complexity of input stimuli and tasks increases, within the range supported by current computational power, plasticity rules derived from normative theory should continue to perform well in both simulation and, preferably, analytically.
Complexity is multifaceted, and involves features of both stimulus and task (see Figure 2c). Even stimuli with very high-dimensional structure can fail to capture critical features of naturalistic stimuli, which can be much more difficult to learn from; for instance, existing plasticity models have great difficulty scaling to naturalistic image data sets (Bartunov et al., 2018). Further, in natural environments, rewards are often provided after long sequences of complex actions; supervised feedback is sparse, if present at all; and an organism’s self-preservation often requires navigating both uncertainty and complex multi-agent interactions. Modern reinforcement learning algorithms are only just beginning to make progress with some of these difficulties (Kaelbling et al., 1998; Arjona-Medina et al., 2019; Raposo et al., 2021; Hung et al., 2019; Zhang et al., 2021), but as yet there are no normative plasticity models that describe how any of the human capabilities used to solve these problems could be learned through cellular adaptation. This suggests that scaling the ability of normative models to handle both complex stimuli and task structures is a major avenue of improvement for future methods.
2.7 Generating Testable Predictions
Testable predictions can be defined via several different experimental lenses, at the level of (1) individual neurons or synapses, (2) populations of neurons, (3) the feedback mechanisms that shape learning in neural circuits, and (4) learning at a behavioral level (see Figure 3a). Accurately distinguishing one mechanism from another will likely require a synthesis of experiments spanning all four lenses.
2.7.1 Individual Neurons
Experiments that focus on individual neurons, including paired-pulse stimulation (Markram et al., 1997), mechanistic characterizations of plasticity (Graupner & Brunel, 2010), pharmacological explorations of neuromodulators that induce or modify plasticity (Bear & Singer, 1986; Reynolds & Wickens, 2002; Froemke et al., 2007; Gu & Singer, 1995), and characterization of local dendritic or microcircuit properties mediating plasticity (Froemke et al., 2005; Letzkus et al., 2006; Sjöström & Häusser, 2006) form the bulk of the classical literature underlying phenomenological and mechanistic modeling. These studies characterize what information is locally available at synapses and what can be done with that information, as well as which properties of cells can be altered in an experience-dependent fashion.
Existing normative theories differ in the nature of their predictions for plasticity at individual neurons. Reward-modulated Hebbian theories require that feedback information be delivered by a neuromodulator like dopamine, serotonin, or acetylcholine (Frémaux & Gerstner, 2016) and that this feedback modulates plasticity at the local synapse by changing the magnitude or sign of plasticity depending on the strength of feedback. In contrast, some unsupervised normative theories require no feedback modulation of plasticity (Pehlevan et al., 2015, 2017), and others argue that detailed feedback information arrives at the apical dendritic arbors of pyramidal neurons to modulate plasticity, which is also partially supported in the hippocampus (Bittner et al., 2015, 2017) and cortex (Larkum et al., 1999; Letzkus et al., 2006; Froemke et al., 2005; Sjöström & Häusser, 2006).
Independent of the exact feedback mechanism, models differ in how temporal associations are formed. Algorithms related to REINFORCE assume that local synaptic eligibility traces integrate over time fluctuations in coactivity of the post- and presynaptic neuron local to a synapse. These postulated eligibility traces are stochastic, summing gaussian fluctuations in activity (Miconi, 2017) that consequently produce temporal profiles similar to Brownian motion. In contrast, methods based on approximations to real-time recurrent learning propose eligibility traces that are deterministic records of coactivity whose time constants are directly connected to the dynamics of the neuron itself (Bellec et al., 2020), while other hybrid approaches predict eligibility traces that are deterministic but are related more to predicted task timescale than the dynamics of the cell (Roth et al., 2018). Though there do exist known cellular processes that naturally track coactivity, like NMDA receptors (Bi & Poo, 1998), and that store traces of this coactivity longitudinally, like CaMKII (Graupner & Brunel, 2010), how the properties of these known biophysical quantities relate to the predictions of various normative theories, and whether there are other biological alternatives remains unclear.
2.7.2 Neural Circuits
The functional effects of plasticity and their relationship to behavior manifest most directly at the level of neural populations (Marschall & Savin, 2023). Determining how circuits encode task-relevant information and affect motor actions requires methods that record large groups of neurons, such as 2-photon calcium imaging, multielectrode recordings, fMRI, EEG, and MEG, as well as methods that manipulate large populations, like optogenetic (Rajasethupathy et al., 2016) stimulation.
First, a population-level lens is useful for evaluating hypotheses about the nature of the objective function, where one starts by training neural networks on a battery of objectives and tests which objective produces the closest correspondence between neural activity in the model and that recorded in the brain. This approach has been used in the ventral (Yamins et al., 2014; Yamins & DiCarlo, 2016) and dorsal (Mineault et al., 2021) visual streams, as well as in auditory cortex (Kell et al., 2018) and medial entorhinal cortex (Nayebi et al., 2021). Often changes in artificial neural network activity throughout time are sufficient to determine the objective optimized by the network as well as its learning algorithm (Nayebi et al., 2020), an approach that could also potentially be applied to recorded neural activity over learning.
Second, circuit recordings could test predictions about the existence of different phases of the dynamics, as required by some normative models. For instance, the Wake-Sleep algorithm (Dayan et al., 1995) proposes that neural circuits should spend extended periods of time (e.g., during dreaming) generating similar activity patterns to those evoked by natural stimulus sequences. There is plenty of room for experiments to more clearly map predictions and components of similar normative models onto well-documented neural phenomena, such as sleep or potentially replay phenomena (Girardeau et al., 2009; Eschenko et al., 2008).
Finally, some algorithms make specific predictions about inhibitory microcircuitry. Impression learning, for instance, suggests that a population of inhibitory interneurons could gate the influence of apical and basal dendritic inputs to the activity of pyramidal neurons (Bredenberg et al., 2021), and some learning algorithms propose that top-down error signals are partially computed by local inhibitory interneurons (Sacramento et al., 2017; Greedy et al., 2022). Therefore, to completely distinguish different theories, it may be necessary to analyze the connectivity and plasticity between small groups of different cell types. Because circuit recording and manipulation methods often sacrifice temporal resolution (Hong & Lieber, 2019) and have difficulty inferring biophysical properties of individual synapses and cells, these methods are best used in concert with single neuron studies to jointly tease apart the multilevel predictions of various normative models.
2.7.3 Feedback Mechanisms
The most direct way to distinguish normative plasticity algorithms is on the basis of the nature of their feedback mechanisms (see Figure 3b). Though no feedback is necessary for some unsupervised algorithms, like Oja’s rule, any form of supervised or reinforcement learning will require some form of top-down feedback. However, across models, the level of precision of feedback varies considerably. The simplest feedback is scalar, conveying reward (Williams, 1992), state fluctuation (Payeur et al., 2021), or context (e.g., saccade; Illing et al., 2021) or attention (Roelfsema & Ooyen, 2005; Pozzi et al., 2020) information. Beyond this, the space of proposed mechanisms expands considerably: backpropagation approximations like feedback alignment (Lillicrap et al., 2016) and random-feedback online learning (RFLO) (Murray, 2019) propose that random error feedback between layers of neurons can provide a sufficient learning signal, whereas algorithms based on control theory propose that low-rank or partially random projections carrying supervised error signals are sufficient (Gilra & Gerstner, 2017; Alemi et al., 2018). Other algorithms propose even more detailed feedback, with individual neurons receiving precise, carefully adapted projections carrying learning-related information. These algorithms propose that top-down projections to apical dendrites (Urbanczik & Senn, 2014) or local interneuron neurons (Bastos et al., 2012) perform spatial credit assignment, but the nature of this signal can differ considerably across different algorithms. It could be a supervised target, carrying information about what the neuron state “should” be to achieve a goal (Guerguiev et al., 2017; Payeur et al., 2021), or it could be a prediction of the future state of the neuron (Bredenberg et al., 2021).
So far, different feedback mechanisms have received only partial support. For example, acetylcholine projections to auditory cortex could subserve a form of reward-based learning: they modulate perceptual learning (Froemke et al., 2013) and display a diversity of responses related to both reward and attention (Hangya et al., 2015), but contrary to simple reward-based learning algorithms, these response properties adapt over the course of learning in concert with auditory cortex (Guo et al., 2019). This suggests that while traditional models of reward-modulated Hebbian plasticity may be correct to a first approximation, a more detailed study of the adaptive capabilities of neuromodulatory centers may be necessary to update the theories.
While a growing number of studies indicate that projections to apical synapses of pyramidal neurons do play a role in inducing plasticity and that these projections themselves are also plastic (i.e., nonrandom; Bittner et al., 2015, 2017), very little is known about the nature of the signal—a critical component for distinguishing several different theories. In the visual system, presentation of unfamiliar images without any form of reward or supervision can modify both apical and basal dendrites throughout time (Gillon et al., 2021), and in the hippocampus, apical input to CA1 pyramidal neurons while animals acclimatize to new spatial environments is sufficient to induce synaptic plasticity (Bittner et al., 2015, 2017). There is further evidence for explicit motor error signals carried by climbing fiber pathways in the cerebellar system being used for plasticity (Gao et al., 2012; Bouvier et al., 2018).
In biofeedback training settings, animals can selectively control the firing rates of individual neurons to satisfy arbitrary experimental conditions for reward (Fetz, 2007), suggesting the existence of highly flexible credit assignment systems, which are not constrained by evolutionary predetermination.5 Other brain-computer interface (BCI) experiments more directly quantify the limits of this flexibility. In particular, animals have been shown to adapt more easily to BCI decoder perturbations that occur within the manifold of neural activity, relative to outside of manifold perturbations (Sadtler et al., 2014), which may be reflective of constraints on the credit assignment system (Feulner & Clopath, 2021) (but see Humphreys et al., 2022; Payeur et al., 2023). Moreover, recent evidence suggests that apical dendrites may receive precise learning signals in the retrosplenial cortex during BCI tasks (Francioni et al., 2023), which could underlie these remarkable capabilities.
2.7.4 Behavior
In much the same way that psychophysical studies of human or animal responses define constraints on what the brain’s perceptual systems are capable of, behavioral studies of learning can do quite a lot to describe the range of phenomena that a model of learning must be able to capture, from operant conditioning (Niv, 2009), to model-based learning (Doll et al., 2012), rapid language learning (Heibeck & Markman, 1987), unsupervised sensory development (Wiesel & Hubel, 1963), or consolidation effects (Stickgold, 2005). Behavioral studies can also outline key limitations in learning, which are perhaps reflective of the brain’s learning algorithms—for example, the brain’s failure to perform certain types of adaptation after critical periods of plasticity (Wiesel & Hubel, 1963).
These existing experimental results stand as (often unmet) targets for normative theories of plasticity, but in addition, normative theories themselves suggest further studies that may test their predictions. In particular, manipulation of learning mechanisms may have predictable effects on animals’ behavior, as seen when acetylcholine receptor blockage in mouse auditory cortex prevented reward-based learning in animals (Guo et al., 2019) and nucleus basalis stimulation during tone perception longitudinally improved animals’ discrimination of that tone (Froemke et al., 2013). Other algorithms have as-yet-untested predictions for behavior; for instance, experimentally increasing the influence of top-down projections should bias behavior toward commonly occurring sensory stimuli according to both predictive coding (Rao & Ballard, 1999; Friston, 2010) and impression learning (Bredenberg et al., 2021). For other detailed feedback algorithms (see Figure 3b), manipulating top-down projections may disrupt learning but would have a much more unstructured deleterious effect on perceptual behavior.
Overall, each experimental lens has its own advantages and disadvantages. Single-neuron studies are excellent for identifying the locally available variables that affect plasticity, circuit-level studies can help narrow down the objectives that shape neural responses and identify traces of offline learning, studies of feedback mechanisms can distinguish among different algorithms that postulate different degrees of precision in their feedback and in complexity of the teaching signal, and studies of behavior can place boundaries on what can be learned, as well as serve as a readout for manipulations of the mechanisms underlying learning. Each focus alone is insufficient to distinguish among all existing normative models, but in concert they show promise for identifying the neural substrates of adaptation.
3 Conclusion
Normative models of plasticity are compelling because of their potential to connect our brains’ capacity for adaptation to their constituent synaptic modifications. Generating good theories is a critical part of the scientific process, but finding ways to close the loop by testing key predictions of new normative models has proved extraordinarily difficult. In this perspective, we have illustrated some of the sources of this difficulty, have shown how recent work has progressed on these fronts, and have identified ways forward for future models.
The core of a normative plasticity model is its plasticity rule, which dictates how a model synapse modifies its strength. To be a normative model—to explain why the plasticity mechanism is important for the organism—there must be a concrete demonstration that this plasticity rule supports adaptation critical for system-wide goals like processing sensory signals or obtaining rewards (see section 2.1). However, this system-wide goal must be achieved using only local information (see section 2.2). These two needs of a normative plasticity model are the fundamental source of tension: it is very difficult to demonstrate that a proposed plasticity rule is both local and optimizes a system-wide objective (see appendix B). Insufficient or partial resolution of this fundamental tension produces normative models that satisfy the other desiderata to a lesser degree; namely, they struggle to map accurately onto neural hardware (see section 2.3) or handle complex temporal stimuli and tasks online (see sections 2.4 to 2.6). To provide a case study of how our desiderata come to be satisfied (or not) in practice, we have included a tutorial for the REINFORCE algorithm in appendix C.
In this review, we have organized emerging theories according to how they satisfy and improve on our desiderata, as well as by how they can be tested. Theoreticians can use our desiderata (see section 2.1 to 2.6) and Table 1 as guides for where theoretical development is needed in order to render normative models more biologically accurate and easier to test, while experimentalists can use the summary of their experimental predictions (see Table 2) to identify tests that distinguish different normative models from one another in specific neural systems. Even if existing algorithms prove not to be implemented exactly in the brain, they can provide key insights into how local synaptic modifications can produce valuable improvements in both behavior and perception for an organism. It seems sensible to use these algorithms as a springboard to produce more biologically realistic and powerful theories.
. | . | . | . | . | Learning . | Scalability . |
---|---|---|---|---|---|---|
. | Improv. . | . | Arch. . | Temp. . | During . | in Dim. & . |
Algorithm . | Perf. . | Local. . | Plaus. . | Credit . | Behavior . | Complexity . |
Backprop. (BP) | U/S/R | ✗ | ✓ | ✓ | ✗ | ✓ |
(Werbos, 1974) | (Williams, 1992) | (Lee et al., 2016) | (Werbos, 1990) | |||
REINFORCE | U/S/R | ✓ | ✓ | ✓ | ✓ | ✗ |
(Williams, 1992) | (Miconi, 2017) | (Werfel et al., 2003) | ||||
Oja (Oja, 1982) | U | ✓ | ✗ | ✗ | ✓ | ✓ |
Pred. Coding | U/S | ✓ | ✗ | ✓ | ✓ | ✓ |
(Rao & Ballard, 1999) | (Whittington & Bogacz, 2017) | (Friston & Kiebel, 2009) | ||||
Wake-Sleep | U | ✓ | ✓ | ✓ | ✓ | ✓ |
(Dayan et al., 1995) | (Dayan & Hinton, 1996) | (Dayan & Hinton, 1996) | (Bredenberg et al., 2021) | |||
Approx. Gradient | U/S* | ✓ | ✓ | ✓ | ✓ | ✓ |
(Lillicrap et al., 2016) | (Bellec et al., 2020) | (Murray, 2019) | (Murray, 2019) | |||
(Akrout et al., 2019) | (Bellec et al., 2020) | (Bellec et al., 2020) | ||||
Equil. Prop. | U/S | ✓ | ✗ | ✗ | ✓ | ✓ |
(Scellier & Bengio, 2017) | (Ernoult et al., 2020) | (Laborieux et al., 2021) | ||||
Target Prop. | U/S | ✓ | ✓ | ✓ | ✗ | ✓ |
(Bengio, 2014) | (Manchev & Spratling, 2020) | (Lee et al., 2015) |
. | . | . | . | . | Learning . | Scalability . |
---|---|---|---|---|---|---|
. | Improv. . | . | Arch. . | Temp. . | During . | in Dim. & . |
Algorithm . | Perf. . | Local. . | Plaus. . | Credit . | Behavior . | Complexity . |
Backprop. (BP) | U/S/R | ✗ | ✓ | ✓ | ✗ | ✓ |
(Werbos, 1974) | (Williams, 1992) | (Lee et al., 2016) | (Werbos, 1990) | |||
REINFORCE | U/S/R | ✓ | ✓ | ✓ | ✓ | ✗ |
(Williams, 1992) | (Miconi, 2017) | (Werfel et al., 2003) | ||||
Oja (Oja, 1982) | U | ✓ | ✗ | ✗ | ✓ | ✓ |
Pred. Coding | U/S | ✓ | ✗ | ✓ | ✓ | ✓ |
(Rao & Ballard, 1999) | (Whittington & Bogacz, 2017) | (Friston & Kiebel, 2009) | ||||
Wake-Sleep | U | ✓ | ✓ | ✓ | ✓ | ✓ |
(Dayan et al., 1995) | (Dayan & Hinton, 1996) | (Dayan & Hinton, 1996) | (Bredenberg et al., 2021) | |||
Approx. Gradient | U/S* | ✓ | ✓ | ✓ | ✓ | ✓ |
(Lillicrap et al., 2016) | (Bellec et al., 2020) | (Murray, 2019) | (Murray, 2019) | |||
(Akrout et al., 2019) | (Bellec et al., 2020) | (Bellec et al., 2020) | ||||
Equil. Prop. | U/S | ✓ | ✗ | ✗ | ✓ | ✓ |
(Scellier & Bengio, 2017) | (Ernoult et al., 2020) | (Laborieux et al., 2021) | ||||
Target Prop. | U/S | ✓ | ✓ | ✓ | ✗ | ✓ |
(Bengio, 2014) | (Manchev & Spratling, 2020) | (Lee et al., 2015) |
Notes: A ✓ indicates that an algorithm has been demonstrated to satisfy a particular desideratum in at least one study, whereas an ✗ indicates that it has not been demonstrated. If the demonstrating study is an improvement on the seminal work or is a new model, we provide a citation. Asterisks indicate that results have only been shown by simulation and lack mathematical support. U, S, and R indicate whether a given algorithm supports unsupervised, supervised, or reinforcement learning, respectively.
Algorithm . | Testable Predictions . |
---|---|
REINFORCE | Reward signals modulate plasticity |
(Williams, 1992) | Stochastic eligibility traces |
Oja (Oja, 1982) | Exclusively Hebbian plasticity |
Pred. Coding | Feedforward propagation of prediction errors |
(Rao & Ballard, 1999) | Approx. symmetric feedback connectivity |
Wake-Sleep | Offline generative replay driven by top-down inputs |
(Dayan et al., 1995) | Top-down predictive inputs drive bottom-up plasticity |
Approx. Gradient | Neuron-specific top-down errors drive plasticity |
(Lillicrap et al., 2016; Akrout et al., 2019) | Smooth eligibility traces |
Equil. Prop. | The sign of plasticity changes |
(Scellier & Bengio, 2017) | while receiving instructive feedback |
Target Prop. | Top-down target inputs drive bottom-up plasticity |
(Bengio, 2014) |
Algorithm . | Testable Predictions . |
---|---|
REINFORCE | Reward signals modulate plasticity |
(Williams, 1992) | Stochastic eligibility traces |
Oja (Oja, 1982) | Exclusively Hebbian plasticity |
Pred. Coding | Feedforward propagation of prediction errors |
(Rao & Ballard, 1999) | Approx. symmetric feedback connectivity |
Wake-Sleep | Offline generative replay driven by top-down inputs |
(Dayan et al., 1995) | Top-down predictive inputs drive bottom-up plasticity |
Approx. Gradient | Neuron-specific top-down errors drive plasticity |
(Lillicrap et al., 2016; Akrout et al., 2019) | Smooth eligibility traces |
Equil. Prop. | The sign of plasticity changes |
(Scellier & Bengio, 2017) | while receiving instructive feedback |
Target Prop. | Top-down target inputs drive bottom-up plasticity |
(Bengio, 2014) |
As the diversity of the experimental preparations suggests, there are increasingly strong arguments for several fundamentally different plasticity algorithms instantiated in different areas of the brain and across different organisms, subserving different functions. It is quite likely that many plasticity mechanisms work in concert to produce learning as it manifests in our perception and behavior. It is our belief that well-articulated normative theories can serve as the building blocks of a conceptual framework that tames this diversity and allows us to understand the brain’s tremendous capacity for adaptation.
Appendix A: The Unidentifiability of an Objective
In this section we illustrate why the choice of objective function for a normative plasticity model is never uniquely determined by data. We consider two situations: the system has already settled to its optimal setting of its weights, , and in the second we are able to observe the system’s plasticity update .
A.1 Unidentifiability Based on an Optimum
A.2 Unidentifiability Based on an Update Rule
Appendix B: Why Can’t the Brain Do Explicit Gradient Descent?
We have provided one surefire way to decrease an objective function by modifying the parameters of a neural network—simply take small steps in the direction of the gradient of the loss (see section 2.1). To appreciate the challenges faced by theories of normative plasticity, it’s important to understand why a biological system could not do this. In this section we provide a simplified argument as to why gradient descent within multilayer neural networks produces parameter updates, thus failing our most critical desideratum for a normative plasticity theory (see section 2.2). More detailed arguments for multilayer neural networks can be found here (Lillicrap et al., 2020), and descriptions of why gradient descent becomes even more implausible for recurrent neural networks trained with either backpropagation through time (Werbos, 1990) or real-time recurrent learning (Williams & Zipser, 1989) can be found here (Marschall et al., 2020).
It is also worth noting two key differentiability assumptions inherent in this approach. For one, we assume not only that the loss function is differentiable, but that some “error calculating” part of the brain does differentiate it. This requires knowledge of what the desired network output should be, which for many real-world tasks is not possible. Second, we assume that the network activation function is differentiable. Since neurons typically emit binary spikes, this differentiability assumption is not necessarily valid, though several modern methods have circumvented this problem by using either stochastic neuron models (Williams, 1992; Dayan & Hinton, 1996) or clever optimization tricks (Bellec et al., 2020). In subsequent sections, we describe one canonical algorithm that employs clever tricks to circumvent the weight transport problem.
Appendix C: REINFORCE
In this section, we provide a mathematical tutorial on the REINFORCE learning algorithm (Williams, 1992), a mechanism for updating the parameters in a stochastic neural network for reinforcement learning objective functions. Its chief advantages are twofold. First, it only requires you to be able to evaluate an objective function (the reward received on any given trial), not the gradient of the objective function with respect to the parameters (see Figure 4b). This is very useful in situations in which the relationship between rewards and network outputs is not clear to an agent, as would be the case in many reinforcement learning scenarios. Second, under a broad range of biologically reasonable assumptions about a neural network architecture, the parameter updates produced by this algorithm are local, meaning the information required for a parameter update would reasonably be available to a synapse in the brain. This algorithm produces updates that are within the class of reward-modulated Hebbian plasticity rules. The chief disadvantage of this algorithm is its comparative data inefficiency relative to backpropagation. In practice, far more data samples (or, equivalently, much lower learning rates) will be required to produce the same improvements in performance compared to backpropagation (Werfel et al., 2003).
The REINFORCE algorithm and minor variations appear in different fields with different names. It is useful to keep track of these alternative names because they all use roughly the same derivation, with some improvements or field-specific modifications. In machine learning, the algorithm is often referred to as node perturbation (Richards et al., 2019; Lillicrap et al., 2020; Werfel et al., 2003), because it involves correlating fluctuations in neuron (node) activity with reward signals. In computational neuroscience, it is sometimes called 3-factor or reward-modulated Hebbian plasticity (Frémaux & Gerstner, 2016), though REINFORCE is only one of several algorithms referred to by these blanket terms. In reinforcement learning, REINFORCE is often treated as a member of the more general class of policy gradient (Sutton & Barto, 2018) methods, which can be used to train any parameterized stochastic agent through reinforcement. Policy gradient methods need not commit to a neural network architecture and are consequently not always local. Finally, very similar methods are used for fitting variational Bayesian models and are in these contexts referred to as either black box variational inference (Ranganath et al., 2014) or neural variational inference (Mnih & Gregor, 2014).
In what follows, we provide a brief derivation of the REINFORCE learning algorithm for a one-layer feedforward neural network. We then discuss the many extensions of the algorithm as well as its strengths and limitations as a normative plasticity model.
C.1 Network Model
Most neural networks used in machine learning are deterministic. However, neurons in biological systems fluctuate across trials and stimulus presentations, so modeling them as stochastic is often more appropriate. It will turn out that these fluctuations can be used to produce parameter updates in a way that a deterministic system could not.
This equation defines a conditional probability distribution, . There is an interesting point here: neuron activities are now samples from this conditional probability distribution, and so we can study how neurons behave on average by taking expectations over the probability distribution.
For simplicity and clarity, we restrict ourselves to this neural architecture for our derivation, but the basic principles apply more generally to a variety of noise sources and neural architectures (see section 3).
C.2 Defining the Objective
C.3 Taking the Gradient
C.4 Why Don’t We Need the Derivative of the Loss?
C.5 Biological Plausibility Assessment
Now that we have derived REINFORCE, we can examine its qualities as a normative plasticity theory. First, we ask: Is this algorithm “local” (see section 2.2)? The gradient for a particular synapse, , can be approximated with samples in an environment with stimuli , firing rates , and rewards by . To decide whether this could be a plasticity rule implemented (or, more realistically, approximated) by a biological system, we need to think about what pieces of information a synapse would have to have available.
First, the synapse needs , which amounts to just the presynaptic input, a common feature of any Hebbian synaptic plasticity rule. Second, the synapse needs . is a constant, and so can be absorbed into the learning rate. is the postsynaptic firing rate, which is also a common feature of any Hebbian plasticity rule. is the current injected into the postsynaptic neuron, and and are both monotonic functions of this current, so it is quite conceivable that these values could be approximated by a biochemical process. Third, every synapse needs access to the scalar reward value received on a given trial, . This is the most “nonlocal” information involved in the parameter update; however, there exist many theories about how neuromodulatory systems in the brain can deliver information about reward diffusely to many synapses and induce plasticity (see section 2.2). To achieve this locality, we have implicitly assumed that we are performing gradient descent with respect to a Euclidean metric (Surace et al., 2020); using different metrics corresponds to premultiplying the full weight update vector by a positive-definite matrix. The locality results discussed here hold if this positive definite matrix is diagonal, but otherwise nonlocal interactions may be introduced.
We have already demonstrated that REINFORCE is able to perform approximate gradient descent for reinforcement learning objective functions. This in itself makes the algorithm very promising as a normative plasticity model (see section 2.1). Its chief advantage is that it does not require detailed knowledge of the reward function (i.e., how to differentiate it), which means that an animal could simply receive a reward from its environment and relay that reward signal diffusely to its synapses. However, this also restricts the types of objectives that could plausibly be learned by a neural system. Unsupervised learning objectives like the ELBO require detailed knowledge of every neural activity of every neuron in the circuit in order to be calculable, and there is no evidence for downstream neural circuits that perform such calculations. Therefore, even though in principle REINFORCE can be used to train a neural network on any objective, explicit reinforcement is much more plausible than other alternatives.
We have only provided a derivation for a single-layer, rate-based neural network with additive gaussian noise, but REINFORCE extends quite readily to multilayer (Williams, 1992), spiking (Frémaux et al., 2013), and recurrent networks (Miconi, 2017) without any loss of locality. This indicates that the algorithm is both architecture-general (see section 2.3) and can handle temporal environmental structure (see section 2.4). Further, because a weight update can be calculated in a single trial, animals could use it to learn online (see section 2.5). The biggest point of failure for REINFORCE is that it scales poorly with high complexity in stimuli or task, large numbers of neurons, or prolonged delays in receipt of reward (Werfel et al., 2003; Fiete, 2004; Bredenberg et al., 2021). The greater the number of neurons that contribute to reward and the higher the complexity of the reward function, the harder it becomes to estimate the correlation between a single neuron and reward, which is a prerequisite for the algorithm’s function. Thus, though the algorithm is an unbiased estimator of the gradient, it can still be so variable an estimate as to be effectively useless in complex contexts. This suggests that if animals exploit the principles of REINFORCE to update synapses, it is likely an approach paired with other algorithms or hybridized in a way that allows for better scalability.
The last way to assess REINFORCE is on the basis of how it can be tested (see section 2.7). The simplest way to test this algorithm is by examining whether scalar reward-like signals (i.e., ) have a multiplicative effect on local plasticity in a circuit. At a single-neuron level, this corresponds to identifying neuromodulators that affect plasticity. At a feedback level, this corresponds to identifying neuromodulatory systems that project to the circuit in question and observing whether their stimulation or silencing improves or blocks circuit-level plasticity or behavioral learning performance, respectively. These steps do not identify REINFORCE as the only possibility, but they narrow down the field of possibilities considerably, removing all candidate algorithms that either do not require any feedback or require more detailed feedback signals (see Figure 3a).
Acknowledgments
We thank Blake Richards, Eero Simoncelli, Owen Marschall, Benjamin Lyo, Elliott Capek, Olivier Codol, and Yuhe Fan for their helpful feedback on this review. C.S. is supported by NIMH Award 1R01MH125571-01, NIH Award R01NS127122, by the National Science Foundation under NSF Award No. 1922658, and a Google faculty award.
In the interest of conciseness, we discuss only long-term plasticity, not including short-term plasticity.
It should be noted that this is the simplest way to characterize improved performance, but not all formulations of learning easily fit into a simple optimization framework for example, associative learning in Hopfield networks (Hopfield, 1982) or multi-agent reinforcement learning (Zhang et al., 2021).
Some objectives (like reward functions) are best thought of as being maximized rather than minimized. Without loss of generality, in such cases we can minimize the negative reward function.
A negative inner product can also be achieved by taking to be the negative loss gradient premultiplied by any positive-definite matrix, which could be dependent on the weights themselves. Updates of this form correspond to gradient descent with respect to different metrics (Surace et al., 2020); special cases include altering the learning rates for different parameters and natural gradient descent.
This is a challenge for normative plasticity models that predefine the outputs of the circuit and approximately backpropagate errors from these outputs.