We propose that replication (with mutation) of patterns of neuronal activity can occur within the brain using known neurophysiological processes. Thereby evolutionary algorithms implemented by neuro- nal circuits can play a role in cognition. Replication of structured neuronal representations is assumed in several cognitive architectures. Replicators overcome some limitations of selectionist models of neuronal search. Hebbian learning is combined with replication to structure exploration on the basis of associations learned in the past. Neuromodulatory gating of sets of bistable neurons allows patterns of activation to be copied with mutation. If the probability of copying a set is related to the utility of that set, then an evolutionary algorithm can be implemented at rapid timescales in the brain. Populations of neuronal replicators can undertake a more rapid and stable search than can be achieved by serial modification of a single solution. Hebbian learning added to neuronal replication allows a powerful structuring of variability capable of learning the location of a global optimum from multiple previously visited local optima. Replication of solutions can solve the problem of catastrophic forgetting in the stability-plasticity dilemma. In short, neuronal replication is essential to explain several features of flexible cognition. Predictions are made for the experimental validation of the neuronal replicator hypothesis.
Why expect replication of information in the brain? First, regardless of the search algorithm used (e.g., reinforcement learning, Bayesian learning, Helmholtz machines, free energy minimization, simulated annealing, backpropagation, random search), a population of machines can find a solution at least as fast as can just one copy of that machine if the machines can be operated in parallel. In some cases, mixtures of experts can simultaneously contribute to behavior (Jacobs, Jordan, Nowlan, & Hinton, 1991), as is the case in robotic implementations of multiple-model-based reinforcement learning (Doya, Samejima, Katagiri, & Kawato, 2002). Where multiple solution representations exist, a mechanism to reallocate limited resources (e.g., neurons or synapses) from the least effective solutions to the currently most effective solutions is necessary. Replication permits such redistribution.
Second, cognitive architectures involving symbol manipulation require copying operations (Marcus, 2001). ACT-R production systems assume information can be copied from one brain region to another, that is, the contents of variables can be moved among data structures (Anderson, 2007). Also, the production rules themselves are learned by a kind of cognitive evolutionary algorithm. In Copycat (a cognitive architecture for solving analogy-based insight problems), we see an evolutionary system as well (Hofstadter & Mitchell, 1994). “Codelets” in the “coderack” are a population of agents (tasks and rules) that evolve.
Third, replication of memory traces is necessary for the influential multiple-trace theory of memory (Nadel, Samsonovich, Ryan, & Moscovitch, 2000).
Fourth, the problem of catastrophic forgetting (the stability-plasticity dilemma; Abraham & Robins, 2005) is solved given the capacity to replicate information, as we shall demonstrate.
Fifth, the search capabilities of evolutionary algorithms are well suited to solving novel problems of the type that require insight, creativity, and imagination (Kohler, 1925; Simonton, 1995; Sternberg & Davidson, 1995; Schwefel, 2000). As Schwefel describes, it has been shown that classical optimization methods are typically more efficient in linear, quadratic, strongly convex, unimodal, and separable problems. In these cases, brains may well have evolved specialized learning mechanisms. Evolutionary algorithms are better in discontinuous, nondifferentiable, multimodel, nonstationary, noisy, and even fractal problems (Schwefel, 2000). This set may constitute a broader field of application in cognition, most notably in problems involving symbol manipulation or problems requiring structured search in rugged landscapes, that is, problems with epistatic interdependencies (Watson, Hornby, & Pollack, 1998; Watson, 2006) such as insight problems (Simonton, 1995; Sternberg & Davidson, 1995; Wagner, Gais, Haider, Verleger, & Born, 2004). Furthermore, where a problem is novel, for which evolution at the organism level has had no opportunity to evolve a brain module specifically for this problem, an evolutionary algorithm can be incredibly versatile. “By varying the representation, the variation operators, the population size, the selection mechanism, the initialization, the evaluation function, and other aspects, we have access to a diverse range of source search procedures. Evolutionary algorithms are much like a Swiss Army knife: a handy set of tools that can be used to address a variety of tasks. Having the Swiss Army knife provides you with the ability to address a wide variety of problems quickly and effectively, even though there might be a better tool for the job” (Michalewicz & Fogel, 2004, p. 163).
The next sections compare neural nonevolutionary algorithms that have received attention in the literature with those proposed by the neuronal replicator hypothesis.
1.1. Evolutionary Computation, Synaptic Selectionism, and Neural Darwinism.
That Darwinian selection is involved in learning, thought, and human creativity has been proposed many times in the past (James, 1890; Bremermann, 1958; Dennett, 1981; Baldwin, 1898, 1909; Hadamard, 1945; Monod, 1971; Campbell, 1974; Dawkins, 1982; Cooper, 2001; Aunger, 2002). Earlier models were psychological or, rather, metaphorical and did not suggest explicit neural mechanisms. Recent models have been based on the idea that synapses or groups of synapses are differentially stabilized by reward (Changeux, Courrege, & Danchin, 1973; Edelman, 1987; Dehaene & Changeux, 1997; Dehaene, Kerszberg, & Changeux, 1998; Seung, 2003; Izhikevich, Gally, & Edelman, 2004; Izhikevich, 2006). None of these models implements an evolutionary algorithm; instead, they are selectionist (Crick, 1989, 1990). Edelman's use of the term neural Darwinism has been criticized for its inaccuracy by Crick (1989). This is because evolution by natural selection occurs only where there are units of evolution. Units of evolution are replicators capable of hereditary variation (Maynard Smith, 1986). Selectionist algorithms do not possess replicators. We propose how evolutionary algorithms could be implemented in the brain by the replication of patterns of neuronal activity (not of neurons themselves).
Units of evolution are entities that replicate and are capable of stably transmitting variations across generations. If such entities have differential fitness, then natural selection generates adaptation (Muller, 1966; Maynard Smith, 1986). Natural selection in the sense used here refers to the algorithm that takes place when there are units of evolution. Replication establishes covariance between phenotypic traits and fitness, a fundamental requirement for adaptation by natural selection (Price, 1970). The natural selection algorithm (Dennett, 1995) can have many possible implementations (Marr, 1983); for example, units of evolution include some units of life (Gánti, 2003) such as organisms and lymphocytes evolving by somatic selection (Edelman, 1994), but also informational entities (without metabolism) such as viruses, machine code programs (Lenski, Ofria, Collier, & Adami, 1999), and binary strings in a genetic algorithm (Fraser, 1957).
Some clarification is required as to what is meant by a replicator. We mean to include a very general class of multiplying entities. For example, we include phenotypic replicators; the physical (neuronal) basis of behavior or of evolutionary strategies (such as hawk or dove) is a phenotypic replicator as described by Maynard Smith (1998). Neither requires a distinction between genotype and phenotype; a self-replicating ribozyme, for example, is both informational and catalytic. A systematic classification of replicators has recently been published (Zachar & Szathmáry, 2010).
Even with this wider notion of replicator, the selectionist models of Changeux and Edelman do not contain units of evolution and therefore do not implement a natural selection algorithm (or, put differently, they implement survival selection without reproduction). Darwinian selection does not exist without some type of “reproduction,” that is, replication with errors (Crick, 1989, 1990). Selectionist models often implement some kind of reward-biased stochastic hill climbing. Such algorithms can explain performance in simple reward-biased cognitive search tasks, such as the Stroop task (Dehaene, Changeux, & Nadal, 1987; Dehaene et al., 1998), the Tower of London task (Dehaene & Changeux, 1997), or instrumental and classical conditioning tasks (Izhikevich, 2007b), but they require a careful choice of representations by the designer in order to ensure that the search landscape is smooth and contains no local optima.
It has been suggested that Edelman's (1987) notion of recurrence can be seen as a replicator of neuronal activity and that the neuronal replicator hypothesis is a variant of Edelman's notion of reentry. This is not the case because reentry is nothing more than the principle of having recurrent reciprocal connections between (and within) neuronal regions. Izhikevich's random recurrent neuronal networks use reentrant connections, for example, and indeed they can carry out simple kinds of classical and operant conditioning (Izhikevich, 2006, 2007b). They are the most modern example of Edelman's principle of reentry. The connections can be nontopographic and so do not necessarily replicate a vector of activity from one brain region to another; rather, they act to stochastically modify neuronal groups or adjust the attractor dynamics of a neuronal group. The influence of one group on another is attractor based rather than template based. In another example of reentry, Tononi, Sporns, and Edelman (1992) have reentrant connections between visual areas that allow perceptual binding due to synchrony. There is no replication of activity between regions; instead, synchronous activity is reinforced by a value signal that tends to promote eye movements toward the desired goal. The task is an operant conditioning task solved by reward-biased stochastic hill climbing, in the same way as Izhikevich's (2007b) later models. Another example of reentry implementing a reward-biased stochastic hill-climbing algorithm is Sporns and Edelman's (1993) solution of a low-dimensional Bernstein problem of inverse kinematic control. Neuronal groups control a four-jointed arm with a pointer at the end that can move on a 2D surface. Reward is obtained if the pointer is close to a desired spot. Reward tends to reinforce neuronal groups that acted to bring the pointer closer to the desired spot. This is an example of a reward-biased stochastic hill climbing. Reentrant signals exist between neuronal groups, allowing synergistic motor actions to build up. The fundamental mechanism remains reward-biased stochastic hill climbing, and there is in no sense replication due to reentry, as has been claimed. Similarly, it has been claimed that the kind of recurrence in Elman's (1990) simple recurrent network where a hidden layer feeds back onto itself topographically could also be called replication. The difference here is that there is no copying in space, only in time. A standard feedforward neural network with a one-to-one map (and hence a subset of recurrent neural nets) is capable of replication of a pattern of activity in space. Thus trivially, feedforward nets are capable of implementing replication in an evolutionary algorithm because the copy is located elsewhere (and overwrites another entity) at a different location in space from the parent. If this is a trivial property of feedforward neural nets, what is the purpose of this letter? First, we are interested in demonstrating the capacity for copying bistable activity patterns in real spiking neural networks, and second we are interested in demonstrating that the copying operation could actually be used to implement an evolutionary algorithm. Feedforward or recurrent neural networks have not previously been used to implement an evolutionary algorithm. This requires two other properties: neuromodulatory gating and a determination of fitness of each layer. Selection can then act to remove one copy while keeping another copy. Note that such evolutionary processes cannot be implemented with temporal copying alone. The fact that copying of activity is so trivial from the point of view of feedforward neural networks makes it even more difficult to understand why a neuronal evolutionary computation algorithm has not been seriously entertained previously. It is as if we had known about DNA but had not accepted the theory of natural selection based on DNA replication.
An example illustrates the difference between reward-biased stochastic hill climbing and natural selection in a population. Suppose one seeks to optimize a binary string of length L, but the fitness landscape contains a local optimum defined as follows. Let the fitness of a solution be proportional to the number of ones in the array, except for a valley of zero fitness cantered around L/2 1s, of size M, that is, if the number of 1s is greater than L/2 − M/2 and less than L/2 + M/2, then fitness = 0; otherwise, fitness = the number of 1s. This problem was recently used by Clune et al. (2008). In reward-biased stochastic hill climbing, a single random initial binary solution is generated. When a mutation occurs with probability F, single bits have a probability Pi of being in state 1. Initially all Pi=0.5 when a mutation occurs. A mutation event corresponds to the stochastic activation of a reentrant synapse in Edelman's models. If a mutation is associated with an improvement in fitness, then the probability Pi is shifted toward the value of that bit by a scaling factor D, whereas if a mutation is associated with a decreased fitness, the Pi is shifted away from that value by a scaling factor D. This corresponds to synaptic weakening or strengthening of reentrant connections, and is analogous to a cross-entropy global optimization method (Szinta & Lorincz, 2007). The stochastic element is homogeneous and due to F, but reward biases the effect of stochasticity toward the regions of higher fitness. One can see that the system is selectionist rather than Darwinian because there is no replication. Reward selects Pi values that tend to produce a bit state that contributes most to fitness. Figure 1 shows that the stochastic hill climber can get stuck at the local optimum, whereas a standard off-the-shelf population-based genetic algorithm is able to reach the global optimum—in this case, by the simple fact that a solution was initialized in the basin of attraction of the global optimum. There is no sophistication here, but the benefit of a parallel search is clear.
This example highlights a fundamental problem with stochastic hill climbing of the type Edelman proposes: the system can become stuck at local optima that could have been avoided trivially by a population. A multiple random restart hill climber is, of course, an option (Watson, Buckley, & Mills, 2009) but is poorly suited to the control of online behavior.
1.2. Evolutionary Computation in Reinforcement Learning.
Next we consider a popular nonevolutionary set of machine learning algorithms: temporal difference reinforcement learning (TDRL). TDRL is a special case of a selectionist algorithm that implements temporal difference learning in which the value associated with a state or state-action pair is updated based on the difference between the expected and actual rewards obtained in the next time step. TDRL depends on assigning reward to representations of states or actions during the execution of a task. At the basis of reinforcement learning is the temporal difference (TD) algorithm that associates values with states in an actor-critic architecture (Houk et al., 2007) or values with state-action pairs in Q-learning (Watkins, 1989), or state-action-reward-state-action (SARSA), in which different state-action pairs exist at the outset of the task and are assigned value by a value function as the task proceeds in an online manner (Whiteson, Taylor, & Stone, 2007). An action selection algorithm determines when to execute these state-action pairs as a function of their instantaneous value, balancing exploration and exploitation. The actor-critic can be seen as a kind of stochastic hill-climbing algorithm (Niv, 2009) with a sophisticated value assignment system.
TDRL techniques have been influential in neuroscience. This is partly due to the remarkable discovery that the release of (dopamine) DA shifts in response to an unconditioned stimulus (US) to an earlier reward-predicting conditioned stimulus (CS) (Schultz, Dayan, & Montague, 1997; Schultz, 1998), as predicted by temporal difference learning. Recently Izhikevich (2007b) has begun to integrate the neural Darwinism and synaptic selectionism type models (Kaelbling, Littman, & Moore, 1996; Sutton & Barto, 1998; Niv, 2009) with RL. TDRL techniques “adaptively develop an evaluation function that is more informative than the one directly available from the learning system's environment” (Barto, Sutton, & Anderson, 1983, p. 838). TDRL would seem to be able to explain performance in learning tasks such as operant conditioning (Thorndike, 1911; Skinner, 1976).
It is important to understand that TDRL does not claim to explain how suitable low-dimensional representations of action and value function can arise in the brain. Very large tabular representations of the state-action space (with no compression) result in very slow learning because a given state-action pair is visited only rarely. A method of compressing the state representation (i.e., a method of function approximation) becomes necessary. Often the choice of function approximator requires domain-specific knowledge provided by the designer (Kaelbling et al., 1996). For example, “The success of TD-Gammon was dependent on Tesauro's skillful design of a non-linear multilayered neural network, used for value function approximation in the Backgammon domain consisting of approximately 1020 states” (Elfwing, 2007, p. 20).
Several function approximation methods exist to deal with large state spaces where convergence to an optimal solution would otherwise be slow. They aim to produce, in the actor or critic, a concise representation of inputs with which to predict actions or values. Multilayer perceptrons, k-nearest neighbors, radial basis functions, and so on are all used in function approximation. Evolutionary algorithms have already been used successfully for function approximation in reinforcement learning, for example, in evolving normalized radial basis function networks for robotic control tasks (Samejima & Omori, 1999; Kondo & Ito, 2004), and these theories have been applied to representations of action-specific values in the striatum (Samejima, Ueda, Doya, & Kimura, 2005). Choosing model parameters for function approximators is well suited to evolutionary algorithms; for example, evolutionary methods have been used to evolve appropriate meta-parameters for reinforcement learning algorithms (Elfwing, 2007). Whiteson et al. (2007) used evolutionary function approximation to evolve representations for value functions and found that this improved the performance of reinforcement learning algorithms. When neuronally implemented evolutionary computation is used in this way, it contributes but does not replace ontogenetic reinforcement learning, that is, the reinforcement learning problem is solved by using both a fitness function for function approximation and a temporal difference method at the same time (Togelius et al., in press).
Why should we expect evolutionary algorithms to have any special advantage as function approximators? It has been shown that if there is nontrivial neutrality in the genetic representation, natural selection can discover compressed representations with the capacity to structure exploration so as to tend to improve the chance that a mutation will be beneficial (Toussaint, 2003). Examples of the capacity for an evolutionary algorithm to evolve its state-space representations to allow facilitated variation exist in several domains, such as evolving logic circuits (Kashtan & Alon, 2005), ribozyme secondary structures (Parter, Kashtan, & Alon, 2008), and rewrite rules (Toussaint, 2003). The same automatic tendency of evolution to compress genetic representations carries over to neuronal replicators.
Despite these useful properties of natural selection, only nonevolutionary methods of undertaking function approximation have been proposed for real neuronal systems. These methods use self-organization to produce appropriate state-space descriptions on which reinforcement learning can act. An example of a neuronally plausible RL task is described by Dominey (1995). Recurrent activity in prefrontal cortex produces a state representation of sensory-motor sequences that are associated with motor outputs to carry out sequence learning. Training is supervised initially, but errors are corrected by RL of cortico-striatal projections. A related method for self-organizing representations is the recursive self-organizing map that is capable of representing temporal sequences (Voegtlin, 2002). The neuronal replicator hypothesis does not deny the existence of other plausible neural search algorithms. Rather, it is complementary, and proposes that evolutionary computation is suited to certain kinds of representation evolution in a certain range of problems (Kondo & Ito, 2004), for example, the formation of Bayesian predictive models (Kemp & Tenenbaum 2008) or sets of linguistic constructions (Steels & De Beule, 2006a), which are related to classifier systems
Another approach to dealing with the curse of dimensionality in action space is to use hierarchical reinforcement learning methods, for example, hierarchical Q-learning in which an evolution-like search is used to find subgoals for reinforcement learning problems (Wiering & Schmidhuber, 1997). Other approaches have used controllers consisting of multiple modules (Morimoto & Doya, 2001; Doya et al., 2002).
Evolutionary algorithms have also been used without temporal difference methods to evolve task-specific reinforcement learning mechanisms by evolving neuromodulation circuits (Niv, Joel, Meilijson, & Ruppin, 2002). Using the classification from a recent review that compared methods for solving the reinforcement learning problem, this is the phylogenetic approach to solving reinforcement learning problems (Togelius et al., in press) that uses only a fitness function but does not assign value to specific states. A genetic algorithm was used to evolve a neural network that contained neuromodulatory neurons that modulated synaptic plasticity as a function of external stimuli (Soltoggio, Dürr, Mattiussi, & Floreano, 2007). This has been used to evolve controllers for bee foraging, for example. According to the neuronal replicator hypothesis, such evolutionary computation could also occur online in the brain, within an organism's lifetime. In some tasks, evolutionary computation has been shown to be superior to some temporal difference methods, for example, in double-pole balancing (Stanley & Miikkulainen, 2002). Recently more complex kinds of evolutionary method, such as cooperative coevolution, have been shown to be superior to other methods in the double-pole balancing task (Gomez, Schmidhuber, & Miikkulainen, 2008)
Last but not least, it is important to consider ontogenetic RL methods based on policy gradient ascent (Togelius et al., in press). Policy gradient ascent methods “update the agent's policy-defining parameters θ directly by estimating a gradient in the direction of higher (average or discounted) reward” (Wierstra, Foerster, Peters, & Schmidhuber, 2007). These methods have been shown to be superior to TD methods in partially observable Markov decision processes and non-Markovian problems (Wierstra et al., 2007). Sophisticated recurrent neural networks exist to automatically store and utilize episodic memories of past events that are of relevance for action selection (Schmidhuber, Wierstra, & Gomez, 2005). Evolutionary methods do, however, successfully compete with even these advanced methods; for example, it was shown that cooperative coevolution of synapses outperformed recurrent policy gradients on two-pole balancing with incomplete information (Gomez et al., 2008). Even more sophisticated methods exist to determine optimal search policies based on self-referential modification of the search algorithm by a machine that makes proofs about whether a self-modification would perform better, and changes itself accordingly (Hutter, 2005; Schmidhuber, 2009). It is not our intention to deny the possibility that such algorithms exist in the brain, but we suggest that there may still be a place for neuronal evolutionary computation for formation of neuronal representations.
Finally, Schmidhuber (2000) has thoroughly compared evolutionary computation with reinforcement learning. He states that the advantages of EC are that (1) it does not require quantizing an environment into discrete states or designing nonlinear function approximators for learning value functions, (2) does not depend on Markovian conditions or full observability, (3) can achieve hierarchical (and nonhierarchical) credit assignment by keeping successful hierarchical policies and structuring variability, and (5) is capable of metalearning (related closely to the evolution of evolvability) to improve its own credit assignment strategy. The disadvantage of EC is noise in assessing the fitness of a policy—unknown delays between actions and effects, and stochastic environments and policies, which increase the time required to assess a policy because statistics must be carried out to ensure proper fitness assessment. Schmidhuber has avoided some of these disadvantages and has designed and tested a backtracking algorithm, the success story algorithm (SSA), which monitors “good” prior policies over appropriately adjusted time periods and replaces “bad” policies with previously successful ones (Schmidhuber, Zhao, & Schraudolph, 1997; Schmidhuber, Zhao, & Wiering, 1997; Schmidhuber, 1999). It is difficult to know how the SSA could be carried out neuronally in the absence of neuronal replication. This letter suggests that such algorithms should not be ruled out in the brain merely because they require neuronal replication and maintenance of a population of policies.
In short, evolutionary computation can contribute positively to reinforcement learning by acting as a phylogenetic method but also as an ontogenetic method for function approximation (Togelius et al., in press). We do not deny the possibility that other nonevolutionary reinforcement learning algorithms are implemented in the brain. However, we propose that evolutionary methods have been neglected in neuroscience because it was either assumed that replication of neuronal activity or neuronal circuitry could not take place in the brain, or that even if it could, it did not feature in a neuronally implemented evolutionary algorithm.
1.3. Evolutionary Computation, Energy Minimization Approaches, and Intrinsic Value Functions.
Karl Friston has proposed that the brain attempts to minimize free energy, which is related to minimizing surprising exchanges with the environment (Friston & Stephan, 2007). We note that the notion of minimizing predictive error (i.e., surprise) has been considered previously with respect to adaptation (Atmar, 1994) as discussed in Fogel (2006). Friston and Stephan (2007) write, “In short, free-energy may be a useful surrogate for adaptive fitness in an evolutionary setting and the log-evidence in model selection. In short, within an organism's lifetime its parameters minimise free-energy, given the model implicit in its phenotype. At the superordinate level, the models themselves may be selected, enabling the population to explore model space and find optimal models. This exploration depends on the heritability of key model components, which could be viewed as priors about environmental niches the system can model” (p. 435).
The neuronal replicator hypothesis can readily use Friston and Stephan's (2007) notion of fitness as reduction of free energy. They write, “Adaptive fitness can be formulated in terms of free-energy, which allows one to link evolutionary and somatic timescales in terms of hierarchical co-evolution” (p. 419). However, instead of requiring the linkage between evolutionary and somatic timescales, the neuronal replicator hypothesis proposes that the link may exist within the brain itself. Evolutionary computation is well suited to evolving structural predictive models that are selected on their capacity to reduce surprise. A population of models is generated, and those that fail to suppress free energy are removed from the population. Those that succeed have their parameters replicated with mutation. For example, a set of suitable variational operators for Bayesian structural models has been proposed (Kemp & Tenenbaum, 2008), and these would be ideal for implementation in an evolutionary algorithm. The neuronal replicator hypothesis proposes that model optimization occurs in the brain by processes of model replication and selection and that this implements Bayesian model selection. There is no conflict between Friston and Stephan's (2007) proposal and ours; in fact, the NRH suggests that predictive models that minimize free energy could be optimized by neuronal evolutionary algorithms.
We note that the concept of intrinsic value functions was first described in Schmidhuber's papers on curious RL systems that get intrinsic reward for the progress of a separately learning predictor of the sensory inputs: the RL systems are motivated to select data that increase the first derivative of predictability (Schmidhuber, 1991, 2006; see an overview at http://www.idsia.ch/~juergen/interest.html). Recent work has taken up this basic idea (Oudeyer, Kaplan, & Hafner, 2007). We propose that the concept of neuronal evolutionary computation can be usefully integrated with this work on intrinsic value systems; for example, it is possible that values can themselves evolve in combination with strategies. In fact, such techniques have been used in coevolution of tests and models for building models of the environment (Bongard, Zykov, & Lipson, 2006; Bongard & Lipson, 2007).
The next section summarizes some general algorithmic advantages conferred by replication, followed by a demonstration of how rapid replication of patterns of neuronal activity can occur in spiking neural networks. Then comes a demonstration of the incorporation of a well-known neural process, Hebbian learning, into the replication process itself. This provides a surprising increase in the speed of evolution, thus demonstrating a tremendous synergy between neural and evolutionary processes that until now has been ignored.
1.4. Some Algorithmic Advantages of Evolutionary Computation in Neuronal Systems.
If replication of individuals occurs with hereditary variation, then the natural selection algorithm operates. What is the difference between a population of replicators and a population of independent stochastic hill climbers? First, as mentioned previously, replication allows the efficient reassignment of search resources (neuronal representational space) to the currently fittest solutions in a population of solutions. Those parts of the neuronal network that are not contributing to success in the problem can be reconfigured by those that are.
Second, and most important for cognition, natural selection with populations of replicators has a sophisticated capacity to structure search (Pigliucci, 2008). Organismal units of evolution can bias the exploration of phenotype space to yield more favorable variants (Kirchner & Gerhart, 1998). This is known as the evolution of evolvability (Conrad, 1983; Altenberg, 1994; Wagner & Altenberg, 1996; Jones, Arnold, & Bürger, 2007) or facilitated variation (Kirchner & Gerhart, 2005). Such systems are capable of structuring their exploration distributions by the use of nontrivial neutrality; that is, there may be many genotypes capable of producing the same phenotype, yet these genotypes may differ in the phenotypic variants they produce (Toussaint, 2003). If this is so, then natural selection in populations can select for variability properties of genotypes as well as for properties that benefit the individual. Demonstrations of these capacities are available in the evolution of logical operations, for example (Parter et al., 2008).
Third, populations have the capacity for recombination. A neuronal implementation of recombination is presented that shows how it has the potential to escape from local minima as long as there is tight linkage (Watson, 2006), just as in genetic recombination.
Fourth, in the brain, as opposed to genetics, it is possible for structuring of exploration distributions to be undertaken explicitly by Hebbian learning of the weights responsible for copying an activity vector between one neuronal individual (activity vector) and another. An abstract neuronal implementation of the Hebbian learning replicator is presented: a novel algorithm that is distinct in mechanism to the capacity of populations of replicators to structure variability described by Toussaint and also distinct from the capacity of recombination to escape local optima. The algorithm works even without a population of solutions, but it can be parallelized. The idea of using Hebbian learning to learn the structure of local optima was discovered recently by Richard Watson and described in the domain of Hopfield networks, and we have applied it here to the case of neuronal replicators (Watson et al., 2009).
The next section describes a simple mechanism by which neuronal replication could occur.
1.5. Mechanisms of Neuronal Replication.
Mechanisms of neuronal replication have been hypothesized previously. Paul Adams (1998) proposed quantal synaptic replicators in which mutations were noisy quantal Hebbian learning events where a synapse was made to contact an adjacent postsynaptic neuron rather than to enhance the connection to the current postsynaptic neuron. Previously, we confirmed that Hebbian learning by Oja's (1982) rule is isomorphic to Eigen's replicator equations, with synaptic strength being equivalent to the population density of a replicator (Eigen, 1971; Fernando & Szathmáry, 2009b). Also we demonstrated a mechanism for replication of patterns of synaptic connectivity between brain regions (Fernando, Karishma, & Szathmáry, 2008) with generation times on the order of minutes (the rate-limiting factor being the speed of structural plasticity of synapses; Butz, Worgotter, & Van Ooyen, 2009). The replication mechanism proposed in this letter can occur on the order of milliseconds, as shown in the simulations we present of bistable copying using Izhikevich neuron models (Izhikevich, 2003). Using these models, we show that reward gated sets of bistable neurons can sustain replication of neuronal activity patterns. This makes possible rapid evolutionary dynamics in the brain.
First, a model of two bistable neurons in a gated network is presented to show the conditions in which neuronal activity-pattern copying can take place. This is extended to two layers of units connected by a one-to-one topographic map, capable of evolving activity patterns using a (1+1) evolution strategy (Rechenberg, 1994; Beyer, 2001). It is the simplest system in which evolutionary dynamics can be exhibited, which is why it is used to demonstrate the principle here. Section 3 considers candidate mechanisms for implementing bistability, any of which could implement a neuronal replicator. Largely due to computational efficiency constraints, later models in the letter do not use the full neuronal simulation, but use a probabilistic model of bistable neurons. It is to such a model that is added Hebbian learning, and by which are solved problems that contain interdependencies (Watson, 2006) that produce rugged fitness landscapes known to be pathological for stochastic hill climbers (and for a (1+1) evolution strategy) alone. The same problem is then solved using a population of neuronal replicators with recombination.
2.1. One Bit Copying in a Bistable Neuron Pair.
The dynamics of a single bistable neuron are shown in the phase portrait in Figure 2 (top).
There are two attractors, one corresponding to a stable quiescent state and the other to an unstable state. Continuous oscillation occurs if the state remains outside the basin of attraction of the stable attractor. Figure 2A (bottom) shows the minimal circuit capable of copying neuronal activity. It consists of two reciprocally coupled excitatory bistable neurons (black circles), each associated with its own gating neuron (gray).
Figure 3 shows a typical simulated experiment (a variant of which we propose could be conducted in an in vitro system of real neurons) to demonstrate copying of bistable states using the circuit shown in Figure 2b. Initially both neurons (1 and 2) are off and = 0 in both directions; neuromodulatory gating prevents the two neurons from interacting. Then neuron i is depolarized by an external input of 10 mV applied at an appropriate phase such that it starts firing. Gate is opened, setting its value to 1 so that neuron 1 can influence neuron 2, causing neuron 2 to fire, after which gate closed (i.e., returned to 0). After this disconnection, a hyperpolarizing pulse of −10 mV is given to neuron 1 in the appropriate phase, turning it off and leaving neuron 2 the only neuron that remains firing. This process shows how a firing (1) state can be copied to a neuron that initially had a nonfiring (0) state. A Mathematica file containing the simulation for producing Figure 3 is available in the Supplementary Material.1
2.2. Copying Vectors of Neuronal Activity.
Figure 2B shows a grouping of pairs of bistable neurons together to construct two layers coupled bidirectionally by a topographic map. In order to implement copying in this system, the current parental layer has neurons initialized randomly to firing (1) or nonfiring (0) states. The state of neurons in the offspring layer is reset (i.e., all neurons are turned to their off states). Neuromodulatory gates are opened for a brief period of time from the parental layer to the offspring layer, allowing the spiking neurons in the parental layer to switch on the corresponding neurons in the offspring layer. Activity gates between layers are then closed. The result is that the vector of activities in the parental layer has been copied to the offspring layer. Because pairs are independent, the dynamics are the same as in Figure 3 for each pair. The per site fidelity of copying is the same as the fidelity of copying in the minimal case.
A crucial factor in allowing unlimited heredity (Szathmáry & Maynard Smith, 1997) is the robustness of copying to noise. Various types of noise can affect fidelity. We consider noise in the form of random depolarization from external sources. Figure 4 shows that several kinds of error can result when noise is introduced. With this setting of I in particular, external noise often turns the neuron on, but it is much more unlikely to turn the neuron off by accident. The neuronal replicator hypothesis makes the testable prediction that were it the case that bistable neurons transmitted information in the way proposed above, then one would expect to find mechanisms that improve the fidelity of copying, gated by reward-controlled neuromodulatory neurons.
Figure 5 shows several copying events in a (1+1) evolution strategy (ES) implemented using neuromodulatory gating controlled by reward. At time t, the two layers of bistable neurons both have their fitness assessed. The higher-fitness layer is defined as the parent, and the offspring layer has its activities reset (to the off state). The higher-fitness (parental) layer then copies its pattern of neuron activities to the newly defined offspring layer. The (1+1) ES is the simplest of selection algorithms. Using it, the desired vector of activities (in this case, an eight-bit array of alternating 1 and 0 states) was evolved within 25 generations. Mutation arose due to membrane potential noise. For the purposes of fitness assessment, a neuron was defined as in state 1 if more than two spikes were fired within the 250 ms time period of fitness assessment and off if fewer than two spikes were fired within this period. One generation of fitness assessment and replication took 750 ms. The Mathematica file containing the simulation for producing Figure 5 is found in the Supplementary Material.
The next section considers a remarkable algorithmic advantage conferred by adding Hebbian learning to neuronal replicators.
2.3. Combining Hebbian Learning with Neuronal Replication.
The neuronal model shows that activity copying can be implemented by gated spiking neurons. For reasons of computational efficiency, a probabilistic model of bistable neurons is used to simulate a larger neuronal system for many more generations than would be convenient using the full spiking model. A remarkable capacity for structuring variability in the copying event emerges if, instead of limiting between-layer connections to a topographic map, one starts with a strong one-to-one topographic map but allows all-to-all Hebbian connections to develop between layers once a local optimum has been reached by the a (1+1) ES (Beyer, 2001). Once the local optimum is reached, the weights between ON-ON and OFF-OFF neuron pairs are increased, and weights between ON-OFF and OFF-ON neurons are decreased. With the weights updated, the activity vectors are completely randomized (i.e., random restart), and a new evolutionary run is started. The Hebbian learning that took place in previous evolutionary runs will therefore bias copying in future runs. An active neuron in the parental layer will not only tend to activate the corresponding one-to-one topographic neuron, but will also tend to convert other neurons in the offspring layer into a state previously occupied at previously found optima.
This technique, first developed in Hopfield networks (Watson et al., 2009) is simpler when adapted to bias the replication operation in a (1+1) ES, possibly making the neuronal replicator implementation the most plausible one for this algorithm. It adds Hebbian learning to the set of self-adapting evolution strategies (Beyer, 2001). Hebbian learning has been used for optimization previously; for example, O'Reilly and Munakata (2000), in the Leabra algorithm combined Hebbian learning and the delta rule. However, it has not been used previously in this way to learn explicitly the structure of local optima.
The lowest level of fitness contributions comes from looking at adjacent pairs in the vector and applying the transfer function and the fitness function. The transfer function is [0, 0] → 0, [1, 1] → 1, and all other pair types produce a NULL (N). The fitness function for each level sums the 0 and 1 entries. The second level is produced by applying the same transfer function to the output of the first transfer function. The fitness contribution of this next layer is again the number of 0s and 1s in this layer multiplied by 2. This goes on until there is only one highest-level fitness contribution. The fitness landscape arising from the HIFF problem is pathological for a hill climber, since there is a fractal landscape of local optima, which means that the problem requires exponential time to solve.
Figure 8A shows the performance of Hebbian replicators on the N = 32 HIFF problem for various Hebbian learning rates and magnitudes of gaussian output gating noise. Output gating modulates all the weights out of a neuron by the same gaussian random number.
For low learning rates (e.g., 0.001), Hebbian learning increases the rate at which the global optimum (fitness = 192) is discovered. This is true at all output gating noise levels, but most pronounced for high levels of output gating noise. The type of noise used makes some difference, as shown in Figure 8B. Here, instead of output gating noise, noise is applied at each input; it is described by an independent random variable assigned to every synapse into a neuron's dendritic tree. Performance is slightly impaired compared to using output noise, but there is still a benefit to adding Hebbian learning.
The efficacy of Hebbian learning is clearly shown in the HIFF 64 bit case (see Figure 9, which shows the mean maximum fitness obtained during 100 independent runs with and without Hebbian learning). Too high a Hebbian learning rate (LR = 0.0008) results in the system's getting stuck in a local optimum; just the right amount allows significant improvement compared with no Hebbian learning.
There has been no demonstration that an equivalent neuronal architecture without replication but still with Hebbian learning would take longer to solve the HIFF problem than with replication. First, no such alternative neuronal architecture except possibly the Hopfield network described by Watson et al. (2009) is known to be effective in this problem. Second, the capacity to add features to evolutionary algorithms to improve their algorithmic potential is of great merit to the neuronal replicator hypothesis rather than being a disadvantage. For example, consider another modification, that of recombination. In the next section, we show how recombination can improve the search capability of evolutionary algorithms implemented using bistable neuronal elements.
2.4. Solving HIFF Using a Recombination of Neuronal Replicators.
The requirement for parallelization leads to a recombination in populations of neuronal replicators. It has been demonstrated previously that a population of replicators undergoing recombination can solve the 128-bit HIFF problem as long as there is tight linkage, that is, if the interdependencies of the problem correspond to the layout of the chromosome such that functionally interdependent units tend to be crossed over together (Watson, 2006). Figure 10 shows how recombination is implemented in neuronal replicators, using the same kinds of operation as proposed for the (1+1) ES.
The simulated system implements one-point crossover with deterministic crowding to maintain diversity (Mahfoud, 1995). Population size is 1000. Activity vectors are randomly initialized. Only fixed-weight topographic mapping is used (i.e., no Hebbian weight change), with each weight being set to 10.0, multiplied by gaussian noise (mean 1, standard deviation 0.01). At each step, two parents are chosen randomly from the population of 100. Seventy percent of the time, a crossover site is also chosen randomly and determines which gates are opened. To the left of the crossover site, gates are opened from parent 1 to the offspring, and to the right of the crossover site, gates are opened from parent 2 to the offspring. The other 30% of the time, parent 1 is copied in total (with mutation) to the offspring. The fitness of the parent closest to the offspring in Hamming distance is calculated, along with the fitness of the offspring. A neural mechanism such as that shown in Figure 13 is capable of learning to determine the Hamming distance between the parent and the offspring neuronal replicators. If the fitness of the offspring is greater than that of the parent, the parental state is reset, and it is overwritten by the offspring by copying in the standard way (without crossover). This process iterates. Each generation thus involves one recombination event and perhaps one copying event.
Figure 11 shows a typical run on the 128 HIFF. With one-point crossover and deterministic crowding (top), the solution is found rapidly, whereas without recombination (mutation only, middle), the solution is never found. To determine whether recombination is really binding diverse solutions to better effect, a comparison is made with the “headless chicken crossover” described in Fogel (2006) (see Figure 11, bottom). This involves exactly the same algorithm as recombination except that one of the chosen parental genomes is replaced with a completely random new solution. This acts as a control against recombination as merely a macromutation operator. The headless chicken operator does not find the solution. Finally, we note that the use of crowding is a useful trick for maintaining diversity. It requires the calculation of Hamming distance. A later section shows how neuronal circuits can be trained to calculate Hamming distance.
2.5. Learning of Neuronal Circuits Capable of Activity Pattern Replication.
How can the very regular connectivity and strong and tightly controlled gating that the above mechanisms propose be reconciled with the seemingly haphazard connectivity of neurons that is observed? How can activity pattern replication arise in an initially randomly connected network not capable originally of sustaining replication? How can a system of neuronal information transmission be transformed from one capable of only attractor-based heredity into one capable of limited heredity and finally unlimited heredity (Szathmáry & Maynard Smith, 1997)? There are two possibilities: activity-independent and activity-dependent mechanisms. This is a special case of a more general question: What capacity does the brain have to modify structure based on experience and reward to implement an adaptive neuronal circuit (Holtmaat & Sovoboda, 2009)? How can it be ensured that two identical genotypes (neuronal activity vectors) in two locations produce identical phenotypes that are assigned the same fitness (see Figure 12)?
The above models assumed a perfect topographic one-to-one map between the two layers. They also assumed the existence of a mechanism to read out each genotype and directly calculate the Hamming distance between each genotype and the desired activity vector, and return a fitness value. Let us relax the assumption that the readout element is capable of knowing the ordering of the loci along the two genotypes. Figure 12 shows that this is identical to the situation where the mapping from a genotype to the phenotype is not one-to-one but a random feedforward network M. Each genotype imaps to its phenotype with a different random feedforward network Mi. If this is the case, then topographic copying between genotypes will result in different phenotypes arising from the same genotype depending on the location of that genotype in the population, and natural selection will be impossible because there will be no covariance between neurogenetic states and fitness. It is easy to see that the replication matrix from genotype 1 to 2 must in fact be the product of the inverse of M2 and M1, that is, Mc=M−12M1 (see Figure 12). How could a feedforward network Mc self-organize in a realistic neuronal system? Although there exist recurrent neural network algorithms to calculate the inverse of matrices, they use nonlocal learning rules (Wang, 1993). Here we demonstrate that using a structural plasticity makes it possible to discover the appropriate copying matrix.
Given that M2 is invertible, the restructuring of the Mc matrix can be established by stochastic hill climbing in the space of neural structures guided by a signal of the similarity between the parent and child activity vectors. A simple model is presented of Bonhoffer-type rapid synaptic remodeling whereby a weak connection can form (and break) between previously disconnected (connected) neurons within a few seconds (Hofer, Mrsic-Flogel, Bonfoeffer, & Hubener, 2009). If following a change in structural plasticity, the resulting reward is increased above a time window reward average, then that connectivity change persists; otherwise, it is lost. A more sophisticated temporal difference method could potentially improve performance. Once a connection exists, quantal weight mutations are permitted (Adams, 1998), which also have the property that they revert under conditions of less than average reward. Similarly, neuron biases also undergo mutation and reward controlled reversion. These structural operations allow rapid restructuring of a sparsely connected neuronal network.
The regime for constructing Mc is to randomly choose a set of N bistable input neurons and N bistable output neurons. The network is initially randomly connected. The N bistable input neurons are initialized with a random pattern of activity, and this activity is sustained over time period T. After period T, the activity of the N output neurons is measured, and the Hamming distance between the input and output arrays is calculated. The reward is defined as N – Hamming distance. Ten thousand input activity patterns are tested, and the reward calculated by the similarity of the output activities to these input activities is averaged over all these presentations. After a particular connectivity pattern has sustained 10,000 activity pattern inputs and the subsequent dynamics, each of which has received a reward assessment, a random structural change is made to the network, and the average reward is calculated over another 10,000 input patterns. An episode is defined as a set of reward assessments, after which there is a structural modification. For the first 10 episodes, the structural change is always reset, and a time-averaged reward is calculated over these 10 episodes. After 10 episodes, the first structural change to produce a reward greater than the mean reward is accepted. If a change is accepted, then the mean reward is calculated again for 10 episodes before another permanent structural change can be accepted. This protocol ensures that on average, harmful structural changes are not accepted because a change is accepted only if the reward is above average.
Figure 13 shows the reorganization of the connectivity that occurs to allow a 3-bit copying operation. Larger copying circuits can be produced by independently learning small copying circuits and joining them. This investigation suggests that it is possible to use stochastic hill climbing guided by reward to self-organize a system capable of high-fidelity replication of a pattern of neuronal activity. An important property that allows search in the space of connectivity is the existence of two timescales—many activity patterns can be assessed for the same structural connectivity pattern.
2.6. Calculation of Hamming Distance Between Two Neuronal Vectors.
The calculation of similarity between two neuronal patterns of activity is an operation that may be of use in several algorithms. In developing the template replication circuit above, a signal of similarity between two neuronal activity patterns was required to act as reward. Implementing diversity maintenance in the recombination circuit above required calculation of Hamming distance. But from where can the initial reward signal that should allow selection for a circuit capable of calculating Hamming distance come? One possibility is that an extrinsic reward signal is maximized when the similarity between the two inputs is greatest. Another possibility is that there is an experience-independent (intrinsic) reward process that has evolved to reward similarity. We acknowledge that other measures of distance may be more suitable in this and other cases. We use Hamming distance here for simplicity.
Using the same structural plasticity algorithm above, a circuit is learned that is capable of calculating the Hamming distance between two input vectors. For a vector of length N, this is simply achieved by independently evolving N XOR gates. The XOR gates fires 1 for inputs [1, 0] or [0, 1] (where two loci differ in their identity) and fires 0 for inputs [0, 0] or [1, 1] (where two loci are the same). Summing the outputs of the N XOR gates gives the Hamming distance. Figure 14 shows that an XOR gate can be evolved using the same method of structural plasticity as above. Ten thousand XOR patterns are input, per structural modification. Reward is 1 if the XOR output is correct, and 1/e otherwise, where e is the difference between the desired and actual Hamming distance.
The kind of problem solved above is a hard combinatorial search problem, requiring a large number of slow, generate-and-test operations (Chklovskii, Mel, & Swoboda, 2004). Critically “whether the brain has evolved the machinery to cope with these ‘algorithmic’ challenges remains an open question” (Chklovskii et al., 2004). The algorithm presented is limited to small networks; otherwise random formation and disconnection guided by reward is much too slow. Methods have been devised to modify random connectivity change; for example, new synapses can be drawn from a “prescreened candidate pool” (Poirazi & Mel, 2001). Diffusible factors or electric fields are also thought to be able to bias the probability of new connections forming.
In summary, the neuronal replicator hypothesis proposes that intrinsic reward functions are capable of molding networks by structural plasticity to construct circuits that can undertake more sophisticated search algorithms acting on neuronal activity rather than neural connectivity. Algorithms that search in the space of activity patterns instead of connectivity patterns can be much faster, and so there would be an adaptive advantage for evolution to have evolved such algorithms in the brain.
2.7. Replication of Actor-Critic Devices Prevents Catastrophic Forgetting.
So far there has been no demonstration of the utility of neuronal replication in a behavioral task. Here, an important role for neuronal replication is demonstrated in a simple simulated robotic learning task. The robustness of a temporal difference reinforcement learning (RL) algorithm to nonstationary perturbations is compared with and without neuronal replication. It is shown that replication of actor-critic controllers allows a solution to the stability-plasticity dilemma. We demonstrate that learning rate and robustness can be increased if multiple copies of an actor-critic controller can be made online, stored, and retrieved. Catastrophic forgetting in a nonstationary and nonlinear environment can be prevented by regular copying of procedural memories into a long-term memory store. The best controllers can be retrieved when the current controller is determined to be functioning poorly.
The stability-plasticity dilemma (SPD) refers to the simultaneous requirement for rapid learning and stable memory. Too much plasticity can result in catastrophic interference during sequence learning; for example, in neural networks, degradation of old patterns may occur as new patterns are stored. Several solutions have been proposed, such as ART (Carpenter & Grossberg, 1988), often aiming to balance plasticity and stability at the same synapse (Abraham & Bear, 1996; Abraham & Robins, 2005). Also, the growth of new neurons has been proposed to prevent catastrophic interference of new memories with existing traces of older memories (Becker, 2005; Wiskott, Rasch, & Kempermann, 2006). Here we propose a simpler solution to the SPD based on copying of patterns of weights from a rapidly changing neuronal substrate to a slowly changing neuronal store. The previously proposed mechanism for connectivity copying shows one way in which patterns of synaptic connectivity could be copied (Fernando et al., 2008; Fernando & Szathmáry, 2009a, 2009b). The next section demonstrates how genotypic neuronal activity can exhibit a connectivity phenotype.
A simple phototaxis task for a Khepera robot is simulated. For the underlying robot controller, an actor-critic reinforcement learner is chosen because it has been proposed that such a controller may exist in the brain (Barto, 1995; Schultz, 1998; Worgötter & Porr, 2005) and neural network implementations already exist (Houk, Adams, & Barto, 1995; Suri & Schultz, 1999). The Webots simulator (Michel, 2004) simulates a standard Khepera robot distance sensors and two light sensors. The light source is moved every 10 minutes to a random (x, y) location in the arena. The world consists of a square arena containing two obstacles and another robot controlled by a simple Braitenberg vehicle architecture. This constitutes a nonstationary and nonlinear environment for reinforcement learning. In addition, perturbations to the robot are made manually, and external controllers are used to disturb the actor-critic controller. The Webots code for the simulation is available in the Supplementary Material. Figure 15 shows the overall arena, which is darkened apart from the single light source.
A single actor-critic controller is described first. Sensory inputs from eight infrared distance sensors (D1−8) and two light sensors (L1 and L2) are fed to the actor and critic networks, each normalized to a value between 0 and 1. The algorithm is shown below.
At each time step, the external reward re is calculated as the sum of light sensor values subtracted from the total distance sensor values scaled by a constant φ (= 20). The normalized sensor values and a fixed bias input of 1 serve as inputs xi(t) to the actor and the critic at each time step. The critic's prediction of the eventual reinforcement at time t is obtained by passing activity through the critic weights vi(t). The final critic output is the effective reward signal, , which is the difference between the actual reinforcement r(t) plus the discounted current prediction and the previously predicted reward The critic weights vi are then updated according to the product of the effective reward and the eligibility trace of that input scaled by a learning rate . The eligibility trace is updated with the new value of xi(t), maintaining a proportion of the previous value, . Next, the motor output determined by passing stimuli through the actor's weight matrix wji with gaussian noise of mean 0 and SD =0.5. This value is passed through a threshold function that outputs +1 if f(x)>=0, −1 if f(x)<0. Then the actor's weights are updated at a learning rate α (=0.005). Finally the eligibility traces for the actor are updated according to how much each input contributes to the actual outputs produced. The motor output is a constant S (=5) times the output neuron value yi.
1. Initialize actor and critic networks with weights from −1 to 1.
To implement multiple controllers capable of replication, the following modifications are made. The fitness of an actor-critic is defined as a leaky integral of reward, with a first-order decay rate of 0.00001. This results in a smoothing out of reward over many epochs (light moves). Initial fitness is zero for each actor-critic. Every 5 minutes, the active actor-critic is always copied to a long-term store regardless of its fitness (storage). All the parameters that define the actor-critic are copied, except for the eligibility traces, which are set to zero. The long-term memory store is assumed finite of size M(=100). It is the least fit actor-critic in the long-term memory store that is overwritten by the current actor-critic. At each time step, it is determined whether the currently active actor-critic should be replaced by an actor-critic from the long-term memory store. The gradient of fitness for the active actor-critic is calculated between 5 minutes in the past and the present value. If this is less than a negative constant (χ = −10,000) or if the fitness of the current actor-critic is less than the maximally fit actor-critic in the long-term memory store + χ, then the existing active actor-critic is replaced by a copy of the most fit actor-critic in the long-term memory store (retrieval). After another 5 minutes, the fitness of the actor-critic in the long-term memory store that gave rise to the active actor-critic is updated with the fitness of the active actor-critic.
Figure 16 (left) shows an externally imposed perturbation that results in the loss of a strategy that was successful. The agent initially achieves an effective strategy of going rapidly to the light and rotating around the light. When the agent is near the light (at 7000 time units), it is forced to rotate clockwise around its axis for 1000 time units, after which control is returned back to the actor-critic. The original strategy is never rediscovered (see the graph labeled “Reward Accumulation, which remains reduced after the external perturbation). Figure 16 (right) shows the same perturbation made to a robot with the capacity for replication of actor-critic controllers. Rather than experiencing catastrophic forgetting, the retrieval of stored actor-critic controllers can rapidly regenerate the previously learned strategy. In summary, the capacity for replication of actor-critic controllers prevents catastrophic forgetting in a robotic learning task.
2.8. Activity Genotypes and Connectivity Phenotypes.
The rapid copying of the phenotype of a weight matrix is possible by copying the states of bistable neuromodulatory inhibitory neurons as shown in Figure 17. Assume that two identical weight matrices exist (blue) (perhaps the result of a previous synaptic connectivity copying event), and that each weight is gated by an inhibitory neuromodulatory neuron (red). Assume also that the inhibitory neurons are linked topographically to the corresponding neuron in the other layer (green). Then by activity copying of the states of the inhibitory neurons, it is possible to rapidly reconfigure the effective connectivity (purple) of the lower layer to match that of the upper layer. This is an alternative and faster way in which the weight matrix of the actor and critic could be replicated.
The neuronal replicator hypothesis proposes that patterns of neuronal activity can be copied and can implement evolutionary algorithms in the brain. The replication operation can be biased by Hebbian learning, allowing knowledge of previously discovered local optima to guide further replication. Recombination can take place between activity patterns. The circuits required for replication can be learned by structural plasticity guided by reinforcement. Replication can help to solve the stability-plasticity dilemma.
Patterns of bistable activity are one possible candidate for rapidly updatable neuronal data structures. The neuronal mechanisms underlying working memory (Baddeley & Hitch, 1974)—bistability (Wang, 1999) and recurrence (Zipser, Kehoe, Littlewort, & Foster, 1993)—are also ideal candidates for implementing activity replicators. For example, persistent firing due to single cell dynamics related to elevated [Ca2+] (Loewenstein & Sompolinsky, 2003; Fransen, Tahvildari, Egorov, Hasselmo, & Alonso, 2006) or high levels of facilitation (Barak, 2007) and network-level dynamics due to recurrent connections (Brunel, 2003), such as the self-organization of “stimulus-selective, sub-populations of excitatory cells within a cortical module” (Amit, Bernacchia, & Yakovlev, 2003) resembling Hebbian cell assemblies (Amit, 1995), cortico-thalamic reverberating circuits (Wang, 2001), dendritic bistability (Goldman, Levine, Major, Tank, & Seung, 2003), or mutual inhibition networks (Mechens, Romo, & Brody, 2005), could all implement bistable replicators. Alternatively (Mongillo, Barak, & Tsodyks, 2008), it is been proposed that calcium-mediated synaptic facilitation stores short-term memory in the form of elevated presynaptic calcium levels rather than persistent spiking per se. Furthermore “subsets of neurons where Up (firing) and Down (not firing) states are observed may be considered as a second integrated structure, beyond the level of individual neurons” (Holcman & Tsodyks, 2006). These structures flip states spontaneously between up and down states. This flipping occurs at low frequency (normally less than 0.1Hz) (Raichle, 2006) and may constitute another basis of bistable neuronal heredity. This is consistent with the fact that dopamine (a possible signal of fitness) can influence the distribution of up and down states (Durstewitz & Seamans, 2006). No bistable mechanism can yet be ruled out as a potential basis for neuronal heredity. The models presented of probabilistic bistable elements cannot exclude any of the possibilities.
The neuronal replicator hypothesis predicts neuronal replication would present itself as spontaneous intrinsic activity in the absence of well-defined tasks. Spontaneous activity is indeed observed (Arieli, Sterkin, Grinvald, & Aertsen, 1996; Tsodyks, Kenet, Grinvald, & Arieli, 1999; Kenet, Bibitchkov, Tsodyks, Grinvald, & Arieli, 2003; Fox, Corbetta, Snyder, Vincent, & Raichle, 2006). Such spontaneous activity contributes significantly to the brain's energy consumption and so is not without adaptive importance (Raichle & Mintun, 2006).
The existence of units of evolution in the brain has significance for agent-based models of neural processes stemming from Minsky's The Society of Mind (1986), which have led to evolutionary game theory models of cognition (Byrne & Kurland, 2001) in which agents in the brain compete and cooperate for behavioral execution. The work also relates to multiexpert neural network models that have been shown to be capable of implementing Q-learning (Toussaint, 2003). The neuronal replicator viewpoint allows several questions to be easily posed. What is the intrinsic value system the brain uses to select neuronal replicators (Oudeyer et al., 2007)? The dopamine system is a crucial element in implementing neuronal search, for it acts to distribute reward and thus define the fitness of neuronal replicators. The literature on intrinsic motivation systems where value comes not only from external reward objects but from intrinsic information theory measures such as increasing the first derivative of predictability (Schmidhuber, 1991; Oudeyer et al., 2007) is also important in understanding the basis of neuronal fitness. This kind of intrinsic value system seems to be crucial in explaining play and active exploration. It is also consistent with evidence that more complex fitness functions than mere reward prediction error are signaled by dopamine (Pennartz, 1995; Redgrave, Prescott, & Gurney, 1999); for example, it has been shown that novelty is also signaled by phasic dopamine, independent of explicit reward (Kakade & Dayan, 2002). The NRH predicts that dopaminergic reward would be assigned to and influence the behavior of groups of neurons rather than individual synapses independently.
We also note that recent work has criticized the hypothesis that creative thought is a Darwinian process (Gabora, 2005). First, the neuronal replicator hypothesis does not claim that conscious (serial) thoughts are the units of evolution; second, we note that the capacity for replication allows adaptations to arise from a (1+1) ES, that is, even where there is only one member in each generation. Third, natural selection does not require multiple identical copies of a variant, as Gabora claims. Instead, single copies of distinct variants are all that are needed if an external module explicitly assigns the probability that each variant contributes to the next generation. The intrabrain replicators that we propose could be examples of Aunger's (2002) neuromemes. They replicate within, not between, brains. However, phenotypic copying of neuronal replicators between brains may be the basis of the heritability of human language, that is, the phenotypic copying of linguistic constructions between brains (Steels & De Beule, 2006a, 2006b). Language acquisition is a complex search problem undertaken by infants. We propose elsewhere that neuronal replicators may underlie the search for linguistic constructions (Fernando & Szathmáry, 2009a).
Intelligence is the capacity to adapt behavior to meet the demands of the environment. When Fogel (2006) wrote in his review of Evolutionary Computation, “The argument offered in this book is that the process of evolution accounts for [intelligent] behaviour and provides a foundation for the design of artificial intelligent machines” and “The versatility of the evolutionary procedure is one of its main strengths in serving as a basis for generating intelligent behaviour, adapting to new challenges, and learning from experience,” he was not referring to evolution in the brain. However, his statement could not better describe our motivation for proposing the neuronal replicator hypothesis. We find it remarkable to imagine that such a powerful algorithm as natural selection has not been seriously entertained as a possible neuronal basis for adaptive behavior. While the machine learning literature contains thousands of papers on evolution and natural selection in neural networks for adaptive behavior, none of these papers seriously proposes that the actual neuronal networks could implement an evolutionary computation algorithm within a single brain. This is the proposal of the neuronal replicator hypothesis.
Funding was generously provided by a Marie Curie Inter-European Grant to work at Collegium Budapest, Hungary. Partial support of this work has generously been provPleaided by the Hungarian National Office for Research and Technology (NAP 2005/KCKHA005), the Hungarian Scientific Research Fund (OTKA, NK73047), and the eFlux FET-OPEN project (225167). We thank Richard Watson, Eugine Izhikevich, Anil Seth, Phil Husbands, Eva Jablonka, and Dario Floreano.
Supplementary material referred to throughout the letter is available online at http://www.mitpressjournals.org/doi/suppl/10.1162/neco_a_00031.