Abstract

We propose that replication (with mutation) of patterns of neuronal activity can occur within the brain using known neurophysiological processes. Thereby evolutionary algorithms implemented by neuro- nal circuits can play a role in cognition. Replication of structured neuronal representations is assumed in several cognitive architectures. Replicators overcome some limitations of selectionist models of neuronal search. Hebbian learning is combined with replication to structure exploration on the basis of associations learned in the past. Neuromodulatory gating of sets of bistable neurons allows patterns of activation to be copied with mutation. If the probability of copying a set is related to the utility of that set, then an evolutionary algorithm can be implemented at rapid timescales in the brain. Populations of neuronal replicators can undertake a more rapid and stable search than can be achieved by serial modification of a single solution. Hebbian learning added to neuronal replication allows a powerful structuring of variability capable of learning the location of a global optimum from multiple previously visited local optima. Replication of solutions can solve the problem of catastrophic forgetting in the stability-plasticity dilemma. In short, neuronal replication is essential to explain several features of flexible cognition. Predictions are made for the experimental validation of the neuronal replicator hypothesis.

1.  Introduction

Why expect replication of information in the brain? First, regardless of the search algorithm used (e.g., reinforcement learning, Bayesian learning, Helmholtz machines, free energy minimization, simulated annealing, backpropagation, random search), a population of machines can find a solution at least as fast as can just one copy of that machine if the machines can be operated in parallel. In some cases, mixtures of experts can simultaneously contribute to behavior (Jacobs, Jordan, Nowlan, & Hinton, 1991), as is the case in robotic implementations of multiple-model-based reinforcement learning (Doya, Samejima, Katagiri, & Kawato, 2002). Where multiple solution representations exist, a mechanism to reallocate limited resources (e.g., neurons or synapses) from the least effective solutions to the currently most effective solutions is necessary. Replication permits such redistribution.

Second, cognitive architectures involving symbol manipulation require copying operations (Marcus, 2001). ACT-R production systems assume information can be copied from one brain region to another, that is, the contents of variables can be moved among data structures (Anderson, 2007). Also, the production rules themselves are learned by a kind of cognitive evolutionary algorithm. In Copycat (a cognitive architecture for solving analogy-based insight problems), we see an evolutionary system as well (Hofstadter & Mitchell, 1994). “Codelets” in the “coderack” are a population of agents (tasks and rules) that evolve.

Third, replication of memory traces is necessary for the influential multiple-trace theory of memory (Nadel, Samsonovich, Ryan, & Moscovitch, 2000).

Fourth, the problem of catastrophic forgetting (the stability-plasticity dilemma; Abraham & Robins, 2005) is solved given the capacity to replicate information, as we shall demonstrate.

Fifth, the search capabilities of evolutionary algorithms are well suited to solving novel problems of the type that require insight, creativity, and imagination (Kohler, 1925; Simonton, 1995; Sternberg & Davidson, 1995; Schwefel, 2000). As Schwefel describes, it has been shown that classical optimization methods are typically more efficient in linear, quadratic, strongly convex, unimodal, and separable problems. In these cases, brains may well have evolved specialized learning mechanisms. Evolutionary algorithms are better in discontinuous, nondifferentiable, multimodel, nonstationary, noisy, and even fractal problems (Schwefel, 2000). This set may constitute a broader field of application in cognition, most notably in problems involving symbol manipulation or problems requiring structured search in rugged landscapes, that is, problems with epistatic interdependencies (Watson, Hornby, & Pollack, 1998; Watson, 2006) such as insight problems (Simonton, 1995; Sternberg & Davidson, 1995; Wagner, Gais, Haider, Verleger, & Born, 2004). Furthermore, where a problem is novel, for which evolution at the organism level has had no opportunity to evolve a brain module specifically for this problem, an evolutionary algorithm can be incredibly versatile. “By varying the representation, the variation operators, the population size, the selection mechanism, the initialization, the evaluation function, and other aspects, we have access to a diverse range of source search procedures. Evolutionary algorithms are much like a Swiss Army knife: a handy set of tools that can be used to address a variety of tasks. Having the Swiss Army knife provides you with the ability to address a wide variety of problems quickly and effectively, even though there might be a better tool for the job” (Michalewicz & Fogel, 2004, p. 163).

The next sections compare neural nonevolutionary algorithms that have received attention in the literature with those proposed by the neuronal replicator hypothesis.

1.1.  Evolutionary Computation, Synaptic Selectionism, and Neural Darwinism.

That Darwinian selection is involved in learning, thought, and human creativity has been proposed many times in the past (James, 1890; Bremermann, 1958; Dennett, 1981; Baldwin, 1898, 1909; Hadamard, 1945; Monod, 1971; Campbell, 1974; Dawkins, 1982; Cooper, 2001; Aunger, 2002). Earlier models were psychological or, rather, metaphorical and did not suggest explicit neural mechanisms. Recent models have been based on the idea that synapses or groups of synapses are differentially stabilized by reward (Changeux, Courrege, & Danchin, 1973; Edelman, 1987; Dehaene & Changeux, 1997; Dehaene, Kerszberg, & Changeux, 1998; Seung, 2003; Izhikevich, Gally, & Edelman, 2004; Izhikevich, 2006). None of these models implements an evolutionary algorithm; instead, they are selectionist (Crick, 1989, 1990). Edelman's use of the term neural Darwinism has been criticized for its inaccuracy by Crick (1989). This is because evolution by natural selection occurs only where there are units of evolution. Units of evolution are replicators capable of hereditary variation (Maynard Smith, 1986). Selectionist algorithms do not possess replicators. We propose how evolutionary algorithms could be implemented in the brain by the replication of patterns of neuronal activity (not of neurons themselves).

Units of evolution are entities that replicate and are capable of stably transmitting variations across generations. If such entities have differential fitness, then natural selection generates adaptation (Muller, 1966; Maynard Smith, 1986). Natural selection in the sense used here refers to the algorithm that takes place when there are units of evolution. Replication establishes covariance between phenotypic traits and fitness, a fundamental requirement for adaptation by natural selection (Price, 1970). The natural selection algorithm (Dennett, 1995) can have many possible implementations (Marr, 1983); for example, units of evolution include some units of life (Gánti, 2003) such as organisms and lymphocytes evolving by somatic selection (Edelman, 1994), but also informational entities (without metabolism) such as viruses, machine code programs (Lenski, Ofria, Collier, & Adami, 1999), and binary strings in a genetic algorithm (Fraser, 1957).

Some clarification is required as to what is meant by a replicator. We mean to include a very general class of multiplying entities. For example, we include phenotypic replicators; the physical (neuronal) basis of behavior or of evolutionary strategies (such as hawk or dove) is a phenotypic replicator as described by Maynard Smith (1998). Neither requires a distinction between genotype and phenotype; a self-replicating ribozyme, for example, is both informational and catalytic. A systematic classification of replicators has recently been published (Zachar & Szathmáry, 2010).

Even with this wider notion of replicator, the selectionist models of Changeux and Edelman do not contain units of evolution and therefore do not implement a natural selection algorithm (or, put differently, they implement survival selection without reproduction). Darwinian selection does not exist without some type of “reproduction,” that is, replication with errors (Crick, 1989, 1990). Selectionist models often implement some kind of reward-biased stochastic hill climbing. Such algorithms can explain performance in simple reward-biased cognitive search tasks, such as the Stroop task (Dehaene, Changeux, & Nadal, 1987; Dehaene et al., 1998), the Tower of London task (Dehaene & Changeux, 1997), or instrumental and classical conditioning tasks (Izhikevich, 2007b), but they require a careful choice of representations by the designer in order to ensure that the search landscape is smooth and contains no local optima.

It has been suggested that Edelman's (1987) notion of recurrence can be seen as a replicator of neuronal activity and that the neuronal replicator hypothesis is a variant of Edelman's notion of reentry. This is not the case because reentry is nothing more than the principle of having recurrent reciprocal connections between (and within) neuronal regions. Izhikevich's random recurrent neuronal networks use reentrant connections, for example, and indeed they can carry out simple kinds of classical and operant conditioning (Izhikevich, 2006, 2007b). They are the most modern example of Edelman's principle of reentry. The connections can be nontopographic and so do not necessarily replicate a vector of activity from one brain region to another; rather, they act to stochastically modify neuronal groups or adjust the attractor dynamics of a neuronal group. The influence of one group on another is attractor based rather than template based. In another example of reentry, Tononi, Sporns, and Edelman (1992) have reentrant connections between visual areas that allow perceptual binding due to synchrony. There is no replication of activity between regions; instead, synchronous activity is reinforced by a value signal that tends to promote eye movements toward the desired goal. The task is an operant conditioning task solved by reward-biased stochastic hill climbing, in the same way as Izhikevich's (2007b) later models. Another example of reentry implementing a reward-biased stochastic hill-climbing algorithm is Sporns and Edelman's (1993) solution of a low-dimensional Bernstein problem of inverse kinematic control. Neuronal groups control a four-jointed arm with a pointer at the end that can move on a 2D surface. Reward is obtained if the pointer is close to a desired spot. Reward tends to reinforce neuronal groups that acted to bring the pointer closer to the desired spot. This is an example of a reward-biased stochastic hill climbing. Reentrant signals exist between neuronal groups, allowing synergistic motor actions to build up. The fundamental mechanism remains reward-biased stochastic hill climbing, and there is in no sense replication due to reentry, as has been claimed. Similarly, it has been claimed that the kind of recurrence in Elman's (1990) simple recurrent network where a hidden layer feeds back onto itself topographically could also be called replication. The difference here is that there is no copying in space, only in time. A standard feedforward neural network with a one-to-one map (and hence a subset of recurrent neural nets) is capable of replication of a pattern of activity in space. Thus trivially, feedforward nets are capable of implementing replication in an evolutionary algorithm because the copy is located elsewhere (and overwrites another entity) at a different location in space from the parent. If this is a trivial property of feedforward neural nets, what is the purpose of this letter? First, we are interested in demonstrating the capacity for copying bistable activity patterns in real spiking neural networks, and second we are interested in demonstrating that the copying operation could actually be used to implement an evolutionary algorithm. Feedforward or recurrent neural networks have not previously been used to implement an evolutionary algorithm. This requires two other properties: neuromodulatory gating and a determination of fitness of each layer. Selection can then act to remove one copy while keeping another copy. Note that such evolutionary processes cannot be implemented with temporal copying alone. The fact that copying of activity is so trivial from the point of view of feedforward neural networks makes it even more difficult to understand why a neuronal evolutionary computation algorithm has not been seriously entertained previously. It is as if we had known about DNA but had not accepted the theory of natural selection based on DNA replication.

An example illustrates the difference between reward-biased stochastic hill climbing and natural selection in a population. Suppose one seeks to optimize a binary string of length L, but the fitness landscape contains a local optimum defined as follows. Let the fitness of a solution be proportional to the number of ones in the array, except for a valley of zero fitness cantered around L/2 1s, of size M, that is, if the number of 1s is greater than L/2 − M/2 and less than L/2 + M/2, then fitness = 0; otherwise, fitness = the number of 1s. This problem was recently used by Clune et al. (2008). In reward-biased stochastic hill climbing, a single random initial binary solution is generated. When a mutation occurs with probability F, single bits have a probability Pi of being in state 1. Initially all Pi=0.5 when a mutation occurs. A mutation event corresponds to the stochastic activation of a reentrant synapse in Edelman's models. If a mutation is associated with an improvement in fitness, then the probability Pi is shifted toward the value of that bit by a scaling factor D, whereas if a mutation is associated with a decreased fitness, the Pi is shifted away from that value by a scaling factor D. This corresponds to synaptic weakening or strengthening of reentrant connections, and is analogous to a cross-entropy global optimization method (Szinta & Lorincz, 2007). The stochastic element is homogeneous and due to F, but reward biases the effect of stochasticity toward the regions of higher fitness. One can see that the system is selectionist rather than Darwinian because there is no replication. Reward selects Pi values that tend to produce a bit state that contributes most to fitness. Figure 1 shows that the stochastic hill climber can get stuck at the local optimum, whereas a standard off-the-shelf population-based genetic algorithm is able to reach the global optimum—in this case, by the simple fact that a solution was initialized in the basin of attraction of the global optimum. There is no sophistication here, but the benefit of a parallel search is clear.

Figure 1:

(Thin line) Fitness of the stochastic hill climber at iterations/100. L = 100, M = 20, F=0.01, D=0.1. (Thicker line) Maximum fitness for a generational genetic algorithm of population size 500 evolving by fitness proportionate selection, mutation rate per locus = 0.0001, no recombination.

Figure 1:

(Thin line) Fitness of the stochastic hill climber at iterations/100. L = 100, M = 20, F=0.01, D=0.1. (Thicker line) Maximum fitness for a generational genetic algorithm of population size 500 evolving by fitness proportionate selection, mutation rate per locus = 0.0001, no recombination.

This example highlights a fundamental problem with stochastic hill climbing of the type Edelman proposes: the system can become stuck at local optima that could have been avoided trivially by a population. A multiple random restart hill climber is, of course, an option (Watson, Buckley, & Mills, 2009) but is poorly suited to the control of online behavior.

1.2.  Evolutionary Computation in Reinforcement Learning.

Next we consider a popular nonevolutionary set of machine learning algorithms: temporal difference reinforcement learning (TDRL). TDRL is a special case of a selectionist algorithm that implements temporal difference learning in which the value associated with a state or state-action pair is updated based on the difference between the expected and actual rewards obtained in the next time step. TDRL depends on assigning reward to representations of states or actions during the execution of a task. At the basis of reinforcement learning is the temporal difference (TD) algorithm that associates values with states in an actor-critic architecture (Houk et al., 2007) or values with state-action pairs in Q-learning (Watkins, 1989), or state-action-reward-state-action (SARSA), in which different state-action pairs exist at the outset of the task and are assigned value by a value function as the task proceeds in an online manner (Whiteson, Taylor, & Stone, 2007). An action selection algorithm determines when to execute these state-action pairs as a function of their instantaneous value, balancing exploration and exploitation. The actor-critic can be seen as a kind of stochastic hill-climbing algorithm (Niv, 2009) with a sophisticated value assignment system.

TDRL techniques have been influential in neuroscience. This is partly due to the remarkable discovery that the release of (dopamine) DA shifts in response to an unconditioned stimulus (US) to an earlier reward-predicting conditioned stimulus (CS) (Schultz, Dayan, & Montague, 1997; Schultz, 1998), as predicted by temporal difference learning. Recently Izhikevich (2007b) has begun to integrate the neural Darwinism and synaptic selectionism type models (Kaelbling, Littman, & Moore, 1996; Sutton & Barto, 1998; Niv, 2009) with RL. TDRL techniques “adaptively develop an evaluation function that is more informative than the one directly available from the learning system's environment” (Barto, Sutton, & Anderson, 1983, p. 838). TDRL would seem to be able to explain performance in learning tasks such as operant conditioning (Thorndike, 1911; Skinner, 1976).

It is important to understand that TDRL does not claim to explain how suitable low-dimensional representations of action and value function can arise in the brain. Very large tabular representations of the state-action space (with no compression) result in very slow learning because a given state-action pair is visited only rarely. A method of compressing the state representation (i.e., a method of function approximation) becomes necessary. Often the choice of function approximator requires domain-specific knowledge provided by the designer (Kaelbling et al., 1996). For example, “The success of TD-Gammon was dependent on Tesauro's skillful design of a non-linear multilayered neural network, used for value function approximation in the Backgammon domain consisting of approximately 1020 states” (Elfwing, 2007, p. 20).

Several function approximation methods exist to deal with large state spaces where convergence to an optimal solution would otherwise be slow. They aim to produce, in the actor or critic, a concise representation of inputs with which to predict actions or values. Multilayer perceptrons, k-nearest neighbors, radial basis functions, and so on are all used in function approximation. Evolutionary algorithms have already been used successfully for function approximation in reinforcement learning, for example, in evolving normalized radial basis function networks for robotic control tasks (Samejima & Omori, 1999; Kondo & Ito, 2004), and these theories have been applied to representations of action-specific values in the striatum (Samejima, Ueda, Doya, & Kimura, 2005). Choosing model parameters for function approximators is well suited to evolutionary algorithms; for example, evolutionary methods have been used to evolve appropriate meta-parameters for reinforcement learning algorithms (Elfwing, 2007). Whiteson et al. (2007) used evolutionary function approximation to evolve representations for value functions and found that this improved the performance of reinforcement learning algorithms. When neuronally implemented evolutionary computation is used in this way, it contributes but does not replace ontogenetic reinforcement learning, that is, the reinforcement learning problem is solved by using both a fitness function for function approximation and a temporal difference method at the same time (Togelius et al., in press).

Why should we expect evolutionary algorithms to have any special advantage as function approximators? It has been shown that if there is nontrivial neutrality in the genetic representation, natural selection can discover compressed representations with the capacity to structure exploration so as to tend to improve the chance that a mutation will be beneficial (Toussaint, 2003). Examples of the capacity for an evolutionary algorithm to evolve its state-space representations to allow facilitated variation exist in several domains, such as evolving logic circuits (Kashtan & Alon, 2005), ribozyme secondary structures (Parter, Kashtan, & Alon, 2008), and rewrite rules (Toussaint, 2003). The same automatic tendency of evolution to compress genetic representations carries over to neuronal replicators.

Despite these useful properties of natural selection, only nonevolutionary methods of undertaking function approximation have been proposed for real neuronal systems. These methods use self-organization to produce appropriate state-space descriptions on which reinforcement learning can act. An example of a neuronally plausible RL task is described by Dominey (1995). Recurrent activity in prefrontal cortex produces a state representation of sensory-motor sequences that are associated with motor outputs to carry out sequence learning. Training is supervised initially, but errors are corrected by RL of cortico-striatal projections. A related method for self-organizing representations is the recursive self-organizing map that is capable of representing temporal sequences (Voegtlin, 2002). The neuronal replicator hypothesis does not deny the existence of other plausible neural search algorithms. Rather, it is complementary, and proposes that evolutionary computation is suited to certain kinds of representation evolution in a certain range of problems (Kondo & Ito, 2004), for example, the formation of Bayesian predictive models (Kemp & Tenenbaum 2008) or sets of linguistic constructions (Steels & De Beule, 2006a), which are related to classifier systems

Another approach to dealing with the curse of dimensionality in action space is to use hierarchical reinforcement learning methods, for example, hierarchical Q-learning in which an evolution-like search is used to find subgoals for reinforcement learning problems (Wiering & Schmidhuber, 1997). Other approaches have used controllers consisting of multiple modules (Morimoto & Doya, 2001; Doya et al., 2002).

Evolutionary algorithms have also been used without temporal difference methods to evolve task-specific reinforcement learning mechanisms by evolving neuromodulation circuits (Niv, Joel, Meilijson, & Ruppin, 2002). Using the classification from a recent review that compared methods for solving the reinforcement learning problem, this is the phylogenetic approach to solving reinforcement learning problems (Togelius et al., in press) that uses only a fitness function but does not assign value to specific states. A genetic algorithm was used to evolve a neural network that contained neuromodulatory neurons that modulated synaptic plasticity as a function of external stimuli (Soltoggio, Dürr, Mattiussi, & Floreano, 2007). This has been used to evolve controllers for bee foraging, for example. According to the neuronal replicator hypothesis, such evolutionary computation could also occur online in the brain, within an organism's lifetime. In some tasks, evolutionary computation has been shown to be superior to some temporal difference methods, for example, in double-pole balancing (Stanley & Miikkulainen, 2002). Recently more complex kinds of evolutionary method, such as cooperative coevolution, have been shown to be superior to other methods in the double-pole balancing task (Gomez, Schmidhuber, & Miikkulainen, 2008)

Last but not least, it is important to consider ontogenetic RL methods based on policy gradient ascent (Togelius et al., in press). Policy gradient ascent methods “update the agent's policy-defining parameters θ directly by estimating a gradient in the direction of higher (average or discounted) reward” (Wierstra, Foerster, Peters, & Schmidhuber, 2007). These methods have been shown to be superior to TD methods in partially observable Markov decision processes and non-Markovian problems (Wierstra et al., 2007). Sophisticated recurrent neural networks exist to automatically store and utilize episodic memories of past events that are of relevance for action selection (Schmidhuber, Wierstra, & Gomez, 2005). Evolutionary methods do, however, successfully compete with even these advanced methods; for example, it was shown that cooperative coevolution of synapses outperformed recurrent policy gradients on two-pole balancing with incomplete information (Gomez et al., 2008). Even more sophisticated methods exist to determine optimal search policies based on self-referential modification of the search algorithm by a machine that makes proofs about whether a self-modification would perform better, and changes itself accordingly (Hutter, 2005; Schmidhuber, 2009). It is not our intention to deny the possibility that such algorithms exist in the brain, but we suggest that there may still be a place for neuronal evolutionary computation for formation of neuronal representations.

Finally, Schmidhuber (2000) has thoroughly compared evolutionary computation with reinforcement learning. He states that the advantages of EC are that (1) it does not require quantizing an environment into discrete states or designing nonlinear function approximators for learning value functions, (2) does not depend on Markovian conditions or full observability, (3) can achieve hierarchical (and nonhierarchical) credit assignment by keeping successful hierarchical policies and structuring variability, and (5) is capable of metalearning (related closely to the evolution of evolvability) to improve its own credit assignment strategy. The disadvantage of EC is noise in assessing the fitness of a policy—unknown delays between actions and effects, and stochastic environments and policies, which increase the time required to assess a policy because statistics must be carried out to ensure proper fitness assessment. Schmidhuber has avoided some of these disadvantages and has designed and tested a backtracking algorithm, the success story algorithm (SSA), which monitors “good” prior policies over appropriately adjusted time periods and replaces “bad” policies with previously successful ones (Schmidhuber, Zhao, & Schraudolph, 1997; Schmidhuber, Zhao, & Wiering, 1997; Schmidhuber, 1999). It is difficult to know how the SSA could be carried out neuronally in the absence of neuronal replication. This letter suggests that such algorithms should not be ruled out in the brain merely because they require neuronal replication and maintenance of a population of policies.

In short, evolutionary computation can contribute positively to reinforcement learning by acting as a phylogenetic method but also as an ontogenetic method for function approximation (Togelius et al., in press). We do not deny the possibility that other nonevolutionary reinforcement learning algorithms are implemented in the brain. However, we propose that evolutionary methods have been neglected in neuroscience because it was either assumed that replication of neuronal activity or neuronal circuitry could not take place in the brain, or that even if it could, it did not feature in a neuronally implemented evolutionary algorithm.

1.3.  Evolutionary Computation, Energy Minimization Approaches, and Intrinsic Value Functions.

Karl Friston has proposed that the brain attempts to minimize free energy, which is related to minimizing surprising exchanges with the environment (Friston & Stephan, 2007). We note that the notion of minimizing predictive error (i.e., surprise) has been considered previously with respect to adaptation (Atmar, 1994) as discussed in Fogel (2006). Friston and Stephan (2007) write, “In short, free-energy may be a useful surrogate for adaptive fitness in an evolutionary setting and the log-evidence in model selection. In short, within an organism's lifetime its parameters minimise free-energy, given the model implicit in its phenotype. At the superordinate level, the models themselves may be selected, enabling the population to explore model space and find optimal models. This exploration depends on the heritability of key model components, which could be viewed as priors about environmental niches the system can model” (p. 435).

The neuronal replicator hypothesis can readily use Friston and Stephan's (2007) notion of fitness as reduction of free energy. They write, “Adaptive fitness can be formulated in terms of free-energy, which allows one to link evolutionary and somatic timescales in terms of hierarchical co-evolution” (p. 419). However, instead of requiring the linkage between evolutionary and somatic timescales, the neuronal replicator hypothesis proposes that the link may exist within the brain itself. Evolutionary computation is well suited to evolving structural predictive models that are selected on their capacity to reduce surprise. A population of models is generated, and those that fail to suppress free energy are removed from the population. Those that succeed have their parameters replicated with mutation. For example, a set of suitable variational operators for Bayesian structural models has been proposed (Kemp & Tenenbaum, 2008), and these would be ideal for implementation in an evolutionary algorithm. The neuronal replicator hypothesis proposes that model optimization occurs in the brain by processes of model replication and selection and that this implements Bayesian model selection. There is no conflict between Friston and Stephan's (2007) proposal and ours; in fact, the NRH suggests that predictive models that minimize free energy could be optimized by neuronal evolutionary algorithms.

We note that the concept of intrinsic value functions was first described in Schmidhuber's papers on curious RL systems that get intrinsic reward for the progress of a separately learning predictor of the sensory inputs: the RL systems are motivated to select data that increase the first derivative of predictability (Schmidhuber, 1991, 2006; see an overview at http://www.idsia.ch/~juergen/interest.html). Recent work has taken up this basic idea (Oudeyer, Kaplan, & Hafner, 2007). We propose that the concept of neuronal evolutionary computation can be usefully integrated with this work on intrinsic value systems; for example, it is possible that values can themselves evolve in combination with strategies. In fact, such techniques have been used in coevolution of tests and models for building models of the environment (Bongard, Zykov, & Lipson, 2006; Bongard & Lipson, 2007).

The next section summarizes some general algorithmic advantages conferred by replication, followed by a demonstration of how rapid replication of patterns of neuronal activity can occur in spiking neural networks. Then comes a demonstration of the incorporation of a well-known neural process, Hebbian learning, into the replication process itself. This provides a surprising increase in the speed of evolution, thus demonstrating a tremendous synergy between neural and evolutionary processes that until now has been ignored.

1.4.  Some Algorithmic Advantages of Evolutionary Computation in Neuronal Systems.

If replication of individuals occurs with hereditary variation, then the natural selection algorithm operates. What is the difference between a population of replicators and a population of independent stochastic hill climbers? First, as mentioned previously, replication allows the efficient reassignment of search resources (neuronal representational space) to the currently fittest solutions in a population of solutions. Those parts of the neuronal network that are not contributing to success in the problem can be reconfigured by those that are.

Second, and most important for cognition, natural selection with populations of replicators has a sophisticated capacity to structure search (Pigliucci, 2008). Organismal units of evolution can bias the exploration of phenotype space to yield more favorable variants (Kirchner & Gerhart, 1998). This is known as the evolution of evolvability (Conrad, 1983; Altenberg, 1994; Wagner & Altenberg, 1996; Jones, Arnold, & Bürger, 2007) or facilitated variation (Kirchner & Gerhart, 2005). Such systems are capable of structuring their exploration distributions by the use of nontrivial neutrality; that is, there may be many genotypes capable of producing the same phenotype, yet these genotypes may differ in the phenotypic variants they produce (Toussaint, 2003). If this is so, then natural selection in populations can select for variability properties of genotypes as well as for properties that benefit the individual. Demonstrations of these capacities are available in the evolution of logical operations, for example (Parter et al., 2008).

Third, populations have the capacity for recombination. A neuronal implementation of recombination is presented that shows how it has the potential to escape from local minima as long as there is tight linkage (Watson, 2006), just as in genetic recombination.

Fourth, in the brain, as opposed to genetics, it is possible for structuring of exploration distributions to be undertaken explicitly by Hebbian learning of the weights responsible for copying an activity vector between one neuronal individual (activity vector) and another. An abstract neuronal implementation of the Hebbian learning replicator is presented: a novel algorithm that is distinct in mechanism to the capacity of populations of replicators to structure variability described by Toussaint and also distinct from the capacity of recombination to escape local optima. The algorithm works even without a population of solutions, but it can be parallelized. The idea of using Hebbian learning to learn the structure of local optima was discovered recently by Richard Watson and described in the domain of Hopfield networks, and we have applied it here to the case of neuronal replicators (Watson et al., 2009).

The next section describes a simple mechanism by which neuronal replication could occur.

1.5.  Mechanisms of Neuronal Replication.

Mechanisms of neuronal replication have been hypothesized previously. Paul Adams (1998) proposed quantal synaptic replicators in which mutations were noisy quantal Hebbian learning events where a synapse was made to contact an adjacent postsynaptic neuron rather than to enhance the connection to the current postsynaptic neuron. Previously, we confirmed that Hebbian learning by Oja's (1982) rule is isomorphic to Eigen's replicator equations, with synaptic strength being equivalent to the population density of a replicator (Eigen, 1971; Fernando & Szathmáry, 2009b). Also we demonstrated a mechanism for replication of patterns of synaptic connectivity between brain regions (Fernando, Karishma, & Szathmáry, 2008) with generation times on the order of minutes (the rate-limiting factor being the speed of structural plasticity of synapses; Butz, Worgotter, & Van Ooyen, 2009). The replication mechanism proposed in this letter can occur on the order of milliseconds, as shown in the simulations we present of bistable copying using Izhikevich neuron models (Izhikevich, 2003). Using these models, we show that reward gated sets of bistable neurons can sustain replication of neuronal activity patterns. This makes possible rapid evolutionary dynamics in the brain.

First, a model of two bistable neurons in a gated network is presented to show the conditions in which neuronal activity-pattern copying can take place. This is extended to two layers of units connected by a one-to-one topographic map, capable of evolving activity patterns using a (1+1) evolution strategy (Rechenberg, 1994; Beyer, 2001). It is the simplest system in which evolutionary dynamics can be exhibited, which is why it is used to demonstrate the principle here. Section 3 considers candidate mechanisms for implementing bistability, any of which could implement a neuronal replicator. Largely due to computational efficiency constraints, later models in the letter do not use the full neuronal simulation, but use a probabilistic model of bistable neurons. It is to such a model that is added Hebbian learning, and by which are solved problems that contain interdependencies (Watson, 2006) that produce rugged fitness landscapes known to be pathological for stochastic hill climbers (and for a (1+1) evolution strategy) alone. The same problem is then solved using a population of neuronal replicators with recombination.

2.  Results

2.1.  One Bit Copying in a Bistable Neuron Pair.

Although several implementations of a single bit-copying device are conceivable, the simplest unit capable of copying a persistent activity state is a pair of coupled bistable neurons. The neurons modeled are excitatory neurons of Izhikevich's bistable type (Izhikevich, 2006) and are bidirectionally coupled. The spiking model from Izhikevich (2003) is as follows. The membrane potential v and the recovery variable u are sufficient to describe the full dynamics of a wide range of spiking neurons observed in the brain:
formula
2.1
formula
2.2
A discontinuity in the dynamical system represents the firing of a spike. Resetting of the variables after a spike is as follows:
formula
2.3
When v reaches +30 mV (the apex of the spike, not to be confused with the firing threshold), v and u are reset: a = 1, b=1.5, c=−60, d = 0, I=−66. I is the mean input from external sources that has standard deviation σ = 2 mV. A coupled system of two neurons involves joining the equations for membrane potential as follows:
formula
2.4
γ is a gating term that opens and closes the connections from neuron j to neuron i by taking binary values. is the weight from neuron j to neuron i. The term is 1 if neuron j spikes at some time δ units before the current time t. This represents a conduction delay between the two neurons.

The dynamics of a single bistable neuron are shown in the phase portrait in Figure 2 (top).

Figure 2:

(Top) Phase portrait of an Izhikevich bistable neuron. A quiescent state is achieved if the state enters the basin of attraction of the single stable attractor (filled circle). Outside this basin of attraction, the neuron fires continuously. Note that either depolarization or hyperpolarization applied at the appropriate phase can bring the state within the basin of attraction of the stable point attractor; therefore, this neuron is a resonator, not an integrator (Izhikevich, 2007a). The two lines show the nullclines for v (parabolic) and u (straight). Note the unstable saddle point attractor (empty circle). (Bottom) A: The minimal unit of activity replication consists of two reciprocally coupled bistable neurons (black circles), gating by two associated inhibitory neurons (gray circles). Figure 3 shows an experiment where copying of a single bit takes place using this circuit. B: Eight of these units can be chained together to form the experimental setup whose results are shown in Figure 5. The gates controlling the output of parent 0, for example, can be switched on and off together.

Figure 2:

(Top) Phase portrait of an Izhikevich bistable neuron. A quiescent state is achieved if the state enters the basin of attraction of the single stable attractor (filled circle). Outside this basin of attraction, the neuron fires continuously. Note that either depolarization or hyperpolarization applied at the appropriate phase can bring the state within the basin of attraction of the stable point attractor; therefore, this neuron is a resonator, not an integrator (Izhikevich, 2007a). The two lines show the nullclines for v (parabolic) and u (straight). Note the unstable saddle point attractor (empty circle). (Bottom) A: The minimal unit of activity replication consists of two reciprocally coupled bistable neurons (black circles), gating by two associated inhibitory neurons (gray circles). Figure 3 shows an experiment where copying of a single bit takes place using this circuit. B: Eight of these units can be chained together to form the experimental setup whose results are shown in Figure 5. The gates controlling the output of parent 0, for example, can be switched on and off together.

There are two attractors, one corresponding to a stable quiescent state and the other to an unstable state. Continuous oscillation occurs if the state remains outside the basin of attraction of the stable attractor. Figure 2A (bottom) shows the minimal circuit capable of copying neuronal activity. It consists of two reciprocally coupled excitatory bistable neurons (black circles), each associated with its own gating neuron (gray).

Figure 3 shows a typical simulated experiment (a variant of which we propose could be conducted in an in vitro system of real neurons) to demonstrate copying of bistable states using the circuit shown in Figure 2b. Initially both neurons (1 and 2) are off and = 0 in both directions; neuromodulatory gating prevents the two neurons from interacting. Then neuron i is depolarized by an external input of 10 mV applied at an appropriate phase such that it starts firing. Gate is opened, setting its value to 1 so that neuron 1 can influence neuron 2, causing neuron 2 to fire, after which gate closed (i.e., returned to 0). After this disconnection, a hyperpolarizing pulse of −10 mV is given to neuron 1 in the appropriate phase, turning it off and leaving neuron 2 the only neuron that remains firing. This process shows how a firing (1) state can be copied to a neuron that initially had a nonfiring (0) state. A Mathematica file containing the simulation for producing Figure 3 is available in the Supplementary Material.1

Figure 3:

Neuron 1 copies its state to neuron 2 (thick red = neuron 1, light blue = neuron 2). Initially both neurons are off. (a) Neuron 1 is switched on. (b) Neuron 1 is connected to neuron 2. (c) Neuron 1 is disconnected from neuron 2. (d) Neuron 1 is switched off, leaving neuron 2 on. Total duration shown = 1500 ms. w12=50.0 mV.

Figure 3:

Neuron 1 copies its state to neuron 2 (thick red = neuron 1, light blue = neuron 2). Initially both neurons are off. (a) Neuron 1 is switched on. (b) Neuron 1 is connected to neuron 2. (c) Neuron 1 is disconnected from neuron 2. (d) Neuron 1 is switched off, leaving neuron 2 on. Total duration shown = 1500 ms. w12=50.0 mV.

2.2.  Copying Vectors of Neuronal Activity.

Figure 2B shows a grouping of pairs of bistable neurons together to construct two layers coupled bidirectionally by a topographic map. In order to implement copying in this system, the current parental layer has neurons initialized randomly to firing (1) or nonfiring (0) states. The state of neurons in the offspring layer is reset (i.e., all neurons are turned to their off states). Neuromodulatory gates are opened for a brief period of time from the parental layer to the offspring layer, allowing the spiking neurons in the parental layer to switch on the corresponding neurons in the offspring layer. Activity gates between layers are then closed. The result is that the vector of activities in the parental layer has been copied to the offspring layer. Because pairs are independent, the dynamics are the same as in Figure 3 for each pair. The per site fidelity of copying is the same as the fidelity of copying in the minimal case.

A crucial factor in allowing unlimited heredity (Szathmáry & Maynard Smith, 1997) is the robustness of copying to noise. Various types of noise can affect fidelity. We consider noise in the form of random depolarization from external sources. Figure 4 shows that several kinds of error can result when noise is introduced. With this setting of I in particular, external noise often turns the neuron on, but it is much more unlikely to turn the neuron off by accident. The neuronal replicator hypothesis makes the testable prediction that were it the case that bistable neurons transmitted information in the way proposed above, then one would expect to find mechanisms that improve the fidelity of copying, gated by reward-controlled neuromodulatory neurons.

Figure 4:

The left three runs show copying of on-to-on states. The top run is without error, the middle run has neuron 1 accidentally being turned on after copying, and the bottom run has neuron 2 being accidentally turned on before copying. The right three runs show copying of off-to-off states. The top run is without error, the middle run has neuron 2 accidentally being turned on, and the bottom run has neuron 1 accidentally being turned on. Noise is gaussian at mean 0 and s.d. 2 mV. Noise can be added to the Mathematica file in the Supplementary Material by setting the NOISE variable.

Figure 4:

The left three runs show copying of on-to-on states. The top run is without error, the middle run has neuron 1 accidentally being turned on after copying, and the bottom run has neuron 2 being accidentally turned on before copying. The right three runs show copying of off-to-off states. The top run is without error, the middle run has neuron 2 accidentally being turned on, and the bottom run has neuron 1 accidentally being turned on. Noise is gaussian at mean 0 and s.d. 2 mV. Noise can be added to the Mathematica file in the Supplementary Material by setting the NOISE variable.

Figure 5 shows several copying events in a (1+1) evolution strategy (ES) implemented using neuromodulatory gating controlled by reward. At time t, the two layers of bistable neurons both have their fitness assessed. The higher-fitness layer is defined as the parent, and the offspring layer has its activities reset (to the off state). The higher-fitness (parental) layer then copies its pattern of neuron activities to the newly defined offspring layer. The (1+1) ES is the simplest of selection algorithms. Using it, the desired vector of activities (in this case, an eight-bit array of alternating 1 and 0 states) was evolved within 25 generations. Mutation arose due to membrane potential noise. For the purposes of fitness assessment, a neuron was defined as in state 1 if more than two spikes were fired within the 250 ms time period of fitness assessment and off if fewer than two spikes were fired within this period. One generation of fitness assessment and replication took 750 ms. The Mathematica file containing the simulation for producing Figure 5 is found in the Supplementary Material.

Figure 5:

As shown in Figure 2B, neurons 1 to 8 are in layer 1, and neurons 9 to 16 are in layer 2. Neuron 1 projects to neuron 9, neuron 2 to neuron 10, and so on. The graphs show voltages, vi for each neuron i for the final 2 generations of an evolutionary run of 40 generations. Layer 1 is being copied to layer 2, with mutation. Because layer 2 cannot improve on layer 1, its neurons are reset every time, and copying is repeated from layer 1 to layer 2 three times. Note that the only mutation observed here is the mutation from neuron 2 to neuron 10 in the second copy operation shown. The Mathematica file in the Supplementary Material can be used to generate other evolutionary runs.

Figure 5:

As shown in Figure 2B, neurons 1 to 8 are in layer 1, and neurons 9 to 16 are in layer 2. Neuron 1 projects to neuron 9, neuron 2 to neuron 10, and so on. The graphs show voltages, vi for each neuron i for the final 2 generations of an evolutionary run of 40 generations. Layer 1 is being copied to layer 2, with mutation. Because layer 2 cannot improve on layer 1, its neurons are reset every time, and copying is repeated from layer 1 to layer 2 three times. Note that the only mutation observed here is the mutation from neuron 2 to neuron 10 in the second copy operation shown. The Mathematica file in the Supplementary Material can be used to generate other evolutionary runs.

The next section considers a remarkable algorithmic advantage conferred by adding Hebbian learning to neuronal replicators.

2.3.  Combining Hebbian Learning with Neuronal Replication.

The neuronal model shows that activity copying can be implemented by gated spiking neurons. For reasons of computational efficiency, a probabilistic model of bistable neurons is used to simulate a larger neuronal system for many more generations than would be convenient using the full spiking model. A remarkable capacity for structuring variability in the copying event emerges if, instead of limiting between-layer connections to a topographic map, one starts with a strong one-to-one topographic map but allows all-to-all Hebbian connections to develop between layers once a local optimum has been reached by the a (1+1) ES (Beyer, 2001). Once the local optimum is reached, the weights between ON-ON and OFF-OFF neuron pairs are increased, and weights between ON-OFF and OFF-ON neurons are decreased. With the weights updated, the activity vectors are completely randomized (i.e., random restart), and a new evolutionary run is started. The Hebbian learning that took place in previous evolutionary runs will therefore bias copying in future runs. An active neuron in the parental layer will not only tend to activate the corresponding one-to-one topographic neuron, but will also tend to convert other neurons in the offspring layer into a state previously occupied at previously found optima.

This technique, first developed in Hopfield networks (Watson et al., 2009) is simpler when adapted to bias the replication operation in a (1+1) ES, possibly making the neuronal replicator implementation the most plausible one for this algorithm. It adds Hebbian learning to the set of self-adapting evolution strategies (Beyer, 2001). Hebbian learning has been used for optimization previously; for example, O'Reilly and Munakata (2000), in the Leabra algorithm combined Hebbian learning and the delta rule. However, it has not been used previously in this way to learn explicitly the structure of local optima.

A set of search problems exists that is particularly well suited to neuronal copying biased by Hebbian learning (Watson, 2006). These are problems in which there is interdependence between problem variables, that is, where the fitness contribution of one variable is contingent on the state of other variables and where there are structured dependencies that are potentially exploitable. An archetypical example is the hierarchical IF-and-only-IF problem (HIFF) (Watson et al., 1998) illustrated in Figure 6 and described by the following equations:
formula
2.5
where si is the ith variable of the configuration, Si is the ith disjoint subpartition of the variables, if , and 0 otherwise; where Σ is the discrete set of allowable values for the problem variables; and n=kH, where is the number of hierarchical levels in the system or subsystem, and k is the number of submodules per module. In HIFF we consider only binary variables, and where k = 2.
Figure 6:

An example of the HIFF problem. See the text for details and equation 2.5 describing how to determine fitness of a given genotype.

Figure 6:

An example of the HIFF problem. See the text for details and equation 2.5 describing how to determine fitness of a given genotype.

The lowest level of fitness contributions comes from looking at adjacent pairs in the vector and applying the transfer function and the fitness function. The transfer function is [0, 0] → 0, [1, 1] → 1, and all other pair types produce a NULL (N). The fitness function for each level sums the 0 and 1 entries. The second level is produced by applying the same transfer function to the output of the first transfer function. The fitness contribution of this next layer is again the number of 0s and 1s in this layer multiplied by 2. This goes on until there is only one highest-level fitness contribution. The fitness landscape arising from the HIFF problem is pathological for a hill climber, since there is a fractal landscape of local optima, which means that the problem requires exponential time to solve.

The probabilistic model of bistable neuronal replicators is as follows. Neurons can be on (1) or off (−1) at each time step. The probability that a neuron is on is given as
formula
2.6
Correspondingly, P(−1)=1−P(1). The energy Et of a neuron is given by
formula
2.7
This is the weighted sum of input neurons to that neuron. Weights can take any positive or negative value. is gating noise that is associated with all the output collaterals of a given presynaptic neuron. Its values are drawn from a gaussian distribution with mean 1 and standard deviation 0.5. At each time step, a new gaussian random value is chosen. The system is initialized so that two layers of one-to-one topographically connected neurons exist at the outset, with neuronal weights set to 3.0 (or 4.0) on the diagonal of the weight matrix and 0 on all other connections. This gives an initial probability of ∼0.95 that a postsynaptic neuron will take value 1 (−1) given the corresponding presynaptic neuron taking value 1 (−1). It is necessary to use 1 and −1 rather than 0 and 1 in order to ensure symmetry in the Hebbian processes that increases and decreases weights. In practice, this could be implemented by defining an intermediate firing threshold where there was no weight change (which acts as a transformed origin) and where −1 corresponds to zero firing. The linear transform has various possible neuronal implementations. There are L = 128 neurons in each layer. The first layer is initialized randomly with a pattern of −1 and 1's. Copying to the second layer involves the calculation of P(1)i.t. values for all i neurons in layer 2 and the choice by roulette wheel selection of whether a neuron is to be on or off given P(1)i.t. We have not experimented with other kinds of selection, although many exist. Once the copy operation is complete, both layers are assessed for fitness, and the fitter layer is again copied to the less fit layer after the less fit layer is reset to all off (−1). This algorithm implements hill climbing for 2000 generations (copying events), after which a stable local optimum has typically been reached in the HIFF problem. At this stage, Hebbian learning is used to modify the weight matrices from layer 0 to layer 1 and from layer 1 to layer 0 as follows:
formula
2.8
The learning rate λ = 0.000003 for the L = 128 case. Figure 7 shows performance with the above parameters in the L=128 case. Each run consists of 40,000 activity randomization events (restarts), with 2000 generations between each randomization event. As a control, the same experiment was conducted without Hebbian learning and without (and with) resetting of activities every 2000 generations. Hill climbing alone was never able to achieve the full solution to the 128 HIFF problem in the time permitted. Note that the results with resetting every 2000 generations are very noisy because of the initial setting of one-to-one weights to 3, resulting in a 5% error rate in copying per site. Only when Hebbian learning is permitted can the error rate decrease.
Figure 7:

(1a, 1b, 1c) Probabilistic bistable replicators applied to solve the HIFF problem using Hebbian learning to structure exploration distributions. The top set of graphs (1a, 1b, 1c) shows that Hebbian learning permits the full 128 HIFF problem to be solved. The middle set of graphs (2a, 2b, 2c) shows that without Hebbian learning (without activity resetting every 2000 generations), only a local optimum is reached. Similarly the bottom set of graphs (3a, 3b, 3c) shows that without Hebbian learning but with activity resetting every 2000 generations, only a local optimum is reached. The output solutions are shown (every 100 generations) (1c, 2c, 3c). With Hebbian learning the all-1 and all-0 solutions are discovered (1c), whereas without Hebbian learning, the system remains stuck on a local optimum that was found early on in the run (without setting) and never finds the full solution (2c) even with a resetting of activities (3c). The weight matrix with Hebbian learning shows a hierarchical structure (1b), whereas the weight matrix without Hebbian learning only is a one-to-one topographic map (2b, 3b).

Figure 7:

(1a, 1b, 1c) Probabilistic bistable replicators applied to solve the HIFF problem using Hebbian learning to structure exploration distributions. The top set of graphs (1a, 1b, 1c) shows that Hebbian learning permits the full 128 HIFF problem to be solved. The middle set of graphs (2a, 2b, 2c) shows that without Hebbian learning (without activity resetting every 2000 generations), only a local optimum is reached. Similarly the bottom set of graphs (3a, 3b, 3c) shows that without Hebbian learning but with activity resetting every 2000 generations, only a local optimum is reached. The output solutions are shown (every 100 generations) (1c, 2c, 3c). With Hebbian learning the all-1 and all-0 solutions are discovered (1c), whereas without Hebbian learning, the system remains stuck on a local optimum that was found early on in the run (without setting) and never finds the full solution (2c) even with a resetting of activities (3c). The weight matrix with Hebbian learning shows a hierarchical structure (1b), whereas the weight matrix without Hebbian learning only is a one-to-one topographic map (2b, 3b).

Figure 8A shows the performance of Hebbian replicators on the N = 32 HIFF problem for various Hebbian learning rates and magnitudes of gaussian output gating noise. Output gating modulates all the weights out of a neuron by the same gaussian random number.

Figure 8:

(A, top) Each graph shows the best fitness obtained over 1000 generations, averaged over 100 independent runs for each parameter setting. Gaussian postsynaptic gating noise was set at mean 0 and the standard deviation shown at the top of each graph. The dashed line in each graph shows the result of running the 1+1 ES with 1000 resetting events with no Hebbian learning. Hebbian learning with learning rate = 0.001 improved the rate at which the correct solution was found, for all values of noise. If the learning rate is set too high, the system performs worse than by hill climbing and random search because the 1+1 ES gets stuck on a local optimum (see the thick orange lines showing LR = 0.1 and LR = 1.0). Normally a 1+1 ES could not get stuck forever on a local optimum due to gaussian variation; however, because Hebbian learning keeps reinforcing the tendency to return to previously visited local optima, variability becomes severely limited. (B, bottom) The same as above, except for gaussian input noise at each synapse to a neuron. Performance is slightly worse than with correlated noise.

Figure 8:

(A, top) Each graph shows the best fitness obtained over 1000 generations, averaged over 100 independent runs for each parameter setting. Gaussian postsynaptic gating noise was set at mean 0 and the standard deviation shown at the top of each graph. The dashed line in each graph shows the result of running the 1+1 ES with 1000 resetting events with no Hebbian learning. Hebbian learning with learning rate = 0.001 improved the rate at which the correct solution was found, for all values of noise. If the learning rate is set too high, the system performs worse than by hill climbing and random search because the 1+1 ES gets stuck on a local optimum (see the thick orange lines showing LR = 0.1 and LR = 1.0). Normally a 1+1 ES could not get stuck forever on a local optimum due to gaussian variation; however, because Hebbian learning keeps reinforcing the tendency to return to previously visited local optima, variability becomes severely limited. (B, bottom) The same as above, except for gaussian input noise at each synapse to a neuron. Performance is slightly worse than with correlated noise.

For low learning rates (e.g., 0.001), Hebbian learning increases the rate at which the global optimum (fitness = 192) is discovered. This is true at all output gating noise levels, but most pronounced for high levels of output gating noise. The type of noise used makes some difference, as shown in Figure 8B. Here, instead of output gating noise, noise is applied at each input; it is described by an independent random variable assigned to every synapse into a neuron's dendritic tree. Performance is slightly impaired compared to using output noise, but there is still a benefit to adding Hebbian learning.

The efficacy of Hebbian learning is clearly shown in the HIFF 64 bit case (see Figure 9, which shows the mean maximum fitness obtained during 100 independent runs with and without Hebbian learning). Too high a Hebbian learning rate (LR = 0.0008) results in the system's getting stuck in a local optimum; just the right amount allows significant improvement compared with no Hebbian learning.

Figure 9:

HIFF 64 is solved much faster with Hebbian learning rate = 0.0002 than with no Hebbian learning. If too high a learning rate is used (0.0008), the system inevitably gets stuck at an intermediate local optimum. The error bars show +/− the standard error of the mean over 100 independent experiments for each learning rate.

Figure 9:

HIFF 64 is solved much faster with Hebbian learning rate = 0.0002 than with no Hebbian learning. If too high a learning rate is used (0.0008), the system inevitably gets stuck at an intermediate local optimum. The error bars show +/− the standard error of the mean over 100 independent experiments for each learning rate.

There has been no demonstration that an equivalent neuronal architecture without replication but still with Hebbian learning would take longer to solve the HIFF problem than with replication. First, no such alternative neuronal architecture except possibly the Hopfield network described by Watson et al. (2009) is known to be effective in this problem. Second, the capacity to add features to evolutionary algorithms to improve their algorithmic potential is of great merit to the neuronal replicator hypothesis rather than being a disadvantage. For example, consider another modification, that of recombination. In the next section, we show how recombination can improve the search capability of evolutionary algorithms implemented using bistable neuronal elements.

2.4.  Solving HIFF Using a Recombination of Neuronal Replicators.

The requirement for parallelization leads to a recombination in populations of neuronal replicators. It has been demonstrated previously that a population of replicators undergoing recombination can solve the 128-bit HIFF problem as long as there is tight linkage, that is, if the interdependencies of the problem correspond to the layout of the chromosome such that functionally interdependent units tend to be crossed over together (Watson, 2006). Figure 10 shows how recombination is implemented in neuronal replicators, using the same kinds of operation as proposed for the (1+1) ES.

Figure 10:

The neuronal recombination circuit is an extension of the simple replication circuit shown in Figure 2B. Here, instead of a one-to-one reciprocal circuit, there is a many-to-one reciprocal circuit. Instead of a gating vector, there is a gating matrix. The top diagram shows the minimal unit from which the system is composed. It consists of a population size of 3, with genome length = 1. Gating can determine which of the parental states are to be copied to the offspring. The bottom figure shows five of these units combined together, forming three parents with genome length = 5. Recombination is simply undertaken by opening, for example, gates, A1, B1, C1 and D2, E2. Once the offspring has been formed and its fitness assessed, it is recopied back to the parent that most resembles it.

Figure 10:

The neuronal recombination circuit is an extension of the simple replication circuit shown in Figure 2B. Here, instead of a one-to-one reciprocal circuit, there is a many-to-one reciprocal circuit. Instead of a gating vector, there is a gating matrix. The top diagram shows the minimal unit from which the system is composed. It consists of a population size of 3, with genome length = 1. Gating can determine which of the parental states are to be copied to the offspring. The bottom figure shows five of these units combined together, forming three parents with genome length = 5. Recombination is simply undertaken by opening, for example, gates, A1, B1, C1 and D2, E2. Once the offspring has been formed and its fitness assessed, it is recopied back to the parent that most resembles it.

The simulated system implements one-point crossover with deterministic crowding to maintain diversity (Mahfoud, 1995). Population size is 1000. Activity vectors are randomly initialized. Only fixed-weight topographic mapping is used (i.e., no Hebbian weight change), with each weight being set to 10.0, multiplied by gaussian noise (mean 1, standard deviation 0.01). At each step, two parents are chosen randomly from the population of 100. Seventy percent of the time, a crossover site is also chosen randomly and determines which gates are opened. To the left of the crossover site, gates are opened from parent 1 to the offspring, and to the right of the crossover site, gates are opened from parent 2 to the offspring. The other 30% of the time, parent 1 is copied in total (with mutation) to the offspring. The fitness of the parent closest to the offspring in Hamming distance is calculated, along with the fitness of the offspring. A neural mechanism such as that shown in Figure 13 is capable of learning to determine the Hamming distance between the parent and the offspring neuronal replicators. If the fitness of the offspring is greater than that of the parent, the parental state is reset, and it is overwritten by the offspring by copying in the standard way (without crossover). This process iterates. Each generation thus involves one recombination event and perhaps one copying event.

Figure 11 shows a typical run on the 128 HIFF. With one-point crossover and deterministic crowding (top), the solution is found rapidly, whereas without recombination (mutation only, middle), the solution is never found. To determine whether recombination is really binding diverse solutions to better effect, a comparison is made with the “headless chicken crossover” described in Fogel (2006) (see Figure 11, bottom). This involves exactly the same algorithm as recombination except that one of the chosen parental genomes is replaced with a completely random new solution. This acts as a control against recombination as merely a macromutation operator. The headless chicken operator does not find the solution. Finally, we note that the use of crowding is a useful trick for maintaining diversity. It requires the calculation of Hamming distance. A later section shows how neuronal circuits can be trained to calculate Hamming distance.

Figure 11:

(Top) With recombination, the 128 HIFF problem is solved in less than 500,000 recombination events. (Middle) Without recombination, the 128 HIFF problem is not solved in replication events. (Note the different scale from above) (Bottom) With the headless chicken operator, the problem is not solved. The light gray line shows the sliding average fitness, and the dark gray dots show the fitness of the parent assessed at each generation.

Figure 11:

(Top) With recombination, the 128 HIFF problem is solved in less than 500,000 recombination events. (Middle) Without recombination, the 128 HIFF problem is not solved in replication events. (Note the different scale from above) (Bottom) With the headless chicken operator, the problem is not solved. The light gray line shows the sliding average fitness, and the dark gray dots show the fitness of the parent assessed at each generation.

2.5.  Learning of Neuronal Circuits Capable of Activity Pattern Replication.

How can the very regular connectivity and strong and tightly controlled gating that the above mechanisms propose be reconciled with the seemingly haphazard connectivity of neurons that is observed? How can activity pattern replication arise in an initially randomly connected network not capable originally of sustaining replication? How can a system of neuronal information transmission be transformed from one capable of only attractor-based heredity into one capable of limited heredity and finally unlimited heredity (Szathmáry & Maynard Smith, 1997)? There are two possibilities: activity-independent and activity-dependent mechanisms. This is a special case of a more general question: What capacity does the brain have to modify structure based on experience and reward to implement an adaptive neuronal circuit (Holtmaat & Sovoboda, 2009)? How can it be ensured that two identical genotypes (neuronal activity vectors) in two locations produce identical phenotypes that are assigned the same fitness (see Figure 12)?

Figure 12:

Determining the replication matrix C is necessary if neuronal genotypes G1 and G2 are to have the same phenotypes after transformation through weight matrices M1 and M2. This can be achieved by reward-biased stochastic hill climbing in the connectivity space of C, to discover the appropriate mapping for a copy operation.

Figure 12:

Determining the replication matrix C is necessary if neuronal genotypes G1 and G2 are to have the same phenotypes after transformation through weight matrices M1 and M2. This can be achieved by reward-biased stochastic hill climbing in the connectivity space of C, to discover the appropriate mapping for a copy operation.

The above models assumed a perfect topographic one-to-one map between the two layers. They also assumed the existence of a mechanism to read out each genotype and directly calculate the Hamming distance between each genotype and the desired activity vector, and return a fitness value. Let us relax the assumption that the readout element is capable of knowing the ordering of the loci along the two genotypes. Figure 12 shows that this is identical to the situation where the mapping from a genotype to the phenotype is not one-to-one but a random feedforward network M. Each genotype imaps to its phenotype with a different random feedforward network Mi. If this is the case, then topographic copying between genotypes will result in different phenotypes arising from the same genotype depending on the location of that genotype in the population, and natural selection will be impossible because there will be no covariance between neurogenetic states and fitness. It is easy to see that the replication matrix from genotype 1 to 2 must in fact be the product of the inverse of M2 and M1, that is, Mc=M−12M1 (see Figure 12). How could a feedforward network Mc self-organize in a realistic neuronal system? Although there exist recurrent neural network algorithms to calculate the inverse of matrices, they use nonlocal learning rules (Wang, 1993). Here we demonstrate that using a structural plasticity makes it possible to discover the appropriate copying matrix.

Given that M2 is invertible, the restructuring of the Mc matrix can be established by stochastic hill climbing in the space of neural structures guided by a signal of the similarity between the parent and child activity vectors. A simple model is presented of Bonhoffer-type rapid synaptic remodeling whereby a weak connection can form (and break) between previously disconnected (connected) neurons within a few seconds (Hofer, Mrsic-Flogel, Bonfoeffer, & Hubener, 2009). If following a change in structural plasticity, the resulting reward is increased above a time window reward average, then that connectivity change persists; otherwise, it is lost. A more sophisticated temporal difference method could potentially improve performance. Once a connection exists, quantal weight mutations are permitted (Adams, 1998), which also have the property that they revert under conditions of less than average reward. Similarly, neuron biases also undergo mutation and reward controlled reversion. These structural operations allow rapid restructuring of a sparsely connected neuronal network.

The regime for constructing Mc is to randomly choose a set of N bistable input neurons and N bistable output neurons. The network is initially randomly connected. The N bistable input neurons are initialized with a random pattern of activity, and this activity is sustained over time period T. After period T, the activity of the N output neurons is measured, and the Hamming distance between the input and output arrays is calculated. The reward is defined as N – Hamming distance. Ten thousand input activity patterns are tested, and the reward calculated by the similarity of the output activities to these input activities is averaged over all these presentations. After a particular connectivity pattern has sustained 10,000 activity pattern inputs and the subsequent dynamics, each of which has received a reward assessment, a random structural change is made to the network, and the average reward is calculated over another 10,000 input patterns. An episode is defined as a set of reward assessments, after which there is a structural modification. For the first 10 episodes, the structural change is always reset, and a time-averaged reward is calculated over these 10 episodes. After 10 episodes, the first structural change to produce a reward greater than the mean reward is accepted. If a change is accepted, then the mean reward is calculated again for 10 episodes before another permanent structural change can be accepted. This protocol ensures that on average, harmful structural changes are not accepted because a change is accepted only if the reward is above average.

The dynamics of the network are similar to those of the network in Figure 2, with some minor differences. The neurons are modeled as binary elements that take values 0 or 1. The probability of a neuron j being ON, Pj(1), is determined as the sum of the product of input activities ai and weights pji plus the bias bj of the neuron, put through a sigmoid threshold function:
formula
2.9
The simulation is discrete time and synchronously updated. The period between initialization of the input neurons and readout of the output neurons is one time step.

Figure 13 shows the reorganization of the connectivity that occurs to allow a 3-bit copying operation. Larger copying circuits can be produced by independently learning small copying circuits and joining them. This investigation suggests that it is possible to use stochastic hill climbing guided by reward to self-organize a system capable of high-fidelity replication of a pattern of neuronal activity. An important property that allows search in the space of connectivity is the existence of two timescales—many activity patterns can be assessed for the same structural connectivity pattern.

Figure 13:

Within 60,000 episodes (i.e., 60,000 structural weight changes), high-fidelity copying of a 3-bit activity vector is obtained. Initial connectivity = 10%. Probability of a weight change mutation = 1/(0.1, where S is the size of the network. Probability of connectivity mutation (removal/addition) = 1/(S2). Minimum weight before removal of connection = 0.01. Maximum weight = 5.0. (Top left) Reward versus generations/10. (Top right) Weights versus generations/10. (Bottom left) Biases versus generations/10. (Bottom right) Final weight matrix: blue = negative, red = positive, white = no connection. Input neurons = 5,6,7. Output neurons = 2,3,4. Neuron 1 is fixed at 1.

Figure 13:

Within 60,000 episodes (i.e., 60,000 structural weight changes), high-fidelity copying of a 3-bit activity vector is obtained. Initial connectivity = 10%. Probability of a weight change mutation = 1/(0.1, where S is the size of the network. Probability of connectivity mutation (removal/addition) = 1/(S2). Minimum weight before removal of connection = 0.01. Maximum weight = 5.0. (Top left) Reward versus generations/10. (Top right) Weights versus generations/10. (Bottom left) Biases versus generations/10. (Bottom right) Final weight matrix: blue = negative, red = positive, white = no connection. Input neurons = 5,6,7. Output neurons = 2,3,4. Neuron 1 is fixed at 1.

2.6.  Calculation of Hamming Distance Between Two Neuronal Vectors.

The calculation of similarity between two neuronal patterns of activity is an operation that may be of use in several algorithms. In developing the template replication circuit above, a signal of similarity between two neuronal activity patterns was required to act as reward. Implementing diversity maintenance in the recombination circuit above required calculation of Hamming distance. But from where can the initial reward signal that should allow selection for a circuit capable of calculating Hamming distance come? One possibility is that an extrinsic reward signal is maximized when the similarity between the two inputs is greatest. Another possibility is that there is an experience-independent (intrinsic) reward process that has evolved to reward similarity. We acknowledge that other measures of distance may be more suitable in this and other cases. We use Hamming distance here for simplicity.

Using the same structural plasticity algorithm above, a circuit is learned that is capable of calculating the Hamming distance between two input vectors. For a vector of length N, this is simply achieved by independently evolving N XOR gates. The XOR gates fires 1 for inputs [1, 0] or [0, 1] (where two loci differ in their identity) and fires 0 for inputs [0, 0] or [1, 1] (where two loci are the same). Summing the outputs of the N XOR gates gives the Hamming distance. Figure 14 shows that an XOR gate can be evolved using the same method of structural plasticity as above. Ten thousand XOR patterns are input, per structural modification. Reward is 1 if the XOR output is correct, and 1/e otherwise, where e is the difference between the desired and actual Hamming distance.

Figure 14:

Structural plasticity used to undertake reward-biased stochastic hill climbing to produce an XOR circuit. N XOR circuits produced in parallel implement a Hamming distance circuit. (Top right) Weights versus generations/10. (Bottom left) Biases versus generations/10. (Bottom right) Final weight matrix: blue = negative, red = positive, white = no connection. Input neurons = 3,4. Output neuron = 2. Neuron 1 is fixed at 1.

Figure 14:

Structural plasticity used to undertake reward-biased stochastic hill climbing to produce an XOR circuit. N XOR circuits produced in parallel implement a Hamming distance circuit. (Top right) Weights versus generations/10. (Bottom left) Biases versus generations/10. (Bottom right) Final weight matrix: blue = negative, red = positive, white = no connection. Input neurons = 3,4. Output neuron = 2. Neuron 1 is fixed at 1.

The kind of problem solved above is a hard combinatorial search problem, requiring a large number of slow, generate-and-test operations (Chklovskii, Mel, & Swoboda, 2004). Critically “whether the brain has evolved the machinery to cope with these ‘algorithmic’ challenges remains an open question” (Chklovskii et al., 2004). The algorithm presented is limited to small networks; otherwise random formation and disconnection guided by reward is much too slow. Methods have been devised to modify random connectivity change; for example, new synapses can be drawn from a “prescreened candidate pool” (Poirazi & Mel, 2001). Diffusible factors or electric fields are also thought to be able to bias the probability of new connections forming.

In summary, the neuronal replicator hypothesis proposes that intrinsic reward functions are capable of molding networks by structural plasticity to construct circuits that can undertake more sophisticated search algorithms acting on neuronal activity rather than neural connectivity. Algorithms that search in the space of activity patterns instead of connectivity patterns can be much faster, and so there would be an adaptive advantage for evolution to have evolved such algorithms in the brain.

2.7.  Replication of Actor-Critic Devices Prevents Catastrophic Forgetting.

So far there has been no demonstration of the utility of neuronal replication in a behavioral task. Here, an important role for neuronal replication is demonstrated in a simple simulated robotic learning task. The robustness of a temporal difference reinforcement learning (RL) algorithm to nonstationary perturbations is compared with and without neuronal replication. It is shown that replication of actor-critic controllers allows a solution to the stability-plasticity dilemma. We demonstrate that learning rate and robustness can be increased if multiple copies of an actor-critic controller can be made online, stored, and retrieved. Catastrophic forgetting in a nonstationary and nonlinear environment can be prevented by regular copying of procedural memories into a long-term memory store. The best controllers can be retrieved when the current controller is determined to be functioning poorly.

The stability-plasticity dilemma (SPD) refers to the simultaneous requirement for rapid learning and stable memory. Too much plasticity can result in catastrophic interference during sequence learning; for example, in neural networks, degradation of old patterns may occur as new patterns are stored. Several solutions have been proposed, such as ART (Carpenter & Grossberg, 1988), often aiming to balance plasticity and stability at the same synapse (Abraham & Bear, 1996; Abraham & Robins, 2005). Also, the growth of new neurons has been proposed to prevent catastrophic interference of new memories with existing traces of older memories (Becker, 2005; Wiskott, Rasch, & Kempermann, 2006). Here we propose a simpler solution to the SPD based on copying of patterns of weights from a rapidly changing neuronal substrate to a slowly changing neuronal store. The previously proposed mechanism for connectivity copying shows one way in which patterns of synaptic connectivity could be copied (Fernando et al., 2008; Fernando & Szathmáry, 2009a, 2009b). The next section demonstrates how genotypic neuronal activity can exhibit a connectivity phenotype.

A simple phototaxis task for a Khepera robot is simulated. For the underlying robot controller, an actor-critic reinforcement learner is chosen because it has been proposed that such a controller may exist in the brain (Barto, 1995; Schultz, 1998; Worgötter & Porr, 2005) and neural network implementations already exist (Houk, Adams, & Barto, 1995; Suri & Schultz, 1999). The Webots simulator (Michel, 2004) simulates a standard Khepera robot distance sensors and two light sensors. The light source is moved every 10 minutes to a random (x, y) location in the arena. The world consists of a square arena containing two obstacles and another robot controlled by a simple Braitenberg vehicle architecture. This constitutes a nonstationary and nonlinear environment for reinforcement learning. In addition, perturbations to the robot are made manually, and external controllers are used to disturb the actor-critic controller. The Webots code for the simulation is available in the Supplementary Material. Figure 15 shows the overall arena, which is darkened apart from the single light source.

Figure 15:

Robotic arena containing two Khepera robots and a moving light source.

Figure 15:

Robotic arena containing two Khepera robots and a moving light source.

A single actor-critic controller is described first. Sensory inputs from eight infrared distance sensors (D1−8) and two light sensors (L1 and L2) are fed to the actor and critic networks, each normalized to a value between 0 and 1. The algorithm is shown below.

At each time step, the external reward re is calculated as the sum of light sensor values subtracted from the total distance sensor values scaled by a constant φ (= 20). The normalized sensor values and a fixed bias input of 1 serve as inputs xi(t) to the actor and the critic at each time step. The critic's prediction of the eventual reinforcement at time t is obtained by passing activity through the critic weights vi(t). The final critic output is the effective reward signal, , which is the difference between the actual reinforcement r(t) plus the discounted current prediction and the previously predicted reward The critic weights vi are then updated according to the product of the effective reward and the eligibility trace of that input scaled by a learning rate . The eligibility trace is updated with the new value of xi(t), maintaining a proportion of the previous value, . Next, the motor output determined by passing stimuli through the actor's weight matrix wji with gaussian noise of mean 0 and SD =0.5. This value is passed through a threshold function that outputs +1 if f(x)>=0, −1 if f(x)<0. Then the actor's weights are updated at a learning rate α (=0.005). Finally the eligibility traces for the actor are updated according to how much each input contributes to the actual outputs produced. The motor output is a constant S (=5) times the output neuron value yi.

Actor-Critic Algorithm

1. Initialize actor and critic networks with weights from −1 to 1.

ForTtime steps

   GetExternalReward r(t):
formula
   SetInputs: , bias = 1

   Critic

Get critic's prediction p(t):
formula
Get effective reward :
formula
Update critic weights v(t+1):
formula
Update eligibility trace :
formula

Actor

Get action yi(t):
formula
Update actor weights w(t+1):
formula
Update eligibility trace :
formula

End For

To implement multiple controllers capable of replication, the following modifications are made. The fitness of an actor-critic is defined as a leaky integral of reward, with a first-order decay rate of 0.00001. This results in a smoothing out of reward over many epochs (light moves). Initial fitness is zero for each actor-critic. Every 5 minutes, the active actor-critic is always copied to a long-term store regardless of its fitness (storage). All the parameters that define the actor-critic are copied, except for the eligibility traces, which are set to zero. The long-term memory store is assumed finite of size M(=100). It is the least fit actor-critic in the long-term memory store that is overwritten by the current actor-critic. At each time step, it is determined whether the currently active actor-critic should be replaced by an actor-critic from the long-term memory store. The gradient of fitness for the active actor-critic is calculated between 5 minutes in the past and the present value. If this is less than a negative constant (χ = −10,000) or if the fitness of the current actor-critic is less than the maximally fit actor-critic in the long-term memory store + χ, then the existing active actor-critic is replaced by a copy of the most fit actor-critic in the long-term memory store (retrieval). After another 5 minutes, the fitness of the actor-critic in the long-term memory store that gave rise to the active actor-critic is updated with the fitness of the active actor-critic.

Figure 16 (left) shows an externally imposed perturbation that results in the loss of a strategy that was successful. The agent initially achieves an effective strategy of going rapidly to the light and rotating around the light. When the agent is near the light (at 7000 time units), it is forced to rotate clockwise around its axis for 1000 time units, after which control is returned back to the actor-critic. The original strategy is never rediscovered (see the graph labeled “Reward Accumulation, which remains reduced after the external perturbation). Figure 16 (right) shows the same perturbation made to a robot with the capacity for replication of actor-critic controllers. Rather than experiencing catastrophic forgetting, the retrieval of stored actor-critic controllers can rapidly regenerate the previously learned strategy. In summary, the capacity for replication of actor-critic controllers prevents catastrophic forgetting in a robotic learning task.

Figure 16:

(Left) A single actor-critic controller. At 7000 time units (t.u.), the agent was forced (by external intervention) to rotate clockwise for 1000 t.u., after which its adaptive strategy was lost and never recovered. (Right). Multiple replicating actor-critic controllers. At 10,000 t.u., the agent was forced (by external intervention) to rotate clockwise for 1000 t.u., after which its adaptive strategy was rediscovered by retrieval of controllers (see the actor controller graph which shows the identity of the controller that is active). Comparison with Figure 16 (left) reveals an enormously better recovery from perturbation given the capacity for actor-critic replication.

Figure 16:

(Left) A single actor-critic controller. At 7000 time units (t.u.), the agent was forced (by external intervention) to rotate clockwise for 1000 t.u., after which its adaptive strategy was lost and never recovered. (Right). Multiple replicating actor-critic controllers. At 10,000 t.u., the agent was forced (by external intervention) to rotate clockwise for 1000 t.u., after which its adaptive strategy was rediscovered by retrieval of controllers (see the actor controller graph which shows the identity of the controller that is active). Comparison with Figure 16 (left) reveals an enormously better recovery from perturbation given the capacity for actor-critic replication.

2.8.  Activity Genotypes and Connectivity Phenotypes.

The rapid copying of the phenotype of a weight matrix is possible by copying the states of bistable neuromodulatory inhibitory neurons as shown in Figure 17. Assume that two identical weight matrices exist (blue) (perhaps the result of a previous synaptic connectivity copying event), and that each weight is gated by an inhibitory neuromodulatory neuron (red). Assume also that the inhibitory neurons are linked topographically to the corresponding neuron in the other layer (green). Then by activity copying of the states of the inhibitory neurons, it is possible to rapidly reconfigure the effective connectivity (purple) of the lower layer to match that of the upper layer. This is an alternative and faster way in which the weight matrix of the actor and critic could be replicated.

Figure 17:

(Left) The top-layer genotype has two inhibitory neurons switched off with all the others on, specifying the phenotype of that layer (the effective connectivity). (Right) When the inhibitory neurons of the top layer replicate their state to the bottom layer, the phenotype is also copied.

Figure 17:

(Left) The top-layer genotype has two inhibitory neurons switched off with all the others on, specifying the phenotype of that layer (the effective connectivity). (Right) When the inhibitory neurons of the top layer replicate their state to the bottom layer, the phenotype is also copied.

3.  Discussion

The neuronal replicator hypothesis proposes that patterns of neuronal activity can be copied and can implement evolutionary algorithms in the brain. The replication operation can be biased by Hebbian learning, allowing knowledge of previously discovered local optima to guide further replication. Recombination can take place between activity patterns. The circuits required for replication can be learned by structural plasticity guided by reinforcement. Replication can help to solve the stability-plasticity dilemma.

Patterns of bistable activity are one possible candidate for rapidly updatable neuronal data structures. The neuronal mechanisms underlying working memory (Baddeley & Hitch, 1974)—bistability (Wang, 1999) and recurrence (Zipser, Kehoe, Littlewort, & Foster, 1993)—are also ideal candidates for implementing activity replicators. For example, persistent firing due to single cell dynamics related to elevated [Ca2+] (Loewenstein & Sompolinsky, 2003; Fransen, Tahvildari, Egorov, Hasselmo, & Alonso, 2006) or high levels of facilitation (Barak, 2007) and network-level dynamics due to recurrent connections (Brunel, 2003), such as the self-organization of “stimulus-selective, sub-populations of excitatory cells within a cortical module” (Amit, Bernacchia, & Yakovlev, 2003) resembling Hebbian cell assemblies (Amit, 1995), cortico-thalamic reverberating circuits (Wang, 2001), dendritic bistability (Goldman, Levine, Major, Tank, & Seung, 2003), or mutual inhibition networks (Mechens, Romo, & Brody, 2005), could all implement bistable replicators. Alternatively (Mongillo, Barak, & Tsodyks, 2008), it is been proposed that calcium-mediated synaptic facilitation stores short-term memory in the form of elevated presynaptic calcium levels rather than persistent spiking per se. Furthermore “subsets of neurons where Up (firing) and Down (not firing) states are observed may be considered as a second integrated structure, beyond the level of individual neurons” (Holcman & Tsodyks, 2006). These structures flip states spontaneously between up and down states. This flipping occurs at low frequency (normally less than 0.1Hz) (Raichle, 2006) and may constitute another basis of bistable neuronal heredity. This is consistent with the fact that dopamine (a possible signal of fitness) can influence the distribution of up and down states (Durstewitz & Seamans, 2006). No bistable mechanism can yet be ruled out as a potential basis for neuronal heredity. The models presented of probabilistic bistable elements cannot exclude any of the possibilities.

The neuronal replicator hypothesis predicts neuronal replication would present itself as spontaneous intrinsic activity in the absence of well-defined tasks. Spontaneous activity is indeed observed (Arieli, Sterkin, Grinvald, & Aertsen, 1996; Tsodyks, Kenet, Grinvald, & Arieli, 1999; Kenet, Bibitchkov, Tsodyks, Grinvald, & Arieli, 2003; Fox, Corbetta, Snyder, Vincent, & Raichle, 2006). Such spontaneous activity contributes significantly to the brain's energy consumption and so is not without adaptive importance (Raichle & Mintun, 2006).

4.  Conclusion

The existence of units of evolution in the brain has significance for agent-based models of neural processes stemming from Minsky's The Society of Mind (1986), which have led to evolutionary game theory models of cognition (Byrne & Kurland, 2001) in which agents in the brain compete and cooperate for behavioral execution. The work also relates to multiexpert neural network models that have been shown to be capable of implementing Q-learning (Toussaint, 2003). The neuronal replicator viewpoint allows several questions to be easily posed. What is the intrinsic value system the brain uses to select neuronal replicators (Oudeyer et al., 2007)? The dopamine system is a crucial element in implementing neuronal search, for it acts to distribute reward and thus define the fitness of neuronal replicators. The literature on intrinsic motivation systems where value comes not only from external reward objects but from intrinsic information theory measures such as increasing the first derivative of predictability (Schmidhuber, 1991; Oudeyer et al., 2007) is also important in understanding the basis of neuronal fitness. This kind of intrinsic value system seems to be crucial in explaining play and active exploration. It is also consistent with evidence that more complex fitness functions than mere reward prediction error are signaled by dopamine (Pennartz, 1995; Redgrave, Prescott, & Gurney, 1999); for example, it has been shown that novelty is also signaled by phasic dopamine, independent of explicit reward (Kakade & Dayan, 2002). The NRH predicts that dopaminergic reward would be assigned to and influence the behavior of groups of neurons rather than individual synapses independently.

We also note that recent work has criticized the hypothesis that creative thought is a Darwinian process (Gabora, 2005). First, the neuronal replicator hypothesis does not claim that conscious (serial) thoughts are the units of evolution; second, we note that the capacity for replication allows adaptations to arise from a (1+1) ES, that is, even where there is only one member in each generation. Third, natural selection does not require multiple identical copies of a variant, as Gabora claims. Instead, single copies of distinct variants are all that are needed if an external module explicitly assigns the probability that each variant contributes to the next generation. The intrabrain replicators that we propose could be examples of Aunger's (2002) neuromemes. They replicate within, not between, brains. However, phenotypic copying of neuronal replicators between brains may be the basis of the heritability of human language, that is, the phenotypic copying of linguistic constructions between brains (Steels & De Beule, 2006a, 2006b). Language acquisition is a complex search problem undertaken by infants. We propose elsewhere that neuronal replicators may underlie the search for linguistic constructions (Fernando & Szathmáry, 2009a).

Intelligence is the capacity to adapt behavior to meet the demands of the environment. When Fogel (2006) wrote in his review of Evolutionary Computation, “The argument offered in this book is that the process of evolution accounts for [intelligent] behaviour and provides a foundation for the design of artificial intelligent machines” and “The versatility of the evolutionary procedure is one of its main strengths in serving as a basis for generating intelligent behaviour, adapting to new challenges, and learning from experience,” he was not referring to evolution in the brain. However, his statement could not better describe our motivation for proposing the neuronal replicator hypothesis. We find it remarkable to imagine that such a powerful algorithm as natural selection has not been seriously entertained as a possible neuronal basis for adaptive behavior. While the machine learning literature contains thousands of papers on evolution and natural selection in neural networks for adaptive behavior, none of these papers seriously proposes that the actual neuronal networks could implement an evolutionary computation algorithm within a single brain. This is the proposal of the neuronal replicator hypothesis.

Acknowledgments

Funding was generously provided by a Marie Curie Inter-European Grant to work at Collegium Budapest, Hungary. Partial support of this work has generously been provPleaided by the Hungarian National Office for Research and Technology (NAP 2005/KCKHA005), the Hungarian Scientific Research Fund (OTKA, NK73047), and the eFlux FET-OPEN project (225167). We thank Richard Watson, Eugine Izhikevich, Anil Seth, Phil Husbands, Eva Jablonka, and Dario Floreano.

Notes

1

Supplementary material referred to throughout the letter is available online at http://www.mitpressjournals.org/doi/suppl/10.1162/neco_a_00031.

References

Abraham
,
W. C.
, &
Bear
,
M. F.
(
1996
).
Metaplasticity: The plasticity of synaptic plasticity
.
Trends in Neurosciences
,
19
,
126
130
.
Abraham
,
W. C.
, &
Robins
,
A.
(
2005
).
Memory retention—the synaptic stability verses plasticity dilemma
.
Trends in Neurosciences
,
28
(
2
),
73
78
.
Adams
,
P.
(
1998
).
Hebb and Darwin
.
J. Theor. Biol.
,
195
(
4
),
419
438
.
Altenberg
,
L.
(
1994
).
The evolution of evolvability in genetic programming
. In
K. E. Kinnear
(Ed.),
Advances in genetic programming
(pp.
47
74
).
Cambridge, MA
:
MIT Press
.
Amit
,
D. J.
(
1995
).
The Hebbian paradigm reintegrated: Local reverberations as internal representations
.
Behav. Brain Res.
,
18
,
617
657
.
Amit
,
D. J.
,
Bernacchia
,
A.
, &
Yakovlev
,
J.
(
2003
).
Multiple-object working memory: A model for behavioural performance
.
Cereb. Cortex
,
13
,
435
443
.
Anderson
,
J. R.
(
2007
).
How can the human mind occur in the physical universe
?
New York
:
Oxford University Press
.
Arieli
,
A.
,
Sterkin
,
A.
,
Grinvald
,
A.
, &
Aertsen
,
A.
(
1996
).
Dynamics of ongoing activity: Explanation of the large variability in evoked cortical responses
.
Science
,
273
,
1868
1871
.
Atmar
,
W.
(
1994
).
Notes on the simulation of evolution
.
IEEE Trans. Neural Networks
,
5
(
1
),
130
148
.
Aunger
,
R.
(
2002
).
The electric meme: A new theory of how we think
.
New York
:
Free Press
.
Baddeley
,
A. D.
, &
Hitch
,
G.
(
1974
).
Working memory
. In
G. H. Bower
(Ed.),
The psychology of learning and motivation: Advances in research and theory
(Vol.
8
, pp.
47
89
).
Orlando, FL
:
Academic Press
.
Baldwin
,
M. J.
(
1898
).
On selective thinking
.
Psychological Review
,
5
(
1
),
4
.
Baldwin
,
M. J.
(
1909
).
The influence of Darwin on theory of knowledge and philosophy
.
Psychological Review
,
16
,
207
218
.
Barak
,
O.
(
2007
).
Persistent activity in neural networks with dynamic synapses
.
PLoS Computational Biology
,
3
(
2
),
e35
.
Barto
,
A. G.
(
1995
).
Adaptive critics and the basal ganglia
. In
J. C. Houk, J. Davis, & D. Beiser
(Eds.),
Models of information processing in the basal ganglia
(pp.
215
232
).
Cambridge, MA
:
MIT Press
.
Barto
,
A. G.
,
Sutton
,
R. S.
, &
Anderson
,
J. C.
(
1983
).
Neuronlike adaptive elements that can solve difficult learning control problems
.
IEEE Transactions on Systems, Man, and Cybernetics
,
13
(
5
),
834
846
.
Becker
,
S. A.
(
2005
).
A computational principle for hippocampal learning and neurogenesis
.
Hippocampus
,
15
,
722
738
.
Beyer
,
H.-G.
(
2001
).
The theory of evolution strategies
.
Berlin
:
Springer
.
Bongard
,
J.
, &
Lipson
,
H.
(
2007
).
Automated reverse engineering of nonlinear dynamical systems
.
Proc. Natl. Acad. Sci. USA
,
104
,
9943
9948
.
Bongard
,
J.
,
Zykov
,
V.
, &
Lipson
,
H.
(
2006
).
Resilient machines through continuous self-modeling
.
Science
,
314
,
1118
1121
.
Bremermann
,
H. J.
(
1958
).
The evolution of intelligence: The nervous system as a model of its environment
. (
Tech. Rep. no. 1
).
Seattle
:
University of Washington, Department of Mathematics
.
Brunel
,
N.
(
2003
).
Dynamics and plasticity of stimulus-selective persistent activity in cortical network models
.
Cerebral Cortex
,
13
,
1151
1161
.
Butz
,
M.
,
Worgotter
,
F.
, &
Van Ooyen
,
A.
(
2009
).
Activity-dependent structural plasticity
.
Brain Research Reviews
,
60
(
2
),
287
305
.
Byrne
,
C. C.
, &
Kurland
,
J. A.
(
2001
).
Self-deception in an evolutionary game
.
Journal of Theoretical Biology
,
212
,
457
480
.
Campbell
,
D. T.
(
1974
).
The philophy of Karl. R. Popper
. In
P. A. Schillpp
(Ed.),
Evolutionary epistemology
(pp.
412
463
).
Chicago
:
University of Chicago Press
.
Carpenter
,
G.
, &
Grossberg
,
S.
(
1988
).
The ART of adaptive pattern recognition by a self-organizing neural network
.
Computer
,
21
(
3
),
77
88
.
Changeux
,
J. P.
,
Courrege
,
P.
, &
Danchin
,
A.
(
1973
).
A theory of the epigenesis of neuronal networks by selective stabilization of synapses
.
Proc. Natl. Acad. Sci. USA
,
70
,
2974
2978
.
Chklovskii
,
D. B.
,
Mel
,
B. W.
, &
Swoboda
,
K.
(
2004
).
Cortical rewiring and information storage
.
Nature
,
431
,
782
788
.
Clune
,
J.
,
Misevic
,
D.
,
Ofria
,
C.
,
Lenski
,
R. E.
,
Elena
,
S. F.
, &
Sanjuán
,
R.
(
2008
).
Natural selection fails to optimize mutation rates for long-term adaptation on rugged fitness landscapes
.
PLoS Computational Biology
,
4
(
9
),
e1000187
.
Conrad
,
M.
(
1983
).
Adaptability: The significance of variability from molecule to ecosystem
.
New York
:
Plenum Press
.
Cooper
,
W.
(
2001
).
The evolution of reason: Logic as a branch of biology
.
Cambridge
:
Cambridge University Press
.
Crick
,
F. H. C.
(
1989
).
Neuronal Edelmanism
.
Trends Neurosci.
,
12
,
240
248
.
Crick
,
F. H. C.
(
1990
).
Reply
.
Trends Neurosci.
,
13
,
13
14
.
Dawkins
,
R.
(
1982
).
The extended phenotype: The gene as the unit of selection
.
New York
:
Freeman
.
Dehaene
,
S.
, &
Changeux
,
J. P.
(
1997
).
A hierarchical neuronal network for planning behavior
.
Proc. Natl. Acad Sci. USA
,
94
(
24
),
13293
13298
.
Dehaene
,
S.
,
Changeux
,
J. P.
, &
Nadal
,
J. P.
(
1987
).
Neural networks that learn temporal sequences by selection
.
Proc. Natl. Acad. Sci. USA
,
84
(
9
),
2727
2731
.
Dehaene
,
S.
,
Kerszberg
,
M.
, &
Changeux
,
J. P.
(
1998
).
A neuronal model of a global workspace in effortful cognitive tasks
.
Proc. Natl. Acad. Sci. USA
,
95
(
24
),
14529
14534
.
Dennett
,
D. C.
(
1981
).
Brainstorms
.
Cambridge, MA
:
MIT Press
.
Dennett
,
D. C.
(
1995
).
Darwin's dangerous idea
.
New York
:
Simon & Schuster
.
Dominey
,
P. F.
(
1995
).
Complex sensory-motor sequence learning based on recurrent state representations and reinforcement learning
.
Biological Cybernetics
,
73
,
265
274
.
Doya
,
K.
,
Samejima
,
K.
,
Katagiri
,
K.
, &
Kawato
,
M.
(
2002
).
Multiple model-based reinforcement learning
.
Neural Computation
,
14
(
6
),
1347
1369
.
Durstewitz
,
D.
, &
Seamans
,
J. K.
(
2006
).
Beyond bistability: Biophysics and temporal dynamics of working memory
.
Neuroscience
,
139
,
119
133
.
Edelman
,
G. M.
(
1987
).
Neural Darwinism: The theory of neuronal group selection
.
New York
:
Basic Books
.
Edelman
,
G. M.
(
1994
).
The evolution of somatic selection: The antibody Tale
.
Genetics
,
138
,
975
981
.
Eigen
,
M.
(
1971
).
Selforganization of matter and the evolution of biological macromolecules
.
Naturwissenschaften
,
58
(
10
),
465
523
.
Elfwing
,
S.
(
2007
).
Embodied evolution of learning ability
.
Unpublished doctoral dissertation, Stockholm, KTH
.
Elman
,
J. L.
(
1990
).
Finding structure in time
.
Cognitive Science
,
14
(
2
),
179
211
.
Fernando
,
C.
,
Karishma
,
K. K.
, &
Szathmáry
,
E.
(
2008
).
Copying and evolution of neuronal topology
.
PLoS ONE
,
3
(
11
),
e3775
.
Fernando
,
C.
, &
Szathmáry
,
E.
(
2009a
).
Chemical, neuronal and linguistic replicators
. In
M. Pigliucci & G. Müller
(Eds.),
Towards an extended evolutionary synthesis
.
Cambridge, MA
:
MIT Press
.
Fernando
,
C.
, &
Szathmáry
,
E.
(
2009b
).
Natural selection in the brain
. In
B. Glatzeder, V. Goel, & A. von Müller
(Eds.),
Toward a theory of thinking
.
Berlin
:
Springer
.
Fogel
,
D. B.
(
2006
).
Evolutionary computation: Toward a new Philosophy of machine intelligence
.
Hoboken, NJ
:
Wiley-Interscience
.
Fox
,
M. D.
,
Corbetta
,
M.
,
Snyder
,
A. Z.
,
Vincent
,
J. L.
, &
Raichle
,
M. E.
(
2006
).
Spontaneous neuronal activity distinguishes human dorsal and ventral attention systems
.
Proc. Natl. Acad. Sci. USA
,
103
,
10046
10051
.
Fransen
,
E.
,
Tahvildari
,
B.
,
Egorov
,
A. V.
,
Hasselmo
,
M. E.
, &
Alonso
,
A. A.
(
2006
).
Mechanism of graded persistent cellular activity of entorhinal cortex layer V neurons
.
Neuron
,
49
,
735
746
.
Fraser
,
A. S.
(
1957
).
Simulation of genetic systems by automatic digital computers. I. Introduction
.
Australian J. Biological Sciences
,
10
,
484
491
.
Friston
,
K. J.
, &
Stephan
,
K. E.
(
2007
).
Free-energy and the brain
.
Synthese
,
159
,
417
458
.
Gabora
,
L.
(
2005
).
Creative thought as a non-Darwinian evolutionary process
.
Journal of Creative Behavior
,
39
(
4
),
65
87
.
Gánti
,
T.
(
2003
).
The principles of life
.
New York
:
Oxford University Press
.
Goldman
,
M. S.
,
Levine
,
J. H.
,
Major
,
G.
,
Tark
,
D. W.
, &
Seung
,
H. S.
(
2003
).
Robust persistent neural activity in a model integrator with multiple hysteretic dendrites per neuron
.
Cereb. Cortex
,
13
,
1185
1195
.
Gomez
,
F.
,
Schmidhuber
,
J.
, &
Miikkulainen
,
R.
(
2008
).
Accelerated neural evolution through cooperatively coevolved synapses
.
Journal of Machine Learning Research
,
9
,
937
965
.
Hadamard
,
J.
(
1945
).
The psychology of invention in the mathematical field
.
New York
:
Dover
.
Hofer
,
S. B.
,
Mrsic-Flogel
,
T. D.
,
Bonfoeffer
,
T.
, &
Hubener
,
M.
(
2009
).
Experience leaves a lasting structure trace in cortical circuits
.
Nature
,
457
,
313
317
.
Hofstadter
,
D. R.
, &
Mitchell
,
M.
(
1994
).
The copycat project: A model of mental fluidity and analogy-making
. In
K. Holyoak & J. Bamden
(Eds.),
Advances in connectionist and neural computation
(Vol.
2
, pp.
31
112
).
Norwood, NJ
:
Ablex
.
Holcman
,
D.
, &
Tsodyks
,
M.
(
2006
).
The emergence of up and down states in cortical networks
.
PLoS Computational Biology
,
2
(
3
),
e23
.
Holtmaat
,
A.
, &
Sovoboda
,
K.
(
2009
).
Experience-dependent structural plasticity in the mammalian brain
.
Nature Reviews Neurosicence
,
10
,
647
658
.
Houk
,
J. C.
,
Adams
,
J. L.
, &
Barto
,
A. G.
(
1995
).
A model of how the basal ganglia generate and use neural signals that predict reinforcement
. In
J. C. Houk, J. L. Davis, & D. G. Belser
(Eds.),
Models of information processing in the basal ganglia
.
Cambridge, MA
:
MIT Press
.
Houk
,
J. C.
,
Bastianen
,
C.
,
Fansler
,
A.
,
Fishback
,
A.
,
Fraser
,
D.
,
Reber
,
P. J.
, et al
(
2007
).
Action selection and refinement in subcortical loops through basal ganglia and cerebellum
.
Phil. Trans. Roy. Soc. B
,
29
,
1573
1583
.
Hutter
,
M.
(
2005
).
Universal artificial intelligence: Sequential decisions based on algorithmic probability
.
Berlin
:
Springer
.
Izhikevich
,
E. M.
(
2003
).
Simple model of spiking neurons
.
IEEE Transactions on Neural Networks
,
14
,
1539
1572
.
Izhikevich
,
E. M.
(
2006
).
Polychronization: Computation with spikes
.
Neural Computation
,
18
(
2
),
245
282
.
Izhikevich
,
E. M.
(
2007a
).
Dynamical systems in neuroscience: The geometry of excitability and bursting
.
Cambridge, MA
:
MIT Press
.
Izhikevich
,
E. M.
(
2007b
).
Solving the distal reward problem through linkage of STDP and dopamine signaling
.
Cerebral Cortex
,
17
,
2443
2452
.
Izhikevich
,
E. M.
,
Gally
,
J. A.
, &
Edelman
,
G. M.
(
2004
).
Spike-timing dynamics of neuronal groups
.
Cereb. Cortex
,
14
(
8
),
933
944
.
Jacobs
,
R.
,
Jordan
,
M.
,
Nowlan
,
S.
, &
Hinton
,
G. E.
(
1991
).
Adaptive mixtures of local experts
.
Neural Computation
,
3
,
79
87
.
James
,
W.
(
1890
).
The principles of psychology
.
New York
:
Dover
.
Jones
,
A. G.
,
Arnold
,
S. J.
, &
Bürger
,
R.
(
2007
).
The mutation matrix and the evolution of evolvability
.
Evolution
,
61
,
727
745
.
Kaelbling
,
L. P.
,
Littman
,
M. L.
, &
Moore
,
A. P.
(
1996
).
Reinforcement learning: A survey
.
Journal of Artificial Intelligence Research
,
4
,
237
285
.
Kakade
,
S.
, &
Dayan
,
P.
(
2002
).
Dopamine: Generalization and bonuses
.
Neural Networks
,
15
,
549
559
.
Kashtan
,
N.
, &
Alon
,
U.
(
2005
).
Spontaneous evolution of modularity and network motifs
.
Proc. Natl. Acad. Sci. USA
,
102
(
39
),
13773
13778
.
Kemp
,
C.
, &
Tenenbaum
,
J. B.
(
2008
).
The discovery of structural form
.
Proc. Natl. Acad. Sci. USA
,
105
(
31
),
10687
10692
.
Kenet
,
T.
,
Bibitchkov
,
D.
,
Tsodyks
,
M.
,
Grinvald
,
A.
, &
Arieli
,
A.
(
2003
).
Spontaneously emerging cortical representations of visual attributes
.
Nature
,
425
,
954
956
.
Kirchner
,
M.
, &
Gerhart
,
J.
(
1998
).
Evolvability
.
Proc. Natl. Acad. Sci. USA
,
95
,
8420
8427
.
Kirchner
,
M.
, &
Gerhart
,
J. C.
(
2005
).
The plausibility of life
.
New Haven, CT
:
Yale University Press
.
Kohler
,
W.
(
1925
).
The mentality of apes
.
New York
:
K. Paul, Trench, Trubner & Co
.
Kondo
,
T.
, &
Ito
,
K.
(
2004
).
A reinforcement learning with evolutionary state recruitment strategy for autonomous mobile robots control
.
Robotics and Autonomous Systems
,
46
(
2
),
111
124
.
Lenski
,
R. E.
,
Ofria
,
C.
,
Collier
,
T. C.
, &
Adami
,
C.
(
1999
).
Genomic complexity, robustness, and genetic interactions in digital organisms
.
Nature
,
400
,
661
664
.
Loewenstein
,
Y.
, &
Sompolinsky
,
H.
(
2003
).
Temporal integration by calcium dynamics in a model neuron
.
Nature Neurosci.
,
6
,
961
967
.
Mahfoud
,
S.
(
1995
).
Niching methods for genetic algorithms
.
Unpublished doctoral dissertation, University of Illinois
.
Marcus
,
G. F.
(
2001
).
The algebraic mind: Integrating connectionism and cognitive science
.
Cambridge, MA
:
MIT Press
.
Marr
,
D.
(
1983
).
Vision: A computational investigation into the human representation and processing of visual information
.
New York
:
Freeman
.
Maynard Smith
,
J.
(
1986
).
The problems of biology
.
New York
:
Oxford University Press
.
Maynard Smith
,
J.
(
1998
).
Evolutionary genetics
.
New York
:
Oxford University Press
.
Mechens
,
C. K.
,
Romo
,
R.
, &
Brody
,
C. D.
(
2005
).
Flexible control of mutual inhibition: A neural model of two-interval discrimination
.
Science
,
307
,
1121
1124
.
Michalewicz
,
Z.
, &
Fogel
,
D. B.
(
2004
).
How to solve it: Modern heuristics
.
Berlin
:
Springer-Verlag
.
Michel
,
O.
(
2004
).
Webots: Professional mobile robot simulation
.
Journal of Advanced Robotics Systems
,
1
,
39
42
.
Minsky
,
M.
(
1986
).
The society of mind
.
New York
:
Simon and Schuster
.
Mongillo
,
G.
,
Barak
,
O.
, &
Tsodyks
,
M.
(
2008
).
Synaptic theory of working memory
.
Science
,
319
,
1543
1546
.
Monod
,
J.
(
1971
).
Chance and necessity: An essay on the natural philosophy of modern biology
.
New York
:
Knopf
.
Morimoto
,
J.
, &
Doya
,
K.
(
2001
).
Acquisition of stand-up behaviour by a real robot using hierarchical reinforcement learning
.
Robotics and Autonomous Systems
,
36
,
37
51
.
Muller
,
H. J.
(
1966
).
The gene material as the initiator and organizing basis of life
.
American Naturalist
,
100
,
493
517
.
Nadel
,
L.
,
Samsonovich
,
A.
,
Ryan
,
L.
, &
Moscovitch
,
M.
(
2000
).
Multiple trace theory of human memory: Computational, neuroimaging, and neuropsychological results
.
Hippocampus
,
10
,
352
368
.
Niv
,
Y.
(
2009
).
Reinforcement learning in the brain
.
Journal of Mathematical Psychology
,
53
,
139
154
.
Niv
,
Y.
,
Joel
,
D.
,
Meilijson
,
I.
, &
Ruppin
,
E.
(
2002
).
Evolution of reinforcement learning in uncertain environments: A simple explanation for complex foraging behaviors
.
Adaptive Behavior
,
10
(
5–23
).
Oja
,
E.
(
1982
).
Simplified neuron model as a principal component analyzer
.
Journal of Mathematical Biology
,
15
(
3
),
267
273
.
O'Reilly
,
R. C.
, &
Munakata
,
Y.
(
2000
).
Computational explorations in cognitive neuroscience: Understanding the mind by simulating the brain
.
Cambridge, MA
:
MIT Press
.
Oudeyer
,
P.-Y.
,
Kaplan
,
F.
, &
Hafner
,
V. V.
(
2007
).
Intrinsic motivation systems for autonomoys mental development
.
IEEE Transactions on Evolutionary Computation
,
11
(
2
),
265
286
.
Parter
,
M.
,
Kashtan
,
N.
, &
Alon
,
U.
(
2008
).
Facilitated variation: How evolution learns from past environments to generalize to new environments
.
PLoS Computational Biology
,
4
(
11
),
e1000206
.
Pennartz
,
C. M.
(
1995
).
The ascending neuromodulatory systems in learning by reinforcement: Comparing computational conjectures with experimental findings
.
Brain. Res. Rev.
,
21
,
219
245
.
Pigliucci
,
M.
(
2008
).
Is evolvability evolvable?
Nature Reviews Genetics
,
9
,
75
82
.
Poirazi
,
P.
, &
Mel
,
B. W.
(
2001
).
Impact of active dendrites and structural plasticity on the memory capacity of neural tissue
.
Neuron
,
29
,
779
796
.
Price
,
G. R.
(
1970
).
Selection and covariance
.
Nature
,
227
,
520
521
.
Raichle
,
M. E.
(
2006
).
The brain's dark energy
.
Science
,
314
,
1249
1250
.
Raichle
,
M. E.
, &
Mintun
,
M. A.
(
2006
).
Brain work and brain imaging
.
Annual Review of Neuroscience
,
29
,
449
476
.
Rechenberg
,
I.
(
1994
).
Evolutionstrategie 94
.
Stuttgart
:
Frommann-Holzboog
.
Redgrave
,
P.
,
Prescott
,
T. J.
, &
Gurney
,
K.
(
1999
).
Is the short-latency dopamine response too short to signal reward error?
Trends in Neuroscience
,
22
,
146
151
.
Samejima
,
K.
, &
Omori
,
T.
(
1999
).
Adaptive internal state space construction method for reinforcement learning of a real-world agent
.
Neural Networks
,
12
,
1143
1155
.
Samejima
,
K.
,
Ueda
,
Y.
,
Doya
,
K.
, &
Kimura
,
M.
(
2005
).
Representation of action-specific reward values in the striatum
.
Science
,
310
,
1337
1340
.
Schmidhuber
,
J.
(
1991
).
Curious model-building control systems
. In Proc.
International Joint Conference on Neural Networks
.
San Francisco
:
Morgan Kaufmann
.
Schmidhuber
,
J.
(
1999
).
A general method for incremental self-improvement and multi-agent learning
. In
X. Yao
(Eds.),
Evolutionary computation: Theory and applications
.
Singapore
:
World Scientific
.
Schmidhuber
,
J.
(
2000
).
Evolutionary computation versus reinforcement learning
. In
Proceedings of the IEEE Industrial Electronics Society
.
Piscataway, NJ
:
IEEE
.
Schmidhuber
,
J.
(
2006
).
Developmental robotics, optimal artificial curiosity, creativity, music, and the fine arts
.
Connection Science
,
18
(
2
),
173
187
.
Schmidhuber
,
J.
(
2009
).
Ultimate cognition à la Gödel
.
Cognitive Computation
,
1
(
2
),
177
193
.
Schmidhuber
,
J.
,
Wierstra
,
D.
, &
Gomez
,
F.
(
2005
).
Evolino: Hybrid neuroevolution / optimal linear search for sequence learning
. In
Proceedings of the 19th International Joint Conference on Artificial Intelligence
(pp.
853
858
).
Boca Raton, FL
:
CRC Press
.
Schmidhuber
,
J.
,
Zhao
,
Q.
, &
Schraudolph
,
N.
(
1997
).
Reinforcement learning with self-modifying policies
. In
S. Thrün & L. Pratt
(Eds.),
Learning to learn
.
Norwood, MA
:
Kluwer
.
Schmidhuber
,
J.
,
Zhao
,
Q.
, &
Wiering
,
M.
(
1997
).
Shifting inductive bias with success-story algorithm, adaptive Levin search, and incremental self-improvement
.
Machine Learning
,
28
,
105
130
.
Schultz
,
W.
(
1998
).
Predictive reward signal of dopamine neurons
.
J. Neurophysiol
.
80
,
1
27
.
Schultz
,
W.
,
Dayan
,
P.
, &
Montague
,
P. R.
(
1997
).
A neural substrate of prediction and reward
.
Science
,
275
,
1593
1599
.
Schwefel
,
H. P.
(
2000
).
Advantages (and disadvantages) of evolutionary computation over other approaches
. In
T. Back, D. B. Fogel, & T. Michalewicz
(Eds.),
Evolutionary computation
.
New York
:
Taylor and Francis
.
Seung
,
S. H.
(
2003
).
Learning in spiking neural networks by reinforcement of stochastic synaptic transmission
.
Neuron
,
40
,
1063
1973
.
Simonton
,
D. K.
(
1995
).
Foresight in insight? A Darwinian answer
. In
R. J. Sternberg & J. E. Davidson
(Eds.),
The nature of insight
.
Cambridge, MA
:
MIT Press
.
Skinner
,
B. F.
(
1976
).
About behaviourism
.
New York
:
Vintage
.
Soltoggio
,
A.
,
Dürr
,
P.
,
Mattlussi
,
C.
, &
Floreano
,
D.
(
2007
).
Evolving neuromodulatory topologies for reinforcement learning-like problems
. In
Proceedings of the 2007 IEEE Congress on Evolutionary Computation
(pp.
2471
2478
).
Piscataway, NJ
:
IEEE
.
Sporns
,
O.
, &
Edelman
,
G. M.
(
1993
).
Solving Bernstein's problem: A proposal for the development of coordinated movement by selection
.
Child Development
,
64
,
960
981
.
Stanley
,
K. O.
, &
Miikkulainen
,
R.
(
2002
).
Efficient reinforcement learning through evolving neural networks topologies
. In
Proceedings of the Genetic and Evolutionary Computation Conference
.
San Francisco
:
Morgan Kaufmann
.
Steels
,
L.
, &
De Beule
,
J.
(
2006a
).
A (very) brief introduction to fluid construction grammar
. In
Proceedings of the 3rd International Workshop on Scalable Natural Language
. N.p.
Steels
,
L.
, &
De Beule
,
J.
(
2006b
).
Unify and merge in fluid construction grammar
. In
Symbol Grounding and Beyond: Proceedings of the Third International Workshop on the Emergence and Evolution of Linguistic Communication
(pp.
197
223
).
Berlin
:
Springer
.
Sternberg
,
R. J.
, &
Davidson
,
J. E.
(Eds.). (
1995
).
The nature of insight
.
Cambridge, MA
:
MIT Press
.
Suri
,
R. E.
, &
Schultz
,
W.
(
1999
).
A neural network model with dopamine-like reinforcement signal that learns a spatial delayed response task
.
Neuroscience
,
91
(
3
),
871
890
.
Sutton
,
S. R.
, &
Barto
,
A. G.
(
1998
).
Reinforcement learning: An introduction
.
Cambridge, MA
:
MIT Press
.
Szathmáry
,
E.
, &
Maynard Smith
,
J.
(
1997
).
From replicators to reproducers: The first major transitions leading to life
.
Journal of Theoretical Biology
,
187
(
4
),
555
571
.
Szinta
,
I.
, &
Lorincz
,
A.
(
2007
).
Learning to play using low-complexity rule-based polices: Illustrations through Ms. Pac-Man
.
Journal of Artificial Intelligence Research
,
30
,
659
684
.
Thorndike
,
E. L.
(
1911
).
Animal intelligence: Experimental studies
.
New York
:
Macmillan
.
Togelius
,
J.
,
Schaul
,
T.
,
Wierstra
,
D.
,
Igel
,
C.
,
Gomez
,
F.
, &
Schmidhuber
,
J.
(
in press
).
Ontogenetic and phylogenetic reinforcement Learning
.
Kuenstliche Intelligenz
.
Tononi
,
G.
,
Sporns
,
O.
, &
Edelman
,
G. M.
(
1992
).
Reentry and the problem of integrating multiple brain areas: Simulation of dynamic integration in the visual system
.
Cerebral Cortex
,
2
,
310
335
.
Toussaint
,
M.
(
2003
).
The evolution of genetic representations and modular adaptation
.
Unpublished doctoral dissertation. Universitat Bochum
.
Tsodyks
,
M.
,
Kenet
,
T.
,
Grinvald
,
A.
, &
Arieli
,
A.
(
1999
).
Linking spontaneous activity of single cortical neurons and the underlying functional architecture
.
Science
,
286
,
1943
1946
.
Voegtlin
,
T.
(
2002
).
Recursive self-organizing maps
.
Neural Networks
,
15
,
979
991
.
Wagner
,
G. P.
, &
Altenberg
,
L.
(
1996
).
Complex adaptations and evolution of evolvability
.
Evolution
,
50
,
329
347
.
Wagner
,
U.
,
Gais
,
S.
,
Haider
,
H.
,
Verleger
,
R.
, &
Born
,
J.
(
2004
).
Sleep inspires insight
.
Nature
,
427
,
352
355
.
Wang
,
J.
(
1993
).
A recurrent neural network for real-time matrix inversion
.
Applied Mathematics and Computation
,
55
,
89
100
.
Wang
,
X.-J.
(
1999
).
Synaptic basis of cortical persistent activity: The importance of NMDA receptors to working memory
.
Journal of Neuroscience
,
19
(
21
),
9587
9603
.
Wang
,
X.-J.
(
2001
).
Synaptic reverberation underlying mnemonic persistent activity
.
Trends in Neuroscience
,
24
,
455
463
.
Watkins
,
C.
(
1989
).
Learning from delayed rewards
.
Unpublished doctoral dissertation, Cambridge University
.
Watson
,
R. A.
(
2006
).
Compositional evolution: The impact of sex, symbiosis, and modularity on the gradualist framework of evolution
.
Cambridge, MA
:
MIT Press
.
Watson
,
R. A.
,
Buckley
,
C. L.
, &
Mills
,
R.
(
2009
).
The effect of Hebbian learning on optimisation in Hopfield networks
(
Tech. Rep. ECS
).
University of Southampton
.
Watson
,
R. A.
,
Hornby
,
G. S.
, &
Pollack
,
J. B.
(
1998
).
Modelling building-block interdependency
. In
Proceedings of the 5th International Conference on Parallel Problem Solving from Nature
.
San Francisco
:
Morgan Kaufmann
.
Whiteson
,
S.
,
Taylor
,
M. E.
, &
Stone
,
P.
(
2007
).
Empirical studies in action selection with reinforcement learning
.
Adaptive Behavior
,
15
,
33
50
.
Wiering
,
M.
, &
Schmidhuber
,
J.
(
1997
).
HQ-Learning
.
Adaptive Behavior
,
6
(
2
),
219
246
.
Wierstra
,
D.
,
Foerster
,
A.
,
Peters
,
J.
, &
Schmidhuber
,
J.
(
2007
).
Solving deep memory POMDPs with Recurrent Policy Gradients
. In
Proceedings of the International Conference on Artificial Neural Networks
.
Berlin
:
Springer
.
Wiskott
,
L.
,
Rasch
,
M. J.
, &
Kempermann
,
G.
(
2006
).
A functional hypothesis for adult hippocampal neurogenesis: Avoidance of catastrophic interference in the dentate gyrus
.
Hippocampus
,
16
(
3
),
329
343
.
Worgötter
,
F.
, &
Porr
,
B.
(
2005
).
Temporal sequence learning, prediction, and control: A review of different models and their relation to biological mechanisms
.
Neural Computation
,
17
,
245
319
.
Zachar
,
I.
, &
Szathmáry
,
E.
(
2010
).
A new replicator: A theoretical framework for analysing replication
.
BMC Biology
,
8
,
21
.
Zipser
,
D.
,
Kehoe
,
B.
,
Littlewort
,
G.
, &
Fuster
,
J.
(
1993
).
A spiking network model of short-term active memory
.
Journal of Neuroscience
,
13
,
3406
3420
.