Numerous algorithms have been proposed to allow legged robots to learn to walk. However, most of these algorithms are devised to learn walking in a straight line, which is not sufficient to accomplish any real-world mission. Here we introduce the Transferability-based Behavioral Repertoire Evolution algorithm (TBR-Evolution), a novel evolutionary algorithm that simultaneously discovers several hundreds of simple walking controllers, one for each possible direction. By taking advantage of solutions that are usually discarded by evolutionary processes, TBR-Evolution is substantially faster than independently evolving each controller. Our technique relies on two methods: (1) novelty search with local competition, which searches for both high-performing and diverse solutions, and (2) the transferability approach, which combines simulations and real tests to evolve controllers for a physical robot. We evaluate this new technique on a hexapod robot. Results show that with only a few dozen short experiments performed on the robot, the algorithm learns a repertoire of controllers that allows the robot to reach every point in its reachable space. Overall, TBR-Evolution introduced a new kind of learning algorithm that simultaneously optimizes all the achievable behaviors of a robot.
Evolving gaits for legged robots has been an important topic in evolutionary computation for the last 25 years (de Garis, 1990; Lewis et al., 1992; Kodjabachian and Meyer, 1998; Hornby et al., 2005; Clune et al., 2011; Yosinski et al., 2011; Samuelsen and Glette, 2014). That legged robots are a classic of evolutionary robotics is not surprising (Bongard, 2013): Legged locomotion is a difficult challenge in robotics that evolution by natural selection solved in nature; evolution-inspired algorithms may do the same for artificial systems. As argued in many papers, evolutionary computation could bring many benefits to legged robotics, from making it easier to design walking controllers (e.g., Hornby et al., 2005) to autonomous damage recovery (e.g., Bongard et al., 2006; Koos et al., 2013a). In addition, in an embodied cognition perspective (Wilson, 2002; Pfeifer and Bongard, 2007; Pfeifer et al., 2007), locomotion is one of the most fundamental skills of animals, and therefore it is one of the first skills needed for embodied agents.
It could seem confusing that evolutionary computation has failed to be central in legged robotics in spite of the efforts of the evolutionary robotics community (Raibert, 1986; Siciliano and Khatib, 2008). In our opinion, this failure occurred for at least two reasons: (1) Most evolved controllers are almost useless in robotics because they are limited to walking in a straight line at constant speed (e.g., Hornby et al., 2005; Bongard et al., 2006; Koos et al., 2013a), whereas a robot that only walks in a straight line is unable to accomplish any mission; (2) evolutionary algorithms typically require evaluating the fitness function thousands of times, which is very hard to achieve with a physical robot. This article introduces TBR-Evolution (Transferability-based Behavioral Repertoire Evolution), a new algorithm that addresses these two issues.
Evolving controllers to make a robot walk in any direction can be seen as a generalization of the evolution of controllers for forward walking. A straightforward idea is to add an additional input to the controller that describes the target direction, then evolve controllers that use this input to steer the robot (Mouret et al., 2006). Unfortunately, this approach requires testing each controller for several directions in the fitness function, which substantially increases the time required to find a controller. In addition, an integrated controller that can use a direction input is likely to be more difficult to find than a controller that can only do forward walking.
An alternative idea is to see walking in every direction as a problem of learning how to do many different but related tasks. In this case, an evolutionary algorithm could search for a repertoire of simple controllers that would contain a different controller for each possible direction. This method circumvents the challenge of learning a complex controller and can be combined with high-level algorithms (e.g., planning algorithms) that successively select controllers to drive the robot. Nevertheless, evolving a controller repertoire typically involves as many evolutionary processes as there are target points in the repertoire. Evolution is consequently slowed down by a factor equal to the number of targets. With existing evolution methods, repertoires of controllers are in effect limited to a few targets, because 20 minutes (Koos et al., 2013a) to dozens of hours (Hornby et al., 2005) are needed to learn how to reach a single target.
Our algorithm aims to find such a repertoire of simple controllers, but in a single run. It is based on a simple observation: With a classic evolutionary algorithm, when a robot learns to reach a specific target, the learning process explores many different potential solutions, with many different outcomes. Most of these potential solutions are discarded because they are deemed poorly performing. Nevertheless, while being useless for the considered objective, these inefficient behaviors can be useful for other objectives. For example, a robot learning to walk in a straight line usually encounters many turning gaits during the search process before converging toward straight line locomotion.
To exploit this idea, TBR-Evolution takes inspiration from the novelty search algorithm (Lehman and Stanley, 2011a) and in particular its variant novelty search with local competition (Lehman and Stanley, 2011b). Instead of rewarding candidate solutions that are closest to the objective, this algorithm explicitly searches for behaviors that are different from those previously seen. The local competition variant adds the notion of a quality criterion that is optimized within each individual’s niche. As shown in the present article, searching for many different behaviors during a single execution of the algorithm allows the evolutionary process to efficiently create a repertoire of high-performing walking gaits.
To further reduce the time required to obtain a behavioral repertoire for the robot, TBR-Evolution relies on the transferability approach (Koos et al., 2013b; Mouret et al., 2012), which combines simulations and tests on the physical robot to find solutions that perform similarly in simulation and in reality. The advantages of the transferability approach is that evolution occurs in simulation but the evolutionary process is driven toward solutions that are likely to work on the physical robot. In recent experiments, this approach led to the successful evolution of walking controllers for quadruped (Koos et al., 2013b), hexapod (Koos et al., 2013a), and biped robots (Oliveira et al., 2013), with no more than 25 tests on the physical robot.
We evaluate our algorithm on two sets of experiments. The first set aims to show that learning simultaneously all the behaviors of a repertoire is faster than learning each of them separately.1 We chose to perform these experiments in simulation to gather extensive statistics. The second set of experiments evaluates our method on a physical hexapod robot (Fig. 1) that has to walk forward and backward and turn in both directions, all at different speeds. We compare our results to learning independently each controller. All our experiments utilize embedded measurements to evaluate the fitness, an aspect of autonomy only considered in a handful of gait discovery experiments (Kimura et al., 2001; Hornby et al., 2005).
2.1 Evolving Walking Controllers
We call walking controller the software module that rhythmically drives the motors of the legged robot. We distinguish two categories of controllers: undriven controllers and inputs-driven controllers. An undriven controller always executes the same gait, while an inputs-driven controller can change the robot’s movements according to an input (e.g., a speed or a direction reference). Inputs-driven controllers are typically combined with decision or planning algorithms (Russell et al., 2010; Currie and Tate, 1991; Dean and Wellman, 1991; Kuffner and LaValle, 2000) to steer the robot. These two categories contain, without distinctions, both open-loop and closed-loop controllers and can be designed using various controller and genotype structures. For example, walking gait evolution or learning has been achieved on legged robots using parametrized periodic functions (Koos et al., 2013a; Chernova and Veloso, 2004; Hornby et al., 2005; Tarapore and Mouret, 2014a, 2014b), artificial neural networks with both direct or generative encoding (Clune et al., 2011; Valsalam and Miikkulainen, 2008; Tarapore and Mouret, 2014a, 2014b), central pattern generators (Kohl and Stone, 2004; Ijspeert et al., 2007), or graph-based genetic programming (Filliat et al., 1999; Gruau, 1994).
When dealing with physical legged robots, the majority of studies only consider the evolution of undriven walking controllers, and most of the time the task consists of finding a controller that maximizes the forward walking speed (Zykov et al., 2004; Chernova and Veloso, 2004; Hornby et al., 2005; Berenson et al., 2005; Yosinski et al., 2011; Mahdavi and Bentley, 2006). Papers on alternatives to evolutionary algorithms such as policy gradients (Kohl and Stone, 2004; Tedrake et al., 2005) or Bayesian optimization (Calandra et al., 2014; Lizotte et al., 2007) are also focused on robot locomotion along a straight line.
Comparatively few articles deal with controllers able to turn or to change walking speed according to an input, especially with a physical robot. Inputs-driven controllers usually need to be tested on each possible input during the learning process or to be learned with an incremental process, which significantly increases both the learning time and the difficulty compared to learning an undriven controller. Filliat et al. (1999) proposed such a method, which evolves a neural network to control a hexapod robot. Their neural network is learned with several steps: The network is learned in order to walk in a straight line; then a second neural network is evolved on top of the walking controller to execute turning maneuvers. In a related task (flapping wing flight), Mouret et al. (2006) proposed another approach, where an evolutionary algorithm is used to design a neural network that pilots a simulated flapping robot; the network was evaluated by its ability to drive the robot to eight different targets, and the reward function was the sum of the distances to the targets.
Overall, many methods exist to evolve undriven controllers, while methods for learning inputs-driven controllers are very time-expensive, difficult to apply on a physical robot, and require an extensive amount of expert knowledge. To our knowledge, no current technique is able to make a physical robot learn to walk in multiple directions in less than a dozen hours. In this paper, we sidestep many of the challenges raised by input-driven controllers while still being able to drive a robot in every direction: We propose to abandon input-driven controllers and instead search for a large number of simple undriven controllers, one for each possible direction.
2.2 Transferability Approach
Most of the previously described methods are based on stochastic optimization algorithms that need to test a high number of candidate solutions. Typically, several thousands of tests are performed with policy gradient methods (Kohl and Stone, 2004) and hundreds of thousands with evolutionary algorithms (Clune et al., 2011). This high number of tests is a major problem when they are performed on a physical robot. An alternative is to perform the learning process in simulation and then apply the result to the robot. Nevertheless, solutions obtained in simulation often do not work well on the real device, because simulation and reality never match perfectly. This phenomenon is called the reality gap (Jakobi et al., 1995; Koos et al., 2013b).
The transferability approach (Koos et al., 2013a, 2013b; Mouret et al., 2012) crosses this gap by finding behaviors that act similarly in simulation and in reality. During the evolutionary process, a few candidate controllers are transferred to the physical robot to measure the behavioral differences between the simulation and the reality; these differences represent the transferability value of the solutions. With these few transfers, a regression model is built up to map solution descriptors to an estimated transferability value. The regression model is then used to predict the transferability value of untested solutions. The transferability approach uses a multiobjective optimization algorithm to find solutions that maximize both task efficiency (e.g., forward speed, stability) and the estimated transferability value.
This mechanism drives the optimization algorithm toward solutions that are both efficient in simulation and transferable (i.e., that act similarly in the simulation and in the reality). It allows the algorithm to exploit the simulation and consequently to reduce the number of tests performed on the physical robot.
This approach was successfully used with an E-puck robot in a T-maze and with a quadruped robot that evolved to walk in a straight line with a minimum of transfers on the physical robots (Koos et al., 2013b). The reality gap phenomenon was particularly apparent in the quadruped experiment: With a controller optimized only in simulation, the virtual robot moved 1.29 m (in 10 s), but when the same controller was applied on the physical robot, it only moved 0.41 m. With the transferability approach, the obtained solution walked 1.19 m in simulation and 1.09 m in reality. These results were found with only 11 tests on the physical robot and 200,000 evaluations in simulation. This approach has also been applied for humanoid locomotion (Oliveira et al., 2013) and damage recovery on a hexapod robot (Koos et al., 2013a).
Since the transferability approach is one of the most practical tools to apply stochastic optimization algorithms on physical robots, it constitutes an element of our method.
2.3 Novelty Search with Local Competition
A longstanding challenge in artificial life is to craft an algorithm able to discover a wide diversity of interesting artificial creatures. While evolutionary algorithms are good candidates, they usually converge to a single species of creatures. In order to overcome this issue, Lehman and Stanley (2011b) proposed a method called novelty search (NS) with local competition. This method, based on multiobjective evolutionary algorithms, combines the exploration abilities of the novelty search algorithm (Lehman and Stanley, 2011a) with a performance competition between similar individuals.
NS with local competition simultaneously optimizes two objectives for an individual c: (1) the novelty objective, novelty(c), which measures how novel the individual is compared to previously encountered ones, and (2) the local competition objective, Qrank(c), which compares the individual’s quality, quality(c), to the performance of individuals in a neighborhood, defined with a morphological distance.
With these two objectives, the algorithm favors individuals that are new, more efficient than their neighbors, and optimal trade-offs between novelty and local quality. Both objectives are evaluated thanks to an archive, which records all encountered families of individuals and allows the algorithm to define neighborhoods for each individual. The novelty objective is computed as the average distance between the current individual and its neighbors, and the local competition objective is the number of neighbors that c outperforms according to the quality criterion quality(i).
Lehman and Stanley successfully applied this method to generate a high number of creatures with different morphologies, all able to walk in a straight line. The algorithm found a heterogeneous population of different creatures, from little hoppers to imposing quadrupeds, all walking at different speeds according to their statures. We show in this paper how this algorithm can be modified to allow a single robot to achieve several different actions (i.e., directions of locomotion).
3.1 Main Ideas
Some complex problems are easier to solve when they are split into several sub-problems. Thus, instead of using a single and complex solution, it is relevant to search for several simple solutions that solve a part of the problem. This principle is often successfully applied in machine learning: mixtures of experts (Jacobs et al., 1991) or boosting (Schapire, 1990) methods train several weak classifiers on different subparts of a problem. Performances of the resulting set of classifiers are better than those of a single classifier trained on the whole problem.
The TBR-Evolution algorithm enables the application of this principle to robotics and particularly to legged robots that learn to walk. Instead of learning a complex inputs-driven controller that generates gaits for every direction, we consider a repertoire of undriven controllers, where each controller is able to reach a different point of the space around the robot. This repertoire gathers a high number of efficient and easy-to-learn controllers.
Because of the required time, independently learning dozens of controllers is prohibitively expensive, especially with a physical robot. To avoid this issue, the TBR-Evolution algorithm transforms the problem of learning a repertoire of controllers into one of evolving a heterogeneous population of controllers. Thus the problem can be solved with an algorithm derived from NS with local competition (Lehman and Stanley, 2011b). Instead of generating virtual creatures with various morphologies that execute the same action, TBR-Evolution generates a repertoire of controllers, each executing a different action, working on the same creature. By simultaneously learning all the controllers without the discrimination of a specified goal, the algorithm recycles interesting controllers, which are typically wasted with classical learning methods. This enhances its optimizing abilities compared to classic optimization methods.
Further, our algorithm incorporates the transferability approach (Koos et al., 2013b) to reduce the number of tests on the physical robot during the evolutionary process. The transferability approach and NS with local competition can be combined because they are both based on multiobjective optimization algorithms. By combining these two approaches, the behavioral repertoire is generated in simulation with a virtual robot, and only a few controller executions are performed on the physical robot. These trials guide the evolutionary process to solutions that work similarly in simulation and in reality (Koos et al., 2013a, 2013b).
The minimization of the number of tests on the physical robot and the simultaneous evolution of many controllers are the two assets that allow the TBR-Evolution algorithm to require significantly less time than classical methods.
More technically, the TBR-Evolution algorithm relies on four principles, detailed in the next sections:
The transferability function is periodically updated with a test on the physical robot.
When a controller is novel enough, it is saved in the novelty archive.
When a controller has a better quality than the one in the archive that reaches the same endpoint, it is substituted for the one in the archive.
Algorithm 1 describes the algorithm in pseudocode.
The local transferability rank (; see Eq. (3)) works as a second local competition objective, where the estimation of the transferability score, , replaces the quality score. As in Koos et al. (2013b), this estimation is obtained by periodically repeating three steps: (1) A controller is randomly selected in the current population or in the archive and then downloaded and executed on the physical robot; (2) the displacement of the robot is estimated thanks to an embedded sensor; and (3) the distance between the endpoint reached in reality and the one in simulation is used to feed a regression model (, here a support vector machine (Chang and Lin, 2011)). This distance defines the transferability score of the controller. This model maps a behavioral descriptor of a controller , which is obtained in simulation, with an approximation of the transferability score.
3.3 Archive Management
In the original NS with local competition (Lehman and Stanley, 2011b), the archive aims at recording all encountered solutions, but only the first individuals that have a new morphology are added to the archive. The next individuals with the same morphology are not saved even if they have better performances. In the TBR-Evolution algorithm, the novelty archive represents the resulting repertoire of controllers and thus has to gather only the best controllers for each region of the reachable space.
For this purpose, the archive is differently managed than in the novelty search: During the learning process, if a controller of the population has better scores ( or ) than the closest controller in the archive, the one in the archive is replaced by the better one. These comparisons are made with a priority among the scores to prevent circular permutations. If the transferability score is lower than a threshold , only the transferability scores are compared; otherwise we compare the quality scores. This mechanism allows the algorithm to focus the search on transferable controllers instead of searching efficient but not transferable solutions. Such a priority is important, as the performances of nontransferable controllers may not be reproducible on the physical robot.
4 Experimental Validation
We evaluate the TBR-Evolution algorithm on two different experiments, which both consist of evolving a repertoire of controllers to access the whole vicinity of the robot. In the first experiment, the algorithm is applied on a simulated robot (Fig. 2); consequently the transferability aspect of the algorithm is disabled. The goal of this experiment is to show the benefits of evolving simultaneously all the behaviors of the repertoire instead of evolving them separately. The second experiment applies the algorithm directly on a physical robot (Fig. 1). For this experiment, the transferability aspect of the algorithm is enabled, and the experiment shows how the behavioral repertoire can be learned with a few trials on the physical robot.
4.1 Implementation Choices
The pseudocode of the algorithm is presented in Algorithm 1. The TBR-Evolution algorithm uses the same variant of NSGA-II (Deb et al., 2002) as NS with local competition (Lehman and Stanley, 2011b). The simulation of the robot is based on the Open Dynamic Engine (ODE), and the transferability function uses the -Support Vector Regression algorithm with linear kernels implemented in the library libsvm (Chang and Lin, 2011) (learning parameters set to default values). All the algorithms are implemented in the Sferes framework (Mouret and Doncieux, 2010) (parameters and source code are detailed in the appendix). The simulated parts of the algorithms are computed on a cluster of five quad-core Xeon-E5520@2.27GHz computers.
Both the virtual and the physical robots have the same kinematic scheme (Fig. 2). They have 18 degrees of freedom, 3 per leg. The first joint of each leg controls the direction of the leg while the two others define its elevation and extension. The virtual robot is designed to be a virtual copy of the physical hexapod: It has the same mass for each of its body parts, and the physical simulator reproduces the dynamic characteristics of the servos. On the physical robot, the estimations of the covered distance are acquired with a Simultaneous Localization and Mapping (SLAM) algorithm based on the embedded RGB-D sensor (Endres et al., 2012).
4.1.2 Genotype and Controller
The parameters can each have five different values (), and with their variations, numerous gaits are possible, from purely quadruped gaits to classic tripod gaits.
For the genotype mutation, each parameter value has a 10% chance of being changed to any value in the set of possible values, with the new value chosen randomly from a uniform distribution over the possible values. For both of the experiments, the crossover is disabled.
Compared to classic evolutionary algorithms, TBR-Evolution only changes the way individuals are selected. As a result, it does not put any constraint on the type of controllers, and many other controllers are conceivable (e.g., bio-inspired central pattern generators (Sproewitz et al., 2008; Ijspeert, 2008), dynamic movement primitives (Schaal, 2003), or evolved neural networks (Yosinski et al., 2011; Clune et al., 2011)).
4.1.3 Endpoints of a Controller
4.1.4 Quality Score
To be able to sequentially execute saved behaviors, special attention is paid to the final orientation of the robot. Because the endpoint of a trajectory depends on the initial orientation of the robot, we need to know how the robot ends its previous movement when we plan the next one. To facilitate chaining controllers, we encourage behaviors that end their movements with an orientation aligned with their trajectory.
The robot cannot execute arbitrary trajectories with a single controller because controllers are made of simple periodic functions. For example, it cannot begin its movement by a turn and then go straight. With this controller, the robot can only follow trajectories with a constant curvature, but it can still can move sideways or even turn itself around while following an overall straight trajectory. We chose to focus the search on circular trajectories, centered on the lateral axis, with a variable radius (Fig. 3A), and for which the robot’s body is pointing toward the tangent of the overall trajectory. Straight forward (or backward) trajectories are still possible with the particular case of an infinite radius. This kind of trajectory is suitable for motion control, as many complex trajectories can be decomposed in a succession of circle portions and lines. This principle is illustrated in Figures 3(D–F).
4.1.5 Transferability Score
In order to estimate the transferability score of untested controllers, a regression model is trained with the tested controllers and their recorded transferability score. The regression model used is the -Support Vector Regression algorithm with linear kernels implemented in the library libsvm (Chang and Lin, 2011) (training parameters are set to default values), which maps a behavioral descriptor, des(c), with an estimated transferability score, . Each controller is described with a vector of Boolean values that describe, for each time step and each leg, whether the leg is in contact with the ground (the descriptor is therefore a vector of size , where N is the number of time steps). This kind of vector is a classic way to describe gaits in legged animals and robots (Raibert, 1986). During the evolutionary process, the algorithm performs one transfer every iterations.
4.2 Experiments on the Virtual Robot
The first experiment involves a virtual robot that learns a behavioral repertoire to reach every point in its vicinity. The transferability objective is disabled because the goal of this experiment is to show the benefits of learning simultaneously all the behaviors of the repertoire instead of learning them separately. Using only the simulation allows us to perform more replications and to implement a higher number of control experiments. This experiment also shows how the robot is able to autonomously
discover possible movements,
cover a high proportion of the reachable space,
generate a behavioral repertoire.
The TBR-Evolution experiment and the control experiments are replicated 40 times to gather statistics.
4.2.1 Control Experiments
To our knowledge, no work directly tackles the question of learning simultaneously all the behaviors of a controller repertoire; thus we cannot compare our approach with an existing method. As a reference point, we implement a naive method where the desired endpoints are preselected. A different controller is optimized to reach each different wanted point individually.
We also investigate how the archive management added in TBR-Evolution improves the quality of produced behavioral repertoires. To highlight these improvements, we compare our resulting archives with archives issued from the novelty search algorithm (Lehman and Stanley, 2011a) and from the NS with local competition algorithm (Lehman and Stanley, 2011b), as the main difference between these algorithms is archive management procedure. We apply these algorithms on the same task and with the same parameters as in the experiment with our method. We call these experiments novelty search and NS with local competition. For both of these experiments, we analyze the produced archives and the resulting populations.
Behavioral repertoires resulting from a typical run of TBR-Evolution and the control experiments are pictured in Figures 4, 5, 6, and 7. The endpoints achieved with each controller of the repertoire are spread over the reachable space in a specific manner: They cover the front and back of the robot but less the lateral sides. These limits are not explicitly defined, but they are autonomously discovered by the algorithm.
For the same number of evaluations, the area is less covered with the control experiments (“nearest” and “orientation”) than with TBR-Evolution (Fig. 4). With only 100,000 evaluations, this area is about twice larger with TBR-Evolution than with both control experiments. At the end of the evolution (1,000,000 evaluations), the reachable space is more dense with our approach. With the “nearest” variant of the control experiment, all target points are reached (Fig. 3C); this is not the case for the “orientation” variant.
The archives produced by novelty search and NS with local competition both cover a larger space than the TBR-Evolution algorithm (Fig. 5). These results are surprising because all these experiments are based on novelty search and differ only in the way the archive is managed. These results show that TBR-Evolution tends to slightly reduce the exploration abilities of NS and focus more on the quality of the solutions.
We can formulate two hypotheses to explain this difference in exploration. First, the local competition objective may have a higher influence in the TBR-Evolution algorithm than in the NS with local competition: In NS with local competition, the individuals from the population are competing against those of the archive; since this archive is not updated if an individual with a similar behavior but a higher performance is encountered, the individuals from the population are likely to always compete against low-performing individuals and therefore always get a similar local competition score; as a result, the local competition objective is likely to not be very distinctive, and most of the selective pressure can be expected to come from the novelty objective. This different selective pressure can explain why NS with local competition explores more than TBR-Evolution, and it echoes the observation that the archives obtained with novelty search and with NS with local competition are visually similar (Fig. 5). The second hypothesis is that the procedure used to update the archive may erode the borderline of the archive: If a new individual is located close to the archive’s borderline, and if this individual has a better performance than its nearest neighbor in the archive, then the archive management procedure of TBR-Evolution will replace the individual from the archive with the new and higher-performing one; as a consequence, an individual from the border can be removed in favor of a higher-performing but less innovative individual. This process is likely to repeatedly erode the border of the archive and hence discourage exploration. These two hypotheses will be investigated in future work.
The primary purpose of NS with local competition is to maintain a diverse variety of well-adapted solutions in its population, not in its archive. For this reason, we also plot the distribution of the population’s individuals for novelty search and NS with local competition (Fig. 6). After 100,000 evaluations, and at the end of the evolution, the population covers less of the robot’s surroundings than TBR-Evolution. The density of the individuals is not homogeneous, and they are not arranged in a particular shape, in contrast to the results of TBR-Evolution. In particular, the borderline of the population seems to be almost random.
The density of the archive is also different between the algorithms. The density of the archives produced by TBR-Evolution is higher than that of the other approaches, while the threshold of novelty, , required to add individuals in the archive is the same. This shows that the archive management of the TBR-Evolution algorithm increases the density of the regions where solutions with a good quality are easier to find. This characteristic allows a better resolution of the archive in specific regions.
The orientation error is qualitatively more important in the “nearest” control experiment during all the evolution than in the other experiments. This error is important at the beginning of the “orientation” variant, too, but at the end the error is negligible for the majority of controllers. Novelty search, NS with local competition, and the population of novelty search have a larger orientation error. Figures 5 and 6 show that the orientation of the controllers seems almost random. With such a repertoire, chaining behaviors on the robot is more complicated than with the TBR-Evolution’s archives, where a vector field is visible. Only the population of the NS with local competition seems to show lower orientation error. This illustrates the benefits of the local competition objective on the population’s behaviors.
The TBR-Evolution algorithm consistently leads to very small orientation errors (Figs. 4 and 7); only few points have a significant error. We find these points in two regions, far from the starting point and directly on its sides. These regions are characterized by their difficulty to be accessed, which stems from two main causes: the large distance to the starting point or the complexity of the required trajectory given the controller and the possible parameters (see Appendix). For example the close lateral regions require executing a trajectory with a very high curvature, which cannot be executed with the range of parameters of the controller. Moreover, the behaviors obtained in these regions are most of the time degenerated: They take advantage of inaccuracies in the simulator to realize movement that would not be possible in reality. Since accessing these points is difficult, finding better solutions is difficult for the evolutionary algorithm. We also observe a correlation between the density of controllers, the orientation error, and the regions difficult to access (Fig. 7): The more a region is difficult to access, the less we find controllers, and the less these controllers have a good orientation. For the other regions, the algorithm produces behaviors with various lengths and curvatures, covering all the reachable area of the robot.
From the orientation point of view (Fig. 8, bottom), our approach needs few evaluations to reach a low error value (less than 5 degrees after 100,000 evaluations, and less than degrees at the end of the evolutionary process). The variation of the “orientation” control experiment is slower and needs 750,000 evaluations to cross the curve of TBR-Evolution. At the end of the experiment this variant reaches a significantly lower error level (p-values with Wilcoxon rank-sum tests), but this corresponds to a difference of the medians of only degrees. The “nearest” variant suffers from significantly higher orientation error (greater than degrees, p-values with Wilcoxon rank-sum tests). This is expected because this variant selects behaviors taking into account only the distance from the target point. With this selection, the orientation aspect is neglected. The novelty search and NS with local competition experiments lead to orientation errors that are very high and almost constant over all the evolution. These results come from the archive management of these algorithms, which does not substitute individuals when a better one is found. The archive of these algorithms only gathers the first encountered behavior of each reached point. The orientation error of NS with local competition is lower than in novelty search because the local competition promotes behavior with a good orientation error (compared to their local niche) in the population, which has an indirect impact on the quality of the archive but not enough to reach a low error level. The same conclusion can be drawn about the populations of these two algorithms: While the population of novelty search has a similar orientation error as its archives, the population of NS with local competition has a lower orientation error than its archives.
With the sets of reference points, we can compute the theoretical minimal sparseness value of the control experiments (Fig. 9). For example, changing the number of targets from 100 to 200 will change the sparseness value from 3.14 cm to 2.22 cm. Nonetheless, doubling the number of points will double the required number of evaluations. Thanks to these values we can extrapolate the variation of the sparseness according to the number of points. For example, with 200 targets, we can predict that the final value of the sparseness will be 2.22, and thus we can scale the graph of our control experiment to fit this prediction. Increasing the number of targets will necessarily increase the number of evaluations, for example, using 200 targets will double the number of evaluations. Following this constraint, we can also scale the temporal axis of our control experiment. We can thus extrapolate the sparseness of the archive with regard to the number of targets and compare it to the sparseness of the archive generated with TBR-Evolution.
The extrapolations (Fig. 9) show higher sparseness values compared to TBR-Evolution within the same execution time. Better values will be achieved with more evaluations. For instance, with 400 targets, the sparseness value reaches 1.57 cm but only after 4 million evaluations. This figure shows how our approach is faster than the control experiments regardless of the number of reference points.
Figures 8 and 9 record how TBR-Evolution is better than the control experiments in both sparseness and orientation. Within few evaluations, reachable points are evenly distributed around the robot, and corresponding behaviors are mainly well oriented. An illustrative video is available (see Appendix).
4.3 Experiments on the Physical Robot
In this second set of experiments, we apply the TBR-Evolution algorithm on a physical hexapod robot (Fig. 1). The transferability component of the algorithm allows it to evolve the behavioral repertoire with a minimum of evaluation on the physical robot. For this experiment, generations are performed, and we execute a transfer (evaluation of one controller on the physical robot) every 50 generations, leading to a total of transfers. The TBR-Evolution experiments and the reference experiments are replicated five times to gather statistics.2
4.3.1 Reference Experiment
In order to compare the learning speed of the TBR-Evolution algorithm, we use a reference experiment where only one controller is learned. For this experiment, we use the NSGA-II algorithm with the transferability approach to learn a controller that reaches a predefined target. The target is situated 0.4 m in front and 0.3 m to the right: a point not as easy to be accessed as going only straight forward and not as hard as executing a U-turn. It represents a good difficulty trade-off and thus allows us to extrapolate the performances of this method to more points.
To update the transferability function, we use the same transfer frequency as in TBR-Evolution experiments (every 50 generations). Among the resulting trade-offs, we select as final controller the one that arrives closest to the target among those with an estimated transferability . This represents a distance between the endpoint reached in simulation and the one reached in reality lower than 10 cm.
After 3,000 iterations and 60 transfers, TBR-Evolution generates a repertoire with a median number of controllers (min = , max = ). This is achieved in approximately 2.5 hours. One of these repertoires is pictured in Fig. 10. The distribution of the controllers’ endpoints follows the same pattern as in the virtual experiments: They cover the front and rear of the robot but not the lateral sides. Here again, these limits are not explicitly defined, but they are autonomously discovered by the algorithm.
As with the experiments on the virtual robot, most of the controllers have a good final orientation, and only the peripheries of the repertoire have a distinguishable orientation error. TBR-Evolution successfully pushes the repertoire of controllers toward controllers with a good quality score and thus following the desired trajectories.
From these results we can draw the same conclusion as with the previous experiment: The difficulty of accessing peripheral regions explains the comparatively poor performance of controllers from these parts of archive. The large distance to the starting point or the complexity of the required trajectory meets the limits of the employed controllers.
The transferability map (Fig. 10) shows that the majority of the controllers have an estimated value lower than 15 cm (dark regions). Nevertheless, some regions are deemed nontransferable (light regions). These regions are situated in the peripheries but are also in circumscribed areas inside of the reachable space. They occur in the peripheries for the same reasons as hold the orientation (see Section 4.2), but those inside the reachable space show that the algorithm failed to find transferable controllers in a few specific regions. This happens when the performed transfers do not allow the algorithm to infer transferable controllers. To overcome this issue, different selection heuristics and transfer frequencies will be considered in future work.
In order to evaluate the hundreds of behaviors contained in the repertoires on the physical robot, we select 30 controllers in each repertoire of the five runs. The selection is made by splitting the space into 30 areas (Fig. 10) and selecting the controllers with the best estimated transferability in each area.
Most of these controllers have an actual transferability value lower than cm (Fig. 11), which is consistent with the observations of the transferability map (Fig. 10) and not very large taking into consideration the SLAM precision, the size of the robot, and the looseness in the joints. Over all the runs, the median accuracy of the controllers is cm (Fig. 11). Nevertheless, every run presents outliers, that is, controllers with a very bad actual transferability value, which originate from regions that the transferability function does not correctly approximate.
In order to compare the efficiency of our approach to the reference experiment, we use the transfers-to-controllers ratio, that is, the number of performed transfers divided by the number of produced controllers at the end of the evolutionary process. For instance, if we reduce the produced behavioral repertoires to the 30 tested controllers, this ratio is equal to for the TBR-Evolution experiments, since we performed 60 transfers.
The performances of the control experiments depend on the number of performed transfers (Fig. 11) and thus on this ratio. For an equal ratio, the reference experiments are less accurate than TBR-Evolution ( cm versus cm, p-value with Wilcoxon rank-sum tests), while the accuracies of both experiments are not statistically different ( cm versus cm and cm, p-value and , respectively) if the reference algorithm uses from 8 to 14 transfers to learn one controller (i.e., a process 4 to 7 times longer). The reference experiment only takes advantage of its target specialization when 20 transfers are performed. With a transfers-to-controllers ratio equal to 20, the accuracy of the reference controllers outperforms the controllers generated with the TBR-Evolution algorithm ( cm versus cm, p-value). Nevertheless, with such a high ratio, the reference experiment only generates three controllers, while our approach generates of them with the same running time (60 transfers and 3,000 generations).
We previously only considered the 30 postevaluated controllers, whereas TBR-Evolution generates several hundreds of them. After transfers the repertoires contain a median number of 217 controllers that have an estimated transferability lower than m (Fig. 12). The previous results show that more than of the tested controllers have an actual transferability value lower than m and lower than m. We can consequently extrapolate that between 100 and 150 controllers are exploitable in a typical behavioral repertoire. Taking into consideration all these controllers, the transfers-to-controllers ratio of the TBR-Evolution experiments falls between and , and thus our approach is about times faster than the reference experiment for a similar accuracy.
5 Conclusion and Discussion
To our knowledge, TBR-Evolution is the first algorithm designed to generate a large number of efficient gaits without requiring that each of them be learned separately or that complex controllers be tested for each direction. In addition, our experiments only rely on internal, embedded measurements, which is critical for autonomy but not considered in most previous studies (e.g., Kohl and Stone, 2004; Zykov et al., 2004; Chernova and Veloso, 2004; Yosinski et al., 2011; Mahdavi and Bentley, 2006).
We evaluated our method on two experiments, one in simulation and one with a physical hexapod robot. With these experiments, we showed that thanks to its ability to recycle solutions usually wasted by classic evolutionary algorithm, TBR-Evolution generates behavioral repertoires faster than by evolving each solution separately. We also showed that archive management allows it to generate behavioral repertoires with a significantly higher quality than the novelty search algorithm (Lehman and Stanley, 2011a).
With the TBR-Evolution algorithm, our physical hexapod robot was able to learn several hundreds of controllers with only 60 transfers of 3 seconds on the robot, which was achieved in hours (including computation time for evolution and the SLAM algorithm). The repartition of these controllers over all the reachable space was autonomously inferred by the algorithm according to the abilities of the robot. Our experiments also showed that our method is about times faster than learning each controller separately.
Overall, these experiments show that the TBR-Evolution algorithm is a powerful method for learning multiple tasks with only several dozens of tests on the physical robot. Fig. 13 and the supplementary videos illustrate the resulting ability of the robot to walk in every direction. In the footsteps of novelty search, this new algorithm thus highlights that evolutionary robotics can be more than black-box optimization (Doncieux and Mouret, 2014). Evolution can simultaneously optimize in many niches, each of them corresponding to a different but high-performing behavior.
In future work we plan to investigate the generation of behavioral repertoires in an environment with obstacles. Selecting the transfers according to the presence of obstacles might enable the robot to avoid them during the learning process. This ability is rarely considered in this kind of learning problem; most of the time, the robot is placed in an empty space or manually replaced in a starting point (e.g., Kohl and Stone, 2004; Zykov et al., 2004; Berenson et al., 2005; and the present work). This would be a step forward in autonomous learning in robotics.
Both the transferability approach and NS with local competition do not make assumptions about the type of controller or genotype employed. For example, the transferability approach has been used to evolve the parameters of central pattern generators (Oliveira et al., 2013) or those of an artificial neural network (Koos et al., 2013b), and the NS algorithm has been employed on plastic artificial neural encoded with NEAT (Risi et al., 2010; Lehman and Stanley, 2011a) and on graph-based virtual creatures (Lehman and Stanley, 2011b). Since TBR-Evolution is a combination of these two algorithms, it can also be used with any type of genotype or controller. In future work we will therefore investigate more sophisticated genotypes and phenotypes like, for instance, neural networks encoded with HyperNEAT (Stanley et al., 2009; Clune et al., 2011; Tarapore and Mouret, 2014a, 2014b) or oscillators encoded by compositional pattern-producing networks (CPPN) like SUPGs (Morse et al., 2013; Tarapore and Mouret, 2014a, 2014b). Nevertheless, such advanced controllers can use feedback to change their behavior according to their sensors. Understanding how feedback-driven controllers and a repertoire-based approach can be elegantly combined is an open question.
TBR-Evolution also makes no assumption on the type of robot, and it would be interesting to see the abilities of the algorithm on more challenging robots like quadrupedal or bipedal robots, where stability is more critical than with the hexapod robot.
The ability of TBR-Evolution to autonomously infer the possible actions of the robot makes this algorithm a relevant tool for developmental robotics (Lungarella et al., 2003). With our methods, the robot progressively discovers its abilities and then perfects them. This process is similar to artificial curiosity algorithms in developmental robotics (Barto et al., 2004; Oudeyer et al., 2007), which make robots autonomously discover their abilities by exploring their behavioral space. It will be relevant to study the links between these approaches and our algorithm, which come from different branches of artificial intelligence. For example, can we build a behavioral repertoire thanks to artificial curiosity? Or, can we see the novelty search aspects of TBR-Evolution as being like a curiosity process? Which of these two approaches is less affected by the curse of dimensionality?
This work was funded by the ANR Creadapt project (ANR-12-JS03-0009) and a DGA/UPMC scholarship for A. Cully.
This experiment is partly based on the preliminary results published in a conference paper (Cully and Mouret, 2013).
Performing statistical analysis with only five runs is difficult, but it still allows us to understand the main tendencies. The current set of experiments (five runs of TBR-Evolution and the control experiment) requires more than 30 hours with the robot, and it is materially challenging to use more replications.
The source code for our experiments and two videos are available at http://pages.isir.upmc.fr/evorob_db
A.1 Parameters Used for Experiments on Virtual Robot
|No. of generations||10,000|
|Mutation rate||10% on each parameter|
|No. of generations||10,000|
|Mutation rate||10% on each parameter|
|Population size||100 individuals|
|No. of generations||50,000 (100 × 500)|
|Mutation rate||10% on each parameter|
|Population size||100 individuals|
|No. of generations||50,000 (100 × 500)|
|Mutation rate||10% on each parameter|
A.2. Parameters Used for Experiments on Physical Robot
|No. of generations||3,000|
|Mutation rate||10% on each parameter|
|Transfer period||50 iterations|
|No. of generations||3,000|
|Mutation rate||10% on each parameter|
|Transfer period||50 iterations|