Evolutionary robotics using real hardware is currently restricted to evolving robot controllers, but the technology for evolvable morphologies is advancing quickly. Rapid prototyping (3D printing) and automated assembly are the main enablers of robotic systems where robot offspring can be produced based on a blueprint that specifies the morphologies and the controllers of the parents. This article addresses the problem of gait learning in newborn robots whose morphology is unknown in advance. We investigate a reinforcement learning method and conduct simulation experiments using robot morphologies with different size and complexity. We establish that reinforcement learning does the job well and that it outperforms two alternative algorithms. The experiments also give insights into the online dynamics of gait learning and into the influence of the size, shape, and morphological complexity of the modular robots. These insights can potentially be used to predict the viability of modular robotic organisms before they are constructed.
This work forms a stepping stone towards the grand vision of the evolution of things as outlined in [11,,–14]. The essence of this vision is to construct physical systems that undergo evolution “in the wild,” that is, not in a virtual world inside a computer. There are various possible avenues towards this goal, including chemical, biological, and robotic approaches. This study falls in the last category, based on modular robots composed of elementary modules or building blocks. Here one can choose a homogeneous system with identical building blocks or a heterogeneous one with several kinds of modules. In both cases a robotic organism (or robot, for short) is an aggregated three-dimensional structure. It is these structures that form the possible robot morphologies, and morphological evolution will take place in the space of all possible robotic organisms.
The algorithmic aspects of systems where robot morphologies and controllers can evolve in real time and real space have been put on a solid footing by a conceptual framework dubbed the triangle of life . This framework describes the pivotal life cycle of an ecosystem of self-reproducing robots. This life cycle does not run from birth to death, but from conception (being conceived) to conception (conceiving one or more children), and it is repeated over and over again, thus creating consecutive generations of robot children. The result is a population of robotic organisms that evolves and thus adapts to the given environment. The triangle of life consists of three stages: morphogenesis, infancy, and mature life, as illustrated in Figure 1.
This article focuses on learning in the infancy stage. We assume that there is a procedure for morphogenesis that can produce new robotic organisms and deliver these in the habitat. As explained in , the body (morphological structure) and the mind (controller) of a new robotic organism are unlikely to fit each other well. Even if the parents had well-matching bodies and minds, recombination and mutation can easily result in a child where this is not the case. Hence, the new robot needs to do some fine tuning; not unlike a newborn calf, the newborn robot needs to learn how to control its own body. This problem—the control your own body (CYOB) problem—is inherent in artificial life systems where newborn organisms are random combinations of the bodies and minds of their parents.
This article considers a particular, limited instance of the general CYOB problem: gait learning. The overall goal of this article is to find an appropriate method for gait learning across a range of different morphologies that can be created with the given modules and that can do this quickly. The problem is highly nontrivial, since a modular robotic organism has many degrees of freedom, which leads to a very large search space of possible gaits. Furthermore, this learning process must take place in an online fashion, during the real operational period of the robotic organisms. The offline approach, where a good controller is developed before the robot is deployed, is not applicable here, because the life cycle of the triangle is running without being paused for intervention by the experimenter.
The mechanism we investigate here to solve the CYOB problem is reinforcement learning, in particular, the RL PoWER algorithm described by Kober and Peters . Note that this learning mechanism is not evolutionary itself. Evolution takes place on the level of multicellular robotic organisms: These organisms are the units of reproduction, whereas the RL PoWER algorithm is applied inside one organism as an individual learning mechanism to discover a good controller that induces a good gait. The specific questions that will be addressed in this article are the following:
Is RL PoWER suitable to learn a good gait on the fly in robot offspring with arbitrary shapes and sizes? How does RL PoWER compare to other options, such as simulated annealing and HyperNEAT?
How is the learning process affected by the shapes and sizes of the robots? Can we identify morphological features that are good predictors of the quality of the learned gait?
The second question is particularly relevant for future implementations in hardware, because the creation of a physical organism is expensive. Predicting the walking abilities of robotic organisms before they are constructed can help filter out hopeless shapes, thus making the use of system resources more efficient. This article extends more limited experiments published in .
2 Related Work
Evolutionary robotics is the combination of evolutionary computing and robotics [2, 9, 16, 29, 38,–40]. The field “aims to apply evolutionary computation techniques to evolve the overall design, or controllers, or both, for real and simulated autonomous robots” [39, p. ix]. This approach is “useful both for investigating the design space of robotic applications and for testing scientific hypotheses of biological mechanisms and processes” [16, p. 1423]. However, as noted in [2, p. 74] “the use of metaheuristics [i.e., evolution] sets this subfield of robotics apart from the mainstream of robotics research,” which “aims to continuously generate better behavior for a given robot, while the long-term goal of Evolutionary Robotics is to create general, robot-generating algorithms.” This provides the context for our work that aims to employ online evolution in real time and real space to deliver robot morphologies and controllers suited for a given environment. As outlined in , such a system can be used for engineering purposes as well as for conducting fundamental studies of evolution in a novel substrate.
Currently we are not aware of any evolutionary robotic systems that implement the principles of the triangle of life. However, there are two studies that come close in some important aspects. Weel et al. proposed a “Robotic Ecosystem with Evolvable Minds and Bodies” that works with online evolution without a centralized evolution manager “above” the robot population . The system is a genuine implementation of the birth, infancy, and maturity life stages in a simulated circular habitat. Robots are not required to perform any specific task and are free to move in any direction, limited only by the border of the habitat. The robots have a modular morphology constructed from Roombots and a controller that is a set of parametrized cyclic splines describing the servo motor angles in the Roombot joints as a function of time. Body and mind are encoded by a genome, and there are appropriate crossover and mutation operators to create new genomes from given parents. Learning, in particular of gaits, in the infancy stage is implemented through reinforcement learning.
Recently, Brodbeck et al. published an experimental study “Morphological Evolution of Physical Robots through Model-Free Phenotype Development” [4, p. 2]. The overall objective is to demonstrate a “model-free implementation for the artificial evolution of physical systems, to stochastically optimize the design of real-world machines.” Being model-free means that the system does not employ simulations; all robots are physically constructed. Again, the system is based on modular robot morphologies. Two types of cubic modules (active and passive) form the raw material, and robot bodies are constructed from a handful of such modules. The robots have onboard Arduino microcontrollers, which operate the servos. The controllers are not fully autonomous, as they interpret high-level commands from a centralized PC. The task is locomotion, which is achieved by oscillating the servos at a certain frequency and amplitude determined by the genome. The evolutionary process is conducted by a centralized evolutionary algorithm that runs on the external PC. The fitness is the distance traveled in a given time interval. The robot phenotypes are constructed in real hardware for fitness evaluation, and the system was designed to construct new robots autonomously.
These articles represent important milestones towards the evolution of things. They demonstrate the feasibility of such systems in a complementary manner. The first article  demonstrates a full ecosystem: a robot population with evolvable morphologies and controllers, where parents select each other autonomously and children undergo learning in real time—but only in simulation. Even though the use of an existing hardware platform (Roombots) makes the system constructible, it has not been constructed in the real world. The second article  describes a genuine hardware implementation, where the robotic manipulator (“mother robot”) and the given supply of modules form a Birth Clinic. However, evolution is an offline, centrally managed process, where robots are built and tested in isolation (not as part of a living population) and are driven by commands from an external PC.
Simulated and real-world experiments have been contrasted before in (evolutionary) robotics. This led to the notion of the reality gap, the fact that the working of an evolved controller or morphological feature obtained in simulations will not be the same once transferred to real hardware . The rationale for using simulations for this article is twofold. First, while our longer-term goal is a learning mechanism for real robots, this is an exploratory stage of development where simulations offer a relatively quick1 method to assess and compare different options. Second, the reality gap concerns the difference between simulated and real-world behavior of robot controllers, whereas here we are comparing learning mechanisms. We believe that our analyses of performance patterns and the conclusions about ranking the three learning methods (RL Power, simulated annealing, and HyperNEAT) will also stand in real hardware.
The design of locomotion for modular robots is a particular problem. It amounts to the creation of rhythmic patterns that satisfy multiple requirements: generating forward motion without falling over, with low energy usage, and possibly coping with different environments, hardware failures, and changes in the environment and/or the organism . In the literature there are several approaches, based on various types of controllers for creating these rhythmic patterns as well as on various algorithms to optimize their parameters.
One of the earliest approaches recognizes the periodicity of these patterns and successfully exploits this by utilizing cyclic genetic algorithms to evolve gaits [30, 31]. Another classic approach is based on gait control tables, as in, for instance,  and . A gait control table consists of rows of actuator commands with one column for each actuator. Each row also has a condition for the transition to the next row, essentially creating a very simple cyclic finite state machine.
A second major avenue of research considers evolving controllers based on neural networks. HyperNEAT  provides a particularly popular approach to evolving robot controllers, in particular for locomotion tasks. HyperNEAT's indirect encoding for neural networks separates the genotype, a compositional pattern-producing network (CPPN) from the phenotype, the substrate. CPPNs are directed graphs with weighted edges in which each node is a mathematical function like the sine, cosine, or Gaussian (similar to neural networks but with a variety of activation functions). The evolved CPPNs specify the weights of the substrate, the neural network that actually controls the robot. Several studies have shown that HyperNEAT is capable of creating efficient gaits for modular and for legged robots [7, 43, 20].
Another successful approach that has received much attention is based on central pattern generators (CPGs). CPGs model neural circuitry found in vertebrates that outputs cyclic patterns without requiring a cyclic input . Each actuator in a robotic organism is controlled by the output of a CPG, and the CPGs are connected to allow them to synchronize and maintain a certain phase difference pattern. Although sensory input is not strictly required for CPGs, it can be incorporated to shape the locomotion pattern to allow for turning and modulating the speed. This technique has been shown to produce well-performing and stable gaits on both non-modular robots  and modular multi-robotic organisms [36, 26, 25].
Recently, a technique based on artificial hormones has been investigated for the locomotion of modular robotic organisms. In this technique artificial hormones are created within robot modules as a response to sensory inputs. These hormones can interact with each other, diffuse to neighboring modules, and act upon output hormones. These output hormones are then used to drive the actuators [21, 34].
Furthermore, there are gait learning techniques that employ reinforcement learning algorithms; the specific approaches used can range from temporal difference learning to expectation maximization. Temporal difference learning seeks to minimize an error function between estimated and empirical results of a controller. Expectation maximization estimates controller parameters so as to maximize the reward gained. These algorithms have been used on modular (e.g., ) as well as non-modular robots (e.g., ).
Although there is extensive previous work on this issue, we must stress that, of the techniques described above, only the ones in , , and  were actually tested on multiple shapes. In this category, the most closely related work is the recent article by Rossi and Eiben, who use a similar test suite of twelve modular robots and online evolution running on every robot .
3 Experimental Setup
As mentioned in Section 1, our primary goal is to assess the suitability of a particular reinforcement learning approach called RL PoWER for online gait learning. To this end, we test the RL PoWER algorithm on various hand-designed organism morphologies with different sizes and complexity. To obtain a relative assessment, we compare the performance of RL PoWER with online implementations of simulated annealing and HyperNEAT. Furthermore, we apply a powerful offline optimization method, CMA-ES, with a high computational budget to obtain a good indication of the practical maximum of robot speeds. This provides benchmark results to position the (online) learning performance of RL PoWER more precisely.
Our second goal is to investigate in which way the morphology of an organism influences the efficacy of the learning process. We apply the RL PoWER method on a large number of randomly generated morphologies, and characterize their shapes in terms of traits such as size, number of extremities, number of effective degrees of freedom (DoF), and effective DoF per extremity. An experimental analysis shows which of these traits provide good predictors for the learned gait.
3.1 Robot Morphologies
All experiments were carried out in simulation with the Webots simulator (Cyberbotics), using the YaMoR module as the building block for the organisms . A YaMoR module (see Figure 2) is made of a static body and a joint on its front that has a single degree of freedom and an operating range of . It also has two connectors, one on the joint and one in the back of the body, which allow one to connect modules at arbitrary angles. The original YaMoR model was modified in three ways to accommodate the current experiments. We added two extra connectors on the left and right sides of the body in a central position, allowing the construction of more complex morphologies. We reduced the width of the joint by removing its lateral protrusions, thus eliminating the possibility of collisions with modules connected to the sides. Lastly, we added a global positioning sensor to a centrally located module to enable exact measurements of the organism's locomotive success.
The environment chosen for the experiments is an infinite plane free of obstacles so as to avoid any extra complexity and the need of supervision. Each experiment starts with the organism lying completely flat at the plane origin. The experiments for RL PoWER, simulated annealing, and HyperNEAT were performed 30 times for each organism, with different random seeds.
When comparing the performance levels achieved with learners, this article compares the speed achieved in the final evaluation of each run. Some runs may be unlucky, then, because that last evaluation just happens to consider a poor candidate controller. Thus, the measure provides an accurate reflection of the actual performance of the robots during online learning, putting learners with erratic performance across evaluations (e.g., because they consider many poor controllers as well as good ones) at a disadvantage.
3.2 Robot Controllers
The HyperNEAT experiments use neural-net-based controllers that are described in Section 3.5.
3.3 RL PoWER
The RL PoWER algorithm was proposed by Kober and Peters . This reinforcement learning algorithm is based on an expectation-maximization approach to estimate the parameters of a policy to maximize the reward gained.
The algorithm creates the initial policy with as many splines as there are modules, each having two control points. These control points are initialized at 0.5 and then perturbed using Gaussian noise. The algorithm then enters an evaluation-adaptation loop to refine the policy, until the stopping condition is reached. A ranking of the k best policies encountered so far is kept to inform the adaptation of the current policy.
Adaptation consists of three components: spline size increase, exploitation, and exploration. The spline is gradually refined by incrementing the number of control points periodically as proposed in . In the exploitation step, the current parameters are adapted based on the values of the k best policies. In the exploration phase policies are adapted by applying Gaussian perturbation to the policy resulting from exploitation. Over the course of the run the variance σ2 is diminished, which decreases exploration and increases exploitation. The pseudocode for the algorithm, as defined in , is displayed in 1.
The operating parameters for RL PoWER, such as the variance and its decay factor, as well as the reward function, were taken from . The values were 0.008 for the variance and 0.98 for the variance decay; k is set to 10. The splines are initialized with two control points and are allowed to grow to a maximum of 100 control points over the course of a run; in this case the spline is grown every 10 evaluations . ϵ is a parameter to avoid division by 0 and is set to 10−10.
3.4 Simulated Annealing
HyperNEAT offers a popular and competitive neuroevolutionary approach to robot locomotion, which is typically used in offline applications . HyperNEAT evolves a neural network's connectivity pattern indirectly, using a generative encoding called a compositional pattern-producing network (CPPN). Like a neural network, a CPPN is a network of mathematical functions with weighted connections. Unlike neural networks, it can contain a variety of activation functions, including the sine, cosine, Gaussian, and sigmoid. To determine the weight of a connection in the neural network that controls the robot (the substrate), the coordinates of the two substrate nodes are fed into the CPPN, which then returns the connection weight .
The HyperNEAT experiments do not use a spline-based controller, but a neural network substrate that controls a closed-loop gait and is composed of three layers: input, hidden, and output. Each layer is an m × n matrix of nodes where m = (OrganismSizex × 2) − 1 and n = OrganismSizey × 2, with OrganismSizex and OrganismSizey the sizes of the organism respectively on the x and y axes, measured by the number of modules. The inputs are the angular positions of each module's servo at the previous time step together with a sinusoidal signal s = sin (ωt), where ω represents the maximum angular velocity of the modules' servo and t the current time. The network outputs the angular position of each module servo for the current time step. The input and output signals are scaled to and from the interval [−1, +1]. The population size is 25; further HyperNEAT settings have the standard values as found in J. Gauci's implementation, on which we based our code .2
Just as for the RL PoWER and simulated annealing experiments, this article considers the results of running HyperNEAT in an online setting with the evaluations in series on the robot, without resetting or repositioning the robot between evaluations.
3.6 Practical Optimum
To have a good indication of the practical maximum of attainable robot speeds, we generate benchmark controllers with an established powerful solver for complex numerical optimization problems, CMA-ES . In contrast to the algorithms we investigate, CMA-ES runs offline. We use standard settings and a computational budget of 4,000 evaluations per run for each organism. (The online learners have a budget of 1,000 evaluations.) The listed practical optimum for each shape is the best controller obtained over five replicate CMA-ES runs for that shape.
3.7 Test Suite I: Designed Shapes
To assess the applicability of the RL PoWER algorithm for gait learning, we test it on nine designed robotic organisms with different sizes and shapes as shown in Figure 4. We compare the RL PoWER results with those of simulated annealing and HyperNEAT.
The size and complexity of the organisms are measured by the number of modules and by the number of extremities, respectively. The experiments were conducted with three complexity levels: organisms with two extremities (I-shape), three extremities (T-shape), and four extremities (H-shape). We created each shape in three sizes: 7, 11, and 15 modules. Figure 4 shows simulation screenshots of the shapes in their initial, prone state.
3.8 Test Suite II: Randomly Generated Shapes
The context of a system where the robots' shapes evolve requires a learning algorithm that performs well on arbitrary shapes. We recreate this context by creating a number of shapes randomly and performing 30 replicate runs of the RL PoWER algorithm on each shape. This allows us to gauge RL PoWER's efficacy also on shapes that may not be very practicable for locomotion. Secondly, it enables us to investigate the effects of shape and size of an organism further than we could with the limited set of designed shapes.
The second test suite was constructed by generating 270 random shapes as follows.3 The size of the organism is determined by drawing a random number between 3 and 15 (uniform distribution). Construction of the organism then starts on the basis of a single root module. The organism is extended to the target size by selecting a random open connector on the organism and attaching a module to that connector with orientation randomly selected as one of 0, 90, 180, or 270 deg. This process was repeated to generate 270 unique shapes (duplicate shapes were discarded and regenerated).
Because the YaMoR module is not symmetrical and the side connectors are not completely centered, the resulting shapes can have conflicts in the module placements. These conflicts have been resolved manually by rotating one of the conflicting modules. In most cases this repair did not result in a functionally different shape. In some cases the conflicts had to be resolved through multiple rotations or a rotation that did result in a functionally different shape. For the purpose of these investigations such functionally different shapes can still be considered randomly generated.4
The designed shapes were characterized in terms of size and the number of extremities. For the generated shapes we consider two additional metrics: the number of effective joints, and the number of effective joints divided by the number of extremities. These metrics are explained in more detail below.
Size The size of a shape is defined by the number of modules. The distribution of sizes is visualized in Figure 5a.
Extremities We record the number of extremities, or feet, as a measure of the complexity of an organism. This metric counts the modules that are connected on only one side. Figure 5b shows the distribution of the number of shapes per extremity. As can be seen, four extremities is the most common, while shapes with more than six extremities are very rare.
Effective Joints The YaMoR modules can be attached to each other either at the moving arm (gray in the model) or on one of the three sides of the module body (green in the model). Attaching a module using its body will leave the arm free; however, this arm is not very long, which means it cannot be used to push the organism effectively. In more technical terms, leaving the arm free leads to a small moment of force. Attaching a module using the arm effectively lengthens the arm where the force is applied, which increases the moment of this force. Having multiple modules in the organism connected with their arms allows a further increase of the moment of force.
We quantify this by counting the modules that are connected at their arms. These are the effective joints. Figure 6a shows the distribution of shapes by number of effective joints. We can see most shapes have between 2 and 7 effective joints, with a few having none at all and very few having more than 8.
Effective Joints per Extremity Finally, we hypothesize that it is not just the number of extremities or the number of effective joints that predicts the performance of a shape well, but the number of effective joints per extremity. Figure 6b shows the distribution of that ratio. We have binned it into intervals of size 0.2, with values less than 0.3 and greater than 2.1 binned in the lowest and the highest interval, respectively. An index of 1 effective joint per extremity is by far the most common, but values from 0.5 to 1.7 are also prevalent. A high value indicates a shape that is able to move its body in many ways. With the YaMoR modules, such shapes are usually oblong, or snakelike.
4 Results and Analysis
4.1 Comparing Online Learners
This section compares the performance of RL PoWER, simulated annealing, and HyperNEAT as online learners of locomotion. These experiments consider only test suite I—the designed shapes described in Section 3.7. The results for RL PoWER and simulated annealing are presented in three groups of plots, one for each morphology (I, T, and H shaped) with plots for three sizes (7, 11, and 15 modules). Each plot shows the median performance as well as the interquartile range over 30 replicate runs against time. The performance of the practical optimum (achieved with CMA-ES running offline with a higher evaluation budget) is noted in the relevant tables, but not shown in the plots, to maintain detail for the comparison of RL PoWER and simulated annealing. The HyperNEAT results are discussed separately in Section 4.1.2 to highlight the very different online dynamics of this algorithm.
The graphs in Figures 7, 8, and 9 compare the median speed achieved by RL PoWER and simulated annealing in 30 replicate runs. Consider the convergence time of RL PoWER and simulated annealing. RL PoWER converges after roughly 400 evaluations (≈2.6 h of simulated time), regardless of the organism's shape or size. This is a direct consequence of the setting for RL PoWER's variance decay parameter. The convergence time for simulated annealing is not as constant and seems to depend on the shape rather than the size. In many cases, particularly for I-shaped organisms, the algorithm is still improving its behavior when the experiments finish.
The median speed achieved by the controllers of RL PoWER reaches roughly one-third of the practical optimum (the performance reached by CMA-ES running offline). In most cases simulated annealing fares worse than RL PoWER, particularly on the T and H shapes. On the I shapes simulated annealing fares much better, even matching RL PoWER's performance for the I-7 shape. Once converged, RL PoWER provides consistent performance, while simulated annealing's performance of consecutive policies is more erratic. To quantify the difference in performance between RL PoWER and simulated annealing, we performed Wilcoxon rank-sum tests to see if their medians differed significantly. The results for these tests can be seen in Table 1. RL PoWER performs significantly better than simulated annealing in six out of the nine cases.
|.||I .||T .||H .|
|.||I .||T .||H .|
Notes. Significant results (p < 0.05) are displayed in bold.
A complete run of 1,000 evaluations equates to roughly 6.5 h of simulated time. In light of current hardware limitations, where reliable operating times of over 4 consecutive hours are rare, this is too optimistic a scenario to be feasible in real hardware. Fortunately, RL PoWER converges quite fast, and exploratory experiments indicate that we may be able to reduce this problem without sacrificing performance by using shorter evaluation periods.
Organisms with the same shape but with different size are very similar in performance. Organisms with the same size but different shape, however, show very large differences in performance. A Kruskal-Wallis test corroborates these results statistically, as shown in Table 2. Consider, for instance, the cell for simulated annealing and I in Table 2a (lower left). The p-value there (10−4) indicates strong evidence that there is a relation between the size of an I-shaped robot and the speed achieved at the end of the 30 replicate runs with simulated annealing. The p-value for RL PoWER and 15 in Table 2b (upper right) indicates strong evidence of a relation between speed and the kind of shape (I, T, or H) for the organisms of size 15.
|(a) .||(b) .|
|.||I .||T .||H .||.||7 .||11 .||15 .|
|RL PoWER||0.792||0.065||0.038||RL PoWER||e−05||e−04||e−06|
|Sim. anneal.||e−04||0.116||0.002||Sim. anneal.||e−10||0.0017||e−08|
|(a) .||(b) .|
|.||I .||T .||H .||.||7 .||11 .||15 .|
|RL PoWER||0.792||0.065||0.038||RL PoWER||e−05||e−04||e−06|
|Sim. anneal.||e−04||0.116||0.002||Sim. anneal.||e−10||0.0017||e−08|
Notes. The robots are grouped according to size (a) or complexity (b). A low p-value indicates evidence that the size (a) or the shape (b) correlates with organism speed. Significant results (p < 0.01) in bold.
The difference in performance between organisms of the same complexity, but with different sizes, is nowhere significant for RL PoWER. For simulated annealing different sizes lead to significantly different speeds for the I and the H shapes. With the I shape size 7 is faster than 11 or 15; with the H shape size 11 is fastest. The difference in performance between organisms of the same size but different shape is significant in all cases. With either learner the I shape always leads to significantly faster gaits than the T and H shapes, regardless of size. With simulated annealing the T-11 shape is also significantly faster than the H-11 shape.
This supports the conclusion that with either algorithm the complexity of the shape has a larger influence on the performance than the size of the shape. The experiments with test suite II investigate this further.
The interquartile ranges in Figures 7,8 to 9 show substantial variation in the quality achieved in individual runs. To assess the reliability of RL PoWER and simulated annealing, Figure 10 shows bihistograms with the distribution of runs based on the performance of the last evaluated controller. The blue histograms represent RL PoWER results; the red histograms (extending downwards) represent simulated annealing results. In each histogram, a vertical line indicates the corresponding median performance. We define a bad run as one that achieves a final speed lower than 0.025 m/s. The graphs show that across the experiments there is a minimum of 1 bad run and a maximum of 9 bad runs for RL PoWER, while there is a minimum of 1 bad run and a maximum of 15 bad runs for simulated annealing.
In such bad runs the speed of the organism typically increases until it suddenly drops. The reason for this sudden decrease in speed is that a particular gait during the learning process made the organism lose its balance and flip on its side (I-shaped), head (T-shaped), or back (H-shaped). This poses such a radical change in circumstances that neither learning process was able to find a good controller for the new situation. This falling behavior can have two causes. First, the online paradigm implies that the organism's stance or position is not reset (e.g., to a default stable stance) between evaluations. Consequently, each controller's performance is affected by the state the organism was left in by the previous one, meaning that a good move in one stance may lead to disastrous behavior in another. Second, some organism morphologies may be more prone than others to lose their balance due to detrimental gaits. In  such detrimental gaits are filtered out based on knowledge of the organism's shape and size; however, we consider a context of arbitrary shapes and sizes, and such filtering is not possible, as we do not have prior knowledge of the size or shape.
Although bad runs are undesirable, this need not be a problem in an evolutionary system such as our triangle-of-life scenario, because it calls for a substantial population of robotic organisms. In such a population, bad or unlucky shapes will not be able to reproduce and will disappear over time, but shapes that are well suited for balanced locomotion will be able to prevail.
For both RL PoWER and simulated annealing the number of bad runs seems to depend more on the shape than on the size of the organism. For instance, I-shaped runs have few bad runs, while the T and H shapes lead to substantial numbers of bad runs. Further experiments with randomly generated morphologies in Section 4.3 will allow us to assess if this relationship between the number of bad runs and organism complexity holds more generally.
Neural networks are commonly used in evolutionary robotics settings such as these, and as mentioned earlier, HyperNEAT has proved an effective method in this area. Figure 11 shows the results of experiments using HyperNEAT for online learning of gaits, similar to the experiments with RL PoWER and simulated annealing. The robot runs HyperNEAT locally, serially evaluating individuals in the time-sharing scheme that is typical for encapsulated evolution. The evaluation duration of individual controllers for these experiments is the same as for the RL PoWER and simulated annealing experiments; the population size is 25. For further details of the algorithm settings, we refer to the code at http://github.com/ci-group/tol-project.git.
Figure 11 shows two sets of results for each of the predefined shapes, accumulated over 30 replicate runs. The blue dots indicate the performance of the individuals of the best run (defined by the last evaluation), and the red dots indicate the median performance over the 30 runs.
The results for HyperNEAT show very competitive controllers, in many instances outperforming the best results of RL-Power and simulated annealing. The controllers from the best run (blue dots) are among the best found in our experiments. On the other hand, the median performances are much worse than those for RL-Power: HyperNEAT succeeds in finding very good controllers at the cost of evaluating many poorly performing controllers as well. Therefore, the HyperNEAT results are excluded from Figure 10.
This highlights an important characteristic of online evolution: Because the robot controllers evolve while the robots perform their tasks, the robots' actual performance is determined by the quality of all the controllers they evaluate, not only by the best controllers they consider. In reinforcement learning terms, one must consider the balance between exploration and exploitation when employing evolution in online scenarios. This is a radical departure from the optimization-centered mindset in most evolutionary robotics research, which implies offline development of controllers that do not evolve once deployed.
Thus, algorithms that yield very good results in many offline neuroevolution applications are not necessarily useful for online learning. In these cases, alternative implementations with a different exploration-exploitation balance must be considered. Silva et al., for instance, research an online variant of NEAT, odNEAT , that may provide a basis for a relevant HyperNEAT implementation. Developing such an implementation is beyond the scope of the current comparison, however.
4.2 RL PoWER on Random Shapes
The results on the designed shapes indicate that RL PoWER is a promising candidate learning algorithm in the triangle-of-life context, and the following experiments investigate how this algorithm holds up when applied to randomly generated morphologies. The results show that RL PoWER typically converges within 400 evaluations, so these experiments use 400 evaluations instead of 1,000. The settings for evaluation and recovery time remain the same, at 1,485 and 198 recovery steps (23.76 and 3.168 s), respectively. Thus a full run now represents roughly 2.6 h of simulated time (i.e., the robot “experiences” 2.6 h of learning within the simulation), which is within the maximum consecutive running time expected of hardware implementations.
A histogram of the median speed from 30 repetitions for all 270 shapes is displayed in Figure 12a. It shows that most shapes reach a speed between 0.03 and 0.08 m/s. This is somewhat lower than the speeds of the I shapes in the previous section, but is in line with the speeds achieved with the T and H shapes.
A few shapes perform very well with speeds up to almost 0.1 m/s, while others lead to a very low median speed. This leads to the conclusion that the generated shapes span a large part of the possible performances.
For a closer look at the performance for particular shapes, 15 shapes are analyzed in detail in Figure 12b. These are the five shapes with the worst final controllers (shown in blue), the five shapes with around median performance (in red), and the five shapes with the best final controllers (shown in green). The plot shows whisker plots with median, quartile, minimum, and maximum distance covered with the final controller in the 30 replicate runs. Figures 13 and 14 show simulation screenshots of the worst and best shapes, respectively.
While the five worst shapes perform uniformly poorly, the performance spread for faster shapes is comparatively large. For instance, even though the median for shape 74 is among the best five, its worst-performing run is also one of the worst (bottom 10%). The high spread with individual shapes implies that RL PoWER does sometimes fail to learn a good gait, something that needs to be addressed for a real-life implementation of the triangle of life. A possible solution could be to introduce a restart mechanism; another would be to tune the evolutionary process in such a way that shapes with low speed still have a reasonable chance to reproduce.
The robots attain speeds ranging from 0 to 0.187 m/s, and the median speed across all runs is 0.0562 m/s, a performance that is overall somewhat lower than that with the designed shapes. To put this number in perspective: If a similar speed were to be achieved on real hardware (disregarding the reality gap for the purposes of this illustration), a speed of 0.0562 m/s would allow an organism to move roughly 600 m in 3 h. A real robot arena will likely be a lot smaller, so these kinds of speeds are more than adequate to travel across the arena and spread an organism's genome to other organisms. This warrants the conclusion that the RL PoWER algorithm works well enough on random shapes to merit further investigation, including trials with hardware implementations.
4.3 Influence of Morphology
The experiments with random shapes can also help initial modeling of the relation between morphology and locomotive success. Figures 13 and 14 show that the worst-performing robots are all of size 3, while the best-performing shapes are of size 8. Furthermore, the worst shapes have 0 or 1 effective joint, while the best have many more. This raises the question of how morphological features influence the performance of an organism and which morphological metrics are good predictors for the quality of the learned gait. The following paragraphs analyze the metrics defined in Section 3.8 in this respect. For each metric, the speed achieved by the last controller for all repetitions with a particular shape are combined. For reference, the plots in Figures 15 and 16 also mark the median speeds achieved with the designed shapes as green triangles (I-shaped), yellow diamonds (T-shaped), and red circles (H-shaped).
Speed versus Size Figure 15a shows the speed achieved in the last evaluation grouped by size. There is a clear trend for higher speed with larger size, flattening out for shapes larger than 9. The maximum speed is achieved by a shape with 10 modules; the lowest, by a shape with 3 modules.
Speed versus Number of Extremities Figure 15b shows the speed achieved in the final evaluation grouped by the number of extremities. The highest speed is achieved by a shape with 4 extremities; the lowest, by a shape with 2 extremities. The trend in this graph is similar to that for size: There is a positive correlation between number of extremities and speed. Remember that the experiments with the designed shapes show that the I shape, with 2 extremities, was by far the fastest. The T shape, with 3 extremities, was somewhat slower, and the H shape, with 4 extremities, was the slowest. To answer the question posed at the end of Section 4.1.1 whether the positive correlation between number of bad runs and organism complexity holds more generally, we now see that the spread of the speeds in Figure 15 indicates that organisms with any number of extremities can have bad runs.
Speed versus Number of Effective Joints Figure 16a shows the speeds achieved in the final evaluation, grouped by the number of effective joints. The results for 10 and 11 effective joints have been grouped with those with 9 effective joints because there were too few shapes with that many effective joints. The best speed was achieved with a shape that had 5 effective joints; the worst, with a shape that had 0 effective joints. The plot shows a clear trend of more effective joints leading to higher speeds. Intuitively this makes sense, as having articulated limbs allows for complex movement. The definition of effective joint means that only connections that can use the body as a lever and thus create momentum are counted in this metric. The number of effective joints, or some other measure for the ability to apply and leverage force, seems to be a good indicator for the performance of an organism.
Speed versus Joints per Extremity Figure 16b shows the speeds achieved in the last evaluation grouped by number of joints per extremity. The highest speed is achieved with a shape with more than 2.1 effective joints per extremity; the lowest speed, with a shape with less than 0.3 effective joints per extremity.
Again, there is a very clear trend: A higher number of effective joints per extremity implies a morphology that achieves better speed.
4.3.1 Predicting Locomotion Speed
To quantify the relation between the proposed morphological metrics and speed, consider the linear models in Table 3. It shows the results of developing linear models for the morphological metrics separately and a model with all features combined (all models significant with a p-value of 0). Of the single features, the number of extremities results in the best fit (R2 = 0.246, i.e., the model explains 24.6% of the variance in speed); however, the number of effective joints per extremity offers more discriminatory power, as it has the highest coefficient (0.0235, i.e., increasing the number of effective joints per extremity by 1 results in an increase in speed of 0.0235 m/s).
|No. .||Independent variables .||R2 .||F-score .||Coefficient .||.|
|4||Effective joints per extremity||0.201||2040||Intercept||0.0306|
|Eff. joints per extr.||0.0235|
|5||Size, extremities, effective joints, and effective joints per extremity||0.258||705||Intercept||0.0178|
|Eff. joints per extr.||0.0170|
|No. .||Independent variables .||R2 .||F-score .||Coefficient .||.|
|4||Effective joints per extremity||0.201||2040||Intercept||0.0306|
|Eff. joints per extr.||0.0235|
|5||Size, extremities, effective joints, and effective joints per extremity||0.258||705||Intercept||0.0178|
|Eff. joints per extr.||0.0170|
In model 5 all the features are combined. Here the R2 value is 0.258, only a 1% improvement over model 2, but a substantial improvement over model 4. Models with two or three features were also created, but had R2 values lower than 0.258 and are omitted here. Model 5 combines a good R2 with the high coefficient of the number of effective joints per extremity and seems to provide a good starting point for a predictive model for the speed of an organism.
4.4 Comparing Test Suites
With regard to the relation between morphology and speed, the results with the designed shapes seem at odds with those for randomly generated shapes. The speeds achieved with the designed shapes showed little difference when the size increased, but substantial difference when the number of extremities increased. In particular, there was a negative trend with regard to the number of extremities: The H shape with 4 extremities was much slower than the T shape (3 extremities), which in turn was much slower than the I shape (2 extremities).
Clearly, the I shapes have very high median speeds compared to the randomly generated ones. With a median performance around 0.13, the I shapes are at the top end of the graph for every characteristic considered. Also, the T shapes perform comparatively well when considering speed versus size (Figure 15a) and speed versus extremities (Figure 15b). However, with the metrics effective joints and effective joints per extremity (Figure 16a and b), T-shaped robots perform more in line with random shapes. Finally, the H shape has average to low performance across all metrics. In other words, what seemed to be a clear trend in our earlier experiments is likely an anomaly arising from the chosen shapes, which disappears under scrutiny of more extensive testing.
Looking at the original shapes in light of the body metrics introduced in Section 3.8, we can see that although the size and the number of extremities for the I, T, and H shapes are in the same range as those for the generated shapes, the number of effective joints and the number of joints per extremity are very high—so high, in fact, that the numbers are well outside the ranges of the generated shapes.
We can conclude that the set of designed shapes in the first series of experiments is not representative of the set of randomly generated shapes. Which of these two sets is more representative for shapes found during evolutionary exploration of the design space will depend on many factors, for instance the representation and the genetic operators. However, in the beginning of the evolutionary process the shapes will be more random. Therefore we deem large test suites preferable over small test suites, even if the latter are systematically composed.
In this article we have addressed the control-your-own-body problem that arises in populations of modular robotic organisms that evolve in real time. The main problem we considered is that the bodies and controllers of newborn robotic organisms are randomized recombinations of their parents and are unlikely to fit well. Therefore, every newborn robot needs to learn to control its own body quickly after birth. In this study we reduced this challenge to an online gait learning problem and investigated a solution by applying reinforcement learning. We conducted simulation experiments using robot morphologies of different size and complexity.
Our results show that the RL PoWER algorithm can successfully perform this learning task. It outperforms online implementations of simulated annealing and HyperNEAT and can find controllers with a speed that is about one-third to one-half of the practical optimum. To put this in perspective, the practical optimum was established by an offline optimizer whose computational budget was 50 times higher, and RL PoWER was used with the default parameter values without tuning it to the problem at hand. RL PoWER converges to a controller after around 400 evaluations (≈2.6 simulated hours) for all the designed shapes and sizes, making it feasible to learn a gait within the operational time of currently possible hardware implementations.
Successfully learning a gait seems to depend more on the shape than on the size of the robots, while the learning speed of RL PoWER seems independent from either of these factors. The differences between morphologies can be quite substantial: The I shape had very few unlucky runs where the organism fails to move at all, whereas the failure rate for the H shape can be up to 20%. These failures are due to very bad gaits that cause the organism to lose its balance and flip on its side or back.
Analyzing the relation between morphological metrics and locomotive success of a proposed shape showed that the number of effective joints per extremity is a good predictor by itself. Combining all morphological metrics in a linear model provided a good starting point for a model that can give an a priori indication of an organism's performance. This could benefit the implementation of a robotic ecosystem to enable discarding particularly unpromising proposed morphologies. The morphological metrics used here were not systematically identified; further research into the relation between morphological traits and successful control (including for other tasks than locomotion) could take a more systematic approach that might yield better predictors for locomotive success.
These analyses also showed that the designed shapes in the first series of experiments are not representative of the shapes that may be found during evolutionary exploration of the design space and that small test suites, even if they are systematically composed, cannot be considered representative without experimental validation.
Further work towards the triangle of life is progressing along two lines. First, further experiments with RL PoWER are being conducted with the aim of tuning its parameters in order to reduce the time needed to achieve good gaits. Secondly, efforts are underway to implement and validate this approach on real hardware.
The authors would like to thank the reviewers for their insightful comments and suggestions that helped us immensely to improve the paper.
The quickness of simulations is indeed relative. For instance, the tests with the 270 random robot shapes took about 15 days of computing time for each learning mechanism on a 12-core i7 Mac Pro with the Webots simulator.
Pictures of all shapes and videos of the gaits they learned are available at http://www.few.vu.nl/∼bwl400/rlpower/.
In a real implementation of the triangle of life such conflicted shapes should of course be handled automatically.
University La Sapienza, Rome, Italy. E-mail: email@example.com