Abstract
In many reinforcement learning tasks, the goal is to learn a policy to manipulate an agent, whose design is fixed, to maximize some notion of cumulative reward. The design of the agent's physical structure is rarely optimized for the task at hand. In this work, we explore the possibility of learning a version of the agent's design that is better suited for its task, jointly with the policy. We propose an alteration to the popular OpenAI Gym framework, where we parameterize parts of an environment, and allow an agent to jointly learn to modify these environment parameters along with its policy. We demonstrate that an agent can learn a better structure of its body that is not only better suited for the task, but also facilitates policy learning. Joint learning of policy and structure may even uncover design principles that are useful for assisted-design applications.
1 Introduction
Embodied cognition [3, 40, 58] is the theory that an organism's cognitive abilities are shaped by its body. It is even argued that an agent's cognition extends beyond its brain, and is strongly influenced by aspects of its body and also the experiences from its various sensorimotor functions [25, 73]. Evolution plays a vital role in shaping an organism's body to adapt to its environment; the brain with its ability to learn is only one of many body components that are coevolved [11, 47]. We can observe embodiment in nature by observing that many organisms exhibit complex motor skills, such as the ability to jump [12] or swim [10], even after brain death.
While evolution shapes the overall structure of the body of a particular species, an organism can also change and adapt its body to its environment during its life (see Figure 1). For instance, professional athletes spend their lives body training while also improving specific mental skills required to master a particular sport [68]. In everyday life, regular exercise not only strengthens the body but also improves mental conditions [22, 49]. We not only learn and improve our skills and abilities during our lives, but also learn to shape our bodies for the lives we want to live.
We are interested in investigating embodied cognition within the reinforcement learning (RL) framework. Most baseline tasks [36, 66] in the RL literature test an algorithm's ability to learn a policy to control the actions of an agent, with a predetermined body design, to accomplish a given task inside an environment. The design of the agent's body is rarely optimal for the task, and sometimes even intentionally designed to make policy search challenging. In this work, we explore enabling learning versions of an agent's body that are better suited for its task, jointly with its policy. We demonstrate that an agent can learn a better structure of its body that not only is better for its task, but also facilitates policy learning. We can even optimize our agent's body for certain desired characteristics, such as material usage.1 Our approach may help uncover design principles useful for assisted design.
Furthermore, we believe the ability to learn useful morphology is an important area for the advancement of AI. Although morphology learning originated in the field of evolutionary computation, there has also been great advances in RL in recent years, and we believe much of what happens in ALife should in principle be of interest to the RL community and vice versa, since learning and evolution are just two sides of the same coin.
We believe that conducting experiments using standardized simulation environments facilitates the communication of ideas across disciplines, and for this reason we design our experiments based on applying ideas from ALife, namely morphology learning, to standardized tasks in the OpenAI Gym environment, a popular testbed for conducting experiments in the RL community. We decide to use standardized Gym environments such as Ant (based on the Bullet physics engine) and Bipedal Walker (based on Box2D), not only for their simplicity, but also because their difficulty is well understood due to the large number of RL publications that use them as benchmarks. As we shall see later, the BipedalWalkerHardcore-v2 task, while simple looking, is especially difficult to solve with modern deep RL methods. By applying simple morphology learning concepts from ALife, we are able to make a difficult task solvable with much fewer computation resources. We also made the code for augmenting OpenAI Gym for morphology learning, along with all pretrained models for reproducing results in this article, available at https://github.com/hardmaru/astool.
We hope this article can serve as a catalyst to precipitate a cultural shift in both fields and encourage researchers to open up their minds to each other. By drawing ideas from ALife and demonstrating them in the OpenAI Gym platform used by RL, we hope this work can set an example to bring both the RL and ALife communities closer together to find synergies and push the AI field forward.
2 Related Work
There is a broad literature in evolutionary computation, artificial life, and robotics devoted to studying and modeling embodied cognition [47]. In 1994, Karl Sims demonstrated that artificial evolution can produce novel morphology that resembles organisms observed in nature [59, 60]. Subsequent works further investigated morphology evolution [4, 8, 9, 11, 37, 44, 64, 65, 70], modular robotics [39, 45, 48, 75], and evolving soft robots [17, 20], using indirect encoding [5, 6, 7, 23, 61].
In passive dynamics studies robot designs that rely on natural swings of motion of body components instead of deploying and controlling motors at each joint [18, 19, 41, 46]. Notably, the artist Theo Jansen [33] also employed evolutionary computation to design physical strandbeests that can walk on their own, consuming only wind energy, to raise environmental awareness.
Recent works in robotics investigate simultaneously optimizing body design and control of a legged robot [29, 30] using constraint-based modeling, which is related to our RL-based approach. Related to our work, [1, 24] employ CMA-ES [31] to optimize over both the motion control and physical configuration of agents. A related recent work [52, 53] employs RL to learn both the policy and design parameters in an alternating fashion, where a single shared policy controls a distribution of different designs; in this work we simply treat both policy and design parameters the same way.
3 Method
In this section, we describe the method used for learning a version of the agent's design better suited for its task, jointly with its policy. In addition to the weight parameters of our agent's policy network, we will also parameterize the agent's environment, which includes the specification of the agent's body structure. This extra parameter vector, which may govern the properties of items such as the width, length, radius, mass, and orientation of an agent's body parts and their joints, will also be treated as a learnable parameter. Hence the weights w we need to learn will be the parameters of the agent's policy network combined with the environment's parameterization vector. During a rollout, an agent initialized with w will be deployed in an environment that is also parameterized with the same parameter vector w.
The goal is to learn w to maximize the expected cumulative reward, E[R(w)], of an agent acting on a policy with parameters w in an environment governed by the same w. In our approach, we search for w using a population-based policy gradient method based on Section 6 of Williams' 1992 REINFORCE [72]. The next section provides an overview of this algorithm, which is shown in Figure 2.
Armed with the ability to change the design configuration of an agent's own body, we also wish to explore encouraging the agent to challenge itself by rewarding it for trying more difficult designs. For instance, carrying the same payload using smaller legs may result in a higher reward than using larger legs. Hence the reward given to the agent may also be augmented according to its parameterized environment vector. We will discuss reward augmentation to optimize for desirable design properties later on in more detail in Section 4.2.
3.1 Overview of Population-based Policy Gradient Method (REINFORCE)
We note that there is a connection between population-based REINFORCE, which is a population-based policy gradient method, and particular formulations of evolution strategies [50, 56], namely ones that are not elitist. For instance, natural evolution strategies (NESs) [57, 71] and OpenAI-ES [51] are closely based on Section 6 of REINFORCE. There is also a connection between natural gradients (computed using NESs) and CMA-ES [31]. We refer to Akimoto et al. [2] for a detailed theoretical treatment and discussion of the connection between CMA-ES and natural gradients.
4 Experiments
In this work, we experiment on the continuous control environment RoboschoolAnt-v1 [36], based on the open source Bullet [21] physics engine, and also BipedalWalker-v2 from the Box2D [16] section of the OpenAI Gym [13] set of environments. For simplicity, we first present results of anecdotal examples obtained over a single representative experimental run to convey qualitative results such as morphology and its relationship to performance. A more comprehensive quantitative study based on multiple runs using different random seeds will be presented in Section 4.3.
The RoboschoolAnt-v12 environment features a four-legged agent called the Ant. The body is supported by four legs, and each leg consists of three parts, which are controlled by two motor joints. The bottom left diagram of Figure 3 describes the initial orientation of the agent. The length of each part of a leg is controlled by the Δx and Δy distances from its joint connection. A size parameter also controls the radius of each leg part.
In our experiment, we keep the volumetric mass density of all materials, along with the parameters of the motor joints, identical to the original environment, and allow the 36 parameters (3 parameters per leg part, 3 leg parts per leg, 4 legs in total) to be learned. In particular, we allow each part to be scaled to a range of ±75% of its original value. This allows us to keep the sign and direction for each part to preserve the original intended structure of the design.
Figure 3 illustrates the learned agent design compared with the original design. With the exception of one leg part, it learns to develop longer, thinner legs while jointly learning to carry the body across the environment. While the original design is symmetric, the learned design (Table 1) breaks symmetry and biases towards larger rear legs while jointly learning the navigation policy using an asymmetric body. The original agent achieved an average cumulative score of 3447 ± 251 over 100 trials, compared to 5789 ± 479 for an agent that learned a better body design.
. | Top Left . | Top Right . | Bottom Left . | Bottom Right . | ||||
---|---|---|---|---|---|---|---|---|
Length . | Radius . | Length . | Radius . | Length . | Radius . | Length . | Radius . | |
Top | 141% | 33% | 141% | 25% | 169% | 35% | 84% | 51% |
Middle | 169% | 26% | 164% | 26% | 171% | 31% | 140% | 29% |
Bottom | 174% | 26% | 168% | 50% | 173% | 29% | 133% | 38% |
. | Top Left . | Top Right . | Bottom Left . | Bottom Right . | ||||
---|---|---|---|---|---|---|---|---|
Length . | Radius . | Length . | Radius . | Length . | Radius . | Length . | Radius . | |
Top | 141% | 33% | 141% | 25% | 169% | 35% | 84% | 51% |
Middle | 169% | 26% | 164% | 26% | 171% | 31% | 140% | 29% |
Bottom | 174% | 26% | 168% | 50% | 173% | 29% | 133% | 38% |
The bipedal walker series of environments is based on the Box2D [16] physics engine. Guided by lidar sensors, the agent is required to navigate across an environment of randomly generated terrain within a time limit, without falling over. The agent's payload—its head—is supported by two legs. The top and bottom parts of each leg are controlled by two motor joints. In the easier BipedalWalker-v2 [34] environment, the agent needs to travel across small random variations of a flat terrain. The task is considered solved if an agent obtains an average score greater than 300 points over 100 rollouts.
Keeping the head payload constant, and also keeping the density of materials and the configuration of the motor joints the same as in the original environment, we only allow the lengths and widths for each of the four leg parts to be learnable, subject to the same range limit of ±75% of the original design. In the original environment in Figure 4 (top), the agent learns a policy that is reminiscent of a joyful skip across the terrain, achieving an average score of 347. In the learned version in Figure 4 (bottom), the agent's policy is to hop across the terrain using its legs as a pair of springs, achieving a score of 359.
In our experiments, all agents were implemented using three-layer fully connected networks with tanh activations. The agent in RoboschoolAnt-v1 has 28 inputs and 8 outputs, all bounded between −1 and +1, with hidden layers of 64 and 32 units. The agents in BipedalWalker-v2 and BipedalWalkerHardcore-v2 have 24 inputs and 4 outputs, all bounded between −1 and +1, with two hidden layers of 40 units each.
Our population-based training experiments were conducted on 96-core CPU machines. Following the approach described in [28], we used a population size of 192, and had each agent perform the task 16 times with different initial random seeds. The agent's reward signal used by the policy gradient method is the average reward of the 16 rollouts. The most challenging BipedalWalkerHardcore agents were trained for 10,000 generations, while the easier BipedalWalker and Ant agents were trained for 5000 and 3000 generations, respectively. As done in [28], we save the parameters of the agent that achieves the best average cumulative reward during its entire training history.
4.1 Joint Learning of Body Design Facilitates Policy Learning
Learning a better version of an agent's body not only helps achieve better performance, but also enables the agent to jointly learn policies more efficiently. We demonstrate this in the much more challenging BipedalWalkerHardcore-v2 [35] version of the task. Unlike the easier version, the agent must also learn to walk over obstacles, travel up and down hilly terrain, and even jump over pits. Figure 5 illustrates the original and learnable versions of the environment.3
In this environment, our agent generally learns to develop longer, thinner legs, with the exception of the rear leg, where it developed a thicker lower limb to serve as a useful stability function for navigation. Its front legs, which are smaller and more maneuverable, also act as sensors for dangerous obstacles ahead, which complement its lidar sensors. While learning to develop this newer structure, it jointly learns a policy to solve the task in 30% of the time it took the original, static version of the environment. The average scores over 100 rollouts for the learnable version is 335 ± 37, compared to the baseline score of 313 ± 53. The full results are summarized in Table 2.
BipedalWalker-v2 . | Avg. score . | leg area . | Top leg 1 . | Bottom leg 1 . | Top leg 2 . | Bottom leg 2 . | ||||
---|---|---|---|---|---|---|---|---|---|---|
w . | h . | w . | h . | w . | h . | w . | h . | |||
Original | 347 ± 0.9 | 100% | 8.0 | 34.0 | 6.4 | 34.0 | 8.0 | 34.0 | 6.4 | 34.0 |
Learnable | 359 ± 0.2 | 33% | 2.0 | 57.3 | 1.6 | 46.0 | 2.0 | 48.8 | 1.6 | 18.9 |
Reward smaller leg | 323 ± 68 | 8% | 2.0 | 11.5 | 1.6 | 10.6 | 2.0 | 11.4 | 1.6 | 10.2 |
BipedalWalkerHardcore-v2 . | Avg. score . | leg area . | Top leg 1 . | Bottom leg 1 . | Top leg 2 . | Bottom leg 2 . | ||||
w . | h . | w . | h . | w . | h . | w . | h . | |||
Original | 313 ± 53 | 100% | 8.0 | 34.0 | 6.4 | 34.0 | 8.0 | 34.0 | 6.4 | 34.0 |
Learnable | 335 ± 37 | 95% | 2.7 | 59.3 | 10.0 | 58.9 | 2.3 | 55.5 | 1.7 | 34.6 |
Reward smaller leg | 312 ± 69 | 27% | 2.0 | 35.3 | 1.6 | 47.1 | 2.0 | 36.2 | 1.6 | 26.7 |
BipedalWalker-v2 . | Avg. score . | leg area . | Top leg 1 . | Bottom leg 1 . | Top leg 2 . | Bottom leg 2 . | ||||
---|---|---|---|---|---|---|---|---|---|---|
w . | h . | w . | h . | w . | h . | w . | h . | |||
Original | 347 ± 0.9 | 100% | 8.0 | 34.0 | 6.4 | 34.0 | 8.0 | 34.0 | 6.4 | 34.0 |
Learnable | 359 ± 0.2 | 33% | 2.0 | 57.3 | 1.6 | 46.0 | 2.0 | 48.8 | 1.6 | 18.9 |
Reward smaller leg | 323 ± 68 | 8% | 2.0 | 11.5 | 1.6 | 10.6 | 2.0 | 11.4 | 1.6 | 10.2 |
BipedalWalkerHardcore-v2 . | Avg. score . | leg area . | Top leg 1 . | Bottom leg 1 . | Top leg 2 . | Bottom leg 2 . | ||||
w . | h . | w . | h . | w . | h . | w . | h . | |||
Original | 313 ± 53 | 100% | 8.0 | 34.0 | 6.4 | 34.0 | 8.0 | 34.0 | 6.4 | 34.0 |
Learnable | 335 ± 37 | 95% | 2.7 | 59.3 | 10.0 | 58.9 | 2.3 | 55.5 | 1.7 | 34.6 |
Reward smaller leg | 312 ± 69 | 27% | 2.0 | 35.3 | 1.6 | 47.1 | 2.0 | 36.2 | 1.6 | 26.7 |
4.2 Optimize for Both the Task and the Desired Design Properties
Allowing an agent to learn a better version of its body obviously enables it to achieve better performance. But what if we want to give back some of the additional performance gains, and optimize also for desirable design properties that might not generally be beneficial for performance? For instance, we may want our agent to learn a design that utilizes the least amount of materials while still achieving satisfactory performance on the task. Here, we reward an agent for developing legs that are smaller in area, and augment its reward signal during training by scaling the rewards by a utility factor of 1 + log(). Augmenting the reward encourages development of smaller legs. (See Figure 6.)
This reward augmentation resulted in a much smaller agent that is still able to support the same payload. In BipedalWalker, given the simplicity of the task, the agent's leg dimensions eventually shrink to near the lower bound of ∼25% of the original dimensions, with the exception of the heights of the top leg parts, which settled at ∼35% of the initial design, while still achieving an average (unaugmented) score of 323 ± 68. For this task, the leg area used is 8% of the original design.
However, the agent is unable to solve the more difficult BipedalWalkerHardcore task using a similar small body structure, due to the various obstacles presented. Instead, it learns to set the width of each leg part close to the lower bound, and instead to learn the shortest heights of each leg part required to navigate, achieving a score of 312 ± 69. Here, the leg area used is 27% of the original.
4.3 Results over Multiple Experimentals Runs
In the previous subsections, for simplicity, we have presented results over a single representative experimental run to convey qualitative results such as a morphology description corresponding to the average score achieved. Running the experiment from scratch with a different random seed may generate different morphology designs and different policies that lead to different performance scores. To demonstrate that morphology learning does indeed improve the performance of the agent over multiple experimental runs, we ran each experiment 10 times and report the full range of average scores obtained in Table 3 and Table 4. From multiple independent experimental runs, we see that morphology learning consistently produces higher scores over the normal task.
Experiment . | Statistics of average scores over 10 independent runs . |
---|---|
(a) Ant | 3139 ± 189.3 |
(b) Ant + morphology | 5267 ± 631.4 |
(c) Biped | 345 ± 1.3 |
(d) Biped + morphology | 354 ± 2.2 |
(e) Biped + morphology + smaller Leg | 330 ± 3.9 |
(f) Biped hardcore | 300 ± 11.9 |
(g) Biped hardcore + morphology | 326 ± 12.7 |
(h) Biped hardcore + morphology + smaller leg | 312 ± 11.9 |
Experiment . | Statistics of average scores over 10 independent runs . |
---|---|
(a) Ant | 3139 ± 189.3 |
(b) Ant + morphology | 5267 ± 631.4 |
(c) Biped | 345 ± 1.3 |
(d) Biped + morphology | 354 ± 2.2 |
(e) Biped + morphology + smaller Leg | 330 ± 3.9 |
(f) Biped hardcore | 300 ± 11.9 |
(g) Biped hardcore + morphology | 326 ± 12.7 |
(h) Biped hardcore + morphology + smaller leg | 312 ± 11.9 |
Experiment . | #1 . | #2 . | #3 . | #4 . | #5 . | #6 . | #7 . | #8 . | #9 . | #10 . |
---|---|---|---|---|---|---|---|---|---|---|
(a) Ant | 3447 | 3180 | 3076 | 3255 | 3121 | 3223 | 3130 | 3096 | 3167 | 2693 |
(b) + morphology | 5789 | 6035 | 5784 | 4457 | 5179 | 4788 | 4427 | 5253 | 6098 | 4858 |
(c) Biped | 347 | 343 | 347 | 346 | 345 | 345 | 345 | 346 | 346 | 344 |
(d) + morphology | 359 | 354 | 353 | 354 | 353 | 352 | 353 | 352 | 353 | 356 |
(e) + smaller leg | 323 | 327 | 327 | 331 | 330 | 331 | 333 | 329 | 337 | 333 |
(f) Biped hardcore | 313 | 306 | 300 | 283 | 311 | 295 | 307 | 309 | 292 | 279 |
(g) + morphology | 335 | 331 | 330 | 330 | 332 | 292 | 327 | 331 | 316 | 330 |
(h) + smaller leg | 312 | 320 | 314 | 318 | 307 | 314 | 316 | 281 | 319 | 324 |
Experiment . | #1 . | #2 . | #3 . | #4 . | #5 . | #6 . | #7 . | #8 . | #9 . | #10 . |
---|---|---|---|---|---|---|---|---|---|---|
(a) Ant | 3447 | 3180 | 3076 | 3255 | 3121 | 3223 | 3130 | 3096 | 3167 | 2693 |
(b) + morphology | 5789 | 6035 | 5784 | 4457 | 5179 | 4788 | 4427 | 5253 | 6098 | 4858 |
(c) Biped | 347 | 343 | 347 | 346 | 345 | 345 | 345 | 346 | 346 | 344 |
(d) + morphology | 359 | 354 | 353 | 354 | 353 | 352 | 353 | 352 | 353 | 356 |
(e) + smaller leg | 323 | 327 | 327 | 331 | 330 | 331 | 333 | 329 | 337 | 333 |
(f) Biped hardcore | 313 | 306 | 300 | 283 | 311 | 295 | 307 | 309 | 292 | 279 |
(g) + morphology | 335 | 331 | 330 | 330 | 332 | 292 | 327 | 331 | 316 | 330 |
(h) + smaller leg | 312 | 320 | 314 | 318 | 307 | 314 | 316 | 281 | 319 | 324 |
We also visualize the variations of morphology designs over different runs in Figure 7 to get a sense of the variations of morphology that can be discovered during training. As these models may take up to several days to train for a particular experiment on a powerful 96-core CPU machine, it may be costly for the reader to fully reproduce the variation of results here, especially when 10 machines running the same experiment with different random seeds are required. We also include all pretrained models from multiple independent runs in the GitHub repository containing the code to reproduce this article. The interested reader can examine the variations in more detail using the pretrained models.
5 Discussion and Future Work
We have shown that using a simple population-based policy gradient method for allowing an agent to learn not only the policy, but also a small set of parameters describing the environment, such as its body, offers many benefits. By allowing the agent's body to adapt to its task within some constraints, the agent can not only learn policies that are better for its task, but also learn them more quickly.
The agent may discover design principles during this joint process of body and policy learning. In both RoboschoolAnt and BipedalWalker experiments, the agent has learned to break symmetry and learn larger rear limbs to facilitate their navigation policies. While also optimizing for material usage for BipedalWalker's limbs, the agent learns that it can still achieve the desired task even on setting the size of its legs to the minimum allowable. Meanwhile, for the much more difficult BipedalWalkerHardcore-v2 task, the agent learns the appropriate length of its limbs required for the task while still minimizing the material usage.
This approach may lead to useful applications in machine-learning-assisted design, in the spirit of [14, 15]. While not directly related to agent design, machine-learning-assisted approaches have been used to procedurally generate game environments that can also facilitate policy learning of game-playing agents [27, 42, 63, 67, 69]. Game designers can optimize the designs of game character assets while at the same time being able to constrain the characters to keep the essence of their original forms. Optimizing character design may complement existing work on machine-learning-assisted procedural content generation for game design. By framing the approach within the popular OpenAI Gym framework, design firms can create more realistic environments—for instance, incorporate strength of materials, safety factors, and malfunctioning of components under stressed conditions—and plug existing algorithms into this framework to optimize also for design aspects such as energy usage, ease of manufacturing, or durability. The designer may even incorporate aesthetic constraints such as symmetry and aspect ratios that suit her design sense.
In this work we have only explored using a simple population-based policy gradient method [72] for learning. State-of-the-art model-free RL algorithms, such as TRPO [54] and PPO [55], work well when our agent is presented with a well-designed dense reward signal, while population-based methods offer computational advantages for sparse-reward problems [51, 62]. In our setting, as the body design is parameterized by a small set of learnable parameters and is only set once at the beginning of a rollout, the problem of learning the body along with the policy becomes more sparse. In principle, we could allow an agent to augment its body during a rollout to obtain a dense reward signal, but we find this impractical for realistic problems. Future work may look at separating the learning from dense rewards and sparse rewards into an inner loop and outer loop, and also examine differences in performance and behaviors in structures learned with various different RL algorithms.
Separation of policy learning and body design into inner loop and outer loop will also enable the incorporation of evolution-based approaches to tackle the vast search space of morphology design, while utilizing efficient RL-based methods for policy learning. The limitation of the current approach is that our RL algorithm can learn to optimize only existing design properties of an agent's body, rather than learn truly novel morphology in the spirit of Karl Sims' “Evolving virtual creatures” [60].
Nevertheless, our approach of optimizing the specifications of an existing design might be practical for many applications. While a powerful evolutionary algorithm that can also evolve novel morphology might come up with robot morphology that easily outperforms the best bipedal walkers in this work, the resulting designs might not be as useful to a game designer who is tasked to work explicitly with bipedal walkers that fit within the game's narrative (although it is debatable whether a game can be more entertaining and interesting if the designer is allowed to explore the space beyond given specifications). Due to the vast search space of all possible morphology, a search algorithm can easily come up with unrealistic or unusable designs that exploit its simulation environment, as discussed in detail in [38], which may be why subsequent morphology evolution approaches constrain the search space of the agent's morphology—for example, to the space of soft-body voxels [17] or to a set of possible pipe frame connection settings [33]. We note that unrealistic designs may also result in our approach, if we do not constrain the learned dimensions to be within ±75% of their original values. For some interesting examples of what REINFORCE discovers without any constraints, we invite the reader to view the Bloopers section of https://designrl.github.io/.
Just as REINFORCE [72] can also be applied to the discrete search problem of neural network architecture designs [74], similar RL-based approaches could be used for novel morphology design—not simply for improving an existing design as in this work. We believe the ability to learn useful morphology is an important area for the advancement of AI. Although morphology learning originally initiated from the field of evolutionary computation, we hope this work will engage the RL community to investigate the concept further and encourage idea exchange across communities.
Acknowledgments
We would like to thank the three reviewers from Artificial Life journal, as well as Luke Metz, Douglas Eck, Janelle Shane, Julian Togelius, Jeff Clune, and Kenneth Stanley, for their thoughtful feedback and conversation. All experiments were performed on CPU machines provided by Google Cloud Platform.
Notes
Videos of results are at https://designrl.github.io/.
A compatible version of this environment is also available in PyBullet [21], which was used for visualization.
As of writing, two methods have been reported to solve this task. Population-based training [28] (our baseline) solves this task in 40 hours on a 96-CPU machine, using a small feedforward policy network. A3C [43], adapted for continuous control [26], solves the task in 48 hours on a 72-CPU machine, but requires an LSTM [32] policy.