The Achilles Heel of stochastic optimization algorithms is getting trapped on local optima. Novelty Search mitigates this problem by encouraging exploration in all interesting directions by replacing the performance objective with a reward for novel behaviors. This reward for novel behaviors has traditionally required a human-crafted, behavioral distance function. While Novelty Search is a major conceptual breakthrough and outperforms traditional stochastic optimization on certain problems, it is not clear how to apply it to challenging, high-dimensional problems where specifying a useful behavioral distance function is difficult. For example, in the space of images, how do you encourage novelty to produce hawks and heroes instead of endless pixel static? Here we propose a new algorithm, the Innovation Engine, that builds on Novelty Search by replacing the human-crafted behavioral distance with a Deep Neural Network (DNN) that can recognize interesting differences between phenotypes. The key insight is that DNNs can recognize similarities and differences between phenotypes at an abstract level, wherein novelty means interesting novelty. For example, a DNN-based novelty search in the image space does not explore in the low-level pixel space, but instead creates a pressure to create new types of images (e.g., churches, mosques, obelisks, etc.). Here, we describe the long-term vision for the Innovation Engine algorithm, which involves many technical challenges that remain to be solved. We then implement a simplified version of the algorithm that enables us to explore some of the algorithm’s key motivations. Our initial results, in the domain of images, suggest that Innovation Engines could ultimately automate the production of endless streams of interesting solutions in any domain: for example, producing intelligent software, robot controllers, optimized physical components, and art.
Stochastic optimization and search algorithms, such as simulated annealing and evolutionary algorithms (EAs), often outperform human engineers in several domains (Koza et al., 2005). However, there are other domains in which these algorithms cannot produce effective solutions yet. Their Achilles Heel is the trap of local optima (Woolley and Stanley, 2011), where the objective given to an algorithm (e.g., a fitness function) prevents the search from leaving suboptimal solutions and reaching better ones. Novelty Search (Lehman and Stanley, 2008, 2011a) addresses this problem by collecting the stepping stones needed to ultimately lead to an objective instead of directly optimizing toward it. The algorithm encourages searching in all directions by replacing a performance objective with a reward for novel behaviors, the novelty of which is measured with a distance function in the behavior space (Li et al., 2014). This recent conceptual breakthrough has been shown to outperform traditional stochastic optimization on deceptive problems where specifying distances between desired behaviors is easy (Lehman and Stanley, 2008, 2011a). Reducing a high-dimensional search space to a low-dimensional one is essential to the success of Novelty Search because in high-dimensional search spaces there are too many ways to be novel without being interesting (Cuccu and Gomez, 2011). For example, if novelty is measured directly in the high-dimensional space of pixels in a 60,000-pixel image, being different can mean different static patterns, which are not interestingly different types of images.
Here we propose a novel algorithm called an Innovation Engine that enables searching in high-dimensional spaces for which it is difficult for humans to define what constitutes interestingly different behaviors. The key insight is to use a deep neural network (DNN) (Bengio, 2009) as the evaluation function to reduce a high-dimensional search space to a low-dimensional search space where novelty means interesting novelty. State-of-the-art DNNs have demonstrated impressive and sometimes human-competitive results on many pattern recognition tasks (Krizhevsky et al., 2012; Bengio, 2009). They see past the myriad pixel differences, such as lighting changes, rotations, zooms, and occlusions, to recognize abstract concepts in images, such as tigers, tables, and turnips. Here we suggest harnessing the power of DNNs to recognize different types of things in the abstract, high-level spaces where they can make distinctions. A second reason for choosing DNNs is that they work by hierarchically recognizing features. In images, for example, they recognize faces by combining edges into corners, then corners into eyes or noses, and then they combine these features into even higher-level features such as faces (Nguyen, Yosinski et al., 2016; Yosinski et al., 2015; Zeiler and Fergus, 2014; Nguyen, Dosovitskiy et al., 2016). Such a hierarchy of features is beneficial because those features can be produced in different combinations to produce new types of ideas/solutions.
Despite their impressive performance, DNNs can also make mistakes. Szegedy et al. (2014) found that it is possible to add imperceptible changes to an image originally correctly classified (e.g., as a bell pepper) such that a DNN will label it as something else entirely (e.g., an ostrich). Nguyen et al. (2015a) showed a different, but related, problem: images can be synthesized from scratch that are completely unrecognizable to human eyes as familiar objects, but that DNNs label with near-certainty as common objects (e.g., DNNs will declare with certainty that a picture filled with white noise static is an armadillo). While such shortcomings of DNNs impair Innovation Engines a fraction of the time, in this article we emphasize that remaining fraction of the time wherein using DNNs as evaluators works well. Innovation Engines will only improve as DNNs are redesigned to not be so easily fooled.
We first describe our long-term, ultimate vision for Innovation Engines that require no labeled data to endlessly innovate in any domain. Because there are many technical hurdles to overcome to reach that vision, we also describe a simpler, version 1.0 Innovation Engine that harnesses labeled data to simulate how the ultimate Innovation Engine might function. While Innovation Engines should work in any domain, we test one in the image generating domain that originally inspired the Novelty Search algorithm (Stanley and Lehman, 2015) and show that it can automatically produce a diversity of interesting images (see Figure 1). We also confirm some expectations regarding why Innovation Engines are expected to work.
2 Innovation Engines
The Innovation Engine algorithm seeks to abstract the process of curiosity and habituation that occurs in humans. Historically, humans create ideas based on combinations of, or changes to, previous ideas, evaluate whether these ideas are interesting, and retain the interesting ideas to create more advanced ideas (see Figure 2). We propose to automate the entire process by having stochastic optimization (e.g., an evolutionary algorithm) generate new behaviors and a DNN evaluate whether the behaviors are interestingly new. The DNN will then be retrained to learn all behaviors generated so far and evolution will be asked to produce new behaviors that the network has not seen before. This algorithm should be able to automatically create an endless stream of interesting solutions in any domain, for example, producing robot controllers, optimized electrical circuits, and even art.
Creating an Innovation Engine requires generating and retaining “stepping stones to everywhere.” The stepping stones on the path to any particular innovation are not known ahead of time (Lehman and Stanley, 2011a). From the Stone Age, for example, the path to create a telephone did not involve inventing only things that improved long-distance communication, but instead involved accumulating all interesting innovations (see Figure 2). In fact, had human culture been restricted to only producing inventions that improve long-distance communication, it is likely that the telephone would never have been developed. That is because many of the fundamental telephone-enabling inventions were not invented because they enabled long-distance communication (e.g., wires, electricity, electromagnets, etc.), but instead were invented at the time for other purposes. The same is true for nearly every significant invention in human history: many of the key enabling technologies were originally invented for other purposes (Lehman and Stanley, 2011b). In art, just as in science, there is a similar accumulation of interesting ideas over time and a pressure to “make something new,” which leads to a steady discovery of new artistic ideas over time (Lehman and Stanley, 2011b). Human culture, therefore, can be seen as an “Innovation Engine” that steadily produces new inventions in many different domains, from math and science to art and engineering.
2.1 The Ultimate Goal
Our long-term vision is to create an Innovation Engine that does not require labeled data, or perhaps is not even shown data from the natural or manmade world. It would learn to classify the types of things it has produced so far and seeks to produce new types of things. Technically, one way to implement this algorithm is by training generative deep neural network models with unsupervised learning algorithms: these generative models can learn to compress the types of data they have seen before (Bengio, 2009; Hinton and Salakhutdinov, 2006). One could thus measure if a newly generated thing is a new type of thing by how well the generative DNN model can compress it. Evolution will be rewarded for producing things that the DNN cannot compress well, which should endlessly produce novel types of things.
Imagine such an Innovation Engine in the image domain. A network trained on all images produced so far will attempt to compress each newly generated image, and it will fail more on new types of images. We hypothesize that the DNN will continuously become “bored” with (i.e., highly compress) easily produced classes of images (initially static and solid colors, but soon more complex patterns), which will encourage evolution to generate increasingly complex images in order to produce new types of images. The process thus becomes a coevolutionary innovation arms race.
This version of the Innovation Engine is motivated by Schmidhuber (2006) and Kompella et al. (2015) curiosity works, which emphasizes the production of things that are not compressed yet, but are most easily compressed next. Our work involves modern compressors (state-of-the-art DNNs) and our algorithm does not require the seemingly impossible task of predicting which classes of artifacts are highly compressible. Our proposal is similar to Liapis et al. (2013), but prevents cycling by attempting to produce things different than everything produced so far, not just the current population. If it works, this Innovation Engine could produce innovations in the multitude of fields and problem domains that currently benefit from stochastic optimization.
2.2 Version 1.0
Unsupervised learning algorithms for generative models do not yet scale well to high-dimensional data (Bengio et al., 2014); for example, they can handle pixel MNIST images (Hinton and Salakhutdinov, 2006) but not pixel ImageNet images (Deng et al., 2009). In this section, we describe a simpler Innovation Engine version that can be implemented with currently available algorithms. A key piece of the ultimate Innovation Engine is automatically recognizing new types of classes, which function as newly created niches for evolution to specialize on. We can emulate that endless process of niche creation by simply starting with a lot of niches and letting evolution exploit them all. To do that, we can take advantage of two recent developments in machine learning: (1) the availability of large, supervised datasets, and (2) the ability of modern supervised Deep Learning algorithms to train DNNs to reach near-human-competitive levels in classifying the things in these datasets (Hinton and Salakhutdinov, 2006; Krizhevsky et al., 2012; Bengio, 2009). We can thus challenge optimization algorithms (e.g., evolution) to produce things that the DNN recognizes as belonging to each class.
Innovation Engines require two key components: (1) a diversity-promoting EA that generates and collects novel behaviors, and (2) a DNN capable of evaluating the behaviors to determine if they are interesting and should be retained. The first criterion could be fulfilled either by Novelty Search or the multidimensional archive of phenotypic elites (MAP-Elites) algorithm (Mouret and Clune, 2015; Cully et al., 2015). We show next that both can work.
3 Test Domain: Generating Images
The test domain for the article is generating a diverse set of interesting, recognizable images. We chose this domain for four reasons. The first is because an experiment in image generation served as the inspiration for Novelty Search (Stanley and Lehman, 2015). That experiment occurred on Picbreeder.org, a website that allows visitors to interactively evolve images (Secretan et al., 2011), resulting in a crowd of humans that evolved a diverse, recognizable set of images. Key enablers of this diversity were Secretan et al. (2011) and Stanley and Lehman (2015). Elements included the fact that collectively there was no goal; that individuals periodically had a target image type in mind, creating a local pressure for high-performing (recognizable) images; users were open to the possibility of switching to a new goal if the opportunity presented itself (e.g., if the eyes of a face started to look like the wheels of a car); that users saved any image that they found interesting (usually a new type of image, or an improvement upon a previous type of image) and future users could branch off of any saved stepping stone to create a new image. Critically, all of these elements should also occur in Innovation Engine 1.0; thus one test of that hypothesis is whether Innovation Engine 1.0 can automatically produce a diverse set of images like those generated by humans on Picbreeder. One attempt was made to automatically recreate the diversity of recognizable images produced on Picbreeder, but it produced only abstract patterns (Auerbach, 2012).
The second motivation for the image-generating domain is that DNNs are nearly human-competitive at recognizing images (Krizhevsky et al., 2012; Karpathy, 2014; Szegedy et al., 2015; Stallkamp et al., 2012). The third reason is that DNNs can recognize and sensibly classify the type of images from Picbreeder (see Figure 3), specifically images encoded by compositional pattern producing networks (CPPNs) (Stanley, 2007). We also encode images with CPPNs in our experiments (described next). The fourth reason is because humans are natural pattern recognizers, making us quickly and intuitively able to evaluate the diversity, interestingness, and recognizability of evolved solutions. Additionally, while much of what we learn from this domain comes from subjective results, there is also a quantitative aspect regarding the confidence a DNN ascribes to the generated images. In future work, we will test whether the conclusions reached in this mostly subjective domain translate into more exclusively quantitative domains.
To experiment in this domain, we use a modern off-the-shelf DNN trained with 1.3 million images to recognize 1,000 different types of objects from the natural world. We then challenge evolution to produce images that the DNN confidently labels as members of each of the 1,000 classes. Evolution is therefore challenged to make increasingly recognizable images for all 1,000 classes. Generating CPPN-encoded images that are recognizable is challenging (Woolley and Stanley, 2011), making recognizability a notion of performance in this domain. Being recognizable is also related to being interesting, as Picbreeder images that are recognizable are often the most highly rated (Secretan et al., 2011).
4.1 Deep Neural Network Models
The DNN in our experiments is the well-known convolutional “AlexNet” architecture from Krizhevsky et al. (2012). It is trained on the 1.3-million-image 2012 ImageNet dataset (Deng et al., 2009; Russakovsky et al., 2015), and available for download via the Caffe software package (Jia et al., 2014). The Caffe-provided AlexNet has small architectural differences from Krizhevsky et al. (2012), but it performs similarly (42.6% top-1 error rate vs. the original 40.7% (Krizhevsky et al., 2012)). For each image, the DNN outputs a post-softmax, 1,000-dimensional vector reporting the probability that the image belongs to each ImageNet class. The softmax means that to produce a high confidence value for one class, all the others must be low.
4.2 Generating Images with Evolution
To simultaneously evolve images that match all 1,000 ImageNet classes, we use the new multidimensional archive of phenotypic elites (MAP-Elites) algorithm (Mouret and Clune, 2015; Cully et al., 2015). MAP-Elites keeps a map (archive) of the best individuals found so far for each class. In each iteration, an individual is randomly chosen from the map, mutated, and then it replaces the current champion for any class if it has a higher fitness for that class. Fitness is the DNN’s confidence that an image is a member of that class.
We also test another implementation of the Innovation Engine, but with Novelty Search instead of MAP-Elites. Novelty Search encourages organisms to be different from the current population and an archive of previously novel individuals. The behavioral distance between two images is defined as the Euclidean distance between the two 1,000-dimensional vectors output by the DNN for each image. Because all of our experiments were performed with the Sferes evolutionary computation framework (Mouret and Doncieux, 2010), we set all Novelty Search parameters to those in Mouret (2011), which was also conducted in Sferes, but followed closely the parameters in Lehman and Stanley (2008).
Images are encoded with compositional pattern producing networks (CPPNs) (Stanley, 2007), which abstract the expressive power of developmental biology to produce regular patterns (e.g., those with symmetry or repetition). CPPNs encode the complex, regular, recognizable images on Picbreeder.org (see Figure 3) and the 3D objects on EndlessForms.com (Clune and Lipson, 2011). The details of how CPPNs encode images and are evolved have been repeatedly described elsewhere (Secretan et al., 2011; Stanley, 2007). Briefly, a CPPN is like a neural network, but each node’s activation function is one of a set (here: sine, sigmoid, Gaussian, and linear). The Cartesian coordinates of each pixel are input into the network and the network’s outputs determine the color of that pixel. Importantly, evolved CPPN images can be recognized by the DNN (see Figure 3), showing that evolution can produce CPPN images that both humans and DNNs can recognize.
As is customary (Secretan et al., 2011; Stanley, 2007; Clune and Lipson, 2011) we evolve CPPNs with the principles of the NeuroEvolution of Augmenting Topologies (NEAT) algorithm (Stanley and Miikkulainen, 2002), a version of which is provided in Sferes. CPPNs start with no hidden nodes, and add nodes and connections over time, forcing evolution to first search for simple, regular images before increasing complexity (Stanley and Miikkulainen, 2002). All of our code and parameters are available at http://EvolvingAI.org. Because each run required 128 CPU cores running continuously for approximately four 4 days, our number of runs is limited.
We conduct a variety of experiments to investigate Innovation Engines. First, we show that Innovation Engines work well both quantitatively and qualitatively in the image generation domain (Section 5.1): the algorithm produces images that are recognizable to both humans and DNNs. Second, we investigate a key component of Innovation Engines—the number of objectives—and show that the performance and evolvability improves as the number of objectives increases (Section 5.2). Third, to support the hypothesis that Innovation Engines should work with any diversity-promoting EA, we demonstrate that Innovation Engines also work well with Novelty Search (Section 5.3). Fourth, we show that the algorithm can be further improved by incorporating additional priors (Section 5.4).
5.1 Evolving Images that Are Recognizable as Members of ImageNet Classes
If the Innovation Engine is a promising idea, then Innovation Engine 1.0 in the image domain should produce the following: (1) images that the DNN gives high confidence to as belonging to ImageNet classes and (2) a diverse set of interesting images that are recognizable as members of ImageNet classes. Our results show that the Innovation Engine accomplishes both of these goals.
In ten independent MAP-Elites runs, evolution produced high-confidence images in most of the 1,000 ImageNet categories (see Figure 4). It struggles most in classes 156–286, which represent subtly different breeds of dogs and cats, where it is hard to look like one type without also looking like other types. Note that because the confidence values are taken after a softmax transformation of the neural activations of the last layer, to maximize its score in one class, an image not only has to have a high-confidence in that class, but also has to have a low-confidence in all the other classes; that is especially difficult for the dog and cat classes given the number of similar cat and dog breeds. While readers must draw their own conclusions, in our opinion the images exhibit a tremendous amount of interesting diversity, putting aside whether they are recognizable. Selected examples are in Figures 5, 1, and 6: all 10,000 evolved images are available at http://www.evolvingai.org/InnovationEngine. The diversity is especially noteworthy because many images are phylogenetically related, which should curtail diversity.
In many cases, the evolved images are recognizable as members of the target class (see Figure 6). This result is remarkable given that it has been shown that with the same encoding (CPPN) and evolutionary algorithm (NEAT), it is impossible to evolve an image to resemble a complex, target image (Woolley and Stanley, 2011). The lesson from that paper is that if evolution is given a specific objective, such as to evolve a butterfly or skull, that it will not succeed because objective-driven evolution rewards only images that increasingly look like butterflies or skulls, and that CPPN lineages that lead to butterflies or skulls tend to pass through images that look nothing like either.
Innovation Engines, like crowds on Picbreeder, simultaneously collect improvements in a large number of objectives. This allows evolutionary lineages to be rewarded for steps that do not resemble butterflies or skulls (provided they resemble something else) and then to be rewarded as butterflies or skulls if they subsequently resemble either. Thus, a main result of this article is that the problem with traditional stochastic optimization is not that it is objective-driven, as is often argued (Lehman and Stanley, 2008, 2011a; Stanley and Lehman, 2015), but instead that it is driven by only a few objectives. The key is to collect “stepping stones in all interesting directions,” which can be approximated by simultaneously selecting for a vast number of objectives. Supporting this argument, our algorithm was able to produce many complex structures (see Figures 5, 1, and 6), including some that are similar to butterflies and skulls (see Figure 7).
We also qualitatively observed common features shared between the images evolved for the same target class over multiple runs of evolution (see Figure 8). For example, the Car wheel images tend to have concentric circles representing the tire and the rim. Images in the Tile roof category tend to exhibit a brownish terracotta color and the wavy pattern of the roof. This result shows that for certain categories, Innovation Engines can consistently produce images that are interesting and recognizable to both humans and DNNs.
Some evolved images are not recognizable, but often do contain recognizable features of the target class. For example, in Figure 5, the remote control has a grid of buttons and the zebra has black-and-white stripes. As was recently reported in our paper titled “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images” (Nguyen et al., 2015a), this algorithm also produces many images that DNNs assign high-confidence scores to, but that are totally unrecognizable, even when their class labels are known (e.g., Figure 5, tailed frog and soccer ball). That study emphasized that the existence of such “fooling images” is problematic for anything that relies on DNNs to accurately classify objects because DNNs sometimes make mistakes. This article emphasizes the opposite, but not mutually exclusive, perspective: while using DNNs as evaluators sometimes produces fooling examples, it also sometimes works really well, and can thus automatically drive the evolution of a diverse set of complex, interesting, and sometimes recognizable images. Section 5.4 discusses a method for increasing the percent of evolved images that are recognizable.
To test the hypothesis that the CPPN images generated by Innovation Engines might actually be considered quality art, we submitted a selection of them to a selective art contest: the University of Wyoming’s 40th Annual Juried Student Exhibition, which accepted only 35.5% of the contest submissions. Not only was the selection of Innovation Engine-produced images we submitted accepted, but it was also among the 21.3% of submissions to be given an award (see Figure 9).
5.2 Investigating Whether Having More Objectives Improves Performance and Evolvability
This section contains a number of experiments and analyses to probe a central hypothesis motivating Innovation Engines, which is that having more objectives will tend to improve performance (on each objective) and improve evolvability. We first present results and analyses from the “one class vs. 1,000 classes” experiment reported in Nguyen et al. (2015b) because they provide a nice illustration of the power of having many objectives. Then, in the subsequent section, we take a deeper dive into these questions, and do so across a range of objectives (1, 50, 100, 500, 1000) instead of comparing only 1 to 1,000.
5.2.1 One Objective versus 1,000 Objectives
As discussed in the previous section, a key hypothesis for why Innovation Engines work is that evolving toward a vast number of objectives simultaneously is more effective than evolving toward each objective separately. In this section, we probe that hypothesis directly by comparing how MAP-Elites performs on all 1,000 objectives versus how evolution fares when evolving to each single-class objective separately. Because we did not have the computational resources to perform 1,000 single-class runs, we randomly selected 100 classes from the ImageNet dataset and performed two single-class MAP-Elites runs per category. We compare those data to how the ten runs of 1,000-class MAP-Elites performed on the same 100-class subset.
The 1,000-class MAP-Elites produced images with significantly higher median DNN confidence scores (see Figure 10, 90.3% vs. 68.3%, via Mann--Whitney U test). The theory behind why more objectives helps is because a lineage that is currently the champion in class X may be trapped on a local optima, such that mutations to it will not improve its fitness on that objective (a phenomenon we observe in the single-class case: see Figure 11 inset). With many objectives, however, a lineage that has been selected for other objectives can mutate to perform better on class X, which occurs frequently with MAP-Elites. For example, on the water tower class (see Figure 11 inset), the lineage of images that produce a large, top-lit sphere does not improve for 250 generations, but at generation 1,750 a descendant of an organism that was the champion for the cocker spaniel dog class (see Figure 11) became a recognizable water tower and was then further refined.
Inspecting the phylogenetic tree of the 1,000 images produced by MAP-Elites in each run, we found that the evolutionary path to a final image often went through other classes, a phenomenon we call goal switching. For example, the path to a beacon involved stepping stones that were rewarded because they were at one point champions for the tench, abaya, megalith, clock, and cocker spaniel dog classes (see Figure 11). A different descendant of abaya traversed the stingray and boathouse classes en route to a recognizable planetarium (see Figure 11). A related phenomenon occurs on Picbreeder, where the evolutionary path to a final image often involves images that do not resemble the target (Secretan et al., 2011).
We quantitatively measured the number of goal switches per class (the number of times during a run that a new class champion was the offspring of a champion of another class). Each class had a of goal switches, which was of the mean new champions per class. Thus, a large percentage of improvements in a class came not from refining the current class champion, but from a mutation to a different class champion, helping to explain why Innovation Engines work.
Another expectation, which we observed, is that the evolved images for many semantically related categories are also phylogenetically related. For example, according to WordNet hierarchy (Deng et al., 2009), planetarium, mosque, church, obelisk, yurt and beacon are subclasses of the structure class (see Figure 12). The evolved images for these classes are often closely related phylogenetically and share visual similarities (see Figure 11).
If two CPPN genomes produce equivalent behaviors (here, images), it is taken as a sign of increased evolvability if one has fewer nodes and connections (Lehman and Stanley, 2008; Woolley and Stanley, 2011; Secretan et al., 2011). It has been shown that objective-based search “corrupts” genomes by adding piecewise hacks that lead to small fitness gains, and thus do not find the simple, elegant solutions produced by divergent search processes (e.g., Novelty Search or Picbreeder crowds) (Woolley and Stanley, 2011). If Innovation Engines behave like traditional single- or multi-objective algorithms, one might expect them to produce large CPPN genomes. On the other hand, if Innovation Engines, which are many-objective algorithms, are more divergent in nature, they should produce smaller genomes like those reported for Picbreeder (Woolley and Stanley, 2011). While the comparison is not apples to apples for many reasons, Innovation Engine genomes are actually more compact than those for Picbreeder. The 10,000 MAP-Elites CPPN genomes contain a median of 27 nodes (SD ) and 37.5 connections (SD ) versus the approximately 7,500 Picbreeder image genomes analyzed in Secretan et al. (2011), which have nodes and 146.7 connections (SD not reported).
5.2.2 Additional Analyses and a More Extensive Sweep Across the Number of Objectives
To further investigate how the number of objectives affects performance and evolvability, we conducted MAP-Elites experiments for a range of numbers of objectives: 1, 50, 100, 500, and 1,000. In each, we restricted the MAP-Elites archive to keep a champion for N classes, where . The classes are randomly selected from the 1,000 ImageNet classes. In each generation, MAP-Elites produced 400 offspring by mutating a randomly selected champion from the set of N. Each of the 400 offspring would then be compared against every current class champion, and the offspring would replace a champion if its confidence score for that class was higher. For each treatment, the algorithmic hyperparameters were the same and we performed ten independent runs.
Performance Increases with the Number of Objectives. We found that the median performance increases monotonically as the number of objectives increases (see Figure 13a, , via Spearman’s rank-order correlation). There are at least two potential explanations for this result. The first is our main hypothesis for why Innovation Engines work well, which is that increasing the number of objectives enables goal switching to occur more frequently (see Figure 13b). Supporting this explanation is the fact that the number of goal switches also monotonically increases with the number of objectives (, via Spearman’s rank-order correlation). A second possible explanation is that having fewer objectives results in less diversity among the 400 offspring in each generation, because those offspring descend from a smaller pool of parents. Having less diversity frequently hurts the performance of stochastic optimization algorithms, including evolutionary algorithms (Floreano and Mattiussi, 2008).
Are Genomes More Evolvable as the Number of Objectives Increases? As mentioned earlier in Section 5.2.1, it has previously been shown that increased evolvability in CPPN-encoded organisms can be detected by fewer nodes in CPPN genomes (Lehman and Stanley, 2008; Woolley and Stanley, 2011; Secretan et al., 2011). We showed that the CPPNs evolved by Innovation Engines had fewer nodes and connections than the CPPNs from Picbreeder, although the comparison is not apples to apples. Here, we test a related hypothesis: that having more objectives leads to CPPNs having fewer nodes and connections, which would suggest that they have more compact, elegant, evolvable representations.
Contrary to our hypothesis, the number of nodes and connections slightly, but significantly, increases with the number of objectives (see Figure 14, , , via Spearman’s rank-order correlation): The difference in the median number of nodes and connections between the 1-objective and the 1,000-objective treatment is only 5 nodes and connections, out of nodes and connections total on average. A study of Picbreeder genomes (Secretan et al., 2011) found a similar result that the size of CPPNs only slightly correlates with the human ratings for the evolved images (, ).
A second indicator of evolvability is the modularity of genomes (Clune et al., 2013), because organisms designed in a modular fashion have been shown to adapt to new environments faster than nonmodular organisms (Kashtan and Alon, 2005; Kashtan et al., 2007; Clune et al., 2013). Here, we test whether having a larger number of objectives produces CPPN genomes that are more modular. The structural modularity of a CPPN is measured by calculating its Q score, which is the most commonly used modularity metric for networks (Newman, 2006). Specifically, we treat each CPPN genome as a directed graph, and adopt the Q score metric for directed networks from Leicht and Newman (2008). Q scores are calculated for all 16,510 champion CPPNs from all ten runs of all five treatments, where each treatment had 1, 50, 100, 500, or 1,000 objectives.
We found a significant, but very slight, monotonic relationship between the number of objectives and the Q score (see Figure 15a, , via Spearman’s rank-order correlation). The lack of a strong relationship could be because in this domain, genomic modularity is not beneficial. Supporting that theory is the fact that there is only a very slight correlation between the Q score of the genome of each image and the confidence score of that image, although that relationship is also significant (see Figure 15b, , via Spearman’s rank-order correlation). Using the same Q score metric, we also found that CPPNs generated by the 1,000-objective treatment are significantly more modular than CPPNs evolved on Picbreeder (Stanley et al., 2013, 2016) (Q score vs. , via Mann--Whitney U test). While the comparison is not apples to apples for many reasons, this result shows that our automated evolution can produce similarly elegant, modular solutions to human-assisted evolution.
The evolvability of an organism can also be measured as a function of the fitness distribution of its offspring, with a higher distribution indicating increased evolvability (Hornby et al., 2003; Clune et al., 2011; Fogel and Ghozeil, 1996; Grefenstette, 2014; Belew et al., 1995; Igel and Chellapilla, 1999; Igel and Kreutz, 1999). To test whether more objectives leads to increased evolvability under this measure, we compared the fitness values of parents and their offspring. One challenge when measuring evolvability in this way is that the distribution of offspring fitness values may depend on the fitness of the parent. Since we have already shown that having more objectives improves performance, we need to control for the fitness of the parent in this evolvability analysis. To do that, we select a set of 400 different champions with performance values semi-evenly distributed within , such that each treatment has approximately the same distribution of fitness values, thus controlling for the fitness of the parents across treatments. Specifically, we divide the confidence range into 20 bins: . For each bin, every treatment contributes the same number of organisms. This number varies from bin to bin; however, on average, each treatment has 20 organisms in each bin for comparison. It was possible to fulfill these constraints for all treatments except the single-objective treatment: we thus include only the four treatments that had more than one objective ().
There are two separate sets of objectives over which we could have measured the fitness of offspring. Offspring could be compared versus their parents on the class of the parent only, or across all 1,000 ImageNet classes. We suspected that single- and few-objective treatments would have offspring that did better on the class of their parent, because organisms in these treatments spend most of their evolutionary history attempting to keep average fitness high on one or a few objectives. We further predicted that many-objective treatments would have organisms more evolved for goal switching, such that their average fitness across all 1,000 objectives would be higher. While both can be considered a form of evolvability, the goal-switching form of evolvability is what is truly required to solve extremely challenging problems and to make progress on open-ended evolution (Lehman and Stanley, 2008; Stanley and Lehman, 2015).
To measure within-class fitness changes, we produced 10 mutants per champion (each champion is from a class C), and measured their DNN confidence score improvement in class C (only) relative to the champion (i.e., the parent). In total, champions mutants were considered per treatment. As expected, we found a very slight, but significant negative monotonic correlation between the fitness changes of offspring with respect to their parent class only and the number of objectives in a treatment (see Figure 16, , via Spearman’s rank-order correlation). Overall, across all treatments, of the offspring have a lower DNN confidence score than their parents, but treatments with more objectives had distributions with slightly lower medians. Additionally, variance in the mutant confidence change distributions decreases as the number of objectives increases (see Figure 16). An explanation is that fewer-objective treatments produce organisms with lower confidence scores than more-objective treatments (see Figure 13a), leaving more room for the mutations to improve (thus, the higher variance). In the previous paragraph, we outlined one hypothesis for why single-objective or few-objective treatments have higher fitness distributions for their parent’s class: because they have not been evolved to goal switch as much.
A perhaps better measure of evolvability is not just whether organisms fare better on the fitness peak their parents are on, but how they do across all fitness peaks. To test this hypothesis, we not only need to control for the champion fitness, but also the number of classes that the champions came from. Specifically, we select 500 mutants (out of 4,000 total) that have parents satisfying two conditions: 1) having confidence scores within the range ; and 2) each coming from one of 50 randomly chosen classes. In order to have 500 champions that meet this criteria for all treatments, we were not able to constrain this set of 50 classes to be the same for all treatments. We select from the 500 offspring produced per treatment the best image for each of the 1,000 ImageNet classes. As expected, treatments with more objectives produce significantly higher average fitness across all 1,000 classes (see Figure 17). This result confirms that the presence of multiple objectives leads to selection not just for high fitness, but for evolvability in the sense of being more likely to have a higher fitness on different objectives. This sort of evolvability could potentially aid our quest for open-ended evolutionary dynamics like those seen in nature (Lehman and Stanley, 2008; Stanley and Lehman, 2015).
Overall, the evidence is either neutral or positive supporting the claim that more objectives improves evolvability. While we did not find that more objectives leads to substantially higher modularity or genome compactness, there was a slightly positive, significant correlation between CPPN modularity and the number of objectives. More convincing is the fact that genomes evolved in the many-objective environments are worse at staying high on the peak their parents are currently on, but have a significantly higher fitness distribution across all 1,000 objectives. In other words, Innovation Engines are producing organisms with a form of evolvability that makes them more likely to goal switch than EAs that have a single or low number of objectives.
5.3 Innovation Engine with Novelty Search
To support the case that Innovation Engines should work with any diversity-promoting EA combined with a DNN-provided deep distance function, we implemented Innovation Engine 1.0 with Novelty Search instead of MAP-Elites. After Novelty Search was afforded the same number of image evaluations, we found the best image it produced for each class according to the DNN. We performed ten independent runs of Novelty Search. To facilitate comparison to the single-class control, we compare performance on the 100 classes randomly selected for the single-class control (Section 5.2). The MAP-Elites vs. Novelty Search comparison on 100 classes is qualitatively the same on all 1,000 classes (data not shown).
As expected, Novelty Search also produced high-confidence images in most classes (see Figure 10). Its median confidence of 91.6% significantly outperforms the 68.3% for the single-class control ( via Mann--Whitney U test). While it significantly underperforms MAP-Elites at the 1,000th generation, for the 2,000th generation and beyond Novelty Search slightly, but significantly outperforms MAP-Elites ( via Mann--Whitney U test), although MAP-Elites has a higher final mean (79.5% vs. 74.0%). The images produced by two treatments are qualitatively similar (data not shown). This result confirms that in this domain both MAP-Elites and Novelty Search can serve as the diversity promoting EA in an Innovation Engine.
This experiment, which swaps out the deep distance function with a shallow, pixel-wise distance function, was performed with Novelty Search only because it is not obvious how to sensibly discretize the space of all possible pixel combinations into bins, as MAP-Elites requires (Mouret and Clune, 2015; Cully et al., 2015).
The results reveal that Novelty Search with the L1 distance function performs poorly: the images it produces are given extremely low-confidence scores by the DNN. After 2,000 generations, the median confidence score is 0.18%, a significantly and substantially lower performance than the 84.4% for Novelty Search with the deep distance function ( via Mann--Whitney U test). Since its performance is significantly lower than all other treatments at generation 2,000 ( via Mann--Whitney U test), and because this experiment was computationally expensive due to the number of pixel comparisons that need to be made to calculate L1 distances between images, we did not run the experiment all the way to 5,000 generations, as we did for the other treatments (see Figure 10). While most images are uninteresting and unrecognizable (see Figure 18b), we found a few high-scoring images with recognizable patterns (see Figure 18a, e.g., black-and-white stripes for an image in the zebra class and vertical bars for an image in the prison class). This experiment confirms that Novelty Search has difficulty finding the rare, interesting, recognizable images in the vast space of all possible pixel combinations. It suggests that, in general, Novelty Search will struggle to find interesting, rare items in a vast, high-dimensional space without a deep distance function that can focus the search on the interesting low-dimensional manifolds that exist with the higher dimensional space.
5.4 Attempting to Further Improve the Frequency and Quality of Recognizable Images by Adding a Natural Image Prior
Although DNNs sometimes make mistakes, giving high-confidence scores to unrecognizable “fooling” images (Nguyen et al., 2015a), we recently showed that using a collection of image priors to bias optimization toward producing images with more natural image statistics can help produce more recognizable images (Yosinski et al., 2015). That finding was not for evolved images but for images produced via gradient-based optimization methods in which gradients are backpropagated to each pixel to search for images that maximally activate certain neurons in DNNs. We subsequently found that minimizing one particular prior called total variation (Rudin et al., 1992) produces even more recognizable images (Nguyen, Yosinski et al., 2016). We hypothesized that encouraging the minimization of total variation in the fitness function via a penalty for higher total variation would encourage evolution to search for more regular images with constant color patches, and that this bias would thus improve recognizability, at least in the directly encoded images.
A recent study showed that it takes only 100 generations to produce unrecognizable images that the LeNet DNN classifies as digits with confidence (Nguyen et al., 2015a). Here, we ran that experiment 50 times longer to 5,000 generations with and without a penalty for total variation. Without a total variation penalty, these additional generations do enable evolution to produce recognizable images, at least for some classes, although the images contain much high-frequency noise static (see Figure 19a). Adding a total variation penalty to the fitness function generations produced smoother, more recognizable images with less noise (see Figures 19b, 19c, and 19d). While the total variation prior is thus beneficial, its benefit shows up only after thousands of generations, making it extremely computationally expensive with only marginal benefit.
We also tested adding a total variation penalty on the challenge of producing images that resemble ImageNet classes, but it did not qualitatively improve image recognizability or DNN confidence scores after 20,000 generations (data not shown), although perhaps it would with much more computation than we had available.
When conducting the same experiment on CPPNs instead of directly encoded images, we found that total variation regularization does not improve the recognizability of images. An explanation for that is that CPPN-encoded images already tend to be smooth and regular (see Figure 6), and thus already have low total variation.
All told, the total variation prior may help the directly encoded Innovation Engine in the image domain, but not enough to make a large qualitative difference. As expected, the total variation prior does not help with the already regular indirectly (CPPN) encoded images. To greatly improve the frequency and quality of the production of recognizable images, more research is needed to identify better priors that penalize non-natural images (Yosinski et al., 2015).
6 Discussion and Conclusion
This article introduces the concept of the Innovation Engine. It then describes a version of Innovation Engines that can be created with existing deep learning technology, relying on DNNs trained with supervised learning. It also describes a more ambitious Innovation Engine that will take advantage of unsupervised learning technology once it is more mature. Our article also also offers a first empirical investigation into many different aspects of Innovation Engines, including why they work and the degree to which they promote evolvability.
All of the work in this article is in the domain of generating images. However, Innovation Engines should theoretically work in any domain, but future work is required to validate that hypothesis. In a future study, we will create Innovation Engines in more quantitative domains. For example, we will pair DNNs trained to recognize different actions in videos (e.g., cartwheels, backflips, handshakes) with evolutionary algorithms to attempt to automatically create neural network robot controllers for thousands of different robotic behaviors. DNNs already can classify the actions taking place in videos (Karpathy et al., 2014; Simonyan and Zisserman, 2014; Donahue et al., 2015), and EAs can evolve neural networks to produce a variety of robot behaviors (Floreano and Mattiussi, 2008; Floreano and Keller, 2010; Cully et al., 2015; Clune et al., 2011; Li et al., 2014; Clune et al., 2009; Lee et al., 2013; Cheney et al., 2013), so we are optimistic that an Innovation Engine in this domain will be successful. That said, its computational costs will be substantial, given how expensive it is to both have DNNs evaluate videos and for EAs to simulate robot behaviors.
Our results have shown that the Innovation Engine concept is worth exploring further. Specifically, we have supported some of its key assumptions: that evolving toward many objectives simultaneously approximates divergent search; that DNNs can provide informative, abstract distance functions in high-dimensional spaces; and that Innovation Engines can generate a large, diverse, interesting set of solutions in a given domain (here images). Innovation Engines will only get better as DNNs are improved, especially when generative DNN models trained with unsupervised learning can scale to higher dimensions. Ultimately, Innovation Engines could potentially be applied to the countless number of domains where stochastic optimization is applied. Like human culture, they could eventually enable endless innovation in any domain, from software and science to arithmetic proofs and art.
We thank Joost Huizinga, Christopher Stanton, Henok Mengistu, and Jean-Baptiste Mouret for useful conversations. Jeff Clune was supported by an NSF CAREER award (CAREER: 1453549) and a hardware donation from the NVIDIA Corporation, and Jason Yosinski by the NASA Space Technology Research Fellowship and NSF grant 1527232.