Abstract

Genetic programming (GP) coarsely models natural evolution to evolve computer programs. Unlike in nature, where individuals can often improve their fitness through lifetime experience, the fitness of GP individuals generally does not change during their lifetime, and there is usually no opportunity to pass on acquired knowledge. This paper introduces the Chameleon system to address this discrepancy and augment GP with lifetime learning by adding a simple local search that operates by tuning the internal nodes of individuals. Although not the first attempt to combine local search with GP, its simplicity means that it is easy to understand and cheap to implement. A simple cache is added which leverages the local search to reduce the tuning cost to a small fraction of the expected cost, and we provide a theoretical upper limit on the maximum tuning expense given the average tree size of the population and show that this limit grows very conservatively as the average tree size of the population increases. We show that Chameleon uses available genetic material more efficiently by exploring more actively than with standard GP, and demonstrate that not only does Chameleon outperform standard GP (on both training and test data) over a number of symbolic regression type problems, it does so by producing smaller individuals and it works harmoniously with two other well-known extensions to GP, namely, linear scaling and a diversity-promoting tournament selection method.

1  Introduction

The success of the so-called memetic algorithms (MAs; Norman et al., 1991; Moscato, 2003) has underlined the importance of local search in augmenting the global search of evolutionary algorithms (EAs; Goldberg, 1989; Holland, 1975). Unlike traditional EAs, MAs intrinsically exploit problem specific information to fine-tune the evolving solutions, giving them the opportunity to improve beyond their genetic makeup at birth. Thus, MAs go a step further to mimic natural evolution: not only does information spread through genes across generations, it can also spread within a generation, through imitating ideas, catch phrases, and fashion (Dawkins, 1990). Moreover, the particular experiences of an evolving entity may add to its survivability.

While there has been quite some work conducted on attaching such lifetime learning to EAs (see, e.g., Moscato, 1989; Östermark, 1999; Ozcan and Mohan, 1998; M.J. Bayley and Williamson, 1998; Dandekar and Argos, 1996; de Souza et al., 1998), genetic programming (GP; Koza, 1992; Banzhaf et al., 1998), a type of EA that evolves computer programs, reports a disproportionally smaller number of examples.

However, GP faces issues that suggest that it can benefit from lifetime learning with local search. For example, the tension between structure and content in GP is well documented (Ryan and Keijzer, 2003; Korns, 2009; Daida, 2004). That is, the eventual shape that the structures of the individuals take depends upon the availability of basic building blocks (functions, terminals, and/or subtrees) in the first few generations. If something is not available, or disappears (Hu et al., 2005; e.g., if a particular constant is absent) then evolution may work around its absence and produce a subtree to overcome the handicap.

There are costs to this, of course, such as extra time spent evolving the missing parts, a larger memory requirement due to larger trees, a tendency to grow too complex to generalise well across unseen data, and an increased danger of bloating because of the larger trees (Blickle and Thiele, 1994; Soule and Foster, 1998; Langdon, 2000; Poli, 2003).

Another issue with GP is that the population quickly becomes homogeneous; McPhee and Hopper demonstrated that up to 70% of the final population shared the top four levels with a single ancestor (McPhee and Hopper, 1999). If these levels are not optimal, or even not great, consequences for the rest of the run can be grave: already faced with evolving solutions to the problem at hand, GP now also has to work around the major handicap of the top levels interfering with the work of the lower ones.

Furthermore, in common practice of GP, the population learns at the expense of the individual: bad individuals go out of the population while the good individuals stay. To decide whether an individual is good or bad, typically, it is evaluated only once. Thereafter, if the individual is good enough, it can only breed offspring or spread multiple copies of itself. This is unlike the case in nature, where, in addition to breeding, the organisms test themselves multiple times against the environment and improve their behaviour through experience.

To address these issues, this paper introduces Chameleon, essentially a hill climbing GP algorithm that permits individuals, through a form of hypermutation, to change their internal function nodes during their lifetime. Tuning the leaves to optimise the coefficients or variables deserves a separate investigation; that is beyond the scope of this paper.

A question then arises as to whether such additional learning encourages over-fitting the training data even more than in standard GP, and how expensive it is. The results demonstrate that this form of learning improves the best fit individuals on both training and test data while also decreasing the average tree size. In fact, we show that we can encourage smaller sized trees in the population as a very consequence of lifetime learning. To do this, instead of uniformly tuning every tree, we reward the shorter-than-average sized trees by giving them more than the average number of chances of tuning their internal nodes. Correspondingly, larger-than-average sized trees get less than the average number of tuning opportunities per node. Such a focus on smaller trees not only decreases tuning expense but also promotes smaller trees in the population.

Exhaustive tuning can appear expensive, particularly when the cardinality of the functions set increases, or as the trees grow larger; however, we demonstrate that with our choice of algorithm, the former is not necessarily true, whereas the latter (growth in tree size) can be curtailed by preferentially tuning smaller trees. Also, in Section 5 we formally quantify the maximum tuning expense.

The paper is laid out as follows: Section 2 gives the background to this work; Section 3 describes the proposed method, how it can be made computationally more efficient and how it can be used to induce parsimony pressure in the population; Section 4 discusses the experimental setup, enlists the results, and then comments on their significance; Section 5 theoretically analyses our method for the maximum tuning expense, gives an upper bound on it, and shows that even when used sparingly (as against using it on every individual), Chameleon can still outperform standard GP; and finally, Section 6 concludes the paper.

2  Background

Approaches to lifetime learning in GP can coarsely be grouped as external, internal, or cultural. The key difference between the first two approaches is the type of individuals they work with. Typical GP individuals are improved by the external approaches, that is, some form of expression trees, and improving these individuals with local search methods such as hill climbing. The internal approaches, however, work with individuals that incorporate an internal mechanism of learning by design, for example, when the individuals are neural networks or support vector machines. The third approach uses a notion of culture to share the learning across the population through some sort of implicit communication. Although Chameleon is very much in the first category, we briefly discuss each category in this section.

Although not as common as with non-GP EAs (see Moscato, 2003, for a detailed review of local search with non-GP EAs), examples of external approaches applying hill climbing to GP date from relatively early (Harries and Smith, 1997; Iba et al., 1994b; O'Reilly and Oppacher, 1994, 1996) to the more recent (Topchy and Punch, 2001; Zhang and Smart, 2004; Krawiec, 2001; Krasnogor, 2004; Nel, 2004; Majeed and Ryan, 2006; Wang et al., 2011). We now briefly review these studies.

The tuning of numeric coefficients has been a popular research topic in GP (Keijzer, 2004b; Iba et al., 1994a; Iba and Nikolaev, 2000; Nikolaev and Iba, 2001; McKay et al., 1999; Hiden et al., 1999; Hunter, 2002), and some studies have also employed local search for this purpose (Topchy and Punch, 2001; Krawiec, 2001; Zhang and Smart, 2004, 2005). Among these approaches, linear scaling (Keijzer, 2004b) has gained some popularity (Raja et al., 2008, 2007; Majeed and Ryan, 2006; Archetti et al., 2006) because it is a deterministic and computationally cheap technique. As we adopt it later in this paper, we explain it further, below.

Linear scaling deterministically optimises two linear parameters to minimise the sum of squared errors between target values, t(x), and approximate values, y(x), where x represents a vector of independent variables. The linearly scaled mean squared error (MSE) is calculated as:
formula
where
formula

Keijzer (2003) has shown that linear scaling significantly boosts the performance of GP on training cases; results on test cases were not presented. Due to the simplicity and widespread usage of linear scaling, we consider GP with linear scaling as a benchmark GP setup in Section 4. We verify whether linear scaling and Chameleon can combine harmoniously to produce an even better GP system.

As with tuning the external nodes (such as numeric leaves) GP literature also has some examples of tuning the internal nodes. Among these examples, O'Reilly and Oppacher (1994, 1996) used hill climbing and simulated annealing to improve a certain percentage of the evolving population with repeated mutation or crossover. Harries and Smith (1997) also tuned the evolving individuals, in their case by crossing them over with themselves and by mutating them; both crossover and mutation were applied several times in a hill climbing fashion. Krasnogor (2004) co-evolved local search heuristics and applied them to the evolving individuals successfully to compare protein structures. Nel (2004) optimised the threshold values in decision trees with directed increasing incremental local search (DILL); much like gradient descent methods use momentum to speed up convergence when the gradient is unchanging (Mitchell, 1996, p. 100), DILL increases the magnitude of change if fitness continues to improve with each iteration of local tuning. Majeed and Ryan (2006) and Majeed (2007) improved crossover by finding the best crossover point for the incoming subtree by trying every possible crossover site. The best crossover point was the one that maximised fitness after crossover. Zhang et al. (2007) repeatedly crossed over two parents in a hill climbing fashion to produce an offspring better than the parents. For the offspring thus produced, the corresponding crossover points in the parents were assigned a weight. This weight was used to protect the underlying subtrees from subsequent crossovers to promote larger building blocks in GP. As with Nel (2004), Wang et al. (2011) tuned the thresholds in decision trees to restrict the decision boundaries. Moreover, they also used a splitting operator to subdivide the decision subspaces and further tune the decisions.

In general, all these approaches combine standard GP with some sort of local (often greedy) search. Many of these methods have enjoyed success by combining the global search abilities of GP with local power of hill climbing type methods.

Another recent effort is that of Korns (2009), in which the system uses abstract grammars. Unlike standard grammar-based systems such as grammatical evolution (Ryan et al., 1998; O'Neill and Ryan, 2003), in which, once an individual maps an expression from a grammar it is ready to be evaluated, Korns would generate expressions containing placeholders, before using an external tuning process to decide upon the contents of these placeholders. Depending upon whether it requires a function or a terminal to replace it, the placeholder is called an abstract function or an abstract terminal. For example, an individual could be of the following form: which has a mixture of abstract functions fi, concrete functions (e.g., ×), and constants. Next, a vector of functions is produced by a method of choice such as particle swarm optimization, differential evolution, or even a GA from which each of the f1‥f5 are chosen. Thus, unlike the previous work, the tuning algorithm is not necessarily greedy. Next, the same or a different algorithm also tunes variables and constants in the expression.

This system can be very expensive: many parts of an expression are affected in each tuning iteration, thereby reducing the ability to cache the intermediate results, and the tuning algorithms can be as expensive as the user can afford. However, Korns has enjoyed some success with it, at least partly because it uses related methods for constant discovery and variable selection.

As we discuss in the following section, Chameleon is considerably simpler, partly because it only considers a single node at a time, which allows the results of the rest of the tree to be cached, but also because the lifetime learning it performs is passed on to offspring as in Lamarckian evolution (i.e., changes made to an individual are passed on to offspring, e.g., Downing, 2001). Moreover, we show that it is possible to perform the learning in a manner that generates parsimony pressure in the GP population, thus further decreasing the computational expense.

The second approach to lifetime learning is the internal approach, that is, to let individuals control their own learning; although, clearly, this requires individuals to produce structures that are capable of doing so. Usually (Khan and Miller, 2009; Curran and O'Riordan, 2006), these methods generate a neural network of some description or a support vector machine (Howley and Madden, 2005) that can be further trained on the problem at hand.

While the spirit and intention of the second approach is essentially the same as what we are attempting to achieve here, the tuning method in Chameleon is so designed that a standard GP run can benefit from lifetime learning, regardless of whether or not the individual agents are active learning structures (e.g., a neural network).

The third approach to lifetime learning involves sharing information across the population or imitating other individuals to learn from their experience in addition to genetic inheritance (Spector and Luke, 1996; Zannoni and Reynolds, 1997; Eskridge and Hougen, 2004; Meuth, 2010). Such a culture of sharing information can occur within a generation (intra-generational), across generations (inter-generational), or across runs (inter-run).

Spector and Luke (1996) implement intra-generational learning with a shared memory which each individual can write to and read from. Thus, the end results of various computations during fitness evaluation remain in this memory, and other individuals can read these results. Spector and Luke hypothesise that such a culture of sharing information improves the overall performance of GP.

Among the other two cultural approaches, inter-generational learning attempts to encapsulate potentially useful genetic material into a single node during a run so that, when crossover occurs, the material is less likely to be broken up (Angeline and Pollack, 1992; Rosca and Ballard, 1996). The inter-run methods use a form of cascading runs, in which potentially useful genetic material discovered in one run is immediately available at the start of the following one (Ryan et al., 2004; Murphy et al., 2007; Keijzer et al., 2005; Meuth, 2010). However, the cultural learning algorithms are beyond the scope of this paper; we do not discuss them any further.

In spite of all these advances, however, the overwhelming majority of GP users still use standard GP without any hill climbing additions. Why is this? We believe it is because of fear of expense. Expense not only in terms of implementing the algorithm, but also in terms of how quickly it will run: after all, almost by definition, each iteration of hill climbing per individual adds to fitness evaluation count. For example, Krawiec (2001) limited local learning to just a single individual per generation to minimise the runtime costs.

In this paper, we introduce a system that improves performance, reduces cost (smaller size of individuals), is trivial to implement, and can have an upper bound of cost imposed upon it.

3  Lifetime Learning with Chameleon

We propose to locally tune each internal node in a GP tree, one node at a time, in a top-down fashion. First, we evaluate the tree in the standard fashion to establish a baseline fitness value. Then, starting with the root node, at each internal node we iterate through the entire set of applicable functions, that is, we respect the original arity of each node, which means that all the changes are point changes, rather than structural changes. As with Korns’ approach, the node undergoing tuning acts as a placeholder or as an abstract function; however, the rest of the tree remains fully specified. The whole process is akin to mutating a node exhaustively.

This exhaustive mutation iteratively attempts to improve the fitness of an individual. After each mutation, we evaluate the tree for fitness; if the new fitness is higher than the best result so far, the change is accepted into the tree, otherwise, it is discarded. While in this case we traverse the tree in a depth-first and top-down fashion, other approaches (e.g., breadth-first, bottom-up) may also be useful.

The bigger the tree, the more expensive it is to tune. However, the same is not necessarily true if the functions set increases in size. Let be the functions set such that Fj is the set of functions with arity j; let Cmax=maxj|Fj| such that |Fj| is the cardinality of Fj and Cmax is the highest cardinality among all |Fj|, and let i be the number of internal nodes of the tree under consideration. The additional number of fitness evaluations carried out during one complete sweep of tuning . Let s be the tree size, and N1 be the number of nodes processed during n fitness evaluations then . Note, though, that this upper limit on N1 (or n) increases with Cmax; however, it remains unaffected if we add some Fk to F so that even though the overall functions set still gets larger. For example, if we add unary functions to a functions set that previously only contained binary functions, then, because we iterate only over the type of functions that an individual already has in it, the tuning expense does not increase for an individual that only has binary functions in its internal nodes.

3.1  Cost Reduction

Before we compare tuning costs against benefits, we employ a simple technique to avoid repetitive evaluations in a tree, that is, caching the intermediate results. Keijzer (Keijzer, 2004a) has proposed vectorised evaluation of input data points and caching of the corresponding outputs of subtrees in a GP population. While Keijzer and others (Downey and Zhang, 2011; Wong and Zhang, 2007) use a population-wide cache, here we only consider a cache local to the tree under consideration.1 This cache is very efficient, as the same subtrees will be evaluated several times for a particular individual. We implement it as a two-dimensional floating point array of size tree_size × |data_set|. To further aid efficiency, we use a large cache for each tree; that is, the cache is cleared out at the start of each individual's evaluation. Occasionally, when encountering exceptionally large trees, the cache can be resized. However, this is less frequent and, as we discuss in Section 3.2, can be discouraged by containing growth in tree size.

We now quantify how much the cache saves us when evaluating the same tree multiple times while tuning it. Note that the output of every node is cached when the tree is evaluated for the first time. During the tuning performed by Chameleon, this output is retrieved from the cache and only the nodes on the path to the root node are reevaluated; Figure 1 exemplifies this. However, the number of nodes on the path to the root node is simply one more than the depth of the tuning site: the depth of the root node is 0. Let dk be the depth of the kth internal node, and for simplicity let |Fj|=Cmax for all j so that the number of nodes processed without the cache N1=i(Cmax−1)s. Then the number of nodes processed with the cache is:
formula
where is the average depth of the internal nodes of the tree under consideration. This average depth is strictly smaller than the tree size, that is, s; therefore, N2 is only a fraction of N1. Appendix  A shows that the same result holds for the general case where for at least some j, that is, the number of functions in each subset of arity j is not necessarily the same. This confirms that using the cache decreases the tuning cost. The exact value of this decrease depends upon the size and shape of the tree under consideration; however, Figure 1 plots the percentage decrease for skinny trees that only use unary functions, and maximally grown binary trees. For the skinny trees the cache constantly decreases the tuning cost by 50%; however, for the maximal binary trees the decrease asymptotically approaches 100%.
Figure 1:

(Left): Local caching is demonstrated: (a) during normal evaluation, the results of every node are placed in the cache; (b) during tuning only, the nodes on the path to the root node are evaluated afresh while the cache is used for the rest. (Right): The cache can significantly decrease the cost, that is, the number of nodes processed while tuning. The figure plots percentage decrease for skinny (with only unary functions) and maximally grown binary trees.

Figure 1:

(Left): Local caching is demonstrated: (a) during normal evaluation, the results of every node are placed in the cache; (b) during tuning only, the nodes on the path to the root node are evaluated afresh while the cache is used for the rest. (Right): The cache can significantly decrease the cost, that is, the number of nodes processed while tuning. The figure plots percentage decrease for skinny (with only unary functions) and maximally grown binary trees.

3.2  Probabilistic Tuning and Parsimony Pressure

Even with caching, tuning incurs an additional computational cost after a standard fitness evaluation. What can exacerbate the cost further is that the larger the tree, the more tuning sites it has. Thus, larger trees have more opportunities to improve their fitness than smaller ones, which may encourage growth in tree size in the population. This is somewhat similar to standard GP, where the (generally) inferior average fitness of smaller trees helps increase the average tree size of the population (Poli, 2003) as larger trees tend to have a higher relative fitness. To avoid promoting larger trees with Chameleon (and, in turn, incurring an even greater tuning expense), we propose rewarding smaller trees with more chances of tuning than the larger ones. To achieve this, we make the probability of tuning each node of a tree a function of the corresponding tree size (s). Let p(s) be such a probability, and let be the average tree size of the population, then,
formula
1
Figure 2 shows a plot of Equation (1). In this study we use c=1.5. Thus, an average sized tree has a per node probability . When , the probability rises to 1.0, whereas when , it drops to 0.0. This means that trees smaller than the average sized trees in the population are more likely to be tuned. If we increase the value of c in Equation (1), the graph in Figure 2 shifts right, thus increasing the tuning opportunities for the larger trees. Decreasing c has the opposite effect.
Figure 2:

Function p(s) from Equation (1) is depicted. is the average tree size, , and .

Figure 2:

Function p(s) from Equation (1) is depicted. is the average tree size, , and .

Note that Equation (1) is neither the only, nor necessarily the optimal, way of modelling p(s); infinitely many functions may model p(s). However, in using Equation (1), we present one such approach that targets the twin objective of keeping tuning expense low and discouraging growth in tree size. Later, Section 5 formally shows that Equation (1) conservatively increases the tuning expense as the average tree size increases.

Of course, irrespective of whether or not they get tuned, all the trees undergo normal evaluation. Therefore, if a large tree can outperform its smaller counterparts in spite of their increased tuning opportunities, it will survive. This makes our strategy different from those methods using parsimony pressure that subjectively decrease the fitness of a large tree (Poli and McPhee, 2008), or probabilistically kill the individuals with larger than the average size (Poli, 2003).

However, as in Poli (2003), our strategy also relies on a theoretical prediction that, if the average fitness of smaller individuals is greater than that of the larger ones, the average size in the next generation will be smaller than that in the current generation. Still, to contain the growth in tree sizes in this study, we do not lower the fitness values of the larger trees; instead, we offer a greater chance for the smaller individuals to prosper. Using a carrot and stick terminology, we rely on carrots alone.

With probabilistic tuning, the number of nodes undergoing tuning is likely to be substantially smaller than that with unconditional tuning. Thus, the training performance with the former may suffer; however, correspondingly, the tree sizes may also be smaller if tuning rewards smaller trees in an effective manner. The results in Section 4.3 show that this is indeed the case.

4  Experiments

To estimate the effectiveness of the proposed method, we use five different GP approaches. These are (): standard genetic programming (), that is, the state of the art; GP with Chameleon function tuning (); GP with probabilistic use of Chameleon function tuning (); GP with linear scaling (Keijzer, 2003; to improve the numeric coefficients), and a tournament selection scheme to prevent mating between parents with identical fitness values (that is, no same mates; Gustafson et al., 2005) (); and probabilistic function tuning combined with the previous setup (). GP enhanced with linear scaling and no same mates (NSM) tournament selection offers a tougher benchmark: when the two techniques combine with GP they perform significantly better than standard GP on both training and the unseen (test) data (Costelloe and Ryan, 2009). Section 2 described linear scaling; we describe NSM below.

Gustafson et al. (2005) showed that NSM tournament selection reduces the probability of producing offspring identical to their parents, and delays stagnation in evolutionary search. To achieve this goal, NSM selection chooses genetically distinct parents. Since different fitness values guarantee a distinct genetic makeup, standard tournament selection is repeated until two parents with different fitness values are obtained. As with linear scaling (Keijzer, 2003), Gustafson et al. only showed training results where GP gained significantly from the use of this technique.

Alone, neither linear scaling nor NSM tournament outperforms standard GP on test cases on a selection of symbolic regression problems (Costelloe and Ryan, 2009); however, when these two methods combine, then together they do. Therefore, in this study if results still improve even more with a triple combination also involving Chameleon, it would be a significant step forward.

4.1  Test Suite and GP Parameters

We use eight problems for this study: four low dimensional problems (one or two input variables), and four high dimensional problems (8–241 input variables). The low dimensional problems are listed in Table 1. Among these, problems 1 and 2 are taken from Keijzer (2003); problem 3 is a bi-variate version of problem 2, and problem 4 is a bi-variate version of a polynomial used in Gustafson et al. (2005). 50 and 200 data points are used for training and testing, respectively, in each of these problems. As these problems are less challenging than their higher dimensional counterparts, we use test sets larger than the training sets to focus on how well a particular set up generalises. The high dimensional problems are listed in Table 2. We describe each of them separately in the following paragraphs; however, for all of these problems, we use 70% of the randomly chosen data points for training during evolutionary runs, and the remaining 30% for testing.

Table 1:
Low dimensional problems used in this study.
Training set [min:step] 50 points
ProblemTest set [min:step] 200 points
arcsinh(x[0.0:1.0] 
  [0.1:0.25] 
x3e-xcos(x)sin(x)(sin2(x)cos(x)−1) [0.0:0.2] 
  [0.05:0.05] 
y3e-xcos(y)sin(x)(sin2(y)cos(x)−1) x [0.0:0.2], y=x+0.03 
  x [0.05:0.05], y=x+0.03 
y2x6−2.13y4x4+y6x2 x [−1.9:0.075], y=x+0.015 
  x [−1.91:0.019], y=x+0.015 
Training set [min:step] 50 points
ProblemTest set [min:step] 200 points
arcsinh(x[0.0:1.0] 
  [0.1:0.25] 
x3e-xcos(x)sin(x)(sin2(x)cos(x)−1) [0.0:0.2] 
  [0.05:0.05] 
y3e-xcos(y)sin(x)(sin2(y)cos(x)−1) x [0.0:0.2], y=x+0.03 
  x [0.05:0.05], y=x+0.03 
y2x6−2.13y4x4+y6x2 x [−1.9:0.075], y=x+0.015 
  x [−1.91:0.019], y=x+0.015 
Table 2:
High dimensional problems used in this study.
Problem LabelInput VariablesData Points
 57 1, 066 
 13 506 
 1, 030 
 241 359 
Problem LabelInput VariablesData Points
 57 1, 066 
 13 506 
 1, 030 
 241 359 

In the first high dimensional problem, , the objective is to predict a real valued chemical composition from 57 process measuring input variables such as temperature, pressures, and flows. This problem was posed as a challenge for GP based symbolic regression,2 and the data originates from a real industrial application at Dow Chemical.

In the second problem, , the objective is to model house prices by using 12 real valued input variables and one binary valued input variable. The source of data is UCI Machine Learning Repository (Frank and Asuncion, 2010). Originally, Harrison and Rubinfeld (1978) reported on this problem; it uses data collected from the suburbs of Boston, Massachusetts.

In the third problem, , the objective is to predict a quantitative value of compressive strength of concrete. This strength is a highly nonlinear function of eight input variables. Again, the data source is UCI Machine Learning Repository and the problem itself is detailed in Yeh (1998).

In the last problem, , the objective is to predict the percentage of an orally submitted dose of a drug that effectively reaches the systemic blood circulation from 241 input variables. This problem was first tackled with GP in Archetti et al. (2006).

Table 3 details experimental parameters for this study. We did not optimise these parameters, except the population size. Initially, we tried population sizes of 120, 500, and 1,000 for GP. GP, both with and without linear scaling and NSM selection, performed the best with a poulation size of 500; therefore, we use this size for all the experiments reported here. A question arises as to whether a tournament size of 2 is too small for such a population size, that is, the sampling probability of each individual may diminish with implications towards population diversity. However, Xie et al. (2007a, 2007b) show that this probability remains constant if the number of tournaments conducted every generation is equal to the population size. We adhere to this formula.

Table 3:
Configuration parameters for the runs.
Population Size500
Run Terminates at exhausting nodes or 
 after 75, 000 fitness evaluations 
Operator probabilities crossover: 0.9; Point mutation: 0.1 
Tournament size 
Replacement Steady state, inverse tournament 
Functions set  
Terminal set Input variables ERC 
ERC ERC =  
 |ERC|=50 
Normalised Fitness  
Initialisation Ramped half and half 
 (max. initial depth = 4) 
Population Size500
Run Terminates at exhausting nodes or 
 after 75, 000 fitness evaluations 
Operator probabilities crossover: 0.9; Point mutation: 0.1 
Tournament size 
Replacement Steady state, inverse tournament 
Functions set  
Terminal set Input variables ERC 
ERC ERC =  
 |ERC|=50 
Normalised Fitness  
Initialisation Ramped half and half 
 (max. initial depth = 4) 

To ascertain overfitting, the end-of-run results in Section 4.3 may not be enough. Therefore, we also provide the test set results over the entire runs in Appendix  C.

The choice of functions set deserves attention. Often, parsimony pressure in GP focuses just on the size of the evolved expression (e.g., Poli, 2003; Luke and Panait, 2002). This ignores the underlying complexity of the expression and the fact that complex models can overfit the training data. For example, (x+x+x) has a larger tree size than sin(x), yet the latter has a far more complex behaviour. In order to maintain a more reasonable size-complexity relationship, we have omitted transcendental functions in this study.

Since the ratio between the number of ephemeral random constants (ERCs) and input variables varies from problem to problem, the probability of their selection also varies. To keep a uniform probability of selecting ERCs across different problems, first we decide between a constant and a variable with a probability of 50%. Then, among themselves, both variables and constants are selected randomly.

4.2  How To Compare Performance?

To fairly compare the performances of different evolutionary algorithms, they should be allowed the same computational expense. Often this means identical number of generations or (by extension) fitness evaluations for every run. However, in GP, the same number of fitness evaluations can incur very dissimilar computational costs. For example, a fitness evaluation in the later generations may consume much more CPU time than that in the initial generations when the trees are much smaller in size. Therefore, another measure of computational effort is the amount of genetic material (or simply the number of nodes in GP trees) processed, for example, as in Silva and Costa (2005).

During each iteration of cached tuning, Chameleon does not process the same amount of genetic material as in a normal fitness evaluation with standard GP. In standard GP, each fitness evaluation fully traverses a tree, whereas during a Chameleon tuning evaluation, only the nodes on the path joining the root node to the node subjected to tuning are evaluated afresh (see Figure 1). Moreover, a standard fitness evaluation can, potentially, sample an entirely different tree each time; however, during a Chameleon tuning evaluation, only a point change is made to an otherwise identical tree. Therefore, it makes more sense to compare standard GP with Chameleon on the basis of the number of processed nodes. A counterargument, then, is that in this way Chameleon avails itself of additional decision making opportunities (fitness evaluations) per unit genetic material processed, which standard GP cannot. However, that is precisely the point: with Chameleon, we propose to use the genetic material more efficiently by actively exploring it.

Still, we present results in terms of both processed nodes and fitness evaluations. Since Chameleon is at a disadvantage with the latter, the performance over the training sets can suffer. Runs terminate after processing GP tree nodes, or after 75, 000 fitness evaluations, depending upon the preselected criterion.

4.3  Results

Figures 34 show the results for the problems considered here. For each run we record the best training score (also referred to as fitness), the corresponding test score, and the average tree size of the population. The figures then show the mean of the training score (fitness value) of the best individual in each of the runs, the mean of their corresponding test scores, and the mean of the average tree size in each run.

Figure 3:

Normalised mean best training (top), and the corresponding test (bottom) scores at the end of the runs are plotted. For each of the last three problems, the figure rescales y-axis because the results are out of scale with respect to those for every other problem. For these problems the mean values are (specified as [min, max] in the order: B-Housing, Concrete and Bioavail): (training) [0.0763001, 0.118518], [0.00757136, 0.0165866] and [0.00117998, 0.00138766]; (test) [0.0101093, 0.0166162], [0.00539519, 0.00830962] and [0.000677607, 0.00105967]. Runs end after processing 1.2 GP tree nodes.

Figure 3:

Normalised mean best training (top), and the corresponding test (bottom) scores at the end of the runs are plotted. For each of the last three problems, the figure rescales y-axis because the results are out of scale with respect to those for every other problem. For these problems the mean values are (specified as [min, max] in the order: B-Housing, Concrete and Bioavail): (training) [0.0763001, 0.118518], [0.00757136, 0.0165866] and [0.00117998, 0.00138766]; (test) [0.0101093, 0.0166162], [0.00539519, 0.00830962] and [0.000677607, 0.00105967]. Runs end after processing 1.2 GP tree nodes.

Figure 4:

Normalised mean best training (top), and the corresponding test (bottom) scores at the end of the runs are plotted. As in Figure 3, for each of the last three problems the figure rescales the y-axis. For these problems, the mean values are (specified as [min, max] in the order: B-Housing, Concrete, and Bioavail): (training) [0.074728, 0.135479], [0.00613116, 0.0215204] and [0.00114992, 0.00160564]; (test) [0.00866396, 0.0160436], [0.00414232, 0.00869288] and [0.000862427, 0.000827737]. Runs end after processing 75, 000 fitness evaluations.

Figure 4:

Normalised mean best training (top), and the corresponding test (bottom) scores at the end of the runs are plotted. As in Figure 3, for each of the last three problems the figure rescales the y-axis. For these problems, the mean values are (specified as [min, max] in the order: B-Housing, Concrete, and Bioavail): (training) [0.074728, 0.135479], [0.00613116, 0.0215204] and [0.00114992, 0.00160564]; (test) [0.00866396, 0.0160436], [0.00414232, 0.00869288] and [0.000862427, 0.000827737]. Runs end after processing 75, 000 fitness evaluations.

The rationale for noting test scores is not to decide the best method for avoiding over-fitting; instead, it is to investigate whether tuning with Chameleon over-trains the evolving individuals so they over-fit even more than with standard GP. While, to aid readability, we only present the end-of-run results in this section, Appendix  C gives mean test scores for the best individual throughout the evolution. Therefore, the reader interested in the entire trend of test set performance should refer to Appendix  C.

The training and test scores are normalised between 0.00 and 1.00 (1.00 being the ideal score). Each sampled point in the plots depicts an average over 500 independent runs. As in Costelloe and Ryan (2009), the 95% confidence limits of the error bars at each point are computed as follows:
formula
where and are the mean and standard deviation of n observations; n=500 represents the number of runs in this case. We can be 95% confident that the statistical population lies within these limits, and that a lack of overlap with another error bar means that the corresponding populations are different.

We also validate the significance of the differences in performance with the Mann-Whitney-Wilcoxon test (Anderson et al., 2001); we test at p=0.05. It is a nonparametric test that does not assume normality of the sample populations. However, for the results presented in this section, this test agreed consistently with the graphical test described earlier that checks for an overlap of the error bars.

While the results for all the setups are plotted together, is the performance benchmark for and as is for . This separation of benchmarks is due to the advantage available to GP with the additional computational expense of applying linear scaling and NSM tournament. Thus, separating benchmarks in this fashion helps compare like with like.

Note that in Figures 3 and 4, for the last three problems, the training and test results of all the GP setups were out of scale with respect to the results on the other five problems. As a result, we partition the results for the last three problems separately. For each of these three problems, the y-axis is rescaled every time; therefore, the plots in Figures 3 and 4 only show the relative performances of the GP setups. However, the figure captions describe the magnitudes of these results by specifying the minimum and maximum values for the three problems. Also, note that although the scales of figures go beyond the ideal value of 1.00, the actual results never exceed that value.

Figure 3 shows the scores on training and test sets after processing a fixed number of nodes. Training results show that on six out of eight problems, performs the best; on the other two problems ( and ) its performance is indistinct from the best. The corresponding test results show that, except on one problem (), the same setup, , is at least as good as the best and, on five problems, is strictly better than its counterpart, . The figure also shows that is the worst performer on all the problems on both training and test sets except on where it performs the best on the test set.

Figure 4 shows the scores on training and test sets after a fixed number of fitness evaluations. Training results show a role reversal: lags behind on six out of eight problems; is indistinct from the best performers on the other two problems. Similarly, on training scores, outperforms and on six problems. However, the situation again reverses on the all-important test results: on all the problems is at least as good as and strictly better on five of them; similarly, the two remaining Chameleon setups outperform on all the problems.

Figure 5 shows the average tree sizes for both criteria for run termination, that is, maximum allowable nodes and fitness evaluations. Clearly, in both the cases, the average tree size with all the variants of Chameleon is much smaller than that with the two variants of standard GP. The difference in size is even larger with maximum fitness evaluations: note the log scale. Among the variants of Chameleon, either has the shortest tree size, or it is indistinct with those with the shortest tree size.

Figure 5:

Mean of average tree size at the end of the runs is plotted: (top) runs end after processing GP tree nodes; (bottom) runs end after 75, 000 fitness evaluations (note the log scale in this case). The error bars are very small in the figures because the tree sizes are very consistent across different runs for each setup. Unlike in Figures 3 and 4, the y axis was not rescaled.

Figure 5:

Mean of average tree size at the end of the runs is plotted: (top) runs end after processing GP tree nodes; (bottom) runs end after 75, 000 fitness evaluations (note the log scale in this case). The error bars are very small in the figures because the tree sizes are very consistent across different runs for each setup. Unlike in Figures 3 and 4, the y axis was not rescaled.

4.4  Discussion

The results clearly show that GP, both in standard () and enhanced () versions, performs better with Chameleon. Even when Chameleon fails to improve results on the training data when fitness evaluations are constant, it compensates for that with a superior performance on the test data. Also, with Chameleon, the average tree size is significantly smaller than without it, even though we have taken no additional bloat controlling measure.

Maximally tuning the internal nodes of trees, performs consistently better than (except on training results with fixed fitness evaluations), and often performs better than its probabilistic counterpart, . also maintains an average tree size smaller than that with , , and, rather surprisingly, . This is encouraging as it dispels the concern that local tuning in such an exhaustive fashion could be prohibitively expensive. It is also surprising that, despite being potentially favourable to larger trees, generates trees smaller than those even with which favours smaller trees. The question arises of whether rewarding the smaller trees with lifetime learning (as opposed to indiscriminately using it) has any effect at all in reducing the average tree size of the population.

To understand the apparently anomalous results in terms of average tree sizes, we compare the average number of generations each GP setup takes until the end of the run; Figure 6 presents this comparison. By measuring the number of generations, we compare the effect of pressure of selection and replacement on the population members. Typically, the higher the number of generations, the larger the trees are in the population. In a steady state setup (as employed in these experiments) with standard GP, the number of generations is simply the number of fitness evaluations divided by the population size. However, this formula does not hold for Chameleon. This is because during tuning an individual undergoes several fitness evaluations before it faces selection pressure. Also, note that while tuning, the size of a tree does not change; therefore, even with the same number of fitness evaluations, the number of generations elapsed with Chameleon can be very different from that with standard GP. Similarly, this number can be different between different variants of Chameleon or even between different runs with the same Chameleon setup. Therefore, we measure the number of generations explicitly.

Figure 6:

Mean of number of generations elapsed at the end of the runs is plotted: (top) runs end after processing GP tree nodes; (bottom) runs end after 75, 000 fitness evaluations. The number of elapsed generations helps compare the selection pressure over time; thus, when consumes the same number of generations as (top), the smaller tree sizes with the former directly result from the efficacy of Chameleon tuning as a reward for smaller trees. Unlike in Figures 3 and 4, the y-axis was not rescaled.

Figure 6:

Mean of number of generations elapsed at the end of the runs is plotted: (top) runs end after processing GP tree nodes; (bottom) runs end after 75, 000 fitness evaluations. The number of elapsed generations helps compare the selection pressure over time; thus, when consumes the same number of generations as (top), the smaller tree sizes with the former directly result from the efficacy of Chameleon tuning as a reward for smaller trees. Unlike in Figures 3 and 4, the y-axis was not rescaled.

Figure 6 tells us why can have surprisingly small trees. Both with a fixed number of nodes and fitness evaluations, it takes the least number of generations. While so, it exhaustively explores the internal nodes of earlier generations (often under 10 generations) and outperforms . While that is remarkable, it also clarifies that in this case the population does not face as much selection pressure over time as the rest of the setups. Earlier work (Azad and Ryan, 2010) also showed that if we increase selection pressure for by reducing the population size to 50, the average tree size becomes indistinct with that of standard GP; however, still manages smaller tree sizes despite using a smaller population size than GP. Also, Figure 6 only reports the results at the end of the runs for consistency and due to space restrictions; however, the detailed results show that given the same number of generations the average tree size with is significantly smaller than that with .

Figure 6 also shows that when the runs terminate after processing a fixed number of nodes, the number of generations taken by each of and are very similar. Thus, that probabilistic tuning outperforms GP while keeping the trees smaller shows that probabilistic tuning successfully exploits the potential of smaller trees over the problems examined. Moreover, it answers the question raised earlier in this section: lifetime learning is indeed an incentive attractive enough to promote smaller trees that benefit from tuning. This is also significant because the fitness is not decreased owing to size of individuals: if a large individual performs better than the smaller and tuned individuals, then it deserves to retain its competitive edge and so it does. However, we do not suggest that this is the best way of controlling bloat in GP; comparing bloat control techniques is beyond the scope of this paper.

Figure 6 also shows that with a fixed number of fitness evaluations, the number of generations that the two standard GP setups take far exceeds that taken by all the variants of Chameleon. For GP, the number of generations both with and without linear scaling are identical; hence, the corresponding legends are exactly on top of each other. This also explains why the tree size for standard GP with fixed evaluations is much greater than that with a fixed number of nodes. The number of generations with the former is much higher than that with the latter.

Also, significantly, outperforms . Thus, probabilistic tuning improves over a known improvement to GP and not just over standard GP.

Although the results show that tuning with Chameleon is both feasible and beneficial, questions remain about its expense. From the above discussion, we recommend that should always reasonably budget the fitness evaluations or nodes. Thus, when the trees grow larger, the budget also exhausts faster during each generation; this keeps the total number of generations low. If, instead, we fix the total number of generations as is common with GP, then because potentially favours large trees (as they have more tuning sites), the computational cost can be high.3 However, is robust, as it successfully contains tree growth despite going through the same number of generations as . Another question then is: can we ascertain the maximum number of tuning events in a population, given the average tree size? Clearly, with maximal tuning (), we cannot do that since this number depends upon the evolutionary dynamics of every individual run: the maximum number of tuning events in a population is equal to the number of internal nodes in the largest tree in that population. However, we show in Section 5 that with probabilistic tuning setups, we can give a more reliable bound on the maximum number of tuning events. This analysis is also useful as a template for analysing another function that replaces Equation (1).

5  How Expensive Can Probabilistic Tuning Get?

The previous section raised the question of how expensive tuning can get. Here we answer this by identifying the tree size that gets the maximum number of tuning events (i.e., when a node is selected for tuning) and then, for that size, calculate that number. Note, by counting the tuning events instead of the iterations (through the functions set) while tuning a node, we abstract out of the exact implementation details. For example, instead of exhaustively iterating over the functions set at a selected node, we can probabilistically iterate over a subset and still the analysis in this section holds.

For this analysis, we consider the average size of the population at a given time, the possible shapes that the trees can take, and Equation (1), which calculates the probability that each node in a tree of size s will be tuned. To aid readability, we repeat Equation (1) with c=1.5 substituted; for the rest of the section, we assume this substitution whenever we refer to Equation (1):
formula
where p(s) (shown in Figure 2) is the per node probability of tuning a tree of size s, and is the average tree size of the population; when .

Let T(s) be the number of nodes tuned (or the number of tuning events) in a tree of size s, and smax be the tree size with the highest number of nodes tuned in the population under consideration. Thus, T(smax) represents the maximum tuning expense in a given population. With the subsequent analysis, we want to characterise T(smax) as a function of the average tree size.

Despite not tuning leaves in this study, we analyse this case first as it is easier to do so. Next, in Section 5.2, we analyse the present case where leaves are not tuned.

5.1  If Leaves are Also Tuned

If the leaves are also tuned then the number of tuning events are: ; note that × represents ordinary multiplication.
formula
2
Figure 7 exemplifies Equation (2) when ; the maximum of Equation (2) is smax. To quantify smax, we find the maxima for both the linear and the nonlinear (quadratic) regions in Figure 7 and then compare T(s) on these two maxima. When T(s) is linear, y>1 and . Thus, the maximum of the linear region is at its endpoint, that is, at . The maximum of the nonlinear region () is where . Thus,
formula
3
because the maximum number of tuning events for the nonlinear region, that is, , is greater than that for the linear region, that is, . Therefore, even if we also tune the leaves, the maximum number of tuning events is only a fraction of the average size of the population, that is, . This is useful in two ways. First, a smaller than average sized tree gets the most out of tuning; this can promote smaller than average sized trees. Second, the maximum tuning expense is a function of the average size of the population instead of the maximum size as in maximal tuning. Clearly, the average size increases slower than the maximum size in a population and can even be regulated (Poli and McPhee, 2008). In the following section, we show that these benefits also hold for when the leaves are not tuned.
Figure 7:

Function T(s) from Equation (2) is depicted: , smax=7.5. The system is linear until 0.5s and quadratic afterward until it drops to 0.

Figure 7:

Function T(s) from Equation (2) is depicted: , smax=7.5. The system is linear until 0.5s and quadratic afterward until it drops to 0.

5.2  If Leaves Are Not Tuned

Typically, we only tune the internal nodes with Chameleon. Let i(s) be the number of internal nodes in t, so that:
formula
4
The crucial question then is: i(s)=f(s)=? In other words, can we compute the number of internal nodes from just the tree size? To answer this, first, we consider full binary trees4 such that each node has exactly zero or two children. Then, in Section 5.2.2 we also consider the general case where the arity of the nodes varies. We show that the results for the general case are bounded by those for full binary trees.

For full binary trees, clearly, we have two cases: (1) when the tree is perfect5 (that is, all leaf nodes are at the same depth) or can be restructured into a perfect binary tree, and (2) otherwise. Next, we show that we can tackle both these cases together.

5.2.1  Full Binary Trees

The analysis in this section applies to both cases (1) and (2) because we show that the same relationship holds for i(s); however, we begin with case (1). We define a perfect convertible tree to be such that can be restructured into a perfect tree without changing the number of internal or external nodes. During this restructuring the semantics of the tree may change; however, that does not concern this analysis because for a full binary tree the number of internal nodes i(s) is only a function of the tree size s. Thus, i(s) is the same for both the perfect and the perfect convertible trees. To compute i(s) for these two types of trees, we consider that where dmax is the maximum depth of the tree and the depth of the root node is 0. Since, in a full tree, the maximum depth of the internal nodes is . Thus,
formula
5
Appendix B shows that Equation (5) also holds for those full binary trees which cannot be restructured into a perfect tree; thus, the following analysis also applies to such trees.
From Equations (4) and (1) we get
formula
6
Again, the maximum of the linear part is . To find the maximum for the nonlinear part, we set and solve for s so that
formula
7
and correspondingly, .
We can confirm that smax indeed corresponds to the maximum by verifying whether .
formula

Thus, as in Section 5.1, smax is a fraction of the average tree size. Also, the maximum tuning expense is a function of the average tree size. Therefore, again the smaller trees get the most out of the tuning; this also keeps the tuning expense manageable. Furthermore, Figure 8 plots T(smax) and its rate of change with average tree size () both when the leaves are tuned and when they are not. The figure shows that when leaves are not tuned, the increase in the tuning expense with an increasing average size () is asymptotic: T(smax) grows very slowly with . When the leaves are also tuned, the growth rate is constant. This is unlike maximal tuning which can grow prohibitively expensive with increasing size.

Figure 8:

(Left): T(smax) is plotted with and without tuning the leaves. (Right): Corresponding rates of change (first derivatives) are plotted, that is, 0.5625 and . The maximum number of tuning events increases ever so slightly when the leaves are not tuned; it increases at a constant rate when the leaves are also tuned.

Figure 8:

(Left): T(smax) is plotted with and without tuning the leaves. (Right): Corresponding rates of change (first derivatives) are plotted, that is, 0.5625 and . The maximum number of tuning events increases ever so slightly when the leaves are not tuned; it increases at a constant rate when the leaves are also tuned.

5.2.2  General Case: Variable Arity

When the nodes have variable arity, we cannot exactly formulate i(s) as in Equation (5); however, we know that at most i(s)=(s−1) (when arity = 1). Thus, we have an upper limit on i(s). Substituting i(s)=(s−1) in Equation (4), again we get smax as in Equation (7). Therefore, the results for the binary trees also provide an upper limit for the general case when the arity varies.

In summary, we quantify the tuning expense with probabilistic tuning given Equation (1) and show that the expense increases conservatively as a function of the average size of the population. The average size of the population not only grows slower than the maximum size but is also more manageable.

6  Conclusions and Future Work

In this paper we describe a simple approach to lifetime learning in GP and apply it to some difficult, well-known problems from the symbolic regression domain. We show that it provides better performance on unseen test data and produces smaller individuals than the current state of the art of GP. The results also show that our hill climbing approach improves not only over standard GP but also over some known improvements to GP in the symbolic regression domain, that is, linear scaling and NSM tournament selection.

We dispel the fear of computational expense with our proposed method by taking a three pronged approach. First, we show that adding to the functions set does not necessarily increase the number of tuning steps. Next, we show (and quantify) that, with the use of a simple cache, the computational costs associated with the proposed hill climbing approach substantially decreases and analytically show that the cost decreases by 50% for skinny trees that use only unary functions, and (asymptotically) up to 100% for maximally grown binary trees. Finally, we propose to tune probabilistically by preferring smaller than average sized individuals. Thus, the number of tuning opportunities for an individual is a function of its size. The results show that by preferring smaller trees like that, Chameleon generates trees smaller than those in standard GP; keeping trees small further reduces the computational cost.

While exhaustively using Chameleon also produces competitive results, such a use potentially promotes larger trees in the population. Also, an exhaustive approach cannot guarantee an upper limit on the tuning expense. Probabilistic Chameleon, however, contains tree growth and can guarantee an upper limit on the tuning expense as the average tree size in the population grows.

In future, we want to explore the effect of varying GP parameters on the performance of Chameleon. In particular, we aim to ascertain the effects of different probability models in probabilistic Chameleon, because although the selected model is an intuitive choice, it is by no means the optimal choice.

References

References
Anderson
,
D. R.
,
Sweeney
,
D. J.
, and
Williams
,
T. A
. (
2001
).
Statistics for business and economics
.
Belmont, CA
:
South-Western College Pub
.
Angeline
,
P. J.
, and
Pollack
,
J. B
. (
1992
).
The evolutionary induction of subroutines
. In
Proceedings of the Fourteenth Annual Conference of the Cognitive Science Society
, pp.
236
241
.
Archetti
,
F.
,
Lanzeni
,
S.
,
Messina
,
E.
, and
Vanneschi
,
L.
(
2006
).
Genetic programming for human oral bioavailability of drugs
. In
M. Keijzer, M. Cattolico, D. Arnold, V. Babovic, C. Blum, P. Bosman, M. V. Butz, C. Coello Coello, D. Dasgupta, S. G. Ficici, J. Foster, A. Hernandez-Aguirre, G. Hornby, H. Lipson, P. McMinn, J. Moore, G. Raidl, F. Rothlauf, C. Ryan, and D. Thierens (Eds.)
,
GECCO 2006: Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation
, Vol.
1
, pp.
255
262
.
Azad
,
R. M. A.
, and
Ryan
,
C.
(
2010
).
Abstract functions and lifetime learning in genetic programming for symbolic regression
. In
J. Branke, M. Pelikan, E. Alba, D. V. Arnold, J. Bongard, A. Brabazon, J. Branke, M. V. Butz, J. Clune, M. Cohen, K. Deb, A. P. Engelbrecht, N. Krasnogor, J. F. Miller, M. O'Neill, K. Sastry, D. Thierens, J. van Hemert, L. Vanneschi, and C. Witt (Eds.)
,
GECCO ’10: Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation
, pp.
893
900
.
Banzhaf
,
W.
,
Nordin
,
P.
,
Keller
,
R. E.
, and
Francone
,
F. D
. (
1998
).
Genetic programming—An introduction; On the automatic evolution of computer programs and its applications
.
San Mateo, CA
:
Morgan Kaufmann
.
Blickle
,
T.
, and
Thiele
,
L.
(
1994
).
Genetic programming and redundancy
. In
J. Hopf (Ed.)
,
Genetic Algorithms within the Framework of Evolutionary Computation (Workshop at KI-94, Saarbrücken)
, pp.
33
38
.
Costelloe
,
D.
, and
Ryan
,
C.
(
2009
).
On improving generalisation in genetic programming
. In
L. Vanneschi, S. Gustafson, A. Moraglio, I. De Falco, and M. Ebner
(Eds.),
Proceedings of the 12th European Conference on Genetic Programming, EuroGP 2009. Lecture notes in computer science
, Vol.
5481
(pp.
61
72
),
Berlin
:
Springer-Verlag
.
Curran
,
D.
, and
O'Riordan
,
C
. (
2006
).
Increasing population diversity through cultural learning
.
Adaptive Behaviour
,
14
(
4
):
315
338
.
Daida
,
J.
(
2004
).
Considering the roles of structure in problem solving by a computer
. In
U.-M. O'Reilly, T. Yu, R. L. Riolo, and B. Worzel
(Eds.),
Genetic Programming Theory and Practice II
,
Chap. 5 (pp. 67–86). Berlin
:
Springer
.
Dandekar
,
T.
, and
Argos
,
P
. (
1996
).
Identifying the tertiary fold of small proteins with different topologies from sequence and secondary structure using the genetic algorithm and extended criteria specific for strand regions
.
Molecular Biology
,
256
(
3
):
645
660
.
Dawkins
,
R
. (
1990
).
The selfish gene
.
Oxford, UK
:
Oxford University Press
.
de Souza
,
P. A. Jr.
,
Garg
,
R.
, and
Garg
,
V. K.
(
1998
).
Automation of the analysis of Mossbauer spectra
.
Hyperfine Interactions
,
112
:
275
278
.
Downey
,
C.
, and
Zhang
,
M.
(
2011
).
Execution trace caching for linear genetic programming
. In
A. E. Smith (Ed.)
,
Proceedings of the 2011 IEEE Congress on Evolutionary Computation
, pp.
1191
1198
.
Downing
,
K. L
. (
2001
).
Reinforced genetic programming
.
Genetic Programming and Evolvable Machines
,
2
(
3
):
259
288
.
Eskridge
,
B. E.
, and
Hougen
,
D. F.
(
2004
).
Memetic crossover for genetic programming: Evolution through imitation
. In
K. Deb, R. Poli, W. Banzhaf, H.-G. Beyer, E. Burke, P. Darwen, D. Dasgupta, D. Floreano, J. Foster, M. Harman, O. Holland, P. L. Lanzi, L. Spector, A. Tettamanzi, D. Thierens, and A. Tyrrell
(Eds.),
Genetic and Evolutionary Computation, GECCO-2004, Part II. Lecture notes in computer science
, Vol.
3103
(pp.
459
470
).
Berlin
:
Springer-Verlag
.
Frank
,
A.
, and
Asuncion
,
A.
(
2010
).
UCI Machine Learning Repository
.
Irvine, CA
:
University of California at Irvine, School of Information and Computer Science
.
Goldberg
,
D. E.
(
1989
).
Genetic algorithms in search, optimization, and machine learning
, 1st ed.
Boston
:
Addison-Wesley Professional
.
Gustafson
,
S.
,
Burke
,
E. K.
, and
Krasnogor
,
N.
(
2005
).
On improving genetic programming for symbolic regression
. In
D. Corne, Z. Michalewicz, M. Dorigo, G. Eiben, D. Fogel, C. Fonseca, G. Greenwood, T. K. Chen, G. Raidl, A. Zalzala, S. Lucas, B. Paechter, J. Willies, J. J. M. Guervos, E. Eberbach, B. McKay, A. Channon, A. Tiwari, L. G. Volkert, D. Ashlock, and M. Schoenauer (Eds.)
,
Proceedings of the 2005 IEEE Congress on Evolutionary Computation
, Vol. 
1
, pp.
912
919
.
Harries
,
K.
, and
Smith
,
P.
(
1997
).
Exploring alternative operators and search strategies in genetic programming
. In
J. R. Koza, K. Deb, M. Dorigo, D. B. Fogel, M. Garzon, H. Iba, and R. L. Riolo (Eds.)
,
Genetic Programming 1997: Proceedings of the Second Annual Conference
, pp.
147
155
.
Harrison
,
D.
, and
Rubinfeld
,
D. L.
(
1978
).
Hedonic prices and the demand for clean air
.
Journal of Environmental Economics & Management
,
5
:
81
102
.
Hiden
,
H. G.
,
Willis
,
M. J.
,
Tham
,
M. T.
, and
Montague
,
G. A
. (
1999
).
Non-linear principal components analysis using genetic programming
.
Computers and Chemical Engineering
,
23
(
3
):
413
425
.
Holland
,
J. H
. (
1975
).
Adaptation in natural and artificial systems
.
Ann Arbor, MI
:
University of Michigan Press
.
Howley
,
T.
, and
Madden
,
M. G
. (
2005
).
The genetic kernel support vector machine: Description and evaluation
.
Artificial Intelligence Review
,
24
(
3-4
):
379
395
.
Hu
,
J.
,
Goodman
,
E.
,
Seo
,
K.
,
Fan
,
Z.
, and
Rosenberg
,
R
. (
2005
).
The hierarchical fair competition framework for sustainable evolutionary algorithms
.
Evolutionary Computation
,
13
(
2
):
241
277
.
Hunter
,
A.
(
2002
).
Using multiobjective genetic programming to infer logistic polynomial regression models
. In
F. Van Harmelen (Eds.)
,
15th European Conference on Artificial Intelligence
, pp.
193
197
.
Iba
,
H.
,
de Garis
,
H.
, and
Sato
,
T.
(
1994a
).
Genetic programming using a minimum description length principle
. In
K. E. Kinnear, Jr.
(Eds.),
Advances in Genetic Programming
, Chap. 12, (pp.
265
284
).
Cambridge, MA
:
MIT Press
.
Iba
,
H.
,
de Garis
,
H.
, and
Sato
,
T.
(
1994b
).
Genetic programming with local hill-climbing
. In
Y. Davidor, H.-P. Schwefel, and R. Männer
(Eds.),
Parallel problem solving from nature III
. Lecture notes in computer sciences,
Vol. 866 (pp. 334–343), Berlin
:
Springer-Verlag
.
Iba
,
H.
, and
Nikolaev
,
N.
(
2000
).
Genetic programming polynomial models of financial data series
. In
Proceedings of the 2000 Congress on Evolutionary Computation CEC00
, pp.
1459
1466
.
Keijzer
,
M.
(
2003
).
Improving symbolic regression with interval arithmetic and linear scaling
. In
C. Ryan, T. Soule, M. Keijzer, E. Tsang, R. Poli, and E. Costa
(Eds.),
Genetic Programming, Proceedings of EuroGP’2003, Lecture notes in computer science
, Vol.
2610
(pp.
70
82
),
Berlin
:
Springer-Verlag
.
Keijzer
,
M.
(
2004a
).
Alternatives in subtree caching for genetic programming
. In
M. Keijzer, U.-M. O'Reilly, S. M. Lucas, E. Costa, and T. Soule
(Eds.),
Proceedings of Genetic Programming 7th European Conference, EuroGP 2004, Lecture notes in computer science
, Vol.
3003
(pp.
328
337
),
Berlin
:
Springer-Verlag
.
Keijzer
,
M
. (
2004b
).
Scaled symbolic regression
.
Genetic Programming and Evolvable Machines
,
5
(
3
):
259
269
.
Keijzer
,
M.
,
Ryan
,
C.
,
Murphy
,
G.
, and
Cattolico
,
M.
(
2005
).
Undirected training of run transferable libraries
. In
M. Keijzer, A. Tettamanzi, P. Collet, J. I. vanHemert, and M. Tomassini
(Eds.),
Proceedings of the 8th European Conference on Genetic Programming, Lecture notes in computer science
, Vol.
3447
(pp.
361
370
),
Berlin
:
Springer-Verlag
.
Khan
,
G. M.
and
Miller
,
J. F.
(
2009
).
Evolution of cartesian genetic programs capable of learning
. In
G. Raidl, F. Rothlauf, G. Squillero, R. Drechsler, T. Stuetzle, M. Birattari, C. B. Congdon, M. Middendorf, C. Blum, C. Cotta, P. Bosman, J. Grahl, J. Knowles, D. Corne, H.-G. Beyer, K. Stanley, J. F.Miller, J. van Hemert, T. Lenaerts, M. Ebner, J. Bacardit, M. O'Neill, M. Di Penta, B. Doerr, T. Jansen, R. Poli, and E. Alba (Eds.)
,
GECCO ’09: Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation
, pp.
707
714
.
Korns
,
M. F.
(
2009
).
Symbolic regression using abstract expression grammars
. In
L. Xu, E. D. Goodman, G. Chen, D. Whitley, and Y. Ding (Eds.)
,
GEC ’09: Proceedings of the First ACM/SIGEVO Summit on Genetic and Evolutionary Computation
, pp.
859
862
.
Koza
,
J. R
. (
1992
).
Genetic programming: On the programming of computers by means of natural selection
.
Cambridge, MA
:
MIT Press
.
Krasnogor
,
N
. (
2004
).
Self generating metaheuristics in bioinformatics: The proteins structure comparison case
.
Genetic Programming and Evolvable Machines
,
5
(
2
):
181
201
.
Krawiec
,
K.
(
2001
).
Genetic programming with local improvement for visual learning from examples
. In
W. Skarbek
(Ed.),
Proceedings of the 9th International Conference on Computer Analysis of Images and Patterns, CAIP 2001, Lecture notes in computer science
, Vol.
2124
(pp.
209
216
).
Berlin
:
Springer-Verlag
.
Langdon
,
W. B.
(
2000
).
Quadratic bloat in genetic programming
. In
D. Whitley, D. Goldberg, E. Cantu-Paz, L. Spector, I. Parmee, and H.-G. Beyer (Eds.)
,
Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2000)
, pp.
451
458
.
Luke
,
S.
, and
Panait
,
L.
(
2002
).
Lexicographic parsimony pressure
. In
W. B. Langdon, E. Cantú-Paz, K. Mathias, R. Roy, D. Davis, R. Poli, K. Balakrishnan, V. Honavar, G. Rudolph, J. Wegener, L. Bull, M. A. Potter, A. C. Schultz, J. F. Miller, E. Burke, and N. Jonoska (Eds.)
,
GECCO 2002: Proceedings of the Genetic and Evolutionary Computation Conference
, pp.
829
836
.
Majeed
,
H.
(
2007
).
The importance of semantic context in tree based GP and its application in defining a less destructive, context aware crossover for GP
.
PhD thesis, University of Limerick, Ireland
.
Majeed
,
H.
, and
Ryan
,
C.
(
2006
).
A re-examination of a real world blood flow modeling problem using context-aware crossover
. In
R. L. Riolo, T. Soule, and B. Worzel
(Eds.),
Genetic Programming Theory and Practice IV
, Chap. 17 (pp.
279
298
).
Berlin
:
Springer-Verlag
.
McKay
,
B.
,
Willis
,
M.
,
Searson
,
D.
, and
Montague
,
G.
(
1999
).
Non-linear continuum regression using genetic programming
. In
W. Banzhaf, J. Daida, A. E. Eiben, M. H. Garzon, V. Honavar, M. Jakiela, and R. E. Smith (Eds.)
,
Proceedings of the Genetic and Evolutionary Computation Conference
, Vol.
2
, pp.
1106
1111
.
McPhee
,
N. F.
, and
Hopper
,
N. J.
(
1999
).
Analysis of genetic diversity through population history
. In
W. Banzhaf, J. Daida, A. E. Eiben, M. H. Garzon, V. Honavar, M. Jakiela, and R. E. Smith (Eds.)
,
Proceedings of the Genetic and Evolutionary Computation Conference
, Vol.
2
, pp.
1112
1120
.
Meuth
,
R. J.
(
2010
).
Meta-learning genetic programming
. In
D. Tauritz (Ed.)
,
GECCO 2010 Late Breaking Abstracts
, pp.
2101
2102
.
Mitchell
,
T. M
. (
1996
).
Machine learning
.
New York
:
McGraw-Hill
.
M.J. Bayley
P. W. G. Jones
, and
Williamson
M.
. (
1998
).
Genfold: A genetic algorithm for folding protein structures using NMR restraints
.
Protein Science
,
7
(
2
):
491
499
.
Moscato
,
P.
(
1989
).
On evolution, search, optimization, genetic algorithms and martial arts; Towards memetic algorithms
,
Caltech Concurrent Computataion Program Rep. 826.
Caltech, Pasadena, CA
.
Moscato
,
P.
(
2003
).
A gentle introduction to memetic algorithms
. In
Handbook of metaheuristics
(pp.
105
144
).
Dordrecht, The Netherlands
:
Kluwer Academic Publishers
.
Murphy
,
G.
,
Ryan
,
C.
, and
Howard
,
D
. (
2007
).
[Seeding methods for run transferable libraries] capturing domain relevant functionality through schematic manipulation for genetic programming
. In
Proceedings of the 2007 International Conference Frontiers in the Convergence of Bioscience and Information Technologies (FBIT 2007)
, pp.
769
772
.
Nel
,
G.
(
2004
).
A memetic genetic program for knowledge discovery
.
Master's thesis, University of Pretoria, Pretoria, South Africa
.
Nikolaev
,
N. Y.
, and
Iba
,
H
. (
2001
).
Regularization approach to inductive genetic programming
.
IEEE Transactions on Evolutionary Computing
,
54
(
4
):
359
375
.
Norman
,
M. G.
, and
Moscato
,
P.
(
1991
).
A competitive and cooperative approach to complex combinatorial search
. In
Proceedings of the 200h Informatics and Operations Research Meeting
, pp.
3.15
3.29
.
O'Neill
,
M.
, and
Ryan
,
C
. (
2003
).
Grammatical evolution: Evolutionary automatic programming in an arbitrary language
, Vol. 4 of Genetic programming.
Dordrecht, The Netherlands
:
Kluwer Academic Publishers
.
O'Reilly
,
U.-M.
, and
Oppacher
,
F.
(
1994
).
Program search with a hierarchical variable length representation: Genetic programming, simulated annealing and hill climbing
.
Tech. Rep.
94-04-021, Santa Fe Institute, Santa Fe, New Mexico
.
O'Reilly
,
U.-M.
, and
Oppacher
,
F.
(
1996
).
A comparative analysis of GP
. In
P. J. Angeline, and K. E. Kinnear, Jr.
(Eds.),
Advances in Genetic Programming 2
, Chap. 2, pp.
23
44
.
Cambridge, MA
:
MIT Press
.
Östermark
,
R
. (
1999
).
Solving a nonlinear non-convex trim loss problem with a genetic hybrid algorithm
.
Computers & Operations Research
,
26
(
6
):
623
635
.
Ozcan
,
E.
, and
Mohan
,
C. K
. (
1998
).
Steady state memetic algorithm for partial shape matching
. In
Evolutionary Programming VII: 7th International Conference, EP98
, pp.
25
27
.
Poli
,
R.
(
2003
).
A simple but theoretically-motivated method to control bloat in genetic programming
. In
C. Ryan, T. Soule, M. Keijzer, E. Tsang, R. Poli, and E. Costa
(Eds.),
Genetic Programming, Proceedings of EuroGP’2003, Lecture notes in computer science
, Vol.
2610
(pp.
204
217
).
Berlin
:
Springer-Verlag
.
Poli
,
R.
, and
McPhee
,
N.
(
2008
).
Parsimony pressure made easy
. In
M. Keijzer, G. Antoniol, C. B. Congdon, K. Deb, B. Doerr, N. Hansen, J. H. Holmes, G. S. Hornby, D. Howard, J. Kennedy, S. Kumar, F. G. Lobo, J. F. Miller, J. Moore, F. Neumann, M. Pelikan, J. Pollack, K. Sastry, K. Stanley, A. Stoica, E.-G. Talbi, and I. Wegener (Eds.)
,
GECCO ’08: Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation
, pp.
1267
1274
.
Raja
,
A.
,
Azad
,
R. M. A.
,
Flanagan
,
C.
, and
Ryan
,
C.
(
2007
).
Real-time, non-intrusive evaluation of VOIP
. In
M. Ebner, M. O'Neill, A. Ekárt, L. Vanneschi, and A. I. Esparcia-Alcázar
(Eds.),
Proceedings of the 10th European Conference on Genetic Programming. Lecture notes in computer science
, Vol.
4445
(pp.
217
228
).
Berlin
:
Springer-Verlag
.
Raja
,
A.
,
Azad
,
R. M. A.
,
Flanagan
,
C.
, and
Ryan
,
C
. (
2008
).
A methodology for deriving VOIP equipment impairment factors for a mixed NB/WB context
.
IEEE Transactions on Multimedia
,
10
(
6
):
1046
1058
.
Rosca
,
J. P.
, and
Ballard
,
D. H.
(
1996
).
Discovery of subroutines in genetic programming
. In
P. J. Angeline, and K. E. Kinnear, Jr.
(Eds.),
Advances in Genetic Programming 2
, Chap. 9, pp.
177
202
.
Cambridge, MA
:
MIT Press
.
Ryan
,
C.
,
Collins
,
J. J.
, and
O'Neill
,
M.
(
1998
).
Grammatical evolution: Evolving programs for an arbitrary language
. In
W. Banzhaf, R. Poli, M. Schoenauer and T. C. Fogarty (Eds.)
,
Proceedings of the First European Workshop on Genetic Programming. Lecture notes in computer science
, Vol.
1391
(pp.
83
95
),
Berlin
:
Springer-Verlag
.
Ryan
,
C.
, and
Keijzer
,
M.
(
2003
).
An analysis of diversity of constants of genetic programming
. In
C. Ryan, T. Soule, M. Keijzer, E. Tsang, R. Poli, and E. Costa
(Eds.),
Genetic Programming, Proceedings of EuroGP’2003. Lecture notes in computer science
, Vol.
2610
(pp.
404
413
),
Berlin
:
Springer-Verlag
.
Ryan
,
C.
,
Keijzer
,
M.
, and
Cattolico
,
M.
(
2004
).
Favorable biasing of function sets using run transferable libraries
. In
U.-M. O'Reilly, T. Yu, R. L. Riolo, and B. Worzel
(Eds.),
Genetic Programming Theory and Practice II
, Chap. 7 (pp.
103
120
).
Berlin
:
Springer-Verlag
.
Silva
,
S.
, and
Costa
,
E.
(
2005
).
Comparing tree depth limits and resource-limited GP
. In
D. Corne, Z. Michalewicz, M. Dorigo, G. Eiben, D. Fogel, C. Fonseca, G. Greenwood, T. K. Chen, G. Raidl, A. Zalzala, S. Lucas, B. Paechter, J. Willies, J. J. M. Guervos, E. Eberbach, B. McKay, A. Channon, A. Tiwari, L. G. Volkert, D. Ashlock, M. and Schoenauer (Eds.)
,
Proceedings of the 2005 IEEE Congress on Evolutionary Computation
, Vol.
1
, pp.
920
927
.
Soule
,
T.
, and
Foster
,
J. A
. (
1998
).
Effects of code growth and parsimony pressure on populations in genetic programming
.
Evolutionary Computation
,
6
(
4
):
293
309
.
Spector
,
L.
, and
Luke
,
S.
(
1996
).
Cultural transmission of information in genetic programming
. In
J. R. Koza, D. E. Goldberg, D. B. Fogel, and R. L. Riolo (Eds.)
,
Genetic Programming 1996: Proceedings of the First Annual Conference
, pp.
209
214
.
Topchy
,
A.
, and
Punch
,
W. F.
(
2001
).
Faster genetic programming based on local gradient search of numeric leaf values
. In
L. Spector, E. D. Goodman, A. Wu, W. B. Langdon, H.-M. Voigt, M. Gen, S. Sen, M. Dorigo, S. Pezeshk, M. H. Garzon, and E. Burke (Eds.)
,
Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001)
, pp.
155
162
.
Wang
,
P.
,
Tang
,
K.
,
Tsang
,
E.
, and
Yao
,
X.
(
2011
).
A memetic genetic programming with decision tree-based local search for classification problems
. In
A. E. Smith (Eds.)
,
Proceedings of the 2011 IEEE Congress on Evolutionary Computation
, pp.
916
923
.
Wong
,
P.
, and
Zhang
,
M.
(
2007
).
Effects of program simplification on simple building blocks in genetic programming
. In
D. Srinivasan and L. Wang (Eds.)
,
2007 IEEE Congress on Evolutionary Computation
, pp.
1570
1577
.
Xie
,
H.
,
Zhang
,
M.
, and
Andreae
,
P.
(
2007a
).
An analysis of constructive crossover and selection pressure in genetic programming
. In
D. Thierens, H.-G. Beyer, J. Bongard, J. Branke, J. A. Clark, D. Cliff, C. B. Congdon, K. Deb, B. Doerr, T. Kovacs, S. Kumar, J. F. Miller, J. Moore, F. Neumann, M. Pelikan, R. Poli, K. Sastry, K. O. Stanley, T. Stutzle, R. A. Watson, and I. Wegener (Eds.)
,
GECCO ’07: Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation
, Vol.
2
, pp.
1739
1748
.
Xie
,
H.
,
Zhang
,
M.
, and
Andreae
,
P.
(
2007b
).
Another investigation on tournament selection: Modelling and visualisation
. In
D. Thierens, H.-G. Beyer, J. Bongard, J. Branke, J. A. Clark, D. Cliff, C. B. Congdon, K. Deb, B. Doerr, T. Kovacs, S. Kumar, J. F. Miller, J. Moore, F. Neumann, M. Pelikan, R. Poli, K. Sastry, K. O. Stanley, T. Stutzle, R. A. Watson, and I. Wegener (Eds.)
,
GECCO ’07: Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation
, Vol.
2
, pp.
1468
1475
.
Yeh
,
I. C
. (
1998
).
Modeling of strength of high-performance concrete using artificial neural networks
.
Cement and Concrete Research
,
28
(
12
):
1797
1808
.
Zannoni
,
E.
, and
Reynolds
,
R. G
. (
1997
).
Learning to control the program evolution process with cultural algorithms
.
Evolutionary Computation
,
5
(
2
):
181
211
.
Zhang
,
M.
,
Gao
,
X.
, and
Lou
,
W
. (
2007
).
A new crossover operator in genetic programming for object classification
.
IEEE Transactions on Systems, Man and Cybernetics, Part B
,
37
(
5
):
1332
1343
.
Zhang
,
M.
, and
Smart
,
W.
(
2004
).
Genetic programming with gradient descent search for multiclass object classification
. In
M. Keijzer, U.-M. O'Reilly, S. M. Lucas, E. Costa, and T. Soule
(Eds.),
Proceedings of Genetic Programming 7th European Conference, EuroGP 2004, Lecture notes in computer sciences
, Vol.
3003
(pp.
399
408
).
Berlin
:
Springer-Verlag
.
Zhang
,
M.
, and
Smart
,
W.
(
2005
).
Learning weights in genetic programs using gradient descent for object recognition
. In
F. Rothlauf, J. Branke, S. Cagnoni, D. W. Corne, R. Drechsler, Y. Jin, P. Machado, E. Marchiori, J. Romero, G. D. Smith, and G. Squillero
(Eds.),
Applications of Evolutionary Computing, EvoWorkshops2005: EvoBIO, EvoCOMNET, EvoHOT, EvoIASP, EvoMUSART, EvoSTOC. Lecture notes in computer science
, Vol.
3449
(pp.
417
427
).
Berlin
:
Springer-Verlag
.

Appendix  A

Savings with a Cache when Arity of the Functions Set Varies

We define F, Cmax, Fj, N1, N2, i, s, and dk as in Section 3. Let Q(k)=|Fj| if the kth node contains a function with arity j and such that . Then,
formula
where represents the average of (Q(k)−1) over i internal nodes. Similarly,
formula
Therefore, even when arity of the functions in the functions set varies, N2 is still only a fraction of N1.

Appendix  B

Non Perfect Convertible Full Binary Trees

Consider a tree which is perfect only until a depth level df and only contains a few leaves at the level df+1, for example, (+(+xy)1). This is an example of a tree which cannot be converted into a perfect binary tree. Such a tree is also called a complete binary tree.66 The trees which cannot be converted into a perfect tree can be converted into a complete binary tree such that every level, except the deepest, is completely filled.

Let df be the deepest level at which the tree is completely filled and if be number of internal nodes for so that:
formula
Let sf be the size of the tree for (i.e., the fraction of the tree that is completely filled) and s be the total size of the tree. Thus, the number of nodes at depth df+1 is:
formula
and (because we consider only full binary trees) the number of internal nodes at depth df is (ssf)/2. Thus,
formula
which is the same result as in Equation (5). Hence, the properties of T(s) remain the same as with perfect or perfect-convertible trees.

Appendix  C

Test Set Results for Complete Runs

This section presents the mean of test set performance of the best individual (best on training score) throughout the evolutionary run. Note that the best individual may vary during the run; the result at the end of the run corresponds to the best individual that appeared during the entire run.

As highlighted in Section 4.3, we investigate the test set results as to whether Chameleon over-trains the evolving individuals, that is, whether their test set performance degrades over time in a manner even worse than that with standard GP. The end of the run results in Section 4.3 dismissed such a notion. However, the end of the run samples the overall trend only at a single point in time. Since, unlike training score and tree sizes that, typically, either only increase or stay the same over time, test set results may fluctuate, a question arises as to whether the end of run was just a lucky point in time for Chameleon. To address this question, we present the test set results from the entire runs in Figures 9 and 10.

Figure 9:

Normalised mean test score for the best fit individual is plotted at various stages of evolution. Runs end after processing GP tree nodes.

Figure 9:

Normalised mean test score for the best fit individual is plotted at various stages of evolution. Runs end after processing GP tree nodes.

Figure 10:

Normalised mean test score for the best fit individual is plotted at various stages of evolution. Runs end after processing 75, 000 fitness evaluations.

Figure 10:

Normalised mean test score for the best fit individual is plotted at various stages of evolution. Runs end after processing 75, 000 fitness evaluations.

Figure 9 shows the results for when the runs terminated after traversing nodes; Figure 10 shows the same results for when the runs terminated after exhausting 75, 000 fitness evaluations. The trends in the figures show that:

  • Except for in Figure 9, Chameleon setups are never consistently inferior to their standard GP counterparts throughout the runs; and

  • On some problems, when training begins, standard GP setups are indeed better than Chameleon, for example, as in , , and in Figure 10. However, while performance of standard GP setups degrades later, the performance of Chameleon setups improves.

Hence, except in one case, for the problems studied in this paper, Chameleon tuning does not harm the test set performance as might be expected.

Notes

1

When implementing it, we require two caches: the additional cache holds the results from the last iteration of tuning. We swap these results back into the main cache if the current iteration does not improve fitness.

2

A. Kordon presented this challenge at EvoStar 2010. For details see: http://casnew.iti.upv.es/index.php/evocompetitions/105-symregcompetition

3

Some exploratory runs showed that maximal tuning can indeed be very expensive if we force the evolution through a fixed number of generations instead of terminating the runs after a fixed number of nodes or fitness evaluations. We consistently recorded individuals with over 20,000 nodes, which is an order of magnitude larger than the largest tree size reported in this study.