## Abstract

The Gene-pool Optimal Mixing Evolutionary Algorithm (GOMEA) is a model-based EA framework that has been shown to perform well in several domains, including Genetic Programming (GP). Differently from traditional EAs where variation acts blindly, GOMEA learns a model of interdependencies within the genotype, that is, the linkage, to estimate what patterns to propagate. In this article, we study the role of Linkage Learning (LL) performed by GOMEA in Symbolic Regression (SR). We show that the non-uniformity in the distribution of the genotype in GP populations negatively biases LL, and propose a method to correct for this. We also propose approaches to improve LL when ephemeral random constants are used. Furthermore, we adapt a scheme of interleaving runs to alleviate the burden of tuning the population size, a crucial parameter for LL, to SR. We run experiments on 10 real-world datasets, enforcing a strict limitation on solution size, to enable interpretability. We find that the new LL method outperforms the standard one, and that GOMEA outperforms both traditional and semantic GP. We also find that the small solutions evolved by GOMEA are competitive with tuned decision trees, making GOMEA a promising new approach to SR.

## 1 Introduction

Symbolic Regression (SR) is the task of finding a function that explains hidden relationships in data, without prior knowledge on the form of such function. Genetic Programming (GP) (Koza, 1992) is particularly suited for SR, as it can generate solutions of arbitrary form using basic functional components.

Much work has been done in GP for SR, proposing novel algorithms (Krawiec, 2015; Zhong et al., 2018; De Melo, 2014),
hybrids (Žegklitz and Pošík, 2017; Icke and Bongard, 2013),
and other forms of enhancement (Keijzer, 2003; Chen et al., 2015). What is
recently receiving a lot of attention is the use of so-called *semantic-aware* operators, which enhance the variation process
of GP by considering intermediate solution outputs (Pawlak et al., 2015; Chen et al., 2018; Moraglio et al., 2012). The use of semantic-aware operators has proven to enable the
discovery of very accurate solutions, but often at the cost of complexity: solution
size can range from hundreds to billions of components (Pawlak et al., 2015; Martins et al., 2018). These solutions are consequently impossible to
interpret, a fact that complicates or even prohibits the use of GP in many
real-world applications because many practitioners desire to understand what a
solution means before trusting its use (Lipton, 2018; Guidotti et al., 2018). The
use of GP to discover uninterpretable solutions can even be considered to be
questionable in many domains, as many alternative machine learning algorithms exist
that can produce competitive solutions much faster (Orzechowski et al., 2018).

We therefore focus on SR when GP is *explicitly constrained* to
generate small-sized solutions, that is, mathematical expressions consisting of a
small number of basic functional components, to increase the level of
interpretability. With size limitation, finding accurate solutions is particularly
hard. It is not without reason that many effective algorithms work instead by
growing solution size, for example, by iteratively stacking components (Moraglio et
al., 2012; Chen and Guestrin, 2016).

A recurring hypothesis in GP literature is that the evolutionary search can be made
effective if *salient patterns*, occurring in the representation of
solutions (i.e., the genotype), are identified and preserved during variation (Poli
et al., 2008). It is worth studying if this
holds for SR, to find accurate small solutions.

The hypothesis that salient patterns in the genotype can be found and exploited is
what motivates the design of Model-Based Evolutionary Algorithms (MBEAs). Among
them, the Gene-pool Optimal Mixing Evolutionary Algorithm (GOMEA) is a recent EA
that has proven to perform competitively in different domains: discrete optimization
(Thierens and Bosman, 2011; Luong et al., 2014), real-valued optimization (Bouter
et al., 2017), but also grammatical evolution
(Medvet, Bartoli et al., 2018), and, the
focus of this article, GP (Virgolin et al., 2017, 2018). GOMEA embodies
within each generation a model-learning phase, where *linkage*, that
is, the interdependency within parts of the genotype, is modeled. During variation,
the linkage information is used to propagate genotype patterns and avoid their
disruption.

The aim of this article is to understand the role of linkage learning when dealing with SR, and consequently improve the GP variant of GOMEA (GP-GOMEA), to find small and accurate SR solutions for realistic problems. We present three main contributions. First, we propose an improved linkage learning approach, that, differently from the original one, is unbiased with respect to the way the population is initialized. Second, we analyze how linkage learning is influenced by the presence of many different constant values, sampled by Ephemeral Random Constant (ERC) nodes (Poli et al., 2008), and explore strategies to handle them. Third, we introduce improvements upon GP-GOMEA's Interleaved Multistart Scheme (IMS), a strategy of multiple evolutionary runs of increasing evolutionary budget that executes them in an interleaved fashion, to better deal with SR and learning tasks in general.

The structure of this article is as follows. In Section 2, we briefly discuss related work on MBEAs for GP. In Section 3, we explain how GP-GOMEA and linkage learning work. Before proceeding with the description of the new contributions and experiments, Section 4 shows general parameter settings and datasets that will be used in the article. Next, we proceed by interleaving our findings on current limitations of GP-GOMEA followed by proposals to overcome such limitations, and respective experiments. In other words, we describe how we improve linkage learning one step at a time. In particular, Section 5 presents current limitations of linkage learning, and describes how we improve linkage learning. Strategies to learn linkage efficiently and effectively when ERCs are used are described in Section 6. We propose a new IMS for SR in Section 7, and use it in Section 8 to benchmark GP-GOMEA with competing algorithms: traditional GP, GP using a state-of-the-art semantic-aware operator, and the very popular decision tree for regression (Breiman et al., 1984). Lastly, we discuss our findings and draw conclusions in Section 9.

## 2 Related Work

We differentiate today's MBEAs into two classes: Estimation-of-Distribution Algorithms (EDA), and Linkage-based Mixing EAs (LMEA). EDAs work by iteratively updating a probabilistic model of good solutions, and sampling new solutions from that model. LMEAs attempt to capture linkage, that is, interdependencies between parts of the genotype, and proceed by variating solutions with mechanisms to avoid the disruption of patterns with strong linkage.

Several EDAs for GP have been proposed so far. Hauschild and Pelikan (2011) and Kim et al. (2014) are relatively recent surveys on the matter. Two
categories of EDAs for GP have mostly emerged in the years: one where the shape of
solutions adheres to some template to be able to estimate probabilities of what
functions and terminals appear in what locations (called *prototype
tree* for tree-based GP) (Salustowicz and Schmidhuber, 1997; Sastry and Goldberg, 2003; Yanai and Iba, 2003; Hemberg et al., 2012), and one where the probabilistic model is used to sample grammars
of rules which, in turn, determine how solutions are generated (Shan et al., 2004; Bosman and De Jong, 2004; Wong et al., 2014;
Sotto and de Melo, 2017). Research on EDAs
for GP appears to be limited. The review of Kim et al. (2014) admits, quoting, that “*Unfortunately, the
latter research [EDAs for GP] has been sporadically carried out, and reported in
several different research streams, limiting substantial communication and
discussion.*”

Concerning symbolic regression, we crucially found no works where it is attempted on
realistic datasets (we searched among the work reported by the surveys and other
recent work cited here). Many contributions on EDAs for GP have been validated on
hard problems of artificial nature instead, such as *Royal Tree* and *Deceptive Max* (Hasegawa and Iba, 2009). Some real-world problems have been explored, but
concerning only a limited number of variables (Tanev, 2007; Li et al., 2010).
When considering symbolic regression, at most synthetic functions or small physical
equations with only few
($\u22645$)
variables have been considered (e.g., by Ratle and Sebag, 2001 and Sotto and de Melo, 2017).

The study of LMEAs has emerged the first decade of the millennium in the field of binary optimization, where it remains mostly explored to date (Chen et al., 2007; Thierens and Bosman, 2013; Goldman and Punch, 2014; Hsu and Yu, 2015). Concerning GP, GOMEA is the first state-of-the-art LMEA ever brought to GP (Virgolin et al., 2017).

GP-GOMEA was first introduced in Virgolin et al. (2017), to tackle classic yet artificial benchmark problems of GP (including some of the ones mentioned before), where the optimum is known. The IMS, largely inspired on the work by Harik and Lobo (1999), was also proposed, to relieve the user from the need of tuning the population size. Population sizing is particularly crucial for MBEAs in general: the population needs to be big enough for probability or linkage models to be reliable, yet small enough to allow efficient search (Harik et al., 1999).

GP-GOMEA has also seen a first adaptation to SR, to find small and accurate solutions for a clinical problem where interpretability is important (Virgolin et al., 2018). There, GP-GOMEA was engineered for the particular problem, and no analysis of what linkage learning brings to SR was performed. Also, instead of using the IMS, a fixed population size was used. This is because the IMS was originally designed by Virgolin et al. (2017) to enable benchmark problems to be solved to optimality. No concern on generalization of solutions to unseen test cases was incorporated.

As to combining LMEAs with grammatical evolution, Medvet, Bartoli et al. (2018) also employed GOMEA, to attempt to learn and exploit linkage when dealing with different types of predefined grammars. In that work, only one synthetic function was considered for symbolic regression, among other four benchmark problems.

There is a need for assessing whether MBEAs can bring an advantage to real-world symbolic regression problems. This work attempts to do this, by exploring possible limitations of GP-GOMEA and ways to overcome them, and validating experiments upon realistic datasets with dozens of features and thousands of observations.

## 3 Gene-Pool Optimal Mixing Evolutionary Algorithm for GP

Three main concepts are at the base of (GP-)GOMEA: solution representation (genotype), linkage learning, and linkage-based variation. These components are arranged in a common outline that encompasses all algorithms of the GOMEA family.

Algorithm 1 shows the outline of GOMEA. As most EAs, GOMEA starts by initializing a population $P$, given the desired population size $npop$. The generational loop is then started and continues until a termination criterion is met, for example, a limit on the number of generations or evaluations, or a maximum time. Lines 4 to 8 represent a generation. First, the linkage model is learned, which is called Family of Subsets (FOS) (explained in Section 3.2). Second, each solution $Pi$ is used to generate an offspring solution $Oi$ by the variation operator Gene-pool Optimal Mixing (GOM). Last, the offspring replace the parent population. Note the lack of a separate selection operator. This is because GOM performs variation and selection at the same time (see Section 3.3).

For GP-GOMEA, an extra parameter is needed, the tree height (or, equivalently, tree depth) $h$. This is necessary to determine the representation of solutions, as described in the following Section 3.1.

### 3.1 Solution Representation in GP-GOMEA

GP-GOMEA uses a modification of the tree-based representation (Koza, 1992) which is similar to the one used by Salustowicz and Schmidhuber (1997). While typical GP trees can have any shape, GP-GOMEA uses a fixed template that allows linkage learning and linkage-based variation to be performed in a similar fashion as for other, fixed string-length versions of GOMEA.

All solutions are generated as *perfect*$r$-ary
trees of height $h$,
that is, such that all non-leaf nodes have exactly $r$ children, and leaves are all at maximum depth $h$,
with $r$ being the maximum number of inputs accepted by the functions (arity) provided in
the function set (e.g., for ${+,-,\xd7}$, $r=2$),
and $h$ chosen by the user. Note that, for any node that is not at maximum depth, $r$ child nodes are appended anyway: no matter if the node is a terminal, or if it
is a function requiring less than $r$ inputs (in this case, the leftmost nodes are used as inputs). Some nodes are
thus *introns*; that is, they are not executed to compute the
output of the tree. It follows that while trees are *syntactically* redundant, they are not necessarily *semantically* so. All trees of GP-GOMEA have the same number
of nodes, equal to $\u2113=\u2211i=0hri=rh+1-1r-1$.
Figure 1 shows a tree used by GP-GOMEA.

### 3.2 Linkage Learning

In GOMEA, linkage learning corresponds to building a FOS. Different types of FOS
exist in literature, however, the one recommended as default is the *Linkage Tree* (LT), by, for example, Thierens and Bosman
(2013) and Virgolin et al. (2017). The LT captures linkage on
hierarchical levels. An LT is learned every generation, from the population. To
assess whether linkage learning plays a key role, that is, whether it is better
than randomly choosing linkage relations, we also consider the Random Tree (RT)
(Virgolin et al., 2017).

#### 3.2.1 Linkage Tree

The LT arranges the FOS subsets in a binary tree structure representing hierarchical levels of linkage strength among genotype locations. The LT is built bottom-up, that is, from the leaves to the root. The bottom level of the LT, that is, the leaves, assume that all genotype locations are independent (no linkage), and is realized by instantiating FOS subsets to singletons, each containing a genotype location $i,\u2200i\u2208{1,\cdots ,\u2113}$.

To build the next levels, mutual information is used as a proxy for linkage
strength. Mutual information is a sensible choice to represent linkage
strength because it expresses, considering, for example, the pair $(i,j)$ of genotype locations as random variables, the amount of information gained
on $i$ given observations on $j$ (and vice versa). In this light, the population can be considered as a set
of realizations of the genotype. In particular, the realizations of each
genotype location $i$ are what *symbols* appear at location $i$ in the population. In a binary genetic algorithm, symbols are either
“0” or “1” while in GP, symbols correspond to
the types of function and terminal nodes, for example,
“$+$”,“$-$”,“$x1$”,“$x2$”.
In other words, random variables can assume as many values as there are
possible symbols in the instruction set.^{1}

Given mutual information between location pairs, we approximate linkage among higher orders of locations using the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) (Gronau and Moran, 2007). To ease understanding, we now provide an explanation of how UPGMA is used to build the rest of the LT that is primarily meant to be intuitive. In practice, we do not use an implementation that strictly adheres to the following explanation, but we use a more advanced algorithm that achieves the same result while having lower time complexity, called the Reciprocal Nearest Neighbor algorithm (RNN). For details on RNN, see Gronau and Moran (2007).

With the efficient implementation of UPGMA by RNN, the time complexity to build the LT remains bounded by $O(npop\u21132)$.

#### 3.2.2 Random Tree

While linkage learning assumes an inherent structural inter-dependency to be present within the genotype that can be captured in an LT, such hypothesis may not be true. In such a scenario, using the LT might be not better than building a similar FOS in a completely random fashion. The RT is therefore considered to test this. The RT shares the same tree-like structure of the LT, but is built randomly rather than using mutual information (taking $O(\u2113)$). We use the RT as an alternative FOS for GP-GOMEA.

### 3.3 Gene-Pool Optimal Mixing

Once the FOS is learned, the variation operator GOM generates the offspring population. GOM varies a given solution $Pi$ in iterative steps, by overriding the nodes at the locations specified by each $Fj$ in the FOS, with the nodes in the same locations taken from random donors in the population. Selection is performed within GOM in a hill-climbing fashion, that is, variation attempts that result in worse fitness are undone.

Note that if a change results in $fOi=fBi$,
the change is kept. This allows random walks in the neutral fitness landscape
(Ebner et al., 2001; Sadowski et al., 2013). Note also that differently
from traditional subtree crossover and subtree mutation (Koza, 1992), GOM can change unconnected nodes at
the same time, and keeps tree height limited to the initially specified
parameter $h$.
Finally, GOM *does not consider* any FOS subset that contains all
node locations, that is, $Fj={1,\cdots ,\u2113}$,
as using such subset would mean to entirely replace $Oi$ with $D$.

## 4 General Experimental Settings

We now describe the general parameters that will be used in this article. Table 1 reports the parameter settings which are typically used in the following experiments, unless specified otherwise. The notation $x$ represents the matrix of feature values. We use the Analytic Quotient (AQ) (Ni et al., 2013) instead of protected division. This is because the AQ is continuous in 0 for the second operand: $x1\xf7AQx2:=x1/1+x22$. Albeit continuity is not needed by many GP variation operators (including GOM), it is useful at prediction time: Ni et al. (2013) show that using the AQ helps generalization (whereas using protected division does not). However, the AQ may be considered relatively hard to interpret.

Parameter . | Setting . |
---|---|

Function set | ${+,-,\xd7,\xf7AQ}$ |

Terminal set | $x\u222a{ERC}$ |

ERC bounds | $[minx,maxx]$ |

Initialization for GP-GOMEA | Half-and-Half as in Virgolin et al. (2018) |

Tree height $h$ | 4 |

Train-validation-test split | 50%–25%–25% |

Experiment repetitions | 30 |

Parameter . | Setting . |
---|---|

Function set | ${+,-,\xd7,\xf7AQ}$ |

Terminal set | $x\u222a{ERC}$ |

ERC bounds | $[minx,maxx]$ |

Initialization for GP-GOMEA | Half-and-Half as in Virgolin et al. (2018) |

Tree height $h$ | 4 |

Train-validation-test split | 50%–25%–25% |

Experiment repetitions | 30 |

As mentioned in the introduction, we focus on the evolution of solutions that are
constrained to be small, to *enable* interpretability. We choose $h=4$ because this results in relatively balanced trees with up to 31 nodes (since $r=2$).
We consider this size limitation a critical value: for the given function set, we
found solutions to be already borderline interpretable for us (this is discussed
further in Section 9). Larger values for $h$ would therefore play against the aim of this study. When benchmarking GP-GOMEA in
Section 8, we also consider $h=3$ and $h=5$ for completeness.

We consider 10 real-world benchmark datasets from literature (Martins et al., 2018) that can be found on the UCI
repository^{2} (Asuncion and
Newman, 2007) and other sources.^{3} The characteristics of the datasets
are summarized in Table 2.

Name . | Abbreviation . | # Features . | # Examples . |
---|---|---|---|

Airfoil | Air | 5 | 1503 |

Boston housing | Bos | 13 | 506 |

Concrete compres. str. | Con | 8 | 1030 |

Dow Chemical | Dow | 57 | 1066 |

Energy cooling | EnC | 8 | 768 |

Energy heating | EnH | 8 | 768 |

Tower | Tow | 25 | 4999 |

Wine red | WiR | 11 | 1599 |

Wine white | WiW | 11 | 4898 |

Yacht hydrodynamics | Yac | 6 | 308 |

Name . | Abbreviation . | # Features . | # Examples . |
---|---|---|---|

Airfoil | Air | 5 | 1503 |

Boston housing | Bos | 13 | 506 |

Concrete compres. str. | Con | 8 | 1030 |

Dow Chemical | Dow | 57 | 1066 |

Energy cooling | EnC | 8 | 768 |

Energy heating | EnH | 8 | 768 |

Tower | Tow | 25 | 4999 |

Wine red | WiR | 11 | 1599 |

Wine white | WiW | 11 | 4898 |

Yacht hydrodynamics | Yac | 6 | 308 |

We use the linearly scaled Mean Squared Error (MSE) to measure solution fitness (Keijzer, 2003), as it can be particularly beneficial when evolving small solutions. This means a fast (cost $O(n)$ with $n$ number of dataset examples) linear regression is applied between the target $y$ and the solution prediction $y\u02dc$ prior to computing the MSE. We present our results in terms of variance-Normalized MSE (NMSE), that is, $MSE(y,y\u02dc)var(y)$, so that results from different datasets are on a similar scale.

To assess statistical significance when comparing the results of multiple executions
of two algorithms (or configurations) on a certain dataset, we use the Wilcoxon
signed-rank test (Demšar, 2006). This
test is set up to compare competing algorithms based on the same prior conditions.
In particular, we employ pairs of executions where the dataset is split into
identical training, validation, and test sets for both algorithms being tested. This
is because the particular split of data determines the fitness function (based on
the training set), and the achievable generalization error (for the validation and
test sets). We consider a difference to be significant if a smaller $p$-value
than $0.05/\beta $ is found, with $\beta $ the Bonferroni correction coefficient, used to prevent false positives. If more than
two algorithms need to be compared, we first perform a Friedman test on mean
performance over all datasets (Demšar, 2006). We use the symbols $\u25b4$, $\u25be$ to respectively indicate significant superiority, and inferiority (absence of a
symbol means no significant difference). The result *next* to the
symbol $\u25b4$ ($\u25be$)
signifies a result being better (worse) than the result obtained by the algorithm
that has the same color of the symbol. Algorithms and/or configurations are color
coded in each table reporting results (colors are color-blind safe).

## 5 Improving Linkage Learning for GP

In previous work on GP-GOMEA, learning the LT was performed the same way it is done
for any discrete GOMEA implementation, that is, by computing the mutual information
between pairs of locations $(i,j)$ in the genotype (Eq. 1) (Virgolin et
al., 2017). However, the distribution of node
types is typically not uniform when a GP population is initialized (e.g., function
nodes never appear as leaves). In fact, this depends on the cardinality of the
function and terminal sets, on the arity of the functions, and on the population
initialization method (e.g., *Full*, *Grow*, *Half-and-Half*, *Ramped Half-and-Half* (Luke and
Panait, 2001)). Note that it does not depend
on the particular dataset in consideration (except in that the number of features
determines the size of the terminal set). The lack of uniformity in the distribution
leads to the emergence of mutual information between particular parts of the
genotype. Crucially, this mutual information is natural to the solution
representation, the sets of symbols and the initialization process.

If mutual information is used to represent linkage, then linkage will already be
observed at initialization. However, it is reasonable to expect no linkage to be
present in an initialized population, as evolution did not take place yet. Figure 3 shows the mutual information matrix
between pairs of node locations in an initial population of 1,000,000 solutions with
maximum height $h=2$,
using *Half-and-Half*, a function set of size 4 with maximum number
of inputs $r=2$,
and a terminal set of size 6 (no ERCs are used). Each tree contains exactly 7 nodes.
We index node locations with pre-order tree transversal; that is, 1 is the root, 2
its first child, 5 its second child, 3,4 are (leaves) children of 2, and 6,7 are
(leaves) children of 5. Nodes at locations 2 and 5 can be functions only if a
function is sampled at node 1. It can be seen that the mutual information matrix of
location pairs (correctly) captures the non-uniformity in the initial distribution
(i.e., larger mutual information values are present between non-leaf nodes). Using
mutual information directly as a proxy for linkage may be undesirable.

### 5.1 Biasing Mutual Information to Represent Linkage

We propose to overcome the aforementioned problem by measuring linkage with a modified version of the mutual information, such that no linkage is measured at initialization. Our hypothesis is that, if we apply such a correction so that no patterns are identified at initialization, the truly salient patterns will have a bigger chance of emerging during evolution, and better results will be achieved.

Let us consider the scenario where, at initialization, symbols are uniformly distributed. For example, this typically happens in binary genetic algorithms. The mutual information between pairs of genotype locations that is expected at initialization, that is, at generation $g=1$ before variation and selection, will then correspond to the identity matrix: $MIg|g=1=I$ (assuming binary symbols and mutual information in bits as well as a sufficiently large population size). This mutual information matrix is suitable to represent linkage as no linkage should be present at initialization.

To determine the beta coefficients *exactly* means to know the
true distribution inferred by the sampling process used to sample the initial
population, and thus the true initial entropy for each genotype location.
However, this is generally not trivial to determine for GP, since a number of
factors need to be considered. For example, if the *Ramped
Half-and-Half* initialization method is used, what symbol is sampled
at a location depends on the chance to use *Full* or *Grow*, the chance to pick the function or the terminal set
based on the depth, the size of these sets, and possibly other problem-specific
factors. Hence, we propose to simply approximate the $\beta $ coefficients by using the $Hg(i)|g=1$ measured on the initial population, assuming the population to be large
enough.

### 5.2 Estimation of Linkage by MI$b\u02dc$

As a preliminary step, we observe what linkage values are obtained between pairs of genotype locations by using $MIb\u02dc$. For conciseness, in the following we denote $MIb\u02dcg(i,j)|g=\Gamma $ with $MIb\u02dc\Gamma (i,j)$. We show the MI matrix computed at the second generation of a GP-GOMEA run on the dataset Yac ($MIb\u02dc1=I$ by construction). We do this for two population sizes, $npop=10$ and $npop=106$. We expect that, the bigger $npop$ is, the closer $MIb\u02dc2$ is to $I$.

We use the parameters of Table 1, a terminal set of size 6 (the features of Yac, no ERC) and $h=2$; that is, $\u2113=7$ nodes per tree. Figure 4 shows the biased mutual information matrix between location pairs, for the two population sizes. It can be seen that the values can be lower than 0 or bigger than 1. However, while this is particularly marked for $npop=10$, with minimum of -0.787 and maximum of 1.032, it becomes less evident for $npop=106$, with minimum of -0.018 and maximum of 0.989. The fact that $MIb\u02dc2\u2248I$ for $npop=106$ is because, with such a large population size, considerable diversity is still present in the second generation.

### 5.3 Experiment: LT–$MIb\u02dc$ vs LT–MI vs RT

We now test the use of $MIb\u02dc$ over the standard MI for GP-GOMEA with the LT. We denote the two configurations with LT–$MIb\u02dc$ and LT–MI. We also consider the RT to see if mutual information drives variation better than random information.

We set the population size to 2000 as a compromise between having enough samples for linkage to be learned, and meeting typical literature values, which range from hundreds to a few thousands. We use the function set of Table 1, and a tree height $h=4$ (thus $\u2113=31$). We set a limit of 20 generations, which corresponds to approximately 1200 generations of traditional GP, as each solution is evaluated up to $2\u2113-2$ times (size of the LT minus its root and non-meaningful changes, see Sections 3.2 and 3.3).

### 5.4 Results: LT–$MIb\u02dc$ vs LT–MI vs RT

The training and test NMSE performances are reported in Table 3. The Friedman test results in significant differences along training and test performance. GP-GOMEA with LT–$MIb\u02dc$ is clearly the best performing algorithm, with significantly lower NMSE compared to LT–MI on 8/10 datasets when training, and 7/10 at test time. It is always better than using the RT when training, and in 9/10 cases when testing. The LT–MI is comparable with the RT for these problems.

. | $npop=2000$ . | $npop=4000$ . | ||
---|---|---|---|---|

Split . | LT . | RT . | LT . | RT . |

1 | 36 | 18 | 44 | 15 |

2 | 42 | 12 | 49 | 21 |

3 | 40 | 7 | 43 | 8 |

4 | 43 | 8 | 45 | 25 |

5 | 36 | 16 | 49 | 16 |

Avg. | 39 | 12 | 46 | 17 |

. | $npop=2000$ . | $npop=4000$ . | ||
---|---|---|---|---|

Split . | LT . | RT . | LT . | RT . |

1 | 36 | 18 | 44 | 15 |

2 | 42 | 12 | 49 | 21 |

3 | 40 | 7 | 43 | 8 |

4 | 43 | 8 | 45 | 25 |

5 | 36 | 16 | 49 | 16 |

Avg. | 39 | 12 | 46 | 17 |

In the remainder, when we write “LT”, we refer to LT–$MIb\u02dc$.

### 5.5 Experiment: Assessing Propagation of Node Patterns

The previous experiment showed that using linkage-driven variation (LT) can be favorable compared to random variation (RT). This seems to confirm the hypothesis that, in certain SR problems, salient underlying patterns of nodes exist in the genotype that can be exploited. Another aspect that can be considered with regard to such hypothesis is how final solutions look: if linkage learning identifies specific node patterns, it can be expected that their propagation will lead to the discovery of similar solutions over different runs.

Therefore, we now want to assess whether the use of the LT has a bigger chance to lead to the discovery of a particular best-of-run solution, compared to the use of the RT. We use the same parameter setting as described in Section 5.3, but perform 100 repetitions. While each run uses a different random seed (e.g., for population initialization), we fix the dataset split, as changing the training set results in changing the fitness function. We repeat the 100 runs on 5 random dataset splits, on the smallest dataset Yac. Together with $npop=2000$ as in the previous experiment, we also consider a doubled $npop=4000$.

^{4}followed by manual inspection). It can be seen that the LT finds more duplicate solutions than the RT, by a margin of around 30% (difference between averages). Figure 6 shows the distribution of solutions found for the second dataset split with $npop=4000$, that is, where both the LT and the RT found a large number of duplicates. The LT has a marked chance of leading to the discovery of a particular solution, up to one-fourth of the times. When the RT is used, a same solution is found only up to 6 times out of 100.

This confirms the hypothesis that linkage-based variation can propagate salient node patterns more than random variation should such patterns exist, enhancing the likelihood of discovering particular solutions.

## 6 Ephemeral Random Constants and Linkage

In many GP problems, and in particular in SR, the use of ERCs can be very beneficial (Poli et al., 2008). An ERC is a terminal which is set to a constant only when instantiated in a solution. In SR, this constant is commonly sampled uniformly at random from a user-defined interval.

Because every node instance of ERC is a different constant, linkage learning needs to deal with a large number of different symbols. This can lead to two shortcomings. First, a very large population size may be needed for salient node patterns to emerge. Second, data structures used to store the frequencies of symbols grow really big and become slow (e.g., hash maps).

We explore three strategies to deal with this: all-const: Ignore the shortcomings, and consider all different constants as different symbols during linkage learning; no-const: Skip all constants during linkage learning, that is, set their frequency to zero. This approximation is reasonable since all constants are unique at initialization, and the respective frequency is almost zero. However, during evolution some constants will be propagated while others will be discarded, making this approximation less and less accurate over time; bin-const: Perform on-line binning. We set a maximum number $\gamma $ of constants to consider. After $\gamma $ different constants have been encountered in frequency counting, any further constant is considered to fall into the same bin as the closest constant among the first $\gamma $. The closest constant can be determined with binary search in $log2(\gamma )$ steps. Contrary to strategy no-const, we expect the error of this approximation to lower over time, because selection lowers diversity, meaning that the total number of different constants will be reduced as generations pass.

### 6.1 Experiment: Linkage Learning with ERCs

We use the same parameter setup of the experiment in Section 5.3, this time adding an ERC terminal to the terminal set. We compare the three strategies to handle ERCs when learning the LT. For this experiment and in the rest of the article, we use $\gamma =100$ in bin-const. We observed that for problems with a small number of features (e.g., Air and Yac), that is, where ERC sampling is more likely and thus more constants are produced, this choice reduces the number of constant symbols to be considered by linkage learning in the first generations by a factor of $\u223c50$. We also report the results obtained with the RT as a baseline, under the hypothesis that using ERCs compromises linkage learning to the point that random variation becomes equally good or better.

The results of this experiment are shown in Table 5 (training and test NMSE) and Table 6 (running time). The Friedman test reveals significant differences among the configurations for train, test, and time performance. Note that the use of ERCs leads to lower errors compared with not using them (compare with Table 3).

In terms of training error, the RT is always outperformed by the use of the LT, no matter the strategy. The all-const strategy is significantly better than no-const in half of the problems, and never worse. Overall, bin-const performs best, with 6 out of 10 significantly better results than all-const. The fact that all-const can be outperformed by bin-const supports the hypothesis that linkage learning can be compromised by the presence of too many constants to consider, which hide the true salient patterns. Test results are overall similar to the training ones, but less comparisons are significant.

In terms of time, all-const is almost always significantly worse than the other methods, and often by a consistent margin. This is particularly marked for problems with a small number of features (i.e., Air, Yac). There, more random constants are present in the initial population, since the probability of sampling the ERC from the terminal set is inversely proportional to the number of features.

Interestingly, despite the lack of a linkage-learning overhead, using the RT is not always the fastest option. This is because random variation leads to a slower convergence of the population compared with the linkage-based one, where salient patterns are quickly propagated, and fewer variation attempts result in changes of the genotype that require a fitness evaluation (see Section 3.3). The slower convergence caused by the RT can also be seen in Figure 5 (for the previous experiment), and was also observed in other work, in terms of diversity preservation (Medvet, Virgolin et al., 2018).

Between the LT-based strategies, the fastest is no-const, at the cost of a bigger training error. Although consistently slower than no-const, bin-const is still quite fast, and achieves the lowest training errors. We found bin-const to be preferable in test NMSE as well. In the following, we always use bin-const, with $\gamma =100$.

## 7 Interleaved Multistart Scheme

The Interleaved Multistart Scheme (IMS) is a wrapper for evolutionary runs largely inspired by the work of Harik and Lobo (1999) on genetic algorithms. It works by interleaving the execution of several runs of increasing resources (e.g., population size). The main motivation for using the IMS is to make an EA much more robust to parameter settings, and alleviate the need for practitioners to tinker with parameters. In fact, the whole design of GP-GOMEA attempts to promote the aspects of ease-of-use and robustness: the EA has no need for parameters that specify how to conduct variation (e.g., crossover or mutation rates), nor how to conduct selection (e.g., tournament size). The IMS or similar schemes are often used with MBEAs (Lin and Yu, 2018; Goldman and Punch, 2014), where population size plays a crucial role in determining the quality of model building. Note that although the IMS has potential to be parallelized, here it is used in a sequential manner.

An IMS for GP-GOMEA was first proposed in Virgolin et al. (2017), and its outline is as follows. A collection of parameter settings $\sigma base$ is given as input, which will be used in the first run $R1$. The IMS runs until a termination criterion is met (e.g., number of generations, time budget). The run $Ri$ performs one generation if no run that precedes it exists (e.g., because it is the first run or because all previous runs have been terminated), or if the previous run $Ri-1$ has executed $g$ generations. The first time $Ri$ is about to execute a generation, it is initialized using the parameter settings $\sigma base$ scaled by the index $i$. For example, the population size can be set to $2i-1nbasepop$ (i.e., doubling the population size of the previous run). Finally, when a run completes a generation, a check is done to determine if the run should be terminated (explained below).

### 7.1 An IMS for Supervised Learning Tasks

The first implementation of the IMS for GP-GOMEA was designed to deal with GP benchmark problems of pure optimization. That implementation therefore scaled both the population size and the height of trees in an attempt to find the optimal solution (of unknown size) (Virgolin et al., 2017).

In this work, we use the IMS as follows. *Scaling of parameter
settings*: We scale only the population size. For run $Ri$,
the population size is set to $nipop=2i-1nbasepop$. *Run termination*: A run $Ri$ is terminated if the fitness of its best solution is worse than the one of a run $Rj$ initialized later, that is, with $j>i$,
or if it converges to all identical solutions.

Differently from Virgolin et al. (2017) we no longer scale the tree height $h$ because in SR, and in supervised learning tasks in general, no optimum is known beforehand, and it is rather desired to find a solution that generalizes well to unseen examples. Moreover, $h$ bounds the maximum solution size, which influences interpretability. Hence $h$ is left as a parameter for the user to set, and we recommend $h\u22644$ to increase the chance that solutions will be interpretable (see Section 9).

We set the run termination criteria to be based upon the fitness of best solutions instead of mean population fitness as done by Harik and Lobo (1999) and Virgolin et al. (2017), because in SR it can happen that the error of a few solutions becomes so large that it compromises the mean population fitness. This can trigger the termination criteria even if solutions exist that are competitive with the ones of other runs. Also differently from the other versions of the IMS, when terminating a run, we do not automatically terminate all previous runs. Indeed, some runs with smaller parameter settings may still be very competitive (e.g., due to the fortunate sampling of particular constants when using ERCs).

We lastly propose to exploit the fact that many runs are performed within the IMS to tackle a central problem of learning tasks: generalization. Instead of discarding the best solutions of terminating runs, we store them in an archive. When the IMS terminates, we recompute the fitness of each solution in the archive using a set of examples different from the training set, that is, the validation set, and return the new best performing, that is, the solution that generalized best. The final test performance is measured on a third, separate set of examples (test set).

## 8 Benchmarking GP-GOMEA

We compare GP-GOMEA (using the new LT) with tree-based GP with traditional subtree crossover and subtree mutation (GP-Trad), tree-based GP using the state-of-the-art, semantic-aware operator Random Desired Operator (GP-RDO) (Pawlak et al., 2015), and Decision Tree for Regression (DTR) (Breiman et al., 1984).

We consider RDO because, as mentioned in the introduction, semantic-aware operators have been studied with interest in the last years. Several works either built upon RDO, or used RDO as a baseline for comparison (see, for example, Chen et al., 2018; Pawlak and Krawiec, 2018; and Virgolin et al., 2019). Yet, consistently large solutions were found. It is interesting to assess how RDO fares when rather strict solution size limits are enforced. Because of such limits, we remark we cannot consider another popular set of semantic-aware operators, that is, the operators used by Geometric Semantic Genetic Programming (GSGP) (Moraglio et al., 2012). These operators work by stacking entire solutions together, necessarily causing extremely large solution growth (even if smart simplifications are attempted (Martins et al., 2018)).

We consider DTR because it is considered among the state-of-the-art algorithms to learn interpretable models (Doshi-Velez and Kim, 2017; Guidotti et al., 2018). We remark that DTR ensembles (e.g., Breiman, 2001; and Chen and Guestrin, 2016) are typically markedly more accurate than single DTRs, but are considered not interpretable.

### 8.1 Experimental Setup

For the EAs, we use a fixed time limit of 1,000 seconds.^{5} We choose a time-based comparison because
GP-GOMEA performs more evaluations per generation than other GP algorithms (up
to $2\u2113-2$ evaluations per generation with the LT), and so that the overhead of learning
the LT (which does not involve evaluations) is taken into account.

We consider maximum solution sizes $\u2113=15,31,63$ (tree nodes), that is, corresponding to $h=3,4,5$ respectively, for full $r$-ary trees. The EAs are run with a typical fixed population size $npop=1000$ and also with the IMS, considering three values for the number of generations in between runs $g$: 4, 6, and 8. For the fixed population size, if the population of GP-GOMEA converges before the time limit, since there is no mutation, it is randomly restarted. Choices of $g$ between 4 and 8 are standards from literature (Bouter et al., 2017; Virgolin et al., 2017).

Our implementation of GP-Trad and GP-RDO mostly follows the one of Pawlak et al.
(2015). The population is initialized
with the *Ramped Half-and-Half* method, with tree height between
2 and $h$.
Selection is performed with tournament of size 7. GP-Trad uses a rate of 0.9 for
subtree crossover, and of 0.1 for subtree mutation. GP-RDO uses the
population-based library of subtrees, a rate of 0.9 for RDO, and of 0.1 for
subtree mutation. Subtree roots to be variated are chosen with the *uniform depth mutation* method, which makes nodes of all
depths equally likely to be selected (Pawlak et al., 2015). Elitism is ensured by cloning the best solution into
the next generation. All EAs are implemented in C^{++}, and the code is
available at: https://goo.gl/15tMV7.

For GP-Trad we consider two versions, to account for different types of solution
size limitation. In the first version, called
GP-Trad^{$h$},
we force trees to be constrained within a maximum height
($h=3,4$),
as done for GP-GOMEA. This way, we can see which algorithm searches better in
the same representation space. In the second version,
GP-Trad^{$\u2113$},
we allow more freedom in tree shape, by only bounding the number of tree nodes.
This limit is set to the maximum number of nodes obtainable in a full $r$-ary
tree of height $h$ ($\u2113=15$ for $h=3$, $\u2113=31$ for $h=4$).
As indicated by previous literature (Gathercole and Ross, 1996; Langdon and Poli, 1997), and as will be shown later in the results,
GP-Trad^{$\u2113$} outperforms
GP-Trad^{$h$}.
We found that the same holds also for GP-RDO, and present here only its best
configuration, that is, a version where the number of tree nodes is limited like
for
GP-Trad^{$\u2113$}.

We use the Python Scikit-learn implementation of DTR (Pedregosa et al., 2011), with 5-fold cross-validation
grid-search over the training set to tune the following hyper-parameters: *splitter*$\u2208$ {‘*best*’,‘*random*'}; *max_features*$\u2208$ {$12,34,1$}; *max_depth*$\u2208$ {3,4,5,6} (documentation available at http://goo.gl/hbyFq2). We do
not allow larger depth values because, like for GP solutions, excessively large
decision trees are uninterpretable. The best generalizing model found by
cross-validation is then used on the test set.

### 8.2 Results: Benchmarking GP-GOMEA

We consider validation and test NMSE. We now show validation rather than training error because the IMS returns the solution which better generalizes to the validation set among the ones found by different runs (same for DTR due to cross-validation). Tables 7, 8, and 9 show the results for maximum sizes $\u2113=15,31,63$ ($h=3,4,5$), respectively. On each set of results, the Friedman test reveals significant differences among the algorithms. As we are only interested in benchmarking GP-GOMEA, we test whether significant performance differences exist only between GP-GOMEA and the other algorithms (with Bonferroni-corrected Wilcoxon signed-rank test).

We begin with some general results. Overall, error magnitudes are lower for larger values of $\u2113$. This is not surprising: limiting solution size limits the complexity of relationships that can be modeled. Another general result is that errors on validation and test set are generally close. Likely, the validation data is a sufficiently accurate surrogate of the test data in these datasets, and solution size limitations make over-fitting unlikely. Finally, note that the results for DTR are the same in all tables.

We now compare GP-GOMEA with GP-Trad$h$, focusing on statistical significance tests (see rows “B/W” of the tables), over all size limit configurations. Recall that these two algorithms work with the same type of limitation, that is, based on maximum tree height. No matter the population sizing method, GP-GOMEA is almost always significantly better than GP-Trad$h$. GP-GOMEA relies on the LT with improved linkage learning, which we showed to be superior to using the RT, that is, blind variation, in the previous series of experiments (Sections 5.3, 6.1). Subtree crossover and subtree mutation are blind as well, and can only swap subtrees, which may be a limitation.

GP-GOMEA and
GP-Trad$\u2113$ are compared next. Recall that
GP-Trad$\u2113$ is allowed to evolve any tree shape, as long as the limit in number of nodes is
respected. Having this extra freedom,
GP-Trad$\u2113$ performs better than
GP-Trad$h$ (not explicitly reported in the tables), which confirms previous literature
results (Gathercole and Ross, 1996;
Langdon and Poli, 1997). No marked
difference exists between GP-GOMEA and
GP-Trad$\u2113$ along different configurations. By counting the number of times one EA is found
to be significantly better than the other along *all* 240
comparisons, GP-GOMEA beats
GP-Trad$\u2113$ by a small margin: 87 significantly lower error distributions vs. 65 (88
draws).

For the traditional use of a single population ($npop=1000$), GP-Trad$\u2113$ is slightly better than GP-GOMEA for $\u2113=15$ (Table 7), slightly worse for $\u2113=31$ (Table 8), and similar for $\u2113=63$ (Table 9), on both validation and test errors. The performance of the two (and also of the other EAs) improves when using the IMS. Although not explicitly shown in the tables, using the IMS is typically significantly better than not using it. When using a single fixed population size and a single run, only a single best-found solution is found. Depending on the configuration of that run, in particular the size of the population, that final solution may be underfitted or overfitted. When using a scheme such as the IMS, multiple solutions are marked best in the different interleaved runs. These solutions can subsequently be compared more in terms of generalization merits, that is, by observing the associated performance on the validation set. The best performing solution can then ultimately be returned. Essentially, this thus provides a means to mitigate to some extent the problem of underfitting or overfitting. It should be noted, however, that the extent to which the setup of the IMS, particularly in terms of growing population sizes, contributes to this is not immediately clear. This could be studied by comparing with a scheme in which multiple runs are also performed, but all with a single population size. The final results of these runs can then also first be tested against the validation set. Likely, the use of a scheme like the IMS has an advantage because multiple population sizes will be tried. Therefore, likely a larger variety of results will be produced to test against the validation set, but a closer examination of this impact is left for future work.

The comparisons between GP-Trad$\u2113$ and GP-GOMEA tend to shift in favor of the latter when using the IMS, particularly for larger values of $g$. For $g=4$, outcomes are still overall mixed along different $\u2113$ limits. For $g=8$, GP-GOMEA is preferable, with moderately more significant wins for $\u2113=15$, several more wins for $\u2113=31$, and slightly more wins for $\u2113=63$.

GP-Trad$\u2113$ performs well for small values of $g$ due to huge populations being instantiated with trees of various shape, that is, expensive random search. Note that this behavior may be problematic when limited memory is available, especially if caching mechanisms are desirable to reduce the number of expensive evaluations (e.g., caching the output of each node as in Pawlak et al., 2015 and Virgolin et al., 2017). On the other hand, GP-GOMEA works fairly well with much smaller populations, as long as they are big enough to enable effective linkage learning (the fixed $npop=1000$ is smaller than the population sizes reached with the IMS). Despite the disadvantage of adhering to a specific tree shape, GP-GOMEA is typically preferable than GP-Trad$\u2113$ for larger values of $g$. Furthermore, Figure 7 shows that GP-GOMEA, population scaling behaves sensibly with regard to $g$; that is, it does not grow abruptly when $g$ becomes small, nor shrink excessively when $g$ becomes larger. This latter aspect is because in GP-GOMEA, populations ultimately converge to a same solution, and are terminated, allowing for bigger runs to start. In GP-Trad$\u2113$ this is unlikely to happen, because of the use of mutation and stochastic (tournament) selection, stalling the IMS. For the larger $g=8$, GP-GOMEA reaches on average 1.6 times bigger populations than GP-Trad$\u2113$.

GP-RDO, although allowed to evolve trees of different shape like GP-Trad$\u2113$, performs poorly on all problems, with all settings. It performs significantly worse than GP-GOMEA almost everywhere (it is also worse than GP-Trad$\u2113$). It is known that GP-RDO normally finds big solutions, and it is also reasonable to expect that it needs big solutions to work well, for example, to build a large set of diverse subtrees for the internal library queried by RDO (Virgolin et al., 2019). The strict size limitation basically breaks GP-RDO. However, we remark that this EA was never designed to work under these circumstances. In fact, when solution size is not strictly limited, GP-RDO achieves excellent performance (Pawlak et al., 2015).

DTR is compared with GP-GOMEA using the IMS with $g=8$. Although GP-GOMEA is not optimized (e.g., by tuning the function set), it performs on par with tuned DTR for $\u2113=15$, and better for $\u2113=31,63$, on both validation and test sets. Where one algorithm outperforms the other, the magnitude of difference in errors are relatively large compared to the ones between EAs. This is because GP and DTR synthesize models of completely different nature (decision trees only use if-then-else statements).

## 9 Discussion and Conclusion

We built upon previous work on model-based GP, in particular on GP-GOMEA, to find accurate solutions when a strict limitation on their size is imposed, in the domain of SR. We focused on small solutions, in particular much smaller solutions than typically reported in literature, to prevent solutions becoming too large to be (easily) interpretable, a key reason to justify the use of GP in many practical applications.

A first limitation of this work is that to truly *achieve* interpretability may well require different measures. Interpretation is mostly
subjective, and many other factors besides solution size are important, including
the intuitiveness of the subfunctions composing the solution, potential
decompositions into understandable repeating sub-modules, the number of features
considered, and the meaning of these features (Lipton, 2018; Doshi-Velez and Kim, 2017). Nonetheless, much current research on GP for SR is far from
delivering any interpretable results precisely because the size of solutions is far
too large (see, e.g., the work of Martins et al., 2018).

We considered solution sizes up to $\u2113=63$ (corresponding to $h=5$ for GP-GOMEA with subfunctions of arity $\u22642$). In our opinion, the limit of $\u2113=31$ ($h=4$) is particularly interesting, as interpreting some solutions at this level can already be non-trivial at times. For example, we show the (manually simplified) best test solution found by GP-GOMEA (IMS $g=8$) for Tower and Yacht, that is, the biggest and smallest dataset respectively, in Figure 8. The solution for Tower is arguably easier to understand than the one for Yacht. We found solutions with $\u2113=63$ ($h=5$) to be overly long to attempt interpreting, and solutions with $\u2113=15$ ($h=3$) to be mostly readable and understandable. We report other example solutions at: http://bit.ly/2IrUFyQ.

We believe future work should address the aforementioned limitation: effort should be put towards reaching some form of interpretability notions, that go beyond solution size or other custom metrics (e.g., Vladislavleva et al., 2009). User studies involving the end users of the model (e.g., medical doctors for a diagnosis model) could guide the design of notions of interpretability. If an objective that represents interpretability can be defined, the design of multiobjective (model-based) GP algorithms may lead to very interesting results.

Another limitation of this work lies in the fact that we did not study how linkage learning behaves in GP for SR in depth. In fact, it would be interesting to assess when linkage learning is beneficial, and when it is superfluous or harmful. To this end, a regime of experiments where linkage-related outcomes are predefined, such as emergence of specific patterns, needs to be designed. Simple problems where the true function to regress is known may need to be considered. Studies of this kind could provide more insights on how to improve linkage learning in GP for SR (and other learning tasks), and are an interesting direction for future work.

Another crucial point to base future research upon is enabling linkage learning and linkage-based mixing in GP with trees of arbitrary shape. In fact, GP-GOMEA was not found to be markedly better than GP-Trad$\u2113$, and a large performance gap was found between GP-Trad$\u2113$ and GP-Trad$h$. This is indicative that there is added value to perform evolution directly on non-templated trees, which, from this perspective, may be considered a limitation of GP-GOMEA. Going beyond the use of a fixed tree template, while still enabling linkage identification and exploitation, is a challenging open problem that could bring very rewarding results. On the other hand, we believe it is interesting to see that when GP-GOMEA and GP-Trad are set to work on the same search space, that is, when GP-Trad$h$ is used, then GP-GOMEA performs markedly better.

In summary and conclusion, we have identified limits and presented ways to improve a
key component of a state-of-the-art model-based EA, that is, *GP-GOMEA*, to competently deal with realistic SR datasets, when
small solutions are desired. This key component is linkage learning. We showed that
solely and directly relying on mutual information to identify linkage may be
undesirable, because the genotype is not uniformly distributed in GP populations,
and we provided an approximate biasing method to tackle this problem. We furthermore
explored how to incorporate ERCs into linkage learning, and found that online
binning of constants is an efficient and effective strategy. Lastly, we introduced a
new form of the IMS, to relieve practitioners from setting a population size, and
from finding a good generalizing solution. Ultimately, our contributions proved
successful in improving the performance of GP-GOMEA, leading to the best overall
performance against competing EAs, as well as tuned decision trees. We believe our
findings set an important first step for the design of better model-based GP
algorithms capable of learning interpretable solutions in real-world data.

## Notes

^{1}

More symbols can be possible than the number of instructions in case ERCs are used, since instantiating an ERC in a solution results in a constant being randomly sampled.

^{4}

Including the use of symbolic simplification with https://andrewclausen.net/computing/deriv.html.

^{5}

Experiments were run on an Intel^{®} Xeon^{®} Processor E5-2650 v2.

## Acknowledgments

The authors thank the Foundation Kinderen Kankervrij for financial support (project no. 187), and SURFsara for granting access to the Lisa Compute Cluster.

## References

*International Conference on Parallel Problem Solving from Nature*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Classification and regression trees*

*IEEE Congress on Evolutionary Computation*

*IEEE Transactions on Evolutionary Computation*

*Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*

*IlliGAL Report*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Journal of Machine Learning Research*

*Complexity*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Information Processing Letters*

*ACM Computing Surveys (CSUR)*

*Evolutionary Computation*

*Genetic and Evolutionary Computation Conference (GECCO)*

*IEEE Transactions on Evolutionary Computation*

*Swarm and Evolutionary Computation*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Genetic and Evolutionary Computation Conference (GECCO) 2015*

*IEEE Congress on Evolutionary Computation*

*European Conference on Genetic Programming*

*Genetic Programming and Evolvable Machines*

*Genetic programming: On the programming of computers by means of natural selection*

*Behavioral program synthesis with genetic programming*

*Genetic Programming*

*IEEE Congress on Evolutionary Computation*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Queue*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Genetic and Evolutionary Computation Conference (GECCO)*

*International Conference on Parallel Problem Solving from Nature*

*Genetic Programming and Evolvable Machines*

*International Conference on Parallel Problem Solving from Nature*

*IEEE Transactions on Evolutionary Computation*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Evolutionary Computation*

*Transactions on Evolutionary Computation*

*Journal of Machine Learning Research*

*A field guide to genetic programming*

*International Conference on Artificial Evolution*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Evolutionary Computation*

*Genetic Programming Theory and Practice*

*IEEE Congress on Evolutionary Computation*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Genetic Programming and Evolvable Machines*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Genetic and Evolutionary Computation Conference (GECCO)*

*IEEE Transactions on Evolutionary Computation*

*IEEE Congress on Evolutionary Computation*

*IEEE Congress on Evolutionary Computation*

*IEEE Transactions on Systems, Man, and Cybernetics: Systems*