## Abstract

Program semantics is a promising recent research thread in Genetic Programming (GP). Over a dozen semantic-aware search, selection, and initialization operators for GP have been proposed to date. Some of these operators are designed to exploit the geometric properties of semantic space, while others focus on making offspring *effective*, that is, semantically different from their parents. Only a small fraction of previous works aimed at addressing both of these features simultaneously. In this article, we propose a suite of *competent operators* that combine effectiveness with geometry for population initialization, mate selection, mutation, and crossover. We present a theoretical rationale behind these operators and compare them experimentally to operators known from literature on symbolic regression and Boolean function synthesis benchmarks. We analyze each operator in isolation as well as verify how they fare together in an evolutionary run, concluding that the competent operators are superior on a wide range of performance indicators, including best-of-run fitness, test-set fitness, and program size.

## 1 Introduction

Recent research in Semantic Genetic Programming (SGP) (Vanneschi et al., 2014) pushed the envelope of EC-based program synthesis by shifting the focus from program syntax to program behavior. An important thread in this area is Geometric Semantic GP (GSGP) (Moraglio et al., 2012), where program semantics is interpreted as a point representing its output on the training set in a multidimensional space, dubbed *semantic space*.

Research in GSGP focuses on designing search operators that have well-defined impact on the geometric relationships between programs in the semantic space (see Section 2). However, practice shows that they often produce offspring solutions that, though formally geometric, fail to be useful. For instance, an offspring program that is semantically equivalent to its parent does not advance search and is a waste of computing resources. The empirical evidence we brought in a previous study (Pawlak et al., 2015a) shows that this happens frequently. In other words, conventional GSGP *constrains* the distribution of offspring semantics, but is in principle not concerned with *shaping* the resulting distribution. In this article, we scrutinize that distribution with respect to *effectiveness*, meant as the prospective likelihood of search progress. To control and improve effectiveness, we propose a suite of algorithmic components that, apart from providing for geometric properties (in an exact or approximate sense), are also *effective*, that is, produce candidate programs that are likely to advance search. We achieve this goal with two means: by using improved variants of effective search operators (mutation and crossover) proposed in our earlier works (Pawlak et al., 2015b; Krawiec and Pawlak, 2013a; Wieloch and Krawiec, 2013), and by augmenting the conventional GSGP workflow with semantic-aware and geometric components: effective geometric initialization and effective geometric mate selection (Sect. 4, Pawlak, 2015). These contributions together form a coherent approach dubbed *competent* GSGP.

We start with a brief formalization in Section 2, and in Section 3 review the related work. In the experimental part, we analyze each of the proposed components in isolation (most of Sect. 5), and then assess the joint performance of the competent suite (Sect. 5.6) and discuss the results and prospects (Sects. 6 and 7).

## 2 Semantics in GP

For the sake of this study, we assume that programs are functions, that is, they feature no side effects, nor persistent state between applications to input data. Let denote a set of all program inputs, and be a set of all program outputs. A program has thus the signature , and we write to state that applied to input returns an output . The set of all programs is denoted by .

Given an -tuple of inputs, , the *semantics* is a tuple of corresponding outputs , and is a set of all semantics. Note that semantics is *any* combination of elements of , including the combinations that cannot be generated by any program in . *The semantics of a program* is the tuple of outputs it produces for the elements of , i.e., . By definition . Though depends on the inputs in , we write for brevity.

A *program induction problem* is defined by a program space , a tuple of inputs , a *target* semantics , and a minimized fitness function . A pair of corresponding elements from and , that is, , forms a* fitness case*. We assume that for target and some metric . A program such that is an *optimal solution* to such program induction problem, and minimizes , that is, .

In the following subsections, we define properties for each operator involved in GP search: population initialization, mate selection, and search operators.

### 2.1 Geometric Search Operators and Convex Hulls

Moraglio et al. (2012) formalized geometric mutation and crossover operators for semantic GP. We adapt those definitions to the notation used in this article.

A mutation operator is a (random) function . A mutation operator is *-geometric* iff the semantics of the produced offspring belongs to a ball of radius centered in parent’s semantics , that is, .

Moraglio (2011) showed that the convex hull of offspring population is a (proper or not) subset of the convex hull of the parent population. Starting from that observation, we extend the conventional two-parent geometric crossover to an arbitrary number of parents. Let denote the convex hull of , that is, the smallest convex set that includes . For brevity, we abuse the notation by writing to denote the convex hull in spanning the semantics of programs from and refer to it as a convex hull of *programs*.

A -ary crossover () is a (random) function . A -ary crossover is *geometric* iff the semantics of the offspring belongs to the convex hull of semantics of its parents ; that is, .

For , a convex hull is equivalent to a segment, and Definition ^{2} becomes identical to that presented by Moraglio et al. (2012). Convex hulls embrace different sets of semantics for different metrics. For more details and rationale see Pawlak and Krawiec (2016a) and Pawlak (2015).

Following the proof by Moraglio (2011), the relevance of convex hulls for geometric crossover is illustrated for fitness cases in Figure 1. The black points mark the semantics of seven programs. An application of a -ary geometric crossover to any programs from that set (left inset) can produce offspring in the convex hull of that set only (the shaded area). If geometric crossover is the only search operator employed, all resulting offspring lie in this convex hull (central inset). With each generation, the convex hull shrinks (or at most remains unchanged). In the limit, the convex hull collapses to a single point (semantics).

What follows from this example is that the necessary condition for a geometric crossover to produce an offspring with semantics is that must be included in the convex hull of the population. We formalize this observation as follows.

A geometric -ary crossover applied to a set of programs may produce a program such that iff .

The proof follows from Definition ^{2}; that is, a geometric -ary crossover must produce only offspring in the convex hull of the parents, hence if lies outside this convex hull, the operator is unable to reach .

As already noted by Moraglio et al. (2012), geometric semantic mutation does not require a similar assumption; that is, it ultimately converges to the target regardless of whether holds. However, that convergence may require many iterations, which depends, among others, on the radius of the ball in Definition ^{1}. Geometric crossover can make long “leaps” in , reducing the number of required iterations.

### 2.2 Target-Geometric Initialization and Mate Selection

Whether the target remains within the convex hull of population depends on (i) the initial population and (ii) the choice of the parents to mate when producing consecutive populations. We define the desirable properties of these objects below. The first property originated in our previous work (Pawlak and Krawiec, 2016b).

An initialization operator is a (random) function that produces a multiset of elements from . An initialization operator is *-geometric* iff the convex hull of programs in includes ; that is, .

A -mate selection operator is a (random) function . A -mate selection operator is *-geometric* iff the convex hull of the selected programs , includes ; that is, .

The practical upshot of this is that GP equipped with a target-geometric initialization, a target-geometric mate selection, and geometric search operators is likely to maintain its target in population’s convex hull in consecutive iterations^{1} (assuming it was present in the initial population). The usefulness of these components depends on their effectiveness, which we cover in the next subsection, and on the technical possibility of constructing geometric and effective operators for a given semantic space and metric, which we investigate in Section 4.

### 2.3 Effective Operators

We consider an operator *effective* if it avoids producing programs that are semantically equal to already considered programs. The meaning of the phrase “already considered programs” depends on the context, so we define effectiveness separately for initialization, selection, and search operators.

An initialization operator is *effective* iff each initial population it produces does not contain semantically equal programs; that is, .

A -mate selection operator is *effective* iff, for any no two programs in have the same semantics; that is, .

A -ary search operator is *effective* iff, for any , each offspring produced by from is semantically distinct from all its parents; that is, .

Effective operators lower the risk of revisiting the same solutions and are beneficial for diversification of population. We expect them to increase the odds of solving program induction problems, and verify this hypothesis in Section 5.

## 3 Previous Work

This article involves initialization, mate selection, and search operators, so we structure the review around these categories.

### 3.1 Semantic-Aware Search Operators

Design of **geometric** search operators is arguably the main research area within GSGP. The key reference methods are here Semantic Geometric Crossover (Sgx) and Mutation (Sgm) proposed by Moraglio et al. (2012). Sgx works by linearly combining parent programs. Sgm builds an offspring from a parent program and a carefully designed random expression so that in effect the semantics of the offspring is only slightly disturbed (up to a controlled degree) with respect to parent’s semantics (see Pawlak, 2016 for proof).

Sgm was thoroughly studied by Moraglio et al. (2013) and Moraglio and Mambrini (2013), in the Boolean and symbolic regression domains, respectively. Pawlak and Krawiec (2016a) analyzed bounds on fitness change in a single application of Sgx and Sgm under different metrics and fitness functions. Moraglio et al. (2014) adapted Sgx to grammatical evolution. Sgx and Sgm are geometric, but not effective in the sense of Definition ^{8}.

To our knowledge, Krawiec and Lichocki (2009) proposed the first *approximately* geometric crossover for GP. This operator involves brood selection: a set of offspring candidates is produced, and the one that minimizes a custom measure of divergence from geometry is returned. This operator is characterized by high probability of producing offspring semantically equal to one of their parents (see Pawlak et al., 2015a). This probability was reduced to zero by Pawlak (2014), making this operator effective with regards to Definition ^{8}.

Krawiec and Pawlak (2013b) proposed Locally Geometric Crossover (Lgx) that approximates geometric crossover by replacing parent subprograms using a library of known programs. Lgx does not guarantee that an offspring is semantically different from its parents; thus, it is not effective.

Pawlak et al. (2015b), Krawiec and Pawlak (2013a), and Wieloch and Krawiec (2013) developed approximately geometric, but not effective, mutation (Lgx) and crossover (Agx) that are based on the idea of inverse execution of parent program(s). The competent mutation and crossover proposed in Sections 4.3 and 4.4 are respective variants of Lgx and Agx extended for effectiveness.

Nguyen et al. (2016) proposed Subtree Semantic Geometric Crossover (Ssgx) that, given parent programs, picks from each of them one subprogram that is semantically most similar to the parent program and linearly combines them in the same manner as Sgx (Moraglio et al., 2012). Ssgx produces programs smaller by few orders of magnitude than Sgx, while maintaining similar fitness-based characteristics. Ssgx is approximately geometric, but not effective.

Concerning semantic-aware operators that aim at being *effective*, Beadle and Johnson (2008, 2009b) proposed, respectively, Semantically Driven Crossover (Sdx) and Mutation (Sdm) for the Boolean and artificial ant domains. Sdx and Sdm repeat in a loop, respectively, the conventional Tree Crossover (Tx) and Tree Mutation (Tm) (Koza, 1992), until the produced offspring candidate is semantically different from its parents. Sdx and Sdm are thus effective with regards to Definition ^{8}.

Jackson (2010b) also experimented with an effective crossover based on Tx (Koza, 1992), which admits an offspring only if it is semantically different from both its parents.

The series of publications by the same authors (Nguyen et al., 2009; Uy et al., 2011, 2013) proposes three crossover and one mutation operator. The common feature of all these operators is that they breed offspring by replacing subprograms in parents with semantically different subprograms. These operators are not effective in sense of Definition ^{8}, because other parts of offspring program may still cause it to be semantically equal to one of its parents.

### 3.2 Semantic-Aware Population Initialization

Most works on initialization in SGP focus on providing semantic diversity in the initial population. In one of the earliest attempts of that type, Looks (2007) proposed Semantic Sampling (Ss)—a population initialization heuristic that defines bins for programs of particular sizes and fills them up to assumed capacity by semantically distinct programs.

Semantically Driven Initialization (Sdi) presented by Beadle and Johnson (2009a) for the Boolean and artificial ant domains is effective in sense of Definition ^{6}. Sdi starts with seeding a population with single node-programs. Then, it iteratively picks a random instruction and combines it with programs drawn from the population. The resulting program is added to the population if no other program in there has equal semantics.

Jackson (2010a) proposed Behavioral Initialization (Bi) that acts as a wrapper on the Ramped Half and Half method (Rhh; Koza, 1992). Bi invokes Rhh in a loop and admits an initialized program to the population if it is semantically different from all programs already present in the population. Thus, Bi is effective, but not geometric.

The recent Semantic Geometric Initialization (Sgi) (Pawlak and Krawiec, 2016b) is geometric and effective in the sense of Definitions ^{4} and ^{6}. Sgi surrounds a target by a set of semantics that form a convex hull that includes the target. Then, it applies domain-specific exact algorithms that explicitly construct programs for the abovementioned semantics. Sgi turns out to significantly improve the performance of GSGP.

### 3.3 Semantic-Aware Selection Operators

The works on semantic-aware selection, including mate selection, are few and far between. Galván-López et al. (2013) proposed Semantic Tournament Selection (Sts), a two-mate selection operator. Sts selects the first parent using Tournament Selection (Ts; Miller and Goldberg (1995)). The second parent is the best one from a set of candidates drawn from the population, where the candidates that are semantically equal to the first parent are discarded. If all candidates in this set are worse than the first parent, Sts by definition allows the same program to be selected twice and act as both parents. Hence, Sts is not effective with regards to Definition ^{7}.

For a comprehensive survey of semantic methods in GP, the reader is referred to Vanneschi et al. (2014) and Pawlak et al. (2015a).

From all the above-mentioned initialization operators, only the one by Pawlak and Krawiec (2016b) is -geometric. However, programs initialized by that operator by design memorize the correct outputs for particular inputs (by storing them explicitly in the program), and as such are unable to generalize well for unseen inputs. In contrast, the competent initialization (Ci) proposed in this article searches for semantically unique programs that expand the population’s convex hull while not explicitly storing correct outputs in program code. As it will become clear in the following section, the competent tournament selection (Cts) is also significantly different from past work: it takes into account the relative locations of the target and mates’ semantics, which none of the former semantic-aware selection methods did. The semantic-aware mutation and crossover operators that complement our competent framework are effective counterparts of their ancestors presented in Pawlak et al. (2015b), Krawiec and Pawlak (2013a) and Wieloch and Krawiec (2013).

## 4 Proposed Competent Operators

The definitions presented in the previous section determine only the desirable properties of GSGP operators, but do not propose concrete algorithms. In this section we present a suite of *competent operators* for GP, i.e., algorithms that are effective (Defs. 6–8) and *designed to approximate* the geometric characteristics (Defs. 1–5). The latter characteristics are achieved only approximately or in the limit by some of the proposed components, for the reasons that we detail in the following. This, however, does not prevent them from performing significantly better than the original GSGP, as we demonstrate in Section 5.

### 4.1 Competent Initialization (Ci)

Algorithm 1 presents *Competent Initialization* (Ci) for tree-based programs. Like the conventional GP search operators, Ci constructs new programs from the programs already present in the population.

An implementation of Algorithm 1 requires additional safeguards that warrant termination of the loop in lines 3–8, because Ci may be unable to produce a program that expands the convex hull of the population, for instance when the available instructions are not sufficient to express certain semantics, or the convex hull already incorporates the entire semantic space; that is, . Thus, if Ci cannot fulfill the condition in line 7 in an assumed number of attempts, it supplements the population with a random program and starts the loop over. Moreover, Ci discards the candidate programs that return the same value for all fitness cases because in most problems such programs are of little use.

Ci is effective in the sense of Definition ^{6}, because it explicitly forbids adding to the population a program that is semantically equal to any program already in . Ci does not require the knowledge of the target (note the absence of among Ci’s arguments in Algorithm 1), and can be thus applied to problems with unknown . By the same token Ci is not target-geometric (Def. 4): the resulting convex hull is in general not guaranteed to include . Achieving such a guarantee would require a domain-specific algorithm, which we did not want to involve in this study. Also, even a target-geometric initialization operator would fail to produce a convex hull that encloses if population size was small compared to the dimensionality of semantic space ; for instance, it may be hard to find two programs that form a convex hull enclosing even in two-dimensional semantic space. Nevertheless, we show in Section 5.2 that in practice Ci is likely to produce an initial population that includes .

### 4.2 Competent Tournament Selection (Cts)

The intent behind the -geometric -mate selection (Def. 5) is to select from a population parent programs that form a convex hull that includes the target . This makes it possible for the subsequent -ary geometric crossover to generate an offspring in the proximity of . Such programs are guaranteed to exist in if and , where is the dimensionality of the semantic space (Carathéodory, 1911). It would be tempting thus to consider such -ary selection operators, *if only*. However, our competent framework, though strongly motivated by geometric relationships between programs, does not guarantee that the target is within the convex hull of the population throughout the run. For instance, already Ci, which supplies the population with initial programs, does not provide such a guarantee. In this context, there is limited motivation for designing -ary, selection operators, and the *Competent Tournament Selection* (Cts) that we propose below is designed to approximate the -geometric 2-mate selection.

Cts (Algorithm 2) selects the first parent using the tournament selection (Miller and Goldberg, 1995). Then, it draws uniformly without replacement candidate programs and selects the one that minimizes the expression in line 5, which is a heuristic measure of desirability of a candidate mate for with respect to three aspects:

The distance between and the target —to be minimized in order to to prefer better programs,

The distance between and —to be maximized to penalize the candidates that are semantically too similar to (and so prefer distinct programs) and to make effective selection more likely (cf. Def. 7),

The difference between the distances of and from the target —to be minimized to promote the close-to-equidistant candidates .

Terms (2) and (3) make it likely for the center of the segment connecting and to be close to . Figure 2 presents the visualizations of the expression for and metrics in the semantic space (). The minimal value of 0 is attained for the candidates having target semantics; that is, , as mandated by the positive bias toward the good candidates (term 1). For the points close to , the expression approaches infinity, so such candidates are discouraged (term 2). Crucially, for the candidates located “on the opposite side” of and at the same distance to as , that is, , the expression attains the relatively low value of (term 3).

Cts is effective (Def. 7) because the drawing procedure in line 2 prevents repeated selection of the same candidate (provided contains at most semantic duplicates). Cts is not -geometric, since it does not guarantee selecting such that belongs to the convex hull . Note that for , that convex hull is a segment connecting and , so picking a pair of programs with the above property from a finite population is particularly unlikely. Nevertheless, as we demonstrate later in Section 5.3, simultaneous optimization of the three above terms allows Cts to select the parents that are likely to result in geometric and well-performing offspring.

Contrary to what Algorithm 2 may suggest, Cts does not require access to the target . Note that the terms that appears in (line 5) are actually *fitness values*: is nothing else than the fitness of , while is the fitness of (assuming fitness uses on ). Therefore, even though Cts attempts to keep within the convex hull of the parents (or otherwise expand it towards ), the explicit knowledge of its location is not essential, as the necessary information is conveyed by the fitness value. Thus, similarly to Ci, Cts can be applied to problems with an unknown .

### 4.3 Competent Mutation (Cm)

*Competent Mutation* (Cm) relies on the concept of *semantic backpropagation* (SB) we introduced in Random Desired Operator (Rdo) (Pawlak et al., 2015b; Wieloch and Krawiec, 2013; Wieloch, 2013). The SB algorithm, given a semantics , a program , and a subprogram in (i.e., its location in ), computes the *combinatorial semantics*. Combinatorial semantics is a tuple of sets , each corresponding to a single fitness case. stores the values such that, when substituted for in , cause the resulting program to return the th element of the target () when evaluated on th fitness case. Because program execution on each fitness case is independent, any combination of values from s causes the program to have semantics equal to . In this way, combinatorial semantics captures—in general exponentially many—semantics of subprograms that, when substituted for , would make the resulting program attain semantics . Because SB involves domain-specific inversion of instructions, we detail it in Appendix A to maintain smooth narration here.

Given a parent and a target , Cm (Algorithm 3) starts with drawing a node from . Then, it calls to obtain the *desired semantics*, that is, the combinatorial semantics of subprograms that would change ’s semantics to when substituted for . Next, Cm calls , that is, backpropagates through the parent program *its own semantics*. The resulting combinatorial semantics represents the *forbidden semantics*—the subprograms that, when substituted for , do not change the semantics of . Then, a library of subprograms is searched for a subprogram that matches as close as possible and does not match . An efficient algorithm for performing that search is presented in Appendix B. Finally, Cm replaces in with to produce an offspring.

By aiming at desired semantics while simultaneously avoiding forbidden semantics, Cm attempts to modify the parent so that the offspring’s semantics becomes equal to . When that succeeds, Cm solves a program induction problem in a single step; this is, however, not frequent in practice, due to possible absence of an adequate subprogram in the library.

Because the distance of the parent from the offspring can be arbitrary, Cm is not geometric in the sense of Definition ^{1}.

Because library search must not return a subprogram that matches the offspring is guaranteed to be semantically different from the parent. Thus, Cm is effective with regards to Definition ^{8}.

Note that, in contrast to all other operators in our competent suite, Cm requires the knowledge on the location of the target .

### 4.4 Competent Crossover (Cx)

*Competent Crossover* (Cx) operates analogously to Approximately Geometric Semantic Crossover (Agx) (Pawlak et al., 2015b; Krawiec and Pawlak, 2013a) and is augmented with extensions that make it effective. Like Cm, Cx employs semantic backpropagation, this time however in order to produce an offspring that is geometric with regards to parent programs.

The motivation for using the midpoint as a temporary goal in the above procedure is to provide better exploration. A candidate offspring located near a parent has by definition very similar semantics and as such usually does not advance much the search process. Promoting locations near to the midpoint helps exploring new areas in semantic space.

Note also that Cx is the only operator in the competent suite that explicitly refers to domain’s specifics (in (4)). This is because Cx*constructs* a new semantics , for which a metric alone is not sufficient. Other operators do not require such constructive means, as they depend only on metric’s values for semantics of *existing* programs.

If the library contains a subprogram that matches , Cx is guaranteed to produce a geometric offspring and is thus geometric with regards to Definition ^{2}. However, that is not guaranteed for the finite libraries used in practice, so in absence of match, the semantically closest subprogram from the library is returned and planted into the parent program. In such cases Cx can be considered approximately geometric, as it produces a good approximation of the midpoint between parents’ semantics, given the programs available in the library.

Cx is effective, because library search cannot return a subprogram with semantics in or , that is, such that it would result in an offspring semantically equal to either parent.

## 5 Experiments

### 5.1 Setup

We compare the operators from the suite presented in Section 4 (Competent Initialization (Ci), Tournament Selection (Cts), Mutation (Cm), and Crossover (Cx)) to the reference operators in Table 1, that is, canonical GP operators and the operators that are known to be semantic but not simultaneously effective and geometric. In the first four experiments, we characterize individual operators in isolation from other operators and outside evolutionary runs. The characteristics in question include fitness distribution, effectiveness, geometry, and program size. The fifth experiment (Sect. 5.6) concerns joint performance of operators and verifies whether the competent operators lead to better performance than the reference operators.

We use five univariate and four bivariate symbolic regression problems, and nine Boolean program synthesis problems from Table 3. When choosing the problems, we aimed on one hand at using a diversified suite of problems (different domains, varying number of input variables, various difficulty) taken from a range of sources (the GPBenchmarks repository and previous works), while on the other hand trying to keep the number of them within a reasonable limit. In univariate regression problems, 20 Chebyshev nodes (Burden and Faires, 2010) in the given range form the training set; the test set comprises 20 points drawn uniformly from the same range. For bivariate problems, 10 values of each variable are analogously selected (Chebyshev nodes for training, uniform drawing for testing), and their Cartesian products form the respective data sets. In the Boolean domain, the training set incorporates all inputs and there is no test set.

Evolutionary runs are set up according to Table 2. Parameters not presented there are set to ECJ’s defaults (Luke, 2010). Methods’ parameters are not fine-tuned to benchmarks, because in this experiment we aim at side-by-side comparison rather than maximization of performance. The only exception is the mutation step of Sgm in symbolic regression domain, because we found out that the default value (1.0) led to very bad performance. In symbolic regression, two floating-point numbers in semantics are considered equal if they differ by less than (the difference between 1 and the closest IEEE754 double-precision number). The experimental software is available online at www.cs.put.poznan.pl/tpawlak/link/?ECJCompetent.

### 5.2 Analysis of Initialization

Our hypothesis is that the effective and geometric characteristics of Ci are superior to those of Rhh and Sdi. To verify this, we run these operators 30 times to produce a population of size 1000 for each problem and analyze the resulting initial populations.

For brevity, in this and following experiments we present only aggregate results, while the details and statistical assessment can be found in Appendix C. Unless stated otherwise, the characteristic in question is first averaged over a number of runs for a given operator and problem, then the averages obtained by particular operators are ranked on that problem, and finally the ranks are averaged over the entire problem suite. In Table 4 we present the average ranks of the number of semantically unique programs in initial populations and the ranks of the average number of programs produced until the target is included in population’s -convex hull. Detailed results are shown in Tables 17–18 in Appendix C.

The results confirm that both Ci and Sdi produce populations composed entirely of semantically unique programs. For the semantic-unaware Rhh, the fraction of such programs is lower for all problems. This is reflected by the outranking graph in the first row of Table 4. Since Sdi and Ci provide semantic uniqueness by definition, there is no need to statistically assess these differences.

Concerning the number of programs needed to include the target in an -convex hull of population, the lowest rank is achieved by Ci. Table 18 in Appendix C reveals that in 12 out of 18 problems Ci needs the lowest number of programs to attain that goal. In the remaining 6 problems, Rhh achieves the lowest average and for one problem Rhh and Ci are on par on this characteristic. The Friedman’s test (Kanji, 1999) results in p-value of , suggesting that some pairwise differences are significant. Thus, we conduct a post-hoc analysis using symmetry-test (Hothorn et al., 2015) and present the resulting outranking graph in the second row of Table 4 (the significant differences () are shown as arcs). The graph reveals that Ci requires significantly fewer programs to include the target in the convex hull than Sdi.

All operators manage to build the desired convex hull with much smaller numbers of programs than the population size considered here (1000). None of the 540 runs performed here failed to produce a convex hull. However, the observed differences may become critical for smaller populations. Also, we hypothesize that in semantic spaces of higher dimensionality (greater ), it would be much harder for Rhh and Sdi to produce a convex hull that includes .

For additional insight, we collect the histograms of fitness values of generated programs for a representative sample of problems and present them in Figure 3. The distributions vary heavily across the problems, while having overall similar shapes for individual operators. Rhh produces very rugged distributions with few peaks that correspond to most common semantics. Sdi features a lower number of such peaks, and they are typically less prominent. The distributions for Ci are mostly smooth and unimodal for most problems. We hypothesize that the further the produced programs are from the target on average (i.e., the greater the average fitness), the more programs Ci has to produce to include the target in the convex hull. This hypothesis is supported by Pearson’s correlation coefficient of these two factors, which amounts to 0.49, while the -test for significance of correlation results in the p-value of 0.01.

### 5.3 Analysis of Selection

In this experiment, we verify whether Cts (Sect. 4.2) is effective and more probable to be -geometric than Sts or Ts. To this aim, we apply these operators to populations initialized by Rhh and Ci. There are thus three setups for Rhh’s populations: RTs, and RSts, RCts, and three for Ci’s populations: Cts, CSts, and CCts. There are 30 runs for each problem and setup, and each run involves producing an initial population of size 1000 by a given initialization operator, followed by 500 acts of selection of two parents from that population.

Table 5 presents the average ranks over problems on the empirical probability that an application of a given selection operator is effective (first row) and -geometric under (second row); see Tables 19–22 in Appendix C for detailed results. Cts achieves the best average ranks; its likelihood of selecting programs effectively is close to 1.0 for all problems, and for populations initialized by both algorithms. Cts returned two semantically equal programs only eight times out of over a quarter of a million selection acts performed in the experiment. Sts is second best, and Ts ranks last for both initialization methods. That the probabilities for Sts and Ts are higher for the populations initialized by Ci. Friedman’s test () indicates that the only setup not surpassed by both RCts and CCts is CSts.

Concerning the characteristic of being -geometric selection, recall that for -ary selection operators, convex hull (Def. 5) is a segment connecting the semantics of selected programs. For metric, the empirical probability of such an -segment to include is very low ( in preliminary analysis). Thus, in this experiment we use . An -segment (a hyperrectangle) may have a significant chance of including any in low-dimensional spaces (low ), but that becomes much less likely for higher (and for all problems considered here).

All selection operators achieve higher probabilities of including in parents’ convex hull for the populations initialized by Rhh; however, a conclusive Friedman’s test () shows that the difference between the initialization methods is significant only for Ts. In 13 out of 18 problems, RCts achieves the highest probability, which makes RCts significantly more -geometric than Sts and Cts. CCts achieves the highest probability for three problems. Within the groups of setups that use a given initialization method, Cts is the most geometric operator.

### 5.4 Analysis of Mutation

We compare the mutation operators from Table 1, that is, Tm, Sdm, Sgm, and Cm, with respect to effectiveness, geometry, and impact on program size, by applying them to programs drawn uniformly from populations initialized by Rhh and Ci. There are four setups operating on Rhh-initialized population: RTm, RSdm, RSgm, and RCm, and four operating on Ci-initialized population: CTm, CSdm, CSgm, and CCm. In each of 30 runs per problem, we first initialize the population of size 1000 programs, and then apply the mutation operator to each of them.

Table 6 presents the aggregated ranks on the empirical probability that a mutation is effective (top row), and on the offspring-to-parent distance (bottom row). Sdm features the highest probabilities of effectiveness and ranks best for both initialization methods. Cm is the runner-up and Tm is the least effective operator for both initialization methods. Friedman’s test () confirms the significance of these findings: Sdm outranks all other operators on effectiveness.

The definition of geometric mutation is parameterized by the radius (Def. 1). Rather than setting this parameter in advance and counting how many times a given mutation act is -geometric, we abstract from and record the -distance of offspring to parent. In Table 6, we present the ranks on the median of that distance. The setups that use Ci produce offspring closer to parents than the corresponding setups with Rhh, except for Sgm. CTm achieves the overall lowest rank with CCm as the runner-up. According to Friedman’s test () these two setups are indiscernible and produce offspring closer to parent than RTm and RSdm, and for CTm also RSgm and CSgm. For setups involving Rhh, RCm is the best by providing the lowest rank. The offspring produced by RSdm are the most distant from the parents. Note that Sgm is parameterized with a mutation step (Moraglio et al., 2012) (which we set as reported in Table 2), so the comparisons with this operator should be taken with a grain of salt.

To compare the operators with respect to offspring size, we record the ratio of the number of nodes in an offspring to the number of nodes in its parent. The ratio of 1.0 indicates that an operator produces an average offspring that is as large as the parent. This indicator abstracts from the absolute sizes of programs, which may vary heavily already in the initial populations considered here.

Table 7 presents the average and .95-confidence interval of offspring-to-parent size ratio. The setups involving Ci lead to ratios closer to 1.0 than the corresponding setups with Rhh; the exception is Sdm in the Boolean domain, where the opposite holds. The reason for this deviation may be that Ci, required to provide semantic diversity in the initial population, tends to build larger programs than Rhh, and single-subtree mutations affect a relatively smaller fraction of larger programs than of smaller programs. The ratio closest to 1.0 is provided by CSdm in the regression domain, and CCm in the Boolean domain. The second closest setup in regression domain is CTm and in the Boolean domain CSgm. Note that the size of offspring produced by Cm can be controlled by restricting the size of programs in the library.

### 5.5 Analysis of Crossover

We compare the crossover operators from Table 1, that is, Tx, Sdx, Sgx, and Cx proposed in Section 4.4 by applying them to the programs drawn uniformly from populations initialized by Rhh and Ci. There are four setups using Rhh: RTx, RSdx, RSgx, and RCx and four using Ci: CTx, CSdx, CSgx, and CCx. We run each setup 30 times on each problem, and each run executes 500 applications of crossover.

Table 8 presents the average ranks on the empirical probability that crossover application is effective, and on the empirical probability that it is effective and -geometric (see Tables 27–30 in Appendix C for details). Setups that use Ci lead to lower ranks than the corresponding Rhh setups, except for Tx where the opposite holds. CSdx ranks first and CSgx is the close runner-up. For each initialization method Tx is least likely to be effective, and Cx comes as last-but-one in the ranking. It is, however, worth nothing that the probabilities achieved by Cx (Table 27) are still high: over .99 for symbolic regression and in the range .47–.87 for the Boolean domain. Friedman’s test confirms significance of these results with .

Concerning the empirical probability that crossover is effective and geometric under metric, we exclude the neutral applications because they do not advance search, even though they are geometric according to Definition ^{2}. Unsurprisingly, the exact geometric Sgx scores the lowest ranks for both initialization methods and thus is the most geometric operator, with probabilities in the range .74–.999 (and 1.0 when including the neutral applications; see Table 29). The conclusive result of Friedman’s test () shows that this result is better than results of all other setups except RCx. The ranks achieved by Sgx are lower when a population is initialized by Ci, which confirms that geometric initialization helps Sgx to produce geometric effective offspring. The runner-up is Cx, for which the rank is lower for the population initialized by Rhh. The least geometric operator is Tx.

Table 9 presents the average and .95-confidence interval of the ratio of the number of nodes in the offspring to the average number of nodes in the parents (for the rationale on ratio-based measures, see Sect. 5.4). RTx diverges the least from the neutral value of 1.0 in both considered domains. The more divergent are, in order, RSdx, CSdx, and CTx. Note that the differences between the ratios of those four setups are minute: all of them range in the interval . Cx maintains higher ratios in Rhh’s populations () than in Ci’s populations, (). As in the case of mutation, the size of the offspring produced by Cx can be tuned by restricting the size of programs in the library. Because Sgx combines parents using an extra code piece, it produces offspring that are on average over twice as large as the parents.

### 5.6 Joint Performance of Operators

We assess the joint performance of the operators analyzed in previous sections. We compare four configurations of the generational GP algorithm equipped with the operators from Table 1: (i) the canonical operators: Rhh, Ts, Tx, and Tm in proportions 90:10 (RTsTxm), (ii) the effective nongeometric operators: Sdi, Sts, Sdx, and Sdm in proportions 50:50 (SStsSdxm), (iii) the geometric non-effective operators: Rhh, Ts, Sgx, and Sgm in proportions 50:50 (RTsSgxm), and (iv) the competent operators: Ci, Cts, Cx, and Cm in proportions 50:50 (CCtsCxm). In this experiment, we replace the Mux20 problem by 16-bit Comparator (Cmp16), because calculating the -bit long semantics in Mux20 for every (sub)program in every generation required over 24 hours per run on high-performance servers, while the -bit long semantics in Cmp16 results in substantially shorter time.

Figure 4 and Table 10 present the average and .95-confidence interval of the best-of-generation and the best-of-run fitness, respectively. In symbolic regression, CCtsCxm is the unquestionable winner in terms of best-of-run fitness and speed of convergence. No other setup outperforms CCtsCxm throughout entire runs. In the Boolean domain CCtsCxm achieves the best-of-run fitness in four out of nine problems. In the remaining five problems, RTsSgxm is the best. The outcome of Friedman’s test (; see Table 11) confirms the superiority of CCtsCxm over all other setups except SStsSdxm.

Table 12 shows the median and .95-confidence interval of the test-set fitness of the best-of-run program on the training set. In seven out of nine problems CCtsCxm outperforms all other setups while RTsTxm proves best twice. Friedman’s test (; see Table 13) shows that RTsSgxm generalizes significantly worse than all setups except the canonical one: RTsTxm. Notice that all setups overfit on the Pg1 problem, with CCtsCxm overfitting particularly strongly. We explain this phenomenon by the use of Chebyshev nodes as training set points (cf. Sect. 5.1). This choice tends to sample the target function more densely at the extremes of training range, while undersampling the center of this range. The values of the Pg1 function (cf. Table 3) vary much stronger in the center of the training range than close to the boundaries of that range. Thus, only a small fraction of training data reflects the most varying (and thus most decisive for the fitness value) part of the sought function.

Table 14 presents the average and .95-confidence interval of the number of nodes in the best-of-run program. CCtsCxm produces the smallest programs in all Boolean problems and Ng9, while RTsTxm produces the smallest programs in seven out of nine symbolic regression problems. RTsSgxm produces programs that are significantly larger (Friedman’s ; see Table 15) than the programs produced by the all other setups.

## 6 Discussion

The consistency of analytical investigations in Sections 5.2–5.5 with the overall good results produced by the competent suite in Section 5.6 demonstrate that effectiveness and geometric character of semantic GP operators are viable performance predictors. The particular indicators and quantities proposed and reported in Sections 5.2–5.5 can be thus used as guides for designing initialization, selection, and search operators.

Although the components proposed here do not adhere strictly to the formal definitions of geometric properties (Defs. 2–4), they are designed with geometry and effectiveness in mind, and trade the geometric perfection in exchange for practically acceptable computational overhead. In this sense, we find them a viable alternative to the exact geometric operators proposed by Moraglio et al. (2012). As elegant as they are, the exact operators represent one extreme on the spectrum of the above trade-off, and often pay for good training-set performance with poor generalization (see Table 12) and skyrocketing program size (see Table 14). Though program size can be to an extent managed by simplification or reuse of subexpressions (Vanneschi et al., 2013), generalization remains a challenge. In machine learning terms, the exact operators as proposed by Moraglio et al. (2012) have no bias and, consequently, exhibit high variance (Geman et al., 1992) when confronted with a test set. This problem may be addressed by *geometric block mutations* (Moraglio et al., 2013) that are biased by modifying program semantics on several fitness cases simultaneously. However these operators have been proposed for Boolean domain only, and assessment of their generalization performance was beyond the scope of that study. The competent operators have also nonzero bias, which apparently helps them generalize well, as verified empirically in this study.

An interesting side result of the experiment conducted in this work is the worse than expected performance of the exact geometric approach (RTsSgxm) in the regression domain. In contrast to previous works that reported excellent results (including the original Moraglio et al. (2012) paper), we found it hard to optimize RTsSgxm’s configuration for performance in the continuous domain. Figure 4 and tables reflect the best configuration obtained by using various mutation steps and size of the random “mixing” tree in Sgm (Sgx for is parameter free). This tuning made it possible for RTsSgxm to catch up with the remaining configurations in three out of nine benchmarks, but did not help generalization: in terms of test-set fitness, RTsSgxm clearly ranks last in Table 12. Our working explanation for the observed underperformance points to three possible causes. Firstly, the continuous benchmarks in our suite are hard (see Table 3): apart from one polynomial (Ng12), all problems involve rational, transcendental, or periodic functions. Secondly, the instruction set for regression problems includes the trigonometric, exponential and logarithmic functions and is broader than in most previous studies that use polynomial instructions only (Moraglio et al., 2012; Vanneschi et al., 2013; Zhu et al., 2013; Castelli et al., 2012; Castelli, Castaldi et al., 2013; Castelli, Vanneschi et al., 2013; Castelli et al., 2014, 2015). The latter characteristic may make it harder to provide a desirable magnitude of variation necessary for Sgm to safely converge to the target.

## 7 Conclusions

We demonstrated that the capabilities of GSGP can be substantially boosted by extending geometric considerations beyond the basic framework of search operators to population initialization and selection, and by equipping those components with mechanisms that make them effective. The empirical results clearly show that these extensions prove successful in practice for the Boolean and symbolic regression domains (Sect. 5.6), while the experiments in Sections 5.2–5.5 revealed the causes of these observations.

The proposed suite of competent operators represents a “holistic” approach to semantic GP, where *all* key components of evolutionary search infrastructure are designed with semantic and geometric aspects in mind. As the experimental results demonstrate, this is beneficial for the overall search performance for the two domains considered in this article, which proves that there is more to GSGP than just search operators. On the other hand, it is clear that the particular components proposed here form just one of many conceivable suites, and the quest for other, possibly even better performing toolkits should continue. Designing population initialization operators and selection operators that guarantee—in deterministic or stochastic sense—the convex hull of population to include the target seems particularly appealing. Another interesting challenge for future research is defining competent operators tailored to other domains, and their empirical assessment.

## Acknowledgments

This work was supported by grant no. 2014/15/B/ST6/05205, funded by the National Science Centre, Poland.

## References

*The ECJ owner’s manual—A user manual for the ECJ Evolutionary Computation Library*, zeroth edition, online version 0.2 edition

*Complex Systems*,

## Notes

^{1}

This likelihood would become certainty if the search operator *guaranteed* that the resulting offspring (i.e., the next population) forms a convex hull that includes . However, each application of a typical search operator is independent and returns only one offspring, which precludes ensuring this property.

^{2}

For multi-argument instructions , we consider bijection with regards to the argument that is to be found by inversion, assuming other arguments are fixed. For instance, is bijective w.r.t. assuming fixed .

## Appendix A Combinatorial Semantics and Semantic Backpropagation

*Semantic backpropagation* (SB; Pawlak et al., 2015b; Krawiec and Pawlak, 2013a; Wieloch and Krawiec, 2013) is an algorithm employed by Cm and Cx (cf. Sects. 4.3 and 4.4) that, given a semantics , a program , and subprogram in (i.e., its location in ), computes the *combinatorial semantics*, a formal object that captures the semantics which, when substituted for in , cause the resulting program to have semantics equal to . A combinatorial semantics is a tuple of sets , where corresponds to the th fitness case (i.e., the th dimension of the considered semantic space). stores the values that, when plugged at location into applied to th fitness case, cause to return the th element of the semantics . Because an application of to every fitness case is independent, implicitly represents the set of semantics resulting from the Cartesian product of s, i.e., . This allows us to capture an exponential number of semantics within a compact formal object. For instance, for , is equivalent to the set of four semantics . The transformations between semantics and combinatorial semantics will be assumed implicit in the following. A program matches the desired semantics if semantics is in the set of semantics represented by , i.e., .

SB (Algorithm 5) calculates the combinatorial semantics by inverting program execution for the considered semantics. For each component of , SB traverses the path of instructions from the program root node to a designated node (where the subtree to calculate combinatorial semantics for is rooted). In each node and its desired output , SB inverts the execution of w.r.t. the output of the next node (the th child) on the path toward by calling . executes an inverted instruction for which outputs , i.e., , where s are the outputs of ’s children nodes. The resulting becomes the desired output for the next node on the path. SB terminates when is reached. The time complexity of SB is , where is the dimensionality of semantic space and is the number of nodes in program .

Realization of is domain-dependent. Table 16 defines it for the common instructions from the regression and Boolean domains. Depending on the properties of the instruction that is subject to inversion, four cases can be delineated when calling .

If is a bijection,

^{2}^{2}it is fully invertible and returns a single value of the argument in question, calculated from the output and the remaining arguments. Examples are addition, subtraction, negation, and xor.If is a noninjective instruction, that is, maps multiple argument values to the same output, there are many (possibly infinitely many) inversions and outputs a set. Examples include the absolute value and .

If the considered argument has no impact on ’s output and the actual output of is equal to the given one, returns the entire domain , because whatever value from is fed into the considered argument, ’s output remains correct. Examples are multiplication by 0 and Boolean conjunction with .

Finally, if is a nonsurjective instruction and the requested output value is not in its image and thus cannot be returned by , returns .

Note that all these four cases may co-occur for the same instruction for different values of the requested output , that is, for different fitness cases.

## Appendix B Efficient Library Search

The set of programs searched by is a union of a set of programs that come from an arbitrary source (e.g., are created in advance) and a set of all single-node constant programs . Programs that match any forbidden semantics are excluded from (see definition of matching in Appendix A). Equation (5) calculates the minimum distance of a program in to each semantics that comes from the Cartesian product of components of (Appendix A).

In this way, the overall set of constants is greatly reduced. For , the term under in Equation (8) is nonlinear and we cannot reduce this way.

For domains where has low cardinality, for example, the Boolean one, it is relatively cheap to search the entire set . For the regression domain, may be still infinite, and it may be necessary to calculate the value that minimizes Equation (8) using, for example, a numerical optimization technique or approximate the result by searching the reduced set like for (what we do in the experiment) .