## Abstract

Genetic Programming is a method to automatically create computer programs based on the principles of evolution. The problem of deceptiveness caused by complex dependencies among components of programs is challenging. It is important because it can misguide Genetic Programming to create suboptimal programs. Besides, a minor modification in the programs may lead to a notable change in the program behaviours and affect the final outputs. This article presents Grammar-Based Genetic Programming with Bayesian Classifiers (GBGPBC) in which the probabilistic dependencies among components of programs are captured using a set of Bayesian network classifiers. Our system was evaluated using a set of benchmark problems (the deceptive maximum problems, the royal tree problems, and the bipolar asymmetric royal tree problems). It was shown to be often more robust and more efficient in searching the best programs than other related Genetic Programming approaches in terms of the total number of fitness evaluation. We studied what factors affect the performance of GBGPBC and discovered that robust variants of GBGPBC were consistently weakly correlated with some complexity measures. Furthermore, our approach has been applied to learn a ranking program on a set of customers in direct marketing. Our suggested solutions help companies to earn significantly more when compared with other solutions produced by several well-known machine learning algorithms, such as neural networks, logistic regression, and Bayesian networks.

## 1  Introduction

In conventional Genetic Programming (GP) (Koza, 1992), the evolution of programs involves two genetic operators: mutation and crossover. But these operators are limited in some situations. Firstly, in order to perform mutation and crossover between any two points in a parse tree, Koza introduced the closure property to the function set: every function in the function set should be well defined for any combination of the input arguments (Koza, 1992). The closure property limits the applications of GP and is relaxed in Grammar-Based Genetic Programming (GBGP) (Whigham, 1995a, 1995b; Wong and Leung, 1995). GBGP was proposed to rigorously define a set of parse trees deduced via a grammar (see McKay et al., 2010 for a review). For example, context-free grammar and logic grammar use a set of production rules to explicitly assure the closure relations of functions. Crossover and mutation operators are redesigned so as to ensure new parse trees satisfy the grammar.

In addition, the performance of GP degrades if there are deceptiveness and dependencies in the problem. The concept of deceptiveness is originated from the Genetic Algorithm (GA) introduced by (Goldberg, 1987) and is further generalized to the concept of deceptive attractor (Whitley, 1991). In this class of problems, there is a strong interaction between the partial solutions. Depending on the configuration of other subtrees, the contribution of a subtree of a parse tree varies. In the search space of a deceptive problem, there are some local optima which can misguide the search algorithm to move away from global optima. Deceptiveness can be caused by dependencies among the components in the programs (e.g., coexistence or order of appearance of subtrees of a parse tree).

Nevertheless, context-free grammar is limited in its expressiveness when dealing with dependencies and uncertainty, while logic grammar cannot capture probabilistic dependencies. Besides, some common types of dependencies in a program, such as data dependencies and control dependencies (Allen and Kennedy, 2002), cannot be naturally modelled by a context-free grammar. For example, the reading memory instruction refers to the data from a preceding writing memory instruction. Swapping the reading memory instruction and writing memory instruction can alter the output of a program. The flow of execution can affect the final output if the instructions are not executed in a specific order. Some combinations are preferable because they are logically or semantically sound in good parse trees which preserve the program dependencies.

As discussed in Kim et al. (2014), this issue has been recently investigated in Probabilistic Model Building Genetic Programming (PMBGP) approaches, which weaken the context-free assumption by adopting probabilistic models in the grammar as in the Probabilistic Model Building Genetic Algorithm (PMBGA). In the bipolar asymmetric royal tree problem, probabilistic models are estimated from different structural and behavioural properties of parse trees in the past generations. When a good solution is discovered, the final estimate of distribution reflects the principles (or the knowledge) to generate good parse trees and gives insights into the problem nature.

Our recent works have developed a system which learns probabilistic models to evolve the best program. Context-free grammar imposes hard syntactical constraints on the search space as in GBGP. Therefore, the first approach adopted was using Bayesian networks (Wong et al., 2014a). Alternatively, the current work adopts Bayesian network classifiers which differentiate the population of parse trees based on their fitness. In the current work and the previous work (Wong et al., 2014b), each production rule in the grammar is attached with a Bayesian network classifier to model the probabilistic dependencies among the nonterminals in the production rules and the fitness function. The inclusion of the class variable inspires us to invent a new genetic operator to refine the generation probability in a class-dependent manner. Bayesian network classifiers can perform a structural dependencies search, which is further extended by adding a nonterminal-based context variable. Contextual variables try to model how we write a program. This article also provides comprehensive comparison experiments using three benchmark problems (Wong et al., 2014a and Wong et al., 2014b performed only two of them). The Grammatical Evolution (GE) method, which is a popular approach in the GP research community, is also added in the comparison. Analysis is also performed on the Bayesian network classifiers so as to study how the evolved grammar encodes the context and structure information in the optimal parse trees via probabilistic dependencies. In addition, we proposed a new perspective to understand how searching cost of PMBGP approaches and the structures of the programs are related. Six program complexity measures are identified and then related to the searching cost.

In order to demonstrate the usefulness of our system, we apply our novel Grammar-Based Genetic Programming with Bayesian Classifiers (GBGPBC) system to evolve a solution of a direct marketing problem which is not done in the previous works (Wong et al., 2014a, 2014b). In the context of direct marketing, the grammar may include an explanation of the interdependencies of the attributes of customers (such as previous shopping behaviours), direct marketers' promotion strategies, and the characteristics of the products and helps a company to identify the customers with high loyalty. This information is valuable for devising future marketing plans.

To summarize, this article expands on our previous work (Wong et al., 2014b) and has the following objectives:

1. To provide a comprehensive introduction of GBGPBC, which captures probabilistic contextual and structural dependencies;

2. To improve our understanding of GBGPBC by providing new analysis on the evaluation results of three benchmark problems; and

3. To demonstrate a potential application of GBGPBC in the context of direct marketing.

The rest of the article is organized as follows. Section 2 summarizes existing works related to GP approaches, PMBGP approaches, and Bayesian network classifiers. Next, Section 3.1 formalizes our grammar model. The details on the procedures of the proposed system are presented in Section 3.2. To demonstrate the robustness and efficiency of our system, the results of the system performance in three benchmark problems are presented and discussed in Section 4. The interplay between searching cost and program complexity are studied in Section 4.3. We then apply it to solve a real-world data mining problem in direct marketing. Lastly, we discuss the other potentials of the GBGPBC system.

## 2  Related Works

Our approach is a PMBGP approach. Therefore, we summarize the related works in these areas in this section. Related works about Bayesian network classifier can be found in Supplementary Section 2 (supplement available at https://doi.org/10.1162/evco_a_00280).

### 2.1  GAs, GPs, and GBGPs

A GA (Goldberg, 1989; Davis, 1991; Holland, 1992; Michalewicz and Schoenauer, 1996) represents each solution (or individual) by a chromosome (i.e., a vector of symbols). The population of individuals are evaluated by a fitness function. Evolution is an iterative process. In each round of evolution process, genetic operators, such as mutation operator and crossover operator, are applied to the current population to produce new individuals. Then, some existing individuals and new individuals are selected to form the next population. After several rounds of evolution, the best individual(s) will be chosen as the final solution(s) to the problem.

GA (Koza, 1992) represents a candidate program using a parse tree composed of elements from a function set and a terminal set, which define the internal nodes and the leaf nodes in all parse trees respectively. Crossover operator can be applied by swapping between subtrees across two trees. Mutation operator on the function nodes can be applied by randomly choosing a new function node in the function set. This can be performed similarly on the terminal nodes.

GBGP (Whigham, 1995a; Wong and Leung, 1995) solves the closure problem found in GP using grammars which formally define a set of parse trees. One of the widely applied GBGP approaches is GE which is a computational evolution technique (Ryan et al., 1998; O'Neill and Ryan, 2003). The original version of GE adopted Backus-Naur Form for expressing the grammar. Unlike GP, each individual is represented in a linear chromosome. Using a mapping function, the chromosome can then be translated to a parse tree by telling the system how to apply the rules using the grammar. GE has been extended and improved in terms of the genotype-phenotype mapping rules (Keijzer et al., 2002; Lourenço et al., 2016), locality of the representation (Rothlauf and Oetzel, 2006; Thorhauer and Rothlauf, 2014), and search operators and search strategies (O'Neill and Brabazon, 2006).

### 2.2  PMBGAs and PMBGPs

In PMBGA or Estimation of Distribution Algorithm (EDA) (Larrañaga and Lozano, 2001), sampling from and updating of the probabilistic model together substitutes mutation operator and crossover operator. Sampling from the probabilistic model generates new individuals while updating of the probabilistic model adjusts and controls the generation probability of some forms of individuals. More specifically, the objective of PMBGA or EDA is to learn a probabilistic model which can (frequently) generate optimal solution(s) of a problem. The probabilistic model can be either univariate (e.g., Population-Based Incremental Learning (PBIL) (Baluja, 1994) and Probabilistic Incremental Program Evolution (PIPE) (Salustowicz and Schmidhuber, 1997), bivariate (e.g., Mutual Information Maximization for Input Clustering (MIMIC) (De Bonet et al., 1997), or multivariate (e.g., Bayesian Optimization Algorithm (BOA) (Pelikan, 2005), and Linkage Detection Factorization Algorithm (LDFA) (Mühlenbein and Mahnig, 1999). Combinatorial Optimization with Coincidence (COIN) rewards good solutions and punishes not-good solutions (Wattanapornprom et al., 2009). Besides, it performs adaptive selection to maintain the diversity.

When compared with GA, GP adopts a different representation of the solution. The search space consists of programs, which are often tree-structured. A probabilistic model can also be used in GP. Many PMBGP approaches have been realized. There are two classes of methods: probabilistic prototype tree (PPT) model-based method and probabilistic grammar model-based method. The PPT model-based method designs an encoding to represent a tree structure in a fixed-length chromosome, and utilizes probabilistic models. (Salustowicz and Schmidhuber, 1997) proposes the first PMBGP, called PIPE, which introduces PPT and adopts a univariate model, which has been subsequently used and extended by Estimation of Distribution Programming (EDP) (Yanai and Iba, 2003), eCGA (Sastry and Goldberg, 2003), BOA programming (Looks et al., 2005), POLE (Hasegawa and Iba, 2006, 2008), POLE on binary encoded PPT (Yanase et al., 2009), and POLE-BP (Sato et al., 2012).

Alternatively, probabilistic grammar model-based methods adopt Probabilistic Context-Free Grammar (PCFG) (Booth and Thompson, 1973). Context-free grammar can be transformed into Stochastic CFG (SCFG) by adding weights on each production rule so that the choices of the derivation can be biased to different structures automatically for different problems. The weights are normalized to 1 at the level of each nonterminal. Stochastic Grammar-based Genetic Programming (SG-GP) (Ratle and Sebag, 2001) combines PBIL (Baluja, 1994) with Grammar-Guided Genetic Programming (GGGP) (Whigham, 1995a). SG-GP can keep the best individual at almost constant size and does not accumulate “junk code”. Later, the use of depth and the relative location in the tree are allowed in SG-GP in Program Evolution with Explicit Learning (PEEL) (Shan et al., 2003). PEEL learns Search Space Description Table (SSDT), which is updated via Ant Colony Optimization (ACO) (Bonabeau et al., 1999), from the population. Minimum Description Length (MDL) (Solomonoff, 1978) score can help in merging the nonterminals of the production rules of SCFG as demonstrated in GMPE (Shan et al., 2004). GT-EDA (Bosman and de Jong, 2004) transforms the grammar so as to express a higher order of dependency. Adding latent annotation increases the flexibility of a grammar model. In PAGE-EM (Hasegawa and Iba, 2009), the latent parameters for annotations are estimated from promising individuals using expectation-maximization (EM) algorithm so as to express the tree structure of the optimum (Dempster et al., 1977). Larger annotation sizes may also improve the search performance at the expense of the computational cost. PAGE-VB improves PAGE-EM since the number of annotations can be approximated using Variational Bayes (VB) learning (Attias, 1999). PCFG-LA mixture model can incorporate local dependencies and global contexts (Hasegawa, 2012). Bayesian automatic programming (Regolin and Pozo, 2005) extends GE and applies a learned Bayesian network on a codified population of individuals. Tanev's work utilizes Probabilistic Context-Sensitive Grammar (PCSG) in GP to evolve locomotion gaits of Snakebot. In this work, predefined context triggers a variation in probability learned from the evolved best individuals. Other models have also been proposed. In the Probabilistic Model Building Genetic Network Programming, useful substructures are discovered from infeasible individuals (Li et al., 2011). A Bayesian network can also be locally attached to each production rule (Wong et al., 2014a) and this approach can be applied to deep neuroevolution (Wong et al., 2019; Wong, 2019). Apart from modeling programs in tree structure, probabilistic models can be applied to other program representations as well. For example, DAE-GP (Wittenberg et al., 2020) uses denoising autoencoder long short-term memory networks to explore latent dependencies in linearized GP. (See Supplementary Section 15 for a detailed comparison.)

These studies show different ways to capture dependencies. In general, these approaches do not include contextual and structural dependencies for individuals with low and high fitness values in their dependency model. Some approaches use sophisticated estimation for model learning and are hard to interpret. In our methodology, we propose that a Bayesian network classifier can also be locally attached to each production rule. In order to make the probabilistic description more explicit and interpretable to humans, GBGPBC imposes a structure on the probabilistic description using probabilistic context-sensitive grammar and Bayesian network classifiers. The classifiers are responsible for capturing the contextual and structural dependencies among individuals having low or high fitness values. Furthermore, the interpretability of dependencies in the adapted grammar can be controlled by the design of the grammar. The adapted probabilistic context-sensitive grammar summarizes the collective characteristics of good computer programs and can be used to evaluate GBGPBC. Note that our current approach does not include any latent variable as proposed in PAGE-EM and PAGE-VB (Hasegawa and Iba, 2009). The lack of latent variables will reduce the model capacity of Bayesian network classifiers.

## 3  Research Methodology

Our system evolves a population of computer programs and the grammar. A fitness function and a grammar are provided to the GBGPBC as input. The fitness function assigns a score to each computer program. In data-driven applications, the fitness function is often associated with an error function over the collected data. The system takes in the PCSG, which is a generalization of context-free grammar. The grammar confines the search space of the computer programs. To model the probabilistic dependencies within structures of a program, the grammar is also associated with several Bayesian network classifiers serving as the probabilistic model discriminating the set of programs based on their fitness. During evolution, GBGPBC generates a population of computer programs following the grammar. Meanwhile, it updates the probabilistic model associated with the grammar iteratively in every generation. The best computer program(s) are often selected as the solution(s). The details of each part are presented in the following sections.

### 3.1  Grammar Model

A grammar is required to specify the set of feasible computer programs. The proposed grammar is composed of several production rules associated with a Bayesian network classifier for each of them. It confines the individuals to valid structures and models the probabilistic dependencies among different components in the programs. A PCSG is composed of seven components: a set of nonterminal symbols $N$, a set of terminal symbols $T$, where $N∩T=∅$, a start nonterminal symbol $S∈N$, a set of production rules $R$, each of the form $A→α$, where the left-hand side $A∈N$ and the right-hand side $α∈(N∪T)*$, a set of context variables $C$ (which are elaborated in Section 3.3.2), a set of class labels $X$, and finally a set of production probability distributions $P$. The probability distributions specify the probabilistic dependencies among the structures. A probability distribution can be represented by $p(w,c,x|r)$, where $c⊆C$, $x∈X$, $r∈R$, $w∈Lr$, $Lir={s∈R:LHS(s)=ith$ element in $RHS(r)}$, and $Lr=L1r×L2r×⋯×L|RHS(r)|r$ (which is the set of all possible combinations of nonterminals for a rule $r$).

This PCSG imposes hard and soft constraints on the structures of the set of programs. The first four elements ($N,T,S$, and $R$) in the tuple impose hard constraints on the structures of the set of programs as in CFG. We use a set of production probabilities $P$ to represent the soft constraints and they depend on four parameters, which are $w$, $c$, $x$, and $r$. A right-hand side function $RHS:R→N*$ maps a production rule to an ordered set of nonterminals on the right-hand side of that rule. Similarly, a left-hand side function $LHS:R→N$ returns the left-hand side of a production rule. The set of all possible ways to derive the $ith$ nonterminal is denoted by $Lir$. Their Cartesian product $Lr$ contains all valid combinations for the whole rule $r$. Therefore, $p(w,c,x|r)$ gives the joint distribution of the expansion choices of the nonterminals for a rule $r$, the context variables, and the class label. Besides, the joint distribution is encoded by a Bayesian network classifier. Since every rule has its own joint distribution, every rule has a Bayesian network classifier. Note that if a rule $r'$ does not have any nonterminal on the right-hand side of a production rule, probabilities in the form of $p(w,c,x|r')$ are not important and ignored because there is only one way to derive this rule.

An example of the PCSG for a deceptive maximum problem (Hasegawa and Iba, 2008) is presented in Table 1. In this example, we omit the context variables and assume there are only two classes of programs (classified by their fitness values): good programs and bad programs, which are labelled as $good$ and $bad$, respectively. The grammar of this problem has two nonterminals: $Exp$ (i.e., EXPression) and $Start$. Four terminals are specified for the two operators ($*$ and $+$) and values (0.95 and $λ$). We use a pair of square brackets (e.g., [$*$] and [$λ$]) to enclose the terminals while the nonterminals (e.g. $Start$ and $Exp$) are not enclosed by them.

Table 1:

(Left) PCSG for a deceptive maximum problem. There are three $Exp$ non-terminals on the right-hand side of the rules of $Exp.0$ and $Exp.1$, because this grammar tackles a DMax(3, r) problem. (Right) A parse tree generated using the grammar on the left.

In the beginning, the probability distributions for the rules are assumed to be independent uniformly distributed. We have three Bayesian network classifiers in the whole grammar. Recall that the joint distribution of a given rule $r$ (i.e., $p(w,c,x|r)$) is associated with three kinds of random variables: the expansion choices of the nonterminals for a rule, the context variables, and the class label. For notation simplicity, we denote $p(w,∅,x|r)$ by $p(w,x|r)$ when there is not any context variable (i.e., $C=∅$). Now, we need to distinguish all nonterminals on the right-hand side of each rule. Refer to the grammar in Figure 1; we use subscripts to distinguish all $Exp$ nonterminals of the right-hand side of each rule. Structures of Bayesian network classifiers of these rules are also shown in the grammar in Figure 1. Since the variables in the initial joint probability distributions are all independent, there is no directed edge between any pair of variables. Given the grammar, our system can derive an individual from the rule associated with the symbol $Start$ on the left-hand side and then chooses the production rules for each nonterminal on the right-hand side recursively until all the nonterminals are expanded.
Figure 1:

Collection of instantiation records from a valid parse tree using the deceptive maximum grammar shown in the top left of this figure.

Figure 1:

Collection of instantiation records from a valid parse tree using the deceptive maximum grammar shown in the top left of this figure.

As mentioned in Section 2, a number of probabilistic models were investigated in PMBGP (e.g., PIPE: a position-dependent model; POLE: a global dependence model for all nodes in a parse tree). They did not adopt classifier nor did they learn the model using the bad (in terms of fitness) individuals. In contrast, GBGPBC moves a step forward by applying Bayesian network classifiers for the choice dependency (i.e., $p(w,c,x|r)$) and at the same time updates the parameters of all Bayesian network classifiers based on the class label. The probabilistic dependencies of the production probability distributions, structures of the programs and the classes of individuals are jointly captured by our formalism.

### 3.2  Proposed Algorithm

GBGPBC iteratively searches for the optimal solutions. Information on how the structures of the individuals associated with the fitness values guides the search in subsequent generations. The whole process of GBGPBC involves eight steps.

1. Initialize a Bayesian network classifier with independent uniform distribution for each production rule (as represented by the circles next to each rule in Figure 1).

2. Produce a population of individuals, some of which are derived following the probability distributions of Bayesian network classifiers and the others are selected individuals in the previous generation through elitism. See Supplementary Section 4 for details.

3. Compute the fitness values for every individual.

4. Collect samples of the fitter and the poorer individuals based on their fitness values and label the individual as “good” and “bad,” respectively. Each sample becomes an instantiation record (i.e., a choice of derivation observed) as shown in Figure 1.

5. Compute the frequency counts for possible choices of derivation of each production rule under different class labels among the samples.

6. For each production rule in the grammar, check if the number of samples accumulated exceeds a predefined threshold; if so, its Bayesian network classifier is reconstructed from the accumulated samples and the samples are removed afterward.

7. Refine the probability distributions in the Bayesian network classifier based on the class labels so as to reduce the probability of generating bad individuals in the current generation.

8. Repeat from step 2 and the top-ranked individuals survive to the next generation.

At the beginning, all variables are assumed to be independent and the joint probability distribution is uniform. Therefore, edges are absent from any pair of nodes in the Bayesian network classifiers. Step 2 to Step 8 are repeated until certain criteria are met, such as reaching the maximum number of generation, or successfully discovering an individual achieving the maximum fitness value.

### 3.3  Individual Generation

The right-hand side of Table 1 shows the parse tree of an individual derived from the grammar. To begin with, we construct a parse tree from the only rule of the start symbol (i.e., Start), which corresponds to the root node of the parse tree. Next, the nonterminal $Exp$ in the rule $Start.0$ is expanded. There are four choices: $Exp.0$, $Exp.1$, $Exp.2$, and $Exp.3$. The system will randomly pick a choice following the probability distribution $q(W)=p(W,x|Start.0)p(x|Start.0)$, where $W$ is the random variable of the four choices of the nonterminal in the rule $Start.0$. We choose a class label $x$ from ${good,bad}$, where $good$ represents the class of “good” individuals (high fitness values) and $bad$ represents the class of “bad” individuals (low fitness values). The prior probabilities of these $good$ and $bad$ classes are $probgood$ and $probbad$, respectively, where $probgood+probbad=1$.

To continue the derivation, we derive the $Exp0$ node (enclosed in dotted line) under the $Start$ root node. At first, we instantiate the rule $r=$$Exp$$→$$[*]$$Exp1$$Exp2$$Exp3$ associated with a Bayesian network classifier $Br$. The three nonterminals are instantiated following the topological order of the network in the Bayesian network classifier, for instance, $X$, $Exp2$, $Exp1$, and $Exp3$. Since $Br$ is also a tree augmented Bayesian network classifier (Friedman et al., 1997), $Exp1$, $Exp2$ and $Exp3$ must form a tree structure. Firstly, we choose a class label $x$ (such as $good$) based on the prior probabilities $probgood$ and $probbad$ to instantiate variable $X$. Next, a state for $Exp2$ will be selected. It has four possible states representing the four rules (i.e., $Exp.0$, $Exp.1$, $Exp.2$, $Exp.3$) of $Exp$, respectively. To instantiate $Exp2$, we choose a state according to the distribution $p(Exp2|good)$, such as state $Exp.3$, which means picking the production rule $Exp.3$. After the instantiation of $Exp2$, the next variable in the topological order is $Exp1$. We instantiate $Exp1$ according to the distribution $p(Exp1|Exp2=Exp.3,good)$, for instance rule $Exp.1$. Following the same procedure for $Exp3$, we get $Exp.0$. Therefore, the states are $(Exp.1,Exp.3,Exp.0)$ for the nonterminals $Exp1$, $Exp2$ and $Exp3$ respectively. The rule is derived to $(Exp.1,Exp.3,Exp.0)$ and the parse tree grows by one level. This procedure is applied recursively until all nonterminals in the parse tree are instantiated. Finally, a parse tree will be obtained.

The prior probabilities $probgood$ and $probbad$ decide whether structures are inherited from “good” individuals and “bad” individuals, respectively. A new generation should include more structures from “good” individuals. The value of prior probability $probgood$ should be close to 1. This means individuals mainly inherit the structures learned from “good” individuals. If the system follows the conditional probabilities on the bad samples to generate a new individual, the individual inherits the structures resided in “bad” individuals. Therefore, the prior probability $probbad$ should be close to 0. It is set to 0.01 to encourage exploration.

#### 3.3.1  Learning Bayesian Network Classifier

After the individual generation and fitness evaluation steps, the individuals are sorted by their fitness values in which fitter individuals are ranked higher. The number of individuals selected for each class $nselect=⌊rselect×pop⌋$, where $nselect≤⌊pop2⌋$, $rselect$ and $pop$ are the selection rate and the population size, respectively. The first $nselect$ of individuals are labelled as $good$. The system will select the last $nselect$$bad$ individuals. In short, a set of $good$ individuals and a set of $bad$ individuals are taken out (and others in the middle are discarded) from the entire population. The amount of training examples collected for the two classes are the same to ensure that the two classes are balanced. With these categorized individuals as training examples, a number of Tree Augmented Naïve Bayes (TAN) (Friedman et al., 1997) network classifiers are learned. The details can be found in Supplementary Section 5.

#### 3.3.2  Context Variables during Derivation

In Section 3.1, the proposed grammar model can be extended to include contextual information to the conditional probability distributions via context variables. This affects the derivation of a parse tree because the choices of the nonterminals will depend on the states of the context variables. A program code fragment cannot be randomly inserted in any part of a program and should be invoked only in a particular situation to yield desired output. There are often dependencies between various statements in the program code. For example, changing the order of some statements may lead to wrong programs (producing wrong outputs). In the Bayesian network classifier, a new variable node corresponding to each context variable will be inserted. Three context variables are introduced to GBGPBC and described below. Supplementary Section 6 provides more details.

Depth-based context:  The substructure of the solution may vary according to tree depth. The function nodes near the output node may perform high-level data manipulations while those near the input nodes perform low-level data manipulations. Therefore, introducing the probabilistic dependencies between the nonterminals and the depth of the current level can be useful.

Rule-based context:  In the derivation of a parse tree, a set of rules needs to be chosen according to the grammar. Sometimes, the preference on the decision may be affected by who (the parent rule) is calling the rule. To model the probabilistic dependencies, we need to annotate the grammar rules with respect to every nonterminal symbol $s$. Only a subset of rules (i.e., the parent rules of symbol $s$) where nonterminal symbol $s$ appears (at least once) on their right-hand side are relevant. We index these rules (starting from 0 to $ks-1$) by their order of appearance in the grammar, where $ks$ is the total number of parent rules.

Nonterminal-based context:  We annotate globally instances of the same nonterminal in the grammar in order to capture special and global dependencies among the nonterminals. In other words, every nonterminal is treated as a distinct instance in a global manner.

#### 3.3.3  Class-Dependent Refinement

An extra advantage of using Bayesian network classifiers is that they enable us to define meaningful classes (in a supervised manner). The probability distribution in the classifiers can be intentionally altered based on the meaning of the class variables so as to increase the chance of generating useful structures of an individual. After this meta-modification on the probability distributions, the fitness values in the next generation may be improved. Instead of changing the generated individual one by one, we fundamentally change the generation schemes by refining the generation probabilities. The details can be found in Supplementary Section 7.

## 4  Experiments on Benchmark Problems

We have implemented a GBGPBC system using C++ 11. We tested GBGPBC and other approaches on three benchmark problems: the royal tree (RT) problems, the deceptive maximum (DMax) problems, and the bipolar asymmetric royal tree (BART) problems. Assume that fitness evaluation is computationally more expensive, each method is evaluated based on the number of fitness evaluation to find at least one optimal solution. A smaller value implies a better learning method. We argue that a better learning method can identify association between the fitness values to deduce the structure of the parse tree and deduce a better way to construct parse trees with improved fitness. In the case where the computational time of executing a fitness function is at least an order of magnitude longer than the learning algorithm, the number of fitness evaluation is also a measure of computational time required. To give a better picture of the performance, we divided the benchmark experiments in two parts.

In the first part, three types of algorithms were compared with our approach using depth-based and rule-based context variables. Firstly, since GBGPBC belongs to PMBGP methods, it was compared with three relatively new Probabilistic Model-Building Genetic Programming methods: POLE (Hasegawa and Iba, 2006, 2008), PAGE-EM (Hasegawa and Iba, 2009), and PAGE-VB (Hasegawa and Iba, 2009). These three approaches were shown to be superior to a univariate model (PIPE), adjacent model (EDP model) and simple GP (Hasegawa and Iba, 2009). Multiple annotation sizes were tested on PAGE-EM while only the best results were reported. Furthermore, GBGPBC was compared with GE implemented in GEVA (O'Neill et al., 2008), which is a widely applied GBGP method; as well as GBGP, which is a conventional evolution computation technique and applies grammar in GP (Whigham, 1995a; Wong and Leung, 1995).

The second part of the experiments was to compare against different variants of GBGPBC. GBGPBC can be equipped with a combination of depth-based, rule-based, and nonterminal-based context variables. We therefore performed experiments for all eight combinations of context variables for each benchmark problem. The structures of the adapted grammar were examined. To this end, we can understand how different combinations of context variables may affect the search behaviour under different program complexity.

### 4.1  Experiments across Different Approaches

The parameters and their values of configuration are listed in Table 2. The parameters for different approaches were obtained either from the original paper or the default parameters provided by the software. Throughout the experiments, all parameters are kept constant except for the parameters that are derivable from the given benchmark problems, such as the depth of a parse tree. We declared an approach fails if it cannot obtain the optimal solution in 200,000 fitness evaluations due to time constraints. The average numbers of fitness evaluations were calculated among the best 45 runs out of 50 runs if there are at least 45 successful runs. If the total number of successful runs is less than 45, all numbers of fitness evaluations for each successful run are included in the calculation of the average number of fitness evaluations which is enclosed in a pair of brackets. Therefore, failure runs are excluded in the calculation of the average number of fitness evaluations. Symbol “x” means none of the runs succeeds. Because of the stochastic nature of the approaches, our results also indicate whether an approach attains a successful rate of at least 90% as suggested by Hasegawa and Iba (2009). The t-test was also applied on the average numbers of fitness evaluations between our approach and the other approaches. The plus sign (+) indicates our approach is better than the compared approach with statistical significance (i.e., p-value $<$ 0.05).

Table 2:

Parameter values for three benchmark problems.

GBGP ConfigurationPOLE Configuration
ParameterValueParameterValue
Population size 1000 Population size 1000
Generation 1000 Parent range
Crossover rate 0.9 Selection rate 0.1
Mutation rate 0.09 Elite rate 0.005

GE Configuration PAGE-EM Configuration
Parameter Value Parameter Value

Population size 100 Population size 1000
Crossover rate 0.9 Annotation size 8, 16 (RT, BART)
Elite size 50  16 (DMax)

GBGPBC Configuration Selection rate 0.1
Parameter Value Elite rate 0.1

Population size 250 PAGE-VB Configuration
Accumulation size 100 Parameter Value

Context variables depth-and-rule-based Population size 1000
Class labels good, bad Annotation to explore {1,2,4,8,16}
Elitism rate 0.1 Annotation exploration
Selection rate 0.2 range
$probbad$ 0.01 Selection rate 0.1
$probgood$ 0.99 Elite rate 0.1
Percentile of samples good: 80th-100th percentile
for the classes sorted bad: 20th-40th percentile
by fitness values in
ascending order
GBGP ConfigurationPOLE Configuration
ParameterValueParameterValue
Population size 1000 Population size 1000
Generation 1000 Parent range
Crossover rate 0.9 Selection rate 0.1
Mutation rate 0.09 Elite rate 0.005

GE Configuration PAGE-EM Configuration
Parameter Value Parameter Value

Population size 100 Population size 1000
Crossover rate 0.9 Annotation size 8, 16 (RT, BART)
Elite size 50  16 (DMax)

GBGPBC Configuration Selection rate 0.1
Parameter Value Elite rate 0.1

Population size 250 PAGE-VB Configuration
Accumulation size 100 Parameter Value

Context variables depth-and-rule-based Population size 1000
Class labels good, bad Annotation to explore {1,2,4,8,16}
Elitism rate 0.1 Annotation exploration
Selection rate 0.2 range
$probbad$ 0.01 Selection rate 0.1
$probgood$ 0.99 Elite rate 0.1
Percentile of samples good: 80th-100th percentile
for the classes sorted bad: 20th-40th percentile
by fitness values in
ascending order

#### 4.1.1  Deceptive Maximum Problem

The Deceptive Max (DMax) problem (Hasegawa and Iba, 2008) is a modified version of the original Max problem studied in GP (Gathercole and Ross, 1996). Given a terminal set, a function set and a depth limit, the goal of the Max problem is to find a program which returns the maximum value. The DMax problem adds deceptiveness by introducing complex number in the terminal set but maximizing only the real part of the output. Throughout the evolution, the optimization algorithm knows nothing about the imaginary part of the score. Interaction between subtrees must be considered in order to achieve this goal. This abstract problem is relevant in practice, for example, in controlling the amplitude of voltage in synthesizing analog electric circuits (Koza et al., 1997).

Formally, the DMax problem has a function set ${+m,×m}$ and a terminal set ${0.95,λ}$, where $λr=1$. The function $+m$ returns the sum of all $m$ numbers in its arguments. The function $×m$ is defined similarly. For example, $+3(1,3,5)=9$ and $×4(2,2,2,2)=16$. One advantage of the DMax problem is that it is parametrized. We can generate many problems to test different algorithms. We use the notation DMax(m,r) to represent this class of problems. The grammar of DMax(3,$r$) has already been shown in Table 1. In the current article, we have tested a number of DMax problems with different $m$ and $r$: DMax(3,4), DMax(3,5), and DMax(5,3). A maximum depth level $d$ is specified to limit the height of the tree for a DMax problem. More information about this problem can be found in the Supplementary Materials. The experimental results across different approaches are presented in Table 3.

DMax(3,4):  In this problem, value of $r$ is larger than value of $m$. The optimal solution for this problem for level 4 and level 5 can be calculated and are $(39)×0.95$ and $(327)×(0.953)$, respectively. At level 4, GBGPBC achieved 3373 fitness evaluations on average to obtain an optimal solution. PAGE-EM and PAGE-VB used twice the amount. POLE can solve the problem in some runs with a successful rate of 6%. But GE and GBGP simply failed. At level 5, GBGPBC becomes even better than PAGE-EM and PAGE-VB. It used 6181 fitness evaluations on average to obtain an optimal solution. PAGE-EM and PAGE-VB used 16521 and 39022 fitness evaluations on average, respectively. The total number of fitness evaluations of GBGPBC was only one-third of that of PAGE-EM. Again, POLE, GE and GBGP cannot attain the minimum successful rate.

DMax(3,5):  GBGPBC performed well with other approaches in level 4. Only 2770 fitness evaluations were required on average. PAGE-EM was the second-best method and needed 6400 fitness evaluations on average. The performance of PAGE-VB was similar to PAGE-EM and required 8514 fitness evaluations on average. POLE can find an optimal solution and needed 7960 fitness evaluations on average. GE and GBGP cannot find any optimal solution. In level 5, GBGPBC is more effective than other approaches. POLE is a global dependence model for all nodes in a parse tree. POLE does not work simply because it needs a large number of samples to estimate the probability distributions. PAGE-VB needs a large number of fitness evaluations because it needs to find the correct number of annotations.

DMax(5,3):  DMax(5,3) is called strong DMax (Hasegawa and Iba, 2009). GBGPBC needed 10204 fitness evaluations on average to find an optimal solution. The performances of POLE, PAGE-EM, and PAGE-VB were 16521, 16043 and 18293 respectively. Their performances were close to GBGPBC. However, GE and GBGP cannot find an optimal solution in 50 runs.

To summarize, PMBGP methods can perform well in general. Overall, GBGPBC performed the best in this benchmark problem.

#### 4.1.2  Royal Tree Problem

The royal tree problem (Punch, 1998) is chosen to benchmark the different methods. The goal is to construct a “perfect royal tree” given a series of functions with increasing arity within a given height. The problem is difficult because of the increasing arity when the level of the tree increases. It captures the idea of order of dependencies of steps in a solution; that is, some steps should come after some basic steps. Details about the evaluation and the grammar can be found in the Supplementary Materials.

Table 3:

Results for the deceptive maximum problem, the royal tree problem, and the bipolar asymmetric royal tree problem, across different methods. The table shows the statistics of the numbers of fitness evaluations needed under different configurations in 50 runs. The number before and the number after $±$ are the mean and standard deviation respectively. Numbers enclosed in parentheses mean it fails to obtain the optimal solution with a probability of at least 90% in 50 independent runs; “x” means none of the runs succeeds. The plus ($+$) mark indicates the t-test results in the successful runs.

Deceptive Maximum Problem
mrdGBGPBCPOLEPAGE-EMPAGE-VBGEGBGP
3373 $±$ 343 (12000 $±$ 2828) 7700 $±$ 735$+$ 8304 $±$ 940$+$
6181 $±$ 719 (29200 $±$ 2486) 16521 $±$ 1473$+$ 39022 $±$ 3361$+$
2770 $±$ 386 7960 $±$ 605$+$ 6400 $±$ 606$+$ 8514 $±$ 972$+$
6750 $±$ 845 18040 $±$ 1814$+$ 43733 $±$ 15611$+$
10204 $±$ 3452 16521 $±$ 1473 16043 $±$ 1189 18293 $±$ 1688
Royal Tree Problem
Level GBGPBC POLE PAGE-EM PAGE-VB GE GBGP
3909 $±$ 161 12720 $±$ 1126$+$ 6237 $±$ 18 (11240 $±$ 7631)$+$ (33716 $±$ 0)
6459 $±$ 6825 61809 $±$ 12341$+$ (163720 $±$ 63528)$+$
Bipolar Asymmetric Royal Tree Problem
Level GBGPBC POLE PAGE-EM PAGE-VB GE GBGP
2176 $±$ 1152 8360 $±$ 1998 (6048 $±$ 1987) 15060 $±$ 9079 5823 $±$ 4535 11249 $±$ 2240
18045 $±$ 13228 9420 $±$ 6204 (20846 $±$ 10915) (63000 $±$ 33586) (16023 $±$ 8556) (19960 $±$ 3138)
Deceptive Maximum Problem
mrdGBGPBCPOLEPAGE-EMPAGE-VBGEGBGP
3373 $±$ 343 (12000 $±$ 2828) 7700 $±$ 735$+$ 8304 $±$ 940$+$
6181 $±$ 719 (29200 $±$ 2486) 16521 $±$ 1473$+$ 39022 $±$ 3361$+$
2770 $±$ 386 7960 $±$ 605$+$ 6400 $±$ 606$+$ 8514 $±$ 972$+$
6750 $±$ 845 18040 $±$ 1814$+$ 43733 $±$ 15611$+$
10204 $±$ 3452 16521 $±$ 1473 16043 $±$ 1189 18293 $±$ 1688
Royal Tree Problem
Level GBGPBC POLE PAGE-EM PAGE-VB GE GBGP
3909 $±$ 161 12720 $±$ 1126$+$ 6237 $±$ 18 (11240 $±$ 7631)$+$ (33716 $±$ 0)
6459 $±$ 6825 61809 $±$ 12341$+$ (163720 $±$ 63528)$+$
Bipolar Asymmetric Royal Tree Problem
Level GBGPBC POLE PAGE-EM PAGE-VB GE GBGP
2176 $±$ 1152 8360 $±$ 1998 (6048 $±$ 1987) 15060 $±$ 9079 5823 $±$ 4535 11249 $±$ 2240
18045 $±$ 13228 9420 $±$ 6204 (20846 $±$ 10915) (63000 $±$ 33586) (16023 $±$ 8556) (19960 $±$ 3138)

Table 3 shows the results of different methods. GE failed to return the optimal value at level D and level E while GBGP and POLE failed at level E only. At level D, GBGPBC performed better than POLE and PAGE-VB. GBGPBC needed 3909 fitness evaluations on average. POLE and PAGE-VB required 12720 and 11240 fitness evaluations on average. PAGE-EM needed 6237 fitness evaluations. The performance of GBGPBC and PAGE-EM were not significantly different. At level E, GBGPBC required 6459 fitness evaluations on average. PAGE-EM and PAGE-VB needed 61809 and 163720 fitness evaluations on average. However, POLE, GE, and GBGP cannot find an optimal solution at level E. The number of fitness evaluations of the other methods also grew rapidly due to the exponential growth in the search space. The number of fitness evaluations of the PAGE-EM grew by ten times for the best annotation. As we can see, the performance of PAGE-VB, which searches for the best annotations, deteriorated quickly. GBGPBC significantly outperformed the other approaches. This shows GBGPBC can search effectively.

#### 4.1.3  Bipolar Asymmetric Royal Tree Problem

The bipolar asymmetric royal tree problem (Wong et al., 2014b) is an extension of the traditional royal tree problem (Punch, 1998). We gradually increased the dependencies in order to test the ability to learn dependency among subtrees of the same level. In the original royal tree problem, all siblings of a node in the perfect tree have to be the same. Therefore, the optimal choice of a node depends only on the parent. Briefly, a perfect subtree of $A$ needs to have $y$ and $z$ as its siblings. Similarly, a perfect subtree of $B$ needs to have $z$ and a perfect subtree of $A$ as its siblings. Some examples of the solutions are shown in Supplementary Figure 9.

The results are reported in Table 3. All approaches can solve this problem in at least one run but the successful rates of them vary. In level E, the successful rates of PAGE-EM and PAGE-VB were dropped while POLE still solved the problem in all 50 runs. PAGE-EM and PAGE-VB may fail to capture the essential probabilistic dependencies due to the simplified probabilistic model which assumes that the same annotation is shared by all right-hand side nonterminal symbols. In contrast, POLE can capture global dependencies and controls the complexity of the Bayesian network via a scoring function. The canonical GBGP and GE system can evolve a global optimal solution because the number of optimal solutions in the search space increases and the children of a perfect subtree are interchangeable so that the crossover operator can be effective. In summary, POLE is the best method among the other approaches.

In this experiment, GBGPBC achieved a high successful rate: 100% at level D and 98% at level E. Besides, it needed about 2176 fitness evaluations on average to find a solution at level D. POLE required 8360 evaluations. PAGE-EM failed to find an optimal solution frequently. PAGE-VB used 15060 fitness evaluations. At level D, GE performed better than POLE, PAGE-EM, and PAGE-VB and needed 5823 fitness evaluations on average. Lastly, GBGP can achieve a high successful rate and needed 11249 fitness evaluations at level D. But when it comes to level E, POLE was superior to GBGPBC and used 9420 fitness evaluations on average. Other approaches, except GBGPBC, failed often and cannot attain the minimum 90% successful rate. GBGPBC used 18045 fitness evaluations on average and was the second best method when dealing with this benchmark problem. One potential reason is that since POLE used expanded parse tree and the number of branches of the function nodes (i.e., A,B,C,D, and E) is a constant 2, POLE can capture long-range relationship strongly relying on the position of the function nodes in the expanded parse tree in the early stage of the search. But GBGPBC needs time to collect more samples in order to find out the dependencies between multiple rules which are captured by the depth-based context. In Section 4.1.3, experimental results showed that good results can be achieved if different combinations of context variables were used for this problem.

In summary, GBGPBC performed well when dealing with problems with strong local dependencies and deceptiveness. Among the tested problems, GBGPBC was often better than the existing approaches. In the next section, the variants of GBGPBC were further analysed.

### 4.2  Experiments on Different GBGPBC Variants

In this section, we repeated the three benchmark problems on different variants of GBGPBC. Different variants have different assumptions on the dependencies. In the following experiments, we demonstrated that different context variables involved in the dependency model (Bayesian network classifiers) can affect the performance. We simplify the labels of different variants of GBGPBC for the clarity of discussion. The format of the label is GBGPBC/ContextVariables, where GBGPBC is the name of our method, and ContextVariables is the context variables used in the variants. The depth-based, rule-based and nonterminal-based context variables are abbreviated by D, R, and T respectively. For example, the variant GBGPBC/D means that only a depth-based context variable is used in the Bayesian network classifiers; the variant GBGPBC/DR means that both depth-based and rule-based context variables are used in the Bayesian network classifiers. The variant GBGPBC/DR was used in Section 4.1 when compared with other approaches. If none of the context variables is used, we label it as GBGPBC/plain. The other parameters are kept constant in the study.

#### 4.2.1  Deceptive Maximum Problem

The experimental results across different variants are presented in Table 4. GBGPBC/plain performed poorly for different DMax problems. When we examined each context variable used individually, that is, GBGPBC/D, GBGPBC/R, and GBGPBC/T, we observed that GBGPBC/D performed the best and it can always find an optimal solution while the other two cannot. Among the variants using two context variables at the same time (that is, GBGPBC/DR, GBGPBC/RT, and GBGPBC/DT), GBGPBC/DR performed the best. In all cases, GBGPBC/RT cannot achieve the required successful rate in the given computation constraints. When we include all three context variables, GBGPBC/DRT cannot attain the same effectiveness as GBGPBC/DR. Notice that GBGPBC/D, GBGPBC/DR and GBGPBC/DT performed well in DMax(3,4) and DMax(3,5), so we argued that the presence of depth context variable was important. If we increased the computational budget, GBGPBC/DRT will eventually find an optimal solution.

Table 4:

Results for the deceptive maximum problem, the royal tree problem, and the bipolar asymmetric royal tree, across different variants. The table shows the statistics of the number of fitness evaluations needed under different configurations in 50 runs. The number before and the number after $±$ are the mean and standard deviation respectively. Numbers enclosed in parentheses mean it fails to obtain the optimal solution with a probability of at least 90% in 50 independent runs; “x” means none of the runs succeeds.

Deceptive Maximum Problem
mrdPlainDRTDRRTDTDRT
(8057 $±$ 3336) 4052 $±$ 6164 (10372 $±$ 17638) (16281 $±$ 11246) 3373 $±$ 343 (20955 $±$ 19138) 9341 $±$ 8193 (16543 $±$ 9293)
6116 $±$ 943 6181 $±$ 719 10057 $±$ 6660 (37474 $±$ 28603)
(8600 $±$ 3799) 2725 $±$ 499 (12072 $±$ 25564) 8614 $±$ 12630 2770 $±$ 386 (9373 $±$ 13278) 12425 $±$ 7633 13135 $±$ 6133
6416 $±$ 849 (151363 $±$ 31700) (68925 $±$ 0) 6750 $±$ 845 (192678 $±$ 0) 45803 $±$ 49994 (66444 $±$ 44239)
53802 $±$ 12747 6327 $±$ 889 8942 $±$ 1370 (14419 $±$ 3381) 10204 $±$ 3452 (62468 $±$ 50770) (76823 $±$ 34065) (58162 $±$ 37819)
Royal Tree Problem
Level Plain DR RT DT DRT
6861 $±$ 5611 2018 $±$ 170 2861 $±$ 2610 6109 $±$ 9490 2025 $±$ 136 6429 $±$ 9932 2041 $±$ 125 2020 $±$ 114
21148 $±$ 21398 6213 $±$ 6659 7319 $±$ 7567 17124 $±$ 17363 6459 $±$ 6825 20357 $±$ 20606 6218 $±$ 6468 5964 $±$ 6214
Bipolar Asymmetric Royal Tree Problem
Level Plain DR RT DT DRT
997 $±$ 348 2029 $±$ 744 1060 $±$ 438 1167 $±$ 494 2176 $±$ 1152 1283 $±$ 896 2649 $±$ 1828 2485 $±$ 2628
2348 $±$ 1735 17084 $±$ 12558 4833 $±$ 5713 (11051 $±$ 12368) 18045 $±$ 13228 10539 $±$ 12489 12357 $±$ 7135 13582 $±$ 10180
Deceptive Maximum Problem
mrdPlainDRTDRRTDTDRT
(8057 $±$ 3336) 4052 $±$ 6164 (10372 $±$ 17638) (16281 $±$ 11246) 3373 $±$ 343 (20955 $±$ 19138) 9341 $±$ 8193 (16543 $±$ 9293)
6116 $±$ 943 6181 $±$ 719 10057 $±$ 6660 (37474 $±$ 28603)
(8600 $±$ 3799) 2725 $±$ 499 (12072 $±$ 25564) 8614 $±$ 12630 2770 $±$ 386 (9373 $±$ 13278) 12425 $±$ 7633 13135 $±$ 6133
6416 $±$ 849 (151363 $±$ 31700) (68925 $±$ 0) 6750 $±$ 845 (192678 $±$ 0) 45803 $±$ 49994 (66444 $±$ 44239)
53802 $±$ 12747 6327 $±$ 889 8942 $±$ 1370 (14419 $±$ 3381) 10204 $±$ 3452 (62468 $±$ 50770) (76823 $±$ 34065) (58162 $±$ 37819)
Royal Tree Problem
Level Plain DR RT DT DRT
6861 $±$ 5611 2018 $±$ 170 2861 $±$ 2610 6109 $±$ 9490 2025 $±$ 136 6429 $±$ 9932 2041 $±$ 125 2020 $±$ 114
21148 $±$ 21398 6213 $±$ 6659 7319 $±$ 7567 17124 $±$ 17363 6459 $±$ 6825 20357 $±$ 20606 6218 $±$ 6468 5964 $±$ 6214
Bipolar Asymmetric Royal Tree Problem
Level Plain DR RT DT DRT
997 $±$ 348 2029 $±$ 744 1060 $±$ 438 1167 $±$ 494 2176 $±$ 1152 1283 $±$ 896 2649 $±$ 1828 2485 $±$ 2628
2348 $±$ 1735 17084 $±$ 12558 4833 $±$ 5713 (11051 $±$ 12368) 18045 $±$ 13228 10539 $±$ 12489 12357 $±$ 7135 13582 $±$ 10180

#### 4.2.2  Analysis of the Evolution When Solving Deceptive Maximum Problem

Furthermore, we performed additional analysis on the behaviours of GBGPBC during evolution when solving this deceptive problem.

Improvement in fitness during evolution:  Figure 2 shows how the evolution progresses. The two plots show the changes in average fitness values and best fitness values when the context information was provided in the search. A typical unsuccessful run for GBGPBC/plain is shown. In this run, GBGPBC/plain was trapped in a local optimum due to the deceptiveness. In the runs obtained from GBGPBC/D and GBGPBC/DR, the fitness values of the best tree(s) were improved starting at the 20th generation while the best tree(s) were improved slightly slower when using GBGPBC/DT. Occasionally, some large jumps in fitness values of the best tree(s) were observed in GBGPBC/DR. GBGPBC/D was able to search progressively and reached the optimum in a sequence of small jumps in fitness values of the best tree(s). Among all the three configurations, the mean fitness values of the population were very low despite the fitness values of the best individual(s) being high. This suggested that the interactions among subtrees are essential.
Figure 2:

Example of a failure run for GBGPBC/plain; successful runs for GBGPBC/D, GBGPBC/DR, and GBGPBC/DT in the DMax problem (m = 3, r = 5, d = 5).

Figure 2:

Example of a failure run for GBGPBC/plain; successful runs for GBGPBC/D, GBGPBC/DR, and GBGPBC/DT in the DMax problem (m = 3, r = 5, d = 5).

We also studied the Bayesian network classifiers when the best fitness of the whole population remained unchanged. It was observed that the conditional probabilities of the set of Bayesian network classifiers changed more frequently than the structure of the Bayesian network classifiers. There were two reasons. Firstly, only the top 10% of parse trees from the previous generation were kept to the next generation. Secondly, the Bayesian network classifiers were learned from a part of the population. The good parse trees and the bad parse trees were from 80th to 100th percentile, and from 20th to 40th percentile, respectively. Other parse trees were ignored. Nevertheless, a similar set of conditional probabilities will reappear again after some generations aperiodically when the set of Bayesian network classifiers were relearned. This suggests that multiple groups of Bayesian network classifiers may be involved in searching for the optimal programs.

Probabilistic dependencies of the rules at the end of a successful evolution:  We also analysed the Bayesian network classifiers which produce the first optimal program so as to understand how our system represents the deceptive problem in our framework. The data were collected from a typical successful run of GBGPBC/DR. The grammar can be referred to Figure 1. The fitness value of the optimal structure is equal to $Re{(3×λ)25×(3×0.95)2}≈6.88×1012$, where $λ=exp(2πi5)$. Three rules $Start.0$, $Exp.0$, and $Exp.1$ were studied because these rules contain at least one nonterminal and are controlled by a Bayesian network classifier. The corresponding Bayesian network classifiers were shown in Figures 3, 4, and 5, respectively. In each figure, each nonterminal variable (e.g., $Exp0$ and $Exp1$) was associated with a node of a Bayesian network classifier. Rule $Exp.0$ will certainly be chosen for the nonterminal $Exp0$ in the rule $Start.0$ (Figure 3). The probability for both $Good$ class and $Bad$ class are high because the GBGPBC discovered that rule $Exp.0$ was useful in the early generation.
Figure 3:

Probabilistic dependencies learned for the rule Start.0.

Figure 3:

Probabilistic dependencies learned for the rule Start.0.

Figure 4:

Probabilistic dependencies learned for the rule Exp.0.

Figure 4:

Probabilistic dependencies learned for the rule Exp.0.

Figure 5:

Probabilistic dependencies learned for the rule Exp.1.

Figure 5:

Probabilistic dependencies learned for the rule Exp.1.

The Bayesian network classifier of the rule $Exp.0$ (Figure 4) showed the condition probability tables of $Exp1$, $Exp2$, and $Exp3$. The first two columns indicate the states of the parents. The remaining columns show the probability for each possible choice of derivations for a particular configuration of the states of the parent variables of that row. The rows of uniform distribution were omitted for clarity. When the class is $Good$, and depth is 1 or 2, the rule generates the configuration of $(Exp.0,Exp.0,Exp.0)$ for the three nonterminals with high probability. This constructs three layers of the multiplicative structure in the optimal tree (Supplementary Figure 10). Besides, the last row in the conditional probability table of $Exp1$ (Figure 4) also shows that $Exp1$ was expanded by rule $Exp.0$ in bad programs while rule $Exp.1$ was correctly identified in good programs.

The rule $Exp.1$ sums three terminals in the good programs. Refer to the conditional probability tables in Figure 5; rule $Exp.3$ was chosen with a higher probability than rule $Exp.2$ for $Exp4$ in good programs when the depth was 4. The remaining two nonterminals ($Exp5$ and $Exp6$) will derive with the rule $Exp.3$ if rule $Exp4$ is derived using the rule $Exp.3$. Due to deceptiveness in the DMax problem, a correct ratio of the number of occurrences of $+3(λ,λ,λ)$ subtree to that of $+3(0.95,0.95,0.95)$ subtree is $25:2$ so as to attain the best fitness value at depth 4. Among the $227$ permutations, $27×262$ permutations satisfy this requirement. From the conditional probability tables, the probability of producing $+3(λ,λ,λ)$ and $+3(0.95,0.95,0.95)$ at depth 4 conditioned on the good class were 0.721 and 0.0141, respectively. Therefore, the probability of generating the optimal layer at depth 4 was $27×262×0.72125×0.01412=1.969×10-5$. Hence, the joint probability of generating the optimal layer at depth 4 and all 27 instantiations being conditioned on the good class was $1.501×10-5$. This was a dramatic improvement because the probability of finding these permutations via a uniformly random search is $6.004×10-47$.

#### 4.2.3  Royal Tree Problem

The experimental results across different variants are presented in Table 4. This problem can be solved successfully by all variants. Using one context variable at level D, GBGPBC/D and GBGPBC/R completed the task within 3000 fitness evaluations, which were only half of the total fitness evaluations required by GBGPBC/plain. At level D, using multiple context variables may not improve the performance significantly. At level E, the weakness of GBGPBC/plain was clearly demonstrated. It was observed that when a variant utilized at least one context variable, it always outperformed GBGPBC/plain. Among these variants, GBGPBC/D, GBGPBC/DR, GBGPBC/DT, and GBGPBC/DRT performed as well as one another, i.e. used less than 6500 fitness evaluations, on average. This agrees with the royal tree problem which heavily relied on the depth information to correctly derive an optimal solution.

#### 4.2.4  Bipolar Asymmetric Royal Tree Problem

The experimental results across different variants are presented in Table 4. This problem can be solved successfully by all variants, except for GBGPBC/T. The plain version of GBGPBC was sufficient to solve this problem at both level D and level E, meaning that only modelling the dependencies within the nonterminals within a production rule provide enough information to derive an optimal solution. At level E, an addition of an extra context variable always increases the number of evaluations because the system needed more samples to learn the dependencies model. GBGPBC/T has a successful rate of 88%. GBGPBC/plain, GBGPBC/DR, and GBGPBC/DRT attain a successful rate over 95%. It can be seen that GBGPBC/plain used the least number of fitness evaluations. This implies that a simpler model is sufficient to model the dependencies among the nonterminals. Additional context variables sometimes lead the search to a less favourable direction. Extra fitness evaluations are needed for the system to acquire the correct dependencies.

### 4.3  Searching Cost and Program Complexity

In this section, we discuss what factors affect the performance of GBGPBC. We identified six Program Complexity (PC) measures, which are features extracted or computed from the problem instances and reflect the difficulties to construct the optimal programs. Then, a correlation analysis was conducted between PC measures and the performance of all variants of GBGPBC. The correlation coefficients were used to classify algorithms according to how their searching cost (i.e., the total number of fitness evaluations) requirements grow as the program complexity grows.

Table 5:

All subtrees fed to the arguments of every function existing in the optimal programs of DMax(3,5) at depth 5.

SubtreeSubtreeSubtreeSubtreeSubtreeSubtree
FunctionSequenceTypeFunctionSequenceTypeFunctionSequenceType
$+$ 0.95 0.95 0.95 $A1$ $*$ $A1$$A1$$A2$ $B2$ $*$ $B3$$B3$$B2$ $C1$
$+$ $λ$$λ$$λ$ $A2$ $*$ $A2$$A2$$A2$ $B3$ $*$ $B3$$B3$$B3$ $C2$
$*$ $A1$$A2$$A2$ $B1$ $*$ $B1$$B1$$B3$ $C1$ $*$ $C1$$C2$$C2$ $Optimum$
$*$ $A2$$A1$$A2$ $B1$ $*$ $B1$$B3$$B1$ $C1$ $*$ $C2$$C1$$C2$ $Optimum$
$*$ $A2$$A2$$A1$ $B1$ $*$ $B3$$B1$$B1$ $C1$ $*$ $C2$$C2$$C1$ $Optimum$
$*$ $A2$$A1$$A1$ $B2$ $*$ $B2$$B3$$B3$ $C1$
$*$ $A1$$A2$$A1$ $B2$ $*$ $B3$$B2$$B3$ $C1$
SubtreeSubtreeSubtreeSubtreeSubtreeSubtree
FunctionSequenceTypeFunctionSequenceTypeFunctionSequenceType
$+$ 0.95 0.95 0.95 $A1$ $*$ $A1$$A1$$A2$ $B2$ $*$ $B3$$B3$$B2$ $C1$
$+$ $λ$$λ$$λ$ $A2$ $*$ $A2$$A2$$A2$ $B3$ $*$ $B3$$B3$$B3$ $C2$
$*$ $A1$$A2$$A2$ $B1$ $*$ $B1$$B1$$B3$ $C1$ $*$ $C1$$C2$$C2$ $Optimum$
$*$ $A2$$A1$$A2$ $B1$ $*$ $B1$$B3$$B1$ $C1$ $*$ $C2$$C1$$C2$ $Optimum$
$*$ $A2$$A2$$A1$ $B1$ $*$ $B3$$B1$$B1$ $C1$ $*$ $C2$$C2$$C1$ $Optimum$
$*$ $A2$$A1$$A1$ $B2$ $*$ $B2$$B3$$B3$ $C1$
$*$ $A1$$A2$$A1$ $B2$ $*$ $B3$$B2$$B3$ $C1$
Table 6:

A list of program complexity.

Program
Complexity MeasureDescription
PC1 The total count of subtree sequences
PC2 The total count of distinct subtree types
PC3 The count of subtree sequences over subtree types, i.e. $PC1/PC2$
PC4 The maximum count of subtree sequences per subtree type
PC5 The entropy of the distribution of subtree types
PC6 The maximum entropy of the distribution of subtree types per function
Program
Complexity MeasureDescription
PC1 The total count of subtree sequences
PC2 The total count of distinct subtree types
PC3 The count of subtree sequences over subtree types, i.e. $PC1/PC2$
PC4 The maximum count of subtree sequences per subtree type
PC5 The entropy of the distribution of subtree types
PC6 The maximum entropy of the distribution of subtree types per function

The PC measures are obtained from the optimal structures in two stages. In the first stage, we construct a table of subtree patterns which stores the structures in the optimal programs. To illustrate the idea, the following problem instance is chosen as an example: DMax(3,5) at depth 5, where $m$ and $r$ are 3 and 5, respectively, and the output of the procedure is shown in Table 5. To construct this table, we need to identify all optimal programs for this problem instance at first. The optimal programs of our example represent the expression $(λ+λ+λ)25×(0.95+0.95+0.95)2$. Next, we extract all possible sequences of subtrees (instead of values) of the arguments of every function existing in the optimal programs. Then, a preliminary version of the table of subtree patterns is created by assigning these subtrees with a unique subtree type, except that the optimal subtree sequences are associated with a special type Optimum. Lastly, we compress the table by grouping subtree types if they are equivalent: two subtree types $T1$ and $T2$ from the same table of subtree patterns are said to be equivalent if a new table of subtree patterns generated by replacing $T1$ by $T2$, and $T2$ by $T1$ will still represent all the structures in the optimal programs. After that, a new subtree type will be introduced to replace the equivalent subtree types. This compression procedure is repeated until we can no longer find a pair of subtree types that are equivalent. Table 5 shows the result after all these procedures. The tables for the royal tree problem and the bipolar asymmetric royal tree problem can be found in Supplementary Tables 7 and 8 respectively. In the second stage, we compute six PC measures (as described in Table 6) from the table of subtree patterns produced from the previous stage. They measure the complexity of dependencies in the optimal structures. Note that PC5 and PC6 refer to Shannon entropy (Shannon, 2001). The PC measures of all benchmark problems are extracted and depicted in Table 7.

Table 7:

(Left) The extracted PC measures of all benchmark problems. (Right) A correlation matrix showing the relationship between PC measures and the performance of different variants of GBGPBC.

How do PC measures relate to performance of GBGPBC? The correlation matrix between each PC measure and the searching cost of each variant of GBGPBC is reported in Table 7. Spearman's rank correlation is adopted and it captures the strength and direction of the monotonic relationship. Spearman's rank correlation coefficients for each pair are shown in the center of each cell in Table 7. The lower the magnitude of a value, the better the performance when searching for complex programs. If the searching cost of a search algorithm is less correlated with a complexity measure, a search algorithm behaves more like a constant searching cost algorithm1 with respect to such measure.

Referring to Tables 4 and 7, all PC measures are positively correlated with the searching cost. However, the magnitudes of the correlation coefficients vary. PC1, PC2, PC3, PC4, and PC6 are strongly correlated with the searching cost of GBGPBC/R when compared with other variants. As discussed in Section 4.2, GBGPBC/D, GBGPBC/DR, and GBGPBC/DT were more successful in finding the optimal solutions. These variants have a relatively low correlation coefficient with PC6. Furthermore, as shown in the experiments, GBGPBC/D and GBGPBC/DR attained the successful rate of at least 90% in 50 independent runs. The correlation coefficients PC measures with these two variants are noticeably different from others: the correlation coefficients are among the lowest. GBGPBC/D and GBGPBC/DR are less susceptible to the increase in program complexity and this leads to a more robust search performance.

## 5  Direct Marketing Application

In this section, GBGPBC is applied to a real-world data mining application in direct marketing. Recently, the amount of digital data collected has been increasing. Data owners, who may not be data scientists, are interested in extracting knowledge from the vast amount of data, which often characterize the behaviours of a group of entities of interest. From the data, the owners want to understand, explain, forecast and manipulate the behaviours of these entities to achieve their goals. For example, direct marketers advertise their products or services to the customers through direct communication, such as direct mail, catalogues and email advertisements. Direct marketers would like to save the promotion expense and minimize the chances to annoy their customers who are likely to ignore their advertisements. Since they have a fixed and limited budget, only a small portion of customers (e.g., 10%) will be contacted. The success of a direct marketing campaign is mainly determined by the responsiveness of the customers.

There are many factors that affect the response from customers. First of all, whether a product is appealing to the customers. Customers purchase a product only when they are interested in it. Secondly, some customers frequently purchase because they are highly loyal to the company. Thirdly, some purchases are seasonal. Say customers are busy at work during Christmas but are willing to travel during summer. Therefore, the travel agency will plan when and to whom to promote travel and tourism related services using direct marketing. Because the relationships among the customers, direct marketers and products are complex, many factors may affect the success of a direct marketing campaign. Direct marketers are interested in constructing a mathematical model to increase the total profit earned from a campaign. They often maintain a massive database of their customers' information. They carry out a direct marketing campaign to raise the profits based on knowledge extracted from the valuable customers' information, such as their contacts, records of purchases, and responses to previous marketing campaigns. Even for data analysts with the knowledge of direct marketing, the vast amount of data cannot be easily comprehended without the assistance from computers. In this section, we apply GBGPBC to handle a real-world data mining application in direct marketing. GBGPBC is applied to perform symbolic regression so as to construct a formula, which ranks the customers given the set of customers' information and the budget constraints.

### 5.1  Direct Marketing Database

A large direct marketing dataset from a U.S.-based catalogue company is obtained from the Direct Marketing Education Foundation. The company has a wide selection of merchandise, such as gifts, apparel, and consumer electronics. Catalogues are regularly sent via mail to the customers to show them the latest promotions of the products. The dataset, which is provided by the Direct Marketing Educational Foundation, contains the records of 106284 consumers with their purchase history over 12 years. In addition, the demographic information from the 1995 U.S. Census and credit information from a commercial vendor gives extra information about the customers. To this end, there are 361 variables, including dichotomous variables, nominal variables and continuous variables, in each record. The most recent promotion sent a catalogue to every customer in this dataset and the response rate is 5.4%. The data are highly skewed. The direct marketers want to know how to increase the earnings given the historical data of the customers. During data preprocessing, a direct marketing expert was invited to select relevant variables based on the domain knowledge. Moreover, endogenous variables were corrected following the procedure proposed by Blundell and Powell (2004), because they are correlated with the model error and can make the parameter estimates biased. All machine learning algorithms used the same preprocessed data as their input during evaluation for a fair comparison on their learning capability.

Figure 6:

Flow of applying machine learning in direct marketing problem. Ten-fold cross-validation has been performed on the rank list of customers.

Figure 6:

Flow of applying machine learning in direct marketing problem. Ten-fold cross-validation has been performed on the rank list of customers.

### 5.2  Evaluation

The goal is to achieve a higher total profit in the first deciles (i.e., top 10%). The flow is shown in Figure 6. To achieve this goal, we applied our algorithms to learn a ranking function from the customers in the training set to predict the ranking of the customers. Besides, we also performed ten-fold cross validation to make the results more reliable. The higher the ranking of a customer means the direct marketers can earn more money from him/her. The scoring function needs to precisely predict the customers who will have high ranking in order to increase the total profit. To evaluate whether the learned scoring function is good or not, we applied decile analysis. We firstly applied the learned scoring function over all customers in the testing set and ranked the customers using the rank produced by the learned scoring function. Because only 10% of customers will be contacted by the direct marketers due to limited budget, only the first decile (top 10%) of customers in the rank list were selected and recommended. Finally, the total profit was calculated as the sum of the profits earned (unknown to the scoring function) from these top 10% of customers. Therefore, a better algorithm can learn a function which produces higher total profit than that from another algorithm.

GBGPBC was compared with three existing evolution algorithm-based methods: GBGP, GE and a Genetic Algorithm specialized for this problem (GA) (Bhattacharyya, 1999). GBGP/DR was adopted. Other PMBGP approaches, such as POLE, PAGE-EM, and PAGE-VB were not included in the comparison because they do not support random number generation thus they are not suitable for evolving ranking functions containing a number of coefficients. For GA, the population size was set to 100. We have tried different generation numbers. Here we report the results when the number of generation was set to 100. For GBGPBC, GBGP, and GE, the population size and the number of generation were set to 100 and 50, respectively. The other parameters for GBGPBC were shown in Table 2. To eliminate the possibility that the problem has little structure, the result of a Random method was applied to solve the problem. The Random method is a baseline approach using a grammar to generate the same number of random solutions bounded by the grammar as fitness evaluations.

Besides, from the results in the previous section, GBGPBC can be a representative among PMBGP approaches. The population size and the maximum number of generation were set to 100 and 50, respectively, for all evolutionary algorithms. The other parameters were the same as those presented in Table 2. Six well-known data mining algorithms, including logistic regression, Bayesian network, naive Bayes, neural network, bagging and support vector machine, are also included for comparison. Table 8 shows the average total profit with the two sets of methods. The number in bold font indicates that the method achieves the best result in the first decile compared with the other methods.

Table 8:

Average total profit (in US$) of ten-fold cross-validation compared with other methods. The p-value in bold font indicates that the average total profit is significantly smaller than that of GBGPBC at a 0.05 level using a paired t-test. MethodAverageStandard Deviationp-value GBGPBC 9265 1139 GBGP 9069 1040 0.0019 GE 9021 1106 0.0002 GA 8539 1308 0.0004 Random 8895 1058 0.0062 Logistic Regression 8673 821 0.0065 Bayesian Network 8295 694 0.0017 Naive Bayes 7027 857 0.0000 Neural Network 8787 1000 0.0013 Bagging 8595 853 0.0038 Support Vector Machine 8299 1056 0.0001 MethodAverageStandard Deviationp-value GBGPBC 9265 1139 GBGP 9069 1040 0.0019 GE 9021 1106 0.0002 GA 8539 1308 0.0004 Random 8895 1058 0.0062 Logistic Regression 8673 821 0.0065 Bayesian Network 8295 694 0.0017 Naive Bayes 7027 857 0.0000 Neural Network 8787 1000 0.0013 Bagging 8595 853 0.0038 Support Vector Machine 8299 1056 0.0001 Referring to Table 8, GBGPBC achieves the average total profit of US$ 9265 just by contacting the first decile of customers in the testing sets. Test sets have about 10,000 records and the first decile of them is about 1,000 customers. This means that if the direct marketing company sends email to the first decile alone, it will earn US$9265. Both GBGP and GE also generate strategies that allow the direct marketer to earn over US$ 9000. When the models are applied in production, there will be more customers (e.g. 1,000,000 records) and hence increasing the average total profit [e.g., GBGPBC can earn $$(9265-9069)×100$$=$$19650 more than GBGP] for each direct marketing campaign. Besides, GBGPBC performed significantly better than GBGP, GE, and GA specialized to the direct marketing problem as shown in Table 8. Comparing the results of GBGPBC with the Random method, it demonstrates that grammar-based probabilistic models can be useful in capturing the underlying properties of the model in the direct marketing problem. From the results, the six data mining techniques do not perform well for this task.2 The performance of logistic regression, neural network, and bagging can generate average total profit of over US\$ 8500. Bayesian network and support vector machine are slightly worse than them. Naive Bayes, which does not consider the interaction among the variables, performed poorly.

To further validate the results, we have compared our approach with traditional data mining algorithms using different combinations of parameters. One-hundred-and-fifteen different configurations of Bagging have been attempted. Our approach significantly outperforms 83 of these configurations. On the other hand, no configuration of Bagging can outperform our approach significantly. Eleven different configurations of Bayesian networks have been attempted. Our approach significantly outperforms 9 of these configurations. On the other hand, no configuration of Bayesian networks can outperform our approach significantly. Thirteen different configurations of Support Vector Machines have been attempted. Our approach significantly outperforms all of them. The results can be found in the Supplementary Materials. In summary, using GBGPBC can generate more earnings when compared with other learning algorithms.

## 6  Conclusions and Future Work

In the current study, a GBGPBC system is proposed. We demonstrated that probabilistic classification (the TAN classifier) can be advantageous for modelling probabilistic dependencies in PMBGP. The Bayesian network classifiers are capable of capturing the probabilistic dependencies among the fitness of the parse trees, the contextual information about the structure of the programs, and the choices of derivations in the grammar. We also conducted a comprehensive study of the GBGPBC system through three benchmark experiments. GBGPBC has been shown to be competitive and all-rounded among POLE, PAGE-EM, PAGE-VB, GE, and GBGP. As shown in Section 4, GBGPBC is more robust than the other approaches because it can obtain the optimal solution with a probability of at least 90% across the benchmark problems. GBGPBC often uses a smaller number of fitness evaluations than that of other approaches. The tested problems require the capability of learning dependencies from the structures of the optimal solutions. Our analysis on the history of evolution also suggested that the system can dynamically utilize multiple probability distributions to improve the search power. In addition, we proposed six PC measures and correlation analysis to study how the search performance of GBGPBC is related to program complexity. Surprisingly, we found that consistently weak correlations between program complexity measures and the searching cost will be associated with a high successful rate. GBGPBC/D and GBGPBC/DR are less susceptible to the increase in program complexity and this leads to a more robust search performance. Therefore, GBGPBC/DR is recommended since it can learn dependencies among rules.

The system was applied to support the selection of customers so as to raise the profit in direct marketing. This shows its business value. Using GBGPBC as a decision support system has several advantages when compared with the previous approaches in the following ways. Grammar-based approaches have the advantages of including expert knowledge in the grammar (Wong and Leung, 2000). It is possible to add new grammar rules (e.g., rules extracted from customers' interviews) customized for the problem of interests. Furthermore, analysis on the Bayesian network classifiers may also reveal the hidden probabilistic dependence among different attributes affecting the decision making (e.g., why some customers are not responsive to the direct marketing) under the constraints imposed by the grammar. In the context of direct marketing, this may inspire the direct marketers to design new strategies and new options to change the behaviours of their customers, which cannot be directly extracted from the data.

There are plenty of future directions. First of all, the effects of bloat due to unevaluated structures in GBGPBC can be studied in the future. If nonterminals associated with the unevaluated structures is always disjoint from nonterminals associated with the evaluated structures, this may not have a strong impact on the Bayesian network classifier learning, because the joint probability distribution can be factorized into two independent parts: $p(XE¯,XE)=p(XE¯)p(XE)$, where $XE¯$ and $XE$ are the nonterminals associated with unevaluated and evaluated branches respectively. If the sets of nonterminals are not disjoint, the total number of relevant samples will be lessened. Therefore, the unevaluated structures will be inherited by the new generation. New ways need to be invented so as to mitigate the effects.

Secondly, tracking how the Bayesian network classifiers perform during evolution may be interesting. The performance tracking can enable us to control the evolution behaviours more precisely, for instance, dynamically deciding which context information to be included in Bayesian network classifiers. However, if we want to evaluate the performance of the classifiers, we need to define a test set. Determining the test set is not trivial because the class labels of the programs are assigned relative to the other programs within the same population. Besides, a program is generated by a group of classifiers so the accuracy of an individual classifier may not be a meaningful measure to predict its performance in the next generation.

Instead of learning Bayesian network classifiers from scratch, adopting online training for the Bayesian network classifiers can be a future work. In online training, each new observation leads to an update of Bayesian network classifiers and this will certainly reduce the time to learn dependencies from the structures of programs. Moreover, it may help to retain the information from the previous generations. However, adopting online training in GBGPBC is not trivial due to the relative nature of class labels as we have mentioned previously. In other words, a program which was labelled as good in the early generation will become bad when the evolution progresses. A suitable online training algorithm needs to set up a mechanism to unlearn the past.

## Acknowledgments

We thank Dr. Leung-Yau Lo who gave advice on the implementation and the evaluation of the system. This research is supported by the LEO Dr. David P. Chan Institute of Data Science and the General Research Fund LU310111 from the Research Grant Council of the Hong Kong Special Administrative Region, Direct Grant of The Chinese University of Hong Kong, and the Institute of Future Cities of The Chinese University of Hong Kong.

## Notes

1

If a search algorithm runs in $O(1)$, its searching cost is bounded (i.e., the successful rate within a maximum number of fitness evaluation) and independent of the complexity measure $n$. The Spearman's coefficient tends to 0. The Spearman's coefficient tends to 1 if algorithms run in $O(n)$ or $O(n2)$.

2

Some feature selection and feature transformation methods may improve the performance of other models.

## References

Allen
,
R.
, and
Kennedy
,
K.
(
2002
).
Optimizing compilers for modern architectures: A dependence-based approach
, Vol.
289
.
San Francisco
:
Morgan Kaufmann
.
Attias
,
H
. (
1999
). Inferring parameters and structure of latent variable models by variational Bayes. In
Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence
, pp.
21
30
.
Baluja
,
S.
(
1994
).
Population-based incremental learning: A method for integrating genetic search based function optimization and competitive learning
. In Technical Report CMU-CS-94-163.
Carnegie Mellon University
,
Pittsburgh, PA
.
Bhattacharyya
,
S
. (
1999
).
Direct marketing performance modeling using genetic algorithms
.
INFORMS Journal on Computing
,
11
(
3
):
248
257
.
Blundell
,
R. W.
, and
Powell
,
J. L
. (
2004
).
Endogeneity in semiparametric binary response models
.
The Review of Economic Studies
,
71
(
3
):
655
679
.
Bonabeau
,
E.
,
Dorigo
,
M.
, and
Theraulaz
,
G
. (
1999
).
Swarm intelligence: From natural to artificial systems
.
Number 1. Oxford
:
Oxford University Press
.
Booth
,
T.
, and
Thompson
,
R
. (
1973
).
Applying probability measures to astract languages
.
IEEE Transactions on Computers
,
22
(
5
):
442
450
.
Bosman
,
P. A. N.
, and
de Jong
,
E. D
. (
2004
). Grammar transformations in an EDA for genetic programming. In
Proceedings of the 2004 Genetic and Evolutionary Computation Conference Workshop
, pp.
26
30
.
Davis
,
L
. (
1991
).
Handbook of genetic algorithms
.
New York
:
Van Nostrand Reinhold
.
De Bonet
,
J. S.
,
Isbell Jr.
,
C. L.
, and
Viola
,
P. A
. (
1997
). MIMIC: Finding optima by estimating probability densities. In
Advances in Neural Information Processing Systems
, pp.
424
430
.
Dempster
,
A. P.
,
Laird
,
N. M.
, and
Rubin
,
D. B.
(
1977
).
Maximum likelihood from incomplete data via the EM-alogrithm
.
Journal of the Royal Statistical Society, B
,
39
:
1
38
.
Friedman
,
N.
,
Geiger
,
D.
, and
Goldszmidt
,
M
. (
1997
).
Bayesian network classifiers
.
Machine Learning
,
29
(
2–3
):
131
163
.
Gathercole
,
C.
, and
Ross
,
P
. (
1996
). An adverse interaction between crossover and restricted tree depth in genetic programming. In
Proceedings of the First Annual Conference on Genetic Programming
, pp.
291
296
.
Goldberg
,
D. E.
(
1987
). Simple genetic algorithms and the minimal, deceptive problem. In
L.
Davis
(Ed.),
Genetic algorithms and simulated annealing
, pp.
74
88
.
London
:
Pitman
.
Goldberg
,
D. E
. (
1989
).
Genetic algorithms in search, optimization, and machine learning
.
London
:
Addison-Wesley Longman Publishing Co., Inc., 1st ed
.
Hasegawa
,
Y.
(
2012
). Programming with annotated grammar estimation. In
S.
Ventura
(Ed.),
Genetic programming—New approaches and successful applications
,
chapter 3, pp. 49–74. London
:
InTech
.
Hasegawa
,
Y.
, and
Iba
,
H
. (
2006
). Estimation of Bayesian network for program generation. In
Proceedings of the Third Asian-Pacific Workshop on Genetic Programming
, pp.
35
46
.
Hasegawa
,
Y.
, and
Iba
,
H
. (
2008
).
A Bayesian network approach to program generation
.
IEEE Transactions on Evolutionary Computation
,
12
(
6
):
750
764
.
Hasegawa
,
Y.
, and
Iba
,
H
. (
2009
).
Latent variable model for estimation of distribution algorithm based on a probabilistic context-free grammar
.
IEEE Transactions on Evolutionary Computation
,
13
(
4
):
858
878
.
Holland
,
J. H
. (
1992
).
Genetic algorithms
.
Scientific American
,
267
(
1
):
66
72
.
Keijzer
,
M.
,
O'Neill
,
M.
,
Ryan
,
C.
, and
Cattolico
,
M
. (
2002
). Grammatical evolution rules: The mod and the bucket rule. In
Proceedings of the fifth European Conference on Genetic Programming
, pp.
123
130
.
Kim
,
K.
,
Shan
,
Y.
,
Nguyen
,
X. H.
, and
McKay
,
R. I
. (
2014
).
Probabilistic model building in genetic programming: A critical review
.
Genetic Programming and Evolvable Machines
,
15
(
2
):
115
167
.
Koza
,
J. R
. (
1992
).
Genetic programming: On the programming of computers by means of natural selection
.
Cambridge, MA
:
MIT Press
.
Koza
,
J. R.
,
Bennett
,
F. H.
,
Andre
,
D.
,
Keane
,
M. A.
, and
Dunlap
,
F
. (
1997
).
Automated synthesis of analog electrical circuits by means of genetic programming
.
IEEE Transactions on Evolutionary Computation
,
1
(
2
):
109
128
.
Larrañaga
,
P.
, and
Lozano
,
J. A
. (
2001
).
Estimation of distribution algorithms: A new tool for evolutionary computation
.
Norwell, MA
:
Kluwer Academic Publishers
.
Li
,
X.
,
Mabu
,
S.
, and
Hirasawa
,
K
. (
2011
). Use of infeasible individuals in probabilistic model building genetic network programming. In
Proceedings of the 2011 Annual Conference on Genetic and Evolutionary Computation
, pp.
601
608
.
Looks
,
M.
,
Goertzel
,
B.
, and
Pennachin
,
C
. (
2005
). Learning computer programs with the Bayesian optimization algorithm. In
Proceedings of the 2005 Annual Conference on Genetic and Evolutionary Computation
, pp.
747
748
.
Lourenço
,
N.
,
Pereira
,
F. B.
, and
Costa
,
E
. (
2016
).
Unveiling the properties of structured grammatical evolution
.
Genetic Programming and Evolvable Machines
,
17
(
3
):
251
289
.
McKay
,
R. I.
,
Hoai
,
N. X.
,
Whigham
,
P. A.
,
Shan
,
Y.
, and
O'Neill
,
M
. (
2010
).
Grammar-based genetic programming: A survey
.
Genetic Programming and Evolvable Machines
,
11
(
3-4
):
365
396
.
Michalewicz
,
Z.
, and
Schoenauer
,
M
. (
1996
).
Evolutionary algorithms for constrained parameter optimization problems
.
Evolutionary Computation
,
4
(
1
):
1
32
.
Mühlenbein
,
H.
, and
Mahnig
,
T
. (
1999
).
FDA-A scalable evolutionary algorithm for the optimization of additively decomposed functions
.
Evolutionary Computation
,
7
(
4
):
353
376
.
O'Neill
,
M.
, and
Brabazon
,
A
. (
2006
).
Grammatical differential evolution
. In
Proceedings of the 2006 International Conference on Artificial Intelligence
, Vol.
1
, pp.
231
236
.
O'Neill
,
M.
,
Hemberg
,
E.
,
Gilligan
,
C.
,
Bartley
,
E.
,
McDermott
,
J.
, and
Brabazon
,
A
. (
2008
).
GEVA: Grammatical evolution in Java
.
ACM SIGEVOlution
,
3
(
2
):
17
22
.
O'Neill
,
M.
, and
Ryan
,
C
. (
2003
).
Grammatical evolution: Evolutionary automatic programming in an arbitrary language
.
Boston
:
Springer
.
Pelikan
,
M
. (
2005
).
Hierarchical Bayesian optimization algorithm: Toward a new generation of evolutionary algorithms
.
Berlin, Heidelberg
:
Springer
.
Punch
,
W. F
. (
1998
). How effective are multiple populations in genetic programming. In
Proceedings of the Third Annual Conference on Genetic Programming
, pp.
308
313
.
Ratle
,
A.
, and
Sebag
,
M
. (
2001
). Avoiding the bloat with stochastic grammar-based genetic programming. In
Selected Papers from the 5th European Conference on Artificial Evolution
, pp.
255
266
.
Regolin
,
E. N.
, and
Pozo
,
A. T. R
. (
2005
). Bayesian automatic programming. In
Proceedings of the 2005 European Conference on Genetic Programming
, pp.
38
49
.
Rothlauf
,
F.
, and
Oetzel
,
M
. (
2006
). On the locality of grammatical evolution. In
Proceedings of the 2006 European Conference on Genetic Programming
, pp.
320
330
.
Ryan
,
C.
,
Collins
,
J.
, and
O'Neill
,
M.
(
1998
). Grammatical evolution: Evolving programs for an arbitrary language. In
Proceedings of the 1998 European Conference on Genetic Programming
, pp.
83
96
.
Salustowicz
,
R.
, and
Schmidhuber
,
J
. (
1997
).
Probabilistic incremental program evolution
.
Evolutionary Computation
,
5
(
2
):
123
141
.
Sastry
,
K.
, and
Goldberg
,
D. E
. (
2003
). Probabilistic model building and competent genetic programming. In
Genetic Programming Theory and Practice
, pp.
205
220
.
Sato
,
H.
,
Hasegawa
,
Y.
,
Bollegala
,
D.
, and
Iba
,
H
. (
2012
). Probabilistic model building GP with belief propagation. In
Proceedings of the 2012 IEEE Congress on Evolutionary Computation
, pp.
1
8
.
Shan
,
Y.
,
McKay
,
R. I.
,
Abbass
,
H. A.
, and
Essam
,
D
. (
2003
).
Program evolution with explicit learning
. In
Proceedings of the 2003 Congress on Evolutionary Computation
, Vol.
3
, pp.
1639
1646
.
Shan
,
Y.
,
McKay
,
R. I.
,
Baxter
,
R.
,
Abbass
,
H.
,
Essam
,
D.
, and
Nguyen
,
H. X
. (
2004
).
Grammar model-based program evolution
. In
Proceedings of the 2004 Congress on Evolutionary Computation
, Vol.
1
, pp.
478
485
.
Shannon
,
C. E
. (
2001
).
A mathematical theory of communication
.
ACM SIGMOBILE Mobile Computing and Communications Review
,
5
(
1
):
3
55
.
Solomonoff
,
R
. (
1978
).
Complexity-based induction systems: Comparison and convergence theorems
.
IEEE Transactions on Information Theory
,
24
(
4
):
422
432
.
Thorhauer
,
A.
, and
Rothlauf
,
F
. (
2014
). On the locality of standard search operators in grammatical evolution. In
Proceedings of the 13th International Conference on Parallel Problem Solving from Nature
, pp.
465
475
.
Wattanapornprom
,
W.
,
Olanviwitchai
,
P.
,
Chutima
,
P.
, and
Chongstitvatana
,
P
. (
2009
). Multi-objective combinatorial optimisation with coincidence algorithm. In
Proceedings of the 2009 IEEE Congress on Evolutionary Computation
, pp.
1675
1682
.
Whigham
,
P. A
. (
1995a
).
Grammatically-based genetic programming
. In
Proceedings of the Workshop on Genetic Programming: From Theory to Real-world Applications
, Vol.
16
, pp.
33
41
.
Whigham
,
P. A
. (
1995b
). Inductive bias and genetic programming. In
First International Conference on Genetic Algorithms in Engineering Systems: Innovations and Applications
, pp.
461
466
.
Whitley
,
L. D
. (
1991
). Fundamental principles of deception. In
Proceedings of the Foundations of Genetic Algorithms
, pp.
221
241
.
Wong
,
M. L.
, and
Leung
,
K. S
. (
1995
).
Applying logic grammars to induce sub-functions in genetic programming
. In
Proceedings of the 1995 IEEE Conference on Evolutionary Computation
, Vol.
2
, pp.
737
740
.
Wong
,
M. L.
, and
Leung
,
K. S
. (
2000
).
Data mining using grammar based genetic programming and applications
.
Norwell, MA
:
Kluwer Academic Publishers
.
Wong
,
P. K.
(
2019
).
Semantics-aware grammar-based genetic programming and its applications
. PhD thesis.
The Chinese University of Hong Kong
.
Wong
,
P. K.
,
Leung
,
K. S.
, and
Wong
,
M. L.
(
2019
).
Probabilistic grammar-based neuroevolution for physiological signal classification of ventricular tachycardia
.
Expert Systems with Applications
,
135
:
237
248
.
Wong
,
P. K.
,
Lo
,
L. Y.
,
Wong
,
M. L.
, and
Leung
,
K. S
. (
2014a
). Grammar-based genetic programming with Bayesian network. In
Proceedings of the 2014 IEEE Congress on Evolutionary Computation
, pp.
739
746
.
Wong
,
P. K.
,
Lo
,
L. Y.
,
Wong
,
M. L.
, and
Leung
,
K. S
. (
2014b
). Grammar-based genetic programming with dependence learning and Bayesian network classifier. In
Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation
, pp.
959
966
.
Yanai
,
K.
, and
Iba
,
H
. (
2003
). Estimation of distribution programming based on Bayesian network. In
Proceedings of the 2003 Congress on Evolutionary Computation
, pp.
1618
1625
.
Yanase
,
T.
,
Hasegawa
,
Y.
, and
Iba
,
H
. (
2009
). Binary encoding for prototype tree of probabilistic model building GP. In
Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation
, pp.
1147
1154
.