Abstract
Genetic variation operators in grammar-guided genetic programming are fundamental to guide the evolutionary process in search and optimization problems. However, they show some limitations, mainly derived from an unbalanced exploration and local-search trade-off. This paper presents an estimation of distribution algorithm for grammar-guided genetic programming to overcome this difficulty and thus increase the performance of the evolutionary algorithm. Our proposal employs an extended dynamic stochastic context-free grammar to encode and calculate the estimation of the distribution of the search space from some promising individuals in the population. Unlike traditional estimation of distribution algorithms, the proposed approach improves exploratory behavior by smoothing the estimated distribution model. Therefore, this algorithm is referred to as SEDA, smoothed estimation of distribution algorithm. Experiments have been conducted to compare overall performance using a typical genetic programming crossover operator, an incremental estimation of distribution algorithm, and the proposed approach after tuning their hyperparameters. These experiments involve challenging problems to test the local search and exploration features of the three evolutionary systems. The results show that grammar-guided genetic programming with SEDA achieves the most accurate solutions with an intermediate convergence speed.
1 Introduction
Evolutionary computation (Bäck et al., 1997) is a subfield of natural computing (Kari and Rozenberg, 2008) that borrows ideas from natural evolution (Darwin, 1959) to perform search and optimization processes on populations of individuals representing candidate solutions to a specific problem. The individuals belong to the search space, and the set of all solutions encoded by the individuals is the solution space. An encoding scheme establishes the relationship between these two spaces. Genetic Programming (GP) (Koza, 1992) and Grammar-Guided Genetic Programming (GGGP) (Whigham, 1995) are evolutionary algorithms that belong to this discipline. GP employs programs of variable size to encode possible solutions to a problem. Grammar-Guided Genetic Programming is an extension of GP designed to optimize programs belonging to a search space defined by a context-free grammar (CFG) (Hopcroft et al., 2006; Sipser, 2013; Krithivasan, 2009; Moll et al., 2012). The CFG establishes the set of syntactical restrictions that all individuals (derivation trees) must satisfy. Hence, GGGP addresses the closure requirement (or solves the closure problem), which means that all individuals generated during the evolutionary process match the problem restrictions (Vanyi and Zvada, 2003; Poli et al., 2008); that is, trees are derivations of the grammar.
GGGP has become crucial as a method for formalizing constraints in GP. It is a branch of interest in evolutionary algorithms research due to its successful applications in different areas (McKay et al., 2010), resulting, in some cases, in patentable inventions (Koza et al., 2006). GGGP has shown great potential in designing knowledge bases of rules and fuzzy rules for medical purposes (Font et al., 2010), and other types of intelligent systems highly useful today, such as Bayesian networks (Font et al., 2011). Neuroevolution is another important field where GGGP has successfully been applied (Barrios Rolanía et al., 2018), especially with the rise of deep artificial neural networks and deep learning. GGGP applications cover a wide diversity of domains, ranging from rule extraction for medical purposes (Wong and Leung, 2000) and architecture (Hemberg et al., 2008) to circuit design (Tetteh et al., 2022) and ecological modeling (McKay et al., 2006).
Like other evolutionary algorithms, GGGP begins by generating the initial population of derivation trees, usually at random, following a particular distribution (Koza et al., 2006; García-Arnau et al., 2007; Ramos Criado et al., 2020). The evolutionary process then takes place, which comprises three primary operations: selection, variation, and replacement. The selection operation chooses some promising individuals of the population, known as parents. The variation operation generates a set of new individuals with inherited characteristics from the parents, known as offspring. Finally, the replacement operator inserts these new individuals into the population and removes other less adapted ones. Algorithm 1 describes the general operation of GGGP, where is an individual of the population , is the iteration number (generation), and is its fitness value. It is assumed a minimization problem, where , if , then is better adapted or has a better fitness than .
The variation operators are crucial to guide the evolutionary process towards the optimum solution or its neighborhood. Therefore, they must perform a local search in promising areas of the search space, focusing on parent individuals' characteristics to produce new ones with similar characteristics. Nevertheless, variation operators must also explore (global search) the search space to generate new diverse individuals, differing from those already existing in the current population. If the variation operators are excessively local, the evolutionary algorithm is likely to converge early to suboptimal solutions, presenting difficulties in escaping from them. If, on the other hand, the variation operators are very explorative generating new individuals, the evolutionary algorithm will take too long to find the optimal solution. Therefore, the performance of the evolutionary process relies to a large extent on its variation operators having an adequate balance between exploitation and exploration; namely, an exploitation–exploration trade-off (McKay et al., 2010). They must be able to explore the search space to find promising areas and eventually escape from local optima, but also focus on those promising areas to find the global optimum.
Crossover and mutation are two variation operators employed in GGGP (Whigham, 1995). They do not provide any control over the exploitation–exploration trade-off. A crossover operation may produce utterly different derivation trees given two similar parents. The result of a mutation operation completely varies depending on the mutated derivation tree node. Mutation of nodes close to derivation tree leaves usually produces small variations, while modification of nodes more proximate to the root produces significant changes. However, the latter is less likely than the former given the structure of a tree (there are more nodes close to the leaves than close to the root) and the fact that the choice of the mutation node is random. Additionally, the mutation operation acts with very low probability. Therefore, its influence on the evolutionary process is minimal, and evolution mainly relies on the crossover operator. Whigham's (1995) crossover (WX) has been widely tested and, in most cases, achieves satisfactory results (Couchet et al., 2007). However, there is still margin for improvement (White et al., 2013) since the offspring produced by this operator might not be similar to its parents. Because of this fact, the optimization may focus not on exploiting promising individuals but on exploring new search space areas (Ramos Criado, 2017). Thus, the evolutionary process may show erratic behavior, and some difficulties in progressing towards the optimal solution arise. This common issue in GGGP is related to the genotypic or syntactic locality of crossover operators (Uy et al., 2010), namely, the crossover's ability to perform small changes to the genotype (Galván-López et al., 2009, 2011; Galván et al., 2013).
GGGP has also been criticized for being a very restrictive environment. New variation operators are rarely designed, since they have to deal with derivation trees with a fixed structure and grammar constraints (McKay et al., 2010). Several research lines have focused on addressing these limitations, such as the linearization of CFG derivation trees (O'Neill and Ryan, 2003; Ryan et al., 1998) and the replacement of CFG by tree-adjoining grammars (Joshi and Schabes, 1997). These techniques have been proposed to change the encoding scheme and provide an enhanced environment where new variation operators can be designed. Other approaches are related to the development of improved GGGP optimization methods, as in ant-colony (Dorigo et al., 2006) or grammatical swarm (O'Neill and Brabazon, 2006) algorithms. Another promising optimization technique for GGGP are estimation of distribution algorithms (EDA) (Hauschild and Pelikan, 2011).
1.1 Estimation of Distribution Algorithms
Estimation of distribution algorithms (EDA) use probabilistic models to drive the evolutionary process towards promising solutions. In each iteration, EDA learn a probabilistic model that shapes the distribution of the current population or some selected individual's characteristics. Then, they replace the existing population or a subset with a new one, sampled according to the probabilistic model previously learned. Therefore, EDA base the evolutionary process on learning probabilistic models from selected individuals to guide the generation of new ones.
EDA-GP approaches employ probabilistic models to improve genetic programming algorithms' performance and scale up with the problem size (Sastry and Goldberg, 2006; Shan et al., 2003). Tree encoding is the primary technique to encode the probabilistic models in GP (Kim et al., 2014). Some approaches based on Bayesian principles (Hasegawa and Iba, 2008) learn computer programs that control the distribution of instances generated throughout a tree representation (Looks et al., 2005). This latter work represents programs as binary trees, called zigzag trees, but they possess limited representation capabilities. Probabilistic incremental program evolution (PIPE) (Salustowicz and Schmidhuber, 1997) uses standard GP functions and generates successive populations according to an adaptive probability distribution over the search space. It assumes the independence of tree nodes to reduce the complexity of the probabilistic model and employs pruning methods. However, these assumptions may cause genetic loss, since promising subtrees may not be produced. In most cases, GP tree approaches suffer from the closure problem and may generate infeasible individuals.
Some EDA-GGGP methods apply stochastic CFG to learn the probabilistic model that drives the GGGP evolutionary process (Ratle and Sebag, 2001, 2002; Tanev, 2004). However, the probability of selecting any production rule of a CFG or stochastic CFG at a precise depth of a derivation not only depends on the likelihood of choosing that production rule but also involves the probabilities of selecting the previous production rules to reach the current depth (McKay et al., 2010). Therefore, the probabilistic model may be biased, especially in recursive CFG.
In general terms, the EDA optimization process displays a fast convergence to optimal solutions (Kim and McKay, 2013; Kim et al., 2014). However, there are still limitations related to the exploitation–exploration trade-off that avoid substantial improvements in its performance (O'Neill et al., 2010). Unlike GGGP crossover operators, there is no subtree recombination in EDA-GGGP. Instead, it generates new individuals from the root to the leaves, following the estimated distribution of parent-derivation-trees. Consequently, EDA and even incremental EDA show a reduction of population diversity that may lead to an excessively local search and increase in the probability to converge to local optima (Ramos Criado, 2017).
1.2 Contributions
This paper presents a smoothed estimation of distribution algorithm for grammar-guided genetic programming, SEDA. It has two essential features that distinguish it from other EDA approaches to provide an adequate trade-off between exploration and exploitation (local search). First, SEDA calculates an extended version of the CFG that encodes the solution space. This context-free grammar expansion (CFGE) represents the search space of the problem at hand in a graph-like structure that stores the probabilities of the distribution model, maintaining the dependence of node information. These probabilities are calculated from a subset of selected promising individuals of the current population. This definition simplifies the representation of the search space and facilitates its application or extrapolation to different search techniques. Second, SEDA applies a smoothing method to reduce the spikes in the calculated probability distribution by slightly increasing the low probabilities and decreasing the high probabilities accordingly. Then, SEDA generates new individuals with representative characteristics present in the population by following the estimated smoothed probability distribution.
SEDA adopts an EDA approach to perform a local search, since it generates new individuals from a probabilistic model learned from the current population. At the same time, SEDA's smoothing method increases the genetic diversity to avoid premature convergence to suboptimal solutions. Moreover, a hyperparameter is provided to control the smoothing process, and therefore the exploitation–exploration trade-off in the optimization process. These two features, working together, provide crucial support to adequately guide the evolutionary process.
Experiments have been conducted to compare the overall performance of the GGGP evolutionary process when using WX, an incremental EDA, and SEDA. The probabilistic model of the incremental EDA approach is based on the CFGE employed in SEDA to consider the dependence of nodes (dependency-aware). Thus, the incremental EDA involved in the experiments is similar to SEDA but does not employ the smoothing method to show its positive impact. The results show that, although SEDA performs more evaluations than incremental EDA, the former achieves more accurate results. Hence, incremental EDA exhibits a lack of exploration that leads the evolutionary process to fall into local optima, which is prevailed by SEDA, especially when large derivation trees are involved. WX achieves, in most cases, unsatisfactory results in both convergence speed and accuracy of the final solutions when dealing with large search spaces, thus revealing its excessive exploration capabilities.
The rest of the paper is structured as follows: Section 2 defines the cardinality of a production rule and symbol as the base for discussing the concept of locality in genetic variation operators and its relation to the exploitation–exploration trade-off problem. Following this, Section 3 details the proposed smoothed-EDA, along with an example. The experimentation process, in Section 4, is broken down into two stages and involves three different challenging problems. Section 4.1 describes the setup stage, consisting of tuning the hyperparameters of the three GGGP approaches to be compared. Then, Section 4.2 discusses the performance comparison results gathered from six optimization experiments for each of the three problems under study. Finally, Section 5 provides some concluding remarks, contributions, and future lines of research.
2 Locality and the Exploitation–Exploration Trade-Off Problem
This section discusses two techniques commonly used in EA from the point of view of their exploration and local search capabilities: GGGP with WX and EDA. The former tends to overly explore the search space, which might hinder the convergence of the algorithm, while the latter boosts local search, which might lead the evolutionary process to local optima.
Let be a CFG, where is the nonterminal symbols set, is the terminal symbols set, is the axiom of the grammar, and is the set of production rules of the form such that and . The asterisk represents the Kleene closure operation.
The cardinality of a production rule or symbol is the number of different terminal string derivations , that can be produced starting from that production rule or symbol. notes the transitive closure of , where one or more production rules are applied. The cardinality of a production rule or symbol is denoted as . The cardinality of a production rule or symbol may not be infinite, even if the derivation involves recursive productions, because GGGP generally sets restrictions on the sizes of the derivation trees to avoid code bloat.
For each derivation , the cardinality pair is defined, where is the set of nonterminal symbols in , and the function defines the number of occurrences for each nonterminal symbol .
Variation operators are intended to lead the evolutionary process towards new promising individuals so that the new populations are expected to improve the previous populations' overall fitness. However, this requirement is not easy to meet, and, in fact, many variation operators do not achieve it. GGGP crossover operators, as WX, are genetic-based variation operators that usually rely on swapping subtrees of parent derivation trees to produce a new offspring. In the case of WX, the crossover nodes (the roots of swapping subtrees) must contain the same nonterminal symbol to ensure a syntactically feasible offspring (the closure requirement). Given two parent derivations and , with , , is the set of nonterminal symbols that belong to both parent derivations and, therefore, can be selected as crossover nodes. If the cardinality of the nonterminal symbol within a crossover node is low, then the offspring individuals are likely to be similar to their parents, as small changes to the genotype are usually expected from nonterminals with low cardinality. The property of a variation operator that produces small changes in the genotype of the offspring (with respect to the parents) is known as genotypic locality. An exploitation behavior appears when the locality is maintained, and an exploration behavior when it is not. When the cardinality of the crossover node is high, the locality is less likely to be maintained. According to this reasoning, genotypic locality is unlikely to be maintained for large search spaces, which are actually the most useful in real-world applications.
2.1 Estimation of Distribution Algorithms Behavior
The performance of the EDA optimization process mainly relies on the probabilistic model abstraction. A more detailed representation of the population distribution, which stores more information about the derivation tree structure, typically the location or dependence of the nodes (McKay et al., 2010), tends to produce individuals that are likely to be similar to previous promising individuals. As a result, it boosts the exploitative (local search) behavior. Therefore, it is more probable that it will converge to local optima since close search space areas are more likely to be explored. On the contrary, a more vague representation of the population distribution, which stores less information about the derivation tree structure—for example, assuming the independence between nodes (Salustowicz and Schmidhuber, 1997)—facilitates the production of extremely different individuals from their ancestors. Subsequently, the optimization process shows a more exploratory behavior that reduces the probability of converging to local optima but also the convergence speed. In most cases, the EDA approaches for GP provide a more detailed representation of the population distribution that reduces the optimization process exploratory behavior. According to this, applying additional methods, such as mutation, that increase the exploratory behavior to search for new individuals plays an essential role in EDA approaches for GP.
EDAs sample new individuals following a probabilistic model. In the case of adopting a generational replacement strategy, which is opposite to the WX approach, EDAs produce a new whole population that replaces the previous generation. As a result, only the characteristics that are the most widespread in the population are likely to be transferred to the next generation. Consequently, the probabilistic model iteratively focuses on these characteristics. Furthermore, using a detailed representation of the population distribution can produce substantial diversity loss, which means that EDA may potentially lose promising individuals that will probably not be reproduced in the following generations.
Table 1 shows comparisons of the average number of evaluations performed to reach a stop criterion, using GGGP with an incremental EDA approach and WX as variation operators, when searching for specific uniformly generated target derivation trees of with 5 and 10 recursions. The stop criterion is met for both GGGP approaches when the target derivation tree is found or 10,000 evaluations have passed. WX selects two individuals from the population and produces two new ones due to the parents' subtree swapping. Incremental EDA chooses of the population to generate the same number of individuals as offspring. Since WX generates fewer individuals than EDA in each generation, it is unfair to use generations as the time unit. Instead, the number of evaluations that the fitness function performs in each generation is employed, which matches the number of new individuals produced.
. | 5 recursions . | 10 recursions . | ||
---|---|---|---|---|
. | Evaluations . | Fitness . | Evaluations . | Fitness . |
EDA | 3,200 (4,664.8) | 3.89 (0.35) | 9,800 (1,400) | 5.79 (1.21) |
WX | 9.72 (20.23) | 0.32 (0.47) | 136.24 (58.89) | 3.76 (1.23) |
. | 5 recursions . | 10 recursions . | ||
---|---|---|---|---|
. | Evaluations . | Fitness . | Evaluations . | Fitness . |
EDA | 3,200 (4,664.8) | 3.89 (0.35) | 9,800 (1,400) | 5.79 (1.21) |
WX | 9.72 (20.23) | 0.32 (0.47) | 136.24 (58.89) | 3.76 (1.23) |
Therefore, is the individual fitness that encodes and the evolutionary process seeks to minimize the fitness with an optimum value of 0.
The population size is 100 individuals, initially generated following the production rules until reaching a terminal string derivation. When there are several production rules with the same nonterminal symbol on their left-hand side ( and in ), selecting any is equiprobable. Both GGGP approaches employ the tournament selection of size 5 to select the best-fit individual out of the five. EDA performs a number of tournaments equivalent to 50% of the population size to calculate the probabilistic model, while WX executes the two necessary to carry out the crossover. They do not use mutation to reduce the evolutionary-process random component and facilitate the comparison. Replacement substitutes the worst individuals with the offspring, keeping the population size constant. 100 executions have been performed for each experiment. Table 2 summarizes the hyperparameters employed.
Pop. . | Initialization . | Selection . | Replacement . | Execs. . | Stop criteria . |
---|---|---|---|---|---|
100 | Equiprobable | Tournament (5) | Sub. the worst | 100 | Target found or |
indiv. with off. | 10,000 evaluations |
Pop. . | Initialization . | Selection . | Replacement . | Execs. . | Stop criteria . |
---|---|---|---|---|---|
100 | Equiprobable | Tournament (5) | Sub. the worst | 100 | Target found or |
indiv. with off. | 10,000 evaluations |
Table 1 reports that WX consistently outperforms EDA, since WX achieves lower fitness values than EDA with fewer evaluations. WX can recombine parent derivation trees to produce new derivation trees with a different number of recursions that may improve the individuals' fitness. The EDA probabilistic model likely generates small trees because the population initialization is biased to derivation trees with few recursions. Note that the likelihood of generating a derivation tree with five or more recursions in the initial population is . Therefore, EDA is unable to produce larger derivation trees even if more evaluations are performed.
In addition, EDA never finds the target derivation trees when increasing their size to 20 or more recursions, always reaching the 10,000-evaluations stop criterion. This result suggests that EDA gets trapped in local optima, shallow derivation trees with fitnesses greater than zero (the optimum) surrounded by others similar in number of recursions with worse fitness, unable to escape because of its excessive local search capability. On the contrary, although WX takes more evaluations to stop when dealing with large target derivation trees, it can find the solution. However, WX achieves solutions with an exponential growth in average fitness and evaluations as the size of the target derivation tree increases, which indicates that excessive exploration cannot deal with large search spaces either.
3 The Smoothed Estimation of Distribution Algorithm
As an estimation of distribution algorithm, SEDA estimates the population distribution to learn or adapt a probabilistic model that encodes the most promising individuals' characteristics. After doing so, SEDA follows the probabilistic model to produce a set of individuals as the new offspring. SEDA also utilizes a smoothing method that aims to create an adequate balance between exploration and exploitation. It increases the offspring diversity to reduce the likelihood of premature convergence. Simultaneously, SEDA's probabilistic-model-based production of individuals enhances the local search capabilities to speed up the evolutionary process.
First, an overview of how SEDA works is provided. Then, the following subsections present a more accurate description. Before the evolutionary process begins, SEDA constructs a tree graph from the specific CFG that defines the search space problem: all derivation trees that the CFG can produce. This graph is the context-free grammar expansion (CFGE). Then, in each generation, SEDA provides the offspring from a set of parent derivation trees selected by the selection operator, which serve as a sample of the current population. This process comprises two main steps. First, SEDA annotates the CFGE according to the parent derivation trees. A probability is assigned to each production rule depending on its location in the CFGE. Thus, the annotated CFGE yields a probabilistic model from which SEDA produces the offspring. These probabilities are calculated based on the absolute frequencies of the applied production rules in each specific node to generate the parent individuals. Then, SEDA applies the smoothing method to the previously calculated frequencies. This method reduces the spikes in the probability distribution and avoids null probabilities of non-sampled production rules. The final probability value of a production rule at a given location depends on the calculated frequency for that production concerning the frequencies of other production rules that could have been applied to the same node. The second step of SEDA comprises the generation of the offspring. The annotated CFGE is traversed from the root to the leaf nodes according to the probabilities previously calculated to generate each individual. The resulting path represents a new derivation tree of the offspring.
3.1 The Context-Free Grammar Expansion
The CFG employed in a given GGGP algorithm is extended to represent a probabilistic model that approximates the current population distribution. The CFG search space, together with the production rules applied to generate each sentence of the grammar language, is represented as a tree graph. This graph-like representation of all derivation trees produced by a CFG is the CFG expansion (CFGE), which also encodes the probabilistic model employed to generate SEDA offspring. Each node of this tree is labeled by the coordinates , where is the depth of the node, starting from the root at to the maximum depth (the number of nodes in the path from the root to the deepest leaf, without taking into account the root), and is the node number at depth , numbered from left, , to right. The maximum depth of a CFGE is limited by a recursion bound, which establishes the maximum number of recursive productions (recursions) on each possible derivation tree encoded by the CFGE. Once it is reached, there are a finite number of non-recursive productions needed to reach the leaves. Therefore, although a recursive CFG can generate an infinite-sized CFGE, the recursion bound prevents it from doing so.
A CFGE node of coordinates contains either a terminal symbol , noted as , or a nonterminal , noted as . In the case of the latter, the node also includes the set of production rules. As in the case of terminal symbols, each rule is noted by to indicate its integration in the node. This last type of node is called meta-node. The root of the CFGE is a meta-node containing the axiom of the CFG, , and the set of production rules . The leaves are always nodes with a terminal symbol located at , . The intermediate nodes are always meta-nodes similar to the root but are those which contain any nonterminal located at , together with its set . For every in a meta-node, each symbol in generates a child node or meta-node. If the terminal symbol is at position , then the child node contains this terminal symbol . If the nonterminal is at position and the recursion bound is not reached, then the child meta-node contains this nonterminal and the set . If the recursion bound is achieved for some production rules, then they are removed from .
3.2 The Annotated Context-Free Grammar Expansion
Given any CFGE as the example of Figure 1a, SEDA calculates and annotates a probability to each production rule within the meta-nodes. These probabilities together with their location at the CFGE represent the distribution of the current population. The annotated probability of a production rule is the probability of applying located at the meta-node to generate a specific derivation for the offspring within all possible derivations starting from the specific nonterminal in the same meta-node, . The construction of the annotated CFGE comprises three major stages: the calculation of the frequencies, the smoothing of the obtained frequencies, and the calculation of the probabilities of the model. This process begins after the selection operator chooses the subset of parent individuals (derivation trees) as representative of the current population to build the probabilistic model:
The calculation of the frequencies involves traversing each selected parent individual, from the root (located at depth 0) to the leaves. This process checks what production rule has been applied in each level of depth to each nonterminal node. Each time a production rule is involved in a parent individual, where is located at depth , and node number , an associated frequency to the corresponding rule in the CFGE (initially set up to 0) is increased by 1. Thus, once every selected parent derivation tree has been traversed, the CFGE contains the absolute frequencies of the applied rules according to the depth and node number where they were used. In the case of the example of Figure 1a, and considering the two derivation trees shown in Figure 1b to sample the current population, supposedly chosen by the selection operator, the values of these frequencies, taking the meta-node (1,0) as an example, are the following:
is the frequency for the recursive production rule of , located at the meta-node (1,0) of the CFGE, . This meta-node represents the nonterminal , . The value is 2 because has been applied at meta-node (1, 0) in both derivation trees to rewrite string from .
is the frequency for the production rule of , located also at the meta-node (1,0) of the CFGE, . This value is 0, as has never been applied to produce either parent derivation tree at meta-node (1, 0).
This example shows that the frequency value of is zero, . Calculating the probabilistic model from these absolute frequencies results in the generation of new individuals without any chance of applying the production rule at (1,0). This scenario causes a loss of exploration of the search space, increasing the probability that the evolutionary algorithm achieves a suboptimal solution.
- A smoothing rate in (4) is applied to the previously calculated frequencies to obtain a smoothed frequency, or weight, associated to each rule of the CFGE. As pointed out in Sections 3.3 and 4, fair values for are within the range [0,0.1],(4)
Following the example of Figure 1, the values of the weights for the meta-node (1,0) are calculated as follows, using (4) with :
since . Similarly, .
Note that the smoothed frequency is now 0.2, instead of the absolute frequency , calculated in the previous step. In most cases, if the smoothing rate , then . Calculating the probabilistic model from the smoothed frequencies provides a non-zero probability to rule at (1,0), which permits SEDA to apply it when generating the offspring although it has not been involved in the generation of the parent individuals.
- The last stage of the construction of the annotated CFGE comprises the calculation of the probabilities, , associated to each production rule of the CFGE. This annotated CFGE conforms the base of a probabilistic model from which the offspring might be obtained by SEDA,(5)If the denominator, , then an option is to assign the same probability to all , so thatwhere is the cardinality of . It is also possible, under the same condition, to apply any other approach to generate the rest of the derivation tree, like the Grammatically Uniform Population Initialization, which uniformly generates derivation trees (Ramos Criado et al., 2020).(6)
The values of the probabilities for the ongoing example in the meta-node (1,0) are while . This means that, instead of an equiprobability scenario, the recursion rule will be likely applied at depth 1 to generate the offspring.
Another interesting result is that the production rule at (1,0) has a probability of 0.08 of being chosen to generate the offspring, despite not intervening in the generation of the parent individuals. If , then and . Conversely, the higher , the lower and the higher , tending both to the value 0.5 in the limit. Therefore, the smoothing factor controls the variety of the offspring regarding their parents, helping balance the exploitation–exploration trade-off.
3.3 The Offspring
According to the probabilistic model, the annotated CFGE is traversed from the root meta-node to the leaf nodes to generate each offspring individual. For every meta-node visited, a production rule is applied according to its associated incremental probability . The resulting traversed path in the annotated CFGE represents a new derivation tree of the offspring. This process is executed as many times as the number of derivation trees are to be generated for the offspring. Figure 2 shows an example of a derivation tree generation using the annotated CFGE after its probabilistic model is updated according to the two-parent derivation trees selected to sample the current population. Figure 2a shows the updated annotated CFGE. The probability associated with each production rule of the meta-nodes has been updated according to the parent derivation trees. The resulting annotated CFGE is traversed a single time to generate the new derivation tree of Figure 2b. Firstly, in this example, the production rule in , located at the meta-node (0,0) of the CFGE, , is chosen with a probability of 1 (according to the probability model, ) and applied. Then, at meta-node (1, 0), at meta-node (2, 1), and at meta-node (3, 1) are consecutively selected and applied. The resulting traversed path is highlighted with bold lines on the annotated CFGE. This path represents the derivation tree of Figure 2b.
Algorithm 2 describes SEDA at generation to produce an offspring of any number of individuals. The CFGE, smoothing rate , and learning rate are the inputs, together with a set of parent derivations previously chosen by the selection operator to sample the current population. The algorithm returns as the offspring a new set of derivation trees based on parent characteristics.
The smoothing rate in stage 2 of the construction process of the annotated CFGE reduces the probability spikes in the estimated population distribution performed in the third stage. This rate tunes SEDA's behavior regarding its ability to explore the search space to avoid local optima and the local search of similar individuals to speed up the convergence process. , the smaller the value, the deeper the local search; on the contrary, SEDA's exploration ability increases as increases. In the experiments carried out, if , SEDA's exploration capability is so high that the algorithm might not converge. The following scenario reveals the essential impact of SEDA's smoothing factor.
Suppose a production rule located at the meta-node of the CFGE that has not been used, or with a low frequency compared to the rest of productions , , of the same meta-node, to produce the parent derivation trees. In this case, the frequency value of in the annotated CFGE, , is zero or very low compared to the rest of the frequency values . Suppose also that some or all of the other production rules of the same meta-node, , are involved in the generation of the parent derivation trees and, therefore, for some or all . According to (5), applying the production rule at to generate the offspring is unlikely in this scenario, . Under these conditions, the offspring are strictly based on the parents' characteristics, which means that the algorithm performs a local search and tends to prematurely converge to a local optimum. This scenario is usual in EDA, which is equivalent to SEDA with .
4 Experimental Results
The experimentation conducted is comprised of two stages to compare the performance of the overall GGGP evolutionary process when using WX, an incremental EDA, and SEDA in the sixth line of Algorithm 1 (Section 1). WX swaps subtrees of two parent derivation trees whose roots must contain the same nonterminal symbol to generate the new offspring of size two. Given the definition of SEDA in Algorithm 2, EDA can be considered a particular case, where the smoothing rate . Finally, Algorithm 2 describes SEDA.
It is a standard practice in EDA to implement upper and lower thresholds on probabilities to prevent 0 and 1 values in the offspring generation. In contrast, SEDA uniformly transforms all probabilities for each meta-node according to the population distribution regardless of whether the bounds are exceeded. Although bounds in EDA may be a valid alternative, the experiments conducted do not apply such bounds, as their intermittent application might infuse noise into the evolutionary process.
Similar to the comparison carried out in Subsection 2.1, the common GGGP hyperparameters employed with the three approaches are the following: population size of 100 individuals, size 5 tournament selection, replacement of worst individuals with offspring, and no mutation.
The first stage consists of tuning the hyperparameters of each evolutionary algorithm. In the case of GGGP with WX and EDA-GGGP, this stage studies whether the replacement rate influences the quality of the solutions achieved. The replacement rate denotes the proportion of the population to be replaced every generation, that is, the size of the offspring to be generated by WX, EDA, or SEDA. In the case of WX, which generates offspring of size 2, the selection method (line 5 of Algorithm 1) and crossover (line 6 of Algorithm 1) must be repeated as necessary. A generational replacement strategy in WX, which implies high replacement rates, boosts the exploration ability of the evolutionary algorithm. On the contrary, the same replacement approach drives EDA to a diversity loss since most individuals in the next generation follow the same probability distribution. In the case of SEDA, besides the replacement rate, the study also involves the smoothing rate since it is an essential hyperparameter to control the balance between exploration and local search.
The second stage employs the best hyperparameters configuration achieved in the previous phase for each GGGP approach to compare their performance in searching for specific derivation trees of different depths with several search space sizes. Performance is measured in terms of fitness of the final solutions and number of evaluations needed to reach the stop criterion. Evaluations are employed instead of generations, since the computational cost of WX generations is lower than EDA-GGGP and SEDA generations. Nevertheless, evaluations of the new generated individuals are comparable in terms of computational cost for the three approaches.
Both stages employ three different benchmark problems to tune the hyperparameters of each evolutionary algorithm and compare their performance, respectively. Varied and difficult real-world problems are selected to provide a set of representative experiments. Table 3 shows the CFG that define these problems with three levels of complexity and different features to observe how the proposed algorithm, SEDA, behaves in common scenarios of evolutionary algorithms. encodes dense, deep feedforward neural network architectures. It contains two recursive production rules: one to determine the number of hidden layers (), and another to define the number of neurons () per layer. The training dataset determines the number of input and output neurons. The terminal (slash) separates layers and represents a neuron. The second problem employs to encode knowledge bases of rules. This CFG also defines two recursive production rules: determines the rules () included in the knowledge base, and the clauses () in each rule. includes a higher number of terminal symbols than (13 vs. 2), which increases the size of the search space and hinders the search for an optimal solution. Finally, shapes the expressions of a symbolic regression optimization problem with two recursive productions, and , to create unary and binary operations, respectively. Moreover, is a doubly recursive production rule, which considerably increases the complexity of the search space.
These three benchmark problems have been used in other research to solve real-world problems. is successfully employed in neuroevolution to approximate some sequences related to the theory of orthogonal polynomials (Barrios Rolanía et al., 2018). Usual applications include the interpolation and approximation of functions, as well as the construction of quadrature formulas and other problems related to numerical integration. Font et al. (2010) apply a more specific version of for the evolutionary construction of self-adapting knowledge-based systems for the detection of knee injuries. Finally, Ramos Criado (2017) and Ramos Criado et al. (2020) employ as a benchmark problem to highlight the limitations of GGGP when dealing with large search spaces. It is important to note that no hierarchical optimization has been performed in order to avoid bias from, for example, learning the neural network parameters in the problem.
4.1 Hyperparameter Setup
The first stage of the experimental results aims to tune the hyperparameters of the three GGGP approaches to compare WX, EDA, and SEDA. The replacement rate is one of the hyperparameters under study, since it appears to influence the GGGP performance. Rates of 20%, 40%, 60%, 80%, and 100% are considered. The smoothing rate is the other hyperparameter tuned in this stage for SEDA, since it is one of the primary contributions of the proposed approach. Values of , , , and are tested to cover both high values, which increase SEDA exploration ability, and values close to zero, which lead the evolutionary process towards local search. means that the offspring is always generated following the CFGE branches already traversed by the parents. It is the usual strategy in general EDA, which is also adopted in the experiments conducted with GGGP-EDA. Therefore, is not considered in the SEDA-smoothing-factor tuning process.
Statistical analysis was run to compare the fitness of the final solutions using different hyperparameter configurations on each evolutionary approach. The problem involves searching for known specific target derivation trees, each of which encodes the optimal solution. A different target derivation tree with 50 recursions is randomly generated uniformly in each execution and CFG under study: , , and . Since these grammars define two recursive production rules, the target derivation trees could contain any combination of the two recursive production rules that total 50 recursions. For example, in , a target derivation tree may have either 50 recursions of the nonterminal , , or a mix of both with 50 recursions in total. For the three evolutionary approaches, the fitness function calculates the Levenshtein distance from the word encoded in the target derivation tree to the sentence encoded by the individual to be evaluated as defined in (3).
There are 5 possible configurations for the replacement rate using WX or EDA, while there are 20 combinations of replacement rates and smoothing factors for SEDA. For each hyperparameter setup, 100 executions have been performed. Any evolutionary process stops when either the target derivation tree is found or the average population fitness improves less than 1% after 25 evaluations. The same 100 target derivation trees are employed for all evolutionary algorithms and hyperparameter configurations to avoid biases. Table 4 summarizes the configurations employed in the experiments.
Hyperparameter . | Value . |
---|---|
GGGP approaches | WX, EDA, and SEDA |
Population size | 100 |
Initialization algorithm | Grammatically uniform |
Selection operator | Tournament of size 5 |
Replacement | Substitutes the worst individuals with the offspring |
Mutation | None |
Replacement rates | 20%, 40%, 60%, 80%, and 100% |
SEDA smoothing factors | , , , and |
Fitness function | Levenshtein distance (3) |
Executions run | 100 for each hyperparameters setup and GGGP approach |
Target derivation tree size | 50 recursions |
Target derivation trees | The same 100 for each hyperparameters setup and GGGP approach |
Stop criteria | 1. The target derivation tree is found. |
2. The avg. population fitness improves less than 1% in 25 evaluations. |
Hyperparameter . | Value . |
---|---|
GGGP approaches | WX, EDA, and SEDA |
Population size | 100 |
Initialization algorithm | Grammatically uniform |
Selection operator | Tournament of size 5 |
Replacement | Substitutes the worst individuals with the offspring |
Mutation | None |
Replacement rates | 20%, 40%, 60%, 80%, and 100% |
SEDA smoothing factors | , , , and |
Fitness function | Levenshtein distance (3) |
Executions run | 100 for each hyperparameters setup and GGGP approach |
Target derivation tree size | 50 recursions |
Target derivation trees | The same 100 for each hyperparameters setup and GGGP approach |
Stop criteria | 1. The target derivation tree is found. |
2. The avg. population fitness improves less than 1% in 25 evaluations. |
The hyperparameter configuration served as the independent variable for each evolutionary algorithm and CFG, with 5 levels in the case of WX and EDA, corresponding to the 5 replacement rates studied, and 20 for SEDA, considering all combinations of replacement rates and smoothing factors. The dependent variable was the fitness of the final solution (the best fitness in the last generation) achieved by each of the three evolutionary approaches using , , and , respectively.
A one-way between-groups analysis of variance (ANOVA) was conducted to gather empirical evidence of whether the differences between the final fitness means achieved by each evolutionary algorithm and CFG are statistically significant for the different hyperparameter setups considered. One of the conditions of an ANOVA is that the variances of the groups should be equivalent. ANOVA is robust to this violation when the groups are of an equal or near-equal size. This condition holds for the current study since 100 executions were run for all CFG, evolutionary algorithm, and hyperparameter configurations (). The size of the groups also allows the assumption of normality.
The ANOVA test results for WX and EDA working with reveal that the null hypothesis (the means of the best fitnesses achieved by the evolutionary algorithm using to search for a specific derivation tree with 50 recursions are equal) cannot be rejected because of the resulting for WX and for EDA, both . These two results indicate that the final solutions achieved by GGGP-WX and -EDA provide a statistically similar average fitness when varying the replacement rate. Instead, the null hypothesis can be rejected for SEDA since , . Thus, the quality of the solutions that SEDA achieves in this case depends on the hyperparameter replacement rate and smoothing factor.
Similarly, data regarding the ANOVA test results for the three evolutionary algorithms were gathered using and . The results achieved working with reveal statistically significant differences between the means of the fitness of the solutions achieved by WX when varying the replacement rate because of the resulting , . The same occurs using SEDA with different replacement rates and smoothing factors, and . However, the null hypothesis cannot be rejected for EDA working with . In the case of , the ANOVA tests reported sizeable differences for the three evolutionary algorithms: , , and for WX, EDA, and SEDA, respectively, resulting in for the three of them.
When there were sizeable differences between the groups regarding the hyperparameters, the Tukey HSD (honestly significant difference) test was used to make post hoc comparisons to demonstrate where the statistically significant differences are found. This is the case of SEDA with , WX and SEDA working with , and the three evolutionary algorithms with .
Tables 5 and 6 summarize the data gathered by showing the hyperparameter configurations included in the homogeneous subset for with the best fitness means for each evolutionary algorithm and grammar pair. Table 5 corresponds to SEDA working with , WX with , and SEDA with , while Table 6 refers to the three evolutionary algorithms working with . An algorithm does not appear in any column in both Tables 5 and 6 when the obtained solutions quality (fitness) do not statistically depend on its hyperparameters. This is the case of WX and EDA working with and EDA with .
SEDA with . | WX with . | SEDA with . | |||
---|---|---|---|---|---|
Hyperparams. . | Mean . | Hyperparams. . | Mean . | Hyperparams. . | Mean . |
100%, | 0.39 | 100% (*) | 63.90 | 100%, | 15.12 |
80%, (*) | 0.49 | 80% | 65.88 | 80%, (*) | 16.69 |
60%, | 0.57 | 60% | 70.58 | 80%, | 17.02 |
40%, | 0.73 | 60%, | 17.65 | ||
20%, | 0.85 | 60%, | 18.77 | ||
40%, | 1.29 | 40%, | 19.15 | ||
100%, | 1.32 | 40%, | 22.33 | ||
80%, | 1.44 | ||||
80%, | 1.49 | ||||
40%, | 1.59 | ||||
60%, | 1.66 | ||||
60%, | 1.85 | ||||
20%, | 1.97 | ||||
100%, | 2.02 |
SEDA with . | WX with . | SEDA with . | |||
---|---|---|---|---|---|
Hyperparams. . | Mean . | Hyperparams. . | Mean . | Hyperparams. . | Mean . |
100%, | 0.39 | 100% (*) | 63.90 | 100%, | 15.12 |
80%, (*) | 0.49 | 80% | 65.88 | 80%, (*) | 16.69 |
60%, | 0.57 | 60% | 70.58 | 80%, | 17.02 |
40%, | 0.73 | 60%, | 17.65 | ||
20%, | 0.85 | 60%, | 18.77 | ||
40%, | 1.29 | 40%, | 19.15 | ||
100%, | 1.32 | 40%, | 22.33 | ||
80%, | 1.44 | ||||
80%, | 1.49 | ||||
40%, | 1.59 | ||||
60%, | 1.66 | ||||
60%, | 1.85 | ||||
20%, | 1.97 | ||||
100%, | 2.02 |
WX with . | EDA with . | SEDA with . | |||
---|---|---|---|---|---|
Hyperparams. . | Mean . | Hyperparams. . | Mean . | Hyperparams. . | Mean . |
100% (*) | 99.27 | 80% (*) | 116.26 | 80%, (*) | 97.77 |
80% | 102.17 | 60% | 116.88 | 60%, | 98.54 |
40% | 117.67 | 80%, | 99.60 | ||
20% | 118.62 | 100%, | 99.79 |
WX with . | EDA with . | SEDA with . | |||
---|---|---|---|---|---|
Hyperparams. . | Mean . | Hyperparams. . | Mean . | Hyperparams. . | Mean . |
100% (*) | 99.27 | 80% (*) | 116.26 | 80%, (*) | 97.77 |
80% | 102.17 | 60% | 116.88 | 60%, | 98.54 |
40% | 117.67 | 80%, | 99.60 | ||
20% | 118.62 | 100%, | 99.79 |
It is important to note that the evolutionary algorithms increasingly depend on their hyperparameters as the problems become more challenging. The mean values are greater for than for , and the latter greater than . However, SEDA is robust to change, meaning that the replacement rate chosen or the problem at hand play no part as long as the smoothing rates of and are employed, since they are the most repeated values in the first and third columns of Table 5 for and , and the third column of Table 6 for . Likewise, EDA is also robust to change. It is the most robust algorithm in terms of replacement rate variation; it achieves comparable results for and , regardless of the replacement rate chosen, and in the case of , it achieves comparable results for replacement rates ranging from 20% to 80%.
This first stage of the experimentation aims to discover the best performing hyperparameter set with each evolutionary algorithm to obtain fair comparative results in the second and third phases. Tables 5 and 6 together allow this goal to be achieved by choosing, for each evolutionary algorithm, those hyperparameters in the subsets' intersection corresponding to the three CFGs. Thus, the optimal hyperparameters are not problem-dependent. In the case of WX, the best performing replacement rates for the three grammars are and . These values appear in both the second column in Table 5 and Table 6, but WX is not on the first column in Table 5, meaning that the replacement rate does not affect the quality of the solutions achieved with . With EDA, either , , , or perform the best with the three CFGs. Finally, in the case of SEDA, the hyperparameter pairs (replacement rate, smoothing factor) that yield the best results with three grammars are (), (), (), and ().
From the results shown in these tables, it is possible to obtain a set of hyperparameters that is the same for each evolutionary algorithm in the three grammars employed in the study. In the case of WX, any replacement rate is suitable for , and 100% achieves the best average results in and . Therefore, 100% is the replacement rate chosen for WX to carry out the performance comparison in stages two and three. EDA perfomance is statistically equivalent for any replacement rate with and . Additionally, it achieves statistically better results with any replacement rate other than 100% with . 80% is the replacement rate chosen for EDA since it yields the best (although equivalent) average results. Finally, from the four pairs that statistically yield the best means with SEDA for the three CFGs, () is chosen because it provides the best and the second-best average results.
4.2 Performance Comparison
This stage aims to show and compare the three GGGP approaches performance when trying to obtain specific derivation trees of different sizes for the three CFGs of Table 3. The search problem is the same as defined in the hyperparameter setup stage but extended to 10, 20, 50, 75, and 100 recursions to observe each evolutionary algorithm behavior when dealing with small and large derivation trees and search spaces. The experiments conducted employ the same configurations reported in the previous stage and summarized in Table 4, except for the target derivation tree sizes (now extended), the replacement rate, and the SEDA smoothing factor. In this study, the replacement rate adopted is 100% for WX, 80% for EDA and SEDA, and for SEDA as obtained from the hyperparameter setup stage.
The experiments performed in this stage are not concerned with each problem's semantics, such as searching for the best multilayer perceptron or knowledge base of rules to perform a specific task. Instead, these ad-hoc experiments are intentionally provided to avoid the noise produced by other optimization processes, such as learning the parameters of each multilayer perceptron generated during the evolution process. They aim to facilitate revealing the advantages of the three GGGP algorithms and their limitations, especially when dealing with large search spaces.
Comparisons between the three GGGP approaches have been performed in terms of the fitness of the final solutions and convergence speed. Two pairs of plots, one for the final fitnesses and the other for convergence speed, present the statistical results for each of the three CFGs studied. Each pair shares the abscissa axis, representing the number of recursions of the target derivation trees to be found. The chart below in each pair shows descriptive statistics for each target derivation tree size through box-and-whisker plots. The ordinate axis corresponds to the fitness of the final solutions in a logarithmic scale or the number of evaluations performed by each evolutionary algorithm until meeting one of the stop criteria (Table 4). The chart above in each pair shows in the ordinate axis the significance level achieved by the three one-way analysis of variance (ANOVA) tests on two groups (t-test) conducted for each target derivation tree size to compare the three evolutionary algorithms: EDA vs. SEDA, EDA vs. WX, and SEDA vs. WX. The dependent variable is the fitness of the final solutions when comparing the quality of the solutions achieved or the number of evaluations performed when comparing the convergence speed. In both cases, the independent variable is the evolutionary algorithm employed for each target derivation tree size and CFG.
The aim is to study the impact of the evolutionary approaches on the fitness of the final solutions achieved and the number of evaluations performed for shallow and deep target derivation trees using different CFGs by gathering empirical evidence of whether the differences between the means of the dependent variable are statistically significant. Again, the null hypothesis (the means of the dependent variable are equal for the different independent variable values) is rejected when a significance level of is achieved. A dashed horizontal line represents this significance threshold level in the plot above in each pair of graphs. All groups are of the same size, with 100 observations for each target derivation tree size, evolutionary algorithm, and CFG. Therefore, neither Levene nor Shapiro-Wilk tests were considered.
Average fitness of the final solutions . | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 recursions . | 20 recursions . | 50 recursions . | 75 recursions . | 100 recursions . | ||||||||||
WX . | EDA . | SEDA . | WX . | EDA . | SEDA . | WX . | EDA . | SEDA . | WX . | EDA . | SEDA . | WX . | EDA . | SEDA . |
0.32 | 0.07 | 1.09 | 0.23 | 0.4 | 2.47 | 0.49 | 1.87 | 4.25 | 3.3 | 5.87 | ||||
Average evaluations performed (in thousands) | ||||||||||||||
0.79 | 0.5 | 0.98 | 1.79 | 1.21 | 3.97 | 2.81 | 2.56 | 4.79 | 2.91 | 3.1 | 6.21 | 3.56 |
Average fitness of the final solutions . | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 recursions . | 20 recursions . | 50 recursions . | 75 recursions . | 100 recursions . | ||||||||||
WX . | EDA . | SEDA . | WX . | EDA . | SEDA . | WX . | EDA . | SEDA . | WX . | EDA . | SEDA . | WX . | EDA . | SEDA . |
0.32 | 0.07 | 1.09 | 0.23 | 0.4 | 2.47 | 0.49 | 1.87 | 4.25 | 3.3 | 5.87 | ||||
Average evaluations performed (in thousands) | ||||||||||||||
0.79 | 0.5 | 0.98 | 1.79 | 1.21 | 3.97 | 2.81 | 2.56 | 4.79 | 2.91 | 3.1 | 6.21 | 3.56 |
Three main conclusions can be deduced from the results obtained with . First, WX rapidly meets a stop criterion and achieves fair solutions for target derivation trees up to 20 recursions. WX performs a more exploratory search than the other two algorithms, which proves to be suitable for small search spaces. However, as the target derivation trees and, therefore, the search space get larger, EDA performs the fewest number of evaluations, although it stops in poor local optima. Unlike WX, EDA boosts the local search, resulting in it stopping prematurely, which becomes even more pronounced as the problem becomes more challenging. Finally, SEDA, which tries to balance the exploration and exploitation, takes an intermediate number of evaluations to stop and improve the quality of the solutions, compared to WX and EDA, as the target derivation tree sizes increase, being the best of the three for 75 and 100 recursions.
is a recursive grammar that encodes rule knowledge bases of variable size in the form if Antecedent then Consequent. The Antecedent of a rule may include one or more clauses. An interesting feature of this grammar is that it can generate broad, not only infinitely deep, derivation trees by applying the recursive production rules and . This feature makes the problem even more difficult because the words to compare for fitness calculation (target and candidate solutions) are long. Consequently, the evolutionary algorithms take the highest number of evaluations of the three CFGs to reach a stop criterion.
Average fitness of the final solutions . | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 recursions . | 20 recursions . | 50 recursions . | 75 recursions . | 100 recursions . | ||||||||||
WX . | EDA . | SEDA . | WX . | EDA . | SEDA . | WX . | EDA . | SEDA . | WX . | EDA . | SEDA . | WX . | EDA . | SEDA . |
7.34 | 11.1 | 13.2 | 21.64 | 63.9 | 32.4 | 47.8 | 60.9 | 114.8 | 98.77 | |||||
Average evaluations performed (in thousands) | ||||||||||||||
8.7 | 3.4 | 3.3 | 12.1 | 5.4 | 17.5 | 8.3 | 20.3 | 9.1 | 18.8 | 10.4 |
Average fitness of the final solutions . | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 recursions . | 20 recursions . | 50 recursions . | 75 recursions . | 100 recursions . | ||||||||||
WX . | EDA . | SEDA . | WX . | EDA . | SEDA . | WX . | EDA . | SEDA . | WX . | EDA . | SEDA . | WX . | EDA . | SEDA . |
7.34 | 11.1 | 13.2 | 21.64 | 63.9 | 32.4 | 47.8 | 60.9 | 114.8 | 98.77 | |||||
Average evaluations performed (in thousands) | ||||||||||||||
8.7 | 3.4 | 3.3 | 12.1 | 5.4 | 17.5 | 8.3 | 20.3 | 9.1 | 18.8 | 10.4 |
The challenge with the problem lies in the fact that this grammar includes the double recursive production rule , besides the recursive . The evolutionary algorithms achieve the worst solutions of the three CFGs.
Average fitness of the final solutions . | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 recursions . | 20 recursions . | 50 recursions . | 75 recursions . | 100 recursions . | ||||||||||
WX . | EDA . | SEDA . | WX . | EDA . | SEDA . | WX . | EDA . | SEDA . | WX . | EDA . | SEDA . | WX . | EDA . | SEDA . |
11.4 | 16.2 | 30.3 | 40.5 | 99.3 | 116.3 | 151.2 | 181.9 | 249.1 | 223.3 | |||||
Average evaluations performed (in thousands) | ||||||||||||||
6.7 | 4.7 | 7.2 | 5.5 | 8.2 | 7.4 | 7.9 | 7.1 | 7.6 | 5.4 |
Average fitness of the final solutions . | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 recursions . | 20 recursions . | 50 recursions . | 75 recursions . | 100 recursions . | ||||||||||
WX . | EDA . | SEDA . | WX . | EDA . | SEDA . | WX . | EDA . | SEDA . | WX . | EDA . | SEDA . | WX . | EDA . | SEDA . |
11.4 | 16.2 | 30.3 | 40.5 | 99.3 | 116.3 | 151.2 | 181.9 | 249.1 | 223.3 | |||||
Average evaluations performed (in thousands) | ||||||||||||||
6.7 | 4.7 | 7.2 | 5.5 | 8.2 | 7.4 | 7.9 | 7.1 | 7.6 | 5.4 |
The results achieved regarding the average fitness of the final solutions and the number of evaluations to meet a stop criterion reveal that the search problem defined by is more simple to solve than and . According to Tables 7, 8, and 9, the average fitness of the final solutions achieved by the three evolutionary approaches are within the range for , for , and for . Moreover, the same tables report that the average evaluations performed to stop are within thousands for , thousands for , and thousands for .
Additional tests were performed with the mutation operator, the only operator commonly used in GGGP that is not present in the experiments reported, using mutation probabilities of 0.01 and 0.05. After generating the offspring, if the mutation probability states that the operator is applied to one of the offspring individuals (derivation trees), it randomly chooses a node with a nonterminal symbol and replaces the subtree beginning from that node with another one randomly generated by the grammatically uniform initialization method (Table 4).
However, the results achieved with both mutation probabilities were statistically similar to those already reported without the mutation operator. EDA slightly improves (decreases) the average fitness of the final solutions only when the mutation probability is 0.05, with almost no increase in the average number of evaluations needed to meet a stop criterion. Nevertheless, this enhancement is not statistically significant.
The interpretation of these results is that WX is a very explorative variation operator; therefore, adding the mutation has no effect on the evolutionary process. SEDA includes the smoothing factor to control the exploration and local search capabilities; thus, adding a mutation operator has a similar effect to increase this parameter. Finally, GGGP with EDA boosts the local search, which might lead the algorithm to prematurely converge to local optima. In this case, the mutation operator can benefit the evolutionary process by increasing the exploration of the search space towards new individuals that allow it to escape from local optima.
5 Conclusions
This work presents a smoothed estimation of distribution algorithm (SEDA) for grammar-guided genetic programming that provides a trade-off between exploration and local search to efficiently guide the evolutionary process to accurate solutions. The proposed approach calculates the context-free grammar expansion (CFGE) to generate the offspring. The CFGE is a tree-like graph representation of all derivation trees that can be generated by the CFG, together with a probabilistic model of the whole search space that approximates the current population distribution. SEDA implements a code-bloat control mechanism by setting up a recursion bound in recursive CFG to limit the CFGE size. Derivations larger than the CFGE size cannot be generated.
However, under such exploitative conditions, SEDA would prematurely stop. The evolutionary process would be rapidly guided to converge to an optimum in very few generations, but likely to a local optimum. To overcome this difficulty, SEDA smooths the estimated distribution model; thus, new individuals may be generated from the current population distribution. These new out-of-distribution individuals increase the population diversity and allow the evolutionary process to explore new promising search space areas.
Statistical results gathered in terms of the average fitness of the final solutions achieved by EDA, WX, and SEDA reveal that EDA finds the worst solutions, showing an excessive local search capability that leads the evolutionary process to prematurely stop. On the other hand, WX achieves the best solutions only with and small search spaces in general, where it is more likely to find a good solution by exploration. This kind of tree-swapping variation operator may be highly exploratory and barely exploitative. Therefore, they are not suitable for problems involving large search spaces with large derivation trees. Finally, SEDA significantly achieves the best solutions on the most challenging problems: , , and searching for target derivation tree sizes of 75 and 100 recursions.
Regarding the number of evaluations to meet a stop criterion, EDA significantly takes fewer generations to stop in general, showing few exceptions for some of the most straightforward experiments. Therefore, EDA is the fastest algorithm, although it also finds the worst solutions. This behavior is usual in algorithms that boost local search excessively, as EDA does. In contrast, SEDA obtains intermediate results. It is slower than EDA but faster than WX and generally finds the best quality solutions.
The experimental results conducted lead us to assume that SEDA is the most balanced algorithm. While WX is the slowest approach according to its excessive exploration capability, EDA meets a stop criterion very early, getting the worst quality solutions. Alternatively, the hyperparameter smoothing factor regulates the SEDA exploitation–exploration trade-off; the higher the more similar to WX (explorative), the lower the more similar to a regular EDA behavior (exploitative). From the results reported, a smoothing factor of 0.01 makes SEDA balance exploration and local search, achieving the best quality solutions, especially in the most challenging problems, with an intermediate average speed.
This research work shows the difficulties of finding the proper balance between exploratory and exploitative behavior. Although SEDA presents a fit exploitation–exploration trade-off, the elaboration of new specific mutation operators may be beneficial to the overall evolutionary process when working with large search space problems. Such mutation operators must focus on the search of unexplored search space areas, avoiding the re-exploration of already visited areas. A controlled exploration balance between spatially close and distant areas, according to the state of the evolutionary process, may also be beneficial.
New evolutionary optimization techniques may also arise from the presented work. The CFGE represents the search space in a graph-like structure that directly encodes derivation trees, which can be obtained by traversing the CFGE. This search space model can be modified to represent, in addition to trees, any other more complex graph-encoded search spaces. Hence, a new evolutionary optimization technique may be developed to optimize any graph-like problems using regular evolutionary operators.
Upper and lower probability thresholds can be helpful for GP and GGGP EDA approaches, where probabilities can easily reach values 0 and 1. Nevertheless, the side effects of these bounds on the evolutionary process still need to be fully understood. As EDA bounds are only enforced when exceeded, further studies should investigate the potential biases that may arise due to inconsistent shaping of the probability distribution.
Acknowledgments
This work was partially supported by research grant PID2021-122154NB-100 of Ministerio de Economía, Industria y Competitividad, Spain.
The authors thank the reviewers and editors for their valuable comments and suggestions, which have improved this paper.