## Abstract

To maintain the population diversity of genetic algorithms (GAs), we are required to employ an appropriate population diversity measure. However, commonly used population diversity measures designed for permutation problems do not consider the dependencies between the variables of the individuals in the population. We propose three types of population diversity measures that address high-order dependencies between the variables to investigate the effectiveness of considering high-order dependencies. The first is formulated as the entropy of the probability distribution of individuals estimated from the population based on an $m$-th--order Markov model. The second is an extension of the first. The third is similar to the first, but it is based on a variable order Markov model. The proposed population diversity measures are incorporated into the evaluation function of a GA for the traveling salesman problem to maintain population diversity. Experimental results demonstrate the effectiveness of the three types of high-order entropy-based population diversity measures against the commonly used population diversity measures.

## 1 Introduction

The maintenance of population diversity is recognized as an important factor for fully exercising the capability of evolutionary algorithms (EAs), and a wide variety of population management strategies for promoting diversity inside the population have been proposed. A survey of methodologies for promoting population diversity in EAs can be found in Squillero and Tonda (2016).

Several population diversity management methodologies utilize measures of population diversity, which can be used to analyze the behavior of EAs (Yao, 1993; Tsai et al., 2004; Wang et al., 2010), to select individuals to maintain population diversity in a positive manner (Maekawa et al., 1996; Zhang et al., 2006; Nagata, 2006; Nagata and Kobayashi, 2013), and as a trigger to activate diversification procedures (Tsujimura and Gen, 1998; Vallada and Ruiz, 2010). The pairwise Hamming distance (the average of the Hamming distance between all possible pairs of the population members) is the most commonly used measure of population diversity. Another commonly used population diversity measure is based on entropy. In information theory, entropy, defined as $-∑s∈Spslogps$, is a measure of the uncertainty of a probability distribution $ps(s∈S)$, where $S$ is a set of all possible events. This definition, however, cannot be directly used to measure population diversity because the population size is typically considerably smaller than the number of all possible solution candidates in the search space $S$. Therefore, to the best of our knowledge, the entropy-based population diversity measures proposed in previous works are all defined as the sum of the entropies of the univariate marginal distributions of all variables. For example, let the solution space $S$ be defined as $(x1,…,xn)$, where $xi$ is a variable taking values in a discrete set $Ai$. The entropy of the $i$-th variable is defined as $Hi=-∑j∈Aipijlogpij$, where $pij$ is the probability that $xi$ has a value $j$ in the population. Then, the commonly used entropy-based population diversity measure is defined as $H=∑i=1nHi$. In this article, we refer to an entropy-based population diversity measure defined in this manner as an independent entropy measure. In previous works, the independent entropy measure was incorporated into EAs applied to the knapsack problem (Mori et al., 1996), binary quadratic programming problem (Wang et al., 2010), traveling salesman problem (Yao, 1993; Maekawa et al., 1996; Tsujimura and Gen, 1998; Tsai et al., 2004; Nagata, 2006; Nagata and Kobayashi, 2013), and others (Zhang et al., 2006).

The independent entropy measure (and other commonly used population diversity measures), however, is not able to consider the dependencies between the variables of the individuals in the population, which creates a situation where population diversity cannot be evaluated appropriately. For example, consider an extreme example on the $n$-dimensional binary solution space where half of the population members are “$00⋯00$” and the other half are “$11⋯11$.” The value of the independent entropy measure of this population is virtually the same as that of a randomly generated population because $pi0≃pi1≃0.5(i=1,…,n)$ for both populations, even though “true” population diversity is extremely low in the former. Therefore, our motivation herein is to design a more appropriate entropy-based population diversity measure by considering dependencies between the variables of the individuals in the population. We refer to such a population diversity measure as a high-order entropy measure.

In this article, we propose several high-order entropy measures for the traveling salesman problem (TSP) to investigate the advantages of using entropy-based population diversity measures that consider the dependencies between the variables. We first formulate high-order entropy measures based on a fixed-order Markov model, where we assume that the probability of observing each vertex at a certain position in an individual (tour) of the population depends on the sequence of $m(≥1)$ precedent vertices. We further extend this model into a variable-order Markov model, where the value of $m$ varies depending on the situation.

We tested the proposed high-order entropy measures on a genetic algorithm (GA) developed by Nagata and Kobayashi (2013), which is known as one of the most effective heuristic algorithms for the TSP. In this GA, one important feature for achieving a top performance is to maintain population diversity by evaluating offspring solutions based on an evaluation function that incorporates a population diversity measure as well as the original evaluation function (tour length). An independent entropy measure is used for evaluating the population diversity. In this article, we perform this GA by replacing the original independent entropy measure with each of the proposed high-order entropy measures in the evaluation function.

Preliminary reports for the high-order entropy measures proposed in this article, were presented in previous works of the author (Nagata and Ono, 2013; Nagata, 2016). This article provides a full description and instructive analysis of the proposed high-order entropy measures; Section 6.3 and the Appendix are completely new and other parts significantly extend the contents of the conference proceedings. The remainder of this article is organized as follows. In Section 2, we describe the background of this study. In Section 3, we propose two types of high-order entropy measures based on a fixed-order Markov model. In Section 4, we propose a high-order entropy measure based on a variable-order Markov model. In Section 5, the GA framework, where the proposed population diversity measures are incorporated, is described. Computational results are presented in Section 6. Finally, conclusions are provided in Section 7.

## 2 Background

We first consider commonly used population diversity measures (independent entropy measure and pairwise Hamming distance) in a general case and then describe the independent entropy measure for the TSP. We also refer to the difficulty of designing an entropy-based population diversity measure that considers the dependencies between the variables.

### 2.1 Commonly Used Population Diversity Measures

We first consider a general case where an individual is represented as a string of length $n$, consisting of symbols in a set $L$. Let the population consist of $Np$ individuals, denoted as $p1,p2,…,pNp$. Let $Xi(i=1,…,n)$ be a random variable representing the symbol in the $i$-th position of an individual (string) selected randomly from the population. The joint probability distribution $P(X1=x1,X2=x2,…,Xn=xn)$ then represents the probability of finding individual $(x1,x2,…,xn)$ when an individual is selected randomly from the population. We denote $P(X1=x1,X2=x2,…,Xn=xn)$ as $P(x1,x2,…,xn)$ for simplicity. The marginal probability distribution $P(Xi=xi)$ or $P(xi)$ then represents the probability of finding symbol $xi$ in the $i$-th position of an individual selected randomly from the population. Probability distribution $P(Xi=l)$ is estimated as $nilNp$, where $nil$ is the number of individuals that have the symbol $l$ at position $i$. The independent entropy measure, which we denote as $Hind$, is then defined by
$Hind=-∑i=1n∑xi∈LP(xi)logP(xi).$
(1)
Another commonly used population diversity measure is the pairwise Hamming distance (the average of the Hamming distance between all possible pairs of the population members). Let a symbol in the $i$-th position of individual $p$ be denoted as $(p)i$. The Hamming distance between two individuals (strings) $p$ and $q$, denoted as $d(x,y)$, is then defined by
$d(p,q)=n-∑i=1nI((p)i,(q)i),$
(2)
where the function $I(a,b)$ returns 1 if $a$ and $b$ are the same, and 0 otherwise.
In fact, the pairwise Hamming distance can be expressed as a function of $P(xi)$ (Wineberg and Oppacher, 2003). To derive this expression, we rewrite the Hamming distance as $d(p,q)=n-∑i=1n∑l∈LIl((p)i,(q)i)$, where $Il(a,b)$ returns 1 if both $a$ and $b$ are symbol $l$, and 0 otherwise. The pairwise Hamming distance, which we denote as $D$, is then calculated as follows:
$D=2Np(Np-1)∑j=1Np-1∑k=j+1Npd(pj,pk)=1Np(Np-1)∑j=1Np∑k=1Npd(pj,pk)∝∑j=1Np∑k=1Np{n-∑i=1n∑l∈LIl((pj)i,(pk)i)}=Np2n-∑i=1n∑l∈L∑j=1Np∑k=1NpIl((pj)i,(pk)i)=Np2n-∑i=1n∑l∈Lnil2=Np2n-∑i=1n∑l∈L{NpP(xi=l)}2∝n-∑i=1n∑l∈LP(xi=l)2=n-∑i=1n∑xi∈LP(xi)2.$
(3)
Note that $∑j=1Np∑k=1NpIl((pj)i,(pk)i)=nil2$ because the number of individuals that have symbol $l$ in the $i$-th position is $nil$.

The similarity between the independent entropy measure $Hind$ and the pairwise Hamming distance $D$ was discussed in Wineberg and Oppacher (2003). As suggested in Nagata and Kobayashi (2013), one advantage of the independent entropy measure over the pairwise Hamming distance is the sensitivity to the change of rare elements in the population. This feature makes $Hind$ a more appropriate population diversity measure than $D$.

### 2.2 TSP Case

Let an asymmetric TSP (ATSP) be defined on a complete directed graph $(V,E)$ with a set of vertices $V={1,…,n}$ and a set of edges $E={(i,j)∣i,j∈V}$. In the asymmetric case, the distance (or cost) between two vertices depends on the travel direction. If the distance is the same in both directions, the TSP is called a symmetric TSP (STSP).

In the majority of GAs applied to the TSP, an individual is represented as the order of vertices on the tour. It is also possible to represent an individual such that the variable $xi$ represents the vertex subsequent to vertex $i$ in the tour, which is well suited for the definition of population diversities $D$ and $Hind$. Using this expression, $P(Xi=l)(l∈1,2,⋯n)$ is defined as the probability distribution of the vertices subsequent to vertex $i$ in the population, and the population diversity measures $Hind$ and $D$ are defined according to Eqs. (1) and (3), respectively. Here, we need to give attention to the STSP case. In this case, we must consider both travel directions for each tour because the population diversity should not depend on the travel direction. Therefore, $P(Xi=l)(l∈1,2,⋯n)$ is defined as the probability distribution of the vertices linked to vertex $i$ in the population. The population diversity measure $Hind$ defined in this manner is used in GAs (Maekawa et al., 1996; Tsai et al., 2004; Nagata, 2006; Nagata and Kobayashi, 2013) for the STSP to control the diversity of the population.

### 2.3 Difficulty in Considering Dependencies

The population diversity measures $Hind$ and $D$ do not consider the dependencies between the variables of the individuals in the population because these are expressed as functions of $P(xi)$. The most naive entropy-based population diversity measure that considers the dependencies between the variables would be defined by
$H=-∑x1∈L⋯∑xn∈LP(x1,…,xn)logP(x1,…,xn).$
(4)
Clearly, this population diversity measure is of no value in practical use (unless $n$ is extremely small) because it is impossible to obtain a sufficient number of samples from the population necessary to estimate $P(x1,…,xn)$. Therefore, the joint probability distribution $P(x1,…,xn)$ must be estimated under a certain assumption(s). For example, when $P(x1,…,xn)$ is modeled as $∏i=1nP(xi)$, the entropy $H$ is equivalent to $Hind$ except for a constant factor, i.e., $H=nHind$.

A Bayesian network could be also useful to model the joint probability distribution. It is represented as $P(x1,x2,…,xn)=∏i=1nP(xi∣xparent(i))$, where each variable $xi$ is conditional only on its parent variables $xparent(i)$ (if it is empty, $P(xi∣xparent(i))$ means a prior probability distribution $P(xi)$). However, it is typically difficult to detect an appropriate conditional dependency between the variables, which is represented as a directed acyclic graph (DAG), in advance. Moreover, it is difficult to compute Eq. (4) efficiently even if a DAG is given in advance.

## 3 High-Order Entropy Measures Based on a Fixed-Order Markov Model

We propose two high-order entropy measures based on a fixed-order Markov model to measure population diversity of GAs for the TSP.

### 3.1 High-Order Entropy Measure $Hm$

Let an individual (tour) be represented as a permutation of the vertices. We model the joint probability distribution of the population on the assumption that the probability of observing each vertex at a certain position in an individual of the population depends on the sequence of $m(≥1)$ precedent vertices in the tour. This assumption is reasonable for the TSP. For example, if you are solving a TSP manually, the next vertex at each vertex strongly depends on the partial path created so far near this vertex and the sequence of $m$ precedent vertices contains a lot of information about this. Let $Si(i=1,…,n)$ be a random variable representing the $i$-th vertex in an individual selected randomly from the population. Given that a tour has a cyclic structure, $P(S1=s1,S2=s2,…,Sn=sn)$, which we denote as $P(s1,s2,…,sn)$ for short, is represented by
$P(s1,s2,…,sn)=∏i=1nP(si+m∣si,…,si+m-1),$
(5)
where index $i+n(1 corresponds to $i$. The entropy $H$ of this joint probability distribution is then calculated as follows:
$H=-∑s1⋯∑snP(s1,…,sn)logP(s1,…,sn)=-∑s1⋯∑snP(s1,…,sn)∑i=1nlogP(si+m∣si,…,si+m-1)=-∑i=1n∑s1⋯∑snP(s1,…,sn)logP(si+m∣si,…,si+m-1)=-∑i=1n∑si⋯∑si+mP(si,…,si+m)logP(si+m∣si,…,si+m-1).$
(6)
Given that each tour can start from an arbitrary vertex, the joint probability distribution of any $m+1$ consecutive random variables should be invariant with respect to shifts in the index, that is, $P(si,…,si+m)$ and $P(si+m∣si,…,si+m-1)$ should be equivalent to $P(s1,…,sm+1)$ and $P(sm+1∣s1,…,sm)$, respectively, regardless of the value of $i$. This assumption is justified by considering $n$ tours with all different start vertices for each individual (tour) in the population. Then, Eq. (6) can be simplified as follows:
$H=-n∑s1⋯∑sm+1P(s1,…,sm+1)logP(sm+1∣s1,…,sm)=-n∑s1⋯∑sm+1P(s1,…,sm+1)logP(s1,…,sm+1)P(s1,…,sm)=-n{∑s1⋯∑sm+1P(s1,…,sm+1)logP(s1,…,sm+1)-∑s1⋯∑smP(s1,…,sm)logP(s1,…,sm)}=n(Hm+1¯-Hm¯),$
(7)
where
$Hk¯=-∑s1⋯∑skP(s1,…,sk)logP(s1,…,sk).$
(8)
The average entropy per symbol of the probability distribution Eq. (5), which we denote as $Hm$, is then given by
$Hm=Hm+1¯-Hm¯.$
(9)
We use $Hm$ as a high-order entropy measure for the TSP. Note that $H1(=H2¯-H1¯)$ is essentially equivalent to $Hind$ because $P(S1=i,S2=j)∝P(Xi=j)$ and $H1¯$ is a constant value for the TSP (because all symbols appear exactly once in each individual of the population). We have the following relation:
$H1=1nHind+const.$
(10)

In information theory, $Hm$ is known as the entropy rate of an $m$-th--order Markov information source modeled by the conditional probability distributions $P(sm+1∣s1,…,sm)$, where the entropy rate of a data source is defined as the average information per symbol obtained from the data source. A central theorem of information theory states that the entropy rate of a data source indicates the average number of bits per symbol required to encode it. Therefore, the existence of the same sequence consisting of up to $m+1$ vertices in the population decreases the value of $Hm$; this effect is more prominent when the length of the high-frequency sequences increases.

To compute $Hm$, we must estimate $P(s1,…,sk)$ for $k=m,m+1$ by sampling sequences of symbols (vertices) from individuals in the population; $P(s1,…,sk)$ is estimated by $N(s1,…,sk)Nsample$, where $N(s1,…,sk)$ is the number of a sequence of symbols ${s1,…,sk}$ in the population and $Nsample$ is the total number of all possible samples of the sequences in the population. Sequences of length $k$ are sampled from each individual while shifting the start of the sequence one by one, resulting in $Np$ samples from each individual, i.e., $Nsample=nNp$. Note that in the STSP case, sequences of length $k$ are sampled from each individual for both travel directions, resulting in $2n$ samples from each individual, i.e., $Nsample=2nNp$. Figure 1 illustrates the method to sample the sequences from an individual (tour) in the ATSP and STSP cases.
Figure 1:

Illustration of method to sample sequences of length $k(=3)$ from a tour; sequences are sampled for the original (both) orientation(s) in the ATSP (STSP) case.

Figure 1:

Illustration of method to sample sequences of length $k(=3)$ from a tour; sequences are sampled for the original (both) orientation(s) in the ATSP (STSP) case.

We store the values of $N(s1,…,sk)(k≤m+1)$ in the form of a tree because it is impractical to store all entries in a table for a large value of $m$. Let $T$ be a tree for storing $N(s1,…,sk)(k≤m+1)$. Figure 2 illustrates an example of the tree $T$ for $k≤4$ in the ATSP case (because it is easier to understand). The tree $T$ stores only $N(s1,…,sk)$ with nonzero values; for example, $N(a,b)=6$, $N(a,b,c)=4$, and $N(a,c)=0$ in this example. Note that in the TSP, $N(s1)$ must be the same for all $s1$ and the same symbol does not reappear in a sequence. For a population consisting of relatively high-quality solutions for the TSP (e.g., tours obtained by a local search algorithm with the 2-opt neighborhood), we have observed that the number of branches at each node is at most about ten (usually less than five).
Figure 2:

Tree representation of $N(s1,…,sk)(k≤4)$.

Figure 2:

Tree representation of $N(s1,…,sk)(k≤4)$.

As the value of $m$ is increased, $Hm$ captures higher-order dependencies in the sequences of symbols included in the population. In this sense, we want to increase the value of $m$. However, $Hm$ is of no value if the value of $m$ is overly large because it is unlikely to obtain a sufficient number of samples (sequences of symbols) from the population necessary to estimate the conditional probability distributions $P(si+m∣si,…,si+m-1)=P(s1,…,sm+1)P(s1,…,sm)=N(s1,…,sm+1)N(s1,…,sm)$ for computing $Hm$; the estimated conditional probability distribution is unreliable if the number of samples of the denominator is small. Therefore, there is a tradeoff between the potential ability to capture higher-order dependencies and the estimate accuracy of the conditional probability distributions, and we need to determine an appropriate value of $m$.

### 3.2 High-Order Entropy Measure $Hmadj$

One might think that $Hm+1¯$ can also be used as a population diversity measure. This is equivalent to the entropy of the probability distribution $P(s1,…,sm+1)$ defined under the assumption that blocks of $m+1$ consecutive symbols $s1,…,sm+1$ appear in the population according to this probability distribution and that these occurrences are independent of each other. However, these occurrences are actually correlated and the definition of $Hm+1¯$ neglects such dependencies. In information theory, $Hm+1¯$ is known as the entropy of the adjoint source of the $(m+1)$-th extension of the original Markov source. From this point onward, we call $Hm+1¯$ the high-order entropy measure $Hmadj$, to distinguish it from $Hm$.

Although the definition of $Hmadj$ (= $Hm+1¯$) is somewhat ad hoc as a population diversity measure, we can find the following simple relationship between $Hmadj$ and $Hk(k=1,…,m)$:
$H1+H2+⋯+Hm=(H2¯-H1¯)+(H3¯-H2¯)+⋯+(Hm+1¯-Hm¯)=Hmadj-H1¯.$
(11)
Because $H1¯$ is a constant value (for the TSP), $Hmadj$ is equivalent to $H1+H2+⋯+Hm$. As can be predicted, the best value $m$ for $Hmadj$ will be greater than that of $Hm$ because $Hmadj$ mixes the high-order entropy measures $Hk(k=1,…,m)$ equally. We discuss the advantages of this population diversity in Section 6.3.

## 4 High-Order Entropy Measure Based on a Variable-Order Markov Model

We propose a high-order entropy measure based on a variable-order Markov model to measure population diversity of GAs for the TSP.

### 4.1 Motivation

The high-order entropy measure $Hm$ defined in the previous section is derived from the assumption that the probability of occurrence of a symbol appearing in individuals in the population obeys a Markov process of order $m$. In the following equations, we use random variables $Si(i=-m,…,-2,-1,0)$ to represent a Markov process of order $m$, where $S0$ represents the symbol to be observed next and $S-i(1≤i≤m)$ represents the $i$-th preceding symbol. The expression of $Hm$ is then given by the following formula:
$Hm=-∑s-m⋯∑s-1∑s0P(s-m,…,s-1,s0)logP(s0∣s-m,…,s-1).$
(12)
In theory, the value of $Hm$ is equivalent to the entropy rate of a Markov process of order $k$ as long as $k≤m$ (i.e., $Hm=Hk$) because $P(s0∣s-m,…,s-k,…,s-1)=P(s0∣s-k,…,s-1)$ in a Markov process of order $k$.

To capture higher-order dependencies without suffering from a lack of sufficient statistics, we model the probability of the occurrence of symbols (vertices) appearing in individuals in the population as a variable-order Markov process. The variable-order Markov model was first suggested in Rissanen (1983) for data compression purposes. Later variants of variable-order Markov models have been successfully applied to areas such as statistical analysis, classification, and prediction (Shmilovici and Ben-Gal, 2007; Begleiter et al., 2004; Ben-Gal et al., 2003). In a variable-order Markov process, the probability distribution of observing the next symbol $s0$ depends on the preceding symbols of variable length $k$. The basic idea is to determine the value of $k$ adaptively such that the number of samples $N(s-k,…,s-1)$ is a sufficient statistic for estimating the conditional probability distribution $P(s0∣s-k,…,s-1)=N(s-k,…,s-1,s0)N(s-k,…,s-1)$.

### 4.2 A High-Order Entropy Measure $Hmvari$

A variable-order Markov process is characterized by a set of conditional probability distributions: $P(s0|sc)sc∈S$, where $S$ is a set of sequences of symbols for the conditioning variables. We set the upper limit $m$ on the length of the sequences for the conditioning variables because it is impractical to store all conditional probability components if $m$ is overly large (e.g., $m>10$). For any sequence of symbols ${s-m,…,s-2,s-1,s0}$, the length of the sequence assigned to the conditioning variables must be uniquely determined. To represent a set $S$ that satisfies this requirement, a context tree (Rissanen, 1983) is useful. Let $sc˜$ be the reverse sequence of $sc$ and we define $S˜={sc˜|sc∈S}$. Then, the elements of $S˜$ are represented as the leaf nodes of a context tree $S˜$ (we use this symbol to refer to the context tree as well) as illustrated in Figure 3, where each number represents the number of the corresponding sequence existing in the population. Note that a symbol “$#$” means any symbol other than the symbols of its sibling nodes. For a given sequence ${s-m,…,s-2,s-1}$ ($s0$ is observed next), the conditioning part is determined by tracing the context tree $S˜$ to the maximum extent possible from top to bottom according to $s-1,s-2,…,s-m$. For example, if ${s-3,s-2,s-1}={h,c,a}(m=3)$, the conditioning part is determined as ${s-2,s-1}={c,a}$ and the conditional probability distribution of observing the next symbol $s0$ is given by $P(S0=s0|S-2=c,S-1=a)$ or $P(s0|c,a)$ for short. In another example, if ${s-3,s-2,s-1}={d,b,a}$, the conditioning part is determined as ${s-2,s-1}={#,a}$ (“$#$” means any symbol other than $c$ and $f$) and the conditional probability distribution of observing the next symbol $s0$ is given by $P(S0=s0|S-2=#,S-1=a)$ or $P(s0|#,a)$. This conditional probability distribution is estimated by $P(#,a,s0)P(#,a)=P(a,s0)-P(c,a,s0)-P(f,a,s0)P(a)-P(c,a)-P(f,a)=N(a,s0)-N(c,a,s0)-N(f,a,s0)N(a)-N(c,a)-N(f,a)$.
Figure 3:

Context tree representation of $S˜$.

Figure 3:

Context tree representation of $S˜$.

The entropy rate of the variable-order Markov process, which we denote as $Hmvari$, is then defined by
$Hmvari=-∑sc∈S∑s0∈LP(sc,s0)logP(s0∣sc)=-∑sc∈S∑s0∈LP(sc,s0)logP(sc,s0)P(sc)=-∑sc∈S∑s0∈LP(sc,s0)logP(sc,s0)+∑sc∈SP(sc)logP(sc).$
(13)
If the context tree $S˜$ is represented as a perfect tree with depth $m$, $Hmvari$ is equivalent to $Hm$. Conversely, $Hmvari$ can be viewed as an approximation of $Hm$ because $Hmvari$ is obtained from $Hm$ by replacing $P(s0∣s-m,…,s-k,…,s-1)$ with a lower-order conditional probability distribution $P(s0∣s-k,…,s-1)$ for all ${s-k,…,s-1}∈S$.

Next, we describe the method to determine $S˜$ (and equivalently $S$), which is essentially equivalent to the learning algorithms used in Rissanen (1983), Ron et al. (1996), Mächler and Bühlmann (2004), and Schulz et al. (2008). Let $N˜(s-1,…,s-k)$ be the number of the reverse sequence of symbols ${s-k,…,s-1}$ in the population, i.e., $N˜(s-1,…,s-k)=N(s-k,…,s-1)$. As with tree $T$, let $T˜$ be a tree for storing the values of $N˜(s-1,…,s-k)(k≤m)$. Figure 4 illustrates an example of the tree $T˜$, which corresponds to $T$ presented in Figure 2 (e.g., $N(a,f,b)=N˜(b,f,a)=4$). Note that in the case of the STSP, $T$ = $T˜$ (except for the maximum depth of the tree), but the displayed example shows the case of the ATSP (because it is easier to understand). For a current population, the context tree $S˜$ is constructed by the following procedure, where $ratio$ is a parameter taking a value between zero and one.

Construction of context tree $S˜$
1. $S˜$ is initialized as the perfect tree of depth one, i.e., $S˜={s-1|s-1∈L}$.

2. For each of the leaf nodes ${s-1,…,s-k}∈S˜$ with $k, if there exists a symbol(s) $s-(k+1)'∈L$ such that $ratio×Np≤N˜(s-1,…,s-k,s-(k+1)')$, this node is expanded to generate a new leaf node(s) ${s-1,…,s-k,s-(k+1)'}$. Expansions of the leaf nodes are iterated until no further expansion is possible.

3. For every node that is already expanded, generate a child node with a symbol “$#$”.

Figure 4:

Tree representation of $N˜(s-1,…,s-k)(k≤3)$.

Figure 4:

Tree representation of $N˜(s-1,…,s-k)(k≤3)$.

According to the above procedure, the context tree $S˜$ presented in Figure 3 is obtained from $T˜$ presented in Figure 4 (if $ratio×Np=8$).

The aim behind the expansion of a leaf node ${s-1,…,s-k}$ (Step 2 of the above procedure) is to capture the higher-order dependency expressed as the conditional probability distribution $P(s0|s-(k+1)',s-k,…,s-1)$ only when it is judged to have a sufficient statistic for estimating this conditional probability distribution. The parameter $ratio$ balances the tradeoff between the potential ability to capture higher-order dependencies and the estimate accuracy of the conditional probability distributions.

In Rissanen (1983), Ron et al. (1996), Mächler and Bühlmann (2004), and Schulz et al. (2008), the context tree constructed by the above algorithm is then pruned based on the magnitude of the effect on the stochastic model. However, we do not use this pruning procedure because we confirmed in a preliminary experiment that the performance of the GA deteriorated by introducing this pruning procedure.

## 5 GA Framework

To evaluate the ability of the proposed population diversity measures $Hm$, $Hmadj$, and $Hmvari$, we perform the GA proposed in Nagata and Kobayashi (2013) using each of the three types of the population diversity measures with different values of $m$ and $ratio$. This GA is one of the most effective heuristic algorithms for the TSP. One important factor for achieving top performance is to maintain population diversity by evaluating offspring solutions based on the change in population diversity when they are selected to survive in the population as well as the tour length. The independent entropy measure $Hind$ was originally used for evaluating population diversity.

Algorithm 1 depicts the GA framework. The population consists of $Np$ individuals. The initial population is generated by a greedy local search algorithm with the 2-opt neighborhood (Line 1). At each generation (Lines 3–8) of the GA, each of the population members is selected, once as parent $pA$ and once as parent $pB$, in random order (Lines 3 and 5). For each pair of parents, edge assembly crossover (EAX) operator generates the $Nch$ (e.g., 30) offspring solutions (Line 6). Then, a best solution is selected from the generated offspring solutions and $pA$ in terms of a given evaluation function, and the selected individual replaces the population member selected as $pA$ (Line 7). Therefore, no replacement occurs if all offspring solutions are worse than $pA$. Note that only parent $pA$ is replaced to better maintain population diversity because EAX typically generates offspring solutions similar to $pA$. Iterations of generation are repeated until a termination condition is achieved (Line 9).

Figure 5 illustrates an outline of EAX. From a selected pair of parent solutions (tours), EAX can generate a number of offspring solutions (tours) through the two phases. In the first phase, intermediate solutions are constructed by assembling edges of the parent solutions (tours) under the relaxed constraint that exactly two edges are linked to every vertex; an intermediate solution typically consists of several subcycles. In the second phase, each intermediate solution is modified into a Hamilton cycle by merging the subcycles heuristically. There are several strategies for generating offspring solutions. For example, EAX with “single” strategy typically generates an intermediate solution like (a) in the figure, where the majority of the edges are inherited from parent $pA$. Conversely, EAX with “block 2” strategy typically generates an intermediate solution like (b), where a number of edges are inherited from each parent. EAX with single strategy is used from the beginning to almost the end of the search and EAX with block 2 strategy is used only at the final stage of the search. The timing to switch from single strategy to block 2 strategy and the termination condition are determined based on the number of consecutive generations for which the best individual of the population is not updated. For more details, we refer the reader to the original paper (Nagata and Kobayashi, 2013).
Figure 5:

Outline of EAX for the TSP.

Figure 5:

Outline of EAX for the TSP.

We define the evaluation function for selecting the individual that replaces $xr(i)(=pA)$ (Line 7) in a somewhat different manner from the one used in Nagata and Kobayashi (2013) because this evaluation function is more natural than the original. Let $L$ be the average tour length of the population and $H$ the population diversity measure selected ($Hm$, $Hmadj$, or $Hmvari$). First of all, (as in Nagata and Kobayashi, 2013) only offspring solutions that do not deteriorate the tour length of $xr(i)$ can be selected because without this restriction, we need a careful cooling schedule for a parameter $T$ (described later) to converge the population. The best one is then selected among them such that $L-TH$ is minimized after the replacement, where $T$ is a parameter that controls the balance between the exploration (maximization of $H$) and exploitation (minimization of $L$) of the search. Therefore, offspring solutions are evaluated by the following evaluation function (a lower value is better):1
$Eval(y)=ΔL(y)-TΔH(y)(ΔL≤0)∞(ΔL>0),$
(14)
where $ΔL(y)$ and $ΔH(y)$ denote the differences in $L$ and $H$, respectively, when an offspring solution $y$ replaces $xr(i)$. After evaluating all offspring solutions, the one with the smallest evaluation value is selected to replace $xr(i)$. Note that if all evaluation values of the offspring solutions are greater than zero, $pA$ is selected (i.e., no replacement occurs) because $Eval(pA)=0$. We update the parameter $T$ in the following manner. At the beginning of the GA, $T$ is set to $∞$ (a sufficiently large value). The value of $T$ is updated every time the replacement of $xr(i)$ is performed $Np$ times. Let $L'$ and $H'$ be the values of $L$ and $H$, respectively, when the value of $T$ is last updated. The value of $T$ is then updated by $T=L-L'H-H'$. Figure 6 illustrates the meaning of this evaluation function. In this figure, the solid curve represents a trajectory of $(L,H)$ in the $L$-$H$ plane during the search. If an offspring solution $y$ replaces $xr(i)$, the current position $(L,H)$ moves by the vector $(ΔL(y),ΔH(y))$ as illustrated in the figure ($y$ is omitted in the figure). The value of $ΔL(y)-TΔH(y)$ is proportional to the magnitude of the vector $(ΔL(y),ΔH(y))$ in the direction (dotted arrow in the figure) perpendicular to the vector $(L-L',H-H')$. Note that the value of $ΔL(y)-TΔH(y)$ is negative in the example of the figure and a lower value is preferable.
Figure 6:

Meaning of the evaluation function $ΔL(y)-TΔH(y)$.

Figure 6:

Meaning of the evaluation function $ΔL(y)-TΔH(y)$.

For every offspring solution $y$, we must compute $ΔL(y)$ and $ΔH(y)$ to obtain the value of $Eval(y)$. However, the computational cost of $ΔH(y)$ for $Hm$, $Hmadj$, and $Hmvari$ is not negligible, especially for large values of $m$. Although this problem is partially alleviated by computing $ΔH(y)$ only when $ΔL(y)≤0$, we require an efficient algorithm for computing $ΔH(y)$. An outline of the efficient computation of $ΔH(y)$ is presented in the Appendix.

When the population diversity measure $Hmvari$ is used, the context tree $S˜$ should be updated each time $xr(i)$ is replaced. However, we decided to update the context tree at the beginning of each generation (before Line 3) because we confirmed that the results did not change significantly using this update method and an efficient algorithm for the immediate update of the context tree is complicated (although it can be implemented).

## 6 Experimental Results

We now present experimental results to analyze the ability of the proposed high-order entropy measures. The GA was implemented in C++ on Ubuntu 14.04 and the program code was executed on PCs with Intel Core i7-4790 CPU/3.60 GHz processor.

### 6.1 Experimental Settings

To investigate the ability of the three types of high-order entropy measures $Hm$, $Hmadj$, and $Hmvari$, we performed the GA described in Section 5 using each of the population diversity measures in the evaluation function (14). We used the default configuration of the GA except for the population diversity measure ($Hind$ was used in the original GA), where $Np=300$ and $Nch=30$ in the default configuration (Nagata and Kobayashi, 2013). In addition, we tested $Np=50$ to investigate the effect of the population size. We also tested the GA using either the pairwise Hamming distance $D$ (Eq. (3)) or no population diversity measure (set $T=0$ in the evaluation function) to evaluate the baseline ability of $Hind$. The population diversity measures tested are summarized as follows:

• Greedy (no population diversity measure), $D$, $Hind$

• $Hm(m=2,3,4,5,6,8)$

• $Hmadj(m=2,3,4,5,6,8)$

• $Hmvari(m=6;ratio=0.02,0.05,0.1,0.2,0.4)$

Note that $H1$ and $H1adj$ are equivalent to $Hind$. For $Hmvari$, the value of $m$ was set to six, and we show the results of $H6vari$ with different values of the parameter $ratio$ because the amount of data for possible combinations of $m$ and $ratio$ is very large.

For each population diversity measure, we performed 30 independent runs of the GA on instances in the following three well-known benchmark sets for the STSP, where we selected all 54 instances of sizes ranging from 4,000 to 30,000.2

### 6.2 Results

Table 1 lists the results of the GA using different population diversity measures under the population size $Np=300$, where detailed results only for some selected population diversity measures including $Greedy$ and $Hind$ are presented to avoid displaying a huge amount of data; for each of $Hm$, $Hmadj$, and $H6vari$, the parameter value of $m$ or $ratio(=0.1)$ that achieved relatively good results is selected. Each line presents the instance name (instance), the optimal or best-known solution (Opt. or UB), the number of runs that succeeded in finding the optimal or best-known solution (Su), and the average error of the best tour lengths in the 30 runs from the optimal or best-known solutions (A-Err), where one unit in the column A-Err is $10-5$%. Full results for all population diversity measures are provided in the online supplementary file, available at https://www.mitpressjournals.org/doi/suppl/10.1162/evco_a_00268.

Table 1:
Solution quality of GA using different population diversity measures.
$Greedy$$Hind$$H3$$H6adj$$H6vari$
InstanceOpt.(UB)SuA-ErrSuA-ErrSuA-ErrSuA-ErrSuA-Err
fnl4461 182566 858 30 30 30 30
rl5915 565530 1294 30 30 30 23 82
rl5934 556045 4130 23 342 28 64* 26 164 27 109
pla7397 23260728 894 251 585 425 502
rl11849 923288 835 24 12 25 12 28 28
usa13509 19982859 816 14 18 23 9* 16 13 25 7*
brd14051 469385 820 29 22 15 26 24 12
d15112 1573084 512 18 14 11 23 16
d18512 645238 747 18 21 12 25 3* 19 10
ca4663 1290319 555 27 24 30 0* 30 0* 30 0*
pm4951 114855 4600 24 87 255 21 240 25 142
tz6117 394718 717 12 99 18 72 14 99 19 72
ar6723 837479 1822 25 19 24 23 23 27 25 19
ho7103 177092 717 24 22 21 39 19 43 11 84
eg7146 172386 446 23 19 21 34 29 3* 30 0*
ym7663 238314 1759 29 22 288 27 28 92
ei8246 206171 994 13 79 28 9* 30 0* 30 0*
ja9847 491924 2598 83 12 43* 58* 15 48*
gr9882 300899 614 12 163 13 55 16 46 19 37*
kz9976 1061881 1283 27 29 29 30 0*
fi10639 520527 796 15 33 25 10* 29 0* 25 10*
mo14185 (427377) 627 13 28 18 18 24 8* 21 13*
it16862 557315 1065 57 44 20* 24*
vm22775 569288 1330 116 138 115 92
sw24978 855597 1014 14 29 10 31 15 18 14 30
bgb4355 12723 4034 30 30 30 30
bgd4396 13009 19 486 27 76 25 128 30 0* 30 0*
frv4410 10711 2116 30 30 30 30
bgf4475 13221 1159 25 126 30 0* 30 0* 30 0*
xqd4966 15316 1654 30 30 30 30
fqm5087 13029 1074 30 29 25 29 25 30
fea5557 15445 17 841 29 21 30 30 30
xsc6880 21535 3451 22 216 27 46* 28 30* 25 77
bnd7168 21834 18 305 30 30 30 30
lap7454 19535 784 30 30 30 30
ida8197 22338 1910 29 14 28 29 30 30
dga9698 27724 1262 28 36 30 30 30
xmc10150 28387 1585 15 258 24 82* 30 0* 25 58*
xvb13584 37083 1240 21 80 27 26* 27 26* 29 8*
xrb14233 (45462) 1649 461 381* 300* 15 227*
xia16928 (52850) 1255 19 113 19 145 17 145 20 88
pjh17845 (48092) 1919 12 138 19 97 16 97 16 97
frh19289 (55798) 1409 27 23 29 30 0* 30 0*
fnc19402 (59287) 1388 19 67 20 73 26 22* 21 50
ido21215 (63517) 1548 20 104 22 68 28 15* 19 78
fma21553 (66527) 1332 12 135 22 45* 14 95 23 35*
lsb22777 (60977) 754 19 65 26 21* 26 27* 29 5*
xrh24104 (69294) 1279 28 29 27 14 30
bbz25234 (69335) 1413 23 38 28 14* 29 4* 28 9*
irx28268 (72607) 1221 28 27 13 26 18 27 13
fyg28534 (78562) 1251 12 106 19 50* 18 59* 24 29*
icx28698 (78088) 1494 217 170* 13 93* 10 102*
boa28924 (79622) 1486 121 117 117 108
ird29514 (80353) 2152 215 12 99* 14 87* 19 58*
Statistical test W=0: L=54 — W=17: L=4 W=22: L=1 W=22: L=4
$Greedy$$Hind$$H3$$H6adj$$H6vari$
InstanceOpt.(UB)SuA-ErrSuA-ErrSuA-ErrSuA-ErrSuA-Err
fnl4461 182566 858 30 30 30 30
rl5915 565530 1294 30 30 30 23 82
rl5934 556045 4130 23 342 28 64* 26 164 27 109
pla7397 23260728 894 251 585 425 502
rl11849 923288 835 24 12 25 12 28 28
usa13509 19982859 816 14 18 23 9* 16 13 25 7*
brd14051 469385 820 29 22 15 26 24 12
d15112 1573084 512 18 14 11 23 16
d18512 645238 747 18 21 12 25 3* 19 10
ca4663 1290319 555 27 24 30 0* 30 0* 30 0*
pm4951 114855 4600 24 87 255 21 240 25 142
tz6117 394718 717 12 99 18 72 14 99 19 72
ar6723 837479 1822 25 19 24 23 23 27 25 19
ho7103 177092 717 24 22 21 39 19 43 11 84
eg7146 172386 446 23 19 21 34 29 3* 30 0*
ym7663 238314 1759 29 22 288 27 28 92
ei8246 206171 994 13 79 28 9* 30 0* 30 0*
ja9847 491924 2598 83 12 43* 58* 15 48*
gr9882 300899 614 12 163 13 55 16 46 19 37*
kz9976 1061881 1283 27 29 29 30 0*
fi10639 520527 796 15 33 25 10* 29 0* 25 10*
mo14185 (427377) 627 13 28 18 18 24 8* 21 13*
it16862 557315 1065 57 44 20* 24*
vm22775 569288 1330 116 138 115 92
sw24978 855597 1014 14 29 10 31 15 18 14 30
bgb4355 12723 4034 30 30 30 30
bgd4396 13009 19 486 27 76 25 128 30 0* 30 0*
frv4410 10711 2116 30 30 30 30
bgf4475 13221 1159 25 126 30 0* 30 0* 30 0*
xqd4966 15316 1654 30 30 30 30
fqm5087 13029 1074 30 29 25 29 25 30
fea5557 15445 17 841 29 21 30 30 30
xsc6880 21535 3451 22 216 27 46* 28 30* 25 77
bnd7168 21834 18 305 30 30 30 30
lap7454 19535 784 30 30 30 30
ida8197 22338 1910 29 14 28 29 30 30
dga9698 27724 1262 28 36 30 30 30
xmc10150 28387 1585 15 258 24 82* 30 0* 25 58*
xvb13584 37083 1240 21 80 27 26* 27 26* 29 8*
xrb14233 (45462) 1649 461 381* 300* 15 227*
xia16928 (52850) 1255 19 113 19 145 17 145 20 88
pjh17845 (48092) 1919 12 138 19 97 16 97 16 97
frh19289 (55798) 1409 27 23 29 30 0* 30 0*
fnc19402 (59287) 1388 19 67 20 73 26 22* 21 50
ido21215 (63517) 1548 20 104 22 68 28 15* 19 78
fma21553 (66527) 1332 12 135 22 45* 14 95 23 35*
lsb22777 (60977) 754 19 65 26 21* 26 27* 29 5*
xrh24104 (69294) 1279 28 29 27 14 30
bbz25234 (69335) 1413 23 38 28 14* 29 4* 28 9*
irx28268 (72607) 1221 28 27 13 26 18 27 13
fyg28534 (78562) 1251 12 106 19 50* 18 59* 24 29*
icx28698 (78088) 1494 217 170* 13 93* 10 102*
boa28924 (79622) 1486 121 117 117 108
ird29514 (80353) 2152 215 12 99* 14 87* 19 58*
Statistical test W=0: L=54 — W=17: L=4 W=22: L=1 W=22: L=4

For each population diversity measure $H$, we performed the one-sided Wilcoxon rank sum test for the null hypothesis that the median of the distribution of the tour length (of the best solution of each run) obtained with $H$ is greater than that with $Hind$. If the null hypothesis was rejected at a significance level of 0.05, the corresponding value in the column A-Err is indicated by the asterisk (indicating that $H$ was significantly superior to $Hind$). Conversely, if the opposite null hypothesis was rejected, the corresponding value is indicated by the dagger (indicating that $H$ was significantly worse than $Hind$). For each population diversity measure, the numbers of asterisks and daggers are presented in the bottom line (Statistical test) of the table, where “W=a: L=b” indicates that the numbers of asterisks and daggers are $a$ and $b$, respectively.

We performed the same one-sided Wilcoxon rank sum test between $Hind$ and each of all population diversity measures under the population size $Np=50$ and 300, and the summarized results (W and L) are presented in Table 2. Note that when the GA using $Hind$ found optimal (or best-known) solutions in most trials of the 30 runs, we cannot find a statistically significant difference even if the GA using the population diversity measure $H$ finds optimal (or best-known) solutions for all 30 runs. This situation is particularly noticeable when the number of vertices is small (e.g., $n<10,000)$. Therefore, the results in Table 2 are presented separately for all 54 instances, a set of the 27 instances with $n<10,000$, and a set of the 27 instances with $10,000≤n$. When the population size is 300, the numbers of instances from which a statistically significant difference can be detected (if the optimal (best-known) solution is found in all runs) are 15 ($n<10,000$) and 24 ($10,000≤n$). On the other hand, this situation does not occur when the population size is 50.

Table 2:
Results of Wilcoxon rank sum test between $Hind$ and other population diversity measures.
$Np=300$$Np=50$
All$n<104$$104≤n$All$n<104$$104≤n$
Population diversityWLWLWLWLWLWL
$Greedy$ — 54 27 27 54 27 27
$D$ — 43 22 21 54 27 27
$m=2$ 11 11
$m=3$ 17 11 10
$Hm$ $m=4$ 15 14 10
$m=5$ 13 11 31 10 21
$m=6$ 14 13 42 15 27
$m=8$ 21 14 49 22 27
$m=2$ 11 13
$m=3$ 17 12 26 10 16
$Hmadj$ $m=4$ 23 15 29 13 16
$m=5$ 22 16 33 17 16
$m=6$ 22 15 24 15
$m=8$ 18 11 20 14
$r=0.02$ 17 42 16 26
$r=0.05$ 20 12 27 20
$H6vari$ $r=0.1$ 22 14 13
($r=ratio$$r=0.2$ 19 13 31 13 18
$r=0.4$ 17 13 21 15
$Np=300$$Np=50$
All$n<104$$104≤n$All$n<104$$104≤n$
Population diversityWLWLWLWLWLWL
$Greedy$ — 54 27 27 54 27 27
$D$ — 43 22 21 54 27 27
$m=2$ 11 11
$m=3$ 17 11 10
$Hm$ $m=4$ 15 14 10
$m=5$ 13 11 31 10 21
$m=6$ 14 13 42 15 27
$m=8$ 21 14 49 22 27
$m=2$ 11 13
$m=3$ 17 12 26 10 16
$Hmadj$ $m=4$ 23 15 29 13 16
$m=5$ 22 16 33 17 16
$m=6$ 22 15 24 15
$m=8$ 18 11 20 14
$r=0.02$ 17 42 16 26
$r=0.05$ 20 12 27 20
$H6vari$ $r=0.1$ 22 14 13
($r=ratio$$r=0.2$ 19 13 31 13 18
$r=0.4$ 17 13 21 15

Table 2 indicates that the GA using the independent entropy measure $Hind$ clearly outperforms the GA using either no population diversity measure or the pairwise Hamming distance $D$. In the following, we first compare the results of the three types of the high-order entropy measures when the population size is 300.

Results of $Hm$ Table 2 shows that the results of $Hm$ gradually improves as the value of $m$ increases from one ($H1$ is equivalent to $Hind$) to three or four. For greater values of $m$, however, the results of the statistical test gradually deteriorates with increasing the value of $m$. These results demonstrate that the ability of evaluating population diversity can be improved by considering high-order dependencies between consecutive symbols (vertices) in the population. As we expected, however, there is a tradeoff between the potential ability to capture higher-order dependencies and the estimate accuracy of the conditional probability distributions required for computing $Hm$. The experimental results indicate that $m=3$ or 4 achieves an appropriate tradeoff between these for this population size ($Np=300$).

Results of $Hmadj$ Table 2 shows that the results of $Hmadj$ gradually improves as the value of $m$ increases from one ($H1adj$ is equivalent to $Hind$) to four, five, or six. However, the result of $H8adj$ is worse than that of $H6adj$, though it is still better than that of $Hind$. As can be predicted (see Section 3.2) and supported by the results, the best value of $m$ for $Hmadj$ is greater than that for $Hm$. More importantly, the best result of $Hmadj$ (obtained with $m=4,5$, or 6) is superior to the best result of $Hm$ (obtained with $m=3$ or 4). This suggests that $Hmadj$ (with an appropriate value of $m$) is a better population diversity measure than $Hm$. We analyze the reason for this in the next subsection.

Results of $H6vari$ In preliminary experiments for selected instances, better results were obtained when $m=6$ than when $m=4$, but there was no significant difference in results between $m=6$ and $m=8$. Table 2 shows that the GA using $H6vari$ outperforms the GA using $Hind$ for all values of $ratio$ ranging from 0.02 to 0.4, where the best value of $ratio$ is around 0.1. When $H6vari(ratio=0.1)$ is compared with $Hm$ and $Hmadj$, the result of $H6vari(ratio=0.1)$ is better than the results of $H3$ and $H4$ (the best results of $Hm$). This indicates that the variable-order Markov model introduced to estimate $Hm$ (see Section 4.1) works as expected. However, the result of $H6vari(ratio=0.1)$ is slightly worse than the results of $H4adj$, $H5adj$, and $H6adj$ (the best results of $Hmadj$).

Results on hard instances The result of $Hind$ is poor (e.g., $Su<5$) for some instances. This is because there is a deceptive structure on these instances, i.e., some edges of the optimal solution are not included in a majority of near-optimal solutions. In general, finding the optimal solution for such instances is more difficult. Table 1 shows that the results of $Hind$ for these instances are more or less improved by using any of the high-order entropy measures $H3$, $H6adj$, and $H6vari$ except for instance pla7397.

Effect of the population size Next, we describe the results of the three types of the high-order entropy measures when the population size is 50, where we focus mainly on the effect of the population size. Table 2 shows that the use of $Hm$ improves the result of $Hind$ only when $m=2$, whereas the result of $Hind$ is improved by using $Hm(m=2,3,4,5)$ under the population size of 300. This makes sense because for a smaller population size, it will be more difficult to obtain a sufficient number of samples (sequences of symbols) from the population necessary to estimate the conditional probability distributions for all values of $m$. On the other hand, the superiority of $Hmadj$ over $Hind$ is retained for large values of $m$ even in the small population size ($Np=50$). As for the high-order entropy measure $H6vari$, the best value of $ratio$ is around 0.2, which is greater than that ($=$ 0.1) when $Np=300$. Given the role of $ratio$ in constructing the context tree $S˜$, this is an expected result. Therefore, the value of $ratio$ should be set appropriately depending on the population size.

Execution time Next, we discuss the difference in the computation time of the GA when using the different population diversity measures. Table 3 lists the average computation time in seconds for the GA using the different population diversity measures. Results are presented for six selected instances listed in the table to avoid displaying a huge amount of data, where two instances ($10,000) were randomly selected from each of the three benchmark sets. In addition, the bottom line (Ave. ratio) of the table shows the ratio of the execution time to that of the GA using $Hini$ averaged over the six instances. The table shows that the execution time tends to increase as the value of $m$ increases in the population diversity measures $Hm$ and $Hmadj$. This is mainly because of the increase in the computational effort of calculating $ΔH(y)$ in the evaluation function (14). Note that the execution times of $Hm$ are smaller than those of $Hmadj$ (for the same value of $m$) even though the calculation of $Hmadj$ is simpler than that of $Hm$. The reason for this is that the number of generations of the GA required to complete the search was smaller when $Hm$ was used than when $Hmadj$ was used. As for the population diversity measure $H6vari$ with various values of $ratio$, the results are similar to that of $H6adj$.

Table 3:
Average execution time (in seconds) of the GA using different population diversity measures.
$H6vari(ratio)$
$Hind$$Greedy$$D$(0.02)(0.05)(0.1)(0.2)(0.4)
usa13509 2429 1097 2442 3458 3704 3620 3475 3068
d15112 3482 1932 3640 5743 5543 5127 4841 4284
it16862 2958 1547 3009 5201 4803 4600 4341 4104
pjh17845 1636 947 1677 2803 2685 2600 2497 2351
fma21553 2053 1094 2084 3518 3354 3285 3154 2982
sw24978 5930 3351 6142 10388 9769 9386 8744 8199
Ave. ratio — 0.53 1.02 1.69 1.61 1.55 1.47 1.36
$Hm$ $Hmadj$
$m=2$ $m=3$ $m=4$ $m=5$ $m=6$ $m=8$ $m=2$ $m=3$ $m=4$ $m=5$ $m=6$ $m=8$
usa13509 2402 2352 2451 2614 2726 3069 2619 2717 2951 3218 3564 4471
d15112 3562 3550 3634 3982 4161 4901 3718 3762 4106 4567 4967 6632
it16862 3077 3087 3208 3406 3720 4294 3162 3290 3649 4139 4614 5936
pjh17845 1675 1729 1866 1969 2108 2488 1759 1861 2014 2257 2443 3033
fma21553 2109 2133 2248 2385 2582 3037 2214 2294 2528 2789 3072 3821
sw24978 6259 6065 6433 6849 7340 8382 6295 6729 7504 8169 9168 11600
Ave. ratio 1.03 1.03 1.08 1.15 1.23 1.42 1.07 1.12 1.23 1.36 1.50 1.90
$H6vari(ratio)$
$Hind$$Greedy$$D$(0.02)(0.05)(0.1)(0.2)(0.4)
usa13509 2429 1097 2442 3458 3704 3620 3475 3068
d15112 3482 1932 3640 5743 5543 5127 4841 4284
it16862 2958 1547 3009 5201 4803 4600 4341 4104
pjh17845 1636 947 1677 2803 2685 2600 2497 2351
fma21553 2053 1094 2084 3518 3354 3285 3154 2982
sw24978 5930 3351 6142 10388 9769 9386 8744 8199
Ave. ratio — 0.53 1.02 1.69 1.61 1.55 1.47 1.36
$Hm$ $Hmadj$
$m=2$ $m=3$ $m=4$ $m=5$ $m=6$ $m=8$ $m=2$ $m=3$ $m=4$ $m=5$ $m=6$ $m=8$
usa13509 2402 2352 2451 2614 2726 3069 2619 2717 2951 3218 3564 4471
d15112 3562 3550 3634 3982 4161 4901 3718 3762 4106 4567 4967 6632
it16862 3077 3087 3208 3406 3720 4294 3162 3290 3649 4139 4614 5936
pjh17845 1675 1729 1866 1969 2108 2488 1759 1861 2014 2257 2443 3033
fma21553 2109 2133 2248 2385 2582 3037 2214 2294 2528 2789 3072 3821
sw24978 6259 6065 6433 6849 7340 8382 6295 6729 7504 8169 9168 11600
Ave. ratio 1.03 1.03 1.08 1.15 1.23 1.42 1.07 1.12 1.23 1.36 1.50 1.90

### 6.3 Analysis

We analyze the behavior of the GA using different population diversity measures to investigate their influence on preserving population diversity. Here, we focus mainly on why the GA using $H6adj$ achieves superior results compared with the GA using $Hm$ with various values of $m$. For this purpose, we depict the typical behavior of a single run of the GA using each of six population diversity measures (Greedy, $Hind$, $H3$, $H4$, $H6$, and $H6adj$) on instance usa13509 under the population size of 300. We omit results on other instances because the same tendency was observed. The first graph in Figure 7 displays the plots of the average tour length $L$ of the population against the number of generations of the GA. This graph is presented as a reference for the following four graphs. The remaining four graphs display the plots of $H1$, $H3$, $H4$, and $H6$ against the average tour length $L$.
Figure 7:

Behavior of GA using different population diversity measures on instance usa13509 $(Np=300)$.

Figure 7:

Behavior of GA using different population diversity measures on instance usa13509 $(Np=300)$.

Figure 7 indicates that each value of $H1$, $H3$, $H4$, and $H6$ is maintained at the highest value at each value of $L$ when the same diversity measure is incorporated into the evaluation function (14) of the GA. However, maintaining the value of $Hm$ at the highest level for a small value of $m$ does not necessarily lead to maintaining $Hk(m at a high level. For example, if $Hind$ (equivalently $H1$) is incorporated into the evaluation function, the population is evolved such that duplication of sequences of length two in the population is suppressed without considering the increase of duplication of longer sequences. On the other hand, maintaining the value of $Hm$ at the highest level for a large value of $m$ (e.g., $m=6$) leads to “overfitting” of the population specialized in maintaining the value of $Hm$ high. For example, if it is possible to exclude any duplication of sequence of length $m+1$ in the population, such a population is preferred to maintain the value of $Hm$ as high as possible even if the duplication of a certain shorter sequence increases excessively.

To alleviate the overfitting problem of $Hm$ for a large value of $m$, not only the values of $Hm$ but also the values of $Hk(k=1,…,m-1)$ should also be maintained at a high level. The high-order entropy measure $Hmadj$ is suitable for this purpose because $Hmadj$ is equivalent to $H1+⋯+Hm$, and therefore the population is evolved such that each of the values of $Hk(k≤m)$ is maintained high although there is no guarantee that all values are maintained near their highest levels. Fortunately, as can be observed from Figure 7, when $H6adj$ is incorporated into the evaluation function, the values of $Hk(k≤6)$ are all maintained near their highest level, especially for $3≤k$. Therefore, we conclude that this is a reason why the GA with a high-order entropy measure $H6adj$ achieves superior results compared with the GA using each of $Hm(m=1,…,6)$.

## 7 Conclusions

We proposed three types of entropy-based population diversity measures to evaluate population diversity of a GA for the TSP. These measures consider high-order dependencies between variables of individuals in the population (high-order entropy measures). To derive these, we considered dependencies between consecutive variables and assumed that an individual is represented as a circular sequence of symbols, which is well suited for the TSP. Under these conditions, the entropy of the probability distribution of individuals in the population (used as a population diversity measure) is defined as the entropy rate of a Markov process estimated from the sequences of symbols sampled from the population.

The high-order entropy measure $Hm$ is equivalent to the entropy rate of the $m$-th--order Markov process. It has the potential ability to capture dependencies between consecutive variables of length up to $m+1$. We demonstrated that $Hm$ with an appropriate value of $m(=3or4)$ is significantly superior to $Hind$, the commonly used entropy-based population diversity measure that does not consider dependencies between variables, in the ability to evaluate population diversity. Although the high-order entropy measure $Hmadj$ is defined in a somewhat ad hoc manner, it is essentially equivalent to $H1+H2+⋯+Hm$. It reduces the overfitting problem of $Hm$ for a large value of $m$ (e.g. $m=6$) while considering dependencies between consecutive variables of length up to $m+1$. Consequently, $Hmadj$ with an appropriate value of $m(=4,5,or6)$ further improves $Hm$ in the ability to measure population diversity. The high-order entropy measure $Hmvari$ is equivalent to the entropy rate of the variable-order Markov process. It also reduces the overfitting problem of $Hm$ with an appropriate parameter setting (e.g., $m=6$ and $ratio=0.1$) and improves $Hm$. Overall, the high-order entropy measure $Hmadj$ with an appropriate value of $m(=4,5,or6)$ is the best population diversity measure among all the population diversity measures tested.

We have demonstrated the effectiveness of considering high-order dependencies between variables of individuals in evaluating population diversity at least for the TSP. Development of other high-order entropy measures and their application to other combinatorial optimization problems remains as a future research direction.

## Acknowledgments

This work was supported by JSPS KAKENHI Grant Number 17K00342.

## Notes

1

This strategy is useful because many of offspring solutions generated by EAX improve (or do not change) the tour length of $xr(i)$. Otherwise, offspring solutions should be evaluated simply by $ΔL(y)-TΔH(y)$ while reducing the value of $T$ in the course of the search.

2

At present, several TSP generators are available and it is possible to make comparisons on hundreds or thousands of TSP instances (Kotthoff et al., 2015; Kerschke et al., 2018). However, it takes a huge amount of time to perform this for instances of similar size and we did not perform this.

## References

Begleiter
,
R.
,
El-Yaniv
,
R.
, and
Yona
,
G.
(2004)
.
On prediction using variable order Markov models
.
Journal of Artificial Intelligence Research
,
22:385
421
.
Ben-Gal
,
I.
,
Morag
,
G.
, and
Shmilovici
,
A
. (
2003
).
Context-based statistical process control: A monitoring procedure for state-dependent processes
.
Technometrics
,
45
(
4
):
293
311
.
Kerschke
,
P.
,
Kotthoff
,
L.
,
Bossek
,
J.
,
Hoos
,
H. H.
, and
Trautmann
,
H
. (
2018
).
Leveraging TSP solver complementarity through machine learning
.
Evolutionary Computation
,
26
(
4
):
597
620
.
Kotthoff
,
L.
,
Kerschke
,
P.
,
Hoos
,
H.
, and
Trautmann
,
H.
(2015)
.
Improving the state of the art in inexact TSP solving using per-instance algorithm selection.
In
Proceedings of the 9th International Conference on Learning and Intelligent Optimization
, pp.
202
217
.
Lecture Notes in Computer Science
, Vol.
8994
.
Mächler
,
M.
, and
Bühlmann
,
P
. (
2004
).
Variable length Markov chains: Methodology, computing, and software
.
Journal of Computational and Graphical Statistics
,
13
(
2
):
435
455
.
Maekawa
,
K.
,
Mori
,
N.
,
Tamaki
,
H.
,
Kita
,
H.
, and
Nishikawa
,
Y
. (
1996
). A genetic solution for the traveling salesman problem by means of a thermodynamical selection rule. In
Proceedings of the 3rd IEEE Conference on Evolutionary Computation
, pp.
529
534
.
Mori
,
N.
,
Kita
,
H.
, and
Nishikawa
,
Y.
(1996)
.
Adaptation to a changing environment by means of the thermodynamical genetic algorithm
. In
Proceedings of the 9th International Conference on Parallel Problem Solving from Nature
, pp.
513
522
.
Lecture Notes in Computer Science
, Vol.
4193
.
Nagata
,
Y.
(2006)
.
Fast EAX algorithm considering population diversity for traveling salesman problems
. In
Proceedings of the 6th International Conference on Evolutionary Computation in Combinatorial Optimization
, pp.
171
182
.
Lecture Notes in Computer Science
, Vol.
3906
.
Nagata
,
Y.
(2016)
.
Population diversity measures based on variable-order Markov models for the traveling salesman problem
. In
Proceedings of the 14th International Conference on Parallel Problem Solving from Nature
, pp.
973
983
.
Lecture Notes in Computer Science
, Vol.
9921
.
Nagata
,
Y.
, and
Kobayashi
,
S
. (
2013
).
A powerful genetic algorithm using edge assembly crossover for the traveling salesman problem
.
Informs Journal on Computing
,
25
(
2
):
346
363
.
Nagata
,
Y.
, and
Ono
,
I.
(2013)
.
High-order sequence entropies for measuring population diversity in the traveling salesman problem
. In
Proceedings of the 13th European Conference on Evolutionary Computation in Combinatorial Optimization
, pp.
179
190
.
Lecture Notes in Computer Science
, Vol.
7832
.
Rissanen
,
J
. (
1983
).
A universal data compression system
.
IEEE Transactions on Information Theory
,
29
(
5
):
656
664
.
Ron
,
D.
,
Singer
,
Y.
, and
Tishby
,
N
. (
1996
).
The power of amnesia: Learning probabilistic automata with variable memory length
.
Machine Learning
,
25
(
2–3
):
117
149
.
Schulz
,
M. H.
,
Weese
,
D.
,
Rausch
,
T.
,
Döring
,
A.
,
Reinert
,
K.
, and
Vingron
,
M.
(2008)
.
Fast and adaptive variable order Markov chain construction
. In
Proceedings of the 8th International Workshop on Algorithms in Bioinformatics
, pp.
306
317
.
Lecture Notes in Bioinformatics
, Vol.
5251
.
Shmilovici
,
A.
, and
Ben-Gal
,
I
. (
2007
).
Using a VOM model for reconstructing potential coding regions in EST sequences
.
Computational Statistics
,
22
(
1
):
49
69
.
Squillero
,
G.
, and
Tonda
,
A.
(2016)
.
Divergence of character and premature convergence: A survey of methodologies for promoting diversity in evolutionary optimization
.
Information Sciences
,
329:782
799
.
Tsai
,
H.
,
Yang
,
J.
,
Tsai
,
Y.
, and
Kao
,
C
. (
2004
).
An evolutionary algorithm for large traveling salesman problems
.
IEEE Transactions on Systems, Man, and Cybernetics, Part B
,
34
(
4
):
1718
1729
.
Tsujimura
,
Y.
, and
Gen
,
M
. (
1998
). Entropy-based genetic algorithm for solving TSP. In
Proceedings of the 2nd International Conference on Knowledge-Based Intelligent Electronic Systems
, pp.
285
290
.
,
E.
, and
Ruiz
,
R
. (
2010
).
Genetic algorithms with path relinking for the minimum tardiness permutation flowshop problem
.
Omega
,
38
(
1
):
57
67
.
Wang
,
Y.
, Lü, Z., and
Hao
,
J.
(2010)
.
A study of multi-parent crossover operators in a memetic algorithm
. In
Proceedings of the 11th International Conference on Parallel Problem Solving from Nature
, pp.
556
565
.
Lecture Notes in Computer Science
, Vol.
6238
.
Wineberg
,
M.
, and
Oppacher
,
F
. (
2003
). The underlying similarity of diversity measures used in evolutionary computation. In
Proceedings of the 2003 International Conference on Genetic and Evolutionary Computation
, p.
206
.
Yao
,
X.
(1993)
. An empirical study of genetic operators in genetic algorithms.
Microprocessing and Microprogramming
,
38:707
714
.
Zhang
,
C.
,
Su
,
S.
, and
Chen
,
J
. (
2006
). Efficient population diversity handling genetic algorithm for QoS-aware web services selection. In
Proceedings of the 6th International Conference on Computational Science
, pp.
104
111
.

## Appendix: Efficient Computation of $ΔH(y)$

An efficient computation of $ΔH(y)$ in the evaluation function (14) is crucial for the execution of the GA using the proposed high-order entropy measures ($Hm$, $Hmadj$, and $Hmvari$). We present an outline of the efficient computation of $ΔH(y)$ in the ATSP case, which includes the STSP case as a special case (see Section 3.1).

Remember that $T$ is a tree for storing the values of $N(s1,…,sk)(k≤m+1)$ (see Figure 2). For an offspring solution $y$, let $Ere$ (resp. $Ead$) be a set of edges that are removed from (resp. added to) the parent solution $pA$ to generate this offspring solution. For each edge of $Ere$, $pA$ has $m$ sequences of length $m+1$ including this edge, which disappear from the population if $pA$ is replaced with $y$. Figure 8 illustrates the sequences of length $m+1$ ($=4$) to remove for $Ere$. In a similar manner, we can determine the sequences of length $m+1$ to add for $Ead$ as illustrated in Figure 8. For each sequence ${s1,…,sm,sm+1}$ to remove or add, we can know the values of $N(s1,…,sk)(h≤k≤m+1)$ affected by the replacement by tracing $T$ according to the sequence, where the value of $h$ depends on the situation. For example, let a sequence of length $m+1(=4)$ to remove be ${b,g,c,a}$ where only edge $(g,c)$ is included in $Ere$. In this case, $N(b,g,c)$ and $N(b,g,c,a)$ are decremented (if $pA$ is replaced with $y$), whereas $N(b)$ and $N(b,g)$ remain the same, that is, $h=3$.
Figure 8:

Illustration of sequences of length $m+1(=4)$ to remove (add) for $Ere$ ($Ead)$.

Figure 8:

Illustration of sequences of length $m+1(=4)$ to remove (add) for $Ere$ ($Ead)$.

The computation of $ΔHm(y)$ and $ΔHmadj(y)$ are similar; hence, we only describe the method to compute $ΔHm(y)$. For every sequence ${s1,…,sm+1}$ to remove and add, the changes to $N(s1,…,sm)$ and $N(s1,…,sm+1)$ are accumulated in $ΔN(s1,…,sm)$ and $ΔN(s1,…,sm+1)$, respectively. We store these values in the tree $T$. Therefore, if an adding sequence ${s1,…,sk}$ does not exist in $T$, we must create a temporal new node of $T$ with $N(s1,…,sk)=0$ to store the necessary data ($T$ must be restored after computing $ΔHm(y)$).

Let $Sm$ and $Sm+1$ be sets of the sequences of length $m$ and $m+1$, respectively, that are removed or added. Let $P'(s)$ be defined as $N(s)+ΔN(s)Nsample$ for $s∈Sm∪Sm+1$, where $P(s)$ is already defined as $N(s)Nsample$ (see Section 3.1) . The value of $ΔHm(y)$ is then computed by
$ΔHm(y)=-∑s∈Sm+1P'(s)logP'(s)-P(s)logP(s)+∑s∈SmP'(s)logP'(s)-P(s)logP(s).$
(15)
The computation of $ΔHmvari(y)$ is somewhat complicated. Before describing this, we first describe a method to compute $Hmvari$ (Eq. (13)). Remember that $S$ is a set of the sequences of symbols for the conditioning variables of the variable-order Markov process to define $Hmvari$ and that $S˜={sc˜|sc∈S}$ with $sc˜$ being the reverse sequence of $sc$. The context tree $S˜$ is then introduced to represent the set $S˜$ (see Figure 3), where the elements of the set $S˜$ are represented as the leaf nodes of the context tree $S˜$. In the actual implementation, each node with a symbol “$#$” in the context tree $S˜$ is removed and is associated with its parent node. Figure 9 (Left) shows a context tree obtained from the context tree $S˜$ presented in Figure 3 according to this manner. For example, node $a$ in the modified context tree represent a sequence $a#$, where “$#$” means any symbol other than $c$ and $f$. From this point onward, we refer to the modified context tree simply as the context tree $S˜$ unless otherwise stated. We classify the nodes of the context tree $S˜$ into two classes as follows (see also Figure 9): (Class 1) leaf nodes and (Class 2) internal nodes. Further, we define a tree $S$ to represent the set $S$, where the nodes of $S$ correspond one-to-one with the nodes of $S˜$. The nodes of the tree $S$ are also classified into the two classes such that each node in $S$ has the same class as the corresponding node in $S˜$. Figure 9 (Right) shows the tree $S$ that corresponds to the context tree $S˜$ in the left side of the figure. For example, node “cgb” in $S˜$ (i.e., a node reached by following $S˜$ in this order) and node “bgc” in $S$ both belong to Class 1. Note that only white nodes of $S$ can be determined from the information of $S˜$ presented in the figure because other nodes (gray nodes) depend on the nodes of $S˜$ omitted in the figure. In the actual implementation, all data of the trees $S$ and $S˜$ are stored in the trees $T$ and $T˜$, respectively, because $S$ (reps. $S˜$) is a subtree of $T$ (resp. $T˜$).
Figure 9:

Context tree representation of $S˜$ and $S$. The numbers represent $NI˜(sc˜)$ and $NI(sc)$.

Figure 9:

Context tree representation of $S˜$ and $S$. The numbers represent $NI˜(sc˜)$ and $NI(sc)$.

Let $S1$ and $S2$ be sets of the nodes of Class 1 and Class 2, respectively, in the tree $S$. Similarly, $S˜1$ and $S˜2$ are also defined for $S˜$. For each node $sc∈S$, we denote the number of the corresponding sequence of symbols existing in the population as $NI(sc)$. The values of $NI(sc)$ are given by
$NI(sc)=N˜(sc˜)(=N(sc))(sc∈S1)N˜(sc˜)-∑s∈L(sc˜,s)∈S1N˜(sc˜,s)(sc∈S2).$
(16)
We need to pay attention when $sc∈S2$. For example, if $sc=a$ (equivalently $sc˜=a$) in the example, $sc˜$ represents sequences “$a#$,” but $sc˜$ represents “$a$” when it is an argument of the functions $N(·)$ and $N˜(·)$ because these functions are defined on the trees $T$ and $T˜$. The values of $NI(sc)$ are stored in the tree $S$.
Further, for $sc∈S$ and $s0∈L$, we define the following function:
$NI+(sc,s0)=N(sc,s0)(sc∈S1)N(sc,s0)-∑s∈L(s,sc)∈S1N(s,sc,s0)(sc∈S2).$
(17)
The values of $NI+(sc,s0)$ are not stored and are computed as necessary. Note that for $sc∈S2$, a set of the nodes $(s,sc)∈S1$ is obtained by introducing a function that returns the addresses of the corresponding nodes in $S˜$ (resp. $S$) to the nodes in $S$ (resp. $S˜$) because we can easily access the nodes $(sc˜,s)∈S˜1$ in the context tree $S˜$. We define $PI(sc)=NI(sc)Nsample(sc∈S)$ and $PI+(sc,s0)=NI+(sc,s0)Nsample(sc∈S,s0∈L)$. Then, $Hmvari$ is computed by
$Hmvari=-∑sc∈S∑s0∈LPI+(sc,s0)logPI+(sc,s0)+∑sc∈SPI(sc)logPI(sc).$
(18)

Now, $ΔHmvari(y)$ is computed by calculating the differences in $NI(sc)$ and $NI+(sc,s0)$ when $pA$ is replaced with $y$. We denote the differences in $NI(sc)$ and $NI+(sc,s0)$ as $ΔNI(sc)$ and $ΔNI+(sc,s0)$, respectively. Let $SI$ and $SI+$ be sets of the nodes $sc$ and $(sc,s0)$ that change the values of $NI(sc)$ and $NI+(sc,s0)$, respectively. At the beginning, an initialization procedure is performed as follows: $SI=∅$, $SI+=∅$, and $ΔNI(s)=ΔNI+(s)=0(s∈S$). For every sequence ${s1,…,sm+1}$ to remove or add (if $pA$ is replaced with $y$), the following procedure is performed.

1. Trace $S$ according to the sequence ${s1,…,sm+1}$, and for each sequence ${s1,…,sk}(h≤k≤m+1)$, perform procedures (2)–(3). The value of $h$ is explained earlier in this appendix.

• If ${s1,…,sk}$ is a Class 1 node of $S$, add this sequence to $SI$ and increment (decrement) $ΔNI(s1,…,sk)$ by one if this sequence is added (removed).

• Else if ${s2,…,sk}$ is a Class 2 node of $S$, add ${s2,…,sk}$ to $SI$ and increment (decrement) $ΔNI(s2,…,sk)$ by one if ${s1,…,sk}$ is added (removed).

• If ${s1,…,sk-1}$ is a Class 1 node of $S$, add ${s1,…,sk}$ to $SI+$ and increment (decrement) $ΔNI+(s1,…,sk)$ by one if ${s1,…,sk}$ is added (removed).

• Else if ${s2,…,sk-1}$ is a Class 2 node of $S$, add ${s2,…,sk}$ to $SI+$ and increment (decrement) $ΔNI+(s2,…,sk)$ by one if ${s1,…,sk}$ is added (removed).

After completing the above procedure, we compute $PI'(sc)=N(sc)+ΔNI(sc)Nsample(sc∈SI)$ and $PI+'(s)=N(s)+ΔNI+(s)Nsample(s∈SI+)$. The value of $ΔHmvari(y)$ is then computed by
$ΔHmvari(y)=-∑s∈SI+PI+'(s)logPI+'(s)-PI+(s)logPI+(s)+∑sc∈SIPI'(sc)logPI'(sc)-PI(sc)logPI(sc).$
(19)

After the parent solution $pA$ is replaced with the selected offspring solution, we need to update the trees $S$ and $S˜$, accordingly. This can be done efficiently, but we omit the details because the procedure is complicated to explain. In the actual implementation, the values of $NI(sc)$ were updated immediately, but the structure of the trees were reconstructed at the beginning of each generation of the GA (see Section 5).