Abstract

To maintain the population diversity of genetic algorithms (GAs), we are required to employ an appropriate population diversity measure. However, commonly used population diversity measures designed for permutation problems do not consider the dependencies between the variables of the individuals in the population. We propose three types of population diversity measures that address high-order dependencies between the variables to investigate the effectiveness of considering high-order dependencies. The first is formulated as the entropy of the probability distribution of individuals estimated from the population based on an m-th--order Markov model. The second is an extension of the first. The third is similar to the first, but it is based on a variable order Markov model. The proposed population diversity measures are incorporated into the evaluation function of a GA for the traveling salesman problem to maintain population diversity. Experimental results demonstrate the effectiveness of the three types of high-order entropy-based population diversity measures against the commonly used population diversity measures.

1 Introduction

The maintenance of population diversity is recognized as an important factor for fully exercising the capability of evolutionary algorithms (EAs), and a wide variety of population management strategies for promoting diversity inside the population have been proposed. A survey of methodologies for promoting population diversity in EAs can be found in Squillero and Tonda (2016).

Several population diversity management methodologies utilize measures of population diversity, which can be used to analyze the behavior of EAs (Yao, 1993; Tsai et al., 2004; Wang et al., 2010), to select individuals to maintain population diversity in a positive manner (Maekawa et al., 1996; Zhang et al., 2006; Nagata, 2006; Nagata and Kobayashi, 2013), and as a trigger to activate diversification procedures (Tsujimura and Gen, 1998; Vallada and Ruiz, 2010). The pairwise Hamming distance (the average of the Hamming distance between all possible pairs of the population members) is the most commonly used measure of population diversity. Another commonly used population diversity measure is based on entropy. In information theory, entropy, defined as -sSpslogps, is a measure of the uncertainty of a probability distribution ps(sS), where S is a set of all possible events. This definition, however, cannot be directly used to measure population diversity because the population size is typically considerably smaller than the number of all possible solution candidates in the search space S. Therefore, to the best of our knowledge, the entropy-based population diversity measures proposed in previous works are all defined as the sum of the entropies of the univariate marginal distributions of all variables. For example, let the solution space S be defined as (x1,,xn), where xi is a variable taking values in a discrete set Ai. The entropy of the i-th variable is defined as Hi=-jAipijlogpij, where pij is the probability that xi has a value j in the population. Then, the commonly used entropy-based population diversity measure is defined as H=i=1nHi. In this article, we refer to an entropy-based population diversity measure defined in this manner as an independent entropy measure. In previous works, the independent entropy measure was incorporated into EAs applied to the knapsack problem (Mori et al., 1996), binary quadratic programming problem (Wang et al., 2010), traveling salesman problem (Yao, 1993; Maekawa et al., 1996; Tsujimura and Gen, 1998; Tsai et al., 2004; Nagata, 2006; Nagata and Kobayashi, 2013), and others (Zhang et al., 2006).

The independent entropy measure (and other commonly used population diversity measures), however, is not able to consider the dependencies between the variables of the individuals in the population, which creates a situation where population diversity cannot be evaluated appropriately. For example, consider an extreme example on the n-dimensional binary solution space where half of the population members are “0000” and the other half are “1111.” The value of the independent entropy measure of this population is virtually the same as that of a randomly generated population because pi0pi10.5(i=1,,n) for both populations, even though “true” population diversity is extremely low in the former. Therefore, our motivation herein is to design a more appropriate entropy-based population diversity measure by considering dependencies between the variables of the individuals in the population. We refer to such a population diversity measure as a high-order entropy measure.

In this article, we propose several high-order entropy measures for the traveling salesman problem (TSP) to investigate the advantages of using entropy-based population diversity measures that consider the dependencies between the variables. We first formulate high-order entropy measures based on a fixed-order Markov model, where we assume that the probability of observing each vertex at a certain position in an individual (tour) of the population depends on the sequence of m(1) precedent vertices. We further extend this model into a variable-order Markov model, where the value of m varies depending on the situation.

We tested the proposed high-order entropy measures on a genetic algorithm (GA) developed by Nagata and Kobayashi (2013), which is known as one of the most effective heuristic algorithms for the TSP. In this GA, one important feature for achieving a top performance is to maintain population diversity by evaluating offspring solutions based on an evaluation function that incorporates a population diversity measure as well as the original evaluation function (tour length). An independent entropy measure is used for evaluating the population diversity. In this article, we perform this GA by replacing the original independent entropy measure with each of the proposed high-order entropy measures in the evaluation function.

Preliminary reports for the high-order entropy measures proposed in this article, were presented in previous works of the author (Nagata and Ono, 2013; Nagata, 2016). This article provides a full description and instructive analysis of the proposed high-order entropy measures; Section 6.3 and the Appendix are completely new and other parts significantly extend the contents of the conference proceedings. The remainder of this article is organized as follows. In Section 2, we describe the background of this study. In Section 3, we propose two types of high-order entropy measures based on a fixed-order Markov model. In Section 4, we propose a high-order entropy measure based on a variable-order Markov model. In Section 5, the GA framework, where the proposed population diversity measures are incorporated, is described. Computational results are presented in Section 6. Finally, conclusions are provided in Section 7.

2 Background

We first consider commonly used population diversity measures (independent entropy measure and pairwise Hamming distance) in a general case and then describe the independent entropy measure for the TSP. We also refer to the difficulty of designing an entropy-based population diversity measure that considers the dependencies between the variables.

2.1 Commonly Used Population Diversity Measures

We first consider a general case where an individual is represented as a string of length n, consisting of symbols in a set L. Let the population consist of Np individuals, denoted as p1,p2,,pNp. Let Xi(i=1,,n) be a random variable representing the symbol in the i-th position of an individual (string) selected randomly from the population. The joint probability distribution P(X1=x1,X2=x2,,Xn=xn) then represents the probability of finding individual (x1,x2,,xn) when an individual is selected randomly from the population. We denote P(X1=x1,X2=x2,,Xn=xn) as P(x1,x2,,xn) for simplicity. The marginal probability distribution P(Xi=xi) or P(xi) then represents the probability of finding symbol xi in the i-th position of an individual selected randomly from the population. Probability distribution P(Xi=l) is estimated as nilNp, where nil is the number of individuals that have the symbol l at position i. The independent entropy measure, which we denote as Hind, is then defined by
Hind=-i=1nxiLP(xi)logP(xi).
(1)
Another commonly used population diversity measure is the pairwise Hamming distance (the average of the Hamming distance between all possible pairs of the population members). Let a symbol in the i-th position of individual p be denoted as (p)i. The Hamming distance between two individuals (strings) p and q, denoted as d(x,y), is then defined by
d(p,q)=n-i=1nI((p)i,(q)i),
(2)
where the function I(a,b) returns 1 if a and b are the same, and 0 otherwise.
In fact, the pairwise Hamming distance can be expressed as a function of P(xi) (Wineberg and Oppacher, 2003). To derive this expression, we rewrite the Hamming distance as d(p,q)=n-i=1nlLIl((p)i,(q)i), where Il(a,b) returns 1 if both a and b are symbol l, and 0 otherwise. The pairwise Hamming distance, which we denote as D, is then calculated as follows:
D=2Np(Np-1)j=1Np-1k=j+1Npd(pj,pk)=1Np(Np-1)j=1Npk=1Npd(pj,pk)j=1Npk=1Np{n-i=1nlLIl((pj)i,(pk)i)}=Np2n-i=1nlLj=1Npk=1NpIl((pj)i,(pk)i)=Np2n-i=1nlLnil2=Np2n-i=1nlL{NpP(xi=l)}2n-i=1nlLP(xi=l)2=n-i=1nxiLP(xi)2.
(3)
Note that j=1Npk=1NpIl((pj)i,(pk)i)=nil2 because the number of individuals that have symbol l in the i-th position is nil.

The similarity between the independent entropy measure Hind and the pairwise Hamming distance D was discussed in Wineberg and Oppacher (2003). As suggested in Nagata and Kobayashi (2013), one advantage of the independent entropy measure over the pairwise Hamming distance is the sensitivity to the change of rare elements in the population. This feature makes Hind a more appropriate population diversity measure than D.

2.2 TSP Case

Let an asymmetric TSP (ATSP) be defined on a complete directed graph (V,E) with a set of vertices V={1,,n} and a set of edges E={(i,j)i,jV}. In the asymmetric case, the distance (or cost) between two vertices depends on the travel direction. If the distance is the same in both directions, the TSP is called a symmetric TSP (STSP).

In the majority of GAs applied to the TSP, an individual is represented as the order of vertices on the tour. It is also possible to represent an individual such that the variable xi represents the vertex subsequent to vertex i in the tour, which is well suited for the definition of population diversities D and Hind. Using this expression, P(Xi=l)(l1,2,n) is defined as the probability distribution of the vertices subsequent to vertex i in the population, and the population diversity measures Hind and D are defined according to Eqs. (1) and (3), respectively. Here, we need to give attention to the STSP case. In this case, we must consider both travel directions for each tour because the population diversity should not depend on the travel direction. Therefore, P(Xi=l)(l1,2,n) is defined as the probability distribution of the vertices linked to vertex i in the population. The population diversity measure Hind defined in this manner is used in GAs (Maekawa et al., 1996; Tsai et al., 2004; Nagata, 2006; Nagata and Kobayashi, 2013) for the STSP to control the diversity of the population.

2.3 Difficulty in Considering Dependencies

The population diversity measures Hind and D do not consider the dependencies between the variables of the individuals in the population because these are expressed as functions of P(xi). The most naive entropy-based population diversity measure that considers the dependencies between the variables would be defined by
H=-x1LxnLP(x1,,xn)logP(x1,,xn).
(4)
Clearly, this population diversity measure is of no value in practical use (unless n is extremely small) because it is impossible to obtain a sufficient number of samples from the population necessary to estimate P(x1,,xn). Therefore, the joint probability distribution P(x1,,xn) must be estimated under a certain assumption(s). For example, when P(x1,,xn) is modeled as i=1nP(xi), the entropy H is equivalent to Hind except for a constant factor, i.e., H=nHind.

A Bayesian network could be also useful to model the joint probability distribution. It is represented as P(x1,x2,,xn)=i=1nP(xixparent(i)), where each variable xi is conditional only on its parent variables xparent(i) (if it is empty, P(xixparent(i)) means a prior probability distribution P(xi)). However, it is typically difficult to detect an appropriate conditional dependency between the variables, which is represented as a directed acyclic graph (DAG), in advance. Moreover, it is difficult to compute Eq. (4) efficiently even if a DAG is given in advance.

3 High-Order Entropy Measures Based on a Fixed-Order Markov Model

We propose two high-order entropy measures based on a fixed-order Markov model to measure population diversity of GAs for the TSP.

3.1 High-Order Entropy Measure Hm

Let an individual (tour) be represented as a permutation of the vertices. We model the joint probability distribution of the population on the assumption that the probability of observing each vertex at a certain position in an individual of the population depends on the sequence of m(1) precedent vertices in the tour. This assumption is reasonable for the TSP. For example, if you are solving a TSP manually, the next vertex at each vertex strongly depends on the partial path created so far near this vertex and the sequence of m precedent vertices contains a lot of information about this. Let Si(i=1,,n) be a random variable representing the i-th vertex in an individual selected randomly from the population. Given that a tour has a cyclic structure, P(S1=s1,S2=s2,,Sn=sn), which we denote as P(s1,s2,,sn) for short, is represented by
P(s1,s2,,sn)=i=1nP(si+msi,,si+m-1),
(5)
where index i+n(1<i<m) corresponds to i. The entropy H of this joint probability distribution is then calculated as follows:
H=-s1snP(s1,,sn)logP(s1,,sn)=-s1snP(s1,,sn)i=1nlogP(si+msi,,si+m-1)=-i=1ns1snP(s1,,sn)logP(si+msi,,si+m-1)=-i=1nsisi+mP(si,,si+m)logP(si+msi,,si+m-1).
(6)
Given that each tour can start from an arbitrary vertex, the joint probability distribution of any m+1 consecutive random variables should be invariant with respect to shifts in the index, that is, P(si,,si+m) and P(si+msi,,si+m-1) should be equivalent to P(s1,,sm+1) and P(sm+1s1,,sm), respectively, regardless of the value of i. This assumption is justified by considering n tours with all different start vertices for each individual (tour) in the population. Then, Eq. (6) can be simplified as follows:
H=-ns1sm+1P(s1,,sm+1)logP(sm+1s1,,sm)=-ns1sm+1P(s1,,sm+1)logP(s1,,sm+1)P(s1,,sm)=-n{s1sm+1P(s1,,sm+1)logP(s1,,sm+1)-s1smP(s1,,sm)logP(s1,,sm)}=n(Hm+1¯-Hm¯),
(7)
where
Hk¯=-s1skP(s1,,sk)logP(s1,,sk).
(8)
The average entropy per symbol of the probability distribution Eq. (5), which we denote as Hm, is then given by
Hm=Hm+1¯-Hm¯.
(9)
We use Hm as a high-order entropy measure for the TSP. Note that H1(=H2¯-H1¯) is essentially equivalent to Hind because P(S1=i,S2=j)P(Xi=j) and H1¯ is a constant value for the TSP (because all symbols appear exactly once in each individual of the population). We have the following relation:
H1=1nHind+const.
(10)

In information theory, Hm is known as the entropy rate of an m-th--order Markov information source modeled by the conditional probability distributions P(sm+1s1,,sm), where the entropy rate of a data source is defined as the average information per symbol obtained from the data source. A central theorem of information theory states that the entropy rate of a data source indicates the average number of bits per symbol required to encode it. Therefore, the existence of the same sequence consisting of up to m+1 vertices in the population decreases the value of Hm; this effect is more prominent when the length of the high-frequency sequences increases.

To compute Hm, we must estimate P(s1,,sk) for k=m,m+1 by sampling sequences of symbols (vertices) from individuals in the population; P(s1,,sk) is estimated by N(s1,,sk)Nsample, where N(s1,,sk) is the number of a sequence of symbols {s1,,sk} in the population and Nsample is the total number of all possible samples of the sequences in the population. Sequences of length k are sampled from each individual while shifting the start of the sequence one by one, resulting in Np samples from each individual, i.e., Nsample=nNp. Note that in the STSP case, sequences of length k are sampled from each individual for both travel directions, resulting in 2n samples from each individual, i.e., Nsample=2nNp. Figure 1 illustrates the method to sample the sequences from an individual (tour) in the ATSP and STSP cases.
Figure 1:

Illustration of method to sample sequences of length k(=3) from a tour; sequences are sampled for the original (both) orientation(s) in the ATSP (STSP) case.

Figure 1:

Illustration of method to sample sequences of length k(=3) from a tour; sequences are sampled for the original (both) orientation(s) in the ATSP (STSP) case.

We store the values of N(s1,,sk)(km+1) in the form of a tree because it is impractical to store all entries in a table for a large value of m. Let T be a tree for storing N(s1,,sk)(km+1). Figure 2 illustrates an example of the tree T for k4 in the ATSP case (because it is easier to understand). The tree T stores only N(s1,,sk) with nonzero values; for example, N(a,b)=6, N(a,b,c)=4, and N(a,c)=0 in this example. Note that in the TSP, N(s1) must be the same for all s1 and the same symbol does not reappear in a sequence. For a population consisting of relatively high-quality solutions for the TSP (e.g., tours obtained by a local search algorithm with the 2-opt neighborhood), we have observed that the number of branches at each node is at most about ten (usually less than five).
Figure 2:

Tree representation of N(s1,,sk)(k4).

Figure 2:

Tree representation of N(s1,,sk)(k4).

As the value of m is increased, Hm captures higher-order dependencies in the sequences of symbols included in the population. In this sense, we want to increase the value of m. However, Hm is of no value if the value of m is overly large because it is unlikely to obtain a sufficient number of samples (sequences of symbols) from the population necessary to estimate the conditional probability distributions P(si+msi,,si+m-1)=P(s1,,sm+1)P(s1,,sm)=N(s1,,sm+1)N(s1,,sm) for computing Hm; the estimated conditional probability distribution is unreliable if the number of samples of the denominator is small. Therefore, there is a tradeoff between the potential ability to capture higher-order dependencies and the estimate accuracy of the conditional probability distributions, and we need to determine an appropriate value of m.

3.2 High-Order Entropy Measure Hmadj

One might think that Hm+1¯ can also be used as a population diversity measure. This is equivalent to the entropy of the probability distribution P(s1,,sm+1) defined under the assumption that blocks of m+1 consecutive symbols s1,,sm+1 appear in the population according to this probability distribution and that these occurrences are independent of each other. However, these occurrences are actually correlated and the definition of Hm+1¯ neglects such dependencies. In information theory, Hm+1¯ is known as the entropy of the adjoint source of the (m+1)-th extension of the original Markov source. From this point onward, we call Hm+1¯ the high-order entropy measure Hmadj, to distinguish it from Hm.

Although the definition of Hmadj (= Hm+1¯) is somewhat ad hoc as a population diversity measure, we can find the following simple relationship between Hmadj and Hk(k=1,,m):
H1+H2++Hm=(H2¯-H1¯)+(H3¯-H2¯)++(Hm+1¯-Hm¯)=Hmadj-H1¯.
(11)
Because H1¯ is a constant value (for the TSP), Hmadj is equivalent to H1+H2++Hm. As can be predicted, the best value m for Hmadj will be greater than that of Hm because Hmadj mixes the high-order entropy measures Hk(k=1,,m) equally. We discuss the advantages of this population diversity in Section 6.3.

4 High-Order Entropy Measure Based on a Variable-Order Markov Model

We propose a high-order entropy measure based on a variable-order Markov model to measure population diversity of GAs for the TSP.

4.1 Motivation

The high-order entropy measure Hm defined in the previous section is derived from the assumption that the probability of occurrence of a symbol appearing in individuals in the population obeys a Markov process of order m. In the following equations, we use random variables Si(i=-m,,-2,-1,0) to represent a Markov process of order m, where S0 represents the symbol to be observed next and S-i(1im) represents the i-th preceding symbol. The expression of Hm is then given by the following formula:
Hm=-s-ms-1s0P(s-m,,s-1,s0)logP(s0s-m,,s-1).
(12)
In theory, the value of Hm is equivalent to the entropy rate of a Markov process of order k as long as km (i.e., Hm=Hk) because P(s0s-m,,s-k,,s-1)=P(s0s-k,,s-1) in a Markov process of order k.

To capture higher-order dependencies without suffering from a lack of sufficient statistics, we model the probability of the occurrence of symbols (vertices) appearing in individuals in the population as a variable-order Markov process. The variable-order Markov model was first suggested in Rissanen (1983) for data compression purposes. Later variants of variable-order Markov models have been successfully applied to areas such as statistical analysis, classification, and prediction (Shmilovici and Ben-Gal, 2007; Begleiter et al., 2004; Ben-Gal et al., 2003). In a variable-order Markov process, the probability distribution of observing the next symbol s0 depends on the preceding symbols of variable length k. The basic idea is to determine the value of k adaptively such that the number of samples N(s-k,,s-1) is a sufficient statistic for estimating the conditional probability distribution P(s0s-k,,s-1)=N(s-k,,s-1,s0)N(s-k,,s-1).

4.2 A High-Order Entropy Measure Hmvari

A variable-order Markov process is characterized by a set of conditional probability distributions: P(s0|sc)scS, where S is a set of sequences of symbols for the conditioning variables. We set the upper limit m on the length of the sequences for the conditioning variables because it is impractical to store all conditional probability components if m is overly large (e.g., m>10). For any sequence of symbols {s-m,,s-2,s-1,s0}, the length of the sequence assigned to the conditioning variables must be uniquely determined. To represent a set S that satisfies this requirement, a context tree (Rissanen, 1983) is useful. Let sc˜ be the reverse sequence of sc and we define S˜={sc˜|scS}. Then, the elements of S˜ are represented as the leaf nodes of a context tree S˜ (we use this symbol to refer to the context tree as well) as illustrated in Figure 3, where each number represents the number of the corresponding sequence existing in the population. Note that a symbol “#” means any symbol other than the symbols of its sibling nodes. For a given sequence {s-m,,s-2,s-1} (s0 is observed next), the conditioning part is determined by tracing the context tree S˜ to the maximum extent possible from top to bottom according to s-1,s-2,,s-m. For example, if {s-3,s-2,s-1}={h,c,a}(m=3), the conditioning part is determined as {s-2,s-1}={c,a} and the conditional probability distribution of observing the next symbol s0 is given by P(S0=s0|S-2=c,S-1=a) or P(s0|c,a) for short. In another example, if {s-3,s-2,s-1}={d,b,a}, the conditioning part is determined as {s-2,s-1}={#,a} (“#” means any symbol other than c and f) and the conditional probability distribution of observing the next symbol s0 is given by P(S0=s0|S-2=#,S-1=a) or P(s0|#,a). This conditional probability distribution is estimated by P(#,a,s0)P(#,a)=P(a,s0)-P(c,a,s0)-P(f,a,s0)P(a)-P(c,a)-P(f,a)=N(a,s0)-N(c,a,s0)-N(f,a,s0)N(a)-N(c,a)-N(f,a).
Figure 3:

Context tree representation of S˜.

Figure 3:

Context tree representation of S˜.

The entropy rate of the variable-order Markov process, which we denote as Hmvari, is then defined by
Hmvari=-scSs0LP(sc,s0)logP(s0sc)=-scSs0LP(sc,s0)logP(sc,s0)P(sc)=-scSs0LP(sc,s0)logP(sc,s0)+scSP(sc)logP(sc).
(13)
If the context tree S˜ is represented as a perfect tree with depth m, Hmvari is equivalent to Hm. Conversely, Hmvari can be viewed as an approximation of Hm because Hmvari is obtained from Hm by replacing P(s0s-m,,s-k,,s-1) with a lower-order conditional probability distribution P(s0s-k,,s-1) for all {s-k,,s-1}S.

Next, we describe the method to determine S˜ (and equivalently S), which is essentially equivalent to the learning algorithms used in Rissanen (1983), Ron et al. (1996), Mächler and Bühlmann (2004), and Schulz et al. (2008). Let N˜(s-1,,s-k) be the number of the reverse sequence of symbols {s-k,,s-1} in the population, i.e., N˜(s-1,,s-k)=N(s-k,,s-1). As with tree T, let T˜ be a tree for storing the values of N˜(s-1,,s-k)(km). Figure 4 illustrates an example of the tree T˜, which corresponds to T presented in Figure 2 (e.g., N(a,f,b)=N˜(b,f,a)=4). Note that in the case of the STSP, T = T˜ (except for the maximum depth of the tree), but the displayed example shows the case of the ATSP (because it is easier to understand). For a current population, the context tree S˜ is constructed by the following procedure, where ratio is a parameter taking a value between zero and one.

Construction of context tree S˜
  1. S˜ is initialized as the perfect tree of depth one, i.e., S˜={s-1|s-1L}.

  2. For each of the leaf nodes {s-1,,s-k}S˜ with k<m, if there exists a symbol(s) s-(k+1)'L such that ratio×NpN˜(s-1,,s-k,s-(k+1)'), this node is expanded to generate a new leaf node(s) {s-1,,s-k,s-(k+1)'}. Expansions of the leaf nodes are iterated until no further expansion is possible.

  3. For every node that is already expanded, generate a child node with a symbol “#”.

Figure 4:

Tree representation of N˜(s-1,,s-k)(k3).

Figure 4:

Tree representation of N˜(s-1,,s-k)(k3).

According to the above procedure, the context tree S˜ presented in Figure 3 is obtained from T˜ presented in Figure 4 (if ratio×Np=8).

The aim behind the expansion of a leaf node {s-1,,s-k} (Step 2 of the above procedure) is to capture the higher-order dependency expressed as the conditional probability distribution P(s0|s-(k+1)',s-k,,s-1) only when it is judged to have a sufficient statistic for estimating this conditional probability distribution. The parameter ratio balances the tradeoff between the potential ability to capture higher-order dependencies and the estimate accuracy of the conditional probability distributions.

In Rissanen (1983), Ron et al. (1996), Mächler and Bühlmann (2004), and Schulz et al. (2008), the context tree constructed by the above algorithm is then pruned based on the magnitude of the effect on the stochastic model. However, we do not use this pruning procedure because we confirmed in a preliminary experiment that the performance of the GA deteriorated by introducing this pruning procedure.

5 GA Framework

To evaluate the ability of the proposed population diversity measures Hm, Hmadj, and Hmvari, we perform the GA proposed in Nagata and Kobayashi (2013) using each of the three types of the population diversity measures with different values of m and ratio. This GA is one of the most effective heuristic algorithms for the TSP. One important factor for achieving top performance is to maintain population diversity by evaluating offspring solutions based on the change in population diversity when they are selected to survive in the population as well as the tour length. The independent entropy measure Hind was originally used for evaluating population diversity.

graphic

Algorithm 1 depicts the GA framework. The population consists of Np individuals. The initial population is generated by a greedy local search algorithm with the 2-opt neighborhood (Line 1). At each generation (Lines 3–8) of the GA, each of the population members is selected, once as parent pA and once as parent pB, in random order (Lines 3 and 5). For each pair of parents, edge assembly crossover (EAX) operator generates the Nch (e.g., 30) offspring solutions (Line 6). Then, a best solution is selected from the generated offspring solutions and pA in terms of a given evaluation function, and the selected individual replaces the population member selected as pA (Line 7). Therefore, no replacement occurs if all offspring solutions are worse than pA. Note that only parent pA is replaced to better maintain population diversity because EAX typically generates offspring solutions similar to pA. Iterations of generation are repeated until a termination condition is achieved (Line 9).

Figure 5 illustrates an outline of EAX. From a selected pair of parent solutions (tours), EAX can generate a number of offspring solutions (tours) through the two phases. In the first phase, intermediate solutions are constructed by assembling edges of the parent solutions (tours) under the relaxed constraint that exactly two edges are linked to every vertex; an intermediate solution typically consists of several subcycles. In the second phase, each intermediate solution is modified into a Hamilton cycle by merging the subcycles heuristically. There are several strategies for generating offspring solutions. For example, EAX with “single” strategy typically generates an intermediate solution like (a) in the figure, where the majority of the edges are inherited from parent pA. Conversely, EAX with “block 2” strategy typically generates an intermediate solution like (b), where a number of edges are inherited from each parent. EAX with single strategy is used from the beginning to almost the end of the search and EAX with block 2 strategy is used only at the final stage of the search. The timing to switch from single strategy to block 2 strategy and the termination condition are determined based on the number of consecutive generations for which the best individual of the population is not updated. For more details, we refer the reader to the original paper (Nagata and Kobayashi, 2013).
Figure 5:

Outline of EAX for the TSP.

Figure 5:

Outline of EAX for the TSP.

We define the evaluation function for selecting the individual that replaces xr(i)(=pA) (Line 7) in a somewhat different manner from the one used in Nagata and Kobayashi (2013) because this evaluation function is more natural than the original. Let L be the average tour length of the population and H the population diversity measure selected (Hm, Hmadj, or Hmvari). First of all, (as in Nagata and Kobayashi, 2013) only offspring solutions that do not deteriorate the tour length of xr(i) can be selected because without this restriction, we need a careful cooling schedule for a parameter T (described later) to converge the population. The best one is then selected among them such that L-TH is minimized after the replacement, where T is a parameter that controls the balance between the exploration (maximization of H) and exploitation (minimization of L) of the search. Therefore, offspring solutions are evaluated by the following evaluation function (a lower value is better):1
Eval(y)=ΔL(y)-TΔH(y)(ΔL0)(ΔL>0),
(14)
where ΔL(y) and ΔH(y) denote the differences in L and H, respectively, when an offspring solution y replaces xr(i). After evaluating all offspring solutions, the one with the smallest evaluation value is selected to replace xr(i). Note that if all evaluation values of the offspring solutions are greater than zero, pA is selected (i.e., no replacement occurs) because Eval(pA)=0. We update the parameter T in the following manner. At the beginning of the GA, T is set to (a sufficiently large value). The value of T is updated every time the replacement of xr(i) is performed Np times. Let L' and H' be the values of L and H, respectively, when the value of T is last updated. The value of T is then updated by T=L-L'H-H'. Figure 6 illustrates the meaning of this evaluation function. In this figure, the solid curve represents a trajectory of (L,H) in the L-H plane during the search. If an offspring solution y replaces xr(i), the current position (L,H) moves by the vector (ΔL(y),ΔH(y)) as illustrated in the figure (y is omitted in the figure). The value of ΔL(y)-TΔH(y) is proportional to the magnitude of the vector (ΔL(y),ΔH(y)) in the direction (dotted arrow in the figure) perpendicular to the vector (L-L',H-H'). Note that the value of ΔL(y)-TΔH(y) is negative in the example of the figure and a lower value is preferable.
Figure 6:

Meaning of the evaluation function ΔL(y)-TΔH(y).

Figure 6:

Meaning of the evaluation function ΔL(y)-TΔH(y).

For every offspring solution y, we must compute ΔL(y) and ΔH(y) to obtain the value of Eval(y). However, the computational cost of ΔH(y) for Hm, Hmadj, and Hmvari is not negligible, especially for large values of m. Although this problem is partially alleviated by computing ΔH(y) only when ΔL(y)0, we require an efficient algorithm for computing ΔH(y). An outline of the efficient computation of ΔH(y) is presented in the Appendix.

When the population diversity measure Hmvari is used, the context tree S˜ should be updated each time xr(i) is replaced. However, we decided to update the context tree at the beginning of each generation (before Line 3) because we confirmed that the results did not change significantly using this update method and an efficient algorithm for the immediate update of the context tree is complicated (although it can be implemented).

6 Experimental Results

We now present experimental results to analyze the ability of the proposed high-order entropy measures. The GA was implemented in C++ on Ubuntu 14.04 and the program code was executed on PCs with Intel Core i7-4790 CPU/3.60 GHz processor.

6.1 Experimental Settings

To investigate the ability of the three types of high-order entropy measures Hm, Hmadj, and Hmvari, we performed the GA described in Section 5 using each of the population diversity measures in the evaluation function (14). We used the default configuration of the GA except for the population diversity measure (Hind was used in the original GA), where Np=300 and Nch=30 in the default configuration (Nagata and Kobayashi, 2013). In addition, we tested Np=50 to investigate the effect of the population size. We also tested the GA using either the pairwise Hamming distance D (Eq. (3)) or no population diversity measure (set T=0 in the evaluation function) to evaluate the baseline ability of Hind. The population diversity measures tested are summarized as follows:

  • Greedy (no population diversity measure), D, Hind

  • Hm(m=2,3,4,5,6,8)

  • Hmadj(m=2,3,4,5,6,8)

  • Hmvari(m=6;ratio=0.02,0.05,0.1,0.2,0.4)

Note that H1 and H1adj are equivalent to Hind. For Hmvari, the value of m was set to six, and we show the results of H6vari with different values of the parameter ratio because the amount of data for possible combinations of m and ratio is very large.

For each population diversity measure, we performed 30 independent runs of the GA on instances in the following three well-known benchmark sets for the STSP, where we selected all 54 instances of sizes ranging from 4,000 to 30,000.2

6.2 Results

Table 1 lists the results of the GA using different population diversity measures under the population size Np=300, where detailed results only for some selected population diversity measures including Greedy and Hind are presented to avoid displaying a huge amount of data; for each of Hm, Hmadj, and H6vari, the parameter value of m or ratio(=0.1) that achieved relatively good results is selected. Each line presents the instance name (instance), the optimal or best-known solution (Opt. or UB), the number of runs that succeeded in finding the optimal or best-known solution (Su), and the average error of the best tour lengths in the 30 runs from the optimal or best-known solutions (A-Err), where one unit in the column A-Err is 10-5%. Full results for all population diversity measures are provided in the online supplementary file, available at https://www.mitpressjournals.org/doi/suppl/10.1162/evco_a_00268.

Table 1:
Solution quality of GA using different population diversity measures.
GreedyHindH3H6adjH6vari
InstanceOpt.(UB)SuA-ErrSuA-ErrSuA-ErrSuA-ErrSuA-Err
fnl4461 182566 858 30 30 30 30 
rl5915 565530 1294 30 30 30 23 82 
rl5934 556045 4130 23 342 28 64* 26 164 27 109 
pla7397 23260728 894 251 585 425 502 
rl11849 923288 835 24 12 25 12 28 28 
usa13509 19982859 816 14 18 23 9* 16 13 25 7* 
brd14051 469385 820 29 22 15 26 24 12 
d15112 1573084 512 18 14 11 23 16 
d18512 645238 747 18 21 12 25 3* 19 10 
ca4663 1290319 555 27 24 30 0* 30 0* 30 0* 
pm4951 114855 4600 24 87 255 21 240 25 142 
tz6117 394718 717 12 99 18 72 14 99 19 72 
ar6723 837479 1822 25 19 24 23 23 27 25 19 
ho7103 177092 717 24 22 21 39 19 43 11 84 
eg7146 172386 446 23 19 21 34 29 3* 30 0* 
ym7663 238314 1759 29 22 288 27 28 92 
ei8246 206171 994 13 79 28 9* 30 0* 30 0* 
ja9847 491924 2598 83 12 43* 58* 15 48* 
gr9882 300899 614 12 163 13 55 16 46 19 37* 
kz9976 1061881 1283 27 29 29 30 0* 
fi10639 520527 796 15 33 25 10* 29 0* 25 10* 
mo14185 (427377) 627 13 28 18 18 24 8* 21 13* 
it16862 557315 1065 57 44 20* 24* 
vm22775 569288 1330 116 138 115 92 
sw24978 855597 1014 14 29 10 31 15 18 14 30 
bgb4355 12723 4034 30 30 30 30 
bgd4396 13009 19 486 27 76 25 128 30 0* 30 0* 
frv4410 10711 2116 30 30 30 30 
bgf4475 13221 1159 25 126 30 0* 30 0* 30 0* 
xqd4966 15316 1654 30 30 30 30 
fqm5087 13029 1074 30 29 25 29 25 30 
fea5557 15445 17 841 29 21 30 30 30 
xsc6880 21535 3451 22 216 27 46* 28 30* 25 77 
bnd7168 21834 18 305 30 30 30 30 
lap7454 19535 784 30 30 30 30 
ida8197 22338 1910 29 14 28 29 30 30 
dga9698 27724 1262 28 36 30 30 30 
xmc10150 28387 1585 15 258 24 82* 30 0* 25 58* 
xvb13584 37083 1240 21 80 27 26* 27 26* 29 8* 
xrb14233 (45462) 1649 461 381* 300* 15 227* 
xia16928 (52850) 1255 19 113 19 145 17 145 20 88 
pjh17845 (48092) 1919 12 138 19 97 16 97 16 97 
frh19289 (55798) 1409 27 23 29 30 0* 30 0* 
fnc19402 (59287) 1388 19 67 20 73 26 22* 21 50 
ido21215 (63517) 1548 20 104 22 68 28 15* 19 78 
fma21553 (66527) 1332 12 135 22 45* 14 95 23 35* 
lsb22777 (60977) 754 19 65 26 21* 26 27* 29 5* 
xrh24104 (69294) 1279 28 29 27 14 30 
bbz25234 (69335) 1413 23 38 28 14* 29 4* 28 9* 
irx28268 (72607) 1221 28 27 13 26 18 27 13 
fyg28534 (78562) 1251 12 106 19 50* 18 59* 24 29* 
icx28698 (78088) 1494 217 170* 13 93* 10 102* 
boa28924 (79622) 1486 121 117 117 108 
ird29514 (80353) 2152 215 12 99* 14 87* 19 58* 
Statistical test W=0: L=54 — W=17: L=4 W=22: L=1 W=22: L=4 
GreedyHindH3H6adjH6vari
InstanceOpt.(UB)SuA-ErrSuA-ErrSuA-ErrSuA-ErrSuA-Err
fnl4461 182566 858 30 30 30 30 
rl5915 565530 1294 30 30 30 23 82 
rl5934 556045 4130 23 342 28 64* 26 164 27 109 
pla7397 23260728 894 251 585 425 502 
rl11849 923288 835 24 12 25 12 28 28 
usa13509 19982859 816 14 18 23 9* 16 13 25 7* 
brd14051 469385 820 29 22 15 26 24 12 
d15112 1573084 512 18 14 11 23 16 
d18512 645238 747 18 21 12 25 3* 19 10 
ca4663 1290319 555 27 24 30 0* 30 0* 30 0* 
pm4951 114855 4600 24 87 255 21 240 25 142 
tz6117 394718 717 12 99 18 72 14 99 19 72 
ar6723 837479 1822 25 19 24 23 23 27 25 19 
ho7103 177092 717 24 22 21 39 19 43 11 84 
eg7146 172386 446 23 19 21 34 29 3* 30 0* 
ym7663 238314 1759 29 22 288 27 28 92 
ei8246 206171 994 13 79 28 9* 30 0* 30 0* 
ja9847 491924 2598 83 12 43* 58* 15 48* 
gr9882 300899 614 12 163 13 55 16 46 19 37* 
kz9976 1061881 1283 27 29 29 30 0* 
fi10639 520527 796 15 33 25 10* 29 0* 25 10* 
mo14185 (427377) 627 13 28 18 18 24 8* 21 13* 
it16862 557315 1065 57 44 20* 24* 
vm22775 569288 1330 116 138 115 92 
sw24978 855597 1014 14 29 10 31 15 18 14 30 
bgb4355 12723 4034 30 30 30 30 
bgd4396 13009 19 486 27 76 25 128 30 0* 30 0* 
frv4410 10711 2116 30 30 30 30 
bgf4475 13221 1159 25 126 30 0* 30 0* 30 0* 
xqd4966 15316 1654 30 30 30 30 
fqm5087 13029 1074 30 29 25 29 25 30 
fea5557 15445 17 841 29 21 30 30 30 
xsc6880 21535 3451 22 216 27 46* 28 30* 25 77 
bnd7168 21834 18 305 30 30 30 30 
lap7454 19535 784 30 30 30 30 
ida8197 22338 1910 29 14 28 29 30 30 
dga9698 27724 1262 28 36 30 30 30 
xmc10150 28387 1585 15 258 24 82* 30 0* 25 58* 
xvb13584 37083 1240 21 80 27 26* 27 26* 29 8* 
xrb14233 (45462) 1649 461 381* 300* 15 227* 
xia16928 (52850) 1255 19 113 19 145 17 145 20 88 
pjh17845 (48092) 1919 12 138 19 97 16 97 16 97 
frh19289 (55798) 1409 27 23 29 30 0* 30 0* 
fnc19402 (59287) 1388 19 67 20 73 26 22* 21 50 
ido21215 (63517) 1548 20 104 22 68 28 15* 19 78 
fma21553 (66527) 1332 12 135 22 45* 14 95 23 35* 
lsb22777 (60977) 754 19 65 26 21* 26 27* 29 5* 
xrh24104 (69294) 1279 28 29 27 14 30 
bbz25234 (69335) 1413 23 38 28 14* 29 4* 28 9* 
irx28268 (72607) 1221 28 27 13 26 18 27 13 
fyg28534 (78562) 1251 12 106 19 50* 18 59* 24 29* 
icx28698 (78088) 1494 217 170* 13 93* 10 102* 
boa28924 (79622) 1486 121 117 117 108 
ird29514 (80353) 2152 215 12 99* 14 87* 19 58* 
Statistical test W=0: L=54 — W=17: L=4 W=22: L=1 W=22: L=4 

For each population diversity measure H, we performed the one-sided Wilcoxon rank sum test for the null hypothesis that the median of the distribution of the tour length (of the best solution of each run) obtained with H is greater than that with Hind. If the null hypothesis was rejected at a significance level of 0.05, the corresponding value in the column A-Err is indicated by the asterisk (indicating that H was significantly superior to Hind). Conversely, if the opposite null hypothesis was rejected, the corresponding value is indicated by the dagger (indicating that H was significantly worse than Hind). For each population diversity measure, the numbers of asterisks and daggers are presented in the bottom line (Statistical test) of the table, where “W=a: L=b” indicates that the numbers of asterisks and daggers are a and b, respectively.

We performed the same one-sided Wilcoxon rank sum test between Hind and each of all population diversity measures under the population size Np=50 and 300, and the summarized results (W and L) are presented in Table 2. Note that when the GA using Hind found optimal (or best-known) solutions in most trials of the 30 runs, we cannot find a statistically significant difference even if the GA using the population diversity measure H finds optimal (or best-known) solutions for all 30 runs. This situation is particularly noticeable when the number of vertices is small (e.g., n<10,000). Therefore, the results in Table 2 are presented separately for all 54 instances, a set of the 27 instances with n<10,000, and a set of the 27 instances with 10,000n. When the population size is 300, the numbers of instances from which a statistically significant difference can be detected (if the optimal (best-known) solution is found in all runs) are 15 (n<10,000) and 24 (10,000n). On the other hand, this situation does not occur when the population size is 50.

Table 2:
Results of Wilcoxon rank sum test between Hind and other population diversity measures.
Np=300Np=50
Alln<104104nAlln<104104n
Population diversityWLWLWLWLWLWL
Greedy — 54 27 27 54 27 27 
D — 43 22 21 54 27 27 
 m=2 11 11 
 m=3 17 11 10 
Hm m=4 15 14 10 
 m=5 13 11 31 10 21 
 m=6 14 13 42 15 27 
 m=8 21 14 49 22 27 
 m=2 11 13 
 m=3 17 12 26 10 16 
Hmadj m=4 23 15 29 13 16 
 m=5 22 16 33 17 16 
 m=6 22 15 24 15 
 m=8 18 11 20 14 
 r=0.02 17 42 16 26 
 r=0.05 20 12 27 20 
H6vari r=0.1 22 14 13 
(r=ratior=0.2 19 13 31 13 18 
 r=0.4 17 13 21 15 
Np=300Np=50
Alln<104104nAlln<104104n
Population diversityWLWLWLWLWLWL
Greedy — 54 27 27 54 27 27 
D — 43 22 21 54 27 27 
 m=2 11 11 
 m=3 17 11 10 
Hm m=4 15 14 10 
 m=5 13 11 31 10 21 
 m=6 14 13 42 15 27 
 m=8 21 14 49 22 27 
 m=2 11 13 
 m=3 17 12 26 10 16 
Hmadj m=4 23 15 29 13 16 
 m=5 22 16 33 17 16 
 m=6 22 15 24 15 
 m=8 18 11 20 14 
 r=0.02 17 42 16 26 
 r=0.05 20 12 27 20 
H6vari r=0.1 22 14 13 
(r=ratior=0.2 19 13 31 13 18 
 r=0.4 17 13 21 15 

Table 2 indicates that the GA using the independent entropy measure Hind clearly outperforms the GA using either no population diversity measure or the pairwise Hamming distance D. In the following, we first compare the results of the three types of the high-order entropy measures when the population size is 300.

Results of Hm Table 2 shows that the results of Hm gradually improves as the value of m increases from one (H1 is equivalent to Hind) to three or four. For greater values of m, however, the results of the statistical test gradually deteriorates with increasing the value of m. These results demonstrate that the ability of evaluating population diversity can be improved by considering high-order dependencies between consecutive symbols (vertices) in the population. As we expected, however, there is a tradeoff between the potential ability to capture higher-order dependencies and the estimate accuracy of the conditional probability distributions required for computing Hm. The experimental results indicate that m=3 or 4 achieves an appropriate tradeoff between these for this population size (Np=300).

Results of Hmadj Table 2 shows that the results of Hmadj gradually improves as the value of m increases from one (H1adj is equivalent to Hind) to four, five, or six. However, the result of H8adj is worse than that of H6adj, though it is still better than that of Hind. As can be predicted (see Section 3.2) and supported by the results, the best value of m for Hmadj is greater than that for Hm. More importantly, the best result of Hmadj (obtained with m=4,5, or 6) is superior to the best result of Hm (obtained with m=3 or 4). This suggests that Hmadj (with an appropriate value of m) is a better population diversity measure than Hm. We analyze the reason for this in the next subsection.

Results of H6vari In preliminary experiments for selected instances, better results were obtained when m=6 than when m=4, but there was no significant difference in results between m=6 and m=8. Table 2 shows that the GA using H6vari outperforms the GA using Hind for all values of ratio ranging from 0.02 to 0.4, where the best value of ratio is around 0.1. When H6vari(ratio=0.1) is compared with Hm and Hmadj, the result of H6vari(ratio=0.1) is better than the results of H3 and H4 (the best results of Hm). This indicates that the variable-order Markov model introduced to estimate Hm (see Section 4.1) works as expected. However, the result of H6vari(ratio=0.1) is slightly worse than the results of H4adj, H5adj, and H6adj (the best results of Hmadj).

Results on hard instances The result of Hind is poor (e.g., Su<5) for some instances. This is because there is a deceptive structure on these instances, i.e., some edges of the optimal solution are not included in a majority of near-optimal solutions. In general, finding the optimal solution for such instances is more difficult. Table 1 shows that the results of Hind for these instances are more or less improved by using any of the high-order entropy measures H3, H6adj, and H6vari except for instance pla7397.

Effect of the population size Next, we describe the results of the three types of the high-order entropy measures when the population size is 50, where we focus mainly on the effect of the population size. Table 2 shows that the use of Hm improves the result of Hind only when m=2, whereas the result of Hind is improved by using Hm(m=2,3,4,5) under the population size of 300. This makes sense because for a smaller population size, it will be more difficult to obtain a sufficient number of samples (sequences of symbols) from the population necessary to estimate the conditional probability distributions for all values of m. On the other hand, the superiority of Hmadj over Hind is retained for large values of m even in the small population size (Np=50). As for the high-order entropy measure H6vari, the best value of ratio is around 0.2, which is greater than that (= 0.1) when Np=300. Given the role of ratio in constructing the context tree S˜, this is an expected result. Therefore, the value of ratio should be set appropriately depending on the population size.

Execution time Next, we discuss the difference in the computation time of the GA when using the different population diversity measures. Table 3 lists the average computation time in seconds for the GA using the different population diversity measures. Results are presented for six selected instances listed in the table to avoid displaying a huge amount of data, where two instances (10,000<n) were randomly selected from each of the three benchmark sets. In addition, the bottom line (Ave. ratio) of the table shows the ratio of the execution time to that of the GA using Hini averaged over the six instances. The table shows that the execution time tends to increase as the value of m increases in the population diversity measures Hm and Hmadj. This is mainly because of the increase in the computational effort of calculating ΔH(y) in the evaluation function (14). Note that the execution times of Hm are smaller than those of Hmadj (for the same value of m) even though the calculation of Hmadj is simpler than that of Hm. The reason for this is that the number of generations of the GA required to complete the search was smaller when Hm was used than when Hmadj was used. As for the population diversity measure H6vari with various values of ratio, the results are similar to that of H6adj.

Table 3:
Average execution time (in seconds) of the GA using different population diversity measures.
H6vari(ratio)
HindGreedyD(0.02)(0.05)(0.1)(0.2)(0.4)
 usa13509 2429 1097 2442 3458 3704 3620 3475 3068   
 d15112 3482 1932 3640 5743 5543 5127 4841 4284   
 it16862 2958 1547 3009 5201 4803 4600 4341 4104   
 pjh17845 1636 947 1677 2803 2685 2600 2497 2351   
 fma21553 2053 1094 2084 3518 3354 3285 3154 2982   
 sw24978 5930 3351 6142 10388 9769 9386 8744 8199   
 Ave. ratio — 0.53 1.02 1.69 1.61 1.55 1.47 1.36   
 Hm Hmadj 
 m=2 m=3 m=4 m=5 m=6 m=8 m=2 m=3 m=4 m=5 m=6 m=8 
usa13509 2402 2352 2451 2614 2726 3069 2619 2717 2951 3218 3564 4471 
d15112 3562 3550 3634 3982 4161 4901 3718 3762 4106 4567 4967 6632 
it16862 3077 3087 3208 3406 3720 4294 3162 3290 3649 4139 4614 5936 
pjh17845 1675 1729 1866 1969 2108 2488 1759 1861 2014 2257 2443 3033 
fma21553 2109 2133 2248 2385 2582 3037 2214 2294 2528 2789 3072 3821 
sw24978 6259 6065 6433 6849 7340 8382 6295 6729 7504 8169 9168 11600 
Ave. ratio 1.03 1.03 1.08 1.15 1.23 1.42 1.07 1.12 1.23 1.36 1.50 1.90 
H6vari(ratio)
HindGreedyD(0.02)(0.05)(0.1)(0.2)(0.4)
 usa13509 2429 1097 2442 3458 3704 3620 3475 3068   
 d15112 3482 1932 3640 5743 5543 5127 4841 4284   
 it16862 2958 1547 3009 5201 4803 4600 4341 4104   
 pjh17845 1636 947 1677 2803 2685 2600 2497 2351   
 fma21553 2053 1094 2084 3518 3354 3285 3154 2982   
 sw24978 5930 3351 6142 10388 9769 9386 8744 8199   
 Ave. ratio — 0.53 1.02 1.69 1.61 1.55 1.47 1.36   
 Hm Hmadj 
 m=2 m=3 m=4 m=5 m=6 m=8 m=2 m=3 m=4 m=5 m=6 m=8 
usa13509 2402 2352 2451 2614 2726 3069 2619 2717 2951 3218 3564 4471 
d15112 3562 3550 3634 3982 4161 4901 3718 3762 4106 4567 4967 6632 
it16862 3077 3087 3208 3406 3720 4294 3162 3290 3649 4139 4614 5936 
pjh17845 1675 1729 1866 1969 2108 2488 1759 1861 2014 2257 2443 3033 
fma21553 2109 2133 2248 2385 2582 3037 2214 2294 2528 2789 3072 3821 
sw24978 6259 6065 6433 6849 7340 8382 6295 6729 7504 8169 9168 11600 
Ave. ratio 1.03 1.03 1.08 1.15 1.23 1.42 1.07 1.12 1.23 1.36 1.50 1.90 

6.3 Analysis

We analyze the behavior of the GA using different population diversity measures to investigate their influence on preserving population diversity. Here, we focus mainly on why the GA using H6adj achieves superior results compared with the GA using Hm with various values of m. For this purpose, we depict the typical behavior of a single run of the GA using each of six population diversity measures (Greedy, Hind, H3, H4, H6, and H6adj) on instance usa13509 under the population size of 300. We omit results on other instances because the same tendency was observed. The first graph in Figure 7 displays the plots of the average tour length L of the population against the number of generations of the GA. This graph is presented as a reference for the following four graphs. The remaining four graphs display the plots of H1, H3, H4, and H6 against the average tour length L.
Figure 7:

Behavior of GA using different population diversity measures on instance usa13509 (Np=300).

Figure 7:

Behavior of GA using different population diversity measures on instance usa13509 (Np=300).

Figure 7 indicates that each value of H1, H3, H4, and H6 is maintained at the highest value at each value of L when the same diversity measure is incorporated into the evaluation function (14) of the GA. However, maintaining the value of Hm at the highest level for a small value of m does not necessarily lead to maintaining Hk(m<k) at a high level. For example, if Hind (equivalently H1) is incorporated into the evaluation function, the population is evolved such that duplication of sequences of length two in the population is suppressed without considering the increase of duplication of longer sequences. On the other hand, maintaining the value of Hm at the highest level for a large value of m (e.g., m=6) leads to “overfitting” of the population specialized in maintaining the value of Hm high. For example, if it is possible to exclude any duplication of sequence of length m+1 in the population, such a population is preferred to maintain the value of Hm as high as possible even if the duplication of a certain shorter sequence increases excessively.

To alleviate the overfitting problem of Hm for a large value of m, not only the values of Hm but also the values of Hk(k=1,,m-1) should also be maintained at a high level. The high-order entropy measure Hmadj is suitable for this purpose because Hmadj is equivalent to H1++Hm, and therefore the population is evolved such that each of the values of Hk(km) is maintained high although there is no guarantee that all values are maintained near their highest levels. Fortunately, as can be observed from Figure 7, when H6adj is incorporated into the evaluation function, the values of Hk(k6) are all maintained near their highest level, especially for 3k. Therefore, we conclude that this is a reason why the GA with a high-order entropy measure H6adj achieves superior results compared with the GA using each of Hm(m=1,,6).

7 Conclusions

We proposed three types of entropy-based population diversity measures to evaluate population diversity of a GA for the TSP. These measures consider high-order dependencies between variables of individuals in the population (high-order entropy measures). To derive these, we considered dependencies between consecutive variables and assumed that an individual is represented as a circular sequence of symbols, which is well suited for the TSP. Under these conditions, the entropy of the probability distribution of individuals in the population (used as a population diversity measure) is defined as the entropy rate of a Markov process estimated from the sequences of symbols sampled from the population.

The high-order entropy measure Hm is equivalent to the entropy rate of the m-th--order Markov process. It has the potential ability to capture dependencies between consecutive variables of length up to m+1. We demonstrated that Hm with an appropriate value of m(=3or4) is significantly superior to Hind, the commonly used entropy-based population diversity measure that does not consider dependencies between variables, in the ability to evaluate population diversity. Although the high-order entropy measure Hmadj is defined in a somewhat ad hoc manner, it is essentially equivalent to H1+H2++Hm. It reduces the overfitting problem of Hm for a large value of m (e.g. m=6) while considering dependencies between consecutive variables of length up to m+1. Consequently, Hmadj with an appropriate value of m(=4,5,or6) further improves Hm in the ability to measure population diversity. The high-order entropy measure Hmvari is equivalent to the entropy rate of the variable-order Markov process. It also reduces the overfitting problem of Hm with an appropriate parameter setting (e.g., m=6 and ratio=0.1) and improves Hm. Overall, the high-order entropy measure Hmadj with an appropriate value of m(=4,5,or6) is the best population diversity measure among all the population diversity measures tested.

We have demonstrated the effectiveness of considering high-order dependencies between variables of individuals in evaluating population diversity at least for the TSP. Development of other high-order entropy measures and their application to other combinatorial optimization problems remains as a future research direction.

Acknowledgments

This work was supported by JSPS KAKENHI Grant Number 17K00342.

Notes

1

This strategy is useful because many of offspring solutions generated by EAX improve (or do not change) the tour length of xr(i). Otherwise, offspring solutions should be evaluated simply by ΔL(y)-TΔH(y) while reducing the value of T in the course of the search.

2

At present, several TSP generators are available and it is possible to make comparisons on hundreds or thousands of TSP instances (Kotthoff et al., 2015; Kerschke et al., 2018). However, it takes a huge amount of time to perform this for instances of similar size and we did not perform this.

References

Begleiter
,
R.
,
El-Yaniv
,
R.
, and
Yona
,
G.
(2004)
.
On prediction using variable order Markov models
.
Journal of Artificial Intelligence Research
,
22:385
421
.
Ben-Gal
,
I.
,
Morag
,
G.
, and
Shmilovici
,
A
. (
2003
).
Context-based statistical process control: A monitoring procedure for state-dependent processes
.
Technometrics
,
45
(
4
):
293
311
.
Kerschke
,
P.
,
Kotthoff
,
L.
,
Bossek
,
J.
,
Hoos
,
H. H.
, and
Trautmann
,
H
. (
2018
).
Leveraging TSP solver complementarity through machine learning
.
Evolutionary Computation
,
26
(
4
):
597
620
.
Kotthoff
,
L.
,
Kerschke
,
P.
,
Hoos
,
H.
, and
Trautmann
,
H.
(2015)
.
Improving the state of the art in inexact TSP solving using per-instance algorithm selection.
In
Proceedings of the 9th International Conference on Learning and Intelligent Optimization
, pp.
202
217
.
Lecture Notes in Computer Science
, Vol.
8994
.
Mächler
,
M.
, and
Bühlmann
,
P
. (
2004
).
Variable length Markov chains: Methodology, computing, and software
.
Journal of Computational and Graphical Statistics
,
13
(
2
):
435
455
.
Maekawa
,
K.
,
Mori
,
N.
,
Tamaki
,
H.
,
Kita
,
H.
, and
Nishikawa
,
Y
. (
1996
). A genetic solution for the traveling salesman problem by means of a thermodynamical selection rule. In
Proceedings of the 3rd IEEE Conference on Evolutionary Computation
, pp.
529
534
.
Mori
,
N.
,
Kita
,
H.
, and
Nishikawa
,
Y.
(1996)
.
Adaptation to a changing environment by means of the thermodynamical genetic algorithm
. In
Proceedings of the 9th International Conference on Parallel Problem Solving from Nature
, pp.
513
522
.
Lecture Notes in Computer Science
, Vol.
4193
.
Nagata
,
Y.
(2006)
.
Fast EAX algorithm considering population diversity for traveling salesman problems
. In
Proceedings of the 6th International Conference on Evolutionary Computation in Combinatorial Optimization
, pp.
171
182
.
Lecture Notes in Computer Science
, Vol.
3906
.
Nagata
,
Y.
(2016)
.
Population diversity measures based on variable-order Markov models for the traveling salesman problem
. In
Proceedings of the 14th International Conference on Parallel Problem Solving from Nature
, pp.
973
983
.
Lecture Notes in Computer Science
, Vol.
9921
.
Nagata
,
Y.
, and
Kobayashi
,
S
. (
2013
).
A powerful genetic algorithm using edge assembly crossover for the traveling salesman problem
.
Informs Journal on Computing
,
25
(
2
):
346
363
.
Nagata
,
Y.
, and
Ono
,
I.
(2013)
.
High-order sequence entropies for measuring population diversity in the traveling salesman problem
. In
Proceedings of the 13th European Conference on Evolutionary Computation in Combinatorial Optimization
, pp.
179
190
.
Lecture Notes in Computer Science
, Vol.
7832
.
Rissanen
,
J
. (
1983
).
A universal data compression system
.
IEEE Transactions on Information Theory
,
29
(
5
):
656
664
.
Ron
,
D.
,
Singer
,
Y.
, and
Tishby
,
N
. (
1996
).
The power of amnesia: Learning probabilistic automata with variable memory length
.
Machine Learning
,
25
(
2–3
):
117
149
.
Schulz
,
M. H.
,
Weese
,
D.
,
Rausch
,
T.
,
Döring
,
A.
,
Reinert
,
K.
, and
Vingron
,
M.
(2008)
.
Fast and adaptive variable order Markov chain construction
. In
Proceedings of the 8th International Workshop on Algorithms in Bioinformatics
, pp.
306
317
.
Lecture Notes in Bioinformatics
, Vol.
5251
.
Shmilovici
,
A.
, and
Ben-Gal
,
I
. (
2007
).
Using a VOM model for reconstructing potential coding regions in EST sequences
.
Computational Statistics
,
22
(
1
):
49
69
.
Squillero
,
G.
, and
Tonda
,
A.
(2016)
.
Divergence of character and premature convergence: A survey of methodologies for promoting diversity in evolutionary optimization
.
Information Sciences
,
329:782
799
.
Tsai
,
H.
,
Yang
,
J.
,
Tsai
,
Y.
, and
Kao
,
C
. (
2004
).
An evolutionary algorithm for large traveling salesman problems
.
IEEE Transactions on Systems, Man, and Cybernetics, Part B
,
34
(
4
):
1718
1729
.
Tsujimura
,
Y.
, and
Gen
,
M
. (
1998
). Entropy-based genetic algorithm for solving TSP. In
Proceedings of the 2nd International Conference on Knowledge-Based Intelligent Electronic Systems
, pp.
285
290
.
Vallada
,
E.
, and
Ruiz
,
R
. (
2010
).
Genetic algorithms with path relinking for the minimum tardiness permutation flowshop problem
.
Omega
,
38
(
1
):
57
67
.
Wang
,
Y.
, Lü, Z., and
Hao
,
J.
(2010)
.
A study of multi-parent crossover operators in a memetic algorithm
. In
Proceedings of the 11th International Conference on Parallel Problem Solving from Nature
, pp.
556
565
.
Lecture Notes in Computer Science
, Vol.
6238
.
Wineberg
,
M.
, and
Oppacher
,
F
. (
2003
). The underlying similarity of diversity measures used in evolutionary computation. In
Proceedings of the 2003 International Conference on Genetic and Evolutionary Computation
, p.
206
.
Yao
,
X.
(1993)
. An empirical study of genetic operators in genetic algorithms.
Microprocessing and Microprogramming
,
38:707
714
.
Zhang
,
C.
,
Su
,
S.
, and
Chen
,
J
. (
2006
). Efficient population diversity handling genetic algorithm for QoS-aware web services selection. In
Proceedings of the 6th International Conference on Computational Science
, pp.
104
111
.

Appendix: Efficient Computation of ΔH(y)

An efficient computation of ΔH(y) in the evaluation function (14) is crucial for the execution of the GA using the proposed high-order entropy measures (Hm, Hmadj, and Hmvari). We present an outline of the efficient computation of ΔH(y) in the ATSP case, which includes the STSP case as a special case (see Section 3.1).

Remember that T is a tree for storing the values of N(s1,,sk)(km+1) (see Figure 2). For an offspring solution y, let Ere (resp. Ead) be a set of edges that are removed from (resp. added to) the parent solution pA to generate this offspring solution. For each edge of Ere, pA has m sequences of length m+1 including this edge, which disappear from the population if pA is replaced with y. Figure 8 illustrates the sequences of length m+1 (=4) to remove for Ere. In a similar manner, we can determine the sequences of length m+1 to add for Ead as illustrated in Figure 8. For each sequence {s1,,sm,sm+1} to remove or add, we can know the values of N(s1,,sk)(hkm+1) affected by the replacement by tracing T according to the sequence, where the value of h depends on the situation. For example, let a sequence of length m+1(=4) to remove be {b,g,c,a} where only edge (g,c) is included in Ere. In this case, N(b,g,c) and N(b,g,c,a) are decremented (if pA is replaced with y), whereas N(b) and N(b,g) remain the same, that is, h=3.
Figure 8:

Illustration of sequences of length m+1(=4) to remove (add) for Ere (Ead).

Figure 8:

Illustration of sequences of length m+1(=4) to remove (add) for Ere (Ead).

The computation of ΔHm(y) and ΔHmadj(y) are similar; hence, we only describe the method to compute ΔHm(y). For every sequence {s1,,sm+1} to remove and add, the changes to N(s1,,sm) and N(s1,,sm+1) are accumulated in ΔN(s1,,sm) and ΔN(s1,,sm+1), respectively. We store these values in the tree T. Therefore, if an adding sequence {s1,,sk} does not exist in T, we must create a temporal new node of T with N(s1,,sk)=0 to store the necessary data (T must be restored after computing ΔHm(y)).

Let Sm and Sm+1 be sets of the sequences of length m and m+1, respectively, that are removed or added. Let P'(s) be defined as N(s)+ΔN(s)Nsample for sSmSm+1, where P(s) is already defined as N(s)Nsample (see Section 3.1) . The value of ΔHm(y) is then computed by
ΔHm(y)=-sSm+1P'(s)logP'(s)-P(s)logP(s)+sSmP'(s)logP'(s)-P(s)logP(s).
(15)
The computation of ΔHmvari(y) is somewhat complicated. Before describing this, we first describe a method to compute Hmvari (Eq. (13)). Remember that S is a set of the sequences of symbols for the conditioning variables of the variable-order Markov process to define Hmvari and that S˜={sc˜|scS} with sc˜ being the reverse sequence of sc. The context tree S˜ is then introduced to represent the set S˜ (see Figure 3), where the elements of the set S˜ are represented as the leaf nodes of the context tree S˜. In the actual implementation, each node with a symbol “#” in the context tree S˜ is removed and is associated with its parent node. Figure 9 (Left) shows a context tree obtained from the context tree S˜ presented in Figure 3 according to this manner. For example, node a in the modified context tree represent a sequence a#, where “#” means any symbol other than c and f. From this point onward, we refer to the modified context tree simply as the context tree S˜ unless otherwise stated. We classify the nodes of the context tree S˜ into two classes as follows (see also Figure 9): (Class 1) leaf nodes and (Class 2) internal nodes. Further, we define a tree S to represent the set S, where the nodes of S correspond one-to-one with the nodes of S˜. The nodes of the tree S are also classified into the two classes such that each node in S has the same class as the corresponding node in S˜. Figure 9 (Right) shows the tree S that corresponds to the context tree S˜ in the left side of the figure. For example, node “cgb” in S˜ (i.e., a node reached by following S˜ in this order) and node “bgc” in S both belong to Class 1. Note that only white nodes of S can be determined from the information of S˜ presented in the figure because other nodes (gray nodes) depend on the nodes of S˜ omitted in the figure. In the actual implementation, all data of the trees S and S˜ are stored in the trees T and T˜, respectively, because S (reps. S˜) is a subtree of T (resp. T˜).
Figure 9:

Context tree representation of S˜ and S. The numbers represent NI˜(sc˜) and NI(sc).

Figure 9:

Context tree representation of S˜ and S. The numbers represent NI˜(sc˜) and NI(sc).

Let S1 and S2 be sets of the nodes of Class 1 and Class 2, respectively, in the tree S. Similarly, S˜1 and S˜2 are also defined for S˜. For each node scS, we denote the number of the corresponding sequence of symbols existing in the population as NI(sc). The values of NI(sc) are given by
NI(sc)=N˜(sc˜)(=N(sc))(scS1)N˜(sc˜)-sL(sc˜,s)S1N˜(sc˜,s)(scS2).
(16)
We need to pay attention when scS2. For example, if sc=a (equivalently sc˜=a) in the example, sc˜ represents sequences “a#,” but sc˜ represents “a” when it is an argument of the functions N(·) and N˜(·) because these functions are defined on the trees T and T˜. The values of NI(sc) are stored in the tree S.
Further, for scS and s0L, we define the following function:
NI+(sc,s0)=N(sc,s0)(scS1)N(sc,s0)-sL(s,sc)S1N(s,sc,s0)(scS2).
(17)
The values of NI+(sc,s0) are not stored and are computed as necessary. Note that for scS2, a set of the nodes (s,sc)S1 is obtained by introducing a function that returns the addresses of the corresponding nodes in S˜ (resp. S) to the nodes in S (resp. S˜) because we can easily access the nodes (sc˜,s)S˜1 in the context tree S˜. We define PI(sc)=NI(sc)Nsample(scS) and PI+(sc,s0)=NI+(sc,s0)Nsample(scS,s0L). Then, Hmvari is computed by
Hmvari=-scSs0LPI+(sc,s0)logPI+(sc,s0)+scSPI(sc)logPI(sc).
(18)

Now, ΔHmvari(y) is computed by calculating the differences in NI(sc) and NI+(sc,s0) when pA is replaced with y. We denote the differences in NI(sc) and NI+(sc,s0) as ΔNI(sc) and ΔNI+(sc,s0), respectively. Let SI and SI+ be sets of the nodes sc and (sc,s0) that change the values of NI(sc) and NI+(sc,s0), respectively. At the beginning, an initialization procedure is performed as follows: SI=, SI+=, and ΔNI(s)=ΔNI+(s)=0(sS). For every sequence {s1,,sm+1} to remove or add (if pA is replaced with y), the following procedure is performed.

  1. Trace S according to the sequence {s1,,sm+1}, and for each sequence {s1,,sk}(hkm+1), perform procedures (2)–(3). The value of h is explained earlier in this appendix.

    • If {s1,,sk} is a Class 1 node of S, add this sequence to SI and increment (decrement) ΔNI(s1,,sk) by one if this sequence is added (removed).

    • Else if {s2,,sk} is a Class 2 node of S, add {s2,,sk} to SI and increment (decrement) ΔNI(s2,,sk) by one if {s1,,sk} is added (removed).

    • If {s1,,sk-1} is a Class 1 node of S, add {s1,,sk} to SI+ and increment (decrement) ΔNI+(s1,,sk) by one if {s1,,sk} is added (removed).

    • Else if {s2,,sk-1} is a Class 2 node of S, add {s2,,sk} to SI+ and increment (decrement) ΔNI+(s2,,sk) by one if {s1,,sk} is added (removed).

After completing the above procedure, we compute PI'(sc)=N(sc)+ΔNI(sc)Nsample(scSI) and PI+'(s)=N(s)+ΔNI+(s)Nsample(sSI+). The value of ΔHmvari(y) is then computed by
ΔHmvari(y)=-sSI+PI+'(s)logPI+'(s)-PI+(s)logPI+(s)+scSIPI'(sc)logPI'(sc)-PI(sc)logPI(sc).
(19)

After the parent solution pA is replaced with the selected offspring solution, we need to update the trees S and S˜, accordingly. This can be done efficiently, but we omit the details because the procedure is complicated to explain. In the actual implementation, the values of NI(sc) were updated immediately, but the structure of the trees were reconstructed at the beginning of each generation of the GA (see Section 5).

Supplementary data