## Abstract

Understanding the behaviour of heuristic search methods is a challenge. This even holds for simple local search methods such as 2-OPT for the Travelling Salesperson Problem (TSP). In this article, we present a general framework that is able to construct a diverse set of instances which are hard or easy for a given search heuristic. Such a diverse set is obtained by using an evolutionary algorithm for constructing hard or easy instances which are diverse with respect to different features of the underlying problem. Examining the constructed instance sets, we show that many combinations of two or three features give a good classification of the TSP instances in terms of whether they are hard to be solved by 2-OPT.

## 1 Introduction

Heuristic search methods such as local search, simulated annealing, evolutionary algorithms, and ant colony optimization have been shown to be very successful for various combinatorial optimization problems. Although they usually don't come with performance guarantees on their runtime and/or approximation behaviour, they often perform very well in several situations. Understanding the conditions under which optimization algorithms perform well is essential for automatic algorithm selection, configuration and effective algorithm design. In both the artificial intelligence (AI) (Xu et al., 2008; Hutter et al., 2014; Vilalta and Drissi, 2002; Eggensperger et al., 2015; Feurer et al., 2015) and operational research (Smith-Miles and Lopes, 2012; van Hemert, 2006) communities, this topic has become a major point of interest.

The feature-based analysis of heuristic search algorithms has become an important part in understanding such type of algorithms (Mersmann et al., 2013; Smith-Miles et al., 2010). This approach characterizes algorithms and their performance for a given problem based on features of problem instances. Thereby, it provides an important tool for bridging the gap between pure experimental investigations and mathematical methods for analysing the performance of search algorithms (Neumann and Witt, 2010; Kötzing et al., 2012; Englert et al., 2014). Current methods for the feature-based analysis are based on constructing hard and easy instances for an investigated search heuristic and a given optimization problem by evolving instances using an evolutionary algorithm (Mersmann et al., 2013; Nallaperuma et al., 2013, 2014). This evolutionary algorithm constructs problem instances where the examined algorithm either shows a bad (good) approximation behaviour and/or requires a large (small) computational effort to come up with good or optimal solutions. Although the evolutionary algorithm for constructing such instances is usually run several times to obtain a large set of hard (easy) instances, the question arises whether the results in terms of features of the instances obtained give a good characterization of problem difficulty.

In the article, we propose a new approach of constructing hard and easy instances. Following some recent work on using evolutionary algorithms for generating diverse sets of instances that are all of high quality (Ulrich, Bader, and Thiele, 2010; Ulrich, Bader, and Zitzler, 2010), we introduce an evolutionary algorithm which maximizes diversity of the obtained instances in terms of a given feature. The recent work of Smith-Miles and Bowly (2015) also considers diversity aspect of the instances. However, our notion of diversity is fundamentally different from this work. Our notion of diversity follows from the evolutionary diversity optimization mechanisms whereas theirs considers a notion of the distance from manually specified target points. Our approach allows the generation of a set of instances that is much more diverse with respect to the problem feature at hand. Carrying out this process for several features of the considered problem and algorithm gives a much better classification of instances according to their difficulty of being solved by the considered algorithm.

To show the benefit of our approach compared with previous methods, we consider the classical 2-OPT algorithm for the TSP. Previous feature-based analyses have already considered hard and easy instances in terms of approximation ratio and analyzed the features of such hard (easy) instances obtained by an evolutionary algorithm. The experimental results of our new approach show that diversity optimization of the features results in an improved coverage of the feature space over classical instance generation methods. In particular, the results show that for some combinations of two features, it is possible to classify hard and easy instances into two clusters with a wider coverage of the feature space compared to the classical methods. Moreover, the three-feature combinations further improve the classification of hard and easy instances for most of the feature combinations. Furthermore, a classification model is built using these diverse instances that can classify TSP instances based on hardness for 2-OPT.

This article is extended from its conference version which was published in PPSN 2016 (Gao et al., 2016). The approach proposed in the conference paper has been applied in artistic image variants analysis (Alexander et al., 2017) and examined with different diversity measurement such as discrepancy (Neumann et al., 2018) and popular indicators from the area of evolutionary multiobjective optimisation (Neumann et al., 2019). Our further research focuses on a weighted version of feature-based population diversity measurement and involving problem hardness as an extra feature value, which will be detailed in Sections 5.2 and 7.

The remainder of this article is organized as follows. Firstly, we introduce the Euclidean TSP and the background on feature-based analysis in Section 2. Afterwards, we state our diversity optimization approach for evolving instances according to feature values in Section 3 and report on the impact of diversity optimization in terms of the range of feature values in Section 4. As feature values can be very diverse both for easy and hard instances, in Section 5 we consider the combinations of several features for instance classification afterwards. We then build a classification model that can classify instances based on hardness and the results are included in Section 6. Section 7 discusses how to maximize the diversity over problem hardness and the classification. The article is finalized with some conclusions in Section 8.

## 2 Background

We consider the classical NP-hard Euclidean Travelling Salesperson Problem (TSP) as the example problem for evolving hard and easy instances which have a diverse set of features. Our methodology can be applied to any optimization problem, but using the TSP in our study has the advantage that it has already been investigated extensively from different perspectives including the area of feature-based analysis.

For our investigations, cities are always in the normalized plane $[0,1]2$; that is, each city has an $x$- and $y$-coordinate in the interval [0,1]. In following, a TSP instance always consists of a set of $n$ points in $[0,1]2$ and the Euclidean distances between them.

Local search heuristics have been shown to be very successful when dealing with the TSP and the most prominent local search operator is the 2-OPT operator (Croes, 1958). The resulting local search algorithm starts with a random permutation of the cities and repeatedly checks whether removing two edges and reconnecting the two resulting paths by two other edges leads to a shorter tour. If no improvement can be found by carrying out any 2-OPT operation, the tour is called locally optimal and the algorithm terminates.

The key factor in the area of feature-based analysis is to identify the problem features and their contribution to the problem hardness for a particular algorithm and problem combination. This can be achieved through investigating hard and easy instances of the problem. Using an evolutionary algorithm, it is possible to evolve sets of hard and easy instances by maximizing or minimizing the fitness (tour length in the case of the TSP) of each instance (van Hemert, 2006; Smith-Miles and Lopes, 2012; Mersmann et al., 2013; Smith-Miles et al., 2010). However, none of these approaches have considered the diversity of the instances explicitly. Within this study, we expect to improve the evolutionary algorithm-based instance generation approach by introducing diversity optimization.

The structural features are dependent on the underlying problem. In Mersmann et al. (2013), there are 47 features in different feature classes used to provide an understanding of algorithm performance for the TSP. The different feature classes established are distance features, mode features, cluster features, centroid features, MST features, angle features, and convex hull features. The feature values are regarded as indicators which allow prediction of the performance of a given algorithm on a given instance.

## 3 Feature-Based Diversity Optimization

In this section, we introduce our approach of evolving a diverse set of easy or hard instances which are diverse with respect to important problem features. As in previous studies, we measure hardness of a given instance by the ratio of the solution quality obtained by the considered algorithm and the value of an optimal solution.

We propose to use an evolutionary algorithm to construct sets of instances of the TSP that are quantified as either easy or hard in terms of approximation and are diverse with respect to underlying features of the produced problem instances. Our evolutionary algorithm (shown in Algorithm 1) evolves instances which are diverse with respect to given features and meet given approximation ratio thresholds.

The algorithm is initialized with a population $P$ consisting of $\mu $ TSP instances which have an approximation ratio at least $\alpha h$ in the case of generating a diverse set of hard instances. In the case of easy instances, we start with a population where all instances have an approximation ratio of at most $\alpha e$ and only instances of approximation ratio at most $\alpha e$ can be accepted for the next iteration. In each iteration, $\lambda \u2264\mu $ offspring are produced by selecting $\lambda $ parents and applying mutation to the selected individuals. Offsprings that don't meet the approximation threshold are rejected immediately.

The new parent population is formed by reducing the set consisting of parents and offspring satisfying the approximation threshold until a set of $\mu $ solutions is achieved. This is done by removing instances one by one based on their contribution to the diversity according to the considered feature.

The core of our algorithm is the selection among individuals meeting the threshold values for the approximation quality according to feature values. Let $I1,\u2026,Ik$ be the elements of $P$ and $f(Ii)$ be their feature values. Furthermore, assume that $f(Ii)\u2208[0,R]$, that is, feature values are non-negative and upper bounded by $R$.

We assume that $f(I1)\u2264f(I2)\u2264\u2026\u2264f(Ik)$ holds. The feature-based diversity contribution of an instance $I$ to a population of instances $P$ is denoted as $d(I,P)$ which is based on other individuals in the population.

Furthermore, an individual with the unique smallest and largest feature value always stays in the population when working with $\mu \u22652$.

In Mersmann et al. (2013), 47 features of TSP instances for characterizing easy and hard TSP instances have been studied. We consider 7 features coming from different feature classes which have shown to be well suited for classification and prediction. These features are:

*angle_mean*: mean value of the angles made by each point with its two nearest neighbor points*centroid_mean_distance_to_centroid*: mean value of the distances from the points to the centroid*chull_area*: area covered by the convex hull*cluster_10pct_mean_distance_to_centroid*: mean value of the distances to cluster centroids at $10%$ levels of reachability*mst_depth_mean*: mean depth of the minimum spanning tree*nnds_mean*: mean distance between nearest neighbours*mst_dists_mean*: mean distance of the minimum spanning tree

We refer the reader to Mersmann et al. (2013) for a detailed explanation for each feature. We carry out our diversity optimization approach for these features and use the evolutionary algorithm to evolve for each feature a diverse population of instances that meets the approximation criteria for hard/easy instances given by the approximation ratio thresholds.

All programs in our experiments are written in R and run in R environment (R Core Team, 2015). We use the functions in tspmeta package (Mersmann et al., 2013) to compute the feature values.

The setting of the evolutionary algorithm for diversity optimization used in our experiments is as follows. We use $\mu =30$ and $\lambda =5$ for the parent and offspring population size, respectively. The 2-OPT algorithm is executed on each instance $I$ five times with different initial solutions and we set $A(I)$ to the average tour length obtained. The examined instance sizes $n$ are 25, 50, and 100, which are denoted by the number of cities in one instance. Based on previous investigations in Mersmann et al. (2013) and initial experimental investigations, we set $\alpha e=1$ for instances of size 25 and 50, and $\alpha e=1.03$ for instances of size 100. Evolving hard instances, we use $\alpha h=1.15,1.18,1.2$ for instances of size $n=25,50,100$, respectively. The mutation operator picks in each step one city for the given parent uniformly at random and changes its $x$- and $y$-coordinator by choosing an offset according to the normal-distribution with standard deviation $\sigma $. Coordinates that are out of the interval are reset to the value of the parent. Based on initial experiments, we use two mutation operators with different values of $\sigma $. We use $\sigma =0.025$ with probability 0.9 and $\sigma =0.05$ with probability 0.1 in a mutation step. The evolutionary algorithm terminates after 10,000 generations which allows it to obtain a good diversity for the considered features. For each $n=25,50,100$ and each of the 7 features, a set of easy and hard instances are generated, which results in 42 independent runs of the ($\mu $+$\lambda $)-EA$D$.

## 4 Range of Feature Values

We first evaluate our diversity optimization approach in terms of the diversity that is obtained with respect to a single feature. Focusing on a single feature in each run provides the insight of the possible range of a certain feature value for hard or easy instances. The previous study (Mersmann et al., 2013) suggests that there are some differences in the possible range of feature values for easy and hard instances. We study the effect of the diversity optimization on the range of features by comparing the instances generated by diversity optimization to the instances generated by the conventional approach in (Mersmann et al., 2013). Evolving hard instances based on the conventional evolutionary algorithm, the obtained instances have mean approximation ratios of 1.12 for $n=25$, 1.16 for $n=50$, and 1.18 for $n=100$. For easy instances, the mean approximation ratios are 1 for $n=25,50$, and 1.03 for $n=100$.

Figure 1 (left) presents the variation of the mean distance of the distances between points and the centroid feature (*centroid_mean_distance_to_centroid*) for hard and easy instances of the three considered sizes 25, 50, and 100. Each set consists of 100 instances generated by independent runs (Mersmann et al., 2013). As shown in Figure 1 (left), the feature values of easy and hard instances are in different ranges. For example, for the hard instances of size 100 the median value (indicated by the red line) is 0.4157 while it is only 0.4032 for the easy instances. The respective range of the feature value is 0.0577 for the hard instances and 0.0645 for the easy instances. For the instances generated by diversity optimization (easy and hard instances are indicated by e (b) and h (b), respectively), there is a difference in the median feature values for the hard and easy instances similar to the instances generated by the conventional approach. Additionally, the range of the feature values for both the hard and easy instances has significantly increased. For example, for the instance size 100, the median value for easy instances is 0.4028 and the range is 0.2382. For the hard instances of the same size, the median is 0.04157 while the range is 0.1917 (see Figure 1, left).

Similarly, Figure 1 (right) presents the variation of cluster $10%$ distance to centroid (*cluster_10pct_distance_to_centroid*) feature for the hard and easy instances generated by the conventional approach (indicated by (e (a) and h (a)) and for the hard and easy instances generated by diversity optimization (indicated by (e (b) and h (b))). The general observations from these box plots are quite similar to the observations from the *mst_dist_mean* shown in Figure 1 (left). For the easy instances of size 100, the range of the feature value is 0.0919 for conventional instances and 0.3471 for the instances generated by diversity optimization. Similarly, for the hard instances the range of the feature values has increased from 0.0577 to 0.1776 by the diversity optimization approach. As shown in Figure 1 (right), there is a significant increase in the range for other instance sizes as well. Improved ranges in feature values are observed for all considered features.

In Table 1, some statistics are included to show the different ranges of the feature values of the instances found by the conventional approach and the diversity maximization EA. Since the conventional approach uses elitism to keep the best solution in each run, the data are gathered from 50 independent runs of the conventional approach that generates TSP instance of acceptable quality for a certain problem size. The hard and easy instances here are classified with the same approximation ratio bound discussed previously in this section. The data to be compared with for the feature-based diversity optimization approach are obtained by a single run with population size 50. In the table, three different features are selected from different categories to show the difference in range. From the table, it is clear that the diversity optimization approach generates TSP instances with feature values in a wider range compared with the conventional approach.

The above results suggest that the diversity optimization approach has resulted in a significant increase in the coverage over the feature space. Having the threshold for approximation ratios ($\alpha e$ and $\alpha h$) our method guarantees the hardness of the instances. These approximation thresholds are more extreme than the mean approximation values obtained by the conventional method. Furthermore, starting with initial population of duplicated instances and a hard coded threshold, the modified $(\mu +\lambda )$-EA is able to achieve hard instances with approximation ratio 1.347, 1.493, and 1.259, respectively for instance size 25, 50, and 100. The majority of the instances are clustered in a small region in the feature space while some other points are dispersed across the whole space. This is evident in the median values similar to the values for the instances with respect to conventional approach and with significantly larger range in feature value. The conventional approach has failed to explore certain regions in the feature space and missed some instances existing in those regions. Being able to discover all these instances spread in the whole feature space, our approach provides a strong basis for more effective feature-based prediction.

As a result of the increased ranges and the similar gap in median feature values for hard and easy instances compared with the conventional instances, there is a strong overlap in the ranges of the features for easy and hard instances generated by the diversity optimization. This is observed in the results for *mst_dist_mean* and *cluster_10pct_distance_to_centroid* shown in Figure 1. A similar pattern holds for the other features as well. This prevents a good classification of problem instances based on single feature value.

. | . | . | centroid_mean_distance_to_centroid . | distance_mean . | nnd_mean . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

n . | . | algorithm . | min . | max . | avg . | stdev . | min . | max . | avg . | std . | min . | max . | avg . | std . |

25 | easy | conventional | 0.4603 | 0.5708 | 0.5303 | 0.0214 | 0.6125 | 0.7613 | 0.7078 | 0.0276 | 0.1923 | 0.4053 | 0.2881 | 0.0497 |

FB-div | 0.0788 | 0.6812 | 0.4570 | 0.0810 | 0.1204 | 0.8570 | 0.6139 | 0.1025 | 0.0471 | 0.7736 | 0.2884 | 0.0934 | ||

hard | conventional | 0.4117 | 0.4669 | 0.4446 | 0.0116 | 0.5697 | 0.6512 | 0.6126 | 0.0156 | 0.2387 | 0.4186 | 0.3161 | 0.0336 | |

FB-div | 0.1076 | 0.5571 | 0.4090 | 0.0589 | 0.1641 | 0.7456 | 0.5598 | 0.0768 | 0.1033 | 0.7381 | 0.3359 | 0.0932 | ||

50 | easy | conventional | 0.4176 | 0.5349 | 0.4624 | 0.0287 | 0.5756 | 0.7061 | 0.6311 | 0.0310 | 0.1834 | 0.3990 | 0.3036 | 0.0479 |

FB-div | 0.2480 | 0.6456 | 0.4486 | 0.0625 | 0.3565 | 0.8200 | 0.6086 | 0.0720 | 0.1121 | 0.5538 | 0.3037 | 0.0749 | ||

hard | conventional | 0.4115 | 0.5185 | 0.4497 | 0.0197 | 0.5638 | 0.6752 | 0.6067 | 0.0202 | 0.2450 | 0.3852 | 0.3039 | 0.0329 | |

FB-div | 0.2662 | 0.6039 | 0.4401 | 0.0506 | 0.3802 | 0.7756 | 0.5905 | 0.0611 | 0.1504 | 0.6089 | 0.3174 | 0.0551 | ||

100 | easy | conventional | 0.3736 | 0.4355 | 0.4047 | 0.0137 | 0.5225 | 0.5837 | 0.5540 | 0.0150 | 0.2686 | 0.3905 | 0.3241 | 0.0270 |

FB-div | 0.2789 | 0.5171 | 0.4028 | 0.0312 | 0.4027 | 0.6834 | 0.5485 | 0.0377 | 0.1909 | 0.4662 | 0.3301 | 0.0468 | ||

hard | conventional | 0.3954 | 0.4468 | 0.4153 | 0.0096 | 0.5408 | 0.6027 | 0.5658 | 0.0114 | 0.2971 | 0.3539 | 0.3241 | 0.0162 | |

FB-div | 0.3193 | 0.5110 | 0.4146 | 0.0282 | 0.4436 | 0.6781 | 0.5628 | 0.0348 | 0.2336 | 0.4193 | 0.3254 | 0.0263 |

. | . | . | centroid_mean_distance_to_centroid . | distance_mean . | nnd_mean . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

n . | . | algorithm . | min . | max . | avg . | stdev . | min . | max . | avg . | std . | min . | max . | avg . | std . |

25 | easy | conventional | 0.4603 | 0.5708 | 0.5303 | 0.0214 | 0.6125 | 0.7613 | 0.7078 | 0.0276 | 0.1923 | 0.4053 | 0.2881 | 0.0497 |

FB-div | 0.0788 | 0.6812 | 0.4570 | 0.0810 | 0.1204 | 0.8570 | 0.6139 | 0.1025 | 0.0471 | 0.7736 | 0.2884 | 0.0934 | ||

hard | conventional | 0.4117 | 0.4669 | 0.4446 | 0.0116 | 0.5697 | 0.6512 | 0.6126 | 0.0156 | 0.2387 | 0.4186 | 0.3161 | 0.0336 | |

FB-div | 0.1076 | 0.5571 | 0.4090 | 0.0589 | 0.1641 | 0.7456 | 0.5598 | 0.0768 | 0.1033 | 0.7381 | 0.3359 | 0.0932 | ||

50 | easy | conventional | 0.4176 | 0.5349 | 0.4624 | 0.0287 | 0.5756 | 0.7061 | 0.6311 | 0.0310 | 0.1834 | 0.3990 | 0.3036 | 0.0479 |

FB-div | 0.2480 | 0.6456 | 0.4486 | 0.0625 | 0.3565 | 0.8200 | 0.6086 | 0.0720 | 0.1121 | 0.5538 | 0.3037 | 0.0749 | ||

hard | conventional | 0.4115 | 0.5185 | 0.4497 | 0.0197 | 0.5638 | 0.6752 | 0.6067 | 0.0202 | 0.2450 | 0.3852 | 0.3039 | 0.0329 | |

FB-div | 0.2662 | 0.6039 | 0.4401 | 0.0506 | 0.3802 | 0.7756 | 0.5905 | 0.0611 | 0.1504 | 0.6089 | 0.3174 | 0.0551 | ||

100 | easy | conventional | 0.3736 | 0.4355 | 0.4047 | 0.0137 | 0.5225 | 0.5837 | 0.5540 | 0.0150 | 0.2686 | 0.3905 | 0.3241 | 0.0270 |

FB-div | 0.2789 | 0.5171 | 0.4028 | 0.0312 | 0.4027 | 0.6834 | 0.5485 | 0.0377 | 0.1909 | 0.4662 | 0.3301 | 0.0468 | ||

hard | conventional | 0.3954 | 0.4468 | 0.4153 | 0.0096 | 0.5408 | 0.6027 | 0.5658 | 0.0114 | 0.2971 | 0.3539 | 0.3241 | 0.0162 | |

FB-div | 0.3193 | 0.5110 | 0.4146 | 0.0282 | 0.4436 | 0.6781 | 0.5628 | 0.0348 | 0.2336 | 0.4193 | 0.3254 | 0.0263 |

## 5 Classification Based on Multiple Features

As a single feature is not capable of clearly classifying the hard/easy instances, combinations of two or three different features are examined in the following. Our analysis mainly focuses on combinations of the 7 previously introduced features.

### 5.1 Diversity Maximization over Single Feature Value

First, we represent the instances according to the combination of two different features in the 2-dimensional feature value space (see Figure 3 for an example).

*distance _max*and

*angle_mean*can be considered together to provide an accurate classification of the hard and easy instances (see Figure 2). Whereas after increasing the diversity over the seven different feature values and a wider coverage of the 2D space is achieved, the separation of easy and hard instances is not so obvious. Compared with Figures 3 and 4, the clear separation shown in Figure 2 is mainly due to the instances generated from the conventional approach not having good coverage over the feature space. The clusters of dots representing hard and easy instances have some overlapping as shown in the left graphs of Figure 3. There are large overlapping areas lying between the two groups of instances. Another example of some separation given by two-feature combination is

*mst_dists_mean*and

*chull_area*which measure the mean distance of the minimum spanning tree and the area of the convex hull. However, as the number of cities in an instance increases, the overlapping area becomes larger. It is hard to do classification based on this.

After examining the 21 different combinations of two features out of the seven features, we found out that some combinations of two features provide a fair separation between hard and easy instances after increasing the diversity over different feature values. As shown in Figure 3, taking both *mst_dists_mean* and *chull_area* features into consideration, some separations can be spotted between hard and easy instances. However, most combinations are not able to give a clear classification between hard and easy instances, for example in Figure 4, neither the combination of features *nnds_mean* and *centroid_mean_distance_to_centroid* nor features *mst_depth_mean* and *chull_area* shows clear classification between instances of different hardness. Moreover, along with the instance size increment, the overlapping area of the dots standing for hard and easy instances grows.

Since the majority of two-feature combinations are not capable of classifying easy and hard instances, the idea of combining three different features is put forward. As in the analysis of two-feature combination, the values of the three selected features are plotted in 3D space.

By considering a third feature in the combination, in the 35 different combinations, it is clear that there are some separations between the two groups of 210 instances from the 3D-plots. A good selection of features results in an accurate classification of the instances. The three-feature combinations with the features measuring statistics about minimum spanning tree always provide good separation between hard and easy instances as shown in Figures 5 and 6. Although there is an overlapping in the area between the two clusters of hard and easy instances, from the 3D-plots, we can spot some areas where there are only dots for instances of certain hardness.

Taking another feature value into consideration, the two-feature combination that is not able to provide good separation can give some clear classification in hard and easy instances. An example illustrating this is included as Figure 7, where together with an additional feature *mst_dists_mean*, the two-feature combination of features *mst_depth_mean* and *chull_area* shows a clear separation between easy and hard instances comparing with the results shown in the left graph in Figure 4.

From the investigation of both the two-feature combination and three-feature combination, we found out that the range of feature values for larger TSP instances is smaller. Some of the good combinations for classifying the hardness of smaller instances may not work for larger instances, such as centroid features which perform well when combining with another feature in classifying the hardness of instances of 25 cities while not showing a clear separation with instance size 50 and 100 in our study. However, there exist some three-feature combinations that give good classification of easy and hard instances without regard to the instance size, for example *mst_dists_mean*, *chull_area*, and *nnds_mean*, and *mst_dists_mean*, *chull_area*, and *mst_depth_mean*.

### 5.2 Diversity Maximization over Multiple Feature Values

This weighted population diversity is used in Algorithm 1 to gain some insight of the relationship between features combination and instance quality. The same parent and offspring population sizes are used for these experiments, which are $\mu =30$ and $\lambda =5$. The instance sizes examined are still 25, 50, and 100. The 2-OPT algorithm is executed five times to obtain the approximation quality. In the experiments, ($\mu $+$\lambda $)-EA$D$ execute for 10,000 generation as previously. Since it is shown in Section 4 that a combination of three features is able to provide a good separation between hard and easy instances, some of the good three-feature combinations are chosen for exploration. The weight distributions for ${f1,f2,f3}$ considered in the experiments are ${1,1,1}$, ${2,1,1}$, ${1,2,1}$, ${1,1,2}$, ${2,2,1}$, ${2,1,2}$, ${1,2,2}$. The same hardness thresholds are used in these experiments as previously. After the seven independent runs for easy and hard instances, the final solution sets are put together. Therefore the results set has 210 instances for each instance size and hardness type, which is the same as previous experiments. The results are plotted in 3D space and compared with the previous experiments on single feature discussed in Section 3 and 4.

The weighted population diversity offers a way to examine the overlapping area of hard and easy instances. With the weighting technique, it takes consideration about the relationship between the different features examined. Since most of these features are not independent from each other and the weighted population diversity considers multiple features at the same time, it is predictable that with the weighted population diversity the extreme value for each single feature may not reach.

*mst_dists_mean*,

*nnds_mean*and

*chull_area*, which is shown to be a good combination for separating the hard and easy instances. From the comparison between Figure 5 and Figure 8, we can see that although the results from maximizing weighted population diversity does not cover a wider search space, it provides a detailed insight into the intersection between the hard and easy instances. The 3D plots of different instance sizes show that the combination of these three certain features provide a clear separation between hard and easy instances. There are some overlapping areas in the search space, but it is clear that this combination of features provide some hints for predicting of hard or easy instances. It is expected that additional features might increase the separability. Because each additional feature contributes to an additional constraint on the hard/easy classification. However, due to visualization limitations we consider up to 3 features for graphical analysis.

## 6 Instances Classification Using Support Vector Machine

Support vector machines (SVMs) are well-known supervised learning models in machine learning which can be used for classification, regression and outliers detection (Cortes and Vapnik, 1995; Gunn, 1998). In order to quantify the separation between instances of different hardness based on the feature values, SVM models are constructed for each combination of features.

### 6.1 Linear SVM

The linear classifier is the first model tried in classifying the dataset. In SVM the linear classifiers that can separate the data with maximum margin are termed as the optimal separating hyper-plane. From the plots in Figures 3, 4, 5, and 6, it is clear that none of the datasets are linearly separable. Taken the trade-off between maximizing the margin and minimizing the number of misclassified data points into consideration, the soft-margin SVM is used for classification.

Let $ACCn$ be the training accuracy of a feature combination in separating the hard and easy instances of size $n$. We define $ACCn$ as the ratio of number of instances which are correctly classified by the model to the total number of instances in the dataset. All classification experiments are done in R with library${e1071}$ (Meyer et al., 2015). The training data of the SVM models are the population of 420 instances generated as in Section 3 and the training accuracy is regarded as a quantified measurement of the separation between hard and easy instances. The feature combinations used for classification are the 21 two-feature combinations and 35 three-feature combinations discussed in Section 5.

From experiment results, $ACC25$ for two-feature combinations lie in the range of 0.5095 to 0.7548 with an average accuracy of 0.6672, while the $ACC25$ for three-feature combinations lie between 0.6286 to 0.7786 with average value 0.7079. In the case of instances with city number of 50, two-feature combination results in $ACC50$ lying in the range of 0.5286 to 0.7738 with an average of 0.6544 while $ACC50$ of three-feature combinations are from 0.5381 to 0.85 with average accuracy equal to 0.6969. For larger instance size, $ACC100$ are in the range between 0.5738 and 0.8119 with average 0.6986 for two-feature combinations, whereas those for three-feature combinations lie in the scope of 0.6238 to 0.8524 with average 0.7382.

Although three-feature combinations show better accuracy in separation of hard and easy instances than those two-feature combinations, there is no significant difference in $ACC$ for two-feature combinations and three-feature combinations. Moreover, the general low accuracy implies the high possibility that the linear models are not suitable for separating the hard and easy instances based on most of the feature combinations.

We then move to applying kernel function for nonlinear mapping of the feature combination.

### 6.2 Nonlinear Classification with RBF Kernel

The linearly non-separable features can become linearly separable after mapped to a higher dimension feature space. The Radial Basis Function (RBF) kernel is one of the well-known kernel function used in SVM classification.

There are two parameters needing to be selected when applying RBF, which are $C$(cost) and $\gamma $. The parameter setting for RBF is crucial, since increasing $C$ and $\gamma $ leads to accurate separation of the training data but at the same time causes over-fitting Cherkassky and Ma (2004). The SVMs here are generated for quantifying the separation rate between hard and easy instances rather than constructing a model for classification of other instances. After some initial trials, $(C,\gamma )$ is set to (100,2) in all the tests to avoid over-fitting to the selected training set. This parameter setting may not be the best parameters for the certain feature combination in SVM classifying, but it helps us to gain some understanding of the separation of hard and easy instances generated from previous experiments based on the same condition.

Tables 2 and 3 show the accuracy of different two features or three features combination in hard and easy instances separation. With RBF kernel, SVM with certain parameter setting can generate a model separating the dataset with average accuracy of 0.8170, 0.8244 and 0.8346 in 2D feature space for instance size 25, 50, and 100, respectively. Whereas with three features, SVM with the same parameter setting provides a separation with average accuracy of 0.9503, 0.9584, and 0.9422 for instance size 25, 50, and 100, respectively.

From the results, it can be concluded that there are better separations between hard and easy instances in the 3D feature space.

## 7 Diversity Optimization for Instance Hardness

In the experiments in previous sections, the main focus of diversity maximization is the feature values. The quality of solutions is guaranteed by a predefined threshold. In order to have an insight into the relationship between the feature values and problem hardness, we conduct another experiment about population diversity optimization with instance hardness. The experiment is based on the TSP and Algorithm 1 as well. In this case, the approximation ratio is taken as a feature value. Maximizing the diversity over approximation ratio results in a set of individuals with different quality. By doing this, we obtain knowledge about the relationship between feature values and instance hardness from another point of view.

In this case, the algorithm needs to be modified in the solution quality check phase. The evolutionary algorithm without hard coded solution quality threshold is shown in Algorithm 2. The population diversity measurement $dar(I,P)$ in step 4 follows the formulation proposed in Section 3 with the approximation ratio as the target feature value. In the survivor selection phase, the individual that contributes the least to the population diversity over approximation ratio is removed.

Feature 1 . | Feature 2 . | $ACC25$ . | $ACC50$ . | $ACC100$ . |
---|---|---|---|---|

angle_mean | centroid_mean_distance_to_centroid | 0.8476 | 0.8071 | 0.8071 |

angle_mean | chull_area | 0.7857 | 0.7810 | 0.7929 |

angle_mean | cluster_10pct_mean_distance_to_centroid | 0.7810 | 0.7786 | 0.8000 |

angle_mean | mst_depth_mean | 0.7524 | 0.7381 | 0.8000 |

angle_mean | nnds_mean | 0.8167 | 0.8833 | 0.8452 |

angle_mean | mst_dists_mean | 0.8119 | 0.8024 | 0.8405 |

centroid_mean_distance_to_centroid | chull_area | 0.8619 | 0.7667 | 0.8381 |

centroid_mean_distance_to_centroid | cluster_10pct_mean_distance_to_centroid | 0.8524 | 0.8357 | 0.7548 |

centroid_mean_distance_to_centroid | mst_depth_mean | 0.8381 | 0.7643 | 0.8095 |

centroid_mean_distance_to_centroid | nnds_mean | 0.8786 | 0.9524 | 0.8476 |

centroid_mean_distance_to_centroid | mst_dists_mean | 0.8905 | 0.8571 | 0.8762 |

chull_area | cluster_10pct_mean_distance_to_centroid | 0.8000 | 0.7881 | 0.8548 |

chull_area | mst_depth_mean | 0.7429 | 0.7429 | 0.7571 |

chull_area | nnds_mean | 0.8071 | 0.8905 | 0.8452 |

chull_area | mst_dists_mean | 0.8619 | 0.8643 | 0.9024 |

cluster_10pct_mean_distance_to_centroid | mst_depth_mean | 0.7619 | 0.7714 | 0.7929 |

cluster_10pct_mean_distance_to_centroid | nnds_mean | 0.8190 | 0.8833 | 0.8643 |

cluster_10pct_mean_distance_to_centroid | mst_dists_mean | 0.8095 | 0.8095 | 0.8738 |

mst_depth_mean | nnds_mean | 0.7786 | 0.8595 | 0.8405 |

mst_depth_mean | mst_dists_mean | 0.8095 | 0.8214 | 0.8810 |

nnds_mean | mst_dists_mean | 0.8500 | 0.9143 | 0.9024 |

Feature 1 . | Feature 2 . | $ACC25$ . | $ACC50$ . | $ACC100$ . |
---|---|---|---|---|

angle_mean | centroid_mean_distance_to_centroid | 0.8476 | 0.8071 | 0.8071 |

angle_mean | chull_area | 0.7857 | 0.7810 | 0.7929 |

angle_mean | cluster_10pct_mean_distance_to_centroid | 0.7810 | 0.7786 | 0.8000 |

angle_mean | mst_depth_mean | 0.7524 | 0.7381 | 0.8000 |

angle_mean | nnds_mean | 0.8167 | 0.8833 | 0.8452 |

angle_mean | mst_dists_mean | 0.8119 | 0.8024 | 0.8405 |

centroid_mean_distance_to_centroid | chull_area | 0.8619 | 0.7667 | 0.8381 |

centroid_mean_distance_to_centroid | cluster_10pct_mean_distance_to_centroid | 0.8524 | 0.8357 | 0.7548 |

centroid_mean_distance_to_centroid | mst_depth_mean | 0.8381 | 0.7643 | 0.8095 |

centroid_mean_distance_to_centroid | nnds_mean | 0.8786 | 0.9524 | 0.8476 |

centroid_mean_distance_to_centroid | mst_dists_mean | 0.8905 | 0.8571 | 0.8762 |

chull_area | cluster_10pct_mean_distance_to_centroid | 0.8000 | 0.7881 | 0.8548 |

chull_area | mst_depth_mean | 0.7429 | 0.7429 | 0.7571 |

chull_area | nnds_mean | 0.8071 | 0.8905 | 0.8452 |

chull_area | mst_dists_mean | 0.8619 | 0.8643 | 0.9024 |

cluster_10pct_mean_distance_to_centroid | mst_depth_mean | 0.7619 | 0.7714 | 0.7929 |

cluster_10pct_mean_distance_to_centroid | nnds_mean | 0.8190 | 0.8833 | 0.8643 |

cluster_10pct_mean_distance_to_centroid | mst_dists_mean | 0.8095 | 0.8095 | 0.8738 |

mst_depth_mean | nnds_mean | 0.7786 | 0.8595 | 0.8405 |

mst_depth_mean | mst_dists_mean | 0.8095 | 0.8214 | 0.8810 |

nnds_mean | mst_dists_mean | 0.8500 | 0.9143 | 0.9024 |

For the purpose of reasonable coverage over the whole space, the population size and offspring population size is set to 400 and 10, respectively, in this experiment. The other parameters are all kept the same with previous experiments. Then the R program is run for 30,000 generations to obtain some stable results.

Feature 1 . | Feature 2 . | Feature 3 . | $ACC25$ . | $ACC50$ . | $ACC100$ . |
---|---|---|---|---|---|

angle_mean | centroid_mean_distance_to_centroid | chull_area | 0.9500 | 0.9190 | 0.9452 |

angle_mean | centroid_mean_distance_ | cluster | 0.9405 | 0.9357 | 0.8214 |

angle_mean | centroid_mean_distance_to_centroid | mst_depth_mean | 0.9548 | 0.9548 | 0.9214 |

angle_mean | centroid_mean_distance_to_centroid | nnds_mean | 0.9452 | 0.9952 | 0.9833 |

angle_mean | centroid_mean_distance_to_centroid | mst_dists_mean | 0.9571 | 0.9500 | 0.9524 |

angle_mean | chull_area | cluster | 0.9524 | 0.9310 | 0.8881 |

angle_mean | chull_area | mst_depth_mean | 0.9357 | 0.9238 | 0.9500 |

angle_mean | chull_area | nnds_mean | 0.9405 | 0.9714 | 0.9571 |

angle_mean | chull_area | mst_dists_mean | 0.9667 | 0.9619 | 0.9143 |

angle_mean | cluster | mst_depth_mean | 0.9214 | 0.9143 | 0.9810 |

angle_mean | cluster | nnds_mean | 0.9476 | 0.9690 | 0.9333 |

angle_mean | cluster | mst_dists_mean | 0.9571 | 0.9143 | 0.9405 |

angle_mean | mst_depth_mean | nnds_mean | 0.9310 | 0.9762 | 0.9238 |

angle_mean | mst_depth_mean | mst_dists_mean | 0.9476 | 0.9262 | 0.9476 |

angle_mean | nnds_mean | mst_dists_mean | 0.9429 | 0.9762 | 0.8833 |

centroid_mean_distance_to_centroid | chull_area | cluster | 0.9476 | 0.9333 | 0.9310 |

centroid_mean_distance_to_centroid | chull_area | mst_depth_mean | 0.9595 | 0.8762 | 0.9762 |

centroid_mean_distance_to_centroid | chull_area | nnds_mean | 0.9667 | 0.9881 | 0.9929 |

centroid_mean_distance_to_centroid | chull_area | mst_dists_mean | 0.9714 | 0.9714 | 0.8381 |

centroid_mean_distance_to_centroid | cluster | mst_depth_mean | 0.9476 | 0.9286 | 0.8571 |

centroid_mean_distance_to_centroid | cluster | nnds_mean | 0.9643 | 0.9905 | 0.8810 |

centroid_mean_distance_to_centroid | cluster | mst_dists_mean | 0.9500 | 0.9595 | 0.9190 |

centroid_mean_distance_to_centroid | mst_depth_mean | nnds_mean | 0.9500 | 0.9881 | 0.9595 |

centroid_mean_distance_to_centroid | mst_depth_mean | mst_dists_mean | 0.9548 | 0.9548 | 0.9595 |

centroid_mean_distance_to_centroid | nnds_mean | mst_dists_mean | 0.9667 | 1.0000 | 0.9952 |

chull_area | cluster | mst_depth_mean | 0.9286 | 0.9524 | 0.9333 |

chull_area | cluster | nnds_mean | 0.9524 | 0.9667 | 0.9667 |

chull_area | cluster | mst_dists_mean | 0.9595 | 0.9595 | 0.9929 |

chull_area | mst_depth_mean | nnds_mean | 0.9381 | 0.9857 | 0.9476 |

chull_area | mst_depth_mean | mst_dists_mean | 0.9476 | 0.9738 | 0.9833 |

chull_area | nnds_mean | mst_dists_mean | 0.9714 | 0.9857 | 0.9667 |

cluster | mst_depth_mean | nnds_mean | 0.9214 | 0.9857 | 0.9738 |

cluster | mst_depth_mean | mst_dists_mean | 0.9500 | 0.9476 | 0.9643 |

cluster | nnds_mean | mst_dists_mean | 0.9643 | 0.9833 | 0.9976 |

mst_depth_mean | nnds_mean | mst_dists_mean | 0.9429 | 0.9929 | 0.9929 |

Feature 1 . | Feature 2 . | Feature 3 . | $ACC25$ . | $ACC50$ . | $ACC100$ . |
---|---|---|---|---|---|

angle_mean | centroid_mean_distance_to_centroid | chull_area | 0.9500 | 0.9190 | 0.9452 |

angle_mean | centroid_mean_distance_ | cluster | 0.9405 | 0.9357 | 0.8214 |

angle_mean | centroid_mean_distance_to_centroid | mst_depth_mean | 0.9548 | 0.9548 | 0.9214 |

angle_mean | centroid_mean_distance_to_centroid | nnds_mean | 0.9452 | 0.9952 | 0.9833 |

angle_mean | centroid_mean_distance_to_centroid | mst_dists_mean | 0.9571 | 0.9500 | 0.9524 |

angle_mean | chull_area | cluster | 0.9524 | 0.9310 | 0.8881 |

angle_mean | chull_area | mst_depth_mean | 0.9357 | 0.9238 | 0.9500 |

angle_mean | chull_area | nnds_mean | 0.9405 | 0.9714 | 0.9571 |

angle_mean | chull_area | mst_dists_mean | 0.9667 | 0.9619 | 0.9143 |

angle_mean | cluster | mst_depth_mean | 0.9214 | 0.9143 | 0.9810 |

angle_mean | cluster | nnds_mean | 0.9476 | 0.9690 | 0.9333 |

angle_mean | cluster | mst_dists_mean | 0.9571 | 0.9143 | 0.9405 |

angle_mean | mst_depth_mean | nnds_mean | 0.9310 | 0.9762 | 0.9238 |

angle_mean | mst_depth_mean | mst_dists_mean | 0.9476 | 0.9262 | 0.9476 |

angle_mean | nnds_mean | mst_dists_mean | 0.9429 | 0.9762 | 0.8833 |

centroid_mean_distance_to_centroid | chull_area | cluster | 0.9476 | 0.9333 | 0.9310 |

centroid_mean_distance_to_centroid | chull_area | mst_depth_mean | 0.9595 | 0.8762 | 0.9762 |

centroid_mean_distance_to_centroid | chull_area | nnds_mean | 0.9667 | 0.9881 | 0.9929 |

centroid_mean_distance_to_centroid | chull_area | mst_dists_mean | 0.9714 | 0.9714 | 0.8381 |

centroid_mean_distance_to_centroid | cluster | mst_depth_mean | 0.9476 | 0.9286 | 0.8571 |

centroid_mean_distance_to_centroid | cluster | nnds_mean | 0.9643 | 0.9905 | 0.8810 |

centroid_mean_distance_to_centroid | cluster | mst_dists_mean | 0.9500 | 0.9595 | 0.9190 |

centroid_mean_distance_to_centroid | mst_depth_mean | nnds_mean | 0.9500 | 0.9881 | 0.9595 |

centroid_mean_distance_to_centroid | mst_depth_mean | mst_dists_mean | 0.9548 | 0.9548 | 0.9595 |

centroid_mean_distance_to_centroid | nnds_mean | mst_dists_mean | 0.9667 | 1.0000 | 0.9952 |

chull_area | cluster | mst_depth_mean | 0.9286 | 0.9524 | 0.9333 |

chull_area | cluster | nnds_mean | 0.9524 | 0.9667 | 0.9667 |

chull_area | cluster | mst_dists_mean | 0.9595 | 0.9595 | 0.9929 |

chull_area | mst_depth_mean | nnds_mean | 0.9381 | 0.9857 | 0.9476 |

chull_area | mst_depth_mean | mst_dists_mean | 0.9476 | 0.9738 | 0.9833 |

chull_area | nnds_mean | mst_dists_mean | 0.9714 | 0.9857 | 0.9667 |

cluster | mst_depth_mean | nnds_mean | 0.9214 | 0.9857 | 0.9738 |

cluster | mst_depth_mean | mst_dists_mean | 0.9500 | 0.9476 | 0.9643 |

cluster | nnds_mean | mst_dists_mean | 0.9643 | 0.9833 | 0.9976 |

mst_depth_mean | nnds_mean | mst_dists_mean | 0.9429 | 0.9929 | 0.9929 |

The plots in Figure 9 present the resulting instances in the space of feature *chull_area*, *mst_dist_mean*, and *mst_depth_mean*. There is clear separation between the red dots and yellow dots in the figure for different problem sizes. The red and yellow dots refer to instances with extreme approximation ratio in each case. For larger instances, there is no clear cluster of instances with different hardness but we can still find areas where hard or easy instances gather. On the other hand, for problem instances with approximation ratio lying in the range of 1.1 to 1.2, 1.2 to 1.4, and 1.15 to 1.2 which are represented by orange dots in the plots, the locations are hard to classify.

The plots in Figure 10 show the relationship of problem hardness with the feature combination *chull_area*, *angle_mean*, and *cluster_10pct_mean_distance_to_centroid*. Similar observation can be obtained. The separation for problem size 25 is clear. However, when the number of cities in each instance increases, the overlapping area becomes larger.

Algorithm 2 allows us to generate a set of solutions with different hardness. The approximation ratio for problem size 25, 50, and 100 is maximized to 1.434, 1.739, and 1.430 in this experiment which implies the populations cover a even wider range than those from previous experiments.

## 8 Conclusions

With this article, we have introduced a new methodology of evolving easy/hard instances which are diverse with respect to feature sets of the optimization problem at hand. Using our diversity optimization approach we have shown that the easy and hard instances obtained by our approach cover a much wider range in the feature space than previous methods. The diversity optimization approach provides instances which are diverse with respect to the investigated features. The proposed population diversity measurements provide a good evaluation of the diversity over single or multiple feature values. Our experimental investigations for 2-OPT and TSP provide evidence that the large set of diverse instances can be classified quite well into easy and hard instances when considering a suitable combination of multiple features which provide some guidance for prediction as the next step. In particular, the SVM classification model built with the diverse instances that can classify TSP instances based on problem hardness provides a strong basis for future performance prediction models that lead to automatic algorithm selection and configuration. Building such models would require further experimentation to determine the minimal set of strong features that can predict performance accurately.

## Acknowledgments

This research has been supported by ARC grants DP140103400 and DP190103894.