## Abstract

The fitness landscape of the travelling salesman problem is investigated for 11 different types of the problem. The types differ in how the distances between cities are generated. Many different properties of the landscape are studied. The properties chosen are all potentially relevant to choosing an appropriate search algorithm. The analysis includes a scaling study of the time to reach a local optimum, the number of local optima, the expected probability of reaching a local optimum as a function of its fitness, the expected fitness found by local search and the best fitness, the probability of reaching a global optimum, the distance between the local optima and the global optimum, the expected fitness as a function of the distance from an optimum, their basins of attraction and a principal component analysis of the local optima. The principal component analysis shows the correlation of the local optima in the component space. We show how the properties of the principal components of the local optima change from one problem type to another.

## 1 Introduction

The travelling salesman problem (TSP) is probably the most famous and best studied combinatorial optimisation problem. In this paper we describe a detailed analysis of the properties of the fitness landscape for 11 different types of problems, ranging from random problems to those motivated by real-world constraints. The problems are artificial in that they are all generated by a instance generator; however, many of them share the constraints that would be expected to arise in real problems. For example, we consider instances that could plausibly occur in a drilling problem, or instances arising from a scheduling problem.

This paper follows two previous papers, where we investigated the fitness landscape properties of the maximum satisfiability (MAX-SAT) problem (Prügel-Bennett and Tayarani, 2011) and graph colouring (Tayarani and Prügel-Bennett, 2015). This current paper expands on those previous papers, not only in looking at a different problem but in studying how properties vary over different types of problems. This is intended to be a self-contained paper and not a comparison with MAX-SAT or graph colouring; however, the previous papers informed the present study about what properties are interesting.

“The relation between the Fourier expansion of a landscape and the computational complexity of the optimization problem on the landscape is of great importance for devising practical optimization heuristics,” Stadler (1996) wrote. “Disappointingly, there seems to be no simple relationship.” We have seen little evidence that it is possible to characterise the complexity of a problem in just a few measures. It is our view that combinatorial optimisation is a tough field requiring an understanding of many different aspects of a problem. This paper examines a large number of properties, as it is our belief that all these properties could be relevant in the design and choice of heuristic search algorithms. Alas, this does not lead to short, punchy papers but requires us to perform large empirical studies. This paper provides an illuminating insight into the structure of the fitness landscape and shows which properties are universal and which depend critically on the details of the instances.

We introduce a new method of analysing the fitness landscape. That is, we use principal component analysis (PCA) to study the structure of the local optima. The motivation for this is that many different local optima often have parts of a tour in common. Thus, the set of local optima does not necessarily span the whole search space but is confined to a low-dimensional subspace. PCA extracts this information. To the best of our knowledge, this is the first time PCA has been used to study the structure of the fitness landscape in this way.

### 1.1 Previous Landscape Analysis

The analysis of the fitness landscape goes back to the seminal work of Sewell Wright (1932) and has attracted considerable attention. Examples of measures that have attempted to capture the ruggedness of the landscape include the autocorrelation (Weinberger, 1990; Angel and Zissimopoulos, 1998) and fitness distance correlation (Jones and Forrest, 1995; Merz and Freisleben, 2000). But it was soon realised that these measures do not necessarily capture the hardness of the problems, when a counterexample was proposed that was an easy problem but showed no relation between the fitness of the solutions and their distance to the global optimum (Altenberg, 1997). The question of what makes a fitness landscape hard continues (Forrest and Mitchell, 1993; Jones and Forrest, 1995), and new methods of capturing the hardness of the problems are proposed, like algebraic properties of the solutions in the landscape (Grover, 1992; Stadler, 1995), modality (number of local optima) (Horn and Goldberg, 1995), and the fractal dimension of the fitness landscape (Hoshino et al., 1998). Some researchers try to explain, when a problem becomes hard, by studying the area in the landscape called Olympus, in which the better local optima are located (Verel et al., 2007, 2008). Fitness cloud is another method proposed to visualise the fitness landscape, which tries to represent some properties of the fitness landscape (Collard et al., 2007; Lu et al., 2011; Vanneschi et al., 2007), reflecting the problem hardness. Problems can be difficult for many different reasons. The aim of this paper and its sister papers (Prügel-Bennett and Tayarani, 2011; Tayarani and Prügel-Bennett, 2015) is to understand why real-world problems, exemplified by some of the classic combinatorial optimisation problems, become hard. We have not sought a single measure of hardest (we believe there can be many reasons for problems to be hard) but rather attempted to provide a full picture of the fitness landscape.

Explaining why and how problems become hard has not been the only field of research in landscape analysis. Some researchers use landscape analysis to study some parameters of the evolutionary algorithms, like the population size (Alander, 1999), operators like mutation and crossover (Mathias and Whitley, 1992; Suzuki and Iwasa, 1997), the recombination operators (Hornby, 1996), or the perturbation operator (Martin et al., 1999). Another field of study has been to study the role of the representation of the solutions on the landscape properties, and find the best representation for each problem (Tavares et al., 2006; Riley and Ciesielski, 2010). Some researchers have used landscape analysis to explain why some algorithms, like local search algorithms (Fonlupt et al., 1997), memetic algorithms (hybrid genetic algorithms) (Merz, 2004), or metaheuristic algorithms based on local search (Watson, 2010) work better on particular landscapes. Some other works have tried to categorise the problems based on their landscape properties (Merz and Freisleben, 2000).

Several papers try to exploit the concept of the landscape and landscape analysis in developing new sets of algorithms for different problems. The fitness landscape analysis is used to propose memetic algorithms for graph bi-partitioning (Merz and Freisleben, 1998), resource allocation (Huang et al., 2009), and maximum satisfiability (Zhang et al., 2003; Zhang, 2004), or improving the performance of evolutionary algorithms by landscape approximation (Ratle, 1998; Pošík and Franc, 2007; Shen and He, 2010). In a more recent work, landscape analysis is used to propose a new population-based algorithm (Qasem and Prügel-Bennett, 2010). The landscapes of many other problems have been studied, including the landscape of graph drawing (Lehn and Kuntz, 2001), graph colouring (Hertz et al., 1994; Bouziri et al., 2011; Tayarani and Prügel-Bennett, 2015), antenna design (Alander et al., 2002), maximum satisfiability (Prügel-Bennett and Tayarani, 2011), flow-shop scheduling (Czogalla and Fink, 2012), and Bayesian network structure (Wu et al., 2011).

The landscape of the TSP has also been studied in a number of different papers. In the first major effort at understanding the landscape of TSP, a big valley structure was observed in the landscape (Stadler and Schnabl, 1992). A similar big valley structure was subsequently observed in other TSP problem instances (Boese, 1995), the graph partitioning (Wiles and Tonkes, 2006), and graph bijection (Boese et al., 1994). The most recent work on the big valley hypothesis studies different problems with different sizes using the Concord algorithm (Hains et al., 2011). Our paper expands on this previous work in two directions. First, we examine considerably more properties, particularly those related to the structure of the local minima. Second, we study how these properties differ depending on the structure of the distance matrix.

This paper is organised as follows. In the next section we introduce the travelling salesman problem and describe the local search algorithm we use to find the optima in the landscape. Section 3 describes some statistical properties of random solutions, autocorrelation, and some properties of global and local optima, including the number of steps a local search algorithm takes to get to a local optimum, number of local and global optima, and distance between the optima. We also examine the expected fitness of configurations in Hamming spheres of different radii from a local optimum. We also consider the probability of returning to a local optimum starting from a randomly chosen configuration in the Hamming sphere. Principal component analysis of the local optima is presented in Section 4. Sections 5 and 6 conclude the paper.

## 2 Travelling Salesman Problem

In the travelling salesman problem one is given a list of cities and the distances between every pair of them. The task is to find the shortest tour that visits each city exactly once and then returns to the start. Despite its deceptively simple description, the problem is surprisingly difficult to solve. In general it is NP-hard, although for some problem types such as Euclidean TSP, there is a polynomial time approximation scheme (Arora, 1998). In this paper, we investigate 11 different types of instances, many inspired by real-world problems. Our chosen categories have been studied in other papers (Cirasella et al., 2001; Rardin et al., 1993). The problems differ in how the distance matrix is generated.

### 2.1 Problem Types

*i*and

*j*and is an indicator function. For a legal tour we require . We consider 11 different problem types, which differ in construction on the distance matrix . The only common assumption is that the distances, , are not negative.

#### 2.1.1 Random Asymmetric Matrices

*Random*problems.

#### 2.1.2 Random Asymmetric Matrices Closed under Shortest Paths

**M**is closed under shortest path if and only if That is, the distances satisfy the triangular inequality. Closing a random matrix can easily be performed by applying the following operator on all the values in the matrix, This process is applied until all the values in the matrix satisfy Eq. (2). We refer to this set of problems as the

*Random C*problems.

#### 2.1.3 Random Symmetric Matrices

#### 2.1.4 Random Symmetric Matrices Closed under Shortest Paths

In this set of problems, the matrices are both closed and symmetric. We refer to this set of problems as the *Random SC* problems. In this case, the distances between cities form a proper metric distance.

#### 2.1.5 Random Two-Dimensional Rectilinear Problems

#### 2.1.6 Tilted Drilling Machine Instances with Additive Norm

*x*-coordinate, and the second motor moves it in the

*y*-coordinate. Because of the drill weight, the second motor needs less energy when moving the drill down than when moving it up. The problem generator places the holes uniformly in a by square and has three parameters:

*u*is a coefficient that shows how much energy the first motor needs to move the drill one unit in the

_{x}*x*direction; is a coefficient that shows the energy needed to move the drill one unit down; and , which is the needed energy when moving upward. So the problem matrix is generated according to We consider

*u*= 1, , and . We refer to this set of problems as the

_{x}*Additive Drilling*problems.

#### 2.1.7 Tilted Drilling Machine Instances with Sup Norm

*x*and

*y*directions rather than their sum. For this problem type the holes are placed just as in the previous problem, but the distances are specified as follows, For this problem type we consider

*u*= 2, =4 and =1. The points are randomly and uniformly put in a by square and take integer values. This problem type is referred to as the

_{x}*Sup Drilling*problems.

#### 2.1.8 Random Euclidean Stacker Crane Problem

*n*objects. Each object is at a (source) location and has to be moved to a destination location . The problem is to chose the order of the tasks to minimise the Euclidean distance between the destination location of one task and the source location of the next location. That is, we take the distance to be where denotes the Euclidean distance. The sources are uniformly picked up from a by square, while the destinations are chosen to be close to the sources, where

*n*is the number of source-destination pairs (problem size). We denote this problem by

*Crane*. In Figure 1 we show an example of a possible tour made by the crane.

#### 2.1.9 Disk Drive Problem

These instances are generated by an idealised model of the scheduling job of the read head of a computer disk. The task is to extract *n* records from a disk. This problem is similar to the stacker crane problem in that the files to be read have a start position and an end position in their tracks. The position is represented in polar coordinates, with the first coordinate representing the radius and the second coordinate the angle. The destination is at the same radius as the source. This is illustrated in Figure 2, which shows an example of five records stored on a disk drive. The read head has to move from the destination of the current record to the source of the next record. We assume that changing the position of the read head (its radius) is 10 times slower than disk rotation.

Although, this is clearly idealised (for example, no account is made of acceleration and deceleration times of the read head), it nevertheless captures much of the structure that a real disk problem would have. We refer to this problem as *Disk Drive*.

#### 2.1.10 Euclidean Problem

This is the well-known Euclidean TSP, the description of which can be found in Cirasella et al. (2001). To generate instances, we uniformly and randomly select the cities in a by square, and the distance between the cities is the Euclidean distance between them. We refer to this problem type as *Euclidean*.

#### 2.1.11 No-Wait Job Scheduling

*k*processors and a set of

*n*tasks, each of which must use the

*k*processors in order with no waiting time allowed between ending one task and starting the next. The no-waiting condition would be realistic if, for example, the processing required the material to be kept hot, or alternatively, if there were no storage space between two consecutive tasks. We denote the times taken to complete the

*l*th process for the

*i*th task as . Thus each task can be represented by a list of subtask times . What makes the scheduling problem nontrivial is that one can start the next task before the previous task is finished, provided that none of the subtasks have to wait. Thus the distance between tasks is the wait time We illustrate this wait time for two problems in Figure 3. We consider , and the times are made by a uniform random number generator between 1 and 1,000.

### 2.2 Local Search Operator

To define the landscape of a problem requires some definition of the neighbourhood of the configurations. The neighbourhood is usually defined in terms of the configurations that can be reached by some local search operator. We consider *k*-opt moves, which are the most commonly used move set used by heuristic algorithms to solve TSP. A *k*-opt move consists of dividing the tour into *k* segements and then recombining the segments in such a way as to obtain a legal tour. The simplest *k*-opt move is 2-opt. For most of the analysis we carried out, we found 2-opt impractical, for two reasons. First, for tours of moderate size (*n* around 50) we found that the number of local optima was so large that we were unable to store them. Second, the 2-opt move reverses one segment of the tour. For asymmetric problems this would disrupt a large number of edge distances, so 2-opt cannot be viewed as a minimal move for this set of problems. Thus, we focus on 3-opt moves, which both substantially reduces the number of local optima and contains moves that change the tour but does not necessarily change the direction in which a segment is traversed. We also considered using 4-opt; however, we found this had such a large neighbourhood that it became computationally very expensive to ensure that a local optimum had been reached.

Note that there is a difference in the interpretation of distance between symmetric and asymmetric problems. In symmetric problems the edges and are treated as the same, while they are treated as different edges for asymmetric problems. If we swap three bonds in a 3-opt, this would correspond to making a move of size 3 in Hamming distance for symmetric problems. For asymmetric problems the distance would also depend on whether any of the segments of the tour were reversed. Reversing a segment would on average change around edges for asymmetric problems.

Although it is convenient to consider *k*-opt neighbourhoods, it is often easier to use the Hamming distance in the space of edges rather than the minimum number of *k*-opt moves needed to go from one configuration to another. That is, we can represent a tour as a binary vector in the space of edges, with 1 denoting the existence of and edge in the tour. For symmetric problems there are edges, while for asymmetric problems there are edges. Each tour consists of exactly *n* nonzero components. The Hamming distance counts the number of edges that differ between two tours.

### 2.3 Exhaustive Local Search Algorithm

One way of studying the topology of the local optima is to check all the solutions in the landscape and find every local optimum. This strategy is feasible for problem sizes up to 20 or 30. However, as the search space for TSP grows as the factorial of the system size, it is impossible to do such an experiment on larger problems. In order to study the local optima for larger problem sizes, we use a local search algorithm that exhaustively explores the neighbours of the current solution and any neighbouring solutions at the same cost to find an improving move. Tayarani and Prügel-Bennett (2015) give a comprehensive description of the exhaustive local search algorithm that guarantees that the solution it returns is located on a local optimum and returns a unique ID for each local optimum. The algorithm starts from a random configuration and uses the 3-opt local search process to find a better solution until a local optimum is reached. It then returns a unique ID for the local optimum it has found.

## 3 Landscape Analysis

This section explores many properties of the landscape of TSP, and in particular it focuses on how these properties scale as the problem size increases.

### 3.1 Density of States

We start our analysis by studying the statistical properties of randomly drawn solutions in the landscape. In our experience these properties do not correlate well with problem hardness, but they provide information on the probability of finding an improving neighbour at a given cost level. This might provide useful information, for example, in deciding the size of the neighbourhood in local search.

The density of states shows the number of configurations at each cost level. The average cost of the solutions in TSP is the average distance between the nodes. By randomly sampling the solutions and finding the histogram of cost of the random solutions, we can compute the spread of costs around the mean. Figure 4 shows the natural logarithm of the histogram of costs around the average cost scaled by for 100, 500, and 1,000 for the Euclidean problem, where *n* is the size of the problem. The results are similar for every randomly drawn problem instance. The cost of a random solution is approximately normally distributed around the average cost with a variance that grows approximately as *n*. We did the same experiment on all the 11 problem types. In each case we averaged over 50 different problem instances of each type. We observed very similar behaviour for all the problem types.

#### 3.1.1 Statistical Measures

*c*is the cost of the tour, and the expectation is over all instances of the problem. The third cumulant is equal to the third central moment , while the fourth cumulant is equal to the fourth central moment minus 3, times the variance squared. In a normal distribution the third and fourth cumulants are zero. The higher-order cumulants are invariant to changing the definition of the costs by an additive constant but are not invariant to the scale of the cost. A more useful set of measures to understand how much a distribution differs from a normal distribution is to consider the scaled cumulants where we scale the

*n*th cumulant by a factor of . The scaled cumulants are now invariant to an affine transformation of the costs (for ). The third scaled cumulant is known as skewness, denoted by , while the fourth scaled cumulant is known as kurtosis and is denoted by .

*ab initio*; we did this for graph colouring in a previous paper (Tayarani and Prügel-Bennett, 2015). For example, the mean is equal to For random tours so that where is the mean of the distance values. We can thus compute the mean length of a random tour averaged over all instances by computing the average distance for the 11 problems. For example, for a 2D Euclidean tour with cities randomly placed in a unit square, Such calculations, however, are not very informative. Furthermore, for higher cumulants they rapidly become very complicated. We therefore do not pursue these calculations but rather show empirically obtained results.

In TSP, as the size of the problem grows, the skewness and kurtosis of the cost of the random solutions change. Figure 5 shows the relation between the skewness and kurtosis of the distribution of costs for the Euclidean problem for different problem sizes. For small problems the skewness and kurtosis are negative. The negative skewness means that for small problems the bulk of the costs lies to the right of the mean, and the negative kurtosis means that costs are more distributed around the mean than a normal distribution. The only problem type that has a positive skewness is the random problem class. For random symmetric problems the skewness is around zero for all the problem sizes.

The skewness versus the kurtosis of the distribution of costs for different problem types for *n* = 10 and *n* = 1,000 are represented in Figure 6. The results are averaged over 50 different problem instances and 10 sampling for each. It is clear that for all the problem types as the system size increases, the distribution of the costs converges toward a normal distribution. This might seem an inevitable consequence of the central limit theorem, but that is not true; for example, in graph colouring the skewness and kurtosis remain nonzero as (Tayarani and Prügel-Bennett, 2015).

### 3.2 Autocorrelation

*t*, and is the variance in the cost for random solutions. We computed the autocorrelation for different sizes of different types of problems. The curves representing the autocorrelation for particular instances of the Euclidean problem for , 200, and 1,000 for three different walks, 2-opt, 3-opt, and 4-opt, are shown in Figure 7. Note that for large values of the estimated value of is unreliable because of statistical fluctuations, even though we used steps in our estimate. Under the rescaling used in this figure, the graphs for all the problems are very similar. The autocorrelation function appears to fall off approximately exponentially as where

*l*is known as the correlation length (Stadler, 1996). The gradient of the curves for 2-opt in Figure 7 is about −1.94; therefore for 2-opt, . The correlation length is considered a measure of the landscape ruggedness—the smaller

*l*, the more rugged the landscape. As

*n*increases, the ruggedness decreases; that is, if we consider two instances of size

*n*and , we get roughly the same degree of ruggedness if we make random steps on the instance of size as we would for a single step on the instance of size

*n*. The correlation length shows that the landscape of TSP is relatively smooth with long-range correlation. It is also clear that for the correlation between the solutions in

*k*-opt is weaker than that in -opt. This is because as we increase

*k*, we are making larger steps and thus decorrelating faster.

Figure 8 shows the autocorrelation function for all 11 problem types for , where we use a 3-opt walk. We observe that the problems split into two major groups. For random symmetric, random symmetric closed, rectilinear, additive drilling, crane, disk drive, and Euclidean problems, the autocorrelation is a straight line, while for the random, random closed, sup drilling, and job scheduling problems, the curves have a distinct kink at the beginning. The kink appears at .

The kink occurs in those problems with asymmetric weights. The explanation of the kink is that typically a random 3-opt move will reverse the direction in which a tour is visited. For asymmetric problems this causes a rapid decrease in the correlation. However, later moves can reverse the direction in which these cities are visited to the original direction, thus restoring the correlation.

### 3.3 Time to Local Optimum

In the study of the landscape of TSP, we mainly focus on the properties of local and global optima. Our analysis begins by studying the number of steps taken by the local search algorithm to reach a local optimum. We count the number of 3-opts performed on the starting configuration until reaching a local optimum. Note that for symmetric problems the 3-opt operator changes three edges at each step, and for asymmetric problems many of the edges may change. To check if the local search algorithm has reached a local optimum, an exhaustive local search algorithm is used to make sure that the local search algorithm is at a local optimum.

Figure 10 shows the mean time to reach a local optimum versus *n* for all the problem types. It is clear that for all the problem types the time to local optima increases linearly with the system size (with the possible exception of job scheduling, which seems to grow slightly faster). The data suggest that the time to local optima increases as approximately . From the point of view of time to local optima, the problem becomes linearly harder as the system size grows. But of course it is not only the time that matters; the number of local optima is another important property that determines how hard the problem is. We discuss this in the next sections.

Although the average time to reach a local optimum is quite similar for different problem types, the distribution of runtimes differs markedly between problem types. This is illustrated in Figure 11, where we show the histogram runtimes for the sup drilling problem and the Euclidean problem. We have plotted the log-frequency for the sup drilling problem to show the rare events. Although the average number of steps is approximately the same for the two different problem types (34.73 for sup drilling and 31.67 for Euclidean), the sup drilling problem shows a long tail at the right side of the mean, while the Euclidean problem shows a more symmetric shape with no long tail. The reason for such a difference lies in the values of the distance matrices of the problems. For the Euclidean problem the distances between the cities are real numbers, so even if truncated, it is exponentially rare to have two edges with the same length. This characteristic makes the fitness of almost all the solutions in the search space different from each other. Having such a characteristic, the landscape tends to have no plateau, so starting from a random solution, the local search algorithm directly reaches a local optimum. For the sup drilling problem the edges in the graphs take integer values; consequently there can be many different tours of the same length. This produces plateau regions that have to be explored, thus increasing the runtime.

So far we have studied the number of steps to reach a local optimum, but we have not shown how the cost reduces during the run. In Figure 12 we show the cost versus the number of steps taken averaged over 1,000 descents for an instance of the sup drilling problem. We truncate the curves after the optima are reached. We plot the same data using a density plot and on a logarithmic scale in Figure 13. The most prominent characteristic is that the improvement in cost decreases roughly exponentially with the number of steps. However, there is clearly a large deviation from this for a large number of steps caused by the relatively small number of runs that take a long time. These characteristics are shared by all the problems, although the exceptionally large tail is a feature only of those problems with large plateaus.

### 3.4 Number of Local Optima

What makes many combinatorial optimisation problem instances difficult is the number of local optima. To find the number of local optima in the landscape we use an exhaustive local search algorithm and store the local optimum it returns. The exhaustive local search algorithm is repeated for a large number of times from randomly chosen starting configurations. Each local optimum and the number of times it is hit are stored. Of course, there is no guarantee that all the local optima in the landscape are found, particularly if a local optimum has a small basin of attraction. Typically there are many local optima that have a very small probability of being visited.

The logarithm of the number of times each local optimum is hit versus the cost of the local optimum for the Euclidean problem is shown in Figure 14. The data are based on one single instance and descents. Of course, there are likely to be local optima that we have not found. However, note that the local optimum with the highest cost (the worst local optimum) is visited 18 times. The minimum number of times a local optimum is visited is 13. Thus with high probability any local optimum with a cost more than the worst local optimum that we have not found would have a basin of attraction 18 times smaller than those that we have found. Although we cannot rule out the existence of such local optima theoretically or practically, we strongly doubt their existence, as we have not found any example of them in a very large number of trials.

We observe that gradient of the fitting line in Figure 14 is , which implies that the probability of ending up in a local optimum of cost *c* decays exponentially as . Using the extrapolation, we find for the descents the expected number of times we would visit a local optimum with a cost greater than would be less than 1, if any such optima exist. The probability of finding the global optimum was 16.7%, while the probability of finding the second best optimum was 14.9%.

Although for all the problem types the probability of getting to a local optimum decays exponentially as the cost of the local optimum grows, the decay rate is different for different problem types. Since the average cost of the local optima is different for different problem types, in order to be able to compare the decay rate of different problems, the cost of the local optima has to be normalized. To do so, the cost on the horizontal axis in Figure 14 is divided by the average cost of the local optima. After normalization, the gradient of the fitting line to the data shows the decay rate of the probability of getting to a local optimum as a function of its cost. The probability of getting to the global optimum versus the decay rate for all the problem types is shown in Figure 15. For the random problems we found the best solution on average 1.07 times in descents. Thus for this problem we can have no confidence that we have found the global optimum. There seems little evidence of a relation between the cost of the optima and the size of their basin of attraction. When the random problems become symmetric, the probability of getting to the global optimum grows; on average the best solution is hit 20 times. When the random problems are forced to satisfy the triangle inequality (random closed), the correlation between the cost of a local optimum and the probability of its being visited increase further. For this problem type the average number of times the best found optimum is hit is 80 times in descents. Going from random problems to random closed problems, the decay rate grows as well. When the random problems are both symmetric and closed, then the number of times the global optimum is hit grows to 4,119 times in descents. For the case of the job scheduling problem, the best observed solution on average is hit 4 times in descents. This means that in comparison with the Euclidean problem, for example, in which the best observed solution is visited for more than 10,000 times, the best observed local optimum in the job scheduling problem does not show any significant superiority over other local optima in terms of its size of the basin of attraction. As the job scheduling problem is not a random problem, we might expect to observe a correlation between the cost of a local optimum and the size of its basin of attraction, but Figure 15 shows no such relation.

Problem types sup drilling and crane show a quite similar behaviour. For the sup drilling problem the global optimum on average is hit 695 times, and in the crane problem it is hit 853 times in descents. For the four remaining types the decay rate is similar, but the probability of getting to the global optimum is different. The Euclidean problem shows the highest probability of getting to the global optimum. In the case of Figure 14, the best observed optimum is hit 16,752 times in descents.

The histograms of the number of local optima at each cost for two different problem types are shown in Figure 16. The histogram shows different behaviour for the different problem types: for the sup drilling problem it is close to a normal distribution, while for the Euclidean problem it is closer to a uniform distribution. In some cases like job scheduling, the distribution has a long tail at the right side. A tail at the right side of the distribution means that there could be high-cost local optima but with smaller probability of being visited. The distribution for the Euclidean problem shows no tail at either of its sides. There is quite a large proportion of local optima at both high and low cost levels.

One way of measuring the size of the tails is to consider the distribution of costs of the local optima and to measure the proportion of local optima in the 10% of highest and lowest costs. This is shown in Figure 17. There is clearly a strong correlation indicating some degree of symmetry. The Euclidean problem has the highest proportion of local optima at the extremes of the cost distribution, while the random problems have the lowest proportion.

#### 3.4.1 Growth in the Number of Local Optima

*n*, and for the sup drilling and crane problems the scaling is . For all four random types a vast majority of the local optima are hit once in descents, so it is hard to estimate the relation between the system size and the number of local optima, although for all the problem types the number of local optima increases exponentially as some function of the system size.

It is interesting to note that the difficulty of the problem is not obviously correlated with the growth in the number of local optima. For example, Euclidean TSP has a larger probability of finding the global optima than sup drilling despite have significantly greater growth in the number of local optima than sup drilling.

#### 3.4.2 Probability of Visiting an Optimum

We have seen that for some problems the probability of finding a local optimum is correlated with its cost, while this is not true for other problems. We illustrate this in Figure 19, which shows the proportion of local optima at each cost level and the probability of finding local optima at the cost level. For the sup drilling problem the lower cost solutions tend to have significantly larger basins of attraction. In contrast, in the job scheduling problem the sizes of the basins of attraction are almost uncorrelated with the cost.

Clearly, it makes the problems easier to solve when their basins of attraction are larger for lower-cost solutions. To measure how the size of the basin of attraction of the local optima changes with the cost, we define the bias to be the difference between the expected cost of all the optima and the expected cost of the optima found by local search, divided by the standard deviation in the cost of all the optima. A large bias indicates that the local search is much more likely to find a fit local optimum than a less fit one. Figure 20 shows the probability of getting to the global optimum versus the bias averaged over 10 different instances for each problem type. There is clearly some correlation between the probability of finding the global optimum and the bias, but interestingly it is not that strong.

### 3.5 Reaching the Global Optima

As we increase the system size, the gap between the expected cost found by the local search and the minimum cost increases. This is illustrated in Figure 21, where the expected minimum cost (averaged cost of the global optima over 20 problem instances) and the expected cost of the local optima found by the local search algorithm over the same ensemble of randomly drawn instances are shown. We also show the expected cost of the local optima found by local search plus and minus one standard deviation. It is clear that as the system size grows, the gap between the expected cost and the cost of the global optimum solutions grows. For problem sizes *n* up to 30, the cost of the global optimum solution lies between the expected cost of the local optima and the expected cost minus one standard deviation. For the size of 80, it lies between the expected cost and expected cost minus two standard deviations. Clearly as the system size grows, it becomes less likely to find the global optimum.

Note that we are not sure if we have found the global optimum. So by global optimum we simply mean the best observed solution. However, as shown for many of the problem types, the better local optima have exponentially larger basins of attraction. As in our experiments the best found local optimum has been visited thousands of times; we believe it is highly unlikely to have an optimum with a better fitness than the best observed one and yet a basin of attraction thousands of times smaller. Therefore the best observed solution, with a high probability, is the global optimum (although this is not true for random problems of the job scheduling problem).

We can also compute the probability of finding the best found local optimum directly. This is shown in Figure 22, where we plot the log-probability of finding the best optimum versus system size for four different problem types. The data are averaged over 20 different problem instances and the number of descents is . The straight line fit is consistent with the hypothesis that finding global optima becomes exponentially unlikely as the system size grows. Using the straight line fit in Figure 22 to extrapolate to large *n*, the probability of the local search algorithm finding a global optimum for an instance of size 1,000 of the Euclidean problem would be and for the sup drilling problem . Although such extrapolation is unlikely to provide a precise value, nevertheless it provides a strong indication that for larger problem instances, using multiple runs of local search algorithms begins to be useless. For random and job scheduling problems the number of times the best visited optimum is visited rapidly shrinks to 1 in descents. For these problem types, for *n* larger than 40, the best visited solution is found just once, so we can almost be sure that for this system size there are better optima that have not been found in this number of descents. The data show that for all the problem types as the system size grows, the probability of finding the global optimum solution decreases exponentially, although at different rates.

Figure 21 shows that the gap between the expected cost and minimum cost increases with the system size, and Figure 22 shows that the probability of finding the global optimum decreases exponentially with system size. These two figures show two different aspects of the same property of the fitness landscape. This property shows how the problem becomes harder as the system size grows. In order to show this for all the problem types, Figure 23 shows the gradient of the fitting line in Figure 22 versus the increasing rate of the gap in Figure 21 with the system size. The data clearly show the correlation between the gap and the probability of finding the global optimum. The job scheduling problem and the random problem can be considered the hardest problems, as the decay rate of the probability of finding a global optimum is the highest among all the problem types. When the random problems become symmetric or closed, the decay rate decreases, meaning the random symmetric and the random closed problems are easier than completely random problems. Among these problem types, the additive drilling and the Euclidean problems, which show similar behaviour, are the two easiest problems.

### 3.6 Distance between Optima

One important property of the fitness landscape is the distance between the local optima, which shows how the local optima are correlated. The distance between two solutions is simply defined as the number of noncommon edges in the two solutions (note we are using here the Hamming distance rather than the number of 3-opt moves between configurations). For symmetric problems the edges are undirected, so both the directions of the edges are considered the same. The histogram of the distances between local optima for two different problem types is shown in Figure 24. The results are averaged over 20 randomly drawn problem instances for *n* = 50. The interesting property of the histogram is the double peaks in the histogram of the additive drilling problem, one higher and wider peak at the left, and one lower and thinner at the right. This peculiar property is because of the nature of this problem. As described in Section 2.1.6, in the definition of the drilling problem, the energy needed for the machine to move the drill down is zero. Because of this property, although this problem is not a symmetric problem, some subtours in a solution can be reversed without posing a significant change to the cost of the solution. The change to the cost of a solution posed by reversing a subtour is the difference in the *y*-coordinates of the starting and ending points of the subtour. In this sense the problem can be considered as a partly symmetric problem. This property causes the peculiar behaviour in the histogram of the distances between the local optima.

The histogram of the distances has a bell shape in the Euclidean problem. This bell shape behaviour of the histogram is seen in all the problem types except the additive drilling and to some degree in the sup drilling problem. In all the problem types the peak of the histogram is located at the left side of the expected distance between two randomly drawn solutions. Note that for the size of , the expected distance between two randomly drawn solutions is around 49.5 (we found this by randomly generating pairs of configurations and calculating the average distance). Therefore the histograms of Figure 24 clearly suggest that the local optima are much closer to each other than the expected distance between two random solutions.

Figure 24 shows the histogram of distances for all local optima irrespective of their cost. It is also interesting to see how the distances between local optima depend on their cost. We show this information for the sup drilling problem in Figure 25, where a three-dimensional histogram is plotted. We observe that the best (lowest cost) local optima are closer to each other than the less fit local optima. Half of the best 10% of local optima are within a distance of 5 from each other. We observe that the less fit local optima are much further from each other than the best fit optima.

### 3.7 Distance to Global Optimum

The measure of the mean distance to the global optimum from a local optimum of cost *c* for different sizes of the Euclidean problem is shown in Figure 26. The scaling in the vertical axis demonstrates that the distance from a local optimum to a global optimum scales linearly with the problem size. The scaling behaviour of the curve is the same for all the problem types. Making the last step from an optimum of cost to *c _{min}* clearly grows with

*n*, although it is difficult to be sure how this distance grows. This question depends on extrapolating the curve in Figure 26 to the

*y*-axis.

Another way of viewing this information is as a density plot of the cost versus distance to the global optimum. We show these plots for a Euclidean problem and a random problem in Figure 27. There is clearly a strong fitness distance correlation for the Euclidean problem, but very little for the random problem. Many population-based algorithms attempt to exploit the fitness distance correlation. Clearly, there is some hope in case of Euclidean problems but much less hope for random TSP problems.

### 3.8 Expected Cost in Hamming Sphere

We noted earlier that all the problems had a relatively slow decline in the autocorrelation function, corresponding to a large correlation length. This is a result of the cost being the sum of all the edges in the tour. If we make a local move that changes only a few edges, then there will be only a relatively small change in the cost. To examine this, we study the mean cost in a Hamming sphere from a configuration, the Hamming sphere of radius *h* being the set of configurations that differ from the start configuration by exactly *h* edges.

Figure 28 shows the expected cost of a configuration in a distance sphere of radius *h* from a global optimum. To construct the graph, we performed some random number of 3-opt moves to the global optimum and then measured the distance from the global optimum. The process was repeated times and the costs at each Hamming distance averaged.

*h*about a tour

*s*is where is the cost of the current tour and is the average cost.

### 3.9 Return Probabilities

Another interesting property is the probability of returning to a local optimum starting at a fixed Hamming distance from the optimum. We can empirically measure the “basin of attraction” of a local optimum by repeatedly jumping to a configuration at a certain distance from the local optima and running the local search algorithm until it reaches a local optimum. We then see if this is the same local optimum from which we started. We can thus measure the return probability starting from a configuration in a given distance sphere. In Figure 29 we show these return probabilities for four problems starting from the global optimum using 3-opt moves. The curves are not monotonic because 3-opt moves can change some number of edges easier than others. For example, it is easier to change three edges than five edges. As we would expect, the return probabilities falls off faster as the problems become harder.

## 4 Principal Component Analysis

Although we have seen that the number of local optima grows exponentially with the system size, many of these may be formed by combining other optimal tours. For example, we could have a tour consisting of 10 segments. There may be two routes through the cities in each segment, both of which are local optima. In this case there would be local optima that are a consequence of these 10 binary choices. To investigate the underlying variety in the local optima, we use principal component analysis to find the subspaces with a large variation. To do this we consider each tour as a binary vector in the space of edges. That is, for a symmetric problem the tour is represented as a binary vector consisting of *n* nonzero components in a total of possible components. Similarly, for asymmetric problems the tour is represented as a binary vector consisting of *n* nonzero components in a total of possible components. To perform PCA we subtract the mean vector of all the optima and then compute the eigenvalues and eigenvectors of the covariance matrix (we can compute this efficiently using singular value decomposition).

To illustrate this approach we show the average solution of the local optima in Figure 30 for a Euclidean problem with 30 cities. The width of the edges shows the number of times they appear in different local optima. It is clear from the graph that there are many edges that do not appear in any of the local optima, and there are some that appear in all the local optima. This property means that the search algorithms do not need to involve such edges in their search process, and a huge proportion of the landscape can be ignored in the search process.

The eigenvectors correspond to orthogonal directions in edge space. The eigenvectors with larger eigenvalues correspond to directions where there is a large variance. That is, the local optima vary significantly in these directions. We can think of these eigenvectors, or eigentours, as showing the most significant way in which the local optima differ. In Figure 31 we show the eigenvector corresponding to the largest eigenvalue (i.e., the principal component) for the same problem as shown in Figure 30. Some components are positive, and others are negative. We can (roughly) interpret an eigentour as a choice between choosing the positive edges and removing the negative edges, or the other way around.

By plotting the eigenvalues against their rank we can see the dimensionality of the subspace where there is a significant variation between local optima. The size of the eigenvalue is equal to the mean squared reconstruction error of the optima if we project out that dimension. In Figure 32 we plotted the spectrum of eigenvalues for four instances from different problem types. We see that in the degree of variation in the Euclidean and the rectilinear problems we have a relatively small number of nonzero eigenvalues, and the variance is small (indicating that the eigentours do not involve that many edges). In contrast, the scheduling and sup drilling problems show many more directions of variation and considerably larger variance.

One final use of PCA is that it allows us to project the local optimum solution into a meaningful low-dimensional subspace, allowing some visualisation of the structure of the local optima. We did this for six problems in Figure 33, where we plotted the local optima into the space spanned by the first two principal eigenvectors (the eigenvectors with the two largest eigenvalues). Furthermore, we plotted the data so that the size of the dots represents the fitness of the local optima; the better local optima are represented with larger dots.

The first graph is for a random problem instance, where there is no particular structure in the local optima. For all the problem instances of the crane problem we almost always see a squared clustered structure for local optima. The squares may form any angle with respect to the horizontal line or may consist of different number of clusters for different problem instances, but the common feature among all the problem types (except random problems) and problem instances is the fitness of the local optima form clusters. We see some of the clusters consisting of better local optima and some others containing inferior local optima solutions. The third graph is for the Euclidean problem, of which the projected local optima usually form two parallel elongated football-shape clusters. But not all the problem instances show exactly the same behaviour; the angle of the clusters to the horizontal line differs from one problem instance to another; the clusters are occasionally not parallel to each other; and sometimes there are more than two clusters. For the additive drilling problem we see a similar behaviour as that for the Euclidean problem, but almost always the number of the clusters is 2 (in over 50 problem instances we studied), and the gap between the clusters is much wider. The next graph shows the disk drive problem, where the local optima usually form parallel lines of clusters. In this problem type the local optima sometimes form two or more parallel lines and in some cases 10 parallel lines. The lines are sometimes close and mixed together such that they form a square shape. And finally, the last graph shows the job scheduling problem, where for almost all the problem instances we observed (for more than 50 problem instances with different sizes) the local optima formed squares, although the squares sometimes consist of some clusters.

## 5 Discussion

In this paper we use 3-opt for our studies. Clearly, many of the properties we report depend on the neighbourhood structure and the local search operator. This is particularly true for properties, like autocorrelation, that measure the local ruggedness of the landscape via random walk. However, the degree to which these properties depend on the local search operator varies from one property to another. For example, using an operator that takes longer steps and offers larger number of neighbours to a solution (like 4-opt) would change some of the data presented here. For time to local optima and the plateaus, for example, choosing such an operator would cause an operator to take fewer steps to reach a local optimum. Also the operator might escape plateaus more easily, so the size of the plateaus might decrease. Using such an operator also decreases the number of local optima in the landscape, as many of the basins of attractions merge together to form a single basin of attraction. For the same reason, the probability of finding a local optimum or the global optimum would increase, although this does not imply that such an operator is better to use, as each step of the algorithm is more time-consuming.

We believe some properties, however, would not change if different operators were used. These properties include the shape of the distribution of local optima at each cost level, the property that the probability of finding the global optimum decreases exponentially with the problem size, the property that the closer optima to the global optima have, in expectation, lower cost and the PCA properties. We believe these properties reflect the nature of these problems and so are operator-independent. We have two major reasons to believe this. First, these properties are also seen on other NP-hard combinatorial problems, including MAX-SAT (Prügel-Bennett and Tayarani, 2011), graph-colouring (Tayarani and Prügel-Bennett, 2015) and quadratic assignment (Tayarani and Prügel-Bennett, 2013) problems.

Second, some properties like the relation between the probability of finding the global optimum and time to local optima should not change if a different operator were used. To explain it more clearly, let’s assume the following example. We showed that time to local optima increases linearly, and the probability of finding the global optimum increases exponentially, with the problem size. So solving the problem becomes exponentially more time-consuming as the problem size increases. This should be the case for any operator. Assume there is an operator that if used, the probability of finding the global optimum increases polynomially, and the time to local optima increases linearly with problem size. This would mean that the time needed (to keep the expected number of times that the global optimum is found equal to 1), grows polynomially with problem size. So, in one sense, NP-hard problems are solved polynomially. Although this might be the case, we believe that there is more evidence for believing the opposite.

The PCA properties, and the property that local optima closer to the global optimum are on average fitter than further away, are other examples of properties that we believe are local search operator-independent, because these show the structure of the problems, and the relation between good solutions, and it does not matter how these good solutions are found.

The other issue that can affect the analyses in this paper is the isotropy of the landscape. Greenwood and Hu (1998) showed that studying anisotropic landscapes requires some considerations. A landscape is anisotropic if the fitness of solutions in some regions in the landscape are considerably different from other regions. This, for example, happens when some regions in the landscape do not satisfy some problem constraints and are markedly at lower fitness compared to other regions. These low-cost regions are called sink hole regions. For this type of landscape, an analysis that uses a long random walk to estimate ruggedness or correlation does not provide accurate results, as some walks may hit sink holes. For several reasons, we believe that many of our analyses, in terms of isotropy of the landscape, are valid. First, as discussed earlier, we believe such sink holes do not exist in the landscape of TSP. Second, to study the proposed properties, we do not use one single long random walk. We use a large number of local searches, starting from different configurations (although properties like autocorrelation are prone to this problem, and this is why we believe these type of properties are not truly representative of problem hardness). Third, even if there were sink hole regions in the fitness landscape, they would not affect many of the results presented in this paper. Sink holes affect random walks because when they pass over the holes, they observe different ruggedness. Our local search, on the other hand, always chooses the best neighbours; thus it never passes over a sink unless it starts from one. Even if the search starts from a sink, it still has a good chance of escaping from it and finding a local optimum outside the sink.

## 6 Conclusion

Fitness landscapes are complicated objects that are not easily characterised by a small number of properties. In consequence we have to settle for a drawn-out conclusion. For the sake of clarity, we enumerate some of the most striking properties:

For TSP the number of steps to reach a minimum appears to grow linearly for most problem types. However, each step of 3-opt naively takes time to compute (as we potentially have to search all edge triples to find one that can be optimised), so this can be time-consuming. This does not in itself make the problems difficult. What makes the problems difficult is that the local optimum we find is unlikely to be the global optimum.

For all problem types we observed that the number of local optima grows exponentially as some function of the size of the instance, but this rate of this growth varied considerably between problem types.

In most TSP types that we examined, the fitter local optima tended to have a significantly larger probability of being found than the less fit local optima. Thus, the number of local optima does not necessarily determine the probability of finding the global optimum.

The probability of reaching a global optimum decreases exponentially with the system size. This property is correlated with a widening gap between the cost of the global optimum and the expected cost of an optimum found by local search, measured in units of the number of standard deviations in the cost of the optima found by local search.

There is a correlation between the fitness of a local optimum and its distance to the global optimum, with fitter solutions closer on average to the less fit solutions. This behaviour appears to scale linearly with the problem size. However, this correlation is dependent on the problem type, and the correlation is especially low for random problems.

The difference between the cost of a reference configuration, and the average cost of the set of configurations at a fixed Hamming distance,

*h*, from the reference, is almost exactly linearly correlated with*h*. This is a result of the fact that the objective function consists of a sum of terms, where each term consists of only a few variables. Thus when we change a small number of variables, we only produce a small change in the cost. This function is almost exactly linear between the cost of the tour we start from and the expected cost of random tour at a Hamming distance of*n*. This long-range correlation means that there is considerable similarity in the long-range structure of the fitness landscape.

Although it is not the intention of this paper to make a comparison with other problems, it is perhaps worth noting that very similar properties to those described here were also observed in MAX-SAT (Prügel-Bennett and Tayarani, 2011) and graph colouring (Tayarani and Prügel-Bennett, 2015). The current paper adds new texture to our description of fitness landscapes by showing how the properties vary between different problem types. Most of the properties of the landscape that we have analysed are qualitatively similar for all problem types, but they vary significantly quantitatively. In particular, the more random problems tend to have significantly more local optima, with the cost of the local optima being significantly less correlated with its proximity to the global optimum tour. One important lesson of this analysis is that Euclidean TSP is exceptional in the set of problem types we analysed so far in that it is relatively easy to solve. One should therefore be on guard in extrapolating results from Euclidean TSP to more general TSP problems.

A final contribution of this paper was the introduction of PCA to analysing the structure of the fitness landscape. This technique provides a qualitative understanding of the degrees of variability between local optima in terms of the spectrum of eigenvalues. It clearly shows that although there is an exponential number of local optima, there are actually many fewer original ways in which the local optima differ from each other. Again, this is a property that depends on the problem type. We have also seen that PCA can provide a means of visualising the structure of the local optima by allowing a meaningful projection to a low-dimensional subspace. This often reveals considerable clustering in the local optima, which is not transparent otherwise. Although we have shown PCA results for small problem instances, it is not limited to small instances. The complexity of PCA depends linearly on the number of edges that differ between the local optima and cubically on the number of local optima. Thus, provided the number of local optima found is not too large, it is feasible to perform PCA on large problems. We believe that PCA could be used to direct the search of a new algorithm. The usefulness of PCA is clearly problem-dependent, but as we saw in the crane example in Figure 33, the fit optima are often confined to particular clusters. PCA does not require the solutions to be optimal solutions and can just use good solutions returned by an optimiser. Although it may seem expensive to compute many good solutions, on large problems this may be necessary. Finally, it is worth pointing out that PCA is not limited to TSP but can be used on pretty much any problem.

## References

*k*-coloring fitness landscape

*k*-coloring problem