## Abstract

This paper describes the evolutionary split and merge for expectation maximization (ESM-EM) algorithm and eight of its variants, which are based on the use of split and merge operations to evolve Gaussian mixture models. Asymptotic time complexity analysis shows that the proposed algorithms are competitive with the state-of-the-art genetic-based expectation maximization (GA-EM) algorithm. Experiments performed in 35 data sets showed that ESM-EM can be computationally more efficient than the widely used multiple runs of EM (for different numbers of components and initializations). Moreover, a variant of ESM-EM free from critical parameters was shown to be able to provide competitive results with GA-EM, even when GA-EM parameters were fine-tuned a priori.

## 1 Introduction

Gaussian mixture models (GMMs) are powerful modeling tools widely used in several real-world applications (McLachlan and Peel, 2000; Figueiredo and Jain, 2002; Zhang et al., 2003), such as speaker identification (Reynolds and Rose, 1995) and handwritten digit recognition (Chen et al., 2011). Since in practice the values of the GMM parameters are usually a priori unknown, there is a need for algorithms capable of efficiently estimating them. To that end, the expectation maximization (EM) type of algorithm (Dempster et al., 1977; McLachlan and Krishnan, 1997) is usually the option of choice. Such algorithms perform iterative computations of maximum likelihood estimates and are known to converge in a finite number of iterations under some regularity conditions. However, it is well known that the EM algorithm does not necessarily lead to a global optimal solution (Zhang et al., 2003). Aimed at circumventing this problem, a variety of approaches have been proposed (e.g., see McLachlan and Krishnan, 1997; Ueda and Nakano, 1998; Figueiredo and Jain, 2002). Here, we focus on the *split and merge* type of algorithm (Ueda et al., 2000), which typically works by iteratively splitting and merging components of the mixture model and then using the EM algorithm in the resulting components. By doing so, it is expected to escape from local optima, which often involves having too many components in one part of the space and too few in another, widely separated part of the feature space (Ueda et al., 2000).

We are particularly interested in using GMMs for data clustering (Xu and Wunsch, 2009). In this case, a common approach is to use a one-to-one correspondence between components’ distributions and clusters. A probabilistic data partition (clustering) is obtained from soft assignments of objects to clusters, based on the posterior probabilities that a particular component was responsible for generating a given object (McLachlan and Peel, 2000; Bishop, 2006). Probabilistic clustering approaches are quite useful to model overlapping clusters, which are common in practice.

From an optimization perspective, clustering can be formally considered a particular type of NP-hard grouping problem (Falkenauer, 1998). This has stimulated the search for efficient approximation algorithms, including the use of not only ad hoc heuristics for particular classes or instances of problems but also general-purpose metaheuristics. In particular, evolutionary algorithms are metaheuristics widely believed to be effective on NP-hard problems, being able to provide near-optimal solutions to such problems in reasonable time. Under this assumption, a large number of evolutionary algorithms for solving clustering problems have been proposed in the literature. These algorithms are based on the optimization of some objective function (i.e., the so-called fitness function) that guides the evolutionary search (Hruschka et al., 2009).

Many clustering algorithms assume that the number of clusters is known and fixed a priori by the user. Aimed at relaxing this assumption, which is often unrealistic in practical applications, a common approach involves estimating the number of clusters from data. The use of mixture models offers a principled approach to perform such an estimation (Figueiredo and Jain, 2002). A widely used procedure is to run EM multiple times for an incremental number of clusters and select the best partition according to a model selection criterion. Although this ordered repetitive procedure may be efficient if the optimal number of clusters is small, this is not the case in several applications. In addition, even when this is the case, one hardly knows that in advance (Naldi et al., 2011). In the case of GMMs with all covariance matrices equal to the identity matrix and hard assignments, the EM for GMM reduces to the well-known *k*-means algorithm (Bishop, 2006). For this setting, a thorough assessment comparing an evolutionary algorithm (EA) and multiple repetitions was conducted by Naldi et al. (2011). More specifically, a well-designed EA was shown to be significantly faster than two commonly used strategies based on multiple runs of *k*-means. The extension of that study for the case of nonrestricted covariance matrices and overlapping clusters is one of our contributions.

In our work, we combine model selection with EAs that use specific, problem-dependent operators particularly designed for model selection. Although there are several EAs for clustering capable of estimating the number of clusters (for a detailed review, see Hruschka et al., 2009), including some algorithms based on GMMs (Pernkopf and Bouchaffra, 2005; Nguyen and Cios, 2008), they do not make use of split and merge operations for building GMMs. Therefore, they tend not to be very effective at avoiding local optima because they are usually not able to move a component from a denser region to a sparser region without passing through positions that give a lower likelihood (Ueda et al., 2000). Aimed at circumventing these problems, we propose a more principled algorithm, *evolutionary split and merge for expectation maximization* (ESM-EM). A preliminary version of this algorithm was introduced in Covões and Hruschka (2011) and compared with well-known greedy algorithms in Covões and Hruschka (2013). We first provide a detailed description of ESM-EM and then investigate eight new variants of ESM-EM,^{1} which are based on different mutation operators. Also, an original extensive empirical analysis that considers 35 data sets is reported. The obtained results are compared to the state-of-the-art EA for evolving GMMs proposed in Pernkopf and Bouchaffra (2005). Finally, original time complexity analyses are reported.

The remainder of this paper is organized as follows. Section 2 briefly reviews the EM algorithm for GMMs and introduces the adopted notation. A review of EAs and split and merge approaches to learning GMMs are presented in Section 3. In Section 4 we describe the ESM-EM algorithm and its variants. In Section 5 we analyze the empirical results of the experiments with 35 data sets, comparing them to related approaches, that is, multiple runs of EM and the EA proposed in Pernkopf and Bouchaffra (2005). Finally, Section 6 concludes the paper and points out some future work.

## 2 Gaussian Mixture Models

*k*Gaussians leading to a probability density in the form where are the mixing coefficients subject to , is a multivariate Gaussian probability distribution, and represents the complete set of parameters (mixing coefficients, means, and covariance matrices of the components, respectively) for a given

*k*.

Initially, we define two sets, and . The former refers to the observed data, , where *M* is the number of features. We assume that the *N* samples are i.i.d. (independent and identically distributed). The latter set has the indicator vectors, , which specify the component to which the *i*th sample (object) belongs, that is, is a *k*-dimensional binary random variable having a *one of**k* representation. Then is the complete data set.

*j*th component for data point (object) . The EM algorithm can then be summarized as follows. In the

*Expectation*step, the expected responsibilities in Eq. (2) are computed. Then, in the

*maximization*step, the model parameters are reestimated using the current responsibilities in such a way that the log-likelihood function is maximized. The expected complete-data log-likelihood function is defined as

The expectation (E) and maximization (M) steps are iteratively repeated until convergence. In practice, the algorithm is deemed to have converged when the change in the log-likelihood function, or alternatively in the parameters, falls below some threshold (Bishop, 2006). Note that initial estimates for the parameters () are necessary for the first EM iteration. In clustering problems, a common approach to initialize such parameters involves running *k*-means. By doing so, the covariance matrices can conveniently be initialized to the sample covariance of the clusters, and the mixing coefficients can be set to the fractions of objects assigned to the respective clusters (McLachlan and Peel, 2000; Bishop, 2006).

A major drawback of EM is that the user must define the number of components (clusters) a priori. However, this value is not known in most applications. One common approach is to run EM repeatedly for values of *k* in a range and then select the best model obtained according to a model selection criterion (see Figueiredo and Jain, 2002; McLachlan and Peel, 2000, for a detailed discussion on the topic). A possible systematic procedure for doing that is denominated ordered multiple runs (OMR-EM). In OMR-EM, the number of clusters is increased at each iteration and for each number of clusters in *n _{p}* random initializations are performed. This procedure is described in Algorithm 1, where

*s*is any model selection criterion to be minimized (an example of such a criterion is presented in Section 4), and

*S*is a stopping criterion for EM, for instance, maximum number of iterations. In practice,

_{C}*n*is usually defined in an ad hoc fashion or by taking into account the available computational resources. Assuming that

_{p}*n*is proportional to

_{p}*k*

_{max}, the overall computational cost of the procedure is .

^{2}

## 3 Related Work

As discussed, both model building and model selection are difficult optimization problems. The search space is, in general, nondifferentiable and multimodal. Evolutionary algorithms have been shown to be successful tools for related problems in a variety of applications (Falkenauer, 1998; Bäck et al., 2000). As stated in Section 1, we focus on the development of EAs for learning GMMs. For a more comprehensive review of clustering-related evolutionary algorithms, see Hruschka et al. (2009).

Although the EM algorithm for learning GMMs has been widely used in practice and several of its variants have been developed, only a few EAs have been proposed for this problem (Pernkopf and Bouchaffra, 2005; Tohka et al., 2007; Tawfick et al., 2008; Nguyen and Cios, 2008). Among them, the algorithms introduced in Tohka et al. (2007) and Tawfick et al. (2008) are of more limited applicability. In particular, the algorithm proposed in Tohka et al. (2007) assumes that the number of components is a priori known, and the work in Tawfick et al. (2008) is limited to hyperspherical clusters, that is, only to identity covariance matrices. The algorithm proposed in Nguyen and Cios (2008), named genetic algorithm *K*-means logarithmic regression expectation maximization (GAKREM), uses a medoid-based representation. For the fitness function, a regression model is employed. Specifically, after a small prespecified number of EM iterations, the regression coefficients are estimated and then used to predict the future likelihood. The uniform crossover is employed along with a mutation operator, which randomly creates or removes a cluster. GAKREM has some conceptual problems. For instance, the codification adopted is not suitable to represent a GMM because there is information loss when converting from the phenotype (GMM) to the genotype (medoids). Moreover, the lack of a refinement step to improve existing solutions along with the inclusion of at most one new individual per generation may cause the algorithm to have a slow convergence or even be unable to escape from local optima. Finally, the adopted fitness function is not based on a principled approach to estimate the number of clusters.

The genetic-based expectation maximization (GA-EM) (Pernkopf and Bouchaffra, 2005) uses a *cluster description-based representation* (Hruschka et al., 2009). The fitness function employed is the log-likelihood function (Equation (3)) along with a minimum description length penalty, defined in Section 4. A disadvantage of this algorithm is that its mutation and crossover operators are employed in a random fashion, favoring *random search* over *guided search*. There are also two critical parameters that the user needs to define and that may have a great impact on the algorithm’s behavior: a maximum correlation between the responsibilities of two Gaussians and the minimum sum of responsibilities that a component must have.

Aimed at overcoming the limitations of the foregoing algorithms, we focus on the combination of EAs with split and merge algorithms. A difficulty faced by EM-based algorithms in avoiding local optima comes from the fact that they are usually not able to move a component from a denser region to a sparser one without passing through positions that provide a lower likelihood value. The basic underlying idea of split and merge algorithms is to merge components in dense regions and split components in sparse regions (Ueda et al., 2000).

There has been in the last decade an increasing interest in split and merge algorithms (e.g., see Ueda et al., 2000; Zhang et al., 2003; Li and Li, 2009). We focus on the procedures proposed in Zhang et al. (2003) and also adopted in Li and Li (2009). Differently from the operations proposed in Ueda et al. (2000), which are based mainly on the linear combinations of the parameters, the merge and split operations described in Zhang et al. (2003) and Li and Li (2009) take into account the GMM formalism. More specifically, as observed in Zhang et al. (2003), it is of particular interest that the split and merge operators consider the relation between the parameters of the component, namely, mean and covariance matrix, in that they are the first and second moments of the distribution represented by the component, respectively. However, the algorithms proposed in Ueda et al. (2000) and Zhang et al. (2003) assume that the number of components is provided by the user. Although the approach described in Li and Li (2009) can estimate the number of components, such estimation is done by using greedy search-based procedures, which are frequently subject to local optima problems.

*j*and

*l*be two component indices. The correlation criterion is defined as where . The pair of components that has the greatest absolute correlation value is the chosen one. Assuming that is computed for all , the computation cost of Equation (4) is .

*local Kullback divergence*(LKD) (Ueda et al., 2000; Zhang et al., 2003), which, for the

*j*th component, is defined as where is the empirical distribution modified by weighting the frequencies by the

*posterior*probabilities in such a way that objects closer to the

*j*th component have more importance, and is the Kronecker function. Eq. (5) defines the distance between two distributions, that is, the local density of the data around the

*j*th component, , and its respective density considering the current parameters (Ueda et al., 2000). A high value for

*D*

_{LKD}indicates a poor estimate of the local density. Thus, the component with highest

*D*

_{LKD}is chosen to be split. Assuming that the normal probabilities and responsibilities were already computed (from the EM algorithm), the computational cost of this criterion is .

*j*and

*l*and obtain

*g*, the following operations were proposed by Zhang et al. (2003):

These operations can be seen as an extension of the ones proposed in Ueda et al. (2000), which aim at preserving the relation between the mean and the covariance matrix. The computational cost of performing all these operations is , where *M* is the number of features.

*g*, should be split. As a result, assume that we have components

*j*and

*l*. A method that uses singular value decomposition (Golub and Van Loan, 1996) is described in Zhang et al. (2003). Initially, the covariance matrix, , is decomposed into three matrices as , where is the matrix of singular values in a decreasing order and and are two orthogonal matrices of eigenvectors (Golub and Van Loan, 1996). Having done that, the matrix is computed (Li and Li, 2009). Following Zhang et al. (2003), the problem of determining and , given , can be transformed into the problem of finding and , given . Considering that is the

*i*th column of matrix , the parameters can be adjusted as (Zhang et al., 2003) where are user-defined parameters.

^{3}Although any other vector, , can be used in place of , this choice is justified because it corresponds to the first principal axis (Zhang et al., 2003). Because of the SVD decomposition, the computational cost of this splitting method is .

## 4 Evolutionary Split and Merge for Expectation Maximization

We propose an evolutionary algorithm that makes use of split and merge operations and is capable of estimating the number of components. Our algorithm is named evolutionary split and merge for expectation maximization (ESM-EM). Essentially, we combine some good properties of the GA-EM algorithm (Pernkopf and Bouchaffra, 2005) with split and merge procedures (described in Section 3). More specifically, differently from other algorithms in the literature (i.e., Pernkopf and Bouchaffra, 2005; Nguyen and Cios, 2008), we adopt context-specific operators guided by quality functions related to the fitness of the individual. Thus, our approach tends to favor a more (probabilistic) guided search in comparison to the existing approaches, whose search procedures are more random in nature.

In order to encode the individuals (also called genotypes or chromosomes), a representation is required that allows dealing with a variable number of clusters and that accurately identifies the GMM represented by the individual. From these observations and assuming that *k*_{max} is the maximum acceptable number of clusters, we adopted a cluster-based encoding (Hruschka et al., 2009) that, for our present case, involves the use of individuals that represent the GMM parameters (). Note that the apparent limitation of fixing the maximum number of clusters can be removed by using a variable-length encoding, if so desired.

^{4}principle (Rissanen, 1989). More precisely, considering an individual that represents

*k*components and being the number of parameters in the GMM, the adopted objective function is given by This function has been widely used for model selection (McLachlan and Peel, 2000; Figueiredo and Jain, 2002; Pernkopf and Bouchaffra, 2005).

The main steps of ESM-EM, summarized in Algorithm 2, are (1) initialization, step 2; (2) refinement, step 5; (3) individual selection for mutation and survival, steps 6 and 11; and (4) mutation operator selection and guidance, steps 8 and 9. In the next subsections we discuss each of these steps as well as possible components that can be used, giving rise to different variants of ESM-EM.

### 4.1 Initialization

Each individual is initialized to represent a clustering solution obtained by *k*-means using randomly selected objects as cluster seeds. We adopted a deterministic approach to select the number of clusters. Specifically, we initialized the *P*_{size} individuals with the number of clusters linearly spaced between 2 and *k*_{max}, ensuring that we had solutions more evenly spread over the possible values of the number of clusters in the initial population. Differently from Pernkopf and Bouchaffra (2005), who if necessary enlarged the initial population to ensure that there would be exactly an individual with each possible number of clusters from 2 to *k*_{max}, the proposed algorithm aims at providing a trade-off solution between data partition diversity and computational costs.

### 4.2 Refinement

In step 5 of Algorithm 2, ESM-EM runs a number (*t*_{EM}) of EM iterations on each individual of the population, and the individuals are assessed by the objective function in Eq. (14). Crossover operators are not used in ESM-EM, but the refinement over the solutions of each individual can be seen as a local search procedure, favoring exploitation over the search space.

### 4.3 Selection for Mutation and Survival

The well-known ()-selection (Bäck et al., 2000) is adopted to determine the individuals to be mutated and the ones that will survive for the next generation. In our case, . In this sense, at each generation, *P*_{size} individuals are generated by mutating each individual in , obtaining . The best *P*_{size} individuals of the population formed by survive and form ; a similar approach was used by Pernkopf and Bouchaffra (2005). There are two main reasons to adopt this procedure: (1) Since the EM algorithm is known to converge slowly (Redner and Walker, 1984), a more steady approach favors the cumulative refinement needed for a better fit, and (2) given that our operators are relatively aggressive in the sense that they may perform big jumps in the solution space, the major drawback of a more conservative selection, the premature convergence, is lessened.

### 4.4 Mutation Operator Selection and Guidance

Following the taxonomy presented in Hruschka et al. (2009), the ESM-EM mutation operators are both *guided* and *cluster-oriented*. In brief, cluster-oriented operators, such as operators that copy, split, merge, and eliminate clusters, are task-dependent, in contrast to conventional evolutionary operators, which only exchange or switch bits with no regard to their task-dependent meaning. Guided operators are guided by some type of information about the quality of individual clusters, the quality of the overall data partition, or their performance in previous applications, such as operators that are more likely to be applied to poor-quality clusters and operators whose probability of application is proportional to their success (or failure) in previous generations.

The ESM-EM mutation operators^{5} consist of the application of the merge and split operations. Specifically, for merging components, we consider a procedure from Zhang et al. (2003) that is described in Algorithm 3. For splitting components, two operators are considered: the SVD-based operator (Zhang et al., 2003), described in Algorithm 4, and a new *variance-based splitting* (VS) operator.

The main idea of the VS operator is to select the feature that will determine the directions of the new means with probability directly proportional to the feature’s variance. After these means have been obtained, *k*-means is applied to estimate the remaining parameters (covariance matrices and mixing coefficients) and to update the means. For *k*-means, only the objects more likely to be generated by the component are considered, and a maximum number of iterations is performed (*t*_{KM}). The main steps of the operator are described in Algorithm 5. The computational cost of this operator is . VS also has some extra advantages over the SVD-based operator, whose complexity is cubic with *M*. In particular, the new components’ parameters are, to some extent, better fitted to the data in comparison to the SVD-based operator. Also, the SVD decomposition can present numerical issues with high dimensional covariance matrices; thus replacing it by a simpler operation can be useful in such scenarios. Moreover, unlike the SVD-based approach, ours does not have user-defined parameters.

To illustrate the application of these operators, consider the scenario depicted in Figure 1(a), which corresponds to a mixture with two components fitted to data generated by a three-component Gaussian mixture. The larger cluster on the left-hand side is the most probable cluster to be mutated, according to both measures, PLL and LKD. This is expected, as it models an area where clearly there are two centers of density apart from each other. Let this cluster be *C*_{1}. The resulting mixture after applying the SVD operator on this component is shown in Figure 1(b). Now, consider the application of the VS operator on the same cluster. The covariance matrix of *C*_{1} corresponds to . Thus, the second feature has a 60% chance of being selected as the direction for the new means. Assuming this feature has been chosen, the direction vector is , that is, the two initial prototypes for *k*-means are the old center () moved three units up and three down, respectively. After five iterations of *k*-means on the objects for which *C*_{1} was the component with the largest responsibility, the resulting GMM is better fitted to the data, as shown in Figure 1(c).

In order to choose the mutation operator (either split or merge) to be used, we followed the approach described in Horta and Campello (2009). If the fitness of a particular individual has improved from generation *t* to generation , then the same operator applied in the previous generation is reapplied, whereas if the fitness of the individual has worsened, the other operator is used. In other cases (e.g., first generation), the operators are randomly selected.

There are several advantages to using the partial log-likelihood instead of the LKD. For instance, the partial log-likelihood is part of the fitness function, thereby establishing a direct synergy between mutation and adaptation. The PLL values are already computed when assessing the fitness of each individual, thus straightforwardly favoring an efficient implementation. For simplicity, we used the negative log-likelihood, so that both PLL and LKD provide high values to clusters that should be split.

*i*and

*j*are the indices of the most probable Gaussians, respectively, that is, . Analogously, let be the set of objects for which

*j*and

*i*are the most probable Gaussians, respectively. Then, the overlap measure can be defined as

Pairs of clusters with higher overlap values are more likely to be merged. The main difference between the correlation and overlapping measures is that the latter focuses on the objects that would be assigned to each cluster under a most probable cluster scheme. The computational cost of the overlapping measure is .

### 4.5 Note on Computational Complexity

As expected, the overall computational complexity of ESM-EM depends on the choice of the operators just discussed. However, the dominant cost of ESM-EM is attributable to running the EM algorithm. From this perspective, the overall computational cost of ESM-EM is estimated as , where *G* is the number of generations, *t*_{EM} is the number of iterations of EM performed at each refinement step, the first term of the sum is the EM cost, and the last term arises because of the computation of the probabilities of merging each pair of clusters in each generation. This cost is the same as that of the GA-EM algorithm (Pernkopf and Bouchaffra, 2005) and similar to that of the GAKREM^{6} (Nguyen and Cios, 2008). Assuming that the user has no a priori knowledge of the range for the number of clusters, which is the scenario where evolutionary algorithms are more applicable, the common rule of thumb of can be employed (Campello et al., 2009). In this case the overall computational cost can be simplified to , as . Although this cost is higher than multiple runs of EM (see Section 2), there is empirical evidence that the additional (asymptotic) complexity per iteration, necessary to guide the search for better data partitions, actually provides better solutions in shorter running times (see Section 5).

### 4.6 Summary of ESM-EM Variants

The adoption of the different components described in the previous section gives rise to different variants of the ESM-EM algorithm. In order to refer to such variants, we use combinations of the operator acronyms. For instance, VS-OVER-PLL refers to the algorithm obtained using the variance-based splitting operator (VS), the overlapping measure (OVER) to indicate clusters to be merged, and the partial log-likelihood function (PLL) to indicate the component to be split. A summary of all the assessed variants is provided in Table 1. For simplicity, the components that are common to all variants, namely initialization (Section 4.1), ( selection, and merging operator (Algorithm 3) are not mentioned.

Name
. | Split Operator
. | Split Measure
. | Merge Measure
. |
---|---|---|---|

SVD | SVDS — Alg. 4 | LKD — Eq. (5) | Correlation — Eq. (4) |

OVER | SVDS — Alg. 4 | LKD — Eq. (5) | Overlapping — Eq. (16) |

VS | VS — Alg. 5 | LKD — Eq. (5) | Correlation — Eq. (4) |

PLL | SVDS — Alg. 4 | PLL — Eq. (15) | Correlation — Eq. (4) |

VS-OVER | VS — Alg. 5 | LKD — Eq. (5) | Overlapping — Eq. (16) |

VS-PLL | VS — Alg. 5 | PLL — Eq. (15) | Correlation — Eq. (4) |

OVER-PLL | SVDS — Alg. 4 | PLL — Eq. (15) | Overlapping — Eq. (16) |

VS-OVER-PLL | VS — Alg. 5 | PLL — Eq. (15) | Overlapping — Eq. (16) |

Name
. | Split Operator
. | Split Measure
. | Merge Measure
. |
---|---|---|---|

SVD | SVDS — Alg. 4 | LKD — Eq. (5) | Correlation — Eq. (4) |

OVER | SVDS — Alg. 4 | LKD — Eq. (5) | Overlapping — Eq. (16) |

VS | VS — Alg. 5 | LKD — Eq. (5) | Correlation — Eq. (4) |

PLL | SVDS — Alg. 4 | PLL — Eq. (15) | Correlation — Eq. (4) |

VS-OVER | VS — Alg. 5 | LKD — Eq. (5) | Overlapping — Eq. (16) |

VS-PLL | VS — Alg. 5 | PLL — Eq. (15) | Correlation — Eq. (4) |

OVER-PLL | SVDS — Alg. 4 | PLL — Eq. (15) | Overlapping — Eq. (16) |

VS-OVER-PLL | VS — Alg. 5 | PLL — Eq. (15) | Overlapping — Eq. (16) |

## 5. Empirical Results

We used 35 data sets for the assessment of the different variants of ESM-EM. Three are well known in the literature, namely: (1) *3-components* (Ueda and Nakano, 1998), formed by 900 objects distributed over three bi-dimensional Gaussian components, (2) *9Gauss* (Campello et al., 2009), which has 900 objects from nine overlapping bi-dimensional Gaussian components, and (3) *Pendigits*, which is a classification data set composed of handwritten digit images (Asuncion and Newman, 2007). For the latter, we considered only the classes representing digits 0, 1, 2, 3, and 4, thus having 5,629 objects. Similarly to what was done by Pernkopf and Bouchaffra (2005), we also performed PCA, which reduced the number of features from 16 to 2. We also considered an extended version of 9Gauss (Campello et al., 2009), called 49Gauss, which has similarly overlapping areas as 9Gauss but with 49 Gaussian components and 4,900 objects. Besides these data sets, we also conducted experiments with seven other data sets that can be modeled by means of GMMs, even though they were not actually generated from GMMs. Figure 2 illustrates these data sets. The remaining 24 data sets are synthetic data sets generated according to two well-known criteria for component separation:

A summary of the main characteristics of the data sets used in our comparisons is provided in Table 2. An important remark about the *c*-separation criterion is that it acts as a lower bound on the complexity of the data set. Thus, a -separated mixture^{7} can be less complex than a -separated criterion. To alleviate this, we used the approach described in Maitra (2009). Specifically, we restricted the components so that at least one pair had a separation between *c* and . Figure 3 shows an example of the separation values for the Gaussian pairs of a GMM according to the *c*-separation and -separation criteria, respectively. We used three different separation values for each criterion. More specifically, we adopted and , where is the average for all pairs of components . For each of these, we generated data sets with and and samples (objects). Table 3 shows the separation values for each generated data set. We have data sets for both *well separated* and *poorly separated* configurations, discussed in Dasgupta (1999) and Maitra and Melnykov (2010).

Data Set
. | N
. | M
. | k
. | Data Set
. | N
. | M
. | k
. |
---|---|---|---|---|---|---|---|

9Gauss | 900 | 2 | 9 | c1.0-k12M4 | 1,200 | 4 | 12 |

49Gauss | 4,900 | 2 | 49 | c2.0-k12M4 | 1,200 | 4 | 12 |

3-components | 900 | 2 | 3 | c0.5-k12M8 | 1,200 | 8 | 12 |

Triangle | 500 | 2 | 5 | c1.0-k12M8 | 1,200 | 8 | 12 |

Letter A | 300 | 2 | 3 | c2.0-k12M8 | 1,200 | 8 | 12 |

Letter B | 400 | 2 | 5 | 0.1-k4M4 | 400 | 4 | 4 |

Letter M | 400 | 2 | 4 | 0.01-k4M4 | 400 | 4 | 4 |

Letter Y | 300 | 2 | 3 | 0.001-k4M4 | 400 | 4 | 4 |

In X | 500 | 2 | 5 | 0.1-k4M8 | 400 | 8 | 4 |

Two bars | 300 | 2 | 3 | 0.01-k4M8 | 400 | 8 | 4 |

Pendigits | 5,629 | 2 | — | 0.001-k4M8 | 400 | 8 | 4 |

c0.5-k4M4 | 400 | 4 | 4 | 0.1-k12M4 | 1,200 | 4 | 12 |

c1.0-k4M4 | 400 | 4 | 4 | 0.01-k12M4 | 1,200 | 4 | 12 |

c2.0-k4M4 | 400 | 4 | 4 | 0.001-k12M4 | 1,200 | 4 | 12 |

c0.5-k4M8 | 400 | 8 | 4 | 0.1-k12M8 | 1,200 | 8 | 12 |

c1.0-k4M8 | 400 | 8 | 4 | 0.01-k12M8 | 1,200 | 8 | 12 |

c2.0-k4M8 | 400 | 8 | 4 | 0.001-k12M8 | 1,200 | 8 | 12 |

c0.5-k12M4 | 1,200 | 4 | 12 |

Data Set
. | N
. | M
. | k
. | Data Set
. | N
. | M
. | k
. |
---|---|---|---|---|---|---|---|

9Gauss | 900 | 2 | 9 | c1.0-k12M4 | 1,200 | 4 | 12 |

49Gauss | 4,900 | 2 | 49 | c2.0-k12M4 | 1,200 | 4 | 12 |

3-components | 900 | 2 | 3 | c0.5-k12M8 | 1,200 | 8 | 12 |

Triangle | 500 | 2 | 5 | c1.0-k12M8 | 1,200 | 8 | 12 |

Letter A | 300 | 2 | 3 | c2.0-k12M8 | 1,200 | 8 | 12 |

Letter B | 400 | 2 | 5 | 0.1-k4M4 | 400 | 4 | 4 |

Letter M | 400 | 2 | 4 | 0.01-k4M4 | 400 | 4 | 4 |

Letter Y | 300 | 2 | 3 | 0.001-k4M4 | 400 | 4 | 4 |

In X | 500 | 2 | 5 | 0.1-k4M8 | 400 | 8 | 4 |

Two bars | 300 | 2 | 3 | 0.01-k4M8 | 400 | 8 | 4 |

Pendigits | 5,629 | 2 | — | 0.001-k4M8 | 400 | 8 | 4 |

c0.5-k4M4 | 400 | 4 | 4 | 0.1-k12M4 | 1,200 | 4 | 12 |

c1.0-k4M4 | 400 | 4 | 4 | 0.01-k12M4 | 1,200 | 4 | 12 |

c2.0-k4M4 | 400 | 4 | 4 | 0.001-k12M4 | 1,200 | 4 | 12 |

c0.5-k4M8 | 400 | 8 | 4 | 0.1-k12M8 | 1,200 | 8 | 12 |

c1.0-k4M8 | 400 | 8 | 4 | 0.01-k12M8 | 1,200 | 8 | 12 |

c2.0-k4M8 | 400 | 8 | 4 | 0.001-k12M8 | 1,200 | 8 | 12 |

c0.5-k12M4 | 1,200 | 4 | 12 |

Data Set
. | min(c)
. | mean(c)
. | max(c)
. | mean()
. | max()
. |
---|---|---|---|---|---|

c0.5-k4M4 | 0.50 | 1.90 | 2.80 | 0.117 | 0.554 |

c1.0-k4M4 | 1.00 | 2.70 | 3.30 | 0.035 | 0.206 |

c2.0-k4M4 | 2.00 | 2.80 | 3.90 | 0.006 | 0.021 |

c0.5-k4M8 | 0.50 | 1.40 | 1.80 | 0.055 | 0.311 |

c1.0-k4M8 | 1.00 | 1.90 | 2.20 | 0.013 | 0.077 |

c2.0-k4M8 | 2.00 | 3.40 | 4.00 | 0.000 | 0.000 |

c0.5-k12M4 | 0.50 | 2.90 | 4.00 | 0.017 | 0.549 |

c1.0-k12M4 | 1.00 | 2.90 | 4.00 | 0.014 | 0.236 |

c2.0-k12M4 | 2.00 | 2.90 | 4.00 | 0.004 | 0.022 |

c0.5-k12M8 | 0.50 | 2.10 | 4.00 | 0.007 | 0.312 |

c1.0-k12M8 | 1.00 | 3.10 | 4.00 | 0.001 | 0.061 |

c2.0-k12M8 | 2.00 | 3.20 | 4.00 | 0.000 | 0.000 |

0.1-k4M4 | 0.60 | 1.10 | 1.50 | 0.100 | 0.239 |

0.01-k4M4 | 1.20 | 1.50 | 2.10 | 0.010 | 0.023 |

0.001-k4M4 | 1.70 | 1.90 | 2.10 | 0.001 | 0.003 |

0.1-k4M8 | 0.40 | 0.70 | 0.90 | 0.100 | 0.241 |

0.01-k4M8 | 0.90 | 1.20 | 1.60 | 0.010 | 0.036 |

0.001-k4M8 | 1.20 | 1.60 | 2.00 | 0.001 | 0.005 |

0.1-k12M4 | 0.10 | 1.00 | 1.90 | 0.100 | 0.502 |

0.01-k12M4 | 0.70 | 2.40 | 5.50 | 0.010 | 0.176 |

0.001-k12M4 | 1.10 | 5.50 | 10.10 | 0.001 | 0.065 |

0.1-k12M8 | 0.30 | 0.70 | 1.00 | 0.100 | 0.300 |

0.01-k12M8 | 0.60 | 1.30 | 2.10 | 0.010 | 0.102 |

0.001-k12M8 | 0.80 | 1.80 | 2.60 | 0.001 | 0.020 |

Data Set
. | min(c)
. | mean(c)
. | max(c)
. | mean()
. | max()
. |
---|---|---|---|---|---|

c0.5-k4M4 | 0.50 | 1.90 | 2.80 | 0.117 | 0.554 |

c1.0-k4M4 | 1.00 | 2.70 | 3.30 | 0.035 | 0.206 |

c2.0-k4M4 | 2.00 | 2.80 | 3.90 | 0.006 | 0.021 |

c0.5-k4M8 | 0.50 | 1.40 | 1.80 | 0.055 | 0.311 |

c1.0-k4M8 | 1.00 | 1.90 | 2.20 | 0.013 | 0.077 |

c2.0-k4M8 | 2.00 | 3.40 | 4.00 | 0.000 | 0.000 |

c0.5-k12M4 | 0.50 | 2.90 | 4.00 | 0.017 | 0.549 |

c1.0-k12M4 | 1.00 | 2.90 | 4.00 | 0.014 | 0.236 |

c2.0-k12M4 | 2.00 | 2.90 | 4.00 | 0.004 | 0.022 |

c0.5-k12M8 | 0.50 | 2.10 | 4.00 | 0.007 | 0.312 |

c1.0-k12M8 | 1.00 | 3.10 | 4.00 | 0.001 | 0.061 |

c2.0-k12M8 | 2.00 | 3.20 | 4.00 | 0.000 | 0.000 |

0.1-k4M4 | 0.60 | 1.10 | 1.50 | 0.100 | 0.239 |

0.01-k4M4 | 1.20 | 1.50 | 2.10 | 0.010 | 0.023 |

0.001-k4M4 | 1.70 | 1.90 | 2.10 | 0.001 | 0.003 |

0.1-k4M8 | 0.40 | 0.70 | 0.90 | 0.100 | 0.241 |

0.01-k4M8 | 0.90 | 1.20 | 1.60 | 0.010 | 0.036 |

0.001-k4M8 | 1.20 | 1.60 | 2.00 | 0.001 | 0.005 |

0.1-k12M4 | 0.10 | 1.00 | 1.90 | 0.100 | 0.502 |

0.01-k12M4 | 0.70 | 2.40 | 5.50 | 0.010 | 0.176 |

0.001-k12M4 | 1.10 | 5.50 | 10.10 | 0.001 | 0.065 |

0.1-k12M8 | 0.30 | 0.70 | 1.00 | 0.100 | 0.300 |

0.01-k12M8 | 0.60 | 1.30 | 2.10 | 0.010 | 0.102 |

0.001-k12M8 | 0.80 | 1.80 | 2.60 | 0.001 | 0.020 |

We report two kinds of experimental analyses. First, in Section 5.1, we assess the different ESM-EM variants, as discussed in Section 4. Considering the obtained results, we then selected the best ESM-EM performing variant and compared it with the state-of-the-art algorithm proposed in Pernkopf and Bouchaffra (2005). Along with these comparisons and to serve as a baseline method, we used the OMR-EM (Algorithm 1). These analyses are described in Section 5.2.

### 5.1. ESM-EM Variants

We report the average over 10 runs of each ESM-EM’s variant in each data set. Their parameters were set as follows: (1) stopping criterion (*S _{C}*) is defined as a maximum of 50 generations or 10 consecutive generations without improvement on the best fitness; (2) the population size (

*P*

_{size}) is equal to 10; (3) the maximum number of EM iterations (

*t*

_{EM}) is equal to 5; (4) the maximum number of

*k*-means iterations (

*t*

_{KM}in the VS operator) is equal to 5; (5) in the SVDS operator are equal to 0.5 (same values used in Zhang et al., 2003); (6) the maximum number of clusters (

*k*

_{max}) is equal to . Table 4 shows the average ranks obtained by each algorithm. We summarize the results for three different criteria: (1) objective function (MDL) value; (2) number of EM iterations performed during the evolutionary search; (3) accuracy in estimating the number of clusters. The ranks were derived by sorting the values of each criterion obtained by each algorithm and then averaging over all data sets. For instance, if a given algorithm provided the best results according to the MDL criterion in all data sets, then its ranking is #1. If we consider the MDL values obtained by each algorithm, the adoption of the variance-based splitting and partial log-likelihood function—the VS-PLL variant—presented better results more consistently, that is, with smaller average ranking. As this is the criterion of main interest in this study, if one had to choose among these variants, we would recommend VS-PLL, which is also the method of choice if one seeks for the solution with the best trade-off among all criteria.

Variant
. | MDL
. | EMS
. | K
. | Average Ranks
. |
---|---|---|---|---|

SVD | 5.03 | 3.57 | 4.99 | 4.53 |

OVER | 4.50 | 4.40 | 4.79 | 4.56 |

VS | 4.63 | 4.77 | 5.20 | 4.87 |

PLL | 4.31 | 4.94 | 4.16 | 4.47 |

VS-OVER | 4.56 | 4.39 | 4.61 | 4.52 |

VS-PLL | 4.01 | 3.83 | 3.84 | 3.90 |

OVER-PLL | 4.36 | 4.74 | 4.86 | 4.65 |

VS-OVER-PLL | 4.60 | 5.36 | 3.56 | 4.50 |

Variant
. | MDL
. | EMS
. | K
. | Average Ranks
. |
---|---|---|---|---|

SVD | 5.03 | 3.57 | 4.99 | 4.53 |

OVER | 4.50 | 4.40 | 4.79 | 4.56 |

VS | 4.63 | 4.77 | 5.20 | 4.87 |

PLL | 4.31 | 4.94 | 4.16 | 4.47 |

VS-OVER | 4.56 | 4.39 | 4.61 | 4.52 |

VS-PLL | 4.01 | 3.83 | 3.84 | 3.90 |

OVER-PLL | 4.36 | 4.74 | 4.86 | 4.65 |

VS-OVER-PLL | 4.60 | 5.36 | 3.56 | 4.50 |

After making these general observations, we assess the algorithms by inspecting their results in more complex data sets. To verify the quality of the GMMs obtained from different variants, we also assess the log-likelihood on a held-out sample with 10,000 objects sampled from the ground-truth GMM. Figures 4–7 show the differences per object obtained for the held-out log-likelihood between each variant and SVD, which was chosen as a baseline for comparison purposes. From this perspective, note that the smaller the value, the better the variant. In each figure the differences correspond to those observed for a specific criterion and a specific number of clusters. For instance, in Figure 4 the two plots in the top row refer to the case where the *c*-separation criterion equals 0.5, whereas those in the bottom row refer to the case where *c*-separation equals 2. The two left-hand plots to the data set with four features (), whereas the two right-hand plots refer to the data set with eight features (). In each plot, the bars correspond to the values (difference) obtained by the variant.

Some key observations can be drawn from Figures 4–7. First, when the number of features and clusters is low (), there are not many differences between the models obtained by each variant, except for the case. This can be explained by two factors: (1) the search space does not have many local optima, so any of the different operators can reach the same (best) solution; (2) the data set generated with is more complex than the others because of the existence of a highly overlapping pair of components (see max() in Table 3). Thus, it is more likely to have a larger number of local optima. By increasing only the number of features (), differences between the variants emerge more frequently and all variants were able to obtain a better model than SVD in at least five out of six data sets. In the opposite scenario (increasing the number of clusters: ), the differences among the models are generally negligible (log-likelihood values smaller than 0.01). A common ground among these results is that most variants were able to reach the same (best) solution, except for SVD. This observation suggests that, for such cases, choosing the simplest (fastest) variant would be enough to have a good chance of escaping from local optima. For the most appealing scenario with a high number of features and clusters (), this does not seem to hold. Specifically, the VS-PLL variant—the one that, as previously discussed, provided the best results on the average over all the evaluation criteria—obtained equal or better models than the other variants in all data sets. Considering all the collected empirical evidence, we conclude that among the different ESM-EM variants assessed, VS-PLL is the most promising one; therefore we chose it for further comparisons with related algorithms.

### 5.2. Comparisons with Related Algorithms

In this section we compare the best variant of ESM-EM, namely, VS-PLL, with the state-of-the-art GA-EM algorithm (Pernkopf and Bouchaffra, 2005). For the sake of simplicity, hereafter we use ESM-EM to refer to this variant. In our comparisons, we also use the OMR-EM algorithm (Section 2) as a baseline, because it has been widely used in practice.

Some of the GA-EM parameters are common to other EAs, such as mutation and crossover probabilities. Along with these parameters, GA-EM has also additional parameters, like the maximum correlation threshold and the minimum amount of likelihood for a Gaussian component. These parameters are hard to set in practice, especially when using EAs for clustering problems. Under controlled experiments, however, we can (optimistically) adopt a fine-tuning procedure to obtain good values for each parameter and then proceed with our comparisons. In this sense, we adopted the SPOT framework (Bartz-Beielstein et al., 2005) to tune the GA-EM parameters. Specifically, the parameters were optimized in the following range: (1) annihilation threshold: ; (2) mutation probability: ; (3) crossover probability: ; (4) correlation threshold: . The population size, number of EM iterations, and maximum number of clusters were set as 10, 5, and , respectively (same values adopted for ESM-EM). For simplicity, only the best results obtained for GA-EM are described. However, we emphasize that such results were obtained after much tuning and computational effort for each data set. This (computationally intensive) fine-tuning procedure could hardly be performed in practical applications.

The ESM-EM algorithm does not have critical parameters. More precisely, all of them (i.e., population size, maximum number of iterations for EM/*k*-means, and maximum number of clusters) can be defined according to the available computational resources. For instance, with a smaller population size, the algorithm is expected to require more time to converge (Naldi et al., 2011). Similarly, the number of EM/*k*-means iterations can be set by considering the available running time. Regarding the maximum number of clusters, the user may have expectations about it and consequently set this value accordingly. If this is not the case, and assuming that the minimum number of clusters is two, we note that the higher the maximum number of clusters, the higher the running time for exploring and exploiting the search space. Finally, the user can set a range for *k* by taking into account either computational limitations or expectations about the data partition to be induced. In this sense, ESM-EM can be seen as an improvement over GA-EM because it does not require the fine-tuning of its parameters for the learning process but only for the available computational resources, thus being quite handy in practice. From a more theoretical viewpoint, since the ESM-EM operators are guided and based on split and merge operators, ESM-EM can be deemed a more principled algorithm in comparison to GA-EM, which adopts a more random search through its operators.

We assessed the algorithms (ESM-EM, GA-EM, and OMR-EM) considering the trade-off between model quality and running time. To perform fair running time comparisons, all of them were implemented in MATLAB and run on the same computer (Opteron 2GHz, 24Gb of RAM) executing only the operating system in parallel.

A relevant practical aspect to be illustrated is that the use of an EA to optimize the parameters of a GMM, including the estimation of the model structure, can be more computationally efficient and accurate than the use of multiple repetitions of EM. To the best of our knowledge, this aspect has not been properly addressed in the literature. To do so, we initially ran OMR-EM with and saved the execution time for each complete pass over the number of clusters, as well as the best model found so far. Thus, we had the computational costs and the quality of the obtained models for each value of . Then, we ran the EAs (ESM-EM and GA-EM) until their running times reach the computational cost required for OMR-EM (with ), saving the intermediate solutions. That way, we could compare the best solutions found by each algorithm under a given time constraint, considering the time spent by OMR-EM with different values of *n _{p}*. In other words, we used

*n*as the time unit measure. Note that we do not consider the time spent on parameter tuning for GA-EM as part of the reported running times. In this sense, GA-EM’s results can be considered very optimistic.

_{p}Table 5 summarizes the overall performance of each algorithm by reporting the average rankings over 10 trials and considering all data sets. For instance, the smallest value obtained by GA-EM for (1.47 in the second row, fourth column) indicates that, from averaging over all data sets, GA-EM was able to obtain the smallest MDL value for such time constraint. From these results, one can note that there is a consistent ranking ordering of the three algorithms for : GA-EM, ESM-EM, and OMR-EM. For , ESM-EM obtains the best rankings. Also, one can see that EAs can be more computationally efficient than multiple repetitions of EM. Finally, one can see that ESM-EM, which does not have critical parameters, is competitive with GA-EM even when GA-EM’s parameters have been properly fine-tuned beforehand. As discussed, this fine-tuning procedure requires both some a priori knowledge about the learning problem (usually not available in practice) and considerable computational effort. As expected, as more computational resources are available, that is, as we increase *n _{p}*, the difference between OMR-EM and the EA performance is reduced, since all algorithms are given enough time (or chances in the case of OMR-EM) to reach or get closer to the global optimum.

n
. _{p} | OMR-EM
. | ESM-EM
. | GA-EM
. |
---|---|---|---|

1 | 2.51 | 2.01 | 1.47 |

2 | 2.51 | 1.83 | 1.66 |

3 | 2.46 | 1.89 | 1.66 |

4 | 2.41 | 1.90 | 1.69 |

5 | 2.30 | 1.93 | 1.77 |

6 | 2.24 | 1.94 | 1.81 |

7 | 2.16 | 1.94 | 1.90 |

8 | 2.14 | 1.89 | 1.97 |

9 | 2.14 | 1.87 | 1.99 |

10 | 2.14 | 1.90 | 1.96 |

n
. _{p} | OMR-EM
. | ESM-EM
. | GA-EM
. |
---|---|---|---|

1 | 2.51 | 2.01 | 1.47 |

2 | 2.51 | 1.83 | 1.66 |

3 | 2.46 | 1.89 | 1.66 |

4 | 2.41 | 1.90 | 1.69 |

5 | 2.30 | 1.93 | 1.77 |

6 | 2.24 | 1.94 | 1.81 |

7 | 2.16 | 1.94 | 1.90 |

8 | 2.14 | 1.89 | 1.97 |

9 | 2.14 | 1.87 | 1.99 |

10 | 2.14 | 1.90 | 1.96 |

We now address some specific results of particular interest. Figure 8 shows representative results obtained in six data sets (the results for other data sets are omitted for space constraints). In these graphs, the *y*-axis shows the MDL value divided by the number of objects, and the *x*-axis represents the number of repetitions, *n _{p}*, performed by OMR-EM.

For the 49Gauss data set (Figure 8(a)), ESM-EM outperformed both OMR-EM and GA-EM. Similar results were observed for 9Gauss. Considering , ESM-EM obtained better results with smaller variances. As can be considered a very small value in practice, especially for a problem with 4,900 objects, the computational efficiency of ESM-EM is appealing. In Pendigits (Figure 8(b)), OMR-EM has higher variances for when compared to the EAs. The EAs’ capability for consistently generating good solutions in such a short time is again evidenced. Also, once more ESM-EM provided better solutions than the (fine-tuned) GA-EM. The results for the -k12M4 (Figure 8(c)) and -k12M8 (Figure 8(d)) data sets show a similar trend, but with GA-EM showing equal or slightly better results than ESM-EM. But again, since ESM-EM does not have critical parameters, it can be more attractive than GA-EM even in such scenarios. For the -k12M4 data set (Figure 8(e)), ESM-EM was able to obtain good solutions in a shorter amount of time () and competitive results when more processing time was available. Similar results were obtained for the -k12M8 data set. The results for the -k12M8 data set (Figure 8(f)) are an example of how inefficient OMR-EM can be when compared to evolutionary (guided) search. Also, a similar trend can be observed when comparing ESM-EM and GA-EM. Specifically, once more ESM-EM quickly converged, but in this case no other algorithm was able to obtain better solutions.

## 6. Final Remarks

This paper has described the evolutionary split and merge for expectation maximization (ESM-EM) algorithm, which is based on the use of split and merge operations to evolve Gaussian mixture models (GMMs). Along with the algorithm and its asymptotic complexity analysis, different variants of ESM-EM were proposed by considering different split and merge operations. In particular, besides adapting and evaluating some principled approaches from the literature, we also proposed a new splitting procedure, named variance-based splitting (VS), a measure of overlapping between components (OVER), and we analyzed the use of the partial log-likelihood (PLL) to indicate clusters that should be split.

We performed an extensive empirical evaluation of ESM-EM and its variants using 35 data sets. The ESM-EM variant based on the proposed VS and PLL approaches consistently provided the best results. We compared such a variant with both the ordered multiple runs of the expectation maximization (OMR-EM) algorithm and the genetic-based expectation maximization (GA-EM) algorithm (Pernkopf and Bouchaffra, 2005). Our experimental results showed that evolutionary algorithms can be more computationally efficient than OMR-EM, which is widely used in practice. To the best of our knowledge, such a systematic experimental evaluation aimed at showing this relevant practical aspect has not been performed earlier, so this is also a contribution of our work. Moreover, the proposed ESM-EM provided competitive results with GA-EM (Pernkopf and Bouchaffra, 2005), even when GA-EM critical parameters were fine-tuned a priori, from a computationally intensive procedure. Since ESM-EM does not have critical parameters, it can be considered an improvement over GA-EM. By taking into account all the obtained results, we believe that ESM-EM is eligible to join a pool of state-of-the-art algorithms for learning GMMs in practice.

As future work, improvements on the mutation operators are promising, for instance, performing *k*-means on the components obtained by the SVDS operator.

## Acknowledgments

This work was supported by NSF Grants (IIS-0713142 and IIS-1016614), by São Paulo State Research Foundation (FAPESP), and by the Brazilian National Research Council (CNPq).

## References

*k*-means

## Notes

^{1}

Software is available upon request from the authors.

^{2}

A detailed analysis of these results, in the *k*-means context, can be seen in Naldi et al. (2011). The adaptation for the more general case of unrestricted covariance matrices is straightforward.

^{3}

We adopt the same values used in Zhang et al. (2003), i.e., .

^{5}

Crossover operators were not used because it is not clear how to design them in such a way that they are both context sensitive and computationally efficient; see Hruschka et al. (2009) for a discussion on this subject. Roughly speaking, in the context of our work, if standard crossover operators are applied to two identical cluster-based encoding individuals, they may produce a different individual because of the identifiability problem.

^{6}

GAKREM has an overall cost of when the EM algorithm with full covariance matrices is used.

^{7}

A *c*-separated mixture is one in which all pairs of Gaussians satisfy the *c*-separation criterion.