Abstract

This paper describes the evolutionary split and merge for expectation maximization (ESM-EM) algorithm and eight of its variants, which are based on the use of split and merge operations to evolve Gaussian mixture models. Asymptotic time complexity analysis shows that the proposed algorithms are competitive with the state-of-the-art genetic-based expectation maximization (GA-EM) algorithm. Experiments performed in 35 data sets showed that ESM-EM can be computationally more efficient than the widely used multiple runs of EM (for different numbers of components and initializations). Moreover, a variant of ESM-EM free from critical parameters was shown to be able to provide competitive results with GA-EM, even when GA-EM parameters were fine-tuned a priori.

1  Introduction

Gaussian mixture models (GMMs) are powerful modeling tools widely used in several real-world applications (McLachlan and Peel, 2000; Figueiredo and Jain, 2002; Zhang et al., 2003), such as speaker identification (Reynolds and Rose, 1995) and handwritten digit recognition (Chen et al., 2011). Since in practice the values of the GMM parameters are usually a priori unknown, there is a need for algorithms capable of efficiently estimating them. To that end, the expectation maximization (EM) type of algorithm (Dempster et al., 1977; McLachlan and Krishnan, 1997) is usually the option of choice. Such algorithms perform iterative computations of maximum likelihood estimates and are known to converge in a finite number of iterations under some regularity conditions. However, it is well known that the EM algorithm does not necessarily lead to a global optimal solution (Zhang et al., 2003). Aimed at circumventing this problem, a variety of approaches have been proposed (e.g., see McLachlan and Krishnan, 1997; Ueda and Nakano, 1998; Figueiredo and Jain, 2002). Here, we focus on the split and merge type of algorithm (Ueda et al., 2000), which typically works by iteratively splitting and merging components of the mixture model and then using the EM algorithm in the resulting components. By doing so, it is expected to escape from local optima, which often involves having too many components in one part of the space and too few in another, widely separated part of the feature space (Ueda et al., 2000).

We are particularly interested in using GMMs for data clustering (Xu and Wunsch, 2009). In this case, a common approach is to use a one-to-one correspondence between components’ distributions and clusters. A probabilistic data partition (clustering) is obtained from soft assignments of objects to clusters, based on the posterior probabilities that a particular component was responsible for generating a given object (McLachlan and Peel, 2000; Bishop, 2006). Probabilistic clustering approaches are quite useful to model overlapping clusters, which are common in practice.

From an optimization perspective, clustering can be formally considered a particular type of NP-hard grouping problem (Falkenauer, 1998). This has stimulated the search for efficient approximation algorithms, including the use of not only ad hoc heuristics for particular classes or instances of problems but also general-purpose metaheuristics. In particular, evolutionary algorithms are metaheuristics widely believed to be effective on NP-hard problems, being able to provide near-optimal solutions to such problems in reasonable time. Under this assumption, a large number of evolutionary algorithms for solving clustering problems have been proposed in the literature. These algorithms are based on the optimization of some objective function (i.e., the so-called fitness function) that guides the evolutionary search (Hruschka et al., 2009).

Many clustering algorithms assume that the number of clusters is known and fixed a priori by the user. Aimed at relaxing this assumption, which is often unrealistic in practical applications, a common approach involves estimating the number of clusters from data. The use of mixture models offers a principled approach to perform such an estimation (Figueiredo and Jain, 2002). A widely used procedure is to run EM multiple times for an incremental number of clusters and select the best partition according to a model selection criterion. Although this ordered repetitive procedure may be efficient if the optimal number of clusters is small, this is not the case in several applications. In addition, even when this is the case, one hardly knows that in advance (Naldi et al., 2011). In the case of GMMs with all covariance matrices equal to the identity matrix and hard assignments, the EM for GMM reduces to the well-known k-means algorithm (Bishop, 2006). For this setting, a thorough assessment comparing an evolutionary algorithm (EA) and multiple repetitions was conducted by Naldi et al. (2011). More specifically, a well-designed EA was shown to be significantly faster than two commonly used strategies based on multiple runs of k-means. The extension of that study for the case of nonrestricted covariance matrices and overlapping clusters is one of our contributions.

In our work, we combine model selection with EAs that use specific, problem-dependent operators particularly designed for model selection. Although there are several EAs for clustering capable of estimating the number of clusters (for a detailed review, see Hruschka et al., 2009), including some algorithms based on GMMs (Pernkopf and Bouchaffra, 2005; Nguyen and Cios, 2008), they do not make use of split and merge operations for building GMMs. Therefore, they tend not to be very effective at avoiding local optima because they are usually not able to move a component from a denser region to a sparser region without passing through positions that give a lower likelihood (Ueda et al., 2000). Aimed at circumventing these problems, we propose a more principled algorithm, evolutionary split and merge for expectation maximization (ESM-EM). A preliminary version of this algorithm was introduced in Covões and Hruschka (2011) and compared with well-known greedy algorithms in Covões and Hruschka (2013). We first provide a detailed description of ESM-EM and then investigate eight new variants of ESM-EM,1 which are based on different mutation operators. Also, an original extensive empirical analysis that considers 35 data sets is reported. The obtained results are compared to the state-of-the-art EA for evolving GMMs proposed in Pernkopf and Bouchaffra (2005). Finally, original time complexity analyses are reported.

The remainder of this paper is organized as follows. Section 2 briefly reviews the EM algorithm for GMMs and introduces the adopted notation. A review of EAs and split and merge approaches to learning GMMs are presented in Section 3. In Section 4 we describe the ESM-EM algorithm and its variants. In Section 5 we analyze the empirical results of the experiments with 35 data sets, comparing them to related approaches, that is, multiple runs of EM and the EA proposed in Pernkopf and Bouchaffra (2005). Finally, Section 6 concludes the paper and points out some future work.

2  Gaussian Mixture Models

A Gaussian mixture model can be written as a linear superposition of k Gaussians leading to a probability density in the form
formula
1
where are the mixing coefficients subject to , is a multivariate Gaussian probability distribution, and represents the complete set of parameters (mixing coefficients, means, and covariance matrices of the components, respectively) for a given k.

Initially, we define two sets, and . The former refers to the observed data, , where M is the number of features. We assume that the N samples are i.i.d. (independent and identically distributed). The latter set has the indicator vectors, , which specify the component to which the ith sample (object) belongs, that is, is a k-dimensional binary random variable having a one ofk representation. Then is the complete data set.

The likelihood function defines the probability of observing the complete data set, conditioned on the parameters, namely, . In practice, as we do not have the values for the latent variables in , we consider their expectations, with respect to the posterior distribution of the latent variables, of the complete-data log-likelihood. The expected value of the indicator variables under the posterior distribution is given by McLachlan and Peel (2000) and Bishop (2006):
formula
2
This quantity can be seen as the responsibility of the jth component for data point (object) . The EM algorithm can then be summarized as follows. In the Expectation step, the expected responsibilities in Eq. (2) are computed. Then, in the maximization step, the model parameters are reestimated using the current responsibilities in such a way that the log-likelihood function is maximized. The expected complete-data log-likelihood function is defined as
formula
3
The parameters of each component are updated as
formula

The expectation (E) and maximization (M) steps are iteratively repeated until convergence. In practice, the algorithm is deemed to have converged when the change in the log-likelihood function, or alternatively in the parameters, falls below some threshold (Bishop, 2006). Note that initial estimates for the parameters () are necessary for the first EM iteration. In clustering problems, a common approach to initialize such parameters involves running k-means. By doing so, the covariance matrices can conveniently be initialized to the sample covariance of the clusters, and the mixing coefficients can be set to the fractions of objects assigned to the respective clusters (McLachlan and Peel, 2000; Bishop, 2006).

A major drawback of EM is that the user must define the number of components (clusters) a priori. However, this value is not known in most applications. One common approach is to run EM repeatedly for values of k in a range and then select the best model obtained according to a model selection criterion (see Figueiredo and Jain, 2002; McLachlan and Peel, 2000, for a detailed discussion on the topic). A possible systematic procedure for doing that is denominated ordered multiple runs (OMR-EM). In OMR-EM, the number of clusters is increased at each iteration and for each number of clusters in np random initializations are performed. This procedure is described in Algorithm 1, where s is any model selection criterion to be minimized (an example of such a criterion is presented in Section 4), and SC is a stopping criterion for EM, for instance, maximum number of iterations. In practice, np is usually defined in an ad hoc fashion or by taking into account the available computational resources. Assuming that np is proportional to kmax, the overall computational cost of the procedure is .2

formula

3  Related Work

As discussed, both model building and model selection are difficult optimization problems. The search space is, in general, nondifferentiable and multimodal. Evolutionary algorithms have been shown to be successful tools for related problems in a variety of applications (Falkenauer, 1998; Bäck et al., 2000). As stated in Section 1, we focus on the development of EAs for learning GMMs. For a more comprehensive review of clustering-related evolutionary algorithms, see Hruschka et al. (2009).

Although the EM algorithm for learning GMMs has been widely used in practice and several of its variants have been developed, only a few EAs have been proposed for this problem (Pernkopf and Bouchaffra, 2005; Tohka et al., 2007; Tawfick et al., 2008; Nguyen and Cios, 2008). Among them, the algorithms introduced in Tohka et al. (2007) and Tawfick et al. (2008) are of more limited applicability. In particular, the algorithm proposed in Tohka et al. (2007) assumes that the number of components is a priori known, and the work in Tawfick et al. (2008) is limited to hyperspherical clusters, that is, only to identity covariance matrices. The algorithm proposed in Nguyen and Cios (2008), named genetic algorithm K-means logarithmic regression expectation maximization (GAKREM), uses a medoid-based representation. For the fitness function, a regression model is employed. Specifically, after a small prespecified number of EM iterations, the regression coefficients are estimated and then used to predict the future likelihood. The uniform crossover is employed along with a mutation operator, which randomly creates or removes a cluster. GAKREM has some conceptual problems. For instance, the codification adopted is not suitable to represent a GMM because there is information loss when converting from the phenotype (GMM) to the genotype (medoids). Moreover, the lack of a refinement step to improve existing solutions along with the inclusion of at most one new individual per generation may cause the algorithm to have a slow convergence or even be unable to escape from local optima. Finally, the adopted fitness function is not based on a principled approach to estimate the number of clusters.

The genetic-based expectation maximization (GA-EM) (Pernkopf and Bouchaffra, 2005) uses a cluster description-based representation (Hruschka et al., 2009). The fitness function employed is the log-likelihood function (Equation (3)) along with a minimum description length penalty, defined in Section 4. A disadvantage of this algorithm is that its mutation and crossover operators are employed in a random fashion, favoring random search over guided search. There are also two critical parameters that the user needs to define and that may have a great impact on the algorithm’s behavior: a maximum correlation between the responsibilities of two Gaussians and the minimum sum of responsibilities that a component must have.

Aimed at overcoming the limitations of the foregoing algorithms, we focus on the combination of EAs with split and merge algorithms. A difficulty faced by EM-based algorithms in avoiding local optima comes from the fact that they are usually not able to move a component from a denser region to a sparser one without passing through positions that provide a lower likelihood value. The basic underlying idea of split and merge algorithms is to merge components in dense regions and split components in sparse regions (Ueda et al., 2000).

There has been in the last decade an increasing interest in split and merge algorithms (e.g., see Ueda et al., 2000; Zhang et al., 2003; Li and Li, 2009). We focus on the procedures proposed in Zhang et al. (2003) and also adopted in Li and Li (2009). Differently from the operations proposed in Ueda et al. (2000), which are based mainly on the linear combinations of the parameters, the merge and split operations described in Zhang et al. (2003) and Li and Li (2009) take into account the GMM formalism. More specifically, as observed in Zhang et al. (2003), it is of particular interest that the split and merge operators consider the relation between the parameters of the component, namely, mean and covariance matrix, in that they are the first and second moments of the distribution represented by the component, respectively. However, the algorithms proposed in Ueda et al. (2000) and Zhang et al. (2003) assume that the number of components is provided by the user. Although the approach described in Li and Li (2009) can estimate the number of components, such estimation is done by using greedy search-based procedures, which are frequently subject to local optima problems.

Before delving into the details about the operations that compute the parameters obtained from split and merge operations, we address the criteria used in Zhang et al. (2003) to identify which components should be modified. The criterion used most to identify the pair of components to be merged is the correlation between the component responsibilities (Ueda et al., 2000; Zhang et al., 2003). Let j and l be two component indices. The correlation criterion is defined as
formula
4
where . The pair of components that has the greatest absolute correlation value is the chosen one. Assuming that is computed for all , the computation cost of Equation (4) is .
In order to identify which component should be split into two components, a widely used criterion is the local Kullback divergence (LKD) (Ueda et al., 2000; Zhang et al., 2003), which, for the jth component, is defined as
formula
5
formula
6
formula
7
where is the empirical distribution modified by weighting the frequencies by the posterior probabilities in such a way that objects closer to the jth component have more importance, and is the Kronecker function. Eq. (5) defines the distance between two distributions, that is, the local density of the data around the jth component, , and its respective density considering the current parameters (Ueda et al., 2000). A high value for DLKD indicates a poor estimate of the local density. Thus, the component with highest DLKD is chosen to be split. Assuming that the normal probabilities and responsibilities were already computed (from the EM algorithm), the computational cost of this criterion is .
After defining which component (pair of components) should be modified using these functions, a split (merge) procedure can be performed. To merge components j and l and obtain g, the following operations were proposed by Zhang et al. (2003):
formula
8
formula
9

These operations can be seen as an extension of the ones proposed in Ueda et al. (2000), which aim at preserving the relation between the mean and the covariance matrix. The computational cost of performing all these operations is , where M is the number of features.

Consider now that a given component, g, should be split. As a result, assume that we have components j and l. A method that uses singular value decomposition (Golub and Van Loan, 1996) is described in Zhang et al. (2003). Initially, the covariance matrix, , is decomposed into three matrices as , where is the matrix of singular values in a decreasing order and and are two orthogonal matrices of eigenvectors (Golub and Van Loan, 1996). Having done that, the matrix is computed (Li and Li, 2009). Following Zhang et al. (2003), the problem of determining and , given , can be transformed into the problem of finding and , given . Considering that is the ith column of matrix , the parameters can be adjusted as (Zhang et al., 2003)
formula
10
formula
11
formula
12
formula
13
where are user-defined parameters.3 Although any other vector, , can be used in place of , this choice is justified because it corresponds to the first principal axis (Zhang et al., 2003). Because of the SVD decomposition, the computational cost of this splitting method is .

4  Evolutionary Split and Merge for Expectation Maximization

We propose an evolutionary algorithm that makes use of split and merge operations and is capable of estimating the number of components. Our algorithm is named evolutionary split and merge for expectation maximization (ESM-EM). Essentially, we combine some good properties of the GA-EM algorithm (Pernkopf and Bouchaffra, 2005) with split and merge procedures (described in Section 3). More specifically, differently from other algorithms in the literature (i.e., Pernkopf and Bouchaffra, 2005; Nguyen and Cios, 2008), we adopt context-specific operators guided by quality functions related to the fitness of the individual. Thus, our approach tends to favor a more (probabilistic) guided search in comparison to the existing approaches, whose search procedures are more random in nature.

In order to encode the individuals (also called genotypes or chromosomes), a representation is required that allows dealing with a variable number of clusters and that accurately identifies the GMM represented by the individual. From these observations and assuming that kmax is the maximum acceptable number of clusters, we adopted a cluster-based encoding (Hruschka et al., 2009) that, for our present case, involves the use of individuals that represent the GMM parameters (). Note that the apparent limitation of fixing the maximum number of clusters can be removed by using a variable-length encoding, if so desired.

Provided that we are dealing with GMMs with a variable number of components (clusters), the log-likelihood function in Eq. (3) cannot be used as our objective (fitness) function because it is nondecreasing with the number of clusters (Figueiredo and Jain, 2002). Among the available alternatives, we chose an objective function based on the minimum description length (MDL)4 principle (Rissanen, 1989). More precisely, considering an individual that represents k components and being the number of parameters in the GMM, the adopted objective function is given by
formula
14
This function has been widely used for model selection (McLachlan and Peel, 2000; Figueiredo and Jain, 2002; Pernkopf and Bouchaffra, 2005).

The main steps of ESM-EM, summarized in Algorithm 2, are (1) initialization, step 2; (2) refinement, step 5; (3) individual selection for mutation and survival, steps 6 and 11; and (4) mutation operator selection and guidance, steps 8 and 9. In the next subsections we discuss each of these steps as well as possible components that can be used, giving rise to different variants of ESM-EM.

formula

4.1  Initialization

Each individual is initialized to represent a clustering solution obtained by k-means using randomly selected objects as cluster seeds. We adopted a deterministic approach to select the number of clusters. Specifically, we initialized the Psize individuals with the number of clusters linearly spaced between 2 and kmax, ensuring that we had solutions more evenly spread over the possible values of the number of clusters in the initial population. Differently from Pernkopf and Bouchaffra (2005), who if necessary enlarged the initial population to ensure that there would be exactly an individual with each possible number of clusters from 2 to kmax, the proposed algorithm aims at providing a trade-off solution between data partition diversity and computational costs.

4.2  Refinement

In step 5 of Algorithm 2, ESM-EM runs a number (tEM) of EM iterations on each individual of the population, and the individuals are assessed by the objective function in Eq. (14). Crossover operators are not used in ESM-EM, but the refinement over the solutions of each individual can be seen as a local search procedure, favoring exploitation over the search space.

4.3  Selection for Mutation and Survival

The well-known ()-selection (Bäck et al., 2000) is adopted to determine the individuals to be mutated and the ones that will survive for the next generation. In our case, . In this sense, at each generation, Psize individuals are generated by mutating each individual in , obtaining . The best Psize individuals of the population formed by survive and form ; a similar approach was used by Pernkopf and Bouchaffra (2005). There are two main reasons to adopt this procedure: (1) Since the EM algorithm is known to converge slowly (Redner and Walker, 1984), a more steady approach favors the cumulative refinement needed for a better fit, and (2) given that our operators are relatively aggressive in the sense that they may perform big jumps in the solution space, the major drawback of a more conservative selection, the premature convergence, is lessened.

4.4  Mutation Operator Selection and Guidance

Following the taxonomy presented in Hruschka et al. (2009), the ESM-EM mutation operators are both guided and cluster-oriented. In brief, cluster-oriented operators, such as operators that copy, split, merge, and eliminate clusters, are task-dependent, in contrast to conventional evolutionary operators, which only exchange or switch bits with no regard to their task-dependent meaning. Guided operators are guided by some type of information about the quality of individual clusters, the quality of the overall data partition, or their performance in previous applications, such as operators that are more likely to be applied to poor-quality clusters and operators whose probability of application is proportional to their success (or failure) in previous generations.

The ESM-EM mutation operators5 consist of the application of the merge and split operations. Specifically, for merging components, we consider a procedure from Zhang et al. (2003) that is described in Algorithm 3. For splitting components, two operators are considered: the SVD-based operator (Zhang et al., 2003), described in Algorithm 4, and a new variance-based splitting (VS) operator.

formula
formula

The main idea of the VS operator is to select the feature that will determine the directions of the new means with probability directly proportional to the feature’s variance. After these means have been obtained, k-means is applied to estimate the remaining parameters (covariance matrices and mixing coefficients) and to update the means. For k-means, only the objects more likely to be generated by the component are considered, and a maximum number of iterations is performed (tKM). The main steps of the operator are described in Algorithm 5. The computational cost of this operator is . VS also has some extra advantages over the SVD-based operator, whose complexity is cubic with M. In particular, the new components’ parameters are, to some extent, better fitted to the data in comparison to the SVD-based operator. Also, the SVD decomposition can present numerical issues with high dimensional covariance matrices; thus replacing it by a simpler operation can be useful in such scenarios. Moreover, unlike the SVD-based approach, ours does not have user-defined parameters.

formula

To illustrate the application of these operators, consider the scenario depicted in Figure 1(a), which corresponds to a mixture with two components fitted to data generated by a three-component Gaussian mixture. The larger cluster on the left-hand side is the most probable cluster to be mutated, according to both measures, PLL and LKD. This is expected, as it models an area where clearly there are two centers of density apart from each other. Let this cluster be C1. The resulting mixture after applying the SVD operator on this component is shown in Figure 1(b). Now, consider the application of the VS operator on the same cluster. The covariance matrix of C1 corresponds to . Thus, the second feature has a 60% chance of being selected as the direction for the new means. Assuming this feature has been chosen, the direction vector is , that is, the two initial prototypes for k-means are the old center () moved three units up and three down, respectively. After five iterations of k-means on the objects for which C1 was the component with the largest responsibility, the resulting GMM is better fitted to the data, as shown in Figure 1(c).

Figure 1:

Example of the application of splitting procedures.

Figure 1:

Example of the application of splitting procedures.

In order to choose the mutation operator (either split or merge) to be used, we followed the approach described in Horta and Campello (2009). If the fitness of a particular individual has improved from generation t to generation , then the same operator applied in the previous generation is reapplied, whereas if the fitness of the individual has worsened, the other operator is used. In other cases (e.g., first generation), the operators are randomly selected.

For the selection of the components (clusters) to be mutated (step 9 of Algorithm 2), ESM-EM uses a probabilistic approach, which is biased toward less fit components (clusters). In particular, the less fit the component, the higher the probability it will be mutated; we adopted a ranking-normalized roulette to do so. For each type of mutation (split or merge), we considered two criteria to assess the components’ quality. For splitting, we considered the local Kullback divergence, in Eq. (5), and investigated the use of the partial log-likelihood (PLL):
formula
15
The PLL of a component is its contribution for the expected complete-data log-likelihood (Eq. (3)).

There are several advantages to using the partial log-likelihood instead of the LKD. For instance, the partial log-likelihood is part of the fitness function, thereby establishing a direct synergy between mutation and adaptation. The PLL values are already computed when assessing the fitness of each individual, thus straightforwardly favoring an efficient implementation. For simplicity, we used the negative log-likelihood, so that both PLL and LKD provide high values to clusters that should be split.

For the merge operator, the selection can be based on two criteria. The first is the correlation criterion, in Eq. (4). We also propose a new measure based on the overlapping of components. Let be the set of objects for which i and j are the indices of the most probable Gaussians, respectively, that is, . Analogously, let be the set of objects for which j and i are the most probable Gaussians, respectively. Then, the overlap measure can be defined as
formula
16

Pairs of clusters with higher overlap values are more likely to be merged. The main difference between the correlation and overlapping measures is that the latter focuses on the objects that would be assigned to each cluster under a most probable cluster scheme. The computational cost of the overlapping measure is .

4.5  Note on Computational Complexity

As expected, the overall computational complexity of ESM-EM depends on the choice of the operators just discussed. However, the dominant cost of ESM-EM is attributable to running the EM algorithm. From this perspective, the overall computational cost of ESM-EM is estimated as , where G is the number of generations, tEM is the number of iterations of EM performed at each refinement step, the first term of the sum is the EM cost, and the last term arises because of the computation of the probabilities of merging each pair of clusters in each generation. This cost is the same as that of the GA-EM algorithm (Pernkopf and Bouchaffra, 2005) and similar to that of the GAKREM6 (Nguyen and Cios, 2008). Assuming that the user has no a priori knowledge of the range for the number of clusters, which is the scenario where evolutionary algorithms are more applicable, the common rule of thumb of can be employed (Campello et al., 2009). In this case the overall computational cost can be simplified to , as . Although this cost is higher than multiple runs of EM (see Section 2), there is empirical evidence that the additional (asymptotic) complexity per iteration, necessary to guide the search for better data partitions, actually provides better solutions in shorter running times (see Section 5).

4.6  Summary of ESM-EM Variants

The adoption of the different components described in the previous section gives rise to different variants of the ESM-EM algorithm. In order to refer to such variants, we use combinations of the operator acronyms. For instance, VS-OVER-PLL refers to the algorithm obtained using the variance-based splitting operator (VS), the overlapping measure (OVER) to indicate clusters to be merged, and the partial log-likelihood function (PLL) to indicate the component to be split. A summary of all the assessed variants is provided in Table 1. For simplicity, the components that are common to all variants, namely initialization (Section 4.1), ( selection, and merging operator (Algorithm 3) are not mentioned.

Table 1:
Summary of variants.
NameSplit OperatorSplit MeasureMerge Measure
SVD SVDS — Alg. 4 LKD — Eq. (5Correlation — Eq. (4
OVER SVDS — Alg. 4 LKD — Eq. (5Overlapping — Eq. (16
VS VS — Alg. 5 LKD — Eq. (5Correlation — Eq. (4
PLL SVDS — Alg. 4 PLL — Eq. (15Correlation — Eq. (4
VS-OVER VS — Alg. 5 LKD — Eq. (5Overlapping — Eq. (16
VS-PLL VS — Alg. 5 PLL — Eq. (15Correlation — Eq. (4
OVER-PLL SVDS — Alg. 4 PLL — Eq. (15Overlapping — Eq. (16
VS-OVER-PLL VS — Alg. 5 PLL — Eq. (15Overlapping — Eq. (16
NameSplit OperatorSplit MeasureMerge Measure
SVD SVDS — Alg. 4 LKD — Eq. (5Correlation — Eq. (4
OVER SVDS — Alg. 4 LKD — Eq. (5Overlapping — Eq. (16
VS VS — Alg. 5 LKD — Eq. (5Correlation — Eq. (4
PLL SVDS — Alg. 4 PLL — Eq. (15Correlation — Eq. (4
VS-OVER VS — Alg. 5 LKD — Eq. (5Overlapping — Eq. (16
VS-PLL VS — Alg. 5 PLL — Eq. (15Correlation — Eq. (4
OVER-PLL SVDS — Alg. 4 PLL — Eq. (15Overlapping — Eq. (16
VS-OVER-PLL VS — Alg. 5 PLL — Eq. (15Overlapping — Eq. (16

5.  Empirical Results

We used 35 data sets for the assessment of the different variants of ESM-EM. Three are well known in the literature, namely: (1) 3-components (Ueda and Nakano, 1998), formed by 900 objects distributed over three bi-dimensional Gaussian components, (2) 9Gauss (Campello et al., 2009), which has 900 objects from nine overlapping bi-dimensional Gaussian components, and (3) Pendigits, which is a classification data set composed of handwritten digit images (Asuncion and Newman, 2007). For the latter, we considered only the classes representing digits 0, 1, 2, 3, and 4, thus having 5,629 objects. Similarly to what was done by Pernkopf and Bouchaffra (2005), we also performed PCA, which reduced the number of features from 16 to 2. We also considered an extended version of 9Gauss (Campello et al., 2009), called 49Gauss, which has similarly overlapping areas as 9Gauss but with 49 Gaussian components and 4,900 objects. Besides these data sets, we also conducted experiments with seven other data sets that can be modeled by means of GMMs, even though they were not actually generated from GMMs. Figure 2 illustrates these data sets. The remaining 24 data sets are synthetic data sets generated according to two well-known criteria for component separation:

  • c-separation (Dasgupta, 1999): Gaussians and are c-separated if
    formula
    17
    where is the largest eigenvalue of .
    Figure 2:

    Scatter plot of the data sets employed in the experiments.

    Figure 2:

    Scatter plot of the data sets employed in the experiments.

  • -separation (Maitra and Melnykov, 2010): sum of misclassification probabilities between two Gaussians and from a GMM:
    formula
    18
    formula
    19
    formula
    20
    formula
    21
    formula
    22

A summary of the main characteristics of the data sets used in our comparisons is provided in Table 2. An important remark about the c-separation criterion is that it acts as a lower bound on the complexity of the data set. Thus, a -separated mixture7 can be less complex than a -separated criterion. To alleviate this, we used the approach described in Maitra (2009). Specifically, we restricted the components so that at least one pair had a separation between c and . Figure 3 shows an example of the separation values for the Gaussian pairs of a GMM according to the c-separation and -separation criteria, respectively. We used three different separation values for each criterion. More specifically, we adopted and , where is the average for all pairs of components . For each of these, we generated data sets with and and samples (objects). Table 3 shows the separation values for each generated data set. We have data sets for both well separated and poorly separated configurations, discussed in Dasgupta (1999) and Maitra and Melnykov (2010).

Figure 3:

Example of the separation criterion for each pair of Gaussians of a GMM.

Figure 3:

Example of the separation criterion for each pair of Gaussians of a GMM.

Table 2:
Summary of data sets. The correct number of clusters (k) for the Pendigits data set is not known.
Data SetNMkData SetNMk
9Gauss 900 c1.0-k12M4 1,200 12 
49Gauss 4,900 49 c2.0-k12M4 1,200 12 
3-components 900 c0.5-k12M8 1,200 12 
Triangle 500 c1.0-k12M8 1,200 12 
Letter A 300 c2.0-k12M8 1,200 12 
Letter B 400 0.1-k4M4 400 
Letter M 400 0.01-k4M4 400 
Letter Y 300 0.001-k4M4 400 
In X 500 0.1-k4M8 400 
Two bars 300 0.01-k4M8 400 
Pendigits 5,629 — 0.001-k4M8 400 
c0.5-k4M4 400 0.1-k12M4 1,200 12 
c1.0-k4M4 400 0.01-k12M4 1,200 12 
c2.0-k4M4 400 0.001-k12M4 1,200 12 
c0.5-k4M8 400 0.1-k12M8 1,200 12 
c1.0-k4M8 400 0.01-k12M8 1,200 12 
c2.0-k4M8 400 0.001-k12M8 1,200 12 
c0.5-k12M4 1,200 12     
Data SetNMkData SetNMk
9Gauss 900 c1.0-k12M4 1,200 12 
49Gauss 4,900 49 c2.0-k12M4 1,200 12 
3-components 900 c0.5-k12M8 1,200 12 
Triangle 500 c1.0-k12M8 1,200 12 
Letter A 300 c2.0-k12M8 1,200 12 
Letter B 400 0.1-k4M4 400 
Letter M 400 0.01-k4M4 400 
Letter Y 300 0.001-k4M4 400 
In X 500 0.1-k4M8 400 
Two bars 300 0.01-k4M8 400 
Pendigits 5,629 — 0.001-k4M8 400 
c0.5-k4M4 400 0.1-k12M4 1,200 12 
c1.0-k4M4 400 0.01-k12M4 1,200 12 
c2.0-k4M4 400 0.001-k12M4 1,200 12 
c0.5-k4M8 400 0.1-k12M8 1,200 12 
c1.0-k4M8 400 0.01-k12M8 1,200 12 
c2.0-k4M8 400 0.001-k12M8 1,200 12 
c0.5-k12M4 1,200 12     
Table 3:
Values for c and separation criteria for generated data.
Data Setmin(c)mean(c)max(c)mean()max()
c0.5-k4M4 0.50 1.90 2.80 0.117 0.554 
c1.0-k4M4 1.00 2.70 3.30 0.035 0.206 
c2.0-k4M4 2.00 2.80 3.90 0.006 0.021 
c0.5-k4M8 0.50 1.40 1.80 0.055 0.311 
c1.0-k4M8 1.00 1.90 2.20 0.013 0.077 
c2.0-k4M8 2.00 3.40 4.00 0.000 0.000 
c0.5-k12M4 0.50 2.90 4.00 0.017 0.549 
c1.0-k12M4 1.00 2.90 4.00 0.014 0.236 
c2.0-k12M4 2.00 2.90 4.00 0.004 0.022 
c0.5-k12M8 0.50 2.10 4.00 0.007 0.312 
c1.0-k12M8 1.00 3.10 4.00 0.001 0.061 
c2.0-k12M8 2.00 3.20 4.00 0.000 0.000 
0.1-k4M4 0.60 1.10 1.50 0.100 0.239 
0.01-k4M4 1.20 1.50 2.10 0.010 0.023 
0.001-k4M4 1.70 1.90 2.10 0.001 0.003 
0.1-k4M8 0.40 0.70 0.90 0.100 0.241 
0.01-k4M8 0.90 1.20 1.60 0.010 0.036 
0.001-k4M8 1.20 1.60 2.00 0.001 0.005 
0.1-k12M4 0.10 1.00 1.90 0.100 0.502 
0.01-k12M4 0.70 2.40 5.50 0.010 0.176 
0.001-k12M4 1.10 5.50 10.10 0.001 0.065 
0.1-k12M8 0.30 0.70 1.00 0.100 0.300 
0.01-k12M8 0.60 1.30 2.10 0.010 0.102 
0.001-k12M8 0.80 1.80 2.60 0.001 0.020 
Data Setmin(c)mean(c)max(c)mean()max()
c0.5-k4M4 0.50 1.90 2.80 0.117 0.554 
c1.0-k4M4 1.00 2.70 3.30 0.035 0.206 
c2.0-k4M4 2.00 2.80 3.90 0.006 0.021 
c0.5-k4M8 0.50 1.40 1.80 0.055 0.311 
c1.0-k4M8 1.00 1.90 2.20 0.013 0.077 
c2.0-k4M8 2.00 3.40 4.00 0.000 0.000 
c0.5-k12M4 0.50 2.90 4.00 0.017 0.549 
c1.0-k12M4 1.00 2.90 4.00 0.014 0.236 
c2.0-k12M4 2.00 2.90 4.00 0.004 0.022 
c0.5-k12M8 0.50 2.10 4.00 0.007 0.312 
c1.0-k12M8 1.00 3.10 4.00 0.001 0.061 
c2.0-k12M8 2.00 3.20 4.00 0.000 0.000 
0.1-k4M4 0.60 1.10 1.50 0.100 0.239 
0.01-k4M4 1.20 1.50 2.10 0.010 0.023 
0.001-k4M4 1.70 1.90 2.10 0.001 0.003 
0.1-k4M8 0.40 0.70 0.90 0.100 0.241 
0.01-k4M8 0.90 1.20 1.60 0.010 0.036 
0.001-k4M8 1.20 1.60 2.00 0.001 0.005 
0.1-k12M4 0.10 1.00 1.90 0.100 0.502 
0.01-k12M4 0.70 2.40 5.50 0.010 0.176 
0.001-k12M4 1.10 5.50 10.10 0.001 0.065 
0.1-k12M8 0.30 0.70 1.00 0.100 0.300 
0.01-k12M8 0.60 1.30 2.10 0.010 0.102 
0.001-k12M8 0.80 1.80 2.60 0.001 0.020 

We report two kinds of experimental analyses. First, in Section 5.1, we assess the different ESM-EM variants, as discussed in Section 4. Considering the obtained results, we then selected the best ESM-EM performing variant and compared it with the state-of-the-art algorithm proposed in Pernkopf and Bouchaffra (2005). Along with these comparisons and to serve as a baseline method, we used the OMR-EM (Algorithm 1). These analyses are described in Section 5.2.

5.1.  ESM-EM Variants

We report the average over 10 runs of each ESM-EM’s variant in each data set. Their parameters were set as follows: (1) stopping criterion (SC) is defined as a maximum of 50 generations or 10 consecutive generations without improvement on the best fitness; (2) the population size (Psize) is equal to 10; (3) the maximum number of EM iterations (tEM) is equal to 5; (4) the maximum number of k-means iterations (tKM in the VS operator) is equal to 5; (5) in the SVDS operator are equal to 0.5 (same values used in Zhang et al., 2003); (6) the maximum number of clusters (kmax) is equal to . Table 4 shows the average ranks obtained by each algorithm. We summarize the results for three different criteria: (1) objective function (MDL) value; (2) number of EM iterations performed during the evolutionary search; (3) accuracy in estimating the number of clusters. The ranks were derived by sorting the values of each criterion obtained by each algorithm and then averaging over all data sets. For instance, if a given algorithm provided the best results according to the MDL criterion in all data sets, then its ranking is #1. If we consider the MDL values obtained by each algorithm, the adoption of the variance-based splitting and partial log-likelihood function—the VS-PLL variant—presented better results more consistently, that is, with smaller average ranking. As this is the criterion of main interest in this study, if one had to choose among these variants, we would recommend VS-PLL, which is also the method of choice if one seeks for the solution with the best trade-off among all criteria.

Table 4:
Average ranks for each variant considering the fitness value (MDL), number of EM iterations (EMS), absolute difference between the predicted number of clusters and the real one ( K).
VariantMDLEMS KAverage Ranks
SVD 5.03 3.57 4.99 4.53 
OVER 4.50 4.40 4.79 4.56 
VS 4.63 4.77 5.20 4.87 
PLL 4.31 4.94 4.16 4.47 
VS-OVER 4.56 4.39 4.61 4.52 
VS-PLL 4.01 3.83 3.84 3.90 
OVER-PLL 4.36 4.74 4.86 4.65 
VS-OVER-PLL 4.60 5.36 3.56 4.50 
VariantMDLEMS KAverage Ranks
SVD 5.03 3.57 4.99 4.53 
OVER 4.50 4.40 4.79 4.56 
VS 4.63 4.77 5.20 4.87 
PLL 4.31 4.94 4.16 4.47 
VS-OVER 4.56 4.39 4.61 4.52 
VS-PLL 4.01 3.83 3.84 3.90 
OVER-PLL 4.36 4.74 4.86 4.65 
VS-OVER-PLL 4.60 5.36 3.56 4.50 

After making these general observations, we assess the algorithms by inspecting their results in more complex data sets. To verify the quality of the GMMs obtained from different variants, we also assess the log-likelihood on a held-out sample with 10,000 objects sampled from the ground-truth GMM. Figures 4–7 show the differences per object obtained for the held-out log-likelihood between each variant and SVD, which was chosen as a baseline for comparison purposes. From this perspective, note that the smaller the value, the better the variant. In each figure the differences correspond to those observed for a specific criterion and a specific number of clusters. For instance, in Figure 4 the two plots in the top row refer to the case where the c-separation criterion equals 0.5, whereas those in the bottom row refer to the case where c-separation equals 2. The two left-hand plots to the data set with four features (), whereas the two right-hand plots refer to the data set with eight features (). In each plot, the bars correspond to the values (difference) obtained by the variant.

Figure 4:

Differences of per object held-out log-likelihood obtained by different ESM-EM variants and the SVD variant in c-sep data sets with (the smaller the value, the better the variant). Results for data sets are omitted because no difference was observed.

Figure 4:

Differences of per object held-out log-likelihood obtained by different ESM-EM variants and the SVD variant in c-sep data sets with (the smaller the value, the better the variant). Results for data sets are omitted because no difference was observed.

Figure 5:

Differences of per object held-out log-likelihood obtained by different ESM-EM variants and the SVD variant in -sep data sets with (the smaller the value, the better the variant).

Figure 5:

Differences of per object held-out log-likelihood obtained by different ESM-EM variants and the SVD variant in -sep data sets with (the smaller the value, the better the variant).

Figure 6:

Differences of per object held-out log-likelihood obtained by different ESM-EM variants and the SVD variant in c-sep data sets with (the smaller the value, the better the variant).

Figure 6:

Differences of per object held-out log-likelihood obtained by different ESM-EM variants and the SVD variant in c-sep data sets with (the smaller the value, the better the variant).

Figure 7:

Differences of per object held-out log-likelihood obtained by different ESM-EM variants and the SVD variant in -sep data sets with (the smaller the value, the better the variant).

Figure 7:

Differences of per object held-out log-likelihood obtained by different ESM-EM variants and the SVD variant in -sep data sets with (the smaller the value, the better the variant).

Some key observations can be drawn from Figures 4–7. First, when the number of features and clusters is low (), there are not many differences between the models obtained by each variant, except for the case. This can be explained by two factors: (1) the search space does not have many local optima, so any of the different operators can reach the same (best) solution; (2) the data set generated with is more complex than the others because of the existence of a highly overlapping pair of components (see max() in Table 3). Thus, it is more likely to have a larger number of local optima. By increasing only the number of features (), differences between the variants emerge more frequently and all variants were able to obtain a better model than SVD in at least five out of six data sets. In the opposite scenario (increasing the number of clusters: ), the differences among the models are generally negligible (log-likelihood values smaller than 0.01). A common ground among these results is that most variants were able to reach the same (best) solution, except for SVD. This observation suggests that, for such cases, choosing the simplest (fastest) variant would be enough to have a good chance of escaping from local optima. For the most appealing scenario with a high number of features and clusters (), this does not seem to hold. Specifically, the VS-PLL variant—the one that, as previously discussed, provided the best results on the average over all the evaluation criteria—obtained equal or better models than the other variants in all data sets. Considering all the collected empirical evidence, we conclude that among the different ESM-EM variants assessed, VS-PLL is the most promising one; therefore we chose it for further comparisons with related algorithms.

5.2.  Comparisons with Related Algorithms

In this section we compare the best variant of ESM-EM, namely, VS-PLL, with the state-of-the-art GA-EM algorithm (Pernkopf and Bouchaffra, 2005). For the sake of simplicity, hereafter we use ESM-EM to refer to this variant. In our comparisons, we also use the OMR-EM algorithm (Section 2) as a baseline, because it has been widely used in practice.

Some of the GA-EM parameters are common to other EAs, such as mutation and crossover probabilities. Along with these parameters, GA-EM has also additional parameters, like the maximum correlation threshold and the minimum amount of likelihood for a Gaussian component. These parameters are hard to set in practice, especially when using EAs for clustering problems. Under controlled experiments, however, we can (optimistically) adopt a fine-tuning procedure to obtain good values for each parameter and then proceed with our comparisons. In this sense, we adopted the SPOT framework (Bartz-Beielstein et al., 2005) to tune the GA-EM parameters. Specifically, the parameters were optimized in the following range: (1) annihilation threshold: ; (2) mutation probability: ; (3) crossover probability: ; (4) correlation threshold: . The population size, number of EM iterations, and maximum number of clusters were set as 10, 5, and , respectively (same values adopted for ESM-EM). For simplicity, only the best results obtained for GA-EM are described. However, we emphasize that such results were obtained after much tuning and computational effort for each data set. This (computationally intensive) fine-tuning procedure could hardly be performed in practical applications.

The ESM-EM algorithm does not have critical parameters. More precisely, all of them (i.e., population size, maximum number of iterations for EM/k-means, and maximum number of clusters) can be defined according to the available computational resources. For instance, with a smaller population size, the algorithm is expected to require more time to converge (Naldi et al., 2011). Similarly, the number of EM/k-means iterations can be set by considering the available running time. Regarding the maximum number of clusters, the user may have expectations about it and consequently set this value accordingly. If this is not the case, and assuming that the minimum number of clusters is two, we note that the higher the maximum number of clusters, the higher the running time for exploring and exploiting the search space. Finally, the user can set a range for k by taking into account either computational limitations or expectations about the data partition to be induced. In this sense, ESM-EM can be seen as an improvement over GA-EM because it does not require the fine-tuning of its parameters for the learning process but only for the available computational resources, thus being quite handy in practice. From a more theoretical viewpoint, since the ESM-EM operators are guided and based on split and merge operators, ESM-EM can be deemed a more principled algorithm in comparison to GA-EM, which adopts a more random search through its operators.

We assessed the algorithms (ESM-EM, GA-EM, and OMR-EM) considering the trade-off between model quality and running time. To perform fair running time comparisons, all of them were implemented in MATLAB and run on the same computer (Opteron 2GHz, 24Gb of RAM) executing only the operating system in parallel.

A relevant practical aspect to be illustrated is that the use of an EA to optimize the parameters of a GMM, including the estimation of the model structure, can be more computationally efficient and accurate than the use of multiple repetitions of EM. To the best of our knowledge, this aspect has not been properly addressed in the literature. To do so, we initially ran OMR-EM with and saved the execution time for each complete pass over the number of clusters, as well as the best model found so far. Thus, we had the computational costs and the quality of the obtained models for each value of . Then, we ran the EAs (ESM-EM and GA-EM) until their running times reach the computational cost required for OMR-EM (with ), saving the intermediate solutions. That way, we could compare the best solutions found by each algorithm under a given time constraint, considering the time spent by OMR-EM with different values of np. In other words, we used np as the time unit measure. Note that we do not consider the time spent on parameter tuning for GA-EM as part of the reported running times. In this sense, GA-EM’s results can be considered very optimistic.

Table 5 summarizes the overall performance of each algorithm by reporting the average rankings over 10 trials and considering all data sets. For instance, the smallest value obtained by GA-EM for (1.47 in the second row, fourth column) indicates that, from averaging over all data sets, GA-EM was able to obtain the smallest MDL value for such time constraint. From these results, one can note that there is a consistent ranking ordering of the three algorithms for : GA-EM, ESM-EM, and OMR-EM. For , ESM-EM obtains the best rankings. Also, one can see that EAs can be more computationally efficient than multiple repetitions of EM. Finally, one can see that ESM-EM, which does not have critical parameters, is competitive with GA-EM even when GA-EM’s parameters have been properly fine-tuned beforehand. As discussed, this fine-tuning procedure requires both some a priori knowledge about the learning problem (usually not available in practice) and considerable computational effort. As expected, as more computational resources are available, that is, as we increase np, the difference between OMR-EM and the EA performance is reduced, since all algorithms are given enough time (or chances in the case of OMR-EM) to reach or get closer to the global optimum.

Table 5:
Average ranks obtained by each algorithm considering the MDL value of the best model for each value of np.
npOMR-EMESM-EMGA-EM
2.51 2.01 1.47 
2.51 1.83 1.66 
2.46 1.89 1.66 
2.41 1.90 1.69 
2.30 1.93 1.77 
2.24 1.94 1.81 
2.16 1.94 1.90 
2.14 1.89 1.97 
2.14 1.87 1.99 
10 2.14 1.90 1.96 
npOMR-EMESM-EMGA-EM
2.51 2.01 1.47 
2.51 1.83 1.66 
2.46 1.89 1.66 
2.41 1.90 1.69 
2.30 1.93 1.77 
2.24 1.94 1.81 
2.16 1.94 1.90 
2.14 1.89 1.97 
2.14 1.87 1.99 
10 2.14 1.90 1.96 

We now address some specific results of particular interest. Figure 8 shows representative results obtained in six data sets (the results for other data sets are omitted for space constraints). In these graphs, the y-axis shows the MDL value divided by the number of objects, and the x-axis represents the number of repetitions, np, performed by OMR-EM.

Figure 8:

Box plot of MDL values (divided by the number of objects) obtained by each algorithm. The x-axis indicates the number of repetitions (np) performed by OMR-EM.

Figure 8:

Box plot of MDL values (divided by the number of objects) obtained by each algorithm. The x-axis indicates the number of repetitions (np) performed by OMR-EM.

For the 49Gauss data set (Figure 8(a)), ESM-EM outperformed both OMR-EM and GA-EM. Similar results were observed for 9Gauss. Considering , ESM-EM obtained better results with smaller variances. As can be considered a very small value in practice, especially for a problem with 4,900 objects, the computational efficiency of ESM-EM is appealing. In Pendigits (Figure 8(b)), OMR-EM has higher variances for when compared to the EAs. The EAs’ capability for consistently generating good solutions in such a short time is again evidenced. Also, once more ESM-EM provided better solutions than the (fine-tuned) GA-EM. The results for the -k12M4 (Figure 8(c)) and -k12M8 (Figure 8(d)) data sets show a similar trend, but with GA-EM showing equal or slightly better results than ESM-EM. But again, since ESM-EM does not have critical parameters, it can be more attractive than GA-EM even in such scenarios. For the -k12M4 data set (Figure 8(e)), ESM-EM was able to obtain good solutions in a shorter amount of time () and competitive results when more processing time was available. Similar results were obtained for the -k12M8 data set. The results for the -k12M8 data set (Figure 8(f)) are an example of how inefficient OMR-EM can be when compared to evolutionary (guided) search. Also, a similar trend can be observed when comparing ESM-EM and GA-EM. Specifically, once more ESM-EM quickly converged, but in this case no other algorithm was able to obtain better solutions.

6.  Final Remarks

This paper has described the evolutionary split and merge for expectation maximization (ESM-EM) algorithm, which is based on the use of split and merge operations to evolve Gaussian mixture models (GMMs). Along with the algorithm and its asymptotic complexity analysis, different variants of ESM-EM were proposed by considering different split and merge operations. In particular, besides adapting and evaluating some principled approaches from the literature, we also proposed a new splitting procedure, named variance-based splitting (VS), a measure of overlapping between components (OVER), and we analyzed the use of the partial log-likelihood (PLL) to indicate clusters that should be split.

We performed an extensive empirical evaluation of ESM-EM and its variants using 35 data sets. The ESM-EM variant based on the proposed VS and PLL approaches consistently provided the best results. We compared such a variant with both the ordered multiple runs of the expectation maximization (OMR-EM) algorithm and the genetic-based expectation maximization (GA-EM) algorithm (Pernkopf and Bouchaffra, 2005). Our experimental results showed that evolutionary algorithms can be more computationally efficient than OMR-EM, which is widely used in practice. To the best of our knowledge, such a systematic experimental evaluation aimed at showing this relevant practical aspect has not been performed earlier, so this is also a contribution of our work. Moreover, the proposed ESM-EM provided competitive results with GA-EM (Pernkopf and Bouchaffra, 2005), even when GA-EM critical parameters were fine-tuned a priori, from a computationally intensive procedure. Since ESM-EM does not have critical parameters, it can be considered an improvement over GA-EM. By taking into account all the obtained results, we believe that ESM-EM is eligible to join a pool of state-of-the-art algorithms for learning GMMs in practice.

As future work, improvements on the mutation operators are promising, for instance, performing k-means on the components obtained by the SVDS operator.

Acknowledgments

This work was supported by NSF Grants (IIS-0713142 and IIS-1016614), by São Paulo State Research Foundation (FAPESP), and by the Brazilian National Research Council (CNPq).

References

Asuncion
,
A.
, and
Newman
,
D.
(
2007
).
UCI machine learning repository. Irvine, CA: University of California, Dept. of Information and Computer Science
. http://www.ics.uci.edu/∼mlearn/ML Repository.html
Bäck
,
T.
,
Fogel
,
D. B.
, and
Michalewicz
,
Z
. (
2000
).
Evolutionary computation 1: Basic algorithms and operators
.
Bristol, UK
:
Institute of Physics Publishing
.
Bartz-Beielstein
,
T.
,
Lasarczyk
,
C.
, and
Preuss
,
M
. (
2005
).
Sequential parameter optimization
. In
Proceedings of the IEEE Congress on Evolutionary Computation
, pp.
773
780
.
Bishop
,
C. M
. (
2006
).
Pattern recognition and machine learning
.
New York
:
Springer
.
Campello
,
R. J.
,
Hruschka
,
E. R.
, and
Alves
,
V. S.
(
2009
).
On the efficiency of evolutionary fuzzy clustering
.
Journal of Heuristics
,
15
:
43
75
.
Chen
,
X.
,
Liu
,
X.
, and
Jia
,
Y
. (
2011
).
Discriminative structure selection method of Gaussian mixture models with its application to handwritten digit recognition
.
Neurocomputing
,
74
(
6
):
954
961
.
Covões
,
T. F.
, and
Hruschka
,
E. R.
(
2011
).
Splitting and merging Gaussian mixture model components: An evolutionary approach
. In
Proceedings of the International Conference on Machine Learning and Applications
, pp.
106
111
.
Covões
,
T. F.
, and
Hruschka
,
E. R.
(
2013
).
Unsupervised learning of Gaussian mixture models: Evolutionary create and eliminate for expectation maximization algorithm
. In
Proceedings of the IEEE Congress on Evolutionary Computation,
pp.
3206
3213
.
Dasgupta
,
S
. (
1999
).
Learning mixtures of Gaussians
. In
Proceedings of the Annual Symposium on Foundations of Computer Science
, pp.
634
644
.
Dempster
,
A. P.
,
Laird
,
N. M.
, and
Rubin
,
D. B
. (
1977
).
Maximum likelihood from incomplete data via the EM algorithm
.
Journal of the Royal Statistical Society, Series B (Methodological)
,
39
(
1
):
1
38
.
Falkenauer
,
E
. (
1998
).
Genetic algorithms and grouping problems
.
New York
:
Wiley
.
Figueiredo
,
M.
, and
Jain
,
A
. (
2002
).
Unsupervised learning of finite mixture models
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
24
(
3
):
381
396
.
Golub
,
G. H.
, and
Van Loan
,
C. F.
(
1996
).
Matrix computations
, 3rd 3rd ed.
Baltimore
:
Johns Hopkins University Press
.
Horta
,
D.
, and
Campello
,
R. J
. (
2009
).
Fast evolutionary algorithms for relational clustering
. In
Proceedings of the International Conference on Intelligent Systems Design and Applications
, pp.
1456
1462
.
Hruschka
,
E. R.
,
Campello
,
R. J.
,
Freitas
,
A.
, and
de Carvalho
,
A. C
. (
2009
).
A survey of evolutionary algorithms for clustering
.
IEEE Transactions on Systems, Man and Cybernetics, Part C: Applications and Reviews
,
39
(
2
):
133
155
.
Li
,
Y.
, and
Li
,
L
. (
2009
).
A novel split and merge EM algorithm for Gaussian mixture models
. In
Proceedings of the International Conference on Natural Computation
, pp.
479
483
.
Maitra
,
R
. (
2009
).
Initializing partition-optimization algorithms
.
IEEE/ACM Transactions on Computational Biology and Bioinformatics
,
6
(
1
):
144
157
.
Maitra
,
R.
, and
Melnykov
,
V
. (
2010
).
Simulating data to study performance of finite mixture modeling and clustering algorithms
.
Journal of Computational and Graphical Statistics
,
19
(
2
):
354
376
.
McLachlan
,
G.
, and
Peel
,
D
. (
2000
).
Finite mixture models
.
New York
:
Wiley
.
McLachlan
,
G. J.
, and
Krishnan
,
T
. (
1997
).
The EM algorithm and extensions
.
New York
:
Wiley
.
Naldi
,
M. C.
,
Campello
,
R. J.
,
Hruschka
,
E. R.
, and
Carvalho
,
A. C
. (
2011
).
Efficiency issues of evolutionary k-means
.
Applied Soft Computing
,
11
(
2
):
1938
1952
.
Nguyen
,
C. D.
, and
Cios
,
K. J
. (
2008
).
GAKREM: A novel hybrid clustering algorithm
.
Information Sciences
,
178
(
22
):
4205
4227
.
Pernkopf
,
F.
, and
Bouchaffra
,
D.
(
2005
).
Genetic-based EM algorithm for learning Gaussian mixture models
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
27
:
1344
1348
.
Redner
,
R. A.
, and
Walker
,
H. F
. (
1984
).
Mixture densities, maximum likelihood and the EM algorithm
.
SIAM Review
,
26
(
2
):
195
239
.
Reynolds
,
D.
, and
Rose
,
R
. (
1995
).
Robust text-independent speaker identification using Gaussian mixture speaker models
.
IEEE Transactions on Speech and Audio Processing
,
3
(
1
):
72
83
.
Rissanen
,
J
. (
1989
).
Stochastic complexity in statistical inquiry theory
.
Singapore
:
World Scientific
.
Tawfick
,
M. M.
,
Abbas
,
H. M.
, and
Shahein
,
H. I.
(
2008
).
An integer-coded evolutionary approach for mixture maximum likelihood clustering
.
Pattern Recognition Letters
,
29
:
515
524
.
Tohka
,
J.
,
Krestyannikov
,
E.
,
Dinov
,
I.
,
Graham
,
A.
,
Shattuck
,
D.
,
Ruotsalainen
,
U.
, and
Toga
,
A.
(
2007
).
Genetic algorithms for finite mixture model based voxel classification in neuroimaging
.
IEEE Transactions on Medical Imaging
,
26
(
5
):
696
711
.
Ueda
,
N.
, and
Nakano
,
R
. (
1998
).
Deterministic annealing EM algorithm
.
Neural Networks
,
11
(
2
):
271
282
.
Ueda
,
N.
,
Nakano
,
R.
,
Ghahramani
,
Z.
, and
Hinton
,
G. E.
(
2000
).
SMEM algorithm for mixture models
.
Neural Computation
,
12
:
2109
2128
.
Xu
,
R.
, and
Wunsch
,
D
. (
2009
).
Clustering
.
New York
:
Wiley
.
Zhang
,
Z.
,
Chen
,
C.
,
Sun
,
J.
, and
Chan
,
K. L
. (
2003
).
EM algorithms for Gaussian mixtures with split-and-merge operation
.
Pattern Recognition
,
36
(
9
):
1973
1983
.

Notes

1

Software is available upon request from the authors.

2

A detailed analysis of these results, in the k-means context, can be seen in Naldi et al. (2011). The adaptation for the more general case of unrestricted covariance matrices is straightforward.

3

We adopt the same values used in Zhang et al. (2003), i.e., .

4

Note that this function is formally equivalent to the well-known Bayesian information criterion (BIC) (Figueiredo and Jain, 2002; McLachlan and Peel, 2000).

5

Crossover operators were not used because it is not clear how to design them in such a way that they are both context sensitive and computationally efficient; see Hruschka et al. (2009) for a discussion on this subject. Roughly speaking, in the context of our work, if standard crossover operators are applied to two identical cluster-based encoding individuals, they may produce a different individual because of the identifiability problem.

6

GAKREM has an overall cost of when the EM algorithm with full covariance matrices is used.

7

A c-separated mixture is one in which all pairs of Gaussians satisfy the c-separation criterion.