## Abstract

α-integration and α-GMM have been recently proposed for integrated stochastic modeling. However, there has not been an approach to date for estimating model parameters for α-GMM in a statistical way, based on a set of training data. In this letter, parameter updating formulas are mathematically derived based on maximum likelihood criterion using an adapted expectation-maximization algorithm. With this method, model parameters for α-GMM are reestimated in an iterative way. The updating formulas were found to be simple and systematically compatible with the GMM equations. This advantage renders the α-GMM a superset of the GMM but with similar computational complexity. This method has been effectively applied to realistic speaker recognition applications.

## 1. Introduction

Gaussian mixture model (GMM) has been well established for decades and is a dominant stochastic modeling technique for a variety of pattern recognition applications. Although conventional GMM has good capacity in stochastic modeling, it often runs into problems when it is applied to robust recognition in adverse conditions. For instance, in speaker recognition, GMM excels at modeling distribution characteristics of data from high-band clean speech but is inferior in modeling distributions from low-band or noisy (convolutional or additive noise) data (Reynolds, 1995; Wu, Morris, & Koreman, 2005; Wu, 2006). This has become a big issue in realistic applications. This is the general background of robust speaker recognition, the topic that this letter addresses. In this letter, we often discuss α-GMM in the context of a speaker recognition application. However, GMM and α-GMM can be applied to other domains as well.

Among the approaches to solving the problem of robust modeling, α-GMM is one of them, and recently it has been proposed to extend conventional GMM into a new framework (Wu, 2008). α-GMM is a more sophisticated model for integrating stochastic modeling components in a nonlinear way, whereas conventional GMM can be regarded as combining its components in a linear way. The procedure of nonlinear combination is referred to as α-integration, which introduces an additional factor α to each component in an integrated stochastic model. With α being set to −1, the α-GMM degenerates into conventional GMM. Therefore, α-GMM is in fact a superset of traditional GMM, and the new framework of α-GMM is a natural extension to canonical GMM. With the value of α being set smaller than −1, the integrated probability density function (pdf) will favor larger component values and deemphasize smaller component values. The integrated pdf therefore possesses a flatter distribution than classical GMM. This feature, referred to as α-warping, is discussed in section 3.

α-GMM has a variety of advantages over conventional GMM, besides being a superset of GMM. α-GMM combines stochastic modeling components with α-integration, while α-integration has been proved optimum in a sense to minimize the extended Kullback-Leibler distance (also referred to as α-divergence; see definition 9) between an integrated stochastic model and its components. Moreover, it was also found that α-integration is very similar to the nonlinear way of combining multiple source channels that occurs in human brains (Amari, 2007). Hence, α-GMM can be considered a more intelligent modeling technique that uses a bio-inspired mechanism. Traditional GMM does not have these advantages.

To the best of our knowledge, no algorithm has been proposed to date to address the issue of estimating model parameters for α-GMM given a data set. Although Amari (2007) mentioned a conceptual idea of gradient descent, it is still far from applying α-GMM to realistic tasks like speaker recognition. This motivates us to propose a training algorithm that can automatically estimate model parameters on a given training set in an iterative way. The proposed algorithm is mathematically derived by solving an optimization problem based on a maximum likelihood criterion with the application of the expectation-maximization (EM) algorithm adaptively (Baum & Sell, 1968; Baum, Petrie, Soules, & Weiss, 1970; Dempster, Laird, & Rubin, 1977). The reestimation equations were found to be simple and compatible with the conventional GMM equations. This property makes α-GMM an ideal method to extend the conventional gaussian mixture model.

Although this letter began with of speaker recognition, the method proposed here is quite general. In fact, the focus of this letter is the mathematical derivation of a theorem to reestimate model parameters for α-GMM with a rigorous proof. The theory is also supported by preliminary speaker recognition experiments.

This letter has two parts. First, it presents a theorem concerning the reestimation formulas for α-GMM. This is the main result. The rest of the letter proves this theorem.

The proof is based on applying the EM algorithm adaptively. In the proof, following the general framework of the EM algorithm, the two steps of expectation (E-step) and maximization (M-step) will be carried out. In the E-step, the expectation of an objective function will be given based on the maximum likelihood criterion; in the M-step, the expectation will be maximized to obtain the reestimation formulae. This is an iterative procedure that eventually converges to a local optimum.

Besides the theoretical proof of the main theorem, we also present experimental results based on a simple but moderately sized realistic application of a robust speaker identity task in order to show the difference between the proposed learning algorithm for α-GMM and conventional GMM. The experiments were carried out on a corpus of telephony speech, NTIMIT (Fisher, Doddington, & Goudie-Marshall, 1986). A moderate number speakers (162) were tested for α-GMM and traditional GMM. A wide range of the values of the factor α were evaluated. It was found that the accuracy with all values of α being set larger than −1 was higher than the baseline. In particular, the accuracy with α = −6 had the largest improvement by 3.8% (with a relative error reduction of 7.8%). This basically confirms that the proposed learning algorithm is valid.

The rest of the letter is organized as follows. In section 2, we denote some elementary notations that will be used throughout this letter. Section 3 presents the basic concepts of α-GMM and clarifies the relationships between α-GMM and conventional GMM. Section 4 presents the main theorem concerning the learning algorithm to α-GMM, followed by a detailed proof. Experimental results for evaluating a speaker identity task on NTIMIT is described in section 5. Section 6 notes some advantages and limitations for the proposed method. Conclusions are drawn in section 7.

## 2. Notations

Before delving into detail to present the theory of α-GMM, we first define some basic concepts and notations that will be used throughout this letter:

* x = (x_{1}, x_{2}, …, x_{d}). A d-dimensional vector x represents a random variable, which often stands for a frame in speaker recognition*.

*. Denote by a set of a random vector x_{t}. In speaker recognition, we often use it to represent a speech utterance spoken by a certain speaker*.

*. A probability density function (pdf) for a random variable *.

*. A probability distribution function for a random variable *.

*. A warping function (Amari, 2007) for a pdf p(*:

**x**) with a factor α, which is defined in two cases*α-integration. We call the following equation α-integration (Hardy, Littlewood, & Polya, 1952; Petz & Temesi, 2005), for a set of pdf , i ∈ [1, K], using equation 2.2 and equation 2.3*,

*where the weights w*:

_{i}need to satisfy w_{i}⩾ 0 and ∑^{K}_{i = 1}w_{i}= 1, and c is a normalization constant that makes the integrated as a pdf*α-divergence. α-divergence (Chernoff, 1952) between two pdf's, p(*

**x**) and q(**x**), is denoted asClearly, α-divergence has four fundamental properties:

- •
*D*_{α}(*p*(**x**)||*q*(**x**)) ⩾ 0 - •
*D*_{α}(*p*(**x**)||*p*(**x**)) = 0 - •
*D*_{−1}(*p*(**x**)||*q*(**x**)) =*KL*(*p*(**x**)||*q*(**x**)) - •
*D*_{1}(*p*(**x**)||*q*(**x**)) =*KL*(*q*(**x**)||*p*(**x**))

The *KL*[•] above is the well-known Kullback-Leibler (KL) divergence.

Bearing in mind these notations, we present the theory of α-GMM.

## 3. α-Integrated Gaussian Mixture Model

α-gaussian mixture model (α-GMM) is one sort of gaussian mixture model with the application of α-integration. GMM integrates individual mixtures with an affine combination, ∑_{i}*w _{i}* = 1,

*w*> 0, for the weight

_{i}*w*of the

_{i}*i*th gaussian mixture. α-GMM then integrates its components with α-integration:

*Given K multiple-dimensional gaussian distributions N*:

_{i}(**x**), i ∈ [1, K], and a sequence of weights {w_{i}}, where ∑^{K}_{i=1}w_{i}= 1, w_{i}> 0, α-GMM for a random variable is denoted as the α-integration of N_{i}(**x**)*where c is a normalization factor with the form of equation 2.5 to make as a pdf*.

*K*= 2 for simplicity:

The α-GMM can therefore be referred to as a superset of probability density functions. We have GMM∈ α-GMM, and GMM is a special case of α-GMM. This is one of the important properties of α-GMM.

Another important property of α-GMM worth noting is that it is an optimal integration method for all of its components in the sense of minimizing the α-divergence between the integrated function and its components (Amari, 2007). For this, we have theorem 1:

The proof of theorem 3.1 is in the appendix.

It is also worth noting the role of the parameter α in affecting the integrated modeling capacity. With different values of α, a broad set of integrated functions can be constructed, including conventional GMM. With α being larger than 1, the effect of the term of to each pdf component is like warping large values toward zero and warping small values toward infinity (we refer to this feature as α-warping). Therefore, in this case, α-GMM emphasizes the small values of its pdf components and deemphasizes the large values. When α < 1, the effect is just the opposite: it emphasizes large values and deemphasizes small values. In this case, the integrated pdf is flatter than the pdf of conventional GMM, which employs a linear integration. This process is demonstrated by Figure 1 with the value of α being set to −6. To generate these two graphs, we used two gaussian distributions with equal weights: *w*_{1} = *w*_{2} = 0.5 for α-GMM and GMM. The mean and variance for the two gaussian distributions are , **var**_{1} = [1; 1], , and **var**_{2} = [0.5; 0.5].

The freedom introduced by factor α allows a wider range of integrated functions to be selected to address different applications. Robust speaker recognition is addressed in a later section as an example.

However, for the family of α-integration functions, no method has been proposed concerning the estimation of model parameters to the best of our knowledge. Because many applications that use statistical modeling, such as GMM, critically rely on a learning algorithm to reestimate model parameters based on a given training data set, we adopt a similar strategy to deal with the issue of model parameter estimation for α-GMM. This is in fact the main purpose of this letter. In the next section, we present the main theorem to reestimate the parameters of α-GMM and provide the theorem proof.

## 4. Parameter Estimation Based on Maximum Likelihood with EM

This section has two parts. First, we present the main theorem to the problem of parameter estimation of α-GMM, based on a given data set. The second part presents the detailed proof.

### 4.1. The Main Theorem.

*(parameter reestimation of α-GMM). Define , i ∈ [1, K] as the {n − 1}th setting of the parameters of an α-GMM and Θ*

^{(n)}_{α}= {α,**μ**^{(n)}_{i},**Σ**^{(n)}_{i}, w^{(n)}_{i}} as the n-th setting of the parameters. Let be the l-th gaussian mixture of the α-GMM of the {n − 1}th setting of Θ^{(n−1)}. For a given data sample at time t, where t ∈ [1, N], and the {n − 1}-th parameter setting of the α-GMM, denote*as the posterior probability of the data sample being allocated in the l-th gaussian mixture of the α-GMM of Θ*.

^{(n−1)}_{α}= {α,**μ**^{(n−1)}_{i},**Σ**^{(n−1)}_{i}, w^{(n−1)}_{i}}### 4.2. Proof.

The proof of the main theorem uses the EM algorithm based on the criterion of maximum likelihood estimation (MLE). As is known, EM is composed of a step of expectation (E-step) and a step of maximization (M-step). For the presentation, we use a strategy similar to the one described in Bilmes (1997), but also based on other work on EM algorithms (Baum & Sell, 1968; Baum et al., 1970; Dempster et al., 1977; Jiang, 2007) to describe the E-step and M-step, respectively.

#### 4.2.1. Objective Function of Maximum Likelihood.

*L*(

**X**∣ Θ): By taking the definition of α-GMM, equation 3.1, into equation 4.6 according to the two cases of α ≠ 1 and α = 1, we have the log likelihood of the data set :

*y*and using lemma 1.

Let , be a hidden(or unseen) variable that indicates the gaussian index that a data sample is allocated to and let be an instance of a random variable —that is, , for a given data set , which shows a possible sequence of gaussian index allocated for any data sample .

By applying lemma 1, we can transform the problem of optimization of into a problem of optimization of *Q*(Θ ∣ Θ^{(n − 1)}).

#### 4.2.2. E-Step.

*Q*(Θ ∣ Θ

^{(n − 1)}) of the training data can be rewritten as By throwing away the constant

*c*

^{′}, we can simplify equation 4.16 to where In this form,

*Q*(Θ, Θ

^{(n − 1)}) appears computationally challenging. However, as in Bilmes (1997), it can be simplified if we notice that for

*l*∈ {1, 2,…,

*K*}, This is because ∑

^{K}

_{j=1}

*Pr*(

*j*∣

**x**

_{i}, Θ

^{(n−1)}) = 1. Using equation 4.20, we can rewrite equation 4.17 as where

#### 4.2.3. M-Step.

In M-step, we maximize the expectations obtained in the E-step for two cases.

*w*, we have where λ is a Lagrange multiplier.

_{l}_{l}, Taking equation 4.29 into Ψ(Θ

_{l}), we can obtain If we ignore the constant terms (since they disappear after taking derivatives), we get Therefore, taking the derivative of equation 4.31 with respect to

*μ*_{l}and setting it equal to zero, we get which, solving for

*μ*_{l}, yields And similarly, as in Bilmes (1997), we also get

## 5. Experiments

In this section, we present experiments on robust speaker recognition as an example to demonstrate the performance difference of the proposed training algorithm for α-GMM from GMM.

Speaker recognition is one of the tasks based on pattern recognition techniques, mainly on statistical modeling, to recognize a speaker's identity by voice characteristics. There are two types of applications in speaker recognition: speaker identification (SI) and speaker verification (SV). An SI task recognizes a speaker identity from a given set of speakers enrolled in the system, and an SV task verifies a speaker's identity by answering a binary question with a yes or no. In our experiments, we selected the SI task to show the effectiveness of the application of α-GMM without losing the generality on its application to other pattern recognition tasks such as SV.

In our experiments, as in Wu et al. (2005), Mel frequency cepstral coefficient features, obtained using the hidden Markov model toolkit (Young et al., 2002), were used, with 20 ms windows and 10 ms shift, a preemphasis factor of 0.97, a Hamming window, and 20 Mel scaled feature bands. All 20 MFCC coefficients were used except c0. On this database, silence removal, cepstral mean subtraction, and time difference features did not increase performance, so these were not used.

We trained a GMM with 32 gaussians for each speaker in a given set (162 speakers) as a baseline, a moderately sized task. We correspondingly trained an α-GMM for each speaker for comparison. The training and test data were based on the NTIMIT database, a telephony corpus (Campbell & Reynolds, 1999). Six utterances were used as training materials for GMM and α-GMM, and two other utterances were used for testing. The recognition criterion is to select the largest score from a given model group, that is, select the most probable speaker model as the target speaker identity. In our configuration, we evaluated the performance of α-GMM by assigning different values to the parameter of α so as to select the different integration functions. The values of α were chosen correspondingly from the range of [−12, 0.5]; the range higher than this set was not tested.

The results are shown in Figure 2. We can see from this figure that when α ∈ [−12, −1], all the integrated functions were experimentally better than the linear integration of the conventional GMM on the telephony corpus. The values of α are higher than −1 degraded integration performance for telephony speech. The best performance was attained with α = −6. (Due to the specific purpose of this letter, we shall not present more experiments.) The simple experiment presented here was intended only to show the difference of the proposed training algorithms between α-GMM and conventional GMM. (For experimental results for α-GMM applied to robust speaker recognition, see Wu, 2008.)

## 6. Discussion

Here we discuss some issues that have not been covered already.

First, the reestimation formulas given in theorem 1 are applied to the case of α ≠ 1. For α = 1, the exponential integration, theorem 1 is not applicable because *w _{i}* and are not separable in equation 4.21 (simple calculus can show this). So in this case, it might be that EM learning cannot be used. How to derive a reestimation formula for this case is a topic for future work. Nevertheless, this does not violate the most important point in this letter that the proposed reestimation formulas are applicable to most of the cases of the values of α, therefore satisfying the requirements of most applications, such as SI and SV. This is a key point.

The second point is a convergence issue. The proposed algorithm is one of recursive training approaches. The model parameters are attained in an iterative way. At the beginning of the training stage, initial values have to be set for the parameters in a given α-GMM. Generally there are two methods available for initialization of model parameters: K-means clustering and mixture splitting. The K-means method sets up initial values directly from an assigned number for clusters, whereas the mixture splitting method splits clusters from a relatively low number to a higher one recursively. Both are often used in realistic applications. In our method, we adopted the K-means clustering algorithm for initialization, as does the baseline GMM, which achieved pretty good performance (Wu, 2006; Reynolds, 1995). After initialization, at each iterative training step, the model parameters are updated to better values with the proposed method, that is, according to the formulas given in equations 4.28, 4.33, and 4.34. These procedures continue until the training process converges to an optimum point in a solution region. This point is reflected by lemma 1.

However, the algorithm does not necessarily guarantee attaining a global optimum solution. Because the proposed method is adapted from the EM algorithm, it holds the similar properties as the conventional EM algorithm. The EM algorithm cannot guarantee finding a global optimum point in a solution region given a data set. It is likely that the algorithm stops at a certain local optimum point. Therefore, a better initialization is extremely important for the effectiveness of the models trained using the EM algorithm. This point is also valid for the method proposed for α-GMM. According to conventional GMM training, the K-means method was found to be one of the effective methods to facilitate EM training for GMM. Considering the updating formulas of α-GMM in compliance to those of the conventional GMM, except introducing the factor of α to the posterior probabilities (see equation 4.20), we still use the K-means algorithm as a clustering algorithm for initializing model parameters. However, more advanced clustering algorithms, such as mixture splitting, are worth investigating.

Third, we emphasize the role of the factor of α in the function of integration. From theorem 1, we see that the reestimation formulas are mathematically simple, although the deriving procedure is somewhat complex. Compared with the updating formulas for conventional GMM training, equations 4.28, 4.33, and 4.34 for the α-GMM look very similar to the corresponding ones in GMM except the parts in equation 4.1. The essential difference in equation 4.1 is that an α factor is given to each component in a gaussian mixture so as to warp their contributions to the final probability score of the composite model. By choosing different values of α, the integrated model can emphasize either small values or large values of the component score, while it is more likely to suppress the effect of noisy components from data by deemphasizing small values. This is indeed the essence of α-GMM, which we refer to as α-warping at score levels. Furthermore, the effect of α-warping is reflected not only from each component score, but also from its final score of the sum of its component ones. This step is like a further normalization.

The next noteworthy point is the comparison between GMM and α-GMM in terms of complexity. As similarity fully exists in the formality of parameter estimation equations between α-GMM and conventional GMM, α-GMM has a variety of advantages over conventional GMM. First, α-GMM is a superset of conventional GMM. GMM is a special case of α-GMM with α = −1. Therefore, α-GMM has better modeling capacity compared with conventional GMM. Second, computation complexity for α-GMM is similar to that for GMM. It is easy to see from equation 4.1 that the only increased cost for calculating probability scores involves the calculation of the powers of the factor α with mixture scores. Considering that logarithmic probability is normally used instead of probability, the computation cost for the α-GMM is raised only by *M* + 1 multiplications, where *M* is the number of mixtures in α-GMM. Therefore, the computational costs between α-GMM and traditional GMM are comparable at the same complexity level. Considering all the above reasons, α-GMM can therefore be viewed as a more powerful modeling tool than conventional GMM.

The fifth point to comment on is an issue concerning selecting the values of α. The algorithm in this letter proposed the reestimation formulas based on a fixed α value, that is, in the overall procedure of optimization, the value of α is assumed constant. However, a more sophisticated question is whether it is possible to optimize the parameter of α as well. This is in fact a problem in selecting the optimal integration method to address a specific application scenario. This could be an extension work in the near future.

Finally, we give another possible extension—this one on the criterion used for optimization. The current criterion is to use maximum likelihood as an objective function for parameter optimization. Many other criteria can also possibly be employed. Among these, maximum a posteriori (MAP), maximum mutual information (MMI), and other discriminant training methods are likely to be useful. Future work will investigate these ideas for training α-GMM.

## 7. Conclusion

This letter presented a theorem concerning parameter reestimation for α-GMM. In the proof of this theorem, the expectation-maximization algorithm was applied to solve an objective function based on maximizing the likelihood of a given data set. The overall procedure of the proof was given by two separate steps: the E-step and M-step. The resultant formulas to reestimate model parameters for α were found to be simple and compatible with those of GMM. This advantage makes the α-GMM possess the same level of computational complexity in both training and test stages. In addition, experiments on a moderately sized speaker recognition task confirmed the effectiveness of the learning algorithm for α-GMM.

## Appendix: Proof of Theorem 3.1

This proof of theorem 3.1 follows the method proposed in Amari (2007). The essential idea of the proof is to employ an optimization method by differentiating the objective function, equation 3.4, with respect to the integrated function .

Hence, the optimum is the α-integration of any α.

## Acknowledgments

I sincerely thank the anonymous reviewers who made important comments on this manuscript and substantially improved its quality.