Abstract

Reducing the dimensionality of high-dimensional data without losing its essential information is an important task in information processing. When class labels of training data are available, Fisher discriminant analysis (FDA) has been widely used. However, the optimality of FDA is guaranteed only in a very restricted ideal circumstance, and it is often observed that FDA does not provide a good classification surface for many real problems. This letter treats the problem of supervised dimensionality reduction from the viewpoint of information theory and proposes a framework of dimensionality reduction based on class-conditional entropy minimization. The proposed linear dimensionality-reduction technique is validated both theoretically and experimentally. Then, through kernel Fisher discriminant analysis (KFDA), the multiple kernel learning problem is treated in the proposed framework, and a novel algorithm, which iteratively optimizes the parameters of the classification function and kernel combination coefficients, is proposed. The algorithm is experimentally shown to be comparable to or outperforms KFDA for large-scale benchmark data sets, and comparable to other multiple kernel learning techniques on the yeast protein function annotation task.

1.  Introduction

Dimensionality reduction is a technique for obtaining a compact data representation that keeps the intrinsic information of the original data as much as possible. During the past decades, the importance of dimensionality reduction has grown as the size and dimensionality of available target data have increased. When we deal with extremely high-dimensional data, such as images, sounds, texts, and gene expressions, an appropriate dimensionality reduction of raw data helps to improve computational time and burden and also allows capturing the intrinsic structure of target data as a technique of data visualization.

Fisher discriminant analysis (FDA; Fisher, 1936), one of the most famous supervised classification techniques, finds a projection axis for a good separation of classes based on the ratio of between-class covariance to within-class covariance, and projected values on the obtained projection axis can be used as a new feature variable that compactly represents the class information of data. It is known that the optimality of FDA is assured when all the class-conditional distributions are gaussians with the same covariance structure. However, in practice, FDA often fails to find the optimal axis because this assumption rarely holds. To overcome such a problem, local Fisher discriminant analysis (LFDA; Sugiyama, 2007) that utilizes local information of data by means of the affinity matrix is used. Sugiyama (2007), successfully used LFDA as a preprocessing technique for classification, and it is shown to be superior to original FDA in some experiments.

In this letter, we explore dimensionality reduction in a supervised setting from the viewpoint of information theory. We also consider nonlinearization of the proposed framework by kernel methods. We argue that the proposed framework can be used as a criterion for kernel optimization and propose a novel method of multiple kernel learning. First, as a typical technique of supervised dimensionality reduction, we interpret FDA in terms of entropy. Then we propose to use the conditional entropy as an objective for supervised dimensionality reduction. So far, many dimensionality-reduction techniques have been proposed from the viewpoint of information theory. Some of them approximate the data distribution by the mixture of gaussians, and others use surrogates of the Shannon differential entropy. Our proposed framework makes no assumption on the data distribution; therefore, it is expected to work well under any data distribution. We carried out a simple experiment with two synthetic dichotomy problems to show that the proposed technique works properly even when conventional FDA fails to find a reasonable projection axis for classification. The result of this experiment is illustrated in Figure 1. The dashed lines are the classification axes found by FDA, and the solid lines are those found by the proposed technique. For the simpler data set depicted in Figure 1a, the projected samples are nicely separated into different classes (◯ and ) on both axes found by FDA and the proposed method. Figure 1b depicts a bimodal data set, that is, samples in one class form two distinct clusters. In this case, FDA collapses the samples from different classes into a single cluster, while the proposed technique gives a perfect separation.

Figure 1:

Examples of dimensionality reduction by FDA and the proposed technique. Two-dimensional dichotomy samples are projected onto an axis. The lines in these figures denote the axes on which the data samples are projected.

Figure 1:

Examples of dimensionality reduction by FDA and the proposed technique. Two-dimensional dichotomy samples are projected onto an axis. The lines in these figures denote the axes on which the data samples are projected.

The proposed framework of conditional entropy minimization is quite general, and it can be easily extended to dimensionality reduction with nonlinear transformations. In this letter, we propose a technique based on kernel Fisher discriminant analysis (KFDA; Mika, Rätsch, Weston, Schölkopf, & Müller, 1999). Our proposal is optimizing data projection axes and kernel combination coefficients in the context of multiple kernel learning (MKL; Lanckriet, Cristianini, Bartlett, Ghaoui, & Jordan, 2004; Lanckriet, Deng, Cristianini, Jordan, & Noble, 2004; Lewis, Jebara, & Noble, 2006a, 2006b; Do, Kalousis, Woznica, & Hilario, 2009). While there has been much research on MKL based on support vector machines (SVM; Cristianini & Shawe-Taylor, 2000) and maximum margin criteria, our proposal is based on class-conditional entropy minimization.

The rest of the letter is organized as follows. Section 2 describes an information-theoretic understanding of FDA. Section 3 argues the validity of conditional entropy for dimensionality-reduction, and a novel supervised dimensionality-reduction framework is proposed based on conditional entropy minimization criterion. In this section, an entropy estimation method and an optimization method are also explained. For linear dimensionality reduction, it is experimentally shown that the proposed framework performs comparable to other dimensionality-reduction methods or even better. In section 4, the proposed technique is utilized to derive a novel multiple kernel learning method. Section 5 is devoted to discussion of information-theoretic dimensionality-reduction techniques. The last section gives concluding remarks.

2.  Information-Theoretic Aspect of Dimensionality Reduction

Given a set of vector data , a problem of dimensionality reduction is formulated as finding a good transformation that maps a datum to an m-dimensional vector , where . In linear dimensionality reduction, the transformation f can be represented by a matrix as
formula
2.1
In this section, we consider the dimensionality-reduction problem from the viewpoint of information theory (Cover & Thomas, 1991). When we refer to the term entropy in this letter, we mean the Shannon differential entropy for a random variable X defined as
formula
2.2
where p is the probability density function of X.
As a supervised dimensionality-reduction technique, FDA is commonly used. Given a sample data set and their class labels , FDA finds a linear projection of the data that is suitable for a classification task. Let Dy be a set of data that belong to the class y, Ny=|Dy| be the number of data in the class y, and be the total number of the data. We denote the mean vector and covariance matrix of the data in Dy by and , respectively, and the mean vector of all the data in D by . In FDA, the transformation matrix A is found by maximizing the ratio of the within-class covariance matrix and the between-class covariance matrix of the transformed data, where the matrices Σw and Σb are defined by
formula
We note that with some abuse of notation, D indicates both the data set and the index set of the data . Then the objective of FDA is minimizing the log ratio of the transformed matrices, that is, , where |M| denotes the determinant of a square matrix M. Since multiplication of both a denominator and a numerator of by a nonzero scalar does not change the value of the objective, FDA is defined as the following constrained minimization problem:
formula
2.3
Now we give an information-theoretic understanding of the FDA optimization problem 2.3. Consider the class-conditional entropy
formula
2.4
of a transformed random variable ATX. Let HG(X) be the entropy of a gaussian distribution with the same covariance structure of the random variable X. The relationship between the conditional entropy and the FDA criterion 2.3 is described by the following inequalities:
formula
2.5
formula
2.6
formula
2.7
where e is the base of natural logarithm. The first inequality, 2.5, comes from the fact that among infinite support distributions with a fixed covariance matrix, the maximum entropy is achieved by the gaussian distribution (Cover & Thomas, 1991). The second inequality, 2.7, comes from the definition of Σw and Jensen's inequality.

In general, the projection found by FDA is not Bayes optimal (Duda, Hart, & Stork, 2000). However, when a datum X in each class y is subject to a gaussian distribution with the same covariance Σc, that is, all Σy's are equal to Σc and thus Σw is equal to Σc, inequalities 2.5 and 2.7 become equalities. From the above argument, we can conclude that FDA is a minimization problem for an upper bound of the class-conditional entropy of a variable on the projected axes.

In the next section, we propose a framework for supervised dimen- sionality-reduction through minimizing the class-conditional entropy.

3.  Proposed Framework of Dimensionality Reduction

For supervised dimensionality reduction, transformed data representation in a lower-dimensional space should be compactly aggregated in each class. A random variable that is concentrated on a certain small region has small entropy. Taking account of the fact that FDA minimizes the upper bound of the class-conditional entropy, we now propose a framework of supervised dimensionality reduction that constructs a transformation that minimizes the class-conditional entropy H(Z|Y). We note that the conditional entropy H(Z|Y) is minimized by any functions that map all data X to a single point. Furthermore, if the representational power of the transformation f is too high, optimization with respect to f might result in overfitting. To avoid trivial solutions and overfitting, we need restriction or regularization in optimizing H(Z|Y). In this letter, we introduce a parameter to control the extent of regularization and a regularization functional , which may depend on both the function f and the given data D. Therefore, the regularized conditional entropy minimization is defined as
formula
3.1
The form of the regularization functional should be appropriately designed according to each problem at hand. For example, in the linear transformation formulation, 2.1, where the determinant of the between-class covariance matrix is constant, say 1, we use .

In the following sections, we describe a method for estimating and optimizing the entropy.

3.1.  Entropy Estimation.

We first consider the minimization problem, 3.1, where the transformation is linear and expressed by . To minimize the entropy, we estimate the entropy of one-dimensional data in a nonparametric manner. Nonparametric entropy estimation methods are roughly divided into two categories: methods based on kernel density estimators (Beirlant, Dudewicz, Györfi, & Meulen, 1997) and methods based on k-nearest neighbors (k-NN) (Kozachenko & Leonenko, 1987). In this letter, we adopt a k-NN-based method proposed by Faivishevsky and Goldberger (2009) because of a small variance in the estimate, fast computation, and implementation simplicity. For n-dimensional random variable X, it is known that the Shannon differential entropy has an unbiased k-NN estimator,
formula
3.2
where is the digamma function, cn is the volume of the n-dimensional unit ball (i.e., ), and is the distance from xi to its kth nearest neighbor. Since the k-NN estimator 3.2 is valid for all , the differential entropy can be also estimated by the average of all estimators with different values of k. Then, averaging all Hk(X), Faivishevsky and Goldberger (2009) proposed a novel entropy estimator, called MeanNN:
formula
3.3
In this letter, we estimate all the entropy by the MeanNN(MNN in short) estimator 3.3 and omit the subscript MNN in .
Since it is difficult to estimate the joint entropy of a high-dimensional multivariate random variable with enough accuracy, we propose to estimate the upper bound of the joint entropy by the sum of its marginal entropies (Cover & Thomas, 1991). Let al be the lth column vector of the transformation matrix A, and consider the lth element zl=aTlx of the transformed vector z. Then the marginal entropy of zl is given by
formula
and the sum of them gives the upper bound of the joint entropy of as
formula
3.4
The upper bound of the class-conditional entropy for a class Y=y is also defined by
formula
3.5
and the weighted sum of with the class prior probability defines the upper bound of the class-conditional entropy as
formula
3.6
where the class prior probability p(y) is estimated by Ny/N. When we minimize the entropy of a multivariate random variable, we assume that the sum of the marginal entropies is minimized henceforth. We will show a simple experimental result to support the upper-bound argument in appendix A.

3.2.  Optimization Using Gradient Descent.

Now we minimize the class-conditional entropy by gradient descent. We calculate the gradient vector of the class-conditional entropies of the transformed data and update each column of A.

The optimization problem to be solved for linear dimensionality reduction is
formula
3.7
where the conditional entropy is calculated as
formula
Since the MNN estimate of H(aTlX|Y=y) is given by
formula
the derivative of the marginal class-conditional entropy with respect to al is given by
formula
and we can minimize by gradient descent.

3.3.  Quasi-Orthogonalization.

Since we minimize the sum of the marginal entropies, a naıve optimization for each marginal entropy may lead us to the same single transformation vector for all the marginal entropies. To avoid this, we apply quasi-orthogonalization to the transformation matrix in each iteration of gradient descent. In order to simplify operations and accelerate convergence of the algorithm, we propose to prewhiten the data in advance. A random vector X is said to be white if its covariance matrix is the unit matrix. Let the eigenvalue decomposition of the covariance matrix be ; then a whitened vector is given by .

Now, under the assumption that the given data are whitened, the matrix obtained in each step of marginal entropy minimization is modified so as to approximately satisfy ||ATAIm||F where Im is the identity matrix and denotes the Frobenius norm. This quasi-orthogonalization corresponds to defining the regularization function in equation 3.7 as . The quasi-orthogonalization of A is realized by iterating the following three steps until convergence:

  1. Divide A by square root of the largest eigenvalue of ATA.

  2. .

  3. Normalize the norm of each column of A to 1.

This procedure for quasi-orthogonalization is validated as follows (Hyvärinen, Karhunen, & Oja, 2001). Let ATA=EDET be the eigenvalue decomposition of the symmetric matrix ATA, where is an orthogonal matrix and D is a diagonal matrix with eigenvalues of ATA. Then, by step 2 of the procedure, ATA is modified as
formula
Noting that because the maximum eigenvalue of the matrix ATA is normalized to one in step 1, the eigenvalues of ATA after this transformation become
formula
Because , eigenvalues of ATA converge to 1 by iterating those three steps. In actual experiments, we iterate these three steps times to obtain an approximately orthogonalized matrix.

Summarizing the above discussion, for a linear transformation , we obtain an algorithm for minimizing the class-conditional entropy depicted in Algorithm 1. We call this algorithm LCEM: linear dimensionality-reduction algorithm based on conditional entropy minimization. We have already shown a simple example of dimensionality reduction using FDA and LCEM in Figure 1.

Before showing experimental results, we note that there is a study on supervised distance metric learning based on probabilistic extension of the k-NN method with a seemingly similar objective function of ours. Supervised distance metric learning aims at obtaining an appropriate distance metric matrix W=AAT for classification. This is equivalent to learning a transformation A so that the transformed data in the same class should be concentrated in a small region, and data in different classes should be separated as much as possible. Goldberger, Roweis, Hinton, and Salakhutdinov (2005) defined a probability that a datum xj is in a neighborhood of a datum xi by a Boltzmann-type distribution:
formula
3.8
Then, defining a set Ci whose elements belong to the same class with xi, they proposed to maximize the objective function
formula
3.9
using gradient ascent with respect to A. This metric learning algorithm is named NCA (neighborhood component analysis).

Input: Training data and class label data . The dimension of the transformed data. A gradient parameter .

Initialization: Choose initial transformation matrix so that rankA=m. Whiten the given data using its empirical mean and covariance.

Iteration: Until convergence:

Gradient step: Update each column of the transformation matrix:
formula

Quasi-orthogonalization step: Until convergence:

  1. Divide A by square root of the largest eigenvalue of ATA.

  2. .

  3. .

Output: Converged transformation matrix A.

Algorithm 1: Linear dimensionality-reduction algorithm based on conditional entropy minimization. At the gradient step, a marginalized entropy is minimized by gradient descent method for each column of transformation A, and in the quasi-orthogonalization step, columns of A are quasi-orthogonalized.

Globerson and Roweis (2006) proposed an alternative method, MCML (maximally collapsing metric learning). Let p0(xj|xi) be the ideal distribution of pA(xj|xi) defined by
formula
3.10
Then the objective function of MCML is defined by the Küllback-Leibler divergence of pA and p0 as
formula
3.11
Since the distance metric matrix AAT must be positive semidefinite, the objective function is optimized by gradient descent with the positive definiteness constraint.
The objective function of MCML and the class-conditional entropy are similar in appearance. In MCML, we optimize the sum of the Küllback-Leibler divergence:
formula
3.12
On the other hand, in LCEM, we optimize the class-conditional entropy:
formula
3.13
Although they look similar, MCML uses a probability pA(xj|xi) that a datum xi selects another datum xj as its neighbor, which is different from the distribution of data themselves. Moreover, in MCML, the probability pA(xj|xi) is restricted to the form of a Boltzmann distribution to make the objective function simple and convex.

3.4.  Experimental Study.

We apply the proposed dimensionality-reduction technique as a preprocess in classification task. As a measure of separability of data in the transformed space, we adopt the one-nearest-neighbor classifier. We employ the IDA data sets (http://ida.first.fraunhofer.de/projects/bench/benchmarks.htm), which are standard binary classification data sets originally used in Rätsch, Onoda, and Müller (2001). Table 1 lists the names of data sets, the dimensionalities of feature vectors, the numbers of training data and test data, and the numbers of realizations (pairs of training and test data sets) of the data. The dimensionalities of the original data are reduced by principal component analysis (PCA), FDA, MCML, LFDA, and LCEM. We estimated suitable embedding dimensionalities for PCA, MCML, LFDA and LCEM in the same manner as Rätsch et al. (2001) used. That is, we ran five-fold cross-validation on the first five realizations of the training sets and estimated the reduced dimensionality by median over the five estimates in each data set. Denoting the class-conditional entropy after the tth iteration by Ht, we stopped the iteration in LCEM when |HtHt−1|/|Ht−1|<10−4 holds. Table 2 shows means and standard deviations of the misclassification rates in percentages. The best results and comparable ones based on the t-test with a significance level of 5% are shown in boldface. The chosen embedding dimensionalities Dim are written in the table as [Dim]. Table 2 tells us that the classification accuracies obtained by LCEM are superior to PCA, FDA, and MCML for many data sets and comparable to LFDA. We also show the classification results in the original spaces in the column labeled “Euclidean.” Compared to the classification results in the Euclidean spaces, LCEM preserves classification accuracy for most data sets, and it even improves the accuracy for some data sets. From this experiment, we can speculate that there are following tendencies among the linear-supervised dimensionality-reduction methods we tested. Among the IDA data set, Banana, Thyroid, and Waveform are multimodal data, while other data are not. Fisher discriminant analysis is not appropriate for multimodal data, as Sugiyama (2007), pointed out and also a simple example is shown in Figure 1. Maximum collapsing metric learning is originally proposed as a distance metric learning technique, and it seems not to be appropriate for dimensionality reduction. This conclusion is drawn from the fact that the optimal reduced dimensionality of MCML found by cross-validation is relatively high compared to other dimensionality-reduction methods. As for LFDA and our proposed LCEM, they show similar results.

Table 1:
IDA Data Specifications.
Input DataNumber of TrainingNumber of TestNumber of
Data NameDimensionalitySamplesSamplesRealizations
Banana 400 4900 100 
Breast-cancer 200 77 100 
Diabetes 468 300 100 
Flare-solar 666 400 100 
German 20 700 300 100 
Heart 13 170 100 100 
Image 18 1300 1010 20 
Ringnorm 20 400 7000 100 
Splice 60 1000 2175 20 
Thyroid 140 75 100 
Titanic 150 2051 100 
Twonorm 20 400 7000 100 
Waveform 21 1000 1000 100 
Input DataNumber of TrainingNumber of TestNumber of
Data NameDimensionalitySamplesSamplesRealizations
Banana 400 4900 100 
Breast-cancer 200 77 100 
Diabetes 468 300 100 
Flare-solar 666 400 100 
German 20 700 300 100 
Heart 13 170 100 100 
Image 18 1300 1010 20 
Ringnorm 20 400 7000 100 
Splice 60 1000 2175 20 
Thyroid 140 75 100 
Titanic 150 2051 100 
Twonorm 20 400 7000 100 
Waveform 21 1000 1000 100 
Table 2:
Average and Standard Deviation of Misclassification Rates (in Percent) of Linear Dimensionality-Reduction Techniques.
Data NamePCAFDAMCMLLFDALCEMEuclidean
Banana 14.0(0.8)[2] 38.3(4.0) 39.6(1.3)[1] 13.7(0.8)[2] 13.6(0.8)[2] 13.6(0.8) 
Breast-cancer 40.7(7.1)[3] 34.9(5.1) 34.5(4.4)[4] 33.3(4.6)[6] 33.6(4.4)[4] 32.7(4.8) 
Diabetes 38.4(5.0)[4] 31.3(2.8) 31.3(1.9)[7] 32.3(2.6)[3] 30.1(2.1)[3] 30.1(2.1) 
Flare-solar 48.6(6.9)[5] 36.4(1.9) 36.6(2.0)[5] 36.8(1.9)[2] 36.5(1.9)[3] 36.5(1.9) 
German 41.8(4.5)[2] 32.0(2.6) 31.4(2.4)[17] 30.2(2.47)[11] 31.2(2.6)[9] 29.5(2.5) 
Heart 46.3(23.9)[4] 22.9(4.1) 24.5(3.4)[10] 21.6(4.3)[5] 22.7(4.0)[3] 23.2(3.7) 
Image 37.3(9.5)[2] 22.1(0.9) 4.1(0.6)[15] 3.7(1.0)[13] 3.4(1.0)[16] 3.4(0.5) 
Ringnorm 28.0(5.1)[10] 31.7(1.0) 23.5(1.1)[8] 20.4(1.0)[6] 19.7(0.8)[8] 35.0(1.4) 
Splice 43.9(4.9)[2] 20.4(0.8) 27.0(0.7)[43] 16.4(0.8)[5] 20.6(0.6)[2] 28.8(1.5) 
Thyroid 9.1(4.4)[2] 17.9(4.9) 4.9(2.1)[4] 4.3(2.3)[3] 4.4(2.2)[4] 4.4(2.2) 
Titanic 26.4(8.4)[1] 22.5(1.1) 22.5(1.1)[1] 22.6(1.5)[1] 22.5(1.1)[1] 22.5(1.1) 
Twonorm 7.6(18.8)[3] 3.5(0.5) 8.0(0.7)[19] 3.5(0.4)[6] 3.6(0.4)[2] 6.7(0.7) 
Waveform 31.7(18.7)[9] 18.6(1.2) 17.8(0.7)[17] 11.7(0.7)[2] 16.3(1.0)[17] 15.8(0.7) 
Data NamePCAFDAMCMLLFDALCEMEuclidean
Banana 14.0(0.8)[2] 38.3(4.0) 39.6(1.3)[1] 13.7(0.8)[2] 13.6(0.8)[2] 13.6(0.8) 
Breast-cancer 40.7(7.1)[3] 34.9(5.1) 34.5(4.4)[4] 33.3(4.6)[6] 33.6(4.4)[4] 32.7(4.8) 
Diabetes 38.4(5.0)[4] 31.3(2.8) 31.3(1.9)[7] 32.3(2.6)[3] 30.1(2.1)[3] 30.1(2.1) 
Flare-solar 48.6(6.9)[5] 36.4(1.9) 36.6(2.0)[5] 36.8(1.9)[2] 36.5(1.9)[3] 36.5(1.9) 
German 41.8(4.5)[2] 32.0(2.6) 31.4(2.4)[17] 30.2(2.47)[11] 31.2(2.6)[9] 29.5(2.5) 
Heart 46.3(23.9)[4] 22.9(4.1) 24.5(3.4)[10] 21.6(4.3)[5] 22.7(4.0)[3] 23.2(3.7) 
Image 37.3(9.5)[2] 22.1(0.9) 4.1(0.6)[15] 3.7(1.0)[13] 3.4(1.0)[16] 3.4(0.5) 
Ringnorm 28.0(5.1)[10] 31.7(1.0) 23.5(1.1)[8] 20.4(1.0)[6] 19.7(0.8)[8] 35.0(1.4) 
Splice 43.9(4.9)[2] 20.4(0.8) 27.0(0.7)[43] 16.4(0.8)[5] 20.6(0.6)[2] 28.8(1.5) 
Thyroid 9.1(4.4)[2] 17.9(4.9) 4.9(2.1)[4] 4.3(2.3)[3] 4.4(2.2)[4] 4.4(2.2) 
Titanic 26.4(8.4)[1] 22.5(1.1) 22.5(1.1)[1] 22.6(1.5)[1] 22.5(1.1)[1] 22.5(1.1) 
Twonorm 7.6(18.8)[3] 3.5(0.5) 8.0(0.7)[19] 3.5(0.4)[6] 3.6(0.4)[2] 6.7(0.7) 
Waveform 31.7(18.7)[9] 18.6(1.2) 17.8(0.7)[17] 11.7(0.7)[2] 16.3(1.0)[17] 15.8(0.7) 

Notes: The numbers in brackets denote standard deviations. The best results and comparable ones based on the t-test with a significance level of 5% are shown in boldface type.

Sugiyama (2007) claims that LFDA is appropriate for multimodal data. Our LCEM shows as good performance as LFDA when applied to Banana and Thyroid but not for Waveform. At this point, it is difficult to draw a general conclusion on which method is preferable to which kinds of data, and it remains future work for us. We show further experimental results in appendix C.

4.  Multiple Kernel Learning Based on Conditional Entropy Minimization

Fisher discriminant analysis has been extended to a nonlinear variant known as kernel Fisher discriminant analysis (KFDA; Mika et al., 1999) and has been shown to work well for data that are not linearly separable. In this section, through KFDA and the proposed dimensionality-reduction framework, we propose a novel method of multiple kernel learning (Lanckriet, Deng et al., 2004; Lanckriet, Cristianini et al., 2004; Lewis et al., 2006b; Do et al., 2009).

4.1.  Kernel Fisher Discriminant Analysis.

In this section, we consider dimensionality reduction to only one dimension for simplicity. Let f(x)=aTx be a linear projection from to . Since datum X is classified by comparing the value of this function and a certain threshold value, we call f(x) a classification function. Suppose a datum is mapped to -dimensional feature space by a map ; the classification function then becomes a projection from the -dimensional feature space to as . We note that the projection vector a in the expression f(x)=aTx is an n-dimensional vector, while a in is an -dimensional vector. Now we use the fact that in the kernel method, with some appropriate regularity condition, we can apply the representer theorem (Shawe-Taylor & Cristianini, 2004) to get expression with real-valued weight parameter . In this case, the inner product in the feature space is written by a kernel function as , and we obtain the kernel expression of the classification function as
formula
4.1
Mika et al. (1999) proposed KFDA, a nonlinear extension of FDA by the kernel method. Let be the Gramian matrix of the given data set such that Kij=k(xi, xj), and ki be the ith column vector of K. With this Gramian matrix and its column vectors, a sample mean vector of each class is given by
formula
and a sample mean vector of all the data is given by
formula
Then the between-class covariance matrix Vb and the within-class covariance matrix Vw in the feature space are written as
formula
Then the objective of KFDA is minimizing , and in the same way as FDA, it is formulated as a minimization problem of under the constraint that is constant. When we use a kernel function to represent the inner product in the high-dimensional feature space, minimizing sometimes results in overfitting. In this letter, we replace the within-class covariance matrix by the regularized within-class covariance matrix , where is a nonnegative regularization parameter. Now the KFDA problem is formulated as
formula
4.2

4.2.  Multiple Kernel Learning Algorithm with Conditional Entropy Criterion.

In this section, we combine multiple kernel functions with a coefficient vector , and we optimize the coefficient and weight of the classification function by minimizing the conditional entropy.

The difficulty in choosing a suitable kernel function and kernel parameter for a given data set is a serious drawback of the kernel method. One of the approaches proposed to address this problem is multiple kernel learning (MKL), in which several kernels are adaptively combined for a given data set. For large classes of combinations of kernel functions that preserve symmetry and positive definiteness, the resulting function also becomes a new valid kernel function (Shawe-Taylor & Cristianini, 2004).

Consider a parameterized family of kernel functions
formula
4.3
where is a parameter that takes a value in a parameter space and characterizes the kernel function. For example, when we consider a family of gaussian kernels,
formula
4.4
is an accuracy parameter of the gaussian kernel, and . Then, noting that any convex combination of kernel functions becomes a kernel function again, we choose S kernel functions with fixed parameters from this family and define a new kernel function by a convex combination of them:
formula
4.5
The idea of our proposed MKL technique is to solve the following optimization problem:
formula
4.6
where we define a classification function depending on and . We note that formally, it is equivalent to equation 3.1with a regularization function
formula
for example. Since the direct simultaneous optimization of equation 4.6 with respect to both and is apparently difficult, we adopt an iterative optimization approach. We denote parameters after the tth iteration by and .
First, let us consider optimization of for fixed . We write the within- and between-class covariance matrices in the feature space as to show the dependency on explicitly, and we omit in equation 4.2 for simplicity of description. As denoted in section 2, KFDA minimizes the upper bound of the class-conditional entropy; thus, the relationship between the class-conditional entropy and the KFDA objective function is written as
formula
4.7
formula
4.8
where . Inequality 4.8 is an upper bound of the class-conditional entropy, and for fixed , the optimal upper-bounding solution is given by KFDA.
We next minimize the conditional entropy with respect to the kernel combination coefficient with fixed . The regularization term contains , and as a new optimization objective, we can put this entropy term and conditional entropy term together using a tuning parameter such as
formula
4.9
In this letter, we take a simple strategy to avoid adding a parameter . The minimization problem considered here is as follows:
formula
4.10
formula
4.11
As a result of this optimization step, we achieve a new kernel function, equation 4.5, with the updated coefficient . With this new kernel function, we can update the covariance matrices ; then we again minimize the updated objective function of KFDA with respect to .

These two steps are iterated until both and are converged or until satisfying some predetermined stopping criterion. We name this algorithm MCEM (multiple kernel learning algorithm based on conditional entropy minimization). It is summarized in Algorithm 2. An intuitive explanation of the algorithm is given in Figure 2.

Figure 2:

A conceptual diagram of the proposed multiple kernel learning algorithm. Dashed curves denote level curves of the conditional entropy. Solid curves denote level curves of the upper bound of the conditional entropy, which are equivalent to the objective functions of KFDA. The proposed algorithm iterates the upper bounding approximation and KFDA to minimize the conditional entropy with respect to with fixed , and minimizing the conditional entropy with respect to with fixed .

Figure 2:

A conceptual diagram of the proposed multiple kernel learning algorithm. Dashed curves denote level curves of the conditional entropy. Solid curves denote level curves of the upper bound of the conditional entropy, which are equivalent to the objective functions of KFDA. The proposed algorithm iterates the upper bounding approximation and KFDA to minimize the conditional entropy with respect to with fixed , and minimizing the conditional entropy with respect to with fixed .

The optimization method of with respect to is arbitrary. In this letter, we devised two methods: the first is a random search algorithm, and the second is based on the convex (quadratic) approximation. In the former random search algorithm, we generate P candidates of by a gaussian random number generator with mean vector . Then we calculate the conditional entropy with these candidates and adopt one that minimizes the conditional entropy. Although this algorithm is naıve, it works well and is applicable to arbitrary form of kernel combinations other than a convex combination. The latter algorithm is described in appendix D. Depending on the method used in optimization step, we call the random search version MCEM.R and the quadratic approximation version MCEM.Q.

Input: Training data and class label data . Kernel parameter for S element kernels , regularization parameter for KFDA.

Initialization: Initialize the combination coefficients of element kernels by random values such that and .

Repetition: Until convergence, from t=1:

α optimization step: Solve KFDA minimization problem for a fixed to get :
formula
β optimization step: Minimize the conditional entropy of the classification function for fixed to get :
formula

Output: Converged parameters and , used to construct the classification function as .

Algorithm 2: Multiple kernel learning algorithm based on conditional entropy minimization. The algorithm iteratively optimizes the classification function that defines one-dimensional classification axis.

4.2.1.  Related Works on Multiple Kernel Learning.

Several attempts have been made to learn kernel functions from the given data. The most popular approach in the context of MKL considers a finite set of predefined element kernels that are combined so that the margin-based objective function of SVM is optimized. Lanckriet, Cristianini et al. (2004) and Lanckriet, Deng et al. (2004) have proposed a framework to combine multiple kernel functions for support vector machines (SVMs). They have modified the classification function of SVM,
formula
4.12
with
formula
4.13
and maximized the margin of the SVM classifier with respect to and simultaneously by using semidefinite programming (SDP).

Recently, Do et al. (2009) proposed a novel MKL method considering the fact that the theoretical error bound of SVM depends on both the margin and the radius of the smallest sphere that contains all the training samples. They derived an iterative algorithm named R-MKL to optimize the margin and the radius with respect to the weight vector and the combination parameter .

In the next section, we compare the performance of our proposed MCEM algorithms to the representative MKL method using SDP (Lanckriet, Cristianini et al., 2004; Lanckriet, Deng et al., 2004) and R-MKL (Do et al., 2009).

4.3.  Experimental Study.

In the same manner as we did in the linear dimensionality reduction, we conduct experiments with one-nearest-neighbor (1-NN) classifiers. As a comparative study of kernel combination techniques, we also tackle the yeast protein function annotation task.

In Table 3, we show classification results by KFDA, KLFDA, MCEM.R, and MCEM.Q algorithms, where KLFDA is a kernelized version of LFDA (Sugiyama, 2007). Except for KFDA, the reduced dimensionalities are arbitrary. However, we fixed them to one for the sake of simplicity. We used gaussian kernels for all algorithms. For KFDA, we applied two methods of determining the kernel parameter. One is the so-called Jaakkola's heuristics, which uses the median of smallest Euclidean distance between the feature vectors in one class and the other class (Jaakkola, Diekhans, & Haussler, 1999) (KFDA(H) in Table 3). The other is cross-validation by first five realizations of each data set, in the same manner as in the linear case (KFDA(CV) in Table 3). The regularization parameter in equation 4.2 is fixed to for all experiments. In the proposed MCEM algorithms, we used 20 gaussian kernels with parameters
formula
To see the effect of kernel combination optimization, we also see the classification result by KFDA with unweighted combination of kernel functions (KFDA(UC) in Table 3).
Table 3:
Misclassification Rates (in Percentages) by KFDA, KLFDA, and MCEMs.
Data NameKFDA(CV)KFDA(H)KFDA(UC)KLFDA (CV)MCEM.RMCEM.QEuclidean
Banana 31.26(3.40) 15.00(0.98) 16.25(1.48) 36.79(4.44) 16.18(1.23) 17.78(2.18) 13.64(0.76) 
Breast-cancer  31.76(4.84) 31.8(4.91) 32.34(5.24) 35.89(5.01) 32.44(4.29) 28.13(4.93) 32.73(4.82) 
Diabetes 30.23(2.44) 29.44(2.21)  27.34(2.63) 36.22(2.75) 26.96(2.26) 26.18(2.46) 30.12(2.05) 
Flare-solar 35.48(2.09) 35.66(2.18) 36.24(2.48) 37.01(1.83) 36.19(1.96) 35.46(1.99) 36.47(1.88) 
German  28.91(2.88) 29.31(2.67)  26.06(2.83) 41.90(2.72)  26.10(2.48) 25.30(2.27) 29.46(2.47) 
Heart  21.12(3.72)  21.39(3.58)  20.71(5.23) 33.93(4.85)  19.86(4.73) 17.48(3.79) 23.16(3.74) 
Image 12.86(1.23) 11.8(1.4) 13.13(1.33) 28.46(1.84) 10.53(1.53) 18.77(1.44) 3.38(0.54) 
Ringnorm 2.06(0.45) 2.06(0.38)  2.93(1.49)  2.28(0.51)  2.91(1.46)  2.69(1.27) 35.03(1.36) 
Splice 18.14(0.76)  20.16(1.13) 17.76(1.69) 37.95(14.84) 18.10(2.25) 24.15(1.33) 28.77(1.52) 
Thyroid  5.45(2.27) 5.93(2.39) 7.97(3.78) 11.12(3.61) 7.16(2.90) 7.87(3.27) 4.36(2.210) 
Titanic 22.61(1.05) 22.37(1.06) 22.57(1.30) 22.44(1.03) 22.26(1.04)  22.46(1.08) 22.50(1.057) 
Twonorm 3.21(0.45) 3.21(0.45)  3.74(1.08) 44.57(5.36) 3.24(0.49) 3.24(0.49) 6.68(0.72) 
Waveform  11.67(0.74)  12.03(0.82) 10.94(1.15) 28.85(1.88) 10.98(1.01)  12.26(1.35) 15.83(0.65) 
Data NameKFDA(CV)KFDA(H)KFDA(UC)KLFDA (CV)MCEM.RMCEM.QEuclidean
Banana 31.26(3.40) 15.00(0.98) 16.25(1.48) 36.79(4.44) 16.18(1.23) 17.78(2.18) 13.64(0.76) 
Breast-cancer  31.76(4.84) 31.8(4.91) 32.34(5.24) 35.89(5.01) 32.44(4.29) 28.13(4.93) 32.73(4.82) 
Diabetes 30.23(2.44) 29.44(2.21)  27.34(2.63) 36.22(2.75) 26.96(2.26) 26.18(2.46) 30.12(2.05) 
Flare-solar 35.48(2.09) 35.66(2.18) 36.24(2.48) 37.01(1.83) 36.19(1.96) 35.46(1.99) 36.47(1.88) 
German  28.91(2.88) 29.31(2.67)  26.06(2.83) 41.90(2.72)  26.10(2.48) 25.30(2.27) 29.46(2.47) 
Heart  21.12(3.72)  21.39(3.58)  20.71(5.23) 33.93(4.85)  19.86(4.73) 17.48(3.79) 23.16(3.74) 
Image 12.86(1.23) 11.8(1.4) 13.13(1.33) 28.46(1.84) 10.53(1.53) 18.77(1.44) 3.38(0.54) 
Ringnorm 2.06(0.45) 2.06(0.38)  2.93(1.49)  2.28(0.51)  2.91(1.46)  2.69(1.27) 35.03(1.36) 
Splice 18.14(0.76)  20.16(1.13) 17.76(1.69) 37.95(14.84) 18.10(2.25) 24.15(1.33) 28.77(1.52) 
Thyroid  5.45(2.27) 5.93(2.39) 7.97(3.78) 11.12(3.61) 7.16(2.90) 7.87(3.27) 4.36(2.210) 
Titanic 22.61(1.05) 22.37(1.06) 22.57(1.30) 22.44(1.03) 22.26(1.04)  22.46(1.08) 22.50(1.057) 
Twonorm 3.21(0.45) 3.21(0.45)  3.74(1.08) 44.57(5.36) 3.24(0.49) 3.24(0.49) 6.68(0.72) 
Waveform  11.67(0.74)  12.03(0.82) 10.94(1.15) 28.85(1.88) 10.98(1.01)  12.26(1.35) 15.83(0.65) 

Notes: The best results and comparable ones based on the t-test with a significance level of 5% are shown in boldface type. Figures are marked by when they improved the classification results in Euclidean space based on the t-test with a significance level of 5%.

Table 3 shows that the nonlinear dimensionality-reduction techniques based on the kernel method outperform linear dimensionality-reduction techniques shown in Table 2 for many data sets. We note that KLFDA does not work well for these data sets. We conjecture that the reason the kernel methods do not work for some data sets is their ease of separation in the original Euclidean space. It is sometimes observed that nonlinearization by the kernel method degenerates the separability when the raw data are easily separated by linear methods. From Table 3, the kernel methods perform worse than 1-NN classifications in Euclidean space for Banana, Image, and Thyroid data. We can observe that in Euclidean space, the classification error of these three data sets is relatively small, and we think it is the reason that the kernel methods do not work well for them.

One of the favorable features of the proposed MCEM algorithms is that it does not need to determine kernel parameters by cross-validation like KFDA. We only need to prepare several kernel functions with different kernel parameters and relegate the optimization of the kernel parameter to the optimization of the combination parameter. We can see that without kernel parameter tuning, MCEM algorithms perform comparable to or better than KFDA.

As an experiment to compare with other MKL techniques, we apply the MCEM algorithms to the problem of yeast protein function annotation. We compare the proposed techniques against SVMs using single kernel and two MKL techniques in Lanckriet, Cristianini et al. (2004) and Do et al. (2009) using data available from the support Web site of Lanckriet, Deng et al. (2004). We use three attributions of yeast protein data represented by kernels: representing gene expression, protein domain content, and protein sequence similarity. We train 12 binary classifiers for each of 12 functional classes of yeast genes. We randomly sample from the data set to reduce its size to 500 genes and then perform three-fold cross-validation, repeating the entire procedure five times. Table 4 summarizes the mean area under the ROC curves (AUCs) over 15 trials for all techniques. We tested various values for the soft margin parameter for SDP and R-MKL by running full classification experiments and adopted the best values. We note that the proposed MCEM framework is intended to learn a kernel function that regulates the distribution of the given data in the feature space so that the data are compactly aggregated in each class. The MCEM framework is flexible enough to be used as a kernel learning preprocessing, and it can be combined with other classifiers besides KFDA. In this experiment, we also show results of the SVM classification with kernel matrices learned by MCEM algorithms. In this experiment, we simply set the regularization parameter for KFDA used in MCEM algorithms to , and soft margin parameter for SVM to one. From Table 4, we see that the proposed methods show comparable accuracy to other MKL methods, such as SDP and R-MKL.

Table 4:
Comparison of MKL Techniques on the Yeast Protein Function Annotation Task.
MCEM.R+MCEM.Q+
FunctionExpDomSeqSDPR-MKLMCEM.RMCEM.QSVMSVM
0.682 0.767 0.774 0.778 0.778 0.784 0.766 0.796 0.776 
0.708 0.676 0.689 0.737 0.725 0.736 0.748 0.713 0.728 
0.619 0.689 0.688 0.683 0.699 0.699 0.692 0.697 0.695 
0.706 0.733 0.758 0.786 0.776 0.769 0.771 0.769 0.770 
0.854 0.789 0.777 0.856 0.874 0.804 0.817 0.803 0.834 
0.590 0.655 0.688 0.692 0.680 0.690 0.682 0.692 0.688 
0.570 0.678 0.708 0.714 0.703 0.710 0.695 0.714 0.704 
0.612 0.635 0.669 0.711 0.716 0.700 0.746 0.684 0.726 
0.686 0.744 0.741 0.783 0.775 0.752 0.768 0.750 0.784 
10 0.622 0.658 0.701 0.698 0.660 0.705 0.673 0.703 0.674 
11 0.612 0.585 0.608 0.586 0.593 0.613 0.611 0.597 0.582 
12 0.657 0.911 0.883 0.875 0.895 0.885 0.832 0.886 0.848 
MCEM.R+MCEM.Q+
FunctionExpDomSeqSDPR-MKLMCEM.RMCEM.QSVMSVM
0.682 0.767 0.774 0.778 0.778 0.784 0.766 0.796 0.776 
0.708 0.676 0.689 0.737 0.725 0.736 0.748 0.713 0.728 
0.619 0.689 0.688 0.683 0.699 0.699 0.692 0.697 0.695 
0.706 0.733 0.758 0.786 0.776 0.769 0.771 0.769 0.770 
0.854 0.789 0.777 0.856 0.874 0.804 0.817 0.803 0.834 
0.590 0.655 0.688 0.692 0.680 0.690 0.682 0.692 0.688 
0.570 0.678 0.708 0.714 0.703 0.710 0.695 0.714 0.704 
0.612 0.635 0.669 0.711 0.716 0.700 0.746 0.684 0.726 
0.686 0.744 0.741 0.783 0.775 0.752 0.768 0.750 0.784 
10 0.622 0.658 0.701 0.698 0.660 0.705 0.673 0.703 0.674 
11 0.612 0.585 0.608 0.586 0.593 0.613 0.611 0.597 0.582 
12 0.657 0.911 0.883 0.875 0.895 0.885 0.832 0.886 0.848 

Notes: The table lists, for each functional class (row) and each classification technique (column), the mean AUC from five times three-fold cross-validation. The optimal mean AUC per data set is shown in boldface type. The first three columns correspond to SVMs with single kernels (gene expression, protein domain content, and sequence similarity, respectively). The SDP and R-MKL columns correspond to SVMs with the combined kernel optimized by methods in Lanckriet, Cristianini et al. (2004) and Do et al. (2009), respectively.

5.  Discussions on Information-Theoretic Dimensionality Reduction Methods

Since there are enormous numbers of studies on supervised dimensionality reduction or feature extraction, we devote this section to the literature survey. Supervised dimensionality-reduction techniques can be divided into two categories: one based on margin maximization (Weston et al., 2000; Tao, Chu, & Wang, 2008), and the other on covariance structures such as FDA and LCEM. In FDA, an equivalent gaussian distribution for each class is assumed. By considering entropy or mutual information, covariance-based approaches can be generalized. Since I(Z; Y)=H(Z)−H(Z|Y) holds, maximizing the mutual information of the transformed data z and the class label Y is equivalent to minimizing the class-conditional entropy H(Z|Y) with a regularization term, for example, , in our approach. These general dimensionality-reduction methods, referred to as information-theoretic dimensionality reduction, are based on the Shannon entropy. Basically, these methods need to estimate joint entropy or mutual information. Since entropy is calculated from the density function of the transformed data, there are various methods depending on density estimation. Methods of estimating density functions can be divided into parametric and nonparametric approaches. In the parametric approach, a gaussian mixture model (GMM) is often adopted to approximate the distribution of Z=ATX. For example, by gradient ascent, Leiva-Murillo and Artes-Rodriguez (2004) maximized the mutual information I(Z; Y) calculated by means of density estimation by a GMM. Kaski and Peltonen (2003), Sajama and Orlitsky (2005), and Goldberger, Peltonen, and Kaski (2007) also used a GMM to estimate the conditional probability p(ATx|y). Then, using Bayes's theorem, they estimated p(y|ATx) and maximized a conditional likelihood,
formula
5.1
by gradient ascent. In the nonparametric approach, no assumption is made on distributions, and in general, only a small number of tuning parameters such as kernel bandwidth are predefined. Recently He, Hu, and Yuan (2009) have proposed a supervised dimensionality-reduction method by means of entropy maximization:
formula
5.2
This entropy maximization criterion is similar to our framework proposed in section 3 because our quasi-orthogonalization corresponds to a constant constraint on under the assumption that all data are whitened. He et al. (2009) also gave a theoretical validation for their proposal based on the relationship between the class-conditional entropy and the objective function of FDA. However, they made a strong assumption for the data distribution on the projected space to reduce the constrained entropy maximization problem to a generalized eigenvalue problem. Furthermore, their approach makes use of the Renyi quadratic entropy (Renyi, 1960), which is defined as
formula
5.3
instead of the Shannon entropy. The use of the Renyi entropy for unsupervised learning such as ICA is proposed in Fisher and Principe (1997), and there is a lot of related work using the Renyi entropy to avoid the difficulty in estimating the Shannon entropy (Principe & Dongxin, 1999; Torkkola & Campbell, 2000; Torkkola, 2003; Hild, Erdogmus, Torkkola, & Principe, 2006, for instance). The Renyi quadratic entropy gives a lower bound of the Shannon entropy as
formula
5.4
and most existing work makes use of this property to validate maximizing the Renyi quadratic entropy instead of the Shannon entropy. However, the relationship between the class-conditional entropy and the objective function of FDA is described in terms of Shannon's original definition of entropy, and other theoretical validations of the proposed framework depicted in appendix B also rely on the definition of the Shannon entropy.

As noted, a lot of studies on information-theoretic dimensionality reduction exist. However, most of existing parametric approaches tackled the Renyi entropy manipulation problem instead of the original Shannon entropy manipulation. By introducing a simple entropy estimator (Faivishevsky & Goldberger, 2009) and upper bounding entropy by the sum of marginal entropies, we can estimate and optimize the Shannon entropy efficiently.

6.  Conclusion

In this letter, we treated the dimensionality-reduction technique as an information-theoretic optimization problem and proposed a general framework of supervised dimensionality reduction based on conditional entropy minimization. By simple experiments, we show that the proposed framework can find the optimal classification surface even when the conventional Fisher discriminant analysis fails to do so. We also clarified the mechanism responsible for the discriminative dimensionality-reduction effect obtained by the proposed criterion. We implemented a linear dimensionality-reduction technique based on the proposed framework and applied it to large-scale benchmark data sets. We demonstrated that the classification accuracy after reducing dimensionalities by LCEM is better than conventional dimensionality-reduction techniques such as PCA and FDA and comparable to the state-of-the-art methods such as LFDA.

There has been an increase in research on dimensionality reduction or manifold learning techniques that take account of the local metric structure of data distribution. Besides LFDA considered in this letter, the locality preserving projection (LPP; He & Niyogi, 2003) and the Laplacian eigenmap (LE; Belkin & Niyogi, 2003) are well known as examples for techniques that use the affinity between data points explicitly. In LFDA, the FDA criterion was generalized to reflect the affinity of data points. In LPP and LE, data points are projected onto a low-dimensional space so that the points close to each other in the original space are kept close in the projected space. The optimization problem of LPP and LE is formalized with the Laplacian matrix defined by the affinity matrix of data points and reduced to the generalized eigenvalue problems. On the other hand, the objective function of the proposed framework is the class-conditional entropy of the transformed data and does not explicitly consider the locality of the data. However, in estimating entropy (Faivishevsky & Goldberger, 2009), locality of the data distribution is naturally reflected, and thus comparable performance to locality-conscious methods, such as LFDA, is obtained. It will be important future work to investigate the probability model underlying other dimensionality-reduction methods and the relationship to LCEM from the viewpoint of information theory.

We also considered multiple kernel learning with the conditional entropy minimization framework. To the best of our knowledge, there is no MKL technique based on the conditional entropy criterion. The proposed algorithm is not only novel; it also worked well for real-world data. As shown in Tables 3 and 4, it can acquire comparable or superior accuracy to KFDA without kernel parameter tuning, and it is also comparable to other MKL methods. Furthermore, it is shown that the proposed MCEM framework can be used as a kernel optimization process for other classifiers such as SVM. To keep the optimization step simple, we omitted the entropy regularization term and optimized a conditional entropy term only in equation 4.10. At the cost of optimization with respect to in equation 4.9, we may obtain improved classification results.

We considered only a linear combination of kernels in this letter; however, there are other kernel combinations. Lewis et al. (2006a) generalized a way of kernel combination that allows the coefficients of kernels to depend on data points. Our proposed framework is also applicable to such a combination to improve classification accuracy.

In future work, we would like to address the relationship between the proposed framework and sufficient dimensionality-reduction (SDR) research. The problem of SDR is finding a subspace such that the projection of the data vector X onto the subspace captures the statistical dependency of the class y (response, in the literature of regression) on X as much as possible. It is of great interest to develop procedures for estimating this subspace, and it has been studied (Li, 1991; Cook & Yin, 2001; Fukumizu, Bach, & Jordan, 2009). The subspace obtained by the proposed framework is, by definition, the one in which projected data distribution has low conditional entropy. As stated in section 3 and illustrated in Figure 1b, the data must be locally distributed when conditioned by the class label in the projected subspace. Another way of characterizing of the subspace obtained by conditional entropy minimization in the context of SDR is important future work for us.

The convergence properties of the proposed MCEM algorithms have not been investigated yet. The study of the property and condition of convergence remains as interesting future work. We also would like to examine techniques for simultaneously optimizing the weight parameter and coefficients in problem 4.6 as Lanckriet, Cristianini et al. (2004) did.

Appendix A:  On Entropy Estimation and Approximation Methods

In this appendix, we support our selection of the entropy estimator and our approximation approach for the joint entropy with the sum of marginal entropies.

We first compare two nonparametric entropy estimation methods. First is a traditional leave-one-out (LOO) method based on kernel density estimation (Beirlant et al., 1997). Given a data set , we first estimate the probability density function p(x) by
formula
A.1
where h is a kernel bandwidth parameter. We determine the parameter by a simple heuristics, Silverman's rule of thumb (Wand, & Jones, 1994). The estimated probability density function can be used to approximate the entropy of X as
formula
A.2
Then we replace by and approximate the expectation operation by LOO method as
formula
A.3

We compare this LOO and the MNN entropy estimators for exponentially distributed data. The density function of the exponential distribution is , and in this case, the entropy can be analytically calculated as . We generate N = 500 samples of random variables from exponential distributions with 10 various values of the parameter . Table 5 shows mean squared errors and standard deviations of entropy estimations. From this table, the MNN estimator is more accurate than the LOO estimator in terms of mean squared errors. It is notable that the standard deviation of the MNN estimator is far smaller than that of LOO. This property is favorable when we evaluate the gradient of entropy.

Table 5:
Performance of the MNN Entropy Estimator in Comparison with an LOO Entropy Estimator.
LOOMNN
Mean square error 0.08154036 0.0155177328 
Standard deviation 0.02514127 0.0001276099 
LOOMNN
Mean square error 0.08154036 0.0155177328 
Standard deviation 0.02514127 0.0001276099 
We next show a simple experimental result to support our approach of entropy estimation. Entropy estimation of high-dimensional random variables is prone to giving a poor result because of the curse of dimensionality. To avoid this problem, the joint entropy is bounded from above by the sum of marginal entropies as
formula
A.4
In general, even after decorrelating each dimension of the transformed vector z by whitening quasi-orthogonalization, there exists a gap between H(Z) and that stems from higher-order moments. Because our objective of entropy estimation is to find the transformation matrix A that minimizes the joint entropy H(Z)=H(ATX), it is important that the minimizer A for H(Z) and is close enough. It is difficult to show a general result, but we show the following simple experimental result. We transform a nongaussian three-dimensional variate X to a two-dimensional subspace by a family of matrices with a parameter . Let the minimizer of the estimated Shannon joint entropy, the sum of the Shannon marginal entropies, and the sum of the Renyi marginal entropies be and , respectively. Then we experimentally show that holds. For a nongaussian multivariate distribution, we adopt a three-dimensional gaussian mixture distribution with two components,
formula
A.5
formula
A.6
formula
A.7
where r=0.3. We define the column-orthonormal transformation matrix as , where
formula
A.8
formula
A.9
We vary the parameter from to and transform the original data by . We generate N=5000 data from the gaussian mixture, equation A.5, and estimate with 5000 data and estimate and with 500 data. We note that it is difficult to calculate the joint entropy analytically, so we used the MNN estimator with a lot of data (N=5000) for joint entropy estimation. We repeat this procedure 100 times and show the mean squared differences of estimated and in Table 6. From this table, we can see that the minimizer is closer to compared to , and the estimates are more stable. We also plot one of the results of the above experiment in Figure 3. The minimum value of each estimate is acquired at , respectively. The parameter values that minimize the estimated joint and the sum of marginal Shannon entropies are close, so this simple experimental result supports our upper bounding approach.
Figure 3:

Estimated Shannon joint entropy (solid line), the sum of marginal Shannon entropies (dashed line), and the sum of marginal Renyi entropies (dotted line). The minimum point is indicated by a circle on each curve.

Figure 3:

Estimated Shannon joint entropy (solid line), the sum of marginal Shannon entropies (dashed line), and the sum of marginal Renyi entropies (dotted line). The minimum point is indicated by a circle on each curve.

Table 6:
Mean Square Error and Standard Deviation of Difference and .
Mean square error 0.09047787 0.1146681 
Standard deviation 0.05246031 0.09060831 
Mean square error 0.09047787 0.1146681 
Standard deviation 0.05246031 0.09060831 

Appendix B:  Validations of the Proposed Framework

In this appendix, we validate the proposed framework of supervised dimensionality reduction from a different point of view from section 2.

In dimensionality-reduction problems, it is desirable that the transformed data compactly aggregated in each class. Putting the notion of compact representation in the perspective of information theory, a good transformation for dimensionality reduction is the one with small mutual information,
formula
B.1
because small I(X; Z) indicates a high compression rate when we regard the transformation as a data compression process (Cover & Thomas, 1991). Since the mutual information, equation B.1, is determined only by the distribution of the original data X and transformed data Z, this equation can be regarded as a criterion for unsupervised dimensionality reduction. In supervised dimensionality reduction, the data are required to be compactly distributed in each class. In this case, it is natural to measure the goodness of the transformation by the class-conditional mutual information,
formula
B.2
It is also natural to suppose that the transformation is deterministic, and in such cases, H(Z|X, Y) is equal to 0, and the goodness of the transformation can be essentially measured by the class-conditional entropy H(Z|Y). From the above discussion, we claim that the proposed framework is reasonable in the context of data compression theory.
We next consider the negentropy and the class-conditional negentropy of a random variable X (Hyvärinen et al., 2001), defined as
formula
B.3
formula
B.4
respectively. With this negentropy expression, we get
formula
where Σ and Σy are covariance matrices of D and Dy, respectively, and p(y) is the class prior distribution. Then the conditional entropy H(Z|Y) can be divided into three terms as
formula
B.5
In the following, we consider the meanings of these three terms.

B.1.  Joint Entropy Under Gaussian Assumption.

The first term HG(ATX) of equation B.5 is the entropy when the distribution of all the data is gaussian. The value of this term is completely determined by the determinant of the covariance matrix Σ of all the data, and minimizing this term is equivalent to expressing all the data compactly. However, this term becomes arbitrarily small by scalar multiplication of the data, so it does not influence the classification ability. As a consequence, under an assumption that the transformation for dimensionality reduction is regularized in some manner, minimizing this first term HG(ATX) does not make a significant contribution to classification accuracy.

B.2.  Conditional Negentropy.

The negentropy of a distribution is a normalized version of entropy defined by equation B.3, and mainly used in the research of independent component analysis (ICA; Comon, 1994; Hyvärinen, 1999; Hyvärinen et al., 2001) as a measure of nongaussianity. In this sense, negentropy is also used as a measure of how interesting the data distributions.

From equation B.5, we can see that minimizing the conditional entropy H(ATX|Y) contributes to maximizing the conditional negentropy J(ATX|Y). As a result, we can expect the obtained transformation to increase nongaussianity in each class of the transformed data.

B.3.  Heterogeneous Discriminant Analysis Criteria.

The last term of the equation B.5 is exactly the same as the objective function of heteroskedastic discriminant analysis (HDA) defined by Kumar and Andreou (1998), and optimizing this term leads us to good class discrimination. In HDA, the covariance structure can differ for each class and has been investigated by many researchers (Kumar & Andreou, 1998; Hastie & Tibshirani, 1996; Loog & Duin, 2004; Zhang & Yeung, 2009) to overcome the strict assumption of FDA.

Appendix C:  Further Experimental Result on Linear Dimensionality Reduction

We show further experimental results on linear dimensionality-reduction techniques. We first compare classification accuracy when dimensionality of data are reduced to one in Table 7. From this table, for most of data, we can conclude that LCEM performs comparable to or slightly better than other supervised dimensionality-reduction methods when dimensionality is reduced to one.

Table 7:
Average Misclassification Rate (in Percentage) of Linear Techniques When Dimensionality Is Reduced to One.
Data Name PCA FDA MCML LFDA LCEM 
Banana 36.5(0.6) 38.3(4.0) 39.4(1.3) 36.2(1.2) 34.4(1.6) 
Breast-Cancer 38.9(5.5) 34.9(5.1) 33.8(5.4) 33.9(4.7) 34.5(4.8) 
Diabetes 40.0(4.2) 31.3(2.8) 40.6(2.2) 34.1(2.4) 30.7(2.5) 
Flare-Solar 43.8(5.7) 36.4(1.9) 36.2(2.6) 36.8(1.9) 36.6(2.0) 
German 42.0(2.3) 32.0(2.6) 39.9(3.3) 38.4(3.3) 31.8(2.8) 
Heart 44.0(25.9) 22.9(4.1) 41.8(5.6) 22.5(3.2) 22.9(3.2) 
Image 44.0(9.0) 22.1(0.9) 29.3(1.5) 31.2(1.6) 22.6(1.4) 
Ringnorm 36.1(8.4) 31.7(1.0) 41.9(0.9) 31.6(1.6) 31.9(1.1) 
Splice 42.7(4.3) 20.4(0.8) 45.4(2.0) 20.9(0.9) 20.6(0.6) 
Thyroid 9.3(3.8) 17.9(4.9) 19.6(3.3) 7.4(3.4) 17.2(4.2) 
Titanic 23.0(1.6) 22.5(1.1) 22.2(1.0) 22.6(1.5) 22.5(1.0) 
Twonorm 3.6(0.3) 3.5(0.5) 40.9(1.2) 3.4(0.4) 3.5(0.5) 
Waveform 36.8(19.0) 18.6(1.2) 40.4(1.2) 18.6(1.1) 18.7(1.1) 
Data Name PCA FDA MCML LFDA LCEM 
Banana 36.5(0.6) 38.3(4.0) 39.4(1.3) 36.2(1.2) 34.4(1.6) 
Breast-Cancer 38.9(5.5) 34.9(5.1) 33.8(5.4) 33.9(4.7) 34.5(4.8) 
Diabetes 40.0(4.2) 31.3(2.8) 40.6(2.2) 34.1(2.4) 30.7(2.5) 
Flare-Solar 43.8(5.7) 36.4(1.9) 36.2(2.6) 36.8(1.9) 36.6(2.0) 
German 42.0(2.3) 32.0(2.6) 39.9(3.3) 38.4(3.3) 31.8(2.8) 
Heart 44.0(25.9) 22.9(4.1) 41.8(5.6) 22.5(3.2) 22.9(3.2) 
Image 44.0(9.0) 22.1(0.9) 29.3(1.5) 31.2(1.6) 22.6(1.4) 
Ringnorm 36.1(8.4) 31.7(1.0) 41.9(0.9) 31.6(1.6) 31.9(1.1) 
Splice 42.7(4.3) 20.4(0.8) 45.4(2.0) 20.9(0.9) 20.6(0.6) 
Thyroid 9.3(3.8) 17.9(4.9) 19.6(3.3) 7.4(3.4) 17.2(4.2) 
Titanic 23.0(1.6) 22.5(1.1) 22.2(1.0) 22.6(1.5) 22.5(1.0) 
Twonorm 3.6(0.3) 3.5(0.5) 40.9(1.2) 3.4(0.4) 3.5(0.5) 
Waveform 36.8(19.0) 18.6(1.2) 40.4(1.2) 18.6(1.1) 18.7(1.1) 

Note: The best results and comparable ones based on the t-test with a significance level of 5% are shown in boldface type.

We next show the misclassification rates as functions of reduced dimensionality. The result shows that LCEM works well, but overall there is no single best method that consistently outperforms the others. As seen from Figure 4, basically the misclassification rate gets smaller as dimensionality increases. However, Ringnorm data seem to have a minimum misclassification dimension near d=7 and suggest the need for some sort of model selection procedure to find the best dimensionality.

Figure 4:

Mean misclassification rates as functions of reduced dimensionalities. Four linear dimensionality-reduction methods are used to map data into spaces lower than or equal to the original dimensionality and classified by one-nearest-neighbor classifiers.

Figure 4:

Mean misclassification rates as functions of reduced dimensionalities. Four linear dimensionality-reduction methods are used to map data into spaces lower than or equal to the original dimensionality and classified by one-nearest-neighbor classifiers.

Appendix D:  Quadratic Optimization for MCEM.Q Algorithm

We give the detailed derivation of the approximate optimization step in the MCEM.Q algorithm.

We again consider the relationship between conditional entropy and its upper bound optimized in the KFDA represented by equations 4.7 and 4.8. The final right-hand side of equation 4.8 is equivalent to the objective function of the KFDA. In KFDA, this upper bound of conditional entropy is minimized with respect to . For kernel combination coefficient , we can also derive the same kind of upper bound of conditional entropy.

We first rewrite explicitly using the element kernels. Let K(s) be the Gramian matrix of the sth element kernel function and ki(s) be the ith column vector of K(s). Then can be written as
formula
D.1
formula
D.2
Now we ignore a constant term and multiplicative factors and obtain an upper bound of the conditional entropy as
formula
where we bundle the ith column vectors of S Gramian matrices in column as to obtain the equality , and bundle the average vector of columns of S Gramian matrices in class y in column as to obtain the equality . We also defined , , and . We used Jensen's inequality to derive the last inequality.
As a result, the minimization problem of the upper bound of the conditional entropy with respect to is formulated as the following optimization problem:
formula
D.3
formula
D.4
This problem is quadratic optimization problem, and a unique solution is obtained efficiently by, for example, an interior point method.

Acknowledgments

We are grateful to Nima Reyhani for helpful suggestions. We also express special thanks to the anonymous reviewers whose comments led to valuable improvements of this letter.

References

Beirlant
,
J.
,
Dudewicz
,
E. J.
,
Györfi
,
L.
, &
Meulen
,
E. C.
(
1997
).
Nonparametric entropy estimation: An overview
.
International Journal of the Mathematical Statistics Sciences
,
6
,
17
39
.
Belkin
,
M.
, &
Niyogi
,
P.
(
2003
).
Laplacian eigenmaps for dimensionality reduction and data representation
.
Neural Computation
,
15
(
6
),
1373
1396
.
Comon
,
P.
(
1994
).
Independent component analysis, a new concept?
Signal Processing
,
36
(
3
),
287
314
.
Cook
,
D. R.
, &
Yin
,
X.
(
2001
).
Dimension reduction and visualization in discriminant analysis (with discussion)
.
Australian and New Zealand Journal of Statistics
,
43
(
2
),
147
199
.
Cover
,
T. M.
, &
Thomas
,
J. A.
(
1991
).
Elements of information theory
.
Hoboken, NJ
:
Wiley
.
Cristianini
,
N.
, &
Shawe-Taylor
,
J.
(
2000
).
An introduction to support vector machines and other kernel-based learning methods
.
Cambridge
:
Cambridge University Press
.
Do
,
H.
,
Kalousis
,
A.
,
Woznica
,
A.
, &
Hilario
,
M.
(
2009
).
Margin and radius based multiple kernel learning
. In
Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases
(Vol.
1
, pp.
330
343
).
Los Alamitos, CA
:
IEEE Computer Society Press
.
Duda
,
R. O.
,
Hart
,
P. E.
, &
Stork
,
D. G.
(
2000
).
Pattern classification
.
Hoboken, NJ
:
Wiley-Interscience
.
Faivishevsky
,
L.
, &
Goldberger
,
J.
(
2009
).
ICA based on a smooth estimation of the differential entropy
. In
D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou
(Eds.),
Advances in neural information processing systems
,
21
(pp.
433
440
).
Cambridge, MA
:
MIT Press
.
Fisher
,
J. W.
, &
Principe
,
J. C.
(
1997
).
Entropy manipulation of arbitrary nonlinear mappings
. In
Proceedings of IEEE Workshop Neural Networks for Signal Processing
(pp.
14
23
).
Piscataway, NJ
:
IEEE Press
.
Fisher
,
R. A.
(
1936
).
The use of multiple measurements in taxonomic problems
.
Annals of Eugenics
,
7
,
179
188
.
Fukumizu
,
K.
,
Bach
,
F. R.
, &
Jordan
,
M. I.
(
2009
).
Kernel dimension reduction in regression
.
Annals of Statistics
,
37
,
1871
1905
.
Globerson
,
A.
, &
Roweis
,
S.
(
2006
).
Metric learning by collapsing classes
. In
Y. Weiss, B. Schölkopf, & J. Platt
(Eds.),
Advances in neural information processing systems
,
18
(pp.
451
458
).
Cambridge, MA
:
MIT Press
.
Goldberger
,
J.
,
Peltonen
,
J.
, &
Kaski
,
S.
(
2007
).
Fast semi-supervised discriminative component analysis
. In
Proceedings of Machine Learning for Signal Processing
(pp.
312
317
).
CSREA Press
.
Goldberger
,
J.
,
Roweis
,
S.
,
Hinton
,
G.
, &
Salakhutdinov
,
R.
(
2005
).
Neighborhood component analysis
. In
K. L. Saul, Y. Weiss, & L. Bottou
(Eds.),
Advances in neural information processing systems
,
17
(pp.
513
520
).
Cambridge, MA
:
MIT Press
.
Hastie
,
T.
, &
Tibshirani
,
R.
(
1996
).
Discriminant analysis by gaussian mixtures
.
Journal of the Royal Statistical Society, Series B
,
58
,
155
176
.
He
,
R.
,
Hu
,
B. G.
, &
Yuan
,
Z.
(
2009
).
Robust discriminant analysis based on nonparametric maximum entropy
. In
Proceedings of the First Asian Conference on Machine Learning
(pp.
120
134
).
Berlin
:
Springer
.
He
,
X.
, &
Niyogi
,
P.
(
2003
).
Locality preserving projections
. In
S. Thrün, L. Saul, & B. Schölkopf
(Eds.),
Advances in neural information processing systems
,
16
(pp.
153
160
).
Cambridge, MA
:
MIT Press
.
Hild
,
K. E.
,
Erdogmus
,
D.
,
Torkkola
,
K.
, &
Principe
,
J. C.
(
2006
).
Feature extraction using information-theoretic learning
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
28
(
9
),
1385
1392
.
Hyvärinen
,
A.
(
1999
).
Survey on independent component analysis
.
Neural Computing Surveys
,
2
,
94
128
.
Hyvärinen
,
A.
,
Karhunen
,
J.
, &
Oja
,
E.
(
2001
).
Independent component analysis
.
Hoboken, NJ
:
Wiley
.
Jaakkola
,
T.
,
Diekhans
,
M.
, &
Haussler
,
D.
(
1999
).
Using the Fisher kernel method to detect remote protein homologies
. In
Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
(pp.
149
158
).
Menlo Park, CA
:
AAAI Press
.
Kaski
,
S.
, &
Peltonen
,
J.
(
2003
).
Informative discriminant analysis
. In
Proceedings of the 20th International Conference on Machine Learning
(pp.
329
336
).
Menlo Park, CA
:
AAAI Press
.
Kozachenko
,
L. F.
, &
Leonenko
,
N. N.
(
1987
).
Sample estimate of entropy of a random vector
.
Problems of Information Transmission
,
23
,
95
101
.
Kumar
,
N.
, &
Andreou
,
A. G.
(
1998
).
Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition
.
Speech Communication
,
26
(
4
),
283
297
.
Lanckriet
,
G. R. G.
,
Cristianini
,
N.
,
Bartlett
,
P.
,
Ghaoui
,
L. E.
, &
Jordan
,
M. I.
(
2004
).
Learning the kernel matrix with semidefinite programming
.
Journal of Machine Learning Research
,
5
,
27
72
.
Lanckriet
,
G. R. G.
,
Deng
,
M.
,
Cristianini
,
N.
,
Jordan
,
M. I.
, &
Noble
,
W. S.
(
2004
).
Kernel-based data fusion and its application to protein function prediction in yeast
. In
Proceedings of the Pacific Symposium on Biocomputing
(pp.
300
311
).
Singapore
:
World Scientific
.
Leiva-Murillo
,
J. M.
, &
Artes-Rodriguez
,
A.
(
2004
).
A gaussian mixture based maximization of mutual information for supervised feature extraction
. In
Proceedings of the Fifth International Conference on Independent Component Analysis and Blind Signal Separation
(pp.
271
278
).
Berlin
:
Springer
.
Lewis
,
D. P.
,
Jebara
,
T.
, &
Noble
,
W. S.
(
2006a
).
Nonstationary kernel combination
. In
Proceedings of the 23rd International Conference on Machine Learning
(pp.
553
560
).
San Francisco
:
Morgan Kaufmann
.
Lewis
,
D. P.
,
Jebara
,
T.
, &
Noble
,
W. S.
(
2006b
).
Support vector machine learning from heterogeneous data: An empirical analysis using protein sequence and structure
.
Bioinformatics
,
22
,
2753
2760
.
Li
,
K-C.
(
1991
).
Sliced inverse regression for dimension reduction
.
Journal of the American Statistical Association
,
86
(
414
),
316
327
.
Loog
,
M.
, &
Duin
,
R. P. W.
(
2004
).
Linear dimensionality reduction via a heteroscedastic extension of LDA: The Chernoff criterion
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
26
(
6
),
732
739
.
Mika
,
S.
,
Rätsch
,
G.
,
Weston
,
J.
,
Schölkopf
,
B.
, &
Müller
,
K. R.
(
1999
).
Fisher discriminant analysis with kernels
. In
Proceedings of the 1999 IEEE Signal Processing Society Workshop
(pp.
41
48
).
Piscataway, NJ
:
IEEE Press
.
Principe
,
J. C.
, &
Dongxin
,
X.
(
1999
).
An introduction to information theoretic learning
. In
Proceedings of International Joint Conference on Neural Networks
(pp.
1783
1787
).
Cambridge, MA
:
MIT Press
.
Rätsch
,
G.
,
Onoda
,
T.
, &
Müller
,
K. R.
(
2001
).
Soft margins for Adaboost
.
Machine Learning
,
42
(
3
),
287
320
.
Renyi
,
A.
(
1960
).
On measures of information and entropy
. In
Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability
(pp.
547
561
).
Berkeley
:
University of California Press
.
Sajama
, &
Orlitsky
,
A.
(
2005
).
Supervised dimensionality reduction using mixture models
. In
Proceedings of the 22nd International Conference on Machine Learning
(pp.
768
775
).
Menlo Park, CA
:
AAAI Press
.
Shawe-Taylor
,
J.
, &
Cristianini
,
N.
(
2004
).
Kernel methods for pattern analysis
.
Cambridge
:
Cambridge University Press
.
Sugiyama
,
M.
(
2007
).
Dimensionality reduction of multimodal labeled data by local Fisher discriminant analysis
.
Journal of Machine Learning Research
,
8
,
1027
1061
.
Tao
,
Q.
,
Chu
,
D.
, &
Wang
,
J.
(
2008
).
Recursive support vector machines for dimensionality reduction
.
IEEE Transactions on Neural Networks
,
19
(
1
),
189
193
.
Torkkola
,
K.
(
2003
).
Feature extraction by non parametric mutual information maximization
.
Journal of Machine Learning Research
,
3
,
1415
1438
.
Torkkola
,
K.
, &
Campbell
,
W. M.
(
2000
).
Mutual information in learning feature transformations
. In
Proceedings of the 17th International Conference on Machine Learning
(pp.
1015
1022
).
San Francisco
:
Morgan Kaufmann
.
Wand
,
M. P.
, &
Jones
,
M. C.
(
1994
).
Kernel smoothing
.
London
:
Chapman & Hall/CRC
.
Weston
,
J.
,
Mukherjee
,
S.
,
Chapelle
,
O.
,
Pontil
,
M.
,
Poggio
,
T.
, &
Vapnik
,
V.
(
2000
).
Feature selection for SVMs
. In
T. K. Leen, T. G. Dietterich, & V. Tresp
(Eds.),
Advances in neural information processing systems
,
13
(pp.
668
674
).
Cambridge, MA
:
MIT Press
.
Zhang
,
Y.
, &
Yeung
,
D-Y.
(
2009
).
Heteroscedastic probabilistic linear discriminant analysis with semi-supervised extension
. In
Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases
(Vol.
2
, pp.
602
616
).
San Francisco
:
Morgan Kaufmann
.