## Abstract

Kernel methods are known to be effective for nonlinear multivariate analysis. One of the main issues in the practical use of kernel methods is the selection of kernel. There have been a lot of studies on kernel selection and kernel learning. Multiple kernel learning (MKL) is one of the promising kernel optimization approaches. Kernel methods are applied to various classifiers including Fisher discriminant analysis (FDA). FDA gives the Bayes optimal classification axis if the data distribution of each class in the feature space is a gaussian with a shared covariance structure. Based on this fact, an MKL framework based on the notion of gaussianity is proposed. As a concrete implementation, an empirical characteristic function is adopted to measure gaussianity in the feature space associated with a convex combination of kernel functions, and two MKL algorithms are derived. From experimental results on some data sets, we show that the proposed kernel learning followed by FDA offers strong classification power.

## 1. Introduction

Kernel methods such as support vector machines (SVMs) have been shown to be successful for a wide range of data analysis problems (Cristianini & Shawe-Taylor, 2000). However, the difficulty in choosing a suitable kernel function and its parameters for a given data set is a serious drawback of these methods. A number of efforts have been made to solve kernel selection problems. For example, Cristianini, Shawe-Taylor, and Kandola (2001) proposed using the ideal kernel made from the class labels of the training data and aligning the base kernel matrix to the ideal kernel matrix. Amari and Wu (1999) proposed magnifying the feature space around the classification surface of the SVM using conformal transformation. In various kernel optimization methods, one of the notable approaches is multiple kernel learning (MKL), in which the optimal convex combination of a set of given kernels is considered. In that framework, kernel combination coefficients are learned so that the separability of data in different classes is maximized in the feature space associated with combined kernels. Lanckriet, Cristianini, Bartlett, Ghaoui, and Jordan (2004) proposed a framework to combine multiple kernel functions with several different loss functions and to optimize weights for data and coefficients for kernels by using semidefinite programming (SDP). Starting from this pioneering work, studies have been devoted to improving classification accuracy, learning efficiency, and feature interpretability (Sonnenburg, Rätsch, Schäfer, & Schölkopf, 2006; Rakotomamonjy, Bach, Canu, & Grandvalet, 2008; Do, Kalousis, Woznica, & Hilario, 2009; Kim, Magnani, & Boyd, 2006; Yan, Kittler, Mikolajczyk, & Tahir, 2009; Suzuki & Tomioka, 2011).

In statistics and multivariate analysis literature, Fisher discriminant analysis (FDA; Fisher, 1936) is one of the most popular linear classification methods. It is also regarded as a supervised dimensionality-reduction method and is capable of a wide range of applications, such as preprocessing for other data analyses or visualizations. In FDA, if the data distribution of each class is a gaussian with the same covariance structure in the feature space, we can obtain a Bayes optimal classifier (Duda, Hart, & Stork, 2000). (See appendix A for a brief proof of this fact.)

In this study, we propose optimizing the kernel coefficients so that all the data distributions of individual classes in the feature space are as close to gaussian distributions sharing the same covariance structure as possible. In the feature space associated with the optimally combined kernel functions, we expect to obtain a Bayes optimal classifier by applying FDA. Figure 1 shows a conceptual diagram of desirable and undesirable data distributions in different feature spaces. Intuitively, the data in the original space are mapped to feature spaces by a family of maps , which is defined by a given finite set of kernel functions and parameterized by the combination parameter in the *S*−1 simplex . The mathematical symbols in the figure will be defined in the next section. The kernel combination coefficients are optimized so that the distribution of the mapped data in each class is as close as possible to a gaussian. We first propose a general framework of MKL with a gaussianity measure. Algorithmically, MKL is a problem of finding the best kernel combination coefficients in a certain sense. We show a simple formulation of the framework using the empirical characteristic function for measuring gaussianity, and develop two algorithms.

Most of existing MKL methods are oriented to sparse representation or interpretable feature selection under some sparseness constraints for the kernel combination coefficients. Since a kernel function determines a feature space, MKL can be regarded as a method to tailor the distribution of data by modifying the feature space. Though feature selection has always been an important issue in statistical data analysis, our focus will not be on selecting features but on combining features to achieve better discriminative power.

The rest of the letter is organized as follows. Section 2 briefly reviews the problem of MKL. Then a framework of MKL based on the gaussianity measure is proposed. In section 3, technical preliminaries to realize MKL methods based on the gaussianity measure are explained. The empirical characteristic function and its useful properties are utilized to define a gaussianity measure of the data distribution in the feature space. In section 4, two MKL algorithms with concrete algorithm descriptions are derived from the proposed MKL framework. Experimental results with artificial data and various kinds of benchmark data are given in section 5. The last section offers concluding remarks and notes future directions for research.

## 2. Gaussian Multiple Kernel Learning

In this section, multiple kernel learning for the binary classification problem is briefly reviewed. Then, the concept of multiple kernel learning with a gaussianity measure (Gaussian MKL; GMKL) is explained.

### 2.1. Multiple Kernel Learning Overview.

*x*_{i}belongs to some input space and

*y*belongs to a class label set . By and , we denote subsets of that belong to class +1 and class −1, respectively. When learning with multiple kernels, we are given

_{i}*S*different feature mappings from the input space to corresponding feature spaces ; that is, . The dimensionality of these feature spaces is arbitrary, and they can be function spaces in general. We consider the case that each mapping gives a reproducing kernel

*k*such that . We denote the inner product in the feature space by and omit the subscript if there is no possible confusion. In the remainder of the letter, we use interchangeably for convenience and use to represent a kernel matrix for . Then, denoting the

_{s}*S*−1 simplex by , we aim at finding an appropriate convex combination, which is written in the kernel matrix form as and parameters for a linear classification function, where and . We note that the feature space associated with the combined kernel functions becomes a tensor product of given feature spaces as , where the corresponding feature mapping is written as .

*l*

_{1}-norm constraint on the combination coefficients, which usually results in sparse solutions lying on . Under appropriate regularity conditions on the classification function, we can apply the representer theorem (Shawe-Taylor & Cristianini, 2004) to get the expression with . Then, with a combined kernel matrix in equation 2.2,

*l*

_{1}-norm constrained MKL is generally formulated as the following optimization problem: where is an arbitrary convex loss function, is a strictly monotonically increasing function to regularize the complexity of the classifier, and is a regularization parameter. The kernel combination coefficient is constrained to be in a simplex , and this is equivalent to

*l*

_{1}-norm constrained with a positivity constraint. A general result on the convexity of kernel learning of the form in equation 2.4 has been established in Micchelli and Pontil (2005). For and , the classifiers behind the optimization problem are known to be a kernel regularized least square and an SVM, respectively. On the other hand, when , where and are within- and between-class covariance matrices in the feature space, the classifier behind the optimization problem is a kernel Fisher discriminant analysis (Mika, Rätsch, Weston, Schölkopf & Müllers, 1999).

We note that conventional MKL methods include optimization with respect to both and . Since our proposed kernel learning method tailors distributions of the data in the feature space by optimizing , we will concentrate on optimization with respect to . This is one of the features of the proposed MKL framework; that is, there is no need to perform simultaneous or alternating optimization for and .

### 2.2. Learning Kernel Combination Based on Gaussianity.

In FDA, it is desirable that each class-conditional distribution is a gaussian with the same covariance structure in the feature space. That is, we would like to let the distributions of projected data be as close to gaussian distributions as possible and to let the two covariance matrices of these data sets be as close as possible.

*M*is a distance between a gaussian distribution and an empirical distribution of the observed data.

_{G}*M*is a distance between empirical covariance matrices of two sets of data in the feature space, and is a balancing parameter.

_{V}In this framework, the problem is which distance measure to select for *M _{G}* and

*M*. There are some requirements for the distance measure. For example, it must be easily estimated from the given data, and it must be bounded below. It is also preferable that the measure be differentiable with respect to . For

_{V}*M*, there are numerous quantities for measuring the gaussianity of the given data. In this letter, we choose a quantity based on an empirical characteristic function to measure gaussianity because it is estimated using only kernel matrices, it is bounded below, and it is differentiable. We discuss other possibilities for

_{G}*M*later. For

_{G}*M*, there are also many candidates. Here, we simply take natural distance measures based on the empirical characteristic function.

_{V}## 3. Technical Preliminary

This section presents technical preliminaries to construct the GMKL framework.

### 3.1. Empirical Characteristic Function.

*d*-dimensional random variable

*X*is defined as where

*p*is a probability density function of

_{X}*X*and

**is a**

*t**d*-dimensional vector in . In this letter, we consider only the case where the distribution has a density function. We note that

*c*(

**) is nothing but a Fourier transformation of the probability density function of**

*t**X*. For independent and identically distributed (i.i.d.) samples from

*p*, we construct an empirical distribution , and the empirical characteristic function is then defined by The empirical characteristic function has several preferable properties. Specifically, under some general restrictions, converges uniformly almost surely to the population characteristic function

_{X}*c*(

**) (Feuerverger & Mureika, 1977). For arbitrary dimension**

*t**d*, the convergence of the empirical characteristic function to the characteristic function is proved using the law of large numbers and the Glivenko-Cantelli theorem (Vapnik, 1998). In section 3.2, we define the empirical characteristic function in the feature space, which might be a functional space. Even in such cases, the characteristic functions in the feature space also completely characterize the probability distributions in that space. Furthermore, under some regularity conditions, the same convergence property as in the finite-dimensional case holds for empirical characteristic functions in infinite-dimensional Hilbert spaces. (See Ledoux & Talagrand, 1991, for details.) Because of its computational ease and theoretical soundness, the empirical characteristic function is applied to a goodness-of-fit test (Koutrouvelis, 1980), a test for the shape of distributions (Murota & Takeuchi, 1981), and ICA (Murata, 2001; Eriksson & Koivunen, 2003), for example.

Using the following theorem, when we consider the gaussian distribution, we have to calculate only the modulo (i.e., absolute value) of the characteristic function, ignoring its argument:

The proof of the theorem is given in appendix B.

### 3.2. Kernelized Empirical Characteristic Function.

*X*. Let be the dual space of the feature space . For a random variable and arbitrary element in the dual space , the characteristic function in the feature space is defined as We are considering the case that the feature space is associated with an inner product; hence, it is a Hilbert space. By the Riesz lemma (Reed & Simon, 1981), for any , there exists some so that for any , . Then the characteristic function of a random variable is written as for , and the empirical characteristic function becomes where the empirical distribution in the feature space is constructed from . When the value of two characteristic functions coincides in every point in its domain, the corresponding two random variables are under the same distribution. That is, the empirical characteristic function should be evaluated at all points in principle. However, to express the empirical characteristic function in the feature space using the value of kernel function, we consider only the point such that there exists satisfying . Some arguments to validate this restriction are presented later. By the definition of the kernel function property, we then have , and using only the values of the kernel functions, the empirical characteristic function in the feature space is written as To simplify notations and to show dependence on explicitly, we write as henceforth. Now the squared modulo of the empirical characteristic function is

**(**

*v***,**

*t*

*x*_{j}) and a matrix are defined by The most notable advantage of using the empirical characteristic function is that it enables us to estimate gaussianity using the value of the kernel function only. Furthermore, though the probability density function of the gaussian distribution involves the inverse of the covariance matrix, we can measure gaussianity without estimating the inverse of the covariance matrix when we use the empirical characteristic function. The empirical characteristic function of the gaussian distribution in the feature space is and we obtain the squared modulo of the empirical characteristic function under the gaussian assumption: From the above equation, holds, and the squared modulo of the characteristic function of gaussian distribution is transformed to a simple quadratic form by logarithmic transformation. Thus, we also transform the squared modulo of the empirical characteristic function defined in equation 3.9 by −log and denote it as By equation 3.11, is regarded as a projection of covariance matrix by a vector ; hence, we will call and projected variances henceforth.

The choice of the point to evaluate the empirical characteristic function is a difficult problem. Theoretically, to claim that two distributions are the same, the corresponding two characteristic functions must be the same for all points in the feature space. A test statistic for ICA is proposed by a weighted integral of the characteristic function (Eriksson & Koivunen, 2003). Another work claims that empirically, evaluation at only one point is enough for a test of the shape of distributions (Murota & Takeuchi, 1981). It is natural to use every training data point to evaluate the characteristic function. However, using all of the training data might be computationally inefficient. In this study, to save computational cost, we sample only one point from training data, , and take it as . In appendix C, we show an empirical evaluation of the effect of on both classification accuracy and resultant kernel combination parameter.

**from the given data, we define a measure of gaussianity in our framework of MKL in equation 2.6 by where the projected variances and are estimated using only data in and is the cardinality of the set . It is easy to see that is bounded below and smooth with respect to . Figure 2 provides a conceptual diagram of the relationship between the space of probability density functions, the space of characteristic functions, the space of the squared modulo of the characteristic functions, and the space of projected variances.**

*t*### 3.3. Sharing Covariance Matrices.

**from the given data, we define a distance measure between empirical covariance matrices based on their projected variances: As another variant of the distance measure of covariance, we define which is useful for deriving a quadratic approximation of the objective function. We note that with the positivity constraint for , the minimizers of**

*t**M*and are different in general.

_{V}## 4. Algorithm Description

### 4.1. Gradient-Based GMKL Algorithm.

### 4.2. Sequential Quadratic Programming for GMKL.

*M*is defined by equation 3.16 and

_{G}*M*is replaced with . Let the objective function be Then at each step of the SQGMKL algorithm, we solve the following quadratic programming iteratively until convergence:

_{V}Many researchers have studied kernel optimization methods including MKLs (Bach, Lanckriet, & Jordan, 2004; Bennett, Momma, & Embrechts, 2002; Bi, Zhang, & Bennett, 2004; Bousquet & Herrmann, 2002; Cristianini et al., 2001; Crammer, Keshet, & Singer, 2002). In these studies, one of the main emphases is on formulating kernel learning as a tractable convex optimization problem. Our gaussianity measure *M _{G}* is not convex in general, but with a straightforward calculation, it is easily verified that its Hessian matrix is bounded. Since the difference measure of covariance matrices is a quadratic form with a positive semidefinite symmetric matrix , with a small, positive constant , the quadratic term becomes convex. Using the concave-convex procedure (CCCP) theorem (Yuille & Rangarajan, 2003, theorem 1), optimization problem 4.1 becomes a convex minimization problem with sufficiently large . The sequential quadratic programming method for convex programming is guaranteed to converge to the global optimum. We note that since the matrix is positive semidefinite, the small, constant can be arbitrarily chosen. In the experiments in this letter, we set . We note that the quadratic term is minimized if . This additional term can be seen as a regularization term to avoid too sparse a solution, which might lead to poor generalization capability (Kloft, Brefeld, Laskov, & Sonnenburg, 2008; Yan, Kittler, Mikolajczyk, & Tahir, 2009).

### 4.3. Empirical Kernel Feature Map.

In kernel methods, linear classifications such as PCA, SVM, and FDA are applied in the feature space. In general, the dimensionality of the feature space is very large, or it can be a function space. We use the notion of an empirical kernel feature map (Schölkopf et al., 1999; Xiong, Swamy, & Ahmad, 2005) to alleviate the difficulty of handling the distribution of the data in feature spaces and to obtain feature vectors explicitly.

*k*and a given

*n*data. We assume rank :

*K*=

*r*, and

*K*can be decomposed as where is a diagonal matrix containing only the

*r*positive eigenvalues of

*K*in decreasing order. The empirical kernel feature map is defined by Letting , we obtain , and we see that the empirical kernel feature map gives the same result of the kernel method applied to the data in the input space directly. We perform FDA for the transformed data set to obtain a classification axis. That is, in the mapped space, we calculate the mean vectors of the two classes, between-class and total within-class covariance matrix, as respectively. Then, by solving the generalized eigenvalue problem, and taking the first eigenvector, we can obtain the classification axis.

If the kernel function is optimized so that the data distribution of each class is a gaussian with the same covariance matrix, the resulting projection axis by discriminant analysis realizes a Bayes optimal classifier. We note that the result of FDA using the empirical kernel feature map is different from that of conventional KFDA (Mika et al., 1999), because KFDA requires a regularization to avoid rank degeneration. In our experiments, we let the rank *r* of a kernel matrix be the number of eigenvalues containing 90% of the total power.

We also note that in principle, it is possible to optimize test statistics of gaussianity, such as Kolmogorov-Smirnov test or Shapiro-Wilk test statistics, calculated using the empirical kernel feature map. However, the effects of kernel combination coefficients for those test statistics can be highly nonlinear and complicated. On the other hand, we can evaluate gaussianity in the feature space easily with the empirical characteristic function. Thus, we can perform optimization with respect to with ease.

## 5. Experimental Results

Numerical experiments for both artificial data and real-world data are presented in this section.

### 5.1. Illustrative Experiments with Artificial Data.

We first show that the proposed GMKL methods work properly with artificial data. We generated two-dimensional dichotomy data from gaussian distributions with mean and with the same covariance matrix for each class. Five hundred samples are generated for each class of data. Five kernel functions, a linear kernel , RBF kernels , a Bessel kernel , where *J*_{2} is the Bessel function of first kind, and a polynomial kernel are prepared for the combination of kernels. In Figures 3a and 3e, we show the distribution of the first two dimensions of the empirical kernel map of each class's data, transformed by a uniform combination of five kernels, and an initial combination of coefficients. In the initial uniform combination of the kernels, the data distribution in the empirical kernel space is not a gaussian, and two classes are completely mixed. We run the GGMKL algorithm for the combined kernel matrix to optimize the combination coefficients. Along with the iteration of the algorithm, the distribution of the data in the empirical kernel feature space gets close to a gaussian distribution, and combination coefficients for all kernels but the first linear kernel function get smaller. Finally, as shown in Figures 3d and 3h, the data distribution in each class becomes a gaussian, where the combination coefficient for the kernel *k*_{1} is 1 and 0 for other four kernels. From this experiment, we see that GGMKL algorithm is able to optimize the kernel combination parameter so that the distribution of the data is a gaussian. Finally, Figure 4 shows the *p*-values obtained by Shapiro-Wilk test for the first two dimensions of the resulting empirical kernel feature map of the data and objective function values along with the iteration of the algorithm. The horizontal dashed line shows 0.05, which is usually taken as the significance level of the test of gaussianity. The objective function value uniformly decreased and the *p*-values increased as the progress of the algorithm. We note that a similar result was obtained by SQGMKL.

### 5.2. IDA Data Set.

We employed the IDA data sets, which are standard binary classification data sets originally used in Rätsch, Onoda, and Müller (2001). The specifications of the data sets are shown in Table 1. We optimized the kernel function by the GMKL algorithm; then the test data are projected onto the axis found by FDA and classified by the large margin linear classifier in the same manner as Mika et al. (2000) described. We compared the GMKL algorithm to an equally weighted combination of kernels (), the best single kernel in candidate kernels, simpleMKL (Rakotomamonjy et al., 2008), SpicyMKL (Suzuki & Tomioka, 2011) with logit loss function, RMKL (Do et al., 2009), and *l*_{2} MK-FDA (Yan et al., 2009). SimpleMKL is one of the most famous MKL methods, and SpicyMKL is its improved variant, which is especially effective when the number of kernel functions is large. For both methods, there are publicly available Matlab implementations. RMKL optimizes the kernel combination coefficients by both maximizing margin and minimizing the radius of the feature space. The concept of RMKL is similar to GMKL in the sense that it aims at optimizing the geometry of the feature space. The *l*_{2}MK-FDA is a nonsparse MKL method based on Fisher's discriminant criterion. The method optimizes both kernel combination coefficients and weight for each datum *x*_{i} using semi-infinite programming (Hettich & Kortanek, 1993), while our proposed methods optimize only kernel combination coefficients.

. | Input Data . | Number of Train . | Number of Test . | Number of . |
---|---|---|---|---|

Data name . | Dimensionality . | Samples . | Samples . | Realizations . |

Banana | 2 | 400 | 4900 | 100 |

Breast-Cancer | 9 | 200 | 77 | 100 |

Diabetes | 8 | 468 | 300 | 100 |

Flare-Solar | 9 | 666 | 400 | 100 |

German | 20 | 700 | 300 | 100 |

Heart | 13 | 170 | 100 | 100 |

Image | 18 | 1300 | 1010 | 20 |

Ringnorm | 20 | 400 | 7000 | 100 |

Splice | 60 | 1000 | 2175 | 20 |

Thyroid | 5 | 140 | 75 | 100 |

Titanic | 3 | 150 | 2051 | 100 |

Twonorm | 20 | 400 | 7000 | 100 |

Waveform | 21 | 1000 | 1000 | 100 |

. | Input Data . | Number of Train . | Number of Test . | Number of . |
---|---|---|---|---|

Data name . | Dimensionality . | Samples . | Samples . | Realizations . |

Banana | 2 | 400 | 4900 | 100 |

Breast-Cancer | 9 | 200 | 77 | 100 |

Diabetes | 8 | 468 | 300 | 100 |

Flare-Solar | 9 | 666 | 400 | 100 |

German | 20 | 700 | 300 | 100 |

Heart | 13 | 170 | 100 | 100 |

Image | 18 | 1300 | 1010 | 20 |

Ringnorm | 20 | 400 | 7000 | 100 |

Splice | 60 | 1000 | 2175 | 20 |

Thyroid | 5 | 140 | 75 | 100 |

Titanic | 3 | 150 | 2051 | 100 |

Twonorm | 20 | 400 | 7000 | 100 |

Waveform | 21 | 1000 | 1000 | 100 |

We implemented all methods except simpleMKL, SpicyMKL, and *l*_{2}MK-FDA using R language (R Development Core Team, 2010). Quadratic programming in SQGMKL is solved by an interior point method equipped with a “kernlab” package (Karatzoglou, Smola, Hornik, & Zeileis, 2004) of R.

The combined kernel functions are the following 20 kernels:

- •
A linear kernel: .

- •
Gaussian kernels: .

- •
Polynomial kernels: .

- •
Laplace kernels: .

For all MKL methods, tuning parameters are chosen from candidate value sets so that the training error is minimized.

From Table 2, we can see that GGMKL and SQGMKL work favorably compared to other methods for many data in the light of classification accuracy.

. | Classification in Empirical Kernel Feature Space . | Conventional MKLs . | ||||||
---|---|---|---|---|---|---|---|---|

Data name . | GGMKL . | SQGMKL . | Uniform . | Best Single . | SimpleMKL . | SpicyMKL . | RMKL . | l_{2}MK-FDA
. |

Banana | 10.05(0.42) | 12.62(0.96) | 18.12(6.08) | 11.10(0.55) | 11.47(0.60) | 13.17(1.25) | 11.08(0.53) | 11.79(0.53) |

Breast-Cancer | 31.43(6.88) | 23.06(3.62) | 35.51(6.73) | 28.97(4.31) | 25.77(4.45) | 26.99(4.77) | 26.58(4.10) | 28.60(4.29) |

Diabetes | 27.29(2.42) | 24.55(1.81) | 28.88(2.59) | 26.78(2.83) | 24.55(3.53) | 24.15(1.68) | 24.55(1.71) | 24.08(1.89) |

Flare-Solar | 34.88(1.78) | 33.74(1.63) | 37.68(5.19) | 33.69(1.84) | 41.01(7.64) | 35.07(1.76) | 34.35(2.06) | 34.43(1.55) |

German | 22.37(1.83) | 23.03(2.10) | 27.58(7.40) | 23.98(2.64) | 36.14(6.72) | 27.03(2.09) | 23.71(2.30) | 23.55(2.23) |

Heart | 14.83(3.27) | 15.83(3.28) | 17.17(3.25) | 16.11(2.92) | 17.05(4.49) | 15.75(3.25) | 16.71(3.17) | 16.81(3.68) |

Image | 2.47(0.52) | 2.40(0.41) | 3.35(1.50) | 4.20(0.97) | 10.90(1.13) | 3.46(0.66) | 3.87(0.74) | 10.62(0.68) |

Ringnorm | 2.02(0.47) | 1.59(0.12) | 13.74(20.48) | 1.39(0.31) | 1.53(0.09) | 1.63(1.55) | 1.63(0.12) | 2.67(0.44) |

Splice | 14.30(1.65) | 13.17(0.76) | 19.24(10.63) | 14.43(1.75) | 16.51(0.70) | 12.7(0.72) | 13.88(0.68) | 15.92(0.87) |

Thyroid | 3.65(2.12) | 3.44(1.93) | 12.49(10.82) | 5.35(2.53) | 4.64(2.15) | 5.29(2.83) | 4.97(2.33) | 5.17(2.23) |

Titanic | 22.32(1.12) | 21.92(1.03) | 32.24(2.79) | 30.66(3.58) | 23.20(3.00) | 22.89(3.04) | 22.93(0.99) | 22.72(0.86) |

Twonorm | 3.83(3.75) | 2.31(0.11) | 2.46(0.15) | 2.43(0.13) | 2.55(0.15) | 2.37(0.11) | 2.49(0.15) | 2.64(0.29) |

Waveform | 9.44(0.38) | 9.48(0.52) | 11.61(3.89) | 10.36(0.90) | 10.33(2.42) | 13.16(0.60) | 9.75(0.45) | 9.73(0.63) |

. | Classification in Empirical Kernel Feature Space . | Conventional MKLs . | ||||||
---|---|---|---|---|---|---|---|---|

Data name . | GGMKL . | SQGMKL . | Uniform . | Best Single . | SimpleMKL . | SpicyMKL . | RMKL . | l_{2}MK-FDA
. |

Banana | 10.05(0.42) | 12.62(0.96) | 18.12(6.08) | 11.10(0.55) | 11.47(0.60) | 13.17(1.25) | 11.08(0.53) | 11.79(0.53) |

Breast-Cancer | 31.43(6.88) | 23.06(3.62) | 35.51(6.73) | 28.97(4.31) | 25.77(4.45) | 26.99(4.77) | 26.58(4.10) | 28.60(4.29) |

Diabetes | 27.29(2.42) | 24.55(1.81) | 28.88(2.59) | 26.78(2.83) | 24.55(3.53) | 24.15(1.68) | 24.55(1.71) | 24.08(1.89) |

Flare-Solar | 34.88(1.78) | 33.74(1.63) | 37.68(5.19) | 33.69(1.84) | 41.01(7.64) | 35.07(1.76) | 34.35(2.06) | 34.43(1.55) |

German | 22.37(1.83) | 23.03(2.10) | 27.58(7.40) | 23.98(2.64) | 36.14(6.72) | 27.03(2.09) | 23.71(2.30) | 23.55(2.23) |

Heart | 14.83(3.27) | 15.83(3.28) | 17.17(3.25) | 16.11(2.92) | 17.05(4.49) | 15.75(3.25) | 16.71(3.17) | 16.81(3.68) |

Image | 2.47(0.52) | 2.40(0.41) | 3.35(1.50) | 4.20(0.97) | 10.90(1.13) | 3.46(0.66) | 3.87(0.74) | 10.62(0.68) |

Ringnorm | 2.02(0.47) | 1.59(0.12) | 13.74(20.48) | 1.39(0.31) | 1.53(0.09) | 1.63(1.55) | 1.63(0.12) | 2.67(0.44) |

Splice | 14.30(1.65) | 13.17(0.76) | 19.24(10.63) | 14.43(1.75) | 16.51(0.70) | 12.7(0.72) | 13.88(0.68) | 15.92(0.87) |

Thyroid | 3.65(2.12) | 3.44(1.93) | 12.49(10.82) | 5.35(2.53) | 4.64(2.15) | 5.29(2.83) | 4.97(2.33) | 5.17(2.23) |

Titanic | 22.32(1.12) | 21.92(1.03) | 32.24(2.79) | 30.66(3.58) | 23.20(3.00) | 22.89(3.04) | 22.93(0.99) | 22.72(0.86) |

Twonorm | 3.83(3.75) | 2.31(0.11) | 2.46(0.15) | 2.43(0.13) | 2.55(0.15) | 2.37(0.11) | 2.49(0.15) | 2.64(0.29) |

Waveform | 9.44(0.38) | 9.48(0.52) | 11.61(3.89) | 10.36(0.90) | 10.33(2.42) | 13.16(0.60) | 9.75(0.45) | 9.73(0.63) |

Note: The best and second-best results among the two methods in this table are shown in boldface.

Computational cost is another important aspect of learning methods. We implemented the proposed methods and RMKL methods using R programming language, and other MKL methods are mainly implemented using Matlab. The direct comparison is not possible; however, as a reference, we show comparative results on IDA data sets executed on the same computer in Figure 5.^{1} We do not report the computational cost for GGMKL, RMKL, and *l*_{2}MK-FDA because it is more than five times slower than the slowest other methods. Figure 5 shows that the computational efficiency of the SQGMKL method is superior to that of simpleMKL for many data sets and comparable to SpicyMKL when a moderate number of kernels are combined.

## 6. Conclusion and Future Directions

In this study, we proposed to optimize kernel combination coefficients based on gaussianity in the feature space associated with the combined kernel functions. To our knowledge, there is currently no MKL method based on the notion of gaussianity in the feature space. Simple implementations of the proposed framework are given based on the empirical characteristic function. This function can be estimated using only a given set of kernel matrices and offers differentiable and bounded optimization objective. Through an experiment with artificial data, we see that the proposed algorithm maximizes gaussianity in the feature space associated with the learned kernel matrix. Using a number of benchmark data sets, we also show that the classification performances of the proposed GMKL methods are comparable or superior to other conventional MKL methods. One of the characteristics of the proposed GMKL framework is that learning of kernel combination coefficients is performed independently from the parameter for each training datum. Intuitively, simultaneous optimization of both and leads to better classification accuracy; however, we know that we can obtain a Bayes optimal classifier by FDA if the data distribution is tailored to be a gaussian by GMKL, and the classification performance of the GMKL is comparable or superior to other methods. The proposed method is one of the optimal choices for kernel learning if FDA is used after learning the kernel. Application of the proposed GMKL methods followed by FDA to feature selection and visualization remains future work for us. Although we can obtain the Bayes optimal classifier by FDA under a certain condition on the data distributions, if the means of two classes are closely located, the Bayes error would be large. In this study, we proposed a basic concept of GMKL with a simple implementation; however, putting a term to separate the mean in each class would improve the classification accuracy with an additional cost of parameter tuning. It is important to investigate how to impose regularizations on the proposed GMKL framework to improve class separability.

In this letter, we adopted simple measures of gaussianity and the difference of covariance matrices, which are written in terms of the projected variances. There are other possibilities for those measures. For example, gaussianity and the difference of covariance matrices can be measured in the space of squared modulo of characteristic functions. Furthermore, it is also possible that *M _{G}* and

*M*are defined using quantities of difference spaces. The classification accuracy and algorithmic stability might be improved by finding the optimal combination of measures.

_{V}*X*be

*h*(

*X*). In information theory, for

*n*i.i.d. random variables, the following inequality is known: The equality holds if and only if distributions of all the random variables are gaussians with proportional covariance matrices (Shannon, 1948; Stam, 1959). For two random variables,

*X*

_{1}and

*X*

_{2}correspond to two classes of data; we can use the difference of the inequality as an objective function for gaussianity. The difficulty of this approach may be that we need entropy estimation. However, there are many entropy estimators based on the Euclidean distance of the observed data, and they might be formulated using only kernel function values. In addition to the alternative measure of gaussianity, the exploration of other measures of the gap between two covariance matrices

*M*is an interesting work.

_{V}In the experiments, we obtained good classification accuracy, and the SQGMKL algorithm worked efficiently with a moderate number of kernel functions. Computational efficiency is one of the beneficial results of GMKL formulation, which concentrates only on the learning of kernel combination coefficients. In this study, we did not consider combining a huge number of kernels because the motivation for our GMKL framework was not feature selection via sparse representation but the construction of accurate classifiers by tailoring the distribution of the given data in the feature space. Unfortunately, the computational cost for SQGMKL would increase rapidly with the number of kernels to be combined because the method solves quadratic optimization programming iteratively, whereas by increasing the number of kernels, the ability of tailoring the feature space might be improved. Hence, development of a scalable MKL method based on the proposed framework will be explored in our future work.

## Appendix A: Bayes Optimality of the Fisher Linear Discriminant

We show the following well-known fact for a two-class discriminant problem for the sake of completeness:

*If the distributions of data in classes C_{+1} and C_{−1} are gaussians with means and and the same covariance matrix , then the Fisher discriminant analysis finds the best linear projection in the sense of Bayes error minimization.*

To prove the theorem, we recall that the linear projection vector ** w** obtained by minimizing Fisher's criterion is given by , where and are within-class and between-class covariance matrices, respectively. By the assumption of the theorem, and .

## Appendix B: Proof of Theorem 1

The necessity is obvious. Conversely, for a positive semidefinite matrix , suppose . This is equivalent to . That is, |*c*(** t**)|

^{2}is nothing but the characteristic function of , and it is decomposed into two factors,

*c*(

**) and**

*t**c*(−

**). Cramer's theorem claims that if the characteristic function of a gaussian distribution is factorized, then each factor characteristic function is also the characteristic function of a gaussian distribution (Lukacs, 1960). From Cramer's theorem, sufficiency of the theorem 1 is proved.**

*t*## Appendix C: Effect of Evaluation Point in Empirical Characteristic Function

We show an empirical evaluation result of the effect of in the empirical characteristic function . As we explained in section 3.2, we save one point in the given data set for to evaluate the empirical characteristic function. Using the first realization of all the IDA data sets, we conduct a classification experiment using the different points in the given data set in the same setting as section 5.2. We show the mean and standard deviation of classification accuracies in the upper-left panel of Figure 6. We also show the means of obtained kernel combination coefficients in each element with one standard deviation error bars in other panels of Figure 6. From this result, we conclude that the selection of from the given data does not affect the result much.

## Appendix D: Gradient and Hessian of Objective Function

*c*_{t}and a matrix

*H*in equation 4.3, which are a gradient vector and a Hessian matrix of evaluated at . Then

_{t}

*c*_{t}and

*H*are explicitly written as In the above formula, where and

_{t}## Acknowledgments

We are grateful to A. Rakotomamonjy, Y. Grandvalet, F. Bach, and S. Canu for providing simple MKL Matlab implementation and to T. Suzuki and R. Tomioka for providing SpicyMKL Matlab implementation. We are also grateful to F. Yan for providing *l _{p}* MK-FDA Matlab implementation. Parts of experiments were done with the help of T. Aritake. We express our special thanks to the editor and reviewers, whose comments led to valuable improvements of the manuscript. Part of this work was supported by JSPS Grant-in-Aid for Research Activity Start-up No. 22800067.

## Note

^{1}

All of numerical experiments in this letter are processed on an Intel machine with 2.93 GHz dual processors and 8 GB memory. The operation system is Mac OS X version 10.6.8.