Abstract

Multiple kernel learning (MKL) partially solves the kernel selection problem in support vector machines and similar classifiers by minimizing the empirical risk over a subset of the linear combination of given kernel matrices. For large sample sets, the size of the kernel matrices becomes a numerical issue. In many cases, the kernel matrix is of low-efficient rank. However, the low-rank property is not efficiently utilized in MKL algorithms. Here, we suggest multiple spectral kernel learning that efficiently uses the low-rank property by finding a kernel matrix from a set of Gram matrices of a few eigenvectors from all given kernel matrices, called a spectral kernel set. We provide a new bound for the gaussian complexity of the proposed kernel set, which depends on both the geometry of the kernel set and the number of Gram matrices. This characterization of the complexity implies that in an MKL setting, adding more kernels may not monotonically increase the complexity, while previous bounds show otherwise.

1.  Introduction

Kernel methods such as support vector machines (SVMs) usually perform well in many prediction problems (Steinwart & Christmann, 2008); however, the performance of the kernel methods heavily depends on the choice of kernel function, which is left to the practitioners (Lanckriet, Cristianini, Bartlett, Ghaoui, & Jordan, 2004; Vapnik & Chapelle, 2000; Chapelle, Vapnik, Bousquet, & Mukherjee, 2002).

A common approach for selecting a suitable kernel function is through cross-validation or other resampling techniques (Schölkopf & Smola, 2002) over a set of candidate kernel functions. The selection consists of training a kernel machine, such as a support vector machine (SVM), with a particular kernel using a subset of training samples and validating the model by measuring the empirical risk over the rest of training samples (called a validation set). This procedure is repeated a number of times. A kernel function with the best average performance on the validation sets will be selected (Steinwart & Christmann, 2008; Chapelle et al., 2002). The computational load in this approach increases with the size of the candidate set and the number of training samples, which clearly limits the applicability of such approaches.

A different framework for kernel selection, multiple kernel learning (MKL), searches for a kernel matrix (function) in the convex hull of kernel functions (matrices)—defined by linear combinations of the given kernel matrices—that minimizes the empirical risk of the plugged-in kernel matrix (Lanckriet et al., 2004; Rakotomamonjy, Bach, Canu, & Grandvalet, 2008; Sonnenburg, Rätsch, Schäfer, & Schölkopf, 2006; Hino, Reyhani, & Murata, 2012). The given set of original kernel functions or matrices is called a kernel base. It is common to restrict the space of linear combinations so that the coefficients belong to the -norm simplex, the concern in this letter, whereas some work suggests -norm simplex (see Kloft, Brefeld, Sonnenburg, & Zien, 2010). In addition, it is usually assumed that a set of kernel matrices for a fixed sample set is provided. This renders the setup more convenient for algorithm design. Other views suggest optimizing other criteria rather than the empirical risk such as maximum entropy, kernel Fisher discriminant analysis performance (Hino, Reyhani, & Murata, 2010), or maximum gaussianity (Hino et al., 2012) to find the kernel combinations.

Lanckriet et al. (2004) casts the linear MKL in terms of semidefinite programming (SDP) for the class of positive-definite linear combination of kernel bases with bounded trace. The main issue with the SDP formulation of MKL is the inefficiency of the SDP solvers for large kernel matrices (Sonnenburg et al., 2006; Rakotomamonjy et al., 2008). This limitation has stimulated further research on numerical issues involved in MKL. One of the main directions is to use the differentiability of the dual of the penalized empirical risk minimization (ERM) at the maximal point, which usually leads to an alternating-optimization algorithm. In the alternating optimization, the inner optimization (i.e., the dual of the ERM) is solved first, while the outer optimization variables (i.e., mixture coefficients) are fixed, and vice versa, until some convergence is reached. In addition, the class of kernel matrices is usually reduced to the convex hull of the kernel bases compared to the original MKL in Lanckriet et al. (2004). Semi-infinite linear programming (SILP; Sonnenburg et al., 2006), the extended-level method (Xu, Jin, King, & Lyu, 2009), simpleMKL (Rakotomamonjy et al., 2008), and alternating gradient descent (Bousquet & Herrmann, 2003) are different implementations of this framework. These methods can be used for most of the important loss functions. Another direction has been to restrict the loss function to -loss and change the set of linear coefficients from simplex to the positive subset of ball. This reduction leads to alternating steps between two closed forms: one for estimating the mixing coefficients and the other one for the -SVM parameters (Cortes, Mohri, & Rostamizadeh, 2009).

A common issue with all of these methods for MKL is the need to solve an iterative optimization of size , where L>1 is the number of kernel matrices belonging to . Here, n is the number of training samples in an inductive setting or training plus test sets in an transductive setting. Thus, regardless of the algorithmic efficiencies, the memory requirement grows very fast, leading to a significant increase in computational load, especially when both n and L are large. This naturally limits the use of MKL methods in large sample settings.

A slightly different framework for kernel learning is to adjust the eigenvalues of the given kernel matrix so that the empirical risk is minimized. In other words, the kernel base here is a set of rank 1 kernels produced by the outer product of eigenvectors of given (single or multiple) kernel matrices.

This type of kernel learning has been studied by Bousquet and Herrmann (2003), Lanckriet et al. (2004), and Bach (2008); however, their approaches end up with block-coordinate optimization, (Bousquet & Herrmann, 2003; Cortes et al., 2009) or cone programming (Lanckriet et al., 2004). In particular, Bach (2008) suggests hierarchical MKL (HMKL), which assumes that the eigenfunctions of the kernels are provided analytically. HMKL constructs a set of rank 1 kernels by evaluating these eigenfunctions on all possible subsets of input feature space. It finds the best mixing by first restricting the set of candidate kernels and then finding the best coefficients using other MKL methods such as SILP. HMKL is more about kernel learning for feature selection, and the main part of the algorithm is to select a subset of rank 1 kernels using a greedy algorithm. In this letter, we consider a similar setup with more of a focus on kernel learning than feature selection and propose an efficient and simpler optimization that improves the scalability in both memory and numerical computation. In addition, we do not assume that the bases’ eigenfunctions are provided.

Another direction in MKL research is to find tight upper bounds for the complexity of the hypothesis set generated by a linear combination of kernels. Most recent work (e.g., Cortes, Mohri, & Rostamizadeh, 2010; Srebro & Ben-David, 2006) presents bounds characterized by, for example, . These results imply that introducing any new kernel matrix to the base set would necessarily increase the complexity no matter how close the added kernel is to the existing ones. The second part of this letter partially addresses this issue with previous bounds.

This letter is organized as follows. A definition of linear MKL is given in section 2. Section 3 outlines our proposed method for MKL using low-rank approximation. To improve the proposed MKL framework to inductive setting, we also suggest using Nyström extension, which is explained in section 4. The gaussian complexity of the proposed kernel set is presented in section 5. The empirical results are given in section 6, followed by a few concluding remarks.

2.  Linear MKL

Let us assume that the set contains independent and identically distributed samples for , where takes values in and yi takes values in a subset of . Also assume that for some unknown smooth function g and that there is a nonlinear mapping, called a feature map, , for some Hilbert space . The inner product of , denoted by , can be computed using a bounded kernel function , so that for . A kernel k defines a unique Hilbert space (Steinwart & Christmann, 2008). Here, for technical reasons, we assume that for each kernel, there exists at least one feature map such as the canonical feature map (Steinwart & Christmann, 2008). We define the kernel matrix K by
formula
For simplicity, we use the notations K and k interchangeably.
The main goal in the learning problem is to estimate g by a function , given a sample set , such that the prediction performance on both the training samples and test samples is sufficiently high. The performance is usually measured by some loss function, which measures the discrepancy between the true value and prediction. We assume that f belongs to some set of functions so that the true prediction function g can be well approximated by members of . Here, we assume contains linear functions in the feature space. A general framework for learning is through empirical risk minimization, which can be formulated as
formula
2.1
where is a loss function, for example, hinge loss . The scalar is the penalization parameter, denotes the norm in the space , and is the matrix containing the feature vectors for . Note that we can rewrite equation 2.1 as
formula
where is some user-defined parameter and denotes the empirical loss.
Let us assume that a set of kernel function or matrices is given. We call ki kernel bases. We consider the convex hull of the kernel bases,
formula
2.2
where the simplex is defined by . Kernel learning over the kernel set is defined by minimizing the empirical risk with respect to the positive-definite kernel matrix ,
formula
2.3
which is
formula
A common approach to solve the optimization problem, equation 2.3, is to solve the minimization problem in and by alternating between each of them while the parameters of the other one is fixed (for details, see Sonnenburg et al., 2006; Rakotomamonjy et al., 2008; Xu et al., 2009). The memory requirement for the outer optimization is of order , which could become a barrier in large-scale problems. Also, every other alternation consists of solving one round of, for example, SVM for the hinge loss function.

3.  Multiple Spectral Kernel Learning

A common phenomenon observed in many large-size kernels is that these matrices are of low efficient rank (Drineas & Mahoney, 2005); that is, a small subset of eigenvalues is large and distant, while the rest have small and similar values. This property has been utilized for improving the scalability of many learning methods and for efficient memory representation of the kernel matrices (Drineas & Mahoney, 2005; Talwalkar & Rostamizadeh, 2010). Thus, it is reasonable to utilize the low-rank assumption of the given kernel matrices to improve the numerical efficiency in the MKL framework.

Let us define the multiple spectral kernel set, which is basically the set of rank 1 kernels, by
formula
3.1
where the set is called the spectral dictionary, or dictionary for short.

The dictionary can be constructed by a small set of eigenvectors extracted from each given kernel matrix. We can select some of eigenvectors or all with large eigenvalues from each kernel matrix. This can be performed in a similar way as in principal component analysis (PCA; Schölkopf & Smola, 2002), where we start selecting eigenvectors by taking those with large eigenvalues and stop when the eigenvalues become small and similar to each other. In this way, we basically take components that are involved in low-rank approximation of kernel matrices without carrying the eigenvalue information. Note that the elements of the dictionary are not necessarily orthogonal to each other.

Another way to construct vectors is to take a set of orthonormal basis functions (e.g., Hermite polynomials) with sufficient smoothness and construct basis vectors by evaluating those functions on the sample points.

The kernel learning over the class takes the form
formula
3.2
which we call multiple spectral kernel learning multiple SKL.

Theorems 1 and Theorem 2 present simple optimization for multiple SKL, the -loss, and more general loss function aiming to solve problem 3.2 in one round (i.e., no alternating optimization with significantly less memory requirement).

Theorem 1. 
Under the assumption presented in section 2, for , the optimization 3.2 can be approximated by
formula
3.3
where the vector contains labels information, . The primal variables can be recovered by , where is the value of at the optimum.

Thus, instead of an alternating least-squares-type solution, as appears in Cortes et al. (2009), we ended up with the basis pursuit (Chen, Donoho, & Saunders, 1998) approximation that solve the kernel learning problem in one round with linear programming. The proof is provided in the appendix.

Remark 1. 
The constraint term in equation 3.3 is well defined if the vector is in the range of matrix . Otherwise, we can extend the dictionary by adding coordinate vectors to . Equivalently, we can rewrite the minimization in equation 3.3 as
formula
where . The preceding minimization is equivalent to
formula
3.4
The primal values can be recovered through and , where , , and is the value of equation 3.4 at the optimum. Then the solution becomes
formula

The extension of theorem 1 for general loss function is as follows.

Theorem 2. 
Under the assumption presented in section 2 for a bounded and convex loss function , the multiple SKL optimization, equation 3.2, reduces to
formula
3.5
where is defined in equation 2.1.

The proof is provided in the appendix.

Using the above results, we can achieve a significant decrease in memory and computation load, compared to MKL algorithms, such as hierarchical MKL (Bach, 2008) or simpleMKL (Rakotomamonjy et al., 2008). The spectral decomposition of kernel matrices is an expensive computational task. However, it happens only once, and we need only the top eigenvectors of the kernel matrices, which can be efficiently computed using power methods, such as Lancsoz method (see Bhatia, 1997, for more details).

It is natural to study the differences between the solution of SVM with the original kernel matrix and the one with its low-rank approximation. For readability, we postpone the presentation of this study to the appendix. The study shows that the difference between solutions of SVMs depends on the magnitude of the largest eigenvalue that is not included in low-rank approximation. Thus, the approximation does not significantly change the solution of SVMs for kernels with efficient low rank.

4.  Application of the Nyström Extension for an Inductive Multiple SKL

The formulations presented in section 3 apply only to the transductive setting where both training and test sets are available at the time of learning. To use the results of multiple SKL for classifying new test samples, access to entries of eigenvectors at those points is crucial. A naive approach to resolve this issue is to add the test samples to each of the original kernel matrices and then take the eigenvalue decomposition of the extended matrices. However, this solution gradually becomes impractical as the number of test samples increases. Instead, we suggest using the Nyström extension (Drineas & Mahoney, 2005; Talwalkar & Rostamizadeh, 2010), which provides an approximate eigenvector extension of a bounded kernel matrix.

For a sample set , let us assume that the eigenvectors of the kernel matrix K for some kernel function are available. Let us consider the integral operator . The eigenvectors and eigenvalues of K are basically an approximation of the eigenfunctions and eigenvalues of Tk for the eigenvalue problem . Here, P is a probability measure on , and denotes an eigenvalue of linear operator Tk.

The Nyström extension approximates the eigenfunction at the point by extending the eigenvector to by
formula
where and are the solution to .

Thus, for any test sample, we first extend the eigenvectors with nonzero entries in the mixing vector , generate the mixture using and the extended eigenvectors, and then evaluate the learned classifier for the extended matrix.

5.  Error Bound and Complexity Computations for the Multiple SKL

In prediction problems, it is natural to ask about an upper bound for the error of prediction for a test point, called generalization error or risk. In a common characterization, it is shown that risk can be controlled by the empirical risk and the complexity of the hypothesis set. The hypothesis set is the collection of functions that construct the search space in empirical risk minimization (e.g., the set in section 2). We next recall some definitions and a generalization bound, which relates the error to the complexity of the function classes.

Definition 1 
(Rademacher and gaussian complexity Bartlett & Mendelson, 2003).  Let us assume that are independent and identically distributed random vectors in and are independent random variables with values in with probability . Also, we assume that are independent normal random variables. We assume both gi and are independent to . is called a Rademacher random variable. For a set of functions , the Rademacher and gaussian complexity are defined as
formula
respectively.

These two measures of complexity are related to each other:.

Theorem 3 

(Tomczak-Jaegermann, 1989).  There are absolute constants c and C such that for every class and every integer n, .

Theorem 4 
(Bartlett & Mendelson, 2003, corollary 15 and Theorem 8).  Consider a loss function and a function that dominates the loss function , that is, and , . We assume that is a subset of . Let be a class of functions , and let , where are independent copies of random vector . Then for any integer n and any , with probability of at least , the following holds:
formula
where is the Lipschitz norm of the loss function and denotes the gaussian complexity of the hypothesis set .
Bartlett and Mendelson (2003) present a similar result where the gaussian complexity is replaced by the Rademacher complexity, under the assumption of theorem 4:
formula
5.1
Both the Ramdemacher and gaussian complexities measure the supremum of the correlation between any function from the class and pure independent noise, described as either independent normal or Rademacher random variables. Usually it is hard to compute the value of and . In some special cases, however, there are ways to find an upper bound for the so-called empirical gaussian complexity and empirical Rademacher complexity defined by
formula
and then connect the above quantities to and , respectively, using the concentration inequality. (For a comprehensive treatment of the subject, see, for example, Bartlett & Mendelson, 2003.)
Let us denote the hypothesis set in the general MKL setup for some kernel set by , which is defined by . By the representer theorem (Steinwart & Christmann, 2008), the solution of the ERM, equation 2.1, for the kernel function k, takes the form having , where and . Therefore, the set can equivalently be written as . For the empirical gaussian complexity of , we have
formula
5.2
formula
5.3
where and is some feature space for the kernel K. A similar result holds for the empirical Rademacher complexity, , where the normal vector Z is replaced by vector . (A comprehensive comparison of previously derived bounds for the Rademacher complexity of MKL can be found in Cortes et al., 2010.) They also provide a tight bound, stated below, which significantly improves the previous results.
Theorem 5 
(Cortes et al., 2010).  The empirical Rademacher complexity of the hypothesis set can be bounded as follows:
formula
5.4
where , , and for .
Note that for the spectral kernel set, the quantity ; therefore, by equation 5.4 and some straightforward computation for finding the best r, the empirical Rademacher complexity for the spectral kernel set is
formula
5.5
where denotes the smallest following integer to x and we assume that . In the next section, we provide a bound for the spectral kernel set, which shows the dependency of the gaussian dependency to the geometry of the kernel set.

5.1.  A Geometrical Bound for Gaussian Complexity.

The right-hand side of equation 5.5 depends on only the number of kernel bases or the size of dictionary in the case of a spectral class, no matter how similar the bases are to each other. In the case of a spectral kernel set, the kernels are of rank 1, and there is no trace or eigenvalue information; therefore, the relation between elements of dictionary to the complexity is lost.

The issue in missing the connection between the dictionary and the complexity bound in theorem 5 motivates us to derive a new bound for the empirical gaussian complexity. There we are able to use Slepian's lemma (see section 8.3) in other to bring the geometry of the dictionary into the the complexity's upper bound. The results are summarized in the following theorem:

Theorem 6. 
Let us consider the spectral kernel set defined in equation 3.1. Then, for sufficiently large L, we have
formula
5.6
where and is the empirical gaussian complexity for the hypothesis set .
In addition, by assuming that , we have
formula
5.7
where C>0 is a constant.

In theorem 6, depends on . The proof relies on the decoupling technique (Levina & Vershynin, 2011; De la Peña & Giné, 1999) after centering the gaussian term. For centering, we need to compute the expected value of the term , which can be computed easily. The gaussian chaos term, , can be split into a nested maximal form . Both maximal forms may be computed using the maximal inequality for gaussian random variables and Slepian's lemma. More details about the proof are provided in section  A.3. The results in theorem 6 can be extended for the general MKL over the full-rank kernel matrices.

The bound in equation 5.6 shows the dependency between the complexity and logarithm of the number of kernel bases in a similar way to the empirical Rademacher complexity bound presented in equation 5.5. However, we got a greater constant, which can be expected by theorem 3.

In addition, equation 5.7 relates the geometry of the dictionary to the empirical complexity via the term . The geometric term counts the similarity between the bases and achieves its maximum when at least two orthogonal vectors are present in the dictionary. It suggests that the complexity can also be increased by the angle between the bases, which is additional information; we cannot achieve this conclusion from any of previous bounds. The additional term in both bounds in equations 5.6 and 5.7 is due to the decoupling technique applied to the gaussian chaos.

6.  Empirical Results

6.1.  UCI Data Sets.

In this section, we present empirical comparisons of the classification accuracy between multiple SKL and state-of-the-art MKL approaches, such as simpleMKL, the extended-level method (Xu et al., 2009), and SILP (Sonnenburg et al., 2006). We perform this study on a selection of UCI classification data sets available for download from http://archive.ics.uci.edu/ml/. By classification accuracy, we mean the performance of an SVM with a kernel matrix obtained by an MKL algorithm. The classification accuracy using kernels from different methods is summarized in Table 1. For multiple SKL, we employ the linear programming toolbox of Mosek (www.mosek.com) in the Matlab environment (www.matlab.com). For other MKL methods, we used the toolbox provided in Sonnenburg et al. (2006), Rakotomamonjy et al. (2008), and Xu et al. (2009).

Table 1:
UCI Data Set and Numbers.
Data Set NameSimpleMKLSLIPLevel MethodMultiple SKL
Iono (n=175) 92.10 2.0 92.10 1.9 92.10 1.3 95.61 2.51 
   Time 33.5 11.6 1161.0 344.2 7.1 4.3 1.55 .15 
Pima (n=384) 76.5 1.9 76.9 2.8 76.9 2.1 98.19 1.45 
   Time 39.4 8.8 62.0 15.2 9.1 1.6 2.12 0.75 
Sonar (n=104) 79.1 4.5 79.3 4.2 79.0 4.7 90.96 2.97 
   Time 60.1 29.6 1964.3 68.4 24.9 10.6 0.10 0.08 
Heart (n=135) 82.2 2.2 82.2 2.2 82.2 2.2 82.87 2.81 
   Time 4.7 2.8 79.2 38.1 2.1 0.4 1.51 0.45 
Wpbc (n=198) 77.0 2.9 77.0 2.8 76.9 2.9 72.92 2.13 
   Time 7.8 2.4 142.0 122.3 5.3 1.3 1.45 .34 
Wdbc (n=285) 95.7 0.8 96.4 0.9 96.4 0.8 88.41 0.35 
   Time 122.9 38.2 146.3 48.3 15.5 7.5 1.33 .62 
Vote (n=218) 96.0 1.1 95.7 1.0 95.7 1.0 95.03 0.11 
   Time 23.7 9.7 26.3 12.4 4.1 1.3 1.11 .43 
Data Set NameSimpleMKLSLIPLevel MethodMultiple SKL
Iono (n=175) 92.10 2.0 92.10 1.9 92.10 1.3 95.61 2.51 
   Time 33.5 11.6 1161.0 344.2 7.1 4.3 1.55 .15 
Pima (n=384) 76.5 1.9 76.9 2.8 76.9 2.1 98.19 1.45 
   Time 39.4 8.8 62.0 15.2 9.1 1.6 2.12 0.75 
Sonar (n=104) 79.1 4.5 79.3 4.2 79.0 4.7 90.96 2.97 
   Time 60.1 29.6 1964.3 68.4 24.9 10.6 0.10 0.08 
Heart (n=135) 82.2 2.2 82.2 2.2 82.2 2.2 82.87 2.81 
   Time 4.7 2.8 79.2 38.1 2.1 0.4 1.51 0.45 
Wpbc (n=198) 77.0 2.9 77.0 2.8 76.9 2.9 72.92 2.13 
   Time 7.8 2.4 142.0 122.3 5.3 1.3 1.45 .34 
Wdbc (n=285) 95.7 0.8 96.4 0.9 96.4 0.8 88.41 0.35 
   Time 122.9 38.2 146.3 48.3 15.5 7.5 1.33 .62 
Vote (n=218) 96.0 1.1 95.7 1.0 95.7 1.0 95.03 0.11 
   Time 23.7 9.7 26.3 12.4 4.1 1.3 1.11 .43 

Notes: The first line in each row displays the accuracy of each method; the numbers in the second line show the learning time in seconds. Values in boldface indicate they pass the two-samples t-test with 95% confidence interval.

In each experiment, 50% of all samples are used as training and the rest for testing randomly for times. In Table 1, n shows the size of the training set. The second row for each data set shows the time in seconds spent for finding the mixing coefficients. (We ran codes on a MacBook, with Mac OS X 10.5 and 4 GB memory). The computational time for the multiple SKL shows the time spent for learning the kernel. We used RBF kernels——for each data set with kernel widths . The singular values of the kernel matrices with the above width range is in line with the low-rank assumption, so we can apply the multiple SKL. We took 20 eigenvectors, corresponding to the 20 largest eigenvalues of the RBF kernel matrices, to build the spectral dictionary. The learned kernel is then plugged into SVM with both loss and the hinge-loss functions. Both showed almost identical performance. For all sample sets, the penalization parameter of SVM is selected using five-fold cross-validation for all MKL methods, as well as multiple SKL separately.

The results show that the multiple SKL significantly improves the classification accuracy in four data sets, and it follows the state-of-the-art in two others. However, the computational time is significantly reduced in all cases.

6.2.  Automatic Flower Recognition.

Flower recognition is an image-based classification with a large number of classes, where multiple kernel learning is shown to improve the classification rate (Nilsback & Zisserman, 2008). In our experiments, we use multiple SKL for flower images provided by www.robots.ox.ac.uk/~vgg/data/flowers/index.html. The Web page provides a set of distance matrices that contains the distances between image samples. We used these matrices to generate RBF kernel matrices with kernel widths . Four features are used to compute distance matrices over samples, which consist of histograms of gradient orientations, HSV values of the pixels, and scale-invariant feature transform, sampled on both the foreground region and its boundary. The training set contains 1000 images with 17 classes and 361 samples as the test set. We consider 340 samples out of the training set for the validation.

In this experiment, we consider two settings, inductive and transductive. In the transductive setting, the classification accuracy (MSE) over all classes is . The accuracy of inductive setting is . The best result reported in Varma and Ray (2007) and Nilsback and Zisserman (2008) by MKL methods is (for the inductive setting), which is significantly lower than our result.

6.3.  Protein Subcellular Localization.

The prediction of subcellular localization of proteins is important in cell biology. MKL has been successfully applied to this problem (Zien & Ong, 2007). A set of kernels is designed for this data set for four organisms: plants, nonplant eukaryotes, and Gram-positive and Gram-negative bacteria. There are 69 kernels available for downloads from the project's Web site (http://www.fml.tuebingen.mpg.de/raetsch/suppl/protsubloc; for details, see Kloft et al., 2010). We used the resampling index provided with the data set to design our numerical experiments. Kernels are normalized with multiplicative normalization, as formulated in Kloft et al. (2010) and Zien and Ong (2007). Then we take the top 20 eigenvectors of each kernel matrix to build the spectral dictionary. The results of the classification, in terms of Mathew's correlation coefficient (MCC; Zien & Ong, 2007), are depicted in Table 2. The results from Kloft et al. (2010) are obtained using SILP.

Table 2:
The MCC for Multiple SKL and Full Kernels in Kloft et al. (2010).
OrganismKloft et al. (2010)Multiple SKL
Plant 8.18 0.47 8.18 0.47 
Nonplant 8.97 0.26 9.01 0.25 
Positive Gram 9.99 0.35 9.87 0.34 
Negative Gram 13.07 0.66 13.07 0.07 
OrganismKloft et al. (2010)Multiple SKL
Plant 8.18 0.47 8.18 0.47 
Nonplant 8.97 0.26 9.01 0.25 
Positive Gram 9.99 0.35 9.87 0.34 
Negative Gram 13.07 0.66 13.07 0.07 

The accuracy of classification is similar to the result of SILP (as appeared in Kloft et al., 2010), whereas the computation and memory demands in multiple SKL are greatly improved, as shown in Table 1. Here, multiple SKL does not uniformly improve the classification rate; however, it shows that the concept of low-rank kernel learning is valid and can be utilized in similar size problem.

7.  Discussion

Multiple kernel learning has been widely used for kernel selection in SVM and similar kernel methods. However, the size of kernel matrices increases by the sample size, which limits the use of MKL methods due to high memory and computational demands. Nevertheless, we show that large kernel matrices are of low efficient rank. In this letter, we proposed an efficient approach for rank 1 kernel learning, which efficiently exploits this property. Empirical results show that multiple SKL improves both the computational time and the accuracy of prediction in most cases.

In addition, we derived a bound for the gaussian complexity of the spectral kernel set and showed that the complexity of the hypothesis set generated by that kernel set is controlled by its diameter and the size of the dictionary, while previous bounds depends on only the number of kernel bases. The proof can be extended for general kernel sets. In that case, the extra term in theorem 6 will be replaced by the trace of the kernel matrices when all kernel bases are of the same trace (e.g., by normalization).

Appendix:  Proofs

Here, we present some computations needed for the proof of theorems 1 and 2. Straightforward dual computations show that we can rewrite equation 2.1 as
formula
A.1
The superscript denotes the convex conjugate (Ruszczynski, 2011), which is defined by
formula
for given function . We denote the identity matrix by I.

A.1.  Proof of Theorem 1.

By taking the dual of and replacing by K, we obtain
formula
A.2
where is the dual-vector variable of . By plugging equation A.2 into equation 2.3 for class , we have
formula
Applying the scaling trick, which is , such that and , to , results in
formula
By solving the preceding display over variable , we obtain
formula
Note that by rewriting to , we get
formula
The Lagrangian of is , where is the dual vector of . By checking the optimality conditions (Steinwart & Christmann, 2008), we can remove the variable ; therefore, reads
formula
A.3
The primal variables can be obtained by checking the optimality conditions, which result in .

A.2.  Proof of Theorem 2.

Let us fix the kernel set to , defined in equation 2.2, and consider the dual of equation 2.1 for kernel matrix :
formula
where are dual variables, and the superscript denotes the conjugate dual (Steinwart & Christmann, 2008). By strong duality, we can replace min and max in the previous equation and then remove , obtaining
formula
A.4
Remark 2. 
If we apply the spectral decomposition to equation A.4, we obtain
formula
which implies that the ERM returns a vector that minimizes the risk and is as dissimilar as possible to the eigenvectors. denotes the ith largest eigenvalue of the kernel matrix K. But the ERM tends to find a vector such that the summands corresponding to small eigenvalues are zero.
For a kernel set , plugging equation A.4 into 2.3 reads
formula
A.5
where
formula
By taking the convex conjugate of , we obtain
formula
Again, we apply the kernel trick by replacing the variable by such that
formula
The penalization term can then be simplified to
formula
Thus, we obtain
formula
The function can be simplified to
formula
For the multiple spectral kernel set , defined in equation 3.1, we obtain
formula
A.6
which is linear programming. Therefore, instead of quadratic constraints, as in Lanckriet et al. (2004), we get linear constraints. Moreover, in a similar way as for -loss, by taking the Lagrange dual of equation A.6, we can simplify to
formula
If we now plug back into equation A.5, we obtain
formula
A.7
In the same line as in equation 3.3, the original variable can be recovered by checking the optimality conditions.

A.3.  Proof for Theorem 6.

The proof of theorem 6 relies on decoupling of the gaussian chaos, Slepian's lemma, and maximal inequalities, which we introduce shortly.

Lemma 1 
(Ledoux & Talagrand, 2011).  Let be a gaussian random variable in . Then we have
formula
This result also holds for subgaussian random variables where the coefficient 3 is replaced by a constant C>0.
Lemma 2 
(Slepian's lemma, Ledoux & Talagrand, 2011; Bartlett & Mendelson, 2003).  Let be random variables defined by
formula
for . Then there exists a constant C>0 such that
formula
Lemma 3 
(decoupling of gaussian quadratic form, Levina & Vershynin, 2011; De la Peña & Giné, 1999).  Let Z be a centered normal random vector in , and let be an independent copy of random vector Z. Let be a set of symmetric matrices. Then,
formula
Proof of theorem 6. 
Let denote independent standard normal random variables that are independent of the samples . Also we define . In deriving the upper bound in equation 5.3, the set was selected arbitrarily. So for the empirical gaussian complexity of the spectral kernel class with the hypothesis set , we have
formula
A.8
Before applying the decoupling lemma, we first remove the mean of , which is equal to
formula
Therefore, from equation A.8, we obtain
formula
A.9
In the third line, we used lemma 3. Both terms and are gaussian random variables and can be treated independently. In the following, we compute the first term on the right-hand side of equation A.9 in two different ways. First,
formula
A.10
In the first line, we used the gaussian maximal inequality. We can further assume that , and the bound will be reduced to 9ln L. Second,
formula
In the first line, we used the gaussian maximal inequality, and in the second line, we used Slepian's lemma. Under the assumption that the norm of is 1, we have
formula
A.11
By replacing the expectation terms in equation A.9 by the upper bounds derived in equation A.10 or A.11, we obtain the claimed results, equations 5.6 or 5.7, respectively.

A.4.  On the Effect of Low-Rank Approximation in the SVM Solution.

Let us consider a kernel matrix K and a perturbation of that, say, . We want to see the difference between the solutions of and . The following proposition provides an estimate of such error, which appeared in Bousquet and Elisseeff (2002). For completeness, we also provide a similar proof here.

Theorem 7. 

Let us consider kernel functions k and and denote their feature maps by and . For sample set , we denote the feature maps with and , respectively. We assume that the feature maps are finite-dimensional.

Let us consider , where is a convex function with respect to the second argument, with Lipschitz norm . Let us further denote the solution of and by and . Then the following holds:
formula
Proof. 
We follow a similar derivation as in Bousquet and Elisseeff (2002) and Zhang (2001). Let us define
formula
and . Also, assume that . Since and attain the minimum of and , respectively, we have
formula
and, similarly,
formula
Combining the above inequalities reads
formula
which is equal to
formula
By assumption, the loss function is a convex function, and therefore we can expand the right-hand side of the above inequality:
formula
Note that we divided both sides by t. By the Lipschitz assumption, we have
formula
Let us take the limit of the above expression when . Then, by the Cauchy-Shwartz inequality, we obtain
formula
We assume that the kernel matrix K has eigenvalue decomposition
formula
We define the feature map:
formula
Let us assume that the low-rank approximation of contains the top R eigenvectors of the kernel matrix K. Then it has a similar feature map:
formula
Using theorem 7, we have
formula
Then,
formula
On the other hand, both and have representation for different values of for each weight vector. If we further assume that , for T>0, we obtain,
formula
The dual of equation 2.1 for the hinge loss is , where . An additional constraint is required if we add a bias term to the decision function (i.e., , where . Therefore, , and for the differences between weight vectors, we obtain,
formula

Acknowledgments

I thank Laurent el Ghaoui for introducing the rank 1 kernel learning problem and extensive discussions, and Peter Bickel for his valuable comments. Part of this research was done while I was visiting Peter Bickel in the Department of Statistics at the University of California, Berkeley, in 2008–2009. I also acknowledge the anonymous referees’ very helpful comments.

References

Bach
,
F.
(
2008
).
Exploring large feature spaces with hierarchical multiple kernel learning
.
arXiv preprint arXiv:0809.1493
Bartlett
,
P. L.
, &
Mendelson
,
S.
(
2003
).
Rademacher and gaussian complexities: Risk bounds and structural results
.
Journal of Machine Learning Research
,
3
,
463
482
.
Bhatia
,
R.
(
1997
).
Matrix analysis
.
New York
:
Springer-Verlag
.
Bousquet
,
O.
, &
Elisseeff
,
A.
(
2002
).
Stability and generalization
.
Journal of Machine Learning Research
,
2
,
499
526
.
Bousquet
,
O.
, &
Herrmann
,
D. J.
(
2003
).
On the complexity of learning the kernel matrix
. In
S. Becker, S. Thrün, & K. Obermayer
(Eds.),
Advances in neural information processing systems
(pp.
415
422
).
Cambridge, MA
:
MIT Press
.
Chapelle
,
O.
,
Vapnik
,
V.
,
Bousquet
,
O.
, &
Mukherjee
,
S.
(
2002
).
Choosing multiple parameters for support vector machines
.
Machine Learning
,
46
(
1
),
131
159
.
Chen
,
S. S.
,
Donoho
,
D. L.
, &
Saunders
,
M. A.
(
1998
).
Atomic decomposition by basis pursuit
.
SIAM Journal on Scientific Computing
,
20
(
1
),
33
61
.
Cortes
,
C.
,
Mohri
,
M.
, &
Rostamizadeh
,
A.
(
2009
, June).
L 2 regularization for learning kernels
. In
Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
(pp.
109
116
).
N.p.
:
AUAI Press
.
Cortes
,
C.
,
Mohri
,
M.
, &
Rostamizadeh
,
A.
(
2010
).
Generalization bounds for learning kernels
. In
Proceedings of the 27th International Conference on Machine Learning
.
Madison, WI
:
Omnipress
.
De la Peña
,
V.
, &
Giné
,
E.
(
1999
).
Decoupling: From dependence to independence
.
New York
:
Springer-Verlag
.
Drineas
,
P.
, &
Mahoney
,
M. W.
(
2005
).
On the Nyström method for approximating a Gram matrix for improved kernel-based learning
.
Journal of Machine Learning Research
,
6
,
2153
2175
.
Hino
,
H.
,
Reyhani
,
N.
, &
Murata
,
N.
(
2010
, December).
Multiple kernel learning by conditional entropy minimization
. In
Proceedings of the 2010 Ninth International Conference on Machine Learning and Applications
(pp.
223
228
).
Piscataway, NJ
:
IEEE
.
Hino
,
H.
,
Reyhani
,
N.
, &
Murata
,
N.
(
2012
).
Multiple kernel learning with gaussianity measures
.
Neural Computation
,
24
(
7
),
1853
1881
.
Kloft
,
M.
,
Brefeld
,
U.
,
Sonnenburg
,
S.
, &
Zien
,
A.
(
2010
).
Non-sparse regularization and efficient training with multiple kernels
.
Arxiv preprint
.
Lanckriet
,
G. R.
,
Cristianini
,
N.
,
Bartlett
,
P.
,
Ghaoui
,
L. E.
, &
Jordan
,
M. I.
(
2004
).
Learning the kernel matrix with semidefinite programming
.
Journal of Machine Learning Research
,
5
,
27
72
.
Ledoux
,
M.
, &
Talagrand
,
M.
(
2011
).
Probability in Banach spaces: Isoperimetry and processes
.
New York
:
Springer
.
Levina
,
E.
, &
Vershynin
,
R.
(
2011
).
Partial estimation of covariance matrices
.
Probability Theory and Related Fields
,
153
,
1
15
.
Nilsback
,
M. E.
, &
Zisserman
,
A.
(
2008
, December).
Automated flower classification over a large number of classes
. In
Proceedings of the Sixth Indian Conference on Computer Vision, Graphics and Image Processing, 2008
(pp.
722
729
).
Piscataway, NJ
:
IEEE
.
Rakotomamonjy
,
A.
,
Bach
,
F.
,
Canu
,
S.
, &
Grandvalet
,
Y.
(
2008
).
SimpleMKL
.
Journal of Machine Learning Research
,
9
,
2491
2521
.
Ruszczynski
,
A.
(
2011
).
Nonlinear optimization
.
Princeton, NJ
:
Princeton University Press
.
Schölkopf
,
B.
, &
Smola
,
A. J.
(
2002
).
Learning with kernels: Support vector machines, regularization, optimization, and beyond
.
Cambridge, MA
:
MIT Press
.
Sonnenburg
,
S.
,
Rätsch
,
G.
,
Schäfer
,
C.
, &
Schölkopf
,
B.
(
2006
).
Large scale multiple kernel learning
.
Journal of Machine Learning Research
,
7
,
1531
1565
.
Srebro
,
N.
, &
Ben-David
,
S.
(
2006
).
Learning bounds for support vector machines with learned kernels
.
Learning Theory
,
4005
,
169
183
.
Steinwart
,
I.
, &
Christmann
,
A.
(
2008
).
Support vector machines
.
New York
:
Springer
.
Talwalkar
,
A.
, &
Rostamizadeh
,
A.
(
2010
).
Matrix coherence and the Nystrom method
.
arXiv preprint arXiv:1004.2008
Tomczak-Jaegermann
,
N.
(
1989
).
Banach-Mazur distances and finite-dimensional operator ideals
.
Harlow, UK
:
Longman Scientific & Technical
.
Vapnik
,
V.
, &
Chapelle
,
O.
(
2000
).
Bounds on error expectation for support vector machines
.
Neural Computation
,
12
(
9
),
2013
2036
.
Varma
,
M.
, &
Ray
,
D.
(
2007
, October).
Learning the discriminative power-invariance trade-off
. In
Proceedings of the IEEE 11th International Conference on Computer Vision, 2007
(pp.
1
8
).
Piscataway, NJ
:
IEEE
.
Xu
,
Z.
,
Jin
,
R.
,
King
,
I.
, &
Lyu
,
M. R.
(
2009
).
An extended level method for efficient multiple kernel learning
. In
D. Köller, D. Schuurmans, Y. Bengio, & L. Bottou
(Eds.),
Advances in neural information processing systems, 21
(pp.
1825
1832
).
Cambridge, MA
:
MIT Press
.
Zhang
,
T.
(
2001
).
Convergence of large margin separable linear
. In
T. K. Leen, T. G. Dietterich, & V. Trosp
(Eds.),
Advances in neural information processing systems
,
13
.
Cambridge, MA
:
MIT Press
.
Zien
,
A.
, &
Ong
,
C. S.
(
2007
, June).
Multiclass multiple kernel learning
. In
Proceedings of the 24th International Conference on Machine Learning
(pp.
1191
1198
).
New York
:
ACM
.