## Abstract

Multiple kernel learning (MKL) partially solves the kernel selection problem in support vector machines and similar classifiers by minimizing the empirical risk over a subset of the linear combination of given kernel matrices. For large sample sets, the size of the kernel matrices becomes a numerical issue. In many cases, the kernel matrix is of low-efficient rank. However, the low-rank property is not efficiently utilized in MKL algorithms. Here, we suggest multiple spectral kernel learning that efficiently uses the low-rank property by finding a kernel matrix from a set of Gram matrices of a few eigenvectors from all given kernel matrices, called a spectral kernel set. We provide a new bound for the gaussian complexity of the proposed kernel set, which depends on both the geometry of the kernel set and the number of Gram matrices. This characterization of the complexity implies that in an MKL setting, adding more kernels may not monotonically increase the complexity, while previous bounds show otherwise.

## 1. Introduction

Kernel methods such as support vector machines (SVMs) usually perform well in many prediction problems (Steinwart & Christmann, 2008); however, the performance of the kernel methods heavily depends on the choice of kernel function, which is left to the practitioners (Lanckriet, Cristianini, Bartlett, Ghaoui, & Jordan, 2004; Vapnik & Chapelle, 2000; Chapelle, Vapnik, Bousquet, & Mukherjee, 2002).

A common approach for selecting a suitable kernel function is through cross-validation or other resampling techniques (Schölkopf & Smola, 2002) over a set of candidate kernel functions. The selection consists of training a kernel machine, such as a support vector machine (SVM), with a particular kernel using a subset of training samples and validating the model by measuring the empirical risk over the rest of training samples (called a validation set). This procedure is repeated a number of times. A kernel function with the best average performance on the validation sets will be selected (Steinwart & Christmann, 2008; Chapelle et al., 2002). The computational load in this approach increases with the size of the candidate set and the number of training samples, which clearly limits the applicability of such approaches.

A different framework for kernel selection, multiple kernel learning (MKL), searches for a kernel matrix (function) in the convex hull of kernel functions (matrices)—defined by linear combinations of the given kernel matrices—that minimizes the empirical risk of the plugged-in kernel matrix (Lanckriet et al., 2004; Rakotomamonjy, Bach, Canu, & Grandvalet, 2008; Sonnenburg, Rätsch, Schäfer, & Schölkopf, 2006; Hino, Reyhani, & Murata, 2012). The given set of original kernel functions or matrices is called a kernel base. It is common to restrict the space of linear combinations so that the coefficients belong to the -norm simplex, the concern in this letter, whereas some work suggests -norm simplex (see Kloft, Brefeld, Sonnenburg, & Zien, 2010). In addition, it is usually assumed that a set of kernel matrices for a fixed sample set is provided. This renders the setup more convenient for algorithm design. Other views suggest optimizing other criteria rather than the empirical risk such as maximum entropy, kernel Fisher discriminant analysis performance (Hino, Reyhani, & Murata, 2010), or maximum gaussianity (Hino et al., 2012) to find the kernel combinations.

Lanckriet et al. (2004) casts the linear MKL in terms of semidefinite programming (SDP) for the class of positive-definite linear combination of kernel bases with bounded trace. The main issue with the SDP formulation of MKL is the inefficiency of the SDP solvers for large kernel matrices (Sonnenburg et al., 2006; Rakotomamonjy et al., 2008). This limitation has stimulated further research on numerical issues involved in MKL. One of the main directions is to use the differentiability of the dual of the penalized empirical risk minimization (ERM) at the maximal point, which usually leads to an alternating-optimization algorithm. In the alternating optimization, the inner optimization (i.e., the dual of the ERM) is solved first, while the outer optimization variables (i.e., mixture coefficients) are fixed, and vice versa, until some convergence is reached. In addition, the class of kernel matrices is usually reduced to the convex hull of the kernel bases compared to the original MKL in Lanckriet et al. (2004). Semi-infinite linear programming (SILP; Sonnenburg et al., 2006), the extended-level method (Xu, Jin, King, & Lyu, 2009), simpleMKL (Rakotomamonjy et al., 2008), and alternating gradient descent (Bousquet & Herrmann, 2003) are different implementations of this framework. These methods can be used for most of the important loss functions. Another direction has been to restrict the loss function to -loss and change the set of linear coefficients from simplex to the positive subset of ball. This reduction leads to alternating steps between two closed forms: one for estimating the mixing coefficients and the other one for the -SVM parameters (Cortes, Mohri, & Rostamizadeh, 2009).

A common issue with all of these methods for MKL is the need to solve an iterative optimization of size , where *L*>1 is the number of kernel matrices belonging to . Here, *n* is the number of training samples in an inductive setting or training plus test sets in an transductive setting. Thus, regardless of the algorithmic efficiencies, the memory requirement grows very fast, leading to a significant increase in computational load, especially when both *n* and *L* are large. This naturally limits the use of MKL methods in large sample settings.

A slightly different framework for kernel learning is to adjust the eigenvalues of the given kernel matrix so that the empirical risk is minimized. In other words, the kernel base here is a set of rank 1 kernels produced by the outer product of eigenvectors of given (single or multiple) kernel matrices.

This type of kernel learning has been studied by Bousquet and Herrmann (2003), Lanckriet et al. (2004), and Bach (2008); however, their approaches end up with block-coordinate optimization, (Bousquet & Herrmann, 2003; Cortes et al., 2009) or cone programming (Lanckriet et al., 2004). In particular, Bach (2008) suggests hierarchical MKL (HMKL), which assumes that the eigenfunctions of the kernels are provided analytically. HMKL constructs a set of rank 1 kernels by evaluating these eigenfunctions on all possible subsets of input feature space. It finds the best mixing by first restricting the set of candidate kernels and then finding the best coefficients using other MKL methods such as SILP. HMKL is more about kernel learning for feature selection, and the main part of the algorithm is to select a subset of rank 1 kernels using a greedy algorithm. In this letter, we consider a similar setup with more of a focus on kernel learning than feature selection and propose an efficient and simpler optimization that improves the scalability in both memory and numerical computation. In addition, we do not assume that the bases’ eigenfunctions are provided.

Another direction in MKL research is to find tight upper bounds for the complexity of the hypothesis set generated by a linear combination of kernels. Most recent work (e.g., Cortes, Mohri, & Rostamizadeh, 2010; Srebro & Ben-David, 2006) presents bounds characterized by, for example, . These results imply that introducing any new kernel matrix to the base set would necessarily increase the complexity no matter how close the added kernel is to the existing ones. The second part of this letter partially addresses this issue with previous bounds.

This letter is organized as follows. A definition of linear MKL is given in section 2. Section 3 outlines our proposed method for MKL using low-rank approximation. To improve the proposed MKL framework to inductive setting, we also suggest using Nyström extension, which is explained in section 4. The gaussian complexity of the proposed kernel set is presented in section 5. The empirical results are given in section 6, followed by a few concluding remarks.

## 2. Linear MKL

*y*takes values in a subset of . Also assume that for some unknown smooth function

_{i}*g*and that there is a nonlinear mapping, called a feature map, , for some Hilbert space . The inner product of , denoted by , can be computed using a bounded kernel function , so that for . A kernel

*k*defines a unique Hilbert space (Steinwart & Christmann, 2008). Here, for technical reasons, we assume that for each kernel, there exists at least one feature map such as the canonical feature map (Steinwart & Christmann, 2008). We define the kernel matrix

*K*by For simplicity, we use the notations

*K*and

*k*interchangeably.

*g*by a function , given a sample set , such that the prediction performance on both the training samples and test samples is sufficiently high. The performance is usually measured by some loss function, which measures the discrepancy between the true value and prediction. We assume that

*f*belongs to some set of functions so that the true prediction function

*g*can be well approximated by members of . Here, we assume contains linear functions in the feature space. A general framework for learning is through empirical risk minimization, which can be formulated as where is a loss function, for example, hinge loss . The scalar is the penalization parameter, denotes the norm in the space , and is the matrix containing the feature vectors for . Note that we can rewrite equation 2.1 as where is some user-defined parameter and denotes the empirical loss.

*k*kernel bases. We consider the convex hull of the kernel bases, where the simplex is defined by . Kernel learning over the kernel set is defined by minimizing the empirical risk with respect to the positive-definite kernel matrix , which is A common approach to solve the optimization problem, equation 2.3, is to solve the minimization problem in and by alternating between each of them while the parameters of the other one is fixed (for details, see Sonnenburg et al., 2006; Rakotomamonjy et al., 2008; Xu et al., 2009). The memory requirement for the outer optimization is of order , which could become a barrier in large-scale problems. Also, every other alternation consists of solving one round of, for example, SVM for the hinge loss function.

_{i}## 3. Multiple Spectral Kernel Learning

A common phenomenon observed in many large-size kernels is that these matrices are of low efficient rank (Drineas & Mahoney, 2005); that is, a small subset of eigenvalues is large and distant, while the rest have small and similar values. This property has been utilized for improving the scalability of many learning methods and for efficient memory representation of the kernel matrices (Drineas & Mahoney, 2005; Talwalkar & Rostamizadeh, 2010). Thus, it is reasonable to utilize the low-rank assumption of the given kernel matrices to improve the numerical efficiency in the MKL framework.

The dictionary can be constructed by a small set of eigenvectors extracted from each given kernel matrix. We can select some of eigenvectors or all with large eigenvalues from each kernel matrix. This can be performed in a similar way as in principal component analysis (PCA; Schölkopf & Smola, 2002), where we start selecting eigenvectors by taking those with large eigenvalues and stop when the eigenvalues become small and similar to each other. In this way, we basically take components that are involved in low-rank approximation of kernel matrices without carrying the eigenvalue information. Note that the elements of the dictionary are not necessarily orthogonal to each other.

Another way to construct vectors is to take a set of orthonormal basis functions (e.g., Hermite polynomials) with sufficient smoothness and construct basis vectors by evaluating those functions on the sample points.

Theorems 1 and Theorem 2 present simple optimization for multiple SKL, the -loss, and more general loss function aiming to solve problem 3.2 in one round (i.e., no alternating optimization with significantly less memory requirement).

Thus, instead of an alternating least-squares-type solution, as appears in Cortes et al. (2009), we ended up with the basis pursuit (Chen, Donoho, & Saunders, 1998) approximation that solve the kernel learning problem in one round with linear programming. The proof is provided in the appendix.

The extension of theorem 1 for general loss function is as follows.

The proof is provided in the appendix.

Using the above results, we can achieve a significant decrease in memory and computation load, compared to MKL algorithms, such as hierarchical MKL (Bach, 2008) or simpleMKL (Rakotomamonjy et al., 2008). The spectral decomposition of kernel matrices is an expensive computational task. However, it happens only once, and we need only the top eigenvectors of the kernel matrices, which can be efficiently computed using power methods, such as Lancsoz method (see Bhatia, 1997, for more details).

It is natural to study the differences between the solution of SVM with the original kernel matrix and the one with its low-rank approximation. For readability, we postpone the presentation of this study to the appendix. The study shows that the difference between solutions of SVMs depends on the magnitude of the largest eigenvalue that is not included in low-rank approximation. Thus, the approximation does not significantly change the solution of SVMs for kernels with efficient low rank.

## 4. Application of the Nyström Extension for an Inductive Multiple SKL

The formulations presented in section 3 apply only to the transductive setting where both training and test sets are available at the time of learning. To use the results of multiple SKL for classifying new test samples, access to entries of eigenvectors at those points is crucial. A naive approach to resolve this issue is to add the test samples to each of the original kernel matrices and then take the eigenvalue decomposition of the extended matrices. However, this solution gradually becomes impractical as the number of test samples increases. Instead, we suggest using the Nyström extension (Drineas & Mahoney, 2005; Talwalkar & Rostamizadeh, 2010), which provides an approximate eigenvector extension of a bounded kernel matrix.

For a sample set , let us assume that the eigenvectors of the kernel matrix *K* for some kernel function are available. Let us consider the integral operator . The eigenvectors and eigenvalues of *K* are basically an approximation of the eigenfunctions and eigenvalues of *T _{k}* for the eigenvalue problem . Here,

*P*is a probability measure on , and denotes an eigenvalue of linear operator

*T*.

_{k}Thus, for any test sample, we first extend the eigenvectors with nonzero entries in the mixing vector , generate the mixture using and the extended eigenvectors, and then evaluate the learned classifier for the extended matrix.

## 5. Error Bound and Complexity Computations for the Multiple SKL

In prediction problems, it is natural to ask about an upper bound for the error of prediction for a test point, called generalization error or risk. In a common characterization, it is shown that risk can be controlled by the empirical risk and the complexity of the hypothesis set. The hypothesis set is the collection of functions that construct the search space in empirical risk minimization (e.g., the set in section 2). We next recall some definitions and a generalization bound, which relates the error to the complexity of the function classes.

*(Rademacher and gaussian complexity Bartlett & Mendelson, 2003). Let us assume that are independent and identically distributed random vectors in and are independent random variables with values in with probability . Also, we assume that are independent normal random variables. We assume both*

*g*and are independent to . is called a Rademacher random variable. For a set of functions , the Rademacher and gaussian complexity are defined as_{i}*respectively*.

These two measures of complexity are related to each other:.

*(Tomczak-Jaegermann, 1989). There are absolute constants c and C such that for every class and every integer n, *.

*(Bartlett & Mendelson, 2003, corollary 15 and Theorem 8). Consider a loss function and a function that dominates the loss function , that is, and , . We assume that is a subset of . Let be a class of functions , and let , where are independent copies of random vector . Then for any integer*

*n*and any , with probability of at least , the following holds:*where is the Lipschitz norm of the loss function and denotes the gaussian complexity of the hypothesis set*.

*k*, takes the form having , where and . Therefore, the set can equivalently be written as . For the empirical gaussian complexity of , we have where and is some feature space for the kernel

*K*. A similar result holds for the empirical Rademacher complexity, , where the normal vector

*Z*is replaced by vector . (A comprehensive comparison of previously derived bounds for the Rademacher complexity of MKL can be found in Cortes et al., 2010.) They also provide a tight bound, stated below, which significantly improves the previous results.

*(Cortes et al., 2010). The empirical Rademacher complexity of the hypothesis set can be bounded as follows:*

*where , , and for*.

*r*, the empirical Rademacher complexity for the spectral kernel set is where denotes the smallest following integer to

*x*and we assume that . In the next section, we provide a bound for the spectral kernel set, which shows the dependency of the gaussian dependency to the geometry of the kernel set.

### 5.1. A Geometrical Bound for Gaussian Complexity.

The right-hand side of equation 5.5 depends on only the number of kernel bases or the size of dictionary in the case of a spectral class, no matter how similar the bases are to each other. In the case of a spectral kernel set, the kernels are of rank 1, and there is no trace or eigenvalue information; therefore, the relation between elements of dictionary to the complexity is lost.

The issue in missing the connection between the dictionary and the complexity bound in theorem 5 motivates us to derive a new bound for the empirical gaussian complexity. There we are able to use Slepian's lemma (see section 8.3) in other to bring the geometry of the dictionary into the the complexity's upper bound. The results are summarized in the following theorem:

*Let us consider the spectral kernel set defined in equation 3.1. Then, for sufficiently large*

*L*, we have*where and is the empirical gaussian complexity for the hypothesis set*.

In theorem 6, depends on . The proof relies on the decoupling technique (Levina & Vershynin, 2011; De la Peña & Giné, 1999) after centering the gaussian term. For centering, we need to compute the expected value of the term , which can be computed easily. The gaussian chaos term, , can be split into a nested maximal form . Both maximal forms may be computed using the maximal inequality for gaussian random variables and Slepian's lemma. More details about the proof are provided in section A.3. The results in theorem 6 can be extended for the general MKL over the full-rank kernel matrices.

The bound in equation 5.6 shows the dependency between the complexity and logarithm of the number of kernel bases in a similar way to the empirical Rademacher complexity bound presented in equation 5.5. However, we got a greater constant, which can be expected by theorem 3.

In addition, equation 5.7 relates the geometry of the dictionary to the empirical complexity via the term . The geometric term counts the similarity between the bases and achieves its maximum when at least two orthogonal vectors are present in the dictionary. It suggests that the complexity can also be increased by the angle between the bases, which is additional information; we cannot achieve this conclusion from any of previous bounds. The additional term in both bounds in equations 5.6 and 5.7 is due to the decoupling technique applied to the gaussian chaos.

## 6. Empirical Results

### 6.1. UCI Data Sets.

In this section, we present empirical comparisons of the classification accuracy between multiple SKL and state-of-the-art MKL approaches, such as simpleMKL, the extended-level method (Xu et al., 2009), and SILP (Sonnenburg et al., 2006). We perform this study on a selection of UCI classification data sets available for download from http://archive.ics.uci.edu/ml/. By classification accuracy, we mean the performance of an SVM with a kernel matrix obtained by an MKL algorithm. The classification accuracy using kernels from different methods is summarized in Table 1. For multiple SKL, we employ the linear programming toolbox of Mosek (www.mosek.com) in the Matlab environment (www.matlab.com). For other MKL methods, we used the toolbox provided in Sonnenburg et al. (2006), Rakotomamonjy et al. (2008), and Xu et al. (2009).

Data Set Name . | SimpleMKL . | SLIP . | Level Method . | Multiple SKL . |
---|---|---|---|---|

Iono (n=175) | 92.10 2.0 | 92.10 1.9 | 92.10 1.3 | 95.61 2.51 |

Time | 33.5 11.6 | 1161.0 344.2 | 7.1 4.3 | 1.55 .15 |

Pima (n=384) | 76.5 1.9 | 76.9 2.8 | 76.9 2.1 | 98.19 1.45 |

Time | 39.4 8.8 | 62.0 15.2 | 9.1 1.6 | 2.12 0.75 |

Sonar (n=104) | 79.1 4.5 | 79.3 4.2 | 79.0 4.7 | 90.96 2.97 |

Time | 60.1 29.6 | 1964.3 68.4 | 24.9 10.6 | 0.10 0.08 |

Heart (n=135) | 82.2 2.2 | 82.2 2.2 | 82.2 2.2 | 82.87 2.81 |

Time | 4.7 2.8 | 79.2 38.1 | 2.1 0.4 | 1.51 0.45 |

Wpbc (n=198) | 77.0 2.9 | 77.0 2.8 | 76.9 2.9 | 72.92 2.13 |

Time | 7.8 2.4 | 142.0 122.3 | 5.3 1.3 | 1.45 .34 |

Wdbc (n=285) | 95.7 0.8 | 96.4 0.9 | 96.4 0.8 | 88.41 0.35 |

Time | 122.9 38.2 | 146.3 48.3 | 15.5 7.5 | 1.33 .62 |

Vote (n=218) | 96.0 1.1 | 95.7 1.0 | 95.7 1.0 | 95.03 0.11 |

Time | 23.7 9.7 | 26.3 12.4 | 4.1 1.3 | 1.11 .43 |

Data Set Name . | SimpleMKL . | SLIP . | Level Method . | Multiple SKL . |
---|---|---|---|---|

Iono (n=175) | 92.10 2.0 | 92.10 1.9 | 92.10 1.3 | 95.61 2.51 |

Time | 33.5 11.6 | 1161.0 344.2 | 7.1 4.3 | 1.55 .15 |

Pima (n=384) | 76.5 1.9 | 76.9 2.8 | 76.9 2.1 | 98.19 1.45 |

Time | 39.4 8.8 | 62.0 15.2 | 9.1 1.6 | 2.12 0.75 |

Sonar (n=104) | 79.1 4.5 | 79.3 4.2 | 79.0 4.7 | 90.96 2.97 |

Time | 60.1 29.6 | 1964.3 68.4 | 24.9 10.6 | 0.10 0.08 |

Heart (n=135) | 82.2 2.2 | 82.2 2.2 | 82.2 2.2 | 82.87 2.81 |

Time | 4.7 2.8 | 79.2 38.1 | 2.1 0.4 | 1.51 0.45 |

Wpbc (n=198) | 77.0 2.9 | 77.0 2.8 | 76.9 2.9 | 72.92 2.13 |

Time | 7.8 2.4 | 142.0 122.3 | 5.3 1.3 | 1.45 .34 |

Wdbc (n=285) | 95.7 0.8 | 96.4 0.9 | 96.4 0.8 | 88.41 0.35 |

Time | 122.9 38.2 | 146.3 48.3 | 15.5 7.5 | 1.33 .62 |

Vote (n=218) | 96.0 1.1 | 95.7 1.0 | 95.7 1.0 | 95.03 0.11 |

Time | 23.7 9.7 | 26.3 12.4 | 4.1 1.3 | 1.11 .43 |

Notes: The first line in each row displays the accuracy of each method; the numbers in the second line show the learning time in seconds. Values in boldface indicate they pass the two-samples *t*-test with 95% confidence interval.

In each experiment, 50% of all samples are used as training and the rest for testing randomly for times. In Table 1, *n* shows the size of the training set. The second row for each data set shows the time in seconds spent for finding the mixing coefficients. (We ran codes on a MacBook, with Mac OS X 10.5 and 4 GB memory). The computational time for the multiple SKL shows the time spent for learning the kernel. We used RBF kernels——for each data set with kernel widths . The singular values of the kernel matrices with the above width range is in line with the low-rank assumption, so we can apply the multiple SKL. We took 20 eigenvectors, corresponding to the 20 largest eigenvalues of the RBF kernel matrices, to build the spectral dictionary. The learned kernel is then plugged into SVM with both loss and the hinge-loss functions. Both showed almost identical performance. For all sample sets, the penalization parameter of SVM is selected using five-fold cross-validation for all MKL methods, as well as multiple SKL separately.

The results show that the multiple SKL significantly improves the classification accuracy in four data sets, and it follows the state-of-the-art in two others. However, the computational time is significantly reduced in all cases.

### 6.2. Automatic Flower Recognition.

Flower recognition is an image-based classification with a large number of classes, where multiple kernel learning is shown to improve the classification rate (Nilsback & Zisserman, 2008). In our experiments, we use multiple SKL for flower images provided by www.robots.ox.ac.uk/~vgg/data/flowers/index.html. The Web page provides a set of distance matrices that contains the distances between image samples. We used these matrices to generate RBF kernel matrices with kernel widths . Four features are used to compute distance matrices over samples, which consist of histograms of gradient orientations, HSV values of the pixels, and scale-invariant feature transform, sampled on both the foreground region and its boundary. The training set contains 1000 images with 17 classes and 361 samples as the test set. We consider 340 samples out of the training set for the validation.

In this experiment, we consider two settings, inductive and transductive. In the transductive setting, the classification accuracy (MSE) over all classes is . The accuracy of inductive setting is . The best result reported in Varma and Ray (2007) and Nilsback and Zisserman (2008) by MKL methods is (for the inductive setting), which is significantly lower than our result.

### 6.3. Protein Subcellular Localization.

The prediction of subcellular localization of proteins is important in cell biology. MKL has been successfully applied to this problem (Zien & Ong, 2007). A set of kernels is designed for this data set for four organisms: plants, nonplant eukaryotes, and Gram-positive and Gram-negative bacteria. There are 69 kernels available for downloads from the project's Web site (http://www.fml.tuebingen.mpg.de/raetsch/suppl/protsubloc; for details, see Kloft et al., 2010). We used the resampling index provided with the data set to design our numerical experiments. Kernels are normalized with multiplicative normalization, as formulated in Kloft et al. (2010) and Zien and Ong (2007). Then we take the top 20 eigenvectors of each kernel matrix to build the spectral dictionary. The results of the classification, in terms of Mathew's correlation coefficient (MCC; Zien & Ong, 2007), are depicted in Table 2. The results from Kloft et al. (2010) are obtained using SILP.

Organism . | Kloft et al. (2010) . | Multiple SKL . |
---|---|---|

Plant | 8.18 0.47 | 8.18 0.47 |

Nonplant | 8.97 0.26 | 9.01 0.25 |

Positive Gram | 9.99 0.35 | 9.87 0.34 |

Negative Gram | 13.07 0.66 | 13.07 0.07 |

Organism . | Kloft et al. (2010) . | Multiple SKL . |
---|---|---|

Plant | 8.18 0.47 | 8.18 0.47 |

Nonplant | 8.97 0.26 | 9.01 0.25 |

Positive Gram | 9.99 0.35 | 9.87 0.34 |

Negative Gram | 13.07 0.66 | 13.07 0.07 |

The accuracy of classification is similar to the result of SILP (as appeared in Kloft et al., 2010), whereas the computation and memory demands in multiple SKL are greatly improved, as shown in Table 1. Here, multiple SKL does not uniformly improve the classification rate; however, it shows that the concept of low-rank kernel learning is valid and can be utilized in similar size problem.

## 7. Discussion

Multiple kernel learning has been widely used for kernel selection in SVM and similar kernel methods. However, the size of kernel matrices increases by the sample size, which limits the use of MKL methods due to high memory and computational demands. Nevertheless, we show that large kernel matrices are of low efficient rank. In this letter, we proposed an efficient approach for rank 1 kernel learning, which efficiently exploits this property. Empirical results show that multiple SKL improves both the computational time and the accuracy of prediction in most cases.

In addition, we derived a bound for the gaussian complexity of the spectral kernel set and showed that the complexity of the hypothesis set generated by that kernel set is controlled by its diameter and the size of the dictionary, while previous bounds depends on only the number of kernel bases. The proof can be extended for general kernel sets. In that case, the extra term in theorem 6 will be replaced by the trace of the kernel matrices when all kernel bases are of the same trace (e.g., by normalization).

## Appendix: Proofs

### A.1. Proof of Theorem 1.

*K*, we obtain where is the dual-vector variable of . By plugging equation A.2 into equation 2.3 for class , we have Applying the scaling trick, which is , such that and , to , results in By solving the preceding display over variable , we obtain Note that by rewriting to , we get The Lagrangian of is , where is the dual vector of . By checking the optimality conditions (Steinwart & Christmann, 2008), we can remove the variable ; therefore, reads The primal variables can be obtained by checking the optimality conditions, which result in .

### A.2. Proof of Theorem 2.

*i*th largest eigenvalue of the kernel matrix

*K*. But the ERM tends to find a vector such that the summands corresponding to small eigenvalues are zero.

### A.3. Proof for Theorem 6.

The proof of theorem 6 relies on decoupling of the gaussian chaos, Slepian's lemma, and maximal inequalities, which we introduce shortly.

*(Ledoux & Talagrand, 2011). Let be a gaussian random variable in . Then we have*

*This result also holds for subgaussian random variables where the coefficient 3 is replaced by a constant*.

*C*>0*L*. Second, In the first line, we used the gaussian maximal inequality, and in the second line, we used Slepian's lemma. Under the assumption that the norm of is 1, we have By replacing the expectation terms in equation A.9 by the upper bounds derived in equation A.10 or A.11, we obtain the claimed results, equations 5.6 or 5.7, respectively.

### A.4. On the Effect of Low-Rank Approximation in the SVM Solution.

Let us consider a kernel matrix *K* and a perturbation of that, say, . We want to see the difference between the solutions of and . The following proposition provides an estimate of such error, which appeared in Bousquet and Elisseeff (2002). For completeness, we also provide a similar proof here.

*Let us consider kernel functions k and and denote their feature maps by and . For sample set , we denote the feature maps with and , respectively. We assume that the feature maps are finite-dimensional*.

*t*. By the Lipschitz assumption, we have Let us take the limit of the above expression when . Then, by the Cauchy-Shwartz inequality, we obtain

*K*has eigenvalue decomposition We define the feature map: Let us assume that the low-rank approximation of contains the top

*R*eigenvectors of the kernel matrix

*K*. Then it has a similar feature map: Using theorem 7, we have Then, On the other hand, both and have representation for different values of for each weight vector. If we further assume that , for

*T*>0, we obtain,

## Acknowledgments

I thank Laurent el Ghaoui for introducing the rank 1 kernel learning problem and extensive discussions, and Peter Bickel for his valuable comments. Part of this research was done while I was visiting Peter Bickel in the Department of Statistics at the University of California, Berkeley, in 2008–2009. I also acknowledge the anonymous referees’ very helpful comments.