Abstract

Estimating the Rademacher chaos complexity of order two is important for understanding the performance of multikernel learning (MKL) machines. In this letter, we develop a novel entropy integral for Rademacher chaos complexities. As compared to the previous bounds, our result is much improved in that it introduces an adjustable parameter ε to prohibit the divergence of the involved integral. With the use of the iteration technique in Steinwart and Scovel (2007), we also apply our Rademacher chaos complexity bound to the MKL problems and improve existing learning rates.

1.  Introduction

During the last few years, kernel-based methods such as support vector machines have found great success in solving supervised learning problems like classification and regression (Schölkopf & Smola, 2002; Shawe-Taylor & Cristianini, 2004). For these tasks, the performance of the learning algorithms largely depends on the data representation via the choice of kernel functions—ones that are typically handcrafted and fixed in advance (Varma & Babu, 2009; Ying & Campbell, 2010). While this provides opportunities for us to incorporate the prior knowledge into the learning process, it can also be difficult in practice to find prior justifications for the use of one kernel instead of another (Lanckriet, Cristianini, Bartlett, Ghaoui, & Jordan, 2004). In this context, multikernel learning (MKL) has been introduced into the learning community to tackle this issue (Rakotomamonjy, Bach, Canu, & Grandvalet, 2007; Micchelli & Pontil, 2005; Lanckriet et al., 2004). Given a set of candidate (base) kernels, MKL tries to search for the most appropriate kernel from the training data. This strategy can significantly enhance the interpretability of the decision function, and it also reflects the fact that typical learning problems often involve multiple, heterogeneous data sources (Rakotomamonjy et al., 2007; Sonnenburg, Rätsch, & Schäfer, 2006).

Although learning the kernel from data allows greater flexibility in matching the target function (Srebro & Ben-David, 2006), this is not sufficient to guarantee the quality of the obtained model since one also needs to take into account the capacity of the hypothesis space. In this letter, we seek to characterize the capacity of by the so-called Rademacher chaos complexity of order two.

Rademacher complexities have proved to be a powerful tool to measure the complexity of function classes (Bartlett & Mendelson, 2002; Mendelson, 2003). In comparison with other complexity measures, such as covering numbers, VC dimension, and pseudodimension, the analysis based on the Rademacher complexity can always lead to slightly faster learning rates since it is directly related to the behavior of the associated uniform deviation and can capture precisely the quantity we are interested in Bousquet (2003). Consequently, characterizing the Rademacher complexity is of great significance in learning theory. Recently Ying and Campbell (2010) introduced Rademacher chaos complexities of order two into the discussion on MKL machines’ learning rates. It has been demonstrated that the Rademacher chaos complexity has inherited the advantage of the Rademacher complexity in that it can also yield an improvement on error analysis (Ying & Campbell, 2009, 2010; Chen & Li, 2010). In their work, Ying and Campbell provided a comprehensive study on Rademacher chaos complexities and offered some novel results, such as the connection between Rademacher chaos complexities and MKL, the structural results on Rademacher chaos complexities, and the related entropy integrals, thus justifying the use of Rademacher chaos complexities as an appropriate tool to treat the learning rates. Other theoretical studies on the generalization performance for MKL can be found in Bousquet and Herrmann (2002), Lanckriet et al. (2004), Srebro and Ben-David (2006), Wu, Ying, and Zhou (2007), and Ying and Zhou (2007).

However, the Rademacher chaos complexity bounds in Ying and Campbell (2010), which are based on the standard entropy integral, may not be useful when the involved integral diverges, which occurs naturally if the logarithm of the covering number or the associated fat-shattering dimension grows at a polynomial rate with an exponent not less than 1 (Mendelson, 2003). This letter tackles this issue by presenting some refined entropy integrals for Rademacher chaos complexities. We use the standard Cauchy-Schwartz inequality to relate the Rademacher chaos complexities of to that of an -cover , which can be tackled by a chaining argument to yield a bound of the desirable form. As compared to the previous results, our bound admits some possible superiority since it allows the parameter to traverse an interval to attain a balance between the accuracy of the approximation by a cover and the complexity of that cover. With the aid of the iteration trick proposed by Steinwart and Scovel (2007), we also consider the application of our results to the MKL problems and improve the existing learning rates.

This letter is organized as follows. Section 2 introduces the statement of the problem. In section 3, we establish the refined entropy integral for Rademacher chaos complexities and consider its application to some specific function classes. Section 4 illustrates the effectiveness of our result by investigating MKL machines’ learning rates. Our conclusions are presented in section 5.

2.  Statement of the Problem

Before formulating our problem, we introduce some notations that we use throughout this letter. For a function , the sign function sgn(f) is defined as sgn(f)(x)=1 if and sgn(f)(x)=−1 if f(x)<0. We use c to denote positive constants, and their values may change from line to line or even within the same line.

2.1.  Rademacher Chaos Complexities.

Definition 1
 (Rademacher chaos complexities). Let P be a probability measure defined on from which the examples are independently drawn. Let be a class of functions on , and let be n independent Rademacher random variables, that is, . The homogeneous Rademacher chaos process of order two, with respect to the Rademacher variable , is a random variable system of the form }. The empirical Rademacher chaos complexity of order two for is defined as the expectation of its suprema:
formula

For simplicity, we always mean empirical Rademacher chaos complexities of order two when referring to Rademacher chaos complexities.

Definition 2
 (Rademacher complexities). Let P be a probability measure on from which the examples are independently drawn. Let be n independent Rademacher random variables. For a class of functions , the (ordinary) empirical Rademacher complexity is defined as
formula

It is worth mentioning that the Rademacher process , indexed by , can also be regarded as the homogeneous Rademacher chaos process of order one. Other than the Rademacher chaos complexities of order one or two, De la Peña and Giné (1999) also provided a definition for Rademacher chaos complexities of order .

Definition 3
 (covering numbers). Let be a metric space and set . For any , a set is called an -cover of if for every we can find an element satisfying . The covering number is the cardinality of the minimal -cover of :
formula
For brevity, whenis a normed space with norm, we also denote bythe covering number ofwith respect to the metric.
The quantitative relationship between Rademacher complexities and covering numbers can be formulated by the following far-reaching inequality (Srebro, Sridharan, & Tewari, 2010; Srebro & Sridharan, 2010),
formula
2.1
where the norm is defined as . Inequality 2.1 can be considered a refinement of Dudley's original entropy integral of the following form (Dudley, 1999):
formula
2.2

As compared to equation 2.2, the refined version, equation 2.1, is more effective and can be applied to the case where the integral in equation 2.2 diverges. The first main purpose of this letter is to derive a novel characterization analogous to equation 2.1 for Rademacher chaos complexities. In this respect, our work can be considered a generalization of the refined entropy integral equation 2.1, from Rademacher complexities to Rademacher chaos complexities.

2.2.  The Kernel Learning Problem.

In the context of classification, we are usually given an input space and an output space . The product space is always assumed to be equipped with a probability measure to govern the relationship between the input variable X and the output variable Y. Given a sequence of examples drawn independently from , the goal of learning is to construct a discriminant function so that the associated classifier sgn(f) is able to do the prediction as accurately as possible when new data arrive (Anthony & Bartlett, 1999). The error incurred by using f to classify a point (x, y) can be quantified by the term , where is a nonnegative function called the loss function. Consequently, the quality of a classifier can be assessed by the expected loss when used to do the prediction:
formula
2.3
which is always called the generalization error or risk of f (with respect to ). However, since the generalization error involving an unknown probability measure is not computable, its discretion is used. The term is called the empirical error, and it is always used to guide the process of choosing a prediction rule from a candidate function class called the hypothesis space. This letter considers only the hinge loss , which is a standard choice due to its computation and statistical merit (Zhang, 2004).
We say that is a Mercer kernel (Cucker & Smale, 2002; Zhou, 2002) if it is continuous, symmetric, and positive semidefinite; that is, for any finite set of distinct points and any , the quadratic form is always nonnegative. The reproducing kernel Hilbert space (RKHS) associated with a Mercer kernel K is defined to be the closure of the linear span of functions , equipped with the inner product given by . An important property making the RKHS an appropriate choice as the candidate hypothesis space is the so-called reproducing property (Cucker & Zhou, 2007), which says that
formula
In the context of multikernel learning, we are given a class of kernels, and our purpose is to learn a function from one of the RKHSs that is most suitable to model the data. We consider here the regularized learning scheme to avoid the possibility of overfitting. Formally, the general MKL scheme can be cast as a two-layer minimization problem,
formula
2.4
where is the regularization parameter. The parameter should be chosen according to the sample size n to balance the goodness of the fit, as assessed by , and the simplicity of the model, as assessed by .

Our second goal is to use the refined bounds on Rademacher chaos complexities to study the generalization ability of the classifier , the theme of section 4.

3.  Refined Bounds on Rademacher Chaos Complexities

This section is devoted to establishing a refined bound on Rademacher chaos complexities when the information of covering numbers is available. Let be a set of functions on and . Then one can define a pseudometric on as follows:
formula
3.1
With the assumption and , Ying and Campbell (2010) offered a characterization of the form1
formula
3.2
The inequality, equation 3.2, known as a generalization of Dudley's entropy integral, equation 2.2, is novel since it builds the connection between Rademacher chaos complexities and covering numbers. Ying and Campbell (2010) also used such bounds to estimate learning rates for the kernel learning problem and obtained some interesting results. However, such bound may not be useful when the involved integral diverges, which can happen in some interesting situations (Mendelson, 2003).

Noticing that the divergence of the entropy integral is due to the rapid expansion of when r is approaching zero, we naturally ask whether a bound exists for which the lower limit of the integration is kept a distance from 0. As we will see, this can be achieved by introducing a threshold to prevent the divergence that will result from too fine an approximation.

For this purpose, we first consider the Rademacher chaos complexity for an -cover of the original class . In view of as an empirical process indexed by , the Rademacher chaos complexity is indeed the L1-norm of the maxima for a finite class of random variables, and it can be addressed by the powerful maximal inequalities in empirical process theory.

3.1.  Refined Rademacher Chaos Complexity Bounds.

The following theorem provides an example of maximal inequalities sufficient for our purpose, which can be considered a modification of equation 3.2 in that we provide a maximal inequality for the -cover rather than the whole class. Its proof is based on the chaining argument (Dudley, 1999; Anthony & Bartlett, 1999), whose basic idea is to split each function of interest in the class into a sum of functions chosen from a sequence of classes of progressively increasing complexity. Then the desired maximal inequality for is obtained by combining maximal inequalities for all of these classes. The underlying reason that we can allow the lower limit of the integration to stay away from 0 is that elements of can be exactly represented without resorting to -covers with .

Theorem 1. 
Let be a class of functions defined on with and , where the metric dx is defined as equation 3.1. For any , let be a minimal -cover of with respect to the metric dx. Then we have the following inequality controlling the Rademacher chaos complexity of :
formula
3.3
Proof. 
Let be the minimal integer not less than . For each , let be a minimal -cover of with respect to the metric dx, where . Without loss of generality, we can assume that and . For any and , we denote by fj the closest point to f in . From the definition of -cover, we know that . Now for any , one can rewrite in a telescope form,
formula
from which and the fact f0=0, one can derive that
formula
3.4
The definition of -cover shows that , coupled with the triangle inequality, implies that for any we have
formula
3.5
Ying and Campbell (2010) established the following maximal inequality for any t>1 and any finite class of functions on :2
formula
3.6
Combining inequalities 3.5 and 3.6 together and noticing that the cardinality of the set is at most , we obtain that
formula
3.7
When we plug inequality 3.7 back into 3.4, the monotonicity of implies that
formula
In the last step of the deduction, we have used the fact that and , which is obvious from the construction of . Since the above inequality holds for any t>1, the desired result, equation 3.3, is immediate.

With theorem 1 as a stepping-stone, we are now ready to present the first main result of this letter. We first temporarily fix an and construct an -cover for the initial function class, thus transferring the original problem to the discussion of and . The complexity can be directly controlled by theorem 1 to yield a bound involving an integral of the desired form, while the term can be addressed by the standard Cauchy-Schwartz inequality. The following theorem is motivated by the recent work of Srebro and Sridharan (2010), who established an improvement, equation 2.1, of the standard entropy integral for the ordinary Rademacher complexity.

Theorem 2. 
Let be a class of functions defined on with and , where the metric dx is defined as equation 3.1. Then for any , there holds that
formula
3.8
Proof. 
For any , let be a minimal -cover of with respect to the metric dx. For any , denote by the closest element to f in . To tackle the quantity , we first decompose it into two parts:
formula
3.9
The first term can be addressed through the famous Cauchy-Schwartz inequality:
formula
3.10
As for the term , one can take t=2 in equation 3.3 to derive that
formula
3.11
When we put inequalities 3.10 and 3.11 back into 3.9, the promised inequality, 3.8, is direct.
Remark 1. 

As can be seen clearly from inequality 3.8, our bound can be at least as sharp as inequality 3.2 since the in inequality 3.8 can be made arbitrarily close to 0. Also, the parameter is permitted to traverse the interval (0, D) to achieve a trade-off between the accuracy of the approximation by the cover and the complexity of that cover, which efficiently prohibits the divergence of the integration over the interval (0, D). Theorem 2 can be readily extended to the setting of Rademacher chaos complexities with order larger than two.

3.2.  Rademacher Chaos Complexities for Some Specific Classes.

As a specific example to show the possible superiority of our result over the previous one, we consider the Rademacher chaos complexities of function classes for which the logarithm of the covering number grows as a polynomial of 1/r.

Corollary 1. 
Let be a class of functions defined on with and . Suppose that there exist constants and p>0 such that for any x. Then the homogeneous Rademacher chaos complexities of order two can be bounded as follows:
formula
3.12
Proof. 

We proceed with our discussion by distinguishing three cases:

  • Case p<1. In this case, one can invoke theorem 2 here to show that
    formula
    where the infimum is achieved with the assignment and the last inequality follows from condition p<1.
  • Case p=1. Now it follows from theorem 2 that
    formula
    where the infimum is attained by taking the choice .
  • Case p>1. For such p, analyzing analogous to the case p<1, one obtains that
    formula
    where the last inequality is obvious since p>1.

Notice that the bound of the form 3.2 cannot tackle the case since the involved integral diverges for such p. Here are some explicit function classes that satisfy the entropy conditions in corollary 1.

Example 1. 
Micchelli and Pontil (2005) considered the problem of finding an optimal kernel from the convex hull of a basic kernel class :
formula
If the base class admits a finite pseudodimension d, then the covering numbers of can be controlled by an exponential of 1/r (Van der Vaart & Wellner, 1996, corollary 2.6.12):
formula
3.13
where K is a constant independent of r. If d>2, the exponent of 1/r in equation 3.13 is larger than 1, and thus the direct application of the traditional entropy integral, equation 3.2 is not possible in this situation. Although corollary 1 can be applied here, a better strategy is to employ the structural property (Ying & Campbell, 2010), which reduces the discussion of to that of . Interested readers are referred to Srebro and Ben-David (2006) for some interesting basic classes with finite pseudodimensions (e.g., gaussian kernels with covariance matrices).
Example 2. 
We consider here an example that can be considered an RKHS on the space of kernels . Specifically, let be the compounded index set, and let be a Mercer kernel. Denote by the RKHS associated with the kernel , and define as a ball with radius R in the Hilbert space , that is,
formula
3.14
Rejchel (2012) considered the function spaces of the form 3.14 in the context of ranking, and the estimation of is important to understand the behavior of the associated estimators. Ong, Smola, and Williamson (2005) also provided a novel approach to learning the kernel in such by minimizing the regularized quality functional, and is called a hyperkernel in their setting.3
If is a closed subset of with piecewise smooth boundary and the Mercer kernel belongs to the generalized Lipschitz space , then for any 0<r<R, the covering numbers satisfy the growing rate (Cucker & Zhou, 2007, theorem 5.8):
formula
3.15
Consequently, the exponent 4d/s in equation 3.15 would exceed 1 if the smoothness parameter s is smaller than 4d.
Example 3. 
Let be a sequence of Mercer kernels with and a sequence of positive numbers with . Consider the classes of kernels that can be expressed as an (infinite) nonnegative combination of Ki under the “l2-constraint”:
formula
3.16
Reformulating the class 3.16 as
formula
and applying corollary 3 in Zhang (2002) to tackle its covering numbers, we obtain4
formula

4.  Applications to the Multikernel Learning Problem

In this section, we demonstrate the effectiveness of the previous results on Rademacher chaos complexities by showing how they can be applied to the kernel learning problem. The foundation on which our discussion is based is a novel connection (theorem 4) between the uniform deviation and the Rademacher chaos complexities, attributed to Ying and Campbell (2010).

Our main aim is to show the Bayes consistency of the multikernel regularized classifiers defined in equation 2.4 by providing some satisfactory estimates for the excess misclassification error , where is the risk of the prediction rule sgn(f) as assessed by the 0-1 loss and is the Bayes rule. The following theorem (Zhang, 2004; Bartlett, Jordan, & McAuliffe, 2006), called the comparison inequality, implies that the “excess misclassification error” can be controlled by the “excess -risk” , where is a measurable function minimizing the -risk. Consequently, the discussion of is sufficient for our purpose.

Theorem 3
 (Zhang, 2004). For the hinge loss , we have the following inequality for any measurable function f:
formula
4.1
For any , we take
formula
A standard approach to addressing is to relate it to two associated errors by the following error decomposition (Wu et al., 2007),
formula
4.2
where the regularization error,
formula
expresses the approximation power of the involved space, and the sample error
formula
reflects the difficulty of selecting the optimal classifier in a given hypothesis space. Here is the regularization function. For simplicity, we assume the existence of the regularization function and the empirical classifier (Wu et al., 2007). We always suppose that the regularization error decays polynomially in this section, that is, for some universal constants c and . This is a standard assumption, and there are many sufficient conditions guaranteeing the polynomial decay of the regularization error (Wu et al., 2007; Chen, Wu, Ying, & Zhou, 2004; Chen, Pan, Li, & Tang, 2013). We also assume that the quantity is finite.
Let
formula
4.3
be the union of the R-balls in the candidate RKHSs. For any R>0, denote by
formula
the set of samples for which the corresponding empirical classifier lies in .
Theorem 4
 (Ying & Campbell, 2010). Assume that the loss function is the hinge loss. Then for any , the following inequality holds with probability at least :
formula
4.4

With theorem 4 as a stepping-stone, we can now present the second result of this letter. Theorem 6 provides a general result on MKL algorithms’ learning rate under the hinge loss and shows that the excess generalization error can be controlled by the regularization error for some restricted parameter . Its proof is based on the iteration trick due to Steinwart and Scovel (2007). For this purpose, we first give a lemma illustrating how the norm decreases as a function of the radius R.

Lemma 5. 
Let be the hinge loss, and let c0, c1 be two positive constants. Assume that the regularization parameter satisfies
formula
4.5
and suppose that . Then for any there is a set with such that
formula
where a and b are two constants independent of R:
formula
4.6
Proof. 
From the definition of , it follows that
formula
from which we know that . Consequently, for any , the reproducing property implies that . Now, one can apply Hoeffding inequality (see, Boucheron, Lugosi, & Bousquet, 2004) here to show that with probability at least , there holds
formula
4.7
Utilizing inequalities 4.2 and 4.7 and applying theorem 4 to the set , we know that there is a set with such that for any , there holds
formula
4.8
where in the second step of the above deduction, we have used the assumption and equation 4.5.
Consequently, for any and some constant c depending on c0, c1, and , we have
formula
That is, there exists a set VR with such that .
Theorem 6. 
Under the condition of lemma 5 and for any , the generalization performance of can be characterized by the following inequality with probability at least :
formula
4.9
Proof. 
For any examples , there holds (Ying & Campbell, 2010) that , and thus we have . Define a sequence by and
formula
4.10
where the numbers a and b are given by equation 4.6. From lemma 5, we know that there exists a subset sequence with and . Consequently, for any , there holds
formula
4.11
Utilizing the recurrent relationship 4.10 and the basic inequality , it is not hard to check
formula
4.12
According to equation 4.11, with confidence at least that , we have with R(J) controlled by equation 4.12. Utilizing this fact and applying inequality 4.8 with R replaced by R(J), the following inequality holds with probability at least :
formula
Taking the choice , we have and thus with probability at least , there holds
formula
The above inequality can be rewritten as equation 4.9. The proof is complete.
Remark 2. 
Noticing that the functions and are always contained in , Ying and Campbell (2010) simply applied theorem 4 with to obtain
formula
4.13
As compared to this result, inequality 4.9 is better and can yield some faster learning rates. The underlying reason for this improved result lies in the fact that by using the iteration technique, we essentially show , and thus the application of theorem 4 with is sufficient.
Corollary 2. 
Let be the hinge loss and suppose the regularization error admits a polynomial decay . Let be a prescribed set of candidate kernels with and for any , where and p are any two positive constants. Then by choosing if p<1, if p=1 and if p>1, for any , the excess misclassification error of the classifier in equation 2.4 can be controlled with probability at least as follows:
formula
Proof. 

By the reproducing property, it is not hard to check that . We can proceed with this proof by distinguishing three cases according to the magnitude of p:

  • Case p<1. In this case, corollary 1 yields that . Consequently, the choice satisfies condition 4.5, and thus inequality 4.9 reduces to
    formula
    4.14
  • Case p=1. For such p, one can derive from corollary 1 that , and thus is a suitable choice meeting condition 4.5. Now, from equation 4.9, it follows directly that
    formula
  • Case p>1. In this case, corollary 1 implies that . It can be directly checked that the choice satisfies condition 4.5, and thus inequality 4.9 reduces to
    formula

When we put the three cases together and use inequality 4.1 to connect the excess misclassification error with the excess generalization error , the stated inequality is immediate.

Remark 3. 

The analysis based on theorem 6 and the previous entropy integral, 3.2, can tackle only the case p<1, while our discussion yields satisfactory learning rates for all p>0. One can refer to the examples in section 3.2 for some specific function classes satisfying the entropy conditions in corollary 2. The condition also holds (Mendelson, 2003) if the fat-shattering dimension grows as a polynomial of 1/r with exponent p.

Remark 4. 
When the Rademacher chaos complexity is bounded by a constant independent of n (which happens if the exponent p is less than 1), Ying and Campbell (2010, example 1) derived the following inequality with probability at least ,
formula
by applying equation 4.13 with the assignment . Comparing this generalization bound with equation 4.14, one can see that corollary 2 is much improved when . Furthermore, as approaches 0, our learning rate can roughly attain the speed (we ignore the logarithm for simplicity), a great improvement. For the case with , a direct calculation shows that learning rates given by corollary 2 are also sharper than the bounds that can be induced from equation 4.13 and the refined entropy integral. Notice that the improvement is due to the iteration technique since we are using the same entropy integral to control Rademacher chaos complexities here.

5.  Conclusion

In this letter, we provide a refined entropy integral for the Rademacher chaos complexity that can be extremely useful when the standard integral diverges. We apply our bounds to some function classes and obtain some satisfactory results. We also combine our Rademacher chaos complexity bounds and the iteration technique in Steinwart and Scovel (2007) to improve the learning rates of MKL machines.

We consider the use of the Rademacher chaos complexity only in the MKL context. However, our discussion can also have direct applications in other learning problems for which the goal is to construct a function f defined on . Some well-known examples include ranking (Rejchel, 2012), where the function f is used to predict the ordering between objects. Specifically, Rejchel (2012) considered the generalization bounds for the ranking rules, and the discussion there is based on the assumption that the logarithm of the covering number grows as a polynomial with exponent less than 1 (Rejchel, 2012, assumption B), which is due to the fact that Rejchel used equation 3.2 to control Rademacher chaos complexities (Rejchel, 2012, equation 18). If one uses the refined entropy integral instead, however, this assumption can be safely removed.

Acknowledgments

This work is supported in part by the National Natural Science Foundation of China (grant 60975050, 61165004). We also appreciate the anonymous referees for their insightful and constructive comments, which greatly improved the quality of the letter.

References

Anthony
,
M.
, &
Bartlett
,
P.
(
1999
).
Neural network learning: Theoretical foundations
.
New York
:
Cambridge University Press
.
Bartlett
,
P.
,
Jordan
,
M.
, &
McAuliffe
,
J.
(
2006
).
Convexity, classification, and risk bounds
.
J. Am. Stat. Assoc.
,
101
(
473
),
138
156
.
Bartlett
,
P.
, &
Mendelson
,
S.
(
2002
).
Rademacher and gaussian complexities: Risk bounds and structural results
.
J. Mach. Learn. Res.
,
3
,
463
482
.
Boucheron
,
S.
,
Lugosi
,
G.
, &
Bousquet
,
O.
(
2004
).
Concentration inequalities
. In
O. Bousquet, U. Von Luxburg, & G. Rätsch
(Eds.),
Advanced lectures on machine learning
(pp.
208
240
).
New York
:
Springer-Verlag
.
Bousquet
,
O.
(
2003
).
New approaches to statistical learning theory
.
Ann. Inst. Stat. Math.
,
55
(
2
),
371
389
.
Bousquet
,
O.
, &
Herrmann
,
D.
(
2002
).
On the complexity of learning the kernel matrix
. In
S. Becker, S. Thrün, & K. Obermayer (Eds.)
,
Advances in neural information processing systems, 15
(pp.
399
406
).
Cambridge, MA
:
MIT Press
.
Chen
,
D.
,
Wu
,
Q.
,
Ying
,
Y.
, &
Zhou
,
D.
(
2004
).
Support vector machine soft margin classifiers: Error analysis
.
J. Mach. Learn. Res.
,
5
,
1143
1175
.
Chen
,
H.
, &
Li
,
L.
(
2010
).
Learning rates of multi-kernel regularized regression
.
J. Stat. Plan. Infer.
,
140
(
9
),
2562
2568
.
Chen
,
H.
,
Pan
,
Z.
,
Li
,
L.
, &
Tang
,
Y.
(
2013
).
Error analysis of coefficient-based regularized algorithm for density-level detection
.
Neural. Comput.
,
25
(
4
),
1107
1121
.
Cucker
,
F.
, &
Smale
,
S.
(
2002
).
On the mathematical foundations of learning
.
Bull. Am. Math. Soc.
,
39
(
1
),
1
50
.
Cucker
,
F.
, &
Zhou
,
D.
(
2007
).
Learning theory: An approximation theory viewpoint
.
Cambridge
:
Cambridge University Press
.
De la Peña
,
V.
, &
Giné
,
E.
(
1999
).
Decoupling: From dependence to independence
.
New York
:
Springer-Verlag
.
Dudley
,
R.
(
1999
).
Uniform central limit theorems
.
Cambridge
:
Cambridge University Press
.
Lanckriet
,
G.
,
Cristianini
,
N.
,
Bartlett
,
P.
,
Ghaoui
,
L.
, &
Jordan
,
M.
(
2004
).
Learning the kernel matrix with semidefinite programming
.
J. Mach. Learn. Res.
,
5
,
27
72
.
Mendelson
,
S.
(
2003
).
A few notes on statistical learning theory
. In
S. Mendelson & A. Smola (Eds.)
,
Advanced lectures on machine learning
(pp.
1
40
).
Berlin
:
Springer-Verlag
.
Micchelli
,
C.
, &
Pontil
,
M.
(
2005
).
Learning the kernel function via regularization
.
J. Mach. Learn. Res.
,
6
,
1099
1125
.
Ong
,
C. S.
,
Smola
,
A. J.
, &
Williamson
,
R. C.
(
2005
).
Learning the kernel with hyperkernels
.
J. Mach. Learn. Res.
,
6
,
1043
1071
.
Rakotomamonjy
,
A.
,
Bach
,
F.
,
Canu
,
S.
, &
Grandvalet
,
Y.
(
2007
).
More efficiency in multiple kernel learning
. In
Z. Ghahramani
(Ed.),
Proceedings of the 24th International Conference on Machine Learning
(pp. 775–782). New York
:
ACM Press
.
Rejchel
,
W.
(
2012
).
On ranking and generalization bounds
.
J. Mach. Learn. Res.
,
13
,
1373
1392
.
Schölkopf
,
B.
, &
Smola
,
A.
(
2002
).
Learning with kernels: Support vector machines, regularization, optimization, and beyond
.
Cambridge, MA
:
MIT Press
.
Shawe-Taylor
,
J.
, &
Cristianini
,
N.
(
2004
).
Kernel methods for pattern analysis
.
New York
:
Cambridge University Press
.
Sonnenburg
,
S.
,
Rätsch
,
G.
, &
Schäfer
,
C.
(
2006
).
A general and efficient multiple kernel learning algorithm
. In
Y. Weiss, B. Schölkopf, & J. Platt
(Eds.),
Advances in neural information processing systems, 18
(pp.
1273
1280
).
Cambridge, MA
:
MIT Press
.
Srebro
,
N.
, &
Ben-David
,
S.
(
2006
).
Learning bounds for support vector machines with learned kernels
. In
G. Lugosi & H. Simon (Eds.)
,
Learning theory
(pp.
169
183
).
Berlin
:
Springer-Verlag
.
Srebro
,
N.
, &
Sridharan
,
K.
(
2010
).
Note on refined Dudley integral covering number bound.
Unpublished results
. http://ttic.uchicago.edu/karthik/dudley.pdf.
Srebro
,
N.
,
Sridharan
,
K.
, &
Tewari
,
A.
(
2010
).
Smoothness, low-noise and fast rates
. In
J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, & A. Culotta (Eds.)
,
Advances in neural information processing systems
,
23
(pp.
2199
2207
).
Red Hook, NY
:
Curran
.
Steinwart
,
I.
, &
Scovel
,
C.
(
2007
).
Fast rates for support vector machines using gaussian kernels
.
Ann. Statist.
,
35
(
2
),
575
607
.
Van der Vaart
,
A.
, &
Wellner
,
J.
(
1996
).
Weak convergence and empirical processes: With applications to statistics
.
New York
:
Springer-Verlag
.
Varma
,
M.
, &
Babu
,
B.
(
2009
).
More generality in efficient multiple kernel learning
. In
A. Danyluk, L. Bottou, & M. Littman
(Eds.),
Proceedings of the 26th Annual International Conference on Machine Learning
(pp.
1065
1072
).
New York
:
ACM
.
Wu
,
Q.
,
Ying
,
Y.
, &
Zhou
,
D.
(
2007
).
Multi-kernel regularized classifiers
.
J. Complex.
,
23
(
1
),
108
134
.
Ying
,
Y.
, &
Campbell
,
C.
(
2009
).
Generalization bounds for learning the kernel
. In
S. Dasgupta & A. Klivans (Eds.)
,
Proceedings of the 22nd Annual Conference on Learning Theory
.
New York
:
Springer-Verlag
.
Ying
,
Y.
, &
Campbell
,
C.
(
2010
).
Rademacher chaos complexities for learning the kernel problem
.
Neural. Comput.
,
22
(
11
),
2858
2886
.
Ying
,
Y.
, &
Zhou
,
D.
(
2007
).
Learnability of gaussians with flexible variances
.
J. Mach. Learn. Res.
,
8
,
249
276
.
Zhang
,
T.
(
2002
).
Covering number bounds of certain regularized linear function classes
.
J. Mach. Learn. Res.
,
2
,
527
550
.
Zhang
,
T.
(
2004
).
Statistical behavior and consistency of classification methods based on convex risk minimization
.
Ann. Stat.
,
32
(
1
),
56
85
.
Zhou
,
D.
(
2002
).
The covering number in learning theory
.
J. Complex.
,
18
(
3
),
739
767
.

Notes

1
Indeed, Ying and Campbell (2010) obtained the inequality
formula
However, it can be shown that the additional term D can be removed under the assumption .
2
Ying and Campbell (2010) obtained the inequality
formula
However, it is not hard to derive the slightly stronger inequality 3.6 through their deduction.
3
The hyperkernel should satisfy the symmetric condition . One can always achieve this by introducing a new kernel,
formula
when it is necessary. The kernel K learned in is not necessarily positive semidefinite, and one needs to impose an additional constraint to guarantee that the kernel matrix [K(xi, xj)]ni,j=1 is positive semidefinite (Ong et al., 2005).
4

The metric dx can be considered (up to some factors) as the empirical L2-covering numbers with respect to the data .