## Abstract

Estimating the Rademacher chaos complexity of order two is important for understanding the performance of multikernel learning (MKL) machines. In this letter, we develop a novel entropy integral for Rademacher chaos complexities. As compared to the previous bounds, our result is much improved in that it introduces an adjustable parameter ε to prohibit the divergence of the involved integral. With the use of the iteration technique in Steinwart and Scovel (2007), we also apply our Rademacher chaos complexity bound to the MKL problems and improve existing learning rates.

## 1. Introduction

During the last few years, kernel-based methods such as support vector machines have found great success in solving supervised learning problems like classification and regression (Schölkopf & Smola, 2002; Shawe-Taylor & Cristianini, 2004). For these tasks, the performance of the learning algorithms largely depends on the data representation via the choice of kernel functions—ones that are typically handcrafted and fixed in advance (Varma & Babu, 2009; Ying & Campbell, 2010). While this provides opportunities for us to incorporate the prior knowledge into the learning process, it can also be difficult in practice to find prior justifications for the use of one kernel instead of another (Lanckriet, Cristianini, Bartlett, Ghaoui, & Jordan, 2004). In this context, multikernel learning (MKL) has been introduced into the learning community to tackle this issue (Rakotomamonjy, Bach, Canu, & Grandvalet, 2007; Micchelli & Pontil, 2005; Lanckriet et al., 2004). Given a set of candidate (base) kernels, MKL tries to search for the most appropriate kernel from the training data. This strategy can significantly enhance the interpretability of the decision function, and it also reflects the fact that typical learning problems often involve multiple, heterogeneous data sources (Rakotomamonjy et al., 2007; Sonnenburg, Rätsch, & Schäfer, 2006).

Although learning the kernel from data allows greater flexibility in matching the target function (Srebro & Ben-David, 2006), this is not sufficient to guarantee the quality of the obtained model since one also needs to take into account the capacity of the hypothesis space. In this letter, we seek to characterize the capacity of by the so-called Rademacher chaos complexity of order two.

Rademacher complexities have proved to be a powerful tool to measure the complexity of function classes (Bartlett & Mendelson, 2002; Mendelson, 2003). In comparison with other complexity measures, such as covering numbers, VC dimension, and pseudodimension, the analysis based on the Rademacher complexity can always lead to slightly faster learning rates since it is directly related to the behavior of the associated uniform deviation and can capture precisely the quantity we are interested in Bousquet (2003). Consequently, characterizing the Rademacher complexity is of great significance in learning theory. Recently Ying and Campbell (2010) introduced Rademacher chaos complexities of order two into the discussion on MKL machines’ learning rates. It has been demonstrated that the Rademacher chaos complexity has inherited the advantage of the Rademacher complexity in that it can also yield an improvement on error analysis (Ying & Campbell, 2009, 2010; Chen & Li, 2010). In their work, Ying and Campbell provided a comprehensive study on Rademacher chaos complexities and offered some novel results, such as the connection between Rademacher chaos complexities and MKL, the structural results on Rademacher chaos complexities, and the related entropy integrals, thus justifying the use of Rademacher chaos complexities as an appropriate tool to treat the learning rates. Other theoretical studies on the generalization performance for MKL can be found in Bousquet and Herrmann (2002), Lanckriet et al. (2004), Srebro and Ben-David (2006), Wu, Ying, and Zhou (2007), and Ying and Zhou (2007).

However, the Rademacher chaos complexity bounds in Ying and Campbell (2010), which are based on the standard entropy integral, may not be useful when the involved integral diverges, which occurs naturally if the logarithm of the covering number or the associated fat-shattering dimension grows at a polynomial rate with an exponent not less than 1 (Mendelson, 2003). This letter tackles this issue by presenting some refined entropy integrals for Rademacher chaos complexities. We use the standard Cauchy-Schwartz inequality to relate the Rademacher chaos complexities of to that of an -cover , which can be tackled by a chaining argument to yield a bound of the desirable form. As compared to the previous results, our bound admits some possible superiority since it allows the parameter to traverse an interval to attain a balance between the accuracy of the approximation by a cover and the complexity of that cover. With the aid of the iteration trick proposed by Steinwart and Scovel (2007), we also consider the application of our results to the MKL problems and improve the existing learning rates.

This letter is organized as follows. Section 2 introduces the statement of the problem. In section 3, we establish the refined entropy integral for Rademacher chaos complexities and consider its application to some specific function classes. Section 4 illustrates the effectiveness of our result by investigating MKL machines’ learning rates. Our conclusions are presented in section 5.

## 2. Statement of the Problem

Before formulating our problem, we introduce some notations that we use throughout this letter. For a function , the sign function sgn(*f*) is defined as sgn(*f*)(*x*)=1 if and sgn(*f*)(*x*)=−1 if *f*(*x*)<0. We use *c* to denote positive constants, and their values may change from line to line or even within the same line.

### 2.1. Rademacher Chaos Complexities.

*(Rademacher chaos complexities). Let P be a probability measure defined on from which the examples are independently drawn. Let be a class of functions on , and let be n independent Rademacher random variables, that is, . The homogeneous Rademacher chaos process of order two, with respect to the Rademacher variable , is a random variable system of the form }. The empirical Rademacher chaos complexity of order two for is defined as the expectation of its suprema:*

For simplicity, we always mean empirical Rademacher chaos complexities of order two when referring to Rademacher chaos complexities.

It is worth mentioning that the Rademacher process , indexed by , can also be regarded as the homogeneous Rademacher chaos process of order one. Other than the Rademacher chaos complexities of order one or two, De la Peña and Giné (1999) also provided a definition for Rademacher chaos complexities of order .

*(covering numbers). Let be a metric space and set . For any , a set is called an -cover of if for every we can find an element satisfying . The covering number is the cardinality of the minimal -cover of :*

*For brevity, when*

*is a normed space with norm*,

*we also denote by*

*the covering number of*

*with respect to the metric*.

As compared to equation 2.2, the refined version, equation 2.1, is more effective and can be applied to the case where the integral in equation 2.2 diverges. The first main purpose of this letter is to derive a novel characterization analogous to equation 2.1 for Rademacher chaos complexities. In this respect, our work can be considered a generalization of the refined entropy integral equation 2.1, from Rademacher complexities to Rademacher chaos complexities.

### 2.2. The Kernel Learning Problem.

*X*and the output variable

*Y*. Given a sequence of examples drawn independently from , the goal of learning is to construct a discriminant function so that the associated classifier sgn(

*f*) is able to do the prediction as accurately as possible when new data arrive (Anthony & Bartlett, 1999). The error incurred by using

*f*to classify a point (

*x*,

*y*) can be quantified by the term , where is a nonnegative function called the loss function. Consequently, the quality of a classifier can be assessed by the expected loss when used to do the prediction: which is always called the generalization error or risk of

*f*(with respect to ). However, since the generalization error involving an unknown probability measure is not computable, its discretion is used. The term is called the empirical error, and it is always used to guide the process of choosing a prediction rule from a candidate function class called the hypothesis space. This letter considers only the hinge loss , which is a standard choice due to its computation and statistical merit (Zhang, 2004).

*K*is defined to be the closure of the linear span of functions , equipped with the inner product given by . An important property making the RKHS an appropriate choice as the candidate hypothesis space is the so-called reproducing property (Cucker & Zhou, 2007), which says that

*n*to balance the goodness of the fit, as assessed by , and the simplicity of the model, as assessed by .

Our second goal is to use the refined bounds on Rademacher chaos complexities to study the generalization ability of the classifier , the theme of section 4.

## 3. Refined Bounds on Rademacher Chaos Complexities

^{1}The inequality, equation 3.2, known as a generalization of Dudley's entropy integral, equation 2.2, is novel since it builds the connection between Rademacher chaos complexities and covering numbers. Ying and Campbell (2010) also used such bounds to estimate learning rates for the kernel learning problem and obtained some interesting results. However, such bound may not be useful when the involved integral diverges, which can happen in some interesting situations (Mendelson, 2003).

Noticing that the divergence of the entropy integral is due to the rapid expansion of when *r* is approaching zero, we naturally ask whether a bound exists for which the lower limit of the integration is kept a distance from 0. As we will see, this can be achieved by introducing a threshold to prevent the divergence that will result from too fine an approximation.

For this purpose, we first consider the Rademacher chaos complexity for an -cover of the original class . In view of as an empirical process indexed by , the Rademacher chaos complexity is indeed the *L*_{1}-norm of the maxima for a finite class of random variables, and it can be addressed by the powerful maximal inequalities in empirical process theory.

### 3.1. Refined Rademacher Chaos Complexity Bounds.

The following theorem provides an example of maximal inequalities sufficient for our purpose, which can be considered a modification of equation 3.2 in that we provide a maximal inequality for the -cover rather than the whole class. Its proof is based on the chaining argument (Dudley, 1999; Anthony & Bartlett, 1999), whose basic idea is to split each function of interest in the class into a sum of functions chosen from a sequence of classes of progressively increasing complexity. Then the desired maximal inequality for is obtained by combining maximal inequalities for all of these classes. The underlying reason that we can allow the lower limit of the integration to stay away from 0 is that elements of can be exactly represented without resorting to -covers with .

*Let be a class of functions defined on with and , where the metric*

*d*_{x}is defined as equation 3.1. For any , let be a minimal -cover of with respect to the metric*d*_{x}. Then we have the following inequality controlling the Rademacher chaos complexity of :*d*

_{x}, where . Without loss of generality, we can assume that and . For any and , we denote by

*f*the closest point to

_{j}*f*in . From the definition of -cover, we know that . Now for any , one can rewrite in a telescope form, from which and the fact

*f*

_{0}=0, one can derive that The definition of -cover shows that , coupled with the triangle inequality, implies that for any we have

*t*>1 and any finite class of functions on :

^{2}

*t*>1, the desired result, equation 3.3, is immediate.

With theorem 1 as a stepping-stone, we are now ready to present the first main result of this letter. We first temporarily fix an and construct an -cover for the initial function class, thus transferring the original problem to the discussion of and . The complexity can be directly controlled by theorem 1 to yield a bound involving an integral of the desired form, while the term can be addressed by the standard Cauchy-Schwartz inequality. The following theorem is motivated by the recent work of Srebro and Sridharan (2010), who established an improvement, equation 2.1, of the standard entropy integral for the ordinary Rademacher complexity.

*Let be a class of functions defined on with and , where the metric*

*d*_{x}is defined as equation 3.1. Then for any , there holds that*d*

_{x}. For any , denote by the closest element to

*f*in . To tackle the quantity , we first decompose it into two parts:

As can be seen clearly from inequality 3.8, our bound can be at least as sharp as inequality 3.2 since the in inequality 3.8 can be made arbitrarily close to 0. Also, the parameter is permitted to traverse the interval (0, *D*) to achieve a trade-off between the accuracy of the approximation by the cover and the complexity of that cover, which efficiently prohibits the divergence of the integration over the interval (0, *D*). Theorem 2 can be readily extended to the setting of Rademacher chaos complexities with order larger than two.

### 3.2. Rademacher Chaos Complexities for Some Specific Classes.

As a specific example to show the possible superiority of our result over the previous one, we consider the Rademacher chaos complexities of function classes for which the logarithm of the covering number grows as a polynomial of 1/*r*.

We proceed with our discussion by distinguishing three cases:

Notice that the bound of the form 3.2 cannot tackle the case since the involved integral diverges for such *p*. Here are some explicit function classes that satisfy the entropy conditions in corollary 1.

*d*, then the covering numbers of can be controlled by an exponential of 1/

*r*(Van der Vaart & Wellner, 1996, corollary 2.6.12): where

*K*is a constant independent of

*r*. If

*d*>2, the exponent of 1/

*r*in equation 3.13 is larger than 1, and thus the direct application of the traditional entropy integral, equation 3.2 is not possible in this situation. Although corollary 1 can be applied here, a better strategy is to employ the structural property (Ying & Campbell, 2010), which reduces the discussion of to that of . Interested readers are referred to Srebro and Ben-David (2006) for some interesting basic classes with finite pseudodimensions (e.g., gaussian kernels with covariance matrices).

*R*in the Hilbert space , that is, Rejchel (2012) considered the function spaces of the form 3.14 in the context of ranking, and the estimation of is important to understand the behavior of the associated estimators. Ong, Smola, and Williamson (2005) also provided a novel approach to learning the kernel in such by minimizing the regularized quality functional, and is called a hyperkernel in their setting.

^{3}

*r*<

*R*, the covering numbers satisfy the growing rate (Cucker & Zhou, 2007, theorem 5.8): Consequently, the exponent 4

*d*/

*s*in equation 3.15 would exceed 1 if the smoothness parameter

*s*is smaller than 4

*d*.

*K*under the “

_{i}*l*

_{2}-constraint”: Reformulating the class 3.16 as and applying corollary 3 in Zhang (2002) to tackle its covering numbers, we obtain

^{4}

## 4. Applications to the Multikernel Learning Problem

In this section, we demonstrate the effectiveness of the previous results on Rademacher chaos complexities by showing how they can be applied to the kernel learning problem. The foundation on which our discussion is based is a novel connection (theorem 4) between the uniform deviation and the Rademacher chaos complexities, attributed to Ying and Campbell (2010).

Our main aim is to show the Bayes consistency of the multikernel regularized classifiers defined in equation 2.4 by providing some satisfactory estimates for the excess misclassification error , where is the risk of the prediction rule sgn(*f*) as assessed by the 0-1 loss and is the Bayes rule. The following theorem (Zhang, 2004; Bartlett, Jordan, & McAuliffe, 2006), called the comparison inequality, implies that the “excess misclassification error” can be controlled by the “excess -risk” , where is a measurable function minimizing the -risk. Consequently, the discussion of is sufficient for our purpose.

*(Zhang, 2004). For the hinge loss , we have the following inequality for any measurable function*

*f*:*c*and . This is a standard assumption, and there are many sufficient conditions guaranteeing the polynomial decay of the regularization error (Wu et al., 2007; Chen, Wu, Ying, & Zhou, 2004; Chen, Pan, Li, & Tang, 2013). We also assume that the quantity is finite.

*(Ying & Campbell, 2010). Assume that the loss function is the hinge loss. Then for any , the following inequality holds with probability at least :*

With theorem 4 as a stepping-stone, we can now present the second result of this letter. Theorem 6 provides a general result on MKL algorithms’ learning rate under the hinge loss and shows that the excess generalization error can be controlled by the regularization error for some restricted parameter . Its proof is based on the iteration trick due to Steinwart and Scovel (2007). For this purpose, we first give a lemma illustrating how the norm decreases as a function of the radius *R*.

*a*and

*b*are given by equation 4.6. From lemma 5, we know that there exists a subset sequence with and . Consequently, for any , there holds

*R*

^{(J)}controlled by equation 4.12. Utilizing this fact and applying inequality 4.8 with

*R*replaced by

*R*

^{(J)}, the following inequality holds with probability at least : Taking the choice , we have and thus with probability at least , there holds The above inequality can be rewritten as equation 4.9. The proof is complete.

*Let be the hinge loss and suppose the regularization error admits a polynomial decay . Let be a prescribed set of candidate kernels with and for any , where and*

*p*are any two positive constants. Then by choosing if*p*<1, if*p*=1 and if*p*>1, for any , the excess misclassification error of the classifier in equation 2.4 can be controlled with probability at least as follows:By the reproducing property, it is not hard to check that . We can proceed with this proof by distinguishing three cases according to the magnitude of *p*:

When we put the three cases together and use inequality 4.1 to connect the excess misclassification error with the excess generalization error , the stated inequality is immediate.

The analysis based on theorem 6 and the previous entropy integral, 3.2, can tackle only the case *p*<1, while our discussion yields satisfactory learning rates for all *p*>0. One can refer to the examples in section 3.2 for some specific function classes satisfying the entropy conditions in corollary 2. The condition also holds (Mendelson, 2003) if the fat-shattering dimension grows as a polynomial of 1/*r* with exponent *p*.

*n*(which happens if the exponent

*p*is less than 1), Ying and Campbell (2010, example 1) derived the following inequality with probability at least , by applying equation 4.13 with the assignment . Comparing this generalization bound with equation 4.14, one can see that corollary 2 is much improved when . Furthermore, as approaches 0, our learning rate can roughly attain the speed (we ignore the logarithm for simplicity), a great improvement. For the case with , a direct calculation shows that learning rates given by corollary 2 are also sharper than the bounds that can be induced from equation 4.13 and the refined entropy integral. Notice that the improvement is due to the iteration technique since we are using the same entropy integral to control Rademacher chaos complexities here.

## 5. Conclusion

In this letter, we provide a refined entropy integral for the Rademacher chaos complexity that can be extremely useful when the standard integral diverges. We apply our bounds to some function classes and obtain some satisfactory results. We also combine our Rademacher chaos complexity bounds and the iteration technique in Steinwart and Scovel (2007) to improve the learning rates of MKL machines.

We consider the use of the Rademacher chaos complexity only in the MKL context. However, our discussion can also have direct applications in other learning problems for which the goal is to construct a function *f* defined on . Some well-known examples include ranking (Rejchel, 2012), where the function *f* is used to predict the ordering between objects. Specifically, Rejchel (2012) considered the generalization bounds for the ranking rules, and the discussion there is based on the assumption that the logarithm of the covering number grows as a polynomial with exponent less than 1 (Rejchel, 2012, assumption B), which is due to the fact that Rejchel used equation 3.2 to control Rademacher chaos complexities (Rejchel, 2012, equation 18). If one uses the refined entropy integral instead, however, this assumption can be safely removed.

## Acknowledgments

This work is supported in part by the National Natural Science Foundation of China (grant 60975050, 61165004). We also appreciate the anonymous referees for their insightful and constructive comments, which greatly improved the quality of the letter.

## References

## Notes

^{1}

*D*can be removed under the assumption .

^{3}

*K*learned in is not necessarily positive semidefinite, and one needs to impose an additional constraint to guarantee that the kernel matrix [

*K*(

*x*,

_{i}*x*)]

_{j}^{n}

_{i,j=1}is positive semidefinite (Ong et al., 2005).

^{4}

The metric *d*_{x} can be considered (up to some factors) as the empirical *L*_{2}-covering numbers with respect to the data .