Estimating the Rademacher chaos complexity of order two is important for understanding the performance of multikernel learning (MKL) machines. In this letter, we develop a novel entropy integral for Rademacher chaos complexities. As compared to the previous bounds, our result is much improved in that it introduces an adjustable parameter ε to prohibit the divergence of the involved integral. With the use of the iteration technique in Steinwart and Scovel (2007), we also apply our Rademacher chaos complexity bound to the MKL problems and improve existing learning rates.
During the last few years, kernel-based methods such as support vector machines have found great success in solving supervised learning problems like classification and regression (Schölkopf & Smola, 2002; Shawe-Taylor & Cristianini, 2004). For these tasks, the performance of the learning algorithms largely depends on the data representation via the choice of kernel functions—ones that are typically handcrafted and fixed in advance (Varma & Babu, 2009; Ying & Campbell, 2010). While this provides opportunities for us to incorporate the prior knowledge into the learning process, it can also be difficult in practice to find prior justifications for the use of one kernel instead of another (Lanckriet, Cristianini, Bartlett, Ghaoui, & Jordan, 2004). In this context, multikernel learning (MKL) has been introduced into the learning community to tackle this issue (Rakotomamonjy, Bach, Canu, & Grandvalet, 2007; Micchelli & Pontil, 2005; Lanckriet et al., 2004). Given a set of candidate (base) kernels, MKL tries to search for the most appropriate kernel from the training data. This strategy can significantly enhance the interpretability of the decision function, and it also reflects the fact that typical learning problems often involve multiple, heterogeneous data sources (Rakotomamonjy et al., 2007; Sonnenburg, Rätsch, & Schäfer, 2006).
Although learning the kernel from data allows greater flexibility in matching the target function (Srebro & Ben-David, 2006), this is not sufficient to guarantee the quality of the obtained model since one also needs to take into account the capacity of the hypothesis space. In this letter, we seek to characterize the capacity of by the so-called Rademacher chaos complexity of order two.
Rademacher complexities have proved to be a powerful tool to measure the complexity of function classes (Bartlett & Mendelson, 2002; Mendelson, 2003). In comparison with other complexity measures, such as covering numbers, VC dimension, and pseudodimension, the analysis based on the Rademacher complexity can always lead to slightly faster learning rates since it is directly related to the behavior of the associated uniform deviation and can capture precisely the quantity we are interested in Bousquet (2003). Consequently, characterizing the Rademacher complexity is of great significance in learning theory. Recently Ying and Campbell (2010) introduced Rademacher chaos complexities of order two into the discussion on MKL machines’ learning rates. It has been demonstrated that the Rademacher chaos complexity has inherited the advantage of the Rademacher complexity in that it can also yield an improvement on error analysis (Ying & Campbell, 2009, 2010; Chen & Li, 2010). In their work, Ying and Campbell provided a comprehensive study on Rademacher chaos complexities and offered some novel results, such as the connection between Rademacher chaos complexities and MKL, the structural results on Rademacher chaos complexities, and the related entropy integrals, thus justifying the use of Rademacher chaos complexities as an appropriate tool to treat the learning rates. Other theoretical studies on the generalization performance for MKL can be found in Bousquet and Herrmann (2002), Lanckriet et al. (2004), Srebro and Ben-David (2006), Wu, Ying, and Zhou (2007), and Ying and Zhou (2007).
However, the Rademacher chaos complexity bounds in Ying and Campbell (2010), which are based on the standard entropy integral, may not be useful when the involved integral diverges, which occurs naturally if the logarithm of the covering number or the associated fat-shattering dimension grows at a polynomial rate with an exponent not less than 1 (Mendelson, 2003). This letter tackles this issue by presenting some refined entropy integrals for Rademacher chaos complexities. We use the standard Cauchy-Schwartz inequality to relate the Rademacher chaos complexities of to that of an -cover , which can be tackled by a chaining argument to yield a bound of the desirable form. As compared to the previous results, our bound admits some possible superiority since it allows the parameter to traverse an interval to attain a balance between the accuracy of the approximation by a cover and the complexity of that cover. With the aid of the iteration trick proposed by Steinwart and Scovel (2007), we also consider the application of our results to the MKL problems and improve the existing learning rates.
This letter is organized as follows. Section 2 introduces the statement of the problem. In section 3, we establish the refined entropy integral for Rademacher chaos complexities and consider its application to some specific function classes. Section 4 illustrates the effectiveness of our result by investigating MKL machines’ learning rates. Our conclusions are presented in section 5.
2. Statement of the Problem
Before formulating our problem, we introduce some notations that we use throughout this letter. For a function , the sign function sgn(f) is defined as sgn(f)(x)=1 if and sgn(f)(x)=−1 if f(x)<0. We use c to denote positive constants, and their values may change from line to line or even within the same line.
2.1. Rademacher Chaos Complexities.
For simplicity, we always mean empirical Rademacher chaos complexities of order two when referring to Rademacher chaos complexities.
It is worth mentioning that the Rademacher process , indexed by , can also be regarded as the homogeneous Rademacher chaos process of order one. Other than the Rademacher chaos complexities of order one or two, De la Peña and Giné (1999) also provided a definition for Rademacher chaos complexities of order .
As compared to equation 2.2, the refined version, equation 2.1, is more effective and can be applied to the case where the integral in equation 2.2 diverges. The first main purpose of this letter is to derive a novel characterization analogous to equation 2.1 for Rademacher chaos complexities. In this respect, our work can be considered a generalization of the refined entropy integral equation 2.1, from Rademacher complexities to Rademacher chaos complexities.
2.2. The Kernel Learning Problem.
Our second goal is to use the refined bounds on Rademacher chaos complexities to study the generalization ability of the classifier , the theme of section 4.
3. Refined Bounds on Rademacher Chaos Complexities
Noticing that the divergence of the entropy integral is due to the rapid expansion of when r is approaching zero, we naturally ask whether a bound exists for which the lower limit of the integration is kept a distance from 0. As we will see, this can be achieved by introducing a threshold to prevent the divergence that will result from too fine an approximation.
For this purpose, we first consider the Rademacher chaos complexity for an -cover of the original class . In view of as an empirical process indexed by , the Rademacher chaos complexity is indeed the L1-norm of the maxima for a finite class of random variables, and it can be addressed by the powerful maximal inequalities in empirical process theory.
3.1. Refined Rademacher Chaos Complexity Bounds.
The following theorem provides an example of maximal inequalities sufficient for our purpose, which can be considered a modification of equation 3.2 in that we provide a maximal inequality for the -cover rather than the whole class. Its proof is based on the chaining argument (Dudley, 1999; Anthony & Bartlett, 1999), whose basic idea is to split each function of interest in the class into a sum of functions chosen from a sequence of classes of progressively increasing complexity. Then the desired maximal inequality for is obtained by combining maximal inequalities for all of these classes. The underlying reason that we can allow the lower limit of the integration to stay away from 0 is that elements of can be exactly represented without resorting to -covers with .
With theorem 1 as a stepping-stone, we are now ready to present the first main result of this letter. We first temporarily fix an and construct an -cover for the initial function class, thus transferring the original problem to the discussion of and . The complexity can be directly controlled by theorem 1 to yield a bound involving an integral of the desired form, while the term can be addressed by the standard Cauchy-Schwartz inequality. The following theorem is motivated by the recent work of Srebro and Sridharan (2010), who established an improvement, equation 2.1, of the standard entropy integral for the ordinary Rademacher complexity.
As can be seen clearly from inequality 3.8, our bound can be at least as sharp as inequality 3.2 since the in inequality 3.8 can be made arbitrarily close to 0. Also, the parameter is permitted to traverse the interval (0, D) to achieve a trade-off between the accuracy of the approximation by the cover and the complexity of that cover, which efficiently prohibits the divergence of the integration over the interval (0, D). Theorem 2 can be readily extended to the setting of Rademacher chaos complexities with order larger than two.
3.2. Rademacher Chaos Complexities for Some Specific Classes.
As a specific example to show the possible superiority of our result over the previous one, we consider the Rademacher chaos complexities of function classes for which the logarithm of the covering number grows as a polynomial of 1/r.
We proceed with our discussion by distinguishing three cases:
Notice that the bound of the form 3.2 cannot tackle the case since the involved integral diverges for such p. Here are some explicit function classes that satisfy the entropy conditions in corollary 1.
4. Applications to the Multikernel Learning Problem
In this section, we demonstrate the effectiveness of the previous results on Rademacher chaos complexities by showing how they can be applied to the kernel learning problem. The foundation on which our discussion is based is a novel connection (theorem 4) between the uniform deviation and the Rademacher chaos complexities, attributed to Ying and Campbell (2010).
Our main aim is to show the Bayes consistency of the multikernel regularized classifiers defined in equation 2.4 by providing some satisfactory estimates for the excess misclassification error , where is the risk of the prediction rule sgn(f) as assessed by the 0-1 loss and is the Bayes rule. The following theorem (Zhang, 2004; Bartlett, Jordan, & McAuliffe, 2006), called the comparison inequality, implies that the “excess misclassification error” can be controlled by the “excess -risk” , where is a measurable function minimizing the -risk. Consequently, the discussion of is sufficient for our purpose.
With theorem 4 as a stepping-stone, we can now present the second result of this letter. Theorem 6 provides a general result on MKL algorithms’ learning rate under the hinge loss and shows that the excess generalization error can be controlled by the regularization error for some restricted parameter . Its proof is based on the iteration trick due to Steinwart and Scovel (2007). For this purpose, we first give a lemma illustrating how the norm decreases as a function of the radius R.
By the reproducing property, it is not hard to check that . We can proceed with this proof by distinguishing three cases according to the magnitude of p:
When we put the three cases together and use inequality 4.1 to connect the excess misclassification error with the excess generalization error , the stated inequality is immediate.
The analysis based on theorem 6 and the previous entropy integral, 3.2, can tackle only the case p<1, while our discussion yields satisfactory learning rates for all p>0. One can refer to the examples in section 3.2 for some specific function classes satisfying the entropy conditions in corollary 2. The condition also holds (Mendelson, 2003) if the fat-shattering dimension grows as a polynomial of 1/r with exponent p.
In this letter, we provide a refined entropy integral for the Rademacher chaos complexity that can be extremely useful when the standard integral diverges. We apply our bounds to some function classes and obtain some satisfactory results. We also combine our Rademacher chaos complexity bounds and the iteration technique in Steinwart and Scovel (2007) to improve the learning rates of MKL machines.
We consider the use of the Rademacher chaos complexity only in the MKL context. However, our discussion can also have direct applications in other learning problems for which the goal is to construct a function f defined on . Some well-known examples include ranking (Rejchel, 2012), where the function f is used to predict the ordering between objects. Specifically, Rejchel (2012) considered the generalization bounds for the ranking rules, and the discussion there is based on the assumption that the logarithm of the covering number grows as a polynomial with exponent less than 1 (Rejchel, 2012, assumption B), which is due to the fact that Rejchel used equation 3.2 to control Rademacher chaos complexities (Rejchel, 2012, equation 18). If one uses the refined entropy integral instead, however, this assumption can be safely removed.
This work is supported in part by the National Natural Science Foundation of China (grant 60975050, 61165004). We also appreciate the anonymous referees for their insightful and constructive comments, which greatly improved the quality of the letter.
The metric dx can be considered (up to some factors) as the empirical L2-covering numbers with respect to the data .