We present a better theoretical foundation of support vector machines with polynomial kernels. The sample error is estimated under Tsybakov’s noise assumption. In bounding the approximation error, we take advantage of a geometric noise assumption that was introduced to analyze gaussian kernels. Compared with the previous literature, the error analysis in this note does not require any regularity of the marginal distribution or smoothness of Bayes’ rule. We thus establish the learning rates for polynomial kernels for a wide class of distributions.
Support vector machines (SVMs) as a special kind of kernel based methods were introduced by Boser, Guyon, and Vapnik (1992) with polynomial kernels, and by Cortes and Vapnik (1995) with general kernels. Since then the theoretical foundation of SVMs has been considered by many authors; a far from complete list of papers containing results of this kind includes Chen, Wu, Ying, and Zhou (2004); Cucker and Zhou (2007); Evgeniou, Pontil, and Poggio (2000); Steinwart (2002); Steinwart and Christmann (2008); Steinwart and Scovel (2007); Vapnik (1998); Wu, Ying, and Zhou (2007); Wu and Zhou (2006); Xiang and Zhou (2009); Zhang (2004). In this note, we investigate SVMs classifiers with the polynomial kernels, probably one of the most popular kernels used in SVMs and other kernel-based learning algorithms. Our goal is to establish explicit learning rates for the algorithms under very mild conditions.
Although the polynomial kernels are the original and probably one of the most important kernels used in SVMs, only a few papers deal with the learning behavior of scheme 1.2. Probably the main difficulty in studying algorithms with polynomial kernels is the approximation error, which involves deeply the degree of kernel polynomial. Zhou and Jetter (2006) partly overcome this difficulty in the univariate case An extended version of the result to the multivariate setting is obtained by Tong, Chen, and Peng (2008). However, they require restrictive assumptions on the distribution . For example, it assumes that marginal distribution should have a finite distortion with respect to Lebesgue measure on X, and Bayes’ rule fc should satisfy a certain smoothness measured by the modulus of smoothness. These assumptions, however, cannot be easily guaranteed in practice or in theory. In the case of gaussian kernels, a more realistic assumption on is geometric noise assumption (see definition 3), proposed in Steinwart and Scovel (2007), which does not require any kind of smoothness of fc or any regularity condition on . A natural question is whether the geometric noise assumption is still valid to bound the approximation error for polynomial kernels. In this note, we present a positive answer to the question by virtue of the connection of Bernstein polynomial (see definition 6) with probability theory. We then derive the explicit learning rates of SVMs with multivariate polynomial kernels under the geometric noise assumption and the Tsybakov’s noise assumption (see definition 1).
2 Definitions, Assumptions and Preliminaries
In this section we introduce some notations, definitions, and basic facts that will be used in this note.
According to the no-free-lunch theorem (Devroye et al., 1997, theorem 7.2), learning rates are impossible without any restriction on We shall introduce in this note two kinds of such restrictions. One is the Tsybakov noise condition (see Tsybakov, 2004).
It is easy to see that all distributions have at least noise exponent 0. Deterministic distributions (which satisfy ) have noise exponent with .
The geometric noise condition describes the concentration of the measure near the decision boundary and does not imply any smoothness of fc or regularity of with respect to Lebesgue measure on X. However, one can show, as in Steinwart and Scovel (2007, theorem 2.6) that if has a Tsybakov noise exponent q and satisfies the envelope condition for some constants and then has a geometric noise exponent if and a geometric noise exponent for all otherwise. (A detailed discussion of this assumption can be found in section 8.2 of Steinwart and Christmann, 2008.)
Geometric noise condition 2.2 will be used in section 3 to estimate the approximation error given just below the proof for proposition 5. To the best of our knowledge, it is the first time that this assumption has been applied to the kernels except for gaussian.
To get better error estimates, one usually makes full use of the projection operator introduced in Chen et al. (2004).
The first term and the rest of the terms on the right-hand side of equation 2.4 are called approximation error and sample error (with respect to f0), respectively.
According to proposition 5, choosing a regularization function is key to bounding the excess generalization error. To this end, we introduce the Bernstein polynomials on a simplex (see Lorentz, 1986).
One can easily see that . We shall estimate the approximation error and sample error with respect to in the next two sections.
3 Approximation Error
In this section we estimate the approximation error : . We first compute the RKHS norm .
Second, we apply the geometric noise condition, equation 2.2, to bound . To this end, we still need some preparation.
Now we can bound .
4 Sample Error
Another term of the sample error involves the samples z and thus runs over a set of functions, so we need some probability inequality for the uniform convergence given by means of the covering numbers.
For a subset of and , the covering number is defined to be the minimal integer such that there exist l balls with radius covering .
The following probability inequality was verified in Wu and Zhou (2005):
5 Learning Rates
In this section we derive explicit learning rates for equation 1.2 by appropriately choosing the regularization parameter and the degree of the kernel polynomial.
When the Tsybakov noise condition, equation 2.1, is not assumed, one can still use theorem 15 by setting and obtain learning rate . When q tends to infinity, the power index of m in equation 5.1 has the limit , which can be very close to for large . So the learning rate in theorem 15 can be for arbitrarily small when q and are large enough.
I thank Qiang Wu for his helpful discussions. This work was completed when I was a visiting scholar at Middle Tennessee State University. It is partly supported by NSF of China under grant 11501380 and the Fundamental Research Funds for the Central Universities in UIBE (13YBLG01).