## Abstract

We present a better theoretical foundation of support vector machines with polynomial kernels. The sample error is estimated under Tsybakov’s noise assumption. In bounding the approximation error, we take advantage of a geometric noise assumption that was introduced to analyze gaussian kernels. Compared with the previous literature, the error analysis in this note does not require any regularity of the marginal distribution or smoothness of Bayes’ rule. We thus establish the learning rates for polynomial kernels for a wide class of distributions.

## 1  Introduction

Support vector machines (SVMs) as a special kind of kernel based methods were introduced by Boser, Guyon, and Vapnik (1992) with polynomial kernels, and by Cortes and Vapnik (1995) with general kernels. Since then the theoretical foundation of SVMs has been considered by many authors; a far from complete list of papers containing results of this kind includes Chen, Wu, Ying, and Zhou (2004); Cucker and Zhou (2007); Evgeniou, Pontil, and Poggio (2000); Steinwart (2002); Steinwart and Christmann (2008); Steinwart and Scovel (2007); Vapnik (1998); Wu, Ying, and Zhou (2007); Wu and Zhou (2006); Xiang and Zhou (2009); Zhang (2004). In this note, we investigate SVMs classifiers with the polynomial kernels, probably one of the most popular kernels used in SVMs and other kernel-based learning algorithms. Our goal is to establish explicit learning rates for the algorithms under very mild conditions.

We focus on a binary classification problem. Let X be a compact subset of and . A binary classifier is a function , which labels every with some . If is an unknown probability measure on , then the misclassification error, which is often used to measure the prediction power of a classifier f, can be defined by
Here is the marginal distribution on X, and is the conditional probability measure at induced by . If we define the regression function of as
then we know (Devroye, Györfi, & Lugosi, 1997) that the classifier minimizing the misclassification error is called Bayes’ rule and is given by , the sign of Here, for a function , the sign function is defined as if and if .
As is unknown, fc cannot be found directly. What we have is a set of samples independently drawn according to . Define hinge loss function as Then the SVM classifier is defined as , where is a kernel-based regularized empirical risk minimizer—more precisely,
1.1
where is a reproducing kernel Hilbert space (RKHS) with Mercer kernel K (see Aronszajn, 1950); is a regularization parameter, which often depends on m; and denotes the empirical risk of a function f associated with hinge loss and samples , that is,
Denote , the Euclidean norm and inner production of . In most applications two families of kernels are used in equation 1.1. One is gaussian kernel with a width parameter There has been a rich study of SVMs with gaussian kernels in the literature (e.g., Steinwart & Christmann, 2008; Steinwart & Scovel, 2007; Wu et al., 2007; Xiang & Zhou, 2009; Ying & Zhou, 2007). Another important Mercer kernel is the polynomial kernel
where d is the degree of the kernel polynomial. It is known (Cucker & Smale, 2001) that the corresponding RKHS is the set of n-variable polynomials of degree at most d, and the dimension of is .
In this note, we restrict our attention to SVM classifiers generated with polynomial kernels; it is defined as , where
1.2

Although the polynomial kernels are the original and probably one of the most important kernels used in SVMs, only a few papers deal with the learning behavior of scheme 1.2. Probably the main difficulty in studying algorithms with polynomial kernels is the approximation error, which involves deeply the degree of kernel polynomial. Zhou and Jetter (2006) partly overcome this difficulty in the univariate case An extended version of the result to the multivariate setting is obtained by Tong, Chen, and Peng (2008). However, they require restrictive assumptions on the distribution . For example, it assumes that marginal distribution should have a finite distortion with respect to Lebesgue measure on X, and Bayes’ rule fc should satisfy a certain smoothness measured by the modulus of smoothness. These assumptions, however, cannot be easily guaranteed in practice or in theory. In the case of gaussian kernels, a more realistic assumption on is geometric noise assumption (see definition 3), proposed in Steinwart and Scovel (2007), which does not require any kind of smoothness of fc or any regularity condition on . A natural question is whether the geometric noise assumption is still valid to bound the approximation error for polynomial kernels. In this note, we present a positive answer to the question by virtue of the connection of Bernstein polynomial (see definition 6) with probability theory. We then derive the explicit learning rates of SVMs with multivariate polynomial kernels under the geometric noise assumption and the Tsybakov’s noise assumption (see definition 1).

## 2  Definitions, Assumptions and Preliminaries

In this section we introduce some notations, definitions, and basic facts that will be used in this note.

Throughout the note, we assume X to be a simplex on , which is defined by
We use the standard notation: for . We write for and ,

According to the no-free-lunch theorem (Devroye et al., 1997, theorem 7.2), learning rates are impossible without any restriction on We shall introduce in this note two kinds of such restrictions. One is the Tsybakov noise condition (see Tsybakov, 2004).

Definition 1.
Let We say that satisfies the Tsybakov noise condition with exponent q if there exists a constant such that
2.1

It is easy to see that all distributions have at least noise exponent 0. Deterministic distributions (which satisfy ) have noise exponent with .

Define the generalization error of a function f as
The Tsybakov noise condition plays an important role in the possibility of reducing the variance of the relative loss with its expectation. The following lemma will be applied to estimate the sample error in section 4.
Lemma 1
(Wu and Zhou, 2005). If satisfies equation 2.1, then for every function there exists some constant such that
Another restriction we assume on is the geometric noise condition, introduced in Steinwart and Scovel (2007), in the case of gaussian kernels. To formulate this assumption, we define the classes of X by and We also define a distance function by
where denotes the distance of to a set A with respect to Euclidean norm. With this function, we can define the following geometric noise condition for distributions.
Definition 2.
Let We say that satisfies the geometric noise condition with exponent if there exists a constant such that
2.2
holds for all

The geometric noise condition describes the concentration of the measure near the decision boundary and does not imply any smoothness of fc or regularity of with respect to Lebesgue measure on X. However, one can show, as in Steinwart and Scovel (2007, theorem 2.6) that if has a Tsybakov noise exponent q and satisfies the envelope condition for some constants and then has a geometric noise exponent if and a geometric noise exponent for all otherwise. (A detailed discussion of this assumption can be found in section 8.2 of Steinwart and Christmann, 2008.)

Geometric noise condition 2.2 will be used in section 3 to estimate the approximation error given just below the proof for proposition 5. To the best of our knowledge, it is the first time that this assumption has been applied to the kernels except for gaussian.

To get better error estimates, one usually makes full use of the projection operator introduced in Chen et al. (2004).

Definition 3.
The projection operator on the space of functions on X is defined by
Trivially . This, together with the comparison theorem proved in Zhang (2004), gives
2.3
for any measurable function Equation 2.3 asserts that the excess misclassification error can be bounded by means of the excess generalization error, which in turn can be estimated by the following error decomposition technique.
Proposition 1.
Let be defined by equation 1.2. Then for any ,
2.4
where and for any measurable function
Proof.
Write as
It is easy to check that ; thus, This in connection with the definition of implies that the second term is at most zero. The proposition is proved.

The first term and the rest of the terms on the right-hand side of equation 2.4 are called approximation error and sample error (with respect to f0), respectively.

According to proposition 5, choosing a regularization function is key to bounding the excess generalization error. To this end, we introduce the Bernstein polynomials on a simplex (see Lorentz, 1986).

Definition 4.
The Bernstein polynomial for a function f on the simplex X is defined as
where .

One can easily see that . We shall estimate the approximation error and sample error with respect to in the next two sections.

## 3  Approximation Error

In this section we estimate the approximation error : . We first compute the RKHS norm .

Proposition 2.
If for all then
Proof.
Tong et al. (2008, theorem 3.1) showed that
Since , we have
Here is the dimension of . Note that
The proposition is proved.

Second, we apply the geometric noise condition, equation 2.2, to bound . To this end, we still need some preparation.

Let and Recall that in probability theory, a random vector (in ) is said to follow a multinomial distribution with parameters d and if it has a probability mass function
Denote and the zero vector, and unit vectors of Let be independent and identically distributed random vectors taking values from with probability
Then one can see that follows the multinomial distribution with parameters d and We also need the following Bennett inequality for the vector-valued random variable given in Smale and Zhou (2007).
Let be a Hilbert space and be m independent random variables with values in . Suppose that for each i, almost surely. Denote Then
Applying this probability inequality to , we find:
Lemma 2.
Let . For any there holds
where the sum is taken for all values satisfying and . Notation of this type is used in the sequel without explanation.
Proof.
It is trivial for For we apply the Bennett inequality to . It satisfies for each Therefore,
The lemma thus follows from

Now we can bound .

Proposition 3.
If satisfies condition 2.2, then
Proof.
Since it obvious implies Hence the equation
given in Zhang (2004) ensures us
3.1
In order to estimate for , we observe that and
Note that for all we can obtain by lemma 8:
for all Since for we can analogously obtain and
we conclude
3.2
for all When one has and equation 3.2 still holds. Consequently, it follows by equations 3.1, 3.2, and 2.2 with :
This proves the proposition.

Combining the results of propositions 7 and 9, we can obtain the estimate of approximation error .

Theorem 1.
Let X be the simplex on If has geometric noise exponent with constant c in equation 2.2, then there exists a constant depending only on such that for all ,

## 4  Sample Error

In this section, we estimate the sample error in equation 2.4 with . We first bound by the following one-side Bernstein inequality (see, e.g., Cucker & Smale, 2001; Cucker & Zhou, 2007).

Let be a random variable on a probability space Z with mean and variance . If almost everywhere, then for every , there holds
Proposition 4.
If satisfies equation 2.1, then for any , with the confidence at least , there holds
Proof.
Consider the random variable . It satisfies
Since and , one can see . Applying the one-side Bernstein inequality to yield with at least confidence
Lemma 2 tells us that . This together with Young’s inequality implies
Therefore, with confidence at least

Another term of the sample error involves the samples z and thus runs over a set of functions, so we need some probability inequality for the uniform convergence given by means of the covering numbers.

Definition 5.

For a subset of and , the covering number is defined to be the minimal integer such that there exist l balls with radius covering .

Note that Let . Tong et al. (2008) showed that
4.1
where is the dimension of .

The following probability inequality was verified in Wu and Zhou (2005):

Lemma 3.
Let and be a set of functions on Z such that for every , almost everywhere and . Then for every ,
In order to apply lemma 13 to estimate , we need to find a ball containing . The definition of tells us that for
So It means that for all and . Applying lemma 13 to the following function set,
we can find:
Proposition 5.
If satisfies equation 2.1, then for any , with the confidence at least , there holds
Proof.
Each function has the form for some . Hence, and . Since one has and lemma 2 asserts that . Now applying lemma 13 to , we can get
It needs to bound the covering number. Observe that for any , and ,
This in connection with equation 4.1 means that
Therefore, if we set to be the unique positive solution of the equation,
Then with confidence at least ,
Here the third inequality follows from Young’s inequality.
What is left is to bound To this end, let
One can see that h2 is a strictly decreasing function on and .
Set
If then
If then
Thus, in either case we have
On the other hand, since it follows that
Therefore, . The proof of the proposition is complete.

## 5  Learning Rates

In this section we derive explicit learning rates for equation 1.2 by appropriately choosing the regularization parameter and the degree of the kernel polynomial.

Theorem 2.
Assume that satisfies equations 2.1 and 2.2. Let , . Then for all , with confidence at least , we have
5.1
where C is some constant independent of m or .
Proof.
Putting theorem 10 and propositions 11 and 14 into proposition 5 with , we can find that with confidence at least ,
Since by taking , we have with the same confidence,
It is easy to see . Recall an elementary inequality
5.2
This inequality is verified by considering the function , which is maximized at . Applying equation 5.2 with we have
Applying equation 5.2 with again, we have
Therefore, with confidence at least
where
By taking the conclusion then follows from equation 2.3 and .

When the Tsybakov noise condition, equation 2.1, is not assumed, one can still use theorem 15 by setting and obtain learning rate . When q tends to infinity, the power index of m in equation 5.1 has the limit , which can be very close to for large . So the learning rate in theorem 15 can be for arbitrarily small when q and are large enough.

## Acknowledgments

I thank Qiang Wu for his helpful discussions. This work was completed when I was a visiting scholar at Middle Tennessee State University. It is partly supported by NSF of China under grant 11501380 and the Fundamental Research Funds for the Central Universities in UIBE (13YBLG01).

## References

Aronszajn
,
N.
(
1950
).
Theory of reproducing kernels
.
Trans. Amer. Math. Soc.
,
68
,
337
404
.
Boser
,
B. E.
,
Guyon
,
I.
, &
Vapnik
,
V.
(
1992
).
A training algorithm for optimal margin classifiers
. In
Proceedings of the Fifth Annual Workshop of Computational Learning Theory
,
5
(pp.
144
152
).
New York
:
ACM
.
Chen
,
D. R.
,
Wu
,
Q.
,
Ying
,
Y.
, &
Zhou
,
D. X.
(
2004
).
Support vector machine soft margin classifiers: Error analysis
.
J. Mach. Learn. Res.
,
5
,
1143
1175
.
Cortes
,
C.
, &
Vapnik
,
V.
(
1995
).
Support-vector networks
.
Mach. Learning
,
20
,
273
297
.
Cucker
,
F.
, &
Smale
,
S.
(
2001
).
On the mathematical foundations of learning theory
.
Bull. Amer. Math. Soc.
,
39
,
1
49
.
Cucker
,
F.
, &
Zhou
,
D. X.
(
2007
).
Learning theory: An approximation theory viewpoint
.
Cambridge
:
Cambridge University Press
.
Devroye
,
L.
,
Györfi
,
L.
, &
Lugosi
,
G.
(
1997
).
A probabilistic theory of pattern recognition
.
New York
:
Springer-Verlag
.
Evgeniou
,
T.
,
Pontil
,
M.
, &
Poggio
,
T.
(
2000
).
Regularization networks and support vector machines
.
,
13
,
1
50
.
Lorentz
,
G. G.
(
1986
).
Bernstein polynomials
.
New York
:
Chelsea
.
Smale
,
S.
, &
Zhou
,
D. X.
(
2007
).
Learning theory estimates via integral operators and their applications
.
Constr. Approx.
,
26
,
153
172
.
Steinwart
,
I.
(
2002
).
Support vector machines are universally consistent
.
J. Complexity
,
18
,
768
791
.
Steinwart
,
I.
, &
Christmann
,
A.
(
2008
).
Support vector machines
.
New York
:
Springer
.
Steinwart
,
I.
, &
Scovel
,
C.
(
2007
).
Fast rates for support vector machines using gaussian kernels
.
Ann. Statist.
,
35
,
575
607
.
Tong
,
H. Z.
,
Chen
,
D. R.
, &
Peng
,
L. Z.
(
2008
).
Learning rates for regularized classifiers using multivariate polynomial kernels
.
J. Complexity.
,
24
,
619
631
.
Tsybakov
,
A. B.
(
2004
).
Optimal aggregation of classifiers in statistical learning
.
Ann. Stat.
,
32
,
135
166
.
Vapnik
,
V.
(
1998
).
Statistical learning theory
.
New York
:
Wiley
.
Wu
,
Q.
,
Ying
,
Y.
, &
Zhou
,
D. X.
(
2007
).
Multi-kernel regularized classifiers
.
J. Complexity
,
23
,
108
134
.
Wu
,
Q.
, &
Zhou
,
D. X.
(
2005
).
SVM soft margin classifiers: Linear programming versus quadratic programming
.
Neural Comput.
,
17
,
1160
1187
.
Wu
,
Q.
&
Zhou
,
D. X.
(
2006
).
Analysis of support vector machine classification
.
J. Comput. Anal. Appl.
,
8
,
99
119
.
Xiang
,
D. H.
, &
Zhou
,
D. X.
(
2009
).
Classification with gaussian and convex loss
.
J. Mach. Learn. Res.
,
10
,
1447
1468
.
Ying
,
Y.
, &
Zhou
,
D. X.
(
2007
).
Learnability of gaussians with flexible variances
.
J. Mach. Learn. Res.
,
8
,
249
276
.
Zhang
,
T.
(
2004
).
Statistical behavior and consistency of classification methods based on convex risk minimization
.
Ann. Stat.
,
32
,
56
85
.
Zhou
,
D. X.
, &
Jetter
,
K.
(
2006
).
Approximation with polynomial kernels and SVM classifiers
.