## Abstract

The selection of the penalty functional is critical for the performance of a regularized learning algorithm, and thus it deserves special attention. In this article, we present a least square regression algorithm based on lp-coefficient regularization. Comparing with the classical regularized least square regression, the new algorithm is different in the regularization term. Our primary focus is on the error analysis of the algorithm. An explicit learning rate is derived under some ordinary assumptions.

## 1.  Introduction

In this letter, we provide an error analysis for regularized least square regression (RLSR) with an lp-coefficient regularizer.

Let X be a compact subset of , for some M > 0 and ρ be an unknown probability distribution on . Given a sample drawn independently according to ρ, the regression problem in learning theory is to find a function , such that fz(x) is a satisfactory estimate of the output y when a new input x is given. The prediction ability of a measurable function f is measured by the generalization error
where is the marginal distribution on X and is the conditional probability measure at x induced by ρ. The function minimizing is called the regression function, which is defined by
Undoubtedly, is the ideal estimator, but it is unknown because ρ and then is unknown. What we can do is to find good approximations of from the random samples.
Let be a reproducing kernel Hilbert space (RKHS) associated with a Mercer kernel K. A function is called a Mercer kernel if it is continuous, symmetric, and positive semidefinite, that is, for any finite set of distinct points , the matrix (K(xi, xj))li,j=1 is positive semidefinite. The RKHS is then defined (see Aronszajn, 1950) as the closure of the linear span of the set of functions with the inner product satisfying
The reproducing property is given by
Denote C(X) as the space of continuous functions on X with the norm . Because of the continuity of K and compactness of X, we have
So the above reproducing property tells us
1.1
The classical RLSR (or regularized networks) (e.g., see Cucker & Smale, 2001; Evgeniou, Pontil, & Poggio, 2000; Girosi, Jones, & Poggio, 1995) is the minimization of the following regularized empirical error:
1.2
Here is the empirical error with respect to z. is called the regularization term, and λ is a regularization parameter, which is usually chosen to be some function of m and .
According to the representer theorem in learning (see, e.g., Scholköpf, Herbrich, & Smola, 2001; Scholköpf & Smola, 2002; Shawe-Taylor & Cristianini, 2004; Wahba, 1990), the minimizer in equation 1.2 admits a representation of the form
1.3

In this letter, we consider a new RLSR algorithm in which the regularizer is not the RKHS norm but an lp-norm of the coefficients in the kernel ensembles. More precisely:

Definition 1.
Let . RLSR with lp-coefficient regularization learning algorithm is
where is given by

If we denote
and
then the algorithm given in definition 1 can be rewritten as
1.4

Algorithms like equation 1.4 are also popular in a wide range of cases. Daubechies, Defrise, and Demol (2004) pointed out that by taking p<2, and especially for the limit value p=1, the proposed minimization procedure in equation 1.4 can promote the sparsity of the solution. Therefore, this scheme may be useful in the fields of signal processing, data compression, feature selection, and so on.

Mathematical analysis of learning algorithm 1.2 has been well understood (see Caponnetto & De Vito, 2007; Smale & Zhou, 2005, 2007; Wu, Ying, & Zhou, 2006). However, there are essential differences between algorithms 1.4 and 1.2. On one hand, the penalty functional is not a Hilbert space norm, which causes a technical difficulty for mathematical analysis. On the other hand, both and hypothesis space depend on sample z, and thus the classical error decomposition approach (see Cucker & Smale, 2001; Niyogi & Girosi, 1996) cannot be applied directly to algorithm 1.4. In a word, the standard error analysis methods for scheme 1.2 are not appropriate to scheme 1.4. As a result, the theoretical analysis of scheme 1.4 is not very rich yet.

As we know, coefficient regularization learning algorithms were first considered in linear programming support vector machine (SVM) classification (see e.g., Bradley & Mangasarian, 2000; Kecman & Hadzic, 2000; Vapnik, 1998). Tarigan and Van de Geer (2006) studied SVM type classifiers that were linear combinations of given base functions and endowed l1 penalty on the coefficients. Under some conditions on the margin and base functions, an oracle inequality was given. But they did not consider the approximation error (see the definition in section 2) and thus failed to give any learning rate for the algorithm. Wu and Zhou (2005) considered linear programming SVM on an RKHS. By setting a stepping-stone, they bridged the linear programming SVM with the classical quadratic programming SVM, and consequently derived an explicit learning rate of the former. Furthermore, Wu and Zhou (2008) established a general analysis framework for learning with sample dependent hypothesis spaces. In particular, they proposed the concept of hypothesis error (see also the definition in section 2) and introduced a novel error decomposition technique (involving a hypothesis error) to overcome the difficulties mentioned above.

In this letter, we adopt ideas from Wu and Zhou (2005, 2008) and give an error analysis for algorithm 1.4. Under some standard assumptions, we finally derive an explicit learning rate of this algorithm. The rest of the letter is organized as follows. In section 2, we present an error decomposition approach tailored to algorithm 1.4 and estimate the hypothesis error. The analysis of the sample error is given in section 3. We derive the learning rate in section 4.

## 2.  Error Decomposition

The error decomposition method is widely used for error analysis of algorithms based on empirical risk minimization and regularization. For scheme 1.2, Wu et al. (2006) gave the following error decomposition,
2.1
where is defined by
2.2
and
2.3

Since is independent of samples and characterizes the approximation ability of with respect to , we often call it an approximation error. Things are completely different in scheme 1.4, where both the hypothesis space and the penalty term depend on sample bounds for an approximation error independent of z and thus are not available. To overcome the difficulty, we apply a modified error decomposition approach proposed by Wu and Zhou (2008). But before we do so, we introduce a projection operator.

Note that . We have for all , so it is natural to restrict the approximations of also contained in [−M, M]. The idea of projection was introduced in an analysis of classification algorithms (see e.g., Bartlett, 1998) and subsequently applied to regression too (see, e.g. Györfi et al., 2002).

Definition 2.
The projection operator is defined on the space of measurable functions as

It is easy to see that
2.4
So is more suitable than f to approximate . By virtue of it, we take instead of fz as our empirical target function and analyze the related learning rates.

Proposition 1.
Let fz, be defined by equations 1.4 and 2.2. Then
2.5
where is given by equation 2.3 and

Proof.
Since
the conclusion is trivial.

In comparison with equation 2.1, the extra term in equation 2.5 is called a hypothesis error, which we now estimate by the following theorem:

Theorem 1.
If , , then we have
2.6

Proof.
Let be the solution to equation 1.2 and be its coefficient in equation 1.3. It is easy to check that satisfies the following linear system of equations,
where I is the identity matrix, K[x]=(K(xi, xj))mi,j=1, and . Therefore, for ,
By the Hölder inequality, we get
2.7
Noting that , we can derive from equations 2.4, 1.4, and 2.7 that
2.8
By taking and f=0 in equation 1.2, respectively, we have
2.9
2.10
Putting equations 2.9 and 2.10 into 2.8, one has
This proves the theorem.

Remark 1.

The result of theorem 1 for p=2 was also obtained by Wu and Zhou (2008).

## 3.  Sample Error

In this section, we mainly estimate the sample error . Since
we can bound by bounding and below.

Applying proposition 2.1 in Wu et al. (2006), we yield the following estimation for :

Lemma 1.
Let be given by equation 2.3. For any t > 0, with confidence 1 − e−t, there holds
3.1

Notice that fz involves the sample z and thus runs over a set of functions in . In order to estimate , we often need a complexity measure of . For simplicity, in this letter, we use only the uniform covering number of .

Definition 3.
Let be a subset of C(X). For any , the uniform covering number of is defined as

Let . By equation 1.1, is a subset of C(X) with continuous inclusion, and thus the uniform covering number is well defined. We denote the uniform covering number of unit ball as

Definition 4.
We say an RKHS has polynomial complexity exponent s > 0 if there is some constant cs > 0 such that
3.2

The uniform covering number has been extensively studied in learning theory. Zhou (2003) showed that equation 3.2 always holds with s=2n/r if the kernel K is Cr with r > 0 on a bounded subset X of . In particular, for a kernel (such as a gussian kernel and a polynomial kernel), equation 3.2 is valid for any s > 0.

The following lemma is also adapted from Wu et al. (2006). It is a uniform law of large numbers for a class of functions:

Lemma 2.
Let be a set of functions on Z such that for any , , and . Then for any and ,

We now apply lemma 2 to a set of functions with r > 0, where

Proposition 2.
If equation 3.2 is satisfied, then for any t > 0, with confidence 1 − e−t, there holds
for all , where C1 is a constant independent of m, t, and R.

Proof.
Each function has a form
with some . Since , we have that
Moreover,
So applying lemma 2 to with , we get
Furthermore, for any ,
we get
Hence,

Here the last inequality is followed by equation 3.2. Let τ* be the unique positive solution to
Then for any , with confidence at least 1−et, there holds
It remains to estimate τ*. By lemma 7 of Cucker and Smale (2002), we know that
So the proposition follows by taking .

Now we need to find a ball containing fz:

Lemma 3.
Let fz be given by definition 1. Then for any , we have

Proof.
By taking f=0 in equation 1.4, we can see that
3.3
It implies
By equation 3.3 and the Hölder inequality, we obtain

Lemma 3 ensures us with . So by proposition 1 and lemma 3, we derive:

Proposition 3.
If equation 3.2 is satisfied, then for any t > 0, with confidence 1 − e−t, it holds that
3.4
where .

## 4.  Learning Rates

Combining the estimations in sections 2 and 3, we can derive an explicit learning rate for algorithm 1.4:

Theorem 2.
Suppose equation 3.2 is satisfied, and for some and Let Then by taking for any , with confidence at least , there holds
where C2 is a constant independent of m and δ.

Proof.
Putting equations 2.6, 3.1, and 3.4 into 2.5, we can see that under the assumption , with confidence 1−2et,
here, .
By choosing λ, we can easily check that

So our theorem follows by taking and

We end this letter with some remarks.

Remark 2.
The estimate of approximation error can be derived through the knowledge of approximation theory (see, e.g., Smale & Zhou, 2003). In particular, the kernel K defines an integral operator by
We know from Smale and Zhou (2005), when lies in the range of LrK for some there holds

Remark 3.

When p is close to 1, we can see and is a monotonous increase with respect to p. So it is a reasonable conjecture that one should sacrifice some sparsity in order to improve the learning rate.

Remark 4.

To derive a nontrivial learning rate, s+p(1−s) is required to be positive in theorem 2. We are not sure whether there is some inherent difficulty in using a larger value of p on an RKHS that has a larger complexity exponent. However, at least for a kernel such as a Gaussian or a polynomial kernel, p can be chosen arbitrarily in (1, 2].

Remark 5.

p=1 is not included in our result. To get a learning rate in this case, we conjecture that some additional assumptions on X and ρ are necessary, just as done in Xiao and Zhou (in press).

## Acknowledgments

We thank the anonymous referees for their careful review and helpful suggestions. This work is partially supported by NSF of China under grants 10871015 and 10872009, and the National Basic Research Program of China under grant 973-2006CB303102.

## References

Aronszajn
,
N.
(
1950
).
Theory of reproducing kernels
.
Trans. Amer. Math. Soc.
,
68
,
337
404
.
Bartlett
,
P. L.
(
1998
).
The sample complexity of pattern classification with neural networks: The size of the weights is more import than the size of the network
.
IEEE Trans. Inform. Theory
,
44
,
525
536
.
,
P. S.
, &
Mangasarian
,
O. L.
(
2000
).
Massive data discrimination via linear support vector machines
.
Optimization Methods and Software
,
13
,
1
10
.
Caponnetto
,
A.
, &
De Vito
,
E.
(
2007
).
Optimal rates for regularized least-squares algorithm
.
Found. Comput. Math.
,
7
,
331
368
.
Cucker
,
F.
, &
Smale
,
S.
(
2001
).
On the mathematical foundations of learning theory
.
Bull. Amer. Math. Soc.
,
39
,
1
49
.
Cucker
,
F.
, &
Smale
,
S.
(
2002
).
Best choices for regularization parameters in learning theory: On the bias-variance problem
.
Found. Comput. Math.
,
2
,
413
428
.
Daubechies
,
I.
,
Defrise
,
M.
, &
Demol
,
C.
(
2004
).
An iterative thresholding algorithm for linear inverse problems with sparsity constraint
.
Comm. Pure Appl. Math.
,
57
,
1413
1541
.
Evgeniou
,
T.
,
Pontil
,
M.
, &
Poggio
,
T.
(
2000
).
Regularization networks and support vector machines
.
,
13
,
1
50
.
Girosi
,
F.
,
Jones
,
M.
, &
Poggio
,
T.
(
1995
).
Regularization theory and neural network architectures
.
Neural Comput.
,
7
,
219
269
.
Györfi
,
L.
,
Kohler
,
M.
,
Krzyzak
,
A.
, &
Walk
,
H.
(
2002
).
A distribution-free theory of nonparametric regression
.
Berlin
:
Springer
.
Kecman
,
V.
, &
,
I.
(
2000
).
Support vector selection by linear programming
. In
Proceedings of the International Joint Conference on Neural Networks
(Vol.
5
, pp.
193
198
).
Washington, DC
:
IEEE Computer Society Press
.
Niyogi
,
N.
, &
Girosi
,
F.
(
1996
).
On the relationship between generalization error, hypothesis complexity, and sample complexity for radical basis functions
.
Neural Comput.
,
8
,
819
842
.
Scholköpf
,
B.
,
Herbrich
,
R.
, &
Smola
,
A. J.
(
2001
).
A generalized representer theorem
. In
Proceedings of the 14th Annual Conference on Computational Learning Theory and and 5th European Conference on Computational Learning Theory
(pp.
416
426
).
Berlin
:
Springer-Verlag
.
Scholköpf
,
B.
, &
Smola
,
A. J.
(
2002
).
Learning with kernels
.
Cambridge, MA
:
MIT Press
.
Shawe-Taylor
,
J.
, &
Cristianini
,
N.
(
2004
).
Kernel methods for pattern analysis
.
Cambridge
:
Cambridge University Press
.
Smale
,
S.
, &
Zhou
,
D. X.
(
2003
).
Estimating the approximation error in learning theory
.
Anal. Appl.
,
1
,
17
41
.
Smale
,
S.
, &
Zhou
,
D. X.
(
2005
).
Shannon sampling II: Connections to learning theory
.
Appl. Comput. Harmonic Anal.
,
19
,
285
302
.
Smale
,
S.
, &
Zhou
,
D. X.
(
2007
).
Learning theory estimates via integral operators and their applications
.
Constr. Approx.
,
26
,
153
172
.
Tarigan
,
B.
, &
Van de Geer
,
S.
(
2006
).
Classifiers of support vector machine type with l1 complexity regularization
.
Bernoulli
,
12
(
6
),
1045
1076
.
Vapnik
,
V.
(
1998
).
Statistical learning theory
.
New York
:
Wiley
.
Wahba
,
G.
(
1990
).
Spline models for observation data
.
:
SIAM
.
Wu
,
Q.
,
Ying
,
Y.
, &
Zhou
,
D. X.
(
2006
).
Learning rates of least-square regularized regression
.
Found. Comput. Math.
,
6
,
171
192
.
Wu
,
Q.
, &
Zhou
,
D. X.
(
2005
).
SVM soft margin classifiers: Linear programming versus quadratic programming
.
Neural Comput.
,
17
,
1160
1187
.
Wu
,
Q.
, &
Zhou
,
D. X.
(
2008
).
Learning with sample dependent hypothesis spaces
.
Comput. Math. Appl.
,
56
,
2896
2907
.
Xiao
,
Q. W.
, &
Zhou
,
D. X.
(
in press
).
Learning by nonsymmetric kernels with data dependent spaces and l1-regularizer
.
Taiwanese J. Math
.
Zhou
,
D. X.
(
2003
).
Capacity of reproducing kernel spaces in learning theory
.
IEEE Trans. Inform. Theory
,
49
,
1743
1752
.