The selection of the penalty functional is critical for the performance of a regularized learning algorithm, and thus it deserves special attention. In this article, we present a least square regression algorithm based on lp-coefficient regularization. Comparing with the classical regularized least square regression, the new algorithm is different in the regularization term. Our primary focus is on the error analysis of the algorithm. An explicit learning rate is derived under some ordinary assumptions.
In this letter, we provide an error analysis for regularized least square regression (RLSR) with an lp-coefficient regularizer.
In this letter, we consider a new RLSR algorithm in which the regularizer is not the RKHS norm but an lp-norm of the coefficients in the kernel ensembles. More precisely:
Algorithms like equation 1.4 are also popular in a wide range of cases. Daubechies, Defrise, and Demol (2004) pointed out that by taking p<2, and especially for the limit value p=1, the proposed minimization procedure in equation 1.4 can promote the sparsity of the solution. Therefore, this scheme may be useful in the fields of signal processing, data compression, feature selection, and so on.
Mathematical analysis of learning algorithm 1.2 has been well understood (see Caponnetto & De Vito, 2007; Smale & Zhou, 2005, 2007; Wu, Ying, & Zhou, 2006). However, there are essential differences between algorithms 1.4 and 1.2. On one hand, the penalty functional is not a Hilbert space norm, which causes a technical difficulty for mathematical analysis. On the other hand, both and hypothesis space depend on sample z, and thus the classical error decomposition approach (see Cucker & Smale, 2001; Niyogi & Girosi, 1996) cannot be applied directly to algorithm 1.4. In a word, the standard error analysis methods for scheme 1.2 are not appropriate to scheme 1.4. As a result, the theoretical analysis of scheme 1.4 is not very rich yet.
As we know, coefficient regularization learning algorithms were first considered in linear programming support vector machine (SVM) classification (see e.g., Bradley & Mangasarian, 2000; Kecman & Hadzic, 2000; Vapnik, 1998). Tarigan and Van de Geer (2006) studied SVM type classifiers that were linear combinations of given base functions and endowed l1 penalty on the coefficients. Under some conditions on the margin and base functions, an oracle inequality was given. But they did not consider the approximation error (see the definition in section 2) and thus failed to give any learning rate for the algorithm. Wu and Zhou (2005) considered linear programming SVM on an RKHS. By setting a stepping-stone, they bridged the linear programming SVM with the classical quadratic programming SVM, and consequently derived an explicit learning rate of the former. Furthermore, Wu and Zhou (2008) established a general analysis framework for learning with sample dependent hypothesis spaces. In particular, they proposed the concept of hypothesis error (see also the definition in section 2) and introduced a novel error decomposition technique (involving a hypothesis error) to overcome the difficulties mentioned above.
In this letter, we adopt ideas from Wu and Zhou (2005, 2008) and give an error analysis for algorithm 1.4. Under some standard assumptions, we finally derive an explicit learning rate of this algorithm. The rest of the letter is organized as follows. In section 2, we present an error decomposition approach tailored to algorithm 1.4 and estimate the hypothesis error. The analysis of the sample error is given in section 3. We derive the learning rate in section 4.
2. Error Decomposition
Since is independent of samples and characterizes the approximation ability of with respect to , we often call it an approximation error. Things are completely different in scheme 1.4, where both the hypothesis space and the penalty term depend on sample bounds for an approximation error independent of z and thus are not available. To overcome the difficulty, we apply a modified error decomposition approach proposed by Wu and Zhou (2008). But before we do so, we introduce a projection operator.
Note that . We have for all , so it is natural to restrict the approximations of also contained in [−M, M]. The idea of projection was introduced in an analysis of classification algorithms (see e.g., Bartlett, 1998) and subsequently applied to regression too (see, e.g. Györfi et al., 2002).
The result of theorem 1 for p=2 was also obtained by Wu and Zhou (2008).
3. Sample Error
Applying proposition 2.1 in Wu et al. (2006), we yield the following estimation for :
Notice that fz involves the sample z and thus runs over a set of functions in . In order to estimate , we often need a complexity measure of . For simplicity, in this letter, we use only the uniform covering number of .
The uniform covering number has been extensively studied in learning theory. Zhou (2003) showed that equation 3.2 always holds with s=2n/r if the kernel K is Cr with r > 0 on a bounded subset X of . In particular, for a kernel (such as a gussian kernel and a polynomial kernel), equation 3.2 is valid for any s > 0.
The following lemma is also adapted from Wu et al. (2006). It is a uniform law of large numbers for a class of functions:
Now we need to find a ball containing fz:
4. Learning Rates
So our theorem follows by taking and
We end this letter with some remarks.
When p is close to 1, we can see and is a monotonous increase with respect to p. So it is a reasonable conjecture that one should sacrifice some sparsity in order to improve the learning rate.
To derive a nontrivial learning rate, s+p(1−s) is required to be positive in theorem 2. We are not sure whether there is some inherent difficulty in using a larger value of p on an RKHS that has a larger complexity exponent. However, at least for a kernel such as a Gaussian or a polynomial kernel, p can be chosen arbitrarily in (1, 2].
p=1 is not included in our result. To get a learning rate in this case, we conjecture that some additional assumptions on X and ρ are necessary, just as done in Xiao and Zhou (in press).
We thank the anonymous referees for their careful review and helpful suggestions. This work is partially supported by NSF of China under grants 10871015 and 10872009, and the National Basic Research Program of China under grant 973-2006CB303102.