Kernelized elastic net regularization (KENReg) is a kernelization of the well-known elastic net regularization (Zou & Hastie, 2005). The kernel in KENReg is not required to be a Mercer kernel since it learns from a kernelized dictionary in the coefficient space. Feng, Yang, Zhao, Lv, and Suykens (2014) showed that KENReg has some nice properties including stability, sparseness, and generalization. In this letter, we continue our study on KENReg by conducting a refined learning theory analysis. This letter makes the following three main contributions. First, we present refined error analysis on the generalization performance of KENReg. The main difficulty of analyzing the generalization error of KENReg lies in characterizing the population version of its empirical target function. We overcome this by introducing a weighted Banach space associated with the elastic net regularization. We are then able to conduct elaborated learning theory analysis and obtain fast convergence rates under proper complexity and regularity assumptions. Second, we study the sparse recovery problem in KENReg with fixed design and show that the kernelization may improve the sparse recovery ability compared to the classical elastic net regularization. Finally, we discuss the interplay among different properties of KENReg that include sparseness, stability, and generalization. We show that the stability of KENReg leads to generalization, and its sparseness confidence can be derived from generalization. Moreover, KENReg is stable and can be simultaneously sparse, which makes it attractive theoretically and practically.
Kernels and related kernel-based learning paradigms have become popular tools in the machine learning community owing to their great successes in theoretical interpretation and computational efficiency. The key idea is to implicitly map observations from the input space to some feature space where simple algorithms can be simultaneously delivered—the so-called kernel trick. Based on the kernels, a variety of learning machines have been developed in both supervised and unsupervised learning. Several canonical examples are the kernel principal component analysis in unsupervised learning, support vector machine for classification (SVC), and support vector regression (SVR) in supervised learning.
1.1 Objectives of the Letter
As a continuation of our previous study on KENReg, the main scope of this letter is to study the following aspects: generalization ability, sparse recovery ability, and the interplay among sparseness, stability, and generalization.
First, generalization ability is one of the major concerns when designing a learning machine and also contributes to the major theme of statistical learning theory. With the random design setting, in the generalization ability analysis, one is concerned with the error term , where the expectation is taken with respect to the observations over the marginal distribution of on . The generalization ability of KENReg has been analyzed in Feng et al. (2014) by introducing an instrumental empirical target function, following the path of Wu and Zhou (2008).
However, as Feng et al. (2014), remarked, the generalization bounds provided there are in fact derived with respect to a truncated version of the empirical target function instead of itself, which is a compromise since the analysis on the truncated version can be significantly simplified. On the other hand, according to the generalization analysis given in Feng et al. (2014), to ensure the convergence of KENReg, the regularization parameter is required to go to zero when the sample size m tends to infinity. Nevertheless, the stability arguments in Feng et al. (2014) show that to ensure the convergence of KENReg, should go to infinity in accordance with m. This contradiction is again caused by the fact that the analysis on the convergence rates of KENReg in Feng et al. (2014) was carried out with respect to a truncated version of .
In fact, the generalization analysis that Feng et al. (2014) presented cannot be easily tailored to in equation 1.5. As Feng et al. (2014) detailed and also as we explain in section 3, from a function approximation point of view, the main difficulties encountered in the analysis lie in defining the hypothesis space and characterizing the approximation error associated with the regularization parameters and . Given that the analysis of the generalization bounds plays an important role in studying a learning machine, the first major concern of this letter is to present a refined analysis on the generalization bounds of KENReg with respect to the empirical target function itself.
Next, we look at sparse recovery ability. KENReg advocates sparse solutions due to the use of a sparsity-producing penalty term. Following the literature on sparse recovery (Candès & Tao, 2005; Candès, Romberg, & Tao, 2006; Zhang, 2011), it is natural to ask whether KENReg can identify the zero pattern of the true solution if it is assumed to be sparse in some sense. This is the second major concern in this work, and a positive answer is presented in section 4 with the fixed design setting.
Finally, we look at the interplay among sparseness, stability, and generalization. Sparseness and stability are two motivating properties for which KENReg is proposed, as Feng et al. (2014) stated. These properties, together with its generalization ability, make KENReg attractive for regression. However, it is generally thought that sparse learning algorithms cannot be stable. Moreover, it is not clear that whether in learning with the kernelized dictionary setting the algorithmic stability can also lead to generalization. And we are also concerned with the relation between generalization and sparseness. Thus, the third purpose of this letter is to illustrate the interplay among these different and important properties of KENReg.
This letter is organized as follows. In section 2, we present results on generalization bounds. Section 3 is dedicated to outlining the steps and main ideas in doing error analysis. We study the sparse recovery ability of the KENReg model in section 4. Section 5 presents discussion on the interplay among the three different properties of the KENReg model: sparseness, stability, and generalization. We end this letter by summarizing contributions in section 6. Proofs of propositions and theorems are provided in the appendix if they are not presented immediately in the following sections.
2 Preliminaries and Main Results on Generalization Bounds
This section presents our main results on generalization bounds, which refer to the convergence rates of to the regression function with for any under the -metric with random design. Here the regression function is in fact due to the zero-mean noise assumption. To this end, we first discuss difficulties encountered in error analysis and introduce some notation and assumptions.
2.1 Difficulties in Error Analysis and Proposed Method
KENReg learns with the kernelized dictionary , which depends on the data . Therefore, the hypothesis space of KENReg is drifting with the varying observations . This leads to the so-called hypothesis error, which also contributes to the variance of the estimator . On the other hand, without a fixed hypothesis space, one is not able to characterize the population version of , and this makes it difficult to conduct an error analysis via the classical bias-variance trade-off approach. As Feng et al. (2014) commented, besides the varying observations , such difficulty also stems from the combined penalty term in KENReg and the stepping-stone technique employed there to bypass it.
In this letter, we overcome this difficulty by first constructing an instrumental hypothesis space to help characterize the population version of . We then construct an instrumental empirical target function that mimics well and is easier to analyze. The last step is to conduct an error analysis with respect to the newly constructed instrumental empirical target function. This process is illustrated in Figure 1.
In Figure 1, is the instrumental hypothesis space that is constructed to characterize the population version of : . is an instrumental empirical target function that has a similar generalization performance as . is the population version of . is another space that is closely related to and is introduced to include both and . In our analysis, instead of bounding the distance between and directly, we bound the distance between and and show that this can upper-bound the previous one. Rigorous definitions of these spaces and functions are given below. Here, it should be mentioned that the instrumental empirical target function essentially plays a stepping-stone role; this technique was proposed by Wu and Zhou (2005, 2008) and employed in much follow-up work (see Shi, 2013; Tong et al., 2010; and Chen et al., 2013).
2.2 Construction of the Instrumental Hypothesis Space and the Instrumental Empirical Target Function
2.3 Assumptions and Generalization Bounds
In order to state the main results on generalization bounds, we introduce several assumptions. The generalization bounds in our study are derived by controlling the capacity of the hypothesis space that KENReg works in. Its hypothesis space is spanned by the kernelized dictionary , and its capacity assumption is given as follows:
Shi et al. (2011) showed that the above capacity assumption holds under certain regularity conditions on the kernel . This indicates that the behavior (see equation 2.5) is typical because the case of the hypothesis space induced by the gaussian kernel can be included. Assumption 16 is on the boundedness of the response variable Y, which is again a canonical assumption in the statistical learning theory literature (Cucker & Zhou, 2007; Steinwart & Christmann, 2008; Hang & Steinwart, 2014; Lv, 2015) and also applied in Feng et al. (2014).
(Boundedness Assumption). Assume that , , for some and without loss of generality we let and .
Following the assumption 16, it is easy to see that . As Feng et al. (2014) remarked, assumption 16 can be easily relaxed to the assumption that the response variable has a subgaussian tail without leading to essential difficulties in analysis. Moreover, by replacing the least-square loss with some robust loss functions, it can be further relaxed to the assumption that the response variable satisfies certain moment conditions as shown in Huber and Ronchetti (2009), Vapnik (1998), Györfi, Kohler, Krzyak, and Walk (2002), and Feng, Huang, Shi, Yang, and Suykens (2015).
This model assumption specifies the model approximation ability. In fact, from the definition of , it is easy to see that it depends on the regularity of . For instance, when , there holds . More discussions on the model assumption are provided in section 3.2. Now we are ready to state our main results on the generalization bounds.
As a consequence of theorem 1, we can present the convergence rates of KENReg in the following corollary in a more explicit manner when the regression function is smooth enough.
The convergence rates stated above indicate their dependence on the regularization parameter and further confirm the involvement of the -term in generalization. They are also optimal in the sense that when p tends to zero, they can be arbitrarily close to , which is regarded as the fastest learning rate in the learning theory literature. On the other hand, the convergence rates are conducted with respect to instead of its projected version as presented in Feng et al. (2014). Noticing that is exactly the empirical target function of interest in our study, in this sense we say that the two types of convergence rates are of a different nature, and refined generalization bounds are presented in this study.
The convergence rates in corollary 2 are obtained under the condition that goes to infinity when the sample size m tends to infinity. As we will detail, this is consistent with the observation on generalization made from the stability arguments in Feng et al. (2014).
In theorem 1 and corollary 2, the regularization parameters and are selected to achieve the theoretical convergence rates by balancing between the bias and the variance terms. In practice, they are more frequently chosen by using data-driven techniques (e.g., cross-validation). To reduce the computational burden, a heuristic approach to selecting the two parameters can be conducted as follows: one first chooses via cross-validation and sets to zero; with fixed , one can then carry out cross-validation again to determine an appropriate . (We refer readers to Feng et al., 2014, for more detailed discussion on the model selection problem and numerical studies.)
3 Generalization Error Analysis
This section presents the main analysis in deriving the generalization bounds given in section 2. In the literature of learning with kernelized dictionary, the generalization error consists of the sample error, approximation error, and hypothesis error.
3.1 Decomposing Generalization Error into Bias-Variance Terms
and are called the sample error, which are caused by randomly sampling. Notice that the estimation of involves , which varies with respect to the observations ; thus, we need concentration techniques from empirical process theory for bounding it. The same observation can be also applied to due to the randomness of . and are called the hypothesis error since and may lie in different hypothesis spaces with with the varying observations . stands for the approximation error that corresponds to the variance term and is independent of randomized sampling. According to the error decomposition in proposition 3, to bound the generalization error of KENReg, it suffices to bound , ,, , and , respectively.
3.2 Approximation Error
The approximation error reflects the approximation ability of the hypothesis space to the underlying ground truth . The model assumption introduced in section 2.3 assumes that for any , is of polynomial order with respect to . In this section, by using techniques introduced in Xiao and Zhou (2010), we will show that the model assumption is typical when a certain regularity assumption on the regression function holds.
3.3 Bounding the Hypothesis Error Terms and
can be estimated by applying the classical one-sided Bernstein’s inequality. However, the estimation of involves function-space-valued random variables. Therefore, we need to introduce the following concentration inequality with values in a Hilbert space to estimate , which can be found in Pinelis (1994).
The following estimates on and can be derived by applying lemma 5.
3.4 Bounding the Sample Error Terms and
In this part, we bound the two sample error terms and , which are more involved due to the dependence of and on the randomized observations . In the learning theory literature, this is typically done by applying concentration inequalities to empirical processes indexed by a class of functions and also by using the classical tools from empirical processes theory such as peeling and symmetrization techniques. The key idea is to show that the supremum of an empirical process is close enough to its expectation. A crucial step in the estimation is bounding the complexity of the function space that gives rise to empirical processes. To this end, in our study, we introduce the following local Rademacher complexity.
It is known that the generalization error bound based on the global Rademacher complexity is of the order . In practice, however, the hypothesis selected by a learning algorithm usually performs better than the worst case and belongs to a more favorable subset of all the hypotheses. The advantage of using the local version of Rademacher average is that they can be considerably smaller than the global ones, and the distribution information is also taken into account compared with other complexity measurements. Therefore, we employ the local Rademacher complexity to measure the complexity of smaller subsets, which ultimately leads to sharper learning rates.
In general, a sub-root function is used as an upper bound for the local Rademacher complexity. A function is sub-root if it is nonnegative, and nondecreasing and satisfies that is nonincreasing. The following theorem is due to Bartlett, Bousquet, and Mendelson (2005) with minor changes.
Lemma 7 tells us that to get better bounds for the empirical term, one needs to study properties of the fixed point of a sub-root . Although there does not exist a general method for choosing , tight bounds for local Rademacher complexity have been established in various function spaces such as RKHSs. The following lemma provides a connection between the local Rademacher complexity and entropy integral, which is an immediate result from the proof of theorem A7 in Bartlett et al. (2005).
By making use of the relation between local Rademacher complexity and the covering number given by lemma 8, we can upper-bound the quantity in lemma 7. Moreover, we come to the following upper bounds for and with the notation that for .
3.5 Bounding the Local Rademacher Complexity: A By-Product
In learning theory analysis, to bound the generalization error, it is crucial to take into account the complexity of the hypothesis space. Various notions of complexity measurements have been employed; they include VC-dimension, covering number, Rademacher complexity, and local Rademacher complexity. One advantage of using local Rademacher complexity as the notion of complexity over the others is that it can be computed directly from the data (Bartlett et al. 2005). In our preceding analysis, we bounded the local Rademacher complexity by applying the relation in lemma 8. In fact, as a by-product, in the following proposition we provide another upper bound for the local Rademacher complexity when learning with the kernelized dictionary .
Let be the eigenfunction of that corresponds to and be an orthogonal basis of . For simplicity, we denote , , , and further denote .
4 Sparse Recovery via Kernelized Elastic Net Regularization
The learning theory estimates presented in section 2 state with overwhelming confidence that can be a good estimator of the regression function. In this section, we focus on the inference aspect of the kernelized elastic net estimator with specific emphasis on its sparse recovery ability.
In recent years, compressed sensing and related sparse recovery schemes have become hot research topics, along with the advent of big data. Essentially the main concern of these sparse recovery schemes is to what extent an algorithm can recover the underlying true signal, which (is assumed) can be sparsely presented in some sense. Given that is also a sparse approximation estimator to and being parallel to those sparse recovery schemes (Candès & Tao, 2005; Candès et al., 2006; Zhang, 2011), we now study the sparse recovery property of by assuming that the possesses a sparse representation or can be sparsely approximated.
In the above sparse representation assumption, it is assumed that can be sparsely represented by the kernelized dictionary , which depends on the design points . Therefore, to study the sparse recovery ability of the KENReg estimator, in this section we adopt the fixed design setting in order to avoid the drifting ground truth. In fact, the fixed design setting has been commonly adopted in the sparse learning literature (Zhang, 2010; Huang & Zhang, 2010; Huang, Zhang, & Metaxas, 2011; Koltchinskii, 2011; Bach, 2013) and can be more practical in certain real-world applications (Koltchinskii, Sakhanenko, & Cai, 2007). In what follows, we denote . For any , we denote as the support set of with . Under the sparse representation assumption, the main concerns of sparse recovery that could also be tailored to our context are the following two aspects:
The approximation ability of the solution of KENReg to , that is, the distance between and under a certain metric
The sparse recovery ability of KENReg: under what conditions KENReg can correctly find the zero pattern of , that is,
In the literature of compressed sensing (Donoho & Huo, 2001; Donoho & Elad, 2003; Tropp & Gilbert, 2007) and sparse linear regression (Zhao & Yu, 2006; Tibshirani, 2013), the coherence criterion and the irrepresentable condition have been introduced to examine correlations between pairs of atoms in a dictionary, which are shown to be closely related. In learning with the kernelized dictionary setting, the kernelized irrepresentable assumption introduced above also measures the correlation between atoms that belong to different sets of dictionaries. In fact, we notice that the kernelized coherence criterion has been introduced within this setting (Richard, Bermudez, & Honeine, 2009; Honeine, 2014, 2015). In Richard et al. (2009), the kernelized coherence is defined as , whereas dictionaries are selected to ensure that with a given threshold. Simple computations show that with properly chosen and , when , assumption 119 also holds. On the other hand, it is easy to see that the kernel and the -regularization term do have an influence in the kernelized irrepresentable assumption.
Theorem 12 states that low correlation between relevant and irrelevant kernel-spanned spaces leads to good model selection ability. Conditions 4.1 require that the sample size be large enough; meanwhile, the ratio between the two regularization parameters and should be upper-bounded. Under the kernelized irrepresentable assumption, theorem 12 tells us with overwhelming confidence that the zero pattern of can be correctly identified.
We first construct a sparse candidate estimator for a reduced-order form of KENReg; based on this, we construct an augmented candidate estimator. Eventually we show that the constructed candidate estimator is the unique solution to the original optimization problem in KENReg.
5 Interplay among Sparseness, Stability, and Generalization
KENReg possesses the algorithmic stability and sparseness property, while simultaneous sparseness and stability indeed give the main motivation of its introduction. Here, sparseness means that some components of the solution to KENReg are zero. This section is dedicated to discussing the sparseness, stability, and generalization properties of KENReg and their interplay. Conclusions drawn in this section are threefold: first, in learning with the kernelized dictionary setting, stability can also lead to the generalization ability as it does for learning machines over reproducing kernel Hilbert spaces; second, the confidence bounds on the sparseness of the output can be derived from the generalization results; third, the two properties of KENReg, sparseness and stability, are not mutually exclusive, which makes KENReg interesting because in general, sparse learning schemes are considered to be not stable (Xu et al., 2012). We next detail these arguments.
5.1 Stability Brings Generalization
Stability and generalization ability are two different but relevant important properties of a learning machine. In connection to the sensitivity analysis, the stability of a learning algorithm means that its output does not change much under small changes in the input observations, while generalization refers to its prediction ability on future unseen observations.
It is now commonly accepted that when learning within reproducing kernel Hilbert spaces, for some kernel-based learning machines including SVC, SVR, and kernel ridge regression, its stability in some sense is equivalent to its generalization ability (Bousquet & Elisseeff, 2002). When moving attention to KENReg, as mentioned in section 1, it enjoys the property of algorithmic stability. Moreover, under the notion of uniform -stable, such algorithmic stability property has been theoretically verified in Feng et al. (2014).
In the regression setup, the uniform -stable of a learning algorithm (Bousquet & Elisseeff, 2002) is defined as follows.
The following fact is due to Feng et al. (2014).
KENReg is uniform -stable with .
Simple calculations show that with the choice as in corollary 2, the coefficient in fact 1 is of the type . Therefore, according to Bousquet and Elisseeff (2002), starting from the above algorithmic stability results, nontrivial generalization bounds can be derived. This coincides with the information brought by generalization bounds presented in theorem 1, that is, KENReg does generalize when tends to infinity in accordance with m. However, this is not the case for the convergence rates reported in Feng et al. (2014), which are conducted with respect to a projected version of and require that goes to zero when m tends to infinity to ensure its convergence. This distinguishes the two types of convergence rates, and this seeming contradiction is in fact caused by the projection operator.
It should be mentioned that the convergence rates of KENReg derived from the algorithmic stability are not as fast as those derived by using concentration arguments stated in theorem 1. This is mainly because the derivation of the convergence rates in theorem 1 takes the second-order information of the noise term into account. On the other hand, in Feng et al. (2014), generalization bounds of KENReg with respect to its projected output, , are also derived, which roughly states that to ensure the generalization ability of , the regularization parameter should go to zero when m tends to infinity. This obviously contradicts the stability results in fact 1 since with such a choice of , the algorithm stability coefficient is useless in deriving the convergence rates of KENReg. This makes the generalization bounds in theorem 1 more attractive as it is compatible with the stability arguments stated above.
In short, KENReg enjoys the algorithmic stability property, which can also bring us nontrivial generalization bounds.
5.2 Sparseness Confidence from Generalization
Besides the property of stability, another important property of KENReg lies in the fact that it advocates sparse output, being attributed to the sparsity-producing penalty it employs. More specifically, as Feng et al. (2014) shows, the sparseness of the solution to KENReg is characterizable. This can be indicated by the following fact:
Obviously, fact 2 provides a necessary and sufficient condition that characterizes the zero pattern of . Starting from fact 2, we now show that by using standard learning theory arguments, probabilistic confidence bounds that ensure the sparseness of KENReg can be established, as done in Shi et al. (2011).
To conclude, the sparseness of the solution to KENReg can be theoretically and probabilistically ensured as a consequence of the results on generalization bounds in theorem 1.
5.3 Sparseness and Stability Are Not Mutually Exclusive
As we have shown, KENReg possesses the sparseness and stability properties, both of which are arguably important for a good learning machine. However, Xu et al. (2012) argued that sparse learning algorithms are in some sense not stable, which seems to contradict our conclusion of the coexistence of sparseness and stability. In this section, we clarifying this aspect.
To illustrate, we first remind that KENReg is a kernel-based regression model. Like classic kernel-based regression methods, learning with KENReg is essentially an instance selection procedure. In this context, sparseness refers to the fact that some of the instances do not contribute to the output . However, the learning schemes that Xu et al. (2012) discuss give more emphasis on the feature selection problem, whereas an algorithm is said to be sparse if it is IRF (identify redundant features). According to Xu et al. (2012), “Being IRF means that at least one solution of the algorithm does not select both features if they are identical.” This obviously excludes the well-known elastic net regularization scheme since it does not encourage the grouping effect property, the main advantage of elastic net regularization. In this sense, we say that the sparseness of a learning algorithm defined in Xu et al. (2012) is somewhat more restrictive and is different from the notion of sparseness in our context. This explains the apparent contradiction between conclusions we draw for KENReg and that in Xu et al. (2012).
From a statistical point of view, the stability of a learning machine is more related to its robustness, since “robustness theories can be viewed as stability theories of statistical inference” (Hampel, Ronchetti, Rousseeuw, & Stahel, 2011). For parametric learning models, their robustness is frequently pursued by applying robust loss functions, which take charge of the residuals (Huber & Ronchetti, 2009; Maronna, Martin, & Yohai, 2006; Hampel et al., 2011); similar results for kernel-based learning models can be also derived (Debruyne et al., 2008; Steinwart & Christmann, 2008; Feng et al., 2015). On the other hand, for a regularization model, more frequently the sparseness is brought by its penalty term, which sets restrictions on the structure of the hypothesis space. Therefore, more emphasis should be placed on designing the penalty term when pursuing a parsimonious model. In the above sense, the stability and sparseness of a learning machine are of different natures and are not mutually exclusive.
In this letter, we studied the KENReg model in which a combined penalty term was employed. A learning theory analysis was presented, where emphasis was placed on its generalization bounds as well as the sparse recovery ability. The following three main results were presented:
By constructing a new hypothesis space, we presented a concentration estimate to the KENReg model. As a result, refined generalization bounds were obtained, which are optimal in an asymptotic sense. Moreover, the newly obtained bounds were shown to coincide with that obtained via the algorithmic stability analysis.
The KENReg model is a sparse model, so we studied its sparse recovery ability by assuming that the regression function has a sparse representation with respect to the kernelized dictionary. Theoretically, we showed that under the above assumption the KENReg model can correctly pick up the sparse pattern with overwhelming probability.
We discussed different properties of the KENReg model, including sparseness, stability, and generalization, with special emphasis on their interplay. Roughly speaking, we showed that for the KENReg model, algorithmic stability also ensured that it can generalize, while its generalization bounds can be used to derive probabilistic bounds on the sparseness. Moreover, we also showed that for the KENReg model, its algorithmic stability and sparseness properties are not mutually exclusive.
Appendix: Technical Proofs
To bound and by applying lemma 5, we need to upper-bound and in terms of their infinity norm:
A.2 Proof of Proposition 6
A.3 Proof of Proposition 9
A.4 Proof of Proposition 10
We thank the editor and the reviewers for their insightful comments and helpful suggestions that helped to improve the quality of this letter. We would also like to thank Dr. Yunwen Lei for pointing out a mistake in an early version of this letter. The corresponding author is S.-G. L. The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013)/ERC AdG A-DATADRIVE-B (290923). This letter reflects our own views; the EU is not liable for any use that may be made of the contained information. Funding has also been received from Research Council KUL: GOA/10/09 MaNet, CoE PFV/10/002 (OPTEC), BIL12/11T; PhD/postdoc grants; Flemish Government: FWO: PhD/postdoc grants, projects: G.0377.12 (Structured systems), G.088114N (Tensor-Based Data Similarity); IWT: PhD/postdoc grants, projects: SBO POM (100031); iMinds Medical Information Technologies SBO 2014; Belgian Federal Science Policy Office: IUAP P7/19 (DYSCO, Dynamical systems, Control and Optimization, 2012–2017). S.-G.L is supported partially by the National Natural Science Foundation of China (no. 11301421), and Fundamental Research Funds for the Central Universities of China (grants JBK141111, 14TD0046, and JBK151134).