Abstract

Kernelized elastic net regularization (KENReg) is a kernelization of the well-known elastic net regularization (Zou & Hastie, 2005). The kernel in KENReg is not required to be a Mercer kernel since it learns from a kernelized dictionary in the coefficient space. Feng, Yang, Zhao, Lv, and Suykens (2014) showed that KENReg has some nice properties including stability, sparseness, and generalization. In this letter, we continue our study on KENReg by conducting a refined learning theory analysis. This letter makes the following three main contributions. First, we present refined error analysis on the generalization performance of KENReg. The main difficulty of analyzing the generalization error of KENReg lies in characterizing the population version of its empirical target function. We overcome this by introducing a weighted Banach space associated with the elastic net regularization. We are then able to conduct elaborated learning theory analysis and obtain fast convergence rates under proper complexity and regularity assumptions. Second, we study the sparse recovery problem in KENReg with fixed design and show that the kernelization may improve the sparse recovery ability compared to the classical elastic net regularization. Finally, we discuss the interplay among different properties of KENReg that include sparseness, stability, and generalization. We show that the stability of KENReg leads to generalization, and its sparseness confidence can be derived from generalization. Moreover, KENReg is stable and can be simultaneously sparse, which makes it attractive theoretically and practically.

1  Introduction

Kernels and related kernel-based learning paradigms have become popular tools in the machine learning community owing to their great successes in theoretical interpretation and computational efficiency. The key idea is to implicitly map observations from the input space to some feature space where simple algorithms can be simultaneously delivered—the so-called kernel trick. Based on the kernels, a variety of learning machines have been developed in both supervised and unsupervised learning. Several canonical examples are the kernel principal component analysis in unsupervised learning, support vector machine for classification (SVC), and support vector regression (SVR) in supervised learning.

In this letter, we focus on the supervised learning problem. Specifically, we are interested in learning for the regression problem, the main goal of which is to infer a functional relation between input and output that gives good prediction performance on future observations. To be more specific, suppose that and are the explanatory variable space and response variable space, respectively, with explanatory variable X and response variable Y. We assume that is a compact metric space, and the observations are generated by
formula
where is the additive zero-mean noise. Suppose that we are given a set of observations that are independent and identically distributed (i.i.d.) copies of drawn from an unknown probability distribution . In the statistical learning literature, there are two typical settings according to the different forms of the given design points . When are randomly drawn from the marginal distribution of , the setting is termed the random design setting, in contrast to the fixed design setting, where are fixed and only the responses are treated as random. The purpose of regression is to approximate the ground truth in some function space based on given observations with random or fixed design.
In the statistical learning literature, kernel-based learning models for regression have been extensively studied. One form can be expressed by the following generic formula,
formula
1.1
where stands for the empirical risk term associated with a loss function , is a reproducing kernel Hilbert space (RKHS) induced by a Mercer kernel , and is a regularization parameter. Depending on different choices of the loss function , equation 1.1 has been termed as different learning machines. For example, being equipped with an -insensitive loss, this equation gives the formula for SVR. The representer theorem ensures that the optimal solution to equation 1.1 admits the presentation
formula
1.2
Besides being amenable to analysis and computationally efficient, another property that the learning schemes enjoy is that in equation 1.2 is sparse. As a result, only a fraction of the observations contribute to the final output, which is crucial for pursuing a parsimonious model. In this sense, the above two kernel-based learning models can be cast as an instance-selection learning machine. This also explains the terminology: support vector. Here the sparseness is delivered by the special mechanism of the loss function.
However, in some practical regression problems, it might be the case that the output predictor learning from SVR is not sufficiently sparse (Drezet & Harrison, 1998). This might be improved to meet the requirement of pursuing a parsimonious model in SVR by choosing a wider zero zone in the loss function. Nevertheless, with such choice of loss functions, the generalization ability might also be weakened. As an alternative, recently another kernel-based sparsity-producing learning model, kernelized dictionary learning model, has been extensively studied in the statistical learning literature, in which sparseness is delivered via the penalty term instead of the loss function. Mathematically, the kernelized dictionary learning model takes the form
formula
1.3
where denotes the empirical risk term associated with a loss function and the kernelized dictionary , is the penalty term and is a positive regularization parameter. Notice that there is no representer theorem for equation 1.3. However, due to the utilization of the kernelized dictionary, it is a finite-dimensional optimization problem, and the output predictor also admits the same form as equation 1.2.
It can be seen from equation 1.3 that the penalization is applied directly to the coefficients. This differentiates the learning scheme, equation 1.3, from previous kernel-based ones and also brings more flexibility. A notable one is that the kernel in equation 1.3 is merely a continuous function and does not necessarily need to be a Mercer kernel. This makes the learning machine, equation 1.3, more applicable considering that the Mercer condition on the kernel in some cases could be cumbersome to verify. For more detailed explanations on the flexibility of learning with kernelized dictionary, see the short review in Feng, Yang, Zhao, Lv, and Suykens (2014). Note that the learning paradigm, equation 1.3, belongs to the category of learning with kernels in the coefficient space that was introduced in Schölkopf and Smola (2001); and Suykens, Van Gestel, De Brabanter, De Moor, and Vandewalle (2002) and recently has been extensively studied empirically and theoretically in Roth (2004), Wu and Zhou (2005, 2008), Wang, Yeung, and Lochovsky (2007), Wu and Zhou (2008), Xiao and Zhou (2010), Sun and Wu (2011), Shi, Feng, and Zhou (2011), Tong, Chen, and Yang (2010), Lin, Zeng, Fang, and Xu (2014), Chen, Pan, Li, and Tang (2013), and Feng et al. (2014). On the other hand, sparseness can be easily delivered in equation 1.3 by choosing corresponding penalty terms. A frequent choice of sparsity-producing penalty term is , associated with which equation 1.3 is termed the kernelized Lasso. However, Xu, Caramanis, and Mannor (2012) and Feng et al. (2014) have argued that the kernelized Lasso is not stable from both a computational and an algorithmic viewpoint. Alternatively, Feng et al. (2014) proposed a stabilized regularization scheme by introducing an additional -norm (with respect to the coefficient) to the kernelized Lasso, namely, the kernelized elastic net regularization (KENReg) model, which can be stated as
formula
1.4
where denotes the matrix with entries for , and are regularization parameters. Obviously KENReg learns in a finite-dimensional space, and the corresponding empirical target function can be expressed as
formula
1.5
Theoretically, Feng et al. (2014) showed that the KENReg model has advantages over kernelized Lasso when taking the stability and characterizable sparsity into account. Meanwhile, the generalization ability that indicates the consistency property as well as the convergence rates of KENReg have been derived.

1.1  Objectives of the Letter

As a continuation of our previous study on KENReg, the main scope of this letter is to study the following aspects: generalization ability, sparse recovery ability, and the interplay among sparseness, stability, and generalization.

First, generalization ability is one of the major concerns when designing a learning machine and also contributes to the major theme of statistical learning theory. With the random design setting, in the generalization ability analysis, one is concerned with the error term , where the expectation is taken with respect to the observations over the marginal distribution of on . The generalization ability of KENReg has been analyzed in Feng et al. (2014) by introducing an instrumental empirical target function, following the path of Wu and Zhou (2008).

However, as Feng et al. (2014), remarked, the generalization bounds provided there are in fact derived with respect to a truncated version of the empirical target function instead of itself, which is a compromise since the analysis on the truncated version can be significantly simplified. On the other hand, according to the generalization analysis given in Feng et al. (2014), to ensure the convergence of KENReg, the regularization parameter is required to go to zero when the sample size m tends to infinity. Nevertheless, the stability arguments in Feng et al. (2014) show that to ensure the convergence of KENReg, should go to infinity in accordance with m. This contradiction is again caused by the fact that the analysis on the convergence rates of KENReg in Feng et al. (2014) was carried out with respect to a truncated version of .

In fact, the generalization analysis that Feng et al. (2014) presented cannot be easily tailored to in equation 1.5. As Feng et al. (2014) detailed and also as we explain in section 3, from a function approximation point of view, the main difficulties encountered in the analysis lie in defining the hypothesis space and characterizing the approximation error associated with the regularization parameters and . Given that the analysis of the generalization bounds plays an important role in studying a learning machine, the first major concern of this letter is to present a refined analysis on the generalization bounds of KENReg with respect to the empirical target function itself.

Next, we look at sparse recovery ability. KENReg advocates sparse solutions due to the use of a sparsity-producing penalty term. Following the literature on sparse recovery (Candès & Tao, 2005; Candès, Romberg, & Tao, 2006; Zhang, 2011), it is natural to ask whether KENReg can identify the zero pattern of the true solution if it is assumed to be sparse in some sense. This is the second major concern in this work, and a positive answer is presented in section 4 with the fixed design setting.

Finally, we look at the interplay among sparseness, stability, and generalization. Sparseness and stability are two motivating properties for which KENReg is proposed, as Feng et al. (2014) stated. These properties, together with its generalization ability, make KENReg attractive for regression. However, it is generally thought that sparse learning algorithms cannot be stable. Moreover, it is not clear that whether in learning with the kernelized dictionary setting the algorithmic stability can also lead to generalization. And we are also concerned with the relation between generalization and sparseness. Thus, the third purpose of this letter is to illustrate the interplay among these different and important properties of KENReg.

This letter is organized as follows. In section 2, we present results on generalization bounds. Section 3 is dedicated to outlining the steps and main ideas in doing error analysis. We study the sparse recovery ability of the KENReg model in section 4. Section 5 presents discussion on the interplay among the three different properties of the KENReg model: sparseness, stability, and generalization. We end this letter by summarizing contributions in section 6. Proofs of propositions and theorems are provided in the appendix if they are not presented immediately in the following sections.

2  Preliminaries and Main Results on Generalization Bounds

This section presents our main results on generalization bounds, which refer to the convergence rates of to the regression function with for any under the -metric with random design. Here the regression function is in fact due to the zero-mean noise assumption. To this end, we first discuss difficulties encountered in error analysis and introduce some notation and assumptions.

2.1  Difficulties in Error Analysis and Proposed Method

KENReg learns with the kernelized dictionary , which depends on the data . Therefore, the hypothesis space of KENReg is drifting with the varying observations . This leads to the so-called hypothesis error, which also contributes to the variance of the estimator . On the other hand, without a fixed hypothesis space, one is not able to characterize the population version of , and this makes it difficult to conduct an error analysis via the classical bias-variance trade-off approach. As Feng et al. (2014) commented, besides the varying observations , such difficulty also stems from the combined penalty term in KENReg and the stepping-stone technique employed there to bypass it.

In this letter, we overcome this difficulty by first constructing an instrumental hypothesis space to help characterize the population version of . We then construct an instrumental empirical target function that mimics well and is easier to analyze. The last step is to conduct an error analysis with respect to the newly constructed instrumental empirical target function. This process is illustrated in Figure 1.

Figure 1:

An illustration on the idea of generalization error analysis in our study.

Figure 1:

An illustration on the idea of generalization error analysis in our study.

In Figure 1, is the instrumental hypothesis space that is constructed to characterize the population version of : . is an instrumental empirical target function that has a similar generalization performance as . is the population version of . is another space that is closely related to and is introduced to include both and . In our analysis, instead of bounding the distance between and directly, we bound the distance between and and show that this can upper-bound the previous one. Rigorous definitions of these spaces and functions are given below. Here, it should be mentioned that the instrumental empirical target function essentially plays a stepping-stone role; this technique was proposed by Wu and Zhou (2005, 2008) and employed in much follow-up work (see Shi, 2013; Tong et al., 2010; and Chen et al., 2013).

2.2  Construction of the Instrumental Hypothesis Space and the Instrumental Empirical Target Function

For any function f defined on , the usual -norm is defined as for any . For notational simplicity, we denote . The instrumental hypothesis space we construct is the following Banach function space with a weighted norm,
formula
where w is a weight parameter related to the penalty parameter that will be specified below. For any bounded kernel function , we define the integral operator as follows:
formula
It is easily seen from the boundedness of that the operator is well defined on . With this notation, we define the range of as another Banach space,
formula
where the norm is given by
formula
Obviously can be embedded into some subset of following the continuity of .
For ease of error analysis, we now introduce the regularization function that is associated with parameter w:
formula
2.1
The instrumental empirical target function is constructed as follows
formula
2.2
To characterize the approximation error, we also denote as
formula
2.3
To characterize the population version of , we also denote
formula
2.4
It is easy to see that represents the approximation ability of the function space to the regression function . The functional in equation 2.4 is strictly convex, and is a reflexive Banach space for . From the classical functional analysis, we know the existence of in equation 2.4. Moreover, there exists a function , such that satisfying .

2.3  Assumptions and Generalization Bounds

In order to state the main results on generalization bounds, we introduce several assumptions. The generalization bounds in our study are derived by controlling the capacity of the hypothesis space that KENReg works in. Its hypothesis space is spanned by the kernelized dictionary , and its capacity assumption is given as follows:

Assumption 1
(Capacity Assumption). There exists a positive constant p with such that
formula
2.5
where cp is a positive constant independent of , and
formula
and denotes the covering number of BR, that is, the smallest integer such that there exist disks in with radius and centers in BR covering BR.

Shi et al. (2011) showed that the above capacity assumption holds under certain regularity conditions on the kernel . This indicates that the behavior (see equation 2.5) is typical because the case of the hypothesis space induced by the gaussian kernel can be included. Assumption 16 is on the boundedness of the response variable Y, which is again a canonical assumption in the statistical learning theory literature (Cucker & Zhou, 2007; Steinwart & Christmann, 2008; Hang & Steinwart, 2014; Lv, 2015) and also applied in Feng et al. (2014).

Assumption 2

(Boundedness Assumption). Assume that , , for some and without loss of generality we let and .

Following the assumption 16, it is easy to see that . As Feng et al. (2014) remarked, assumption 16 can be easily relaxed to the assumption that the response variable has a subgaussian tail without leading to essential difficulties in analysis. Moreover, by replacing the least-square loss with some robust loss functions, it can be further relaxed to the assumption that the response variable satisfies certain moment conditions as shown in Huber and Ronchetti (2009), Vapnik (1998), Györfi, Kohler, Krzyak, and Walk (2002), and Feng, Huang, Shi, Yang, and Suykens (2015).

Assumption 3
(Model Assumption). There exist positive constants with and , such that for any , there holds
formula

This model assumption specifies the model approximation ability. In fact, from the definition of , it is easy to see that it depends on the regularity of . For instance, when , there holds . More discussions on the model assumption are provided in section 3.2. Now we are ready to state our main results on the generalization bounds.

Theorem 1.
Suppose that is a compact subset of . Assume that the boundedness assumption, the capacity assumption, and the model assumption hold. For any , with confidence , there holds
formula
where
formula
and the above rate is derived by choosing
formula
with the parameters and a positive constant independent of m, or .

As a consequence of theorem 1, we can present the convergence rates of KENReg in the following corollary in a more explicit manner when the regression function is smooth enough.

Corollary 1.
Suppose that the assumptions of theorem 1 hold and additionally we assume that . For any , with confidence , there holds
formula
where the choice of the parameter pair is .

Proofs of theorem 1 and corollary 2 are provided in the appendix. Here we give several remarks:

  • The convergence rates stated above indicate their dependence on the regularization parameter and further confirm the involvement of the -term in generalization. They are also optimal in the sense that when p tends to zero, they can be arbitrarily close to , which is regarded as the fastest learning rate in the learning theory literature. On the other hand, the convergence rates are conducted with respect to instead of its projected version as presented in Feng et al. (2014). Noticing that is exactly the empirical target function of interest in our study, in this sense we say that the two types of convergence rates are of a different nature, and refined generalization bounds are presented in this study.

  • The convergence rates in corollary 2 are obtained under the condition that goes to infinity when the sample size m tends to infinity. As we will detail, this is consistent with the observation on generalization made from the stability arguments in Feng et al. (2014).

  • In theorem 1 and corollary 2, the regularization parameters and are selected to achieve the theoretical convergence rates by balancing between the bias and the variance terms. In practice, they are more frequently chosen by using data-driven techniques (e.g., cross-validation). To reduce the computational burden, a heuristic approach to selecting the two parameters can be conducted as follows: one first chooses via cross-validation and sets to zero; with fixed , one can then carry out cross-validation again to determine an appropriate . (We refer readers to Feng et al., 2014, for more detailed discussion on the model selection problem and numerical studies.)

3  Generalization Error Analysis

This section presents the main analysis in deriving the generalization bounds given in section 2. In the literature of learning with kernelized dictionary, the generalization error consists of the sample error, approximation error, and hypothesis error.

3.1  Decomposing Generalization Error into Bias-Variance Terms

For any function , denote and as its empirical risk and expected risk under the least squares loss that are given by
formula
In what follows, we set the weight and . The following error decomposition splits the generalization error into the above-mentioned three parts: sample error, approximation error, and hypothesis error.
Proposition 1.
Let be produced by equation 1.5, be given by equation 2.2, and be defined as in equation 2.4. Denote , where the norms and are defined with respect to the coefficients of . For any , the following error decomposition holds,
formula
where
formula

and are called the sample error, which are caused by randomly sampling. Notice that the estimation of involves , which varies with respect to the observations ; thus, we need concentration techniques from empirical process theory for bounding it. The same observation can be also applied to due to the randomness of . and are called the hypothesis error since and may lie in different hypothesis spaces with with the varying observations . stands for the approximation error that corresponds to the variance term and is independent of randomized sampling. According to the error decomposition in proposition 3, to bound the generalization error of KENReg, it suffices to bound , ,, , and , respectively.

3.2  Approximation Error

The approximation error reflects the approximation ability of the hypothesis space to the underlying ground truth . The model assumption introduced in section 2.3 assumes that for any , is of polynomial order with respect to . In this section, by using techniques introduced in Xiao and Zhou (2010), we will show that the model assumption is typical when a certain regularity assumption on the regression function holds.

In the learning theory literature, to characterize the regularity of the regression function , it is usually assumed that belongs to the range of a compact, symmetric, and positive-definite linear operator on that is associated with the kernel . Note that the kernel in our study is not necessarily positive or symmetric. Xiao and Zhou (2010) noticed that a positive-definite kernel can be constructed from as follows:
formula
Consequently a positive-definite integral operator can be defined. Due to the compactness of and the continuity of , the integral operator , as well as its fractional power , is compact and well defined on for any . Based on the above notations, we come to the following conclusion:
Proposition 2.
Suppose that there exists such that for some . If for some , then there holds
formula
Proof.
We first denote as eigenvalues of with and as the corresponding eigenfunctions. From the spectral theorem for a positive compact operator, we know that forms an orthogonal basis of . Following the regularity assumption on , one has
formula
3.1
We now bound by considering the cases when lies in different intervals.
When , then there exists such that . Denoting , the fact that are eigenvalues of implies
formula
This in connection with the definition of the norm tells us that
formula
where the first inequality is due to the Cauchy-Schwarz inequality and the fact that is an orthogonal basis of , while the second inequality is based on the assumption that . On the other hand, for any , yields
formula
From the above estimates and the definition of , when , there holds
formula
When , we choose and obtain
formula
By combining the above estimates, we accomplish the proof.

3.3  Bounding the Hypothesis Error Terms and

can be estimated by applying the classical one-sided Bernstein’s inequality. However, the estimation of involves function-space-valued random variables. Therefore, we need to introduce the following concentration inequality with values in a Hilbert space to estimate , which can be found in Pinelis (1994).

Lemma 1.
Let H be a Hilbert space and be an independent random variable on Z with values in H. Assume that almost surely. Denote . Let be an independent random sample from . Then for any , with confidence , there holds
formula

The following estimates on and can be derived by applying lemma 5.

Proposition 3.
With the choice , for any , with confidence at least , there holds
formula
and
formula
where C3 and are positive constants independent of m or .

3.4  Bounding the Sample Error Terms and

In this part, we bound the two sample error terms and , which are more involved due to the dependence of and on the randomized observations . In the learning theory literature, this is typically done by applying concentration inequalities to empirical processes indexed by a class of functions and also by using the classical tools from empirical processes theory such as peeling and symmetrization techniques. The key idea is to show that the supremum of an empirical process is close enough to its expectation. A crucial step in the estimation is bounding the complexity of the function space that gives rise to empirical processes. To this end, in our study, we introduce the following local Rademacher complexity.

Let be an i.i.d. sequence of Rademacher variables, and let be an i.i.d. sequence of random variables from , drawn according to some distribution. Let be a class of functions on . Let be the variance of f2 with respect to the probability distribution on . For each , define the Rademacher complexity on the function class as
formula
and call an expression the local Rademacher average of the class .

It is known that the generalization error bound based on the global Rademacher complexity is of the order . In practice, however, the hypothesis selected by a learning algorithm usually performs better than the worst case and belongs to a more favorable subset of all the hypotheses. The advantage of using the local version of Rademacher average is that they can be considerably smaller than the global ones, and the distribution information is also taken into account compared with other complexity measurements. Therefore, we employ the local Rademacher complexity to measure the complexity of smaller subsets, which ultimately leads to sharper learning rates.

In general, a sub-root function is used as an upper bound for the local Rademacher complexity. A function is sub-root if it is nonnegative, and nondecreasing and satisfies that is nonincreasing. The following theorem is due to Bartlett, Bousquet, and Mendelson (2005) with minor changes.

Lemma 2.
Let be a class of measurable, square integrable functions such that for all . Let be a sub-root function, A be some positive constant, and be the unique solution to . Assume that
formula
Then for all and all , with probability at least there holds
formula

Lemma 7 tells us that to get better bounds for the empirical term, one needs to study properties of the fixed point of a sub-root . Although there does not exist a general method for choosing , tight bounds for local Rademacher complexity have been established in various function spaces such as RKHSs. The following lemma provides a connection between the local Rademacher complexity and entropy integral, which is an immediate result from the proof of theorem A7 in Bartlett et al. (2005).

Lemma 3.
The local Rademacher complexity is upper-bounded as
formula
where A is some constant and is the covering number of at radius u for the empirical metric.

By making use of the relation between local Rademacher complexity and the covering number given by lemma 8, we can upper-bound the quantity in lemma 7. Moreover, we come to the following upper bounds for and with the notation that for .

Proposition 4.
Assume that the boundedness assumption and the capacity assumption hold. For any , with confidence , there holds
formula
where C1 and are positive constants independent of m or .
Proposition 5.
Assume that the boundedness assumption and the capacity assumption hold. For any , with confidence , there holds
formula
where C4 is a positive constant independent of m or .

3.5  Bounding the Local Rademacher Complexity: A By-Product

In learning theory analysis, to bound the generalization error, it is crucial to take into account the complexity of the hypothesis space. Various notions of complexity measurements have been employed; they include VC-dimension, covering number, Rademacher complexity, and local Rademacher complexity. One advantage of using local Rademacher complexity as the notion of complexity over the others is that it can be computed directly from the data (Bartlett et al. 2005). In our preceding analysis, we bounded the local Rademacher complexity by applying the relation in lemma 8. In fact, as a by-product, in the following proposition we provide another upper bound for the local Rademacher complexity when learning with the kernelized dictionary .

To this end, let be any m-size i.i.d. copies of X drawn from . For any , denote . Let be the function set given by
formula
Proposition 6.
Let be the l-th largest eigenvalue of and assume that for . Then there holds
formula
Proof.

Let be the eigenfunction of that corresponds to and be an orthogonal basis of . For simplicity, we denote , , , and further denote .

For any , there holds
formula
3.2
Note that
formula
3.3
where the inequality follows from Jensen’s inequality. Since ’s are independent random variables with zero mean and , there holds
formula
This, in connection with equation 3.3, implies that
formula
On the other hand, we have
formula
This in connection with equation 3.2, and Jensen’s inequality implies that
formula
Following the subadditivity of and taking the supremum on , we accomplish the proof.

4  Sparse Recovery via Kernelized Elastic Net Regularization

The learning theory estimates presented in section 2 state with overwhelming confidence that can be a good estimator of the regression function. In this section, we focus on the inference aspect of the kernelized elastic net estimator with specific emphasis on its sparse recovery ability.

In recent years, compressed sensing and related sparse recovery schemes have become hot research topics, along with the advent of big data. Essentially the main concern of these sparse recovery schemes is to what extent an algorithm can recover the underlying true signal, which (is assumed) can be sparsely presented in some sense. Given that is also a sparse approximation estimator to and being parallel to those sparse recovery schemes (Candès & Tao, 2005; Candès et al., 2006; Zhang, 2011), we now study the sparse recovery property of by assuming that the possesses a sparse representation or can be sparsely approximated.

We start with introducing some notations and assumptions. Recall that the regression model we study in this letter is given by
formula
where more generally in this section, we assume that . For the regression function, we assume that the following sparse representation assumption holds, which has also been employed in the machine learning literature (see Xu, Jin, Shen, & Zhu, 2015).
Assumption 4
(Sparse Representation Assumption). Let S (possibly unknown) be a subset of with cardinality . We assume that the regression function has the following sparse representation:
formula

In the above sparse representation assumption, it is assumed that can be sparsely represented by the kernelized dictionary , which depends on the design points . Therefore, to study the sparse recovery ability of the KENReg estimator, in this section we adopt the fixed design setting in order to avoid the drifting ground truth. In fact, the fixed design setting has been commonly adopted in the sparse learning literature (Zhang, 2010; Huang & Zhang, 2010; Huang, Zhang, & Metaxas, 2011; Koltchinskii, 2011; Bach, 2013) and can be more practical in certain real-world applications (Koltchinskii, Sakhanenko, & Cai, 2007). In what follows, we denote . For any , we denote as the support set of with . Under the sparse representation assumption, the main concerns of sparse recovery that could also be tailored to our context are the following two aspects:

  • The approximation ability of the solution of KENReg to , that is, the distance between and under a certain metric

  • The sparse recovery ability of KENReg: under what conditions KENReg can correctly find the zero pattern of , that is,

In fact, under assumption 18, if we denote and further define the seminorm as
formula
it is easy to see that
formula
where the expectation is taking over the randomized sampling from . Therefore, for the more general randomized setting, the approximation ability of the solution of KENReg to has been studied in section 2. With respect to the second aspect, the following results make some effort to address the sparse recovery property of .
Assumption 5
(Kernelized Irrepresentable Assumption). There exists a constant , such that
formula
where Sc denotes the complement of S and is the matrix acting on the subset S with , , .

In the literature of compressed sensing (Donoho & Huo, 2001; Donoho & Elad, 2003; Tropp & Gilbert, 2007) and sparse linear regression (Zhao & Yu, 2006; Tibshirani, 2013), the coherence criterion and the irrepresentable condition have been introduced to examine correlations between pairs of atoms in a dictionary, which are shown to be closely related. In learning with the kernelized dictionary setting, the kernelized irrepresentable assumption introduced above also measures the correlation between atoms that belong to different sets of dictionaries. In fact, we notice that the kernelized coherence criterion has been introduced within this setting (Richard, Bermudez, & Honeine, 2009; Honeine, 2014, 2015). In Richard et al. (2009), the kernelized coherence is defined as , whereas dictionaries are selected to ensure that with a given threshold. Simple computations show that with properly chosen and , when , assumption 119 also holds. On the other hand, it is easy to see that the kernel and the -regularization term do have an influence in the kernelized irrepresentable assumption.

Theorem 2.
Suppose that assumptions 18 and 119 hold and is chosen to satisfy
formula
4.1
Then, with probability at least , KENReg has a unique solution with its support contained within the true support, that is, .

Theorem 12 states that low correlation between relevant and irrelevant kernel-spanned spaces leads to good model selection ability. Conditions 4.1 require that the sample size be large enough; meanwhile, the ratio between the two regularization parameters and should be upper-bounded. Under the kernelized irrepresentable assumption, theorem 12 tells us with overwhelming confidence that the zero pattern of can be correctly identified.

Proof of Theorem 2.

We first construct a sparse candidate estimator for a reduced-order form of KENReg; based on this, we construct an augmented candidate estimator. Eventually we show that the constructed candidate estimator is the unique solution to the original optimization problem in KENReg.

We first construct a sparse candidate estimator by solving the following reduced-order elastic net regularization model,
formula
4.2
where . The solution to this reduced-order convex problem is guaranteed to be unique due to its strict convexity, regardless of the reduced matrix . From optimality conditions for convex programs (Boyd & Vandenberghe, 2004), we know that the vector is the solution to equation 4.2 if and only if there exists a subgradient such that
formula
4.3
and moreover, there holds .
We then construct by augmenting such that with , which is taken as a candidate solution to KENReg. Then we choose that satisfies the following zero-subgradient optimality condition of the original KENReg,
formula
4.4
where .
We now turn to prove that the estimator constructed above is the unique solution to the original KENReg. In fact, from the construction of and the optimality conditions for strictly convex programs, it suffices to prove that . This can be obtained by first rewriting condition 4.4 in block form, as follows:
formula
4.5
Recalling that the pair is obtained by solving the convex program, equation 4.3, therefore they must satisfy the top block of these equations. In addition, from the bottom block of equation 4.5, we obtain that
formula
The previous formula, together with equation 4.3, implies
formula
where stands for the identity matrix of size :
formula
Therefore, to bound , it suffices to bound and , respectively. Under the kernelized irrepresentable condition and recalling that , we have
formula
4.6
where the second inequality is due to the choice of the regularization parameters in equation 4.1. On the other hand, to bound , we define the orthogonal projection matrix by
formula
Thus, can be element-wisely rewritten as
formula
where Kj denotes the jth column of . According to lemma 1.7 in Buldygin and Kozachenko (2000), we see that each is a zero-mean subgaussian noise variable with variance,
formula
where we use the fact that the projection matrix has spectral norm one and the boundedness assumption that . As a result, by the subgaussian tail bound together with the union bound, we have
formula
By letting , we further obtain
formula
4.7
where the last inequality follows from the relation .
Finally, by combining estimates in equations 4.6 and 4.7, we conclude that with probability at least , there holds
formula
This shows that is one subgradient of , and thus equation 4.4 is a zero subgradient equation of KENReg. According to optimality conditions for strictly convex programs (Boyd & Vandenberghe, 2004), we conclude that is the unique solution of KENReg and due to the construction of , it is obvious that its support is contained within the true support. This completes the proof of theorem 22.

5  Interplay among Sparseness, Stability, and Generalization

KENReg possesses the algorithmic stability and sparseness property, while simultaneous sparseness and stability indeed give the main motivation of its introduction. Here, sparseness means that some components of the solution to KENReg are zero. This section is dedicated to discussing the sparseness, stability, and generalization properties of KENReg and their interplay. Conclusions drawn in this section are threefold: first, in learning with the kernelized dictionary setting, stability can also lead to the generalization ability as it does for learning machines over reproducing kernel Hilbert spaces; second, the confidence bounds on the sparseness of the output can be derived from the generalization results; third, the two properties of KENReg, sparseness and stability, are not mutually exclusive, which makes KENReg interesting because in general, sparse learning schemes are considered to be not stable (Xu et al., 2012). We next detail these arguments.

5.1  Stability Brings Generalization

Stability and generalization ability are two different but relevant important properties of a learning machine. In connection to the sensitivity analysis, the stability of a learning algorithm means that its output does not change much under small changes in the input observations, while generalization refers to its prediction ability on future unseen observations.

It is now commonly accepted that when learning within reproducing kernel Hilbert spaces, for some kernel-based learning machines including SVC, SVR, and kernel ridge regression, its stability in some sense is equivalent to its generalization ability (Bousquet & Elisseeff, 2002). When moving attention to KENReg, as mentioned in section 1, it enjoys the property of algorithmic stability. Moreover, under the notion of uniform -stable, such algorithmic stability property has been theoretically verified in Feng et al. (2014).

In the regression setup, the uniform -stable of a learning algorithm (Bousquet & Elisseeff, 2002) is defined as follows.

Definition 1.
Let . Denote as the modified samples by replacing the instance pair with in , for a fixed , and . Let be a learning algorithm with as the input and as the output. We say that has uniform stability with respect to the loss function if for any , , and with , for all , there holds
formula

The following fact is due to Feng et al. (2014).

Fact 1.

KENReg is uniform -stable with .

Simple calculations show that with the choice as in corollary 2, the coefficient in fact 1 is of the type . Therefore, according to Bousquet and Elisseeff (2002), starting from the above algorithmic stability results, nontrivial generalization bounds can be derived. This coincides with the information brought by generalization bounds presented in theorem 1, that is, KENReg does generalize when tends to infinity in accordance with m. However, this is not the case for the convergence rates reported in Feng et al. (2014), which are conducted with respect to a projected version of and require that goes to zero when m tends to infinity to ensure its convergence. This distinguishes the two types of convergence rates, and this seeming contradiction is in fact caused by the projection operator.

It should be mentioned that the convergence rates of KENReg derived from the algorithmic stability are not as fast as those derived by using concentration arguments stated in theorem 1. This is mainly because the derivation of the convergence rates in theorem 1 takes the second-order information of the noise term into account. On the other hand, in Feng et al. (2014), generalization bounds of KENReg with respect to its projected output, , are also derived, which roughly states that to ensure the generalization ability of , the regularization parameter should go to zero when m tends to infinity. This obviously contradicts the stability results in fact 1 since with such a choice of , the algorithm stability coefficient is useless in deriving the convergence rates of KENReg. This makes the generalization bounds in theorem 1 more attractive as it is compatible with the stability arguments stated above.

In short, KENReg enjoys the algorithmic stability property, which can also bring us nontrivial generalization bounds.

5.2  Sparseness Confidence from Generalization

Besides the property of stability, another important property of KENReg lies in the fact that it advocates sparse output, being attributed to the sparsity-producing penalty it employs. More specifically, as Feng et al. (2014) shows, the sparseness of the solution to KENReg is characterizable. This can be indicated by the following fact:

Fact 2.
For , if and only if
formula
5.1

Obviously, fact 2 provides a necessary and sufficient condition that characterizes the zero pattern of . Starting from fact 2, we now show that by using standard learning theory arguments, probabilistic confidence bounds that ensure the sparseness of KENReg can be established, as done in Shi et al. (2011).

The main idea is to derive a probabilistic confidence bound within which equation 5.1 holds. To this purpose, we provide a reminder that the population version of the empirical event
formula
5.2
is given by
formula
5.3
Applying the Cauchy-Schwarz inequality to equality 5.3, one gets
formula
The term involves only the kernel function and can be bounded under restrictions on the regularity of , while the term is the generalization error of that is given in theorem 1. Therefore, to derive a probabilistic bound that ensures equation 5.1, it remains to bound the difference between equations 5.2 and 5.3. Observe that equation 5.2 is in fact an empirical process that is associated with the randomized sampling . Standard learning theory arguments that involve empirical processes indexed by a class of functions can be applied, and a concentration estimate on the difference between equations 5.2 and 5.3 can be pursued. Here we omit the details.

To conclude, the sparseness of the solution to KENReg can be theoretically and probabilistically ensured as a consequence of the results on generalization bounds in theorem 1.

5.3  Sparseness and Stability Are Not Mutually Exclusive

As we have shown, KENReg possesses the sparseness and stability properties, both of which are arguably important for a good learning machine. However, Xu et al. (2012) argued that sparse learning algorithms are in some sense not stable, which seems to contradict our conclusion of the coexistence of sparseness and stability. In this section, we clarifying this aspect.

To illustrate, we first remind that KENReg is a kernel-based regression model. Like classic kernel-based regression methods, learning with KENReg is essentially an instance selection procedure. In this context, sparseness refers to the fact that some of the instances do not contribute to the output . However, the learning schemes that Xu et al. (2012) discuss give more emphasis on the feature selection problem, whereas an algorithm is said to be sparse if it is IRF (identify redundant features). According to Xu et al. (2012), “Being IRF means that at least one solution of the algorithm does not select both features if they are identical.” This obviously excludes the well-known elastic net regularization scheme since it does not encourage the grouping effect property, the main advantage of elastic net regularization. In this sense, we say that the sparseness of a learning algorithm defined in Xu et al. (2012) is somewhat more restrictive and is different from the notion of sparseness in our context. This explains the apparent contradiction between conclusions we draw for KENReg and that in Xu et al. (2012).

From a statistical point of view, the stability of a learning machine is more related to its robustness, since “robustness theories can be viewed as stability theories of statistical inference” (Hampel, Ronchetti, Rousseeuw, & Stahel, 2011). For parametric learning models, their robustness is frequently pursued by applying robust loss functions, which take charge of the residuals (Huber & Ronchetti, 2009; Maronna, Martin, & Yohai, 2006; Hampel et al., 2011); similar results for kernel-based learning models can be also derived (Debruyne et al., 2008; Steinwart & Christmann, 2008; Feng et al., 2015). On the other hand, for a regularization model, more frequently the sparseness is brought by its penalty term, which sets restrictions on the structure of the hypothesis space. Therefore, more emphasis should be placed on designing the penalty term when pursuing a parsimonious model. In the above sense, the stability and sparseness of a learning machine are of different natures and are not mutually exclusive.

6  Conclusion

In this letter, we studied the KENReg model in which a combined penalty term was employed. A learning theory analysis was presented, where emphasis was placed on its generalization bounds as well as the sparse recovery ability. The following three main results were presented:

  • By constructing a new hypothesis space, we presented a concentration estimate to the KENReg model. As a result, refined generalization bounds were obtained, which are optimal in an asymptotic sense. Moreover, the newly obtained bounds were shown to coincide with that obtained via the algorithmic stability analysis.

  • The KENReg model is a sparse model, so we studied its sparse recovery ability by assuming that the regression function has a sparse representation with respect to the kernelized dictionary. Theoretically, we showed that under the above assumption the KENReg model can correctly pick up the sparse pattern with overwhelming probability.

  • We discussed different properties of the KENReg model, including sparseness, stability, and generalization, with special emphasis on their interplay. Roughly speaking, we showed that for the KENReg model, algorithmic stability also ensured that it can generalize, while its generalization bounds can be used to derive probabilistic bounds on the sparseness. Moreover, we also showed that for the KENReg model, its algorithmic stability and sparseness properties are not mutually exclusive.

Appendix:  Technical Proofs

A.1  Proofs of Proposition 3 and Lemma 14

Proof of Proposition 1.
According to definitions, it is easy to see that . For any , is equal to
formula
From the definition of , we know that . Consequently, the desired conclusion follows.

To bound and by applying lemma 5, we need to upper-bound and in terms of their infinity norm:

Lemma 4
Let the function be defined as in equation 2.1; then
formula
and
formula
Proof.
To proceed with the proof, we first introduce a univariate convex auxiliary function on , that is, for any fixed function h,
formula
According to the definition of in equation 2.1, we see that for any fixed h, where denotes the subgradient of at . Due to the arbitrariness of h, this further implies that
formula
A.1
where we use the fact that the subgradient of is , given by
formula
If for some t, the result is trivial. Otherwise, there holds ; by equation A.1, we have
formula
A.2
where this inequality follows from the Cauchy-Schwarz inequality and equation 2.3. Besides, it is clear from equation 2.1 that . Since is set in our setting, this, together with equation A.2, yields our first desired result.
To conclude the proof, we need to bound . Following from the boundedness assumption and the definition of , we have
formula
This completes the proof of lemma 14.

A.2  Proof of Proposition 6

We first bound the hypothesis error term , which can be split as follows,
formula
where
formula
Note that in the above expressions, is denoted under the choice that .
Bounding the quantities and involves the concentration arguments with respect to random variables that are associated with the deterministic function . This can be dealt with by applying lemma 5. To this end, we first introduce the random variable on with values in . From lemma 14, we know that . In addition, we see from equation 2.1 that
formula
Therefore, lemma 5 tells us that with probability at least , there holds
formula
An argument similar to that of also tells us that with confidence , there holds
formula
On the other hand, it is easy to see that
formula
Combining the above estimates for , , and , we obtain the desired estimate of .
We now turn to bounding the term . Note that
formula
A.3
Denoting as the random variable on with values in , we then apply lemma 5 again to the random variable . In this case, and . Besides, it is easy to verify that
formula
Therefore, by applying lemma 5, we know that for any , with confidence at least , there holds
formula
and thus the desired upper bound for follows from equations 2.1 and A.3 and lemma 14.

A.3  Proof of Proposition 9

We now present a concentration estimation with respect to the random variable to bound the sample error term . Obviously, varies among a set of functions in accordance with the varying sample . In this case, we turn to investigating the function set
formula
To apply lemma 7 on the function set BR, it suffices to verify conditions in lemma 7.
We first check the boundedness of functions in . In fact, for any , the definition of the projecting operation and the boundedness of Y tell us that
formula
The variance of function-valued variables taking from can be bounded as follows,
formula
where the above inequality again follows from the boundedness of Y. We now evaluate the capacity of based on the capacity assumption. Observe that for any , with
formula
for any functions and , , . The supremum norm of the difference between g1 and g2 can be bounded as
formula
Consequently, it follows from the capacity assumption that
formula
which together with lemma 8 yields that the fixed point of lemma 7 will be of the order
formula
where c0 is a positive constant independent of m or p.
Now applying lemma 7 and substituting the corresponding coefficients and indices, we see that for any and , with confidence , there holds
formula
A.4
where is a positive constant depending only on p. To complete the proof of proposition 9, we need the upper bound of R. By the definition of in equation 1.4, we know that
formula
On the other hand, Hölder’s inequality tells us that Consequently, we can take . By letting and substituting R with in inequality A.4, we come to the conclusion that for any , with confidence , there holds
formula
where C1 and are positive constants independent of m or . This completes the proof of proposition 9.

A.4  Proof of Proposition 10

We apply lemma 7 to prove proposition 10. To this end, we first define the function set
formula
and then we denote
formula
It is easy to verify that for any , there holds
formula
where c4 is a positive constant. That is, for any , there holds . Consequently, it is obvious that
formula
On the other hand, for any associated with , we have
formula
A.5
Noticing that with , and following from equation A.5 and capacity assumption 15, we know that
formula
where . By letting and applying lemma 7 to the function , we see that for any , with confidence at least , there holds
formula
A.6
Substituting b, R, D, and K with the corresponding estimate given above into equation A.6 and recalling the following relation,
formula
we obtain the desired estimate in proposition 10 after simple computations. This completes the proof.

A.5  Proofs of Theorem 1 and Corollary 2

Proof of Theorem 1.
As a consequence of proposition 3, to bound , it remains to combine the upper bounds of , , , and . With simple computations, we obtain that for any , with confidence , can be upper-bounded by
formula
where C0 is a positive constant independent of m or .
Recalling the model assumption that , when , it is easy to verify that in the above estimate, the third term dominates the second one. For this case, simple computations show that with confidence , there holds
formula
where is a positive constant independent of m or . On the other hand, when , with confidence , we have
formula
with the choice of . To ensure nontrivial convergence rates, we let . This completes the proof of theorem 1.
Proof of Corollary 1.

When , the desired estimate in corollary 2 is a direct result of theorem 1 by taking .

Acknowledgments

We thank the editor and the reviewers for their insightful comments and helpful suggestions that helped to improve the quality of this letter. We would also like to thank Dr. Yunwen Lei for pointing out a mistake in an early version of this letter. The corresponding author is S.-G. L. The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013)/ERC AdG A-DATADRIVE-B (290923). This letter reflects our own views; the EU is not liable for any use that may be made of the contained information. Funding has also been received from Research Council KUL: GOA/10/09 MaNet, CoE PFV/10/002 (OPTEC), BIL12/11T; PhD/postdoc grants; Flemish Government: FWO: PhD/postdoc grants, projects: G.0377.12 (Structured systems), G.088114N (Tensor-Based Data Similarity); IWT: PhD/postdoc grants, projects: SBO POM (100031); iMinds Medical Information Technologies SBO 2014; Belgian Federal Science Policy Office: IUAP P7/19 (DYSCO, Dynamical systems, Control and Optimization, 2012–2017). S.-G.L is supported partially by the National Natural Science Foundation of China (no. 11301421), and Fundamental Research Funds for the Central Universities of China (grants JBK141111, 14TD0046, and JBK151134).

References

Bach
,
F.
(
2013
).
Sharp analysis of low-rank kernel matrix approximations
. In
Proceedings of the Twenty-Sixth Annual Conference on Computational Learning Theory
.
New York
:
Springer
.
Bartlett
,
P. L.
,
Bousquet
,
O.
, &
Mendelson
,
S.
(
2005
).
Local Rademacher complexities
.
Annals of Statistics
,
33
(
4
),
1497
1537
.
Bousquet
,
O.
, &
Elisseeff
,
A.
(
2002
).
Stability and generalization
.
Journal of Machine Learning Research
,
2
,
499
526
.
Boyd
,
S.
, &
Vandenberghe
,
L.
(
2004
).
Convex optimization
.
Cambridge
:
Cambridge University Press
.
Buldygin
,
V. V.
, &
Kozachenko
,
Y. V.
(
2000
).
Metric characterization of random variables and random processes
.
Providence, RI
:
American Mathematical Society
.
Candès
,
E. J.
,
Romberg
,
J. K.
, &
Tao
,
T.
(
2006
).
Stable signal recovery from incomplete and inaccurate measurements
.
Communications on Pure and Applied Mathematics
,
59
(
8
),
1207
1223
.
Candès
,
E. J.
, &
Tao
,
T.
(
2005
).
Decoding by linear programming
.
IEEE Transactions on Information Theory
,
51
(
12
),
4203
4215
.
Chen
,
H.
,
Pan
,
Z.
,
Li
,
L.
, &
Tang
,
Y.
(
2013
).
Error analysis of coefficient-based regularized algorithm for density-level detection
.
Neural Computation
,
25
(
4
),
1107
1121
.
Cucker
,
F.
, &
Zhou
,
D.-X.
(
2007
).
Learning theory: An approximation theory viewpoint
.
Cambridge
:
Cambridge University Press
.
Debruyne
,
M.
,
Hubert
,
M.
, &
Suykens
,
J. A. K.
(
2008
).
Model selection in kernel based regression using the influence function
.
Journal of Machine Learning Research
,
9
,
2377
2400
.
Donoho
,
D. L.
, &
Elad
,
M.
(
2003
).
Optimally sparse representation in general (nonorthogonal) dictionaries via minimization
.
Proceedings of the National Academy of Sciences
,
100
(
5
),
2197
2202
.
Donoho
,
D. L.
, &
Huo
,
X.
(
2001
).
Uncertainty principles and ideal atomic decomposition
.
IEEE Transactions on Information Theory
,
47
(
7
),
2845
2862
.
Drezet
,
P.M.L.
, &
Harrison
,
R. R.
(
1998
).
Support vector machines for system identification
. In
Proceedings of UKACC International Conference on Control
(
vol. 1
, pp.
688
692
).
London
:
Institution of Electrical Engineers.
Dudley
,
R. M.
(
1999
).
Uniform central limit theorems
.
Cambridge
:
Cambridge University Press
.
Feng
,
Y.
,
Huang
,
X.
,
Shi
,
L.
,
Yang
,
Y.
, &
Suykens
,
J. A. K.
(
2015
).
Learning with the maximum correntropy criterion induced losses for regression
.
Journal of Machine Learning Research
,
16
,
993
1034
.
Feng
,
Y.
,
Yang
,
Y.
,
Zhao
,
Y.
,
Lv
,
S.
, &
Suykens
,
J. A. K.
(
2014
).
Learning with kernelized elastic net regularization
(Internal Report, ESAT-STADIUS).
Leuven, Belgium
:
KU Leuven
. ftp://ftp.esat.kuleuven.ac.be/stadius/yfeng/KENReg2014.pdf
Györfi
,
L.
,
Kohler
,
M.
,
Krzyak
,
A.
, &
Walk
,
H.
(
2002
).
A distribution-free theory of nonparametric regression
.
New York
:
Springer
.
Hampel
,
F. R.
,
Ronchetti
,
E. M.
,
Rousseeuw
,
P. J.
, &
Stahel
,
W. A.
(
2011
).
Robust statistics: The approach based on influence functions
.
Hoboken, NJ
:
Wiley
.
Hang
,
H.
, &
Steinwart
,
I.
(
2014
).
Fast learning from -mixing observations
.
Journal of Multivariate Analysis
,
127
,
184
199
.
Honeine
,
P.
(
2014
).
Analyzing sparse dictionaries for online learning with kernels
.
arXiv:1409.6045
Honeine
,
P.
(
2015
).
Approximation errors of online sparsification criteria
.
IEEE Transactions on Signal Processing
,
63
(
37
),
4700
4709
Huang
,
J.
, &
Zhang
,
T.
(
2010
).
The benefit of group sparsity
.
Annals of Statistics
,
38
(
4
),
1978
2004
.
Huang
,
J.
,
Zhang
,
T.
, &
Metaxas
,
D.
(
2011
).
Learning with structured sparsity
.
Journal of Machine Learning Research, 12
,
3371
3412
.
Huber
,
P. J.
, &
Ronchetti
,
E.
(
2009
).
Robust statistics
.
Hoboken, NJ
:
Wiley
.
Koltchinskii
,
V.
(
2011
).
Oracle inequalities in empirical risk minimization and sparse recovery problems
.
New York
:
Springer
.
Koltchinskii
,
V.
,
Sakhanenko
,
L.
, &
Cai
,
S.
(
2007
).
Integral curves of noisy vector fields and statistical problems in diffusion tensor imaging: Nonparametric kernel estimation and hypotheses testing
.
Annals of Statistics
,
35
(
4
),
1576
1607
.