Abstract

Sparse regularization such as 1 regularization is a quite powerful and widely used strategy for high-dimensional learning problems. The effectiveness of sparse regularization has been supported practically and theoretically by several studies. However, one of the biggest issues in sparse regularization is that its performance is quite sensitive to correlations between features. Ordinary 1 regularization selects variables correlated with each other under weak regularizations, which results in deterioration of not only its estimation error but also interpretability. In this letter, we propose a new regularization method, independently interpretable lasso (IILasso), for generalized linear models. Our proposed regularizer suppresses selecting correlated variables, so that each active variable affects the response independently in the model. Hence, we can interpret regression coefficients intuitively, and the performance is also improved by avoiding overfitting. We analyze the theoretical property of the IILasso and show that the proposed method is advantageous for its sign recovery and achieves almost minimax optimal convergence rate. Synthetic and real data analyses also indicate the effectiveness of the IILasso.

1  Introduction

High-dimensional data appear in many fields such as biology, economy, and industry. A common approach for high-dimensional regression is sparse regularization strategy such as the Lasso (least absolute shrinkage and selection operator; Tibshirani, 1996). Since the sparse regularization performs both parameter estimation and variable selection simultaneously, it offers interpretable results by identifying informative variables, and it can effectively avoid overfitting by discarding redundant variables. Because of these properties, sparse regularization has had huge success in a wide range of data analysis in science and engineering. In addition, several theoretical studies have been developed to support the effectiveness of sparse regularization, and efficient optimization methods also have been proposed so that sparse learning is easily executed.

However, one of the biggest issues is that the performance of sparse regularization is guaranteed only under “small correlation” assumptions, that is, the variables are not much correlated with each other. Actually, typical theoretical supports are based on a kind of small correlation assumptions such as restricted eigenvalue condition (Bickel, Ritov, & Tsybakov, 2009; Bühlmann & Van De Geer, 2011). In the situation where the variables are highly correlated, redundant variables tend to be selected, and the estimation errors are worse because of overfitting.

In addition, correlated variables hinder interpretation of models. A standard interpretation of a (generalized) linear model comes from regression coefficients and effects (products of regression coefficients and standard deviations). A single coefficient represents the degree to which an increase of its variable by one unit increases the prediction “when all other active variables remain fixed.” When an output model contains correlated variables, we must consider the influence of all other correlated active variables simultaneously, which can be intractable. A single effect represents the degree to which its variable varies the prediction on average. However, the value of the effect may misleading in the presence of strong correlations because it is unlikely to increase a variable by one standard deviation but not changing other correlated variables. Similar properties were described as disadvantages of linear models in section 4.1 in Molnar (2018). This kind of interpretability was also referred as “decomposability” in Lipton (2018), which means whether we can decompose a model into some parts and interpret each component independently. In this sense, correlated variables degrade model decomposability. Therefore, it is beneficial to construct a model with uncorrelated variables for both estimation error and interpretability.

One line of research that takes correlations among variables into account is based on a strategy in which correlated variables are either all selected or not selected at all. Examples of this line are the elastic net (Zou & Hastie, 2005), pairwise elastic net (Lorbert, Eis, Kostina, Blei, & Ramadge, 2010), and trace Lasso (Grave, Obozinski, & Bach, 2011). These methods select not only important variables, but also unimportant variables that are strongly correlated with important variables. Although these methods often give stable generalization error, this strategy makes it hard to interpret the model, because many correlated variables are incorporated into the output model, and we no longer distinguish which variables truly affect the response.

To facilitate interpretability of the model, another line of research, including ours, is based on the strategy in which uncorrelated variables are selected. The uncorrelated Lasso (Chen, Ding, Luo, & Xie, 2013) intends to construct a model with uncorrelated variables. However, it still tends to select negatively correlated variables; hence, the correlation problem is not resolved. The exclusive group Lasso (Kong, Fujimaki, Liu, Nie, & Ding, 2014) is also in this line, but it requires grouping correlated variables beforehand. The authors suggest that we group variables whose correlations are over a certain threshold. However, determination of the threshold is not a trivial problem, and it causes unstable results.

In this letter, we propose a new regularization method, independently interpretable Lasso (IILasso) for linear models with quadratic loss and its extension to generalized linear models (GLMs) with general loss.1 Our method offers efficient variable selection, does not select negatively correlated variables, and is free from specific preprocessing such as grouping. Our proposed regularization formulation more aggressively induces the sparsity of the active variables and reduces the correlations among them. Hence, we can independently interpret the effects of active variables in the output model, and the estimation error is also much improved. To support the usefulness of our proposal, we make the following contributions:

  • We show the necessary and sufficient condition for the sign consistency of variables selection. We show that our method achieves the sign consistency under a milder condition than the Lasso for correlated design as long as true important variables are uncorrelated.

  • The convergence rate of the estimation error is analyzed. We show that our estimation error achieves almost the minimax optimal rate and has an advantage over the Lasso under certain conditions.

  • We propose a coordinate descent algorithm to find a local optimum of the objective function, which is guaranteed to converge to a stationary point. In addition, we show that every local optimal solution achieves the same statistical error rate as the global optimal solution and thus is almost minimax optimal.

  • We extend linear models to GLMs, and derive its convergence rate.

The rest of this paper is as follows: In section 2, we propose a new regularization formulation for linear models and introduce its optimization method. We also state the relationship to existing work. In section 3, we show theoretical results on sign recovery and estimation error. In section 4, we extend the IILasso to GLMs and show its estimation error. In section 5, we illustrate both synthetic and real-world data experiments, including 10 microarray data sets. In section 6, we summarize the properties of IILasso.

1.1  Notations

Let vRp. Diag(v)Rp×p is the diagonal matrix whose jth diagonal element is vj. |v| is the element-wise absolute vector whose jth element is |vj|. sgn(v) is the sign vector whose elements are 1forvj>0, -1forvj<0, and0forvj=0. supp(v) is the support set of v{j{1,,p}|vj0}. vq is the q-norm—vq=(j=1p|vj|q)1/q. Let MRn×p. We use subscripts for the columns of M, for example, Mj denotes the jth column. Mq is the operator norm (induced norm)—Mq=supvRpMvq/vq. Specifically, M2=supvRpMv2/v2 is the spectral norm (the largest singular value of M), and M=maxij|Mij| is the maximum absolute column sum norm. Mmax is the max norm—Mmax=maxij|Mij|. Let MRp×p. MO and MO denote positive semi-definite matrix and positive-definite matrix—vMv0 for vRp and vMv>0 for vRp, respectively. Let S be a subset of {1,,p}. |S| is the number of the elements in S. Sc is the complement subset of SSc={1,,p}S. vS is the vector v restricted to the index set S. MS1S2 is the matrix whose row indexes are restricted to S1 and column indexes are restricted to S2.

2  Method

2.1  Problem Setting

Consider a problem of predicting a response yRn, given a design matrix XRn×p, assuming a linear model
y=Xβ+ɛ,
where βRp is a regression coefficient and ɛRn is a noise. We assume that the variables are standardized such that Σi=1nXij=0,Σi=1nXij2/n=1, and Σi=1nyi=0. The Lasso solves the following optimization problem,
minβ12ny-Xβ22+λβ1,
where λ>0 is a regularization parameter. Since the last penalty term induces the sparsity of estimated values, the output model contains few variables and hence is tractable, which is critically important for interpretability. However, as we described in section 1, when variables are highly correlated, the Lasso can select correlated variables especially for a small λ, resulting in worse interpretability and poor estimation error.

2.2  Proposed Method

In this letter, we propose a new regularization formulation as follows:
minβ12ny-Xβ22+λβ1+α2|β|R|β|=:L(β),
(2.1)
where α>0 is a regularization parameter for the new regularization term, and RRp×p is a symmetric matrix whose component Rjk0 is a monotonically increasing function of the absolute correlation rjk=1n|XjXk| for jk. Some concrete definitions of R are described later. The last term, (λα/2)|β|R|β|=(λα/2)j=1pk=1pRjk|βj||βk|, is an additional term to the Lasso. Since Rjk represents the similarity between Xj and Xk, correlated variables are hard to be selected simultaneously in our formulation. In particular, when Xj and Xk are strongly correlated, the squared error does not change under the condition that βj+βk is constant, but the penalty Rjk|βj||βk| strongly induces either βj=0 or βk=0. On the contrary, if Xj and Xk are uncorrelated (i.e., Rjk is small), then the penalty of selecting both βj and βk is negligible, and it reduces to the ordinary Lasso formulation.

We can consider some definition variations of the similarity matrix R. One of the natural choices is Rjk=rjk2. R is positive semidefinite in this case because the Hadamard product of positive semidefinite matrices is also positive semidefinite. Hence, problem 2.1 turns out to be convex and easy to solve the global optimal solution. However, it may not reduce correlations enough. Another choice is Rjk=|rjk|, which reduces correlations more strongly. Yet another effective choice is Rjk=|rjk|/(1-|rjk|) for jk and Rjk=0 for j=k. In this case, if a correlation between two certain variables becomes higher (i.e., rjk1), then the penalty term diverges infinitely, and the IILasso cannot simultaneously select both of them. We use the last one in our numerical experiments, because it is favorable from theoretical studies, as described in sections 3 and 4.

Constraint regions corresponding to our regularization term indicate how our method provides an exclusive effect. Figure 1 illustrates the constraint regions of β1+|β|TR|β|/2 for the case p=2. As diagonal elements of R increase (from the top to the bottom panel), the contours become smooth at the axes of coordinates. Because of this, the solution tends to select both variables if two variables are strongly correlated. This is the grouping effect of the elastic net, as we will describe later. But as off-diagonal elements of R increase (from the left to the right panel), the contours become pointed at the axes of coordinates, and the solution tends to be sparser. This is the exclusive effect for correlated variables. The q(0<q<1) penalty, SCAD (Fan & Li, 2001), MCP (Zhang, 2010), and other methods also have similar contours pointed at the axes; however, the contours of the IILasso are adaptive for correlations among variables, so our penalty achieves both sparse and stable solutions. (See appendix A for further comparison with various regularization contours.)

Figure 1:

Contours of IILasso regularization terms.

Figure 1:

Contours of IILasso regularization terms.

The exclusive effect among correlated variables offers great advantages. First, it produces uncorrelated models, which are easy to interpret as we describe in section 2.3. Second, it is favorable for sign recovery and etimation error thanks to the assumption that the truth is well interpretable, as we describe in our theoretical and numerical analyses.

2.3  Interpretability

A linear model looks as if it could be perfectly interpreted, but that is not always the case. Here, we discuss difficulties of linear model interpretation and the advantages of our proposed method. We note that similar difficulties were referred to in Lipton (2018) and section 4.1 in Molnar (2018), but they did not offer any workaround.

One usually understands a linear model through its regression coefficients and effects. A single coefficient represents how strongly a unit change of a variable affects the prediction when all other active variables remain fixed. A single effect is calculated by a regression coefficient times a standard deviation of each variable, so that it represents the degree to which a variable affects to the prediction on average. Coefficients and effects coincide if variables are standardized in advance.

Correlated variables in an output model make interpreting coefficients and effects harder. When the model does not contain correlated variables, the note that “when all other active variables remain fixed” is a reasonable assumption. When the model contains correlated variables, we must consider the influence of all other correlated active variables simultaneously. Neglecting other active variables may lead to misinterpretation because increasing a variable by one standard deviation but not changing others can be counterfactual.

As an example, Figure 2 shows scatter plots of uncorrelated and correlated variables in output models. Two variables, X1 and X2, are mean 0 and standard deviation 1 in both cases, but correlations among them differ from each other. In the uncorrelated case, increasing X1 by 1 (standard deviation) but not changing X2 is reasonable, because fixing X2 does not affect the range of X1. In the correlated case, fixing X2 strongly affects the range of X1. Thus, increasing X1 by 1 but not changing X2 falls into an unrealistic (counterfactual) setting, that is, it gets out of the support of the distribution. For this reason, the effect no longer represents the average influence to the response fixing others in the correlated case. This example implies that correlations among active variables hinder interpretation of linear models and leads to misinterpretation. These difficulties are mitigated in our proposed method by inducing low correlations.

Figure 2:

Standard deviations for uncorrelated variables (left) and correlated variables (right).

Figure 2:

Standard deviations for uncorrelated variables (left) and correlated variables (right).

Moreover, even when we have a moderately interpretable correlated model, it might be possible to obtain an uncorrelated model with refined interpretability. Let us consider a simple example with p=3. Suppose X=[X1,X2,X3]Rn×3 is standardized, X1 and X2 are orthogonal, and X3=(X1+X2)/2. Consider two models: (A) y=2X1+X2 and (B) y=X1+2X3. Figure 3 illustrates Xj's and y. Both models output the same prediction. However, it seems that model A is more interpretable than model B. This is because in model A, active variables (X1 and X2) are uncorrelated; hence, we can decompose the model (2X1+X2) into each component (2X1 and X2) and interpret each coefficient as an independent variable contribution from each variable to the response. In model B, active variables (X1 and X3) are correlated; hence, each coefficient is no longer an independent variable contribution. For example, you can imagine an example of predicting traffic congestion using three variables: X1: Saturday, X2: Sunday, and X3: Weekend. We can interpret the model with “Saturday” and “Sunday” more intuitively than that of “Saturday” and “Weekend,” because “Weekend” includes “Saturday.” Of course, we can easily interpret the “Saturday” and “Weekend” model since their relation is clear. However, in general, it is not easy to unravel complicated interactions among correlated variables since they might imply confounders, mediators, or cause-and-effect relationships. The Lasso selects model B because the 1 norm of its coefficients is small, while IILasso tends to select model A because our regularization term excludes correlations. This perfectly collinear example is not merely an extreme case, because it was reported that there were many good solutions (solutions that have almost the same error) around the Lasso solution in real applications (Hara & Maehara, 2017), so that severe collinearlity often occurs in high-dimensional data.

Figure 3:

Uncorrelated model (A) versus correlated model (B).

Figure 3:

Uncorrelated model (A) versus correlated model (B).

2.4  Optimization

To solve problem 2.1, we introduce the coordinate descent algorithm (CDA), which was originally proposed for the Lasso (α=0 for the IILasso; Friedman, Hastie, Höfling, & Tibshirani, 2007; Friedman, Hastie, & Tibshirani, 2010). CDA is a simple and efficient algorithm, particularly for high-dimensional data. CDA basically follows simply: For each j{1,,p}, we optimize the objective function with respect to βj, with the remaining elements of β fixed at their most recently updated values. CDA is applicable even when the quadratic penalty is included.

To derive the update equation, when βj0, differentiating L(β) with respect to βj yields
βjL(β)=-1nXjy-X-jβ-j+1+λαRjjβj+λ1+αRj,-j|β-j|sgn(βj),
where β-j denotes β without the jth component, X-j denotes X without the jth column, and Rj,-j denotes the jth row vector without the jth column of R. Solving βjL(β)=0, we obtain the update rule as
βj11+λαRjjS1nXjy-X-jβ-j,λ1+αRj,-j|β-j|,
(2.2)
where S(z,γ) is a soft thresholding function:
S(z,γ):=sgn(z)(|z|-γ)+=z-γifz>0andγ<|z|,z+γifz<0andγ<|z|,0if|z|γ.

The algorithm for solving the IILasso is described in algorithm 1. We search several λ from λmax to λmin. β is initialized at each λ in some ways such as zeros for all elements, the solution of previous λ, or the solution of the ordinary Lasso.

formula
formula

In algorithm 1, the objective function monotonically decreases at each update, and the estimate converges a stationary point.

Proposition 1.

Let {βt}t=0,1, be a sequence of β in algorithm 1. Then every cluster point of {βt}t(p-1)modp is a stationary point.

Proof.

The proof is based on theorem 4.1 in Tseng (2001). First, we can see that the level set {β|L(β)L(β0)} is compact and L(β) is continuous. Moreover, L(β) has a unique minimum with equation 2.2 in terms of βj. Therefore, every cluster point of {βt}t(p-1)modp is a coordinatewise minimum point. In addition, since L(β) can be seen as a locally quadratic function in any direction, L(β) is regular at the cluster point. Hence, theorem 4.1c in Tseng (2001) concludes the assertion.

2.5  Related Work

Some work has taken correlations among variables into account. We can divide them into mainly two directions: grouping selection and exclusive selection. The former groups correlated variables and selects correlated variables together. The latter excludes correlated variables and selects uncorrelated variables.

The representative method of grouping selection is the elastic net (Zou & Hastie, 2005). The objective function is constructed by a squared 2 penalty in addition to a 1 penalty:
minβ||y-Xβ22+λ1β1+λ2β22.
Due to a squared 2 penalty, if the variables Xj and Xk are strongly correlated tending to be 1, then estimated values of βj and βk get closer. If Xj and Xk are equal, then βj and βk must be equal. This behavior is called the grouping effect. The pairwise elastic net (Lorbert et al., 2010) and trace Lasso (Grave et al., 2011) are in the same direction of research and improve prediction accuracy. These methods tend to include many correlated variables, and each coefficient no longer indicates an independent variable contribution to the response. As a result, it is hard to understand which variables are truly active and how variables affect the objective variable.
Another direction of research is exclusive selection. The uncorrelated Lasso (ULasso; Chen et al., 2013) aims to reduce correlations among active variables. It optimizes the following objective function,
minβy-Xβ22+λ1β1+λ2βRβ,
(2.3)
where RRp×p with each element Rjk=(1nXjXk)2. Although the ULasso resembles our formulation, there exists a critical difference that it uses β instead of |β| in the objective function, equation 2.3. We found that the ULasso does not necessarily select uncorrelated variables. For example, consider the case X=[X1,X2]. The last term of equation 2.3 is λ2(β12+β22+2R12β1β2). Minimizing R12β1β2 makes β1β2<0 and |β1β2| larger if R120. This implies that the ULasso tends to select correlated variables and set coefficients to the opposite sign. In particular, X1 and X2 are strongly correlated; then it reduces λ2(β1+β2)2, which induces β1=-β2. It is not a major problem when X1 and X2 are positively correlated, but it is a significant problem when X1 and X2 are negatively correlated. This problem is overcome in our method. Therefore, the difference between their ULasso and our IILasso is essential and crucial.
The exclusive group Lasso (EGLasso; Kong et al., 2014) is also in the same direction of exclusive selection. It optimizes the following objective function:
minβy-Xβ22+λ1β1+λ2k=1Kβ(k)12,
(2.4)
where β(k) consists of the variables of β within a group of predictors gk{1,,p} and K is the number of groups. The last 1/2 penalty term acts on exclusive variable selection. For example, when p=3,g1={1,2}, and g2={3}, then the last term becomes λ2((|β1|+|β2|)2+|β3|2). This enforces sparsity over each intragroup. They suggest that we put highly correlated variables into the same group in order to select uncorrelated variables. They use |rij|>θ with θ0.90 as a threshold. EGLasso can be seen as a special case of the IILasso. Let R be a group indicator matrix such that Rjk=1 if Xj and Xk belong to the same group and Rjk=0 otherwise. Then the IILasso is reduced to EGLasso. For the above example, if we define similarity matrix R=[1,1,0;1,1,0;0,0,1], then the last term of the IILasso objective function, 2.1, becomes λβ12+2|β1||β2|+β22+β32, which is the same as the last term of equation 2.4. As we see, EGLasso needs to determine the threshold θ and group variables beforehand, which causes a severely unstable estimation.

3  Theoretical Properties

In this section, we show the sign recovery condition and the estimation error bound of the IILasso for linear models. These results explain the effectiveness of the IILasso in terms of interpretability and estimation error. Moreover, we show the property of local minimum, which implies that every local optimal solution achieves the same statistical error rate as the global optimal solution. In this section, let β* denote the true parameter. Let S denote the true active sets: S=supp(β*)={j{1,,p}|βj*0} and s=|S|.

3.1  Sign Recovery

First, we give a necessary and sufficient condition of sign recovery. We define
U:=1nXSXS+λαDiag(sgn(βS*))RSSDiag(sgn(βS*)),w:=sgn(βS*)+αDiag(sgn(βS*))RSSDiag(sgn(βS*))βS*.
Theorem 1.
Assume U is invertible. Then there exists a critical point β^ of equation 2.1 with correct sign recovery sgn(β^)=sgn(β*) if and only if the following two conditions hold:
sgnβS*-U-1λw-1nXSɛ=sgn(βS*),
(3.1)
1nXScXSU-1λw-1nXSɛ+1nXScɛλ1+αRScSβS*-U-1λw-1nXSɛ,
(3.2)

where both of these vector inequalities are taken elementwise.

The proof is given in section B.1. The sign recovery condition is derived from the standard conditions for optimality. We note that α=0 reduces the condition to an ordinary Lasso condition in Wainwright (2009). The invertible assumption of U is not restrictive because it is true for almost all λ if XSXS is invertible, which is the same assumption as standard analysis of the Lasso.

According to theorem 2, the IILasso has an advantage in sign recovery for correlated design compared to the Lasso as long as the truth is “well interpretable,” that is, the true nonzero components are independent. This is because when RSS is small enough, equation 3.1 is the same as the Lasso and equation 3.2 is easier to be satisfied unless αRScS=0. In addition, the condition gets milder as RScS gets large.

Note that theorem 2 is not the condition of a global optimal solution. However, if global optimal solutions are finite, they must be a critical point. Hence, there exists a global optimal solution with correct sign recovery only if equations 3.1 and 3.2 hold.

Next, we give a sufficient condition for sign recovery. The following theorem clarifies the sign recovery conditions on the design matrix X and the true parameter β*, since it does not depend on the realization of the noise ɛ. We prepare some assumptions and definitions.

Assumption 1

(Subgaussian). The noise sequence {ɛi}i=1n is an i.i.d. subgaussian sequence with parameter σ>0, that is, E[exp(tɛi)]exp(σ2t2/2) for tR.

We give subgaussian properties in section B.2 for our theoretical analyses.

Assumption 2.
There exists a constant D>0, ϕ>0, and ψ such that 1+λnαψ>0,
RSSmaxD,1nXSXSϕI,1nXSXS-1/2Diag(sgn(βS*))RSSDiag(sgn(βS*))1nXSXS-1/2ψI.
(3.3)

We note that ψ is not necessarily positive, but it is larger than ψ>-1/λnα. Condition 3.3 is satisfied if RSSψRI, XSXS/nϕmaxI, ψR/ϕψ, and ψR/ϕmaxψ.

Definition 1
(Generalized Incoherence Condition). We say that the generalized incoherence condition holds if there exists some incoherence parameter κ(0,1] such that
1nXjXSU-111+αRSj1Δmin1+αDβS*1(1-κ),
(3.4)
for jSc, where βmin*:=minjS|βj*| and
Δmin:=βmin*-λn1+αDβS*1U-1+4σϕ(1+λnαψ)>0.
(3.5)

The generalized incoherence condition (see definition 5) is a generalized notion of the incoherence condition (Fuchs, 2005; Tropp, 2006; Wainwright, 2009). In fact, when α=0, the generalized incoherence condition reduces to the ordinary incoherence condition. The ordinary incoherence condition is quite restrictive and is much stronger than the restricted eigenvalue condition. For example, if XSXS/n=I, then the ordinary incoherence condition requires that maxjSckS|iXijXik/n|<1. This condition is hard to be satisfied if there are correlations between informative and uninformative variables.

On the other hand, the generalized incoherence condition offers a great advantage because the right-hand side of equation 3.4 is not upper bounded by 1 when α0. Specifically, consider a case where the true model is well interpretable: RSS=O. The generalized incoherence condition reduces to
1nXjXS1nXSXS-111+αRSj1Δmin(1-κ),
(3.6)
for jSc, where
Δmin:=βmin*-λn1nXSXS-1+4σϕ(1+λnαψ)>0.
This condition is milder than the ordinary incoherence condition since the right-hand side of equation 3.6 is larger than 1-κ.
Now, we fix any 0<δ<1 and define γn as
γn=γn(δ):=σ2log(2p/δ)n.
(3.7)
Then we obtain a sufficient condition for sign recovery.
Theorem 2.
Suppose that assumptions 1 and 2 and definition 5 (generalized incoherence condition) are satisfied. Suppose that the regularization parameter satisfies
λnmax14σ,2κ,λnαψ1+λnαψ22κγn,
(3.8)
with γn in equation 3.7. Then there exists a critical point β^ of equation 2.1 with correct sign recovery sgn(β^)=sgn(β*) with probability at least 1-2δ.

The proof is given in section B.3. We note that α=0 reduces to the result of an ordinary Lasso (Wainwright, 2009). When α0, our method has much advantage for sign recovery as long as the true model is well interpretable: RSS=O. This is because the generalized incoherence condition (definition 5) for IILasso is milder than that for the ordinary Lasso as described above, and assumption 2 and equation 3.8 reduce to the ordinary Lasso condition. In addition, Meinshausen and Bühlmann (2006) and Zhao and Yu (2006) gave various designs of X satisfying the incoherence condition. This implies that the generalized incoherence condition also holds in various situations as long as RSS is small enough.

3.2  Estimation Error

We give an estimation error bound of approximately global minimum solutions of our method.

Definition 2
(Generalized Restricted Eigenvalue Condition for Linear Models (GRE(S,C,C'))). Let a set of vectors B(S,C,C') be
B(S,C,C'):={vRp:vSc1+C'α2|vSc|RScSc|vSc|+C'α|vSc|RScS|vS+βS*|CvS1}.
(3.9)
We say that the generalized restricted eigenvalue condition holds if we have φGRE>0 where
φGRE=φGRE(S,C,C'):=infvB(S,C,C')v1nXXvv22.
(3.10)

The generalized restricted eigenvalue (GRE) condition in definition 7 is a generalized notion of the restricted eigenvalue (RE) condition (Bickel et al., 2009; Bühlmann & Van De Geer, 2011) tailored for our regularization. One can see that if α=0 or C'=0, then φGRE(S,C,C') is reduced to the ordinary restricted eigenvalue (Bickel et al., 2009; Raskutti, Wainwright, & Yu, 2010) for the analysis of the Lasso.

The GRE condition is not restrictive. Since there are additional terms related to |β|R|β|, the set B(S,C,C') is smaller than that for the ordinary restricted eigenvalue if the same C is used. In particular, the term |βSc|RScS|βS+βS*| strongly restricts the amplitude of coefficients for unimportant variables (especially the variables with large RScS). Hence, assumption GRE(S,C,C') is milder than assumption RE(S,C). Because the RE(S,C) condition is satisfied in the general class of gaussian design (Raskutti et al., 2010), the GRE condition also holds in the general class of gaussian design.

Then we obtain the convergence rate of an approximately global minimum solution of IILasso as follows:

Theorem 3.
Suppose that assumption 1 is satisfied. Suppose R satisfies
RSSmaxD,
(3.11)
for some positive constant D, and the estimator β^ of equation 2.1 is approximately minimizing the objective function so that
L(β^)L(β*).
(3.12)
Suppose that the regularization parameters satisfy
3γnλnandα14DβS*1,
with γn in equation 3.7. Suppose that assumption GRE(S,3,3/2) (see definition 7) is satisfied. Then it holds that
β^-β*2216sλn2φGRE2,
(3.13)
with probability at least 1-δ.
The proof is given in section B.4. The obtained convergence rate is roughly evaluated as
β^-β*22=Opslog(p)n
by taking λn=Op(logp/n), which is almost the minimax optimal rate (Raskutti, Wainwright, & Yu, 2011).
We compare the convergence rate of the Lasso and IILasso. For comparison, we have a somewhat bit stricter bound,
β^-β*2283+5αDβS*1+34(αDβS*1)22sλn2φGRE2,
under assumption GRE(S,C,3/2) (see definition 7), where C=2+15αDβS*1/4+9(αDβS*1)2/16, with high probability. (The proof is also given in section B.5.) We can easily see that when α=0, the convergence rate analysis is reduced to the standard one for the ordinary Lasso (Bickel et al., 2009; Bühlmann & Van De Geer, 2011). We note that our theorem includes the additional assumption, equation 3.11, compared to the ordinary Lasso theorem and relaxes the restricted eigenvalue condition. This indicates that our theorem holds under the milder condition in terms of correlations among all variables, in exchange for the additional condition of the true active variables. Under well-interpretable cases where the true nonzero components S are independent (i.e., RSS=O), the error bounds for the Lasso and IILasso are the same except for the term φGRE. Since RScS and RScSc shrink the set of vectors B(S,C,C') in definition 7, φGRE of the IILasso is larger than that of the Lasso. Therefore, our approximately global minimum solution has a better error bound than the ordinary 1 regularization in this situation. In addition, our method is more advantageous when the variables are correlated between informative and noninformative variables.

3.3  Local Optimality

The objective function of the IILasso is not necessarily convex in exchange for better statistical properties, as observed above. Our next theoretical interest is the local optimality of our optimization algorithm, algorithm 1. Since our optimization method is greedy, there is no confirmation that it achieves the global optimum. However, as we see in this section, the local solution achieves almost the same estimation error as the global optimum satisfying equation 3.12. For theoretical simplicity, we assume the following stronger condition.

Assumption 3.
There exist φ>0 and qnp such that for all V{1,,p} satisfying |V|qn and VS=, it holds that
1nXSVXSVφI.
Moreover, there exists D¯ such that the maximum absolute value of the eigenvalue of R is bounded as
supuRSVu(RSV,SV)uD¯u22.

Then we obtain the convergence rate of local minimum solutions of IILasso as follows:

Theorem 4.
Suppose assumptions 1 and 3 are satisfied. Suppose that β^ is a local optimal solution of equation 2.1 satisfying |supp(β^)||S|+qn. Let the regularization parameters satisfy
γn<λnandα<mins2D¯β*2,φ2D¯λn,
with γn in equation 3.7. Then β^ should satisfy
β^-β*2225sλn2φ2,
with probability at least 1-δ.

The proof is given in appendix B.3. Theorem 10 indicates that every local optimal solution achieves the same convergence rate with the ideal optimal solution. In other words, there is no local optimal solution with sparsity level |S|+qn far from the true vector β*.

4  Extensions to Generalized Linear Models

4.1  Method for GLMs

The idea of IILasso is not restricted to linear models with quadratic loss. In this section, we extend the method and theory to a more general setting, especially for generalized linear models (GLMs) with general loss without assuming the truth is included in GLMs.

Consider independent and identically distributed data {(Xi,Yi)}i=1n with (Xi,Yi)X×Y where Xi is a fixed covariable in some space X and Yi is a response variable in some space YR. Note that we suppose Xi's are deterministic variables. Let F be the space of whole measurable functions. Let ρf:X×YR be a general loss function for a model fF. We also denote ρf(x,y):=ρ(f(x),y). Typical examples of loss functions include quadratic loss ρf(x,y)=ρ(f(x),y)=(y-f(x))2 for yR and logistic loss ρf(x,y)=ρ(f(x),y)=-yf(x)+log(1+exp(f(x))) for y{0,1}. Consider a GLM subspace:
F:=fβ(x):=j=1pβjψj(x):βRp,xXF.
(4.1)
Here, the map x(ψ1(x),,ψp(x))Rp is an arbitrary measurable feature map from the space X to the space Rp. We define the population mean risk,
Pρf:=1ni=1nEYi|Xiρf(Xi,Yi)|Xi,
(4.2)
and the empirical average risk,
Pnρf:=1ni=1nρf(Xi,Yi).
(4.3)
We further define the target as the minimizer of the theoretical mean risk:
f0:=argminfFPρf.
(4.4)
Note that we allow model misspecification; in other words, it may hold that f0F. In this setting, our interest is to estimate GLMs close to the target, equation 4.4. IILasso for GLMs forms
β^=argminβPnρfβ+λβ1+α2|β|R|β|=:L(β),
(4.5)
where λ>0 and α>0 are regularization parameters and RRp×p is a symmmetric matrix whose component Rjk0 is typically a monotonically increasing function of the absolute correlation rjk=|iψj(Xi)ψk(Xi)|/n for jk and rjk=0 for j=k.

4.2  Estimation Error for GLMs

We introduce some additional notations. Let f be L-norm, that is, f=supz|f(z)|. Let f be L2(Qn)-norm with Qn the empirical measure of {zi}i=1n, that is, f:=fn:=i=1nf2(zi)/n. We denote S as an arbitrary variable index set and Sβ:=supp(β). Note that we do not use S as the true active variables in this section because it may hold f0F. We define the excess risk as
E(f):=P(ρf-ρf0).

Next we prepare some key assumptions and definitions.

Assumption 4.
We assume the loss function is a Lipschitz loss with L, that is,
ρ(a,y)-ρ(a',y)La-a',a,a',y.
Definition 3
(Quadratic Margin Condition). We say that the quadratic margin condition holds with κ for Fη:={fF:f-f0η} if we have
E(f)κf-f02,fFη.
Definition 4
(Generalized Restricted Eigenvalue Condition for GLMs (GRE (S,C,C'))). Let a set of vectors B(S,C,C') be
B(S,C,C'):={v:vSc1+C'α2|vSc|RScSc|vSc|+αC'|vSc|RScS|vS+βS*(S)|CvS1},
where β*(S):=argminβ:Sβ=SE(fβ). We say that the generalized restricted eigenvalue condition for GLMs holds with a set of vectors B(S,C,C') if we have φGRE>0 where
φGRE=φGRE(S,C,C'):=infvB(S,C,C')fv2v22.
Definition 5
(Oracle). Let S be a collection of variable index sets. We define the oracle β* as
β*=β*(S):=argminβ:SβS3E(fβ)+9λn2|Sβ|κφGRE*,
(4.6)
where φGRE*=φGRE(S):=infSSφGRE(S,C,C').

These assumption and definitions are basically based on Bühlmann and Van De Geer (2011) and tailored for our proposed method. The quadratic margin condition (see definition 12) means that the excess risk for every model f near the target f0 is bounded below by a quadratic function of f-f0. The GRE condition for GLMs (see definition 13) is a natural extension of the GRE condition for linear models. The oracle β* (see definition 14) represents an ideal solution because it minimizes the sum of the excess risk and the penalty in propotion to the number of selected variables.

Now we derive an estimation error bound for GLMs:

Theorem 5.
Let assumption 4 with a constant L be satisfied. Suppose that ψj1 for j. Let S be an arbitrary collection of variable index sets. For every variable index set SS, suppose that assumption GRE(S,3,1) (see definition 13) is satisfied. Define the oracle β* as equation 4.6 (see definition 14) for S and some constant κ>0. Fix any 0<δ<1. Define
γn=γn(δ):=2L42log2pn+2log(1/δ)n,M*:=1γn3E(fβ*)+9λn2|Sβ*|κφGRE*.
(4.7)
Suppose the quadratic margin condition (see definition 12) holds with κ for Fη. In addition, assume fβFη for all β-β*1M*, as well as fβ*Fη. Let β^ be an approximate minimizer of the IILasso such that L(β^)L(β) for all β-β*1M*. Let the regularization parameters satisfy
α112Dβ*1and20γn3λn,
where D:=RS*S*max and S*:=supp(β*). Then we have
β^-β*2222E(fβ*)κφGRE*+60λn2|S*|κ2φGRE*2,
with probability at least 1-δ.
The proof is given in section B.7. The obtained covergence rate in theorem 15 is roughly evaluated as
β^-β*22=OP|S*|log(p)n,
under E(fβ*)=op(1), φGRE*=Op(1), and λn=Op(logp/n), which is almost the minimax optimal rate (Raskutti et al., 2011).

If D=0, then φGRE* gets larger as RS*cSβ* and RS*cS*c become large. Hence, the IILasso for general loss with quadratic margin is preferable when true active variables are not correlated and others are strongly correlated.

The collection of variable index sets S is quite arbitrary. Taking S as specific definitions yields the following corollaries:

Corollary 1.
Define the best linear approximation,
βGLM0:=argminβPρfβ.
Take S={SGLM0}:={supp(βGLM,j0)}. Then, under the same assumptions as in theorem 15, we have β*=βGLM0 and
β^-βGLM02222E(fβGLM0)κφGRE(SGLM0,3,1)+60λn2|SGLM0|κ2φGRE(SGLM0,3,1)2.
Proof.
β*=argminβ:Sβ=SGLM03E(fβ)+9λn2|SGLM0|κφGRE*=βGLM0,
and φGRE*=φGRE(SGLM0,3,1) concludes the assertion.
Corollary 2.
Suppose the true model is linear:
f0=fβ0andE(fβ0)=0.
Take S={S0}:={supp(β0)}. Then, under the same assumptions as in theorem 15, we have β*=β0 and
β^-β02260λn2|S0|κ2φGRE(S0,3,1)2.
Proof.
β*=argminβ:Sβ=S03E(fβ)+9λn2|S0|κφGRE*=β0,
and φGRE*=φGRE(S0,3,1) concludes the assertion.

Corollary 16 shows that the error between the IILasso estimate and the best linear approximation is bounded by the excess risk of the best linear approximation, the number of active variables in the best linear approximation, and some constants depending on n and p. In addition, corollary 17 shows that the error between the IILasso estimate and the true linear model (if exists) is bounded by the number of true active variables and some constants depending on n and p.

4.3  Estimation Error for Logistic Models

We examine logistic regression models as an illustration. Consider data {(Xi,Yi)}i=1n with (Xi,Yi)Rp×{0,1}. Define π(x):=P(Y=1|X=x). The logistic model is
logπ^(x)1-π^(x)=f(x)=μ+j=1pβjxj,
and the logistic loss is
ρf(x,y)=ρ(f(x),y):=-yf(x)+log(1+exp(f(x))).
We take a target as
f0(x):=logπ(x)1-π(x)
because it minimizes EY|X[ρf(X,Y)|X=x]. Then we obtain the following estimation error bound:
Corollary 3.
Suppose that for some constant 0<ɛ0<1,
ɛ0<π(x)<1-ɛ0,
for xX. Suppose that Xj1 for j. Suppose that assumption GRE(S,3,1) (see definition 13) is satisfied with a GRE constant φGRE*. Define the oracle β* as equation 4.6 (see definition 14). Fix any 0<δ<1 and η>0. Then define
γn:=γn(δ):=242log2pn+2log(1/δ)n,M*:=1γn3E(fβ*)+9(exp(η)+ɛ0)2λn2|Sβ*|ɛ02φGRE*.
Let the regularization parameters satisfy
α112Dβ*1and20γn3λnTγn
for some constant T. In addition, suppose that
T2(exp(η)+ɛ0)2γn|Sβ*|ɛ02φGRE*η27,f*-f0η3,andE(f*)γnη9.
Let β^ be an approximate minimizer of the IILasso such that L(β^)L(β) for all β-β*1M*. Then we have
β^-β*2222(exp(η)+ɛ0)2E(fβ*)ɛ02φGRE*+60(exp(η)+ɛ0)4λn2|Sβ*|ɛ04φGRE*2,
with probability at least 1-δ.

The proof is given in section B.8.

5  Numerical Experiments

5.1  Synthetic Data Experiments for Linear Models

First, we validated the effectiveness of the IILasso for linear models. We considered the case in which the true active variables are uncorrelated and many inactive variables are strongly correlated with the active variable. If all active and inactive variables are uncorrelated, it is easy to estimate which is active or inactive. But if the inactive variables are strongly correlated with the active variables, it is hard to distinguish which one is active. We simulate such a situation.

We generated a design matrix XRn×p from the gaussian distribution of N(0,Σ) where Σ=Diag(Σ(1),,Σ(b)) was a block diagonal matrix whose element Σ(l)Rq×q was Σjk(l)=0.95 for jk and Σjk(l)=1 for j=k. We set n=50, p=100, b=10, and q=10. Thus, there were 10 groups containing 10 strongly correlated variables. Next, we generated a response y by the true active variables X1,X11,X21,,X91, such that y=10X1-9X11+8X21-7X31++2X81-X91+ɛ, with a standard gaussian noise ɛ. Each group included one active variable. We generated three data sets for training, validation, and test as above procedures independently.

Then we compared the performance of the Lasso, SCAD (Fan & Li, 2001), MCP (Zhang, 2010), EGLasso (Kong et al., 2014), and IILasso. Evaluation criteria are prediction error (mean squared error), estimation error (2 norm between the true and estimated coefficients), and model size (the number of nonzero coefficients). The SCAD and MCP are representative methods of folded concave penalty, so their objective functions are nonconvex, which are the same as our method. They have a tuning parameter γ; we set γ=2.5, 3.7, 10, 20, 100, 1000 for the SCAD and γ=1.5, 3, 10, 20, 100, 1000 for the MCP. The EGLasso has a parameter λ2; we set λ2/λ1=0.01, 0.1, 1, 10, 100, 1000. For the EGLasso, we used the true group information beforehand. We used R packages ncvreg (Breheny & Huang, 2011) for the Lasso, SCAD, MCP, and sparsenet (Mazumder, Friedman, & Hastie, 2011) for MCP. One can solve MCP using either ncvreg or sparsenet; they differ in their optimization algorithms and ways of initialization. For the IILasso, we defined Rjk=|rjk|/(1-|rjk|) for jk and Rjk=0 for j=k. Hence, RSS takes small values if active variables are independent, and RSSc and RScSc take large values if inactive variables are strongly correlated with other variables, which is favorable from the theoretical results obtained in section 3. We set α=0.01, 0.1, 1, 10, 100, 1000. We tuned the above parameters using validation data and calculated errors using test data. We iterated this procedure 500 times and evaluated the averages and standard errors.

Table 1 shows the performances with their standard error in parentheses. The IILasso achieved the best prediction and estimation among all of them. This was because our penalty term excluded the correlations and avoided overfitting. Moreover, the model size of the IILasso was much smaller than those of the Lasso and EGLasso and comparable to MCP. As a whole, the IILasso could estimate the most accurate model with a few variables.

Table 1:
Results of Synthetic Data for Linear Models.
Prediction ErrorEstimation ErrorModel Size
Lasso (ncvreg2.67 (0.05) 4.44 (0.06) 34.1 (0.46) 
SCAD (ncvreg1.52 (0.02) 1.79 (0.04) 14.6 (0.23) 
MCP (ncvreg1.53 (0.02) 1.79 (0.04) 14.6 (0.24) 
MCP (sparsenet2.41 (0.11) 3.15 (0.13) 13.4 (0.28) 
EGLasso 2.60 (0.04) 4.36 (0.05) 33.3 (0.32) 
IILasso (ours) 1.45 (0.02) 1.40 (0.04) 13.5 (0.23) 
Prediction ErrorEstimation ErrorModel Size
Lasso (ncvreg2.67 (0.05) 4.44 (0.06) 34.1 (0.46) 
SCAD (ncvreg1.52 (0.02) 1.79 (0.04) 14.6 (0.23) 
MCP (ncvreg1.53 (0.02) 1.79 (0.04) 14.6 (0.24) 
MCP (sparsenet2.41 (0.11) 3.15 (0.13) 13.4 (0.28) 
EGLasso 2.60 (0.04) 4.36 (0.05) 33.3 (0.32) 
IILasso (ours) 1.45 (0.02) 1.40 (0.04) 13.5 (0.23) 

Note: Bold characters indicate the best performer within one standard error.

5.2  Synthetic Data Experiments for Logistic Models

Next, we validated the performance of the IILasso for logistic regression models. We generated a response y by Bernoulli(π(x)) where logπ(x)/(1-π(x))=10X1-9X11+8X21-7X31++2X81-X91. For the rest setting, we used the same parameters and procedures as in section 5.1, except that n=100 instead of n=50. This is because it is difficult for all methods to estimate logistic regression models with small samples. Logistic regression of the Lasso, SCAD, and MCP is supported by ncvreg (sparsenet does not support logistic regression). The EGLasso and the IILasso can also be formulated as logistic regression. (For algorithm details, see appendix C.) We evaluated the negative log likelihood, misclassification error (the rate of misclassification), estimation error, and model size.

Table 2 shows the performances with their standard error in parentheses. The IILasso outperformed other methods significantly in terms of misclassification error and estimation error. In addition, the IILasso presented a much smaller model size than other methods. Therefore, the IILasso could efficiently estimate the accurate logistic models with a few variables.

Table 2:
Results of Synthetic Data for Logistic Models.
Negative Log LikelihoodMisclassification ErrorEstimation ErrorModel Size
Lasso (ncvreg0.484 (0.006) 0.107 (0.002) 16.8 (0.06) 22.9 (0.14) 
SCAD (ncvreg0.486 (0.007) 0.105 (0.002) 16.00 (0.10) 18.8 (0.25) 
MCP (ncvreg0.489 (0.007) 0.106 (0.002) 16.7 (0.10) 18.1 (0.26) 
EGLasso 0.477 (0.006) 0.104 (0.002) 16.7 (0.06) 25.7 (0.18) 
IILasso (ours) 0.471 (0.009) 0.096 (0.002) 14.4 (0.14) 12.3 (0.20) 
Negative Log LikelihoodMisclassification ErrorEstimation ErrorModel Size
Lasso (ncvreg0.484 (0.006) 0.107 (0.002) 16.8 (0.06) 22.9 (0.14) 
SCAD (ncvreg0.486 (0.007) 0.105 (0.002) 16.00 (0.10) 18.8 (0.25) 
MCP (ncvreg0.489 (0.007) 0.106 (0.002) 16.7 (0.10) 18.1 (0.26) 
EGLasso 0.477 (0.006) 0.104 (0.002) 16.7 (0.06) 25.7 (0.18) 
IILasso (ours) 0.471 (0.009) 0.096 (0.002) 14.4 (0.14) 12.3 (0.20) 

Note: Bold characters indicate the best performer within one standard error.

5.3  Real Data Experiments: Gene Expression Data

We applied our method to various gene expression data to validate its effectiveness for real applications. We used the following 10 data sets: alon (Alon et al., 1999; colon cancer), chiaretti (Chiaretti et al., 2004; leukemia), gordon (Gordon et al., 2002; lung cancer), gravier (Gravier et al., 2010; breast cancer), pomeroy (Pomeroy et al., 2002; central nervous system disorders), shipp (Shipp et al., 2002; lymphoma), singh (Singh et al., 2002; prostate cancer), subramanian (Subramanian et al., 2005; miscellaneous), tian (Tian et al., 2003; myeloma), and west (West et al., 2001; breast cancer). All of these data are provided by R package datamicroarray. The abstract of these data sets is described in Table 3. All data sets are small-sample, high-dimensional DNA microarray data. Since the response is binary, logistic regression was applied. We used the same settings on regularization parameters as described in section 5.1. We evaluated the negative log likelihood, misclassification error, model size, and max correlation among active variables, using twenty-fold cross validation.

Table 3:
Abstract of Microarray Data.
DataNumber of of SamplesNumber of Dimensions DimensionsNumber of Positive Labels
Alon 62 2000 22 
Chiaretti 111 12,625 10 
Gordon 181 12,533 150 
Gravier 168 2905 111 
Pomeroy 60 7128 21 
Shipp 77 7129 58 
Singh 102 12,600 50 
Subramanian 50 10,100 33 
Tian 173 12,625 137 
West 49 7129 25 
DataNumber of of SamplesNumber of Dimensions DimensionsNumber of Positive Labels
Alon 62 2000 22 
Chiaretti 111 12,625 10 
Gordon 181 12,533 150 
Gravier 168 2905 111 
Pomeroy 60 7128 21 
Shipp 77 7129 58 
Singh 102 12,600 50 
Subramanian 50 10,100 33 
Tian 173 12,625 137 
West 49 7129 25 

The results are given in Figure 4. In terms of negative log likelihood, 8 out of 10 data sets gave similar performances among all methods, and the other two data sets, subramanian and west, showed the smallest negative log-likelihood by the IILasso. The MCP was comparable to the IILasso but fell behind the IILasso. We can see similar tendencies of misclassification errors. The IILasso won in 8 out of 10 cases, including 5 ties. Although the numbers of selected variables of the IILasso were inferior to the MCP (the IILasso won in 5, including 3 ties; the MCP won in 6, including 2 ties), maximum correlations among active variables were superior to others (the IILasso won in 8, including 5 ties; the MCP won in 5, including 5 ties). As a whole, the IILasso could construct accurate models with small correlations.

Figure 4:

Results of the 10 microarray data sets.

Figure 4:

Results of the 10 microarray data sets.

We further investigated the influences of regularization parameters for the data sets subramanian and west because they showed clear difference among methods. Figure 5 shows the results of the Lasso, MCP, and IILasso. Compared with the Lasso, the IILasso mitigated overfitting even when λ was small, and hence achieved low negative log likelihood and misclassification error. Moreover, the IILasso suppressed its model size and correlations among active variables. The MCP rapidly decreased negative log likelihood as λ decreased, and overfitting occurred early. We note that the sequences of the MCP were broken because its algorithm iterations reached a maximum number defined in ncvreg.

Figure 5:

Results of subramanian and west along with the sequences of λ.

Figure 5:

Results of subramanian and west along with the sequences of λ.

6  Conclusion

We have proposed a new regularization method, IILasso. The IILasso reduces correlations among the active variables; hence, it is easy to decompose and interpret the model. We showed that the sign recovery condition of the IILasso is milder than that of the Lasso for correlated design as long as the true important variables are uncorrelated with each other. The convergence rate of the IILasso also has better performance compared to that of the Lasso. Moreover, we extend the IILasso to the GLMs, and its convergence rate is also analyzed. Finally, we verified the effectiveness of the IILasso by synthetic and real data analyses using 10 gene expression data, and we showed that the IILasso was superior in many cases on high-dimensional data.

Appendix A:  Shapes of Regularization Contours

We show countours of various regularization terms in Figure 6, including the SCAD (Fan & Li, 2001), MCP (Zhang, 2010), q-norm, log penalty (Candes, Wakin, & Boyd, 2008), and ordered weighted 1-norm (Bogdan, Van Den Berg, Sabatti, Su, & Candès, 2015; Figueiredo & Nowak, 2016; Zeng & Figueiredo, 2014). We compare them with the IILasso (see Figure 1) and summarize their properties as follows:

  • The shapes of the SCAD and MCP are similar to each other; they behave like the Lasso for large γ or small magnitude of β, while the contours are pointed at the axes and parallel to the axes for small γ and large magnitude of β. Their shapes are adaptive for the magnitude of β depending on their hyperparameter γ, but they are not adaptive for correlations among variables in contrast to the IILasso.

  • The shapes of q-norm for 0<q<1 are pointed at the axes, so that they resemble that of the IILasso for correlated variables.

  • The log penalty looks similar to q-norm for 0<q<1, as well as the IILasso for correlated variables.

  • The ordered weighted 1-norm is similar to the elastic net, but its contours are pointed at |β1|=|β2|, resulting in grouping effect.

Figure 6:

Contours of various regularization terms.

Figure 6:

Contours of various regularization terms.

Appendix B:  Proofs

B.1  Proof of Theorem 2

By standard conditions for optimality, β^ is a critical point if and only if there exists a subgradient z^β^1:={z^Rp|zj^=sgn(β^j)forβ^j0,|zj^|1otherwise} such that β^L(β)=0. Because β12|β|R|β|=Diag(R|β|)z, the condition β^L(β)=0 yields
-1nX(y-Xβ^)+λz^+λαDiagR|β^|z^=0.
(B.1)
Substituting y=Xβ*+ɛ in equation B.1, we have
-1nX(X(β*-β^)+ɛ)+λz^+λαDiagR|β^|z^=0.
(B.2)
Let the true active set S={1,,s} and inactive set Sc={s+1,,p} without loss of generality; then equation B.2 is turned into
1nXSXSβ^S-βS*+1nXSXScβ^Sc-1nXSɛ+λz^S+λαDiagRSS|β^S|z^S=0,
(B.3)
1nXScXSβ^S-βS*+1nXScXScβ^Sc-1nXScɛ+λz^Sc+λαDiagRScS|β^S|z^Sc=0.
(B.4)
Hence, there exists a critical point with correct sign recovery if and only if there exist β^ and z^ such that equations B.3 and B.4, z^β^1 and sgn(β^)=sgn(β*). The latter two conditions can be written as
z^S=sgn(βS*),
(B.5)
|z^Sc|1,
(B.6)
sgn(β^S)=sgn(βS*),
(B.7)
β^Sc=0.
(B.8)
Conditions B.5 and B.8 yield
1nXSXSβ^S-βS*-1nXSɛ+λsgn(βS*)+λαDiagRSS|β^S|sgn(βS*)=0,
(B.9)
1nXScXSβ^S-βS*-1nXScɛ+λz^Sc+λαDiagRScS|β^S|z^Sc=0.
(B.10)
Since
Diag(RSS|β^S|)sgn(βS*)=Diag(sgn(βS*))RSS|β^S|=Diag(sgn(βS*))RSSDiag(sgn(βS*))β^S,
equation B.9 can be rewritten as
U(β^S-βS*)+w-1nXSɛ=0,
where
U:=1nXSXS+λαDiag(sgn(βS*))RSSDiag(sgn(βS*)),w:=λsgn(βS*)+λαDiag(sgn(βS*))RSSDiag(sgn(βS*))βS*.
If we assume U is invertible, we obtain
β^S=βS*-U-1w-1nXSɛ.
(B.11)
Substituting this in equation B.10, we have
1nXScXS-U-1w-1nXSɛ-1nXScɛ+λz^Sc+λαDiagRScSβS*-U-1w-1nXSɛz^Sc=0,
that is,
1+αDiagRScSβS*-U-1w-1nXSɛλz^Sc
(B.12)
=1nXScXSU-1w-1nXSɛ+1nXScɛ.
(B.13)
Combining equations B.6, B.7, B.11, and B.13, we conclude the assertion.

B.2  Subgaussian Tail Bounds

We briefly summarize the definition and properties of subgaussian, since it plays a key role in our nonasymptotic analyses (see Wainwright, 2009, and Rigollet & Hütter, 2015, for details). Let ɛ be a zero-mean random variable. We say that ɛ is a subgaussian variable with parameter σ>0 if it holds for tR,
E[exp(tɛ)]expσ2t22.
(B.14)
By applying the Chernoff bound to equation B.14, we have a subgaussian tail bound for z>0:
P|ɛ|>z2exp-z22σ2.
(B.15)

We obtain the following lemma for a sequence of subgaussian variables. This is useful for our theoretical analyses of sign recovery.

Lemma 1.
Let {ɛi}i=1n be i.i.d. zero-mean subgaussian variables with a parameter σ. Then we have for aRn and z>0:
Pi=1naiɛi>z2exp-z22a22σ2.
Proof.
From the definition of subgaussian, we have
Eexpti=1naiɛi=i=1nEexptaiɛii=1nexpai2σ2t22=expa22σ2t22.
Therefore, the subgaussian tail bound, equation B.15, concludes the assertion.

In addition, we prepare the following collorary using lemma 19. This is useful for our theoretical analyses of estimation error and local optimality.

Corollary 4.
Suppose that assumption 1 and i=1nXij2/n1 for j=1,,p are satisfied. For δ>0, define γn:=γn(δ) as equation 3.7. Then we have
P1nXɛγnδ.
Proof.
Notice that
P1nXɛγn=Pmax1jp1ni=1nXijɛiγn=P1jp1ni=1nXijɛiγnj=1pP1ni=1nXijɛiγnpmax1jpP1ni=1nXijɛiγn2pmax1jpexp-n2γn22σ2Xj22exp-nγn22σ2+log(2p),
where we used lemma 19 in the fifth line. Since we set δ=exp(-nγn2/2σ2+log(2p)), we conclude the assertion.

B.3  Proof of Theorem 6

We derive sufficient conditions for equations 3.1 and 3.2 in theorem 2.

In terms of equation 3.1, it is sufficient if
βmin*>U-1λnw-1nXSɛ.
By the triangular inequality, we have
U-1λnw-1nXSɛλnU-1w+U-11nXSɛ.
(B.16)
The first term on the right-hand side of equation B.16 is bounded as
λnU-1wλn1+αDβS*1U-1.
Consider the jth element of the random variable of the second term on the right-hand side of equation B.16,
Tj:=ejU-11nXSɛ,
where ejRs represents a unit vector with 1 for the jth element and 0 for others. From lemma 19, we have for t>0,
P(|Tj|>t)2exp-t2n22σ2XSU-1ej22.
Using assumption 2, we have
1nXSXS-1/2U1nXSXS-1/2=I+λnα1nXSXS-1/2Diag(sgn(βS*))RSSDiag(sgn(βS*))1nXSXS-1/2(1+λnαψ)IU1nXSXS-1U=1nXSXS1/21nXSXS-1/2U1nXSXS-1/221nXSXS1/2ϕ(1+λnαψ)2IXSU-1ej22=ejU-1XSXSU-1ejn/ϕ(1+λnαψ)2.
Hence, we obtain
PmaxjS|Tj|>t2sexp-t2nϕ(1+λnαψ)22σ2.
Setting t=4λnσ/ϕ(1+λnαψ), we have
PmaxjS|Tj|>4λnσϕ(1+λnαψ)exp-8λn2n+log(2s).
Therefore, if it holds Δmin>0 where
Δmin:=βmin*-λn1+αDβS*1U-1+4σϕ(1+λnαψ),
then equation 3.1 is satisfied with probability at least 1-exp(-8λn2n+log(2s)).
In terms of equation 3.2, it is sufficient if for jSc,
λnnXjXSU-1w+1nXjI-1nXSU-1XSɛλn1+αRSj1Δmin,
(B.17)
since we now have
βS*-U-1λw-1nXSɛ>Δmin>0.
The first term on the right-hand side of equation B.17 is bounded as
λnnXjXSU-1wλnn1+αDβS*1XjXSU-11.
Consider the random variable,
Zj:=1nXjI-1nXSU-1XSɛ,
in the second term on the right-hand side of equation B.17. From lemma 19, we have for zj>0,
P(|Zj|>zj)2exp-n2zj22σ2XjI-1nXSU-1XS22.
Since U(1+λnαψ)IO, we have
I-1nXSU-1XSI.
(B.18)
In addition, we have
I-1nXSU-1XS=I-XSXSXS-1/21nXSXS-1/2U1nXSXS-1/2-1XSXS-1/2XSI-11+λnαψXS(XSXS)-1XS1-11+λnαψI,
(B.19)
where we used the fact that XS(XSXS)-1XS is the projection matrix to the image of XS in the last line. Equations B.18 and B.19 give
I-1nXSU-1XS22max1,λnαψ1+λnαψ2=:ν2.
(B.20)
Hence, we obtain
PjSc|Zj|>zjjSc2exp-nzj22σ2ν2.
Setting zj=λn(1+αRSj1Δmin)κ/2, we have
PjSc|Zj|>λn(1+αRSj1Δmin)κ/22jScexp-nλn2κ2(1+αRSj1Δmin)28σ2ν22(p-s)exp-nλn2κ28σ2ν2=exp-nλn2κ28σ2ν2+log(2(p-s)).
Therefore, the generalized incoherence condition (see definition 5) yields the condition 3.2 with probability at least 1-exp(-nλn2κ2/8σ2ν2+log(2(p-s))).

Overall, conditions 3.1 and 3.2 hold with probability at least 1-exp-8λn2n+log(2s)-exp(-nλn2κ2/8σ2ν2+log(2(p-s))). Since we set λn and δ as λnmax{1/4σ,2ν/κ}γn and δ=exp(-nγn2/2σ2+log(2p)), the probability is bounded by 1-2δ.

B.4  Proof of Theorem 8

By L(β^)L(β*) and y=Xβ*+ɛ, it holds that
12nX(β^-β*)22+λnβ^1+α2|β^|R|β^|1nɛX(β^-β*)+λnβ*1+α2|β*|R|β*|.
(B.21)
By corollary 20, it holds that
P1nXɛ>γnδ.
Hereafter, we assume that the event {1nXɛγn} is happening.
Then, if γnλn/3, by equation B.21,
12nX(β^-β*)22+λnβ^1+α2|β^|R|β^|1nɛXβ*-β^1+λnβ*1+α2|β*|R|β*|γnβ*-β^1+λnβ*1+α2|β*|R|β*|13λnβ*-β^1+λnβ*1+α2|β*|R|β*|.
(B.22)
Since
β^-β*1=β^S-βS*1+β^Sc-βSc*1=β^S-βS*1+β^Sc1
and
|βS*|RSS|βS*|-|β^S|RSS|β^S|(j,k)S×SRjk|βj*βk*-β^jβ^k|2(j,k)S×SRjk|βj*(βk*-β^k)|+(j,k)S×SRjk|(βj*-β^j)(βk*-β^k)|=2|βS*|RSS|βS*-β^S|+|βS*-β^S|RSS|βS*-β^S|2RSS|βS*|βS*-β^S1+DβS*-β^S12,
we obtain that
12nX(β^-β*)22+λn(β^S1+β^Sc1+α2|β^S|RSS|β^S|+α2(j,k)S×SRjk|β^jβ^k|)13λn(β^S-βS*1+β^Sc1)+λnβS*1+α2|βS*|RSS|βS*|12nX(β^-β*)22+λn23β^Sc1+α2(j,k)S×SRjk|β^jβ^k|13λnβ^S-βS*1+λn(βS*1-β^S1+αRSS|βS*|βS*-β^S1+αD2βS*-β^S12)12nX(β^-β*)22+λn23β^Sc1+α2(j,k)S×SRjk|β^jβ^k|λn43β^S-βS*1+αRSS|βS*|βS*-β^S1+αD2βS*-β^S12.
(B.23)
On the other hand, equation B.22 also gives
β^S1+β^Sc113(β^S-βS*1+β^Sc1)+βS*1+α2|βS*|RSS|βS*|23β^S-βS*1+23β^Sc12βS*1+α2|βS*|RSS|βS*|β^S-βS*13βS*1+34α|βS*|RSS|βS*|β^S-βS*13+34αRSS|βS*|βS*1.
Therefore, equation B.23 gives
23β^Sc1+α2(j,k)S×SRjk|β^jβ^k|43+αRSS|βS*|+32αDβS*11+α4RSS|βS*|β^S-βS*1.
(B.24)
The second term of the left side is evaluated as
(j,k)S×SRjk|β^jβ^k|=jSc,kScRjk|β^jβ^k|+2jS,kScRjk|(β^j-βS*+βS*)β^k|=|β^Sc|RScSc|β^Sc|+2|β^Sc|RScS|β^S-βS*+βS*|.
Hence, equation B.24 gives
23β^Sc1+α2|β^Sc|RScSc|β^Sc|+α|β^Sc|RScS|β^S-βS*+βS*|43+αRSS|βS*|+32αDβS*11+α4RSS|βS*|β^S-βS*1β^Sc1+34α|β^Sc|RScSc|β^Sc|+32α|β^Sc|RScS|β^S-βS*+βS*|2+154αDβS*1+916(αDβS*1)2β^S-βS*1.
(B.25)
If α14DβS*1, we have
β^Sc1+34α|β^Sc|RScSc|β^Sc|+32α|β^Sc|RScS|β^S-βS*+βS*|3β^S-βS*1.
Therefore, we can see that
v^B(S,C,C'),
where v^=β^-β*, C=3, and C'=32. By applying the definition of φGRE to equation B.23, it holds that
φGRE2β^-β*22λn43+52αDβS*1+38(αDβS*1)2β^S-βS*1
Because β^S-βS*12sβ^S-βS*22, we have
β^-β*283+5αDβS*1+34(αDβS*1)2sλnφGREβ^-β*2283+5αDβS*1+34(αDβS*1)22sλn2φGRE216sλn2φGRE2
(B.26)
This concludes the assertion.

B.5  Corollary of Theorem 8

For comparison with the IILasso and Lasso, we use the following estimation error bound, which is a little bit stric than theorem 8.

Corollary 5.
Suppose the same assumption of theorem 8 except for assumption GRE(S,3,32). Instead, suppose that assumption GRE(S,C,32) (see definition 7) where C=2+154αDβS*1+916(αDβS*1)2 is satisfied. Then it holds that
β^-β*2283+5αDβS*1+34(αDβS*1)22sλn2φGRE2,
with probability at least 1-δ.
Proof.
This is derived basically in the same way as theorem 8. From equation B.25, we can directly see that
v^B(S,C,C'),
where v^=β^-β*, C=2+154αDβS*1+916(αDβS*1)2, and C'=32. This and equation B.26 conclude the assertion.

From this corollary, we compare the IILasso with RSS=O and Lasso.

  1. If α=0, we have
    β^-β*2264sλn29φGRE2,
    with B(S,C,C') where C=2 and C'=0. This is a standard Lasso result.
  2. If D=0, we have
    β^-β*2264sλn29φGRE2,
    with B(S,C,C') where C=2 and C'=32. Since φGRE is the minimum eigenvalue restricted by B(S,C,C'), φGRE of the IILasso is larger than that of the Lasso.

B.6  Proof of Theorem 10

Let
βˇ:=argminβRp:βSc=0y-Xβ22.
That is, βˇ is the least squares estimator with the true nonzero coefficients. Let β˜ be a local optimal solution. For 0<h<1, let β(h):=β˜+h(βˇ-β˜); then it holds that
Lλn(β(h))-Lλn(β˜)=h2-2h2nX(β˜-βˇ)22-hn(Xβˇ-y)X(β˜-βˇ)+λn(β(h)1-β˜1)+λnα2(|β(h)|R|β(h)|-|β˜|R|β˜|).
(B.27)

First, we evaluate the term 1n(Xβˇ-y)X(β˜-βˇ)=1n(Xβˇ-y)XS(β˜S-βˇS)+1n(Xβˇ-y)XSc(β˜Sc-βˇSc) as follows:

1. Since βˇ is the least squares estimator and 1nXSXS is invertible by the assumption, we have
βˇS=(XSXS)-1XSy,βˇSc=0.
Therefore,
1nXS(Xβˇ-y)=1nXS(XS(XSXS)-1XS-I)y.
Here, I-XS(XSXS)-1XS is the projection matrix to the orthogonal complement of the image of XS. Hence, 1n(Xβˇ-y)XS(β˜S-βˇS)=0.
2. Notice that
1nXSc(Xβˇ-y)=-1nXSc(I-XS(XSXS)-1XS)y=-1nXSc(I-XS(XSXS)-1XS)(XSβS*+ɛ)=-1nXSc(I-XS(XSXS)-1XS)ɛ,
where we used (I-XS(XSXS)-1XS)XSc=0 in the last line. Because (I-XS(XSXS)XS) is a projection matrix, we have (I-XS(XSXS)-1XS)Xj22Xj22. This and corollary 20 give
1nXSc(Xβˇ-y)γn,
with probability 1-δ. Hence, let V:=supp(β˜)S. Then we have
1n(β˜Sc-βˇSc)XSc(Xβˇ-y)γnβ˜Sc-βˇSc1=γnβ˜V1,
where we used the assumption VSc and βˇV=0.
Combining these inequalities and the assumption λnγn, we have that
1n(Xβˇ-y)X(β˜-βˇ)λnβ˜V1.
(B.28)
As for the regularization term, we evaluate each term of λn(β(h)1-β˜1)+λn2(|β(h)|R|β(h)|-|β˜|R|β˜|) in the following.
i. Evaluation of β(h)1-β˜1. Because of the definition of β(h), it holds that
β(h)1-β˜1=β˜+h(βˇ-β˜)1-β˜1=β˜S+h(βˇS-β˜S)1-β˜S1+β˜V+h(βˇV-β˜V)1-β˜V1=β˜S+h(βˇS-β˜S)1-β˜S1+(1-h)β˜V1-β˜V1hβˇS-β˜S1-hβ˜V1.
(B.29)
ii. Evaluation of |β(h)|R|β(h)|-|β˜|R|β˜|. Note that
|β(h)j|Rjk|β(h)k|-|β˜j|Rjk|β˜k|=|(1-h)β˜j+hβˇj|Rjk|(1-h)β˜k+hβˇk|-|β˜j|Rjk|β˜k|(1-h)2|β˜j|Rjk|β˜k|+h(1-h)(|βˇj|Rjk|β˜k|+|β˜j|Rjk|βˇk|)+h2|βˇj|Rjk|βˇk|-|β˜j|Rjk|β˜k|=-2h|β˜j|Rjk|β˜k|+h(|βˇj|Rjk|β˜k|+|β˜j|Rjk|βˇk|)+O(h2)=h[(|βˇj|-|β˜j|)Rjk|β˜k|+|β˜j|Rjk(|βˇk|-|β˜k|)]+O(h2).
(B.30)
If j,kS, then the right hand side of equation B.30 is bounded by
h(|βˇj-β˜j|Rjk|βˇk-β˜k|+|βˇj-β˜j|Rjk|βˇk-β˜k|)+h(|βˇj-β˜j|Rjk|βˇk|+|βˇj|Rjk|βˇk-β˜k|)+O(h2).
If jV and kS, then the right hand side of equation B.30 is bounded by
h|β˜j|Rjk(|βˇk|-|β˜k|)+O(h2)h|β˜j|Rjk|βˇk-β˜k|+O(h2).
If jV and kV, then the right-hand side of equation B.30 is bounded by
0+O(h2)=O(h2).
Based on these evaluations, we have
|β(h)|R|β(h)|-|β˜|R|β˜|2h|βˇS-β˜S|RSS|βˇS-β˜S|+|βˇS-β˜S|RSS|βˇS|+|β˜V|RVS|βˇS-β˜S|+O(h2)2h|βˇ-β˜|R|βˇ-β˜|+|βˇS-β˜S|RSS|βˇS|+O(h2)2hD¯(βˇ-β˜22+βˇ2βˇS-β˜S2)+O(h2).
We will show in equation B.32 that βˇ-β*2sλn/φ, and thus it follows that
βˇ2β*2+sλn/φ.
Therefore, we obtain that
|β(h)|R|β(h)|-|β˜|R|β˜|2hD¯βˇ-β˜22+(β*2+sλn/φ)βˇS-β˜S2+O(h2).
(B.31)
Applying the inequalities B.28, B.29, and B.31 to B.27 yields
Lλn(β(h))-Lλn(β˜)h{-1nX(βˇ-β˜)22+λnβ˜S-βˇS1-(λn-γn)β˜V1+λnαD¯[βˇ-β˜22+(β*2+sλn/φ)βˇS-β˜S2]}+O(h2)h{-φβˇ-β˜22+λnβ˜S-βˇS1+λnαD¯[βˇ-β˜22+(β*2+sλn/φ)βˇS-β˜S2]}+O(h2)h{-φ+λnαD¯βˇ-β˜22+λnβ˜S-βˇS1+αD¯(β*2+sλn/φ)βˇS-β˜S2}+O(h2),
where we used the assumption λn>γn in the second inequality.
Since we have assumed α<mins2D¯β*2,φ2D¯λn, the right-hand side is further bounded by
h-φ2βˇ-β˜22+2λnsβˇS-β˜S2+O(h2).
Because of this, if βˇ-β˜2>4sλnφ, then the first term becomes negative, and we conclude that for sufficiently small η>0, it holds that
Lλn(β(h))<Lλn(β˜),
for all 0<h<η. In other words, β˜ is not a local optimal solution. Therefore, we must have
βˇ-β˜24sλnφ.
Finally, notice that β˜-β*22(β˜-βˇ2+β*-βˇ2)2 and
βˇ-β*22=(XSXS)-1XSy-βS*22=(XSXS)-1XS(XSβS*+ɛ)-βS*22=(XSXS)-1XSɛ22φ-21nXSɛ22φ-2sγn2φ-2sλn2,
(B.32)
which concludes the assertion.

B.7  Proof of Theorem 15

In the proof, we use the shorthand notation S*:=Sβ*. Let c1,c2, be some constants, which are defined concretely at the end of the proof. Define the empirical process as
vn(β):=(Pn-P)ρfβ,
and let
ZM:=supβ-β*1M|vn(β)-vn(β*)|.

First, we prepare the following lemma.

Lemma 2.

Let assumption 4 hold. Suppose ψj1 for j. Then we have P({ZMγnM})1-δ where γn is defined as equation 4.7.

Proof.

See example 14.2 in section 14.8 of Bühlmann and Van De Geer (2011) in detail.

Following the lemma, we have ZM*γnM* with high probability, where
M*:=1γnc1E(fβ*)+c2λn2|S*|κφGRE*,
with some positive constants c1 and c2. Hereafter, we assume this holds. If β^-β*1M*, which we will show later, then we have
E(fβ^)+λnβ^1+α2|β^|R|β^|-vn(β^)-vn(β*)+E(fβ*)+λnβ*1+α2|β*|R|β*|ZM*+E(fβ*)+λnβ*1+α2|β*|R|β*|γnM*+E(fβ*)+λnβ*1+α2|β*|R|β*|.
Substituting β=βS*+βS*c, we have
E(fβ^)+λn(β^S*1+β^S*c1+α2|β^S*|RS*S*|β^S*|+α|β^S*c|RS*cS*|β^S*|+α2|β^S*c|RS*cS*c|β^S*c|)γnM*+E(fβ*)+λnβS**1+α2|βS**|RS*S*|βS**|.
(B.33)
Using this inequality, we can evaluate E(fβ^)+λnβ^-β*1 as
E(fβ^)+λnβ^-β*1=E(fβ^)+λnβ^S*-βS**1+λnβ^S*c1γnM*+E(fβ*)+λn2βS**1+α2|βS**|RS*S*|βS**|.
(B.34)
To obtain a tighter bound, we need to use quadratic terms of β^. Hereafter, we reparameterize v^:=β^-β*. Then, we have, from equation B.33,
E(fβ^)+λnv^S*c1+α|v^S*c|RS*cS*|v^S*+βS**|+α2|v^S*c|RS*cS*c|v^S*c|γnM*+E(fβ*)+λnv^S*1+α2|βS**|RS*S*|βS**|-|β^S*|RS*S*|β^S*|.
We can evaluate the last two terms as
|βS**|RS*S*|βS**|-|β^S*|RS*S*|β^S*|=|βS**|RS*S*|βS**|-|βS**|RS*S*|β^S*|+|βS**|RS*S*|β^S*|-|β^S*|RS*S*|β^S*||βS**|RS*S*|βS**-β^S*|+|β^S*-βS**|RS*S*|β^S*||βS**|RS*S*|βS**-β^S*|+|β^S*-βS**|RS*S*(|βS**|+|βS**-β^S*|)(2|βS**|+|βS**-β^S*|)RS*S*|βS**-β^S*|2RS*S*|βS**|v^S*1+Dv^S*12,
where D:=RS*S*max. Hence, we obtain
E(fβ^)+λnv^S*c1+α|v^S*c|RS*cS*|v^S*+βS**|+α2|v^S*c|RS*cS*c|v^S*c|γnM*+E(fβ*)+λnv^S*1+αRS*S*|βS**|v^S*1+αD2v^S*12.
(B.35)
We further characterize this bound by dividing it into two cases:
  • Case 1. If λnv^S*1+αRS*S*|βS**|v^S*1+αD2v^S*12γnM*, then the inequality, equation B.35, reduces to
    E(fβ^)+λnv^S*c1+α|v^S*c|RS*cS*|v^S*+βS**|+α2|v^S*c|RS*cS*c|v^S*c|2γnM*+E(fβ*).
    Hence, we obtain
    E(fβ^)+λnv^1=E(fβ^)+λnv^S*c1+λnv^S*13γnM*+E(fβ*)=(3c1+1)E(fβ*)+3c2λn2|S*|κφGRE*,
    where we use λnv^S*1γnM* in the second line.
  • Case 2. If λnv^S*1+αRS*S*|βS**|v^S*1+αD2v^S*12γnM*, we need the compatibility condition and the margin condition. In this case, the inequality, equation B.35, reduces to
    E(fβ^)+λnv^S*c1+α|v^S*c|RS*cS*|v^S*+βS**|+α2|v^S*c|RS*cS*c|v^S*c|γnM*+E(fβ*)+λnv^S*1+αRS*S*|βS**|v^S*1+αD2v^S*121+1c1γnM*+λnv^S*1+αRS*S*|βS**|v^S*1+αD2v^S*122+1c1λnv^S*1+αRS*S*|βS**|v^S*1+αD2v^S*122+1c1λn1+αRS*S*|βS**|+αD2v^S*1v^S*1.
    (B.36)
    Hence, we obtain
    E(fβ^)+λnv^1=E(fβ^)+λnv^S*c1+λnv^S*1λn3+1c1+2+1c1αRS*S*|βS**|+2+1c1αD2v^S*1v^S*1.
    (B.37)
    To characterize v^S*1 in parentheses, we can see from equation B.34,
    λn(v^S*1+v^S*c1)γnM*+E(fβ*)+λn2βS**1+α2|βS**|RS*S*|βS**|1+1c1γnM*+λn2βS**1+α2|βS**|RS*S*|βS**|.
    (B.38)

We further divide case 2 into two cases:

Case 2a. If λn2βS**1+α2|βS**|RS*S*|βS**|2γnM*, then equation B.34 reduces to
E(fβ^)+λnv^13γnM*+E(fβ*)3c1+1E(fβ*)+3c2λn2|S*|κφGRE*.
Case 2b. If λn2βS**1+α2|βS**|RS*S*|βS**|2γnM*, then the inequality, equation B.38, reduces to
λn(v^S*1+v^S*c1)123+1c1λn2βS**1+α2|βS**|RS*S*|βS**|123+1c1λn2+α2RS*S*|βS**|βS**1,
which indicates
v^S*1123+1c12+α2RS*S*|βS**|βS**1.
(B.39)
Hence, incorporating equations B.37 and B.39 yields,
E(fβ^)+λnv^1λn3+1c1+2+1c1αRS*S*|βS**|+122+1c13+1c1αD22+α2RS*S*|βS**|βS**1v^S*1λn3+1c1+122+1c15+1c1αDβ*1+182+1c13+1c1(αDβ*1)2v^S*1.
If we take α' such that αα'Dβ*1 and define
c3:=3+1c1+122+1c15+1c1α'+182+1c13+1c1α'2,
we have
E(fβ^)+λnv^1c3λnv^S*1.
(B.40)
We can restrict the feasible region for v by equations B.36 and B.39 as
v^S*c1+α|v^S*c|RS*cS*|v^S*+βS**|+α2|v^S*c|RS*cS*c|v^S*c|2+1c1(1+αRS*S*|βS**|+αD2123+1c12+α2RS*S*|βS**|βS**1v^S*12+1c11+125+1c1αDβS**1+183+1c1αDβS**12v^S*12+1c11+125+1c1α'+183+1c1α'2v^S*1.
If we take c and α' satisfying
2+1c11+125+1c1α'+183+1c1α'23,
then we have v^B(S*,3,1) where
B(S*,3,1)=v:vS*c1+α|vS*c|RS*cS*|vS*+βS**|+α2|vS*c|RS*cS*c|vS*c|3vS*1.
Hence, we have only to impose the restricted eigenvalue condition for the set B(S*,3,1).
According to the restricted eigenvalue condition, we have
v22fv2φGRE*,vB(S*,3,1).
Incorporating v^S*1|S*|v^S*2|S*|v^2, we have
v^S*1|S*|fv^φGRE*.
Hence, from equation B.40, we have
E(fβ^)+λnv^1c3λn|S*|fβ^-fβ*φGRE*4c3λn2|S*|κφGRE*+c3κ16fβ^-fβ*24c3λn2|S*|κφGRE*+c3κ16fβ^-f0+fβ*-f024c3λn2|S*|κφGRE*+c3κ8fβ^-f02+c3κ8fβ*-f024c3λn2|S*|κφGRE*+c38E(fβ^)+c38E(fβ*),
where we use the restricted eigenvalue condition in the first line, uv4u2+v2/16 in the second line, the triangular inequality in the third line, (u+v)22(u2+v2) in the fourth line, and the margin condition with fβ^Fη and fβ*Fη in the last line. This implies
1-c38E(fβ^)+λnv^1c38E(fβ*)+4c3λn2|S*|κφGRE*;
hence, we obtain
E(fβ^)+λnv^1c38-c3E(fβ*)+32c3λn2|S*|(8-c3)κφGRE*.
where we assume c3<8, which is satisfied by taking appropreate c1 and α'.
Incorporating cases 1, 2a, and 2b, we have shown that
E(fβ^)+λnv^1c4γnM*,
(B.41)
where
c4:=max3+1c1,c3c1(8-c3),32c3c2(8-c3).
On the other hand, we have
E(fβ^)κfβ^-f0212κfβ^-fβ*2-κfβ*-f0212κφGRE*β^-β*22-E(fβ*),
where we used (u+v)22(u2+v2) in the second line and the generalized restricted and margin conditions in the last line. Therefore, we have
12κφGRE*β^-β*22(c1c4+1)E(fβ*)+c2c4λn2|S*|κφGRE*,
that is,
β^-β*222(c1c4+1)κφGRE*E(fβ*)+2c2c4λn2|S*|κ2φGRE*2.
This equations concludes the assertion of theorem 15.
Now we show β^-β*1M*. Define
β˜:=β*+t(β^-β*),t:=M*M*+β^-β*1.
Then β˜-β*1M* and hence all the above inequalities hold substututing β^ into β˜. In particular, equation B.41 implies
β˜-β*1c4γnλnM*M*2,
for which we assume c4γnλn12. Because β˜-β*1=tβ^-β*1, we have
M*M*+β^-β*1β^-β*1M*2.
Hence, we obtain β^-β*1M*.

Finally, we take c1=3,c2=9,andα'=1/12, which satisfy all the aforementioned assumptions.

B.8  Proof of Corollary 18

We can see that the logistic loss is Lipschitz with a Lipschitz constant 1 because it holds that
fl(f,y)=-y+exp(f)1+exp(f)1.
Next, we show that the logistic regression satisfies the quadratic margin condition. For fFη,
2f2l(f,·)=exp(f)(1+exp(f))2exp(|f0|+η)(1+exp(|f0|+η))21(1+exp(|f0|+η))2=1(1+exp(η)maxπ1-π,1-ππ)21(1+exp(η)1-ɛ0ɛ0)21(exp(η)/ɛ0+1)2.
Hence, the quadratic margin condition holds with κ:=(exp(η)/ɛ0+1)-2.
On the other hand, we have
fβ-f0fβ-fβ*+fβ*-f0.
For β-β*1M*, it holds that
fβ-fβ*=(β-β*)Xβ-β*1XmaxM*=1γn3E(fβ*)+9(exp(η)/ɛ0+1)2λn2|Sβ*|φGRE*.
Since we assume
fβ*-f0η3,E(fβ*)/γnη9,andT2(exp(η)/ɛ0+1)2γn|Sβ*|φGRE*η27,
with λnTγn, we obtain
fβ-f0η.

From the above, all of the assumptions in theorem 15 have been verified. ffffffffffffffffff

Appendix C:  Optimization for Logistic Regression

We derive the coordinate descent algorithm of the IILasso for the binary objective variable. The objective function is
L(β)=-1niyiXiβ-log(1+exp(Xiβ))+λβ1+α2|β|R|β|,
where Xi is the ith row of X=[1,X1,,Xp] and β=[β0,β1,,βp]. Forming a quadratic approximation with the current estimate β¯, we have
L¯(β)=-12ni=1nwi(zi-Xiβ)2+C(β¯)+λβ1+α2|β|R|β|,
where
zi=Xiβ¯+yi-p¯(Xi)p¯(Xi)(1-p¯(Xi)),wi=p¯(Xi)(1-p¯(Xi)),p¯(Xi)=11+exp(-Xiβ¯).
To derive the update equation, when βj0, differentiating the quadratic objective function with respect to βj yields
βjL¯(β)=-1ni=1nwi(zi-Xiβ)Xij+λsgn(βj)+αRj|β|sgn(βj)=-1ni=1nwizi-Xi,-jβ-jXij+1ni=1nwiXij2+λRjjβj+λ1+αRj,-j|β-j|sgn(βj).
This yields
βj11ni=1nwiXij2+λαRjj×S1ni=1nwizi-Xi,-jβ-jXij,λ1+αRj,-j|β-j|.
These procedures amount to a sequence of nested loops. The whole algorithm is described in algorithm 2.

Notes

1

This is an extended work of published in Takada, Suzuki, and Fujisawa (2018). We extend our method from linear models to GLMs in this letter.

Acknowledgments

T.S. was partially supported by MEXT kakenhi (25730013, 25120012, 26280009, 15H01678, 15H05707), JST-PRESTO, and JST-CREST. H.F. was partially supported by MEXT kakenhi 17K00065.

References

Alon
,
U.
,
Barkai
,
N.
,
Notterman
,
D. A.
,
Gish
,
K.
,
Ybarra
,
S.
,
Mack
,
D.
, &
Levine
,
A. J.
(
1999
).
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays
.
Proceedings of the National Academy of Sciences
,
96
(
12
),
6745
6750
.
Bickel
,
P. J.
,
Ritov
,
Y.
, &
Tsybakov
,
A. B.
(
2009
).
Simultaneous analysis of Lasso and Dantzig selector
.
Annals of Statistics
,
37
(
4
),
1705
1732
.
Bogdan
,
M.
,
Van Den Berg
,
E.
,
Sabatti
,
C.
,
Su
,
W.
, &
Candès
,
E. J.
(
2015
).
Slope: Adaptive variable selection via convex optimization
.
Annals of Applied Statistics
,
9
(
3
),
1103
.
Breheny
,
P.
, &
Huang
,
J.
(
2011
).
Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection
.
Annals of Applied Statistics
,
5
(
1
),
232
.
Bühlmann
,
P.
, &
Van De Geer
,
S.
(
2011
).
Statistics for high-dimensional data: Methods, theory and applications
.
New York
:
Springer Science & Business Media
.
Candes
,
E. J.
,
Wakin
,
M. B.
, &
Boyd
,
S. P.
(
2008
).
Enhancing sparsity by reweighted 1 minimization
.
Journal of Fourier Analysis and Applications
,
14
(
5–6
),
877
905
.
Chen
,
S.-B.
,
Ding
,
C.
,
Luo
,
B.
, &
Xie
,
Y.
(
2013
).
Uncorrelated Lasso
. In
Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence
(pp.
166
172
).
Palo Alto, CA
:
AAAI
.
Chiaretti
,
S.
,
Li
,
X.
,
Gentleman
,
R.
,
Vitale
,
A.
,
Vignetti
,
M.
,
Mandelli
,
F.
,
Ritz
,
J.
, &
Foa
,
R.
(
2004
).
Gene expression profile of adult t-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival
.
Blood
,
103
(
7
),
2771
2778
.
Fan
,
J.
, &
Li
,
R.
(
2001
).
Variable selection via nonconcave penalized likelihood and its oracle properties
.
Journal of the American Statistical Association
,
96
(
456
),
1348
1360
.
Figueiredo
,
M.
, &
Nowak
,
R.
(
2016
).
Ordered weighted L1 regularized regression with strongly correlated covariates: Theoretical aspects
. In
Proceedings of the 19th International Conference on Artificial Intelligence and Statistics
(pp.
930
938
).
Friedman
,
J.
,
Hastie
,
T.
,
Höfling
,
H.
, &
Tibshirani
,
R.
(
2007
).
Pathwise coordinate optimization
.
Annals of Applied Statistics
,
1
(
2
),
302
332
.
Friedman
,
J.
,
Hastie
,
T.
, &
Tibshirani
,
R.
(
2010
).
Regularization paths for generalized linear models via coordinate descent
.
Journal of Statistical Software
,
33
(
1
),
1
.
Fuchs
,
J.-J.
(
2005
).
Recovery of exact sparse representations in the presence of bounded noise
.
IEEE Transactions on Information Theory
,
51
(
10
),
3601
3608
.
Gordon
,
G. J.
,
Jensen
,
R. V.
,
Hsiao
,
L.-L.
,
Gullans
,
S. R.
,
Blumenstock
,
J. E.
,
Ramaswamy
,
S.
, …
Bueno
,
R.
(
2002
).
Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma
.
Cancer Research
,
62
(
17
),
4963
4967
.
Grave
,
E.
,
Obozinski
,
G. R.
, &
Bach
,
F. R.
(
2011
). Trace Lasso: A trace norm regularization for correlated designs. In
J.
Shawe-Taylor
,
R. S.
Zemel
,
P. L.
Bartlett
,
F.
Pereira
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
24
(pp.
2187
2195
).
Red Hook, NY
:
Curran
.
Gravier
,
E.
,
Pierron
,
G.
,
Vincent-Salomon
,
A.
,
Gruel
,
N.
,
Raynal
,
V.
,
Savignoni
,
A.
, …
Fourquet
,
A.
(
2010
).
A prognostic DNA signature for T1T2 node-negative breast cancer patients
.
Genes, Chromosomes and Cancer
,
49
(
12
),
1125
1134
.
Hara
,
S.
, &
Maehara
,
T.
(
2017
).
Enumerate Lasso solutions for feature selection
. In
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence
.
Kong
,
D.
,
Fujimaki
,
R.
,
Liu
,
J.
,
Nie
,
F.
, &
Ding
,
C.
(
2014
). Exclusive feature learning on arbitrary structures via 1,2-norm. In
Z.
Ghahramani
,
M.
Welling
,
C.
Cortes
,
N. D.
Lawrence
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
(pp.
1655
1663
).
Red Hook, NY
:
Curran
.
Lipton
,
Z. C.
(
2018
).
The mythos of model interpretability
.
Queue
,
16
(
3
),
31
57
.
Lorbert
,
A.
,
Eis
,
D.
,
Kostina
,
V.
,
Blei
,
D.
, &
Ramadge
,
P.
(
2010
).
Exploiting covariate similarity in sparse regression via the pairwise elastic net
. In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
(pp.
477
484
).
Mazumder
,
R.
,
Friedman
,
J. H.
, &
Hastie
,
T.
(
2011
).
Sparsenet: Coordinate descent with nonconvex penalties
.
Journal of the American Statistical Association
,
106
(
495
),
1125
1138
.
Meinshausen
,
N.
, &
Bühlmann
,
P.
(
2006
).
High-dimensional graphs and variable selection with the Lasso
.
Annals of Statistics
,
34
(
3
),
1436
1462
.
Molnar
,
C.
(
2018
).
Interpretable machine learning: A guide for making black box models explainable.
Leanpub
.
Pomeroy
,
S. L.
,
Tamayo
,
P.
,
Gaasenbeek
,
M.
,
Sturla
,
L. M.
,
Angelo
,
M.
,
McLaughlin
,
M. E.
, …
Allen
,
J. C.
(
2002
).
Prediction of central nervous system embryonal tumour outcome based on gene expression
.
Nature
,
415
(
6870
),
436
442
.
Raskutti
,
G.
,
Wainwright
,
M. J.
, &
Yu
,
B.
(
2010
).
Restricted eigenvalue properties for correlated gaussian designs
.
Journal of Machine Learning Research
,
11
,
2241
2259
.
Raskutti
,
G.
,
Wainwright
,
M. J.
, &
Yu
,
B.
(
2011
).
Minimax rates of estimation for high-dimensional linear regression over q-balls
.
IEEE Transactions on Information Theory
,
57
(
10
),
6976
6994
.
Rigollet
,
P.
, &
Hütter
,
J.-C.
(
2015
).
High dimensional statistics
.
Lecture notes for course 18S997, MIT
.
Shipp
,
M. A.
,
Ross
,
K. N.
,
Tamayo
,
P.
,
Weng
,
A. P.
,
Kutok
,
J. L.
,
Aguiar
,
R. C.
, …
Ray
,
T. S.
(
2002
).
Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning
.
Nature Medicine
,
8
(
1
),
68
74
.
Singh
,
D.
,
Febbo
,
P. G.
,
Ross
,
K.
,
Jackson
,
D. G.
,
Manola
,
J.
,
Ladd
,
C.
, …
Sellers
,
R.
(
2002
).
Gene expression correlates of clinical prostate cancer behavior
.
Cancer Cell
,
1
(
2
),
203
209
.
Subramanian
,
A.
,
Tamayo
,
P.
,
Mootha
,
V. K.
,
Mukherjee
,
S.
,
Ebert
,
B. L.
,
Gillette
,
M. A.
, …
Mesirov
,
J. P.
(
2005
).
Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles
.
Proceedings of the National Academy of Sciences
,
102
(
43
),
15545
15550
.
Takada
,
M.
,
Suzuki
,
T.
, &
Fujisawa
,
H.
(
2018
).
Independently interpretable lasso: A new regularizer for sparse regression with uncorrelated variables
. In
Proceedings of the International Conference on Artificial Intelligence and Statistics
(pp.
454
463
).
Tian
,
E.
,
Zhan
,
F.
,
Walker
,
R.
,
Rasmussen
,
E.
,
Ma
,
Y.
,
Barlogie
,
B.
, &
Shaughnessy Jr
,
J. D.
(
2003
).
The role of the WNT-signaling antagonist DKK1 in the development of osteolytic lesions in multiple myeloma
.
New England Journal of Medicine
,
349
(
26
),
2483
2494
.
Tibshirani
,
R.
(
1996
).
Regression shrinkage and selection via the Lasso
.
Journal of the Royal Statistical Society: Series B (Methodological)
,
58
(
1
),
267
288
.
Tropp
,
J. A.
(
2006
).
Just relax: Convex programming methods for identifying sparse signals in noise
.
IEEE Transactions on Information Theory
,
52
(
3
),
1030
1051
.
Tseng
,
P.
(
2001
).
Convergence of a block coordinate descent method for nondifferentiable minimization
.
Journal of Optimization Theory and Applications
,
109
(
3
),
475
494
.
Wainwright
,
M. J.
(
2009
).
Sharp thresholds for high-dimensional and noisy sparsity recovery using 1-constrained quadratic programming (lasso)
.
IEEE Transactions on Information Theory
,
55
(
5
),
2183
2202
.
West
,
M.
,
Blanchette
,
C.
,
Dressman
,
H.
,
Huang
,
E.
,
Ishida
,
S.
,
Spang
,
R.
, …
Nevins
,
J. R.
(
2001
).
Predicting the clinical status of human breast cancer by using gene expression profiles
.
Proceedings of the National Academy of Sciences
,
98
(
20
),
11462
11467
.
Zeng
,
X.
, &
Figueiredo
,
M. A.
(
2014
).
The ordered weighted 1 norm: Atomic formulation, projections, and algorithms
.
arXiv:1409.4271
.
Zhang
,
C.-H.
(
2010
).
Nearly unbiased variable selection under minimax concave penalty
.
Annals of Statistics
,
38
(
2
),
894
942
.
Zhao
,
P.
, &
Yu
,
B.
(
2006
).
On model selection consistency of lasso
.
Journal of Machine Learning Research
,
7
,
2541
2563
.
Zou
,
H.
, &
Hastie
,
T.
(
2005
).
Regularization and variable selection via the elastic net
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
67
(
2
),
301
320
.