## Abstract

Sparse regularization such as $ℓ1$ regularization is a quite powerful and widely used strategy for high-dimensional learning problems. The effectiveness of sparse regularization has been supported practically and theoretically by several studies. However, one of the biggest issues in sparse regularization is that its performance is quite sensitive to correlations between features. Ordinary $ℓ1$ regularization selects variables correlated with each other under weak regularizations, which results in deterioration of not only its estimation error but also interpretability. In this letter, we propose a new regularization method, independently interpretable lasso (IILasso), for generalized linear models. Our proposed regularizer suppresses selecting correlated variables, so that each active variable affects the response independently in the model. Hence, we can interpret regression coefficients intuitively, and the performance is also improved by avoiding overfitting. We analyze the theoretical property of the IILasso and show that the proposed method is advantageous for its sign recovery and achieves almost minimax optimal convergence rate. Synthetic and real data analyses also indicate the effectiveness of the IILasso.

## 1  Introduction

High-dimensional data appear in many fields such as biology, economy, and industry. A common approach for high-dimensional regression is sparse regularization strategy such as the Lasso (least absolute shrinkage and selection operator; Tibshirani, 1996). Since the sparse regularization performs both parameter estimation and variable selection simultaneously, it offers interpretable results by identifying informative variables, and it can effectively avoid overfitting by discarding redundant variables. Because of these properties, sparse regularization has had huge success in a wide range of data analysis in science and engineering. In addition, several theoretical studies have been developed to support the effectiveness of sparse regularization, and efficient optimization methods also have been proposed so that sparse learning is easily executed.

However, one of the biggest issues is that the performance of sparse regularization is guaranteed only under “small correlation” assumptions, that is, the variables are not much correlated with each other. Actually, typical theoretical supports are based on a kind of small correlation assumptions such as restricted eigenvalue condition (Bickel, Ritov, & Tsybakov, 2009; Bühlmann & Van De Geer, 2011). In the situation where the variables are highly correlated, redundant variables tend to be selected, and the estimation errors are worse because of overfitting.

In addition, correlated variables hinder interpretation of models. A standard interpretation of a (generalized) linear model comes from regression coefficients and effects (products of regression coefficients and standard deviations). A single coefficient represents the degree to which an increase of its variable by one unit increases the prediction “when all other active variables remain fixed.” When an output model contains correlated variables, we must consider the influence of all other correlated active variables simultaneously, which can be intractable. A single effect represents the degree to which its variable varies the prediction on average. However, the value of the effect may misleading in the presence of strong correlations because it is unlikely to increase a variable by one standard deviation but not changing other correlated variables. Similar properties were described as disadvantages of linear models in section 4.1 in Molnar (2018). This kind of interpretability was also referred as “decomposability” in Lipton (2018), which means whether we can decompose a model into some parts and interpret each component independently. In this sense, correlated variables degrade model decomposability. Therefore, it is beneficial to construct a model with uncorrelated variables for both estimation error and interpretability.

One line of research that takes correlations among variables into account is based on a strategy in which correlated variables are either all selected or not selected at all. Examples of this line are the elastic net (Zou & Hastie, 2005), pairwise elastic net (Lorbert, Eis, Kostina, Blei, & Ramadge, 2010), and trace Lasso (Grave, Obozinski, & Bach, 2011). These methods select not only important variables, but also unimportant variables that are strongly correlated with important variables. Although these methods often give stable generalization error, this strategy makes it hard to interpret the model, because many correlated variables are incorporated into the output model, and we no longer distinguish which variables truly affect the response.

To facilitate interpretability of the model, another line of research, including ours, is based on the strategy in which uncorrelated variables are selected. The uncorrelated Lasso (Chen, Ding, Luo, & Xie, 2013) intends to construct a model with uncorrelated variables. However, it still tends to select negatively correlated variables; hence, the correlation problem is not resolved. The exclusive group Lasso (Kong, Fujimaki, Liu, Nie, & Ding, 2014) is also in this line, but it requires grouping correlated variables beforehand. The authors suggest that we group variables whose correlations are over a certain threshold. However, determination of the threshold is not a trivial problem, and it causes unstable results.

In this letter, we propose a new regularization method, independently interpretable Lasso (IILasso) for linear models with quadratic loss and its extension to generalized linear models (GLMs) with general loss.1 Our method offers efficient variable selection, does not select negatively correlated variables, and is free from specific preprocessing such as grouping. Our proposed regularization formulation more aggressively induces the sparsity of the active variables and reduces the correlations among them. Hence, we can independently interpret the effects of active variables in the output model, and the estimation error is also much improved. To support the usefulness of our proposal, we make the following contributions:

• We show the necessary and sufficient condition for the sign consistency of variables selection. We show that our method achieves the sign consistency under a milder condition than the Lasso for correlated design as long as true important variables are uncorrelated.

• The convergence rate of the estimation error is analyzed. We show that our estimation error achieves almost the minimax optimal rate and has an advantage over the Lasso under certain conditions.

• We propose a coordinate descent algorithm to find a local optimum of the objective function, which is guaranteed to converge to a stationary point. In addition, we show that every local optimal solution achieves the same statistical error rate as the global optimal solution and thus is almost minimax optimal.

• We extend linear models to GLMs, and derive its convergence rate.

The rest of this paper is as follows: In section 2, we propose a new regularization formulation for linear models and introduce its optimization method. We also state the relationship to existing work. In section 3, we show theoretical results on sign recovery and estimation error. In section 4, we extend the IILasso to GLMs and show its estimation error. In section 5, we illustrate both synthetic and real-world data experiments, including 10 microarray data sets. In section 6, we summarize the properties of IILasso.

### 1.1  Notations

Let $v∈Rp$. $Diag(v)∈Rp×p$ is the diagonal matrix whose $j$th diagonal element is $vj$. $|v|$ is the element-wise absolute vector whose $j$th element is $|vj|$. $sgn(v)$ is the sign vector whose elements are $1forvj>0$, $-1forvj<0$, $and0forvj=0$. $supp(v)$ is the support set of $v$${j∈{1,…,p}|vj≠0}$. $∥v∥q$ is the $ℓq$-norm—$∥v∥q=(∑j=1p|vj|q)1/q$. Let $M∈Rn×p$. We use subscripts for the columns of $M$, for example, $Mj$ denotes the $j$th column. $∥M∥q$ is the operator norm (induced norm)—$∥M∥q=supv∈Rp∥Mv∥q/∥v∥q$. Specifically, $∥M∥2=supv∈Rp∥Mv∥2/∥v∥2$ is the spectral norm (the largest singular value of $M$), and $∥M∥∞=maxi∑j|Mij|$ is the maximum absolute column sum norm. $∥M∥max$ is the max norm—$∥M∥max=maxij|Mij|$. Let $M∈Rp×p$. $M⪰O$ and $M≻O$ denote positive semi-definite matrix and positive-definite matrix—$vMv≥0$ for $∀v∈Rp$ and $vMv>0$ for $∀v∈Rp$, respectively. Let $S$ be a subset of ${1,…,p}$. $|S|$ is the number of the elements in $S$. $Sc$ is the complement subset of $S$$Sc={1,…,p}∖S$. $vS$ is the vector $v$ restricted to the index set $S$. $MS1S2$ is the matrix whose row indexes are restricted to $S1$ and column indexes are restricted to $S2$.

## 2  Method

### 2.1  Problem Setting

Consider a problem of predicting a response $y∈Rn$, given a design matrix $X∈Rn×p$, assuming a linear model
$y=Xβ+ɛ,$
where $β∈Rp$ is a regression coefficient and $ɛ∈Rn$ is a noise. We assume that the variables are standardized such that $Σi=1nXij=0,Σi=1nXij2/n=1$, and $Σi=1nyi=0$. The Lasso solves the following optimization problem,
$minβ12n∥y-Xβ∥22+λ∥β∥1,$
where $λ>0$ is a regularization parameter. Since the last penalty term induces the sparsity of estimated values, the output model contains few variables and hence is tractable, which is critically important for interpretability. However, as we described in section 1, when variables are highly correlated, the Lasso can select correlated variables especially for a small $λ$, resulting in worse interpretability and poor estimation error.

### 2.2  Proposed Method

In this letter, we propose a new regularization formulation as follows:
$minβ12n∥y-Xβ∥22+λ∥β∥1+α2|β|⊤R|β|=:L(β),$
(2.1)
where $α>0$ is a regularization parameter for the new regularization term, and $R∈Rp×p$ is a symmetric matrix whose component $Rjk≥0$ is a monotonically increasing function of the absolute correlation $rjk=1n|Xj⊤Xk|$ for $j≠k$. Some concrete definitions of $R$ are described later. The last term, $(λα/2)|β|⊤R|β|=(λα/2)∑j=1p∑k=1pRjk|βj||βk|$, is an additional term to the Lasso. Since $Rjk$ represents the similarity between $Xj$ and $Xk$, correlated variables are hard to be selected simultaneously in our formulation. In particular, when $Xj$ and $Xk$ are strongly correlated, the squared error does not change under the condition that $βj+βk$ is constant, but the penalty $Rjk|βj||βk|$ strongly induces either $βj=0$ or $βk=0$. On the contrary, if $Xj$ and $Xk$ are uncorrelated (i.e., $Rjk$ is small), then the penalty of selecting both $βj$ and $βk$ is negligible, and it reduces to the ordinary Lasso formulation.

We can consider some definition variations of the similarity matrix $R$. One of the natural choices is $Rjk=rjk2$. $R$ is positive semidefinite in this case because the Hadamard product of positive semidefinite matrices is also positive semidefinite. Hence, problem 2.1 turns out to be convex and easy to solve the global optimal solution. However, it may not reduce correlations enough. Another choice is $Rjk=|rjk|$, which reduces correlations more strongly. Yet another effective choice is $Rjk=|rjk|/(1-|rjk|)$ for $j≠k$ and $Rjk=0$ for $j=k$. In this case, if a correlation between two certain variables becomes higher (i.e., $rjk→1$), then the penalty term diverges infinitely, and the IILasso cannot simultaneously select both of them. We use the last one in our numerical experiments, because it is favorable from theoretical studies, as described in sections 3 and 4.

Constraint regions corresponding to our regularization term indicate how our method provides an exclusive effect. Figure 1 illustrates the constraint regions of $∥β∥1+|β|TR|β|/2$ for the case $p=2$. As diagonal elements of $R$ increase (from the top to the bottom panel), the contours become smooth at the axes of coordinates. Because of this, the solution tends to select both variables if two variables are strongly correlated. This is the grouping effect of the elastic net, as we will describe later. But as off-diagonal elements of $R$ increase (from the left to the right panel), the contours become pointed at the axes of coordinates, and the solution tends to be sparser. This is the exclusive effect for correlated variables. The $ℓq(0 penalty, SCAD (Fan & Li, 2001), MCP (Zhang, 2010), and other methods also have similar contours pointed at the axes; however, the contours of the IILasso are adaptive for correlations among variables, so our penalty achieves both sparse and stable solutions. (See appendix A for further comparison with various regularization contours.)

Figure 1:

Contours of IILasso regularization terms.

Figure 1:

Contours of IILasso regularization terms.

The exclusive effect among correlated variables offers great advantages. First, it produces uncorrelated models, which are easy to interpret as we describe in section 2.3. Second, it is favorable for sign recovery and etimation error thanks to the assumption that the truth is well interpretable, as we describe in our theoretical and numerical analyses.

### 2.3  Interpretability

A linear model looks as if it could be perfectly interpreted, but that is not always the case. Here, we discuss difficulties of linear model interpretation and the advantages of our proposed method. We note that similar difficulties were referred to in Lipton (2018) and section 4.1 in Molnar (2018), but they did not offer any workaround.

One usually understands a linear model through its regression coefficients and effects. A single coefficient represents how strongly a unit change of a variable affects the prediction when all other active variables remain fixed. A single effect is calculated by a regression coefficient times a standard deviation of each variable, so that it represents the degree to which a variable affects to the prediction on average. Coefficients and effects coincide if variables are standardized in advance.

Correlated variables in an output model make interpreting coefficients and effects harder. When the model does not contain correlated variables, the note that “when all other active variables remain fixed” is a reasonable assumption. When the model contains correlated variables, we must consider the influence of all other correlated active variables simultaneously. Neglecting other active variables may lead to misinterpretation because increasing a variable by one standard deviation but not changing others can be counterfactual.

As an example, Figure 2 shows scatter plots of uncorrelated and correlated variables in output models. Two variables, $X1$ and $X2$, are mean 0 and standard deviation 1 in both cases, but correlations among them differ from each other. In the uncorrelated case, increasing $X1$ by 1 (standard deviation) but not changing $X2$ is reasonable, because fixing $X2$ does not affect the range of $X1$. In the correlated case, fixing $X2$ strongly affects the range of $X1$. Thus, increasing $X1$ by 1 but not changing $X2$ falls into an unrealistic (counterfactual) setting, that is, it gets out of the support of the distribution. For this reason, the effect no longer represents the average influence to the response fixing others in the correlated case. This example implies that correlations among active variables hinder interpretation of linear models and leads to misinterpretation. These difficulties are mitigated in our proposed method by inducing low correlations.

Figure 2:

Standard deviations for uncorrelated variables (left) and correlated variables (right).

Figure 2:

Standard deviations for uncorrelated variables (left) and correlated variables (right).

Moreover, even when we have a moderately interpretable correlated model, it might be possible to obtain an uncorrelated model with refined interpretability. Let us consider a simple example with $p=3$. Suppose $X=[X1,X2,X3]∈Rn×3$ is standardized, $X1$ and $X2$ are orthogonal, and $X3=(X1+X2)/2$. Consider two models: (A) $y=2X1+X2$ and (B) $y=X1+2X3$. Figure 3 illustrates $Xj$'s and $y$. Both models output the same prediction. However, it seems that model A is more interpretable than model B. This is because in model A, active variables ($X1$ and $X2$) are uncorrelated; hence, we can decompose the model ($2X1+X2$) into each component ($2X1$ and $X2$) and interpret each coefficient as an independent variable contribution from each variable to the response. In model B, active variables ($X1$ and $X3$) are correlated; hence, each coefficient is no longer an independent variable contribution. For example, you can imagine an example of predicting traffic congestion using three variables: $X1$: Saturday, $X2$: Sunday, and $X3$: Weekend. We can interpret the model with “Saturday” and “Sunday” more intuitively than that of “Saturday” and “Weekend,” because “Weekend” includes “Saturday.” Of course, we can easily interpret the “Saturday” and “Weekend” model since their relation is clear. However, in general, it is not easy to unravel complicated interactions among correlated variables since they might imply confounders, mediators, or cause-and-effect relationships. The Lasso selects model B because the $ℓ1$ norm of its coefficients is small, while IILasso tends to select model A because our regularization term excludes correlations. This perfectly collinear example is not merely an extreme case, because it was reported that there were many good solutions (solutions that have almost the same error) around the Lasso solution in real applications (Hara & Maehara, 2017), so that severe collinearlity often occurs in high-dimensional data.

Figure 3:

Uncorrelated model (A) versus correlated model (B).

Figure 3:

Uncorrelated model (A) versus correlated model (B).

### 2.4  Optimization

To solve problem 2.1, we introduce the coordinate descent algorithm (CDA), which was originally proposed for the Lasso ($α=0$ for the IILasso; Friedman, Hastie, Höfling, & Tibshirani, 2007; Friedman, Hastie, & Tibshirani, 2010). CDA is a simple and efficient algorithm, particularly for high-dimensional data. CDA basically follows simply: For each $j∈{1,…,p}$, we optimize the objective function with respect to $βj$, with the remaining elements of $β$ fixed at their most recently updated values. CDA is applicable even when the quadratic penalty is included.

To derive the update equation, when $βj≠0$, differentiating $L(β)$ with respect to $βj$ yields
$∂βjL(β)=-1nXj⊤y-X-jβ-j+1+λαRjjβj+λ1+αRj,-j|β-j|sgn(βj),$
where $β-j$ denotes $β$ without the $j$th component, $X-j$ denotes $X$ without the $j$th column, and $Rj,-j$ denotes the $j$th row vector without the $j$th column of $R$. Solving $∂βjL(β)=0$, we obtain the update rule as
$βj←11+λαRjjS1nXj⊤y-X-jβ-j,λ1+αRj,-j|β-j|,$
(2.2)
where $S(z,γ)$ is a soft thresholding function:
$S(z,γ):=sgn(z)(|z|-γ)+=z-γifz>0andγ<|z|,z+γifz<0andγ<|z|,0if|z|≤γ.$

The algorithm for solving the IILasso is described in algorithm 1. We search several $λ$ from $λmax$ to $λmin$. $β$ is initialized at each $λ$ in some ways such as zeros for all elements, the solution of previous $λ$, or the solution of the ordinary Lasso.

In algorithm 1, the objective function monotonically decreases at each update, and the estimate converges a stationary point.

Proposition 1.

Let ${βt}t=0,1,⋯$ be a sequence of $β$ in algorithm 1. Then every cluster point of ${βt}t≡(p-1)modp$ is a stationary point.

Proof.

The proof is based on theorem 4.1 in Tseng (2001). First, we can see that the level set ${β|L(β)≤L(β0)}$ is compact and $L(β)$ is continuous. Moreover, $L(β)$ has a unique minimum with equation 2.2 in terms of $βj$. Therefore, every cluster point of ${βt}t≡(p-1)modp$ is a coordinatewise minimum point. In addition, since $L(β)$ can be seen as a locally quadratic function in any direction, $L(β)$ is regular at the cluster point. Hence, theorem 4.1c in Tseng (2001) concludes the assertion.

### 2.5  Related Work

Some work has taken correlations among variables into account. We can divide them into mainly two directions: grouping selection and exclusive selection. The former groups correlated variables and selects correlated variables together. The latter excludes correlated variables and selects uncorrelated variables.

The representative method of grouping selection is the elastic net (Zou & Hastie, 2005). The objective function is constructed by a squared $ℓ2$ penalty in addition to a $ℓ1$ penalty:
$minβ||y-Xβ∥22+λ1∥β∥1+λ2∥β∥22.$
Due to a squared $ℓ2$ penalty, if the variables $Xj$ and $Xk$ are strongly correlated tending to be 1, then estimated values of $βj$ and $βk$ get closer. If $Xj$ and $Xk$ are equal, then $βj$ and $βk$ must be equal. This behavior is called the grouping effect. The pairwise elastic net (Lorbert et al., 2010) and trace Lasso (Grave et al., 2011) are in the same direction of research and improve prediction accuracy. These methods tend to include many correlated variables, and each coefficient no longer indicates an independent variable contribution to the response. As a result, it is hard to understand which variables are truly active and how variables affect the objective variable.
Another direction of research is exclusive selection. The uncorrelated Lasso (ULasso; Chen et al., 2013) aims to reduce correlations among active variables. It optimizes the following objective function,
$minβ∥y-Xβ∥22+λ1∥β∥1+λ2β⊤Rβ,$
(2.3)
where $R∈Rp×p$ with each element $Rjk=(1nXj⊤Xk)2$. Although the ULasso resembles our formulation, there exists a critical difference that it uses $β$ instead of $|β|$ in the objective function, equation 2.3. We found that the ULasso does not necessarily select uncorrelated variables. For example, consider the case $X=[X1,X2]$. The last term of equation 2.3 is $λ2(β12+β22+2R12β1β2)$. Minimizing $R12β1β2$ makes $β1β2<0$ and $|β1β2|$ larger if $R12≠0$. This implies that the ULasso tends to select correlated variables and set coefficients to the opposite sign. In particular, $X1$ and $X2$ are strongly correlated; then it reduces $λ2(β1+β2)2$, which induces $β1=-β2$. It is not a major problem when $X1$ and $X2$ are positively correlated, but it is a significant problem when $X1$ and $X2$ are negatively correlated. This problem is overcome in our method. Therefore, the difference between their ULasso and our IILasso is essential and crucial.
The exclusive group Lasso (EGLasso; Kong et al., 2014) is also in the same direction of exclusive selection. It optimizes the following objective function:
$minβ∥y-Xβ∥22+λ1∥β∥1+λ2∑k=1K∥β(k)∥12,$
(2.4)
where $β(k)$ consists of the variables of $β$ within a group of predictors $gk⊂{1,…,p}$ and $K$ is the number of groups. The last $ℓ1/ℓ2$ penalty term acts on exclusive variable selection. For example, when $p=3,g1={1,2}$, and $g2={3}$, then the last term becomes $λ2((|β1|+|β2|)2+|β3|2)$. This enforces sparsity over each intragroup. They suggest that we put highly correlated variables into the same group in order to select uncorrelated variables. They use $|rij|>θ$ with $θ≈0.90$ as a threshold. EGLasso can be seen as a special case of the IILasso. Let $R$ be a group indicator matrix such that $Rjk=1$ if $Xj$ and $Xk$ belong to the same group and $Rjk=0$ otherwise. Then the IILasso is reduced to EGLasso. For the above example, if we define similarity matrix $R=[1,1,0;1,1,0;0,0,1]$, then the last term of the IILasso objective function, 2.1, becomes $λβ12+2|β1||β2|+β22+β32$, which is the same as the last term of equation 2.4. As we see, EGLasso needs to determine the threshold $θ$ and group variables beforehand, which causes a severely unstable estimation.

## 3  Theoretical Properties

In this section, we show the sign recovery condition and the estimation error bound of the IILasso for linear models. These results explain the effectiveness of the IILasso in terms of interpretability and estimation error. Moreover, we show the property of local minimum, which implies that every local optimal solution achieves the same statistical error rate as the global optimal solution. In this section, let $β*$ denote the true parameter. Let $S$ denote the true active sets: $S=supp(β*)={j∈{1,…,p}|βj*≠0}$ and $s=|S|$.

### 3.1  Sign Recovery

First, we give a necessary and sufficient condition of sign recovery. We define
$U:=1nXS⊤XS+λαDiag(sgn(βS*))RSSDiag(sgn(βS*)),w:=sgn(βS*)+αDiag(sgn(βS*))RSSDiag(sgn(βS*))βS*.$
Theorem 1.
Assume $U$ is invertible. Then there exists a critical point $β^$ of equation 2.1 with correct sign recovery $sgn(β^)=sgn(β*)$ if and only if the following two conditions hold:
$sgnβS*-U-1λw-1nXS⊤ɛ=sgn(βS*),$
(3.1)
$1nXSc⊤XSU-1λw-1nXS⊤ɛ+1nXSc⊤ɛ≤λ1+αRScSβS*-U-1λw-1nXS⊤ɛ,$
(3.2)

where both of these vector inequalities are taken elementwise.

The proof is given in section B.1. The sign recovery condition is derived from the standard conditions for optimality. We note that $α=0$ reduces the condition to an ordinary Lasso condition in Wainwright (2009). The invertible assumption of $U$ is not restrictive because it is true for almost all $λ$ if $XS⊤XS$ is invertible, which is the same assumption as standard analysis of the Lasso.

According to theorem 2, the IILasso has an advantage in sign recovery for correlated design compared to the Lasso as long as the truth is “well interpretable,” that is, the true nonzero components are independent. This is because when $RSS$ is small enough, equation 3.1 is the same as the Lasso and equation 3.2 is easier to be satisfied unless $αRScS=0$. In addition, the condition gets milder as $RScS$ gets large.

Note that theorem 2 is not the condition of a global optimal solution. However, if global optimal solutions are finite, they must be a critical point. Hence, there exists a global optimal solution with correct sign recovery only if equations 3.1 and 3.2 hold.

Next, we give a sufficient condition for sign recovery. The following theorem clarifies the sign recovery conditions on the design matrix $X$ and the true parameter $β*$, since it does not depend on the realization of the noise $ɛ$. We prepare some assumptions and definitions.

Assumption 1

(Subgaussian). The noise sequence ${ɛi}i=1n$ is an i.i.d. subgaussian sequence with parameter $σ>0$, that is, $E[exp(tɛi)]≤exp(σ2t2/2)$ for $∀t∈R$.

We give subgaussian properties in section B.2 for our theoretical analyses.

Assumption 2.
There exists a constant $D>0$, $ϕ>0$, and $ψ$ such that $1+λnαψ>0$,
$∥RSS∥max≤D,1nXS⊤XS⪰ϕI,1nXS⊤XS-1/2Diag(sgn(βS*))RSSDiag(sgn(βS*))1nXS⊤XS-1/2⪰ψI.$
(3.3)

We note that $ψ$ is not necessarily positive, but it is larger than $ψ>-1/λnα$. Condition 3.3 is satisfied if $RSS⪰ψRI$, $XS⊤XS/n≼ϕmaxI$, $ψR/ϕ≥ψ$, and $ψR/ϕmax≥ψ$.

Definition 1
(Generalized Incoherence Condition). We say that the generalized incoherence condition holds if there exists some incoherence parameter $κ∈(0,1]$ such that
$1nXj⊤XSU-11≤1+α∥RSj∥1Δmin1+αD∥βS*∥1(1-κ),$
(3.4)
for $∀j∈Sc$, where $βmin*:=minj∈S|βj*|$ and
$Δmin:=βmin*-λn1+αD∥βS*∥1∥U-1∥∞+4σϕ(1+λnαψ)>0.$
(3.5)

The generalized incoherence condition (see definition 5) is a generalized notion of the incoherence condition (Fuchs, 2005; Tropp, 2006; Wainwright, 2009). In fact, when $α=0$, the generalized incoherence condition reduces to the ordinary incoherence condition. The ordinary incoherence condition is quite restrictive and is much stronger than the restricted eigenvalue condition. For example, if $XS⊤XS/n=I$, then the ordinary incoherence condition requires that $maxj∈Sc∑k∈S|∑iXijXik/n|<1$. This condition is hard to be satisfied if there are correlations between informative and uninformative variables.

On the other hand, the generalized incoherence condition offers a great advantage because the right-hand side of equation 3.4 is not upper bounded by 1 when $α≠0$. Specifically, consider a case where the true model is well interpretable: $RSS=O$. The generalized incoherence condition reduces to
$1nXj⊤XS1nXS⊤XS-11≤1+α∥RSj∥1Δmin(1-κ),$
(3.6)
for $∀j∈Sc$, where
$Δmin:=βmin*-λn1nXS⊤XS-1∞+4σϕ(1+λnαψ)>0.$
This condition is milder than the ordinary incoherence condition since the right-hand side of equation 3.6 is larger than $1-κ$.
Now, we fix any $0<δ<1$ and define $γn$ as
$γn=γn(δ):=σ2log(2p/δ)n.$
(3.7)
Then we obtain a sufficient condition for sign recovery.
Theorem 2.
Suppose that assumptions 1 and 2 and definition 5 (generalized incoherence condition) are satisfied. Suppose that the regularization parameter satisfies
$λn≥max14σ,2κ,λnαψ1+λnαψ22κγn,$
(3.8)
with $γn$ in equation 3.7. Then there exists a critical point $β^$ of equation 2.1 with correct sign recovery $sgn(β^)=sgn(β*)$ with probability at least $1-2δ$.

The proof is given in section B.3. We note that $α=0$ reduces to the result of an ordinary Lasso (Wainwright, 2009). When $α≠0$, our method has much advantage for sign recovery as long as the true model is well interpretable: $RSS=O$. This is because the generalized incoherence condition (definition 5) for IILasso is milder than that for the ordinary Lasso as described above, and assumption 2 and equation 3.8 reduce to the ordinary Lasso condition. In addition, Meinshausen and Bühlmann (2006) and Zhao and Yu (2006) gave various designs of $X$ satisfying the incoherence condition. This implies that the generalized incoherence condition also holds in various situations as long as $RSS$ is small enough.

### 3.2  Estimation Error

We give an estimation error bound of approximately global minimum solutions of our method.

Definition 2
(Generalized Restricted Eigenvalue Condition for Linear Models ($GRE(S,C,C')$)). Let a set of vectors $B(S,C,C')$ be
$B(S,C,C'):={v∈Rp:∥vSc∥1+C'α2|vSc|⊤RScSc|vSc|+C'α|vSc|⊤RScS|vS+βS*|≤C∥vS∥1}.$
(3.9)
We say that the generalized restricted eigenvalue condition holds if we have $φGRE>0$ where
$φGRE=φGRE(S,C,C'):=infv∈B(S,C,C')v⊤1nX⊤Xv∥v∥22.$
(3.10)

The generalized restricted eigenvalue (GRE) condition in definition 7 is a generalized notion of the restricted eigenvalue (RE) condition (Bickel et al., 2009; Bühlmann & Van De Geer, 2011) tailored for our regularization. One can see that if $α=0$ or $C'=0$, then $φGRE(S,C,C')$ is reduced to the ordinary restricted eigenvalue (Bickel et al., 2009; Raskutti, Wainwright, & Yu, 2010) for the analysis of the Lasso.

The GRE condition is not restrictive. Since there are additional terms related to $|β|⊤R|β|$, the set $B(S,C,C')$ is smaller than that for the ordinary restricted eigenvalue if the same $C$ is used. In particular, the term $|βSc|⊤RScS|βS+βS*|$ strongly restricts the amplitude of coefficients for unimportant variables (especially the variables with large $RScS$). Hence, assumption GRE$(S,C,C')$ is milder than assumption RE$(S,C)$. Because the RE$(S,C)$ condition is satisfied in the general class of gaussian design (Raskutti et al., 2010), the GRE condition also holds in the general class of gaussian design.

Then we obtain the convergence rate of an approximately global minimum solution of IILasso as follows:

Theorem 3.
Suppose that assumption 1 is satisfied. Suppose $R$ satisfies
$∥RSS∥max≤D,$
(3.11)
for some positive constant $D$, and the estimator $β^$ of equation 2.1 is approximately minimizing the objective function so that
$L(β^)≤L(β*).$
(3.12)
Suppose that the regularization parameters satisfy
$3γn≤λnandα≤14D∥βS*∥1,$
with $γn$ in equation 3.7. Suppose that assumption GRE$(S,3,3/2)$ (see definition 7) is satisfied. Then it holds that
$∥β^-β*∥22≤16sλn2φGRE2,$
(3.13)
with probability at least $1-δ$.
The proof is given in section B.4. The obtained convergence rate is roughly evaluated as
$∥β^-β*∥22=Opslog(p)n$
by taking $λn=Op(logp/n)$, which is almost the minimax optimal rate (Raskutti, Wainwright, & Yu, 2011).
We compare the convergence rate of the Lasso and IILasso. For comparison, we have a somewhat bit stricter bound,
$∥β^-β*∥22≤83+5αD∥βS*∥1+34(αD∥βS*∥1)22sλn2φGRE2,$
under assumption GRE$(S,C,3/2)$ (see definition 7), where $C=2+15αD∥βS*∥1/4+9(αD∥βS*∥1)2/16$, with high probability. (The proof is also given in section B.5.) We can easily see that when $α=0$, the convergence rate analysis is reduced to the standard one for the ordinary Lasso (Bickel et al., 2009; Bühlmann & Van De Geer, 2011). We note that our theorem includes the additional assumption, equation 3.11, compared to the ordinary Lasso theorem and relaxes the restricted eigenvalue condition. This indicates that our theorem holds under the milder condition in terms of correlations among all variables, in exchange for the additional condition of the true active variables. Under well-interpretable cases where the true nonzero components $S$ are independent (i.e., $RSS=O$), the error bounds for the Lasso and IILasso are the same except for the term $φGRE$. Since $RScS$ and $RScSc$ shrink the set of vectors $B(S,C,C')$ in definition 7, $φGRE$ of the IILasso is larger than that of the Lasso. Therefore, our approximately global minimum solution has a better error bound than the ordinary $ℓ1$ regularization in this situation. In addition, our method is more advantageous when the variables are correlated between informative and noninformative variables.

### 3.3  Local Optimality

The objective function of the IILasso is not necessarily convex in exchange for better statistical properties, as observed above. Our next theoretical interest is the local optimality of our optimization algorithm, algorithm 1. Since our optimization method is greedy, there is no confirmation that it achieves the global optimum. However, as we see in this section, the local solution achieves almost the same estimation error as the global optimum satisfying equation 3.12. For theoretical simplicity, we assume the following stronger condition.

Assumption 3.
There exist $φ>0$ and $qn≤p$ such that for all $V⊂{1,⋯,p}$ satisfying $|V|≤qn$ and $V∩S=∅$, it holds that
$1nXS∪V⊤XS∪V≻φI.$
Moreover, there exists $D¯$ such that the maximum absolute value of the eigenvalue of $R$ is bounded as
$supu∈RS∪Vu⊤(RS∪V,S∪V)u≤D¯∥u∥22.$

Then we obtain the convergence rate of local minimum solutions of IILasso as follows:

Theorem 4.
Suppose assumptions 1 and 3 are satisfied. Suppose that $β^$ is a local optimal solution of equation 2.1 satisfying $|supp(β^)|≤|S|+qn$. Let the regularization parameters satisfy
$γn<λnandα
with $γn$ in equation 3.7. Then $β^$ should satisfy
$∥β^-β*∥22≤25sλn2φ2,$
with probability at least $1-δ$.

The proof is given in appendix B.3. Theorem 10 indicates that every local optimal solution achieves the same convergence rate with the ideal optimal solution. In other words, there is no local optimal solution with sparsity level $|S|+qn$ far from the true vector $β*$.

## 4  Extensions to Generalized Linear Models

### 4.1  Method for GLMs

The idea of IILasso is not restricted to linear models with quadratic loss. In this section, we extend the method and theory to a more general setting, especially for generalized linear models (GLMs) with general loss without assuming the truth is included in GLMs.

Consider independent and identically distributed data ${(Xi,Yi)}i=1n$ with $(Xi,Yi)∈X×Y$ where $Xi$ is a fixed covariable in some space $X$ and $Yi$ is a response variable in some space $Y⊂R$. Note that we suppose $Xi$'s are deterministic variables. Let $F$ be the space of whole measurable functions. Let $ρf:X×Y→R$ be a general loss function for a model $f∈F$. We also denote $ρf(x,y):=ρ(f(x),y)$. Typical examples of loss functions include quadratic loss $ρf(x,y)=ρ(f(x),y)=(y-f(x))2$ for $y∈R$ and logistic loss $ρf(x,y)=ρ(f(x),y)=-yf(x)+log(1+exp(f(x)))$ for $y∈{0,1}$. Consider a GLM subspace:
$F:=fβ(x):=∑j=1pβjψj(x):β∈Rp,x∈X⊂F.$
(4.1)
Here, the map $x↦(ψ1(x),⋯,ψp(x))⊤∈Rp$ is an arbitrary measurable feature map from the space $X$ to the space $Rp$. We define the population mean risk,
$Pρf:=1n∑i=1nEYi|Xiρf(Xi,Yi)|Xi,$
(4.2)
and the empirical average risk,
$Pnρf:=1n∑i=1nρf(Xi,Yi).$
(4.3)
We further define the target as the minimizer of the theoretical mean risk:
$f0:=argminf∈FPρf.$
(4.4)
Note that we allow model misspecification; in other words, it may hold that $f0∉F$. In this setting, our interest is to estimate GLMs close to the target, equation 4.4. IILasso for GLMs forms
$β^=argminβPnρfβ+λ∥β∥1+α2|β|⊤R|β|=:L(β),$
(4.5)
where $λ>0$ and $α>0$ are regularization parameters and $R∈Rp×p$ is a symmmetric matrix whose component $Rjk≥0$ is typically a monotonically increasing function of the absolute correlation $rjk=|∑iψj(Xi)⊤ψk(Xi)|/n$ for $j≠k$ and $rjk=0$ for $j=k$.

### 4.2  Estimation Error for GLMs

We introduce some additional notations. Let $∥f∥∞$ be $L∞$-norm, that is, $∥f∥∞=supz|f(z)|$. Let $∥f∥$ be $L2(Qn)$-norm with $Qn$ the empirical measure of ${zi}i=1n$, that is, $∥f∥:=∥f∥n:=∑i=1nf2(zi)/n$. We denote $S$ as an arbitrary variable index set and $Sβ:=supp(β)$. Note that we do not use $S$ as the true active variables in this section because it may hold $f0∉F$. We define the excess risk as
$E(f):=P(ρf-ρf0).$

Next we prepare some key assumptions and definitions.

Assumption 4.
We assume the loss function is a Lipschitz loss with $L$, that is,
$ρ(a,y)-ρ(a',y)≤La-a',∀a,a',y.$
Definition 3
(Quadratic Margin Condition). We say that the quadratic margin condition holds with $κ$ for $Fη:={f∈F:∥f-f0∥∞≤η}$ if we have
$E(f)≥κf-f02,∀f∈Fη.$
Definition 4
(Generalized Restricted Eigenvalue Condition for GLMs (GRE ($S,C,C'$))). Let a set of vectors $B(S,C,C')$ be
$B(S,C,C'):={v:∥vSc∥1+C'α2|vSc|⊤RScSc|vSc|+αC'|vSc|⊤RScS|vS+βS*(S)|≤C∥vS∥1},$
where $β*(S):=argminβ:Sβ=SE(fβ)$. We say that the generalized restricted eigenvalue condition for GLMs holds with a set of vectors $B(S,C,C')$ if we have $φGRE>0$ where
$φGRE=φGRE(S,C,C'):=infv∈B(S,C,C')∥fv∥2∥v∥22.$
Definition 5
(Oracle). Let $S$ be a collection of variable index sets. We define the oracle $β*$ as
$β*=β*(S):=argminβ:Sβ∈S3E(fβ)+9λn2|Sβ|κφGRE*,$
(4.6)
where $φGRE*=φGRE(S):=infS∈SφGRE(S,C,C')$.

These assumption and definitions are basically based on Bühlmann and Van De Geer (2011) and tailored for our proposed method. The quadratic margin condition (see definition 12) means that the excess risk for every model $f$ near the target $f0$ is bounded below by a quadratic function of $∥f-f0∥$. The GRE condition for GLMs (see definition 13) is a natural extension of the GRE condition for linear models. The oracle $β*$ (see definition 14) represents an ideal solution because it minimizes the sum of the excess risk and the penalty in propotion to the number of selected variables.

Now we derive an estimation error bound for GLMs:

Theorem 5.
Let assumption 4 with a constant $L$ be satisfied. Suppose that $∥ψj∥≤1$ for $∀j$. Let $S$ be an arbitrary collection of variable index sets. For every variable index set $S∈S$, suppose that assumption GRE$(S,3,1)$ (see definition 13) is satisfied. Define the oracle $β*$ as equation 4.6 (see definition 14) for $S$ and some constant $κ>0$. Fix any $0<δ<1$. Define
$γn=γn(δ):=2L42log2pn+2log(1/δ)n,M*:=1γn3E(fβ*)+9λn2|Sβ*|κφGRE*.$
(4.7)
Suppose the quadratic margin condition (see definition 12) holds with $κ$ for $Fη$. In addition, assume $fβ∈Fη$ for all $∥β-β*∥1≤M*$, as well as $fβ*∈Fη$. Let $β^$ be an approximate minimizer of the IILasso such that $L(β^)≤L(β)$ for all $∥β-β*∥1≤M*$. Let the regularization parameters satisfy
$α≤112D∥β*∥1and20γn≤3λn,$
where $D:=∥RS*S*∥max$ and $S*:=supp(β*)$. Then we have
$∥β^-β*∥22≤22E(fβ*)κφGRE*+60λn2|S*|κ2φGRE*2,$
with probability at least $1-δ$.
The proof is given in section B.7. The obtained covergence rate in theorem 15 is roughly evaluated as
$∥β^-β*∥22=OP|S*|log(p)n,$
under $E(fβ*)=op(1)$, $φGRE*=Op(1)$, and $λn=Op(logp/n)$, which is almost the minimax optimal rate (Raskutti et al., 2011).

If $D=0$, then $φGRE*$ gets larger as $RS*cSβ*$ and $RS*cS*c$ become large. Hence, the IILasso for general loss with quadratic margin is preferable when true active variables are not correlated and others are strongly correlated.

The collection of variable index sets $S$ is quite arbitrary. Taking $S$ as specific definitions yields the following corollaries:

Corollary 1.
Define the best linear approximation,
$βGLM0:=argminβPρfβ.$
Take $S={SGLM0}:={supp(βGLM,j0)}$. Then, under the same assumptions as in theorem 15, we have $β*=βGLM0$ and
$∥β^-βGLM0∥22≤22E(fβGLM0)κφGRE(SGLM0,3,1)+60λn2|SGLM0|κ2φGRE(SGLM0,3,1)2.$
Proof.
$β*=argminβ:Sβ=SGLM03E(fβ)+9λn2|SGLM0|κφGRE*=βGLM0,$
and $φGRE*=φGRE(SGLM0,3,1)$ concludes the assertion.
Corollary 2.
Suppose the true model is linear:
$f0=fβ0andE(fβ0)=0.$
Take $S={S0}:={supp(β0)}$. Then, under the same assumptions as in theorem 15, we have $β*=β0$ and
$∥β^-β0∥22≤60λn2|S0|κ2φGRE(S0,3,1)2.$
Proof.
$β*=argminβ:Sβ=S03E(fβ)+9λn2|S0|κφGRE*=β0,$
and $φGRE*=φGRE(S0,3,1)$ concludes the assertion.

Corollary 16 shows that the error between the IILasso estimate and the best linear approximation is bounded by the excess risk of the best linear approximation, the number of active variables in the best linear approximation, and some constants depending on $n$ and $p$. In addition, corollary 17 shows that the error between the IILasso estimate and the true linear model (if exists) is bounded by the number of true active variables and some constants depending on $n$ and $p$.

### 4.3  Estimation Error for Logistic Models

We examine logistic regression models as an illustration. Consider data ${(Xi,Yi)}i=1n$ with $(Xi,Yi)∈Rp×{0,1}$. Define $π(x):=P(Y=1|X=x)$. The logistic model is
$logπ^(x)1-π^(x)=f(x)=μ+∑j=1pβjxj,$
and the logistic loss is
$ρf(x,y)=ρ(f(x),y):=-yf(x)+log(1+exp(f(x))).$
We take a target as
$f0(x):=logπ(x)1-π(x)$
because it minimizes $EY|X[ρf(X,Y)|X=x]$. Then we obtain the following estimation error bound:
Corollary 3.
Suppose that for some constant $0<ɛ0<1$,
$ɛ0<π(x)<1-ɛ0,$
for $∀x∈X$. Suppose that $∥Xj∥≤1$ for $∀j$. Suppose that assumption GRE$(S,3,1)$ (see definition 13) is satisfied with a GRE constant $φGRE*$. Define the oracle $β*$ as equation 4.6 (see definition 14). Fix any $0<δ<1$ and $η>0$. Then define
$γn:=γn(δ):=242log2pn+2log(1/δ)n,M*:=1γn3E(fβ*)+9(exp(η)+ɛ0)2λn2|Sβ*|ɛ02φGRE*.$
Let the regularization parameters satisfy
$α≤112D∥β*∥1and20γn3≤λn≤Tγn$
for some constant $T$. In addition, suppose that
$T2(exp(η)+ɛ0)2γn|Sβ*|ɛ02φGRE*≤η27,∥f*-f0∥∞≤η3,andE(f*)γn≤η9.$
Let $β^$ be an approximate minimizer of the IILasso such that $L(β^)≤L(β)$ for all $∥β-β*∥1≤M*$. Then we have
$∥β^-β*∥22≤22(exp(η)+ɛ0)2E(fβ*)ɛ02φGRE*+60(exp(η)+ɛ0)4λn2|Sβ*|ɛ04φGRE*2,$
with probability at least $1-δ$.

The proof is given in section B.8.

## 5  Numerical Experiments

### 5.1  Synthetic Data Experiments for Linear Models

First, we validated the effectiveness of the IILasso for linear models. We considered the case in which the true active variables are uncorrelated and many inactive variables are strongly correlated with the active variable. If all active and inactive variables are uncorrelated, it is easy to estimate which is active or inactive. But if the inactive variables are strongly correlated with the active variables, it is hard to distinguish which one is active. We simulate such a situation.

We generated a design matrix $X∈Rn×p$ from the gaussian distribution of $N(0,Σ)$ where $Σ=Diag(Σ(1),…,Σ(b))$ was a block diagonal matrix whose element $Σ(l)∈Rq×q$ was $Σjk(l)=0.95$ for $j≠k$ and $Σjk(l)=1$ for $j=k$. We set $n=50$, $p=100$, $b=10$, and $q=10$. Thus, there were 10 groups containing 10 strongly correlated variables. Next, we generated a response $y$ by the true active variables $X1,X11,X21,…,X91$, such that $y=10X1-9X11+8X21-7X31+⋯+2X81-X91+ɛ$, with a standard gaussian noise $ɛ$. Each group included one active variable. We generated three data sets for training, validation, and test as above procedures independently.

Then we compared the performance of the Lasso, SCAD (Fan & Li, 2001), MCP (Zhang, 2010), EGLasso (Kong et al., 2014), and IILasso. Evaluation criteria are prediction error (mean squared error), estimation error ($ℓ2$ norm between the true and estimated coefficients), and model size (the number of nonzero coefficients). The SCAD and MCP are representative methods of folded concave penalty, so their objective functions are nonconvex, which are the same as our method. They have a tuning parameter $γ$; we set $γ=2.5$, 3.7, 10, 20, 100, 1000 for the SCAD and $γ=1.5$, 3, 10, 20, 100, 1000 for the MCP. The EGLasso has a parameter $λ2$; we set $λ2/λ1=0.01$, 0.1, 1, 10, 100, 1000. For the EGLasso, we used the true group information beforehand. We used R packages ncvreg (Breheny & Huang, 2011) for the Lasso, SCAD, MCP, and sparsenet (Mazumder, Friedman, & Hastie, 2011) for MCP. One can solve MCP using either ncvreg or sparsenet; they differ in their optimization algorithms and ways of initialization. For the IILasso, we defined $Rjk=|rjk|/(1-|rjk|)$ for $j≠k$ and $Rjk=0$ for $j=k$. Hence, $RSS$ takes small values if active variables are independent, and $RSSc$ and $RScSc$ take large values if inactive variables are strongly correlated with other variables, which is favorable from the theoretical results obtained in section 3. We set $α=0.01$, 0.1, 1, 10, 100, 1000. We tuned the above parameters using validation data and calculated errors using test data. We iterated this procedure 500 times and evaluated the averages and standard errors.

Table 1 shows the performances with their standard error in parentheses. The IILasso achieved the best prediction and estimation among all of them. This was because our penalty term excluded the correlations and avoided overfitting. Moreover, the model size of the IILasso was much smaller than those of the Lasso and EGLasso and comparable to MCP. As a whole, the IILasso could estimate the most accurate model with a few variables.

Table 1:
Results of Synthetic Data for Linear Models.
Prediction ErrorEstimation ErrorModel Size
Lasso (ncvreg2.67 (0.05) 4.44 (0.06) 34.1 (0.46)
SCAD (ncvreg1.52 (0.02) 1.79 (0.04) 14.6 (0.23)
MCP (ncvreg1.53 (0.02) 1.79 (0.04) 14.6 (0.24)
MCP (sparsenet2.41 (0.11) 3.15 (0.13) 13.4 (0.28)
EGLasso 2.60 (0.04) 4.36 (0.05) 33.3 (0.32)
IILasso (ours) 1.45 (0.02) 1.40 (0.04) 13.5 (0.23)
Prediction ErrorEstimation ErrorModel Size
Lasso (ncvreg2.67 (0.05) 4.44 (0.06) 34.1 (0.46)
SCAD (ncvreg1.52 (0.02) 1.79 (0.04) 14.6 (0.23)
MCP (ncvreg1.53 (0.02) 1.79 (0.04) 14.6 (0.24)
MCP (sparsenet2.41 (0.11) 3.15 (0.13) 13.4 (0.28)
EGLasso 2.60 (0.04) 4.36 (0.05) 33.3 (0.32)
IILasso (ours) 1.45 (0.02) 1.40 (0.04) 13.5 (0.23)

Note: Bold characters indicate the best performer within one standard error.

### 5.2  Synthetic Data Experiments for Logistic Models

Next, we validated the performance of the IILasso for logistic regression models. We generated a response $y$ by Bernoulli($π(x)$) where $logπ(x)/(1-π(x))=10X1-9X11+8X21-7X31+⋯+2X81-X91$. For the rest setting, we used the same parameters and procedures as in section 5.1, except that $n=100$ instead of $n=50$. This is because it is difficult for all methods to estimate logistic regression models with small samples. Logistic regression of the Lasso, SCAD, and MCP is supported by ncvreg (sparsenet does not support logistic regression). The EGLasso and the IILasso can also be formulated as logistic regression. (For algorithm details, see appendix C.) We evaluated the negative log likelihood, misclassification error (the rate of misclassification), estimation error, and model size.

Table 2 shows the performances with their standard error in parentheses. The IILasso outperformed other methods significantly in terms of misclassification error and estimation error. In addition, the IILasso presented a much smaller model size than other methods. Therefore, the IILasso could efficiently estimate the accurate logistic models with a few variables.

Table 2:
Results of Synthetic Data for Logistic Models.
Negative Log LikelihoodMisclassification ErrorEstimation ErrorModel Size
Lasso (ncvreg0.484 (0.006) 0.107 (0.002) 16.8 (0.06) 22.9 (0.14)
SCAD (ncvreg0.486 (0.007) 0.105 (0.002) 16.00 (0.10) 18.8 (0.25)
MCP (ncvreg0.489 (0.007) 0.106 (0.002) 16.7 (0.10) 18.1 (0.26)
EGLasso 0.477 (0.006) 0.104 (0.002) 16.7 (0.06) 25.7 (0.18)
IILasso (ours) 0.471 (0.009) 0.096 (0.002) 14.4 (0.14) 12.3 (0.20)
Negative Log LikelihoodMisclassification ErrorEstimation ErrorModel Size
Lasso (ncvreg0.484 (0.006) 0.107 (0.002) 16.8 (0.06) 22.9 (0.14)
SCAD (ncvreg0.486 (0.007) 0.105 (0.002) 16.00 (0.10) 18.8 (0.25)
MCP (ncvreg0.489 (0.007) 0.106 (0.002) 16.7 (0.10) 18.1 (0.26)
EGLasso 0.477 (0.006) 0.104 (0.002) 16.7 (0.06) 25.7 (0.18)
IILasso (ours) 0.471 (0.009) 0.096 (0.002) 14.4 (0.14) 12.3 (0.20)

Note: Bold characters indicate the best performer within one standard error.

### 5.3  Real Data Experiments: Gene Expression Data

We applied our method to various gene expression data to validate its effectiveness for real applications. We used the following 10 data sets: alon (Alon et al., 1999; colon cancer), chiaretti (Chiaretti et al., 2004; leukemia), gordon (Gordon et al., 2002; lung cancer), gravier (Gravier et al., 2010; breast cancer), pomeroy (Pomeroy et al., 2002; central nervous system disorders), shipp (Shipp et al., 2002; lymphoma), singh (Singh et al., 2002; prostate cancer), subramanian (Subramanian et al., 2005; miscellaneous), tian (Tian et al., 2003; myeloma), and west (West et al., 2001; breast cancer). All of these data are provided by R package datamicroarray. The abstract of these data sets is described in Table 3. All data sets are small-sample, high-dimensional DNA microarray data. Since the response is binary, logistic regression was applied. We used the same settings on regularization parameters as described in section 5.1. We evaluated the negative log likelihood, misclassification error, model size, and max correlation among active variables, using twenty-fold cross validation.

Table 3:
Abstract of Microarray Data.
DataNumber of of SamplesNumber of Dimensions DimensionsNumber of Positive Labels
Alon 62 2000 22
Chiaretti 111 12,625 10
Gordon 181 12,533 150
Gravier 168 2905 111
Pomeroy 60 7128 21
Shipp 77 7129 58
Singh 102 12,600 50
Subramanian 50 10,100 33
Tian 173 12,625 137
West 49 7129 25
DataNumber of of SamplesNumber of Dimensions DimensionsNumber of Positive Labels
Alon 62 2000 22
Chiaretti 111 12,625 10
Gordon 181 12,533 150
Gravier 168 2905 111
Pomeroy 60 7128 21
Shipp 77 7129 58
Singh 102 12,600 50
Subramanian 50 10,100 33
Tian 173 12,625 137
West 49 7129 25

The results are given in Figure 4. In terms of negative log likelihood, 8 out of 10 data sets gave similar performances among all methods, and the other two data sets, subramanian and west, showed the smallest negative log-likelihood by the IILasso. The MCP was comparable to the IILasso but fell behind the IILasso. We can see similar tendencies of misclassification errors. The IILasso won in 8 out of 10 cases, including 5 ties. Although the numbers of selected variables of the IILasso were inferior to the MCP (the IILasso won in 5, including 3 ties; the MCP won in 6, including 2 ties), maximum correlations among active variables were superior to others (the IILasso won in 8, including 5 ties; the MCP won in 5, including 5 ties). As a whole, the IILasso could construct accurate models with small correlations.

Figure 4:

Results of the 10 microarray data sets.

Figure 4:

Results of the 10 microarray data sets.

We further investigated the influences of regularization parameters for the data sets subramanian and west because they showed clear difference among methods. Figure 5 shows the results of the Lasso, MCP, and IILasso. Compared with the Lasso, the IILasso mitigated overfitting even when $λ$ was small, and hence achieved low negative log likelihood and misclassification error. Moreover, the IILasso suppressed its model size and correlations among active variables. The MCP rapidly decreased negative log likelihood as $λ$ decreased, and overfitting occurred early. We note that the sequences of the MCP were broken because its algorithm iterations reached a maximum number defined in ncvreg.

Figure 5:

Results of subramanian and west along with the sequences of $λ$.

Figure 5:

Results of subramanian and west along with the sequences of $λ$.

## 6  Conclusion

We have proposed a new regularization method, IILasso. The IILasso reduces correlations among the active variables; hence, it is easy to decompose and interpret the model. We showed that the sign recovery condition of the IILasso is milder than that of the Lasso for correlated design as long as the true important variables are uncorrelated with each other. The convergence rate of the IILasso also has better performance compared to that of the Lasso. Moreover, we extend the IILasso to the GLMs, and its convergence rate is also analyzed. Finally, we verified the effectiveness of the IILasso by synthetic and real data analyses using 10 gene expression data, and we showed that the IILasso was superior in many cases on high-dimensional data.

## Appendix A:  Shapes of Regularization Contours

We show countours of various regularization terms in Figure 6, including the SCAD (Fan & Li, 2001), MCP (Zhang, 2010), $ℓq$-norm, log penalty (Candes, Wakin, & Boyd, 2008), and ordered weighted $ℓ1$-norm (Bogdan, Van Den Berg, Sabatti, Su, & Candès, 2015; Figueiredo & Nowak, 2016; Zeng & Figueiredo, 2014). We compare them with the IILasso (see Figure 1) and summarize their properties as follows:

• The shapes of the SCAD and MCP are similar to each other; they behave like the Lasso for large $γ$ or small magnitude of $β$, while the contours are pointed at the axes and parallel to the axes for small $γ$ and large magnitude of $β$. Their shapes are adaptive for the magnitude of $β$ depending on their hyperparameter $γ$, but they are not adaptive for correlations among variables in contrast to the IILasso.

• The shapes of $ℓq$-norm for $0 are pointed at the axes, so that they resemble that of the IILasso for correlated variables.

• The log penalty looks similar to $ℓq$-norm for $0, as well as the IILasso for correlated variables.

• The ordered weighted $ℓ1$-norm is similar to the elastic net, but its contours are pointed at $|β1|=|β2|$, resulting in grouping effect.

Figure 6:

Contours of various regularization terms.

Figure 6:

Contours of various regularization terms.

## Appendix B:  Proofs

### B.1  Proof of Theorem 2

By standard conditions for optimality, $β^$ is a critical point if and only if there exists a subgradient $z^∈∂∥β^∥1:={z^∈Rp|zj^=sgn(β^j)forβ^j≠0,|zj^|≤1otherwise}$ such that $∂β^L(β)=0$. Because $∂β12|β|⊤R|β|=Diag(R|β|)z$, the condition $∂β^L(β)=0$ yields
$-1nX⊤(y-Xβ^)+λz^+λαDiagR|β^|z^=0.$
(B.1)
Substituting $y=Xβ*+ɛ$ in equation B.1, we have
$-1nX⊤(X(β*-β^)+ɛ)+λz^+λαDiagR|β^|z^=0.$
(B.2)
Let the true active set $S={1,…,s}$ and inactive set $Sc={s+1,…,p}$ without loss of generality; then equation B.2 is turned into
$1nXS⊤XSβ^S-βS*+1nXS⊤XScβ^Sc-1nXS⊤ɛ+λz^S+λαDiagRSS|β^S|z^S=0,$
(B.3)
$1nXSc⊤XSβ^S-βS*+1nXSc⊤XScβ^Sc-1nXSc⊤ɛ+λz^Sc+λαDiagRScS|β^S|z^Sc=0.$
(B.4)
Hence, there exists a critical point with correct sign recovery if and only if there exist $β^$ and $z^$ such that equations B.3 and B.4, $z^∈∂∥β^∥1$ and $sgn(β^)=sgn(β*)$. The latter two conditions can be written as
$z^S=sgn(βS*),$
(B.5)
$|z^Sc|≤1,$
(B.6)
$sgn(β^S)=sgn(βS*),$
(B.7)
$β^Sc=0.$
(B.8)
Conditions B.5 and B.8 yield
$1nXS⊤XSβ^S-βS*-1nXS⊤ɛ+λsgn(βS*)+λαDiagRSS|β^S|sgn(βS*)=0,$
(B.9)
$1nXSc⊤XSβ^S-βS*-1nXSc⊤ɛ+λz^Sc+λαDiagRScS|β^S|z^Sc=0.$
(B.10)
Since
$Diag(RSS|β^S|)sgn(βS*)=Diag(sgn(βS*))RSS|β^S|=Diag(sgn(βS*))RSSDiag(sgn(βS*))β^S,$
equation B.9 can be rewritten as
$U(β^S-βS*)+w-1nXS⊤ɛ=0,$
where
$U:=1nXS⊤XS+λαDiag(sgn(βS*))RSSDiag(sgn(βS*)),w:=λsgn(βS*)+λαDiag(sgn(βS*))RSSDiag(sgn(βS*))βS*.$
If we assume $U$ is invertible, we obtain
$β^S=βS*-U-1w-1nXS⊤ɛ.$
(B.11)
Substituting this in equation B.10, we have
$1nXSc⊤XS-U-1w-1nXS⊤ɛ-1nXSc⊤ɛ+λz^Sc+λαDiagRScSβS*-U-1w-1nXS⊤ɛz^Sc=0,$
that is,
$1+αDiagRScSβS*-U-1w-1nXS⊤ɛλz^Sc$
(B.12)
$=1nXSc⊤XSU-1w-1nXS⊤ɛ+1nXSc⊤ɛ.$
(B.13)
Combining equations B.6, B.7, B.11, and B.13, we conclude the assertion. $□$

### B.2  Subgaussian Tail Bounds

We briefly summarize the definition and properties of subgaussian, since it plays a key role in our nonasymptotic analyses (see Wainwright, 2009, and Rigollet & Hütter, 2015, for details). Let $ɛ$ be a zero-mean random variable. We say that $ɛ$ is a subgaussian variable with parameter $σ>0$ if it holds for $∀t∈R$,
$E[exp(tɛ)]≤expσ2t22.$
(B.14)
By applying the Chernoff bound to equation B.14, we have a subgaussian tail bound for $∀z>0$:
$P|ɛ|>z≤2exp-z22σ2.$
(B.15)

We obtain the following lemma for a sequence of subgaussian variables. This is useful for our theoretical analyses of sign recovery.

Lemma 1.
Let ${ɛi}i=1n$ be i.i.d. zero-mean subgaussian variables with a parameter $σ$. Then we have for $∀a∈Rn$ and $∀z>0$:
$P∑i=1naiɛi>z≤2exp-z22∥a∥22σ2.$
Proof.
From the definition of subgaussian, we have
$Eexpt∑i=1naiɛi=∏i=1nEexptaiɛi≤∏i=1nexpai2σ2t22=exp∥a∥22σ2t22.$
Therefore, the subgaussian tail bound, equation B.15, concludes the assertion.

In addition, we prepare the following collorary using lemma 19. This is useful for our theoretical analyses of estimation error and local optimality.

Corollary 4.
Suppose that assumption 1 and $∑i=1nXij2/n≤1$ for $∀j=1,⋯,p$ are satisfied. For $∀δ>0$, define $γn:=γn(δ)$ as equation 3.7. Then we have
$P1nX⊤ɛ∞≥γn≤δ.$
Proof.
Notice that
$P1nX⊤ɛ∞≥γn=Pmax1≤j≤p1n∑i=1nXijɛi≥γn=P⋃1≤j≤p1n∑i=1nXijɛi≥γn≤∑j=1pP1n∑i=1nXijɛi≥γn≤pmax1≤j≤pP1n∑i=1nXijɛi≥γn≤2pmax1≤j≤pexp-n2γn22σ2∥Xj∥22≤exp-nγn22σ2+log(2p),$
where we used lemma 19 in the fifth line. Since we set $δ=exp(-nγn2/2σ2+log(2p))$, we conclude the assertion.

### B.3  Proof of Theorem 6

We derive sufficient conditions for equations 3.1 and 3.2 in theorem 2.

In terms of equation 3.1, it is sufficient if
$βmin*>U-1λnw-1nXS⊤ɛ∞.$
By the triangular inequality, we have
$U-1λnw-1nXS⊤ɛ∞≤λnU-1w∞+U-11nXS⊤ɛ∞.$
(B.16)
The first term on the right-hand side of equation B.16 is bounded as
$λnU-1w∞≤λn1+αD∥βS*∥1∥U-1∥∞.$
Consider the $j$th element of the random variable of the second term on the right-hand side of equation B.16,
$Tj:=ej⊤U-11nXS⊤ɛ,$
where $ej∈Rs$ represents a unit vector with 1 for the $j$th element and 0 for others. From lemma 19, we have for $∀t>0$,
$P(|Tj|>t)≤2exp-t2n22σ2XSU-1ej22.$
Using assumption 2, we have
$1nXS⊤XS-1/2U1nXS⊤XS-1/2=I+λnα1nXS⊤XS-1/2Diag(sgn(βS*))RSSDiag(sgn(βS*))1nXS⊤XS-1/2⪰(1+λnαψ)I⇒U1nXS⊤XS-1U=1nXS⊤XS1/21nXS⊤XS-1/2U1nXS⊤XS-1/221nXS⊤XS1/2⪰ϕ(1+λnαψ)2I⇒∥XSU-1ej∥22=ej⊤U-1XS⊤XSU-1ej≤n/ϕ(1+λnαψ)2.$
Hence, we obtain
$Pmaxj∈S|Tj|>t≤2sexp-t2nϕ(1+λnαψ)22σ2.$
Setting $t=4λnσ/ϕ(1+λnαψ)$, we have
$Pmaxj∈S|Tj|>4λnσϕ(1+λnαψ)≤exp-8λn2n+log(2s).$
Therefore, if it holds $Δmin>0$ where
$Δmin:=βmin*-λn1+αD∥βS*∥1∥U-1∥∞+4σϕ(1+λnαψ),$
then equation 3.1 is satisfied with probability at least $1-exp(-8λn2n+log(2s))$.
In terms of equation 3.2, it is sufficient if for $∀j∈Sc$,
$λnnXj⊤XSU-1w+1nXj⊤I-1nXSU-1XS⊤ɛ≤λn1+α∥RSj∥1Δmin,$
(B.17)
since we now have
$βS*-U-1λw-1nXS⊤ɛ>Δmin>0.$
The first term on the right-hand side of equation B.17 is bounded as
$λnnXj⊤XSU-1w≤λnn1+αD∥βS*∥1Xj⊤XSU-11.$
Consider the random variable,
$Zj:=1nXj⊤I-1nXSU-1XS⊤ɛ,$
in the second term on the right-hand side of equation B.17. From lemma 19, we have for $∀zj>0$,
$P(|Zj|>zj)≤2exp-n2zj22σ2Xj⊤I-1nXSU-1XS⊤22.$
Since $U⪰(1+λnαψ)I⪰O$, we have
$I-1nXSU-1XS⊤≼I.$
(B.18)
$I-1nXSU-1XS⊤=I-XSXS⊤XS-1/21nXS⊤XS-1/2U1nXS⊤XS-1/2-1XS⊤XS-1/2XS⊤⪰I-11+λnαψXS(XS⊤XS)-1XS⊤⪰1-11+λnαψI,$
(B.19)
where we used the fact that $XS(XS⊤XS)-1XS⊤$ is the projection matrix to the image of $XS⊤$ in the last line. Equations B.18 and B.19 give
$I-1nXSU-1XS⊤22≤max1,λnαψ1+λnαψ2=:ν2.$
(B.20)
Hence, we obtain
$P⋃j∈Sc|Zj|>zj≤∑j∈Sc2exp-nzj22σ2ν2.$
Setting $zj=λn(1+α∥RSj∥1Δmin)κ/2$, we have
$P⋃j∈Sc|Zj|>λn(1+α∥RSj∥1Δmin)κ/2≤2∑j∈Scexp-nλn2κ2(1+α∥RSj∥1Δmin)28σ2ν2≤2(p-s)exp-nλn2κ28σ2ν2=exp-nλn2κ28σ2ν2+log(2(p-s)).$
Therefore, the generalized incoherence condition (see definition 5) yields the condition 3.2 with probability at least $1-exp(-nλn2κ2/8σ2ν2+log(2(p-s)))$.

Overall, conditions 3.1 and 3.2 hold with probability at least $1-exp-8λn2n+log(2s)-exp(-nλn2κ2/8σ2ν2+log(2(p-s)))$. Since we set $λn$ and $δ$ as $λn≥max{1/4σ,2ν/κ}γn$ and $δ=exp(-nγn2/2σ2+log(2p))$, the probability is bounded by $1-2δ$. $□$

### B.4  Proof of Theorem 8

By $L(β^)≤L(β*)$ and $y=Xβ*+ɛ$, it holds that
$12n∥X(β^-β*)∥22+λn∥β^∥1+α2|β^|⊤R|β^|≤1nɛ⊤X(β^-β*)+λn∥β*∥1+α2|β*|⊤R|β*|.$
(B.21)
By corollary 20, it holds that
$P1nX⊤ɛ∞>γn≤δ.$
Hereafter, we assume that the event ${1nX⊤ɛ∞≤γn}$ is happening.
Then, if $γn≤λn/3$, by equation B.21,
$12n∥X(β^-β*)∥22+λn∥β^∥1+α2|β^|⊤R|β^|≤1n∥ɛ⊤X∥∞∥β*-β^∥1+λn∥β*∥1+α2|β*|⊤R|β*|≤γn∥β*-β^∥1+λn∥β*∥1+α2|β*|⊤R|β*|≤13λn∥β*-β^∥1+λn∥β*∥1+α2|β*|⊤R|β*|.$
(B.22)
Since
$∥β^-β*∥1=∥β^S-βS*∥1+∥β^Sc-βSc*∥1=∥β^S-βS*∥1+∥β^Sc∥1$
and
$|βS*|⊤RSS|βS*|-|β^S|⊤RSS|β^S|≤∑(j,k)∈S×SRjk|βj*βk*-β^jβ^k|≤2∑(j,k)∈S×SRjk|βj*(βk*-β^k)|+∑(j,k)∈S×SRjk|(βj*-β^j)(βk*-β^k)|=2|βS*|⊤RSS|βS*-β^S|+|βS*-β^S|⊤RSS|βS*-β^S|≤2∥RSS|βS*|∥∞∥βS*-β^S∥1+D∥βS*-β^S∥12,$
we obtain that
$12n∥X(β^-β*)∥22+λn(∥β^S∥1+∥β^Sc∥1+α2|β^S|⊤RSS|β^S|+α2∑(j,k)∉S×SRjk|β^jβ^k|)≤13λn(∥β^S-βS*∥1+∥β^Sc∥1)+λn∥βS*∥1+α2|βS*|⊤RSS|βS*|⇒12n∥X(β^-β*)∥22+λn23∥β^Sc∥1+α2∑(j,k)∉S×SRjk|β^jβ^k|≤13λn∥β^S-βS*∥1+λn(∥βS*∥1-∥β^S∥1+α∥RSS|βS*|∥∞∥βS*-β^S∥1+αD2∥βS*-β^S∥12)⇒12n∥X(β^-β*)∥22+λn23∥β^Sc∥1+α2∑(j,k)∉S×SRjk|β^jβ^k|≤λn43∥β^S-βS*∥1+α∥RSS|βS*|∥∞∥βS*-β^S∥1+αD2∥βS*-β^S∥12.$
(B.23)
On the other hand, equation B.22 also gives
$∥β^S∥1+∥β^Sc∥1≤13(∥β^S-βS*∥1+∥β^Sc∥1)+∥βS*∥1+α2|βS*|⊤RSS|βS*|⇒23∥β^S-βS*∥1+23∥β^Sc∥1≤2∥βS*∥1+α2|βS*|⊤RSS|βS*|⇒∥β^S-βS*∥1≤3∥βS*∥1+34α|βS*|⊤RSS|βS*|⇒∥β^S-βS*∥1≤3+34α∥RSS|βS*|∥∞∥βS*∥1.$
Therefore, equation B.23 gives
$23∥β^Sc∥1+α2∑(j,k)∉S×SRjk|β^jβ^k|≤43+α∥RSS|βS*|∥∞+32αD∥βS*∥11+α4∥RSS|βS*|∥∞∥β^S-βS*∥1.$
(B.24)
The second term of the left side is evaluated as
$∑(j,k)∉S×SRjk|β^jβ^k|=∑j∈Sc,k∈ScRjk|β^jβ^k|+2∑j∈S,k∈ScRjk|(β^j-βS*+βS*)β^k|=|β^Sc|⊤RScSc|β^Sc|+2|β^Sc|⊤RScS|β^S-βS*+βS*|.$
Hence, equation B.24 gives
$23∥β^Sc∥1+α2|β^Sc|⊤RScSc|β^Sc|+α|β^Sc|⊤RScS|β^S-βS*+βS*|≤43+α∥RSS|βS*|∥∞+32αD∥βS*∥11+α4∥RSS|βS*|∥∞∥β^S-βS*∥1⇒∥β^Sc∥1+34α|β^Sc|⊤RScSc|β^Sc|+32α|β^Sc|⊤RScS|β^S-βS*+βS*|≤2+154αD∥βS*∥1+916(αD∥βS*∥1)2∥β^S-βS*∥1.$
(B.25)
If $α≤14D∥βS*∥1$, we have
$∥β^Sc∥1+34α|β^Sc|⊤RScSc|β^Sc|+32α|β^Sc|⊤RScS|β^S-βS*+βS*|≤3∥β^S-βS*∥1.$
Therefore, we can see that
$v^∈B(S,C,C'),$
where $v^=β^-β*$, $C=3$, and $C'=32$. By applying the definition of $φGRE$ to equation B.23, it holds that
$φGRE2∥β^-β*∥22≤λn43+52αD∥βS*∥1+38(αD∥βS*∥1)2∥β^S-βS*∥1$
Because $∥β^S-βS*∥12≤s∥β^S-βS*∥22$, we have
$∥β^-β*∥2≤83+5αD∥βS*∥1+34(αD∥βS*∥1)2sλnφGRE⇒∥β^-β*∥22≤83+5αD∥βS*∥1+34(αD∥βS*∥1)22sλn2φGRE2≤16sλn2φGRE2$
(B.26)
This concludes the assertion. $□$

### B.5  Corollary of Theorem 8

For comparison with the IILasso and Lasso, we use the following estimation error bound, which is a little bit stric than theorem 8.

Corollary 5.
Suppose the same assumption of theorem 8 except for assumption GRE$(S,3,32)$. Instead, suppose that assumption GRE$(S,C,32)$ (see definition 7) where $C=2+154αD∥βS*∥1+916(αD∥βS*∥1)2$ is satisfied. Then it holds that
$∥β^-β*∥22≤83+5αD∥βS*∥1+34(αD∥βS*∥1)22sλn2φGRE2,$
with probability at least $1-δ$.
Proof.
This is derived basically in the same way as theorem 8. From equation B.25, we can directly see that
$v^∈B(S,C,C'),$
where $v^=β^-β*$, $C=2+154αD∥βS*∥1+916(αD∥βS*∥1)2$, and $C'=32$. This and equation B.26 conclude the assertion.

From this corollary, we compare the IILasso with $RSS=O$ and Lasso.

1. If $α=0$, we have
$∥β^-β*∥22≤64sλn29φGRE2,$
with $B(S,C,C')$ where $C=2$ and $C'=0$. This is a standard Lasso result.
2. If $D=0$, we have
$∥β^-β*∥22≤64sλn29φGRE2,$
with $B(S,C,C')$ where $C=2$ and $C'=32$. Since $φGRE$ is the minimum eigenvalue restricted by $B(S,C,C')$, $φGRE$ of the IILasso is larger than that of the Lasso.

### B.6  Proof of Theorem 10

Let
$βˇ:=argminβ∈Rp:βSc=0∥y-Xβ∥22.$
That is, $βˇ$ is the least squares estimator with the true nonzero coefficients. Let $β˜$ be a local optimal solution. For $0, let $β(h):=β˜+h(βˇ-β˜)$; then it holds that
$Lλn(β(h))-Lλn(β˜)=h2-2h2n∥X(β˜-βˇ)∥22-hn(Xβˇ-y)⊤X(β˜-βˇ)+λn(∥β(h)∥1-∥β˜∥1)+λnα2(|β(h)|⊤R|β(h)|-|β˜|⊤R|β˜|).$
(B.27)

First, we evaluate the term $1n(Xβˇ-y)⊤X(β˜-βˇ)=1n(Xβˇ-y)⊤XS(β˜S-βˇS)+1n(Xβˇ-y)⊤XSc(β˜Sc-βˇSc)$ as follows:

1. Since $βˇ$ is the least squares estimator and $1nXS⊤XS$ is invertible by the assumption, we have
$βˇS=(XS⊤XS)-1XS⊤y,βˇSc=0.$
Therefore,
$1nXS⊤(Xβˇ-y)=1nXS⊤(XS(XS⊤XS)-1XS⊤-I)y.$
Here, $I-XS(XS⊤XS)-1XS⊤$ is the projection matrix to the orthogonal complement of the image of $XS⊤$. Hence, $1n(Xβˇ-y)⊤XS(β˜S-βˇS)=0$.
2. Notice that
$1nXSc⊤(Xβˇ-y)=-1nXSc⊤(I-XS(XS⊤XS)-1XS⊤)y=-1nXSc⊤(I-XS(XS⊤XS)-1XS⊤)(XSβS*+ɛ)=-1nXSc⊤(I-XS(XS⊤XS)-1XS⊤)ɛ,$
where we used $(I-XS(XS⊤XS)-1XS⊤)XSc=0$ in the last line. Because $(I-XS(XS⊤XS)⊤XS⊤)$ is a projection matrix, we have $∥(I-XS(XS⊤XS)-1XS⊤)Xj∥22≤∥Xj∥22$. This and corollary 20 give
$1nXSc⊤(Xβˇ-y)∞≤γn,$
with probability $1-δ$. Hence, let $V:=supp(β˜)∖S$. Then we have
$1n(β˜Sc-βˇSc)⊤XSc⊤(Xβˇ-y)≤γn∥β˜Sc-βˇSc∥1=γn∥β˜V∥1,$
where we used the assumption $V⊆Sc$ and $βˇV=0$.
Combining these inequalities and the assumption $λn≥γn$, we have that
$1n(Xβˇ-y)⊤X(β˜-βˇ)≤λn∥β˜V∥1.$
(B.28)
As for the regularization term, we evaluate each term of $λn(∥β(h)∥1-∥β˜∥1)+λn2(|β(h)|⊤R|β(h)|-|β˜|⊤R|β˜|)$ in the following.
i. Evaluation of $∥β(h)∥1-∥β˜∥1$. Because of the definition of $β(h)$, it holds that
$∥β(h)∥1-∥β˜∥1=∥β˜+h(βˇ-β˜)∥1-∥β˜∥1=∥β˜S+h(βˇS-β˜S)∥1-∥β˜S∥1+∥β˜V+h(βˇV-β˜V)∥1-∥β˜V∥1=∥β˜S+h(βˇS-β˜S)∥1-∥β˜S∥1+(1-h)∥β˜V∥1-∥β˜V∥1≤h∥βˇS-β˜S∥1-h∥β˜V∥1.$
(B.29)
ii. Evaluation of $|β(h)|⊤R|β(h)|-|β˜|⊤R|β˜|$. Note that
$|β(h)j|Rjk|β(h)k|-|β˜j|Rjk|β˜k|=|(1-h)β˜j+hβˇj|Rjk|(1-h)β˜k+hβˇk|-|β˜j|Rjk|β˜k|≤(1-h)2|β˜j|Rjk|β˜k|+h(1-h)(|βˇj|Rjk|β˜k|+|β˜j|Rjk|βˇk|)+h2|βˇj|Rjk|βˇk|-|β˜j|Rjk|β˜k|=-2h|β˜j|Rjk|β˜k|+h(|βˇj|Rjk|β˜k|+|β˜j|Rjk|βˇk|)+O(h2)=h[(|βˇj|-|β˜j|)Rjk|β˜k|+|β˜j|Rjk(|βˇk|-|β˜k|)]+O(h2).$
(B.30)
If $j,k∈S$, then the right hand side of equation B.30 is bounded by
$h(|βˇj-β˜j|Rjk|βˇk-β˜k|+|βˇj-β˜j|Rjk|βˇk-β˜k|)+h(|βˇj-β˜j|Rjk|βˇk|+|βˇj|Rjk|βˇk-β˜k|)+O(h2).$
If $j∈V$ and $k∈S$, then the right hand side of equation B.30 is bounded by
$h|β˜j|Rjk(|βˇk|-|β˜k|)+O(h2)≤h|β˜j|Rjk|βˇk-β˜k|+O(h2).$
If $j∈V$ and $k∈V$, then the right-hand side of equation B.30 is bounded by
$0+O(h2)=O(h2).$
Based on these evaluations, we have
$|β(h)|⊤R|β(h)|-|β˜|⊤R|β˜|≤2h|βˇS-β˜S|⊤RSS|βˇS-β˜S|+|βˇS-β˜S|⊤RSS|βˇS|+|β˜V|⊤RVS|βˇS-β˜S|+O(h2)≤2h|βˇ-β˜|⊤R|βˇ-β˜|+|βˇS-β˜S|⊤RSS|βˇS|+O(h2)≤2hD¯(∥βˇ-β˜∥22+∥βˇ∥2∥βˇS-β˜S∥2)+O(h2).$
We will show in equation B.32 that $∥βˇ-β*∥2≤sλn/φ$, and thus it follows that
$∥βˇ∥2≤∥β*∥2+sλn/φ.$
Therefore, we obtain that
$|β(h)|⊤R|β(h)|-|β˜|⊤R|β˜|≤2hD¯∥βˇ-β˜∥22+(∥β*∥2+sλn/φ)∥βˇS-β˜S∥2+O(h2).$
(B.31)
Applying the inequalities B.28, B.29, and B.31 to B.27 yields
$Lλn(β(h))-Lλn(β˜)≤h{-1n∥X(βˇ-β˜)∥22+λn∥β˜S-βˇS∥1-(λn-γn)∥β˜V∥1+λnαD¯[∥βˇ-β˜∥22+(∥β*∥2+sλn/φ)∥βˇS-β˜S∥2]}+O(h2)≤h{-φ∥βˇ-β˜∥22+λn∥β˜S-βˇS∥1+λnαD¯[∥βˇ-β˜∥22+(∥β*∥2+sλn/φ)∥βˇS-β˜S∥2]}+O(h2)≤h{-φ+λnαD¯∥βˇ-β˜∥22+λn∥β˜S-βˇS∥1+αD¯(∥β*∥2+sλn/φ)∥βˇS-β˜S∥2}+O(h2),$
where we used the assumption $λn>γn$ in the second inequality.
Since we have assumed $α, the right-hand side is further bounded by
$h-φ2∥βˇ-β˜∥22+2λns∥βˇS-β˜S∥2+O(h2).$
Because of this, if $∥βˇ-β˜∥2>4sλnφ$, then the first term becomes negative, and we conclude that for sufficiently small $η>0$, it holds that
$Lλn(β(h))
for all $0. In other words, $β˜$ is not a local optimal solution. Therefore, we must have
$∥βˇ-β˜∥2≤4sλnφ.$
Finally, notice that $∥β˜-β*∥22≤(∥β˜-βˇ∥2+∥β*-βˇ∥2)2$ and
$∥βˇ-β*∥22=∥(XS⊤XS)-1XS⊤y-βS*∥22=∥(XS⊤XS)-1XS⊤(XSβS*+ɛ)-βS*∥22=∥(XS⊤XS)-1XS⊤ɛ∥22≤φ-2∥1nXS⊤ɛ∥22≤φ-2sγn2≤φ-2sλn2,$
(B.32)
which concludes the assertion. $□$

### B.7  Proof of Theorem 15

In the proof, we use the shorthand notation $S*:=Sβ*$. Let $c1,c2,⋯$ be some constants, which are defined concretely at the end of the proof. Define the empirical process as
$vn(β):=(Pn-P)ρfβ,$
and let
$ZM:=sup∥β-β*∥1≤M|vn(β)-vn(β*)|.$

First, we prepare the following lemma.

Lemma 2.

Let assumption 4 hold. Suppose $∥ψj∥≤1$ for $∀j$. Then we have $P({ZM≤γnM})≥1-δ$ where $γn$ is defined as equation 4.7.

Proof.

See example 14.2 in section 14.8 of Bühlmann and Van De Geer (2011) in detail.

Following the lemma, we have $ZM*≤γnM*$ with high probability, where
$M*:=1γnc1E(fβ*)+c2λn2|S*|κφGRE*,$
with some positive constants $c1$ and $c2$. Hereafter, we assume this holds. If $∥β^-β*∥1≤M*$, which we will show later, then we have
$E(fβ^)+λn∥β^∥1+α2|β^|⊤R|β^|≤-vn(β^)-vn(β*)+E(fβ*)+λn∥β*∥1+α2|β*|⊤R|β*|≤ZM*+E(fβ*)+λn∥β*∥1+α2|β*|⊤R|β*|≤γnM*+E(fβ*)+λn∥β*∥1+α2|β*|⊤R|β*|.$
Substituting $β=βS*+βS*c$, we have
$E(fβ^)+λn(∥β^S*∥1+∥β^S*c∥1+α2|β^S*|⊤RS*S*|β^S*|+α|β^S*c|⊤RS*cS*|β^S*|+α2|β^S*c|⊤RS*cS*c|β^S*c|)≤γnM*+E(fβ*)+λn∥βS**∥1+α2|βS**|⊤RS*S*|βS**|.$
(B.33)
Using this inequality, we can evaluate $E(fβ^)+λn∥β^-β*∥1$ as
$E(fβ^)+λn∥β^-β*∥1=E(fβ^)+λn∥β^S*-βS**∥1+λn∥β^S*c∥1≤γnM*+E(fβ*)+λn2∥βS**∥1+α2|βS**|⊤RS*S*|βS**|.$
(B.34)
To obtain a tighter bound, we need to use quadratic terms of $β^$. Hereafter, we reparameterize $v^:=β^-β*$. Then, we have, from equation B.33,
$E(fβ^)+λn∥v^S*c∥1+α|v^S*c|⊤RS*cS*|v^S*+βS**|+α2|v^S*c|⊤RS*cS*c|v^S*c|≤γnM*+E(fβ*)+λn∥v^S*∥1+α2|βS**|⊤RS*S*|βS**|-|β^S*|⊤RS*S*|β^S*|.$
We can evaluate the last two terms as
$|βS**|⊤RS*S*|βS**|-|β^S*|⊤RS*S*|β^S*|=|βS**|⊤RS*S*|βS**|-|βS**|⊤RS*S*|β^S*|+|βS**|⊤RS*S*|β^S*|-|β^S*|⊤RS*S*|β^S*|≤|βS**|⊤RS*S*|βS**-β^S*|+|β^S*-βS**|⊤RS*S*|β^S*|≤|βS**|⊤RS*S*|βS**-β^S*|+|β^S*-βS**|⊤RS*S*(|βS**|+|βS**-β^S*|)≤(2|βS**|+|βS**-β^S*|)⊤RS*S*|βS**-β^S*|≤2∥RS*S*|βS**|∥∞∥v^S*∥1+D∥v^S*∥12,$
where $D:=∥RS*S*∥max$. Hence, we obtain
$E(fβ^)+λn∥v^S*c∥1+α|v^S*c|⊤RS*cS*|v^S*+βS**|+α2|v^S*c|⊤RS*cS*c|v^S*c|≤γnM*+E(fβ*)+λn∥v^S*∥1+α∥RS*S*|βS**|∥∞∥v^S*∥1+αD2∥v^S*∥12.$
(B.35)
We further characterize this bound by dividing it into two cases:
• Case 1. If $λn∥v^S*∥1+α∥RS*S*|βS**|∥∞∥v^S*∥1+αD2∥v^S*∥12≤γnM*$, then the inequality, equation B.35, reduces to
$E(fβ^)+λn∥v^S*c∥1+α|v^S*c|⊤RS*cS*|v^S*+βS**|+α2|v^S*c|RS*cS*c|v^S*c|≤2γnM*+E(fβ*).$
Hence, we obtain
$E(fβ^)+λn∥v^∥1=E(fβ^)+λn∥v^S*c∥1+λn∥v^S*∥1≤3γnM*+E(fβ*)=(3c1+1)E(fβ*)+3c2λn2|S*|κφGRE*,$
where we use $λn∥v^S*∥1≤γnM*$ in the second line.
• Case 2. If $λn∥v^S*∥1+α∥RS*S*|βS**|∥∞∥v^S*∥1+αD2∥v^S*∥12≥γnM*$, we need the compatibility condition and the margin condition. In this case, the inequality, equation B.35, reduces to
$E(fβ^)+λn∥v^S*c∥1+α|v^S*c|⊤RS*cS*|v^S*+βS**|+α2|v^S*c|RS*cS*c|v^S*c|≤γnM*+E(fβ*)+λn∥v^S*∥1+α∥RS*S*|βS**|∥∞∥v^S*∥1+αD2∥v^S*∥12≤1+1c1γnM*+λn∥v^S*∥1+α∥RS*S*|βS**|∥∞∥v^S*∥1+αD2∥v^S*∥12≤2+1c1λn∥v^S*∥1+α∥RS*S*|βS**|∥∞∥v^S*∥1+αD2∥v^S*∥12≤2+1c1λn1+α∥RS*S*|βS**|∥∞+αD2∥v^S*∥1∥v^S*∥1.$
(B.36)
Hence, we obtain
$E(fβ^)+λn∥v^∥1=E(fβ^)+λn∥v^S*c∥1+λn∥v^S*∥1≤λn3+1c1+2+1c1α∥RS*S*|βS**|∥∞+2+1c1αD2∥v^S*∥1∥v^S*∥1.$
(B.37)
To characterize $∥v^S*∥1$ in parentheses, we can see from equation B.34,
$λn(∥v^S*∥1+∥v^S*c∥1)≤γnM*+E(fβ*)+λn2∥βS**∥1+α2|βS**|⊤RS*S*|βS**|≤1+1c1γnM*+λn2∥βS**∥1+α2|βS**|⊤RS*S*|βS**|.$
(B.38)

We further divide case 2 into two cases:

Case 2a. If $λn2∥βS**∥1+α2|βS**|⊤RS*S*|βS**|≤2γnM*$, then equation B.34 reduces to
$E(fβ^)+λn∥v^∥1≤3γnM*+E(fβ*)≤3c1+1E(fβ*)+3c2λn2|S*|κφGRE*.$
Case 2b. If $λn2∥βS**∥1+α2|βS**|⊤RS*S*|βS**|≥2γnM*$, then the inequality, equation B.38, reduces to
$λn(∥v^S*∥1+∥v^S*c∥1)≤123+1c1λn2∥βS**∥1+α2|βS**|⊤RS*S*|βS**|≤123+1c1λn2+α2∥RS*S*|βS**|∥∞∥βS**∥1,$
which indicates
$∥v^S*∥1≤123+1c12+α2∥RS*S*|βS**|∥∞∥βS**∥1.$
(B.39)
Hence, incorporating equations B.37 and B.39 yields,
$E(fβ^)+λn∥v^∥1≤λn3+1c1+2+1c1α∥RS*S*|βS**|∥∞+122+1c13+1c1αD22+α2∥RS*S*|βS**|∥∞∥βS**∥1∥v^S*∥1≤λn3+1c1+122+1c15+1c1αD∥β*∥1+182+1c13+1c1(αD∥β*∥1)2∥v^S*∥1.$
If we take $α'$ such that $α≤α'D∥β*∥1$ and define
$c3:=3+1c1+122+1c15+1c1α'+182+1c13+1c1α'2,$
we have
$E(fβ^)+λn∥v^∥1≤c3λn∥v^S*∥1.$
(B.40)
We can restrict the feasible region for $v$ by equations B.36 and B.39 as
$∥v^S*c∥1+α|v^S*c|⊤RS*cS*|v^S*+βS**|+α2|v^S*c|⊤RS*cS*c|v^S*c|≤2+1c1(1+α∥RS*S*|βS**|∥∞+αD2123+1c12+α2∥RS*S*|βS**|∥∞∥βS**∥1∥v^S*∥1≤2+1c11+125+1c1αD∥βS**∥1+183+1c1αD∥βS**∥12∥v^S*∥1≤2+1c11+125+1c1α'+183+1c1α'2∥v^S*∥1.$
If we take $c$ and $α'$ satisfying
$2+1c11+125+1c1α'+183+1c1α'2≤3,$
then we have $v^∈B(S*,3,1)$ where
$B(S*,3,1)=v:∥vS*c∥1+α|vS*c|⊤RS*cS*|vS*+βS**|+α2|vS*c|⊤RS*cS*c|vS*c|≤3∥vS*∥1.$
Hence, we have only to impose the restricted eigenvalue condition for the set $B(S*,3,1)$.
According to the restricted eigenvalue condition, we have
$∥v∥22≤∥fv∥2φGRE*,∀v∈B(S*,3,1).$
Incorporating $∥v^S*∥1≤|S*|∥v^S*∥2≤|S*|∥v^∥2$, we have
$∥v^S*∥1≤|S*|∥fv^∥φGRE*.$
Hence, from equation B.40, we have
$E(fβ^)+λn∥v^∥1≤c3λn|S*|∥fβ^-fβ*∥φGRE*≤4c3λn2|S*|κφGRE*+c3κ16∥fβ^-fβ*∥2≤4c3λn2|S*|κφGRE*+c3κ16∥fβ^-f0∥+∥fβ*-f0∥2≤4c3λn2|S*|κφGRE*+c3κ8∥fβ^-f0∥2+c3κ8∥fβ*-f0∥2≤4c3λn2|S*|κφGRE*+c38E(fβ^)+c38E(fβ*),$
where we use the restricted eigenvalue condition in the first line, $uv≤4u2+v2/16$ in the second line, the triangular inequality in the third line, $(u+v)2≤2(u2+v2)$ in the fourth line, and the margin condition with $fβ^∈Fη$ and $fβ*∈Fη$ in the last line. This implies
$1-c38E(fβ^)+λn∥v^∥1≤c38E(fβ*)+4c3λn2|S*|κφGRE*;$
hence, we obtain
$E(fβ^)+λn∥v^∥1≤c38-c3E(fβ*)+32c3λn2|S*|(8-c3)κφGRE*.$
where we assume $c3<8$, which is satisfied by taking appropreate $c1$ and $α'$.
Incorporating cases 1, 2a, and 2b, we have shown that
$E(fβ^)+λn∥v^∥1≤c4γnM*,$
(B.41)
where
$c4:=max3+1c1,c3c1(8-c3),32c3c2(8-c3).$
On the other hand, we have
$E(fβ^)≥κ∥fβ^-f0∥2≥12κ∥fβ^-fβ*∥2-κ∥fβ*-f0∥2≥12κφGRE*∥β^-β*∥22-E(fβ*),$
where we used $(u+v)2≤2(u2+v2)$ in the second line and the generalized restricted and margin conditions in the last line. Therefore, we have
$12κφGRE*∥β^-β*∥22≤(c1c4+1)E(fβ*)+c2c4λn2|S*|κφGRE*,$
that is,
$∥β^-β*∥22≤2(c1c4+1)κφGRE*E(fβ*)+2c2c4λn2|S*|κ2φGRE*2.$
This equations concludes the assertion of theorem 15.
Now we show $∥β^-β*∥1≤M*$. Define
$β˜:=β*+t(β^-β*),t:=M*M*+∥β^-β*∥1.$
Then $∥β˜-β*∥1≤M*$ and hence all the above inequalities hold substututing $β^$ into $β˜$. In particular, equation B.41 implies
$∥β˜-β*∥1≤c4γnλnM*≤M*2,$
for which we assume $c4γnλn≤12$. Because $∥β˜-β*∥1=t∥β^-β*∥1$, we have
$M*M*+∥β^-β*∥1∥β^-β*∥1≤M*2.$
Hence, we obtain $∥β^-β*∥1≤M*$.

Finally, we take $c1=3,c2=9,andα'=1/12$, which satisfy all the aforementioned assumptions. $□$

### B.8  Proof of Corollary 18

We can see that the logistic loss is Lipschitz with a Lipschitz constant 1 because it holds that
$∂∂fl(f,y)=-y+exp(f)1+exp(f)≤1.$
Next, we show that the logistic regression satisfies the quadratic margin condition. For $f∈Fη$,
$∂2∂f2l(f,·)=exp(f)(1+exp(f))2≥exp(|f0|+η)(1+exp(|f0|+η))2≥1(1+exp(|f0|+η))2=1(1+exp(η)maxπ1-π,1-ππ)2≥1(1+exp(η)1-ɛ0ɛ0)2≥1(exp(η)/ɛ0+1)2.$
Hence, the quadratic margin condition holds with $κ:=(exp(η)/ɛ0+1)-2$.
On the other hand, we have
$∥fβ-f0∥∞≤∥fβ-fβ*∥∞+∥fβ*-f0∥∞.$
For $∥β-β*∥1≤M*$, it holds that
$∥fβ-fβ*∥∞=∥(β-β*)X∥∞≤∥β-β*∥1∥X∥max≤M*=1γn3E(fβ*)+9(exp(η)/ɛ0+1)2λn2|Sβ*|φGRE*.$
Since we assume
$∥fβ*-f0∥∞≤η3,E(fβ*)/γn≤η9,andT2(exp(η)/ɛ0+1)2γn|Sβ*|φGRE*≤η27,$
with $λn≤Tγn$, we obtain
$∥fβ-f0∥∞≤η.$

From the above, all of the assumptions in theorem 15 have been verified. $ffffffffffffffffff□$

## Appendix C:  Optimization for Logistic Regression

We derive the coordinate descent algorithm of the IILasso for the binary objective variable. The objective function is
$L(β)=-1n∑iyiXiβ-log(1+exp(Xiβ))+λ∥β∥1+α2|β|⊤R|β|,$
where $Xi$ is the $i$th row of $X=[1,X1,…,Xp]$ and $β=[β0,β1,…,βp]$. Forming a quadratic approximation with the current estimate $β¯$, we have
$L¯(β)=-12n∑i=1nwi(zi-Xiβ)2+C(β¯)+λ∥β∥1+α2|β|⊤R|β|,$
where
$zi=Xiβ¯+yi-p¯(Xi)p¯(Xi)(1-p¯(Xi)),wi=p¯(Xi)(1-p¯(Xi)),p¯(Xi)=11+exp(-Xiβ¯).$
To derive the update equation, when $βj≠0$, differentiating the quadratic objective function with respect to $βj$ yields
$∂βjL¯(β)=-1n∑i=1nwi(zi-Xiβ)Xij+λsgn(βj)+αRj⊤|β|sgn(βj)=-1n∑i=1nwizi-Xi,-jβ-jXij+1n∑i=1nwiXij2+λRjjβj+λ1+αRj,-j|β-j|sgn(βj).$
This yields
$βj←11n∑i=1nwiXij2+λαRjj×S1n∑i=1nwizi-Xi,-jβ-jXij,λ1+αRj,-j|β-j|.$
These procedures amount to a sequence of nested loops. The whole algorithm is described in algorithm 2.

## Notes

1

This is an extended work of published in Takada, Suzuki, and Fujisawa (2018). We extend our method from linear models to GLMs in this letter.

## Acknowledgments

T.S. was partially supported by MEXT kakenhi (25730013, 25120012, 26280009, 15H01678, 15H05707), JST-PRESTO, and JST-CREST. H.F. was partially supported by MEXT kakenhi 17K00065.

## References

Alon
,
U.
,
Barkai
,
N.
,
Notterman
,
D. A.
,
Gish
,
K.
,
Ybarra
,
S.
,
Mack
,
D.
, &
Levine
,
A. J.
(
1999
).
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays
.
Proceedings of the National Academy of Sciences
,
96
(
12
),
6745
6750
.
Bickel
,
P. J.
,
Ritov
,
Y.
, &
Tsybakov
,
A. B.
(
2009
).
Simultaneous analysis of Lasso and Dantzig selector
.
Annals of Statistics
,
37
(
4
),
1705
1732
.
Bogdan
,
M.
,
Van Den Berg
,
E.
,
Sabatti
,
C.
,
Su
,
W.
, &
Candès
,
E. J.
(
2015
).
Slope: Adaptive variable selection via convex optimization
.
Annals of Applied Statistics
,
9
(
3
),
1103
.
Breheny
,
P.
, &
Huang
,
J.
(
2011
).
Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection
.
Annals of Applied Statistics
,
5
(
1
),
232
.
Bühlmann
,
P.
, &
Van De Geer
,
S.
(
2011
).
Statistics for high-dimensional data: Methods, theory and applications
.
New York
:
.
Candes
,
E. J.
,
Wakin
,
M. B.
, &
Boyd
,
S. P.
(
2008
).
Enhancing sparsity by reweighted $ℓ1$ minimization
.
Journal of Fourier Analysis and Applications
,
14
(
5–6
),
877
905
.
Chen
,
S.-B.
,
Ding
,
C.
,
Luo
,
B.
, &
Xie
,
Y.
(
2013
).
Uncorrelated Lasso
. In
Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence
(pp.
166
172
).
Palo Alto, CA
:
AAAI
.
Chiaretti
,
S.
,
Li
,
X.
,
Gentleman
,
R.
,
Vitale
,
A.
,
Vignetti
,
M.
,
Mandelli
,
F.
,
Ritz
,
J.
, &
Foa
,
R.
(
2004
).
Gene expression profile of adult t-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival
.
Blood
,
103
(
7
),
2771
2778
.
Fan
,
J.
, &
Li
,
R.
(
2001
).
Variable selection via nonconcave penalized likelihood and its oracle properties
.
Journal of the American Statistical Association
,
96
(
456
),
1348
1360
.
Figueiredo
,
M.
, &
Nowak
,
R.
(
2016
).
Ordered weighted L1 regularized regression with strongly correlated covariates: Theoretical aspects
. In
Proceedings of the 19th International Conference on Artificial Intelligence and Statistics
(pp.
930
938
).
Friedman
,
J.
,
Hastie
,
T.
,
Höfling
,
H.
, &
Tibshirani
,
R.
(
2007
).
Pathwise coordinate optimization
.
Annals of Applied Statistics
,
1
(
2
),
302
332
.
Friedman
,
J.
,
Hastie
,
T.
, &
Tibshirani
,
R.
(
2010
).
Regularization paths for generalized linear models via coordinate descent
.
Journal of Statistical Software
,
33
(
1
),
1
.
Fuchs
,
J.-J.
(
2005
).
Recovery of exact sparse representations in the presence of bounded noise
.
IEEE Transactions on Information Theory
,
51
(
10
),
3601
3608
.
Gordon
,
G. J.
,
Jensen
,
R. V.
,
Hsiao
,
L.-L.
,
Gullans
,
S. R.
,
Blumenstock
,
J. E.
,
Ramaswamy
,
S.
, …
Bueno
,
R.
(
2002
).
Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma
.
Cancer Research
,
62
(
17
),
4963
4967
.
Grave
,
E.
,
Obozinski
,
G. R.
, &
Bach
,
F. R.
(
2011
). Trace Lasso: A trace norm regularization for correlated designs. In
J.
Shawe-Taylor
,
R. S.
Zemel
,
P. L.
Bartlett
,
F.
Pereira
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
24
(pp.
2187
2195
).
Red Hook, NY
:
Curran
.
Gravier
,
E.
,
Pierron
,
G.
,
Vincent-Salomon
,
A.
,
Gruel
,
N.
,
Raynal
,
V.
,
Savignoni
,
A.
, …
Fourquet
,
A.
(
2010
).
A prognostic DNA signature for T1T2 node-negative breast cancer patients
.
Genes, Chromosomes and Cancer
,
49
(
12
),
1125
1134
.
Hara
,
S.
, &
Maehara
,
T.
(
2017
).
Enumerate Lasso solutions for feature selection
. In
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence
.
Kong
,
D.
,
Fujimaki
,
R.
,
Liu
,
J.
,
Nie
,
F.
, &
Ding
,
C.
(
2014
). Exclusive feature learning on arbitrary structures via $ℓ1,2$-norm. In
Z.
Ghahramani
,
M.
Welling
,
C.
Cortes
,
N. D.
Lawrence
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
(pp.
1655
1663
).
Red Hook, NY
:
Curran
.
Lipton
,
Z. C.
(
2018
).
The mythos of model interpretability
.
Queue
,
16
(
3
),
31
57
.
Lorbert
,
A.
,
Eis
,
D.
,
Kostina
,
V.
,
Blei
,
D.
, &
,
P.
(
2010
).
Exploiting covariate similarity in sparse regression via the pairwise elastic net
. In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
(pp.
477
484
).
Mazumder
,
R.
,
Friedman
,
J. H.
, &
Hastie
,
T.
(
2011
).
Sparsenet: Coordinate descent with nonconvex penalties
.
Journal of the American Statistical Association
,
106
(
495
),
1125
1138
.
Meinshausen
,
N.
, &
Bühlmann
,
P.
(
2006
).
High-dimensional graphs and variable selection with the Lasso
.
Annals of Statistics
,
34
(
3
),
1436
1462
.
Molnar
,
C.
(
2018
).
Interpretable machine learning: A guide for making black box models explainable.
Leanpub
.
Pomeroy
,
S. L.
,
Tamayo
,
P.
,
Gaasenbeek
,
M.
,
Sturla
,
L. M.
,
Angelo
,
M.
,
McLaughlin
,
M. E.
, …
Allen
,
J. C.
(
2002
).
Prediction of central nervous system embryonal tumour outcome based on gene expression
.
Nature
,
415
(
6870
),
436
442
.
,
G.
,
Wainwright
,
M. J.
, &
Yu
,
B.
(
2010
).
Restricted eigenvalue properties for correlated gaussian designs
.
Journal of Machine Learning Research
,
11
,
2241
2259
.
,
G.
,
Wainwright
,
M. J.
, &
Yu
,
B.
(
2011
).
Minimax rates of estimation for high-dimensional linear regression over $ℓq$-balls
.
IEEE Transactions on Information Theory
,
57
(
10
),
6976
6994
.
Rigollet
,
P.
, &
Hütter
,
J.-C.
(
2015
).
High dimensional statistics
.
Lecture notes for course 18S997, MIT
.
Shipp
,
M. A.
,
Ross
,
K. N.
,
Tamayo
,
P.
,
Weng
,
A. P.
,
Kutok
,
J. L.
,
Aguiar
,
R. C.
, …
Ray
,
T. S.
(
2002
).
Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning
.
Nature Medicine
,
8
(
1
),
68
74
.
Singh
,
D.
,
Febbo
,
P. G.
,
Ross
,
K.
,
Jackson
,
D. G.
,
Manola
,
J.
,
,
C.
, …
Sellers
,
R.
(
2002
).
Gene expression correlates of clinical prostate cancer behavior
.
Cancer Cell
,
1
(
2
),
203
209
.
Subramanian
,
A.
,
Tamayo
,
P.
,
Mootha
,
V. K.
,
Mukherjee
,
S.
,
Ebert
,
B. L.
,
Gillette
,
M. A.
, …
Mesirov
,
J. P.
(
2005
).
Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles
.
Proceedings of the National Academy of Sciences
,
102
(
43
),
15545
15550
.
,
M.
,
Suzuki
,
T.
, &
Fujisawa
,
H.
(
2018
).
Independently interpretable lasso: A new regularizer for sparse regression with uncorrelated variables
. In
Proceedings of the International Conference on Artificial Intelligence and Statistics
(pp.
454
463
).
Tian
,
E.
,
Zhan
,
F.
,
Walker
,
R.
,
Rasmussen
,
E.
,
Ma
,
Y.
,
Barlogie
,
B.
, &
Shaughnessy Jr
,
J. D.
(
2003
).
The role of the WNT-signaling antagonist DKK1 in the development of osteolytic lesions in multiple myeloma
.
New England Journal of Medicine
,
349
(
26
),
2483
2494
.
Tibshirani
,
R.
(
1996
).
Regression shrinkage and selection via the Lasso
.
Journal of the Royal Statistical Society: Series B (Methodological)
,
58
(
1
),
267
288
.
Tropp
,
J. A.
(
2006
).
Just relax: Convex programming methods for identifying sparse signals in noise
.
IEEE Transactions on Information Theory
,
52
(
3
),
1030
1051
.
Tseng
,
P.
(
2001
).
Convergence of a block coordinate descent method for nondifferentiable minimization
.
Journal of Optimization Theory and Applications
,
109
(
3
),
475
494
.
Wainwright
,
M. J.
(
2009
).
Sharp thresholds for high-dimensional and noisy sparsity recovery using $ℓ1$-constrained quadratic programming (lasso)
.
IEEE Transactions on Information Theory
,
55
(
5
),
2183
2202
.
West
,
M.
,
Blanchette
,
C.
,
Dressman
,
H.
,
Huang
,
E.
,
Ishida
,
S.
,
Spang
,
R.
, …
Nevins
,
J. R.
(
2001
).
Predicting the clinical status of human breast cancer by using gene expression profiles
.
Proceedings of the National Academy of Sciences
,
98
(
20
),
11462
11467
.
Zeng
,
X.
, &
Figueiredo
,
M. A.
(
2014
).
The ordered weighted $ℓ1$ norm: Atomic formulation, projections, and algorithms
.
arXiv:1409.4271
.
Zhang
,
C.-H.
(
2010
).
Nearly unbiased variable selection under minimax concave penalty
.
Annals of Statistics
,
38
(
2
),
894
942
.
Zhao
,
P.
, &
Yu
,
B.
(
2006
).
On model selection consistency of lasso
.
Journal of Machine Learning Research
,
7
,
2541
2563
.
Zou
,
H.
, &
Hastie
,
T.
(
2005
).
Regularization and variable selection via the elastic net
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
67
(
2
),
301
320
.