## Abstract

The goal of supervised feature selection is to find a subset of input features that are responsible for predicting output values. The least absolute shrinkage and selection operator (Lasso) allows computationally efficient feature selection based on linear dependency between input features and output values. In this letter, we consider a feature-wise kernelized Lasso for capturing nonlinear input-output dependency. We first show that with particular choices of kernel functions, nonredundant features with strong statistical dependence on output values can be found in terms of kernel-based independence measures such as the Hilbert-Schmidt independence criterion. We then show that the globally optimal solution can be efficiently computed; this makes the approach scalable to high-dimensional problems. The effectiveness of the proposed method is demonstrated through feature selection experiments for classification and regression with thousands of features.

## 1. Introduction

Finding a subset of features in high-dimensional supervised learning is an important problem with many real-world applications such as gene selection from microarray data (Xing, Jordan, & Karp, 2001; Ding & Peng, 2005; Suzuki, Sugiyama, Kanamori, & Sese, 2009; Huang, Horowitz, & Wei, 2010), document categorization (Forman, 2008), and prosthesis control (Shenoy, Miller, Crawford, & Rao, 2008).

### 1.1. Problem Description.

The goal of supervised feature selection is to find *m* features (*m*<*d*) of input vector ** x** that are responsible for predicting output

*y*.

### 1.2. Lasso.

The least absolute shrinkage and selection operator (Lasso) (Tibshirani, 1996) allows computationally efficient feature selection based on the assumption of linear dependency between input features and output values.

*k*th feature, and are the - and -norms, and is the regularization parameter. The -regularizer in Lasso tends to produce a sparse solution, which means that the regression coefficients for irrelevant features become zero. Lasso is particularly useful when the number of features is larger than the number of training samples (Tibshirani, 1996). Furthermore, various optimization software packages were developed for efficiently computing the Lasso solution (Boyd & Vandenberghe, 2004; Daubechies, Defrise, & De Mol, 2004; Combettes & Wajs, 2005; Kim, Koh, Lustig, Boyd, & Gorinvesky, 2007; Yin, Osher, Goldfarb, & Darbon, 2008; Wright, Nowak, & Figueiredo, 2009; Tomioka, Suzuki, & Sugiyama, 2011).

However, a critical limitation of Lasso is that it cannot capture nonlinear dependency.

### 1.3. Instance-Wise Nonlinear Lasso.

**is transformed by a nonlinear function . Then the Lasso optimization problem is expressed as where , is a regression coefficient vector and is a coefficient of the**

*x**j*th basis

*A*(

**,**

*x*

*x*_{j}).

The instance-wise nonlinear Lasso gives a sparse solution in terms of instances but not features. Therefore, it cannot be used for feature selection.

### 1.4. Feature-Wise Nonlinear Lasso (Feature Vector Machine).

To obtain sparsity in terms of features, the feature-wise nonlinear Lasso was proposed (Li, Yang, & Xing, 2006).

**in a feature-wise manner as where is the vector of the**

*x**k*th feature for all samples. Then the feature vector

*u*_{k}and the output vector

**are transformed by a nonlinear function . The Lasso optimization problem in the transformed space is given as where is a regression coefficient vector and denotes the regression coefficient of the**

*y**k*th feature. By using the kernel trick (Schölkopf & Smola, 2002), equation 1.1 was shown to be equivalently expressed as the following quadratic programming (QP) problem: where and . This formulation is called the feature vector machine (FVM). Note that, since FVM uses the -dimensional Hessian matrix

**, it is especially useful when the number of training samples**

*D**n*is much bigger than that of features

*d*.

In the original FVM, mutual information (Cover & Thomas, 2006) was used as the kernel function . However, the matrix ** D** obtained from mutual information is not necessarily positive definite (Seeger, 2002), and thus the objective function, equation 1.2, can be nonconvex. Furthermore, when the number of training samples is smaller than that of features (which is often the case in high-dimensional feature selection scenarios), the matrix

**is singular. This can cause numerical instability. Another restriction of FVM is that regardless of regression or classification, output**

*D***should be transformed by the same nonlinear function as feature vector**

*y***. This highly limits the flexibility of capturing nonlinear dependency. Finally, it is not statistically clear what kinds of features are found by this FVM formulation.**

*u*### 1.5. Contribution of This Paper.

To overcome the limitations of FVM, we propose an alternative feature-wise nonlinear Lasso. More specifically, we propose to use particular forms of universal reproducing kernels (Steinwart, 2001) as feature and output transformations and solve the optimization problem in the primal space.

An advantage of this new formulation is that the global optimal solution can be computed efficiently. Thus, it is scalable to high-dimensional feature selection problems. To the best of our knowledge, this is the first convex feature selection method that is able to deal with high-dimensional nonlinearly related features. Furthermore, this new formulation has a clear statistical interpretation that nonredundant features with strong statistical dependence on output values are found using kernel-based independence measures such as the Hilbert-Schmidt independence criterion (HSIC) (Gretton, Bousquet, Smola, & Schölkopf, 2005) and the criterion based on the normalized cross-covariance operator (NOCCO) (Fukumizu, Gretton, Sun, & Schölkopf, 2008). Thus, the proposed methods can be regarded as a minimum redundancy maximum relevance–based feature selection method (Peng, Long, & Ding, 2005). In addition, the proposed methods are simple to implement, a highly preferable property for practitioners.

We also discuss the relation between the proposed method and existing feature selection approaches such as minimum redundancy maximum relevance (mRMR) (Peng et al., 2005), HSIC-based greedy feature selection (Song, Smola, Gretton, Bedo, & Borgwardt, 2012), quadratic programming feature selection (QPFS) (Rodriguez-Lujan, Huerta, Elkan, & Cruz, 2010), kernel target alignment (KTA) (Shawe-Taylor & Kandola, 2002; Cortes, Mohri, & Rostamizadeh, 2012), Hilbert-Schmidt feature selection (HSFS) (Masaeli, Fung, & Dy, 2010), and sparse additive models (SpAM) (Ravikumar, Lafferty, Liu, & Wasserman, 2009; Liu, Lafferty, & Wasserman, 2009; Raskutti, Wainwright, & Yu, 2012). See Table 1 for the summary of feature selection methods.

. | . | . | . | Scalability with . |
---|---|---|---|---|

. | . | . | . | Regard to Number . |

Method . | Dependency . | Optimization . | Primal/Dual . | of Features . |

Lasso . | Linear . | Convex
. | Primal
. | Highly scalable
. |

mRMR | Nonlinear | Greedy | — | Scalable |

Greedy HSIC | Nonlinear | Greedy | — | Scalable |

HSFS | Nonlinear | Nonconvex | — | Not scalable |

FVM | Nonlinear | Nonconvex^{a} | Dual | Not scalable |

QPFS/KTA | Nonlinear | Nonconvex^{a} | Dual | Not scalable |

SpAM | Additive nonlinear | Convex | Primal | Scalable |

Proposed | Nonlinear | Convex | Primal | Highly scalable |

. | . | . | . | Scalability with . |
---|---|---|---|---|

. | . | . | . | Regard to Number . |

Method . | Dependency . | Optimization . | Primal/Dual . | of Features . |

Lasso . | Linear . | Convex
. | Primal
. | Highly scalable
. |

mRMR | Nonlinear | Greedy | — | Scalable |

Greedy HSIC | Nonlinear | Greedy | — | Scalable |

HSFS | Nonlinear | Nonconvex | — | Not scalable |

FVM | Nonlinear | Nonconvex^{a} | Dual | Not scalable |

QPFS/KTA | Nonlinear | Nonconvex^{a} | Dual | Not scalable |

SpAM | Additive nonlinear | Convex | Primal | Scalable |

Proposed | Nonlinear | Convex | Primal | Highly scalable |

Notes: The methods that have preferable properties for high-dimensional feature selection problems are in bold.

^{a}In practice, positive constants may be added to the diagonal elements of the Hessian matrix to guarantee the convexity, although the validity of selected features by this modification is not statistically clear.

Through experiments on real-world feature selection problems, we show that the proposed methods compare favorably with existing feature selection methods.

## 2. Proposed Methods

In this section, we propose alternative implementations of the nonlinear feature-wise Lasso.

### 2.1. HSIC Lasso.

^{2}where is the Frobenius norm, and are centered Gram matrices,

*K*

^{(k)}

_{i,j}=

*K*(

*x*

_{k,i},

*x*

_{k,j}) and

*L*

_{i,j}=

*L*(

*y*,

_{i}*y*) are Gram matrices, and are kernel functions, is the centering matrix,

_{j}

*I*_{n}is the

*n*-dimensional identity matrix, and is the

*n*-dimensional vector with all ones. Note that we employ nonnegativity constraint for so that meaningful features are selected (see section 2.2 for details). In addition, since we use the output Gram matrix

**to select features in HSIC Lasso, we can naturally incorporate structured outputs using kernels. Moreover, we can perform feature selection even if the training data set consists of input**

*L***and its affinity information**

*x***such as link structures between inputs.**

*L*Differences from the original formulation, equation 1.1 are that we allow the kernel functions *K* and *L* to be different and the nonnegativity constraint is imposed. The first term in equation 2.1 means that we are regressing the output kernel matrix by a linear combination of feature-wise input kernel matrices .

### 2.2. Interpretation of HSIC Lasso.

Here we show that HSIC Lasso can be regarded as a minimum redundancy maximum relevancy (mRMR)–based feature selection method (Peng et al., 2005), a popular feature selection strategy in machine learning and artificial intelligence communities.

If the *k*th feature *u*_{k} has high dependence on output ** y**, takes a large value and thus should also take a large value so that equation 2.1 is minimized. On the other hand, if

*u*_{k}is independent of

**, is close to zero, and thus such tends to be eliminated by the -regularizer. This means that relevant features that have strong dependence on output**

*y***tend to be selected by HSIC Lasso.**

*y*Furthermore, if *u*_{k} and *u*_{l} are strongly dependent (i.e., redundant features), takes a large value, and thus either of and tends to be zero. This means that redundant features tend to be eliminated by HSIC Lasso.

Overall, HSIC Lasso tends to find nonredundant features with strong dependence on output ** y**, which is the idea of feature selection methods based on minimum redundancy maximum relevancy (mRMR) (Peng et al., 2005). This is a preferable property in feature selection.

Note that it is possible to remove the nonnegativity constraint in equation 2.1 and select features that have nonzero coefficients . However, if we allow negative values in , it is hard to interpret selected features. Indeed, interpretability is one of important properties in feature selection, and thus we include the nonnegativity constraint for HSIC Lasso.

### 2.3. Kernel Selection.

In theory, a universal kernel such as the gaussian kernel or the Laplace kernel permits HSIC to detect dependence between two random variables (Gretton et al., 2005). Moreover, it has been proposed to use the delta kernel for multiclass classification problems (Song et al., 2012). Thus, in this letter, we employ the gaussian kernel for inputs. For output kernels, we use the gaussian kernel for regression cases and the delta kernel for classification problems.

*y*is categorical), we use the delta kernel for

*y*, where

*n*is the number of samples in class

_{y}*y*. Note that it is also possible to use the gaussian kernel in classification scenarios, but it tends to perform poorly (see Figure 4).

### 2.4. Computational Properties of HSIC Lasso.

*n*

^{2}samples and

*d*features.

If *d*>*n*^{2} (i.e., high-dimensional feature selection from a small number of training samples), the Lasso optimization technique called the dual augmented Lagrangian (DAL) was shown to be computationally highly efficient (Tomioka et al., 2011).^{3} Because DAL can also incorporate the nonnegativity constraint without losing its computational advantages, we can directly use DAL to solve our HSIC Lasso problem. In contrast, when , we may use the either cKTM or FVM (i.e., dual) formulation.

If the number of samples *n* is relatively large, the gaussian kernel computation in HSIC Lasso is expensive. Therefore, the overall computation cost of HSIC Lasso is high. In addition, since naive implementation of HSIC Lasso requires *n*^{2}*d* memory space, it is not practical if both *d* and *n* are large (e.g., *d*>10000 and *n*>1000). Here, we propose a table lookup approach to reduce the computation time and memory size.

*x*to have mean zero and unit standard deviation and discretize the difference of two input values into

*B*values (we use

*B*=4096 in our implementation). Then we prepare in advance a lookup table of

*B*elements that contain gaussian kernel values and refer to these values when we compute the gaussian kernels. The centered Gram matrix can be rewritten as where and . Thus, to compute , we only need to store

**and**

*m**s*. Namely, the required memory size for the table lookup–based approach is

*O*(

*dn*+

*B*), which is much smaller than the memory size required for the naive implementation,

*O*(

*dn*

^{2}).

### 2.5. Variation: NOCCO Lasso.

Because was shown to be asymptotically independent of the choice of kernels, NOCCO Lasso is expected to be less sensitive to the kernel parameter choice than HSIC Lasso, although needs to be tuned in practice.

### 2.6. Other Types of Regularizers.

*g*th group of variables and

*G*is the number of groups. This group Lasso problem can also be efficiently solved by the DAL package with the nonnegativity constraint.

### 2.7. Relation to Two-Stage Multiple Kernel Learning.

The proposed method is closely related to the two-stage multiple kernel learning (MKL) method called centered kernel target alignment (cKTA) (Cortes et al., 2012), which was originally proposed for learning a kernel Gram matrix (not used for feature selection problems).

*n*and small

*d*. However, the Hessian matrix

**is not necessarily positive definite (Seeger, 2002) and is singular in high-dimensional problems. More specifically,**

*D***can be written as and**

*D***is singular when**

*D**d*>

*n*

^{2}(i.e., high-dimensional feature selection from a small number of training samples). Thus, solving the nonnegative least-squares problem in high-dimensional feature selection problems can be cumbersome in practice.

For the above reason, the proposed feature selection method is more suited than cKTA for high-dimensional feature selection problems. In contrast, if we want to solve a small *d* and large *n* feature selection problem, cKTA is more suited than the HSIC Lasso.

## 3. Existing Methods

In this section, we review existing feature selection methods and discuss their relation to the proposed approach. (See Table 1 for the summary of feature selection methods.)

### 3.1. Minimum Redundancy Maximum Relevance.

Minimum redundancy maximum relevance (mRMR) (Peng et al., 2005) is a mutual information–based feature selection criterion.

*m*features are extracted from

*d*features. Then mRMR for

**is defined as where is an empirical approximator of mutual information given as denotes a Parzen window estimator of the joint density of**

*V***and**

*V***, and and denotes Parzen window estimators of marginal densities of**

*y***and**

*V***, respectively.**

*y*The first term in mRMR measures the dependency between chosen feature *v*_{k} and output ** y**, while the second term is a penalty for selecting redundant features. Thus, mRMR finds nonredundant features with strong dependence on outputs. A very fast implementation of mRMR that is available and can deal with high-dimensional feature selection problems.

mRMR-based feature selection is performed by finding a submatrix ** V** that maximizes equation 3.1. However, since there are 2

^{d}possible feature subsets, the brute force approach is computationally intractable. Hence, greedy search strategies such as forward selection or backward elimination are used in practice (Peng et al., 2005). However, the greedy approaches tend to produce a locally optimal feature set.

Another potential weakness of mRMR is that mutual information is approximated by Parzen window estimation. Parzen window–based mutual information estimation is unreliable when the number of training samples is small (Suzuki et al., 2009).

### 3.2. Greedy Feature Selection with HSIC.

*M*

_{i,j}=

*M*(

*v*_{i},

*v*_{j}) is a Gram matrix, is a kernel function, and .

HSIC-based greedy feature selection is performed by finding a submatrix ** V** that maximizes equation 3.2. An advantage of HSIC-based feature selection is its simplicity; it can be implemented very easily. However, since the maximization problem equation 3.2 is NP-hard, forward selection or backward elimination strategies are used for finding a locally optimal solution in practice (Song et al., 2012).

### 3.3. Hilbert-Schmidt Feature Selection.

HSFS can be regarded as a continuously relaxed version of the HSIC-based feature selection (Song et al., 2012). Thanks to this continuous formulation, the HSFS optimization problem can be solved by limited-memory BFGS (L-BFGS) (Nocedal & Wright, 2003). However, since HSFS is a nonconvex method, restarting from many different initial points would be necessary to select good features, which is computationally expensive. Moreover, HSFS attempts to optimize a projection matrix that has *d*^{2} parameters. That is, following the original HSFS implementation based on a quasi-Newton method, the total computational complexity of HSFS is *O*(*d*^{4}), which can be unacceptably large in high-dimensional feature selection problems. To reduce the computational cost, it may be able to approximately solve the HSFS optimization problem by reducing the size of the transformation matrix to for . However, this approximation leads to an additional tuning parameter that cannot be chosen objectively.

### 3.4. Quadratic Programming Feature Selection.

Quadratic programming feature selection (QPFS) (Rodriguez-Lujan et al., 2010) tries to find features by solving a QP problem.

*D*

_{k,l}=

*D*(

*u*_{k},

*u*_{l}), , is a dependency measure, and is a tuning parameter. In QPFS, an empirical estimator of mutual information is used as a dependency measure. Note that if we employ HSIC as a dependency measure in QPFS and remove the sum-to-one constraint, QPFS is equivalent to the centered KTA (cKTA) (Cortes et al., 2012), a multiple kernel learning method originally proposed for learning a kernel matrix (see section 2.7 for details).

Similar to cKTA, an advantage of QPFS is that feature selection can be performed just by solving a QP problem. Moreover, since QPFS has -dimensional Hessian matrix, QPFS is computationally efficient for feature selection problems with large *n* and small *d*. However, the Hessian matrix ** D** is not necessarily positive definite (Seeger, 2002) and is singular in high-dimensional problems.

### 3.5. Sparse Additive Models.

The sparse additive model (SpAM) is useful for high-dimensional feature selection (Ravikumar et al., 2009; Liu et al., 2009; Raskutti et al., 2012; Suzuki & Sugiyama, 2013).

An advantage of SpAM is that it is a convex method and can be efficiently optimized by the backfitting algorithm. Moreover, statistical properties of the SpAM estimator are well studied (Ravikumar et al., 2009). A potential weakness of SpAM is that it can deal only with additive models. That is, if data follow a nonadditive model, SpAM may not work well (see Figure 1b). Another weakness of SpAM is that it needs to optimize *nd* variables, while the proposed methods contain only *d* variables. Thus, SpAM optimization tends to be computationally more expensive than the proposed methods. Finally, an output *y* should be a real number in SpAM, meaning that SpAM cannot deal with structured outputs such as multilabel and graph data.

## 4. Experiments

In this section, we experimentally investigate the performance of the proposed and existing feature selection methods using synthetic and real-world data sets.

### 4.1. Setup.

We compare the performance of the proposed methods with mRMR (Peng et al., 2005), QPFS (Rodriguez-Lujan et al., 2010), cKTA (Cortes et al., 2012), forward selection with HSIC (FHSIC), FVM (Li et al., 2006), and SpAM (Ravikumar et al., 2009).^{4} Note that since it has been reported that the performance of FHSIC is comparable to HSFS and HSFS is computationally expensive for high-dimensional data, we decided to compare the proposed method only to FHSIC. For FVM, QPFS, and mRMR, the C++ implementation of a mutual information estimator is used.^{5} Then a QP solver SeDuMi is used to solve the FVM and QPFS optimization problems.^{6} We observed that the matrices ** D** in FVM, QPFS, and cKTA tend not to be positive definite. In our experiments, we added a small constant to the diagonal elements of

**so that the objective function becomes convex. For all experiments, we set in FVM, in QPFS and cKTA, and in NOCCO Lasso. For proposed methods, we experimentally use .**

*D*### 4.2. Synthetic Data Sets.

First, we illustrate the behavior of the proposed HSIC Lasso and NOCCO Lasso using the following two synthetic data sets:

- •
- •

Figure 1 shows the feature selection accuracy of each method over 30 runs as functions of the number of samples, where the accuracy is measured by the fraction of correctly selected features under the assumption that the number of true features is known. As the figure clearly shows, the proposed HSIC Lasso and NOCCO Lasso methods select good features in both additive and nonadditive model cases. SpAM also works very well for Data1, but it performs poorly for Data2 because the additivity assumption is violated in Data2. QPFS and FVM behave similarly but tend to be outperformed by the proposed methods.

Next, we compare the computation time of each method. Here, we change the number of features in Data2 to , while we fix the number of samples to *n* = 100. Figure 1(c) shows the average computation time for each method over 30 runs. As can be observed, the computation time of HSIC Lasso and NOCCO Lasso increases mildly with respect to the number of features compared to that of SpAM, FVM, and QPFS. Figure 1(d) shows the average computation time of HSIC Lasso for 30 runs. In this experiment, we use the table lookup trick for HSIC Lasso to deal with a relatively large number of samples. This shows that the lookup trick allows us to handle relatively large data sets.

### 4.3. Real-World Data Sets.

Next, we compare the performance of feature selection methods using real-world data sets.

#### 4.3.1. Multi-Class Classification.

We use four image data sets and two microarray datasets.^{7} Detailed information on the data sets is summarized in Table 2.

. | . | Number of . | Number of . | Number of . |
---|---|---|---|---|

Type . | Data Set . | Features (d)
. | Samples (n)
. | Classes . |

Image | AR10P | 2400 | 130 | 10 |

PIE10P | 2400 | 210 | 10 | |

PIX10P | 10,000 | 100 | 10 | |

ORL10P | 10,000 | 100 | 10 | |

Microarray | TOX | 5748 | 171 | 4 |

CLL-SUB | 11340 | 111 | 3 |

. | . | Number of . | Number of . | Number of . |
---|---|---|---|---|

Type . | Data Set . | Features (d)
. | Samples (n)
. | Classes . |

Image | AR10P | 2400 | 130 | 10 |

PIE10P | 2400 | 210 | 10 | |

PIX10P | 10,000 | 100 | 10 | |

ORL10P | 10,000 | 100 | 10 | |

Microarray | TOX | 5748 | 171 | 4 |

CLL-SUB | 11340 | 111 | 3 |

In this experiment, we use 80% of samples for training and the rest for testing. We repeat the experiments 100 times by randomly shuffling training and test samples and evaluate the performance of feature selection methods by the average classification accuracy. We use multiclass -regularized kernel logistic regression (KLR) (Hastie, Tibshirani, & Friedman, 2001; Yamada, Sugiyama, & Matsui, 2010) with the gaussian kernel for evaluating the classification accuracy when the top features selected by each method are used. In this letter, we first choose 50 features and then use the top features having the largest absolute regression coefficients. In KLR, all tuning parameters such as the gaussian width and the regularization parameter are chosen based on three-fold cross-validation.

^{8}where is the correlation coefficient between the

*k*th and

*l*th features. A large RED score indicates that selected features are more strongly correlated to each other, that is, many redundant features are selected. Thus, as a feature selection method, a small redundancy rate is preferable.

*Results*. Figure 2 shows the mean classification accuracy over 100 runs as functions of the number of selected features. Table 3 shows the average classification accuracy rates for the top *m*=50 features selected by each method. In this experiment, since the computation cost of FVM was too high for data sets with a large number of features, we included the FVM results only for the data sets with a small number of features (i.e., AR10P, PIE10P, and TOX). The graphs in Figure 2 clearly show that HSIC Lasso and NOCCO Lasso compare favorably with existing methods for the image data sets (i.e., AR10P, PIE10P, PIX10P, and ORL10P) in terms of the classification accuracy, and they are comparable to existing methods for the microarray data sets (i.e., TOX and CLL-SUB).

. | HSIC . | NOCCO . | . | . | . | . | . |
---|---|---|---|---|---|---|---|

Data Set . | Lasso . | Lasso . | SpAM . | FVM . | mRMR . | QPFS . | cKTA . |

AR10P | .85 (.11) | .85 (.11) | .55 (.11) | .80 (.12) | .75 (.14) | .74 (.14) | .71 (.21) |

PIE10P | .97 (.03) | .97 (.03) | .90 (.11) | .96 (.06) | .90 (.12) | .95 (.07) | .88 (.22) |

PIX10P | .96 (.04) | .96 (.04) | .86 (.15) | — (—) | .77 (.12) | .92 (.07) | .92 (.09) |

ORL10P | .94 (.07) | .94 (.07) | .67 (.12) | — (—) | .78 (.14) | .88 (.10) | .85 (.14) |

TOX | .78 (.12) | .79 (.11) | .69 (.09) | .69 (.09) | .71 (.08) | .77 (.08) | .80 (.11) |

CLL-SUB | .77 (.09) | .77 (.08) | .52 (.11) | — (—) | .64 (.10) | .76 (.12) | .71 (.10) |

. | HSIC . | NOCCO . | . | . | . | . | . |
---|---|---|---|---|---|---|---|

Data Set . | Lasso . | Lasso . | SpAM . | FVM . | mRMR . | QPFS . | cKTA . |

AR10P | .85 (.11) | .85 (.11) | .55 (.11) | .80 (.12) | .75 (.14) | .74 (.14) | .71 (.21) |

PIE10P | .97 (.03) | .97 (.03) | .90 (.11) | .96 (.06) | .90 (.12) | .95 (.07) | .88 (.22) |

PIX10P | .96 (.04) | .96 (.04) | .86 (.15) | — (—) | .77 (.12) | .92 (.07) | .92 (.09) |

ORL10P | .94 (.07) | .94 (.07) | .67 (.12) | — (—) | .78 (.14) | .88 (.10) | .85 (.14) |

TOX | .78 (.12) | .79 (.11) | .69 (.09) | .69 (.09) | .71 (.08) | .77 (.08) | .80 (.11) |

CLL-SUB | .77 (.09) | .77 (.08) | .52 (.11) | — (—) | .64 (.10) | .76 (.12) | .71 (.10) |

Note: Standard deviations in brackets for real-world data.

Table 4 shows the RED values for the top *m*=50 features selected by each method. As can be observed, HSIC Lasso and NOCCO Lasso tend to have smaller RED values, and thus they select less redundant features.

. | HSIC . | NOCCO . | . | . | . | . | . |
---|---|---|---|---|---|---|---|

Data Set . | Lasso . | Lasso . | SpAM . | FVM . | mRMR . | QPFS . | cKTA . |

AR10P | .20 (.03) | .20 (.03) | .26 (.04) | .26 (.04) | .27 (.04) | .22 (.05) | .24 (.03) |

PIE10P | .14 (.01) | .14 (.02) | .25 (.042) | .19 (.03) | .23 (.04) | .18 (.03) | .16 (.02) |

PIX10P | .18 (.02) | .17 (.02) | .34 (.11) | — (—) | .20 (.07) | .29 (.06) | .20 (.04) |

ORL10P | .19 (.03) | .19 (.03) | .30 (.05) | — (—) | .29 (.10) | .20 (.03) | .19 (.03) |

TOX | .38 (.03) | .38 (.03) | .39 (.03) | .42 (.03) | .39 (.03) | .38 (.03) | .37 (.04) |

CLL-SUB | .34 (.03) | .35 (.03) | .40 (.06) | — (—) | .33 (.04) | .32 (.03) | .28 (.05) |

. | HSIC . | NOCCO . | . | . | . | . | . |
---|---|---|---|---|---|---|---|

Data Set . | Lasso . | Lasso . | SpAM . | FVM . | mRMR . | QPFS . | cKTA . |

AR10P | .20 (.03) | .20 (.03) | .26 (.04) | .26 (.04) | .27 (.04) | .22 (.05) | .24 (.03) |

PIE10P | .14 (.01) | .14 (.02) | .25 (.042) | .19 (.03) | .23 (.04) | .18 (.03) | .16 (.02) |

PIX10P | .18 (.02) | .17 (.02) | .34 (.11) | — (—) | .20 (.07) | .29 (.06) | .20 (.04) |

ORL10P | .19 (.03) | .19 (.03) | .30 (.05) | — (—) | .29 (.10) | .20 (.03) | .19 (.03) |

TOX | .38 (.03) | .38 (.03) | .39 (.03) | .42 (.03) | .39 (.03) | .38 (.03) | .37 (.04) |

CLL-SUB | .34 (.03) | .35 (.03) | .40 (.06) | — (—) | .33 (.04) | .32 (.03) | .28 (.05) |

Note: Standard deviations in brackets.

*Role of the Gaussian Width and the Output Kernel*. In the proposed methods, the gaussian width and the output kernel must be chosen manually. We carried out a set of experiments to show the sensitivity of choosing the gaussian width and the output kernel in Figures 3 and 4. Note that since the performances of HSIC Lasso and NOCCO Lasso are comparable, we here evaluate only HSIC Lasso. As can be seen in Figure 3, the proposed method is not so sensitive to the gaussian width. From the output kernel comparison in Figure 4, we found that HSIC Lasso with delta kernel clearly outperforms that with gaussian kernel. Thus, using different input and output kernels is important for feature selection in classification scenarios.

#### 4.3.2. High-Dimensional Regression.

We also evaluate our proposed method with the Affymetric GeneChip Rat Genome 230 2.0 Array data set (Scheetz et al., 2006). The data set consists of 120 rat subjects with 31098 genes, which were measured from eye tissue. Similar to Huang et al. (2010), we focus on finding genes that are related to the TRIM32 gene, which was recently found to cause the Bardet-Biedl syndrome. Note that since TRIM32 takes real values, this is a regression problem.

In this experiment, we use 80% of samples for training and the rest for testing. We repeat the experiments 100 times by randomly shuffling training and test samples and evaluate the performance of feature selection methods by the mean squared error. In addition, we use the correlation coefficient between the predicted and the true TRIM32 values, a popular performance metric in biology community. We use kernel regression (KR) (Schölkopf and Smola, 2002) with the gaussian kernel for evaluating the mean squared error and the mean correlation when the top features selected by each method are used. We first choose 50 features and then use top features having the largest absolute regression coefficients. In KR, all tuning parameters such as the gaussian width and the regularization parameter are chosen based on three-fold cross-validation.

*Results.* Figure 5 shows the mean squared error and the mean correlation coefficient over 100 runs as functions of the number of selected features. As can be observed, the proposed HSIC Lasso and NOCCO Lasso compare favorably with existing methods.

## 5. Conclusion

In this letter, we proposed novel nonlinear feature selection methods called HSIC Lasso and NOCCO Lasso. In the proposed methods, global optimal solutions can be obtained by solving a Lasso optimization problem with a nonnegativity constraint, which can be efficiently performed by the dual augmented Lagrangian algorithm (Tomioka et al., 2011). Furthermore, the proposed methods have clear statistical interpretation that nonredundant features with strong statistical dependence on output values can be found via kernel-based independence measures (Gretton et al., 2005; Fukumizu et al., 2008). We applied the proposed methods to real-world image and biological feature selection tasks and experimentally showed that they are promising.

The usefulness of the proposed method will be investigated on more real-world applications such as computer vision, bioinformatics, and speech and signal processing in our future work. Moreover, extending the proposed model to multitask learning and prediction and investigating theoretical properties of the proposed formulation are important issues to investigate.

## Acknowledgments

We thank Pradeep Ravikumar for providing us the SpAM code and Junming Yin and Kenji Fukumizu for their valuable comments. M.Y. acknowledges the JST PRESTO Program and the PLIP Program, W.J. acknowledges the Okazaki Kaheita International Scholarship and MEXT KAKENHI 23120004, and M.S. acknowledges the FIRST program.

## References

## Notes

^{1}

could be either continuous (i.e., regression) or categorical (i.e., classification). Structured outputs can also be handled in our proposed methods.

^{2}

A Matlab implementation of the proposed algorithm is available online at http://www.makotoyamada-ml.com/hsiclasso.html.

^{4}

We thank the authors of Ravikumar et al. (2009) for providing us the code used in their paper.

^{8}

The original redundancy rate was defined with a plain correlation coefficient (Pearson, 1920), not the absolute correlation coefficient (Zhao et al., 2010). However, this is not appropriate as an error metric because negative correlation decreases RED. For this reason, we decided to use the absolute correlation coefficient.