## Abstract

Recently, graph-based unsupervised feature selection algorithms (GUFS) have been shown to efficiently handle prevalent high-dimensional unlabeled data. One common drawback associated with existing graph-based approaches is that they tend to be time-consuming and in need of large storage, especially when faced with the increasing size of data. Research has started using anchors to accelerate graph-based learning model for feature selection, while the hard linear constraint between the data matrix and the lower-dimensional representation is usually overstrict in many applications. In this letter, we propose a flexible linearization model with anchor graph and $ℓ21$-norm regularization, which can deal with large-scale data sets and improve the performance of the existing anchor-based method. In addition, the anchor-based graph Laplacian is constructed to characterize the manifold embedding structure by means of a parameter-free adaptive neighbor assignment strategy. An efficient iterative algorithm is developed to address the optimization problem, and we also prove the convergence of the algorithm. Experiments on several public data sets demonstrate the effectiveness and efficiency of the method we propose.

## 1  Introduction

In many applications, such as computer vision, data mining, and pattern recognition, we are often confronted with data that are represented by high-dimensional features, which require significant time and space consumption to cope with (Chang, Nie, Yang, Zhang, & Huang, 2016; Freeman, Kulic, & Basir, 2013). To address this issue, feature selection techniques aim to reduce the dimensionality of the high-dimensional data by selecting a smaller set of representative features (Romero & Sopena, 2008; Nie, Xiang, Jia, Zhang, & Yan, 2008; Ling, Qiang, & Min, 2013; Wang, Nie, Yang, Gao, & Yao, 2015; Luo et al., 2018; Paul & Drineas, 2016). In general, feature selection algorithms can be classified into three categories: supervised feature selection, semisupervised feature selection, and unsupervised feature selection. Due to the large number of unlabeled data produced by rapid development of technology and that annotating these data is a dramatically expensive and time-consuming process (Cheng, Zhou, & Cheng, 2011; Dy & Brodley, 2004; Wu, Jia, Liu, Ghanem, & Lyu, 2018), we focus on the problem of selecting features in unsupervised learning scenarios.

Since many real-world data sets such as face image, handwritten digit and text data sets are strictly distributed on low-dimensional manifolds (Roweis & Saul, 2000; Cheng, Nie, Sun, & Gong, 2017), the methods that consider the manifold structure usually have better performance (Liu, Jiang, Luo, & Chang, 2011; Yang et al., 2016). In this letter, we focus on the graph-based family of unsupervised feature selection methods, in which the manifold geometry structure of the entire feature set is represented in graph form. Existing GUFS methods (Zhao & Liu, 2007; Cai, Zhang, & He, 2010; Li, Yang, Liu, Zhou, & Lu, 2012; Hou, Nie, Li, Yi, & Wu, 2014) determine feature relevance by evaluating a feature's correlation with a pseudo-label derived from spectral analysis and use regression coefficient to rank each feature. However, conventional graph-based approaches usually use kernel-based neighbor assignment strategy (e.g., gaussian similarity function) to construct a similarity graph, which typically requires extra parameters and may adversely affect learning performance (Nie, Wang, & Huang, 2014). What is more, relaxing and approximating these pseudo-labels by imposing orthogonal constraints on the cluster indicator matrix into a continuous embedding can also inevitably introduce noise into the estimated cluster labels (Shi, Du, & Shen, 2014; Liu & Chang, 2009).

In addition, with a growing number of data, traditional graph-based approaches tend to be inefficient in handling large-scale data sets due to high computational complexity. This complexity stems mainly from two aspects: the construction of the similarity graph and the decomposition of the Laplacian matrix. These two processes are time-consuming for large-scale data and having at least $O(n2d)$ time complexity, where $n$ and $d$ represent the number of samples and features, respectively. To solve this problem, recent work (Liu, He, & Chang, 2010; Liu, Wang, & Chang, 2012; Deng, Ji, Liu, Tao, & Gao, 2013; Deng, Ji, Tao, Gao, & Li, 2014) attempts to accelerate the graph-based learning model using anchors. Since the number of anchors can be much smaller than the original data points, both the graph construction and the learning process can be much faster than traditional graph-based approaches. The work in Hu, Wang, Nie, Yang, & Yu (2018) first applies anchors to the acceleration of the graph-based learning model for feature selection, while the hard linear constraint between the data matrix and the lower-dimensional representation is usually overstrict in many real applications (Nie, Xu, Tsang, & Zhang, 2010), which may degenerate the learning performance.

In light of the noted limitations of GUFS, in this letter, we propose a novel algorithm: scalable and flexible unsupervised feature selection (SFUFS). SFUFS simultaneously performs manifold embedding learning and flexible regression analysis based on an anchor graph, which can deal with large-scale data sets and improve the performance of the existing anchor-based method. At the same time, an anchor-based graph construction strategy is applied to characterize the manifold embedding structure by means of a parameter-free adaptive neighbor assignment strategy. Similar to the conventional unsupervised feature selection methods, the goal of SFUFS is to learn a discriminative projection matrix; thus, it is more suitable to set orthogonal constraints on the projection matrix instead of a cluster indicator matrix to select the most discriminative features robustly (Wang, Nie, & Huang, 2014). For feature selection, it is preferrable to learn a row sparse projection matrix, which is formulated as an $ℓ21$-norm minimization term (Nie, Huang, Cai, & Ding, 2010; Nie, Zhu, & Li, 2016; Xing, Dong, Jiang, & Tang, 2018). Moreover, an efficient iterative algorithm is proposed to solve the optimization problem of SFUFS. Extensive experiments are conducted on four benchmark data sets, which demonstrate the efficiency and effectiveness of the proposed SFUFS algorithm.

The rest of the letter is organized as follows. In section 2, we formulate the proposed SFUFS framework and introduce a parameter-free construction approach of an anchor-based graph. Section 3 presents an efficient iterative algorithm to tackle the problem and some analysis about the proposed method, including computation complexity analysis and convergence behavior. Section 4 provides some comparison results on various data sets, followed by the conclusion in section 5.

## 2  The SFUFS Algorithm

In this section, we exploit the formulation of the proposed SFUFS. This method performs manifold embedding learning and flexible regression jointly on the Laplacian graph, where the graph is constructed using an anchor-based strategy via a parameter-free algorithm.

### 2.1  The Proposed Framework

Given data matrix $X=[x1,x2,…,xn]T∈Rn×d$. Graph construction approaches calculate the similarity matrix $A={aij}∈Rn×n$ and measure the weight between $xi$ and $xj$ as $aij$. According to manifold learning theory, data lie on or near a smooth manifold of lower dimensionality (Roweis & Saul, 2000). We represent the original data $xi$ by low-dimensional embedding (i.e., $fi∈Rl×1$), where $l$ is the dimensionality of embedding $(l≪d)$. With this replacement, the most valuable information is retained and the feature redundancies are eliminated (Hou et al., 2014). Because of the intuition that nearby points are more likely to have similar properties (Hou et al., 2014; Nie, Wang, Jordan, & Huang, 2016), nearby points should be assigned a larger similarity. Roweis and Saul (2000) assume that the same similarity that reconstructs the data points in high-dimensional space $Rd$ should also reconstruct its embedded manifold coordinates in low-dimensional space $Rl$. Therefore, a natural method to obtain the manifold embedding is by minimizing
$minF12∑i,j=1naij∥fi-fj∥22=Tr(FTLF),$
(2.1)
where $F∈Rn×l$ with the $i$th row formed by $fi$, $L=D-A$ is called a Laplacian matrix, $D$ is a diagonal matrix with the $i$th entry defined as $∑jaij$.
Our previous work in Hu et al. (2018) proposed a fast unsupervised feature selection (FUFS) algorithm, which assumes that there is always a transformation matrix $W∈Rd×l$ that can preserve the manifold structure after transforming the original high-dimensional data $xi$ into a lower-dimensional form $fi$, where $fi=WTxi$. $W$ can be regarded as the combination coefficients for different features, and the importance of each feature can be valued by $∥wi∥2$; $wi$ is the $i$th row of $W$. To perform feature selection, we enforce sparsity on the rows of $W$ by adding an $ℓ21$-norm regularization term to decide what features to keep (Nie, Huang et al., 2010; Chang, Nie, & Wang, 2017). Concretely, the objective function of FUFS can be written as
$minWTW=ITr(WTXTLXW)+γ∥W∥2,1,$
(2.2)
where $γ$ is a regularization parameter, and the orthogonal constraint $WTW=I$ is imposed to avoid arbitrary scaling, making it more reasonable to rank features using the 2-norm of $wi$ for feature selection.
Commonly, the linearization technique of manifold embedding is computationally efficient for projection vector learning; however, the rigid constrained method is overstrict; and its performance may degrade in cases with a nonlinearly distributed manifold (Nie, Xu et al., 2010). Thus, we introduce a flexible penalty term (i.e., regression residue $∥XW-F∥F2$) to solve this problem. By performing manifold embedding learning and flexible regression analysis jointly, the formulation of SFUFS is given by
$minWTW=I,FTr(FTLF)+α∥XW-F∥F2+γ∥W∥2,1,$
(2.3)
where $α$ is a regularization parameter used to balance different terms. Thus, we can learn both $F$ and $W$ simultaneously in a unified objective to obtain a more robust projection matrix $W$ and select more discriminative features.

### 2.2  Graph Construction

Unlike traditional graph-based methods that require painstaking work to compute all pairwise similarity between $n$ data points, the anchor-based method establishes the adjacency relationship between original data points based on anchors, which can greatly reduce the complexity of the graph construction as well as memory requirements. We next discuss a detailed illustration.

The most critical step in all graph-based approaches is building a similarity graph, which usually consists of two steps. The first step is determining edge connections in graphs. There are usually three connecting strategies: a $ɛ$-neighborhood graph, a $k$-nearest neighbor ($k$NN) graph, and the fully connected graph. The local geometric structure of the data can usually achieve better performance than a global geometric structure can and the value of the positive integer $k$ is easier to tune than $ɛ$. One broadly used connecting strategy is the $k$NN graph, which connects every point to its $k$-nearest neighbors. The second step is to compute a similarity score for connecting points. Traditional GUFS methods are used to define the weight $aij$ by means of the gaussian kernel function. Then the similarity of $k$NN graph can be defined as
$aij=exp-∥xi-xj∥22σ2xi∈N(xj)orxj∈N(xi)0otherwise,$
where $σ$ is the width parameter and $N(x)$ represents the set of $k$-nearest neighbors of $x$.

The widely used $k$-NN graph based on gaussian kernel function, however, still has several limitations. First, the extra gaussian kernel parameter $σ$ is very sensitive and varies on different data sets (Xiang, Nie, Zhang, & Zhang, 2009; Nie et al., 2014), which is difficult to adjust in practice from a large range. Even this traditional graph construction approach is infeasible in large-scale data sets because the time cost of similarity graph construction is $O(n2d)$, and saving in memory a matrix $A$ as large as $n×n$. To overcome these drawbacks, we adopt a parameter-free method to construct the similarity matrix with anchor-based strategy.

#### 2.2.2  Anchor Graph Construction

Instead of calculating similarity matrix $A$ directly, the anchor-based strategy constructs the matrix $Z$ to measure the relationship between the original data points and the anchor points and using $Z$ to establish links among the original data points. Generally there are two steps to construct the matrix $Z$: (1) generating $m(m≪n)$ anchors to cover all data points and (2) measuring similarity between data points and the obtained anchors by the matrix $Z∈Rn×m$.

The generation of anchor points can be realized by means of random selection or $k$-means method (Liu et al., 2010, 2012; Deng, Ji, Liu, Tao, & Gao, 2013; Deng, Ji, Tao, Gao, & Li, 2014). Random selection selects $m$ anchor points by random sampling from original data points and requires the computational complexity of $O(1)$. The $k$-means method utilizes $m$ clustering centers to generate more representative anchors for better performance, while the $k$-means method takes the computational complexity of $O(ndmt)$, where $t$ is the number of iterations, which makes it impossible for large-scale data sets. Although random selection does not guarantee that the selected $m$ anchor points are always representative, it is very fast for large-scale data sets. Therefore, we are more likely to generate anchor points by random selection.

After the anchors are generated, a parameter-free strategy is adopted to obtain the similarity matrix $Z$. Let $U=[u1,…,um]T∈Rm×d$ represent the set of anchor points. We regard the similarity $zij$ between $xi$ and $uj$ as the probability that $ui$ is the neighbor of $xi$. The square of the Euclidean distance $∥xi-uj∥22$ is applied as the distance measure. A natural way to obtain neighbor probabilities of the $i$th sample is by solving the following problem,
$minziT1=1,zij≥0∑j=1m∥xi-uj∥22zij+μ∑j=1mzij2,$
(2.4)
where $μ$ is the regularization parameter, $ziT$ represents the $i$th row of $Z$, and $zij$ is the $j$th element of $ziT$. The regularization is added to avoid the the trivial solution (Nie et al., 2014). Let $dij=∥xi-uj||22$, and denote $di∈Rm×1$ as a vector with the $j$th element as $dij$. We rewrite equation 2.4 in vector form as
$minzizi+di2μ22s.t.ziT1=1,zij≥0.$
(2.5)
It is easy to see that problem 2.5 can be solved with a closed-form solution. Accordingly, the similarity matrix $A$ can be obtained by (Liu et al., 2010)
$A=ZΔ-1ZT,$
(2.6)
where $Δ∈Rm×m$ is a diagonal matrix with the $j$th diagonal element defined as $Δjj=∑i=1nzij$. Instead of saving in memory the matrix $A$ as large as $n×n$, we only need to save $Z$ with time cost $O(ndm)$, linear in the data size $n$.

#### 2.2.3  The Solution of Problem 2.5

The difficulty with solving problem 2.5 is determining the value of the regularization parameter that is difficult to set and varies by data set. Then, we present an effective method to determine the regularization parameter. Following (Nie et al., 2014; Nie, Wang et al., 2016), we construct the following Lagrangian function of equation 2.5 to determine the value of $μ$,
$L(zi,η,φi)=12zi+di2μ22-η(ziT1-1)-φiTzi,$
(2.7)
where $η$ is a scalar and $φi$ is a vector, both of which are Lagrangian multipliers that must be determined.
According to the KKT condition, the optimal solution $zij$ should be
$zij=-dij2μ+η+,$
(2.8)
where $x+=max(x,0)$. Without loss of generality, suppose $di1$, $di2$, $…$, $dim$ are ordered from small to large. In order to learn a sparse $zi$ that has only $k$ nonzero elements, according to equation 2.8, we know $zik>0$ and $zi,k+1=0$. Therefore, we have
$-dik2μ+η>0-di,k+12μ+η≤0.$
(2.9)
According to the constraint $ziT1=1$ and equation 2.8, we have
$η=1k+12kμ∑j=1kdij.$
(2.10)
So we have the following inequality according to equations 2.9 and 2.10:
$k2di,k-12∑j=1kdij<μ≤k2di,k+1-12∑j=1kdij,$
(2.11)
and $μ$ could be set as
$μ=k2di,k+1-12∑j=1kdij.$
(2.12)
Thus, the solution to problem 2.5 is
$zij=di,k+1-dijkdi,k+1-∑j=1kdij,$
(2.13)
1. The learned $zij$ is scale invariant. If the data points and anchor points are scaled by an arbitrary scalar $α$, then $dij$ is changed to $α·dij$ for each $j$, but the $zij$ computed by equation 2.13 will not change.

2. From equation 2.12, we see that the parameter $k$ is much easier to tune than $μ$, since $k$ is a positive integer and has an explicit meaning. In most cases, $k<10$ is more likely to produce reasonable results. This property is important since the adjustment to hyperparameter is still a difficult and open problem in many learning tasks.

3. The learned $Z$ is naturally $k$-sparse, and the computational burden for subsequent optimization process can be greatly reduced.

## 3  Optimization and Analysis for SFUFS

In this section, we first present a novel iterative algorithm to solve optimization problem 2.3, and then an anchor-based approach shows how to speed up the algorithm. Finally, we discuss computational complexity and the convergence behavior.

### 3.1  Optimization Algorithm

Since problem 2.3 contains $ℓ21$-norm regularization that is difficult to cope with, we construct another optimization problem by introducing an auxiliary variable to tackle the problem indirectly. We introduce a diagonal matrix $Q∈Rd×d$ with the $i$th element defined as
$Qii=12wiwiT+ɛ,$
(3.1)
where $ɛ$ is a small enough constant. Solving equation 2.3 corresponds to solving
$minWTW=ITr(FTLF)+α∥XW-F∥F2+γTr(WTQW),$
(3.2)
and we will prove that the objective function of SFUFS will also decrease by solving the approximated problem in equation 3.2. The solution of equation 3.2 can be divided into two steps in each iteration as follows First, fix $F$, $W$ and optimize $Q$. When $W$ is fixed, $Q$ is obtained by equation 3.1. Second, fix $Q$, and optimize $F$, $W$. Taking the derivation of equation 3.2 with respect to $F$ and setting the derivation to zero, we have
$λLF-αXW+αF=0⇒F=α(λL+αI)-1XW.$
(3.3)
By substituting equation 3.3 into equation 3.2, and letting $P$ represent $α(L+αI)-1X$, problem 3.2 is transformed into
$minWTW=ITr(WTPTLPW)+αTr(WT(XT-PT)(X-P)W)+γTr(WTQW)=minWTW=ITr[WT(PTLP+α(XT-PT)(X-P)+γQ)W].$
(3.4)
Considering the constraint $WTW=I$, $W$ can be obtained by $l$ eigenvectors of $(PTLP+α(XT-PT)(X-P)+γQ)$.

### 3.2  Anchor-Based Approach to Accelerate the Above Algorithm

The iterative algorithm we propose costs the most time in solving the matrix inversion problem, and its computational complexity is $O(n3)$. An anchor-based graph is applied to accelerate this algorithm. From equation 2.6, let $B=ZΔ-12$; then matrix $A$ can be rewritten as $A=BBT$. For the degree of each data point, we have $Dii=∑sjzis(Δss)-1zjs=∑szis=1$. Therefore, we obtain the diagonal matrix $D=I$ and $L=I-BBT$. The formula $P$ thus becomes
$P=α(L+αI)-1X=α1+αI-BBT1+α-1X.$
(3.5)
According to the Woodbury matrix identity,
$(H-UCV)-1=H-1+H-1U(C-1-VH-1U)-1VH-1.$
(3.6)
Thus, the formula can be obtained as follows:
$(I-UVT)-1=I+U(I-VTU)-1VT.$
(3.7)
Using equation 3.7 to solve the matrix inversion problem in equation 3.5, we obtain
$P=α1+αI+B1+αI-BTB1+α-1BTX.$
(3.8)
The corresponding computational complexity of calculating matrix $P$ can be reduced to $O(nm2+nmd)$. At the same time, $W$ can be obtained by $l$ eigenvectors of $(PTP-PTBBTP+α(XT-PT)(X-P)+γQ)$. The details of this algorithm are summarized in algorithm 1.

### 3.3  Complexity Analysis

Our proposed method, SFUFS, consists of three steps:

1. We need $O(1)$ to generate $m$ anchors by random selection.

2. We need $O(ndm)$ to learn the matrix $Z$.

3. We need $O(dl)$ and $O(nd2+ndm+d2m+d2l)$ to solve the problem in equations 3.1 and 3.4 in each iteration.

Considering that $d≪m≪n$ for very large-scale data sets and the projection dimension $l$ is usually quite small, the overall computational complexity of SFUFS is $O(ndmt)$, where $t$ is the iterative number of our algorithm and is also quite small. Our algorithm can significantly reduce the computational complexity of graph-based methods. The memory complexity of SFUFS is $O(nm)+O(d2)$, which is dominated by $O(nm)$ for very large-scale data sets. Note that the number of anchor points is much fewer than the number of samples, and conventional GUFS methods require at least $O(n2)$ memory cost. Our method can simultaneously reduce computational and storage costs.

### 3.4  Convergence Analysis

In this section, we provide the convergence analysis of the iterative algorithm shown in algorithm 1. First, we provide a lemma:

Lemma 1.
The following inequality holds for any positive real number $u$ and $v$.
$u-u2v≤v-v2v.$
(3.9)

The convergence behavior of SFUFS is summarized in the following theorem:

Theorem 1.

The alternative updating rules in algorithm 1 monotonically decrease the objective value of problem 2.3 until convergence.

Proof.
As seen from algorithm 1, suppose the updated $W$ and $F$ are $W˜$ and $F˜$, respectively. We can have the following inequality:
$Tr(F˜TLF˜)+α∥XW˜-F˜∥F2+γTr(W˜TQW˜)≤Tr(FTLF)+α∥XW-F∥F2+γTr(WTQW).$
(3.10)
We add $γ∑iɛ2wiTwi+ɛ$ to both sides of this inequality. Substituting the definition of $Q$, we can rewrite the inequality as
$Tr(F˜TLF˜)+α∥XW˜-F˜∥F2+γ∑iw˜iTw˜i+ɛ2wiTwi+ɛ≤Tr(FTLF)+α∥XW-F∥F2+γ∑iwiTwi+ɛ2wiTwi+ɛ.$
(3.11)
Recalling the results in lemma 1, we know that
$γ∑iw˜iTw˜i+ɛ-γ∑iw˜iTw˜i+ɛ2wiTwi+ɛ,≤γ∑iwiTwi+ɛ-γ∑iwiTwi+ɛ2wiTwi+ɛ.$
(3.12)
Summing over inequalities 3.1 and 3.2, we arrive at
$Tr(F˜TLF˜)+α∥XW˜-F˜∥F2+γ∑iw˜iTw˜i+ɛ≤Tr(FTLF)+α∥XW-F∥F2+γ∑iwiTwi+ɛ.$
(3.13)
This inequality indicates that the objective function in equation 2.3 will monotonically decrease in each iteration, which completes the proof.

$□$

## 4  Experiment

In this section, we experimentally demonstrate the efficiency and effectiveness of the proposed method on four benchmark data sets and then show several analyses of experimental results.

### 4.1  Experimental Setup

We conduct experiments on four commonly used benchmark data sets in terms of clustering: one face data set—MSRA25; (Nie et al., 2014)—and three handwritten digit image data sets—USPS (Hull, 1994), MNIST; (Lecun, Bottou, Bengio, & Haffner, 1998), and Extended MNIST (Karlen, Weston, Erkan, & Collobert, 2008)—which can be categorized into small, medium, and large sizes. In our experiment, we regard MSRA25 and USPS as small data sets, MNIST as a medium-sized data set, and Extended MNIST (E-MNIST) as a large data set. The details of these data sets are presented in Table 1. Three clustering evaluation matrixes—standard clustering accuracy (ACC), normalized mutual information (NMI), and Purity—are used to measure the learning performance of all the methods.

Table 1:
Data Set Description.
 Data Set Samples Features Classes MSRA25 1799 256 12 USPS 9298 256 10 MNIST 70,000 784 10 E-MNIST 630,000 900 10
 Data Set Samples Features Classes MSRA25 1799 256 12 USPS 9298 256 10 MNIST 70,000 784 10 E-MNIST 630,000 900 10

In the clustering experiment, we set the number of clusters to be the ground truth $c$ in each data set and the projection dimension $l$ to be $c$ as well. In order to obtain reasonable accuracy, anchors need to be sufficiently dense to build effective adjacency relationships. Therefore, we set the number of anchor points at 500 for small data sets, 1000 for medium-sized data sets, and 2000 for large data sets. To ensure a fair comparison between different unsupervised feature selection algorithms, we fix $k=5$ for all data sets to specify the size of neighborhoods. The number of selected features is set from half of the total number to the full feature size, while all other regularization parameters for all methods are tuned from ${10-3,10-2,10-1,1,10,102,103}$ by a grid-search strategy, and the best clustering results from the optimal parameters are reported for all the algorithms. After completing the feature selection process, the $k$-means algorithm is applied to cluster the samples in the selected feature subspace. Because a $k$-means algorithm depends on initialization, we run it 30 times with random starting points and report the average value to alleviate the stochastic effect. Experiments in this letter are implemented in Matlab R2017a and run on a Windows 10 machine with 3.60 GHz i7-7700 CPU and 32 GB main memory.

### 4.2  Compared Algorithms

To illustrate the efficiency and effectiveness of our method, we have compared SFUFS with one baseline and several unsupervised feature selection approaches as follows:

• Baseline: All features are selected to the clustering.

• LS: Laplacian Score (He, Cai, & Niyogi, 2005) ranks features according to their power of locality preserving in a descending order.

• UDFS: Unsupervised discriminative feature selection (Yang, Shen, Ma, Huang, & Zhou, 2011) integrates discriminative analysis and $ℓ21$-norm minimization into one optimization problem.

• RUFS: Robust unsupervised feature selection (Qian & Zhai, 2013) selects feature by robust nonnegative matrix factorization and robust regression.

• JELSR: Joint embedding learning and sparse regression (Hou et al., 2014) performs embedding learning with sparse regression for feature selection.

• FUFS: Fast unsupervised feature selection (Hu et al., 2018) with anchor graph and $ℓ2,1$-norm regularization, where anchor points are selected randomly.

### 4.3  Results and Analysis

The average running time of all the comparison methods in the experiment is shown in Table 2, and the results of the clustering performance are shown in Tables 3 and 4. The following observations can be obtained from the experimental results. First, feature selection can make subsequent processing more efficient by selecting a subset of all original features and significantly improve learning performance. Due to the removal of redundant and noisy features, all feature selection methods exhibit better performance than the baseline. Second, SFUFS obtains better performance than FUFS on all the data sets, which demonstrates the need to introduce a flexible regression term. Third, for small data sets, anchor-based methods (FUFS and SFUFS) have no obvious time-cost advantage over traditional graph-based unsupervised feature selection methods. For medium-sized and large data sets, FUFS and the proposed SFUFS method achieve significant improvements in running time. SFUFS needs only 97.7 seconds, which is seven times faster than the third-fastest method (LS) on the medium-sized data set (MNIST) and 105 times faster than LS on the large data set (E-MNIST). The anchor-based methods dominate other methods in computational complexity, especially for large-scale data sets. Third, SFUFS achieves competitive performance for almost all data sets, which verifies that the proposed SFUFS algorithm is able to select more informative features, for three main reasons: (1) a parameter-free graph construction strategy is performed to obtain a more accurate and stable anchor graph: (2) we impose the orthogonal constraint on the projection matrix, making the selection matrix more accurate and representative, and (3) a regression method is added to improve the1 robustness of graph manifold embedding.

Table 2:
Running Time (in Seconds) for Different Selection Methods.
 Data Set LS UDFS RUFS JELSR FUFS SFUFS MSRA25 0.108 2.974 3.311 10.691 0.691 1.062 USPS 1.254 84.64 42.35 82.10 1.640 1.940 MNIST 779.5 OM OM OM 31.13 97.73 E-MNIST 34,496 OM OM OM 197.1 325.7
 Data Set LS UDFS RUFS JELSR FUFS SFUFS MSRA25 0.108 2.974 3.311 10.691 0.691 1.062 USPS 1.254 84.64 42.35 82.10 1.640 1.940 MNIST 779.5 OM OM OM 31.13 97.73 E-MNIST 34,496 OM OM OM 197.1 325.7

Notes: The top two highest-ranked methods are highlighted in bold. OM: Out-of-memory error.

Table 3:
Clustering Results of Each Method on MSRA25 and USPS.
 MSRA25 USPS Method ACC NMI Purity ACC NMI Purity Baseline 53.2 59.3 56.1 62.2 60.1 71.1 LS 56.6 63.5 59.5 65.3 60.4 71.8 UDFS 57.7 64.1 60.4 66.9 61.3 72.7 RUFS 59.3 64.4 60.9 67.9 61.5 73.8 JELSR 59.7 64.7 61.0 68.2 61.6 74.4 FUFS 58.1 64.4 60.5 67.0 61.5 73.4 SFUFS 59.8 64.5 61.2 67.8 62.3 74.2
 MSRA25 USPS Method ACC NMI Purity ACC NMI Purity Baseline 53.2 59.3 56.1 62.2 60.1 71.1 LS 56.6 63.5 59.5 65.3 60.4 71.8 UDFS 57.7 64.1 60.4 66.9 61.3 72.7 RUFS 59.3 64.4 60.9 67.9 61.5 73.8 JELSR 59.7 64.7 61.0 68.2 61.6 74.4 FUFS 58.1 64.4 60.5 67.0 61.5 73.4 SFUFS 59.8 64.5 61.2 67.8 62.3 74.2
Table 4:
Clustering Results of Each Method on MNIST and E-MNIST.
 MNIST E-MNIST Method ACC NMI Purity ACC NMI Purity Baseline 43.7 36.4 46.7 46.3 39.8 52.0 LS 44.5 36.8 46.8 50.2 41.8 54.1 FUFS 45.3 37.1 47.1 50.4 41.9 54.4 SFUFS 45.9 38.2 48.0 51.4 42.4 55.4
 MNIST E-MNIST Method ACC NMI Purity ACC NMI Purity Baseline 43.7 36.4 46.7 46.3 39.8 52.0 LS 44.5 36.8 46.8 50.2 41.8 54.1 FUFS 45.3 37.1 47.1 50.4 41.9 54.4 SFUFS 45.9 38.2 48.0 51.4 42.4 55.4

### 4.4  Studies on Parameter Sensitivity and Convergence

In this section, we evaluate the sensitivity of parameters for SFUFS. Due to space limitations, we report only the results in terms of ACC on the USPS data set to illustrate the influence of the regularization parameter and feature numbers on the learning performance. In our algorithm, there are two regularization parameters ($α$ and $γ$) to discuss. From the results shown in Figures 1 and 2, we can see that our algorithm is stable with $α$ and $γ$ with wide ranges and comparatively sensitive to the number of selected features.

Figure 1:

Clustering accuracy (ACC) of SFUFS with different $α$ and feature numbers while keeping $γ=1$.

Figure 1:

Clustering accuracy (ACC) of SFUFS with different $α$ and feature numbers while keeping $γ=1$.

Figure 2:

Clustering accuracy (ACC) of SFUFS with different $γ$ and feature numbers while keeping $α=1$.

Figure 2:

Clustering accuracy (ACC) of SFUFS with different $γ$ and feature numbers while keeping $α=1$.

To work out the objective function, we come up with an efficient iterative algorithm. The convergence of our approach has been fully proved by the convergence analysis of algorithm 1. Now we experimentally verify its speed of convergence. Figure 3 shows the connection between the value of objective function and the number of iterations among different data sets. We can easily learn from the figure that the convergence curve of objective value decreases relatively quickly, and our algorithm can achieve the optimum almost within five iterations.

Figure 3:

Convergence curves of the proposed SFUFS.

Figure 3:

Convergence curves of the proposed SFUFS.

As is often the case, there are two ways to generate anchor points: we rewrite SFUFS as SFUFS-R, and SFUFS-K represents another commonly used way (i.e., $k$-means method) to generate anchor points applied in our method. We also study the parameter sensitivity of a number of anchor points between these two means. As shown in Figure 4, the increase in the number of anchors does not always mean improved performance, while the running time increases in accordance with the linear mode. Thus, we can adopt the proper number of anchor points to keep the balance between computational complexity and learning performance. Theoretically, as for large data sets, the optimal number of anchor points would be accordingly large. Nevertheless, it remains an open problem to ensure that the optimal number of anchor points is in line with different data sets. But even if the anchor points are chosen randomly, the SFUFS-R performance is almost as good as SFUFS-K, but its running time is far shorter than SFUFS-K. Compared with SFUFS-R, the extra running time of SFUFS-K is spent on generating anchor points by $k$-means. In certain circumstances, $k$-means can get converged after lots of iterations. Taking accuracy and efficiency into consideration, SFUFS-R can be the best way for unsupervised feature selection among all compared approaches, especially for large-scale data sets.

Figure 4:

ACC and running time with different number of anchor points on USPS dataset.

Figure 4:

ACC and running time with different number of anchor points on USPS dataset.

## 5  Conclusion

In this letter, we have proposed a fast graph-based unsupervised feature selection method, which performs manifold embedding learning and flexible regression analysis jointly based on an anchor graph. The manifold embedding structure is characterized by a parameter-free adaptive neighbor assignment strategy. To solve the optimization problem of SFUFS, an efficient iterative algorithm is executed to obtain the selection matrix. Extensive experiments have demonstrated the efficiency and effectiveness of the proposed method.

## Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under grant numbers 61772427 and 61751202.

## References

Cai
,
D.
,
Zhang
,
C.
, &
He
,
X.
(
2010
).
Unsupervised feature selection for multi-cluster data
. In
Proceedings of the Conference on Knowledge Discovery and Data Mining
(pp.
333
342
).
New York
:
ACM
.
Chang
,
Y.
,
Nie
,
F.
, &
Wang
,
M.
(
2017
).
Multiview feature analysis via structured sparsity and shared subspace discovery
.
Neural Computation
,
29
(
7
),
1986
2003
.
Chang
,
X.
,
Nie
,
F.
,
Yang
,
Y.
,
Zhang
,
C.
, &
Huang
,
H.
(
2016
).
Convex sparse PCA for unsupervised feature learning
.
ACM Transactions on Knowledge Discovery from Data
,
11
(
1
),
3:1
3:16
.
Cheng
,
D.
,
Nie
,
F.
,
Sun
,
J.
, &
Gong
,
Y.
(
2017
).
A weight-adaptive Laplacian embedding for graph-based clustering
.
Neural Computation
,
29
(
7
),
1902
1918
.
Cheng
,
Q.
,
Zhou
,
H.
, &
Cheng
,
J.
(
2011
).
The Fisher-Markov selector: Fast selecting maximally separable feature subset for multiclass classification with applications to high-dimensional data
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
33
(
6
),
1217
1233
.
Deng
,
C.
,
Ji
,
R.
,
Liu
,
W.
,
Tao
,
D.
, &
Gao
,
X.
(
2013
).
Visual reranking through weakly supervised multi-graph learning
. In
Proceedings of the 2013 IEEE International Conference on Computer Vision
(pp.
2600
2607
).
Piscataway, NJ
:
IEEE
.
Deng
,
C.
,
Ji
,
R.
,
Tao
,
D.
,
Gao
,
X.
, &
Li
,
X.
(
2014
).
Weakly supervised multi-graph learning for robust image reranking
.
IEEE Transactions on Multimedia
,
16
(
3
),
785
795
.
Dy
,
J. G.
, &
Brodley
,
C. E.
(
2004
).
Feature selection for unsupervised learning
.
Journal of Machine Learning Research
,
5
,
845
889
.
Freeman
,
C.
,
Kulic
,
D.
, &
Basir
,
O.
(
2013
).
Feature-selected tree-based classification
.
IEEE Transactions on Cybernetics
,
43
(
6
),
1990
2004
.
He
,
X.
,
Cai
,
D.
, &
Niyogi
,
P.
(
2005
).
Laplacian score for feature selection
. In
Proceedings of the Conference on Neural Information Processing Systems
(pp.
507
514
).
Cambridge, MA
:
MIT Press
.
Hou
,
C.
,
Nie
,
F.
,
Li
,
X.
,
Yi
,
D.
, &
Wu
,
Y.
(
2014
).
Joint embedding learning and sparse regression: A framework for unsupervised feature selection
.
IEEE Transactions on Cybernetics
,
44
(
6
),
793
804
.
Hu
,
H.
,
Wang
,
R.
,
Nie
,
F.
,
Yang
,
X.
, &
Yu
,
W.
(
2018
).
Fast unsupervised feature selection with anchor graph and 2,1-norm regularization
.
Multimedia Tools and Applications
,
77
(
17
),
22099
22113
.
Hull
,
J. J.
(
1994
).
A database for handwritten text recognition research
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
16
(
5
),
550
554
.
Karlen
,
M.
,
Weston
,
J.
,
Erkan
,
A.
, &
Collobert
,
R.
(
2008
).
Large scale manifold transduction
. In
Proceedings of the 25th International Conference on Machine Learning
(pp.
448
455
).
New York
:
ACM
.
Lecun
,
Y.
,
Bottou
,
L.
,
Bengio
,
Y.
, &
Haffner
,
P.
(
1998
).
Gradient-based learning applied to document recognition
.
Proceedings of the IEEE
,
86
(
11
),
2278
2324
.
Li
,
Z.
,
Yang
,
Y.
,
Liu
,
J.
,
Zhou
,
X.
, &
Lu
,
H.
(
2012
).
Unsupervised feature selection using nonnegative spectral analysis
. In
Proceedings of the AAAI Conference on Artificial Intelligence
(pp.
1026
1032
).
Cambridge, MA
:
AAAI Press
.
Ling
,
X.
,
Qiang
,
M. A.
, &
Min
,
Z.
(
2013
).
Tensor semantic model for an audio classification system
.
Science China
,
56
(
6
),
1
9
.
Liu
,
W.
, &
Chang
,
S.-F.
(
2009
).
Robust multi-class transductive learning with graphs
. In
Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition
(pp.
381
388
).
Piscataway, NJ
:
IEEE
.
Liu
,
W.
,
He
,
J.
, &
Chang
,
S. F.
(
2010
).
Large graph construction for scalable semisupervised learning
. In
Proceedings of the International Conference on Machine Learning
(pp.
679
686
).
Washington, DC
:
IEEE Computer Society
.
Liu
,
W.
,
Jiang
,
Y.-G.
,
Luo
,
J.
, &
Chang
,
S.-F.
(
2011
).
Noise resistant graph ranking for improved web image search
. In
Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
(pp.
849
856
).
Piscataway, NJ
:
IEEE
.
Liu
,
W.
,
Wang
,
J.
, &
Chang
,
S. F.
(
2012
).
Robust and scalable graph-based semisupervised learning
.
Proceedings of the IEEE
,
100
(
9
),
2624
2638
.
Luo
,
M.
,
Nie
,
F.
,
Chang
,
X.
,
Yi
,
Y.
,
Hauptmann
,
A. G.
, &
Zheng
,
Q.
(
2018
).
Adaptive unsupervised feature selection with structure regularization
.
IEEE Transactions on Neural Networks and Learning Systems
,
29
,
944
956
.
Nie
,
F.
,
Huang
,
H.
,
Cai
,
X.
, &
Ding
,
C.
(
2010
). Efficient and robust feature selection via joint 2,1 -norms minimization. In
J. D.
Lafferty
,
C. K. I.
Williams
,
J.
Shawe-Taylor
,
R. S.
Zemel
, &
A.
Culotta
(Eds.),
Advances in neural information processing systems
,
23
(pp.
1813
1821
).
Red Hook, NY
:
Curran
.
Nie
,
F.
,
Wang
,
X.
, &
Huang
,
H.
(
2014
).
Clustering and projected clustering with adaptive neighbors
. In
Proceedings of the Conference on Knowledge Discovery and Data Mining
(pp.
977
986
).
New York
:
ACM
.
Nie
,
F.
,
Wang
,
X.
,
Jordan
,
M.
, &
Huang
,
H.
(
2016
).
The constrained Laplacian rank algorithm for graph-based clustering
. In
Proceedings of the AAAI Conference on Artificial Intelligence
(pp.
1969
1976
).
Cambridge, MA
:
AAAI Press
.
Nie
,
F.
,
Xiang
,
S.
,
Jia
,
Y.
,
Zhang
,
C.
, &
Yan
,
S.
(
2008
).
Trace ratio criterion for feature selection
. In
Proceedings of the AAAI Conference on Artificial Intelligence
(pp.
671
676
).
Cambridge, MA
:
AAAI Press
.
Nie
,
F.
,
Xu
,
D.
,
Tsang
,
I. W.-H.
, &
Zhang
,
C.
(
2010
).
Flexible manifold embedding: A framework for semi-supervised and unsupervised dimension reduction
.
IEEE Transactions on Image Processing
,
19
(
7
),
1921
1932
.
Nie
,
F.
,
Zhu
,
W.
, &
Li
,
X.
(
2016
).
Unsupervised feature selection with structured graph optimization
. In
Proceedings of the AAAI Conference on Artificial Intelligence
(pp.
1302
1308
).
Cambridge, MA
:
AAAI Press
.
Paul
,
S.
, &
Drineas
,
P.
(
2016
).
Feature selection for ridge regression with provable guarantees
.
Neural Computation
,
28
(
4
),
716
742
.
Qian
,
M.
, &
Zhai
,
C.
(
2013
).
Robust unsupervised feature selection
. In
Proceedings of the International Joint Conference on Artificial Intelligence
(pp.
1621
1627
).
Cambridge, MA
:
AAAI Press
.
Romero
,
E.
, &
Sopena
,
J. M.
(
2008
).
Performing feature selection with multilayer perceptrons
.
IEEE Transactions on Neural Networks
,
19
(
3
),
431
441
.
Roweis
,
S. T.
, &
Saul
,
L. K.
(
2000
).
Nonlinear dimensionality reduction by locally linear embedding
.
Science
,
290
(
5500
),
2323
2326
.
Shi
,
L.
,
Du
,
L.
, &
Shen
,
Y. D.
(
2014
).
Robust spectral learning for unsupervised feature selection
. In
Proceedings of the 2014 IEEE International Conference on Data Mining
(pp.
977
982
).
Piscataway, NJ
:
IEEE
.
Wang
,
D.
,
Nie
,
F.
, &
Huang
,
H.
(
2014
). Unsupervised feature selection via unified trace ratio formulation and k-means clustering (track). In
T.
Calders
,
F.
Esposito
,
E.
Hüllermeier
, &
R.
Meo
(Eds.),
Machine Learning and Knowledge Discovery in Databases
(pp.
306
321
).
Berlin
:
Springer
.
Wang
,
R.
,
Nie
,
F.
,
Yang
,
X.
,
Gao
,
F.
, &
Yao
,
M.
(
2015
).
Robust 2DPCA with nongreedy $ℓ1$-norm maximization for image analysis
.
IEEE Transactions on Cybernetics
,
45
(
5
),
1108
1112
.
Wu
,
B.
,
Jia
,
F.
,
Liu
,
W.
,
Ghanem
,
B.
, &
Lyu
,
S.
(
2018
).
Multi-label learning with missing labels using mixed dependency graphs
.
International Journal of Computer Vision
,
126
(
8
),
875
896
.
Xiang
,
S.
,
Nie
,
F.
,
Zhang
,
C.
, &
Zhang
,
C.
(
2009
).
Nonlinear dimensionality reduction with local spline embedding
.
IEEE Transactions on Knowledge and Data Engineering
,
21
(
9
),
1285
1298
.
Xing
,
L.
,
Dong
,
H.
,
Jiang
,
W.
, &
Tang
,
K.
(
2018
).
Nonnegative matrix factorization by joint locality-constrained and $ℓ21$-norm regularization
.
Multimedia Tools and Applications
,
77
(
3
),
3029
3048
.
Yang
,
P.
,
Zhao
,
P.
,
Hai
,
Z.
,
Liu
,
W.
,
Hoi
,
S. C. H.
, &
Li
,
X.-L.
(
2016
).
Efficient multi-class selective sampling on graphs
. In
Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence
(pp.
805
814
).
Cambridge, MA
:
AAAI Press
.
Yang
,
Y.
,
Shen
,
H. T.
,
Ma
,
Z.
,
Huang
,
Z.
, &
Zhou
,
X.
(
2011
).
$ℓ2,1$-norm regularized discriminative feature selection for unsupervised learning
. In
Proceedings of the International Joint Conference on Artificial Intelligence
(pp.
1589
1594
).
Cambridge, MA
:
AAAI Press
.
Zhao
,
Z.
, &
Liu
,
H.
(
2007
).
Spectral feature selection for supervised and unsupervised learning
. In
Proceedings of the International Conference on Machine Learning
(pp.
1151
1157
).
New York
:
ACM
.