## Abstract

Recently, graph-based unsupervised feature selection algorithms (GUFS) have been shown to efficiently handle prevalent high-dimensional unlabeled data. One common drawback associated with existing graph-based approaches is that they tend to be time-consuming and in need of large storage, especially when faced with the increasing size of data. Research has started using anchors to accelerate graph-based learning model for feature selection, while the hard linear constraint between the data matrix and the lower-dimensional representation is usually overstrict in many applications. In this letter, we propose a flexible linearization model with anchor graph and $\u211321$-norm regularization, which can deal with large-scale data sets and improve the performance of the existing anchor-based method. In addition, the anchor-based graph Laplacian is constructed to characterize the manifold embedding structure by means of a parameter-free adaptive neighbor assignment strategy. An efficient iterative algorithm is developed to address the optimization problem, and we also prove the convergence of the algorithm. Experiments on several public data sets demonstrate the effectiveness and efficiency of the method we propose.

## 1 Introduction

In many applications, such as computer vision, data mining, and pattern recognition, we are often confronted with data that are represented by high-dimensional features, which require significant time and space consumption to cope with (Chang, Nie, Yang, Zhang, & Huang, 2016; Freeman, Kulic, & Basir, 2013). To address this issue, feature selection techniques aim to reduce the dimensionality of the high-dimensional data by selecting a smaller set of representative features (Romero & Sopena, 2008; Nie, Xiang, Jia, Zhang, & Yan, 2008; Ling, Qiang, & Min, 2013; Wang, Nie, Yang, Gao, & Yao, 2015; Luo et al., 2018; Paul & Drineas, 2016). In general, feature selection algorithms can be classified into three categories: supervised feature selection, semisupervised feature selection, and unsupervised feature selection. Due to the large number of unlabeled data produced by rapid development of technology and that annotating these data is a dramatically expensive and time-consuming process (Cheng, Zhou, & Cheng, 2011; Dy & Brodley, 2004; Wu, Jia, Liu, Ghanem, & Lyu, 2018), we focus on the problem of selecting features in unsupervised learning scenarios.

Since many real-world data sets such as face image, handwritten digit and text data sets are strictly distributed on low-dimensional manifolds (Roweis & Saul, 2000; Cheng, Nie, Sun, & Gong, 2017), the methods that consider the manifold structure usually have better performance (Liu, Jiang, Luo, & Chang, 2011; Yang et al., 2016). In this letter, we focus on the graph-based family of unsupervised feature selection methods, in which the manifold geometry structure of the entire feature set is represented in graph form. Existing GUFS methods (Zhao & Liu, 2007; Cai, Zhang, & He, 2010; Li, Yang, Liu, Zhou, & Lu, 2012; Hou, Nie, Li, Yi, & Wu, 2014) determine feature relevance by evaluating a feature's correlation with a pseudo-label derived from spectral analysis and use regression coefficient to rank each feature. However, conventional graph-based approaches usually use kernel-based neighbor assignment strategy (e.g., gaussian similarity function) to construct a similarity graph, which typically requires extra parameters and may adversely affect learning performance (Nie, Wang, & Huang, 2014). What is more, relaxing and approximating these pseudo-labels by imposing orthogonal constraints on the cluster indicator matrix into a continuous embedding can also inevitably introduce noise into the estimated cluster labels (Shi, Du, & Shen, 2014; Liu & Chang, 2009).

In addition, with a growing number of data, traditional graph-based approaches tend to be inefficient in handling large-scale data sets due to high computational complexity. This complexity stems mainly from two aspects: the construction of the similarity graph and the decomposition of the Laplacian matrix. These two processes are time-consuming for large-scale data and having at least $O(n2d)$ time complexity, where $n$ and $d$ represent the number of samples and features, respectively. To solve this problem, recent work (Liu, He, & Chang, 2010; Liu, Wang, & Chang, 2012; Deng, Ji, Liu, Tao, & Gao, 2013; Deng, Ji, Tao, Gao, & Li, 2014) attempts to accelerate the graph-based learning model using anchors. Since the number of anchors can be much smaller than the original data points, both the graph construction and the learning process can be much faster than traditional graph-based approaches. The work in Hu, Wang, Nie, Yang, & Yu (2018) first applies anchors to the acceleration of the graph-based learning model for feature selection, while the hard linear constraint between the data matrix and the lower-dimensional representation is usually overstrict in many real applications (Nie, Xu, Tsang, & Zhang, 2010), which may degenerate the learning performance.

In light of the noted limitations of GUFS, in this letter, we propose a novel algorithm: scalable and flexible unsupervised feature selection (SFUFS). SFUFS simultaneously performs manifold embedding learning and flexible regression analysis based on an anchor graph, which can deal with large-scale data sets and improve the performance of the existing anchor-based method. At the same time, an anchor-based graph construction strategy is applied to characterize the manifold embedding structure by means of a parameter-free adaptive neighbor assignment strategy. Similar to the conventional unsupervised feature selection methods, the goal of SFUFS is to learn a discriminative projection matrix; thus, it is more suitable to set orthogonal constraints on the projection matrix instead of a cluster indicator matrix to select the most discriminative features robustly (Wang, Nie, & Huang, 2014). For feature selection, it is preferrable to learn a row sparse projection matrix, which is formulated as an $\u211321$-norm minimization term (Nie, Huang, Cai, & Ding, 2010; Nie, Zhu, & Li, 2016; Xing, Dong, Jiang, & Tang, 2018). Moreover, an efficient iterative algorithm is proposed to solve the optimization problem of SFUFS. Extensive experiments are conducted on four benchmark data sets, which demonstrate the efficiency and effectiveness of the proposed SFUFS algorithm.

The rest of the letter is organized as follows. In section 2, we formulate the proposed SFUFS framework and introduce a parameter-free construction approach of an anchor-based graph. Section 3 presents an efficient iterative algorithm to tackle the problem and some analysis about the proposed method, including computation complexity analysis and convergence behavior. Section 4 provides some comparison results on various data sets, followed by the conclusion in section 5.

## 2 The SFUFS Algorithm

In this section, we exploit the formulation of the proposed SFUFS. This method performs manifold embedding learning and flexible regression jointly on the Laplacian graph, where the graph is constructed using an anchor-based strategy via a parameter-free algorithm.

### 2.1 The Proposed Framework

### 2.2 Graph Construction

Unlike traditional graph-based methods that require painstaking work to compute all pairwise similarity between $n$ data points, the anchor-based method establishes the adjacency relationship between original data points based on anchors, which can greatly reduce the complexity of the graph construction as well as memory requirements. We next discuss a detailed illustration.

#### 2.2.1 Traditional Graph Construction

The widely used $k$-NN graph based on gaussian kernel function, however, still has several limitations. First, the extra gaussian kernel parameter $\sigma $ is very sensitive and varies on different data sets (Xiang, Nie, Zhang, & Zhang, 2009; Nie et al., 2014), which is difficult to adjust in practice from a large range. Even this traditional graph construction approach is infeasible in large-scale data sets because the time cost of similarity graph construction is $O(n2d)$, and saving in memory a matrix $A$ as large as $n\xd7n$. To overcome these drawbacks, we adopt a parameter-free method to construct the similarity matrix with anchor-based strategy.

#### 2.2.2 Anchor Graph Construction

Instead of calculating similarity matrix $A$ directly, the anchor-based strategy constructs the matrix $Z$ to measure the relationship between the original data points and the anchor points and using $Z$ to establish links among the original data points. Generally there are two steps to construct the matrix $Z$: (1) generating $m(m\u226an)$ anchors to cover all data points and (2) measuring similarity between data points and the obtained anchors by the matrix $Z\u2208Rn\xd7m$.

The generation of anchor points can be realized by means of random selection or $k$-means method (Liu et al., 2010, 2012; Deng, Ji, Liu, Tao, & Gao, 2013; Deng, Ji, Tao, Gao, & Li, 2014). Random selection selects $m$ anchor points by random sampling from original data points and requires the computational complexity of $O(1)$. The $k$-means method utilizes $m$ clustering centers to generate more representative anchors for better performance, while the $k$-means method takes the computational complexity of $O(ndmt)$, where $t$ is the number of iterations, which makes it impossible for large-scale data sets. Although random selection does not guarantee that the selected $m$ anchor points are always representative, it is very fast for large-scale data sets. Therefore, we are more likely to generate anchor points by random selection.

#### 2.2.3 The Solution of Problem 2.5

The learned $zij$ is scale invariant. If the data points and anchor points are scaled by an arbitrary scalar $\alpha $, then $dij$ is changed to $\alpha \xb7dij$ for each $j$, but the $zij$ computed by equation 2.13 will not change.

From equation 2.12, we see that the parameter $k$ is much easier to tune than $\mu $, since $k$ is a positive integer and has an explicit meaning. In most cases, $k<10$ is more likely to produce reasonable results. This property is important since the adjustment to hyperparameter is still a difficult and open problem in many learning tasks.

The learned $Z$ is naturally $k$-sparse, and the computational burden for subsequent optimization process can be greatly reduced.

## 3 Optimization and Analysis for SFUFS

In this section, we first present a novel iterative algorithm to solve optimization problem 2.3, and then an anchor-based approach shows how to speed up the algorithm. Finally, we discuss computational complexity and the convergence behavior.

### 3.1 Optimization Algorithm

### 3.2 Anchor-Based Approach to Accelerate the Above Algorithm

### 3.3 Complexity Analysis

Our proposed method, SFUFS, consists of three steps:

Considering that $d\u226am\u226an$ for very large-scale data sets and the projection dimension $l$ is usually quite small, the overall computational complexity of SFUFS is $O(ndmt)$, where $t$ is the iterative number of our algorithm and is also quite small. Our algorithm can significantly reduce the computational complexity of graph-based methods. The memory complexity of SFUFS is $O(nm)+O(d2)$, which is dominated by $O(nm)$ for very large-scale data sets. Note that the number of anchor points is much fewer than the number of samples, and conventional GUFS methods require at least $O(n2)$ memory cost. Our method can simultaneously reduce computational and storage costs.

### 3.4 Convergence Analysis

In this section, we provide the convergence analysis of the iterative algorithm shown in algorithm 1. First, we provide a lemma:

The convergence behavior of SFUFS is summarized in the following theorem:

The alternative updating rules in algorithm 1 monotonically decrease the objective value of problem 2.3 until convergence.

^{1}, we know that

$\u25a1$

## 4 Experiment

In this section, we experimentally demonstrate the efficiency and effectiveness of the proposed method on four benchmark data sets and then show several analyses of experimental results.

### 4.1 Experimental Setup

We conduct experiments on four commonly used benchmark data sets in terms of clustering: one face data set—MSRA25; (Nie et al., 2014)—and three handwritten digit image data sets—USPS (Hull, 1994), MNIST; (Lecun, Bottou, Bengio, & Haffner, 1998), and Extended MNIST (Karlen, Weston, Erkan, & Collobert, 2008)—which can be categorized into small, medium, and large sizes. In our experiment, we regard MSRA25 and USPS as small data sets, MNIST as a medium-sized data set, and Extended MNIST (E-MNIST) as a large data set. The details of these data sets are presented in Table 1. Three clustering evaluation matrixes—standard clustering accuracy (ACC), normalized mutual information (NMI), and Purity—are used to measure the learning performance of all the methods.

Data Set | Samples | Features | Classes |

MSRA25 | 1799 | 256 | 12 |

USPS | 9298 | 256 | 10 |

MNIST | 70,000 | 784 | 10 |

E-MNIST | 630,000 | 900 | 10 |

Data Set | Samples | Features | Classes |

MSRA25 | 1799 | 256 | 12 |

USPS | 9298 | 256 | 10 |

MNIST | 70,000 | 784 | 10 |

E-MNIST | 630,000 | 900 | 10 |

In the clustering experiment, we set the number of clusters to be the ground truth $c$ in each data set and the projection dimension $l$ to be $c$ as well. In order to obtain reasonable accuracy, anchors need to be sufficiently dense to build effective adjacency relationships. Therefore, we set the number of anchor points at 500 for small data sets, 1000 for medium-sized data sets, and 2000 for large data sets. To ensure a fair comparison between different unsupervised feature selection algorithms, we fix $k=5$ for all data sets to specify the size of neighborhoods. The number of selected features is set from half of the total number to the full feature size, while all other regularization parameters for all methods are tuned from ${10-3,10-2,10-1,1,10,102,103}$ by a grid-search strategy, and the best clustering results from the optimal parameters are reported for all the algorithms. After completing the feature selection process, the $k$-means algorithm is applied to cluster the samples in the selected feature subspace. Because a $k$-means algorithm depends on initialization, we run it 30 times with random starting points and report the average value to alleviate the stochastic effect. Experiments in this letter are implemented in Matlab R2017a and run on a Windows 10 machine with 3.60 GHz i7-7700 CPU and 32 GB main memory.

### 4.2 Compared Algorithms

To illustrate the efficiency and effectiveness of our method, we have compared SFUFS with one baseline and several unsupervised feature selection approaches as follows:

- •
Baseline: All features are selected to the clustering.

- •
LS: Laplacian Score (He, Cai, & Niyogi, 2005) ranks features according to their power of locality preserving in a descending order.

- •
UDFS: Unsupervised discriminative feature selection (Yang, Shen, Ma, Huang, & Zhou, 2011) integrates discriminative analysis and $\u211321$-norm minimization into one optimization problem.

- •
RUFS: Robust unsupervised feature selection (Qian & Zhai, 2013) selects feature by robust nonnegative matrix factorization and robust regression.

- •
JELSR: Joint embedding learning and sparse regression (Hou et al., 2014) performs embedding learning with sparse regression for feature selection.

- •
FUFS: Fast unsupervised feature selection (Hu et al., 2018) with anchor graph and $\u21132,1$-norm regularization, where anchor points are selected randomly.

### 4.3 Results and Analysis

The average running time of all the comparison methods in the experiment is shown in Table 2, and the results of the clustering performance are shown in Tables 3 and 4. The following observations can be obtained from the experimental results. First, feature selection can make subsequent processing more efficient by selecting a subset of all original features and significantly improve learning performance. Due to the removal of redundant and noisy features, all feature selection methods exhibit better performance than the baseline. Second, SFUFS obtains better performance than FUFS on all the data sets, which demonstrates the need to introduce a flexible regression term. Third, for small data sets, anchor-based methods (FUFS and SFUFS) have no obvious time-cost advantage over traditional graph-based unsupervised feature selection methods. For medium-sized and large data sets, FUFS and the proposed SFUFS method achieve significant improvements in running time. SFUFS needs only 97.7 seconds, which is seven times faster than the third-fastest method (LS) on the medium-sized data set (MNIST) and 105 times faster than LS on the large data set (E-MNIST). The anchor-based methods dominate other methods in computational complexity, especially for large-scale data sets. Third, SFUFS achieves competitive performance for almost all data sets, which verifies that the proposed SFUFS algorithm is able to select more informative features, for three main reasons: (1) a parameter-free graph construction strategy is performed to obtain a more accurate and stable anchor graph: (2) we impose the orthogonal constraint on the projection matrix, making the selection matrix more accurate and representative, and (3) a regression method is added to improve the^{1} robustness of graph manifold embedding.

Data Set | LS | UDFS | RUFS | JELSR | FUFS | SFUFS |

MSRA25 | 0.108 | 2.974 | 3.311 | 10.691 | 0.691 | 1.062 |

USPS | 1.254 | 84.64 | 42.35 | 82.10 | 1.640 | 1.940 |

MNIST | 779.5 | OM | OM | OM | 31.13 | 97.73 |

E-MNIST | 34,496 | OM | OM | OM | 197.1 | 325.7 |

Data Set | LS | UDFS | RUFS | JELSR | FUFS | SFUFS |

MSRA25 | 0.108 | 2.974 | 3.311 | 10.691 | 0.691 | 1.062 |

USPS | 1.254 | 84.64 | 42.35 | 82.10 | 1.640 | 1.940 |

MNIST | 779.5 | OM | OM | OM | 31.13 | 97.73 |

E-MNIST | 34,496 | OM | OM | OM | 197.1 | 325.7 |

Notes: The top two highest-ranked methods are highlighted in bold. OM: Out-of-memory error.

MSRA25 | USPS | |||||

Method | ACC | NMI | Purity | ACC | NMI | Purity |

Baseline | 53.2 | 59.3 | 56.1 | 62.2 | 60.1 | 71.1 |

LS | 56.6 | 63.5 | 59.5 | 65.3 | 60.4 | 71.8 |

UDFS | 57.7 | 64.1 | 60.4 | 66.9 | 61.3 | 72.7 |

RUFS | 59.3 | 64.4 | 60.9 | 67.9 | 61.5 | 73.8 |

JELSR | 59.7 | 64.7 | 61.0 | 68.2 | 61.6 | 74.4 |

FUFS | 58.1 | 64.4 | 60.5 | 67.0 | 61.5 | 73.4 |

SFUFS | 59.8 | 64.5 | 61.2 | 67.8 | 62.3 | 74.2 |

MSRA25 | USPS | |||||

Method | ACC | NMI | Purity | ACC | NMI | Purity |

Baseline | 53.2 | 59.3 | 56.1 | 62.2 | 60.1 | 71.1 |

LS | 56.6 | 63.5 | 59.5 | 65.3 | 60.4 | 71.8 |

UDFS | 57.7 | 64.1 | 60.4 | 66.9 | 61.3 | 72.7 |

RUFS | 59.3 | 64.4 | 60.9 | 67.9 | 61.5 | 73.8 |

JELSR | 59.7 | 64.7 | 61.0 | 68.2 | 61.6 | 74.4 |

FUFS | 58.1 | 64.4 | 60.5 | 67.0 | 61.5 | 73.4 |

SFUFS | 59.8 | 64.5 | 61.2 | 67.8 | 62.3 | 74.2 |

MNIST | E-MNIST | |||||

Method | ACC | NMI | Purity | ACC | NMI | Purity |

Baseline | 43.7 | 36.4 | 46.7 | 46.3 | 39.8 | 52.0 |

LS | 44.5 | 36.8 | 46.8 | 50.2 | 41.8 | 54.1 |

FUFS | 45.3 | 37.1 | 47.1 | 50.4 | 41.9 | 54.4 |

SFUFS | 45.9 | 38.2 | 48.0 | 51.4 | 42.4 | 55.4 |

MNIST | E-MNIST | |||||

Method | ACC | NMI | Purity | ACC | NMI | Purity |

Baseline | 43.7 | 36.4 | 46.7 | 46.3 | 39.8 | 52.0 |

LS | 44.5 | 36.8 | 46.8 | 50.2 | 41.8 | 54.1 |

FUFS | 45.3 | 37.1 | 47.1 | 50.4 | 41.9 | 54.4 |

SFUFS | 45.9 | 38.2 | 48.0 | 51.4 | 42.4 | 55.4 |

### 4.4 Studies on Parameter Sensitivity and Convergence

In this section, we evaluate the sensitivity of parameters for SFUFS. Due to space limitations, we report only the results in terms of ACC on the USPS data set to illustrate the influence of the regularization parameter and feature numbers on the learning performance. In our algorithm, there are two regularization parameters ($\alpha $ and $\gamma $) to discuss. From the results shown in Figures 1 and 2, we can see that our algorithm is stable with $\alpha $ and $\gamma $ with wide ranges and comparatively sensitive to the number of selected features.

To work out the objective function, we come up with an efficient iterative algorithm. The convergence of our approach has been fully proved by the convergence analysis of algorithm 1. Now we experimentally verify its speed of convergence. Figure 3 shows the connection between the value of objective function and the number of iterations among different data sets. We can easily learn from the figure that the convergence curve of objective value decreases relatively quickly, and our algorithm can achieve the optimum almost within five iterations.

As is often the case, there are two ways to generate anchor points: we rewrite SFUFS as SFUFS-R, and SFUFS-K represents another commonly used way (i.e., $k$-means method) to generate anchor points applied in our method. We also study the parameter sensitivity of a number of anchor points between these two means. As shown in Figure 4, the increase in the number of anchors does not always mean improved performance, while the running time increases in accordance with the linear mode. Thus, we can adopt the proper number of anchor points to keep the balance between computational complexity and learning performance. Theoretically, as for large data sets, the optimal number of anchor points would be accordingly large. Nevertheless, it remains an open problem to ensure that the optimal number of anchor points is in line with different data sets. But even if the anchor points are chosen randomly, the SFUFS-R performance is almost as good as SFUFS-K, but its running time is far shorter than SFUFS-K. Compared with SFUFS-R, the extra running time of SFUFS-K is spent on generating anchor points by $k$-means. In certain circumstances, $k$-means can get converged after lots of iterations. Taking accuracy and efficiency into consideration, SFUFS-R can be the best way for unsupervised feature selection among all compared approaches, especially for large-scale data sets.

## 5 Conclusion

In this letter, we have proposed a fast graph-based unsupervised feature selection method, which performs manifold embedding learning and flexible regression analysis jointly based on an anchor graph. The manifold embedding structure is characterized by a parameter-free adaptive neighbor assignment strategy. To solve the optimization problem of SFUFS, an efficient iterative algorithm is executed to obtain the selection matrix. Extensive experiments have demonstrated the efficiency and effectiveness of the proposed method.

## Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under grant numbers 61772427 and 61751202.