## Abstract

Graph-based clustering methods perform clustering on a fixed input data graph. Thus such clustering results are sensitive to the particular graph construction. If this initial construction is of low quality, the resulting clustering may also be of low quality. We address this drawback by allowing the data graph itself to be adaptively adjusted in the clustering procedure. In particular, our proposed weight adaptive Laplacian (WAL) method learns a new data similarity matrix that can adaptively adjust the initial graph according to the similarity weight in the input data graph. We develop three versions of these methods based on the L2-norm, fuzzy entropy regularizer, and another exponential-based weight strategy, that yield three new graph-based clustering objectives. We derive optimization algorithms to solve these objectives. Experimental results on synthetic data sets and real-world benchmark data sets exhibit the effectiveness of these new graph-based clustering methods.

## 1 Introduction

Clustering is an important task in computer vision and machine learning research area with many applications, such as image segmentation (Shi & Malik, 2000), image categorization (Grauman & Darrell, 2006; Chang, Yang, Long, Zhang, & Hauptmann, 2016), scene analysis (Koppal & Narasimhan, 2006; Chang, Nie, Yang, & Huang, 2014), document clustering (Steinbach, Karypis, & Kumar, 2000; Xu, Liu, & Gong, 2003), motion modeling (Ochs & Brox, 2012), and medical image analysis (Brun, Knutsson, Park, Shenton, & Westin, 2004). Many clustering methods have been proposed (Cai, Nie, & Huang, 2013; Chang, Nie, Wang et al., 2015; Hagen & Kahng, 1992; Huang, Nie, & Huang, 2013; Li & Ding, 2006; Ng, Jordan, & Weiss, 2002; Nie, Zeng, Tsang, Wu, & Zhang, 2011; Huang, Nie, & Huang, 2015). Among these methods, graph-based clustering is a popular choice. It is easy to efficiently implement and often outperforms traditional clustering methods such as K-means. The graph-based clustering methods model the data as a weighted undirected graph based on pair-wise similarities (Nie, Wang, Deng et al., 2016). The goal of clustering is to separate data vectors in different clusters according to their similarities. For the similarity graph constructed from data, we want to find a partition of the graph such that the edges between different clusters have low weights and the edges within the same cluster have high weights. That is, we want to find a partition such that data vectors within the same clusters are similar to each other, and vectors in different clusters are dissimilar from each other.

State-of-the-art clustering methods are often based on a graphical model of the relationships among data points. For instance, nonnegative matrix factorization (Lee & Seung, 2001; Hoyer, 2004; Li & Ding, 2006), spectral clustering (Ng et al., 2002; Von Luxburg, 2007), normalized cut (Shi & Malik, 2000; Dhillon, Guan, & Kulis, 2004), and ratio cut (Hagen & Kahng, 1992; Chan, Schlag, & Zien, 1994) all transform the data into a weighted, undirected graph based on pairwise similarities. Then, clustering is accomplished by spectral or graphical theoretical optimization procedures. All of these graph-based methods have these two stages: a data graph is formed from the input data, and then optimization procedures are invoked on this fixed input data graph. These two stages are independent; thus, the clustering result depends on the quality of the input affinity matrix, which makes the clustering result sensitive to the particular graph construction methods. If this initial constructed graph is of low quality, the resulting clustering may also be of low quality.

In order to address these drawbacks, we hope to learn another data similarity matrix that can be adaptively adjusted in the spectral clustering procedure. In this letter, we propose a novel adaptive optimization process for the graph-based clustering model that learns a graph by weight-adaptive Laplacian algorithm. In the new model, instead of fixing the input data graph associated with the affinity matrix, we learn a new data similarity matrix that can adaptively adjust the optimization procedure and use this new learned data similarity matrix to guide the optimization process for the spectral clustering task. Based on the weight adaptive Laplacian algorithm, we also propose three weight strategies, based on the L2-norm (Nie, Wang, Jordan, & Huang, 2016), fuzzy entropy regularizer (Li, Ng, Cheung, & Huang, 2008), and another exponential-based weight strategy (Cai, Nie, Cai, & Huang, 2013), which yield three new graph-based clustering objectives. Finally, we derive optimization algorithms for the proposed three graph-based objective functions. We conduct empirical studies on simulated data sets and seven real-world benchmark data sets to validate the efficiency of our proposed methods.

In the rest of the letter, we first introduce our three proposed weight adaptive Laplacian algorithms for the graph-based clustering. Next, we derive the optimization method for each of the three objective functions. After that, we conduct experiments on both the synthetic data set and seven real-world benchmark data sets to illustrate the efficiency of the proposed clustering methods; we also provide some analysis of the experiments. We conclude with additional observations and future work.

Throughout the letter, all the matrices are capitalized. For matrix , the th row and the th element of are denoted by and , respectively. The L2-norm of the vector is denoted by , and means the trace of matrix . An identity matrix is denoted by , and denotes a column vector with all elements as one. For vector and matrix , , and means all the elements of and are equal to or larger than zero.

## 2 Weight-Adaptive Laplacian Algorithm for Clustering

As defined in equations 2.2 to 2.4, we denote each of the objectives as WAL_L2, WAL_Ln, and WAL_R, respectively. The proposed weight-constrained algorithm will adaptively optimize the spectral clustering objectives. It can be interpreted as follows. Taking equation 2.2 as an example, the first term in it, , is the weight-adaptive graph term, and the second term, , is the regularization term, which aims to make the graph smooth. We mainly focus on the first weight-adaptive graph term. Denote the affinity value between the th and th data node, which is fixed in the optimization process and can be constructed by other methods from the data points, and is the assigned value for indicating function to the th node. Denote , which is the probability distribution. Thus, we can see that when becomes larger, will become smaller, which means that only when both and become larger then become smaller. Because is the element of the fixed input affinity matrix and is the learned value for indicating function, the smaller indicates node and to be approximately the same class. The larger is, the closer the distance is between node and , and the larger means the higher probability that node and will be the same class. Based on this analysis, the proposed weight-adaptive Laplacian algorithms for graph-based clustering have the following two properties.

First, when is small in equation 2.2, there exist the following two cases about and :

- •
Based on the weight-adaptive graph term in equation 2.2, when is very large, that means and are consistently assigning and to the same class, but their value is inconsistent (one is small and the other is large). Thus, they have no effect on the value of in the optimization process.

- •
When is also small, we can see that the optimized are not consistent with to assign to the same class. Since both and are very small, that would make much smaller and lead the larger in the optimization process. Then adding the small weight to will strengthen the correctly connected weight between the same classes in the optimization process.

Second, when is very large, which means examples and are different classes based on the optimized and , there also exist the following two cases about and , analogous to the previous explanation:

- •
When is very small, and are consistently assigning and into different classes, but their value is not consistent (one is small and the other is large). Thus they, have no effect on the value of .

- •
When is also large, we can see that the optimized are not consistent with to assign to the same class. Since both and are very large, would be much larger and lead the smaller in the optimization process. Thus, adding a large weight to restrains the misconnected weight between the different classes in the optimization process.

Based on the above analysis of the weight-adaptive Laplacian algorithm for graph-based clustering according to equation 2.2, we can clearly see that the proposed weight strategy can adaptively adjust the optimization of the clustering process. The same properties can be obtained for the other two weight-adaptive clustering methods, illustrated in equations 2.3 and 2.4.

## 3 Optimization Algorithm to Solve the Weight-Adaptive Graph-Based Clustering Problem

### 3.1 Optimization Algorithm for Solving Problem WAL_L2 in Equation 2.2

### 3.2 Optimization Algorithm for Solving Problem WAL_Ln in Equation 2.3

The optimization process for equation 2.3 is analogicous to equation 2.2. When is fixed, the optimal is obtained the same as equation 3.1.

The alternating minimization procedure between and can be applied to the objective function (Li et al., 2008). Finally, we use the obtained optimal for clustering. A detailed description of the optimization method is provided in algorithm 2.

### 3.3 Optimization Algorithm for Solving Problem WAL_R in Equation 2.4

To optimize the objective function in equation 2.4, we also use the alternate algorithm to get the optimal solution. When is fixed, the optimal can be obtained just as equation 3.1 illustrates, where the only difference is that , and we use the scalar to control the distribution of different weights.

By the above two steps, we alternatively update and , and repeat them iteratively until the objective function converges (Cai, Nie, Cai, & Huang, 2013). Finally, we got the optimal and for clustering. The optimization algorithm is described in algorithm 3.

## 4 Experiments

In this section, we explore the performance of our clustering methods on synthetic and real-world benchmark data sets. For the synthetic data set, we use the block diagonal synthetic data and two-moon synthetic data to analyze the properties of our proposed weight-adaptive Laplacian algorithm for graph-based clustering. Seven real-world benchmark data sets are also used.

### 4.1 Initial Graph Affinity Matrix Learning

In the proposed algorithms, an initial graph-based affinity matrix is required before learning the new normalized similarity matrix, . We have used the graph construction method proposed in Nie, Wang, Jordan, et al. (2016), in which the learned matrix is naturally sparse and is computationally efficient for graph-based learning tasks such as clustering and semisupervised classification. Given the data points , the learned affinity values of make the smaller distance between data points and corresponding to a larger affinity value .

### 4.2 Experiments on Synthetic Data Sets

#### 4.2.1 Block Diagonal Synthetic Data

The block diagonal synthetic data set we used is a matrix with four block matrices diagonally arranged. The data within each block denote the affinity of two corresponding points in one cluster, and the data outside all blocks denote the noise. The affinity data within each block are randomly generated in the range of 0 and 1, while the noise data are randomly generated in the range of 0 and , where is set as 0.6 and 0.7, respectively. To make this clustering task more challenging, we randomly pick out 25 noise data points and set their value to be 1.0.

Figure 1 illustrates the original random matrix and the clustering results obtained by the proposed three weight strategies under two settings, and we note that our proposed clustering methods exhibit good performance in this clustering task. We also compared the clustering accuracy with other graph-based clustering methods, as illustrated in Table 1.

#### 4.2.2 Two-Moon Synthetic Data

The second toy data set is the randomly generated two-moon matrix. In this test, there are two clusters of data distributed in a moon shape. Each cluster has a volume of 100 samples, and the noise percentage is set to be 0.12. Our goal is to recompute the similarity matrix such that the number of connected components in the learned similarity matrix is exactly two. We test our proposed three methods on this data set and obtained good results on all of them, as illustrated in Figure 2, which shows the effectiveness of our proposed methods.

#### 4.2.3 Experimental Results on Real Benchmark Data Sets

We also evaluated the proposed clustering methods on seven real-world benchmark data sets: 20news, Umist, Orl, yaleb, Coil20, Jaffe, and Dig0689. All of these data sets are from the UCI Machine Learning Repository (Asuncion & Newman, 2007) or some other image data sets. The descriptions of these seven data sets are summarized in Table 2.

Algorithm . | 20news . | Umist . | Orl . | yaleb . | Coil20 . | Jaffe . | Dig0689 . |
---|---|---|---|---|---|---|---|

ACC | |||||||

K-means | 0.2599 | 0.4087 | 0.6275 | 0.1127 | 0.6208 | 0.6761 | 0.6297 |

RCut | 0.2554 | 0.6573 | 0.7775 | 0.4279 | 0.7895 | 0.8439 | 0.7943 |

NCut | 0.2554 | 0.6713 | 0.7250 | 0.4275 | 0.7902 | 0.8439 | 0.7943 |

NMF | 0.2572 | 0.4278 | 0.7525 | 0.4693 | 0.7917 | 0.8685 | 0.7770 |

CLR_L1 | 0.26.41 | 0.7182 | 0.7750 | 0.4675 | 0.8749 | 0.8743 | |

CLR_L2 | 0.2632 | 0.7291 | 0.7725 | 0.4703 | 0.8117 | 0.8755 | 0.8770 |

WAL-R | 0.7009 | 0.5350 | 0.3169 | 0.7438 | 0.8404 | 0.7055 | |

WAL-Ln | 0.2678 | 0.7217 | 0.7675 | 0.7431 | 0.8873 | 0.8857 | |

WAL-L2 | 0.2677 | 0.4544 | 0.7944 | ||||

NMI | |||||||

K-means | 0.1190 | 0.6510 | 0.8004 | 0.1604 | 0.7773 | 0.7272 | 0.7286 |

RCut | 0.0579 | 0.8387 | 0.8805 | 0.6450 | 0.8721 | 0.9144 | 0.7780 |

NCut | 0.0579 | 0.8426 | 0.8746 | 0.6390 | 0.8721 | 0.9144 | 0.7780 |

NMF | 0.088 | 0.6094 | 0.8719 | 0.6745 | 0.8974 | 0.8703 | 0.7668 |

CLR_L1 | 0.128 | 0.8594 | 0.8729 | 0.8952 | 0.9213 | 0.8568 | |

CLR_L2 | 0.118 | 0.8532 | 0.8749 | 0.8934 | 0.9233 | 0.8598 | |

WAL-R | 0.8369 | 0.7460 | 0.4953 | 0.8643 | 0.8734 | 0.6844 | |

WAL-Ln | 0.0370 | 0.8697 | 0.8533 | 0.6698 | 0.9110 | 0.8449 | |

WAL-L2 | 0.0367 | 0.6694 | 0.8776 |

Algorithm . | 20news . | Umist . | Orl . | yaleb . | Coil20 . | Jaffe . | Dig0689 . |
---|---|---|---|---|---|---|---|

ACC | |||||||

K-means | 0.2599 | 0.4087 | 0.6275 | 0.1127 | 0.6208 | 0.6761 | 0.6297 |

RCut | 0.2554 | 0.6573 | 0.7775 | 0.4279 | 0.7895 | 0.8439 | 0.7943 |

NCut | 0.2554 | 0.6713 | 0.7250 | 0.4275 | 0.7902 | 0.8439 | 0.7943 |

NMF | 0.2572 | 0.4278 | 0.7525 | 0.4693 | 0.7917 | 0.8685 | 0.7770 |

CLR_L1 | 0.26.41 | 0.7182 | 0.7750 | 0.4675 | 0.8749 | 0.8743 | |

CLR_L2 | 0.2632 | 0.7291 | 0.7725 | 0.4703 | 0.8117 | 0.8755 | 0.8770 |

WAL-R | 0.7009 | 0.5350 | 0.3169 | 0.7438 | 0.8404 | 0.7055 | |

WAL-Ln | 0.2678 | 0.7217 | 0.7675 | 0.7431 | 0.8873 | 0.8857 | |

WAL-L2 | 0.2677 | 0.4544 | 0.7944 | ||||

NMI | |||||||

K-means | 0.1190 | 0.6510 | 0.8004 | 0.1604 | 0.7773 | 0.7272 | 0.7286 |

RCut | 0.0579 | 0.8387 | 0.8805 | 0.6450 | 0.8721 | 0.9144 | 0.7780 |

NCut | 0.0579 | 0.8426 | 0.8746 | 0.6390 | 0.8721 | 0.9144 | 0.7780 |

NMF | 0.088 | 0.6094 | 0.8719 | 0.6745 | 0.8974 | 0.8703 | 0.7668 |

CLR_L1 | 0.128 | 0.8594 | 0.8729 | 0.8952 | 0.9213 | 0.8568 | |

CLR_L2 | 0.118 | 0.8532 | 0.8749 | 0.8934 | 0.9233 | 0.8598 | |

WAL-R | 0.8369 | 0.7460 | 0.4953 | 0.8643 | 0.8734 | 0.6844 | |

WAL-Ln | 0.0370 | 0.8697 | 0.8533 | 0.6698 | 0.9110 | 0.8449 | |

WAL-L2 | 0.0367 | 0.6694 | 0.8776 |

Note: The best performances are in bold.

We compared our proposed clustering methods with K-means, Ratio Cut(RCut), Normalized Cut(NCut), NMF, , and (Nie, Wang, Jordan et al., 2016) methods in Table 3. For all the compared methods, we have used the same algorithm to construct the initial affinity matrix (Nie, Wang, Jordan et al., 2016) as the input matrix . To construct the affinity matrix, we set the number of neighbors, , to be five for the affinity matrix construction. As for the clustering method, we determined the value of , , and by the line search method to find the optimal parameter settings. Moreover, we set the number of clusters to be the ground truth in each data set for all the methods. For all of these methods, we record the average performance, and the standard clustering accuracy (ACC) and normalized mutual information (NMI) metrics are used to evaluate all of the clustering methods.

. | Purity . | ||||||
---|---|---|---|---|---|---|---|

Algorithm . | 20news . | Umist . | Orl . | yaleb . | Coil20 . | Jaffe . | Dig0689 . |

K-means | 0.2625 | 0.4957 | 0.6725 | 0.1326 | 0.6799 | 0.8685 | 0.7532 |

RCut | 0.2565 | 0.7625 | 0.7900 | 0.4718 | 0.8118 | 0.8873 | 0.7942 |

NCut | 0.2565 | 0.7704 | 0.7625 | 0.4581 | 0.8118 | 0.8873 | 0.7940 |

NMF | 0.2597 | 0.4765 | 0.7650 | 0.4925 | 0.8160 | 0.6901 | 0.7770 |

CLR_L1 | 0.2647 | 0.8004 | 0.7822 | 0.8124 | 0.9013 | 0.8740 | |

CLR_L2 | 0.2747 | 0.8065 | 0.7853 | 0.4915 | 0.8143 | 0.8991 | 0.8790 |

WAL-R | 0.7339 | 0.5551 | 0.3587 | 0.7777 | 0.8544 | 0.7335 | |

WAL-Ln | 0.2680 | 0.8000 | 0.4850 | 0.8000 | 0.8873 | 0.8875 | |

WAL-L2 | 0.2677 | 0.5550 | 0.4933 |

. | Purity . | ||||||
---|---|---|---|---|---|---|---|

Algorithm . | 20news . | Umist . | Orl . | yaleb . | Coil20 . | Jaffe . | Dig0689 . |

K-means | 0.2625 | 0.4957 | 0.6725 | 0.1326 | 0.6799 | 0.8685 | 0.7532 |

RCut | 0.2565 | 0.7625 | 0.7900 | 0.4718 | 0.8118 | 0.8873 | 0.7942 |

NCut | 0.2565 | 0.7704 | 0.7625 | 0.4581 | 0.8118 | 0.8873 | 0.7940 |

NMF | 0.2597 | 0.4765 | 0.7650 | 0.4925 | 0.8160 | 0.6901 | 0.7770 |

CLR_L1 | 0.2647 | 0.8004 | 0.7822 | 0.8124 | 0.9013 | 0.8740 | |

CLR_L2 | 0.2747 | 0.8065 | 0.7853 | 0.4915 | 0.8143 | 0.8991 | 0.8790 |

WAL-R | 0.7339 | 0.5551 | 0.3587 | 0.7777 | 0.8544 | 0.7335 | |

WAL-Ln | 0.2680 | 0.8000 | 0.4850 | 0.8000 | 0.8873 | 0.8875 | |

WAL-L2 | 0.2677 | 0.5550 | 0.4933 |

Note: The best performances are in bold.

Since all the methods involve the K-means, including K-means, R-Cut, NCut, , and , and our proposed three methods, WAL-L2, WAL-Ln, and WAL-R, we used the same initialization for the k-means clustering involved in all the methods and report their average results over 10 repetitions in Table 3. From Table 3, we can conclude that our proposed methods outperform the compared methods on most of the benchmark data sets, and WAL-L2 is much better in most cases, which greatly illustrates the effectiveness of our proposed methods for graph-based clustering. The results of our methods are independent of the initialization and are always stable with a certain setting.

## 5 Conclusion

We have proposed novel graph-based clustering algorithm that learns a graph by weight-adaptive Laplacian algorithm. In the proposed algorithm, instead of fixing the input data graph associated with the affinity matrix, we learn a new data similarity matrix that can adaptively adjust the optimization procedure; finally, we use this new data similarity matrix for the clustering task. Based on the weight-constrained Laplacian algorithm, we also consider three regularizers on the proposed weight-constrained graph and propose three new clustering objectives, deriving optimization algorithms to solve them. Extensive experiments have been conducted on both the synthetic data and seven real-world benchmark data sets to demonstrate the performance of our models. In future work, we will extend the weight-constrained strategy to other graph-based clustering methods.

## Acknowledgments

This work was supported by the National Basic Research Program of China (grant 2015CB351705) and the State Key Program of National Natural Science Foundation of China (grant 61332018).