## Abstract

Recovering intrinsic low-dimensional subspaces from data distributed on them is a key preprocessing step to many applications. In recent years, a lot of work has modeled subspace recovery as low-rank minimization problems. We find that some representative models, such as robust principal component analysis (R-PCA), robust low-rank representation (R-LRR), and robust latent low-rank representation (R-LatLRR), are actually deeply connected. More specifically, we discover that once a solution to one of the models is obtained, we can obtain the solutions to other models in closed-form formulations. Since R-PCA is the simplest, our discovery makes it the center of low-rank subspace recovery models. Our work has two important implications. First, R-PCA has a solid theoretical foundation. Under certain conditions, we could find globally optimal solutions to these low-rank models at an overwhelming probability, although these models are nonconvex. Second, we can obtain significantly faster algorithms for these models by solving R-PCA first. The computation cost can be further cut by applying low-complexity randomized algorithms, for example, our novel filtering algorithm, to R-PCA. Although for the moment the formal proof of our filtering algorithm is not yet available, experiments verify the advantages of our algorithm over other state-of-the-art methods based on the alternating direction method.

## 1 Introduction

Subspaces are commonly assumed structures for high-dimensional data due to their simplicity and effectiveness. For example, motion (Tomasi & Kanade, 1992), face (Belhumeur, Hespanha, & Kriegman, 1997; Belhumeur & Kriegman, 1998; Basri & Jacobs, 2003), and texture (Ma, Derksen, Hong, & Wright, 2007) data have been known to be well characterized by low-dimensional subspaces. A lot of effort has been devoted to robustly recovering the underlying subspaces of data. The most widely adopted approach is principal component analysis (PCA). Unfortunately, PCA is known to be fragile to large noise or outliers in the data, and much work has been devoted to improving its robustness (Gnanadesikan & Kettenring, 1972; Huber, 2011; Fischler & Bolles, 1981; De La Torre & Black, 2003; Ke & Kanade, 2005; McCoy & Tropp, 2011; Zhang & Lerman, 2014; Lerman, McCoy, Tropp, & Zhang, 2014; Hardt & Moitra, 2012), among which the robust PCA (R-PCA) (Wright, Ganesh, Rao, Peng, & Ma, 2009; Chandrasekaran, Sanghavi, Parrilo, & Willsky, 2011; Candès, Li, Ma, & Wright, 2011) is one of the models with theoretical guarantee. Candès et al. (2011), Chandrasekaran et al. (2011) and Wright et al. (2009) proved that under certain conditions, the ground truth subspace can be exactly recovered with overwhelming probability. Later work (Hsu, Kakade, & Zhang, 2011) gave a justification of R-PCA in the case where the spatial pattern of the corruptions is deterministic.

Although R-PCA has found wide application, such as video denoising, background modeling, image alignment, photometric stereo, and texture representation (see e.g., Wright et al., 2009; De La Torre & Black, 2003, Ji, Liu, Shen, & Xu, 2010; Peng, Ganesh, Wright, Xu, & Ma, 2010; Zhang, Ganesh, Liang, & Ma, 2012), it only aims at recovering a single subspace that spans the entire data. To identify finer structures of data, the multiple subspaces recovery problem is considered, which aims at clustering data according to the subspaces they lie in. This problem has attracted a lot of attention in recent years (Vidal, 2011), and much work has provided a strong theoretical guarantee for the problem (see e.g., Soltanolkotabi & Candès, 2012). Rank minimization methods account for a large class of subspace clustering algorithms, where rank is connected to the dimensions of subspaces. Representative rank minimization–based methods include low-rank representation (LRR) (Liu & Yan, 2011; Liu, Lin, Yan, Sun, & Ma, 2013), robust low-rank representation (R-LRR) (Wei & Lin, 2010; Vidal & Favaro, 2014),^{1} latent low-rank representation (LatLRR) (Liu, Lin, & Yu, 2010; Zhang, Lin, & Zhang, 2013), and its robust version (R-LatLRR) (Zhang, Lin, Zhang, & Gao, 2014). Subspace clustering algorithms, including these low-rank methods, have been widely applied to motion segmentation (Gear, 1998; Costeira & Kanade, 1998; Vidal & Hartley, 2004; Yan & Pollefeys, 2006; Rao, Tron, Vidal, & Ma, 2010), image segmentation (Yang, Wright, Ma, & Sastry, 2008; Cheng, Liu, Wang, Li, & Yan, 2011), face classification (Ho, Yang, Lim, Lee, & Kriegman, 2003; Vidal, Ma, & Sastry, 2005; Liu & Yan, 2011; Liu et al., 2013), and system identification (Vidal, Soatto, Ma, & Sastry, 2003; Zhang & Bitmead, 2005; Paoletti, Juloski, Ferrari-Trecate, & Vidal, 2007).

### 1.1 Our Contributions

In this letter, we show that some of the low-rank subspace recovery models are actually deeply connected, even though they were proposed independently and targeted different problems (single or multiple subspaces recovery). Our discoveries are based on a characteristic of low-rank recovery models: they may have closed-form solutions. Such a characteristic has not been found in sparsity-based models for subspace recovery, such as sparse subspace clustering (Elhamifar & Vidal, 2009).

There are two main contributions of this letter. First, we find a close relation between R-LRR (Wei & Lin, 2010; Vidal & Favaro, 2014) and R-PCA (Wright et al., 2009; Candès et al., 2011), showing that, surprisingly, their solutions are mutually expressible. Similarly, R-LatLRR (Zhang et al., 2014) and R-PCA are closely connected too: their solutions are also mutually expressible. Our analysis allows an arbitrary regularizer for the noise term.

Second, since R-PCA is the simplest low-rank recovery model, our analysis naturally positions it at the center of existing low-rank recovery models. In particular, we propose to first apply R-PCA to the data and then use the solution of R-PCA to obtain the solution for other models. This approach has two important implications. First, although R-LRR and R-LatLRR are nonconvex problems, under certain conditions we can obtain globally optimal solutions with an overwhelming probability (see remark ^{9}). Namely, if the noiseless data are sampled from a union of independent subspaces and the dimension of the subspace containing the union of subspaces is much smaller than the dimension of the ambient space, we are able to recover exact subspace structure as long as the noise is sparse (even if the magnitudes of the noise are arbitrarily large). Second, solving R-PCA is much faster than solving other models. The computation cost could be further cut if we solve R-PCA by randomized algorithms. For example, we propose the filtering algorithm to solve R-PCA when the noise term uses norm (see Table 1 for a definition). Experiments verify the significant advantages of our algorithms.

Notations . | Meanings . |
---|---|

Capital letter | A matrix |

m, n | Size of the data matrix M |

, | , |

Natural logarithm | |

I, , | The identity matrix, all-zero matrix, and all-one vector |

e _{i} | Vector whose ith entry is 1 and others are 0 |

The jth column of matrix M | |

M _{ij} | The entry at the ith row and jth column of matrix M |

M ^{T} | Transpose of matrix M |

Moore-Penrose pseudo-inverse of matrix M | |

, | |

Euclidean norm for a vector, | |

Nuclear norm of a matrix (the sum of its singular values) | |

norm of a matrix (the number of nonzero entries) | |

norm of a matrix (the number of nonzero columns) | |

norm of a matrix, | |

norm of a matrix, | |

norm of a matrix, | |

norm of a matrix, | |

Frobenius norm of a matrix, | |

Matrix operator norm, the largest singular value of a matrix |

Notations . | Meanings . |
---|---|

Capital letter | A matrix |

m, n | Size of the data matrix M |

, | , |

Natural logarithm | |

I, , | The identity matrix, all-zero matrix, and all-one vector |

e _{i} | Vector whose ith entry is 1 and others are 0 |

The jth column of matrix M | |

M _{ij} | The entry at the ith row and jth column of matrix M |

M ^{T} | Transpose of matrix M |

Moore-Penrose pseudo-inverse of matrix M | |

, | |

Euclidean norm for a vector, | |

Nuclear norm of a matrix (the sum of its singular values) | |

norm of a matrix (the number of nonzero entries) | |

norm of a matrix (the number of nonzero columns) | |

norm of a matrix, | |

norm of a matrix, | |

norm of a matrix, | |

norm of a matrix, | |

Frobenius norm of a matrix, | |

Matrix operator norm, the largest singular value of a matrix |

The remainder of this letter is organized as follows. Section 2 reviews the representative low-rank models for subspace recovery. Section 3 gives our theoretical results—the interexpressibility among the solutions of R-PCA, R-LRR, and R-LatLRR. In section 4, we present detailed proofs of our theoretical results. Section 5 gives two implications of our theoretical analysis: better solutions and faster algorithms. We show the experimental results on both synthetic and real data in section 6. Section 7 concludes the paper.

## 2 Related Work

In this section, we review a number of existing low rank models for subspace recovery.

### 2.1 Notations and Naming Conventions

We first define some notations. Table 1 summarizes the main notations that appear in this letter.

Since this letter involves multiple subspace recovery models, to minimize confusion, we name the models that minimize rank functions and nuclear norms as the *original* model and the *relaxed* model, respectively. We also name the models that use the denoised data matrices for dictionaries as *robust* models, with the prefix “R-”.

### 2.2 Robust Principal Component Analysis

*A*

_{0}and a sparse matrix

*E*

_{0}, respectively. The mathematical formulation of R-PCA is as follows: where is the observation with data samples being its columns and is a regularization parameter.

*A*

_{0}is ,

*A*

_{0}is nonsparse (see the incoherent conditions, equations 5.1 and 5.2), and the nonzeros of the noise matrix

*E*

_{0}are uniformly distributed whose number is (it is remarkable that the magnitudes of noise could be arbitrarily large), then with a particular regularization parameter the solution of the convex relaxed R-PCA problem, equation 2.2, perfectly recovers the ground truth data matrix

*A*

_{0}and the noise matrix

*E*

_{0}with an overwhelming probability.

### 2.3 Low-Rank Representation

*Z*(i.e., block diagonal structure), can help identify the subspaces. To make the model robust to outliers, Liu et al. (2010, 2013) added a regularization term to the LRR model, supposing that the corruptions are column sparse.

*Z*and the indices of nonzero columns of the ground truth

*E*can be exactly recovered (Liu et al., 2013).

### 2.4 Robust Low-Rank Representation (Robust Shape Interaction and Low-Rank Subspace Clustering)

Note that the relaxed RSI is still nonconvex due to its bilinear constraint, which may cause difficulty in finding its globally optimal solution. Wei and Lin (2010) first proved the following result on the relaxed noiseless LRR, which is also the noiseless version of the relaxed RSI.

is called the shape interaction matrix in the field of structure from motion (Costeira & Kanade, 1998). Hence model 2.6 is named robust shape interaction. is block diagonal when the column vectors of *A* lie strictly on independent subspaces. The block diagonal pattern reveals the structure of each subspace and therefore offers the possibility of subspace clustering.

^{1}. In this way, they deduced the optimal solution. We prove in section 4 that this is indeed true and actually holds for arbitrary functions on

*E*.

It is worth noting that Xu, Caramanis, and Sanghavi (2012) proved that problem 2.10 is capable of exactly recognizing the sparse outliers and simultaneously recovering the column space of the ground truth data under rather broad conditions. Our previous work (Zhang, Lin, Zhang, & Chang, 2015) further showed that the parameter guarantees the success of the model, even when the rank of the intrinsic matrix and the number of nonzero columns of the noise matrix are almost , where *n* is the column number of the input.

*E*. The other difference is that LRSC adopts the alternating direction method (ADMM) (Lin, Liu, & Su, 2011) to solve equation 2.12.

In order not to confuse readers and to highlight that both RSI and LRSC are robust versions of LRR, we call them robust LRR (R-LRR) instead as our theoretical analysis allows for arbitrary functions on *E*.

### 2.5 Robust Latent Low-Rank Representation

*X*into the dictionary: Obviously it is impossible to solve problem 2.13 because

_{H}*X*is unobserved. Nevertheless, by utilizing proposition

_{H}^{1}, Liu and Yan (2011) proved that

*X*can be written as , where both

*Z*and

*L*are low rank, resulting in the following latent low-rank representation (LatLRR) model, where sparse corruptions are considered. As a common practice, its relaxed version is solved instead:

*X*itself as the dictionary. So Zhang et al. (2014) borrowed the idea of R-LRR to use denoised data as the dictionary, giving rise to the following robust latent LRR (R-LatLRR) model, and its relaxed version, Again the relaxed R-LatLRR model is nonconvex. Zhang, Lin et al. (2013) proved that when there is no noise, both the original R-LatLRR and relaxed R-LatLRR have nonunique closed-form solutions, and they described the complete solution sets, among which there are a large number of inappropriate solutions for subspace clustering. In order to choose the appropriate solution, according to Wright, Ma, Mairal, Sapiro, and Huang (2010), an informative similarity matrix should have an adaptive neighborhood, high discriminating power, and high sparsity (Zhuang et al., 2012). As the graphs constructed by LatLRR have had high discriminating power and adaptive neighborhood (Liu & Yan, 2011), Zhang, Lin et al. (2013) considered using high sparsity to choose the optimal solution . In particular, like RSI, Zhang, Lin et al. (2013) proposed applying R-PCA to separate

*X*into . Next, they found the sparsest solution among the solution set of relaxed noiseless R-LatLRR, with

*A*being . Equation 2.18 is a relaxed version of the original noiseless R-LatLRR model:

### 2.6 Other Low-Rank Models for Subspace Clustering

In this section, we mention more low-rank subspace recovery models, although they are not our focus in this letter. Also aiming at addressing the small sample issue, Liu et al. (2012) proposed fixed rank representation by requiring that the representation matrix be as close to a rank *r* matrix as possible, where *r* is a prescribed rank. Then the best rank *r* matrix, which still has a block diagonal structure, is used for subspace clustering. Wang, Saligrama, and Castañón (2011) extended LRR to address nonlinear multimanifold segmentation, where the error *E* is regularized by the square of Frobenius norm so that the kernel trick can be used. Ni, Sun, Yuan, Yan, and Cheong (2010) augmented the LRR model with a semidefiniteness constraint on the representation matrix *Z*. In contrast, the representation matrices by R-LRR and R-LatLRR are both naturally semidefinite as they are shape interaction matrices.

## 3 Main Results: Relations Among Low-Rank Models

*E*can be arbitrary. More specifically, the generalized models are: where

*f*is any function. For brevity, we still call equations 3.1 to 3.6 the original R-PCA, relaxed R-PCA, original R-LRR, relaxed R-LRR, original R-LatLRR, and relaxed R-LatLRR, respectively, without mentioning “generalized.”

We show that the solutions to the above models are mutually expressible; if we have a solution to one of the models, we will obtain the solutions to other models in closed-form formulations. We further show in section 5 that such mutual expressibility is useful.

It suffices to show that the solutions of the original R-PCA and those of other models are mutually expressible (i.e., letting the original R-PCA hinge all the above models). We summarize our results as the following theorems.

(connection between the original R-PCA and the original R-LRR). For any minimizer of the original R-PCA problem, equation 3.1, suppose is the skinny SVD of the matrix . Then is the optimal solution to the original R-LRR problem, equation 3.3, where *S* is any matrix such that . Conversely, provided that is an optimal solution to the original R-LRR problem, equation 3.3, is a minimizer of the original R-PCA problem, equation 3.1.

(connection between the original R-PCA and the relaxed R-LRR) . For any minimizer of the original R-PCA problem, equation 3.1, the relaxed R-LRR problem, equation 3.4, has an optimal solution . Conversely, suppose that the relaxed R-LRR problem, equation 3.4, has a minimizer ; then is an optimal solution to the original R-PCA problem, equation 3.1.

According to theorem ^{4}, the relaxed R-LRR can be viewed as denoising the data first by the original R-PCA and then adopting the shape interaction matrix of the denoised data matrix as the affinity matrix. Such a procedure is exactly the same as that in Wei and Lin (2010), which was proposed out of heuristics and for which no proof was provided.

*S*

_{1}and

*S*

_{2}are any matrices satisfying:

and

Rankrank and rankrank.

Conversely, let be any optimal solution to the original R-LatLRR, equation 3.5. Then is a minimizer of the original R-PCA problem, equation 3.1.

Its blocks are compatible with , i.e., if then =0

Both and are positive semidefinite.

Conversely, let be any optimal solution to the relaxed R-LatLRR, equation 3.6. Then is a minimizer of the original R-PCA problem, equation 3.1.

Figure 1 illustrates our theorems by putting the original R-PCA at the center of the low-rank subspace clustering models under consideration.

By the above theorems, we easily have the following corollary:

According to the above results, once we obtain a globally optimal solution to the original R-PCA, equation 3.1, we can obtain globally optimal solutions to the original and relaxed R-LRR and R-LatLRR problems. Although in general solving the original R-PCA is NP hard, under certain conditions (see section 5.1), its globally optimal solution can be obtained with an overwhelming probability by solving the relaxed R-PCA, equation 3.2. If one solves the original and relaxed R-LRR or R-LatLRR directly (e.g., by ADMM), there is no analysis on whether their globally optimal solutions can be attained due to their nonconvex nature. In this sense, we say that we can obtain a better solution for the original and relaxed R-LRR and R-LatLRR if we reduce them to the original R-PCA. Our numerical experiments in section 6.1 testify to our claims.

## 4 Proofs of Main Results

In this section, we provide detailed proofs of the four theorems in the previous section.

### 4.1 Connection between R-PCA and R-LRR

The following lemma is useful throughout the proof of theorem ^{3}.

Using lemma ^{10}, we can prove theorem ^{3}.

^{10}, we fix and have On the other hand, From equations 4.1 to 4.3, we have which leads to a contradiction with the optimality of to R-PCA, equation 3.1.

*E*as in equation 3.3, by lemma

^{10}and the optimality of , we infer that On the other hand, where we have utilized another property of Moore-Penrose pseudo-inverse: rank()=rank(Y). Combining equations 4.5 to 4.7, we have Notice that satisfies the constraint of the original R-LRR problem, equation 3.3, due to and . The inequality, equation 4.8 leads to a contradiction with the optimality of the pair for R-LRR.

Thus we finish the proof.

Now we prove theorem ^{4}. Proposition ^{1} is critical for the proof.

*E*as a fixed matrix, by proposition

^{1}we have On the other hand, . So we derive This is a contradiction because has been an optimal solution to the R-PCA problem 3.1, thus proving the first part of the theorem.

### 4.2 Connection between R-PCA and R-LatLRR

Now we prove the mutual expressibility between the solutions of R-PCA and R-LatLRR. Our previous work (Zhang, Lin et al., 2013) gives the complete closed-form solutions to noiseless R-LatLRR problems 2.19 and 2.18, which are both critical to our proofs.

Now we are ready to prove theorem ^{6}.

We first prove the first part of the theorem. Since equation 4.13 is the minimizer to problem 2.19 with , it naturally satisfies the constraint: . Together with the fact that based on the assumption of the theorem, we conclude that satisfies the constraint of the original R-LatLRR, equation 3.5.

^{11}, by fixing and , respectively, we have From equations 4.14 to 4.16, we finally obtain which leads to a contradiction with our assumption that is optimal for R-PCA.

The following lemma is helpful for proving the connection between the R-PCA, equation 3.1, and the relaxed R-LatLRR, equation 3.6.

*A*. Then the complete optimal solutions to the relaxed noiseless R-LatLRR problem, equation 2.18 are as follows: where is any block diagonal matrix satisfying:

Its blocks are compatible with , that is, if then =0.

Both and are positive semidefinite.

Now we are ready to prove theorem ^{7}.

^{12}, can be written as the form 4.19: where and satisfies all the conditions in lemma

^{12}. Taking equation 4.21 into the objective function of problem 3.6, we have where conditions 1 and 2 in lemma

^{12}guarantee . On the other hand, taking equation 3.8 into the objective function of problem 3.6 and using conditions 1 and 2 in the theorem, we have Thus we obtain a contradiction by considering equations 4.20, 4.22, and 4.23.

Finally, viewing R-PCA as a hinge, we connect all the models considered in section 3. We now prove corollary ^{8}.

According to theorems ^{3}, ^{4}, ^{6}, and ^{7}, the solutions to R-PCA and those of other models are mutually expressible. Next, we build the relationships among equations 3.3 to 3.6. For simplicity, we take only equations 3.3 and 3.4 as example. The proofs of the remaining connections are similar.

Suppose is optimal to the original R-LRR problem, equation 3.3. Then, based on theorem ^{3}, is an optimal solution to the R-PCA problem, equation 3.1. Then theorem ^{4} concludes that is a minimizer of the relaxed R-LRR problem, equation 3.4. Conversely, suppose that is optimal to the relaxed R-LRR problem. By theorems ^{3} and ^{4}, we conclude that is an optimal solution to the original R-LRR problem, equation 3.3, where is the matrix of right singular vectors in the skinny SVD of and *S* is any matrix satisfying .

## 5 Applications of the Theoretical Analysis

In this section, we discuss pragmatic values of our theoretical results in section 3. As one can see in Figure 1, we put R-PCA at the center of all the low-rank models under consideration because it is the simplest one, which implies that we prefer deriving solutions of other models from that of R-PCA. For simplicity, we turn to our two-step approach, first reducing to R-PCA and then expressing the desired solution by the solution of R-PCA, as REDU-EXPR method. There are two advantages of REDU-EXPR. First, we could obtain better solutions to other low-rank models (see remark ^{9}). R-PCA has a solid theoretical foundation. Candès et al. (2011) proved that under certain conditions, solving the relaxed R-PCA, equation 2.2, which is convex, can recover the ground truth solution at an overwhelming probability (see section 5.1.1). Xu et al. (2012) and Zhang et al. (2015) also proved similar results for column sparse relaxed R-PCA, equation 2.10 (see section 5.1.2). Then by the mutualexpressibility of solutions, we could also obtain globally optimal solutions to other models. In contrast, the optimality of a solution is uncertain if we solve other models using specific algorithms, such as, ADMM (Lin et al., 2011), due to their nonconvex nature.

The second advantage is that we could have much faster algorithms for other low-rank models. Due to the simplicity of R-PCA, solving R-PCA is much faster than other models. In particular, the expensive complexity of matrix-matrix multiplication (between *X* and *Z* or *L*) could be avoided. Moreover, there are low-complexity randomized algorithms for solving R-PCA, making the computational cost of solving other models even lower. In particular, we propose an filtering algorithm for column sparse relaxed R-PCA (equation 3.2 with ). If one is directly faced with other models, it is nontrivial to design low-complexity algorithms (either deterministic or randomized).^{2}

In summary, based on our analysis, we could achieve low rankness–based subspace clustering with better performance and faster speed.

### 5.1 Better Solution for Subspace Recovery

Reducing to R-PCA could help overcome the nonconvexity issue of the low-rank recovery models we consider (see remark ^{9}). We defer the numerical verification of this claim until section 6.1. In this section, we discuss the theoretical conditions under which reducing to R-PCA succeeds for the subspace clustering problem.

We focus on the application of theorem ^{4}, which shows that given the solution to R-PCA problem 3.1, the optimal solution to the relaxed R-LRR problem 3.4, is presented by . Note that is called the shape interaction matrix in the field of structure from motion and has been proven to be block diagonal by Costeira and Kanade (1998) when the column vectors of lie strictly on independent subspaces and the sampling number of from each subspace is larger than the subspace dimension (Liu et al., 2013). The block diagonal pattern reveals the structure of each subspace and hence offers the possibility of subspace clustering. Thus, to illustrate the success of our approach, we show under which conditions the R-PCA problem exactly recovers the noiseless data matrix or correctly recognizes the indices of noise. We discuss the cases where the corruptions are sparse element-wise noise, sparse column-wise noise, and dense gaussian noise, respectively. As for the application of theorems ^{3}, ^{6}, and ^{7}, given that the solutions to problems 3.3, 3.5, and 3.6 are all nonunique and thus it is possible for an optimal solution to have better or worse performance (e.g., clustering accuracy) than another optimal solution, one should adopt another criterion (e.g., the sparsity constraint) to select the most suitable solution for specific application tasks (see e.g., Zhang et al., 2014).

#### 5.1.1 Sparse Element-Wise Noises

Suppose each column of the data matrix is an observation. In the case where the corruptions are sparse element-wise noise, we assume that the positions of the corrupted elements sparsely and uniformly distribute on the input matrix. In this case, we consider the use of the norm, model 2.2, to remove the corruption.

*A*

_{0}from the corrupted observations . We apply them to the success conditions of our approach. First, to avoid the possibility that the low-rank part

*A*

_{0}is sparse,

*A*

_{0}needs to satisfy the following incoherent conditions: where is the skinny SVD of

*A*

_{0}, , and is a constant. The second assumption for the success of the algorithm is that the dimension of the sum of the subspaces is sufficiently low and the support number

*s*of the noise matrix

*E*

_{0}is not too large, namely, where and are numerical constants, and . Under these conditions, Candès et al. (2011) justified that relaxed R-PCA, equation 2.2, with exactly recovers the noiseless data

*A*

_{0}. Thus, the algorithm of reducing to R-PCA succeeds as long as the subspaces are independent and the sampling number from each subspace is larger than the subspace dimension (Liu et al., 2013).

#### 5.1.2 Sparse Column-Wise Noise

In the more general case, the noise exists in a small number of columns, each nonzero column of *E*_{0} corresponds to a corruption. In this case, we consider the use of the norm, model 2.10 to remove the corruption.

*M*belongs to the space Range or the space Range. So we remove the corrupted observation identified by the algorithm rather than exactly recovering its ground truth and use the remaining noiseless data to reveal the real structure of the subspaces.

*n*, we assume , where }, , and is a projection onto the complement of . Note that, for example, when the columns of

*E*

_{0}are independent subgaussian isotropic random vectors, the constraint holds. So the constraint is feasible, even though the number of the nonzero columns of

*E*

_{0}is comparable to

*n*. The dimension of the sum of the subspaces is also required to be low and the column support number

*s*of the noise matrix

*E*

_{0}to not be too large. More specifically, where and are numerical constants. Note that the range of the successful rank in equation 5.5 is broader than that of equation 5.3, and has been proved to be tight (Zhang et al., 2015). Moreover, to avoid lying in an incorrect subspace, we assume for . Under these conditions, our theorem justifies that column-sparse relaxed R-PCA, equation 2.10, with exactly recognizes the indices of noises. Thus our approach succeeds.

#### 5.1.3 Dense Gaussian Noises

*A*

_{0}lie in an

*r*-dimension subspace where

*r*is relatively small. For dense gaussian noises, we consider the use of squared Frobenius norm, leading to the following relaxed R-LRR problem: We quote the following result from Favaro et al. (2011), which gave the closed-form solution to problem 5.6. Based on our results in section 3, we give a new proof:

*U*

_{1}, and

*V*

_{1}correspond to the top singular values and singular vectors of

*X*, respectively. This can be easily seen by probing the different rank

*k*of

*A*and observing that .

Next, according to theorem ^{4}, where *f* is chosen as the squared Frobenius norm, the optimal solution to problem 5.6 is given by , , and as claimed.

Corollary ^{13} offers insight into the relaxed R-LRR, equation 5.6. We can first solve the classical PCA problem with parameter and then adopt the shape interaction matrix of the denoised data matrix as the affinity matrix for subspace clustering. This is consistent with the well-known fact that, empirically and theoretically, PCA is capable of dealing effectively with small, dense gaussian noise. Note that one needs to tune the parameter in problem 5.6 in order to obtain a suitable parameter *r* for the PCA problem.

#### 5.1.4 Other Cases

Although our approach works well under rather broad conditions, it might fail in some cases (e.g., the noiseless data matrix is not low rank). However, for certain data structures, the following numerical experiment shows that reducing to R-PCA correctly identifies the indices of noise even though the ground truth data matrix is of full rank. The synthetic data are generated as follows. In the linear space , we construct five independent *D*-dimensional subspaces , whose bases are randomly generated column orthonormal matrices. Then points are sampled from each subspace by multiplying its basis matrix with a gaussian distribution matrix, whose entries are independent and identically distributed (i.i.d.) . Thus, we obtain a structured sample matrix without noise, and the noiseless data matrix is of rank . We then add column-wise gaussian noises whose entries are i.i.d. on the noiseless matrix and solve model 2.10 with . Table 2 reports the Hamming distance between the ground truth indices and the identified indices by model 2.10, under different input sizes. It shows that reducing to R-PCA succeeds for structured data distributions even when the dimension of the sum of the subspaces is equal to that of the ambient space. In contrast, the algorithm fails for unstructured data distributions; for example, the noiseless data are gaussian matrix whose element is totally random, obeying i.i.d. . Since the main focus of this letter is the relations among several low-rank models and the success conditions are within the research of R-PCA, the theoretical analysis on how data distribution influences the success of R-PCA will be our future work.

D
. | dist . | D
. | dist . | D
. | dist . | D
. | dist . |
---|---|---|---|---|---|---|---|

5 | 0 | 10 | 0 | 50 | 0 | 100 | 0 |

D
. | dist . | D
. | dist . | D
. | dist . | D
. | dist . |
---|---|---|---|---|---|---|---|

5 | 0 | 10 | 0 | 50 | 0 | 100 | 0 |

Note: Rank, , and . refers to the indices obtained by solving model 2.10, and refers to the ground truth indices of noise.

### 5.2 Fast Algorithms for Subspace Recovery

Representative low-rank subspace recovery models like LRR and LatLRR are solved by ADMM (Lin et al., 2011) and the overall complexity^{3} is (Liu et al., 2010; Liu & Yan, 2011; Liu et al., 2013). For LRR, by employing linearized ADMM (LADMM) and some advanced tricks for computing partial SVD, the resulting algorithm is of overall complexity, where *r* is the rank of optimal *Z*. We show that our REDU-EXPR approach can be much faster.

*X*is . We record the running times and the clustering accuracies

^{4}of relaxed LRR (Liu et al., 2010, 2013) and relaxed R-LRR (Favaro et al., 2011; Wei & Lin, 2010). LRR is solved by ADMM. For R-LRR, we test three algorithms. The first one is traditional ADMM: updating

*A*,

*E*, and

*Z*alternately by minimizing the augmented Lagrangian function of relaxed R-LRR: The second algorithm is partial ADMM, which updates

*A*,

*E*, and

*Z*by minimizing the partial augmented Lagrangian function: subject to . This method is adopted by Favaro et al. (2011). A key difference between partial ADMM and traditional ADMM is that the former updates

*A*and

*Z*simultaneously by using corollary

^{13}. (For more details, refer to Favaro et al., 2011.) The third method is REDU-EXPR, adopted by Wei and Lin (2010). Except the ADMM method for solving R-LRR, we run the codes provided by their respective authors.

One can see from Table 3 that REDU-EXPR is significantly faster than the ADMM-based method. Actually, solving R-LRR by ADMM did not converge. We want to point out that the partial ADMM method used the closed-form solution shown in corollary ^{13}. However, its speed is still much inferior to that of REDU-EXPR.

Model . | Method . | Accuracy . | CPU Time (h) . |
---|---|---|---|

LRR | ADMM | - | 10 |

R-LRR | ADMM | - | Did not converge |

R-LRR | Partial ADMM | - | 10 |

R-LRR | REDU-EXPR | 61.6365% | 0.4603 |

Model . | Method . | Accuracy . | CPU Time (h) . |
---|---|---|---|

LRR | ADMM | - | 10 |

R-LRR | ADMM | - | Did not converge |

R-LRR | Partial ADMM | - | 10 |

R-LRR | REDU-EXPR | 61.6365% | 0.4603 |

For large-scale data, neither nor is fast enough. Fortunately, for R-PCA, it is relatively easy to design low-complexity randomized algorithms to further reduce its computational load. Liu, Lin, Su, and Gao (2014) has reported an efficient randomized algorithm, filtering, to solve R-PCA when . The filtering is completely parallel, and its complexity is only —linear to the matrix size. In the following, we sketch the filtering algorithm (Liu et al., 2014), and in the same spirit propose a novel filtering algorithm for solving column-sparse R-PCA, equation 2.11, that is, R-PCA with .

#### 5.2.1 Outline of Filtering Algorithm (Liu et al., 2014)

The filtering algorithm aims at solving the R-PCA problem, equation 3.1, with . There are two main steps. The first step is to recover a seed matrix. The second is to process the rest of the data matrix by -norm-based linear regression.

*Step 1: Recovery of a Seed Matrix.*Assume that the target rank

*r*of the low-rank component

*A*is very small compared with the size of the data matrix: . By randomly sampling an submatrix

*X*from

^{s}*X*, where are oversampling rates, we partition the data matrix

*X*, together with the underlying matrix

*A*and the noise

*E*, into four parts (for simplicity, we assume that

*X*is at the top left corner of

^{s}*X*):

#### 5.2.2 Filtering Algorithm

filtering is for entry-sparse R-PCA. For R-LRR, we need to solve column sparse R-PCA. Unlike the case, which breaks the full matrix into four blocks, the norm requires viewing each column in a holistic way, so we can only partition the whole matrix into two blocks. We inherit the idea of filtering to propose a randomized algorithm, called filtering, to solve column-sparse R-PCA. It also consists of two steps. We first recover a seed matrix and then process the remaining columns via norm-based linear regression, which turns out to be a least square problem.

*Recovery of a Seed Matrix.*The step of recovering a seed matrix is nearly the same as that of the filtering method, except that we partition the whole matrix into only two blocks. Suppose the rank of

*A*is . We randomly sample columns of

*X*, where is an oversampling rate. These columns form a submatrix

*X*. For brevity, we assume that

_{l}*X*is the leftmost submatrix of

_{l}*X*. Then we may partition

*X*,

*A*, and

*E*as follows: respectively. We could first recover

*A*from

_{l}*X*by a small-scale relaxed column-sparse R-PCA problem, where (Zhang et al., 2015).

_{l}*Filtering.*After the seed matrix

*A*is obtained, since rank and with an overwhelming probability rank, the columns of

_{l}*A*must be linear combinations of

_{r}*A*. So there exists a representation matrix such that The part

_{l}*E*of noise should still be column sparse, however, so we have the following norm-based linear regression problem:

_{r}If equation 5.17 is solved directly by using ADMM (Liu et al., 2012), the complexity of our algorithm will be nearly the same as that of solving the whole original problem. Fortunately, we can solve equation 5.17 column-wise independently due to the separability of norms.

*i*th column of

*X*,

_{r}*Q*, and

*E*, respectively (). Then problem 5.17 could be decomposed into

_{r}*n*−

*sr*subproblems: As least square problems, equation 5.18 has closed-form solutions . Then and the solution to the original problem, equation 5.17, is . Interestingly, it is the same solution if replacing the norm in equation 5.17 with the Frobenius norm.

*A*, which is available when solving equation 5.15. Then

_{l}*A*could be written as We may first compute and then . This little trick reduces the complexity of computing

_{r}*A*.

_{r}*The Complete Algorithm.* Algorithm 1 summarizes our filtering algorithm for solving column-sparse R-PCA.

As soon as the solution to column-sparse R-PCA is solved, we can obtain the representation matrix of R-LRR *Z* by . Note that we should not compute *Z* naively as it is written, whose complexity will be more than . A more clever way is as follows. Suppose is the skinny SVD of *A*; then . On the other hand, . So we have only to compute the row space of , where has been saved in step 3 of algorithm 1. This can be easily done by doing LQ decomposition (Golub & Van Loan, 2012) of : , where *L* is lower triangular and . Then . Since LQ decomposition is much cheaper than SVD, the above trick is very efficient and all the matrix-matrix multiplications are . The complete procedure for solving the R-LRR problem, equation 2.7 is described in algorithm 2.

Unlike LRR, the optimal solution to R-LRR problem, 2.7 is symmetric, and thus we could directly use as the affinity matrix instead of . After that, we can apply spectral clustering algorithms, such as normalized cut, to cluster each data point into its corresponding subspace.

Although for the moment the formal proof of our filtering algorithm is not yet available, our algorithm intuitively works well. To this end, we assume two conditions to guarantee the exact recoverability of our algorithm. First, to guarantee the exact recovery of the ground-truth subspace from the whole matrix *X*, we need to ensure that the same subspace can be fully recovered from the seed matrix *X _{l}*. So applying the result of Zhang et al. (2015) to the seed matrix, we assume that and

*E*

_{0}has nonzero columns, which guarantees that

*E*has column support at an overwhelming probability. Also, the incoherence conditions and the regularization parameter are required. Second, as for the filtering step, to represent the rest of the whole matrix by

_{l}*A*, it seems that

_{l}*A*should be low rank: Range(

*A*) = Range(

_{l}*A*). Under these conditions, our filtering algorithm intuitively succeeds with overwhelming probabilities. We leave the rigorous analysis as our future work.

*Complexity Analysis.* In algorithm 1, step 2 requires time, and step 3 requires time. Thus the whole complexity of the filtering algorithm for solving column-sparse R-PCA is . In algorithm 2, for solving the relaxed R-LRR problem, equation 2.7, as just analyzed, step 1 requires time. The LQ decomposition in step 2 requires time at most (Golub & Van Loan, 2012). Computing *VV ^{T}* in step 3 requires

*rn*

^{2}time. Thus, the whole complexity for solving equation 2.7 is .

^{5}As most of the low-rank subspace clustering models require time to solve, due to SVD or matrix-matrix multiplication in every iteration, our algorithm is significantly faster than state-of-the-art methods.

## 6 Experiments

In this section, we use experiments to illustrate the applications of our theoretical analysis.

### 6.1 Comparison of Optimality on Synthetic Data

We compare the two algorithms, partial ADMM^{6} (Favaro et al., 2011) and REDU-EXPR (Wei & Lin, 2010), which we mentioned in section 5.2, for solving the nonconvex-relaxed R-LRR problem, equation 2.7. Since the traditional ADMM is notconvergent, we do not compare with it. Because we want to compare only the quality of solutions produced by the two methods, for REDU-EXPR we temporarily do not use the filtering algorithm introduced in section 5 to solve column-sparse R-PCA.

The synthetic data are generated as follows. In the linear space , we construct five independent four-dimensional subspaces , whose bases are randomly generated column orthonormal matrices. Then 200 points are uniformly sampled from each subspace by multiplying its basis matrix with a gaussian distribution matrix, whose entries are independent and identically distributed (i.i.d.). . Thus, we obtain a sample matrix without noise.

We compare the clustering accuracies^{7} as the percentage of corruption increases, where noise uniformly distributed on is added at uniformly distributed positions. We run the test 10 times and compute the mean clustering accuracy. Figure 2 presents the comparison on the accuracy, where all the parameters are tuned to be the same: . One can see that R-LRR solved by REDU-EXPR is much more robust to column-sparse corruptions than by partial ADMM.

Noise Percentage (%) . | 0 . | 10 . | 20 . | 30 . | 40 . | 50 . |
---|---|---|---|---|---|---|

Rank(Z) (partial ADMM) | 20 | 30 | 30 | 30 | 30 | 30 |

Rank(Z) (REDU-EXPR) | 20 | 20 | 20 | 20 | 20 | 20 |

(partial ADMM) | 0 | 99 | 200 | 300 | 400 | 500 |

(REDU-EXPR) | 0 | 100 | 200 | 300 | 400 | 500 |

Objective (partial ADMM) | 20.00 | 67.67 | 106.10 | 144.14 | 182.19 | 220.24 |

Objective (REDU-EXPR) | 20.00 | 58.05 | 96.10 | 134.14 | 172.19 | 210.24 |

Time (s, partial ADMM) | 4.89 | 124.33 | 126.34 | 119.12 | 115.20 | 113.94 |

Time (s, REDU-EXPR) | 10.67 | 9.60 | 8.34 | 8.60 | 9.00 | 12.86 |

Noise Percentage (%) . | 0 . | 10 . | 20 . | 30 . | 40 . | 50 . |
---|---|---|---|---|---|---|

Rank(Z) (partial ADMM) | 20 | 30 | 30 | 30 | 30 | 30 |

Rank(Z) (REDU-EXPR) | 20 | 20 | 20 | 20 | 20 | 20 |

(partial ADMM) | 0 | 99 | 200 | 300 | 400 | 500 |

(REDU-EXPR) | 0 | 100 | 200 | 300 | 400 | 500 |

Objective (partial ADMM) | 20.00 | 67.67 | 106.10 | 144.14 | 182.19 | 220.24 |

Objective (REDU-EXPR) | 20.00 | 58.05 | 96.10 | 134.14 | 172.19 | 210.24 |

Time (s, partial ADMM) | 4.89 | 124.33 | 126.34 | 119.12 | 115.20 | 113.94 |

Time (s, REDU-EXPR) | 10.67 | 9.60 | 8.34 | 8.60 | 9.00 | 12.86 |

Notes: All the experiments are run 10 times, and the is set to be the same: , where *n* is the data size. The numbers in bold refer to the better results between the two methods: partial ADMM and REDU-EXPR.

### 6.2 Comparison of Speed on Synthetic Data

In this section, we show the great speed advantage of our REDU-EXPR algorithm in solving low-rank recovery models. We compare the algorithms to solve relaxed R-LRR, equation 2.7. We also present the results of solving LRR by ADMM for reference, although it is a slightly different model. Except our filtering algorithm, all the codes run in this test are offered by Liu et al. (2013), Liu and Yan (2011), and Favaro et al. (2011).

The parameter is set for each method so that the highest accuracy is obtained. We generate clean data as we did in section 6.1. The only differences are the choice of the dimension of the ambient space and the number of points sampled from subspaces. We compare the speed of different algorithms on corrupted data, where the noises are added in the same way as in Liu et al. (2010) and Liu et al. (2013). Namely, the noises are added by submitting to 5% column-wise gaussian noises with zero means and standard deviation, where *x* indicates corresponding vector in the subspace. For REDU-EXPR, with or without using filtering, the rank is estimated at its exact value, 20, and the oversampling parameter *s _{c}* is set to be 10. As the data size goes up, the CPU times are shown in Table 5. When the corruptions are not heavy, all the methods in this test achieve 100% accuracy. We can see that REDU-EXPR consistently outperforms ADMM-based methods. By filtering, the computation time is further reduced. The advantage of filtering is more salient when the data size is larger.

Data Size . | LRR . | R-LRR . | R-LRR . | R-LRR . |
---|---|---|---|---|

. | (ADMM) . | (partial ADMM) . | (REDU-EXPR) . | (filtering REDU-EXPR) . |

33.0879 | 4.9581 | 1.4315 | 0.6843 | |

58.9177 | 7.2029 | 1.8383 | 1.0917 | |

370.1058 | 24.5236 | 6.1054 | 1.5429 | |

3600 | 124.3417 | 28.3048 | 2.4426 | |

3600 | 411.8664 | 115.7095 | 3.4253 |

Data Size . | LRR . | R-LRR . | R-LRR . | R-LRR . |
---|---|---|---|---|

. | (ADMM) . | (partial ADMM) . | (REDU-EXPR) . | (filtering REDU-EXPR) . |

33.0879 | 4.9581 | 1.4315 | 0.6843 | |

58.9177 | 7.2029 | 1.8383 | 1.0917 | |

370.1058 | 24.5236 | 6.1054 | 1.5429 | |

3600 | 124.3417 | 28.3048 | 2.4426 | |

3600 | 411.8664 | 115.7095 | 3.4253 |

Note: In this test, REDU-EXPR with filtering is significantly faster than other methods, and its computation time grows at most linearly with the data size.

### 6.3 Test on Real Data: AR Face Database

*F*is the feature data and

*H*is the label matrix. The classifier is trained as follows. We first run LRR or R-LRR on the original input data and obtain an approximately block diagonal matrix . View each column of

*Z*as a new observation,

^{8}and separate the columns of

*Z*into two parts, where one part corresponds to the training data and the other to the test data. We train the ridge regression model by the training samples and use the obtained

*W*to classify the test samples.

Unlike the existing literature, Liu et al. (2010, 2013), which manually removed severely corrupted images and shrank the input images to small-sized ones in order to reduce the computation load, our experiment uses all the full-sized face images. So the size of our data matrix is , where each image is reshaped as a column of the matrix, 19,800 is the number of pixels in each image, and 2574 is the total number of face images. We test LRR (Liu et al., 2010, 2013), solved by ADMM, and relaxed R-LRR, solved by partial ADMM (Favaro et al., 2011), REDU-EXPR (Wei & Lin, 2010), and REDU-EXPR with filtering) for both classification accuracy and speed. Table 6 shows the results, where the parameters have been tuned to be the best. Since the ADMM-based method requires too much time to converge, we terminate it after 60 hours. This experiment testifies to the great speed advantage of REDU-EXPR and filtering. Note that with filtering, the speed of REDU-EXPR is three times faster than that without filtering, and the accuracy is not compromised.

Model . | Method . | Accuracy . | CPU Time (h) . |
---|---|---|---|

LRR | ADMM | - | 10 |

R-LRR | partial ADMM | 86.3371% | 53.5165 |

R-LRR | REDU-EXPR | 90.1648% | 0.5639 |

R-LRR | REDU-EXPR with filtering | 90.5901% | 0.1542 |

Model . | Method . | Accuracy . | CPU Time (h) . |
---|---|---|---|

LRR | ADMM | - | 10 |

R-LRR | partial ADMM | 86.3371% | 53.5165 |

R-LRR | REDU-EXPR | 90.1648% | 0.5639 |

R-LRR | REDU-EXPR with filtering | 90.5901% | 0.1542 |

Notes: For fair comparison of both the accuracy and the speed for different algorithms, the parameters are tuned to be the best according to the classification accuracy, and we observe the CPU time. The figures in bold refer to the best results.

## 7 Conclusion and Future Work

In this letter, we investigate the connections among solutions of some representative low-rank subspace recovery models: R-PCA, R-LRR, R-LatLRR, and their convex relaxations. We show that their solutions can be mutually expressed in closed forms. Since R-PCA is the simplest model, it naturally becomes a hinge for all low-rank subspace recovery models. Based on our theoretical findings, under certain conditions we are able to find better solutions to low-rank subspace recovery models and also significantly speed up finding their solutions numerically by solving R-PCA first, and then express their solutions by that of R-PCA in closed forms. Since there are randomized algorithms for R-PCA, for example, we propose the filtering algorithm for column-sparse R-PCA, the computation complexities for solving existing low-rank subspace recovery models can be much lower than the existing algorithms. Extensive experiments on both synthetic and real-world data testify to the utility of our theories.

As shown in section 5.1.4, our approach may succeed even when the conditions of sections 5.1.1, 5.1.2, and 5.1.3 do not hold. The theoretical analysis on how data distribution influences the success of our approach, together with the theoretical guarantee of our filtering algorithm, will be our future work.

## Acknowledgments

We thank Rene Vidal for valuable discussions. H. Zhang and C. Zhang are supported by National Key Basic Research Project of China (973 Program) (nos. 2015CB352303 and 2011CB302400) and National Nature Science Foundation (NSF) of China (nos. 61071156 and 61131003). Z. Lin is supported by NSF China (nos. 61231002 and 61272341), 973 Program of China (no. 2015CB352502), and Microsoft Research Asia Collaborative Research Program. J. Gao is partially supported under Australian Research Council’s Discovery Projects funding scheme (project DP130100364).

## References

*m*-estimator for robust PCA

## Notes

^{1}

Note that Wei and Lin (2010) and Vidal and Favaro (2014) called R-LRR “robust shape interaction” (RSI) and low-rank subspace clustering (LRSC), respectively. The two models are essentially the same, differing only in the optimization algorithms. In order to remind readers that they are both robust versions of LRR by using a denoised dictionary, in this letter, we call them “robust low-rank representation (R-LRR).”

^{2}

We emphasize that although there is a linear time SVD algorithm (Avron, Maymounkov, & Toledo, 2010; Mahoney, 2011) for computing SVD at low cost, which is typically needed in the existing solvers for all models, linear time SVD is known to have relative error. Moreover, even adopting linear time SVD, the whole complexity could still be due to matrix-matrix multiplications outside the SVD, computation in each iteration if there is no careful treatment.

^{3}

All complexity appearing in our letter refers to the overall complexity, that is, taking the iteration complexity into account.

^{4}

Liu et al. (2010) reported an accuracy of 62.53% by LRR, but there were only 10 classes in their data set. In contrast, there are 38 classes in our data set.

^{5}

Here we highlight the difference between and . The former is independent of numerical precision. It is due to the three matrix-matrix multiplications to form and *Z*, respectively. In contrast, usually grows with numerical precision. The more iterations there are, the larger the constant in the big *O* is.

^{6}

The partial ADMM method of Favaro et al. (2011) was designed for the norm on the noise matrix *E*, while here we have adapted it for the norm.

^{7}

Just as Liu et al. (2010) did, given the ground truth labeling, we set the label of a cluster to be the index of the ground truth that contributes the maximum number of samples to the cluster. Then all these labels are used to compute the clustering accuracy after comparing with the ground truth.

^{8}

Since *Z* is approximately block diagonal, each column of *Z* has few nonzero coefficients, and thus the new observations are suitable for classification.