## Abstract

This letter examines the problem of robust subspace discovery from input data samples (instances) in the presence of overwhelming outliers and corruptions. A typical example is the case where we are given a set of images; each image contains, for example, a face at an unknown location of an unknown size; our goal is to identify or detect the face in the image and simultaneously learn its model. We employ a simple generative subspace model and propose a new formulation to simultaneously infer the label information and learn the model using low-rank optimization. Solving this problem enables us to simultaneously identify the ownership of instances to the subspace and learn the corresponding subspace model. We give an efficient and effective algorithm based on the alternating direction method of multipliers and provide extensive simulations and experiments to verify the effectiveness of our method. The proposed scheme can also be used to tackle many related high-dimensional combinatorial selection problems.

## 1. Introduction

Subspace learning algorithms have recently been adopted for analyzing high-dimensional data in various problems (Jenatton, Obozinski, & Bach, 2010; Wright, Yang, Ganesh, Sastry, & Ma, 2009; Wagner, Wright, Ganesh, Zhou, & Ma, 2009). Assuming the data are well aligned and lie in a low-dimensional linear subspace, these methods can deal with large sparse errors and learn the low-rank subspace of data. Other approaches (e.g., Elhamifar & Vidal, 2009; Liu, Lin, & Yu, 2010; Luo, Nie, Ding, & Huang, 2011; Favaro, Vidal, & Ravichandran, 2011) have been proposed to cluster data into different subspaces. However, these methods may have difficulty in dealing with a class of unsupervised learning scenarios in which a large number of outliers exist. In this letter, we propose a method to discover low-dimensional linear subspace from a set of data containing both inliers and a significant number of outliers. Figure 1 gives a typical problem setting of this letter, as well as the pipeline of our proposed solution. Here, we are given a set of images and each image contains a common object (pattern). Our goal is to automatically identify the object and learn its subspace model.

In an abstract sense, we are given a set of data containing both inliers lying in a relative low-dimensional linear subspace and overwhelming outliers; in addition, the inliers may be corrupted by sparse errors. We make use of two constraints that have been adopted in the multiple instance learning (MIL) literature that data are divided into different bags and at least one inlier exists in each bag. These two constraints usually coexist, as is shown in Figures 1a and 1b. We may turn each image into a bag, consider image patches containing objects of the same category as inliers, and treat image patches from background or other categories as outliers. We aim to find the low-dimensional subspace and identify which data belong to the subspace. Obviously this problem is highly combinatorial and high dimensional. Here we borrow the MIL concept but assume no given negative bags in the training process, as in Zhu, Wu, Wei, Chang, and Tu (2012). The original problem becomes a weakly supervised subspace (pattern) discovery problem. We then transfer this problem into a convex optimization formulation, which can be effectively solved by the alternating direction method of multipliers (ADMM) (Gabay & Mercier, 1976; Boyd, Parikh, Chu, Peleato, & Eckstein, 2011) method. In the proposed formulation, each instance is associated with an indicator indicating whether the instance is an inlier or an outlier; this is illustrated in Figure 1b. The indicators of instances are treated as latent variables, and our objective function is to minimize both the rank of the subspace spanned by the selected instance and the norm of the error in the selected instance. Thus, by solving this optimization problem, we achieve the goal of discovering the low-dimensional subspace and identifying the instances belonging to the subspace. In Figure 1c, we show the discovered face subspace and errors of each face image. We deal with various object discovery tasks to demonstrate the advantage of our algorithm in the experiments. In the remainder of this section, we give the related work of our method.

### 1.1. Relations to Existing Work.

In a nutshell, we are addressing a very challenging subspace learning problem. The existing scalable robust subspace learning methods such as robust principal component analysis (RPCA) (Candes, Li, Ma, & Wright, 2011; Xu, Sanghavi, & Caramanis, 2012) can handle a sparse number of errors or outliers, while the dense error correction method in Wright and Ma (2010) can deal with dense corruptions under some restricted conditions. These do not, however, apply to our case since here the inlying instances are very few compared to the outliers and the inliers might even be partially corrupted. Nevertheless, our problem assumes an important additional structure: we know that there is at least one inlier in each set of samples. We will demonstrate that this extra information assists us in solving a seemingly impossible subspace discovery problem.

Robust principal component analysis (Candes et al., 2011) has been successfully applied in background modeling (Candes et al., 2011), texture analysis (Zhang, Ganesh, Liang, & Ma, 2012), and face recognition (Jia, Chan, & Ma, 2012). It requires input data to be well aligned, a prohibitive requirement in many real-world situations. To overcome this limitation, robust alignment by sparse and low rank (RASL) (Peng, Ganesh, Wright, Xu, & Ma, 2012) was proposed to automatically refine the alignment of input data (e.g., a set of images with a common pattern). However, RASL demands a good initialization of the common pattern on the same scale, whereas here we are dealing with a much less constrained problem in which the common pattern (object) observes large-scale differences at unknown locations in the images.

Robustly learning a model from noisy input data is a central problem in machine learning. In the multiple instance learning (MIL) literature (Dietterich, Lathrop, & Lozano-Pérez, 1997), the input data are given in the form of bags. Each positive bag contains at least one positive instance and each negative bag consists of all negative instances. MIL falls into the class of weakly supervised learning problems. In the MIL setting, the two central subtasks are to infer the missing label information and learn the correct model for the instances. The EM algorithm (Dempster, Laird, & Rubin, 1977) has been widely adopted for inferring missing labels for such MIL problems (Zhang & Goldman, 2001), and likewise for the latent SVM (Yu & Joachims, 2009). One could modify these methods; however, as we will see in our comparison, they lead to greedy iterative optimization that often produces suboptimal solutions. Recently, Lerman, McCoy, Tropp, and Zhang (2012) proposed a convex optimization method, called REAPER, to learn subspace structure in data sets with large fractions of outliers. Compared to an approach like RPCA, this multiplicative approach has better thresholds for recovery in gaussian outlier clouds. However, it is not robust to additional sparse corruption to the data points. The bMCL algorithm (Zhu et al., 2012) deals with cases of both one-class and multiclass object assumptions in object discovery with weak supervision. In previous work, Sankaranarayanan and Davis (2012) proposed one-class multiple instance learning (MIL) along the idea of discriminative models for target tracking. Wang et al. (2012) proposed an EM approach of learning a low-rank subspace for MIL with the one-class assumption. In this letter, we emphasize the task of robust subspace discovery with an explicit generative model for global optimization.

In the rest of the letter, we refer to the common pattern as an object and focus on the problem of object discovery. Given a set of images, our goal is to automatically discover the common object across all the images, which might appear at an unknown location with an unknown size.

Along the line of object discovery, many methods have also been proposed. Russell, Freeman, Efros, Sivic, and Zisserman (2006) treated each image as a bag of visual words in which the common topics are discovered by the latent Dirichlet allocation. Other systems, such as those of Grauman and Darrell (2006) and Lee and Grauman (2009), perform clustering on the affinity or correspondence matrix established by different approaches using different cues in the images. Although these existing methods achieve promising results on several benchmark data sets, they have notable limitations from several perspectives: there often lacks a clear generative formulation and performance guarantee; some systems are quite complicated, with many cues, classifiers, and components involved; and they have a strong dependence on the discriminative models.

In contrast, this letter explores a different direction by employing a simple generative subspace model and proposes a new formulation to simultaneously infer the label information and learn the model using low-rank optimization. Unlike EM-like approaches, our method is not sensitive to the initial conditions and is robust to severe corruption. Although different from the classical robust PCA methods, our method inherits the same kind of robustness and efficiency from its convex formation and solution. Extensive simulations and experiments demonstrate the advantages of our method.

## 2. Formulation of Subspace Discovery

Given *K* bags of candidate object instances, we denote the number of instances in the *k*th bag as *n _{k}*. The total number of instances is

*N*=

*n*

_{1}+⋅⋅⋅+

*n*. Each instance is represented by a

_{K}*d*-dimensional vector . We may represent all the instances from one bag as columns of a matrix . Furthermore, we define . By default, we assume that each bag contains at least one common object, and the rest are unrelated. To be concrete, we associate each object

*x*

^{(k)}

_{i}with a binary label .

*z*=1 indicates that

^{k}_{i}*x*is the common object. Similarly, we define and . We assume that each bag contains at least one common object. So we have , where is an “or” operator and is the set of positive integers less than or equal to

^{k}_{i}*K*.

In general, different instances of the same object are highly correlated. It is reasonable to assume that such instances lie on a low-dimensional subspace . This assumption can be verified empirically for real data. Figure 4c shows a comparison of the spectrum of a number of instances that are from the same object or from random image patches. Even if one applies a robust dimensionality reduction to the set of random image patches, their spectrum is still much higher than those for a common object.

However, due to many practical nuisance factors in real images, such as variation of pose, cast shadow, and occlusion, the observed instances of the common objects may no longer lie in a low-dimensional subspace. We may model all of these contaminations as sparse errors added to the instances. So we could model each instance as *x*=*a*+*e*, where and *e* is a sparse vector. The observed instances of the common objects may no longer lie in a low-dimensional subspace. We may model the contamination as sparse errors added to the instances. So we could model each instance as *x*=*a*+*e*, where and *e* is a sparse vector.

*K*bags of instances , our goal is to find one or more instances from each bag so that all the selected instances form a low-rank matrix

*A*, subject to some sparse errors

*E*. Or equivalently, we need to solve the following problem, where diag(

*Z*) is an block-diagonal matrix with

*K*blocks . To distinguish with the conventional (robust) subspace learning problems, we could refer to this problem as subspace discovery.

## 3. Solution via Convex Relaxation

*d*and

*N*are large. The recent theory of RPCA (Candes et al., 2011) suggests that rank and sparsity can be effectively minimized via their convex surrogates. Therefore, we could replace the above objective function with the nuclear norm and norm with norm. Thus, equation 2.1 is replaced with the following program: Notice that although the objective function is now convex, the constraints on all the binary variables

*z*

^{(k)}

_{i}make this program intractable.

### 3.1. A Naive Iterative Solution.

*Z*and minimizing the objective with respect to the low-rank

*A*and sparse

*E*in a spirit similar to the EM algorithm. With

*Z*fixed, equation 3.1 becomes a convex optimization problem and can be solved by the RPCA method (Candes et al., 2011). Once the low-rank matrix

*A*is known, one could perform -regression to evaluate the distance between each point and the subspace: Then within each bag, we reassign 1 to a number of instances with errors below a certain threshold and mark the rest as 0. One can iterate this process until convergence. Because there are many outliers, this naive iterative method is very sensitive to initialization, so we have to run this naive method many times with random initializations and pick the best solution. This is similar to the popular RANSAC scheme for robust model estimation. Suppose there are

*m*positive instances within the

_{k}*k*th bag; then the probability that RANSAC would succeed in selecting only the common objects is . Typically , so the probability that RANSAC succeeds vanishes exponentially as the number of objects increases. Even if the correct instances are selected, the above regression is not always guaranteed to work well when

*A*contains errors. Nevertheless, with careful initialization and tuning, this method can be made to work for some relatively easy cases and data sets. It can be used as a baseline method to evaluate the improved effectiveness of any new algorithm.

### 3.2. Relaxing *Z*.

*Z*to be binary , we relax it to have real value in

**R**. Also, the constraint can be relaxed with its continuous version

**1**

^{T}

*Z*

^{(k)}=1, which is linear. So the optimization problem becomes Although we do not explicitly require

*Z*to be nonnegative, it turns out that the optimal solution to the above program always ensures , as theorem 1 shows. This is due to some special properties of the nuclear norm and norm. For our problem, this is incredibly helpful since the efficiency of the proposed algorithm based on the augmented Lagrangian method decreases quickly as the number of constraints increases. This fact saves us from imposing

*N*extra inequality constraints on the convex program.

*If none of the columns of X is zero, the optimal solution of equation 3.3 is always nonnegative*.

*A*,

*E*,

*Z*) where

*Z*have negative entries. Let us consider the triple constructed in the following way: Since

*X*diag(

*Z*)=

*A*+

*E*, obviously ; thus, is a feasible solution, and is nonnegative. We will show that , contradicting the fact that (

*A*,

*E*,

*Z*) is optimal. Note that flipping the sign of any column of the matrix will not change the singular value of a matrix and thus has no effect on the nuclear norm of it (if the SVD of , is still orthogonal matrix). So if we construct another matrix such that , . Similarly we construct an and . So and are just column-wise downscaled version of and . Since for the

*k*th bag 1

^{T}

*Z*

^{(k)}=1, 1

^{T}|

*Z*

^{(k)}| >1 if and only if any entry of

*Z*

^{(k)}is negative; otherwise, 1

^{T}|

*Z*

^{(k)}|=1. The columns of and in the bags with negative

*Z*

^{(k)}are downscaled by a scalar . It can be proved that any downscaling of a nonzero column of a matrix will decrease the nuclear norm.

*Given any matrix , if is Q with some column scaled by some scalar , then *.

Without loss of generality, we assume that the last column *q _{n}* gets scaled. Let

*Q*=[

*Q*

_{n−1},

*q*], and let be the matrix by setting the last column to 0. The singular values of are just the union of singular values of

_{n}*Q*

_{n−1}and an additional 0. Let . According to Horn and Johnson (2012, theorem 7.3.9), . So naturally , and the equality holds only if . This is impossible since . We must have .

can be viewed as a sequence of downscaling on different columns of *A*, and each downscaling will decreases the nuclear norm. The same goes for the norm of the sparse error *E*. This shows that , which contradicts the assumption that (*A*, *E*, *Z*) is optimal.

### 3.3. Solving equation 3.3 via Alternating Direction Method of Multipliers.

Fortunately, the three minimization problems all have closed-form solutions. We next provide the details.

*i*th column of

*P*

^{(k)}as

*P*

^{(k)}

_{i}and . Furthermore, we define and

*P*

^{(k)}

_{R}=vec(

*P*

^{(k)}). Thus, equation 3.9 can be rewritten as Directly applying the standard least square technique would require us to compute the pseudo-inverse of , which is high dimensional, so we perform a trick so that pseudo-inverse is calculated only for a matrix in :

The alternating minimization process in equation 3.7 is known as the alternating direction method of multipliers (ADMM) (Gabay & Mercier, 1976). A comprehensive survey of ADMM is given in Boyd et al. (2011); Lin et al. (2010) introduced it to the field of low-rank optimization. ADMM is not always guaranteed to converge to the optimal solution. If there are only two alternating terms, its convergence has been well studied and established in Gabay and Mercier (1976). However, less is known for the convergence of cases where there are more than two alternating terms, despite the strong empirical observations (Zhang et al., 2012). Tao and Yuan (2011) obtained convergence for a certain family of three-term alternation functions (applied to the noisy principal component pursuit problem). However, the scheme they proposed is different from the direct ADMM in equation 3.7, and it is also computationally heavy in practice. The convergence of the general ADMM remains an open problem, although in practice, there is a simple and fast implementation. Nevertheless, during the submission of this letter, there was some development in the study of ADMM (Shiqian Ma & Zou, 2013) suggesting that one can design a convergent ADMM algorithm for the problem studied here. For instance, we could simply group the variables E and Z together and apply the proximal ADMM algorithm suggested in Shiqian Ma and Zou (2013), which results in a slight modification to the proposed algorithm. Such proximal ADMM is guaranteed to converge. However, in practice, it might not converge faster than the proposed algorithm, which exploits the natural separable structures in the augmented Lagrangian function among the three sets of variables A, E, and Z. In our experience, the proposed algorithm works extremely well in practice and meets our application goals.

## 4. Simulations and Experiments

In this section, we conduct simulations and experiments on both synthetic and real data for different applications for object discovery to verify the effectiveness of our method. We name the method described in section 3.1 the naive iterative method (NIM) and call the relaxed method ADMM. In all our experiments, we set , where *d* is the dimension of instance feature.

### 4.1. Robust Subspace Learning Simulation.

In order to investigate the ability of the proposed ADMM method to recover the indicators of inlier instances, in this experiment, we generate synthetic data with 50 bags; in each bag, there are 10 instances—1 positive instance and 9 negative instances. The dimension of instance is *d*=500. First, the positive instances are generated by linearly combining *r* randomly generated vector whose entries are independent and identically distributed (i.i.d.). Standard gaussian, and the negative instances are independently randomly generated vector following i.i.d. normal distribution. Then, for every instance (no matter whether it is positive or negative), we normalize it to make sure its -norm is 1. Finally, large sparse errors are added to all instances; the sparsity ratio of the error is *s*, and the values of the error are uniformly distributed in the range of [−1, 1].

We investigate the performance when *r* (the rank of subspace) and *s* (the sparsity level the error) vary. *r* ranges from 1 to 31, and *s* ranges from 0 to 0.3. For each test, we denote the ground-truth indicator vector as , the recovered indicator vector as , the set of indexes whose corresponding values in are 1 as , and the set of indexes whose corresponding values in are larger than a threshold as . The accuracy of the recovered indicators is defined as . Given the ratio of sparsity and the rank of subspace, we run five random tests and report the average accuracy of the recovered indicators for ADMM (under different ) and NIM (randomly initialized) in Figure 2. In Figure 2a, ; 0.5 is a fair value, since there is only one recovered indicator with a value larger than 0.5. In Figure 2b, , a very strict value; the accuracy matrix of the recovered indicators under shows how exact our relaxation is in section 3.2. The solution of NIM in Figure 2c is discrete, and it is not necessary to set a threshold for the solution. From the results in Figure 2, we see that NIM can work only when the positive instances are in a very low-rank subspace in the situation of no error. No matter whether the threshold is 0.5 or 0.99, the working ranges of ADMM are strikingly larger than NIM. Comparing the results in Figures 2a and 2b, we find that it requires positive instances to be in a lower-dimensional subspace and contain less error if we want to exactly recover the indicators of them—say, the indicator values of recovered instances are larger than 0.99.

#### 4.1.1. Multiple Positive Instances in One Bag.

The simulations are focused on the situation where only one positive instance exists in each bag. Now we study how ADMM can deal with the situation when multiple instances exist in each bag. We put three positive instances in each bag; they are randomly drawn from the same subspace and corrupted with large sparse errors. Thus, the three positive instances in each bag are not identical. For different values of *r* and *s*, we run ADMM five times. The values of the recovered indicators are used for plotting the precision-recall curve. Results are shown in Figure 3. Given a threshold , precision and recall are calculated by and . As shown in Figure 3(a), the performance of ADMM increases as the error becomes sparser; when there is no error, ADMM is able to perfectly identify all positive instances. Figure 3b shows that it requires the subspace to have higher rank if more positive instances exist. When the rank of subspace is 15 and the sparsity level of error is 10%, ADMM is able to recover the indicators of 99% positive instances with 100% precision. It is observed that the current formulation for the subspace discovery problem in equation 3.3 has difficulty in dealing with multiple positive instances in some other settings.

### 4.2. Aligned Face Discovery Among Random Image Patches.

We illustrate the effect of ADMM for object discovery by finding well-aligned face images among lots of randomly selected image patches. Face images are from the Yale face data set (Georghiades, Belhumeur, & Kriegman, 2001), which consists of 165 frontal faces of 15 individuals. Other image patches are randomly selected from the PASCAL image data set (Everingham, Van Gool, Williams, Winn, & Zisserman, 2011). We design bags and instances as follows: the 165 face images are in 165 bags; other than the face image, in each bag, there are 9 image patches from the PASCAL data set; every image or patch is normalized to 64 × 64 pixels and then vectorized to be a 4096-dimensional feature. Some of images in bags are shown in Figure 4a.

To evaluate the performance of this face recovery task, we get the images with the maximum indicator value in each bag and then calculate the percentage of Yale faces among these images as the accuracy of face discovery. Because negative instances are randomly selected, we run the experiments five times. The average accuracy and the standard deviation of ADMM and NIM (randomly initialized) are 99.50.5% and 77.83.5%, respectively. Some of the discovered faces by ADMM are shown in Figure 4b. As it shows, facial expression and glasses are removed from the original images so that the repaired faces are better approximated by a low-dimensional subspace.

### 4.3. Object Discovery on Real-World Images.

The task of object discovery has become a major topic to reduce manual labeling effort to learn object class; it is a challenging task. In this task, we are given a set of images, each containing one or more instances of the same object class. In contrast to the fully supervised scenario, the locations of objects are not given. Different from subspace learning with simulated data, the appearance of an object varies a lot in real-world images, and thus is different from subspace learning with simulated data; instead, it requires using image descriptors that are somewhat robust to substantial pose variations, for example, HoG and LBP. Moreover, the location and scale of the objects are unknown, which means the number of instances can rise to the millions. To address this problem, we use an existing unsupervised salient object detection algorithm (e.g., Feng, Wei, Tao, Zhang, & Sun, 2011), to reduce the number of instances per bag or image. The reason for us to choose the HoG and LBP descriptors for characterizing object is due to the observation that objects from the same category with the same view may not have similar color or texture. However, they often have similar shapes. Both HoG and LBP show good performance in supervised object detection (Felzenszwalb, Girshick, McAllester, & Ramanan, 2010; Ahonen, Hadid, & Pietikainen, 2006). The common shape structures of objects are the subspaces we want to discover.

In the experiments of object discovery on real-world images, we evaluate the proposed ADMM algorithm on four diverse data sets: the PASCAL 2006 data set (Everingham, Zisserman, Williams, & Van Gool, 2006), the PASCAL 2007 data set (Everingham, Van Gool, Williams, Winn, & Zisserman, 2007), the face detection data set and benchmark (FDDB) subset (Jain & Learned-Miller, 2010), and the ETHZ Apple logo class (Ferrari, Tuytelaars, & Van Gool, 2006), and compare ADMM with the state-of-the-art object discovery methods. Because different performance evaluation protocols are used, we give the experimental results for the PASCAL 2006 and 2007 data sets and for the FDDB subset and ETHZ Apple logo class in two different parts.

#### 4.3.1. PASCAL 2006 and 2007 Data Sets.

The PASCAL 2006 and 2007 data sets are challenging and have been widely used as benchmarks for evaluating supervised object detection and image classification systems. For the object discovery task, we follow the protocol of Deselaers, Alexe, and Ferrari (2012). The performance is evaluated by the CorLoc measure, which is the percentage of correctly localized objects, according to the PASCAL criterion (window intersection-over-union > 0.5). Two subsets are taken from the PASCAL 2006 and 2007 data sets: PASCAL06-6×2, PASCAL06-all, PASCAL07-6×2, and PASCAL07-all. PASCAL06-6×2 contains 779 images from 12 classes or views; PASCAL06-all contains 2184 images from 33 classes or views; PASCAL07-6×2 contains 463 images from 12 classes or views; and PASCAL07-all contains 2047 images from 45 classes or views. (For more details about the data sets, as well as the evaluation protocol, refer to Deselaers et al., 2012.)

Each image is considered as a bag, and a patch in the image detected by the salient object detector in Feng et al. (2011) is considered an instance. The parameter of score threshold in Feng et al. (2011) is denoted as , which controls the number of salient objects detected. Standard HoG and LBP features are then extracted for each image patch. We let for the PASCAL06-6×2 and PASCAL06-all data sets and use for the PASCAL07-6×2 and PASCAL07-all data sets. We run the proposed ADMM method on these images and report the image patch with the maximum indicator value as the detected object. The results of ADMM are reported in Table 1 and compared with the results of other methods (Pandey & Lazebnik, 2011; Deselaers et al., 2012; Chum & Zisserman, 2007; Russell, Freeman, Efros, Sivic, & Zisserman, 2006; Lampert, Blaschko, & Hofmann, 2009).

. | PASCAL06- . | PASCAL07- . | ||
---|---|---|---|---|

Method . | 6×2 . | all . | 6×2 . | all . |

ESS (Lampert et al., 2009) | 24 | 21 | 27 | 14 |

Russell et al. (2006) | 28 | 27 | 22 | 14 |

Chum and Zisserman (2007) | 45 | 34 | 33 | 19 |

ADMM (our method) | 57 | 43 | 40 | 27 |

Deselaers et al. (2012) | 64 | 49 | 50 | 28 |

Pandey and Lazebnik (2011) | NA | NA | 61 | 30 |

. | PASCAL06- . | PASCAL07- . | ||
---|---|---|---|---|

Method . | 6×2 . | all . | 6×2 . | all . |

ESS (Lampert et al., 2009) | 24 | 21 | 27 | 14 |

Russell et al. (2006) | 28 | 27 | 22 | 14 |

Chum and Zisserman (2007) | 45 | 34 | 33 | 19 |

ADMM (our method) | 57 | 43 | 40 | 27 |

Deselaers et al. (2012) | 64 | 49 | 50 | 28 |

Pandey and Lazebnik (2011) | NA | NA | 61 | 30 |

Note: The numbers in bold indicate the best results in each column.

Table 1 shows favorable results by our method compared with those by Chum and Zisserman (2007), Russell et al. (2006), and Lampert et al. (2009). The state-of-the-art performances are reported in Pandey and Lazebnik (2011) and Deselaers et al. (2012), which either uses extra bounding-box annotations or adopts complicated object models (Felzenszwalb et al., 2010). Here we study a generative model of subspace learning with a clean and effective solution. Figure 5 shows some discovered objects on the PASCAL-all data set.

#### 4.3.2. FDDB Subset and ETHZ Apple Logo Class.

The FDDB subset contains 440 face images; the ETHZ Apple logo class contains 36 images with Apple logos. The appearance of objects and the background of the two data sets are quite diverse. In these two data sets, we use HoG only as the descriptor. Coordinating with the formulation in this letter, the low-rank term corresponds to the common shape structures of faces or apple logos, since we use the HoG as the descriptor; the sparse error term corresponds to the occlusions and the appearance variations in faces or apple logos. We run ADMM and get the indicator value of each instance. For each image, the indicator value is normalized by dividing the maximum indicator value in the bag; the normalized indicator value is used as the score of each patch.

A selected patch is correct if it intersects with the ground truth object by more than half of their union (PASCAL criteria). Object discovery performance is evaluated by precision-recall curves (Everingham et al., 2011), generated by varying the score threshold, and average precision (AP) (Everingham et al., 2011), computed by averaging multiple precisions corresponding to different recalls at regular intervals.

We compare ADMM with four methods: the baseline saliency detection method (SD) in Feng et al. (2011), the state-of-the-art discriminative object discovery approach named bMCL in Zhu et al. (2012), the naive iterative method initialized with the saliency score (NIM-SD), and the naive iterative method with random initialization (NIM-Rand). The parameters of the four methods are adjusted to make sure they achieve their best performances. The AP of NIM-Rand is the average value of three rounds. APs of all four methods are compared with ADMM in Table 2 on both data sets. As we can see, ADMM significantly improves the results from the saliency detection and outperforms all the other competing methods. The precision-recall curves of the four methods in Figure 6 confirm this as well. The SD method is a purely bottom-up approach. The other three methods make the assumption that all of the input images contain a common object class of interest. The bMCL method (Zhu et al., 2012) is a discriminative method; it obtains state-of-the-art performances on image data sets with a simple background, such as the SIVAL data set (Rahmani, Goldman, Zhang, Krettek, & Fritts, 2005). The images in the FDDB data set are more cluttered, posing additional difficulty. Our methods, both ADMM and NIM-SD, are able to deal with a cluttered background since they do not seek to discriminate the object from the background, an important property in tackling the problem object discovery and subspace learning. The patches with maximum scores by SD, bMCL, NIM-SD, and ADMM are shown in Figure 7.

Method . | FDDB Subset . | ETHZ Apple Logo . |
---|---|---|

SD | 0.148 | 0.532 |

bMCL | 0.619 | 0.697 |

NIM-SD | 0.671 | 0.826 |

NIM-Rand | 0.669 | 0.726 |

ADMM (our method) | 0.745 | 0.836 |

Method . | FDDB Subset . | ETHZ Apple Logo . |
---|---|---|

SD | 0.148 | 0.532 |

bMCL | 0.619 | 0.697 |

NIM-SD | 0.671 | 0.826 |

NIM-Rand | 0.669 | 0.726 |

ADMM (our method) | 0.745 | 0.836 |

Note: The numbers in bold indicate the best results in each column.

In the experiments, we observe that there are situations in which ADMM might fail: the objects are not contained in the detected salient image windows, and the objects observe a large variation due to articulation or nonrigid transformation, which do not reside in a common low-rank space. Note that in this letter, we focus on the problem of subspace learning and make the assumption of the common pattern spanning a low-rank subspace.

### 4.4. Instance Selection for Multiple Instance Learning.

In this experiment, we show how to apply the proposed ADMM to the traditional MIL problem (Dietterich et al., 1997). Our basic idea is to use ADMM to directly distinguish the positive instances from the negative instances in positive bags; the positive instances together with all the negative instances from negative bags are used to train an instance-level classifier (e.g. SVM with RBF kernel for the MIL task). In the testing stage, we use the learned instance-level SVM classifier for bag classification, based on a noisy-or model: if there exists any positive instance in a bag, the bag is identified as being positive, and otherwise as negative.

To use ADMM to distinguish positive from negative instances, we follow the assumption that has been previously made in this letter: positive instances lie in a low-dimensional subspace. In practice, we collect all positive bags as input of ADMM algorithm in algorithm 1 and obtain the indicator value of each instance. For each bag, the indicator value is normalized by dividing the maximum indicator value in the bag. Then the instances whose normalized indicator values are larger than an upper threshold are labeled as positive instances; the instances whose normalized indicator values are less than a lower threshold are labeled as negative instances. In this experiment, we fix and . The instances with normalized indicator values between 0.3 and 0.7 are omitted and not used for training the instance SVM classifier. When training the RBF kernel SVM, we adopt the LibSVM (Chang & Lin, 2011).

We evaluate the proposed method on five popular bench mark datasets, including Musk1, Musk2, Elephant, Fox, and Tiger. Detailed descriptions of the data sets can be found in Dietterich and Lathrop (1997) and Andrews et al. (2003). We compare our method with MI-SVM and mi-SVM (Andrews et al., 2003), MILES (Chen et al., 2006), EM-DD (Zhang & Goldman, 2001), PPMM Kernel (Wang et al., 2008), MIGraph and miGraph (Zhou et al., 2009), and MI-CRF (Deselaers & Ferrari, 2010) via 10 times 10-fold cross-validation and report the average accuracy and the standard deviation in Table 3. Some of them were obtained in different studies, and the standard deviations were not available. The average accuracy over the five tested data sets is reported in the for right column. The best performance on each compared item is noted in bold.

Data Sets . | Musk1 . | Musk2 . | Elephant . | Fox . | Tiger . | Average . |
---|---|---|---|---|---|---|

MI-SVM | 77.9 | 84.3 | 81.4 | 59.4 | 84.0 | 77.4 |

mi-SVM | 87.4 | 83.6 | 82.0 | 58.2 | 78.9 | 78.0 |

MILES | 86.3 | 87.7 | – | – | – | – |

EM-DD | 84.8 | 84.9 | 78.3 | 56.1 | 72.1 | 75.2 |

PPMM kernel | 95.6 | 81.2 | 82.4 | 60.3 | 80.2 | 79.9 |

MI-CRF | 87.0 | 78.4 | 85.0 | 65.0 | 79.5 | 79.0 |

ADMM (our method) | 89.90.7 | 85.01.6 | 79.60.9 | 65.41.2 | 81.51.0 | 80.3 |

MIGraph | 90.03.8 | 90.02.7 | 85.12.8 | 61.21.7 | 81.91.5 | 81.6 |

miGraph | 88.93.3 | 90.32.6 | 86.80.7 | 61.62.8 | 86.01.6 | 82.7 |

Data Sets . | Musk1 . | Musk2 . | Elephant . | Fox . | Tiger . | Average . |
---|---|---|---|---|---|---|

MI-SVM | 77.9 | 84.3 | 81.4 | 59.4 | 84.0 | 77.4 |

mi-SVM | 87.4 | 83.6 | 82.0 | 58.2 | 78.9 | 78.0 |

MILES | 86.3 | 87.7 | – | – | – | – |

EM-DD | 84.8 | 84.9 | 78.3 | 56.1 | 72.1 | 75.2 |

PPMM kernel | 95.6 | 81.2 | 82.4 | 60.3 | 80.2 | 79.9 |

MI-CRF | 87.0 | 78.4 | 85.0 | 65.0 | 79.5 | 79.0 |

ADMM (our method) | 89.90.7 | 85.01.6 | 79.60.9 | 65.41.2 | 81.51.0 | 80.3 |

MIGraph | 90.03.8 | 90.02.7 | 85.12.8 | 61.21.7 | 81.91.5 | 81.6 |

miGraph | 88.93.3 | 90.32.6 | 86.80.7 | 61.62.8 | 86.01.6 | 82.7 |

Notes: Comparisons on five benchmark data sets. MI-SVM and mi-SVM in Andrews, Tsochantaridis, and Hofmann (2003); MILES in Chen, Bi, and Wang, 2006; EM-DD in Zhang and Goldman, 2001; PPMM Kernel in Wang, Yang, and Zha, 2008; MIGraph and miGraph in Zhou, Sun, and Li, 2009; and MI-CRF in Deselaers and Ferrari, 2010. ADMM (our method) refers to the one proposed in this letter. The numbers in bold indicate the best results in each column.

As shown in Table 3, the best results are reported by MIGraph and miGraph, which exploit graph structure based on the affinities. We focus on comparing with mi-SVM, which learns to weigh instances by maximizing the margin between the positive and negative instances under MIL conditions via iterative SVM. This problem is nonconvex, and the optimization method of mi-SVM does not guarantee a local optimum. Here, our method selects the instance of a common subspace with a convex formulation and obtains promising results.

## 5. Conclusion

In this letter, we have proposed a robust formulation for unsupervised subspace discovery. We relax the highly combinatorial high-dimensional problem into a convex program and solve it efficiently with the augmented Lagrangian multiplier method. Unlike other approaches based on discriminative training, our proposed method can discover objects of interest by using common patterns across input data. We demonstrate the evident advantage of our method over the competing algorithms in a variety of benchmark data sets. Our method suggests that an explicit low-rank subspace assumption with a robust formulation naturally deals with a subspace discovery problem in the presence of overwhelming outliers, which allows a rich emerging family of subspace learning methods to have a wider scope of applications. It enlarges the application range of the RPCA-based methods.

## Acknowledgments

This work was supported by Microsoft Research Asia, NSF IIS-1216528 (IIS-1360566), NSF CAREER award IIS-0844566 (IIS-1360568), NSFC 61173120, NSFC 61222308, and the Chinese Program for New Century Excellent Talents in University. X.W. was supported by Microsoft Research Asia Fellowship 2012. We thank John Wright for encouraging discussions, David Wipf for valuable comments, and Jun Sun for his helpful discussion on the proof of theorem 1.

## References

*Proceedings of the International Conference on Computer Vision*(pp.