## Abstract

Since combining features from heterogeneous data sources can significantly boost classification performance in many applications, it has attracted much research attention over the past few years. Most of the existing multiview feature analysis approaches separately learn features in each view, ignoring knowledge shared by multiple views. Different views of features may have some intrinsic correlations that might be beneficial to feature learning. Therefore, it is assumed that multiviews share subspaces from which common knowledge can be discovered. In this letter, we propose a new multiview feature learning algorithm, aiming to exploit common features shared by different views. To achieve this goal, we propose a feature learning algorithm in a batch mode, by which the correlations among different views are taken into account. Multiple transformation matrices for different views are simultaneously learned in a joint framework. In this way, our algorithm can exploit potential correlations among views as supplementary information that further improves the performance result. Since the proposed objective function is nonsmooth and difficult to solve directly, we propose an iterative algorithm for effective optimization. Extensive experiments have been conducted on a number of real-world data sets. Experimental results demonstrate superior performance in terms of classification against all the compared approaches. Also, the convergence guarantee has been validated in the experiment.

## 1 Introduction

With the proliferation of social networks, such as Facebook, Twitter, and Flickr, the volume of multimedia data exponentially increases, inevitably resulting in challenges to effective and efficient management of big media data (Yang et al., 2012; Chang, Nie, Wang et al., 2016; Chang, Nie, Yang, Zhang, & Huang, 2016; Fan, Chang, & Tao, 2017). Content-based categorization using visual features has always been focused on as a practical solution to the problem. In computer vision areas, a number of low-level visual features have been invented to describe visual information in a compact way. Meanwhile, researchers on machine learning have proposed many feature analysis approaches to understanding the features further. One direction is to utilize multiple heterogeneous features from different views (Chang, Yu, Yang, & Xing, 2016a, 2016b; Wang, Chang, Li, Long et al., 2016; Xue et al., 2017). Intuitively, an object can be characterized in various ways regarding different perspectives. For example, to describe visual objects in images, histogram of oriented gradients (HOG) (Dalal & Triggs, 2005), speeded-up robust features (SURF) (Bay, Tuytelaars, & Gool, 2006), and scale-invariant feature transform (SIFT) (Lowe, 2004) have been widely used to extract local visual features. Features in various views describe the data from different perspectives to meet the particular properties; for example, SIFT is robust to noise, changes of illumination, and rotation. If the heterogeneous features are properly integrated into a well-designed algorithm, better performance can be expected. Multiview feature learning therefore stems from this motivation. Previous work (Wang et al., 2014; Cai, Nie, & Huang, 2013; Chen et al., 2012; Conrad & Mester, 2013; Luo et al., 2017) has demonstrated that multiview feature learning can reduce noise and improve statistical significance. A number of multiple kernel learning (MKL) algorithms (Sonnenburg, Rätsch, Schäfer, & Schölkopf, 2006; Suykens & Vandewalle, 1999; Kloft, Brefeld, Laskov, & Sonnenburg, 2008; Yu et al., 2010; Ye, Ji, & Chen, 2008; Lanckriet, Cristianini, Bartlett, Ghaoui, & Jordan, 2004; Wang, Chang, Li, Sheng, & Chen, 2016) also learn different multiple kernels from heterogeneous features in multiview learning frameworks. Normally, MKL-based approaches learn an ensemble of kernels for a certain application to achieve better performance.

Although multiview learning approaches have achieved good performance in various applications, they neglect two problems that may be essential to improve the performance of multiview feature analysis further. First, features from each view are separately learned, which makes the learning framework unable to exploit the shared knowledge across multiple views. For example, texture-based and shape-based visual features should have some intrinsic correlations when capturing a certain object (e.g., the sun in images). It is presumed that exploiting shared knowledge is beneficial to feature learning. Second, most of the existing multiview feature analysis methods simply assume that features in all the views are equally important for different classes. They cannot balance the weights of heterogeneous features when combining them together.

In this letter, we propose a novel multiview feature learning algorithm that exploits correlations among different views for visual-based applications. Our solution to the first problem mentioned above is to assume that there exist shared subspaces that contain shared knowledge across multiviews. Based on this assumption, our objective function squeezes out common components in the shared subspaces as much as possible in iterations until convergence is reached. Regarding the second challenge, we propose to use group -norm (-norm) for regularization (Wang, Nie, & Huang, 2013; Chang, Nie, Yang, & Huang, 2014). From its definition, -norm uses -norm within each view and -norm among different views. Hence, once a specific view of features is recognized as not informative for certain groups of objects, the algorithm will downgrade its importance by assigning zero weight to features in that view. Otherwise, large weights will be given. Consequently, the -norm is capable of capturing relationships among different views. In this letter, we name our proposed algorithm multiview correlation feature learning (MVCL).

Existing multiview manifold learning algorithms (e.g., Wang & Mahadevan, 2013; Ham, Lee, & Saul, 2005) construct mapping functions that project data instances from different input domains to a new lower-dimensional space. However, they fail to take the shared component into consideration. In addition, Ham et al. (2005) design their approach in a semisupervised fashion.

The main contributions of this work can be summarized as follows:

We propose a novel multiview feature learning algorithm that takes the correlations among different views into consideration to improve subsequent classification performance. Based on the shared subspace learning, our algorithm can automatically exploit shared knowledge deeply buried across the multiviews.

In our algorithm, the importance of features in different views is not equally treated. The weights of features are dynamically adapted according to the sparsity condition across views. Those nonzero weights reflect their corresponding views as more informative than their counterparts with zero weight. In this way, our algorithm has more flexibility when learning multiview features than others.

Although the proposed objective function is nonsmooth, we derive an iterative algorithm to optimize the objective function effectively and efficiently. The convergence is theoretically guaranteed and empirically tested. From the experimental results, the proposed algorithm converges within 10 iterations in most of the cases.

Extensive experiments are conducted on several real-world data sets. The evaluation results show that our algorithm yields superior performance against all other compared multiview learning approaches.

The rest of this letter is organized as follows. We describe in detail the proposed feature learning framework in section 2, followed by a corresponding optimization algorithm in section 3 and convergence analysis in section 4. The experimental results are reported and discussed in section 5. We conclude section 6.

## 2 Multiview Correlated Feature Learning Framework

We begin by summarizing the notations and definitions used in this letter. Matrices and vectors are written as boldface uppercase letters and boldface lowercase letters, respectively. For an arbitrary matrix , its th row and th column are denoted as and , respectively. Given training samples in the th view, the training data matrix is represented by . is the feature dimension in the th view, . is the number of views. The label matrix is denoted as , and is the number of classes.

We assume that there exists a subspace shared by all the views. We denote the shared subspace as . Then, for each learned feature from each view, it becomes , and the concatenated learned features should be denoted as .

The sparsity between different views is enhanced by the -norm since it adopts the -norm within each view and the -norm between different views. For example, the objective function will assign very small values to a specific view of features when they are not discriminative for certain tasks. Otherwise, their weights will be assigned large values. In this way, the correlations between different views can be captured.

## 3 Optimization Algorithm

The difficulties of solving the objective function in equation 2.4 are because of the concatenation of learned intermediate representations and the -norm. We propose to efficiently tackle this problem in the following steps.

## 4 Convergence Analysis

In this section, we prove that algorithm 1 converges by the following theorem. We begin with a lemma;

By fixing and , we can get the global solutions for . Similarly, we can get the global solutions for and with fixed .

By fixing , we can convert the objective function to a convex optimization problem with regard to and . Hence, the global solutions for and can be obtained by setting the derivative of equation 3.4 to zero, respectively. In the same manner, we can also prove that by fixing and , we can get the global solutions for .

The objective function value shown in equation 3.4 monotonically decreases until convergence by applying the proposed algorithm.

^{1}:

From equation 4.3, we can clearly see that the objective function value decrease after each iteration. Thus, theorem ^{2} has been proved.

## 5 Experiment

In this section, we experimentally evaluate the performance of the proposed method MVCL for classification tasks. We first compare our method with the related state-of-the-art multiview methods on four real-world data sets in which the evaluation of shared component is also studied. Then we study the performance variance with regard to parameter sensitivity and the convergence of algorithm 1.

### 5.1 Data Set Description

We evaluate our new multiview learning framework on four broadly used benchmark data sets. Each data set has a certain number of types of features (views), summarized in Table 1.

Feature ID . | NUS-WIDE-Object . | Outdoor Scene . | MSRC-v1 . | Handwritten Digit . |
---|---|---|---|---|

1 | Color histogram (64-D) | GIST (512-D) | Color Moment (48-D) | FOU(76-D) |

2 | Color correlogram (144-D) | Color Moment (432-D) | LBP (256-D) | FAC(216-D) |

3 | Edge direction histogram (73-D) | HOG (256-D) | HOG (100-D) | KAR (64-D) |

4 | Wavelet texture (128-D) | LBP (48-D) | SIFT (1230-D) | PIX (240-D) |

5 | Block-wise color moments (225-D) | - | GIST (512-D) | ZER (47-D) |

6 | BoW SIFT (500-D) | - | CENTRIST (1320-D) | - |

Number of classes | 31 | 8 | 8 | 10 |

Size | 30,000 | 2688 | 210 | 2000 |

Feature ID . | NUS-WIDE-Object . | Outdoor Scene . | MSRC-v1 . | Handwritten Digit . |
---|---|---|---|---|

1 | Color histogram (64-D) | GIST (512-D) | Color Moment (48-D) | FOU(76-D) |

2 | Color correlogram (144-D) | Color Moment (432-D) | LBP (256-D) | FAC(216-D) |

3 | Edge direction histogram (73-D) | HOG (256-D) | HOG (100-D) | KAR (64-D) |

4 | Wavelet texture (128-D) | LBP (48-D) | SIFT (1230-D) | PIX (240-D) |

5 | Block-wise color moments (225-D) | - | GIST (512-D) | ZER (47-D) |

6 | BoW SIFT (500-D) | - | CENTRIST (1320-D) | - |

Number of classes | 31 | 8 | 8 | 10 |

Size | 30,000 | 2688 | 210 | 2000 |

We used four data sets:

- •
*NUS-WIDE-OBJECT data set*(Chua et al., 2009). This is a light version of the NUS-WIDE data set consisting of 30,000 real-world object images and 31 object categories. It is widely used to compared different multiview learning methods in terms of image annotation. In our experiment, the official split is adopted: 17,927 training images and 12,073 testing images. Each image contained 1134 features within six views. - •
*OUTDOOR SCENE*(Monadjemi, Thomas, & Mirmehdi, 2002). This data set contains 2688 color images that belong to eight outdoor scene categories: street, coast, forest, mountain, open country, inside city, highways, and tall buildings. Each image has 1248 features within four views. - •
*MSRC*(Grauman & Darrell, 2006). MSRC is a scene recognition data set and consists of 240 images with eight classes. Following Grauman and Darrell (2006), we select seven classes, and each class has 30 images. The selected classes are as follows: building, tree, airplane, cow, face, car, and bicycle. Each images has 3466 features within six views. - •
*Handwritten Digits*. This data set consists of 2000 data instances for 0 to 9 10-digit classes. In our experiment, we use five publicly available features to represent multiple views, and each instance has 643 features within five views.

### 5.2 Experiment Setup

To evaluate the performance of our method for classification tasks, we compare MVCL with several single-view and multiview learning methods. For single-view learning methods, we select the popular standard support vector machine (SVM) to compare the multiview classification performance of the proposed algorithm with their corresponding single-view counterparts and the concatenation of all types of features. In all our experiments, SVM is implemented by LIBMSVM-software package (Chang & Lin, 2011).

For multiview learning methods, several different kinds of state-of-the-art multiview methods are adopted for the comparison. First, we compare the results of our method with several kinds of multiple kernel learning methods (MKL): SVM MKL, LSSVM MKL, LPboost, and GP method. SVM MKL and LSSVM MKL are the SVM-based MKL methods and least square SVM-based (LSSVM) MKL methods, respectively, which all extend the MKL with different norms. According to Sonnenburg et al. (2006), different kernel normalizations can have a significant impact on the performance of MKL. In our experiments, we adopt the popular , , norm to normalize MKL methods: SVM MKL method (Lanckriet et al., 2004), SVM MKL method (Kloft et al., 2008), SVM MKL method (Sonnenburg et al., 2006), LSSVM MKL method (Suykens & Vandewalle, 1999), LSSVM MKL method (Yu et al., 2010), and LSSVM MKL method (Ye et al., 2008). Different from SVM-based MKL methods, LPboost combines boosting approaches with MKL to mix different kernels. Here we adopt two versions of LPboost: LPboost- - (Gehler & Nowozin, 2009) and LPboost- (Gehler & Nowozin, 2009). In LPboost-, a single-vector is used to define a combination that works well for all classes, while in LPboost-, each class has its own weight vector. The GP method (Kapoor, Grauman, Urtasun, & Darrell, 2010) adopts the gaussian process method and pyramid match Kernel to combine multiple kernels to boost classification performance.

Besides the MKL methods, we also compare our method with multiview correlated algorithms: multiview CCA (Foster, Kakade, & Zhang, 2008), multirelational classification (Guo & Viktor, 2008), and intra-view and inter-view supervised correlation analysis for multiview feature learning (Jing et al., 2014). All of these methods exploit and use the correlations among all the views to improve task performance. Moreover, we compare our method with multiview manifold learning methods, namely manifold alignment preserving global Geomery (MAPGG) (Wang & Mahadevan, 2013).

As we evaluate the effectiveness of our method for multiclass classification problems in our experiment, we employ mean average precision (MAP) to evaluate classification performance, which is the ranking performance computed under each label.

### 5.3 Experimental Results

We present the performance comparison of different algorithms in Table 2. It can be seen that the proposed algorithm MVCL consistently outperforms the other compared algorithms for different applications.

Method . | NUS-WIDE-Object . | Scene . | MSRC-v1 . | Handwritten Digit . |
---|---|---|---|---|

SVM (type 1) | ||||

SVM (type 2) | ||||

SVM (type 3) | ||||

SVM (type 4) | ||||

SVM (type 5) | - | |||

SVM (type 6) | - | - | ||

SVM (all by concatenation) | ||||

SVM MKL method (Sonnenburg et al., 2006) | ||||

SVM MKL method (Lanckriet et al., 2004) | ||||

SVM MKL method (Kloft et al., 2008) | ||||

LSSVM MKL method (Ye et al., 2008) | ||||

LSSVM MKL method (Suykens & Vandewalle, 1999) | ||||

LSSVM MKL method (Yu et al., 2010) | ||||

GP method (Kapoor et al., 2010) | ||||

LPboost- (Gehler & Nowozin, 2009) | ||||

LPboost-B (Gehler & Nowozin, 2009) | ||||

Multiview CCA (Foster et al., 2008) | ||||

Multirelational Classification (Guo & Viktor, 2008) | ||||

MAPGG (Wang & Mahadevan, 2013) | ||||

KSCA (Jing et al., 2014) | ||||

MVCL (no shared subspace) | ||||

MVCL (shared subspace) |

Method . | NUS-WIDE-Object . | Scene . | MSRC-v1 . | Handwritten Digit . |
---|---|---|---|---|

SVM (type 1) | ||||

SVM (type 2) | ||||

SVM (type 3) | ||||

SVM (type 4) | ||||

SVM (type 5) | - | |||

SVM (type 6) | - | - | ||

SVM (all by concatenation) | ||||

SVM MKL method (Sonnenburg et al., 2006) | ||||

SVM MKL method (Lanckriet et al., 2004) | ||||

SVM MKL method (Kloft et al., 2008) | ||||

LSSVM MKL method (Ye et al., 2008) | ||||

LSSVM MKL method (Suykens & Vandewalle, 1999) | ||||

LSSVM MKL method (Yu et al., 2010) | ||||

GP method (Kapoor et al., 2010) | ||||

LPboost- (Gehler & Nowozin, 2009) | ||||

LPboost-B (Gehler & Nowozin, 2009) | ||||

Multiview CCA (Foster et al., 2008) | ||||

Multirelational Classification (Guo & Viktor, 2008) | ||||

MAPGG (Wang & Mahadevan, 2013) | ||||

KSCA (Jing et al., 2014) | ||||

MVCL (no shared subspace) | ||||

MVCL (shared subspace) |

Our observations from the experimental results are as follows:

MVCL consistently outperforms single-view SVM classification and multiview feature learning algorithms, which indicates the effectiveness of the proposed feature learning algorithm in supervised multiview classification.

The approaches that use multiple views generally get better performance than SVM with each single type of features. From this observation, we can confirm that multiview feature learning contributes to the classification performance improvement.

Both multiview CCA and multirelational classification generally perform better than the other multiview algorithms. This demonstrates that considering correlations among different views facilitates further classification.

The proposed algorithm outperforms the other compared algorithms. Moreover, when the shared component is considered, the proposed algorithm gets better performance. This indicates that it is beneficial to mine the shared component of different views.

### 5.4 Parameter Sensitivity and Convergence

In this section, experiments are conducted to study the performance variance with regard to the regularization parameters and and validate how fast the proposed algorithm monotonically decreases the objective function value until convergence.

## 6 Conclusion

In this letter, we propose a novel multiview feature learning algorithm has been proposed to efficiently learn an intermediate representation of individual features and combine them for subsequent tasks (e.g., classification). Shared components among different views are taken into consideration. Our algorithm can mine the correlations among different views by incorporating -norm into the proposed framework. To solve the objective function, we propose an effective and efficient algorithm to reach convergence in an iterative manner. We have tested and compared our algorithm with all other approaches over a number of real-world data sets. Our experimental results show that our algorithm is superior to the others compared in this letter.

We improve our framework based on the assumption that multiple views of features have some shared component. How to automatically determine the dimension of shared component is still an open problem. Normally, we manually estimate the shared component dimension. However, human estimations are sometimes unreliable. Even if humans think that two views of features have some shared components, there may be no common subspace at all. From the experiments conducted in this letter, we can see that performance will drop if we share subspace between nonshared views. Based on our discussion, we will learn how to automatically estimate the dimension of shared components among different views in our future work.

## Acknowledgments

The research is supported by the Science Foundation of the China (Xi'an) Institute for Silk Road Research (2016SY10, 2016SY18) and the Research Foundation of XAUFE (15XCK14).