Abstract

We present a supervised model for tensor dimensionality reduction, which is called large margin low rank tensor analysis (LMLRTA). In contrast to traditional vector representation-based dimensionality reduction methods, LMLRTA can take any order of tensors as input. And unlike previous tensor dimensionality reduction methods, which can learn only the low-dimensional embeddings with a priori specified dimensionality, LMLRTA can automatically and jointly learn the dimensionality and the low-dimensional representations from data. Moreover, LMLRTA delivers low rank projection matrices, while it encourages data of the same class to be close and of different classes to be separated by a large margin of distance in the low-dimensional tensor space. LMLRTA can be optimized using an iterative fixed-point continuation algorithm, which is guaranteed to converge to a local optimal solution of the optimization problem. We evaluate LMLRTA on an object recognition application, where the data are represented as 2D tensors, and a face recognition application, where the data are represented as 3D tensors. Experimental results show the superiority of LMLRTA over state-of-the-art approaches.

1.  Introduction

Dimensionality reduction is one of the most fundamental problems in several related areas, including machine learning, pattern recognition, and data mining. Effective dimensionality reduction techniques dramatically facilitate the subsequent visualization, classification, and retrieval tasks. Over the past few decades, plenty of dimensionality reduction methods have been proposed and successfully used in many applications, such as optical character recognition (OCR), face recognition, and image and video retrieval. Traditional linear dimensionality reduction methods, such as principal component analysis (PCA) (Jolliffe, 2002) and linear discriminant analysis (LDA) (Fukunnaga, 1991), are simple to implement. However, they are guaranteed only to discover the true structure of data lying on or near a linear subspace of the high-dimensional input space. To alleviate this problem, kernel extensions of these methods have been proposed: kernel principal component analysis (KPCA) (Schölkopf, Smola, & Müller, 1998) and generalized discriminant analysis (GDA) (Baudat & Anouar, 2000). Seung and Lee (2000) state that the human brain represents real-world perceptual stimuli in a manifold way. At the same time of their work and later numerous manifold learning algorithms, such as isometric feature mapping (Isomap) (Tenenbaum, de Silva, & Langford, 2000) and locally linear embedding (LLE) (Roweis & Saul, 2000), have been proposed for discovering the manifold structure of data embedded in a high-dimensional space. These manifold learning methods can faithfully preserve the geometrical structure of data. Unfortunately, most of the existing dimensionality-reduction methods, either linear or nonlinear, can work only on vectorized data representations, although data in many applications are more naturally represented as high-order tensors, such as 2D images and 3D textures.

To generalize traditional vector representation–based dimensionality-reduction methods to applications involving high-order tensors, some tensor dimensionality reduction approaches have been proposed. Two are 2DPCA (Yang, Zhang, Frangi, & Yang, 2004) and 2DLDA (Ye, Janardan, & Li, 2004). These approaches learn the low-dimensional representations of tensor data in an unsupervised or a supervised way. In particular, the approach presented in Wang, Yan, Huang, and Tang (2007) is theoretically guaranteed to converge to a local optimal solution of the learning problem. However, one common issue of these approaches exists: the dimensionality of the low-dimensional tensor space must be manually specified before these approaches can be applied. Therefore, these approaches may not necessarily lead to the genuine manifold structure of the tensor data.

To address these problems of previous approaches, in this letter, we propose a novel tensor dimensionality reduction model, large margin low rank tensor analysis (LMLRTA), which is aimed at learning the low-dimensional representations of tensors using techniques of multilinear algebra (Northcott, 1984) and graph theories (Bondy & Murty, 1976). Compared to traditional vector representation-based dimensionality reduction approaches, LMLRTA can take any order of tensors as input, including 1D vectors (one-order tensor), 2D matrices (two-order tensor), and more. Furthermore, unlike previous tensor dimensionality reduction approaches (Yang et al., 2004; Ye et al., 2004; Wang et al., 2007), which can learn only the low-dimensional embeddings with a priori specified dimensionality, LMLRTA can automatically and jointly learn the optimal dimensionality and the low-dimensional representations from data. In addition, LMLRTA enforces the low-dimensional embeddings of the same class to be close and that of different classes to be separated by a large margin of distance. More important, it delivers low rank projection matrices for mapping high-dimensional data to the learned low-dimensional tensor space. We optimize LMLRTA using an iterative fixed-point continuation (FPC) algorithm, which is guaranteed to converge to a local optimal solution of the learning problem.

The rest of this letter is organized as follows. In section 2, we provide a brief overview of previous work on dimensionality reduction. In section 3, we present our proposed model, LMLRTA, in detail, including its formulation and optimization. In particular, we theoretically prove that LMLRTA can converge to a local optimal solution of the optimization problem. Section 4 shows the experimental results on real-world applications, including object recognition and face recognition, which involve 2D and 3D tensors, respectively. We conclude in section 5 with remarks and future work.

2.  Previous Work

In order to find effective low-dimensional representations of data, many dimensionality reduction approaches have been proposed. The most representative approaches are principal component analysis (PCA) and linear discriminant analysis (LDA) for unsupervised and supervised learning paradigms, respectively. They are widely used in many applications due to their simplicity and efficiency. However, it is well known that they are optimal only if the relation between the latent and the observed space can be described with a linear function. To address this issue, nonlinear extensions based on kernel method have been proposed to provide nonlinear formulations: kernel principal component analysis (KPCA) (Schölkopf et al., 1998) and generalized discriminant analysis (GDA) (Baudat & Anouar, 2000).

Since about a decade ago, many manifold learning approaches have been proposed. These approaches, including isometric feature mapping (Isomap) (Tenenbaum et al., 2000) and locally linear embedding (LLE) (Roweis & Saul, 2000), can faithfully preserve global or local geometrical properties of the nonlinear structure of data. However, these methods work only on a given set of data points and cannot be easily extended to out-of-sample data (Bengio et al., 2003). To alleviate this problem, locality preserving projections (LPP) (He & Niyogi, 2003) and local Fisher discriminant analysis (LFDA) (Sugiyama, 2007) were proposed to approximate the manifold structure in a linear subspace by preserving local similarity between data points. In particular, Yan et al. (2007) proposed a general framework known as graph embedding for dimensionality reduction. Most of the spectral learning-based approaches, either linear or nonlinear, supervised or unsupervised, are contained in this framework. Furthermore, based on this framework, the authors proposed the marginal Fisher analysis (MFA) algorithm for supervised linear dimensionality reduction. In the research of probabilistic learning models, Lawrence (2005) proposed the gaussian process latent variable models (GPLVM), which extend PCA to a probabilistic nonlinear formulation. Combining a gaussian Markov random field prior with GPLVM, Zhong, Li, Yeung, Hou, and Liu (2010) proposed the gaussian process latent random field model, which can be considered a supervised variant of GPLVM. In the area of neural network research, Hinton and Salakhutdinov (2006) proposed a deep neural network model, the autoencoder for dimensionality reduction. To exploit the effect of deep architecture for dimensionality reduction, some other deep neural network models were introduced, such as deep belief nets (DBN) (Hinton, Osindero, & Teh, 2006), stacked autoencoder (SAE) (Bengio, Lamblin, Popovici, & Larochelle, 2006), and stacked denoise autoencoder (SDAE) (Vincent, Larochelle, Lajoie, Bengio, & Manzagol, 2010). These studies show that deep neural networks can generally learn high-level representations of data, which can benefit subsequent recognition tasks.

All of the above approaches assume that the input data are in the form of vectors. In many real-world applications, however, the objects are essentially represented as high-order tensors, such as 2D images or 3D textures. One has to unfold these tensors into one-dimensional vectors before the dimensionality reduction approaches can be applied, and some useful information in the original data may not be sufficiently preserved. Moreover, high-dimensional vectorized representations suffer from the curse of dimensionality, as well as high computational cost. To address these problems, 2DPCA (Yang et al., 2004) and 2DLDA (Ye et al., 2004) were proposed to extend the original PCA and LDA algorithms to work directly on 2D matrices rather than 1D vectors. In recent years, many other approaches (Yan et al., 2007; Tao, Li, Wu, & Maybank, 2007; Fu & Huang, 2008; Liu, Liu, & Chan, 2010; Liu, Liu, Wonka, & Ye, 2012) were also proposed to deal with high-order tensor problems. In particular, Wang et al. (2007) proposed a tensor dimensionality reduction method based on the graph embedding framework, the first method to give a convergent solution. However, all of these previous tensor dimensionality reduction approaches have a common shortcoming: the dimensionality of the low-dimensional representations must be specified manually before the approaches can be applied.

To address these issues in both vector representation–based and tensor representation–based dimensionality reduction approaches, we propose a novel model for tensor dimensionality reduction, large margin low rank tensor analysis (LMLRTA). It can take any order of tensors as input and automatically learn the dimensionality of the low-dimensional representations.

3.  Large Margin Low Rank Tensor Analysis

In this section, we introduce the notation and some basic terminologies on tensor operations (Kolda & Bader, 2009; Dai & Yeung, 2006). We then detail our model, LMLRTA, including its formulation and optimization. Specifically, we prove that LMLRTA can converge to a local optimal solution of the learning problem.

3.1.  Notation and Terminology.

We denote vector by using bold lowercase letters such as v, matrix by using bold uppercase letters such as M, and tensor by using calligraphic capital letters such as . The ith row and jth column of a matrix M are defined as and , respectively. Mij denotes the element of M at the ith row and jth column. vi is the ith element of a vector v. We use MT to denote the transpose of M, and tr(M) to denote the trace of M. Suppose is a tensor of size . The order of is L and the lth dimension (or mode) of is of size Il. In addition, we denote the index of a single entry within a tensor by subscripts, such as .

Definition 1.

The scalar productof two tensorsis defined as, wheredenotes complex conjugation. Furthermore, the Frobenius norm of a tensoris defined as.

Definition 2.

The l-mode product of a tensorand a matrixis antensor denoted as, where the corresponding entries are given by.

Definition 3.

Let be an tensor and () be any permutation of the entries of the set. The l-mode unfolding of the tensorinto anmatrix, denoted asA(l), is defined by, wherewith.

Definition 4.
The multilinear rank of a tensor is a set of nonnegative numbers, , such that
whereis the range space of the matrixA, and rank(A) is the matrix rank.

Multilinear rank of tensors is elegantly discussed in de Silva and Lim (2008), as well as other rank concepts. In this letter, we focus only on multilinear rank of tensors and call it “rank” for short.

3.2.  Formulation of LMLRTA.

As researchers in the area of cognitive psychology have pointed out, humans learn based on the similarity of examples (Rosch, 1973). Here, we formulate our model based on the local similarity of tensor data. In addition, thanks to the existence of many “teachers,” we can generally obtain the categorical information of the examples before or during learning. Therefore, we formulate our learning model in a supervised scheme.

Given a set of N tensor data, , with the associated class labels , where L is the order of the tensors and C is the number of classes, we learn L low rank projection matrices , such that N embedded data points can be obtained as .

The learning objective function of LMLRTA can be written as
3.1
where rank(Ul) is the rank of matrix Ul, is the Frobenius norm of a tensor , and is the so-called hinge loss, which is aimed at maximizing the margin between classes. Here, we define two similarity matrices, and . If and have the same class label and is one of the k1-nearest neighbors of or is one of the k1-nearest neighbors of , then ; otherwise, . If and have different class labels but is one of the k2-nearest neighbors of or is one of the k2-nearest neighbors of , then ; otherwise, :
3.2
where stands for k-nearest neighbors of . Like the binary matrix , the matrix is fixed and does not change during learning.

The minimization of the first term of the objective function, , is to learn low rank Uls, which will straightforwardly lead to the low-dimensional representations of the tensors. The second term of the objective function is to enforce the neighboring data in each class to be close in the low-dimensional tensor space. It can be considered as a graph Laplacian-parameterized loss function with respect to the low-dimensional embeddings (Chung, 1997; Belkin & Niyogi, 2003; Tenenbaum, Kemp, Griffiths, & Goodman, 2011), where each node corresponds to one tensor datum in the given data set. For each tensor datum , suppose it belongs to class . The hinge loss in the third term will be incurred by a differently labeled datum within the k2-nearest neighbors of if its distance to does not exceed by 1 the distance from to any of its k1-nearest neighbors within the class . This third term thereby favors projection matrices in which different classes maintain a large margin of distance. Furthermore, it encourages nearby data in different classes far apart in the low-dimensional tensor space.

rank(Ul) is a nonconvex function with respect to Ul and difficult to optimize. Following recent work in matrix completion (Candès & Tao, 2010; Candès & Recht, 2012), we replace it with its convex envelope—the nuclear norm of Ul, defined as the sum of its singular values, , where s are the singular values of Ul, and r is the rank of Ul. Thus, the resulting formulation of our model can be written as
3.3
Since the hinge loss in problem 3.3 is not convex with respect to each individual Ul, this problem is not convex with respect to each Ul as well, . In order to find a suboptimal solution of Ul, we relax problem 3.3 to a convex problem with respect to each individual . Using the slack variables, we can express the new formulation of our learning model as
3.4
where Y(l)i is the l-mode unfolding matrix of the tensor . For the second term of the objective function and the first constraint in problem 3.4, we have used the property of the trace function: tr(Ul(Y(l)iY(l)j)(Y(l)iY(l)j)TUTl)=tr((Y(l)iY(l)j)(Y(l)iY(l)j)TUTlUl).

Problem 3.4 is not jointly convex with respect to all the Wls. However, it is convex with respect to each of them. This is guaranteed by the following lemma:

Lemma 1.

Problem 3.4 is convex with respect to each Wl.

Proof.

First, the nuclear norm of Wl, , is a convex function with respect to Wl. Second, the other terms of the objective function and the constraints in problem 3.4 are all linear function with respect to Wl. Hence, problem 3.4 is convex with respect to each Wl.

Remark 1

(relation to previous work).

1. LMLRTA can be considered a supervised multilinear extension of locality preserving projections (LPP) (He & Niyogi, 2003), in that the second term of the objective function in problem 3.4 forces neighboring data in a same class to be close in the low-dimensional tensor space.

2. LMLRTA can also be considered a reformulation of tensor marginal Fisher analysis (TMFA) (Yan et al., 2007). However, TMFA is not guaranteed to converge to a local optimum of the optimization problem (Wang et al., 2007), but LMLRTA is guaranteed as proved in section 3.3.

3. For problem 3.4, we can consider it a variant of the large margin nearest neighbor (LMNN) algorithm (Weinberger, Blitzer, & Saul, 2005) for distance metric learning in tensor space. Moreover, we can learn low-rank distance matrices via the formulation of problem 3.4, which the LMNN algorithm does not.

4. In contrast to previous approaches for tensor dimensionality reduction, which can learn projection matrices only with prespecified dimensionality of the low-dimensional representations, LMLRTA can automatically learn the dimensionality of the low-dimensional representations from the given data. We show this in section 3.3.

5. Unlike most deep neural network models (Hinton et al., 2006; Bengio et al., 2006; Vincent et al., 2010), which can take only vectorized representations of data as input, LMLRTA can take any order of tensors as input. Moreover, with a large number of parameters, the learning of deep neural network models in general needs many training data. If the size of the training set is small, deep neural network models may fail to learn the intrinsic structure of data. However, in this case, LMLRTA can perform much better than deep neural network models. Experimental results shown in section 4 demonstrate this effect.

3.3.  Optimization.

Similar to previous approaches on tensor dimensionality reduction (Dai & Yeung, 2006; Wang et al., 2007), here we solve prob- lem 3.4 using an iterative optimization algorithm. In each iteration, we refine one projection matrix by fixing the others. Here, for each Wl, problem 3.4 is a semidefinite programming problem, which can be solved using off-the-shelf algorithms, such as SeDuMi (http://sedumi.ie.lehigh.edu/) and CVX (Grant & Boyd, 2008). However, the computational cost of semidefinite programming approaches is generally very high. Here, we solve the problem using a fixed-point continuation (FPC) algorithm (Ma, Goldfarb, & Chen, 2011).

FPC is an iterative optimization method. In the tth iteration, it uses two alternating steps:

• •

• •

Shrinkage step: .

In the gradient step, g(Wtl) is the subgradient of the objective function in problem 3.4 with respect to Wtl (excluding the nuclear norm term), and is the step size. Here, we can express as a function with respect to Wtl:
3.5
Note that the hinge loss is not differentiable, but we can compute its subgradient and use a standard descent algorithm to optimize the problem. Thus, we can calculate g(Wtl) as
3.6
where is the set of triplets whose corresponding slack variable exceeds zero, that is, .

In the shrinkage step, is a matrix shrinkage operator on , where max is element-wise and is a diagonal matrix with all the diagonal elements set to . Here, since Wtl is supposed to be a symmetric and positive semidefinite matrix, its eigenvalues should also be its singular values and nonnegative. Therefore, we adopt the eigenvalue decomposition method to shrink the rank of Ztl. To this end, the shrinkage operator shifts the eigenvalues down and truncates any eigenvalue less than to zero. This step reduces the nuclear norm of Wtl. If some eigenvalues are truncated to zeros, this step reduces the rank of Wtl as well. When the algorithm converges, the rank of the learned Wtl represents the optimal dimensionality of the lth mode of the low dimensional representations.

To improve the speed of the applied FPC algorithm, we follow Ma et al. (2011) and use a continuation optimization procedure. This involves beginning with a large value and solving a sequence of subproblems, each with a decreasing value and using the previous solution as its initial point. The sequence of values is determined by a decay parameter : , , where is the final value to use and K is the number of rounds of continuation.

When Wtl gets close to an optimal solution , the distance between Wtl and Wt+1l should become very small. We use the following condition as a stopping criterion (Ma et al., 2011),
3.7
where denotes the Frobenius norm of a matrix M and tol is a small, positive number. In our experiments, tol=10−6 was used. For clarity, we present the complete FPC algorithm in algorithm 1.

For the convergence of the FPC algorithm, we present the following theorem:

Theorem 1.

For fixed Wk, , the sequence generated by the FPC algorithm with converges to the optimal solution of problem 3.4, where is the maximum eigenvalue of g(Wl).

The proof of this theorem is similar to that of theorem 4 in Ma et al. (2011). A minor difference is that since Wl=UTlUl is a symmetric and positive semidefinite matrix, we use eigenvalue decomposition instead of singular value decomposition as used for the proof of theorem 4 in Ma et al. (2011). Nevertheless, the derivation and results are the same.

Based on lemma 1 and theorem 1, we have the following theorem on the convergence of the proposed model, LMLRTA:

Theorem 2.

LMLRTA converges to a local optimal solution of problem3.4.

Proof.

To prove theorem 2, we need only to prove that the objective function has a lower bound, as well as the iterative optimization procedures monotonically decrease the value of the objective function.

It is easy to see that the value of the objective function in problem 3.4 is always larger than or equal to 0. Hence, 0 is a lower bound of this objective function. For the optimization of each Wl, , from theorem 1, we know that the FPC algorithm minimizes the value of the objective function in problem 3.4. Therefore, the iterative procedures of LMLRTA monotonically decrease the value of the objective function and LMLRTA is guaranteed to converge to a local optimal solution of problem 3.4.

For problems with one-order tensors (vector input), we have the following corollary:

Corollary 1.

If the given data are one-order tensors, the LMLRTA algorithm converges to the global optimal solution of problem 3.4.

3.4.  Generalization to New Tensor Data.

For the recognition of unseen test tensors, we employ the tensor Frobenius norm-based k-nearest neighbor classifier as the recognizer, in that it measures the local similarity between training data and test data in the low dimensional tensor space (Rosch, 1973).

4.  Experiments

In this section, we report the experimental results obtained on two real-world applications: object recognition and face recognition. Particularly for the face recognition application, we used 3D Gabor transformation of the face images as input signals, as the kernels of the Gabor filters resemble the receptive field profiles of the mammalian cortical simple cells (Daugman, 1988). In the following, we report the parameter settings and experimental results in detail.

4.1.  Parameter Settings.

To demonstrate the effectiveness of our model, LMLRTA, for the intrinsic tensor representation learning, we conducted experiments on the COIL-20 data set (http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php) and the ORL face data set (http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html). The COIL-20 data set has 20 classes of objects, with 72 samples in each class. The size of the images is . The ORL data set contains 400 images of 40 subjects, where each image was normalized to a size of . For each face image, we used 28 Gabor filters to extract textural features. To the end, each face image was represented as a tensor. For the COIL-20 data set, we randomly selected one-third of the samples for testing and used others for training. Because each subject has only 10 images in the ORL data set, we evaluated the proposed model, LMLRTA, based on the average over five times random partition of the data. Here, a variety of scenarios—different numbers of data from each class used to compose the training set—were tested. In our experiments for both data sets, the feature values of data are normalized in [0, 1] by dividing the largest pixel value or the largest value of Gabor features.

To show the advantage of our proposed method, LMLRTA, we compared it with two classic vector representation-based dimensionality-reduction approaches—linear discriminant analysis (LDA) (Fisher, 1936) and marginal Fisher analysis (MFA) (Yan et al., 2007), one deep neural networks model called stacked denoising autoencoder (SDAE) (Vincent et al., 2010)—and two state-of-the-art tensor dimensionality-reduction methods—convergent multilinear discriminant analysis (CMDA) and convergent tensor marginal Fisher analysis (CTMFA) (Wang et al., 2007). For comparison, we also provided the classification results obtained in the original data space. In the LMLRTA algorithm, k1 and k2 were set to 7 and 15, respectively for the COIL-20 data set, while for the ORL data set, they were set to ntrain−1 and respectively, where ntrain denotes the number of data of each class used to compose the training set. In implementing of the FPC method for the optimization of each projection matrix, we initialized , with a random full-rank symmetric and positive semidefinite matrix and set . Following Ma et al. (2011), we adopted values starting at , where was the largest eigenvalue of the initialized matrix and decreased until . Furthermore, was selected from using the holdout strategy. Particularly, for the COIL-20 data set, half of the training data were randomly selected for model learning and the other half for validation. For the ORL data set, we conducted model selection for each scenario of random partition of the data. For concreteness, in each scenario of random partition of the data, we selected one sample of each class of the training data to compose the validation set and used others for model learning. For CMDA and CTMFA, we adopted the best setting learned by LMLRTA to specify the dimensionality of the low-dimensional tensor space. We used the code of SDAE from a public deep learning toolbox (https://github.com/rasmusbergpalm/DeepLearnToolbox). For all the methods but SDAE, tensor Frobenius norm-based 1-nearest neighbor classifier was used for the recognition of the test data.

4.2.  Visualization.

Figure 1a and Figure 1b illustrate the 2D embeddings of the object images from the COIL-20 data set and that of the 3D Gabor transformation of the face images from the ORL data set, respectively. The t-distribution-based stochastic neighbor embedding (t-SNE) algorithm (van der Maaten & Hinton, 2008) was employed to learn these 2D embeddings, where the distances between data were measured based on tensor Frobenius norm. From Figure 1, we can see that in the original space of these two data sets, most of the classes align on a submanifold embedded in the ambient space. However, for some classes, the data are scattered in a large area of the data space and, alternatively, close to that of other classes. As a result, local similarity-based classifiers may predict the labels of some unseen data incorrectly in both of these two original representation spaces. Hence, it is necessary to learn the intrinsic and compact representations of the given tensor data.

Figure 1:

2D embeddings of the tensors from the COIL-20 and the ORL data sets, where different classes are denoted with different shading. (a) Images from the COIL-20 data set. (b) Gabor transformation of the face images from the ORL data set. We can see that in the original space of these two data sets, some data of the same class are far apart, and at the same time, some data are close to those of different classes. (Best viewed in color; see the online supplement.)

Figure 1:

2D embeddings of the tensors from the COIL-20 and the ORL data sets, where different classes are denoted with different shading. (a) Images from the COIL-20 data set. (b) Gabor transformation of the face images from the ORL data set. We can see that in the original space of these two data sets, some data of the same class are far apart, and at the same time, some data are close to those of different classes. (Best viewed in color; see the online supplement.)

Figure 2a and Figure 2b illustrate the 2D embeddings of the low-dimensional tensor representations for the COIL-20 and the ORL data sets, respectively. LMLRTA was used to learn the low-dimensional tensor representations, while the t-SNE algorithm was used to generate the 2D embeddings. It is easy to see that LMLRTA successfully discovered the manifold structure of these two data sets. In both Figures 2a and Figure 2b, the similarity of data of the same class are faithfully preserved, while the discrimination between classes is maximized.

Figure 2:

2D embeddings of the low-dimensional tensor representations for the COIL-20 and the ORL data sets. LMLRTA was used to learn the low-dimensional tensor representations. (a) Corresponding low-dimensional tensor rep- resentations of the images shown in Figure 1a. (D) Corresponding low-dimensional tensor representations of the 3D Gabor transformation of the face images shown in Figure 1b. We can see that in the low-dimensional tensor space learned by LMLRTA, the data points of the same class are close to each other, while data of different classes are relatively far apart. (Best viewed in color; see the online supplement.)

Figure 2:

2D embeddings of the low-dimensional tensor representations for the COIL-20 and the ORL data sets. LMLRTA was used to learn the low-dimensional tensor representations. (a) Corresponding low-dimensional tensor rep- resentations of the images shown in Figure 1a. (D) Corresponding low-dimensional tensor representations of the 3D Gabor transformation of the face images shown in Figure 1b. We can see that in the low-dimensional tensor space learned by LMLRTA, the data points of the same class are close to each other, while data of different classes are relatively far apart. (Best viewed in color; see the online supplement.)

Figure 3 shows some low-dimensional tensor representations of the images from the COIL-20 data set, which were learned by CMDA, CTMFA and LMLRTA, respectively. Five classes were randomly selected, and low-dimensional representations of five images from each class were further randomly selected to show. Particularly in each panel of Figure 3, each row shows the low-dimensional tensor representations of images from one class. In contrast to the dimensionality of the original image, , the dimensionality of the low-dimensional representations here is . We can see that all three methods can preserve the similarity of data of the same class faithfully. However, the discrimination between classes in the low-dimensional tensor space learned by LMLRTA is much better than that learned by CMDA and CTMFA. Recognition results shown in section 4.3 also demonstrate this observation.

Figure 3:

Learned low-dimensional tensor representations of images from the COIL-20 data set. (a) Those learned by CMDA. Here, each row shows the low-dimensional representations of images from one class. Five classes are shown. (b) Low-dimensional tensor representations of the same images as in panel (a) learned by CTMFA. (c) Low-dimensional tensor representations of the same images as in panel (a) learned by LMLRTA. We can see that in the learned low-dimensional tensor space, all three methods preserve the similarity of data of each class faithfully. However, recognition results show that the discrimination between classes in the tensor space learned by LMLRTA is much better than those learned by CMDA and CTMFA.

Figure 3:

Learned low-dimensional tensor representations of images from the COIL-20 data set. (a) Those learned by CMDA. Here, each row shows the low-dimensional representations of images from one class. Five classes are shown. (b) Low-dimensional tensor representations of the same images as in panel (a) learned by CTMFA. (c) Low-dimensional tensor representations of the same images as in panel (a) learned by LMLRTA. We can see that in the learned low-dimensional tensor space, all three methods preserve the similarity of data of each class faithfully. However, recognition results show that the discrimination between classes in the tensor space learned by LMLRTA is much better than those learned by CMDA and CTMFA.

4.3.  Object Recognition Results on the COIL-20 Data Set (2D Tensors).

In this experiment, we compare LMLRTA with some related approaches on the object recognition application: LDA, MFA, SDAE, CMDA, CTMFA, and classification in the original space. We implemented experiments on the COIL-20 data set. For LMLRTA, CMDA, and CTMFA, we followed the settings as introduced in section 4.1. For LDA and MFA, we used the same strategy as LMLRTA for model selection. For concreteness, the dimensionality of the LDA subspace was selected from , and the dimensionality of the MFA subspace was selected from . In order to reduce the noise level of the data, before using LDA and MFA for dimensionality reduction, we first projected the data onto a PCA subspace with 99% of the variance retained. Other than PCA, no other preprocessing is applied. For the SDAE algorithm, we used a six-layer neural network model. The sizes of the layers were 1024, 512, 256, 64, 32 and 20, respectively. The batch size was set to 80 for both pretraining and fine-tuning. The number of epochs for pretraining was set to 400, while that for fine-tuning was set to 5000.

Figure 4 shows the classification accuracy obtained by LMLRTA and the compared methods on the COIL-20 data set. The dimensionality of the low-dimensional representations learned by LMLRTA is [21, 16], which is used for the learning of CMDA and CTMFA as well. For the comparison of the tested methods, it is easy to see that LMLRTA performed best among all the methods: it achieved 100% accuracy on this data set. Due to the loss of local structural information of the images, vector representation–based approache, LDA and MFA, performed worst on this problem. Due to the limitation of training sample size and vectorized representations of the data, the deep neural network model, SDAE, did not outperform LMLRTA on this problem and obtained only a similar recognition accuracy as the classification in the original data space. State-of-the-art tensor dimensionality-reduction approaches, CMDA and CTMFA, can converge to a local optimal solution of the learning problem but did not outperform LMLRTA as well.

Figure 4:

Recognition results obtained by LMLRTA and the compared methods on the COIL-20 data set. Note that LMLRTA obtained 100% accuracy on this data set, but other approaches cannot lead to a same perfect performance on this data set.

Figure 4:

Recognition results obtained by LMLRTA and the compared methods on the COIL-20 data set. Note that LMLRTA obtained 100% accuracy on this data set, but other approaches cannot lead to a same perfect performance on this data set.

To show the convergence process of the FPC algorithm during learning the projection matrices, Figure 5 illustrates the values of the objective function against iterations during the optimization of LMLRTA on the COIL-20 data set. As we can see, the FPC algorithm converges to a stationary point of the problem as the iteration continues.

Figure 5:

Optimization for the two orders of one projection matrix. These two curves show that the FPC algorithm can converge to a stationary point of the optimization problem.

Figure 5:

Optimization for the two orders of one projection matrix. These two curves show that the FPC algorithm can converge to a stationary point of the optimization problem.

4.4.  Face Recognition Results on the ORL Data Set (3D Tensors).

Figure 6 shows the classification accuracy and standard deviation obtained on the ORL data set. Due to the high computational complexity problem of LDA, MFA, and SDAE (the vector representations of the tensors is of dimensionality ), here we compared only LMLRTA to CMDA, CTMFA, and the classification in the original data space. From Figure 6, we can see that LMLRTA consistently outperforms the compared convergent tensor dimensionality reduction approaches and the classification in the original data space. More importantly, as LMLRTA gradually reduces the ranks of the projection matrices during optimization, it can automatically learn the dimensionality of the intrinsic low-dimensional tensor space from data. However, for the compared tensor dimensionality reduction approaches, the parameter must be manually specified before they can be applied.

Figure 6:

Recognition results for the ORL face images, where ntrain denotes the number of data of each class used to compose the training set.

Figure 6:

Recognition results for the ORL face images, where ntrain denotes the number of data of each class used to compose the training set.

In Table 1, we report the dimensionality of the low-dimensional representations learned by LMLRTA on the ORL data set. It shows that, LMLRTA generally delivers low rank projection matrices and compact representations of the tensor data.

Table 1:
Dimensionality of the Low-Dimensional Representations Learned by LMLRTA on the ORL Data Set.
3train4train5train6train7train
Dims [15, 18, 16] [17, 16, 16] [15, 17, 15] [12, 15, 16] [14, 14, 13]
3train4train5train6train7train
Dims [15, 18, 16] [17, 16, 16] [15, 17, 15] [12, 15, 16] [14, 14, 13]

5.  Conclusion

In this letter, we propose a supervised tensor dimensionality reduction model, large margin low rank tensor analysis (LMLRTA), which can be used to automatically and jointly learn the dimensionality and low-dimensional representations of tensors. We optimize LMLRTA using an iterative fixed-point continuation (FPC) algorithm, which is guaranteed to converge to a local optimal solution of the optimization problem. Experiments on object recognition and face recognition show the superiority of LMLRTA over classic vector representation-based dimensionality-reduction approaches, deep neural network models, and existing tensor dimensionality reduction approaches. In future work, we attempt to extend LMLRTA to the transfer learning (Pan & Yang, 2010) and active learning (Cohn, Ladner, & Waibel, 1994) scenarios. Furthermore, we plan to combine LMLRTA with deep neural networks (LeCun, Bottou, Bengio, & Haffner, 2001) and nonnegative matrix factorization models (Lee & Seung, 1999), to solve challenging large-scale problems.

Acknowledgments

We thank two anonymous reviewers for their valuable comments and constructive suggestions. This work is supported by the Social Sciences and Humanities Research Council of Canada and the Natural Sciences and Engineering Research Council of Canada.

References

Baudat
,
G.
, &
Anouar
,
F.
(
2000
).
Generalized discriminant analysis using a kernel approach
.
Neural Computation
,
12
(
10
),
2385
2404
.
Belkin
,
M.
, &
Niyogi
,
P.
(
2003
).
Laplacian eigenmaps for dimensionality reduction and data representation
.
Neural Computation
,
15
(
6
),
1373
1396
.
Bengio
,
Y.
,
Lamblin
,
P.
,
Popovici
,
D.
, &
Larochelle
,
H.
(
2006
).
Greedy layer-wise training of deep networks
. In
B. Schölkopf, J. C. Platt, & T. Hoffman
(Eds.),
Advancess in neural information processing systems
,
19
(pp.
153
160
).
Cambridge, MA
:
MIT Press
.
Bengio
,
Y.
,
Paiement
,
J.-F.
,
Vincent
,
P.
,
Delalleau
,
O.
,
Roux
,
N.
L., &
Ouimet
,
M.
(
2003
).
Out-of-sample extensions for LLE, isomap, MDS, eigenmaps, and spectral clustering
. In
S. Thrün, L. K. Saul, & B. Schölkopf
(Eds.),
Advances in neural information processing systems, 16
.
Cambridge, MA
:
MIT Press
.
Bondy
,
J. A.
, &
Murty
,
U.S.R.
(
1976
).
Graph theory with applications
.
Amsterdam
:
North-Holland
.
Candès
,
E.
, &
Recht
,
B.
(
2012
).
Exact matrix completion via convex optimization
.
Commun. ACM
,
55
(
6
),
111
119
.
Candès
,
E.
, &
Tao
,
T.
(
2010
).
The power of convex relaxation: Near-optimal matrix completion
.
IEEE Transactions on Information Theory
,
56
(
5
),
2053
2080
.
Chung
,
F. R. K.
(
1997
).
Spectral graph theory
.
Providence, RI
:
American Mathematical Society
.
Cohn
,
D.
,
,
R.
, &
Waibel
,
A.
(
1994
).
Improving generalization with active learning machine learning
.
Machine Learning
,
15
,
201
221
.
Dai
,
G.
, &
Yeung
,
D.-Y.
(
2006
).
Tensor embedding methods
. In
Proceedings of the 21st National Conference on Artificial Intelligence and the 18th Innovative Applications of Artificial Intelligence Conference
(pp.
330
335
).
Cambridge, MA
:
MIT Press
.
Daugman
,
J. G.
(
1988
).
Complete discrete 2D Gabor transforms by neural networks for image analysis and compression
.
IEEE Transactions on Acoustics, Speech and Signal Processing
,
36
(
7
),
1169
1179
.
de Silva
,
V.
, &
Lim
,
L.-H.
(
2008
).
Tensor rank and the ill-posedness of the best low-rank approximation problem
.
SIAM J. Matrix Analysis Applications
,
30
(
3
),
1084
1127
.
Fisher
,
R. A.
(
1936
).
The use of multiple measurements in taxonomic problems
.
Annals of Eugenics
,
7
(
7
),
179
188
.
Fu
,
Y.
, &
Huang
,
T. S.
(
2008
).
Image classification using correlation tensor analysis
.
IEEE Transactions on Image Processing
,
17
(
2
),
226
234
.
Fukunnaga
,
K.
(
1991
).
Introduction to statistical pattern recognition
(2nd ed.).
Orlando, FL
:
.
Grant
,
M.
, &
Boyd
,
S.
(
2008
).
Graph implementations for nonsmooth convex programs
. In
V. Blondel, S. Boyd, & H. Kimura (Eds.)
,
Recent advances in learning and control
(pp.
95
110
).
New York
:
Springer-Verlag
.
He
,
X.
, &
Niyogi
,
P.
(
2003
).
Locality preserving projections
. In
S. Thrün, L. K. Saul, & B. Schölkopf
(Eds.),
Advances in neural information processing systems, 16
.
Cambridge, MA
:
MIT Press
.
Hinton
,
G. E.
,
Osindero
,
S.
, &
Teh
,
Y. W.
(
2006
).
A fast learning algorithm for deep belief nets
.
Neural Computation
,
18
(
7
),
1527
1554
.
Hinton
,
G. E.
, &
Salakhutdinov
,
R. R.
(
2006
).
Reducing the dimensionality of data with neural networks
.
Science
,
313
(
5786
),
504
507
.
Jolliffe
,
I.
(
2002
).
Principal component analysis
(2nd ed.).
New York
:
Springer
.
Kolda
,
T. G.
, &
,
B. W.
(
2009
).
Tensor decompositions and applications
.
SIAM Review
,
51
(
3
),
455
500
.
Lawrence
,
N. D.
(
2005
).
Probabilistic non-linear principal component analysis with gaussian process latent variable models
.
Journal of Machine Learning Research
,
6
,
1783
1816
.
LeCun
,
Y.
,
Bottou
,
L.
,
Bengio
,
Y.
, &
Haffner
,
P.
(
2001
).
Gradient-based learning applied to document recognition
. In
S. Haykin & B. Kosko
(Eds.),
Intelligent signal processing
(pp.
306
351
).
Piscataway, NJ
:
IEEE Press
.
Lee
,
D. D.
, &
Seung
,
H. S.
(
1999
).
Learning the parts of objects by non-negative matrix factorization
.
Nature
,
401
(
6755
),
788
791
.
Liu
,
Y.
,
Liu
,
Y.
, &
Chan
,
K. C. C.
(
2010
).
Tensor distance based multilinear locality-preserved maximum information embedding
.
IEEE Transactions on Neural Networks
,
21
(
11
),
1848
1854
.
Liu
,
J.
,
Liu
,
J.
,
Wonka
,
P.
, &
Ye
,
J.
(
2012
).
Sparse non-negative tensor factorization using columnwise coordinate descent
.
Pattern Recognition
,
45
(
1
),
649
656
.
Ma
,
S.
,
Goldfarb
,
D.
, &
Chen
,
L.
(
2011
).
Fixed point and Bregman iterative methods for matrix rank minimization
.
Math. Program.
,
128
(
1–2
),
321
353.
Northcott
,
D. G.
(
1984
).
Multilinear algebra.
Cambridge
:
Cambridge University Press
.
Pan
,
S. J.
, &
Yang
,
Q.
(
2010
).
A survey on transfer learning
.
IEEE Trans. Knowl. Data Eng.
,
22
(
10
),
1345
1359.
Rosch
,
E.
(
1973
).
Natural categories
.
Cognitive Psychol.
,
4
,
328
350
.
Roweis
,
S. T.
, &
Saul
,
L. K.
(
2000
).
Nonlinear dimensionality reduction by locally linear embedding
.
Science
,
290
(
5500
),
2323
2326
.
Schölkopf
,
B.
,
Smola
,
A. J.
, &
Müller
,
K.-R.
(
1998
).
Nonlinear component analysis as a kernel eigenvalue problem
.
Neural Computation
,
10
(
5
),
1299
1319
.
Seung
,
H. S.
, &
Lee
,
D. D.
(
2000
).
The manifold ways of perception
.
Science
,
290
(
5500
),
2268
2269
.
Sugiyama
,
M.
(
2007
).
Dimensionality reduction of multimodal labeled data by local fisher discriminant analysis
.
Journal of Machine Learning Research
,
8
,
1027
1061
.
Tao
,
D.
,
Li
,
X.
,
Wu
,
X.
, &
Maybank
,
S. J.
(
2007
).
General tensor discriminant analysis and Gabor features for gait recognition
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
29
(
10
),
1700
1715
.
Tenenbaum
,
J. B.
,
de Silva
,
V.
, &
Langford
,
J. C.
(
2000
).
A global geometric framework for nonlinear dimensionality reduction
.
Science
,
290
(
5500
),
2319
2323
.
Tenenbaum
,
J. B.
,
Kemp
,
C.
,
Griffiths
,
T. L.
, &
Goodman
,
N. D.
(
2011
).
How to grow a mind: Statistics, structure, and abstraction
.
Science
,
331
(
6022
),
1279
1285
.
van der Maaten
,
L.
, &
Hinton
,
G. E.
(
2008
).
Visualizing data using t-SNE
.
Journal of Machine Learning Research
,
9
,
2579
2605
.
Vincent
,
P.
,
Larochelle
,
H.
,
Lajoie
,
I.
,
Bengio
,
Y.
, &
Manzagol
,
P.-A.
(
2010
).
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion
.
Journal of Machine Learning Research
,
11
,
3371
3408
.
Wang
,
H.
,
Yan
,
S.
,
Huang
,
T. S.
, &
Tang
,
X.
(
2007
).
A convengent solution to tensor subspace learning
. In
Proceedings of the 20th International Joint Conference on Artificial Intelligence
.
San Francisco
:
Morgan Kaufmann
.
Weinberger
,
K. Q.
,
Blitzer
,
J.
, &
Saul
,
L. K.
(
2005
).
Distance metric learning for large margin nearest neighbor Classification
. In
Y. Weis, B. Schölkopf, & J. Platt (Eds.)
,
Advances in neural information processing systems
,
18
.
Cambridge, MA
:
MIT Press
.
Yan
,
S.
,
Xu
,
D.
,
Zhang
,
B.
,
Zhang
,
H.-J.
,
Yang
,
Q.
, &
Lin
,
S.
(
2007
).
Graph embedding and extensions: A general framework for dimensionality reduction
.
IEEE Trans. Pattern Anal. Mach. Intell.
,
29
(
1
),
40
51
.
Yang
,
J.
,
Zhang
,
D.
,
Frangi
,
A. F.
, &
Yang
,
J.-Y.
(
2004
).
Two-dimensional PCA: A new approach to appearance-based face representation and recognition
.
IEEE Trans. Pattern Anal. Mach. Intell.
,
26
(
1
),
131
137
.
Ye
,
J.
,
Janardan
,
R.
, &
Li
,
Q.
(
2004
).
Two-dimensional linear discriminant analysis
. In
L. Saul, Y. Weiss, & L. Bottou (Eds.)
,
Advances in neural information processing systems, 17
.
Cambridge, MA
:
MIT Press
.
Zhong
,
G.
,
Li
,
W.-J.
,
Yeung
,
D.-Y.
,
Hou
,
X.
, &
Liu
,
C.-L.
(
2010
).
Gaussian process latent random field
. In
Proceedings of the 24th AAAI Conference on Artificial Intelligence.
Cambridge, MA
:
MIT Press
.

Author notes

Color versions of some figures in this letter are presented in the online supplement available at http://www.mitpressjournals.org/doi/suppl/10.1162/NECO_a_00570.