## Abstract

We present a supervised model for tensor dimensionality reduction, which is called large margin low rank tensor analysis (LMLRTA). In contrast to traditional vector representation-based dimensionality reduction methods, LMLRTA can take any order of tensors as input. And unlike previous tensor dimensionality reduction methods, which can learn only the low-dimensional embeddings with a priori specified dimensionality, LMLRTA can automatically and jointly learn the dimensionality and the low-dimensional representations from data. Moreover, LMLRTA delivers low rank projection matrices, while it encourages data of the same class to be close and of different classes to be separated by a large margin of distance in the low-dimensional tensor space. LMLRTA can be optimized using an iterative fixed-point continuation algorithm, which is guaranteed to converge to a local optimal solution of the optimization problem. We evaluate LMLRTA on an object recognition application, where the data are represented as 2D tensors, and a face recognition application, where the data are represented as 3D tensors. Experimental results show the superiority of LMLRTA over state-of-the-art approaches.

## 1. Introduction

Dimensionality reduction is one of the most fundamental problems in several related areas, including machine learning, pattern recognition, and data mining. Effective dimensionality reduction techniques dramatically facilitate the subsequent visualization, classification, and retrieval tasks. Over the past few decades, plenty of dimensionality reduction methods have been proposed and successfully used in many applications, such as optical character recognition (OCR), face recognition, and image and video retrieval. Traditional linear dimensionality reduction methods, such as principal component analysis (PCA) (Jolliffe, 2002) and linear discriminant analysis (LDA) (Fukunnaga, 1991), are simple to implement. However, they are guaranteed only to discover the true structure of data lying on or near a linear subspace of the high-dimensional input space. To alleviate this problem, kernel extensions of these methods have been proposed: kernel principal component analysis (KPCA) (Schölkopf, Smola, & Müller, 1998) and generalized discriminant analysis (GDA) (Baudat & Anouar, 2000). Seung and Lee (2000) state that the human brain represents real-world perceptual stimuli in a manifold way. At the same time of their work and later numerous manifold learning algorithms, such as isometric feature mapping (Isomap) (Tenenbaum, de Silva, & Langford, 2000) and locally linear embedding (LLE) (Roweis & Saul, 2000), have been proposed for discovering the manifold structure of data embedded in a high-dimensional space. These manifold learning methods can faithfully preserve the geometrical structure of data. Unfortunately, most of the existing dimensionality-reduction methods, either linear or nonlinear, can work only on vectorized data representations, although data in many applications are more naturally represented as high-order tensors, such as 2D images and 3D textures.

To generalize traditional vector representation–based dimensionality-reduction methods to applications involving high-order tensors, some tensor dimensionality reduction approaches have been proposed. Two are 2DPCA (Yang, Zhang, Frangi, & Yang, 2004) and 2DLDA (Ye, Janardan, & Li, 2004). These approaches learn the low-dimensional representations of tensor data in an unsupervised or a supervised way. In particular, the approach presented in Wang, Yan, Huang, and Tang (2007) is theoretically guaranteed to converge to a local optimal solution of the learning problem. However, one common issue of these approaches exists: the dimensionality of the low-dimensional tensor space must be manually specified before these approaches can be applied. Therefore, these approaches may not necessarily lead to the genuine manifold structure of the tensor data.

To address these problems of previous approaches, in this letter, we propose a novel tensor dimensionality reduction model, large margin low rank tensor analysis (LMLRTA), which is aimed at learning the low-dimensional representations of tensors using techniques of multilinear algebra (Northcott, 1984) and graph theories (Bondy & Murty, 1976). Compared to traditional vector representation-based dimensionality reduction approaches, LMLRTA can take any order of tensors as input, including 1D vectors (one-order tensor), 2D matrices (two-order tensor), and more. Furthermore, unlike previous tensor dimensionality reduction approaches (Yang et al., 2004; Ye et al., 2004; Wang et al., 2007), which can learn only the low-dimensional embeddings with a priori specified dimensionality, LMLRTA can automatically and jointly learn the optimal dimensionality and the low-dimensional representations from data. In addition, LMLRTA enforces the low-dimensional embeddings of the same class to be close and that of different classes to be separated by a large margin of distance. More important, it delivers low rank projection matrices for mapping high-dimensional data to the learned low-dimensional tensor space. We optimize LMLRTA using an iterative fixed-point continuation (FPC) algorithm, which is guaranteed to converge to a local optimal solution of the learning problem.

The rest of this letter is organized as follows. In section 2, we provide a brief overview of previous work on dimensionality reduction. In section 3, we present our proposed model, LMLRTA, in detail, including its formulation and optimization. In particular, we theoretically prove that LMLRTA can converge to a local optimal solution of the optimization problem. Section 4 shows the experimental results on real-world applications, including object recognition and face recognition, which involve 2D and 3D tensors, respectively. We conclude in section 5 with remarks and future work.

## 2. Previous Work

In order to find effective low-dimensional representations of data, many dimensionality reduction approaches have been proposed. The most representative approaches are principal component analysis (PCA) and linear discriminant analysis (LDA) for unsupervised and supervised learning paradigms, respectively. They are widely used in many applications due to their simplicity and efficiency. However, it is well known that they are optimal only if the relation between the latent and the observed space can be described with a linear function. To address this issue, nonlinear extensions based on kernel method have been proposed to provide nonlinear formulations: kernel principal component analysis (KPCA) (Schölkopf et al., 1998) and generalized discriminant analysis (GDA) (Baudat & Anouar, 2000).

Since about a decade ago, many manifold learning approaches have been proposed. These approaches, including isometric feature mapping (Isomap) (Tenenbaum et al., 2000) and locally linear embedding (LLE) (Roweis & Saul, 2000), can faithfully preserve global or local geometrical properties of the nonlinear structure of data. However, these methods work only on a given set of data points and cannot be easily extended to out-of-sample data (Bengio et al., 2003). To alleviate this problem, locality preserving projections (LPP) (He & Niyogi, 2003) and local Fisher discriminant analysis (LFDA) (Sugiyama, 2007) were proposed to approximate the manifold structure in a linear subspace by preserving local similarity between data points. In particular, Yan et al. (2007) proposed a general framework known as graph embedding for dimensionality reduction. Most of the spectral learning-based approaches, either linear or nonlinear, supervised or unsupervised, are contained in this framework. Furthermore, based on this framework, the authors proposed the marginal Fisher analysis (MFA) algorithm for supervised linear dimensionality reduction. In the research of probabilistic learning models, Lawrence (2005) proposed the gaussian process latent variable models (GPLVM), which extend PCA to a probabilistic nonlinear formulation. Combining a gaussian Markov random field prior with GPLVM, Zhong, Li, Yeung, Hou, and Liu (2010) proposed the gaussian process latent random field model, which can be considered a supervised variant of GPLVM. In the area of neural network research, Hinton and Salakhutdinov (2006) proposed a deep neural network model, the autoencoder for dimensionality reduction. To exploit the effect of deep architecture for dimensionality reduction, some other deep neural network models were introduced, such as deep belief nets (DBN) (Hinton, Osindero, & Teh, 2006), stacked autoencoder (SAE) (Bengio, Lamblin, Popovici, & Larochelle, 2006), and stacked denoise autoencoder (SDAE) (Vincent, Larochelle, Lajoie, Bengio, & Manzagol, 2010). These studies show that deep neural networks can generally learn high-level representations of data, which can benefit subsequent recognition tasks.

All of the above approaches assume that the input data are in the form of vectors. In many real-world applications, however, the objects are essentially represented as high-order tensors, such as 2D images or 3D textures. One has to unfold these tensors into one-dimensional vectors before the dimensionality reduction approaches can be applied, and some useful information in the original data may not be sufficiently preserved. Moreover, high-dimensional vectorized representations suffer from the curse of dimensionality, as well as high computational cost. To address these problems, 2DPCA (Yang et al., 2004) and 2DLDA (Ye et al., 2004) were proposed to extend the original PCA and LDA algorithms to work directly on 2D matrices rather than 1D vectors. In recent years, many other approaches (Yan et al., 2007; Tao, Li, Wu, & Maybank, 2007; Fu & Huang, 2008; Liu, Liu, & Chan, 2010; Liu, Liu, Wonka, & Ye, 2012) were also proposed to deal with high-order tensor problems. In particular, Wang et al. (2007) proposed a tensor dimensionality reduction method based on the graph embedding framework, the first method to give a convergent solution. However, all of these previous tensor dimensionality reduction approaches have a common shortcoming: the dimensionality of the low-dimensional representations must be specified manually before the approaches can be applied.

To address these issues in both vector representation–based and tensor representation–based dimensionality reduction approaches, we propose a novel model for tensor dimensionality reduction, large margin low rank tensor analysis (LMLRTA). It can take any order of tensors as input and automatically learn the dimensionality of the low-dimensional representations.

## 3. Large Margin Low Rank Tensor Analysis

In this section, we introduce the notation and some basic terminologies on tensor operations (Kolda & Bader, 2009; Dai & Yeung, 2006). We then detail our model, LMLRTA, including its formulation and optimization. Specifically, we prove that LMLRTA can converge to a local optimal solution of the learning problem.

### 3.1. Notation and Terminology.

We denote vector by using bold lowercase letters such as **v**, matrix
by using bold uppercase letters such as **M**, and tensor by using
calligraphic capital letters such as . The *i*th row and *j*th column
of a matrix **M** are defined as and , respectively. **M**_{ij} denotes the element of **M** at the *i*th row and *j*th column. **v**_{i} is the *i*th element of a vector **v**. We use **M**^{T} to denote the transpose of **M**, and tr(**M**) to
denote the trace of **M**. Suppose is a tensor of size . The order of is *L* and the *l*th dimension
(or mode) of is of size *I _{l}*. In addition, we denote the index of a single entry within a tensor by
subscripts, such as .

*The scalar product**of two tensors**is defined as*, *where**denotes complex conjugation. Furthermore, the Frobenius norm of a
tensor**is defined as*.

*The l-mode product of a tensor**and a matrix**is an**tensor denoted as*, *where the corresponding entries are given
by*.

*Let* be an *tensor and* () *be any permutation of the entries of the
set*. *The l-mode unfolding of the tensor**into an**matrix, denoted as**A*^{(l)}, *is defined by*, *where**with*.

Multilinear rank of tensors is elegantly discussed in de Silva and Lim (2008), as well as other rank concepts. In this letter, we focus only on multilinear rank of tensors and call it “rank” for short.

### 3.2. Formulation of LMLRTA.

As researchers in the area of cognitive psychology have pointed out, humans learn based on the similarity of examples (Rosch, 1973). Here, we formulate our model based on the local similarity of tensor data. In addition, thanks to the existence of many “teachers,” we can generally obtain the categorical information of the examples before or during learning. Therefore, we formulate our learning model in a supervised scheme.

Given a set of *N* tensor data, , with the associated class labels , where *L* is the order of the tensors and *C* is the number of classes, we learn *L* low
rank projection matrices , such that *N* embedded data points can be obtained as .

**U**

_{l}) is the rank of matrix

**U**

_{l}, is the Frobenius norm of a tensor , and is the so-called hinge loss, which is aimed at maximizing the margin between classes. Here, we define two similarity matrices, and . If and have the same class label and is one of the

*k*

_{1}-nearest neighbors of or is one of the

*k*

_{1}-nearest neighbors of , then ; otherwise, . If and have different class labels but is one of the

*k*

_{2}-nearest neighbors of or is one of the

*k*

_{2}-nearest neighbors of , then ; otherwise, : where stands for

*k*-nearest neighbors of . Like the binary matrix , the matrix is fixed and does not change during learning.

The minimization of the first term of the objective function, , is to learn low rank **U**_{l}s, which will straightforwardly lead to the low-dimensional
representations of the tensors. The second term of the objective function is to
enforce the neighboring data in each class to be close in the low-dimensional
tensor space. It can be considered as a graph Laplacian-parameterized loss
function with respect to the low-dimensional embeddings (Chung, 1997; Belkin & Niyogi, 2003; Tenenbaum, Kemp, Griffiths, & Goodman, 2011), where each node corresponds to
one tensor datum in the given data set. For each tensor datum , suppose it belongs to class . The hinge loss in the third term will be incurred by a
differently labeled datum within the *k*_{2}-nearest neighbors of if its distance to does not exceed by 1 the distance from to any of its *k*_{1}-nearest neighbors within the class . This third term thereby favors projection matrices in which
different classes maintain a large margin of distance. Furthermore, it
encourages nearby data in different classes far apart in the low-dimensional
tensor space.

**U**

_{l}) is a nonconvex function with respect to

**U**

_{l}and difficult to optimize. Following recent work in matrix completion (Candès & Tao, 2010; Candès & Recht, 2012), we replace it with its convex envelope—the nuclear norm of

**U**

_{l}, defined as the sum of its singular values, , where s are the singular values of

**U**

_{l}, and

*r*is the rank of

**U**

_{l}. Thus, the resulting formulation of our model can be written as

**U**

_{l}, this problem is not convex with respect to each

**U**

_{l}as well, . In order to find a suboptimal solution of

**U**

_{l}, we relax problem 3.3 to a convex problem with respect to each individual . Using the slack variables, we can express the new formulation of our learning model as where

**Y**

^{(l)}

_{i}is the

*l*-mode unfolding matrix of the tensor . For the second term of the objective function and the first constraint in problem 3.4, we have used the property of the trace function: tr(

**U**

_{l}(

**Y**

^{(l)}

_{i}−

**Y**

^{(l)}

_{j})(

**Y**

^{(l)}

_{i}−

**Y**

^{(l)}

_{j})

^{T}

**U**

^{T}

_{l})=tr((

**Y**

^{(l)}

_{i}−

**Y**

^{(l)}

_{j})(

**Y**

^{(l)}

_{i}−

**Y**

^{(l)}

_{j})

^{T}

**U**

^{T}

_{l}

**U**

_{l}).

Problem 3.4 is not jointly convex
with respect to all the **W**_{l}s. However, it is convex with respect to each of them. This is guaranteed
by the following lemma:

*Problem 3.4 is
convex with respect to each W_{l}.*

(relation to previous work).

LMLRTA can be considered a supervised multilinear extension of locality preserving projections (LPP) (He & Niyogi, 2003), in that the second term of the objective function in problem 3.4 forces neighboring data in a same class to be close in the low-dimensional tensor space.

LMLRTA can also be considered a reformulation of tensor marginal Fisher analysis (TMFA) (Yan et al., 2007). However, TMFA is not guaranteed to converge to a local optimum of the optimization problem (Wang et al., 2007), but LMLRTA is guaranteed as proved in section 3.3.

For problem 3.4, we can consider it a variant of the large margin nearest neighbor (LMNN) algorithm (Weinberger, Blitzer, & Saul, 2005) for distance metric learning in tensor space. Moreover, we can learn low-rank distance matrices via the formulation of problem 3.4, which the LMNN algorithm does not.

In contrast to previous approaches for tensor dimensionality reduction, which can learn projection matrices only with prespecified dimensionality of the low-dimensional representations, LMLRTA can automatically learn the dimensionality of the low-dimensional representations from the given data. We show this in section 3.3.

Unlike most deep neural network models (Hinton et al., 2006; Bengio et al., 2006; Vincent et al., 2010), which can take only vectorized representations of data as input, LMLRTA can take any order of tensors as input. Moreover, with a large number of parameters, the learning of deep neural network models in general needs many training data. If the size of the training set is small, deep neural network models may fail to learn the intrinsic structure of data. However, in this case, LMLRTA can perform much better than deep neural network models. Experimental results shown in section 4 demonstrate this effect.

### 3.3. Optimization.

Similar to previous approaches on tensor dimensionality reduction (Dai &
Yeung, 2006; Wang et al., 2007), here we solve prob- lem 3.4 using an iterative optimization
algorithm. In each iteration, we refine one projection matrix by fixing the
others. Here, for each **W**_{l}, problem 3.4 is a
semidefinite programming problem, which can be solved using off-the-shelf
algorithms, such as SeDuMi (http://sedumi.ie.lehigh.edu/) and CVX (Grant & Boyd, 2008). However, the computational cost of
semidefinite programming approaches is generally very high. Here, we solve the
problem using a fixed-point continuation (FPC) algorithm (Ma, Goldfarb, &
Chen, 2011).

FPC is an iterative optimization method. In the *t*th iteration,
it uses two alternating steps:

- •
Gradient step: .

- •
Shrinkage step: .

*g*(

**W**

^{t}

_{l}) is the subgradient of the objective function in problem 3.4 with respect to

**W**

^{t}

_{l}(excluding the nuclear norm term), and is the step size. Here, we can express as a function with respect to

**W**

^{t}

_{l}: Note that the hinge loss is not differentiable, but we can compute its subgradient and use a standard descent algorithm to optimize the problem. Thus, we can calculate

*g*(

**W**

^{t}

_{l}) as where is the set of triplets whose corresponding slack variable exceeds zero, that is, .

In the shrinkage step, is a matrix shrinkage operator on , where max is element-wise and is a diagonal matrix with all the diagonal elements set to . Here, since **W**^{t}_{l} is supposed to be a symmetric and positive semidefinite matrix, its
eigenvalues should also be its singular values and nonnegative. Therefore, we
adopt the eigenvalue decomposition method to shrink the rank of **Z**^{t}_{l}. To this end, the shrinkage operator shifts the eigenvalues down and
truncates any eigenvalue less than to zero. This step reduces the nuclear norm of **W**^{t}_{l}. If some eigenvalues are truncated to zeros, this step reduces the rank
of **W**^{t}_{l} as well. When the algorithm converges, the rank of the learned **W**^{t}_{l} represents the optimal dimensionality of the *l*th mode of
the low dimensional representations.

To improve the speed of the applied FPC algorithm, we follow Ma et al. (2011) and use a continuation optimization
procedure. This involves beginning with a large value and solving a sequence of subproblems, each with a decreasing
value and using the previous solution as its initial point. The sequence of
values is determined by a decay parameter : , , where is the final value to use and *K* is the number
of rounds of continuation.

**W**

^{t}

_{l}gets close to an optimal solution , the distance between

**W**

^{t}

_{l}and

**W**

^{t+1}

_{l}should become very small. We use the following condition as a stopping criterion (Ma et al., 2011), where denotes the Frobenius norm of a matrix

**M**and

*tol*is a small, positive number. In our experiments,

*tol*=10

^{−6}was used. For clarity, we present the complete FPC algorithm in algorithm 1.

For the convergence of the FPC algorithm, we present the following theorem:

*For fixed W_{k}, , the sequence generated by the FPC algorithm with converges to the optimal solution of problem 3.4, where is the maximum eigenvalue of g(W_{l}).*

The proof of this theorem is similar to that of theorem 4 in Ma et al. (2011). A minor difference is that since **W**_{l}=**U**^{T}_{l}**U**_{l} is a symmetric and positive semidefinite matrix, we use eigenvalue
decomposition instead of singular value decomposition as used for the proof of
theorem 4 in Ma et al. (2011).
Nevertheless, the derivation and results are the same.

Based on lemma 1 and theorem 1, we have the following theorem on the convergence of the proposed model, LMLRTA:

*LMLRTA converges to a local optimal solution of problem*3.4.

To prove theorem 2, we need only to prove that the objective function has a lower bound, as well as the iterative optimization procedures monotonically decrease the value of the objective function.

It is easy to see that the value of the objective function in problem 3.4 is always larger than or
equal to 0. Hence, 0 is a lower bound of this objective function. For the
optimization of each **W**_{l}, , from theorem 1, we know that the FPC algorithm minimizes
the value of the objective function in problem 3.4. Therefore, the iterative procedures
of LMLRTA monotonically decrease the value of the objective function and
LMLRTA is guaranteed to converge to a local optimal solution of problem 3.4.

For problems with one-order tensors (vector input), we have the following corollary:

*If the given data are one-order tensors, the LMLRTA algorithm
converges to the global optimal solution of problem 3.4*.

### 3.4. Generalization to New Tensor Data.

For the recognition of unseen test tensors, we employ the tensor Frobenius
norm-based *k*-nearest neighbor classifier as the recognizer, in
that it measures the local similarity between training data and test data in the
low dimensional tensor space (Rosch, 1973).

## 4. Experiments

In this section, we report the experimental results obtained on two real-world applications: object recognition and face recognition. Particularly for the face recognition application, we used 3D Gabor transformation of the face images as input signals, as the kernels of the Gabor filters resemble the receptive field profiles of the mammalian cortical simple cells (Daugman, 1988). In the following, we report the parameter settings and experimental results in detail.

### 4.1. Parameter Settings.

To demonstrate the effectiveness of our model, LMLRTA, for the intrinsic tensor representation learning, we conducted experiments on the COIL-20 data set (http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php) and the ORL face data set (http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html). The COIL-20 data set has 20 classes of objects, with 72 samples in each class. The size of the images is . The ORL data set contains 400 images of 40 subjects, where each image was normalized to a size of . For each face image, we used 28 Gabor filters to extract textural features. To the end, each face image was represented as a tensor. For the COIL-20 data set, we randomly selected one-third of the samples for testing and used others for training. Because each subject has only 10 images in the ORL data set, we evaluated the proposed model, LMLRTA, based on the average over five times random partition of the data. Here, a variety of scenarios—different numbers of data from each class used to compose the training set—were tested. In our experiments for both data sets, the feature values of data are normalized in [0, 1] by dividing the largest pixel value or the largest value of Gabor features.

To show the advantage of our proposed method, LMLRTA, we compared it with two
classic vector representation-based dimensionality-reduction
approaches—linear discriminant analysis (LDA) (Fisher, 1936) and marginal Fisher analysis (MFA) (Yan et al., 2007), one deep neural networks model
called stacked denoising autoencoder (SDAE) (Vincent et al., 2010)—and two state-of-the-art tensor
dimensionality-reduction methods—convergent multilinear discriminant
analysis (CMDA) and convergent tensor marginal Fisher analysis (CTMFA) (Wang et
al., 2007). For comparison, we also
provided the classification results obtained in the original data space. In the
LMLRTA algorithm, *k*_{1} and *k*_{2} were set to 7 and 15, respectively for the COIL-20 data set, while
for the ORL data set, they were set to *ntrain−1* and respectively, where *ntrain* denotes the number of data
of each class used to compose the training set. In implementing of the FPC
method for the optimization of each projection matrix, we initialized , with a random full-rank symmetric and positive semidefinite
matrix and set . Following Ma et al. (2011), we adopted values starting at , where was the largest eigenvalue of the initialized matrix and decreased until . Furthermore, was selected from using the holdout strategy. Particularly, for the COIL-20 data
set, half of the training data were randomly selected for model learning and the
other half for validation. For the ORL data set, we conducted model selection
for each scenario of random partition of the data. For concreteness, in each
scenario of random partition of the data, we selected one sample of each class
of the training data to compose the validation set and used others for model
learning. For CMDA and CTMFA, we adopted the best setting learned by LMLRTA to
specify the dimensionality of the low-dimensional tensor space. We used the code
of SDAE from a public deep learning toolbox (https://github.com/rasmusbergpalm/DeepLearnToolbox). For all the
methods but SDAE, tensor Frobenius norm-based 1-nearest neighbor classifier was
used for the recognition of the test data.

### 4.2. Visualization.

Figure 1a and Figure 1b illustrate the 2D embeddings of the object images
from the COIL-20 data set and that of the 3D Gabor transformation of the face
images from the ORL data set, respectively. The *t*-distribution-based stochastic neighbor embedding (t-SNE)
algorithm (van der Maaten & Hinton, 2008) was employed to learn these 2D embeddings, where the distances
between data were measured based on tensor Frobenius norm. From Figure 1, we can see that in the original space of
these two data sets, most of the classes align on a submanifold embedded in the
ambient space. However, for some classes, the data are scattered in a large area
of the data space and, alternatively, close to that of other classes. As a
result, local similarity-based classifiers may predict the labels of some unseen
data incorrectly in both of these two original representation spaces. Hence, it
is necessary to learn the intrinsic and compact representations of the given
tensor data.

Figure 2a and Figure 2b illustrate the 2D embeddings of the low-dimensional tensor representations for the COIL-20 and the ORL data sets, respectively. LMLRTA was used to learn the low-dimensional tensor representations, while the t-SNE algorithm was used to generate the 2D embeddings. It is easy to see that LMLRTA successfully discovered the manifold structure of these two data sets. In both Figures 2a and Figure 2b, the similarity of data of the same class are faithfully preserved, while the discrimination between classes is maximized.

Figure 3 shows some low-dimensional tensor representations of the images from the COIL-20 data set, which were learned by CMDA, CTMFA and LMLRTA, respectively. Five classes were randomly selected, and low-dimensional representations of five images from each class were further randomly selected to show. Particularly in each panel of Figure 3, each row shows the low-dimensional tensor representations of images from one class. In contrast to the dimensionality of the original image, , the dimensionality of the low-dimensional representations here is . We can see that all three methods can preserve the similarity of data of the same class faithfully. However, the discrimination between classes in the low-dimensional tensor space learned by LMLRTA is much better than that learned by CMDA and CTMFA. Recognition results shown in section 4.3 also demonstrate this observation.

### 4.3. Object Recognition Results on the COIL-20 Data Set (2D Tensors).

In this experiment, we compare LMLRTA with some related approaches on the object recognition application: LDA, MFA, SDAE, CMDA, CTMFA, and classification in the original space. We implemented experiments on the COIL-20 data set. For LMLRTA, CMDA, and CTMFA, we followed the settings as introduced in section 4.1. For LDA and MFA, we used the same strategy as LMLRTA for model selection. For concreteness, the dimensionality of the LDA subspace was selected from , and the dimensionality of the MFA subspace was selected from . In order to reduce the noise level of the data, before using LDA and MFA for dimensionality reduction, we first projected the data onto a PCA subspace with 99% of the variance retained. Other than PCA, no other preprocessing is applied. For the SDAE algorithm, we used a six-layer neural network model. The sizes of the layers were 1024, 512, 256, 64, 32 and 20, respectively. The batch size was set to 80 for both pretraining and fine-tuning. The number of epochs for pretraining was set to 400, while that for fine-tuning was set to 5000.

Figure 4 shows the classification accuracy obtained by LMLRTA and the compared methods on the COIL-20 data set. The dimensionality of the low-dimensional representations learned by LMLRTA is [21, 16], which is used for the learning of CMDA and CTMFA as well. For the comparison of the tested methods, it is easy to see that LMLRTA performed best among all the methods: it achieved 100% accuracy on this data set. Due to the loss of local structural information of the images, vector representation–based approache, LDA and MFA, performed worst on this problem. Due to the limitation of training sample size and vectorized representations of the data, the deep neural network model, SDAE, did not outperform LMLRTA on this problem and obtained only a similar recognition accuracy as the classification in the original data space. State-of-the-art tensor dimensionality-reduction approaches, CMDA and CTMFA, can converge to a local optimal solution of the learning problem but did not outperform LMLRTA as well.

To show the convergence process of the FPC algorithm during learning the projection matrices, Figure 5 illustrates the values of the objective function against iterations during the optimization of LMLRTA on the COIL-20 data set. As we can see, the FPC algorithm converges to a stationary point of the problem as the iteration continues.

### 4.4. Face Recognition Results on the ORL Data Set (3D Tensors).

Figure 6 shows the classification accuracy and standard deviation obtained on the ORL data set. Due to the high computational complexity problem of LDA, MFA, and SDAE (the vector representations of the tensors is of dimensionality ), here we compared only LMLRTA to CMDA, CTMFA, and the classification in the original data space. From Figure 6, we can see that LMLRTA consistently outperforms the compared convergent tensor dimensionality reduction approaches and the classification in the original data space. More importantly, as LMLRTA gradually reduces the ranks of the projection matrices during optimization, it can automatically learn the dimensionality of the intrinsic low-dimensional tensor space from data. However, for the compared tensor dimensionality reduction approaches, the parameter must be manually specified before they can be applied.

In Table 1, we report the dimensionality of the low-dimensional representations learned by LMLRTA on the ORL data set. It shows that, LMLRTA generally delivers low rank projection matrices and compact representations of the tensor data.

## 5. Conclusion

In this letter, we propose a supervised tensor dimensionality reduction model, large margin low rank tensor analysis (LMLRTA), which can be used to automatically and jointly learn the dimensionality and low-dimensional representations of tensors. We optimize LMLRTA using an iterative fixed-point continuation (FPC) algorithm, which is guaranteed to converge to a local optimal solution of the optimization problem. Experiments on object recognition and face recognition show the superiority of LMLRTA over classic vector representation-based dimensionality-reduction approaches, deep neural network models, and existing tensor dimensionality reduction approaches. In future work, we attempt to extend LMLRTA to the transfer learning (Pan & Yang, 2010) and active learning (Cohn, Ladner, & Waibel, 1994) scenarios. Furthermore, we plan to combine LMLRTA with deep neural networks (LeCun, Bottou, Bengio, & Haffner, 2001) and nonnegative matrix factorization models (Lee & Seung, 1999), to solve challenging large-scale problems.

## Acknowledgments

We thank two anonymous reviewers for their valuable comments and constructive suggestions. This work is supported by the Social Sciences and Humanities Research Council of Canada and the Natural Sciences and Engineering Research Council of Canada.

## References

## Author notes

Color versions of some figures in this letter are presented in the online supplement available at http://www.mitpressjournals.org/doi/suppl/10.1162/NECO_a_00570.