## Abstract

We theoretically and experimentally investigate tensor-based regression and classification. Our focus is regularization with various tensor norms, including the overlapped trace norm, the latent trace norm, and the scaled latent trace norm. We first give dual optimization methods using the alternating direction method of multipliers, which is computationally efficient when the number of training samples is moderate. We then theoretically derive an excess risk bound for each tensor norm and clarify their behavior. Finally, we perform extensive experiments using simulated and real data and demonstrate the superiority of tensor-based learning methods over vector- and matrix-based learning methods.

## 1  Introduction

A wide range of real-world data takes the format of matrices and tensors, for example, recommendation (Karatzoglou, Amatriain, Baltrunas, & Oliver, 2010), video sequences (Kim, Wong, & Cipolla, 2007), climates (Bahadori, Yu, & Liu, 2014), genomes (Sankaranarayanan, Schomay, Aiello, & Alter, 2015), and neuroimaging (Zhou, Li, & Zhu, 2013). A naive way to learn from such matrix and tensor data is to vectorize them and apply ordinary regression or classification methods designed for vectorial data. However, such a vectorization approach would lead to loss in structural information of matrices and tensors such as low-rankness.

The objective of this letter is to investigate regression and classification methods that directly handle tensor data without vectorization. Low-rank structure of data has been successfully utilized in various applications, such as missing data imputation (Cai, Candès, & Shen, 2010), robust principal component analysis (Candès, Li, Ma, & Wright, 2011), and subspace clustering (Liu, Lin, & Yu, 2010). Instead of lowrankness of data itself, in this letter we consider its dual: learning coefficients of a regressor and a classifier. Low-rankness in learning coefficients means that only a subspace of feature space is used for regression and classification.

For matrices, regression and classification have been studied in Tomioka and Aihara (2007) and Zhou and Li (2014) in the context of EEG data analysis. It was experimentally demonstrated that directly learning matrix data by low-rank regularization can significantly improve performance compared to learning after vectorization. Another advantage of using low-rank regularization in the context of EEG data analysis is that analyzing singular value spectra of learning coefficients is useful in understanding activities of brain regions.

More recently, an inductive learning method for tensors has been explored (Signoretto, Dinh, De Lathauwer, & Suykens, 2013). Compared to the matrix case, learning with tensors is inherently more complex. For example, the multilinear ranks of tensors make it more complicated to find a proper low-rankness of a tensor compared to a matrix, which has only one rank. So far, several tensor norms such as the overlapped trace norm or the tensor nuclear norm (Liu, Musialski, Wonka, & Ye, 2009), the latent trace norm (Tomioka & Suzuki, 2013), and the scaled latent trace norm (Wimalawarne, Sugiyama, & Tomioka, 2014) have been proposed and demonstrated to perform well for various tensor structures. However, theoretical analysis of tensor learning in inductive learning settings has not been much investigated yet. Another challenge in inductive tensor learning is efficient optimization strategies, since tensor data often have much higher dimensionalities than matrix and vector data.

We theoretically and experimentally investigate tensor-based regression and classification with regularization by the overlapped trace norm, the latent trace norm, and the scaled latent trace norm. We first provide their dual formulations and propose optimization procedures using the alternating direction method of multipliers (Bertsekas, 1996), which is computationally efficient when the number of data samples is moderate. We then derive an excess risk bound for each tensor regularization, which allows us to theoretically understand the behavior of tensor norm regularization. More specifically, we elucidate that the excess risk of the overlapped trace norm is bounded with the average multilinear ranks of each mode, that of the latent trace norm is bounded with the minimum multilinear rank among all modes, and that of the scaled latent trace norm is bounded with the minimum ratio between multilinear ranks and mode dimensions. Finally, for simulated and real tensor data, we experimentally investigate the behavior of tensor-based regression and classification methods. The experimental results are in concordance with our theoretical findings, and tensor-based learning methods compare favorably with vector- and matrix-based methods.

The remainder of this letter is organized as follows. In section 2, we formulate the problem of tensor-based supervised learning and review the overlapped trace norm, the latent trace norm, and the scaled latent trace norm. In section 3, we derive dual optimization algorithms based on the alternating direction method of multipliers. In section 4, we theoretically give an excess risk bound for each tensor norm. In section 5, we give experimental results on both artificial and real-world data and illustrate the advantage of tensor-based learning methods. In section 6, we conclude.

Throughout the paper, we use standard tensor notation following Kolda and Bader (2009). We represent a K-way tensor as that consists of elements. A mode-k fiber of is an nk-dimensional vector that can be obtained by fixing all except the kth index. The mode-k unfolding of tensor is represented as , which is obtained by concatenating all the mode-k fibers along its columns. The spectral norm of a matrix X is denoted by , the maximum singular value of X. The operator is the sum of element-wise multiplications of and , that is, . The Frobenius norm of a tensor is defined as .

## 2  Learning with Tensor Regularization

In this section, we put forward inductive tensor learning models with tensor regularization and review different tensor norms used for low-rank regularization.

### 2.1  Problem Formulation

Our focus in this letter is regression and classification of tensor data. We consider a data set , where is a covariate tensor and for regression, while for classification. We consider the following learning model for a tensor norm :
2.1
where is the loss function. The squared loss,
2.2
is used for regression, and the logistic loss,
2.3
is used for classification. is the bias term, and is the regularization parameter. If or , then the above problem is equivalent to ordinary vector-based l2- or l1-regularization.
To understand the effect of tensor-based regularization, it is important to investigate the low-rankness of tensors. When a matrix is being considered, its trace norm is defined as
2.4
where is the singular value and J is the number of nonzero singular values (). A matrix is called law rank if . The matrix trace norm, equation 2.4 is a convex envelope to the matrix rank and it is commonly used in matrix low-rank approximation (Recht, Fazel, & Parrilo, 2010).

As in matrices, the rank property is also available for tensors, but it is more complicated due to its multidimensional structure. The mode-k rank rk of a tensor is defined as the rank of mode-k unfolding , and the multilinear rank of is given as . The mode-i of a tensor is called low rank if .

### 2.2  Overlapped Trace Norm

One of the earliest definitions of a tensor norm is the tensor nuclear norm (Liu, Musialski, Wonka, & Ye, 2009) or the overlapped trace norm (Tomioka & Suzuki, 2013), which can be represented for a tensor as
2.5
The overlapped trace norm can be viewed as a direct extension of the matrix trace norm since it unfolds a tensor on each of its modes and computes the sum of trace norms of the unfolded matrices. Regularization with the overlapped trace norm can also be seen as an overlapped group regularization due to the fact that the same tensor is unfolded over different modes and regularized with the trace norm.

One of the popular applications of the overlapped trace norm is tensor completion (Gandy, Recht, & Yamada, 2011; Liu et al., 2009), where missing entries of a tensor are imputed. Another application is multilinear multitask learning (Romera-Paredes, Aung, Bianchi-Berthouze, & Pontil, 2013), where multiple vector-based linear learning tasks with a common feature space are arranged as a tensor feature structure and the multiple tasks are solved together with constraints to minimize the multilinear ranks of the tensor feature.

Theoretical analyses on the overlapped norm have been carried out for both tensor completion (Tomioka & Suzuki, 2013) and multilinear multitask learning (Wimalawarne et al., 2014). They have shown that the prediction error of overlapped trace norm regularization is bounded by the average mode-k ranks, which can be large if some modes are close to full rank even if there are low-rank modes. Thus, these studies imply that the overlapped trace norm performs well when the multilinear ranks have small variations, and it may result in poor performance when the multilinear ranks have high variations.

To overcome the weakness of the overlapped trace norm, recent research in tensor norms has led to new norms such as the latent trace norm (Tomioka & Suzuki, 2013) and the scaled latent trace norm (Wimalawarne et al., 2014).

### 2.3  Latent Trace Norm

Tomioka and Suzuki (2013) proposed the latent trace norm as
The latent trace norm takes a mixture of K latent tensors, which is equal to the number of modes, and regularizes each of them separately. In contrast to the overlapped trace norm, the latent tensor trace norm regularizes different latent tensors for each unfolded mode, and this gives the tendency that the latent tensor trace norm picks the latent tensor with the lowest rank.

In general, the latent trace norm results in a mixture of latent tensors, and the content of each latent tensor would depend on the rank of its unfolding. In an extreme case, for a tensor with all its modes full except one mode, regularization with the latent tensor trace norm would result in making the latent tensor with the lowest mode become prominent while others become zero.

### 2.4  Scaled Latent Trace Norm

Recently Wimalawarne et al. (2014) proposed the scaled latent trace norm as an extension of the latent trace norm:
Compared to the latent trace norm, the scaled latent trace norm takes the rank relative to the mode dimension. A major drawback of the latent trace norm is its inability to identify the rank of a mode relative to its dimension. If a tensor has a mode where its dimension is smaller than other modes yet its relative rank with respect to its mode dimension is high compared to other modes, the latent trace norm could incorrectly pick the smallest mode.

The scaled latent norm has the ability to overcome this problem by its scaling with the mode dimensions such that it is able to work with the relative ranks of the tensor. In the context of multilinear multitask learning, it has been shown that the scaled latent trace norm works well for tensors with high variations in multilinear ranks and mode dimensions compared to the overlapped trace norm and the latent trace norm (Wimalawarne et al., 2014).

The inductive learning setting mentioned in equation 2.1 with the overlapped trace norm has been studied previously in Signoretto et al. (2013). However, theoretical analysis and performance comparison with other tensor norms have not been conducted yet. Similarly to tensor decomposition (Tomioka & Suzuki, 2013) and multilinear multitask learning (Wimalawarne et al., 2014), tensor-based regression and classification may also be improved by regularization methods that can work with high variations in multilinear ranks and mode dimensions.

In the following sections, to make tensor-based learning more practical and to improve the performance, we consider formulation 2.1 with the overlapped trace norm the latent trace norm and the scaled latent trace norm and give computationally efficient optimization algorithms and excess risk bounds.

## 3  Optimization

In this section, we consider the dual formulation for equation 2.1 and propose computationally efficient optimization algorithms. Since optimization of equation 2.1 with regularization using the overlapped trace norm has already been studied in Signoretto et al. (2013), we do not discuss it here. Our main focus in this section is optimization of equation 2.1 with regularization using the latent trace norm and the scaled latent trace norm.

Let us consider the formulation equation 2.1 for a data set with latent and scaled latent trace norm regularization as follows:
3.1
where, for and for any given regularization parameter , in the case of the latent trace norm and in the case of the scaled latent trace norm, respectively. is the unfolding of on its kth mode. It is worth noticing that the application of the latent and scaled latent trace norms requires optimizing over K latent tensors, which contain variables in total. For large K and N, solving the primal problem, equation 3.1, can be computationally expensive, especially in nonlinear problems such as logistic regression, since they require computationally expensive optimization methods such as gradient descent or the Newton method. If the number of training samples m is , solving the dual problem of equation 3.1 could be computationally more efficient. For this reason, we focus on optimization in the dual below.
The dual formulation of equation 3.1 can be written as follows (its detailed derivation is given in appendix  A):
3.2
where are dual variables corresponding to the training data set , is the conjugate loss function defined as
in the case of regression with the squared loss (Tomioka, Suzuki, & Sugiyama, 2011), and
with constraint in the case of classification with the logistic loss (Tomioka et al., 2011) and is the indicator function defined as if and otherwise. The constraint is due to the bias term b. Here, the auxiliary variables are introduced to remove the coupling between the indicator functions in the objective function (see appendix  A for details).

The alternating direction method of multipliers (ADMM) (Gabay & Mercier, 1976; Boyd, Parikh, Chu, Peleato, & Eckstein, 2011) has been previously used to solve primal problems of tensor decomposition (Tomioka, Suzuki, Hayashi, & Kashima, 2011) and multilinear multitask learning (Romera-Paredes et al., 2013) with the overlapped trace norm regularization. Optimization in the dual for tensor decomposition problems with the latent and scaled latent trace norm regularization has been solved using ADMM in Tomioka, Suzuki, Hayashi et al. (2011). Here, we also adopt ADMM to solve equation 3.2 and describe the formulation and the optimization steps in detail.

With the introduction of dual variables (corresponding to the primal variables of equation 3.1), , and parameter , the augmented Lagrangian function for equation 3.2 is defined as:
This ADMM formulation is solved for variables , , , and b by considering subproblems for each variable. Below, we give the solution for each variable at iterative step .
The first subproblem to solve is for at step :
where , and bt are the solutions obtained at step t.
Depending on the conjugate loss , the solution for differs. In the case of regression with the squared loss, equation 2.2, the augmented Lagrangian can be minimized with respect to by solving the following linear equation:
where , , , , and is the m-dimensional vector of all ones. Note that in the above system of equations, the coefficient matrix multiplied with does not change during optimization. Thus, it can be efficiently solved at each iteration by precomputing the Cholesky factorization of the matrix.
For classification with the logistic loss, equation 2.3, the Newton method is used to find the solution for , which requires the gradient and the Hessian of :
Next, we update at step by solving the following subproblem:
3.3
where and .
Finally, we update the dual variables and b at step as
3.4
3.5
Note that step 3.3 and step 3.4 can be combined as
where and . This allows us to avoid computing singular values and the associated singular vectors that are smaller than the threshold in equation 3.3.

### 3.1  Optimality Condition

As a stopping condition, we use the relative duality gap (Tomioka, Hayashi, & Kashima, 2011), which can be expressed as
where is the primal solution at step t of equation 3.1 and is a predefined tolerance value. is the dual solution at step t of equation 3.2 with obtained by multiplying with , where and is the largest singular value of V.

## 4  Theoretical Risk Analysis

In this section, we theoretically analyze the excess risk for regularization with the overlapped trace norm, the latent trace norm, and the scaled latent trace norm.

We consider a loss function l, which is Lipshitz continuous with constant . Note that this condition is true for both the squared loss and logistic loss functions. Let the training data set be given as , where for regression and for classification. In our theoretical analysis, we assume that elements of independently follow the standard gaussian distribution.

As the standard formulation (Maurer & Pontil, 2013), the empirical risk without the bias term is defined as
and the expected risk is defined as
where is the probability distribution from which are sampled.
The optimal that minimizes the expected risk is given as
4.1
where is either the overlapped trace norm, the latent trace norm, or the scaled latent trace norm. The optimal that minimizes the empirical risk is denoted as
4.2

Lemma 1 provides an upper bound of the excess risk for tensor-based learning problems (see appendix  B for its proof), where is the dual norm of for :

Lemma 1.
For a given -Lipchitz continuous loss function l and for any such that for problems 4.1 and 4.2, the excess risk for a given training data set is bounded with probability at least as
4.3
where and are Rademacher random variables.

Theorem 2 gives an excess risk bound for overlapped trace norm regularization (its proof is also included in appendix  B), which is based on the inequality given in Tomioka and Suzuki (2013):

Theorem 1.
With probability at least , the excess risk of learning using the overlapped trace norm regularization for any , with , multilinear ranks , and estimator with is bounded as
4.4
where .

In theorem 3, we give an excess risk bound for the latent trace norm (its proof is also included in appendix  B), which uses the inequality given in Tomioka and Suzuki (2013):

Theorem 2.
With probability at least , the excess risk of learning using the latent norm regularization for any , with , multilinear ranks , and estimator with is bounded as
4.5
where .

Theorem 3 shows that the excess risk for the latent trace norm, equation 4.5 is bounded by the minimum multilinear rank. If , the latent trace norm is always better than the overlapped trace norm in terms of the excess risk bounds because . If the dimensions are not the same, the overlapped trace norm could be better.

Finally, we bound the excess risk for the scaled latent trace norm based on the inequality given in Wimalawarne et al. (2014):

Theorem 3.
With probability at least , the excess risk of learning using the scaled latent trace norm regularization for any , with , multilinear ranks , and estimator with is bounded as
4.6

Theorem 4 shows that the excess risk for regularization with the scaled latent trace norm is bounded with the minimum of multilinear ranks relative to their mode dimensions. Similar to the latent trace norm, the scaled latent trace norm would also perform better than the overlapped norm when the multilinear ranks have large variations. If we consider a flat tensor, the modes with small dimensions may have ranks comparable to their dimensions. Although these modes have the lowest mode-k rank, they do not impose a low-rank structure. In such cases, our theory predicts that the scaled latent trace norm performs better because it is sensitive to the mode-k rank relative to its dimension.

As a variation, we can also consider a mode-wise scaled version of the overlapped trace norm defined as . It can be easily seen that holds, and with the same conditions as in theorem 2, we can upper-bound the excess risk for the scaled overlapped trace norm regularization as
4.7
Note that when all modes have the same dimensions, equation 4.7 coincides with equation 4.4. Compared with bound 4.6, the scaled latent norm would perform better than the scaled overlapped norm regularization since .

## 5  Experiments

We conducted several experiments using simulated and real-world data to evaluate the performance of tensor-based regression and classification methods with regularizations using different tensor norms. We discuss simulations for tensor-based regression in section 5.1 and experiments with real-world data for tensor classification in section 5.2. For all experiments, we use a Matlab environment on a 2.10 GHz (2×8 cores) Intel Xeon E5-2450 server machine with 128 GB memory.

### 5.1  Tensor Regression with Artificial Data

We report the results of artificial data experiments on tensor-based regression.

We generated three different three-mode tensors as weight tensors with different multilinear ranks and mode dimensions. We created two homogeneous tensors with equal mode dimensions of with different multilinear ranks and . The third weight tensor is an inhomogeneous case with mode dimensions of , and multilinear ranks . To generate these weight tensors, we use the Tucker decomposition (Kolda & Bader, 2009) of a tensor as , where is the core tensor and are component matrices. We sample elements of the core tensor from a standard gaussian distribution, choose component matrices to be orthogonal matrices, and generate by mode-wise multiplication of the core tensor and component matrices.

To create training samples , we first create the random tensors generated with each element independently sampled from the standard gaussian distribution and obtain , where is noise drawn from the gaussian distribution with mean zero and variance 0.1. In our experiments, we use cross-validation to select the regularization parameter from the range 0.01 to 100 at intervals of 0.1. For comparison, we have also simulated matrix regularized regressions for each mode unfolding. Also, we experimented with cross-validation among matrix regularization on each unfolded matrix to understand whether it can find the correct mode for regularization. As the baseline vector-based learning method, we use ridge regression (i.e., l2-regularized least-squares).

Figure 1 shows the performance of homogeneous tensors with equal mode dimensions and equal multilinear ranks . We see that the overlapped trace norm and the scaled overlapped trace norm (due to equal mode dimensions) perform the best equally, while both latent norms perform equally (since mode dimensions are equal) but inferior to the overlapped norm. Also, the regression results from all matrix regularizations with individual modes perform better than the latent and the scaled latent norm regularized regression models. Due to the equal multilinear ranks and equal mode dimensions, it results in equal performance with cross-validation among each mode-wise unfolded matrix regularization.

Figure 1:

Simulation results of tensor regression based on homogeneous weight tensor of equal mode dimensions and equal multilinear ranks

Figure 1:

Simulation results of tensor regression based on homogeneous weight tensor of equal mode dimensions and equal multilinear ranks

Figure 2 shows the performances of homogeneous tensors with equal mode dimensions and unequal multilinear ranks . In this case, both the latent and the scaled latent norms also perform equally since tensor dimensions are the same. The mode-1 regularized regression models give the best performance since they have the lowest rank; regularization with the latent and scaled latent norms gives the next best performance. The mode-wise cross-validation correctly coincides with the mode-1 regularization. The overlapped trace norm and the scaled overlapped trace (due to equal mode dimensions) perform equally poorly compared to the latent and the scaled latent trace norms.

Figure 2:

Simulation results of tensor regression based on homogeneous weight tensor of equal mode sizes and unequal multilinear rank

Figure 2:

Simulation results of tensor regression based on homogeneous weight tensor of equal mode sizes and unequal multilinear rank

Figure 3 shows the performance of inhomogeneous tensors with mode dimensions , and multilinear ranks . In this case, we can see that the scaled latent trace norm outperforms all other tensor norms. The latent trace norm performs poorly since it fails to find the mode with the lowest rank. This agrees well with our theoretical analysis. As shown in equation 4.5, the excess risk of the latent trace norm is bounded with the minimum of multilinear ranks, which is on the first mode in the current setup and is high ranked. The scaled latent trace norm is able to find the mode with the lowest rank since it takes the relative rank with respect to the mode dimension as in equation 4.6. If we look at the individual mode regularizations, we see that the best performance is given with the second mode, which has the lowest rank with respect to the mode dimension, and the worst performance is given with the first mode, which is high ranked compared to other modes. Here, the mode-wise cross-validation is again as good as mode-2 regularization. The overlapped trace norm performs poorly compared to the scaled latent trace norm, and the scaled overlapped trace norm performs worse than the overlapped trace norm.

Figure 3:

Simulation results of tensor regression based on inhomogeneous weight tensor of equal mode sizes , and multilinear rank

Figure 3:

Simulation results of tensor regression based on inhomogeneous weight tensor of equal mode sizes , and multilinear rank

It is also worth noticing in these experiments that ridge regression performed worse than all the tensor regularized learning models. This highlights the need to employ low-rank-inducing norms for learning with tensor data without vectorization to get the best performance.

Figure 4 shows the computation time for the toy regression experiment with inhomogeneous tensors with mode dimensions , and multilinear ranks (computation time for other setups showed similar tendency and thus we omit the results). For each data set, we measured the computation time of training regression models, cross-validation for model selection, and predicting output values for test data. We can see that methods based on tensor norms and matrix norms are computationally much more expensive compared to ridge regression. However, as we saw, they achieve higher accuracy than ridge regression. It is worth noticing that mode-wise cross-validation is computationally more expensive compared to the scaled latent trace norm and other tensor norms. This computational advantage and comparable performance with respect to the best mode-wise regularization make the scaled latent trace norm a useful regularization method for tensor-based regression, especially for tensors with high variations in its multilinear ranks.

Figure 4:

Computation times in seconds for toy experiment with inhomogeneous tensors with mode dimensions , and multilinear rank

Figure 4:

Computation times in seconds for toy experiment with inhomogeneous tensors with mode dimensions , and multilinear rank

### 5.2  Tensor Classification for Hand Gesture Recognition

Next, we report the results of experiments on tensor classification with the Cambridge hand gesture data set (Kim et al., 2007).

The Cambridge hand gesture data set contains image sequences from nine gesture classes. These gesture classes include three primitive hand shapes of flats, spread, and V-shape, and three different hand motions of rightward, leftward, and contrast. Each class has 100 image sequences with different illumination conditions and arbitrary motions of two people. Previously, the tensor canonical correlation (Kim et al., 2007) was used to classify these hand gestures.

To apply tensor classification, we first build action sequences as tensor data by sampling S images with equal time intervals from each sequence. This makes each sequence a tensor of , where the first two modes are downsampled images as in (Kim et al., 2007) and S is the number of sampled images. In our experiments, we set S at 5 or 10. We consider binary classification and choose visually similar sequences of left/flat and left/spread (see Figure 5), which we found to be difficult to classify. We apply standardization of data by mean removal and variance normalization to all the data. We randomly sample data into a training set of 120 data elements, use a validation set of 40 data elements to select the optimal regularization parameter, and finally use a test set of 40 elements to evaluate the learned classifier. In addition to the tensor regularized learning models, we also trained classifiers with matrix regularization with unfolding on each mode separately. As a baseline vector-based learning method, we have used the l2-regularized logistic regression. We also trained mode-wise cross-validation (CV) with individual mode regularization (mode-wise CV). We selected regularization parameters as 50 splits in logarithmic scale from 0.01 to 500. We repeated the learning procedure for 10 sample sets for each classifier, the results are shown in Table 1.

Figure 5:

Samples of hand motion sequences of left/flat and left/spread.

Figure 5:

Samples of hand motion sequences of left/flat and left/spread.

Table 1:
Classification Error of Experiments with the Hand Gesture Data Set.
Tensor Dimensions
Norm(20,20,5)(20,20,10)
Overlapped trace norm 0.1375 (0.0530) 0.0775 (0.0343)
Latent trace norm 0.1275 (0.0416) 0.0875 (0.0429)
Scaled latent trace norm 0.1075 (0.0409) 0.1000 (0.0500)
Scaled overlapped trace norm 0.1275 (0.0416) 0.0850 (0.0444)
Mode-1 0.1050 (0.0438) 0.0975 (0.0463)
Mode-2 0.1275 (0.0777) 0.0850 (0.0489)
Mode-3 0.1175 (0.0409) 0.1075 (0.0602)
Mode-wise CV 0.1475 (0.0671) 0.1025 (0.0381)
Logistic regression (l20.1500 (0.0565) 0.1425 (0.0457)
Tensor Dimensions
Norm(20,20,5)(20,20,10)
Overlapped trace norm 0.1375 (0.0530) 0.0775 (0.0343)
Latent trace norm 0.1275 (0.0416) 0.0875 (0.0429)
Scaled latent trace norm 0.1075 (0.0409) 0.1000 (0.0500)
Scaled overlapped trace norm 0.1275 (0.0416) 0.0850 (0.0444)
Mode-1 0.1050 (0.0438) 0.0975 (0.0463)
Mode-2 0.1275 (0.0777) 0.0850 (0.0489)
Mode-3 0.1175 (0.0409) 0.1075 (0.0602)
Mode-wise CV 0.1475 (0.0671) 0.1025 (0.0381)
Logistic regression (l20.1500 (0.0565) 0.1425 (0.0457)

Note: The bold figures indicate comparable accuracies among classifiers after a t-test with a significance of 0.05.

In both experiments for S = 5 and 10, we see that tensor norm regularized classification performs better than the vectorized learning method. With a tensor structure of (20, 20, 5), we can see that the mode-1 gives the best performance; the scaled latent trace norm, latent trace norm, scaled overlapped trace norm, mode-2, and mode-3 are comparable. We observed that with the tensor structure of (20, 20, 5), the resulting weight tensor after learning its third mode becomes full rank. The scaled latent trace norm performed as well as mode-1 since it could identify the mode with the minimum rank relative to its mode dimension, the first mode in the current setup. The overlapped trace norm performs poorly due to large variations in the multilinear ranks and tensor dimensions.

With the tensor structure (20, 20, 10), the overlapped trace norm gives the best performance. In this case, we found that the multilinear ranks are close to each other, which made the overlapped trace norm give better performance. The scaled latent trace norm, latent trace norm, scaled overlapped trace norm, mode-1, and mode-2 gave a performance comparable to that with the overlapped trace norm.

### 5.3  Tensor Classification for Brain Computer Interface

As our second tensor classification, we experimented with a motor-imagery EEG classification problem in the context of brain-computer interface (BCI). The objective of the experiments was to classify movements imagined by person using the EEG signals captured in that instance. For our experiments, we used the data from the BCI competition IVa (Dornhege, Blankertz, Curio, & Müller, 2004). Previous research by Tomioka and Aihara (2007) has considered channel × channel as a matrix of the EEG signal and classified it using logistic regression with low-rank matrix regularization. Our objective is to model EEG data as tensors to incorporate more information and learn to classify using tensor regularization methods.

The BCI competition IVa data set consists of BCI experiments of five people. Though BCI experiments have used 256 channels, we use signals from only 49 channels following Tomioka and Aihara (2007) and preprocess each signal from each channel with Z different band-pass filters (Butterworth filters). Let , where C denotes the number of channels and T denotes the time, be the matrix obtained by processing with the filter. As in Tomioka and Aihara (2007), each Si is further processed to make centering and scaling as . Then we obtain , a channel × channel matrix (in our setting, it is ). We arrange all to form a tensor of dimensions .

For our experiments, we used Z = 5 different bandpass Butterworth filters with cutoff frequencies of (7, 10), (9 12), (11 14), (13 16), and (15 18) with scaling by 50, which resulted in a signal converted into a tensor of dimensions . We split the data used in the competition into training and validation sets with a proportion of 80:20; the rest of the data we used for testing. As in the previous experiment, we used logistic regression with all the tensor norms, individual mode unfolded matrix regularizations, and cross-validation with unfolded matrix regularization. We also used vector-based logistic regression with l2-regularization for comparison. To compare tensor-based methods with the previously proposed matrix approach (Tomioka & Aihara, 2007), we averaged tensor data over the frequency mode and applied classification with matrix trace norm regularization. For all experiments, we selected all regularization parameters in 50 splits in logarithmic scale from 0.01 to 500. We show the validation and test errors for the tensor norms in appendix  C in Figure 6.

Figure 6:

Plots of validation error and test error for BCI data subjects.

Figure 6:

Plots of validation error and test error for BCI data subjects.

The results of the experiment are given in Table 2, which strongly indicate that vector-based logistic regression is clearly outperformed by the overlapped and scaled latent trace norms. Also, in most cases, the averaged matrix method performs poorly compared to the optimal tensor structured regularization methods. Mode-1 regularization performs poorly since mode-1 was high ranked compared to the other modes. Similarly, the latent trace norm gives poor performance since it cannot properly regularize since it does not consider the rank relative to the mode dimension. For all subjects, mode-2 and mode-3 unfolded regularizations result in the same performance due to the symmetry of each Xi resulting in same rank along mode-2 and mode-3 unfoldings. For subject aa, the scaled latent norm, mode-1, mode-2, and mode-wise cross-validation give the best or comparable performance. In subject al, the scaled overlapped trace norm gives the best performance, and in subject av, both the overlapped trace norm and the scaled overlapped trace norm give comparable performances. In subjects aw and ay, the overlapped trace norm gives the best performance.

Table 2:
Classification Error of Experiments with the BCI Competition IVa Data Set.
NormSubject aaSubject alSubject avSubject awSubject ayAverage Time (seconds)
Overlapped trace norm 0.2205 (0.0139) 0.0178 (0.0) 0.3244 (0.0132) 0.0603 (0.0071) 0.1254 (0.0190) 17,986 (1489)
Scaled overlapped trace norm 0.2295 (0.0270) 0.0018 (0.0056) 0.3235 (0.0160) 0.1022 (0.0192) 0.2532 (0.0312) 18,118 (1608)
Latent trace norm 0.3107 (0.0210) 0.0339 (0.0056) 0.3735 (0.0218) 0.1549 (0.0381) 0.4008 (0.0) 20,021 (14024)
Scaled latent trace norm 0.2080 (0.0043) 0.0179 (0.0) 0.3694 (0.0182) 0.0804 (0.0) 0.1980 (0.0476) 77,123 (149024)
Mode-1 0.3205 (0.0174) 0.0339 (0.0056) 0.3739 (0.0211) 0.1450 (0.0070) 0.4020 (0.0038) 5,737 (3238)
Mode-2 0.2035 (0.0124) 0.0285 (0.0225) 0.3653 (0.0186) 0.0790 (0.0042) 0.1794 (0.0025) 5,195 (1446)
Mode-3 0.2035 (0.0124) 0.0285 (0.0225) 0.3653 (0.0186) 0.0790 (0.0042) 0.1794 (0.0025) 5,223 (1452)
Mode-wise CV 0.2080 (0.0369) 0.0428 (0.0305) 0.3545 (0.0125) 0.1008 (0.0227) 0.1452 (0.0224) 14,473 (4142)
Averaged matrix 0.2732 (0.0286) 0.0178 (0.0) 0.4030 (0.2487) 0.1366 (0.0056) 0.1825 (0.0) 1,936 (472)
Logistic regression (l20.3161 (0.0075) 0.0179 (0.0) 0.3684 (0.0537) 0.2241 (0.0432) 0.4040 (0.0640) 72 (62)
NormSubject aaSubject alSubject avSubject awSubject ayAverage Time (seconds)
Overlapped trace norm 0.2205 (0.0139) 0.0178 (0.0) 0.3244 (0.0132) 0.0603 (0.0071) 0.1254 (0.0190) 17,986 (1489)
Scaled overlapped trace norm 0.2295 (0.0270) 0.0018 (0.0056) 0.3235 (0.0160) 0.1022 (0.0192) 0.2532 (0.0312) 18,118 (1608)
Latent trace norm 0.3107 (0.0210) 0.0339 (0.0056) 0.3735 (0.0218) 0.1549 (0.0381) 0.4008 (0.0) 20,021 (14024)
Scaled latent trace norm 0.2080 (0.0043) 0.0179 (0.0) 0.3694 (0.0182) 0.0804 (0.0) 0.1980 (0.0476) 77,123 (149024)
Mode-1 0.3205 (0.0174) 0.0339 (0.0056) 0.3739 (0.0211) 0.1450 (0.0070) 0.4020 (0.0038) 5,737 (3238)
Mode-2 0.2035 (0.0124) 0.0285 (0.0225) 0.3653 (0.0186) 0.0790 (0.0042) 0.1794 (0.0025) 5,195 (1446)
Mode-3 0.2035 (0.0124) 0.0285 (0.0225) 0.3653 (0.0186) 0.0790 (0.0042) 0.1794 (0.0025) 5,223 (1452)
Mode-wise CV 0.2080 (0.0369) 0.0428 (0.0305) 0.3545 (0.0125) 0.1008 (0.0227) 0.1452 (0.0224) 14,473 (4142)
Averaged matrix 0.2732 (0.0286) 0.0178 (0.0) 0.4030 (0.2487) 0.1366 (0.0056) 0.1825 (0.0) 1,936 (472)
Logistic regression (l20.3161 (0.0075) 0.0179 (0.0) 0.3684 (0.0537) 0.2241 (0.0432) 0.4040 (0.0640) 72 (62)

Note: The bold numbers in columns aa, al, av, aw, and ay indicate comparable accuracies among classifiers after a t-test with a significance of 0.05.

In contrast to the computation time for regression experiments, in this experiment, we see that the computation time for tensor trace norm regularizations is more expensive compared to the mode-wise regularization. Also, the mode-wise cross-validation is computationally less expensive than the scaled latent trace norm and other tensor trace norms. This is a slight drawback with the tensor norms, though they tend to have higher classification accuracy.

## 6  Conclusion and Future Work

In this letter, we have studied tensor-based regression and classification with regularization using the overlapped trace norm, the latent trace norm, and the scaled latent trace norm. We have provided dual optimization methods, theoretical analysis, and experimental evaluations to understand tensor-based inductive learning. Our theoretical analysis on excess risk bounds showed the relationship of excess risks with the multilinear ranks and dimensions of the weight tensor. Our experimental results on both simulated and real data sets further confirmed the validity of our theoretical analyses. From the theoretical and empirical results, we can conclude that the performance of regularization with tensor norms depends on the multilinear ranks and mode dimensions, where the latent and scaled latent norms are more robust in tensors with large variations of multilinear ranks.

Our research opens up many future research directions. For example, an important direction is improvement of optimization methods. Optimization over the latent tensors that results in the use of the latent trace norm and the scaled latent trace norm increases the computational cost compared to the vectorized methods. Also, computing multiple singular value decompositions and solving Newton optimization subproblems (for logistic regression) at each iterative step are computationally expensive. This is evident from our experimental results on computation time for regression and classification. It would be an important direction to develop computationally more efficient methods for learning with tensor data to make it more practical.

Regularization with a mixture of norms is common in both vector-based (e.g., the elastic net; Zou & Hastie, 2003) and matrix-based regularizations (Savalle, Richard, & Vayatis, 2012). It would be an interesting research direction to combine sparse regularization (the l1-norm) to existing tensor norms. There is also a recent research direction to develop new composite norms such the -trace norm (Richard, Obozinski, & Vert, 2014). Development of composite tensor norms can be useful for inductive tensor learning to obtain sparse and low-rank solutions.

### Appendix A:  Dual Formulations

In this appendix, we derive the dual formulation of the latent trace norms. We consider a training data set , where . To derive the dual for the latent trace norms, we rewrite the primal for the regression of equation 3.1 as
Its Lagrangian can be written by introducing variables as
We introduce auxiliary variables to remove the coupling between the indicator functions. Then the above dual solutions can be restated as
A.1
Similarly, we can derive the dual formulation for logistic regression.

### Appendix B:  Proofs of Theorems in Section 4

We prove the following useful lemma.

### Appendix C:  Test and Validation Curves for BCI data

We show in Figure 6 the validation errors and test errors for BCI data sets.

## Acknowledgments

K.W. acknowledges the Monbukagakusho MEXT Scholarship and KAKENHI 23120004, and M.S. acknowledges the JST CREST program.

## References

,
M. T.
,
Yu
,
Q. R.
, &
Liu
,
Y.
(
2014
).
Fast multivariate spatio-temporal analysis via low rank tensor learning
. In
Z.
Ghahramani
,
M.
Welling
,
C.
Cortes
,
N. D.
Lawrence
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
27
(pp.
3491
3499
).
Red Hook, NY
:
Curran
.
Bertsekas
,
D. P.
(
1996
).
Constrained optimization and Lagrange multiplier methods.
Belmont, MA
:
Athena Scientific
.
Boyd
,
S.
,
Parikh
,
N.
,
Chu
,
E.
,
Peleato
,
B.
, &
Eckstein
,
J.
(
2011
).
Distributed optimization and statistical learning via the alternating direction method of multipliers
.
Foundations and Trends in Machine Learning
,
3
(
1
),
1
122
.
Cai
,
J.
,
Candès
,
E. J.
, &
Shen
,
Z.
(
2010
).
A singular value thresholding algorithm for matrix completion
.
SIAM J. on Optimization
,
20
,
1956
1982
.
Candès
,
E. J.
,
Li
,
X.
,
Ma
,
Y.
, &
Wright
,
J.
(
2011
).
Robust principal component analysis?
Journal of the ACM
,
58
(
3
),
1
37
.
Dornhege
,
G.
,
Blankertz
,
B.
,
Curio
,
G.
, &
Müller
,
K.-R.
(
2004
).
Boosting bit rates in noninvasive EEG single-trial classifications by feature combination and multiclass paradigms
.
IEEE Transactions on Biomedical Engineering
,
51
(
6
),
993
1002
.
Gabay
,
D.
, &
Mercier
,
B.
(
1976
).
A dual algorithm for the solution of nonlinear variational problems via finite element approximation
.
Computers and Mathematics with Applications
,
2
(
1
),
17
40
.
Gandy
,
S.
,
Recht
,
B.
, &
,
I.
(
2011
).
Tensor completion and low-n-rank tensor recovery via convex optimization
.
Inverse Problems
,
27
(
2
),
025010
.
Karatzoglou
,
A.
,
Amatriain
,
X.
,
Baltrunas
,
L.
, &
Oliver
,
N.
(
2010
).
Multiverse recommendation: N-dimensional tensor factorization for context-aware collaborative filtering
. In
Proceedings of the Fourth ACM Conference on Recommender Systems
(pp.
79
86
).
New York
:
ACM
.
Kim
,
T.-K.
,
Wong
,
S.-F.
, &
Cipolla
,
R.
(
2007
).
Tensor canonical correlation analysis for action classification
. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp.
1
8
).
Piscataway, NJ
:
IEEE
.
Kolda
,
T. G.
, &
,
B. W.
(
2009
).
Tensor decompositions and applications
.
SIAM Review
,
51
(
3
),
455
500
.
Liu
,
G.
,
Lin
,
Z.
, &
Yu
,
Y.
(
2010
).
Robust subspace segmentation by low-rank representation
. In
Proceedings of the 27th International Conference on Machine Learning
(pp.
663
670
).
:
Omnipress
.
Liu
,
J.
,
Musialski
,
P.
,
Wonka
,
P.
, &
Ye
,
J.
(
2009
).
Tensor completion for estimating missing values in visual data
. In
Proceedings of the IEEE International Conference on Computer Vision
(pp.
2114
2121
).
Piscataway, NJ
:
IEEE
.
Maurer
,
A.
, &
Pontil
,
M.
(
2013
).
Excess risk bounds for multitask learning with trace norm regularization
. In
Proceedings of the Annual Conference on Learning Theory 2013
(pp.
55
76
). JMLR.org.
Recht
,
B.
,
Fazel
,
M.
, &
Parrilo
,
P.
(
2010
).
Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization
.
SIAM Review
,
52
(
3
),
471
501
.
Richard
,
E.
,
Obozinski
,
G. R.
, &
Vert
,
J.-P.
(
2014
).
Tight convex relaxations for sparse matrix factorization
. In
Z.
Ghahramani
,
M.
Welling
,
C.
Cortes
,
N. D.
Lawrence
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems, 27
(pp.
3284
3292
).
Red Hook, NY
:
Curran
.
Romera-Paredes
,
B.
,
Aung
,
H.
,
Bianchi-Berthouze
,
N.
, &
Pontil
,
M.
(
2013
).
. In
Proceedings of the 30th International Conference on Machine Learning
(pp.
1444
1452
). JMLR.org.
Sankaranarayanan
,
P.
,
Schomay
,
T. E.
,
Aiello
,
K. A.
, &
Alter
,
O.
(
2015
).
Tensor GSVD of patient- and platform-matched tumor and normal DNA copy-number profiles uncovers chromosome arm-wide patterns of tumor-exclusive platform-consistent alterations encoding for cell transformation and predicting ovarian cancer survival
.
PLoS ONE
,
10
(
4
),
e0121396
.
Savalle
,
P.
,
Richard
,
E.
, &
Vayatis
,
N.
(
2012
).
Estimation of simultaneously sparse and low rank matrices
. In
Proceedings of the 29th International Conference on Machine Learning
(pp.
1351
1358
).
:
Omnipress
.
Signoretto
,
M.
,
Dinh
,
Q. T.
,
De Lathauwer
,
L.
, &
Suykens
,
J.A.K.
(
2013
).
Learning with tensors: A framework based on convex optimization and spectral regularization
.
Machine Learning
,
94
(
3
),
303
351
.
Tomioka
,
R.
, &
Aihara
,
K.
(
2007
).
Classifying matrices with a spectral regularization
. In
Proceedings of International Conference on Machine Learning
(pp.
895
902
).
New York
:
ACM
.
Tomioka
,
R.
,
Hayashi
,
K.
, &
Kashima
,
H.
(
2011
).
Estimation of low-rank tensors via convex optimization (Technical report)
.
arXiv 1010.0789
.
Tomioka
,
R.
, &
Suzuki
,
T.
(
2013
).
Convex tensor decomposition via structured Schatten norm regularization
. In
Advances in neural information processing systems
,
26
(pp.
1331
1339
).
Red Hook, NY
:
Curran
.
Tomioka
,
R.
,
Suzuki
,
T.
,
Hayashi
,
K.
, &
Kashima
,
H.
(
2011
).
Statistical performance of convex tensor decomposition
. In
C.J.C
Burges
,
L.
Bottou
,
M.
Welling
,
Z.
Ghahramani
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
24
(pp.
972
980
).
Red Hook, NY
:
Curran
.
Tomioka
,
R.
,
Suzuki
,
T.
, &
Sugiyama
,
M.
(
2011
).
Super-linear convergence of dual augmented-Lagrangian algorithm for sparsity regularized estimation
.
Journal of Machine Learning Research
,
12
,
1537–1586
.
Wimalawarne
,
K.
,
Sugiyama
,
M.
, &
Tomioka
,
R.
(
2014
).
. In
Z.
Ghahramani
,
M.
Welling
,
C.
Cortes
,
N. D.
Lawrence
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
27
(pp.
2825
2833
.
Red Hook, NY
:
Curran
.
Zhou
,
H.
, &
Li
,
L.
(
2014
).
Regularized matrix regression
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
76
(
2
),
463
483
.
Zhou
,
H.
,
Li
,
L.
, &
Zhu
,
H.
(
2013
).
Tensor regression with applications in neuroimaging data analysis
.
Journal of the American Statistical Association
,
108
(
502
),
540
552
.
Zou
,
H.
, &
Hastie
,
T.
(
2003
).
Regularization and variable selection via the elastic net
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
67
(
2
),
301
320
.