## Abstract

We theoretically and experimentally investigate tensor-based regression and classification. Our focus is regularization with various tensor norms, including the overlapped trace norm, the latent trace norm, and the scaled latent trace norm. We first give dual optimization methods using the alternating direction method of multipliers, which is computationally efficient when the number of training samples is moderate. We then theoretically derive an excess risk bound for each tensor norm and clarify their behavior. Finally, we perform extensive experiments using simulated and real data and demonstrate the superiority of tensor-based learning methods over vector- and matrix-based learning methods.

## 1 Introduction

A wide range of real-world data takes the format of matrices and tensors, for example, recommendation (Karatzoglou, Amatriain, Baltrunas, & Oliver, 2010), video sequences (Kim, Wong, & Cipolla, 2007), climates (Bahadori, Yu, & Liu, 2014), genomes (Sankaranarayanan, Schomay, Aiello, & Alter, 2015), and neuroimaging (Zhou, Li, & Zhu, 2013). A naive way to learn from such matrix and tensor data is to vectorize them and apply ordinary regression or classification methods designed for vectorial data. However, such a vectorization approach would lead to loss in structural information of matrices and tensors such as low-rankness.

The objective of this letter is to investigate regression and classification methods that directly handle tensor data without vectorization. Low-rank structure of data has been successfully utilized in various applications, such as missing data imputation (Cai, Candès, & Shen, 2010), robust principal component analysis (Candès, Li, Ma, & Wright, 2011), and subspace clustering (Liu, Lin, & Yu, 2010). Instead of lowrankness of data itself, in this letter we consider its dual: learning coefficients of a regressor and a classifier. Low-rankness in learning coefficients means that only a subspace of feature space is used for regression and classification.

For matrices, regression and classification have been studied in Tomioka and Aihara (2007) and Zhou and Li (2014) in the context of EEG data analysis. It was experimentally demonstrated that directly learning matrix data by low-rank regularization can significantly improve performance compared to learning after vectorization. Another advantage of using low-rank regularization in the context of EEG data analysis is that analyzing singular value spectra of learning coefficients is useful in understanding activities of brain regions.

More recently, an inductive learning method for tensors has been explored (Signoretto, Dinh, De Lathauwer, & Suykens, 2013). Compared to the matrix case, learning with tensors is inherently more complex. For example, the multilinear ranks of tensors make it more complicated to find a proper low-rankness of a tensor compared to a matrix, which has only one rank. So far, several tensor norms such as the overlapped trace norm or the tensor nuclear norm (Liu, Musialski, Wonka, & Ye, 2009), the latent trace norm (Tomioka & Suzuki, 2013), and the scaled latent trace norm (Wimalawarne, Sugiyama, & Tomioka, 2014) have been proposed and demonstrated to perform well for various tensor structures. However, theoretical analysis of tensor learning in inductive learning settings has not been much investigated yet. Another challenge in inductive tensor learning is efficient optimization strategies, since tensor data often have much higher dimensionalities than matrix and vector data.

We theoretically and experimentally investigate tensor-based regression and classification with regularization by the overlapped trace norm, the latent trace norm, and the scaled latent trace norm. We first provide their dual formulations and propose optimization procedures using the alternating direction method of multipliers (Bertsekas, 1996), which is computationally efficient when the number of data samples is moderate. We then derive an excess risk bound for each tensor regularization, which allows us to theoretically understand the behavior of tensor norm regularization. More specifically, we elucidate that the excess risk of the overlapped trace norm is bounded with the average multilinear ranks of each mode, that of the latent trace norm is bounded with the minimum multilinear rank among all modes, and that of the scaled latent trace norm is bounded with the minimum ratio between multilinear ranks and mode dimensions. Finally, for simulated and real tensor data, we experimentally investigate the behavior of tensor-based regression and classification methods. The experimental results are in concordance with our theoretical findings, and tensor-based learning methods compare favorably with vector- and matrix-based methods.

The remainder of this letter is organized as follows. In section 2, we formulate the problem of tensor-based supervised learning and review the overlapped trace norm, the latent trace norm, and the scaled latent trace norm. In section 3, we derive dual optimization algorithms based on the alternating direction method of multipliers. In section 4, we theoretically give an excess risk bound for each tensor norm. In section 5, we give experimental results on both artificial and real-world data and illustrate the advantage of tensor-based learning methods. In section 6, we conclude.

Throughout the paper, we use standard tensor notation following Kolda and Bader (2009). We represent a *K*-way tensor as that consists of elements. A mode-*k* fiber of is an *n _{k}*-dimensional vector that can be obtained by fixing all except the

*k*th index. The mode-

*k*unfolding of tensor is represented as , which is obtained by concatenating all the mode-

*k*fibers along its columns. The spectral norm of a matrix

*X*is denoted by , the maximum singular value of

*X*. The operator is the sum of element-wise multiplications of and , that is, . The Frobenius norm of a tensor is defined as .

## 2 Learning with Tensor Regularization

In this section, we put forward inductive tensor learning models with tensor regularization and review different tensor norms used for low-rank regularization.

### 2.1 Problem Formulation

*l*

_{2}- or

*l*

_{1}-regularization.

*J*is the number of nonzero singular values (). A matrix is called law rank if . The matrix trace norm, equation 2.4 is a convex envelope to the matrix rank and it is commonly used in matrix low-rank approximation (Recht, Fazel, & Parrilo, 2010).

As in matrices, the rank property is also available for tensors, but it is more complicated due to its multidimensional structure. The mode-*k* rank *r _{k}* of a tensor is defined as the rank of mode-

*k*unfolding , and the multilinear rank of is given as . The mode-

*i*of a tensor is called low rank if .

### 2.2 Overlapped Trace Norm

*overlapped trace norm*(Tomioka & Suzuki, 2013), which can be represented for a tensor as The overlapped trace norm can be viewed as a direct extension of the matrix trace norm since it unfolds a tensor on each of its modes and computes the sum of trace norms of the unfolded matrices. Regularization with the overlapped trace norm can also be seen as an overlapped group regularization due to the fact that the same tensor is unfolded over different modes and regularized with the trace norm.

One of the popular applications of the overlapped trace norm is tensor completion (Gandy, Recht, & Yamada, 2011; Liu et al., 2009), where missing entries of a tensor are imputed. Another application is *multilinear multitask learning* (Romera-Paredes, Aung, Bianchi-Berthouze, & Pontil, 2013), where multiple vector-based linear learning tasks with a common feature space are arranged as a tensor feature structure and the multiple tasks are solved together with constraints to minimize the multilinear ranks of the tensor feature.

Theoretical analyses on the overlapped norm have been carried out for both tensor completion (Tomioka & Suzuki, 2013) and multilinear multitask learning (Wimalawarne et al., 2014). They have shown that the prediction error of overlapped trace norm regularization is bounded by the average mode-*k* ranks, which can be large if some modes are close to full rank even if there are low-rank modes. Thus, these studies imply that the overlapped trace norm performs well when the multilinear ranks have small variations, and it may result in poor performance when the multilinear ranks have high variations.

### 2.3 Latent Trace Norm

*K*latent tensors, which is equal to the number of modes, and regularizes each of them separately. In contrast to the overlapped trace norm, the latent tensor trace norm regularizes different latent tensors for each unfolded mode, and this gives the tendency that the latent tensor trace norm picks the latent tensor with the lowest rank.

In general, the latent trace norm results in a mixture of latent tensors, and the content of each latent tensor would depend on the rank of its unfolding. In an extreme case, for a tensor with all its modes full except one mode, regularization with the latent tensor trace norm would result in making the latent tensor with the lowest mode become prominent while others become zero.

### 2.4 Scaled Latent Trace Norm

The scaled latent norm has the ability to overcome this problem by its scaling with the mode dimensions such that it is able to work with the relative ranks of the tensor. In the context of multilinear multitask learning, it has been shown that the scaled latent trace norm works well for tensors with high variations in multilinear ranks and mode dimensions compared to the overlapped trace norm and the latent trace norm (Wimalawarne et al., 2014).

The inductive learning setting mentioned in equation 2.1 with the overlapped trace norm has been studied previously in Signoretto et al. (2013). However, theoretical analysis and performance comparison with other tensor norms have not been conducted yet. Similarly to tensor decomposition (Tomioka & Suzuki, 2013) and multilinear multitask learning (Wimalawarne et al., 2014), tensor-based regression and classification may also be improved by regularization methods that can work with high variations in multilinear ranks and mode dimensions.

In the following sections, to make tensor-based learning more practical and to improve the performance, we consider formulation 2.1 with the overlapped trace norm the latent trace norm and the scaled latent trace norm and give computationally efficient optimization algorithms and excess risk bounds.

## 3 Optimization

In this section, we consider the dual formulation for equation 2.1 and propose computationally efficient optimization algorithms. Since optimization of equation 2.1 with regularization using the overlapped trace norm has already been studied in Signoretto et al. (2013), we do not discuss it here. Our main focus in this section is optimization of equation 2.1 with regularization using the latent trace norm and the scaled latent trace norm.

*k*th mode. It is worth noticing that the application of the latent and scaled latent trace norms requires optimizing over

*K*latent tensors, which contain variables in total. For large

*K*and

*N*, solving the primal problem, equation 3.1, can be computationally expensive, especially in nonlinear problems such as logistic regression, since they require computationally expensive optimization methods such as gradient descent or the Newton method. If the number of training samples

*m*is , solving the dual problem of equation 3.1 could be computationally more efficient. For this reason, we focus on optimization in the dual below.

*b*. Here, the auxiliary variables are introduced to remove the coupling between the indicator functions in the objective function (see appendix A for details).

The alternating direction method of multipliers (ADMM) (Gabay & Mercier, 1976; Boyd, Parikh, Chu, Peleato, & Eckstein, 2011) has been previously used to solve primal problems of tensor decomposition (Tomioka, Suzuki, Hayashi, & Kashima, 2011) and multilinear multitask learning (Romera-Paredes et al., 2013) with the overlapped trace norm regularization. Optimization in the dual for tensor decomposition problems with the latent and scaled latent trace norm regularization has been solved using ADMM in Tomioka, Suzuki, Hayashi et al. (2011). Here, we also adopt ADMM to solve equation 3.2 and describe the formulation and the optimization steps in detail.

*b*by considering subproblems for each variable. Below, we give the solution for each variable at iterative step .

*m*-dimensional vector of all ones. Note that in the above system of equations, the coefficient matrix multiplied with does not change during optimization. Thus, it can be efficiently solved at each iteration by precomputing the Cholesky factorization of the matrix.

### 3.1 Optimality Condition

*t*of equation 3.1 and is a predefined tolerance value. is the dual solution at step

*t*of equation 3.2 with obtained by multiplying with , where and is the largest singular value of

*V*.

## 4 Theoretical Risk Analysis

In this section, we theoretically analyze the excess risk for regularization with the overlapped trace norm, the latent trace norm, and the scaled latent trace norm.

We consider a loss function *l*, which is Lipshitz continuous with constant . Note that this condition is true for both the squared loss and logistic loss functions. Let the training data set be given as , where for regression and for classification. In our theoretical analysis, we assume that elements of independently follow the standard gaussian distribution.

Lemma ^{1} provides an upper bound of the excess risk for tensor-based learning problems (see appendix B for its proof), where is the dual norm of for :

Theorem ^{2} gives an excess risk bound for overlapped trace norm regularization (its proof is also included in appendix B), which is based on the inequality given in Tomioka and Suzuki (2013):

In theorem ^{3}, we give an excess risk bound for the latent trace norm (its proof is also included in appendix B), which uses the inequality given in Tomioka and Suzuki (2013):

Theorem ^{3} shows that the excess risk for the latent trace norm, equation 4.5 is bounded by the minimum multilinear rank. If , the latent trace norm is always better than the overlapped trace norm in terms of the excess risk bounds because . If the dimensions are not the same, the overlapped trace norm could be better.

Finally, we bound the excess risk for the scaled latent trace norm based on the inequality given in Wimalawarne et al. (2014):

Theorem ^{4} shows that the excess risk for regularization with the scaled latent trace norm is bounded with the minimum of multilinear ranks relative to their mode dimensions. Similar to the latent trace norm, the scaled latent trace norm would also perform better than the overlapped norm when the multilinear ranks have large variations. If we consider a flat tensor, the modes with small dimensions may have ranks comparable to their dimensions. Although these modes have the lowest mode-*k* rank, they do not impose a low-rank structure. In such cases, our theory predicts that the scaled latent trace norm performs better because it is sensitive to the mode-*k* rank relative to its dimension.

^{2}, we can upper-bound the excess risk for the scaled overlapped trace norm regularization as Note that when all modes have the same dimensions, equation 4.7 coincides with equation 4.4. Compared with bound 4.6, the scaled latent norm would perform better than the scaled overlapped norm regularization since .

## 5 Experiments

We conducted several experiments using simulated and real-world data to evaluate the performance of tensor-based regression and classification methods with regularizations using different tensor norms. We discuss simulations for tensor-based regression in section 5.1 and experiments with real-world data for tensor classification in section 5.2. For all experiments, we use a Matlab environment on a 2.10 GHz (2×8 cores) Intel Xeon E5-2450 server machine with 128 GB memory.

### 5.1 Tensor Regression with Artificial Data

We report the results of artificial data experiments on tensor-based regression.

We generated three different three-mode tensors as weight tensors with different multilinear ranks and mode dimensions. We created two homogeneous tensors with equal mode dimensions of with different multilinear ranks and . The third weight tensor is an inhomogeneous case with mode dimensions of , and multilinear ranks . To generate these weight tensors, we use the Tucker decomposition (Kolda & Bader, 2009) of a tensor as , where is the core tensor and are component matrices. We sample elements of the core tensor from a standard gaussian distribution, choose component matrices to be orthogonal matrices, and generate by mode-wise multiplication of the core tensor and component matrices.

To create training samples , we first create the random tensors generated with each element independently sampled from the standard gaussian distribution and obtain , where is noise drawn from the gaussian distribution with mean zero and variance 0.1. In our experiments, we use cross-validation to select the regularization parameter from the range 0.01 to 100 at intervals of 0.1. For comparison, we have also simulated matrix regularized regressions for each mode unfolding. Also, we experimented with cross-validation among matrix regularization on each unfolded matrix to understand whether it can find the correct mode for regularization. As the baseline vector-based learning method, we use ridge regression (i.e., *l*_{2}-regularized least-squares).

Figure 1 shows the performance of homogeneous tensors with equal mode dimensions and equal multilinear ranks . We see that the overlapped trace norm and the scaled overlapped trace norm (due to equal mode dimensions) perform the best equally, while both latent norms perform equally (since mode dimensions are equal) but inferior to the overlapped norm. Also, the regression results from all matrix regularizations with individual modes perform better than the latent and the scaled latent norm regularized regression models. Due to the equal multilinear ranks and equal mode dimensions, it results in equal performance with cross-validation among each mode-wise unfolded matrix regularization.

Figure 2 shows the performances of homogeneous tensors with equal mode dimensions and unequal multilinear ranks . In this case, both the latent and the scaled latent norms also perform equally since tensor dimensions are the same. The mode-1 regularized regression models give the best performance since they have the lowest rank; regularization with the latent and scaled latent norms gives the next best performance. The mode-wise cross-validation correctly coincides with the mode-1 regularization. The overlapped trace norm and the scaled overlapped trace (due to equal mode dimensions) perform equally poorly compared to the latent and the scaled latent trace norms.

Figure 3 shows the performance of inhomogeneous tensors with mode dimensions , and multilinear ranks . In this case, we can see that the scaled latent trace norm outperforms all other tensor norms. The latent trace norm performs poorly since it fails to find the mode with the lowest rank. This agrees well with our theoretical analysis. As shown in equation 4.5, the excess risk of the latent trace norm is bounded with the minimum of multilinear ranks, which is on the first mode in the current setup and is high ranked. The scaled latent trace norm is able to find the mode with the lowest rank since it takes the relative rank with respect to the mode dimension as in equation 4.6. If we look at the individual mode regularizations, we see that the best performance is given with the second mode, which has the lowest rank with respect to the mode dimension, and the worst performance is given with the first mode, which is high ranked compared to other modes. Here, the mode-wise cross-validation is again as good as mode-2 regularization. The overlapped trace norm performs poorly compared to the scaled latent trace norm, and the scaled overlapped trace norm performs worse than the overlapped trace norm.

It is also worth noticing in these experiments that ridge regression performed worse than all the tensor regularized learning models. This highlights the need to employ low-rank-inducing norms for learning with tensor data without vectorization to get the best performance.

Figure 4 shows the computation time for the toy regression experiment with inhomogeneous tensors with mode dimensions , and multilinear ranks (computation time for other setups showed similar tendency and thus we omit the results). For each data set, we measured the computation time of training regression models, cross-validation for model selection, and predicting output values for test data. We can see that methods based on tensor norms and matrix norms are computationally much more expensive compared to ridge regression. However, as we saw, they achieve higher accuracy than ridge regression. It is worth noticing that mode-wise cross-validation is computationally more expensive compared to the scaled latent trace norm and other tensor norms. This computational advantage and comparable performance with respect to the best mode-wise regularization make the scaled latent trace norm a useful regularization method for tensor-based regression, especially for tensors with high variations in its multilinear ranks.

### 5.2 Tensor Classification for Hand Gesture Recognition

Next, we report the results of experiments on tensor classification with the *Cambridge hand gesture data set* (Kim et al., 2007).

The Cambridge hand gesture data set contains image sequences from nine gesture classes. These gesture classes include three primitive hand shapes of flats, spread, and V-shape, and three different hand motions of rightward, leftward, and contrast. Each class has 100 image sequences with different illumination conditions and arbitrary motions of two people. Previously, the tensor canonical correlation (Kim et al., 2007) was used to classify these hand gestures.

To apply tensor classification, we first build action sequences as tensor data by sampling *S* images with equal time intervals from each sequence. This makes each sequence a tensor of , where the first two modes are downsampled images as in (Kim et al., 2007) and *S* is the number of sampled images. In our experiments, we set *S* at 5 or 10. We consider binary classification and choose visually similar sequences of left/flat and left/spread (see Figure 5), which we found to be difficult to classify. We apply standardization of data by mean removal and variance normalization to all the data. We randomly sample data into a training set of 120 data elements, use a validation set of 40 data elements to select the optimal regularization parameter, and finally use a test set of 40 elements to evaluate the learned classifier. In addition to the tensor regularized learning models, we also trained classifiers with matrix regularization with unfolding on each mode separately. As a baseline vector-based learning method, we have used the *l*_{2}-regularized logistic regression. We also trained mode-wise cross-validation (CV) with individual mode regularization (mode-wise CV). We selected regularization parameters as 50 splits in logarithmic scale from 0.01 to 500. We repeated the learning procedure for 10 sample sets for each classifier, the results are shown in Table 1.

. | Tensor Dimensions . | |
---|---|---|

Norm . | (20,20,5) . | (20,20,10) . |

Overlapped trace norm | 0.1375 (0.0530) | 0.0775 (0.0343) |

Latent trace norm | 0.1275 (0.0416) | 0.0875 (0.0429) |

Scaled latent trace norm | 0.1075 (0.0409) | 0.1000 (0.0500) |

Scaled overlapped trace norm | 0.1275 (0.0416) | 0.0850 (0.0444) |

Mode-1 | 0.1050 (0.0438) | 0.0975 (0.0463) |

Mode-2 | 0.1275 (0.0777) | 0.0850 (0.0489) |

Mode-3 | 0.1175 (0.0409) | 0.1075 (0.0602) |

Mode-wise CV | 0.1475 (0.0671) | 0.1025 (0.0381) |

Logistic regression (l_{2}) | 0.1500 (0.0565) | 0.1425 (0.0457) |

. | Tensor Dimensions . | |
---|---|---|

Norm . | (20,20,5) . | (20,20,10) . |

Overlapped trace norm | 0.1375 (0.0530) | 0.0775 (0.0343) |

Latent trace norm | 0.1275 (0.0416) | 0.0875 (0.0429) |

Scaled latent trace norm | 0.1075 (0.0409) | 0.1000 (0.0500) |

Scaled overlapped trace norm | 0.1275 (0.0416) | 0.0850 (0.0444) |

Mode-1 | 0.1050 (0.0438) | 0.0975 (0.0463) |

Mode-2 | 0.1275 (0.0777) | 0.0850 (0.0489) |

Mode-3 | 0.1175 (0.0409) | 0.1075 (0.0602) |

Mode-wise CV | 0.1475 (0.0671) | 0.1025 (0.0381) |

Logistic regression (l_{2}) | 0.1500 (0.0565) | 0.1425 (0.0457) |

Note: The bold figures indicate comparable accuracies among classifiers after a *t*-test with a significance of 0.05.

In both experiments for *S* = 5 and 10, we see that tensor norm regularized classification performs better than the vectorized learning method. With a tensor structure of (20, 20, 5), we can see that the mode-1 gives the best performance; the scaled latent trace norm, latent trace norm, scaled overlapped trace norm, mode-2, and mode-3 are comparable. We observed that with the tensor structure of (20, 20, 5), the resulting weight tensor after learning its third mode becomes full rank. The scaled latent trace norm performed as well as mode-1 since it could identify the mode with the minimum rank relative to its mode dimension, the first mode in the current setup. The overlapped trace norm performs poorly due to large variations in the multilinear ranks and tensor dimensions.

With the tensor structure (20, 20, 10), the overlapped trace norm gives the best performance. In this case, we found that the multilinear ranks are close to each other, which made the overlapped trace norm give better performance. The scaled latent trace norm, latent trace norm, scaled overlapped trace norm, mode-1, and mode-2 gave a performance comparable to that with the overlapped trace norm.

### 5.3 Tensor Classification for Brain Computer Interface

As our second tensor classification, we experimented with a motor-imagery EEG classification problem in the context of brain-computer interface (BCI). The objective of the experiments was to classify movements imagined by person using the EEG signals captured in that instance. For our experiments, we used the data from the BCI competition IVa (Dornhege, Blankertz, Curio, & Müller, 2004). Previous research by Tomioka and Aihara (2007) has considered channel × channel as a matrix of the EEG signal and classified it using logistic regression with low-rank matrix regularization. Our objective is to model EEG data as tensors to incorporate more information and learn to classify using tensor regularization methods.

The BCI competition IVa data set consists of BCI experiments of five people. Though BCI experiments have used 256 channels, we use signals from only 49 channels following Tomioka and Aihara (2007) and preprocess each signal from each channel with *Z* different band-pass filters (Butterworth filters). Let , where *C* denotes the number of channels and *T* denotes the time, be the matrix obtained by processing with the filter. As in Tomioka and Aihara (2007), each *S _{i}* is further processed to make centering and scaling as . Then we obtain , a channel × channel matrix (in our setting, it is ). We arrange all to form a tensor of dimensions .

For our experiments, we used *Z* = 5 different bandpass Butterworth filters with cutoff frequencies of (7, 10), (9 12), (11 14), (13 16), and (15 18) with scaling by 50, which resulted in a signal converted into a tensor of dimensions . We split the data used in the competition into training and validation sets with a proportion of 80:20; the rest of the data we used for testing. As in the previous experiment, we used logistic regression with all the tensor norms, individual mode unfolded matrix regularizations, and cross-validation with unfolded matrix regularization. We also used vector-based logistic regression with *l*_{2}-regularization for comparison. To compare tensor-based methods with the previously proposed matrix approach (Tomioka & Aihara, 2007), we averaged tensor data over the frequency mode and applied classification with matrix trace norm regularization. For all experiments, we selected all regularization parameters in 50 splits in logarithmic scale from 0.01 to 500. We show the validation and test errors for the tensor norms in appendix C in Figure 6.

The results of the experiment are given in Table 2, which strongly indicate that vector-based logistic regression is clearly outperformed by the overlapped and scaled latent trace norms. Also, in most cases, the averaged matrix method performs poorly compared to the optimal tensor structured regularization methods. Mode-1 regularization performs poorly since mode-1 was high ranked compared to the other modes. Similarly, the latent trace norm gives poor performance since it cannot properly regularize since it does not consider the rank relative to the mode dimension. For all subjects, mode-2 and mode-3 unfolded regularizations result in the same performance due to the symmetry of each *X _{i}* resulting in same rank along mode-2 and mode-3 unfoldings. For subject

*aa*, the scaled latent norm, mode-1, mode-2, and mode-wise cross-validation give the best or comparable performance. In subject

*al*, the scaled overlapped trace norm gives the best performance, and in subject

*av*, both the overlapped trace norm and the scaled overlapped trace norm give comparable performances. In subjects

*aw*and

*ay*, the overlapped trace norm gives the best performance.

Norm . | Subject aa
. | Subject al
. | Subject av
. | Subject aw
. | Subject ay
. | Average Time (seconds) . |
---|---|---|---|---|---|---|

Overlapped trace norm | 0.2205 (0.0139) | 0.0178 (0.0) | 0.3244 (0.0132) | 0.0603 (0.0071) | 0.1254 (0.0190) | 17,986 (1489) |

Scaled overlapped trace norm | 0.2295 (0.0270) | 0.0018 (0.0056) | 0.3235 (0.0160) | 0.1022 (0.0192) | 0.2532 (0.0312) | 18,118 (1608) |

Latent trace norm | 0.3107 (0.0210) | 0.0339 (0.0056) | 0.3735 (0.0218) | 0.1549 (0.0381) | 0.4008 (0.0) | 20,021 (14024) |

Scaled latent trace norm | 0.2080 (0.0043) | 0.0179 (0.0) | 0.3694 (0.0182) | 0.0804 (0.0) | 0.1980 (0.0476) | 77,123 (149024) |

Mode-1 | 0.3205 (0.0174) | 0.0339 (0.0056) | 0.3739 (0.0211) | 0.1450 (0.0070) | 0.4020 (0.0038) | 5,737 (3238) |

Mode-2 | 0.2035 (0.0124) | 0.0285 (0.0225) | 0.3653 (0.0186) | 0.0790 (0.0042) | 0.1794 (0.0025) | 5,195 (1446) |

Mode-3 | 0.2035 (0.0124) | 0.0285 (0.0225) | 0.3653 (0.0186) | 0.0790 (0.0042) | 0.1794 (0.0025) | 5,223 (1452) |

Mode-wise CV | 0.2080 (0.0369) | 0.0428 (0.0305) | 0.3545 (0.0125) | 0.1008 (0.0227) | 0.1452 (0.0224) | 14,473 (4142) |

Averaged matrix | 0.2732 (0.0286) | 0.0178 (0.0) | 0.4030 (0.2487) | 0.1366 (0.0056) | 0.1825 (0.0) | 1,936 (472) |

Logistic regression (l_{2}) | 0.3161 (0.0075) | 0.0179 (0.0) | 0.3684 (0.0537) | 0.2241 (0.0432) | 0.4040 (0.0640) | 72 (62) |

Norm . | Subject aa
. | Subject al
. | Subject av
. | Subject aw
. | Subject ay
. | Average Time (seconds) . |
---|---|---|---|---|---|---|

Overlapped trace norm | 0.2205 (0.0139) | 0.0178 (0.0) | 0.3244 (0.0132) | 0.0603 (0.0071) | 0.1254 (0.0190) | 17,986 (1489) |

Scaled overlapped trace norm | 0.2295 (0.0270) | 0.0018 (0.0056) | 0.3235 (0.0160) | 0.1022 (0.0192) | 0.2532 (0.0312) | 18,118 (1608) |

Latent trace norm | 0.3107 (0.0210) | 0.0339 (0.0056) | 0.3735 (0.0218) | 0.1549 (0.0381) | 0.4008 (0.0) | 20,021 (14024) |

Scaled latent trace norm | 0.2080 (0.0043) | 0.0179 (0.0) | 0.3694 (0.0182) | 0.0804 (0.0) | 0.1980 (0.0476) | 77,123 (149024) |

Mode-1 | 0.3205 (0.0174) | 0.0339 (0.0056) | 0.3739 (0.0211) | 0.1450 (0.0070) | 0.4020 (0.0038) | 5,737 (3238) |

Mode-2 | 0.2035 (0.0124) | 0.0285 (0.0225) | 0.3653 (0.0186) | 0.0790 (0.0042) | 0.1794 (0.0025) | 5,195 (1446) |

Mode-3 | 0.2035 (0.0124) | 0.0285 (0.0225) | 0.3653 (0.0186) | 0.0790 (0.0042) | 0.1794 (0.0025) | 5,223 (1452) |

Mode-wise CV | 0.2080 (0.0369) | 0.0428 (0.0305) | 0.3545 (0.0125) | 0.1008 (0.0227) | 0.1452 (0.0224) | 14,473 (4142) |

Averaged matrix | 0.2732 (0.0286) | 0.0178 (0.0) | 0.4030 (0.2487) | 0.1366 (0.0056) | 0.1825 (0.0) | 1,936 (472) |

Logistic regression (l_{2}) | 0.3161 (0.0075) | 0.0179 (0.0) | 0.3684 (0.0537) | 0.2241 (0.0432) | 0.4040 (0.0640) | 72 (62) |

Note: The bold numbers in columns *aa*, *al*, *av*, *aw*, and *ay* indicate comparable accuracies among classifiers after a *t*-test with a significance of 0.05.

In contrast to the computation time for regression experiments, in this experiment, we see that the computation time for tensor trace norm regularizations is more expensive compared to the mode-wise regularization. Also, the mode-wise cross-validation is computationally less expensive than the scaled latent trace norm and other tensor trace norms. This is a slight drawback with the tensor norms, though they tend to have higher classification accuracy.

## 6 Conclusion and Future Work

In this letter, we have studied tensor-based regression and classification with regularization using the overlapped trace norm, the latent trace norm, and the scaled latent trace norm. We have provided dual optimization methods, theoretical analysis, and experimental evaluations to understand tensor-based inductive learning. Our theoretical analysis on excess risk bounds showed the relationship of excess risks with the multilinear ranks and dimensions of the weight tensor. Our experimental results on both simulated and real data sets further confirmed the validity of our theoretical analyses. From the theoretical and empirical results, we can conclude that the performance of regularization with tensor norms depends on the multilinear ranks and mode dimensions, where the latent and scaled latent norms are more robust in tensors with large variations of multilinear ranks.

Our research opens up many future research directions. For example, an important direction is improvement of optimization methods. Optimization over the latent tensors that results in the use of the latent trace norm and the scaled latent trace norm increases the computational cost compared to the vectorized methods. Also, computing multiple singular value decompositions and solving Newton optimization subproblems (for logistic regression) at each iterative step are computationally expensive. This is evident from our experimental results on computation time for regression and classification. It would be an important direction to develop computationally more efficient methods for learning with tensor data to make it more practical.

Regularization with a mixture of norms is common in both vector-based (e.g., the elastic net; Zou & Hastie, 2003) and matrix-based regularizations (Savalle, Richard, & Vayatis, 2012). It would be an interesting research direction to combine sparse regularization (the *l*_{1}-norm) to existing tensor norms. There is also a recent research direction to develop new composite norms such the -trace norm (Richard, Obozinski, & Vert, 2014). Development of composite tensor norms can be useful for inductive tensor learning to obtain sparse and low-rank solutions.

### Appendix A: Dual Formulations

### Appendix B: Proofs of Theorems in Section 4

We prove the following useful lemma.

### Appendix C: Test and Validation Curves for BCI data

We show in Figure 6 the validation errors and test errors for BCI data sets.

## Acknowledgments

K.W. acknowledges the Monbukagakusho MEXT Scholarship and KAKENHI 23120004, and M.S. acknowledges the JST CREST program.

## References

*n*-rank tensor recovery via convex optimization

*N*-dimensional tensor factorization for context-aware collaborative filtering