## Abstract

We propose a set of convex low-rank inducing norms for coupled matrices and tensors (hereafter referred to as coupled tensors), in which information is shared between the matrices and tensors through common modes. More specifically, we first propose a mixture of the overlapped trace norm and the latent norms with the matrix trace norm, and then, propose a completion model regularized using these norms to impute coupled tensors. A key advantage of the proposed norms is that they are convex and can be used to find a globally optimal solution, whereas existing methods for coupled learning are nonconvex. We also analyze the excess risk bounds of the completion model regularized using our proposed norms and show that they can exploit the low-rankness of coupled tensors, leading to better bounds compared to those obtained using uncoupled norms. Through synthetic and real-data experiments, we show that the proposed completion model compares favorably with existing ones.

## 1  Introduction

Learning from a matrix or a tensor has long been an important problem in machine learning. In particular, matrix and tensor factorization using low-rank inducing norms has been studied extensively, and many applications have been considered, such as missing value imputation (Signoretto, Dinh, De Lathauwer, & Suykens, 2013; Liu, Musialski, Wonka, & Ye, 2009), multitask learning (Argyriou, Evgeniou, & Pontil, 2006; Romera-Paredes, Aung, Bianchi-Berthouze, & Pontil, 2013; Wimalawarne, Sugiyama, & Tomioka, 2014), subspace clustering (Liu, Lin, & Yu, 2010), and inductive learning (Signoretto et al., 2013; Wimalawarne, Tomioka, & Sugiyama, 2016). Though useful in many applications, factorization based on an individual matrix or tensor tends to perform poorly under the cold start setup condition (Singh & Gordon, 2008), when, for example, it is not possible to observe click information for new users in collaborative filtering. It therefore cannot be used to recommend possible items for new users. Potential ways to address this issue are matrix or tensor factorization with side information (Narita, Hayashi, Tomioka, & Kashima, 2011). Both have been applied to recommendation systems (Singh & Gordon, 2008; Gunasekar, Yamada, Yin, & Chang, 2015) and personalized medicine (Khan & Kaski, 2014).

Both matrix and tensor factorization with side information can be regarded as the joint factorization of coupled matrices and tensors (hereafter referred to as coupled tensors; see Figure 1). Acar, Kolda, and Dunlavy (2011) introduced a coupled factorization method based on CANDECOMP/PARAFAC (CP) decomposition that simultaneously factorizes matrices and tensors by sharing the low-rank structures in the matrices and tensors. The coupled factorization approach has been applied to joint analysis of fluorescence and proton nuclear magnetic resonance (NMR) measurements (Acar, Nilsson, & Saunders, 2014) and joint NMR and liquid chromatography-mass spectrometry (LCMS; Acar, Bro, and Smilde, 2015). More recently, a Bayesian approach proposed by Ermis, Acar, and Cemgil (2015) was applied to link prediction problems. However, existing coupled factorization methods are nonconvex and can obtain only a poor local optimum. Moreover, the ranks of the coupled tensors need to be determined beforehand. In practice, it is difficult to specify the true ranks of the tensor and the matrix without prior knowledge. Furthermore, existing algorithms are not theoretically guaranteed.

Figure 1:

Illustration of information sharing between matrix and tensor in coupled tensor, through customers mode.

Figure 1:

Illustration of information sharing between matrix and tensor in coupled tensor, through customers mode.

We propose in this letter convex norms for coupled tensors that overcome the nonconvexity problem. The norms are a mixture of tensor norms: the overlapped trace norm (Tomioka, Suzuki, Hayashi, & Kashima, 2011), the latent trace norm (Tomioka & Suzuki, 2013), the scaled latent norm (Wimalawarne et al., 2014), and the matrix trace norm (Argyriou et al., 2006). A key advantage of the proposed norms is that they are convex and thus can be used to find a globally optimal solution, whereas existing coupled factorization approaches are nonconvex. Furthermore, we analyze the excess risk bounds of the completion model regularized using our proposed norms. Through synthetic and real-data experiments, we show that it compares favorably with existing ones.

In this letter, we:

• Propose a set of convex coupled norms for matrices and tensors that extend low-rank tensor and matrix norms.

• Propose mixed norms that combine features from both the overlapped norm and latent norms.

• Propose a convex completion model regularized using the proposed coupled norms.

• Analyze the excess risk bounds for the proposed completion model with respect to the proposed norms and show that it leads to lower excess risk.

• Show through synthetic and real-data experiments that our norms lead to performance comparable to that of existing nonconvex methods.

• Show that our norms are applicable to coupled tensors based on both the CP rank and the multilinear rank without prior assumptions about their low-rankness.

• Show that the convexity of the proposed norms leads to global solutions, eliminating the need to deal with local optimal solutions as is necessary with nonconvex methods.

The remainder of the letter is organized as follows. In section 2, we discuss related work on coupled tensor completion. In section 3, we present our proposed method, first introducing a coupled completion model and then proposing a set of norms called coupled norms. In section 4, we give optimization methods for solving the coupled completion model. In section 5, we theoretically analyze it using excess risk bounds for the proposed coupled norms. In section 6, we present the results of our evaluation using synthetic and real-world data experiments. Finally, in section 7, we summarize the key points and suggest future work.

## 2  Related Work

Most of the models proposed for learning with multiple matrices or tensors use joint factorization of matrices and tensors. The regularization-based model proposed by Acar et al. (2011) for completion of coupled tensors, which was further studied (Acar, Nilsson et al., 2014; Acar, Papalexakis et al., 2014; Acar et al., 2015) uses CP decomposition (Carroll & Chang, 1970; Harshman, 1970; Hitchcock, 1927; Kolda & Bader, 2009) to factorize the tensor and operates under the assumption that the factorized components of its coupled mode are in common with the factorized components of the matrix on the same mode. Bayesian models have also been proposed for imputing missing values with applications in link prediction (Ermis et al., 2015) and nonnegative factorization (Takeuchi, Tomioka, Ishiguro, Kimura, & Sawada, 2013), which use similar factorization models. Applications that have used collective factorization of tensors are multiview factorization (Khan & Kaski, 2014) and multiway clustering (Banerjee, Basu, & Merugu, 2007). Due to their use of factorization-based learning, all of these models are nonconvex.

The use of common adjacency graphs has more recently been proposed for incorporating similarities among heterogeneous tensor data (Li, Zhao, Li, Cichocki, & Guo, 2015). Though this method does not require assumptions about rank for explicit factorization of tensors, it depends on the modeling of the common adjacency graph and does not incorporate the low-rankness created by the coupling of tensors.

## 3  Proposed Method

We investigate a method for coupling a matrix and a tensor that forms when they share a common mode (Acar et al., 2015; Acar, Nilsson et al., 2014; Acar, Papalexakis, 2014). An example of the most basic coupling is shown in Figure 1, where a three-way (third-order) tensor is attached to a matrix on a specific mode. As depicted, we may have a problem predicting recommendations for customers on the basis of their preferences of restaurants in different locations, and we may also have side information about the characteristics for each customer. We can utilize this side information by coupling the customer-characteristic matrix with the sparse customer-restaurant-location tensor of the customer mode and then impute the missing values in the tensor.

Let us consider a partially observed matrix $M^∈Rn1×m$ and a partially observed three-way tensor $T^∈Rn1×n2×n3$ with mappings to observed elements indexed by $ΩM$ and $ΩT$, respectively, and let us assume that they are coupled on the first mode. Our ultimate goal of this letter is to introduce convex coupled norms $∥T,M∥cn$ for use in solving
$minT,M12∥ΩM(M-M^)∥F2+12∥ΩT(T-T^)∥F2+λ∥T,M∥cn,$
(3.1)
where $λ≥0$ is the regularization parameter. We also investigate the theoretical properties of problem 3.1.

The mode-$k$ unfolding of tensor $T∈Rn1×⋯×nK$ is represented as $T(k)∈Rnk×∏j≠kKnj$, which is obtained by concatenating all the $∏j≠kKnj$ vectors with dimension $nk$ obtained by fixing all except the $k$th index on mode-$k$ along its columns. We use $vec()$ to indicate the conversion of a matrix or a tensor into a vector and $unvec()$ to represent the reverse operation. The spectral norm (operator norm) of a matrix $X$ is the $∥X∥op$ that is the largest singular value of $X$. The Frobenius norm of a tensor $T$ is defined as $∥T∥F=T,T=vec(T)⊤vec(T)$. We use $[M;N]$ as the concatenation of matrices $M∈Rm1×m2$ and $N∈Rm1×m3$ along their mode 1.

### 3.1  Existing Matrix and Tensor Norms

Before we introduce our new norms, we first briefly review the existing low-rank inducing matrix and tensor norms. Among matrices, the matrix trace norm (Argyriou et al., 2006) is a commonly used convex relaxation for the minimization of the rank of a matrix. For a given matrix $M∈Rn1×m$ with rank $J$, we can define its trace norm as
$∥M∥tr=∑j=1Jσj,$
where $σj$ is the $j$th nonzero singular value of the matrix.
Low-rank inducing norms for tensors have received revived attention in recent years. One of the earliest low-rank inducing tensor norm is the tensor nuclear norm (Liu et al., 2009), also known as the overlapped trace norm (Tomioka & Suzuki, 2013) which can be expressed for a tensor $T∈Rn1×⋯×nK$ as
$∥T∥overlap=∑k=1K∥T(k)∥tr.$
(3.2)
Tomioka and Suzuki (2013) proposed the latent trace norm:
$∥T∥latent=infT(1)+…+T(K)=T∑k=1K∥T(k)(k)∥tr.$
(3.3)
The scaled latent trace norm was proposed as an extension of the latent trace norm (Wimalawarne et al., 2014):
$∥T∥scaled=infT(1)+…+T(K)=T∑k=1K1nk∥T(k)(k)∥tr.$
(3.4)

The behaviors of these two tensor norms have been studied on the basis of multitask learning (Wimalawarne et al., 2014) and inductive learning (Wimalawarne et al., 2016). The results show that for a tensor $T∈Rn1×⋯×nK$ with multilinear rank $(r1,…,rK)$, the excess risk is bounded above with respect to regularization with the overlapped trace norm by $O(∑k=1Krk)$, the latent trace norm by $O(minkrk)$, and the scaled latent trace norm by $Ominkrknk$.

### 3.2  Coupled Tensor Norms

As with individual matrices and tensors, having convex and low-rank inducing norms for coupled tensors would be useful in achieving global solutions for coupled tensor completion with theoretical guarantees. To achieve this, we propose a set of norms for coupled tensors that are coupled on specific modes using existing matrix and tensor trace norms. We first define a new coupled norm with the format $∥.∥(b,c,d)a$, where the superscript $a$ specifies the mode in which the tensor and matrix are coupled and the subscripts $b,c,d∈{O,L,S,-}$ indicate how the modes are regularized. The notations for $b,c,d$ are defined as follows:

• $O$: The mode is regularized with the trace norm. The same tensor is regularized on other modes similar to the overlapped trace norm.

• $L$: The mode is considered to be a latent tensor that is regularized using the trace norm only with respect to that mode.

• $S$: The mode is regularized as a latent tensor, but it is scaled similar to the scaled latent trace norm.

• $-$: The mode is not regularized.

Given a matrix $M∈Rn1×m$ and a tensor $T∈Rn1×n2×n3$, we introduce three norms that are coupled extensions of the overlapped trace norm, the latent trace norm, and the scaled latent trace norm, respectively.

Coupled overlapped trace norm:
$∥T,M∥(O,O,O)1:=∥[T(1);M]∥tr+∑k=23∥T(k)∥tr.$
(3.5)
Coupled latent trace norm:
$∥T,M∥(L,L,L)1=infT(1)+T(2)+T(3)=T∥[T(1)(1);M]∥tr+∑k=23∥T(k)(k)∥tr.$
(3.6)
Coupled scaled latent trace norm:
$∥T,M∥(S,S,S)1=infT(1)+T(2)+T(3)=T1n1∥[T(1)(1);M]∥tr+∑k=231nk∥T(k)(k)∥tr.$
(3.7)
In addition to these norms, we can also create norms as mixtures of overlapped and latent or scaled latent norms. For example, if we want to create a norm that is regularized using the scaled latent trace norm on the second mode while the other modes are regularized using the overlapped trace norm, we can define it as
$∥T,M∥(O,S,O)1=infT(1)+T(2)=T∥[T(1)(1);M]∥tr+1n2∥T(2)(2)∥tr+∥T(3)(1)∥tr.$
(3.8)
This norm has two latent tensors, $T(1)$ and $T(2)$. Tensor $T(1)$ is regularized using the overlapped method for modes 1 and 3, while the tensor $T(2)$ is regularized as a scaled latent tensor on mode 2. Given this use of a mixture of regularization methods, we call the resulting norm a mixed norm.

In a similar manner, we can create other mixed norms distinguished by their subscripts: $(L,O,O)$, $(O,L,O)$, $(O,O,L)$, $(S,O,O)$, $(O,S,O)$, and $(O,O,S)$. The main advantage gained by using these mixed norms is the additional freedom to regularize low-rank constraints among coupled tensors. Other combinations of norms in which two modes are latent tensors, such as $(L,L,O)$, will make the third mode also a latent tensor since overlapped regularization requires that more than one mode be regularized of the same tensor. Though we have considered using the latent trace norm, in practice it has been shown to be weaker in performance than the scaled latent trace norm (Wimalawarne et al., 2014, 2016). Therefore, in our experiments, we considered only mixed norms based on the scaled latent trace norm.

#### 3.2.1  Extensions for Multiple Matrices and Tensors

Our newly defined norms can be extended to multiple matrices coupled to a tensor on different modes. For instance, we can couple two matrices $M1∈Rn1×m1$ and $M2∈Rn3×m2$ to a three-way tensor $T$ on its first and third modes. If we regularize the coupled tensor with the overlapped trace norm on modes 1 and 3 and the scaled latent trace norm on mode 2, we obtain a mixed norm:
$∥T,M1,M2∥(O,S,O)1,3=infT(1)+T(2)=T∥[T(1)(1);M1]∥tr+1n2∥T(2)(2)∥tr+∥[T(3)(1);M2]∥tr.$

Coupled norms for multiple three-mode or higher-dimensional tensors could also be designed using our proposed method. However, such extension may require extending coupled norms further. Extensions to coupled norms for multiple tensors are a promising area for future research.

### 3.3  Dual Norms

We now briefly look at dual norms for the coupled norms. Dual norms are useful in deriving excess risk bounds, as discussed in section 4. Due to space limitations, we derive dual norms for only two coupled norms to better understand their nature. To derive them, we first need to know the Schatten norm (Tomioka & Suzuki, 2013) for the coupled tensor norms. We first define the Schatten-$(p,q)$ norm for the coupled norm $∥T,M∥(O,O,O)1$ with an additional subscript notation $Sp/q̲$:
$∥T,M∥(O,O,O),Sp/q̲1:=(∑ir1σi[T(1);M]pqp+∑jr2σjT(2)pqp+∑kr3σkT(3)pqp)1q,$
(3.9)
where $p$ and $q$ are constants; $r1$, $r2$, and $r3$ are the ranks; and $σi$, $σj$, and $σk$ are the singular values for each unfolding.

The following theorem presents the dual norm of $∥T,M∥(O,O,O),Sp/q̲1$ (see appendix A for proof).

Theorem 1.
Let a matrix $M∈Rn1×m$ and a tensor $T∈Rn1×n2×n3$ be coupled on their first modes. The dual norm of $∥T,M∥(O,O,O),Sp/q̲1$ with $1/p+1/p*=1$ and $1/q+1/q*=1$ is
$∥T,M∥(O,O,O),Sp*/q*¯1=infT(1)+T(2)+T(3)=T(∑ir1σi[T(1)(1);M]p*q*p*+∑jr2σjT(2)(2)p*q*p*+∑kr3σkT(3)(3)p*q*p*)1q*,$
where $r1$, $r2$, and $r3$ are the ranks for each mode and $σi$, $σj$, and $σk$ are the singular values for each unfolding of the coupled tensor.

In the special case of $p=1$ and $q=1$, we see that $∥T,M∥(O,O,O),S1/1̲1=∥T,M∥(O,O,O)1$. Its dual norm is the spectral norm, as shown in the following corollary:

Corollary 1.
Let a matrix $M∈Rn1×m$ and a tensor $T∈Rn1×n2×n3$ be coupled on their first mode. The dual norm of $∥T,M∥(O,O,O),S1/1̲1$ is
$∥T,M∥(O,O,O),S∞/∞¯1=infT(1)+T(2)+T(3)=Tmax∥[T(1)(1);M]∥op,∥T(2)(2)∥op,∥T(3)(3)∥op.$
The Schatten-$(p,q)$ norm for the mixed norm $∥·∥(L,O,O)1$ is defined as
$∥T,M∥(L,O,O),Sp/q̲1=infT(1)+T(2)=T(∑ir1σi[T(1)(1);M]pqp+∑jr2σjT(2)(2)pqp+∑kr3σkT(3)(2)pqp)1q.$
Its dual norm is defined by the following theorem (see appendix A for the proof):
Theorem 2.
Let a matrix $M∈Rn1×m$ and a tensor $T∈Rn1×n2×n3$ be coupled on their first mode. The dual norm of the mixed coupled norm $∥T,M∥(L,O,O),Sp/q̲1$ with $1/p+1/p*=1$ and $1/q+1/q*=1$ is
$∥T,M∥(L,O,O),Sp*/q*¯1=(∑ir1σi[T(1);M]p*q*p*+infT^(1)+T^(2)=T∑jr2σjT^(2)(1)p*q*p*+∑kr3σkT^(3)(2)p*q*p*)1q*,$
where $r1$, $r2$, and $r3$ are the ranks of $T(1)$, $T^(2)(1)$, and $T^(3)(2)$, respectively, and $σi$, $σj$, and $σk$ are their singular values.

The dual norms of other mixed norms can be similarly derived.

## 4  Optimization

In this section, we discuss optimization of the proposed completion model, 3.1. The model can be easily solved for each coupled norm using a state-of-the-art optimization method such as the alternating direction method of multipliers (ADMM) method (Boyd, Parikh, Chu, Peleato, & Eckstein, 2011). The optimization steps for the coupled norm $∥T,M∥(S,O,O)1$ are derived using the ADMM method. The optimization steps for the other norms are similarly derived.

We express equation 3.1 using the $∥T,M∥(S,O,O)1$ norm:
$minT(1),T(2),M12∥ΩM(M-M^)∥F2+12∥ΩT(T(1)+T(2)-T^)∥F2+λ1n1∥[T(1)(1);M]∥tr+∥T(2)(2)∥tr+∥T(3)(2)∥tr.$
(4.1)
By introducing auxiliary variables $X∈Rn1×m$ and $Y∈Rn1×n2×n3$, we can formulate the objective function of ADMM for equation 4.1:
$minT(1),T(2),M12∥ΩM(M-M^)∥F2+12∥ΩT(T(1)+T(2)-T^)∥F2+λ1n1∥[Y(1)(1);X]∥tr+∥Y(2)(2)∥tr+∥Y(3)(2)∥trs.t.X=M,Y(1)=T(1),Y(k)=T(2)k=2,3.$
(4.2)
We introduce Lagrangian multipliers $WM∈Rn1×m$ and $WT(k)∈Rn1×n2×n3$, $(k=1,2,3)$ and formulate the Lagrangian as
$minT(1),T(2),M12∥ΩM(M-M^)∥F2+12∥ΩT(T(1)+T(2)-T^)∥F2+λ1n1∥[Y(1)(1);X]∥tr+∥Y(2)(2)∥tr+∥Y(3)(2)∥tr+WM,M-X+WT(1),T(1)-Y(1)+∑k=23WT(k),T(2)-Y(k)+β2∥M-X∥F2+β2∥T(1)-Y(1)∥F2+β2∑k=23∥T(2)-Y(k)∥F2,$
(4.3)
where $β$ is a proximity parameter. Using this Lagrangian formulation, we can obtain solutions for unknown variables $M$, $T(1)$, $T(2)$, $WM$, $WT(k)(k=1,2,3)$, $X$, and $Y(k)(k=1,2,3)$ iteratively. We use superscripts $[t]$ and $[t-1]$ to represent the variables at iteration steps $t$ and $t-1$, respectively.
The solutions for $M$ at each iteration can be obtained by solving the following subproblem:
$M[t]=unvec(ΩM⊤ΩM+βIM)-1vecΩM(M^)-WM[t-1]+βX[t-1].$
Solutions for $T(1)$ and $T(2)$ at iteration step $t$ can be obtained from the following subproblem:
$ΩT⊤ΩT+2βITITITΩT⊤ΩT+2βITvec(T(1)[t])vec(T(2)[t])=vecΩT^(T^)-∑k=23WT(k)[t-1]+β∑k=23Y(k)[t-1]vecΩT^(T^)-∑k=23WT(k)[t-1]+β∑k=23Y(k)[t-1],$
(4.4)
where $IM$ and $IT$ are unit diagonal matrices with dimensions $n1m×n1m$ and $n1n2n3×n1n2n3$, respectively.
The updates for $X$ and $Y(k)$, $(k=1,2,3)$ at iteration step $t$ are given as
$[Y(1)(1)[t-1];X[t-1]]=proxλ/(n1β)W(1)T(1)[t-1]β;WM[t-1]β+[T(1)(1)[t];M[t]]$
(4.5)
and
$Y(k)(k)[t-1]=proxλ/βW(k)T(t)[t-1]β+T(k)(2)[t],k=2,3,$
(4.6)
where $proxλ(X)=U(S-λ)+V⊤$ for $X=USV⊤$.
The update rules for the dual variables are
$WM[t]=WM[t-1]+β(M[t]-X[t]),WT(1)[t-1]=WT(1)[t]+β(T(1)[t]-Y(1)[t]),WT(k)[t-1]=WT(k)[t]+β(T(k)[t]-Y(k)[t]),k=2,3.$
We can modify the above optimization procedures by replacing the variables in equation 4.1 in accordance with the norm that is used to regularize the tensor and by adjusting operations in equations 4.2 and 4.4 to 4.6. For example, for the norm $∥·∥(O,O,O)1$, there is only a single $T$, so the subproblem for equation 4.4 becomes
$ΩT⊤ΩT+3βITvec(T[t])=vecΩT^(T^)-∑k=13WT(k)[t-1]+β∑k=13Y[t-1],$
and that for equation 4.5 becomes
$[Y(1)(1)[t];X[t]]=proxλ/βW(1)T(k)[t-1]β;WM[t-1]β+[T(1)[t];M[t]]$
and
$Y(k)(k)[t-1]=proxλ/βW(k)T(k)[t-1]β+T(k)[t],k=1,2,3.$
Additionally, the dual update rule with $T$ becomes
$WT(k)[t-1]=WT(k)[t]+β(T[t]-Y(k)[t]),k=1,2,3.$
The optimization procedures for the other norms can be similarly derived.

## 5  Theoretical Analysis

In this section, we analyze the excess risk bounds of the completion model introduced in equation 3.1 for the coupled norms defined in section 3 using transductive Rademacher complexity (El-Yaniv & Pechyony, 2007; Shamir & Shalev-Shwartz, 2014). Let us again consider matrix $M$ and tensor $T$ and use them as a single structure $X=T∪M$ with a training sample index set $STrain$ and a testing sample index set $STest$ with the total set of observed samples $S=STrain∪STest$. We rewrite equation 3.1 with our new notations as an equivalent model:
$minW1|STrain|∑(i1,i2,i3)∈STrainl(Xi1,i2,i3,Wi1,i2,i3)s.t.∥W∥cn≤B,$
(5.1)
where $l(a,b)=(a-b)2$, $W=W∪WM$ is the learned coupled structure consisting of components $W$ and $WM$ of the tensor and matrix, respectively; $B$ is a constant; and $∥·∥cn$ is any norm defined in section 3.2.
Given that $l(·,·)$ is a $Λ$-Lipschitz loss function bounded by $supi1,i2,i3|l(Xi1,i2,i3,Wi1,i2,i3)|≤bl$ and assuming that $|STrain|=|STest|=|S|/2$, we can obtain the following excess risk bound based on transductive Rademacher complexity theory (El-Yaniv & Pechyony, 2007; Shamir & Shalev-Shwartz, 2014) with probability $1-δ$,
$1|STest|∑(i1,i2,i3)∈STestl(Xi1,i2,i3,Wi1,i2,i3)-1|STrain|∑(i1,i2,i3)∈STrainl(Xi1,i2,i3,Wi1,i2,i3)≤4R(W)+bl11+4log1δ|STrain|,$
(5.2)
where $R(W)$ is the transductive Rademacher complexity defined as
$R(W)=1|S|Eσsup∥W∥cn≤B∑(i1,i2,i3)∈Sσi1,i2,i3l(Wi1,i2,i3,Xi1,i2,i3),$
(5.3)
where $σi1,i2,i3∈{-1,1}$ with probability 0.5 if $(i1,i2,i3)∈S$, or 0 otherwise (see appendix B for derivation).

Next we give the bounds for equation 5.3 with respect to different coupled norms. We assume that $|STrain|=|STest|$, as in Shamir and Shalev-Shwartz (2014) but our theorem can be extended to more general cases. Detailed proofs of the theorems in this section are given in appendix B.

The following two theorems give the Rademacher complexities for coupled completion regularized using the coupled norms $∥·∥(O,O,O)1$ and $∥·∥(S,S,S)1$.

Theorem 3.
Let $∥·∥cn=∥·∥(O,O,O)1$; then, with probability $1-δ$,
$R(W)≤3Λ2|S|r(1)(BT+BM)+∑k=23rkBTmaxC2n1+∏j=23nj+m,mink∈2,3C1nk+∏j≠k3nj,$
where $(r1,r2,r3)$ is the multilinear rank of $W$, $r(1)$ is the rank of the coupled unfolding on mode 1, and $BM$, $BT$, $C1$, and $C2$ are constants.
Theorem 4.
Let $∥·∥cn=∥·∥(S,S,S)1$. Then, with probability $1-δ$,
$R(W)≤3Λ2|S|r(1)n1(BM+BT)+mink∈2,3rknkBTmaxC2n1+∏i=13ni+n1m,C1maxk=2,3nk+∏i=13ni,$
where $(r1,r2,r3)$ is the multilinear rank of $W$, $r(1)$ is the rank of the coupled unfolding on mode 1, and $BM$, $BT$, $C1$, and $C2$ are constants.

We can see that in both of these theorems, the Rademacher complexity of the coupled tensor is divided by the total number of observed samples of both the matrix and the tensor. If the tensor or the matrix is completed separately, then the Rademacher complexity is divided only by their individual samples (see theorems 8 to 10 in appendix B and a discussion in Shamir & Shalev-Shwartz, 2014). This means that coupled tensor learning can lead to better performance than separate matrix or tensor learning. We can also see that due to coupling, the excess risks are bounded by the ranks of both the tensors and the concatenated matrix of the unfolded tensors on the coupled mode. Additionally, the maximum term on the right takes the combinations of both the tensor and the concatenated matrix of the unfolded tensors on the coupled mode.

Finally, we consider the Rademacher complexity of the mixed norm $∥·∥cn=∥·∥(S,O,O)1$:

Theorem 5.
Let $∥·∥cn=∥·∥(S,O,O)1$. Then, with probability $1-δ$,
$R(W)≤3Λ2|S|r(1)n1(BM+BT)+∑i=2,3riBTmaxC2n1+∏i=13ni+n1m,mink=2,3C1nk+∏i≠k3ni,$
where $(r1,r2,r3)$ is the multilinear rank of $W$, $r(1)$ is the rank of the coupled unfolding on mode 1, and $BM$, $BT$, $C1$, and $C2$ are constants.

We see that for the mixed norm $∥·∥cn=∥·∥(S,O,O)1$, the excess risk is bounded by the scaled rank of the coupled unfolding along the first mode. For this norm, we can see that the terms related to ranks are smaller in theorem 4 and that the maximum term could be smaller than in theorem 5. This means that this norm can perform better than $∥·∥(O,O,O)1$ and $∥·∥(S,S,S)1$ depending on the ranks and mode dimensions of the coupled tensor. The bounds of the other two mixed norms can also be derived and explained in a manner similar to theorem 6.

## 6  Evaluation

We evaluated our proposed method experimentally using synthetic and real-world data.

### 6.1  Synthetic Data

Our main objectives were to evaluate how the proposed norms perform depending on the ranks and dimensions of the coupled tensors. We used simulation data based on CP rank and Tucker rank in these experiments.

#### 6.1.1  Experiments Using CP Rank

To create coupled tensors with the CP rank, we first generated a three-mode tensor $T∈Rn1×n2×n3$ with CP rank $r$ using CP decomposition (Kolda & Bader, 2009) as $T=∑i=1rciui∘vi∘wi$ where $ui∈Rn1$, $vi∈Rn2$ and $wi∈Rn3$ and $ci∈R+$. For our experiments, we used two approaches to create CP-rank-based tensors in which all the component vectors $ui,vi$, and $wi$ were nonorthogonal vectors or orthogonal vectors. We coupled matrix $X∈Rn1×m$ with rank $r$ to $T$ on mode 1 by generating $X=USV⊤$ with $U(1:r,:)=[u1,…,ur]$, $S∈Rr×r$, and $V∈Rm×r$ is an orthogonal matrix. We also added noise sampled from a gaussian distribution with mean zero and variance of 0.01 to the elements of the matrix and the tensor.

In our experiments using synthetic data, we considered coupled structures of tensors with dimension $20×20×20$ and matrices with dimension $20×30$ coupled on their first modes. To simulate completion, we randomly selected observed samples with percentages of 30, 50, and 70 of the total number of elements in both the matrix and the tensor; selected a validation set with a percentage of 10; and took the remainder as test samples. We performed coupled completion using the proposed coupled norms of $∥·∥(O,O,O)1$, $∥·∥(S,S,S)1$, $∥·∥(S,O,O)1$, $∥·∥(O,S,O)1$, and $∥·∥(O,O,S)1$. For all the learning models with these norms, we cross-validated their regularization parameters ranging from 0.01 to 5.0 with intervals of 0.05. We ran our experiments with 10 random selections and plotted the mean square error (MSE) for the test samples.

As benchmark methods, we used the overlapped trace norm (OTN) and the scaled latent trace norm (SLTN) for individual tensors and the matrix trace norm (MTN) for individual matrices. For all these norms, we cross-validated the regularization parameters ranging from 0.01 to 5.0 with intervals of 0.05. We compared our results with those of advanced coupled matrix-tensor factorization ACMTF (Acar, Papalexakis et al., 2014), for which the regularization parameters were selected using cross-validation in the range $0,0.0001,0.001,…,1$. To select ranks to use with the ACMTF method, we first ran experiments using ranks of $1,3,5,⋯,19$ and selected the rank that gave the best performance. Due to the nonconvex nature of ACMTF, we ran experiments with five random initializations to select the best local optimal solution.

We first ran experiments on coupled tensor completion based on CP rank in different settings. In the first experiment, we considered coupled tensors with no shared components. In this experiment, we created a tensor with CP rank 5 in which the component vectors were nonorthogonal and generated from a normal distribution. We also created a matrix of rank 5 and without any components in common with the tensor. Figure 2 shows that the coupled norms did not perform better than individual matrix completion using the matrix trace norm. However, for tensor completion, the coupled norm $∥·∥(O,O,O)1$ had performance comparable to that of the overlapped trace norm.

Figure 2:

Completion performance of a matrix with dimension $20×30$ and rank 5 with no sharing and of a tensor with dimension $20×20×20$ and CP rank 5 with nonorthogonal component vectors.

Figure 2:

Completion performance of a matrix with dimension $20×30$ and rank 5 with no sharing and of a tensor with dimension $20×20×20$ and CP rank 5 with nonorthogonal component vectors.

We next ran experiments on coupled tensors with some components in common and with both orthogonal and nonorthogonal component vectors. We created coupled tensors with CP rank of 5 and both the tensor and matrix shared all components along mode 1. We generated the tensor with orthogonal component vectors. As shown in Figure 3, the coupled norm $∥·∥(O,O,O)1$ had good performance for both the matrix and tensor.

Figure 3:

Completion performance of a matrix with dimension $20×30$ and rank 5 with all components shared and of a tensor with dimension $20×20×20$ and CP rank 5 with orthogonal component vectors.

Figure 3:

Completion performance of a matrix with dimension $20×30$ and rank 5 with all components shared and of a tensor with dimension $20×20×20$ and CP rank 5 with orthogonal component vectors.

Figure 4 shows the performance of coupled tensors with the same rank as in the previous experiment with tensors created from nonorthogonal component vectors. Again, the coupled norm $∥·∥(O,O,O)1$ had better performance than individual matrix and tensor completions.

Figure 4:

Completion performance of a matrix with dimension $20×30$ and rank 5 with all component vectors shared and of a tensor with dimension $20×20×20$ and CP rank 5 and nonorthogonal component vectors.

Figure 4:

Completion performance of a matrix with dimension $20×30$ and rank 5 with all component vectors shared and of a tensor with dimension $20×20×20$ and CP rank 5 and nonorthogonal component vectors.

In our final experiment, we created tensors with CP rank 5 and coupled them with a matrix of rank 10 sharing all five component vectors along mode 1. Figures 5 and 6 show the results for tensors created with orthogonal and nonorthogonal component vectors, respectively. In both cases, the coupled norms $∥·∥(O,O,O)1$, $∥·∥(S,S,S)1$, and $∥·∥(S,O,O)1$ had better matrix completion performance than individual completion by the matrix trace norm. Similarly, as in the previous experiments, both the overlapped trace norm and the coupled norm $∥·∥(O,O,O)1$ had comparable performances.

Figure 5:

Completion performance of a matrix with dimension $20×30$ and rank 5 and of a tensor with dimension $20×20×20$ with CP rank 10 and nonorthogonal component vectors that shared five components.

Figure 5:

Completion performance of a matrix with dimension $20×30$ and rank 5 and of a tensor with dimension $20×20×20$ with CP rank 10 and nonorthogonal component vectors that shared five components.

Figure 6:

Completion performance of a matrix with dimension $20×30$ and rank 5 and of a tensor with dimension $20×20×20$ and CP rank 10 and orthogonal component vectors that shared five components.

Figure 6:

Completion performance of a matrix with dimension $20×30$ and rank 5 and of a tensor with dimension $20×20×20$ and CP rank 10 and orthogonal component vectors that shared five components.

#### 6.1.2  Simulations Using Tucker Rank

To create coupled tensors with the Tucker rank, we first generated a tensor $T∈Rn1×n2×n3$ using Tucker decomposition (Kolda & Bader, 2009) as $T=C×1U1×2U2×3U3$, where $C∈Rr1×r2×r3$ was the core tensor generated from a normal distribution specifying multilinear rank $(r1,r2,r3)$ and component matrices $U1∈Rr1×p1$, $U2∈Rr2×p2$, and $U3∈Rr3×p3$ were orthogonal matrices. Next, we generated a matrix that was coupled with mode 1 of the tensor using singular value decomposition $X=USV⊤$, where we specified its rank $r$ using diagonal matrix $S$ and generated matrices $U$ and $V$ as orthogonal matrices. For sharing between the matrix and the tensor, we computed $T(1)=UnSnVn⊤$ and replaced the first $s$ singular values of $S$ with the first $s$ singular values of $Sn$, replaced the first basis vectors $s$ of $U$ with the first $s$ basis vectors of $Un$, and computed $X=USV⊤$ such that the coupled structure shared $s$ common components. We also added noise sampled from a gaussian distribution with mean zero and variance 0.01 to the elements of the coupled tensor.

As in the synthetic experiments using the CP rank, we considered coupled structures with tensors with dimension $20×20×20$ and matrices with dimension $20×30$ coupled on their mode 1. We considered different multilinear ranks of tensors, ranks of matrices, and degrees of sharing among them. We used the same percentages in selecting the training, testing, and validation sets as we did in the CP rank experiments. We again compared our results with those of ACMTF.

We also used an additional nonconvex coupled learning model to incorporate multilinear ranks of the coupled tensor by considering Tucker decomposition under the assumption that the components of the coupled mode were shared between both the matrix and tensor. We used the Tensorlab framework (Vervliet, Debals, Sorber, Van Barel, & De Lathauwer, 2016) to implement this model. We regularized the factorized components of the tensor (including the core tensor) and the matrix using the Frobenius norm. We used a regularization parameter selected from the range 0.01 to 50 in logarithmic linear scale with five divisions (in Matlab syntax exp(linspace(log(0.01), log(50), 5))). We refer to this benchmark method as NC-Tucker. Due to the nonconvex nature of the model, we ran 5 to 10 simulations with different random initializations and selected the best local optimal solution. Specifying the multilinear rank a priori for this model would be challenging in real applications, but since we knew the rank in our simulations, we could specify the multilinear ranks to be used to create the tensors.

In our first simulations, we considered a coupled tensor with a matrix rank of 5 and a tensor multilinear rank $(5,5,5)$ with no shared components. Figure 7 shows that with this setting, individual matrix and tensor completion had better performance than that of the coupled norms. The nonconvex NC-Tucker benchmark method had the best performance for the tensor but performed poorly in matrix completion compared to the coupled norms.

Figure 7:

Completion performance of a matrix with dimension $20×30$ and rank 5 and of a tensor with dimension $20×20×20$ and multilinear rank $(5,5,5)$ with no sharing.

Figure 7:

Completion performance of a matrix with dimension $20×30$ and rank 5 and of a tensor with dimension $20×20×20$ and multilinear rank $(5,5,5)$ with no sharing.

In our next simulation, we considered coupling of tensors and matrices with some degree of sharing among them. We created a matrix of rank 5 and a tensor of multilinear rank $(5,5,5)$ and let them share all five singular components along mode 1. Figure 8 shows that the coupled norm $∥·∥(O,O,O)1$ had the best performance among the coupled norms for both matrix and tensor completion. Individual tensor completion with the overlapped trace norm had the same performance as $∥·∥(O,O,O)1$. The NC-Tucker method performed better than the coupled norms for tensor and matrix completion.

Figure 8:

Completion performances of completion of a matrix with dimension $20×30$ and rank 5 and of a tensor with dimension $20×20×20$ and multilinear rank $(5,5,5)$ that shared five components.

Figure 8:

Completion performances of completion of a matrix with dimension $20×30$ and rank 5 and of a tensor with dimension $20×20×20$ and multilinear rank $(5,5,5)$ that shared five components.

In our next simulation, we considered a matrix of rank 5 and a tensor of multilinear rank $(5,15,5)$ that shared all five singular components along mode 1. Figure 9 shows that with this setting, although the coupled norm $∥·∥(O,O,S)1$ had the best performance among the coupled norms and individual tensor completion, it was outperformed by the NC-Tucker method. However, the NC-Tucker method performed poorly in matrix completion compared to the coupled norms. For the matrix completion, individual matrix completion by the matrix trace norm had the best performance, while coupled norms $∥·∥(O,O,S)1$ and $∥·∥(S,O,O)1$ had the next best performance.

Figure 9:

Completion performance of a matrix with dimension $20×30$ and rank 5 and of a tensor with dimension $20×20×20$ and multilinear rank $(5,15,5)$ that shared five components.

Figure 9:

Completion performance of a matrix with dimension $20×30$ and rank 5 and of a tensor with dimension $20×20×20$ and multilinear rank $(5,15,5)$ that shared five components.

For our final simulation, we created a coupled matrix with rank 5 and a tensor with multilinear rank $(15,5,5)$, all sharing five singular components along mode 1. Figure 10 shows that the mixed coupled norms $∥·∥(O,S,O)1$ and $∥·∥(O,O,S)1$ performed equally and had better performance for tensor completion than the individual tensor completion. The NC-Tucker method had better performance than the coupled norms for tensor completion, while the performance was comparable for matrix completion. For matrix completion when the percentage of training samples was small, coupled norms $∥·∥(O,O,O)1$ and $∥·∥(S,O,O)1$ had better performance. As the percentage of training samples was increased, the performance of individual matrix completion improved, while those of $∥·∥(O,S,O)1$ and $∥·∥(O,O,S)1$ were close but second best.

Figure 10:

Completion performance of completion of a matrix with dimension $20×30$ and rank 5 and of a tensor with dimension $20×20×20$ and multilinear rank $(15,5,5)$ that shared five components.

Figure 10:

Completion performance of completion of a matrix with dimension $20×30$ and rank 5 and of a tensor with dimension $20×20×20$ and multilinear rank $(15,5,5)$ that shared five components.

The results of these simulations show that the ACMTF performed poorly compared to our proposed methods.

### 6.2  Real-World Data

As a real-world data experiment, we applied our proposed method to the UCLAF data set (Zheng, Cao, Zheng, Xie, & Yang, 2010), which consists of GPS data for 164 users in 168 locations performing five activities, resulting in a sparse user-location-activity tensor $T∈R164×168×5$. This data set also has a user-location matrix $X∈R164×168$, which we used as side information coupled to the user mode of $T$. Using similar observed element percentages as in the synthetic data simulations, we performed completion experiments on $T$. We considered all the elements of the user-location matrix as observed elements and used them as training data. We repeated the evaluation for 10 random sample selections. We cross-validated the regularization parameters from 0.01 to 500 divided into 50 in logarithmic linear scale. As a baseline method, we again used the ACMTF method (Acar, Papalexakis et al., 2014) with CP rank 5. Additionally, we used the coupled (Tucker) method (Ermis et al., 2015) and the NC-Tucker method with multilinear rank $(3,3,3)$, where we selected the best performances among 5 random initializations. Figure 11 shows the completion performances for the coupled tensor.

Figure 11:

Completion performance for UCLAF data.

Figure 11:

Completion performance for UCLAF data.

We can see that the best performance among coupled norms was that of mixed coupled norm $∥·∥(S,O,O)1$, indicating that learning with side information as a coupled structure improves tensor completion performance compared to completion using only tensor norms. This also indicates that mode 1 may have a lower rank than the other modes and that modes 2 and 3 may have ranks closer to each other. The nonconvex coupled (Tucker) method and the NC-Tucker method had better performance than $∥·∥(S,O,O)1$ when the number of observed samples was less than 70 percent of the total elements.

## 7  Conclusion and Future Work

We have proposed a new set of convex norms for the completion problem of coupled tensors. We restricted our study to coupling a three-way tensor with a matrix and defined low-rank inducing norms by extending trace norms such as the overlapped trace norm and scaled latent trace norm of tensors and the matrix trace norm. We also introduced the concept of mixed norms, which combines the features of both overlapped and latent trace norms. We looked at the theoretical properties of our convex completion model and evaluated it using synthetic and real-world data. We found that the proposed coupled norms perform comparably to existing nonconvex ones. However, our norms lead to global optimal solutions and eliminate the need for specifying the ranks of the coupled tensors beforehand. While there are still many aspects to be studied, we believe that our work is the first step in modeling convex norms for coupled tensors.

Although coupling can occur among many tensors with different dimensions and multiple matrices on different modes, this study focused on a three-mode tensor and a single matrix. The methodology used to create coupled norms can be extended to any of those settings, but mere extensions may not lead to the optimal design of norms for those settings. Particularly, the square tensor norm (Mn, Huang, Wright, & Goldfarb, 2014) has shown to be better suited to tensors beyond three modes and thus can also be used to model novel coupled norms in the future. Furthermore, theoretical analysis using methods such as the gaussian width (Amelunxen, Lotz, McCoy, & Tropp, 2014) may provide a deeper understanding of coupled tensors, which should enable the design of better norms. Such studies could be interesting directions for future research.

## Appendix A:  Proofs of Dual Norms

We first provide the proofs of the dual norms of theorems 1 and 3.

Proof of Theorem 1.
We use lemma 3 of Tomioka and Suzuki (2013) to prove the duality. Consider a linear operator $Φ$ such that $Φ(T,M)=[vec(M);vec(T(1));vec(T(2));vec(T(3))]∈Rd1+3d2$, where $d1=n1m$ and $d2=n1n2n3$. We define
$∥z∥*=∥[Z(1)(1);X]∥Spq+∑k=23∥Z(k)(k)∥Spq1/q,$
(A.1)
where $Z(k)$ is the inverse vectorization of elements $z(d1+(k-1)d2+1):(d1+kd2)$ and $X$ is the inverse vectorization of $z1:d1$. The dual of the above norm is expressed as
$∥z∥**=∥[Z(1)(1);X]∥Sp*q*+∑k=23∥Z(k)(k)∥Sp*q*1/q*.$
Let
$Φ⊤(z)={T,M}=∑k=13Z(k),X.$
Then following lemma 3 of Tomioka and Suzuki (2013), we write
$|||[T,M]|||*¯(Φ)=inf∥z∥s.tΦ⊤(z)={T,M}.$
Given that
$|||[T,M]|||*̲(Φ):=∥[T,M]∥(O,O,O),Sp/q̲1,$
and following lemma 3 in Tomioka and Suzuki (2013) we obtain the dual of $∥[T,M]∥(O,O,O),Sp/q̲1$ as $∥[T,M]∥(L,L,L),Sp*/q*¯1$.

$□$

Proof of Theorem 2.
We can apply theorem 1 to latent tensors $T(1)$ and $T(2)$, as well as the dual of the overlapping norm to $T$. First, consider the dual with respect to $T(1)$ and $T(2)$. By applying theorem 1, we obtain
$∥T,M∥(L,O,O),Sp*/q*¯1=∑ir1σi[T(1);M]p*q*p*+∥T∥(-,O,O),Sp*1q*.$
Next, by applying lemma 1 of Tomioka and Suzuki (2013) to $∥T∥(-,O,O)$, we obtain
$∥T,M∥(L,O,O),Sp*/q*¯1=(∑ir1σi[T(1);M]p*q*p*+infT^(1)+T^(2)=T∑jr2σjT^(2)(1)p*q*p*+∑kr3σkT^(3)(2)p*q*p*)1q*.$

$□$

## Appendix B:  Proofs of Excess Risk Bounds

Here we derive the excess risk bounds for the coupled completion problem.

From previous work (El-Yaniv & Pechyony, 2007; Shamir & Shalev-Shwartz, 2014), we know that for a loss function $l(·,·)$ that is, a $Λ$-Lipschitz loss function and bounded as $supi1,i2,i3|l(Xi1,i2,i3,Wi1,i2,i3)|≤bl$ and with the assumption that $|STrain|=|STest|=|S|/2$, we have the following bound for equation 5.1 based on transductive Rademacher complexity theory (El-Yaniv and Pechyony, 2007; Shamir & Shalev-Shwartz, 2014) with probability $1-δ$,
$1|STest|∑(i1,i2,i3)∈STestl(Xi1,i2,i3,Wi1,i2,i3)-1|STrain|∑(i1,i2,i3)∈STrainl(Xi1,i2,i3,Wi1,i2,i3)≤4R(W)+bl11+4log1δ|STrain|,$
where $R(W)$ is transductive Rademacher complexity defined as
$R(W)=1|S|Eσsup∥W∥cn≤B∑(i1,i2,i3)∈Sσi1,i2,i3l(Wi1,i2,i3,Xi1,i2,i3),$
(B.1)
where $σi1,i2,i3∈{-1,1}$ with probability 0.5 if $(i1,i2,i3)∈S$, or 0 otherwise.
We can rewrite equation B.1 as
$R(W)=1|S|Eσsup∥W∥cn≤BM+BT∑(i1,i2,i3)∈Sσi1,i2,i3l(Wi1,i2,i3,Xi1,i2,i3)≤Λ|S|Eσsup∥W∥cn≤BM+BT∑(i1,i2,i3)∈Sσi1,i2,i3Wi1,i2,i3(Rademachercontraction),≤Λ|S|Eσsup∥W∥cn≤BM+BT∥W∥cn∥Σ∥cn*(Holder'sinequality),$
where we have used that $∥W∥F≤BT$ and $∥WM∥F≤BM$, and $Σ$ is of dimensions of the coupled tensor consisting Rademacher variables ($Σi1,i2,i3=σi1,i2,i3$ if $(i1,i2,i3)∈S$, else $Σi1,i2,i3=0$).
Proof of Theorem 3.
Let $W=W∪WM$, where $W$ and $WM$ are the completed tensors of $T$ and $M$, and let $Σ=ΣT∪ΣM$, where $ΣT$ and $ΣM$ consist of the corresponding Rademacher variables ($σi1,i2,i3$) for $T$ and $M$. Since we use an overlapping norm, we have $∥W∥cn=∥W,WM∥(O,O,O)1$ from which we obtain
$∥W,WM∥(O,O,O)1=∥[W(1);WM]∥tr+∑k=23∥W(k)∥tr≤r(1)(BT+BM)+∑k=23rkBT,$
where $(r1,r2,r3)$ is the multilinear rank of $W$ and $r(1)$ is the rank of the concatenated matrix of unfolding tensors on mode 1. To obtain the above inequality, we used the fact that for any matrix $U$ with rank $r$, we have $∥U∥tr≤r∥U∥F$ (Tomioka & Suzuki, 2013).
Using Latała's theorem (Latała, 2005; Shamir & Shalev-Shwartz, 2014) for the mode $k$ unfolding, we can bound $∥ΣT(k)∥op$
$E∥ΣT(k)∥op≤C1nk+∏j≠k3nj+|ΣT(k)|4,$
and since $|ΣT(k)|4≤∏i=13ni4≤12nk+∏j≠k3nj$, we have
$E∥ΣT(k)∥op≤3C12nk+∏j≠k3nj.$
Similarly, using Latała's theorem, we obtain
$E∥[ΣT(1);ΣM]∥op≤3C22n1+∏j=23nj+m.$
To bound $E∥ΣT,ΣM∥(O,O,O)*1$, we use the duality relationship from theorem 1 and corollary 2:
$∥ΣT,ΣM∥(O,O,O)*1=infΣT(1)+ΣT(2)+ΣT(3)=ΣTmax∥[ΣT(1)(1);ΣM]∥op,∥ΣT(2)(2)∥op,∥ΣT(3)(3)∥op.$
Since we can take any $ΣT(k)$ to be equal to $ΣT$, the above norm can be upper-bounded:
$∥ΣT,ΣM∥(O,O,O)★1≤max∥[ΣT(1);ΣM]∥op,min∥ΣT(2)∥op,∥ΣT(3)∥op.$
$E∥ΣT,ΣM∥(O,O,O)*1≤Emax∥[ΣT(1);ΣM]∥op,min∥ΣT(2)∥op,∥ΣT(3)∥op≤maxE∥[ΣT(1);ΣM]∥op,minE∥ΣT(2)∥op,E∥ΣT(3)∥op.$
Finally, we have
$R(W)≤3Λ2|S|r(1)(BT+BM)+∑k=23rkBTmaxC2n1+∏j=23nj+m,mink∈2,3C1nk+∏j≠k3nj.$

$□$

Before we give the excess risk bound for the $∥·∥(S,S,S)1$, in the following theorem, we give the excess risk of coupled completion with the $∥·∥(L,L,L)1$.

Theorem 6.
Let $∥·∥cn=∥·∥(L,L,L)1$. Then, with probability $1-δ$,
$R(W)≤3Λ2|S|r(1)BM+minr(1),mink=2,3rkBTmaxC2n1+∏j=23nj+m,maxk=2,3C2nk+∏j≠k3nj,$
where $(r1,r2,r3)$ is the multilinear rank of $W$, $r(1)$ is the rank of the coupled unfolding on mode 1, and $BM$, $BT$, $C1$, and $C2$ are constants.
Proof.
Let $W=W∪WM$, where $W$ and $WM$ are the completed tensors of $T$ and $M$ and $Σ=ΣT∪ΣM$, where $ΣT$ and $ΣM$ consist of the corresponding Rademacher variables. We can see that
$∥W∥(L,L,L)1=infW(1)+W(2)+W(3)=W∥[W(1)(1);WM]∥tr+∑k=23∥W(k)(k)∥tr,$
which can be bounded as
$∥W∥(L,L,L)1≤r(1)(BM+BT)+mink=2,3rkBT,$
where the last term is derived by considering the infimum with respect to $W(2)$ and $W(3)$.
Using the duality result given in theorem 1 (corollary 2) and Latała's theorem, we obtain
$∥ΣT,ΣM∥(L,L,L)*1≤maxE∥[ΣT(1);ΣM]∥op,E∥ΣT(2)∥op,E∥ΣT(3)∥op≤32max{C2n1+∏j=23nj+m,maxk=2,3C1nk+∏j≠k3nj}.$
Finally, we have
$R(W)≤3Λ2|S|r(1)(BM+BT)+mink=2,3rkBTmaxC2n1+∏j=23nj+m,maxk=2,3C1nk+∏j≠k3nj.$

$□$

Proof of Theorem 4.
By definition, we have
$∥W∥(S,S,S)1=infW(1)+W(2)+W(3)=W1n1∥[W(1)(1),WM]∥tr+∑k=2,31nk∥W(k)(k)∥tr,$
which results in
$∥W∥(S,S,S)1≤r(1)n1(BM+BT)+mink∈2,3rknkBT.$
Using the duality result given in theorem 1 and Latała's theorem, we obtain
$E∥ΣT,ΣM∥(S,S,S)*1=Emaxn1∥[ΣT(1);ΣM]∥op,n2∥ΣT(2)∥op,n3∥ΣT(3)∥op≤32maxC2n1+∏i=13ni+n1m,C1maxk=2,3nk+∏i≠k3ni.$
Finally, we have
$R(W)≤3Λ2|S|r(1)n1(BM+BT)+mink∈2,3rknkBTmaxC2n1+∏i=13ni+n1m,C1maxk=2,3nk+∏i=13ni.$

$□$

Proof of Theorem 5.
First, let us look at $∥W∥(S,O,O)1$, which is expressed as
$∥W∥(S,O,O)1=infW(1)+W(2)=W1n1∥[W(1)(1);WM]∥tr+∥W(2)(2)∥tr+∥W(3)(2)∥tr.$
This norm can be upper-bounded:
$∥W∥(S,O,O)1≤r(1)n1(BM+BT)+∑i=2,3riBT.$
Now we are left with bounding $∥ΣT,ΣM∥(S,O,O)*1$. Using theorem 3, we obtain
$∥ΣT,ΣM∥(S,O,O)*1≤maxn1∥[ΣT(1);ΣM]∥op,min∥ΣT(2)∥op,∥ΣT(3)∥op.$
We then have
$E∥ΣT,ΣM∥(S,O,O)*1≤32maxC2n1+∏i=13ni+n1m,mink=2,3C1nk+∏i≠k3ni.$
The final resulting bound is
$R(W)≤3Λ2|S|r(1)n1(BM+BT)+∑i=2,3riBTmaxC2n1+∏i=13ni+n1m,mink=2,3C1nk+∏i≠k3ni.$

$□$

In addition to the above transductive bounds for completion with coupled norms, we also provide the bounds for individual tensor completion with tensor norms such as the overlapped trace norm, the latent trace norm, and the scaled latent trace norm. We can consider equation 5.1 only for a matrix or a tensor without coupling and with low-rank regularization. Therefore, we may have the transductive bounds for a matrix $M∈Rn1×m$ (Shamir & Shalev-Shwartz, 2014) as
$R(WM)≤cBMΛ|SM|r^n1+m,$
(B.2)
where $SM$ is the index set of observed samples of matrix $M$, $r^$ is the rank induced by matrix trace norm regularization, and $c$ is a constant.

Next we can consider the transductive bounds for tensor $T∈Rn1×n2×n3$ with regularization using norms such as the overlapped trace norm (Tomioka & Suzuki, 2013), the latent trace norm (Tomioka & Suzuki, 2013), and the scaled latent trace norm (Wimalawarne et al., 2014) in the following three theorems. We denote the index set of observed sample of $T$ by $ST$.

Theorem 7.
Using the overlapped trace norm regularization given as $∥W∥overlap=∥W∥(O,O,O)$, we obtain
$R(W)≤c1BTΛ|ST|∑k=13r^kminknk+∏j≠k3nj,$
for some constant $c1$; $(r^1,r^2,r^3)$ is the multilinear rank of $W$.
Proof.
Using the same procedure as for theorem 4, we obtain
$E∥ΣT∥overlap*≤Emink∥ΣT(k)∥op≤minkE∥ΣT(k)∥op≤3c12minknk+∏j≠k3nj.$
Since $∥W∥overlap≤∑k=13r^kBT$, where $∥W∥F≤BT$ (Tomioka & Suzuki, 2013), we have
$R(W)≤c1BTΛ|ST|∑k=13r^kminknk+∏j≠k3nj.$

$□$

Theorem 8.
Using the latent trace norm regularization given by $∥W∥latent=∥W∥(L,L,L)$, we obtain
$R(W)≤c2ΛBTminkr^k|ST|maxknk+∏j≠k3nj,$
for some constant $c2$; $(r^1,r^2,r^3)$ is the multilinear rank of $W$.
Proof.
Using the duality result from Wimalawarne et al. (2014), we have
$∥ΣT∥latent*=maxk∥ΣT(k)∥op.$
Using Latała's theorem, we obtain
$E∥ΣT∥latent*≤3c22maxknk+∏j≠k3nj.$
Finally, using the known bound $∥W∥latent≤minir^iBT$ (Wimalawarne et al., 2014), where $∥W∥F≤BT$, we obtain the excess risk:
$R(W)≤3c2ΛBTminir^i2|ST|maxknk+∏j≠k3nj.$

$□$

Theorem 9.
Using the scaled latent trace norm regularization given by $∥W∥scaled=∥W∥(S,S,S)$, we obtain
$R(W)≤3c3ΛBT2|ST|minir^inimaxknk+∏j=13nj.$
for some constant $c3$; $(r^1,r^2,r^3)$ is the multilinear rank of $W$.
Proof.
From previous work (Wimalawarne et al., 2014), we can derive
$∥ΣT∥scaled*=maxknk∥ΣT(k)∥op.$
Using an approach similar to that for theorem 9 with the additional scaling of $nk$ and using Latała's theorem, we arrive at the following bound:
$R(W)≤3c3ΛBT2|ST|minir^inimaxknk+∏j=13nj.$

$□$

## Acknowledgments

M.Y. was supported by the JST PRESTO program JPMJPR165A. H.M. has been partially supported by JST ACCEL grant JPMJAC1503 (Japan), MEXT Kakenhi 16H02868 (Japan), FiDiPro by Tekes (currently Business Finland), and AIPSE programme by Academy of Finland.

## References

Acar
,
E.
,
Bro
,
R.
, &
Smilde
,
A. K.
(
2015
).
Data fusion in metabolomics using coupled matrix and tensor factorizations
.
Proceedings of the IEEE
,
103
(
9
),
1602
1620
.
Acar
,
E.
,
Kolda
,
T. G.
, &
Dunlavy
,
D. M.
(
2011
).
All-at-once optimization for coupled matrix and tensor factorizations
.
CoRR
,
abs/1105.3422
.
Acar
,
E.
,
Nilsson
,
M.
, &
Saunders
,
M.
(
2014
).
A flexible modeling framework for coupled matrix and tensor factorizations
. In
Proceedings of the 22nd Signal Processing Conference
(pp.
111
115
).
Piscataway, NJ
:
IEEE
.
Acar
,
E.
,
Papalexakis
,
E. E.
,
Gürdeniz
,
G.
,
Rasmussen
,
M. A.
,
Lawaetz
,
A. J.
,
Nilsson
,
M.
, &
Bro
,
R.
(
2014
).
Structure-revealing data fusion
.
BMC Bioinformatics
,
15
,
239
.
Amelunxen
,
D.
,
Lotz
,
M.
,
McCoy
,
M. B.
, &
Tropp
,
J. A.
(
2014
).
Living on the edge: Phase transitions in convex programs with random data
.
Information and Inference
,
3
,
224
294
.
Argyriou
,
A.
,
Evgeniou
,
T.
, &
Pontil
,
M.
(
2006
B.
Schölkopf
,
J. C.
Platt
, &
T.
Hoffman
(Eds.),
Advances in neural information processing systems
,
19
(pp.
41
48
).
Cambridge, MA
:
MIT Press
.
Banerjee
,
A.
,
Basu
,
S.
, &
Merugu
,
S.
(
2007
).
Multi-way clustering on relation graphs
. In
Proceedings of the 2007 SIAM International Conference on Data Mining
(pp.
145
156
).
:
SIAM
.
Boyd
,
S.
,
Parikh
,
N.
,
Chu
,
E.
,
Peleato
,
B.
, &
Eckstein
,
J.
(
2011
).
Distributed optimization and statistical learning via the alternating direction method of multipliers
.
Found. Tren. Mach. Learn.
,
1
,
1
122
.
Carroll
,
J. D.
, &
Chang
,
J.-J.
(
1970
).
Analysis of individual differences in multidimensional scaling via an n-way generalization of “Eckart-Young” decomposition
.
Psychometrika
,
35
(
3
),
283
319
.
El-Yaniv
,
R.
, &
Pechyony
,
D.
(
2007
). Transductive Rademacher complexity and its applications. In
N. H.
Bshouty
&
C.
Gentile
(Eds.),
Lecture Notes in Computer Science: Vol. 4539. Learning Theory COLT 2007
(pp.
157
171
).
Berlin
:
Springer
.
Ermis
,
B.
,
Acar
,
E.
, &
Cemgil
,
A. T.
(
2015
).
Link prediction in heterogeneous data via generalized coupled tensor factorization
.
Data Mining and Knowledge Discovery
,
29
(
1
),
203
236
.
Gunasekar
,
S.
,
,
M.
,
Yin
,
D.
, &
Chang
,
Y.
(
2015
).
Consistent collective matrix completion under joint low rank structure
. In
Proceedings of the International Conference on Artificial Intelligence and Statistics
.
Harshman
,
R. A.
(
1970
).
Foundations of the PARAFAC procedure: Models and conditions for an explanatory multimodal factor analysis
.
UCLA Working Papers in Phonetics
,
16
,
1
84
.
Hitchcock
,
F. L.
(
1927
).
The expression of a tensor or a polyadic as a sum of products
.
J. Math. Phys.
,
6
(
1
),
164
189
.
Khan
,
S. A.
, &
,
S.
(
2014
). Bayesian multi-view tensor factorization. In
T.
Calders
,
F.
Esposito
,
E.
Hüllermeier
, &
R.
Meo
(Eds.),
Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases
(pp.
656
671
).
Berlin
:
Springer
.
Kolda
,
T. G.
, &
,
B. W.
(
2009
).
Tensor decompositions and applications
.
SIAM Review
,
51
(
3
),
455
500
.
Latała
,
R.
(
2005
).
Some estimates of norms of random matrices
.
Proc. Amer. Math. Soc.
,
133
(
5
),
1273
1282
.
Li
,
C.
,
Zhao
,
Q.
,
Li
,
J.
,
Cichocki
,
A.
, &
Guo
,
L.
(
2015
).
Multi-tensor completion with common structures
. In
Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence
(pp.
2743
2749
).
Palo Alto, CA
:
AAAI Press
.
Liu
,
G.
,
Lin
,
Z.
, &
Yu
,
Y.
(
2010
).
Robust subspace segmentation by low-rank representation
. In
Proceedings of the International Conference on Machine Learning
.
:
Omnipress
.
Liu
,
J.
,
Musialski
,
P.
,
Wonka
,
P.
, &
Ye
,
J.
(
2009
).
Tensor completion for estimating missing values in visual data
. In
Proceedings of the International Conference on Computer Vision
(pp.
2114
2121
).
Piscataway, NJ
:
IEEE
.
Mn
,
C.
,
Huang
,
B.
,
Wright
,
J.
, &
Goldfarb
,
D.
(
2014
).
Square deal: Lower bounds and improved relaxations for tensor recovery
. In
Proceedings of the International Conference on Machine Learning
(pp.
73
81
).
Narita
,
A.
,
Hayashi
,
K.
,
Tomioka
,
R.
, &
Kashima
,
H.
(
2011
).
Tensor factorization using auxiliary information
.
Berlin
:
Springer
.
Romera-Paredes
,
B.
,
Aung
,
H.
,
Bianchi-Berthouze
,
N.
, &
Pontil
,
M.
(
2013
).
. In
Proceedings of the International Conference on Machine Learning
(pp.
1444
1452
).
Shamir
,
O.
, &
Shalev-Shwartz
,
S.
(
2014
).
Matrix completion with the trace norm: Learning, bounding, and transducing
.
Journal of Machine Learning Research
,
15
,
3401
3423
.
Signoretto
,
M.
,
Dinh
,
Q. T.
,
De Lathauwer
,
L.
, &
Suykens
,
J. A. K.
(
2013
).
Learning with tensors: A framework based on convex optimization and spectral regularization
.
Machine Learning
,
94
(
3
),
303
351
.
Singh
,
A. P.
, &
Gordon
,
G. J.
(
2008
).
Relational learning via collective matrix factorization
. In
Proceedings of the SIGKKD International Conference on Knowledge Discovery and Data Mining
.
New York
:
ACM
.
Takeuchi
,
K.
,
Tomioka
,
R.
,
Ishiguro
,
K.
,
Kimura
,
A.
, &
,
H.
(
2013
).
Non-negative multiple tensor factorization
. In
Proceedings of the International Conference on Data Mining
(pp.
1199
1204
).
Piscataway, NJ
:
IEEE
.
Tomioka
,
R.
, &
Suzuki
,
T.
(
2013
).
Convex tensor decomposition via structured Schatten norm regularization
. In
C. J. C.
Burges
,
L.
Bottou
,
M.
Welling
.
Z.
Ghahramani
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
26
.
Red Hook, NY
:
Curran
.
Tomioka
,
R.
,
Suzuki
,
T.
,
Hayashi
,
K.
, &
Kashima
,
H.
(
2011
).
Statistical performance of convex tensor decomposition
. In
J.
Shawe-Taylor
,
R. S.
Zemel
,
P. L.
Bartlett
,
F.
Pereira
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
24
(pp.
972
980
).
Red Hook, NY
:
Curran
.
Vervliet
,
N.
,
Debals
,
O.
,
Sorber
,
L.
,
Van Barel,
M.
, &
De Lathauwer
,
L.
(
2016
).
Tensorlab 3.0
. https://www.tensorlab.net/
Wimalawarne
,
K.
,
Sugiyama
,
M.
, &
Tomioka
,
R.
(
2014
). Multitask learning meets tensor factorization: task imputation via convex optimization. In
Z.
Ghahramani
,
M.
Welling
,
C.
Cortes
,
N. D.
Lawrence
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
27
.
Red Hook, NY
:
Curran
.
Wimalawarne
,
K.
,
Tomioka
,
R.
, &
Sugiyama
,
M.
(
2016
).
Theoretical and experimental analyses of tensor-based regression and classification
.
Neural Computation
,
28
(
4
),
686
715
.
Zheng
,
V. W.
,
Cao
,
B.
,
Zheng
,
Y.
,
Xie
,
X.
, &
Yang
,
Q.
(
2010
).
Collaborative filtering meets mobile recommendation: A user-centered approach
. In
AAAI
.