## Abstract

We consider the problem of extracting a common structure from multiple tensor data sets. For this purpose, we propose multilinear common component analysis (MCCA) based on Kronecker products of mode-wise covariance matrices. MCCA constructs a common basis represented by linear combinations of the original variables that lose little information of the multiple tensor data sets. We also develop an estimation algorithm for MCCA that guarantees mode-wise global convergence. Numerical studies are conducted to show the effectiveness of MCCA.

## 1 Introduction

Various statistical methodologies for extracting useful information from a large amount of data have been studied over the decades since the appearance of big data. In the present era, it is important to discover a common structure of multiple data sets. In an early study, Flury (1984) focused on the structure of the covariance matrices of multiple data sets and discussed the heterogeneity of the structure. The author reported that population covariance matrices differ among multiple data sets in practical applications. Many methodologies have been developed for treating the heterogeneity between covariance matrices of multiple data sets (see, Flury, 1986, 1988; Flury & Gautschi, 1986; Pourahmadi, Daniels, & Park, 2007; Wang, Banerjee, & Boley, 2011; Park & Konishi, 2020).

Among such methodologies, common component analysis (CCA; Wang et al., 2011) is an effective tool for statistics. The central idea of CCA is to reduce the number of dimensions of data while losing as little information of the multiple data sets as possible. To reduce the number of dimensions, CCA reconstructs the data with a few new variables that are linear combinations of the original variables. For considering the heterogeneity between covariance matrices of multiple data sets, CCA assumes that there is a different covariance matrix for each data set. There have been many papers on various statistical methodologies using multiple covariance matrices: discriminant analysis (Bensmail & Celeux, 1996), spectral decomposition (Boik, 2002), and a likelihood ratio test for multiple covariance matrices (Manly & Rayner, 1987). It should be noted that principal component analysis (PCA) (Pearson, 1901; Jolliffe, 2002) is a technique similar to CCA. In fact, CCA is a generalization of PCA; PCA can only be applied to one data set, whereas CCA can be applied to multiple data sets.

Meanwhile, in various fields of research, including machine learning and computer vision, the main interest has been in tensor data, which has a multidimensional array structure. In order to apply the conventional statistical methodologies, such as PCA, to tensor data, a simple approach is to first transform the tensor data into vector data and then apply the methodology. However, such an approach causes the following problems:

In losing the tensor structure of the data, the approach ignores the higher-order inherent relationships of the original tensor data.

Transforming tensor data to vector data substantially increases the number of features. It also has a high computational cost.

To overcome these problems, statistical methodologies for tensor data analyses have been proposed that take the tensor structure of the data into consideration. Such methods enable us to accurately extract higher-order inherent relationships in a tensor data set. In particular, many existing statistical methodologies have been extended for tensor data, for example, multilinear principal component analysis (MPCA) (Lu et al., 2008) and sparse PCA for tensor data analysis (Allen, 2012; Wang, Sun, Chen, Pang, & Zhou, 2012; Lai, Xu, Chen, Yang, & Zhang, 2014), as well as others (see Carroll & Chang, 1970; Harshman, 1970; Kiers, 2000; Badeau & Boyer, 2008; Kolda & Bader, 2009).

In this letter, we extend CCA to tensor data analysis, proposing multilinear common component analysis (MCCA). MCCA discovers the common structure of multiple data sets of tensor data while losing as little of the information of the data sets as possible. To identify the common structure, we estimate a common basis constructed as linear combinations of the original variables. For estimating the common basis, we develop a new estimation algorithm based on the idea of CCA. In developing the estimation algorithm, two issues must be addressed: the convergence properties of the algorithm and its computational cost. To determine the convergence properties, we investigate first the relationship between the initial values of the parameters and global optimal solution and then the monotonic convergence of the estimation algorithm. These analyses reveal that our proposed algorithm guarantees convergence of the mode-wise global optimal solution under some conditions. To analyze the computational efficacy, we calculate the computational cost of our proposed algorithm.

The rest of the letter is organized as follows. In section 2, we review the formulation and the minimization problem of CCA. In section 3, we formulate the MCCA model by constructing the covariance matrices of tensor data, based on a Kronecker product representation. Then we formulate the estimation algorithm for MCCA in section 4. In section 5, we present the theoretical properties for our proposed algorithm and analyze the computational cost. The efficacy of the MCCA is demonstrated through numerical experiments in section 6. Concluding remarks are presented in section 7. Technical proofs are provided in the appendixes. Our implementation of MCCA and supplementary materials are available at https://github.com/yoshikawa-kohei/MCCA.

## 2 Common Component Analysis

*common component analysis*(CCA).

where $tr(\xb7)$ denotes the trace of a matrix. A crucial issue for solving the maximization problem 2.4 is the nonconvexity. Certainly the maximization problem is nonconvex since the problem is defined on a set of orthogonal matrices, which is a nonconvex set. Generally it is difficult to find the global optimal solution in nonconvex optimization problems. To overcome this drawback, Wang et al. (2011) proposed an estimation algorithm in which the estimated parameters are guaranteed to constitute the global optimal solution under some conditions.

## 3 Multilinear Common Component Analysis

In this section, we introduce a mathematical formulation of the MCCA, an extension of the CCA in terms of tensor data analysis. Moreover, we formulate an optimization problem of MCCA and investigate its convergence properties.

*multilinear common component analysis*(MCCA).

To solve the maximization problem efficiently and identify the inherent relationships, the maximization problem 3.5 can be decomposed into the mode-wise maximization problems represented in the following lemma.

**Lemma 1.**

## 4 Estimation

Our estimation algorithm consists of two steps: initializing the parameters and iteratively updating the parameters. The initialization step gives us the initial values of the parameters near the global optimal solution for each mode. Next, by iteratively updating the parameters, we can monotonically increase the value of the objective function 3.7 until convergence.

### 4.1 Initialization

Using the initial value of $w(k)$, we can obtain the initial value of the parameter $V0(k)$ by maximizing $fk'(V(k))$ for each mode. The maximizer consists of $Rk$ eigenvectors, corresponding to the $Rk$ largest eigenvalues, obtained by eigenvalue decomposition of $MI(k)$. The theoretical justification for this initialization is discussed in section 5.

### 4.2 Iterative Update of Parameters

Our estimation procedure comprises the above estimation steps. The procedure is summarized as algorithm 1.

## 5 Theory

This section presents the theoretical and computational analyses for algorithm 1. Theoretical analyses consist of two steps. First, we prove that the initial values of parameters obtained in section 4.1 are relatively close to the global optimal solution. If the initial values are close to the global maximum, then we can obtain the global optimal solution even if the maximization problem is nonconvex. Second, we prove that the iterative updates of the parameters in section 4.2 monotonically increase the value of objective function 3.7 by solving the surrogate problem 4.3. From the monotonically increasing property, the estimated parameters always converge at a stationary point. The combination of these two results enables us to obtain the mode-wise global optimal solution. In the computational analysis, we calculate computational cost for MCCA and then compare the cost with conventional methods. By comparing the costs, we investigate the computational efficacy of MCCA.

### 5.1 Analysis of Upper and Lower Bounds

This section aims to provide the upper and lower bounds of the maximization problem 3.7. From the bounds, we find that the initial values in section 4.1 are relatively close to the global optimal solution. Before providing the bounds, we define a contraction ratio.

**Definition 1.**

Note that a contraction ratio $\alpha (k)$ satisfies $0\u2264\alpha (k)\u22641$ and $\alpha (k)=1$ if and only if $Rk=Pk$.

Using $fk'max$ and the contraction ratio $\alpha (k)$, we have the following theorem that reveals the upper and lower bounds of the global maximum in problem 3.7.

**Theorem 1.**

This theorem indicates that $fk'max\u2192fkmax$ when $\alpha (k)\u21921$. Thus, it is important to obtain an $\alpha (k)$ that is as close as possible to one. Since $\alpha (k)$ depends on $V0(k)$ and $w(k)$, $V0(k)$ depends on $w(k)$. From this dependency, if we could set the initial value of $w(k)$ such that $\alpha (k)$ is as large as possible, then we could obtain an initial value of $V0(k)$ that attains a value near $fkmax$. The following theorem shows that we can compute the initial value of $w(k)$ such that $\alpha (k)$ is maximized.

**Theorem 2.**

Let $\lambda 0(k)$ and $\lambda 1(k)$ be the vectors consisting of eigenvalues defined in equation 4.2. For $w(k)=w(1)(-k),w(2)(-k),\u2026,w(G)(-k)(k=1,2,\u2026,M)$, suppose that the estimate $w^(k)$ is obtained by solving equation 4.1 for $k=1,2,\u2026,M$. Then $w^(k)$ maximizes $\alpha (k)$.

In fact, $\alpha (k)$ is very close to one with the initial values given in theorem 2 even if $Rk$ is small. This resembles the cumulative contribution ratio in PCA.

### 5.2 Convergence Analysis

We next verify that our proposed procedure for iteratively updating parameters maximizes the optimization problem 3.7. In algorithm 1, the parameter $Vs+1(k)$ can be obtained by solving the surrogate maximization problem 4.3. Theorem 3 shows that we can monotonically increase the value of the function $fk(V(k))$ in equation 3.7 by algorithm 1.

**Theorem 3.**

From theorem 1, we obtain initial values of the parameters that are near the global optimal solution. By combining theorems 1 and 3, the solution from algorithm 1 can be characterized by the following corollary.

**Corollary 1.**

Consider the maximization problem 3.7. Suppose that the initial value of the parameter is obtained by $V0(k)$ = $argmaxV(k)fk\u02dc'(V(k))$, and the parameter $Vs(k)$ is repeatedly updated by algorithm 1. Then the mode-wise global maximum for the maximization problem 3.7 is achieved when all the contraction ratios $\alpha (k)$ for $k=1,2,\u2026,M$ go to one.

Algorithm 1 does not guarantee the global solution due to the fundamental problem of nonconvexity, but it is enough for pragmatic purposes. We investigate the issue of convergence to global solution through numerical studies in section 6.3.

### 5.3 Computational Analysis

First, we analyze the computational cost. To simplify the analysis, we assume $P=argmaxjPj$ for $j=1,2,\u2026,M$. This implies that $P$ is the upper bound of $Rj$ for all $j$. We then calculate the upper bound of the computational complexity.

The expensive computations of the each iteration in algorithm 1 consist of three parts: the formulation of $M(Vs(k))$, the eigenvalue decomposition of $M(Vs(k))$, and updating latent covariance matrices $\Lambda g(k)$. These steps are $O(GM2P3)$, $O(P3)$, and $O(GMP3)$, respectively. The total computational complexity per iteration is then $O(GM2P3)$.

Next, we analyze the memory requirement of algorithm 1. MCCA represents the original tensor data with fewer parameters by projecting the data onto a lower-dimensional space. This requires the $Pk\xd7Rk$ projection matrices $V(k)$ for $k=1,2,\u2026,M$. MCCA projects the data with size of $N\u220fk=1MPk$ to $N\u220fk=1MRk$, where $N=\u2211g=1GNg$. Thus, the required size for the parameters is $\u2211k=1MPkRk+N\u220fk=1MRk$. MPCA requires the same amount of memory as MCCA. Meanwhile, CCA and PCA need a projection matrix, which is size $R\u220fk=1MPk$. The required size for the parameters is then $R\u220fk=1MPk+NR$.

To compare the computational cost clearly, the upper bounds of computational complexity and the memory requirement are summarized in Table 1. Table 1 shows that the computational complexity of MCCA is superior to that of the other algorithms and the complexity of MCCA is not limited by sample size. In contrast, the MPCA algorithm is affected by the sample size (Lu, Plataniotis, & Venetsanopoulos, 2008). Additionally, MCCA and MPCA require a large amount of memory when the number of modes in a data set is large, but their memory requirements are much smaller than those of PCA and CCA.

Method . | Computational Complexity . | Memory Reqirement . |
---|---|---|

PCA | $O(P3M)$ | $R\u220fk=1MPk+NR$ |

CCA | $O(GP3M)$ | $R\u220fk=1MPk+NR$ |

MPCA | $O(NMPM+1)$ | $\u2211k=1MPkRk+N\u220fk=1MRk$ |

MCCA | $O(GM2P3)$ | $\u2211k=1MPkRk+N\u220fk=1MRk$ |

Method . | Computational Complexity . | Memory Reqirement . |
---|---|---|

PCA | $O(P3M)$ | $R\u220fk=1MPk+NR$ |

CCA | $O(GP3M)$ | $R\u220fk=1MPk+NR$ |

MPCA | $O(NMPM+1)$ | $\u2211k=1MPkRk+N\u220fk=1MRk$ |

MCCA | $O(GM2P3)$ | $\u2211k=1MPkRk+N\u220fk=1MRk$ |

## 6 Experiment

To demonstrate the efficacy of MCCA, we applied MCCA, PCA, CCA, and MPCA to image compression tasks.

### 6.1 Experimental Setting

For the experiments, we prepared the following three image data sets:

**MNIST data set**consists of data of handwritten digits $0,1,\u2026,9$ at image sizes of $28\xd728$ pixels. The data set includes a training data set of 60,000 images and a test data set of 10,000 images. We used the first 10 training images of the data set for each group. The MNIST data set (Lecun, Bottou, Bengio, & Haffner, 1998) is available at http://yann.lecun.com/exdb/mnist/.**AT&T (ORL) face data set**contains gray-scale facial images of 40 people. The data set has 10 images sized $92\xd7112$ pixels for each person. We used images resized by a factor of 0.5 to improve the efficiency of the experiment. The AT&T face data set is available at https://git-disl.github.io/GTDLBench/datasets/att_face_dataset/. All the credits of this data set go to AT&T Laboratories Cambridge.**Cropped AR database**has color facial images of 100 people. These images are cropped around the face. The size of images is $120\xd7165\xd73$ pixels. The data set contains 26 images in each group, 12 of which are images of people wearing sunglasses or scarves. We used the cropped facial images of 50 males who were not wearing sunglasses or scarves. Due to memory limitations, we resized these images by a factor of 0.25. The AR database (Martinez & Benavente, 1998; Martinez & Kak, 2001) is available at http://www2.ece.ohio-state.edu/∼aleix/ARdatabase.html.

The data set characteristics are summarized in Table 2.

Data Set . | Group Size . | Sample Size (/Group) . | Number of Dimensions . | Number of Groups . |
---|---|---|---|---|

MNIST | Small | 10 | $28\xd728=784$ | 10 |

AT&T(ORL) | Small | 10 | $46\xd756=2576$ | 10 |

Medium | 20 | |||

Large | 40 | |||

Cropped AR | Small | 14 | $30\xd741\xd73=7380$ | 10 |

Medium | 25 | |||

Large | 50 |

Data Set . | Group Size . | Sample Size (/Group) . | Number of Dimensions . | Number of Groups . |
---|---|---|---|---|

MNIST | Small | 10 | $28\xd728=784$ | 10 |

AT&T(ORL) | Small | 10 | $46\xd756=2576$ | 10 |

Medium | 20 | |||

Large | 40 | |||

Cropped AR | Small | 14 | $30\xd741\xd73=7380$ | 10 |

Medium | 25 | |||

Large | 50 |

To compress these images, we performed dimensionality reductions by MCCA, PCA, CCA, and MPCA, as follows. We vectorized the tensor data set before performing PCA and CCA. In MCCA, the images were compressed and reconstructed according to the following steps:

Prepare the multiple image data sets $X(g)\u2208RP1\xd7P2\xd7\cdots \xd7PM\xd7Ng$ for $g=1,2,\u2026,G$.

Compute the covariance matrix of $X(g)$ for $g=1,2,\u2026,G$.

From these covariance matrices, compute the linear transformation matrices $Vi\u2208RPi\xd7Ri$ for $i=1,2,\u2026,M$ for mapping to the $(R1,R2,\u2026,RM)$-dimensional latent space.

Map the $i$th sample $X(g)i$ to $X(g)i\xd71V1\xd72V2\cdots \xd7MVM\u2208RR1\xd7R2\xd7\cdots \xd7RM$, where the operator $\xd7i$ is the $i$-mode product of tensor (Kolda & Bader, 2009).

Reconstruct $i$th sample $X\u02dc(g)i$ = $X(g)i\xd71V1V1\u22a4\xd72V2V2\u22a4\cdots \xd7MVMVM\u22a4$.

Meanwhile, PCA and MPCA each require a single data set. Thus, we aggregated the data sets as $X=[X(1),X(2),\u2026,X(G)]\u2208RP1\xd7P2\xd7\cdots \xd7PM\xd7\u2211g=1GNg$ and performed PCA and MPCA for data set $X$.

### 6.2 Performance Assessment

For MCCA and MPCA, the reduced dimensions $R1$ and $R2$ were chosen as the same number, and then we fixed $R3$ as two. All computations were performed by the software R (version 3.6) (R Core Team, 2019). In the initialization of MCCA, solving the quadratic programming problem was carried out using the function ipop in the package kernlab. MPCA was implemented as the function mpca in the package rTensor. (The implementations of MCCA, PCA, and CCA are available at https://github.com/yoshikawa-kohei/MCCA.)

From Figure 1, we observe that the RER material MCCA is the smallest for any value of CR. This indicates that MCCA performs better than the other methods. In addition, note that CCA performs better than MPCA only for fairly small values of CR, even though it is a method for vector data, whereas MPCA performs better for larger values of CR. This implies the limitations of CCA for vector data.

Next we consider group size by comparing panels a, b, and c in Figure 1. The value of CR at the intersection of CCA and MPCA increases with increasing the group size. This indicates that MPCA has more trouble extracting an appropriate latent space as the group size increases. Since MPCA does not consider the group structure, it is not possible to properly estimate the covariance structure when the group size is large.

### 6.3 Behavior of Contraction Ratio

We examined the behavior of contraction ratio $\alpha (k)$. We performed MCCA on the AT&T(ORL) data set with the medium group size and computed $\alpha (1)$ and $\alpha (2)$ with the various pairs of reduced dimensions $(R1,R2)\u2208{1,2,\u2026,25}\xd7{1,2,\u2026,25}$.

### 6.4 Efficacy of Solving the Quadratic Programming Problem

We investigated the usefulness of determining the initial value of $w(k)$ by solving the quadratic programming problem 4.1. We applied MCCA to the AT&T(ORL) data set with the small, medium, and large number of groups. In addition, we used the smaller group size of three. For determining the initial value of $w(k)$, we consider three methods: solving the quadratic programming problem 4.1 (MCCA:QP); setting all values of $w(k)$ to one (MCCA:FIX); and setting the values by random sampling according to the uniform distribution $U(0,1)$ (MCCA:RANDOM). We computed the $\alpha (k)$ with the reduced dimensions $R1=R2(\u22081,2,\u2026,10)$ for each of these methods.

To evaluate the performance of these methods, we compared the values of $\alpha (k)$ and the number of iterations in the estimation. The number of iterations in the estimation is the number of repetitions of lines 7 to 9 in algorithm 1. For MCCA(RANDOM), we performed 50 trials and calculated averages of each of these indices.

## 7 Conclusion

We have developed the multilinear common components analysis (MCCA) by introducing a covariance structure based on the Kronecker product. To efficiently solve the nonconvex optimization problem for MCCA, we have proposed an iteratively updating algorithm that exhibits some superior theoretical convergence properties. Numerical experiments have shown the usefulness of MCCA.

Specifically, MCCA was shown to be competitive among the initialization methods in terms of the number of iterations. As the number of groups increases, the overall number of samples increases. This may be the reason why all methods required almost the same number of iterations for small, medium, and large groups. Note that in this study, we used the Kronecker product representation to estimate the covariance matrix for tensor data sets. Greenewald, Zhou, and Hero (2019) used the Kronecker sum representation for estimating the covariance matrix, and it would be interesting to extend the MCCA to this and other covariance representations.

## Appendix A: Proof of Lemma 1

We provide two basic lemmas about Kronecker products before we prove lemma 1.

**Lemma 2.**

**Lemma 3.**

These lemmas are known as the mixed-product property and the spectrum property, respectively. See Harville (1998) for detailed proofs.

**Proof of Lemma 1.**

## Appendix B: Proof of Theorem 1

Theorem 1 can be easily shown from the following lemma.

**Lemma 4.**

**Proof of Lemma 4.**

**Proof of Theorem 1.**

## Appendix C: Proof of Theorem 2

**Proof of Theorem 2.**

## Appendix D: Proof of Theorem 3

**Proof of Theorem 3.**

## Acknowledgments

We thank the reviewer for helpful comments and constructive suggestions. S.K. was supported by JSPS KAKENHI grants JP19K11854 and JP20H02227, and MEXT KAKENHI grants JP16H06429, JP16K21723, and JP16H06430.

## References

*Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics*

*SIAM Journal on Matrix Analysis and Applications*

*Journal of the American Statistical Association*

*Biometrika*

*Psychometrika*

*Journal of the American Statistical Association*

*Annals of Statistics*

*Common principal components and related multivariate models*

*SIAM Journal on Scientific and Statistical Computing*

*Journal of the Royal Statistical Society: Series B (Statistical Methodology)*

*UCLA Working Papers in Phonetics*

*Matrix algebra from a statistician's perspective*

*Principal component analysis*

*IEEE Journal on Selected Areas in Communications*

*Journal of Chemometrics*

*SIAM Review*

*IEEE Transactions on Neural Networks and Learning Systems*

*Proceedings of the IEEE*

*IEEE Transactions on Neural Networks*

*Biometrika*

*The AR face database*

*IEEE Transactions on Pattern Analysis and Machine Intelligence*

*Statistical Papers*

*London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science*

*Journal of Multivariate Analysis*

*R: A language and environment for statistical computing*

*Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*

*Proceedings of the 21st International Conference on Pattern Recognition*

*IEEE Transactions on Signal Processing*

*IEEE Transactions on Vehicular Technology*