Multilinear Common Component Analysis via Kronecker Product Representation

Abstract We consider the problem of extracting a common structure from multiple tensor data sets. For this purpose, we propose multilinear common component analysis (MCCA) based on Kronecker products of mode-wise covariance matrices. MCCA constructs a common basis represented by linear combinations of the original variables that lose little information of the multiple tensor data sets. We also develop an estimation algorithm for MCCA that guarantees mode-wise global convergence. Numerical studies are conducted to show the effectiveness of MCCA.


Introduction
The various statistical methodologies, extracting useful information from a large amount of data, have been studied for decades because of the appearance of big data.In the present era, it is important to discover a common structure from the various multiple datasets.In the earlier studies, Flury (1984) focused on the structure of the covariance matrices of the multiple datasets and discussed the heterogeneity of the structure.In practice, this study has reported that the population covariance matrices are different between multiple datasets.Therefore, it is important to develop methodologies that consider the heterogeneity between covariance matrices of multiple datasets (see, e.g., Flury (1986Flury ( , 1988)); Flury and Gautschi (1986); Pourahmadi et al. (2007); Wang et al. (2011); Park and Konishi (2018)).
Among such methodologies, common component analysis (CCA) (Wang et al., 2011) is an effective tool for statistics.The central idea of CCA is to reduce the number of dimensions of data without losing information of multiple dataset as possible.To reduce the dimensions, CCA reconstructs the data with a few new variables which are linear combinations of the original variables.For considering the heterogeneity between covariance matrices of multiple datasets, CCA assumes that there is a different covariance matrix for each dataset.There have been many papers on various statistical methodologies using multiple covariance matrices: discriminant analysis (Bensmail and Celeux, 1996), spectral decomposition (Boik, 2002), and likelihood ratio test for multiple covariance matrices (Manly and Rayner, 1987).It should be noted that principal component analysis (PCA) (Pearson, 1901;Jolliffe, 2002) is similar techniques to CCA.In fact, CCA is a generalization of PCA; PCA can only be applied to one dataset, while CCA can be applied to multiple datasets.
Meanwhile, in the various research fields of machine learning or computer vision, the main interest has been directed to tensor data, which is a multidimensional array.When we apply the conventional statistical methodologies, such as PCA, to tensor data, a simple approach is to transform tensor data into vector data and apply those methodologies.
However, such analysis causes the following problems; 1. Since the nature of the tensor data cannot be held, the analysis ignores the higherorder inherent relationship of the original tensor data.
2. Transforming tensor data to vector data makes the number of features large.For that reason, it causes a high computational cost.
To overcome these problems, statistical methodologies for tensor data analyses have been proposed by taking the nature of the tensor data into consideration.The methods enable us to accurately extract higher-order inherent relationships in the tensor dataset.
In this paper, we extend CCA to tensor data analysis, and propose multilinear common component analysis (MCCA).MCCA discovers the common structure from multiple datasets of tensor data without losing information of every datasets as possible.To identify the common structure, we estimate the common basis constructed by linear combinations of original variables.For estimating the common basis, we develop a new algorithm that can estimate the basis faster and more accurate than an estimation algorithm of CCA.When we develop the estimation algorithm, two issues need to be discussed.One is the convergence property of the algorithm.The other is its computational cost.To find out the convergence property, we guarantee that our proposed algorithm converges the mode-wise global optimum under some conditions.In order to analyze the computational efficacy, we calculate the computational cost of our proposed algorithm and compare it with the computational cost of MPCA.
The rest of the paper is organized as follows.In Section 2, we review the formulation and the minimization problem of CCA.In Section 3, we formulate the MCCA model by constructing the covariance matrices of tensor data based on Kronecker product representation.Then, we develop the estimation algorithm for MCCA in Section 4. In Section 5, we provide the theoretical property for our proposed algorithm and analyze the computational cost.The efficacy of the MCCA is demonstrated by experimental analysis in Section 6. Concluding remarks are presented in Section 7. Supplementary materials and source codes of MCCA method are available at https://github.com/yoshikawa-kohei/MCCA.

Common Component Analysis
Suppose that we obtain data matrices X (g) = [x (g)1 , . . .x (g)Ng ] T ∈ R Ng ×P with N g observations and P variables for g = 1, . . .G, where x (g)i is a P -dimensional vector of i-th row of X (g) and G is the number of multiple datasets.Then, the sample covariance matrix in the group g is obtained by where S (g) ∈ S P + , in which S P + is the set of symmetric positive definite matrices of the size P × P , and x(g) = 1 Ng Ng i=1 x (g)i is a P -dimensional mean vector in the group g.The main idea of modeling CCA is to find out the common structure of multiple datasets by projecting the data to the lower-dimensional space on the same basis.Wang et al. (2011) assumed that the covariance matrices S (g) for g = 1, . . ., G can be decomposed to a product of latent covariance matrices and an orthogonal matrix for the linear transformation as follows: where Λ (g) ∈ S R + is a latent covariance matrix in the group g, V ∈ R P ×R is an orthogonal matrix for the linear transformation, E (g) ∈ R P ×P is an error matrix in the group g, and I R is an identity matrix of the size R × R. V determines the R-dimensional common subspace between the multiple datasets.Wang et al. (2011) called the above model common component analysis (CCA).
The estimator of CCA can be obtained by solving the minimization problem min where • F denotes the Frobenius norm.The estimator of latent covariance matrices Λ (g) for g = 1, . . ., G can be obtained by solving the minimization problem as Λ(g) = V T S (g) V.
By using the estimated value Λ(g) , Wang et al. (2011) indicated that the minimization problem can be reformulated to the following maximization problem max where tr(•) denotes the trace of the matrix.A crucial issue for solving the maximization problem (2.4) is the non-convexity.Certainly, the maximization problem is non-convex since the problem is defined on a set of orthogonal matrices V, which is a non-convex set.Generally speaking, it is difficult to find the global maximum in non-convex optimization problems.Therefore, it is difficult to obtain the global maximum in the formula (2.4).To overcome this drawback, Wang et al. (2011) proposed an estimation algorithm in which the estimated parameters are guaranteed to be the global maximum under some conditions.

Multilinear Common Component Analysis
In this section, we introduce a mathematical formulation of the MCCA model, which is the extension of the CCA model in terms of tensor data analysis.Moreover, we formulate the optimization problem of MCCA and raise the issue of its convergence properties.
Suppose that we independently obtain an M-th order tensor data Here, we set the datasets of the tensors where G is the number of multiple datasets.Then, the sample covariance matrix in the group g for the tensor dataset is defined by where S * (g) ∈ S P + , in which P = M k=1 P k , ⊗ denotes Kronecker product operator, and + is the sample covariance matrix for k-mode in the group g defined by Here, X that the tensor element (p 1 , p 2 , . . ., p M ) maps to matrix element (p k , l), where l = 1 + M t=1,t =k (p t − 1)L t with L t = t−1 m=1,m =k P m , in which p 1 , p 2 , . . ., p M denote the indices of the M-th order tensor X .For more details of tensor operations, see Kolda and Bader (2009).The representation of the tensor covariance matrix by Kronecker products is often used in the various studies (Kermoal et al., 2002;Yu et al., 2004;Werner et al., 2008).
To construct CCA in terms of tensor data analysis, we consider CCA for the k-mode covariance matrix in the group g as follows: where + is a latent covariance matrix for k-mode in the group g, V (k) ∈ R P k ×R k is an orthogonal matrix for the linear transformation, and E (k) (g) ∈ R P k ×P k is an error matrix in the group g.Here, since S * (g) can be decomposed to Kronecker product of S (k) (g) for k = 1, . . ., M in the formula (3.1), we obtain the following model where , and E * (g) is an error matrix in the group g.We call this model multilinear common component analysis (MCCA).
To find the R-dimensional common subspace between the multiple tensor datasets, MCCA determines V (1) , V (2) , . . ., V (M ) .As with CCA, we obtain the estimate of Λ * (g) for g = 1, . . ., G can be obtained Λ * With respect to V * , we can obtain the estimate by solving the following optimization problem which is similar to (2.4).
However, the number of parameters will be very large when we directly solve the problem.
Too large the number of parameters causes a high computational cost.Moreover, it is not possible to discover the inherent relationships between the variables in each mode simply by solving the problem (3.5).
To solve the optimization problem efficiently and identify the inherent relationships, the maximization problem (3.5) can be decomposed into the following mode-wise maximization problem.
Lemma 1.The optimal value of the parameters V (k) for k = 1, 2, . . ., M can be obtained by solving the following maximization problem for each mode max (3.6) By summarizing the terms unrelated to V (k) in the maximization problem (3.6), we can obtain the following maximization problem for k-mode. max where Although the estimate of V (k) can be obtained by solving the maximization problem (3.7), the problem is non-convexity since V (k) is assumed to be orthogonal matrix.Therefore, the maximization problem has several local maxima.Meanwhile, by choosing the initial point near the global maximum in the estimation, we can obtain the global maximum.In Section 4, we develop not only an estimation algorithm but also an initialization method to choose the initial point near the global maximum.The initialization method helps our developed algorithm guarantees to converge the global maximum.

Estimation
Our

Initialization
The first step is to initialize the parameters V (k) for each mode.To initialize the parameters V (k) near the global maximum, instead of the problem (3.7), we prepare the maximization problem given by max where Here, λ (g)i is an i-th index eigenvalue of S (j) (g) S (j) (g) .We can understand that w is the upper bound of w . The initial value of the parameter V (k) 0 can be obtained by solving (4.1) for each mode.The solution of (4.1) is R k eigenvectors obtained by eigenvalue decomposition of M I (k) .Note that this initialization takes the initial value near the global maximum and the theoretical property is described in Section 5.

Iterative updating parameters
The second step is to update parameters V (k) for each mode.To update parameters, we maximize the problem f k .However, the updated value cannot be obtained by eigenvalue decomposition in the same way as the initialization because the eigenvalue-decomposed M V (k) contains the parameters to be estimated.Therefore, when we denote V (k) s as the value of V (k) at s step, we solve the following surrogate maximization problem: max The solution of (4.3) is R k eigenvectors obtained by eigenvalue decomposition of M(V as the parameters to be estimated are not included in the maximization problem.The iteratively updating the parameters monotonically improves the objective function and it allows the function to be maximized.For more details on this theory, it is described in Section 5. Our estimation procedure is constructed by these estimation steps.This is summarized as the Algorithm 1.

Theory
Analyses in this section give the theoretical justification to Algorithm 1.These analyses consist of two quantities.The first analysis is the relationship between the initial value of the parameters and the global optimum.The second analysis is the convergence for the method of iteratively updating parameter.The relationship shows that the initial value obtained by the proposed initialization is relatively close to the global maximum.
If the initial value is close to the global maximum, we can obtain the global maximum even if the maximization problem is non-convex.In addition, the iteratively updating parameters enables the objective function (3.7) to monotonically improve by solving the surrogate problem (4.3).By monotonically improving the function (3.7), the estimated parameters always converge at the stationary point.The combination of these two results enables us to obtain the mode-wise global maximum.

Analysis of upper and lower bounds
We analyze the upper and lower bounds of the maximization problem (3.7).This analysis shows that the initial value is relatively close to the global optimum.The bound for the global maximum of f k (V (k) ) depends on the alternative optimization problem (4.1) and a contraction ratio defined below: k) , we define α (k) as a contraction ratio of data for k-mode: where 0 ≤ α (k) ≤ 1 and α ) is the alternative maximization problem (4.1) for the initialization.We have the following theorem which reveals the upper and lower bounds of the global maximum in the problem (3.7).k) ).Then we have where α (k) is the contraction ratio defined in Definition 1. k) ).In fact, the α (k) is almost one if we set a few number of the latent variables.Because the α (k) represents the total amount of information held by the PCA.If the α (k) is sufficiently large, it means that the data are mostly explained by the principal components.In PCA, it is well known that most of the data can be represented by the first of a few principal components.Therefore, it is clear from Theorem 1 that our proposed initialization takes the initial value to close enough of the global maximum for f k (V (k) ) from such a practical situation.

Convergence Analysis
Theorem 1 means that we can obtain the good initial value when the α (k) is large enough.
Next, we show that our iteratively updating parameters maximizes the optimization problem (3.7).In our algorithm, the parameter s+1 can be obtained by solving the surrogate maximization problem (4.3).The following Theorem 2 shows that we can monotonically improve the function f k (V (k) ) by updating the parameter. ). (5.3) By repeatedly updating the parameters, at least, the parameters reach the local maxima.Fortunately, we can obtain the initial value of the parameter near the global maximum.Thus, the local optima can converge to the global maximum.By combining Theorem 1 and Theorem 2, we can characterize the solution as the following corollary: Corollary 1.Consider the maximization problem (3.7).We set the initial value of the parameter ) and repeatedly update the parameter V Although our proposed algorithm does not guarantee the global optimality due to the fundamental problem of the non-convex function, the algorithm can approximately obtain mode-wise global maximum.This property is enough to be pragmatic.take O(GM 2 P 3 ), O(P 3 ) and O(GMP 3 ), respectively.Therefore, the total computational complexity per iteration is O(GM 2 P 3 ).

Computational Analysis
It indicates that the MCCA algorithm does not depend on the sample size of the dataset.
On the other hand, since the MPCA algorithm depends on the sample size of the dataset (Lu et al., 2008), the MCCA algorithm adjusts to the scale of the dataset.
Next, we analyze the memory requirement of the Algorithm 1. MCCA represents the original tensor data with fewer parameters by projecting the data into the lowerdimensional space.It needs the P k × R k projection matrices V (k) for k = 1, 2, . . ., M.
MCCA projects the data from the size of N • M k=1 P k to N • M k=1 R k , where N = G g=1 N g .Thus, the requirement size of the parameters is MPCA requires the same amount of memory of MCCA.Similarly, CCA and PCA needs the projection matrix, which size is R • M k=1 P k .Therefore, the requirement size of the parameters is R • M k=1 P k + NR.It should be noted that MCCA and MPCA require a large amount of memory when the number of modes in a dataset is large, but their memory requirements are much smaller than CCA and PCA.

Experimental Settings
To demonstrate the efficacy of MCCA, we applied MCCA, PCA, CCA, and MPCA to image compression tasks.For the experiments, we prepared the following three image datasets.
MNIST dataset consists of hand written digits data 0, 1, . . ., 9 of which size is 28 × 28 pixels.The dataset includes a training dataset of 60,000 images and a test dataset of 10,000 images.We used the 10 training images from the beginning of the dataset for each class.MNIST dataset (Lecun et al., 1998) is available at http://yann.lecun.com/exdb/mnist/.
AT&T (ORL) face dataset contains gray scale facial images of 40 people.The dataset has 10 images of which size is 92 × 112 pixels, for each person.We used the images resized to 0.5 times theirs size in advance to improve the efficiency of the experiment.
Cropped AR database has color facial images of 100 people.These images are cropped around the face.The size of images is 120 × 165 × 3 pixels.It contains 26 images in each class, 12 of which are images of people wearing sunglasses or scarves.We used the cropped facial images of 50 male excluding for those 12 images.Due to memory limitations, we resized these images by 0.25 ratio.AR database (Martinez and Benavente., 1998;Martinez and Kak, 2001) is available at http://www2.ece.ohio-state.edu/~aleix/ARdatabase.html.
The detailed description is summarized in Table 1.Large 50 R 1 = 1, respectively.When the both R 1 and R 2 are more than eight, we can verify that the both α (1) and α (2) go to one, as shown in Figure 2.

Experimental Analysis for the Performance of MCCA
In this experiment, we examined the performance of MCCA by comparing with PCA, CCA, and MPCA.We performed these methods for the datasets described in Table 1.
For MCCA and MPCA, the reduced dimension R 1 and R 2 were chosen as same number, and then we fixed R 3 as two.The all computations were done by the software R (ver 3.6) (R Core Team, 2019).MPCA was implemented by the function mpca from the package rTensor and the implementation of MCCA, PCA and CCA are available at https:// github.com/yoshikawa-kohei/MCCA.
To assess the performances of the methods, we calculated the reconstruction error rate (RER) under the same compression ratio (CR).The RER is defined by where X = [ X (1) , X (2) , . . ., X (G) ] is the aggregated dataset of reconstructed tensors X (g) = [ X(g)1 , X(g)2 , . . ., X(g)Ng ] for g = 1, 2, . . ., G and X F is the norm of a tensor X ∈ R P 1 ×P 2 ×•••×P M computed by x 2 p 1 ,p 2 ,...,p M , (6.2) in which x p 1 ,p 2 ,...,p M is an element (p 1 , p 2 , . . ., p M ) of X .In addition, we defined the compression ratio (CR) as The results show that the RER of MCCA is always small when we compared on the same value of CR.The lowest approximation error of MCCA indicates that the MCCA performs better than any other method.Besides, even though CCA is a method for vector data, it performs better than MPCA only when the value of CR is small.However, when the value of CR is slightly larger, the MPCA makes RER smaller than the CCA.It implies the importance of developing analytical methods for tensor data.
In addition, comparing (a), (b) and (c) in Figure 3, the value of CR at the intersection of CCA and MPCA is gradually increasing.It indicates that the MPCA cannot extract a suitable latent space as the group size increases.Since MPCA does not consider the group structure, it is not possible to properly estimate the covariance structure when the group size is large.
estimation algorithm consists of two steps: the first step is initializing the parameters and the other step is iteratively updating the parameters.The step of initialization gives us the initial values of the parameters near the global maximum for each mode.Next, by iteratively updating the parameters, we can monotonically improve the objective function (3.7) until the convergence.
Then, we can obtain mode-wise global maximum for the maximization problem (3.7) when the contraction ratio α (k) goes to one for k = 1, 2, . . ., M.

First, we analyze
the computational cost.The expensive calculation steps of the each iteration in Algorithm 1 are the formulation of M(V simplify the analysis of computational complexity, we assume P = arg max j P j for j = 1, 2, . . ., M. Therefore, P is the upper bound of R j for all j.We calculate the upper bounds of computational complexity.The formulation of M(V
parameters required for MCCA and MPCA are M k=1 P k R k + N • M k=1 R k , and CCA and PCA are R • M k=1 P k + NR.The Figrue 3 shows the line charts of RER obtained by estimating various ranks for AT&T(ORL) dataset with the group size of small, medium, and large.The points on the line charts are the RER obtained by the actual estimations.Since figures for the results of other datasets are similar to Figure 1, we attached them to the Supplementary Materials.

Table 1 :
The summary of the datasets.