## Abstract

Multiview alignment, achieving one-to-one correspondence of multiview inputs, is critical in many real-world multiview applications, especially for cross-view data analysis problems. An increasing amount of work has studied this alignment problem with canonical correlation analysis (CCA). However, existing CCA models are prone to misalign the multiple views due to either the neglect of uncertainty or the inconsistent encoding of the multiple views. To tackle these two issues, this letter studies multiview alignment from a Bayesian perspective. Delving into the impairments of inconsistent encodings, we propose to recover correspondence of the multiview inputs by matching the marginalization of the joint distribution of multiview random variables under different forms of factorization. To realize our design, we present adversarial CCA (ACCA), which achieves consistent latent encodings by matching the marginalized latent encodings through the adversarial training paradigm. Our analysis, based on conditional mutual information, reveals that ACCA is flexible for handling implicit distributions. Extensive experiments on correlation analysis and cross-view generation under noisy input settings demonstrate the superiority of our model.

## 1 Introduction

Multiview learning is the subfield of machine learning that considers learning from data with multiple feature sets. This paradigm has attracted increasing attention due to the emerging multiview data that have facilitated various real-world applications, such as video surveillance (Wang, 2013), information retrieval (Elkahky, Song, & He, 2015), and recommender systems (Elkahky et al., 2015). In these applications, it is critical to achieve instance-level multiview alignment, such that the multiple data streams achieve great one-to-one correspondence (Li, Yang, & Zhang, 2018). For example, considering traditional multiview learning tasks, such as multiview classification (Qi et al., 2016) or multiview clustering (Chaudhuri, Kakade, Livescu, & Sridharan, 2009) on face images in video surveillance, the input data correspond to face images taken from different angles. In these cases, input feature sets with low one-to-one correspondence degrade the alignment of the multiple views, thus severely affecting the performance of the desired tasks. Multiview alignment plays an even more critical role in cross-view data analysis (Jia & Ruan, 2016) problems, namely, to analyzing one view of the data given the input from the other view. For example, the cross-view retrieval task (Elkahky et al., 2015) is, given a query from one view, to search for the corresponding object in the other view; cross-view generation (Regmi & Borji, 2018) seeks to generate target objects given the cross-view inputs. Both are promising real-world applications in which alignment of the incorporated views is critical to performance.

Canonical correlation analysis (CCA) (Hotelling, 1936) provides a primary tool to study instance-level multiview alignment under a subspace learning mechanism (Xu, Tao, & Xu, 2013). In this setting, the instances of two views, $X$ and $Y$, are assumed to be generated from a common latent subspace $Z$; the alignment problem is to find two mapping functions, namely, $F(X)$ and $G(Y)$, such that the embeddings of corresponding input pairs are close to each other regarding the linear correlation. The instance $(xi,yi)$ is in exact correspondence if and only if $F(xi)=G(yi)$ (Ma & Fu, 2011). However, existing CCA models are prone to misalignment due to the neglect of uncertainty or the inconsistent encoding of the multiple views.

Following the principle of classic CCA, vanilla CCA models study multiview alignment with deterministic mapping functions (Oh et al., 2018). Such CCA models are opting to misalign the multiple views since uncertainty is not considered. To be specific, the classic CCA obtains the shared latent space by maximally correlating the deterministic point embeddings, achieved with a linear mapping of the two views. Some work, such as kernel CCA (KCCA) (Lai & Fyfe, 2000), deep CCA (DCCA) (Andrew, Arora, Bilmes, & Livescu, 2013), and mutiview autoencoder (MVAE) (Ngiam et al., 2011), extend the classic CCA with nonlinear mapping or through cross-view reconstruction to exploit nonlinear correlation for the alignment. The mapping functions $F(\xb7)$ and $G(\xb7)$ are nonlinear in these models. As depicted in Figure 1a, these methods all exploit the subspace $Z$ with deterministic point embeddings; namely, $zx=F(x)$ and $zy=G(y)$ are points in $Rd$. Without an inference mechanism to evaluate the quality of obtained latent codes, the mapping function obtained in those models is susceptible to noisy inputs (Kendall & Gal, 2017), which can consequently result in misalignment of the multiple views. For example, for observation the circled 1 in Figure 1a, inputs in the two views are obviously projected faraway in the embedding space: they are projected into different clusters, 5 and 2, respectively, while they are suppose to be close to each other around the ground truth cluster 7. Moreover, without prior regularization on the shared subspace, these models cannot allow easy latent interpolations, since their latent spaces are discontinuous. In such cases, the training samples are encoded into nonoverlapping zones chaotically scattered across space, with “holes” between the zones where the model has never been trained (Tolstikhin, Bousquet, Gelly, & Schoelkopf, 2017). Therefore, these models cannot facilitate the cross-view generation task since the generation results are quite likely to be unrealistic.

Generative CCA models, such as probabilistic CCA (PCCA) (Bach & Jordan, 2005), variational CCA (VCCA) (Wang, Yan, Lee, & Livescu, 2016), and multichannel variational autoencoder *(MCVAE)* (Antelmi, Ayache, Robert, & Lorenzi, 2019), overcome the issue with probability. However, they suffer from misalignment due to the impairments of inconsistent encodings. Specifically, these models adopt the Kullback-Leibler divergence (KL-divergence) between the encodings of individual input example, that is, $Q(Z|X=x)$ and $Q(Z|Y=y)$ and the prior $P0(Z)$, as the criterion to match the latent encodings of different views. However, this constraint can simply force the matching of the encodings of individual input to the common prior (Tolstikhin et al., 2017). Even if the constraint is satisfied, the encodings of the data samples from both the two views can be intersected. In this way, the correspondence between the latent codes of paired inputs is violated. Such inconsistent latent encodings would cause one-to-many correspondence between the instances of the incorporated views, indicating that the multiple views are misaligned. As depicted in Figure 1b, although all of these latent encodings match the prior, the encodings of the instances from the two views intersect in the common latent space. This arouses confusion about the correspondence between the instances in the two views; for example, both 1 and 2 (the circled numbers) exhibit one-to-many correspondence. Such inconsistency not only weakens the alignment of the two spaces but also influences the quality of data reconstruction. Moreover, to achieve a tractable solution for the inference, these models restrict the latent space with simple gaussian prior, $p0(z)\u223cN(0,Id)$, so that the constraint can be computed analytically. However, such a prior is not expressive enough to capture the true posterior distributions (Mescheder, Nowozin, & Geiger, 2017). Therefore, the latent space may not be expressive enough to preserve the instance-level correspondence of the data samples. These impairments lead to an inferior alignment of the multiple views and thus also degrade the models' performance in cross-view generation tasks.

To tackle the issues we have noted, in this letter we study the instance-level multi-view alignment from a Bayesian perspective. With an in-depth analysis of existing CCA models with respect to latent distribution matching, we figure out the impairments of inconsistent encodings in the existing CCA models. We then propose to recover a consistency of multiple views and thereby boost cross-view generation performance by matching the marginalization of the joint distribution of multiview random variables under different forms of factorization, (see equation 3.3). To realize our marginalization design, we present adversarial CCA (ACCA), which achieves consistent latent encoding of the multiple views by matching the marginalized posteriors to flexible prior distributions through the adversarial training paradigm. Analyzing the conditional independent assumption in CCA with conditional mutual information (CMI), we reveal that compared with existing CCA methods, our ACCA is flexible for handling implicit distributions. The contributions of this work can be summarized as follows:

We provide a systematic study of CCA-based instance-level multi-view alignment. We also figure out the impairments of inconsistent encodings in the existing CCA models and propose to study multiview alignment based on the marginalization principle of Bayesian inference, to recover consistency of multiple views.

We design adversarial CCA (ACCA), which achieves consistent latent encoding of the multiple views and is flexible for handling implicit distributions. To the best of our knowledge, we are the first to elaborate the superiority of adversarial learning in multiview alignment scenario.

We analyze the connection of ACCA and existing CCA models based on CMI and reveal the superiority of ACCA, which benefits from consistent latent encoding. Our CMI-based analysis and consistent latent encoding can provide insights for a flexible design of other CCA models for multiview alignment.

The rest of this letter is organized as follows. In section 2, we review the existing CCA models regarding latent distribution matching. In section 3, we elaborate on our design to study multiview alignment through marginalization and present our ACCA design. In section 4, we discuss the advantages of our model by comparing existing models based on CMI. In section 5, we demonstrate the superior alignment performance of ACCA with model verification and various real-world applications. Section 6 concludes the letter and envisions future work.

## 2 Deficiencies of Existing CCA Models

In this section, we review the multiview alignment achieved with existing CCA models in terms of latent distribution matching.

### 2.1 Vanilla CCA Models and the Neglect of Uncertainty

Vanilla CCA models are prone to misalignment since data uncertainty is not considered.

Canonical correlation analysis (CCA) (Hotelling, 1936) is a powerful statistical tool for multiview data analysis. Let ${x(i),y(i)}i=1N$ denote the collection of $N$ independent and identically distributed (i.i.d.) samples with pairwise correspondence in a multiview scenario. (In the following, we use $(x,y)$ to denote any one instance in this set, for simplicity.) Classic CCA aims to find linear projections for the two views, ($Wx'X,Wy'Y$), such that the correlations between the projections are mutually maximized, namely, $maxcorr{Wx'X,Wy'Y}=Wx'\Sigma xyWyWx'\Sigma xxWxWy'\Sigma yyWy$, where $\Sigma xx$ and $\Sigma yy$ are the covariance of $X$ and $Y$; $\Sigma xy$ denotes the cross-covariance. With linear projections, classic CCA simply exploits the linear correlation among the multiple views to achieve alignment. It is often insufficient to analyze complex real-world data that exhibit higher-order correlations (Suzuki & Sugiyama, 2010).

Without the inference mechanism that can evaluate the quality of obtained embeddings, these methods are vulnerable to misaligning the multiple views when given noisy inputs (Tolstikhin et al., 2017). As depicted in Figure 1a, for noisy halved images of the digit 7, the two views are misaligned in the latent space, since their embeddings scatter far and are even chaotically embedded in the different clusters of 2 and 5, respectively. Moreover, these models cannot facilitate cross-view generation tasks very well, since the obtained subspace is discontinuous under such deterministic mappings. Consequently, interpolations of the latent space would lead to unrealistic generation results.

### 2.2 Generative CCA Models and Inconsistent Latent Encodings

Generative CCA models overcome the uncertainty issue by modeling probability. However, they still suffer from misalignment due to the impairments of inconsistent encodings, caused by the limitation of the KL-divergence alignment criterion.

*bi-VCCA*. (Bi-VCCA) Both methods minimize a reconstruction cost together with the KL-divergence to regularize the alignment. VCCA penalizes the discrepancy between a single view encoding and the prior, $DKL(Q(Z|X=x)\u2225P0(Z))$, based on a preference for one of the two views. The two views are not well aligned since the information in the other view is not exploited. It also cannot handle the cross-view generation task due to this missing encoding. Bi-VCCA overcomes the limitation by a heuristic combination of the KL-divergence term obtained with both the two encodings, $Q(Z|X=x)$ and $Q(Z|Y=y)$, with $\lambda $ to control the trade-off. To achieve a tractable solution for the inference, the latent space is restricted to being gaussian distributed, $P0(Z)\u223cN(\mu ,\Sigma )$, so that the KL-divergence can be computed analytically. Its objective is given as

#### 2.2.1 Impairments of Inconsistent Latent Encodings

Since there exists an encoding and decoding mechanism for each of the views in generative CCA models, the instance-level alignment of the views can be verified by cross-view generation. Specifically, if the two views are well aligned, the encoding from one view can then recover the corresponding data in the other view. In such circumstances, we define the encoding of the two views to be consistent. Therefore, the consistency of the multiview encodings is a necessary condition for multiview alignment in generative CCA models.

However, the methods we have mentioned would misalign the multiple views due to the inconsistent latent encodings caused by the inferior alignment criterion, $DKL(Q(Z|X)\u2225Q(Z|Y))$. First, this criterion can only match the encodings of individual data samples, while causing inconsistent encoding of the views. As depicted in Figure 1b, in the multiview learning scenario, it simply forces the encoding from each view, $Q(Z|X=x)$ and $Q(Z|Y=y)$, of all the different input examples to individually match the common prior $P0(Z)$. In this way, the latent encodings from the two views intersect in the common latent space. This intersection disorganizes the consistency of the encodings in the latent space, and thus reduces the instance-level alignment of the two input views. This misalignment also influences the quality of data reconstruction or generation. Both deficiencies are crucial for cross-view generation tasks. In addition, to compute the KL-divergence analytically, all of these methods require the incorporated distributions, that is, the prior $P0(Z)$, the posteriors of each view $Q(Z|X)$ and $Q(Z|Y)$, to be simple. However, such restriction can lead to inferior inference models that are not expressive enough to capture the true posterior distribution (Mescheder et al., 2017). Inexpressiveness of the latent space limits the models' ability further to preserve the instance-level correspondence of the data samples.

## 3 Multiview Alignment via Consistent Latent Encoding

### 3.1 Multiview Alignment through Marginalization

The KL-diver-gence criterion adopted in existing CCA models causes impairments of the inconsistent encodings in two ways:

Primarily, it causes inconsistent latent encoding of the two views, since it simply matches the encodings of individual data samples.

It further restricts the expressiveness of the latent space regarding the instance-level correspondence, since it can only incorporate simple priors directly.

To exploit a better criterion that benefits the alignment, instance-level consistency, of the multiple views, we study multiview alignment from a Bayesian perspective.

Based on the marginalization principle of Bayesian inference (Tipping, 2003; Jaynes, 1978), we propose to facilitate consistent latent encoding by simultaneously matching the multiview encodings whose condition variables are all integrated out. We first eliminate the misalignment induced by the intersection of the individual sample encodings by marginalizing the encodings from multiple views and then constrain the marginalized encodings to overlap with the prior $p0(z)$ simultaneously.

Compared with the KL-divergence that harshly matches the conditional distribution of each sample to the prior, our proposed constraint matches the marginal distributions, $\u222bq(z|x)p(x)dx\u2248p(z)$. Since we take the input of the conditional variables into consideration, this constraint is tolerant to the flexibility of the input data. This property also makes it good for matching multiview encodings: $\u222bq(z|x)p(x)dx\u2248\u222bq(z|y)p(y)dy\u2248\u222bq(z|x,y)p(x,y)dxdy\u2248p0(z)$. The multiview alignment can be further improved by expanding the expressiveness of the latent space by incorporating more complex prior distributions (Mathieu, Rainforth, Siddharth, & Teh, 2019).

### 3.2 Adversarial CCA with Consistent Latent Encoding

To realize our design, we design ACCA, which provides consistent latent encoding by matching the marginalized latent encodings to flexible priors through the adversarial training paradigm. We adopt two schemes to facilitate consistent latent encodings in ACCA.

First, to provide different factorization forms for the joint distribution of multiview data, we provide holistic information for the latent encodings, $q(z|x,y)$, $q(z|x)$ and $q(z|y)$, in ACCA. Besides the two principal encodings, $q(z|x)$ and $q(z|y)$, which support cross-view analysis, we explicitly model $q(z|x,y)$ by encoding an auxiliary view $XY$ that contains all the information of the two views. With the encoding from this auxiliary view, the latent space is more expressive for the correspondence of the multiple views.

Second, we match the marginalization of these holistic encodings simultaneously with the adversarial learning technique. The adversarial learning technique minimizes the Jensen-Shannon (JS)-divergence between two distributions through binary classification on the samples of the two distributions directly (Goodfellow et al., 2014). Consequently, any two distributions can be matched as long as their samples are provided (the explicit forms of the distributions are not required). We adopt adversarial learning as the criterion to match the marginalization of all three encodings to an arbitrary fixed prior $p0(z)$ in ACCA. To be specific, we apply an adversarial distribution matching scheme on the common latent space. Within this scheme, each encoder acts as a generator that defines a marginalized posterior over $z$ (Makhzani, Shlensand, Jaitly, Goodfellow, & Frey, 2015) in equation 3.2. The obtained latent codes of individual data instances are samples of the corresponding marginalized posteriors, $q*(z)$. The three marginalized posteriors constrained to be matched by simultaneously matching the same prior $p0(z)$, namely equation 3.3, with a shared discriminator (Hoang, Nguyen, Le, & Phung, 2018). We present the formulation of the proposed constraints in section 3.2.1.

Consequently, our ACCA realizes the proposed marginalization design by adversarially matching the marginalized posteriors with a common and flexible prior distribution. As shown in in Table 1, our ACCA is better than the existing generative CCA models in three ways:

We recover the consistency of multiple views by matching the marginalization of holistic encodings. This contributes to the consistent latent encoding of multiple views, which benefits the multiview alignment.

It avoids the gaussian distribution restriction on $p(z)$. Instead of computing the criterion analytically, adversarial learning provides an efficient estimation of the JS-divergence between the encodings (Goodfellow et al., 2014). This helps ACCA handle expressive latent space with flexible prior distributions.

It does not require explicit distribution assumptions on the posterior $p(z|x,y)$. The adversarial learning scheme matches the incorporated distributions implicitly. Thus, it can benefit the model by omitting the sampling operation required in other generative CCA models (e.g. VCCA and MCVAE).

. | . | . | . | Evaluation . | ||
---|---|---|---|---|---|---|

Category . | Methods . | Nonlinear Mapping . | Criterion . | Consistent Encoding . | Avoids Gaussian Restriction on $p(z)$ . | Implicit Posteriors $p(z|x,y)$ . |

Vanilla CCA models | CCA | ✗ | Linear correlation | ✗ | ✗ | - |

KCCA | ✓ | Linear correlation | ✗ | ✗ | - | |

DCCA | ✓ | Linear correlation | ✗ | ✗ | - | |

DCCAE | ✓ | Linear correlation | ✗ | ✗ | - | |

MVAE | ✓ | - | - | - | - | |

Generative CCA models | PCCA | ✗ | KL-divergence | ✗ | ✗ | ✗ |

VCCA | ✓ | KL-divergence | ✗ | ✗ | ✗ | |

Bi-VCCA | ✓ | KL-divergence | ✗ | ✗ | ✗ | |

MCVAE | ✓ | KL-divergence | ✗ | ✗ | ✗ | |

ACCA (ours) | ✓ | Adversarial learning | ✓ | ✓ | ✓ |

. | . | . | . | Evaluation . | ||
---|---|---|---|---|---|---|

Category . | Methods . | Nonlinear Mapping . | Criterion . | Consistent Encoding . | Avoids Gaussian Restriction on $p(z)$ . | Implicit Posteriors $p(z|x,y)$ . |

Vanilla CCA models | CCA | ✗ | Linear correlation | ✗ | ✗ | - |

KCCA | ✓ | Linear correlation | ✗ | ✗ | - | |

DCCA | ✓ | Linear correlation | ✗ | ✗ | - | |

DCCAE | ✓ | Linear correlation | ✗ | ✗ | - | |

MVAE | ✓ | - | - | - | - | |

Generative CCA models | PCCA | ✗ | KL-divergence | ✗ | ✗ | ✗ |

VCCA | ✓ | KL-divergence | ✗ | ✗ | ✗ | |

Bi-VCCA | ✓ | KL-divergence | ✗ | ✗ | ✗ | |

MCVAE | ✓ | KL-divergence | ✗ | ✗ | ✗ | |

ACCA (ours) | ✓ | Adversarial learning | ✓ | ✓ | ✓ |

A diagram of ACCA is presented in Figure 2d. Note that the three encodings are all essential in ACCA. First, the encodings of the principal views, $q(z|x)$ and $q(z|y)$, are essential to facilitate cross-view analysis with generative CCA methods. Second, the encoding of the auxiliary view, $q(z|x,y)$, contributes to a latent space that better encodes the correspondence of the multiple views and thus benefits the multiview alignment achieved in ACCA. Indeed, one can achieve expressive representations for the multiview data with only the auxiliary encoding. However, this is not the focus of our work. We further emphasize the significance of the auxiliary view and the superiority achieved with the adversarial learning in section 4.2.

#### 3.2.1 Formulation

In practice, ACCA is jointly trained by alternatively updating the reconstruction and regularization phases. In the reconstruction phase, we update the encoders and the decoders to minimize the reconstruction error of the two principal views. In the regularization phase, the adversarial networks, with multiple encoders or generators, are trained following the same alternating procedure as in Hoang et al. (2018). Once the training procedure is done, the encoders will define expressive encodings for each view.

## 4 Connection to Other Models

In this section, we discuss the connection between ACCA and other models.

### 4.1 Understanding CCA Models with CMI

From a Bayesian perspective, the general CCA models come with an assumption that the two views, $X$ and $Y$, are conditionally independent given the latent variable $Z$ (i.e., equation 2.1, to achieve a tractable solution for inference. However, this assumption is hard to verify in real multiview analysis problems that incorporate complex distributions. Here, we analyze this inherent assumption of CCA with conditional mutual information (CMI).

Although equation 4.2 presents an objective for minimizing CMI, it is hard to optimize since the posterior $p\theta (z|x,y)$ is unknown or intractable for the practical multiview learning problems. Consequently, existing methods make different assumptions on the incorporated distributions (e.g., prior, likelihood, and posterior) and adopt approximate inference methods to achieve tractable solutions for multi-view analysis.

### 4.2 ACCA versus Existing CCA Methods

Based on our analysis, we emphasize the superiority of the proposed ACCA (Figure 1c) over the other CCA prototypes:

The adversarial learning criterion enables ACCA to achieve a tractable solution for multiview analysis with many flexible prior and posterior distributions. This benefits the expressiveness of the obtained aligned latent space.

The adversarial learning criterion leads to consistent latent encoding in ACCA by matching the marginalization of the incorporated distributions and thus helps ACCA achieve better instance-level alignment for the multiple views.

Appending $q(z|x,y)$ with the auxiliary view $XY$, our ACCA can better estimate the minimizing CMI objective, equation 4.2, compared with other variants that simply adopt the encodings from individual views, that is, $q(z|x)$ and $q(z|y)$.

Some works adopt additional penalties, such as sparsity constraint (Shi, Xu, Pan, Tsang, & Pan, 2019), on these prototypes to enhance multiview alignment. For instance, Kidron, Schechner, and Elad (2007) extend classic CCA with sparsity to enhance its performance on cross-modal localization tasks. Jia, Salzmann, and Darrell (2010) introduce structured sparsity into MVAE. Virtanen, Klami, and Kaski (2011) propose a generative CCA variant that also adopts KL-divergence as the criterion, with only an additional group sparsity assumption to improve the variations approximation. Note that we can also extend ACCA with corresponding structural priors to enhance the alignment of the multiple views (Mathieu et al., 2019).

Some other work extends these prototypes by further exploiting view-specific information. Besides the multiview shared information, these variants also consider specific information in each view to benefit the alignment task. For example, VCCA-private extends Bi-VCCA by introducing two hidden variables, $hx$ and $hy$, to capture the private information that is not captured with the common variable, $Z$. It adopts two extra KL-divergence constraints to match the encoding of the private variables (see equation 4.3) in Wang et al. (2016). Our ACCA can also be extended with such private variables and additional discriminators to further enhance the alignment.

In this sense, MVIB can be regarded as an extension of multiview alignment that further incorporates superfluous information to handle supervised downstream tasks. We can also apply our developed inference method in ACCA to solve the generative variant of the MVIB objective as well.

Note that in this work, we focus on studying the classic CCA prototypes in terms of multiview alignment for data generation. Consequently, the CCA variants with additional penalties or view-specific variables are not for main comparisons here. The MVIB is not comparable here since it cannot facilitate generation.

### 4.3 ACCA versus Adversarial Autoencoders

Also highly relevant to our ACCA are * adversarial autoencoders* (AAEs) (Makhzani et al., 2015). AAEs adopt adversarial distribution matching to promote the reconstruction of autoencoders, based on variational autoencoders (VAEs). Compared with AAEs, our ACCA is contributive since we extend the adversarial distribution matching into the multiview scenario to facilitate multiview alignment, especially for cross-view generation tasks. We also find that that our model is reasonable for achieving superior alignment for multiple views with consistent latent encoding by analyzing the conditional independent assumption in CCA with CMI.

### 4.4 Instance-Level Alignment Versus Distribution-Level Alignment

In this work, we study instance-level multiview alignment with CCA to achieve correspondence—for instance, embeddings obtained from each view. There is also work that studes the distribution-level alignment of multiple views. This other work focuses on aligning the marginal distribution of the views, $P(X)$ and $P(Y)$, without considering the pairwise correspondence for each instance. Cross-view generation in such a setting is regarded as a style transfer task (Ganin et al., 2016). For example, cycle-GAN (Zhu, Park, Isola, & Efros, 2017) studies unsupervised image translation in two domains by modeling cycle consistency. UNIT (Liu, Breuel, & Kautz, 2017) and MUNIT (Huang, Liu, Belongie, & Kautz, 2018) study the same task by incorporating a common latent space into cycle-GAN. Conditional GANs are adopted to facilitate the cross-view image synthesis task (Regmi & Borji, 2018). Because this is not the focus of our letter, we do not discuss them further.

## 5 Experiments

In this section, we evaluate the performance of our ACCA regarding multiview alignment and generation. We first discuss the advantages of ACCA in section 5.1. Then, we show the superiority of ACCA in achieving multiview alignment in three aspects. We conduct correlation analysis to show ACCA captures higher nonlinear correlation among the multiple views in section 5.2. We present alignment verification to show that ACCA achieves better instance-level correspondence in the latent space in section 5.3. We conduct several cross-view analysis tasks with noisy inputs to show the robustness of ACCA in achieving instance-level alignment of the multiple views in section 5.4.

We also evaluate the quality of obtained embeddings regarding downstream supervised tasks, to demonstrate that our ACCA facilitates superior alignment without sacrificing the discriminative property of the representation. The experiments regarding clustering and classification are presented in sections 5.4.3 and 5.5, respectively.

Note that our work targets at instance-level multiview alignment and generation. Consequently, we emphasize the evaluation in the first few subsections: the preserved correspondence on the latent embeddings and how well the correspondence can be recovered from the obtained latent spaces, cross-view generation. The evaluation of the discriminative property of latent embeddings is presented for illustration.

In section 5.6, we present a preliminary study on the influence of view-specific variables for alignment and generation as future work.

### 5.1 Superiority of Adversarial Criterion for Multiview Alignment

We first examine the benefits achieved with the adversarial learning alignment criterion: consistently matching the marginalized latent encodings with flexible priors.

#### 5.1.1 Consistent Encoding in ACCA

We verify the consistent encoding in ACCA with one of the most commonly used multiview learning data sets: the MNIST left/right halved data set (MNIST_LR) (Andrew et al., 2013). Details about the data set and network design are shown in Table 2.

Data set . | Statistics . | Dimension of $z$ . | Network Setting (MLPs) $D^={1024,1024,1024}$ . | Parameters . |
---|---|---|---|---|

Toy data set (Simulated) | # Tr = 8,000# Te = 2,000 | d = 10 | $Ex={1024,1024}$;$Exy={1024,1024}$;$Ey={1024,1024}$ | For all the data set: learning rate = 0.001, epoch = 100. For each data set: batch size tuned over ${16,32,128,256,500,512,1000}$; $d$ tuned over ${10,30,50,100}$ |

MNIST L/R halved data set (MNIST_LR) (Andrew et al., 2013) | # Tr = 60,000# Te = 10,000 | d = 30 | $Ex={2308,1024,1024}$;$Exy={3916,1024,1024}$;$Ey={1608,1024,1024}$ | |

MNIST noisy data set (MNIST_Noisy) (Wang et al., 2016) | # Tr = 60,000# Te = 10,000 | d = 50 | $Ex={1024,1024,1024}$;$Exy={1024,1024,1024}$;$Ex={1024,1024,1024}$ | |

Wisconsin X-ray microbeam database (XRMB) (Wang et al., 2016) | # Tr = 1.4M# Te = 0.1M | d = 112 | $Ex={1811,1811}$;$Exy={3091,3091}$;$Ey={1280,1280}$ |

Data set . | Statistics . | Dimension of $z$ . | Network Setting (MLPs) $D^={1024,1024,1024}$ . | Parameters . |
---|---|---|---|---|

Toy data set (Simulated) | # Tr = 8,000# Te = 2,000 | d = 10 | $Ex={1024,1024}$;$Exy={1024,1024}$;$Ey={1024,1024}$ | For all the data set: learning rate = 0.001, epoch = 100. For each data set: batch size tuned over ${16,32,128,256,500,512,1000}$; $d$ tuned over ${10,30,50,100}$ |

MNIST L/R halved data set (MNIST_LR) (Andrew et al., 2013) | # Tr = 60,000# Te = 10,000 | d = 30 | $Ex={2308,1024,1024}$;$Exy={3916,1024,1024}$;$Ey={1608,1024,1024}$ | |

MNIST noisy data set (MNIST_Noisy) (Wang et al., 2016) | # Tr = 60,000# Te = 10,000 | d = 50 | $Ex={1024,1024,1024}$;$Exy={1024,1024,1024}$;$Ex={1024,1024,1024}$ | |

Wisconsin X-ray microbeam database (XRMB) (Wang et al., 2016) | # Tr = 1.4M# Te = 0.1M | d = 112 | $Ex={1811,1811}$;$Exy={3091,3091}$;$Ey={1280,1280}$ |

To testify the approximation of the three encodings in ACCA, we estimate the distribution distances among the three posterior distributions with kernel maximum mean discrepancy (MMD) (Gretton, Borgwardt, Rasch, Schölkopf, & Smola, 2012). We assign gaussian mixture prior (see equation 5.1) for ACCA and then calculate the sum of the MMD distance between the three encodings and the prior $p0(z)$ in equation 3.2 during the training process. Figure 4 shows that the distance gradually decreases during the convergence of ACCA. This trend verifies that ACCA can facilitate the matching of nongaussian marginalized posteriors, that is, the consistent encoding (see equation 3.3).

We also estimate CMI during the model training process with an open-source nonparametric entropy estimation toolbox.^{1} The right panel of Figure 4 illustrates that the CMI gradually decreases during the training of ACCA and reaches zero at a relatively early stage in the convergence of ACCA. The trend indicates that ACCA implicitly minimizes CMI, and the optimal, $I(X;Y|Z)=0$, can be achieved at its convergence. Consequently, the explicit conditional independent constraint (see equation 2.1) of CCA can be automatically satisfied in our ACCA.

#### 5.1.2 Flexibility of Prior Encoding in Alignment

We conduct correlation analysis on a toy data set with nongaussian prior to verify that ACCA benefits from handling implicit distributions for multiview alignment.

##### Toy data set.

##### Dependency metric.

The Hilbert Schmidt independence criterion (HSIC) (Gretton, Bousquet, Smola, & Schölkopf, 2005) is a commonly used measurement for the overall dependency among variables. In this work, we adopt the normalized estimate of HSIC (nHSIC) (Wu, Zhao, Tsai, Yamada, & Salakhutdinov, 2018) as the metric to measure the dependency captured by the embeddings of the test set ($ZXTe$ and $ZYTe$) of each method. We report the nHSIC computed with both the linear and the RBF kernel ($\sigma $ is set with the F-H distance between the points).

##### Baselines.

We compare^{4} ACCA with several state-of-the-art vanilla CCA variants here.

**CCA**(Hotelling, 1936): Linear CCA model that learns linear projections of the two views that are maximally correlated.**PCCA**(Bach & Jordan, 2005): Probabilistic variant of linear CCA.**DCCA**(Andrew et al., 2013): DeepCCA, nonlinear CCA extension with DNN.**MVAE**(Ngiam et al., 2011): Multiview autoencoders, an CCA variant that discovers the dependency among the data via multiview reconstruction.**Bi-VCCA**(Wang et al., 2016): Bi-deep variational CCA, a representative generative nonlinear CCA model restricted with gaussian prior.**ACCA_NoCV**: A variant of ACCA that is designed without the encoding of the complementary view $XY$. This is used to verify the efficiency of the holistic encoding scheme in ACCA.**ACCA(G)**; ACCA implemented with the standard gaussian prior.**ACCA(GM)**: ACCA implemented with the exact gaussian mixture prior.

Since ACCA handles posterior distributions implicitly, its latent space can be more expressive, revealing the correspondences of the multiple views compared with other baselines that can only directly handle simple gaussian priors. (An additional sampling procedure is requested for these methods to handle other complex distributions.) Consequently, higher nonlinear dependency is expected to be achieved in ACCA, especially when given the exact prior of the multiview data set. Table 3 reports the dependency captured in the common latent space of each method. The results are revealing in several ways:

Both CCA and PCCA achieve low nHSIC value on the toy data set, due to their inability to capture nonlinear dependency.

DCCA achieves higher HSIC scores compared with other baselines due to its objective, which directly targets higher linear correlations. However, its result is still inferior to our methods.

The results of MVAE and Bi-VCCA are unsatisfactory. The results of MVAE are not good, because it lacks the inference mechanism to qualify the encodings. Bi-VCCA gets inferior results mainly because of the inconsistent encoding problem caused by the inferior alignment criterion.

Our ACCA model achieves good performance here. This indicates that the consistent encoding imposed by the adversarial distribution matching benefits the model's ability to capture nonlinear dependency.

ACCA (GM) archives the best result in both settings. This verifies that ACCA benefits from the ability to handle implicit distributions.

Metric . | Data Sets . | CCA . | PCCA . | DCCA . | MVAE . | Bi-VCCA . | ACCA_NoCV . | ACCA (G) . | ACCA (GM) . |
---|---|---|---|---|---|---|---|---|---|

nHSIC (linear kernel) | toy | 0.0010 | 0.1037 | 0.5353 | 0.1428 | 0.1035 | 0. 8563 | 0.7296 | 0.9595 |

MNIST_LR | 0.4210 | 0.3777 | 0.6699 | 0.2500 | 0.4612 | 0.5233 | 0.5423 | 0.6823 | |

MNIST_Noisy | 0.0817 | 0.1037 | 0.1460 | 0.4089 | 0.1912 | 0.3343 | 0.3285 | 0.4133 | |

XRMB | 0.0574 | 0.0416 | 0.2970 | 0.2637 | 0.1046 | 0.1244 | 0.2903 | 0.3482 | |

Maps | - | - | 0.3465 | 0.4423 | 0.1993 | 0.7324 | 0.5157 | 0.7043 | |

nHSIC (RBF kernel) | toy | 0.0029 | 0.2037 | 0.7685 | 0.2358 | 0.2543 | 0.8737 | 0.5870 | 0.8764 |

MNIST_LR | 0.4416 | 0.3568 | 0.6877 | 0.1499 | 0.3804 | 0.5799 | 0.6318 | 0.7387 | |

MNIST_Noisy | 0.0948 | 0.0993 | 0.1605 | 0.4133 | 0.2076 | 0.2697 | 0.3099 | 0.4326 | |

XRMB | 0.0534 | 0.03184 | 0.3180 | 0.0224 | 0.0846 | 0.1456 | 0.2502 | 0.2989 | |

Maps | - | - | 0.5905 | 0.5624 | 0.3956 | 0.8171 | 0.6285 | 0.8658 |

Metric . | Data Sets . | CCA . | PCCA . | DCCA . | MVAE . | Bi-VCCA . | ACCA_NoCV . | ACCA (G) . | ACCA (GM) . |
---|---|---|---|---|---|---|---|---|---|

nHSIC (linear kernel) | toy | 0.0010 | 0.1037 | 0.5353 | 0.1428 | 0.1035 | 0. 8563 | 0.7296 | 0.9595 |

MNIST_LR | 0.4210 | 0.3777 | 0.6699 | 0.2500 | 0.4612 | 0.5233 | 0.5423 | 0.6823 | |

MNIST_Noisy | 0.0817 | 0.1037 | 0.1460 | 0.4089 | 0.1912 | 0.3343 | 0.3285 | 0.4133 | |

XRMB | 0.0574 | 0.0416 | 0.2970 | 0.2637 | 0.1046 | 0.1244 | 0.2903 | 0.3482 | |

Maps | - | - | 0.3465 | 0.4423 | 0.1993 | 0.7324 | 0.5157 | 0.7043 | |

nHSIC (RBF kernel) | toy | 0.0029 | 0.2037 | 0.7685 | 0.2358 | 0.2543 | 0.8737 | 0.5870 | 0.8764 |

MNIST_LR | 0.4416 | 0.3568 | 0.6877 | 0.1499 | 0.3804 | 0.5799 | 0.6318 | 0.7387 | |

MNIST_Noisy | 0.0948 | 0.0993 | 0.1605 | 0.4133 | 0.2076 | 0.2697 | 0.3099 | 0.4326 | |

XRMB | 0.0534 | 0.03184 | 0.3180 | 0.0224 | 0.0846 | 0.1456 | 0.2502 | 0.2989 | |

Maps | - | - | 0.5905 | 0.5624 | 0.3956 | 0.8171 | 0.6285 | 0.8658 |

Note: Higher is better for dependency. The best are in bold.

### 5.2 Correlation Analysis

We xt conduct correlation analysis on four commonly used multiview data sets to testify the alignment achieved with each method. Higher correlations are expected with latent embeddings that preserve better data correspondence. Details about the data sets are presented in Tables 2 and 9 (Table 9 is in section 6). For XRMB, we follow the setting of DCCA (Wang et al., 2016)—we divide the data set into 5-folds and report the average nHSIC scores for comparison. For ACCA (GM), we adopt the same prior as the toy data set, that is, 13 as a simple arbitrary selection of nongaussian prior. The results are presented in Table 3. We can see that:

DCCA achieves a higher correlation compared with the baselines that do not support data generation, CCA and PCCA. This is because it adopts nonlinear mapping, which enables it to exploit nonlinear correlations in the input for alignment.

The correlation achieved with MVAE is inferior to DCCA in most of the settings. This is because MVAE seeks embeddings that result in better view reconstruction. However, DCCA directly targets the embeddings that achieve maximum linear correlation scores in our evaluation.

Our methods, ACCA_NoCV, and ACCA, outperform Bi-VCCA in all the settings. Our results are comparable to and even better than DCCA in some of the settings. This indicates that our consistent encoding design can benefit the consistency preserved in the latent space. Since ACCA can facilitate data generation compared with DCCA, the comparison between ACCA and DCCA, and MVAE, indicates that ACCA can balance the data correspondence and reconstruction quality. The argument is collaboratively supported by the data generation result in section 5.4.

Among our three ACCA variants, the ACCA (GM) achieves the best result in almost all of the settings. This observation indicates that the preserved latent correspondence can be enhanced by incorporating a more expressive latent space with more flexible priors. It also verifies the superiority of our ACCA for directly handling flexible priors without extra sampling procedures (see section 4.2).

In addition to the quantitative correlation analysis, we conduct t-SNE visualization to demonstrate the quality of obtained embeddings. In Figure 5, we compare the embeddings of the two individual views obtained with DCCA, MVAE, BI-VCCA, and ACCA(G). It is clear that for the two vanilla CCA models, DCCA and MVAE, embeddings of each view fail to preserve a distinguishable clustering structure. This observation can be explained with our analysis that they lack the inference mechanism to qualify the obtained embeddings. For Bi-VCCA, the embedding of view $X$ presents a great clustering structure. But the embeddings of view $Y$ are scattered randomly in the common latent space. This implies that the instances do not prohibit the desired correspondence in the latent space, meaning that the two views are not well aligned with Bi-VCCA. The observation also implies that the left part of the MNIST data potentially preserves more label information than the right views. For our ACCA, embeddings of both of the views present a good clustering structure. This indicates that the two views are better aligned with the proposed ACCA.

### 5.3 Alignment Verification

*misalignment degree*as the metric for the alignment performance. We take the origin point $O$ as the reference and adopt angular difference to measure the distance of the paired embeddings: $\varphi (zx,zy)=\u2220zxOzy$. The misalignment degree of the multiview is given by

The results are presented in Figure 6. We make the following observations:

For DCCA, the latent embeddings of two views are clearly scattered apart, indicating inferior instance correspondence in the latent space.

The regions for the paired embeddings of Bi-VCCA are not overlapped, and the misalignment degree of Bi-VCCA is $\delta =2.3182$, which is much higher than the others. This indicates that Bi-VCCA suffers severely from the misaligned encoding problem.

ACCA and ACCA_NoCV, achieve superior alignment performance compared with DCCA, MVAE, and Bi-VCCA. This shows the effectiveness of the consistent constraint on the marginalization for view alignment in ACCA.

The embeddings of ACCA are uniformly distributed in the latent space compared with ACCA_NoCV. This indicates that the complementary view, $XY$, provides additional information for the holistic encoding.

### 5.4 Applications of Cross-View Generation

We design several cross-view generation tasks to reflect the superior multiview alignment achieved in ACCA. We first apply ACCA to an image recovery task to conduct whole-image recovery, given the partial images as input for one of the views. We then test ACCA with a face alignment task to annotate facial landmarks given the face images. Since MVAE and Bi-VCCA are the baseline models that can support cross-view generation, we compare these two methods. We do not compare ACCA_NoCV here since it is a variant of our ACCA and will be comparable to the ACCA due to the consistent encoding. We adopt gaussian prior for ACCA here to conduct a fair comparison.

#### 5.4.1 Image Recovery

We testify the image recovery (Sohn, Lee, & X, 2015) performance of ACCA on the MNIST handwritten digit data set and the CelebFaces Attributes data set (CelebA) (Liu, Luo, & Tang, 2015). Both are commonly used image generation data sets. The performance is evaluated based on the quality of generated images (e.g., Is the image blurred?) Does the image show apparent misalignment at the junctions in the middle?

##### Image recovery on handwritten digits.

We train the models with original data, while adding noise to the test data of the MNIST data set, to testify to the robustness of the alignment achieved with each model. We divide the test data in each view into four quadrants and masked one, two, or three quadrants of the input with gray color (Sohn et al., 2015) and use the noisy images as the input for testing. The experimental result is evaluated qualitatively and quantitativly.

Qualitative analysis. Figure 7 presents some of the recovered images (columns 3 to 5) obtained with one-quadrant input. This figure clearly illustrates that given the noisy input, the images generated with ACCA are more real and recognizable than those of MVAE and Bi-VCCA:

The image generated with MVAE shows the worst quality. The images contain much more noise compared with other methods. In many cases, the “digit” is hard to identify (e.g., case b). In addition, the generated image of MVAE shows clear misalignment at the junctions of the halved images (e.g., case a).

The images generated by Bi-VCCA are much more blurred and less recognizable than those of ACCA, especially in cases a and b.

ACCA can successfully recover the noisy half images, which are even confusing for our human to recognize. For example, in case b, the left-half image of digit 5 looks similar to the digit 4; ACCA succeeds in recovering the true digit.

Quantitative evidence. We compare the pixel-level accuracy with the root mean square error (RMSE), that is, 1 $RMSE$. The results in Table 4 show that our ACCA consistently outperforms Bi-VCCA given the different level of masked input images. It is interesting to note that the whole images generated with the left-half images tend to be more realistic than those generated using the right half. A probable reason is that the right-half images contain more information than the left-half images. This finding coincides with our discovery in Figure 5b. This imbalance of information between the two views would drive the decoder of the less informative view to generate high-quality images, while sacrificing the alignment with another view.

. | . | Gray Color Overlaid . | ||
---|---|---|---|---|

Input (halved image) . | Methods . | 1 Quadrant . | 2 Quadrants . | 3 Quadrants . |

Left | MVAE | 64.94 | 61.81 | 56.15 |

Bi-VCCA | 73.14 | 69.29 | 63.05 | |

ACCA | 77.66 | 72.91 | 67.08 | |

Right | MVAE | 73.57 | 67.57 | 59.69 |

Bi-VCCA | 75.66 | 69.72 | 65.52 | |

ACCA | 80.16 | 74.60 | 66.80 |

. | . | Gray Color Overlaid . | ||
---|---|---|---|---|

Input (halved image) . | Methods . | 1 Quadrant . | 2 Quadrants . | 3 Quadrants . |

Left | MVAE | 64.94 | 61.81 | 56.15 |

Bi-VCCA | 73.14 | 69.29 | 63.05 | |

ACCA | 77.66 | 72.91 | 67.08 | |

Right | MVAE | 73.57 | 67.57 | 59.69 |

Bi-VCCA | 75.66 | 69.72 | 65.52 | |

ACCA | 80.16 | 74.60 | 66.80 |

Note: The best results are in bold.

##### Image recovery on human faces.

For the human face recovery on the CelebA data set, we halve the RGB images into top half and bottom half and design a CNN architecture to handle this task. Details of the network design are reported in Table 9.

Qualitative analysis. Figure 8 shows the image samples recovered for the CelebA data set. We have mainly three observations:

The samples generated by MVAE show clear misalignment at the junctions, especially when the backgrounds of the images are in color. Some of the images are too blurred to see the details (e.g., the samples circled with red).

The samples generated by Bi-VCCA are generally more blurred than the other two. The observation is quite obvious in the image generated with the top-half image, which contains many fewer details than the bottom-half image.

The images generated by ACCA show better quality compared with the others, considering both the clarity and the alignment of junctions.

Quantitative evidence. We quantitatively assess the quality of generated images with the Frechet inception distance (FID) (Heusel, Ramsauer, Unterthiner, Nessler, & Hochreiter, 2017) and estimate the sharpness of the generated test images using the image gradients.^{2} The results are reported in Table 5. Figure 8 shows that for the image recovery with the top-half face images, the image generated with ACCA is of much better quality than that of MVAE and Bi-VCCA. Bi-VCCA is the worst in terms of the two metrics. For the experiment with the bottom-half face images, the FID score of our ACCA is slightly inferior to that of MVAE; however, the generated images of ACCA are still sharper. Comparing the results of these two experiments, we can see that the image recovery with the top-half image is better than the other because it presents lower FID and higher image sharpness. This observation coincides with our qualitative evaluation shown in Figure 8, where the images are generated with the bottom-half image (left column), especially the top half generated images, are commonly blurrier than those generated with the top-half image (right column). This phenomenon also agrees with our discovery in the handwritten digit recovery task, where the input view with more information obtains worse results.

. | . | Evaluation Metrics . | |
---|---|---|---|

Input (halved image) . | Methods . | FID . | Sharpness . |

Top | MVAE | 61.3360 | 8.9645 |

Bi-VCCA | 78.0752 | 7.0069 | |

ACCA | 58.7983 | 11.9026 | |

Bottom | MVAE | 63.6921 | 8.5428 |

Bi-VCCA | 84.7122 | 6.7574 | |

ACCA | 68.1467 | 8.7249 |

. | . | Evaluation Metrics . | |
---|---|---|---|

Input (halved image) . | Methods . | FID . | Sharpness . |

Top | MVAE | 61.3360 | 8.9645 |

Bi-VCCA | 78.0752 | 7.0069 | |

ACCA | 58.7983 | 11.9026 | |

Bottom | MVAE | 63.6921 | 8.5428 |

Bi-VCCA | 84.7122 | 6.7574 | |

ACCA | 68.1467 | 8.7249 |

Notes: Smaller is better for FID and larger is better for sharpness. The sharpness of the real images is 14.6722. Numbers in bold are the best results.

##### Unconditional human face generation.

To illustrate how ACCA benefits the image generation quality, we further evaluate unconditional generation performance with the trained models on the CelebA data set. Specifically, we randomly sample a batch of $z$ from the prior distribution $p(z)$ and adopt the two decoders to generate both views.

The results are presented in Figure 9. It is clear that the images generated with ACCA are much more realistic than those generated with Bi-VCCA, since the facial boundaries of these images are clearer. It is also remarkable that ACCA generates images with more details, due to the superior correspondence achieved between the input and the latent space. The images in the red box present remarkable details, such as a cap, a hoodie, glasses, and backgrounds.

#### 5.4.2 Face Alignment

We further evaluate the multiview alignment performance of ACCA with a face-alignment task (Kazemi & Sullivan, 2014) on CelebA (Liu et al., 2015). We train ACCA with paired face and ground-truth facial landmark annotations as input for the two views. Then, the better that the multiple views are aligned, the better are the facial landmark prediction, or generation, results that can be achieved given the test face images.

Since the landmark annotations of the original CelebA data set simply contain five landmark locations, this data set may be insufficient to testify to the performance achieved with the models that can handle more complicated applications (Regmi & Borji, 2018). Instead, we construct a more challenging data set with 68 landmark locations as the face annotation. Specifically, we extract the annotations with the state-of-the-art facial landmark localization method Super-FAN (Bulat & Tzimiropoulos, 2018), with the s$3$fd face detector.^{3} We drop the figures whose faces cannot be detected and construct a data set with 202,405 samples. Figure 10 presents several samples of our data set. Details for the setting of the face alignment experiment are presented in Table 9.

##### Qualitative analysis.

To verify the robustness of ACCA in achieving multiview alignment, we adopt the complete data samples for training while adopting partial or noisy images as input to evaluate the alignment performance of each model. Specifically, we randomly omit the input pixels with blocks of different sizes (50, 60, 70). Such a setting simulates the real face alignment scenarios with occlusive faces.

Figure 11 demonstrates the face alignment results. It is clear that our proposed ACCA outperforms the baselines under both settings, with human interpretable and clearer facial landmark annotations. We can also observe that:

Most of the generated results of MVAE are noisy and blurred under human perceptions, which indicates that MVAE is susceptible to noisy input. The problem is even more obvious with larger occlusions. As shown in the right panel of Figure 11, most of the results of MVAE are not recognizable.

The results of Bi-VCCA are commonly more blurred than MVAE and our ACCA. However, Bi-VCCA is more robust with noisy input than MVAE since its results are more interpretable under the 70 $\xd7$ 70 blocked setting. This verified that a latent distribution matching constraint benefits the robustness of the multiview alignment.

Our proposed ACCA achieves clear and human interpretable facial landmark annotations under both settings. This indicates that the multiview alignment achieved with ACCA is the most robust among these three models and verifies that the consistent encoding achieved in ACCA contributes to a better and more robust alignment of the multiple views.

##### Quantitative evidence.

. | . | Inputs (occluded face images) . | ||
---|---|---|---|---|

Evaluation Metrics . | Methods . | 50 × 50 . | 60 × 60 . | 70 × 70 . |

PSNR | MVAE | 63.0074 | 62.5455 | 62.1448 |

Bi-VCCA | 63.0289 | 62.6468 | 62.3351 | |

ACCA | 62.2924 | 62.4175 | 62.0975 | |

SSIM | MVAE | 0.9982 | 0.9978 | 0.9975 |

Bi-VCCA | 0.9981 | 0.9979 | 0.9976 | |

ACCA | 0.9984 | 0.9979 | 0.9981 |

. | . | Inputs (occluded face images) . | ||
---|---|---|---|---|

Evaluation Metrics . | Methods . | 50 × 50 . | 60 × 60 . | 70 × 70 . |

PSNR | MVAE | 63.0074 | 62.5455 | 62.1448 |

Bi-VCCA | 63.0289 | 62.6468 | 62.3351 | |

ACCA | 62.2924 | 62.4175 | 62.0975 | |

SSIM | MVAE | 0.9982 | 0.9978 | 0.9975 |

Bi-VCCA | 0.9981 | 0.9979 | 0.9976 | |

ACCA | 0.9984 | 0.9979 | 0.9981 |

Notes: Smaller is better for PSNR, and larger is better for SSIM. The best results are in bold.

#### 5.4.3 Cross-View Generation for High-Dimensional Data

To evaluate the capacity of our ACCA for cross-view generation, we further validate its performance with high-resolution image inputs. We adopt Google Maps data set (Maps) (Isola, Zhu, Zhou, & Efros, 2017) one of the benchmark data sets for cross-view synthesis applications (Regmi & Borji, 2018). To ensure the quality of generated image, we equipped skip-connection for the autoencoder structure, i.e. UNET (Ronneberger, Fischer, & Brox, 2015) in each method. We adopt the least square GANs (Mao et al., 2017) as the marginal matching the equation 3.3 constraint for ACCA. The results are presented in Figure 12. It is clear that our ACCA outperforms the baselines regarding the generated image quality.

We also analyzed the quality of obtained embeddings with t-SNE visualization. Specifically, we cluster the embeddings of the aerial photo (View$X$) with k-means (MacQueen et al., 1967) (n_clusters $=$ 3) and do t-SNE visualization to analyze whether the results present human interpretable properties. We mark the centroids of each cluster and prohibit the top 3 data samples that are nearest to each centroid. The results are presented in Figure 13. The comparison is analyzed in two aspects:

The clusters of our ACCA are compact and present clear boundaries, while those of MVAE and BI-VCCA show overlap between the clusters (marked with red dashed circles). The comparison is quite obvious with regard to Bi-VCCA. This indicates that the embedding of our ACCA preserves more discriminative information compared to Bi-VCCA.

The clustering results of our ACCA represent human interpretable properties. According to Figure 13c, the three clusters present distinct properties. Among the test images, a large proportion of them are blocks (points in blue), and a small proportion are bodies of water or vegetation (points in purple); the rest are hybrid zones (e.g., highways, railways, with (points in red). This discovery coincides with the data statistics of the original data set. The clustering results of MVAE embeddings are not interpretable compared with our ACCA, since the samples nearest the centroids are not distinguishable and no interpretable patterns are presented for the top 3 data samples for each cluster. The cluster centroids obtained with BI-VCCA are interpretable to some extent, since the data samples nearest the centroids present unique properties. However, the clustering result of the data sample “(2-3)” circled with red, is not clear.

Consequently, on the Google Maps data set, our ACCA outperforms the baselines regarding the alignment of the multiple views. The obtained latent embeddings are also informative about the multiview data.

To collaboratively support the superior alignment and generation performance of ACCA, we further compare the result with pix2pix—the state-of-the-art GAN-based cross-view generation baseline. The comparison is shown in Figure 14. We can see that the image quality of our ACCA is comparable and even better than that of pix2pix. This indicates that our ACCA has good alignment and generation ability.

### 5.5 Alignment and Discriminative Property of the Representation

Multiview representation learning is an important part of CCA. In this section, we conduct classification tasks to verify that the alignment achieved in ACCA does not greatly influence the discriminative property of the learned representations.

We follow the setting in Table 2 and perform classification on the three labeled data sets: MNIST_LR, MNIST_noisy, and XRMB. We train linear SVM classifiers with the concatenation of obtained embeddings and then evaluate their accuracy on the projected test set, $[ZXTe,ZYTe]$. For iteratively optimized nonlinear CCA models, we selected the embeddings obtained from the last five epochs for evaluation. We compare ACCA with gaussian prior, namely ACCA(G), here for a fair comparison. PCCA is not evaluated regarding classification since it should be comparable to the linear CCA.

Table 7 presents the classification results. It is obvious that our ACCA achieves comparable and even better classification performance among the CCA variants with the generative mechanism. The results of ACCA are better than those of Bi-VCCA in all settings and are comparable to those of MVAE in most of the settings. This reveals that our ACCA preserves considerable discriminative property of the embeddings while achieving superior alignment of the multiple views. Our ACCA is inferior to DCCA regarding classification tasks. The reason is that instead of targeting alignment for discriminative representation learning as in DCCA, our model focuses on reconstruction for data generation. For the MNIST-LR data set, ACCA outperforms DCCA to a large extent. This indicates that reconstruction can benefit discriminative representation learning in certain scenarios. The finding also coincides with the outstanding performance of MVAE here.

Data Sets . | CCA . | DCCA . | MVAE . | Bi-VCCA . | ACCA (G) . |
---|---|---|---|---|---|

MNIST_LR | 50.65 | 73.67 $\xb1$ 0.15 | 84.44 $\xb1$ 0.76 | 74.32 $\xb1$ 0.19 | 85.81 $\xb1$ 0.71 |

MNIST_Noisy | 75.48 | 91.60 $\xb1$ 0.36 | 90.78 $\xb1$ 1.12 | 85.81 $\xb1$ 0.44 | 86.93 $\xb1$ 1.46 |

XRMB | 32.04 | 62.14 $\xb1$ 0.52 | 58.57 $\xb1$ 0.30 | 56.58 $\xb1$ 0.35 | 60.37 $\xb1$ 0.40 |

Data Sets . | CCA . | DCCA . | MVAE . | Bi-VCCA . | ACCA (G) . |
---|---|---|---|---|---|

MNIST_LR | 50.65 | 73.67 $\xb1$ 0.15 | 84.44 $\xb1$ 0.76 | 74.32 $\xb1$ 0.19 | 85.81 $\xb1$ 0.71 |

MNIST_Noisy | 75.48 | 91.60 $\xb1$ 0.36 | 90.78 $\xb1$ 1.12 | 85.81 $\xb1$ 0.44 | 86.93 $\xb1$ 1.46 |

XRMB | 32.04 | 62.14 $\xb1$ 0.52 | 58.57 $\xb1$ 0.30 | 56.58 $\xb1$ 0.35 | 60.37 $\xb1$ 0.40 |

Note: The best results are in bold.

### 5.6 ACCA versus CCA Variants with View-Specific Information

In this section, we compare ACCA with CCA variants that additionally exploit view-specific information to further demonstrate its alignment capacity. This is also a preliminary study on the influence of private information to multiview alignment and generation. We choose Bi-VCCA-private as a representative baseline to compare here.

We evaluate the alignment in BI-VCCA-private with regard to both correlation analysis and conditional/unconditional data generation. We compare our model with simple gaussian prior, ACCA (G), here. We adopt the same settings (i.e., network settings and evaluation metrics) as in previous experiments for consistency. For BI-VCCA-private, the dimensions of private variables are set as $dHx=dHy=30$ for all data sets. The networks of the private encoders are set the same as its principal encoders in Table 2.

The results of correlation analysis are presented in Table 8. It is clear that our ACCA excels Bi-VCCA-private and Bi-VCCA in all the settings. This indicates that it is the KL-divergence constraint, $DKL(q(z|*)\u2225p(z))$, that mainly hinders these models from achieving instance-level multiview alignment (see Figure 1b). Our ACCA overcomes this limitation with the marginalized matching constraint, $DJS(\u222bq(z|*)p(*)d*\u2225p(z))$, and thus preserves better correspondence for paired inputs. The results of Bi-VCCA-private are slightly better than those of Bi-VCCA, indicating that the private variables can help to enhance multiview alignment to some extent. The comparisons regarding cross-view generation and unconditional data generation are presented in Figures 15 and 16, respectively. The results show that our ACCA outperforms the others in terms of both image sharpness and recognizable object details. In cross-view generation, the images generated with ACCA are much clear and sharper than those of Bi-VCCA-private. In addition, although Bi-VCCA-private generates faces with more details compared with Bi-VCCA (e.g., beards and glasses; see Figure 16), these details are not as clear and recognizable as those of ACCA. These generation results coincide with our finding in correlation analysis: Bi-VCCA-private achieves inferior multiview alignment compared with ACCA. This inferior alignment consequently downgrades its performance in cross-view data generation.

. | . | Data Sets . | ||
---|---|---|---|---|

Metrics . | Methods . | MNIST_LR . | MNIST_Noisy . | XRMB . |

nHSIC (linear kernel) | BI-VCCA-private | 0.2818 | 0.2235 | 0.1227 |

Bi-VCCA | 0.4612 | 0.1912 | 0.1046 | |

ACCA (G) (ours) | 0.5423 | 0.3285 | 0.2903 | |

nHSIC (RBF kernel) | BI-VCCA-private | 0.2853 | 0.2386 | 0.0893 |

Bi-VCCA | 0.3804 | 0.2076 | 0.0846 | |

ACCA (G) (ours) | 0.6318 | 0.3099 | 0.2502 |

. | . | Data Sets . | ||
---|---|---|---|---|

Metrics . | Methods . | MNIST_LR . | MNIST_Noisy . | XRMB . |

nHSIC (linear kernel) | BI-VCCA-private | 0.2818 | 0.2235 | 0.1227 |

Bi-VCCA | 0.4612 | 0.1912 | 0.1046 | |

ACCA (G) (ours) | 0.5423 | 0.3285 | 0.2903 | |

nHSIC (RBF kernel) | BI-VCCA-private | 0.2853 | 0.2386 | 0.0893 |

Bi-VCCA | 0.3804 | 0.2076 | 0.0846 | |

ACCA (G) (ours) | 0.6318 | 0.3099 | 0.2502 |

Note: The best results are in bold. For dependency, higher is better.

Data Set . | Statistics . | Dimension of $z$ . | Architecture (Conv all with batch normalization before LReLU) . | Parameters . | |
---|---|---|---|---|---|

CelebA (Liu et al., 2015) | Image Resolution: 64 $\xd7$ 64 Image recovery # Tr = 201,599 # Te = 1,000 Face alignment # Tr = 201,599 # Te = 1,000 | d = 100 | Encoders: Conv: 64$\xd75\xd7$5 (stride 2), Conv: 128$\xd75\xd7$5 (stride 2), Conv: 256$\xd75\xd7$5 (stride 2), Conv: 512$\xd75\xd7$5 (stride 2); dense: 100. Decoders (Image recovery): dense: 8192, relu; deConv: 256$\xd75\xd7$5 (stride 2), deConv: 128$\xd75\xd7$5 (stride 2), deConv: 64$\xd75\xd7$5 (stride 2), deConv: 3$\xd72\xd7$5, (stride1$\xd7$2); Tanh. Decoders(Face alignment): dense: 8192, relu; deConv: 256$\xd75\xd7$5 (stride 2), deConv: 128$\xd75\xd7$5 (stride 2), deConv: 64$\xd75\xd7$5 (stride 2), deConv: 3$\xd72\xd7$5 (stride 2); Tanh. | Discriminator: $D^$: dense: 128$\u219264\u2192$1, sigmoid. | Epoch = 10; Batchsize = 64; lr = 0.0002; Beta1 = 0.05; |

Google Maps dataset (Maps) (Isola et al., 2017) | Image Resolution: 256 $\xd7$ 256 Cross-view generation # Tr = 1,096 # Te = 1,098 | d = 100 | Encoders: Conv: 64$\xd75\xd7$5 (stride 2), Conv: 128$\xd75\xd7$5 (stride 2), Conv: 256$\xd75\xd7$5 (stride 2), Conv: 256$\xd75\xd7$5 (stride 2), Conv: 512$\xd75\xd7$5 (stride 2); dense: 100. Decoders (with skip-connection): dense: 32768,relu; deConv: 256$\xd75\xd7$5 (stride 2), deConv: 256$\xd75\xd7$5 (stride 2), deConv: 128$\xd75\xd7$5 (stride 2), deConv: 64$\xd75\xd7$5 (stride 2), deConv: 3$\xd72\xd7$5, (stride 2); Tanh. | Discriminator: $D^$: dense: 128$\u219264\u2192$1, tanh. | Epoch = 15; Batchsize = 16; lr = 0.0002; Beta1 = 0.5; |

Data Set . | Statistics . | Dimension of $z$ . | Architecture (Conv all with batch normalization before LReLU) . | Parameters . | |
---|---|---|---|---|---|

CelebA (Liu et al., 2015) | Image Resolution: 64 $\xd7$ 64 Image recovery # Tr = 201,599 # Te = 1,000 Face alignment # Tr = 201,599 # Te = 1,000 | d = 100 | Encoders: Conv: 64$\xd75\xd7$5 (stride 2), Conv: 128$\xd75\xd7$5 (stride 2), Conv: 256$\xd75\xd7$5 (stride 2), Conv: 512$\xd75\xd7$5 (stride 2); dense: 100. Decoders (Image recovery): dense: 8192, relu; deConv: 256$\xd75\xd7$5 (stride 2), deConv: 128$\xd75\xd7$5 (stride 2), deConv: 64$\xd75\xd7$5 (stride 2), deConv: 3$\xd72\xd7$5, (stride1$\xd7$2); Tanh. Decoders(Face alignment): dense: 8192, relu; deConv: 256$\xd75\xd7$5 (stride 2), deConv: 128$\xd75\xd7$5 (stride 2), deConv: 64$\xd75\xd7$5 (stride 2), deConv: 3$\xd72\xd7$5 (stride 2); Tanh. | Discriminator: $D^$: dense: 128$\u219264\u2192$1, sigmoid. | Epoch = 10; Batchsize = 64; lr = 0.0002; Beta1 = 0.05; |

Google Maps dataset (Maps) (Isola et al., 2017) | Image Resolution: 256 $\xd7$ 256 Cross-view generation # Tr = 1,096 # Te = 1,098 | d = 100 | Encoders: Conv: 64$\xd75\xd7$5 (stride 2), Conv: 128$\xd75\xd7$5 (stride 2), Conv: 256$\xd75\xd7$5 (stride 2), Conv: 256$\xd75\xd7$5 (stride 2), Conv: 512$\xd75\xd7$5 (stride 2); dense: 100. Decoders (with skip-connection): dense: 32768,relu; deConv: 256$\xd75\xd7$5 (stride 2), deConv: 256$\xd75\xd7$5 (stride 2), deConv: 128$\xd75\xd7$5 (stride 2), deConv: 64$\xd75\xd7$5 (stride 2), deConv: 3$\xd72\xd7$5, (stride 2); Tanh. | Discriminator: $D^$: dense: 128$\u219264\u2192$1, tanh. | Epoch = 15; Batchsize = 16; lr = 0.0002; Beta1 = 0.5; |

Furthermore, it is interesting to note that for Bi-VCCA-private, the image quality generated with different views shows slight differences compared with the baselines that extractly the shared information (see Figure 8). This indicates that incorporating view-specific variables also contributes to a balanced cross-view generation capacity from different views when the multiple input views contain an imbalanced amount of information. (In section 5.4.1, we find that the imbalance of information between the two input views can influence the image generation quality.)

## 6 Conclusion

In this letter, we present a systematic analysis of instance-level multiview alignment with CCA. Based on the marginalization principle of Bayesian inference, we study multiview alignment via consistent latent encoding and present ACCA, which facilitates superior alignment of the multiple views that benefits the performance of various multiview analysis and cross-view analysis tasks. Matching multiple encodings, ACCA can also be adopted to other tasks, such as image captioning and translation. Furthermore, owing to the flexible architecture design of ACCA, it can be easily extended to multiview tasks of $n$ views, with ($n+1$) encoders and ($n$) decoders.

In this work, we mainly exploit our ACCA with predefined priors. For future work, we will explore more powerful inference techniques for ACCA to boost its alignment performance even more. For example, normalizing flows (Rezende & Mohamed, 2015) is a data-driven method that provides an efficient tool to learn a data-dependent prior for complex data sets. It can be employed in ACCA to boost multiview alignment by providing a more expressive latent space, that is, a complex data-dependent prior and better preservation of instance-level correspondence, with invertible mappings.

Our analysis based on the CMI and the consistent encoding also provides insights for a flexible design of other CCA models. In the future, we will conduct more in-depth analysis on multiview alignment with CMI and propose other variants of CCA with other alignment criteria (e.g., MMD distance, Wasserstein distances; (Arjovsky, Chintala, & Bottou, 2017). It is also interesting to note that input with different levels of detail can influence the result of cross-view generation. This research direction is also worth further study.

## Notes

^{2}

We evaluate the sharpness of each test image using the gradients and average these values over the 1000 test images.

## Acknowledgments

The work was supported in part by the Australian Research Council grants DP180100106, DP200101328 and the China Scholarship Council (201706330075).

## References

*Proceedings of the International Conference on Machine Learning*(pp.

*Proceedings of the International Conference on Machine Learning*(pp.