Abstract

Multiview alignment, achieving one-to-one correspondence of multiview inputs, is critical in many real-world multiview applications, especially for cross-view data analysis problems. An increasing amount of work has studied this alignment problem with canonical correlation analysis (CCA). However, existing CCA models are prone to misalign the multiple views due to either the neglect of uncertainty or the inconsistent encoding of the multiple views. To tackle these two issues, this letter studies multiview alignment from a Bayesian perspective. Delving into the impairments of inconsistent encodings, we propose to recover correspondence of the multiview inputs by matching the marginalization of the joint distribution of multiview random variables under different forms of factorization. To realize our design, we present adversarial CCA (ACCA), which achieves consistent latent encodings by matching the marginalized latent encodings through the adversarial training paradigm. Our analysis, based on conditional mutual information, reveals that ACCA is flexible for handling implicit distributions. Extensive experiments on correlation analysis and cross-view generation under noisy input settings demonstrate the superiority of our model.

1  Introduction

Multiview learning is the subfield of machine learning that considers learning from data with multiple feature sets. This paradigm has attracted increasing attention due to the emerging multiview data that have facilitated various real-world applications, such as video surveillance (Wang, 2013), information retrieval (Elkahky, Song, & He, 2015), and recommender systems (Elkahky et al., 2015). In these applications, it is critical to achieve instance-level multiview alignment, such that the multiple data streams achieve great one-to-one correspondence (Li, Yang, & Zhang, 2018). For example, considering traditional multiview learning tasks, such as multiview classification (Qi et al., 2016) or multiview clustering (Chaudhuri, Kakade, Livescu, & Sridharan, 2009) on face images in video surveillance, the input data correspond to face images taken from different angles. In these cases, input feature sets with low one-to-one correspondence degrade the alignment of the multiple views, thus severely affecting the performance of the desired tasks. Multiview alignment plays an even more critical role in cross-view data analysis (Jia & Ruan, 2016) problems, namely, to analyzing one view of the data given the input from the other view. For example, the cross-view retrieval task (Elkahky et al., 2015) is, given a query from one view, to search for the corresponding object in the other view; cross-view generation (Regmi & Borji, 2018) seeks to generate target objects given the cross-view inputs. Both are promising real-world applications in which alignment of the incorporated views is critical to performance.

Canonical correlation analysis (CCA) (Hotelling, 1936) provides a primary tool to study instance-level multiview alignment under a subspace learning mechanism (Xu, Tao, & Xu, 2013). In this setting, the instances of two views, X and Y, are assumed to be generated from a common latent subspace Z; the alignment problem is to find two mapping functions, namely, F(X) and G(Y), such that the embeddings of corresponding input pairs are close to each other regarding the linear correlation. The instance (xi,yi) is in exact correspondence if and only if F(xi)=G(yi) (Ma & Fu, 2011). However, existing CCA models are prone to misalignment due to the neglect of uncertainty or the inconsistent encoding of the multiple views.

Following the principle of classic CCA, vanilla CCA models study multiview alignment with deterministic mapping functions (Oh et al., 2018). Such CCA models are opting to misalign the multiple views since uncertainty is not considered. To be specific, the classic CCA obtains the shared latent space by maximally correlating the deterministic point embeddings, achieved with a linear mapping of the two views. Some work, such as kernel CCA (KCCA) (Lai & Fyfe, 2000), deep CCA (DCCA) (Andrew, Arora, Bilmes, & Livescu, 2013), and mutiview autoencoder (MVAE) (Ngiam et al., 2011), extend the classic CCA with nonlinear mapping or through cross-view reconstruction to exploit nonlinear correlation for the alignment. The mapping functions F(·) and G(·) are nonlinear in these models. As depicted in Figure 1a, these methods all exploit the subspace Z with deterministic point embeddings; namely, zx=F(x) and zy=G(y) are points in Rd. Without an inference mechanism to evaluate the quality of obtained latent codes, the mapping function obtained in those models is susceptible to noisy inputs (Kendall & Gal, 2017), which can consequently result in misalignment of the multiple views. For example, for observation the circled 1 in Figure 1a, inputs in the two views are obviously projected faraway in the embedding space: they are projected into different clusters, 5 and 2, respectively, while they are suppose to be close to each other around the ground truth cluster 7. Moreover, without prior regularization on the shared subspace, these models cannot allow easy latent interpolations, since their latent spaces are discontinuous. In such cases, the training samples are encoded into nonoverlapping zones chaotically scattered across space, with “holes” between the zones where the model has never been trained (Tolstikhin, Bousquet, Gelly, & Schoelkopf, 2017). Therefore, these models cannot facilitate the cross-view generation task since the generation results are quite likely to be unrealistic.

Figure 1:

The motivation of Adversarial CCA. (a) Vanilla CCA models misalign the multiple views with discontinuous latent space and unrealistic generated data. (b) The latent encodings matched with KL-divergence are inconsistent, leading to misalignment of the multiple views. (c) Adversarial learning facilitates consistent encodings for the multiple views by matching marginalized latent encodings with flexible priors.

Figure 1:

The motivation of Adversarial CCA. (a) Vanilla CCA models misalign the multiple views with discontinuous latent space and unrealistic generated data. (b) The latent encodings matched with KL-divergence are inconsistent, leading to misalignment of the multiple views. (c) Adversarial learning facilitates consistent encodings for the multiple views by matching marginalized latent encodings with flexible priors.

Generative CCA models, such as probabilistic CCA (PCCA) (Bach & Jordan, 2005), variational CCA (VCCA) (Wang, Yan, Lee, & Livescu, 2016), and multichannel variational autoencoder (MCVAE) (Antelmi, Ayache, Robert, & Lorenzi, 2019), overcome the issue with probability. However, they suffer from misalignment due to the impairments of inconsistent encodings. Specifically, these models adopt the Kullback-Leibler divergence (KL-divergence) between the encodings of individual input example, that is, Q(Z|X=x) and Q(Z|Y=y) and the prior P0(Z), as the criterion to match the latent encodings of different views. However, this constraint can simply force the matching of the encodings of individual input to the common prior (Tolstikhin et al., 2017). Even if the constraint is satisfied, the encodings of the data samples from both the two views can be intersected. In this way, the correspondence between the latent codes of paired inputs is violated. Such inconsistent latent encodings would cause one-to-many correspondence between the instances of the incorporated views, indicating that the multiple views are misaligned. As depicted in Figure 1b, although all of these latent encodings match the prior, the encodings of the instances from the two views intersect in the common latent space. This arouses confusion about the correspondence between the instances in the two views; for example, both 1 and 2 (the circled numbers) exhibit one-to-many correspondence. Such inconsistency not only weakens the alignment of the two spaces but also influences the quality of data reconstruction. Moreover, to achieve a tractable solution for the inference, these models restrict the latent space with simple gaussian prior, p0(z)N(0,Id), so that the constraint can be computed analytically. However, such a prior is not expressive enough to capture the true posterior distributions (Mescheder, Nowozin, & Geiger, 2017). Therefore, the latent space may not be expressive enough to preserve the instance-level correspondence of the data samples. These impairments lead to an inferior alignment of the multiple views and thus also degrade the models' performance in cross-view generation tasks.

To tackle the issues we have noted, in this letter we study the instance-level multi-view alignment from a Bayesian perspective. With an in-depth analysis of existing CCA models with respect to latent distribution matching, we figure out the impairments of inconsistent encodings in the existing CCA models. We then propose to recover a consistency of multiple views and thereby boost cross-view generation performance by matching the marginalization of the joint distribution of multiview random variables under different forms of factorization, (see equation 3.3). To realize our marginalization design, we present adversarial CCA (ACCA), which achieves consistent latent encoding of the multiple views by matching the marginalized posteriors to flexible prior distributions through the adversarial training paradigm. Analyzing the conditional independent assumption in CCA with conditional mutual information (CMI), we reveal that compared with existing CCA methods, our ACCA is flexible for handling implicit distributions. The contributions of this work can be summarized as follows:

  • We provide a systematic study of CCA-based instance-level multi-view alignment. We also figure out the impairments of inconsistent encodings in the existing CCA models and propose to study multiview alignment based on the marginalization principle of Bayesian inference, to recover consistency of multiple views.

  • We design adversarial CCA (ACCA), which achieves consistent latent encoding of the multiple views and is flexible for handling implicit distributions. To the best of our knowledge, we are the first to elaborate the superiority of adversarial learning in multiview alignment scenario.

  • We analyze the connection of ACCA and existing CCA models based on CMI and reveal the superiority of ACCA, which benefits from consistent latent encoding. Our CMI-based analysis and consistent latent encoding can provide insights for a flexible design of other CCA models for multiview alignment.

The rest of this letter is organized as follows. In section 2, we review the existing CCA models regarding latent distribution matching. In section 3, we elaborate on our design to study multiview alignment through marginalization and present our ACCA design. In section 4, we discuss the advantages of our model by comparing existing models based on CMI. In section 5, we demonstrate the superior alignment performance of ACCA with model verification and various real-world applications. Section 6 concludes the letter and envisions future work.

2  Deficiencies of Existing CCA Models

In this section, we review the multiview alignment achieved with existing CCA models in terms of latent distribution matching.

2.1  Vanilla CCA Models and the Neglect of Uncertainty

Vanilla CCA models are prone to misalignment since data uncertainty is not considered.

Canonical correlation analysis (CCA) (Hotelling, 1936) is a powerful statistical tool for multiview data analysis. Let {x(i),y(i)}i=1N denote the collection of N independent and identically distributed (i.i.d.) samples with pairwise correspondence in a multiview scenario. (In the following, we use (x,y) to denote any one instance in this set, for simplicity.) Classic CCA aims to find linear projections for the two views, (Wx'X,Wy'Y), such that the correlations between the projections are mutually maximized, namely, maxcorr{Wx'X,Wy'Y}=Wx'ΣxyWyWx'ΣxxWxWy'ΣyyWy, where Σxx and Σyy are the covariance of X and Y; Σxy denotes the cross-covariance. With linear projections, classic CCA simply exploits the linear correlation among the multiple views to achieve alignment. It is often insufficient to analyze complex real-world data that exhibit higher-order correlations (Suzuki & Sugiyama, 2010).

Various CCA models are proposed to exploit nonlinear correlation for multiview alignment with deterministic nonlinear mappings. Kernel CCA (KCCA) and deep CCA (DCCA) exploit nonlinear correlation by extending CCA with nonlinear mapping implement with kernel methods and deep neural networks (DNNs), respectively. Some other work, for example, deep canonically correlated autoencoders (DCCAE) (Wang, Arora, Livescu, & Bilmes, 2015), extends nonlinear CCA with self-reconstruction for each view. However, since there is a trade-off between canonical correlation of the learned bottleneck representations and the reconstruction, the cross-view relationship captured in the common subspace is often inferior to that of DCCA (Wang et al., 2016). Multiview autoencoder (MVAE) aims to establish a strong connection between the views through cross-view reconstruction. Without adopting specific alignment criterion, its objective is given as
minF,G1N{x,y}x-F-1(F(x))2+x-F-1(G(y))2+y-G-1(G(y))2+y-G-1(F(x))2,
where F(.) and G(.) represent nonlinear mapping of X and Y, respectively. F-1(.) and G-1(.) denote the corresponding decoders for the view reconstructions.

Without the inference mechanism that can evaluate the quality of obtained embeddings, these methods are vulnerable to misaligning the multiple views when given noisy inputs (Tolstikhin et al., 2017). As depicted in Figure 1a, for noisy halved images of the digit 7, the two views are misaligned in the latent space, since their embeddings scatter far and are even chaotically embedded in the different clusters of 2 and 5, respectively. Moreover, these models cannot facilitate cross-view generation tasks very well, since the obtained subspace is discontinuous under such deterministic mappings. Consequently, interpolations of the latent space would lead to unrealistic generation results.

2.2  Generative CCA Models and Inconsistent Latent Encodings

Generative CCA models overcome the uncertainty issue by modeling probability. However, they still suffer from misalignment due to the impairments of inconsistent encodings, caused by the limitation of the KL-divergence alignment criterion.

Let the two input views correspond to random variables X and Y, each distributed according to an unknown generative process with density p(x) and p(y) from which we have observations {x(i),y(i)}i=1N. Probabilistic CCA (PCCA) (Bach & Jordan, 2005), as a generative version of classic CCA, aligns multiview data by maximizing the correlation between the linearly projected views in a common latent space with gaussian prior, namely, zN(0,Id),x|zN(Wxz+μx,Φx),y|zN(Wyz+μy,Φy), where d denotes the dimension of the projected space. The KL-divergence is tractable in this case, since the conjugacy of the prior and the likelihood in PCCA lead to two favorable conditions. First, the conditional distribution p(x,y|z) can be modeled with the joint covariance matrix, with which the conditional independent constraint for CCA can be easily imposed (Drton, Sturmfels, & Sullivant, 2008):
p(x,y|z)=p(x|z)p(y|z).
(2.1)
Second, the posterior, p(z|x,y)=p(x,y|z)p(z)p(x,y), can be calculated analytically (Tipping & Bishop, 1999).
To exploit nonlinear correlation for alignment, some work extends PCCA with nonlinear mapping. Inspired by variational inference, Wang et al. (2016) proposed two generative CCA variants: variational CCA (VCCA) bi-VCCA. (Bi-VCCA) Both methods minimize a reconstruction cost together with the KL-divergence to regularize the alignment. VCCA penalizes the discrepancy between a single view encoding and the prior, DKL(Q(Z|X=x)P0(Z)), based on a preference for one of the two views. The two views are not well aligned since the information in the other view is not exploited. It also cannot handle the cross-view generation task due to this missing encoding. Bi-VCCA overcomes the limitation by a heuristic combination of the KL-divergence term obtained with both the two encodings, Q(Z|X=x) and Q(Z|Y=y), with λ to control the trade-off. To achieve a tractable solution for the inference, the latent space is restricted to being gaussian distributed, P0(Z)N(μ,Σ), so that the KL-divergence can be computed analytically. Its objective is given as
minθ,ϕ1N{x,y}[λ[-Eqϕ(z|x)[logpθ(x|z)+logpθ(y|z)]+DKL(qϕ(z|x)p0(z))]+(1-λ)[-Eqϕ(z|y)[logpθ(x|z)+logpθ(y|z)]+DKL(qϕ(z|y)p0(z))]],
(2.2)
where θ is the generative model parameters and ϕ denotes the variational parameters. The prototype proposed in Antelmi et al. (2019), multichannel variational autoencoder (MCVAE), aims to constrain the expectation of KL-divergence between the encoding of each view and the target posterior distribution, Q(Z|X=x), Q(Z|X=y), and Q(Z|X=x,Y=y) for all the data samples, as the criteria for the alignment. However, with an explicit conditionally independent assumption (see equation 2.1), MCVAE achieves the same objective as Bi-VCCA.

2.2.1  Impairments of Inconsistent Latent Encodings

Since there exists an encoding and decoding mechanism for each of the views in generative CCA models, the instance-level alignment of the views can be verified by cross-view generation. Specifically, if the two views are well aligned, the encoding from one view can then recover the corresponding data in the other view. In such circumstances, we define the encoding of the two views to be consistent. Therefore, the consistency of the multiview encodings is a necessary condition for multiview alignment in generative CCA models.

However, the methods we have mentioned would misalign the multiple views due to the inconsistent latent encodings caused by the inferior alignment criterion, DKL(Q(Z|X)Q(Z|Y)). First, this criterion can only match the encodings of individual data samples, while causing inconsistent encoding of the views. As depicted in Figure 1b, in the multiview learning scenario, it simply forces the encoding from each view, Q(Z|X=x) and Q(Z|Y=y), of all the different input examples to individually match the common prior P0(Z). In this way, the latent encodings from the two views intersect in the common latent space. This intersection disorganizes the consistency of the encodings in the latent space, and thus reduces the instance-level alignment of the two input views. This misalignment also influences the quality of data reconstruction or generation. Both deficiencies are crucial for cross-view generation tasks. In addition, to compute the KL-divergence analytically, all of these methods require the incorporated distributions, that is, the prior P0(Z), the posteriors of each view Q(Z|X) and Q(Z|Y), to be simple. However, such restriction can lead to inferior inference models that are not expressive enough to capture the true posterior distribution (Mescheder et al., 2017). Inexpressiveness of the latent space limits the models' ability further to preserve the instance-level correspondence of the data samples.

3  Multiview Alignment via Consistent Latent Encoding

In this section, we study multiview alignment from a Bayesian perspective. In section 3.1 we elaborate the design to achieve consistency of the multiple views through marginalization. We then present our design of adversarial CCA in section 3.2.

3.1  Multiview Alignment through Marginalization

The KL-diver-gence criterion adopted in existing CCA models causes impairments of the inconsistent encodings in two ways:

  • Primarily, it causes inconsistent latent encoding of the two views, since it simply matches the encodings of individual data samples.

  • It further restricts the expressiveness of the latent space regarding the instance-level correspondence, since it can only incorporate simple priors directly.

To exploit a better criterion that benefits the alignment, instance-level consistency, of the multiple views, we study multiview alignment from a Bayesian perspective.

From the Bayesian perspective, the primary reason for inconsistent encoding is that their KL-divergence criterion measures the disagreement of the posterior distributions q(z|x) and q(z|y) without considering the condition variable. That is, it simply matches the encodings of individual data in each view to the prior p0(z) via a heuristic combination of the KL-divergence between each encoding and the prior,
λDKL(q(z|x)p0(z))+(1-λ)DKL(q(z|y)p0(z)).
(3.1)
Without considering the condition variables X and Y, the encodings of instances from the two views can be overlapped in a disorganized way. This degrades the one-to-one correspondence of the multiview data in corresponding models.

Based on the marginalization principle of Bayesian inference (Tipping, 2003; Jaynes, 1978), we propose to facilitate consistent latent encoding by simultaneously matching the multiview encodings whose condition variables are all integrated out. We first eliminate the misalignment induced by the intersection of the individual sample encodings by marginalizing the encodings from multiple views and then constrain the marginalized encodings to overlap with the prior p0(z) simultaneously.

First, within the CCA-based multiview learning scenario, the joint distribution of multiview random variables can be factorized into three forms: q(x,z)=q(z|x)p(x),q(y,z)=q(z|y)p(y), and q(x,y,z)=q(z|x,y)p(x,y). Marginalization of these joint distributions on z results in three marginalized posterior distributions:
qx(z)=q(z|x)p(x)dx,qy(z)=q(z|y)p(y)dy,qxy(z)=q(z|x,y)p(x,y)dxdy.
(3.2)
Then we propose to match these three marginalized encodings simultaneously to provide consistent latent encodings that benefit the multiview alignment. Since it is nontrivial to annotate a distribution measurement among the prior p0(z) and other surrogate distributions marginalized by different views, we represent this idea as
qx(z)qy(z)qxy(z)p0(z).
(3.3)

Compared with the KL-divergence that harshly matches the conditional distribution of each sample to the prior, our proposed constraint matches the marginal distributions, q(z|x)p(x)dxp(z). Since we take the input of the conditional variables into consideration, this constraint is tolerant to the flexibility of the input data. This property also makes it good for matching multiview encodings: q(z|x)p(x)dxq(z|y)p(y)dyq(z|x,y)p(x,y)dxdyp0(z). The multiview alignment can be further improved by expanding the expressiveness of the latent space by incorporating more complex prior distributions (Mathieu, Rainforth, Siddharth, & Teh, 2019).

3.2  Adversarial CCA with Consistent Latent Encoding

To realize our design, we design ACCA, which provides consistent latent encoding by matching the marginalized latent encodings to flexible priors through the adversarial training paradigm. We adopt two schemes to facilitate consistent latent encodings in ACCA.

First, to provide different factorization forms for the joint distribution of multiview data, we provide holistic information for the latent encodings, q(z|x,y), q(z|x) and q(z|y), in ACCA. Besides the two principal encodings, q(z|x) and q(z|y), which support cross-view analysis, we explicitly model q(z|x,y) by encoding an auxiliary view XY that contains all the information of the two views. With the encoding from this auxiliary view, the latent space is more expressive for the correspondence of the multiple views.

Second, we match the marginalization of these holistic encodings simultaneously with the adversarial learning technique. The adversarial learning technique minimizes the Jensen-Shannon (JS)-divergence between two distributions through binary classification on the samples of the two distributions directly (Goodfellow et al., 2014). Consequently, any two distributions can be matched as long as their samples are provided (the explicit forms of the distributions are not required). We adopt adversarial learning as the criterion to match the marginalization of all three encodings to an arbitrary fixed prior p0(z) in ACCA. To be specific, we apply an adversarial distribution matching scheme on the common latent space. Within this scheme, each encoder acts as a generator that defines a marginalized posterior over z (Makhzani, Shlensand, Jaitly, Goodfellow, & Frey, 2015) in equation 3.2. The obtained latent codes of individual data instances are samples of the corresponding marginalized posteriors, q*(z). The three marginalized posteriors constrained to be matched by simultaneously matching the same prior p0(z), namely equation 3.3, with a shared discriminator (Hoang, Nguyen, Le, & Phung, 2018). We present the formulation of the proposed constraints in section 3.2.1.

Consequently, our ACCA realizes the proposed marginalization design by adversarially matching the marginalized posteriors with a common and flexible prior distribution. As shown in in Table 1, our ACCA is better than the existing generative CCA models in three ways:

  • We recover the consistency of multiple views by matching the marginalization of holistic encodings. This contributes to the consistent latent encoding of multiple views, which benefits the multiview alignment.

  • It avoids the gaussian distribution restriction on p(z). Instead of computing the criterion analytically, adversarial learning provides an efficient estimation of the JS-divergence between the encodings (Goodfellow et al., 2014). This helps ACCA handle expressive latent space with flexible prior distributions.

  • It does not require explicit distribution assumptions on the posterior p(z|x,y). The adversarial learning scheme matches the incorporated distributions implicitly. Thus, it can benefit the model by omitting the sampling operation required in other generative CCA models (e.g. VCCA and MCVAE).

Table 1:
Comparison of CCA Methods for Multiview Alignment.
Evaluation
CategoryMethodsNonlinear MappingCriterionConsistent EncodingAvoids Gaussian Restriction on p(z)Implicit Posteriors p(z|x,y)
Vanilla CCA models CCA ✗ Linear correlation ✗ ✗ 
 KCCA ✓ Linear correlation ✗ ✗ 
 DCCA ✓ Linear correlation ✗ ✗ 
 DCCAE ✓ Linear correlation ✗ ✗ 
 MVAE ✓ 
Generative CCA models PCCA ✗ KL-divergence ✗ ✗ ✗ 
 VCCA ✓ KL-divergence ✗ ✗ ✗ 
 Bi-VCCA ✓ KL-divergence ✗ ✗ ✗ 
 MCVAE ✓ KL-divergence ✗ ✗ ✗ 
 ACCA (ours) ✓ Adversarial learning ✓ ✓ ✓ 
Evaluation
CategoryMethodsNonlinear MappingCriterionConsistent EncodingAvoids Gaussian Restriction on p(z)Implicit Posteriors p(z|x,y)
Vanilla CCA models CCA ✗ Linear correlation ✗ ✗ 
 KCCA ✓ Linear correlation ✗ ✗ 
 DCCA ✓ Linear correlation ✗ ✗ 
 DCCAE ✓ Linear correlation ✗ ✗ 
 MVAE ✓ 
Generative CCA models PCCA ✗ KL-divergence ✗ ✗ ✗ 
 VCCA ✓ KL-divergence ✗ ✗ ✗ 
 Bi-VCCA ✓ KL-divergence ✗ ✗ ✗ 
 MCVAE ✓ KL-divergence ✗ ✗ ✗ 
 ACCA (ours) ✓ Adversarial learning ✓ ✓ ✓ 

A diagram of ACCA is presented in Figure 2d. Note that the three encodings are all essential in ACCA. First, the encodings of the principal views, q(z|x) and q(z|y), are essential to facilitate cross-view analysis with generative CCA methods. Second, the encoding of the auxiliary view, q(z|x,y), contributes to a latent space that better encodes the correspondence of the multiple views and thus benefits the multiview alignment achieved in ACCA. Indeed, one can achieve expressive representations for the multiview data with only the auxiliary encoding. However, this is not the focus of our work. We further emphasize the significance of the auxiliary view and the superiority achieved with the adversarial learning in section 4.2.

Figure 2:

Graphical diagrams for generative nonlinear CCA variants. The solid lines in each diagram denote the generative models pθ(z)pθ(*|z). The dashed lines denote the approximation qϕ(z|*) to the intractable posterior pθ(z|*). The * indicates x or y.

Figure 2:

Graphical diagrams for generative nonlinear CCA variants. The solid lines in each diagram denote the generative models pθ(z)pθ(*|z). The dashed lines denote the approximation qϕ(z|*) to the intractable posterior pθ(z|*). The * indicates x or y.

3.2.1  Formulation

Based on the design, the objective of our ACCA consists of two components: (1) the log likelihood (reconstruction) terms for fitting the multiview data and (2) the adversarial learning constraint that contributes to consistent latent encoding. The objective of our ACCA is given as
minΘ,ΦLACCA(x,y)=1N{x,y}[-Eqϕxy(z|x,y)[logpθx(x|z)+logpθy(y|z)]-Eqϕx(z|x)[logpθx(x|z)+logpθy(y|z)]-Eqϕy(z|y)[logpθx(x|z)+logpθy(y|z)]+RGAN],
(3.4)
where Θ and Φ denote the parameters of the encoders and the decoders respectively, that is, Θ = {θx,θy} and Φ = {ϕxy,ϕx,ϕy}.
The ACCA, illustrated in Figure 3, consists of six subnetworks. The three encoders, {Ex,Exy,Ey} and the two decoders {Dx,Dy} constitute the view-reconstruction scheme, which corresponds to the first three terms in equation 3.4. The three encoders (generators), together with the shared discriminator D^, compose the adversarial distribution matching scheme. These subnetworks, {Ex,Exy,Ey,D^}, compose the adversarial regularizer that promotes with RGAN,
RGAN(Ex,Ey,Exy,D^)=Ezp(z)log(D^(z))+Ezxyqϕxy(z|x,y)log(1-D^(zxy))+Ezxqϕx(z|x)log(1-D^(zx))+Ezyqϕy(z|y)log(1-D^(zy)).
(3.5)
Here, we add the subscripts to discriminate the latent codes z encoded from different views X,Y,XY. This distinctiveness is criterical for the experiments.
Figure 3:

Overall structure of ACCA. The left panel represents encoding with the holistic information scheme. The top right panel corresponds to the cross-view reconstruction. The bottom right panel illustrates the adversarial learning criterion.

Figure 3:

Overall structure of ACCA. The left panel represents encoding with the holistic information scheme. The top right panel corresponds to the cross-view reconstruction. The bottom right panel illustrates the adversarial learning criterion.

In practice, ACCA is jointly trained by alternatively updating the reconstruction and regularization phases. In the reconstruction phase, we update the encoders and the decoders to minimize the reconstruction error of the two principal views. In the regularization phase, the adversarial networks, with multiple encoders or generators, are trained following the same alternating procedure as in Hoang et al. (2018). Once the training procedure is done, the encoders will define expressive encodings for each view.

4  Connection to Other Models

In this section, we discuss the connection between ACCA and other models.

4.1  Understanding CCA Models with CMI

From a Bayesian perspective, the general CCA models come with an assumption that the two views, X and Y, are conditionally independent given the latent variable Z (i.e., equation 2.1, to achieve a tractable solution for inference. However, this assumption is hard to verify in real multiview analysis problems that incorporate complex distributions. Here, we analyze this inherent assumption of CCA with conditional mutual information (CMI).

Given random variables X,Y, and Z, the CMI defines the expected KL-divergence between the conditional joint distribution p(x,y|z) and the product of the conditional marginal distributions, p(x|z) and p(y|z) (Zhang, Zhao, Hao, Zhao, & Chen, 2014):
I(X;Y|Z)=Ep(z)[DKL(p(x,y|z)p(x|z)p(y|z))]0.
(4.1)
The minimum, I(X;Y|Z)=0, can be achieved only when X and Y are conditionally independent given Z. Consequently, the conditional independent criterion of CCA (see equation 2.1) can be achieved by minimizing the CMI. The objective can be given as
Iθ(X;Y|Z)=p(z)p(x,y|z)logp(x,y|z)p(x|z)p(y|z)dzdxdy=p(z|x,y)p(x,y)[logp(x,y|z)p(x|z)p(y|z)-logp(x,y)+logp(x,y)]dzdxdy=p(z|x,y)p(x,y)[logp(z|x,y)p(z)-logp(x|z)-logp(y|z)+logp(x,y)]dzdxdy=H(X,Y)+Epθ(x,y)[-Ep(z|x,y)[logpθ(x|z)+logpθ(y|z)]+DKL(pθ(z|x,y)p(z))],
where H(X,Y) is a constant and has no effect on the optimization (Gao, Brekelmans, Steeg, & Galstyan, 2018). Therefore, the minimum of CMI can be achieved by minimizing the remaining terms, namely
minθEp(x,y)[Fθ(x,y)]1N{x,y}Fθ(x,y),
(4.2)
where Fθ(x,y)=-Epθ(z|x,y)[logpθ(x|z)+logpθ(y|z)]+DKL(pθ(z|x,y)p(z)).

Although equation 4.2 presents an objective for minimizing CMI, it is hard to optimize since the posterior pθ(z|x,y) is unknown or intractable for the practical multiview learning problems. Consequently, existing methods make different assumptions on the incorporated distributions (e.g., prior, likelihood, and posterior) and adopt approximate inference methods to achieve tractable solutions for multi-view analysis.

Example 1: PCCA
(Bach & Jordan, 2005). With an explicit conditional independent assumption, PCCA adopts gaussian assumptions for both the likelihood and the prior to achieve tractable solution for the inference in linear CCA. Under the conditional independent constraint, the minimum of CMI, I(X;Y|Z)=0, is naturally satisfied. Due to the conjugacy of the prior and the likelihood, the posterior in equation 4.1 can be presented with an analytic solution, with which the model parameters can be directly estimated with expectation-maximization algorithms:
zN(0,Id),x|zN(Wxz+μx,Φx),y|zN(Wyz+μy,Φy).
Example 2: MVAE
(Ngiam et al., 2011). If we consider gaussian models with zN(μ,0), pθ(x|z)=N(Fϕx(zx),I) and pθ(y|z)=N(Gϕy(zy),I), the zx and zy are obtained as point embedding obtained with F(x) and G(y), that is, zx=Fθx(x) and zy=Gθy(y) (see section 2.1). We can see that the reconstruction terms in equation 4.2 measure the l2 reconstruction error of the two inputs from the latent code z through the DNNs defined with F-1 and G-1. The objective of MVAE is
minθ,ϕ12N{x,y}x-Fϕx-1(Fθx(x))2+y-Gϕy-1(Gθy(y))2.
Note that MVAE is a simple AE, with no regularization on posterior and prior matching.
Example 3: VCCA
(Wang et al., 2016). Considering a model where the latent codes zN(μ,Σ) and the observations x|z and y|z both follow implicit distribution, VCCA adopts variational inference to get the approximate posterior for equation 4.1 with two additional assumptions: (1) the single input view can provide sufficient information for multiview encoding, namely. qϕ(z|x,y)qϕ(z|x) and (2) the variational approximate posterior qϕ(z|x)N(z;μ,Σ), where Σ=diag(σ12,,σd2). In this case, the KL-divergence term can be explicitly computed with DKL(qϕ(z|x)pθ(z))=-12j=1d(1-σj2-μj2+logσj2). Note that p0(z) is defined with an explicit form, and the encoding functions actually models the distribution parameters. The latent codes are then obtained by sampling L samples from the posterior distribution, zlqϕ(z|x), with the reparameterization trick (Kingma & Welling, 2013). The objective of VCCA is given as
minθ,ϕ1N{x,y}[-1Ll=1L[logpθ(x|zl)+logpθ(y|zl)]+DKL(qϕ(z|x)pθ(z))].s.t.zxl=μx+Σxεl,whereεlN(0,Id),l=1,,L.
(4.3)
Example 4: Bi-VCCA
(Wang et al., 2016). Bi-VCCA adopts the encoding of both qθ(z|x) and qθ(z|y) to approximate qθ(z|x,y). Its objective is given as a heuristic combination of equation 4.3 derived with each encoding,
minθ,ϕ1N{x,y}[-λLl=1L[logpθ(x|zxl)+logpθ(y|zxl)]+DKL(qϕ(z|x)pθ(z))
(4.4)
+-1-λLl=1L[logpθ(x|zyl)+logpθ(y|zyl)]+DKL(qϕ(z|y)pθ(z))],s.t.zxl=μx+Σxεl,zyl=μy+Σyεl,whereεlN(0,Id),l=1,,L,
(4.5)
where λ[0,1] is the trade-off factor between the two encodings.

4.2  ACCA versus Existing CCA Methods

Based on our analysis, we emphasize the superiority of the proposed ACCA (Figure 1c) over the other CCA prototypes:

  • The adversarial learning criterion enables ACCA to achieve a tractable solution for multiview analysis with many flexible prior and posterior distributions. This benefits the expressiveness of the obtained aligned latent space.

  • The adversarial learning criterion leads to consistent latent encoding in ACCA by matching the marginalization of the incorporated distributions and thus helps ACCA achieve better instance-level alignment for the multiple views.

  • Appending q(z|x,y) with the auxiliary view XY, our ACCA can better estimate the minimizing CMI objective, equation 4.2, compared with other variants that simply adopt the encodings from individual views, that is, q(z|x) and q(z|y).

Some works adopt additional penalties, such as sparsity constraint (Shi, Xu, Pan, Tsang, & Pan, 2019), on these prototypes to enhance multiview alignment. For instance, Kidron, Schechner, and Elad (2007) extend classic CCA with sparsity to enhance its performance on cross-modal localization tasks. Jia, Salzmann, and Darrell (2010) introduce structured sparsity into MVAE. Virtanen, Klami, and Kaski (2011) propose a generative CCA variant that also adopts KL-divergence as the criterion, with only an additional group sparsity assumption to improve the variations approximation. Note that we can also extend ACCA with corresponding structural priors to enhance the alignment of the multiple views (Mathieu et al., 2019).

Some other work extends these prototypes by further exploiting view-specific information. Besides the multiview shared information, these variants also consider specific information in each view to benefit the alignment task. For example, VCCA-private extends Bi-VCCA by introducing two hidden variables, hx and hy, to capture the private information that is not captured with the common variable, Z. It adopts two extra KL-divergence constraints to match the encoding of the private variables (see equation 4.3) in Wang et al. (2016). Our ACCA can also be extended with such private variables and additional discriminators to further enhance the alignment.

There is also generative CCA work that incorporates additional information (e.g., supervision), to benefit the multiview alignment. For example, multiview information bottleneck (MVIB) (Federici, Dutta, Forré, Kushmann, & Akata, 2020) aligns the two views in a supervised manner in order to obtain multiview data representations that are maximally informative about the downstream prediction task, X1,X2Y. (Note that Y denotes the label here.) Consequently, its motivation is different from our ACCA, which targets an alignment of the two views for generation tasks: X1X2. Actually, MVIB cannot even facilitate our targeted cross-view generation task due to the lack of a generation mechanism. In addition, besides the minimum CMI criterion, equation 4.1, which formulates ACCA, MVIB also adopts an additional superfluous information minimization objective to discard the input information that is irrelevant to its label:
LMIB(θ;λ)=Iθ(Z;X1|X2)superfluousinformation+λIθ(X1;X2|Z).

In this sense, MVIB can be regarded as an extension of multiview alignment that further incorporates superfluous information to handle supervised downstream tasks. We can also apply our developed inference method in ACCA to solve the generative variant of the MVIB objective as well.

Note that in this work, we focus on studying the classic CCA prototypes in terms of multiview alignment for data generation. Consequently, the CCA variants with additional penalties or view-specific variables are not for main comparisons here. The MVIB is not comparable here since it cannot facilitate generation.

4.3  ACCA versus Adversarial Autoencoders

Also highly relevant to our ACCA are adversarial autoencoders (AAEs) (Makhzani et al., 2015). AAEs adopt adversarial distribution matching to promote the reconstruction of autoencoders, based on variational autoencoders (VAEs). Compared with AAEs, our ACCA is contributive since we extend the adversarial distribution matching into the multiview scenario to facilitate multiview alignment, especially for cross-view generation tasks. We also find that that our model is reasonable for achieving superior alignment for multiple views with consistent latent encoding by analyzing the conditional independent assumption in CCA with CMI.

4.4  Instance-Level Alignment Versus Distribution-Level Alignment

In this work, we study instance-level multiview alignment with CCA to achieve correspondence—for instance, embeddings obtained from each view. There is also work that studes the distribution-level alignment of multiple views. This other work focuses on aligning the marginal distribution of the views, P(X) and P(Y), without considering the pairwise correspondence for each instance. Cross-view generation in such a setting is regarded as a style transfer task (Ganin et al., 2016). For example, cycle-GAN (Zhu, Park, Isola, & Efros, 2017) studies unsupervised image translation in two domains by modeling cycle consistency. UNIT (Liu, Breuel, & Kautz, 2017) and MUNIT (Huang, Liu, Belongie, & Kautz, 2018) study the same task by incorporating a common latent space into cycle-GAN. Conditional GANs are adopted to facilitate the cross-view image synthesis task (Regmi & Borji, 2018). Because this is not the focus of our letter, we do not discuss them further.

5  Experiments

In this section, we evaluate the performance of our ACCA regarding multiview alignment and generation. We first discuss the advantages of ACCA in section 5.1. Then, we show the superiority of ACCA in achieving multiview alignment in three aspects. We conduct correlation analysis to show ACCA captures higher nonlinear correlation among the multiple views in section 5.2. We present alignment verification to show that ACCA achieves better instance-level correspondence in the latent space in section 5.3. We conduct several cross-view analysis tasks with noisy inputs to show the robustness of ACCA in achieving instance-level alignment of the multiple views in section 5.4.

We also evaluate the quality of obtained embeddings regarding downstream supervised tasks, to demonstrate that our ACCA facilitates superior alignment without sacrificing the discriminative property of the representation. The experiments regarding clustering and classification are presented in sections 5.4.3 and 5.5, respectively.

Note that our work targets at instance-level multiview alignment and generation. Consequently, we emphasize the evaluation in the first few subsections: the preserved correspondence on the latent embeddings and how well the correspondence can be recovered from the obtained latent spaces, cross-view generation. The evaluation of the discriminative property of latent embeddings is presented for illustration.

In section 5.6, we present a preliminary study on the influence of view-specific variables for alignment and generation as future work.

5.1  Superiority of Adversarial Criterion for Multiview Alignment

We first examine the benefits achieved with the adversarial learning alignment criterion: consistently matching the marginalized latent encodings with flexible priors.

5.1.1  Consistent Encoding in ACCA

We verify the consistent encoding in ACCA with one of the most commonly used multiview learning data sets: the MNIST left/right halved data set (MNIST_LR) (Andrew et al., 2013). Details about the data set and network design are shown in Table 2.

Table 2:
Details of the Data Sets and Network Settings with MLPs.
Data setStatisticsDimension of zNetwork Setting (MLPs) D^={1024,1024,1024}Parameters
Toy data set (Simulated) # Tr = 8,000# Te = 2,000 d = 10 Ex={1024,1024};Exy={1024,1024};Ey={1024,1024} For all the data set: learning rate = 0.001, epoch = 100. For each data set: batch size tuned over {16,32,128,256,500,512,1000}; d tuned over {10,30,50,100} 
MNIST L/R halved data set (MNIST_LR) (Andrew et al., 2013# Tr = 60,000# Te = 10,000 d = 30 Ex={2308,1024,1024};Exy={3916,1024,1024};Ey={1608,1024,1024}  
MNIST noisy data set (MNIST_Noisy) (Wang et al., 2016# Tr = 60,000# Te = 10,000 d = 50 Ex={1024,1024,1024};Exy={1024,1024,1024};Ex={1024,1024,1024}  
Wisconsin X-ray microbeam database (XRMB) (Wang et al., 2016# Tr = 1.4M# Te = 0.1M d = 112 Ex={1811,1811};Exy={3091,3091};Ey={1280,1280}  
Data setStatisticsDimension of zNetwork Setting (MLPs) D^={1024,1024,1024}Parameters
Toy data set (Simulated) # Tr = 8,000# Te = 2,000 d = 10 Ex={1024,1024};Exy={1024,1024};Ey={1024,1024} For all the data set: learning rate = 0.001, epoch = 100. For each data set: batch size tuned over {16,32,128,256,500,512,1000}; d tuned over {10,30,50,100} 
MNIST L/R halved data set (MNIST_LR) (Andrew et al., 2013# Tr = 60,000# Te = 10,000 d = 30 Ex={2308,1024,1024};Exy={3916,1024,1024};Ey={1608,1024,1024}  
MNIST noisy data set (MNIST_Noisy) (Wang et al., 2016# Tr = 60,000# Te = 10,000 d = 50 Ex={1024,1024,1024};Exy={1024,1024,1024};Ex={1024,1024,1024}  
Wisconsin X-ray microbeam database (XRMB) (Wang et al., 2016# Tr = 1.4M# Te = 0.1M d = 112 Ex={1811,1811};Exy={3091,3091};Ey={1280,1280}  

To testify the approximation of the three encodings in ACCA, we estimate the distribution distances among the three posterior distributions with kernel maximum mean discrepancy (MMD) (Gretton, Borgwardt, Rasch, Schölkopf, & Smola, 2012). We assign gaussian mixture prior (see equation 5.1) for ACCA and then calculate the sum of the MMD distance between the three encodings and the prior p0(z) in equation 3.2 during the training process. Figure 4 shows that the distance gradually decreases during the convergence of ACCA. This trend verifies that ACCA can facilitate the matching of nongaussian marginalized posteriors, that is, the consistent encoding (see equation 3.3).

Figure 4:

Verification of consistent encoding in ACCA. Left: The holistic encodings are approximated (see equation 3.3) during the training of ACCA. Right: The minimum CMI (see equation 4.1) is implicitly achieved in ACCA.

Figure 4:

Verification of consistent encoding in ACCA. Left: The holistic encodings are approximated (see equation 3.3) during the training of ACCA. Right: The minimum CMI (see equation 4.1) is implicitly achieved in ACCA.

We also estimate CMI during the model training process with an open-source nonparametric entropy estimation toolbox.1 The right panel of Figure 4 illustrates that the CMI gradually decreases during the training of ACCA and reaches zero at a relatively early stage in the convergence of ACCA. The trend indicates that ACCA implicitly minimizes CMI, and the optimal, I(X;Y|Z)=0, can be achieved at its convergence. Consequently, the explicit conditional independent constraint (see equation 2.1) of CCA can be automatically satisfied in our ACCA.

5.1.2  Flexibility of Prior Encoding in Alignment

We conduct correlation analysis on a toy data set with nongaussian prior to verify that ACCA benefits from handling implicit distributions for multiview alignment.

Toy data set. 
Following William (2000), we construct a toy multiview data set within which the two views prohibit a nonlinear relationship. Let X=W1Z and Y=W2ZTZ, where Z denotes a 10D vector with each dimension zp(z), and W1R10×50, W2R10×50 are the random projection matrices to construct the data. Details for the setting are presented in Table 2. As we consider nonlinear dependency with nongaussian prior, we set p0(z) with a mixture of Gaussian distribution in this experiment.
zp(z)=0.2×N(0,1)+0.5×N(8,2)+0.3×N(3,1.5).
(5.1)
Dependency metric. 

The Hilbert Schmidt independence criterion (HSIC) (Gretton, Bousquet, Smola, & Schölkopf, 2005) is a commonly used measurement for the overall dependency among variables. In this work, we adopt the normalized estimate of HSIC (nHSIC) (Wu, Zhao, Tsai, Yamada, & Salakhutdinov, 2018) as the metric to measure the dependency captured by the embeddings of the test set (ZXTe and ZYTe) of each method. We report the nHSIC computed with both the linear and the RBF kernel (σ is set with the F-H distance between the points).

Baselines. 

We compare4 ACCA with several state-of-the-art vanilla CCA variants here.

  • CCA (Hotelling, 1936): Linear CCA model that learns linear projections of the two views that are maximally correlated.

  • PCCA (Bach & Jordan, 2005): Probabilistic variant of linear CCA.

  • DCCA (Andrew et al., 2013): DeepCCA, nonlinear CCA extension with DNN.

  • MVAE (Ngiam et al., 2011): Multiview autoencoders, an CCA variant that discovers the dependency among the data via multiview reconstruction.

  • Bi-VCCA (Wang et al., 2016): Bi-deep variational CCA, a representative generative nonlinear CCA model restricted with gaussian prior.

  • ACCA_NoCV: A variant of ACCA that is designed without the encoding of the complementary view XY. This is used to verify the efficiency of the holistic encoding scheme in ACCA.

  • ACCA(G); ACCA implemented with the standard gaussian prior.

  • ACCA(GM): ACCA implemented with the exact gaussian mixture prior.

Since ACCA handles posterior distributions implicitly, its latent space can be more expressive, revealing the correspondences of the multiple views compared with other baselines that can only directly handle simple gaussian priors. (An additional sampling procedure is requested for these methods to handle other complex distributions.) Consequently, higher nonlinear dependency is expected to be achieved in ACCA, especially when given the exact prior of the multiview data set. Table 3 reports the dependency captured in the common latent space of each method. The results are revealing in several ways:

  • Both CCA and PCCA achieve low nHSIC value on the toy data set, due to their inability to capture nonlinear dependency.

  • DCCA achieves higher HSIC scores compared with other baselines due to its objective, which directly targets higher linear correlations. However, its result is still inferior to our methods.

  • The results of MVAE and Bi-VCCA are unsatisfactory. The results of MVAE are not good, because it lacks the inference mechanism to qualify the encodings. Bi-VCCA gets inferior results mainly because of the inconsistent encoding problem caused by the inferior alignment criterion.

  • Our ACCA model achieves good performance here. This indicates that the consistent encoding imposed by the adversarial distribution matching benefits the model's ability to capture nonlinear dependency.

  • ACCA (GM) archives the best result in both settings. This verifies that ACCA benefits from the ability to handle implicit distributions.

Table 3:
The Dependency of Latent Embeddings.
MetricData SetsCCAPCCADCCAMVAEBi-VCCAACCA_NoCVACCA (G)ACCA (GM)
nHSIC (linear kernel) toy 0.0010 0.1037 0.5353 0.1428 0.1035 0. 8563 0.7296 0.9595 
 MNIST_LR 0.4210 0.3777 0.6699 0.2500 0.4612 0.5233 0.5423 0.6823 
 MNIST_Noisy 0.0817 0.1037 0.1460 0.4089 0.1912 0.3343 0.3285 0.4133 
 XRMB 0.0574 0.0416 0.2970 0.2637 0.1046 0.1244 0.2903 0.3482 
 Maps 0.3465 0.4423 0.1993 0.7324 0.5157 0.7043 
nHSIC (RBF kernel) toy 0.0029 0.2037 0.7685 0.2358 0.2543 0.8737 0.5870 0.8764 
 MNIST_LR 0.4416 0.3568 0.6877 0.1499 0.3804 0.5799 0.6318 0.7387 
 MNIST_Noisy 0.0948 0.0993 0.1605 0.4133 0.2076 0.2697 0.3099 0.4326 
 XRMB 0.0534 0.03184 0.3180 0.0224 0.0846 0.1456 0.2502 0.2989 
 Maps 0.5905 0.5624 0.3956 0.8171 0.6285 0.8658 
MetricData SetsCCAPCCADCCAMVAEBi-VCCAACCA_NoCVACCA (G)ACCA (GM)
nHSIC (linear kernel) toy 0.0010 0.1037 0.5353 0.1428 0.1035 0. 8563 0.7296 0.9595 
 MNIST_LR 0.4210 0.3777 0.6699 0.2500 0.4612 0.5233 0.5423 0.6823 
 MNIST_Noisy 0.0817 0.1037 0.1460 0.4089 0.1912 0.3343 0.3285 0.4133 
 XRMB 0.0574 0.0416 0.2970 0.2637 0.1046 0.1244 0.2903 0.3482 
 Maps 0.3465 0.4423 0.1993 0.7324 0.5157 0.7043 
nHSIC (RBF kernel) toy 0.0029 0.2037 0.7685 0.2358 0.2543 0.8737 0.5870 0.8764 
 MNIST_LR 0.4416 0.3568 0.6877 0.1499 0.3804 0.5799 0.6318 0.7387 
 MNIST_Noisy 0.0948 0.0993 0.1605 0.4133 0.2076 0.2697 0.3099 0.4326 
 XRMB 0.0534 0.03184 0.3180 0.0224 0.0846 0.1456 0.2502 0.2989 
 Maps 0.5905 0.5624 0.3956 0.8171 0.6285 0.8658 

Note: Higher is better for dependency. The best are in bold.

5.2  Correlation Analysis

We xt conduct correlation analysis on four commonly used multiview data sets to testify the alignment achieved with each method. Higher correlations are expected with latent embeddings that preserve better data correspondence. Details about the data sets are presented in Tables 2 and 9 (Table 9 is in section 6). For XRMB, we follow the setting of DCCA (Wang et al., 2016)—we divide the data set into 5-folds and report the average nHSIC scores for comparison. For ACCA (GM), we adopt the same prior as the toy data set, that is, 13 as a simple arbitrary selection of nongaussian prior. The results are presented in Table 3. We can see that:

  • DCCA achieves a higher correlation compared with the baselines that do not support data generation, CCA and PCCA. This is because it adopts nonlinear mapping, which enables it to exploit nonlinear correlations in the input for alignment.

  • The correlation achieved with MVAE is inferior to DCCA in most of the settings. This is because MVAE seeks embeddings that result in better view reconstruction. However, DCCA directly targets the embeddings that achieve maximum linear correlation scores in our evaluation.

  • Our methods, ACCA_NoCV, and ACCA, outperform Bi-VCCA in all the settings. Our results are comparable to and even better than DCCA in some of the settings. This indicates that our consistent encoding design can benefit the consistency preserved in the latent space. Since ACCA can facilitate data generation compared with DCCA, the comparison between ACCA and DCCA, and MVAE, indicates that ACCA can balance the data correspondence and reconstruction quality. The argument is collaboratively supported by the data generation result in section 5.4.

  • Among our three ACCA variants, the ACCA (GM) achieves the best result in almost all of the settings. This observation indicates that the preserved latent correspondence can be enhanced by incorporating a more expressive latent space with more flexible priors. It also verifies the superiority of our ACCA for directly handling flexible priors without extra sampling procedures (see section 4.2).

In addition to the quantitative correlation analysis, we conduct t-SNE visualization to demonstrate the quality of obtained embeddings. In Figure 5, we compare the embeddings of the two individual views obtained with DCCA, MVAE, BI-VCCA, and ACCA(G). It is clear that for the two vanilla CCA models, DCCA and MVAE, embeddings of each view fail to preserve a distinguishable clustering structure. This observation can be explained with our analysis that they lack the inference mechanism to qualify the obtained embeddings. For Bi-VCCA, the embedding of view X presents a great clustering structure. But the embeddings of view Y are scattered randomly in the common latent space. This implies that the instances do not prohibit the desired correspondence in the latent space, meaning that the two views are not well aligned with Bi-VCCA. The observation also implies that the left part of the MNIST data potentially preserves more label information than the right views. For our ACCA, embeddings of both of the views present a good clustering structure. This indicates that the two views are better aligned with the proposed ACCA.

Figure 5:

t-SNE visualization of the embeddings of X (left) and Y(right) for MNIST_LR, obtained with DCCA, MVAE, Bi-VCCA, and ACCA, respectively. The colors represent label information.

Figure 5:

t-SNE visualization of the embeddings of X (left) and Y(right) for MNIST_LR, obtained with DCCA, MVAE, Bi-VCCA, and ACCA, respectively. The colors represent label information.

5.3  Alignment Verification

We conduct alignment verification to evaluate the instance-level correspondence achieved in the common latent space of ACCA. Specifically, we project the paired testing data of the MNIST_LR data set to a two-dimensional latent space with gaussian prior. We define misalignment degree as the metric for the alignment performance. We take the origin point O as the reference and adopt angular difference to measure the distance of the paired embeddings: ϕ(zx,zy)=zxOzy. The misalignment degree of the multiview is given by
δ=1N{x,y}ψ(zx,zy)Ψ,
(5.2)
where N denotes the number of data pairs and Ψ is the maximum angle among the paired embeddings (see Figure 6d). We compare ACCA with DCCA, MVAE, Bi-VCCA, and ACCA_NoCV here, since they are baselines that have encodings for both views.
Figure 6:

Visualization of the embeddings obtained for the two views. Each row represents the embeddings obtained with view X and view Y, respectively. (zx, zy) denotes a pair of correspondent embedding. δ indicates the misalignment degree of each method. Methods with a smaller value of δ are better.

Figure 6:

Visualization of the embeddings obtained for the two views. Each row represents the embeddings obtained with view X and view Y, respectively. (zx, zy) denotes a pair of correspondent embedding. δ indicates the misalignment degree of each method. Methods with a smaller value of δ are better.

The results are presented in Figure 6. We make the following observations:

  • For DCCA, the latent embeddings of two views are clearly scattered apart, indicating inferior instance correspondence in the latent space.

  • The regions for the paired embeddings of Bi-VCCA are not overlapped, and the misalignment degree of Bi-VCCA is δ=2.3182, which is much higher than the others. This indicates that Bi-VCCA suffers severely from the misaligned encoding problem.

  • ACCA and ACCA_NoCV, achieve superior alignment performance compared with DCCA, MVAE, and Bi-VCCA. This shows the effectiveness of the consistent constraint on the marginalization for view alignment in ACCA.

  • The embeddings of ACCA are uniformly distributed in the latent space compared with ACCA_NoCV. This indicates that the complementary view, XY, provides additional information for the holistic encoding.

5.4  Applications of Cross-View Generation

We design several cross-view generation tasks to reflect the superior multiview alignment achieved in ACCA. We first apply ACCA to an image recovery task to conduct whole-image recovery, given the partial images as input for one of the views. We then test ACCA with a face alignment task to annotate facial landmarks given the face images. Since MVAE and Bi-VCCA are the baseline models that can support cross-view generation, we compare these two methods. We do not compare ACCA_NoCV here since it is a variant of our ACCA and will be comparable to the ACCA due to the consistent encoding. We adopt gaussian prior for ACCA here to conduct a fair comparison.

5.4.1  Image Recovery

We testify the image recovery (Sohn, Lee, & X, 2015) performance of ACCA on the MNIST handwritten digit data set and the CelebFaces Attributes data set (CelebA) (Liu, Luo, & Tang, 2015). Both are commonly used image generation data sets. The performance is evaluated based on the quality of generated images (e.g., Is the image blurred?) Does the image show apparent misalignment at the junctions in the middle?

Image recovery on handwritten digits. 

We train the models with original data, while adding noise to the test data of the MNIST data set, to testify to the robustness of the alignment achieved with each model. We divide the test data in each view into four quadrants and masked one, two, or three quadrants of the input with gray color (Sohn et al., 2015) and use the noisy images as the input for testing. The experimental result is evaluated qualitatively and quantitativly.

Qualitative analysis. Figure 7 presents some of the recovered images (columns 3 to 5) obtained with one-quadrant input. This figure clearly illustrates that given the noisy input, the images generated with ACCA are more real and recognizable than those of MVAE and Bi-VCCA:

Figure 7:

Generated samples given one quadrant noisy image as input. The first column is the ground truth. The next three columns show the input for view X and the generated image with Bi-VCCA and ACCA, respectively. The last three columns are that of Y.

Figure 7:

Generated samples given one quadrant noisy image as input. The first column is the ground truth. The next three columns show the input for view X and the generated image with Bi-VCCA and ACCA, respectively. The last three columns are that of Y.

  • The image generated with MVAE shows the worst quality. The images contain much more noise compared with other methods. In many cases, the “digit” is hard to identify (e.g., case b). In addition, the generated image of MVAE shows clear misalignment at the junctions of the halved images (e.g., case a).

  • The images generated by Bi-VCCA are much more blurred and less recognizable than those of ACCA, especially in cases a and b.

  • ACCA can successfully recover the noisy half images, which are even confusing for our human to recognize. For example, in case b, the left-half image of digit 5 looks similar to the digit 4; ACCA succeeds in recovering the true digit.

Quantitative evidence. We compare the pixel-level accuracy with the root mean square error (RMSE), that is, 1 RMSE. The results in Table 4 show that our ACCA consistently outperforms Bi-VCCA given the different level of masked input images. It is interesting to note that the whole images generated with the left-half images tend to be more realistic than those generated using the right half. A probable reason is that the right-half images contain more information than the left-half images. This finding coincides with our discovery in Figure 5b. This imbalance of information between the two views would drive the decoder of the less informative view to generate high-quality images, while sacrificing the alignment with another view.

Table 4:
Pixel-Level Accuracy for Image Recovery with Noisy Inputs on the MNIST Data Set.
Gray Color Overlaid
Input (halved image)Methods1 Quadrant2 Quadrants3 Quadrants
Left MVAE 64.94 61.81 56.15 
 Bi-VCCA 73.14 69.29 63.05 
 ACCA 77.66 72.91 67.08 
Right MVAE 73.57 67.57 59.69 
 Bi-VCCA 75.66 69.72 65.52 
 ACCA 80.16 74.60 66.80 
Gray Color Overlaid
Input (halved image)Methods1 Quadrant2 Quadrants3 Quadrants
Left MVAE 64.94 61.81 56.15 
 Bi-VCCA 73.14 69.29 63.05 
 ACCA 77.66 72.91 67.08 
Right MVAE 73.57 67.57 59.69 
 Bi-VCCA 75.66 69.72 65.52 
 ACCA 80.16 74.60 66.80 

Note: The best results are in bold.

Image recovery on human faces. 

For the human face recovery on the CelebA data set, we halve the RGB images into top half and bottom half and design a CNN architecture to handle this task. Details of the network design are reported in Table 9.

Qualitative analysis. Figure 8 shows the image samples recovered for the CelebA data set. We have mainly three observations:

  • The samples generated by MVAE show clear misalignment at the junctions, especially when the backgrounds of the images are in color. Some of the images are too blurred to see the details (e.g., the samples circled with red).

  • The samples generated by Bi-VCCA are generally more blurred than the other two. The observation is quite obvious in the image generated with the top-half image, which contains many fewer details than the bottom-half image.

  • The images generated by ACCA show better quality compared with the others, considering both the clarity and the alignment of junctions.

Figure 8:

The images generated with different methods on CelebA.

Figure 8:

The images generated with different methods on CelebA.

Quantitative evidence. We quantitatively assess the quality of generated images with the Frechet inception distance (FID) (Heusel, Ramsauer, Unterthiner, Nessler, & Hochreiter, 2017) and estimate the sharpness of the generated test images using the image gradients.2 The results are reported in Table 5. Figure 8 shows that for the image recovery with the top-half face images, the image generated with ACCA is of much better quality than that of MVAE and Bi-VCCA. Bi-VCCA is the worst in terms of the two metrics. For the experiment with the bottom-half face images, the FID score of our ACCA is slightly inferior to that of MVAE; however, the generated images of ACCA are still sharper. Comparing the results of these two experiments, we can see that the image recovery with the top-half image is better than the other because it presents lower FID and higher image sharpness. This observation coincides with our qualitative evaluation shown in Figure 8, where the images are generated with the bottom-half image (left column), especially the top half generated images, are commonly blurrier than those generated with the top-half image (right column). This phenomenon also agrees with our discovery in the handwritten digit recovery task, where the input view with more information obtains worse results.

Table 5:
FID and Sharpness Scores for Image Recovery on CelebA.
Evaluation Metrics
Input (halved image)MethodsFIDSharpness
Top MVAE 61.3360 8.9645 
 Bi-VCCA 78.0752 7.0069 
 ACCA 58.7983 11.9026 
Bottom MVAE 63.6921 8.5428 
 Bi-VCCA 84.7122 6.7574 
 ACCA 68.1467 8.7249 
Evaluation Metrics
Input (halved image)MethodsFIDSharpness
Top MVAE 61.3360 8.9645 
 Bi-VCCA 78.0752 7.0069 
 ACCA 58.7983 11.9026 
Bottom MVAE 63.6921 8.5428 
 Bi-VCCA 84.7122 6.7574 
 ACCA 68.1467 8.7249 

Notes: Smaller is better for FID and larger is better for sharpness. The sharpness of the real images is 14.6722. Numbers in bold are the best results.

Unconditional human face generation. 

To illustrate how ACCA benefits the image generation quality, we further evaluate unconditional generation performance with the trained models on the CelebA data set. Specifically, we randomly sample a batch of z from the prior distribution p(z) and adopt the two decoders to generate both views.

The results are presented in Figure 9. It is clear that the images generated with ACCA are much more realistic than those generated with Bi-VCCA, since the facial boundaries of these images are clearer. It is also remarkable that ACCA generates images with more details, due to the superior correspondence achieved between the input and the latent space. The images in the red box present remarkable details, such as a cap, a hoodie, glasses, and backgrounds.

Figure 9:

Comparison of Bi-VCCA (left) and ACCA (right) on unconditional generation. The images marked with a red box present distinguishable details.

Figure 9:

Comparison of Bi-VCCA (left) and ACCA (right) on unconditional generation. The images marked with a red box present distinguishable details.

5.4.2  Face Alignment

We further evaluate the multiview alignment performance of ACCA with a face-alignment task (Kazemi & Sullivan, 2014) on CelebA (Liu et al., 2015). We train ACCA with paired face and ground-truth facial landmark annotations as input for the two views. Then, the better that the multiple views are aligned, the better are the facial landmark prediction, or generation, results that can be achieved given the test face images.

Since the landmark annotations of the original CelebA data set simply contain five landmark locations, this data set may be insufficient to testify to the performance achieved with the models that can handle more complicated applications (Regmi & Borji, 2018). Instead, we construct a more challenging data set with 68 landmark locations as the face annotation. Specifically, we extract the annotations with the state-of-the-art facial landmark localization method Super-FAN (Bulat & Tzimiropoulos, 2018), with the s3fd face detector.3 We drop the figures whose faces cannot be detected and construct a data set with 202,405 samples. Figure 10 presents several samples of our data set. Details for the setting of the face alignment experiment are presented in Table 9.

Figure 10:

Sample images of CelebA for the face alignment experiment.

Figure 10:

Sample images of CelebA for the face alignment experiment.

Qualitative analysis. 

To verify the robustness of ACCA in achieving multiview alignment, we adopt the complete data samples for training while adopting partial or noisy images as input to evaluate the alignment performance of each model. Specifically, we randomly omit the input pixels with blocks of different sizes (50, 60, 70). Such a setting simulates the real face alignment scenarios with occlusive faces.

Figure 11 demonstrates the face alignment results. It is clear that our proposed ACCA outperforms the baselines under both settings, with human interpretable and clearer facial landmark annotations. We can also observe that:

  • Most of the generated results of MVAE are noisy and blurred under human perceptions, which indicates that MVAE is susceptible to noisy input. The problem is even more obvious with larger occlusions. As shown in the right panel of Figure 11, most of the results of MVAE are not recognizable.

  • The results of Bi-VCCA are commonly more blurred than MVAE and our ACCA. However, Bi-VCCA is more robust with noisy input than MVAE since its results are more interpretable under the 70 × 70 blocked setting. This verified that a latent distribution matching constraint benefits the robustness of the multiview alignment.

  • Our proposed ACCA achieves clear and human interpretable facial landmark annotations under both settings. This indicates that the multiview alignment achieved with ACCA is the most robust among these three models and verifies that the consistent encoding achieved in ACCA contributes to a better and more robust alignment of the multiple views.

Figure 11:

Performance of face alignment with different level of face occlusions. Left: The results with 60 × 60 blocked inputs. Right: The results with 70 × 70 blocked inputs.

Figure 11:

Performance of face alignment with different level of face occlusions. Left: The results with 60 × 60 blocked inputs. Right: The results with 70 × 70 blocked inputs.

Quantitative evidence. 

We further analyze the results with two standard metrics for image alignment, peak signal-to-noise ratio (PSNR) (Bulat & Tzimiropoulos, 2018) and structural similarity (SSIM) (Zhang et al., 2018). Table 6 shows that our ACCA is superior to the other two models with respect to the two criteria.

Table 6:
PSNR and SSIM of Face Alignment with Random Occlusions of Different Size.
Inputs (occluded face images)
Evaluation MetricsMethods50 × 5060 × 6070 × 70
PSNR MVAE 63.0074 62.5455 62.1448 
 Bi-VCCA 63.0289 62.6468 62.3351 
 ACCA 62.2924 62.4175 62.0975 
SSIM MVAE 0.9982 0.9978 0.9975 
 Bi-VCCA 0.9981 0.9979 0.9976 
 ACCA 0.9984 0.9979 0.9981 
Inputs (occluded face images)
Evaluation MetricsMethods50 × 5060 × 6070 × 70
PSNR MVAE 63.0074 62.5455 62.1448 
 Bi-VCCA 63.0289 62.6468 62.3351 
 ACCA 62.2924 62.4175 62.0975 
SSIM MVAE 0.9982 0.9978 0.9975 
 Bi-VCCA 0.9981 0.9979 0.9976 
 ACCA 0.9984 0.9979 0.9981 

Notes: Smaller is better for PSNR, and larger is better for SSIM. The best results are in bold.

5.4.3  Cross-View Generation for High-Dimensional Data

To evaluate the capacity of our ACCA for cross-view generation, we further validate its performance with high-resolution image inputs. We adopt Google Maps data set (Maps) (Isola, Zhu, Zhou, & Efros, 2017) one of the benchmark data sets for cross-view synthesis applications (Regmi & Borji, 2018). To ensure the quality of generated image, we equipped skip-connection for the autoencoder structure, i.e. UNET (Ronneberger, Fischer, & Brox, 2015) in each method. We adopt the least square GANs (Mao et al., 2017) as the marginal matching the equation 3.3 constraint for ACCA. The results are presented in Figure 12. It is clear that our ACCA outperforms the baselines regarding the generated image quality.

Figure 12:

Comparison of cross-view generation results on the Maps data set.

Figure 12:

Comparison of cross-view generation results on the Maps data set.

We also analyzed the quality of obtained embeddings with t-SNE visualization. Specifically, we cluster the embeddings of the aerial photo (ViewX) with k-means (MacQueen et al., 1967) (n_clusters = 3) and do t-SNE visualization to analyze whether the results present human interpretable properties. We mark the centroids of each cluster and prohibit the top 3 data samples that are nearest to each centroid. The results are presented in Figure 13. The comparison is analyzed in two aspects:

  • The clusters of our ACCA are compact and present clear boundaries, while those of MVAE and BI-VCCA show overlap between the clusters (marked with red dashed circles). The comparison is quite obvious with regard to Bi-VCCA. This indicates that the embedding of our ACCA preserves more discriminative information compared to Bi-VCCA.

  • The clustering results of our ACCA represent human interpretable properties. According to Figure 13c, the three clusters present distinct properties. Among the test images, a large proportion of them are blocks (points in blue), and a small proportion are bodies of water or vegetation (points in purple); the rest are hybrid zones (e.g., highways, railways, with (points in red). This discovery coincides with the data statistics of the original data set. The clustering results of MVAE embeddings are not interpretable compared with our ACCA, since the samples nearest the centroids are not distinguishable and no interpretable patterns are presented for the top 3 data samples for each cluster. The cluster centroids obtained with BI-VCCA are interpretable to some extent, since the data samples nearest the centroids present unique properties. However, the clustering result of the data sample “(2-3)” circled with red, is not clear.

Figure 13:

Comparison of K-means clustering results on the embeddings of the Maps data set. The centroids are marked with gray spots. The blue triangles represent the top 3 data points nearest to each centroid, with the order represented in the annotations. For example, 1_1 is the data point nearest the centroid within the first cluster.

Figure 13:

Comparison of K-means clustering results on the embeddings of the Maps data set. The centroids are marked with gray spots. The blue triangles represent the top 3 data points nearest to each centroid, with the order represented in the annotations. For example, 1_1 is the data point nearest the centroid within the first cluster.

Consequently, on the Google Maps data set, our ACCA outperforms the baselines regarding the alignment of the multiple views. The obtained latent embeddings are also informative about the multiview data.

To collaboratively support the superior alignment and generation performance of ACCA, we further compare the result with pix2pix—the state-of-the-art GAN-based cross-view generation baseline. The comparison is shown in Figure 14. We can see that the image quality of our ACCA is comparable and even better than that of pix2pix. This indicates that our ACCA has good alignment and generation ability.

Figure 14:

Comparison of pix2pix and ACCA on the Maps data set.

Figure 14:

Comparison of pix2pix and ACCA on the Maps data set.

5.5  Alignment and Discriminative Property of the Representation

Multiview representation learning is an important part of CCA. In this section, we conduct classification tasks to verify that the alignment achieved in ACCA does not greatly influence the discriminative property of the learned representations.

We follow the setting in Table 2 and perform classification on the three labeled data sets: MNIST_LR, MNIST_noisy, and XRMB. We train linear SVM classifiers with the concatenation of obtained embeddings and then evaluate their accuracy on the projected test set, [ZXTe,ZYTe]. For iteratively optimized nonlinear CCA models, we selected the embeddings obtained from the last five epochs for evaluation. We compare ACCA with gaussian prior, namely ACCA(G), here for a fair comparison. PCCA is not evaluated regarding classification since it should be comparable to the linear CCA.

Table 7 presents the classification results. It is obvious that our ACCA achieves comparable and even better classification performance among the CCA variants with the generative mechanism. The results of ACCA are better than those of Bi-VCCA in all settings and are comparable to those of MVAE in most of the settings. This reveals that our ACCA preserves considerable discriminative property of the embeddings while achieving superior alignment of the multiple views. Our ACCA is inferior to DCCA regarding classification tasks. The reason is that instead of targeting alignment for discriminative representation learning as in DCCA, our model focuses on reconstruction for data generation. For the MNIST-LR data set, ACCA outperforms DCCA to a large extent. This indicates that reconstruction can benefit discriminative representation learning in certain scenarios. The finding also coincides with the outstanding performance of MVAE here.

Table 7:
Classification Accuracy and Standard Deviation (in %) with Obtained Latent Embeddings.
Data SetsCCADCCAMVAEBi-VCCAACCA (G)
MNIST_LR 50.65 73.67 ± 0.15 84.44 ± 0.76 74.32 ± 0.19 85.81 ± 0.71 
MNIST_Noisy 75.48 91.60 ± 0.36 90.78 ± 1.12 85.81 ± 0.44 86.93 ± 1.46 
XRMB 32.04 62.14 ± 0.52 58.57 ± 0.30 56.58 ± 0.35 60.37 ± 0.40 
Data SetsCCADCCAMVAEBi-VCCAACCA (G)
MNIST_LR 50.65 73.67 ± 0.15 84.44 ± 0.76 74.32 ± 0.19 85.81 ± 0.71 
MNIST_Noisy 75.48 91.60 ± 0.36 90.78 ± 1.12 85.81 ± 0.44 86.93 ± 1.46 
XRMB 32.04 62.14 ± 0.52 58.57 ± 0.30 56.58 ± 0.35 60.37 ± 0.40 

Note: The best results are in bold.

5.6  ACCA versus CCA Variants with View-Specific Information

In this section, we compare ACCA with CCA variants that additionally exploit view-specific information to further demonstrate its alignment capacity. This is also a preliminary study on the influence of private information to multiview alignment and generation. We choose Bi-VCCA-private as a representative baseline to compare here.

  • BI-VCCA-private (Wang et al., 2016): An extension of Bi-VCCA that additionally extracts view-specific (private) variables for each view (see Figure 2 in Wang et al.)

We evaluate the alignment in BI-VCCA-private with regard to both correlation analysis and conditional/unconditional data generation. We compare our model with simple gaussian prior, ACCA (G), here. We adopt the same settings (i.e., network settings and evaluation metrics) as in previous experiments for consistency. For BI-VCCA-private, the dimensions of private variables are set as dHx=dHy=30 for all data sets. The networks of the private encoders are set the same as its principal encoders in Table 2.

The results of correlation analysis are presented in Table 8. It is clear that our ACCA excels Bi-VCCA-private and Bi-VCCA in all the settings. This indicates that it is the KL-divergence constraint, DKL(q(z|*)p(z)), that mainly hinders these models from achieving instance-level multiview alignment (see Figure 1b). Our ACCA overcomes this limitation with the marginalized matching constraint, DJS(q(z|*)p(*)d*p(z)), and thus preserves better correspondence for paired inputs. The results of Bi-VCCA-private are slightly better than those of Bi-VCCA, indicating that the private variables can help to enhance multiview alignment to some extent. The comparisons regarding cross-view generation and unconditional data generation are presented in Figures 15 and 16, respectively. The results show that our ACCA outperforms the others in terms of both image sharpness and recognizable object details. In cross-view generation, the images generated with ACCA are much clear and sharper than those of Bi-VCCA-private. In addition, although Bi-VCCA-private generates faces with more details compared with Bi-VCCA (e.g., beards and glasses; see Figure 16), these details are not as clear and recognizable as those of ACCA. These generation results coincide with our finding in correlation analysis: Bi-VCCA-private achieves inferior multiview alignment compared with ACCA. This inferior alignment consequently downgrades its performance in cross-view data generation.

Table 8:
The dependency of Latent Embeddings.
Data Sets
MetricsMethodsMNIST_LRMNIST_NoisyXRMB
nHSIC (linear kernel) BI-VCCA-private 0.2818 0.2235 0.1227 
 Bi-VCCA 0.4612 0.1912 0.1046 
 ACCA (G) (ours0.5423 0.3285 0.2903 
nHSIC (RBF kernel) BI-VCCA-private 0.2853 0.2386 0.0893 
 Bi-VCCA 0.3804 0.2076 0.0846 
 ACCA (G) (ours0.6318 0.3099 0.2502 
Data Sets
MetricsMethodsMNIST_LRMNIST_NoisyXRMB
nHSIC (linear kernel) BI-VCCA-private 0.2818 0.2235 0.1227 
 Bi-VCCA 0.4612 0.1912 0.1046 
 ACCA (G) (ours0.5423 0.3285 0.2903 
nHSIC (RBF kernel) BI-VCCA-private 0.2853 0.2386 0.0893 
 Bi-VCCA 0.3804 0.2076 0.0846 
 ACCA (G) (ours0.6318 0.3099 0.2502 

Note: The best results are in bold. For dependency, higher is better.

Table 9:
Details of the Cross-View Generation Data Sets.
Data SetStatisticsDimension of zArchitecture (Conv all with batch normalization before LReLU)Parameters
CelebA (Liu et al., 2015Image Resolution: 64 × 64 Image recovery # Tr = 201,599 # Te = 1,000 Face alignment # Tr = 201,599 # Te = 1,000 d = 100 Encoders: Conv: 64×5×5 (stride 2), Conv: 128×5×5 (stride 2), Conv: 256×5×5 (stride 2), Conv: 512×5×5 (stride 2); dense: 100. Decoders (Image recovery): dense: 8192, relu; deConv: 256×5×5 (stride 2), deConv: 128×5×5 (stride 2), deConv: 64×5×5 (stride 2), deConv: 3×2×5, (stride1×2); Tanh. Decoders(Face alignment): dense: 8192, relu; deConv: 256×5×5 (stride 2), deConv: 128×5×5 (stride 2), deConv: 64×5×5 (stride 2), deConv: 3×2×5 (stride 2); Tanh. Discriminator: D^: dense: 128641, sigmoid. Epoch = 10; Batchsize = 64; lr = 0.0002; Beta1 = 0.05; 
Google Maps dataset (Maps) (Isola et al., 2017Image Resolution: 256 × 256 Cross-view generation # Tr = 1,096 # Te = 1,098 d = 100 Encoders: Conv: 64×5×5 (stride 2), Conv: 128×5×5 (stride 2), Conv: 256×5×5 (stride 2), Conv: 256×5×5 (stride 2), Conv: 512×5×5 (stride 2); dense: 100. Decoders (with skip-connection): dense: 32768,relu; deConv: 256×5×5 (stride 2), deConv: 256×5×5 (stride 2), deConv: 128×5×5 (stride 2), deConv: 64×5×5 (stride 2), deConv: 3×2×5, (stride 2); Tanh. Discriminator: D^: dense: 128641, tanh. Epoch = 15; Batchsize = 16; lr = 0.0002; Beta1 = 0.5; 
Data SetStatisticsDimension of zArchitecture (Conv all with batch normalization before LReLU)Parameters
CelebA (Liu et al., 2015Image Resolution: 64 × 64 Image recovery # Tr = 201,599 # Te = 1,000 Face alignment # Tr = 201,599 # Te = 1,000 d = 100 Encoders: Conv: 64×5×5 (stride 2), Conv: 128×5×5 (stride 2), Conv: 256×5×5 (stride 2), Conv: 512×5×5 (stride 2); dense: 100. Decoders (Image recovery): dense: 8192, relu; deConv: 256×5×5 (stride 2), deConv: 128×5×5 (stride 2), deConv: 64×5×5 (stride 2), deConv: 3×2×5, (stride1×2); Tanh. Decoders(Face alignment): dense: 8192, relu; deConv: 256×5×5 (stride 2), deConv: 128×5×5 (stride 2), deConv: 64×5×5 (stride 2), deConv: 3×2×5 (stride 2); Tanh. Discriminator: D^: dense: 128641, sigmoid. Epoch = 10; Batchsize = 64; lr = 0.0002; Beta1 = 0.05; 
Google Maps dataset (Maps) (Isola et al., 2017Image Resolution: 256 × 256 Cross-view generation # Tr = 1,096 # Te = 1,098 d = 100 Encoders: Conv: 64×5×5 (stride 2), Conv: 128×5×5 (stride 2), Conv: 256×5×5 (stride 2), Conv: 256×5×5 (stride 2), Conv: 512×5×5 (stride 2); dense: 100. Decoders (with skip-connection): dense: 32768,relu; deConv: 256×5×5 (stride 2), deConv: 256×5×5 (stride 2), deConv: 128×5×5 (stride 2), deConv: 64×5×5 (stride 2), deConv: 3×2×5, (stride 2); Tanh. Discriminator: D^: dense: 128641, tanh. Epoch = 15; Batchsize = 16; lr = 0.0002; Beta1 = 0.5; 
Figure 15:

Comparison of Bi-VCCA-private and ACCA regarding face recovery. The results of Bi-VCCA-private commonly blurrier than those of our ACCA.

Figure 15:

Comparison of Bi-VCCA-private and ACCA regarding face recovery. The results of Bi-VCCA-private commonly blurrier than those of our ACCA.

Figure 16:

Analysis on the face recovery results of Bi-VCCA-private. (Left) The unconditional generation results of Bi-VCCA-private. It is inferior to our ACCA although superior to Bi-VCCA (see Figure 9). (Right) For Bi-VCCA-private, the image quality generated with different views shows slight differences. This indicates that incorporating private variables contributes to a balanced capacity for data generation from each view.

Figure 16:

Analysis on the face recovery results of Bi-VCCA-private. (Left) The unconditional generation results of Bi-VCCA-private. It is inferior to our ACCA although superior to Bi-VCCA (see Figure 9). (Right) For Bi-VCCA-private, the image quality generated with different views shows slight differences. This indicates that incorporating private variables contributes to a balanced capacity for data generation from each view.

Furthermore, it is interesting to note that for Bi-VCCA-private, the image quality generated with different views shows slight differences compared with the baselines that extractly the shared information (see Figure 8). This indicates that incorporating view-specific variables also contributes to a balanced cross-view generation capacity from different views when the multiple input views contain an imbalanced amount of information. (In section 5.4.1, we find that the imbalance of information between the two input views can influence the image generation quality.)

6  Conclusion

In this letter, we present a systematic analysis of instance-level multiview alignment with CCA. Based on the marginalization principle of Bayesian inference, we study multiview alignment via consistent latent encoding and present ACCA, which facilitates superior alignment of the multiple views that benefits the performance of various multiview analysis and cross-view analysis tasks. Matching multiple encodings, ACCA can also be adopted to other tasks, such as image captioning and translation. Furthermore, owing to the flexible architecture design of ACCA, it can be easily extended to multiview tasks of n views, with (n+1) encoders and (n) decoders.

In this work, we mainly exploit our ACCA with predefined priors. For future work, we will explore more powerful inference techniques for ACCA to boost its alignment performance even more. For example, normalizing flows (Rezende & Mohamed, 2015) is a data-driven method that provides an efficient tool to learn a data-dependent prior for complex data sets. It can be employed in ACCA to boost multiview alignment by providing a more expressive latent space, that is, a complex data-dependent prior and better preservation of instance-level correspondence, with invertible mappings.

Our analysis based on the CMI and the consistent encoding also provides insights for a flexible design of other CCA models. In the future, we will conduct more in-depth analysis on multiview alignment with CMI and propose other variants of CCA with other alignment criteria (e.g., MMD distance, Wasserstein distances; (Arjovsky, Chintala, & Bottou, 2017). It is also interesting to note that input with different levels of detail can influence the result of cross-view generation. This research direction is also worth further study.

Notes

2

We evaluate the sharpness of each test image using the gradients and average these values over the 1000 test images.

Acknowledgments

The work was supported in part by the Australian Research Council grants DP180100106, DP200101328 and the China Scholarship Council (201706330075).

References

Andrew
,
G.
,
Arora
,
R.
,
Bilmes
,
J.
, &
Livescu
,
K.
(
2013
).
Deep canonical correlation analysis.
In
Proceedings of the International Conference on Machine Learning
(pp.
1247
1255
).
Antelmi
,
L.
,
Ayache
,
N.
,
Robert
,
P.
, &
Lorenzi
,
M.
(
2019
).
Sparse multi-channel variational autoencoder for the joint analysis of heterogeneous data.
In
Proceedings of the International Conference on Machine Learning
(pp.
302
311
).
Arjovsky
,
M.
,
Chintala
,
S.
, &
Bottou
,
L.
(
2017
).
Wasserstein GAN.
CoRR
,
abs/1701.07875
.
Bach
,
F.
, &
Jordan
,
M.
(
2005
).
A probabilistic interpretation of canonical correlation analysis
(Technical Report 688)
. Department of Statistics, University of California, Berkeley.
Bulat
,
A.
, &
Tzimiropoulos
,
G.
(
2018
).
Super-fan: Integrated facial landmark localization and super-resolution of real-world low resolution faces in arbitrary poses with GANs.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp.
109
117
).
Piscataway, NJ
:
IEEE
.
Chaudhuri
,
K.
,
Kakade
,
S. M.
,
Livescu
,
K.
, &
Sridharan
,
K.
(
2009
).
Multi-view clustering via canonical correlation analysis.
In
Proceedings of the International Conference on Machine Learning
(pp.
129
136
).
Drton
,
M.
,
Sturmfels
,
B.
, &
Sullivant
,
S.
(
2008
).
Lectures on algebraic statistics
.
Berlin
:
Springer
.
Elkahky
,
A. M.
,
Song
,
Y.
, &
He
,
X.
(
2015
).
A multi-view deep learning approach for cross domain user modeling in recommendation systems.
In
Proceedings of the International Conference on the World Wide Web
(pp.
278
288
).
New York
:
ACM
.
Federici
,
M.
,
Dutta
,
A.
,
Forré
,
P.
,
Kushmann
,
N.
, &
Akata
,
Z.
(
2020
).
Learning robust representations via multi-view information bottleneck.
In
Proceedings of the International Conference on Learning Representations.
OpenReview.net
.
Ganin
,
Y.
,
Ustinova
,
E.
,
Ajakan
,
H.
,
Germain
,
P.
,
Larochelle
,
H.
,
Laviolette
,
F.
, …
Lempitsky
,
V.
(
2016
).
Domain-adversarial training of neural networks.
JMLR
,
17
(1)
,
2096
2030
.
Gao
,
S.
,
Brekelmans
,
R.
,
Steeg
,
G. V.
, &
Galstyan
,
A.
(
2018
).
Auto-encoding total correlation explanation
.
arXiv:1802.05822
.
Goodfellow
,
I.
,
Pouget-Abadie
,
J.
,
Mirza
,
M.
,
Xu
,
B.
,
Warde-Farley
,
D.
,
Ozair
,
S.
, …
Bengio
,
Y.
(
2014
).
Generative adversarial nets.
In
Z.
Ghahramani
,
M.
Welling
,
C.
Cortes
,
N. D.
Lawrence
, &
K. Q.
Weinberger
(Eds),
Advances in neural information processing systems
,
27
(pp.
2672
2680
).
Red Hook, NY
:
Curran
.
Gretton
,
A.
,
Borgwardt
,
K. M.
,
Rasch
,
M. J.
,
Schölkopf
,
B.
, &
Smola
,
A. J.
(
2012
).
A kernel two-sample test
.
JMLR
,
13
,
723
773
.
Gretton
,
A.
,
Bousquet
,
O.
,
Smola
,
A.
, &
Schölkopf
,
B.
(
2005
).
Measuring statistical dependence with Hilbert-Schmidt norms.
In
Gretton
,
A.
,
Bousquet
,
O.
,
Smola
,
A.
, &
Schölkopf
,
B.
(
2005
).
Measuring statistical dependence with Hilbert-Schmidt norms.
In
Proceedings of the International Conference on Algorithmic Learning Theory
(pp.
63
77
).
Berlin
:
Springer-Verlag
.
Heusel
,
M.
,
Ramsauer
,
H.
,
Unterthiner
,
T.
,
Nessler
,
B.
, &
Hochreiter
,
S.
(
2017
). GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In
I.
Guyon
,
U. V.
Luxburg
,
S.
Bengio
,
H.
Wallach
,
S.
Vishwanathan
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
30
(pp.
6626
6637
).
Red Hook, NY
:
Curran
.
Hoang
,
Q.
,
Nguyen
,
T. D.
,
Le
,
T.
, &
Phung
,
D.
(
2018
).
MGAN: Training generative adversarial nets with multiple generators.
In
Proceedings of the 6th International Conference on Learning Representations
.
OpenReview
.
Hotelling
,
H.
(
1936
).
Relations between two sets of variates.
Biometrika
,
28
(3/4)
,
321
377
.
Huang
,
X.
,
Liu
,
M.
,
Belongie
,
S. J.
, &
Kautz
,
J.
(
2018
).
Multimodal unsupervised image-to-image translation.
In
Proceedings of the European Conference on Computer Vision
(pp.
179
196
).
Berlin
:
Springer
.
Isola
,
P.
,
Zhu
,
J.
,
Zhou
,
T.
, &
Efros
,
A. A.
(
2017
).
Image-to-image translation with conditional adversarial networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp.
1125
1134
).
Piscataway, NJ
:
IEEE
.
Jia
,
J.
, &
Ruan
,
Q.
(
2016
).
Cross-view analysis by multi-feature fusion for person re-identification.
In
Proceedings of the International Conference on Signal Processing
(pp.
107
112
).
Piscataway, NJ
:
IEEE
.
Jia
,
Y.
,
Salzmann
,
M.
, &
Darrell
,
T.
(
2010
). Factorized latent spaces with structured sparsity. In
J. D.
Lafferty
,
C. K. I.
Williams
,
J.
Shawe-Taylor
,
R. S.
Zemel
, &
A.
Culotta
(Eds.),
Advances in neural information processing systems
,
23
(pp.
982
990
).
Red Hook, NY
:
Curran
.
Kazemi
,
V.
, &
Sullivan
,
J.
(
2014
).
One millisecond face alignment with an ensemble of regression trees.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Piscataway, NJ
:
IEEE
.
Kendall
,
A.
, &
Gal
,
Y.
(
2017
).
What uncertainties do we need in Bayesian deep learning for computer vision?
In
U. von
Luxburg
,
S.
Bengio
,
H. M.
Wallach
,
R.
Fergus
,
S. V. N.
Vishwantathan
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
30
(pp.
5574
5584
).
Red Hook, NY
:
Curran
.
Kidron
,
E.
,
Schechner
,
Y.
, &
Elad
,
M.
(
2007
).
Cross-modal localization via sparsity.
IEEE Transactions on Signal Processing
,
55
(4)
,
1390
1404
.
Kingma
,
D.
, &
Welling
,
M.
(
2013
).
Auto-encoding variational Bayes
. arXiv:1312.6114.
Lai
,
P.
, &
Fyfe
,
C.
(
2000
).
Kernel and nonlinear canonical correlation analysis.
International Journal of Neural Systems
,
10
(5)
,
365
377
.
Li
,
Y.
,
Yang
,
M.
, &
Zhang
,
Z.
(
2018
).
A survey of multi-view representation learning.
In
Proceedings of the IEEE Conference on Transactions on Knowledge and Data Engineering.
Washington, DC
:
IEEE Computer Society
.
Liu
,
M.
,
Breuel
,
T.
, &
Kautz
,
J.
(
2017
). Unsupervised image-to-image translation networks. In
U. von
Luxburg
,
S.
Bengio
,
H. M.
Wallach
,
R.
Fergus
,
S. V. N.
Vishwanathan
, &
R.
Garnett
(Eds.),
Advances in neural information processing system
,
30
(pp.
700
708
).
Red Hook, NY
:
Curran
.
Liu
,
Z.
,
Luo
,
P.
, &
Tang
,
X. W. X.
(
2015
).
Deep learning face attributes in the wild.
In
Proceedings of the International Conference on Computer Vision
(pp.
3730
3738
).
Washington, DC
:
IEEE Computer Society
.
Ma
,
Y.
, &
Fu
,
Y.
(
2011
).
Manifold learning theory and applications
.
Boca Raton, FL
:
CRC Press
.
MacQueen
,
J.
(
1967
).
Some methods for classification and analysis of multivariate observations.
In
Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability
(vol.
1
, pp.
281
297
).
Berkeley
:
University of California
.
Makhzani
,
A.
,
Shlensand
,
J.
,
Jaitly
,
N.
,
Goodfellow
,
I.
, &
Frey
,
B.
(
2015
).
Adversarial autoencoders
.
arXiv:1511.05644
.
Mao
,
X.
,
Li
,
Q.
,
Xie
,
H.
,
Lau
,
R. Y.
,
Wang
,
Z.
, &
Smolley
,
S. P.
(2017)
.
Least squares generative adversarial networks.
In
Proceedings of the International Conference on Computer Vision
(pp.
2794
2802
).
Washington, DC
:
IEEE Computer Society
.
Mathieu
,
E.
,
Rainforth
,
T.
,
Siddharth
,
N.
, &
Teh
,
Y. W.
(
2019
).
Disentangling disentanglement in variational autoencoders.
In Proceedings of the International Conference on Machine Learning (pp.
4402
4412
).
Mescheder
,
L.
,
Nowozin
,
S.
, &
Geiger
,
A.
(
2017
).
Adversarial variational Bayes: Unifying variational autoencoders and generative adversarial networks
.
arXiv:1701.04722
.
Ngiam
,
J.
,
Khosla
,
A.
,
Kim
,
M.
,
Nam
,
J.
,
Lee
,
H.
, &
Ng
,
A.
(
2011
).
Multimodal deep learning.
In Proceedings of the International Conference on Machine Learning (pp.
689
696
.
Oh
,
S. J.
,
Murphy
,
K.
,
Pan
,
J.
,
Roth
,
J.
,
Schroff
,
F.
, &
Gallagher
,
A.
(
2018
).
Modeling uncertainty with hedged instance embedding.
arXiv:1810.00319
.
Qi
,
C. R.
,
Su
,
H.
,
Niessner
,
M.
,
A. Dai
,
A.
,
Yan
,
M.
, &
Guibas
,
L. J.
(
2016
).
Volumetric and multi-view CNNs for object classification on 3D data.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp.
5648
5656
).
Piscataway, NJ
:
IEEE
.
Regmi
,
K.
, &
Borji
,
A.
(
2018
).
Cross-view image synthesis using conditional GANs.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp.
3501
3510
).
Piscataway, NJ
:
IEEE
.
Rezende
,
D. J.
, &
Mohamed
,
S.
(
2015
).
Variational inference with normalizing flows.
arXiv:1505.05770
.
Ronneberger
,
O.
,
Fischer
,
P.
, &
Brox
,
T.
(
2015
).
U-Net: Convolutional networks for biomedical image segmentation.
In
Proceedings of the Conference on Medical Image Computing and Computer-Assisted Intervention
(pp.
234
241
).
Berlin
:
Springer
.
Shi
,
Y.
,
Xu
,
D.
,
Pan
,
Y.
,
Tsang
,
I.
, &
Pan
,
S.
(
2019
).
Label embedding with partial heterogeneous contexts.
In
Proceedings of the AAAI Conference on Artificial Intelligence.
Palo Alto, CA
:
AAAI Press
.
Sohn
,
K.
,
Lee
,
H.
, &
Yan
,
X.
(
2015
). Learning structured output representation using deep conditional generative models. In
C.
Cortes
,
N. D.
Lawrence
,
D. D.
Lee
,
M.
Sugiyama
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
28
(pp.
3483
3491
).
Red Hook, NY
:
Curran
.
Suzuki
,
T.
, &
Sugiyama
,
M.
(
2010
).
Sufficient dimension reduction via squared-loss mutual information estimation.
In
Proceedings of the 13th International Conference on Artificial Intelligence and Statistics
(pp.
804
811
).
Tipping
,
M. E.
(
2003
).
Bayesian inference: An introduction to principles and practice in machine learning.
In
O.
Bousquet
,
U. von
Luxburg
, &
G.
Rätsch
, (Eds.),
Lecture Notes in Computer Science: Vol. 3176, Advanced Lectures on Machine Learning
(pp.
41
62
).
Berlin
:
Springer
.
Tipping
,
M.
, &
Bishop
,
C.
(
1999
).
Probabilistic principal component analysis.
Journal of the Royal Statistical Society B (Statistical Methodology)
,
61
(3)
,
611
622
.
Tolstikhin
,
I.
,
Bousquet
,
O.
,
Gelly
,
S.
, &
Schölkopf
,
B.
(
2017
).
Wasserstein autoencoders.
arXiv:1711.01558
.
Virtanen
,
S.
,
Klami
,
A.
, &
Kaski
,
S.
(
2011
).
Bayesian CCA via group sparsity.
In
Proceedings of the International Conference on Machine Learning
(pp.
457
464
).
Wang
,
W.
,
Arora
,
R.
,
Livescu
,
K.
, &
Bilmes
,
J.
(
2015
).
On deep multi-view representation learning.
In
Proceedings of the International Conference on Machine Learning
(pp.
1083
1092
).
Wang
,
W.
,
Yan
,
X.
,
Lee
,
H.
, &
Livescu
,
K.
(
2016
).
Deep variational canonical correlation analysis.
arXiv:1610.03454
.
Wang
,
X.
(
2013
).
Intelligent multi-camera video surveillance: A review.
Pattern Recognition Letters
,
34
(1)
,
3
19
.
William
,
W.
(
2000
).
Nonlinear canonical correlation analysis by neural networks.
Neural Networks
,
13
(10)
,
1095
1105
.
Wu
,
D.
,
Zhao
,
Y.
,
Tsai
,
Y.
,
Yamada
,
M.
, &
Salakhutdinov
,
R.
(
2018
). ”
Dependency bottleneck” in auto-encoding architectures: An empirical study.
arXiv:1802.05408
.
Xu
,
C.
,
Tao
,
D.
, &
Xu
,
C.
(
2013
).
A survey on multi-view learning.
arXiv:1304.5634
.
Zhang
,
X.
,
Zhao
,
J.
,
Hao
,
J.
,
Zhao
,
X.
, &
Chen
,
L.
(
2014
).
Conditional mutual inclusive information enables accurate quantification of associations in gene regulatory networks.
Nucleic Acids Research
,
43
(5)
,
e31
.
Zhang
,
Y.
,
Li
,
K.
,
Li
,
K.
,
Wang
,
L.
,
Zhong
,
B.
, &
Fu
,
Y.
(
2018
).
Image super-resolution using very deep residual channel attention networks.
In
Proceedings of the European Conference on Computer Vision
.
Berlin
:
Springer
.
Zhu
,
J.
,
Park
,
T.
,
Isola
,
P.
, &
Efros
,
A.
(
2017
).
Unpaired image-to-image translation using cycle-consistent adversarial networks.
In
Proceedings of the IEEE International Conference on Computer Vision
(pp.
2223
2232
).
Washington, DC
:
IEEE Computer Society
.