Abstract
Disentanglement is a useful property in representation learning, which increases the interpretability of generative models such as variational autoencoders (VAE), generative adversarial models, and their many variants. Typically in such models, an increase in disentanglement performance is traded off with generation quality. In the context of latent space models, this work presents a representation learning framework that explicitly promotes disentanglement by encouraging orthogonal directions of variations. The proposed objective is the sum of an autoencoder error term along with a principal component analysis reconstruction error in the feature space. This has an interpretation of a restricted kernel machine with the eigenvector matrix valued on the Stiefel manifold. Our analysis shows that such a construction promotes disentanglement by matching the principal directions in the latent space with the directions of orthogonal variation in data space. In an alternating minimization scheme, we use the Cayley ADAM algorithm, a stochastic optimization method on the Stiefel manifold along with the Adam optimizer. Our theoretical discussion and various experiments show that the proposed model is an improvement over many VAE variants in terms of both generation quality and disentangled representation learning.
1 Introduction
Latent space models are popular tools for sampling from high-dimensional distributions. Often, only a small number of latent factors are sufficient to describe data variations. These models exploit the underlying structure of the data and learn explicit representations that are faithful to the data-generating factors. Popular latent space models are variational autoencoders (VAEs; Kingma & Welling, 2014), restricted Boltzmann machines (RBMs; Salakhutdinov & Hinton, 2009), normalizing flows (Rezende & Mohamed, 2015), and their many variants.
Images by the decoder of the latent space traversal: for with and for some . Green and black dashed lines represent the walk along and , respectively. At every step of the walk, the output of the decoder generates the data in the input space. The images were generated by St-RKM with on 3Dshapes dataset. See Figure 5 for traversal along other components.
Images by the decoder of the latent space traversal: for with and for some . Green and black dashed lines represent the walk along and , respectively. At every step of the walk, the output of the decoder generates the data in the input space. The images were generated by St-RKM with on 3Dshapes dataset. See Figure 5 for traversal along other components.
The rest of the article is organized as follows. In section 2 we discuss the closely related work on disentangled representation learning and generation in the context of autoencoders. Further in section 3, we describe the proposed model along with the connection between PCA and disentanglement. In section 3.2, we discuss our contributions. In section 4, we derive the evidence lower bound of the proposed model and show connections with the probabilistic models. In section 5, we describe our experiments and discuss the results.
2 Related Work
Related works can be broadly classified into two categories: Variational autoencoders (VAE) in the context of disentanglement and Restricted Kernel Machines (RKM), a recently proposed modeling framework that integrates kernel methods with deep learning.
2.1 VAE
As discussed in the section 1 (Higgins et al., 2017) suggested that a stronger emphasis on the posterior to match the factorized unit gaussian prior puts further constraints on the implicit capacity of the latent bottleneck. Burgess et al. (2017) further analyzed the effect of the term in depth. Later, Chen, Li, Grosse, and Duvenaud (2018) showed that the KL term includes the mutual information gap, which encourages disentanglement. Recently, several variants of VAEs promoting disentanglement have been proposed by adding extra terms to the ELBO. For instance, FactorVAE (Kim & Mnih, 2018) augments the ELBO by a new term enforcing factorization of the marginal posterior (or aggregate posterior). Rolínek et al. (2019) analyzed the reason for the alignment of the latent space with the coordinate axes, as the design of VAE itself does not suggest any such mechanism. The authors argue that due to the diagonal approximation in the encoder, together with the inherent stochasticity, forces the local orthogonality of the decoder. Locatello et al. (2020) considered adding an extra term that accounts for the knowledge of some partial label information to improve disentanglement. Later, Ghosh, Sajjadi, Vergari, Black, and Schölkopf (2020) studied the deterministic AEs, where another quadratic regularization on the latent vectors was proposed. In contrast to Rolínek et al. (2019), where the implicit orthogonality of VAE was studied, our proposed model has orthogonality by design due to the introduction of the Stiefel manifold.
2.2 RKM
Restricted kernel machines (RKM; Suykens, 2017) provides a representation of kernel methods with visible and hidden variables similar to the energy function of restricted Boltzmann machines (RBM; LeCun, Huang, & Bottou, 2004; Hinton, 2005), thus linking kernel methods with RBMs. Training and prediction schemes are characterized by the stationary points for the unknowns in the objective. The equations in these stationary points lead to solving a linear-system or matrix decomposition for the training. Suykens (2017) shows various RKM formulations for doing classification, regression, kernel PCA, and singular value decomposition. Later the kernel PCA formulation of RKM was extended to a multiview generative model called generative-RKM (Gen-RKM) which uses convolutional neural networks as explicit feature maps (Pandey, Schreurs, & Suykens, 2020, 2021). For the joint feature selection and subspace learning, the proposed training procedure performs eigendecomposition of the kernel/covariance matrix in every minibatch of the optimization scheme. Intuitively, the model could be seen as learning an autoencoder with kernel PCA in the bottleneck part. As a result, the computational complexity scales cubically with the minibatch size and is proportional to the number of minibatches. Moreover, backpropagation through the eigendecomposition could be numerically unstable due to the possibility of small eigenvalues. All such limitations are addressed by our proposed model.
Schematic illustration of St-RKM training problem. The length of the dashed line represents the reconstruction error (see the autoencoder term in equation 3.3) and the length of the vector projecting on hyperplane represents the PCA reconstruction error. After training, the projected points tend to be distributed normally on the hyperplane.
Schematic illustration of St-RKM training problem. The length of the dashed line represents the reconstruction error (see the autoencoder term in equation 3.3) and the length of the vector projecting on hyperplane represents the PCA reconstruction error. After training, the projected points tend to be distributed normally on the hyperplane.
3 Proposed Mechanism
Let be an symmetric matrix. Let be its smallest eigenvalues, possibly including multiplicities, with associated orthonormal eigenvectors . Let be a matrix whose columns are these eigenvectors. Then the optimization problem has a minimizer at and we have with .
A few remarks follow. First, if is a minimizer of the optimization problem in proposition 1 then with orthogonal is also a minimizer, but is not necessarily diagonal. Second, notice that if the eigenvalues of in proposition 1 have a multiplicity larger than 1, there can exist several sets of eigenvectors , associated with the smallest eigenvalues, spanning distinct linear subspaces. Nevertheless, in practice, the eigenvalues of the matrices considered in this article are numerically distinct.
for all the points in latent space for . In equation 3.2, refers to the th component of the image . To sketch this idea, we study the local motions in the latent space.
3.1 Objective Function
the training of a Stiefel-restricted kernel machines, equation 3.3, in view of our discussion in section 2. The basic idea is to design different AE losses with a regularization term that penalizes the feature map in the orthogonal subspace . The choice of the AE losses is motivated by the expression of the regularized AE in equation 1.2 and by the following lemma, which extends the result of Rolínek et al. (2019). Here we adapt it in the context of optimization on the Stiefel manifold (see appendix for the proof).
with where is Euler's gamma function.
In lemma 1, the first term on the right-hand side in equation 3.4 plays the role of the classical AE loss. The second term is proportional to the trace of equation 3.2. This is related to our discussion above where we argue that jointly diagonalizing both and helps to enforce disentanglement. However, determining the behavior of the third term in equation 3.4 is difficult. This is because, for a typical neural network architecture, it is unclear in practice if the function has -Lipschitz continuous Hessian for all . Hence we propose another AE loss (splitted loss) in order to cancel the third term in equation 3.4. Nevertheless, the assumption in lemma 1 is used to provide a meaningful bound on the remainder in equation 3.4. In the light of these remarks, we propose two stochastic AE losses.
3.1.1 AE Losses
Visualizing the matrix, equation 3.6 for St-RKM models after training on three data sets. The first two rows show, equation 3.6, where is the output of algorithm 1. These matrices are effectively close to being diagonal and especially for , as expected. In contrast, the third row shows the same matrix, equation 3.6, with sampled uniformly at random (see Table 6 for the corresponding normalized diagonalization errors).
Visualizing the matrix, equation 3.6 for St-RKM models after training on three data sets. The first two rows show, equation 3.6, where is the output of algorithm 1. These matrices are effectively close to being diagonal and especially for , as expected. In contrast, the third row shows the same matrix, equation 3.6, with sampled uniformly at random (see Table 6 for the corresponding normalized diagonalization errors).
Note that we do not simply propose another encoder-decoder architecture, given by and . Instead, our objective assumes that the neural network defining the encoder provides a better embedding if we impose that it maps training points on a linear subspace of dimension in the -dimensional latent space. In other words, the optimization of the parameters in the last layer of the encoder does not play a redundant role, since the second term in equation 3.3 clearly also depends on . The full training involves an alternating minimization procedure, which is described in algorithm 1.
3.2 Contributions
4 Connections with the Evidence Lower Bound
5 Experiments
In this section, we investigate if St-RKM2 can simultaneously achieve accurate reconstructions on training data, good random generations, and good disentanglement performance. We use the standard data sets: MNIST (LeCun & Cortes, 2010), Fashion-MNIST (fMNIST; Xiao, Rasul, & Vollgraf, 2017), and SVHN (Netzer et al., 2011). To evaluate disentanglement, we use data sets with known ground-truth generating factors such as dSprites (Matthey, Higgins, Hassabis, & Lerchner, 2017), 3DShapes (Burgess & Kim, 2018), and 3D cars (Reed, Zhang, Zhang, & Lee, 2015). Further, all figures and tables report average errors with 1 standard deviation over 10 experiments.
5.1 Algorithm
We use an alternating-minimization scheme as shown in algorithm 1. First, the Adam optimizer with a learning rate is used to update the encoder-decoder parameters; then, the Cayley Adam optimizer (Li et al., 2020) with a learning rate is used to update . Finally, at the end of the training, we recompute from the singular value decomposition (SVD) of the covariance matrix as a final correction-step of the kernel PCA term in our objective (step 10 of algorithm 1). Since the covariance matrix is typically small, this decomposition is fast (see Table 3). In practice, our training procedure only marginally increases the computation cost, which can be seen from training times in Table 1.
Training Time in Minutes (for 1000 Epochs, Mean with 1 Standard Deviation over 10 Runs) and the Number of Parameters (Nb) of the Generative Models on the MNIST Data Set.
Model . | St-RKM . | ()-VAE . | FactorVAE . | Info-GAN . |
---|---|---|---|---|
Nb parameters | 4165589 | 8182591 | 4713478 | |
Training time | 21.93 (1.3) | (0.8) | 33.31 (2.7) | 45.96 (1.6) |
Model . | St-RKM . | ()-VAE . | FactorVAE . | Info-GAN . |
---|---|---|---|---|
Nb parameters | 4165589 | 8182591 | 4713478 | |
Training time | 21.93 (1.3) | (0.8) | 33.31 (2.7) | 45.96 (1.6) |
5.2 Experimental Setup
We consider four baselines for comparison: VAE, -VAE, FactorVAE, and Info-GAN. An ablation study with the Gen-RKM is shown in section A.4 in the appendix. Extensive experimentation was not computationally feasible since the evaluation and decomposition of kernel matrices scales and with the data set size (see the discussion in section 2).
5.3 Inductive Biases
To be consistent in evaluation, we keep the same encoder (discriminator) and decoder (generator) architecture and the same latent dimension across the models. We use convolutional neural networks due to the choice of image data sets for evaluating generation and disentanglement. In the case of Info-GAN, batch normalization is added for training stability (see section A.3 in the appendix for details). For the determination of the hyperparameters of other models, we start from values in the range of the parameters suggested in the authors' reference implementation. After trying various values, we noticed that and seem to work well across the data sets that we considered for -VAE and FactorVAE, respectively. Furthermore, in all the experiments on St-RKM, we keep the reconstruction weight . All models are trained on the entire data set. Note that for the same encoder-decoder network, the St-RKM model has the least number of parameters compared to any VAE variants and Info-GAN (see Table 1).
Fréchet inception distance (FID; Heusel, Ramsauer, Unterthiner, Nessler, & Hochreiter, 2017) and sliced Wasserstein distance (SWD) scores (mean and 1 standard deviation) for 8000 randomly generated samples (smaller is better).
Fréchet inception distance (FID; Heusel, Ramsauer, Unterthiner, Nessler, & Hochreiter, 2017) and sliced Wasserstein distance (SWD) scores (mean and 1 standard deviation) for 8000 randomly generated samples (smaller is better).
Traversals along the principal components. The first two rows show the ground-truth and reconstructed images. Each subsequent row shows the generated images by traversing along a principal component in the latent space. The last column in each subimage indicates the dominant factor of variation.
Traversals along the principal components. The first two rows show the ground-truth and reconstructed images. Each subsequent row shows the generated images by traversing along a principal component in the latent space. The last column in each subimage indicates the dominant factor of variation.
To evaluate the disentanglement performance, various metrics have been proposed. A comprehensive review by Locatello et al. (2019) shows that the various disentanglement metrics are correlated, albeit with a different degree of correlation across data sets. In this article, we use three metrics to evaluate disentanglement: Eastwood's framework (Eastwood & Williams, 2018), mutual information gap (MIG; Chen et al., 2018), and separated attribute predictability (SAP; Kumar et al., 2018) scores. Eastwood's framework (Eastwood & Williams, 2018) further proposes three metrics: disentanglement: the degree to which a representation factorizes the underlying factors of variation, with each variable capturing at most one generative factor; completeness: the degree to which each underlying factor is captured by a single code variable; and informativeness: the amount of information that a representation captures about the underlying factors of variation. Furthermore, we use a slightly modified version of MIG score as proposed by Locatello et al. (2019). Figure 6 shows that St-RKM variants have better disentanglement and completeness scores (higher mean scores). However, the informativeness scores are higher for St-RKM when using a lasso-regressor in contrast to mixed scores with a random forest regressor. Figure 7 further complements these observations by showing MIG and SAP scores. Here, the St-RKM-sl model has the highest mean scores for every data set. Qualitative assessment can be done from Figure 5, which shows the generated images by traversing along the principal components in the latent space. In the 3DShapes data set, the St-RKM model captures floor hue, wall hue, and orientation perfectly but has a slight entanglement in capturing other factors. This is worse in -VAE, which has entanglement in all dimensions except the floor hue, along with noise in some generated images. Similar trends can be observed in the dSprites and 3D cars data sets.
Eastwood framework's (Eastwood & Williams, 2018) disentanglement metric with Lasso and random forest (RF) regressor. The plot shows mean and 1 standard deviation of scores over 10 iterations. For disentanglement and completeness, a higher score is better; for informativeness, lower is better. “Info.” indicates (average) root-mean-square error in predicting .
Eastwood framework's (Eastwood & Williams, 2018) disentanglement metric with Lasso and random forest (RF) regressor. The plot shows mean and 1 standard deviation of scores over 10 iterations. For disentanglement and completeness, a higher score is better; for informativeness, lower is better. “Info.” indicates (average) root-mean-square error in predicting .
MIG (Chen et al., 2018; Locatello et al., 2019) and SAP (Kumar, Sattigeri, & Balakrishnan, 2018) scores to evaluate disentanglement performance showing the mean (standard deviation) over 10 random seeds.
6 Conclusion
This article proposes he St-RKM model for disentangled representation learning and generation based on manifold optimization. For the training, we use the Cayley Adam algorithm of Li et al. (2020) for stochastic optimization on the Stiefel manifold. Computationally, St-RKM increases the training time by only a reasonably small amount compared to -VAE, for instance. Furthermore, we propose several autoencoder objectives and discuss that the combination of a stochastic AE loss with an explicit optimization on the Stiefel manifold promotes disentanglement. In addition, we establish connections with probabilistic models, formulate an evidence lower bound, and discuss the independence of latent factors. Where the considered baselines have a trade-off between generation quality and disentanglement, we improve on both of these aspects as illustrated through various experiments. The proposed model has some limitations. A first limitation is hyperparameter selection: the number of components in the KPCA, neural network architecture, and the final size of the feature map. When additional knowledge on the data is available, we suggest that the user selects the number of components close to the number of underlying generating factors. The final size of the feature map should be large enough so that KPCA extracts meaningful components. Second, we interpret the disentanglement as the two orthogonal changes in the latent space corresponding to two orthogonal changes in input space. Although not perfect, we believe it is a reasonable mathematical approximation of the loosely defined notion of disentanglement. Moreover, experimental results confirm this assumption. Among the possible regularizers on the hidden features, the model associated with the squared Euclidean norm was analyzed in detail, while a deeper study of other regularizers is a prospect for further research, in particular for the case of spherical units.
Appendix
A.1 Proof of Lemma 1
A.2 Details on Evidence Lower Bound for St-RKM model
Data Sets and Hyperparameters Used for the Experiments.
Data Set . | . | . | . | . |
---|---|---|---|---|
MNIST | 60,000 | 10 | 256 | |
fMNIST | 60,000 | 10 | 256 | |
SVHN | 73,257 | 10 | 256 | |
dSprites | 737,280 | 5 | 256 | |
3DShapes | 480,000 | 6 | 256 | |
3D cars | 17,664 | 3 | 256 |
Data Set . | . | . | . | . |
---|---|---|---|---|
MNIST | 60,000 | 10 | 256 | |
fMNIST | 60,000 | 10 | 256 | |
SVHN | 73,257 | 10 | 256 | |
dSprites | 737,280 | 5 | 256 | |
3DShapes | 480,000 | 6 | 256 | |
3D cars | 17,664 | 3 | 256 |
Note: is the number of training samples, the input dimension (resized images), the subspace dimension, and the minibatch size.
Model Architectures.
Data Set . | Architecture . | |
---|---|---|
MNIST/fMNIST//SVHN/3DShapes/sDprites/3Dcars |
Data Set . | Architecture . | |
---|---|---|
MNIST/fMNIST//SVHN/3DShapes/sDprites/3Dcars |
Notes: All convolutions and transposed convolutions are with stride 2 and padding 1. Unless stated otherwise, layers have parametric-RELU () activation functions, except output layers of the preimage maps, which have sigmoid activation functions (since input data are normalized [0, 1]). Adam and Cayley ADAM optimizers have learning rates and , respectively. The preimage map/decoder network is always taken as transposed of the feature map/encoder network. for 3D cars; and for all others. Further, and stride 1 for MNIST, fMNIST, SVHN and 3DShapes; and for others. SVHN and 3DShapes are resized to input dimensions.
A.3 Data Sets and Hyperparameters
We refer to Tables 2 and 3 for specific details on the model architectures, data sets, and hyperparameters used in this article. All models were trained on full data sets and for a maximum of 1000 epochs. Furthermore, all data sets are scaled between [0-1] and are resized to dimensions except dSprites and 3D cars. The PyTorch library (single precision) in Python was used as the programming language on 8 GB NVIDIA QUADRO P4000 GPU. See algorithm 1 for training the St-RKM model. In the case of FactorVAE, the discriminator architecture is same as proposed in the original paper (Kim & Mnih, 2018).
A.3.1 Disentanglement Metrics
MIG was originally proposed by Chen et al. (2018); however, we use the modified metric as proposed in Locatello et al. (2019). We evaluate this score on 5000 test points across all the considered data sets. SAP and Eastwood's metrics use different classifiers to compute the importance of each dimension of the learned representation for predicting a ground-truth factor. For these metrics, we randomly sample 5000 and 3000 training and testing points, respectively. To compute these metrics, we use the open source library available at github.com/google-research/disentanglement_lib.
A.4 Ablation Studies
A.4.1 Significance of the KPCA Loss
In this section, we show an ablation study on the KPCA loss and evaluate its effect on disentanglement. We repeat the experiments of section 5 on the mini-3DShapes data set (floor hue, wall hue, object hue, and scale: 8000 samples), where we consider three different variants of the proposed model:
St-RKM (): The KPCA loss is optimized in a stochastic manner using the Cayley ADAM optimizer, as proposed in this article.
Gen-RKM: The KPCA loss is optimized exactly at each step by performing an eigendecomposition in each minibatch (this corresponds to the algorithm in Pandey et al., 2021).
AE-PCA: A standard AE is used, and a reconstruction loss is minimized for the training. As a postprocessing step, a PCA is performed on the latent embedding of the training data.
The encoder/decoder maps are the same across all the models, and for the AE-PCA model, additional linear layers are used to map the latent space to the subspace. From Table 4, we conclude that optimizing the KPCA loss during training improves disentanglement. Moreover, using a stochastic algorithm improves computation time and scalability with only a slight decrease in disentanglement score. Note that calculating the exact eigendecomposition at each step (Gen-RKM) comes with numerical difficulties. In particular, double floating-point precision has to be used together with a careful selection of the number of principal components to avoid ill-conditioned kernel matrices. This problem is not encountered when using the St-RKM training algorithm.
Training Timings per Epoch (in minutes) and Disentanglement Scores (Heusel et al., 2017) for Different Variants of RKM When Trained on the mini-3Dshapes Data Set.
. | . | St-RKM () . | Gen-RKM . | AE-PCA . |
---|---|---|---|---|
Training time | 3.01 (0.71) | 9.21 (0.54) | 2.87 (0.33) | |
Disentanglement score | Lasso | 0.40 (0.02) | 0.44 (0.01) | 0.35 (0.01) |
RF | 0.27 (0.01) | 0.31 (0.02) | 0.22 (0.02) | |
Compliance score | Lasso | 0.64 (0.01) | 0.51 (0.01) | 0.42 (0.01) |
RF | 0.67 (0.02) | 0.58 (0.01) | 0.45 (0.02) | |
Information score | Lasso | 1.01 (0.02) | 1.11 (0.02) | 1.20 (0.01) |
RF | 0.98 (0.01) | 1.09 (0.01) | 1.17 (0.02) |
. | . | St-RKM () . | Gen-RKM . | AE-PCA . |
---|---|---|---|---|
Training time | 3.01 (0.71) | 9.21 (0.54) | 2.87 (0.33) | |
Disentanglement score | Lasso | 0.40 (0.02) | 0.44 (0.01) | 0.35 (0.01) |
RF | 0.27 (0.01) | 0.31 (0.02) | 0.22 (0.02) | |
Compliance score | Lasso | 0.64 (0.01) | 0.51 (0.01) | 0.42 (0.01) |
RF | 0.67 (0.02) | 0.58 (0.01) | 0.45 (0.02) | |
Information score | Lasso | 1.01 (0.02) | 1.11 (0.02) | 1.20 (0.01) |
RF | 0.98 (0.01) | 1.09 (0.01) | 1.17 (0.02) |
Notes: Gen-RKM has the worst training time but gets the highest disentanglement scores. This is due to the exact eigendecomposition of the kernel matrix at every iteration. This computationally expensive step is approximated by the St-RKM model, which achieves significant speed-up and scalability to large data sets. Finally, the AE-PCA model has the fastest training time due to the absence of eigendecompositions in the training loop. However, using PCA in the postprocessing step alters the basis of the latent space. This basis is unknown to the decoder network, resulting in degraded disentanglement performance.
A.4.2 Smaller Encoder/Decoder Architecture
FID Scores Computed on Randomly Generated 8000 Images When Trained with Architecture and Hyperparameters.
. | St-RKM . | VAE . | -VAE . | FactorVAE . | InfoGAN . |
---|---|---|---|---|---|
MNIST | 24.63 (0.22) | 36.11 (1.01) | 42.81 (2.01) | 35.48 (0.07) | 45.74 (2.93) |
fMNIST | 61.44 (1.02) | 73.47 (0.73) | 75.21 (1.11) | 69.73 (1.54) | 84.11 (2.58) |
. | St-RKM . | VAE . | -VAE . | FactorVAE . | InfoGAN . |
---|---|---|---|---|---|
MNIST | 24.63 (0.22) | 36.11 (1.01) | 42.81 (2.01) | 35.48 (0.07) | 45.74 (2.93) |
fMNIST | 61.44 (1.02) | 73.47 (0.73) | 75.21 (1.11) | 69.73 (1.54) | 84.11 (2.58) |
Notes: Lower is better with standard deviations. Adapted from Dupont (2018).
Computing the Diagonalization Scores (see Figure 3).
Models . | dSprites . | 3DShapes . | 3D cars . |
---|---|---|---|
St-RKM-sl (, ) | 0.17 (0.05) | 0.23 (0.03) | 0.21 (0.04) |
St-RKM (, ) | 0.26 (0.05) | 0.30 (0.10) | 0.31 (0.09) |
St-RKM (, random ) | 0.61 (0.02) | 0.72 (0.01) | 0.69 (0.03) |
Models . | dSprites . | 3DShapes . | 3D cars . |
---|---|---|---|
St-RKM-sl (, ) | 0.17 (0.05) | 0.23 (0.03) | 0.21 (0.04) |
St-RKM (, ) | 0.26 (0.05) | 0.30 (0.10) | 0.31 (0.09) |
St-RKM (, random ) | 0.61 (0.02) | 0.72 (0.01) | 0.69 (0.03) |
Notes: Denote (cf. equation 3.6). Then we compute the score as , where sets the off-diagonal elements of matrix to zero. The scores are computed for each model over 10 random seeds and show the mean (standard deviation). Lower scores indicate better diagonalization.
Samples of randomly generated batch of images used to compute FID scores and SWD scores (see Figure 4).
Samples of randomly generated batch of images used to compute FID scores and SWD scores (see Figure 4).
Samples of randomly generated images used to compute the FID scores. See Table 5.
Samples of randomly generated images used to compute the FID scores. See Table 5.
(a) Loss evolution ( plot) during the training of equation A.2 over 1000 epochs with once with Cayley ADAM optimizer (green curve) and then without (blue curve). (b) Traversals along the principal components when the model was trained with a fixed , that is, with the objective given by equation A.2 and . There is no clear isolation of a feature along any of the principal components, indicating further that optimizing over is key to better disentanglement.
(a) Loss evolution ( plot) during the training of equation A.2 over 1000 epochs with once with Cayley ADAM optimizer (green curve) and then without (blue curve). (b) Traversals along the principal components when the model was trained with a fixed , that is, with the objective given by equation A.2 and . There is no clear isolation of a feature along any of the principal components, indicating further that optimizing over is key to better disentanglement.
A.4.3 Analysis of St-RKM with a Fixed
for all ,
for all .
Thanks to the push-through identity, we have the alternative expression Therefore, it holds , as it should. In our experiments, we set . If , the regularized PCA objective in equation A.2 takes negative values after a few epochs due to the numerical instability as mentioned above.
In Figure 10a, the evolution of the training objective A.2 is displayed. It can be seen that the final objective has a lower value [] when is optimized compared to its fixed counterpart [], showing the merit of optimizing over Stiefel manifold for the same parameter . Hence, the subspace determined by has to be adapted to the encoder and decoder networks. In other words, the training over is not sufficient to minimize the objective with Adam. Figure 10b further explores the latent traversals in the context of this ablation study. In the top row of Figure 10b (latent traversal in the direction of ), both the shape of the object and the wall hue are changing. A coupling between wall hue and shape is also visible in the bottom row of this figure.
Notes
A typical implementation of VAE includes another neural network (after the primary network) for parametrizing the covariance matrix. To simplify this introductory discussion, this matrix is here chosen as a constant diagonal .
The source code is available at http://bit.ly/StRKM_code.
Using a random can be interpreted as sketching the encoder map in the spirit of randomized orthogonal systems (ROS) sketches (see Yang, Pilanci, & Wainwright, 2017).
Acknowledgments
Most of this work was done when M.F. was at KU Leuven.
EU: The research leading to these results received funding from the European Research Council under the European Union's Horizon 2020 research and innovation program/ERC Advanced Grant E-DUALITY (787960). This article reflects only the authors' views, and the EU is not liable for any use that may be made of the contained information.
Research Council KUL: Optimization frameworks for deep kernel machines C14/18/068.
Flemish government: (a) FWO: projects: GOA4917N (Deep Restricted Kernel Machines: Methods and Foundations), PhD/postdoc grant. (b) This research received funding from the Flemish government (AI Research Program). We are affiliated with Leuven.AI-KU Leuven institute for AI, B-3000, Leuven, Belgium.
Ford KU Leuven Research Alliance Project: KUL0076 (stability analysis and performance improvement of deep reinforcement learning algorithms).
Vlaams Supercomputer Centrum: The computational resources and services used in this work were provided by the VSC (Flemish Supercomputer Center), funded by the Research Foundation–Flanders (FWO) and the Flemish government department EWI.