## Abstract

Disentanglement is a useful property in representation learning, which increases the interpretability of generative models such as variational autoencoders (VAE), generative adversarial models, and their many variants. Typically in such models, an increase in disentanglement performance is traded off with generation quality. In the context of latent space models, this work presents a representation learning framework that explicitly promotes disentanglement by encouraging orthogonal directions of variations. The proposed objective is the sum of an autoencoder error term along with a principal component analysis reconstruction error in the feature space. This has an interpretation of a restricted kernel machine with the eigenvector matrix valued on the Stiefel manifold. Our analysis shows that such a construction promotes disentanglement by matching the principal directions in the latent space with the directions of orthogonal variation in data space. In an alternating minimization scheme, we use the Cayley ADAM algorithm, a stochastic optimization method on the Stiefel manifold along with the Adam optimizer. Our theoretical discussion and various experiments show that the proposed model is an improvement over many VAE variants in terms of both generation quality and disentangled representation learning.

## 1 Introduction

Latent space models are popular tools for sampling from high-dimensional distributions. Often, only a small number of latent factors are sufficient to describe data variations. These models exploit the underlying structure of the data and learn explicit representations that are faithful to the data-generating factors. Popular latent space models are variational autoencoders (VAEs; Kingma & Welling, 2014), restricted Boltzmann machines (RBMs; Salakhutdinov & Hinton, 2009), normalizing flows (Rezende & Mohamed, 2015), and their many variants.

^{1}is given by the neural network $\varphi \theta $ parametrized by $\theta $. A random decoder $p(x|z)=N(x|\psi \xi (z),\sigma 02I)$ is associated with the decoder neural network $\psi \xi $, parameterized by $\xi $, which maps latent codes to the data points. A VAE is trained by maximizing the lower bound to the idealized log-likelihood as:

The rest of the article is organized as follows. In section 2 we discuss the closely related work on disentangled representation learning and generation in the context of autoencoders. Further in section 3, we describe the proposed model along with the connection between PCA and disentanglement. In section 3.2, we discuss our contributions. In section 4, we derive the evidence lower bound of the proposed model and show connections with the probabilistic models. In section 5, we describe our experiments and discuss the results.

## 2 Related Work

Related works can be broadly classified into two categories: Variational autoencoders (VAE) in the context of disentanglement and Restricted Kernel Machines (RKM), a recently proposed modeling framework that integrates kernel methods with deep learning.

### 2.1 VAE

As discussed in the section 1 (Higgins et al., 2017) suggested that a stronger emphasis on the posterior to match the factorized unit gaussian prior puts further constraints on the implicit capacity of the latent bottleneck. Burgess et al. (2017) further analyzed the effect of the $\beta $ term in depth. Later, Chen, Li, Grosse, and Duvenaud (2018) showed that the KL term includes the mutual information gap, which encourages disentanglement. Recently, several variants of VAEs promoting disentanglement have been proposed by adding extra terms to the ELBO. For instance, FactorVAE (Kim & Mnih, 2018) augments the ELBO by a new term enforcing factorization of the marginal posterior (or aggregate posterior). Rolínek et al. (2019) analyzed the reason for the alignment of the latent space with the coordinate axes, as the design of VAE itself does not suggest any such mechanism. The authors argue that due to the diagonal approximation in the encoder, together with the inherent stochasticity, forces the local orthogonality of the decoder. Locatello et al. (2020) considered adding an extra term that accounts for the knowledge of some partial label information to improve disentanglement. Later, Ghosh, Sajjadi, Vergari, Black, and Schölkopf (2020) studied the deterministic AEs, where another quadratic regularization on the latent vectors was proposed. In contrast to Rolínek et al. (2019), where the implicit orthogonality of VAE was studied, our proposed model has orthogonality by design due to the introduction of the Stiefel manifold.

### 2.2 RKM

Restricted kernel machines (RKM; Suykens, 2017) provides a representation of kernel methods with visible and hidden variables similar to the energy function of restricted Boltzmann machines (RBM; LeCun, Huang, & Bottou, 2004; Hinton, 2005), thus linking kernel methods with RBMs. Training and prediction schemes are characterized by the stationary points for the unknowns in the objective. The equations in these stationary points lead to solving a linear-system or matrix decomposition for the training. Suykens (2017) shows various RKM formulations for doing classification, regression, kernel PCA, and singular value decomposition. Later the kernel PCA formulation of RKM was extended to a multiview generative model called generative-RKM (Gen-RKM) which uses convolutional neural networks as explicit feature maps (Pandey, Schreurs, & Suykens, 2020, 2021). For the joint feature selection and subspace learning, the proposed training procedure performs eigendecomposition of the kernel/covariance matrix in every minibatch of the optimization scheme. Intuitively, the model could be seen as learning an autoencoder with kernel PCA in the bottleneck part. As a result, the computational complexity scales cubically with the minibatch size and is proportional to the number of minibatches. Moreover, backpropagation through the eigendecomposition could be numerically unstable due to the possibility of small eigenvalues. All such limitations are addressed by our proposed model.

## 3 Proposed Mechanism

*Orthogonal latent directions.*Naturally, given an $m\xd7m$ orthogonal matrix $O$ and a matrix $U\u2208St(\u2113,m)$, we have

Let $M$ be an $\u2113\xd7\u2113$ symmetric matrix. Let $\nu 1,\u2026,\nu m$ be its $m$ smallest eigenvalues, possibly including multiplicities, with associated orthonormal eigenvectors $v1,\u2026,vm$. Let $V$ be a matrix whose columns are these eigenvectors. Then the optimization problem $minU\u2208St(\u2113,m)Tr(U\u22a4MU)$ has a minimizer at $U\u2605=V$ and we have $U\u2605\u22a4MU\u2605=diag(\nu ),$ with $\nu =(\nu 1,\u2026,\nu m)\u22a4$.

A few remarks follow. First, if $U\u2605$ is a minimizer of the optimization problem in proposition 1 then $U'\u2605=U\u2605O$ with $O$ orthogonal is also a minimizer, but $U'\u2605\u22a4MU'\u2605$ is not necessarily diagonal. Second, notice that if the eigenvalues of $M$ in proposition 1 have a multiplicity larger than 1, there can exist several sets of eigenvectors $v1,\u2026,vm$, associated with the $m$ smallest eigenvalues, spanning distinct linear subspaces. Nevertheless, in practice, the eigenvalues of the matrices considered in this article are numerically distinct.

*Orthogonal directions of variation in input space.*We want the lines defined by the orthonormal vectors ${u\u2605,1,\u2026,u\u2605,m}$ to provide directions associated with different generative factors of our model. In other words, we conjecture that a possible formalization of disentanglement is that the principal directions in latent space match orthogonal directions of variation in the data space (see Figure 2). That is, we would like that

for all the points in latent space $yi=PU\varphi \theta (xi)$ for $i=1,\u2026,n$. In equation 3.2, $\psi a(y)$ refers to the $a$th component of the image $\psi (y)\u2208Rd$. To sketch this idea, we study the local motions in the latent space.

### 3.1 Objective Function

the training of a Stiefel-restricted kernel machines, equation 3.3, in view of our discussion in section 2. The basic idea is to design different AE losses with a regularization term that penalizes the feature map in the orthogonal subspace $U\u22a5$. The choice of the AE losses is motivated by the expression of the regularized AE in equation 1.2 and by the following lemma, which extends the result of Rolínek et al. (2019). Here we adapt it in the context of optimization on the Stiefel manifold (see appendix for the proof).

with $|Ra(\sigma )|\u226416\sigma 3La2(m+1)\Gamma ((m+1)/2)\Gamma (m/2)$ where $\Gamma $ is Euler's gamma function.

In lemma 1, the first term on the right-hand side in equation 3.4 plays the role of the classical AE loss. The second term is proportional to the trace of equation 3.2. This is related to our discussion above where we argue that jointly diagonalizing both $U\u22a4\u2207\psi (y)\u2207\psi (y)\u22a4U$ and $U\u22a4C\theta U$ helps to enforce disentanglement. However, determining the behavior of the third term in equation 3.4 is difficult. This is because, for a typical neural network architecture, it is unclear in practice if the function $[x-\psi (\xb7)]a2$ has $La$-Lipschitz continuous Hessian for all $a\u2208[d]$. Hence we propose another AE loss (splitted loss) in order to cancel the third term in equation 3.4. Nevertheless, the assumption in lemma 1 is used to provide a meaningful bound on the remainder in equation 3.4. In the light of these remarks, we propose two stochastic AE losses.

#### 3.1.1 AE Losses

Note that we do not simply propose another encoder-decoder architecture, given by $U\u22a4\varphi \theta (\xb7)$ and $\psi \xi (U\xb7)$. Instead, our objective assumes that the neural network defining the encoder provides a better embedding if we impose that it maps training points on a linear subspace of dimension $m<\u2113$ in the $\u2113$-dimensional latent space. In other words, the optimization of the parameters in the last layer of the encoder does not play a redundant role, since the second term in equation 3.3 clearly also depends on $PU\u22a5\varphi \theta (\xb7)$. The full training involves an alternating minimization procedure, which is described in algorithm 1.

### 3.2 Contributions

## 4 Connections with the Evidence Lower Bound

## 5 Experiments

In this section, we investigate if St-RKM^{2} can simultaneously achieve accurate reconstructions on training data, good random generations, and good disentanglement performance. We use the standard data sets: MNIST (LeCun & Cortes, 2010), Fashion-MNIST (fMNIST; Xiao, Rasul, & Vollgraf, 2017), and SVHN (Netzer et al., 2011). To evaluate disentanglement, we use data sets with known ground-truth generating factors such as dSprites (Matthey, Higgins, Hassabis, & Lerchner, 2017), 3DShapes (Burgess & Kim, 2018), and 3D cars (Reed, Zhang, Zhang, & Lee, 2015). Further, all figures and tables report average errors with 1 standard deviation over 10 experiments.

### 5.1 Algorithm

We use an alternating-minimization scheme as shown in algorithm 1. First, the Adam optimizer with a learning rate $2\xd710-4$ is used to update the encoder-decoder parameters; then, the Cayley Adam optimizer (Li et al., 2020) with a learning rate $10-4$ is used to update $U$. Finally, at the end of the training, we recompute $U$ from the singular value decomposition (SVD) of the covariance matrix as a final correction-step of the kernel PCA term in our objective (step 10 of algorithm 1). Since the $\u2113\xd7\u2113$ covariance matrix is typically small, this decomposition is fast (see Table 3). In practice, our training procedure only marginally increases the computation cost, which can be seen from training times in Table 1.

Model . | St-RKM . | ($\beta $)-VAE . | FactorVAE . | Info-GAN . |
---|---|---|---|---|

Nb parameters | $4164519$ | 4165589 | 8182591 | 4713478 |

Training time | 21.93 (1.3) | $19.83$ (0.8) | 33.31 (2.7) | 45.96 (1.6) |

Model . | St-RKM . | ($\beta $)-VAE . | FactorVAE . | Info-GAN . |
---|---|---|---|---|

Nb parameters | $4164519$ | 4165589 | 8182591 | 4713478 |

Training time | 21.93 (1.3) | $19.83$ (0.8) | 33.31 (2.7) | 45.96 (1.6) |

### 5.2 Experimental Setup

We consider four baselines for comparison: VAE, $\beta $-VAE, FactorVAE, and Info-GAN. An ablation study with the Gen-RKM is shown in section A.4 in the appendix. Extensive experimentation was not computationally feasible since the evaluation and decomposition of kernel matrices scales $O(n2)$ and $O(n3)$ with the data set size (see the discussion in section 2).

### 5.3 Inductive Biases

To be consistent in evaluation, we keep the same encoder (discriminator) and decoder (generator) architecture and the same latent dimension across the models. We use convolutional neural networks due to the choice of image data sets for evaluating generation and disentanglement. In the case of Info-GAN, batch normalization is added for training stability (see section A.3 in the appendix for details). For the determination of the hyperparameters of other models, we start from values in the range of the parameters suggested in the authors' reference implementation. After trying various values, we noticed that $\beta =3$ and $\gamma =12$ seem to work well across the data sets that we considered for $\beta $-VAE and FactorVAE, respectively. Furthermore, in all the experiments on St-RKM, we keep the reconstruction weight $\lambda =1$. All models are trained on the entire data set. Note that for the same encoder-decoder network, the St-RKM model has the least number of parameters compared to any VAE variants and Info-GAN (see Table 1).

To evaluate the disentanglement performance, various metrics have been proposed. A comprehensive review by Locatello et al. (2019) shows that the various disentanglement metrics are correlated, albeit with a different degree of correlation across data sets. In this article, we use three metrics to evaluate disentanglement: Eastwood's framework (Eastwood & Williams, 2018), mutual information gap (MIG; Chen et al., 2018), and separated attribute predictability (SAP; Kumar et al., 2018) scores. Eastwood's framework (Eastwood & Williams, 2018) further proposes three metrics: *disentanglement*: the degree to which a representation factorizes the underlying factors of variation, with each variable capturing at most one generative factor; *completeness*: the degree to which each underlying factor is captured by a single code variable; and *informativeness*: the amount of information that a representation captures about the underlying factors of variation. Furthermore, we use a slightly modified version of MIG score as proposed by Locatello et al. (2019). Figure 6 shows that St-RKM variants have better disentanglement and completeness scores (higher mean scores). However, the informativeness scores are higher for St-RKM when using a lasso-regressor in contrast to mixed scores with a random forest regressor. Figure 7 further complements these observations by showing MIG and SAP scores. Here, the St-RKM-sl model has the highest mean scores for every data set. Qualitative assessment can be done from Figure 5, which shows the generated images by traversing along the principal components in the latent space. In the 3DShapes data set, the St-RKM model captures floor hue, wall hue, and orientation perfectly but has a slight entanglement in capturing other factors. This is worse in $\beta $-VAE, which has entanglement in all dimensions except the floor hue, along with noise in some generated images. Similar trends can be observed in the dSprites and 3D cars data sets.

## 6 Conclusion

This article proposes he St-RKM model for disentangled representation learning and generation based on manifold optimization. For the training, we use the Cayley Adam algorithm of Li et al. (2020) for stochastic optimization on the Stiefel manifold. Computationally, St-RKM increases the training time by only a reasonably small amount compared to $\beta $-VAE, for instance. Furthermore, we propose several autoencoder objectives and discuss that the combination of a stochastic AE loss with an explicit optimization on the Stiefel manifold promotes disentanglement. In addition, we establish connections with probabilistic models, formulate an evidence lower bound, and discuss the independence of latent factors. Where the considered baselines have a trade-off between generation quality and disentanglement, we improve on both of these aspects as illustrated through various experiments. The proposed model has some limitations. A first limitation is hyperparameter selection: the number of components in the KPCA, neural network architecture, and the final size of the feature map. When additional knowledge on the data is available, we suggest that the user selects the number of components close to the number of underlying generating factors. The final size of the feature map should be large enough so that KPCA extracts meaningful components. Second, we interpret the disentanglement as the two orthogonal changes in the latent space corresponding to two orthogonal changes in input space. Although not perfect, we believe it is a reasonable mathematical approximation of the loosely defined notion of disentanglement. Moreover, experimental results confirm this assumption. Among the possible regularizers on the hidden features, the model associated with the squared Euclidean norm was analyzed in detail, while a deeper study of other regularizers is a prospect for further research, in particular for the case of spherical units.

## Appendix

### A.1 Proof of Lemma 1

### A.2 Details on Evidence Lower Bound for St-RKM model

Data Set . | $N$ . | $d$ . | $m$ . | $M$ . |
---|---|---|---|---|

MNIST | 60,000 | $28\xd728$ | 10 | 256 |

fMNIST | 60,000 | $28\xd728$ | 10 | 256 |

SVHN | 73,257 | $32\xd732\xd73$ | 10 | 256 |

dSprites | 737,280 | $64\xd764$ | 5 | 256 |

3DShapes | 480,000 | $64\xd764\xd73$ | 6 | 256 |

3D cars | 17,664 | $64\xd764\xd73$ | 3 | 256 |

Data Set . | $N$ . | $d$ . | $m$ . | $M$ . |
---|---|---|---|---|

MNIST | 60,000 | $28\xd728$ | 10 | 256 |

fMNIST | 60,000 | $28\xd728$ | 10 | 256 |

SVHN | 73,257 | $32\xd732\xd73$ | 10 | 256 |

dSprites | 737,280 | $64\xd764$ | 5 | 256 |

3DShapes | 480,000 | $64\xd764\xd73$ | 6 | 256 |

3D cars | 17,664 | $64\xd764\xd73$ | 3 | 256 |

Note: $N$ is the number of training samples, $d$ the input dimension (resized images), $m$ the subspace dimension, and $M$ the minibatch size.

Data Set . | Architecture . | |
---|---|---|

MNIST/fMNIST//SVHN/3DShapes/sDprites/3Dcars | $\varphi \theta (\xb7)=Conv[c]\xd74\xd74;Conv[c\xd72]\xd74\xd74;Conv[c\xd74]\xd7k^\xd7k^;FC256;FC50(Linear)$ | $\psi \zeta (\xb7)=FC256;FC[c\xd74]\xd7k^\xd7k^;Conv[c\xd72]\xd74\xd74;Conv[c]\xd74\xd74;Conv[c](Sigmoid)$ |

Data Set . | Architecture . | |
---|---|---|

MNIST/fMNIST//SVHN/3DShapes/sDprites/3Dcars | $\varphi \theta (\xb7)=Conv[c]\xd74\xd74;Conv[c\xd72]\xd74\xd74;Conv[c\xd74]\xd7k^\xd7k^;FC256;FC50(Linear)$ | $\psi \zeta (\xb7)=FC256;FC[c\xd74]\xd7k^\xd7k^;Conv[c\xd72]\xd74\xd74;Conv[c]\xd74\xd74;Conv[c](Sigmoid)$ |

Notes: All convolutions and transposed convolutions are with stride 2 and padding 1. Unless stated otherwise, layers have parametric-RELU ($\alpha =0.2$) activation functions, except output layers of the preimage maps, which have sigmoid activation functions (since input data are normalized [0, 1]). Adam and Cayley ADAM optimizers have learning rates $2\xd710-4$ and $10-4$, respectively. The preimage map/decoder network is always taken as transposed of the feature map/encoder network. $c=48$ for 3D cars; and $c=64$ for all others. Further, $k^=3$ and stride 1 for MNIST, fMNIST, SVHN and 3DShapes; and $k^=4$ for others. SVHN and 3DShapes are resized to $28\xd728$ input dimensions.

### A.3 Data Sets and Hyperparameters

We refer to Tables 2 and 3 for specific details on the model architectures, data sets, and hyperparameters used in this article. All models were trained on full data sets and for a maximum of 1000 epochs. Furthermore, all data sets are scaled between [0-1] and are resized to $28\xd728$ dimensions except dSprites and 3D cars. The PyTorch library (single precision) in Python was used as the programming language on 8 GB NVIDIA QUADRO P4000 GPU. See algorithm 1 for training the St-RKM model. In the case of FactorVAE, the discriminator architecture is same as proposed in the original paper (Kim & Mnih, 2018).

#### A.3.1 Disentanglement Metrics

MIG was originally proposed by Chen et al. (2018); however, we use the modified metric as proposed in Locatello et al. (2019). We evaluate this score on 5000 test points across all the considered data sets. SAP and Eastwood's metrics use different classifiers to compute the importance of each dimension of the learned representation for predicting a ground-truth factor. For these metrics, we randomly sample 5000 and 3000 training and testing points, respectively. To compute these metrics, we use the open source library available at github.com/google-research/disentanglement_lib.

### A.4 Ablation Studies

#### A.4.1 Significance of the KPCA Loss

In this section, we show an ablation study on the KPCA loss and evaluate its effect on disentanglement. We repeat the experiments of section 5 on the mini-3DShapes data set (floor hue, wall hue, object hue, and scale: 8000 samples), where we consider three different variants of the proposed model:

**St-RKM**($\sigma =0$): The KPCA loss is optimized in a stochastic manner using the Cayley ADAM optimizer, as proposed in this article.**Gen-RKM:**The KPCA loss is optimized exactly at each step by performing an eigendecomposition in each minibatch (this corresponds to the algorithm in Pandey et al., 2021).**AE-PCA:**A standard AE is used, and a reconstruction loss is minimized for the training. As a postprocessing step, a PCA is performed on the latent embedding of the training data.

The encoder/decoder maps are the same across all the models, and for the AE-PCA model, additional linear layers are used to map the latent space to the subspace. From Table 4, we conclude that optimizing the KPCA loss during training improves disentanglement. Moreover, using a stochastic algorithm improves computation time and scalability with only a slight decrease in disentanglement score. Note that calculating the exact eigendecomposition at each step (Gen-RKM) comes with numerical difficulties. In particular, double floating-point precision has to be used together with a careful selection of the number of principal components to avoid ill-conditioned kernel matrices. This problem is not encountered when using the St-RKM training algorithm.

. | . | St-RKM ($\sigma =0$) . | Gen-RKM . | AE-PCA . |
---|---|---|---|---|

Training time | 3.01 (0.71) | 9.21 (0.54) | 2.87 (0.33) | |

Disentanglement score | Lasso | 0.40 (0.02) | 0.44 (0.01) | 0.35 (0.01) |

RF | 0.27 (0.01) | 0.31 (0.02) | 0.22 (0.02) | |

Compliance score | Lasso | 0.64 (0.01) | 0.51 (0.01) | 0.42 (0.01) |

RF | 0.67 (0.02) | 0.58 (0.01) | 0.45 (0.02) | |

Information score | Lasso | 1.01 (0.02) | 1.11 (0.02) | 1.20 (0.01) |

RF | 0.98 (0.01) | 1.09 (0.01) | 1.17 (0.02) |

. | . | St-RKM ($\sigma =0$) . | Gen-RKM . | AE-PCA . |
---|---|---|---|---|

Training time | 3.01 (0.71) | 9.21 (0.54) | 2.87 (0.33) | |

Disentanglement score | Lasso | 0.40 (0.02) | 0.44 (0.01) | 0.35 (0.01) |

RF | 0.27 (0.01) | 0.31 (0.02) | 0.22 (0.02) | |

Compliance score | Lasso | 0.64 (0.01) | 0.51 (0.01) | 0.42 (0.01) |

RF | 0.67 (0.02) | 0.58 (0.01) | 0.45 (0.02) | |

Information score | Lasso | 1.01 (0.02) | 1.11 (0.02) | 1.20 (0.01) |

RF | 0.98 (0.01) | 1.09 (0.01) | 1.17 (0.02) |

Notes: Gen-RKM has the worst training time but gets the highest disentanglement scores. This is due to the exact eigendecomposition of the kernel matrix at every iteration. This computationally expensive step is approximated by the St-RKM model, which achieves significant speed-up and scalability to large data sets. Finally, the AE-PCA model has the fastest training time due to the absence of eigendecompositions in the training loop. However, using PCA in the postprocessing step alters the basis of the latent space. This basis is unknown to the decoder network, resulting in degraded disentanglement performance.

#### A.4.2 Smaller Encoder/Decoder Architecture

. | St-RKM . | VAE . | $\beta $-VAE . | FactorVAE . | InfoGAN . |
---|---|---|---|---|---|

MNIST | 24.63 (0.22) | 36.11 (1.01) | 42.81 (2.01) | 35.48 (0.07) | 45.74 (2.93) |

fMNIST | 61.44 (1.02) | 73.47 (0.73) | 75.21 (1.11) | 69.73 (1.54) | 84.11 (2.58) |

. | St-RKM . | VAE . | $\beta $-VAE . | FactorVAE . | InfoGAN . |
---|---|---|---|---|---|

MNIST | 24.63 (0.22) | 36.11 (1.01) | 42.81 (2.01) | 35.48 (0.07) | 45.74 (2.93) |

fMNIST | 61.44 (1.02) | 73.47 (0.73) | 75.21 (1.11) | 69.73 (1.54) | 84.11 (2.58) |

Notes: Lower is better with standard deviations. Adapted from Dupont (2018).

Models . | dSprites . | 3DShapes . | 3D cars
. |
---|---|---|---|

St-RKM-sl ($\sigma =10-3$, $U\u2605$) | 0.17 (0.05) | 0.23 (0.03) | 0.21 (0.04) |

St-RKM ($\sigma =10-3$, $U\u2605$) | 0.26 (0.05) | 0.30 (0.10) | 0.31 (0.09) |

St-RKM ($\sigma =10-3$, random $U$) | 0.61 (0.02) | 0.72 (0.01) | 0.69 (0.03) |

Models . | dSprites . | 3DShapes . | 3D cars
. |
---|---|---|---|

St-RKM-sl ($\sigma =10-3$, $U\u2605$) | 0.17 (0.05) | 0.23 (0.03) | 0.21 (0.04) |

St-RKM ($\sigma =10-3$, $U\u2605$) | 0.26 (0.05) | 0.30 (0.10) | 0.31 (0.09) |

St-RKM ($\sigma =10-3$, random $U$) | 0.61 (0.02) | 0.72 (0.01) | 0.69 (0.03) |

Notes: Denote $M=1|C|\u2211i\u2208CU\u2605\u22a4\u2207\psi (yi)\u2207\psi (yi)\u22a4U\u2605,withyi=PU\varphi \theta (xi)$ (cf. equation 3.6). Then we compute the score as $M-diag(M)F/MF$, where $diag:Rm\xd7m\u21a6Rm\xd7m$ sets the off-diagonal elements of matrix to zero. The scores are computed for each model over 10 random seeds and show the mean (standard deviation). Lower scores indicate better diagonalization.

#### A.4.3 Analysis of St-RKM with a Fixed $U$

^{3}and kept fixed during the training of the following optimization problem,

$PU\u02dc\u22a5(\u025b)u\u22a5=u\u22a5$ for all $u\u22a5\u2208(range(U))\u22a5$,

$PU\u02dc\u22a5(\u025b)u=\u025bu$ for all $u\u2208range(U)$.

Thanks to the push-through identity, we have the alternative expression $PU\u02dc\u22a5(\u025b)=I-U(U\u22a4U+\u025bIm)-1U\u22a4.$ Therefore, it holds $lim\u025b\u21920PU\u02dc\u22a5(\u025b)=PU\u02dc\u22a5$, as it should. In our experiments, we set $\u025b=10-5$. If $\u025b\u226410-6$, the regularized PCA objective in equation A.2 takes negative values after a few epochs due to the numerical instability as mentioned above.

In Figure 10a, the evolution of the training objective A.2 is displayed. It can be seen that the final objective has a lower value [$exp(6.78)\u2248881$] when $U$ is optimized compared to its fixed counterpart [$exp(6.81)\u2248905$], showing the merit of optimizing over Stiefel manifold for the same parameter $\u025b$. Hence, the subspace determined by $range(U)$ has to be adapted to the encoder and decoder networks. In other words, the training over $\theta ,\xi $ is not sufficient to minimize the $St(\u2113,m)$ objective with Adam. Figure 10b further explores the latent traversals in the context of this ablation study. In the top row of Figure 10b (latent traversal in the direction of $u1$), both the shape of the object and the wall hue are changing. A coupling between wall hue and shape is also visible in the bottom row of this figure.

## Notes

^{1}

A typical implementation of VAE includes another neural network (after the primary network) for parametrizing the covariance matrix. To simplify this introductory discussion, this matrix is here chosen as a constant diagonal $\gamma 2I$.

^{2}

The source code is available at http://bit.ly/StRKM_code.

^{3}

Using a random $U\u02dc\u2208St(\u2113,m)$ can be interpreted as sketching the encoder map in the spirit of randomized orthogonal systems (ROS) sketches (see Yang, Pilanci, & Wainwright, 2017).

## Acknowledgments

Most of this work was done when M.F. was at KU Leuven.

EU: The research leading to these results received funding from the European Research Council under the European Union's Horizon 2020 research and innovation program/ERC Advanced Grant E-DUALITY (787960). This article reflects only the authors' views, and the EU is not liable for any use that may be made of the contained information.

Research Council KUL: Optimization frameworks for deep kernel machines C14/18/068.

Flemish government: (a) FWO: projects: GOA4917N (Deep Restricted Kernel Machines: Methods and Foundations), PhD/postdoc grant. (b) This research received funding from the Flemish government (AI Research Program). We are affiliated with Leuven.AI-KU Leuven institute for AI, B-3000, Leuven, Belgium.

Ford KU Leuven Research Alliance Project: KUL0076 (stability analysis and performance improvement of deep reinforcement learning algorithms).

Vlaams Supercomputer Centrum: The computational resources and services used in this work were provided by the VSC (Flemish Supercomputer Center), funded by the Research Foundation–Flanders (FWO) and the Flemish government department EWI.

## References

*Optimization algorithms on matrix manifolds*

*Advances in neural information processing systems, 27*

*IEEE Transactions on Pattern Analysis and Machine Intelligence*

*NIPS 2017 Workshop on Learning Disentangled Representations: From Perception to Control.*

*Advances in neural information processing systems*

*Advances in neural information processing systems*

*Proceedings of the International Conference on Learning Representations*

*Proceedings of the International Conference on Learning Representations*

*Advances in neural information processing systems, 30*

*Proceedings of the International Conference on Learning Representations*

*Proceedings of the 19th International Joint Conference on Artificial Intelligence*

*Proceedings of the International Conference on Learning Representations.*

*Proceedings of the Thirty-Fifth International Conference on Machine Learning*

*Proceedings of the International Conference on Learning Representations.*

*Proceedings of the International Conference on Learning Representations.*

*Proceedings of the Conference on Computer Vision and Pattern Recognition.*

*Proceedings of the International Conference on Learning Representations*

*Proceedings of the International Conference on Machine Learning.*

*International Conference on Learning Representations.*

*dSprites: Disentanglement testing Sprites dataset.*

*Introductory lectures on convex optimization: A basic course.*

*NIPS Workshop on Deep Learning and Unsupervised Feature Learning.*

*Proceedings of the Sixth International Conference on Machine Learning, Optimization, and Data Science*

*Neural Networks*

*Advances in neural information processing systems*

*Proceedings of the International Conference on Machine Learning*

*Proceedings of the 2019 IEEE/CVF conference on Computer Vision and Pattern Recognition*

*Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics*

*Neural Computation*

*Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms.*

*Annals of Statistics*