Abstract
UMAP is a nonparametric graph-based dimensionality reduction algorithm using applied Riemannian geometry and algebraic topology to find low-dimensional embeddings of structured data. The UMAP algorithm consists of two steps: (1) computing a graphical representation of a data set (fuzzy simplicial complex) and (2) through stochastic gradient descent, optimizing a low-dimensional embedding of the graph. Here, we extend the second step of UMAP to a parametric optimization over neural network weights, learning a parametric relationship between data and embedding. We first demonstrate that parametric UMAP performs comparably to its nonparametric counterpart while conferring the benefit of a learned parametric mapping (e.g., fast online embeddings for new data). We then explore UMAP as a regularization, constraining the latent distribution of autoencoders, parametrically varying global structure preservation, and improving classifier accuracy for semisupervised learning by capturing structure in unlabeled data.1
1 Introduction
Current nonlinear dimensionality reduction algorithms can be divided broadly into nonparametric algorithms, which rely on the efficient computation of probabilistic relationships from neighborhood graphs to extract structure in large data sets (UMAP (McInnes, Healy, & Melville, 2018), t-SNE (van der Maaten & Hinton, 2008), LargeVis (Tang, Liu, Zhang, & Mei, 2016)), and parametric algorithms, which, driven by advances in deep learning, optimize an objective function related to capturing structure in a data set over neural network weights (Ding, Condon, & Shah, 2018; Ding & Regev, 2019; Hinton & Salakhutdinov, 2006; Kingma & Welling, 2013; Szubert, Cole, Monaco, & Drozdov, 2019).
In recent years, a number of parametric dimensionality reduction algorithms have been developed to wed these two classes of methods, learning a structured graphical representation of the data and using a deep neural network to capture that structure (discussed in section 3). In particular, over the past decade, several variants of the t-SNE algorithm have proposed parameterized forms of t-SNE (Bunte, Biehl, & Hammer, 2012; Gisbrecht, Lueks, Mokbel, & Hammer, 2012; Gisbrecht, Schulz, & Hammer, 2015; van der Maaten, 2009). Parametric t-SNE (van der Maaten, 2009) for example, trains a deep neural network to minimize loss over a t-SNE graph. However, the t-SNE loss function itself is not well suited to neural network training paradigms. In particular, t-SNE's optimization requires normalization over the entire data set at each step of optimization, making batch-based optimization and online learning of large data sets difficult. In contrast, UMAP is optimized using negative sampling (Mikolov, Sutskever, Chen, Corrado, & Dean, 2013; Tang et al., 2016) to sparsely sample edges during optimization, making it, in principle, better suited to batch-wise training as is common in deep learning applications. Our proposed method, Parametric UMAP, brings the nonparametric graph-based dimensionality reduction algorithm UMAP into an emerging class of parametric topologically inspired embedding algorithms.
In the following section, we broadly outline the algorithm underlying UMAP to explain why our proposed algorithm, Parametric UMAP, is particularly well suited to deep learning applications. We contextualize our discussion of UMAP in t-SNE to outline the advantages that UMAP confers over t-SNE in the domain of parametric neural-network-based embedding. We then perform experiments comparing our algorithm, Parametric UMAP, to parametric and nonparametric algorithms. Finally, we show a novel extension of Parametric UMAP to semisupervised learning.
2 Parametric and Nonparametric UMAP
UMAP and t-SNE have the same goal: Given a -dimensional data set , produce a -dimensional embedding such that points that are close together in (e.g., and ) are also close together in ( and ).
2.1 Graph Construction
2.1.1 Computing Probabilities in
The first step in both UMAP and t-SNE is to compute a distribution of probabilities between pairs of points in based on the distances between points in data space. Probabilities are initially computed as local, one-directional probabilities between a point and its neighbors in data space, then symmetrized to yield a final probability representing the relationship between pairs of points.
2.2 Graph Embedding
After constructing a distribution of probabilistically weighted edges between points in , UMAP and t-SNE initialize an embedding in corresponding to each data point, where a probability distribution () is computed between points as was done with the distribution () in the input space. The objective of UMAP and t-SNE is then to optimize that embedding to minimize the difference between and .
2.2.1 Computing Probabilities in
In embedding space, the pairwise probabilities are computed directly without first computing local, one-directional probabilities.
2.2.2 Cost Function
Finally, the distribution of embeddings in is optimized to minimize the difference between and .
2.3 Attraction and Repulsion
Minimizing the cost function over every possible pair of points in the data set would be computationally expensive. UMAP and more recent variants of t-SNE both use shortcuts to bypass much of that computation. In UMAP, those shortcuts are directly advantageous to batch-wise training in a neural network.
The primary intuition behind these shortcuts is that the cost function of both t-SNE and UMAP can be broken out into a mixture of attractive forces between locally connected embeddings and repulsive forces between nonlocally connected embeddings.
2.3.1 Attractive Forces
Both UMAP and t-SNE utilize a similar strategy in minimizing the computational cost over attractive forces: they rely on an approximate nearest neighbors graph.2 The intuition for this approach is that elements that are farther apart in data space have very small edge probabilities, which can be treated effectively as zero. Thus, edge probabilities and attractive forces only need to be computed over the nearest neighbors; non-nearest neighbors can be treated as having an edge probability of zero. Because nearest-neighbor graphs are themselves computationally expensive, approximate nearest neighbors (Dong, Moses, & Li, 2011) produce effectively similar results.
2.3.2 Repulsive Forces
Because most data points are not locally connected, we do not need to waste computation on most pairs of embeddings.
UMAP takes a shortcut motivated by the language model word2vec (Mikolov et al., 2013) and performs negative sampling over embeddings. Each training step iterates over positive, locally connected edges and randomly samples edges from the remainder of the data set, treating their edge probabilities as zero to compute cross-entropy. Because most data points are not locally connected and have a very low edge probability, these negative samples are, on average, correct, allowing UMAP to sample only sparsely over edges in the data set.
In t-SNE, repulsion is derived from the normalization of . A few methods for minimizing the amount of computation needed for repulsion have been developed. The first is the Barnes-Hut tree algorithm (van der Maaten, 2014), which bins the embedding space into cells and where repulsive forces can be computed over cells rather than individual data points within those cells. Similarly, the more recent interpolation-based t-SNE (FIt-SNE; Linderman, Rachh, Hoskins, Steinerberger, & Kluger, 2017, 2019) divides the embedding space into a grid and computes repulsive forces over the grid rather than the full set of embeddings.
2.4 Parametric UMAP
To summarize, both t-SNE and UMAP rely on the construction of a graph and a subsequent embedding that preserves the structure of that graph (see Figure 1). UMAP learns an embedding by minimizing cross-entropy sampled over positively weighted edges (attraction) and using negative sampling randomly over the data set (repulsion), allowing minimization to occur over sampled batches of the data set. t-SNE, meanwhile, minimizes a KL divergence loss function normalized over the entire set of embeddings in the data set using different approximation techniques to compute attractive and repulsive forces.
Because t-SNE optimization requires normalization over the distribution of embedding in projection space, gradient descent can only be performed after computing edge probabilities over the entire data set. Projecting an entire data set into a neural network between each gradient descent step would be too computationally expensive to optimize, however. The trick that Parametric t-SNE proposes for this problem is to split the data set up into large batches (e.g. 5000 data points in the original paper) that are used to compute separate graphs that are independently normalized over and used constantly throughout training, meaning that relationships between elements in different batches are not explicitly preserved. Conversely, a parametric form of UMAP, using negative sampling, can in principle be trained on batch sizes as small as a single edge, making it suitable for minibatch training needed for memory-expensive neural networks trained on the full graph over large data sets as well as online learning.
Given these design features, UMAP loss can be applied as a regularization in typical stochastic gradient descent deep learning paradigms without requiring the batching trick that Parametric T-SNE relies on. Despite this, a parametric extension to the UMAP learning algorithm has not yet been explored. Here, we explore the performance of a parametric extension to UMAP relative to current embedding algorithms and perform several experiments further extending Parametric UMAP to novel applications.3
3 Related Work
Beyond Parametric t-SNE and Parametric UMAP, a number of recent parametric dimensionality reduction algorithms utilizing structure-preserving constraints exist that were not compared here. This work is relevant to ours and is mentioned here to provide clarity on the current state of parametric topologically motivated and structure-preserving dimensionality reduction algorithms.
Moor et al. (2020; topological autoencoders) and Hofer, Kwitt, Niethammer, and Dixit (2019; connectivity-optimized representation learning) apply an additional topological structure-preserving loss using persistent homology over minibatches to the latent space of an autoencoder. Jia, Sun, Gao, Song, and Shi (2015; Laplacian autoencoders) similarly define an autoencoder with a local structure-preserving regularization. Mishne, Shaham, Cloninger, and Cohen (2019; Diffusion Nets) define an autoencoder extension based on diffusion maps that constrains the latent space of the autoencoder. Ding et al. (2018; scvis) and Graving and Couzin (2020; VAE-SNE) describe VAE-derived dimensionality reduction algorithms based on the ELBO objective. Duque, Morin, Wolf, and Moon (2020; geometry-regularized autoencoders) regularize an autoencoder with the PHATE (potential of heat-diffusion for affinity-based trajectory embedding) embedding algorithm (Moon et al., 2019). Szubert et al. (2019; ivis) and Robinson (2020; differential embedding networks) make use of Siamese neural network architectures with structure-preserving loss functions to learn embeddings. Pai, Talmon, Bronstein, and Kimmel (2019; DIMAL) similarly uses Siamese networks constrained to preserve geodesic distances for dimensionality reduction. Several of these parametric approaches indirectly condition neural networks (e.g., autoencoders) on nonparametric embeddings rather than directly on the loss of the algorithm, which can be applied to arbitrary embedding algorithms. We contrast indirect and direct parametric embeddings in section 5.6.
4 UMAP as a Regularization
4.1 Autoencoding with UMAP
AEs are by themselves a powerful dimensionality reduction algorithm (Hinton & Salakhutdinov, 2006). Thus, combining them with UMAP may yield additional benefits in capturing latent structure. We used an autoencoder as an additional regularization to Parametric UMAP (see Figure 2C). A UMAP/AE hybrid is simply the combination of the UMAP loss and a reconstruction loss, both applied over the network. VAEs have similarly been used in conjunction with Parametric t-SNE for capturing structure in animal behavioral data (Graving & Couzin, 2020) and combining t-SNE, which similarly emphasizes local structure, with AEs aiding in capturing more global structure over the data set (Graving & Couzin, 2020; van der Maaten & Hinton, 2008).
4.2 Semisupervised Learning
Parametric UMAP can be used to regularize supervised classifier networks, training the network on a combination of labeled data with the classifier loss and unlabeled data with UMAP loss (see Figure 2D). Semisupervised learning refers to the use of unlabeled data to jointly learn the structure of a data set while labeled data are used to optimize the supervised objective function, such as classifying images. Here, we explore how UMAP can be jointly trained as an objective function in a deep neural network alongside a classifier.
4.3 Preserving Global Structure
An open issue in dimensionality reduction is how to balance local and global structure preservation (Becht et al., 2019; De Silva & Tenenbaum, 2003; Kobak & Linderman, 2021). Algorithms that rely on sparse nearest neighbor graphs like UMAP focus on capturing the local structure present between points and their nearest neighbors, while global algorithms, like multidimensional scaling (MDS), attempt to preserve all relationships during embedding. Local algorithms are both computationally more efficient and capture structure that is lost in global algorithms (e.g., the clusters corresponding to numbers found when projecting MNIST into UMAP). While local structure preservation captures more application-relevant structure in many data sets, the ability to additionally capture global structure is still often desirable. The approach used by nonparametric t-SNE and UMAP is to initialize embeddings with global structure-preserving embeddings such as PCA or Laplacian eigenmaps embeddings. In Parametric UMAP, we explore a different tactic, imposing global structure by jointly training on a global structure preservation loss directly.
5 Experiments
Experiments were performed comparing parametric UMAP and a UMAP/AE hybrid, to several baselines: nonparametric UMAP, nonparametric t-SNE (FIt-SNE) (Linderman et al., 2019; Poličar, Stražar, & Zupan, 2019), Parametric t-SNE, an AE, a VAE, and PCA projections. As additional baselines, we compared PHATE (nonparametric), SCVIS (parametric), and IVIS (parametric), which we described in section 3. We also compare a second nonparametric UMAP implementation that has the same underlying code as Parametric UMAP, but where optimization is performed over embeddings directly rather than neural network weights. This comparison is made to provide a bridge between the UMAP-learn implementation and Parametric UMAP, to control for any potential implementation differences. Parametric t-SNE, Parametric UMAP, the AE, VAE, and the UMAP/AE hybrid use the same neural network architectures and optimizers within each data set (described in supplemental materials).
We used the common machine learning benchmark data sets MNIST, FMNIST, and CIFAR-10 alongside two real-world data sets in areas where UMAP has proven a useful tool for dimensionality reduction: a single-cell retinal transcriptome data set (Macosko et al., 2015) and a bioacoustic data set of Cassin's vireo song, recorded in the Sierra Nevada mountains (Hedley, 2016a, 2016b).
5.1 Embeddings
5.1.1 Trustworthiness
5.1.2 Area under the Curve (AUC) of
To compare embeddings across scales (both local and global neighborhoods), we computed the AUC of for each embedding (Lee, Peluffo-Ordóñez, & Verleysen, 2015), which captures the agreement across K-ary neighborhoods, weighting nearest neighbors as more important than farther neighbors. In 2D we find that Parametric and nonparametric UMAP perform similarly, while t-SNE has the highest AUC. At 64D, Parametric and nonparametric UMAP again perform similarly, with PCA having the higher AUC.
5.1.3 KNN-Classifier
A KNN-classifier is used as a baseline to measure supervised classification accuracy based on local relationships in embeddings. We find KNN-classifier performance largely reflects trustworthiness (see Figure 5, Supplementary Figures 3 and 4, and Supplementary Tables 4 and 5). In 2D, we observe a broadly similar performance between UMAP and t-SNE variants, each of which is substantially better than the PCA, AE, or VAE projections. At 64 dimensions UMAP projections are similar but in some data sets (FMNIST, CIFAR-10) slightly under-perform PCA, AE, VAE, and Parametric t-SNE.
5.1.4 Silhouette Score
Silhouette score measures how clustered a set of embeddings is given ground truth labels. In 2D, across data sets, we tend to see a better silhouette score for UMAP and Parametric UMAP projections than t-SNE and Parametric t-SNE, which are more clustered than PCA in all cases but CIFAR-10, which shows little difference from PCA (see Figure 5, Supplementary Figure 5 and Supplementary Table 5). The clustering of each data set can also be observed in Figure 4, where t-SNE and Parametric t-SNE are more spread out within the cluster than UMAP. In 64D projections, we find the silhouette score of Parametric t-SNE is near or below that of PCA, which is lower than UMAP-based methods. We note, however, that the poor performance of Parametric t-SNE may reflect setting the degrees-of-freedom () at , which is only one of three parameterization schemes that van der Maaten (2009) suggests. A learned degrees-of-freedom parameter might improve performance for parametric t-SNE at higher dimensions.
5.1.5 Clustering
5.2 Training and Embedding Speed
5.2.1 Training Speed
Optimization in nonparametric UMAP is not influenced by the dimensionality of the original data set; the dimensionality comes into play only in computing the nearest-neighbors graph. In contrast, training speeds for Parametric UMAP are variable based on the dimensionality of data and the architecture of the neural network used. The dimensionality of the embedding does not have a substantial effect on speed. In Figure 6, we show the cross-entropy loss over time for Parametric and nonparametric UMAP, for the MNIST, Fashion MNIST, and Retina data sets. Across each data set, we find that nonparametric UMAP reaches a lower loss more quickly than Parametric UMAP but that Parametric UMAP reaches a similar cross-entropy within an order of magnitude of time. Thus, Parametric UMAP can train more slowly than nonparametric UMAP, but training times remain within a similar range, making Parametric UMAP a reasonable alternative to nonparametric UMAP in terms of training time.
5.2.2 Embedding and Reconstruction Speed
A parametric mapping allows embeddings to be inferred directly from data, resulting in a quicker embedding than nonparametric methods. The speed of embedding is especially important in signal processing paradigms where near-real-time embedding speeds are necessary. For example in brain-machine interfacing, bioacoustics, and computational ethology, fast embedding methods like PCA or deep neural networks are necessary for real-time analyses and manipulations; thus, deep neural networks are increasingly being used (Brown & De Bivort, 2018; Pandarinath et al., 2018; Sainburg, Thielk, & Gentner, 2019). Here, we compare the embedding speed of a held-out test sample for each data set, as well as the speed of reconstruction of the same held-out test samples.
5.3 Capturing Additional Global Structure in Data
Broadly, with Parametric UMAP, we can observe the trade-off between captured global and local structure with the weight of (light blue line in each panel of Figure 9). We observe that adding this loss can increase the amount of global structure captured while preserving much of the local structure, as indicated by the distance to the top right corner of each panel in Figure 9, which reflects the simultaneous capture of global and local relationships, relative to each other embedding algorithm.
5.4 Autoencoding with UMAP
The ability to reconstruct data from embeddings can aid in understanding the structure of nonlinear embeddings and allow for manipulation and synthesis of data based on the learned features of the data set. We compared the reconstruction accuracy across each method, which had inverse-transform capabilities (), as well as the reconstruction speed across the neural network–based implementations to nonparametric implementations and PCA. In addition, we performed latent algebra on Parametric UMAP embeddings both with and without an autoencoder regularization and found that reconstructed data can be linearly manipulated in complex feature space.
5.4.1 Reconstruction Accuracy
5.4.2 Latent Features
Previous work shows that parametric embedding algorithms such as AEs (e.g., variational autoencoders) linearize complex data features in latent space—for example, the presence of a pair of sunglasses in pictures of faces (Radford, Metz, & Chintala, 2015; Sainburg et al., 2018; White, 2016). Here, we performed latent-space algebra and reconstructed manipulations on Parametric UMAP latent space to explore whether UMAP does the same.
We find that complex latent features are linearized in latent space when the network is trained with UMAP loss alone as well as when the network is trained with AE loss. For example, in the third set of images in Figure 10, a pair of glasses can be added or removed from the projected image by adding or subtracting its corresponding latent vector.
5.5 Semisupervised Learning
Real-word data sets often comprise a small number of labeled data and a large number of unlabeled data. Semisupervised learning (SSL) aims to use the unlabeled data to learn the structure of the data set, aiding a supervised learning algorithm in making decisions about the data. Current SOTA approaches in many areas of supervised learning such as computer vision rely on deep neural networks. Likewise, semisupervised learning approaches modify supervised networks with structure-learning loss using unlabeled data. Parametric UMAP, being a neural network that learns structure from unlabeled data, is well suited to semisupervised applications. Here, we determine the efficacy of UMAP for semisupervised learning by comparing a neural network jointly trained on classification and UMAP (see Figure 2D) with a network trained on classification alone using data sets with varying numbers of labeled data.
We compared data sets ranging from highly structured (MNIST) to unstructured (CIFAR-10) in UMAP using a naive distance metric in data space (e.g., Euclidean distance over images). For image data sets, we used a deep convolutional neural network (CNN), which performs with relatively high accuracy for CNN classification on the fully supervised networks (see Supplementary Table 8 based on the CNN13 architecture commonly used in SSL (Oliver, Odena, Raffel, Cubuk, & Goodfellow, 2018). For the birdsong data set, we used a BLSTM network, and for the retina data set, we used a densely connected network.
5.5.1 Naive UMAP Embedding
5.5.2 Consistency Regularization and Learned Invariance Using Data Augmentation
5.5.3 Learning a Categorically Relevant UMAP Metric Using a Supervised Network
We find that in all three data sets, without augmentation, the addition of the learned UMAP loss confers little to no improvement in classification accuracy over the data (see Figure 13, right, and Supplementary Table 8). When we look at nonparametric projections of the graph over latent activations, we see that the learned graph largely conforms to the network's categorical decision making (e.g., Figure 14 predictions versus ground truth). In contrast, with augmentation, the addition of the UMAP loss improves performance in each data set, including CIFAR-10. This contrast in improvement demonstrates that training the network to learn a distribution in a categorically relevant space that is already intrinsic to the network does not confer any additional information that the network can use in classification. Training the network to be invariant toward augmentations in the data, however, does aid in regularizing the classifier, more in line with directly training the network on consistency in classifications (Sajjadi et al., 2016).
5.6 Comparisons with Indirect Parametric Embeddings
In principle, any embedding technique can be implemented parametrically by training a parametric model (e.g., a neural network) to predict embeddings from the original high-dimensional data (as in Duque et al., 2020). However, such a parametric embedding is limited in comparison to directly optimizing the algorithm's loss function. Parametric UMAP optimizes directly over the structure of the graph with respect to the architecture of the network as well as additional constraints (e.g., additional losses). In contrast, training a neural network to predict nonparametric embeddings does not take additional constraints into account.
6 Discussion
In this letter, we propose a novel parametric extension to UMAP. This parametric form of UMAP produces embeddings similar to those of nonparametric UMAP, with the added benefit of a learned mapping between data space and embedding space. We demonstrated the utility of this learned mapping on several downstream tasks. We showed that parametric relationships can be used to improve inference times for embeddings and reconstructions by orders of magnitude while maintaining similar embedding quality to nonparametric UMAP. Combined with a global structure preservation loss, Parametric UMAP captures additional global relationships in data, outperforming methods where global structure is only imposed upon initialization (e.g., initializing with PCA embeddings). Combined with an autoencoder, UMAP improves reconstruction quality and allows for the reconstruction of high-dimensional UMAP projections. We also show that Parametric UMAP projections linearize complex features in latent space. Parametric UMAP can be used for semisupervised learning, improving training accuracy on data sets where small numbers of training exemplars are available. We showed that UMAP loss applied to a classifier improves semisupervised learning in real-world cases where UMAP projections carry categorically relevant information (such as stereotyped birdsongs or single-cell transcriptomes), but not in cases where categorically relevant structure is not present (such as CIFAR-10). We devised two downstream approaches based around learned categorically relevant distances and consistency regularization that show improvements on these more complex data sets. Parametric embedding also makes UMAP feasible in fields where dimensionality reduction of continuously generated signals plays an important role in real-time analysis and experimental control.
A number of future directions and extensions to our approach have the potential to further improve upon our results in dimensionality reduction and its various applications. For example, to improve global structure preservation, we jointly optimized over the Pearson correlation between data and embeddings. Using notions of global structure beyond pairwise distances in data space (such as global UMAP relationships or higher-dimensional simplices) may capture additional structure in data. Similarly, one approach we used to improve classifier accuracy relied on obtaining a categorically relevant metric, defined as the Euclidean distance between activation states of the final layer of a classifier. Recent work (e.g., as discussed and proposed in Schulz, Hinder, & Hammer, 2019) have explored methods for more directly capturing class information in the computation of distance, such as using the Fisher metric to capture category- and decision-relevant structure in classifier networks. Similar metrics may prove to further improve semisupervised classifications with Parametric UMAP.
Notes
Google Colab walkthrough.
UMAP requires substantially fewer nearest neighbors than t-SNE, which generally requires three times the perplexity hyperparameter (defaulted at 30 here), whereas UMAP computes only 15 neighbors by default, which is computationally less costly.
See code implementations: Experiments: https://github.com/timsainb/ParametricUMAP_paper. Python package: https://github.com/lmcinnes/umap.
Where possible. In contrast with UMAP, Parametric UMAP, and Parametric t-SNE, Barnes Huts t-SNE can only embed in two or three dimensions (van der Maaten, 2014), and while FIt-SNE can in principle scale to higher dimensions (Linderman et al., 2019), embedding in more than two dimensions is unsupported in both the official implementation (KlugerLab, 2020) and openTSNE (Poličar et al., 2019)
Acknowledgments
This work was supported by NIH 5T32MH020002-20 to T.S. and 5R01DC018055-02 to T.G. We also thank Kyle McDonald for making available his translation of Parametric t-SNE to Tensorflow/Keras, which we used as a basis for our own implementation.