Abstract
Invoking the manifold assumption in machine learning requires knowledge of the manifold's geometry and dimension, and theory dictates how many samples are required. However, in most applications, the data are limited, sampling may not be uniform, and the manifold's properties are unknown; this implies that neighborhoods must adapt to the local structure. We introduce an algorithm for inferring adaptive neighborhoods for data given by a similarity kernel. Starting with a locally conservative neighborhood (Gabriel) graph, we sparsify it iteratively according to a weighted counterpart. In each step, a linear program yields minimal neighborhoods globally, and a volumetric statistic reveals neighbor outliers likely to violate manifold geometry. We apply our adaptive neighborhoods to nonlinear dimensionality reduction, geodesic computation, and dimension estimation. A comparison against standard algorithms using, for example, -nearest neighbors, demonstrates the usefulness of our approach.
1 Introduction
A starting point for many algorithms in data science—from clustering to manifold inference—is knowing the neighbor relationships among data points. Clustering, for example, often begins with a “-nearest neighbor graph,” while manifold inference involves a kernel, that is, a measure of similarity between data points. In the first case, the neighborhoods are local and discrete; in the second, they are global and continuous, with concentration of influence controlled by the kernel bandwidth, or scale. Such neighbor relationships are fundamental to defining a topology. Moreover, dimensionality may be estimated based on the rate of change in the density of points within a ball, that is, within a neighborhood, with respect to its radius. It is helpful when the number of data points is large, a requirement that grows with dimensionality; asymptotic analysis is often favored by theoreticians.
In practice, we rarely have enough data points to satisfy asymptotic bounds. Nor are we given the precise number of neighbors, , that each point should have. We often make the manifold assumption—that the data points are drawn randomly from a (or near a) manifold—but rarely try to assess the basic properties of the manifold assumed by theorists: its dimensionality, sampling density, curvature, medial axis, or reach (defined in the next section). All of these could influence .
Instead, we rely on different visualization algorithms, such as Isomap, diffusion maps, t-SNE, and many others, to find a pleasing organization of the data. This is dangerous, of course, because these algorithms have free parameters. In particular, and central to this article, most require specifying the number of neighbors, (or its equivalent): changing or other parameters changes the result. Unless one knows the answer, one is caught in a conundrum: imposing a prior belief amounts to “fixing” the solution (examples of changing are shown later in the article).
We present an algorithm to estimate an effective neighborhood—the immediate neighbors, or scale of a similarity kernel—around each point. We seek to identify those nearest neighbors that are “correct” in the sense that they support dimensionality and volume estimates, and manifold inference in general, without covering holes or filling in concavities. It is inspired by the philosophical position that views discrete and continuous mathematics as “two sides of the same,” as argued by Lovász (2010), and iterates between them.
Our algorithm builds from a conservative initial estimate of neighbors (based on a discrete construct, the Gabriel graph) toward a refined one, based on continuous estimates from a multiscale gaussian kernel. The discrete and continuous volume estimates must be consistent, however, and this provides the glue for our iteration. Since not all of the initial putative neighbors may actually be closest neighbors, neighbors that violate the volume relationship are pruned, and the process repeats until the two perspectives agree. Our algorithm thus can be considered an iterative graph sparsification.
Technically, it involves two different graphs: a discrete one, which links only putative nearest neighbors (pairs of points defining the diameter of an otherwise empty ball), and a weighted one, structured by a multiscale gaussian kernel, whose individual scales must cover the neighborhood given in the discrete graph. Keeping the two graphs consistent is another way to think about our iteration. Each resulting graph can be applied to many different algorithms for data visualization, dimensionality reduction, and manifold inference.
An overview of the article is as follows. In the next section, we review the background in some detail, covering both the zoo of similarity kernels that exist, plus several relevant notions, such as the reach of a manifold, that are well studied in the theory literature. The discussion is organized to emphasize the centrality of scale, or neighborhood, in all of the references. In section 3, we provide an overview of our algorithm. It includes a brief sketch of both graphs we work with, plus the connection back to manifolds. Pseudocode for the algorithm is given in algorithm 1, which also includes pointers to where each of its steps is developed.
We then expand on the algorithm. In section 3.3, we study the Gabriel graph and putative neighbors. Two features are emphasized: scale-free neighborhoods and the relationship between node degree and local dimensionality. A structural criterion is revealed, showing how putative edges between neighbors fill “volumes” that block others from being neighbors. This graph serves as an initialization. It is then refined iteratively in several steps. First, continuous kernel scales are computed based on the discrete, putative neighbors. A linear program relaxation bridges local scales to a global cover, in which each node's weighted degree is comparable to the number of its neighbors. In other words, each neighborhood radius should not cover too many outside points. If it does, then it indicates that the neighborhood itself should be refined. That is, some putative scales are likely wrong, in the sense that their neighborhood contains an extreme outlier. This leads directly to a volumetric statistic (see section 3.5.1), and to a pruning technique for sparsifying edges from the discrete graph. The process iterates until there are no more outliers.
In section 4, we evaluate the results for estimating manifold low-dimensional embeddings, geodesics, and local intrinsic dimensionality. Comparisons against popular algorithms, such as UMAP and t-SNE, illustrate the power of the approach. In the end, we demonstrate that it is possible to infer data-driven local neighborhoods that remain consistent with geometric and topological properties of manifolds.
Code for our algorithm will be available at github.com/dyballa/IAN.
2 Background
Manifold learning is a vast area of machine learning where high-dimensional data are analyzed based on the assumption that they were sampled from a low-dimensional manifold, (Fefferman, Mitter, & Narayanan, 2016), in which case geodesic distances over provide a better description of the relationships between data points than Euclidean distances in ambient space (Belkin & Niyogi, 2004). The manifold assumption finds applications in nonlinear dimensionality reduction (van der Maaten, Postma, & van den Herik, 2009), denoising (Hein & Maier, 2006), interpolation (Bregler & Omohundro, 1994), dimensionality estimation (Camastra & Staiano, 2016), computational geometry (Crane, Weischedel, & Wardetzky, 2013), and more.
Since locally resembles Euclidean space, it is standard to define a similarity kernel to define (possibly weighted) neighborhoods around each point in terms of other points. This naturally leads to a graph having data points as nodes and similarity values as edge weights. Then, by computing the graph Laplacian, one can apply a variety of methods from spectral graph theory (see, e.g., Spielman, 2012). Formal analysis involves the limit as the number of data points grows large; the practical success of such methods depends on how well graph neighborhoods capture the topology and geometry of .
We here review the many approaches to specifying a similarity kernel or a local neighborhood. Let be a -dimensional manifold in ambient space . When only pairwise distances are known, an intuitive approach is to define the neighbors of a point, , as those within a certain distance threshold, or, equivalently, inside an -dimensional ball around . A kernel function assumes the role of this ball by assigning values to neighboring points as a function (discrete or continuous) of how close they are to . The question becomes, What kernel size should be used for each point?
2.1 Similarity Kernels
Consider a set of points . Typically, a symmetric, positive semidefinite similarity kernel (Schölkopf & Smola, 2002) is chosen to determine weighted connections between data points based on the ambient Euclidean distances between them. For each pair of data points , it returns a number between 0 and 1, which determines how close, or strongly connected, they are. This effectively defines a neighborhood around each point.
2.1.1 Discrete Kernels
2.1.2 Continuous Kernels: Global Scale
One would like the parameter to be just large enough to be able to capture local manifold patches. There are several heuristics for finding such a scale: the median of all pairwise distances in (or another percentile), the mean (or median) of the distances to each point's th nearest neighbor (Lafon, 2004), or a scalar multiple of the maximal distance from a point to its nearest neighbor in the data (Keller, Coifman, Lafon, & Zucker, 2009). Also common is to choose a scale so that each data point is sufficiently connected to at least one other point (Lafon, Keller, & Coifman, 2006).
2.1.3 Continuous Kernels: Multiscale
A more localized strategy is to use a multiscale kernel, where each point has an individual scale, or bandwidth. Instead of a single, global scale, there are now parameters. The advantage is that if the scale selection is adequate, the kernel may capture the characteristics of more complex data sets and manifolds that have non-uniform sampling and geometry.
2.1.4 Adaptive Neighborhood Size Methods
Other methods attempt to automatically determine optimal neighborhoods. Most of these are based on determining an optimal for a -nearest neighbors (-NN) graph; this can be done either globally or by selecting a local neighborhood size around each point , known as adaptive neighborhood selection (van der Maaten et al., 2009).
Some approaches optimize a global based on its performance in a specific embedding algorithm. For instance, the method from Samko, Marshall, and Rosin (2006) is tailored to Isomap (Tenenbaum, de Silva, & Langford, 2000), while others (Kouropteva, Okun, & Pietikäinen, 2002; Álvarez-Meza, Valencia-Aguirre, Daza-Santacoloma, & Castellanos-Domínguez, 2011) apply to LLE (Roweis & Saul, 2000). In Álvarez-Meza et al. (2011), a local method is additionally proposed that produces a nearest-neighbor graph with variable , under the assumption that the manifold is connected.
Others are based on first estimating the local tangent space around each point, then setting to include as neighbors those points that are close to it. Such methods (Wang, Zhang, & Zha, 2004; Mekuz & Tsotsos, 2006) typically work with positional information for the tangent space computation (usually via SVD).
Also available are methods that are not based on the nearest-neighbors concept. In computational geometry, the idea of refining an initial estimate of connectivity from a simplicial mesh has been used before, usually specific to the case when 2 and 3, surfaces in 3-D space (Amenta, Bern, & Kamvysselis, 1998; Amenta & Bern, 1999; Bernardini, Mittleman, Rushmeier, Silva, & Taubin, 1999; Belkin, Sun, & Wang, 2008). Other approaches extend this idea to arbitrary dimension (Belkin, Sun, & Wang, 2009; Boissonnat, Guibas, & Oudot, 2009) but still require knowledge of . Most of the algorithms in this class use point clouds as input, so they can exploit positional information to decide on the appropriate neighborhood/connectivity.
Among the myriad ways of estimating neighborhoods, there is little agreement on which is most successful (see Lindenbaum, Salhov, Yeredor, & Averbuch, 2020, for a review). Before proceeding to our algorithm, then, it is helpful to first understand what makes this such a hard problem. How can it fail, and what requirements must it fulfill in order to properly capture the topology and geometry of ? This brings us to the geometry of manifolds.
2.2 Reach and the Geometry of Manifolds
The neighborhoods implied by a kernel should agree with , or at least approximate a tubular neighborhood of it. As exemplified in Figure 1, if neighborhoods are too small, the implied manifold may become disconnected (i.e., falsely divided into disjoint submanifolds or clusters; Samko et al., 2006); if too large, they may cause to self-intersect, collapsing bottlenecks or curved regions, or cause “smoothing” or “folding.” Such shortcomings are well known in the manifold inference literature; while the former case typically occurs due to nonuniform sampling, the latter is mainly caused by an incompatibility between the sampling rate and the reach of (Federer, 1959; Thäle, 2008). We now expand on these points.
When is positive, it provides a measure of the “local distortion” (Block et al., 2021); the larger it is, the easier inference becomes. Some authors (e.g., Narayanan & Mitter, 2010; Fefferman et al., 2016) assume large reach in order to test the manifold hypothesis and find bounds on the required sample size. In Block et al. (2021), the reach is used when establishing bounds on the quality of an intrinsic dimensionality estimation based on -nearest neighbors.
Obtaining a good representation of therefore requires consideration of its reach. In terms of our problem of finding an appropriate kernel, this effectively means that no neighborhood radius should cross the medial axis of .
Sampling is a further complication and essentially what makes this a hard problem: when it is nonuniform and sparse (common in real-life data sets), it is not always clear whether the space between points constitutes an undersampled piece of , a hole, or a gap between disjoint submanifolds (cf. Figure 2). The latter two conditions, of course, relate to reach. Narayanan and Mitter (2010) prove that the number of required samples depends polynomially on curvature, exponentially on intrinsic dimension, and linearly on intrinsic volume. Aspects of our algorithm address each of these during the iteration process.
In all such cases, choosing a globally fixed radius is likely to be problematic. While defining neighborhood size based on a fixed number of neighbors can be helpful to deal with nonuniform density (since the neighborhood radius adapts to the local pairwise distances), it is bound to violate the reach if is too large. It will also be a problem when the intrinsic dimensionality is not constant throughout , as higher dimensions require exponentially more neighbors.
Mekuz and Tsotsos (2006) point out the lack of a principled way for setting this parameter, which in practice is often tuned empirically based on prior knowledge of the desired output. As put by Wang et al. (2004), the effectiveness of manifold learning algorithms depends on how nearby neighborhoods overlap and on the interplay between the curvature of the manifold and sampling density.
In terms relevant to this article, the neighborhood radius should be smaller than the local feature size but large enough to account for sampling variability and local dimensionality. We propose an iterative approach to developing a kernel, so that it can adapt appropriately to the neighborhood characteristics around each point.
3 The Algorithm
We here overview our algorithm for finding the neighborhood scale around each point in a manner that makes it globally consistent as a covering of the data points. As is common in manifold learning, we start with a pairwise distance matrix, not the points themselves. The first step is to build a graph in which each datum is connected to an appropriate neighborhood containing other data points. This data graph defines a topology; we refer to it as the neighborhood graph.
As we reviewed above, in the discrete case, one might choose -nearest neighbors, while in the continuous kernel case, there is a bandwidth parameter that effectively defines a “ball of influence” around each point. Scale is the radius of such a ball—a level set of the kernel function that essentially contains those neighbors whose weights are nontrivial. Our goal, then, is to find those scales—or neighborhoods—that support nonlinear dimensionality reduction, geodesic estimation, and, in general, manifold inference from the given pairwise distances. We do not have sampling guarantees so will develop a statistic to check whether reach and curvature constraints might be violated.
3.1 Subtleties of Scale
Since scale may not be constant across the data set, we argue that it should be the first property to be inferred from the data. We start by imposing the manifold assumption, but from an empirical perspective. Unlike most theoretical studies, we do not assume the manifold is pure (i.e., that it has constant dimension). In a simple case, the data may be drawn from a union of different manifolds whose dimensions are not known a priori; such data sets have been considered infrequently, although exceptions exist (e.g., Haro, Randall, & Sapiro, 2008; Little et al., 2017).
Second, we do not know the sampling rate, or density. Rather, we build it up, conservatively, with putative nearest neighbors to each data point, by imposing a necessary (but not sufficient) condition. These putative neighbors will be refined, as the algorithm iterates, to achieve sufficiency. While the manifold assumption does imply the existence of local neighborhoods, their size may vary over the data set; we require that the sampling be nearly constant over each of them. In effect, the density of points must be determined locally while respecting the global manifold geometry.
We submit that such situations occur in real data sets and, since the data are fixed, we cannot appeal to knowing the sampling density or the manifold dimensions and reach. Instead, we address the interplay between manifold reach and sampling density pragmatically. Along the spike, the data appear to be 1D; in the ball, 3D. We seek a neighborhood graph that supports these inferences, so “most” points enjoy a neighborhood that agrees with their apparent dimension. At the join (or high-curvature neck), it is unclear. Moving from the spike to the ball suggests that dimension should be increasing; from the ball to the spike, it should be decreasing. For the neighborhood graph, most points along the spike should see about 2 neighbors, and most points in the ball should see about 2 neighbors; the problematic points should see something intermediate. Such results will be shown to follow from our algorithm.
We claim that either of the alternatives is worse; one should not impose an apparent dimensionality (or connectivity in the neighborhood graph) globally. If small numbers of neighbors (appropriate for the spike) are enforced on the ball, then holes are likely to be introduced. Or if too many neighbors are enforced on the spike, it will collapse on itself. Both change the topology drastically (these situations are illustrated later, in Figures 24 and 25).
3.2 Overview of the Algorithm
Let the data set, , be a sampling of a (possibly nonpure) manifold , with the dimension of each component denoted by . It consists of points in ambient space , where . The manifold may have a boundary, and the number of components is not known a priori.
We work with two graphs: the first unweighted, and the second with edge weights given by a kernel. Our strategy is to begin with a conservative estimate of the unweighted graph and extend it to a global weighted graph that suggests an estimated manifold covering. The validity of this extension is evaluated by a measure of volume in both graphs; an iterative algorithm is used to infer individual local scales for each point . Before presenting the algorithm, we introduce the two graphs.
Let the unweighted graph be , with and adjacency matrix with entries , where to each point is associated a node . We denote its initial estimate by ; successive refinements are indicated as until convergence ().
Given a set of individual point scales (sometimes collected into the vector ), we define a second, weighted graph as the complete graph on all pairs of data points in . Its weighted adjacency matrix, , has entries .
While the unweighted graph will be related to nearest neighbors and computational geometry, the weighted graph will be related to spectral methods on manifold inference. In particular, we expect the Laplacian of to approximate the Laplace-Beltrami operator on , subject to the number of data points and their sampling.
The algorithm is initialized by computing a coarse estimate of . As described in section 3.3, this is achieved by exploiting the geometry of medial balls between pairs of points to produce a Gabriel graph (Gabriel & Sokal, 1969; Matula & Sokal, 1980). A Gabriel graph is that in which there is an edge between two points and if and only if they are the only two closest points to the midpoint of the line segment joining them. The main advantages of using a Gabriel graph as a starting point are that (1) it is scale invariant, so a prespecified -neighborhood (equation 2.1) is not required; (2) there is no global constant (it can vary); and (3) neighbors are not limited to the closest neighbors in ambient space. Thus, it allows for connections to “jump across” sampling gaps while keeping the data graph sparse.
However, as described in section 2.2, obtaining a good inference of amounts to finding reasonable estimates of its reach and local feature size. For that to occur, no edge segment between two points and should cross a medial axis of . As the examples that follow will show, there are several cases in which the Gabriel graph will violate this. Therefore, additional steps are necessary to refine it. The Gabriel graph provides a necessary condition (all the correct connections are present, but possibly others as well); our refinement moves toward sufficiency.
In order to estimate —the weighted counterpart of —we will use the weights that are obtained by applying a continuous kernel (see equation 3.1) over the points in . Such a kernel requires scales, or bandwidths, that must be estimated from . These will be obtained from an optimization procedure that finds the smallest such scales ensuring that all discrete edges have a minimum kernel value as weight. At this point, a weighted graph can be obtained from .
It is now helpful to articulate the geometry more carefully. Figure 5 depicts how the discrete connectivity relates to the manifold geometry. In particular, for a real data set, the few closest points surrounding are the best candidates for “nearest” neighbors—this is all that can be asserted locally. Let and be the projections of two neighbors and onto , respectively. Then any point along the geodesic between and should be closer to no sampled point other than or . By further assuming or at least that and small , then approximates the geodesic when the curvature between and is small. Equivalently, the line segment between and lies on the tangent space , where is the midpoint between and (see Figure 5). The existence of a geodesic follows from identifying the tangent plane that includes the points with the exponential map of the manifold around them.
Such an “edge-centric” approach connects differential geometry to the underlying graph. This is illustrated in Figure 5, where the kernel values are shown as shading in the tangent plane. Notice how and its neighbor both fall under the bright kernel values; they are very similar (in this measure) to each other. Stated in geometric terms, we assume that the neighbors lie within the injectivity radius around . In fact, we will show (in Figure 14) that the value of a multiscale kernel between two data points is equivalent to that of a rescaled, single-scale kernel centered at the midpoint between those two points.
The optimized scales can be used to evaluate the current approximation and identify the edges in that are “too expensive,” that is, are likely to violate the local feature size. We proceed by computing successive refinements of both and in an iterative manner until no further change is observed. We then return the final version of the discrete and weighted graphs (denoted by and , respectively).
One can view the computation of as a relaxation of the discrete connectivity in . In fact, as we shall see in section 3.5, a relaxation statistic, , will be used to prune discrete edges that produce a poor approximation. More specifically, when a node with degree, , in has close to 1, it means has retained approximately the same degree in , only continuously spread as a gaussian around it.
Each of these steps is listed in algorithm 1 and will be described in detail. We begin with the discrete connectivity rule (Gabriel graph); then the scale optimization is developed, followed by the edge-pruning step. Figure 6 illustrates the results of our algorithm on data sets for which the Gabriel graph alone cannot infer a good approximation of the manifold connectivity.
3.3 Neighbors in a Gabriel Graph
We begin by defining a set of putative neighboring points of (denoted by ) that uses the connectivity rule found in a Gabriel graph (Gabriel & Sokal, 1969; Matula & Sokal, 1980). It directly incorporates the observation that closest neighbors should have no points “between” them.
Two points, and , are Gabriel-nearest neighbors to each other if and only if they both touch the same closed ball, , that is empty except for and .
Note that is therefore a medial ball—a ball whose center point is a medial axis (with respect to the set of sampled points). Thus, this connectivity criterion can be restated as creating an edge for all those medial balls, and only those, touching exclusively two points (to be clear, if a third point touches , no edge shall be formed between and ). Hence, to each edge is associated a medial ball centered at the midpoint between and with radius (see Figure 7). This is furthermore equivalent to the following alternative definitions:
Points and are Gabriel-nearest neighbors if and only if any point along the line segment in has either or (or both) as its only closest point(s).
In terms of the Voronoi diagram (Fortune, 1995) of (with the cell around denoted by ), and are neighbors when crosses a single Voronoi hyperplane (namely, that between the cells and ), and the midpoint between and is in .
As a concrete example (refer to Figure 7), consider two points and at a distance from each other, with midpoint . Assume the region in the manifold between them is uniformly sampled. Now consider the ball centered at with radius , therefore touching and . If there are no points in its interior, we say and are nearest neighbors. Conversely, if it contains other points in its interior, under our assumption of uniform density, this means that there is at least one other point “between” and . So we say that and are not nearest neighbors, in the sense that connecting and directly would be “crossing over” ; this implies that an edge in the resulting graph would be a poor approximation to a geodesic in (i.e., if is “locally uniformly sampled,” the segment would be passing outside ). Note that even when the input to the algorithm is solely a distance matrix (i.e., with no position information), this connectivity criterion may still be evaluated by considering the triangle –– and using Apollonius's theorem to compute the length of the median from to (see Figure 7D).
The Gabriel graph is a subgraph of the Delaunay graph (Dyer, Zhang, & Möller, 2009) and enjoys a number of key properties (Matula & Sokal, 1980). We emphasize that (1) they are scale invariant, that is, there is no prespecified threshold on the diameter of medial balls that can form connections; (2) the guarantee that Gabriel graphs connect points to their true nearest neighbors when is uniformly sampled as a grid (shown in Figure 9); and (3), Gabriel graphs provide a locally adapted neighborhood size , for each point , based on the local geometry. Crucially, they do not require an initial guess of the number of neighbors, of the intrinsic dimensionality, or of a maximum neighborhood radius.
Nevertheless, the neighborhoods given by the Gabriel graph are not sufficent. We now expand on a few of their properties—that will be useful in motivating the rest of the algorithm.
3.3.1 Closing Triangles
A triangle will be formed by edges in a Gabriel graph only when it is acute (see Figure 8A).
3.3.2 Maximum Curvature
3.3.3 Degree Distribution in Gabriel Graphs
We now study the above connectivity rule starting with flat, uniformly sampled manifolds (i.e., “regular grids”) to illustrate how Gabriel graphs naturally adapt to their geometry and dimensionality. As shown in Figure 9A, in such ideal cases, the degree of an interior node in the Gabriel graph agrees with the true number of (literal) nearest neighbors: two for collinear points, four for a square grid, and six for a triangular grid.
Node degree appears to grow with dimension as , except for the triangular grid (which, in some sense, looks too “nongeneric”). Adding noise (gaussian, with standard deviation equal to half the spacing between neighboring points) supports this conjecture, as the degree then approaches regardless of the original grid structure. This holds in higher dimensions as well, for both normal and uniform sampling at random (see Figures 9B, 9C, and 12).
The expected number of neighbors in a Gabriel graph approximately follows a distribution centered at (where is the intrinsic dimension of the data) for a variety of sampling strategies (see Figure 9C).
Importantly, because Gabriel graphs are inherently scale invariant, this degree distribution is largely independent of sampling density.
How to explain such remarkable regularity despite the randomness of sampling? A complementary geometric view of the Gabriel graph connectivity rule is illuminating: each edge between data points implies an “occluding hyperplane” that blocks other points from becoming neighbors (see Figure 10). For example, when 1, two points necessarily occlude any additional connections and every nonboundary point must have two neighbors. Now, using the diagrams in Figure 11 as reference, we find that when 2, on average about 4 points are sufficient to occlude a point from all sides. For 3 this number is doubled again, and the expected number of neighbors becomes about 8, revealing the trend. Every additional dimension adds a new coordinate axis along which the previous constraints are duplicated, roughly doubling the average number of directions available from which neighbors can connect. Once balls are “attached” to , the remaining space is greatly reduced, and so is the probability of drawing a sample point from inside the region enclosed by the hyperplanes.
When the neighbors are regularly spread around , by construction this region is equivalent to a -dimensional orthoplex2 (or cross-polytope). A -orthoplex has facets (or -1)-faces), and is one of the three finite, regular, convex polytopes that exist in dimension higher than 4 (the other two being hypercubes and simplices). Naturally, when sampling is not uniform, we should find irregular orthoplexes instead.
While this geometric construction supports our empirical results and implies they should hold in higher dimensions, it also suggests the following:
Our experiments on the growth in dimension of randomly sampled points agree with a model in which Gabriel neighbors lie approximately in the facets of an orthoplex.
We later use the additional observation that the dual polytope (a -hypercube) of an orthoplex is obtained by placing a vertex (i.e., a neighbor) in each of its facets.
The Gabriel graph enjoys many attractive properties and provides the starting point for our algorithm. The above arguments show how the space is largely filled by “Gabriel balls” within the manifold, but such balls may also fill space across holes and bottlenecks; curvature must be dealt with. Examples were given in Figure 6, where we showed that Gabriel connections can arise incorrectly and must be removed. To do so, one must “look” in every direction (of the tangent plane) and past immediate neighbors. For this, we now develop the weighted graph counterpart to the Gabriel graph, exploiting the kernel to extend local information globally. This begins to connect the graph construction more directly to manifold properties.
3.4 Multiscale Optimization
We now begin to develop the iteration in algorithm 1, given the initial Gabriel neighborhood graph, . Assuming (temporarily) that this gives correct local neighborhoods, what should the corresponding scales be for a gaussian kernel? In effect this is an extension of into a weighted counterpart, . From Figure 5, this weighted graph is also a type of approximation of (aspects of) the continuous manifold. Because density is not necessarily uniform, different points might have different neighborhood radii, so a multiscale gaussian similarity kernel (see equation 3.1) is used. Each point has its own associated scale, . To develop the computation of such scales, we now move into the continuous domain and exploit the geometric notion of a cover.
3.4.1 Covering Criterion
We flip this around by using a different but related construction: consider gaussians now centered at the midpoints (i.e., not on data points) to indicate whether nearby points should be connected, not separated (Figure 5 illustrates this construction directly). Furthermore, because we use a multiscale kernel (see equation 3.1), the (nonnormalized) gaussian density becomes a function of . Hence, we obtain a criterion for what we term -connectivity:
The constant plays a role in normalizing for unknown density; it will be developed in section 3.5.2. For now, we illustrate its role in the connection from graphs to manifolds. Figure 13 shows the graph over a set of data points and the local scales obtained (by the algorithm below) for different values of . Choosing too large yields scales (and therefore gaussians) that are too large, that is, their overlap has peaks. Choosing it too small yields scales that introduce holes. When it is chosen correctly, the gaussians form a covering of the manifold that approximates a partition of unity. Such partitions of unity are used in differential geometry to extend local information (in our case, the scales) to global information (a covering of the manifold).
We say a -covering is attained when every pair is -connected (see equation 3.7). Additionally, when the spacing between neighboring points is approximately uniform locally, the pointwise summation over all gaussian kernel bumps given by the individual scales provides an (unnormalized) partition of unity of .
We now use the covering constraints to solve for the set of scales, . It is desirable that the scales be small (respecting the reach) while at the same time maintaining the connectivity in close to that of . Thus, one idea is to find scales such that the sum of edge weights in incident to a node from its neighbors in approximates the degree of in , for all , while at the same time ensuring a -covering. This, however, amounts to a nonconvex problem in which the cost function involves a summation of multiscale kernel values. We are unable to solve this efficiently. Instead, we find the smallest individual scales such that our covering criterion is satisfied for all edges (a “minimal covering”) and later address the quality of the relaxation by using a statistical pruning (edge sparsification). This can be transformed into a convex, linear program with linear constraints by which all scales can be solved for simultaneously, as we show next. (We also present, in the appendix, a greedy approach to this optimization that may be convenient when dealing with very large data sets.)
3.4.2 Linear Program Relaxation
To achieve a minimal covering, one might minimize (or, equivalently, the 1-norm of the vector , since scales are positive) subject to the covering constraint.3 This suggests the following:
Due to the hyperbolas, this amounts to a non-linear, non-convex set of constraints. However, we can convexify the feasible set by considering, for each edge , the line(s) passing through the hyperbola's vertex (the point at which ) and the points where the hyperbola intersects the bounding box. The four possibilities are shown in Figure 15. The feasible region for each edge therefore is bounded by a convex envelope given by such line(s) and those defined by the upper bounds to and to . Such envelopes for all edges, combined, define the boundaries of a convex polytope. Note that this convexification is conservative in the sense that only the objective is relaxed—the feasible scales are always at least as large as required by the original nonconvex problem; therefore, our covering requirement is not relaxed. (Because of the presence of a later pruning stage in the algorithm, it is better to overconnect here than to inadvertently disconnect nodes that should otherwise be connected.)
Hence the problem now amounts to a convex, linear program (LP) with linear constraints:
3.5 Sparsification
Summarizing what we have seen so far, the Gabriel graph provides an initial estimate of connectivity, while the LP optimization provides minimal scales for a continuous kernel to cover those connections. However, since the initial estimate of the discrete graph might contain incorrect connections, its resulting optimal scales might also be inadequate. An example of this can be seen in Figure 19: initially, two pairs of nodes are connected across the central gap since a Gabriel ball exists between them. This will require very large scales to “cover” these edges. Furthermore, the Gabriel graph is based on a local connectivity rule; however, as illustrated in Figure 16, decisions about connecting nodes across a gap should not be local. We here address both of these issues by introducing a global statistic based on how frequently such a gap occurs in the data. In terms of algorithm 1, we are now at steps 6 and 7.
3.5.1 Volume Ratio
Because incorrect connections can be given by Gabriel balls lying in the free space between parts of a manifold (i.e., across the medial axis), it is tempting to simply prune the longest connections. Note, however, that the size of a scale by itself is not necessarily important. In both examples shown in Figure 6, the nonuniform density causes scale sizes to vary considerably, and even the largest ones are appropriate, that is, they are still consistent with the distances to neighboring points.
Conversely (and importantly), a scale that is excessively large will likely cover “too many” points. That is, it will cover neighbors in excess of the number of discrete neighbors of its corresponding node in . We quantify this notion by observing that an individual scale, , should produce kernel values whose sum is comparable to the discrete degree, , of node in . As will be shown, after proper normalization, this also means shall relate to a local volume element around , or the inverse of the local density. Since each connection in can be seen as having unit weight, a gaussian kernel around with scale should distribute that same amount, , only continuously over ambient space.
We start our derivation with a definition:
An individual-scale gaussian kernel is needed to correctly assess the impact of on the relaxation from the perspective of alone. The multiscale kernel here might artificially increase the weighted degree of when other nodes (even nonneighbors of !) have incorrect scales. (Nevertheless, as discussed below, a corresponding ratio using the actual weights in may eventually be used for convergence purposes.)
Nodes whose degree deviate from exactly will, likewise, under- or overestimate the local dimension, so reasonable volume estimates are still obtained regardless. However, in order to avoid a dimension less than 1 for connected nodes, in practice when 1, we replace with .
Thus, we expect for points obeying and . Crucially, points for which these conditions are not met (those having “wrong” neighbors in the original Gabriel graph, ) will depart from this by having . In the next section, we use this fact to guide a sparsification of edges in based on .
Therefore, can be thought of as the product between kernel scale and local density. When is optimal, it should be approximately equal to the inverse of the local density, so .
3.5.2 Uniformity of Sampling and Edge Pruning
Since is evaluated for every node , we can collect it across nodes and view it as a statistic. This has two consequences: (1) it can be used to enforce consistency in sampling, and (2) outliers in this statistic are likely candidates for edge pruning. We address consistency of sampling first.
We have several times stated that sampling is required to be locally uniform, although its rate may change over the manifold. Examples of this were shown in, for example, Figure 12, where the sampling was denser in the center of the gaussian distribution than in the periphery. This example differs from the regular grids in which all nearest neighbors had exactly the same distance. Putting this together, we have:
Locally uniform sampling: Let node have neighbors in . Among these, let denote the distance from to its farthest neighbor, and that to its nearest neighbor. When for all , we say the sampling is locally uniform.
This is useful because a departure from the assumption that sampling is locally uniform will cause to be on average greater than 1 throughout the data set. To see this, when sampling is not uniform, we have . Now, since is optimized to cover all of 's neighbors, it will have in most cases the same order of magnitude as (minus some possible slack due to the multiscale interaction). Therefore, the higher the variability in the neighbors' distances, the larger the difference between and will be, making , in turn, be larger than the distance to most neighbors of . Ultimately, this will increase beyond what we would have in a uniform-sampling scenario (in which ).
When data are acquired using a global sampling strategy, this variability in the neighbors' distances should be roughly constant throughout the data set (rather than the distances). So we use the scalar parameter, , from equation 3.7 to correct for this “bias” and bring the median of the distribution of (denoted as ) close to 1.7
Let the tuned be that which causes to be closest to 1.
Typically, , which, in the scale optimization procedure, means that the covering constraints (equation 3.10) are being relaxed using the distribution of as a guide (see Figure 18). Note that although the tuning of is not necessary for finding candidates for sparsification, it attributes a quantitative meaning to the value of , so any is guaranteed to indicate the need for edge pruning. Such tuning should be performed at 0 and repeated as needed over the iterations whenever deviates too much from unity (which may happen after several edges have been pruned). Most commonly, we find .
Thus, we have a data-driven way of finding an appropriate value for . Because it is a global constant applied to all connection constraints, it shifts the distribution of to have a median around 1 without changing its general shape.
This leads us to the second use of our statistic: any node whose normalized volume ratio is much greater than the median of the population should be identified as an outlier. Such nodes will have a neighbor considerably farther than its other neighbors (relative to the median variability of such neighboring distances throughout the data) and are candidates for the sparsification step.
Nodes that are robust outliers according to the statistic have an overly distant neighbor (relative to the other neighbors for that node) and hence are likely to be in violation of reach or other geometric constraints. These relatively distant neighbors are candidates for having an edge pruned.
Given the distribution of normalized volume ratios, statistical models can be used to define a threshold for identifying outliers (see Figures 19 to 22). It is likely that data sets with a large number of problematic connections will exhibit a distribution with a heavy tail or that looks like a mixture of two distributions (see the example in Figure 22), so using the distribution's quartiles may give a more robust result. One option that seems to work particularly well is to use estimates of the sample mean and standard deviation from the quartiles, as in Wan, Wang, Liu, and Tong (2014) (throughout, we make use of the C3 method derived there, setting the threshold to 4.5 standard deviations above the mean thus estimated). Still, we found that results are typically quite invariant to this particular choice, especially in real-life data sets. Finally, we note that our algorithm can be run interactively, so the user can analyze the histogram of the distribution after each iteration to judge whether the choice of threshold is reasonable and thus be confident in the results.
Nodes with above the threshold should have their connection to their farthest neighbor deleted. Ideally, only one such connection is pruned after each iteration; however, should that become impractical with large data sets, a compromise is to limit the pruning, at each iteration, to a single edge from each node that is above the threshold (giving the chance for its value to be updated before the next pruning).
3.5.3 Convergence
The algorithm converges at iteration when no point has an outlying (i.e., greater than a statistical threshold). This implies that no edges will be pruned, so , and therefore no further changes can occur to either or . Note that convergence is guaranteed: since at every iteration an edge is removed, the algorithm must necessarily reach a certain at which all outliers (if there were any to begin with) have been pruned.
analogous to equation 3.25 but using the weights from directly. Since the multiscale kernel takes into account the interaction of individual scales, the distribution of will be typically tighter than that of (i.e., some of the excessively large scales might be compensated by small neighboring scales). Therefore, one may wish to allow for an earlier convergence when there are no remaining outliers in the distribution of .
Finally, in applications where it is required that be connected, pruning can simply be stopped before disconnection. Naturally, is always connected up to machine precision or some numerical tolerance.
In closing this section, we return to one of our introductory examples and show, in Figure 23, the resulting graphs for the sampling Swiss cheese patterns (from Figure 2). When sampling is too sparse (bottom), there is only so much that can be inferred, and not all holes are free of edges after convergence. As sampling gets denser, however, the algorithm correctly identifies that edges across holes should be pruned (middle). When it is very dense (top), even the initial Gabriel graph is able to correctly infer the true holes.
3.6 Comparison with Other Kernel Methods
We now compare the data graphs obtained using our iterated adaptive neighborhoods (IAN) with those from other popular manifold learning methods. In Figure 24, a synthetic “stingray” data set exhibits a transition of apparent dimension from 2 (body) to 1 (tail), a variation of the scenario explored in Figure 4. Points were uniformly sampled, with 20% deleted at random.
Our converged, unweighted graph, (see the top row in Figure 24), can be compared with the traditional -nearest neighbors graph (bottom row), used in a variety of methods, including Isomap (Tenenbaum et al., 2000). In the latter, when 2, the tail exhibits perfect connectivity, but the body is too sparse. If 4, the body is more properly connected, but the tail becomes overly connected, and “folding” or “short circuits” start to appear. Finally, for , the connectivity is inappropriate as the tip of the tail connects directly to the body. In contrast, manages to retain a minimally connected tail while covering the body almost everywhere, creating appropriate edges across many of the sampling gaps (compare with the holes that remain in the -NN graph with 4, some of which are present even when 8).
Our weighted graph, , can be compared against methods that use a gaussian-like kernel and where each point has an individual scale. Some of these methods were described in section 2.1: t-SNE (van der Maaten & Hinton, 2008), UMAP (McInnes et al., 2018), self-tuning (Zelnik-Manor & Perona, 2004), and variable bandwidth (Berry et al., 2015; Berry & Harlim, 2016); their resulting connectivity can be visualized in Figure 24, where edges have intensity proportional to their weight.
In Figure 25, we visualize the individual scales resultant from each of these methods. Each is represented, around each point , as the level set corresponding to a (single-scale) kernel value of 0.75. At the top, we see that the scales found by our kernel seem to nicely conform to the space between each point and its neighbors. Especially illuminating is what happens along the tail, where scales either “expand” or “shrink” so as to minimally cover the spaces between neighboring points; this illustrates what our scale optimization achieves. Among the other methods, with few exceptions, the scales seem to cover either too much (collapsing the tail on itself) or too little (leaving holes in the body).
The weighted graphs in Figure 24 reveal the result of the interaction between these individual scales (namely, the edge weights). Our (top right) manages to cover almost the entire body with edges, while keeping the tail minimally connected—in fact, resembling the unweighted version in , and therefore respecting the original curvature and reach. Other methods, in contrast, have a hard time achieving both things with a global value for . In t-SNE, the scales over the body are much smaller when 4, so its weighted graph looks too sparse; for 8, the scales over the tail become too large, and therefore strong edges appear, connecting it to the body. In UMAP, the scales do not grow as much with increasing , but at 4, the body in the weighted graph is still too sparse, while for 8, the tail is strongly connected to the body. With the self-tuning, scales seem to grow faster with , while with variable bandwidth, this growth is somewhat counteracted by the action of their global scale, (see equation 2.7). In fact, the graph that most resembles our own is the one using the variable bandwidth kernel with 2, the main difference being that the big sampling gap near the tip of the body is poorly connected, while in our case, it is slightly overly connected (due to connections in crossing that gap).
4 Applications
We now provide examples of application of our kernel to three different manifold learning tasks: dimensionality reduction, by means of a nonlinear embedding algorithm; geodesic estimation, which typically finds application in computational geometry, vision, and graphics; and local intrinsic dimensionality estimation.
4.1 Low-Dimensional Embeddings
Dimensionality reduction is now ubiquitous in visualization of high-dimensional data. Several methods exist (Saul, Weinberger, Sha, Ham, & Lee, 2006; Lee & Verleysen, 2007; Goldberg & Ritov, 2009; van der Maaten et al., 2009), and most of the nonlinear methods are manifold based (Roweis & Saul, 2000; Tenenbaum et al., 2000; Roweis, Saul, & Hinton, 2001; Hinton & Roweis, 2002; Belkin & Niyogi, 2003; Donoho & Grimes, 2003; Zhang & Zha, 2004; Coifman & Lafon, 2006; Weinberger & Saul, 2006; van der Maaten & Hinton, 2008; Tang, Liu, Zhang, & Mei, 2016; McInnes et al., 2018; Moon et al., 2019). Given a collection of points in high-dimensional space sampled from a low-dimensional manifold , the goal is to find a good parameterization for the data in terms of intrinsic coordinates over , which in turn can be used to produce a low-dimensional embedding.
In surveying the literature, it is common to find a heuristic, or a range of values, suggested for choosing the neighborhood size (see section 2.1), but rarely do we see examples of the sensitivity of the results to that choice. In this section, we ran a few of the most popular methods using a wide range of values for the kernel scale parameter, , and compared their results to those using our own kernel.
4.1.1 Diffusion Maps with Self-Tuning Kernel
Diffusion maps are based on the spectral properties of the random walk matrix (normalized graph Laplacian) over the weighted data graph; integration over all paths in the graph makes diffusion distances, in principle, more robust to “short-circuiting” than graph geodesics. For better comparison with IAN, instead of the standard single-scale gaussian kernel, we use the self-tuning approach of Zelnik-Manor and Perona (2004) from equation 2.6. Our kernel was applied to diffusion maps by directly using as a similarity matrix (weighted adjacency matrix). We use the diffusion map parameters 1 and 1 (cf. Coifman & Lafon, 2006).
With the stingray data set (see Figure 26), we see that the fully extended tail at 2 becomes progressively more folded and compressed as increases. The body appears contracted at 2 but expands with larger . Using our own , although we obtain excellent embeddings of both body and tail (right-most column), they are represented by separate sets of coordinates (two for the body, and a third for the tail), which happens due to the change in dimensionality.
Applying self-tuning to the spiral data set (see Figure 27), only 2 and 4 were able to prevent folding. The bent plane (see Figure 28) was more tolerant, with good results for all except 64, for which the plane remained folded. When using IAN, a good parameterization was obtained for both data sets.
4.1.2 Variable Bandwidth Diffusion Embedding
We also tested a variant of diffusion maps using the variable bandwidth kernel of Berry et al. (2015), in which a distinct type of multiscale kernel is proposed, along with a specific normalization of the weighted graph Laplacian. Because it computes an other global scale, , based on the individual scales, in order to apply our algorithm to this method we replaced the density estimates, (see equation 2.7), with the inverse of our optimal scales. We used 0 and −1/2, as recommended in Berry and Harlim (2016); eigenvectors were scaled by the square root of the inverse of their respective eigenvalues (Saerens, Fouss, Yen, & Dupont, 2004; Noé, Banisch, & Clementi, 2016), following the implementation in Banisch, Thiede, & Trstanova (2017).
This method produced good embeddings for the stingray, especially for 8 (see Figure 26). For the spiral (see Figure 27), using 8 caused some points to drift apart, and although it returned basically the original curve when 16 or 32, a spectral algorithm such as this is expected to “unroll” the spiral, finding a good (1D) parameterization of it. The same happened with the bent plane (see Figure 28), which could not be embedded into two coordinates for any choice of . Using our scales, however, the algorithm managed to find appropriate parameterizations for all three data sets.
4.1.3 Isomap
Isomap applies classical multidimensional scaling (MDS) to geodesic distances computed as shortest paths over a -nearest neighbors graph (see equation 2.2). Because the graph is unweighted, this method is particularly sensitive to the choice of . Our kernel was applied to Isomap by directly replacing the -NN graph with .
With the stingray (see Figure 26), Isomap produced a good embedding with 4. The result with 2 was completely wrong (an additional tail appears), and with 8, the tip of the tail was disconnected. With 16 and 32, it essentially returned the original data, without any dimensionality reduction. Our improved on the result of 4 by making the points in the body more uniformly spread.
4.1.4 t-SNE and UMAP
t-SNE and UMAP are related methods that have gained popularity in recent years (Becht et al., 2019). Both compute similarities between data points using individual scales based on (section 2.1) and adopt a secondary kernel for computing similarities between embedded points: t-SNE uses a Student -distribution (Cauchy kernel), while UMAP uses an nonnormalized variant requiring a hyperparameter, . In t-SNE, embedding coordinates are initialized at random, while UMAP adopts the strategy of refining an initial spectral embedding. Both then optimize their embeddings by running gradient descent on an information-theoretic cost function between similarities in input space versus embedded space: t-SNE minimizes the KL-divergence; UMAP uses a variant of cross-entropy.
Alternative initializations are typically used with t-SNE (e.g., PCA) to improve results (Kobak & Berens, 2019; Linderman, Rachh, Hoskins, Steinerberger, & Kluger, 2019; Kobak & Linderman, 2021); in our experiments, for better comparison with UMAP, we used a spectral embedding initialization computed from its own symmetrized similarity matrix (see equation 2.10). The IAN kernel was applied to t-SNE by replacing the individual scales (see equation 2.8) with those in ; with UMAP, because a different kernel function is used, we directly replaced the weighted graph (with adjacencies given by in equation 2.12) with .
We executed t-SNE assigning the various values to the perplexity parameter, leaving the remaining parameters to their defaults in the scikit-learn implementation (Pedregosa et al., 2011). We used the Barnes-Hut method (van der Maaten, 2014) for the cylinder data set and the “exact” method for all others. In UMAP, the parameter was set to , with remaining parameters using default values (in particular, 0.1). Because of the stochastic nature of both algorithms (even when using a fixed initialization), different runs will produce slightly different results. Therefore, in order to avoid cherry-picking, both algorithms were executed a single time, using the same random seed.
Results for the stingray (see Figure 26) were quite analogous between the two algorithms: both produced artificial clustering for 8, while for 16, the tail began to fuse with the body. The gaps in sampling within the body were accentuated by both algorithms, even at 32, where we see a big hole in the UMAP embedding; in t-SNE, it almost breaks into two pieces (despite the large neighborhood size). This example is illustrative of how much an embedding algorithm based on attractive versus repulsive forces can end up exaggerating nonuniform sampling.
The spiral (see Figure 27) was disconnected by t-SNE for all values of except 8. UMAP produced reasonable results for between 4 and 8; however, for 2, a multitude of clusters was obtained, and when 16, the curve twisted over itself. Using our kernel (right column) produced a connected, non-self-intersecting curve. Neither algorithm was capable of returning a good arc-length parameterization of the spiral, however.
With the bent plane (see Figure 28), although both algorithms succeeded in unfolding it, t-SNE was only able to produce a fully two-dimensional plane (with no gaps) when setting 32 (not shown) or 64, while UMAP required 16. Both gave reasonable results using our kernel.
4.1.5 A Higher Dimensional Example
Because all of the examples above have , we also tested our kernel when applied to a higher-dimensional manifold, namely, a 5-dimensional cylinder () with radius 1 and length 3, sampled uniformly at random ( 8403, ambient space ). Here we used a pure, connected manifold with no bottlenecks and low curvature in order to simplify interpretation.
Both t-SNE and UMAP produced similar or better results when using the IAN kernel (we set 27 based on the mean degree found in , compatible with 5; results were robust to this particular choice). Despite their current popularity (e.g., Wattenberg, Viégas, & Johnson, 2016; Arora, Hu, & Kothari, 2018; Chan, Rao, Huang, & Canny, 2018; Dimitriadis, Neto, & Kampff, 2018; Becht et al., 2019; Kobak & Berens, 2019; Fujiwara, Ida, Kanai, Kumagai, & Ueda, 2021; Kobak & Linderman, 2021; Wang, Huang, Rudin, & Shaposhnik, 2021), they produced considerably jittered outputs, however, implying that the original neighborhoods were not preserved. This appears to be caused by an attempt to reproduce the spherical shape of the cylinder's base along the main axis, so different “slices” ended up projected on top of one another. However, UMAP produced jittered results even when set to return six components (as in the original space) instead of two.
Diffusion maps using IAN resulted in little mixing except near the boundaries, so neighborhoods were better preserved. Running it with either self-tuning or variable-bandwidth kernels using 27 gave comparable results; Isomap also produced excellent results, with tau=0.98 (not shown).
4.2 Geodesic Computation
Using the unweighted graph, , one may immediately compute graph geodesics (shortest paths using distances in ambient space as edge lengths) as an estimate of the geodesics over . The latter are likely to be underestimated by the former when sampling is sparse (Bernstein, De Silva, Langford, & Tenenbaum, 2000), even when the graph connectivity is correct, for example, due to curvature (cf. section 3.3.2). It seems a good idea, then, to incorporate the continuous kernel values present in its weighted counterpart, , as a means to possibly improve geodesic estimation.
We propose to use the heat method for geodesic computation of Crane et al. (2013). It consists of solving the Poisson equation to find a function, , whose gradient follows a unit vector field, , pointing along geodesics; can be obtained by normalizing the temperature gradient, , due to a diffusion process in which heat, , is allowed to diffuse for a short time. Although this method is tailored to applications where positional information and dimensionality are known (in particular, surfaces in ), here we apply it to , since discrete versions of the operators used (Laplacian, gradient, and divergence) can be readily defined on a weighted graph (see Desquesnes, Elmoataz, & Lézoray, 2013).
In Figure 30, heat geodesics computed from for the bent plane data set approximate well the true geodesics over , and graph geodesics obtained from follow closely. Comparison with those from a naive -NN graph illustrates that the choice of is critical (compare with the bottom row of Figure 28).
In Figure 31, we compare the results using weighted graphs from various kernels on the stingray data set; interestingly, heat geodesics computed from hold reasonably well even when facing a continuous change in dimensionality. (The diffusion time parameter used by the heat method was optimized for each data set.)
4.3 Local Dimensionality Estimation
Intrinsic dimensionality (ID) estimation is tightly associated with dimensionality reduction tasks, especially in manifold learning, where knowledge of can help, among others, to determine the appropriate number of embedding dimensions. Informally, ID may be seen as the minimum number of parameters required to accurately describe the data. In the context of manifold learning, it is typically equivalent to the topological dimension of (e.g., a general space curve has dimensionality 1 since it requires a single parameter, arc length).
There are many different ways to estimate it (Camastra, 2003; Camastra & Staiano, 2016); global approaches are typically divided into two. The first group is based on some variant of PCA (e.g., Fukunaga & Olsen, 1971; Little et al., 2017) and uses the number of significant eigenvalues to infer dimensionality; these may be applied globally or by combining local estimates. The second group of methods, termed geometric (or fractal, when a noninteger ID is computed), exploits the geometric relationships in the data, such as neighboring distances. Some are based on estimating packing numbers (Kégl, 2002) or on distances to nearest neighbors (Trunk, 1976; Pettis, Bailey, Jain, & Dubes, 1979; Verveer & Duin, 1995; Costa, Girotra, & Hero, 2005; Facco, d'Errico, Rodriguez, & Laio, 2017; Block et al., 2021).
Among the most popular are the correlation dimension methods (Camastra & Vinciarelli, 2002; Grassberger & Procaccia, 2004; Hein & Audibert, 2005), a variant of which has been specifically applied in the context of determining an appropriate kernel width for manifold learning (see Coifman et al., 2008; Berry et al., 2015; Haghverdi et al., 2015). The dimension is computed as the slope of a log-log plot of the number of neighboring points versus neighborhood radius (see section 2.1). A recent variation is Kleindessner and Luxburg (2015); others cover the difficult case of high ID (Camastra & Vinciarelli, 2002; Rozza, Lombardi, Ceruti, Casiraghi, & Campadelli, 2012).
In our scenario, since we do not assume a pure manifold (section 3.1), we focus on local (i.e., pointwise) ID estimation approaches, namely, those in which dimension is estimated within a neighborhood around each data point (e.g., Farahmand, Szepesvári, & Audibert, 2007; He, Ding, Jiang, Li, & Hu, 2014). This notion can be formalized as the local Hausdorff dimension (Young, 1982; Camastra & Staiano, 2016), and a global estimate is typically found by averaging over local values.
Notice that our kernel can be readily used with this method by simply replacing the -NN graph with , therefore summing over nodes in the neighborhood instead of over the nearest. Additionally, we propose a correlation dimension-based method that allows for local estimates. We describe it next, then compare its results with those from the MLE method.
4.3.1 Algorithm: Neighborhood Correlation Dimension
However, because we assume that intrinsic dimension may vary over , global averages cannot work in general. Moreover, nonuniform density, curvature, or multiple connected components may all create multiple peaks for , so inspection of the log-log plot cannot be automated.
Therefore, we modify this approach to use individual curves for each data point . To keep the summation local, points are restricted to those in the neighborhood of in . Here, it is advantageous to work with an extended neighborhood (e.g., by also including neighbors-of-neighbors) due to the theoretical limit to the value of the dimension that can be accurately estimated given a set of points (Eckmann & Ruelle, 1992), namely, . In fact, if is large compared to , even additional hops away from may be considered. Because such extension is done by following edges in (as opposed to naively expanding a ball in ), we may thus obtain a larger (approximately tubular) neighborhood around without ever leaving the manifold. We denote such a neighborhood , as opposed to the immediate neighborhood ; throughout this section, both will include the node itself.
Our algorithm involves the following steps:
- For each data point and its extended neighborhood, , define as(4.5)
- Analogous to equation 4.4, by taking the logarithm, we have that the slope of the curve, that is,is an estimate of , the dimension around , as a function of . Computationally, it is desirable to use the closed-form expression for accuracy:(4.6)(4.7)
A region of stability of (i.e., a local maximum) is then an estimate of the dimension around .
A local maximum (“peak”) in can be interpreted as follows: as a ball around is expanded, the rate at which neighbors are seen has stopped increasing and must decrease with larger , since no additional neighbors can be found after the ball encompasses all points in . Underlying is the assumption that is sufficiently representative of the manifold around . If neighbors are approximately uniformly distributed and dimensionality is constant within it, then should remain constant over some appreciable range of , whence the notion of “stability.”
Even though we work with a subset of , there may still be multiple maxima in , for example, when the neighbors of are far from uniformly distributed around it. So operationally, we use the global maximum of , as this takes into account the information given by the majority of neighboring points. Now, because as , and as , the slope of must approach 0 at both extremes; thus, the global maximum of must also be a relative one (a “peak”).
We now proceed to avoid boundary effects by recentering neighborhoods. The boundary, , of a -dimensional manifold (when present) has dimensionality (Lee, 2010). The correlation integral approach often fails for these; it typically returns for points in , since they have roughly half the number of neighbors compared to interior points. For the same reason, it tends to also underestimate for points near the boundary. Since we work locally over a graph, we can regularize the computation by moving the focus to a more central, nearby point (thus regularizing over sampling artifacts as well):
- Letting be the set of adjacent nodes to in and including itself, define as the node with the smallest median squared distance to all points in the extended neighborhood :Thus, is, in effect, the most central node in 's immediate neighborhood.8(4.8)
Use as the point from which kernel values are computed for by replacing with in equation 4.5, thereby shifting the center of estimation of . This assumes that the dimension does not change abruptly across neighboring points. Denote the resulting estimate by .
- As with the MLE method (see section 4.3), we may obtain a smoother estimate, , by averaging over immediate neighbors in :(4.9)
Finally, recall from section 3.5 that we also obtain a degree-based estimate, , when computing volume ratios (see equation 3.24); we can use this information to further improve our results. A final estimate, , is then obtained as follows:
- To avoid overestimating the true dimension, compute an average over as(4.10)
- Compute the optimal estimate, , as(4.11)
Application of this technique and comparison with other methods are given next.
4.3.2 Experimental Results
Using IAN, we obtained near-optimal results for the stingray and the bent plane. For the 5-dimensional cylinder, the dimension was underestimated (mean 4.6). Methods based on correlation dimension are known to underestimate the true when the sample size is not sufficiently large (Camastra & Staiano, 2016). In these cases, the method of Camastra and Vinciarelli (2002) can be applied a posteriori to improve results.
For the MLE method, using large values of tended to improve results, but only when dimension was constant (as in the bent plane and cylinder data sets). For the stingray, however, no value of gave correct results: small values of increased the dimension estimates due to a bias, and large values tended to produce a uniform value throughout (thus giving better estimates only when is constant). We found that computing the neighborhood averages using the correction of MacKay and Ghahramani (2005), averaging the inverse of the estimators to reduce bias when is small, gave slightly better results. (We did not use the final smoothing procedure, which involves choosing two additional neighborhood-size parameters, and .)
5 Summary and Conclusion
In theory, applying the manifold assumption requires prior knowledge about the manifold: its geometry, topology, and how it was sampled. In practice, however, these manifold properties are rarely known. Instead, one typically imposes an assumption about the manifold's dimension, , which in turn suggests that nearest neighbors should suffice. This is how many—most!—of the data graphs underlying manifold inference and nonlinear dimensionality reduction are built. Since it is difficult to know whether this assumption about dimension is accurate, it is common practice to test a few values of and choose among the results.
Apart from the subjective nature of this choice, there are more general problems. Manifolds may not have a fixed dimension, they may be curved or with boundary, and sampling may vary. The intrinsic dimension may vary across the data, and so should the number of neighbors. In such cases, finding a compromise may be far from ideal. We suggest a different approach: that one should build the nearest-neighbor graph, and hence the graph-Laplacian approximation, in as data-driven a manner as possible, while being aware of the manifold properties.
Our algorithm of iterated adaptive neighborhoods (IAN) starts with a conservative assumption: that nearest neighbors should have no “nearer” neighbors between them. We then alternate between a discrete and a continuous view of neighborhood graphs and use a volumetric statistic to check for outliers. A linear program keeps the scales minimal while providing a global cover. This optimization is convex, so results are deterministic; other approaches, such as t-SNE and UMAP, are stochastic, so depend critically on the initialization.
Our kernel has been applied successfully to a variety of data sets, and compared against some of the most popular algorithms available. In all cases, our performance dominates. Furthermore, IAN can be incorporated directly into many embedding algorithms, including diffusion maps, Isomap, UMAP, and t-SNE, improving their results. Most of these algorithms involve several free parameters; we have none other than the robust requirement for an outlier.
Other popular embedding algorithms, such as LLE (Roweis & Saul, 2000), approximate the tangent space over a local neighborhood around each point. Although not explored here, using to automatically provide such neighborhoods is straightforward (analogous to what was done in section 4.3 to estimate the local dimensionality). Applications to clustering need to be explored.
Our weighted graph has also been applied to geodesic estimation, achieving comparable results to those obtained from graph geodesics. In contrast, the graphs obtained from other similarity kernels produced less than optimal results.
Our unweighted graph has found application in local dimensionality estimation. Our proposed algorithm, neighborhood correlation dimension (NCD), takes advantage of the adaptive connectivity of our graph to improve results based on correlation dimension, namely, by restricting the correlation integral to an approximately tubular neighborhood around in . As a result, we obtained accurate estimates of the local dimension in data sets where it is not uniform.
Several theoretical bounds are implied throughout this article; these need to be proved. Multiscale kernels, such as those from equations 2.6 and 2.7, are known to approximate Laplacian operators asymptotically (Ting, Huang, & Jordan, 2010; Berry & Harlim, 2016). Using our application examples as evidence, we conjecture that our version also results in good approximations.
In conclusion, understanding the interplay between manifold geometry, topology, and sampling lies at the heart of many data science applications. We have taken a first step to illustrate how discrete relates to continuous, how local estimates relate to global ones, and how uncertainties in data gathering relate to both. Applying data science in a way that leads to rigorous, scientifically appropriate conclusions must take all of these into account.
Appendix: Greedy Splitting
As an alternative to the optimization from section 3.4 (which can be expensive when the number of edges in is very large, mainly due to large dimensionality), we have developed a greedy approach in which scales that “-cover” each edge are assigned in decreasing order of length, (the Euclidean distance between and in ). We call this algorithm greedy splitting.
Starting with the edge with largest , set , with , thereby satisfying with equality: we say is evenly “split” between and . Moreover, since , we know the constraints and are also satisfied.
Continue with the edge that has the next largest length, . Here we are met with three possible cases in which a (re)assignment of scales is needed:
If neither of the nodes has been assigned a scale yet, evenly split the distance between , as above.
If one of the nodes does not have a scale yet (without loss of generality, let that node be ), set to the minimum scale that ensures , that is, ;
If both nodes have previously been assigned a scale but is not -covered by the current values of and , then set the quotient and update both scales: and , thereby evenly splitting the quotient between the two nodes.
After cases 2 and 3, the updated scales might need to be “rebalanced” in order to meet the constraints and . Without loss of generality, let . Then, we set and . Only one of the two scales may exceed its upper bound: in case 2, this is trivially true since only the newly assigned scale may be greater than ; in case 3, since both and have been previously assigned, we have and , as well as and , so therefore it must be the case that . Note that as a corollary, both scales must meet their respective constraints after being rebalanced as above.
The above is repeated until all edges have been visited. By covering the largest edges first, we assign the largest, most constrained scales first, allowing for the later, less constrained scales, to be as small as possible. Because in most cases this tends to evenly split the scaled edge lengths between and , the algorithm produces reasonable (but usually suboptimal) results when compared to the linear program of section 3.4.2.
Notes
Throughout, when referring to a point's set of -nearest neighbors, we shall not include the point itself (unless otherwise stated) and further assume that no two points are identical.
An orthoplex is a line segment in 1D, a square in 2D, a regular octahedron in 3D, a 16-cell in 4D, and so on.
Another possibility is to use a weighted sum while keeping the same constraints, thus still guaranteeing a covering. The weights add a bias to how the length of an edge is split between its two incident nodes (by balancing their individual scales). One interesting option is to set , the ratio between the distance to the nearest nonneighboring point, , and the farthest neighbor, .
That is assuming (a natural choice). If for some reason one needs to allow , then the upper bounds must be scaled by in order to ensure feasibility.
We abuse notation, therefore, when we say “-dimensional manifold” or “.”
This agrees with our observation (in section 3.3.3) that the unoccluded region around is similar to a -orthoplex: by placing a vertex (i.e., a neighbor) in each of its facets, we obtain a -hypercube, the dual polytope of an orthoplex.
Although the mean typically gives smoother tuning curves, the median is more robust. This matters because of the possible outlying values.
Acknowledgments
This research was supported by NIH grant EY031059, NSF CRCNS grant 1822598, and the Swartz Foundation. This project derived from problems in manifold inference involving neuroscience data. We thank G. Field and M. Stryker for motivating discussions.