Abstract

In recent years, a wealth of dimension-reduction techniques for data visualization and preprocessing has been established. Nonparametric methods require additional effort for out-of-sample extensions, because they provide only a mapping of a given finite set of points. In this letter, we propose a general view on nonparametric dimension reduction based on the concept of cost functions and properties of the data. Based on this general principle, we transfer nonparametric dimension reduction to explicit mappings of the data manifold such that direct out-of-sample extensions become possible. Furthermore, this concept offers the possibility of investigating the generalization ability of data visualization to new data points. We demonstrate the approach based on a simple global linear mapping, as well as prototype-based local linear mappings. In addition, we can bias the functional form according to given auxiliary information. This leads to explicit supervised visualization mappings with discriminative properties comparable to state-of-the-art approaches.

1.  Introduction

Due to improved sensor technology, dedicated data formats, and rapidly increasing digitalization capabilities, the amount of electronic data has increased dramatically. As a consequence, manual inspection of digital data sets often becomes infeasible. Automatic methods that help users quickly scan through large amounts of data are desirable.

Many powerful nonlinear dimensionality-reduction techniques have been developed that provide a visualization of complex data sets. In this way, humans can rely on their cognitive capabilities for visual perception when extracting information from large data volumes. In data visualizations, structural characteristics can be captured almost instantly by humans independent of the number of displayed data points.

Over the years, many powerful dimension-reduction techniques have been proposed (Lee & Verleysen, 2007; van der Maaten, Postma, & van den Herik, 2009; van der Maaten & Hinton, 2008; Keim, Mansmann, Schneidewind, Thomas, & Ziegler, 2008; Venna, Peltonen, Nybo, Aidos, & Kaski, 2010). Basically, the task of dimensionality reduction is to represent data points contained in a high-dimensional data manifold by low-dimensional counterparts in two or three dimensions while preserving as much information as possible. Since it is not clear a priori which parts of the data are relevant to the user, this problem is inherently ill posed: depending on the specific data domain and the situation at hand, different aspects can be the focus of attention.

Therefore, a variety of methods has been proposed that try to preserve different properties of the data and impose additional regularizing constraints on the techniques. Spectral dimension-reduction techniques such as locally linear embedding (LLE) (Roweis & Saul, 2000), Isomap (Tenenbaum, de Silva, & Langford, 2000), and Laplacian eigenmaps (Belkin & Niyogi, 2003) rely on the spectrum of the neighborhood graph of the data and preserve important properties of this graph. In general, a unique algebraic solution of the corresponding mathematical objective can be formalized. To arrive at unimodal costs, these methods are often based on very simple affinity functions such as gaussians. As a consequence, their results can be flawed when it comes to boundaries, disconnected manifolds, or holes. Using more complex affinities such as geodesic distance or local neighborhood relations, techniques such as Isomap or maximum variance unfolding (Weinberger & Saul, 2006) can partially avoid these problems at the price of higher computational costs. Many highly nonlinear techniques have been proposed as an alternative, but they often suffer the existence of local minima. They do not yield unique solutions and require numerical optimization techniques. Yet due to the greater complexity, their visualization properties may be superior as demonstrated in Hinton and Roweis (2003), van der Maaten and Hinton (2008), and Carreira-Perpiñán (2010).

All of the methods noted map a given finite set of data points to low dimensions. Additional effort is required to include new points in the mapping and arrive at out-of-sample extensions. Usually novel points are mapped to the projection space by minimizing the underlying cost function of the visualization method while keeping the projections of the prior data points fixed. In this way, novel coordinates depend on all given data points, and the effort to map new data depends on the size of the training set. Moreover, no explicit mapping function is available, and the generalization ability of the techniques to new data is not clear.

As an alternative, some approaches derive an explicit function that maps the given data to low dimensions. In this way, an immediate extension to novel data becomes possible. Linear techniques such as standard principal component analysis (PCA) and Fisher discriminant analysis provide an explicit mapping. Auto-encoder networks can be seen as a nonlinear extension of PCA, which directly aims at the inference of a nonlinear mapping function and its approximate inverse. In a similar way, topographic mapping such as a self-organizing map or generative topographic mapping (Kohonen, 1995; Bishop & Williams, 1998) offers a mapping function to a low-dimensional space by simultaneous clustering and mapping based on prototypes. Nonlinear mapping functions have also been considered by Bae, Choi, Qiu, and Fox (2010), where only a few points are mapped using a dimensionality-reduction technique and an interpolation to all data is done by means of a k-nearest neighbor approach. For LLE, a similar extension, locally linear coordination (LCC), has been proposed based on locally linear functions by Roweis and Saul (2000). There, the function parameters are optimized directly using the LLE cost function. Similarly, t-distributed stochastic neighbor embedding (t-SNE) has been extended to a mapping given by deep encoder networks by van der Maaten (2009), relying on the t-SNE cost function to optimize the mapping function parameters. Suykens (2008) uses kernel mappings with a reference point to arrive at high-quality data visualization mappings and experimentally demonstrates the excellent generalization ability and visualization properties of the technique. Although these approaches constitute promising directions for arriving at explicit dimensionality-reduction mappings, many of the techniques have been developed for a specific setting and dimensionality-reduction technique only.

In this article, we propose a general principle to formalize nonparametric dimension-reduction techniques based on cost optimization. This general principle allows us to simultaneously extend popular nonparametric dimension-reduction methods to explicit mapping functions for which out-of-sample extensions are immediate. In this setting, the functional form of the mapping is fixed a priori, and function parameters are optimized within the dimension-reduction framework instead of the coordinates of single point projections. We demonstrate the suitability of this approach using two different types of functions: simple linear projections and locally linear functions. Interestingly, it can be shown that state-of-the-art dimensionality-reduction cost functions, as provided by t-SNE, for example, can even improve simple linear dimensionality-reduction functions as compared to classical PCA. Furthermore, the performance of state-of-the-art techniques such as presented by van der Maaten (2009) can be achieved using more complex locally linear functions.

Several benefits arise from an explicit dimension-reduction mapping: out-of-sample extensions are immediate and require only constant time depending on the chosen form of the mapping. Since an explicit mapping function is available, approximate inverse mapping is possible at least locally: locally linear functions, for example, can be inverted using the pseudo-inverse. This makes a deeper investigation of the structure of the projection possible. Depending on the form of the mapping function, only a few parameters need to be determined, and implicit regularization takes place. In consequence, only a few data points are necessary to determine these mapping parameters and generalize to novel data points. Hence, only a small subset of the full data is necessary for training, which yields enormously increased speed for large data sets: Instead of quadratic complexity to map the data, the mapping function can be determined in constant time complexity, and the full data set can be displayed in linear time complexity. This opens the way to feasible dimension reduction for very large data sets.

In this letter, we experimentally demonstrate the suitability of the approach and investigate the generalization ability in terms of several applications. Moreover, we substantiate the experimental findings with an explicit mathematical formalization of the generalization ability of dimensionality reduction in the framework of statistical learning theory. Although we are not yet able to provide good explicit generalization bounds, we argue that principled learnability can be guaranteed for standard techniques.

Another benefit of an explicit mapping function is the possibility of biasing the dimensionality-reduction mapping according to given prior knowledge. The task of dimension reduction is inherently ill posed, and which aspects of the data are relevant for a user depends on the situation at hand. One way to shape the ill-posed task of data visualization is by incorporating auxiliary information (as proposed, for example, by Kaski, Sinkkonen, & Peltonen, 2001).

A few classical dimension-reducing visualization tools take class labeling into account. Feature selection can be interpreted as a particularly simple form of discriminative dimensionality reduction (see Guyon & Elisseeff, 2003, for an overview). Classical Fisher linear discriminant analysis (LDA) as well as partial least squares regression (PLS) offer supervised linear visualization techniques based on the covariances of the classes; kernel techniques extend these settings to nonlinear projections (Ma, Qu, & Wong, 2007; Baudat & Anouar, 2000). The principle of adaptive metrics used for data projection according to the given auxiliary information has been introduced by Kaski et al. (2001) and Peltonen, Klami, and Kaski (2004). The obtained metric can be integrated into diverse techniques such as self-organizing maps, multidimensional scaling, or a recent information-theoretic model for data visualization (Kaski et al., 2001; Peltonen et al., 2004; Venna, Peltonen, Nybo, Aidos, & Kaski, 2010). An ad hoc metric adaptation is used in Geng, Zhan, and Zhou (2005) to extend Isomap to class labels. Alternative approaches change the cost function of dimensionality reduction (for examples, see Iwata et al., 2007; Memisevic & Hinton, 2005; Song, Smola, Borgwardt, & Gretton, 2008).

In this letter, we show that auxiliary information in the form of given class labels can be easily integrated into the dimension-reduction scheme by biasing the functional form accordingly. As a result, one obtains a discriminative dimensionality-reduction technique that is competitive with alternative state-of-the-art approaches.

We first briefly review several popular nonparametric dimensionality-reduction techniques. We put them into a general framework based on the notion of cost functions that compare the characteristics of the data and the projections. This general framework allows us to simultaneously extend the dimension-reduction techniques to explicit mappings that not only lead to a finite set of projection coordinates but employ an explicit projection function. We demonstrate this principle using a linear mapping function and locally linear maps, for example, derived by standard clustering techniques. We incorporate these functional forms into the cost function of t-SNE. Interestingly, the results are superior compared to standard linear techniques such as PCA and alternative mapping functions as presented, for example, by van der Maaten (2009). Furthermore, we demonstrate that the functional form can be biased toward auxiliary label information by choosing the functional form on top of supervised classification techniques. Finally, we argue that based on the notion of a mapping function, the generalization properties of dimension-reduction maps can be formalized in the framework of computational learning theory.

2.  Dimension Reduction as Cost Optimization

In this section we briefly review some popular nonparametric dimension-reduction methods. We assume high-dimensional data points are given: xi ∈ IRD where i = 1, …, n. These points are projected to a low-dimensional embedding space IRd, with d < D, usually d ∈ {2, 3} for visualization. The coordinates of the points in the projection space are referred to as yi ∈ IRd for i = 1, …, n. Often visualization techniques refer to the distances or affinities of data in the data space and the projection space, respectively. The pairwise affinities are denoted as for the original high-dimensional data points and by for the corresponding dissimilarities in the embedding space. Usually is chosen as Euclidean distance, while is chosen according to the data set at hand, for example, it is given by the Euclidean or the geodesic distance in the high-dimensional space. A mathematical formalization of dimensionality reduction can take place in different ways.

2.1.  Multidimensional Scaling and Extensions.

Multidimensional scaling (MDS; Torgerson, 1952) is probably one of the oldest dimension-reduction methods. It aims at the preserving pairwise relations measured in the least square sense. The original MDS measures the pairwise relations of the data in terms of dot products in the original and the embedding space, respectively, and minimizes the cost function:
formula
2.1
The advantage of this formulation is that an analytical solution is possible. In later approaches, the objective was changed to preserving distances:
formula
2.2
with Euclidean distances and and a normalizing constant c (Lee & Verleysen, 2007). The weights can be chosen, for example, as wij = 1. In the well-known Sammon mapping (Sammon, 1969), they take the form , emphasizing the preservation of small distances. There, the constant c is set to the sum of the distances, and optimization takes place by a gradient descent procedure.

2.2.  Isomap.

Depending on the actual data, the Euclidean distance might not be appropriate to describe pairwise relations. Therefore, Isomap (Tenenbaum et al., 2000) is based on the approximation of geodesic distances, which measure the relations along the data manifold. A neighborhood graph is constructed using k neighborhoods or ε-balls, and the shortest path lengths in this graph define the pairwise affinities in the data space. Afterward, a standard MDS procedure is used.

2.3.  Locally Linear Embedding.

Locally linear embedding (LLE) (Roweis & Saul, 2000) aims at preserving local topologies defined by the reconstruction of data points i by means of a linear combination of its neighbors j. We denote the property that j is neighbor to i by ij. As for Isomap, local neighbors can be defined based on k nearest neighbors or ε-balls, respectively. To obtain weights for reconstruction, the objective ∑i(xi − ∑j:ijwijxj)2 in original space is minimized under the constraint ∑jwij = 1 in order to ensure rotation and translation invariance of the output. Afterward, the projections are determined such that local linear relationships are preserved as much as possible in a least squared sense: minimize ∑i(yi − ∑j:ijwijyj)2 subject to the constraints of centered coordinates ∑iyi = 0 with unit covariance YtY = I, where I is the d × d identity matrix. Here, the normalization of the reconstruction weights leads to a unique optimum of the system.

2.4.  Laplacian Eigenmaps.

Similar to LLE and Isomap, Laplacian eigenmaps (Belkin & Niyogi, 2003) are based on the construction of a local neighborhood graph given the k nearest neighbors or an ε-neighborhood, respectively. The connections are weighted by coefficients wij, for example, using a heat kernel. The projection is obtained by solving a generalized eigenvalue problem given the corresponding graph Laplacian and the degree matrix of the graph, picking the eigendirections corresponding to the smallest eigenvalues unequal to 0. This is equivalent to minimizing the embedding objective with Euclidean distance, under the constraint YtDY = I and YtD1 = 0, where Y refers to the matrix of low-dimensional points and D is the degree matrix to remove scaling and translation factors.

2.5.  Maximum Variance Unfolding.

Maximum variance unfolding (MVU) (Weinberger & Saul, 2006) is based on a neighborhood graph with k nearest neighborhood graphs or ε-neighborhoods. Projections yi are determined by maximizing the variance of the projection. This means is maximized subject to the constraint that all neighboring points xi and xj preserve their affinity: ∀{i, j} with the normalization ∑iyi = 0. Considering the inner product matrix (Y)Y, a reformulation as a convex problem is possible, and a solution can be found in terms of a semidefinite program (SDP) (Vandenberghe & Boyd, 1994). Furthermore, if a preservation of neighbored distances is not exactly possible, slack variables can be introduced.

2.6.  Stochastic Neighbor Embedding.

Stochastic neighbor embedding (SNE) (Hinton & Roweis, 2003) defines the characteristics of the data in terms of probabilities in the original and embedding space, respectively
formula
2.3
formula
2.4
using Euclidean distances as default. The objective corresponds to the Kullback-Leibler divergence between the probability densities in the original and the projection space. The bandwidths σi are determined based on the so-called perplexity, which defines local neighborhoods. A gradient descent procedure is used for optimization.

2.7.  T-Distributed Stochastic Neighbor Embedding.

T-distributed stochastic neighbor embedding (t-SNE) (van der Maaten & Hinton, 2008) modifies the SNE cost function such that the long-tailed student-t distribution is used in the embedding space instead of gaussians. The cost function
formula
2.5
uses symmetrized conditional probabilities
formula
2.6
formula
2.7
with n denoting the number of data points and the student-t distribution parameterized with ς = 1 by default. Again, optimization is done in terms of a gradient method.

2.8.  Neighborhood Retrieval Visualizer.

Venna et al. (2010) propose a quality measure for dimension reduction derived from an information retrieval point of view. A new dimension-reduction technique based on the new objective accompanies this proposal: the neighborhood retrieval visualizer (NeRV). The cost function reads
formula
2.8
with probabilities as defined for SNE (see equations 2.3 and 2.4) and a weighting parameter λ ∈ [0, 1] to control the influence of the competing terms related to the traditional measures precision and recall. The t-NeRV extension is straightforward considering symmetric pairwise probabilities just as in t-SNE (see equations 2.6 and 2.7) in the symmetrized version of the Kullback-Leibler divergence.

2.9.  A General View.

All methods as summarized above obey one general principle. Assume a finite sample of points X = (xi ∈ IRD | i = 1, …, n) = (x1, …, xn) is given. These points should be mapped to a low-dimensional embedding space IRd with d < D, where data point xi is mapped to the projection yi ∈ IRd by means of a nonparametric mapping. The projections are referred to as Y = (yi | i = 1, …, n) = (y1, …, yn). The sequence of tuples of data points and their projections is referred to as XY = ((x1, y1), …, (xn, yn)). We denote the set of all finite subsequences of IRd by S(IRD); more generally, S(A) refers to all finite subsequences of a given set A. Given a sequence X = (x1, …, xn), its length is denoted by n = |X|.

For all methods, the coefficients yi are determined based on the same general principle, using the same basic ingredients, the characteristics derived from the original training set X for every data point, corresponding characteristics of its projection, and an error measure between these two characteristics. The latter is minimized during projection, possibly taking into account further constraints. More precisely, dimensionality reduction is characterized by the following ingredients:

  • • 

    A function is fixed that maps a data sequence X and a point x in the original space IRd to a characteristic. Usually .

  • • 

    A function is fixed that maps a finite subset XY of points and their projections, and a given tuple of a point and its projection to a corresponding characteristic. Usually .

  • • 

    An error measure is fixed that measures the difference of two such characteristics: error:S(IR) × S(IR) → IR.

  • • 
    Given a finite sequence XS(IRD), dimensionality reduction takes place by determining the projection yi of every xi such that the costs,
    formula
    2.9
    are minimized.
  • • 
    Possibly additional constraints are imposed on yi to guarantee the uniqueness or invariance of the result. This can be formalized by a constraint function,
    formula
    2.10
    which is optimized simultaneously to the overall costs (see equation 2.9) and can implement hard constraints by means of an indicator function or soft constraints by means of a real-valued function.
The methods differ in the definition of the data characteristics and in the way the error of the characteristics is defined. Furthermore, they differ in the (implicit or explicit) computation of the characteristics and the employed (analytical or numerical) optimization method. The objective (see equation 2.9) and the constraints (see equation 2.10) might be contradictory, and the way in which these two objectives are combined can be chosen differently.

Table 1 summarizes the properties of the different optimization methods with respect to this general principle. We explain the formalization and the exact choice of the relevant functions in more detail in the following:

Table 1:
Dimensionality-Reduction Methods.
MethodCharacteristics of DataCharacteristics of ProjectionsError MeasureConstraint
MDS Euclidean distance Euclidean distance Minimize (weighted) No constraint 
   least squared error  
Isomap Geodesic distance Euclidean distance Minimize (weighted) No constraint 
 dgeodesic(xi, xj least squared error  
LLE Weights wij such that Weights such that Identity  yi = 0, 
 ∑(xi − ∑ijwijxj)2   iyi(yi)t = n · I 
 minimum with ∑jwij = 1 is minimum   
Laplacian Weights  Distance  Minimize YtDY = I
eigenmap for neighbors ij for neighbors ij dot product YtDI = 0 
MVU Euclidean distance  Euclidean distance  Enforce identity Maximize  
 for neighbors ij for neighbors ij using slack variables with ∑iyi = 0
SNE Probabilities Probabilities Minimize  
   Kullback-Leibler divergences No constraint 
t-SNE Probabilities Probabilities Minimize No constraint 
   Kullback-Leibler divergence  
NeRV Probabilities Probabilities Minimize sum of Kullback-Leibler No constraint 
 pj|i as for SNE qj|i as for SNE divergences with weight λ ∈ [0, 1]  
t-NeRV Probabilities Probabilities Minimize sum of Kullback-Leibler No constraint 
 pij as for t-SNE qij as for t-SNE divergences with weight λ ∈ [0, 1]  
MethodCharacteristics of DataCharacteristics of ProjectionsError MeasureConstraint
MDS Euclidean distance Euclidean distance Minimize (weighted) No constraint 
   least squared error  
Isomap Geodesic distance Euclidean distance Minimize (weighted) No constraint 
 dgeodesic(xi, xj least squared error  
LLE Weights wij such that Weights such that Identity  yi = 0, 
 ∑(xi − ∑ijwijxj)2   iyi(yi)t = n · I 
 minimum with ∑jwij = 1 is minimum   
Laplacian Weights  Distance  Minimize YtDY = I
eigenmap for neighbors ij for neighbors ij dot product YtDI = 0 
MVU Euclidean distance  Euclidean distance  Enforce identity Maximize  
 for neighbors ij for neighbors ij using slack variables with ∑iyi = 0
SNE Probabilities Probabilities Minimize  
   Kullback-Leibler divergences No constraint 
t-SNE Probabilities Probabilities Minimize No constraint 
   Kullback-Leibler divergence  
NeRV Probabilities Probabilities Minimize sum of Kullback-Leibler No constraint 
 pj|i as for SNE qj|i as for SNE divergences with weight λ ∈ [0, 1]  
t-NeRV Probabilities Probabilities Minimize sum of Kullback-Leibler No constraint 
 pij as for t-SNE qij as for t-SNE divergences with weight λ ∈ [0, 1]  

Notes: These methods can be put into a general framework: characteristics of the data are extracted. Projections lead to corresponding characteristics depending on the coefficients. These coefficients are determined such that an error measure of the characteristics is minimized, fulfilling probably additional constraints.

MDS. The characteristics are the pairwise Euclidean distances in the original and embedding space, respectively:
formula
and
formula
In particular, the characteristic depends on the projections of the data only, and not the original coefficients in this case. The cost function is the least squared error,
formula
for ai, bi ∈ IR, where the weighting is according to the Sammon mapping. Note that only sequences, of the same length are compared via this function. No constraints are imposed; the constraint function (see equation 2.10) is trivial.

Isomap. Isomap differs from MDS only in the characteristic , which is given by the geodesic distances (dgeodesic(x1, x), …, dgeodesic(xn, x)). Geodesic distances are usually approximated in the data set by means of the following algorithm. A neighborhood graph is constructed from X = (x1, …, xn) and x by means of an ε-neighborhood or a k-nearest neighbor graph with vertices enumerated by xi and x. Then, all shortest paths from x to xi are computed within this graph. These distances constitute an approximation of the geodesic distances of the underlying data manifold.

LLE. The characteristics are the local reconstruction weights of points estimated by their neighborhood,
formula
where denotes the characteristic function of the neighbors of x in X, excluding x itself:
formula
This characteristic uses both the projections yi and the data in original space xi to define the neighborhood graph. Since the characteristic already includes an approximation, the error can be picked in a trivial way:
formula
Because of this definition, minimization of equation 2.9 is equivalent to a minimization of where the reconstruction weights wij and the neighborhood structure are taken from the original data space. Since this formulation is not well posed, 0 being an obvious global optimum, regularization is used. The constraints enforce that the projection coefficients are centered at the origin, and their correlation matrix is given by the unit matrix. Since these constraints can be fulfilled exactly, the characteristic function,
formula
can be used.
Laplacian eigenmap: The characteristics of the original data space are based on the local neighborhood structure and an appropriate weighting of distances given in this neighborhood, for example weighting according to the heat kernel:
formula
Characteristics of the projections are similar but based on the standard Euclidean distance,
formula
The cost function is given by the dot product:
formula
which is minimized. Since this formulation allows the trivial solution 0, constraints are imposed. Set ; then an arbitrary scaling factor and translation of the solution is removed by imposing the constraint function,
formula
Maximum variance unfolding. Similarly,
formula
and
formula
with error
formula
and constraint
formula
with a constant λ. The cost term defines a characteristic function that might not possess a feasible solution because it is in general not possible to preserve all local distances exactly. Therefore, the cost function should be smoothed. In MVU, the characteristic functions are taken as constraints of an optimization problem, and slack variables are introduced.
SNE. Similarly,
formula
where entries corresponding to xi = x are set to 0, and
formula
again setting entries for yi = y to 0. The bandwidth parameter σx is determined such that the effective number of neighbors of x in X as measured by an information-theoretic framework equal to a predefined value, the perplexity, which constitutes a meta-parameter of the model. The error is given by the Kullback-Leibler divergence:
formula
No constraints are imposed.
t-SNE. Similar to SNE, we have
formula
where X ∪ {x} refers to the set of elements without duplicates, and
formula
setting entries corresponding to x = xi to 0. Again, the Kullback-Leibler divergence is used, and no constraints are imposed.
NeRV. NeRV deviates from SNE only in the choice of the cost function, which is
formula
with appropriate weighting λ.

t-NeRV. Similarly, t-NeRV uses the same cost function as NeRV in the t-SNE setting.

These formalizations are summarized in Table 1. Note that some of the techniques, such as LLE, MVU, and Laplacian eigenmaps, allow an explicit algebraic solution or lead to a unique optimum, while others require numeric optimization such as SNE and its variants. For the latter cases, unique solutions usually do not exist, so multiple local optima may be found depending on the initialization of the parameters. Visualizations obtained in this way can differ significantly from one run to the next depending on the initialization strategy. However, as van der Maaten and Hinton (2008) argued, this fact is not necessarily a drawback of the technique. Usually high-dimensional data sets cannot be embedded into low dimensions without a loss of information. Often there exists more than one reasonable embedding of data, which is inherently ambiguous. Different local optima of the projection techniques can correspond to different low-dimensional views of the data with the same quality (as measured, for example, using evaluation measures such as proposed by Lee & Verleysen, 2009; and Venna et al., 2010). This argument is in line with our experimental observation that dimension reduction based on t-SNE leads to qualitatively different behavior in different runs. However, the quality of the different results usually does not differ much from each other when using the quality measure proposed by Lee and Verleysen (2009), for instance.

2.10.  Out-of-Sample Extensions.

One benefit of our general formulation is that the optimization steps are separated from the principled mathematical objective of the actual technique at hand. As an immediate consequence, a principled framework for out-of-sample extension can be formalized simultaneously for all techniques. Here, out-of-sample extension refers to the question of how to extend the projection to a novel point x ∈ IRD if a set of points X is already mapped to projections Y. Assume that a dimension reduction for a given data set is given, characterized by the sequence of points and their projections XY. Assume that a novel data point x is considered. Then a reasonable projection y of this point can be obtained by means of the mapping: xy such that the costs,
formula
are minimized. This term corresponds to the contribution of x and its projection y to the overall costs, equation 2.9, assuming that the projections Y of X are fixed. Simultaneously, the constraints
formula
need to be optimized where XY•(x, y) denotes the concatenation of the known coordinates and the novel projection (x, y), where the coefficients Y are kept fixed and only the novel projection coordinates y are treated as free parameters. For simple constraints such as given for MDS, Isomap, and SNE and its variants, this immediately yields a mathematical formalization of out-of-sample extensions. Numerical optimization such as gradient techniques can be used to obtain solutions. For LLE and Laplacian eigenmaps, the constraints are given by an indicator function; the same holds for the constraint ∑yi = 0 for MVU. These constraints can no longer be fulfilled exactly and should be weakened to soft constraints. This has the consequence that in general, explicit algebraic solutions of the optimization problem are no longer available.

Typically the complexity of this approach depends on the number n of the given data points. Hence, this procedure can be quite time-consuming depending on the data set. Moreover, this mapping leads to an implicit functional prescription in terms of an optimum of a complicated function, which may display local optima.

In the following, we will substitute the implicit form by an explicit functional prescription, the form of which is fixed a priori. We derive techniques to determine function parameters by means of the given optimization objectives. The fact that nonparametric dimensionality reduction is formalized using a general framework allows us to simultaneously extend all of these methods to explicit mapping functions in a principled way.

3.  Dimension-Reduction Mappings

Due to their dependence on pairwise dissimilarities, the computational effort of most dimensionality-reduction techniques scales quadratically with respect to the number of data points. This makes them infeasible for large data sets. Even linear techniques (such as presented in Bunte, Hammer, Wismüller, & Biehl, 2010) can reach their limits for very large data sets so that sublinear or even constant time techniques are required. Furthermore, it might be inadequate to display all data points given a large data set due to the limited resolution on screens or prints. Therefore, in the literature, often a random subsample of the full data set is picked as representative of the data (see the overviews van der Maaten et al., 2009; and Venna et al., 2010). If more points are added on demand, out-of-sample extension as specified above is necessary.

One crucial property of this procedure is the requirement that the mapping, which is determined from a small subsample, is representative for a mapping of the full data set. Hence, the generalization ability of dimensionality reduction to novel data points must be guaranteed. To our knowledge, the generalization ability of non-parametric dimension reduction has hardly been verified experimentally in the literature (one exception being presented by Suykens, 2008), and no exact mathematical treatments of the generalization ability exist.

Here, we take a different point of view and address the problem of dimensionality reduction by inferring an explicit mapping function. This has several benefits—for example, a mapping function allows an immediate extension to novel data points by simply applying the mapping. Hence, large data sets can be dealt with since the mapping function can be inferred from a small subset only in constant time (assuming constant size of the subset). Mapping all data points requires linear time only. The generalization ability of the mapping function can be addressed explicitly in experiments. We will observe an excellent generalization ability in several examples. Furthermore, the generalization ability can be treated in an exact mathematical way by referring to the mapping function. We will argue that for typical mapping functions, guarantees exist in the framework of statistical learning theory. An additional benefit is that the complexity of the mapping function and its functional form can be chosen a priori, such that auxiliary information (e.g., in terms of class labels), can be integrated into the system.

3.1.  Previous Work.

A few dimensionality-reduction techniques provide an explicit mapping of the data. Linear methods such as PCA and neighborhood-preserving projection optimize the information loss of the projection (Bishop, 2007; He, Cai, Yan, & Zhang, 2005). Extensions to nonlinear functions are given by autoencoder networks, which provide a function given by a multilayer feedforward network in such a way that the reconstruction error is minimized when projecting back with another feedforward network (van der Maaten et al., 2009). Typically training is done by standard backpropagation, directly minimizing the reconstruction error. Manifold charting connects local linear embeddings obtained by local PCA, for example, by minimizing the error on the overlaps (Brand, 2003; Teh & Roweis, 2003). This can be formulated in terms of a generalized eigenvalue problem. Topographic maps such as the self-organizing map and generative topographic mapping characterize data in terms of prototypes, which are visualized in low dimensions (Bishop & Williams, 1998; Kohonen, 1995). Due to clustering, new data points can be visualized by mapping them to the closest prototype or its visualization, respectively.

Some nonparametric dimension-reduction methods, as introduced above, have been extended to global dimension-reduction mappings. For example, locally linear coordination (LLC) (Teh & Roweis, 2003) extends LLE by assuming that local linear projections are available, such as local PCAs and combining these using affine transformations. The resulting points are inserted in the LLE cost function, and additional parameters are optimized accordingly. Kernel maps, based on the ideas of kernel eigenmap methods, provide out-of-sample extensions with excellent generalization (Suykens, 2008). Parametric t-SNE (van der Maaten, 2009) extends t-SNE toward an embedding given by a multilayer neural network. The network parameters are determined using backpropagation, where, instead of the mean squared error, the t-SNE cost function is taken as objective. These techniques, however, are often specifically tailored to the functional form of the mapping or the specific properties of the technique. In contrast, we propose a general principle to extend nonparametric dimension reduction to explicit mappings.

3.2.  A General Principle.

A dimension-reduction technique determines an implicit function of the full data space to the projection space f:IRD → IRd. A data point x is projected to low-dimensional counterparts, which minimizes the respective cost function and constraints. Depending on the method, f might have a complex form, and its computation might be time-consuming. This computational complexity can be avoided by defining an explicit dimension-reduction mapping function,
formula
3.1
of fixed form parameterized by W. The general formalization of dimension reduction as a cost optimization allows us to extend nonparametric embedding to an explicit mapping function fW as follows. We fix a parameterized function fW:IRD → IRd. Instead of the projection coordinates y, we consider the images of the mapping and optimize the parameters W such that the costs,
formula
3.2
become minimal under the constraints
formula
3.3
where refers to the sequence .

This principle leads to a well-defined mathematical objective for the mapping parameters W for every dimension-reduction method summarized in Table 1. For out-of-sample extensions, however, hard constraints such as imposed for LLE, MVU, and Laplacian eigenmaps can no longer be fulfilled exactly and should be transferred to soft constraints. This has the consequence that the optimization problem differs from the one in the original method. A closed-form solution as given for, say, spectral methods might no longer be available for a general functional form fW and soft constraints.

The functional form fW needs to be specified a priori. It can be chosen as a global linear function, a combination of locally linear projections, a feedforward neural network, or any parameterized, possibly nonlinear, function. If gradient techniques are used for the optimization of the parameters W, fW has to be differentiable with respect to W. The functional form of fW defines the flexibility of the resulting dimensionality-reduction mapping. Naturally, restricted choices such as linear forms lead to less flexibility than universal approximators such as feedforward networks or general kernel maps.

Note that this provides a general framework that extends dimensionality-reduction techniques in order to obtain explicit mapping functions. The ingredients are formally defined for all methods specified in Table 1. This gives a mathematical objective for all functional forms of fW and all these methods, provided hard constraints of LLE and similar are softened in such a way that feasible solutions result. The objectives can be optimized directly using universal optimization techniques such as gradient methods or local search techniques. Explicit algebraic solutions as given for the original spectral techniques are no longer available, however. Furthermore, the numeric optimization task can be difficult in practice.

Since every possible dimension-reduction technique and every choice of the form fW leads to a different method, an extensive evaluation of all possible choices is beyond the scope of this letter. In the next section, we consider example algorithms for two specific mapping functions: a global linear one and a nonlinear mapping based on local linear projections in the t-SNE formalism. For the latter setting, we first demonstrate the feasibility of the results in the unsupervised setting for local linear maps in comparison to feedforward networks used for dimension reduction. Then we demonstrate the possibility of integrating supervised label information into the technique by means of a bias of the functional form of fW.

4.  Linear t-SNE Mapping

In this section we derive the formulation based on a linear hypothesis for the mapping, optimized according to the t-SNE cost function. In this case the mapping,
formula
4.1
is expressed in terms of a rectangular matrix A, which defines a linear transformation from IRD → IRd. This matrix can be optimized by following a stochastic gradient descent procedure using the gradient of the t-SNE cost function (see equation 2.5):
formula
Using Euclidean distance , it follows that
formula
and, hence,
formula
4.2
An example result of this algorithm on a three-dimensional benchmark data set is compared to simple PCA. The data contains three gaussians arranged on top of each other (see the upper left panel of Figure 1). Because of the variance in the z-direction, PCA projects the modes onto each other, loosening the cluster information (see the lower left panel in Figure 1). The linear mapping obtained by the optimization of the t-SNE cost function (referred to as DiReduct mapping), on the other hand, shows a much clearer separation of the original clusters (see the upper right panel of Figure 1). This is due to the preservation of local structures formulated in the t-SNE objective rather than the preservation of global variances as used in PCA.
Figure 1:

Simulation results on a three-class benchmark data set using PCA and a global linear map optimizing the t-SNE cost function, respectively. The latter leads to a better separation due to its local nature, which can be formally evaluated using the measure of intrusion and extrusion on the resulting mapping.

Figure 1:

Simulation results on a three-class benchmark data set using PCA and a global linear map optimizing the t-SNE cost function, respectively. The latter leads to a better separation due to its local nature, which can be formally evaluated using the measure of intrusion and extrusion on the resulting mapping.

A quantitative evaluation of the two mappings is also included in the lower right panel of Figure 1, based on the quality measure proposed by Lee and Verleysen (2008, 2009). Basically it relies on k-intrusions and k-extrusions, which means that it compares k-ary neighborhoods given in the original high-dimensional space with those occurring in the low-dimensional space. Intrusion refers to samples intruding a neighborhood in the embedding space, while extrusion counts the number of samples missing in the projected k-ary neighborhoods. The overall quality measure Q measures the percentage of data that is neither k-intrusive nor k-extrusive. In the optimal case, all neighborhoods are exactly preserved, which results in a value of Q = 1. B measures the percentage of k-intrusions minus the percentage of k-extrusions in the projection and therefore shows the tendency of the mapping method: techniques with negative values for B are characterized by extrusive behavior, and those with positive values tend to be more intrusive.

Obviously DiReduct shows a superior quality, in particular, for small neighborhood ranges, since it preserves local structures of the data to a larger extent. Further, unlike PCA, which displays a trend toward highly intrusive behavior, it is rather neutral in the mapping character, being mildly extrusive for medium values of k.

5.  Local Linear t-SNE Mappings

In this section, we consider nonlinear mapping functions obtained by the principles outlined above. Again, we employ the t-SNE cost function. The functional form fW is chosen in two different ways. First, we consider fW given by a multilayer feedforward network as proposed by van der Maaten (2009). The update equations for a feedforward network can be derived from the t-SNE cost function and are similar to standard back-propagation (see van der Maaten, 2009), for details.

Second, we consider a locally linear projection based on local mappings obtained by prototype-based techniques such as neural gas in combination with local PCA or mixtures of probabilistic PCA (Möller & Hoffmann, 2004). The latter techniques provide a set of prototypes wk ∈ IRD, dividing the data space into k receptive fields, and corresponding local projections Ωk ∈ IRm×D with mD. We assume that locally linear projections of the data points are derived from one of these techniques,
formula
5.1
with local matrices Ωk and prototypes wk. We assume furthermore the existence of responsibilities rik of the local mapping pk for data point xi, where ∑krik = 1. In the following, we choose simple responsibilities based on the receptive fields:
formula
5.2
More generally, a point x is associated with the responsibilities rk(x) in the same way. A global nonlinear mapping function combines these linear projections,
formula
5.3
using local linear projections Ak and local offsets ok to align the local pieces.

The number of parameters W that have to be determined depends on the number of local projections k and their dimension m. Usually it is much smaller than the number of parameters when projecting all points yi directly. Hence, it is sufficient to consider a small part of the given training data only in order to obtained a valid dimension reduction.

We determine the parameters by a stochastic gradient descent based on the derivative of the t-SNE cost function:
formula
5.4
and
formula
5.5
assuming Euclidean distance in the projection space, as before.

As an example, we show the results obtained on the UCI image segmentation data set. It consists of seven classes and 2,310 instances of 3×3 regions randomly drawn from seven hand-segmented outdoor images. Three of the 19 features were not taken into account because they show no variance. We scaled the features by dividing with the maximal feature value in the data followed by PCA reducing the dimension to 10.

For the locally linear projection, we run the neural gas algorithm (Martinetz & Schulten, 1991; Cottrell, Hammer, Hasenfuss, & Villmann, 2006) with 14 prototypes to get a division of the data space into receptive fields. PCA was applied to every receptive field to define local transformations Ωk. Together with the respective prototypes wk, this offers the corresponding data projections pk(xi) (see equation 5.1).

The transformations Ak ∈ IR2×10 were set as rectangular matrices to perform the dimension reduction from 10 to 2 dimensions. The offsets ok are vectors in IR2. The mapping parameters were initialized with small random values, and a stochastic gradient descent was performed with 300 epochs and a learning rate annealed from 0.3 to 0.01. The perplexity of t-SNE was set to 50.

For the neural network embedding, we use parametric t-SNE with default parameters as provided in the implementation (given by van der Maaten, 2009). An optimum network architecture was picked, varying the number of neurons from 50 to 2000 per hidden layer. The architecture is given by a [100 100 500 2]-layer neural network. The perplexity was optimized on the data and picked as 25.

For comparison, we present the result of a powerful parametric dimensionality-reduction technique, generative topographic mapping (GTM; Bishop & Williams, 1998). It obtains a prototype-based mapping by simultaneous clustering and topographic ordering while training. We use a 10 × 10 lattice for data visualization and 3 × 3 base functions for mapping the latent space to the data space. Convergence after 20 epochs was observed. The results for a locally linear t-SNE mapping and parametric t-SNE in comparison to GTM are shown in Figure 2. In both cases, we used a subset of roughly 10% for training, and we report the results of the mapping on the training and test sets. Interestingly, as measured by the quality of the mapping, t-SNE in comparison with locally linear mapping is superior to a neural network–based approach and the GTM. Since the data set is labeled, an evaluation of the projection in terms of the nearest-neighbor classification error is also possible. The five nearest-neighbor error for the whole preprocessed data after PCA to 10 dimensions is 0.054. After further dimension reduction, this error increases due to the loss of information. For a locally linear mapping, the 5-nearest-neighbor error is 0.21 for the training set and 0.16 for the full data set; the corresponding projections are shown in the upper panel. The panels in the middle show the corresponding mappings achieved by parametric t-SNE (5-nearest-neighbor error: 0.5 in training and 0.32 for the whole set, respectively). The bottom panel contains the evaluation of the mappings using the quality measure as proposed by Lee and Verleysen (2008, 2009).

Figure 2:

Projection of the image segmentation data set using parametric t-SNE, GTM, and DiReduct combining unsupervised clustering and the learning of a mapping. The result of the subsample used for training (left panels) and the full data set (right panels) are depicted. The intrusion/extrusion quality on the whole data set for all methods is shown in the bottom panel.

Figure 2:

Projection of the image segmentation data set using parametric t-SNE, GTM, and DiReduct combining unsupervised clustering and the learning of a mapping. The result of the subsample used for training (left panels) and the full data set (right panels) are depicted. The intrusion/extrusion quality on the whole data set for all methods is shown in the bottom panel.

Interestingly, both functional forms show a good generalization ability in the sense that the error of the full data set resembles the error on the test set. However, the results of locally linear mappings are superior to a feedforward mapping in both cases.

6.  Supervised Dimensionality-Reduction Mapping

Mapping high-dimensional data to low dimensions is connected to an information loss, and depending on the dimensionality-reducing visualization technique, different data visualizations are derived. Since many methods such as t-SNE do not yield a unique solution, it can happen that a data set is visualized in different ways with a single dimensionality-reducing visualization technique in different runs. It can be argued (see van der Maaten & Hinton, 2008) that this effect is desirable since it mirrors different possible views of the given data, reflecting the ill-posedness of the problem. Auxiliary information in the form of class labels can be useful to shape the problem in such settings and to resolve (parts of) the inherent ambiguities. Aspects of the data should be included in the visualization, which are of particular relevance for the given class labels, while aspects can be neglected if they are not so important for the given labeling. Thus, additional information, such as label information, can improve the results of dimension reduction by reducing possible noise in the data and keeping the essential information to discriminate the classes.

This observation has led to the development of a variety of visualization techniques that take given labels into account. These methods still map the original data to low dimensions, but they do so using the additional information. Examples for such methods include linear discriminant analysis and variations, supervised NeRV, supervised Isomap, and multiple relational embedding. (Venna et al., 2010), for example, give a recent overview and compare various methods for supervised data visualization. Here, we essentially repeat the experiments as proposed in Venna et al. (2010) to demonstrate the suitability of our general method to incorporate auxiliary information in the data visualization.

In this section, we show some examples of the proposed method based on the t-SNE cost function, employing supervised local linear projections pk(xi) (see equation 5.1). Here, the parameters Ωk and wk are acquired by a supervised prototype-based classifier, limited rank matrix learning vector quantization (LiRaM LVQ; Bunte et al., 2010; Schneider, Biehl, & Hammer, 2009). We compare the results to alternative state-of-the-art techniques on the three data sets mimicking the experiments by Venna et al. (2010):

  • • 

    The letter recognition data set (referred to as Letter in the following) from the UCI Machine Learning Repository (Asuncion, Newman, Hettich, Blake, & Merz, 1998). It is a 16-dimensional data set of 4 × 4 images of the 26 capital letters of the alphabet. These 26 classes are based on 20 different distorted fonts. In total, 20,000 data points are given.

  • • 

    The Phoneme data set taken from LVQ-PAK (Kohonen, Hynninen, Kangas, Laaksonen, & Torkkola, 1996) consists of 20-dimensional feature vectors representing phoneme samples stemming from 13 different classes.

  • • 

    The Landsat satellite data set is contained in the UCI Machine Learning Repository. Each of the 6435 36-dimensional vectors corresponds to a 3 × 3 satellite image measured in four spectral bands. The six classes indicate the terrain type in the image: red soil, cotton crop, gray soil, damp gray soil, soil with vegetation stubble, and very damp gray soil.

For these data sets, we consider a projection to two dimensions by means of a locally linear function, as before, characterized by the functional form of equation 5.3. Unlike the previous setting, this form is biased toward the given class information, because the local projections pk are determined by means of a supervised prototype-based projection method. We used LiRaM LVQ with the rank of the matrices limited to 10 (for Letter and Phoneme) and 30 (for Landsat), respectively. Based on this setting, the offsets ok are determined by means of the prototypes, and the initial projections Ωk are given directly by the square root of the matrices obtained by LiRaM LVQ to obtain good class separation. Correspondingly, the parameter matrices Ak map from 10 or 30 dimensions to 2 dimensions in this case. The supervised training of the initial functional form of the mapping function, equation 5.3, by means of LiRaM LVQ as well as the (unsupervised) training of the free parameters of the mapping function takes place using only a small subset of the full data set (7%–18%) while the evaluation takes into account the full data set.

The goal of supervised dimension reduction is the preservation of classification performance and is hence quite different from classical unsupervised dimension reduction. In consequence, the quality assessment of the final embedding should be done differently. Here, following the approach of Venna et al. (2010), we measure the 5-nearest-neighbor-classification error (5NN error) of the resulting visualizations achieved in a 10-fold cross-validation scheme. We compare the result obtained by locally linear projection based on the t-SNE cost function and a functional form biased by a discriminative prototype-based classifier as specified above to several state-of-the-art supervised nonlinear embedding methods taken from Venna et al. (2010):

  • • 

    Supervised NeRV (SNeRV; Venna et al., 2010), which uses input distances induced by the Fisher information from a nonparametric supervised classifier.

  • • 

    Multiple relational embedding (MRE; Memisevic & Hinton, 2005), an extension of SNE accommodating additional characteristics of the data space or subspaces provided as similarity relations known a priori to the user.

  • • 

    Colored maximum variance unfolding (Song et al., 2008), an extension of the unsupervised maximum variance unfolding. It is also called maximum unfolding via Hilbert-Schmidt independence criterion (MUHSIC) because it maximizes the dependency between the embedding coordinates and the labels.

  • • 

    Supervised isomap (S-Isomap; Geng et al., 2005), an extension of unsupervised Isomap extending distances to incorporate label information in an ad hoc manner.

  • • 

    Parametric embedding (PE; Iwata et al., 2007) aims at preserving the topology of the original data by minimizing a sum of Kullback-Leibler divergences between a gaussian mixture model in the original and embedding space.

  • • 

    Neighborhood component analysis (NCA; Goldberger, Roweis, Hinton, & Salakhutdinov, 2004), which adapts a metric by finding a linear transformation of the original data such that the average leave-one-out k-nearest-neighbor classification performance is maximized in the transformed space.

These methods constitute representative supervised visualization techniques that enrich dimensionality reduction by incorporating given label information in various forms.

The error rates of the nearest-neighbor classification (using squared Euclidean distance) on the original high-dimensional data set and after dimension reduction with the different methods are shown in Figure 3. In contrast to our method, the other techniques were evaluated using only a small subset of the data sets (only 1500 sampled points), because they are based on embedding single points. For our approach, we train on a subsample of 7% only, but also report the results of the full data set obtained by the explicit mapping. Note that the classification error obtained by an explicit mapping biased according to auxiliary information is smaller than the alternatives for all three data sets. It is remarkable that the error in the reduced space is also comparable to the error on the high-dimensional data for most data sets. For the Phoneme data set, supervised dimension reduction leads to even better separation of the classes than in the original space. Hence the proposed method displays excellent generalization, way offering an efficient technique to deal with large data sets by inferring a mapping on a small subset only. Example visualizations of the proposed method are displayed in Figure 4. A clear class structure is visible, especially for the data sets Letter and Phoneme. Interestingly, the Letter clusters arrange in a quite intuitive way: O, Q, G, and C stay close together, as do M, N, and H. The qualitative characteristic of the projections is the same for the training data and the full data sets, displaying the excellent generalization ability of the proposed method.

Figure 3:

Five-nearest-neighbor errors of all methods on the three data sets.

Figure 3:

Five-nearest-neighbor errors of all methods on the three data sets.

Figure 4:

Example visualizations of the data sets in two dimensions.

Figure 4:

Example visualizations of the data sets in two dimensions.

7.  Generalization Ability and Computational Complexity

The introduction of a general view on dimension reduction as cost optimization extends existing techniques to large data sets by subsampling. A mapping function fW based on a small data subset is obtained that extends the embedding to arbitrary points x coming from the same distribution as the training samples. In this context, it is of particular interest if the procedure can be substantiated by mathematical guarantees concerning its generalization ability. We are interested in whether a mapping achieves good quality on arbitrary data assuming it showed satisfactory embeddings on a finite subset, which has been used to determine the mapping parameters.

A formal evaluation measure of dimensionality reduction has been proposed by Lee and Verleysen (2009) and Venna et al. (2010), based on the measurement of local neighborhoods and their preservation while projecting the data. Since these measures rely on a finite number of neighbors, they are not suitable as evaluation measures for arbitrary data distributions in IRN. Furthermore, restrictions on the applicability of these quality measures to evaluate clustering have been published recently by Mokbel, Gisbrecht, and Hammer, (2010).

7.1.  A Possible Formalization.

As Lee and Verleysen (2009) pointed out, one alternative objective of dimension reduction is to preserve the available information as much as possible. This objective is rarely used to evaluate nonparametric dimensionality reduction because it cannot be evaluated due to a lack of an explicit mapping. Given explicit mapping, however, it can act as a valid evaluation measure. The error of a dimensionality-reduction mapping f is defined as
formula
7.1
where P defines the probability measure according to which the data x are distributed in and f−1 denotes an approximate inverse mapping of f. An exact inverse might not exist in general, but local inversion is usually possible apart from sets of measure 0. In practice, the full data manifold is not available such that this objective can be neither evaluated nor optimized given a finite data set. Rather, the empirical error,
formula
7.2
can be computed based on given data samples xi. A dimension-reduction mapping shows good generalization ability iff the empirical error is representative of the true error E(P).

If the form of f is fixed prior to training, we can specify a function class with independent of the given training set. Assuming representative vectors xi are chosen independently and identically distributed (i.i.d.) according to P, the question is whether the empirical error allows limiting the real error E(P) we are interested in. As usual, bounds should hold simultaneously for all possible functions in to circumvent the problem that the function f is chosen according to the given training data.

This setting can be captured in the classical framework of computational learning theory, as specified by Bartlett and Mendelson (2003). We can adapt their theorem 8 to our setting. We assume that the norm of the input data is limited to the unit ball. Prior normalization may be necessary, which would be mirrored by corresponding constants in the bounds. We consider the loss function
formula
7.3
Then, as Bartlett and Mendelson (2003, theorem 8) reported, assuming i.i.d. data according to P, for any confidence δ ∈ (0, 1) and every , the relation
formula
7.4
holds with probability at least 1 − δ where
formula
7.5
and Rn refers to the so-called Rademacher complexity of the function class. The Rademacher complexity constitutes a quantity that, similar to the Vapnik-Chervonenkis dimension, estimates the capacity of a given function class (see Bartlett & Mendelson, 2003). The Rademacher complexity of many function classes (such as piecewise constant, piecewise linear functions with a fixed number of pieces, or polynomials of fixed degree) can be limited by a term that scales as n−1/2. (See Bartlett & Mendelson, 2003, for structural results and explicit bounds for linear functions, and Schneider et al., 2009, for explicit bounds on piecewise constant functions as induced by prototype-based clustering.) This result implies that the generalization ability of dimensionality-reduction mappings as considered above can be guaranteed in principle since the gaussian complexity of the class can be limited in our settings. It remains a subject of future research to find explicit good bounds.

7.2.  Complexity.

Assume a set X of points is given. Most dimensionality-reduction techniques are computationally quite demanding due to the form of the overall costs (see equation 2.9). Since, the characteristics map sequences of points to sequences of real values of the same length, the computation of equation 2.9 is at least . This is infeasible for large X. Out-of-sample extensions by means of an implicit mapping depend on a subset X0X only. If the principle as derived in this letter is used, the corresponding complexity is given by , since only the subset X0 is mapped using the original method; afterward, all remaining points are mapped by separately optimizing the costs of one xX regarding their relation to X0, the latter being for every x. Thus, this approach substantially reduces the effort depending on the size of X0, but it does not easily allow a way to control the form of the mapping or to integrate prior label information. By choosing an explicit functional form, the complexity is further reduced to , assuming an effort to evaluate fW. Since usually |X| ≫ |X0| ≫ |W|, this constitutes a further considerable reduction of the time required to map all points.

8.  Conclusion

We have reformulated dimension reduction as an optimization problem based on structural characteristics. As a consequence, many popular nonparametric dimension-reduction techniques can simultaneously be extended to learn an explicit mapping function. The optimization of a parameterized mapping function for dimension reduction is beneficial in several ways. Large data sets can be dealt with because the mapping function can be learned on a small random subset of the data. Furthermore, this framework allows us to consider the generalization ability of dimension reduction since an explicit cost function is available in terms of the reconstruction error. Interestingly, bounds as derived in the context of computational learning theory can be transferred to this setting.

We showed the suitability of the approach based on the integration of global linear and locally linear projections into the t-SNE dimension-reduction method on different data sets. Furthermore we show the integration of auxiliary (e.g., class) information into the framework.

The proposed general framework is very flexible and can be combined with every possible form of mapping function. The investigation of alternative dimension-reduction mappings based on other cost functions and other functional forms of the mapping, as well as the derivation of explicit bounds on its generalization ability will be the subject of future work. Thus far, the settings have been restricted to only Euclidean data due to the form of the mapping fW. Naturally more general forms could be considered that can take more complex, non-Euclidean data as inputs, such as mappings based on general dissimilarity characterization of data. Since it is not possible to embed such data in any Euclidean vector space, possibly qualitatively different results could occur. A corresponding investigation is the subject of future research.

Acknowledgments

This work was supported by the Nederlandse organisatie voor Wetenschappelijke Onderzoek (NWO) under project 612.066.620 and by the German Science Foundation (DFG) under grant number HA-2719/4-1. We also gratefully acknowledge financial support from the Cluster of Excellence 277 Cognitive Interaction Technology funded in the framework of the German Excellence Initiative.

References

Asuncion
,
A.
,
Newman
,
D. J.
,
Hettich
,
S.
,
Blake
,
C. L.
, &
Merz
,
C. J.
(
1998
).
UCI repository of machine learning databases
. http://archive.ics.uci.edu/ml/.
Bae
,
S.-H.
,
Choi
,
J. Y.
,
Qiu
,
J.
, &
Fox
,
G. C.
(
2010
).
Dimension reduction and visualization of large high-dimensional data via interpolation
. In
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
(pp.
203
214
).
New York
:
ACM
.
Bartlett
,
P. L.
, &
Mendelson
,
S.
(
2003
).
Rademacher and gaussian complexities: risk bounds and structural results
.
Journal of Machine Learning Research
,
3
,
463
482
.
Baudat
,
G.
, &
Anouar
,
F.
(
2000
).
Generalized discriminant analysis using a kernel approach
.
Neural Computation
,
12
,
2385
2404
.
Belkin
,
M.
, &
Niyogi
,
P.
(
2003
).
Laplacian eigenmaps for dimensionality reduction and data representation
.
Neural Computation
,
15
,
1373
15396
.
Bishop
,
C.
(
2007
).
Pattern recognition and machine learning
.
Berlin
:
Springer
.
Bishop
,
C. M.
, &
Williams
,
C.K.I.
(
1998
).
GTM: The generative topographic mapping
.
Neural Computation
,
10
,
215
234
.
Brand
,
M.
(
2003
).
Charting a manifold
(
Tech. Rep. No. 15
).
Cambridge, MA
:
Mitsubishi Electric Research Laboratories
.
Bunte
,
K.
,
Hammer
,
B.
,
Wismüller
,
A.
, &
Biehl
,
M.
(
2010
).
Adaptive local dissimilarity measures for discriminative dimension reduction of labeled data
.
Neurocomputing
,
73
(
7–9
),
1074
1092
.
Carreira-Perpiñán
,
M. Á.
(
2010
).
The elastic embedding algorithm for dimensionality reduction
. In
27th Int. Conf. Machine Learning (ICML 2010)
(pp.
167
174
).
Madison, WI
:
Omnipress
.
Cottrell
,
M.
,
Hammer
,
B.
,
Hasenfuss
,
A.
, &
Villmann
,
T.
(
2006
).
Batch and median neural gas
.
Neural Networks
,
19
(
6–7
),
762
771
.
Geng
,
X.
,
Zhan
,
D.-C.
, &
Zhou
,
Z.-H.
(
2005
).
Supervised nonlinear dimensionality reduction for visualization and classification
. In
IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics
,
35
,
1098
1107
.
Goldberger
,
J.
,
Roweis
,
S.
,
Hinton
,
G.
, &
Salakhutdinov
,
R.
(
2004
).
Neighbourhood components analysis
. In
L. K. Saul, Y. Weiss, & L. Bottou
(Eds.),
Advances in neural information processing systems, 17
(pp.
513
520
).
Cambridge, MA
:
MIT Press
.
Guyon
,
I.
, &
Elisseeff
,
A.
(
2003
).
An introduction to variable and feature selection
.
Journal of Machine Learning Research
,
3
,
1157
1182
.
He
,
X.
,
Cai
,
D.
,
Yan
,
S.
, &
Zhang
,
H.-J.
(
2005
).
Neighborhood preserving embedding
. In
2005. Tenth IEEE International Conference on Computer Vision, 2005
(vol. 2, pp.
1208
1213
).
Piscataway, NJ
:
IEEE
.
Hinton
,
G.
, &
Roweis
,
S.
(
2003
).
Stochastic neighbor embedding
. In
S. Becker, S. Thrun, & K. Obermayer
(Eds.),
Advances in neural information processing systems, 15
(pp.
833
840
).
Cambridge, MA
:
MIT Press
.
Iwata
,
T.
,
Saito
,
K.
,
Ueda
,
N.
,
Stromsten
,
S.
,
Griffiths
,
T. L.
, &
Tenenbaum
,
J. B.
(
2007
).
Parametric embedding for class visualization
.
Neural Computation
,
19
,
2536
2556
.
Kaski
,
S.
,
Sinkkonen
,
J.
, &
Peltonen
,
J.
(
2001
).
Bankruptcy analysis with self-organizing maps in learning metrics
.
IEEE Transactions on Neural Networks
,
12
,
936
947
.
Keim
,
D. A.
,
Mansmann
,
F.
,
Schneidewind
,
J.
,
Thomas
,
J.
, &
Ziegler
,
H.
(
2008
).
Visual analytics: Scope and challenges
. In S. Simoff, M. H. Boehlen, & A. Mazeika (Eds.)
,
Visual data mining: Theory, techniques and tools for visual analytics
.
Berlin
:
Springer
.
Kohonen
,
T.
(
1995
).
Self-organizing maps
.
Berlin
:
Springer
.
Kohonen
,
T.
,
Hynninen
,
J.
,
Kangas
,
J.
,
Laaksonen
,
J.
, &
Torkkola
,
K.
(
1996
).
LVQ PAK: The learning vector quantization program package
(
Tech. Rep. No. A30)
.
Espoo, Finland
:
Helsinki University of Technology
.
Lee
,
J.
, &
Verleysen
,
M.
(
2007
).
Nonlinear dimensionality reduction
.
Berlin
:
Springer
.
Lee
,
J. A.
, &
Verleysen
,
M.
(
2008
).
Rank-based quality assessment of nonlinear dimensionality reduction
. In
16th European Symposium on Artificial Neural Networks
(pp.
49
54
).
Berlin
:
Springer
.
Lee
,
J. A.
, &
Verleysen
,
M.
(
2009
).
Quality assessment of dimensionality reduction: Rank-based criteria
.
Neurocomput.
,
72
(
7–9
),
1431
1443
.
Ma
,
B.
,
Qu
,
H.
, &
Wong
,
H.
(
2007
).
Kernel clustering-based discriminant analysis
.
Pattern Recognition
,
40
(
1
),
324
327
.
Martinetz
,
T.
, &
Schulten
,
K.
(
1991
).
A “neural-gas” network learns topologies
.
Artificial Neural Networks
,
1
,
397
402
.
Memisevic
,
R.
, &
Hinton
,
G.
(
2005
).
Multiple relational embedding
. In
L. K. Saul, Y. Weiss, & L. Bottou
(Eds.),
Advances in neural information processing systems, 17
(pp.
913
920
).
Cambridge, MA
:
MIT Press
.
Mokbel
,
B.
,
Gisbrecht
,
A.
, &
Hammer
,
B.
(
2010
).
On the effect of clustering on quality assessment measures for dimensionality reduction
. In
Neural Information Processing Systems Workshop on Challenges of Data Visualization
.
Cambridge, MA
:
MIT Press
.
Möller
,
R.
, &
Hoffmann
,
H.
(
2004
).
An extension of neural gas to local PCA
.
Neurocomputing
,
62
,
305
326
.
Peltonen
,
J.
,
Klami
,
A.
, &
Kaski
,
S.
(
2004
).
Improved learning of Riemannian metrics for exploratory analysis
.
Neural Networks
,
17
,
1087
1100
.
Roweis
,
S. T.
, &
Saul
,
L. K.
(
2000
).
Nonlinear dimensionality reduction by locally linear embedding
.
Science
,
290
(
5500
),
2323
2326
.
Sammon
,
J. W.
(
1969
).
A nonlinear mapping for data structure analysis
.
IEEE Transactions on Computers, C-18
(5).
Piscataway, NJ
:
IEEE
.
Schneider
,
P.
,
Biehl
,
M.
, &
Hammer
,
B.
(
2009
).
Adaptive relevance matrices in learning vector quantization
.
Neural Computation
,
21
(
12
),
3532
3561
.
Song
,
L.
,
Smola
,
A. J.
,
Borgwardt
,
K. M.
, &
Gretton
,
A.
(
2008
).
Colored maximum variance unfolding
. In
J. Platt, D. Koller, Y. Singer, & S. Roweis
(Eds.),
Advances in neural information processing systems
,
20
.
Cambridge, MA
:
MIT Press
.
Suykens
,
J.A.K.
(
2008
).
Data visualization and dimensionality reduction using kernel maps with a reference point
.
IEEE Transactions on Neural Networks
,
19
(
9
),
1501
1517
.
Teh
,
Y. W.
, &
Roweis
,
S.
(
2003
).
Automatic alignment of local representations
. In
S. Becker, S. Thrun, & K. Obermayer
(Eds.),
Advances in neural information processing systems, 15
(pp.
841
848
).
Cambridge, MA
:
MIT Press
.
Tenenbaum
,
J. B.
,
de Silva
,
V.
, &
Langford
,
J. C.
(
2000
).
A global geometric framework for nonlinear dimensionality reduction
.
Science
,
290
(
5500
),
2319
2323
.
Torgerson
,
W.
(
1952
).
Multidimensional scaling, I: Theory and method
.
Psychometrika
,
17
,
401
419
.
van der Maaten
,
L. J. P.
(
2009
).
Learning a parametric embedding by preserving local structure
. In
Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AI-STATS)
(pp.
384
391
).
JMLR
.
van der Maaten
,
L.J.P.
, &
Hinton
,
G.
(
2008
).
Visualizing data using t-SNE
.
Journal of Machine Learning Research
,
9
,
2579
2605
.
van der Maaten
,
L.J.P.
,
Postma
,
E. O.
, &
van den Herik
,
H. J.
(
2009
).
Dimensionality reduction: A comparative review
(Tech. Rep. No. TiCC-TR 2009-005).
Tilburg, Netherlands
:
Tilburg University
.
Vandenberghe
,
L.
, &
Boyd
,
S.
(
1994
).
Semidefinite programming
.
SIAM Review
,
38
,
49
95
.
Venna
,
J.
,
Peltonen
,
J.
,
Nybo
,
K.
,
Aidos
,
H.
, &
Kaski
,
S.
(
2010
).
Information retrieval perspective to nonlinear dimensionality reduction for data visualization
.
Journal of Machine Learning Research
,
11
,
451
490
.
Weinberger
,
K. Q.
, &
Saul
,
L. K.
(
2006
).
An introduction to nonlinear dimensionality reduction by maximum variance unfolding
. In
Proceedings of the 21st National Conference on Artificial Intelligence
.
Boston
:
AAAI Press
.