Abstract

Topographic maps such as the self-organizing map (SOM) or neural gas (NG) constitute powerful data mining techniques that allow simultaneously clustering data and inferring their topological structure, such that additional features, for example, browsing, become available. Both methods have been introduced for vectorial data sets; they require a classical feature encoding of information. Often data are available in the form of pairwise distances only, such as arise from a kernel matrix, a graph, or some general dissimilarity measure. In such cases, NG and SOM cannot be applied directly. In this article, we introduce relational topographic maps as an extension of relational clustering algorithms, which offer prototype-based representations of dissimilarity data, to incorporate neighborhood structure. These methods are equivalent to the standard (vectorial) techniques if a Euclidean embedding exists, while preventing the need to explicitly compute such an embedding. Extending these techniques for the general case of non-Euclidean dissimilarities makes possible an interpretation of relational clustering as clustering in pseudo-Euclidean space. We compare the methods to well-known clustering methods for proximity data based on deterministic annealing and discuss how far convergence can be guaranteed in the general case. Relational clustering is quadratic in the number of data points, which makes the algorithms infeasible for huge data sets. We propose an approximate patch version of relational clustering that runs in linear time. The effectiveness of the methods is demonstrated in a number of examples.

1.  Introduction

Topographic maps such as the self-organizing map (SOM) constitute a valuable tool for robust data inspection and data visualization, which has been applied in such diverse areas as telecommunication, robotics, bioinformatics, and business (Kohonen, 1995; Yin, 2008). A particular strength of SOM lies in the fact that it offers simultaneous data clustering, visualization, topological inference, and compression of data by means of prototypes such that diverse functionalities can be realized on top of SOM. Alternative methods such as neural gas (NG) (Martinetz, Berkovich, & Schulten, 1993) provide an efficient clustering and topographic mapping of data without fixing a prior lattice. In this way, subsequent visualization such as multidimensional scaling (Kruskal, 1964) can readily be applied, whereby no prior restriction of a fixed lattice structure as for SOM is necessary and the risk of topographic errors is minimized. For NG, an optimum (nonregular) data topology is induced such that browsing in a neighborhood becomes directly possible (Martinetz & Schulten, 1994). A very elegant fundamental treatment of vector quantization and extensions such as SOM and NG has been presented by Hofmann and Buhmann (1998) based on information-theoretic principles introduced in Luttrell (1989). In this framework, vector quantization is interpreted as an encoding mechanism with limited resources, where SOM can be derived as a robust model if channel noise is present, whereas NG accounts for the situation that certain channels are not available (e.g., due to varying bandwidth). This also gives some hints as to which situations the application of SOM or NG, respectively, is advisable from a theoretical model of the data, besides providing additional functionality compared to simple clustering such as k-means due to the additional (fixed or data optimum, respectively) neighborhood structure. Interestingly, as presented in Hofmann and Buhmann (1998), these approaches can be combined to yield models that are robust with respect to different types of noise.

Original SOM and NG, however, have been proposed for vectorial data such that their application is restricted to Euclidean space. In the past few years, extensions of these methods have been proposed to deal with more general data structures. This accounts for the fact that more general metrics have to be used for complex data such as microarray data or DNA sequences. Further, it might be the case that data are not embedded in a vector space at all; rather, pairwise similarities or dissimilarities are available.

Several extensions of classical SOM and NG to more general data have been proposed: a statistical interpretation of SOM (Graepel & Obermayer, 1999; Heskes, 2001; Seo & Obermayer, 2004; Tino, Kaban, & Sun, 2004) allows changing the generative model to alternative general data models. The resulting approaches are very flexible but also computationally quite demanding, such that proper initialization and metaheuristics (e.g., deterministic annealing) become necessary when optimizing statistical models. For specific data structures such as time series or recursive structures, recursive models have been proposed (as reviewed in Barreto, Araujo, & Kremer, 2003; Hammer, Micheli, Sperduti, & Strickert, 2004). However, these models are restricted to recursive data structures with Euclidean constituents. Online variants of SOM and NG have been extended to general kernels (e.g., in the approaches presented in Qin & Suganthan, 2004; Yin, 2006). However, these versions have been derived for (slow) online adaptation only.

The approaches of Kohonen and Somervuo (2002) provide a fairly general method for large-scale application of SOM to nonvectorial data. It is assumed that pairwise similarities of data points are available. Then the batch optimization scheme of SOM can be generalized by means of the generalized median to a visualization tool for general similarity data. Thereby, prototype locations are restricted to data points. This method has been extended to NG in Cottrell, Hammer, Hasenfuss, and Villmann (2006), together with a general proof of the convergence of median versions of clustering. Further developments concern the efficiency of the computation (Conan-Guez, Rossi, & El Golli, 2005) and the integration of prior information if available to achieve meaningful visualization and clustering (Hammer, Hasenfuss, Schleif, & Villmann, 2006a, 2006b; Villmann, Hammer, Schleif, Geweniger, & Herrmann, 2006).

Median clustering has the benefit that it builds directly on the derivation of SOM and NG from a cost function. Thus, the resulting algorithms share the simplicity of batch NG and SOM, its mathematical background and convergence, as well as the flexibility to model additional information by means of an extension of the cost function. However, for median versions, prototype locations are restricted to the set of given training data, which constitutes a severe restriction, particularly for small data sets. Therefore, extensions that allow a smooth adaptation of prototypes have been proposed (e.g., in Hasenfuss, Hammer, Schleif, & Villmann, 2007). In this approach, a weighting scheme is introduced for the points, which represents a virtual prototype in the space spanned by the training data. This model has the drawback that it is not an extension of the standard Euclidean version.

Here, we use an alternative way to extend NG to relational data given by pairwise dissimilarities, which is similar to the relational dual of fuzzy clustering as derived in Hathaway and Bezdek (1994) and Hathaway, Davenport, & Bezdek (1989) and builds directly on fundamental work in the context of relational clustering as introduced in Laub, Roth, Buhmann, and Müller (2006) and Roth, Laub, Kawanabe, and Buhmann (2003). For a given dissimilarity matrix that stems from a (possibly high-dimensional and unknown) Euclidean space, it is possible to derive the relational dual of topographic map formation that expresses the relevant quantities in terms of the given matrix and leads to a learning scheme similar to standard batch optimization. This scheme provides identical results as the standard Euclidean version if an embedding of the given data points is known. In particular, it possesses the same convergence properties as the standard variants, thereby restricting the computation to known quantities that do not rely on explicit embedding. Since these relational variants rely on the same cost function, extensions to additional label information or magnification control (Hammer et al., 2006a, 2006b; Hammer, Hasenfuss, & Villmann, 2007) become readily available.

The methods can be applied directly to every possible non-Euclidean dissimilarity matrix, and, as we will see in a variety of experiments, they result in good performance in practical applications. The theory behind the case of general dissimilarity data, however, is less clear. We will show that a simple shift of the dissimilarity matrix as proposed in Laub et al. (2006), which makes data Euclidean and does not affect the location of the optima of the dual cost function, can severely affect the underlying numeric. As an alternative, we will link the proposed algorithm to clustering in pseudo-Euclidean space, such that an intuitive interpretation of the algorithm becomes possible also in the non-Euclidean setting. However, we show by counterexample that the algorithm need no longer converge to a fixed point of the dual cost function, albeit we have not observed this behavior in a single real application. We show that popular alternatives such as deterministic annealing for pairwise data clustering or SOM share this property; counterexamples that show possible divergence can also be found for these two well-known clustering algorithms. We argue that relational neural gas is in fact related to popular pairwise data clustering in the sense that the former relaxes the standard quantization error in pseudo-Euclidean space, while the latter can be derived as deterministic annealing to optimize the standard quantization error in pseudo-Euclidean space. This provides a direct interpretation of these alternatives in terms of relational prototypes; inspection of the results becomes possible this way, and it explains why relational clustering shows remarkable results in practice that are comparable to results obtained by deterministic annealing while consuming less training time.

Relational clustering, as well as its deterministic annealing counterparts, displays squared complexity according to the size of the dissimilarity matrix. This makes the algorithms unsuitable for large data sets. Based on intuitive and powerful extensions of classical k-means and NG to large data sets by means of patch clustering (Alex, Hasenfuss, & Hammer, 2009; Farnstrom, Lewis, & Elkan, 2000; Bradley, Fayyad, & Reina, 1998), we propose an approximation of the algorithms that can work in constant memory and linear time, that is, it is suited for large data sets. While we test the results for patch relational NG clustering, the principled method can successfully be applied to every clustering scheme that relies on relational prototypes; a direct transfer of the method to relational SOM and deterministic annealing variants of relational clustering becomes possible.

We first introduce batch learning algorithms for neural gas based on a cost function. Then we focus on a dissimilarity matrix that can be embedded in Euclidean space and derive the respective relational dual resulting in a dual cost function and batch optimization schemes for the case of a given dissimilarity matrix of data. For the general non-Euclidean setting, we discuss the connection to an embedding in pseudo-Euclidean space. Based on this connection, a relation to well-established deterministic annealing schemes become possible. To make the algorithms suitable for large data sets, an approximation of prototypes is introduced that allows processing data subsequently in patches using only constant memory and linear time. The efficiency of relational clustering is demonstrated in a couple of benchmark situations, as well as an application for a text clustering task that involves almost 200,000 articles.

2.  Neural Gas

Neural clustering and topographic maps constitute effective methods for data clustering, inspection, and preprocessing. Classical variants deal with vectorial data , which are distributed according to an underlying distribution P in the Euclidean space. The goal of prototype-based clustering algorithms is to distribute prototypes , among the data such that they represent the data as accurately as possible. A new data point is assigned to the winner , which refers to the prototype with the smallest distance . This separates the data space into the receptive fields of the prototypes.

Different popular variants of neural clustering have been proposed to learn prototype locations from given training data (Kohonen, 1995). Assume the number of prototypes is fixed to k. Simple k-means directly optimizes the quantization error,
formula
where with Kronecker indicates the winner neuron for . Unlike k-means, neural gas (NG) (Martinetz et al., 1993) and the self-organizing map (SOM) (Kohonen, 1995) incorporate the neighborhood of a neuron for adaptation. The cost function of NG is given by
formula
where
formula
is the rank of the prototypes sorted according to the distances and scales the neighborhood cooperation with the neighborhood range .
The SOM itself possesses not a cost function but a slight variation of it, as proposed, for example, by Heskes (2001):
formula
where denotes the neuron with the smallest averaged distance and nd(i, l) denote a prior chosen neighborhood structure of neurons, often induced by a low-dimensional lattice structure.

The incorporation of a neighborhood structure into SOM and NG has several beneficial effects; additional functionality is achieved this way since the topological structure of the data is respected by the neurons, and browsing and, in the case of SOM, visualization become possible. It has been shown experimentally for benchmark data sets that by neighborhood integration, a method that is widely insensitive to initialization can be achieved (Martinetz, 1992). Unfortunately, this observation has not yet been substantiated by theoretical convergence guarantees. In fundamental work (Hofmann & Buhmann, 1998; Luttrell, 1989), a theoretical justification for this finding is given by linking NG and SOM, respectively, to information-theoretical concepts in the context of encoding in the presence of channel noise. In the following, we exemplarily consider NG, the argumentation for SOM and k-means being similar.

Often the NG cost function is optimized by means of an online stochastic gradient descent. Alternatively, if data ,…, are available, batch optimization can be done. The corresponding discrete cost function is given by
formula
This is optimized by an iterative optimization of assignments and prototypes until convergence in batch NG (see algorithm 1). Thereby, the neighborhood cooperation is usually annealed to during training such that the quantization error is optimized in the limit of small neighborhood size. It has been shown in Cottrell et al. (2006) that this procedure converges after a finite number of steps to a local optimum of the NG cost function for fixed . In the following theoretical considerations, we always assume that a fixed and usually small neighborhood parameter is chosen. This consideration approximately corresponds to the final stages of training.

Algorithm 1: Batch NG

input

   data ;

begin

   init randomly;

   repeat

     set ;

     set ;

   until convergence;

   return;

end.

In the following, we deal with the application of NG to settings where no Euclidean embedding of the data points is known. We more generally deal with data points xi characterized by pairwise proximities d(xi, xj) that are symmetric and 0 if xi=xj. For such general proximities, it can hold that no Euclidean embedding can be found. However, we will see that also for non-Euclidean dissimilarities, a vector space with symmetric bilinear form (which need not be positive definite) and embeddings of the points xi exist such that . We stress that the cost function of NG can be formalized for every real vector space that possesses a symmetric bilinear form , by substituting the term in the cost function by . The corresponding cost function becomes
formula
where the ranks are determined based on the bilinear form. In the same way, using this extended definition of the ranks, the procedure of batch NG is well defined in every vector space equipped with a symmetric bilinear form. However, it is not guaranteed that this batch NG procedure optimizes the NG cost function for the general setting, one that the convergence of this procedure is guaranteed, or that this procedure is useful at all. We address these questions in more detail in the following sections.

3.  Dissimilarity Data

Relational data xi are not explicitly embedded in a Euclidean vector space; rather, pairwise similarities or dissimilarities are available. We assume that dissimilarities dij are given for every pair of data points x1,…, xm. In the Euclidean setting, data can be represented as vectors and the equality holds. In general, however, no such explicit embedding of data is available. Further, dij need not fulfill the requirements of a metric, that is, the triangle inequality can be violated. Such data can stem from a general distance measure such as alignment distance for DNA sequences, Levenstein distance of words, general graph distances (e.g., in a web graph), or even empirical results of questionnaires or experiments.

This situation is rather common in practice, and an overview of supervised classification techniques for similarity data has recently been published (Chen, Garcia, Gupta, Rahimi, & Cazzani, 2009). Here we are interested in the unsupervised case. A variety of clustering algorithms for general dissimilarities has been proposed. This includes hierarchical clustering such as single or complete linkage, UPGMA, or neighbor joining (Hartigan, 1975; Zhong & Ghosh, 2003), which usually rely on heuristics, although guarantees of the correctness of the result can be derived in specific settings (e.g., for ultrametrics or additive metrics; Kaplansky, 1977). Alternatives rely on an appropriate cost function. Naturally, the standard kernel trick can be applied to many clustering algorithms or topographic mapping, such that a direct extension of these approaches to similarity data that is induced by a kernel results. Proposals of this approach can be found in Conan-Guez et al. (2005), Yin (2006) and Qin and Suganthan (2004), for example. Clustering of dissimilarities that are represented by a graph can be formalized as a graph cut problem. Spectral clustering can be interpreted as an efficient approximation of the NP-hard optimization of the normalized or ratio graph cut (von Luxburg, 2007). Alternatively, in median or exemplar-based clustering, the position of prototypes in the standard quantization error is restricted to locations in the set of given data points. This way, the quantization error can directly be transferred to the setting of general dissimilarities. Proposals that follow this line of argumentation include median clustering as proposed for SOM and NG (Kohonen & Somervuo, 2002; Cottrell et al., 2006) and exemplar-based clustering by formulating the problem in terms of a factor graph leading to affinity propagation (Frey & Dueck, 2007). By restricting the prototype locations, the set of potential solutions becomes restricted such that these proposals do not correspond to standard k-means or SOM in the Euclidean setting.

A few approaches try to directly generalize the standard Euclidean setting to general dissimilarity data. One proposal is to optimize a cost function for pairwise clustering that relies on the cluster assignments only (i.e., it can be formulated for pairwise dissimilarities), but is equivalent to the standard quantization error in the Euclidean setting. This cost function can be optimized, for example, by using some metaheuristic or deterministic annealing (Hofmann & Buhmann, 1997). However, in the general setting, the notion of prototypes is lost this way, and a clustering is represented in terms of assignments only. Similarly, the approach (Graepel & Obermayer, 1999) uses deterministic annealing for a reliable training of SOM for proximity data. The transfer from vectorial data is achieved by an interpretation of SOM as an encoder-decoder framework. We will later see that these methods can be interpreted as deterministic annealing of standard Euclidean k-means clustering and SOM, respectively, which are transferred to proximity data using relational clustering, as we will propose in this article.

A direct extension of k-means and fuzzy-k-means to dissimilarity data has been proposed in Hathaway and Bezdek (1994) and Hathaway et al. (1989). Here the assumption is made that a Euclidean embedding of data exists, but it is unknown. In this case, k-means can be equivalently reformulated in terms of dissimilarities only. The resulting methods are termed relational clustering since they can deal with dissimilarities that result from relations rather than vectors. In this contribution, we will follow the last approach and transfer these ideas to topographic mapping. First, we consider the situation that a Euclidean embedding of data exists and derive the relational neural gas in this setting. Then we discuss the setting for non-Euclidean situations.

3.1.  Euclidean Data.

We assume, as before, that training data x1,…, xm are given in terms of pairwise dissimilarities dij. One special setting is that data originate from a Euclidean distance measure, that is, we are able to find Euclidean points such that . Note that this notation includes a possibly nonlinear mapping (feature map) with . Since the Euclidean distance is related to the Euclidean dot product, this setting is referred to as kernel trick for distances in the literature. The approach (Schölkopf, 2000), as an example, investigates sufficient conditions for a metric to fulfill this property. Since an explicit formula for the embedding need not be known, we cannot directly apply batch NG in the embedding space. We want to derive a possibility for reformulating batch NG in such a way that an explicit embedding is not needed; batch NG can be applied directly to this setting. The key observation is based on the fact that optimum prototype locations of batch NG can be expressed as a linear combination of the given data points. Therefore, the unknown values can be expressed in terms of known values dij. Correspondingly, the NG cost function can be substituted by a dual-cost function in terms of ranks and cluster assignments only.

3.1.1.  Relational Clustering Algorithm.

In the following section, we lay out the details of this procedure. We always assume that pairwise dissimilarities dij are given, , and D denotes the corresponding matrix of dissimilarities. We assume that for D, there exists a finite-dimensional real vector space X together with a symmetric bilinear form , and there exist vectors with . We will see later that this property is always guaranteed if D is symmetric with zero diagonal elements. For some theorems, we require X to be Euclidean space, that is, we require positive definiteness of the bilinear form . This will be explicitly mentioned at corresponding places. Interestingly, the principled procedure as derived in this section, as well as a couple of results, hold for the more general case that X is an arbitrary (possibly non-Euclidean) vector space equipped with a symmetric bilinear form.

Prototypes will be restricted to convex combinations of given data points since optimum prototypes for NG have this form. For NG, we need to compute dissimilarities. It is possible to compute the dissimilarity of two such prototypes (or a prototype and a given data point) based on the given dissimilarity matrix only. More precisely, generalizing results as presented in Hathaway and Bezdek (1994):

Theorem 1. 
For every coefficient vector with , the following equality holds:
formula

Proof. 
It holds that:
formula
and
formula
Hence,
formula

Prototypes of NG can be written as a convex combination of data points: . Thus, because of theorem 1, we can compute the dissimilarity between a prototype and a given data point based on the coefficients in this convex combination and the dissimilarity matrix D. Further, it is not necessary to explicitly store the prototype locations; rather, we can alternatively represent prototypes by means of the coefficients . These observations lead to the following equivalent formulation of batch NG, which relies not on an explicit vectorial representation of the data but on the matrix D only and extends work as presented in Hathaway and Bezdek (1994) to neighborhood cooperation:

Algorithm 2. Relational NG

input

   symmetric dissimilarity matrix with zero diagonal ;

begin

   init such that ;

   repeat

     compute ;

     set ;

     set ;

   until convergence;

   return;

end.

Theorem 2. 

Relational NG as shown in algorithm 2 constitutes an equivalent formulation of batch NG by means of the identity . Thereby, refers to the vector and refers to coordinate j. Equivalence means that if holds for the initialization of batch NG and relational NG, respectively, then the same identity holds for these values after every epoch of batch NG and relational NG, respectively.

Proof. 

The equivalence holds because of theorem 1 by setting to the jth unit vector because of . Further, we can assume that prototypes are initialized in the convex hull of data points, that is, for the initialization, equivalent coefficients can be found.

Using the identity , relational NG computes exactly the same prototype locations in every epoch as batch NG does. In relational NG, however, prototype locations are computed only indirectly by means of the coefficients . It is not necessary to know the vectorial representation of the training vectors; rather, the knowledge of the pairwise dissimilarities dij is sufficient to compute the output of batch NG in terms of coefficients . Hence, NG can be performed directly based on a given dissimilarity matrix D only. For every prototype, m coefficients are stored, with m denoting the number of training points. corresponds to the part that data points take in representing prototype . The larger the , the more contributes to the prototype location. In the limit of zero neighborhood cooperation , the prototypes converge to the mean of the data points in their receptive fields. As for batch NG, the neighborhood parameter is usually annealed to 0 during training. For all theoretical considerations, we assume a fixed and possibly small parameter corresponding to the final stages of training.

The space complexity of relational clustering is linear with regard to the number of training data because of the requirement to store for . The time complexity of one training epoch is quadratic with regard to the number of training points, since the computation of takes linear time for every j. The quantity does not depend on j and has to be computed only once for every i; thus, it also contributes quadratic complexity. Since embedding data into a vector space based on the distance matrix D has cubic complexity depending on the number of training points, as we will see later, this procedure offers a more efficient alternative for NG if an embedding of data is not given previously. Obviously the algorithm provided by relational NG can be applied to every setting where pairwise dissimilarities are known. Euclideanity of an underlying vector space is not necessarily required to apply the algorithm. However, in such cases, no guarantee on the convergence of relational NG is given or a connection to a cost function, just as no guarantee is given for the vectorial counterpart batch NG in non-Euclidean settings. We will discuss later how far the procedure is reasonable in such situations.

3.1.2.  Dual Cost Function.

For the Euclidean setting— for some embedding holds—convergence of relational NG toward a local optimum of the NG cost function is guaranteed, since relational NG is equivalent to standard batch NG; therefore, the same guarantees as for batch NG hold, provided the underlying space X is Euclidean (Cottrell et al., 2006). For a non-Euclidean setting, convergence of the algorithm toward a local optimum of the NG cost function is not guaranteed. For both cases, however, prototypes are obtained only indirectly this way, and the NG cost function can be evaluated only when the data are known. Therefore, it is interesting to see whether there exists an alternative to evaluate the cost function of NG based on the quantities given by relational NG and the distance matrix D.

We introduce the following function, which depends on the assignments kij and the dissimilarity matrix D only and extends the dual cost function of k-means to neighborhood cooperation (Hathaway & Bezdek, 1994):
formula
The function has to be optimized with respect to the assignments kij under the constraint that k1j,…, kkj constitute a permutation of for every fixed j. The cost function measures the intracluster distances averaged over the local neighborhood as induced by the data.

We will see that this cost function is in some sense dual to the standard NG cost function. For fixed points found by relational or batch NG, respectively, the values of the NG cost function and the above function coincide, as we will see. Further, global optima of this cost function can be related to global optima of the standard NG cost function, as we will show. Therefore, we refer to this cost function as the dual NG cost function, which expresses the objective of NG in the dual variable kij instead of the prototypes .

This cost function is independent of an embedding of data in a vector space. Rather, it evaluates the quality of a clustering based on the assignments kij, which determine the rank of cluster i for a data point and the pairwise distances . This cost function constitutes a direct extension of the well-known cost function of pairwise data clustering, referred to as dual k-means cost function in the literature, to neighborhood cooperation (Hathaway & Bezdek, 1994; Hofmann & Buhmann, 1997):
formula
This cost function is the dual function of the standard quantization error, and it has to be optimized for assignments such that for every j. It directly measures the intracluster distances of a given clustering. It is well known that optimum prototypes of the vector quantization error correspond to optimum cluster assignments of the dual cost function in the Euclidean space, both values computed in the course of k-means clustering (Hathaway & Bezdek, 1994; Hathaway et al., 1989). The dual cost function is more general in the sense that it offers a valid cost function for arbitrary dissimilarities dij, and alternative optimization methods such as deterministic annealing have been proposed for this setting (Hofmann & Buhmann, 1997).

Now we formally establish a relation of the NG cost function and its dual. For this purpose, we reformulate the dual cost function of NG for vectors in a vector space X, which allows us to relate the dual function depending on kij only to a mixed function depending on kij and , and, finally, the standard NG cost function.

Theorem 3. 
If , then the dual cost function of NG equals
formula

Proof. 
We obtain
formula

This reformulation of the dual cost function allows us to link its value to the standard NG cost function for fixed points obtained by relational NG:

Theorem 4. 

Assume relational NG converges toward a fixed point of the algorithm, that is, coefficients and kij are found, which remain constant after further adaptation steps. Define the vector . Then .

Proof. 

For a fixed point of relational NG, and thus a fixed point of batch NG, we find .

This result allows us to evaluate the outcome of relational NG based on the dissimilarity matrix D and the coefficients only. The value of the dual cost function corresponds to the standard cost function value of standard NG if fixed points of the algorithms are considered; hence, the quality of the solution can be judged independent of a concrete vectorial embedding of the data points. Obvious this theorem also holds under the weaker condition—that kij coincides with the ranks of data point j for the prototypes . As we will see in the experiments, this condition is often already approximately fulfilled in early stages of training, such that both cost functions are approximately identical for relational NG during training.

It is interesting to investigate whether the structure of the NG cost function and its dual are in some sense equivalent. This refers to the question of whether local or global optima of these cost functions can be related to each other. We start by looking at global optima of the cost functions. The following holds:

Theorem 5. 
Define
formula
Then
formula
If is positive definite, equality holds. Further, it holds that
formula
for every real vector space X with symmetric bilinear form where Pj denotes the space of permutations of for every j.

Proof. 

The inequality is obvious because of the definition of the functions.

A positive definite bilinear form over a finite-dimensional vector space can be expressed as with a symmetric positive-definite matrix A. We can compute the directional derivative of the cost function into the direction as . For local optima, this must be zero for every ; hence, we find and, hence, . The Hessian matrix of this cost function constitutes a matrix with diagonal blocks , and 0 entries otherwise. This is positive definite because A is positive definite, and the factor is positive; thus, this position constitutes a local optimum of the cost function. Since is a quadratic form with respect to , this local optimum must be the global optimum. Therefore, equality holds in this case.

Assume . Assume for two indices i and holds and . Then, because of the monotonicity of , ; thus, we can decrease the cost function by substituting these two assignments. Therefore, optimum assignments kij are given by the ranks ; hence, the equality follows.

As a consequence of this theorem, we find equivalence of global optima of the NG cost function and its dual in the Euclidean setting:

Theorem 6. 
The following inequality is valid:
formula
If is positive definite, equality holds.

Proof. 
Because of theorems 4 and 5, we find
formula
with equality for positive definite bilinear form and
formula
Because of the finiteness of Pj we can exchange the infimum and minimum; hence, the theorem follows.

Thus, in the Euclidean case, the NG cost function and its dual coincide in the sense that we have a correspondence of the values of global optima. Since global optima are fixed points of batch NG and relational NG in the Euclidean setting, theorem 4 also gives a correspondence of the global optima itself.

The question occurs whether a more detailed correspondence of the functions according to their overall shape can be established. In particular, a connection of the values as obtained by batch and relational NG and their role (e.g., local or global optimum) with respect to the cost functions would be interesting. Batch NG repeatedly optimizes the assignments kij and prototype locations of the cost function if a Euclidean embedding of data can be found. The convergence proof as presented in Cottrell et al. (2006) relies on the fact that optimum values kij and , respectively, are determined in every step in the Euclidean setting, as also computed in the above proofs. Thus, the cost function decreases in successive steps until convergence can be observed for Euclidean or, more generally, positive definite symmetric bilinear form. Cottrell et al. (2006) showed that the obtained fixed point constitutes a local optimum of the cost function under mild conditions on the setting. The same holds for RNG if the pairwise distances stem from a Euclidean space because RNG is just an equivalent formulation of batch NG using the identity . However, in both cases, it is not guaranteed that a global optimum of the cost functions is reached.

We already know that for fixed points of relational or batch NG, the values of the NG cost function and its dual coincide. Further, in the Euclidean setting, a local optimum of the NG cost function is reached, and, conversely, every local optimum of the NG cost function constitutes a fixed point of NG. Now the question occurs as to how far local optima of the NG cost function can be related to local optima of the dual cost function. For this purpose, we have to determine a neighborhood structure for the solution space of the dual cost function, since kij constitute discrete values. We use the following simple structure. We define a neighborhood structure on Pi such that kij and are neighbored if and only if for all but two indices (ij). An assignment kij is called a local optimum of if for all in the neighborhood of kij. Using this definition, we obtain the following result:

Theorem 7. 

Assume is positive definite. Then local optima of constitute fixed points of relational NG.

Proof. 

Assume kij are given, which do not constitute a fixed point of relational NG. Define . Then, kij does not coincide with the ranks . Thus, we can argue as in the proof of theorem 5 that substituting two assignments kij for which holds, but the corresponding distances fulfill leads to assignments with . Setting , we obtain as in the proof of theorem 5 that . Because of theorem 3, this means ; thus, kij does not constitute a local optimum.

The converse, however, is not true:

Theorem 8. 

There exist fixed points of relational NG that do not constitute a local optimum of the dual cost function with respect to the neighborhood structure as defined above in the Euclidean setting.

Proof. 

We consider the limit case (i.e., crisp k-means). NG approximates this setting for small enough . Consider the Euclidean points in : , , and prototypes and . The corresponding crisp assignments (corresponding to for ) are , , . This is a fixed point of batch NG and, hence, relational NG. The alternative , , , however, is in the neighborhood of this solution and leads to the better prototypes , .

Thus, it is guaranteed that local optima of the NG cost function or its dual, respectively, lead to fixed points of batch NG or relational NG in the Euclidean setting. Conversely, every fixed point constitutes a local optimum of the NG cost function under mild conditions, while the converse is not true; there exist settings where a fixed point of RNG can be further improved within the neighborhood structure as defined above for the dual NG cost function. Since a one-one connection of fixed points of batch NG and RNG exists, this leads to the consequence that the dual NG cost function possesses fewer local optima than the original NG cost function with respect to the neighborhood as defined above; in particular, there is no exact correspondence of the overall structure of the NG cost function and its dual.

3.1.3.  Out-of-Sample Extensions.

We conclude this section by introducing two issues relevant to clustering algorithms: Is it possible to extend a given clustering to new data points that are not contained in the training set (so called out-of-sample extensions)? Is it possible to integrate prior knowledge (e.g., given by a partial labeling of the data) into the algorithms? Both questions can be answered satisfactorily in the context of prototype-based clustering.

Out-of-sample extensions of NG to a point can be defined based on the standard winner assignment; is mapped to the class represented by the prototype with the smallest dissimilarity to . This simple scheme is one of the benefits of prototype-based clustering as opposed to alternatives such as assignment-based clustering. This scheme can be transferred directly to relational clustering because of its connection to batch NG in a vector space. The question occurs whether this procedure can be transferred to a scheme that refers to pairwise dissimilarities only, without an explicit knowledge of the embedding of data or prototypes. The key ingredient is a scheme that allows computing the dissimilarity of a new data point and a prototype based on and pairwise dissimilarities only:

Theorem 9. 
Assume a data point is contained in X, the dissimilarities to points are given as , and the corresponding vector is denoted by . Assume with . denotes the corresponding vector. Then
formula

Proof. 

This follows us directly from theorem 1, whereby we consider the linear combinations of , ,…, with coefficients and , respectively.

This allows us to compute out-of-sample extensions of the clustering based on the dissimilarities and prototype coefficients only.

3.1.4.  Supervision.

Clustering constitutes an ill-posed problem since the objective of clustering is not clear a priori (see, e.g., Biehl, Hammer, Hochreiter, Kremer, & Villmann, 2009). If pairwise distances of data are given, the dual cost function of k-means constitutes one possible objective of clustering. Alternatively, when restricting to prototype-based algorithms, the standard quantization error can serve as an objective function. We have seen that these cost functions are equivalent for metric data, and neural gas and relational neural gas optimize a relaxation thereof. However, it is not clear a priori whether the outcome meets the intended result in practical problems. The possibility of including further information, if available, is very important to get meaningful results for unsupervised learning. This can help to prevent the garbage in–garbage out problem of unsupervised learning, as discussed in Kaski et al. (2003) and Kaski, Nikkilä, Savia, and Roos (2005).

Here we consider the situation that additional label information is available, which should be accounted for by clustering or visualization. Thereby, labels are embedded in and can be fuzzy. We assume that the label attached to xj is denoted by . We equip a prototype wi with a label , which is adapted during learning. The basic idea consists of substituting the standard dissimilarities by a mixture
formula
which takes the similarity of label assignments into account and where controls the influence of the label values. This procedure has been proposed in Hammer et al. (2006a, 2006b) and Villmann et al. (2006) for Euclidean and median clustering and online neural gas, respectively. One can use the same principles to extend relational clustering. The cost function of NG becomes
formula
where denotes the rank of neuron i measured according to the dissimilarities . Batch NG can be directly derived. The ranks are determined based on this extended cost term, which is accompanied by the adaptation for the prototype labels for batch optimization.

Relational learning becomes possible in the same way as before. Assume pairwise dissimilarities of data dij are given instead of explicit data locations. Then supervised relational neural gas results as displayed in algorithm 3. Assume that a vector space and a symmetric bilinear form exist that induce the dissimilarities: . In that case, supervised NG can be interpreted as an extension of standard NG applied to the data with coefficients in the first dimensions and in the remaining ones. In particular, the theorems proved in this section also hold for the supervised case.

3.2.  Non-Euclidean Data.

It often holds that data cannot be embedded in Euclidean space. As an example, discrete data such as DNA sequences or strings can be considered and pairwise dissimilarities stem from an alignment of data. This yields a metric, but not necessarily the Euclidean one. For data that stem from experimental measurements, it can even be the case

Algorithm 3. Supervised Relational NG

input

   symmetric dissimilarity matrix with zero diagonal;

   label information ;

begin

   init with ;

   repeat

     compute ;

     set ;

     set ;

     set ;

   until convergence;

   return, ;

end.

that metric properties such as the triangle equality are violated. In these cases, parts of the argumentation of the previous section no longer hold. Here we investigate whether relational NG can still be applied in this more general setting, whether it can be connected to a batch NG scheme in an appropriate vector space, and whether a connection to the NG cost function and its dual can be made. We also want to investigate properties such as convergence of the algorithm based on these findings. Further, we will put relational clustering as introduced in the last section into a wider context and relate it to two well-known and very powerful clustering schemes for proximity data based on deterministic annealing, the framework of pairwise data clustering as proposed by Hofmann and Buhmann (1997) and extended to neural gas schemes in Hofmann and Buhmann (1996), and the SOM for proximity data as introduced by Graepel and Obermayer (1999) and later extended to hyperbolic SOM (Saalbach et al., 2005). We will argue that relational clustering constitutes a simple way that allows us to derive deterministic annealing variants as proposed in Hofmann and Buhmann (1997) and Graepel and Obermayer (1999) directly from the corresponding deterministic annealing schemes for standard k-means and SOM, respectively, in the Euclidean setting. For general dissimilarities, relational clustering as well as deterministic annealing can be related to clustering algorithms in a vector space; in particular, an interpretation in terms of prototypes is possible for all algorithms. However, it can be shown that none of the algorithms guarantees convergence to a local optimum of the dual cost function or related, or convergence of the algorithm at all.

3.2.1.  Relational Clustering and Batch Clustering in Pseudo-Euclidean Space.

We assume in the following that data are represented by pairwise dissimilarities dij. D denotes the corresponding dissimilarity matrix. This setting has been considered (e.g., in Hofmann & Buhmann, 1997; Puzicha, Hofmann, & Buhmann, 1999; Graepel & Obermayer, 1999). There, the dual cost functions of k-means or related costs are considered, which are well defined for general dissimilarities. The approaches (Hofmann & Buhmann, 1997; Puzicha et al., 1999; Graepel & Obermayer, 1999) propose an optimization of this discrete function using methods of statistical physics, resulting in deterministic annealing schedules for clustering. Note that approximations have to be used because optimization of this cost function constitutes an NP-hard problem (Brucker, 1978). In this section, we discuss how far relational NG can be seen as an alternative solution and in which cases optimization by relational NG as well as deterministic annealing fails. Note that relational NG as defined above can be applied as an algorithm to every setting where a dissimilarity matrix D is given. However, it is not clear how far this procedure leads to meaningful results.

In the following, we always make the reasonable assumption that the diagonal is zero: dii=0. Further, we assume symmetry of D: dij=dji. The latter does not constitute a restriction because it does not affect the cost function. More precisely, it has been shown (e.g., in Laub et al., 2006) that the dual cost function of k-means is invariant with respect to symmetric transformations. The same holds for the dual cost function of NG:

Theorem 10. 

Assume pairwise dissimilarities are given. If we set , then is not affected by this transform.

Proof. 
This equality is obvious because of the identity
formula

All finite data sets characterized by a symmetric dissimilarity matrix with zero diagonal can be embedded into a vector space that possesses a symmetric bilinear form as follows (see, e.g., Pekalska & Duin, 2005; Goldfarb, 1984). Define
formula
with identity matrix I and the vector . Define
formula
Obviously, this matrix is symmetric, and, thus, it can uniquely be decomposed into the form
formula
with orthonormal matrix Q and diagonal matrix of eigenvalues with p positive and q negative entries. Taking the square root of allows the alternative representation in the form
formula
where Ipq constitutes a diagonal matrix with p entries 1 and q entries −1, that is, , where only p+q nonzero eigenvalues of are taken into account. We can define the symmetric bilinear form in :
formula
Then the columns of X constitute vectors with pairwise dissimilarities . Hence we have found a vector space together with a symmetric bilinear form and an embedding of the points , which yields these dissimilarities under the bilinear form. This embedding is referred to as pseudo-Euclidean embedding of the data points. The values (p, q, mpq) are referred to as the signature of the pseudo-Euclidean space.

Because of this fact, we can always assume that data points are given as vectors . However, this embedding need not correspond to standard Euclidean space. The dissimilarity matrix D stems from Euclidean points if and only if q=0, that is, matrix that describes the bilinear form contains only positive diagonal entries (dropping the parts with entry 0 since they obviously do not carry any contribution to the dissimilarity). Otherwise Euclidianity is violated.

The code of NG can be executed in every vector space that possesses a symmetric bilinear form such as pseudo-Euclidean space. Further, relational NG can be executed based on a dissimilarity matrix D only. Because of theorem 2, these two algorithms correspond to each other, that is, relational NG applied to a symmetric dissimilarity matrix with 0 diagonal is the same algorithm as standard batch NG in pseudo-Euclidean space using the correspondence for the embedding of data as given above. Thereby, relational NG does not rely on explicit coordinates of the data points. The above embedding depends on matrix diagonalization; thus, it has cubic complexity. Hence relational NG provides a more efficient scheme than standard NG for the Euclidean as well as non-Euclidean setting. Obviously, out-of-sample extensions of the found assignments for new data are easily possible, as seen in theorem 9, and the extension to supervised settings as proposed in the last section is immediate.

3.2.2.  Connection to a Cost Function.

The question now is whether a relation of this procedure to the NG cost function and its dual can be made and whether convergence of the algorithm is guaranteed. If D corresponds to a Euclidean setting (i.e., q=0), the guarantees as given in the previous section hold. In particular, RNG converges to a local optimum of the NG cost function, and evaluation of these costs is possible based on the dual.

Unfortunately, these guarantees do not hold in general in pseudo-Euclidean space. This property is due to the fact that consecutive steps of batch optimization do not necessarily find optima in the respective step. While assignments of kij based on the ranks are still optimum in the non-Euclidean setting, assignments are not. These values can constitute a saddle point or (in case of only negative entries of the bilinear form) even a local maximum of the corresponding part of the cost function. Because of this fact, the value of the NG cost function does not necessarily decrease in consecutive steps, and convergence is not guaranteed.

A proof that convergence in general cannot be guaranteed is given by the following theorem:

Theorem 11. 

Assume dij constitutes a symmetric matrix with diagonal elements 0. Then relational neural gas does not necessarily converge toward a fixed point.

Proof. 
Consider the two-dimensional pseudo-Euclidean space with signature (1, 1, 0). Consider the points , , , , , . The dissimilarity measure yields the dissimilarity matrix
formula
which is obviously symmetric with diagonal 0 and positive off-diagonal elements. We consider two classes: two prototypes of relational NG only. We assume that the neighborhood is chosen small enough (e.g., ) and start with the initialization , . Then we obtain a cyclic behavior that switches between the two cluster assignments (1, 1, 1, 2, 2, 2) and (2, 2, 1, 2, 1, 1) of points ,…, in subsequent steps. (See Figure 1.)

Figure 1:

Example of points in pseudo-Euclidean space for which relational clustering does not converge to a fixed point. It is indicated by arrows whose points cyclically change their cluster assignments.

Figure 1:

Example of points in pseudo-Euclidean space for which relational clustering does not converge to a fixed point. It is indicated by arrows whose points cyclically change their cluster assignments.

This example demonstrates that a large contribution of coefficients in the negative axes of the pseudo-Euclidean space can prevent convergence. In this case, the dissimilarity of a data point and a prototype can even become negative, although pairwise dissimilarities of data are positive. Further, even if a fixed point is found by relational NG, it need not correspond to a local optimum of the cost function; rather, it can correspond to a saddle point. One example of this behavior is the following situation. Consider the three points , , in pseudo-Euclidean space with signature (1, 1, 0). The corresponding dissimilarity matrix given by the bilinear form is the matrix
formula
If we choose only one prototype , NG converges toward the mean vector . The quantization error for this solution yields the value 1.9167. If we choose , we obtain the better value of 1.25. Obviously, in this case, the quantization error is given by a quadratic form that possesses at most one local optimum, which then coincides with the global optimum. Thus, the found solution must be a saddle point. Note that the assignments are the same in this case, although the prototype is not located at an optimum position.

We will see in our experiments that although convergence is not always guaranteed, it can nevertheless be observed for most practical cases. In our experiments, convergence was always given. For fixed points of relational NG, the value of the dual cost function for the found solution corresponds to a value of the standard NG cost function because of theorem 4. Further, the values obtained in these experiments seem reasonably good, and the cost function of NG is decreased in most epochs, such that relational NG can be seen as a reasonable heuristic to arrive at a good solution in these cases. Note that nonconvex quadratic programming is NP hard (as shown, e.g., in Sahni, 1974 and Pardalos & Vavasis, 1991), such that an efficient algorithm that finds optimum prototypes in every epoch instead of the simple mean cannot easily be derived. We can restrict the search space of prototypes to positions given by convex combinations with permutations kij if we are interested in an optimization of the dual cost function, because of theorem 3. However, these values kij need not coincide with rank assignments for optimum choices. Nevertheless, in practice, this compromise often constitutes a reasonable and efficiently computable trade-off.

3.2.3.  Connection to Deterministic Annealing Approaches.

The fact that convergence of relational NG is not guaranteed is shared by popular and successful alternatives that have been proposed in the literature in the context of clustering proximity data—namely, deterministic annealing for pairwise data clustering and SOM, respectively, as proposed in Hofmann and Buhmann (1997) and Graepel and Obermayer (1999)—when run in fast parallel update mode.1 Relational neural gas can be understood as a direct, simple derivation of the corresponding crisp clustering algorithm that results from the approaches as proposed in Hofmann and Buhmann (1997) and Graepel and Obermayer (1999) in the limit of zero temperature. Thus, because the deterministic annealing schemes (Hofmann & Buhmann, 1997; Graepel & Obermayer, 1999) constitute powerful and effective techniques for grouping and clustering proximities, relational approaches also can be seen as an effective compromise to arrive at reasonable solutions of the NP-hard optimization problem for general dissimilarity data. We will explain this link and give examples of the divergence of the deterministic annealing schemes in the following section.

Deterministic annealing for pairwise data clustering (DA) is introduced in Hofmann and Buhmann (1997) based on a normalization of the dual k-means cost function:
formula
where m denotes the number of data points, refers to the assignment of point xj to cluster i, and denotes the probability of cluster i. The additional summand yields a constant term that emphasizes the independence of the clustering cost function of the absolute dissimilarity scale. Obviously, up to constant factors and summands, the standard dual k-means cost function is obtained. In this approach, a clustering scheme is derived from this cost function by substituting crisp cluster assignments by expectation values for the assignments under a specified certainty level parameterized by a temperature T. Using methods of statistical physics pioneered in the context of clustering by Rose (1998) under the frame of deterministic annealing, the following algorithm is derived, which consists of a subsequent computation of expected assignments and potentials , as shown in algorithm 4. determines the decrease of the temperature .

Algorithm 4. Deterministic Annealing for Pairwise Data Clustering

input

   symmetric dissimilarity matrix with zero diagonal;

begin

   init , randomly;

   init temperature T=T0;

   repeat

     repeat

        ;

        ;

     until convergence

     ;

   until;

   return assignments

end.

The formulas as displayed in algorithm 4 have been derived under the assumption
formula
as shown in equation 38 in Hofmann and Buhmann (1997). If instead the equation
formula
is used, which directly results from the definition of pi, the update formula for is slightly altered to
formula
In the limit of many points , the differences between these two update formulas vanish. These latter updates correspond to the update rules of relational k-means in the limit of zero temperature where the averages become crisp assignments , and the potentials correspond to the dissimilarity distij of data point xj and prototype wi:
formula
Thus, in the limit, DA indirectly performs k-means clustering in pseudo-Euclidean space using the possibility of computing dissimilarities only indirectly by means of the formula provided in theorem 1. Prototypes are given indirectly by the coefficients and can be recovered as for an embedding of data in pseudo-Euclidean space . Instead of a derivation from the dual cost function for pairwise data, DA could alternatively be derived directly from the formulas of deterministic annealing in pseudo-Euclidean space by using theorem 1 and the ideas of the derivation of relational NG from batch NG. Thus, DA for pairwise proximities is an application of deterministic annealing to relational k-means to obtain better results by substituting crisp assignments by their expected values.
Similarly, DA for proximity SOM (DASOM) as proposed in Graepel and Obermayer (1999) optimizes the dual cost function of the standard SOM objective as proposed by Heskes (2001):
formula
The resulting algorithm repeats the following assignments in the inner loop,
formula
for the expected assignments of data to clusters and
formula
where
formula
denotes the coefficients of prototypes in pseudo-Euclidean space. The term corresponds to the distance of data points from prototype averaged over the local neighborhood provided by SOM. As before, these update formulas can be interpreted as relational versions of deterministic annealing of batch SOM in pseudo-Euclidean space or, alternatively, as deterministic annealing of relational SOM.

The deterministic annealing schemes are derived from their respective cost function using methods of statistical physics: cluster assignments are characterized by a Gibbs distribution depending on the cost terms. A mean field approximation is taken to approximate the Gibbs distribution in factorial form. Partial assignment costs can be derived by optimizing the Kullback-Leibler divergence between the Gibbs distribution and the factorial approximation. In both cases, optimal potentials (which relate to prototype positions in pseudo-Euclidean space) are determined based on the derivatives, that is, only the necessary condition for a local optimum is guaranteed in both cases. In the same way as for relational NG, the sufficient condition on a local optimum is not necessarily fulfilled if a non-Euclidean dissimilarity matrix D is considered since the Hessian is not globally definite in this case. In consequence, there exist situations where deterministic annealing does not converge in the same way as relational clustering.

As an example, we consider the same situation as provided in theorem 11. Since the update rules of DA (in changed form) and DASOM yield the standard relational k-means update rules, the same behavior as for relational k-means can be observed if initialized appropriately. More precisely, for a temperature at most 0.001 and initialization as
formula
a cyclic change of this state to the setting
formula
is observed. For the original DA rule, due to the slightly different update, a different behavior holds for this setting. We obtain the same cyclic changes if we start with four identical copies of every point, since a large enough number of data m causes essentially the same updates for both DA versions.
The fact that cycles can occur has already been pointed out in Hofmann and Buhmann (1997). As a consequence, Hofmann and Buhmann recommend updating the quantities and sequentially by picking a random coefficient j (corresponding to a data point xj) and updating and for this j only in every inner loop. As a consequence, cycles that constitute only local traps of the update dynamics can partially be avoided, such as the example as introduced above, which does not lead to cyclic behavior for online updates. However, cyclic behavior can still be present for online updates. As an example, we consider the points , , , in two-dimensional pseudo-Euclidean space with signature (1, 1, 0). The corresponding dissimilarity matrix is given as
formula
When started in
formula
and corresponding with temperature T at most 0.001, the points and change their expected assignments between and , while the other assignments remain constant for DA with altered update rule. Again, the original DA rule behaves slightly differently, but we obtain qualitatively the same cyclic behavior when using four identical copies of every point, as before. Thus, sequential update can only partially avoid cyclic behavior.

As a consequence of this argumentation, we can interpret relational clustering as the crisp limit case of very popular deterministic annealing variants for pairwise data clustering. Because of its avoidance of the inner loop, it is considerably faster than DA schemes, but the price of probably worse optima is paid. In principle, however, the theoretical guarantees of both methods with respect to convergence are the same since they show limit cycles in comparable situations for parallel updates (which turn out to be rare in practical applications; we did not observe this behavior in a single practical experiment). Cycles for DA with sequential update are rare (we could find only an example with negative dissimilarities) and rarely occur in practice, although at the cost of a considerably slower convergence. Relational NG could be run in sequential mode in analogy to DA, leading to convergence in most situations, thereby still avoiding one inner loop in comparison to online DA at the cost of probably slightly worse optima.

The connection of relational clustering to DA schemes allows to interpret DA schemes in the same way as relational variants. DA schemes constitute standard clustering methods in pseudo-Euclidean space. Thus, the methods can be interpreted as prototype-based schemes where prototypes are represented indirectly by means of coefficients. In particular, interpretation of the methods in terms of data points closest to the prototypes as well as fast extensions to very large data sets in terms of patch clustering, which we introduce later in this article, become possible this way also for the latter algorithms.

3.2.4.  Spread Transformation.

As an alternative to a direct application of clustering in pseudo-Euclidean space, one can first change the dissimilarity matrix to make it Euclidean, such that convergence of the algorithm is guaranteed. This procedure has been proposed in the context of k-means (e.g., in the approaches of Laub et al., 2006, and Roth et al., 2003). In these approaches, it is demonstrated that a theoretical base of this technique is given by the fact that the dual k-means cost function is invariant under additive shifts of the off-diagonal elements of the dissimilarity matrix. The approach (Roth et al., 2003) shows that this also holds for a few further popular cost functions (e.g., connected to multidimensional scaling or graph cut).

This fact is preserved by the dual cost function of NG for small neighborhood range. More precisely, we find the following result:

Theorem 12. 
Assume pairwise distances dij are given such that the corresponding matrix D is symmetric with zero diagonal. Assume . Consider the distances with shifted off-diagonal terms for where denotes the Kronecker delta. Assume the neighborhood range is chosen as . Then the distance of the dual cost function of neural gas for dij and can be estimated as
formula
where .

Proof. 
We find
formula
The latter term can be decomposed into the sum over all prototypes for which at least one kij=0 exists, and the remaining prototypes, which are not winners for a data point. We obtain for the first term,
formula
for , and for the second term,
formula
for . Therefore, substituting these terms by d0k/4 changes the result by at most .

Thus, a transformation of the dissimilarity matrix does not change the shape of the dual cost function and the location of local and global optima for small . One can see that for large enough d0, a dissimilarity matrix results that stems from the squared Euclidean metric. This procedure has been introduced in the literature under the notion of spread transformation (e.g., in connection to relational fuzzy clustering; Hathaway & Bezdek, 1994) or constant shift embedding in the connection of k-means clustering (Roth et al., 2003). The following result is well known in the literature and follows immediately from the embedding of data in pseudo-Euclidean space (e.g., Pekalska & Duin, 2005):

Theorem 13. 
Assume D is a symmetric dissimilarity matrix with zero diagonal. Then
formula
is squared Euclidean, that is, there exist Euclidean points with for all i, j, where denotes the Gram matrix used for the pseudo-Euclidean embedding of the points, and denotes the smallest eigenvalue of G. This shift denotes the smallest possible value to achieve this property.

The result is obvious because this shift corresponds to a shift of the Gram matrix of the form . This procedure has been proposed (e.g., in Laub et al., 2006; Roth et al., 2003). Since only the smallest eigenvalue of G is needed, this procedure is much more efficient than an explicit embedding and correction of the non-Euclideanity, which would require cubic effort. Further, unlike alternatives to move a general dissimilarity matrix in squared Euclidean form as proposed in Chen et al. (2009), the spread transform does not affect the cost function. In fuzzy clustering, it has been proposed to use spread transformation if and only if distances of data points and prototypes become negative (Hathaway & Bezdek, 1994).

Unfortunately, it turns out that this procedure is partially of theoretical interest for some practical problems. The reason lies in the fact that this transformation can make the classification problem harder for the standard methods such as relational NG and deterministic annealing of pairwise proximities. While the transform does not affect the location of local optima, it changes the relative differences between good and bad local optima of the cost function. This does not affect optimization methods, which are controlled by a cost difference such as the method proposed in Yom-Tov and Slonim (2009). As seen in our experiments, deterministic annealing variants for pairwise data clustering and relational neural gas are affected by the shift such that these methods can no longer easily find good local optima. Hence, the direct application of the relational models to non-Euclidean data can give better results (although convergence is not guaranteed in theory, but can usually be observed in practice) than the application of the corresponding method in the shifted Euclidean space (although convergence is guaranteed in the latter case).

We demonstrate this effect in a standard benchmark data set. The cat cortex data originate from anatomical studies of cats' brains. The dissimilarity matrix displays the connection strength among 65 cortical areas (Graepel & Obermayer, 1999). For our purposes, a preprocessed version as presented in Haasdonk and Bahlmann (2004) was used. The matrix is symmetric with zero diagonal, but the triangle inequality does not hold. The signature of the related pseudo-Euclidean space is (41, 23, 1), that is, about one-third of the directions are associated with negative eigenvalues. The corresponding eigenspectrum is depicted in Figure 2. We applied the spread transform to this data set according to theorem 5. We trained (batch) deterministic annealing and relational neural gas on these data sets, using five prototypes in each case and default parameters (for deterministic annealing: number of epochs of the outer loop = 300, number of epochs of the inner loop = 50, start temperature = 100, noise level added to distances to avoid identical prototypes = 10−6; for relational neural gas: number of epochs = 100, start neighborhood range = 2.5). Figure 3 shows the quality measured by the dual quantization error of the found optima when repeatedly initializing the algorithms with small, random values for the data assignments obtained over 1000 runs. A curve with a well-expressed maximum results, representing the most prominent optimum. As expected, this optimum is a bit better for deterministic annealing than for relational neural gas due to the soft assignments of the data points, which result in a slightly better optimum at the cost of a slower convergence speed and the necessity of an additional inner loop. When considering the spread-transformed data that correspond to an Euclidean representation, the optima are shifted to the right in both cases. Thereby, we evaluate the result based on the found assignments only, taking the original dissimilarities, that is, we evaluate the quality of the found clustering for the original clustering problem in both cases. Hence, although the algorithms deal with a Euclidean setting instead of an only pseudo-Euclidean one in these cases, the optima found by the algorithms are worse than the original ones. This behavior is quite typical and can also be observed for other non-Euclidean data sets. Hence, a direct application of relational neural gas to the dissimilarity matrix instead of a prior spread transformation of the data can be advisable.

Figure 2:

Eigenspectrum of the cat cortex data set when embedded into pseudo-Euclidean space

Figure 2:

Eigenspectrum of the cat cortex data set when embedded into pseudo-Euclidean space

Figure 3:

Local optima of the dual k-means cost function reached by repeated runs of relational neural gas (left column) and deterministic annealing (right column) on the cat cortex data set for the original cat cortex data set (top) and its spread transformation (bottom), respectively.

Figure 3:

Local optima of the dual k-means cost function reached by repeated runs of relational neural gas (left column) and deterministic annealing (right column) on the cat cortex data set for the original cat cortex data set (top) and its spread transformation (bottom), respectively.

4.  Patch Clustering

The problem of mining large data sets has become one of the central issues of data mining. Roughly, the amount of electronically available data doubles every 20 months, reaching almost every area of daily life and science, such that people have to cope with massive data sets that cannot be scanned manually. Clustering and visualization offers one of the fundamental techniques to adequately compress and preprocess such enormous data sets. However, in these cases, data no longer fit into main memory such that batch processing methods that rely on all data at once become infeasible. Further, at most one scan through the data is still affordable, which also makes online alternatives such as online neural gas unsuitable. The situation is still worse, since clustering methods for dissimilarity data rely on the quadratic dissimilarity matrix (i.e., they display at least quadratic complexity) and, as is the case of relational clustering and deterministic annealing, linear space complexity for the classifier. Both issues make the methods slow for settings that reach 10,000 data points, and entirely unsuitable for common desktop computers available today if more than a 100,000 data points are involved.

Due to these problems, a variety of methods that introduce clustering algorithms for streaming data have been proposed in the literature, which ideally work in linear time and constant space. Thereby, most of the approaches have been proposed for Euclidean data. This includes extensions of classical k-means clustering or k-median clustering, partially incorporating approximation guarantees such as those of Badoiu, Har-Peled, and Indyk (2002) and Kumar, Sabharwal, and Sen (2004) which reach linear time, but for which space requirements depend on the number of points, or heuristics that partially rely on sampling and according to statistical guarantees or grid-based methods, such as popular algorithms dedicated to very large data sets as CURE, STING, and BIRCH (Domingos & Hulten, 2001; Guha, Rastogi, & Shim, 1998; Wang, Yang, & Muntz, 1997; Zhang, Ramakrishnan, & Livny, 1996), or iterative compression approaches that process only a fixed subset of the given data at a time (Bradley et al., 1998; Farnstrom et al., 2000; Alex et al., 2009).

A few proposals for large sets of general non-Euclidean dissimilarity data exist. Hierarchical clustering typically has a squared complexity, which can partially be optimized. However superliner complexity is still kept, and these methods are typically not very stable with respect to noise (Murtagh, 2002). The method (Kohonen & Somervuo, 2002) extends the standard SOM to dissimilarity data by means of the generalized median and tackles large data sets by simple subsampling of the data. Affinity propagation (Frey & Dueck, 2007) also relies on median clustering by restricting prototype locations to data points, and it can be used for large data sets if the connection matrix is sparse since only existing connections are effectively used in this setting. However, both methods are restricted due to restricted prototype locations, and they require several sweeps through the entire data set. Two approaches that require only one sweep are given in Bezdek, Hathaway, Huband, Leckie, and Kotagiri (2006) and Yom-Tov and Slonim (2009). Bezdek et al. rely on relational variants of fuzzy k-means clustering and extend this to large data sets by subsampling a characteristic part of the dissimilarity matrix, clustering this subpart, and extending it to all data. This way, however, the full matrix has to be available in advance or, alternatively, data must be independent and identically distributed (i.i.d.) to obtain reliable extensions to new parts. Similarly, the Nyström approximation constitutes a popular vehicle to extend the results of a part of a dissimilarity or kernel matrix to the full data set, where approximation bounds can be derived explicitely. The work of Belongie, Fowlkes, Chung, and Malik (2002) constitutes one proposal where this approach has been used in the context of graph clustering for general dissimilarity graphs. However, a representative sample has to be available that allows the extension of the clustering to the full data set. In contrast, the approach of Yom-Tov and Slonim (2009) proposes to process only parts of the given data on parallel processors by a direct optimization of pairwise clustering and to subsequently reach valid assignments of all data this way. However, the presented method no longer represents the solution in the form of interpretable prototypes.

Here we rely on patch clustering as introduced in Alex et al. (2009), Farnstrom et al. (2000) and Bradley et al. (1998) for Euclidean data sets to extend prototype-based clustering methods for dissimilarity data to large or streaming data sets. The basic idea is to iteratively process only a small part of the data using standard k-means or neural gas and to store these data in compressed form in terms of the prototypes and their multiplicity. This serves as sufficient statistics for further runs. Subsequent runs cluster the compressed data points, which were already seen in form of the prototypes counted with multiplicities in addition to the next patch. Note that the final shape of the clusters always depends on the full data set for patch clustering, and the statistics accumulated in every run are adapted according to the given data, such that patch clustering can better deal with data that are not presented in an i.i.d. fashion than the noted approaches.

4.1.  Patch Definition.

To transfer this method to dissimilarity data, we assume the following setting. A (possibly large) set of points xi indexed i = 1, 2,…, m is given such that for all i and j, the dissimilarity dij between these points can be computed directly. D denotes that the corresponding dissimilarity matrix where we assume symmetry dij=dji and zero diagonal dii=0 as before. A typical example of this setting is a database of strings for which pairwise comparisons are given by alignment. For large m, it is in general infeasible to compute or store the full dissimilarity matrix in main memory due to the squared complexity. Patch processing relies on the principle of processing data in np patches of prior fixed size p = m/np. Thereby, we assume divisibility of m by np for simplicity. In practice, the last patch is a smaller size. For dissimilarity data, a patch Pt is then represented by the corresponding portion of the dissimilarity matrix D:
formula
which represents the dissimilarities of points .

4.2.  -Approximation.

The idea of original patch clustering is to add prototypes from the processing of the former patch Pt−1 counted with multiplicities according to the size of their receptive field as additional data points to the current patch Pt. These points play the role of a compressed representation of all already seen data points; they provide sufficient statistics of the information processed so far. This way, all data are processed without loss of essential information since the previous information is represented by the sufficient statistics.

A naive transfer of this method from the Euclidean case to relational clustering, however, is not possible due to two reasons. Unlike patch processing for Euclidean data, prototypes correspond to a weighting of all points involved in the clustering. Thus, the dimensionality of the coefficient vectors is determined by the number of number of data points that have to be processed. This results in an infeasible linear space complexity for huge data sets. In addition, for further processing of the data, the dissimilarities in between all prototypes and all data from a new patch have to be computed. Since prototypes are represented indirectly by the contribution of all data points seen so far, the distance of prototypes and a new patch relies on the distance of the new patch and all data seen so far. By induction, one can see that this processing is possible only if the full dissimilarity matrix D is available. Hence, this approach results in infeasible quadratic time and space complexity.

Because of this fact, an approximation scheme is introduced that substitutes the full vector of coefficients that characterize the prototypes by only the K most prominent ones, K being a fixed number. Every patch is clustered until convergence, that is, a small neighborhood range is used in the resulting prototype representations. Obviously, for the coefficients of the prototypes, the following is valid:
formula
where Ri denotes the receptive field of prototype i. Thus, the coefficient vector yields 1/|Ri| for all data points for which the winner is the point i. Thereby, for simplicity, we assume that , which is usually the case for NG schemes. If , then is nonvanishing if prototype i is the second closest neuron of . If this set is also empty, the third closest points determine the nonzero entries of and so on.

We approximate the prototype by the closest K points in Ri, that is, the points where kij=0 and is among the K points with the smallest dissimilarity as computed in relational neural gas. Note that more complex alternatives could be possible that choose K points and coefficients such that the corresponding prototype location changes as little as possible as described in Rossi, Hasenfuss, and Hammer (2007). However, the above simple approximation scheme will lead to good results, as we will see in experiments.

These considerations give rise to the definition of a K-approximation of a relational prototype. Assume a prototype is given with . A K-approximation refers to the K indices j1,…, jK corresponding to points ,…, with the smallest dissimilarity to . For optimum prototypes as computed by relational NG in the limit phase, these points are the K closest points in the receptive field of . Obviously, for these points, the coefficient is maximum. The k-approximation can be computed easily in relational NG since the dissimilarities of data points and neurons are readily available.

Note that by always restricting to the K closest data points, K being number fixed a priori, prototypes can always be approximated in constant space while processing all data. Further, only a fixed portion of the global dissimilarity matrix D is necessary to compute dissimilarities in between prototypes and further patches.

4.3.  Extended Patches.

More precisely, we describe the parts of the dissimilarity matrix that are needed for further processing of patches and K-approximated prototypes. Assume the current patch Pt is considered. Assume Nt−1 refers to the index set of the K-approximation of all prototypes obtained in the previous step. When k prototypes are being considered, the size of this set is restricted by , under the assumption that at least K points lie in every receptive field, and equality holds. For the next round of patch processing, dissimilarity clustering is applied to the points corresponding to the indices in Nt and the data from the current patch; that is, we need the following part of the dissimilarity matrix,
formula
where denotes the interdissimilarities of points from the K-approximation, and denotes the dissimilarities of points in the K-approximation and the current patch. We refer to as extended patches.

4.4.  Patch-Relational Neural Gas.

Based on these data-handling techniques, patch-relational neural gas can be defined as iterative processing of patches enriched by the K-approximation of prototypes from the previous patch. The prototypes contribute to the new clustering task according to the sizes of their receptive fields, that is, a prototype is counted with multiplicity |Ri|. Correspondingly, every point in Nt−1 contributes according to the fraction |Ri|/K if it lies in the receptive field of , that is, kij=0. Hence, we set the multiplicity mj=|Ri|/K where lies in the receptive field of . It is straightforward to extend relational neural gas to deal with multiplicities mj of point corresponding to the underlying cost function . The only change concerns the update of the coefficients (see algorithm 5). This algorithm can be used as an internal loop for patch processing for dissimilarity data as shown in algorithm 6. Thereby, prototypes W are always represented by an index vector corresponding to the data points that contribute to this prototype, and a coefficient vector that specifies the strength of the contributions. Unlike the coefficients in full relational NG, the coefficient vector of patch NG is sparse and can be represented in constant space. Note that we assume that the relevant parts of the dissimilarity matrix can be computed on demand such that the full dissimilarity matrix need not be computed or stored at the beginning of the algorithm. This fact constitutes a further advantage of patch processing, since it is sufficient to compute only a linear part of the dissimilarity matrix. Since, depending on the application scenario, the dissimilarity computation can be quite demanding (e.g., alignment of sequences in ioinformatics, or the normalized compression distance for text processing), this can result in drastic computational savings.

Algorithm 5: Relational NG with Multiplicities

input

   symmetric dissimilarity matrix

   multiplicities

begin

   init with ;

   repeat

     compute ;

     set ;

     set

   until convergence;

   return;

end.

Algorithm 6: Patch Relational NG

begin

   cut the first patch P1;

   apply relational NG to P1 prototypes W1;

   compute the K-approximation N1 of W1;

   update multiplicities mi of N1;

   set t=2;

   repeat

     cut the next patch Pt;

     construct extended patch using Pt and Nt−1;

     set multiplicities of points in Pt to mi=1;

     apply relational NG with multiplicities to prototypes Wt;

     compute K-approximation Nt of Wt;

     update multiplicities mi of Nt;

     t := t + 1;

   untilt = nP

   return prototypes ;

end.

After processing, a set of prototypes together with a reasonable K-approximation thereof is obtained that compresses the full data set. As before, an inspection of prototypes is easily possible by looking at the points closest to these prototypes.

Note that the algorithm runs in constant space if the size p of the patches is chosen independent of the data set size m. Similarly, under this assumption, the fraction of the distance matrix that has to be computed for the procedure is of linear size , and the overall time complexity of patch clustering is of size , assuming constant p. Hence, a linear time and constant space algorithm for general dissimilarity data results that is suited for large data sets if constant patch size is taken. In the experiments, we will demonstrate an application to a data set of size almost 200,000, which corresponds to a full dissimilarity matrix for which the storage would require almost 251 GB, assuming double precision.

This way, a linear time and constant space clustering algorithm is obtained that can deal with general dissimilarity data. Since it relies on only a linear part of the full dissimilarity matrix, the complexity of data preprocessing (i.e., the computation of probably complicated pairwise dissimilarities such as alignment distances) is also greatly reduced. Further, the algorithm provides an explanation of the clustering in terms of prototypes that can be represented by a finite number of representative data points; hence, the result can be directly inspected by human experts.

5.  Experiments

We demonstrate the behavior of relational NG in a variety of experiments whereby we mainly focus on benchmark data sets that cannot be embedded in Euclidean space. Since convergence of relational NG or the monotonicity of the cost function during training is theoretically not guaranteed in such situations, we first look at the development of the cost function values in a typical setting. Then, we evaluate relational NG in comparison to deterministic annealing on a variety of benchmark data sets, showing the competitiveness of the algorithm. For patch NG, we first demonstrate the robustness of the algorithm with respect to the patch size, the quality of the K-approximation, and the order of the presentation of patterns in comparison to full batch clustering. Then we show an application to a huge text corpus, demonstrating the efficiency of the method for large data sets.

5.1.  Convergence.

For all data sets considered in the experiments, convergence of relational NG was observed. We depict the behavior of relational NG for the cat cortex data set as introduced above using 5 neurons and 100 epochs. Figure 4 (top) displays the value of the NG cost function and its dual based on the rankings kij and the prototypes , respectively, as computed by NG. Obviously the two cost functions are strictly monotonic and convergent, and, apart from the first steps, they coincide for the computed values. Similarly, the vector quantization cost function and its dual computed for the assignments and prototypes as given by relational neural gas are strictly monotonic and convergent, as can be seen in Figure 4 (bottom). Due to the fact that the quantization error is computed for the (in terms of k-means) suboptimum prototypes that incorporate neighborhood smoothing, while the dual costs are determined on the assignments only (implicitly assuming optimum positions of the prototypes in terms of k-means), the quantization error is worse compared to the value of the dual cost function for early stages of training, which display a large neighborhood cooperation.

Figure 4:

Cost function of NG and its dual (top) and standard quantization error and its dual (bottom) for the parameters as determined by relational NG.

Figure 4:

Cost function of NG and its dual (top) and standard quantization error and its dual (bottom) for the parameters as determined by relational NG.

5.2.  Experiments on Benchmark Data Sets.

In addition to the cat cortex data set, we consider the following benchmark data sets, which were symmetrized prior to training and linearly transformed from similarities to dissimilarities, if necessary:

  • • 

    Protein data: The protein data set as described in Mevissen and Vingron (1996) consists of 226 globin proteins that are compared based on their evolutionary distance. The samples originate from different protein families: hemoglobin-, hemoglobin-, myoglobin, and others. Here we distinguish five classes as proposed in Haasdonk and Bahlmann (2004): HA, HB, MY, GG/GP, and others. Unlike the other data sets considered here, the protein data set has a highly unbalanced class structure, with class distribution HA (31.86%), HB (31.86%), MY (17.26%), GG/GP (13.27%), and others (5.75%).

  • • 

    Copenhagen chromosomes data. The Copenhagen chromosomes data set is a benchmark from cytogenetics (Lundsteen, Phillip, & Granum, 1980). Forty-two hundred human chromosomes from 22 classes (the autosomal chromosomes) are represented by the gray levels of their images. These images are transferred to strings based on their thickness. These strings can be compared using edit distance, which constitutes a typical dissimilarity measure for strings (Juan & Vidal, 2000). The substitution costs are thereby given by the difference of the entries, and insertion and deletion costs are set to 4.5 (Neuhaus & Bunke, 2006).

  • • 

    Aural sonar data. The aural sonar data set as described in Chen et al. (2009) consists of 100 returns from a broadband active sonar system, labeled in two classes, target-of-interest versus clutter. The dissimilarity is scored by two independent human subjects, each resulting in a dissimilarity score in .

  • • 

    Caltech data. The caltech data set consists of 8677 images from 101 object categories that are compared using the pyramid match kernel on SIFT features (see Chen et al., 2009).

  • • 

    Face recognition data. The face recognition data set consists of faces of 139 people. Dissimilarities are computed by means of cosine similarity between integral invariant signatures based on surface curves of the 3D faces (see Chen et al., 2009).

  • • 

    Patrol data. The patrol data set describes 241 members of seven patrol units and one class corresponding to people not in any unit. Dissimilarities are computed based on every person in the patrol units naming five other persons in their unit, whereby the responses were partially inaccurate. Every mention yields an entry of the dissimilarity matrix (see Chen et al., 2009). Data are sparse in the sense that most entries of the matrix correspond to the maximum dissimilarity, which we set to three.

  • • 

    Voting data. The voting data set describes a two-class classification problem incorporating 435 samples given by 16 categorical features with three different possible values each. The dissimilarity is determined based on the value difference metric (see Chen et al., 2009).

These data sets are non-Euclidean with signature as given in Table 1. Obviously, the protein data and the caltech data set are almost Euclidean, while at least one-third of the eigenvalues are negative for the other data sets.

Table 1:
Signature of the Benchmark Data Set and Minimum Absolute Value of the Spread Transform to Obtain Squared Euclidean Distances.
Minimum Value
SignatureSpread Transform
cat cortex (41,23,1) 4.93 
chromosomes (1951,2206,43) 851.79 
protein data (218,4,4) 2.6 
aural sonar (54,45,1) 2.1 
caltech (8664,12,1) 1.67 
face recognition (311,310,324) 7.9 
patrol (173,67,1) 4.33 
voting (105,235,95) 3.2 
Minimum Value
SignatureSpread Transform
cat cortex (41,23,1) 4.93 
chromosomes (1951,2206,43) 851.79 
protein data (218,4,4) 2.6 
aural sonar (54,45,1) 2.1 
caltech (8664,12,1) 1.67 
face recognition (311,310,324) 7.9 
patrol (173,67,1) 4.33 
voting (105,235,95) 3.2 

We performed a repeated cross-validation for all data sets, using 10 repeats. We report the results of relational neural gas (RNG) and supervised relational neural gas (SRNG) for these data sets. For relational neural gas, 100 epochs were used; for the supervised version, the supervision parameter equals . For comparison, we report the results of deterministic annealing (DA); here, 300 epochs were used for training. The number of prototypes used for every data set and the number of folds are reported in Table 2. Results are reported on the training and test set. The runs are evaluated by the classification accuracy obtained by posterior labeling on the training set. In addition, the quantization error and the value of the dual k-means cost function are reported. Thereby, relational prototypes allow an out-of-sample extension for both relational neural gas and deterministic annealing, as discussed previously. Since we can interpret deterministic annealing as annealing of clustering in pseudo-Euclidean space, out-of-sample extensions for deterministic annealing can be obtained in the same way.

Table 2:
Number of Neurons and Number of Folds Used for the Runs.
Number of NeuronsNumber of Folds
cat cortex 12 02 
chromosomes 60 02 
protein data 20 10 
aural sonar 10 02 
caltech 1030 02 
face recognition 1390 10 
patrol 24 10 
voting 20 10 
Number of NeuronsNumber of Folds
cat cortex 12 02 
chromosomes 60 02 
protein data 20 10 
aural sonar 10 02 
caltech 1030 02 
face recognition 1390 10 
patrol 24 10 
voting 20 10 

The results of the runs are reported in Table 3. Interestingly, the value of the dual cost function, the quantization error, and the classification accuracy of RNG is always competitive to the value obtained by DA, although the latter method requires more training time due to an additional inner loop. As expected, the values of SRNG for the unsupervised cost functions are a bit worse, since this method alters the objective to better take label information into account. This corresponds to an improvement of the classification accuracy on the training data for all cases but one.

Table 3:
Averaged Results of a Repeated Cross-Validation on the Data Sets.
Dual Cost FunctionQuantization ErrorClassification Accuracy
TrainingTestTrainingTestTrainingTest
cat cortex  
   RNG 11.49 (0.35) 15.47 (0.56) 11.49 (0.35) 24.92 (0.32) 0.928 (0.028) 0.69 (0.076) 
   SRNG 11.96 (0.56) 15.16 (0.39) 12.02 (0.61) 24.29(0.47) 0.994 (0.008) 0.724 (0.062) 
   DA 11.12 (2.58) 14.93 (2.07) 11.12 (2.58) 25.85 (0.64) 0.890 (0.105) 0.803 (0.083) 
chromosomes  
   RNG 22479.36 (54.42) 22864.64 (45.32) 22479.36 (54.42) 24275.99 (39.63) 0.893 (0.004) 0.897 (0.004) 
   SRNG 22470.29 (65.97) 22848.01 (35.06) 22475.19 (65.79) 24241.31 (30.11) 0.912 (0.003) 0.907 (0.004) 
   DA 22608.49 (43.41) 23090.59 (51.96) 22608.49 (43.41) 24374.89 (38.22) 0.910 (0.004) 0.908 (0.003) 
protein data  
   RNG 293.65 (0.73) 17.17 (0.54) 293.65 (0.73) 40.45 (0.93) 0.942 (0.002) 0.919 (0.016) 
   SRNG 291.28 (1.13) 17.10 (0.57) 291.99 (1.22) 40.08 (1.01) 0.980 (0.005) 0.944 (0.013) 
   DA 282.44 (0.60) 15.35 (0.82) 282.44 (0.60) 39.31 (0.83) 0.931 (0.003) 0.907 (0.008) 
aural sonar  
   RNG 4.17 (0.07) 5.04 (0.10) 4.17 (0.07) 6.90 (0.07) 0.892 (0.017) 0.834 (0.014) 
   SRNG 4.33 (0.09) 5.12 (0.11) 4.36 (0.10) 6.92 (0.14) 0.993 (0.006) 0.870 (0.032) 
   DA 3.99 (0.07) 4.99 (0.12) 3.99 (0.07) 6.99 (0.13) 0.897 (0.013) 0.856 (0.026) 
caltech  
   RNG 2435.72 (1.58) 2467.04 (2.52) 2435.72 (1.58) 2647.70 (2.34) 0.451 (0.002) 0.407 (0.006) 
   SRNG 2466.91 (1.85) 2508.07 (3.78) 2489.32 (2.10) 2640.61 (2.04) 0.791 (0.009) 0.364 (0.025) 
   DA 2440.09 (1.88) 2471.81 (2.35) 2440.09 (1.88) 2659.73 (1.81) 0.452 (0.003) 0.401 (0.004) 
face recognition  
   RNG 0.071 (0.003) 0.005 (0.001) 0.071 (0.003) 0.013 (0.001) 0.887 (0.003) 0.865 (0.002) 
   SRNG 0.250 (0.013) 0.021 (0.002) 0.472 (0.020) 0.058 (0.002) 0.830 (0.005) 0.811 (0.007) 
   DA 0.116 (0.007) 0.008 (0.001) 0.116 (0.007) 0.017 (0.001) 0.900 (0.002) 0.873 (0.004) 
patrol  
   RNG 120.77 (0.19) 7.88 (0.29) 120.77 (0.19) 17.34 (0.25) 0.835 (0.005) 0.665 (0.024) 
   SRNG 123.76 (0.19) 7.93 (0.43) 124.26 (0.22) 16.99 (0.14) 0.989 (0.001) 0.657 (0.022) 
   DA 127.34 (5.39) 10.39 (0.93) 127.34 (5.39) 17.13 (0.19) 0.713 (0.077) 0.521 (0.051) 
voting  
   RNG 8.86 (0.04) 0.615 (0.024) 8.86 (0.04) 1.18 (0.02) 0.953 (0.001) 0.950 (0.004) 
   SRNG 8.93 (0.03) 0.651 (0.033) 9.08 (0.02) 1.18 (0.03) 0.972 (0.001) 0.953 (0.004) 
   DA 8.84 (0.06) 0.626 (0.021) 8.84 (0.06) 1.15 (0.03) 0.955 (0.001) 0.951 (0.005) 
Dual Cost FunctionQuantization ErrorClassification Accuracy
TrainingTestTrainingTestTrainingTest
cat cortex  
   RNG 11.49 (0.35) 15.47 (0.56) 11.49 (0.35) 24.92 (0.32) 0.928 (0.028) 0.69 (0.076) 
   SRNG 11.96 (0.56) 15.16 (0.39) 12.02 (0.61) 24.29(0.47) 0.994 (0.008) 0.724 (0.062) 
   DA 11.12 (2.58) 14.93 (2.07) 11.12 (2.58) 25.85 (0.64) 0.890 (0.105) 0.803 (0.083) 
chromosomes  
   RNG 22479.36 (54.42) 22864.64 (45.32) 22479.36 (54.42) 24275.99 (39.63) 0.893 (0.004) 0.897 (0.004) 
   SRNG 22470.29 (65.97) 22848.01 (35.06) 22475.19 (65.79) 24241.31 (30.11) 0.912 (0.003) 0.907 (0.004) 
   DA 22608.49 (43.41) 23090.59 (51.96) 22608.49 (43.41) 24374.89 (38.22) 0.910 (0.004) 0.908 (0.003) 
protein data  
   RNG 293.65 (0.73) 17.17 (0.54) 293.65 (0.73) 40.45 (0.93) 0.942 (0.002) 0.919 (0.016) 
   SRNG 291.28 (1.13) 17.10 (0.57) 291.99 (1.22) 40.08 (1.01) 0.980 (0.005) 0.944 (0.013) 
   DA 282.44 (0.60) 15.35 (0.82) 282.44 (0.60) 39.31 (0.83) 0.931 (0.003) 0.907 (0.008) 
aural sonar  
   RNG 4.17 (0.07) 5.04 (0.10) 4.17 (0.07) 6.90 (0.07) 0.892 (0.017) 0.834 (0.014) 
   SRNG 4.33 (0.09) 5.12 (0.11) 4.36 (0.10) 6.92 (0.14) 0.993 (0.006) 0.870 (0.032) 
   DA 3.99 (0.07) 4.99 (0.12) 3.99 (0.07) 6.99 (0.13) 0.897 (0.013) 0.856 (0.026) 
caltech  
   RNG 2435.72 (1.58) 2467.04 (2.52) 2435.72 (1.58) 2647.70 (2.34) 0.451 (0.002) 0.407 (0.006) 
   SRNG 2466.91 (1.85) 2508.07 (3.78) 2489.32 (2.10) 2640.61 (2.04) 0.791 (0.009) 0.364 (0.025) 
   DA 2440.09 (1.88) 2471.81 (2.35) 2440.09 (1.88) 2659.73 (1.81) 0.452 (0.003) 0.401 (0.004) 
face recognition  
   RNG 0.071 (0.003) 0.005 (0.001) 0.071 (0.003) 0.013 (0.001) 0.887 (0.003) 0.865 (0.002) 
   SRNG 0.250 (0.013) 0.021 (0.002) 0.472 (0.020) 0.058 (0.002) 0.830 (0.005) 0.811 (0.007) 
   DA 0.116 (0.007) 0.008 (0.001) 0.116 (0.007) 0.017 (0.001) 0.900 (0.002) 0.873 (0.004) 
patrol  
   RNG 120.77 (0.19) 7.88 (0.29) 120.77 (0.19) 17.34 (0.25) 0.835 (0.005) 0.665 (0.024) 
   SRNG 123.76 (0.19) 7.93 (0.43) 124.26 (0.22) 16.99 (0.14) 0.989 (0.001) 0.657 (0.022) 
   DA 127.34 (5.39) 10.39 (0.93) 127.34 (5.39) 17.13 (0.19) 0.713 (0.077) 0.521 (0.051) 
voting  
   RNG 8.86 (0.04) 0.615 (0.024) 8.86 (0.04) 1.18 (0.02) 0.953 (0.001) 0.950 (0.004) 
   SRNG 8.93 (0.03) 0.651 (0.033) 9.08 (0.02) 1.18 (0.03) 0.972 (0.001) 0.953 (0.004) 
   DA 8.84 (0.06) 0.626 (0.021) 8.84 (0.06) 1.15 (0.03) 0.955 (0.001) 0.951 (0.005) 

Note: The standard deviation is given in parentheses.

Since supervised label information is available for these data sets, it is possible to compare the results to the behavior of supervised training algorithms such as SVM or k-nearest neighbor for these datasets. The last six data sets have recently been considered in Chen et al. (2009) as benchmarks where different supervised techniques to deal with dissimilarity data including SVM with various preprocessing and kernels and k-nearest neighbor have been compared. Interestingly, errors of 13.00% to 17.75% are reported for the aural data set using the different methods, and errors of 4.89% to 5.8% are reported for the voting data set, placing SRNG as a method with best classification accuracy for both situations. For the other data sets, errors of 1.86% to 30.35% (protein data), 29.9% to 41.99% (caltech), 3.92% to 4.55% (face recognition), and 11.56% to 42.19% (patrol) are reported. Thus, SRNG is competitive for the protein data set too. For caltech, and patrol, it achieves considerable accuracy on the training set, but not generalizing properly to the test set, while clustering seems unsuited for the supervised classification task specified in the face recognition data. Nevertheless, the aim of clustering as measured by the quantization error and the dual cost function gives reasonable results in all cases, clearly demonstrating the competitiveness of RNG methods to deterministic annealing schemes.

5.3.  Demonstration of Patch Clustering.

Patch clustering extends RNG toward huge data sets at the costs of decreased accuracy due to the compression of data in early patches in form of prototypes and the approximation of relational prototypes by only the most important coefficients. Thereby, both the patch size and the number of coefficients used for the approximation are parameters that can be chosen to balance the accuracy of the results (this is high for large patch size) and the required space and speed of computation (which is small for small patch size and small K for the approximation). The effect of these choices is reported in Figure 5. The Kopenhagen chromosomes data are taken trained with patch RNG and 60 neurons for 100 epochs. Thereby, the number of patches was varied from 1 to 10, and the K used for the K-approximation was varied from 1 to 5. The reported result is the classification accuracy obtained on a test set for a repeated 10-fold cross-validation with 10 repetitions. Using 10 patches (corresponding to a speed-up 10 of the computation) and a K-approximation with K = 2 leads to a reduction of the hit rate of about 3.3%, and using a K-approximation with K = 1 leads to a reduction of the hit rate of about 4.5%. Hence, patch approximation leads to only a slight decrease of the quality of the found solution.

Figure 5:

Accuracy of patch training depending on the size of the patches and the accuracy of the approximation.

Figure 5:

Accuracy of patch training depending on the size of the patches and the accuracy of the approximation.

For huge data sets, data can often be accessed only sequentially, such that data are not i.i.d. with respect to the underlying distribution. It is interesting to investigate whether this fact has consequences on the accuracy of patch clustering. As already demonstrated in Alex et al. (2009) for Euclidean settings, this is not the case since prototypes accurately represent already seen data, following trends accurately, if necessary. In Alex et al., a data set with a strong trend was created and presented to patch clustering following the trend. The result of the overall clustering, which was achieved after iterative patch processing, was virtually indistinguishable from the result of NG when applied to all data at once or patch NG with i.i.d. data due to the ability of patch NG to use all previous information in terms of the sufficient statistics.

For the non-Euclidean setting, it is not as obvious how to create data with a strong trend to test the abilities of patch clustering to deal with such settings. Therefore, we rely on auxiliary information as given by the data labels. We compare patch clustering with randomly presented samples to patch clustering where samples are sorted according to the label information such that in early patches, a highly unbalanced data set is presented. The results as achieved for the chromosomes data set are reported in Table 4. Thereby, a two-fold cross-validation was repeated 10 times, using 10 patches (corresponding to about 420 data points per patch), 60 neurons, 100 epochs per patch, and an approximation of relational prototypes by three coefficients. For comparison, the results of full batch RNG using the same splits for the repeated cross-validation are reported. Obviously the differences are negligible, that is, the order of the presentation does not severely influence the output in this case.

Table 4:
Results of Patch Clustering When Data Sets Are Represented in Different Order (i.i.d. resp. non i.i.d.) with Respect to the Labeling, in Comparison to Full Batch RNG.
Dual Cost FunctionQuantization ErrorClassification Accuracy
TrainingTestTrainingTestTrainingTest
Random order  
   Patch RNG 23,589.6 (69.0) 23,898.9 (85.6) 30,157.0 (83.4) 31,623.0 (133.9) 0.873 (0.006) 0.868 (0.007) 
   RNG 22,307.3 (51.5) 22,777.7 (85.0) 22,309.8 (51.0) 24,255.2 (95.3) 0.898 (0.005) 0.900 (0.005) 
Ordered according to the labeling  
   Patch RNG 23,495.3 (80.5) 23,831.1 (89.3) 30,016.6 (131.8) 31,534.5 (124.3) 0.875 (0.007) 0.870 (0.009) 
   RNG 22,326.9 (39.6) 22,757.2 (35.6) 22,329.5 (39.9) 24,250.5 (31.6) 0.900 (0.007) 0.901 (0.007) 
Dual Cost FunctionQuantization ErrorClassification Accuracy
TrainingTestTrainingTestTrainingTest
Random order  
   Patch RNG 23,589.6 (69.0) 23,898.9 (85.6) 30,157.0 (83.4) 31,623.0 (133.9) 0.873 (0.006) 0.868 (0.007) 
   RNG 22,307.3 (51.5) 22,777.7 (85.0) 22,309.8 (51.0) 24,255.2 (95.3) 0.898 (0.005) 0.900 (0.005) 
Ordered according to the labeling  
   Patch RNG 23,495.3 (80.5) 23,831.1 (89.3) 30,016.6 (131.8) 31,534.5 (124.3) 0.875 (0.007) 0.870 (0.009) 
   RNG 22,326.9 (39.6) 22,757.2 (35.6) 22,329.5 (39.9) 24,250.5 (31.6) 0.900 (0.007) 0.901 (0.007) 

5.4.  Processing Large Text Data.

To demonstrate the ability of patch clustering to deal with large dissimilarity data, a data set was generated in the same way as the 20 news group data set from the UCI repository (Asuncion & Newman, 2007); 183,546 articles from 13 different news groups were taken, where the distribution of the articles is displayed in Table 5. The texts were preprocessed by removing stop words and applying word stemming (Porter, 1980). Comparisons of two texts took place using the normalized compression distance (NCD) as proposed in the approach (Cilibrasi & Vitanyi, 2005). The NCD is an approximation of a universal distance for strings based on Kolmogorov complexity. It is defined as
formula
where x and y are the document strings, xy its concatenation, and C(x) denotes the size of a compression of the string x. For our experiments, the bzip2 compression method was used.
Table 5:
News Group Present in the Big Newsgroup Data Set.
NameCountPercentage
alt.atheism 72,0680 39.260 
comp.os.ms-windows.misc 0178 0.10 
comp.windows.x 0182 0.10 
rec.motorcycles 15,8410 8.63 
rec.sport.baseball 0462 0.25 
rec.sport.hockey 0510 0.28 
sci.crypt 3203 1.75 
sci.med 2912 1.59 
soc.religion.christian 0341 0.19 
talk.politics.guns 23,2090 12.640 
talk.politics.mideast 9337 5.09 
talk.politics.misc 54,6830 29.790 
talk.religion.misc 0620 0.34 
NameCountPercentage
alt.atheism 72,0680 39.260 
comp.os.ms-windows.misc 0178 0.10 
comp.windows.x 0182 0.10 
rec.motorcycles 15,8410 8.63 
rec.sport.baseball 0462 0.25 
rec.sport.hockey 0510 0.28 
sci.crypt 3203 1.75 
sci.med 2912 1.59 
soc.religion.christian 0341 0.19 
talk.politics.guns 23,2090 12.640 
talk.politics.mideast 9337 5.09 
talk.politics.misc 54,6830 29.790 
talk.religion.misc 0620 0.34 

We would like to demonstrate the ability of our method to project large data sets into the Euclidean plane in reasonable time. Therefore, we report the result of a supervised relational hyperbolic SOM (HSOM), which offers particularly powerful visualization facilities of possibly complex data sets due to the flexibility of the hyperbolic space (Ontrup & Ritter, 2001). Obviously, batch processing for HSOM can be transferred to a relational version in the same way as batch NG. The goal is to map the full data set to the hyperbolic lattice structure by means of supervised HSOM. Patch processing was applied using supervised relational HSOM with 85 neurons and a 3 approximation of the prototypes. Since the full dissimilarity matrix would occupy approximately 250 GB of space, it is no option to process the data at once or precalculate the full dissimilarity at once. Instead, 183 patches of around 1,000 texts were taken, and the distances of these patches were precalculated, resulting in only around a 12 MB space. In addition, since patch processing requires the dissimilarities of the extended patches, around dissimilarities had to be computed on demand for every patch. This way, the whole computation could be performed on a common desktop computer in reasonable time.

The outcome is depicted in Figure 6. Thereby, the hyperbolic lattice structure is mapped onto Euclidean plane for visualization and data inspection. For interpretability, neurons were posteriorly labeled according to a majority vote. Clearly, the data arrange on the lattice according to their semantic meaning as mirrored by the corresponding news group, such that visualization and browsing become possible.

Figure 6:

Visualization of 183,546 newsgroup articles using patch relational HSOM. Labeling of the neurons is based on majority vote, displaying the topographic arrangement of the biggest news groups on the map. In this case, patch processing reduces the required space of the full dissimilarity matrix from approximately 251 GB to only around 13 MB.

Figure 6:

Visualization of 183,546 newsgroup articles using patch relational HSOM. Labeling of the neurons is based on majority vote, displaying the topographic arrangement of the biggest news groups on the map. In this case, patch processing reduces the required space of the full dissimilarity matrix from approximately 251 GB to only around 13 MB.

6.  Discussion

The focus of this article was on clustering algorithms for general dissimilarity data, which can be derived from the standard vector quantization cost function and extensions thereof. For this purpose, we introduced relational clustering as an equivalent formulation of batch clustering for neural gas or self-organizing maps if Euclidean data are given in terms of pairwise dissimilarities only. While convergence can be guaranteed in Euclidean settings only, applicability of the algorithms is granted for every possible symmetric dissimilarity measure. Further, a variety of properties is still valid, such as easy out-of-sample extensions and interpretability in terms of the dual cost function for fixed points of the algorithms. In this way, we arrived at relational neural gas and self-organizing maps for arbitrary dissimilarity data. We have seen that the algorithms can be interpreted as clustering in pseudo-Euclidean space whereby a reasonable heuristic is taken to arrive at good prototype locations. The algorithms yield competitive results to alternatives such as deterministic annealing, and they often show better behavior than theoretical alternatives with guaranteed convergence such as spread transformation of the data. Thus, relational neural gas and alternatives can be taken as simple and efficient prototype-based models for the clustering and topographic mapping of general dissimilarity matrices with wide applicability to domains involving non-Euclidean data such as text processing, biological sequence analysis, and image processing.

One major problem of clustering algorithms for dissimilarity data lies in the fact that the dissimilarity matrix scales quadratically with the number of data points. This leads to at least quadratic time and space complexity, which is infeasible for medium-sized problems. Therefore, an approximation scheme was introduced that processes data based on patches in linear time and constant space, provided the patch size is chosen fixed depending on the available resources. We demonstrated that approximate patch clustering leads to only a minor decrease in accuracy and that it is robust with respect to data ordering. In particular, it can deal with settings where data are not identically distributed, and it leads to competitive results as if all data were available prior to training, as demonstrated in experiments. This way, a very simple and fast but powerful and flexible linear time and constant space clustering scheme for large dissimilarity data results. This patch scheme was proposed based on relational neural gas or relational self-organizing maps; however, alternative prototype-based schemes such as affinity propagation or median clustering could be easily extended accordingly. We demonstrated the applicability of the approach for a large text corpus. The evaluation and application in further large-scale scenarios are the subject of ongoing research.

Notes

1

Actually, we construct an example that shows that deterministic annealing for pairwise data clustering can also show oscillatory behavior for sequential updates if run with negative dissimilarities.

References

Alex
,
N.
,
Hasenfuss
,
A.
, &
Hammer
,
B.
(
2009
).
Patch clustering for massive data sets
.
Neurocomputing
,
72
(
7–9
),
1455
1469
.
Asuncion
,
A.
, &
Newman
,
D. J.
(
2007
).
UCI Machine Learning Repository Irvine: University of California, School of Information and Computer Science
. .
Badoiu
,
M.
,
Har-Peled
H.
, &
Indyk
,
P.
(
2002
).
Approximate clustering via core-sets
. In
Proceedings of the 34th ASM Symposium on Theory of Computing
(pp.
250
257
).
New York
:
ACM
.
Barreto
,
G. de A.
,
Araujo
,
A. F. R.
, &
Kremer
,
S. C.
(
2003
).
A taxonomy for spatiotemporal connectionist networks revisited: The unsupervised case
.
Neural Computation
,
15
(
6
),
1255
1320
.
Belongie
,
S.
,
Fowlkes
,
C.
,
Chung
,
F.
, &
Malik
,
J.
(
2002
).
Spectral partitioning with indefinite kernels using the Nyström extension
. In
A. Heyden
(Ed.),
Proceedings of the European Conference on Computer Vision
(pp.
531
542
).
Berlin
:
Springer-Verlag
.
Bezdek
,
J. C.
,
Hathaway
,
R. J.
,
Huband
,
J. M.
,
Leckie
,
C.
, &
Kotagiri
,
R.
(
2006
).
Approximate data mining in very large relational data
. In
G. Dobbie & J. Bailey
(Eds.),
Database Technologies 2006, Proceedings of the 17th Australasian Database Conference
(pp.
3
13
).
New York
:
ACM
.
Biehl
,
M.
,
Hammer
,
B.
,
Hochreiter
,
S.
,
Kremer
,
S. C.
, &
Villmann
,
T.
(
2009
).
Similarity-based learning on structures
. In
Dagstuhl Seminar Proceedings 09081, Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, Germany
.
Bradley
,
P. S.
,
Fayyad
,
U.
, &
Reina
C.
(
1998
).
Scaling clustering algorithms to large data sets
. In
Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining
(pp.
9
15
).
Menlo Park, CA
:
AAAI Press
.
Brucker
,
P.
(
1978
).
On the complexity of clustering problems
. In
M. Beckman & H. P. Kunzi
(Eds.),
Optimization and operations research
(pp.
45
54
).
Berlin
:
Springer
.
Chen
,
Y.
,
Garcia
,
E. K.
,
Gupta
,
M. R.
,
Rahimi
,
A.
, &
Cazzani
,
L.
(
2009
).
Similarity-based classification: Concepts and algorithms
.
Journal of Machine Learning Research
,
10
,
747
776
.
Cilibrasi
,
R.
, &
Vitanyi
,
M. B.
(
2005
).
Clustering by compression
.
IEEE Transactions on Information Theory
,
51
(
4
),
1523
1545
.
Conan-Guez
,
B.
,
Rossi
,
F.
, &
El Golli
,
A.
(
2005
).
A fast algorithm for the self-organizing map on dissimilarity data
. In
Workshop on Self-Organizing Maps
(pp.
561
568
).
Berlin
:
Springer-Verlag
.
Cottrell
,
M.
,
Hammer
,
B.
,
Hasenfuss
,
A.
, &
Villmann
,
T.
(
2006
).
Batch and median neural gas
.
Neural Networks
,
19
,
762
771
.
Domingos
,
P.
, &
Hulten
,
G.
(
2001
).
A general method for scaling up machine learning algorithms and its application to clustering
. In
Proc. International Conference on Machine Learing
(pp.
106
119
).
San Francisco
:
Morgan Kaufmann
.
Farnstrom
,
F.
,
Lewis
,
J.
, &
Elkan
,
C.
(
2000
).
Scalability for clustering algorithms revisited
.
SIGKDD Explorations
,
2
(
1
),
51
57
.
Frey
,
B. J.
, &
Dueck
,
D.
(
2007
).
Clustering by passing messages between data points
.
Science
,
315
,
972
976
.
Goldfarb
,
L.
(
1984
).
A unified approach to pattern recognition
.
Pattern Recognition
,
17
(
5
),
575
582
.
Graepel
,
T.
, &
Obermayer
,
K.
(
1999
).
A stochastic self-organizing map for proximity data
.
Neural Computation
,
11
,
139
155
.
Guha
,
S.
,
Rastogi
,
R.
, &
Shim
,
K.
(
1998
).
CURE: An efficient clustering algorithm for large datasets
. In
Proceedings of ACM SIGMOD International Conference on Management of Data
(pp.
73
84
).
New York
:
ACM Press
.
Haasdonk
,
B.
, &
Bahlmann
,
C.
(
2004
).
Learning with distance substitution kernels
. In
Pattern Recognition, Proc. of the 26th DAGM Symposium
.
Berlin
:
Springer
.
Hammer
,
B.
,
Hasenfuss
,
A.
,
Schleif
,
F.-M.
, &
Villmann
,
T.
(
2006a
).
Supervised median neural gas
. In
C. Dagli, A. Buczak, D. Enke, A. Embrechts, & O. Ersoy
(Eds.),
Intelligent engineering systems through artificial neural networks
(pp.
623
633
).
New York
:
ASME Press
.
Hammer
,
B.
,
Hasenfuss
,
A.
,
Schleif
,
F.-M.
, &
Villmann
,
T.
(
2006b
).
Supervised batch neural gas
. In
Proceedings of the Conference on Artificial Neural Networks in Pattern Recognition
(pp.
33
45
).
Berlin
:
Springer
.
Hammer
,
B.
,
Hasenfuss
,
A.
, &
Villmann
,
T.
(
2007
).
Magnification control for batch neural gas
.
Neurocomputing
,
70
,
1225
1234
.
Hammer
,
B.
,
Micheli
,
A.
,
Sperduti
,
A.
, &
Strickert
,
M.
(
2004
).
Recursive self-organizing network models
.
Neural Networks
,
17
(
8–9
),
1061
1086
.
Hartigan
,
J. A.
(
1975
).
Clustering algorithms
.
Hoboken, NJ
:
Wiley
.
Hasenfuss
,
A.
,
Hammer
,
B.
,
Schleif
,
F.-M.
, &
Villmann
,
T.
(
2007
).
Neural gas clustering for dissimilarity data with continuous prototypes
. In
F. Sandoval, A. Prieto, J. Cabestany, & M. Grana
(Eds.),
Computational and Ambient Intelligence—Proceedings of the 9th Work-conference on Artificial Neural Networks
(pp.
539
546
).
Berlin
:
Springer
.
Hathaway
,
R. J.
, &
Bezdek
,
J. C.
(
1994
).
Nerf c-means: Non-Euclidean relational fuzzy clustering
.
Pattern Recognition
,
27
(
3
),
429
437
.
Hathaway
,
R. J.
,
Davenport
,
J. W.
, &
Bezdek
,
J. C.
(
1989
).
Relational duals of the c-means algorithms
.
Pattern Recognition
,
22
,
205
212
.
Heskes
,
T.
(
2001
).
Self-organizing maps, vector quantization, and mixture modeling
.
IEEE Transactions on Neural Networks
,
12
,
1299
1305
.
Hofmann
,
T.
, &
Buhmann
,
J. M.
(
1996
).
An annealed “neural gas” network for robust vector quantization
. In
Artificial Neural Networks ICANN'96
(pp.
151
156
).
Berlin
:
Springer
.
Hofmann
,
T.
, &
Buhmann
,
J.
(
1997
).
Pairwise data clustering by deterministic annealing
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
19
(
1
),
1
14
.
Hofmann
,
T.
, &
Buhmann
,
J. M.
(
1998
).
Competitive learning algorithms for robust vector quantization
,
IEEE Transactions on Signal Processing
,
46
(
6
),
1665
1675
.
Juan
,
A.
, &
Vidal
,
E.
(
2000
).
On the use of normalized edit distances and an efficient k-NN search technique (k-AESA) for fast and accurate string classification
. In
Proceedings of the International Conference on Pattern Recognition
.
Piscataway, NJ
:
IEEE
.
Kaplansky
,
I.
(
1977
).
Set theory and metric spaces
.
Providence, RI
:
American Mathematical Society. AMS Chelsea Publishing
.
Kaski
,
S.
,
Nikkilä
,
J.
,
Savia
,
E.
, &
Roos
,
C.
(
2005
).
Discriminative clustering of yeast stress response
. In
U. Seiffert, L. Jain, & P. Schweizer
(Eds.),
Bioinformatics using computational intelligence paradigms
(pp.
75
92
).
Berlin
:
Springer
.
Kaski
,
S.
,
Nikkilä
,
J.
,
Oja
,
M.
,
Venna
,
J.
,
Törönen
,
P.
, &
Castren
,
E.
(
2003
).
Trustworthiness and metrics in visualizing similarity of gene expression
.
BMC Bioinformatics
,
4
,
48
.
Kohonen
,
T.
(
1995
).
Self-organizing maps
.
Berlin
:
Springer
.
Kohonen
,
T.
, &
Somervuo
,
P.
(
2002
).
How to make large self-organizing maps for nonvectorial data
.
Neural Networks
,
15
,
945
952
.
Kruskal
,
J. B.
(
1964
).
Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis
.
Psychometrika
,
29
,
1
27
.
Kumar
,
A.
,
Sabharwal
,
Y.
, &
Sen
,
S.
(
2004
).
A simple linear time (1+epsilon)- approximation algorithm for k-means clustering in any dimensions
. In
Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science
(pp.
454
462
).
Piscataway, NJ
:
IEEE
.
Laub
,
J.
,
Roth
,
V.
,
Buhmann
,
J. M.
, &
Müller
,
K. B.
(
2006
).
On the information and representation of non-Euclidean pairwise data
.
Pattern Recognition
,
39
,
1815
1826
.
Lundsteen
,
C.
,
Phillip
,
J.
, &
Granum
,
E.
(
1980
).
Quantitative analysis of 6985 digitized trypsin G-banded human metaphase chromosomes
.
Clinical Genetics
,
18
,
355
370
.
Luttrell
,
S. P.
(
1989
).
Self-organisation: A derivation from first principles of a class of learning algorithms
. In
Proceedings of 3rd International Joint Conference on Neural Networks
(vol.
2
, pp.
495
498
).
Norwell
:
MA: Kluwer
.
Martinetz
,
T.
(
1992
).
Selbstorganisierende neuronale Netzwerkmodelle zur Bewegungssteuerung
.
Unpublished doctoral dissertation, Sankt Augustin
.
Martinetz
,
T.
,
Berkovich
,
S. G.
, &
Schulten
,
K. J.
(
1993
).
“Neural-gas” network for vector quantization and its application to time-series prediction
.
IEEE Transactions on Neural Networks
,
4
,
558
569
.
Martinetz
,
T.
, &
Schulten
,
K.
(
1994
).
Topology representing networks
.
Neural Networks
,
7
,
507
522
.
Mevissen
,
H.
, &
Vingron
,
M.
(
1996
).
Quantifying the local reliability of a sequence alignment
.
Protein Engineering
,
9
,
127
132
.
Murtagh
,
F.
(
2002
).
Clustering in massive data sets
. In
J. Abell, P. M. Pardelos, & M. G. C. Resende
(Eds.),
Handbook of massive data sets
.
Norwell, MA
:
Kluwer
.
Neuhaus
,
M.
, &
Bunke
,
H.
(
2006
).
Edit distance based kernel functions for structural pattern classification
.
Pattern Recognition
,
39
(
10
),
1852
1863
.
Ontrup
,
J.
, &
Ritter
,
H.
(
2001
).
Hyperbolic self-organizing maps for semantic navigation
. In
T. Dietterich, S. Becker, & Z. Ghahramani
(Eds.),
Advances in neural information processing systems
,
14
(pp.
1417
1424
).
Cambridge, MA
:
MIT Press
.
Pardalos
,
P. M.
, &
Vavasis
,
S. A.
(
1991
).
Quadratic programming with one negative eigenvalue is NP hard
.
Journal of Global Optimization
,
1
,
15
22
.
Pekalska
,
E.
, &
Duin
,
R. P. W.
(
2005
).
The dissimilarity representation for pattern recognition: Foundations and applications
.
Singapore
:
World Scientific
.
Porter
,
M. F.
(
1980
).
An algorithm for suffix stripping
.
Program
,
14
(
3
),
130
137
.
Puzicha
,
J.
,
Hofmann
,
T.
, &
Buhmann
,
J.
(
1999
).
A theory of proximity based clustering: Structure detection by optimziation
.
Pattern Recognition Letters
,
33
(
4
),
617
634
.
Qin
,
A. K.
, &
Suganthan
,
P. N.
(
2004
).
Kernel neural gas algorithms with application to cluster analysis
. In
Proceedings of the International Conference on Pattern Recognition 2004
(vol.
4
, pp.
617
620
).
Washington, DC
:
IEEE Computer Society
.
Rose
,
K.
(
1998
).
Deterministic annealing for clustering, compression, classification, regression, and related optimization problems
.
Proceedings of the IEEE
(pp.
2210
2223
).
Piscataway, NJ
:
IEEE
.
Rossi
,
F.
,
Hasenfuss
,
A.
, &
Hammer
,
B.
(
2007
).
Accelerating relational clustering algorithms with sparse prototype representations
. In
Proceedings of the 6th International Workshop on Self-Organizing Maps
.
New York
:
ACM
.
Roth
,
V.
,
Laub
,
J.
,
Kawanabe
,
M.
, &
Buhmann
,
M.
(
2003
).
Optimal cluster preserving embedding of nonmetric proximity data
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
25
(
12
),
1540
1551
.
Saalbach
,
A.
,
Twellmann
,
T.
,
Nattkemper
,
T. W.
,
Wismüller
,
A.
,
Ontrup
,
J.
, &
Ritter
,
H.
(
2005
).
A hyperbolic topographic mapping for proximity data
.
Artificial Intelligence and Applications
,
2005
,
106
111
.
Sahni
,
S.
(
1974
).
Computationally related problems
.
SIAM Journal on Computing
,
3
(
4
),
262
279
.
Schölkopf
,
B.
(
2000
).
The kernel trick for distances
(
Microsoft Tech. Rep. No. 2000-51
).
Redmond, WA
:
Microsoft
.
Seo
,
S.
, &
Obermayer
,
K.
(
2004
).
Self-organizing maps and clustering methods for matrix data
.
Neural Networks
,
17
,
1211
1230
.
Tino
,
P.
,
Kaban
,
A.
, &
Sun
,
Y.
(
2004
).
A generative probabilistic approach to visualizing sets of symbolic sequences
. In
R. Kohavi, J. Gehrke, W. DuMouchel, & J. Ghosh
(Eds.),
Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD-2004
(pp.
701
706
).
New York
:
ACM Press
.
Villmann
,
T.
,
Hammer
,
B.
,
Schleif
,
F.
,
Geweniger
,
T.
, &
Herrmann
,
W.
(
2006
).
Fuzzy classification by fuzzy labeled neural gas
.
Neural Networks
,
19
,
772
779
.
von Luxburg
,
U.
(
2007
).
A tutorial on spectral clustering
.
Statistics and Computing
,
17
(
4
),
395
416
.
Wang
,
W.
,
Yang
,
J.
, &
Muntz
,
R. R.
(
1997
).
STING: A statistical information grid approach to spatial data mining
. In
Proceedings of the 23rd Conference on Very Large Data Bases
(pp.
186
195
).
San Francisco
:
Morgan Kaufmann
.
Yin
,
H.
(
2006
).
On the equivalence between kernel self-organising maps and self-organising mixture density networks
.
Neural Networks
,
19
(
6
),
780
784
.
Yin
,
H.
(
2008
).
Self-organising maps: Background, theories, extensions and applications
. In
J. Filcher & L. C. Jain
(Eds.),
Computational intelligence: A compendium
(pp.
715
762
).
Berlin
:
Springer
.
Yom-Tov
,
E.
, &
Slonim
,
N.
(
2009
).
Parallel pairwise clustering
. In
Proceedings of the SIAM International Conference on Data Mining, SDM 2009
(pp.
745
755
).
Philadelphia
:
SIAM
.
Zhang
,
T.
,
Ramakrishnan
,
R.
, &
Livny
,
M.
(
1996
).
BIRCH: An efficient data clustering method for very large databases
. In
Proceedings of the 15th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems
(pp.
103
114
).
San Francisco
:
Morgan Kaufmann
.
Zhong
,
S.
, &
Ghosh
,
J.
(
2003
).
A unified framework for model-based clustering
.
Journal of Machine Learning Research
,
4
,
1001
1037
.