## Abstract

Topographic maps such as the self-organizing map (SOM) or neural gas (NG) constitute powerful data mining techniques that allow simultaneously clustering data and inferring their topological structure, such that additional features, for example, browsing, become available. Both methods have been introduced for vectorial data sets; they require a classical feature encoding of information. Often data are available in the form of pairwise distances only, such as arise from a kernel matrix, a graph, or some general dissimilarity measure. In such cases, NG and SOM cannot be applied directly. In this article, we introduce relational topographic maps as an extension of relational clustering algorithms, which offer prototype-based representations of dissimilarity data, to incorporate neighborhood structure. These methods are equivalent to the standard (vectorial) techniques if a Euclidean embedding exists, while preventing the need to explicitly compute such an embedding. Extending these techniques for the general case of non-Euclidean dissimilarities makes possible an interpretation of relational clustering as clustering in pseudo-Euclidean space. We compare the methods to well-known clustering methods for proximity data based on deterministic annealing and discuss how far convergence can be guaranteed in the general case. Relational clustering is quadratic in the number of data points, which makes the algorithms infeasible for huge data sets. We propose an approximate patch version of relational clustering that runs in linear time. The effectiveness of the methods is demonstrated in a number of examples.

## 1. Introduction

Topographic maps such as the self-organizing map (SOM) constitute a valuable tool for robust data inspection and data visualization, which has been applied in such diverse areas as telecommunication, robotics, bioinformatics, and business (Kohonen, 1995; Yin, 2008). A particular strength of SOM lies in the fact that it offers simultaneous data clustering, visualization, topological inference, and compression of data by means of prototypes such that diverse functionalities can be realized on top of SOM. Alternative methods such as neural gas (NG) (Martinetz, Berkovich, & Schulten, 1993) provide an efficient clustering and topographic mapping of data without fixing a prior lattice. In this way, subsequent visualization such as multidimensional scaling (Kruskal, 1964) can readily be applied, whereby no prior restriction of a fixed lattice structure as for SOM is necessary and the risk of topographic errors is minimized. For NG, an optimum (nonregular) data topology is induced such that browsing in a neighborhood becomes directly possible (Martinetz & Schulten, 1994). A very elegant fundamental treatment of vector quantization and extensions such as SOM and NG has been presented by Hofmann and Buhmann (1998) based on information-theoretic principles introduced in Luttrell (1989). In this framework, vector quantization is interpreted as an encoding mechanism with limited resources, where SOM can be derived as a robust model if channel noise is present, whereas NG accounts for the situation that certain channels are not available (e.g., due to varying bandwidth). This also gives some hints as to which situations the application of SOM or NG, respectively, is advisable from a theoretical model of the data, besides providing additional functionality compared to simple clustering such as *k*-means due to the additional (fixed or data optimum, respectively) neighborhood structure. Interestingly, as presented in Hofmann and Buhmann (1998), these approaches can be combined to yield models that are robust with respect to different types of noise.

Original SOM and NG, however, have been proposed for vectorial data such that their application is restricted to Euclidean space. In the past few years, extensions of these methods have been proposed to deal with more general data structures. This accounts for the fact that more general metrics have to be used for complex data such as microarray data or DNA sequences. Further, it might be the case that data are not embedded in a vector space at all; rather, pairwise similarities or dissimilarities are available.

Several extensions of classical SOM and NG to more general data have been proposed: a statistical interpretation of SOM (Graepel & Obermayer, 1999; Heskes, 2001; Seo & Obermayer, 2004; Tino, Kaban, & Sun, 2004) allows changing the generative model to alternative general data models. The resulting approaches are very flexible but also computationally quite demanding, such that proper initialization and metaheuristics (e.g., deterministic annealing) become necessary when optimizing statistical models. For specific data structures such as time series or recursive structures, recursive models have been proposed (as reviewed in Barreto, Araujo, & Kremer, 2003; Hammer, Micheli, Sperduti, & Strickert, 2004). However, these models are restricted to recursive data structures with Euclidean constituents. Online variants of SOM and NG have been extended to general kernels (e.g., in the approaches presented in Qin & Suganthan, 2004; Yin, 2006). However, these versions have been derived for (slow) online adaptation only.

The approaches of Kohonen and Somervuo (2002) provide a fairly general method for large-scale application of SOM to nonvectorial data. It is assumed that pairwise similarities of data points are available. Then the batch optimization scheme of SOM can be generalized by means of the generalized median to a visualization tool for general similarity data. Thereby, prototype locations are restricted to data points. This method has been extended to NG in Cottrell, Hammer, Hasenfuss, and Villmann (2006), together with a general proof of the convergence of median versions of clustering. Further developments concern the efficiency of the computation (Conan-Guez, Rossi, & El Golli, 2005) and the integration of prior information if available to achieve meaningful visualization and clustering (Hammer, Hasenfuss, Schleif, & Villmann, 2006a, 2006b; Villmann, Hammer, Schleif, Geweniger, & Herrmann, 2006).

Median clustering has the benefit that it builds directly on the derivation of SOM and NG from a cost function. Thus, the resulting algorithms share the simplicity of batch NG and SOM, its mathematical background and convergence, as well as the flexibility to model additional information by means of an extension of the cost function. However, for median versions, prototype locations are restricted to the set of given training data, which constitutes a severe restriction, particularly for small data sets. Therefore, extensions that allow a smooth adaptation of prototypes have been proposed (e.g., in Hasenfuss, Hammer, Schleif, & Villmann, 2007). In this approach, a weighting scheme is introduced for the points, which represents a virtual prototype in the space spanned by the training data. This model has the drawback that it is not an extension of the standard Euclidean version.

Here, we use an alternative way to extend NG to relational data given by pairwise dissimilarities, which is similar to the relational dual of fuzzy clustering as derived in Hathaway and Bezdek (1994) and Hathaway, Davenport, & Bezdek (1989) and builds directly on fundamental work in the context of relational clustering as introduced in Laub, Roth, Buhmann, and Müller (2006) and Roth, Laub, Kawanabe, and Buhmann (2003). For a given dissimilarity matrix that stems from a (possibly high-dimensional and unknown) Euclidean space, it is possible to derive the relational dual of topographic map formation that expresses the relevant quantities in terms of the given matrix and leads to a learning scheme similar to standard batch optimization. This scheme provides identical results as the standard Euclidean version if an embedding of the given data points is known. In particular, it possesses the same convergence properties as the standard variants, thereby restricting the computation to known quantities that do not rely on explicit embedding. Since these relational variants rely on the same cost function, extensions to additional label information or magnification control (Hammer et al., 2006a, 2006b; Hammer, Hasenfuss, & Villmann, 2007) become readily available.

The methods can be applied directly to every possible non-Euclidean dissimilarity matrix, and, as we will see in a variety of experiments, they result in good performance in practical applications. The theory behind the case of general dissimilarity data, however, is less clear. We will show that a simple shift of the dissimilarity matrix as proposed in Laub et al. (2006), which makes data Euclidean and does not affect the location of the optima of the dual cost function, can severely affect the underlying numeric. As an alternative, we will link the proposed algorithm to clustering in pseudo-Euclidean space, such that an intuitive interpretation of the algorithm becomes possible also in the non-Euclidean setting. However, we show by counterexample that the algorithm need no longer converge to a fixed point of the dual cost function, albeit we have not observed this behavior in a single real application. We show that popular alternatives such as deterministic annealing for pairwise data clustering or SOM share this property; counterexamples that show possible divergence can also be found for these two well-known clustering algorithms. We argue that relational neural gas is in fact related to popular pairwise data clustering in the sense that the former relaxes the standard quantization error in pseudo-Euclidean space, while the latter can be derived as deterministic annealing to optimize the standard quantization error in pseudo-Euclidean space. This provides a direct interpretation of these alternatives in terms of relational prototypes; inspection of the results becomes possible this way, and it explains why relational clustering shows remarkable results in practice that are comparable to results obtained by deterministic annealing while consuming less training time.

Relational clustering, as well as its deterministic annealing counterparts, displays squared complexity according to the size of the dissimilarity matrix. This makes the algorithms unsuitable for large data sets. Based on intuitive and powerful extensions of classical *k*-means and NG to large data sets by means of patch clustering (Alex, Hasenfuss, & Hammer, 2009; Farnstrom, Lewis, & Elkan, 2000; Bradley, Fayyad, & Reina, 1998), we propose an approximation of the algorithms that can work in constant memory and linear time, that is, it is suited for large data sets. While we test the results for patch relational NG clustering, the principled method can successfully be applied to every clustering scheme that relies on relational prototypes; a direct transfer of the method to relational SOM and deterministic annealing variants of relational clustering becomes possible.

We first introduce batch learning algorithms for neural gas based on a cost function. Then we focus on a dissimilarity matrix that can be embedded in Euclidean space and derive the respective relational dual resulting in a dual cost function and batch optimization schemes for the case of a given dissimilarity matrix of data. For the general non-Euclidean setting, we discuss the connection to an embedding in pseudo-Euclidean space. Based on this connection, a relation to well-established deterministic annealing schemes become possible. To make the algorithms suitable for large data sets, an approximation of prototypes is introduced that allows processing data subsequently in patches using only constant memory and linear time. The efficiency of relational clustering is demonstrated in a couple of benchmark situations, as well as an application for a text clustering task that involves almost 200,000 articles.

## 2. Neural Gas

Neural clustering and topographic maps constitute effective methods for data clustering, inspection, and preprocessing. Classical variants deal with vectorial data , which are distributed according to an underlying distribution *P* in the Euclidean space. The goal of prototype-based clustering algorithms is to distribute prototypes , among the data such that they represent the data as accurately as possible. A new data point is assigned to the winner , which refers to the prototype with the smallest distance . This separates the data space into the receptive fields of the prototypes.

*k*. Simple

*k*-means directly optimizes the quantization error, where with Kronecker indicates the winner neuron for . Unlike

*k*-means, neural gas (NG) (Martinetz et al., 1993) and the self-organizing map (SOM) (Kohonen, 1995) incorporate the neighborhood of a neuron for adaptation. The cost function of NG is given by where is the rank of the prototypes sorted according to the distances and scales the neighborhood cooperation with the neighborhood range .

*i*,

*l*) denote a prior chosen neighborhood structure of neurons, often induced by a low-dimensional lattice structure.

The incorporation of a neighborhood structure into SOM and NG has several beneficial effects; additional functionality is achieved this way since the topological structure of the data is respected by the neurons, and browsing and, in the case of SOM, visualization become possible. It has been shown experimentally for benchmark data sets that by neighborhood integration, a method that is widely insensitive to initialization can be achieved (Martinetz, 1992). Unfortunately, this observation has not yet been substantiated by theoretical convergence guarantees. In fundamental work (Hofmann & Buhmann, 1998; Luttrell, 1989), a theoretical justification for this finding is given by linking NG and SOM, respectively, to information-theoretical concepts in the context of encoding in the presence of channel noise. In the following, we exemplarily consider NG, the argumentation for SOM and k-means being similar.

**Algorithm 1:** Batch NG

**input**

data ;

**begin**

**init** randomly;

**repeat**

set ;

set ;

**until** convergence;

**return**;

**end**.

*x*characterized by pairwise proximities

^{i}*d*(

*x*,

^{i}*x*) that are symmetric and 0 if

^{j}*x*=

^{i}*x*. For such general proximities, it can hold that no Euclidean embedding can be found. However, we will see that also for non-Euclidean dissimilarities, a vector space with symmetric bilinear form (which need not be positive definite) and embeddings of the points

^{j}*x*exist such that . We stress that the cost function of NG can be formalized for every real vector space that possesses a symmetric bilinear form , by substituting the term in the cost function by . The corresponding cost function becomes where the ranks are determined based on the bilinear form. In the same way, using this extended definition of the ranks, the procedure of batch NG is well defined in every vector space equipped with a symmetric bilinear form. However, it is not guaranteed that this batch NG procedure optimizes the NG cost function for the general setting, one that the convergence of this procedure is guaranteed, or that this procedure is useful at all. We address these questions in more detail in the following sections.

^{i}## 3. Dissimilarity Data

Relational data *x ^{i}* are not explicitly embedded in a Euclidean vector space; rather, pairwise similarities or dissimilarities are available. We assume that dissimilarities

*d*are given for every pair of data points

_{ij}*x*

^{1},…,

*x*. In the Euclidean setting, data can be represented as vectors and the equality holds. In general, however, no such explicit embedding of data is available. Further,

^{m}*d*need not fulfill the requirements of a metric, that is, the triangle inequality can be violated. Such data can stem from a general distance measure such as alignment distance for DNA sequences, Levenstein distance of words, general graph distances (e.g., in a web graph), or even empirical results of questionnaires or experiments.

_{ij}This situation is rather common in practice, and an overview of supervised classification techniques for similarity data has recently been published (Chen, Garcia, Gupta, Rahimi, & Cazzani, 2009). Here we are interested in the unsupervised case. A variety of clustering algorithms for general dissimilarities has been proposed. This includes hierarchical clustering such as single or complete linkage, UPGMA, or neighbor joining (Hartigan, 1975; Zhong & Ghosh, 2003), which usually rely on heuristics, although guarantees of the correctness of the result can be derived in specific settings (e.g., for ultrametrics or additive metrics; Kaplansky, 1977). Alternatives rely on an appropriate cost function. Naturally, the standard kernel trick can be applied to many clustering algorithms or topographic mapping, such that a direct extension of these approaches to similarity data that is induced by a kernel results. Proposals of this approach can be found in Conan-Guez et al. (2005), Yin (2006) and Qin and Suganthan (2004), for example. Clustering of dissimilarities that are represented by a graph can be formalized as a graph cut problem. Spectral clustering can be interpreted as an efficient approximation of the NP-hard optimization of the normalized or ratio graph cut (von Luxburg, 2007). Alternatively, in median or exemplar-based clustering, the position of prototypes in the standard quantization error is restricted to locations in the set of given data points. This way, the quantization error can directly be transferred to the setting of general dissimilarities. Proposals that follow this line of argumentation include median clustering as proposed for SOM and NG (Kohonen & Somervuo, 2002; Cottrell et al., 2006) and exemplar-based clustering by formulating the problem in terms of a factor graph leading to affinity propagation (Frey & Dueck, 2007). By restricting the prototype locations, the set of potential solutions becomes restricted such that these proposals do not correspond to standard *k*-means or SOM in the Euclidean setting.

A few approaches try to directly generalize the standard Euclidean setting to general dissimilarity data. One proposal is to optimize a cost function for pairwise clustering that relies on the cluster assignments only (i.e., it can be formulated for pairwise dissimilarities), but is equivalent to the standard quantization error in the Euclidean setting. This cost function can be optimized, for example, by using some metaheuristic or deterministic annealing (Hofmann & Buhmann, 1997). However, in the general setting, the notion of prototypes is lost this way, and a clustering is represented in terms of assignments only. Similarly, the approach (Graepel & Obermayer, 1999) uses deterministic annealing for a reliable training of SOM for proximity data. The transfer from vectorial data is achieved by an interpretation of SOM as an encoder-decoder framework. We will later see that these methods can be interpreted as deterministic annealing of standard Euclidean *k*-means clustering and SOM, respectively, which are transferred to proximity data using relational clustering, as we will propose in this article.

A direct extension of *k*-means and fuzzy-*k*-means to dissimilarity data has been proposed in Hathaway and Bezdek (1994) and Hathaway et al. (1989). Here the assumption is made that a Euclidean embedding of data exists, but it is unknown. In this case, *k*-means can be equivalently reformulated in terms of dissimilarities only. The resulting methods are termed relational clustering since they can deal with dissimilarities that result from relations rather than vectors. In this contribution, we will follow the last approach and transfer these ideas to topographic mapping. First, we consider the situation that a Euclidean embedding of data exists and derive the relational neural gas in this setting. Then we discuss the setting for non-Euclidean situations.

### 3.1. Euclidean Data.

We assume, as before, that training data *x*^{1},…, *x ^{m}* are given in terms of pairwise dissimilarities

*d*. One special setting is that data originate from a Euclidean distance measure, that is, we are able to find Euclidean points such that . Note that this notation includes a possibly nonlinear mapping (feature map) with . Since the Euclidean distance is related to the Euclidean dot product, this setting is referred to as kernel trick for distances in the literature. The approach (Schölkopf, 2000), as an example, investigates sufficient conditions for a metric to fulfill this property. Since an explicit formula for the embedding need not be known, we cannot directly apply batch NG in the embedding space. We want to derive a possibility for reformulating batch NG in such a way that an explicit embedding is not needed; batch NG can be applied directly to this setting. The key observation is based on the fact that optimum prototype locations of batch NG can be expressed as a linear combination of the given data points. Therefore, the unknown values can be expressed in terms of known values

_{ij}*d*. Correspondingly, the NG cost function can be substituted by a dual-cost function in terms of ranks and cluster assignments only.

_{ij}#### 3.1.1. Relational Clustering Algorithm.

In the following section, we lay out the details of this procedure. We always assume that pairwise dissimilarities *d _{ij}* are given, , and

*D*denotes the corresponding matrix of dissimilarities. We assume that for

*D*, there exists a finite-dimensional real vector space

*X*together with a symmetric bilinear form , and there exist vectors with . We will see later that this property is always guaranteed if

*D*is symmetric with zero diagonal elements. For some theorems, we require

*X*to be Euclidean space, that is, we require positive definiteness of the bilinear form . This will be explicitly mentioned at corresponding places. Interestingly, the principled procedure as derived in this section, as well as a couple of results, hold for the more general case that

*X*is an arbitrary (possibly non-Euclidean) vector space equipped with a symmetric bilinear form.

Prototypes will be restricted to convex combinations of given data points since optimum prototypes for NG have this form. For NG, we need to compute dissimilarities. It is possible to compute the dissimilarity of two such prototypes (or a prototype and a given data point) based on the given dissimilarity matrix only. More precisely, generalizing results as presented in Hathaway and Bezdek (1994):

Prototypes of NG can be written as a convex combination of data points: . Thus, because of theorem 1, we can compute the dissimilarity between a prototype and a given data point based on the coefficients in this convex combination and the dissimilarity matrix *D*. Further, it is not necessary to explicitly store the prototype locations; rather, we can alternatively represent prototypes by means of the coefficients . These observations lead to the following equivalent formulation of batch NG, which relies not on an explicit vectorial representation of the data but on the matrix *D* only and extends work as presented in Hathaway and Bezdek (1994) to neighborhood cooperation:

**Algorithm 2.** Relational NG

**input**

symmetric dissimilarity matrix with zero diagonal ;

**begin**

**init** such that ;

**repeat**

compute ;

set ;

set ;

**until** convergence;

**return**;

**end**.

*Relational NG as shown in algorithm 2 constitutes an equivalent formulation of batch NG by means of the identity . Thereby, refers to the vector and refers to coordinate j. Equivalence means that if holds for the initialization of batch NG and relational NG, respectively, then the same identity holds for these values after every epoch of batch NG and relational NG, respectively*.

The equivalence holds because of theorem 1 by setting to the *j*th unit vector because of . Further, we can assume that prototypes are initialized in the convex hull of data points, that is, for the initialization, equivalent coefficients can be found.

Using the identity , relational NG computes exactly the same prototype locations in every epoch as batch NG does. In relational NG, however, prototype locations are computed only indirectly by means of the coefficients . It is not necessary to know the vectorial representation of the training vectors; rather, the knowledge of the pairwise dissimilarities *d _{ij}* is sufficient to compute the output of batch NG in terms of coefficients . Hence, NG can be performed directly based on a given dissimilarity matrix

*D*only. For every prototype,

*m*coefficients are stored, with

*m*denoting the number of training points. corresponds to the part that data points take in representing prototype . The larger the , the more contributes to the prototype location. In the limit of zero neighborhood cooperation , the prototypes converge to the mean of the data points in their receptive fields. As for batch NG, the neighborhood parameter is usually annealed to 0 during training. For all theoretical considerations, we assume a fixed and possibly small parameter corresponding to the final stages of training.

The space complexity of relational clustering is linear with regard to the number of training data because of the requirement to store for . The time complexity of one training epoch is quadratic with regard to the number of training points, since the computation of takes linear time for every *j*. The quantity does not depend on *j* and has to be computed only once for every *i*; thus, it also contributes quadratic complexity. Since embedding data into a vector space based on the distance matrix *D* has cubic complexity depending on the number of training points, as we will see later, this procedure offers a more efficient alternative for NG if an embedding of data is not given previously. Obviously the algorithm provided by relational NG can be applied to every setting where pairwise dissimilarities are known. Euclideanity of an underlying vector space is not necessarily required to apply the algorithm. However, in such cases, no guarantee on the convergence of relational NG is given or a connection to a cost function, just as no guarantee is given for the vectorial counterpart batch NG in non-Euclidean settings. We will discuss later how far the procedure is reasonable in such situations.

#### 3.1.2. Dual Cost Function.

For the Euclidean setting— for some embedding holds—convergence of relational NG toward a local optimum of the NG cost function is guaranteed, since relational NG is equivalent to standard batch NG; therefore, the same guarantees as for batch NG hold, provided the underlying space *X* is Euclidean (Cottrell et al., 2006). For a non-Euclidean setting, convergence of the algorithm toward a local optimum of the NG cost function is not guaranteed. For both cases, however, prototypes are obtained only indirectly this way, and the NG cost function can be evaluated only when the data are known. Therefore, it is interesting to see whether there exists an alternative to evaluate the cost function of NG based on the quantities given by relational NG and the distance matrix *D*.

*k*and the dissimilarity matrix

_{ij}*D*only and extends the dual cost function of

*k*-means to neighborhood cooperation (Hathaway & Bezdek, 1994): The function has to be optimized with respect to the assignments

*k*under the constraint that

_{ij}*k*

_{1j},…,

*k*constitute a permutation of for every fixed

_{kj}*j*. The cost function measures the intracluster distances averaged over the local neighborhood as induced by the data.

We will see that this cost function is in some sense dual to the standard NG cost function. For fixed points found by relational or batch NG, respectively, the values of the NG cost function and the above function coincide, as we will see. Further, global optima of this cost function can be related to global optima of the standard NG cost function, as we will show. Therefore, we refer to this cost function as the dual NG cost function, which expresses the objective of NG in the dual variable *k _{ij}* instead of the prototypes .

*k*, which determine the rank of cluster

_{ij}*i*for a data point and the pairwise distances . This cost function constitutes a direct extension of the well-known cost function of pairwise data clustering, referred to as dual

*k*-means cost function in the literature, to neighborhood cooperation (Hathaway & Bezdek, 1994; Hofmann & Buhmann, 1997): This cost function is the dual function of the standard quantization error, and it has to be optimized for assignments such that for every

*j*. It directly measures the intracluster distances of a given clustering. It is well known that optimum prototypes of the vector quantization error correspond to optimum cluster assignments of the dual cost function in the Euclidean space, both values computed in the course of

*k*-means clustering (Hathaway & Bezdek, 1994; Hathaway et al., 1989). The dual cost function is more general in the sense that it offers a valid cost function for arbitrary dissimilarities

*d*, and alternative optimization methods such as deterministic annealing have been proposed for this setting (Hofmann & Buhmann, 1997).

_{ij}Now we formally establish a relation of the NG cost function and its dual. For this purpose, we reformulate the dual cost function of NG for vectors in a vector space *X*, which allows us to relate the dual function depending on *k _{ij}* only to a mixed function depending on

*k*and , and, finally, the standard NG cost function.

_{ij}This reformulation of the dual cost function allows us to link its value to the standard NG cost function for fixed points obtained by relational NG:

*Assume relational NG converges toward a fixed point of the algorithm, that is, coefficients and k _{ij} are found, which remain constant after further adaptation steps. Define the vector . Then *.

For a fixed point of relational NG, and thus a fixed point of batch NG, we find .

This result allows us to evaluate the outcome of relational NG based on the dissimilarity matrix *D* and the coefficients only. The value of the dual cost function corresponds to the standard cost function value of standard NG if fixed points of the algorithms are considered; hence, the quality of the solution can be judged independent of a concrete vectorial embedding of the data points. Obvious this theorem also holds under the weaker condition—that *k _{ij}* coincides with the ranks of data point

*j*for the prototypes . As we will see in the experiments, this condition is often already approximately fulfilled in early stages of training, such that both cost functions are approximately identical for relational NG during training.

It is interesting to investigate whether the structure of the NG cost function and its dual are in some sense equivalent. This refers to the question of whether local or global optima of these cost functions can be related to each other. We start by looking at global optima of the cost functions. The following holds:

The inequality is obvious because of the definition of the functions.

A positive definite bilinear form over a finite-dimensional vector space can be expressed as with a symmetric positive-definite matrix *A*. We can compute the directional derivative of the cost function into the direction as . For local optima, this must be zero for every ; hence, we find and, hence, . The Hessian matrix of this cost function constitutes a matrix with diagonal blocks , and 0 entries otherwise. This is positive definite because *A* is positive definite, and the factor is positive; thus, this position constitutes a local optimum of the cost function. Since is a quadratic form with respect to , this local optimum must be the global optimum. Therefore, equality holds in this case.

Assume . Assume for two indices *i* and holds and . Then, because of the monotonicity of , ; thus, we can decrease the cost function by substituting these two assignments. Therefore, optimum assignments *k _{ij}* are given by the ranks ; hence, the equality follows.

As a consequence of this theorem, we find equivalence of global optima of the NG cost function and its dual in the Euclidean setting:

Thus, in the Euclidean case, the NG cost function and its dual coincide in the sense that we have a correspondence of the values of global optima. Since global optima are fixed points of batch NG and relational NG in the Euclidean setting, theorem 4 also gives a correspondence of the global optima itself.

The question occurs whether a more detailed correspondence of the functions according to their overall shape can be established. In particular, a connection of the values as obtained by batch and relational NG and their role (e.g., local or global optimum) with respect to the cost functions would be interesting. Batch NG repeatedly optimizes the assignments *k _{ij}* and prototype locations of the cost function if a Euclidean embedding of data can be found. The convergence proof as presented in Cottrell et al. (2006) relies on the fact that optimum values

*k*and , respectively, are determined in every step in the Euclidean setting, as also computed in the above proofs. Thus, the cost function decreases in successive steps until convergence can be observed for Euclidean or, more generally, positive definite symmetric bilinear form. Cottrell et al. (2006) showed that the obtained fixed point constitutes a local optimum of the cost function under mild conditions on the setting. The same holds for RNG if the pairwise distances stem from a Euclidean space because RNG is just an equivalent formulation of batch NG using the identity . However, in both cases, it is not guaranteed that a global optimum of the cost functions is reached.

_{ij}We already know that for fixed points of relational or batch NG, the values of the NG cost function and its dual coincide. Further, in the Euclidean setting, a local optimum of the NG cost function is reached, and, conversely, every local optimum of the NG cost function constitutes a fixed point of NG. Now the question occurs as to how far local optima of the NG cost function can be related to local optima of the dual cost function. For this purpose, we have to determine a neighborhood structure for the solution space of the dual cost function, since *k _{ij}* constitute discrete values. We use the following simple structure. We define a neighborhood structure on

*P*such that

_{i}*k*and are neighbored if and only if for all but two indices (

_{ij}*ij*). An assignment

*k*is called a

_{ij}*local optimum*of if for all in the neighborhood of

*k*. Using this definition, we obtain the following result:

_{ij}*Assume is positive definite. Then local optima of constitute fixed points of relational NG*.

Assume *k _{ij}* are given, which do not constitute a fixed point of relational NG. Define . Then,

*k*does not coincide with the ranks . Thus, we can argue as in the proof of theorem 5 that substituting two assignments

_{ij}*k*for which holds, but the corresponding distances fulfill leads to assignments with . Setting , we obtain as in the proof of theorem 5 that . Because of theorem 3, this means ; thus,

_{ij}*k*does not constitute a local optimum.

_{ij}The converse, however, is not true:

*There exist fixed points of relational NG that do not constitute a local optimum of the dual cost function with respect to the neighborhood structure as defined above in the Euclidean setting*.

We consider the limit case (i.e., crisp *k*-means). NG approximates this setting for small enough . Consider the Euclidean points in : , , and prototypes and . The corresponding crisp assignments (corresponding to for ) are , , . This is a fixed point of batch NG and, hence, relational NG. The alternative , , , however, is in the neighborhood of this solution and leads to the better prototypes , .

Thus, it is guaranteed that local optima of the NG cost function or its dual, respectively, lead to fixed points of batch NG or relational NG in the Euclidean setting. Conversely, every fixed point constitutes a local optimum of the NG cost function under mild conditions, while the converse is not true; there exist settings where a fixed point of RNG can be further improved within the neighborhood structure as defined above for the dual NG cost function. Since a one-one connection of fixed points of batch NG and RNG exists, this leads to the consequence that the dual NG cost function possesses fewer local optima than the original NG cost function with respect to the neighborhood as defined above; in particular, there is no exact correspondence of the overall structure of the NG cost function and its dual.

#### 3.1.3. Out-of-Sample Extensions.

We conclude this section by introducing two issues relevant to clustering algorithms: Is it possible to extend a given clustering to new data points that are not contained in the training set (so called out-of-sample extensions)? Is it possible to integrate prior knowledge (e.g., given by a partial labeling of the data) into the algorithms? Both questions can be answered satisfactorily in the context of prototype-based clustering.

Out-of-sample extensions of NG to a point can be defined based on the standard winner assignment; is mapped to the class represented by the prototype with the smallest dissimilarity to . This simple scheme is one of the benefits of prototype-based clustering as opposed to alternatives such as assignment-based clustering. This scheme can be transferred directly to relational clustering because of its connection to batch NG in a vector space. The question occurs whether this procedure can be transferred to a scheme that refers to pairwise dissimilarities only, without an explicit knowledge of the embedding of data or prototypes. The key ingredient is a scheme that allows computing the dissimilarity of a new data point and a prototype based on and pairwise dissimilarities only:

This follows us directly from theorem 1, whereby we consider the linear combinations of , ,…, with coefficients and , respectively.

This allows us to compute out-of-sample extensions of the clustering based on the dissimilarities and prototype coefficients only.

#### 3.1.4. Supervision.

Clustering constitutes an ill-posed problem since the objective of clustering is not clear a priori (see, e.g., Biehl, Hammer, Hochreiter, Kremer, & Villmann, 2009). If pairwise distances of data are given, the dual cost function of *k*-means constitutes one possible objective of clustering. Alternatively, when restricting to prototype-based algorithms, the standard quantization error can serve as an objective function. We have seen that these cost functions are equivalent for metric data, and neural gas and relational neural gas optimize a relaxation thereof. However, it is not clear a priori whether the outcome meets the intended result in practical problems. The possibility of including further information, if available, is very important to get meaningful results for unsupervised learning. This can help to prevent the garbage in–garbage out problem of unsupervised learning, as discussed in Kaski et al. (2003) and Kaski, Nikkilä, Savia, and Roos (2005).

*x*is denoted by . We equip a prototype

^{j}*w*with a label , which is adapted during learning. The basic idea consists of substituting the standard dissimilarities by a mixture which takes the similarity of label assignments into account and where controls the influence of the label values. This procedure has been proposed in Hammer et al. (2006a, 2006b) and Villmann et al. (2006) for Euclidean and median clustering and online neural gas, respectively. One can use the same principles to extend relational clustering. The cost function of NG becomes where denotes the rank of neuron

^{i}*i*measured according to the dissimilarities . Batch NG can be directly derived. The ranks are determined based on this extended cost term, which is accompanied by the adaptation for the prototype labels for batch optimization.

Relational learning becomes possible in the same way as before. Assume pairwise dissimilarities of data *d _{ij}* are given instead of explicit data locations. Then supervised relational neural gas results as displayed in algorithm 3. Assume that a vector space and a symmetric bilinear form exist that induce the dissimilarities: . In that case, supervised NG can be interpreted as an extension of standard NG applied to the data with coefficients in the first dimensions and in the remaining ones. In particular, the theorems proved in this section also hold for the supervised case.

### 3.2. Non-Euclidean Data.

It often holds that data cannot be embedded in Euclidean space. As an example, discrete data such as DNA sequences or strings can be considered and pairwise dissimilarities stem from an alignment of data. This yields a metric, but not necessarily the Euclidean one. For data that stem from experimental measurements, it can even be the case

**Algorithm 3.** Supervised Relational NG

**input**

symmetric dissimilarity matrix with zero diagonal;

label information ;

**begin**

**init** with ;

**repeat**

compute ;

set ;

set ;

set ;

**until** convergence;

**return**, ;

**end**.

that metric properties such as the triangle equality are violated. In these cases, parts of the argumentation of the previous section no longer hold. Here we investigate whether relational NG can still be applied in this more general setting, whether it can be connected to a batch NG scheme in an appropriate vector space, and whether a connection to the NG cost function and its dual can be made. We also want to investigate properties such as convergence of the algorithm based on these findings. Further, we will put relational clustering as introduced in the last section into a wider context and relate it to two well-known and very powerful clustering schemes for proximity data based on deterministic annealing, the framework of pairwise data clustering as proposed by Hofmann and Buhmann (1997) and extended to neural gas schemes in Hofmann and Buhmann (1996), and the SOM for proximity data as introduced by Graepel and Obermayer (1999) and later extended to hyperbolic SOM (Saalbach et al., 2005). We will argue that relational clustering constitutes a simple way that allows us to derive deterministic annealing variants as proposed in Hofmann and Buhmann (1997) and Graepel and Obermayer (1999) directly from the corresponding deterministic annealing schemes for standard *k*-means and SOM, respectively, in the Euclidean setting. For general dissimilarities, relational clustering as well as deterministic annealing can be related to clustering algorithms in a vector space; in particular, an interpretation in terms of prototypes is possible for all algorithms. However, it can be shown that none of the algorithms guarantees convergence to a local optimum of the dual cost function or related, or convergence of the algorithm at all.

#### 3.2.1. Relational Clustering and Batch Clustering in Pseudo-Euclidean Space.

We assume in the following that data are represented by pairwise dissimilarities *d _{ij}*.

*D*denotes the corresponding dissimilarity matrix. This setting has been considered (e.g., in Hofmann & Buhmann, 1997; Puzicha, Hofmann, & Buhmann, 1999; Graepel & Obermayer, 1999). There, the dual cost functions of

*k*-means or related costs are considered, which are well defined for general dissimilarities. The approaches (Hofmann & Buhmann, 1997; Puzicha et al., 1999; Graepel & Obermayer, 1999) propose an optimization of this discrete function using methods of statistical physics, resulting in deterministic annealing schedules for clustering. Note that approximations have to be used because optimization of this cost function constitutes an NP-hard problem (Brucker, 1978). In this section, we discuss how far relational NG can be seen as an alternative solution and in which cases optimization by relational NG as well as deterministic annealing fails. Note that relational NG as defined above can be applied as an algorithm to every setting where a dissimilarity matrix

*D*is given. However, it is not clear how far this procedure leads to meaningful results.

In the following, we always make the reasonable assumption that the diagonal is zero: *d _{ii}*=0. Further, we assume symmetry of

*D*:

*d*=

_{ij}*d*. The latter does not constitute a restriction because it does not affect the cost function. More precisely, it has been shown (e.g., in Laub et al., 2006) that the dual cost function of

_{ji}*k*-means is invariant with respect to symmetric transformations. The same holds for the dual cost function of NG:

*Assume pairwise dissimilarities are given. If we set , then is not affected by this transform*.

*I*and the vector . Define Obviously, this matrix is symmetric, and, thus, it can uniquely be decomposed into the form with orthonormal matrix

*Q*and diagonal matrix of eigenvalues with

*p*positive and

*q*negative entries. Taking the square root of allows the alternative representation in the form where

*I*constitutes a diagonal matrix with

_{pq}*p*entries 1 and

*q*entries −1, that is, , where only

*p*+

*q*nonzero eigenvalues of are taken into account. We can define the symmetric bilinear form in : Then the columns of

*X*constitute vectors with pairwise dissimilarities . Hence we have found a vector space together with a symmetric bilinear form and an embedding of the points , which yields these dissimilarities under the bilinear form. This embedding is referred to as pseudo-Euclidean embedding of the data points. The values (

*p*,

*q*,

*m*−

*p*−

*q*) are referred to as the signature of the pseudo-Euclidean space.

Because of this fact, we can always assume that data points are given as vectors . However, this embedding need not correspond to standard Euclidean space. The dissimilarity matrix *D* stems from Euclidean points if and only if *q*=0, that is, matrix that describes the bilinear form contains only positive diagonal entries (dropping the parts with entry 0 since they obviously do not carry any contribution to the dissimilarity). Otherwise Euclidianity is violated.

The code of NG can be executed in every vector space that possesses a symmetric bilinear form such as pseudo-Euclidean space. Further, relational NG can be executed based on a dissimilarity matrix *D* only. Because of theorem 2, these two algorithms correspond to each other, that is, relational NG applied to a symmetric dissimilarity matrix with 0 diagonal is the same algorithm as standard batch NG in pseudo-Euclidean space using the correspondence for the embedding of data as given above. Thereby, relational NG does not rely on explicit coordinates of the data points. The above embedding depends on matrix diagonalization; thus, it has cubic complexity. Hence relational NG provides a more efficient scheme than standard NG for the Euclidean as well as non-Euclidean setting. Obviously, out-of-sample extensions of the found assignments for new data are easily possible, as seen in theorem 9, and the extension to supervised settings as proposed in the last section is immediate.

#### 3.2.2. Connection to a Cost Function.

The question now is whether a relation of this procedure to the NG cost function and its dual can be made and whether convergence of the algorithm is guaranteed. If *D* corresponds to a Euclidean setting (i.e., *q*=0), the guarantees as given in the previous section hold. In particular, RNG converges to a local optimum of the NG cost function, and evaluation of these costs is possible based on the dual.

Unfortunately, these guarantees do not hold in general in pseudo-Euclidean space. This property is due to the fact that consecutive steps of batch optimization do not necessarily find optima in the respective step. While assignments of *k _{ij}* based on the ranks are still optimum in the non-Euclidean setting, assignments are not. These values can constitute a saddle point or (in case of only negative entries of the bilinear form) even a local maximum of the corresponding part of the cost function. Because of this fact, the value of the NG cost function does not necessarily decrease in consecutive steps, and convergence is not guaranteed.

A proof that convergence in general cannot be guaranteed is given by the following theorem:

*Assume d _{ij} constitutes a symmetric matrix with diagonal elements 0. Then relational neural gas does not necessarily converge toward a fixed point*.

We will see in our experiments that although convergence is not always guaranteed, it can nevertheless be observed for most practical cases. In our experiments, convergence was always given. For fixed points of relational NG, the value of the dual cost function for the found solution corresponds to a value of the standard NG cost function because of theorem 4. Further, the values obtained in these experiments seem reasonably good, and the cost function of NG is decreased in most epochs, such that relational NG can be seen as a reasonable heuristic to arrive at a good solution in these cases. Note that nonconvex quadratic programming is NP hard (as shown, e.g., in Sahni, 1974 and Pardalos & Vavasis, 1991), such that an efficient algorithm that finds optimum prototypes in every epoch instead of the simple mean cannot easily be derived. We can restrict the search space of prototypes to positions given by convex combinations with permutations *k _{ij}* if we are interested in an optimization of the dual cost function, because of theorem 3. However, these values

*k*need not coincide with rank assignments for optimum choices. Nevertheless, in practice, this compromise often constitutes a reasonable and efficiently computable trade-off.

_{ij}#### 3.2.3. Connection to Deterministic Annealing Approaches.

The fact that convergence of relational NG is not guaranteed is shared by popular and successful alternatives that have been proposed in the literature in the context of clustering proximity data—namely, deterministic annealing for pairwise data clustering and SOM, respectively, as proposed in Hofmann and Buhmann (1997) and Graepel and Obermayer (1999)—when run in fast parallel update mode.^{1} Relational neural gas can be understood as a direct, simple derivation of the corresponding crisp clustering algorithm that results from the approaches as proposed in Hofmann and Buhmann (1997) and Graepel and Obermayer (1999) in the limit of zero temperature. Thus, because the deterministic annealing schemes (Hofmann & Buhmann, 1997; Graepel & Obermayer, 1999) constitute powerful and effective techniques for grouping and clustering proximities, relational approaches also can be seen as an effective compromise to arrive at reasonable solutions of the NP-hard optimization problem for general dissimilarity data. We will explain this link and give examples of the divergence of the deterministic annealing schemes in the following section.

*k*-means cost function: where

*m*denotes the number of data points, refers to the assignment of point

*x*to cluster

^{j}*i*, and denotes the probability of cluster

*i*. The additional summand yields a constant term that emphasizes the independence of the clustering cost function of the absolute dissimilarity scale. Obviously, up to constant factors and summands, the standard dual

*k*-means cost function is obtained. In this approach, a clustering scheme is derived from this cost function by substituting crisp cluster assignments by expectation values for the assignments under a specified certainty level parameterized by a temperature

*T*. Using methods of statistical physics pioneered in the context of clustering by Rose (1998) under the frame of deterministic annealing, the following algorithm is derived, which consists of a subsequent computation of expected assignments and potentials , as shown in algorithm 4. determines the decrease of the temperature .

**Algorithm 4.** Deterministic Annealing for Pairwise Data Clustering

**input**

symmetric dissimilarity matrix with zero diagonal;

**begin**

init , randomly;

init temperature *T*=*T*_{0};

**repeat**

**repeat**

;

;

**until** convergence

;

**until**;

**return** assignments

**end**.

*p*, the update formula for is slightly altered to In the limit of many points , the differences between these two update formulas vanish. These latter updates correspond to the update rules of relational

_{i}*k*-means in the limit of zero temperature where the averages become crisp assignments , and the potentials correspond to the dissimilarity dist

_{ij}of data point

*x*and prototype

^{j}*w*: Thus, in the limit, DA indirectly performs

^{i}*k*-means clustering in pseudo-Euclidean space using the possibility of computing dissimilarities only indirectly by means of the formula provided in theorem 1. Prototypes are given indirectly by the coefficients and can be recovered as for an embedding of data in pseudo-Euclidean space . Instead of a derivation from the dual cost function for pairwise data, DA could alternatively be derived directly from the formulas of deterministic annealing in pseudo-Euclidean space by using theorem 1 and the ideas of the derivation of relational NG from batch NG. Thus, DA for pairwise proximities is an application of deterministic annealing to relational

*k*-means to obtain better results by substituting crisp assignments by their expected values.

The deterministic annealing schemes are derived from their respective cost function using methods of statistical physics: cluster assignments are characterized by a Gibbs distribution depending on the cost terms. A mean field approximation is taken to approximate the Gibbs distribution in factorial form. Partial assignment costs can be derived by optimizing the Kullback-Leibler divergence between the Gibbs distribution and the factorial approximation. In both cases, optimal potentials (which relate to prototype positions in pseudo-Euclidean space) are determined based on the derivatives, that is, only the necessary condition for a local optimum is guaranteed in both cases. In the same way as for relational NG, the sufficient condition on a local optimum is not necessarily fulfilled if a non-Euclidean dissimilarity matrix *D* is considered since the Hessian is not globally definite in this case. In consequence, there exist situations where deterministic annealing does not converge in the same way as relational clustering.

*k*-means update rules, the same behavior as for relational

*k*-means can be observed if initialized appropriately. More precisely, for a temperature at most 0.001 and initialization as a cyclic change of this state to the setting is observed. For the original DA rule, due to the slightly different update, a different behavior holds for this setting. We obtain the same cyclic changes if we start with four identical copies of every point, since a large enough number of data

*m*causes essentially the same updates for both DA versions.

*j*(corresponding to a data point

*x*) and updating and for this

^{j}*j*only in every inner loop. As a consequence, cycles that constitute only local traps of the update dynamics can partially be avoided, such as the example as introduced above, which does not lead to cyclic behavior for online updates. However, cyclic behavior can still be present for online updates. As an example, we consider the points , , , in two-dimensional pseudo-Euclidean space with signature (1, 1, 0). The corresponding dissimilarity matrix is given as When started in and corresponding with temperature

*T*at most 0.001, the points and change their expected assignments between and , while the other assignments remain constant for DA with altered update rule. Again, the original DA rule behaves slightly differently, but we obtain qualitatively the same cyclic behavior when using four identical copies of every point, as before. Thus, sequential update can only partially avoid cyclic behavior.

As a consequence of this argumentation, we can interpret relational clustering as the crisp limit case of very popular deterministic annealing variants for pairwise data clustering. Because of its avoidance of the inner loop, it is considerably faster than DA schemes, but the price of probably worse optima is paid. In principle, however, the theoretical guarantees of both methods with respect to convergence are the same since they show limit cycles in comparable situations for parallel updates (which turn out to be rare in practical applications; we did not observe this behavior in a single practical experiment). Cycles for DA with sequential update are rare (we could find only an example with negative dissimilarities) and rarely occur in practice, although at the cost of a considerably slower convergence. Relational NG could be run in sequential mode in analogy to DA, leading to convergence in most situations, thereby still avoiding one inner loop in comparison to online DA at the cost of probably slightly worse optima.

The connection of relational clustering to DA schemes allows to interpret DA schemes in the same way as relational variants. DA schemes constitute standard clustering methods in pseudo-Euclidean space. Thus, the methods can be interpreted as prototype-based schemes where prototypes are represented indirectly by means of coefficients. In particular, interpretation of the methods in terms of data points closest to the prototypes as well as fast extensions to very large data sets in terms of patch clustering, which we introduce later in this article, become possible this way also for the latter algorithms.

#### 3.2.4. Spread Transformation.

As an alternative to a direct application of clustering in pseudo-Euclidean space, one can first change the dissimilarity matrix to make it Euclidean, such that convergence of the algorithm is guaranteed. This procedure has been proposed in the context of *k*-means (e.g., in the approaches of Laub et al., 2006, and Roth et al., 2003). In these approaches, it is demonstrated that a theoretical base of this technique is given by the fact that the dual *k*-means cost function is invariant under additive shifts of the off-diagonal elements of the dissimilarity matrix. The approach (Roth et al., 2003) shows that this also holds for a few further popular cost functions (e.g., connected to multidimensional scaling or graph cut).

This fact is preserved by the dual cost function of NG for small neighborhood range. More precisely, we find the following result:

*Assume pairwise distances d*

_{ij}are given such that the corresponding matrix D is symmetric with zero diagonal. Assume . Consider the distances with shifted off-diagonal terms for where denotes the Kronecker delta. Assume the neighborhood range is chosen as . Then the distance of the dual cost function of neural gas for d_{ij}and can be estimated as*where*.

*k*=0 exists, and the remaining prototypes, which are not winners for a data point. We obtain for the first term, for , and for the second term, for . Therefore, substituting these terms by

_{ij}*d*

_{0}

*k*/4 changes the result by at most .

Thus, a transformation of the dissimilarity matrix does not change the shape of the dual cost function and the location of local and global optima for small . One can see that for large enough *d*_{0}, a dissimilarity matrix results that stems from the squared Euclidean metric. This procedure has been introduced in the literature under the notion of spread transformation (e.g., in connection to relational fuzzy clustering; Hathaway & Bezdek, 1994) or constant shift embedding in the connection of *k*-means clustering (Roth et al., 2003). The following result is well known in the literature and follows immediately from the embedding of data in pseudo-Euclidean space (e.g., Pekalska & Duin, 2005):

*Assume D is a symmetric dissimilarity matrix with zero diagonal. Then*

*is squared Euclidean, that is, there exist Euclidean points with for all*.

*i*,*j*, where denotes the Gram matrix used for the pseudo-Euclidean embedding of the points, and denotes the smallest eigenvalue of G. This shift denotes the smallest possible value to achieve this propertyThe result is obvious because this shift corresponds to a shift of the Gram matrix of the form . This procedure has been proposed (e.g., in Laub et al., 2006; Roth et al., 2003). Since only the smallest eigenvalue of *G* is needed, this procedure is much more efficient than an explicit embedding and correction of the non-Euclideanity, which would require cubic effort. Further, unlike alternatives to move a general dissimilarity matrix in squared Euclidean form as proposed in Chen et al. (2009), the spread transform does not affect the cost function. In fuzzy clustering, it has been proposed to use spread transformation if and only if distances of data points and prototypes become negative (Hathaway & Bezdek, 1994).

Unfortunately, it turns out that this procedure is partially of theoretical interest for some practical problems. The reason lies in the fact that this transformation can make the classification problem harder for the standard methods such as relational NG and deterministic annealing of pairwise proximities. While the transform does not affect the location of local optima, it changes the relative differences between good and bad local optima of the cost function. This does not affect optimization methods, which are controlled by a cost difference such as the method proposed in Yom-Tov and Slonim (2009). As seen in our experiments, deterministic annealing variants for pairwise data clustering and relational neural gas are affected by the shift such that these methods can no longer easily find good local optima. Hence, the direct application of the relational models to non-Euclidean data can give better results (although convergence is not guaranteed in theory, but can usually be observed in practice) than the application of the corresponding method in the shifted Euclidean space (although convergence is guaranteed in the latter case).

We demonstrate this effect in a standard benchmark data set. The cat cortex data originate from anatomical studies of cats' brains. The dissimilarity matrix displays the connection strength among 65 cortical areas (Graepel & Obermayer, 1999). For our purposes, a preprocessed version as presented in Haasdonk and Bahlmann (2004) was used. The matrix is symmetric with zero diagonal, but the triangle inequality does not hold. The signature of the related pseudo-Euclidean space is (41, 23, 1), that is, about one-third of the directions are associated with negative eigenvalues. The corresponding eigenspectrum is depicted in Figure 2. We applied the spread transform to this data set according to theorem 5. We trained (batch) deterministic annealing and relational neural gas on these data sets, using five prototypes in each case and default parameters (for deterministic annealing: number of epochs of the outer loop = 300, number of epochs of the inner loop = 50, start temperature = 100, noise level added to distances to avoid identical prototypes = 10^{−6}; for relational neural gas: number of epochs = 100, start neighborhood range = 2.5). Figure 3 shows the quality measured by the dual quantization error of the found optima when repeatedly initializing the algorithms with small, random values for the data assignments obtained over 1000 runs. A curve with a well-expressed maximum results, representing the most prominent optimum. As expected, this optimum is a bit better for deterministic annealing than for relational neural gas due to the soft assignments of the data points, which result in a slightly better optimum at the cost of a slower convergence speed and the necessity of an additional inner loop. When considering the spread-transformed data that correspond to an Euclidean representation, the optima are shifted to the right in both cases. Thereby, we evaluate the result based on the found assignments only, taking the original dissimilarities, that is, we evaluate the quality of the found clustering for the original clustering problem in both cases. Hence, although the algorithms deal with a Euclidean setting instead of an only pseudo-Euclidean one in these cases, the optima found by the algorithms are worse than the original ones. This behavior is quite typical and can also be observed for other non-Euclidean data sets. Hence, a direct application of relational neural gas to the dissimilarity matrix instead of a prior spread transformation of the data can be advisable.

## 4. Patch Clustering

The problem of mining large data sets has become one of the central issues of data mining. Roughly, the amount of electronically available data doubles every 20 months, reaching almost every area of daily life and science, such that people have to cope with massive data sets that cannot be scanned manually. Clustering and visualization offers one of the fundamental techniques to adequately compress and preprocess such enormous data sets. However, in these cases, data no longer fit into main memory such that batch processing methods that rely on all data at once become infeasible. Further, at most one scan through the data is still affordable, which also makes online alternatives such as online neural gas unsuitable. The situation is still worse, since clustering methods for dissimilarity data rely on the quadratic dissimilarity matrix (i.e., they display at least quadratic complexity) and, as is the case of relational clustering and deterministic annealing, linear space complexity for the classifier. Both issues make the methods slow for settings that reach 10,000 data points, and entirely unsuitable for common desktop computers available today if more than a 100,000 data points are involved.

Due to these problems, a variety of methods that introduce clustering algorithms for streaming data have been proposed in the literature, which ideally work in linear time and constant space. Thereby, most of the approaches have been proposed for Euclidean data. This includes extensions of classical *k*-means clustering or *k*-median clustering, partially incorporating approximation guarantees such as those of Badoiu, Har-Peled, and Indyk (2002) and Kumar, Sabharwal, and Sen (2004) which reach linear time, but for which space requirements depend on the number of points, or heuristics that partially rely on sampling and according to statistical guarantees or grid-based methods, such as popular algorithms dedicated to very large data sets as CURE, STING, and BIRCH (Domingos & Hulten, 2001; Guha, Rastogi, & Shim, 1998; Wang, Yang, & Muntz, 1997; Zhang, Ramakrishnan, & Livny, 1996), or iterative compression approaches that process only a fixed subset of the given data at a time (Bradley et al., 1998; Farnstrom et al., 2000; Alex et al., 2009).

A few proposals for large sets of general non-Euclidean dissimilarity data exist. Hierarchical clustering typically has a squared complexity, which can partially be optimized. However superliner complexity is still kept, and these methods are typically not very stable with respect to noise (Murtagh, 2002). The method (Kohonen & Somervuo, 2002) extends the standard SOM to dissimilarity data by means of the generalized median and tackles large data sets by simple subsampling of the data. Affinity propagation (Frey & Dueck, 2007) also relies on median clustering by restricting prototype locations to data points, and it can be used for large data sets if the connection matrix is sparse since only existing connections are effectively used in this setting. However, both methods are restricted due to restricted prototype locations, and they require several sweeps through the entire data set. Two approaches that require only one sweep are given in Bezdek, Hathaway, Huband, Leckie, and Kotagiri (2006) and Yom-Tov and Slonim (2009). Bezdek et al. rely on relational variants of fuzzy *k*-means clustering and extend this to large data sets by subsampling a characteristic part of the dissimilarity matrix, clustering this subpart, and extending it to all data. This way, however, the full matrix has to be available in advance or, alternatively, data must be independent and identically distributed (i.i.d.) to obtain reliable extensions to new parts. Similarly, the Nyström approximation constitutes a popular vehicle to extend the results of a part of a dissimilarity or kernel matrix to the full data set, where approximation bounds can be derived explicitely. The work of Belongie, Fowlkes, Chung, and Malik (2002) constitutes one proposal where this approach has been used in the context of graph clustering for general dissimilarity graphs. However, a representative sample has to be available that allows the extension of the clustering to the full data set. In contrast, the approach of Yom-Tov and Slonim (2009) proposes to process only parts of the given data on parallel processors by a direct optimization of pairwise clustering and to subsequently reach valid assignments of all data this way. However, the presented method no longer represents the solution in the form of interpretable prototypes.

Here we rely on patch clustering as introduced in Alex et al. (2009), Farnstrom et al. (2000) and Bradley et al. (1998) for Euclidean data sets to extend prototype-based clustering methods for dissimilarity data to large or streaming data sets. The basic idea is to iteratively process only a small part of the data using standard *k*-means or neural gas and to store these data in compressed form in terms of the prototypes and their multiplicity. This serves as sufficient statistics for further runs. Subsequent runs cluster the compressed data points, which were already seen in form of the prototypes counted with multiplicities in addition to the next patch. Note that the final shape of the clusters always depends on the full data set for patch clustering, and the statistics accumulated in every run are adapted according to the given data, such that patch clustering can better deal with data that are not presented in an i.i.d. fashion than the noted approaches.

### 4.1. Patch Definition.

*x*indexed

_{i}*i*= 1, 2,…,

*m*is given such that for all

*i*and

*j*, the dissimilarity

*d*between these points can be computed directly.

_{ij}*D*denotes that the corresponding dissimilarity matrix where we assume symmetry

*d*=

_{ij}*d*and zero diagonal

_{ji}*d*=0 as before. A typical example of this setting is a database of strings for which pairwise comparisons are given by alignment. For large

_{ii}*m*, it is in general infeasible to compute or store the full dissimilarity matrix in main memory due to the squared complexity. Patch processing relies on the principle of processing data in

*n*patches of prior fixed size

_{p}*p*=

*m*/

*n*. Thereby, we assume divisibility of

_{p}*m*by

*n*for simplicity. In practice, the last patch is a smaller size. For dissimilarity data, a patch

_{p}*P*is then represented by the corresponding portion of the dissimilarity matrix

_{t}*D*: which represents the dissimilarities of points .

### 4.2. -Approximation.

The idea of original patch clustering is to add prototypes from the processing of the former patch *P*_{t−1} counted with multiplicities according to the size of their receptive field as additional data points to the current patch *P _{t}*. These points play the role of a compressed representation of all already seen data points; they provide sufficient statistics of the information processed so far. This way, all data are processed without loss of essential information since the previous information is represented by the sufficient statistics.

A naive transfer of this method from the Euclidean case to relational clustering, however, is not possible due to two reasons. Unlike patch processing for Euclidean data, prototypes correspond to a weighting of all points involved in the clustering. Thus, the dimensionality of the coefficient vectors is determined by the number of number of data points that have to be processed. This results in an infeasible linear space complexity for huge data sets. In addition, for further processing of the data, the dissimilarities in between all prototypes and all data from a new patch have to be computed. Since prototypes are represented indirectly by the contribution of all data points seen so far, the distance of prototypes and a new patch relies on the distance of the new patch and all data seen so far. By induction, one can see that this processing is possible only if the full dissimilarity matrix *D* is available. Hence, this approach results in infeasible quadratic time and space complexity.

*K*most prominent ones,

*K*being a fixed number. Every patch is clustered until convergence, that is, a small neighborhood range is used in the resulting prototype representations. Obviously, for the coefficients of the prototypes, the following is valid: where

*R*denotes the receptive field of prototype

_{i}*i*. Thus, the coefficient vector yields 1/|

*R*| for all data points for which the winner is the point

_{i}*i*. Thereby, for simplicity, we assume that , which is usually the case for NG schemes. If , then is nonvanishing if prototype

*i*is the second closest neuron of . If this set is also empty, the third closest points determine the nonzero entries of and so on.

We approximate the prototype by the closest *K* points in *R _{i}*, that is, the points where

*k*=0 and is among the

_{ij}*K*points with the smallest dissimilarity as computed in relational neural gas. Note that more complex alternatives could be possible that choose

*K*points and coefficients such that the corresponding prototype location changes as little as possible as described in Rossi, Hasenfuss, and Hammer (2007). However, the above simple approximation scheme will lead to good results, as we will see in experiments.

These considerations give rise to the definition of a *K*-approximation of a relational prototype. Assume a prototype is given with . A *K*-approximation refers to the *K* indices *j*_{1},…, *j _{K}* corresponding to points ,…, with the smallest dissimilarity to . For optimum prototypes as computed by relational NG in the limit phase, these points are the

*K*closest points in the receptive field of . Obviously, for these points, the coefficient is maximum. The

*k*-approximation can be computed easily in relational NG since the dissimilarities of data points and neurons are readily available.

Note that by always restricting to the *K* closest data points, *K* being number fixed a priori, prototypes can always be approximated in constant space while processing all data. Further, only a fixed portion of the global dissimilarity matrix *D* is necessary to compute dissimilarities in between prototypes and further patches.

### 4.3. Extended Patches.

*K*-approximated prototypes. Assume the current patch

*P*is considered. Assume

_{t}*N*

_{t−1}refers to the index set of the

*K*-approximation of all prototypes obtained in the previous step. When

*k*prototypes are being considered, the size of this set is restricted by , under the assumption that at least

*K*points lie in every receptive field, and equality holds. For the next round of patch processing, dissimilarity clustering is applied to the points corresponding to the indices in

*N*and the data from the current patch; that is, we need the following part of the dissimilarity matrix, where denotes the interdissimilarities of points from the

_{t}*K*-approximation, and denotes the dissimilarities of points in the

*K*-approximation and the current patch. We refer to as

*extended patches*.

### 4.4. Patch-Relational Neural Gas.

Based on these data-handling techniques, patch-relational neural gas can be defined as iterative processing of patches enriched by the *K*-approximation of prototypes from the previous patch. The prototypes contribute to the new clustering task according to the sizes of their receptive fields, that is, a prototype is counted with multiplicity |*R _{i}*|. Correspondingly, every point in

*N*

_{t−1}contributes according to the fraction |

*R*|/

_{i}*K*if it lies in the receptive field of , that is,

*k*=0. Hence, we set the multiplicity

_{ij}*m*=|

_{j}*R*|/

_{i}*K*where lies in the receptive field of . It is straightforward to extend relational neural gas to deal with multiplicities

*m*of point corresponding to the underlying cost function . The only change concerns the update of the coefficients (see algorithm 5). This algorithm can be used as an internal loop for patch processing for dissimilarity data as shown in algorithm 6. Thereby, prototypes

_{j}*W*are always represented by an index vector corresponding to the data points that contribute to this prototype, and a coefficient vector that specifies the strength of the contributions. Unlike the coefficients in full relational NG, the coefficient vector of patch NG is sparse and can be represented in constant space. Note that we assume that the relevant parts of the dissimilarity matrix can be computed on demand such that the full dissimilarity matrix need not be computed or stored at the beginning of the algorithm. This fact constitutes a further advantage of patch processing, since it is sufficient to compute only a linear part of the dissimilarity matrix. Since, depending on the application scenario, the dissimilarity computation can be quite demanding (e.g., alignment of sequences in ioinformatics, or the normalized compression distance for text processing), this can result in drastic computational savings.

**Algorithm 5:** Relational NG with Multiplicities

**input**

symmetric dissimilarity matrix

multiplicities

**begin**

init with ;

**repeat**

compute ;

set ;

set

**until** convergence;

**return**;

**end**.

**Algorithm 6:** Patch Relational NG

**begin**

cut the first patch *P*_{1};

apply relational NG to *P*_{1} prototypes *W*_{1};

compute the *K*-approximation *N*_{1} of *W*_{1};

update multiplicities *m _{i}* of

*N*

_{1};

set *t*=2;

**repeat**

cut the next patch *P _{t}*;

construct extended patch using *P _{t}* and

*N*

_{t−1};

set multiplicities of points in *P _{t}* to

*m*=1;

_{i} apply relational NG with multiplicities to prototypes *W _{t}*;

compute *K*-approximation *N _{t}* of

*W*;

_{t} update multiplicities *m _{i}* of

*N*;

_{t} *t* := *t* + 1;

**until***t* = *n _{P}*

**return** prototypes ;

**end**.

After processing, a set of prototypes together with a reasonable *K*-approximation thereof is obtained that compresses the full data set. As before, an inspection of prototypes is easily possible by looking at the points closest to these prototypes.

Note that the algorithm runs in constant space if the size *p* of the patches is chosen independent of the data set size *m*. Similarly, under this assumption, the fraction of the distance matrix that has to be computed for the procedure is of linear size , and the overall time complexity of patch clustering is of size , assuming constant *p*. Hence, a linear time and constant space algorithm for general dissimilarity data results that is suited for large data sets if constant patch size is taken. In the experiments, we will demonstrate an application to a data set of size almost 200,000, which corresponds to a full dissimilarity matrix for which the storage would require almost 251 GB, assuming double precision.

This way, a linear time and constant space clustering algorithm is obtained that can deal with general dissimilarity data. Since it relies on only a linear part of the full dissimilarity matrix, the complexity of data preprocessing (i.e., the computation of probably complicated pairwise dissimilarities such as alignment distances) is also greatly reduced. Further, the algorithm provides an explanation of the clustering in terms of prototypes that can be represented by a finite number of representative data points; hence, the result can be directly inspected by human experts.

## 5. Experiments

We demonstrate the behavior of relational NG in a variety of experiments whereby we mainly focus on benchmark data sets that cannot be embedded in Euclidean space. Since convergence of relational NG or the monotonicity of the cost function during training is theoretically not guaranteed in such situations, we first look at the development of the cost function values in a typical setting. Then, we evaluate relational NG in comparison to deterministic annealing on a variety of benchmark data sets, showing the competitiveness of the algorithm. For patch NG, we first demonstrate the robustness of the algorithm with respect to the patch size, the quality of the *K*-approximation, and the order of the presentation of patterns in comparison to full batch clustering. Then we show an application to a huge text corpus, demonstrating the efficiency of the method for large data sets.

### 5.1. Convergence.

For all data sets considered in the experiments, convergence of relational NG was observed. We depict the behavior of relational NG for the cat cortex data set as introduced above using 5 neurons and 100 epochs. Figure 4 (top) displays the value of the NG cost function and its dual based on the rankings *k _{ij}* and the prototypes , respectively, as computed by NG. Obviously the two cost functions are strictly monotonic and convergent, and, apart from the first steps, they coincide for the computed values. Similarly, the vector quantization cost function and its dual computed for the assignments and prototypes as given by relational neural gas are strictly monotonic and convergent, as can be seen in Figure 4 (bottom). Due to the fact that the quantization error is computed for the (in terms of

*k*-means) suboptimum prototypes that incorporate neighborhood smoothing, while the dual costs are determined on the assignments only (implicitly assuming optimum positions of the prototypes in terms of

*k*-means), the quantization error is worse compared to the value of the dual cost function for early stages of training, which display a large neighborhood cooperation.

### 5.2. Experiments on Benchmark Data Sets.

In addition to the cat cortex data set, we consider the following benchmark data sets, which were symmetrized prior to training and linearly transformed from similarities to dissimilarities, if necessary:

- •
**Protein data**: The protein data set as described in Mevissen and Vingron (1996) consists of 226 globin proteins that are compared based on their evolutionary distance. The samples originate from different protein families: hemoglobin-, hemoglobin-, myoglobin, and others. Here we distinguish five classes as proposed in Haasdonk and Bahlmann (2004): HA, HB, MY, GG/GP, and others. Unlike the other data sets considered here, the protein data set has a highly unbalanced class structure, with class distribution HA (31.86%), HB (31.86%), MY (17.26%), GG/GP (13.27%), and others (5.75%). - •
**Copenhagen chromosomes data**. The Copenhagen chromosomes data set is a benchmark from cytogenetics (Lundsteen, Phillip, & Granum, 1980). Forty-two hundred human chromosomes from 22 classes (the autosomal chromosomes) are represented by the gray levels of their images. These images are transferred to strings based on their thickness. These strings can be compared using edit distance, which constitutes a typical dissimilarity measure for strings (Juan & Vidal, 2000). The substitution costs are thereby given by the difference of the entries, and insertion and deletion costs are set to 4.5 (Neuhaus & Bunke, 2006). - •
**Aural sonar data**. The aural sonar data set as described in Chen et al. (2009) consists of 100 returns from a broadband active sonar system, labeled in two classes, target-of-interest versus clutter. The dissimilarity is scored by two independent human subjects, each resulting in a dissimilarity score in . - •
**Caltech data**. The caltech data set consists of 8677 images from 101 object categories that are compared using the pyramid match kernel on SIFT features (see Chen et al., 2009). - •
**Face recognition data**. The face recognition data set consists of faces of 139 people. Dissimilarities are computed by means of cosine similarity between integral invariant signatures based on surface curves of the 3D faces (see Chen et al., 2009). - •
**Patrol data**. The patrol data set describes 241 members of seven patrol units and one class corresponding to people not in any unit. Dissimilarities are computed based on every person in the patrol units naming five other persons in their unit, whereby the responses were partially inaccurate. Every mention yields an entry of the dissimilarity matrix (see Chen et al., 2009). Data are sparse in the sense that most entries of the matrix correspond to the maximum dissimilarity, which we set to three. - •
**Voting data**. The voting data set describes a two-class classification problem incorporating 435 samples given by 16 categorical features with three different possible values each. The dissimilarity is determined based on the value difference metric (see Chen et al., 2009).

These data sets are non-Euclidean with signature as given in Table 1. Obviously, the protein data and the caltech data set are almost Euclidean, while at least one-third of the eigenvalues are negative for the other data sets.

. | . | Minimum Value . |
---|---|---|

. | Signature . | Spread Transform . |

cat cortex | (41,23,1) | 4.93 |

chromosomes | (1951,2206,43) | 851.79 |

protein data | (218,4,4) | 2.6 |

aural sonar | (54,45,1) | 2.1 |

caltech | (8664,12,1) | 1.67 |

face recognition | (311,310,324) | 7.9 |

patrol | (173,67,1) | 4.33 |

voting | (105,235,95) | 3.2 |

. | . | Minimum Value . |
---|---|---|

. | Signature . | Spread Transform . |

cat cortex | (41,23,1) | 4.93 |

chromosomes | (1951,2206,43) | 851.79 |

protein data | (218,4,4) | 2.6 |

aural sonar | (54,45,1) | 2.1 |

caltech | (8664,12,1) | 1.67 |

face recognition | (311,310,324) | 7.9 |

patrol | (173,67,1) | 4.33 |

voting | (105,235,95) | 3.2 |

We performed a repeated cross-validation for all data sets, using 10 repeats. We report the results of relational neural gas (RNG) and supervised relational neural gas (SRNG) for these data sets. For relational neural gas, 100 epochs were used; for the supervised version, the supervision parameter equals . For comparison, we report the results of deterministic annealing (DA); here, 300 epochs were used for training. The number of prototypes used for every data set and the number of folds are reported in Table 2. Results are reported on the training and test set. The runs are evaluated by the classification accuracy obtained by posterior labeling on the training set. In addition, the quantization error and the value of the dual *k*-means cost function are reported. Thereby, relational prototypes allow an out-of-sample extension for both relational neural gas and deterministic annealing, as discussed previously. Since we can interpret deterministic annealing as annealing of clustering in pseudo-Euclidean space, out-of-sample extensions for deterministic annealing can be obtained in the same way.

. | Number of Neurons . | Number of Folds . |
---|---|---|

cat cortex | 12 | 02 |

chromosomes | 60 | 02 |

protein data | 20 | 10 |

aural sonar | 10 | 02 |

caltech | 1030 | 02 |

face recognition | 1390 | 10 |

patrol | 24 | 10 |

voting | 20 | 10 |

. | Number of Neurons . | Number of Folds . |
---|---|---|

cat cortex | 12 | 02 |

chromosomes | 60 | 02 |

protein data | 20 | 10 |

aural sonar | 10 | 02 |

caltech | 1030 | 02 |

face recognition | 1390 | 10 |

patrol | 24 | 10 |

voting | 20 | 10 |

The results of the runs are reported in Table 3. Interestingly, the value of the dual cost function, the quantization error, and the classification accuracy of RNG is always competitive to the value obtained by DA, although the latter method requires more training time due to an additional inner loop. As expected, the values of SRNG for the unsupervised cost functions are a bit worse, since this method alters the objective to better take label information into account. This corresponds to an improvement of the classification accuracy on the training data for all cases but one.

Dual Cost Function . | Quantization Error . | Classification Accuracy . | ||||
---|---|---|---|---|---|---|

Training . | Test . | Training . | Test . | Training . | Test . | |

cat cortex | ||||||

RNG | 11.49 (0.35) | 15.47 (0.56) | 11.49 (0.35) | 24.92 (0.32) | 0.928 (0.028) | 0.69 (0.076) |

SRNG | 11.96 (0.56) | 15.16 (0.39) | 12.02 (0.61) | 24.29(0.47) | 0.994 (0.008) | 0.724 (0.062) |

DA | 11.12 (2.58) | 14.93 (2.07) | 11.12 (2.58) | 25.85 (0.64) | 0.890 (0.105) | 0.803 (0.083) |

chromosomes | ||||||

RNG | 22479.36 (54.42) | 22864.64 (45.32) | 22479.36 (54.42) | 24275.99 (39.63) | 0.893 (0.004) | 0.897 (0.004) |

SRNG | 22470.29 (65.97) | 22848.01 (35.06) | 22475.19 (65.79) | 24241.31 (30.11) | 0.912 (0.003) | 0.907 (0.004) |

DA | 22608.49 (43.41) | 23090.59 (51.96) | 22608.49 (43.41) | 24374.89 (38.22) | 0.910 (0.004) | 0.908 (0.003) |

protein data | ||||||

RNG | 293.65 (0.73) | 17.17 (0.54) | 293.65 (0.73) | 40.45 (0.93) | 0.942 (0.002) | 0.919 (0.016) |

SRNG | 291.28 (1.13) | 17.10 (0.57) | 291.99 (1.22) | 40.08 (1.01) | 0.980 (0.005) | 0.944 (0.013) |

DA | 282.44 (0.60) | 15.35 (0.82) | 282.44 (0.60) | 39.31 (0.83) | 0.931 (0.003) | 0.907 (0.008) |

aural sonar | ||||||

RNG | 4.17 (0.07) | 5.04 (0.10) | 4.17 (0.07) | 6.90 (0.07) | 0.892 (0.017) | 0.834 (0.014) |

SRNG | 4.33 (0.09) | 5.12 (0.11) | 4.36 (0.10) | 6.92 (0.14) | 0.993 (0.006) | 0.870 (0.032) |

DA | 3.99 (0.07) | 4.99 (0.12) | 3.99 (0.07) | 6.99 (0.13) | 0.897 (0.013) | 0.856 (0.026) |

caltech | ||||||

RNG | 2435.72 (1.58) | 2467.04 (2.52) | 2435.72 (1.58) | 2647.70 (2.34) | 0.451 (0.002) | 0.407 (0.006) |

SRNG | 2466.91 (1.85) | 2508.07 (3.78) | 2489.32 (2.10) | 2640.61 (2.04) | 0.791 (0.009) | 0.364 (0.025) |

DA | 2440.09 (1.88) | 2471.81 (2.35) | 2440.09 (1.88) | 2659.73 (1.81) | 0.452 (0.003) | 0.401 (0.004) |

face recognition | ||||||

RNG | 0.071 (0.003) | 0.005 (0.001) | 0.071 (0.003) | 0.013 (0.001) | 0.887 (0.003) | 0.865 (0.002) |

SRNG | 0.250 (0.013) | 0.021 (0.002) | 0.472 (0.020) | 0.058 (0.002) | 0.830 (0.005) | 0.811 (0.007) |

DA | 0.116 (0.007) | 0.008 (0.001) | 0.116 (0.007) | 0.017 (0.001) | 0.900 (0.002) | 0.873 (0.004) |

patrol | ||||||

RNG | 120.77 (0.19) | 7.88 (0.29) | 120.77 (0.19) | 17.34 (0.25) | 0.835 (0.005) | 0.665 (0.024) |

SRNG | 123.76 (0.19) | 7.93 (0.43) | 124.26 (0.22) | 16.99 (0.14) | 0.989 (0.001) | 0.657 (0.022) |

DA | 127.34 (5.39) | 10.39 (0.93) | 127.34 (5.39) | 17.13 (0.19) | 0.713 (0.077) | 0.521 (0.051) |

voting | ||||||

RNG | 8.86 (0.04) | 0.615 (0.024) | 8.86 (0.04) | 1.18 (0.02) | 0.953 (0.001) | 0.950 (0.004) |

SRNG | 8.93 (0.03) | 0.651 (0.033) | 9.08 (0.02) | 1.18 (0.03) | 0.972 (0.001) | 0.953 (0.004) |

DA | 8.84 (0.06) | 0.626 (0.021) | 8.84 (0.06) | 1.15 (0.03) | 0.955 (0.001) | 0.951 (0.005) |

Dual Cost Function . | Quantization Error . | Classification Accuracy . | ||||
---|---|---|---|---|---|---|

Training . | Test . | Training . | Test . | Training . | Test . | |

cat cortex | ||||||

RNG | 11.49 (0.35) | 15.47 (0.56) | 11.49 (0.35) | 24.92 (0.32) | 0.928 (0.028) | 0.69 (0.076) |

SRNG | 11.96 (0.56) | 15.16 (0.39) | 12.02 (0.61) | 24.29(0.47) | 0.994 (0.008) | 0.724 (0.062) |

DA | 11.12 (2.58) | 14.93 (2.07) | 11.12 (2.58) | 25.85 (0.64) | 0.890 (0.105) | 0.803 (0.083) |

chromosomes | ||||||

RNG | 22479.36 (54.42) | 22864.64 (45.32) | 22479.36 (54.42) | 24275.99 (39.63) | 0.893 (0.004) | 0.897 (0.004) |

SRNG | 22470.29 (65.97) | 22848.01 (35.06) | 22475.19 (65.79) | 24241.31 (30.11) | 0.912 (0.003) | 0.907 (0.004) |

DA | 22608.49 (43.41) | 23090.59 (51.96) | 22608.49 (43.41) | 24374.89 (38.22) | 0.910 (0.004) | 0.908 (0.003) |

protein data | ||||||

RNG | 293.65 (0.73) | 17.17 (0.54) | 293.65 (0.73) | 40.45 (0.93) | 0.942 (0.002) | 0.919 (0.016) |

SRNG | 291.28 (1.13) | 17.10 (0.57) | 291.99 (1.22) | 40.08 (1.01) | 0.980 (0.005) | 0.944 (0.013) |

DA | 282.44 (0.60) | 15.35 (0.82) | 282.44 (0.60) | 39.31 (0.83) | 0.931 (0.003) | 0.907 (0.008) |

aural sonar | ||||||

RNG | 4.17 (0.07) | 5.04 (0.10) | 4.17 (0.07) | 6.90 (0.07) | 0.892 (0.017) | 0.834 (0.014) |

SRNG | 4.33 (0.09) | 5.12 (0.11) | 4.36 (0.10) | 6.92 (0.14) | 0.993 (0.006) | 0.870 (0.032) |

DA | 3.99 (0.07) | 4.99 (0.12) | 3.99 (0.07) | 6.99 (0.13) | 0.897 (0.013) | 0.856 (0.026) |

caltech | ||||||

RNG | 2435.72 (1.58) | 2467.04 (2.52) | 2435.72 (1.58) | 2647.70 (2.34) | 0.451 (0.002) | 0.407 (0.006) |

SRNG | 2466.91 (1.85) | 2508.07 (3.78) | 2489.32 (2.10) | 2640.61 (2.04) | 0.791 (0.009) | 0.364 (0.025) |

DA | 2440.09 (1.88) | 2471.81 (2.35) | 2440.09 (1.88) | 2659.73 (1.81) | 0.452 (0.003) | 0.401 (0.004) |

face recognition | ||||||

RNG | 0.071 (0.003) | 0.005 (0.001) | 0.071 (0.003) | 0.013 (0.001) | 0.887 (0.003) | 0.865 (0.002) |

SRNG | 0.250 (0.013) | 0.021 (0.002) | 0.472 (0.020) | 0.058 (0.002) | 0.830 (0.005) | 0.811 (0.007) |

DA | 0.116 (0.007) | 0.008 (0.001) | 0.116 (0.007) | 0.017 (0.001) | 0.900 (0.002) | 0.873 (0.004) |

patrol | ||||||

RNG | 120.77 (0.19) | 7.88 (0.29) | 120.77 (0.19) | 17.34 (0.25) | 0.835 (0.005) | 0.665 (0.024) |

SRNG | 123.76 (0.19) | 7.93 (0.43) | 124.26 (0.22) | 16.99 (0.14) | 0.989 (0.001) | 0.657 (0.022) |

DA | 127.34 (5.39) | 10.39 (0.93) | 127.34 (5.39) | 17.13 (0.19) | 0.713 (0.077) | 0.521 (0.051) |

voting | ||||||

RNG | 8.86 (0.04) | 0.615 (0.024) | 8.86 (0.04) | 1.18 (0.02) | 0.953 (0.001) | 0.950 (0.004) |

SRNG | 8.93 (0.03) | 0.651 (0.033) | 9.08 (0.02) | 1.18 (0.03) | 0.972 (0.001) | 0.953 (0.004) |

DA | 8.84 (0.06) | 0.626 (0.021) | 8.84 (0.06) | 1.15 (0.03) | 0.955 (0.001) | 0.951 (0.005) |

Note: The standard deviation is given in parentheses.

Since supervised label information is available for these data sets, it is possible to compare the results to the behavior of supervised training algorithms such as SVM or *k*-nearest neighbor for these datasets. The last six data sets have recently been considered in Chen et al. (2009) as benchmarks where different supervised techniques to deal with dissimilarity data including SVM with various preprocessing and kernels and *k*-nearest neighbor have been compared. Interestingly, errors of 13.00% to 17.75% are reported for the aural data set using the different methods, and errors of 4.89% to 5.8% are reported for the voting data set, placing SRNG as a method with best classification accuracy for both situations. For the other data sets, errors of 1.86% to 30.35% (protein data), 29.9% to 41.99% (caltech), 3.92% to 4.55% (face recognition), and 11.56% to 42.19% (patrol) are reported. Thus, SRNG is competitive for the protein data set too. For caltech, and patrol, it achieves considerable accuracy on the training set, but not generalizing properly to the test set, while clustering seems unsuited for the supervised classification task specified in the face recognition data. Nevertheless, the aim of clustering as measured by the quantization error and the dual cost function gives reasonable results in all cases, clearly demonstrating the competitiveness of RNG methods to deterministic annealing schemes.

### 5.3. Demonstration of Patch Clustering.

Patch clustering extends RNG toward huge data sets at the costs of decreased accuracy due to the compression of data in early patches in form of prototypes and the approximation of relational prototypes by only the most important coefficients. Thereby, both the patch size and the number of coefficients used for the approximation are parameters that can be chosen to balance the accuracy of the results (this is high for large patch size) and the required space and speed of computation (which is small for small patch size and small *K* for the approximation). The effect of these choices is reported in Figure 5. The Kopenhagen chromosomes data are taken trained with patch RNG and 60 neurons for 100 epochs. Thereby, the number of patches was varied from 1 to 10, and the *K* used for the *K*-approximation was varied from 1 to 5. The reported result is the classification accuracy obtained on a test set for a repeated 10-fold cross-validation with 10 repetitions. Using 10 patches (corresponding to a speed-up 10 of the computation) and a *K*-approximation with *K* = 2 leads to a reduction of the hit rate of about 3.3%, and using a *K*-approximation with *K* = 1 leads to a reduction of the hit rate of about 4.5%. Hence, patch approximation leads to only a slight decrease of the quality of the found solution.

For huge data sets, data can often be accessed only sequentially, such that data are not i.i.d. with respect to the underlying distribution. It is interesting to investigate whether this fact has consequences on the accuracy of patch clustering. As already demonstrated in Alex et al. (2009) for Euclidean settings, this is not the case since prototypes accurately represent already seen data, following trends accurately, if necessary. In Alex et al., a data set with a strong trend was created and presented to patch clustering following the trend. The result of the overall clustering, which was achieved after iterative patch processing, was virtually indistinguishable from the result of NG when applied to all data at once or patch NG with i.i.d. data due to the ability of patch NG to use all previous information in terms of the sufficient statistics.

For the non-Euclidean setting, it is not as obvious how to create data with a strong trend to test the abilities of patch clustering to deal with such settings. Therefore, we rely on auxiliary information as given by the data labels. We compare patch clustering with randomly presented samples to patch clustering where samples are sorted according to the label information such that in early patches, a highly unbalanced data set is presented. The results as achieved for the chromosomes data set are reported in Table 4. Thereby, a two-fold cross-validation was repeated 10 times, using 10 patches (corresponding to about 420 data points per patch), 60 neurons, 100 epochs per patch, and an approximation of relational prototypes by three coefficients. For comparison, the results of full batch RNG using the same splits for the repeated cross-validation are reported. Obviously the differences are negligible, that is, the order of the presentation does not severely influence the output in this case.

. | Dual Cost Function . | Quantization Error . | Classification Accuracy . | |||
---|---|---|---|---|---|---|

. | . | . | . | |||

. | Training . | Test . | Training . | Test . | Training . | Test . |

Random order | ||||||

Patch RNG | 23,589.6 (69.0) | 23,898.9 (85.6) | 30,157.0 (83.4) | 31,623.0 (133.9) | 0.873 (0.006) | 0.868 (0.007) |

RNG | 22,307.3 (51.5) | 22,777.7 (85.0) | 22,309.8 (51.0) | 24,255.2 (95.3) | 0.898 (0.005) | 0.900 (0.005) |

Ordered according to the labeling | ||||||

Patch RNG | 23,495.3 (80.5) | 23,831.1 (89.3) | 30,016.6 (131.8) | 31,534.5 (124.3) | 0.875 (0.007) | 0.870 (0.009) |

RNG | 22,326.9 (39.6) | 22,757.2 (35.6) | 22,329.5 (39.9) | 24,250.5 (31.6) | 0.900 (0.007) | 0.901 (0.007) |

. | Dual Cost Function . | Quantization Error . | Classification Accuracy . | |||
---|---|---|---|---|---|---|

. | . | . | . | |||

. | Training . | Test . | Training . | Test . | Training . | Test . |

Random order | ||||||

Patch RNG | 23,589.6 (69.0) | 23,898.9 (85.6) | 30,157.0 (83.4) | 31,623.0 (133.9) | 0.873 (0.006) | 0.868 (0.007) |

RNG | 22,307.3 (51.5) | 22,777.7 (85.0) | 22,309.8 (51.0) | 24,255.2 (95.3) | 0.898 (0.005) | 0.900 (0.005) |

Ordered according to the labeling | ||||||

Patch RNG | 23,495.3 (80.5) | 23,831.1 (89.3) | 30,016.6 (131.8) | 31,534.5 (124.3) | 0.875 (0.007) | 0.870 (0.009) |

RNG | 22,326.9 (39.6) | 22,757.2 (35.6) | 22,329.5 (39.9) | 24,250.5 (31.6) | 0.900 (0.007) | 0.901 (0.007) |

### 5.4. Processing Large Text Data.

*x*and

*y*are the document strings,

*xy*its concatenation, and

*C*(

*x*) denotes the size of a compression of the string

*x*. For our experiments, the bzip2 compression method was used.

Name . | Count . | Percentage . |
---|---|---|

alt.atheism | 72,0680 | 39.260 |

comp.os.ms-windows.misc | 0178 | 0.10 |

comp.windows.x | 0182 | 0.10 |

rec.motorcycles | 15,8410 | 8.63 |

rec.sport.baseball | 0462 | 0.25 |

rec.sport.hockey | 0510 | 0.28 |

sci.crypt | 3203 | 1.75 |

sci.med | 2912 | 1.59 |

soc.religion.christian | 0341 | 0.19 |

talk.politics.guns | 23,2090 | 12.640 |

talk.politics.mideast | 9337 | 5.09 |

talk.politics.misc | 54,6830 | 29.790 |

talk.religion.misc | 0620 | 0.34 |

Name . | Count . | Percentage . |
---|---|---|

alt.atheism | 72,0680 | 39.260 |

comp.os.ms-windows.misc | 0178 | 0.10 |

comp.windows.x | 0182 | 0.10 |

rec.motorcycles | 15,8410 | 8.63 |

rec.sport.baseball | 0462 | 0.25 |

rec.sport.hockey | 0510 | 0.28 |

sci.crypt | 3203 | 1.75 |

sci.med | 2912 | 1.59 |

soc.religion.christian | 0341 | 0.19 |

talk.politics.guns | 23,2090 | 12.640 |

talk.politics.mideast | 9337 | 5.09 |

talk.politics.misc | 54,6830 | 29.790 |

talk.religion.misc | 0620 | 0.34 |

We would like to demonstrate the ability of our method to project large data sets into the Euclidean plane in reasonable time. Therefore, we report the result of a supervised relational hyperbolic SOM (HSOM), which offers particularly powerful visualization facilities of possibly complex data sets due to the flexibility of the hyperbolic space (Ontrup & Ritter, 2001). Obviously, batch processing for HSOM can be transferred to a relational version in the same way as batch NG. The goal is to map the full data set to the hyperbolic lattice structure by means of supervised HSOM. Patch processing was applied using supervised relational HSOM with 85 neurons and a 3 approximation of the prototypes. Since the full dissimilarity matrix would occupy approximately 250 GB of space, it is no option to process the data at once or precalculate the full dissimilarity at once. Instead, 183 patches of around 1,000 texts were taken, and the distances of these patches were precalculated, resulting in only around a 12 MB space. In addition, since patch processing requires the dissimilarities of the extended patches, around dissimilarities had to be computed on demand for every patch. This way, the whole computation could be performed on a common desktop computer in reasonable time.

The outcome is depicted in Figure 6. Thereby, the hyperbolic lattice structure is mapped onto Euclidean plane for visualization and data inspection. For interpretability, neurons were posteriorly labeled according to a majority vote. Clearly, the data arrange on the lattice according to their semantic meaning as mirrored by the corresponding news group, such that visualization and browsing become possible.

## 6. Discussion

The focus of this article was on clustering algorithms for general dissimilarity data, which can be derived from the standard vector quantization cost function and extensions thereof. For this purpose, we introduced relational clustering as an equivalent formulation of batch clustering for neural gas or self-organizing maps if Euclidean data are given in terms of pairwise dissimilarities only. While convergence can be guaranteed in Euclidean settings only, applicability of the algorithms is granted for every possible symmetric dissimilarity measure. Further, a variety of properties is still valid, such as easy out-of-sample extensions and interpretability in terms of the dual cost function for fixed points of the algorithms. In this way, we arrived at relational neural gas and self-organizing maps for arbitrary dissimilarity data. We have seen that the algorithms can be interpreted as clustering in pseudo-Euclidean space whereby a reasonable heuristic is taken to arrive at good prototype locations. The algorithms yield competitive results to alternatives such as deterministic annealing, and they often show better behavior than theoretical alternatives with guaranteed convergence such as spread transformation of the data. Thus, relational neural gas and alternatives can be taken as simple and efficient prototype-based models for the clustering and topographic mapping of general dissimilarity matrices with wide applicability to domains involving non-Euclidean data such as text processing, biological sequence analysis, and image processing.

One major problem of clustering algorithms for dissimilarity data lies in the fact that the dissimilarity matrix scales quadratically with the number of data points. This leads to at least quadratic time and space complexity, which is infeasible for medium-sized problems. Therefore, an approximation scheme was introduced that processes data based on patches in linear time and constant space, provided the patch size is chosen fixed depending on the available resources. We demonstrated that approximate patch clustering leads to only a minor decrease in accuracy and that it is robust with respect to data ordering. In particular, it can deal with settings where data are not identically distributed, and it leads to competitive results as if all data were available prior to training, as demonstrated in experiments. This way, a very simple and fast but powerful and flexible linear time and constant space clustering scheme for large dissimilarity data results. This patch scheme was proposed based on relational neural gas or relational self-organizing maps; however, alternative prototype-based schemes such as affinity propagation or median clustering could be easily extended accordingly. We demonstrated the applicability of the approach for a large text corpus. The evaluation and application in further large-scale scenarios are the subject of ongoing research.

## Notes

^{1}

Actually, we construct an example that shows that deterministic annealing for pairwise data clustering can also show oscillatory behavior for sequential updates if run with negative dissimilarities.

## References

*UCI Machine Learning Repository*Irvine: University of California, School of Information and Computer Science