Abstract

A variety of modifications have been employed to learning vector quantization (LVQ) algorithms using either crisp or soft windows for selection of data. Although these schemes have been shown in practice to improve performance, a theoretical study on the influence of windows has so far been limited. Here we rigorously analyze the influence of windows in a controlled environment of gaussian mixtures in high dimensions. Concepts from statistical physics and the theory of online learning allow an exact description of the training dynamics, yielding typical learning curves, convergence properties, and achievable generalization abilities. We compare the performance and demonstrate the advantages of various algorithms, including LVQ 2.1, generalized LVQ (GLVQ), Learning from Mistakes (LFM) and Robust Soft LVQ (RSLVQ). We find that the selection of the window parameter highly influences the learning curves but not, surprisingly, the asymptotic performances of LVQ 2.1 and RSLVQ. Although the prototypes of LVQ 2.1 exhibit divergent behavior, the resulting decision boundary coincides with the optimal decision boundary, thus yielding optimal generalization ability.

1.  Introduction

Learning vector quantization (LVQ) constitutes a family of learning algorithms for nearest prototype classification of potentially high-dimensional data (Kohonen, 2001). The intuitive approach and computational efficiency of LVQ classifiers have motivated its application in various disciplines (see, e.g., Neural Networks Research Centre, 2002). Prototypes in LVQ algorithms represent typical features within a data set using the same feature space instead of the black box approach practiced in many other classifiers (e.g., feedforward neural networks or support vector machines). This makes them attractive to researchers outside the field of machine learning. Other advantages of LVQ algorithms are that they are easy to implement for multiclass classification problems and the algorithm complexity can be adjusted during training as required.

Numerous variants of the original LVQ prescriptions have been proposed for achieving better performance, such as LVQ 2.1 (Kohonen, 1990, 2001), LVQ 3 (Kohonen, 1990, 2001), generalized LVQ (GLVQ) (Hammer & Villmann, 2002; Sato & Yamada, 1995), and Robust Soft LVQ (RSLVQ) (Seo & Obermayer, 2003). Common themes of these modifications include an additional parameter that controls the selection of data to which the system is adapted and variation of the magnitude of prototype updates. We refer to these in general as window schemes. In the limiting case of hard or crisp learning schemes, updates are restricted to examples that fall into this window. For instance, LVQ 2.1 allows updates as long as the example is in the vicinity of the current decision boundary. Alternatively, learning schemes can implement a soft window (e.g., RSLVQ and GLVQ), which considers all examples but adapts the magnitude of the update according to their relative distances to the current decision boundary.

In general, the learning behavior of these strategies is not well understood. It is unclear how the convergence, stability, and achievable generalization ability compare for the different strategies. Fortunately, methods from statistical physics and theory of online learning recently allowed a systematic investigation of very large systems in the so-called thermodynamic limit. This has been successfully applied in, among others, feedforward neural networks, perceptron training and principal component analysis (Biehl & Caticha, 2003; Engel & van den Broeck, 2001; Saad, 1999). A similar approach to LVQ-type algorithms, such as LVQ 1, unsupervised VQ, and rank-based neural gas, was treated in Biehl, Ghosh, and Hammer (2007) and Witoelar, Biehl, Ghosh, and Hammer (2008).

In this work, we closely examine the influence of window schemes for LVQ algorithms. Typical learning behavior is studied within a model situation of high-dimensional gaussian clusters and competing prototypes. From this analysis, we can observe typical learning curves and the convergence properties, that is, the asymptotic behavior in the limit of an arbitrarily large number of examples.

Typically the window parameters are selected either heuristically or derived from prior knowledge of the data and kept fixed during training. The optimal parameter settings are chosen according to a computationally expensive validation procedure. It is also possible to treat the hyperparameters as dynamic properties during learning by means of an annealing schedule (Seo & Obermayer, 2006) or a gradient-based optimization method (Bengio, 2000), for example. Using the model described in this letter, one can investigate the optimality of the parameters for both fixed and dynamic settings in representative model situations.

2.  Model

Throughout this letter, we study LVQ algorithms in a model situation: high-dimensional data are generated from a mixture of M gaussian clusters and presented to a system of two or three prototypes. We restrict ourselves to the analysis of isotropic and homogeneous clusters; each cluster σ generates only data with one of the class labels where Nc is the number of classes. Examples with are drawn independently according to the probability density function,
formula
2.1
where are the cluster-wise prior probabilities and . The components of vectors from cluster are random numbers with mean vectors and variance . The unit vectors determine the orientation of cluster centers. Similar densities have been studied in Barkai, Seung, and Sompolinsky (1993), Biehl (1994), Biehl et al. (2007), and Meir (1995).

In this framework, we formally exploit the thermodynamic limit corresponding to very high-dimensional data. This has simplifying consequences that will be present throughout the letter. Note that on random subspace projections, data from different clusters completely overlap and are not separable. The clusters become apparent at most only in the M-dimensional space spanned by vectors . The nontrivial goal is to identify this subspace from the N-dimensional data.

We bring attention to the scaling of the model. The anisotropy of this data distribution is very weak: while the mean of cluster σ, given by , is a vector of length , the average squared length of the data vectors is in the order .

Obviously this model is greatly simplified from practical situations. However, it represents an ideal scenario to analyze the learning algorithms considered here using gaussian mixture models, a common technique in many practical scenarios. We can expect that algorithms that do not perform well on this idealized model will also be inappropriate for real-life problems. Although more complex behaviors are expected in practical applications, the nontrivial effects already observed in this model will clearly influence the outcome under more general circumstances.

3.  Algorithms

We review LVQ algorithms and their corresponding window schemes. For the two-class model defined in section 2, we define an LVQ system as a set of K prototypes with and . Classification is implemented through a nearest prototype scheme: novel examples will be assigned to the class of the closest prototype according to a dissimilarity measure. Here we restrict the measure to the squared Euclidean distance for a given novel example . In the online algorithm, examples are presented sequentially to the system, and the prototypes are adapted by the following update step,
formula
3.1
where denotes the prototype after presentation of examples and the learning rate is rescaled with N. We use the shorthand fS for the modulation function that controls, along with the learning rate , the magnitude of the update of wS toward or away from the current example. In this work, we investigate several LVQ prescriptions that include window schemes.

3.1.  LVQ 2.1.

LVQ 2.1, proposed by Kohonen, aims at efficient separation between prototypes of different classes and has been shown to provide good classification results (Kohonen, 1990; Neural Networks Research Centre, 2002). Given an example , two nearest prototypes wS and wT are updated if the following conditions are met: (1) the classes cS and cT are different, and (2) either cS or cT is equal to . The prototype with the correct class is moved toward the data, while the other is moved further away with fS = 1, fT = −1 if ; fS = −1, fT = +1 else.

It is well known that the learning rule has stability problems for unbalanced data sets, resulting in diverging prototypes with deteriorating performance (Kohonen, 1990). Therefore, LVQ 2.1 restricts updates to examples , which fall into a window around the decision boundary,
formula
3.2
where is a window parameter, and therefore . However, this window is ineffective for very high-dimensional data, as we obtain because terms dominate the other -terms, and (w2S). Consequently, this window definition does not work in very high dimensions, evidenced by
formula
3.3
which implies that every example falls into the window. Therefore, in the following, we implement the constraint
formula
3.4
where k is a small, positive number. Note that the term cancels out on the left-hand side, while it dominates on the right-hand side for . Thus, the right-hand side becomes , and the condition is nontrivial only if . We introduce the rescaled window parameter so that the window scheme is ; is positive. We describe these rules as the following modulation function,
formula
3.5
with if and else. We use the shorthand notations and , where is the Heaviside function .

We sum over prototypes , and terms enforce the window condition. The product term singles out instances where wS and wT are the two closest prototypes. This form of fS allows for the analysis given in section 4.

3.2.  LFM-W.

A simple modification to overcome the stability problems of LVQ 2.1 is restricting updates only on misclassified examples. Analogous to perceptron learning, we term this update rule as Learning from Mistakes (LFM). Here, the closest prototype wJ with the same class (correct winner) and closest prototype wK with a different class (incorrect winner) are updated with fJ=+1 and fK=−1, if the example is misclassified. On the contrary, if the winning prototype is already correct, the configuration is left unchanged. This prescription can be interpreted as a limiting case of cost-function-based Robust Soft LVQ (RSLVQ), which will be explained later in this section. Because the cost function of RSLVQ is bounded from below, stability can also be expected in LFM.

The performance of LFM can be improved by including data selection using the window rule in equation 3.4. We refer to this algorithm as LFM with a window (LFM-W), represented by the modulation function
formula
3.6
with , which identifies cases with wJ being the correct winner and wK being the incorrect winner: if this condition is fulfilled and else. Terms in parentheses single out misclassified examples that fall into the window.

3.3.  GLVQ.

Earlier LVQ prescriptions, including LVQ 2.1, were based on heuristic grounds. In contrast, a popular variant, the generalized LVQ, was proposed in Sato and Yamada (1995), which introduced the cost function
formula
3.7
where is a (usually nonlinear) monotonically increasing function, wJ is the nearest correct prototype, and wK is the nearest incorrect prototype to the example . We insert the scaling parameter C, which will be required for high dimensions. Stochastic gradient procedure on equation 3.7 yields the learning rule
formula
3.8
Here the usefulness of selecting a nonlinear is shown. For instance, in Hammer and Villmann (2002) and Sato and Yamada (1995), the sigmoid function is chosen: . The form of , which has a single peak at , can be interpreted as a soft window around the decision boundary.
In the high-dimensional limit, we notice that is dominated by -terms and effectively becomes a constant -term: . Therefore, the denominator term in equation 3.7 becomes constant:
formula
3.9
To obtain a nonzero argument, C must also be in the order , and we rescale using . The parameter vG determines the softness of the window, provided that an appropriate nonlinear is chosen. Note that GLVQ can be simplified to LVQ 2.1 without a window using the identity function . The cost function in equation 3.7 becomes , where vG could be set to 1 without changing its learning behavior. The modulation function is then reduced to fJ=+1, fK=−1.
In this letter, we choose the cumulative normal distribution
formula
3.10
where . Note that this form implements a gaussian window similar to the sigmoidal cost described in Hammer and Villmann (2002) and Sato and Yamada (1995) and therefore produces a qualitatively similar behavior (see Figure 1 for the comparison).
Figure 1:

(Left) The form of the chosen in GLVQ, in comparison to the sigmoidal function . The derivatives produce a soft window. (Middle and right) The RSLVQ modulation function fS for class 1 () when presented with data from class 1. The figures display the difference between smaller vsoft (left) and larger vsoft (right).

Figure 1:

(Left) The form of the chosen in GLVQ, in comparison to the sigmoidal function . The derivatives produce a soft window. (Middle and right) The RSLVQ modulation function fS for class 1 () when presented with data from class 1. The figures display the difference between smaller vsoft (left) and larger vsoft (right).

Plugging in the form of equation 3.9, we obtain the learning rules
formula
3.11
We can write the modulation function as
formula
3.12
with .

3.4.  RSLVQ.

The robust soft LVQ algorithm (Seo & Obermayer, 2003) was derived using a statistical modeling of the data and designed to overcome the stability problem of LVQ 2.1. RSLVQ introduces soft prototype assignments that act similar to a soft window around the decision boundary. This algorithm minimizes a bounded cost function E=−ln(L), where L is based on a likelihood ratio function of a mixture model, described as
formula
3.13
where approximates the actual probability density (see equation 2.1). It is assumed that every component wS of the mixture generates examples that belong to one class: cS. is the number of classes, and PS is the probability that the examples are generated by a particular component wS and is the conditional probability that wS generates a particular example .
The learning rule is obtained by performing stochastic gradient descent on the cost function E with respect to wS. We examine it for a gaussian mixture ansatz as in Seo and Obermayer (2003), where it is chosen . Furthermore, every component is assumed to have equal probability P(S)=1/K, and equal variance vS=vsoft, , where vsoft is called the softness hyperparameter. This gives the following modulation function
formula
with the assignment probabilities,
formula
3.14
(see Seo & Obermayer, 2003, for the derivations). describes the posterior probability that is assigned to the component S of the mixture, given that the example is generated by the correct class. describes the posterior probability that is assigned to the component S of the complete mixture using all classes. As vsoft becomes smaller, the updates become smaller for correctly classified examples and larger for incorrectly classified examples (see Figure 1).
Note that the limiting case of vsoft is particularly simple. The assignments of equation 3.14 become hard assignments:
formula
3.15
Plugging the above into equation 3.14, we obtain the learning rule for learning from mistakes (LFM), described in section 3.2.

4.  Analysis

In this section, we describe the methods to analyze the learning dynamics in LVQ algorithms. Following the lines of the theory of online learning (see, e.g., Biehl & Mietzner, 1993; Biehl & Schwarze, 1993; Engel & van den Broeck, 2001; Saad, 1999), the system can be fully described in terms of a few characteristic quantities, so-called order parameters, in the thermodynamic limit . A suitable set of order parameters for the considered learning model is
formula
4.1
Note that are the projections of prototype vectors on the center vectors , and correspond to the self- and cross-overlaps of the prototype vectors. From the generic update rule defined above, equation 3.1, we can derive the following recursions in terms of the order parameters:
formula
4.2
where the input data vectors enter the system as their projections and , defined as
formula
4.3
In the limit , the term can be neglected and the order parameters self-average (Reents & Urbanczik, 1998) with respect to the random sequence of examples. This means that fluctuations of the order parameters vanish, and the system dynamics can be described exactly in terms of their mean values. Also for , the rescaled quantity can be conceived as a continuous time variable. Accordingly, the dynamics can be described by a set of coupled ordinary differential equations (ODE) (Ghosh, Biehl, & Hammer, 2006) after performing an average over the sequence of input data:
formula
4.4
where and are the averages over the density and . To simplify the last term of equation 4.4, we used
formula
In various sections in this letter, we investigate learning behaviors using small learning rates and neglect the terms in equation 4.4. Nontrivial behavior is expected only by taking the simultaneous limit and rescaling in equation 4.4.
Exploiting the limit once more, the quantities become correlated gaussian quantities by means of the central limit theorem. Therefore, they are fully specified by first and second moments, detailed in appendix A:
formula
4.5
where S, T are prototype indices; are cluster indices; is the Kronecker delta; and is an overlap measure between clusters.

Thus, the above averages , , and reduce to gaussian integrations in K+M dimensions and can be expressed in (see appendix B). For various algorithms and a system with two competing prototypes, the averages can be calculated analytically. For three or more prototypes, the mathematical treatment becomes more involved and requires multiple numerical integrations.

Given the averages for a specific modulation function fS, we obtain a closed set of ODE. Using initial conditions , we integrate this system for a given algorithm and obtain the evolution of order parameters in the course of training, . The generalization error (i.e., the probability of the closest prototype wS carrying an incorrect label) is determined by considering the contribution from each cluster separately:
formula
4.6
which can be calculated from . For instance, for the simplest system with two clusters and prototypes w+ and w, the generalization error is written explicitly in terms of order parameters as
formula
4.7
with , detailed in appendix D. The form of for systems with more prototypes is more involved, and we refer the final result of the calculations to appendix D. We obtain the learning curve , which quantifies the success of training. This method of analysis shows excellent agreement with Monte Carlo simulations of the learning system for dimensionality as low as N=100, as demonstrated in Figure 2.
Figure 2:

(Left) Evolution of the order parameters for LVQ 2.1 with K=2, M=2, and learning parameters and . Solid lines represent obtained from the theoretical analysis, and bars represent the variance as produced by Monte Carlo simulations for N=100 over 100 independent runs. (Right) Influence of a window on LVQ 2.1 at learning time . Prototypes are projected on the (B+, B) subspace for (), (Δ) and unrestricted LVQ 2.1 (□). In the latter, one prototype strongly diverges. The resulting decision boundaries are indicated by chained lines. The origin is marked by () and the cluster centers by ().

Figure 2:

(Left) Evolution of the order parameters for LVQ 2.1 with K=2, M=2, and learning parameters and . Solid lines represent obtained from the theoretical analysis, and bars represent the variance as produced by Monte Carlo simulations for N=100 over 100 independent runs. (Right) Influence of a window on LVQ 2.1 at learning time . Prototypes are projected on the (B+, B) subspace for (), (Δ) and unrestricted LVQ 2.1 (□). In the latter, one prototype strongly diverges. The resulting decision boundaries are indicated by chained lines. The origin is marked by () and the cluster centers by ().

5.  A Simple Case: Two Prototypes, Two Clusters

In this section, we discuss in detail the results of the analysis for the simplest nontrivial problem: two-prototype LVQ 2.1, GLVQ, LFM-W, and RSLVQ systems and M=2 with one gaussian cluster per class. The model data are given in section 2. For simplicity, we denote the two clusters as and without loss of generality can choose and orthonormal , that is, if i=j; 0 else.

We place an emphasis on the asymptotic behavior in the limit —the achieved performance for an arbitrarily large number of examples. The asymptotic generalization error scales with the learning rate, analogous to minimizing a cost function in stochastic gradient descent procedures. For LVQ 2.1 and RSLVQ, the best achievable generalization error is obtained in the simultaneous limit of small learning rates , and rescaling . However, this limit is not meaningful for LFM, as will be explained later.

In this simple scenario, it is possible to exactly calculate the best linear decision boundaries (BLD) by linear approximation of the Bayesian optimal decision boundary (see Biehl, Freking, Ghosh, & Reents, 2004, for the calculations.) We compare the results from each algorithm to the best linearly achievable error .

5.1.  LVQ 2.1.

We first examine two-prototype systems: K=2. Figure 2 illustrates the evolution of order parameters under the influence of a window and the trajectories of the prototypes projected onto the (B+, B) subspace. Without additional constraints, LVQ 2.1 with two prototypes displays a strong divergent behavior in a system with unbalanced data: . The repulsion factor dominates for the prototype representing the weaker cluster, here w2. The order parameters associated with this prototype increase exponentially with . As , w2 will be arbitrarily far away from the cluster centers, and the asymptotic generalization error is trivial, .

When the window scheme is implemented, w2 is repulsed until the data densities of both classes within the window become more balanced. Subsequently the order parameters change with more balance between both prototypes. The repulsion factor still dominates its counterpart; therefore, both prototypes still diverge: for both prototypes display a linear change with at large , but the decision boundary remains stable. Trivial classification is prevented (see the generalization error curves versus in the left panel of Figure 3). Obviously, for smaller , a considerable amount of data is filtered out, and the initial learning stages slow significantly. Meanwhile for large , becomes nonmonotonic and converges more slowly.

Figure 3:

Generalization error for LVQ 2.1 with , and . (Left) versus using and without a window. Note the logarithmic scaling on the horizontal axis. The asymptotic errors for all settings of converge at , indicated by the dotted line. (Right) at fixed learning times , and 20 as a function of .

Figure 3:

Generalization error for LVQ 2.1 with , and . (Left) versus using and without a window. Note the logarithmic scaling on the horizontal axis. The asymptotic errors for all settings of converge at , indicated by the dotted line. (Right) at fixed learning times , and 20 as a function of .

Hence, the performance at finite is dependent on , displayed in Figure 3, and parameter settings are highly critical in practical applications. Given learning time , an optimal choice of fixed exists, which clearly depends on the properties of the data. With larger , becomes less sensitive toward , and the optimal setting of is smaller. Surprisingly, influences only the convergence speed, while the nontrivial asymptotic generalization error is insensitive to the choice of and equals the best achievable error for each setting. This can be explained as follows. We can compare the asymptotic decision boundary to the BLD: the angle between them is equal to the angle between (w1w2) and (B+B). This is calculated, using equation 4.1 and the orthonormality of B+ and B, as
formula
5.1
which is found to be zero for large . Hence, the decision boundary becomes parallel to the BLD, and only its offset produces the difference between and . In low dimensions, this offset oscillates around zero due to the window rule. In the thermodynamic limit, the fluctuations vanish, and the LVQ 2.1 decision boundary coincides with the BLD.

5.2.  LFM-W.

The LFM scheme performs updates identical to LVQ 2.1 with the condition that the example is misclassified. A detailed investigation into the characteristics of K=2 unrestricted LFM has been presented in Biehl et al. (2007). There, it was shown that LFM produces stable prototype configurations for finite learning rates . The projection of the prototypes lies parallel to the symmetry axis , displayed in Figure 4. However the prototypes w1 and w2 retain components orthogonal to the two-dimensional subspace spanned by the cluster centers, indicated by QST>RS+RT++RSRT, which implies
formula
The asymptotic generalization error is suboptimal and insensitive to : the asymptotic decision boundary remains at an angle from the optimal hyperplane (see equation 5.1) independent of . The Euclidean distance between prototypes is given by the quantity
formula
5.2
which is found to be proportional to for . At , and the prototypes coincide, and this limit is not meaningful in LFM.
Figure 4:

LFM-W with . (Left) Asymptotic prototype configuration for LFM and LFM-W and 4.5, projected on . Cluster centers are indicated by . The projection of w1, w2 lies parallel to the symmetry axis , although they retain components orthogonal to the subspace. (Right) as a function of the window size . The lines correspond to learning rates , and 1.0.

Figure 4:

LFM-W with . (Left) Asymptotic prototype configuration for LFM and LFM-W and 4.5, projected on . Cluster centers are indicated by . The projection of w1, w2 lies parallel to the symmetry axis , although they retain components orthogonal to the subspace. (Right) as a function of the window size . The lines correspond to learning rates , and 1.0.

In this analysis, we observe that window schemes can dramatically improve the performance of LFM. When a window is used, the tilt of the decision boundary from the optimal hyperplane ( in equation 5.1), is reduced, resulting in lower . We observe that decreases along with reducing , displayed in the right panel of Figure 4. However, a critical window size exists where the LFM unexpectedly becomes divergent and no stationary state exists. Smaller windows filter examples, which produce more repulsion in the orientation of the cluster centers, and we observe asymptotically larger as decreases. This is clearly observed in Figure 4. Given a sufficiently small , it is possible that the repulsion factor entirely outweighs the attractive factor. At , it performs similar to LVQ 2.1: the angle becomes zero, and is close to the best achievable error.

Unlike the unrestricted case, the learning rate can influence the asymptotic performance. The learning rate and window size are indirectly related, as shown in the right panel of Figure 4. For example, learning with small learning rates requires smaller windows to achieve optimal asymptotic error. Note that the influence of the window size depends heavily on the structure of the data. For various data models, efficient window settings may exist on only a very limited range, and window schemes may be ineffective to improve generalization performance while still maintaining stability.

5.3.  GLVQ.

Apart from the influence of vG to the overall learning rate, small vG corresponds to a sharp peak around the decision boundary, while large vG corresponds to a very large window. Figure 5 displays the prototype lengths while using GLVQ: the soft window slows the strong repulsion of the prototype of the weaker cluster, as opposed to unrestricted LVQ 2.1. While both prototypes still diverge because the cost function at is not bounded (see equation 3.9), the asymptotic remains nontrivial (see Figure 5).

Figure 5:

(Left) Q11 and Q22 for GLVQ (solid lines), compared to unrestricted LVQ 2.1 (dashed lines). The soft window of GLVQ slows the repulsion of one prototype, but the prototypes remain divergent. Here , . (Right) Learning curves versus for softness vG=2, 5, and 50; note the logarithmic horizontal axis. The learning rates are maintained at . Large vG produces better asymptotic generalization error but may exhibit nonmonotonic behavior and require very long learning times.

Figure 5:

(Left) Q11 and Q22 for GLVQ (solid lines), compared to unrestricted LVQ 2.1 (dashed lines). The soft window of GLVQ slows the repulsion of one prototype, but the prototypes remain divergent. Here , . (Right) Learning curves versus for softness vG=2, 5, and 50; note the logarithmic horizontal axis. The learning rates are maintained at . Large vG produces better asymptotic generalization error but may exhibit nonmonotonic behavior and require very long learning times.

Note that vG directly relates to the overall learning rate (refer to equation 3.12), which influences the level of noise in stochastic gradient procedures. We compare results with respect to vG, while maintaining an equal overall learning rate by keeping constant in Figure 5. Performance deteriorates at smaller vG, where training slows down at intermediate stages and converges at a higher error. However, very large vG allows strong repulsion of the weaker prototype, which results in nonmonotonic and long learning convergence times. Surprisingly, the soft GLVQ window is outperformed by the simple hard or crisp window of LVQ 2.1. This is caused by the long tail of the modulation function, which sums up into a large repulsion, whereas in the crisp window, only data near the decision boundary are considered.

Figure 6 displays the cost function during learning. In the initial learning stages, the minimization of the cost function E leads to fast decrease of . However, while the cost function continues to decrease monotonically, behaves nonmonotonically. While many techniques are developed to improve minimization procedures of E, it is important to evaluate the choice of E and its correlation to the desired generalization performance.

Figure 6:

. The cost functions for GLVQ with (left) and RSLVQ with (right) decrease monotonically, corresponding to a stochastic gradient descent.

Figure 6:

. The cost functions for GLVQ with (left) and RSLVQ with (right) decrease monotonically, corresponding to a stochastic gradient descent.

5.4.  RSLVQ.

Finally in this section, we study the influence of the softness parameter vsoft in the RSLVQ algorithm. Note that in Seo and Obermayer (2003), the learning rate and softness parameter vsoft are treated independently using separate annealing schedules. In this section, we assume decreases proportionally with vsoft, that is, a fixed overall learning rate is maintained.

We first investigate model scenarios with equal variance clusters v+=v and unbalanced data . We observe the influence of vsoft on the learning curves, displayed on the left panel of Figure 7. The generalization error curve depends on vsoft: at large vsoft, may exhibit nonmonotonic behavior, reminiscent of LVQ 2.1. Because of this behavior, the learning process may require long learning times before reaching the asymptotic configuration. This is an important consideration for practical applications, which often uses early stopping strategies to avoid overtraining. Meanwhile, the algorithm minimizes the cost function E in equation 3.13 monotonically (see Figure 6). Thus, the decrease in E does not always result in a decrease of .

Figure 7:

Learning curves for RSLVQ using softness parameter vsoft=1, 2, 10, and 20. (Left) p+=0.7 and equal variance v+=v=1 with fixed overall learning rate . (Right) p+=0.6 and unequal variance v+=1, v=4 with . The asymptotic errors are independent of vsoft at small learning rates, but at a suboptimal value. Note the logarithmic scale of .

Figure 7:

Learning curves for RSLVQ using softness parameter vsoft=1, 2, 10, and 20. (Left) p+=0.7 and equal variance v+=v=1 with fixed overall learning rate . (Right) p+=0.6 and unequal variance v+=1, v=4 with . The asymptotic errors are independent of vsoft at small learning rates, but at a suboptimal value. Note the logarithmic scale of .

A major advantage of the RSLVQ algorithm is the convergence of prototypes: a stationary configuration of order parameters exists for finite vsoft. The asymptotic configuration of prototypes is displayed in Figure 8. At , the softness parameter controls only the distance between the two prototypes: as defined in equation 5.2 decreases linearly with vsoft. Note that under the conditions p+=0.5, vsoft=v+=v, and initialization of prototypes on the symmetry axis, each prototype is located at its corresponding cluster center: the RSLVQ mixture model exactly matches the actual input density.

Figure 8:

Trajectories of prototypes of the system in the left panel of Figure 7. Prototypes are projected on the space Span(B+, B) for vsoft=1 (circle), 2 (triangle), and 5 (square).

Figure 8:

Trajectories of prototypes of the system in the left panel of Figure 7. Prototypes are projected on the space Span(B+, B) for vsoft=1 (circle), 2 (triangle), and 5 (square).

Figures 7 compare the asymptotic errors in the case of (left) and small learning rates (right). In the former case, performance improves with large vsoft: at small vsoft, the system converges at high similar to LFM, while at larger vsoft, it approaches the best linear decision. Meanwhile, at small learning rates, the asymptotic error becomes independent of vsoft. Therefore, given sufficiently small learning rates, RSLVQ becomes robust with regard to its softness parameter.

In the equal variance scenario, the asymptotic decision boundary always converges to the best linear decision boundary for all settings of , and RSLVQ outperforms both LFM and LVQ 2.1, as it provides robustness, stability, and low generalization error.

On the other hand, a scenario with unequal class variances presents an interesting case where RSLVQ with global vsoft fails to match the model. RSLVQ remains robust; the decision boundary converges to identical configurations for all settings of vsoft (see Figure 8). However, the asymptotic results are suboptimal. While RSLVQ is insensitive to the priors of the clusters, its performance with regard to the best achievable error is sensitive to the cluster variances; for example, at highly unbalanced , RSLVQ generalizes poorly and is outperformed by the simpler LVQ 2.1. In practical applications, vsoft may be set locally for each prototype to accommodate such scenarios, but this case cannot be treated along the lines of the analysis here in a straightforward way.

6.  Optimal Window Schedules

We have observed in sections 5.1 and 5.2 the learning curves and asymptotics of LVQ 2.1 and LFM-W with regard to fixed window parameters. Although small windows allow optimal , their obvious disadvantages are slower initial learning and convergence speed. This suggests that online performance can be improved by adjusting the window along with the number of examples presented. In this section, we treat the window parameter as dynamic properties during learning, , similar to the annealing method in Seo and Obermayer (2006) or the gradient-based optimization in Bengio (2000). In contrast to the proposed methods, we can formally minimize with respect to using the knowledge of the input density. Hence we calculate the locally optimal -schedule by finding the condition
formula
6.1
where we use the shorthand O for the set of order parameters. For a system with two prototypes and two clusters , and derivating from equation 4.7, we obtain
formula
with and defined in equation 5.2 (see appendix D for the calculations).

We plug in for the corresponding algorithm and numerically calculate from equation 6.1 at each learning step. We find that the learning curve is improved with initially large , which is decreased during training, following the curve in Figure 9. This suggests that practical schedules with gradual reduction of window sizes are indeed suitable for this particular learning problem.

Figure 9:

(Left) Optimal window schedule for LVQ 2.1 and LFM, obtained by formally minimizing with respect to . (Right) Optimal softness parameter for RSLVQ with fixed , and 0.5.

Figure 9:

(Left) Optimal window schedule for LVQ 2.1 and LFM, obtained by formally minimizing with respect to . (Right) Optimal softness parameter for RSLVQ with fixed , and 0.5.

While this approach locally minimizes generalization error, this strategy does not always lead to minimization of over a time span (i.e., a globally optimal schedule), which requires calculations along the lines of variational optimization (see Biehl, 1994; Saad & Rattray, 1997) for its application of optimal learning rates in multilayered neural networks. Obviously a priori knowledge of the input density is not available in practical situations. Nevertheless, this minimization technique provides an upper bound of the achievable performance of the learning scheme for a given model.

Figure 7 displays that although large vsoft for RSLVQ allows a faster initial learning, it also can yield nonmonotonic learning curves. We can avoid the nonmonotonic behavior and maximize the decrease of by applying a variational approach analogous to equation 6.1 in order to calculate the locally optimal softness parameter schedule . While fixing the value of , we produce the locally optimal softness schedule in Figure 9, where is initially large and decreases to saturate at a constant value. Note that this value depends on the learning rate; for example, it decreases with . In calculations with , we obtain the limit , which is the clearly suboptimal LFM. Therefore, an analysis of optimal RSLVQ schedule requires .

7.  Three-Prototype Systems

In this section we look at more generic analyses of LVQ algorithms by extending the previous systems to K=3 prototypes and M clusters, requiring a much larger set of order parameters. This allows an initial study on two important issues concerning practical applications of LVQ: multiclass problems and the use of multiple prototypes within a class.

We first look at multiclass problems with Nc=3 classes. An example is shown in Figure 10 for LVQ 2.1 with M=6 clusters selected with random variances and random deviation from the original class centers. The clusters are separable only in M out of N dimensions. In all our observations, we find that the behaviors of K=3 systems are qualitatively similar to K=2 systems. For LVQ 2.1, the learning curves vary according to the window sizes, but its asymptotic generalization error is independent of . Due to the presence of other prototypes, the repulsion on a weaker class prototype is reduced. However, the prototypes remain divergent (e.g., Figure 10). Meanwhile for LFM-W, the asymptotic performance is sensitive to whose range of effective window sizes depends strongly on the learning parameters. For GLVQ, the prototypes are divergent with a higher asymptotic error than LVQ 2.1, and thus it performs poorly. Finally, for RSLVQ, the prototypes remain stable, and the asymptotic generalization performance is robust with regard to settings of vsoft, but it is outperformed by LVQ 2.1. Hence, the results are consistent with the K=2 system, and the preceding analysis is valid qualitatively to at least systems of M clusters and one prototype per class within the model restrictions.

Figure 10:

(Left) Snapshot at of an LVQ 2.1 system, with K=3 and M=6 randomly generated isotropic clusters projected on the (B1, B3) subspace. The solid dot marks the initial position of all prototypes and solid lines mark the trajectories of the prototypes. (Right) . Solid lines represent, from bottom to top, prototype vector lengths Q11, Q22, Q33 for LVQ2.1 . Dashed lines represent the result for RSLVQ vsoft=2.

Figure 10:

(Left) Snapshot at of an LVQ 2.1 system, with K=3 and M=6 randomly generated isotropic clusters projected on the (B1, B3) subspace. The solid dot marks the initial position of all prototypes and solid lines mark the trajectories of the prototypes. (Right) . Solid lines represent, from bottom to top, prototype vector lengths Q11, Q22, Q33 for LVQ2.1 . Dashed lines represent the result for RSLVQ vsoft=2.

To allow more complex decision boundaries, practical LVQ applications frequently employ several prototypes within a class. We investigate a two-class system using K=3 prototypes with labels and observe the nontrivial interaction between similarly labeled prototypes, here w1 and w2. While prototypes of different classes immediately separate in the initial training phase, prototypes of the same class remain identical in the M-dimensional space (see Figure 11). The latter prototypes differ only in dimensions that are not related for classification and produce a suboptimal decision boundary. This may proceed for a long learning period before these prototypes begin to specialize; each prototype produces a bigger overlap with a distinct group of clusters. The specialization phase produces a sudden decrease of , displayed in the right panel of Figure 11. This phenomenon is highly reminiscent of symmetry-breaking effects observed in unsupervised learning, such as winner-takes-all vector quantization (VQ) (Biehl, 1994; Witoelar et al., 2008) or multilayer neural networks (Saad & Solla, 1995).

Figure 11:

Unspecialized phase induces long learning plateaus, shown with LFM-W and input density M=6 and . (Left) Several order parameters display a specialization phase between prototypes of the same class. (Right) Generalization error.

Figure 11:

Unspecialized phase induces long learning plateaus, shown with LFM-W and input density M=6 and . (Left) Several order parameters display a specialization phase between prototypes of the same class. (Right) Generalization error.

Learning parameters highly influence the nature of the transition; for example, large learning rates and smaller windows prolong the unspecialized phase, and therefore they are critical to the success of learning. Symmetry breaking may require exceedingly long learning times, resulting in learning plateaus that dominate the training process and present a challenge in practical situations with very high-dimensional data. In more extreme circumstances, the system may not escape the unspecialized state at all, and the optimal classification cannot be obtained. Details of the symmetry-breaking properties with regard to parameters will be investigated in future publications.

8.  Conclusion

We have investigated the learning behavior of LVQ 2.1, GLVQ, LFM-W, and RSLVQ using window schemes that work in high dimensions. The analysis is based on the theory of online learning on a model of high-dimensional isotropic clusters. Our findings demonstrate that the selection of proper window sizes is critical to efficient learning for all algorithms. Given more available data and allowance for costly learning times, parameter selection becomes much less important.

Our analysis demonstrates the influence of windows on the learning curves and the advantages and drawbacks of each algorithm within the model scenarios. A summary is described in Table 1. Asymptotically LVQ 2.1 achieves optimal performance in all scenarios, but stability remains an issue in terms of diverging prototypes. LFM-W shows a remarkable improvement in performance over LFM. Unfortunately, the introduction of a window may also influence its stability, and therefore it is highly parameter sensitive; only a narrow range of window size can improve the overall performance. GLVQ behaves similarly to LVQ 2.1. While GLVQ reduces the initial strong overshooting of LVQ 2.1, the prototypes remain divergent, and GLVQ produces higher generalization errors or long convergence times. RSLVQ attempts to combine the advantages of both LFM and LVQ 2.1 by providing stability and optimal performance. However, an important issue of RSLVQ lies on its approximation of the data structure; it performs well when the actual input density are isotropic gaussian clusters with equal variance. If the assumptions depart from the input density, the results become suboptimal, and RSLVQ can even be outperformed by the simpler LVQ 2.1 and LFM-W. In all scenarios, RSLVQ displays robustness of its classification behavior with respect to the softness parameter, given sufficiently low learning rates.

Table 1:
Asymptotic Properties of LVQ Algorithms.
LVQ 2.1LFM-WGLVQRSLVQ
Stability Divergent Convergenta Divergent Convergent 
Sensitivity with regard to parameters Robust Dependent Dependent Robust 
Generalization ability Optimal Suboptimal Suboptimal Suboptimal 
LVQ 2.1LFM-WGLVQRSLVQ
Stability Divergent Convergenta Divergent Convergent 
Sensitivity with regard to parameters Robust Dependent Dependent Robust 
Generalization ability Optimal Suboptimal Suboptimal Suboptimal 

aUnder the condition that .

This analysis also allows a formal optimization of the window size during learning to ensure fast convergence. While in general various window sizes for LVQ 2.1 produce equal asymptotic errors, initial window sizes should be chosen large for faster convergence speed and decreased in the course of learning. Similarly, an optimal schedule for RSLVQ points to a gradual decrease of the softness parameter to a particular saturation value, which agrees well with many practical scheduling schemes. However, locally optimal schedules do not always lead to the globally optimal schedules (see, e.g., Saad & Rattray, 1997). In further work, we will develop efficient dynamic parameter adaptations, that is, optimal window schedules during online training along the lines of variational optimization.

We show that the analysis remains valid for multiclass systems and arbitrary numbers of isotropic clusters. Additionally, using multiple prototype assignments within a class, we already observe the presence of learning plateaus in this highly simplified scenario. These phenomena carry on and could dominate the training process in any practical situations with high degrees of freedom. Further investigations of more complex network architectures and nontrivial input distributions may also yield additional phenomena, such as competing stationary states of the system, and provide further insights to general LVQ behaviors.

Appendix A:  Statistics of the Projections

For convenience, we combine the projections and defined in equation 4.3 into a D-dimensional vector, where D=K+M, as
formula
A.1
In our analysis of online learning, we assume that is statistically independent from wS because is uncorrelated to all previous data and . Therefore, we observe that hS and become correlated gaussian random quantities following the central limit theorem and can be fully described by their first and second moments—its conditional averages and conditional covariance matrix . We compute these averages in the following.

A.1.  First-Order Statistics.

We compute the averages of the components of x as follows:
formula
A.2
formula
A.3
with . To a large extent, we use orthonormal cluster center vectors, , where is the Kronecker delta. The conditional first-order moments can be expressed in terms of order parameters as
formula
A.4

A.2.  Second-Order Statistics.

To compute the conditional variance , we first look at the average
formula
A.5
Here we exploit the following:
formula
Hence we obtain the conditional second-order moment, from equations A.5 and A.2:
formula
A.6
Analogously, we get the second-order statistics of b and the covariance as follows:
formula
A.7
formula
A.8
The conditional covariance matrix can be written in terms of order parameters as
formula
A.9

Appendix B:  Form of the Differential Equations

In order to perform the ordinary differential equations described in equation 4.4, we need to plug in the values of
formula
B.1
Note that is not required in the limit , where terms proportional to can be neglected. We write the forms for the following algorithms: LVQ 2.1, LFM-W, GLVQ, and RSLVQ.

B.1.  LVQ 2.1.

The general modulation function for LVQ 2.1 is described in equation 3.5 as
formula
with if and else. We can rewrite
formula
B.2
formula
For two prototype systems with labels wS and wT, we can simplify the above as
formula
B.3
And the required averages over the joint density, equation B.1, are calculated as
formula
B.4
The quantities and are calculated in Appendix C.

B.2.  LFM-W.

The general modulation function for LFM-W is described in equation 3.6 as
formula
B.5
with . With only two prototypes, both wS and wT are winners of their respective class; thus, , and the averages are
formula
B.6

B.3.  GLVQ.

The general modulation function for GLVQ is described in equation 3.12 as
formula
B.7
For two prototypes,
formula
B.8
formula
The quantities are found in equation C.7 in appendix C.

B.4.  RSLVQ.

With one prototype representing each class, equation 3.14 becomes
formula
B.9
where we defined
formula
Therefore, the RSLVQ modulation function becomes
formula
B.10
where is the Kronecker delta and
formula
B.11
We obtain the averages
formula
B.12
The required quantities and are supplied in appendix C.

Appendix C:  Gaussian Averages

C.1.  Two Prototypes.

For generic functions , the quantities and are required:
formula
C.1
with . Rotating the coordinate system, we obtain
formula
C.2
with . Next we calculate the quantity
formula
C.3

C.1.1.  LVQ 2.1, LFM-W.

The following quantities are required for two prototypes: LVQ2.1 and LFM-W:
formula
C.4
formula
C.5

C.1.2.  GLVQ.

For GLVQ, the quantities and are required:
formula
C.6
Here we can use the substitution to obtain
formula
C.7

C.1.3.  RSLVQ.

For RSLVQ, the quantities and are required:
formula
C.8
This one-dimensional integration has to be solved numerically:
formula
C.9
where
formula
C.10
Applying integration by parts with
formula
C.11
we obtain
formula
C.12
formula
C.13
After applying rotation,
formula
C.14
which is also solved numerically.

C.2.  Three Prototypes.

For generic function , the quantities and are required:
formula
C.15
Next we calculate the quantity
formula
C.16
The quantities and have been calculated in Witoelar et al. (2008) as follows:
formula
C.17
formula
C.18
With the addition of a window, these quantities are required:
formula
C.19
For LVQ 2.1, the following average is required:
formula
C.20
formula
C.21
formula
C.22
where
formula
C.23
formula
C.24
formula
C.25

Appendix D:  Generalization Error

D.1.  Two Prototypes.

We compute the generalization error from equation 4.6 as follows. For two prototypes w+ and w, we calculate with
formula
D.1
with and . We refer the calculations to Biehl et al. (2004). Plugging in the values, we obtain
formula
D.2
By using and , we can calculate the derivative of the generalization error with respect to the order parameters as follows:
formula
D.3
where we used . Derivations with respect to the order parameters yield
formula
D.4
In the special case of p+=p=0.5 and v+=v=v, one obtains
formula
D.5

D.2.  Three Prototypes.

To compute the generalization error in systems with three prototypes wS, wT, wU, we require the quantity
formula
D.6
where the averages are written in equation C.17.

References

Barkai
,
N.
,
Seung
,
H. S.
, &
Sompolinsky
,
H.
(
1993
).
Scaling laws in learning of classification tasks
.
Phys. Rev. Lett.
,
70
,
3167
3170
.
Bengio
,
Y.
(
2000
).
Gradient-based optimization of hyperparameters
.
Neural Comput.
,
12
(
8
),
1889
1900
.
Biehl
,
M.
(
1994
).
An exactly solvable model of unsupervised learning
.
Europhysics Letters
,
25
,
391
396
.
Biehl
,
M.
, &
Caticha
,
N.
(
2003
).
The statistical mechanics of on-line learning and generalization
. In
M. A. Arbib
(Ed.),
The handbook of brain theory and neural networks
(pp.
1095
1098
).
Cambridge, MA
:
MIT Press
.
Biehl
,
M.
,
Freking
,
A.
,
Ghosh
,
A.
, &
Reents
,
G.
(
2004
).
A theoretical framework for analysing the dynamics of LVQ: A statistical physics approach (Tech. Rep. 2004-9-02)
.
Groningen, Netherlands
:
Mathematics and Computing Science, University of Groningen
.
Available online at http://www.cs.rug.nl/~biehl
.
Biehl
,
M.
,
Ghosh
,
A.
, &
Hammer
,
B.
(
2007
).
Dynamics and generalization ability of LVQ algorithms
.
J. Mach. Learning Res.
,
8
,
323
360
.
Biehl
,
M.
, &
Mietzner
,
A.
(
1993
).
Statistical mechanics of unsupervised learning
.
Europhysics Letters
,
27
,
421
426
.
Biehl
,
M.
, &
Schwarze
,
H.
(
1993
).
Learning drifting concepts with neural networks
.
Journal of Physics A: Mathematical and General
,
26
,
2651
2665
.
Engel
,
A.
, &
van den Broeck
,
C.
(
2001
).
The statistical mechanics of learning
.
Cambridge
:
Cambridge University Press
.
Ghosh
,
A.
,
Biehl
,
M.
, &
Hammer
,
B.
(
2006
).
Performance analysis of LVQ algorithms: A statistical physics approach
.
Neural Networks
,
19
,
817
829
.
Hammer
,
B.
, &
Villmann
,
T.
(
2002
).
Generalized relevance learning vector quantization
.
Neural Networks
,
15
,
1059
1068
.
Kohonen
,
T.
(
1990
).
Improved versions of learning vector quantization
.
1990 IJCNN International Joint Conference on Neural Networks
(Vol.
1
, pp.
545
550
).
Mahwah, NJ
:
Erlbaum
.
Kohonen
,
T.
(
2001
).
Self organising maps
(3rd ed.).
Berlin
:
Springer
.
Meir
,
R.
(
1995
).
Empirical risk minimization versus maximum-likelihood estimation: A case study
.
Neural Computation
,
7
,
144
157
.
Neural Networks Research Centre, Helsinki
. (
2002
).
Bibliography on the self-organizing maps (SOM) and learning vector quantization (LVQ)
.
Otaniemi
:
Helsinki University of Technology
. .
Reents
,
G.
, &
Urbanczik
,
R.
(
1998
).
Self averaging and on-line learning
.
Phys. Rev. Letter
,
80
,
5445
5448
.
Saad
,
D.
(Ed.). (
1999
).
Online learning in neural networks
.
Cambridge
:
Cambridge University Press
.
Saad
,
D.
, &
Rattray
,
M.
(
1997
).
Globally optimal parameters for on-line learning in multilayer neural networks
.
Phys. Rev. Lett.
,
79
,
2578
2581
.
Saad
,
D.
, &
Solla
,
S. A.
(
1995
).
On-line learning in soft committee machines
.
Phys. Rev. E, 52
,
4225
4243
.
Sato
,
A.
, &
Yamada
,
K.
(
1995
).
Generalized learning vector quantization
. In
G. Tesauro, D. Touretzky, & T. Leen
(Eds.),
Advances in neural information processing systems
,
7
(pp.
423
429
).
Cambridge, MA
:
MIT Press
.
Seo
,
S.
, &
Obermayer
,
K.
(
2003
).
Soft learning vector quantization
.
Neural Computation
,
15
,
1589
1604
.
Seo
,
S.
, &
Obermayer
,
K.
(
2006
).
Dynamic hyper parameter scaling method for LVQ algorithms
. In
International Joint Conference on Neural Networks
.
Piscataway, NJ
:
IEEE Press
.
Witoelar
,
A.
,
Biehl
,
M.
,
Ghosh
,
A.
, &
Hammer
,
B.
(
2008
).
Learning dynamics and robustness of vector quantization and neural gas
.
Neurocomputing
,
71
,
1210
1219
.