## Abstract

A variety of modifications have been employed to learning vector quantization (LVQ) algorithms using either crisp or soft windows for selection of data. Although these schemes have been shown in practice to improve performance, a theoretical study on the influence of windows has so far been limited. Here we rigorously analyze the influence of windows in a controlled environment of gaussian mixtures in high dimensions. Concepts from statistical physics and the theory of online learning allow an exact description of the training dynamics, yielding typical learning curves, convergence properties, and achievable generalization abilities. We compare the performance and demonstrate the advantages of various algorithms, including LVQ 2.1, generalized LVQ (GLVQ), Learning from Mistakes (LFM) and Robust Soft LVQ (RSLVQ). We find that the selection of the window parameter highly influences the learning curves but not, surprisingly, the asymptotic performances of LVQ 2.1 and RSLVQ. Although the prototypes of LVQ 2.1 exhibit divergent behavior, the resulting decision boundary coincides with the optimal decision boundary, thus yielding optimal generalization ability.

## 1. Introduction

Learning vector quantization (LVQ) constitutes a family of learning algorithms for nearest prototype classification of potentially high-dimensional data (Kohonen, 2001). The intuitive approach and computational efficiency of LVQ classifiers have motivated its application in various disciplines (see, e.g., Neural Networks Research Centre, 2002). Prototypes in LVQ algorithms represent typical features within a data set using the same feature space instead of the black box approach practiced in many other classifiers (e.g., feedforward neural networks or support vector machines). This makes them attractive to researchers outside the field of machine learning. Other advantages of LVQ algorithms are that they are easy to implement for multiclass classification problems and the algorithm complexity can be adjusted during training as required.

Numerous variants of the original LVQ prescriptions have been proposed for achieving better performance, such as LVQ 2.1 (Kohonen, 1990, 2001), LVQ 3 (Kohonen, 1990, 2001), generalized LVQ (GLVQ) (Hammer & Villmann, 2002; Sato & Yamada, 1995), and Robust Soft LVQ (RSLVQ) (Seo & Obermayer, 2003). Common themes of these modifications include an additional parameter that controls the selection of data to which the system is adapted and variation of the magnitude of prototype updates. We refer to these in general as *window schemes*. In the limiting case of hard or crisp learning schemes, updates are restricted to examples that fall into this window. For instance, LVQ 2.1 allows updates as long as the example is in the vicinity of the current decision boundary. Alternatively, learning schemes can implement a soft window (e.g., RSLVQ and GLVQ), which considers all examples but adapts the magnitude of the update according to their relative distances to the current decision boundary.

In general, the learning behavior of these strategies is not well understood. It is unclear how the convergence, stability, and achievable generalization ability compare for the different strategies. Fortunately, methods from statistical physics and theory of online learning recently allowed a systematic investigation of very large systems in the so-called thermodynamic limit. This has been successfully applied in, among others, feedforward neural networks, perceptron training and principal component analysis (Biehl & Caticha, 2003; Engel & van den Broeck, 2001; Saad, 1999). A similar approach to LVQ-type algorithms, such as LVQ 1, unsupervised VQ, and rank-based neural gas, was treated in Biehl, Ghosh, and Hammer (2007) and Witoelar, Biehl, Ghosh, and Hammer (2008).

In this work, we closely examine the influence of window schemes for LVQ algorithms. Typical learning behavior is studied within a model situation of high-dimensional gaussian clusters and competing prototypes. From this analysis, we can observe typical learning curves and the convergence properties, that is, the asymptotic behavior in the limit of an arbitrarily large number of examples.

Typically the window parameters are selected either heuristically or derived from prior knowledge of the data and kept fixed during training. The optimal parameter settings are chosen according to a computationally expensive validation procedure. It is also possible to treat the hyperparameters as dynamic properties during learning by means of an annealing schedule (Seo & Obermayer, 2006) or a gradient-based optimization method (Bengio, 2000), for example. Using the model described in this letter, one can investigate the optimality of the parameters for both fixed and dynamic settings in representative model situations.

## 2. Model

*M*gaussian clusters and presented to a system of two or three prototypes. We restrict ourselves to the analysis of isotropic and homogeneous clusters; each cluster σ generates only data with one of the class labels where

*N*is the number of classes. Examples with are drawn independently according to the probability density function, where are the cluster-wise prior probabilities and . The components of vectors from cluster are random numbers with mean vectors and variance . The unit vectors determine the orientation of cluster centers. Similar densities have been studied in Barkai, Seung, and Sompolinsky (1993), Biehl (1994), Biehl et al. (2007), and Meir (1995).

_{c}In this framework, we formally exploit the thermodynamic limit corresponding to very high-dimensional data. This has simplifying consequences that will be present throughout the letter. Note that on random subspace projections, data from different clusters completely overlap and are not separable. The clusters become apparent at most only in the *M*-dimensional space spanned by vectors . The nontrivial goal is to identify this subspace from the *N*-dimensional data.

We bring attention to the scaling of the model. The anisotropy of this data distribution is very weak: while the mean of cluster σ, given by , is a vector of length , the average squared length of the data vectors is in the order .

Obviously this model is greatly simplified from practical situations. However, it represents an ideal scenario to analyze the learning algorithms considered here using gaussian mixture models, a common technique in many practical scenarios. We can expect that algorithms that do not perform well on this idealized model will also be inappropriate for real-life problems. Although more complex behaviors are expected in practical applications, the nontrivial effects already observed in this model will clearly influence the outcome under more general circumstances.

## 3. Algorithms

*K*prototypes with and . Classification is implemented through a nearest prototype scheme: novel examples will be assigned to the class of the closest prototype according to a dissimilarity measure. Here we restrict the measure to the squared Euclidean distance for a given novel example . In the online algorithm, examples are presented sequentially to the system, and the prototypes are adapted by the following update step, where denotes the prototype after presentation of examples and the learning rate is rescaled with

*N*. We use the shorthand

*f*for the modulation function that controls, along with the learning rate , the magnitude of the update of

_{S}**w**

_{S}toward or away from the current example. In this work, we investigate several LVQ prescriptions that include window schemes.

### 3.1. LVQ 2.1.

LVQ 2.1, proposed by Kohonen, aims at efficient separation between prototypes of different classes and has been shown to provide good classification results (Kohonen, 1990; Neural Networks Research Centre, 2002). Given an example , two nearest prototypes **w**_{S} and **w**_{T} are updated if the following conditions are met: (1) the classes *c _{S}* and

*c*are different, and (2) either

_{T}*c*or

_{S}*c*is equal to . The prototype with the correct class is moved toward the data, while the other is moved further away with

_{T}*f*= 1,

_{S}*f*= −1 if ;

_{T}*f*= −1,

_{S}*f*= +1 else.

_{T}**w**

^{2}

_{S}). Consequently, this window definition does not work in very high dimensions, evidenced by which implies that every example falls into the window. Therefore, in the following, we implement the constraint where

*k*is a small, positive number. Note that the term cancels out on the left-hand side, while it dominates on the right-hand side for . Thus, the right-hand side becomes , and the condition is nontrivial only if . We introduce the rescaled window parameter so that the window scheme is ; is positive. We describe these rules as the following modulation function, with if and else. We use the shorthand notations and , where is the Heaviside function .

We sum over prototypes , and terms enforce the window condition. The product term singles out instances where **w**_{S} and **w**_{T} are the two closest prototypes. This form of *f _{S}* allows for the analysis given in section 4.

### 3.2. LFM-W.

A simple modification to overcome the stability problems of LVQ 2.1 is restricting updates only on misclassified examples. Analogous to perceptron learning, we term this update rule as Learning from Mistakes (LFM). Here, the closest prototype **w**_{J} with the same class (correct winner) and closest prototype **w**_{K} with a different class (incorrect winner) are updated with *f _{J}*=+1 and

*f*=−1, if the example is misclassified. On the contrary, if the winning prototype is already correct, the configuration is left unchanged. This prescription can be interpreted as a limiting case of cost-function-based Robust Soft LVQ (RSLVQ), which will be explained later in this section. Because the cost function of RSLVQ is bounded from below, stability can also be expected in LFM.

_{K}**w**

_{J}being the correct winner and

**w**

_{K}being the incorrect winner: if this condition is fulfilled and else. Terms in parentheses single out misclassified examples that fall into the window.

### 3.3. GLVQ.

**w**

_{J}is the nearest correct prototype, and

**w**

_{K}is the nearest incorrect prototype to the example . We insert the scaling parameter

*C*, which will be required for high dimensions. Stochastic gradient procedure on equation 3.7 yields the learning rule Here the usefulness of selecting a nonlinear is shown. For instance, in Hammer and Villmann (2002) and Sato and Yamada (1995), the sigmoid function is chosen: . The form of , which has a single peak at , can be interpreted as a soft window around the decision boundary.

*C*must also be in the order , and we rescale using . The parameter

*v*determines the softness of the window, provided that an appropriate nonlinear is chosen. Note that GLVQ can be simplified to LVQ 2.1 without a window using the identity function . The cost function in equation 3.7 becomes , where

_{G}*v*could be set to 1 without changing its learning behavior. The modulation function is then reduced to

_{G}*f*=+1,

_{J}*f*=−1.

_{K}### 3.4. RSLVQ.

*E*=−ln(

*L*), where

*L*is based on a likelihood ratio function of a mixture model, described as where approximates the actual probability density (see equation 2.1). It is assumed that every component

**w**

_{S}of the mixture generates examples that belong to one class:

*c*. is the number of classes, and

_{S}*P*is the probability that the examples are generated by a particular component

_{S}**w**

_{S}and is the conditional probability that

**w**

_{S}generates a particular example .

*E*with respect to

**w**

_{S}. We examine it for a gaussian mixture ansatz as in Seo and Obermayer (2003), where it is chosen . Furthermore, every component is assumed to have equal probability

*P*(

*S*)=1/

*K*, and equal variance

*v*=

_{S}*v*

_{soft}, , where

*v*

_{soft}is called the softness hyperparameter. This gives the following modulation function with the assignment probabilities, (see Seo & Obermayer, 2003, for the derivations). describes the posterior probability that is assigned to the component

*S*of the mixture, given that the example is generated by the correct class. describes the posterior probability that is assigned to the component

*S*of the complete mixture using all classes. As

*v*

_{soft}becomes smaller, the updates become smaller for correctly classified examples and larger for incorrectly classified examples (see Figure 1).

## 4. Analysis

*S*,

*T*are prototype indices; are cluster indices; is the Kronecker delta; and is an overlap measure between clusters.

Thus, the above averages , , and reduce to gaussian integrations in *K*+*M* dimensions and can be expressed in (see appendix B). For various algorithms and a system with two competing prototypes, the averages can be calculated analytically. For three or more prototypes, the mathematical treatment becomes more involved and requires multiple numerical integrations.

*f*, we obtain a closed set of ODE. Using initial conditions , we integrate this system for a given algorithm and obtain the evolution of order parameters in the course of training, . The generalization error (i.e., the probability of the closest prototype

_{S}**w**

_{S}carrying an incorrect label) is determined by considering the contribution from each cluster separately: which can be calculated from . For instance, for the simplest system with two clusters and prototypes

**w**

_{+}and

**w**

_{−}, the generalization error is written explicitly in terms of order parameters as with , detailed in appendix D. The form of for systems with more prototypes is more involved, and we refer the final result of the calculations to appendix D. We obtain the learning curve , which quantifies the success of training. This method of analysis shows excellent agreement with Monte Carlo simulations of the learning system for dimensionality as low as

*N*=100, as demonstrated in Figure 2.

## 5. A Simple Case: Two Prototypes, Two Clusters

In this section, we discuss in detail the results of the analysis for the simplest nontrivial problem: two-prototype LVQ 2.1, GLVQ, LFM-W, and RSLVQ systems and *M*=2 with one gaussian cluster per class. The model data are given in section 2. For simplicity, we denote the two clusters as and without loss of generality can choose and orthonormal , that is, if *i*=*j*; 0 else.

We place an emphasis on the asymptotic behavior in the limit —the achieved performance for an arbitrarily large number of examples. The asymptotic generalization error scales with the learning rate, analogous to minimizing a cost function in stochastic gradient descent procedures. For LVQ 2.1 and RSLVQ, the best achievable generalization error is obtained in the simultaneous limit of small learning rates , and rescaling . However, this limit is not meaningful for LFM, as will be explained later.

In this simple scenario, it is possible to exactly calculate the best linear decision boundaries (BLD) by linear approximation of the Bayesian optimal decision boundary (see Biehl, Freking, Ghosh, & Reents, 2004, for the calculations.) We compare the results from each algorithm to the best linearly achievable error .

### 5.1. LVQ 2.1.

We first examine two-prototype systems: *K*=2. Figure 2 illustrates the evolution of order parameters under the influence of a window and the trajectories of the prototypes projected onto the (**B**_{+}, **B**_{−}) subspace. Without additional constraints, LVQ 2.1 with two prototypes displays a strong divergent behavior in a system with unbalanced data: . The repulsion factor dominates for the prototype representing the weaker cluster, here **w**_{2}. The order parameters associated with this prototype increase exponentially with . As , **w**_{2} will be arbitrarily far away from the cluster centers, and the asymptotic generalization error is trivial, .

When the window scheme is implemented, **w**_{2} is repulsed until the data densities of both classes within the window become more balanced. Subsequently the order parameters change with more balance between both prototypes. The repulsion factor still dominates its counterpart; therefore, both prototypes still diverge: for both prototypes display a linear change with at large , but the decision boundary remains stable. Trivial classification is prevented (see the generalization error curves versus in the left panel of Figure 3). Obviously, for smaller , a considerable amount of data is filtered out, and the initial learning stages slow significantly. Meanwhile for large , becomes nonmonotonic and converges more slowly.

**w**

_{1}−

**w**

_{2}) and (

**B**

_{+}−

**B**

_{−}). This is calculated, using equation 4.1 and the orthonormality of

**B**

_{+}and

**B**

_{−}, as which is found to be zero for large . Hence, the decision boundary becomes parallel to the BLD, and only its offset produces the difference between and . In low dimensions, this offset oscillates around zero due to the window rule. In the thermodynamic limit, the fluctuations vanish, and the LVQ 2.1 decision boundary coincides with the BLD.

### 5.2. LFM-W.

*K*=2 unrestricted LFM has been presented in Biehl et al. (2007). There, it was shown that LFM produces stable prototype configurations for finite learning rates . The projection of the prototypes lies parallel to the symmetry axis , displayed in Figure 4. However the prototypes

**w**

_{1}and

**w**

_{2}retain components orthogonal to the two-dimensional subspace spanned by the cluster centers, indicated by

*Q*>

_{ST}*R*

_{S+}

*R*

_{T+}+

*R*

_{S−}

*R*

_{T−}, which implies The asymptotic generalization error is suboptimal and insensitive to : the asymptotic decision boundary remains at an angle from the optimal hyperplane (see equation 5.1) independent of . The Euclidean distance between prototypes is given by the quantity which is found to be proportional to for . At , and the prototypes coincide, and this limit is not meaningful in LFM.

In this analysis, we observe that window schemes can dramatically improve the performance of LFM. When a window is used, the tilt of the decision boundary from the optimal hyperplane ( in equation 5.1), is reduced, resulting in lower . We observe that decreases along with reducing , displayed in the right panel of Figure 4. However, a critical window size exists where the LFM unexpectedly becomes divergent and no stationary state exists. Smaller windows filter examples, which produce more repulsion in the orientation of the cluster centers, and we observe asymptotically larger as decreases. This is clearly observed in Figure 4. Given a sufficiently small , it is possible that the repulsion factor entirely outweighs the attractive factor. At , it performs similar to LVQ 2.1: the angle becomes zero, and is close to the best achievable error.

Unlike the unrestricted case, the learning rate can influence the asymptotic performance. The learning rate and window size are indirectly related, as shown in the right panel of Figure 4. For example, learning with small learning rates requires smaller windows to achieve optimal asymptotic error. Note that the influence of the window size depends heavily on the structure of the data. For various data models, efficient window settings may exist on only a very limited range, and window schemes may be ineffective to improve generalization performance while still maintaining stability.

### 5.3. GLVQ.

Apart from the influence of *v _{G}* to the overall learning rate, small

*v*corresponds to a sharp peak around the decision boundary, while large

_{G}*v*corresponds to a very large window. Figure 5 displays the prototype lengths while using GLVQ: the soft window slows the strong repulsion of the prototype of the weaker cluster, as opposed to unrestricted LVQ 2.1. While both prototypes still diverge because the cost function at is not bounded (see equation 3.9), the asymptotic remains nontrivial (see Figure 5).

_{G}Note that *v _{G}* directly relates to the overall learning rate (refer to equation 3.12), which influences the level of noise in stochastic gradient procedures. We compare results with respect to

*v*, while maintaining an equal overall learning rate by keeping constant in Figure 5. Performance deteriorates at smaller

_{G}*v*, where training slows down at intermediate stages and converges at a higher error. However, very large

_{G}*v*allows strong repulsion of the weaker prototype, which results in nonmonotonic and long learning convergence times. Surprisingly, the soft GLVQ window is outperformed by the simple hard or crisp window of LVQ 2.1. This is caused by the long tail of the modulation function, which sums up into a large repulsion, whereas in the crisp window, only data near the decision boundary are considered.

_{G}Figure 6 displays the cost function during learning. In the initial learning stages, the minimization of the cost function *E* leads to fast decrease of . However, while the cost function continues to decrease monotonically, behaves nonmonotonically. While many techniques are developed to improve minimization procedures of *E*, it is important to evaluate the choice of *E* and its correlation to the desired generalization performance.

### 5.4. RSLVQ.

Finally in this section, we study the influence of the softness parameter *v*_{soft} in the RSLVQ algorithm. Note that in Seo and Obermayer (2003), the learning rate and softness parameter *v*_{soft} are treated independently using separate annealing schedules. In this section, we assume decreases proportionally with *v*_{soft}, that is, a fixed overall learning rate is maintained.

We first investigate model scenarios with equal variance clusters *v*_{+}=*v*_{−} and unbalanced data . We observe the influence of *v*_{soft} on the learning curves, displayed on the left panel of Figure 7. The generalization error curve depends on *v*_{soft}: at large *v*_{soft}, may exhibit nonmonotonic behavior, reminiscent of LVQ 2.1. Because of this behavior, the learning process may require long learning times before reaching the asymptotic configuration. This is an important consideration for practical applications, which often uses early stopping strategies to avoid overtraining. Meanwhile, the algorithm minimizes the cost function *E* in equation 3.13 monotonically (see Figure 6). Thus, the decrease in *E* does not always result in a decrease of .

A major advantage of the RSLVQ algorithm is the convergence of prototypes: a stationary configuration of order parameters exists for finite *v*_{soft}. The asymptotic configuration of prototypes is displayed in Figure 8. At , the softness parameter controls only the distance between the two prototypes: as defined in equation 5.2 decreases linearly with *v*_{soft}. Note that under the conditions *p*_{+}=0.5, *v*_{soft}=*v*_{+}=*v*_{−}, and initialization of prototypes on the symmetry axis, each prototype is located at its corresponding cluster center: the RSLVQ mixture model exactly matches the actual input density.

Figures 7 compare the asymptotic errors in the case of (left) and small learning rates (right). In the former case, performance improves with large *v*_{soft}: at small *v*_{soft}, the system converges at high similar to LFM, while at larger *v*_{soft}, it approaches the best linear decision. Meanwhile, at small learning rates, the asymptotic error becomes independent of *v*_{soft}. Therefore, given sufficiently small learning rates, RSLVQ becomes robust with regard to its softness parameter.

In the equal variance scenario, the asymptotic decision boundary always converges to the best linear decision boundary for all settings of , and RSLVQ outperforms both LFM and LVQ 2.1, as it provides robustness, stability, and low generalization error.

On the other hand, a scenario with unequal class variances presents an interesting case where RSLVQ with global *v*_{soft} fails to match the model. RSLVQ remains robust; the decision boundary converges to identical configurations for all settings of *v*_{soft} (see Figure 8). However, the asymptotic results are suboptimal. While RSLVQ is insensitive to the priors of the clusters, its performance with regard to the best achievable error is sensitive to the cluster variances; for example, at highly unbalanced , RSLVQ generalizes poorly and is outperformed by the simpler LVQ 2.1. In practical applications, *v*_{soft} may be set locally for each prototype to accommodate such scenarios, but this case cannot be treated along the lines of the analysis here in a straightforward way.

## 6. Optimal Window Schedules

**O**for the set of order parameters. For a system with two prototypes and two clusters , and derivating from equation 4.7, we obtain with and defined in equation 5.2 (see appendix D for the calculations).

We plug in for the corresponding algorithm and numerically calculate from equation 6.1 at each learning step. We find that the learning curve is improved with initially large , which is decreased during training, following the curve in Figure 9. This suggests that practical schedules with gradual reduction of window sizes are indeed suitable for this particular learning problem.

While this approach locally minimizes generalization error, this strategy does not always lead to minimization of over a time span (i.e., a globally optimal schedule), which requires calculations along the lines of variational optimization (see Biehl, 1994; Saad & Rattray, 1997) for its application of optimal learning rates in multilayered neural networks. Obviously a priori knowledge of the input density is not available in practical situations. Nevertheless, this minimization technique provides an upper bound of the achievable performance of the learning scheme for a given model.

Figure 7 displays that although large *v*_{soft} for RSLVQ allows a faster initial learning, it also can yield nonmonotonic learning curves. We can avoid the nonmonotonic behavior and maximize the decrease of by applying a variational approach analogous to equation 6.1 in order to calculate the locally optimal softness parameter schedule . While fixing the value of , we produce the locally optimal softness schedule in Figure 9, where is initially large and decreases to saturate at a constant value. Note that this value depends on the learning rate; for example, it decreases with . In calculations with , we obtain the limit , which is the clearly suboptimal LFM. Therefore, an analysis of optimal RSLVQ schedule requires .

## 7. Three-Prototype Systems

In this section we look at more generic analyses of LVQ algorithms by extending the previous systems to *K*=3 prototypes and *M* clusters, requiring a much larger set of order parameters. This allows an initial study on two important issues concerning practical applications of LVQ: multiclass problems and the use of multiple prototypes within a class.

We first look at multiclass problems with *N _{c}*=3 classes. An example is shown in Figure 10 for LVQ 2.1 with

*M*=6 clusters selected with random variances and random deviation from the original class centers. The clusters are separable only in

*M*out of

*N*dimensions. In all our observations, we find that the behaviors of

*K*=3 systems are qualitatively similar to

*K*=2 systems. For LVQ 2.1, the learning curves vary according to the window sizes, but its asymptotic generalization error is independent of . Due to the presence of other prototypes, the repulsion on a weaker class prototype is reduced. However, the prototypes remain divergent (e.g., Figure 10). Meanwhile for LFM-W, the asymptotic performance is sensitive to whose range of effective window sizes depends strongly on the learning parameters. For GLVQ, the prototypes are divergent with a higher asymptotic error than LVQ 2.1, and thus it performs poorly. Finally, for RSLVQ, the prototypes remain stable, and the asymptotic generalization performance is robust with regard to settings of

*v*

_{soft}, but it is outperformed by LVQ 2.1. Hence, the results are consistent with the

*K*=2 system, and the preceding analysis is valid qualitatively to at least systems of

*M*clusters and one prototype per class within the model restrictions.

To allow more complex decision boundaries, practical LVQ applications frequently employ several prototypes within a class. We investigate a two-class system using *K*=3 prototypes with labels and observe the nontrivial interaction between similarly labeled prototypes, here **w**_{1} and **w**_{2}. While prototypes of different classes immediately separate in the initial training phase, prototypes of the same class remain identical in the *M*-dimensional space (see Figure 11). The latter prototypes differ only in dimensions that are not related for classification and produce a suboptimal decision boundary. This may proceed for a long learning period before these prototypes begin to specialize; each prototype produces a bigger overlap with a distinct group of clusters. The specialization phase produces a sudden decrease of , displayed in the right panel of Figure 11. This phenomenon is highly reminiscent of symmetry-breaking effects observed in unsupervised learning, such as winner-takes-all vector quantization (VQ) (Biehl, 1994; Witoelar et al., 2008) or multilayer neural networks (Saad & Solla, 1995).

Learning parameters highly influence the nature of the transition; for example, large learning rates and smaller windows prolong the unspecialized phase, and therefore they are critical to the success of learning. Symmetry breaking may require exceedingly long learning times, resulting in learning plateaus that dominate the training process and present a challenge in practical situations with very high-dimensional data. In more extreme circumstances, the system may not escape the unspecialized state at all, and the optimal classification cannot be obtained. Details of the symmetry-breaking properties with regard to parameters will be investigated in future publications.

## 8. Conclusion

We have investigated the learning behavior of LVQ 2.1, GLVQ, LFM-W, and RSLVQ using window schemes that work in high dimensions. The analysis is based on the theory of online learning on a model of high-dimensional isotropic clusters. Our findings demonstrate that the selection of proper window sizes is critical to efficient learning for all algorithms. Given more available data and allowance for costly learning times, parameter selection becomes much less important.

Our analysis demonstrates the influence of windows on the learning curves and the advantages and drawbacks of each algorithm within the model scenarios. A summary is described in Table 1. Asymptotically LVQ 2.1 achieves optimal performance in all scenarios, but stability remains an issue in terms of diverging prototypes. LFM-W shows a remarkable improvement in performance over LFM. Unfortunately, the introduction of a window may also influence its stability, and therefore it is highly parameter sensitive; only a narrow range of window size can improve the overall performance. GLVQ behaves similarly to LVQ 2.1. While GLVQ reduces the initial strong overshooting of LVQ 2.1, the prototypes remain divergent, and GLVQ produces higher generalization errors or long convergence times. RSLVQ attempts to combine the advantages of both LFM and LVQ 2.1 by providing stability and optimal performance. However, an important issue of RSLVQ lies on its approximation of the data structure; it performs well when the actual input density are isotropic gaussian clusters with equal variance. If the assumptions depart from the input density, the results become suboptimal, and RSLVQ can even be outperformed by the simpler LVQ 2.1 and LFM-W. In all scenarios, RSLVQ displays robustness of its classification behavior with respect to the softness parameter, given sufficiently low learning rates.

. | LVQ 2.1 . | LFM-W . | GLVQ . | RSLVQ . |
---|---|---|---|---|

Stability | Divergent | Convergent^{a} | Divergent | Convergent |

Sensitivity with regard to parameters | Robust | Dependent | Dependent | Robust |

Generalization ability | Optimal | Suboptimal | Suboptimal | Suboptimal |

. | LVQ 2.1 . | LFM-W . | GLVQ . | RSLVQ . |
---|---|---|---|---|

Stability | Divergent | Convergent^{a} | Divergent | Convergent |

Sensitivity with regard to parameters | Robust | Dependent | Dependent | Robust |

Generalization ability | Optimal | Suboptimal | Suboptimal | Suboptimal |

^{a}Under the condition that .

This analysis also allows a formal optimization of the window size during learning to ensure fast convergence. While in general various window sizes for LVQ 2.1 produce equal asymptotic errors, initial window sizes should be chosen large for faster convergence speed and decreased in the course of learning. Similarly, an optimal schedule for RSLVQ points to a gradual decrease of the softness parameter to a particular saturation value, which agrees well with many practical scheduling schemes. However, locally optimal schedules do not always lead to the globally optimal schedules (see, e.g., Saad & Rattray, 1997). In further work, we will develop efficient dynamic parameter adaptations, that is, optimal window schedules during online training along the lines of variational optimization.

We show that the analysis remains valid for multiclass systems and arbitrary numbers of isotropic clusters. Additionally, using multiple prototype assignments within a class, we already observe the presence of learning plateaus in this highly simplified scenario. These phenomena carry on and could dominate the training process in any practical situations with high degrees of freedom. Further investigations of more complex network architectures and nontrivial input distributions may also yield additional phenomena, such as competing stationary states of the system, and provide further insights to general LVQ behaviors.

## Appendix A: Statistics of the Projections

*D*-dimensional vector, where

*D*=

*K*+

*M*, as In our analysis of online learning, we assume that is statistically independent from

**w**

_{S}because is uncorrelated to all previous data and . Therefore, we observe that

*h*and become correlated gaussian random quantities following the central limit theorem and can be fully described by their first and second moments—its conditional averages and conditional covariance matrix . We compute these averages in the following.

_{S}### A.1. First-Order Statistics.

### A.2. Second-Order Statistics.

## Appendix B: Form of the Differential Equations

### B.1. LVQ 2.1.

**w**

_{S}and

**w**

_{T}, we can simplify the above as And the required averages over the joint density, equation B.1, are calculated as The quantities and are calculated in Appendix C.

### B.2. LFM-W.

**w**

_{S}and

**w**

_{T}are winners of their respective class; thus, , and the averages are

### B.3. GLVQ.

### B.4. RSLVQ.

## Appendix C: Gaussian Averages

### C.1. Two Prototypes.

#### C.1.1. LVQ 2.1, LFM-W.

#### C.1.2. GLVQ.

#### C.1.3. RSLVQ.

### C.2. Three Prototypes.

## Appendix D: Generalization Error

### D.1. Two Prototypes.

**w**

_{+}and

**w**

_{−}, we calculate with with and . We refer the calculations to Biehl et al. (2004). Plugging in the values, we obtain By using and , we can calculate the derivative of the generalization error with respect to the order parameters as follows: where we used . Derivations with respect to the order parameters yield In the special case of

*p*

_{+}=

*p*

_{−}=0.5 and

*v*

_{+}=

*v*

_{−}=

*v*, one obtains

### D.2. Three Prototypes.

**w**

_{S},

**w**

_{T},

**w**

_{U}, we require the quantity where the averages are written in equation C.17.