A variety of modifications have been employed to learning vector quantization (LVQ) algorithms using either crisp or soft windows for selection of data. Although these schemes have been shown in practice to improve performance, a theoretical study on the influence of windows has so far been limited. Here we rigorously analyze the influence of windows in a controlled environment of gaussian mixtures in high dimensions. Concepts from statistical physics and the theory of online learning allow an exact description of the training dynamics, yielding typical learning curves, convergence properties, and achievable generalization abilities. We compare the performance and demonstrate the advantages of various algorithms, including LVQ 2.1, generalized LVQ (GLVQ), Learning from Mistakes (LFM) and Robust Soft LVQ (RSLVQ). We find that the selection of the window parameter highly influences the learning curves but not, surprisingly, the asymptotic performances of LVQ 2.1 and RSLVQ. Although the prototypes of LVQ 2.1 exhibit divergent behavior, the resulting decision boundary coincides with the optimal decision boundary, thus yielding optimal generalization ability.
Learning vector quantization (LVQ) constitutes a family of learning algorithms for nearest prototype classification of potentially high-dimensional data (Kohonen, 2001). The intuitive approach and computational efficiency of LVQ classifiers have motivated its application in various disciplines (see, e.g., Neural Networks Research Centre, 2002). Prototypes in LVQ algorithms represent typical features within a data set using the same feature space instead of the black box approach practiced in many other classifiers (e.g., feedforward neural networks or support vector machines). This makes them attractive to researchers outside the field of machine learning. Other advantages of LVQ algorithms are that they are easy to implement for multiclass classification problems and the algorithm complexity can be adjusted during training as required.
Numerous variants of the original LVQ prescriptions have been proposed for achieving better performance, such as LVQ 2.1 (Kohonen, 1990, 2001), LVQ 3 (Kohonen, 1990, 2001), generalized LVQ (GLVQ) (Hammer & Villmann, 2002; Sato & Yamada, 1995), and Robust Soft LVQ (RSLVQ) (Seo & Obermayer, 2003). Common themes of these modifications include an additional parameter that controls the selection of data to which the system is adapted and variation of the magnitude of prototype updates. We refer to these in general as window schemes. In the limiting case of hard or crisp learning schemes, updates are restricted to examples that fall into this window. For instance, LVQ 2.1 allows updates as long as the example is in the vicinity of the current decision boundary. Alternatively, learning schemes can implement a soft window (e.g., RSLVQ and GLVQ), which considers all examples but adapts the magnitude of the update according to their relative distances to the current decision boundary.
In general, the learning behavior of these strategies is not well understood. It is unclear how the convergence, stability, and achievable generalization ability compare for the different strategies. Fortunately, methods from statistical physics and theory of online learning recently allowed a systematic investigation of very large systems in the so-called thermodynamic limit. This has been successfully applied in, among others, feedforward neural networks, perceptron training and principal component analysis (Biehl & Caticha, 2003; Engel & van den Broeck, 2001; Saad, 1999). A similar approach to LVQ-type algorithms, such as LVQ 1, unsupervised VQ, and rank-based neural gas, was treated in Biehl, Ghosh, and Hammer (2007) and Witoelar, Biehl, Ghosh, and Hammer (2008).
In this work, we closely examine the influence of window schemes for LVQ algorithms. Typical learning behavior is studied within a model situation of high-dimensional gaussian clusters and competing prototypes. From this analysis, we can observe typical learning curves and the convergence properties, that is, the asymptotic behavior in the limit of an arbitrarily large number of examples.
Typically the window parameters are selected either heuristically or derived from prior knowledge of the data and kept fixed during training. The optimal parameter settings are chosen according to a computationally expensive validation procedure. It is also possible to treat the hyperparameters as dynamic properties during learning by means of an annealing schedule (Seo & Obermayer, 2006) or a gradient-based optimization method (Bengio, 2000), for example. Using the model described in this letter, one can investigate the optimality of the parameters for both fixed and dynamic settings in representative model situations.
In this framework, we formally exploit the thermodynamic limit corresponding to very high-dimensional data. This has simplifying consequences that will be present throughout the letter. Note that on random subspace projections, data from different clusters completely overlap and are not separable. The clusters become apparent at most only in the M-dimensional space spanned by vectors . The nontrivial goal is to identify this subspace from the N-dimensional data.
We bring attention to the scaling of the model. The anisotropy of this data distribution is very weak: while the mean of cluster σ, given by , is a vector of length , the average squared length of the data vectors is in the order .
Obviously this model is greatly simplified from practical situations. However, it represents an ideal scenario to analyze the learning algorithms considered here using gaussian mixture models, a common technique in many practical scenarios. We can expect that algorithms that do not perform well on this idealized model will also be inappropriate for real-life problems. Although more complex behaviors are expected in practical applications, the nontrivial effects already observed in this model will clearly influence the outcome under more general circumstances.
3.1. LVQ 2.1.
LVQ 2.1, proposed by Kohonen, aims at efficient separation between prototypes of different classes and has been shown to provide good classification results (Kohonen, 1990; Neural Networks Research Centre, 2002). Given an example , two nearest prototypes wS and wT are updated if the following conditions are met: (1) the classes cS and cT are different, and (2) either cS or cT is equal to . The prototype with the correct class is moved toward the data, while the other is moved further away with fS = 1, fT = −1 if ; fS = −1, fT = +1 else.
We sum over prototypes , and terms enforce the window condition. The product term singles out instances where wS and wT are the two closest prototypes. This form of fS allows for the analysis given in section 4.
A simple modification to overcome the stability problems of LVQ 2.1 is restricting updates only on misclassified examples. Analogous to perceptron learning, we term this update rule as Learning from Mistakes (LFM). Here, the closest prototype wJ with the same class (correct winner) and closest prototype wK with a different class (incorrect winner) are updated with fJ=+1 and fK=−1, if the example is misclassified. On the contrary, if the winning prototype is already correct, the configuration is left unchanged. This prescription can be interpreted as a limiting case of cost-function-based Robust Soft LVQ (RSLVQ), which will be explained later in this section. Because the cost function of RSLVQ is bounded from below, stability can also be expected in LFM.
Thus, the above averages , , and reduce to gaussian integrations in K+M dimensions and can be expressed in (see appendix B). For various algorithms and a system with two competing prototypes, the averages can be calculated analytically. For three or more prototypes, the mathematical treatment becomes more involved and requires multiple numerical integrations.
5. A Simple Case: Two Prototypes, Two Clusters
In this section, we discuss in detail the results of the analysis for the simplest nontrivial problem: two-prototype LVQ 2.1, GLVQ, LFM-W, and RSLVQ systems and M=2 with one gaussian cluster per class. The model data are given in section 2. For simplicity, we denote the two clusters as and without loss of generality can choose and orthonormal , that is, if i=j; 0 else.
We place an emphasis on the asymptotic behavior in the limit —the achieved performance for an arbitrarily large number of examples. The asymptotic generalization error scales with the learning rate, analogous to minimizing a cost function in stochastic gradient descent procedures. For LVQ 2.1 and RSLVQ, the best achievable generalization error is obtained in the simultaneous limit of small learning rates , and rescaling . However, this limit is not meaningful for LFM, as will be explained later.
In this simple scenario, it is possible to exactly calculate the best linear decision boundaries (BLD) by linear approximation of the Bayesian optimal decision boundary (see Biehl, Freking, Ghosh, & Reents, 2004, for the calculations.) We compare the results from each algorithm to the best linearly achievable error .
5.1. LVQ 2.1.
We first examine two-prototype systems: K=2. Figure 2 illustrates the evolution of order parameters under the influence of a window and the trajectories of the prototypes projected onto the (B+, B−) subspace. Without additional constraints, LVQ 2.1 with two prototypes displays a strong divergent behavior in a system with unbalanced data: . The repulsion factor dominates for the prototype representing the weaker cluster, here w2. The order parameters associated with this prototype increase exponentially with . As , w2 will be arbitrarily far away from the cluster centers, and the asymptotic generalization error is trivial, .
When the window scheme is implemented, w2 is repulsed until the data densities of both classes within the window become more balanced. Subsequently the order parameters change with more balance between both prototypes. The repulsion factor still dominates its counterpart; therefore, both prototypes still diverge: for both prototypes display a linear change with at large , but the decision boundary remains stable. Trivial classification is prevented (see the generalization error curves versus in the left panel of Figure 3). Obviously, for smaller , a considerable amount of data is filtered out, and the initial learning stages slow significantly. Meanwhile for large , becomes nonmonotonic and converges more slowly.
In this analysis, we observe that window schemes can dramatically improve the performance of LFM. When a window is used, the tilt of the decision boundary from the optimal hyperplane ( in equation 5.1), is reduced, resulting in lower . We observe that decreases along with reducing , displayed in the right panel of Figure 4. However, a critical window size exists where the LFM unexpectedly becomes divergent and no stationary state exists. Smaller windows filter examples, which produce more repulsion in the orientation of the cluster centers, and we observe asymptotically larger as decreases. This is clearly observed in Figure 4. Given a sufficiently small , it is possible that the repulsion factor entirely outweighs the attractive factor. At , it performs similar to LVQ 2.1: the angle becomes zero, and is close to the best achievable error.
Unlike the unrestricted case, the learning rate can influence the asymptotic performance. The learning rate and window size are indirectly related, as shown in the right panel of Figure 4. For example, learning with small learning rates requires smaller windows to achieve optimal asymptotic error. Note that the influence of the window size depends heavily on the structure of the data. For various data models, efficient window settings may exist on only a very limited range, and window schemes may be ineffective to improve generalization performance while still maintaining stability.
Apart from the influence of vG to the overall learning rate, small vG corresponds to a sharp peak around the decision boundary, while large vG corresponds to a very large window. Figure 5 displays the prototype lengths while using GLVQ: the soft window slows the strong repulsion of the prototype of the weaker cluster, as opposed to unrestricted LVQ 2.1. While both prototypes still diverge because the cost function at is not bounded (see equation 3.9), the asymptotic remains nontrivial (see Figure 5).
Note that vG directly relates to the overall learning rate (refer to equation 3.12), which influences the level of noise in stochastic gradient procedures. We compare results with respect to vG, while maintaining an equal overall learning rate by keeping constant in Figure 5. Performance deteriorates at smaller vG, where training slows down at intermediate stages and converges at a higher error. However, very large vG allows strong repulsion of the weaker prototype, which results in nonmonotonic and long learning convergence times. Surprisingly, the soft GLVQ window is outperformed by the simple hard or crisp window of LVQ 2.1. This is caused by the long tail of the modulation function, which sums up into a large repulsion, whereas in the crisp window, only data near the decision boundary are considered.
Figure 6 displays the cost function during learning. In the initial learning stages, the minimization of the cost function E leads to fast decrease of . However, while the cost function continues to decrease monotonically, behaves nonmonotonically. While many techniques are developed to improve minimization procedures of E, it is important to evaluate the choice of E and its correlation to the desired generalization performance.
Finally in this section, we study the influence of the softness parameter vsoft in the RSLVQ algorithm. Note that in Seo and Obermayer (2003), the learning rate and softness parameter vsoft are treated independently using separate annealing schedules. In this section, we assume decreases proportionally with vsoft, that is, a fixed overall learning rate is maintained.
We first investigate model scenarios with equal variance clusters v+=v− and unbalanced data . We observe the influence of vsoft on the learning curves, displayed on the left panel of Figure 7. The generalization error curve depends on vsoft: at large vsoft, may exhibit nonmonotonic behavior, reminiscent of LVQ 2.1. Because of this behavior, the learning process may require long learning times before reaching the asymptotic configuration. This is an important consideration for practical applications, which often uses early stopping strategies to avoid overtraining. Meanwhile, the algorithm minimizes the cost function E in equation 3.13 monotonically (see Figure 6). Thus, the decrease in E does not always result in a decrease of .
A major advantage of the RSLVQ algorithm is the convergence of prototypes: a stationary configuration of order parameters exists for finite vsoft. The asymptotic configuration of prototypes is displayed in Figure 8. At , the softness parameter controls only the distance between the two prototypes: as defined in equation 5.2 decreases linearly with vsoft. Note that under the conditions p+=0.5, vsoft=v+=v−, and initialization of prototypes on the symmetry axis, each prototype is located at its corresponding cluster center: the RSLVQ mixture model exactly matches the actual input density.
Figures 7 compare the asymptotic errors in the case of (left) and small learning rates (right). In the former case, performance improves with large vsoft: at small vsoft, the system converges at high similar to LFM, while at larger vsoft, it approaches the best linear decision. Meanwhile, at small learning rates, the asymptotic error becomes independent of vsoft. Therefore, given sufficiently small learning rates, RSLVQ becomes robust with regard to its softness parameter.
In the equal variance scenario, the asymptotic decision boundary always converges to the best linear decision boundary for all settings of , and RSLVQ outperforms both LFM and LVQ 2.1, as it provides robustness, stability, and low generalization error.
On the other hand, a scenario with unequal class variances presents an interesting case where RSLVQ with global vsoft fails to match the model. RSLVQ remains robust; the decision boundary converges to identical configurations for all settings of vsoft (see Figure 8). However, the asymptotic results are suboptimal. While RSLVQ is insensitive to the priors of the clusters, its performance with regard to the best achievable error is sensitive to the cluster variances; for example, at highly unbalanced , RSLVQ generalizes poorly and is outperformed by the simpler LVQ 2.1. In practical applications, vsoft may be set locally for each prototype to accommodate such scenarios, but this case cannot be treated along the lines of the analysis here in a straightforward way.
6. Optimal Window Schedules
We plug in for the corresponding algorithm and numerically calculate from equation 6.1 at each learning step. We find that the learning curve is improved with initially large , which is decreased during training, following the curve in Figure 9. This suggests that practical schedules with gradual reduction of window sizes are indeed suitable for this particular learning problem.
While this approach locally minimizes generalization error, this strategy does not always lead to minimization of over a time span (i.e., a globally optimal schedule), which requires calculations along the lines of variational optimization (see Biehl, 1994; Saad & Rattray, 1997) for its application of optimal learning rates in multilayered neural networks. Obviously a priori knowledge of the input density is not available in practical situations. Nevertheless, this minimization technique provides an upper bound of the achievable performance of the learning scheme for a given model.
Figure 7 displays that although large vsoft for RSLVQ allows a faster initial learning, it also can yield nonmonotonic learning curves. We can avoid the nonmonotonic behavior and maximize the decrease of by applying a variational approach analogous to equation 6.1 in order to calculate the locally optimal softness parameter schedule . While fixing the value of , we produce the locally optimal softness schedule in Figure 9, where is initially large and decreases to saturate at a constant value. Note that this value depends on the learning rate; for example, it decreases with . In calculations with , we obtain the limit , which is the clearly suboptimal LFM. Therefore, an analysis of optimal RSLVQ schedule requires .
7. Three-Prototype Systems
In this section we look at more generic analyses of LVQ algorithms by extending the previous systems to K=3 prototypes and M clusters, requiring a much larger set of order parameters. This allows an initial study on two important issues concerning practical applications of LVQ: multiclass problems and the use of multiple prototypes within a class.
We first look at multiclass problems with Nc=3 classes. An example is shown in Figure 10 for LVQ 2.1 with M=6 clusters selected with random variances and random deviation from the original class centers. The clusters are separable only in M out of N dimensions. In all our observations, we find that the behaviors of K=3 systems are qualitatively similar to K=2 systems. For LVQ 2.1, the learning curves vary according to the window sizes, but its asymptotic generalization error is independent of . Due to the presence of other prototypes, the repulsion on a weaker class prototype is reduced. However, the prototypes remain divergent (e.g., Figure 10). Meanwhile for LFM-W, the asymptotic performance is sensitive to whose range of effective window sizes depends strongly on the learning parameters. For GLVQ, the prototypes are divergent with a higher asymptotic error than LVQ 2.1, and thus it performs poorly. Finally, for RSLVQ, the prototypes remain stable, and the asymptotic generalization performance is robust with regard to settings of vsoft, but it is outperformed by LVQ 2.1. Hence, the results are consistent with the K=2 system, and the preceding analysis is valid qualitatively to at least systems of M clusters and one prototype per class within the model restrictions.
To allow more complex decision boundaries, practical LVQ applications frequently employ several prototypes within a class. We investigate a two-class system using K=3 prototypes with labels and observe the nontrivial interaction between similarly labeled prototypes, here w1 and w2. While prototypes of different classes immediately separate in the initial training phase, prototypes of the same class remain identical in the M-dimensional space (see Figure 11). The latter prototypes differ only in dimensions that are not related for classification and produce a suboptimal decision boundary. This may proceed for a long learning period before these prototypes begin to specialize; each prototype produces a bigger overlap with a distinct group of clusters. The specialization phase produces a sudden decrease of , displayed in the right panel of Figure 11. This phenomenon is highly reminiscent of symmetry-breaking effects observed in unsupervised learning, such as winner-takes-all vector quantization (VQ) (Biehl, 1994; Witoelar et al., 2008) or multilayer neural networks (Saad & Solla, 1995).
Learning parameters highly influence the nature of the transition; for example, large learning rates and smaller windows prolong the unspecialized phase, and therefore they are critical to the success of learning. Symmetry breaking may require exceedingly long learning times, resulting in learning plateaus that dominate the training process and present a challenge in practical situations with very high-dimensional data. In more extreme circumstances, the system may not escape the unspecialized state at all, and the optimal classification cannot be obtained. Details of the symmetry-breaking properties with regard to parameters will be investigated in future publications.
We have investigated the learning behavior of LVQ 2.1, GLVQ, LFM-W, and RSLVQ using window schemes that work in high dimensions. The analysis is based on the theory of online learning on a model of high-dimensional isotropic clusters. Our findings demonstrate that the selection of proper window sizes is critical to efficient learning for all algorithms. Given more available data and allowance for costly learning times, parameter selection becomes much less important.
Our analysis demonstrates the influence of windows on the learning curves and the advantages and drawbacks of each algorithm within the model scenarios. A summary is described in Table 1. Asymptotically LVQ 2.1 achieves optimal performance in all scenarios, but stability remains an issue in terms of diverging prototypes. LFM-W shows a remarkable improvement in performance over LFM. Unfortunately, the introduction of a window may also influence its stability, and therefore it is highly parameter sensitive; only a narrow range of window size can improve the overall performance. GLVQ behaves similarly to LVQ 2.1. While GLVQ reduces the initial strong overshooting of LVQ 2.1, the prototypes remain divergent, and GLVQ produces higher generalization errors or long convergence times. RSLVQ attempts to combine the advantages of both LFM and LVQ 2.1 by providing stability and optimal performance. However, an important issue of RSLVQ lies on its approximation of the data structure; it performs well when the actual input density are isotropic gaussian clusters with equal variance. If the assumptions depart from the input density, the results become suboptimal, and RSLVQ can even be outperformed by the simpler LVQ 2.1 and LFM-W. In all scenarios, RSLVQ displays robustness of its classification behavior with respect to the softness parameter, given sufficiently low learning rates.
|.||LVQ 2.1 .||LFM-W .||GLVQ .||RSLVQ .|
|Sensitivity with regard to parameters||Robust||Dependent||Dependent||Robust|
|.||LVQ 2.1 .||LFM-W .||GLVQ .||RSLVQ .|
|Sensitivity with regard to parameters||Robust||Dependent||Dependent||Robust|
aUnder the condition that .
This analysis also allows a formal optimization of the window size during learning to ensure fast convergence. While in general various window sizes for LVQ 2.1 produce equal asymptotic errors, initial window sizes should be chosen large for faster convergence speed and decreased in the course of learning. Similarly, an optimal schedule for RSLVQ points to a gradual decrease of the softness parameter to a particular saturation value, which agrees well with many practical scheduling schemes. However, locally optimal schedules do not always lead to the globally optimal schedules (see, e.g., Saad & Rattray, 1997). In further work, we will develop efficient dynamic parameter adaptations, that is, optimal window schedules during online training along the lines of variational optimization.
We show that the analysis remains valid for multiclass systems and arbitrary numbers of isotropic clusters. Additionally, using multiple prototype assignments within a class, we already observe the presence of learning plateaus in this highly simplified scenario. These phenomena carry on and could dominate the training process in any practical situations with high degrees of freedom. Further investigations of more complex network architectures and nontrivial input distributions may also yield additional phenomena, such as competing stationary states of the system, and provide further insights to general LVQ behaviors.