Abstract

Many pattern analysis problems require classification of examples into naturally ordered classes. In such cases, nominal classification schemes will ignore the class order relationships, which can have a detrimental effect on classification accuracy. This article introduces two novel ordinal learning vector quantization (LVQ) schemes, with metric learning, specifically designed for classifying data items into ordered classes. In ordinal LVQ, unlike in nominal LVQ, the class order information is used during training in selecting the class prototypes to be adapted, as well as in determining the exact manner in which the prototypes get updated. Prototype-based models in general are more amenable to interpretations and can often be constructed at a smaller computational cost than alternative nonlinear classification models. Experiments demonstrate that the proposed ordinal LVQ formulations compare favorably with their nominal counterparts. Moreover, our methods achieve competitive performance against existing benchmark ordinal regression models.

1.  Introduction

Most classification algorithms focus on predicting data labels from nomi-nal (nonordered) classes. However, many pattern recognition problems involve classifying data into classes that have a natural ordering. This type of problem, known as ordinal classification or ordinal regression, is commonly seen in several real-life applications, including information retrieval (Chu & Keerthi, 2007) and medical analysis (Cardoso, Pinto da Costa, & Cardoso, 2005). In such problems, although it is still possible to use conventional (nominal) methods, the order relation among the classes will be ignored, which may affect the stability of learning and overall prediction accuracy.

A lot of effort has already been devoted to the problem of ordinal classification in the machine learning literature. A simple approach is to convert ordinal regression to a set of nested binary classification problems that encode the ordering of the original ranks. Results of these nested binary classifications are combined to produce the overall label predictions. For example, Frank and Hall (2001) employ binary classification tree learners, while (Waegeman & Boullart, 2006) uses binary support vector machines (SVM).

Another stream of ordinal regression research assumes that ordinal labels originate from coarse measurements of a continuous variable. The labels are thus associated with intervals on the real line. A group of algorithms known as threshold models focuses on two main issues:

  • • 

    How to find the optimal projection line, representing the assumed linear order of classes, onto which the input data will be projected

  • • 

    How to optimally position thresholds defining the label intervals so that the margin of separation between neighboring classes is maximized

For example, in the SVM context, a class of models under the name of support vector ordinal regression (SVOR) have been developed. Shashua and Levin (2002) proposed two large-margin principles: the fixed-margin principle, in which the margin of the closest pair of classes is being maximized, leading to equal margins between two neighboring classes (the assumption that is too strict in most cases), and the sum of margins principle, which allows for different margins and only the sum of all K − 1 margins is maximized (assuming there are K ordered categories). However, the order on the K − 1 class thresholds was not imposed. Therefore this work was extended in the SVOR with EXplicit ordering Constraints (SVOR-EXC) formulation (Chu & Keerthi, 2007), where the order of class thresholds is considered explicitly. Furthermore, Chu and Keerthi (2007) also presented an alternative SVOR model: SVOR with IMplicit ordering Constraints (SVOR-IMC).

Based on the SVOR-EXC and SVOR-IMC methods, Li and Lin (2007) and Lin and Li (2012) presented a reduction framework from ordinal regression to binary classification based on extended examples. We refer to this model as REDuction-SVM (RED-SVM). This work was extended into another reduction framework known as weighted Logitboost (Xia, Zhou, Yang, & Zhang, 2007).

However, the SVM-based algorithms all suffer from high computational complexity (in the number of training points) (Sun, Li, Wu, Zhang, & Li, 2010). Therefore, Sun et al. introduced a (non-SVM)-based model with lower computational complexity: kernel discriminant learning for ordinal regression (KDMOR).

In this article, we propose two novel learning vector quantization–based learning models specifically designed for classifying data into ordered classes. Learning vector quantization (LVQ), originally introduced by Kohonen (1986, 1998), constitutes a family of supervised learning multiclass classification algorithms. Classifiers are parameterized by a set of prototypical vectors, representing classes in the input space and a distance measure on the input data.1 In the classification phase, an unknown sample is assigned to the class represented by the closest prototype. Compared to SVM-type methods, prototype-based models in general are more amenable to interpretations and can be constructed at a smaller computational cost. The function of such classifiers can be more directly understood because of the intuitive classification of data points to the class of their closet prototype (under a given metric).

In particular, we extend the recently proposed modifications of LVQ, termed matrix LVQ (MLVQ) and generalized MLVQ (GMLVQ) Schneider (2010; Schneider, Biehl, & Hammer, 2009), to the case of ordinal classification. In MLVQ/GMLVQ, the prototype positions, as well as the (global) metric in the data space, can be modified. Unlike in nominal LVQ, in the proposed ordinal LVQ, the class order information is used during training in selection of the class prototypes to be adapted, as well as in determining the exact manner in which the prototypes get updated. To the best of our knowledge, this article presents the first attempt at extending the LVQ model with metric learning to ordinal regression.

This article is organized as follows. Section 2 briefly introduces the LVQ-based methods related to this study. In section 3 we introduce two novel ordinal LVQ approaches for classifying data with ordered labels. Experimental results are presented and discussed in section 4. Section 5 concludes the study by summarizing the key contributions.

2.  Learning Vector Quantization and Its Extensions

Learning vector quantization (LVQ) constitutes a family of supervised learning algorithms that uses Hebbian online learning to adapt prototypes to the training data (Kohonen, 1998). The original version, named LVQ1, was introduced by Kohonen (1986).

Assume training data , i = 1, 2, …, n is given, m denoting the data dimensionality and K the number of different classes. A typical LVQ network consists of L prototypes , q = 1, 2, 3, …, L, characterized by their location in the input space and their class label c(wq) ∈ {1, …, K}. Obviously at least one prototype per class needs to be included in the model. The overall number of prototypes is a model (hyper) parameter optimized (e.g., in a data-driven manner through a validation process).

Given a distance measure d(xi, w) in (i.e., the distances of the input sample xi to the different prototypes w), classification is based on a winner-take-all scheme: a data point is assigned to the label c(wj) of prototype wj with d(x, wj) < d(x, wq), ∀jq.

Each prototype wj with class label c(wj) will represent a receptive field in the input space.2 Points in the receptive field of prototype wj will be assigned class c(wj) by the LVQ model. The goal of learning is to adapt prototypes automatically such that the distances between data points of class c ∈ {1, …, K} and the corresponding prototypes with label c (to which the data belong) are minimized.

In the training phase for each data point xi with class label c(xi), the closest prototype with the same label is rewarded by pushing it closer to the training input; the closest prototype with a different label is penalized by moving it away from the pattern xi.

Several modifications of this basic learning scheme have been proposed, aiming to achieve better approximation of decision boundaries on faster or more robust convergence (Sato & Yamada, 1996; Hammer & Villmann, 2002). Many LVQ variants use the squared Euclidean distance d2(x, w) = (xw)T(xw) as a distance measure between prototypes and feature vectors. However, the use of Euclidean distance can be problematic in case of high-dimensional, heterogeneous data sets where different scalings and correlations of dimensions can be observed. Recently special attention was paid to schemes for manipulating the input space metric used to quantify the similarity between prototypes and feature vectors (Schneider et al., 2009; Hammer & Villmann, 2002). Generalized relevance LVQ (GRLVQ), introduced in Hammer and Villmann (2002), proposed an adaptive diagonal matrix acting as the metric tensor of a (dis)similarity distance measure. This was extended in matrix LVQ (MLVQ) and generalized matrix LVQ (GMLVQ) (Schneider et al., 2009; Schneider, 2010), which use a fully adaptive metric tensor. Metric learning in the LVQ context has been shown to have a positive impact on the stability of learning and the classification accuracy (Schneider et al., 2009; Schneider, 2010).

2.1.  Matrix LVQ.

Matrix LVQ (MLVQ) (Schneider, 2010) is a new heuristic extension of the basic LVQ1 (Kohonen, 1986) with a full (e.g., not only diagonal elements) matrix tensor–based distance measure. Given an (m × m) positive-definite matrix Λ, the algorithm uses a generalized form of the squared Euclidean distance,
formula
2.1

Positive definiteness of Λ can be achieved by substituting Λ = ΩTΩ, where , 1 ⩽ lm is a full-rank matrix. Furthermore, Λ needs to be normalized after each learning step to prevent the algorithm from degeneration.

For each training pattern xi, the algorithm implements Hebbian updates for the closest prototype w and for the metric parameter Ω. If c(xi) = c(w), then w is attracted toward xi; otherwise w is repelled (for more details, see Schneider, 2010).

2.2.  Generalized MLVQ.

Generalized matrix LVQ (GMLVQ; Schneider et al., 2009; Schneider, 2010) is a recent extension of the generalized LVQ (Sato & Yamada, 1996), which uses the adaptive input metric, equation 2.1. The model is trained in an online learning manner, minimizing the cost function
formula
2.2
based on the steepest descent method. Φ is a monotonic function (e.g., the logistic function or the identity Φ(ℓ) = ℓ, which we use throughout the article); dΛ+(xi, w) is the distance of data point xi from the closest prototype with the same class label yi = c(xi); and dΛ(xi, w) is the distance to xi from the closest prototype with a different class label than yi.

The numerator is smaller than 0 if the classification of the data point is correct. The smaller the numerator, the greater the security of classification (i.e., the difference of the distance from a correct and wrong prototype). The security of classification characterizes the hypothesis margin of the classifier. The larger this margin is, the more robust is the classification of a data pattern with respect to noise in the input or function parameters (Schneider et al., 2009; Hammer & Villmann, 2002).3 A large-margin generalization bound for the GMLVQ model was derived in Schneider et al. (2009). The bound represents a particularly strong result since it is dominated by the margin size and the input space dimensionality does not explicitly occur in it.

The denominator scales the argument of Φ such that it falls in the interval [ − 1, 1]. The learning rules are derived from this cost function by taking the derivatives with respect to the prototype locations w and the distance metric parameters Ω.

Hebbian-like online updates are implemented for the closest correct prototype w+ (i.e., c(xi) = c(w+)) and the closest incorrect prototype w (i.e., c(xi) ≠ c(w)), along with the metric parameters Ω. While w+ is pushed toward the training instance xi, w is pushed away from it (for more details, see Schneider et al., 2009).

All previous LVQ variants (with or without metric learning) were designed for nominal classification problems. However, the training examples may be labeled by classes with a natural order imposed on them (e.g., classes can represent rank). Pattern recognition problems of classifying examples into ordered classes, namely, ordinal classifications, have received a great deal of attention in recent work. They lend themselves to many practical applications as in Chu and Keerthi (2007) and Cardoso et al. (2005). In this article, we to extend the LVQ framework to ordinal classification, since the existing LVQ models do not consider the ordinal label information explicitly during learning.

3.  The Proposed Ordinal LVQ Classifiers

This section presents two novel methodologies based on LVQ for classifying data with ordinal classes.

We assume that we are given training data , where i = 1, 2, …, n, and K is the number of different classes. In the ordinal classification problem, it is assumed that classes are ordered yK>yK−1> ⋅ ⋅ ⋅ >y1, where > denotes the order relation on labels. As in LVQ models, the proposed classifier is parameterized with L prototype-label pairs:
formula
3.1
We assume that each class k ∈ {1, 2, …, K} may be represented by P prototypes collected in the set W(k),
formula
3.2
leading to a total number of L = K · P prototypes.4 The prototypes define a classifier by means of a winner-take-all rule, where a pattern is classified with the label of the closest prototype, c(xi) = c(wj), , where dΛ denotes the metric, equation 2.1.

Whereas nominal versions of LVQ aim to position the class prototypes in the input space so that the overall misclassification error is minimized, the proposed ordinal LVQ models adapt the class prototypes so that the average absolute error of class mislabeling is minimized. Loosely speaking, this implies that some class mislabeling (e.g., claiming class c(wj) = (k + 1) instead of class c(xi) = k) will be treated as less serious than other ones (e.g., outputting c(wj) = K instead of c(xi) = 1), where the seriousness of misclassification will be related to |c(xi) − c(wj)|.5 In the next section, we describe the prototypes to be modified given each training input xi.

3.1.  Identification of Class Prototypes to Be Adapted.

The initial step in each training instance xi, i = 1, 2, …, n, focuses on detecting the correct and incorrect prototype classes (with respect to c(xi)) that will be modified. Subsequently, the correct prototypes will pushed toward xi, whereas the incorrect ones will be pushed away from xi.

3.1.1.  Correct and Incorrect Prototype Classes.

Due to the ordinal nature of labels, for each training instant xi and prototype wq, q = 1, 2, …, L, the correctness of prototype's label c(wq) is measured through the absolute error loss function H(c(xi), c(wq)) (Dembczynski, Kotlowski & Slowinski, 2008):
formula
3.3
Given a rank loss threshold Lmin, defined on the range of the loss function, the class prototypes wq with H(c(xi), c(wq)) ⩽ Lmin will be viewed as tolerably correct, while prototypes with H(c(xi), c(wq))>Lmin will be classified as incorrect.6 This is illustrated in Figure 1. The sets of correct and incorrect prototype classes for input xi hence read:
formula
3.4
and
formula
3.5
respectively.
Figure 1:

Correct and incorrect prototype class estimation. Training pattern c(xi) = 2 is indicated with a square and threshold Lmin = 1. White circles are prototypes of correct classes with respect to c(xi): black circles indicate prototypes of incorrect classes.

Figure 1:

Correct and incorrect prototype class estimation. Training pattern c(xi) = 2 is indicated with a square and threshold Lmin = 1. White circles are prototypes of correct classes with respect to c(xi): black circles indicate prototypes of incorrect classes.

3.1.2.  Prototypes to Be Adapted.

Given a training pattern xi, the nominal LVQ techniques adapt either the closest prototype or the closest pair of correct or incorrect prototypes. In our case, we need to deal with the class prototypes in a different way:

  • • 
    Correct prototypes with labels in N(c(xi))+: For correct prototypes, it makes sense to push toward xi only the closest prototype from each class in N(c(xi))+. The set of correct prototypes to be modified given input xi reads:
    formula
    3.6
  • • 
    Incorrect prototypes with labels in N(c(xi)): For incorrect prototypes, it is desirable to push away from xi all incorrect prototypes lying in the neighborhood of xi. In our case, the neighborhood will be defined as a sphere of radius D under the metric dΛ:
    formula
    3.7

3.2.  Prototype Weighting Scheme.

Unlike in nominal LVQ, we will need to adapt multiple prototypes, albeit to a different degree. Given a training input xi, the attractive and repulsive force applied to correct and incorrect prototypes w will decrease and increase, respectively, with growing H(c(xi), c(w)). In addition, for incorrect prototypes w, the repulsive force will diminish with increasing distance from xi. In the two following sections, we describe the prototype adaptation schemes in greater detail.

Given a training pattern xi, there are two distinct weighting schemes for the correct and incorrect prototypes w in W(xi)+ and W(xi):

  • • 
    Weighting correct prototypes wW(xi)+: We propose a gaussian weighting scheme,
    formula
    3.8
    where, σ is the gaussian kernel width.
  • • 
    Weighting incorrect prototypes wW(xi): Denote by εmax the maximum rank loss error within the set W(xi),
    formula
    The weight factor α for incorrect prototype wW(xi) is then calculated as
    formula
    3.9
    where σ′ is the gaussian kernel width for the distance factor in α.

These weighting factors will be used in two prototype update schemes introduced in the next two sections.

3.3.  Ordinal MLVQ Algorithm.

In this section, we generalize the MLVQ algorithm in section 2.1 to the case of linearly ordered classes. We refer to this new learning scheme as ordinal MLVQ (OMLVQ). In particular, there are two main differences between MLVQ and OMLVQ:

  • • 

    In OMLVQ, the order information on classes is utilized to select appropriate multiple prototypes (rather than just the closest one as in MLVQ) to be adapted.

  • • 

    The ordinal version of MLVQ realizes Hebbian updates for all prototype parameters in W(xi)+ and W(xi), using the assigned weights α±. Similar to MLVQ, each prototype update Δw will be followed by a corresponding metric parameter update ΔΩ.

The OMLVQ training algorithm is outlined in greater detail below:

  1. Initialization: Initialize the prototype positions , q = 1, 2, …, L.7 Initialize the matrix tensor parameter Ω by setting it equal to the identity matrix (Euclidean distance).

  2. While a stopping criterion (in our case the maximum number of training epochs) is not reached do:

    1. Randomly select a training pattern xi, i ∈ {1, 2, …, n}, with class label c(xi).

    2. Determine the correct and incorrect classes for xi, N(c(xi))+ and N(c(xi)) based on equations 3.4 and 3.5, respectively.

    3. Find collections of prototypes W(xi)+ and W(xi) to be adapted using equations 3.6 and 3.7.

    4. Assign weight factors α± to the selected prototypes, equations 3.8 and 3.9).8

    5. Update the prototypes from W(xi)+, W(xi) and the distance metric Ω as follows:

      1. wW(x)+ do: w = w + ηw · α+ · Λ · (xiw) (w dragged toward xi) Ω = Ω − ηΩ · α+ · Ω · (xiw)(xiw)T (dΛ(xi, w) shrinks)

      2. wW(x) do: w = w − ηw · α · Λ · (xiw) (w pushed away from xi) Ω = Ω + ηΩ · α · Ω · (xiw)(xiw)T (dΛ(xi, w) is increased). Here ηw, ηΩ are positive learning rates for prototypes and metric, respectively.9 They decrease monotonically with time as (Darken, Chang, & Moody, 1992):
        formula
        where g ∈ {Ω, w}, τ>0 determines the speed of annealing and t indexes the number of training epochs done.10 Similar to the original MLVQ (see section 2.1), to prevent the algorithm from degeneration, Ω is normalized after each learning step so that ∑iΛii = 1 (Schneider et al., 2009; Schneider, 2010).

3.  End While

Note that unlike in the original MLVQ, during the training, adaptation of the prototypes is controlled by the corresponding weight factors α±, which reflect the class order (see equations 3.8 and 3.9), and the distance of incorrect prototypes from training inputs (see equation 3.9).

3.4.  Ordinal GMLVQ Algorithm.

This section extends the update rules of the GMLVQ algorithm (in section 2.2) to the case of ordinal classes. The algorithm, referred to as ordinal GMLVQ (OGMLVQ), will inherit from GMLVQ its cost function, equation 2.2. There are two main differences between OGMLVQ and GMLVQ:

  • • 

    For each training pattern xi, the GMLVQ scheme applies a Hebbian update for the single closest prototype pair (with the same and different class labels with respect to the label c(xi) of xi; see section 2.2). On the other hand, in OGMLVQ there will be updates of r ⩾ 1 prototype pairs from W(xi)+ × W(xi) (see equations 3.6 and 3.7). This is done in an iterative manner as follows:

    Set W± = W(xi)±, r = 0.

    While (W+ ≠ ∅ and W ≠ ∅) 1)  rr + 1. 2)  Construct “the closest” prototype pair Rr = (wa, wb), where
    formula
    3.11
    3)  Update wa, wb and Ω (to be detailed later).

    4)  W+W+ ∖ {wa}, WW ∖ {wb}.

    End While

  • In order to control prototype adaptation by their corresponding weight factors α± (see equations 3.8 and 3.9), OGMLVQ scales the metric, equation 2.1 (used in the original GMLVQ cost function, equation 2.2) as
    formula
    3.12
    The OGMLVQ cost function reads:
    formula
    3.13
    where
    formula
    The cost function fOGMLVQ will be minimized with respect to prototypes and metric parameter Ω using the steepest descent method. Recall that is the distance of the data point xi from the correct prototype wa, and is the distance from the incorrect prototype wb, and Φ is a monotonic function set (as in GMLVQ) to the identity mapping.

To obtain the new adaptation rules for the OGMLVQ algorithm, we present derivatives of μ(xi, Rj) with respect to the prototype pair (wa, wb) = Rj, equation 3.11, and the metric parameter Ω:

  • • 
    Derivatives of μ(xi, Rj) with respect to the correct prototype wa,
    formula
    where
    formula
    3.14
    and
    formula
    3.15
  • • 
    Derivatives of μ(xi, Rj) with respect to the incorrect prototype wb,
    formula
    where
    formula
    3.16
    and
    formula
    3.17
  • • 
    Derivatives of μ(xi, Rj) with respect to the metric parameter Ω,
    formula
    3.18
    formula
    3.19
    using equations 3.14 and 3.16, then,
    formula
    3.20
    where
    formula
    3.21
    and
    formula
    3.22

Note that the OGMLVQ cost function, equation 3.12, is a sum of r weighted versions of the GMLVQ cost function (Schneider et al., 2009; see equation 2.2). The only difference is that the distances from data points to prototypes are linearly scaled by factors α± (see equation 3.12). As such, the OGMLVQ cost function inherits all the discontinuity problems of the GMLVQ cost functional at receptive field boundaries of the prototypes. As Schneider et al. (2009) argued, the GMLVQ prototype and metric updates resulting from gradient descent on the GMLVQ cost function are valid whenever the metric is differentiable (see also Hammer & Villmann, 2002; Hammer, Strickert, & Villmann, 2005). Using delta function (as derivative of the Heaviside function), the argument can be made for cost functions rewritten with respect to full reasonable distributions on the input space (with continuous support) (Schneider et al., 2009). Since weighting of distances in individual GMLVQ cost functions that make up the OGMLVQ cost function preserves the differentiability of the metric and because the OGMLVQ cost function is a sum of such individual weighted GMLVQ cost functions, the theoretical arguments made about updates from the GMLVQ cost function also fall through in the case of the OGMLVQ cost function.

We summarize the OGMLVQ algorithm:

  1. Initialization: Initialize the prototype positions , q = 1, 2, …, L. Initialize the matrix tensor parameter Ω by setting it equal to the identity matrix (Euclidean distance).

  2. While a stopping criterion (in our case the maximum number of training epochs) is not reached, do:

    1. Randomly select a training pattern xi, i ∈ {1, 2, …, n}, with class label c(xi).

    2. Determine the correct and incorrect classes for xi, N(c(xi))+ and N(c(xi)) based on equations 3.4 and 3.5, respectively.

    3. Find collections of prototypes W(xi)+ and W(xi) to be adapted using equations 3.6 and 3.7.

    4. Assign weight factors α± to the selected prototypes, equations 3.8 and 3.9.

    5. Set W± = W(xi)±, r = 0. While (W+ ≠ ∅ and W ≠ ∅)

      1. rr + 1.

      2. Construct “the closest” prototype pair Rr = (wa, wb) as in equation 3.11.

      3. Update the prototypes position:
        formula
        (wa dragged toward xi)
        formula
        (wb pushed away from xi)11
      4. Update the metric parameter Ω,
        formula
        where γ+ and γ are given in equations 3.14 and 3.16, respectively. ηw, ηΩ are the learning rates for prototypes and metric, respectively, and they normally decrease throughout the learning as given in equation 3.10. Each Ω update is followed by a normalization step as described in the OMLVQ algorithm (see section 3.3).
      5. W+W+ ∖ {wa}, WW ∖ {wb}.

      End While

    3) End While

During the adaptation, distances between the training point xi and the correct prototypes in W+ are on average decreased in line with the aim of minimizing the rank loss error. Conversely, the average distances between xi and the incorrect prototypes in W are increased, so that the risk of higher ordinal classification error (due to the high rank loss error of incorrect prototypes) is diminished.

Note that while OMLVQ is a heuristic extension of MLVQ, updating each prototype independent of the others, the OGMLVQ is an extension of GMLVQ, with parameter updates following in a principled manner from a well-defined cost function. In OGMLVQ, the prototypes are updated in pairs, as explained above.

4.  Experiments

We evaluated the performance of the proposed ordinal regression LVQ methods through a set of experiments conducted on two groups of data sets: eight benchmark ordinal regression data sets (Sun et al., 2010; Chu et al., 2007; Li et al., 2007; Lin et al., 2012; Xia et al., 2007) and two real-world ordinal regression data sets (Lin et al., 2012).12 The ordinal LVQ models, OMLVQ and OGMLVQ, were assessed against their nominal (nonordinal) counterparts, MLVQ and GMLVQ, respectively. The ordinal LVQ models were also compared with benchmark ordinal regression approaches.

The experiments used three evaluation metrics to measure the accuracy of predicted class with respect to true class y on a test set:

  1. Mean zero-one error (MZE): (Misclassification rate) the fraction of incorrect predictions,
    formula
    where v is the number of test examples and denotes the indicator function returning 1 if the predicate holds and 0 otherwise.
  2. Mean absolute error (MAE): The average deviation of the prediction from the true rank,
    formula
  3. Macroaveraged mean absolute error (MMAE) (Baccianella, Esuli, & Sebastiani, 2009): Macroaveraged version of mean absolute error, a weighted sum of the classification errors across classes,
    formula
    where K is the number of classes and vk is the number of test points whose true class is k. The macroaveraged MAE is typically used in imbalanced ordinal regression problems as it emphasizes errors equally in each class.

For comparison purposes and with respect to the eight benchmark ordinal regression data sets, we conducted the same preprocessing as described in Sun et al. (2010), Chu et al. (2007), Li et al. (2007), Lin et al. (2012), and Xia et al. (2007). Data labels were discretized into 10 ordinal quantities using the equal-frequency binning. Hence, the eight benchmark ordinal regression data sets are balanced with respect to their class distribution. The input vectors were normalized to have zero mean and unit variance. Each data set was randomly partitioned into training and test splits as recorded in Table 1. The partitioning was repeated 20 times independently, yielding 20 resampled training and test sets. For these class-balanced data sets, the experimental evaluations were done using the MZE and MAE measures.

Table 1:
Ordinal Regression Data Sets Partitions.
Data SetDimensionTrainingTesting
Pyrimidines 27 50  24 
MachineCpu 150  59 
Boston 13 300  206 
Abalone 1000  3177 
Bank 32 3000  5182 
Computer 21 4000  4182 
California 5000 15,640 
Census 16 6000 16,784 
Cars 1296  432 
Redwine 11 1200  399 
Data SetDimensionTrainingTesting
Pyrimidines 27 50  24 
MachineCpu 150  59 
Boston 13 300  206 
Abalone 1000  3177 
Bank 32 3000  5182 
Computer 21 4000  4182 
California 5000 15,640 
Census 16 6000 16,784 
Cars 1296  432 
Redwine 11 1200  399 

The two real-world ordinal ranking problems were represented by two data sets: cars and the red wine subset redwine of the wine quality set from the UCI machine learning repository (Hettich et al., 1998). For fair comparison, we followed the same experimental settings as in Lin et al. (2012). We randomly split 75% of the examples for training and 25% for testing, as recorded in Table 1, and conducted 20 runs of such random splits. The cars problem intends to rank cars to four conditions (unacceptable, acceptable, good, very good), while the redwine problem ranks red wine samples to 11 different levels (between 0 and 10; however, the actual data contain samples only with ranks between 3 and 8). It is worth mentioning that the two data sets are highly imbalanced (with respect to their class distribution). In the cars data set, the class distribution (percentage of instances per class) is as follows: unacceptable 70%; acceptable, 22%; good, 4%; and very good, 4%. The redwine data set has the following class distribution: 3, 1%; 4, 3%; 5, 43%; 6, 40%; 7, 12%; and 8, 1%. Real-world ordinal regression data sets are often severely imbalanced; they are likely to have different class populations at their class order, and (unlike in many previous ordinal classification studies) ordinal classification algorithms should be examined in both balanced and imbalanced class distribution cases.13 As Baccianella et al. (2009) showed, testing a classifier on imbalanced data sets using standard evaluation measures (e.g. MAE) may be insufficient. Therefore, along with the MZE and MAE evaluation measures, we examined our prototype-based models with the macroaveraged mean absolute Error (MMAE; Baccianella et al., 2009), which that is specially designed for evaluating classifiers operating on imbalanced data sets.

On each data set, the algorithm parameters were chosen through five-fold cross-validation on the training set. Test errors were obtained using the optimal parameters found for each data resampling and were averaged over the 20 trials (runs). We also report standard deviations across the 20 trails.

4.1.  Comparison with MLVQ and GMLVQ.

This section evaluates the performance of the proposed OMLVQ and OGMLVQ algorithms against their standard nominal versions MLVQ and GMLVQ. For the eight benchmark ordinal regression data sets, the MZE and MAE results, along with standard deviations (represented by error bars), across 20 runs are shown in Figures 2 and 3, respectively. Furthermore, the MZE, MAE, and MMAE results, along with standard deviations (represented by error bars) across 20 runs, for the two real-world ordinal regression data sets are presented in Figures 4a, 4b, and 4c, respectively.

Figure 2:

MZE results for the eight benchmark ordinal regression data sets.

Figure 2:

MZE results for the eight benchmark ordinal regression data sets.

Figure 3:

MAE results for the eight benchmark ordinal regression data sets.

Figure 3:

MAE results for the eight benchmark ordinal regression data sets.

Figure 4:

MZE, MAE, and MMAE results for the the two real-world ordinal regression data sets.

Figure 4:

MZE, MAE, and MMAE results for the the two real-world ordinal regression data sets.

The results in general confirm that the proposed ordinal LVQ models achieve better performance in terms of MZE, MAE, and MMAE rates than their standard (nominal) LVQ counterparts. On average, across the eight benchmark ordinal regression data sets, the OMLVQ algorithm outperforms the baseline MLVQ by relative improvement of 10% and 18% on MZE and MAE, repectively. Furthermore, OGMLVQ achieves relative improvements over the baseline GMLVQ of 5% and 15% on MZE and MAE, respectively. For the two real-world ordinal regression data sets, on average the OMLVQ algorithm outperforms the baseline MLVQ by relative improvement of 41%, 48%, and 46% on MZE, MAE, and MMAE, repectively. And the OGMLVQ achieves relative improvements over the baseline GMLVQ of of 14%, 15%, and 8% on MZE, MAE, and MMAE, repectively.

4.2.  Comparison with Benchmark Ordinal Regression Approaches.

This section compares (in terms of MZE, MAE, and MMAE) the proposed ordinal LVQ approaches (OMLVQ and OGMLVQ) against five benchmark ordinal regression methods: two threshold SVM-based models (SVOR-IMC and SVOR-EXC; Chu et al., 2007, with the gaussian kernel), two reduction frameworks (the SVM-based model REDSVM with perceptron kernel; Li et al., 2007; Lin et al., 2012), and the weighted LogitBoost (Xia et al., 2007)), and a non SVM-based kernel discriminant learning for ordinal regression method (KDLOR; Sun et al., 2010).

The first comparison was conducted on eight benchmark ordinal ranking data sets used in Chu et al. (2007), Sun et al. (2010), Li et al. (2007), Lin et al. (2012), and Xia et al. (2007). We used the same data set preprocessing and experimental settings as those authors did.

MZE and MAE test results,14 along with standard deviations over 20 training and test resamplings, are listed in Tables 2 and 3, respectively.15 We use boldface to indicate the lowest average error value among the results of all algorithms.

Table 2:
Mean Zero-One Error (MZE) Results with Standard Deviations (±) Across 20 Training and Test Resampling.
Data SetKDLORSVOR-IMCSVOR-EXCRED-SVMOMLVQOGMLVQ
Pyrimidines 0.739± 0.719± 0.752± 0.762± 0.660± 0.645 ± 
 (0.050) (0.066) (0.063) (0.021) (0.060) (0.106) 
MachineCPU 0.480± 0.655± 0.661± 0.572± 0.431± 0.415 ± 
 (0.010) (0.045) (0.056) (0.013) (0.079) (0.096) 
Boston 0.560± 0.561± 0.569± 0.541± 0.532 ± 0.534± 
 (0.020) (0.026) (0.025) (0.009) (0.017) (0.024) 
Abalone 0.740± 0.732± 0.736± 0.721± 0.545± 0.532 ± 
 (0.020) (0.007) (0.011) (0.002) (0.021) (0.049) 
Bank 0.745± 0.751± 0.744 ± 0.751± 0.756± 0.750± 
 (0.0025) (0.005) (0.005) (0.001) (0.016) (0.008) 
Computer 0.472± 0.473± 0.462± 0.451 ± 0.535± 0.510± 
 (0.020) (0.005) (0.005) (0.002) (0.019) (0.010) 
California 0.643± 0.639± 0.640± 0.613 ± 0.710± 0.680± 
 (0.005) (0.003) (0.003) (0.001) (0.018) (0.007) 
Census 0.711± 0.705± 0.699± 0.688 ± 0.754± 0.735± 
 (0.020) (0.002) (0.002) (0.001) (0.154) (0.014) 
Data SetKDLORSVOR-IMCSVOR-EXCRED-SVMOMLVQOGMLVQ
Pyrimidines 0.739± 0.719± 0.752± 0.762± 0.660± 0.645 ± 
 (0.050) (0.066) (0.063) (0.021) (0.060) (0.106) 
MachineCPU 0.480± 0.655± 0.661± 0.572± 0.431± 0.415 ± 
 (0.010) (0.045) (0.056) (0.013) (0.079) (0.096) 
Boston 0.560± 0.561± 0.569± 0.541± 0.532 ± 0.534± 
 (0.020) (0.026) (0.025) (0.009) (0.017) (0.024) 
Abalone 0.740± 0.732± 0.736± 0.721± 0.545± 0.532 ± 
 (0.020) (0.007) (0.011) (0.002) (0.021) (0.049) 
Bank 0.745± 0.751± 0.744 ± 0.751± 0.756± 0.750± 
 (0.0025) (0.005) (0.005) (0.001) (0.016) (0.008) 
Computer 0.472± 0.473± 0.462± 0.451 ± 0.535± 0.510± 
 (0.020) (0.005) (0.005) (0.002) (0.019) (0.010) 
California 0.643± 0.639± 0.640± 0.613 ± 0.710± 0.680± 
 (0.005) (0.003) (0.003) (0.001) (0.018) (0.007) 
Census 0.711± 0.705± 0.699± 0.688 ± 0.754± 0.735± 
 (0.020) (0.002) (0.002) (0.001) (0.154) (0.014) 

Notes: Results given for the ordinal LVQ models (OMLVQ and OGMLVQ) and the benchmark algorithms KDLOR reported in Sun et al. (2010), SVOR-IMC (with gaussian kernel), SVOR-EXC (with gaussian kernel) reported in Chu et al. (2007), RED-SVM (with perceptron kernel) reported in Lin et al. (2012). The best results are in bold.

Table 3:
Mean Absolute Error (MAE) Results, Along with Standard Deviations (±) Across 20 Training and Test Resampling.
Data SetKDLORSVOR-IMCSVOR-EXCRED-SVMWeighted Logit-BoostOMLVQOGMLVQ
Pyrimidines 1.1± 1.294± 1.331± 1.304± 1.271± 1.004± 0.985 ± 
 (0.100) (0.204) (0.193) (0.040) (0.205) (0.123) (0.169) 
MachineCPU 0.690± 0.990± 0.986± 0.842± 0.800± 0.660± 0.630 ± 
 (0.015) (0.115) (0.127) (0.022) (0.087) (0.291) (0.176) 
Boston 0.700 ± 0.747± 0.773± 0.732± 0.816± 0.742± 0.731± 
 (0.035) (0.049) (0.049) (0.013) (0.056) (0.048) (0.050) 
Abalone 1.400± 1.361± 1.391± 1.383± 1.457± 0.732± 0.731 ± 
 (0.050) (0.013) (0.021) (0.004) (0.014) (0.035) (0.068) 
Bank 1.450± 1.393 ± 1.512± 1.404± 1.499± 1.501± 1.462± 
 (0.020) (0.011) (0.017) (0.002) (0.016) (0.025) (0.009) 
Computer 0.601± 0.596± 0.602± 0.565 ± 0.601± 0.776± 0.698± 
 (0.025) (0.008) (0.009) (0.002) (0.007) (0.018) (0.023) 
California 0.907± 1.008± 1.068± 0.940± 0.882 ± 1.238± 1.208± 
 (0.004) (0.005) (0.005) (0.001) (0.009) (0.048) (0.018) 
Census 1.213± 1.205± 1.270± 1.143± 1.142 ± 1.761± 1.582± 
 (0.003) (0.007) (0.007) (0.002) (0.005) (0.033) (0.018) 
Data SetKDLORSVOR-IMCSVOR-EXCRED-SVMWeighted Logit-BoostOMLVQOGMLVQ
Pyrimidines 1.1± 1.294± 1.331± 1.304± 1.271± 1.004± 0.985 ± 
 (0.100) (0.204) (0.193) (0.040) (0.205) (0.123) (0.169) 
MachineCPU 0.690± 0.990± 0.986± 0.842± 0.800± 0.660± 0.630 ± 
 (0.015) (0.115) (0.127) (0.022) (0.087) (0.291) (0.176) 
Boston 0.700 ± 0.747± 0.773± 0.732± 0.816± 0.742± 0.731± 
 (0.035) (0.049) (0.049) (0.013) (0.056) (0.048) (0.050) 
Abalone 1.400± 1.361± 1.391± 1.383± 1.457± 0.732± 0.731 ± 
 (0.050) (0.013) (0.021) (0.004) (0.014) (0.035) (0.068) 
Bank 1.450± 1.393 ± 1.512± 1.404± 1.499± 1.501± 1.462± 
 (0.020) (0.011) (0.017) (0.002) (0.016) (0.025) (0.009) 
Computer 0.601± 0.596± 0.602± 0.565 ± 0.601± 0.776± 0.698± 
 (0.025) (0.008) (0.009) (0.002) (0.007) (0.018) (0.023) 
California 0.907± 1.008± 1.068± 0.940± 0.882 ± 1.238± 1.208± 
 (0.004) (0.005) (0.005) (0.001) (0.009) (0.048) (0.018) 
Census 1.213± 1.205± 1.270± 1.143± 1.142 ± 1.761± 1.582± 
 (0.003) (0.007) (0.007) (0.002) (0.005) (0.033) (0.018) 

Note: See the note to Table 2.

In comparison with other methods and with respect to the eight benchmark ordinal ranking data sets, OGMLVQ and OMLVQ algorithms achieve the lowest MZE results on four data sets, with OGMLVQ being lowest in Pyrimidines, MachineCPU, and Abalone data sets, and OMLVQ in the Boston data set. Furthermore, OGMLVQ and OMLVQ attain the lowest MAE for three data sets (Pyrimidines, MachineCPU, and Abalone), with OGMLVQ being slightly better than OMLVQ on all data sets. Note that on the Abalone data set, both ordinal LVQ models beat the competitors out of sample by a large margin. However, relative to the competitors, OMLVQ and OGMLVQ exhibit the worst performance on three data sets (Computer, California, and Census), and comparable performances on the remaining data sets, Boston and Bank. Note that on the three data sets where the ordinal LVQ methods were beaten by the competitors, the original LVQ methods performed poorly as well (see Figures 2 and 3). We hypothesize that the class distribution structure of those data sets may not be naturally captured by the prototype-based methods.

We also examined the performance of our prototype-based models, using the two real-world ordinal ranking problems, against two SVM-based ordinal regression approaches: SVOR-IMC (Chu et al., 2007) with the gaussian kernel and RED-SVM with perceptron kernel (Li et al., 2007; Lin et al., 2012)).16

The MZE and MAE test results of the cars and redwine data sets for the two compared algorithms were reported in Lin et al. (2012). MZE, MAE, and MMAE test results over 20 training and test random resamplings are listed in Table 4.17 We use bold type to indicate the lowest average error value among the results of all algorithms.

Table 4:
Mean Zero-One Error (MZE), Mean Absolute Error (MAE), and Macroaveraged Mean Absolute Error (MMAE) Results on the cars and redwine Data Sets, Along with Standard Deviations (±) Across 20 Training and Test Resampling.
Data SetAlgorithmMZEMAEMMAE
Cars SVOR-IMC N/A 0.051 ± (0.002) NA 
 RED-SVM 0.064 ± (0.003) 0.061 ± (0.003) NA 
 OMLVQ 0.035 ± (0.012) 0.044 ± (0.016) 0.069 ± (0.029) 
 OGMLVQ 0.111 ± (0.029) 0.128 ± (0.035) 0.281 ± (0.080) 
Redwine SVOR-IMC N/A 0.429 ± (0.004) NA 
 RED-SVM 0.327 ± (0.005) 0.357 ± (0.005) NA 
 OMLVQ 0.358 ± (0.014) 0.405 ± (0.016) 0.535 ± (0.067) 
 OGMLVQ 0.331 ± (0.009) 0.364 ± (0.014) 0.555 ± (0.083) 
Data SetAlgorithmMZEMAEMMAE
Cars SVOR-IMC N/A 0.051 ± (0.002) NA 
 RED-SVM 0.064 ± (0.003) 0.061 ± (0.003) NA 
 OMLVQ 0.035 ± (0.012) 0.044 ± (0.016) 0.069 ± (0.029) 
 OGMLVQ 0.111 ± (0.029) 0.128 ± (0.035) 0.281 ± (0.080) 
Redwine SVOR-IMC N/A 0.429 ± (0.004) NA 
 RED-SVM 0.327 ± (0.005) 0.357 ± (0.005) NA 
 OMLVQ 0.358 ± (0.014) 0.405 ± (0.016) 0.535 ± (0.067) 
 OGMLVQ 0.331 ± (0.009) 0.364 ± (0.014) 0.555 ± (0.083) 

Note: The best results are in bold.

In comparison with SVOR-IMC (Chu et al., 2007) and RED-SVM (Li et al., 2007; Lin et al., 2012), on the two real-world ordinal regression data sets (cars and redwine), the prototype-based models for ordinal regression (OMLVQ and OGMLVQ) show competitive performance in MZE and MAE. For the cars data set, among the compared algorithms, the OMLVQ model is performing the best with respect to the MZE and MAE results. For the redwine data set, the RED-SVM yields the best MZE and MAE performance. The OMLVQ and OGMLVQ models are slightly worse than RED-SVM but better than the SVM-IMC algorithm.

Table 5:
Mean Absolute Error (MAE) Results, with Standard Deviations (±) Across 20 Training and Test Resampling.
Data SetKLminAlgorithmMAE (Lmin − 1)MAE (Lmin)MAE (Lmin + 1)
Cars OMLVQ N/A 0.044 ± (0.016) 0.403 ± (0.027) 
  OGMLVQ N/A 0.128 ± (0.035) 0.324 ± (0.034) 
Redwine OMLVQ N/A 0.405 ± (0.016) 0.800 ± (0.080) 
  OGMLVQ N/A 0.364 ± (0.014) 0.440 ± (0.019) 
Pyrimidines 10 OMLVQ 1.274 ± (0.177) 1.004 ± (0.123) 1.300 ± (0.168) 
  OGMLVQ 1.162 ± (0.199) 0.985 ± (0.169) 1.062 ± (0.130) 
Abalone 10 OMLVQ 0.885 ± (0.082) 0.732 ± (0.035) 0.901 ± (0.104) 
  OGMLVQ 0.740 ± (0.011) 0.731 ± (0.068) 0.886 ± (0.034) 
Data SetKLminAlgorithmMAE (Lmin − 1)MAE (Lmin)MAE (Lmin + 1)
Cars OMLVQ N/A 0.044 ± (0.016) 0.403 ± (0.027) 
  OGMLVQ N/A 0.128 ± (0.035) 0.324 ± (0.034) 
Redwine OMLVQ N/A 0.405 ± (0.016) 0.800 ± (0.080) 
  OGMLVQ N/A 0.364 ± (0.014) 0.440 ± (0.019) 
Pyrimidines 10 OMLVQ 1.274 ± (0.177) 1.004 ± (0.123) 1.300 ± (0.168) 
  OGMLVQ 1.162 ± (0.199) 0.985 ± (0.169) 1.062 ± (0.130) 
Abalone 10 OMLVQ 0.885 ± (0.082) 0.732 ± (0.035) 0.901 ± (0.104) 
  OGMLVQ 0.740 ± (0.011) 0.731 ± (0.068) 0.886 ± (0.034) 

Notes: Obtained using varying numbers of rank loss threshold ((Lmin − 1), (Lmin), and (Lmin + 1)), on four ordinal regression data sets. Note that the value of Lmin is determined using a cross-validation procedure on each of the four examined data sets. The best results are in bold.

4.3.  Sensitivity of the Ordinal LVQ Models to the Correct Region.

As specified in section 3.1, the rank loss threshold Lmin defines the sets of correct and incorrect prototype classes. Given classes 1, 2, …, K, the value of the Lmin is defined on the range of the absolute error loss function, [0, K − 1].

The following experiment investigates the sensitivity of the presented models to the choice of the correct region, that is, the value of Lmin18. The experiment was conducted on four data sets with different number of classes K (Pyrimidines and Abalone with K = 10; cars and redwine with K = 4 and K = 6, respectively). Using settings of the best-performing models from the previous experiments, we examined the sensitivity of the model performance with respect to varying Lmin in the range [L*min − 1, L*min + 1], where L*min denotes the optimal value of Lmin found using cross-validation as described above.

The MAE and MMAE19 results are presented in Tables 5 and 6, respectively. As expected, sensitivity with respect to variations in Lmin is much greater if the number of classes is small (e.g., cars and redwine). In such cases, setting the right value of Lmin is crucial. Not surprisingly, for a small number of classes, the selected value of Lmin was 0. Interestingly, OGMLVQ appears to be more robust to changes in Lmin than OMLVQ is. We speculate that this is so since OMLVQ in each training step updates all selected correct and incorrect prototypes independent of each other. On the other hand, OGMLVQ updates only the closest pair of correct and incorrect prototypes, affecting potentially a smaller number of prototypes.

Table 6:
Macroaveraged Mean Absolute Error (MMAE) Results, with Standard Deviations (±) Across 20 Training and Test Resampling.
Data SetKLminAlgorithmMMAE (Lmin − 1)MMAE (Lmin)MMAE (Lmin + 1)
Cars OMLVQ N/A 0.069 ± (0.029) 0.268 ± (0.036) 
  OGMLVQ N/A 0.281 ± (0.080) 0.390 ± (0.062) 
Redwine OMLVQ N/A 0.535 ± (0.067) 0.781 ± (0.145) 
  OGMLVQ N/A 0.555 ± (0.083) 0.678 ± (0.071) 
Data SetKLminAlgorithmMMAE (Lmin − 1)MMAE (Lmin)MMAE (Lmin + 1)
Cars OMLVQ N/A 0.069 ± (0.029) 0.268 ± (0.036) 
  OGMLVQ N/A 0.281 ± (0.080) 0.390 ± (0.062) 
Redwine OMLVQ N/A 0.535 ± (0.067) 0.781 ± (0.145) 
  OGMLVQ N/A 0.555 ± (0.083) 0.678 ± (0.071) 

Notes: Obtained using varying number of rank loss threshold ((Lmin − 1), (Lmin), and (Lmin + 1)), on two ordinal regression data sets. Note that the value of Lmin is determined using a cross-validation procedure on each of the four examined data sets. The best results are in bold.

4.4.  Discussion.

OGMLVQ slightly outperforms OMLVQ in almost all cases. This may be due to principled adaptation formulation through the novel cost function, equation 3.13. Interestingly enough, this is also reflected in the nominal classification case, where GLVQ (later extended to GMLVQ) has been shown to be superior to LVQ1 (later extended to MLVQ) (Sato & Yamada, 1996).

As expected, ordinal LVQ methods demonstrate stronger improvements over their nominal counterparts in terms of MAE rather than MZE. As an example, this is illustrated in Figure 5 obtained on a MachineCpu test set. The figure compares the true class labels in the selected test set, Figure 5a, against the predicted ones generated by MLVQ, OMLVQ, GMLVQ and OGMLVQ, Figures 5b to 5e, respectively. Although there are several misclassifications by our ordinal LVQ methods (OMLVQ and OGMLVQ), they incorporate fewer deviations (from their true ordinal label) when compared to the deviations occurring in the MLVQ and GMLVQ misclassifications. Clearly the ordinal LVQ schemes efficiently use the class order information during learning, thus improving the MAE performance.

Figure 5:

Ordinal prediction results of a single example run in MachineCpu data set (a: true labels) obtained by (b) MLVQ, (c) OMLVQ, (d) GMLVQ, and (e) OGMLVQ.

Figure 5:

Ordinal prediction results of a single example run in MachineCpu data set (a: true labels) obtained by (b) MLVQ, (c) OMLVQ, (d) GMLVQ, and (e) OGMLVQ.

Interestingly enough, we observed that reshaping the class prototypes in the ordinal LVQ methods by explicit use of the class order information stabilizes the training substantially when compared to the nominal LVQ methods. Provided the class distribution in the data space respects the class order, the class prototypes of ordinal LVQ will quickly reposition to reflect this order. Then most misclassifications that need to be acted on during training have a low absolute error; that is, most misclassifications happen on the border of receptive fields of ordered prototypes with small absolute differences between the classes of data points and those of their closest prototypes. This stabilizes the training in that only relatively small prototype updates are necessary. In nominal LVQ, where the order of classes is not taken into account during training, larger jumps in absolute error can occur. For example in Figures 6 and 7, we show the evolution of MAE error rates as the training progresses measured in training epochs) for a single run of (O)MLVQ and (O)GMLVQ on the Abalone and Boston data sets, respectively. The same training sample and similar experimental settings for MLVQ and OMLVQ, as well as for GMLVQ and OGMLVQ, were used.

Figure 6:

Evolution of MAE in the course of training epochs (t) in the Abalone training set obtained by the (a) MLVQ and (b) OMLVQ algorithms.

Figure 6:

Evolution of MAE in the course of training epochs (t) in the Abalone training set obtained by the (a) MLVQ and (b) OMLVQ algorithms.

Figure 7:

Evolution of MAE in the course of training epochs (t) in the Boston training set obtained by the (a) GMLVQ and (b) OGMLVQ algorithms.

Figure 7:

Evolution of MAE in the course of training epochs (t) in the Boston training set obtained by the (a) GMLVQ and (b) OGMLVQ algorithms.

5.  Conclusion

This article introduced two novel prototype-based learning methodologies tailored for classifying data with ordered classes. Based on the existing nominal LVQ methods with metric learning, matrix LVQ (MLVQ) and generalized MLVQ (GMLVQ) (Schneider et al., 2009; Schneider, 2010), we proposed two new ordinal LVQ methodologies: ordinal MLVQ (OMLVQ) and ordinal GMLVQ (OGMLVQ).

Unlike in nominal LVQ, in ordinal LVQ, the class order information is used during training in selection of the class prototypes to be adapted, as well as in determining the exact manner in which the prototypes get updated. In particular, the prototypes are adapted so that the ordinal relations among the prototype classes are preserved, reflected in a reduction of the overall mean absolute error. Whereas in the OMLVQ approach, the prototypes are adapted independent of each other, in the OGMLVQ approach, the prototypes are updated in pairs based on minimization of a novel cost function.

Experimental results on eight benchmark data sets and two real-world imbalanced data sets empirically verify the effectiveness of our ordinal LVQ frameworks when compared with their standard nominal LVQ versions. The mean zero-one error (MZE), mean absolute error (MAE), and macroaveraged mean absolute error (MMAE) (in case of imbalanced data sets) rates of the proposed methods were considerably lower, with more pronounced improvements on the MAE (in the case of balanced data sets) and MAE, MMAE rates (in the case of imbalanced data sets) when compared to the MZE rate. In addition, our ordinal models exhibit more stable learning behavior when compared to their nominal counterparts. Finally, in comparison with existing benchmark ordinal regression methods, our ordinal LVQ frameworks attained a competitive performance in terms of MZE and MAE measurements.

Acknowledgments

We are grateful to the anonymous reviewers for helpful comments. P.T. was supported by BBSRC grant BB/H012508/1 (project code RRAE14541). S.F. was supported by the Islamic Development Bank (IDB), Merit Scholarship Programme.

References

Baccianella
,
S.
,
Esuli
,
A.
, &
Sebastiani
,
F.
(
2009
).
Evaluation measures for ordinal regression
. In
Proceedings of the Ninth International Conference on Intelligent Systems Design and Applications
(pp.
283
287
).
San Mateo, CA
:
IEEE Computer Society
.
Cardoso
,
J. S.
,
Pinto da Costa
,
J. F.
, &
Cardoso
,
M. J.
(
2005
).
Modelling ordinal relations with SVMs: An application to objective aesthetic evaluation of breast cancer conservative treatment
.
Neural Networks
,
18
,
808
817
.
Chu
,
W.
, &
Keerthi
,
S. S.
(
2007
).
Support vector ordinal regression
.
Neural Computation
,
19
,
792
815
.
Darken
,
C.
,
Chang
,
J.
, &
Moody
,
J.
(
1992
).
Learning rate schedules for faster stochastic gradient search
. In
Proceedings of the 1992 IEEE Workshop: Neural Networks for Signal Processing
(pp.
3
12
).
Piscataway, NJ
:
IEEE Press
.
Dembczynski
,
K.
,
Kotlowski
,
W.
, &
Slowinski
,
R.
(
2008
).
Ordinal classification with decision rules
. In
Proceedings of the 3rd ECML/PKDD International Conference on Mining Complex Data
(pp.
169
181
).
New York
:
Springer-Verlag
.
Frank
,
E.
, &
Hall
,
M.
(
2001
).
A simple approach to ordinal classification
. In
Proceedings of the 12th European Conference on Machine Learning
(pp.
145
156
).
New York
:
Springer-Verlag
.
Hammer
,
B.
,
Strickert
,
M.
, &
Villmann
,
T.
(
2005
).
On the generalization ability of GRLVQ networks
.
Neural Processing Letters
,
21
,
109
120
.
Hammer
,
B.
, &
Villmann
,
T.
(
2002
).
Generalized relevance learning vector quantization
.
Neural Networks
,
15
,
1059
1068
.
Hettich
,
S.
,
Blake
,
C. L.
, &
Merz
,
C. J.
(
1998
).
UCI repository of machine learning databases
. http://archive.ics.uci.edu/ml/
Kohonen
,
T.
(
1986
).
Learning vector quantization for pattern recognition
(
Tech. Rep. No. TKKF-A601
).
Espoo, Finland
:
Laboratory of Computer and Information Science, Department of Technical Physics, Helsinki University of Technology
.
Kohonen
,
T.
(
1998
).
Learning vector quantization
. In
M. A. Arbib
(Ed.),
The handbook of brain theory and neural networks
(pp. 537–540). Cambridge, MA
:
MIT Press
.
Li
,
L.
, &
Lin
,
H.
(
2007
).
Ordinal regression by extended binary classification
. In
B. Schölkopf, J. C. Platt, & T. Hofmann
(Eds.),
Advances in neural information processing systems, 19
(pp. 865–872). Cambridge, MA
:
MIT Press
.
Lin
,
H.
, &
Li
,
L.
(
2012
).
Reduction from cost-sensitive ordinal ranking to weighted binary classification
.
Neural Computation
,
24
,
1329
1367
.
Sato
,
A.
, &
Yamada
,
K.
(
1996
).
Generalized learning vector quantization
. In
D. Touretzky, M. Mozer, & M. E. Hasselmo (Eds.)
Advances in neural information processing systems
,
8
(pp.
423
429
).
Cambridge, MA
:
MIT Press
.
Schneider
,
P.
(
2010
).
Advanced methods for prototype-based classification
.
Unpublished doctoral dissertation, University of Groningen
.
http://irs.ub.rug.nl/ppn/327245379
Schneider
,
P.
,
Biehl
,
M.
, &
Hammer
,
B.
(
2009
).
Adaptive relevance matrices in learning vector quantization
.
Neural Computation
,
21
,
3532
3561
.
Shashua
,
A.
, &
Levin
,
A.
(
2002
).
Ranking with large margin principle: Two approaches
. In
S. Becker, S. Thrün, & K. Obermayer
(Eds.),
Advances in neural information processing systems, 15
(pp.
937
944
).
Cambridge, MA
:
MIT Press
.
Sun
,
B. Y.
,
Li
,
J.
,
Wu
,
D. D.
,
Zhang
,
X. M.
, &
Li
,
W. B.
(
2010
).
Kernel discriminant learning for ordinal regression
.
IEEE Transactions on Knowledge and Data Engineering
,
22
,
906
910
.
Waegeman
,
W.
, &
Boullart
,
L.
(
2006
).
An ensemble of weighted support vector machines for ordinal regression
.
Transactions on Engineering, Computing and Technology
,
12
,
71
75
.
Xia
,
F.
,
Zhou
,
L.
,
Yang
,
Y.
, &
Zhang
,
W.
(
2007
).
Ordinal regression as multiclass classification
.
International Journal of Intelligent Control and Systems
,
12
,
230
236
.

Notes

1

Different distance metric measures can be used to define the closeness of prototypes.

2

The receptive field of prototype w is defined as the set of points in the input space that pick this prototype as their winner.

3

We are grateful to the anonymous reviewer for pointing this out.

4

This imposition can be relaxed to a variable number of prototypes per class.

5

Of course, other-order related costs could be used.

6

In our case [0, K − 1].

7

Following Schneider et al. (2009) and Schneider (2010), the means of P random subsets of training samples selected from each class k, where k ∈ {1, 2, …, K}, are chosen as initial states of the prototypes. Alternatively, one could run a vector quantization with P centers on each class.

8

For ease of presentation, we omit from the notation the classes of the prototypes and the training point.

9

The initial learning rates are chosen individually for every application through cross-validation. We imposed ηwΩ, implying a slower rate of change to the metric, when compared with prototype modification. This setting has proven better performance in other matrix relevance learning applications (Schneider et al., 2009; Schneider, 2010).

10

In our experiments, τ was set to 0.0001.

11

Note that unlike γ+, γ is negative.

12

Regression data sets are available online at http://www.gatsby.ucl.ac.uk/∼chuwei/ordinalregression.html.

13

We are grateful to the anonymous reviewer for pointing this out.

14

The underlying eight benchmark data sets are considered as balanced (with respect to their class distribution). Thus, we did not examine their MMAE results.

15

MZE results of the weighted LogitBoost reduction model are not listed because only MAE of this algorithm was recorded in Xia et al. (2007).

16

Unfortunately we have not been able to obtain codes for the two other ordinal regression algorithms considered in this study: Weighted LogitBoost (Xia et al., 2007) and KDLOR (Sun et al., 2010).

17

MMAE results of the SVM-based models are not listed because only MZE and MAE of these algorithms were recorded in Lin et al. (2012). Furthermore, MZE of the SVOR-IMC with gaussian kernel algorithm was not reported in Lin et al. (2012).

18

We are grateful to the anonymous reviewer for suggesting this experiment.

19

The MMAE results of the Pyrimidines and Abalone data sets were not assessed because they are considered balanced data sets; hence, their MAE and MMAE results coincide.