Ordinal classification refers to classification problems in which the classes have a natural order imposed on them because of the nature of the concept studied. Some ordinal classification approaches perform a projection from the input space to one-dimensional (latent) space that is partitioned into a sequence of intervals (one for each class). Class identity of a novel input pattern is then decided based on the interval its projection falls into. This projection is trained only indirectly as part of the overall model fitting. As with any other latent model fitting, direct construction hints one may have about the desired form of the latent model can prove very useful for obtaining high-quality models. The key idea of this letter is to construct such a projection model directly, using insights about the class distribution obtained from pairwise distance calculations. The proposed approach is extensively evaluated with 8 nominal and ordinal classifiers methods, 10 real-world ordinal classification data sets, and 4 different performance measures. The new methodology obtained the best results in average ranking when considering three of the performance metrics, although significant differences are found for only some of the methods. Also, after observing other methods of internal behavior in the latent space, we conclude that the internal projections do not fully reflect the intraclass behavior of the patterns. Our method is intrinsically simple, intuitive, and easily understandable, yet highly competitive with state-of-the-art approaches to ordinal classification.
Ordinal classification or ordinal regression is a supervised learning problem of predicting categories that have an ordered arrangement. When the problem is exhibiting an ordinal nature, it is expected that this order is also present in the data input space (Hühn & Hüllermeier, 2008). The samples are labeled by a set of ranks with an ordering among the categories. In contrast to nominal classification, there is an ordinal relationship throughout the categories, and it is different from regression in that the number of ranks is finite and exact amounts of difference between ranks are not defined. In this way, ordinal classification lies somewhere between nominal classification and regression.
Ordinal regression should not be confused with sorting or ranking. Sorting is related to ranking all samples in the test set, with a total order. Ranking is related to ranking with a relative order of samples and a limited number of ranks. Of course, ordinal regression can be used to rank samples, but its objective is to obtain good accuracy and, at the same time, good ranking.
Ordinal classification problems are important, since they are common in our everyday life where many problems require classification of items into naturally ordered classes. Examples of these problems are the teaching assistant evaluation (Lim, Loh, & Shih, 2000), car insurance risk rating (Kibler, Aha, & Albert, 1989), pasture production (Barker, 1995), preference learning (Arens, 2010), breast cancer conservative treatment (Cardoso, Pinto da Costa, & Cardoso, 2005), wind forescasting (Gutiérrez et al., 2013), and credit rating (Kim & Ahn, 2012).
A variety of approaches have been proposed for ordinal classification. For example, Raykar, Duraiswami, and Krishnapuram (2008) learn ranking functions in the context of ordinal regression and collaborative filtering data sets. Kramer, Widmer, Pfahringer, and de Groeve (2010) map the ordinal scale by assigning numerical values and then apply a regression tree model. The main problem with this simple approach is the assignment of a numerical value corresponding to each class, without a principled way of deciding the true metric distances between the ordinal scales. Also, representing all patterns in a class by the same value may not reflect the relationships among the patterns in a natural way. In this letter, we propose that the numerical values associated with different patterns may differ (even within the same class), and, most important, the value for each individual pattern is decided based on its relative localization in the input space.
Other simple alternatives that have appeared in the literature tried to impose the ordinal structure through the use of cost-sensitive classification, where standard (nominal) classifiers are made aware of ordinal information through penalizing the misclassification error, commonly selecting a cost equal to the absolute deviation between the actual and the predicted ranks (Kotsiantis & Pintelas, 2004). This is suitable when the knowledge about the problem is sufficient to completely define a cost matrix. However, when this is not possible, this approach is making an important assumption about the distances between the adjacent labels, all of them being equal, which may not be appropriate.
The third direct alternative suggested in the literature is to transform the ordinal classification problem into a nested binary classification one (Frank & Hall, 2001; Waegeman & Boullart, 2009), and then to combine the resulting classifier predictions to obtain the final decision. It is clear that ordinal information allows ranks to be compared. For a given rank k, an associated question could be, “Is the rank of pattern x greater than k?” This question is exactly a binary classification problem, and ordinal classification can be solved by approaching each binary classification problem independently and combining the binary outputs to a rank (Frank & Hall, 2001). Other alternative (Waegeman & Boullart, 2009) imposes explicit weights over the patterns of each binary system in such a way that errors on training objects are penalized proportionally to the absolute difference between their rank and k. Binarization of ordinal regression problems can also be tackled from augmented binary classification perspective, that is, binary problems are not solved independently, but a single binary classifier is constructed for all the subproblems. For example, Cardoso and Pinto da Costa (2007) add more dimensions and replicate the data points through what is known as the data replication method. This augmented space is used to construct a binary classifier, and the projection onto the original one results in an ordinal classifier. A very interesting framework in this direction is that proposed by Li and Lin (2007) and Lin and Li (2012): reduction from cost-sensitive ordinal ranking to weighted binary classification (RED), which is able to reformulate the problem as a binary problem by using a matrix for extension of the original samples, a weighting scheme, and a V-shaped cost matrix. An attractive feature of this framework is that it unifies many existing ordinal ranking algorithms, such as perceptron ranking (Crammer & Singer, 2005) and support vector ordinal regression (Chu & Keerthi, 2007). Recently, Fouad and Tiňo (2012) found that the learning vector quantization (LVQ) is adapted to the ordinal case in the context of prototype-based learning. In that work, the order information is utilized to select class prototypes to be adapted and improve the prototype updating process.
The vast majority of proposals addressing ordinal classification can be grouped under the umbrella of threshold methods (Verwaeren, Waegeman, & De Baets, 2012). These methods assume that ordinal response is a coarsely measured latent continuous variable and model it as real intervals in one dimension. Based on this assumption, the algorithms seek a direction onto which the samples are projected and a set of thresholds that partition the direction into consecutive intervals representing ordinal categories (McCullagh, 1980; Verwaeren et al., 2012; Herbrich, Graepel, & Obermayer, 2000; Crammer & Singer, 2001; Chu & Keerthi, 2005). Proportional odds model (POM) (McCullagh, 1980) is a standard statistical approach in this direction, where the latent variable is modeled by using a linear combination of the inputs and a probabilistic distribution is assumed for the patterns projected by this function. Crammer and Singer (2001) generalized the online perceptron algorithm with multiple thresholds to perform ordinal ranking. Support vector machines (SVMs) (Cortes & Vapnik, 1995; Vapnik, 1999) were also adapted for ordinal regression, first by the large-margin algorithm of Herbrich et al. (2000). The main drawback of this first proposal was that the problem size was a quadratic function of the training data size. A related, more efficient approach was presented by Shashua and Levin (2002), who excluded the inequality constraints on the thresholds. However, this can result in nondesirable solutions because the absence of constraints can lead to difficulties in imposing order on thresholds. Chu and Keerthi (2005) explicitly and implicitly included the constraints in the model formulation (support vector for ordinal regression, SVOR), deriving the associated dual problem and the optimality conditions. From another perspective, discriminant learning has been adapted to the ordinal setup by (apart from maximizing between-class distance and minimizing within-class distance) trying to minimize distance separation between projected patterns of consecutive classes (kernel discriminant learning for ordinal regression, KDLOR) (Sun, Li, Wu, Zhang, & Li, 2010). Finally, threshold models have also been estimated by using a Bayesian framework (gaussian processes for ordinal regression, GPOR) (Chu & Ghahramani, 2005), where the latent function is modeled using gaussian processes and then all the parameters are estimated by maximum likelihood optimization.
While threshold approaches offer an interesting perspective on the problem of ordinal classification, they learn the projection from the input space onto the one-dimensional latent space only indirectly, as part of the overall model fitting. As with any other latent model fitting, direct construction hints one may have about the desired form of the latent model can prove very useful for obtaining high-quality models. The key idea of this letter is to construct such a projection model directly, using insights about class distribution obtained from pairwise distance calculations. Indeed, our motivation stems from the fact that the order information should also be present in the data input space, and it could be interesting to take advantage of it to construct a useful variable for ordering the patterns using the ordinal scale. Additionally, regression is clearly the most natural way to approximate this continuous variable. As a result, we propose to construct the ordinal classifier in two stages: the input data are first projected into a one-dimensional variable by considering the relative position of the patterns in the input space, and then a standard regression algorithm is applied to learn a function to predict new values of this derived variable.
The main contribution of this work is the projection onto a one-dimensional variable, which is done by a guided projection process. This process exploits the ordinal space distribution of patterns in the input space. A measure of how well a pattern is located within its corresponding class region is defined by considering the distances between patterns of the adjacent classes in the ordinal scale. Then a projection interval is defined for each class, and the centers of those intervals (for nonboundary classes) are associated with the best located patterns for the corresponding classes (quantified by the measure mentioned above). For the boundary classes (first and last in the class order), the extreme end points of their projection intervals are associated with the most separated patterns of those classes. All the other patterns are assigned proportional positions in their corresponding class intervals, again according to their goodness values, expressing how well a pattern is located within its class. We refer to this projection as pairwise class distances (PCD) based. The behavior of this projection is evaluated over synthetic data sets, showing an intuitive response and good ability to separate adjacent classes even in nonlinear settings.
Once the mapping is done, our framework allows the design of effective ordinal ranking algorithms based on well-tuned regression approaches. The final classifier constructed by combining PCD and a regressor is called pairwise class distances ordinal classifier (PCDOC). In this contribution, PCDOC is implemented using -support vector regression () (Schölkopf & Smola, 2001; Vapnik, 1999) as the base regressor, although any other properly handled regression method could be used.
We carry out an extensive set of experiments on 10 real-world ordinal regression data sets, comparing our approach with 8 state-of-the-art methods. Our method, a simple one, holds out very well. Under four complementary performance metrics, the proposed method obtained the best mean ranking for three of the four metrics.
The rest of the letter is organized as follows. Section 2 introduces the ordinal classification problem and performance metrics we use to evaluate the ordinal classifiers. Section 3 explains the proposed data projection method and the classification algorithm. It also evaluates the behavior of the projection using two synthetic data sets and the performance of the classification algorithm under situations that may hamper classification. Section 4 presents the experimental design, data sets, and alternative ordinal classification methods that will be compared with our approach and discusses the experimental results. Finally, the last section sums up key conclusions and points to future work.
2. Ordinal Classification
This section briefly introduces the mathematical notation and the ordinal classification performance metrics, including the threshold model formulation.
2.1. Problem Formulation.
In an ordinal classification problem, the purpose is to learn a mapping from an input space to a finite set containing Q labels, where the label set has an order relation imposed on it. The symbol denotes the ordering between different ranks. A rank for the ordinal label can be defined as . Each pattern is represented by a K-dimensional feature vector and a class label . The training data set T is composed of N patterns , with .
Given these definitions, an ordinal classifier should be constructed taking into account two goals. First, the nature of the problem implies that the class order is somehow related to the distribution of patterns in the space of attributes and also to the topological distribution of the classes. Therefore the classifier must exploit this a priori knowledge about the input space (Hühn & Hüllermeier, 2008). Second, when evaluating an ordinal classifier, the performance metrics must consider the order of the classes so that misclassifications between adjacent classes should be considered less important than the ones between nonadjacent classes, more separated in the class order. For example, given an ordinal data set of weather prediction with the natural order between classes , it is straightforward to think that predicting class Hot when the real class is Cold represents a more severe error than that associated with a Very Cold prediction. Thus, specialized measures are needed for evaluating ordinal classifiers performance (Pinto da Costa, Alonso, & Cardoso, 2008; Cruz-Ramírez, Hervás-Martínez, Sánchez-Monedero, & Gutiérrez, 2011).
2.2. Ordinal Classification Performance Metrics.
In this work, we utilize four evaluation metrics quantifying the accuracy of N predicted ordinal labels for a given data set , with respect to the true targets :
- Acc: the accuracy (Acc), also known as the correct classification rate.1 is the rate of correctly classified patterns: is the predicted rank, and is the indicator function, being equal to 1 if c is true and 0 otherwise. Acc values range from 0 to 1, and they represent a global performance on the classification task. Although Acc is widely used in classification tasks, it is not suitable for some types of problems, such as imbalanced data sets (Sánchez-Monedero, Gutiérrez, Fernández-Navarro, & Hervás-Martínez, 2011) (very different number of patterns for each class) or ordinal data sets (Baccianella, Esuli, & Sebastiani, 2009).
- MAE: The mean absolute error (MAE) is the average absolute deviation of the predicted ranks from the true ranks (Baccianella et al., 2009), . The MAE values range from 0 to Q−1. Since Acc does not reflect the category order, MAE is typically used in the ordinal classification literature together with Acc (Pinto da Costa et al., 2008; Agresti, 1984; Waegeman & De Baets, 2011; Chu & Keerthi, 2007; Chu & Ghahramani, 2005; Li & Lin, 2007). However, neither Acc nor MAE is suitable for problems with imbalanced classes. This is rectified in the average MAE (AMAE) (Baccianella et al., 2009) measuring the mean performance of the classifier across all classes.
- AMAE: This measure evaluates the mean of the MAEs across classes (Baccianella et al., 2009). It has been proposed as a more robust alternative to MAE for imbalanced data sets—a common situation in ordinal classification, where extreme classes (associated with rare situations) tend to be less populated:
- : The Kendall's is a statistic used to measure the association between two measured quantities. Specifically, it is a measure of the rank correlation (Kendall, 1962), is +1 if is greater than (in the ordinal scale) , 0 if and are the same, and −1 if is lower than , and the same for cij (using yi and yj). values range from −1 (maximum disagreement between the prediction and the true label), to 0 (no correlation between them), to 1 (maximum agreement). has been advocated as a better measure for ordinal variables because it is independent of the values used to represent classes (Cardoso & Sousa, 2011) since it works directly on the set of pairs corresponding to different observations. One may argue that shifting the predictions one class would will keep the same value, whereas the quality of the ordinal classification is lower. However, since there is a finite number of classes, shifting all predictions by one class will have detrimental effect in the boundary classes and so would substantially decrease the performance, even as measured by . As a consequence, is an interesting measure for ordinal classification but should be used in conjunction with other ones.
2.3. Latent Variable Modeling for Ordinal Classification.
In our proposal, it is assumed that a model can be found that links data items with their latent space representation . We place our proposal in the context of latent variable models for ordinal classification because of its similarity to these models. In contrast to other models employing a one-dimensional latent space, such as POM (McCullagh, 1980), we do not consider variable thresholds but impose fixed values for θ. However, suitable dimensionality reduction is given due attention: first, by trying to exploit the ordinal structure of the space , and by explicitly putting external pressure on the margins between the classes in (see section 3.2).
3. Proposed Method
Our approach is different from the previous ones in that it does not implicitly learn latent representations of the training inputs. Instead, we impose how training inputs xi are going to be represented through . Then this representation is generalized to the whole input space by training a regressor on the (xi, zi) pairs, resulting in a projection function . To ease the presentation, we sometimes write training input patterns x as x(q) to explicitly reflect their class label rank q (i.e., the class label of x is Cq).
3.1. Pairwise Class Distance Projection.
Figure 1 shows the idea of minimum distances for each pattern with respect to the patterns of the adjacent classes. In this figure, patterns of the second class are considered. The example illustrates how the value is obtained for the pattern x(2) marked with a circle. For distances between x(2) and class 1 patterns, the item x(1) has the minimum distance, so is calculated by using this pattern. For distances between x(2) and class 3 patterns, is the minimum distance between x(2) and x(3).
3.2. Analysis of the Proposed Projection in Synthetic Data Sets.
For illustration purposes, we generated synthetic ordinal classification data sets in with four classes (Q=4). Figure 2 shows the patterns of a synthetic data set, SyntheticLinearOrder, with a linear order between classes, and Figure 3 shows the SyntheticNonLinearOrder data set, with a nonlinear ordinal relationship between classes. Points at SyntheticLinearOrder were generated by adding a uniform noise to points of a line. Points in SyntheticNonLinearOrder were generated by adding a gaussian noise to points on a spiral. In both figures, points belonging to different classes are marked with different colors and symbols. Besides the points, the figures also illustrate basic concepts of the proposed method on example points (surrounded by gray circles). For these points, the minimum distances are illustrated with lines of the corresponding class color. The minimum distances of a point to the previous and next class patterns are marked with dashed and solid lines, respectively. For selected points, we show the value of the PCD projection (calculated using equation 3.5).
In Figure 2, the z value increases for patterns of the higher classes, and this value varies depending on the position of the pattern x(q) in the space with respect to the patterns x(q−1) and x(q+1) of adjacent classes. Extreme values, z=0.0 and z=1.0, correspond to the patterns more distant from classes 1 and Q, respectively (and with a maximum value). Synthetic NonLinearOrder in Figure 3 is designed to demonstrate that the PCD projection is suitable for more complex ordinal topologies of the data. This is, for any topology in an ordinal data set, it is expected that patterns of classes Q−1 and q+1 are always the closest ones to the patterns of class q, and PCD will take advantage of this situation to decide the relative order of the pattern within its class, even when this is produced in a nonlinear manner.
Figures 4a and 4b show histograms of the PCD projections from the synthetic data sets in Figures 2 and 3, respectively. The thresholds θ that divide the z values of the different classes are also included. Observe that the z values of the different classes are clearly separated and that they are compacted within a range that is always smaller than the range initially indicated by the thresholds. This is due to the scaling of the z values in equation 3.2, where the value cannot be zero, so a pattern can never be located close to the boundary separating intervals of adjacent classes.
3.3. Algorithm for Ordinal Classification.
Once the PCD projections have been obtained for all training inputs, we construct a new training set . Any generic regression tool can be trained on to obtain the projection function . In this respect, our method is quite general, allowing the user to choose his or her favorite regression method or any other improved regression tool introduced in the future. The resulting algorithm, pairwise class distances for ordinal classification (PCDOC), is described in two steps in Figures 5 and 6.
It is expected that formulating the problem as a regression problem would help the model to capture the ordinal structure of the input and output spaces and their relationship. In addition, due to the nature of the regression problem, it is expected that the performance of the classification task will be improved regarding metrics that consider the difference between the predicted and actual classes within the linear class order, such as MAE or AMAE, or the correlation between the target and predicted values, such as . Experimental results confirm this hypothesis in section 4.3.
3.4. PCDOC Performance Analysis in Some Controlled Experiments.
3.4.1. Analysis of the Influence of Dimensionality and Class Overlapping.
This section analyzes the performance of the PCDOC algorithm under situations that may hamper classification: class overlapping and large dimensionality of the data. For this purpose, different synthetic data sets have been generated by sampling random points from Q gaussian distributions, where Q is the number of classes, so that each class point is a random sample of the corresponding gaussian distribution. In order to easily control the overlap of the classes, the variance () is kept constant independent of the number of dimensions (K). In addition, the Q centers (means ) are set up in order to keep the distance of 1 between two adjacent class means independent of K. Under this situation, each coordinate of adjacent class means is separated by so that , , and so on.
The number of features tested (input space dimensionality) were and the different width values are , so that 18 data sets were generated. The number of patterns for each class from 1 to 4 was 10, 100, 100, and 5. Figure 7 shows two of these data sets generated different variance values for K=2.
For these experiments, our approach uses the support vector regression (SVR) algorithm as the model for the z variable (the method will be referred to as SVR-PCDOC). We have also included three methods as baseline methods: the C-support vector classification (SVC) (Cortes & Vapnik, 1995; Vapnik, 1999), the support vector ordinal regression with explicit constraints (SVOREX) (Chu & Keerthi, 2005, 2007), and the kernel discriminant learning for ordinal regression (KDLOR) (Sun et al., 2010). As in the next experimental section (section 4), the experimental design includes 30 stratified random splits (with 75% of patterns for training and the remaining for generalization). The mean MAE and AMAE generalization results are used for comparison purposes in Figure 8 (for further details about experimental procedure, methods description, and hyperparameter optimization, refer to section 4.2).
From the results depicted in Figure 8, we can generally conclude that the three methods, except KDLOR, have similar MAE performance degradation with the increase of class overlapping and dimensionality. Figure 8a shows that SVR-PCDOC has a slightly worse performance than SVC and SVOREX. However, in experiments with higher K (see Figures 8c and 8e), the performance of the three ordinal methods varies in a similar way. In particular, in Figure 8e we can observe that SVC performance decreases with high overlapping and high dimensionality, whereas the ordinal methods have similar performance here. From the analysis of the AMAE performance, we can conclude that KDLOR outperforms the rest of the methods in cases of low class overlapping. Regarding our method, we can conclude that compared with the other methods, its AMAE performance is worse in the case of low class overlap. However, in general, our method seems more robust when the class overlap increases.
3.4.2. Analysis of the Influence of Data Multimodality.
This section extends the experiments to the case of multimodal data, the data sets are generated with K=2 and , and the number of modes per class is varied. Figure 9a presents the unimodal case. The data sets with more modes per class are generated in the following way. A gaussian distribution is set up as in the previous section, with center . For each class, each additional gaussian distribution is centered in a random location within the hypersphere with center and radius 0.75. Then patterns are sampled from each distribution. For each class, we considered different number of modes, from one mode to four modes. The number of patterns generated for each mode was 36, 90, 90, and 24 for classes 1, 2, 3, and 4, respectively, using the same number for all modes of a class. An example of the bimodal case (two gaussian distributions per class) is shown in Figure 9b, having 72, 180, 180, and 48 patterns for classes 1, 2, 3, and 4, respectively.
Experiments were carried out as in the previous section, and MAE and AMAE generalization results are depicted in Figure 10. Regarding MAE, Figure 10a reveals that the four methods perform similarly in data sets with one and four modes, but they differ on performance for those with two and three modes. Considering only MAE, SVRPCDOC has the worse performance in cases 2 and 3. Nevertheless, considering AMAE results in Figure 10b, SVRPCDOC and KDLOR achieve the best results. The different behavior of the methods depending on the performance measure can be explained by observing the nature of the bimodal data set (see Figure 9b), where the majority of the patterns are from classes 2 and 3. In this context, the optimization done by SVOREX and SVC can move the decision thressholds to better classify patterns of these two classes at the expense of misclassifying class 1 and 4 patterns, especially patterns of classes placed on the class boundaries (see Figure 9b).
In this section we report on extensive experiments that were performed to check the competitiveness of the proposed methodology. The source code of the proposed method, synthetic data sets analysis code, and real ordinal data sets partitions used for the experiments are available at a public website (http://www.uco.es/grupos/ayrna/neco-pairwisedistances).
4.1. Ordinal Classification Data Sets and Experimental Design.
To the best of our knowledge, there are no public specific data set repositories for real ordinal classification problems. The ordinal regression benchmark data sets repository provided by Chu and Ghahramani (2005) is the most widely used repository in the literature. However, these data sets are not real ordinal classification data sets but regression ones. To turn regression into ordinal classification, the target variable was discretized into Q different bins (representing classes) with equal frequency or equal width. However, there are potential problems with this approach. If equal frequency labeling is considered, the data sets do not exhibit some characteristics of typical complex classification tasks, such as class imbalance. On the other hand, severe class imbalance can be introduced by using the same binning width. Finally, as the actual target regression variable exists with observed values, the classification problem can be simpler than on those data sets where the variable z is really unobservable and has to be modeled.
We have therefore decided to use a set of real ordinal classification data sets publicly available at the UCI (Asuncion & Newman, 2007) and mldata.org repositories (Sonnenburg, 2011) (see Table 1 for data description). All of them are ordinal classification problems, although one can find literature where the ordering information is discarded. The nature of the target variable is now analyzed for two example data sets. The bondrate data set is a classification problem where the purpose is to assign the right ordered category to bonds with the category labels C2=AA, C3=A, C4=BBB, . These labels represent the quality of a bond and are assigned by credit rating agencies, AAA being the highest quality and BB the worst. In this case, classes AAA, AA, and A are more similar than classes BBB and BB, so that no assumptions should be made about the distance between classes in both the input and the latent space. The other example is the eucalyptus data set; in this case, the problem is to predict which eucalyptus seedlots are best for soil conservation in a seasonally dry hill country. The classes are C2=low, C3=average, C4=good, ; it cannot be assumed there is an equal width for each class in the latent space.
|Data Set .||N .||K .||Q .||Ordered Class Distribution .|
|Data Set .||N .||K .||Q .||Ordered Class Distribution .|
Note: N is the number of patterns, K is the number of attributes and Q is the number of classes.
Regarding the experimental setup, 30 random splits of the data sets have been considered, with 75% and 25% of the instances in the training and test sets, respectively. The partitions were the same for all compared methods, and since all of them are deterministic, one model was obtained and evaluated (in the test (generalization) set) for each split. All nominal attributes were transformed into as many binary attributes as the number of categories. All the data sets were property standardized.
4.2. Existing Methods Used for Comparisons.
For comparison purposes, different state-of-the-art methods have been included in the experimentation:
Gaussian processes for ordinal regression (GPOR) (Chu & Ghahramani, 2005) presents a probabilistic kernel approach to ordinal regression based on gaussian processes where a threshold model that generalizes the probit function is used as the likelihood function for ordinal variables. In addition, Chu and Ghahramani apply the automatic relevance determination (ARD) method proposed by Mackay (1994) and Neal (1996) to the GPOR model. When using GPOR with ARD feature selection, we will refer the algorithm to as GPOR-ARD.
Support vector ordinal regression (SVOR) (Chu & Keerthi, 2005, 2007) proposes two new support vector approaches for ordinal regression. Here, multiple thresholds are optimized in order to define parallel discriminant hyperplanes for the ordinal scales. The first approach, with explicit inequality constraints on the thresholds, derives the optimal conditions for the dual problem and adapts the SMO algorithm for the solution; we will refer to it as SVOREX. In the second approach, the samples in all the categories are allowed to contribute errors for each threshold; therefore, there is no need to include the inequality constraints in the problem. This approach is named an SVOR with implicit constraints (SVORIM).
RED-SVM (Li & Lin, 2007) applies the reduction from cost-sensitive ordinal ranking to weighted binary classification (RED) framework to SVM. The RED method can be summarized in three steps. First, transform all training samples into extended samples by using a coding matrix and weighting these samples with a cost matrix. Second, all the extended examples are jointly learned by a binary classifier with confidence outputs, aiming at a low weighted 0/1 loss. Finally, convert the binary outputs to a rank. In this letter, the coding matrix considered is the identity, and the cost matrix is the absolute value matrix, applied to the standard binary soft-margin SVM.
A simple approach to ordinal regression (ASAOR) (Frank & Hall, 2001) is a general method that enables standard classification algorithms to make the use of order information in attributes. For the training process, the method transforms the Q-class ordinal problem into Q−1 binary class problems. Any ordinal attribute with ordered values is converted into Q−1 binary attributes. The prediction of new instances of class is done by estimating the probability of belonging to each of the Q classes with the Q−1 models. In the current work, the C4.5 method available in Weka (Hall et al., 2009) is used as the underlying classification algorithm since this is the one initially employed by the authors of ASAOR. In this way, the algorithm is identified as ASAOR(C4.5).
The proportional odds model (POM) is one of the first models specifically designed for ordinal regression (McCullagh, 1980). The model is based on the assumption of stochastic ordering of the space . Stochastic ordering is satisfied by a monotonic function (the model) that defines a probability density function over the class labels for a given feature vector x. Due to the thresholds that divide the monotonic function values corresponding to different classes, this method was the first one to be named a threshold model. The main problem associated with this model is that the projection is done by considering a linear combination of the inputs (linear projection), which hinders its performance. For the POM model, the function of Matlab software has been used.
Kernel discriminant learning for ordinal regression (KDLOR) (Sun et al., 2010) extends the kernel discriminant analysis (KDA) using a rank constraint. The method looks for the optimal projection that maximizes the separation between the projection of the different classes and minimizes the intraclass distance as in traditional discriminant analysis for nominal classes. Crucially, the order of the classes in the resulting projection is also considered. The authors claim that compared with the SVM-based methods, the KDA approach takes advantage of the global information of the data and the distribution of the classes and also reduces the computational complexity of the problem.
Support vector machine (SVM) (Cortes & Vapnik, 1995; Vapnik, 1999) nominal classifier is included in the experiments in order to establish a baseline nominal performance. C-support vector classification (SVC) available in libSVM 3.0 (Chang & Lin, 2011) is used as the SVM classifier implementation. In order to deal with the multiclass case, a 1-versus-1 approach has been considered, following the recommendations of Hsu and Lin (2002).
In our approach, the support vector regression (SVR) algorithm is used as the model for the z variable. The method will be referred to by the acronym SVR-PCDOC. The available in libSVM is used. The authors of GPOR, SVOREX, SVORIM, and RED-SVM provide publicly available software implementations of their methods.4 In the case of KDLOR, this method has been implemented by the authors using Matlab software (Perez-Ortiz et al., 2011).
Model selection is an important issue and involves selecting the best hyperparameter combination for all the methods compared. All the methods were configured to use the gaussian kernel. For the support vector algorithms (SVC, RED-SVM, SVOREX, SVORIM and ), the corresponding hyperparameters (regularization parameter, C, and width of the gaussian functions, ) were adjusted using a grid search over each of the 30 training sets by a five-fold nested cross-validation with the following ranges: and . Regarding , the additional parameter has to be adjusted. The range considered was . For KDLOR, the width of the gaussian kernel was adjusted using the range , and the regularization parameter, u, for avoiding the singularity problem values was . The POM and ASAOR(C4.5) methods do not have hyperparameters. Finally, GPOR-ARD has no hyperparameters to fix since the method optimizes the associated parameters itself.
For all the methods, the MAE measure is used as the performance metric for guiding the grid search to be consistent with the authors of the different state-of-the-art methods. The grid search procedure of SVC at libSVM has been modified in order to use MAE as the criteria for hyperparameter selection.
4.3. Performance Results.
Table 2 outlines the results through the mean and standard deviation (SD) of AccG, MAEG, AMAEG, and across the 30 holdout splits, where the subindex G indicates that results were obtained on the (holdout) generalization fold. As a summary, Table 3 shows for each performance metric the mean values of the metrics across all the data sets and the mean ranking values when comparing the different methods (R=1 for the best-performing method and R=9 for the worst one). To enhance readability, in tables 2 and 3, the best and second-best results are in boldface and italics, respectively.
|Method/ .||.||.||contact- .||.||.||.||squash- .||squash- .||.||winequality- .|
|Data Set .||automobile .||bondrate .||lenses .||eucalyptus .||newthyroid .||pasture .||stored .||unstored .||tae .||red .|
|Method/ .||.||.||contact- .||.||.||.||squash- .||squash- .||.||winequality- .|
|Data Set .||automobile .||bondrate .||lenses .||eucalyptus .||newthyroid .||pasture .||stored .||unstored .||tae .||red .|
Notes: The mean and standard deviation (SD) of the generalization results are reported for each data set. The best statistical result is in boldface and the second-best result in italics.
Regarding Table 2, it can be seen that the majority of methods are very competitive. The best-performing method depends on the considered performance metric, as can be seen from the mean rankings. This is also true when separately considering each of the data sets; the performance for some data sets varies noticeably if AMAEG is considered instead of MAEG (see bondrate, contact-lenses, eucalyptus, squash-unstored, and winequality-red). In the case of winequality-red, it happens that the second-worse method in MAEG, ASAOR(C4.5), is the second-best one for AMAEG. It is worthwhile mentioning that for the pasture data set, the mean MAEG and AMAEG are the same, which is due to the fact that pasture is a perfectly balanced data set (see section 2.2). In the case of tae, MAEG and AMAEG are very similar since the pattern distribution across classes is very similar. Regarding , it is interesting to highlight that a value close to zero of this metric reveals that the classifier predictions are not related to the real values, this is, the classifier performance is similar to the performance of a trivial classifier. This happens for the GPOR method in the bondrate, squash-stored, and tae data sets and for POM in the eucalyptus data set.
From Table 3, it can be observed how best mean value across the different data sets is not always translated into best mean ranking (see and columns). We now analyze the results in greater detail, highlighting the best and second-best performances. When considering AccG, SVC is clearly the best method, in both average performance and ranking. KDLOR and SVR-PCDOC are the second-best methods in average value and ranking, respectively. However, results are very different for all the other measures where the order is included in the evaluation. The best method in average MAEG and ranking of MAEG is SVR-PCDOC, and the second-best ranks are for KDLOR and RED-SVM, having similar mean MAEG. AMAE is a better alternative than MAE when the distribution of patterns is not balanced, and this is clearly the case for several data sets (see Table 1). The best values for mean AMAEG and mean ranking are obtained by SVR-PCDOC, and the second-best ones are those reported by KDLOR. Finally, the reveals the clearest differences. When this metric is used, the best mean values and ranks are reported by SVR-PCDOC, followed by KDLOR.
4.4. Statistical Comparisons Between Methods.
To quantify whether a statistical difference exists between any of these algorithms, a procedure for comparing multiple classifiers over multiple data sets is employed (Demšar, 2006). First, a Friedman's nonparametric test (Friedman, 1940) with a significance level of has been carried out to determine the statistical significance of the differences in the mean ranks of Table 3 for each different measure. The test rejected the null hypothesis stating that the differences in mean rankings of AccG, MAEG, AMAEG, and obtained by the different algorithms were statistically significant (with ). Specifically, the confidence interval for this number of data sets and algorithms is , and the corresponding F-values for each metric were 3.257∉C0, 4.821∉C0, 4.184∉C0, and 5.099∉C0, respectively.
On the basis of this rejection, the Nemenyi post hoc test is used to compare all classifiers to each other (Demšar, 2006). This test considers that the performance of any two classifiers is deemed significantly different if their mean ranks differ by at least the critical difference (CD), which depends on the number of data sets and methods. A 5% significance confidence was considered () to obtain this CD and the results can be observed in Figure 11, which shows CD diagrams as proposed in Demšar (2006). Each method is represented as a point in a ranking scale, corresponding to its mean ranking performance. CD segments are included to measure the separation needed between methods in order to assess statistical differences. The horizontal lines in the figures define a set of algorithms with no statistical differences in mean ranking performance. Table 3 should also be considered when interpreting this graph.
Figure 11a shows that SVC, the nominal classifier, has the best performance in Acc when not considering the order of the label prediction errors, and SVR-PCDOC has the second-best one. RED-SVM, KDLOR, and SVORIM have similar performance here. In Figure 11b, the best mean ranking is for SVR-PCDOC, and SVORIM, KDLOR, and RED-SVM have similar performances. However, when considering AMAE, it can be seen in Figure 11c that the SVR-PCDOC mean ranking distance to the other methods increases, specifically for RED-SVM and SVORIM. Finally, Figure 11d shows the mean rank CD diagram for where SVR-PCDOC still has the best mean performance.
The Nemenyi approach comparing all classifiers to each other in a post hoc test is not as sensitive as the approach comparing all classifiers to a given classifier, a control method (Demšar, 2006). The Bonferroni-Dunn test allows this latter type of comparison, and in our case, it is done using the proposed method as the control method for the four metrics. The results of the Bonferroni-Dunn test for are in Table 4, where the corresponding critical values are included. From the results of this test, it can be concluded that SVR-PCDOC does not report a statistically significant difference with respect to the SVM ordinal regression methods, KDLOR and ASAOR(C4.5), but it does when it is compared to POM for all the metrics and compared to GPOR for the ordinal metrics. Moreover, there are significant differences with respect to SVC when considering AMAE and .
Note: Bonferroni-Dunn test: 3.336.
aStatistical difference with 0.05.
From the experiments, we can conclude that the reference (baseline) nominal classifier, SVC, is improved with statistical differences when considering ordinal classification measures. Regarding ASAOR(C4.5), SVOREX, SVORIM, KDLOR, and RED-SVM, whereas the general performance is slightly better, there are no statistically significant differences favoring any of the methods.
Two important conclusions can be drawn about the performance measures. When unbalanced data sets are considered, AccG is clearly omitting important aspects of ordinal classification, and so does MAEG. If comparative performance is taken into account, KDLOR and SVR-PCDOC appear to be very good classifiers when the objective is to improve AMAEG and . The best mean ranking performance is obtained by the method we propose in this letter.
4.5. Latent Space Representations of the Ordinal Classes.
In the previous section, we showed that our simple and intuitive methodology can compete on equal footing with established more complex or less direct methods for ordinal classification. In this section, we complement this performance-based comparison with a deeper analysis of the main ingredient of our approach and related ones to ordinal classification: projection onto the one-dimensional (latent) space naturally representing the ordinal nature of the class organization. In particular, we study how nonlinear latent variable models, SVR-PCDOC, KDLOR, SVOREX, and SVORIM, organize their one-dimensional latent space data projections. For comparison purposes, the latent variable values of the training and generalization data of the first fold of the tae data set are shown (see Figure 12). Both histograms and values are plotted so that the behavior of the models can be analyzed. In the case of PCDOC, the PCD projection is also included to see whether the regressor model is close to the PCDOC projection. The histograms represent the relative frequency of the projections. SVORIM histograms and latent variable values are not presented since they are similar to the SVOREX ones in the selected data set.
We first analyze the SVR-PCDOC method. From PCD projections in Figure 12a, we deduce that classes C1 and C2 contain patterns that are very close in the input space: projection of some patterns from C2 lies near the threshold that divides the values for the two classes. An analogous comment applies to classes C2 and C3. The regressor seems to have learned the imposed projection reasonably well since the predicted latent values have a histogram similar to the training PCD projection histogram. The generalization PCD projections (see Figure 12c) have similar characteristics as the training ones.5 Note the concentration of values around on the prediction of the generalization . This concentration of values is due to incorrect prediction of class C1 and C3 patterns that were both assigned to C2. This behavior can be better seen in Figures 12e and 12f, where the modeled latent value for each pattern is shown together with its class label. Indeed, during training, some C1 and C2 patterns were mapped to positions near the thresholds. This is probably caused by noise or overlapping class distribution in the input space.
Figure 13 presents latent variable values of KDLOR. The KDLOR method projects the data onto the latent space by minimizing the intraclass distance while maximizing the interclass distance of the projections. As a result, the latent representations of the data are quite compact for each class (see the training projection histogram in Figure 13a). While this philosophy often leads to superior classification results, the projections do not reflect the structure of patterns within a single class, that is, the ordinal nature of the data is not fully captured by the model. In addition, KDLOR projections occur in the incorrect bins more often than in the case of SVR-PCDOC (see the generalization projections in Figure 13d).
Finally, Figure 14 presents latent representations of patterns by the SVOREX model. As in the KDLOR case, (except for a few patterns) the training latent representations are highly compact within each class. Again, the relative structure of patterns within their classes is lost in the projections.
In both models, KDLOR and SVOREX, there is pressure in the model construction phase to find one-dimensional projections of the data that result in compact classes while maximizing the interclass separation. In the case of KDLOR, this is explicitly formulated in the objective function. On the other hand, the key idea behind SVM-based approaches is margin maximization. Data projections that maximize interclass margins implicitly make the projected classes compact. We hypothesize that the pressure for compact within-class latent projections can lead to poorer generalization performance, as illustrated in Figure 14d. In the case of overlapping classes, the drive for compact class projections can result in locally highly nonlinear projections of the overlapping regions, over which we do not have direct control (unlike in the case of PCDOC, where the nonlinear projection is guided by the relative positions of points with respect to the other classes). Having such highly expanding projections can result in test points being projected to incorrect classes in an arbitrary manner. Although we provide detailed analysis for one data set and one fold only, the observed tendencies were quite general across the data sets and holdout folds.
This letter addresses ordinal classification by proposing a projection of the input data into a one-dimensional variable, based on the relative position of each pattern with respect to the patterns of the adjacent classes. Our approach is based on a simple and intuitive idea: instead of implicitly inducing a one-dimensional data projection into a series of class intervals (as done in threshold-based methods), construct such projections explicitly and in a controlled manner. Threshold methods crucially depend on such projections, and we propose that it might be advantageous to have direct control over how the projection is done rather than having to rely on its indirect induction through a one-stage ordinal classification learning process.
Applying this one-dimensional projection to the training set yields data on which generalized projection can be trained using any standard regression method. The generalized projection in turn can be applied to new instances, which are then classified based on the interval into which their projection falls.
We construct the projection by imposing that the best-separated pattern of each class (i.e., the pattern most distant from the adjacent classes) should be mapped to the centre of the interval representing that class (or in the interval extremes for the extreme—the first and the last—classes). All the other patterns are proportionally positioned in their corresponding class intervals around the centers mentioned above. We designed a projection method having such desirable properties and empirically verified its appropriateness on data sets with linear and nonlinear class ordering topologies.
We extensively evaluated our method on 10 real-world data sets, 4 performance metrics, and a measure of statistical significance and performed comparison with 8 alternative methods, including the most recent proposals for ordinal regression and a baseline nominal classifier. In spite of the intrinsic simplicity and straightforward intuition behind our proposal, the results are competitive and consistent with respect to the state of the art in the literature. The mean ranking performance of our method was particularly impressive, when robust ordinal performance metrics were considered, such as the average mean absolute error or the correlation coefficient. Moreover, we studied in detail the latent space organization of the projection-based methods considered in this letter. We suggest that while the pressure for compact within-class latent projections can make training sample projections compact well within classes, it can lead to poorer generalization performance overall.
We also identify some interesting discussion points. First, the latent space thresholds are fixed by the projection with an equal width. This may be interpreted as an assumption of equal widths for each class, which is not always true for all the problems. This would indeed be a problem if we used a linear regressor from the data space to the projection space. However, we employ nonlinear projections, and the adjustment for unequal widths of the different classes can be naturally achieved within such nonlinear mapping from the data to the projection space. From the model-fitting standpoint, having fixed-width class regions in the projection space is desirable. Allowing for variable widths would increase the number of free parameters and would make the free parameters dependent in a potentially complicated manner (flexibility of projections versus class widths in the projection space). This may have a harmful effect on model fitting, especially if the data set is of limited size. Having fewer free parameters is also advantageous from the point of view of computational complexity.
The second discussion point is the possible undesirable influence of outliers in the PCD projection. One possible solution can be to place each pattern in the projection considering more classes than just the adjacent ones. However, this should be done carefully in order not to decrease the role of ordinal information in the projection. A direct alternative can be to use a k-NN–like scheme in equation 3.1, where instead of taking the minimum distance to a point, the average distance to the k closest points of class can be used. This will represent a generalization of the current scheme that calculates distances with k=1. Nevertheless, the inclusion of k would imply the addition of a new free parameter to the training process.
In conclusion, the results indicate that our two-phase approach to ordinal classification is a viable and simple-to-understand alternative to the state of the art. The projection constructed in the first phase is consistently extracting useful information for ordinal classification. As such, it can be used not only as the basis for classifier construction but also as a starting point for devising measures able to detect and quantify possible ordering of classes in any data set. This is a matter for our future research.
This work has been partially subsidized by the TIN2011-22794 project of the Spanish Ministerial Commision of Science and Technology (MICYT), FEDER funds and the P11-TIC-7508 project of the Junta de Andalucía (Spain). The work of P.T. was supported by BBSRC grant BB/H012508/1.
Acc is referred to as mean zero-one error when expressed as an error.
This does not in any way hamper generality, as our regressors defining g will be smooth nonlinear functions.
Recall that the threshold set θ delimiting class intervals is defined in equation 3.3.
GPOR (http://www.gatsby.ucl.ac.uk/~chuwei/ordinalregression.html), SVOREX and SVORIM (http://www.gatsby.ucl.ac.uk/~chuwei/svor.htm), and RED-SVM (http://home.caltech.edu/~htlin/program/libsvm/).
There are many fewer patterns in the holdout set than in the training set, making direct comparison of the two histograms problematic.