Abstract

We present a reduction framework from ordinal ranking to binary classification. The framework consists of three steps: extracting extended examples from the original examples, learning a binary classifier on the extended examples with any binary classification algorithm, and constructing a ranker from the binary classifier. Based on the framework, we show that a weighted 0/1 loss of the binary classifier upper-bounds the mislabeling cost of the ranker, both error-wise and regret-wise. Our framework allows not only the design of good ordinal ranking algorithms based on well-tuned binary classification approaches, but also the derivation of new generalization bounds for ordinal ranking from known bounds for binary classification. In addition, our framework unifies many existing ordinal ranking algorithms, such as perceptron ranking and support vector ordinal regression. When compared empirically on benchmark data sets, some of our newly designed algorithms enjoy advantages in terms of both training speed and generalization performance over existing algorithms. In addition, the newly designed algorithms lead to better cost-sensitive ordinal ranking performance, as well as improved listwise ranking performance.

1.  Introduction

We work on a supervised learning problem, ordinal ranking, which is also referred to as ordinal regression (Chu & Keerthi, 2007) or ordinal classification (Frank & Hall, 2001). For instance, the rating that a customer gives on a movie might be one of “,” “,” “,” “,” and “.” Those ratings are called the ranks, which can be represented by ordinal class labels like 1, 2, 3, 4, 5. The ordinal ranking problem is closely related to multiclass classification and metric regression. It is different from the former because of the ordinal information encoded in the ranks and different from the latter because of a lack of the metric distance between the ranks. Since rank is a natural representation of human preferences, the problem lends itself to many applications in the social sciences and information retrieval (Liu, 2009).

Many ordinal ranking algorithms have been proposed from a machine learning perspective in recent years. For instance, Herbrich, Graepel, and Obermayer (2000) designed an approach with support vector machines based on comparing training examples in a pairwise manner. Har-Peled, Roth, & Zimak (2003) proposed a constraint classification approach that works with any binary classifiers based on the pairwise comparison framework. Nevertheless, such a pairwise comparison perspective may not be suitable for large-scale learning because the size of the associated optimization problem can be large. In particular, for an ordinal ranking problem with N examples, if at least two of the ranks are supported by Ω(N) examples (which is quite common in practice), the size of the pairwise learning problem is quadratic to N.

There are some other approaches that do not lead to such a quadratic expansion. For instance, Crammer and Singer (2005) generalized the online perceptron algorithm with multiple thresholds to do ordinal ranking. In their approach, a perceptron maps an input vector to a latent potential value, which is then thresholded to obtain a rank. Shashua and Levin (2003) proposed new support vector machine (SVM) formulations to handle multiple thresholds, and some other SVM formulations were studied by Rajaram, Garg, Zhou, and Huang (2003), Chu and Keerthi (2007), and Cardoso and da Costa (2007). All of these algorithms share a common property: they are modified from well-known binary classification approaches. Still some other approaches fall into neither of the perspectives above, such as gaussian process ordinal regression (Chu & Ghahramani, 2005).

Since binary classification is much better studied than ordinal ranking, a general framework to systematically reduce the latter to the former can introduce two immediate benefits. First, well-tuned binary classification approaches can be readily transformed into good ordinal ranking algorithms, which saves a great deal of effort in design and implementation. Second, new generalization bounds for ordinal ranking can be easily derived from known bounds for binary classification, which saves tremendous effort in theoretical analysis.

In this letter, we propose such a reduction framework. The framework is based on extended binary examples, which are extracted from the original ordinal ranking examples. The binary classifier trained from the extended binary examples can then be used to construct a ranker. We prove that the mislabeling cost of the ranker is bounded by a weighted 0/1 loss of the binary classifier. Furthermore, we prove that the mislabeling regret of the ranker is bounded by the regret of the binary classifier as well. Hence, binary classifiers that generalize well could introduce rankers that generalize well. The advantages of the framework in algorithmic design and in theoretical analysis are both demonstrated in the letter. In addition, we show that our framework provides a unified view for many existing ordinal ranking algorithms. The experiments on some benchmark data sets validate the usefulness of the framework in practice for improving cost-sensitive ordinal ranking performance and helping improve other ranking criteria.

The letter is organized as follows. We introduce the ordinal ranking problem in section 2. Some related work is discussed in section 3. We illustrate our reduction framework in section 4. The algorithmic and theoretical usefulness of the framework is shown in section 5. Finally, we present experimental results in section 6 and conclude in section 7.

A short version of the letter appeared in the 2006 Conference on Neural Information Processing Systems (Li & Lin, 2007b). The paper was then enriched by the more general cost-sensitive setting as well as the deeper theoretical results that were revealed in the 2009 Preference Learning Workshop (Lin & Li, 2009). For completeness, selected results from an earlier conference work (Lin & Li, 2006) are included in section 5.2. These publications are also part of the first author's Ph.D. thesis (Lin, 2008). In addition to the results that have been published in the conferences, we point out some important properties of ordinal ranking in section 2, have added a detailed discussion of the literature in section 3, show deeper theoretical results on the equivalence between ordinal ranking and binary classification in section 4, clarify the differences among different SVM-based ordinal ranking algorithms in section 5, and strengthen the experimental results to emphasize the usefulness of cost-sensitive ordinal ranking in section 6.

2.  Ordinal Ranking Setup

The ordinal ranking problem aims at predicting the rank y of some input vector x, where x is in an input space and y is in a label space . A function is called an ordinal ranker, or a ranker for short. We shall adopt the cost-sensitive setting (Abe, Zadrozny, & Langford, 2004; Lin, 2008), in which a cost vector is generated with (x, y) from some fixed but unknown distribution on . The kth element c[k] of the cost vector represents the penalty when predicting the input vector x as rank k. We naturally assume that c[k] ≥ 0 and c[y] = 0. Thus, y = argmin1≤kKc[k]. An ordinal ranking problem comes with a given training set , whose elements are drawn independent and identically distributed (i.i.d.) from . The goal of the problem is to find a ranker r such that its expected test cost,
formula
is small.

The setting looks similar to that of a cost-sensitive multiclass classification problem (Abe et al., 2004) in the sense that the label space is a finite set. Therefore, ordinal ranking is also called ordinal classification (Frank & Hall, 2001; Cardoso & da Costa, 2007). Nevertheless, in addition to representing the nominal categories (as the usual classification labels), those also carry the ordinal information. That is, two different labels in can be compared by the usual < operation. Thus, those are called the ranks to distinguish them from the usual classification labels.

Ordinal ranking is also similar to regression (for which instead of ), because the real values in can be ordered by the usual < operation. Therefore, ordinal ranking is also popularly called ordinal regression (Herbrich et al., 2000; Shashua & Levin, 2003; Chu & Ghahramani, 2005; Chu & Keerthi, 2007; Xia, Zhou, Yang, & Zhang, 2007). Nevertheless, unlike the real-valued regression labels , the discrete ranks do not carry metric information. For instance, we cannot say that a five-star movie is 2.5 times better than a two-star one; we also cannot compute the exact distance (difference) between a five-star movie and a one-star movie. In other words, the rank serves as a qualitative indication rather than a quantitative outcome. Thus, any monotonic transform of the label space should not alter the ranking performance. Nevertheless, many regression algorithms depend on the assumed metric information and can be highly affected by monotonic transforms of the label space (which are equivalent to change-of-metric operations). Thus, those regression algorithms may not always perform well on ordinal ranking problems.

The ordinal information carried by the ranks introduces the following two properties, which are important for modeling ordinal ranking problems:

  • Closeness in the rank space : The ordinal information suggests that the mislabeling cost depends on the closeness of the prediction. For example, predicting a two-star movie as a three-star one is less costly than predicting it as a five-star one. Hence, the cost vector c should be V-shaped with respect to y (Li & Lin, 2007b), that is,
    formula
    2.1
    A V-shaped cost vector says that a ranker needs to pay more if its prediction on x is further away from y. We shall assume that every cost vector c generated from is V-shaped with respect to y = argmin1≤kKc[k]. In other words, one can decompose where is always V-shaped and satisfies y = argmin1≤kKc[k].
    In some of our results, we need a stronger condition: the cost vectors should be convex (Li & Lin, 2007b), which is defined by the condition1
    formula
    2.2
    When using convex cost vectors, a ranker needs to pay increasingly more if its prediction on x is further away from y. Provably, any convex cost vector c is V-shaped with respect to y = argmin1≤kKc[k].
    The V-shaped and convex cost vectors are general choices that can be used to represent the ordinal nature of . One popular cost vector that has been frequently used for ordinal ranking is the absolute cost vector, which accompanies (x, y) as
    formula

    Because the absolute cost vectors come with the median function as its a population minimizer (Dembczyński, Kotowski, & Sowiński, 2008), it appears to be a natural choice for ordinal ranking, similar to how the traditional 0/1 loss is the most natural choice for classification. Nevertheless, our work aims at studying more flexible possibilities (costs) beyond the natural choice, similar to the more flexible weighted loss beyond the 0/1 one in weighted classification (Zadrozny, Langford, & Abe, 2003). As we shall show in section 6, the flexible costs can be used to embed the desired structural information in for better test performance.

  • Comparability in the input space : Note that the classification cost vectors
    formula
    which checks whether the predicted rank k is exactly the same as the desired rank ℓ, are also V-shaped.2 If those cost vectors are used, an immediate question is: What distinguishes ordinal ranking and common multiclass classification?
    Let r* denote the optimal ranker with respect to . Note that r* introduces a total preorder in (Herbrich et al., 2000), that is,
    formula

    The total preorder allows us to naturally group and compare vectors in the input space . For instance, a two-star movie is “worse than” a three-star one, which is in turn “worse than” a four-star one; movies of less than three stars are “worse than” movies of at least three stars.

    The simplicity of the grouping and the comparison distinguishes ordinal ranking from multiclass classification. For instance, when classifying movies, it is difficult to group and compare with , , but “movies of less than three stars” can be naturally compared with “movies of at least three stars.”

    The comparability property connects ordinal ranking to monotonic classification (Sill, 1998; Kotowski & Sowiński, 2009), which is also referred to as ordinal classification with the monotonicity constraints and is an important problem on its own. Monotonic classification models the ordinal ranking problem by assuming that an explicit order in the input space (such as the value-order of one particular feature) can be used to directly (and monotonically) infer about the order of the ranks in the output space (yy′). In other words, monotonic classification allows putting thresholds on the explicit order to perform ordinal ranking. The comparability property shows that there is an order (total preorder) introduced by the ranks. Nevertheless, the order is not always explicit in general ordinal ranking problems. Therefore, many of the existing ordinal ranking approaches, such as the thresholded model, which is discussed in section 5, seek the implicit order through transforming the input vectors before respecting the monotonic nature between the implicit order and the order of the ranks.

In Table 1, we summarize four different learning problems in terms of their comparability and closeness properties. As discussed, usual ordinal ranking problems come with strong closeness in (which is represented by V-shaped cost vectors) and simple comparability in . The classification cost vectors can be viewed as degenerate V-shaped cost vectors, and hence introduce degenerate ordinal ranking problems.

Table 1:
Properties of Different Learning Problems.
Weak closenessStrong closeness
Comparability(classification cost vectors)(other V-shaped cost vectors)
Comparable Degenerate ordinal ranking Usual ordinal ranking 
Not comparable Multiclass classification Special cases of cost-sensitive 
  classification 
Weak closenessStrong closeness
Comparability(classification cost vectors)(other V-shaped cost vectors)
Comparable Degenerate ordinal ranking Usual ordinal ranking 
Not comparable Multiclass classification Special cases of cost-sensitive 
  classification 

Multiclass classification problems, on the other hand, do not allow examples of different classes to be naturally grouped and compared. If we want to use cost vectors other than the classification ones, we move to special cases of cost-sensitive classification. For instance, in the attempt to recognize digits {0, 1, …, 9} for written checks, a possible cost is the absolute one (to represent monetary differences) rather than simply right or wrong (the classification cost). The absolute cost is V-shaped and convex. Nevertheless, the digits intuitively cannot be grouped and compared, and hence the recognition problem belongs to cost-sensitive multiclass classification rather than ordinal ranking (Lin, 2008).

A good ordinal ranking algorithm should appropriately use the comparability property. In section 4, we show how the property serves as a key to derive our proposed reduction framework.

3.  Related Literature

The analysis of ordinal data has been studied in statistics by defining a suitable link function that models the underlying probability for generating the ordinal labels (Anderson, 1984). For instance, one popular model is the the cumulative link model (Agresti, 2002), which we discuss in section 5. Similar models can be traced back to the work of McCullagh (1980). Much of the earlier work in statistics, which usually focuses on the effectiveness and efficiency of the modeling, influences ordinal ranking studies in machine learning (Herbrich et al., 2000), including our work. Another related area that studies the analysis of ordinal data is operations research, especially in the subarea of multicriteria decision analysis (Greco, Sowiński, & Matarazzo, 2000; Figueira, Greco, & Ehrgott, 2005), which contains work that focuses on reasonable decision making with ordinal preference scales. Our work tackles ordinal ranking problems from the machine learning perspective—improving the test performance—and is different from work that takes the perspective of statistics or operations research.

In machine learning (and information retrieval), there are three major families of ranking algorithms: pointwise, pairwise, and listwise (Liu, 2009). The ordinal ranking setup presented in section 2 belongs to pointwise ranking. Next, we discuss some representative algorithms in each family and relate them to the ordinal ranking setup. Then we compare the proposed reduction framework with other reduction-based approaches for ranking.

3.1.  Families of Ranking Algorithms.

3.1.1.  Pointwise Ranking.

Pointwise ranking aims at predicting the relevance of some input vector x using either real-valued scores or ordinal-valued ranks. It does not directly use the comparison nature of ranking.

The ordinal ranking algorithms studied in this letter focus on computing ordinal-valued ranks for pointwise ranking. For obtaining real-valued scores, a fundamental tool is traditional least-squares regression (Hastie, Tibshirani, & Friedman, 2001). As discussed in section 2, however, when the examples come with ordinal labels, the ordinal ranking algorithms studied in this letter can be more useful than traditional regression by taking the metricless nature of labels into account.

3.1.2.  Pairwise Ranking.

Pairwise ranking aims at predicting the relative order between two input vectors x and x′ and thus captures the local comparison nature of ranking. It is arguably one of the most widely used techniques in the ranking family and is usually cast as a binary classification problem of predicting whether x is preferred over x′. During training, such a problem translates to comparing all pairs of (xn, xm) based on their corresponding labels. One representative pairwise ranking algorithm is RankSVM (Herbrich et al., 2000), which trains an underlying support vector machine using those pairs. RankSVM was initially proposed for data sets that come with ordinal labels, but is also commonly applied to data sets that come with real-valued labels.

Note that even when all the labels take ordinal values, as long as two of the classes contain Ω(N) examples, there are Ω(N2) pairs. Such a quadratic number of pairs makes it difficult to scale up general pairwise ranking algorithms, except in special cases like the linear support vector machine (Joachims, 2006) or RankBoost (Herbrich et al., 2000; Lin & Li, 2006). Thus, when the training set is large and contains ordinal labels, the ordinal ranking algorithms studied in this letter may serve as a useful alternative over pairwise ranking ones.

3.1.3.  Listwise Ranking.

Listwise ranking aims at ordering a whole finite set of input vectors . In particular, the (listwise) ranker tries to minimize the inconsistency between the predicted permutation and the ground-truth permutation of (Liu, 2009). Listwise ranking captures the global comparison nature of ranking. One representative listwise ranking algorithm is ListNet (Cao, Qin, Liu, Tsai, & Li, 2007), which is based on an underlying neural network model along with an estimated distribution of all possible permutations (rankers). Nevertheless, there are M! permutations for a given . Thus, listwise ranking can be computationally even more expensive than pairwise ranking.

Many listwise ranking algorithms try to alleviate the computational burden by keeping some internal pointwise rankers. For instance, ListNet uses the underlying neural network to score each instance (Cao et al., 2007) for the purpose of permutation. The use of internal pointwise rankers for listwise ranking further justifies the importance of better understanding pointwise ranking, including the ordinal ranking algorithms studied in this letter.

3.2.  Reduction Approaches for Ranking.

Because ranking is a relatively new and diverse problem in machine learning, many existing ranking approaches try to reduce the ranking problem to other learning problems. Next, we discuss some existing reduction-based approaches that are related to the framework proposed in this letter.

3.2.1.  From Pairwise Ranking to Binary Classification.

Balcan et al. (2007) propose a robust reduction from bipartite (i.e., ordinal with two outcomes) pairwise ranking to binary classification. The training part of the reduction works like usual pairwise ranking: learning a binary classifier on whether x is preferred over x′. The prediction part of the reduction asks the underlying binary classifier to vote for each example in the test set in order to rank those examples. The reduction is simple but yields solid theoretical guarantees. In particular, for ranking M test examples, the reduction uses Ω(M2) calls to the binary classifier and transforms a binary classification regret of r to a bipartite ranking regret (measured by the so-called AUC criterion) of at most 2r.

Ailon and Mohri (2008) improve the reduction of Balcan et al. (2007) and propose a more efficient reduction from general pairwise ranking to binary classification. The prediction part of the reduction operates by taking the underlying binary classifier as the comparison function of the popular QuickSort algorithm. In the special bipartite ranking case, for ranking M examples, the reduction uses O(Mlog M) calls to the binary classifier in average and transforms a binary classification regret of r to a bipartite ranking regret of at most 2r.

3.2.2.  From Listwise Ranking to Regression (Pointwise Ranking).

The subset ranking (Cossock & Zhang, 2008) algorithm can be viewed as a reduction from listwise ranking to regression. In particular, Cossock and Zhang (2008) prove that regression with various cost functions can be used to approximate a Bayes optimal listwise ranker. In other words, low-regret regressors can be cast as low-regret listwise rankers.

3.2.3.  From Listwise Ranking to Ordinal (Pointwise) Ranking.

McRank (Li, Burges, & Wu, 2008) is a reduction from listwise ranking to ordinal ranking with the classification cost. The main theoretical justification of the reduction shows that a scaled classification cost of an ordinal ranker can upper-bound the regret of the associated listwise ranker. That is, low-error ordinal rankers can be cast as low-regret listwise rankers. Li et al. (2008) empirically verified that McRank can perform better than the subset ranking algorithm (Cossock & Zhang, 2008).

3.2.4.  From Ordinal Ranking to Binary Classification.

The proposed framework in this letter and in the associated shorter version (Li & Lin, 2007b) is a reduction from ordinal ranking to binary classification. We will show that the reduction is both error and regret preserving. That is, low-error binary classifiers can be cast as low-error ordinal rankers; low-regret binary classifiers can be cast as low-regret ordinal rankers.

The data replication method, which was independently proposed by Cardoso and da Costa (2007), is a similar but more restricted case of the reduction framework. The data replication method essentially considers the absolute cost. In addition, the focus of the data replication method (Cardoso & da Costa, 2007) is on explaining the training procedure of the reduction. The proposed framework in this letter is more general than the data replication method in terms of the cost considered, as well as the deeper theoretical analysis on both the training and the test performance of the reduction.

The proposed reduction framework for pointwise ranking and existing reductions in pairwise ranking (Balcan et al., 2007; Ailon & Mohri, 2008) take very different views on the ranking problem and considers different evaluation criteria. As a consequence, when learning N examples and ranking (predicting on) M instances with K ordinal scales, the proposed framework results in a transformed training set of size O(KN) and a prediction procedure with time complexity O(KM). Both the size of the training set and the time complexity of the prediction procedure are more efficient than the state-of-the-art reduction from pairwise ranking to binary classification (Ailon & Mohri, 2008), as shown in Table 2.

Table 2:
Comparison of General Reductions from Ranking to Binary Clas-sification.
ReductionSize of Transformed Set During TrainingNumber of Calls to Binary Classifiers During PredictionEvaluation Criterion
The proposed framework O(KNO(KMRanking cost 
Balcan et al. (2007O(N2O(M2AUC 
Ailon and Mohri (2008O(N2O(Mlog MAUC 
ReductionSize of Transformed Set During TrainingNumber of Calls to Binary Classifiers During PredictionEvaluation Criterion
The proposed framework O(KNO(KMRanking cost 
Balcan et al. (2007O(N2O(M2AUC 
Ailon and Mohri (2008O(N2O(Mlog MAUC 

Note that the work of Li et al. (2008) revealed an opportunity to use the discrete nature of ordinal-valued labels to improve the listwise ranking performance over subset ranking when using a heuristic ordinal ranking algorithm. The proposed framework is a more rigorous study on ordinal ranking that can be coupled with McRank to yield a reduction from listwise ranking to binary classification, which allows state-of-the-art binary classification algorithms to be efficiently used for listwise ranking. We demonstrate the use of this opportunity in section 6.4.

4.  Reduction Framework

We first introduce the details of our proposed reduction framework. Then we demonstrate its theoretical guarantees. Consider, for instance, that we want to know how good a movie x is. Using the comparability property of ordinal ranking, we can then ask the associated question, “Is the rank of x greater than k?”

For a given k, such a question is exactly a binary classification problem, and the rank of x can be determined by asking multiple questions for k = 1, 2, until (K − 1). The questions are the core of the dominance-based rough set approach in operations research for reasoning from ordinal data (Sowiński, Greco, & Matarazzo, 2007). From the machine learning perspective, Frank and Hall (2001) proposed to solve each binary classification problem independently and combine the binary outputs to a rank. Although their approach is simple, the generalization performance using the combination step cannot be easily analyzed.

The proposed reduction framework works differently. First, a simpler step is used to convert the binary outputs to a rank, and generalization analysis can immediately follow. Moreover, all the binary classification problems are solved jointly to obtain a single binary classifier.

Assume that g(x, k) is the single binary classifier that provides answers to all the associated questions above. Consistent answers would be g(x, k) = +1 (“yes”) for k = 1 until (ℓ − 1) for some ℓ, and −1 (“no”) afterward. Then a reasonable ranker based on the binary answers is rg(x) = ℓ = 1 + min {k:g(x, k) = +1}. Equivalently,
formula
4.1
The binary classifier g that produces only consistent answers would be called rank-monotonic.3
For any ordinal example , we can define the extended binary examples with weights W(k) as
formula
4.2
The extended input vector X(k) represents the associated question, “Is the rank of x greater than k?” The binary label Y(k) represents the desired answer to the question; the weight W(k) represents the importance of the question and will be used in our theoretical analysis. Here X(k) stands for an abstract pair, and we discuss its practical encoding in section 5. If g(X(k)) ≡ g(x, k) makes no errors on all the associated questions, rg(x) equals y by equation 4.1. That is, c[rg(x)] = 0. In the following theorem, we further connect c[rg(x)] to the amount of error that g makes.
Theorem 1
(per example cost bound). For any ordinal example (x, y, c), where c is V-shaped and c[y] = 0, consider its associated extended binary examples (X(k), Y(k), W(k)) in equation 4.2. Assume that the ranker rg is constructed from a binary classifier g using equation 4.1. If g(X(k)) is rank-monotonic or if c is convex, then
formula
4.3
Proof.
Because g is rank-monotonic, g(X(k)) = +1 for k < rg(x) and g(X(k)) = −1 for krg(x). Thus, the cost that the ranker rg needs to pay is
formula
4.4
Because the cost vector c is V-shaped, Y(k) equals the sign of (c[k] − c[k + 1]) if the latter is not zero. Continuing from equation 4.4 with c[y] = 0,
formula
4.5
When g is not rank-monotonic but the cost vector c is convex, equation 4.5 becomes an inequality that could be alternatively proved by replacing equation 4.4 with
formula
The inequality above holds because (c[k] − c[k + 1]) is decreasing due to the convexity, and there are exactly (rg(x) − 1) zeros and (Krg(x)) ones in the values of [[g(X(k)) < 0]] according to equation 4.1.

We call equation 4.3 the per example cost bound, which says that if g makes only a small amount of error on the extended binary examples (X(k), Y(k), W(k)), then rg is guaranteed to pay only a small amount of the cost on the original example (x, y, c). The bound allows us to derive the following reduction method, which is composed of three stages: preprocessing, training, and prediction.

Algorithm 1: Reduction to Extended Binary Classification

  1. Preprocessing: For each original training example and for each k = 1, 2, …, K − 1, generate an extended training example (X(k)n, Y(k)n, W(k)n) and include it in , where
    formula
  2. Training: Use a binary classification algorithm on and get a binary classifier g on a concrete encoding (discussed in section 5) of . Let g(x, k) ≡ g(X(k)).

  3. Prediction: For any , estimate its rank with equation 4.1.

4.1.  Cost Bound of the Reduction Framework.

Consider the following probability distribution that generates the extended binary examples:
  1. Draw a tuple (x, y, c) independently from , and draw k uniformly from the set 1, 2, …, K − 1.

  2. Generate (X(k), Y(k), W(k)) by equation 4.2.

The extended training set SE contains examples that are equivalent (in terms of expectation) to examples drawn independently from . For any given binary classifier g, define its out-of-sample error with respect to as
formula
Using the definitions above, we can prove the first theoretical guarantee of the reduction framework.
Theorem 2

(cost bound of the reduction framework). Consider a ranker rg that is constructed from a binary classifier g using equation 4.1. Assume that c is V-shaped and c[y] = 0 for every tuple (x, y, c) generated from . If g(x, k) is rank-monotonic or if every cost vector c is convex, then E(rg) ≤ Eb(g).

Proof.
From equation 4.3,
formula
Take the expectation over on both sides and use ∼u to mean the uniform sampling:
formula

4.2.  Regret Bound of the Reduction Framework.

Theorem 2 indicates that if there exists a decent binary classifier g, we can obtain a decent ranker rg. Nevertheless, it does not guarantee how good rg is in comparison with other rankers. In particular, if we consider the optimal binary classifier g* under , and the optimal ranker r* under , does a small regret Eb(g) − Eb(g*) in binary classification translate to a small regret E(rg) − E(r*) in ordinal ranking? Furthermore, is close to E(r*)? Next, we introduce the reverse-reduction technique, which helps to answer these questions.

The reverse-reduction technique works on the binary classification problems generated by the reduction method. It goes through the preprocessing and the prediction stages of the reduction method in the opposite direction. In the preprocessing stage, instead of starting with ordinal examples (xn, yn, cn), reverse reduction deals with weighted binary examples (X(k)n, Y(k)n, W(k)n). It first combines each set of binary examples sharing the same xn to an ordinal example by
formula
4.6
It is easy to verify that equation 4.6 is the exact inverse transform of equation 4.2 on the training examples under the assumption that c[y] = 0. These ordinal examples are then given to an ordinal ranking algorithm to obtain a ranker r. In the prediction stage, reverse reduction works by decomposing the prediction r(x) to K − 1 binary predictions, each as if coming from a binary classifier:
formula
4.7
Then a lemma on the out-of-sample cost of gr immediately follows (Lin & Li, 2009).
Lemma 1.

With the definitions of and in theorem 2, for every ordinal ranker r, E(r) = Eb(gr).

Proof.

Because gr is rank-monotonic by construction, the same proof for the first part of theorem 2 leads to the desired result.

The stages of reduction and reverse reduction are illustrated in Figure 1. Next, we show how the reverse-reduction technique allows us to draw a strong theoretical connection between ordinal ranking and binary classification. By the definition of r* and g*, for any ranker r and any binary classifier g,
formula
4.8
Then the reverse-reduction technique yields a simple proof of the regret bound.
Theorem 3
(regret bound of the reduction framework). If g(x, k) is rank-monotonic or if every cost vector c is convex, then
formula
4.9
Figure 1:

Reduction (top) and reverse reduction (bottom).

Figure 1:

Reduction (top) and reverse reduction (bottom).

Proof.
formula

The cost bound (theorem 2) and the regret bound (theorem 3) provide different guarantees for the reduction method. The former describes how the ordinal ranking cost is upper-bounded by the binary classification error in an absolute sense, and the latter describes the upper bound in a relative sense.

4.3.  Equivalence Between Ordinal Ranking and Binary Classification.

The results thus for suggest that ordinal ranking can be reduced to binary classification without any loss of optimality. That is, ordinal ranking is no harder than binary classification. Intuitively, binary classification is also no harder than ordinal ranking, because the former is a special case of the latter with K = 2. Next, we formalize the notion of hardness with the probably approximately correct (PAC) setup in computational learning theory (Kearns & Vazirani, 1994) and prove that ordinal ranking and binary classification are indeed equivalent in hardness. We use the following definition of PAC in our upcoming theorems (Valiant, 1984; Kearns & Vazirani, 1994).

Definition 1.

In cost-sensitive classification, a learning model is efficiently PAC-learnable (using the same representation class) if there exists a (possibly randomized) learning algorithm satisfying the following property: for every distribution being considered, where

formula
with some ; for all and , if is given access to an oracle generating examples (x, y, c) from , as well as inputs ε and δ, then outputs such that E(g) ≤ ε with probability at least 1 − δ, as well as with time polynomial in and .

The definition assumes that the target function g* is within the learning model and is of cost 0 (the minimum cost). In other words, it is the noiseless setup of learning. We shall focus on only this case while pointing out that similar results can also be proved for the noisy setup (Lin, 2008).

Theorem 4

(equivalence theorem of the reduction framework). Consider a learning model for ordinal ranking, its associated learning model for binary classification, and distributions such that all cost vectors c are V-shaped. Then is efficiently PAC-learnable if and only if is efficiently PAC-learnable.

Proof.

If is efficiently PAC-learnable using algorithm , we can convert to an efficient algorithm for ordinal ranking as follows:

  1. 1. 

    Transform the oracle that generates (x, y, c) from to an oracle that generates (X(k), Y(k), W(k)) by picking k uniformly and applying equation 4.2.

  2. 2. 

    Run with the transformed oracle until it outputs some g(X(k)).

  3. 3. 

    Return rg.

It is not hard to see that is as efficient as , and the cost guarantee comes from theorem 2 using the fact that gr are all rank-monotonic.

Now we consider the other direction. If is efficiently PAC-learnable using algorithm , we can convert to an efficient algorithm for binary classification:
  1. Transform the oracle that generates (X(k), Y(k), W(k)) from to an oracle that generates (x, y, c) by
    formula
    That is, x copies the 1st to the Dth elements of X(k). Let be the underlying distribution of the constructed oracle.
  2. Run with the transformed oracle until it outputs some r(x).

  3. Return gr.

Note that is as efficient as . In addition, we see that plugging into equation 4.2 introduces . Thus, if we take as the expected test cost with respect to , by lemma 1,
formula
Therefore, Eb(gr) < ε after running .

Theorem 4 demonstrates that ordinal ranking is theoretically as easy (hard) as the associated binary classification problem. Recall that we compare four different kinds of learning problems in Table 1. At first sight, theorem 4 appears to suggest that all four problems can be conquered with the reduction framework, because the only required assumption of the theorem is that the cost vectors are V-shaped. Nevertheless, note that the necessary and sufficient condition in the theorem is, “The associated learning model is efficiently PAC-learnable.” Then the different comparability properties of the different problems make a difference. In particular, for multiclass classification problems, the associated binary question, “Is the class of x greater than k?” can be complicated and is thus difficult to learn, In other words, the associated may not be efficiently PAC-learnable. Then more complicated binary questions (Abe et al., 2004; Beygelzimer, Daniand, Hayes, Langford, & Zadrozny, 2005; Beygelzimer, Langford, & Ravikumar, 2007; Lin, 2008) are needed to reduce from the general (cost-sensitive) multiclass problem to binary classification ones.

On the other hand, for the special case of cost-sensitive ordinal ranking, in which is efficiently PAC-learnable, the reduction framework establishes a tight connection between the learnability of and —the ranking model of interest. The tight connection motivates us to design ordinal ranking algorithms from popular binary classification algorithms, as we show in the next section.

5.  Applications of Reduction Framework

So far the reduction works only by assuming that X(k) = (x, k) is an abstract pair understandable by the binary classification algorithm. With proper choices of the cost vectors, the encoding scheme of (x, k), and the binary classification algorithm, many existing ordinal ranking algorithms can be unified in our framework, and their theoretical justifications can immediately follow.

In this section, we briefly discuss some of those algorithms and their theoretical justifications. It happens that a simple encoding scheme for (x, k) via a coding matrix M of (K − 1) rows works for all the algorithms. To form X(k), the vector mk, which denotes the kth row of M, is appended after x. We mostly work with M = γ · IK−1, where γ is a positive scalar and IK−1 is the (K − 1) × (K − 1) identity matrix.

5.1.  Perceptron for Ordinal Ranking.

The perceptron ranking (PRank) algorithm proposed by Crammer and Singer (2005) is an online ordinal ranking algorithm that employs the thresholded linear model,
formula
where the thresholds θ1, θ2, …, θK−1, θK are ordered such that θ1 ≤ θ2 ≤ ⋅ ⋅ ⋅ ≤ θK−1 ≤ θK = ∞. Whenever a training example is not predicted correctly, the current v and θ are updated in a way similar to the perceptron learning rule (Rosenblatt, 1962). The algorithm was proved to keep the thresholds ordered along with a mistake bound (Crammer & Singer, 2005).
Let X(k) = (x, mk) with the simple encoding scheme M = IK−1. Then,
formula
Consider an ordinal ranking problem such that only generates examples (x, y, c(y)), where c(y) is the absolute cost vector with respect to y. We see that W(k) = K − 1 (a constant) for all the extended binary examples. Then we can simply interpret PRank as a specific instance of the reduction framework with a modified perceptron learning rule as the underlying binary classification algorithm. That is, PRank uses the perceptron learning rule to find a weight vector (v, − θ) for classifying the extended binary examples (x, mk).4 The mistake bound is a direct application of the well-known perceptron mistake bound (see Freund & Schapire, 1999). Our framework not only simplifies the derivation of the mistake bound, but also allows the use of other underlying perceptron algorithms, such as batch-mode algorithms (Li & Lin, 2007a) rather than online ones.

5.2.  Boosting for Ordinal Ranking.

In our earlier work (Lin & Li, 2006), we proposed the thresholded ensemble model
formula
5.1
for ordinal ranking. Each confidence function reflects a possibly imperfect ordering preference. Note that a special instance of the confidence function is a binary classifier . The ensemble linearly combines the ordering preferences with α. We allow αt to be any real value, which means that it is possible to reverse the ordering preference of ht in the ensemble when necessary.

Ensemble models in general have been successfully used for classification and regression (Meir & Rätsch, 2003). They not only introduce more stable predictions through the linear combination, but also provide sufficient power for approximating complicated target functions. The thresholded ensemble model extends existing ensemble models to ordinal ranking and inherits many useful theoretical properties from them. Next, we discuss one such property: large-margin bounds.

We first list the definition of the margins for a thresholded ensemble (Lin & Li, 2006). Intuitively, we expect the potential value HT(x) to be in the desired interval (θy−1, θy), and we want HT(x) to be far from the thresholds.

Definition 2.

Consider a given thresholded ensemble r in equation 5.1. The normalized margin is defined as

formula
Definition 2 is similar to the definition of the support vector machine (SVM) margin made by Shashua and Levin (2003) and is analogous to the definition of the ℓ1-margins in binary classification (Schapire, Freund, Bartlett, & Lee, 1998). A nonpositive indicates an incorrect prediction. We now define the δ-absolute margin cost as
formula

Consider an ordinal ranking problem such that generates examples only with the absolute cost vectors. The associated binary classification problem would then be based on an underlying probability distribution that generates only W = K − 1 (a constant value). Then we can obtain a large-margin bound of E(r):

Theorem 5

(large-margin bound for thresholded ensemble rankers). Consider a negation complete5 set , which contains only binary classifiers and is of VC-dimension d. Assume that δ>0, and N>d + K1 = dE. Then, for an ordinal ranking problem with the absolute cost vectors, with probability at least 1 − δ over the random choice of the training set S, every thresholded ensemble ranker defined from equation 5.1 satisfies the following bound for all δ>0:

formula
Proof.

See appendix  A.

The bound above can be generalized when contains confidence functions rather than binary classifiers using another existing result (Schapire et al., 1998, theorem 4) in the proof. The bound motivates us to design the ORBoost-All algorithm (Lin & Li, 2006), which can be viewed as coupling the reduction framework with a variant of the popular AdaBoost algorithm (Schapire et al., 1998; Schapire & Singer, 1999). ORBoost-All provably minimizes the term Ein(r, Δ) exponentially fast if the underlying base learner is strong enough. The proof can be made by applying the training error theorem of AdaBoost (Schapire et al., 1998, theorem 5) on SE, another application of the reduction framework.

5.3.  SVM for Ordinal Ranking.

SVM is a popular binary classification algorithm (Vapnik, 1995; Schölkopf & Smola, 2002). It maps the feature vector x to φ(x) in a possibly higher-dimensional space and implicitly computes the inner products with a kernel function .

Using a similar set of notations for perceptions (see section 5.1), we denote the parallel hyperplanes in the higher-dimensional space by (v, − θ) with an additional bias term b. Now, if we encode (x, k) with the matrix M = γ · IK−1, we can then compute the inner products of the extended examples (X(k), Y(k)) by
formula
With the reduction framework, we can plug in and extended training examples into the standard SVM to obtain a hyperplane ranker,
formula
based on an optimal solution to
formula
5.2
If θ1 ≤ θ2 ≤ ⋅ ⋅ ⋅ ≤ θK−1 or if the cost vectors considered are convex, theorems 2 and 3 can guarantee the expected out-of-sample cost of r(x) based on the expected out-of-sample cost of the binary classifier:
formula
The oSVM approach of Cardoso and da Costa (2007) is an instance of equation f>5.2 with the absolute cost vectors, in which all W(k)n are equal. The SVOR-IMC approach of Chu & Keerthi (2007) can also be thought of as a modified instance of the formulation with the absolute cost vectors, except that the term is dropped. Their SVOR-EXC approach is another modified instance using the classification cost vectors plus an additional constraint to guarantee that θ1 ≤ θ2 ≤ ⋅ ⋅ ⋅ ≤ θK−1.

Our proposed algorithm, reduction-to-SVM (RED-SVM) unifies the above algorithms under a generic formulation, equation 5.2, with the cost-sensitive reduction framework. RED-SVM can deal with any convex cost vectors by changing W(k)n and feeding the weighted binary examples to a standard SVM solver, regardless of whether θ1 ≤ θ2 ≤ ⋅ ⋅ ⋅ ≤ θK−1. Interestingly, our earlier work (Li & Lin, 2007b) proved that the ordering property always holds at the optimal SVM solution.

On the other hand, if the cost vectors are ordinal but not convex, solving equation 5.2 is more complicated. We adopt a coordinate-descent procedure that switches between optimizing (v, b) (using the standard SVM solver) and optimizing θ under the constraints (a small quadratic programming problem with an analytic solution) in the experiments.

Chu and Keerthi (2007) empirically found that SVOR-EXC performed better in terms of the classification cost, and SVOR-IMC preceded in terms of the absolute cost. They explain so by noting that SVOR-EXC minimizes an in-sample loss function that upper-bounds the classification cost, while SVOR-IMC minimizes a loss function that upper-bounds the absolute cost. The explanation is echoed by the study of loss functions for ordinal ranking (Rennie & Srebro, 2005; Dembczyński et al., 2008). The proposed reduction framework offers a more direct explanation than the loss-based one. Because the binary SVM is designed to target for decent out-of-sample binary classification error, reduction with the classification cost (SVOR-EXC) targets for decent out-of-sample classification cost and reduction with the absolute cost (SVOR-IMC) targets for decent out-of-sample absolute cost.

Note that Chu and Keerthi (2007) spent a great deal of effort in designing and implementing suitable optimizers for the modified formulation that does not contain the term. If we use the standard soft-margin SVM instead when considering the convex cost vectors like the absolute cost, we can directly and efficiently use the state-of-the-art SVM software to deal with the ordinal ranking problem. The formulation of Chu and Keerthi (2007) can be approximated by using a large γ. As we shall see in section 6, even a simple assignment of γ = 1 performs similarly to the approaches of Chu and Keerthi (2007) in practice.

In addition to the algorithmic benefits, the reduction framework can be used theoretically for SVM. For instance, we demonstrated how we can derive a novel large-margin absolute-cost bound of thresholded ensemble rankers in section 5.2. Next, we extend the bounds to SVM-based formulations and to a wider class of cost functions. While Shashua and Levin (2003) derived one such bound with a specific cost function, their bound is not data dependent and hence does not fully explain the out-of-sample performance of SVM-based rankers in reality (Bartlett & Shawe-Taylor, 1998). Our bound, on the other hand, is not only more general but also data dependent:

Theorem 6

(large-margin bound for SVM-based rankers). Consider a collection

formula
Let , , and β = Bmax/Bmin. If θ1 ≤ θ2 ≤ ⋅ ⋅ ⋅ ≤ θK1, or if every c is convex, for any δ>0, with probability at least 1 − δ, and for every f in , the associated ranker rg(x) with g(·) = sign(f(·)) satisfies
formula
Proof.

See appendix  B.

Thus, if the binary classifier g achieves large margins (≥Δ) on most of the extended training examples (X(k)n, Y(k)n, W(k)n), E(rg) is guaranteed to be small.

Theorem 6, which is based on the proposed reduction framework, is quite general and applies to a wide class of cost functions. In the special case of the absolute cost function (which results in W(k)n = 1 and β = 1), theorem 6 can be simplified to an order-wise comparable bound that has been independently derived by Agarwal (2008) using a similar proving technique.

Note that we can also choose to encode (x, k) differently. For instance, define
formula
That is, different kernels can be used for different binary classification sub-problems. Recently Chang, Chen, and Hung (2011) explored such a possibility and proposed the ordinal hyperplanes ranker that achieved promising performance on the age-estimation application. The ordinal hyperplanes ranker can be theoretically justified through the reduction framework using the choice of encoding above. The promising performance suggests the possibility of application opportunities within the proposed general framework.

5.4.  Summary.

We have briefly introduced several ordinal ranking algorithms that can be explained as special instances of the reduction framework. We have also derived new cost bounds of the ordinal ranking algorithms via reduction. There are some other existing algorithms that can be viewed as special instances of the reduction framework, as listed in Table 3.

Table 3:
Instances of the Reduction Framework.
Ordinal rankingCostBinary Classification algorithm
PRank Absolute Modified perceptron rule 
(Crammer & Singer, 2005  
Kernel ranking Classification Modified hard-margin SVM 
(Rajaram et al., 2003  
SVOR-EXC Classification Modified soft-margin SVM 
SVOR-IMC Absolute  
(Chu & Keerthi, 2007  
ORBoost-LR Classification Modified AdaBoost 
ORBoost-All Absolute  
(Lin & Li, 2006  
oSVM Absolute Standard soft-margin SVM 
oNN Absolute Standard neural network 
(Cardoso & da Costa, 2007  
RED-C4.5 Any convex Standard C4.5 
RED-AdaBoost Any convex Standard AdaBoost 
RED-SVM Any convex Standard soft-margin SVM 
RED-SVM Any V-shaped Modified soft-margin SVM 
(Li & Lin, 2007b; Lin, 2008  
AdaBoost.OR Any V-shaped Standard AdaBoost coupled 
(Lin & Li, 2009 with special base learners 
CLM Implicitly depends on Maximum likelihood on 
(Agresti, 2002assumed distribution assumed distribution 
Ordinal rankingCostBinary Classification algorithm
PRank Absolute Modified perceptron rule 
(Crammer & Singer, 2005  
Kernel ranking Classification Modified hard-margin SVM 
(Rajaram et al., 2003  
SVOR-EXC Classification Modified soft-margin SVM 
SVOR-IMC Absolute  
(Chu & Keerthi, 2007  
ORBoost-LR Classification Modified AdaBoost 
ORBoost-All Absolute  
(Lin & Li, 2006  
oSVM Absolute Standard soft-margin SVM 
oNN Absolute Standard neural network 
(Cardoso & da Costa, 2007  
RED-C4.5 Any convex Standard C4.5 
RED-AdaBoost Any convex Standard AdaBoost 
RED-SVM Any convex Standard soft-margin SVM 
RED-SVM Any V-shaped Modified soft-margin SVM 
(Li & Lin, 2007b; Lin, 2008  
AdaBoost.OR Any V-shaped Standard AdaBoost coupled 
(Lin & Li, 2009 with special base learners 
CLM Implicitly depends on Maximum likelihood on 
(Agresti, 2002assumed distribution assumed distribution 

Note that the thresholded linear model is commonly used in statistics for ordinal ranking (Agresti, 2002) and is called the cumulative link model (CLM), which assumes (〈v, x〉 − θk) to link to the cumulative probability . CLM can then be coupled with some more assumptions on the underlying probability distribution to reach a maximum likelihood solution. The proposed framework treats the thresholded linear model (CLM) as a rank-monotonic special case of the general prediction rule, equation f>3.1. CLM and the proposed framework take very different views on modeling the ordinal ranking problem and hence reach different results. In particular, CLM focuses on deriving from the assumed underlying distribution appropriately, while the proposed framework focuses on using the given cost vectors appropriately.

6.  Experiments

We validate the proposed reduction framework by performing experiments with eight benchmark ordinal ranking data sets (Chu & Keerthi, 2007): . The data sets were constructed by quantizing some metric regression data sets with K = 10. We use the same training-to-test ratio and also average the results over 20 trials. Thus, we can fairly compare our results with those of SVOR-IMC and SVOR-EXC (Chu & Keerthi, 2007), the state-of-the-art algorithms.

Table 4:
Test Absolute Cost of Ordinal Ranking Algorithms.
DataReduction to:SVOR-IMC
SetC4.5AdaBoost-StSVM-PercGaussian
 1.565 ± 0.072 1.360 ± 0.054 1.304 ± 0.040 1.294 ± 0.046 
 0.987 ± 0.024* 0.875 ± 0.017* 0.842 ± 0.0220.990 ± 0.026 
 0.950 ± 0.016 0.846 ± 0.015 0.732 ± 0.0130.747 ± 0.011 
 1.560 ± 0.006 1.458 ± 0.005 1.383 ± 0.004 1.361 ± 0.003 
 1.700 ± 0.005 1.481 ± 0.002 1.404 ± 0.002 1.393 ± 0.002 
 0.701 ± 0.003 0.604 ± 0.002 0.565 ± 0.002* 0.596 ± 0.002 
 0.974 ± 0.004* 0.991 ± 0.003* 0.940 ± 0.001* 1.008 ± 0.001 
 1.263 ± 0.003 1.210 ± 0.001 1.143 ± 0.002* 1.205 ± 0.002 
DataReduction to:SVOR-IMC
SetC4.5AdaBoost-StSVM-PercGaussian
 1.565 ± 0.072 1.360 ± 0.054 1.304 ± 0.040 1.294 ± 0.046 
 0.987 ± 0.024* 0.875 ± 0.017* 0.842 ± 0.0220.990 ± 0.026 
 0.950 ± 0.016 0.846 ± 0.015 0.732 ± 0.0130.747 ± 0.011 
 1.560 ± 0.006 1.458 ± 0.005 1.383 ± 0.004 1.361 ± 0.003 
 1.700 ± 0.005 1.481 ± 0.002 1.404 ± 0.002 1.393 ± 0.002 
 0.701 ± 0.003 0.604 ± 0.002 0.565 ± 0.002* 0.596 ± 0.002 
 0.974 ± 0.004* 0.991 ± 0.003* 0.940 ± 0.001* 1.008 ± 0.001 
 1.263 ± 0.003 1.210 ± 0.001 1.143 ± 0.002* 1.205 ± 0.002 

Notes: Those within one standard error of the lowest one are marked in bold. Those better than SVOR-IMC are marked with an asterisk.

6.1.  The Absolute Cost.

We first test the reduction framework with the absolute cost vectors, M = γ · IK−1 with γ = 1, and three different binary classification algorithms. The first binary algorithm is the C4.5 decision tree (Quinlan, 1986).6 The second is AdaBoost-St, which uses AdaBoost (Schapire et al., 1998) to aggregate 500 decision stumps. The third one is SVM-Perc, which is SVM (Vapnik, 1995) with the perceptron kernel (Lin & Li, 2008). The parameter κ of the soft-margin SVM is determined by a five-fold cross-validation procedure with log2κ ∈ {−17, − 15, …, 3} (Hsu, Chang, & Lin, 2003), and LIBSVM (Chang & Lin, 2001) is adopted as the solver.

We list the mean and the standard error of the test absolute costs in Table 4, with entries within one standard error of the lowest one marked in bold.7 With the proposed reduction framework, all three binary learning algorithms, even the simplest C4.5 decision tree, could be better than SVOR-IMC with the gaussian kernel on some of the data sets. The results demonstrate that all the algorithms can achieve decent out-of-sample performance. Among the three algorithms, reduction to SVM-Perc is usually better than the other two.

Note, however, that Chu and Keerthi (2007) use the gaussian kernel rather than the perceptron kernel in their experiments. For a fair comparison, we implement SVOR-IMC with the perceptron kernel by modifying LIBSVM (Chang & Lin, 2001) and conduct experiments with the parameter selection procedure introduced earlier in this section. We also couple RED-SVM with the gaussian kernel and the parameter selection procedure of SVOR-IMC (Chu & Keerthi, 2007).

In addition, to examine the performance of different SVM-based approaches on real-world ordinal ranking problems, we include two more data sets: and the red wine subset () of the wine quality set from the UCI machine learning repository (Hettich, Blake, & Merz, 1998). The problem aims at ranking cars according four conditions: { } the problem ranks red wine samples to 11 different levels between 0 and 10, while the actual data contain only samples with ranks between 3 and 8. We randomly split 75% of the examples for training and 25% for testing, and conduct 20 runs of such a random split. The training input vectors are first scaled to [0, 1] linearly, and the test input vectors are scaled accordingly.

Table 5:
Test Absolute Cost of SVM-Based Ordinal Ranking Algorithms.
DataRED-SVMRED-SVMSVOR-IMCSVOR-IMC
SetPerceptronGaussianPerceptronGaussian
 1.304 ± 0.040 1.277 ± 0.037 1.315 ± 0.039 1.294 ± 0.046 
 0.842 ± 0.022 0.914 ± 0.026 0.814 ± 0.019 0.990 ± 0.026 
 0.732 ± 0.013 0.752 ± 0.015 0.729 ± 0.013 0.747 ± 0.011 
 1.383 ± 0.004 1.361 ± 0.003 1.386 ± 0.005 1.361 ± 0.003 
 1.404 ± 0.002 1.395 ± 0.002 1.404 ± 0.002 1.393 ± 0.002 
 0.565 ± 0.002 0.588 ± 0.001 0.565 ± 0.002 0.596 ± 0.002 
 0.940 ± 0.001 0.945 ± 0.001 0.939 ± 0.001 1.008 ± 0.001 
 1.143 ± 0.002 1.167 ± 0.002 1.143 ± 0.002 1.205 ± 0.002 
 0.061 ± 0.003 0.050 ± 0.002 0.064 ± 0.003 0.051 ± 0.002 
 0.357 ± 0.005 0.425 ± 0.004 0.357 ± 0.005 0.429 ± 0.004 
DataRED-SVMRED-SVMSVOR-IMCSVOR-IMC
SetPerceptronGaussianPerceptronGaussian
 1.304 ± 0.040 1.277 ± 0.037 1.315 ± 0.039 1.294 ± 0.046 
 0.842 ± 0.022 0.914 ± 0.026 0.814 ± 0.019 0.990 ± 0.026 
 0.732 ± 0.013 0.752 ± 0.015 0.729 ± 0.013 0.747 ± 0.011 
 1.383 ± 0.004 1.361 ± 0.003 1.386 ± 0.005 1.361 ± 0.003 
 1.404 ± 0.002 1.395 ± 0.002 1.404 ± 0.002 1.393 ± 0.002 
 0.565 ± 0.002 0.588 ± 0.001 0.565 ± 0.002 0.596 ± 0.002 
 0.940 ± 0.001 0.945 ± 0.001 0.939 ± 0.001 1.008 ± 0.001 
 1.143 ± 0.002 1.167 ± 0.002 1.143 ± 0.002 1.205 ± 0.002 
 0.061 ± 0.003 0.050 ± 0.002 0.064 ± 0.003 0.051 ± 0.002 
 0.357 ± 0.005 0.425 ± 0.004 0.357 ± 0.005 0.429 ± 0.004 

Note: Those within one standard error of the lowest one are marked in bold.

Table 5 lists the results, which suggest that direct reduction to the standard SVM (RED-SVM) performs similar to SVOR-IMC when using the same kernel. RED-SVM nevertheless is much easier to implement. In addition, RED-SVM is significantly faster than SVOR-IMC in training. The speed difference is illustrated in Figure 2 using the four largest data sets. We make a fair comparison by implementing both algorithms under the same code and data structure of LIBSVM. The CPU time was gathered on a 1.7 G Dual Intel Xeon machine with 1 GB RAM. After a careful comparison, we find that the main cause for the time difference is the speed-up of heuristics. While, to the best of our knowledge, not much has been done to improve the original SVOR-IMC algorithm, plenty of heuristics, such as shrinking and advanced working selection in LIBSVM, can be seamlessly adopted by RED-SVM because of the reduction framework. The newly designed SVOR-IMC does not enjoy the same advantage. The difference demonstrates an important property of the reduction framework: any improvements to the binary classification approaches can be immediately inherited by reduction-based ordinal ranking algorithms.

Figure 2:

Training time (including automatic parameter selection) of SVM-based ordinal ranking algorithms with the perceptron kernel.

Figure 2:

Training time (including automatic parameter selection) of SVM-based ordinal ranking algorithms with the perceptron kernel.

6.2.  The Classification Cost.

We also test the reduction framework with the classification cost vectors. Because the classification cost vectors are V-shaped but not convex, the reduction framework guarantees to work only when the obtained binary classifier is rank-monotonic. Such a condition is not easily met by C4.5 or AdaBoost. Thus, we test the reduction framework only using a variant of RED-SVM that respects the constraint θ1 ≤ θ2 ≤ ⋅ ⋅ ⋅ ≤ θK−1 (see section 5.3), and compare the variant with SVOR-EXC.

We list the mean and the standard error of the test classification costs in Table 6, with entries within one standard error of the lowest one marked in bold. RED-SVM with the perceptron kernel is better than RED-SVM with the gaussian kernel on most of the benchmark data sets and redwine, while RED-SVM with the gaussian kernel is better on . RED-SVM with the gaussian kernel is in term slightly better than SVOR-EXC with the gaussian kernel on most of the data sets. The results again justify the usefulness of the proposed reduction framework.8

Table 6:
Test Classification Cost of SVM-Based Ordinal Ranking Algorithms.
DataRED-SVMRED-SVMSVOR-EXC
SetPerceptronGaussianGaussian
 0.762 ± 0.021 0.787 ± 0.021 0.752 ± 0.014 
 0.572 ± 0.013 0.637 ± 0.016 0.661 ± 0.012 
 0.541 ± 0.009 0.565 ± 0.008 0.569 ± 0.006 
 0.721 ± 0.002 0.708 ± 0.002 0.736 ± 0.002 
 0.751 ± 0.001 0.746 ± 0.001 0.744 ± 0.001 
 0.451 ± 0.002 0.461 ± 0.001 0.462 ± 0.001 
 0.613 ± 0.001 0.612 ± 0.001 0.640 ± 0.001 
 0.688 ± 0.001 0.686 ± 0.001 0.699 ± 0.000 
 0.064 ± 0.003 0.050 ± 0.002 0.054 ± 0.002 
 0.327 ± 0.005 0.392 ± 0.004 0.403 ± 0.004 
DataRED-SVMRED-SVMSVOR-EXC
SetPerceptronGaussianGaussian
 0.762 ± 0.021 0.787 ± 0.021 0.752 ± 0.014 
 0.572 ± 0.013 0.637 ± 0.016 0.661 ± 0.012 
 0.541 ± 0.009 0.565 ± 0.008 0.569 ± 0.006 
 0.721 ± 0.002 0.708 ± 0.002 0.736 ± 0.002 
 0.751 ± 0.001 0.746 ± 0.001 0.744 ± 0.001 
 0.451 ± 0.002 0.461 ± 0.001 0.462 ± 0.001 
 0.613 ± 0.001 0.612 ± 0.001 0.640 ± 0.001 
 0.688 ± 0.001 0.686 ± 0.001 0.699 ± 0.000 
 0.064 ± 0.003 0.050 ± 0.002 0.054 ± 0.002 
 0.327 ± 0.005 0.392 ± 0.004 0.403 ± 0.004 

Note: Those within one standard error of the lowest one are marked in bold.

6.3.  Other Costs.

Next we use different cost vectors for evaluation to demonstrate the power of the proposed cost-sensitive ordinal ranking framework. We consider two kinds of cost vectors. First, we define the asymmetric cost vector for rank ℓ as
formula
That is, for K = 10, an asymmetric cost vector for (x, 3) would be
formula
The asymmetric cost vector combines two different cost vectors. For instance, when , the asymmetric cost vector includes a fast-growing cost vector when k>ℓ and a slow-growing one when k ≤ ℓ. A potential use of the asymmetric cost vector is to tolerate the cases when k is on the “same side” of ℓ while penalizing the cases when k is far from ℓ.
Table 7:
Test Asymmetric Cost of SVM-Based Ordinal Ranking Algorithms.
DataRED-SVM
SetAsymmetricAbsoluteClassificationSVOR-IMCSVOR-EXC
 1.716 ± 0.182 1.593 ± 0.118 4.522 ± 1.505 1.665 ± 0.140 2.309 ± 0.321 
 0.873 ± 0.056 0.820 ± 0.034 0.814 ± 0.051 0.898 ± 0.046 1.011 ± 0.062 
 0.762 ± 0.038 0.759 ± 0.030 0.750 ± 0.029 0.784 ± 0.049 0.822 ± 0.063 
 1.992 ± 0.022 1.995 ± 0.018 2.700 ± 0.035 1.952 ± 0.015 2.580 ± 0.024 
 1.937 ± 0.009 1.923 ± 0.009 2.558 ± 0.032 1.948 ± 0.010 2.490 ± 0.013 
 0.492 ± 0.003 0.508 ± 0.002 0.507 ± 0.003 0.533 ± 0.002 0.535 ± 0.003 
 1.183 ± 0.007 1.141 ± 0.005 1.223 ± 0.008 1.208 ± 0.007 1.318 ± 0.009 
 1.587 ± 0.010 1.552 ± 0.007 1.778 ± 0.019 1.746 ± 0.019 2.023 ± 0.021 
DataRED-SVM
SetAsymmetricAbsoluteClassificationSVOR-IMCSVOR-EXC
 1.716 ± 0.182 1.593 ± 0.118 4.522 ± 1.505 1.665 ± 0.140 2.309 ± 0.321 
 0.873 ± 0.056 0.820 ± 0.034 0.814 ± 0.051 0.898 ± 0.046 1.011 ± 0.062 
 0.762 ± 0.038 0.759 ± 0.030 0.750 ± 0.029 0.784 ± 0.049 0.822 ± 0.063 
 1.992 ± 0.022 1.995 ± 0.018 2.700 ± 0.035 1.952 ± 0.015 2.580 ± 0.024 
 1.937 ± 0.009 1.923 ± 0.009 2.558 ± 0.032 1.948 ± 0.010 2.490 ± 0.013 
 0.492 ± 0.003 0.508 ± 0.002 0.507 ± 0.003 0.533 ± 0.002 0.535 ± 0.003 
 1.183 ± 0.007 1.141 ± 0.005 1.223 ± 0.008 1.208 ± 0.007 1.318 ± 0.009 
 1.587 ± 0.010 1.552 ± 0.007 1.778 ± 0.019 1.746 ± 0.019 2.023 ± 0.021 

Note: Those within one standard error of the lowest one are marked in bold.

Another cost vector that we consider is called two-gaussian (2Gauss), which combines two (reverted) gaussian functions. The formal definition of the 2Gauss cost is
formula
Note that the 2Gauss cost vectors are V-shaped but not convex. They also penalize the two sides of cases differently.

Table 7 lists the mean and standard error of the test asymmetric costs. For RED-SVM, we consider three kinds of costs for training: asymmetric, absolute, and classification. We then compare the three variants of RED-SVM with the state-of-the-art SVOR-IMC and SVOR-EXC. First, RED-SVM with the asymmetric cost and RED-SVM with the absolute cost generally perform better than SVOR-IMC or SVOR-EXC, which demonstrates that the proposed cost-sensitive framework could achieve decent test performance in a cost-sensitive setting.

The classification cost vectors are very different from the asymmetric ones, and thus RED-SVM with the classification cost cannot perform well when evaluated with the asymmetric cost vectors. The absolute cost vectors are closer to the asymmetric ones. In fact, Table 7 suggests that RED-SVM with the absolute cost is often better than RED-SVM with the asymmetric cost. Thus, when evaluating with convex cost vectors like the asymmetric ones, training with the absolute cost vectors could be a useful firsthand choice.

Table 8 lists the mean and standard error of the test 2Gauss costs. For RED-SVM, we also consider three kinds of costs during training: 2Gauss, absolute, and classification. Note that RED-SVM with the 2Gauss cost is not only better than the state-of-the-art SVOR-IMC and SVOR-EXC but also better than other RED-SVM variants. One possible explanation is that RED-SVM with the absolute cost cannot perform well because the absolute cost is convex while 2Gauss is not; RED-SVM with the classification cost also cannot perform well because the classification cost is symmetric on both sides of the desired label ℓ while 2Gauss is not. The results justify the importance of the proposed cost-sensitive ordinal ranking framework.

Table 8:
Test 2Gauss Cost of SVM-Based Ordinal Ranking Algorithms.
DataRED-SVM
Set2GaussAbsoluteClassificationSVOR-IMCSVOR-EXC
 0.760 ± 0.055 0.961 ± 0.040 1.025 ± 0.071 0.930 ± 0.047 0.932 ± 0.047 
 0.456 ± 0.019 0.505 ± 0.019 0.466 ± 0.020 0.552 ± 0.024 0.540 ± 0.030 
 0.383 ± 0.015 0.434 ± 0.013 0.435 ± 0.013 0.434 ± 0.014 0.425 ± 0.012 
 0.935 ± 0.006 1.032 ± 0.004 0.943 ± 0.005 1.006 ± 0.004 0.936 ± 0.005 
 0.912 ± 0.003 1.069 ± 0.003 1.015 ± 0.004 1.051 ± 0.003 0.997 ± 0.003 
 0.227 ± 0.001 0.279 ± 0.002 0.263 ± 0.002 0.287 ± 0.001 0.274 ± 0.002 
 0.542 ± 0.002 0.607 ± 0.001 0.577 ± 0.003 0.602 ± 0.002 0.578 ± 0.002 
 0.713 ± 0.001 0.786 ± 0.002 0.737 ± 0.002 0.799 ± 0.002 0.765 ± 0.003 
DataRED-SVM
Set2GaussAbsoluteClassificationSVOR-IMCSVOR-EXC
 0.760 ± 0.055 0.961 ± 0.040 1.025 ± 0.071 0.930 ± 0.047 0.932 ± 0.047 
 0.456 ± 0.019 0.505 ± 0.019 0.466 ± 0.020 0.552 ± 0.024 0.540 ± 0.030 
 0.383 ± 0.015 0.434 ± 0.013 0.435 ± 0.013 0.434 ± 0.014 0.425 ± 0.012 
 0.935 ± 0.006 1.032 ± 0.004 0.943 ± 0.005 1.006 ± 0.004 0.936 ± 0.005 
 0.912 ± 0.003 1.069 ± 0.003 1.015 ± 0.004 1.051 ± 0.003 0.997 ± 0.003 
 0.227 ± 0.001 0.279 ± 0.002 0.263 ± 0.002 0.287 ± 0.001 0.274 ± 0.002 
 0.542 ± 0.002 0.607 ± 0.001 0.577 ± 0.003 0.602 ± 0.002 0.578 ± 0.002 
 0.713 ± 0.001 0.786 ± 0.002 0.737 ± 0.002 0.799 ± 0.002 0.765 ± 0.003 

Note: Those within one standard error of the lowest one are marked in bold.

6.4.  Improving NDCG with Cost-Sensitive Ordinal Ranking.

We demonstrate another useful characteristic of cost-sensitive ordinal ranking by designing a cost vector that could help improve the normalized discounted cumulative gain (NDCG), a criterion commonly used in listwise ranking (Liu, 2009). The design uses a bound from the McRank work of Li et al. (2008), who showed that for any set of test input vectors {xm}Mm=1 with ideal ranks ym,
formula
6.1
Nevertheless, the original McRank algorithm was not designed with the bound above, but was derived by replacing c with (2K − 1)2 times the classification cost—a much looser upper bound. Next we examine whether we can use the tighter cost vector in equation 6.1 to achieve better (higher) NDCG performance.
Table 9:
Test NDCG of Ordinal Ranking Algorithms.
DataRED-SVM
SetndcgAbsoluteClassificationSVOR-IMCSVOR-EXC
 0.924 ± 0.008 0.934 ± 0.008 0.917 ± 0.011 0.933 ± 0.008 0.922 ± 0.010 
 0.973 ± 0.002 0.973 ± 0.002 0.976 ± 0.001 0.961 ± 0.003 0.956 ± 0.004 
 0.958 ± 0.002 0.957 ± 0.002 0.957 ± 0.002 0.956 ± 0.003 0.953 ± 0.003 
 0.872 ± 0.001 0.865 ± 0.001 0.869 ± 0.002 0.868 ± 0.001 0.864 ± 0.001 
 0.902 ± 0.000 0.879 ± 0.000 0.881 ± 0.001 0.879 ± 0.000 0.880 ± 0.001 
 0.961 ± 0.000 0.961 ± 0.000 0.962 ± 0.000 0.959 ± 0.000 0.959 ± 0.000 
 0.934 ± 0.000 0.931 ± 0.000 0.932 ± 0.000 0.930 ± 0.000 0.931 ± 0.000 
 0.929 ± 0.000 0.919 ± 0.000 0.922 ± 0.001 0.914 ± 0.000 0.917 ± 0.001 
DataRED-SVM
SetndcgAbsoluteClassificationSVOR-IMCSVOR-EXC
 0.924 ± 0.008 0.934 ± 0.008 0.917 ± 0.011 0.933 ± 0.008 0.922 ± 0.010 
 0.973 ± 0.002 0.973 ± 0.002 0.976 ± 0.001 0.961 ± 0.003 0.956 ± 0.004 
 0.958 ± 0.002 0.957 ± 0.002 0.957 ± 0.002 0.956 ± 0.003 0.953 ± 0.003 
 0.872 ± 0.001 0.865 ± 0.001 0.869 ± 0.002 0.868 ± 0.001 0.864 ± 0.001 
 0.902 ± 0.000 0.879 ± 0.000 0.881 ± 0.001 0.879 ± 0.000 0.880 ± 0.001 
 0.961 ± 0.000 0.961 ± 0.000 0.962 ± 0.000 0.959 ± 0.000 0.959 ± 0.000 
 0.934 ± 0.000 0.931 ± 0.000 0.932 ± 0.000 0.930 ± 0.000 0.931 ± 0.000 
 0.929 ± 0.000 0.919 ± 0.000 0.922 ± 0.001 0.914 ± 0.000 0.917 ± 0.001 

Note: Those within one standard error of the highest one are marked in bold.

We transform the benchmark data sets to listwise ranking by randomly generating 10,000 subsets of size 10. Then, we evaluate NDCG at the 10th position for each subset and report the average. Note that the original McRank algorithm (Li et al., 2008) is very similar to reduction with the absolute cost, with a possibly weaker underlying learner (boosting tree) and slightly different rule of converting g to rg.9 Thus, in addition to the ndcg cost, equation 6.1, we also couple the absolute cost (similar to the actual McRank algorithm) and the classification cost (similar to the theoretical backbone of McRank) with RED-SVM for comparison. We then compare the three variants with the state-of-the-art SVOR-IMC and SVOR-EXC. Table 9 lists the mean and standard error of the test NDCG on the eight data sets. We see that RED-SVM with the ndcg cost often achieves better NDCG performance than the other two variants of RED-SVM, including RED-SVM with the classification cost. Also, RED-SVM with the ndcg cost is often better than SVOR-IMC and SVOR-EXC. The results demonstrate the potential of cost-sensitive ordinal ranking on improving listwise ranking, which echoes the recent finding of a related work (Tsai et al., 2010) toward the Yahoo! Learning to Rank Challenge.

7.  Conclusion

We presented the reduction framework from ordinal ranking to binary classification. The framework is accompanied by the flexibility to work with any reasonable cost vectors and any binary classifiers. We showed the theoretical guarantees of the framework, including the cost bound, the regret bound, and the equivalence between ordinal ranking and binary classification.

We also demonstrated the advantages of the framework in designing new algorithms, explaining existing ones, and deriving new generalization bounds for ordinal ranking. Furthermore, the usefulness of the framework was empirically validated by comparing the newly proposed algorithms constructed from the framework with the state-of-the-art SVOR-IMC and SVOR-EXC algorithms. In particular, the proposed cost-sensitive ordinal ranking algorithms were observed to not only improve over SVOR-IMC and SVOR-EXC when using common evaluation criteria like the absolute or the classification costs, but also superior over SVOR-IMC and SVOR-EXC when evaluated with other costs, as well as the NDCG criteria for listwise ranking.

Appendix A:.  Proof of Theorem 5

Consider the extended training set
formula
with N(K − 1) elements. If we directly draw from , each element is a possible outcome. Note, however, that these elements are not all independent outcomes. For example, (X(1)n, Y(1)n, W(1)n) and (X(2)n, Y(2)n, W(2)n) are dependent because they sprout from the same (xn, yn, cn). Thus, we cannot directly use the whole SE as a set of independent outcomes from .
Nevertheless, some subsets of SE contain independent outcomes from . One way to extract such a subset is to choose one kn uniformly and independently from 1, …, K − 1 for each training example (xn, yn, cn). The resulting subset would be
formula
We use a simple encoding scheme of M = IK−1 to represent X(k) = (x, k). Consider a binary classification ensemble g(X(k)) defined by a linear combination of the functions in
formula
A.1
Here s(X(k)) is a decision stump on dimension d + ℓ (Holte, 1993). If the output space of s is {−1, 1}, it is not hard to show that the VC-dimension of is no more than dE = d + K − 1. Since the proof of Schapire et al. (1998, theorem 2), which will be applied on later, requires only a combinatorial counting bound on the possible outputs of s, we let
formula
to get a cosmetically cleaner proof. Some different versions of the bound can be obtained by considering s(X(k)) ∈ {−1, 1} or by bounding the number of possible outputs of s directly by a tighter term.
Without loss of generality, we normalize r such that ∑Tt=1t| + ∑K−1ℓ=1| is 1. Then consider an ensemble function,
formula
For every (X(k), Y(k), W(k)) derived from the tuple (x, y, k), the term . Furthermore, we can easily see that r = rg. Thus, by theorem 2,
formula
A.2
Because SI contains N independent outcomes from , the large-margin theorem (Schapire et al., 1998, theorem 2) states that with probability at least over the choice of SI,
formula
A.3
Define a Boolean random variable:
formula
We see that bn comes with mean . Using Hoeffding's (1963) inequality, when each bn is chosen independently, with probability at least over the choice of bn,
formula
A.4
The desired result can be proved by combining equations A.2 to A.4 with a union bound.

Appendix B:.  Proof of Theorem 6

For every example (x, y, c), by the same derivation as theorem 2,
formula
Note that
formula
sums to 1. Then, for each example (x, y, c) obtained from , we can randomly choose k according to P(k) and form an unweighted binary example (X(k), Y(k)). The procedure above defines a probability distribution . Integrating over all (x, y, c), we get
formula
When each kn is chosen independently according to P(k)n, we can generate N independent examples from and S. Then, using a cost bound for SVM in binary classification (Bartlett & Shawe-Taylor, 1998), with probability at least over the choice of ,
formula
Using the same technique as the proof of theorem 5 with and a union bound, with probability >1 − δ,
formula

Acknowledgments

We thank Yaser S. Abu-Mostafa, Amrit Pratap, John Langford, and the anonymous reviewers for valuable discussions and comments. When this project was initiated, L. L. was supported by the Caltech SISL Graduate Fellowship, and H.-T. L. was supported by the Caltech EAS Division Fellowship. The continuing work was supported by the National Science Council of Taiwan via grant NSC 98-2221-E-002-192.

References

Abe
,
N.
,
Zadrozny
,
B.
, &
Langford
,
J.
(
2004
).
An iterative method for multi-class cost-sensitive learning
. In
Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(pp.
3
11
).
New York
:
ACM
.
Agarwal
,
S.
(
2008
).
Generalization bounds for some ordinal regression algorithms
. In
Proceedings of the 19th Conference on Algorithmic Learning Theory
(pp.
7
21
).
Berlin
:
Springer
.
Agresti
,
A.
(
2002
).
Categorical data analysis
(2nd ed.).
Hoboken, NJ
:
Wiley
.
Ailon
,
N.
, &
Mohri
,
M.
(
2008
).
An efficient reduction of ranking to classification
. In
Learning Theory: 21st Annual Conference on Learning Theory
(pp.
87
98
).
Berlin
:
Springer
.
Anderson
,
J. A.
(
1984
).
Regression and ordered categorical variables
.
Journal of the Royal Statistical Society, Series B
,
46
,
1
30
.
Balcan
,
M.-F.
,
Bansal
,
N.
,
Beygelzimer
,
A.
,
Coppersmith
,
D.
,
Langford
,
J.
, &
Sorkin
,
G. B.
(
2007
).
Robust reductions from ranking to classification
. In
Learning Theory: 20th Annual Conference on Learning Theory
(pp.
604
619
).
Berlin
:
Springer
.
Bartlett
,
P. L.
, &
Shawe-Taylor
,
J.
(
1998
).
Generalization performance of support vector machines and other pattern classifiers
. In
B. Schölkopf, C.J.C. Burges, & A. J. Smola
(Eds.),
Advances in kernel methods: Support vector learning
(pp.
43
54
).
Cambridge, MA
:
MIT Press
.
Beygelzimer
,
A.
,
Daniand
,
V.
,
Hayes
,
T.
,
Langford
,
J.
, &
Zadrozny
,
B.
(
2005
).
Error limiting reductions between classification tasks
. In
Machine Learning: Proceedings of the 22nd International Conference
(pp.
49
56
).
Madison, WI
:
Omnipress
.
Beygelzimer
,
A.
,
Langford
,
J.
, &
Ravikumar
,
P.
(
2007
).
Multiclass classification with filter trees
.
Available online at
http://hunch.net/~jl.
Cao
,
Z.
,
Qin
,
T.
,
Liu
,
T.-Y.
,
Tsai
,
M.-F.
, &
Li
,
H.
(
2007
).
Learning to rank: From pairwise approach to listwise approach
. In
Machine Learning: Proceedings of the 24th International Conference
(pp.
129
136
).
Madison, WI
:
Omnipress
.
Cardoso
,
J. S.
, &
da Costa
,
J. F. P.
(
2007
).
Learning to classify ordinal data: The data replication method
.
Journal of Machine Learning Research
,
8
,
1393
1429
.
Chang
,
C.-C.
, &
Lin
,
C.-J.
(
2001
).
LIBSVM: A library for support vector machines
.
National Taiwan University. Available online at
http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Chang
,
K.-Y.
,
Chen
,
C.-S.
, &
Hung
,
Y.-P.
(
2011
).
Ordinal hyperplanes ranker with cost sensitivities for age estimation
. In
Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recognition
(pp.
585
592
).
Piscataway, NJ
:
IEEE
.
Chu
,
W.
, &
Ghahramani
,
Z.
(
2005
).
Gaussian processes for ordinal regression
.
Journal of Machine Learning Research
,
6
,
1019
1041
.
Chu
,
W.
, &
Keerthi
,
S. S.
(
2007
).
Support vector ordinal regression
.
Neural Computation
,
19
,
792
815
.
Cossock
,
D.
, &
Zhang
,
T.
(
2008
).
Statistical analysis of Bayes optimal subset ranking
.
IEEE Transactions on Information Theory
,
54
,
4140
5154
.
Crammer
,
K.
, &
Singer
,
Y.
(
2005
).
Online ranking by projecting
.
Neural Computation
,
17
,
145
175
.
Dembczyński
,
K.
,
Kotowski
,
W.
, &
Sowiński
,
R.
(
2008
).
Ordinal classification with decision rules
. In
Proceedings of the 3rd International Workshop on Mining Complex Data
(pp.
169
181
).
Berlin
:
Springer
.
Figueira
,
J.
,
Greco
,
S.
, &
Ehrgott
,
M.
(Eds.). (
2005
).
Multiple criteria decision analysis: State of the art surveys
.
Berlin
:
Springer-Verlag
.
Frank
,
E.
, &
Hall
,
M.
(
2001
).
A simple approach to ordinal classification
. In
Machine Learning: Proceedings of the 12th European Conference on Machine Learning
(pp.
145
156
).
Freund
,
Y.
, &
Schapire
,
R. E.
(
1999
).
Large margin classification using the perceptron algorithm
.
Machine Learning
,
37
(
3
),
277
296
.
Greco
,
S.
,
Sowiński
,
R.
, &
Matarazzo
,
B.
(
2000
).
Extension of the rough set approach to multicriteria decision support
.
European Journal of Operational Research
,
38
,
161
195
.
Har-Peled
,
S.
,
Roth
,
D.
, &
Zimak
,
D.
(
2003
).
Constraint classification: A new approach to multiclass classification and ranking
. In
S. Becker, S. Thrun, & K. Obermayer
(Eds.),
Advances in neural information processing systems, 15
(pp.
365
379
).
Cambridge, MA
:
MIT Press
.
Hastie
,
T.
,
Tibshirani
,
R.
, &
Friedman
,
J.
(
2001
).
The elements of statistical learning: Data mining, inference, and prediction
.
Berlin
:
Springer-Verlag
.
Herbrich
,
R.
,
Graepel
,
T.
, &
Obermayer
,
K.
(
2000
).
Large margin rank boundaries for ordinal regression
. In
A. J. Smola, P. Bartlett, B. Schölkopf, & D. Schuurmans
(Eds.),
Advances in large margin classifiers
(pp.
115
132
).
Hettich
,
S.
,
Blake
,
C. L.
, &
Merz
,
C. J.
(
1998
).
UCI repository of machine learning databases
.
Available online at
http://www.ics.uci.edu/∼mlearn/MLRepository.html.
Hoeffding
,
W.
(
1963
).
Probability inequalities for sums of bounded random variables
.
Journal of the American Statistical Association
,
58
(
301
),
13
30
.
Holte
,
R. C.
(
1993
).
Very simple classification rules perform well on most commonly used datasets
.
Machine Learning
,
11
(
1
),
63
91
.
Hsu
,
C.-W.
,
Chang
,
C.-C.
, &
Lin
,
C.-J.
(
2003
).
A practical guide to support vector classification
(
Tech. Rep.
).
Taipei
:
National Taiwan University
.
Joachims
,
T.
(
2006
).
Training linear SVMs in linear time
. In
Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(pp.
217
226
).
New York
:
ACM
.
Kearns
,
M. J.
, &
Vazirani
,
U. V.
(
1994
).
An introduction to computational learning theory
.
Cambridge, MA
:
MIT Press
.
Kotowski
,
W.
, &
Sowiński
,
R.
(
2009
).
Rule learning with monotonicity constraints
. In
Machine Learning: Proceedings of the 26th International Conference
(pp.
537
544
).
Berlin
:
Springer
.
Li
,
L.
, &
Lin
,
H.-T.
(
2007a
).
Optimizing 0/1 loss for perceptrons by random coordinate descent
. In
Proceedings of the 2007 International Joint Conference on Neural Networks
(pp.
749
754
).
Piscataway, NJ
:
IEEE
.
Li
,
L.
, &
Lin
,
H.-T.
(
2007b
).
Ordinal regression by extended binary classification
. In
B. Schölkopf, J. Platt, & T. Hoffman
(Eds.),
Advances in neural information processing systems
,
19
(pp.
865
872
).
Cambridge, MA
:
MIT Press
.
Li
,
P.
,
Burges
,
C.
, &
Wu
,
Q.
(
2008
).
McRank: Learning to rank using multiple classification and gradient boosting
. In
J. C. Platt, D. Koller, Y. Singer, & S. Roweis
(Eds.),
Advances in neural information processing systems
,
20
(pp.
897
904
).
Lin
,
H.-T.
(
2008
).
From ordinal ranking to binary classification
.
Unpublished doctoral dissertation, California Institute of Technology
.
Lin
,
H.-T.
, &
Li
,
L.
(
2006
).
Large-margin thresholded ensembles for ordinal regression: Theory and practice
. In
Proceedings of the 17th Conference on Algorithmic Learning Theory
(pp.
319
333
).
Berlin
:
Springer
.
Lin
,
H.-T.
, &
Li
,
L.
(
2008
).
Support vector machinery for infinite ensemble learning
.
Journal of Machine Learning Research
,
9
,
285
312
.
Lin
,
H.-T.
, &
Li
,
L.
(
2009
).
Combining ordinal preferences by boosting
. In
Preference Learning Workshop at ECML/PKDD
.
Berlin
:
Springer
.
Liu
,
T.-Y.
(
2009
).
Learning to rank for information retrieval
.
Foundations and Trends in Information Retrieval
,
3
(
3
),
225
331
.
McCullagh
,
P.
(
1980
).
Regression models for ordinal data
.
Journal of the Royal Statistical Society. Series B
,
42
,
109
142
.
Meir
,
R.
, &
Rätsch
,
G.
(
2003
).
An introduction to boosting and leveraging
. In
O. Bousquet, U. von Luxburg, & G. Rätsch
(Eds.),
Advanced Lectures on Machine Learning
(pp.
119
184
).
Berlin
:
Springer
.
Quinlan
,
J. R.
(
1986
).
Induction of decision trees
.
Machine Learning
,
1
(
1
),
81
106
.
Rajaram
,
S.
,
Garg
,
A.
,
Zhou
,
X. S.
, &
Huang
,
T. S.
(
2003
).
Classification approach towards ranking and sorting problems
. In
Machine Learning: Proceedings of the 14th European Conference on Machine Learning
(pp.
301
312
).
Berlin
:
Springer
.
Rennie
,
J. D. M.
, &
Srebro
,
N.
(
2005
).
Loss functions for preference levels: Regression with discrete ordered labels
. In
Proceedings of the IJCAI Multidisciplinary Workshop on Advances in Preference Handling
(pp.
180
186
).
Norwell, MA
:
Kluwer
.
Rosenblatt
,
F.
(
1962
).
Principles of neurodynamics: Perceptrons and the theory of brain mechanisms
.
N.p.
:
Spartan Books
.
Schapire
,
R. E.
,
Freund
,
Y.
,
Bartlett
,
P. L.
, &
Lee
,
W. S.
(
1998
).
Boosting the margin: A new explanation for the effectiveness of voting methods
.
Annals of Statistics
,
26
(
5
),
1651
1686
.
Schapire
,
R. E.
, &
Singer
,
Y.
(
1999
).
Improved boosting algorithms: Using confidence-rated predictions
.
Machine Learning
,
37
(
3
),
297
336
.
Schölkopf
,
B.
, &
Smola
,
A.
(
2002
).
Learning with kernels
.
Cambridge, MA
:
MIT Press
.
Shashua
,
A.
, &
Levin
,
A.
(
2003
).
Ranking with large margin principle: Two approaches
. In
S. Becker, S. Thrun, & K. Obermayer
(Eds.),
Advances in neural information processing systems
,
15
(pp.
937
944
).
Cambridge, MA
:
MIT Press
.
Sill
,
J.
(
1998
).
Monotonic networks
. In
M. Jordan, M. Kearns, & S. Solla
(Eds.),
Advances in neural information processing systems
,
10
(pp.
661
667
).
Cambridge, MA
:
MIT Press
.
Sowiński
,
R.
,
Greco
,
S.
, &
Matarazzo
,
B.
(
2007
).
Dominance-based rough set approach to reasoning about ordinal data
. In
Proceedings of the International Conference on Rough Sets and Intelligent Systems Paradigms
(pp.
5
11
).
Berlin
:
Springer
.
Tsai
,
M.-F.
,
Chen
,
S.-T.
,
Chen
,
Y.-N.
,
Ferng
,
C.-S.
,
Wang
,
C.-H.
,
Wen
,
T.-Y.
, et al. (
2010
).
An ensemble ranking solution to the Yahoo! learning to rank challenge
(Tech. Rep.).
Taipei
:
National Taiwan University
.
Valiant
,
L. G.
(
1984
).
A theory of the learnable
.
Communications of the ACM
,
27
(
11
),
1134
1142
.
Vapnik
,
V. N.
(
1995
).
The nature of statistical learning theory
.
Berlin
:
Springer-Verlag
.
Xia
,
F.
,
Zhou
,
L.
,
Yang
,
Y.
, &
Zhang
,
W.
(
2007
).
Ordinal regression as multiclass classification
.
International Journal of Intelligent Control and Systems
,
12
(
3
),
230
236
.
Zadrozny
,
B.
,
Langford
,
J.
, &
Abe
,
N.
(
2003
).
Cost sensitive learning by cost-proportionate example weighting
. In
Proceedings of the 3rd IEEE International Conference on Data Mining
(pp.
435
442
).
Piscataway, NJ
:
IEEE
.

Notes

1

When connecting the points (k, c[k]) from a convex cost vector c by line segments, it is not difficult to prove that the resulting curve is convex for k ∈ [1, K].

2

[[·]] is 1 if the inner condition is true, and 0 otherwise.

3

Although equation 4.1 can be flexibly applied even when g is not rank monotonic, a rank-monotonic g is usually desired in order to introduce a good ranker .

4

To precisely replicate the PRank algorithm, the (K − 1) binary examples sprouted from a same ordinal example should be considered altogether in updating the perceptron weight vector.

5

is negation complete if and only if , where (−h)(x) = −(h(x)) for all x.

6

C4.5 can directly take the extended input vector (x, k) without encoding. We choose to still encode (x, k) by the matrix M = γ · IK−1 to make a simple and fair comparison with the other two algorithms that need the encoding.

7

Note that the results from Chu and Keerthi (2007) include the standard deviation; here we compute the standard error instead.

8

We do not have the results of SVOR-EXC with the perceptron kernel because it is difficult to use LIBSVM to implement and compare SVOR-EXC fairly with RED-SVM.

9

Although McRank is designed from the classification cost, a closer inspection from the reduction perspective reveals that the algorithm can be interpreted better by the absolute cost.