Abstract
Linear rankSVM is one of the widely used methods for learning to rank. Although its performance may be inferior to nonlinear methods such as kernel rankSVM and gradient boosting decision trees, linear rankSVM is useful to quickly produce a baseline model. Furthermore, following its recent development for classification, linear rankSVM may give competitive performance for large and sparse data. A great deal of works have studied linear rankSVM. The focus is on the computational efficiency when the number of preference pairs is large. In this letter, we systematically study existing works, discuss their advantages and disadvantages, and propose an efficient algorithm. We discuss different implementation issues and extensions with detailed experiments. Finally, we develop a robust linear rankSVM tool for public use.
1. Introduction
Learning to rank is an important supervised learning technique because of its application to search engines and online advertisement. According to Chapelle and Chang (2011) and others, state-of-the-art learning to rank models can be categorized into three types. Pointwise methods (e.g., decision tree models and linear regression) directly learn the relevance score of each instance; pairwise methods like rankSVM (Herbrich, Graepel, & Obermayer, 2000) learn to classify preference pairs; and listwise methods such as LambdaMART (Burges, 2010) try to optimize the measurement for evaluating the whole ranking list. Some methods lie between two categories; for example, GBRank (Zheng, Chen, Sun, & Zha, 2007) combines pointwise decision tree models and pairwise loss. Among them, rankSVM, as a pairwise approach, is one commonly used method. This method is extended from standard support vector machine (SVM) by Boser, Guyon, and Vapnik (1992) and Cortes and Vapnik (1995). In the SVM literature, it is well known that linear (i.e., data are not mapped to a different space) and kernel SVMs are suitable for different scenarios, where linear SVM is more efficient, but the more costly kernel SVM may give higher accuracy.1 The same situation may occur for rankSVM. In this letter, we study large-scale linear rankSVM.

Notation . | Explanation . |
---|---|
w | The weight vector obtained by solving problem 1.2 or 1.3 |
xi | The feature vector of the ith training instance |
yi | Label of the ith training instance |
qi | Query of the ith training instance |
K | The set of relevance levels |
Q | The set of queries |
P | The set of preference pairs; see equation 1.1 |
l | Number of training instances |
k | Number of relevance levels |
p | Number of preference pairs |
n | Number of features |
![]() | Average number of nonzero features per instance |
lq | Number of training instances in a given query |
kq | Number of relevance levels in a given query |
T | An order-statistic tree |
Notation . | Explanation . |
---|---|
w | The weight vector obtained by solving problem 1.2 or 1.3 |
xi | The feature vector of the ith training instance |
yi | Label of the ith training instance |
qi | Query of the ith training instance |
K | The set of relevance levels |
Q | The set of queries |
P | The set of preference pairs; see equation 1.1 |
l | Number of training instances |
k | Number of relevance levels |
p | Number of preference pairs |
n | Number of features |
![]() | Average number of nonzero features per instance |
lq | Number of training instances in a given query |
kq | Number of relevance levels in a given query |
T | An order-statistic tree |





Although linear rankSVM is an established method, it is known that gradient boosting decision trees (GBDT) by Friedman (2001) and its variant, LambdaMART (Burges, 2010), give competitive performance on web search ranking data. In addition, random forests (Breiman, 2001) are also reported in Mohan, Chen, and Weinberger (2011) to perform well. Actually all the winning teams of the Yahoo Learning to Rank Challenge (Chapelle & Chang, 2011) use decision-tree-based ensemble models. Note that GBDT and random forests are nonlinear pointwise methods, and LambdaMART is a nonlinear listwise method. Their drawback is the longer training time. We will conduct experiments to compare the performance and training time between linear and nonlinear ranking methods.
In this letter, we consider Newton methods for solving problem 1.3 and present the following results:
We give a clear overview and connection of past work on the efficient calculation over all relevance pairs.
We investigate several order-statistic tree implementations and show their advantages and disadvantages.
We finish an efficient implementation that is faster than existing works for linear rankSVM.
We compare linear rankSVM and linear and nonlinear pointwise methods, including GBDT and random forests, in detail.
We release a public tool for linear rankSVM.
This letter is organized as follows. Section 2 introduces methods for the efficient calculation of all relevance pairs. Section 3 discusses past studies of linear rankSVM and compares them with our method. Various types of experiments are shown in section 4. In section 5 we discuss another possible algorithm for rankSVMs when k is large. Section 6 concludes. A supplementary file including additional analysis and experiments is available online at http://www.mitpressjournals.org/doi/suppl/10.1162/NECO_a_00571.
2. Efficient Calculation over Relevance Pairs
A difficulty in training rankSVM is that the number of pairs in the loss term can be as large as O(l2). This difficulty occurs in any optimization method that needs to calculate the objective value. Further more, other values used in optimization procedures, such as subgradient, gradient, or Hessian-vector products, face the same difficulty. In this section, we consider truncated Newton methods as an example and investigate efficient methods for the calculation over pairs.
2.1. Information Needed in Optimization Procedures and an Example Using Truncated Newton Methods.
Many optimization methods employ gradient or even higher-order information at each iteration of an iterative procedure. From problems 1.2 and 1.3, it is clear that the summation over the p pairs remains in the gradient and the Hessian. Therefore, the difficulty of handling O(l2) pairs occurs beyond the calculations of the objective value. Here we consider the truncated Newton method as an example to see what kind of information it requires.



The discussion shows that in a truncated Newton method, the calculations of objective value, gradient, and Hessian-vector product all face the difficulty of handling p pairs. In the rest of this section, we discuss the efficient calculation of these values.
2.2. Efficient Function and Gradient Evaluation and Matrix-Vector Products.



















2.3. Efficient Calculation by Storing Values in an Order-Statistic Tree.
Airola et al. (2011) calculate l+i(w) and l−i(w) by an order-statistic tree, so the O(lk) term in equation 2.14 is reduced to O(llog k). The
optimization method used is a cutting plane method (Teo, Vishwanathan, Smola,
& Le, 2010), which calculates
function and subgradient values. Our procedure here is an extension because in
Newton methods, we further need Hessian-vector products. Notice that Airola et
al. (2011) considered problem 1.2; we solve problem 1.3 and require the computation of and
in addition.


An illustration of using an order-statistic tree to calculate l+i(w).
Our algorithm constructs a tree for each matrix-vector product (or each CG
iteration) because of the change of the vector v in equation 2.11. Thus,
an outer iteration of truncated Newton method requires constructing several
trees. If we store instead of
at each node, only one tree independent of v is needed at an outer iteration. However, because a vector is stored at
a node, each update requires
cost. The total cost of maintaining the tree is
because each insertion requires O(log k) updates. This is bigger than
for a tree of storing
. Further, we need O(ln)
space to store vectors.2 Because the number of matrix-vector products is often not large, storing
is more suitable.

2.4. A Different Implementation by Storing Keys in Leaves of a Tree.


2.5. A Discussion on Tree Implementations.
For the method in section 2.3, where each node has a key, we can consider balanced binary search trees such as an AVL tree, red-black tree, and AA tree. AVL trees use more complicated insertion operations to ensure being balanced. Consequently, the insertion is slower, but the order-statistic computation is usually faster compared to other order-statistic trees. In the comparison by Heger (2004), an AA tree tends to be more balanced and faster than a red-black tree. However, previous studies also consider node deletions, which are not needed here, so we conduct an experiment in section 4.3.
For the method in section 2.4 to store
keys in leaves, the selection tree is a suitable data structure. Note that
selection trees were mainly used for sorting, but using it as a balanced binary
search tree is a straightforward adaptation. An implementation method introduced
in Knuth (1973) is to transfer k possible yi values to , and let the indices of the internal nodes be
. Then for any node m, its parent is the node
. Moreover, if m is an odd number then it is a
right child, and vice versa. By this method, we do not need to use pointers for
constructing the tree, and thus the implementation is very simple. Another
advantage is that this tree is fully balanced, so each leaf is of the same
depth.
3. Comparison with Existing Methods
In this section, we introduce recent studies of linear rankSVM that are considered state of the art. Some of them were mentioned in section 2 in comparison with our proposed methods.
3.1. PRSVM and PRSVM+.




3.2. TreeRankSVM.
Joachims (2006) uses a cutting plane
method to optimize problem 1.2.
Airola et al. (2011) improve on Joachims's
work and release a package .


3.3. sofia-ml.
Sculley (2009) proposed to solve problem 1.2. It is a stochastic gradient descent (SGD) method that randomly
draws a preference pair from the training set at each iteration and uses a
subgradient on this pair to update w. This method does not consider the special structure of the loss term.
For going through the whole training data, the cost is
, which is worse than other methods discussed. Therefore, we do
not include this method in our experiments.
In contrast, SGD is one of the state-of-the-art methods for linear SVM(e.g., Shalev-Shwartz, Singer, & Srebro, 2007.) The main reason is that special methods to consider the structure of the loss term have not been available.
4. Experiments
In this section, we begin by describing the details of a truncated Newton implementation of our approach. The first experiment is to evaluate methods discussed in section 2. In particular, the speed of different implementations of order-statistic trees is examined. Next, we compare state-of-the-art methods for linear rankSVM with the proposed approach. Then we conduct an investigation of the performance difference between linear rankSVM and pointwise methods. Finally, an experiment on sparse data is shown. (Programs used for experiments can be found online at http://www.csie.ntu.edu.tw/~cjlin/liblinear/exp.html.)
4.1. Implementation Using a Trust Region Newton Method.
In our implementation of the proposed approach, we consider a trust region Newton
method (), that is, a truncated Newton method discussed in section 2.1. For details of trust region
methods, a comprehensive book is by Conn, Gould, and Toint (2000). Here we mainly follow the setting in Lin and
Moré (1999) and Lin et al. (2008).










For linear classification, Lin et al. (2008) apply the approach of Steihaug (1983) to run CG iterations until either a minimum of gt(s) is found or s touches the boundary of the trust region. We consider the same setting in our implementation.




4.2. Experiment Setting.
We consider three sources of web search engine ranking: (Qin, Liu, Xu, & Li, 2010),
,3 and
(Chapelle & Chang, 2011). Both
and
are from Microsoft Research, while
is from the Yahoo Learning to Rank Challenge. From
, we take four sets:
,
,
, and
. For
, we take the set with name 30k, which indicates the number of
queries within it.4 Each set from
or
consists of five segmentations, and we take the first fold.
contains two sets, and both are considered. The details of
these data sets are listed in Table 2.
Each set comes with training, validation, and testing sets; we use the
validation set only for selecting the parameters of each model. For
preprocessing, we linearly scale each feature of
and
data sets to the range [0, 1], while the features of
data sets are already in this range.
. | . | . | . | . | . | Average kq/lq . |
---|---|---|---|---|---|---|
Data Set . | l . | n . | k . | |Q| . | p . | over Queries . |
![]() | 42, 158 | 46 | 3 | 1, 017 | 246, 015 | 0.0546 |
![]() | 9, 630 | 46 | 3 | 471 | 52, 325 | 0.1697 |
![]() | 2, 270, 296 | 136 | 5 | 18, 919 | 101, 312, 036 | 0.0492 |
![]() | 473, 134 | 519 | 5 | 19, 944 | 5, 178, 545 | 0.2228 |
![]() | 34, 815 | 596 | 5 | 1, 266 | 292, 951 | 0.1560 |
![]() | 743, 790 | 46 | 1, 268 | 1, 017 | 285, 943, 893 | 1 |
![]() | 540, 679 | 46 | 1, 831 | 471 | 323, 151, 792 | 1 |
. | . | . | . | . | . | Average kq/lq . |
---|---|---|---|---|---|---|
Data Set . | l . | n . | k . | |Q| . | p . | over Queries . |
![]() | 42, 158 | 46 | 3 | 1, 017 | 246, 015 | 0.0546 |
![]() | 9, 630 | 46 | 3 | 471 | 52, 325 | 0.1697 |
![]() | 2, 270, 296 | 136 | 5 | 18, 919 | 101, 312, 036 | 0.0492 |
![]() | 473, 134 | 519 | 5 | 19, 944 | 5, 178, 545 | 0.2228 |
![]() | 34, 815 | 596 | 5 | 1, 266 | 292, 951 | 0.1560 |
![]() | 743, 790 | 46 | 1, 268 | 1, 017 | 285, 943, 893 | 1 |
![]() | 540, 679 | 46 | 1, 831 | 471 | 323, 151, 792 | 1 |
Notes: All data sets are dense (i.e., ). In the last column, lq and kq are the number of instances and the number of relevance
levels in query q, respectively. See Table 1 for the meaning of other
columns.
All experiments are conducted on a 64-bit machine with Intel Xeon 2.5 GHz CPU (E5504), 12 MB cache, and 16 GB memory.
4.3. A Comparison Between Methods in Section 2: A Direct Counting Method and Different Order-Statistic Trees.
We solve problem 1.3 using and compare the following methods for calculating l+i(w), l−i(w),
and
:
- •
direct-count: the direct counting method mentioned at the end of section 2.2 (see details in the supplementary material)
- •
y-rbtree: the red-black tree using yi as the key of nodes (see section 2.3)
- •
wTx-rbtree: the red-black tree using wTxi as the key of nodes (see section 2.3)
- •
selectiontree: the selection tree that stores keys in leaf nodes (see section 2.4)
- •
y-avltree: the same as y-rbtree, except the order-statistic tree used is the AVL tree (see section 2.5)
- •
y-aatree: the same as y-rbtree, except the order-statistic tree used is the AA tree (see section 2.5)



We take four data sets and set C=1. The results of
training time versus function values are shown in Figure 4. We also draw a horizontal line in the figure to
indicate that the default stopping condition in of using
in equation 4.4 has been satisfied. Experiments in section 4.4 will show that solutions obtained below this line
have similar ranking performance to the optimum. From the figures, the method of
direct counting is slow when k (number of relevance levels) is
large. This result is expected following the complexity analysis in equation 2.14. In addition, although
implementations of order-statistic trees have slightly different running times
in the end, they are very similar otherwise. Therefore, we choose selection
trees in subsequent experiments because of the simplicity.
Comparison between different order-statistic tree implementations and the
direct counting method. We present training time and relative difference
to the optimal function value. C=1 is set for
all four data sets. The dotted horizontal line indicates the function
value of using the default stopping tolerance
in equation 4.4.
Comparison between different order-statistic tree implementations and the
direct counting method. We present training time and relative difference
to the optimal function value. C=1 is set for
all four data sets. The dotted horizontal line indicates the function
value of using the default stopping tolerance
in equation 4.4.
4.4. A Comparison Between Different Methods for Linear RankSVM.
We compare the following methods for linear rankSVM:
- •
. Our approach of using
with selection trees.
- •
+ (Chapelle & Keerthi, 2010). This method was discussed in section 3.1. The authors did not release their code, so we make an implementation using the same framework of
. Therefore we apply trust region rather than line search in their truncated Newton procedure.
- •
(Airola et al., 2011). This method was discussed in section 3.2. We download version 0.1 from http://staff.cs.utu.fi/~aatapa/software/RankSVM/. Although this package is mainly implemented in Python, computationally intensive procedures such as red-black trees and the minimization of equation 3.3 are written in C/C++ or Fortran.








For each evaluation criterion, we find the best regularization parameter by
checking the validation set result of .6 The selected regularization parameter C for each data
set and each measurement is listed in Table 3. The results of comparing different approaches can be found in
Figures 5 and 6. We present the relative difference to the optimal function value,
pairwise accuracy, and (mean) NDCG.7 We also draw the horizontal lines of the default stopping condition of
in the figures.
A comparison between different linear rankSVM methods on function values, pairwise accuracy, and NDCG. The x-axis is in log scale.
A comparison between different linear rankSVM methods on function values
and pairwise accuracy. The two sets have large k (number of relevance
levels). NDCG cannot be used because of the overflow of
in equation 4.5. The x-axis is in log scale.
A comparison between different linear rankSVM methods on function values
and pairwise accuracy. The two sets have large k (number of relevance
levels). NDCG cannot be used because of the overflow of
in equation 4.5. The x-axis is in log scale.
. | Problem 1.2 Using L1 Loss . | Problem 1.3 Using L2 Loss . | ||
---|---|---|---|---|
Data Sets . | Pairwise Accuracy . | NDCG . | Pairwise Accuracy . | NDCG . |
![]() | 2−1 | 28 | 2−5 | 2−15 |
![]() | 28 | 2−6 | 27 | 27 |
![]() | NA | NA | 23 | 23 |
![]() | NA | NA | 2−14 | 21 |
![]() | 2−7 | 2−4 | 2−10 | 2−10 |
![]() | 25 | NA | 2−12 | NA |
![]() | 2−14 | NA | 2−14 | NA |
. | Problem 1.2 Using L1 Loss . | Problem 1.3 Using L2 Loss . | ||
---|---|---|---|---|
Data Sets . | Pairwise Accuracy . | NDCG . | Pairwise Accuracy . | NDCG . |
![]() | 2−1 | 28 | 2−5 | 2−15 |
![]() | 28 | 2−6 | 27 | 27 |
![]() | NA | NA | 23 | 23 |
![]() | NA | NA | 2−14 | 21 |
![]() | 2−7 | 2−4 | 2−10 | 2−10 |
![]() | 25 | NA | 2−12 | NA |
![]() | 2−14 | NA | 2−14 | NA |
One could observe from the figures that the convergence speed of is slower than
+ and
. To rule out the implementation differences between
/
+ and
, in the supplementary materials, we check iterations versus
relative function value and test performance.8 Results still show that
is slower, so for linear rankSVM, methods using second-order
information seem to be superior. Regarding
and
+, Figure 5 shows that
they are similar when the average kq/lq is small. However, from Figure 6,
+ is much slower if the number of preference levels is large
(i.e., large kq/lq). This result is expected following the complexity analysis in
equations 2.23 and 3.2. Another
observation is that the performances of
+ and
are stable after the default stopping condition is satisfied.
Thus, tuning
in equation 4.4 is generally not necessary.
This experiment also serves as a comparison between L1- and L2-loss linear rankSVM. Results show that their performances (NDCG and pairwise accuracy) are similar.
Instead of using the best C after parameter selection, we
investigate the training time under a fixed C for all methods.
We check the situations that C is large, medium, and small by
using C=100, C=1, and C=10−4, respectively. The results
are shown in Figure 7. We can observe that is always one of the fastest methods.
4.5. A Comparison Between Linear RankSVM, Linear Support Vector Regression, GBDT, and Random Forests.
We compare rankSVM using with the following pointwise methods:
- •
Linear support vector regression (SVR) by Vapnik (1995). We check both L1-loss and L2-loss linear SVR provided in the package
(version 1.92). Their implementation details can be found in Ho and Lin (2012). For L2-loss linear SVR, two implementations are available in
by solving primal and dual problems, respectively. We use the one that solves the primal problem by
.
- •
GBDT (Friedman, 2001). This is a nonlinear pointwise model that is known to be powerful for web search ranking problems. We use version 0.9 of the package pGBRT (Tyree, Weinberger, Agrawal, & Paykin, 2011) downloaded from http://machinelearning.wustl.edu/pmwiki.php/Main/Pgbrt.
- •
Random forests (Breiman, 2001). This is another nonlinear pointwise model that performs well on web search data. We use version 1.0 of Rt-Rank downloaded from https://sites.google.com/site/rtranking.
For linear SVR, we set the -insensitive parameter
because Ho and Lin (2012) showed that this setting often works well. We then conduct the
same parameter selection procedure as in section 4.4 to find the best regularization parameters C and list them in Table 4. The training
time, test NDCG, and test pairwise accuracy are shown in Table 5. We first observe that the performance of L1-loss
SVR is worse than L2-loss SVR. The reason might be that L1 loss imposes a
smaller training loss when the prediction error is larger than 1. Regarding
L2-loss SVR and L2-loss rankSVM, their NDCG results are close, but rankSVM gives
better pairwise accuracy. This result seems to be reasonable because rankSVM
considers pairwise training losses. For training time, although the selected
regularization parameters are different and hence results are not fully
comparable, in general L2-loss SVR is faster. In summary, L2-loss SVR is
competitive in terms of NDCG and training time, but rankSVM may still be useful
if pairwise accuracy is what we are concerned about.
. | L1-Loss Linear SVR . | L2-Loss Linear SVR . | ||
---|---|---|---|---|
Data Set . | Pairwise Accuracy . | NDCG . | Pairwise Accuracy . | NDCG . |
![]() | 28 | 28 | 2−7 | 2−11 |
![]() | 28 | 28 | 2−1 | 2−4 |
![]() | 2−1 | 22 | 2−2 | 2−2 |
![]() | 2−10 | 2−5 | 2−5 | 2−2 |
![]() | 2−3 | 21 | 2−5 | 24 |
![]() | 2−9 | NA | 2−15 | NA |
![]() | 24 | NA | 2−7 | NA |
. | L1-Loss Linear SVR . | L2-Loss Linear SVR . | ||
---|---|---|---|---|
Data Set . | Pairwise Accuracy . | NDCG . | Pairwise Accuracy . | NDCG . |
![]() | 28 | 28 | 2−7 | 2−11 |
![]() | 28 | 28 | 2−1 | 2−4 |
![]() | 2−1 | 22 | 2−2 | 2−2 |
![]() | 2−10 | 2−5 | 2−5 | 2−2 |
![]() | 2−3 | 21 | 2−5 | 24 |
![]() | 2−9 | NA | 2−15 | NA |
![]() | 24 | NA | 2−7 | NA |
. | L2-Loss RankSVM . | L1-Loss SVR . | L2-Loss SVR . | |||
---|---|---|---|---|---|---|
. | Training . | . | Training . | . | Training . | . |
Data Set . | Time (s) . | NDCG . | Time (s) . | NDCG . | Time (s) . | NDCG . |
![]() | 0.5 | 0.5211 | 23.9a | 0.4756a | 0.5 | 0.5157 |
![]() | 0.5 | 0.4571 | 3.4a | 0.4153a | 0.2 | 0.4450 |
![]() | 1601.6 | 0.4945 | 461.6 | 0.4742 | 202.4 | 0.4946 |
![]() | 334.8 | 0.7616 | 10.8 | 0.7579 | 172.7 | 0.7642 |
![]() | 11.2 | 0.7519 | 47.6 | 0.7470 | 20.8 | 0.7578 |
Training | Pairwise | Training | Pairwise | Training | Pairwise | |
Time (s) | Accuracy | Time (s) | Accuracy | Time (s) | Accuracy | |
![]() | 1.3 | 70.36% | 23.9a | 64.06%a | 0.7 | 68.56% |
![]() | 0.5 | 82.70% | 3.4a | 77.72%a | 0.3 | 82.17% |
![]() | 1601.6 | 61.52% | 65.4 | 60.11% | 202.4 | 60.49% |
![]() | 117.1 | 68.45% | 2.4 | 67.82% | 149.5 | 67.83% |
![]() | 11.2 | 69.74% | 3.3 | 68.37% | 14.5 | 69.39% |
![]() | 38.7 | 80.71% | 1.0 | 79.82% | 5.0 | 79.70% |
![]() | 16.6 | 82.11% | 1.1 | 81.65% | 6.7 | 81.85% |
. | L2-Loss RankSVM . | L1-Loss SVR . | L2-Loss SVR . | |||
---|---|---|---|---|---|---|
. | Training . | . | Training . | . | Training . | . |
Data Set . | Time (s) . | NDCG . | Time (s) . | NDCG . | Time (s) . | NDCG . |
![]() | 0.5 | 0.5211 | 23.9a | 0.4756a | 0.5 | 0.5157 |
![]() | 0.5 | 0.4571 | 3.4a | 0.4153a | 0.2 | 0.4450 |
![]() | 1601.6 | 0.4945 | 461.6 | 0.4742 | 202.4 | 0.4946 |
![]() | 334.8 | 0.7616 | 10.8 | 0.7579 | 172.7 | 0.7642 |
![]() | 11.2 | 0.7519 | 47.6 | 0.7470 | 20.8 | 0.7578 |
Training | Pairwise | Training | Pairwise | Training | Pairwise | |
Time (s) | Accuracy | Time (s) | Accuracy | Time (s) | Accuracy | |
![]() | 1.3 | 70.36% | 23.9a | 64.06%a | 0.7 | 68.56% |
![]() | 0.5 | 82.70% | 3.4a | 77.72%a | 0.3 | 82.17% |
![]() | 1601.6 | 61.52% | 65.4 | 60.11% | 202.4 | 60.49% |
![]() | 117.1 | 68.45% | 2.4 | 67.82% | 149.5 | 67.83% |
![]() | 11.2 | 69.74% | 3.3 | 68.37% | 14.5 | 69.39% |
![]() | 38.7 | 80.71% | 1.0 | 79.82% | 5.0 | 79.70% |
![]() | 16.6 | 82.11% | 1.1 | 81.65% | 6.7 | 81.85% |
aReached maximum iteration of .
Next, we check GBDT and random forests. Their training time is long, so we do not
conduct parameter selection. We consider a small number of trees and fix the
parameters as follows. For GBDT, we use learning rate =0.1, tree depth
=4, and number of trees =100. For random forests, we use the
number of sampled features for splitting in each node and number of trees =40. We further use eight cores to
reduce the training time. The results are shown in Table 6. For the smaller data sets
,
and
set 2, we are able to train more trees in a reasonable time,
so we present in Table 7 the result of
using 1000 trees.
Random Forests | GBDT | |||||
Training | Pairwise | Training | Pairwise | |||
Data Set | Time (s) | Accuracy | NDCG | Time (s) | Accuracy | NDCG |
![]() | 14.8 | 66.16% | 0.4959 | 1.4 | 69.78% | 0.5182 |
![]() | 2.3 | 80.36% | 0.4541 | 0.4 | 82.83% | 0.4706 |
![]() | 5102.1 | 63.76% | 0.5598 | 1339.3 | 62.77% | 0.5375 |
![]() | 1672.2 | 70.69% | 0.7797 | 557.7 | 69.22% | 0.7707 |
![]() | 58.7 | 68.76% | 0.7629 | 11.3 | 71.21% | 0.7711 |
![]() | 606.0 | 78.78% | NA | 106.8 | 79.85% | NA |
![]() | 423.3 | 82.04% | NA | 59.3 | 82.43% | NA |
Random Forests | GBDT | |||||
Training | Pairwise | Training | Pairwise | |||
Data Set | Time (s) | Accuracy | NDCG | Time (s) | Accuracy | NDCG |
![]() | 14.8 | 66.16% | 0.4959 | 1.4 | 69.78% | 0.5182 |
![]() | 2.3 | 80.36% | 0.4541 | 0.4 | 82.83% | 0.4706 |
![]() | 5102.1 | 63.76% | 0.5598 | 1339.3 | 62.77% | 0.5375 |
![]() | 1672.2 | 70.69% | 0.7797 | 557.7 | 69.22% | 0.7707 |
![]() | 58.7 | 68.76% | 0.7629 | 11.3 | 71.21% | 0.7711 |
![]() | 606.0 | 78.78% | NA | 106.8 | 79.85% | NA |
![]() | 423.3 | 82.04% | NA | 59.3 | 82.43% | NA |
Note: Random forests: 40 trees; GBDT: 100 trees.
Random Forests | GBDT | |||||
Training | Pairwise | Training | Pairwise | |||
Data Set | Time (s) | Accuracy | NDCG | Time (s) | Accuracy | NDCG |
![]() | 345.3 | 69.07% | 0.5221 | 13.7 | 67.58% | 0.4892 |
![]() | 52.0 | 82.60% | 0.4675 | 3.6 | 79.78% | 0.4491 |
![]() | 1406.9 | 71.91% | 0.7801 | 108.7 | 71.70% | 0.7720 |
Random Forests | GBDT | |||||
Training | Pairwise | Training | Pairwise | |||
Data Set | Time (s) | Accuracy | NDCG | Time (s) | Accuracy | NDCG |
![]() | 345.3 | 69.07% | 0.5221 | 13.7 | 67.58% | 0.4892 |
![]() | 52.0 | 82.60% | 0.4675 | 3.6 | 79.78% | 0.4491 |
![]() | 1406.9 | 71.91% | 0.7801 | 108.7 | 71.70% | 0.7720 |
From Tables 6 and 7, GBDT and random forests generally perform well,
though they are not always better than linear rankSVM. For set 2, random forests achieves 0.78 NDCG using 1000 trees,
which is much better than 0.75 of linear rankSVM. This result is consistent with
the fact that in the Yahoo Learning to Rank Challenge, all top performers use
decision-tree-based methods. However, the training cost of GBDT and random
forests is in general higher than linear rankSVM. Therefore, linear rankSVM is
useful to quickly provide a baseline result. We also note that the performance
of GBDT with more trees is not always better than with few trees. This result
seems to indicate that overfitting occurs and parameter selection is important.
In contrast, random forests is more robust. The training time of GBDT is faster
than random forests because in random forests, the tree depth is unlimited while
we restrict the depth of trees in GBDT to be four.
Although pointwise methods perform well in this experiment, a potential problem is that they do not consider different queries. It is unclear if this situation may cause problems.
4.6. A Comparison Between Linear and Nonlinear Models on Sparse Data.
Recent research has shown that linear SVM is competitive with nonlinear SVM on classifying large and sparse data (Yuan et al., 2012). We conduct an experiment to check if this property also holds for learning to rank. We consider rankSVM as the linear model for comparison, but for the nonlinear model, we use random forests rather than kernel rankSVM. One reason is that random forests is very robust in the previous experiment. We consider the following two CTR (click through rate) estimation problems, which can be treated as regression or ranking problems:
- •
. This is a data set used in Ho and Lin (2012).
- •
. This is the processed data generated by the winning team (Wu et al., 2012) of KDD Cup 2012 track 2 (Niu et al., 2012). It contains about one-third of the original data. The task of this competition is online advertisement ranking evaluated by AUC, while the labels are number of clicks and number of views. Note that pairwise accuracy is reduced to AUC when k=2. We transform the labels into CTR (i.e., number of clicks over number of views).
The two data sets both contain a single query, and each comes with training and testing sets. To reduce the training time and the memory cost of random forests, we sample from the two data sets and condense the features. The details are listed in Table 8. We use the same parameters of random forests as in section 4.5. For a fair comparison, we fix C=1 for rankSVM because the parameters of random forests are not well tuned. The results are shown in Table 9. We first notice the training time of random forests is several thousand times more than linear rankSVM on sparse data. The difference is larger than the case of dense data because the training cost of random forests is linear to the number of features, but that of rankSVM is linear to the average number of nonzero features. Regarding the performance, the difference is small for the two data sets, so linear rankSVM is very useful to get competitive results quickly. However, more experiments are needed to confirm these preliminary observations. We hope more public sparse ranking data will be available in the near future.
Data Set | l | n | ![]() | k | p |
![]() | 11, 382, 195 | 22, 510, 600 | 22.6 | 93, 899 | 46, 191, 724, 381, 879 |
![]() | 68, 019, 906 | 79, 901, 700 | 35.3 | 6, 896 | 198, 474, 800, 029, 148 |
![]() | 11, 382 | 73, 581 | 22.6 | 1, 087 | 46, 020, 848 |
![]() | 17, 005 | 74, 026 | 35.3 | 26 | 12, 704, 393 |
Data Set | l | n | ![]() | k | p |
![]() | 11, 382, 195 | 22, 510, 600 | 22.6 | 93, 899 | 46, 191, 724, 381, 879 |
![]() | 68, 019, 906 | 79, 901, 700 | 35.3 | 6, 896 | 198, 474, 800, 029, 148 |
![]() | 11, 382 | 73, 581 | 22.6 | 1, 087 | 46, 020, 848 |
![]() | 17, 005 | 74, 026 | 35.3 | 26 | 12, 704, 393 |
Note: To reduce the training time, only a small subset of each problem is used.
Linear RankSvm | Random Forests | |||||
Training | Pairwise | Mean | Training | Pairwise | Mean | |
Data Set | Time (s) | Accuracy | NDCG | Time (s) | Accuracy | NDCG |
![]() | 4.3 | 60.83% | 0.4822 | 6343.3 | 60.45% | 0.4732 |
![]() | 2.7 | 68.16% | 0.5851 | 5223.1 | 69.72% | 0.5982 |
Linear RankSvm | Random Forests | |||||
Training | Pairwise | Mean | Training | Pairwise | Mean | |
Data Set | Time (s) | Accuracy | NDCG | Time (s) | Accuracy | NDCG |
![]() | 4.3 | 60.83% | 0.4822 | 6343.3 | 60.45% | 0.4732 |
![]() | 2.7 | 68.16% | 0.5851 | 5223.1 | 69.72% | 0.5982 |
Note: Random forests uses 40 trees.
5. Using Partial Pairs to Train Models
To avoid considering the O(l2) pairs, a common practice in ranking is to use only a subset of pairs. An example is Lin (2010), which uses pairs with close relevance levels (i.e., yi close to yj). The concept is similar to equation 1.5: if pairs with close relevance levels are ranked with the right order, those pairs with larger distances should also be ranked correctly. When k=O(l), this approach can reduce the number of pairs from O(l2) to be as small as O(k)=O(l). However, if k is small, each pair is already formed by instances in two close relevance levels, so we cannot significantly reduce the number of pairs.
We take and
to conduct experiments because these two data sets possess the
property
. Because in each q, the values of yi are
, we use the pairs
with yi=yj+1. This setting of using two adjacent relevance levels leads to O(l) pairs. Then we can directly consider
problems 1.2 and 1.3 as classification problems with instances xi−xj. If Newton methods are considered for solving problem 1.3, by the approach in
equation 2.6, each Hessian-vector
product costs only
. Therefore, we directly use the
implementation to solve L2-loss SVM in
without applying any special method in section 2. After selecting the parameter C, we
present pairwise accuracy in Table 10. It is
observed that the selected C of using partial pairs is larger than
that of using all pairs. This situation occurs because the sum of training losses in
problem 1.3 on a smaller number of
pairs must be penalized by a larger C. For training time and
pairwise accuracy, it is as expected that the approach of using partial pairs
slightly sacrifices the performance for faster training speed.
. | ![]() | ![]() | ||||
---|---|---|---|---|---|---|
. | . | Training . | Pairwise . | . | Training . | Pairwise . |
. | C . | Time (s) . | Accuracy . | C . | Time (s) . | Accuracy . |
Partial pairs | 2−9 | 19.9 | 79.10% | 2−10 | 10.5 | 81.81% |
All pairs | 2−12 | 38.7 | 80.71% | 2−14 | 16.6 | 82.11% |
. | ![]() | ![]() | ||||
---|---|---|---|---|---|---|
. | . | Training . | Pairwise . | . | Training . | Pairwise . |
. | C . | Time (s) . | Accuracy . | C . | Time (s) . | Accuracy . |
Partial pairs | 2−9 | 19.9 | 79.10% | 2−10 | 10.5 | 81.81% |
All pairs | 2−12 | 38.7 | 80.71% | 2−14 | 16.6 | 82.11% |
Note: using
for L2-loss SVM is used for the partial-pair setting,
while
is used for all pairs.
Because of the only slightly lower pairwise accuracy, we may say that this approach, together with past work, is already enough to train large-scale linear rankSVM:
- •
If k is small, we can apply the direct method mentioned in section 2.2 that has an O(lk) term for calculating
, and
.
- •
If k is large, we can use only O(l) pairs. Then any efficient methods to train linear SVM can be applied.
However, a caveat is that two different implementations must be used. In contrast, methods of using order-statistic trees can simultaneously handle situations of small and large k.
6. Conclusion
In this letter, we systematically reviewed recent approaches for linear rankSVM. We show that regardless of the optimization methods used, the computational bottleneck is on calculating some values over all preference pairs. Following Airola et al. (2011), we comprehensively investigate tree-based techniques for the calculation. Experiments show that our method is faster than existing implementations for linear rankSVM.
Based on this study, we release an extension of the popular linear
classification/regression package for ranking. It is available online at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/.
Appendix: The Dual Problem of Problems 1.1 and 1.2




Acknowledgments
This work was supported in part by the National Science Council of Taiwan (grant 101-2221-E-002-199-MY3). We thank Chia-Hua Ho for discussion and inspiring the selection tree data structure. We also thank Husan-Tien Lin, Yuh-Jye Lee, and the anonymous reviewers for valuable comments.
References

Notes
See, for example, Yuan, Ho, and Lin (2012) for more detailed discussion.
Note that is likely to be dense even if each xj is sparse.
The number of queries shown in Table 2 is fewer because we report only the training set statistics.
Here l represents the number of testing data.
In the implementation of , the formulation is scaled so the regularization parameter is
.
For the function value, parameters selected using (validation) pairwise accuracy are considered, but results of using NDCG are similar.
For /
+, we use CG iterations rather than outer Newton iterations
because each CG has a similar complexity to that of a cutting plane
iteration.