## Abstract

Linear rankSVM is one of the widely used methods for learning to rank. Although its performance may be inferior to nonlinear methods such as kernel rankSVM and gradient boosting decision trees, linear rankSVM is useful to quickly produce a baseline model. Furthermore, following its recent development for classification, linear rankSVM may give competitive performance for large and sparse data. A great deal of works have studied linear rankSVM. The focus is on the computational efficiency when the number of preference pairs is large. In this letter, we systematically study existing works, discuss their advantages and disadvantages, and propose an efficient algorithm. We discuss different implementation issues and extensions with detailed experiments. Finally, we develop a robust linear rankSVM tool for public use.

## 1. Introduction

Learning to rank is an important supervised learning technique because of its
application to search engines and online advertisement. According to Chapelle and
Chang (2011) and others, state-of-the-art
learning to rank models can be categorized into three types. Pointwise methods
(e.g., decision tree models and linear regression) directly learn the relevance
score of each instance; pairwise methods like rankSVM (Herbrich, Graepel, &
Obermayer, 2000) learn to classify preference
pairs; and listwise methods such as LambdaMART (Burges, 2010) try to optimize the measurement for evaluating the
whole ranking list. Some methods lie between two categories; for example, GBRank
(Zheng, Chen, Sun, & Zha, 2007) combines
pointwise decision tree models and pairwise loss. Among them, rankSVM, as a pairwise
approach, is one commonly used method. This method is extended from standard support
vector machine (SVM) by Boser, Guyon, and Vapnik (1992) and Cortes and Vapnik (1995). In the SVM literature, it is well known that linear (i.e., data are
not mapped to a different space) and kernel SVMs are suitable for different
scenarios, where linear SVM is more efficient, but the more costly kernel SVM may
give higher accuracy.^{1} The same situation may occur for rankSVM. In this letter, we study
large-scale linear rankSVM.

*K*is the set of possible relevance levels with |

*K*|=

*k*and

*Q*is the set of queries. By defining the set of preference pairs as L1-loss linear rankSVM minimizes the sum of training losses and a regularization term, where

*C*>0 is a regularization parameter. If L2 loss is used, the optimi- zation problem becomes In prediction, for any test instance

**, a larger**

*x*

*w*^{T}

**implies that**

*x***should be ranked higher. In Table 1, we list the notation used in this letter.**

*x*Notation . | Explanation . |
---|---|

w | The weight vector obtained by solving problem 1.2 or 1.3 |

x_{i} | The feature vector of the ith training
instance |

y _{i} | Label of the ith training instance |

q _{i} | Query of the ith training instance |

K | The set of relevance levels |

Q | The set of queries |

P | The set of preference pairs; see equation 1.1 |

l | Number of training instances |

k | Number of relevance levels |

p | Number of preference pairs |

n | Number of features |

Average number of nonzero features per instance | |

l _{q} | Number of training instances in a given query |

k _{q} | Number of relevance levels in a given query |

T | An order-statistic tree |

Notation . | Explanation . |
---|---|

w | The weight vector obtained by solving problem 1.2 or 1.3 |

x_{i} | The feature vector of the ith training
instance |

y _{i} | Label of the ith training instance |

q _{i} | Query of the ith training instance |

K | The set of relevance levels |

Q | The set of queries |

P | The set of preference pairs; see equation 1.1 |

l | Number of training instances |

k | Number of relevance levels |

p | Number of preference pairs |

n | Number of features |

Average number of nonzero features per instance | |

l _{q} | Number of training instances in a given query |

k _{q} | Number of relevance levels in a given query |

T | An order-statistic tree |

*l*/

*k*instances are with the same relevance level, the number of pairs in

*P*is The large number of pairs becomes the main difficulty in training rankSVM. Many existing studies have attempted to address this difficulty. By taking the property that it is possible to avoid the

*O*(

*l*

^{2}) complexity of going through all pairs in calculating the objective function, gradient, or other information needed in the optimization procedure. Interestingly, although existing work applies different optimization methods, their approaches to avoid considering the

*O*(

*l*

^{2}) pairs are very related. Next, we briefly review some recent results. Joachims (2006) solves problem 1.2 by a cutting plane method in which an method is proposed to calculate the objective value and a subgradient of problem 1.2. The cost is for calculating , where is the average number of nonzero features per training instance;

*O*(

*l*log

*l*+

*lk*) is for the sum of training losses in problem 1.2;

*O*(

*n*) is for the regularization term

*w*^{T}

**/2. This method is efficient if**

*w**k*(number of relevance levels) is small but becomes inefficient when

*k*=

*O*(

*l*). Airola, Pahikkala, and Salakoski (2011) improve on Joachims's work by reducing the complexity to The main breakthrough is that they employ order-statistic trees, which are extended from balance binary search trees, so the

*O*(

*lk*) term in equation 1.6 is reduced to

*O*(

*l*log

*k*). Another type of optimization method considered is the truncated Newton method for solving problem 1.3 in which the main computation is on Hessian vector products. Chapelle & Keerthi (2010) showed that if

*k*=2 (i.e., only two relevance levels), the cost of each function, gradient, or Hessian vector product evaluation is . Their method is related to that by Joachims (2006) because we can see that the

*O*(

*lk*) term in equation 1.6 can be removed if

*k*=2. For

*k*>2, they decompose the problem into

*k*−1 subproblems, each with only two relevance levels. Therefore, similar to Joachims's approach, Chapelle and Keerthi's approach may not be efficient for larger

*k*. Regarding optimization methods, an advantage of Newton methods is the faster convergence. However, they require the differentiability of the objective function, so L2 loss must be used. In contrast, cutting plane methods are applicable to both L1 and L2 losses.

Although linear rankSVM is an established method, it is known that gradient boosting decision trees (GBDT) by Friedman (2001) and its variant, LambdaMART (Burges, 2010), give competitive performance on web search ranking data. In addition, random forests (Breiman, 2001) are also reported in Mohan, Chen, and Weinberger (2011) to perform well. Actually all the winning teams of the Yahoo Learning to Rank Challenge (Chapelle & Chang, 2011) use decision-tree-based ensemble models. Note that GBDT and random forests are nonlinear pointwise methods, and LambdaMART is a nonlinear listwise method. Their drawback is the longer training time. We will conduct experiments to compare the performance and training time between linear and nonlinear ranking methods.

In this letter, we consider Newton methods for solving problem 1.3 and present the following results:

We give a clear overview and connection of past work on the efficient calculation over all relevance pairs.

We investigate several order-statistic tree implementations and show their advantages and disadvantages.

We finish an efficient implementation that is faster than existing works for linear rankSVM.

We compare linear rankSVM and linear and nonlinear pointwise methods, including GBDT and random forests, in detail.

We release a public tool for linear rankSVM.

This letter is organized as follows. Section 2 introduces methods for the efficient calculation of all relevance pairs. Section 3 discusses past studies of linear rankSVM
and compares them with our method. Various types of experiments are shown in section 4. In section 5 we discuss another possible algorithm for rankSVMs when *k* is large. Section 6 concludes. A supplementary file including additional analysis and experiments is
available online at http://www.mitpressjournals.org/doi/suppl/10.1162/NECO_a_00571.

## 2. Efficient Calculation over Relevance Pairs

A difficulty in training rankSVM is that the number of pairs in the loss term can be
as large as *O*(*l*^{2}). This difficulty occurs in any optimization method that needs to
calculate the objective value. Further more, other values used in optimization
procedures, such as subgradient, gradient, or Hessian-vector products, face the same
difficulty. In this section, we consider truncated Newton methods as an example and
investigate efficient methods for the calculation over pairs.

### 2.1. Information Needed in Optimization Procedures and an Example Using Truncated Newton Methods.

Many optimization methods employ gradient or even higher-order information at
each iteration of an iterative procedure. From problems 1.2 and 1.3, it is clear that the summation over the *p* pairs remains in the gradient and the Hessian. Therefore,
the difficulty of handling *O*(*l*^{2}) pairs occurs beyond the calculations of the objective value. Here
we consider the truncated Newton method as an example to see what kind of
information it requires.

*w*^{t}to

*w*^{t+1}there are inner CG iterations. Algorithm 1 gives the framework of truncated Newton methods.

The discussion shows that in a truncated Newton method, the calculations of
objective value, gradient, and Hessian-vector product all face the difficulty of
handling *p* pairs. In the rest of this section, we discuss the
efficient calculation of these values.

### 2.2. Efficient Function and Gradient Evaluation and Matrix-Vector Products.

*f*(

**) represents the objective function of problem 1.3.**

*w**p*by

*l*matrix: That is, if , then a corresponding row in

*A*has that the

*i*th entry is 1, the

*j*th entry is −1, and other entries are all zeros. By this definition, the objective function in problem 1.3 can be written as where is a vector of ones, and

*D*

_{w}is a

*p*by

*p*diagonal matrix with for all , The gradient is However, does not exist because equation 2.4 is not differentiable. Following Mangasarian (2002) and Lin et al. (2008), we define a generalized Hessian matrix, where

*I*is the identity matrix.

*A*and

*D*

_{w}have

*O*(

*p*) nonzero elements, the complexity of calculating equation 2.6 is The right-to-left matrix-vector products in equation 2.6 are faster than if we obtain and store the matrix

*X*

^{T}A^{T}D_{w}

*AX*.

*p*=

*O*(

*l*

^{2}), not only is the cost of equation 2.6 still high, but also the storage of

*A*and

*D*

_{w}requires a huge amount of memory. To derive a faster method, it is essential to explore the structure of the generalized Hessian. We define We show in appendix A that when problem 1.3 is treated as an SVM classification problem with feature vectors

*x*_{i}−

*x*_{j}and labels being 1 for all , the set SV(

**) corresponds to the support vectors.**

*w**D*

_{w}by defining a new matrix such that where

*A*

_{w}includes rows of

*A*such that . Thus, equation 2.6 becomes Observe that Because each row of

*A*

_{w}contains two nonzero elements, occurs only under the following situations: We define Then from equation 2.10, Hence, Therefore, If we already have the values of , and , the computation of the Hessian-vector product in equation 2.9 would just cost , where is for computing equation 2.11 and

*O*(

*n*) is for the vector addition in equation 2.9.

*A*

^{T}_{w}

*A*

_{w}

*X*

**can be calculated by equation 2.11. We also have Thus, the computation of both equations 2.12 and 2.13 costs as well.**

*w**l*

^{+}

_{i}(

**) and**

*w**l*

^{−}

_{i}(

**) are needed for efficient function and subgradient evaluation (see more details in section 3.2). Therefore, regardless of the optimization methods used, an important common task is to efficiently calculate and . In the supplementary materials, we discuss in detail a direct method that costs**

*w**O*(

*l*+

*k*) space excluding the training data and time for one matrix-vector product. Although the cost is lower than that by equation 2.6, the

*O*(

*lk*) complexity is still high if

*k*is large. Subsequently, we discuss methods to reduce the

*O*(

*lk*) term to

*O*(

*l*log

*k*).

### 2.3. Efficient Calculation by Storing Values in an Order-Statistic Tree.

Airola et al. (2011) calculate *l*^{+}_{i}(** w**) and

*l*

^{−}

_{i}(

**) by an order-statistic tree, so the**

*w**O*(

*lk*) term in equation 2.14 is reduced to

*O*(

*l*log

*k*). The optimization method used is a cutting plane method (Teo, Vishwanathan, Smola, & Le, 2010), which calculates function and subgradient values. Our procedure here is an extension because in Newton methods, we further need Hessian-vector products. Notice that Airola et al. (2011) considered problem 1.2; we solve problem 1.3 and require the computation of and in addition.

*l*

^{+}

_{i}(

**), we must count the cardinality of the following set: The main difficulty is that both the order of**

*w**y*and the order of

_{i}

*w*^{T}

*x*_{i}are involved. We can first sort

*w*^{T}

*x*_{i}in ascending order. For an easy description, we assume that We observe that if elements in have been properly arranged in an order-statistic tree

*T*by the value of

*y*, then

_{j}*l*

^{+}

_{i}(

**) can be obtained in**

*w**O*(log

*k*) time. Consider the following example:

*i*=1, we assume We construct a tree in Figure 1a so that each node includes where and nodes are arranged according to the keys (i.e.,

*y*values). For each node, we ensure that its right child has a larger key than its left child and the node itself. Clearly, for any node

_{j}*y*, By , we mean the instance

*j*has been inserted to the tree

*T*.

*l*

^{+}

_{i}(

**), we traverse from the root of**

*w**T*to the node

*y*by observing that Therefore, once a tree for the set 2.16 has been constructed, we can define the following function:

_{i}*y*. In the last case of equation 2.20, for any

*t*in the left of

*y*, by the way we construct the tree. Then with we have so nodes on the left are not considered. Using equation 2.20, we have An example to traverse the tree for finding

*l*

^{+}

_{1}(

**) is in Figure 1b.**

*w**l*

^{+}

_{i}(

**) has been calculated, we insert the following instances into the tree: Then,**

*w**l*

^{+}

_{i+1}(

**) can be calculated in the same way. The calculation for**

*w**l*

^{−}

_{i}(

**) is similar. Because we start from**

*w**l*and maintain a tree of the following set: We then define a function similar to to obtain

*l*

^{−}

_{i}(

**).**

*w**O*(log

*k*) (see more discussion in section 2.5).If

*w*^{T}

*x*_{i}have been sorted before CG iterations, each matrix-vector product involves operations, which are smaller than equation 2.14 because the

*lk*term is reduced to

*l*log

*k*. Therefore, the cost of truncated Newton methods using order-statistic trees is where the

*O*(

*l*log

*l*) term is the cost of sorting.

Our algorithm constructs a tree for each matrix-vector product (or each CG
iteration) because of the change of the vector ** v** in equation 2.11. Thus,
an outer iteration of truncated Newton method requires constructing several
trees. If we store instead of at each node, only one tree independent of

**is needed at an outer iteration. However, because a vector is stored at a node, each update requires cost. The total cost of maintaining the tree is because each insertion requires**

*v**O*(log

*k*) updates. This is bigger than for a tree of storing . Further, we need

*O*(

*ln*) space to store vectors.

^{2}Because the number of matrix-vector products is often not large, storing is more suitable.

*w*^{T}

*x*_{i}and using

*y*as the keys, we may alternatively sort

_{i}*y*such that and for maintain a tree

_{i}*T*of the following set: Then we can apply the same approach as above. An advantage of this approach is that

*y*are fixed and need to be sorted only once in the training procedure. However,

_{i}

*w*^{T}

*x*_{i}become keys of nodes and are in general different, so the tree will eventually contain

*O*(

*l*) rather than

*O*(

*k*) nodes. Therefore, this approach is less preferred because maintaining a smaller tree is essential.

### 2.4. A Different Implementation by Storing Keys in Leaves of a Tree.

*k*leaf nodes from left to right correspond to the ascending order of relevance levels. At a leaf node, we record the size and xv of a relevance level. For each internal node, which is the root of a subtree, its size and xv are both the sum of that attribute of its children. For the same example considered in section 2.3, the tree at

*i*=1 is shown in Figure 2. To compute

*l*

^{+}

_{i}(

**), we know that Therefore, for any node**

*w**s*in the tree, we can define: and let An illustration of finding

*l*

^{+}

_{1}(

**) is in Figure 3. The procedures for obtaining , and are all similar.**

*w*### 2.5. A Discussion on Tree Implementations.

For the method in section 2.3, where each node has a key, we can consider balanced binary search trees such as an AVL tree, red-black tree, and AA tree. AVL trees use more complicated insertion operations to ensure being balanced. Consequently, the insertion is slower, but the order-statistic computation is usually faster compared to other order-statistic trees. In the comparison by Heger (2004), an AA tree tends to be more balanced and faster than a red-black tree. However, previous studies also consider node deletions, which are not needed here, so we conduct an experiment in section 4.3.

For the method in section 2.4 to store
keys in leaves, the selection tree is a suitable data structure. Note that
selection trees were mainly used for sorting, but using it as a balanced binary
search tree is a straightforward adaptation. An implementation method introduced
in Knuth (1973) is to transfer *k* possible *y _{i}* values to , and let the indices of the internal nodes be . Then for any node

*m*, its parent is the node . Moreover, if

*m*is an odd number then it is a right child, and vice versa. By this method, we do not need to use pointers for constructing the tree, and thus the implementation is very simple. Another advantage is that this tree is fully balanced, so each leaf is of the same depth.

## 3. Comparison with Existing Methods

In this section, we introduce recent studies of linear rankSVM that are considered state of the art. Some of them were mentioned in section 2 in comparison with our proposed methods.

### 3.1. **PRSVM** and **PRSVM**+.

*O*(

*p*), which becomes dominant for large

*p*. To reduce the cost, Chapelle and Keerthi (2010) proposed + for solving problem 1.3 by a truncated Newton method. They first consider the case of

*k*=2 (i.e., two relevance levels). The algorithm for calculating , and is related to Joachims (2005) and is a special case of a direct counting method discussed in the supplementary material. For the general situation, they observe that The inner sum is for a subset of data in two relevance levels (

*r*and >

*r*). Then the algorithm for two-level data can be applied. When we replace the

*O*(

*lk*) term in equation 2.14 with

*O*(size of each two-level set), the complexity of each matrix-vector product is If each relevance level takes about the same amount of

*O*(

*l*/

*k*) data, equation 3.1 becomes which is larger than the approach of using order-statistic trees.

### 3.2. **TreeRankSVM**.

Joachims (2006) uses a cutting plane method to optimize problem 1.2. Airola et al. (2011) improve on Joachims's work and release a package .

*L*(

**) is the loss term. Let**

*w*

*w*^{t}be the solution obtained at the

*t*th iteration. The first-order Taylor approximation of

*L*(

**) is used to build a cutting plane**

*w*

*a*^{T}

_{t}

**+**

*w**b*at

_{t}**=**

*w*

*w*^{t}, where If

*L*(

**) is nondifferentiable, then a subgradient is used for**

*w*

*a*_{t}. The cutting plane method maintains to form a lower-bound function for

*L*(

**), and obtains**

*w*

*w*^{t+1}by solving

*l*

^{+}

_{i}(

**) and**

*w**l*

^{−}

_{i}(

**), Joachims (2006) uses a direct counting method; the complexity at each iteration is shown in equation 1.6. We leave the details to the supplementary materials. As mentioned in section 2.3, the main improvement Airola et al. (2011) made is to use order-statistic trees so that the**

*w**O*(

*lk*) term in calculating

*l*

^{+}

_{i}(

**) and is reduced to**

*w**O*(

*l*log

*k*). In particular, red-black trees were adopted in their work. The overall cost is

### 3.3. **sofia-ml**.

Sculley (2009) proposed to solve problem 1.2. It is a stochastic gradient descent (SGD) method that randomly
draws a preference pair from the training set at each iteration and uses a
subgradient on this pair to update ** w**. This method does not consider the special structure of the loss term.
For going through the whole training data, the cost is , which is worse than other methods discussed. Therefore, we do
not include this method in our experiments.

In contrast, SGD is one of the state-of-the-art methods for linear SVM(e.g., Shalev-Shwartz, Singer, & Srebro, 2007.) The main reason is that special methods to consider the structure of the loss term have not been available.

## 4. Experiments

In this section, we begin by describing the details of a truncated Newton implementation of our approach. The first experiment is to evaluate methods discussed in section 2. In particular, the speed of different implementations of order-statistic trees is examined. Next, we compare state-of-the-art methods for linear rankSVM with the proposed approach. Then we conduct an investigation of the performance difference between linear rankSVM and pointwise methods. Finally, an experiment on sparse data is shown. (Programs used for experiments can be found online at http://www.csie.ntu.edu.tw/~cjlin/liblinear/exp.html.)

### 4.1. Implementation Using a Trust Region Newton Method.

In our implementation of the proposed approach, we consider a trust region Newton method (), that is, a truncated Newton method discussed in section 2.1. For details of trust region methods, a comprehensive book is by Conn, Gould, and Toint (2000). Here we mainly follow the setting in Lin and Moré (1999) and Lin et al. (2008).

**by minimizing**

*s**g*(

_{t}**) in equation 2.1 over a region that we trust: where is the size of the trust region. After solving problem 4.1, it decides whether to apply the obtained direction**

*s*

*s*^{t}according to the approximate function reduction

*g*(

_{t}**) and the real function decrease, that is, where is a prespecified parameter and adjusts the trust region according to . When it is too small, is decreased. Otherwise, while is large enough, increases the value of . More specifically, the following rule is considered: where**

*s*For linear classification, Lin et al. (2008) apply the approach of Steihaug (1983) to run CG iterations until either a minimum of *g _{t}*(

**) is found or**

*s***touches the boundary of the trust region. We consider the same setting in our implementation.**

*s*

*w*^{0}is the initial iterate and is the stopping tolerance given by users. By default, we set and let

*w*^{0}be the zero vector.

### 4.2. Experiment Setting.

We consider three sources of web search engine ranking: (Qin, Liu, Xu, & Li, 2010), ,^{3} and (Chapelle & Chang, 2011). Both and are from Microsoft Research, while is from the Yahoo Learning to Rank Challenge. From , we take four sets: , , , and . For , we take the set with name 30k, which indicates the number of
queries within it.^{4} Each set from or consists of five segmentations, and we take the first fold. contains two sets, and both are considered. The details of
these data sets are listed in Table 2.
Each set comes with training, validation, and testing sets; we use the
validation set only for selecting the parameters of each model. For
preprocessing, we linearly scale each feature of and data sets to the range [0, 1], while the features of data sets are already in this range.

. | . | . | . | . | . | Average k/_{q}l
. _{q} |
---|---|---|---|---|---|---|

Data Set . | l
. | n
. | k
. | |Q|
. | p
. | over Queries . |

fold 1 | 42, 158 | 46 | 3 | 1, 017 | 246, 015 | 0.0546 |

fold 1 | 9, 630 | 46 | 3 | 471 | 52, 325 | 0.1697 |

30k fold 1 | 2, 270, 296 | 136 | 5 | 18, 919 | 101, 312, 036 | 0.0492 |

set 1 | 473, 134 | 519 | 5 | 19, 944 | 5, 178, 545 | 0.2228 |

set 2 | 34, 815 | 596 | 5 | 1, 266 | 292, 951 | 0.1560 |

fold 1 | 743, 790 | 46 | 1, 268 | 1, 017 | 285, 943, 893 | 1 |

fold 1 | 540, 679 | 46 | 1, 831 | 471 | 323, 151, 792 | 1 |

. | . | . | . | . | . | Average k/_{q}l
. _{q} |
---|---|---|---|---|---|---|

Data Set . | l
. | n
. | k
. | |Q|
. | p
. | over Queries . |

fold 1 | 42, 158 | 46 | 3 | 1, 017 | 246, 015 | 0.0546 |

fold 1 | 9, 630 | 46 | 3 | 471 | 52, 325 | 0.1697 |

30k fold 1 | 2, 270, 296 | 136 | 5 | 18, 919 | 101, 312, 036 | 0.0492 |

set 1 | 473, 134 | 519 | 5 | 19, 944 | 5, 178, 545 | 0.2228 |

set 2 | 34, 815 | 596 | 5 | 1, 266 | 292, 951 | 0.1560 |

fold 1 | 743, 790 | 46 | 1, 268 | 1, 017 | 285, 943, 893 | 1 |

fold 1 | 540, 679 | 46 | 1, 831 | 471 | 323, 151, 792 | 1 |

Notes: All data sets are dense (i.e., ). In the last column, *l _{q}* and

*k*are the number of instances and the number of relevance levels in query

_{q}*q*, respectively. See Table 1 for the meaning of other columns.

All experiments are conducted on a 64-bit machine with Intel Xeon 2.5 GHz CPU (E5504), 12 MB cache, and 16 GB memory.

### 4.3. A Comparison Between Methods in Section 2: A Direct Counting Method and Different Order-Statistic Trees.

We solve problem 1.3 using and compare the following methods for calculating *l*^{+}_{i}(** w**),

*l*

^{−}

_{i}(

**), and :**

*w*- •
direct-count: the direct counting method mentioned at the end of section 2.2 (see details in the supplementary material)

- •
*y*-rbtree: the red-black tree using*y*as the key of nodes (see section 2.3)_{i} - •
*w*-rbtree: the red-black tree using^{T}x*w*^{T}*x*_{i}as the key of nodes (see section 2.3) - •
selectiontree: the selection tree that stores keys in leaf nodes (see section 2.4)

- •
*y*-avltree: the same as*y*-rbtree, except the order-statistic tree used is the AVL tree (see section 2.5) - •
*y*-aatree: the same as*y*-rbtree, except the order-statistic tree used is the AA tree (see section 2.5)

We take four data sets and set *C*=1. The results of
training time versus function values are shown in Figure 4. We also draw a horizontal line in the figure to
indicate that the default stopping condition in of using in equation 4.4 has been satisfied. Experiments in section 4.4 will show that solutions obtained below this line
have similar ranking performance to the optimum. From the figures, the method of
direct counting is slow when *k* (number of relevance levels) is
large. This result is expected following the complexity analysis in equation 2.14. In addition, although
implementations of order-statistic trees have slightly different running times
in the end, they are very similar otherwise. Therefore, we choose selection
trees in subsequent experiments because of the simplicity.

### 4.4. A Comparison Between Different Methods for Linear RankSVM.

We compare the following methods for linear rankSVM:

- •
. Our approach of using with selection trees.

- •
+ (Chapelle & Keerthi, 2010). This method was discussed in section 3.1. The authors did not release their code, so we make an implementation using the same framework of . Therefore we apply trust region rather than line search in their truncated Newton procedure.

- •
(Airola et al., 2011). This method was discussed in section 3.2. We download version 0.1 from http://staff.cs.utu.fi/~aatapa/software/RankSVM/. Although this package is mainly implemented in Python, computationally intensive procedures such as red-black trees and the minimization of equation 3.3 are written in C/C++ or Fortran.

*m*is a prespecified positive integer, is an ideal ordering with and is the ordering being evaluated, where

*l*is the number of instances in query

_{q}*q*. Then where

*N*is the score of an ideal ordering, where top-ranked instances are considered more important because of larger . From equation 4.5, NDCG computes the relative score of the evaluated ordering to the ideal ordering. Regarding

_{m}*m*, considers For and , we follow their recommendation to use mean NDCG: We then report the average over all queries.

*O*(

*l*log

*l*) time, but our implementation uses a selection tree.

^{5}

For each evaluation criterion, we find the best regularization parameter by
checking the validation set result of .^{6} The selected regularization parameter *C* for each data
set and each measurement is listed in Table 3. The results of comparing different approaches can be found in
Figures 5 and 6. We present the relative difference to the optimal function value,
pairwise accuracy, and (mean) NDCG.^{7} We also draw the horizontal lines of the default stopping condition of in the figures.

. | Problem 1.2 Using L1 Loss . | Problem 1.3 Using L2 Loss . | ||
---|---|---|---|---|

Data Sets . | Pairwise Accuracy . | NDCG . | Pairwise Accuracy . | NDCG . |

2^{−1} | 2^{8} | 2^{−5} | 2^{−15} | |

2^{8} | 2^{−6} | 2^{7} | 2^{7} | |

30k | NA | NA | 2^{3} | 2^{3} |

set 1 | NA | NA | 2^{−14} | 2^{1} |

set 2 | 2^{−7} | 2^{−4} | 2^{−10} | 2^{−10} |

2^{5} | NA | 2^{−12} | NA | |

2^{−14} | NA | 2^{−14} | NA |

. | Problem 1.2 Using L1 Loss . | Problem 1.3 Using L2 Loss . | ||
---|---|---|---|---|

Data Sets . | Pairwise Accuracy . | NDCG . | Pairwise Accuracy . | NDCG . |

2^{−1} | 2^{8} | 2^{−5} | 2^{−15} | |

2^{8} | 2^{−6} | 2^{7} | 2^{7} | |

30k | NA | NA | 2^{3} | 2^{3} |

set 1 | NA | NA | 2^{−14} | 2^{1} |

set 2 | 2^{−7} | 2^{−4} | 2^{−10} | 2^{−10} |

2^{5} | NA | 2^{−12} | NA | |

2^{−14} | NA | 2^{−14} | NA |

One could observe from the figures that the convergence speed of is slower than + and . To rule out the implementation differences between /+ and , in the supplementary materials, we check iterations versus
relative function value and test performance.^{8} Results still show that is slower, so for linear rankSVM, methods using second-order
information seem to be superior. Regarding and +, Figure 5 shows that
they are similar when the average *k _{q}*/

*l*is small. However, from Figure 6, + is much slower if the number of preference levels is large (i.e., large

_{q}*k*/

_{q}*l*). This result is expected following the complexity analysis in equations 2.23 and 3.2. Another observation is that the performances of + and are stable after the default stopping condition is satisfied. Thus, tuning in equation 4.4 is generally not necessary.

_{q}This experiment also serves as a comparison between L1- and L2-loss linear rankSVM. Results show that their performances (NDCG and pairwise accuracy) are similar.

Instead of using the best *C* after parameter selection, we
investigate the training time under a fixed *C* for all methods.
We check the situations that *C* is large, medium, and small by
using *C*=100, *C*=1, and *C*=10^{−4}, respectively. The results
are shown in Figure 7. We can observe that is always one of the fastest methods.

### 4.5. A Comparison Between Linear RankSVM, Linear Support Vector Regression, GBDT, and Random Forests.

We compare rankSVM using with the following pointwise methods:

- •
Linear support vector regression (SVR) by Vapnik (1995). We check both L1-loss and L2-loss linear SVR provided in the package (version 1.92). Their implementation details can be found in Ho and Lin (2012). For L2-loss linear SVR, two implementations are available in by solving primal and dual problems, respectively. We use the one that solves the primal problem by .

- •
GBDT (Friedman, 2001). This is a nonlinear pointwise model that is known to be powerful for web search ranking problems. We use version 0.9 of the package pGBRT (Tyree, Weinberger, Agrawal, & Paykin, 2011) downloaded from http://machinelearning.wustl.edu/pmwiki.php/Main/Pgbrt.

- •
Random forests (Breiman, 2001). This is another nonlinear pointwise model that performs well on web search data. We use version 1.0 of Rt-Rank downloaded from https://sites.google.com/site/rtranking.

For linear SVR, we set the -insensitive parameter because Ho and Lin (2012) showed that this setting often works well. We then conduct the
same parameter selection procedure as in section 4.4 to find the best regularization parameters *C* and list them in Table 4. The training
time, test NDCG, and test pairwise accuracy are shown in Table 5. We first observe that the performance of L1-loss
SVR is worse than L2-loss SVR. The reason might be that L1 loss imposes a
smaller training loss when the prediction error is larger than 1. Regarding
L2-loss SVR and L2-loss rankSVM, their NDCG results are close, but rankSVM gives
better pairwise accuracy. This result seems to be reasonable because rankSVM
considers pairwise training losses. For training time, although the selected
regularization parameters are different and hence results are not fully
comparable, in general L2-loss SVR is faster. In summary, L2-loss SVR is
competitive in terms of NDCG and training time, but rankSVM may still be useful
if pairwise accuracy is what we are concerned about.

. | L1-Loss Linear SVR . | L2-Loss Linear SVR . | ||
---|---|---|---|---|

Data Set . | Pairwise Accuracy . | NDCG . | Pairwise Accuracy . | NDCG . |

2^{8} | 2^{8} | 2^{−7} | 2^{−11} | |

2^{8} | 2^{8} | 2^{−1} | 2^{−4} | |

30k | 2^{−1} | 2^{2} | 2^{−2} | 2^{−2} |

set 1 | 2^{−10} | 2^{−5} | 2^{−5} | 2^{−2} |

set 2 | 2^{−3} | 2^{1} | 2^{−5} | 2^{4} |

2^{−9} | NA | 2^{−15} | NA | |

2^{4} | NA | 2^{−7} | NA |

. | L1-Loss Linear SVR . | L2-Loss Linear SVR . | ||
---|---|---|---|---|

Data Set . | Pairwise Accuracy . | NDCG . | Pairwise Accuracy . | NDCG . |

2^{8} | 2^{8} | 2^{−7} | 2^{−11} | |

2^{8} | 2^{8} | 2^{−1} | 2^{−4} | |

30k | 2^{−1} | 2^{2} | 2^{−2} | 2^{−2} |

set 1 | 2^{−10} | 2^{−5} | 2^{−5} | 2^{−2} |

set 2 | 2^{−3} | 2^{1} | 2^{−5} | 2^{4} |

2^{−9} | NA | 2^{−15} | NA | |

2^{4} | NA | 2^{−7} | NA |

. | L2-Loss RankSVM . | L1-Loss SVR . | L2-Loss SVR . | |||
---|---|---|---|---|---|---|

. | Training . | . | Training . | . | Training . | . |

Data Set . | Time (s) . | NDCG . | Time (s) . | NDCG . | Time (s) . | NDCG . |

0.5 | 0.5211 | 23.9^{a} | 0.4756^{a} | 0.5 | 0.5157 | |

0.5 | 0.4571 | 3.4^{a} | 0.4153^{a} | 0.2 | 0.4450 | |

30k | 1601.6 | 0.4945 | 461.6 | 0.4742 | 202.4 | 0.4946 |

set 1 | 334.8 | 0.7616 | 10.8 | 0.7579 | 172.7 | 0.7642 |

set 2 | 11.2 | 0.7519 | 47.6 | 0.7470 | 20.8 | 0.7578 |

Training | Pairwise | Training | Pairwise | Training | Pairwise | |

Time (s) | Accuracy | Time (s) | Accuracy | Time (s) | Accuracy | |

1.3 | 70.36% | 23.9^{a} | 64.06%^{a} | 0.7 | 68.56% | |

0.5 | 82.70% | 3.4^{a} | 77.72%^{a} | 0.3 | 82.17% | |

30k | 1601.6 | 61.52% | 65.4 | 60.11% | 202.4 | 60.49% |

set 1 | 117.1 | 68.45% | 2.4 | 67.82% | 149.5 | 67.83% |

set 2 | 11.2 | 69.74% | 3.3 | 68.37% | 14.5 | 69.39% |

38.7 | 80.71% | 1.0 | 79.82% | 5.0 | 79.70% | |

16.6 | 82.11% | 1.1 | 81.65% | 6.7 | 81.85% |

. | L2-Loss RankSVM . | L1-Loss SVR . | L2-Loss SVR . | |||
---|---|---|---|---|---|---|

. | Training . | . | Training . | . | Training . | . |

Data Set . | Time (s) . | NDCG . | Time (s) . | NDCG . | Time (s) . | NDCG . |

0.5 | 0.5211 | 23.9^{a} | 0.4756^{a} | 0.5 | 0.5157 | |

0.5 | 0.4571 | 3.4^{a} | 0.4153^{a} | 0.2 | 0.4450 | |

30k | 1601.6 | 0.4945 | 461.6 | 0.4742 | 202.4 | 0.4946 |

set 1 | 334.8 | 0.7616 | 10.8 | 0.7579 | 172.7 | 0.7642 |

set 2 | 11.2 | 0.7519 | 47.6 | 0.7470 | 20.8 | 0.7578 |

Training | Pairwise | Training | Pairwise | Training | Pairwise | |

Time (s) | Accuracy | Time (s) | Accuracy | Time (s) | Accuracy | |

1.3 | 70.36% | 23.9^{a} | 64.06%^{a} | 0.7 | 68.56% | |

0.5 | 82.70% | 3.4^{a} | 77.72%^{a} | 0.3 | 82.17% | |

30k | 1601.6 | 61.52% | 65.4 | 60.11% | 202.4 | 60.49% |

set 1 | 117.1 | 68.45% | 2.4 | 67.82% | 149.5 | 67.83% |

set 2 | 11.2 | 69.74% | 3.3 | 68.37% | 14.5 | 69.39% |

38.7 | 80.71% | 1.0 | 79.82% | 5.0 | 79.70% | |

16.6 | 82.11% | 1.1 | 81.65% | 6.7 | 81.85% |

^{a}Reached maximum iteration of .

Next, we check GBDT and random forests. Their training time is long, so we do not conduct parameter selection. We consider a small number of trees and fix the parameters as follows. For GBDT, we use learning rate =0.1, tree depth =4, and number of trees =100. For random forests, we use the number of sampled features for splitting in each node and number of trees =40. We further use eight cores to reduce the training time. The results are shown in Table 6. For the smaller data sets , and set 2, we are able to train more trees in a reasonable time, so we present in Table 7 the result of using 1000 trees.

Random Forests | GBDT | |||||

Training | Pairwise | Training | Pairwise | |||

Data Set | Time (s) | Accuracy | NDCG | Time (s) | Accuracy | NDCG |

14.8 | 66.16% | 0.4959 | 1.4 | 69.78% | 0.5182 | |

2.3 | 80.36% | 0.4541 | 0.4 | 82.83% | 0.4706 | |

30k | 5102.1 | 63.76% | 0.5598 | 1339.3 | 62.77% | 0.5375 |

set 1 | 1672.2 | 70.69% | 0.7797 | 557.7 | 69.22% | 0.7707 |

set 2 | 58.7 | 68.76% | 0.7629 | 11.3 | 71.21% | 0.7711 |

606.0 | 78.78% | NA | 106.8 | 79.85% | NA | |

423.3 | 82.04% | NA | 59.3 | 82.43% | NA |

Random Forests | GBDT | |||||

Training | Pairwise | Training | Pairwise | |||

Data Set | Time (s) | Accuracy | NDCG | Time (s) | Accuracy | NDCG |

14.8 | 66.16% | 0.4959 | 1.4 | 69.78% | 0.5182 | |

2.3 | 80.36% | 0.4541 | 0.4 | 82.83% | 0.4706 | |

30k | 5102.1 | 63.76% | 0.5598 | 1339.3 | 62.77% | 0.5375 |

set 1 | 1672.2 | 70.69% | 0.7797 | 557.7 | 69.22% | 0.7707 |

set 2 | 58.7 | 68.76% | 0.7629 | 11.3 | 71.21% | 0.7711 |

606.0 | 78.78% | NA | 106.8 | 79.85% | NA | |

423.3 | 82.04% | NA | 59.3 | 82.43% | NA |

Note: Random forests: 40 trees; GBDT: 100 trees.

Random Forests | GBDT | |||||

Training | Pairwise | Training | Pairwise | |||

Data Set | Time (s) | Accuracy | NDCG | Time (s) | Accuracy | NDCG |

345.3 | 69.07% | 0.5221 | 13.7 | 67.58% | 0.4892 | |

52.0 | 82.60% | 0.4675 | 3.6 | 79.78% | 0.4491 | |

set 2 | 1406.9 | 71.91% | 0.7801 | 108.7 | 71.70% | 0.7720 |

Random Forests | GBDT | |||||

Training | Pairwise | Training | Pairwise | |||

Data Set | Time (s) | Accuracy | NDCG | Time (s) | Accuracy | NDCG |

345.3 | 69.07% | 0.5221 | 13.7 | 67.58% | 0.4892 | |

52.0 | 82.60% | 0.4675 | 3.6 | 79.78% | 0.4491 | |

set 2 | 1406.9 | 71.91% | 0.7801 | 108.7 | 71.70% | 0.7720 |

From Tables 6 and 7, GBDT and random forests generally perform well, though they are not always better than linear rankSVM. For set 2, random forests achieves 0.78 NDCG using 1000 trees, which is much better than 0.75 of linear rankSVM. This result is consistent with the fact that in the Yahoo Learning to Rank Challenge, all top performers use decision-tree-based methods. However, the training cost of GBDT and random forests is in general higher than linear rankSVM. Therefore, linear rankSVM is useful to quickly provide a baseline result. We also note that the performance of GBDT with more trees is not always better than with few trees. This result seems to indicate that overfitting occurs and parameter selection is important. In contrast, random forests is more robust. The training time of GBDT is faster than random forests because in random forests, the tree depth is unlimited while we restrict the depth of trees in GBDT to be four.

Although pointwise methods perform well in this experiment, a potential problem is that they do not consider different queries. It is unclear if this situation may cause problems.

### 4.6. A Comparison Between Linear and Nonlinear Models on Sparse Data.

Recent research has shown that linear SVM is competitive with nonlinear SVM on classifying large and sparse data (Yuan et al., 2012). We conduct an experiment to check if this property also holds for learning to rank. We consider rankSVM as the linear model for comparison, but for the nonlinear model, we use random forests rather than kernel rankSVM. One reason is that random forests is very robust in the previous experiment. We consider the following two CTR (click through rate) estimation problems, which can be treated as regression or ranking problems:

- •
. This is a data set used in Ho and Lin (2012).

- •
. This is the processed data generated by the winning team (Wu et al., 2012) of KDD Cup 2012 track 2 (Niu et al., 2012). It contains about one-third of the original data. The task of this competition is online advertisement ranking evaluated by AUC, while the labels are number of clicks and number of views. Note that pairwise accuracy is reduced to AUC when

*k*=2. We transform the labels into CTR (i.e., number of clicks over number of views).

The two data sets both contain a single query, and each comes with training and
testing sets. To reduce the training time and the memory cost of random forests,
we sample from the two data sets and condense the features. The details are
listed in Table 8. We use the same
parameters of random forests as in section 4.5. For a fair comparison, we fix *C*=1 for
rankSVM because the parameters of random forests are not well tuned. The results
are shown in Table 9. We first notice the
training time of random forests is several thousand times more than linear
rankSVM on sparse data. The difference is larger than the case of dense data
because the training cost of random forests is linear to the number of features,
but that of rankSVM is linear to the average number of nonzero features.
Regarding the performance, the difference is small for the two data sets, so
linear rankSVM is very useful to get competitive results quickly. However, more
experiments are needed to confirm these preliminary observations. We hope more
public sparse ranking data will be available in the near future.

Data Set | l | n | k | p | |

11, 382, 195 | 22, 510, 600 | 22.6 | 93, 899 | 46, 191, 724, 381, 879 | |

68, 019, 906 | 79, 901, 700 | 35.3 | 6, 896 | 198, 474, 800, 029, 148 | |

(0.1%) | 11, 382 | 73, 581 | 22.6 | 1, 087 | 46, 020, 848 |

(0.025%) | 17, 005 | 74, 026 | 35.3 | 26 | 12, 704, 393 |

Data Set | l | n | k | p | |

11, 382, 195 | 22, 510, 600 | 22.6 | 93, 899 | 46, 191, 724, 381, 879 | |

68, 019, 906 | 79, 901, 700 | 35.3 | 6, 896 | 198, 474, 800, 029, 148 | |

(0.1%) | 11, 382 | 73, 581 | 22.6 | 1, 087 | 46, 020, 848 |

(0.025%) | 17, 005 | 74, 026 | 35.3 | 26 | 12, 704, 393 |

Note: To reduce the training time, only a small subset of each problem is used.

Linear RankSvm | Random Forests | |||||

Training | Pairwise | Mean | Training | Pairwise | Mean | |

Data Set | Time (s) | Accuracy | NDCG | Time (s) | Accuracy | NDCG |

(0.1%) | 4.3 | 60.83% | 0.4822 | 6343.3 | 60.45% | 0.4732 |

(0.025%) | 2.7 | 68.16% | 0.5851 | 5223.1 | 69.72% | 0.5982 |

Linear RankSvm | Random Forests | |||||

Training | Pairwise | Mean | Training | Pairwise | Mean | |

Data Set | Time (s) | Accuracy | NDCG | Time (s) | Accuracy | NDCG |

(0.1%) | 4.3 | 60.83% | 0.4822 | 6343.3 | 60.45% | 0.4732 |

(0.025%) | 2.7 | 68.16% | 0.5851 | 5223.1 | 69.72% | 0.5982 |

Note: Random forests uses 40 trees.

## 5. Using Partial Pairs to Train Models

To avoid considering the *O*(*l*^{2}) pairs, a common practice in ranking is to use only a subset of pairs.
An example is Lin (2010), which uses pairs
with close relevance levels (i.e., *y _{i}* close to

*y*). The concept is similar to equation 1.5: if pairs with close relevance levels are ranked with the right order, those pairs with larger distances should also be ranked correctly. When

_{j}*k*=

*O*(

*l*), this approach can reduce the number of pairs from

*O*(

*l*

^{2}) to be as small as

*O*(

*k*)=

*O*(

*l*). However, if

*k*is small, each pair is already formed by instances in two close relevance levels, so we cannot significantly reduce the number of pairs.

We take and to conduct experiments because these two data sets possess the
property . Because in each *q*, the values of *y _{i}* are , we use the pairs with

*y*=

_{i}*y*+1. This setting of using two adjacent relevance levels leads to

_{j}*O*(

*l*) pairs. Then we can directly consider problems 1.2 and 1.3 as classification problems with instances

*x*_{i}−

*x*_{j}. If Newton methods are considered for solving problem 1.3, by the approach in equation 2.6, each Hessian-vector product costs only . Therefore, we directly use the implementation to solve L2-loss SVM in without applying any special method in section 2. After selecting the parameter

*C*, we present pairwise accuracy in Table 10. It is observed that the selected

*C*of using partial pairs is larger than that of using all pairs. This situation occurs because the sum of training losses in problem 1.3 on a smaller number of pairs must be penalized by a larger

*C*. For training time and pairwise accuracy, it is as expected that the approach of using partial pairs slightly sacrifices the performance for faster training speed.

. | . | . | ||||
---|---|---|---|---|---|---|

. | . | Training . | Pairwise . | . | Training . | Pairwise . |

. | C
. | Time (s) . | Accuracy . | C
. | Time (s) . | Accuracy . |

Partial pairs | 2^{−9} | 19.9 | 79.10% | 2^{−10} | 10.5 | 81.81% |

All pairs | 2^{−12} | 38.7 | 80.71% | 2^{−14} | 16.6 | 82.11% |

. | . | . | ||||
---|---|---|---|---|---|---|

. | . | Training . | Pairwise . | . | Training . | Pairwise . |

. | C
. | Time (s) . | Accuracy . | C
. | Time (s) . | Accuracy . |

Partial pairs | 2^{−9} | 19.9 | 79.10% | 2^{−10} | 10.5 | 81.81% |

All pairs | 2^{−12} | 38.7 | 80.71% | 2^{−14} | 16.6 | 82.11% |

Note: using for L2-loss SVM is used for the partial-pair setting, while is used for all pairs.

Because of the only slightly lower pairwise accuracy, we may say that this approach, together with past work, is already enough to train large-scale linear rankSVM:

- •
If

*k*is small, we can apply the direct method mentioned in section 2.2 that has an*O*(*lk*) term for calculating , and . - •
If

*k*is large, we can use only*O*(*l*) pairs. Then any efficient methods to train linear SVM can be applied.

However, a caveat is that two different implementations must be used. In contrast,
methods of using order-statistic trees can simultaneously handle situations of small
and large *k*.

## 6. Conclusion

In this letter, we systematically reviewed recent approaches for linear rankSVM. We show that regardless of the optimization methods used, the computational bottleneck is on calculating some values over all preference pairs. Following Airola et al. (2011), we comprehensively investigate tree-based techniques for the calculation. Experiments show that our method is faster than existing implementations for linear rankSVM.

Based on this study, we release an extension of the popular linear classification/regression package for ranking. It is available online at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/.

## Appendix: The Dual Problem of Problems 1.1 and 1.2

*D*is a diagonal matrix and

*Q*=

*AX*(

*AX*)

^{T}. For L1 loss,

*U*=

*C*and

*D*is the zero matrix. For L2 loss, and

*D*=

*I*/(2

*C*), where

*I*is the

*p*by

*p*identity matrix.

## Acknowledgments

This work was supported in part by the National Science Council of Taiwan (grant 101-2221-E-002-199-MY3). We thank Chia-Hua Ho for discussion and inspiring the selection tree data structure. We also thank Husan-Tien Lin, Yuh-Jye Lee, and the anonymous reviewers for valuable comments.

## References

## Notes

^{1}

See, for example, Yuan, Ho, and Lin (2012) for more detailed discussion.

^{2}

Note that is likely to be dense even if each *x*_{j} is sparse.

^{4}

The number of queries shown in Table 2 is fewer because we report only the training set statistics.

^{5}

Here *l* represents the number of testing data.

^{6}

In the implementation of , the formulation is scaled so the regularization parameter is .

^{7}

For the function value, parameters selected using (validation) pairwise accuracy are considered, but results of using NDCG are similar.

^{8}

For /+, we use CG iterations rather than outer Newton iterations because each CG has a similar complexity to that of a cutting plane iteration.