Abstract

Linear rankSVM is one of the widely used methods for learning to rank. Although its performance may be inferior to nonlinear methods such as kernel rankSVM and gradient boosting decision trees, linear rankSVM is useful to quickly produce a baseline model. Furthermore, following its recent development for classification, linear rankSVM may give competitive performance for large and sparse data. A great deal of works have studied linear rankSVM. The focus is on the computational efficiency when the number of preference pairs is large. In this letter, we systematically study existing works, discuss their advantages and disadvantages, and propose an efficient algorithm. We discuss different implementation issues and extensions with detailed experiments. Finally, we develop a robust linear rankSVM tool for public use.

1.  Introduction

Learning to rank is an important supervised learning technique because of its application to search engines and online advertisement. According to Chapelle and Chang (2011) and others, state-of-the-art learning to rank models can be categorized into three types. Pointwise methods (e.g., decision tree models and linear regression) directly learn the relevance score of each instance; pairwise methods like rankSVM (Herbrich, Graepel, & Obermayer, 2000) learn to classify preference pairs; and listwise methods such as LambdaMART (Burges, 2010) try to optimize the measurement for evaluating the whole ranking list. Some methods lie between two categories; for example, GBRank (Zheng, Chen, Sun, & Zha, 2007) combines pointwise decision tree models and pairwise loss. Among them, rankSVM, as a pairwise approach, is one commonly used method. This method is extended from standard support vector machine (SVM) by Boser, Guyon, and Vapnik (1992) and Cortes and Vapnik (1995). In the SVM literature, it is well known that linear (i.e., data are not mapped to a different space) and kernel SVMs are suitable for different scenarios, where linear SVM is more efficient, but the more costly kernel SVM may give higher accuracy.1 The same situation may occur for rankSVM. In this letter, we study large-scale linear rankSVM.

Assume we are given a set of training label-query-instance tuples , where K is the set of possible relevance levels with |K|=k and Q is the set of queries. By defining the set of preference pairs as
formula
1.1
L1-loss linear rankSVM minimizes the sum of training losses and a regularization term,
formula
1.2
where C>0 is a regularization parameter. If L2 loss is used, the optimi- zation problem becomes
formula
1.3
In prediction, for any test instance x, a larger wTx implies that x should be ranked higher. In Table 1, we list the notation used in this letter.
Table 1:
Notation.
NotationExplanation
w The weight vector obtained by solving problem 1.2 or 1.3 
xi The feature vector of the ith training instance 
yi Label of the ith training instance 
qi Query of the ith training instance 
K The set of relevance levels 
Q The set of queries 
P The set of preference pairs; see equation 1.1 
l Number of training instances 
k Number of relevance levels 
p Number of preference pairs 
n Number of features 
 Average number of nonzero features per instance 
lq Number of training instances in a given query 
kq Number of relevance levels in a given query 
T An order-statistic tree 
NotationExplanation
w The weight vector obtained by solving problem 1.2 or 1.3 
xi The feature vector of the ith training instance 
yi Label of the ith training instance 
qi Query of the ith training instance 
K The set of relevance levels 
Q The set of queries 
P The set of preference pairs; see equation 1.1 
l Number of training instances 
k Number of relevance levels 
p Number of preference pairs 
n Number of features 
 Average number of nonzero features per instance 
lq Number of training instances in a given query 
kq Number of relevance levels in a given query 
T An order-statistic tree 
The sum of training losses can be written in the following separable form:
formula
Because each query involves an independent training subset, the outer summation over all can be easily handled. Therefore, in our discussion, we assume a single query in the training set. Hence, if on average l/k instances are with the same relevance level, the number of pairs in P is
formula
1.4
The large number of pairs becomes the main difficulty in training rankSVM. Many existing studies have attempted to address this difficulty. By taking the property that
formula
1.5
it is possible to avoid the O(l2) complexity of going through all pairs in calculating the objective function, gradient, or other information needed in the optimization procedure. Interestingly, although existing work applies different optimization methods, their approaches to avoid considering the O(l2) pairs are very related. Next, we briefly review some recent results. Joachims (2006) solves problem 1.2 by a cutting plane method in which an
formula
1.6
method is proposed to calculate the objective value and a subgradient of problem 1.2. The cost is for calculating , where is the average number of nonzero features per training instance; O(llog l+lk) is for the sum of training losses in problem 1.2; O(n) is for the regularization term wTw/2. This method is efficient if k (number of relevance levels) is small but becomes inefficient when k=O(l). Airola, Pahikkala, and Salakoski (2011) improve on Joachims's work by reducing the complexity to
formula
The main breakthrough is that they employ order-statistic trees, which are extended from balance binary search trees, so the O(lk) term in equation 1.6 is reduced to O(llog k). Another type of optimization method considered is the truncated Newton method for solving problem 1.3 in which the main computation is on Hessian vector products. Chapelle & Keerthi (2010) showed that if k=2 (i.e., only two relevance levels), the cost of each function, gradient, or Hessian vector product evaluation is . Their method is related to that by Joachims (2006) because we can see that the O(lk) term in equation 1.6 can be removed if k=2. For k>2, they decompose the problem into k−1 subproblems, each with only two relevance levels. Therefore, similar to Joachims's approach, Chapelle and Keerthi's approach may not be efficient for larger k. Regarding optimization methods, an advantage of Newton methods is the faster convergence. However, they require the differentiability of the objective function, so L2 loss must be used. In contrast, cutting plane methods are applicable to both L1 and L2 losses.

Although linear rankSVM is an established method, it is known that gradient boosting decision trees (GBDT) by Friedman (2001) and its variant, LambdaMART (Burges, 2010), give competitive performance on web search ranking data. In addition, random forests (Breiman, 2001) are also reported in Mohan, Chen, and Weinberger (2011) to perform well. Actually all the winning teams of the Yahoo Learning to Rank Challenge (Chapelle & Chang, 2011) use decision-tree-based ensemble models. Note that GBDT and random forests are nonlinear pointwise methods, and LambdaMART is a nonlinear listwise method. Their drawback is the longer training time. We will conduct experiments to compare the performance and training time between linear and nonlinear ranking methods.

In this letter, we consider Newton methods for solving problem 1.3 and present the following results:

  1. We give a clear overview and connection of past work on the efficient calculation over all relevance pairs.

  2. We investigate several order-statistic tree implementations and show their advantages and disadvantages.

  3. We finish an efficient implementation that is faster than existing works for linear rankSVM.

  4. We compare linear rankSVM and linear and nonlinear pointwise methods, including GBDT and random forests, in detail.

  5. We release a public tool for linear rankSVM.

This letter is organized as follows. Section 2 introduces methods for the efficient calculation of all relevance pairs. Section 3 discusses past studies of linear rankSVM and compares them with our method. Various types of experiments are shown in section 4. In section 5 we discuss another possible algorithm for rankSVMs when k is large. Section 6 concludes. A supplementary file including additional analysis and experiments is available online at http://www.mitpressjournals.org/doi/suppl/10.1162/NECO_a_00571.

2.  Efficient Calculation over Relevance Pairs

A difficulty in training rankSVM is that the number of pairs in the loss term can be as large as O(l2). This difficulty occurs in any optimization method that needs to calculate the objective value. Further more, other values used in optimization procedures, such as subgradient, gradient, or Hessian-vector products, face the same difficulty. In this section, we consider truncated Newton methods as an example and investigate efficient methods for the calculation over pairs.

2.1.  Information Needed in Optimization Procedures and an Example Using Truncated Newton Methods.

Many optimization methods employ gradient or even higher-order information at each iteration of an iterative procedure. From problems 1.2 and 1.3, it is clear that the summation over the p pairs remains in the gradient and the Hessian. Therefore, the difficulty of handling O(l2) pairs occurs beyond the calculations of the objective value. Here we consider the truncated Newton method as an example to see what kind of information it requires.

A Newton method obtains a direction at the tth iteration by solving
formula
where
formula
2.1
and updates wt by
formula
Note that gt(s) is the second-order Taylor approximation of f(wt+s)−f(wt). If is invertible, step s is obtained by solving the following linear system;
formula
2.2
To ensure convergence, usually a line search or a trust region technique is applied to obtain a truncated Newton step. For machine learning applications, is often too large to be stored, so the conjugate gradient (CG) method is a common way to solve equation 2.2 by taking the special structure of into account (Keerthi & DeCoste, 2005; Lin, Weng, & Keerthi, 2008). At each CG iteration, we only need to calculate a Hessian-vector product,
formula
and it can be performed without storing the Hessian matrix. Then the algorithm contains two levels of iterations. The outer one generates , while from wt to wt+1 there are inner CG iterations. Algorithm 1 gives the framework of truncated Newton methods.

The discussion shows that in a truncated Newton method, the calculations of objective value, gradient, and Hessian-vector product all face the difficulty of handling p pairs. In the rest of this section, we discuss the efficient calculation of these values.

2.2.  Efficient Function and Gradient Evaluation and Matrix-Vector Products.

We consider L2-loss linear rankSVM (see problem 1.3) and discuss details of calculating the function, gradient, and Hessian-vector products. These are the computational bottlenecks in Newton methods. In the rest of this letter, if not specified, f(w) represents the objective function of problem 1.3.
formula
To indicate the preference pairs, we define a p by l matrix:
formula
That is, if , then a corresponding row in A has that the ith entry is 1, the jth entry is −1, and other entries are all zeros. By this definition, the objective function in problem 1.3 can be written as
formula
2.3
where is a vector of ones,
formula
and Dw is a p by p diagonal matrix with for all ,
formula
The gradient is
formula
2.4
However, does not exist because equation 2.4 is not differentiable. Following Mangasarian (2002) and Lin et al. (2008), we define a generalized Hessian matrix,
formula
2.5
where I is the identity matrix.
The main computation at each CG iteration is a Hessian-vector product. For any vector , the truncated Newton method by Chapelle and Keerthi (2010) calculates
formula
2.6
Because both A and Dw have O(p) nonzero elements, the complexity of calculating equation 2.6 is
formula
2.7
The right-to-left matrix-vector products in equation 2.6 are faster than if we obtain and store the matrix XTATDwAX.
However, if p=O(l2), not only is the cost of equation 2.6 still high, but also the storage of A and Dw requires a huge amount of memory. To derive a faster method, it is essential to explore the structure of the generalized Hessian. We define
formula
2.8
We show in appendix  A that when problem 1.3 is treated as an SVM classification problem with feature vectors xixj and labels being 1 for all , the set SV(w) corresponds to the support vectors.
We then remove the matrix Dw by defining a new matrix such that
formula
where Aw includes rows of A such that . Thus, equation 2.6 becomes
formula
2.9
Observe that
formula
2.10
Because each row of Aw contains two nonzero elements, occurs only under the following situations:
formula
We define
formula
Then from equation 2.10,
formula
Hence,
formula
Therefore,
formula
2.11
If we already have the values of , and , the computation of the Hessian-vector product in equation 2.9 would just cost , where is for computing equation 2.11 and O(n) is for the vector addition in equation 2.9.
Similarly, function and gradient evaluations can be more efficient by reformulating equations 2.3 and 2.4 to the following forms, respectively:
formula
2.12
and
formula
2.13
where is a vector of ones. In equations 2.12 and 2.13, ATwAwXw can be calculated by equation 2.11. We also have
formula
Thus, the computation of both equations 2.12 and 2.13 costs as well.
Note that for solving problem 1.2 by cutting plane methods, Joachims (2006) and Airola et al. (2011) have identified that l+i(w) and li(w) are needed for efficient function and subgradient evaluation (see more details in section 3.2). Therefore, regardless of the optimization methods used, an important common task is to efficiently calculate and . In the supplementary materials, we discuss in detail a direct method that costs O(l+k) space excluding the training data and
formula
2.14
time for one matrix-vector product. Although the cost is lower than that by equation 2.6, the O(lk) complexity is still high if k is large. Subsequently, we discuss methods to reduce the O(lk) term to O(llog k).

2.3.  Efficient Calculation by Storing Values in an Order-Statistic Tree.

Airola et al. (2011) calculate l+i(w) and li(w) by an order-statistic tree, so the O(lk) term in equation 2.14 is reduced to O(llog k). The optimization method used is a cutting plane method (Teo, Vishwanathan, Smola, & Le, 2010), which calculates function and subgradient values. Our procedure here is an extension because in Newton methods, we further need Hessian-vector products. Notice that Airola et al. (2011) considered problem 1.2; we solve problem 1.3 and require the computation of and in addition.

To find l+i(w), we must count the cardinality of the following set:
formula
The main difficulty is that both the order of yi and the order of wTxi are involved. We can first sort wTxi in ascending order. For an easy description, we assume that
formula
2.15
We observe that if elements in
formula
2.16
have been properly arranged in an order-statistic tree T by the value of yj, then l+i(w) can be obtained in O(log k) time. Consider the following example:
formula
When i=1, we assume
formula
2.17
We construct a tree in Figure 1a so that each node includes
formula
2.18
where and nodes are arranged according to the keys (i.e., yj values). For each node, we ensure that its right child has a larger key than its left child and the node itself. Clearly, for any node y,
formula
2.19
By , we mean the instance j has been inserted to the tree T.
Figure 1:

An illustration of using an order-statistic tree to calculate l+i(w).

Figure 1:

An illustration of using an order-statistic tree to calculate l+i(w).

To find l+i(w), we traverse from the root of T to the node yi by observing that
formula
Therefore, once a tree for the set 2.16 has been constructed, we can define the following function:
formula
2.20
We explain that equation 2.20 uses the idea of equation 1.5 on y. In the last case of equation 2.20, for any t in the left of y,
formula
by the way we construct the tree. Then with
formula
we have
formula
so nodes on the left are not considered. Using equation 2.20, we have
formula
An example to traverse the tree for finding l+1(w) is in Figure 1b.
Once l+i(w) has been calculated, we insert the following instances into the tree:
formula
Then, l+i+1(w) can be calculated in the same way. The calculation for li(w) is similar. Because
formula
we start from l and maintain a tree of the following set:
formula
We then define a function similar to to obtain li(w).
For the calculation of and , notice that
formula
2.21
and
formula
At each node of the tree, we can store a value so that it follows a relation like equation 2.19:
formula
2.22
The function , defined in equation 2.20, can be directly extended to output both l+i(w) and .
formula
Details of the overall procedure are presented in algorithm 2. For binary search trees, we can consider, for example, the AVL tree (Adelson-Velsky & Landis, 1962), red-black tree (Bayer, 1972), and AA tree (Andersson, 1993). These trees are reasonably balanced so that each insertion and computation of Larger/Smaller functions all cost O(log k) (see more discussion in section 2.5).If wTxi have been sorted before CG iterations, each matrix-vector product involves
formula
2.23
operations, which are smaller than equation 2.14 because the lk term is reduced to llog k. Therefore, the cost of truncated Newton methods using order-statistic trees is
formula
where the O(llog l) term is the cost of sorting.

Our algorithm constructs a tree for each matrix-vector product (or each CG iteration) because of the change of the vector v in equation 2.11. Thus, an outer iteration of truncated Newton method requires constructing several trees. If we store instead of at each node, only one tree independent of v is needed at an outer iteration. However, because a vector is stored at a node, each update requires cost. The total cost of maintaining the tree is because each insertion requires O(log k) updates. This is bigger than for a tree of storing . Further, we need O(ln) space to store vectors.2 Because the number of matrix-vector products is often not large, storing is more suitable.

Besides, instead of sorting wTxi and using yi as the keys, we may alternatively sort yi such that
formula
2.24
and for maintain a tree T of the following set:
formula
Then we can apply the same approach as above. An advantage of this approach is that yi are fixed and need to be sorted only once in the training procedure. However, wTxi become keys of nodes and are in general different, so the tree will eventually contain O(l) rather than O(k) nodes. Therefore, this approach is less preferred because maintaining a smaller tree is essential.

2.4.  A Different Implementation by Storing Keys in Leaves of a Tree.

Although the method in section 2.3 successfully reduces the complexity, maintaining the trees makes the implementation more complicated. In this section, we consider a simpler method that doubles the size of the tree and stores all instances in leaf nodes. This setting is similar to a selection tree (Knuth, 1973). We ensure that the k leaf nodes from left to right correspond to the ascending order of relevance levels. At a leaf node, we record the size and xv of a relevance level. For each internal node, which is the root of a subtree, its size and xv are both the sum of that attribute of its children. For the same example considered in section 2.3, the tree at i=1 is shown in Figure 2. To compute l+i(w), we know that
formula
Therefore, for any node s in the tree, we can define:
formula
and let
formula
An illustration of finding l+1(w) is in Figure 3. The procedures for obtaining , and are all similar.
Figure 2:

An example of storing yj in leaf nodes.

Figure 2:

An example of storing yj in leaf nodes.

Figure 3:

An example of finding l+1(w) when storing yj in leaf nodes. Note that we assume y1=4.

Figure 3:

An example of finding l+1(w) when storing yj in leaf nodes. Note that we assume y1=4.

An advantage of this approach is that because all yj are stored in leaves, it is easier to maintain the trees. (See more discussion in section 2.5.) In section 4.3, we experimentally compare this approach with the method in section 2.3.

2.5.  A Discussion on Tree Implementations.

For the method in section 2.3, where each node has a key, we can consider balanced binary search trees such as an AVL tree, red-black tree, and AA tree. AVL trees use more complicated insertion operations to ensure being balanced. Consequently, the insertion is slower, but the order-statistic computation is usually faster compared to other order-statistic trees. In the comparison by Heger (2004), an AA tree tends to be more balanced and faster than a red-black tree. However, previous studies also consider node deletions, which are not needed here, so we conduct an experiment in section 4.3.

For the method in section 2.4 to store keys in leaves, the selection tree is a suitable data structure. Note that selection trees were mainly used for sorting, but using it as a balanced binary search tree is a straightforward adaptation. An implementation method introduced in Knuth (1973) is to transfer k possible yi values to , and let the indices of the internal nodes be . Then for any node m, its parent is the node . Moreover, if m is an odd number then it is a right child, and vice versa. By this method, we do not need to use pointers for constructing the tree, and thus the implementation is very simple. Another advantage is that this tree is fully balanced, so each leaf is of the same depth.

3.  Comparison with Existing Methods

In this section, we introduce recent studies of linear rankSVM that are considered state of the art. Some of them were mentioned in section 2 in comparison with our proposed methods.

3.1.  PRSVM and PRSVM+.

We have discussed by Chapelle and Keerthi (2010) at the beginning of section 2.2. The complexity shown in equation 2.7 has a term O(p), which becomes dominant for large p. To reduce the cost, Chapelle and Keerthi (2010) proposed + for solving problem 1.3 by a truncated Newton method. They first consider the case of k=2 (i.e., two relevance levels). The algorithm for calculating , and is related to Joachims (2005) and is a special case of a direct counting method discussed in the supplementary material. For the general situation, they observe that
formula
The inner sum is for a subset of data in two relevance levels (r and >r). Then the algorithm for two-level data can be applied. When we replace the O(lk) term in equation 2.14 with O(size of each two-level set), the complexity of each matrix-vector product is
formula
3.1
If each relevance level takes about the same amount of O(l/k) data, equation 3.1 becomes
formula
3.2
which is larger than the approach of using order-statistic trees.

3.2.  TreeRankSVM.

Joachims (2006) uses a cutting plane method to optimize problem 1.2. Airola et al. (2011) improve on Joachims's work and release a package .

Here we follow Teo et al. (2010, section 2) to describe the cutting plane method for minimizing a function
formula
where L(w) is the loss term. Let wt be the solution obtained at the tth iteration. The first-order Taylor approximation of L(w) is used to build a cutting plane aTtw+bt at w=wt,
formula
where
formula
If L(w) is nondifferentiable, then a subgradient is used for at. The cutting plane method maintains to form a lower-bound function for L(w),
formula
and obtains wt+1 by solving
formula
3.3
For rankSVM, if problem 1.2 is considered, a subgradient of its loss term is
formula
3.4
The function value also needs to be evaluated during the optimization procedure.
formula
For obtaining l+i(w) and li(w), Joachims (2006) uses a direct counting method; the complexity at each iteration is shown in equation 1.6. We leave the details to the supplementary materials. As mentioned in section 2.3, the main improvement Airola et al. (2011) made is to use order-statistic trees so that the O(lk) term in calculating l+i(w) and is reduced to O(llog k). In particular, red-black trees were adopted in their work. The overall cost is
formula

3.3.  sofia-ml.

Sculley (2009) proposed to solve problem 1.2. It is a stochastic gradient descent (SGD) method that randomly draws a preference pair from the training set at each iteration and uses a subgradient on this pair to update w. This method does not consider the special structure of the loss term. For going through the whole training data, the cost is , which is worse than other methods discussed. Therefore, we do not include this method in our experiments.

In contrast, SGD is one of the state-of-the-art methods for linear SVM(e.g., Shalev-Shwartz, Singer, & Srebro, 2007.) The main reason is that special methods to consider the structure of the loss term have not been available.

4.  Experiments

In this section, we begin by describing the details of a truncated Newton implementation of our approach. The first experiment is to evaluate methods discussed in section 2. In particular, the speed of different implementations of order-statistic trees is examined. Next, we compare state-of-the-art methods for linear rankSVM with the proposed approach. Then we conduct an investigation of the performance difference between linear rankSVM and pointwise methods. Finally, an experiment on sparse data is shown. (Programs used for experiments can be found online at http://www.csie.ntu.edu.tw/~cjlin/liblinear/exp.html.)

4.1.  Implementation Using a Trust Region Newton Method.

In our implementation of the proposed approach, we consider a trust region Newton method (), that is, a truncated Newton method discussed in section 2.1. For details of trust region methods, a comprehensive book is by Conn, Gould, and Toint (2000). Here we mainly follow the setting in Lin and Moré (1999) and Lin et al. (2008).

adopts the trust region technique to obtain the truncated Newton step. It finds the direction s by minimizing gt(s) in equation 2.1 over a region that we trust:
formula
4.1
where is the size of the trust region. After solving problem 4.1, it decides whether to apply the obtained direction st according to the approximate function reduction gt(s) and the real function decrease, that is,
formula
4.2
where is a prespecified parameter and
formula
adjusts the trust region according to . When it is too small, is decreased. Otherwise, while is large enough, increases the value of . More specifically, the following rule is considered:
formula
4.3
where
formula

For linear classification, Lin et al. (2008) apply the approach of Steihaug (1983) to run CG iterations until either a minimum of gt(s) is found or s touches the boundary of the trust region. We consider the same setting in our implementation.

Regarding the stopping condition, we follow that of in the package (Fan, Chang, Hsieh, Wang, & Lin, 2008) to check if the gradient is relatively smaller than the initial gradient:
formula
4.4
where w0 is the initial iterate and is the stopping tolerance given by users. By default, we set and let w0 be the zero vector.

4.2.  Experiment Setting.

We consider three sources of web search engine ranking: (Qin, Liu, Xu, & Li, 2010), ,3 and (Chapelle & Chang, 2011). Both and are from Microsoft Research, while is from the Yahoo Learning to Rank Challenge. From , we take four sets: , , , and . For , we take the set with name 30k, which indicates the number of queries within it.4 Each set from or consists of five segmentations, and we take the first fold. contains two sets, and both are considered. The details of these data sets are listed in Table 2. Each set comes with training, validation, and testing sets; we use the validation set only for selecting the parameters of each model. For preprocessing, we linearly scale each feature of and data sets to the range [0, 1], while the features of data sets are already in this range.

Table 2:
Statistics of Training Data Sets.
Average kq/lq
Data Setlnk|Q|pover Queries
fold 1 42, 158 46 1, 017 246, 015 0.0546 
fold 1 9, 630 46 471 52, 325 0.1697 
30k fold 1 2, 270, 296 136 18, 919 101, 312, 036 0.0492 
set 1 473, 134 519 19, 944 5, 178, 545 0.2228 
set 2 34, 815 596 1, 266 292, 951 0.1560 
fold 1 743, 790 46 1, 268 1, 017 285, 943, 893 
fold 1 540, 679 46 1, 831 471 323, 151, 792 
Average kq/lq
Data Setlnk|Q|pover Queries
fold 1 42, 158 46 1, 017 246, 015 0.0546 
fold 1 9, 630 46 471 52, 325 0.1697 
30k fold 1 2, 270, 296 136 18, 919 101, 312, 036 0.0492 
set 1 473, 134 519 19, 944 5, 178, 545 0.2228 
set 2 34, 815 596 1, 266 292, 951 0.1560 
fold 1 743, 790 46 1, 268 1, 017 285, 943, 893 
fold 1 540, 679 46 1, 831 471 323, 151, 792 

Notes: All data sets are dense (i.e., ). In the last column, lq and kq are the number of instances and the number of relevance levels in query q, respectively. See Table 1 for the meaning of other columns.

All experiments are conducted on a 64-bit machine with Intel Xeon 2.5 GHz CPU (E5504), 12 MB cache, and 16 GB memory.

4.3.  A Comparison Between Methods in Section 2: A Direct Counting Method and Different Order-Statistic Trees.

We solve problem 1.3 using and compare the following methods for calculating l+i(w), li(w), and :

  • • 

    direct-count: the direct counting method mentioned at the end of section 2.2 (see details in the supplementary material)

  • • 

    y-rbtree: the red-black tree using yi as the key of nodes (see section 2.3)

  • • 

    wTx-rbtree: the red-black tree using wTxi as the key of nodes (see section 2.3)

  • • 

    selectiontree: the selection tree that stores keys in leaf nodes (see section 2.4)

  • • 

    y-avltree: the same as y-rbtree, except the order-statistic tree used is the AVL tree (see section 2.5)

  • • 

    y-aatree: the same as y-rbtree, except the order-statistic tree used is the AA tree (see section 2.5)

The trust region Newton method, written in C/C++, is extended from the implementation in . To compare the convergence speed, we investigate the relative difference to the optimal function value,
formula
where is the optimum of problem 1.3. We run the optimization algorithm long enough to obtain an approximation of .

We take four data sets and set C=1. The results of training time versus function values are shown in Figure 4. We also draw a horizontal line in the figure to indicate that the default stopping condition in of using in equation 4.4 has been satisfied. Experiments in section 4.4 will show that solutions obtained below this line have similar ranking performance to the optimum. From the figures, the method of direct counting is slow when k (number of relevance levels) is large. This result is expected following the complexity analysis in equation 2.14. In addition, although implementations of order-statistic trees have slightly different running times in the end, they are very similar otherwise. Therefore, we choose selection trees in subsequent experiments because of the simplicity.

Figure 4:

Comparison between different order-statistic tree implementations and the direct counting method. We present training time and relative difference to the optimal function value. C=1 is set for all four data sets. The dotted horizontal line indicates the function value of using the default stopping tolerance in equation 4.4.

Figure 4:

Comparison between different order-statistic tree implementations and the direct counting method. We present training time and relative difference to the optimal function value. C=1 is set for all four data sets. The dotted horizontal line indicates the function value of using the default stopping tolerance in equation 4.4.

4.4.  A Comparison Between Different Methods for Linear RankSVM.

We compare the following methods for linear rankSVM:

  • • 

    . Our approach of using with selection trees.

  • • 

    + (Chapelle & Keerthi, 2010). This method was discussed in section 3.1. The authors did not release their code, so we make an implementation using the same framework of . Therefore we apply trust region rather than line search in their truncated Newton procedure.

  • • 

    (Airola et al., 2011). This method was discussed in section 3.2. We download version 0.1 from http://staff.cs.utu.fi/~aatapa/software/RankSVM/. Although this package is mainly implemented in Python, computationally intensive procedures such as red-black trees and the minimization of equation 3.3 are written in C/C++ or Fortran.

Note that solves L1-loss rankSVM, but the other two consider L2 loss. Therefore they have different optimal function values. We separately obtain their own optima and compute the corresponding relative difference to the optimal function value. For prediction performance, we first check normalized discounted cumulative gain (NDCG), which is widely used for comparing ranked lists of information retrieval tasks (Järvelin & Kekäläinen, 2002). Several definitions of NDCG are available, so we follow the recommendation of each data source. Assume m is a prespecified positive integer, is an ideal ordering with
formula
and is the ordering being evaluated, where lq is the number of instances in query q. Then
formula
4.5
where
formula
Nm is the score of an ideal ordering, where top-ranked instances are considered more important because of larger . From equation 4.5, NDCG computes the relative score of the evaluated ordering to the ideal ordering. Regarding m, considers
formula
For and , we follow their recommendation to use mean NDCG:
formula
We then report the average over all queries.
We further consider pairwise accuracy as a measurement because it is directly related to the loss term of rankSVM. Pairwise accuracy is widely adopted in areas such as statistics and medical data analysis, but with the name concordance index or Kendall's (Kendall, 1938):
formula
We adopt the algorithm of Christensen (2005) that uses an AVL tree to compute pairwise accuracy in O(llog l) time, but our implementation uses a selection tree.5

For each evaluation criterion, we find the best regularization parameter by checking the validation set result of .6 The selected regularization parameter C for each data set and each measurement is listed in Table 3. The results of comparing different approaches can be found in Figures 5 and 6. We present the relative difference to the optimal function value, pairwise accuracy, and (mean) NDCG.7 We also draw the horizontal lines of the default stopping condition of in the figures.

Figure 5:

A comparison between different linear rankSVM methods on function values, pairwise accuracy, and NDCG. The x-axis is in log scale.

Figure 5:

A comparison between different linear rankSVM methods on function values, pairwise accuracy, and NDCG. The x-axis is in log scale.

Figure 6:

A comparison between different linear rankSVM methods on function values and pairwise accuracy. The two sets have large k (number of relevance levels). NDCG cannot be used because of the overflow of in equation 4.5. The x-axis is in log scale.

Figure 6:

A comparison between different linear rankSVM methods on function values and pairwise accuracy. The two sets have large k (number of relevance levels). NDCG cannot be used because of the overflow of in equation 4.5. The x-axis is in log scale.

Table 3:
Best Regularization Parameter for Each Data Set and Each Measurement.
Problem 1.2 Using L1 LossProblem 1.3 Using L2 Loss
Data SetsPairwise AccuracyNDCGPairwise AccuracyNDCG
 2−1 28 2−5 2−15 
 28 2−6 27 27 
30k NA NA 23 23 
set 1 NA NA 2−14 21 
set 2 2−7 2−4 2−10 2−10 
 25 NA 2−12 NA 
 2−14 NA 2−14 NA 
Problem 1.2 Using L1 LossProblem 1.3 Using L2 Loss
Data SetsPairwise AccuracyNDCGPairwise AccuracyNDCG
 2−1 28 2−5 2−15 
 28 2−6 27 27 
30k NA NA 23 23 
set 1 NA NA 2−14 21 
set 2 2−7 2−4 2−10 2−10 
 25 NA 2−12 NA 
 2−14 NA 2−14 NA 

Notes: When problem 1.2 with L1 loss is used, failed to finish the parameter selection procedure on 30k and set 1 after a long running time. NDCG cannot be used for and because the large k leads to the overflow of in equation 4.5.

One could observe from the figures that the convergence speed of is slower than + and . To rule out the implementation differences between /+ and , in the supplementary materials, we check iterations versus relative function value and test performance.8 Results still show that is slower, so for linear rankSVM, methods using second-order information seem to be superior. Regarding and +, Figure 5 shows that they are similar when the average kq/lq is small. However, from Figure 6, + is much slower if the number of preference levels is large (i.e., large kq/lq). This result is expected following the complexity analysis in equations 2.23 and 3.2. Another observation is that the performances of + and are stable after the default stopping condition is satisfied. Thus, tuning in equation 4.4 is generally not necessary.

This experiment also serves as a comparison between L1- and L2-loss linear rankSVM. Results show that their performances (NDCG and pairwise accuracy) are similar.

Instead of using the best C after parameter selection, we investigate the training time under a fixed C for all methods. We check the situations that C is large, medium, and small by using C=100, C=1, and C=10−4, respectively. The results are shown in Figure 7. We can observe that is always one of the fastest methods.

Figure 7:

A comparison between different linear rankSVM methods on relative function value with the same C. We set C to be 100, 1 and 10−4.

Figure 7:

A comparison between different linear rankSVM methods on relative function value with the same C. We set C to be 100, 1 and 10−4.

4.5.  A Comparison Between Linear RankSVM, Linear Support Vector Regression, GBDT, and Random Forests.

We compare rankSVM using with the following pointwise methods:

  • • 

    Linear support vector regression (SVR) by Vapnik (1995). We check both L1-loss and L2-loss linear SVR provided in the package (version 1.92). Their implementation details can be found in Ho and Lin (2012). For L2-loss linear SVR, two implementations are available in by solving primal and dual problems, respectively. We use the one that solves the primal problem by .

  • • 

    GBDT (Friedman, 2001). This is a nonlinear pointwise model that is known to be powerful for web search ranking problems. We use version 0.9 of the package pGBRT (Tyree, Weinberger, Agrawal, & Paykin, 2011) downloaded from http://machinelearning.wustl.edu/pmwiki.php/Main/Pgbrt.

  • • 

    Random forests (Breiman, 2001). This is another nonlinear pointwise model that performs well on web search data. We use version 1.0 of Rt-Rank downloaded from https://sites.google.com/site/rtranking.

For linear SVR, we set the -insensitive parameter because Ho and Lin (2012) showed that this setting often works well. We then conduct the same parameter selection procedure as in section 4.4 to find the best regularization parameters C and list them in Table 4. The training time, test NDCG, and test pairwise accuracy are shown in Table 5. We first observe that the performance of L1-loss SVR is worse than L2-loss SVR. The reason might be that L1 loss imposes a smaller training loss when the prediction error is larger than 1. Regarding L2-loss SVR and L2-loss rankSVM, their NDCG results are close, but rankSVM gives better pairwise accuracy. This result seems to be reasonable because rankSVM considers pairwise training losses. For training time, although the selected regularization parameters are different and hence results are not fully comparable, in general L2-loss SVR is faster. In summary, L2-loss SVR is competitive in terms of NDCG and training time, but rankSVM may still be useful if pairwise accuracy is what we are concerned about.

Table 4:
Best Regularization Parameter for Each Data Set and Each Measurement of SVR.
L1-Loss Linear SVRL2-Loss Linear SVR
Data SetPairwise AccuracyNDCGPairwise AccuracyNDCG
 28 28 2−7 2−11 
 28 28 2−1 2−4 
30k 2−1 22 2−2 2−2 
set 1 2−10 2−5 2−5 2−2 
set 2 2−3 21 2−5 24 
 2−9 NA 2−15 NA 
 24 NA 2−7 NA 
L1-Loss Linear SVRL2-Loss Linear SVR
Data SetPairwise AccuracyNDCGPairwise AccuracyNDCG
 28 28 2−7 2−11 
 28 28 2−1 2−4 
30k 2−1 22 2−2 2−2 
set 1 2−10 2−5 2−5 2−2 
set 2 2−3 21 2−5 24 
 2−9 NA 2−15 NA 
 24 NA 2−7 NA 
Table 5:
Comparison Between rankSVM and SVR.
L2-Loss RankSVML1-Loss SVRL2-Loss SVR
TrainingTrainingTraining
Data SetTime (s)NDCGTime (s)NDCGTime (s)NDCG
 0.5 0.5211 23.9a 0.4756a 0.5 0.5157 
 0.5 0.4571 3.4a 0.4153a 0.2 0.4450 
30k 1601.6 0.4945 461.6 0.4742 202.4 0.4946 
set 1 334.8 0.7616 10.8 0.7579 172.7 0.7642 
set 2 11.2 0.7519 47.6 0.7470 20.8 0.7578 
 Training Pairwise Training Pairwise Training Pairwise 
 Time (s) Accuracy Time (s) Accuracy Time (s) Accuracy 
 1.3 70.36% 23.9a 64.06%a 0.7 68.56% 
 0.5 82.70% 3.4a 77.72%a 0.3 82.17% 
30k 1601.6 61.52% 65.4 60.11% 202.4 60.49% 
set 1 117.1 68.45% 2.4 67.82% 149.5 67.83% 
set 2 11.2 69.74% 3.3 68.37% 14.5 69.39% 
 38.7 80.71% 1.0 79.82% 5.0 79.70% 
 16.6 82.11% 1.1 81.65% 6.7 81.85% 
L2-Loss RankSVML1-Loss SVRL2-Loss SVR
TrainingTrainingTraining
Data SetTime (s)NDCGTime (s)NDCGTime (s)NDCG
 0.5 0.5211 23.9a 0.4756a 0.5 0.5157 
 0.5 0.4571 3.4a 0.4153a 0.2 0.4450 
30k 1601.6 0.4945 461.6 0.4742 202.4 0.4946 
set 1 334.8 0.7616 10.8 0.7579 172.7 0.7642 
set 2 11.2 0.7519 47.6 0.7470 20.8 0.7578 
 Training Pairwise Training Pairwise Training Pairwise 
 Time (s) Accuracy Time (s) Accuracy Time (s) Accuracy 
 1.3 70.36% 23.9a 64.06%a 0.7 68.56% 
 0.5 82.70% 3.4a 77.72%a 0.3 82.17% 
30k 1601.6 61.52% 65.4 60.11% 202.4 60.49% 
set 1 117.1 68.45% 2.4 67.82% 149.5 67.83% 
set 2 11.2 69.74% 3.3 68.37% 14.5 69.39% 
 38.7 80.71% 1.0 79.82% 5.0 79.70% 
 16.6 82.11% 1.1 81.65% 6.7 81.85% 

aReached maximum iteration of .

Next, we check GBDT and random forests. Their training time is long, so we do not conduct parameter selection. We consider a small number of trees and fix the parameters as follows. For GBDT, we use learning rate =0.1, tree depth =4, and number of trees =100. For random forests, we use the number of sampled features for splitting in each node and number of trees =40. We further use eight cores to reduce the training time. The results are shown in Table 6. For the smaller data sets , and set 2, we are able to train more trees in a reasonable time, so we present in Table 7 the result of using 1000 trees.

Table 6:
Performance of GBDT and Random Forests with a Small Number of Trees.
 Random Forests GBDT 
 Training Pairwise  Training Pairwise  
Data Set Time (s) Accuracy NDCG Time (s) Accuracy NDCG 
 14.8 66.16% 0.4959 1.4 69.78% 0.5182 
 2.3 80.36% 0.4541 0.4 82.83% 0.4706 
30k 5102.1 63.76% 0.5598 1339.3 62.77% 0.5375 
set 1 1672.2 70.69% 0.7797 557.7 69.22% 0.7707 
set 2 58.7 68.76% 0.7629 11.3 71.21% 0.7711 
 606.0 78.78% NA 106.8 79.85% NA 
 423.3 82.04% NA 59.3 82.43% NA 
 Random Forests GBDT 
 Training Pairwise  Training Pairwise  
Data Set Time (s) Accuracy NDCG Time (s) Accuracy NDCG 
 14.8 66.16% 0.4959 1.4 69.78% 0.5182 
 2.3 80.36% 0.4541 0.4 82.83% 0.4706 
30k 5102.1 63.76% 0.5598 1339.3 62.77% 0.5375 
set 1 1672.2 70.69% 0.7797 557.7 69.22% 0.7707 
set 2 58.7 68.76% 0.7629 11.3 71.21% 0.7711 
 606.0 78.78% NA 106.8 79.85% NA 
 423.3 82.04% NA 59.3 82.43% NA 

Note: Random forests: 40 trees; GBDT: 100 trees.

Table 7:
Performance of GBDT and Random Forests with 1000 Trees.
 Random Forests GBDT 
 Training Pairwise  Training Pairwise  
Data Set Time (s) Accuracy NDCG Time (s) Accuracy NDCG 
 345.3 69.07% 0.5221 13.7 67.58% 0.4892 
 52.0 82.60% 0.4675 3.6 79.78% 0.4491 
set 2 1406.9 71.91% 0.7801 108.7 71.70% 0.7720 
 Random Forests GBDT 
 Training Pairwise  Training Pairwise  
Data Set Time (s) Accuracy NDCG Time (s) Accuracy NDCG 
 345.3 69.07% 0.5221 13.7 67.58% 0.4892 
 52.0 82.60% 0.4675 3.6 79.78% 0.4491 
set 2 1406.9 71.91% 0.7801 108.7 71.70% 0.7720 

From Tables 6 and 7, GBDT and random forests generally perform well, though they are not always better than linear rankSVM. For set 2, random forests achieves 0.78 NDCG using 1000 trees, which is much better than 0.75 of linear rankSVM. This result is consistent with the fact that in the Yahoo Learning to Rank Challenge, all top performers use decision-tree-based methods. However, the training cost of GBDT and random forests is in general higher than linear rankSVM. Therefore, linear rankSVM is useful to quickly provide a baseline result. We also note that the performance of GBDT with more trees is not always better than with few trees. This result seems to indicate that overfitting occurs and parameter selection is important. In contrast, random forests is more robust. The training time of GBDT is faster than random forests because in random forests, the tree depth is unlimited while we restrict the depth of trees in GBDT to be four.

Although pointwise methods perform well in this experiment, a potential problem is that they do not consider different queries. It is unclear if this situation may cause problems.

4.6.  A Comparison Between Linear and Nonlinear Models on Sparse Data.

Recent research has shown that linear SVM is competitive with nonlinear SVM on classifying large and sparse data (Yuan et al., 2012). We conduct an experiment to check if this property also holds for learning to rank. We consider rankSVM as the linear model for comparison, but for the nonlinear model, we use random forests rather than kernel rankSVM. One reason is that random forests is very robust in the previous experiment. We consider the following two CTR (click through rate) estimation problems, which can be treated as regression or ranking problems:

  • • 

    . This is a data set used in Ho and Lin (2012).

  • • 

    . This is the processed data generated by the winning team (Wu et al., 2012) of KDD Cup 2012 track 2 (Niu et al., 2012). It contains about one-third of the original data. The task of this competition is online advertisement ranking evaluated by AUC, while the labels are number of clicks and number of views. Note that pairwise accuracy is reduced to AUC when k=2. We transform the labels into CTR (i.e., number of clicks over number of views).

The two data sets both contain a single query, and each comes with training and testing sets. To reduce the training time and the memory cost of random forests, we sample from the two data sets and condense the features. The details are listed in Table 8. We use the same parameters of random forests as in section 4.5. For a fair comparison, we fix C=1 for rankSVM because the parameters of random forests are not well tuned. The results are shown in Table 9. We first notice the training time of random forests is several thousand times more than linear rankSVM on sparse data. The difference is larger than the case of dense data because the training cost of random forests is linear to the number of features, but that of rankSVM is linear to the average number of nonzero features. Regarding the performance, the difference is small for the two data sets, so linear rankSVM is very useful to get competitive results quickly. However, more experiments are needed to confirm these preliminary observations. We hope more public sparse ranking data will be available in the near future.

Table 8:
Statistics of Sparse Training Data Sets.
Data Set l n  k p 
 11, 382, 195 22, 510, 600 22.6 93, 899 46, 191, 724, 381, 879 
 68, 019, 906 79, 901, 700 35.3 6, 896 198, 474, 800, 029, 148 
(0.1%) 11, 382 73, 581 22.6 1, 087 46, 020, 848 
(0.025%) 17, 005 74, 026 35.3 26 12, 704, 393 
Data Set l n  k p 
 11, 382, 195 22, 510, 600 22.6 93, 899 46, 191, 724, 381, 879 
 68, 019, 906 79, 901, 700 35.3 6, 896 198, 474, 800, 029, 148 
(0.1%) 11, 382 73, 581 22.6 1, 087 46, 020, 848 
(0.025%) 17, 005 74, 026 35.3 26 12, 704, 393 

Note: To reduce the training time, only a small subset of each problem is used.

Table 9:
Performance of Linear RankSvm and Random Forests on Sparse Data.
 Linear RankSvm Random Forests 
 Training Pairwise Mean Training Pairwise Mean 
Data Set Time (s) Accuracy NDCG Time (s) Accuracy NDCG 
(0.1%) 4.3 60.83% 0.4822 6343.3 60.45% 0.4732 
(0.025%) 2.7 68.16% 0.5851 5223.1 69.72% 0.5982 
 Linear RankSvm Random Forests 
 Training Pairwise Mean Training Pairwise Mean 
Data Set Time (s) Accuracy NDCG Time (s) Accuracy NDCG 
(0.1%) 4.3 60.83% 0.4822 6343.3 60.45% 0.4732 
(0.025%) 2.7 68.16% 0.5851 5223.1 69.72% 0.5982 

Note: Random forests uses 40 trees.

5.  Using Partial Pairs to Train Models

To avoid considering the O(l2) pairs, a common practice in ranking is to use only a subset of pairs. An example is Lin (2010), which uses pairs with close relevance levels (i.e., yi close to yj). The concept is similar to equation 1.5: if pairs with close relevance levels are ranked with the right order, those pairs with larger distances should also be ranked correctly. When k=O(l), this approach can reduce the number of pairs from O(l2) to be as small as O(k)=O(l). However, if k is small, each pair is already formed by instances in two close relevance levels, so we cannot significantly reduce the number of pairs.

We take and to conduct experiments because these two data sets possess the property . Because in each q, the values of yi are , we use the pairs with yi=yj+1. This setting of using two adjacent relevance levels leads to O(l) pairs. Then we can directly consider problems 1.2 and 1.3 as classification problems with instances xixj. If Newton methods are considered for solving problem 1.3, by the approach in equation 2.6, each Hessian-vector product costs only . Therefore, we directly use the implementation to solve L2-loss SVM in without applying any special method in section 2. After selecting the parameter C, we present pairwise accuracy in Table 10. It is observed that the selected C of using partial pairs is larger than that of using all pairs. This situation occurs because the sum of training losses in problem 1.3 on a smaller number of pairs must be penalized by a larger C. For training time and pairwise accuracy, it is as expected that the approach of using partial pairs slightly sacrifices the performance for faster training speed.

Table 10:
A Comparison Between Using Partial Pairs and All Pairs to Train a Model.
TrainingPairwiseTrainingPairwise
CTime (s)AccuracyCTime (s)Accuracy
Partial pairs 2−9 19.9 79.10% 2−10 10.5 81.81% 
All pairs 2−12 38.7 80.71% 2−14 16.6 82.11% 
TrainingPairwiseTrainingPairwise
CTime (s)AccuracyCTime (s)Accuracy
Partial pairs 2−9 19.9 79.10% 2−10 10.5 81.81% 
All pairs 2−12 38.7 80.71% 2−14 16.6 82.11% 

Note: using for L2-loss SVM is used for the partial-pair setting, while is used for all pairs.

Because of the only slightly lower pairwise accuracy, we may say that this approach, together with past work, is already enough to train large-scale linear rankSVM:

  • • 

    If k is small, we can apply the direct method mentioned in section 2.2 that has an O(lk) term for calculating , and .

  • • 

    If k is large, we can use only O(l) pairs. Then any efficient methods to train linear SVM can be applied.

However, a caveat is that two different implementations must be used. In contrast, methods of using order-statistic trees can simultaneously handle situations of small and large k.

6.  Conclusion

In this letter, we systematically reviewed recent approaches for linear rankSVM. We show that regardless of the optimization methods used, the computational bottleneck is on calculating some values over all preference pairs. Following Airola et al. (2011), we comprehensively investigate tree-based techniques for the calculation. Experiments show that our method is faster than existing implementations for linear rankSVM.

Based on this study, we release an extension of the popular linear classification/regression package for ranking. It is available online at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/.

Appendix:  The Dual Problem of Problems 1.1 and 1.2

Based on the original training data, we can construct a new set with
formula
Using this training data set, problems 1.2 and 1.3 can be viewed as L1-loss and L2-loss SVM problems with only one class of data, respectively. Then the dual problems of problems 1.2 and 1.3 are both in the following form,
formula
where , D is a diagonal matrix and Q=AX(AX)T. For L1 loss, U=C and D is the zero matrix. For L2 loss, and D=I/(2C), where I is the p by p identity matrix.
Notice that the KKT condition for L2-loss SVM gives
formula
which implies
formula
Thus the set defined in equation 2.8 corresponds to the set of support vectors.

Acknowledgments

This work was supported in part by the National Science Council of Taiwan (grant 101-2221-E-002-199-MY3). We thank Chia-Hua Ho for discussion and inspiring the selection tree data structure. We also thank Husan-Tien Lin, Yuh-Jye Lee, and the anonymous reviewers for valuable comments.

References

References
Adelson-Velsky
,
G. M.
, &
Landis
,
E. M.
(
1962
).
An algorithm for the organization of information
.
Proceedings of the USSR Academy of Sciences
,
146
,
263
266
.
Airola
,
A.
,
Pahikkala
,
T.
, &
Salakoski
,
T.
(
2011
).
Training linear ranking SVMs in linearithmic time using red–black trees
.
Pattern Recognition Letters
,
32
(
9
),
1328
1336
.
Andersson
,
A.
(
1993
).
Balanced search trees made simple
. In
Proceedings of the Third Workshop on Algorithms and Data Structures
(pp.
60
71
).
Bayer
,
R.
(
1972
).
Symmetric binary B-trees: Data structure and maintenance algorithms
.
Acta Informatica
,
1
,
290
306
.
Boser
,
B. E.
,
Guyon
,
I.
, &
Vapnik
,
V.
(
1992
).
A training algorithm for optimal margin classifiers
. In
Proceedings of the Fifth Annual Workshop on Computational Learning Theory
(pp.
144
152
).
New York
:
ACM Press
.
Breiman
,
L.
(
2001
).
Random forests
.
Machine Learning
,
45
(
1
),
5
32
.
Burges
,
C. J. C.
(
2010
).
From RankNet to LambdaRank to LambdaMART: An overview
(
Technical Report MSR-TR-2010-82). Microsoft Research
.
Chapelle
,
O.
, &
Chang
,
Y.
(
2011
).
Yahoo! learning to rank challenge overview
. In
JMLR Workshop and Conference Proceedings: Workshop on Yahoo! Learning to Rank Challenge
(Vol. 14, pp.
1
24
).
Chapelle
,
O.
, &
Keerthi
,
S. S.
(
2010
).
Efficient algorithms for ranking with SVMs
.
Information Retrieval
,
13
(
3
),
201
215
.
Christensen
,
D.
(
2005
).
Fast algorithms for the calculation of Kendall's
.
Computational Statistics
,
20
,
51
62
.
Conn
,
A. R.
,
Gould
,
N. I. M.
, &
Toint
,
P. L.
(
2000
).
Trust-region methods
.
Philadelphia: Society for Industrial and Applied Mathematics
.
Cortes
,
C.
, &
Vapnik
,
V.
(
1995
).
Support-vector network
.
Machine Learning
,
20
,
273
297
.
Fan
,
R.-E.
,
Chang
,
K.-W.
,
Hsieh
,
C.-J.
,
Wang
,
X.-R.
, &
Lin
,
C.-J.
(
2008
).
LIBLINEAR: A library for large linear classification
.
Journal of Machine Learning Research
,
9
,
1871
1874
.
Friedman
,
J. H.
(
2001
).
Greedy function approximation: A gradient boosting machine
.
Annals of Statistics
,
29
(
5
),
1189
1232
.
Heger
,
D. A.
(
2004
).
A disquisition on the performance behavior of binary search tree data structures
.
European Journal for the Informatics Professional
,
5
(
5
),
67
75
.
Herbrich
,
R.
,
Graepel
,
T.
, &
Obermayer
,
K.
(
2000
).
Large margin rank boundaries for ordinal regression
. In
P. J. Bartlett, B. Schölkopf, D. Schuurmans, & A. J. Smola
(Eds.),
Advances in large margin classifiers
(pp.
115
132
).
Cambridge, MA
:
MIT Press
.
Ho
,
C.-H.
, &
Lin
,
C.-J.
(
2012
).
Large-scale linear support vector regression
.
Journal of Machine Learning Research
,
13
,
3323
3348
.
Järvelin
,
K.
, &
Kekäläinen
,
J.
(
2002
).
Cumulated gain-based evaluation of IR techniques
.
ACM Transactions on Information Systems
,
20
(
4
),
422
446
.
Joachims
,
T.
(
2005
).
A support vector method for multivariate performance measures
. In
Proceedings of the Twenty Second International Conference on Machine Learning (ICML)
.
Joachims
,
T.
(
2006
).
Training linear SVMs in linear time
. In
Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
.
Keerthi
,
S. S.
, &
DeCoste
,
D.
(
2005
).
A modified finite Newton method for fast solution of large scale linear SVMs
.
Journal of Machine Learning Research
,
6
,
341
361
.
Kendall
,
M. G.
(
1938
).
A new measure of rank correlation
.
Biometrika
,
30
(
1/2
),
81
93
.
Knuth
,
D. E.
(
1973
).
The art of computer programming
.
Reading, MA
:
Addison-Wesley
.
Lin
,
C.-J.
, &
Moré
,
J. J.
(
1999
).
Newton's method for large-scale bound constrained problems
.
SIAM Journal on Optimization
,
9
,
1100
1127
.
Lin
,
C.-J.
,
Weng
,
R. C.
, &
Keerthi
,
S. S.
(
2008
).
Trust region Newton method for large-scale logistic regression
.
Journal of Machine Learning Research
,
9
,
627
650
.
Lin
,
K.-Y.
(
2010
).
Data selection techniques for large-scale rankSVM.
Master's thesis, National Taiwan University
.
Mangasarian
,
O. L.
(
2002
).
A finite Newton method for classification
.
Optimization Methods and Software
,
17
(
5
),
913
929
.
Mohan
,
A.
,
Chen
,
Z.
, &
Weinberger
,
K.
(
2011
).
Web-search ranking with initialized gradient boosted regression trees
. In
JMLR Workshop and Conference Proceedings: Workshop on Yahoo! Learning to Rank Challenge
(Vol. 14, pp.
77
89
).
Niu
,
Y.
,
Wang
,
Y.
,
Sun
,
G.
,
Yue
,
A.
,
Dalessandro
,
B.
,
Perlich
,
C.
, &
Hamner
,
B.
(
2012
).
The Tencent dataset and KDD-Cup12
. In
ACM SIGKDD KDD-Cup WorkShop
.
Qin
,
T.
,
Liu
,
T.-Y.
,
Xu
,
J.
, &
Li
,
H.
(
2010
).
LETOR: A benchmark collection for research on learning to rank for information retrieval
.
Information Retrieval
,
13
(
4
),
346
374
.
Sculley
,
D.
(
2009
).
Large scale learning to rank
. In
NIPS 2009 Workshop on Advances in Ranking
.
Shalev-Shwartz
,
S.
,
Singer
,
Y.
, &
Srebro
,
N.
(
2007
).
Pegasos: Primal estimated sub-gradient solver for SVM
. In
Proceedings of the Twenty Fourth International Conference on Machine Learning (ICML)
.
Steihaug
,
T.
(
1983
).
The conjugate gradient method and trust regions in large scale optimization
.
SIAM Journal on Numerical Analysis
,
20
,
626
637
.
Teo
,
C. H.
,
Vishwanathan
,
S.
,
Smola
,
A.
, &
Le
,
Q. V.
(
2010
).
Bundle methods for regularized risk minimization
.
Journal of Machine Learning Research
,
11
,
311
365
.
Tyree
,
S.
,
Weinberger
,
K. Q.
,
Agrawal
,
K.
, &
Paykin
,
J.
(
2011
).
Parallel boosted regression trees for web search ranking
. In
Proceedings of the 20th International Conference on World Wide Web
(pp.
387
396
).
Vapnik
,
V.
(
1995
).
The nature of statistical learning theory
.
New York
:
Springer-Verlag
.
Wu
,
K.-W.
,
Ferng
,
C.-S.
,
Ho
,
C.-H.
,
Liang
,
A.-C.
,
Huang
,
C.-H.
,
Shen
,
W.-Y.
, …
Lin
,
H.-T.
(
2012
).
A two-stage ensemble of diverse models for advertisement ranking in KDD Cup 2012
. In
ACM SIGKDD KDD-Cup WorkShop
.
New York.
Yuan
,
G.-X.
,
Ho
,
C.-H.
, &
Lin
,
C.-J.
(
2012
).
Recent advances of large-scale linear classification
.
Proceedings of the IEEE
,
100
(
9
),
2584
2603
.
Zheng
,
Z.
,
Chen
,
K.
,
Sun
,
G.
, &
Zha
,
H.
(
2007
).
A regression framework for learning ranking functions using relative relevance judgments
. In
Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
(pp.
287
294
).
New York
:
ACM
.

Notes

1

See, for example, Yuan, Ho, and Lin (2012) for more detailed discussion.

2

Note that is likely to be dense even if each xj is sparse.

4

The number of queries shown in Table 2 is fewer because we report only the training set statistics.

5

Here l represents the number of testing data.

6

In the implementation of , the formulation is scaled so the regularization parameter is .

7

For the function value, parameters selected using (validation) pairwise accuracy are considered, but results of using NDCG are similar.

8

For /+, we use CG iterations rather than outer Newton iterations because each CG has a similar complexity to that of a cutting plane iteration.