Abstract

The area under the ROC curve (AUC) is a widely used performance measure in machine learning. Increasingly, however, in several applications, ranging from ranking to biometric screening to medicine, performance is measured not in terms of the full area under the ROC curve but in terms of the partial area under the ROC curve between two false-positive rates. In this letter, we develop support vector algorithms for directly optimizing the partial AUC between any two false-positive rates. Our methods are based on minimizing a suitable proxy or surrogate objective for the partial AUC error. In the case of the full AUC, one can readily construct and optimize convex surrogates by expressing the performance measure as a summation of pairwise terms. The partial AUC, on the other hand, does not admit such a simple decomposable structure, making it more challenging to design and optimize (tight) convex surrogates for this measure.

Our approach builds on the structural SVM framework of Joachims (2005) to design convex surrogates for partial AUC and solves the resulting optimization problem using a cutting plane solver. Unlike the full AUC, where the combinatorial optimization needed in each iteration of the cutting plane solver can be decomposed and solved efficiently, the corresponding problem for the partial AUC is harder to decompose. One of our main contributions is a polynomial time algorithm for solving the combinatorial optimization problem associated with partial AUC. We also develop an approach for optimizing a tighter nonconvex hinge loss–based surrogate for the partial AUC using difference-of-convex programming. Our experiments on a variety of real-world and benchmark tasks confirm the efficacy of the proposed methods.

1  Introduction

The receiver operating characteristic (ROC) curve plays an important role as an evaluation tool in machine learning and data science. In particular, the area under the ROC curve (AUC) is widely used to summarize the performance of a scoring function in binary classification problems and is often a performance measure of interest in bipartite ranking (Cortes & Mohri, 2004; Agarwal, Graepel, Herbrich, Har-Peled, & Roth, 2005). Increasingly, however, in several applications, the performance measure of interest is not the full area under the ROC curve but the partial area under the ROC curve between two specified false-positive rates (FPRs) (see Figure 1). For example, in ranking applications where accuracy at the top is critical, one is often interested in the left-most part of the ROC curve (Rudin, 2009; Agarwal, 2011; Rakotomamonjy, 2012); this corresponds to maximizing partial AUC in a false-positive range of the form . In biometric screening, where false-positives are intolerable, one is again interested in maximizing the partial AUC in a false-positive range for some suitably small . In the KDD Cup 2008 challenge on breast cancer detection, performance was measured in terms of the partial AUC in a specific false-positive range deemed clinically relevant (Rao, Yakhnenko, & Krishnapuram, 2008).1
Figure 1:

Partial AUC in the false-positive range .

Figure 1:

Partial AUC in the false-positive range .

In this letter, we develop support vector machine (SVM)–based algorithms for directly optimizing the partial AUC between any two false-positive rates and . Our methods are based on minimizing a suitable proxy or surrogate objective for the partial AUC error. In the case of the full AUC, where the evaluation measure can be expressed as a summation of pairwise indicator terms, one can readily construct and optimize surrogates by exploiting this structure. The partial AUC does not admit such a decomposable structure, as the set of negative instances associated with the specified false-positive range can be different for different scoring models. As a result, it becomes more challenging to design and optimize convex surrogates for this measure.

For instance, a popular approach for constructing convex surrogates for the full AUC is to replace the indicator terms in its definition with a suitable pairwise convex loss such as the pairwise hinge loss; in fact, there are several efficient methods available to solve the resulting optimization problem (Herbrich, Graepel, & Obermayer, 2000; Joachims, 2002, 2005). This is not the case with the more complex partial AUC measure. Here, a surrogate constructed by replacing the indicators with the pairwise hinge loss is nonconvex in general. Even in the special case of FPR intervals of the form , where the hinge loss–based surrogate turns out to be convex, solving the resulting optimization problem is not straightforward.

In our approach, we construct and optimize convex surrogates on the partial AUC by building on the structural SVM formulation of Joachims (2005) developed for general complex performance measures. It is known that for the full AUC, this formulation recovers the corresponding hinge surrogate (Joachims, 2006). A direct application of this framework to the partial AUC results in a loose approximation to the performance measure (in a sense that we elaborate in later sections). Instead, we first rewrite the evaluation measure as a maximum of a certain term over subsets of negative instances and leverage the structural SVM setup to construct a convex approximation to the inner term. This yields a tighter surrogate, which, for the special case of partial AUC in the range, is equivalent to the hinge surrogate obtained by replacing the indicators with the pairwise hinge loss. For general FPR intervals , the surrogate obtained can be seen as a convex relaxation to the (nonconvex) hinge surrogate.

We make use of the cutting plane method to optimize the proposed structural SVM surrogates. Each iteration of this solver requires a combinatorial search over subsets of instances and over binary matrices (representing relative orderings of positive and negative training instances) to find the currently most violated constraint. In the case of the full AUC (where the optimization is only over binary matrices), this problem decomposes neatly into one where each matrix entry can be chosen independently (Joachims, 2005). Unfortunately, for the partial AUC, a straightforward decomposition is not possible, again because the negative instances involved in the relevant false-positive range can be different for different orderings of instances.

One of our main contributions in this letter is a polynomial time algorithm for solving the corresponding combinatorial optimization within the cutting plane method for the partial AUC by breaking down the problem into smaller tractable ones. When the specified false-positive range is of the form , we show that after fixing the optimal subset of negatives to the top-ranked negatives, one can still optimize the individual entries of the ordering matrix separately. For the general case , we require further formulating an equivalent optimization problem over a restricted search space, where each row of the matrix can be optimized separately—and efficiently.

While the use of convex surrogates in this approach allows efficient optimization and guarantees convergence to the global surrogate optimum, it turns out that for the partial AUC in a general FPR interval , the previous nonconvex hinge surrogate (obtained by replacing the indicators with the pairwise hinge loss) is a tighter approximation to the original evaluation measure. Hence, as a next step, we also develop a method for directly optimizing this nonconvex surrogate using a popular nonconvex optimization technique based on difference-of-convex (DC) programming. Here we exploit the fact that the partial AUC in can be written as a difference of (scaled) partial AUC values in and .

We evaluate the proposed methods on a variety of real-world applications where partial AUC is an evaluation measure of interest and on benchmark data sets. We find that in most cases, the proposed methods yield better partial AUC performance compared to an approach for optimizing the full AUC, thus confirming the focus of our methods on a select false-positive range of the ROC curve. Our methods are also competitive with existing algorithms for optimizing partial AUC. For partial AUC in , we find that in some settings, the proposed DC programming method for optimizing the nonconvex hinge surrogate (despite having the risk of getting stuck at a locally optimal solution) performs better than the structural SVM method, though overall there is no clear winner.

In summary, we make the following additional contributions, in this letter compared to the conference versions of this work (Narasimhan & Agarwal, 2013a, 2013b):

  1. We provide a more self-contained and comprehensive description of our structural SVM methods for partial AUC optimization. The surrogate construction is explained from the ground up and from a perspective that matches well with readers' intuition about existing surrogates for AUC. Complete proofs are given for all theorems.

  2. In the case of partial AUC in the range, we develop a new method for optimizing a tighter nonconvex surrogate using DC programming.

  3. We derive a generalization bound for partial AUC using VC-dimension-based uniform convergence arguments.

  4. Our experiments are extensive and detailed, covering a range of applications and benchmark data sets.

1.1  Related Work

Much work has been done on developing algorithms to optimize the full AUC, mostly in the context of ranking (Herbrich et al., 2000; Joachims, 2002, 2005; Freund, Iyer, Schapire, & Singer, 2003; Burges et al., 2005). There has also been interest in the ranking literature in optimizing measures focusing on the left end of the ROC curve, corresponding to maximizing accuracy at the top of the list (Rudin, 2009). In particular, the infinite push ranking algorithm (Agarwal, 2011; Rakotomamonjy, 2012) can be viewed as maximizing the partial AUC in the range , where is the number of negative training examples.

While the AUC is widely used in practice, increasingly the partial AUC is being preferred as an evaluation measure in several applications in bioinformatics and medical diagnosis (Pepe & Thompson, 2000; Qi, Bar-Joseph, & Klein-seetharaman, 2006; Rao et al., 2008; Hsu, Chang, & Hsueh, 2014), and more recently even in domains like computer vision Paisitkriangkrai, Shen, & van den Hengel, 2013, 2014), personalized medicine (Majumder et al., 2015), and demand forecasting (Schneider & Gorr, 2015). The problem of optimizing the partial AUC in false-positive ranges of the form has received some attention primarily in the bioinformatics and biometrics literature (Pepe & Thompson, 2000; Dodd & Pepe, 2003; Wang & Chang, 2011; Ricamato & Tortorella, 2011; Hsu & Hsueh, 2012); however, in most cases, the algorithms developed are heuristic in nature. The asymmetric SVM algorithm of Wu, Lin, Chen, and Chen (2008) also aims to maximize the partial AUC in a range by using a variant of one-class SVM. The optimization objective used, however, does not directly approximate the partial AUC in the specified range, but instead seeks to indirectly promote good partial AUC performance through a fine-grained parameter tuning procedure. There has also been some work on optimizing the partial AUC in general false-positive ranges of the form , including the boosting-based algorithms pAUCBoost (Komori & Eguchi, 2010) and pU-AUCBoost (Takenouchi, Komori, & Eguchi, 2012).

Support vector algorithms have been extensively used in practice for various supervised learning tasks, with both standard and complex performance measures (Cortes & Vapnik, 1995; Crammer & Singer, 2002; Chu & Keerthi, 2007; Joachims, 2002; Tsochantaridis, Joachims, Hofmann, & Altun, 2005). The proposed methods are most closely related to the structural SVM framework of Joachims (2005) for optimizing the full AUC. To our knowledge, ours is the first work to develop principled support vector methods that can directly optimize the partial AUC in an arbitrary false-positive range .

1.2  Organization

We begin with the problem setting in section 2, along with background material on the previous structural SVM framework for full AUC maximization. In section 3, we consider two initial surrogates for the partial AUC—one based on the pairwise hinge loss and the other on a naive application of the structural SVM formulation—and point out drawbacks in each case. We then present a tight convex surrogate for the special case of FPR range in section 4 and for the general case of intervals in section 5, along with cutting plane solvers for solving the resulting optimization problem. Subsequently in section 6, we also describe a DC programming approach for directly optimizing the nonconvex hinge surrogate for partial AUC in . We provide a generalization bound for the partial AUC in section 7 and present our experimental results on real-world and benchmark tasks in section 8. All proofs are provided in the online appendix.

2  Preliminaries and Background

2.1  Problem Setting

Let be an instance space and and be probability distributions over positive and negative instances in . We are given a training sample containing positive instances drawn independent and identically distributed (i.i.d.) according to and negative instances drawn i.i.d. according to . Our goal is to learn from a scoring function that assigns higher scores to positive instances compared to negative instances and, in particular, yields good performance in terms of the partial AUC between some specified false-positive rates and , where . In a ranking application, this scoring function can then be deployed to rank new instances accurately, while in a classification setting, the scoring function, along with a suitable threshold, serves as a binary classifier.

2.1.1  Partial AUC

Define for a scoring function and threshold , the true positive rate (TPR) of the binary classifier as the probability that it correctly classifies a random positive instance from as positive,2
formula
and the false-positive rate (FPR) of the classifier as the probability that it misclassifies a random negative instance from as positive:
formula
The ROC curve for the scoring function is then defined as the plot of against for different values of . The area under this curve can be computed as
formula
where Assuming there are no ties, it can be shown (Cortes & Mohri, 2004) that the AUC can be written as
formula
Our interest here is in the area under the curve between FPRs and . The (normalized) partial AUC of in the range is defined as
formula

2.1.2  Empirical Partial AUC

Given a sample as above, one can plot an empirical ROC curve corresponding to a scoring function . Assuming there are no ties, this is obtained by using
formula
instead of and , respectively. The area under this empirical curve is given by
formula
2.1
Denoting and , the (normalized) empirical partial AUC of in the FPR range can then be written as (Dodd & Pepe, 2003):
formula
2.2
where denotes the negative instance in ranked in th position (among negatives, in descending order of scores) by .3

2.1.3  Partial AUC versus Full AUC

It is important to note that for the AUC to take its maximum value of 1, a scoring function needs to rank the positive instances above all the negative instances. For the partial AUC in a specified interval to take a value of 1, it is sufficient that a scoring function ranks the positives above a subset of the negative instances (specifically, above those in positions to in the ranking of negatives). Another key difference between the two evaluation measures is that the full AUC can be expressed as an expectation or sum of indicator terms over pairs of positive-negative instances (see equation 2.1), whereas the partial AUC does not have such a simple additive structure. This is clearly evident in the definition in equation 2.2, where the set of negatives corresponding to FPR range that appear in the inner summation is not fixed and can be different for different scoring functions .

We also stress that a scoring function with a high AUC value need not be optimal in terms of partial AUC in a particular FPR range. This is illustrated in Figure 2, which shows scores assigned by two scoring functions and on a hypothetical sample of four positive and six negative instances, and the corresponding ROC curves. As can be seen, while gives a higher AUC value, has higher partial AUC in the FPR range . This motivates the need to design algorithms that are tailored to directly optimize partial AUC.
Figure 2:

ROC curves for scoring functions and (bottom) described in the table (top) on a sample containing four positive instances and five negative instances.

Figure 2:

ROC curves for scoring functions and (bottom) described in the table (top) on a sample containing four positive instances and five negative instances.

2.2  Background on Structural SVM Framework for Full AUC

As a first step toward developing a method for optimizing the partial AUC, we provide some background on the popular structural SVM framework for maximizing the full AUC (Joachims, 2005). Unless otherwise specified, we assume that for some and consider linear scoring functions of the form for some . The methods described will easily extend to nonlinear functions/nonEuclidean instance spaces using kernels (Yu & Joachims, 2008).

2.2.1  Hinge Loss–Based Surrogate

Given a training sample , our goal here is to find a scoring function that yields maximum empirical AUC on or equivalently minimizes one minus the empirical AUC, given as follows:
formula
2.3
Owing to its discrete nature, minimizing this objective is a hard problem in general. One instead works with a convex proxy or surrogate objective for the above risk that is easier to optimize. A common approach is to replace the above indicator term with a pair-wise loss such as the pair-wise hinge loss, which for any scoring function and instances and is defined as , where . This is clearly convex in and an upper bound on . The following equation is then the well-known pairwise hinge surrogate for the AUC risk:
formula
2.4
This surrogate is convex in , upper-bounds the AUC risk, and is minimized by a scoring function that ranks positive instances above the negative instances with a sufficient margin of separation. It is also evident that one can minimize this objective over all model vectors using a standard convex optimization solver. In fact, several specialized methods are available to solve (a regularized form of) this optimization problem (Herbrich et al., 2000; Joachims, 2002, 2005). One popular approach is to solve the corresponding dual optimization problem using a coordinate descent–type method (Joachims, 2002).

The partial AUC has a more complex structure as the subset of negatives relevant to the given FPR range can be different for different scoring models. As a result, a surrogate obtained by replacing the indicators with the pairwise hinge loss turns out to be nonconvex in general. The approach that we take for the partial AUC will instead make use of the structural SVM framework developed by Joachims (2005) for designing surrogate minimizing methods for general complex performance measures. For the full AUC, it has been shown that this formulation recovers the corresponding hinge surrogate in equation 2.4 (Joachims, 2006). We give the details for AUC below and in subsequent sections build on this formulation to construct and optimize convex surrogates for the partial AUC.

2.2.2  Structural SVM Formulation

For any ordering of the training instances, we represent (errors in) the relative ordering of the positive instances in and negative instances in via a matrix as follows:
formula
Not all matrices in represent a valid relative ordering (due to transitivity requirements). We let denote the set of all matrices in that correspond to valid orderings.

Clearly, the correct relative ordering has . This corresponds to all positive training instances being ranked above the negative training instances .4

For any , we can define the AUC loss of with respect to as
formula
2.5
It can be verified that for any that is consistent with scoring function , evaluates to the AUC risk in equation 2.3.
We also define a joint feature map between the input training sample and an output ordering matrix as
formula
2.6
The expression in equation 2.6 evaluates to a (normalized) sum of feature vector differences over all pairs of positive-negative instances in in which the positive instance is ordered by above the negative instance. This choice of ensures that for any fixed , maximizing over yields an ordering matrix consistent with the scoring function , and, thus, for which the loss term evaluates to . The problem of optimizing the AUC now reduces to finding a for which the maximizer over of has minimum AUC loss. This is approximated by the following structural SVM-based relaxation of the AUC loss:
formula
2.7
Clearly, this surrogate is convex in as it is a maximum of linear functions in . Moreover, this is also an upper bound on the empirical AUC risk . Let be the maximizer of over . Then from the above definition .

Interestingly, this surrogate can be shown to be equivalent to the hinge loss–based surrogate in equation 2.4.

Theorem 1.

(Joachims, 2006). For any and training sample ,

Thus, the problem of minimizing a pairwise hinge surrogate for the AUC can be cast as one of optimizing the above structural SVM surrogate (along with a suitable regularizer), which results in the following SVM-style convex (quadratic) program:
formula
where is a regularization parameter.

2.2.3  Cutting Plane Method

While the above optimization problem contains an exponential number of constraints (one for each ), it can be solved efficiently using the cutting plane method (Tsochantaridis et al., 2005). Each iteration of this solver requires a combinatorial optimization over matrices in . By exploiting the simple structure of the AUC loss, this combinatorial problem can be decomposed into simpler ones, where each entry of the matrix can be optimized independently (Joachims, 2005). The cutting plane method is guaranteed to yield an -accurate solution in iterations (Joachims, 2006); in the case of the AUC, each iteration requires computational time. We elaborate on this solver in section 4 when we develop a structural SVM approach for the partial AUC.

3  Candidate Surrogates for Partial AUC

As noted earlier, our goal in this letter is to design efficient methods for optimizing the partial AUC in a specified false-positive range. In particular, given a training sample , we wish to find a scoring function that maximizes partial AUC in , or equivalently minimizes the following risk:
formula
3.1

where denotes the negative instance in ranked in th position by . As before, optimizing this quantity directly is computationally hard in general. Hence, we work with a continuous surrogate objective that acts as a proxy for the above risk. As first-cut attempts at devising surrogates for the partial AUC, we replicate the two approaches used above for constructing surrogates for the full AUC, namely, those based on the hinge loss and the structural SVM framework, respectively. As we shall see, the surrogates obtained in both cases have certain drawbacks that require us to use a somewhat different approach.

3.1  Hinge Loss–Based Surrogate

We begin by considering a hinge-style surrogate for the partial AUC obtained by replacing the indicator functions in the partial AUC risk with the pairwise hinge loss:
formula
3.2

In the case of the full AUC (i.e., when the FPR interval is ), the hinge surrogate is convex in and hence can be optimized efficiently. However, the corresponding surrogate given above for the partial AUC turns out to be nonconvex in general. This is because the surrogate is defined on only a subset of negative instances relevant to the given FPR range, and this subset can be different for different scoring functions.

Theorem 2.

Let with . Then there exists a training sample for which the surrogate is nonconvex in .

Fortunately, for the case where and , that is, for FPR intervals of the form , the hinge loss to based surrogate turns out to be convex. Here the surrogate is given by
formula
3.3
and we have:
Theorem 3.

Let . For any training sample , the surrogate is convex in .

Despite the hinge surrogate for FPR intervals of the form being convex, it is not immediately clear how the resulting optimization problem can be solved efficiently. For instance, a common approach for optimizing the full AUC surrogate is to derive and solve the corresponding dual optimization problem. Since the surrogate for the partial AUC is defined on a subset of negative instances that can be different for different scoring functions, even deriving the dual problem for the hinge partial AUC surrogate turns out to be nontrivial.

3.2  Naive Structural SVM Surrogate

As an alternative to the hinge surrogate, we next consider constructing a convex surrogate for the partial AUC by a direct application of the structural SVM formulation described in the previous section for the AUC. Specifically, we consider the surrogate obtained by replacing the loss term in the structural SVM surrogate for the AUC in equation 2.7 with an appropriate loss for the partial AUC in :
formula
where denotes the index of the th ranked negative instance by any fixed ordering of instances consistent with . The resulting surrogate is given by
formula
3.4
As with the AUC, this surrogate serves as a convex upper bound for the partial AUC risk in equation 3.1. At first glance, this surrogate does appear to be a good proxy for the partial AUC risk. However, a closer look shows that this surrogate does have drawbacks, as explained below.

Recall that with the AUC, the structural SVM surrogate is equivalent to the corresponding hinge surrogate (see theorem 1). However, even for the special case of partial AUC in FPR intervals of the form (where the hinge surrogate is convex), the above structural SVM surrogate turns out to be a looser convex upper bound on the partial AUC risk than the hinge surrogate in equation 3.2. In particular, in its simplified form, the above structural SVM surrogate for the partial AUC contains redundant terms that penalize misrankings of the scoring function with respect to negative instances outside the relevant FPR range, in particular in positions of the ranked list. These additional terms appear because the joint feature map in the surrogate is defined on all negative instances and not just the ones relevant to the given FPR range (see equation 2.6). Clearly, these terms disrupt the emphasis of the surrogate on the specified FPR interval. The details can be found in the earlier conference versions of this letter (Narasimhan & Agarwal, 2013a, 2013b) and are left out here to keep the exposition simple.

Thus a naive application of the structural SVM formulation yields a loose surrogate for the partial AUC. Of course, one could look at tightening the surrogate by restricting the joint feature map to only a subset of negative instances; however, it is not immediate how this can be done, as the subset of negatives relevant to the given FPR interval can be different for different scoring models , while the definition of the joint feature map in the structural SVM framework needs to be independent of .

The approach that we take constructs a tighter surrogate for the partial AUC by making use of the structural SVM framework in a manner that suitably exploits the structure of the partial AUC performance measure. In particular, we first rewrite the partial AUC risk as a maximum of a certain term over subsets of negatives and compute a convex approximation to the inner term using the structural SVM setup. In the rewritten formulation, the joint feature maps need to be defined on only a subset of the negative instances. The resulting surrogate is convex and is equivalent to the corresponding hinge surrogate for intervals in equation 3.3. For general FPR intervals , the proposed surrogate can be seen as a tighter convex relaxation to the partial AUC risk compared to the naive structural SVM surrogate.

A summary of the surrogates discussed here is given in Figure 3. Among the surrogates considered, the hinge surrogates serve as the tightest upper bound on the partial AUC risk but are not necessarily convex; on the other hand, the naive structural SVM surrogates are convex but yield a looser upper bound. The structural SVM-based surrogates proposed in this letter (highlighted in blue) are convex and also serve as tighter upper bounds compared to the naive surrogates; moreover, in the case of intervals, the proposed surrogate is equivalent to the corresponding hinge surrogate.
Figure 3:

Relationship between surrogates for (a) AUC and partial AUC in (b) and (c) , for a model vector and sample . These inequalities follow from the narrative in section 3, together with the results in theorems 6 and 7. The relationship between and requires an additional assumption that (see theorem 7). Those colored in blue are the tight convex structural SVM surrogates that we optimize using the cutting plane method (see sections 4 and 5); the one in red is the nonconvex hinge surrogate that we optimize using a DC programming method (see section 6).

Figure 3:

Relationship between surrogates for (a) AUC and partial AUC in (b) and (c) , for a model vector and sample . These inequalities follow from the narrative in section 3, together with the results in theorems 6 and 7. The relationship between and requires an additional assumption that (see theorem 7). Those colored in blue are the tight convex structural SVM surrogates that we optimize using the cutting plane method (see sections 4 and 5); the one in red is the nonconvex hinge surrogate that we optimize using a DC programming method (see section 6).

We also provide a cutting plane method to optimize the prescribed surrogates. Unlike the full AUC, here the combinatorial optimization required in each iteration of the solver does not decompose easily into simpler problems. One of our main contributions is a polynomial time algorithm for solving this combinatorial optimization for the partial AUC. The details are provided for the case in section 4 and for the case in section 5. In addition to methods that optimize convex structural SVM surrogates on the partial AUC, we also develop a method for directly optimizing the nonconvex hinge surrogate for general FPR intervals using difference-of-convex programming. This approach is explained in section 6.

While the proposed methods optimize continuous approximations or surrogates to the partial AUC, on several real-world tasks, they were found to perform better in terms of the original performance measure compared to the state-of-the-art approaches (see section 8). Indeed it would be of interest to establish precise conditions under which optimizing the proposed surrogates would yield the optimal scoring function for the original partial AUC (i.e., under which the surrogates are statistically consistent). However, this is a generic question that is not very well understood for many applications of (structural) SVM-style surrogates (Joachims, 2005) and is left open for future work.

4  Structural SVM Approach for Partial AUC in

We start by developing a method for optimizing the partial AUC in FPR intervals :
formula
4.1
We saw in the previous section that the hinge loss–based surrogate is convex in this case, but it was not immediate how this objective can be optimized efficiently. We also saw that a naive application of the structural SVM framework results in a surrogate that is a looser convex approximation to the partial AUC risk than the hinge surrogate.

Our approach makes use of the structural SVM formulation in a manner that allows us to construct a tighter convex surrogate for the partial AUC, which in this case is equivalent to the corresponding hinge surrogate. The key idea here is that the partial AUC risk in can be written as maximum over subsets of negatives of the full AUC risk evaluated on all positives and the given subset of negatives. The structural SVM formulation described in section 2 for the full AUC can then be leveraged to design a convex surrogate and to optimize it efficiently using a cutting plane solver.

4.1  Tight Structural SVM Surrogate for pAUC in

For any subset of negatives , let denote the full AUC risk of scoring function evaluated on a sample containing all the positives and the subset of negatives . Then the partial AUC risk of is simply the value of this quantity on the top-ranked negatives. This can be shown to be equivalent to the maximum value of over all subsets of negatives of size .

Theorem 4.
For any and training sample ,
formula
4.2

Having expressed the partial AUC risk in in terms of the full AUC risk on a subset of instances, we can devise a convex surrogate for the evaluation measure by constructing a convex approximation to the full AUC term using the structural SVM formulation explained in section 2.2.

In particular, let us define truncated ordering matrices for positive instances and any subset of negative instances as
formula
The set of all valid orderings is denoted as , and the correct ordering is given by . Also redefine the joint feature map for positive and negatives: . The following is then a convex upper bound on the inner AUC term in equation 4.2:
formula
Replacing the AUC term in equation 4.2 with the above expression gives us a surrogate that upper-bounds the partial AUC risk in :
formula
4.3
where the 's index over all positive instances and over negative instances in the corresponding subset in the outer argmax.
Clearly, the prescribed surrogate objective is convex in , as it is a maximum of convex functions in . In fact, this surrogate is equivalent to the corresponding hinge surrogate for partial AUC in in equation 3.3. More specifically, we know from theorem 1 that the structural SVM expression used above to approximate the inner full AUC term is same as the hinge surrogate for the AUC:
formula
4.4
At first glance, this appears different from the hinge surrogate for range in equation 3.3. However, as seen next, the above maximum is attained at the top negatives according to , which clearly implies that the two surrogates are equivalent.
Proposition 1.

Let be the set of negative instances ranked in the top positions (among all negative instances in , in descending order of scores) by . Then the maximum value of the objective in equation 4.4 (or equivalently in equation 4.3) is attained at .

The following result then follows directly from proposition 5.

Theorem 5.

For any and training sample ,

Also notice that unlike the naive structural SVM surrogate in equation 3.4, the joint feature map in the proposed surrogate in equation 4.3 is not defined on all negatives, only on a subset of negatives. Consequently, this surrogate does not contain additional redundant terms and is tighter than the naive surrogate (Narasimhan & Agarwal, 2013b) (see Figure 3). We next develop a cutting plane method for optimizing the tighter surrogate.

4.2  Cutting Plane Method for Optimizing

We would like to minimize the proposed surrogate in equation 4.3 with an additional regularization term on . This yields the (convex) quadratic program given below:
formula

Notice that the optimization problem has an exponential number of constraints, one for each subset of negative instances of size and matrix . As with the full AUC, we use the cutting plane method to solve this problem. The idea behind this method is that for any , a small subset of the constraints is sufficient to find an -approximate solution to the problem (Joachims, 2006). In particular, the method starts with an empty constraint set and, on each iteration, adds the most violated constraint to and solves a tighter relaxation of the optimization problem in the subsequent iteration. This continues until no constraint is violated by more than (see algorithm 1).

formula

It is known that for any fixed regularization parameter and tolerance , the cutting plane method converges in iterations and will yield a surrogate value within of the minimum value (Joachims, 2006). Since in each iteration, the quadratic program needed to be solved grows by only a single constraint, the primary bottleneck in the algorithm is the combinatorial optimization (over subsets of negatives and ordering matrices) required to find the most violated constraint (line 10).

4.2.1  Finding the Most Violated Constraint

The specific combinatorial optimization problem that we wish to solve can be stated as
formula

In the case of AUC, where , the above argmax is only overordering matrices in and can be easily computed by exploiting the additive form of the AUC loss and, in particular, neatly decomposing the problem into one where each can be chosen independently (Joachims, 2005). In the case of the partial AUC in , the decomposition is not as straightforward as the argmax is also over subsets of negatives.

4.2.2  Reduction to Simpler Problems

We know, however, from proposition 5 that the above argmax is attained at the top negatives according to , and all that remains is to compute the optimal ordering matrix in keeping fixed. The optimization problem can then be decomposed easily. In particular, having fixed the subset , the combinatorial optimization problem becomes equivalent to
formula
OP1
Now consider solving a relaxed form of OP1 over all matrices in . The objective decomposes into a sum of terms involving individual elements and can be maximized by optimizing each term separately. The optimal matrix is then given by . It can be seen that this optimal matrix is in fact a valid ordering matrix in . Notice that corresponds to an ordering of instances where the positive instances are scored according to and the negative instances are scored according to . Since corresponds to an ordering resulting from a valid set of scores on the instances, it satisfies the transitivity requirements of a valid ordering matrix. Hence is also a solution to the original unrelaxed form of OP1 for fixed , and thus gives us the desired most violated constraint.

4.2.3  Time Complexity

A straightforward implementation to compute the above solution (see algorithm 2) would take computational time (assuming score evaluations on instances can be done in unit time). Using a more compact representation of the orderings (Joachims, 2005), however, this can be further reduced to . (The details are in appendix B.1.) Thus, computing the most violated constraint for the partial AUC in a small interval is faster than that for the full AUC (Joachims, 2005). This is because the number of negative instances relevant to the given FPR range over which the most violated constraint is computed is smaller for the partial AUC. It turns out that in practice, the number of iterations required by the cutting plane method to converge (and, in turn, the number of calls to the given combinatorial optimization) is often higher for partial AUC compared to AUC. (We elaborate on this when we discuss our experimental results in section 8.)

formula

We have presented an efficient method for optimizing the structural SVM surrogate for the partial AUC in the range, which we saw was equivalent to the hinge surrogate. We next proceed to algorithms for optimizing partial AUC in interval.

5  Structural SVM Approach for Partial AUC in

Recall that the partial AUC in a general FPR interval is given by
formula
5.1
We have seen in section 3 that in this case, the simple hinge surrogate (obtained by replacing the indicator terms in the above risk by the pairwise hinge loss) is not necessarily convex. We have also seen that a naive application of the structural SVM formulation to the above risk yields a surrogate with redundant additional terms involving negative instances outside the specified FPR range. As with the case, we now apply the structural SVM framework in a manner that yields a tighter convex surrogate for the partial AUC risk. Of course, in this case, the resulting convex surrogate is not equivalent to the nonconvex hinge surrogate, but as we explain later, it can be seen as a convex relaxation to the hinge surrogate.

Again, the main idea here is to rewrite the partial AUC risk as a maximum of a certain term over subsets of negatives and use the structural SVM formulation to compute a convex approximation to the inner term. We provide an efficient cutting plane method for solving the resulting optimization problem. Here, the combinatorial optimization step for finding the most violated constraint in the cutting plane solver does not admit a decomposition involving individual matrix entries. We show that by using a suitable reformulation of the problem over a restricted search space, the optimization can be still reduced to simpler ones, but now involving individual rows of the ordering matrix. In section 6, we shall also look at an approach for directly optimizing the nonconvex hinge surrogate for general FPR ranges.

5.1  Tight Structural SVM Surrogate for pAUC in

We begin by describing the construction of the tight structural SVM surrogate. Just as the partial AUC risk in could be written as a maximum over subsets of negative instances of the full AUC risk evaluated on this subset (see theorem 4), the partial AUC risk in can also be written as a maximum of a certain term over subsets of negative instances of size .

Theorem 6.
For any and training sample ,
formula
where for any subset of negative instances that (w.l.o.g.) satisfy ,
formula

Note that when , the term is the full AUC risk on the sample , recovering our previous result in theorem 4. In this case, we directly made use of the structural SVM formulation for the full AUC to construct a convex approximation for this term. However, when , is more complex and can be essentially seen as (a scaled version of) partial AUC risk in the FPR range defined on a subset of instances ; we will hence have to rework the structural SVM formulation for , as described next.

5.1.1  Convex Upper Bound on

In particular, we describe how the structural SVM framework can be used to construct a convex upper bound on the inner term and obtain a convex surrogate for the partial AUC risk in . Restricting ourselves to truncated ordering matrices defined for positives and a subset of negatives, let us again use to denote the index of the th ranked negative instance by any fixed ordering of instances consistent with (note that all such orderings yield the same value of ). We further define the loss term for the truncated ordering matrices:
formula
5.2
Given that this loss is defined on a subset of instances, as noted above, it can be seen as the partial AUC loss in a scaled interval . The following is then a convex upper bound on ,
formula
and replacing in the rewritten partial AUC risk in theorem 7 with the above expression gives us the following upper bounding surrogate:
formula
5.3
The surrogate is a maximum of convex functions in and is, hence, convex in . Even here, it turns out that the above maximum is attained by the top negatives according to .
Proposition 2.

Let be the set of instances in the top positions in the ranking of negative instances (in descending order of scores) by . Then the maximum value of the objective in equation 5.3 is attained at .

Unlike the partial AUC in , the above structural SVM surrogate is not equivalent to the nonconvex hinge surrogate in equation 3.2 for intervals and is a looser upper bound on the partial AUC risk. However, compared to the naive structural SVM surrogate in equation 3.4 for the range, the joint feature map here is only defined on a subset of negatives; as a result, the proposed surrogate is tighter and lays more emphasis on good performance in the given range (Narasimhan & Agarwal, 2013b; see Figure 3). This will become clear from the characterization provided below.

5.2  Characterization for

Before proceeding to develop a method for optimizing the proposed structural SVM surrogate for intervals, we analyze how the surrogate is related to the original partial AUC risk in equation 5.1, and to the other surrogates discussed in section 3. These relationships were obvious for the case, as the prescribed structural SVM surrogate there was exactly equivalent to the associated hinge surrogate. For the , it is not immediately clear from the surrogate whether it closely mimics the partial AUC risk. We know so far that the proposed structural SVM surrogate for intervals upper-bounds the partial AUC risk; below we give a more detailed characterization:5

Theorem 7.
Let . Then for any sample and ,
formula
where is the hinge surrogate in equation 3.2 and is a version of this surrogate defined on a subset of positive instances,
formula
while and are positive terms that penalize misrankings against negatives in the FPR range with a margin of zero:
formula
Moreover, if , then the lower and upper bounds match, and we have:
formula

Note that in both the lower and upper bounds, certain positive-negative pairs are penalized with a larger margin than others. In particular, those involving the negative instances in positions to are penalized with a margin of one, while the rest are penalized with zero margin. This confirms the surrogate's focus on a select portion of the ROC curve in the range .

Further, if is such that the difference in scores assigned to any pair of positive-negative training instances is either greater than 1 or less than (which is indeed the case when has a sufficiently large norm and the training instances are all unique), then the characterization is more precise. Here the structural SVM surrogate is exactly equal to the sum of two terms. The first is the nonconvex hinge surrogate, and the second is a positive term that penalizes misrankings with regard to negatives in positions 1 to and which can be seen as enforcing convexity in the surrogate.

While the proposed structural SVM surrogate is not equivalent to the hinge surrogate, it can clearly be interpreted as a convex approximation to the hinge surrogate for the range. Also, a similar characterization for the naive structural SVM surrogate in equation 3.4 contains additional terms involving negative instances ranked in positions outside the specified FPR range (Narasimhan & Agarwal, 2013b). The proposed surrogate does not contain these terms and is therefore a tighter upper bound on the partial AUC risk (also see Figure 3).

5.3  Cutting Plane Method for Optimizing

Having constructed a tight structural SVM surrogate for intervals, we now provide an efficient method to optimize it. In particular, we would like to minimize a regularized form of equation 5.3, which results in the following quadratic program:
formula
Since the optimization problem has an exponential number of constraints, we once again employ the cutting plane method for solving it (see algorithm 1). Recall that the crucial step in the cutting plane solver is to efficiently compute the most violated constraint in each iteration. Below, we provide an algorithm for performing this combinatorial optimization within the cutting plane method in polynomial time.
The specific combinatorial optimization problem we wish to solve has the form
formula
In the case of the full AUC or the partial AUC in , the corresponding combinatorial optimization problem decomposes into simpler problems involving individual s. For the general partial AUC, solving this problem is, however, trickier as the set of negatives involved in the summation in is different for different ordering matrices. In this case, we will no longer be able to optimize each independently, as the resulting matrix need not correspond to a valid ordering. Nevertheless, we will be able to formulate an equivalent problem over a restricted search space of ordering matrices, where each row of the matrix can be optimized separately and efficiently.
To this end, we first observe from proposition 8 that it suffices to maximize the above optimization objective over ordering matrices in , fixing to the subset of top -ranked negative instances according to (where w.l.o.g. we assume that denotes the th instance in the ranking of negatives):
formula
OP2

5.3.1  Restricted Search Space of Ordering Matrices

Notice that among the negative instances in , it is only the bottom-ranked negatives that appear in . As noted above, this subset of negative instances is different for different ordering matrices, and hence computing the argmax requires further reformulation. In particular, we next show that the above argmax can be equivalently computed over a restricted set of ordering matrices, given for any as
formula
where, as before, denotes the index of the th ranked negative instance in or equivalently in , when the instances are sorted (in descending order) by . This is the set of all ordering matrices in which any two negative instances that are separated by a positive instance are sorted according to . We then have
Theorem 8.

The solution to OP2 lies in .6

It is further easy to see that for any , , as there always exists an ordering consistent with in which the negatives are all sorted according to (follows from the definition of ). As a result, equation OP2 can be equivalently framed as
formula
5.4
where denotes the th row of the ordering matrix .

5.3.2  Reduction to Simpler Problems

With this reformulation, it turns out that each row can be considered separately; moreover, the optimization over each can be done efficiently. In particular, note that for each , the th row of the optimal ordering matrix to the above problem essentially corresponds to an interleaving of the lone positive instance with the list of negative instances sorted according to . Thus each is of the form
formula
5.5
for some . In other words, the optimization over reduces to an optimization over or, equivalently, to an optimization over , where with . Clearly, , and hence we can rewrite equation 5.4 as
formula
OP3
Since the objective given above decomposes into a sum of terms involving the individual rows , equation OP2 can be solved by maximizing over each row separately.

5.3.3  Time Complexity

In a straightforward implementation of this optimization, for each , one would evaluate the term for each of the values of (corresponding to the choices of (see equation 5.5) and select the optimal among these. Each such evaluation takes time, yielding an overall time complexity of . It turns out, however, that one can partition the values of into two groups, and , such that the optimization over in each of these groups (after the negative instances have been sorted according to ) can be implemented in time. A description is given in algorithm 3, where the overall time complexity is .

formula

Again, using a more compact representation of the orderings, it is possible to further reduce the computational complexity of algorithm 3 to (see appendix B.1 for details). Note that the time complexity for finding the most violated constraint for partial AUC (with ) has a better dependence on the number of training instances compared to that for the usual AUC (Joachims, 2005). However, we find in our experiments in section 8 that the overall running time of the cutting plane method is often higher for partial AUC compared to full AUC, because the number of calls made to the inner combinatorial optimization routine turns out to be higher in practice for partial AUC.

We thus have an efficient method for optimizing a convex surrogate on the partial AUC risk in . While the surrogate optimized by our method is a tighter approximation to the partial AUC risk than the naive structural SVM surrogate considered initially in section 3, we know from the characterization result in theorem 11 that the surrogate does contain terms involving negative instances in positions outside the specified FPR range. The hinge surrogate in equation 3.2 for intervals, while being nonconvex, serves as a tighter approximation to the partial AUC risk (see Figure 3). This motivates us to next develop a method for directly optimizing the nonconvex hinge surrogate for intervals. We will make use of a popular nonconvex optimization technique based on DC programming for this purpose.

6  DC Programming Approach for Partial AUC in

As noted above, for general FPR intervals , the structural SVM-based surrogate optimized in the previous section is often a looser relaxation to the partial AUC risk compared to the nonconvex hinge loss–based surrogate considered in equation 3.2. We now develop an approach for directly optimizing the hinge surrogate. Here we resort to a popular difference-of-convex (DC) programming technique, exploiting the fact that the partial AUC in is essentially a difference between (scaled) partial AUC values in and . The structural SVM algorithm developed in section 4 for false-positive ranges of the form will be used as a subroutine here.

6.1  Difference-of-Convex Objective

We begin by rewriting the surrogate in equation 3.2 as a difference of hinge surrogates in intervals and , thus allowing us to write the optimization objective as a difference of two convex functions in :
formula
where, in the second step, we use theorem 6 to write the hinge surrogate for partial AUC in in terms of the tight structural SVM formulation (see equation 4.3).

6.2  Concave-Convex Procedure

The above difference-of-convex function can now be optimized directly using the well-known concave-convex procedure (CCCP) (Yu & Joachims, 2009; Yuille & Rangarajan, 2003). This technique works by successively computing a gradient-based linear upper bound on the concave (or negative convex) part of the objective and optimizing the resulting sum of convex and linear functions (see algorithm 4).

formula

In our case, each iteration of this technique maintains a model vector and computes a supergradient of the concave term at . Since this term is essentially the negative of the maximum of linear functions in (see equation 4.3), one can obtain a supergradient of this term (with regard to ) by computing the gradient of the linear function at which the maximum is attained (Bertsekas, 1999); specifically, if is the subset-matrix pair at which the maximum is attained (which can be computed efficiently using algorithm 2), a supergradient of this term is , with the corresponding linear upper bound given by . This gives us a convex upper bound on our original difference-of-convex objective, which can be optimized efficiently using a straightforward variant of the structural SVM method discussed in section 4. The CCCP method then simply alternates between the above described linearization and convex upper-bound optimization steps until the difference in objective value across two successive iterations falls below a tolerance .

The CCCP method is guaranteed to converge to only a locally optimal solution or to a saddle point (Tao, 1997; Yuille & Rangarajan, 2003) and is computationally more expensive than the previous approach (requiring solving an entire structural SVM optimization in each iteration). However, as we shall see in our experiments, this method yields higher partial AUC values than the structural SVM method in some cases.

7  Generalization Bound for Partial AUC

We have so far focused on developing efficient methods for optimizing the partial AUC. In this section, we look at the generalization properties of this evaluation measure. In particular, we derive a uniform convergence generalization bound for the partial AUC risk, thus establishing that good training performance in terms of partial AUC also implies good generalization performance. We first define the population or distribution version of the partial AUC risk for a general scoring function :
formula
7.1
where is an indicator function that is one if and is zero otherwise. As before, we define the empirical partial AUC risk for a sample :
formula
7.2
where is an indicator function turned on only when lies in positions to in the ranking of all negative instances by .

We would like to show that the above empirical risk is not too far from the population risk for any scoring function chosen from a given real-valued function class of reasonably bounded capacity. In our case, the capacity of such a function class will be measured using the VC dimension of the class of thresholded classifiers obtained from scoring functions in the class: . Then:

Theorem 9.
Let be a class of real-valued functions on and . Let . Then with probability at least (overdraw of sample from ), we have for all ,
formula
where is the VC dimension of and is a distribution-independent constant.

The above result provides a bound on the generalization performance of a learned scoring function in terms of its empirical (training) risk. Also notice that the tightness of this bound depends on the size of the FPR range of interest. In particular, the smaller the FPR interval is, the looser is the bound.

The proof of this result differs substantially from that for the full AUC (Agarwal et al., 2005) as the complex structure of partial AUC forbids the direct application of standard concentration results like Hoeffding's inequality. Instead, the difference between the empirical and population risks needs to be broken down into simpler additive terms that can be bounded using standard arguments. We provide the details in appendix A.10.7

8  Experiments

In this section, we present experimental evaluations of the proposed SVM-based methods for optimizing partial AUC on a number of real-world applications where the partial AUC is a performance measure of interest, and on benchmark UCI data sets. The structural SVM algorithms were implemented using a publicly available API from (Tsochantaridis et al., 2005),8 while the DC programming method was implemented using an API for latent structural SVM from (Yu & Joachims, 2009).9 In each case, two-thirds of the data set was used for training and the remaining for testing, with the results averaged over five such random splits. The tunable parameters were chosen using a held-out portion of the training set treated as a validation set. The specific parameter choices, along with details of data preprocessing, are in appendix B.2. All experiments used a linear model.10

8.1  Maximizing Partial AUC in

We begin with our results for FPR intervals of the form . We considered two real-world applications where the partial AUC in is an evaluation measure of interest: a protein-protein interaction prediction task and a drug discovery task (see Table 1). We refer to the proposed structural SVM-based method in section 4 as . We included for comparison the structural SVM algorithm of Joachims (2005) for optimizing the full AUC, which we shall call , as well as three existing algorithms for optimizing partial AUC in : asymmetric SVM (ASVM) (Wu et al., 2008), pAUCBoost (Komori & Eguchi, 2010), and a greedy heuristic method due to Ricamato and Tortorella (2011). For completeness, we also compare against a standard classification SVM method that optimizes classification accuracy.11

Table 1:
Real Data Sets Used.
Data SetNumber of InstancesNumber of Features
ppi 240,249 85 
chemo 2142 1021 
kddcup08 102,294 117 
kddcup06 4429 116 
Data SetNumber of InstancesNumber of Features
ppi 240,249 85 
chemo 2142 1021 
kddcup08 102,294 117 
kddcup06 4429 116 

8.1.1  Protein-Protein Interaction Prediction

In protein-protein interaction (PPI) prediction, the task is to predict whether a given pair of proteins interacts. Owing to the highly imbalanced nature of PPI data (e.g., only 1 in every 600 protein pairs in yeast is found to interact), the partial AUC in a small FPR range has been advocated as an evaluation measure for this application (Qi et al., 2006). We used the PPI data for Yeast from Qi et al. (2006), which contains 2865 protein pairs known to be interacting (positive) and a random set of 237,384 protein pairs assumed to be noninteracting (negative). Each protein-pair is represented using 85 features.12 We evaluated the partial AUC performance of these methods on two FPR intervals, and . To compare the methods for different training sample sizes, we report results in Figure 4 for varying fractions of the training set. As seen, the proposed method almost always yields higher partial AUC in the specified FPR intervals compared to the method for optimizing the full AUC, thus confirming its focus on a select portion of the ROC curve. Interestingly, the difference in performance is more pronounced for smaller training sample sizes, implying that when one has limited training data, it is more beneficial to use the data to directly optimize the partial AUC rather than to optimize the full AUC. Also, in most cases, the proposed method performs comparable to or better than the other baselines; the pAUCBoost and Greedy-Heuristic methods perform particularly poorly on smaller training samples due to the use of heuristics. As expected, the classification SVM method performs even worse than the full AUC maximizing approach, clearly showing that optimizing the classification accuracy does not necessarily yield good performance on the ROC curve.
Figure 4:

Partial AUC maximization in on PPI data.

Figure 4:

Partial AUC maximization in on PPI data.

8.1.2  Drug Discovery

The next task that we considered used examples of chemical compounds that are active or inactive against a therapeutic target; the goal is to rank new compounds such that the active ones are above the inactive ones. Here the interest is in good ranking quality in the top portion of the ranked list, and, hence, good partial AUC in a small FPR interval in the initial portion of the ROC curve is a performance measure of interest. In our experiments, we used a virtual screening data set from Jorissen and Gilson (2005); this contains 50 active or positive compounds (corresponding to the reversible antagonists of the adrenoceptor) and 2092 ones that are inactive or negative, with each compound represented as a 1021-bit vector using the FP2 molecular fingerprint representation (as done in Agarwal, Dugar, & Sengupta, 2010). Figure 5 contains the partial AUC performance for varying fractions of the training set on two FPR intervals. Clearly, for the most part, yields higher partial AUC values than and performs comparable to or better than the other baseline algorithms.
Figure 5:

Partial AUC maximization in on drug discovery data.

Figure 5:

Partial AUC maximization in on drug discovery data.

8.2  Maximizing Partial AUC in

We next move to our experiments on partial AUC in a general interval. We refer to the proposed structural SVM method for maximizing partial AUC in again as , and our DC programming approach for optimizing the nonconvex hinge surrogate as . As baselines, we included , pAUCBoost, which can optimize partial AUC over FPR ranges , an extension of the greedy heuristic method in Ricamato and Tortorella (2011) to handle arbitrary FPR ranges, and the classification SVM method. We first present our results on a real-world application, where partial AUC in is a useful evaluation measure.

8.2.1  Breast Cancer Detection

We consider the task stated in the KDD Cup 2008 challenge, where one is required to predict whether a given region of interest (ROI) from a breast X-ray image corresponds to a malignant (positive) or a benign (negative) tumor (Rao et al., 2008). The data provided are collected from 118 malignant patients and 1594 normal patients. Four X-ray images are available for each patient; overall, there are 102,294 candidate ROIs selected from these X-ray images, of which 623 are positive, with each ROI represented by 117 features. In the KDD Cup challenge, performance was evaluated in terms of the partial area under the free-response operating characteristic (FROC) curve in a false-positive range deemed clinically relevant based on radiologist surveys. The FROC curve (Miller, 1969) effectively uses a scaled version of the false-positive rate in the usual ROC curve; for our purposes, the corresponding false-positive rate is obtained by rescaling by a factor of (this is the total number of images divided by the total number of negative ROIs). Thus, the goal in our experiments was to maximize the partial AUC in the clinically relevant FPR range . Table 2 presents results on algorithms and developed for FPR intervals of this form, as well as on the baseline methods. performs the best in this case.

Table 2:
Partial AUC Maximization in [α, β] with KDD Cup 08 Data.
 0.638  (0.02) 
 0.616  (0.02) 
 0.612  (0.02) 
pAUCBoost 0.603  (0.02) 
Greedy-Heuristic 0.562  (0.02) 
Classification SVM 0.484  (0.02) 
 0.638  (0.02) 
 0.616  (0.02) 
 0.612  (0.02) 
pAUCBoost 0.603  (0.02) 
Greedy-Heuristic 0.562  (0.02) 
Classification SVM 0.484  (0.02) 

Notes: The standard deviations of the reported values are enclosed in parentheses. Here . The method performing best is highlighted in bold.

8.2.2  UCI Data Sets

To perform a more detailed comparison between the proposed structural SVM and DC programming methods for general intervals, we also evaluated the methods on a number of benchmark data sets obtained from the UCI machine learning repository (Frank & Asuncion, 2010; see Table 3). The results for the FPR interval are shown in Table 4. For completeness, we also report the performance of the baseline methods. Despite having to solve a nonconvex optimization problem (and hence running the risk of getting stuck at a locally optimum solution), does perform better than in some cases, though between the two, there is no clear winner. Also, on three of the five data sets, one of the two proposed methods yields the best overall performance. The strikingly poor performance of pAUCBoost and Greedy-Heuristic on the cod-rna data set is because the individual features in the data set yield low partial AUC values. Since these methods rely on local greedy heuristics, they fail to find a linear combination of the features that yields high partial AUC performance. Unsurprisingly, classification SVM performs worse than in many cases.

Table 3:
UCI Data Sets Used.
Data SetNumber of InstancesNumber of Features
a9a 48,842 123 
cod-rna 488,565 
covtype 581,012 54 
ijcnn1 141,691 22 
letter 20,000 16 
Data SetNumber of InstancesNumber of Features
a9a 48,842 123 
cod-rna 488,565 
covtype 581,012 54 
ijcnn1 141,691 22 
letter 20,000 16 
Table 4:
Partial AUC Maximization in on UCI Data Sets.
a9acod-rnacovtypeijcnn1letter
 0.274 (0.07) 0.919 (0.00) 0.247 (0.00) 0.613 (0.03) 0.521 (0.02) 
 0.365 (0.04) 0.920 (0.00) 0.241 (0.09) 0.680 (0.00) 0.518 (0.02) 
 0.434 (0.01) 0.919 (0.00) 0.299 (0.01) 0.475 (0.01) 0.445 (0.02) 
pAUCBooost 0.401 (0.01) 0.033 (0.00) 0.448 (0.00) 0.491 (0.06) 0.495 (0.05) 
GreedyHeuristic 0.342 (0.02) 0.033 (0.00) 0.239 (0.01) 0.120 (0.01) 0.289 (0.01) 
Classification SVM 0.354 (0.02) 0.872 (0.00) 0.378 (0.01) 0.314 (0.01) 0.351 (0.02) 
a9acod-rnacovtypeijcnn1letter
 0.274 (0.07) 0.919 (0.00) 0.247 (0.00) 0.613 (0.03) 0.521 (0.02) 
 0.365 (0.04) 0.920 (0.00) 0.241 (0.09) 0.680 (0.00) 0.518 (0.02) 
 0.434 (0.01) 0.919 (0.00) 0.299 (0.01) 0.475 (0.01) 0.445 (0.02) 
pAUCBooost 0.401 (0.01) 0.033 (0.00) 0.448 (0.00) 0.491 (0.06) 0.495 (0.05) 
GreedyHeuristic 0.342 (0.02) 0.033 (0.00) 0.239 (0.01) 0.120 (0.01) 0.289 (0.01) 
Classification SVM 0.354 (0.02) 0.872 (0.00) 0.378 (0.01) 0.314 (0.01) 0.351 (0.02) 

Notes: All methods except and classification SVM seek to optimize partial AUC in the range . The standard deviations of the reported values are in parentheses, and the best-performing methods in bold.

8.3  Maximizing TPR at a Specific FPR Value

We have so far seen that the proposed methods are good at learning scoring functions that yield high partial AUC in a specified FPR range. In our next experiment, we demonstrate that the proposed methods can also be useful in applications where the requirement is to learn a classifier with specific true- or false-positive requirements. In particular, we consider the task described in the KDD Cup 2006 challenge of detecting pulmonary emboli in medical images obtained from CT angiography (Lane, Rao, Bi, Liang, & Salganicoff, 2006). Given a candidate region of interest (ROI) from the image, the goal is to predict whether it is a pulmonary embolus. A specific requirement here is that the classifier must have high TPR, with the FPR kept within a specified limit.

Indeed if a classifier is constructed by thresholding a scoring function, the above evaluation measure can be seen as the partial AUC of the scoring function in an infinitesimally small FPR interval. However, given the small size of the FPR interval concerned, maximizing this evaluation measure directly may not produce a classifier with good generalization performance, particularly with smaller training samples (see the generalization bound in section 7). Instead, we prescribe a more robust approach of using the methods developed in this letter to maximize partial AUC in an appropriate larger FPR interval and constructing a classifier by suitably thresholding the scoring function thus obtained.

The data provided in the KDD Cup challenge consist of 4,429 ROIs represented by 116 features; 500 are positive. We considered a maximum allowable FPR limit of 0.1 (which is one of the values prescribed in the challenge). The proposed partial AUC maximization methods were used to learn scoring functions for two FPR intervals, and that we expected would promote high TPR at the given FPR of 0.1. The performance of a learned model was then evaluated based on the TPR it yields when thresholded at a FPR of 0.1. Table 5 contains results for on both intervals and for on the interval. We also included for comparison. Interestingly, in this case, in performs the best. The DC programming method on the interval is a close second, performing better than the structural SVM approach for the same interval.13

Table 5:
Partial AUC Maximization as a Proxy for Maximizing the TPR at a Specified FPR on KDD Cup 06 Data.
TPR at FPR
 0.613 
 0.584 
 0.591 
 0.584 
TPR at FPR
 0.613 
 0.584 
 0.591 
 0.584 

Note: The best-performing method is highlighted in bold.

We also note that in follow-up work, the proposed approach was applied to a similar problem in personalized cancer treatment (Majumder et al., 2015), where the goal was to predict whether a given cancer patient will respond well to a drug treatment. In this application, one again requires high true-positive rates (fraction of cases where the treatment is effective and the model predicts the same), subject to the false-positive rate being within an allowable limit. As above, the problem was posed as a partial AUC maximization task, and the classifiers learned using our methods were found to yield higher TPR performance than standard approaches for this problem while not exceeding the allowed FPR limit.

8.4  Run-Time Analysis

8.4.1  Run-Time Comparison with Baseline Methods

In our final set of experiments, we compared the running times of the various partial AUC optimizing algorithms evaluated above. For completeness, we also include running times for the full AUC maximizing . Figure 6 contains the average training times (across five train-test splits) for partial AUC maximization tasks involving FPR intervals on two data sets. We also report the average time taken for tuning the parameters in , , and ASVM using the validation set. The remaining two methods do not have tunable parameters. The running times for partial AUC maximization in a general range are shown in Figure 7 for two data sets. All experiments were run on an Intel Xeon (2.13 GHz) machine with 12 GB RAM.
Figure 6:

Partial AUC maximization in . Comparison of average training and validation times between and baseline methods. The validation time is the time taken to tune the parameters for the learning algorithms using a held-out validation set, and the training time is the time taken to run the learning algorithms on the entire training set using the chosen parameters.

Figure 6:

Partial AUC maximization in . Comparison of average training and validation times between and baseline methods. The validation time is the time taken to tune the parameters for the learning algorithms using a held-out validation set, and the training time is the time taken to run the learning algorithms on the entire training set using the chosen parameters.

Figure 7:

Partial AUC maximization in . Comparison of average training and validation times between , and baseline methods. The validation time is the time taken to tune the parameters for the learning algorithms using a held-out validation set. The training time is the time taken to run the learning algorithms on the entire training set using the chosen parameters. Here .

Figure 7:

Partial AUC maximization in . Comparison of average training and validation times between , and baseline methods. The validation time is the time taken to tune the parameters for the learning algorithms using a held-out validation set. The training time is the time taken to run the learning algorithms on the entire training set using the chosen parameters. Here .

Notice that except for , all the baselines require higher or similar training times compared to . Also, as expected, for the intervals, the DC programming-based (which solves an entire structural SVM problem in each iteration) requires higher training time than . The reason for the full AUC-maximizing method being the fastest in all cases, despite the subroutine for finding the most violated constraint requiring higher computational time compared to that for partial AUC, is that the number of iterations required by the cutting plane solver is lower for AUC. This will become clear in our next experiments, where we report the number of iterations taken by the cutting plane solver under different settings.

8.4.2  Influence of , , and on Number of Cutting Plane Iterations

We next analyzed the number of iterations taken by the cutting plane method in and , that is, the number of calls made to the routine for finding the most violated constraint (algorithms 2 and 3) for different FPR intervals and regularization parameter values. Our results for FPR ranges of the form are shown in Figure 8, where we provide plots of the average number of cutting plane iterations (over five train-test splits) as a function of for different values of the regularization parameter . It is often seen that the number of iterations of the cutting plane method increases as decreases (i.e., as the FPR interval becomes smaller), suggesting that optimizing partial AUC is harder for smaller intervals. This explains why the overall training time is lower for maximizing the full AUC (i.e., for ) compared to the time taken for maximizing partial AUC in a small interval. Similarly, the number of cutting plane iterations increases with , as suggested by the method's convergence rate (see section 4).
Figure 8:

Partial AUC in . Average number of cutting plane iterations in versus length of FPR interval for different values of regularization parameter . Here, corresponds to the full AUC optimizing method .

Figure 8:

Partial AUC in . Average number of cutting plane iterations in versus length of FPR interval for different values of regularization parameter . Here, corresponds to the full AUC optimizing method .

Our results for general FPR intervals are shown in Figure 9, where we have plots of the average number of cutting plane iterations as a function of the interval length, for different values of . Again, the number of iterations is higher for smaller FPR intervals. Interestingly, the number of iterations also increases as the interval is farther away to the right.
Figure 9:

Partial AUC in on KDD Cup 2008 data. Average number of cutting plane iterations in versus length of FPR interval for different values of .

Figure 9:

Partial AUC in on KDD Cup 2008 data. Average number of cutting plane iterations in versus length of FPR interval for different values of .

9  Conclusion and Open Questions

The partial AUC is increasingly used as a performance measure in several machine learning and data mining applications. We have developed support vector algorithms for optimizing the partial AUC between two given false-positive rates and . Unlike the full AUC, where it is straightforward to develop surrogate optimizing methods, even constructing a (tight) convex surrogate for the partial AUC turns out to be nontrivial. By exploiting the specific structure of the evaluation measure and extending the structural SVM framework of Joachims (2005), we have constructed convex surrogates for the partial AUC and developed an efficient cutting plane method for solving the resulting optimization problem. In addition, we have provided a DC programming method for optimizing a nonconvex hinge surrogate that is tighter for general false-positive ranges . Our empirical results on several real-world and benchmark tasks indicate that the proposed methods indeed optimize the partial AUC in the desired false-positive range, often performing comparable to or better than existing baselines.

Subsequent to the conference versions of this letter, there have been a number of follow-up studies. One such work applied our algorithm to an important problem in personalized cancer treatment (Majumder et al., 2015), where the task was to predict clinical responses of patients to chemotherapy. Other studies include minibatch extensions of our structural SVM method for the range to online and large-scale stochastic settings (Kar et al., 2014), as well as an ensemble-style version of this method with application to a problem in computer vision (Paisitkriangkrai et al., 2013, 2014).

Several questions remain open. First, it would be useful to understand the consistency properties of the proposed algorithms—conditions under which optimizing the proposed surrogates yields the optimal scoring function for the original partial AUC measure. Recently, consistency properties have been established for the method that optimizes the full AUC (Uematsu & Lee, 2015; Gao & Zhou, 2015), but these results do not directly extend to the partial AUC. Second, we observed in our experiments that the number of iterations required by the proposed cutting plane solvers to converge depends on the length of the specified FPR range, but this is not evident from the current convergence rate for the solver (see section 4) obtained from a result in Joachims (2006). It would be of interest to see if tighter convergence rates that match our empirical observation can be shown for the cutting plane solver. Finally, one could look at extensions of the proposed algorithms to multiclass classification and ordinal regression settings, where often there are different constraints on the error rates of a predictor on different classes. Again, there has been work on optimizing multiclass versions of the full AUC (Waegeman, De Baets, & Boullart, 2008; Clémençon et al.,