## Abstract

The area under the ROC curve (AUC) is a widely used performance measure in machine learning. Increasingly, however, in several applications, ranging from ranking to biometric screening to medicine, performance is measured not in terms of the full area under the ROC curve but in terms of the partial area under the ROC curve between two false-positive rates. In this letter, we develop support vector algorithms for directly optimizing the partial AUC between any two false-positive rates. Our methods are based on minimizing a suitable proxy or surrogate objective for the partial AUC error. In the case of the full AUC, one can readily construct and optimize convex surrogates by expressing the performance measure as a summation of pairwise terms. The partial AUC, on the other hand, does not admit such a simple decomposable structure, making it more challenging to design and optimize (tight) convex surrogates for this measure.

Our approach builds on the structural SVM framework of Joachims (2005) to design convex surrogates for partial AUC and solves the resulting optimization problem using a cutting plane solver. Unlike the full AUC, where the combinatorial optimization needed in each iteration of the cutting plane solver can be decomposed and solved efficiently, the corresponding problem for the partial AUC is harder to decompose. One of our main contributions is a polynomial time algorithm for solving the combinatorial optimization problem associated with partial AUC. We also develop an approach for optimizing a tighter nonconvex hinge loss–based surrogate for the partial AUC using difference-of-convex programming. Our experiments on a variety of real-world and benchmark tasks confirm the efficacy of the proposed methods.

## 1 Introduction

^{1}

In this letter, we develop support vector machine (SVM)–based algorithms for directly optimizing the partial AUC between any two false-positive rates and . Our methods are based on minimizing a suitable proxy or surrogate objective for the partial AUC error. In the case of the full AUC, where the evaluation measure can be expressed as a summation of pairwise indicator terms, one can readily construct and optimize surrogates by exploiting this structure. The partial AUC does not admit such a decomposable structure, as the set of negative instances associated with the specified false-positive range can be different for different scoring models. As a result, it becomes more challenging to design and optimize convex surrogates for this measure.

For instance, a popular approach for constructing convex surrogates for the full AUC is to replace the indicator terms in its definition with a suitable pairwise convex loss such as the pairwise hinge loss; in fact, there are several efficient methods available to solve the resulting optimization problem (Herbrich, Graepel, & Obermayer, 2000; Joachims, 2002, 2005). This is not the case with the more complex partial AUC measure. Here, a surrogate constructed by replacing the indicators with the pairwise hinge loss is nonconvex in general. Even in the special case of FPR intervals of the form , where the hinge loss–based surrogate turns out to be convex, solving the resulting optimization problem is not straightforward.

In our approach, we construct and optimize convex surrogates on the partial AUC by building on the structural SVM formulation of Joachims (2005) developed for general complex performance measures. It is known that for the full AUC, this formulation recovers the corresponding hinge surrogate (Joachims, 2006). A direct application of this framework to the partial AUC results in a loose approximation to the performance measure (in a sense that we elaborate in later sections). Instead, we first rewrite the evaluation measure as a maximum of a certain term over subsets of negative instances and leverage the structural SVM setup to construct a convex approximation to the inner term. This yields a tighter surrogate, which, for the special case of partial AUC in the range, is equivalent to the hinge surrogate obtained by replacing the indicators with the pairwise hinge loss. For general FPR intervals , the surrogate obtained can be seen as a convex relaxation to the (nonconvex) hinge surrogate.

We make use of the cutting plane method to optimize the proposed structural SVM surrogates. Each iteration of this solver requires a combinatorial search over subsets of instances and over binary matrices (representing relative orderings of positive and negative training instances) to find the currently most violated constraint. In the case of the full AUC (where the optimization is only over binary matrices), this problem decomposes neatly into one where each matrix entry can be chosen independently (Joachims, 2005). Unfortunately, for the partial AUC, a straightforward decomposition is not possible, again because the negative instances involved in the relevant false-positive range can be different for different orderings of instances.

One of our main contributions in this letter is a polynomial time algorithm for solving the corresponding combinatorial optimization within the cutting plane method for the partial AUC by breaking down the problem into smaller tractable ones. When the specified false-positive range is of the form , we show that after fixing the optimal subset of negatives to the top-ranked negatives, one can still optimize the individual entries of the ordering matrix separately. For the general case , we require further formulating an equivalent optimization problem over a restricted search space, where each row of the matrix can be optimized separately—and efficiently.

While the use of convex surrogates in this approach allows efficient optimization and guarantees convergence to the global surrogate optimum, it turns out that for the partial AUC in a general FPR interval , the previous nonconvex hinge surrogate (obtained by replacing the indicators with the pairwise hinge loss) is a tighter approximation to the original evaluation measure. Hence, as a next step, we also develop a method for directly optimizing this nonconvex surrogate using a popular nonconvex optimization technique based on difference-of-convex (DC) programming. Here we exploit the fact that the partial AUC in can be written as a difference of (scaled) partial AUC values in and .

We evaluate the proposed methods on a variety of real-world applications where partial AUC is an evaluation measure of interest and on benchmark data sets. We find that in most cases, the proposed methods yield better partial AUC performance compared to an approach for optimizing the full AUC, thus confirming the focus of our methods on a select false-positive range of the ROC curve. Our methods are also competitive with existing algorithms for optimizing partial AUC. For partial AUC in , we find that in some settings, the proposed DC programming method for optimizing the nonconvex hinge surrogate (despite having the risk of getting stuck at a locally optimal solution) performs better than the structural SVM method, though overall there is no clear winner.

In summary, we make the following additional contributions, in this letter compared to the conference versions of this work (Narasimhan & Agarwal, 2013a, 2013b):

We provide a more self-contained and comprehensive description of our structural SVM methods for partial AUC optimization. The surrogate construction is explained from the ground up and from a perspective that matches well with readers' intuition about existing surrogates for AUC. Complete proofs are given for all theorems.

In the case of partial AUC in the range, we develop a new method for optimizing a tighter nonconvex surrogate using DC programming.

We derive a generalization bound for partial AUC using VC-dimension-based uniform convergence arguments.

Our experiments are extensive and detailed, covering a range of applications and benchmark data sets.

### 1.1 Related Work

Much work has been done on developing algorithms to optimize the full AUC, mostly in the context of ranking (Herbrich et al., 2000; Joachims, 2002, 2005; Freund, Iyer, Schapire, & Singer, 2003; Burges et al., 2005). There has also been interest in the ranking literature in optimizing measures focusing on the left end of the ROC curve, corresponding to maximizing accuracy at the top of the list (Rudin, 2009). In particular, the infinite push ranking algorithm (Agarwal, 2011; Rakotomamonjy, 2012) can be viewed as maximizing the partial AUC in the range , where is the number of negative training examples.

While the AUC is widely used in practice, increasingly the partial AUC is being preferred as an evaluation measure in several applications in bioinformatics and medical diagnosis (Pepe & Thompson, 2000; Qi, Bar-Joseph, & Klein-seetharaman, 2006; Rao et al., 2008; Hsu, Chang, & Hsueh, 2014), and more recently even in domains like computer vision Paisitkriangkrai, Shen, & van den Hengel, 2013, 2014), personalized medicine (Majumder et al., 2015), and demand forecasting (Schneider & Gorr, 2015). The problem of optimizing the partial AUC in false-positive ranges of the form has received some attention primarily in the bioinformatics and biometrics literature (Pepe & Thompson, 2000; Dodd & Pepe, 2003; Wang & Chang, 2011; Ricamato & Tortorella, 2011; Hsu & Hsueh, 2012); however, in most cases, the algorithms developed are heuristic in nature. The asymmetric SVM algorithm of Wu, Lin, Chen, and Chen (2008) also aims to maximize the partial AUC in a range by using a variant of one-class SVM. The optimization objective used, however, does not directly approximate the partial AUC in the specified range, but instead seeks to indirectly promote good partial AUC performance through a fine-grained parameter tuning procedure. There has also been some work on optimizing the partial AUC in general false-positive ranges of the form , including the boosting-based algorithms pAUCBoost (Komori & Eguchi, 2010) and p*U*-AUCBoost (Takenouchi, Komori, & Eguchi, 2012).

Support vector algorithms have been extensively used in practice for various supervised learning tasks, with both standard and complex performance measures (Cortes & Vapnik, 1995; Crammer & Singer, 2002; Chu & Keerthi, 2007; Joachims, 2002; Tsochantaridis, Joachims, Hofmann, & Altun, 2005). The proposed methods are most closely related to the structural SVM framework of Joachims (2005) for optimizing the full AUC. To our knowledge, ours is the first work to develop principled support vector methods that can directly optimize the partial AUC in an arbitrary false-positive range .

### 1.2 Organization

We begin with the problem setting in section 2, along with background material on the previous structural SVM framework for full AUC maximization. In section 3, we consider two initial surrogates for the partial AUC—one based on the pairwise hinge loss and the other on a naive application of the structural SVM formulation—and point out drawbacks in each case. We then present a tight convex surrogate for the special case of FPR range in section 4 and for the general case of intervals in section 5, along with cutting plane solvers for solving the resulting optimization problem. Subsequently in section 6, we also describe a DC programming approach for directly optimizing the nonconvex hinge surrogate for partial AUC in . We provide a generalization bound for the partial AUC in section 7 and present our experimental results on real-world and benchmark tasks in section 8. All proofs are provided in the online appendix.

## 2 Preliminaries and Background

### 2.1 Problem Setting

Let be an instance space and and be probability distributions over positive and negative instances in . We are given a training sample containing positive instances drawn independent and identically distributed (i.i.d.) according to and negative instances drawn i.i.d. according to . Our goal is to learn from a scoring function that assigns higher scores to positive instances compared to negative instances and, in particular, yields good performance in terms of the partial AUC between some specified false-positive rates and , where . In a ranking application, this scoring function can then be deployed to rank new instances accurately, while in a classification setting, the scoring function, along with a suitable threshold, serves as a binary classifier.

#### 2.1.1 Partial AUC

^{2}and the false-positive rate (FPR) of the classifier as the probability that it misclassifies a random negative instance from as positive: The ROC curve for the scoring function is then defined as the plot of against for different values of . The area under this curve can be computed as

#### 2.1.2 Empirical Partial AUC

^{3}

#### 2.1.3 Partial AUC versus Full AUC

It is important to note that for the AUC to take its maximum value of 1, a scoring function needs to rank the positive instances above all the negative instances. For the partial AUC in a specified interval to take a value of 1, it is sufficient that a scoring function ranks the positives above a subset of the negative instances (specifically, above those in positions to in the ranking of negatives). Another key difference between the two evaluation measures is that the full AUC can be expressed as an expectation or sum of indicator terms over pairs of positive-negative instances (see equation 2.1), whereas the partial AUC does not have such a simple additive structure. This is clearly evident in the definition in equation 2.2, where the set of negatives corresponding to FPR range that appear in the inner summation is not fixed and can be different for different scoring functions .

### 2.2 Background on Structural SVM Framework for Full AUC

As a first step toward developing a method for optimizing the partial AUC, we provide some background on the popular structural SVM framework for maximizing the full AUC (Joachims, 2005). Unless otherwise specified, we assume that for some and consider linear scoring functions of the form for some . The methods described will easily extend to nonlinear functions/nonEuclidean instance spaces using kernels (Yu & Joachims, 2008).

#### 2.2.1 Hinge Loss–Based Surrogate

The partial AUC has a more complex structure as the subset of negatives relevant to the given FPR range can be different for different scoring models. As a result, a surrogate obtained by replacing the indicators with the pairwise hinge loss turns out to be nonconvex in general. The approach that we take for the partial AUC will instead make use of the structural SVM framework developed by Joachims (2005) for designing surrogate minimizing methods for general complex performance measures. For the full AUC, it has been shown that this formulation recovers the corresponding hinge surrogate in equation 2.4 (Joachims, 2006). We give the details for AUC below and in subsequent sections build on this formulation to construct and optimize convex surrogates for the partial AUC.

#### 2.2.2 Structural SVM Formulation

Clearly, the correct relative ordering has . This corresponds to all positive training instances being ranked above the negative training instances .^{4}

Interestingly, this surrogate can be shown to be equivalent to the hinge loss–based surrogate in equation 2.4.

(Joachims, 2006). For any and training sample ,

#### 2.2.3 Cutting Plane Method

While the above optimization problem contains an exponential number of constraints (one for each ), it can be solved efficiently using the cutting plane method (Tsochantaridis et al., 2005). Each iteration of this solver requires a combinatorial optimization over matrices in . By exploiting the simple structure of the AUC loss, this combinatorial problem can be decomposed into simpler ones, where each entry of the matrix can be optimized independently (Joachims, 2005). The cutting plane method is guaranteed to yield an -accurate solution in iterations (Joachims, 2006); in the case of the AUC, each iteration requires computational time. We elaborate on this solver in section 4 when we develop a structural SVM approach for the partial AUC.

## 3 Candidate Surrogates for Partial AUC

where denotes the negative instance in ranked in th position by . As before, optimizing this quantity directly is computationally hard in general. Hence, we work with a continuous surrogate objective that acts as a proxy for the above risk. As first-cut attempts at devising surrogates for the partial AUC, we replicate the two approaches used above for constructing surrogates for the full AUC, namely, those based on the hinge loss and the structural SVM framework, respectively. As we shall see, the surrogates obtained in both cases have certain drawbacks that require us to use a somewhat different approach.

### 3.1 Hinge Loss–Based Surrogate

In the case of the full AUC (i.e., when the FPR interval is ), the hinge surrogate is convex in and hence can be optimized efficiently. However, the corresponding surrogate given above for the partial AUC turns out to be nonconvex in general. This is because the surrogate is defined on only a subset of negative instances relevant to the given FPR range, and this subset can be different for different scoring functions.

Let with . Then there exists a training sample for which the surrogate is nonconvex in .

Let . For any training sample , the surrogate is convex in .

### 3.2 Naive Structural SVM Surrogate

Recall that with the AUC, the structural SVM surrogate is equivalent to the corresponding hinge surrogate (see theorem ^{1}). However, even for the special case of partial AUC in FPR intervals of the form (where the hinge surrogate is convex), the above structural SVM surrogate turns out to be a looser convex upper bound on the partial AUC risk than the hinge surrogate in equation 3.2. In particular, in its simplified form, the above structural SVM surrogate for the partial AUC contains redundant terms that penalize misrankings of the scoring function with respect to negative instances outside the relevant FPR range, in particular in positions of the ranked list. These additional terms appear because the joint feature map in the surrogate is defined on all negative instances and not just the ones relevant to the given FPR range (see equation 2.6). Clearly, these terms disrupt the emphasis of the surrogate on the specified FPR interval. The details can be found in the earlier conference versions of this letter (Narasimhan & Agarwal, 2013a, 2013b) and are left out here to keep the exposition simple.

Thus a naive application of the structural SVM formulation yields a loose surrogate for the partial AUC. Of course, one could look at tightening the surrogate by restricting the joint feature map to only a subset of negative instances; however, it is not immediate how this can be done, as the subset of negatives relevant to the given FPR interval can be different for different scoring models , while the definition of the joint feature map in the structural SVM framework needs to be independent of .

The approach that we take constructs a tighter surrogate for the partial AUC by making use of the structural SVM framework in a manner that suitably exploits the structure of the partial AUC performance measure. In particular, we first rewrite the partial AUC risk as a maximum of a certain term over subsets of negatives and compute a convex approximation to the inner term using the structural SVM setup. In the rewritten formulation, the joint feature maps need to be defined on only a subset of the negative instances. The resulting surrogate is convex and is equivalent to the corresponding hinge surrogate for intervals in equation 3.3. For general FPR intervals , the proposed surrogate can be seen as a tighter convex relaxation to the partial AUC risk compared to the naive structural SVM surrogate.

We also provide a cutting plane method to optimize the prescribed surrogates. Unlike the full AUC, here the combinatorial optimization required in each iteration of the solver does not decompose easily into simpler problems. One of our main contributions is a polynomial time algorithm for solving this combinatorial optimization for the partial AUC. The details are provided for the case in section 4 and for the case in section 5. In addition to methods that optimize convex structural SVM surrogates on the partial AUC, we also develop a method for directly optimizing the nonconvex hinge surrogate for general FPR intervals using difference-of-convex programming. This approach is explained in section 6.

While the proposed methods optimize continuous approximations or surrogates to the partial AUC, on several real-world tasks, they were found to perform better in terms of the original performance measure compared to the state-of-the-art approaches (see section 8). Indeed it would be of interest to establish precise conditions under which optimizing the proposed surrogates would yield the optimal scoring function for the original partial AUC (i.e., under which the surrogates are statistically consistent). However, this is a generic question that is not very well understood for many applications of (structural) SVM-style surrogates (Joachims, 2005) and is left open for future work.

## 4 Structural SVM Approach for Partial AUC in

Our approach makes use of the structural SVM formulation in a manner that allows us to construct a tighter convex surrogate for the partial AUC, which in this case is equivalent to the corresponding hinge surrogate. The key idea here is that the partial AUC risk in can be written as maximum over subsets of negatives of the full AUC risk evaluated on all positives and the given subset of negatives. The structural SVM formulation described in section 2 for the full AUC can then be leveraged to design a convex surrogate and to optimize it efficiently using a cutting plane solver.

### 4.1 Tight Structural SVM Surrogate for pAUC in

For any subset of negatives , let denote the full AUC risk of scoring function evaluated on a sample containing all the positives and the subset of negatives . Then the partial AUC risk of is simply the value of this quantity on the top-ranked negatives. This can be shown to be equivalent to the maximum value of over all subsets of negatives of size .

Having expressed the partial AUC risk in in terms of the full AUC risk on a subset of instances, we can devise a convex surrogate for the evaluation measure by constructing a convex approximation to the full AUC term using the structural SVM formulation explained in section 2.2.

^{1}that the structural SVM expression used above to approximate the inner full AUC term is same as the hinge surrogate for the AUC: At first glance, this appears different from the hinge surrogate for range in equation 3.3. However, as seen next, the above maximum is attained at the top negatives according to , which clearly implies that the two surrogates are equivalent.

The following result then follows directly from proposition ^{5}.

For any and training sample ,

Also notice that unlike the naive structural SVM surrogate in equation 3.4, the joint feature map in the proposed surrogate in equation 4.3 is not defined on all negatives, only on a subset of negatives. Consequently, this surrogate does not contain additional redundant terms and is tighter than the naive surrogate (Narasimhan & Agarwal, 2013b) (see Figure 3). We next develop a cutting plane method for optimizing the tighter surrogate.

### 4.2 Cutting Plane Method for Optimizing

Notice that the optimization problem has an exponential number of constraints, one for each subset of negative instances of size and matrix . As with the full AUC, we use the cutting plane method to solve this problem. The idea behind this method is that for any , a small subset of the constraints is sufficient to find an -approximate solution to the problem (Joachims, 2006). In particular, the method starts with an empty constraint set and, on each iteration, adds the most violated constraint to and solves a tighter relaxation of the optimization problem in the subsequent iteration. This continues until no constraint is violated by more than (see algorithm 1).

It is known that for any fixed regularization parameter and tolerance , the cutting plane method converges in iterations and will yield a surrogate value within of the minimum value (Joachims, 2006). Since in each iteration, the quadratic program needed to be solved grows by only a single constraint, the primary bottleneck in the algorithm is the combinatorial optimization (over subsets of negatives and ordering matrices) required to find the most violated constraint (line 10).

#### 4.2.1 Finding the Most Violated Constraint

In the case of AUC, where , the above argmax is only overordering matrices in and can be easily computed by exploiting the additive form of the AUC loss and, in particular, neatly decomposing the problem into one where each can be chosen independently (Joachims, 2005). In the case of the partial AUC in , the decomposition is not as straightforward as the argmax is also over subsets of negatives.

#### 4.2.2 Reduction to Simpler Problems

^{5}that the above argmax is attained at the top negatives according to , and all that remains is to compute the optimal ordering matrix in keeping fixed. The optimization problem can then be decomposed easily. In particular, having fixed the subset , the combinatorial optimization problem becomes equivalent to Now consider solving a relaxed form of OP1 over all matrices in . The objective decomposes into a sum of terms involving individual elements and can be maximized by optimizing each term separately. The optimal matrix is then given by . It can be seen that this optimal matrix is in fact a valid ordering matrix in . Notice that corresponds to an ordering of instances where the positive instances are scored according to and the negative instances are scored according to . Since corresponds to an ordering resulting from a valid set of scores on the instances, it satisfies the transitivity requirements of a valid ordering matrix. Hence is also a solution to the original unrelaxed form of OP1 for fixed , and thus gives us the desired most violated constraint.

#### 4.2.3 Time Complexity

A straightforward implementation to compute the above solution (see algorithm 2) would take computational time (assuming score evaluations on instances can be done in unit time). Using a more compact representation of the orderings (Joachims, 2005), however, this can be further reduced to . (The details are in appendix B.1.) Thus, computing the most violated constraint for the partial AUC in a small interval is faster than that for the full AUC (Joachims, 2005). This is because the number of negative instances relevant to the given FPR range over which the most violated constraint is computed is smaller for the partial AUC. It turns out that in practice, the number of iterations required by the cutting plane method to converge (and, in turn, the number of calls to the given combinatorial optimization) is often higher for partial AUC compared to AUC. (We elaborate on this when we discuss our experimental results in section 8.)

We have presented an efficient method for optimizing the structural SVM surrogate for the partial AUC in the range, which we saw was equivalent to the hinge surrogate. We next proceed to algorithms for optimizing partial AUC in interval.

## 5 Structural SVM Approach for Partial AUC in

Again, the main idea here is to rewrite the partial AUC risk as a maximum of a certain term over subsets of negatives and use the structural SVM formulation to compute a convex approximation to the inner term. We provide an efficient cutting plane method for solving the resulting optimization problem. Here, the combinatorial optimization step for finding the most violated constraint in the cutting plane solver does not admit a decomposition involving individual matrix entries. We show that by using a suitable reformulation of the problem over a restricted search space, the optimization can be still reduced to simpler ones, but now involving individual rows of the ordering matrix. In section 6, we shall also look at an approach for directly optimizing the nonconvex hinge surrogate for general FPR ranges.

### 5.1 Tight Structural SVM Surrogate for pAUC in

We begin by describing the construction of the tight structural SVM surrogate. Just as the partial AUC risk in could be written as a maximum over subsets of negative instances of the full AUC risk evaluated on this subset (see theorem ^{4}), the partial AUC risk in can also be written as a maximum of a certain term over subsets of negative instances of size .

Note that when , the term is the full AUC risk on the sample , recovering our previous result in theorem ^{4}. In this case, we directly made use of the structural SVM formulation for the full AUC to construct a convex approximation for this term. However, when , is more complex and can be essentially seen as (a scaled version of) partial AUC risk in the FPR range defined on a subset of instances ; we will hence have to rework the structural SVM formulation for , as described next.

#### 5.1.1 Convex Upper Bound on

^{7}with the above expression gives us the following upper bounding surrogate: The surrogate is a maximum of convex functions in and is, hence, convex in . Even here, it turns out that the above maximum is attained by the top negatives according to .

Let be the set of instances in the top positions in the ranking of negative instances (in descending order of scores) by . Then the maximum value of the objective in equation 5.3 is attained at .

Unlike the partial AUC in , the above structural SVM surrogate is not equivalent to the nonconvex hinge surrogate in equation 3.2 for intervals and is a looser upper bound on the partial AUC risk. However, compared to the naive structural SVM surrogate in equation 3.4 for the range, the joint feature map here is only defined on a subset of negatives; as a result, the proposed surrogate is tighter and lays more emphasis on good performance in the given range (Narasimhan & Agarwal, 2013b; see Figure 3). This will become clear from the characterization provided below.

### 5.2 Characterization for

Before proceeding to develop a method for optimizing the proposed structural SVM surrogate for intervals, we analyze how the surrogate is related to the original partial AUC risk in equation 5.1, and to the other surrogates discussed in section 3. These relationships were obvious for the case, as the prescribed structural SVM surrogate there was exactly equivalent to the associated hinge surrogate. For the , it is not immediately clear from the surrogate whether it closely mimics the partial AUC risk. We know so far that the proposed structural SVM surrogate for intervals upper-bounds the partial AUC risk; below we give a more detailed characterization:^{5}

Note that in both the lower and upper bounds, certain positive-negative pairs are penalized with a larger margin than others. In particular, those involving the negative instances in positions to are penalized with a margin of one, while the rest are penalized with zero margin. This confirms the surrogate's focus on a select portion of the ROC curve in the range .

Further, if is such that the difference in scores assigned to any pair of positive-negative training instances is either greater than 1 or less than (which is indeed the case when has a sufficiently large norm and the training instances are all unique), then the characterization is more precise. Here the structural SVM surrogate is exactly equal to the sum of two terms. The first is the nonconvex hinge surrogate, and the second is a positive term that penalizes misrankings with regard to negatives in positions 1 to and which can be seen as enforcing convexity in the surrogate.

While the proposed structural SVM surrogate is not equivalent to the hinge surrogate, it can clearly be interpreted as a convex approximation to the hinge surrogate for the range. Also, a similar characterization for the naive structural SVM surrogate in equation 3.4 contains additional terms involving negative instances ranked in positions outside the specified FPR range (Narasimhan & Agarwal, 2013b). The proposed surrogate does not contain these terms and is therefore a tighter upper bound on the partial AUC risk (also see Figure 3).

### 5.3 Cutting Plane Method for Optimizing

#### 5.3.1 Restricted Search Space of Ordering Matrices

The solution to OP2 lies in .^{6}

#### 5.3.2 Reduction to Simpler Problems

#### 5.3.3 Time Complexity

In a straightforward implementation of this optimization, for each , one would evaluate the term for each of the values of (corresponding to the choices of (see equation 5.5) and select the optimal among these. Each such evaluation takes time, yielding an overall time complexity of . It turns out, however, that one can partition the values of into two groups, and , such that the optimization over in each of these groups (after the negative instances have been sorted according to ) can be implemented in time. A description is given in algorithm 3, where the overall time complexity is .

Again, using a more compact representation of the orderings, it is possible to further reduce the computational complexity of algorithm 3 to (see appendix B.1 for details). Note that the time complexity for finding the most violated constraint for partial AUC (with ) has a better dependence on the number of training instances compared to that for the usual AUC (Joachims, 2005). However, we find in our experiments in section 8 that the overall running time of the cutting plane method is often higher for partial AUC compared to full AUC, because the number of calls made to the inner combinatorial optimization routine turns out to be higher in practice for partial AUC.

We thus have an efficient method for optimizing a convex surrogate on the partial AUC risk in . While the surrogate optimized by our method is a tighter approximation to the partial AUC risk than the naive structural SVM surrogate considered initially in section 3, we know from the characterization result in theorem ^{11} that the surrogate does contain terms involving negative instances in positions outside the specified FPR range. The hinge surrogate in equation 3.2 for intervals, while being nonconvex, serves as a tighter approximation to the partial AUC risk (see Figure 3). This motivates us to next develop a method for directly optimizing the nonconvex hinge surrogate for intervals. We will make use of a popular nonconvex optimization technique based on DC programming for this purpose.

## 6 DC Programming Approach for Partial AUC in

As noted above, for general FPR intervals , the structural SVM-based surrogate optimized in the previous section is often a looser relaxation to the partial AUC risk compared to the nonconvex hinge loss–based surrogate considered in equation 3.2. We now develop an approach for directly optimizing the hinge surrogate. Here we resort to a popular difference-of-convex (DC) programming technique, exploiting the fact that the partial AUC in is essentially a difference between (scaled) partial AUC values in and . The structural SVM algorithm developed in section 4 for false-positive ranges of the form will be used as a subroutine here.

### 6.1 Difference-of-Convex Objective

^{6}to write the hinge surrogate for partial AUC in in terms of the tight structural SVM formulation (see equation 4.3).

### 6.2 Concave-Convex Procedure

The above difference-of-convex function can now be optimized directly using the well-known concave-convex procedure (CCCP) (Yu & Joachims, 2009; Yuille & Rangarajan, 2003). This technique works by successively computing a gradient-based linear upper bound on the concave (or negative convex) part of the objective and optimizing the resulting sum of convex and linear functions (see algorithm 4).

In our case, each iteration of this technique maintains a model vector and computes a supergradient of the concave term at . Since this term is essentially the negative of the maximum of linear functions in (see equation 4.3), one can obtain a supergradient of this term (with regard to ) by computing the gradient of the linear function at which the maximum is attained (Bertsekas, 1999); specifically, if is the subset-matrix pair at which the maximum is attained (which can be computed efficiently using algorithm 2), a supergradient of this term is , with the corresponding linear upper bound given by . This gives us a convex upper bound on our original difference-of-convex objective, which can be optimized efficiently using a straightforward variant of the structural SVM method discussed in section 4. The CCCP method then simply alternates between the above described linearization and convex upper-bound optimization steps until the difference in objective value across two successive iterations falls below a tolerance .

The CCCP method is guaranteed to converge to only a locally optimal solution or to a saddle point (Tao, 1997; Yuille & Rangarajan, 2003) and is computationally more expensive than the previous approach (requiring solving an entire structural SVM optimization in each iteration). However, as we shall see in our experiments, this method yields higher partial AUC values than the structural SVM method in some cases.

## 7 Generalization Bound for Partial AUC

We would like to show that the above empirical risk is not too far from the population risk for any scoring function chosen from a given real-valued function class of reasonably bounded capacity. In our case, the capacity of such a function class will be measured using the VC dimension of the class of thresholded classifiers obtained from scoring functions in the class: . Then:

The above result provides a bound on the generalization performance of a learned scoring function in terms of its empirical (training) risk. Also notice that the tightness of this bound depends on the size of the FPR range of interest. In particular, the smaller the FPR interval is, the looser is the bound.

The proof of this result differs substantially from that for the full AUC (Agarwal et al., 2005) as the complex structure of partial AUC forbids the direct application of standard concentration results like Hoeffding's inequality. Instead, the difference between the empirical and population risks needs to be broken down into simpler additive terms that can be bounded using standard arguments. We provide the details in appendix A.10.^{7}

## 8 Experiments

In this section, we present experimental evaluations of the proposed SVM-based methods for optimizing partial AUC on a number of real-world applications where the partial AUC is a performance measure of interest, and on benchmark UCI data sets. The structural SVM algorithms were implemented using a publicly available API from (Tsochantaridis et al., 2005),^{8} while the DC programming method was implemented using an API for latent structural SVM from (Yu & Joachims, 2009).^{9} In each case, two-thirds of the data set was used for training and the remaining for testing, with the results averaged over five such random splits. The tunable parameters were chosen using a held-out portion of the training set treated as a validation set. The specific parameter choices, along with details of data preprocessing, are in appendix B.2. All experiments used a linear model.^{10}

### 8.1 Maximizing Partial AUC in

We begin with our results for FPR intervals of the form . We considered two real-world applications where the partial AUC in is an evaluation measure of interest: a protein-protein interaction prediction task and a drug discovery task (see Table 1). We refer to the proposed structural SVM-based method in section 4 as . We included for comparison the structural SVM algorithm of Joachims (2005) for optimizing the full AUC, which we shall call , as well as three existing algorithms for optimizing partial AUC in : asymmetric SVM (ASVM) (Wu et al., 2008), pAUCBoost (Komori & Eguchi, 2010), and a greedy heuristic method due to Ricamato and Tortorella (2011). For completeness, we also compare against a standard classification SVM method that optimizes classification accuracy.^{11}

#### 8.1.1 Protein-Protein Interaction Prediction

^{12}We evaluated the partial AUC performance of these methods on two FPR intervals, and . To compare the methods for different training sample sizes, we report results in Figure 4 for varying fractions of the training set. As seen, the proposed method almost always yields higher partial AUC in the specified FPR intervals compared to the method for optimizing the full AUC, thus confirming its focus on a select portion of the ROC curve. Interestingly, the difference in performance is more pronounced for smaller training sample sizes, implying that when one has limited training data, it is more beneficial to use the data to directly optimize the partial AUC rather than to optimize the full AUC. Also, in most cases, the proposed method performs comparable to or better than the other baselines; the pAUCBoost and Greedy-Heuristic methods perform particularly poorly on smaller training samples due to the use of heuristics. As expected, the classification SVM method performs even worse than the full AUC maximizing approach, clearly showing that optimizing the classification accuracy does not necessarily yield good performance on the ROC curve.

#### 8.1.2 Drug Discovery

### 8.2 Maximizing Partial AUC in

We next move to our experiments on partial AUC in a general interval. We refer to the proposed structural SVM method for maximizing partial AUC in again as , and our DC programming approach for optimizing the nonconvex hinge surrogate as . As baselines, we included , pAUCBoost, which can optimize partial AUC over FPR ranges , an extension of the greedy heuristic method in Ricamato and Tortorella (2011) to handle arbitrary FPR ranges, and the classification SVM method. We first present our results on a real-world application, where partial AUC in is a useful evaluation measure.

#### 8.2.1 Breast Cancer Detection

We consider the task stated in the KDD Cup 2008 challenge, where one is required to predict whether a given region of interest (ROI) from a breast X-ray image corresponds to a malignant (positive) or a benign (negative) tumor (Rao et al., 2008). The data provided are collected from 118 malignant patients and 1594 normal patients. Four X-ray images are available for each patient; overall, there are 102,294 candidate ROIs selected from these X-ray images, of which 623 are positive, with each ROI represented by 117 features. In the KDD Cup challenge, performance was evaluated in terms of the partial area under the free-response operating characteristic (FROC) curve in a false-positive range deemed clinically relevant based on radiologist surveys. The FROC curve (Miller, 1969) effectively uses a scaled version of the false-positive rate in the usual ROC curve; for our purposes, the corresponding false-positive rate is obtained by rescaling by a factor of (this is the total number of images divided by the total number of negative ROIs). Thus, the goal in our experiments was to maximize the partial AUC in the clinically relevant FPR range . Table 2 presents results on algorithms and developed for FPR intervals of this form, as well as on the baseline methods. performs the best in this case.

. | . |
---|---|

0.638 (0.02) | |

0.616 (0.02) | |

0.612 (0.02) | |

pAUCBoost | 0.603 (0.02) |

Greedy-Heuristic | 0.562 (0.02) |

Classification SVM | 0.484 (0.02) |

. | . |
---|---|

0.638 (0.02) | |

0.616 (0.02) | |

0.612 (0.02) | |

pAUCBoost | 0.603 (0.02) |

Greedy-Heuristic | 0.562 (0.02) |

Classification SVM | 0.484 (0.02) |

Notes: The standard deviations of the reported values are enclosed in parentheses. Here . The method performing best is highlighted in bold.

#### 8.2.2 UCI Data Sets

To perform a more detailed comparison between the proposed structural SVM and DC programming methods for general intervals, we also evaluated the methods on a number of benchmark data sets obtained from the UCI machine learning repository (Frank & Asuncion, 2010; see Table 3). The results for the FPR interval are shown in Table 4. For completeness, we also report the performance of the baseline methods. Despite having to solve a nonconvex optimization problem (and hence running the risk of getting stuck at a locally optimum solution), does perform better than in some cases, though between the two, there is no clear winner. Also, on three of the five data sets, one of the two proposed methods yields the best overall performance. The strikingly poor performance of pAUCBoost and Greedy-Heuristic on the cod-rna data set is because the individual features in the data set yield low partial AUC values. Since these methods rely on local greedy heuristics, they fail to find a linear combination of the features that yields high partial AUC performance. Unsurprisingly, classification SVM performs worse than in many cases.

Data Set . | Number of Instances . | Number of Features . |
---|---|---|

a9a | 48,842 | 123 |

cod-rna | 488,565 | 8 |

covtype | 581,012 | 54 |

ijcnn1 | 141,691 | 22 |

letter | 20,000 | 16 |

Data Set . | Number of Instances . | Number of Features . |
---|---|---|

a9a | 48,842 | 123 |

cod-rna | 488,565 | 8 |

covtype | 581,012 | 54 |

ijcnn1 | 141,691 | 22 |

letter | 20,000 | 16 |

. | . | ||||
---|---|---|---|---|---|

. | a9a . | cod-rna . | covtype . | ijcnn1 . | letter . |

0.274 (0.07) | 0.919 (0.00) | 0.247 (0.00) | 0.613 (0.03) | 0.521 (0.02) | |

0.365 (0.04) | 0.920 (0.00) | 0.241 (0.09) | 0.680 (0.00) | 0.518 (0.02) | |

0.434 (0.01) | 0.919 (0.00) | 0.299 (0.01) | 0.475 (0.01) | 0.445 (0.02) | |

pAUCBooost | 0.401 (0.01) | 0.033 (0.00) | 0.448 (0.00) | 0.491 (0.06) | 0.495 (0.05) |

GreedyHeuristic | 0.342 (0.02) | 0.033 (0.00) | 0.239 (0.01) | 0.120 (0.01) | 0.289 (0.01) |

Classification SVM | 0.354 (0.02) | 0.872 (0.00) | 0.378 (0.01) | 0.314 (0.01) | 0.351 (0.02) |

. | . | ||||
---|---|---|---|---|---|

. | a9a . | cod-rna . | covtype . | ijcnn1 . | letter . |

0.274 (0.07) | 0.919 (0.00) | 0.247 (0.00) | 0.613 (0.03) | 0.521 (0.02) | |

0.365 (0.04) | 0.920 (0.00) | 0.241 (0.09) | 0.680 (0.00) | 0.518 (0.02) | |

0.434 (0.01) | 0.919 (0.00) | 0.299 (0.01) | 0.475 (0.01) | 0.445 (0.02) | |

pAUCBooost | 0.401 (0.01) | 0.033 (0.00) | 0.448 (0.00) | 0.491 (0.06) | 0.495 (0.05) |

GreedyHeuristic | 0.342 (0.02) | 0.033 (0.00) | 0.239 (0.01) | 0.120 (0.01) | 0.289 (0.01) |

Classification SVM | 0.354 (0.02) | 0.872 (0.00) | 0.378 (0.01) | 0.314 (0.01) | 0.351 (0.02) |

Notes: All methods except and classification SVM seek to optimize partial AUC in the range . The standard deviations of the reported values are in parentheses, and the best-performing methods in bold.

### 8.3 Maximizing TPR at a Specific FPR Value

We have so far seen that the proposed methods are good at learning scoring functions that yield high partial AUC in a specified FPR range. In our next experiment, we demonstrate that the proposed methods can also be useful in applications where the requirement is to learn a classifier with specific true- or false-positive requirements. In particular, we consider the task described in the KDD Cup 2006 challenge of detecting pulmonary emboli in medical images obtained from CT angiography (Lane, Rao, Bi, Liang, & Salganicoff, 2006). Given a candidate region of interest (ROI) from the image, the goal is to predict whether it is a pulmonary embolus. A specific requirement here is that the classifier must have high TPR, with the FPR kept within a specified limit.

Indeed if a classifier is constructed by thresholding a scoring function, the above evaluation measure can be seen as the partial AUC of the scoring function in an infinitesimally small FPR interval. However, given the small size of the FPR interval concerned, maximizing this evaluation measure directly may not produce a classifier with good generalization performance, particularly with smaller training samples (see the generalization bound in section 7). Instead, we prescribe a more robust approach of using the methods developed in this letter to maximize partial AUC in an appropriate larger FPR interval and constructing a classifier by suitably thresholding the scoring function thus obtained.

The data provided in the KDD Cup challenge consist of 4,429 ROIs represented by 116 features; 500 are positive. We considered a maximum allowable FPR limit of 0.1 (which is one of the values prescribed in the challenge). The proposed partial AUC maximization methods were used to learn scoring functions for two FPR intervals, and that we expected would promote high TPR at the given FPR of 0.1. The performance of a learned model was then evaluated based on the TPR it yields when thresholded at a FPR of 0.1. Table 5 contains results for on both intervals and for on the interval. We also included for comparison. Interestingly, in this case, in performs the best. The DC programming method on the interval is a close second, performing better than the structural SVM approach for the same interval.^{13}

. | TPR at FPR . |
---|---|

0.613 | |

0.584 | |

0.591 | |

0.584 |

. | TPR at FPR . |
---|---|

0.613 | |

0.584 | |

0.591 | |

0.584 |

Note: The best-performing method is highlighted in bold.

We also note that in follow-up work, the proposed approach was applied to a similar problem in personalized cancer treatment (Majumder et al., 2015), where the goal was to predict whether a given cancer patient will respond well to a drug treatment. In this application, one again requires high true-positive rates (fraction of cases where the treatment is effective and the model predicts the same), subject to the false-positive rate being within an allowable limit. As above, the problem was posed as a partial AUC maximization task, and the classifiers learned using our methods were found to yield higher TPR performance than standard approaches for this problem while not exceeding the allowed FPR limit.

### 8.4 Run-Time Analysis

#### 8.4.1 Run-Time Comparison with Baseline Methods

Notice that except for , all the baselines require higher or similar training times compared to . Also, as expected, for the intervals, the DC programming-based (which solves an entire structural SVM problem in each iteration) requires higher training time than . The reason for the full AUC-maximizing method being the fastest in all cases, despite the subroutine for finding the most violated constraint requiring higher computational time compared to that for partial AUC, is that the number of iterations required by the cutting plane solver is lower for AUC. This will become clear in our next experiments, where we report the number of iterations taken by the cutting plane solver under different settings.

#### 8.4.2 Influence of , , and on Number of Cutting Plane Iterations

## 9 Conclusion and Open Questions

The partial AUC is increasingly used as a performance measure in several machine learning and data mining applications. We have developed support vector algorithms for optimizing the partial AUC between two given false-positive rates and . Unlike the full AUC, where it is straightforward to develop surrogate optimizing methods, even constructing a (tight) convex surrogate for the partial AUC turns out to be nontrivial. By exploiting the specific structure of the evaluation measure and extending the structural SVM framework of Joachims (2005), we have constructed convex surrogates for the partial AUC and developed an efficient cutting plane method for solving the resulting optimization problem. In addition, we have provided a DC programming method for optimizing a nonconvex hinge surrogate that is tighter for general false-positive ranges . Our empirical results on several real-world and benchmark tasks indicate that the proposed methods indeed optimize the partial AUC in the desired false-positive range, often performing comparable to or better than existing baselines.

Subsequent to the conference versions of this letter, there have been a number of follow-up studies. One such work applied our algorithm to an important problem in personalized cancer treatment (Majumder et al., 2015), where the task was to predict clinical responses of patients to chemotherapy. Other studies include minibatch extensions of our structural SVM method for the range to online and large-scale stochastic settings (Kar et al., 2014), as well as an ensemble-style version of this method with application to a problem in computer vision (Paisitkriangkrai et al., 2013, 2014).

Several questions remain open. First, it would be useful to understand the consistency properties of the proposed algorithms—conditions under which optimizing the proposed surrogates yields the optimal scoring function for the original partial AUC measure. Recently, consistency properties have been established for the method that optimizes the full AUC (Uematsu & Lee, 2015; Gao & Zhou, 2015), but these results do not directly extend to the partial AUC. Second, we observed in our experiments that the number of iterations required by the proposed cutting plane solvers to converge depends on the length of the specified FPR range, but this is not evident from the current convergence rate for the solver (see section 4) obtained from a result in Joachims (2006). It would be of interest to see if tighter convergence rates that match our empirical observation can be shown for the cutting plane solver. Finally, one could look at extensions of the proposed algorithms to multiclass classification and ordinal regression settings, where often there are different constraints on the error rates of a predictor on different classes. Again, there has been work on optimizing multiclass versions of the full AUC (Waegeman, De Baets, & Boullart, 2008; Clémençon et al., 2013; Uematsu & Lee,