Abstract

While most proposed methods for solving classification problems focus on minimization of the classification error rate, we are interested in the receiver operating characteristic (ROC) curve, which provides more information about classification performance than the error rate does. The area under the ROC curve (AUC) is a natural measure for overall assessment of a classifier based on the ROC curve. We discuss a class of concave functions for AUC maximization in which a boosting-type algorithm including RankBoost is considered, and the Bayesian risk consistency and the lower bound of the optimum function are discussed. A procedure derived by maximizing a specific optimum function has high robustness, based on gross error sensitivity. Additionally, we focus on the partial AUC, which is the partial area under the ROC curve. For example, in medical screening, a high true-positive rate to the fixed lower false-positive rate is preferable and thus the partial AUC corresponding to lower false-positive rates is much more important than the remaining AUC. We extend the class of concave optimum functions for partial AUC optimality with the boosting algorithm. We investigated the validity of the proposed method through several experiments with data sets in the UCI repository.

1.  Introduction

The receiver operating characteristic (ROC) curve and the area under the ROC curve (AUC) are widely employed performance measures of classifiers (Pepe & Thompson, 2000), especially in medical diagnosis and psychological studies. An important property of the AUC is that it can be used when the prior probability of each class is biased, whereas the error rate counts only the number of wrong classifications regardless of the bias of samples of each class. Also, when the goal of a problem is to find a discriminant function with a high AUC value, it is natural to use an algorithm that directly maximizes the AUC; however, in general, direct construction of a discriminant function that maximizes the AUC value is difficult because of the nondifferentiability of the definition of the AUC. Brefeld and Scheffer (2005) proposed a support vector machine (SVM) type of algorithm (Vapnik, 1998) maximizing the AUC. The optimization problem for this algorithm can be solved by standard QP solvers; however, its computational cost grows quadratically with the number of examples, and optimization is hard for large-scale problems. Boosting is a learning method that builds up a stronger classification machine from a set of base learning machines that can be easily learned, and AdaBoost (Freund & Schapire, 1997) is a typical implementation of boosting. Murata, Takenouchi, Kanamori, and Eguchi (2004) extended the algorithm to general U-Boost and discussed its statistical properties such as consistency and robustness. In this letter, we extend the concept of the AUC and propose a new performance measure along the lines of Murata et al. (2004) and a boosting-type of algorithm that maximizes the measure. We also propose an algorithm maximizing the partial AUC (pAUC), a measure of performance whose false positive rate is controlled, and investigate the statistical properties and robustness of the algorithms. In other words, the algorithm is designed to attain a high true positive rate within “a specific range of the false positive rate.” This idea is common in medical screening where the false-positive rate (the probability that healthy subjects are incorrectly judged to be positive) should be kept as low as possible or in preference-ranking problems, where only the top of the ranking list is the user's interest. Recently Rudin and Schapire (2009) proposed an algorithm that focuses on the left-most portion of the ROC curve based on RankBoost (Freund, Iyer, Schapire, & Singer, 2003). The method controls the weights on the exponential loss so that it puts more importance on low values of a false-positive rate, in contrast with the direct restriction of the range of the false positive in our approach.

This letter is organized as follows. In section 2, we describe the basic settings and briefly review the conventional AUC. We formulate a new measure, U-AUC, and propose a boosting-type algorithm, U-AUCBoost. In section 3, we propose an algorithm maximizing pAUC and investigate some of its statistical properties. The performances of the proposed methods are evaluated with some experiments in section 4.

2.  U-AUC

In this section, we define a new performance measure, U-AUC, which is a lower bound of the AUC, and propose an algorithm, U-AUCBoost, that can maximize the AUC. Adopting the framework of boosting makes it possible to deal with large-scale data sets. We also investigate a relationship with a statistical model and statistical properties of the algorithm such as consistency or robustness.

2.1.  Settings of the Problem.

Let x be a feature vector and y ∈ {+1, − 1} be a binary label of the feature vector x. A data set is given. Function I(·) is an indicator function:
formula
For a discriminant function F(x), the conventional true-positive rate (TPR) and false-positive rate (FPR) at a threshold value c are defined as follows:
formula
2.1
formula
2.2
where H(z) is the Heaviside step function and py(x) is a conditional probability p(x|y) of x given the label y. The AUC is a performance measure of the discriminant function F(x) over all threshold value c and is defined by
formula
2.3
Its empirical version is written as
formula
2.4
where ny is the number of samples having label y in the data set.

A direct construction of F that maximizes equation 2.4 is difficult, because this equation is not continuously differentiable. Replacement of the Heaviside function with a sigmoid function solves this nondifferentiability (Herschtal & Raskutti, 2004; Ghosh & Chaudhuri, 2005) or the normal distribution function (Komori & Eguchi, 2010); however, it may not be necessary to find the global maximum because of the nonconvexity of the sigmoid function. To overcome the difficulty, in this letter we consider a true positive utility (TPU) rather than the conventional TPR and propose a new performance measure, U-AUC, that approximately maximizes equation 2.4. A similar approach, RankBoost, has been investigated by Freund et al. (2003) and Rudin and Schapire (2009), who considered combining preferences to solve ranking problems and pointed out that RankBoost is equivalent to maximization of the AUC value (Cortes & Mohri, 2004). Our proposed method includes RankBoost as a special case and has such statistical properties as consistency and robustness.

2.2.  U-AUC.

In this section, we define the new performance measure U-AUC and derive the algorithm for maximizing it in the framework of boosting. First, we define a quantity TPU as follows:
formula
2.5
where U is a concave and monotonically increasing function, satisfying
formula
2.6
Based on this quantity, we propose a new performance measure, U-AUC,
formula
2.7
and denote its empirical version as ,
formula
2.8
Note that the AU(F) is a lower bound of AUC,
formula
2.9
because of equation 2.6. Then maximization of AU(F) implies maximization of the lower bound of the AUC, which is our purpose. Also note that AU(F) is concave with respect to F, i.e.,
formula
because of the concavity of U. The equality holds if and only if F1(x) = F2(x) + c, where c is an arbitrary constant. This property also gives the uniqueness (except for c) of the optimal function .
Remark 1. 

U-AUC is threshold free because AU(F + κ) = AU(F) holds for an arbitrary threshold value κ. In a situation where sample numbers of classes are biased, this threshold-free property is a virtue.

Remark 2. 
We can extend the FPR to FPU as
formula
where U2 is a concave and monotonically increasing function and an associated performance measure as
formula
where U2 is the derivative of U2.

The maximization of AU(F) is a concave optimization problem that can be efficiently solved when a problem is not too large (Boyd & Vandenberghe, 2004). However, recently, there has been a great need for analyzing large-scale data sets having numerous examples with high dimension, and the direct optimization of AU(F) is difficult for such large-scale problems. To deal with such problems, a boosting method has been proposed: this is a learning method that builds up a stronger classification machine from a set of base learning machines that can be easily learned. A lot of boosting-type algorithms have been proposed, and theoretical properties such as consistency and robustness have been intensively investigated. In this letter, to maximize with respect to F for large-scale data sets, we derive a boosting-type algorithm, U-AUCBoost, as follows. Let be a set of base classifiers. We then construct F by ∑Tt=1αtft where ft is a component of and αt is a coefficient of ft, as follows:

  1. Initialize a weight for a pair of examples as and discriminant function as F0(x) = 0.

  2. For t = 1, …, T,

    1. Find a base classifier as
      formula
      where
      formula
    2. Calculate a coefficient of the selected classifier:
      formula
    3. Update the weight as
      formula
      where Ft(x) = Ft−1(x) + αtft(x) and is a normalization term.

  3. Output the discriminant function FT(x).

In this algorithm, the weight is defined for a pair of examples while the weight is defined for an example in the usual boosting algorithm such as AdaBoost.

A derivation of the algorithm is the same as the usual boosting-type algorithm (Murata et al., 2004). When the discriminant function Ft−1 is given, we consider an update from Ft−1 to Ft−1 + εf where ε is a small constant. Then we observe
formula
2.11
and maximization of this quantity corresponds to the selection of the base classifier. For the selected base classifier, we determine the coefficient as in step b. In the maximization in equation 2.10, the solution often diverges. Then in our experiments, for the purpose of regularization, we threshold α at 1 and optimize in the range of [0, 1]. This kind of implementation for numerical problems is also found in LogitBoost (Friedman, Hastie, & Tibshirani, 2000).

In the following sections, we discuss statistical properties of the U-AUCBoost such as the empirical AUC attained by the algorithm (see section 2.3), a related statistical model and consistency (see section 2.4), robustness (see section 2.5), and relation with the conventional methods such as the RankBoost or AdaBoost (see section 2.6).

2.3.  Evaluation of Empirical AUC.

Using the boosting-type algorithm just described, we can construct the discriminant function FT by linear combination of base classifiers. We investigate how the algorithm behaves by evaluating a lower bound of the empirical U-AUC for FT as follows:

Theorem 1. 
For the discriminant function FT(x) = ∑Tt=1αft(x), we observe that
formula
where γt is a constant satisfying
formula
and is on [0, 1].1 Note that acct(ft) − errt(ft) ⩾ 0 holds for all t.
Proof. 

See appendix  A.

If acct(ft) − errt(ft)>0 and γt>0 hold for all t, the lower bound of and monotonically increases against T.

2.4.  Consistency of U-AUCBoost.

We consider a population maximizer of (2.7), and investigate its property. Then we observe the following theorem:

Theorem 2. 
The maximizer F* satisfies
formula
2.12
where
formula
2.13
Proof. 
Let η(x) be an arbitrary function of x and ε be a small constant, and we consider the functional derivative of AU(F). From
formula
we observe that the maximizer F* of the concave function AU(F) satisfies
formula
which is rewritten as
formula
2.14
The bracketed term in equation 2.14 becomes 0 for any x because η(x) is arbitrary.

Note that F* + c where c is an arbitrary constant, also satisfies theorem 2.4 because of the threshold-free property of U-AUC. In the following, we ignore the arbitrariness of the constant.

With the concavity of U and a simple calculation, we observe
formula
2.15
Then is a monotonically increasing function. Then we know the following:
Remark 3. 
The optimal function F* is connected to the ratio p+1/p−1 or in one-to-one correspondence, and we observe that
formula
2.16
where πy is a marginal distribution of y.
The statistical properties of equation 2.16 depend on the function U. In a context of U-Boost (Murata et al., 2004), for example, a boosting algorithm with a function U(z) = (1 − η)exp(z) + ηz where η is a tuning parameter, is closely related to a probabilistic model of mislabeling as
formula
where δ(x) is a probability of mislabeling, which can depend on the input x. Takenouchi, Eguchi, Murata, and Kanamori (2008) investigated a model of δ(x) where the input has a higher mislabeling probability when the decision for predicting the label of x is more difficult.
We focus on a class of function U satisfying
formula
2.17
Then we observe the following theorem:
Theorem 3. 
If the function U satisfies equation 2.17, the maximizer F* of AU(F) is equal to
formula
2.18
except for the arbitrariness of the constant.
Proof. 
From relationship 2.17, it follows that
formula
which implies that the bracketed term in equation 2.14 becomes 0 with the F‡ for any x. Then from the concavity of AU(F), we conclude that F* = F‡ except for the arbitrariness of the constant.

We next focus on a special type of function U associated with the logistic model.

2.4.1.  Class of Function Associated with the Logistic Model.

Here we consider a family {F(x; α)} of discriminant functions parameterized by a vector α of parameters and the logistic model as
formula
2.19
where c is a constant. We assume that the true conditional distribution is represented by the logistic model with parameter α0. Under these conditions, we focus on estimating the parameter α. Let be an estimator maximizing equation 2.7 where a true joint distribution of (X, Y) is (=πypy(x)).
Definition 1: 

Fisher consistency. When the estimator of the parameter vector maximizing equation 2.7 associated with the distribution is consistent with the true parameter α0, that is, , then the estimator is Fisher consistent.

Note that Fisher consistency implies
formula
2.20
and a class of function U associated with the logistic model is specified by the following theorem:
Theorem 4. 

Assume that F(·; α) is a one-to-one mapping with respect to the parameter α. If the function U(z) satisfies equation 2.17 with λ = 2, the estimator maximizing equation 2.7 is Fisher consistent.

Proof. 
A necessary and sufficient condition is that F(·; α0) satisfies equation 2.12. Under assumptions 2.19, we observe that
formula
where c0 is a constant associated with α0. From the relationship 2.17, we know that
formula
2.21
formula
2.22
Then we observe equation 2.12 and conclude the theorem.
Remark 4. 

These properties are derived from the concavity of the function U and do not hold for the sigmoid function or the normal distribution function.

We consider some examples of the function U that satisfy equation 2.17 with λ = 2.

  1. Exponential type: Urank(z) = 1 − exp(−z). We discussed some properties in section 2.6.

  2. Logistic type: .

  3. MadaBoost type:
    formula

The statistical properties of a boosting algorithm with this third function, called MadaBoost, is discussed in Kanamori, Takenouchi, Eguchi, and Murata (2007). MadaBoost is the most B-robust boosting algorithm for outliers in the sense of gross error sensitivity (Hampel, Rousseeuw, Ronchetti, & Stahel, 1986) when the true distribution is modeled by the logistic model. We next investigate the robustness of the loss function in the framework of U-AUC maximization. These functions are plotted in Figure 1.

Figure 1:

Examples of U-functions.

Figure 1:

Examples of U-functions.

2.5.  Robustness of MadaBoost Type Loss.

We consider the class of fun-ction U satisfying equation 2.17 and assume that the true distribution is written as
formula
where F† is a discriminant function. Let us consider an update of the discriminant function with a base classifier f ∈ {+1, − 1} as
formula
and a statistical model of the conditional distribution p(y|x) as
formula
2.23
Note that p(y|x) = p0(y|x) and from theorem 4, the estimator maximizing equation 2.7 is Fisher consistent,
formula
2.24
which implies .
The robustness of the estimator is measured by the gross error sensitivity defined by
formula
2.25
where is the probability distribution with a point mass at .
Lemma 1. 
The gross error sensitivity of the estimator is written as
formula
2.26
where V(z) = 1 − U(−z).
Proof. 

See appendix  B.

Then we observe the following theorem:

Theorem 5. 
We assume that the base classifier f satisfies the following conditions,2
formula
2.27
Then the estimator associated with the MadaBoost-type function Umada is the most B-robust estimator, which minimizes the gross error sensitivity (see equation 2.26).
Proof. 
If the function V′ is not upper-bounded, the numerator of equation 2.26 diverges. Therefore we focus on the case where V′ is upper-bounded. Without loss of generality, we can assume
formula
2.28
because multiplication by a positive constant for V′ does not change the estimator. From assumptions 2.27, we observe
formula
2.29
formula
2.30
The gross error sensitivity is minimized when the denominator of equation 2.26 is maximized. The term is calculated over the region {x|F†(x′) ⩾ F†(x)}, and then we find that V′(z) = 2 should hold for z ⩾ 0. From equation B.8, we observe that the MadaBoost-type function minimizes the gross error sensitivity.

2.6.  Special Example of U: Exponential Type.

The algorithm-maximizing U-AUC with a special function
formula
is equivalent to RankBoost, whose properties were investigated in Freund et al. (2003). A relationship between the algorithm and AUC maximization was discussed in Cortes and Mohri (2004). While the selection of base classifier and calculation of αt cannot be explicitly solved in the algorithm with a general function U, the maximization of can be explicitly solved with corresponding steps of the algorithm modified as follows:
  • • 
    Selection of a base classifier:
    formula
    2.31
  • • 
    Calculation of a coefficient for the selected classifier:
    formula
    2.32

Also, as with theorem 1, theoretical evaluation of the empirical AUC value by the algorithm with the function Urank was analyzed in Freund et al. (2003).

In the following, we investigate the relationship between RankBoost associated with Urank and AdaBoost. AdaBoost is the most popular boosting algorithm for classification problems and is derived from sequential minimization of the following exponential loss function:
formula
2.33
We observe the following:
Proposition 1. 

RankBoost is equivalent to AdaBoost with an optimized threshold value.

Proof. 
Here we consider a modified version of equation 2.33, which optimizes a threshold value of the discriminant function as
formula
2.34
where
formula
By plugging into equation 2.34, we find
formula
2.35
and then minimization of equation 2.34, that is, AdaBoost is equivalent to the maximization of associated with the function Urank, or RankBoost.

2.7.  Related Work.

In this section, we briefly review related work to clarify our own contributions in this letter.

The AdaBoost procedure is derived from a sequential minimization of the loss function,
formula
2.36
and its statistical properties have been investigated in detail (Friedman et al., 2000). Lebanon and Lafferty (2001) pointed out a relationship between AdaBoost and the maximum likelihood estimation, that is, the minimization of the extended Kullback-Leibler divergence. Also, the loss function (see equation 2.36) can be extended with any arbitrary convex function (Mason, Baxter, Bartlett, & Frean, 1999). Based on these insights, Murata et al. (2004) extended the framework of AdaBoost with a statistical version of the Bregman divergence and proposed a U-Boost algorithm. Statistical properties of the U-Boost such as the consistency or robustness were investigated, and the information geometrical structure (Amari & Nagaoka, 2000) of U-Boost algorithm was discussed, which clearly shows an intuitive interpretation of how U-Boost works. Kanamori et al. (2007) considered the robustness of the boosting method in the class of U-Boost and revealed that the MadaBoost-type loss function is the most robust loss function for the coefficient estimation.

RankBoost is a boosting method to learn accurate rank or preference by combining multiple preferences such as the results of different search engines (Freund et al., 2003), and it was pointed out that the minimization of the loss function of RankBoost is equivalent to the maximization of the surrogate of the empirical AUC (Cortes & Mohri, 2004). While RankBoost optimizes the surrogate of the empirical AUC, measure of performance of classifier, another important performance measure is the margin of classifier. However, RankBoost does not always converge to a maximum margin solution because its objective function is the empirical AUC rather than the margin. Rudin and Schapire (2009) proposed the smoothed margin ranking algorithm, a modified version of RankBoost that enables us to obtain a maximum ranking-margin solution. While the RankBoost and its variant, including the proposed method, are based on the maximization of concave objective functions, which are surrogates of the empirical AUC value, there are methods that approximate the Heaviside function by the sigmoid function (Herschtal & Raskutti, 2004; Ghosh & Chaudhuri, 2005) or the cumulative distribution function of the normal distribution (Komori & Eguchi, 2010). Algorithms are derived with the usual coordinate ascent method; however, the objective function is not a concave function, implying that the algorithm can be trapped in local maxima, and this kind of algorithm requires regularization to avoid the problem of local maxima.

Our proposed method is a generalization of the RankBoost with any arbitrary concave function for maximizing the surrogate of the empirical AUC, which is motivated by the extension of AdaBoost to U-Boost (Murata et al., 2004). Also we investigated the relationship between the class of concave functions and the associated statistical model, and proposed a robust algorithm for AUC maximization (in section 2.5) based on the framework of Kanamori et al. (2007).

3.  Partial U-AUC

The AUC provides an overall assessment of classification based on FPR and TPR as described in equation 2.3, each of which expresses a different accuracy for discrimination performance. In some situations, especially in disease screening, a high TPR in a range of a low FPR is important (Baker, 2003; Qi, Joseph, & Seetharaman, 2006), and hence the partial area under the ROC curve (pAUC) has been gaining in popularity. Moreover, a specific value of TPR given a fixed FPR is also used as a summary measure of discrimination in medical screening (Pepe, Longton, Anderson, & Schummer, 2003; Pepe, 2003). Pepe and Thompson (2000) proposed a statistical method that maximizes the pAUC by linearly combining two predictors, and Komori and Eguchi (2010) made it possible to nonlinearly combine more than two predictors using the approximate pAUC. In this context, we consider the partial U-AUC as an extension of the U-AUC.

3.1.  Definition of Partial U-AUC.

We focus on a range of FPR between τ1 and τ2 that corresponds to the threshold values c1,F and c2,F:
formula
3.1
Note that c2,F < c1,F if τ1 < τ2. For the range of FPR, partial AUC pA(F) is defined as
formula
3.2
To approximate the partial AUC, we define the partial U-AUC pAU(F) as
formula
3.3
The empirical version of pAU is given as
formula
3.4
where and are corresponding threshold values. In the empirical versions, a threshold value c associated with a specific value τ is chosen from the middle points of sorted values F(xj)’s having class label yj = −1. Note that from the definition of the function U, we observe that
formula
3.5
and the following theorem:
Theorem 6. 

pAU(F, τ1, τ2) is concave with respect to F.

Proof. 

See appendix  C.

Note that the concavity for is easily shown by the same argument as that in the proof of theorem 6.

3.2.  Algorithm for Maximizing Partial U-AUC.

In this section, we consider a boosting-type algorithm to maximize ; however, its derivation is not straightforward because the threshold values c1,F and c2,F in definition 3.3 depend on the function F. If we update the discriminant function Ft−1 to Ft−1 + αf with a fixed f, the threshold values also change depending on the value of α, and then we cannot consider a derivation like equation 2.11. This implies that f and α cannot be independently determined, and then it is not computationally tractable. To overcome the difficulty, we assume that the coefficient α is in a range [0, 1] and consider the following lower bound of the difference of the partial U-AUC with fixed threshold values:
formula
3.6
formula
3.7
formula
3.8
formula
3.9
where the inequality of equation 3.8 is derived from the concavity of . The approximate lower bound is defined with thresholds for the function Ft−1 + f and does not depend on α, which enables us to select f independent of αt and leads to a drastic reduction in computational cost. Then the algorithm maximizing the partial U-AUC, pU-AUCBoost is written as follows:
  1. Initialize the discriminant function to F0(x).

    1. Find a base classifier,
      formula
      where
      formula
      formula
      and
      formula
      formula
    2. Calculate the coefficient of the selected classifier:
      formula
    3. Update the discriminant function:
      formula
  2. Output the discriminant function FT(x) = F0(x) + ∑Tt=1αtft(x).

Note that in the optimization, equation 3.15, the threshold values depend on the value of α; however, we can easily detect when the threshold values change, and then the optimization is not difficult. During the optimization procedure, the value of Zt−1(f) in the denominator of equation 3.13 often becomes exactly 0 because of the restriction of the number of samples by and ; the result is that one base classifier is chosen repeatedly, and the algorithm does not move ahead. To avoid this problem, the base classifier that is chosen is discarded in later steps.

3.3.  Consistency of the Algorithm for pAUC.

Similar to the consistency of AU, we have the following remark:

Remark 5. 
The maximizer F* of equation 3.3 is connected with the ratio p+1/p−1 or in a one-to-one correspondence, and we observe
formula
3.17
where πy is a marginal distribution of y, and .
Proof. 

See appendix  D.

Note that this one-to-one correspondence between F* and p+1/p−1 holds in the restricted range of . For the complement set that consists of and , we have
formula
3.18
formula
3.19
where . Hence, we have a weak Bayes consistency in the sense that the consistency holds only for .

3.4.  Evaluation of the Empirical pAU.

Regarding the bounds of the empirical pAU, we have the following theorem:

Theorem 7. 
For the discriminant function FT(x) = ∑Tt=1αtft(x), is bounded as
formula
3.20
where
formula
Proof. 

See appendix  E.

From relationship 3.5, we immediately observe the following corollary:

Corollary 1. 
For the discriminant function FT(x) constructed by the pAU boosting algorithm, is lower-bounded as
formula
3.21
Note that holds for all t(t = 1, …, T).

4.  Experiments

In this section, we clarify the validity of proposed method by comparing it with the existing binary classification method using benchmark data sets from the UCI repository (Blake & Merz, 1998). Through experiments, we investigated the performance of the proposed methods (U-AUCBoost and pU-AUCBoost) with three functions (Urank, Ulogit, Umada) and employed “stumps” (Friedman et al., 2000) as the base classifier of boosting-type algorithms. The proposed methods were compared with several boosting-type algorithms (AdaBoost, LogitBoost, and MadaBoost defined by functions Urank, Ulogit, and Umada, respectively) and SVM with an RBF kernel and a first-order polynomial (i.e., linear) kernel.

The coefficient α of the base classifiers was optimized in a range of [0, 1] for the purpose of regularization. The initial function of the pU-AUCBoost was set to a function constructed by RankBoost (the U-AUCBoost with Urank) whose step number T was determined by the AUC value or pAUC value for the validation data set, because pU-AUCBoost can use only part of all examples associated with a focused range of FPR, so it had difficulty in working smoothly in these data sets. (See Komori & Eguchi, 2010, for the preprocessing of the data by a pAUC-based method in high dimensions.) We employed the SVM in a package program e1071, which is an LIBSVM (Chang & Lin, 2011) implementation in R language (R Development Core Team, 2011), and used a first-order polynomial (i.e., linear) kernel and a radial basis function (RBF) kernel for their kernel functions (Schölkopf & Smola, 2001).

We divided the original data set into a training data set and a test data set with the ratios of 80% and 20%, respectively, and evaluated the generalization performance by using the test data set. We repeated this procedure 100 times by changing the data set division and observed the average performance. For the tuning of hyperparameters of each method (i.e., the step number T of the boosting-type algorithms or regularization parameter C controlling the number of support vectors and parameter of kernel γ of the SVM),3 we subdivided the training data set into a training subset and a validation subset with the ratios of 80% and 20%, respectively, and set hyperparameters so as to maximize the AUC value or pAUC value for the validation subset. Details of the data set are summarized in Table 1. We first removed examples with unobserved variables.

Table 1:
Data Set Information.
Data SetTotal NumberNumber of Attributes
Australian 690 14 −0.221 
Breast Cancer 683 −0.619 
Diabetes 768 0.624 
German 1000 24 −0.847 
Heart 270 13 −0.223 
Ionosphere 351 34 0.580 
Liver Disorders 345 0.322 
Pima 768 −0.624 
Sonar 208 60 −0.135 
Data SetTotal NumberNumber of Attributes
Australian 690 14 −0.221 
Breast Cancer 683 −0.619 
Diabetes 768 0.624 
German 1000 24 −0.847 
Heart 270 13 −0.223 
Ionosphere 351 34 0.580 
Liver Disorders 345 0.322 
Pima 768 −0.624 
Sonar 208 60 −0.135 

Figures 2 and 3 show box-whiskers plots of AUC and partial AUC (τ1 = 0, τ2 = 0.1) of each method over 100 trials, respectively. We observe that the proposed methods (especially U-AUCBoost) worked well and had improved better performance then the existing methods except for a few data sets; even for those few data sets, the existing method did not significantly outperform U-AUCBoost. Although improvement of the AUC value by the proposed method seemed to be small, this was not negligible because the AUC criteria were insensitive to the improvement of classifier (Pencina, D'Agostino, D'Agostino, & Vasan, 2008). On the other hand, pU-AUCBoost sometimes had degraded performance compared with that of U-AUCBoost. This may be because pU-AUCBoost focuses on a specific range of FPR, and its target function (see equation 3.4) of the pU-AUCBoost is defined with fewer negative examples compared with U-AUCBoost. This restriction of examples made its optimization processes less stable than that of U-AUCBoost. In summary, the algorithm pU-AUCBoost for maximizing the partial AUC can be formally defined, but we could not find a special situation or data set where it consistently works well. We recommend employing U-AUCBoost to maximize partial AUC rather than pU-AUCBoost, because U-AUCBoost performed well in the sense of the partial AUC.

Figure 2:

Box plots of AUC values for each data set. (A) SVM with RBF kernel. (B) SVM with linear kernel. (C) Boosting with Urank. (D) Boosting with Ulogit. (E) Boosting with Umada. (F) U-AUCBoost with Urank. (G) U-AUCBoost with Ulogit. (H) U-AUCBoost with Umada. (I) pU-AUCBoost with Urank. (J) pU-AUCBoost with Ulogit. (K) pU-AUCBoost with Umada.

Figure 2:

Box plots of AUC values for each data set. (A) SVM with RBF kernel. (B) SVM with linear kernel. (C) Boosting with Urank. (D) Boosting with Ulogit. (E) Boosting with Umada. (F) U-AUCBoost with Urank. (G) U-AUCBoost with Ulogit. (H) U-AUCBoost with Umada. (I) pU-AUCBoost with Urank. (J) pU-AUCBoost with Ulogit. (K) pU-AUCBoost with Umada.

Figure 3:

Box plots of partial AUC values for each data set. (A) SVM with RBF kernel. (B) SVM with linear kernel. (C) Boosting with Urank. (D) Boosting with Ulogit. (E) Boosting with Umada. (F) U-AUCBoost with Urank. (G) U-AUCBoost with Ulogit. (H) U-AUCBoost with Umada. (I) pU-AUCBoost with Urank. (J) pU-AUCBoost with Ulogit. (K) pU-AUCBoost with Umada.

Figure 3:

Box plots of partial AUC values for each data set. (A) SVM with RBF kernel. (B) SVM with linear kernel. (C) Boosting with Urank. (D) Boosting with Ulogit. (E) Boosting with Umada. (F) U-AUCBoost with Urank. (G) U-AUCBoost with Ulogit. (H) U-AUCBoost with Umada. (I) pU-AUCBoost with Urank. (J) pU-AUCBoost with Ulogit. (K) pU-AUCBoost with Umada.

5.  Conclusion

In this letter, we formulated a new performance measure U-AUC for AUC-optimal classification and proposed a boosting-type algorithm maximizing U-AUC, U-AUCBoost, which was formulated as sequential maximization of a concave function. The statistical properties of the maximizer of U-AUC were discussed, and we evaluated the robustness of the proposed algorithm. We also extended U-AUCBoost into pU-AUCBoost for maximizing U-AUC at a specific range of FPR. We experimentally investigated the performance of the proposed algorithms with real data sets and clarified the validity of the U-AUCBoost. pU-AUCBoost did not work so well because of the shortage of examples caused by the restriction of FPR. Nevertheless, numerical experiments showed that U-AUCBoost can maximize partial AUC.

Appendix A:  Proof of Theorem 1

For convenience, we introduce a function V(z) that satisfies U(z) = 1 − V(−z). The function V is nonnegative, convex, and monotonically increasing. With this function, maximization of equation 2.8 is equivalent to minimization of
formula
A.1
Note that . From the definition of γt, we observe that
formula
Then we observe that
formula
where
formula
A.2
By recursively considering this inequality, we obtain
formula
because F0(x) = 0 and V(0) = 1. From an inequality 1 + z ⩽ exp(z), we obtain
formula
A.3
From , we conclude the theorem.

Appendix B:  Proof of Lemma 1

By simple calculation, we observe that the conditional distribution of the contaminated distribution is written as
formula
B.1
where . The estimator with the contaminated distribution satisfies
formula
B.2
Note that because of the Fisher consistency, when ε = 0, and by considering the Taylor expansion, we observe that
formula
Here, we can assume that without loss of generality. Then we observe that
formula
B.3
Note that the first term of the above equation is zero. Then we observe that
formula
B.4
The same holds for a case . Then the gross error sensitivity of the estimator is written as
formula
B.5
Because F(x) is in {+1, − 1}, the numerator of equation B.5 is rewritten as
formula
B.6
The denominator is rewritten as
formula
B.7
From the definition of V, we observe
formula
B.8
and then the above equation is rewritten as
formula
B.9
Equation B.9 comes from the following relationship:
formula
B.10
Then we conclude the lemma.

Appendix C:  Proof of Theorem 6

For an arbitrary function η(x) and a scalar ε, we have
formula
Note that and because of equations 3.1. In the same way, the second derivative is given as
formula
C.1
because of the concavity of U(z). This indicates that pAU(F, τ1, τ2) is concave with respect to F.

Appendix D:  Proof of Remark 5

The maximizer satisfies
formula
D.1
for an arbitrary function η(x). From the same calculation as in the proof of theorem 6, we have
formula
D.2
Note that equation D.2 can attain 0 for any η only if x is in a region , and hence F* satisfies
formula
D.3
formula
D.4
where Ψ(z) is defined in equation 2.13. From equation 2.15, F*(x) has a one-to-one correspondence with p+1(x)/p−1(x), which directly leads to remark 5.

Appendix E:  Proof of Theorem 7

formula
E.1
formula
E.2
where 0 ⩽ α* ⩽ αT. From the same argument as in the proof of theorem 6, we have
formula
E.3
Hence, we have
formula
E.4
formula
E.5
where
formula
E.6
formula
E.7
Using notation such as
formula
E.8
formula
E.9
we have
formula
E.10
formula
E.11
formula
E.12
where F0(x) = 0. In the same way, we have
formula
E.13
formula
E.14
which concludes theorem 7.

Acknowledgments

We appreciate the valuable comments that the anonymous reviewers made.

References

Amari
,
S.
, &
Nagaoka
,
H.
(
2000
).
Methods of information geometry
.
New York
:
Oxford University Press
.
Baker
,
S. G.
(
2003
).
The central role of receiver operating characteristic (ROC) curves in evaluating tests for the early detection of cancer
.
Journal of the National Cancer Institute
,
95
,
511
515
.
Blake
,
C. L.
, &
Merz
,
C. J.
(
1998
).
UCI repository of machine learning databases
.
Department of Information and Computer Science, University of California
.
Boyd
,
S.
, &
Vandenberghe
,
L.
(
2004
).
Convex optimization
.
Cambridge
:
Cambridge University Press
.
Brefeld
,
U.
, &
Scheffer
,
T.
(
2005
).
AUC maximizing support vector learning
. In
Proceedings of the ICML 2005 Workshop on ROC Analysis in Machine Learning.
Bonn, Germany
.
Chang
,
C.
, &
Lin
,
C.
(
2011
).
LIBSVM: A library for support vector machines
.
ACM Transactions on Intelligent Systems and Technology
,
2
(
3
),
27
.
Cortes
,
C.
, &
Mohri
,
M.
(
2004
).
AUC optimization vs. error rate minimization
. In
S. Thrün, L. K. Saul, & B. Schölkopf
(Eds.),
Advances in neural information processing systems
, 16 (pp.
313
320
).
Cambridge, MA
:
MIT Press
.
Freund
,
Y.
,
Iyer
,
R.
,
Schapire
,
R.
, &
Singer
,
Y.
(
2003
).
An efficient boosting algorithm for combining preferences
.
Journal of Machine Learning Research
,
4
,
933
969
.
Freund
,
Y.
, &
Schapire
,
R. E.
(
1997
).
A decision-theoretic generalization of online learning and an application to boosting
.
Journal of Computer and System Sciences
,
55
(
1
),
119
139
.
Friedman
,
J. H.
,
Hastie
,
T.
, &
Tibshirani
,
R.
(
2000
).
Additive logistic regression: A statistical view of boosting
.
Annals of Statistics
,
28
,
337
407
.
Ghosh
,
A.
, &
Chaudhuri
,
P.
(
2005
).
On data depth and distribution-free discriminant analysis using separating surfaces
.
Bernoulli
,
11
(
1
),
1
28
.
Hampel
,
F. R.
,
Rousseeuw
,
P. J.
,
Ronchetti
,
E. M.
, &
Stahel
,
W. A.
(
1986
).
Robust statistics
.
New York
:
Wiley
.
Herschtal
,
A.
, &
Raskutti
,
B.
(
2004
).
Optimising area under the ROC curve using gradient descent
. In
Proceedings of the Twenty-First International Conference on Machine Learning
(p.
49
).
New York
:
ACM
.
Kanamori
,
T.
,
Takenouchi
,
T.
,
Eguchi
,
S.
, &
Murata
,
N.
(
2007
).
Robust loss functions for boosting
.
Neural Computation
,
19
(
8
),
2183
2244
.
Komori
,
O.
, &
Eguchi
,
S.
(
2010
).
A boosting method for maximizing the partial area under the ROC curve
.
BMC Bioinformatics
,
11
,
314
.
Lebanon
,
G.
, &
Lafferty
,
J.
(
2001
).
Boosting and maximum likelihood for exponential models
(
Tech. Rep. No. CMU-CS-01-144
).
Pittsburgh, PA
:
School of Computer Science, Carnegie Mellon University
.
Mason
,
L.
,
Baxter
,
J.
,
Bartlett
,
P.
, &
Frean
,
M.
(
1999
).
Boosting algorithms as gradient descent in function space
. In
S. A. Solla, T. K. Leen, & K.-R. Müller
(Eds.),
Advances in neural information processing systems
, 12 (pp.
512
518
).
Cambridge, MA
:
MIT Press
.
Murata
,
N.
,
Takenouchi
,
T.
,
Kanamori
,
T.
, &
Eguchi
,
S.
(
2004
).
Information geometry of U-boost and Bregman divergence
.
Neural Computation
,
16
(
7
),
1437
1481
.
Pencina
,
M.
,
D'Agostino
Sr.,
R.
,
D'Agostino
Jr.,
R.
, &
Vasan
,
R.
(
2008
).
Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond
.
Statistics in Medicine
,
27
(
2
),
157
172
.
Pepe
,
M. S.
(Ed.). (
2003
).
The statistical evaluation of medical tests for classification and prediction
.
New York
:
Oxford University Press
.
Pepe
,
M. S.
,
Longton
,
G.
,
Anderson
,
G. L.
, &
Schummer
,
M.
(
2003
).
Selecting differentially expressed genes from microarray experiments
.
Biometrics
,
59
,
133
142
.
Pepe
,
M. S.
, &
Thompson
,
M. L.
(
2000
).
Combining diagnostic test results to increase accuracy
.
Biostatistics
,
1
,
123
140
.
Qi
,
Y.
,
Joseph
,
Z. B.
, &
Seetharaman
,
J. K.
(
2006
).
Evaluation of different biological data and computational classification methods for use in protein interaction prediction
.
Proteins: Structure, Function, and Bioinformatics
,
63
,
490
500
.
R Development Core Team
. (
2011
).
R: A language and environment for statistical computing
[Computer software manual]. Vienna, Austria
. http://www.R-project.org.
Rudin
,
C.
, &
Schapire
,
R.
(
2009
).
Margin-based ranking and an equivalence between AdaBoost and RankBoost
.
Journal of Machine Learning Research
,
10
,
2193
2232
.
Schölkopf
,
B.
, &
Smola
,
A. J.
(
2001
).
Learning with kernels: Support vector machines, regularization, optimization, and beyond
.
Cambridge, MA
:
MIT Press
.
Takenouchi
,
T.
,
Eguchi
,
S.
,
Murata
,
T.
, &
Kanamori
,
T.
(
2008
).
Robust boosting algorithm against mislabeling in multi-class problems
.
Neural Computation
,
20
(
6
),
1596
1630
.
Vapnik
,
V.
(
1998
).
Statistical learning theory
.
New York
:
Wiley
.

Notes

1

From the concavity of with respect to α, we observe that , which implies 0 ⩽ γt ⩽ 1.

2

The function F† is associated with the true distribution rather than that constructed by the boosting algorithm, and the condition is not too strong; F† can converge when p(+1|x) (or p(−1|x)) is nearly equal to 0.

3

We optimize C and γ in a range {2−2, …, 25} and {2−3, …, 23}, respectively.