Abstract

A wide variety of machine learning algorithms such as the support vector machine (SVM), minimax probability machine (MPM), and Fisher discriminant analysis (FDA) exist for binary classification. The purpose of this letter is to provide a unified classification model that includes these models through a robust optimization approach. This unified model has several benefits. One is that the extensions and improvements intended for SVMs become applicable to MPM and FDA, and vice versa. For example, we can obtain nonconvex variants of MPM and FDA by mimicking Perez-Cruz, Weston, Hermann, and Schölkopf's (2003) extension from convex ν-SVM to nonconvex Eν-SVM. Another benefit is to provide theoretical results concerning these learning methods at once by dealing with the unified model. We give a statistical interpretation of the unified classification model and prove that the model is a good approximation for the worst-case minimization of an expected loss with respect to the uncertain probability distribution. We also propose a nonconvex optimization algorithm that can be applied to nonconvex variants of existing learning methods and show promising numerical results.

1.  Introduction

There is a wide variety of machine learning algorithms for binary classification. The support vector machine (SVM) is one of the most successful classification algorithms in modern machine learning (Cortes & Vapnik, 1995; Schölkopf & Smola, 2002). It finds a hyperplane that separates training samples into different classes with the maximum margin for a linearly separable data set (Boser, Guyon, & Vapnik, 1992).

The SVM has been extended to deal with nonseparable data by trading off the margin size with the data separation error (Cortes & Vapnik, 1995). This soft margin formulation is commonly referred to as C-SVM, since the trade-off is controlled by the parameter C. C-SVM has been shown to work very well in a wide range of real-world applications (Schölkopf & Smola, 2002). An alternative formulation of the soft margin idea is -SVM (Schölkopf, Smola, Williamson, & Bartlett, 2000). -SVM involves another trade-off parameter that roughly specifies the fraction of support vectors (or sparseness of the solution). The -SVM formulation provides a richer variety of interpretations in comparison with the original C-SVM formulation and is potentially more useful in real applications.

-SVM has a limited usable range of such as (see Crisp & Burges, 2000; Chang & Lin, 2001). Perez-Cruz, Weston, Hermann, and Schölkopf (2003) showed that -SVM can be modified in a way that increases the range of to include zero by allowing it to take negative margin values. The resulting extended -SVM (E-SVM) is formulated as a nonconvex quadratic programming problem, and a local optimum search algorithm has been developed for it. Perez-Cruz et al. experimentally showed that the generalization performance of E-SVM is often better than that of -SVM.

The minimax probability machine (MPM; Lanckriet, Ghaoui, Bhattacharyya, & Jordan, 2002) also addresses the binary classification problem. The problem setting assumes that only the mean and covariance matrix of each class are known. The optimal hyperplane of MPM is determined by minimizing the worst-case (maximum) probability of misclassification of unseen test samples over all possible class-conditional distributions. Moreover, Nath and Bhattacharyya (2007) devised a maximum margin classifier model whose constraints restrict the worst-case probability of misclassification in each class. The model can be regarded as a generalized variant of MPM. We can apply Fisher discriminant analysis (FDA; Fukunaga, 1990) to the binary classification problem when the means and covariance matrices are known.

The purpose of this letter is to provide a unified framework for learning algorithms, including SVM, MPM, and FDA, from the viewpoint of robust optimization (Ben-Tal, El-Ghaoui, & Nemirovski, 2009). Robust optimization is an approach that handles optimization problems defined by uncertain inputs. A simple example of robust optimization is
formula
1.1
where w is the parameter to be optimized under the constraint and x is an uncertain input in the problem. The uncertainty set represents the uncertainty of the input. Equation 1.1 is interpreted as a problem in which one determines the decision-making parameter w that maximizes the benefit for the worst-case setup among . For binary classification, we regard the means x+ and x of the data points of each class as uncertain inputs and prepare uncertainty sets and of those uncertain inputs. We assume that x exists in the Minkowski difference of and , that is, . is defined by . This problem always seems to be nonconvex because of . However, it reduces to a convex problem that includes a constraint instead of when and do not intersect.

In this letter, we show that robust optimization (see equation 1.1) reduces to the learning methods mentioned above, depending on a prescribed uncertainty set . For example, we show that MPM is a special case of this equation with an uncertainty set that consists of two ellipsoidal sets such that and touch externally. When and are defined as reduced convex hulls (Bennett & Bredensteiner, 2000; Crisp & Burges, 2000), equation 1.1 reduces to E-SVM if and are strictly intersecting (i.e., has a positive volume), while it reduces to -SVM if . The difference between these learning methods turns out to be only in the definition of of robust optimization, equation 1.1.

The first contribution of handling the unified model, equation 1.1, is to obtain new learning methods. For example, within the usable range of -SVM can be computed by solving equation 1.1 with two touching reduced convex hulls and . As far as we know, no one has tried to compute , although Crisp and Burges (2000) have devised means to compute . We can reformulate the problem for by mimicking the formulation of MPM. Other examples of new learning methods are nonconvex variants of MPM and FDA under the assumption of for the uncertainty set of equation 1.1. We can obtain them by mimicking the extension of Perez-Cruz et al. (2003) from convex -SVM to nonconvex E-SVM.

The second contribution is to provide theoretical results concerning these learning methods at once by dealing with the unified model. Indeed, we provide a statistical interpretation for equation 1.1 on the basis of conventional statistical learning theory. We consider the worst-case minimization of expected loss functions under an uncertainty set of probabilities. Then, we show that equation 1.1, with some corresponding uncertainty set, is a good approximation for minimizing the worst-case expected loss.

We also provide a generalized local optimum search algorithm that is applicable to nonconvex variants of learning models. We prove theoretical results on the local optimum search algorithm and show promising numerical results.

The letter is organized as follows. In section 2, we elucidate the unified model, called the robust classification model (RCM), for classification. In section 3, we show RCM's connection with existing learning algorithms and obtain nonconvex variants for MPM and FDA in the same way as nonconvex E-SVM (Perez-Cruz et al., 2003). In section 4, we give a statistical interpretation of RCM in terms of minimizing the upper and lower bounds of the worst-case expected loss. Section 5 discusses the implications of the nonconvex variants of RCM and proves that the nonconvexity is closely related to negative regularization. Furthermore, we describe a local optimum search algorithm for nonconvex RCM. We report on experimental results for RCM in section 6 and summarize our contributions and future work in section 7.

2.  Robust Classification Model

We start by introducing the problem setting and the notations used throughout the letter. Let be the input domain and be a set of binary labels. The observed training samples are denoted as . Let be the set of indices of training samples with the label +1 and be the set of indices of training samples with the label −1. Let |M+|=m+ and |M|=m, where shows the size of the set.

A decision function is estimated from the training samples. The label of the input x is predicted by the classifier h(x)=sign(f(x)), where sign is the sign function, that is, if and −1 otherwise. The goal of the classification task is to obtain a classifier that minimizes the prediction error rate for unseen test samples. For simplicity, we focus on linear classifiers: , where w () is a vector and b () is a bias parameter. Most of the discussion in this letter can be directly applied to kernel classifiers (Schölkopf & Smola, 2002). Concretely, the change from to the kernel function makes the statements of section 2 to section 5.1 hold for kernel classifiers, while the algorithm in section 5.2 needs a small modification.

2.1.  Robust Optimization

Exact data are unavailable in many practical optimization problems. The data may include noise, data represent information about the future such as future product demand, and so on. Robust optimization (Ben-Tal et al., 2009) is an approach that can handle optimization problems defined by uncertain inputs. We assume that the uncertain data lie within a bounded set, which is called the uncertainty set . Here, we consider an optimization problem in which the objective function includes uncertain data. When the worst case in matters, maximizing the minimum benefit on is a valid strategy. Thus, robust optimization is formulated as equation 1.1. To transform this equation into a tractable problem such as a convex optimization problem, it is often assumed that the sets of w and of x are bounded convex sets in Euclidean space. In the following, however, the convexity of does not necessarily have to be assumed.

The way of constructing the uncertainty set is an important issue in practice. If we set too large, the optimal decision of equation 1.1 is very robust to uncertain data x but too conservative. Moreover, if we define with complicated functions, we cannot easily solve equation 1.1. Many robust optimization studies have used polyhedral sets and ellipsoidal sets as for computational tractability.

Robust optimization has been used to make statistical learning able to handle uncertain observations (see, e.g., Trafalis & Gilbert, 2006; Xu, Caramanis, & Mannor, 2009). The previous studies assumed that each sample xi lies somewhere within an ellipsoid with the observation of xi as its center. All possible data uncertainties, , are taken into account for all training samples xi, when deriving a robust classifier or regressor.

2.2.  Unified Classification Model Based on Robust Optimization

We assume that the training samples are not reliable because of noise or measurement errors. To make a classification model less sensitive to noise in the training samples, we focus on representative points of each class, denoted by x+ and x. These points are not necessarily individual samples but may be means or medians of the data points of each class. Since the training samples are not reliable, it is reasonable to assume that the means of each class, x+ and x, will involve some uncertainty. The largest possible sets of x+ and x are denoted by and , respectively, and these sets are defined on the basis of training samples xi, , and xi, , respectively. Throughout this letter, we assume that both and are compact and convex and that they have interior points. We show examples of and in section 3.

We briefly describe the robust classification model.1 Suppose that (or ) is an uncertainty set of class +1 (or −1). The robust classification model (RCM) is defined as the following optimization problem:
formula
2.1
where is the Euclidean norm, and is the Minkowski difference of and :
formula
Since and are compact convex sets and have interior points, their Minkowski difference is necessarily compact convex and has a nonempty interior. In RCM, equation 2.1, inner minimization on implies the minimum discrepancy between and along the direction vector w. The optimal solution of equation 2.1 gives a direction vector w maximizing the minimum discrepancy between the two classes. Clearly, such a w is useful for predicting the class labels.
Here we investigate equation 2.1 in terms of dual representation of the convex set . The well-known corollary of the separation theorem of Mazur proves that the compact convex set is completely determined by its support function of the set . More concretely,
formula

Noticing that problem 2.1 is equivalent to , we can say that it is to find a solution w that achieves the minimum among all support functions of .

To geometrically interpret RCM, equation 2.1, Figure 1 shows the ellipsoidal uncertainty sets , and their Minkowski difference. We can separate the problem into two cases, whether and have an intersection or not, which is equivalent to whether includes or not. As shown in theorem 1, there is a large difference in problem difficulty of RCM between these two cases because equation 2.1 is essentially a nonconvex problem when , but it becomes a convex one when .

Figure 1:

Geometric interpretation of RCM, equation 2.1. (Left) Two uncertainty sets, and , are disjoint (). (Right) and are joint (). The uncertainty sets are ellipsoids around the sample. The solid line stands for the optimal hyperplane, and the squares are the solution of the inner minimization on in RCM. The bias term b in the decision function is defined such that the decision boundary passes through the midpoint of the squares. The optimal hyperplane for the Minkowski difference is also depicted (dash-dotted line), and the optimal point in is shown by the filled circle. The arrow labeled indicates the normal direction of the hyperplane for some positive . The other arrows indicate the optimal solution .

Figure 1:

Geometric interpretation of RCM, equation 2.1. (Left) Two uncertainty sets, and , are disjoint (). (Right) and are joint (). The uncertainty sets are ellipsoids around the sample. The solid line stands for the optimal hyperplane, and the squares are the solution of the inner minimization on in RCM. The bias term b in the decision function is defined such that the decision boundary passes through the midpoint of the squares. The optimal hyperplane for the Minkowski difference is also depicted (dash-dotted line), and the optimal point in is shown by the filled circle. The arrow labeled indicates the normal direction of the hyperplane for some positive . The other arrows indicate the optimal solution .

Before giving an intuitive geometric interpretation of RCM in theorem 1, we start by proving two lemmas. The first lemma separates the case into two cases: includes in its interior, , or on its boundary, , and it classifies RCM into three cases in terms of the optimal value. In the geometric sense, holds when and are disjoint. implies that and are joint. In particular, implies that and touch externally.

Lemma 1. 

The optimal value of RCM, equation 2.1, is positive if and only if . It is zero if and only if , and it is negative if and only if.

Proof. 

means . The well-known strong separation theorem ensures that there exist and such that . Therefore, the optimal value of RCM is positive.

When , for any , we can find some such that and . This implies that holds for any , and therefore .

implies that there exists a supporting hyperplane to at ; that is, exists such that . Therefore, holds, and its minimizer is . When is not a supporting hyperplane to , the interior of meets the hyperplane and holds. Considering that and , one has .

In the above, we have proved that the position of relative to determines which case (the optimal value of RCM is positive, negative, or zero) occurs. By taking the contrapositive of these statements, we can ensure that the position of relative to (, or ) is determined from the optimal value of RCM.

Lemma 2. 
For the uncertainty sets and satisfying , the following inequality holds:
formula
Proof. 
Let be the optimal value function of the inner-min in equation 2.1 for . Let be an optimal solution of for i=1, 2. Because , the inequality holds for arbitrary w. Hence, we obtain
formula

Lemma 2 indicates that the optimal value of equation 2.1 is nonincreasing with respect to the inclusion relation of uncertainty sets. Let be a parameterized uncertainty set for RCM, equation 2.1, such that holds for . Figure 2 plots the nonincreasing optimal value of RCM with respect to . An uncertainty set might exist such that the optimal value of equation 2.1 becomes zero.

Figure 2:

Optimal value of RCM, equation 2.1, with uncertainty set . We suppose that includes one parameter and that holds for any .

Figure 2:

Optimal value of RCM, equation 2.1, with uncertainty set . We suppose that includes one parameter and that holds for any .

The following theorem shows that when , the equality constraint in equation 2.1 can be replaced by without changing the solution. Moreover, the constraint can be replaced by when .

Theorem 1. 
For an uncertainty set such that , RCM, equation 2.1, is equivalent to
formula
2.2
Moreover, the problem is equivalent to
formula
2.3
An optimal w of equation 2.2 can be obtained from by using the optimal of equation 2.3. For an uncertainty set such that , RCM, equation 2.1, is equivalent to
formula
2.4
Moreover, the problem is equivalent to
formula
2.5
where is the closure of the complement of the convex set . An optimal w of equation 2.4 can be obtained from by using the optimal of equation 2.5.

In the above, “equivalent” denotes that an optimal solution of RCM, equation 2.1, is obtained by solving equation 2.2 or 2.4. The optimal solutions, and , are illustrated in Figure 1.

Proof. 
Assume . By applying the discussion on the minimum norm duality (Luenberger, 1969) to equation 2.2, we can confirm the equivalence of equation 2.2 and (i.e., equation 2.3) and the optimal solution . Hence, it is enough to show that there exists an optimal solution of equation 2.2 such that , because the difference between equations 2.1 and 2.2 is only the norm constraint of w. Lemma 1 ensures that the optimal value of equation 2.2 is positive because
formula
Since the optimal solution of equation 2.2 satisfies , the following inequalities hold:
formula
The last inequality comes from the optimality of . These inequalities imply that is also an optimal solution of equation 2.2 and that .
Next, we consider the case of . The equivalence of equation 2.4 and the minimum norm to the complement of (i.e., ) is proved from proposition 3.1 of Briec (1997) under the assumption that a convex has a nonempty interior. To prove the equivalence of equation 2.1 and 2.4, we show that an optimal solution of equation 2.4 exists such that . Lemma 1 implies that the optimal value of equation 2.4 is negative (i.e., ), because
formula
Hence, one has because of . This implies that is an optimal solution of equation 2.4 and that .

For , RCM, equation 2.1, is essentially a nonconvex problem, and we need to use nonconvex optimization methods to solve it. Section 5.2 describes the optimization algorithm for nonconvex problems of equation 2.1.

3.  Equivalence to Existing Classifiers

We will show that RCM can be reduced to the support vector machine (SVM), minimax probability machine (MPM), or Fisher discriminant analysis (FDA) depending on the prescribed uncertainty set . In Table 1, × means that the corresponding cases never happen, and means that there are no corresponding existing models as far as we know. The models indicated by are the target in this letter. In this section, we denote an optimal solution of equation 2.1 as . Let and stand for the optimal solutions of the inner-min in equation 2.1 for .

Table 1:
Correspondence with Existing Classifiers.
Set
Ellipsoid  MPM MM-MPM 
(equation 3.9 (Lanckriet et al., 2002(Nath & Bhattacharyya, 2007
Ellipsoid  FDA sparse feature selection 
(equation 3.17 (Fukunaga, 1990(Bhattacharyya, 2004
Reduced E-SVM  -SVM 
convex hull (Perez-Cruz et al., 2003 (Schölkopf et al., 2000
Convex hull × × Hard-margin SVM 
   (Boser et al., 1992
Set
Ellipsoid  MPM MM-MPM 
(equation 3.9 (Lanckriet et al., 2002(Nath & Bhattacharyya, 2007
Ellipsoid  FDA sparse feature selection 
(equation 3.17 (Fukunaga, 1990(Bhattacharyya, 2004
Reduced E-SVM  -SVM 
convex hull (Perez-Cruz et al., 2003 (Schölkopf et al., 2000
Convex hull × × Hard-margin SVM 
   (Boser et al., 1992

3.1.  Hard-Margin SVM, ν-SVM, and Eν-SVM

Whenever a data set is linearly separable, many hyperplanes correctly classify all training samples. Vapnik-Chervonenkis theory (Vapnik, 1998) indicates that a large margin classifier has a small generalization error. Here, the margin of a linear classifier is defined as the signed distance of the closest sample to the decision boundary:
formula

The problem of maximizing the margin can be transformed into a quadratic programming problem (Boser et al., 1992), and the classification method is called hard-margin support vector classification machine (SVM).

Here we define the uncertainty sets as follows:
formula
3.1
where means convex hull. The equivalence of hard-margin SVM and RCM, equation 2.1, is obvious for linearly separable samples (i.e., ). Indeed, Bennett and Bredensteiner (2000) showed the equivalence of hard-margin SVM and equation 2.3 by using the Wolfe duality. Theorem 1 shows the equivalence of RCM, equation 2.1, and equation 2.3 when . Accordingly, we obtain the following corollary:
Corollary 1. 

Suppose that holds for uncertainty set 3.1. Then RCM, equation 2.1, with equation 3.1, and the bias term , provides an optimal solution to hard-margin SVM.

Hard-margin SVM has been extended to cope with nonseparable data. C-SVM (Cortes & Vapnik, 1995) and -SVM (Schölkopf et al., 2000) are typical examples of soft-margin SVMs. In the standard C-SVM (Cortes & Vapnik, 1995), the parameters w and b in the linear classifier, , are estimated by solving the following convex optimization problem
formula
3.2
and in -SVM (Schölkopf et al., 2000), they are estimated by
formula
3.3
C>0 and are regularization parameters.
There is a correspondence between C-SVM and -SVM. That is, the classifier estimated by C-SVM with can be obtained from -SVM with a parameter , and vice versa (Schölkopf et al., 2000; Chang & Lin, 2001). Crisp and Burges (2000) showed and gave a geometric interpretation for . For , the optimization problem of -SVM is unbounded, and for , -SVM provides a trivial solution ( and b=0). Perez-Cruz et al. (2003) devised extended -SVM (E-SVM) as a way of avoiding such a trivial solution:
formula
3.4
By forcing the norm of w to be unity, a nontrivial and meaningful solution is obtained for any , but this comes at the expense of convexity. It furthermore provides the same solution as -SVM for . In that sense, E-SVM can be regarded as an extension of -SVM. It was experimentally found in Perez-Cruz et al. (2003) that the generalization performance of E-SVM is often better than that of -SVM.
In order to connect (E)-SVM with RCM, we need to define the uncertainty sets appropriately. For , let us define as
formula
3.5

The other uncertainty set is defined in the same way for the negative labels. The set 3.5 is essentially equal to a reduced convex hull or soft convex hull defined in Bennett and Bredensteiner (2000) and Crisp and Burges (2000), that is, is a polytope of the convex hull that has been shrunk toward the center . The effect of changing the value of on the size of a reduced convex hull is shown in Mavroforakis and Theodoridis (2006). Note that holds for . For the linearly nonseparable data set, and intersect when is small. We can also regard as the set of means under arbitrary discrete probability distributions satisfying , , in addition to satisfying and for .

Crisp and Burges (2000) pointed out that is the largest such that two reduced convex hulls and intersect. We can obtain it by the following linear programming (LP) problem,
formula
3.6
whose formulation follows the formulation of the MPM, equation 3.10, shown later. In other words, is a value such that and touch. The model that finds corresponds to the case of in the reduced convex hull of Table 1.

Barbero, Takeda, and López (2012) transformed -SVM, equation 3.3, and E-SVM, equation 3.4, into the form of RCM, equation 2.1, in order to give them a geometric interpretation. In this case, the bias term b is defined so that the decision boundary passes through the midpoint of margin support vectors in each class, , which are strictly on the margin (i.e., vectors xj with in equation 3.5). We can relate -SVM, E-SVM, and RCM as follows:

Corollary 2 (Barbero et al., 2012). 

For , RCM, equation 2.1, with , and the bias is equivalent to -SVM, and for , it is equivalent to E-SVM.

3.2.  Minimax Probability Machine and Its Extension

In the minimax probability machine (MPM; (Lanckriet et al., 2002), only the mean and covariance matrix of each class are used for classification tasks. Suppose that x+ (or x) is a d-dimensional random vector with mean (or ) and covariance (or ). We assume that holds. The MPM controls the misclassification probabilities under the worst-case setting. The linear classifier is estimated by solving the following problem,
formula
3.7
where the notation refers to the class of distributions that have prescribed mean and covariance but are otherwise arbitrary; the same holds for x. That is, the classifier minimizes the worst-case misclassification probability. In practice, the mean vectors and covariance matrices of each class are estimated from the training samples. MPM's appealing feature is that the worst-case bound of the classification error rate can be explicitly estimated.
As Lanckriet et al. (2002), showed, by applying a multivariate generalization of Chebychev-Cantelli inequality to the constraint , we have the equivalent constraint: and . Hence, problem 3.7 can be represented as
formula
3.8
The optimal solution of equation 3.8 is also optimal to 3.7, and the other optimal solutions and of equation 3.7 can be calculated using and . Problem (3.8) is a convex optimization problem known as a second-order cone program.
Problem 3.8 has an interesting geometric interpretation that can be obtained via convex duality (Lanckriet et al., 2002). Let and be
formula
3.9
Then, the dual form of equation 3.8 leads to the problem,
formula
3.10
that is, finding the smallest positive such that the two ellipsoids intersect. This problem corresponds to the problem of finding such as in Table 1.
The idea of MPM is combined with the idea of the margin maximization in Nath and Bhattacharyya (2007). Given acceptable false-positive and false-negative rates, and , the linear classifier can be estimated by solving
formula
3.11
In this letter, we call this model the margin maximized MPM (MM-MPM). In the same way as in MPM, equation 3.11 can be transformed into
formula
3.12
with and . As Nath and Bhattacharyya (2007) showed, the dual form is
formula
3.13
This is a problem of finding the minimum distance between two ellipsoids. We define and as constants such that and touch. When and , equation 3.13 gives us a positive distance.

Corollary 3 relates MM-MPM and RCM.

Corollary 3. 
Let and be specified positive constants. RCM, equation 2.1, with , and the bias can be transformed into
formula
3.14
Furthermore, for and , equation 3.14 is equivalent to MM-MPM, equation 3.12.
Proof. 
The inner-min problem of equation 2.1 can be solved analytically:
formula
where are used in the above min and max, respectively. Maximizing the above objective function subject to the constraint is nothing but the optimization problem equation 3.14. That equation 3.12 and 3.14 are equivalent can be proved by comparing the dual, equation 2.3, of equation 3.14, with the dual, equation 3.13, of equation 3.12.

Besides the equivalence of equation 3.14 and MM-MPM equation 3.12, we also can show that RCM, equation 3.14, with coincides with MPM, equation 3.8. Here, is such that the optimal value of equation 3.14 with is zero. In addition, is the optimal value of the MPM dual, equation 3.10.

and of equation 3.14 make it possible to handle unbalanced data. However, finding convexity thresholds, and , is more difficult than obtaining a single value, .

3.3.  Fisher Discriminant Analysis and Its Extension

In Fisher discriminant analysis (FDA) as in MPM, equation 3.7, a discriminant hyperplane is computed from the means and covariances of random vectors x+ and x. The hyperplane is determined from the optimal solution to the following optimization problem (Fukunaga, 1990):
formula
3.15
The intuition behind the problem is to find a direction that maximizes the projected class means (the numerator) while minimizing the class variance in this direction (the denominators). When FDA is applied to the classification problems, the bias term b is determined by assuming a probabilistic model, such as the normal distribution, for the observed samples. For , an optimal solution of equation 3.15 can be found by solving the linear system of equations, and , with respect to the variable w and the Lagrange multiplier .
Likewise for MPM, FDA has a probabilistic interpretation under the worst-case scenario. Bhattacharyya (2004) shows that the following problem is equivalent to equation 3.15.
formula
3.16
The above constraint is expressed as a convex inequality: where Using the ellipsoidal uncertainty set defined by
formula
3.17
equation 3.16 can be represented as
formula
3.18
That is, one finds the smallest positive such that the ellipsoid includes the null vector.

FDA can be extended to RCM, equation 2.1, with the uncertainty set for a prescribed parameter . Let be the optimal value of equation 3.18. Then, along the same lines as the MPM in section 3.2, we find that RCM with is equivalent to FDA:

Corollary 4. 
Let be a chosen positive constant. RCM, equation 2.1, with the uncertainty set is transformed into
formula
3.19
Especially for , the norm constraint in equation 3.19 is replaced with the convex constraint without changing the optimal solution, and equation 3.19 is reduced to subject to.
We can prove corollary 4 similar to corollary 3. Corollary 4 provides an FDA's counterpart of MM-MPM; that is, the estimator found by solving
formula
Here, MM-FDA refers to this estimator. In replacing the Euclidean norm with the L1-norm , MM-FDA is equivalent to a sparse feature selection model based on FDA (Bhattacharyya, 2004) as shown in Table 1.

RCM with is nonconvex for . Section 5.2 describes a learning algorithm for nonconvex objectives.

4.  Statistical Interpretation for Robust Classification Models

We can give a statistical interpretation for RCM on the basis of statistical learning theory. We start by introducing a loss function that defines the loss of the decision function regarding the sample (x, y) as . A convex and decreasing function is usually used as the loss function . Some examples of loss functions are as follows.

Example 1: Loss functions. 

To evaluate the classification error rate, we can use the 0-1 loss: , where I is an indicator function; that is, I[P]=1 holds if the predicate P is true and I[P]=0, otherwise. The hinge loss is defined by ; it is used in the soft margin SVM (Cortes & Vapnik, 1995). The truncated quadratic loss is defined as , and it is used in the 2-norm soft-margin SVM (Schölkopf & Smola, 2002). The logistic loss is defined by , and it is used in the maximum likelihood estimator or logitboost algorithm (Friedman, Hastie, & Tibshirani, 2000). The quadratic loss is defined by . All loss functions except the 0-1 loss are convex, and all but the quadratic loss are nonincreasing functions.

A goal of the classification task is to obtain an accurate classifier. For this purpose, one often minimizes the expected loss,
formula
where is the expectation of the loss function with respect to the probability distribution of the samples. Let us define p(x|y), the conditional probability density of x, given the binary label y and and , as the marginal probabilities of the positive and negative labels, respectively. The expectation of the loss function is
formula
4.1

Since the true probability distribution is unknown, we cannot minimize the expected loss directly.

Suppose that the training samples are independently and identically distributed from the probability distribution. Then the expected loss function, equation 4.1, can be approximated by the empirical loss function,
formula
4.2
Indeed, in the large sample limit, the empirical loss, equation 4.2, converges in probability to its expectation, equation 4.1, because of the law of large numbers. Hence, by minimizing the empirical loss with respect to the parameters w and b, one can estimate the decision function. This strategy is known as the empirical risk minimization inductive principle (ERM) (Vapnik, 1998). Under a mild assumption, the estimator minimizing the empirical loss, equation 4.2, converges in probability to the optimal solution of the expected loss, equation 4.1.

Now let us consider the ambiguity of the probability distribution p(x|y). We can use the minimax decision rule for the uncertainty of p(x|y); that is, we can consider the worst-case minimization of the expected loss by taking the uncertainty of p(x|y) into account. The worst-case minimization problem is difficult to solve. Therefore, we propose to solve RCM, equation 2.1, since we can prove that RCM is a good approximation for minimizing the worst-case expected loss in the sense that it minimizes the upper and lower bounds of the worst-case expected loss at the same time.

A situation in which the probability distribution is ambiguous frequently occurs in practice. Under the setup of the ERM, the probability distributions of the training and test samples are supposed to be exactly the same. In the real-world data, however, the probability distribution of the test samples may be somewhat different from that of the training samples. Such a situation is referred to as dataset shift, and several approaches to learning problems affected by data shifts have been investigated (Quiñonero-Candela, Sugiyama, Schwaighofer, & Lawrence, 2008). Here, we use robust optimization to deal with it. In what follows, we show that the expected loss function when the probability distribution is ambiguous leads to RCM with a certain corresponding uncertainty set.

4.1.  Upper and Lower Bounds of Expected Loss Functions

We shall derive upper and lower bounds of the expected loss function and discuss the relationship between these bounds and RCM. The bounds are expressed in the following theorem:

Theorem 2. 
We assume that (i) is convex, decreasing, and second-order differentiable and that (ii) holds for all z. Let be
formula
where x+ and x are the mean of the input vector x under the conditional probabilities p(x|+1) and p(x|−1), respectively, that is, . In addition, we assume that the domain of the input vector x under the conditional probabilities is bounded such that . Then the following inequality holds for the weight vector w such that :
formula
Proof. 
The convexity of leads to the lower bound of the expected loss,
formula
The Taylor expansion of around yields
formula
where in is some real number. Thus, we find that
formula
Similarly, we have
formula
As a result, we obtain the upper bound of the expected loss,
formula

The logistic loss and the truncated quadratic loss in example 1 satisfy the assumption of theorem 2. According to theorem 2, minimization of the function is closely related to minimization of the expected loss.

4.2.  Worst-Case Expected Loss Minimization

Let us now consider the worst-case minimization of the expected loss. Let and be sets of probability densities. Each set of probabilities expresses the uncertainty of the conditional probabilities p(x|+1) and p(x|−1), respectively. Here, we assume that all probability distributions in have the mean vector. Let us consider the following problem:
formula
4.3

This is the worst-case optimization of the weight vector w when the probability is uncertain; the bias term b is supposed to be optimally chosen. Note that RCM focuses on the weight vector w and that the bias term can be optimally chosen by using the optimal solution, . Hence, we can find a natural correspondence between equation 4.3 and RCM.

Suppose that the uncertainty sets of probability densities, , are both convex: a mixture of two probability densities also lies in the uncertainty set. Then and defined by
formula
4.4
are convex. Theorem 2 indicates that the optimal solution of equation 4.3 is approximately the optimal solution of
formula
4.5

The following theorem connects RCM with problem 4.5:

Theorem 3. 

Suppose that is a nonincreasing function. An optimal solution of the RCM with the uncertainty sets and in equation 4.4 is also optimal for problem 4.5.

Proof. 
For a fixed w and , one has
formula
The right-hand side of this expression is nonincreasing with respect to , since the objective in the middle expression is nonincreasing in . Thus, there exists a nonincreasing function such that holds. Examples 2 to 4 include explicit expressions for the function . Hence, one has
formula
As a result, the optimal solution of the RCM is also optimal for problem 4.5.

Theorem 2 places certain assumptions on the loss function in order to derive the bounds of the expected loss. However, the only assumption we need in theorem 3 is the monotonicity of . Hence, for general loss functions, problem 4.5 is equivalent to RCM.

Finally, we shall show that RCM, equation 2.2, is a good approximation for minimizing the worst-case expected loss.

Remark 1. 
We assume that all assumptions in theorem 2 are satisfied. From theorem 2, we obtain
formula
Using theorem 3, we can rewrite the above inequalities as
formula
which implies that the classifier of RCM, equation 2.1, is a good approximation for that of the worst-case minimization. The max-min problem of the function can be efficiently solved by using the algorithm in section 5.2; however, trying to solve the min-max problem of the expected loss function or the empirical loss function would not be easy.

RCM, equation 2.1, corresponding to the worst-case of , has uncertainty sets defined by equation 4.4. The definition suggests that the uncertainty set can be interpreted as the uncertainty of the mean of the input vector for each class. Even if the uncertainty sets of the probability distributions, , have infinitely many parameters to specify each probability, the uncertainty set of the mean vector, , is defined in a finite-dimensional space. This can be regarded as a framework for semiparametric inference (Barnett, Powell, & Tauchen, 1991), and it can deal with various uncertainty descriptions in the probability distribution. This interpretation will be helpful to design the uncertainty set .

There are various ways to estimate the bias term b. The simplest way is to use , where and comprise the optimal solution of RCM. When we use this estimator, the two classes are separated by the hyperplane passing through the midpoint of and . Another approach is to estimate b as the minimum solution of . Indeed, in examples 2 to 4, b is computed by for several loss functions . Yet another promising method is to construct an appropriate statistical model for the projected samples . The projected samples, , , are scattered in one-dimensional space, and hence b can be easily estimated on the basis of the statistical model. We can use the empirical loss function, equation 4.2, for computing an optimal b, such as
formula
4.6

The following examples show explicit expressions for the function is used in the proof of theorem 3:

Example 2: Logistic loss. 
The logistic loss is defined as , which is convex and decreasing. The second derivative is upper-bounded by 1/4. Hence, the assumptions placed on the loss function in theorem 2 are satisfied. The optimization problem, equation 4.5, is
formula
since the second expression is decreasing in . Hence, RCM provides an optimal solution to the above problem and, moreover, gives lower and upper bounds for the worst-case expected loss minimization, equation 4.3.
Example 3: Hinge loss. 
The hinge loss is defined as , which is convex and decreasing but not differentiable. Hence, the assumptions on the loss function in theorem 2 are not satisfied. The optimization problem, equation 4.5, is
formula
The function, , is nonincreasing in . Hence, RCM provides an optimal solution to the above problem.
Example 4: Quadratic loss. 
Let the quadratic loss function be . Note that the quadratic loss function is not decreasing, and hence, the assumption in theorem 3 is not satisfied. The optimization problem, equation 4.5, is
formula
In this case, the above problem is not equivalent to RCM.

5.  Nonconvex Robust Classification Models

Here, we deal with the nonconvex RCM with an uncertainty set satisfying , that is, equation 2.4. As Table 1 shows, the nonconvex RCM with an uncertainty set of reduced convex hulls corresponds to E-SVM (Perez-Cruz et al., 2003). We can also consider nonconvex variants of MPM or FDA, as shown in equaiton 3.14 or 3.19. As Perez-Cruz et al. (2003) noted, better classifiers can be obtained by extending our model to cover both convex and nonconvex RCMs. This section is devoted to providing a new interpretation of nonconvex RCM in terms of regularization (see section 5.1) and describing a nonconvex optimization algorithm (see section 5.2).

5.1.  Regularization in Nonconvex Robust Classification Models

Section 3 illustrated the relation between RCM and some of the existing learning methods. In particular, we showed that an RCM with an uncertainty set of the reduced convex hull of the data set is closely related to (E)-SVM equation 3.4. In this section, we focus on the relation between RCM and C-SVM, equation 3.2, and show that the nonconvex RCM is closely related to negative regularization.

We shall deal with an essentially equivalent problem to C-SVM that has a norm regularization constraint instead of a regularization term in the objective function:
formula
5.1
for some R. The intensity of the regularization is adjusted by varying R. Using the Lagrangian function of equation 5.1,
formula
we obtain the dual of equation 5.1 as
formula
The Lagrange multiplier of the constraint corresponds to the regularization parameter of C-SVM, equation 3.2, implying that equaiton 5.1 can be identified with C-SVM. Note that several authors have shown the equivalence of C-SVM and -SVM (Chang & Lin, 2001; Schölkopf et al., 2000), and corollary 2 means that -SVM can be described as a convex RCM, equation 2.2, with the uncertainty set of equation 3.5. Hence, equation 5.1 is equivalent to the convex RCM, equation 2.2, with of equation 3.5 with appropriate regularization parameters and R.
Now we study C-SVM with a negative regularization expressed as ,
formula
5.2
where the constant s in the constraint takes 1, 0 or −1. s determines the penalty of the misclassification. When s=0 or s=−1, the penalty on misclassified data is less than it is when s=1.

Instead of the constraint , we can replace the objective function of equation 5.2 by , where the function is an increasing and nonnegative function. Suppose an optimal solution exists. Then the optimal solution of this objective function subject to the constraints satisfies the Karush-Kuhn-Tucker optimality condition of equation 5.2. Hence, problem 5.2 is equivalent to minimizing the loss function with the negative regularization parameter.

The next theorem shows the relation between the negative regularized C-SVM, equation 5.2, and nonconvex RCM:

Theorem 4. 
Let be an optimal solution of the nonconvex RCM, equation 2.4, with an uncertainty set of the reduced convex hull, equation 3.5, such that . Let be the estimator of the bias term as in corollary 2. Let be an optimal solution of
formula
5.3
and be
formula
Then the linear classifier, , of the nonconvex RCM is the same as that of C-SVM with the negative regularization, equation 5.2, with and some.

Note that equation 5.3 corresponds to E-SVM, equation 3.4, with fixed variables . Equation 5.3 has only one decision variable, , and thus, it is easy to solve.

Proof. 
Let be an optimal solution of the nonconvex RCM with the uncertainty set, equation 3.5, and be the estimated bias term defined by . We can rewrite E-SVM, equation 3.4, for as
formula
5.4
Let us define the Lagrangian function of the inner-min problem in equation 5.4 as
formula
which is convex in and concave in . Problem 5.4 can be rewritten as
formula
Hence, problem 5.4 is equivalent to the nonconvex RCM, equation 2.4. In addition, it is straightforward to see that parameters and of equation 5.3 are optimal for equation 5.4. Therefore, including , , is optimal for equation 5.4. Let us consider equation 5.4 with . If , the set of parameters is also an optimal solution of
formula
For , the set of parameters is optimal for the problem
formula
Therefore, problem 5.4, which is equivalent to the nonconvex RCM, equation 2.4, results in equation 5.2.

In the standard -SVM, a small corresponds to weak regularization. In the current setup, the parameter tends to be small when is small. Moreover, when is positive and small, the parameter is large. Then the norm of the optimal solution must be large. The effect is the same as having a negative regularization parameter. Now suppose that is negative and has a small absolute value. In this case, negative margin will increase the complexity of the model to fit the training data. The misclassification penalty is defined as . The definition implies that even if misclassification occurs, the penalty may still be zero; penalties are only put on large misclassifications such that . Such a penalty does not yield a consistent estimator of the decision function (Bartlett, Jordan, & McAuliffe, 2006). However, it may work well in an uncertain situation such as a data set shift, because a “soft” penalty allows small differences in the decision boundary between the training phase and the test phase.

5.2.  General Nonconvex Optimization Algorithm

RCM, equation 2.1, is an essentially nonconvex problem that includes a nonconvex constraint, , when the intersection of and is not empty. In this section, we propose a solution method that is generalized from the local optimum search algorithms of Perez-Cruz et al. (2003) and Takeda and Sugiyama (2008).

Suppose that we solve RCM, equation 2.1, with the uncertainty set with one parameter and that holds for . Let us define such that the optimal value of this equation 2.1 with is zero. First, we need to compute in order to check whether problem 2.1 is convex. is obtained as the optimal solution of the problem
formula
5.5
When are ellipsoidal sets of equation 3.9, the problem reduces to MPM, equation 3.8. When is an ellipsoidal set of equation 3.17, the problem reduces to FDA, equation 3.15, whose optimal solution is one to a linear system of equations. In (E)-SVM, the uncertainty sets corresponding to equation 3.5 with are reduced convex hulls, and equation 5.5 reduces to the LP, equation 3.6.

If the input parameter is equal to , we have already obtained an optimal solution from equation 5.5. If , we next solve the convex problem, equation 2.3, by using a standard optimization software. For , RCM, equation 2.1, is essentially equivalent to equation 2.4 that includes a nonconvex constraint, . We next need to solve equation 2.1 as a nonconvex problem.

In the area of global optimization, nonconvex RCM, equation 2.4, (precisely, a problem constructed by taking the dual for the inner minimization in the equation) is known as a reverse convex program (RCP), or canonical d.c. programming. This differs from a conventional convex program only by the presence of a reverse convex constraint ( in the current case). When all functions are linear except for the reverse convex constraint, the RCP problem is especially called linear reverse convex program (LRCP). E-SVM is an LRCP, for which Perez-Cruz et al. (2003) proposed a local optimum search algorithm and Takeda and Sugiyama (2008) proposed a global optimum search algorithm.

Here, we show a local optimum search algorithm that is generalized from the local optimum search algorithms (Perez-Cruz et al., 2003; Takeda & Sugiyama, 2008) of E-SVM for nonconvex RCM:
formula

This algorithm is essentially the same as the local algorithm, algorithm 7, in Takeda and Sugiyama (2008) when of g(w) is a reduced convex hull, equation 3.5, and . RCM, equation 2.1, requires maximizing subject to a nonconvex constraint, . Instead of solving the nonconvex problem directly, we can iteratively solve the relaxation problems, equation 5.6, in algorithm 1. Since g(w) is concave, equation 5.6 can be solved by using convex minimization techniques.

The nonconvex constraint of equation 2.1 is linearized at in the algorithm, and the linear constraint is updated every iteration. Note that the negativity of the optimal value of equation 5.6 is guaranteed because of . As the algorithm proceeds, the solutions improve,
formula
5.7
because together with implies . Note that is a feasible solution for equation 5.6. Hence, if this equation has no better solutions than , is returned as an optimal solution of the equation. The algorithm terminates after that.
The computation of g(w) may be difficult for general uncertainty sets. However, we do not need an explicit formula for g(w) in equation 5.6. If is a convex set, we can obtain a dual formulation (max-problem) for g(w) and replace the max-min problem, equation 5.6, with a simple max-problem, that is, a one-level convex problem. Indeed, when the uncertainty set is a reduced convex hull, equation 3.5, of data points, we can take the dual for and change equation 5.6, into a linearized E-SVM, equation 3.4, whose constraint is instead of . Accordingly, algorithm 1 is essentially the same as the local optimization method (Takeda & Sugiyama, 2008) for E-SVM. When the algorithm is applied to the RCM having ellipsoidal uncertainty, we analytically obtain the optimal value g(w) for any w. For example, in the case of ellipsoidal uncertainty, equation 3.9, we get
formula

The following theorem guarantees finite convergence of algorithm 1 with :

Theorem 5. 

For any , algorithm 1 terminates in a finite number of iterations.

Proof. 
Let the negative value be the optimal value of RCM, equation 2.1. Suppose for all . Otherwise the algorithm terminates. Accordingly, the following inequalities hold:
formula
Hence, for all positive integers t, we obtain
formula
Since the sequence is increasing and bounded from above, there exists a limit, . As a result, we have
formula
The above inequality and the positivity of yield . Thus, we find that , and exists such that . Therefore,
formula
Since holds, the stopping rule with positive is satisfied in a finite number of iterations.

Below, we consider the case that the algorithm with converges. When the uncertainty set is a polytope, it converges in a finite number of iterations (see theorem 6). Theorem 7 shows a sufficient condition for the local optimality of the solution , that is obtained by algorithm 1 with .

Theorem 6. 

For an RCM, equation 2.1, with a convex polyhedron , algorithm 1 terminates with within a finite number of iterations.

Proof. 
For a convex polyhedron , equation 5.6 is written as
formula
5.8
by using the dual formulation for . Let be the optimal solution of the LP, equation 5.8, in the tth iteration. The feasible solution of RCM, equation 2.1, is
formula
Note that we can find at the corner of the feasible set of the LP, equation 5.8. Since this equation has a polyhedral cone as the feasible set when the constraint is removed, the scaled solution, , is also a corner of the feasible set of RCM, equation 2.1.

Moreover, equation 5.7 implies that the algorithm finds a distinct corner of RCM, equaiton 2.1, in each iteration. Since the number of corners of the RCM is finite, the algorithm terminates within a finite number of iterations.

For theorem 7, we suppose that