Abstract

Financial risk measures have been used recently in machine learning. For example, -support vector machine (-SVM) minimizes the conditional value at risk (CVaR) of margin distribution. The measure is popular in finance because of the subadditivity property, but it is very sensitive to a few outliers in the tail of the distribution. We propose a new classification method, extended robust SVM (ER-SVM), which minimizes an intermediate risk measure between the CVaR and value at risk (VaR) by expecting that the resulting model becomes less sensitive than -SVM to outliers. We can regard ER-SVM as an extension of robust SVM, which uses a truncated hinge loss. Numerical experiments imply the ER-SVM’s possibility of achieving a better prediction performance with proper parameter setting.

1  Introduction

One important goal of classification methods is to construct classifiers with high prediction accuracy, that is, good generalization ability. The support vector machine (SVM) is designed to reduce the risk of misprediction for test data by means of the combination of a regularization term and a loss function that evaluates the fitting to training data. Risk minimization techniques are studied in financial mathematics field as well. Especially when considering long-term contracts, various risks are present in practice, and an effective way of hedging those risks is needed. One of the most widely used risk measures in finance is value at risk (VaR), a quantile at a predefined probability level. A coherent risk measure such as conditional VaR (CVaR) (Rockafellar & Uryasev, 2002; Artzner, Delbaen, Eber, & Heath, 1999) became popular because of the subadditivity property, which basically means that “a merger does not create extra risk,” though the CVaR is very sensitive to a few outliers in the tail of the distribution.

Several works have studied financial risk measures from a machine learning perspective. For example, Xu, Caramanis, Mannor, and Yun (2009) proposed a comprehensive robust classification model that uses a discounted loss function depending on the data and investigated the relationship between comprehensive robustness and convex risk measures. Takeda and Sugiyama (2008) showed that -SVM (Schölkopf, Smola, Williamson, & Bartlett, 2000) and its extended model, E-SVM (Perez-Cruz, Weston, Hermann, & Schölkopf, 2003), minimize the CVaR of margin distribution.

In this letter, we propose a new classification method, which we call ER-SVM, that minimizes a truncated CVaR by using VaR by expecting that the resulting model becomes less sensitive to outliers than E-SVM. Our model is closely related to robust SVM (Shen, Tseng, Zhang, & Wong, 2003; Xu, Crammer, & Schuurmans, 2006; Wu & Liu, 2007; Brooks, 2011), which minimizes a truncated hinge loss combined with a regularization term. We can say that ER-SVM is an extended variant of robust SVM (see Figure 1). More precisely, the permissible parameter range of robust SVM is included in that of ER-SVM, and therefore, ER-SVM may achieve a better prediction performance with proper parameter setting.

Figure 1:

The relation of ER-SVM to existing learning models shown in solid boxes.

Figure 1:

The relation of ER-SVM to existing learning models shown in solid boxes.

ER-SVM is formulated as a nonconvex problem and difficult to solve exactly, as well as existing robust SVMs. We propose a heuristic algorithm for solving ER-SVM approximately, which finds a feasible solution of ER-SVM in every iteration. Furthermore, when the algorithm is running, a hyperparameter of ER-SVM is set to an appropriate value. If of ER-SVM is set to the ratio of outliers, it achieves good prediction performance, but it is hard to predict the ratio in practical problem setting. Therefore, the proposed algorithm, where the parameter is automatically tuned, is very practical. In the algorithm, we repeat two steps: solving optimization problems of -SVM or E-SVM and removing training samples with large losses. The solution method is easy to implement because we can use existing tools and software for solving -SVM or E-SVM several times in order to solve ER-SVM.

Numerical experiments show the superior performance of ER-SVM over robust SVM, C-SVM, and E-SVM in the presence of outliers. Indeed, Figure 2 shows that CVaR minimization, which is equivalent to E-SVM, is sensitive to outliers, whereas our model, ER-SVM, is not sensitive. The feature of our model that ignores samples with large losses contributes to the superior performance over E-SVM.

Figure 2:

Influence of an outlier, shown at the lower left, to E-SVM and ER-SVM.

Figure 2:

Influence of an outlier, shown at the lower left, to E-SVM and ER-SVM.

The superior performance over robust SVM implies the effectiveness of ER-SVM’s extended parameter range relative to the range of robust SVM. Both problems, ER-SVM and robust SVM, can be regarded as the minimum distance problem to a set that consists of training samples, and hyperparameters of those problems control the size of . Robust SVM limits the range of the hyperparameter so that does not contain but ER-SVM removes the limitation.

The letter is organized as follows. Section 2 reviews several related support vector machine classifiers and risk measure minimization models in financial engineering. Section 3 presents the formulation of our model, ER-SVM, and a heuristic algorithm for ER-SVM. Section 4 provides geometric interpretations for ER-SVM by showing the dual formulation of ER-SVM. Those interpretations help us to understand ER-SVM from the geometric viewpoint and recognize the difference of ER-SVM from VaR or CVaR-based methods. In section 5, our model is compared to several related models such as robust SVM (Xu et al., 2006; Wu & Liu, 2007), -SVM (Schölkopf et al., 2000), E-SVM (Perez-Cruz et al., 2003), and VaR-SVM (Tsyurmasto, Zabarankin, & Uryasev, 2014). Section 6 concludes the letter.

2  Background and Related Work

2.1  Support Vector Machine

The SVM has been widely used for classification in machine learning. Let us address the binary classification problem of learning a linear function based on training samples , . We assume that the training samples are independent and identically distributed following the unknown probability distribution on . For simplicity, we generally focus on linear functions , but the discussions in this letter can be directly applicable to nonlinear kernel classifiers (see Schölkopf & Smola, 2002).

2.1.1  Equivalence Between C-SVM and -SVM

One important goal of learning methods is to construct classifiers with high prediction accuracy, that is, good generalization ability. For that purpose, SVMs minimize the objective function, which consists of a loss function and regularization term (such as ). Many learning methods use convex surrogate losses (e.g., hinge loss):
formula
2.1
where for . A representative model using the hinge loss is C-SVM (Cortes & Vapnik, 1995),
formula
where is a user-specified hyperparameter. Moreover, -SVM (Schölkopf et al., 2000) is another formulation with the hinge loss,
formula
where is a hyperparameter and shows the cardinality of the set I. The margin in the hinge loss is determined by an optimal solution of -SVM, and the resulting margin is nonnegative (see Crisp & Burges, 2000).

-SVM and C-SVM have the same optimal solution if we appropriately set two parameters and C by using the optimal solution of -SVM (see Schölkopf et al., 2000). It is said that setting appropriate for -SVM is easier and more intuitive than C of C-SVM because of -properties of -SVM, implying that is an upper bound on the fraction of margin errors and a lower bound on the fraction of support vectors.

However, of -SVM has the permissible range as (see Crisp & Burges, 2000 for details). -SVM is unbounded when is larger than , where (resp. ) is the index set of the samples with the positive (resp. negative) label. -SVM produces a trivial solution ( and ) when is smaller than some threshold .

2.1.2  Extension from -SVM to E-SVM

-SVM was extended to E-SVM (Perez-Cruz et al., 2003) by allowing the margin to be negative and enforcing the norm of to be unity,
formula
2.2
where . It should be noted that the convex relaxation problem, which replaces of E-SVM (see equation 2.2) with
formula
is equivalent to -SVM. Indeed, dual problems of -SVM and the relaxation problem with coincide (up to a scaling of variables). The nonconvex constraint prevents from being zero for smaller than , and E-SVM gives a nontrivial solution satisfying -properties for .

The classifiers of E-SVM and -SVM are the same up to a scaling factor when because the convex relaxation problem of E-SVM attains an optimal solution at . In other words, can be relaxed to without changing the optimal solution. On the other hand, in the case of , the relaxation problem gives a trivial solution ( and ) as well as -SVM. In that case, of E-SVM can be relaxed to , but it is still nonconvex. Therefore, a nonconvex optimization method (Perez-Cruz et al., 2003; Takeda & Sugiyama, 2008) needs to be applied to E-SVM.

2.1.3  Robust SVM

These classification models deteriorate the prediction performance in the presence of outliers. To overcome such drawbacks, various papers (e.g., Shen et al., 2003; Xu et al., 2006; Wu & Liu, 2007; Brooks, 2011) proposed robust SVM, which uses a truncated hinge loss. The truncated loss implies that the loss does not increase the penalty after a certain point. Such a loss can improve the robustness to outliers at the expense of convexity. Using the -hinge loss1,
formula
2.3
with parameters and in addition to , Xu et al. (2006) formulated robust SVM as
formula
2.4
Here C () is a user-specified positive parameter. The optimized -hinge function is shown to be equivalent to the truncated hinge loss in (Xu et al., 2006):
formula
Xu et al. (2006) solved equation 2.4 by using semidefinite relaxation. Wu and Liu (2007) also described the truncated hinge loss by the difference of two hinge loss functions and solved the resulting SVM problem by applying the difference convex (d.c.) algorithm through a sequence of convex subproblems.

2.2  Risk Measure Minimization

We define financial risk measures using the losses of samples; , and show classification methods that minimize risk measures with respect to and b.

Assuming that is a random vector with discrete support consisting of training samples, let us consider the distribution of losses over these samples:
formula
For , let be the -percentile of the loss distribution, known as the value at risk, or -VaR in finance:
formula
Thus, the fraction of the losses , , exceeds the threshold (see Figure 3). VaR minimization is formulated as follows:
formula
2.5
Figure 3:

An example of the distribution of losses over all training samples.

Figure 3:

An example of the distribution of losses over all training samples.

We define the -tail distribution as
formula
Let be the mean of the -tail distribution of , (see Figure 3):
formula
where denotes the expectation over the distribution . is called the conditional VaR, or -CVaR. By definition, the CVaR, , is always larger than or equal to the VaR, . Indeed, Rockafellar and Uryasev (2002) proved the relation between the CVaR and the VaR as
formula
2.6
using the expectation over the distribution . If there is no probability atom at , all three terms in equation 2.6 are equal, and is the expected loss given that the loss is greater than or equal to the VaR.
Minimizing the CVaR, , is shown in Rockafellar and Uryasev (2002) to be equivalent to
formula
2.7
in its optimal solution is almost equal to , (see Rockafellar & Uryasev, 2002).

Note that equation 2.7 reduces to E-SVM equation 2.2, by changing a variable as . Let be an optimal solution for equation 2.7, with . As Figure 4 shows, is decreasing with respect to . We can compute the threshold of convexity, , such that by solving a linear programming problem as shown in Takeda, Mitsugi, and Kanamori (2013). For , equation 2.7 reduces to a convex problem with instead of W. As discussed before, the convex problem is equivalent to -SVM (Schölkopf et al., 2000). A classification model minimizing VaR with a convex constraint, , (i.e., the convex relaxation problem of equation 2.5 with respect to ) was studied in Tsyurmasto et al. (2014) as VaR-SVM.

Figure 4:

Profiles of minimum CVaR and minimum VaR as functions of . ``Convex'' means that the feasible region is , while “nonconvex” means W.

Figure 4:

Profiles of minimum CVaR and minimum VaR as functions of . ``Convex'' means that the feasible region is , while “nonconvex” means W.

3  Extended Robust SVM

The CVaR and VaR measures have strong points and weak points when applied to classification problems. Note that the VaR measure ignores samples with large losses. Minimizing the VaR gives us a robust estimate of () that is insensitive to outliers in the data. However, the performance is sensitive to the input parameter , which controls the ratio of ignored samples. We can easily discard essential training samples by regarding them as outliers and achieve worse prediction results. The CVaR minimization model, including -SVM, has an excellent reputation in its performance in standard setup. However, the estimator is sensitive to outliers.

3.1  ER-SVM Formulation

We propose a new classification model that uses both CVaR and VaR measures for their advantages,
formula
3.1
where . We call it extended robust SVM (ER-SVM). It can be reformulated as
formula
3.2
Note that optimal solutions for equation 3.1 and for equation 3.2 correspond to each other so that .

We can regard ER-SVM as a mixture model of CVaR and VaR minimization. The model has two input parameters, and , in the range . When becomes small enough so that , J becomes empty, and equation 3.1 is the same as CVaR minimization, equation 2.2, with . When becomes small enough so that , the objective function in equation 3.1 reduces to , and therefore equation 3.1 is the same as VaR minimization, equation 2.5, with (we can prove it easily from corollary 9 in Rockafellar & Uryasev, 2002).

The measure used in equation 3.1 almost (in the sense of equation 2.6) equals
formula
3.3
where . We call it truncated CVaR (trCVaR) and denote it by tr- because equation 3.3 adds an upper bound, , to equation 2.6 of the CVaR. The measure indicates the average of remaining large losses after ignoring extremely large losses.
To ensure that equation 3.1 has an optimal solution, the range of is restricted depending on J2, while can take . The upper bound of is determined by
formula
When , equation 3.1 becomes an unbounded problem. Note that the optimal value of equation 3.1, tr-, is decreasing with respect to . We can compute as in the case of E-SVM so that the optimal tr- is negative for , which we call the convex range for .

Now we modify ER-SVM, equation 3.1, by replacing the nonconvex set W by and call the resulting problem convex ER-SVM.

Lemma 1.

As far as is in the convex range, ER-SVM, equation 3.1 is equivalent to convex ER-SVM.

Proof.
To prove the lemma, it is enough to show that the set W can be relaxed by without changing the optimality as far as the optimal value of ER-SVM, equation 3.1, is negative. Suppose on the contrary that convex ER-SVM has an optimal solution satisfying , although the optimal value of ER-SVM is negative. Then we have
formula
Note that is a feasible solution of the convex ER-SVM. The above strict inequalities show that the solution achieves a smaller objective value than , which contradicts the optimality of the convex ER-SVM. Therefore, holds, and this proves the lemma.

Lemma 1 implies that for , convex ER-SVM gives an optimal solution of ER-SVM, equation 3.1. But, when , convex ER-SVM will lead to a trivial solution ( and ).

The following theorem implies that ER-SVM, equation 3.1, is an extension of robust SVM, equation 2.4, because ER-SVM equals robust SVM only when is in the convex range.

Theorem 1.
Convex ER-SVM, robust SVM (see equation 2.4), and robust -SVM
formula
3.4
are equivalent by setting , of equation 2.4 and and of equation 3.1 with the use of an optimal solution of equation 3.4 for and from the convex range.
Proof.
We focus on the case where is chosen from the convex range, that is the feasible solution is not optimal to robust -SVM, equation 3.4 (here, indicates the all-ones vector). At first, we show that the optimal solution of robust -SVM, equation 3.4 is also optimal to the problem
formula
3.5
If is not optimal, equation 3.5 must have an optimal solution such as
formula
3.6
formula
3.7
However, inequalities in equations 3.6 and 3.7 imply that achieves a smaller objective value of equation 3.4 than , which contradicts the optimality of equation 3.4, and therefore, is optimal to equation 3.5. Moreover, is optimal to
formula
3.8
which is the convex relaxation of ER-SVM, equation 3.2, that is, convex ER-SVM. We could show the equivalence between convex ER-SVM, equation 3.8, and robust -SVM, equation 3.4.
Finally, we relate robust -SVM, equation 3.4 to robust SVM equation 2.4. As far as is in the convex range, the optimal value, equation 3.5, is negative. Note that optimal is also negative because it is less than or equal to the optimal value. If we fix of equation 3.4 to and minimize the objective function only over the remaining variables, nothing will change. By rescaling the set of variables of equation 3.4 by as , we describe the objective function of equation 3.4 divided by as
formula
We can confirm that robust -SVM, equation 3.4, reduces to robust SVM, equation 2.4, by setting of equation 3.4 and of equation 2.4, appropriately.

The theorem shows that ER-SVM is equivalent to robust SVM when . If of ER-SVM is in the range, the nonconvex constraint can be relaxed as . when , convex ER-SVM will achieve a trivial solution ( and ). In such a case, the nonconvex constraint of ER-SVM is equivalent to , because for the inequality constraint, holds at the optimal solution, , as long as the optimal value of ER-SVM is positive, similar to the properties of E-SVM shown in section 2.1.2. By extending the permissible range of from to , the margin is allowed to be negative as well as E-SVM, equation 2.2, and as a result, ER-SVM can provide a nontrivial solution that robust SVM can not attain.

3.2  ER-SVM Heuristic Algorithm

When we use ER-SVM, equation 3.1, for classification, it seems troublesome to handle two parameters of ER-SVM, and . If is set to the ratio of outliers, ER-SVM will achieve good prediction performance, but it is hard to predict the ratio in practical problem setting. In this section, we propose a heuristic algorithm for ER-SVM, which automatically tunes parameters of ER-SVM: and during execution. It makes parameter settings convenient if we cannot estimate the ratio of outliers in data sets.

The algorithm minimizes the objective value of ER-SVM with respect to J and other variables alternately, gradually increasing the cardinality of J from 0 to . When we use the idea of Larsen, Mausser, and Uryasev (2002), the algorithm observes the change of the solution with respect to increasing the cardinality of J and stops the execution when the solution does not change much.

Larsen et al.’s (2002) efficient heuristic algorithm was proposed for the VaR minimization problem equation 2.5. We modify the algorithm for solving trCVaR minimization problem. Algorithm A1 in Larsen et al. (2002) approximately solves the -VaR minimization problem. The general line of thought behind the heuristic algorithm is simple. The algorithm systematically reduces the VaR by repeating two phases (starting with , and ): (1) solve a -CVaR problem using data set , and (2) update so as to remove samples with large losses in the CVaR problem and reset so that (the number of samples with zero penalty, shown in the white area in Figure 3, is the same). These phases are repeated until samples are removed.

formula
We propose a heuristic algorithm, algorithm 1, for ER-SVM by modifying the algorithm of Larsen et al. (2002).3 We use a parameter to control the number of samples to be discarded in each iteration in addition to . If we do not care about computation time, it is better to set a small value for . () indicates the number of data samples used in the kth CVaR problem. We define for equation 3.9 in order to indicate the ratio of support vectors to . As k becomes larger, becomes smaller, and therefore the algorithm terminates within finite iterations—more concretely, less than
formula
If the iteration k exceeds the above number, holds.

in the termination criterion shows how different the kth solution is from the previously obtained solution. When the solution does not change much even if samples with large losses are removed, this algorithm recognizes that all outliers were already removed and stops automatically. The solution in the first iteration is optimal to -CVaR minimization, equation 2.2. When the algorithm iterates sufficiently by setting , the resulting solution approximates an -VaR minimizer. Algorithm 1 gives an intermediate solution between the VaR minimizer and CVaR minimizer; it gets close to VaR minimizer from the CVaR minimizer as the iteration proceeds.

The algorithm automatically sets suitable values to and of ER-SVM. The following theorem implies which values are set to and for ER-SVM in every iteration of the algorithm.

Theorem 2.

A feasible solution of equation 3.9 in each iteration is also feasible for equation 3.1 with and .

Proof.

It is enough to check that equation 3.9 equals equation 3.1 with such and when J is fixed to . Noticing that , we see that feasible solutions of equation 3.9 are also feasible for equation 3.1.

Especially if Ik of equation 3.9 equals to for the optimal solution of equation 3.1, the optimal solution of equation 3.9 is also optimal for equation 3.1. Since we define Ik by the set of samples whose losses were small in equation 3.9, there is no guarantee that as well Larsen et al.’s (2012) algorithm. In general, it may not be easy to devise not only global optimization algorithms but also local optimization algorithms for nonconvex problems that have nonconvexity in both objective and constraint functions such as ER-SVM. However, algorithm 1 works well to find influential outliers for J, as numerical experiments in Figure 7 show later.

To solve CVaR minimization equation 3.9 in the algorithm, we can use existing tools and software (Chang & Lin, 2001; Perez-Cruz et al., 2003; Takeda & Sugiyama, 2008) for solving -SVM or E-SVM. Therefore, the algorithm is easily implemented because the main part is solving equation 3.9 in order to solve ER-SVM. When , we need to deal with equation 3.9 with the nonconvex set W, to which we can use a local solution algorithm as in Perez-Cruz et al. (2003) and Takeda and Sugiyama (2008). Algorithm 1 becomes more efficient by using a warm-start strategy for equation 3.9 (i.e., by using as an initial solution). The problem, equation 3.9, does not change much in any iteration when setting a small value to . Then small computation costs are needed to find the next solution, , by using a warm-start strategy.

4  Geometric Interpretation for ER-SVM

We give geometric interpretations for ER-SVM, equation 3.1, by showing the dual formulation of it. The dual ER-SVM is formulated as the minimum distance problem to a set that consists of samples , . From theorem 1, we can give the same interpretation for robust SVM equation 2.4. The interpretation for robust SVM is not obvious from the formulation of equation 2.4 which uses truncated hinge losses. Furthermore, we also give a geometric interpretation for VaR minimization. Those interpretations help us to understand ER-SVM from the geometric viewpoint and recognize the difference of ER-SVM from VaR or CVaR-based methods.

4.1  Dual of ER-SVM

We define the Minkowski difference of two sets ( and ) by
formula
When and are compact convex sets and have interior points, their Minkowski difference is necessarily compact convex and has a nonempty interior. When and do not intersect, , we have . If they have an intersection, contains the origin.
By taking the dual for ER-SVM, equation 3.1 in terms of b and , it is transformed to
formula
4.1
where
formula
4.2
Let be its optimal solution. The bias of equation 3.1 can be computed by using margin support vectors, which are strictly on the margin ( with in ). We can confirm the equivalence of ER-SVM, equation 3.1, to equation 4.1 by taking the dual of the inner-minimization problem in equation 4.1.

For equation 4.1, assume that and are representative points (or means) of each class. We can interpret the problem in the viewpoint of robust optimization (Ben-Tal, El-Ghaoui, & Nemirovski, 2009) by regarding and as uncertain inputs and preparing uncertainty sets where lie. Note that are convex sets constructed from training samples , , respectively. ER-SVM, equation 4.1, finds a solution that is robust with respect to changes in the realization of .

Bishop (2006) (see section 4.1.4 for Fisher’s linear discriminant) discusses the problem maximizing in terms of for sample means of each classes. Although the problem selects a projection that maximizes the class separation, the resulting solution, , can induce considerable class overlap in the projected space, between and . The difficulty arises when the samples of each class are generated from class distributions with strongly nondiagonal covariances. Fisher’s linear discriminant finds a solution so as to give a small variance within each class in addition to maximizing the class separation, thereby minimizing the class overlap. We can say that ER-SVM, equation 4.1 tries to overcome the difficulty in another way by removing outliers with the use of J and considering the worst case in uncertainty sets for .

4.2  Transformation of Dual ER-SVM Depending on

In section 3.1, we referred to convex ER-SVM, equation 3.8, where the nonconvex constraint is replaced by the convex one, , without changing the optimality of ER-SVM as far as . In the range, the optimal value of ER-SVM is negative (see Figure 4). We can relate the sign of the optimal value of ER-SVM to the position of relative to for an optimal solution of ER-SVM. Indeed, holds if and only if the optimal value of ER-SVM is negative, and the interior of includes () if and only if the optimal value of ER-SVM is positive. We can prove them by mimicking the proof of lemma 1 in Takeda et al. (2013).

Therefore, if , we find that ER-SVM is equivalent to convex ER-SVM, equation 3.8. Otherwise, is essentially equivalent to the nonconvex inequality . We cannot detect whether can be made convex or not a priori by using , because the optimal solution is necessary. However, we can give geometric interpretations for ER-SVM in each case. Now we transform ER-SVM, equation 4.1, into two norm-minimization problems depending on .

Theorem 3.
Suppose that is the optimal solution of ER-SVM, equation 4.1. When , ER-SVM is equivalent to
formula
4.3
When , it is equivalent to
formula
4.4
where is the closure of the complement of the convex set .
Proof.
For a bounded convex set , we define the support functional by
formula
ER-SVM, equation 4.1 is rewritten as
formula
4.5
Theorem 3.1 of Briec (1997) implies
formula
4.6
for any satisfying in addition to the relation between optimal solutions: of equation 4.1 and of equation 4.3. This leads to the equivalence of equation 4.5 to 4.3 because in equation 4.5 can be convex by due to the negative optimal value of ER-SVM (i.e., ).
On the other hand, proposition 3.1 of Briec (1997) shows
formula
for any satisfying in addition to the relation between optimal solutions: of equation 4.1 and of equation 4.4. This leads to the equivalence of equation 4.5 to 4.4, because in equation 4.5 can be replaced by due to the positive optimal value of ER-SVM.

Figure 5 depicts the hyperplane given by ER-SVM, equation 4.1, as well as the resulting and with the use of of equation 4.1. ER-SVM could detect all outliers shown with the star marks as . In this case, (or equivalently, ) holds, and therefore, ER-SVM, equation 4.1, equals the dual problem, equation 4.3, of convex ER-SVM. As theorem 3 implies, the hyperplane of equation 4.1 was obtained by maximizing the minimum distance between and with respect to J.

Figure 5:

The hyperplane given by ER-SVM, equation 4.1, and resulting and (shown in solid lines) for the solution . The circle marks (or triangle marks) plot samples with label 1 (or label −1, resp). The removed samples, samples in , which belong to class −1 are shown with the stars.

Figure 5:

The hyperplane given by ER-SVM, equation 4.1, and resulting and (shown in solid lines) for the solution . The circle marks (or triangle marks) plot samples with label 1 (or label −1, resp). The removed samples, samples in , which belong to class −1 are shown with the stars.

By fixing J, equation 4.3 reduces to a convex problem, whereas equation 4.4 remains a nonconvex problem because of the constraint . The former problem corresponds to convex ER-SVM whose parameter is in the convex range, . Indeed, the range was defined so that the optimal tr- is negative, but it is equivalently defined so that in the range. In other words, as long as , holds and ER-SVM, equation 4.1, is equivalent to convex ER-SVM, robust SVM equation 2.4, and robust -SVM (see equation 3.4). As becomes smaller, the set becomes larger and the nonconvex case, , tends to happen.

4.3  Relation to CVaR and VaR Minimization Models

Takeda, Mitsugi, and Kanamori (2012) proposed a unified formulation, unified classification model (UCM), for binary classification.4 UCM is formulated as
formula
4.7
Depending on the definition of , UCM embraces various classification methods such as the support vector machine (SVM) (Schölkopf & Smola, 2002), minimax probability machine (Lanckriet, Ghaoui, Bhattacharyya, & Jordan, 2002), and Fisher discriminant analysis (Fukunaga, 1990).
Indeed, Takeda et al. (2012) showed that UCM is equal to CVaR minimization (i.e., E-SVM or -SVM) when are reduced convex hulls with size , defined in Bennett and Bredensteiner (2000) and Crisp and Burges (2000) as
formula
4.8
The sets are polytopes that have been shrunk toward the centers as becomes large. By taking the dual for the inner maximization in UCM with the above with respect to , we have problem 2.7 minimizing -CVaR, , with respect to , which is equivalent to E-SVM or -SVM.

Figure 6 shows the reduced convex hulls, and , of equation 4.8. Because in this case, UCM, equation 4.7, with these equals to E-SVM. The hyperplane was obtained by solving E-SVM, equation 2.2. The CVaR-based model takes into account all losses induced from training samples; therefore, the reduced convex hulls shown in the figure are influenced by outliers. Therefore, the hyperplane is also influenced by outliers, compared with the hyperplane of ER-SVM (see Figure 5).

Figure 6:

The hyperplane given by E-SVM and its reduced convex hulls: and (shown in solid lines) of equation 4.8. The circle marks (or triangle marks) plot samples with label 1 (or label −1, resp.).

Figure 6:

The hyperplane given by E-SVM and its reduced convex hulls: and (shown in solid lines) of equation 4.8. The circle marks (or triangle marks) plot samples with label 1 (or label −1, resp.).

We can rewrite VaR minimization, equation 2.5, as
formula
where
formula
4.9
Note that of equation 4.9 are generated by deleting an upper bound of from of ER-SVM, equation 4.1.

We can again confirm that ER-SVM is a mixture model of CVaR and VaR minimization from the definition (see equation 4.2) of of ER-SVM; it has an upper bound as in equation 4.8 of CVaR minimization and delete some portion of samples as J from the set as in equation 4.9 of VaR minimization.

5  Numerical Results

5.1  Properties of ER-SVM

The intermediate model, ER-SVM, has strong points of CVaR minimization and VaR minimization; a strong point of CVaR minimization is that the parameter choice of is not so sensitive for prediction, and a strong point of VaR minimization is that the optimal solution of VaR minimization is robust to outliers. We compared the performances of ER-SVM, equation 3.1, by algorithm 1, VaR minimization equation 2.5, by running the algorithm with and CVaR minimization, equation 2.7. Recall that equation 2.7 equals E-SVM and especially for , it reduces to -SVM (Schölkopf et al., 2000). VaR minimization, equation 2.5, reduces to VaR-SVM (Tsyurmasto et al., 2014) for large (precisely, as long as : see Figure 4). ER-SVM equals robust SVM, equation 2.4, when exceeds (see lemma 1 and theorem 1).

We used synthetic data generated by following Xu et al. (2006). We generated two-dimensional samples with label and from two normal distributions with different mean vectors and the same covariance matrix. The optimal hyperplane for the noiseless data set is with and . We added outliers only to the training set with label by drawing samples uniformly from a half-ring with center , inner radius , and outer radius in the space of . The training set contained 50 samples from each class (100 in total), including outliers. The ratio of outliers in the training set was set to one of the values from 0 to 5% (10% only for Figure 8). The test set has 500 samples from each class (1000 in total). We repeated all the experiments 100 times, drawing training and test sets every repetition.

The parameter setting of and in algorithm 1 for approximately solving ER-SVM is as follows: and (basic parameter setting throughout numerical experiments). This makes the number of removed samples, , unrelated to the choice of .

Figure 7 shows the influence of outliers on performances of three methods for . These figures show the average test errors with their estimation errors (the standard deviations divided by the square root of the number of trials, ) over the 100 runs. indicates the maximum convexity threshold for CVaR minimization (E-SVM) among 100 trials. This indicates that ER-SVM with is equivalent to robust-SVM. Figures 7a to 7d imply that ER-SVM (especially ER-SVM with ) achieved better performance as the ratio of outlier increases, while E-SVM’s performance became worse. We can confirm that the curve of the test error of ER-SVM was not so volatile with respect to compared to VaR minimization, and it achieved the lowest test errors among the three models in the presence of outliers.

Figure 7:

Test errors with respect to in the presence of 0 to 5% outliers.

Figure 7:

Test errors with respect to in the presence of 0 to 5% outliers.

We tested the precision and recall on the synthetic data including 10% outliers (i.e., 10 outliers are included in 100 samples) generated from the ring with inner radius . Precision is the number of outliers that algorithm 1 removed divided by the total number of removed samples, and recall is the number of outliers removed divided by the total number of existing outliers in the data set. Figure 8 (top) shows the precision recall curve with different values of that is used in the stopping criterion of algorithm 1. It varied from to . The upper panel demonstrates that larger produces a high precision rate, implying that the outliers are detected at the early stage (i.e., small k) of algorithm 1. Figure 8 (bottom) shows the test errors of the classifiers obtained by algorithm 1 with corresponding . The basic parameter setting used in the numerical experiments stresses the precision rather than the recall, which leads to high-prediction performances (i.e., small test errors).

Figure 8:

(Top) The precision-recall curve with different values of used in the stopping criterion of algorithm 1. (Bottom) The test errors with respect to different values of .

Figure 8:

(Top) The precision-recall curve with different values of used in the stopping criterion of algorithm 1. (Bottom) The test errors with respect to different values of .

5.2  Comparison to Existing Models

We compared the performance of ER-SVM to the following existing models: C-SVM; E-SVM; robust SVM (which was solved by the concave-convex procedure, CCCP (Collobert, Sinz, Weston, & Bottou, 2006)) and VaR minimization. C-SVM is a well-known classification method. E-SVM, equation 2.2 is an extension of -SVM (equivalent to C-SVM) and VaR minimization, equation 2.5 is an extension of VaR-SVM (Tsyurmasto et al., 2014). We can say that ER-SVM is a robust variant of E-SVM whereas robust SVM is a robust variant of C-SVM. We used for robust SVM, equation 2.4.

5.2.1  Synthetic Data

Table 1 shows the results (average error [%] standard deviations of test errors in 100 trials) of comparing four SVMs on the synthetic data set of the previous section. We found the best parameter setting from 9 candidates, , for ER-SVM, E-SVM, and VaR minimization and from for robust SVM, and C-SVM. Figure 9 depicts the results of Table 1: the average test error of each learning model by changing the ratio of outliers in data sets. ER-SVM, VaR minimization, and robust SVM were less influenced by outliers than standard SVM models such as C-SVM and E-SVM. Above all, the proposed model, ER-SVM, achieved a good prediction performance.

Figure 9:

Comparison of ER-SVM to other learning models for synthetic data.

Figure 9:

Comparison of ER-SVM to other learning models for synthetic data.

Table 1:
Synthetic Data ().
Outlier Ratio ER-SVM robust SVM E-SVM C-SVM VaR 
      
      
      
      
      
      
Outlier Ratio ER-SVM robust SVM E-SVM C-SVM VaR 
      
      
      
      
      
      

Notes: Average test error [%] standard deviations in 100 trials for synthetic data. The minimum average test error of five models for each outlier ratio is shown in bold.

5.2.2  UCI Data Sets

We generated contaminated data sets from original data sets of UCI repository (Blake & Merz, 1998), that are shown in Table 2, by adding outliers as follows. We scaled all attributes of the original data set from −1.0 to 1.0, generated outliers uniformly from a ring with center and radius R, and assigned the wrong label to by using optimal classifiers of E-SVM. The radius R for generating outliers was properly set so that outliers have an impact on the test errors. In addition, the percentage of outliers (outlier ratio) was increased until outliers had a large influence on the test errors.

Table 2:
Properties of UCI Data Sets.
Data Set n    
liver-disorders  6 345 0.72 0.84 
diabetes  8 768 0.52 0.70 
breast cancer 10 683 0.06 0.70 
heart 13 270 0.33 0.89 
australian 14 690 0.29 0.89 
Data Set n    
liver-disorders  6 345 0.72 0.84 
diabetes  8 768 0.52 0.70 
breast cancer 10 683 0.06 0.70 
heart 13 270 0.33 0.89 
australian 14 690 0.29 0.89 

We generated 10 contaminated data sets for each outlier ratio. Figure 10 shows the average of test errors on 10 contaminated data sets with the best parameter choice. The best parameter was chosen among 9 candidates at equal intervals from for ER-SVM, E-SVM, and VaR minimization and from 9 candidates (the same number of candidates for ) as for robust SVM and C-SVM. The plots on each line of ER-SVM, E-SVM, and VaR minimization indicate that the best parameter value was attained in the nonconvex range ; no plots indicate that it was attained in the convex range .

Figure 10:

Comparison of prediction performances for contaminated UCI data sets. outliers were uniformly generated from a ring with center and radius R. The plots on each line of ER-SVM, E-SVM, and VaR minimization indicate that best parameter value was attained in the nonconvex range.

Figure 10:

Comparison of prediction performances for contaminated UCI data sets. outliers were uniformly generated from a ring with center and radius R. The plots on each line of ER-SVM, E-SVM, and VaR minimization indicate that best parameter value was attained in the nonconvex range.

When the outlier ratio increases, ER-SVM tends to have good prediction performance over robust SVM in most cases. Especially, for the liver-disorders and diabetes data sets, nonconvex ER-SVM achieved good prediction performances compared to other methods. As for the other data sets, the best parameter of ER-SVM was chosen in the convex range for almost all outlier ratios; therefore, the differences in performance between convex ER-SVM and robust SVM were in the parameter setting and C. may be more easily adjusted than C. ER-SVM has a possibility of achieving better prediction performances with proper parameter setting.

6  Conclusion

We proposed extended robust SVM (ER-SVM), which minimizes an intermediate risk measure between the CVaR and VaR by expecting that the resulting model becomes less sensitive to outliers than E-SVM. Our model, ER-SVM, is an extension of robust SVM (Xu et al., 2006; Wu & Liu, 2007). Indeed, if we set of ER-SVM to a value in the convex range, ER-SVM and robust SVM give the same classifier. Numerical experiments show the superior performance of our model over robust SVM, -SVM, and E-SVM in the presence of outliers. The effectiveness of the extended parameter range of ER-SVM contributes to the superior performance over robust SVM, whereas ignoring samples with large losses in ER-SVM contributes to the superior performance over E-SVM.

ER-SVM includes two parameters, and . We expect that if is set to the ratio of outliers, ER-SVM will achieve a good prediction performance, but it is hard to predict the ratio in practical problem setting. Therefore, in this letter, we proposed a heuristic algorithm for ER-SVM that automatically tunes these parameter values during execution.

Our algorithm has tuning parameter , , and . The prediction performance is not significantly affected by and . However, some guidelines for the choice of parameters would be helpful in practice. If we do not care about computation time, it is better to set a small value for . As for , to may be appropriate for scaled data sets (see Figure 8). A challenging issue is to develop a reliable cross-validation method for in the existence of outliers.

In the future, we want to investigate a practical way to set parameters directly for and of ER-SVM. The parameter controls the number of ignored samples, and controls the thickness of the margin. The superior performance of ER-SVM over robust SVM is due to the extended permissible range for . With small in the extended range, the margin of ER-SVM can be negative, different from robust SVM. We also need to check the problem settings and features of data sets where the negative margin works.

References

Artzner
,
P.
,
Delbaen
,
F.
,
Eber
,
J.
, &
Heath
,
D.
(
1999
).
Coherent measures of risk
.
Mathematical Finance
,
9
,
203
228
.
Ben-Tal
,
A.
,
El-Ghaoui
,
L.
, &
Nemirovski
,
A.
(
2009
).
Robust optimization
.
Princeton, NJ
:
Princeton University Press
.
Bennett
,
K. P.
, &
Bredensteiner
,
E. J.
(
2000
).
Duality and geometry in SVM classifiers
. In
Proceedings of the 17th International Conference on Machine Learning
(pp.
57
64
).
San Francisco
:
Morgan Kaufmann
.
Bishop
,
C. M.
(
2006
).
Pattern recognition and machine learning
.
New York
:
Springer
.
Blake
,
C. L.
, &
Merz
,
C. J.
(
1998
).
UCI repository of machine learning databases
.
Irvine, CA
:
University of California, Irvine
.
Briec
,
W.
(
1997
).
Minimum distance to the complement of a convex set: Duality result
.
Journal of Optimization Theory and Applications
,
93
(
2
),
301
319
.
Brooks
,
J. P.
(
2011
).
Support vector machines with the ramp loss and the hard margin loss
.
Operations Research
,
59
(
2
),
467
479
.
Chang
,
C. C.
, &
Lin
,
C. J.
(
2001
).
Libsvm: A library for support vector machines
. (
Tech. Rep.
).
Department of Computer Science, National Taiwan University
.
Collobert
,
R.
,
Sinz
,
F.
,
Weston
,
J.
, &
Bottou
,
L.
(
2006
).
Trading convexity for scalability
. In
Proceedings of the 23rd International Conference on Machine Learning
(pp.
129
136
).
New York
:
ACM
.
Cortes
,
C.
, &
Vapnik
,
V.
(
1995
).
Support-vector networks
.
Machine Learning
,
20
,
273
297
.
Crisp
,
D. J.
, &
Burges
,
C. J. C.
(
2000
).
A geometric interpretation of -SVM classifiers
. In
S. A. Solla, T. K. Leen, & K.-R. Muller (Eds.)
,
Advances in neural information processing systems
,
12
(pp.
244
250
).
Cambridge, MA
:
MIT Press
.
Fukunaga
,
K.
(
1990
).
Introduction to statistical pattern recognition
.
Boston
:
Academic Press
.
Lanckriet
,
G.R.G.
,
Ghaoui
,
L. E.
,
Bhattacharyya
,
C.
, &
Jordan
,
M. I.
(
2002
).
A robust minimax approach to classification
.
Journal of Machine Learning Research
,
3
,
555
582
.
Larsen
,
N.
,
Mausser
,
H.
, &
Uryasev
,
S.
(
2002
).
Algorithms for optimization of value-at-risk
. In
P.
Pardalos
&
V. K.
Tsitsiringos
(Eds.),
Financial engineering, E-commerce and supply chain
(pp.
129
157
).
Boston
:
Kluwer Academic Publishers
.
Perez-Cruz
,
F.
,
Weston
,
J.
,
Hermann
,
D. J. L.
, &
Schölkopf
,
B.
(
2003
).
Extension of the -SVM range for classification
. In
J. A. K.
Suykens
,
G.
Horvath
,
S.
Basu
,
C.
Micchelli
, &
J.
Vandewalle
(Eds.),
Advances in learning theory: Methods, models and applications 190
(pp.
179
196
).
Amsterdam
:
IOS Press
.
Rockafellar
,
R. T.
, &
Uryasev
,
S.
(
2002
).
Conditional value-at-risk for general loss distributions
.
Journal of Banking and Finance
,
26
(
7
),
1443
1472
.
Schölkopf
,
B.
, &
Smola
,
A. J.
(
2002
).
Learning with kernels
.
Cambridge, MA
:
MIT Press
.
Schölkopf
,
B.
,
Smola
,
A.
,
Williamson
,
R.
, &
Bartlett
,
P.
(
2000
).
New support vector algorithms
.
Neural Computation
,
12
(
5
),
1207
1245
.
Shen
,
X.
,
Tseng
,
G. C.
,
Zhang
,
X.
, &
Wong
,
W. H.
(
2003
).
On -learning
.
Journal of the American Statistical Association
,
98
(
463
),
724
734
.
Takeda
,
A.
,
Mitsugi
,
H.
, &
Kanamori
,
T.
(
2012
).
A unified robust classification model
. In
Proceedings of the 29th International Conference on Machine Learning
(pp.
129
136
).
Madison, WI
:
Omnipress
.
Takeda
,
A.
,
Mitsugi
,
H.
, &
Kanamori
,
T.
(
2013
).
A unified classification model based on robust optimization
.
Neural Computation
,
25
(
3
),
759
804
.
Takeda
,
A.
, &
Sugiyama
,
M.
(
2008
).
-support vector machine as conditional value-at-risk minimization
. In
Proceedings of the 25th International Conference on Machine Learning
(pp.
1056
1062
).
New York
:
ACM
.
Tsyurmasto
,
P.
,
Zabarankin
,
M.
, &
Uryasev
,
S.
(
2014
).
Value-at-risk support vector machine: Stability to outliers
:
Journal of Combinatorial Optimization, 28
,
218
232
.
Wu
,
Y.
, &
Liu
,
Y.
(
2007
).
Robust truncated hinge loss support vector machines
.
Journal of the American Statistical Association
,
102
(
479
),
974
983
.
Xu
,
H.
,
Caramanis
,
C.
,
Mannor
,
S.
, &