## Abstract

Nonconvex variants of support vector machines (SVMs) have been developed for various purposes. For example, robust SVMs attain robustness to outliers by using a nonconvex loss function, while extended -SVM (E-SVM) extends the range of the hyperparameter by introducing a nonconvex constraint. Here, we consider an extended robust support vector machine (ER-SVM), a robust variant of E-SVM. ER-SVM combines two types of nonconvexity from robust SVMs and E-SVM. Because of the two nonconvexities, the existing algorithm we proposed needs to be divided into two parts depending on whether the hyperparameter value is in the extended range or not. The algorithm also heuristically solves the nonconvex problem in the extended range.

In this letter, we propose a new, efficient algorithm for ER-SVM. The algorithm deals with two types of nonconvexity while never entailing more computations than either E-SVM or robust SVM, and it finds a critical point of ER-SVM. Furthermore, we show that ER-SVM includes the existing robust SVMs as special cases. Numerical experiments confirm the effectiveness of integrating the two nonconvexities.

## 1 Introduction

The support vector machine (SVM), one of the most successful machine learning models, has many variants. The original form of SVM, -SVM (Cortes & Vapnik, 1995), is popular because of its generalization ability and convexity. An alternative SVM formulation, -SVM (Schölkopf, Smola, Williamson, & Bartlett, 2000), is known to be equivalent to -SVM, and extended -SVM (E-SVM) (Perez-Cruz, Weston, Hermann, & Schölkopf, 2003) is a nonconvex extension of -SVM. E-SVM introduces a nonconvex norm constraint instead of a regularization term in the objective function, and this nonconvex constraint makes it possible to extend the range of the hyperparameter . E-SVM includes -SVM as a special case, and E-SVM empirically outperforms -SVM owing to the extension (see Perez-Cruz et al., 2003). Furthermore, Takeda and Sugiyama (2008) showed that E-SVM minimizes the conditional value-at-risk (CVaR), a popular coherent risk measure in finance. However, CVaR is sensitive to tail risks, and the same holds true for E-SVM. Unfortunately, it also implies that SVMs might not be sufficiently robust to outliers.

Various nonconvex SVMs have been studied with the goal of ensuring robustness to outliers. Indeed, there are many such models, which are called robust SVMs. In this letter, “robust SVMs” means any robust variant of SVM that uses a nonconvex loss function.^{1} Ramp-loss SVM (Collobert, Sinz, Weston, & Bottou, 2006; Xu, Crammer, & Schuurmans, 2006) is a popular robust SVM. The idea is to truncate the hinge loss and bound the value of the loss function by a constant. Moreover, any loss function, not only the hinge loss, can be truncated in the same way as the ramp loss. The framework of such truncated loss functions is studied, for example, in Shen, Tseng, Zhang, and Wong (2003) and Yu, Yang, Xu, White, and Schuurmans (2010).

Xu et al. (2006) also proposed robust outlier detection (ROD) as a model explicitly identifying outliers. CVaR-()-SVM (Tsyurmasto, Uryasev, & Gotoh, 2013) has also been proposed as a robust SVM. However, we can prove that ROD and CVaR-()-SVM are equivalent to ramp-loss SVM with appropriately set parameters. On the other hand, Takeda, Fujiwara, and Kanamori (2014) recently proposed extended robust SVM (ER-SVM) as a robust variant of E-SVM, a nonconvex extension of -SVM. That is, while the robust SVMs are robust variants of -SVM or -SVM, extended robust SVM (ER-SVM) is a robust variant of E-SVM, as indicated in Table 1.

. | . | Regularizer . | |
---|---|---|---|

. | . | Convex . | Nonconvex . |

Loss | Convex | -SVM, -SVM | E-SVM |

Nonconvex | Robust outlier detection | ER-SVM | |

ramp-loss SVM | |||

CVaR-()-SVM |

. | . | Regularizer . | |
---|---|---|---|

. | . | Convex . | Nonconvex . |

Loss | Convex | -SVM, -SVM | E-SVM |

Nonconvex | Robust outlier detection | ER-SVM | |

ramp-loss SVM | |||

CVaR-()-SVM |

Note: The models in the right (resp. bottom) cell include the models in the left (resp. top) cell as special cases.

### 1.1 Nonconvex Optimization and DC Programming

The important issue regarding nonconvex variants is how to solve their difficult nonconvex problems. Difference of convex functions (DC) programming is a powerful framework for dealing with nonconvex problems. It is known that various nonconvex problems can be formulated as DC programs whose objective function is expressed as by using two convex functions and (see Horst & Tuy, 1996).

The DC algorithm (DCA) introduced in Pham Dinh (1988) is one of the most efficient algorithms for DC programs. The basic idea behind it is to linearize the concave part and sequentially solve the convex subproblem. The local and global optimality conditions, convergence properties, and the duality of DC programs were studied using convex analysis (Rockafellar, 1970). For a general DC program, every limit point of the sequence generated by DCA is a critical point, which is also called a generalized Karush-Kuhn-Tucker (KKT) point. It is remarkable that DCA does not require differentiability in order to ensure its convergence. Furthermore, it is known that DCA converges quite often to a global solution (see Le Thi & Pham Dinh, 2005; Pham Dinh & Le Thi, 1997).

A similar method, the concave-convex procedure (CCCP) (Yuille & Rangarajan, 2003), has been studied in the machine learning literature. Indeed, it can be shown that if of the objective function is differentiable, then DCA exactly reduces to CCCP. Smola, Vishwanathan, and Hofmann (2005) proposed constrained CCCP to deal with DC constraints, while Sriperumbudur and Lanckriet (2012) studied the global convergence properties of (constrained) CCCP, proving that the sequence generated by CCCP converges to a stationary point under conditions such as differentiability and strict convexity. However, since our model is not differentiable, we will use DCA and take advantage of the theoretical results on DCA such as on its convergence properties.

### 1.2 Contributions

The main contribution of this letter is a new, efficient algorithm based on DCA for ER-SVM. ER-SVM has two nonconvexities: the nonconvexity in the objective function is due to the truncated loss function and the nonconvexity in the constraint is due to the extension of the parameter range . We express the truncated loss function of ER-SVM as the difference of two CVaRs, which are convex functions, and move the nonconvex term in its constraint to the objective by using an exact penalty. While being equivalent to ER-SVM, the resulting formulation allows us to apply DCA to ER-SVM and gives an intuitive interpretation for ER-SVM.

The previous algorithm proposed by Takeda et al. (2014) is heuristic and does not have a theoretical guarantee. Furthermore, it is not simple because it needs to use two different procedures depending on the value of the hyperparameter . Our algorithm instead works with any value of , and it can find a critical point of ER-SVM. Though ER-SVM enjoys both of the nonconvexities from E-SVM and robust SVMs (e.g., ROD and ramp-loss SVM), our new algorithm is simple and comparable to their algorithms. Indeed, our algorithm is similar to the algorithm (Collobert et al., 2006) of ramp-loss SVM, which is known to be fast. Our code for solving ER-SVM by DCA is publicly available at Fujiwara, Takeda, and Kanamori (2016).

Furthermore, we clarify the relation of various robust SVMs, including ER-SVM. Though there are many robust variants of SVMs, their relations to each other have not been discussed, except that Takeda et al. (2014) showed that ER-SVM and ramp-loss SVM share the same global optimal solutions. In this letter, we prove that ROD and CVaR-()-SVM are equivalent to ramp-loss SVM and also show that ER-SVM includes ramp-loss SVM and ROD as special cases in the sense of having the same KKT points. More specifically, a special case of ER-SVM (whose range of the hyperparameter is limited), ramp-loss SVM, and ROD share all KKT points. Therefore, as Table 1 shows, ER-SVM can be regarded not simply as a robust variant of E-SVM, but rather as a natural extension of ramp-loss SVM and ROD.

### 1.3 Outline of the Letter

This letter is organized as follows. Section 2 provides a preliminary description of the basic notions. In sections 2.1 and 2.2, we describe the existing SVMs and their variants. Section 2.3 briefly describes the definitions and properties of popular financial risk measures such as CVaR and VaR. Section 3 describes some important properties of ER-SVM. After showing a DC programming formulation of ER-SVM, which is a minimization of the difference of CVaRs, we discuss the relationship of ramp-loss SVM, ROD, and ER-SVM. Section 4 describes our new algorithm, DCA for ER-SVM. Numerical results are presented in section 5.

## 2 Preliminary

Our algorithm can be applied to nonlinear models by using a kernel technique. We describe a kernel variant of ER-SVM in section 3.4.

### 2.1 Support Vector Machines

#### 2.1.1 Convex SVMs

#### 2.1.2 Nonconvex SVMs

and are hyperparameters. The original formulation in Xu et al. (2006) defines ROD with an inequality constraint , but we can replace it by the equality one because the replacement does not change the optimal value.

### 2.2 Extended Robust Support Vector Machine

where and are hyperparameters.

ER-SVM, equation 2.6, is obtained by relaxing the 0-1 integer constraints of the original formulation in Takeda et al. (2014) and replacing of the original one by the equality variant. The relaxation does not change the problem if . More precisely, if , ER-SVM, equation 2.6, has an optimal solution such that .

In this case, ER-SVM, equation 2.6, removes samples and applies E-SVM, equation 2.2, to the rest. Indeed, the heuristic algorithm of Takeda et al. (2014) for E-SVM removes samples little by little until the total number of removed samples becomes . In this letter, we use formulation 2.6 and call it ER-SVM.

It can be easily seen that for fixed , the optimal value of equation 2.6 decreases as increases. Moreover, it is shown in Takeda et al. (2014, lemma ^{4}) that the nonconvex constraint can be relaxed to without changing the optimal solution so long as the optimal value is negative. Just like E-SVM, ER-SVM, equation 2.6, has a threshold (we denote it by ) of the hyperparameter where the optimal value equals zero and the nonconvex constraint is essential for ER-SVM with . The nonconvex constraint removes the lower threshold of and extends the admissible range of in the same way as E-SVM (see Table 2). We show in section 3.2 that a special case (case C in Table 2) of ER-SVM is equivalent to ROD, equation 2.5, and ramp-loss SVM, equation 2.4, in the way where a special case (case C) of E-SVM is equivalent to -SVM. Hence, ER-SVM can be seen as a natural extension of robust SVMs such as ROD and ramp-loss SVM. ER-SVM, equation 2.6, also has an upper threshold , which, similar to E-SVM and -SVM, makes the problem a bounded one.

. | . | Case N . | Case C . | . |
---|---|---|---|---|

-SVM | Range of | |||

Optimal value | 0 | Negative | Unbounded | |

Optimal solution | Admissible | – | ||

E-SVM | Range of | |||

Optimal value | Nonnegative | Negative | Unbounded | |

Optimal solution | Admissible | Admissible | – | |

Constraint | ||||

ER-SVM | Range of | |||

Optimal value | Nonnegative | Negative | Unbounded | |

Optimal solution | Admissible | Admissible | – | |

Constraint |

. | . | Case N . | Case C . | . |
---|---|---|---|---|

-SVM | Range of | |||

Optimal value | 0 | Negative | Unbounded | |

Optimal solution | Admissible | – | ||

E-SVM | Range of | |||

Optimal value | Nonnegative | Negative | Unbounded | |

Optimal solution | Admissible | Admissible | – | |

Constraint | ||||

ER-SVM | Range of | |||

Optimal value | Nonnegative | Negative | Unbounded | |

Optimal solution | Admissible | Admissible | – | |

Constraint |

Notes: If is greater than the lower threshold (case C), the nonconvex constraint of E-SVM and ER-SVM can be relaxed to a convex constraint without changing the optimal solutions. Case C of E-SVM is equivalent to -SVM.

### 2.3 Financial Risk Measures

^{3}.

## 3 Properties of Extended Robust SVM

### 3.1 Decomposition using Conditional Value-at-Risks

Here, we give an intuitive interpretation to ER-SVM, equation 2.6, using two CVaRs. E-SVM has been shown to minimize -CVaR (Takeda & Sugiyama, 2008). On the other hand, ER-SVM, equation 2.6, ignores a fraction of the samples and solves E-SVM by using the rest: that is, it minimizes CVaR using the rest of the samples. Hence, ER-SVM can be regarded as one that minimizes the average of in the gray area of Figure 2. The mean of the gray area in Figure 2 can be described using two CVaRs, as is done in Wozabal (2012):

The proof of proposition ^{3} is in appendix A.

Since CVaR is convex in for fixed , the objective function in equation 2.9 can be described as the difference between two convex functions. A similar model, CVaR-()-SVM, which relaxes the nonconvex constraint of equation 3.1 by , was recently proposed (Tsyurmasto et al., 2013). As in Table 2, such a model is a special case (case C) of ER-SVM, and it is essentially equivalent to ramp-loss SVM or ROD.

### 3.2 Relationship with Existing Models

Here, we discuss the relationship between ER-SVM, ROD, and ramp-loss SVM by using the KKT conditions shown in appendix B. We begin by showing the equivalence of ramp-loss SVM and ROD. While ROD was proposed (Xu et al., 2006) as a direct approach that explicitly incorporates outlier suppression unlike ramp-loss SVM, the following lemma implies their equivalence.

The proof of lemma ^{4} is in section B.1.

Theorem ^{5} shows the equivalence of ROD and the special case (case C in Table 2) of ER-SVM.

(relation between ER-SVM and ROD). Case C of ER-SVM (in Table 2), that is, equation B.3, and ROD, equation 2.5, share all KKT points in the following sense:

Let satisfy the KKT conditions of ROD, equation 2.5, and suppose . satisfies the KKT conditions of case C of ER-SVM with a corresponding hyperparameter value.

Let satisfy the KKT conditions of case C of ER-SVM. Suppose and the objective value is nonzero. satisfies the KKT conditions of ROD with a corresponding hyperparameter value.

See section B.2 for the proof of theorem ^{5}.

From lemma ^{4} and theorem ^{5}, ramp-loss SVM and ROD can be regarded as a special case (case C in Table 2) of ER-SVM. As we discussed in section 3.1, CVaR-()-SVM (Tsyurmasto et al., 2013) is also equivalent to case C (in Table 2) of ER-SVM. Theorem ^{5} is similar to the relation between -SVM and -SVM (which is a special case of E-SVM). It was shown that the sets of global solutions of -SVM and -SVM correspond to each other when the hyperparameters are set properly (see Chang & Lin, 2001b and Schölkopf et al., 2000). We used KKT conditions to show the relationship between the nonconvex models.

### 3.3 Motivation for Nonconvex Regularizer

The nonconvex constraint helps to remove the lower threshold of the hyperparameter , as Table 2 shows. The extension of the admissible range of gives us a chance of finding a better classifier that cannot be found by using a convex regularizer as in -SVM, equation 2.1, and robust SVMs, equation 2.3. Perez-Cruz et al. (2003) showed examples where E-SVM outperforms -SVM owing to the extended range of .

Besides this empirical evidence, there is theoretical evidence as to why the extended parameter range is needed for -SVM. More precisely, theorem ^{6} gives an explicit condition for data sets where the admissible range of for -SVM is empty, and therefore, -SVM and -SVM obtain a trivial classifier having for any hyperparameter value of and . The condition also applies to robust SVMs after removing all outliers with . Indeed, Figure 4e shows a numerical example where -SVM, -SVM, and ramp-loss SVM for any hyperparameter value obtain trivial classifiers but ER-SVM obtains a nontrivial one.

Rifkin, Pontil, and Verri (1999) studied the conditions under which -SVM obtains a trivial solution. We directly connect their statements to -SVM and strengthen them in theorem ^{6} by adding a geometric interpretation for -SVM for when the admissible range of is empty for -SVM.

The proof is in appendix C.

### 3.4 Kernelization

Linear learning models can be modified into more powerful nonlinear ones by using a kernel technique. Here, we briefly introduce a kernel variant of ER-SVM, equation 2.6. The input sample is mapped into in a high (even infinite) dimensional inner product space , and the classifier of the form is learned from the training samples, where is the inner product of and in . The kernel function is defined as .

## 4 Algorithm

### 4.1 Simplified DCA

- •
The objective value is decreasing (i.e., ).

- •
DCA has global convergence.

- •
Every limit point of the sequence is a critical point of , which is also called a generalized KKT point.

is said to be a critical point of if . This implies that a critical point has and such that , a necessary condition for local minima. When equation 4.1 has a convex constraint , we can define the critical point by replacing with , where is an indicator function equal to 0 if and otherwise.

### 4.2 DCA for Extended Robust SVM

This reformulation is a special case of the exact penalty approach (see Pham Dinh & Le Thi, 1997; Phanm Dinh, Le Thi, & Muu, 1999). There exists such that equations 3.1 and 4.3 have the same set of optimal solutions for all . We can estimate an upper bound of in our case by invoking the following lemma.

The key point in the proof of lemma ^{7} is that CVaR has positive homogeneity (i.e., for all such that ). This is a well-known property of coherent risk measures such as CVaR (e.g., Artzner, Delbaen, Eber, & Heath, 1999).

#### 4.2.1 Subdifferential of CVaR

Let be the index set, where , , are the largest values among , . We can easily find an optimal solution by assigning 0 to , , and 1 to others.

#### 4.2.2 Efficient Update of

Condition 4.7 ensures that the optimal value of equation 4.6 is negative, since the solution in the previous iteration has a negative objective value. With such , lemma ^{7} holds.

#### 4.2.3 Explicit Form of Subproblem

The steps of the algorithm are listed in algorithm 1. The algorithm also has nice convergence properties, such as linear convergence, as shown in section 4.1.

#### 4.2.4 CVaR-Function Decomposition

The decomposition of two CVaR functions enables us to interpret the algorithm and the resulting solution. The algorithm repeatedly solves subproblems by linearizing a concave part that corresponds to underestimating the sum of losses of outliers. The approximation becomes more accurate as the algorithm iterates.

There is another advantage to the decomposition. If the given parameter value is in case C, the term disappears from the objective function in equation 4.3 and the proposed algorithm acts like a polyhedral DC algorithm, which has the nice property of finite convergence (see Pham Dinh & Le Thi, 1997). This property is due to the polyhedral convexity of CVaR functions.

## 5 Numerical Results

We compared our approach, DCA for ER-SVM, with a heuristic algorithm (Takeda et al., 2014) for ER-SVM, CCCP (Collobert et al., 2006) for ramp-loss SVM, E-SVM algorithm (Perez-Cruz et al., 2003), and -SVM from the LIBSVM software package (Chang & Lin, 2001a). The hyperparameter of ramp-loss SVM, equation 2.3, was set to 1, and in ER-SVM, equation 2.6, was set to 0.05; all models thus had one hyperparameter to be tuned. A comparison of ROD, equation 2.5 using SDP relaxation (Xu et al., 2006) is omitted because ROD is equivalent to ramp-loss SVM (see lemma ^{4}).

We implemented all methods in Python (version 2.7.6) and used IBM ILOG CPLEX (version 12.6) to solve the optimization problems involved in these four methods: DCA and the heuristic algorithm for ER-SVM, CCCP for ramp-loss SVM, and E-SVM algorithm. We solved the primal formulation, equation 4.8, while running DCA to deal with data sets with a large sample size and small features. In the CCCP algorithm for ramp-loss SVM, dual formulations of convex subproblems were sequentially solved (see algorithm 2 of Collobert et al., 2006). Almost all of the numerical experiments were done on a PC (CPU: Intel Core i5-3437U (1.90 GHz) , Memory: 8GB, OS: Ubuntu 14.04); a LINUX server (CPU: Intel Xeon E5-2680 (2.70GHz core) , Memory: 64 GB, OS: Red Hat Enterprise Linux Server) was used for the large-scale data sets.

### 5.1 Accuracy for Synthetic Data Sets

We used synthetic data generated by following the procedure in Xu et al. (2006). We generated two-dimensional samples with labels and from two normal distributions with different mean vectors and the same covariance matrix. The optimal hyperplane for the noiseless data set is . We added outliers only to the training set with the label by drawing samples uniformly from a half-ring with center , inner radius , and outer radius in the space of . The training set contained 50 samples from each class (100 in total), including outliers. The ratio of outliers in the training set was set to from 0 to 10%. The test set had 1000 samples from each class (2000 in total). We repeated the experiments 100 times, drawing training and test sets every repetition. We found the best parameter setting from nine candidates, and .

Figure 3a plots the test accuracy of each model versus the outlier ratio. Each error bar spans from the 25th to 75th percentiles. ER-SVM by DCA achieved high average accuracy with small standard deviations especially when the outlier ratio was large. DCA and the heuristic algorithm (Takeda et al., 2014) solved the same problem for ER-SVM, but the heuristic one had poor performance; it seemed to get stuck in poor local solutions.

### 5.2 Accuracy on Real Data Sets

We used the data sets of the UCI repository (Blake & Merz, 1998) and LIBSVM (Chang and Lin, 2001a; see Table 3). We scaled the original data sets such that all attributes had zero mean and unit variance. For most data sets, we generated outliers uniformly from a ring with center and radius . The radius of generating outliers was set so that the outliers would have an impact on the test accuracy. When the feature size of the data set was large, we generated outliers in different ways: a label flip for binary class data sets or samples with the third class for multiclass data sets.

. | . | Number of . | Number of . | . | Outlier . |
---|---|---|---|---|---|

Data Set . | Used Classes . | Feature . | Samples . | . | Type . |

svmguide1 | 4 | 3089 | |||

cod-rna | 8 | 53,581 | |||

diabetes | 8 | 768 | |||

vehicle | Class 1 versus rest | 18 | 846 | ||

satimage | Class 6 versus rest | 36 | 4435 | ||

splice | 60 | 1000 | |||

mushrooms | 112 | 8124 | label-flip | ||

adult | 123 | 1605 | |||

dna | Class 2 versus class 3 | 180 | 3955 | class 1 | |

MNIST | Class 1 versus class 7 | 768 | 15,170 | class 9 | |

internet ad | 1554 | 3279 | label-flip |

. | . | Number of . | Number of . | . | Outlier . |
---|---|---|---|---|---|

Data Set . | Used Classes . | Feature . | Samples . | . | Type . |

svmguide1 | 4 | 3089 | |||

cod-rna | 8 | 53,581 | |||

diabetes | 8 | 768 | |||

vehicle | Class 1 versus rest | 18 | 846 | ||

satimage | Class 6 versus rest | 36 | 4435 | ||

splice | 60 | 1000 | |||

mushrooms | 112 | 8124 | label-flip | ||

adult | 123 | 1605 | |||

dna | Class 2 versus class 3 | 180 | 3955 | class 1 | |

MNIST | Class 1 versus class 7 | 768 | 15,170 | class 9 | |

internet ad | 1554 | 3279 | label-flip |

Notes: For most data sets, outliers are generated on the sphere with center and radius . The radius was set so that the outliers would have an impact on the test accuracy. When the feature size of the data set is large, outliers are generated in different ways: the label flip for binary class data sets or the samples with the third class for multiclass data sets. of cod-rna could not be calculated because the algorithm ran out of memory.

The best parameter was chosen from nine candidates, with equal intervals and . In this experiment, we split the data set into a training set, validation set, and test set in the ratio of 4:3:3. The hyperparameters were determined on the training set and validation set, both of which were contaminated. The performance was evaluated on the clean test set.

Table 4 shows the results of 30 trials for most data sets (only 3 or 10 trials for large-scale data sets because of long computation times). “–” in Table 4 indicates the out-of-memory error. “N” in Table 4 means that ER-SVM (or E-SVM) achieved the best accuracy with (or ) at least once, and “C” means that the best accuracy was achieved with (or ) in all trials.

. | . | ER-SVM (DCA) . | ER-SVM (heuristics) . | Ramp-Loss SVM . | E-SVM . | -SVM . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Data Set . | OR . | ave.acc . | F1 . | . | ave.acc . | F1 . | . | ave.acc . | F1 . | ave.acc . | F1 . | . | ave.acc . | F1 . |

svmguide1 | 0% | 0.954 | 0.965 | C | 0.956 | 0.966 | N | 0.959 | 0.968 | 0.954 | 0.964 | N | 0.954 | 0.964 |

3% | 0.944 | 0.956 | C | 0.939 | 0.952 | N | 0.946 | 0.958 | 0.921 | 0.939 | N | 0.879 | 0.911 | |

5% | 0.948 | 0.959 | C | 0.935 | 0.950 | N | 0.946 | 0.958 | 0.909 | 0.931 | N | 0.758 | 0.846 | |

10% | 0.937 | 0.952 | N | 0.920 | 0.939 | N | 0.893 | 0.925 | 0.887 | 0.916 | N | 0.658 | 0.790 | |

cod-rna | 0% | 0.939 | 0.912 | C | 0.939 | 0.911 | N | – | – | 0.938 | 0.909 | N | 0.939 | 0.910 |

3% | 0.937 | 0.908 | C | 0.937 | 0.908 | N | – | – | 0.935 | 0.903 | N | 0.914 | 0.863 | |

5% | 0.938 | 0.911 | C | 0.938 | 0.910 | N | – | – | 0.936 | 0.905 | N | 0.773 | 0.522 | |

10% | 0.936 | 0.904 | C | 0.936 | 0.906 | N | – | – | 0.932 | 0.897 | N | 0.667 | 0.500 | |

diabetes | 0% | 0.757 | 0.829 | N | 0.755 | 0.827 | N | 0.760 | 0.828 | 0.762 | 0.828 | N | 0.769 | 0.836 |

3% | 0.756 | 0.823 | N | 0.742 | 0.819 | N | 0.752 | 0.820 | 0.754 | 0.821 | N | 0.749 | 0.824 | |

5% | 0.757 | 0.827 | N | 0.749 | 0.822 | N | 0.753 | 0.822 | 0.746 | 0.816 | N | 0.747 | 0.823 | |

10% | 0.750 | 0.822 | N | 0.728 | 0.811 | N | 0.742 | 0.818 | 0.732 | 0.811 | N | 0.728 | 0.812 | |

vehicle | 0% | 0.789 | 0.535 | N | 0.781 | 0.524 | N | 0.787 | 0.534 | 0.792 | 0.530 | N | 0.791 | 0.532 |

3% | 0.773 | 0.517 | N | 0.762 | 0.480 | N | 0.777 | 0.492 | 0.767 | 0.488 | N | 0.780 | 0.465 | |

5% | 0.779 | 0.498 | N | 0.759 | 0.434 | N | 0.761 | 0.403 | 0.751 | 0.420 | N | 0.756 | 0.423 | |

10% | 0.768 | 0.461 | N | 0.749 | 0.326 | N | 0.757 | 0.129 | 0.714 | 0.304 | N | 0.745 | 0.386 | |

satimage | 0% | 0.908 | 0.792 | C | 0.905 | 0.788 | N | 0.908 | 0.793 | 0.909 | 0.798 | N | 0.911 | 0.801 |

3% | 0.903 | 0.786 | N | 0.893 | 0.764 | N | 0.899 | 0.777 | 0.896 | 0.773 | N | 0.898 | 0.778 | |

5% | 0.899 | 0.775 | N | 0.889 | 0.752 | N | 0.898 | 0.775 | 0.892 | 0.764 | N | 0.891 | 0.762 | |

10% | 0.893 | 0.766 | N | 0.888 | 0.753 | N | 0.897 | 0.773 | 0.890 | 0.753 | N | 0.874 | 0.738 | |

splice | 0% | 0.784 | 0.784 | C | 0.772 | 0.778 | N | 0.783 | 0.781 | 0.794 | 0.790 | C | 0.783 | 0.786 |

3% | 0.771 | 0.775 | N | 0.756 | 0.760 | N | 0.772 | 0.776 | 0.774 | 0.777 | N | 0.777 | 0.779 | |

5% | 0.769 | 0.765 | C | 0.745 | 0.747 | N | 0.756 | 0.756 | 0.761 | 0.761 | N | 0.755 | 0.765 | |

10% | 0.756 | 0.753 | N | 0.727 | 0.730 | N | 0.725 | 0.725 | 0.733 | 0.737 | N | 0.737 | 0.743 |

. | . | ER-SVM (DCA) . | ER-SVM (heuristics) . | Ramp-Loss SVM . | E-SVM . | -SVM . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Data Set . | OR . | ave.acc . | F1 . | . | ave.acc . | F1 . | . | ave.acc . | F1 . | ave.acc . | F1 . | . | ave.acc . | F1 . |

svmguide1 | 0% | 0.954 | 0.965 | C | 0.956 | 0.966 | N | 0.959 | 0.968 | 0.954 | 0.964 | N | 0.954 | 0.964 |

3% | 0.944 | 0.956 | C | 0.939 | 0.952 | N | 0.946 | 0.958 | 0.921 | 0.939 | N | 0.879 | 0.911 | |

5% | 0.948 | 0.959 | C | 0.935 | 0.950 | N | 0.946 | 0.958 | 0.909 | 0.931 | N | 0.758 | 0.846 | |

10% | 0.937 | 0.952 | N | 0.920 | 0.939 | N | 0.893 | 0.925 | 0.887 | 0.916 | N | 0.658 | 0.790 | |

cod-rna | 0% | 0.939 | 0.912 | C | 0.939 | 0.911 | N | – | – | 0.938 | 0.909 | N | 0.939 | 0.910 |

3% | 0.937 | 0.908 | C | 0.937 | 0.908 | N | – | – | 0.935 | 0.903 | N | 0.914 | 0.863 | |

5% | 0.938 | 0.911 | C | 0.938 | 0.910 | N | – | – | 0.936 | 0.905 | N | 0.773 | 0.522 | |

10% | 0.936 | 0.904 | C | 0.936 | 0.906 | N | – | – | 0.932 | 0.897 | N | 0.667 | 0.500 | |

diabetes | 0% | 0.757 | 0.829 | N | 0.755 | 0.827 | N | 0.760 | 0.828 | 0.762 | 0.828 | N | 0.769 | 0.836 |

3% | 0.756 | 0.823 | N | 0.742 | 0.819 | N | 0.752 | 0.820 | 0.754 | 0.821 | N | 0.749 | 0.824 | |

5% | 0.757 | 0.827 | N | 0.749 | 0.822 | N | 0.753 | 0.822 | 0.746 | 0.816 | N | 0.747 | 0.823 | |

10% | 0.750 | 0.822 | N | 0.728 | 0.811 | N | 0.742 | 0.818 | 0.732 | 0.811 | N | 0.728 | 0.812 | |

vehicle | 0% | 0.789 | 0.535 | N | 0.781 | 0.524 | N | 0.787 | 0.534 | 0.792 | 0.530 | N | 0.791 | 0.532 |

3% | 0.773 | 0.517 | N | 0.762 | 0.480 | N | 0.777 | 0.492 | 0.767 | 0.488 | N | 0.780 | 0.465 | |

5% | 0.779 | 0.498 | N | 0.759 | 0.434 | N | 0.761 | 0.403 | 0.751 | 0.420 | N | 0.756 | 0.423 | |

10% | 0.768 | 0.461 | N | 0.749 | 0.326 | N | 0.757 | 0.129 | 0.714 | 0.304 | N | 0.745 | 0.386 | |

satimage | 0% | 0.908 | 0.792 | C | 0.905 | 0.788 | N | 0.908 | 0.793 | 0.909 | 0.798 | N | 0.911 | 0.801 |

3% | 0.903 | 0.786 | N | 0.893 | 0.764 | N | 0.899 | 0.777 | 0.896 | 0.773 | N | 0.898 | 0.778 | |

5% | 0.899 | 0.775 | N | 0.889 | 0.752 | N | 0.898 | 0.775 | 0.892 | 0.764 | N | 0.891 | 0.762 | |

10% | 0.893 | 0.766 | N | 0.888 | 0.753 | N | 0.897 | 0.773 | 0.890 | 0.753 | N | 0.874 | 0.738 | |

splice | 0% | 0.784 | 0.784 | C | 0.772 | 0.778 | N | 0.783 | 0.781 | 0.794 | 0.790 | C | 0.783 | 0.786 |

3% | 0.771 | 0.775 | N | 0.756 | 0.760 | N | 0.772 | 0.776 | 0.774 | 0.777 | N | 0.777 | 0.779 | |

5% | 0.769 | 0.765 | C | 0.745 | 0.747 | N | 0.756 | 0.756 | 0.761 | 0.761 | N | 0.755 | 0.765 | |

10% | 0.756 | 0.753 | N | 0.727 | 0.730 | N | 0.725 | 0.725 | 0.733 | 0.737 | N | 0.737 | 0.743 |

. | . | ER-SVM (DCA) . | ER-SVM (heuristics) . | Ramp-Loss SVM . | E-SVM . | -SVM . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Data Set . | OR . | ave.acc . | F1 . | . | ave.acc . | F1 . | . | ave.acc . | F1 . | ave.acc . | F1 . | . | ave.acc . | F1 . |

mushrooms | 0% | 0.982 | 0.981 | C | 0.983 | 0.982 | C | 1.000 | 1.000 | 0.999 | 0.999 | C | 1.000 | 1.000 |

5% | 0.998 | 0.998 | C | 0.941 | 0.936 | C | 0.999 | 0.999 | 0.998 | 0.998 | C | 0.999 | 0.999 | |

10% | 0.998 | 0.998 | C | 0.917 | 0.904 | C | 0.998 | 0.998 | 0.998 | 0.998 | C | 0.998 | 0.997 | |

15% | 0.998 | 0.997 | C | 0.903 | 0.887 | C | 0.997 | 0.997 | 0.997 | 0.996 | C | 0.997 | 0.997 | |

adult | 0% | 0.820 | 0.588 | C | 0.808 | 0.438 | C | 0.813 | 0.598 | 0.817 | 0.596 | C | 0.809 | 0.599 |

5% | 0.821 | 0.592 | C | 0.797 | 0.381 | N | 0.816 | 0.590 | 0.820 | 0.600 | C | 0.813 | 0.602 | |

10% | 0.820 | 0.597 | C | 0.790 | 0.435 | N | 0.817 | 0.585 | 0.817 | 0.601 | C | 0.819 | 0.615 | |

15% | 0.817 | 0.580 | C | 0.776 | 0.345 | N | 0.805 | 0.594 | 0.814 | 0.580 | C | 0.810 | 0.595 | |

dna | 0% | 0.971 | 0.955 | C | 0.971 | 0.955 | C | 0.976 | 0.963 | 0.976 | 0.963 | C | 0.974 | 0.960 |

5% | 0.968 | 0.944 | C | 0.961 | 0.938 | N | 0.963 | 0.942 | 0.967 | 0.946 | C | 0.965 | 0.946 | |

10% | 0.957 | 0.933 | C | 0.954 | 0.931 | N | 0.958 | 0.934 | 0.963 | 0.942 | C | 0.960 | 0.939 | |

15% | 0.950 | 0.924 | C | 0.945 | 0.919 | N | 0.951 | 0.924 | 0.957 | 0.935 | C | 0.955 | 0.932 | |

MNIST | 0% | 0.988 | 0.989 | C | 0.987 | 0.988 | C | 0.993 | 0.994 | 0.992 | 0.992 | C | 0.994 | 0.994 |

5% | 0.986 | 0.986 | C | 0.981 | 0.982 | C | 0.985 | 0.985 | 0.984 | 0.985 | C | 0.987 | 0.987 | |

10% | 0.989 | 0.989 | C | 0.976 | 0.976 | C | 0.976 | 0.976 | 0.981 | 0.981 | C | 0.961 | 0.963 | |

20% | 0.982 | 0.982 | C | 0.980 | 0.980 | C | 0.972 | 0.973 | 0.872 | 0.877 | C | 0.817 | 0.787 | |

internet ad | 0% | 0.944 | 0.758 | C | 0.921 | 0.619 | C | 0.963 | 0.857 | 0.964 | 0.860 | C | 0.966 | 0.868 |

3% | 0.955 | 0.820 | C | 0.938 | 0.720 | C | 0.960 | 0.843 | 0.960 | 0.839 | C | 0.961 | 0.840 | |

5% | 0.960 | 0.837 | C | 0.948 | 0.788 | C | 0.958 | 0.834 | 0.959 | 0.842 | C | 0.961 | 0.834 | |

10% | 0.953 | 0.798 | C | 0.953 | 0.800 | C | 0.952 | 0.803 | 0.936 | 0.752 | C | 0.946 | 0.778 |

. | . | ER-SVM (DCA) . | ER-SVM (heuristics) . | Ramp-Loss SVM . | E-SVM . | -SVM . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Data Set . | OR . | ave.acc . | F1 . | . | ave.acc . | F1 . | . | ave.acc . | F1 . | ave.acc . | F1 . | . | ave.acc . | F1 . |

mushrooms | 0% | 0.982 | 0.981 | C | 0.983 | 0.982 | C | 1.000 | 1.000 | 0.999 | 0.999 | C | 1.000 | 1.000 |

5% | 0.998 | 0.998 | C | 0.941 | 0.936 | C | 0.999 | 0.999 | 0.998 | 0.998 | C | 0.999 | 0.999 | |

10% | 0.998 | 0.998 | C | 0.917 | 0.904 | C | 0.998 | 0.998 | 0.998 | 0.998 | C | 0.998 | 0.997 | |

15% | 0.998 | 0.997 | C | 0.903 | 0.887 | C | 0.997 | 0.997 | 0.997 | 0.996 | C | 0.997 | 0.997 | |

adult | 0% | 0.820 | 0.588 | C | 0.808 | 0.438 | C | 0.813 | 0.598 | 0.817 | 0.596 | C | 0.809 | 0.599 |

5% | 0.821 | 0.592 | C | 0.797 | 0.381 | N | 0.816 | 0.590 | 0.820 | 0.600 | C | 0.813 | 0.602 | |

10% | 0.820 | 0.597 | C | 0.790 | 0.435 | N | 0.817 | 0.585 | 0.817 | 0.601 | C | 0.819 | 0.615 | |

15% | 0.817 | 0.580 | C | 0.776 | 0.345 | N | 0.805 | 0.594 | 0.814 | 0.580 | C | 0.810 | 0.595 | |

dna | 0% | 0.971 | 0.955 | C | 0.971 | 0.955 | C | 0.976 | 0.963 | 0.976 | 0.963 | C | 0.974 | 0.960 |

5% | 0.968 | 0.944 | C | 0.961 | 0.938 | N | 0.963 | 0.942 | 0.967 | 0.946 | C | 0.965 | 0.946 | |

10% | 0.957 | 0.933 | C | 0.954 | 0.931 | N | 0.958 | 0.934 | 0.963 | 0.942 | C | 0.960 | 0.939 | |

15% | 0.950 | 0.924 | C | 0.945 | 0.919 | N | 0.951 | 0.924 | 0.957 | 0.935 | C | 0.955 | 0.932 | |

MNIST | 0% | 0.988 | 0.989 | C | 0.987 | 0.988 | C | 0.993 | 0.994 | 0.992 | 0.992 | C | 0.994 | 0.994 |

5% | 0.986 | 0.986 | C | 0.981 | 0.982 | C | 0.985 | 0.985 | 0.984 | 0.985 | C | 0.987 | 0.987 | |

10% | 0.989 | 0.989 | C | 0.976 | 0.976 | C | 0.976 | 0.976 | 0.981 | 0.981 | C | 0.961 | 0.963 | |

20% | 0.982 | 0.982 | C | 0.980 | 0.980 | C | 0.972 | 0.973 | 0.872 | 0.877 | C | 0.817 | 0.787 | |

internet ad | 0% | 0.944 | 0.758 | C | 0.921 | 0.619 | C | 0.963 | 0.857 | 0.964 | 0.860 | C | 0.966 | 0.868 |

3% | 0.955 | 0.820 | C | 0.938 | 0.720 | C | 0.960 | 0.843 | 0.960 | 0.839 | C | 0.961 | 0.840 | |

5% | 0.960 | 0.837 | C | 0.948 | 0.788 | C | 0.958 | 0.834 | 0.959 | 0.842 | C | 0.961 | 0.834 | |

10% | 0.953 | 0.798 | C | 0.953 | 0.800 | C | 0.952 | 0.803 | 0.936 | 0.752 | C | 0.946 | 0.778 |

Notes: The ratio of outliers (OR) varies from . – indicates an out-of-memory error. N implies that the objective value of ER-SVM (or E-SVM) was positive (that is, case N in table 2 occurred and the nonconvex constraint worked effectively) at least once in the trials. Only 3 or 10 trials were used for large-scale data sets because of the long computation times. The best scores among these methods are in bold.

The prediction performance of ER-SVM by DCA was better than that of the other methods and very stable as the outlier ratio increased. -SVM often performs well when data sets contain no or few outliers, but it tended to be beaten by ER-SVM (DCA) as the outlier ratio increased. Our DCA algorithm could find a better solution, and this led to ER-SVM by DCA having better prediction performance than the heuristic method. We can see from the table that the nonconvex cases (N) of ER-SVM and E-SVM performed better than their convex cases on the intractable data sets (e.g., diabetes, vehicle, splice), where all linear SVMs performed poorly.

### 5.3 Comparison of CPU Times

Here, we show the trend of computation time of our method. Figure 3b plots the average of the CPU times in the range from the 25th to 75th percentile versus the outlier ratio of each method on the synthetic data set of Figure 3a. While the computation times of some algorithms such as -SVM (with ) and ramp-loss SVM slightly increased as the outlier ratio increased, the time of ER-SVM (DCA) was not significantly affected by increasing the outlier ratio.

Figure 3c shows the CPU time of each method with respect to the sample size of the cod-rna data set. Subsets with the sample size indicated on the horizontal axis were randomly chosen 10 times. The error bars show the range from the 25th to 75th percentile.

Both figures imply that our method, ER-SVM by DCA, performed comparably to the other methods despite the fact that it has two kinds of nonconvexities. Indeed, our method was faster than other robust SVM approaches: heuristics for ER-SVM and ramp-loss SVM. We omitted ramp-loss SVM from Figure 3c because out-of-memory errors occurred for 10,000 training samples when we tried to solve dual-formulated subproblems of ramp-loss SVM by using CPLEX (state-of-the-art commercial optimization software package). We can also see that the computation time of our method was in a smaller range than that of LIBSVM for -SVM for varied parameter values.

### 5.4 Detailed Observations on DCA

#### 5.4.1 Comparison of DCA and Heuristics (Figure 4a)

Table 4 reveals the advantage of DCA (ER-SVM (DCA)) over the heuristic method (ER-SVM (heuristics)) in terms of prediction accuracy. Here, we assess the quality of the solutions found by these algorithms by comparing their optimal values for the liver data set ( and ).

The initial solutions were selected from a uniform random distribution on the unit sphere, and the experiments were repeated 300 times. We set the hyperparameter for DCA corresponding to the one automatically selected in the heuristic algorithm. We counted how many times our algorithm (DCA) achieved smaller objective values than those of the heuristics. We counted win, lose, and draw cases in 300 trials (win or lose is when the gap between the objective values is more than 3% considering the numerical error). Figure 4a shows that our algorithm (DCA) tended to achieve wins or draws, especially for larger . This result supports the claim of Le Thi and Pham Dinh (2005) and Pham Dinh and Le Thi (1997) that DCA tends to converge to a global solution.

#### 5.4.2 Efficient Update of

#### 5.4.3 Computation Time on Case N

Here, we investigate the change in computation time with respect to the parameter . Figure 4c shows the computation time averaged over 100 trials. The vertical line is the estimated value of . The extension of the parameter range corresponds to , where the nonconvex constraint worked. ER-SVM with found classifiers that ROD and ramp-loss SVM did not. The figure shows that the computation time did not change much except for . The optimal margin variable was zero around , which might make the problem numerically unstable.

#### 5.4.4 Kernelized ER-SVM

Figure 4d shows the test accuracy of ER-SVM with a linear kernel or a polynomial kernel defined as . The hyperparameters of the polynomial kernel were and . We used 50% of the data set as the training set and the rest as the test set. The training set was contaminated by using a half-ring as shown in section 5.2 or Xu et al. (2006). The test accuracy was evaluated by the mean of 100 trials for each . The markers in Figure 4d imply that the objective value was positive (i.e., case N in Table 2 occurred and the nonconvex constraint worked effectively) at least once in 100 trials. This result implies that the extension is effective for not only a linear kernel but also a polynomial kernel, especially when the data set is contaminated. Furthermore, the polynomial kernel was more accurate than the linear one regardless of the outlier ratio, which indicates the effectiveness of kernelization of ER-SVM.

### 5.5 Effectiveness of the Extension

Figure 4e verifies theorem ^{6}. When the number of the samples in each class is unbalanced or the samples in two classes overlap substantially as in Figure 4e, -SVM, -SVM, and ramp-loss SVM obtain trivial classifiers (), while ER-SVM obtains a nontrivial classifier. That is, this figure implies the effectiveness of the nonconvex constraint .

Figure 4f shows how the test accuracy of each method is affected by the degree of intersection of the samples in two classes. We used synthetic data sets that had the same covariance and number of samples as in Figure 4e, but only the mean of the distribution of positive samples was changed. The horizontal axis is the distance between the mean of the distributions of positive and negative samples. The hyperparameters of ER-SVM were fixed to . The solid line is our method (DCA for ER-SVM), and the dashed line is its case C (which is equivalent to ramp-loss SVM), where in equation 4.8 is set to zero. Figure 4f implies that case N occurred frequently when the distance was small and the nonextended model had poor test accuracy in such cases because of having as an optimal solution.

## 6 Conclusion

We theoretically analyzed ER-SVM, proving that ER-SVM is a natural extension of robust SVMs and discussed the conditions under which such an extension works. Furthermore, we proposed a new, efficient algorithm that has theoretically good properties. Numerical experiments showed that our algorithm worked efficiently.

In this letter, we focused on binary classification problems. Along the same line, we can formulate extended robust variants of learning algorithms for other statistical problems such as regression problems. The proposed simplified DCA with a similar CVaR-function decomposition is applicable to extended robust learning algorithms. In particular, Wu and Liu (2007) proposed a multiclass extension of binary SVM using the ramp loss. We may be able to formulate such a multiclass extension for ER-SVM and apply the proposed DCA to the resulting problem.

We now have to handle large-scale data sets arising from real-world problems. When the given data set has a large feature size, solving the dual of subproblems 4.8 may speed up our algorithm, DCA, more than by solving the primal subproblems. An alternative practical approach to solving large-scale problems is to use a memory-efficient method such as the SMO algorithm (Platt, 1998) for solving subproblems, although its worst-case computation time can be quite large. When the given data set has a large sample size, a stochastic variant of our method might be necessary. We would like to improve our method so that it works better on large-scale data sets.

Our method, ER-SVM by DCA, tends to perform better than the other methods especially on data sets with small feature sizes. For high-dimensional data sets, we recommend combining our method with feature selection.

## Appendix A: Proof of Proposition ^{3}

^{2}ensures that which implies that holds for the indices having .

## Appendix B: KKT Conditions for ER-SVM, ROD, and Ramp-Loss SVM

We show differentiable formulations of case C of ER-SVM, equation 2.6, ROD, equation 2.5, and ramp-loss SVM, equation 2.4, to write the KKT conditions for each model.

The KKT conditions of the above problems are defined as follows.

### B.1 Proof of Lemma ^{4}

### B.2 Proof of Theorem ^{5}

Let us prove the first statement. Let be a KKT point of equation B.2 with hyperparameter and Lagrange multipliers . Suppose . Then is a KKT point of equation B.3 with Lagrange multipliers and .

## Appendix C: Proof of Theorem ^{6}

When , . Then holds. Here, we show the following statements for a training set are equivalent.

The training set satisfies .

has an optimal solution such that for all .

has an optimal solution such that for all .

Statements c2 and c3 imply that -SVM and -SVM obtain a trivial solution such that for any hyperparameter value.