Abstract

Nonconvex variants of support vector machines (SVMs) have been developed for various purposes. For example, robust SVMs attain robustness to outliers by using a nonconvex loss function, while extended -SVM (E-SVM) extends the range of the hyperparameter by introducing a nonconvex constraint. Here, we consider an extended robust support vector machine (ER-SVM), a robust variant of E-SVM. ER-SVM combines two types of nonconvexity from robust SVMs and E-SVM. Because of the two nonconvexities, the existing algorithm we proposed needs to be divided into two parts depending on whether the hyperparameter value is in the extended range or not. The algorithm also heuristically solves the nonconvex problem in the extended range.

In this letter, we propose a new, efficient algorithm for ER-SVM. The algorithm deals with two types of nonconvexity while never entailing more computations than either E-SVM or robust SVM, and it finds a critical point of ER-SVM. Furthermore, we show that ER-SVM includes the existing robust SVMs as special cases. Numerical experiments confirm the effectiveness of integrating the two nonconvexities.

1  Introduction

The support vector machine (SVM), one of the most successful machine learning models, has many variants. The original form of SVM, -SVM (Cortes & Vapnik, 1995), is popular because of its generalization ability and convexity. An alternative SVM formulation, -SVM (Schölkopf, Smola, Williamson, & Bartlett, 2000), is known to be equivalent to -SVM, and extended -SVM (E-SVM) (Perez-Cruz, Weston, Hermann, & Schölkopf, 2003) is a nonconvex extension of -SVM. E-SVM introduces a nonconvex norm constraint instead of a regularization term in the objective function, and this nonconvex constraint makes it possible to extend the range of the hyperparameter . E-SVM includes -SVM as a special case, and E-SVM empirically outperforms -SVM owing to the extension (see Perez-Cruz et al., 2003). Furthermore, Takeda and Sugiyama (2008) showed that E-SVM minimizes the conditional value-at-risk (CVaR), a popular coherent risk measure in finance. However, CVaR is sensitive to tail risks, and the same holds true for E-SVM. Unfortunately, it also implies that SVMs might not be sufficiently robust to outliers.

Various nonconvex SVMs have been studied with the goal of ensuring robustness to outliers. Indeed, there are many such models, which are called robust SVMs. In this letter, “robust SVMs” means any robust variant of SVM that uses a nonconvex loss function.1 Ramp-loss SVM (Collobert, Sinz, Weston, & Bottou, 2006; Xu, Crammer, & Schuurmans, 2006) is a popular robust SVM. The idea is to truncate the hinge loss and bound the value of the loss function by a constant. Moreover, any loss function, not only the hinge loss, can be truncated in the same way as the ramp loss. The framework of such truncated loss functions is studied, for example, in Shen, Tseng, Zhang, and Wong (2003) and Yu, Yang, Xu, White, and Schuurmans (2010).

Xu et al. (2006) also proposed robust outlier detection (ROD) as a model explicitly identifying outliers. CVaR-()-SVM (Tsyurmasto, Uryasev, & Gotoh, 2013) has also been proposed as a robust SVM. However, we can prove that ROD and CVaR-()-SVM are equivalent to ramp-loss SVM with appropriately set parameters. On the other hand, Takeda, Fujiwara, and Kanamori (2014) recently proposed extended robust SVM (ER-SVM) as a robust variant of E-SVM, a nonconvex extension of -SVM. That is, while the robust SVMs are robust variants of -SVM or -SVM, extended robust SVM (ER-SVM) is a robust variant of E-SVM, as indicated in Table 1.

Table 1:
Relationships among Existing Models.
Regularizer
ConvexNonconvex
Loss Convex -SVM, -SVM E-SVM 
  
 Nonconvex Robust outlier detection ER-SVM 
  ramp-loss SVM  
  CVaR-()-SVM  
Regularizer
ConvexNonconvex
Loss Convex -SVM, -SVM E-SVM 
  
 Nonconvex Robust outlier detection ER-SVM 
  ramp-loss SVM  
  CVaR-()-SVM  

Note: The models in the right (resp. bottom) cell include the models in the left (resp. top) cell as special cases.

1.1  Nonconvex Optimization and DC Programming

The important issue regarding nonconvex variants is how to solve their difficult nonconvex problems. Difference of convex functions (DC) programming is a powerful framework for dealing with nonconvex problems. It is known that various nonconvex problems can be formulated as DC programs whose objective function is expressed as by using two convex functions and (see Horst & Tuy, 1996).

The DC algorithm (DCA) introduced in Pham Dinh (1988) is one of the most efficient algorithms for DC programs. The basic idea behind it is to linearize the concave part and sequentially solve the convex subproblem. The local and global optimality conditions, convergence properties, and the duality of DC programs were studied using convex analysis (Rockafellar, 1970). For a general DC program, every limit point of the sequence generated by DCA is a critical point, which is also called a generalized Karush-Kuhn-Tucker (KKT) point. It is remarkable that DCA does not require differentiability in order to ensure its convergence. Furthermore, it is known that DCA converges quite often to a global solution (see Le Thi & Pham Dinh, 2005; Pham Dinh & Le Thi, 1997).

A similar method, the concave-convex procedure (CCCP) (Yuille & Rangarajan, 2003), has been studied in the machine learning literature. Indeed, it can be shown that if of the objective function is differentiable, then DCA exactly reduces to CCCP. Smola, Vishwanathan, and Hofmann (2005) proposed constrained CCCP to deal with DC constraints, while Sriperumbudur and Lanckriet (2012) studied the global convergence properties of (constrained) CCCP, proving that the sequence generated by CCCP converges to a stationary point under conditions such as differentiability and strict convexity. However, since our model is not differentiable, we will use DCA and take advantage of the theoretical results on DCA such as on its convergence properties.

1.2  Contributions

The main contribution of this letter is a new, efficient algorithm based on DCA for ER-SVM. ER-SVM has two nonconvexities: the nonconvexity in the objective function is due to the truncated loss function and the nonconvexity in the constraint is due to the extension of the parameter range . We express the truncated loss function of ER-SVM as the difference of two CVaRs, which are convex functions, and move the nonconvex term in its constraint to the objective by using an exact penalty. While being equivalent to ER-SVM, the resulting formulation allows us to apply DCA to ER-SVM and gives an intuitive interpretation for ER-SVM.

The previous algorithm proposed by Takeda et al. (2014) is heuristic and does not have a theoretical guarantee. Furthermore, it is not simple because it needs to use two different procedures depending on the value of the hyperparameter . Our algorithm instead works with any value of , and it can find a critical point of ER-SVM. Though ER-SVM enjoys both of the nonconvexities from E-SVM and robust SVMs (e.g., ROD and ramp-loss SVM), our new algorithm is simple and comparable to their algorithms. Indeed, our algorithm is similar to the algorithm (Collobert et al., 2006) of ramp-loss SVM, which is known to be fast. Our code for solving ER-SVM by DCA is publicly available at Fujiwara, Takeda, and Kanamori (2016).

Furthermore, we clarify the relation of various robust SVMs, including ER-SVM. Though there are many robust variants of SVMs, their relations to each other have not been discussed, except that Takeda et al. (2014) showed that ER-SVM and ramp-loss SVM share the same global optimal solutions. In this letter, we prove that ROD and CVaR-()-SVM are equivalent to ramp-loss SVM and also show that ER-SVM includes ramp-loss SVM and ROD as special cases in the sense of having the same KKT points. More specifically, a special case of ER-SVM (whose range of the hyperparameter is limited), ramp-loss SVM, and ROD share all KKT points. Therefore, as Table 1 shows, ER-SVM can be regarded not simply as a robust variant of E-SVM, but rather as a natural extension of ramp-loss SVM and ROD.

1.3  Outline of the Letter

This letter is organized as follows. Section 2 provides a preliminary description of the basic notions. In sections 2.1 and 2.2, we describe the existing SVMs and their variants. Section 2.3 briefly describes the definitions and properties of popular financial risk measures such as CVaR and VaR. Section 3 describes some important properties of ER-SVM. After showing a DC programming formulation of ER-SVM, which is a minimization of the difference of CVaRs, we discuss the relationship of ramp-loss SVM, ROD, and ER-SVM. Section 4 describes our new algorithm, DCA for ER-SVM. Numerical results are presented in section 5.

2  Preliminary

Here, we briefly review the binary classification of supervised learning. Suppose we have a set of training samples , where and is the set of indices of training samples. (or ) is the index set for the label (or , respectively), and we suppose , where denotes the size of the set. SVM learns the function and predicts the label of as . We define
formula
wherein the absolute value of is proportional to the distance from the hyperplane to the sample . becomes negative if the sample is classified correctly and is positive otherwise.

Our algorithm can be applied to nonlinear models by using a kernel technique. We describe a kernel variant of ER-SVM in section 3.4.

2.1  Support Vector Machines

2.1.1  Convex SVMs

-SVM (Cortes & Vapnik, 1995), the most standard form of SVM, minimizes the hinge loss and regularizer;
formula
where and is a hyperparameter. -SVM (Schölkopf et al., 2000) is formulated as
formula
2.1
which is equivalent to -SVM if and are set appropriately. The hyperparameter has an upper threshold,
formula
and a lower threshold, . The optimal solution is trivial if , and the optimal value is unbounded if (see Chang & Lin, 2001b). Therefore, we define the range of for -SVM as .

2.1.2  Nonconvex SVMs

Here, we introduce two types of nonconvexity for SVMs. The first is extended -SVM (E-SVM) (Perez-Cruz et al., 2003), an extended model of -SVM. E-SVM is formulated as
formula
2.2
E-SVM has the same set of optimal solutions to -SVM if , and it obtains nontrivial solutions even if , owing to the nonconvex constraint . Therefore, we define the range of for E-SVM as . E-SVM removes the lower threshold of -SVM and extends the admissible range of . It was empirically shown that E-SVM sometimes achieves high accuracy in the extended range of . We will mention other concrete advantages of E-SVM over -SVM in section 3.3.
The second is ramp-loss SVM, a robust variant of -SVM. The resulting classifier is robust to outliers at the expense of the convexity of the hinge-loss function. The idea behind the ramp loss is to clip large losses with a hyperparameter . Ramp-loss SVM is formulated as
formula
2.3
is also a hyperparameter. The ramp loss can be described as the difference between hinge-loss functions; therefore, the concave-convex procedure (CCCP), an effective algorithm of DC programming, can be applied to the problem (see Collobert et al., 2006, for details). Xu et al. (2006) gave another representation of ramp-loss SVM by using the -hinge loss;
formula
2.4
and applied semidefinite programming (SDP) relaxation to equation 2.4. To explicitly identify outliers, Xu et al. (2006) proposed robust outlier detection (ROD):
formula
2.5

and are hyperparameters. The original formulation in Xu et al. (2006) defines ROD with an inequality constraint , but we can replace it by the equality one because the replacement does not change the optimal value.

2.2  Extended Robust Support Vector Machine

Recently, Takeda et al. (2014) proposed extended robust SVM (ER-SVM):
formula
2.6

where and are hyperparameters.

Remark 1.

ER-SVM, equation 2.6, is obtained by relaxing the 0-1 integer constraints of the original formulation in Takeda et al. (2014) and replacing of the original one by the equality variant. The relaxation does not change the problem if . More precisely, if , ER-SVM, equation 2.6, has an optimal solution such that .

In this case, ER-SVM, equation 2.6, removes samples and applies E-SVM, equation 2.2, to the rest. Indeed, the heuristic algorithm of Takeda et al. (2014) for E-SVM removes samples little by little until the total number of removed samples becomes . In this letter, we use formulation 2.6 and call it ER-SVM.

It can be easily seen that for fixed , the optimal value of equation 2.6 decreases as increases. Moreover, it is shown in Takeda et al. (2014, lemma 4) that the nonconvex constraint can be relaxed to without changing the optimal solution so long as the optimal value is negative. Just like E-SVM, ER-SVM, equation 2.6, has a threshold (we denote it by ) of the hyperparameter where the optimal value equals zero and the nonconvex constraint is essential for ER-SVM with . The nonconvex constraint removes the lower threshold of and extends the admissible range of in the same way as E-SVM (see Table 2). We show in section 3.2 that a special case (case C in Table 2) of ER-SVM is equivalent to ROD, equation 2.5, and ramp-loss SVM, equation 2.4, in the way where a special case (case C) of E-SVM is equivalent to -SVM. Hence, ER-SVM can be seen as a natural extension of robust SVMs such as ROD and ramp-loss SVM. ER-SVM, equation 2.6, also has an upper threshold , which, similar to E-SVM and -SVM, makes the problem a bounded one.

Table 2:
Range of for -SVM, E-SVM, and ER-SVM.
Case NCase C
-SVM Range of     
 Optimal value Negative Unbounded 
 Optimal solution  Admissible – 
E-SVM Range of     
 Optimal value Nonnegative Negative Unbounded 
 Optimal solution Admissible Admissible – 
 Constraint    
ER-SVM Range of     
 Optimal value Nonnegative Negative Unbounded 
 Optimal solution Admissible Admissible – 
 Constraint    
Case NCase C
-SVM Range of     
 Optimal value Negative Unbounded 
 Optimal solution  Admissible – 
E-SVM Range of     
 Optimal value Nonnegative Negative Unbounded 
 Optimal solution Admissible Admissible – 
 Constraint    
ER-SVM Range of     
 Optimal value Nonnegative Negative Unbounded 
 Optimal solution Admissible Admissible – 
 Constraint    

Notes: If is greater than the lower threshold (case C), the nonconvex constraint of E-SVM and ER-SVM can be relaxed to a convex constraint without changing the optimal solutions. Case C of E-SVM is equivalent to -SVM.

2.3  Financial Risk Measures

We define the financial risk measures as in Rockafellar and Uryasev (2002). Let us consider the cumulative distribution function for :
formula
For , let be the -percentile of the distribution, known in finance as the value-at-risk (VaR). More precisely, -VaR is defined as
formula
and -VaR, which we call -upper-VaR, is defined as
formula
The difference between VaR and VaR is illustrated in Figure 1.
Figure 1:

Difference between VaR and VaR. VaR corresponds to VaR if equation has no solution (a).

Figure 1:

Difference between VaR and VaR. VaR corresponds to VaR if equation has no solution (a).

Conditional value-at-risk (CVaR) is also a popular risk measure in finance because of its coherence and computational properties. Formally, -CVaR is defined as
formula
where the -tail cumulative distribution function is as defined in Rockafellar and Uryasev (2002) by
formula
The explicit mathematical form for calculating CVaR is given by the following theorem.
Theorem 1.
(Rockafellar & Uryasev (2002)). One has
formula
2.7
where
formula
Moreover,
formula
hold.
CVaR can also be described as a maximum of linear functions as follows:
formula
2.8
Here, we have used the properties of CVaR for a discrete loss distribution (see proposition 8 of Rockafellar & Uryasev, 2002). The description, equation 2.8, ensures the convexity of CVaR, , in . We will use the above different representations of CVaR, equations 2.7 and 2.8, in the proof of proposition 3.

3  Properties of Extended Robust SVM

3.1  Decomposition using Conditional Value-at-Risks

Here, we give an intuitive interpretation to ER-SVM, equation 2.6, using two CVaRs. E-SVM has been shown to minimize -CVaR (Takeda & Sugiyama, 2008). On the other hand, ER-SVM, equation 2.6, ignores a fraction of the samples and solves E-SVM by using the rest: that is, it minimizes CVaR using the rest of the samples. Hence, ER-SVM can be regarded as one that minimizes the average of in the gray area of Figure 2. The mean of the gray area in Figure 2 can be described using two CVaRs, as is done in Wozabal (2012):

Figure 2:

Distribution of : ER-SVM minimizes the average of , , in the gray area. -CVaR is shown in red and -CVaR in blue.

Figure 2:

Distribution of : ER-SVM minimizes the average of , , in the gray area. -CVaR is shown in red and -CVaR in blue.

Proposition 1.
ER-SVM, equation 2.6, is described as a problem minimizing the difference of two convex functions by using the two CVaRs:
formula
3.1

The proof of proposition 3 is in appendix A.

Since CVaR is convex in for fixed , the objective function in equation 2.9 can be described as the difference between two convex functions. A similar model, CVaR-()-SVM, which relaxes the nonconvex constraint of equation 3.1 by , was recently proposed (Tsyurmasto et al., 2013). As in Table 2, such a model is a special case (case C) of ER-SVM, and it is essentially equivalent to ramp-loss SVM or ROD.

3.2  Relationship with Existing Models

Here, we discuss the relationship between ER-SVM, ROD, and ramp-loss SVM by using the KKT conditions shown in appendix B. We begin by showing the equivalence of ramp-loss SVM and ROD. While ROD was proposed (Xu et al., 2006) as a direct approach that explicitly incorporates outlier suppression unlike ramp-loss SVM, the following lemma implies their equivalence.

Lemma 1.

(relation between ROD and ramp-loss SVM). Ramp-loss SVM, equation 2.4, and ROD, equation 2.5, share all KKT points in the following sense.

  1. Let be a KKT point of ramp-loss SVM, equation 2.4. Then it is also a KKT point of ROD, equation 2.5, with .

  2. A KKT point of ROD, equation 2.5, having the Lagrange multiplier for is also a KKT point of ramp-loss SVM, equation 2.4, with .

The proof of lemma 4 is in section B.1.

Theorem 5 shows the equivalence of ROD and the special case (case C in Table 2) of ER-SVM.

Theorem 2.

(relation between ER-SVM and ROD). Case C of ER-SVM (in Table 2), that is, equation B.3, and ROD, equation 2.5, share all KKT points in the following sense:

  1. Let satisfy the KKT conditions of ROD, equation 2.5, and suppose . satisfies the KKT conditions of case C of ER-SVM with a corresponding hyperparameter value.

  2. Let satisfy the KKT conditions of case C of ER-SVM. Suppose and the objective value is nonzero. satisfies the KKT conditions of ROD with a corresponding hyperparameter value.

See section B.2 for the proof of theorem 5.

From lemma 4 and theorem 5, ramp-loss SVM and ROD can be regarded as a special case (case C in Table 2) of ER-SVM. As we discussed in section 3.1, CVaR-()-SVM (Tsyurmasto et al., 2013) is also equivalent to case C (in Table 2) of ER-SVM. Theorem 5 is similar to the relation between -SVM and -SVM (which is a special case of E-SVM). It was shown that the sets of global solutions of -SVM and -SVM correspond to each other when the hyperparameters are set properly (see Chang & Lin, 2001b and Schölkopf et al., 2000). We used KKT conditions to show the relationship between the nonconvex models.

3.3  Motivation for Nonconvex Regularizer

The nonconvex constraint helps to remove the lower threshold of the hyperparameter , as Table 2 shows. The extension of the admissible range of gives us a chance of finding a better classifier that cannot be found by using a convex regularizer as in -SVM, equation 2.1, and robust SVMs, equation 2.3. Perez-Cruz et al. (2003) showed examples where E-SVM outperforms -SVM owing to the extended range of .

Besides this empirical evidence, there is theoretical evidence as to why the extended parameter range is needed for -SVM. More precisely, theorem 6 gives an explicit condition for data sets where the admissible range of for -SVM is empty, and therefore, -SVM and -SVM obtain a trivial classifier having for any hyperparameter value of and . The condition also applies to robust SVMs after removing all outliers with . Indeed, Figure 4e shows a numerical example where -SVM, -SVM, and ramp-loss SVM for any hyperparameter value obtain trivial classifiers but ER-SVM obtains a nontrivial one.

Rifkin, Pontil, and Verri (1999) studied the conditions under which -SVM obtains a trivial solution. We directly connect their statements to -SVM and strengthen them in theorem 6 by adding a geometric interpretation for -SVM for when the admissible range of is empty for -SVM.

Theorem 3.
Suppose without loss of generality and define the reduced convex hull (RCH) (Crisp & Burges, 2000):
formula
3.2
-SVM and -SVM lead to the trivial classifier for any hyperparameter values and if and only if the training set satisfies . When , the above statement is modified into .

The proof is in appendix C.

3.4  Kernelization

Linear learning models can be modified into more powerful nonlinear ones by using a kernel technique. Here, we briefly introduce a kernel variant of ER-SVM, equation 2.6. The input sample is mapped into in a high (even infinite) dimensional inner product space , and the classifier of the form is learned from the training samples, where is the inner product of and in . The kernel function is defined as .

Here, we show how the equality constraint in ER-SVM, equation 2.6, is dealt with in the kernel method. Let be the subspace in spanned by and be the orthogonal subspace of . Then the weight vector can be decomposed into , where and . The vector does not affect the value of . The vector can be expressed as a linear combination of , such as . If holds, should also hold, and the constraint is equivalent to . When (i.e., the dimension of is not zero), one can prove that the constraint can be replaced with the convex constraint , where one uses the fact that the gram matrix is nonnegative definite. Indeed, since the objective function depends on through the component , the constraint on can be replaced with its projection onto the subspace . Thus, the above convex constraint is obtained unless . In such case, the kernel variant of ER-SVM is
formula
3.3
where . When holds, the inequality constraint of should be replaced with the equality constraint.

4  Algorithm

4.1  Simplified DCA

We begin with a brief introduction to the difference of convex functions (DC) program and the DC algorithm (DCA). The DC program is formulated using lower semicontinuous proper convex functions and :
formula
4.1
DCA is an efficient algorithm for equation 4.1, and it is theoretically studied in Pham Dinh and Le Thi (1997). We shall use simplified DCA, which is the standard form of DCA. Simplified DCA sequentially linearizes the concave part in equation 4.1 and solves convex subproblems as follows:
formula
4.2
where is a subgradient of at . The sequence generated by simplified DCA has the following good convergence properties:
  • The objective value is decreasing (i.e., ).

  • DCA has global convergence.

  • Every limit point of the sequence is a critical point of , which is also called a generalized KKT point.

is said to be a critical point of if . This implies that a critical point has and such that , a necessary condition for local minima. When equation 4.1 has a convex constraint , we can define the critical point by replacing with , where is an indicator function equal to 0 if and otherwise.

4.2  DCA for Extended Robust SVM

As shown in section 3.1, ER-SVM, equation 2.6 can be described as equation 3.1, whose objective function is a difference of CVaR functions. DCA is still not be applicable to equation 3.1 because it has a nonconvex constraint . We further reformulate equation 3.1 into a problem with a convex constraint using a sufficiently large constant :
formula
4.3

This reformulation is a special case of the exact penalty approach (see Pham Dinh & Le Thi, 1997; Phanm Dinh, Le Thi, & Muu, 1999). There exists such that equations 3.1 and 4.3 have the same set of optimal solutions for all . We can estimate an upper bound of in our case by invoking the following lemma.

Lemma 2.

If in equation 4.3 is sufficiently large such that the optimal value of equation 4.3 is negative, equations 3.1 and 4.3 have the same set of optimal solutions.

The key point in the proof of lemma 7 is that CVaR has positive homogeneity (i.e., for all such that ). This is a well-known property of coherent risk measures such as CVaR (e.g., Artzner, Delbaen, Eber, & Heath, 1999).

Proof of Lemma 2.
Let be an optimal solution of equation 4.3 and suppose to the contrary that . achieves a smaller objective value than since
formula
However, this contradicts the optimality of . Therefore, the optimal solution of equation 4.3 satisfies , which implies that it is also optimal to equation 3.1.
The following DC program represents equation 4.3:
formula
4.4
where
formula
Here, we apply simplified DCA to problem 4.4. Let be the solution obtained in the ()th iteration of simplified DCA. At the th iteration of simplified DCA, we solve a subproblem, equation 4.2, constructed by linearizing the concave part in equation 4.5 at . The subproblem, equation 4.2, of the th iteration is thus described as
formula
4.5
where . is the subdifferential of and is a subgradient of at . An optimal solution of equation 4.5 is denoted by . Below, we show how to calculate a subgradient and how to choose a sufficiently large constant .

4.2.1  Subdifferential of CVaR

Here, we show how to calculate the subdifferential of CVaR, equation 2.8, by using the technique described in Wozabal (2012). The subdifferential of CVaR at is
formula
where is the convex hull of the set and
formula
4.6

Let be the index set, where , , are the largest values among , . We can easily find an optimal solution by assigning 0 to , , and 1 to others.

formula

4.2.2  Efficient Update of

The update of the large constant in each iteration makes our algorithm more efficient. We propose to use, in the th iteration, such that
formula
4.7

Condition 4.7 ensures that the optimal value of equation 4.6 is negative, since the solution in the previous iteration has a negative objective value. With such , lemma 7 holds.

4.2.3  Explicit Form of Subproblem

Now we are ready to describe the subproblem 4.5 explicitly. By using the above results and substituting equation 2.7 for , equation 4.5 results in
formula
4.8

The steps of the algorithm are listed in algorithm 1. The algorithm also has nice convergence properties, such as linear convergence, as shown in section 4.1.

4.2.4  CVaR-Function Decomposition

The decomposition of two CVaR functions enables us to interpret the algorithm and the resulting solution. The algorithm repeatedly solves subproblems by linearizing a concave part that corresponds to underestimating the sum of losses of outliers. The approximation becomes more accurate as the algorithm iterates.

There is another advantage to the decomposition. If the given parameter value is in case C, the term disappears from the objective function in equation 4.3 and the proposed algorithm acts like a polyhedral DC algorithm, which has the nice property of finite convergence (see Pham Dinh & Le Thi, 1997). This property is due to the polyhedral convexity of CVaR functions.

5  Numerical Results

We compared our approach, DCA for ER-SVM, with a heuristic algorithm (Takeda et al., 2014) for ER-SVM, CCCP (Collobert et al., 2006) for ramp-loss SVM, E-SVM algorithm (Perez-Cruz et al., 2003), and -SVM from the LIBSVM software package (Chang & Lin, 2001a). The hyperparameter of ramp-loss SVM, equation 2.3, was set to 1, and in ER-SVM, equation 2.6, was set to 0.05; all models thus had one hyperparameter to be tuned. A comparison of ROD, equation 2.5 using SDP relaxation (Xu et al., 2006) is omitted because ROD is equivalent to ramp-loss SVM (see lemma 4).

We implemented all methods in Python (version 2.7.6) and used IBM ILOG CPLEX (version 12.6) to solve the optimization problems involved in these four methods: DCA and the heuristic algorithm for ER-SVM, CCCP for ramp-loss SVM, and E-SVM algorithm. We solved the primal formulation, equation 4.8, while running DCA to deal with data sets with a large sample size and small features. In the CCCP algorithm for ramp-loss SVM, dual formulations of convex subproblems were sequentially solved (see algorithm 2 of Collobert et al., 2006). Almost all of the numerical experiments were done on a PC (CPU: Intel Core i5-3437U (1.90 GHz) , Memory: 8GB, OS: Ubuntu 14.04); a LINUX server (CPU: Intel Xeon E5-2680 (2.70GHz core) , Memory: 64 GB, OS: Red Hat Enterprise Linux Server) was used for the large-scale data sets.

5.1  Accuracy for Synthetic Data Sets

We used synthetic data generated by following the procedure in Xu et al. (2006). We generated two-dimensional samples with labels and from two normal distributions with different mean vectors and the same covariance matrix. The optimal hyperplane for the noiseless data set is . We added outliers only to the training set with the label by drawing samples uniformly from a half-ring with center , inner radius , and outer radius in the space of . The training set contained 50 samples from each class (100 in total), including outliers. The ratio of outliers in the training set was set to from 0 to 10%. The test set had 1000 samples from each class (2000 in total). We repeated the experiments 100 times, drawing training and test sets every repetition. We found the best parameter setting from nine candidates, and .

Figure 3a plots the test accuracy of each model versus the outlier ratio. Each error bar spans from the 25th to 75th percentiles. ER-SVM by DCA achieved high average accuracy with small standard deviations especially when the outlier ratio was large. DCA and the heuristic algorithm (Takeda et al., 2014) solved the same problem for ER-SVM, but the heuristic one had poor performance; it seemed to get stuck in poor local solutions.

Figure 3:

(a) The average test accuracy for the synthetic data set. (b, c) Computation times for the synthetic data set and cod-rna.

Figure 3:

(a) The average test accuracy for the synthetic data set. (b, c) Computation times for the synthetic data set and cod-rna.

5.2  Accuracy on Real Data Sets

We used the data sets of the UCI repository (Blake & Merz, 1998) and LIBSVM (Chang and Lin, 2001a; see Table 3). We scaled the original data sets such that all attributes had zero mean and unit variance. For most data sets, we generated outliers uniformly from a ring with center and radius . The radius of generating outliers was set so that the outliers would have an impact on the test accuracy. When the feature size of the data set was large, we generated outliers in different ways: a label flip for binary class data sets or samples with the third class for multiclass data sets.

Table 3:
List of UCI and LIBSVM Data Sets.
Number ofNumber ofOutlier
Data SetUsed ClassesFeature SamplesType
svmguide1  3089   
cod-rna  53,581   
diabetes  768   
vehicle Class 1 versus rest 18 846   
satimage Class 6 versus rest 36 4435   
splice  60 1000   
mushrooms  112 8124  label-flip 
adult  123 1605   
dna Class 2 versus class 3 180 3955  class 1 
MNIST Class 1 versus class 7 768 15,170  class 9 
internet ad  1554 3279  label-flip 
Number ofNumber ofOutlier
Data SetUsed ClassesFeature SamplesType
svmguide1  3089   
cod-rna  53,581   
diabetes  768   
vehicle Class 1 versus rest 18 846   
satimage Class 6 versus rest 36 4435   
splice  60 1000   
mushrooms  112 8124  label-flip 
adult  123 1605   
dna Class 2 versus class 3 180 3955  class 1 
MNIST Class 1 versus class 7 768 15,170  class 9 
internet ad  1554 3279  label-flip 

Notes: For most data sets, outliers are generated on the sphere with center and radius . The radius was set so that the outliers would have an impact on the test accuracy. When the feature size of the data set is large, outliers are generated in different ways: the label flip for binary class data sets or the samples with the third class for multiclass data sets. of cod-rna could not be calculated because the algorithm ran out of memory.

The best parameter was chosen from nine candidates, with equal intervals and . In this experiment, we split the data set into a training set, validation set, and test set in the ratio of 4:3:3. The hyperparameters were determined on the training set and validation set, both of which were contaminated. The performance was evaluated on the clean test set.

Table 4 shows the results of 30 trials for most data sets (only 3 or 10 trials for large-scale data sets because of long computation times). “–” in Table 4 indicates the out-of-memory error. “N” in Table 4 means that ER-SVM (or E-SVM) achieved the best accuracy with (or ) at least once, and “C” means that the best accuracy was achieved with (or ) in all trials.

Table 4:
Average Test Accuracy and F1 Score () over 30 Trials for Most Data Sets of the Proposed Method, ER-SVM (DCA), Compared with the Others.
ER-SVM (DCA)ER-SVM (heuristics)Ramp-Loss SVME-SVM-SVM
Data SetORave.accF1ave.accF1ave.accF1ave.accF1ave.accF1
svmguide1 0% 0.954 0.965 0.956 0.966 0.959 0.968 0.954 0.964 0.954 0.964 
 3% 0.944 0.956 0.939 0.952 0.946 0.958 0.921 0.939 0.879 0.911 
 5% 0.948 0.959 0.935 0.950 0.946 0.958 0.909 0.931 0.758 0.846 
 10% 0.937 0.952 0.920 0.939 0.893 0.925 0.887 0.916 0.658 0.790 
cod-rna 0% 0.939 0.912 0.939 0.911 – – 0.938 0.909 0.939 0.910 
 3% 0.937 0.908 0.937 0.908 – – 0.935 0.903 0.914 0.863 
 5% 0.938 0.911 0.938 0.910 – – 0.936 0.905 0.773 0.522 
 10% 0.936 0.904 0.936 0.906 – – 0.932 0.897 0.667 0.500 
diabetes 0% 0.757 0.829 0.755 0.827 0.760 0.828 0.762 0.828 0.769 0.836 
 3% 0.756 0.823 0.742 0.819 0.752 0.820 0.754 0.821 0.749 0.824 
 5% 0.757 0.827 0.749 0.822 0.753 0.822 0.746 0.816 0.747 0.823 
 10% 0.750 0.822 0.728 0.811 0.742 0.818 0.732 0.811 0.728 0.812 
vehicle 0% 0.789 0.535 0.781 0.524 0.787 0.534 0.792 0.530 0.791 0.532 
 3% 0.773 0.517 0.762 0.480 0.777 0.492 0.767 0.488 0.780 0.465 
 5% 0.779 0.498 0.759 0.434 0.761 0.403 0.751 0.420 0.756 0.423 
 10% 0.768 0.461 0.749 0.326 0.757 0.129 0.714 0.304 0.745 0.386 
satimage 0% 0.908 0.792 0.905 0.788 0.908 0.793 0.909 0.798 0.911 0.801 
 3% 0.903 0.786 0.893 0.764 0.899 0.777 0.896 0.773 0.898 0.778 
 5% 0.899 0.775 0.889 0.752 0.898 0.775 0.892 0.764 0.891 0.762 
 10% 0.893 0.766 0.888 0.753 0.897 0.773 0.890 0.753 0.874 0.738 
splice 0% 0.784 0.784 0.772 0.778 0.783 0.781 0.794 0.790 0.783 0.786 
 3% 0.771 0.775 0.756 0.760 0.772 0.776 0.774 0.777 0.777 0.779 
 5% 0.769 0.765 0.745 0.747 0.756 0.756 0.761 0.761 0.755 0.765 
 10% 0.756 0.753 0.727 0.730 0.725 0.725 0.733 0.737 0.737 0.743 
ER-SVM (DCA)ER-SVM (heuristics)Ramp-Loss SVME-SVM-SVM
Data SetORave.accF1ave.accF1ave.accF1ave.accF1ave.accF1
svmguide1 0% 0.954 0.965 0.956 0.966 0.959 0.968 0.954 0.964 0.954 0.964 
 3% 0.944 0.956 0.939 0.952 0.946 0.958 0.921 0.939 0.879 0.911 
 5% 0.948 0.959 0.935 0.950 0.946 0.958 0.909 0.931 0.758 0.846 
 10% 0.937 0.952 0.920 0.939 0.893 0.925 0.887 0.916 0.658 0.790 
cod-rna 0% 0.939 0.912 0.939 0.911 – – 0.938 0.909 0.939 0.910 
 3% 0.937 0.908 0.937 0.908 – – 0.935 0.903 0.914 0.863 
 5% 0.938 0.911 0.938 0.910 – – 0.936 0.905 0.773 0.522 
 10% 0.936 0.904 0.936 0.906 – – 0.932 0.897 0.667 0.500 
diabetes 0% 0.757 0.829 0.755 0.827 0.760 0.828 0.762 0.828 0.769 0.836 
 3% 0.756 0.823 0.742 0.819 0.752 0.820 0.754 0.821 0.749 0.824 
 5% 0.757 0.827 0.749 0.822 0.753 0.822 0.746 0.816 0.747 0.823 
 10% 0.750 0.822 0.728 0.811 0.742 0.818 0.732 0.811 0.728 0.812 
vehicle 0% 0.789 0.535 0.781 0.524 0.787 0.534 0.792 0.530 0.791 0.532 
 3% 0.773 0.517 0.762 0.480 0.777 0.492 0.767 0.488 0.780 0.465 
 5% 0.779 0.498 0.759 0.434 0.761 0.403 0.751 0.420 0.756 0.423 
 10% 0.768 0.461 0.749 0.326 0.757 0.129 0.714 0.304 0.745 0.386 
satimage 0% 0.908 0.792 0.905 0.788 0.908 0.793 0.909 0.798 0.911 0.801 
 3% 0.903 0.786 0.893 0.764 0.899 0.777 0.896 0.773 0.898 0.778 
 5% 0.899 0.775 0.889 0.752 0.898 0.775 0.892 0.764 0.891 0.762 
 10% 0.893 0.766 0.888 0.753 0.897 0.773 0.890 0.753 0.874 0.738 
splice 0% 0.784 0.784 0.772 0.778 0.783 0.781 0.794 0.790 0.783 0.786 
 3% 0.771 0.775 0.756 0.760 0.772 0.776 0.774 0.777 0.777 0.779 
 5% 0.769 0.765 0.745 0.747 0.756 0.756 0.761 0.761 0.755 0.765 
 10% 0.756 0.753 0.727 0.730 0.725 0.725 0.733 0.737 0.737 0.743 
Table 4:
Continued.
ER-SVM (DCA)ER-SVM (heuristics)Ramp-Loss SVME-SVM-SVM
Data SetORave.accF1ave.accF1ave.accF1ave.accF1ave.accF1
mushrooms 0% 0.982 0.981 0.983 0.982 1.000 1.000 0.999 0.999 1.000 1.000 
 5% 0.998 0.998 0.941 0.936 0.999 0.999 0.998 0.998 0.999 0.999 
 10% 0.998 0.998 0.917 0.904 0.998 0.998 0.998 0.998 0.998 0.997 
 15% 0.998 0.997 0.903 0.887 0.997 0.997 0.997 0.996 0.997 0.997 
adult 0% 0.820 0.588 0.808 0.438 0.813 0.598 0.817 0.596 0.809 0.599 
 5% 0.821 0.592 0.797 0.381 0.816 0.590 0.820 0.600 0.813 0.602 
 10% 0.820 0.597 0.790 0.435 0.817 0.585 0.817 0.601 0.819 0.615 
 15% 0.817 0.580 0.776 0.345 0.805 0.594 0.814 0.580 0.810 0.595 
dna 0% 0.971 0.955 0.971 0.955 0.976 0.963 0.976 0.963 0.974 0.960 
 5% 0.968 0.944 0.961 0.938 0.963 0.942 0.967 0.946 0.965 0.946 
 10% 0.957 0.933 0.954 0.931 0.958 0.934 0.963 0.942 0.960 0.939 
 15% 0.950 0.924 0.945 0.919 0.951 0.924 0.957 0.935 0.955 0.932 
MNIST 0% 0.988 0.989 0.987 0.988 0.993 0.994 0.992 0.992 0.994 0.994 
 5% 0.986 0.986 0.981 0.982 0.985 0.985 0.984 0.985 0.987 0.987 
 10% 0.989 0.989 0.976 0.976 0.976 0.976 0.981 0.981 0.961 0.963 
 20% 0.982 0.982 0.980 0.980 0.972 0.973 0.872 0.877 0.817 0.787 
internet ad 0% 0.944 0.758 0.921 0.619 0.963 0.857 0.964 0.860 0.966 0.868 
 3% 0.955 0.820 0.938 0.720 0.960 0.843 0.960 0.839 0.961 0.840 
 5% 0.960 0.837 0.948 0.788 0.958 0.834 0.959 0.842 0.961 0.834 
 10% 0.953 0.798 0.953 0.800 0.952 0.803 0.936 0.752 0.946 0.778 
ER-SVM (DCA)ER-SVM (heuristics)Ramp-Loss SVME-SVM-SVM
Data SetORave.accF1ave.accF1ave.accF1ave.accF1ave.accF1
mushrooms 0% 0.982 0.981 0.983 0.982 1.000 1.000 0.999 0.999 1.000 1.000 
 5% 0.998 0.998 0.941 0.936 0.999 0.999 0.998 0.998 0.999 0.999 
 10% 0.998 0.998 0.917 0.904 0.998 0.998 0.998 0.998 0.998 0.997 
 15% 0.998 0.997 0.903 0.887 0.997 0.997 0.997 0.996 0.997 0.997 
adult 0% 0.820 0.588 0.808 0.438 0.813 0.598 0.817 0.596 0.809 0.599 
 5% 0.821 0.592 0.797 0.381 0.816 0.590 0.820 0.600 0.813 0.602 
 10% 0.820 0.597 0.790 0.435 0.817 0.585 0.817 0.601 0.819 0.615 
 15% 0.817 0.580 0.776 0.345 0.805 0.594 0.814 0.580 0.810 0.595 
dna 0% 0.971 0.955 0.971 0.955 0.976 0.963 0.976 0.963 0.974 0.960 
 5% 0.968 0.944 0.961 0.938 0.963 0.942 0.967 0.946 0.965 0.946 
 10% 0.957 0.933 0.954 0.931 0.958 0.934 0.963 0.942 0.960 0.939 
 15% 0.950 0.924 0.945 0.919 0.951 0.924 0.957 0.935 0.955 0.932 
MNIST 0% 0.988 0.989 0.987 0.988 0.993 0.994 0.992 0.992 0.994 0.994 
 5% 0.986 0.986 0.981 0.982 0.985 0.985 0.984 0.985 0.987 0.987 
 10% 0.989 0.989 0.976 0.976 0.976 0.976 0.981 0.981 0.961 0.963 
 20% 0.982 0.982 0.980 0.980 0.972 0.973 0.872 0.877 0.817 0.787 
internet ad 0% 0.944 0.758 0.921 0.619 0.963 0.857 0.964 0.860 0.966 0.868 
 3% 0.955 0.820 0.938 0.720 0.960 0.843 0.960 0.839 0.961 0.840 
 5% 0.960 0.837 0.948 0.788 0.958 0.834 0.959 0.842 0.961 0.834 
 10% 0.953 0.798 0.953 0.800 0.952 0.803 0.936 0.752 0.946 0.778 

Notes: The ratio of outliers (OR) varies from . – indicates an out-of-memory error. N implies that the objective value of ER-SVM (or E-SVM) was positive (that is, case N in table 2 occurred and the nonconvex constraint worked effectively) at least once in the trials. Only 3 or 10 trials were used for large-scale data sets because of the long computation times. The best scores among these methods are in bold.

The prediction performance of ER-SVM by DCA was better than that of the other methods and very stable as the outlier ratio increased. -SVM often performs well when data sets contain no or few outliers, but it tended to be beaten by ER-SVM (DCA) as the outlier ratio increased. Our DCA algorithm could find a better solution, and this led to ER-SVM by DCA having better prediction performance than the heuristic method. We can see from the table that the nonconvex cases (N) of ER-SVM and E-SVM performed better than their convex cases on the intractable data sets (e.g., diabetes, vehicle, splice), where all linear SVMs performed poorly.

5.3  Comparison of CPU Times

Here, we show the trend of computation time of our method. Figure 3b plots the average of the CPU times in the range from the 25th to 75th percentile versus the outlier ratio of each method on the synthetic data set of Figure 3a. While the computation times of some algorithms such as -SVM (with ) and ramp-loss SVM slightly increased as the outlier ratio increased, the time of ER-SVM (DCA) was not significantly affected by increasing the outlier ratio.

Figure 3c shows the CPU time of each method with respect to the sample size of the cod-rna data set. Subsets with the sample size indicated on the horizontal axis were randomly chosen 10 times. The error bars show the range from the 25th to 75th percentile.

Both figures imply that our method, ER-SVM by DCA, performed comparably to the other methods despite the fact that it has two kinds of nonconvexities. Indeed, our method was faster than other robust SVM approaches: heuristics for ER-SVM and ramp-loss SVM. We omitted ramp-loss SVM from Figure 3c because out-of-memory errors occurred for 10,000 training samples when we tried to solve dual-formulated subproblems of ramp-loss SVM by using CPLEX (state-of-the-art commercial optimization software package). We can also see that the computation time of our method was in a smaller range than that of LIBSVM for -SVM for varied parameter values.

5.4  Detailed Observations on DCA

5.4.1  Comparison of DCA and Heuristics (Figure 4a)

Table 4 reveals the advantage of DCA (ER-SVM (DCA)) over the heuristic method (ER-SVM (heuristics)) in terms of prediction accuracy. Here, we assess the quality of the solutions found by these algorithms by comparing their optimal values for the liver data set ( and ).

Figure 4:

(a) The number of times ER-SVM achieved smaller objective values than those of the heuristic algorithm in 300 trials. (b) Faster convergence by our update rule of in equation 4.7 than a fixed constant. (c) Computation time of algorithm 1 for the liver data set. (d) Test accuracy of ER-SVM with the polynomial kernel for liver. (e) An example where ER-SVM obtained a nontrivial classifier, but -SVM, -SVM, and ramp-loss SVM with any hyperparameter value obtained trivial classifiers . (f) The performance of DCA for ER-SVM and its restricted case C (which is equivalent to ramp-loss SVM).

Figure 4:

(a) The number of times ER-SVM achieved smaller objective values than those of the heuristic algorithm in 300 trials. (b) Faster convergence by our update rule of in equation 4.7 than a fixed constant. (c) Computation time of algorithm 1 for the liver data set. (d) Test accuracy of ER-SVM with the polynomial kernel for liver. (e) An example where ER-SVM obtained a nontrivial classifier, but -SVM, -SVM, and ramp-loss SVM with any hyperparameter value obtained trivial classifiers . (f) The performance of DCA for ER-SVM and its restricted case C (which is equivalent to ramp-loss SVM).

The initial solutions were selected from a uniform random distribution on the unit sphere, and the experiments were repeated 300 times. We set the hyperparameter for DCA corresponding to the one automatically selected in the heuristic algorithm. We counted how many times our algorithm (DCA) achieved smaller objective values than those of the heuristics. We counted win, lose, and draw cases in 300 trials (win or lose is when the gap between the objective values is more than 3% considering the numerical error). Figure 4a shows that our algorithm (DCA) tended to achieve wins or draws, especially for larger . This result supports the claim of Le Thi and Pham Dinh (2005) and Pham Dinh and Le Thi (1997) that DCA tends to converge to a global solution.

5.4.2  Efficient Update of

We use the results for the liver data set to show the effectiveness of the auto update rule of a constant as in equation 4.7. Figure 4b implies that the convergence when using our auto update rule is much faster than when using a fixed constant or 0.5.

5.4.3  Computation Time on Case N

Here, we investigate the change in computation time with respect to the parameter . Figure 4c shows the computation time averaged over 100 trials. The vertical line is the estimated value of . The extension of the parameter range corresponds to , where the nonconvex constraint worked. ER-SVM with found classifiers that ROD and ramp-loss SVM did not. The figure shows that the computation time did not change much except for . The optimal margin variable was zero around , which might make the problem numerically unstable.

5.4.4  Kernelized ER-SVM

Figure 4d shows the test accuracy of ER-SVM with a linear kernel or a polynomial kernel defined as . The hyperparameters of the polynomial kernel were and . We used 50% of the data set as the training set and the rest as the test set. The training set was contaminated by using a half-ring as shown in section 5.2 or Xu et al. (2006). The test accuracy was evaluated by the mean of 100 trials for each . The markers in Figure 4d imply that the objective value was positive (i.e., case N in Table 2 occurred and the nonconvex constraint worked effectively) at least once in 100 trials. This result implies that the extension is effective for not only a linear kernel but also a polynomial kernel, especially when the data set is contaminated. Furthermore, the polynomial kernel was more accurate than the linear one regardless of the outlier ratio, which indicates the effectiveness of kernelization of ER-SVM.

5.5  Effectiveness of the Extension

Figure 4e verifies theorem 6. When the number of the samples in each class is unbalanced or the samples in two classes overlap substantially as in Figure 4e, -SVM, -SVM, and ramp-loss SVM obtain trivial classifiers (), while ER-SVM obtains a nontrivial classifier. That is, this figure implies the effectiveness of the nonconvex constraint .

Figure 4f shows how the test accuracy of each method is affected by the degree of intersection of the samples in two classes. We used synthetic data sets that had the same covariance and number of samples as in Figure 4e, but only the mean of the distribution of positive samples was changed. The horizontal axis is the distance between the mean of the distributions of positive and negative samples. The hyperparameters of ER-SVM were fixed to . The solid line is our method (DCA for ER-SVM), and the dashed line is its case C (which is equivalent to ramp-loss SVM), where in equation 4.8 is set to zero. Figure 4f implies that case N occurred frequently when the distance was small and the nonextended model had poor test accuracy in such cases because of having as an optimal solution.

6  Conclusion

We theoretically analyzed ER-SVM, proving that ER-SVM is a natural extension of robust SVMs and discussed the conditions under which such an extension works. Furthermore, we proposed a new, efficient algorithm that has theoretically good properties. Numerical experiments showed that our algorithm worked efficiently.

In this letter, we focused on binary classification problems. Along the same line, we can formulate extended robust variants of learning algorithms for other statistical problems such as regression problems. The proposed simplified DCA with a similar CVaR-function decomposition is applicable to extended robust learning algorithms. In particular, Wu and Liu (2007) proposed a multiclass extension of binary SVM using the ramp loss. We may be able to formulate such a multiclass extension for ER-SVM and apply the proposed DCA to the resulting problem.

We now have to handle large-scale data sets arising from real-world problems. When the given data set has a large feature size, solving the dual of subproblems 4.8 may speed up our algorithm, DCA, more than by solving the primal subproblems. An alternative practical approach to solving large-scale problems is to use a memory-efficient method such as the SMO algorithm (Platt, 1998) for solving subproblems, although its worst-case computation time can be quite large. When the given data set has a large sample size, a stochastic variant of our method might be necessary. We would like to improve our method so that it works better on large-scale data sets.

Our method, ER-SVM by DCA, tends to perform better than the other methods especially on data sets with small feature sizes. For high-dimensional data sets, we recommend combining our method with feature selection.

Appendix A: Proof of Proposition 3

The objective function of ER-SVM, equation 2.6, under the constraints can be equivalently rewritten as
formula
The last equality is obtained by using a constraint, , of ER-SVM, equation 2.6.
Now we show that the term in the last equation,
formula
can be written as by showing that holds for any whenever at the optimal solution of equation 2.6. Here, we assume that , , are sorted in descending order; . Then should be
formula
Note that must be an optimal solution of the problem:
formula
The problem is regarded as one of minimizing -CVaR, where (), for the truncated distribution with samples removed. Theorem 2 ensures that
formula
which implies that holds for the indices having .
Therefore, we can rewrite the objective function of ER-SVM, equation 2.6, as
formula
A.1
By using equation 2.7 for the first two terms of equation A.1 and using equation 2.8 for the last term, we can further rewrite equation A.1 as
formula
which is the objective function of equation 3.1. This implies that ER-SVM, equation 2.6, can be described in the form of equation 3.1.

Appendix B: KKT Conditions for ER-SVM, ROD, and Ramp-Loss SVM

We show differentiable formulations of case C of ER-SVM, equation 2.6, ROD, equation 2.5, and ramp-loss SVM, equation 2.4, to write the KKT conditions for each model.

Continuous ramp-loss SVM
formula
B.1
Continuous ROD
formula
B.2
Continuous ER-SVM (limited to Case C in Table 2)
formula
B.3

The KKT conditions of the above problems are defined as follows.

KKT conditions of equation B.1 using :
formula
B.4a
formula
B.4b
formula
B.4c
formula
B.4d
formula
B.4e
formula
B.4f
formula
B.5a
formula
B.5b
formula
B.5c
formula
B.5d
formula
B.6
KKT conditions of equation B.2 using : equations B.4 and B.5
formula
B.7
formula
B.8
KKT conditions of equation B.3 using : equations B.4 and B.8
formula
B.9a
formula
B.9b
formula
B.9c
formula
B.9d
formula
B.9e
formula
B.9f
formula
B.9g

B.1 Proof of Lemma 4

The difference between the KKT conditions of ramp-loss SVM, equation B.1 and ROD, equation B.2, is only equations B.6 to B.8.

Note that a KKT point of ramp-loss SVM, equation B.1, satisfies the KKT conditions of ROD, equation B.2, with . On the other hand, a KKT point of equation B.2 whose Lagrange multiplier for is satisfies the KKT conditions of ramp-loss SVM, equation B.1, with .

B.2 Proof of Theorem 5

Let us prove the first statement. Let be a KKT point of equation B.2 with hyperparameter and Lagrange multipliers . Suppose . Then is a KKT point of equation B.3 with Lagrange multipliers and .

Now let us prove the second statement. Let be a KKT point of equation B.3 with Lagrange multipliers . Suppose . If , satisfies the KKT conditions of equation B.2 with Lagrange multipliers and hyperparameter . Now, we will prove using the assumption of a nonzero optimal value. Consider the following problem, which fixes of equation B.3 to :
formula
B.10
The KKT conditions of equation B.10 are as follows:
formula
is also a KKT point of equation B.10 with the same Lagrange multipliers as equation B.3. Moreover, is not only a KKT point but also an optimal solution of equation B.10 because B.10 is a convex problem. Since the objective function of the dual problem of equation B.10 is , if and only if the objective value is zero. Then we can see that under the assumption of a nonzero objective value.

Appendix C: Proof of Theorem 6

Consider the dual problems of -SVM and -SVM:
formula
Let us describe the optimal of and as and , respectively. Note that the optimal of and are represented as and , respectively, with the use of the KKT conditions of -SVM and -SVM. Then if and only if for the optimal solutions of and .

When , . Then holds. Here, we show the following statements for a training set are equivalent.

  1. The training set satisfies .

  2. has an optimal solution such that for all .

  3. has an optimal solution such that for all .

Statements c2 and c3 imply that -SVM and -SVM obtain a trivial solution such that for any hyperparameter value.

The equivalence of statements c1 and c2 can be seen by making a geometric interpretation of -SVM. From the result of Crisp and Burges (2000), can be described as
formula
C.1
By using an appropriate scaling: , and equation C.1 has the same set of optimal solutions. We denote the optimal solution of equation C.1 as