Abstract

We investigate an approach based on DC (Difference of Convex functions) programming and DCA (DC Algorithm) for online learning techniques. The prediction problem of an online learner can be formulated as a DC program for which online DCA is applied. We propose the two so-called complete/approximate versions of online DCA scheme and prove their logarithmic/sublinear regrets. Six online DCA-based algorithms are developed for online binary linear classification. Numerical experiments on a variety of benchmark classification data sets show the efficiency of our proposed algorithms in comparison with the state-of-the-art online classification algorithms.

1  Introduction

Online learning can be seen as the process of predicting answers to the sequential arrival of questions based on the knowledge of the correct answers to previous questions and possibly other available information (Shalev-Shwartz, 2012). It plays a significant role in multiple contexts. For example, when the data samples are available over time, the predictions must be made in real time, the learner is required to dynamically adapt to new data patterns, or even learning over the entire data at once is impossible in the computational aspect. Applications range from online advertisement placement to online web ranking, online email categorization, and real-time recommendation (Shalev-Shwartz, 2007, 2012).

The learner makes a prediction in a sequence of consecutive rounds. On each online round, the learner receives an incoming question and must predict an answer to this question. After that, the correct answer is revealed, and the learner will suffer some loss. The whole process is summarized in the following protocol:

Online Learning

Input: a question space X, a possible answer space Y, a loss function .

for each step t=1,2,do

  1. Receive a question xtX.

  2. Predict an answer p¯tY.

  3. Receive the true answer ytY.

  4. Suffer loss (p¯t,yt).

end for

It is assumed that all the answers can be given by a hypothesis h¯:XY (Shalev-Shwartz, 2012). The set of possible hypotheses is denoted by H. Let us denote by LT and Lh¯,T, respectively, the cumulative loss of the learner and the cumulative loss of the hypothesis h¯ after T prediction steps, i.e.,
LT:=t=1T(p¯t,yt),Lh¯,T:=t=1T(h¯(xt),yt).
The main goal for the online learner is to minimize the regret:
RT:=t=1T(p¯t,yt)-minh¯Ht=1T(h¯(xt),yt)=LT-minh¯HLh¯,T.

Due to its goal, online learning is seen as a general class of techniques at the interface between machine learning and online optimization. Until now, most of the effective online learning algorithms have been derived from online convex optimization (Gentile, 2003; Zinkevich, 2003; Hazan, Agarwal, & Kale, 2007; Shalev-Shwartz, 2007, 2012; Shalev-Shwartz & Singer, 2007; Hazan, 2016). The paradigm of online convex optimization was introduced by (Zinkevich, 2003; Shalev-Shwartz & Singer, 2007), where the prediction domain and the loss function are convex. In online convex optimization, there exists a common update rule to predict the answer at each step: follow-the-leader (Kalai & Vempala, 2005) and its regularization form (Shalev-Shwartz, 2007; Shalev-Shwartz & Singer, 2007). In the follow-the-leader rule, the learner minimizes the cumulative loss function over all past steps, whereas its regularization form adds a convex regularization function to the cumulative loss function and introduces stability to the predictions. From update rule, most of the effective online convex algorithms have been proposed, such as online gradient descent (with lazy or greedy projections; Zinkevich, 2003), exponentiated gradient (Kivinen & Warmuth, 1997, 2001; Azoury & Warmuth, 2001), p-norm (Gentile, 2003), and their variants.

One common difficulty in most practical applications is that loss functions to assess the predictions are often nonsmooth or nonconvex (e.g., 0-1 loss function returning 0 if the prediction is correct and 1 otherwise), or the domain of predictions is nonconvex (Chung, 1994). Hence, solving the resulting optimization problem becomes more intractable. The disadvantages of using online convex optimization approaches have been mentioned in several works (Cesa-Bianchi & Lugosi, 2006; Shalev-Shwartz, 2007, 2012). Thus, it is essential to resort to nonconvex optimization in online mode to overcome the difficulties.

Recently, some works have been developed for online learning with nonconvex loss functions. Ertekin, Bottou, and Giles (2011) proposed a nonconvex online algorithm for support vector machine problems with ramp loss function based on a special version of DCA (Difference of Convex functions Algorithm) for smooth functions, while Gasso, Pappaioannou, Spivak, and Bottou (2011) presented an online algorithm for nonconvex Neyman-Pearson classification problems using the gradient method. These works, however, did not study the regret bounds of the online algorithms. Later, online algorithms with a submodular loss function were given in Hazan and Kale (2012), and an algorithm for online bandit learning problems with nonconvex loss function (i.e. only the suffered loss is available) was proposed in Zhang, Yang, Jin, and Zhou (2015). The exponential weighting methods applied in on online (nonconvex) learning problems were presented in Maillard and Munos (2010) and Krichene, Balandat, Tomlin, and Bayen (2015). Using a special grid layered structure on the decision set, Yang, Deng, Hajiesmaili, Tan, and Wong (2018) recently introduced some variants of online recursive weighting algorithm in a full information setting (the knowledge of loss function is used). Gao, Li, and Zhang (2018) studied online nonconvex optimization with a new performance metric: nonstationary regret, which defines the discrepancy between the cumulative losses of the online learner and the cumulative losses of the best possible responses. The authors also proposed an online normalized gradient descent algorithm and its bandit version in full and partial information settings.

We investigate DC (Difference of Convex functions) programming and DCA for online learning problems with nonconvex loss function. The idea is to approximate the nonconvex loss function by a DC loss function and then investigate online DCA for minimizing the resulting DC loss function. At each iteration, online DCA consists of approximating the current DC loss function by its convex majorization and then solving the resulting convex subproblem. We propose some variants of online DCA in which the convex subproblems are solved completely or approximately by a subgradient method and prove that these variants have the vanishing per step regret. As an application, we apply the proposed online DCA algorithms for online binary linear classification using the 0-1 loss function. Thanks to the regret bound of these algorithms, we derive their bounds on the number of prediction mistakes. We show the efficiency of the proposed algorithms on a variety of benchmark classification data sets in comparison with five state-of-the-art online binary linear classification algorithms.

The rest of the letter is organized as follows. In section 2, we briefly introduce DC programming and DCA, then present an online DCA scheme in the context of online learning, and finally propose some variants of online DCA with their regret bounds. The development of these variants for online binary linear classification is given in section 3. Section 4 reports the numerical results on several test problems. Section 5 concludes the letter.

2  Online DCA for Online Learning

2.1  Outline of DC Programming and DCA

DC programming and DCA were introduced by Pham Dinh Tao in a preliminary form in 1985 and have been extensively developed by Le Thi Hoai An and Pham Dinh Tao since 1994 (Pham Dinh & Le Thi, 1997, 1998, 2014; Le Thi & Pham Dinh, 2005, 2018) to become now classic and increasingly popular. They address DC programs of the form
inf{f(w):=g(w)-h(w):wRn}(Pdc)
where g,hΓ0(Rn), the set of all lower semicontinuous proper convex functions on Rn. Such a function f is called DC function, and g-h, DC decomposition of f while g and h are DC components of f.

The main idea of DCA is quite simple. It consists of approximating a DC program by a sequence of convex programs: each iteration l of DCA approximates the concave part -h by its affine majorization (that corresponds to taking zlh(wl)) and minimizes the resulting convex function. The generic DCA scheme can be described as follows.

Generic DCA Scheme

Initialization. Choose an initial point w0. Set l0.

Repeat

Step 1. Compute zlh(wl).

Step 2. Compute wl+1argmin{g(w)-w,zl:wRn}.

Step 3.ll+1.

Until stopping condition is satisfied.

In recent years, numerous DCA-based algorithms have successfully solved large-scale nonsmooth and nonconvex programs appearing in several application areas, especially machine learning, communication systems, biology, and finance (see, e.g., Phan & Le Thi, 2019; Le Thi, Ho, & Pham Dinh, 2019; Le Thi & Pham Dinh, 2018; Phan, Le, & Le Thi, 2018; Le Thi, Le, Phan, & Tran, 2017; Le Thi & Phan, 2017; Le Thi & Nguyen, 2017; Phan, Le Thi, & Pham Dinh, 2017; Le Thi, 2005). DCA has been proved to be a fast and scalable approach, which is, thanks to the effect of DC decompositions, more efficient than related methods. (For a comprehensive survey on 30 years of development of DCA, see Le Thi & Pham Dinh, 2018.)

2.2  Online DCA

The generic DCA scheme can be adapted as follows for solving online DC problems where the set of predictions is convex and the loss suffered at each step is a DC function. At each learning step, we have to minimize a DC loss function Ft under the set of predictions S. We are therefore faced with a (standard) DC program. As we are in the “online” context where data are available in a sequential order, completely solving this DC program may not be imperative. Instead, we perform only one iteration of DCA.

Let us denote by T the number of online learning steps. The function Ft can be defined either from the cumulative t=1Tft or the current loss ft. In this letter we define Ft simply as the current loss function ft, say, Ft=ft:=gt-ht with gt and ht being convex functions. The online DCA scheme can be described as follows.

Online DCA for Online DC Programming

Input: a convex set S

Initialization: set an initial point w0 and observe a DC loss function f0:=g0-h0.

fort=1,2, until convergence

  1. Predict a vector wtS by performing one iteration of DCA.

    1. Compute a subgradient ztht-1(wt-1).

    2. Solve the convex program to obtain wt
      min{gt-1(w)-zt,w:wS}.
      (2.1)

  2. Observe a DC loss function ft:=gt-ht.

  3. Suffer loss ft(wt) and update its model.

end for

Output: {wt}.

In an online DCA scheme, subproblem 2.1 can be solved by any solver for convex optimization problems. In this letter, we propose using a subgradient method (see, e.g., Shor, 1985). In particular, at step t, we need to compute, at iteration k{0,1,}, a subgradient rt-1,kgt-1(wt-1,k) and then determine wt-1,k+1S by
wt-1,k+1=ProjS(wt-1,k-ηt-1(rt-1,k-zt-1)),
with a step size ηt-1 and wt-1,0:=wt-1. Here ProjS denotes the orthogonal projection mapping.

When the convex subproblems in online DCA are completely solved by the subgradient method, the corresponding DCA is called the complete version of online DCA and summarized in the following scheme:

ODCA: Complete Online DCA Scheme with Projected Subgradient Method

Perform the online DCA in which step 1.2 is replaced by

Set wt-1,0=wt-1, k=0.

repeat

Compute a subgradient rt-1,kgt-1(wt-1,k).

Set st-1,k=rt-1,k-zt-1.

Compute
wt-1,k+1=ProjS(wt-1,k-ηt-1st-1,k).
(2.2)

Set k=k+1.

untilwt-1,k-wt-1,k-12ɛ(wt-1,k-12+1)

Set wt=wt-1,k.

We observe that completely solving the subproblem in online DCA by the gradient method may be computationally expensive. Thus, we propose the so-called approximate version of ODCA, named ODCAk, in which the convex subproblem 2.1 is approximately solved by one iteration of the subgradient method.

ODCAk: Approximate Online DCA Scheme with Projected Subgradient Method

Perform the online DCA scheme in which step 1.2 is replaced by

Compute a subgradient rt-1gt-1(wt-1).

Set st-1=rt-1-zt-1.

Compute wt=ProjS(wt-1-ηt-1st-1).

2.3  Regret Bounds of ODCA and ODCAk

In this section, we analyze the regret bound of two online DCA schemes: ODCA and ODCAk. We define the regret of an algorithm A until step T by
RegretAT=t=1Tft(wt)-minwRnt=1Tft(w),
(2.3)
where the sequence {w1,w2,,wT} is generated by the algorithm A.

Now, we prove that ODCA and ODCAk have a vanishing per step regret (or a sublinear regret). That is, RegretAT grows sublinearly with the number of steps T: limT+RegretAT/T=0. We can achieve a logarithmic regret bound O(log(T)) for ODCAk.

First, we make the assumptions of the DC function ft in assumption 1:

Assumption 1.

There exist positive parameters α, γ, nonnegative parameter β, and point u*S, such that for t{1,,T}:

  1. u*argmin{ft(w):wS},

  2. α2u*-wt22gt(wt)-gt(u*)-zt,wt-u*,

  3. ht(u*)-ht(wt)-zt,u*-wtβ2u*-wt22,

  4. gt(wt)-gt(u*)rt,wt-u*-γ2u*-wt22, with rtgt(wt).

Next, theorem 2 indicates the regret bounds of ODCA and ODCAk.

Let Kt be the number of iterations of subgradient method at step t, K=maxt=1,,TKt, and let L be a positive number satisfying
maxt{1,,T}maxst2,maxk{1,,Kt}st,k2L,maxt{1,,T}wt-u*2max{K-1,1}Lmint{1,,T}ηt.
(2.4)
Theorem 1.
Let {wt}t=1,,T be the sequence generated by ODCA and ODCAk. If assumptions 1a to 1c are verified, then we have
RegretODCAT3L2(α+β)T3K2-4K+22α,RegretODCAkT3L2(α+β)T2α.
In addition, if assumption 1d is also verified, then
RegretODCAkTL2(α+β)1+log(T)2αγ.
Proof.

First, we analyze the regret bound of ODCA.

From the definition, equation 2.3, we have
RegretODCAT=t=1Tft(wt)-minwSt=1Tft(w)t=1Tft(wt)-minwSft(w).
(2.5)
It readily derives from assumption 1a that
ft(wt)-minwSft(w)=ft(wt)-ft(u*)=[g¯t(wt)-g¯t(u*)]+[ht(u*)-ht(wt)-zt,u*-wt],
(2.6)
where the convex function g¯t:=gt-zt,· for t=1,,T.
From equations 2.5 and 2.6 and assumptions 1b to 1c, we obtain
RegretODCAT1+βαt=1T[g¯t(wt)-g¯t(u*)]=1+βαt=1T[g¯t(wt,0)-g¯t(u*)]
(2.7)
1+βαt=1Tk=0Kt-2[g¯t(wt,k)-g¯t(wt,k+1)]+[g¯t(wt,Kt-1)-g¯t(u*)]1+βαt=1Tk=0Kt-2st,k,wt,k-wt,k+1+st,Kt-1,wt,Kt-1-u*.
(2.8)
The last inequality holds as st,k(gt(wt,k)-zt):=g¯t(wt,k), k=0,,Kt-1.
Similar to theorem 3.1 in Hazan (2016), we can derive from equation 2.2 an upper bound of st,Kt-1,wt,Kt-1-u* as follows:
st,Kt-1,wt,Kt-1-u*wt,Kt-1-u*22-wt,Kt-u*222ηt+ηt2st,Kt-122.
Combining equation 2.4 and the fact that
wt,Kt-1-u*2wt,Kt-1-wt,02+wt,0-u*2k=0Kt-2wt,k+1-wt,k2+wt-u*2k=0Kt-2ηtst,k2+wt-u*2ηt(K-1)L+wt-u*2,
we yield
wt,Kt-1-u*22wt-u*22+3ηt2(K-1)2L2.
It implies
st,Kt-1,wt,Kt-1-u*wt-u*22-wt+1-u*222ηt+ηt2(3K2-6K+4)L2.
(2.9)
Similarly, we get
k=0Kt-2st,k,wt,k-wt,k+1k=0Kt-2wt,k-wt,k+1222ηt+ηt2st,k22k=0Kt-2ηtst,k22ηt(K-1)L2.
(2.10)
We deduce from 2.8, 2.9, and 2.10 that
RegretODCAT1+βαt=1Twt-u*22-wt+1-u*222ηt+ηt2(3K2-4K+2)L21+βαt=1Twt-u*2212ηt-12ηt-1+ηt2(3K2-4K+2)L2,
where, by convention, 1η0:=0.
Let us define ηt=1t3K2-4K+2 for all t=1,,T. We have
RegretODCAT(α+β)3K2-4K+2αL2T2+L222T3L2(α+β)T3K2-4K+22α.
Setting K=1, RegretODCAkT is bounded by 3L2(α+β)T2α.
When assumption 1d is also satisfied, we derive from equation 2.7 for ODCAk that
RegretODCAkT1+βαt=1Tst,wt-u*-γ2u*-wt21+βαt=1Twt-u*2212ηt-12ηt-1-γ2+ηt2L2.
Defining ηt=1γt for all t=1,,T, we obtain
RegretODCAkTL2(α+β)1+log(T)2αγ.

The proof of theorem 2 is established.

Remark 1.

From theorem 2, we see that the regret bound of ODCAk is 3K2-4K+2 times less than the regret bound of ODCA. Thus, in most of cases, the prediction of ODCAk is better than that of ODCA. This is confirmed by the numerical experiments in section 4.

In the sequel, we show how to develop these online DCA schemes for the problem of online binary classification in online learning.

3  Online DCA for Online Binary Linear Classification

Online binary linear classification (Shalev-Shwartz, 2012; Hoi, Wang, & Zhao, 2014; Ho, Le Thi, & Bui, 2016) is online learning with the yes/no answers and predictions, in which the prediction set is the same as the corrected answer set {-1,1} and the loss (p¯t,yt) is the 0-1 loss function. Formally, at each step t, the learner receives an instance with n features, denoted xtX=Rn, and tries to find a linear classifier wtS=Rn in order to predict the corresponding binary label:
p¯t=pt(wt)Y={-1,1},pt(w):=1ifw,xt0,-1otherwise.
(3.1)
After that, the correct label ytY is revealed and the learner has to suffer the loss (p¯t,yt) where the loss function is defined as
(pt(w),yt):=1{pt(w)yt}(w)=1{ytw,xt0}(w).
(3.2)
Here, 1C is an indicator function on C (say, 1C(x)=1 if xC, 0 otherwise).

Obviously the loss (p¯t,yt)=0 when the prediction is correct (pt(w)=yt) and (p¯t,yt)=1 when the prediction is wrong (pt(w)yt).

There exist many online classification algorithms such as perceptron (Novikoff, 1963; Rosenblatt, 1958; Van Der Malsburg, 1986), approximate maximal margin classification algorithm (ALMA) (Gentile, 2002), relaxed online maximum margin algorithm (ROMMA) (Li & Long, 2002), passive-aggressive learning algorithms (PA) (Crammer, Dekel, Keshet, Shalev-Shwartz, & Singer, 2006), and their variants. Hoi et al. (2014) conducted a library of scalable and efficient online learning algorithms for large-scale online classification tasks.

Now, we apply online DCA for online binary linear classification with the 0-1 loss function, equation 3.2:
t(w):=(pt(w),yt).
As the function 3.2 is not DC, for applying ODCA and ODCAk, we have to approximate t by a DC function ft.

3.1  DC Approximation Functions

We propose DC approximation functions taking the form of piecewise-linear functions like ramp loss (Collobert, Sinz, Weston, & Bottou, 2006a, 2006b; Ho et al., 2016) and sigmoid-like function (Mason, Baxter, Bartlett, & Frean, 1999).

First, it is worth mentioning that in practice, if the prediction using the classifier wt-1 is correct (i.e., t-1(wt-1)=0), then it is not necessary to update the classifier wt (Shalev-Shwartz, 2012). Obviously, if the prediction at step t is correct, say t(wt)=0, then we take ft=0, which is a DC function. We can see from equation 3.2 that t(wt)=0 if and only if ytwt,xt>0. Thus, we approximate t by a DC function ft in two cases: ytwt,xt<0 and ytwt,xt=0.

To ensure the boundness on prediction mistakes, ft should be a surrogate function of t at wt (Shalev-Shwartz, 2012):
ft(wt)t(wt).
(3.3)

3.1.1  First Piecewise-Linear Approximation

We use the first DC approximation function proposed in Ho et al. (2016),
ft(1)(w):=max0,ν1+min-ytwt,xt,-ytw,xtτt*,
(3.4)
where τ1 is a positive parameter, and
τt*=min{τ1,-ytwt,xt},ν1=0ifytwt,xt<0,τt*=τ1,ν1=1ifytwt,xt=0.

The following proposition gives us a suitable DC decomposition of ft(1).

Proposition 1.
Let a and x be a nonnegative constant and a given vector, respectively. The function
f(w)=max{0,min{a,w,x}},
(3.5)
is a DC function with DC components:
g(w)=max{0,w,x}andh(w)=max{0,w,x-a}.
Proof.
Obviously the function min{a,w,x} is DC with the following natural DC decomposition:
min{a,w,x}=f1(w)-f2(w)wheref1(w):=a+w,x,f2(w):=max{a,w,x}.
Consequently the function f is clearly DC too:
f(w)=max{0,f1(w)-f2(w)}=max{f1(w),f2(w)}-f2(w).
It is easy to see that max{f1(w),f2(w)}=g(w)+a and f2(w)=h(w)+a, and the resulting DC decomposition of f is f=g-h.
From proposition 1, we get a DC decomposition of ft(1) as follows,
ft(1)=gt(1)-ht(1),
where
gt(1)(w)=max0,ν1-ytw,xtτt*,ht(1)(w)=max0,ytwt-w,xtτt*.
(3.6)
According to the ODCA/ODCAk scheme, we need to compute the subgradients zt-1ht-1(1)(wt-1) and rt-1,kgt-1(1)(wt-1,k). Clearly, the functions gt(1)(w) and ht(1)(w) are the maximum of two affine functions. Thanks to the rule of computing the subdifferential of a function,
h(w)=maxi=1,,mhi(w),
where hi, i=1,,m, are convex functions (Valadier, 1969), we have
h(w)=cohi(w):iargmaxj{1,,m}hj(w).
(3.7)
Here co(X) denotes the convex hull of a set of points X.
Applying equation 3.7 to the functions ht-1(1) and gt-1(1) with m=2 and h1,h2 being affine functions, we obtain
ht-1(1)(w)=-yt-1xt-1τt-1*ifyt-1wt-1-w,xt-1>0,-yt-1xt-1τt-1*,0ifyt-1wt-1-w,xt-1=0,{0}otherwise,gt-1(1)(w)=-yt-1xt-1τt-1*ifyt-1w,xt-1<ν1τt-1*,-yt-1xt-1τt-1*,0ifyt-1w,xt-1=ν1τt-1*,{0}otherwise.
Here [a,b](=co{a,b}) denotes the line segment between two points a and b.
In particular, we can take these subgradients zt-1ht-1(1)(wt-1) and rt-1,kgt-1(1)(wt-1,k) as follows:
zt-1=0andrt-1,k=-yt-1xt-1τt-1*ifyt-1wt-1,k,xt-1ν1τt-1*,0otherwise.
(3.8)

3.1.2  Second Piecewise-Linear Approximation

We propose another piecewise-linear approximation function. The idea is that at step t, we sometimes update the linear classifier wt even when the prediction is correct (i.e., ytwt,xt>0) and do not update wt when ytwt,xt0. In particular, we give the second DC approximation function,
ft(2)(w):=max0,1+ν2minτ2,-ytw,xtτ3xt2,
where τ2,τ3 are positive parameters and
ν2=1if-τ2ytwt,xtτ3xt2,0if-τ2>ytwt,xt.
According to proposition 1, the DC components gt(2) and ht(2) of ft(2) are
gt(2)(w)=max0,1-ν2ytw,xtτ3xt2,ht(2)(w)=max0,-ν2(τ2+ytw,xt)τ3xt2.
Similar to the first piecewise-linear approximation, we can take the subgradients zt-1ht-1(2)(wt-1) and rt-1,kgt-1(2)(wt-1,k) as follows:
zt-1=0andrt-1,k=-ν2yt-1xt-1τ3xt-12ifν2yt-1wt-1,k,xt-1<τ3xt-12,0otherwise.

3.1.3  Sigmoid Approximation

We propose the following sigmoid approximation function,
ft(3)(w):=max1-tanh(δt),1-tanh(κtytw,xt),
where the increasing real-valued function tanh(s)=es-e-ses+e-s, κt=κxt2, κ>0, δt=κtytwt,xt-ln(mt)2 and mt=2e2κtytwt,xt+1(e2κtytwt,xt+1)2.

It is easy to see that if ytwt,xt>0, then 0<tanh(κtytwt,xt)<1. Thus, similar to the idea of ft(2), we consider taking ft(3) in case tanh(κtytwt,xt)<1-ɛ, where ɛ is a threshold in [0, 1).

It is known that the function 1-tanh(s):=2e-2se-2s+1 is a DC function with DC components: g(s)=2e-2s and h(s)=2e-4se-2s+1. Similar to the proof of proposition 1, we have
ft(3)=max1-tanh(δt),g(κtytw,xt)-h(κtytw,xt)=max1-tanh(δt)+h(κtytw,xt),g(κtytw,xt)-h(κtytw,xt).
Thus, a DC decomposition of ft(3) is given by
ft(3):=gt(3)-ht(3),
where the function ct:RnR is defined by ct(w):=2e-2κtytw,xt,
ht(3)(w):=2e-4κtytw,xte-2κtytw,xt+1,gt(3)(w):=max1-tanh(δt)+ht(3)(w),ct(w).
Since ht-1(3) is differentiable, we take zt-1=ht-1(3)(wt-1) as follows:
zt-1=-4κt-1yt-1xt-1e-2κt-1yt-1wt-1,xt-1(2e2κt-1yt-1wt-1,xt-1+1)(e2κt-1yt-1wt-1,xt-1+1)2.
(3.9)
Similar to the first piecewise-linear approximation, the subgradient rt-1,kgt-1(3)(wt-1,k) can be taken as
rt-1,k=-2κt-1yt-1xt-1ct-1(wt-1,k)ifδt-1>κt-1yt-1wt-1,k,xt-1,ht-1(3)(wt-1,k)otherwise.
(3.10)

In section 3.2, we describe six online DCA algorithms, corresponding to two versions ODCA and ODCAk for three above DC approximation functions.

3.2  Proposed Online DCA Algorithms

The ODCA for the first piecewise-linear approximation is denoted by ODCA-PiL1 and summarized in algorithm 1.

Concerning the ODCAk for this approximation, we take, at step t, wt-1,0=wt-1 and thus,
st-1,0=-yt-1xt-1min{τ1,-yt-1wt-1,xt-1}ifyt-1wt-1,xt-1<0,-yt-1xt-1τ1ifyt-1wt-1,xt-1=0.
Finally, the approximate ODCAk-PiL1 is given in algorithm 2.

Similar to ODCA-PiL1 and ODCAk-PiL1, we design the complete (resp. approximate) version of online DCA for the second piecewise-linear approximation, named ODCA-PiL2 (resp. ODCAk-PiL2) in algorithm 3 (resp. algorithm 4). ODCA-PiL2 differs from ODCA-PiL1 by the way to compute rt-1,k in step 2.1.2.2.1. As for ODCAk-PiL2, one reduces step 2.1.2.2 of ODCA-PiL2 to one iteration. In this case, since zt-1=0, we have st-1,0=rt-1,0, and thus wt=wt-1-ηt-1rt-1,0.

As for the sigmoid approximation, its complete version of ODCA, named ODCA-Sig, is described in algorithm 5 in which the steps for computing the subgradient zt-1 and rt-1,k are replaced by equations 3.9 and 3.10, respectively. Moreover, its approximate version, ODCAk-Sig, is given in algorithm 6 by performing one iteration of ODCA-Sig in step 2.1.2.2.

formula
formula
formula
formula
Remark 2.

Regarding the worst-case complexity, three approximate (resp. complete) algorithms ODCAk-PiL1, ODCAk-PiL2, and ODCAk-Sig (resp. ODCA-PiL1, ODCA-PiL2, and ODCA-Sig) have the complexity of O(nT) (resp. O(nKT)). Here, T is the total number of instances, n is the number of features, and K is the maximum number of iterations of subgradient methods at all steps. The complexity of the proposed algorithms is similar to the complexity of state-of-the-art algorithms (see the appendix). This is confirmed by the numerical experiment in section 4.

formula
formula

In the sequel, we show a mistake bound of the proposed algorithms—a bound on the number of steps at which p¯tyt.

3.3  Mistake Bound of the Proposed Algorithms

First, we show in lemma 6 that the DC function ft satisfies assumption 1. Let us denote by M the set of steps at which the DC function is observed.

Lemma 1.
For the DC functions {ft(i)} (i=1,2,3), if there is a vector u*Rn such that for all t=1,,T,
fori=1,ytu*,xt2τ1,fori=2,ytu*,xt2τ3maxj=1,,Txj2,fori=3,κtytu*,xtmaxj=1,,Tδj,
(3.11)
then there exist α, γ, u* such that assumptions 1a, 1b, and 1d are satisfied, and assumption 1c is satisfied for all β0.
Proof.

First, we see that when xt=0, the algorithms do not update the linear classifier, and thus the results of lemma 6 are straightforward. Hence, we further assume that xt0 for tM. We derive from equation 3.11 that assumption 1a is satisfied with the DC functions ft(i) (i=1,2,3). Moreover, for i=1, we observe that u*-wt0 for all t, because if there is some t such that u*=wt, then we have ytu*,xt<0, which contradicts equation 3.11. This is similar for i=2,3.

Next, we verify assumptions 1b to 1d for the function ft(1) in two cases: ytwt,xt<0 and ytwt,xt=0. We first consider the case ytwt,xt<0.

Let us define the function g¯t(1):=gt(1)-zt,·. We have
g¯t(1)(wt)-g¯t(1)(u*)=-ytwt,xtτt*>0.
Thus, assumption 1b is satisfied with αmintM-2ytwt,xtτt*u*-wt22.
Assumption 1c is also satisfied for all β0 since we have
ht(1)(u*)-ht(1)(wt)-zt,u*-wt=0β2u*-wt22.
We have
gt(1)(u*)-gt(1)(wt)-rt,u*-wt=ytu*,xtτt*>0.
Thus, assumption 1d is satisfied with γmintM2ytu*,xtτt*u*-wt22.

When ytwt,xt=0, assumptions 1b to 1d are also verified if β0, αmintM2u*-wt22, and γmintM2u*-wt22.

Similarly for the DC functions ft(2), Assumptions 1a, 1b, and 1d are satisfied if the parameters αmintM2(τ2xt2-ytwt,xt)τ2xt2u*-wt22, γmintM2u*-wt22. Moreover, assumption 1c is satisfied for all β0.

Finally, we check assumptions 1b to 1d for the DC function ft(3).

Let us define the function g¯t(3):=gt(3)-zt,·. We have
g¯t(3)(wt)-g¯t(3)(u*)=ct(wt)-ct(u*)+2ct(wt)mtκtytwt-u*,xt=ct(wt)1-ct(u*-wt)2-2mtκtytu*-wt,xt=ct(wt)1-e-2κtytu*-wt,xt-2mtκtytu*-wt,xt.
Due to the definition of δt, it is easy to prove that g¯t(3)(wt)-g¯t(3)(u*)>0. By setting
αmintMct(wt)-ct(u*)+2ct(wt)mtκtytwt-u*,xtu*-wt22,
(3.12)
we derive that assumption 1b is verified.
Since ht(3) is convex and differentiable, we have
ht(3)(u*)-ht(3)(wt)-zt,u*-wtht(3)(u*)-ht(3)(wt),u*-wtht(3)(u*)-ht(3)(wt)2.u*-wt2,
(3.13)
and
ht(3)(u*)-ht(3)(wt)2=8κtxt2ct(u*)[ct(-u*)+1][ct(-u*)+2]2-ct(wt)[ct(-wt)+1][ct(-wt)+2]2κtxt22[ct(u*)+4][ct(-wt)+2]2-[ct(wt)+4][ct(-u*)+2]2κtxt22|ct(u*-2wt)-ct(wt-2u*)|+4|ct(u*-wt)-ct(wt-u*)|+2|ct(u*)-ct(wt)|+4|ct(-2wt)-ct(-2u*)|+8|ct(-wt)-ct(-u*)|.
(3.14)
For any x, yRn, we have
ct(x)-ct(y)=ct(x)1-e-2κtyty-x,xt2ct(x)κtxt2y-x2.
Thus, we readily derive that
|ct(x)-ct(y)|2max{ct(x),ct(y)}κtxt2y-x2.
(3.15)
Let us define K¯t=64e4κt|u*-wt,xt|+max{|u*,xt|,|wt,xt|}. Combining equation 3.13 with equations 3.14 and 3.15, we have
ht(3)(u*)-ht(3)(wt)-zt,u*-wtκt2xt22K¯tu*-wt22.
Hence, assumption 1c is satisfied with
βmaxtMκt2xt22K¯t.
(3.16)
Moreover, we have
gt(3)(u*)-gt(3)(wt)-rt,u*-wt=ct(wt)ct(u*-wt)2-1+2κtytu*-wt,xt=ct(wt)e-2κtytu*-wt,xt-1+2κtytu*-wt,xt.
It is clear that e-2κtytu*-wt,xt+2κtytu*-wt,xt>1. Thus, if
γmintMct(u*)-ct(wt)+2ct(wt)κtytu*-wt,xtu*-wt22,
(3.17)
then assumption 1d is verified.

The proof of lemma 6 is established.

According to theorem 2 and lemma 6, we obtain the regret bounds of the six proposed algorithms in corollary 7.

Corollary 1.
Assume that ODCA-PiL1, ODCAk-PiL1, ODCA-PiL2, ODCAk-PiL2, ODCA-Sig, and ODCAk-Sig generate the sequence {wt}t=1,,T. Then we have
RegretODCA-PiL1T3L2T3K2-4K+2,
(3.18)
RegretODCAk-PiL1TL21+log(T)γ,
(3.19)
where γ is the positive parameter satisfying
γmintM2minτt*,ψ(ytu*,xt)τt*u*-wt22,
(3.20)
and ψ is the real function defined by ψ(x)=x if x>0, + otherwise. Moreover, ODCA-PiL2 and ODCAk-PiL2 have the same regret bound as equations 3.18 and 3.19, respectively, but with
γmintM2u*-wt22.
(3.21)
As for ODCA-Sig and ODCAk-Sig, we have
RegretODCAk-SigTL2(α+β)1+log(T)2αγ,RegretODCA-SigT3L2(α+β)T3K2-4K+22α,
where the parameters α, β, and γ satisfy equations 3.12, 3.16, and 3.17, respectively.

Thanks to the regret bound of six proposed algorithms, the following proposition provides a mistake bound of these algorithms.

Proposition 2.
(a) For wRn, the number of prediction mistakes made by ODCAk-PiL1 (resp. ODCAk-PiL2) has an upper bound that is the root, x¯1, of the equation,
x-a¯-b¯1+log(x)=0,
where a¯=tMft(w), b¯=L2γPiL, x¯1b¯, γPiLminγ,L2, and γ is defined by equation 3.20 (resp. 3.21). Moreover, the mistake bound for ODCAk-Sig is nothing else, but b¯=L2(α+β)(2αγSig), γSigminγ,L2(α+β)2α, and the parameters α, β, γ satisfy equations 3.12, 3.16, and 3.17, respectively.

(b) For wRn, the number of prediction mistakes made by ODCA-PiL1, ODCA-PiL2 is upper-bounded by c¯+c¯2+4a¯24 where c¯:=3L23K2-4K+2 and a¯=tMft(w). In addition, the mistake bound of ODCA-Sig is the same as ODCA-PiL1, ODCA-PiL2, but c¯=3L2(α+β)T3K2-4K+22α, and α, β satisfy equations 3.12 and 3.16, respectively.

Proof.

In this proof, we only show the mistake bound of the algorithms ODCAk-PiL1 in part a and ODCA-PiL1 in part b. It is similar to the mistake bound of the other algorithms.

(a) From the inequality 3.3 and corollary 7, we derive that for any wRn,
|M|tMft(wt)tMft(w)+L21+log(|M|)γPiL.
Here, |M| is the number of steps in M. From the definition of a¯ and b¯, it is evident that a¯0, b¯1 and the last inequality can be rewritten as
|M|a¯+b¯1+log(|M|).
Let us consider the strictly convex function r:(0,+)R,
r(x)=x-a¯-b¯1+log(x).
Since limx0+r(x)=limx+r(x)=+ and r(b¯)0, the equation r(x)=0 has two roots x¯1 and x¯2 such that 0<x¯2b¯x¯1.
(b) Similarly to part a, we obtain that for any wRn,
|M|tMft(wt)a¯+3L2|M|3K2-4K+2.
It leads to the inequation |M|a¯+c¯|M|.

The proof of proposition 1 is established.

4  Numerical Experiments

Our numerical experiment consists of two parts. First, we study the performance of the two versions of online DCA algorithms: the approximate ones ODCAk-PiL1, ODCAk-PiL2, ODCAk-Sig, and the complete ones ODCA-PiL1, ODCA-PiL2, ODCA-Sig. Second, we compare the notable algorithms with five state-of-the-art online binary classification algorithms: perceptron (Novikoff, 1963; Rosenblatt, 1958; Van Der Malsburg, 1986), online gradient descent (OGD; Zinkevich, 2003), relaxed online maximum margin algorithm (ROMMA; Li & Long, 2002), approximate maximal margin classification algorithm (ALMA; Gentile, 2002), passive-aggressive learning algorithms (PA; Crammer et al., 2006), which are described in detail in the appendix.

We test these comparative algorithms on a variety of benchmark classification data sets from the UCI Machine Learning Repository1 and LIBSVM website.2 The data sets used in our experiments cover various areas (e.g., social sciences, biology, physics, life sciences) and are shown in Table 1.

Table 1:
Data Sets Used in Our Experiments.
Data SetNameNumber of Instances (T)Number of Features (n)
D1 a8a 32,561 123 
D2 cod-rna 271,617 
D3 colon-cancer 62 2000 
D4 covtype 581,012 54 
D5 diabetes 768 
D6 ijcnn1 141,691 22 
D7 magic04 19,020 10 
D8 splice 3175 60 
D9 svmguide1 7089 
D10 w7a 49,749 300 
D11 gisette 6806 5000 
D12 duke 44 7129 
D13 susy1a 1,000,000 18 
D14 susy2a 1,500,000 18 
Data SetNameNumber of Instances (T)Number of Features (n)
D1 a8a 32,561 123 
D2 cod-rna 271,617 
D3 colon-cancer 62 2000 
D4 covtype 581,012 54 
D5 diabetes 768 
D6 ijcnn1 141,691 22 
D7 magic04 19,020 10 
D8 splice 3175 60 
D9 svmguide1 7089 
D10 w7a 49,749 300 
D11 gisette 6806 5000 
D12 duke 44 7129 
D13 susy1a 1,000,000 18 
D14 susy2a 1,500,000 18 

aData set includes the first T instances in the susy data set.

All experiments were implemented in Matlab R2013b and performed on a PC Intel Xeon CPU E5-2630 v2, at 2.60 GHz of 32 GB RAM. The open source Matlab package for the state-of-the-art algorithms is available in Hoi et al. (2014). The initial point of all algorithms is 0Rn. In fact, many numerical experiments of existing online classification algorithms started from the zero point (see Shalev-Shwartz, 2012; Zinkevich, 2003; and Hoi et al., 2014). Thus, for a fair comparison, we used this zero point for all comparative algorithms. Meanwhile, it is worth mentioning that we have tested several random initial points, and our proposed algorithms gave similar results to those furnished from the zero starting point. The default tolerance ɛ is set to 10-4, and the maximum number of iterations for the subgradient method at each step is 5000. The step size for our proposed algorithms is ηt=C/T for all t. We are interested in the following criteria to evaluate the effectiveness of the proposed algorithms: the mistake rate (defined as the ratio of the number of mistakes to the number of instances) and the CPU time (in seconds).

To choose the best parameters for different algorithms, we follow a so-called validation procedure described in Hoi et al. (2014). In particular, we first perform each algorithm by running over one random permutation of the data set with the different parameter values and then take the value corresponding to the smallest mistake rate. The ranges of parameters for the state-of-the-art algorithms are described completely in Hoi et al. (2014), while the best parameters τ1, τ2, τ3, C, κ, and ɛ in our algorithms are searched from the range of {2-4,2-3,,24}, {2-4,2-3,,24}, {1,3,,9}, {2-4,2-3,,24}, {0.1,0.2,,1}, and {0,0.1,,0.9}, respectively. After the validation procedure, each algorithm is conducted over N runs of different random permutations for each data set with the chosen parameter.

4.1  Experiment 1: Comparison between Two Versions of Online DCA Algorithms

In this experiment, we give a comparison between the approximate algorithms ODCAk-PiL1, ODCAk-PiL2, ODCAk-Sig and the complete algorithms ODCA-PiL1, ODCA-PiL2, ODCA-Sig, respectively. We run 6 algorithms on 14 data sets over five runs (N=5). The average mistake rate and CPU time over these five runs of all algorithms are reported in Tables 2 and 3, respectively.

Table 2:
Average Mistake Rate and Its Standard Deviation Obtained by Two Versions of Online DCA-Based Algorithms.
Data Set ODCAk-PiL1 ODCA-PiL1 ODCAk-PiL2 ODCA-PiL2 ODCAk-Sig ODCA-Sig 
D1 0.217± 0.016 0.221 ± 0.001 0.157± 0.000 0.170 ± 0.001 0.157± 0.000 0.240 ± 0.000 
D2 0.174± 0.001 0.222 ± 0.001 0.466± 0.183 0.466± 0.183 0.115± 0.000 0.333 ± 0.000 
D3 0.296± 0.039 0.303 ± 0.039 0.229± 0.035 0.290 ± 0.073 0.216± 0.058 0.219 ± 0.035 
D4 0.469± 0.001 0.476 ± 0.001 0.487± 0.000 0.487± 0.000 0.423 ± 0.000 0.417± 0.000 
D5 0.318± 0.013 0.323 ± 0.013 0.275± 0.009 0.305 ± 0.018 0.260± 0.001 0.268 ± 0.005 
D6 0.096± 0.005 0.207 ± 0.132 0.088 ± 0.039 0.064± 0.002 0.072 ± 0.004 0.059± 0.000 
D7 0.361± 0.006 0.384 ± 0.002 0.529± 0.163 0.529± 0.163 0.280± 0.001 0.529 ± 0.163 
D8 0.280± 0.020 0.302 ± 0.016 0.496± 0.021 0.496± 0.021 0.216± 0.003 0.519 ± 0.000 
D9 0.248± 0.005 0.278 ± 0.006 0.487± 0.070 0.487± 0.070 0.201± 0.002 0.487 ± 0.070 
D10 0.110± 0.002 0.110± 0.001 0.101 ± 0.001 0.100± 0.000 0.101 ± 0.001 0.098± 0.000 
D11 0.110± 0.002 0.113 ± 0.007 0.500± 0.001 0.500± 0.001 0.072± 0.002 0.499 ± 0.000 
D12 0.295± 0.058 0.295± 0.058 0.336 ± 0.057 0.318± 0.084 0.272± 0.023 0.318 ± 0.048 
D13 0.284± 0.006 0.359 ± 0.057 0.213± 0.000 0.215 ± 0.001 0.213± 0.000 0.458 ± 0.014 
D14 0.295± 0.018 0.354 ± 0.044 0.213± 0.000 0.214 ± 0.001 0.213± 0.000 0.457 ± 0.010 
Data Set ODCAk-PiL1 ODCA-PiL1 ODCAk-PiL2 ODCA-PiL2 ODCAk-Sig ODCA-Sig 
D1 0.217± 0.016 0.221 ± 0.001 0.157± 0.000 0.170 ± 0.001 0.157± 0.000 0.240 ± 0.000 
D2 0.174± 0.001 0.222 ± 0.001 0.466± 0.183 0.466± 0.183 0.115± 0.000 0.333 ± 0.000 
D3 0.296± 0.039 0.303 ± 0.039 0.229± 0.035 0.290 ± 0.073 0.216± 0.058 0.219 ± 0.035 
D4 0.469± 0.001 0.476 ± 0.001 0.487± 0.000 0.487± 0.000 0.423 ± 0.000 0.417± 0.000 
D5 0.318± 0.013 0.323 ± 0.013 0.275± 0.009 0.305 ± 0.018 0.260± 0.001 0.268 ± 0.005 
D6 0.096± 0.005 0.207 ± 0.132 0.088 ± 0.039 0.064± 0.002 0.072 ± 0.004 0.059± 0.000 
D7 0.361± 0.006 0.384 ± 0.002 0.529± 0.163 0.529± 0.163 0.280± 0.001 0.529 ± 0.163 
D8 0.280± 0.020 0.302 ± 0.016 0.496± 0.021 0.496± 0.021 0.216± 0.003 0.519 ± 0.000 
D9 0.248± 0.005 0.278 ± 0.006 0.487± 0.070 0.487± 0.070 0.201± 0.002 0.487 ± 0.070 
D10 0.110± 0.002 0.110± 0.001 0.101 ± 0.001 0.100± 0.000 0.101 ± 0.001 0.098± 0.000 
D11 0.110± 0.002 0.113 ± 0.007 0.500± 0.001 0.500± 0.001 0.072± 0.002 0.499 ± 0.000 
D12 0.295± 0.058 0.295± 0.058 0.336 ± 0.057 0.318± 0.084 0.272± 0.023 0.318 ± 0.048 
D13 0.284± 0.006 0.359 ± 0.057 0.213± 0.000 0.215 ± 0.001 0.213± 0.000 0.458 ± 0.014 
D14 0.295± 0.018 0.354 ± 0.044 0.213± 0.000 0.214 ± 0.001 0.213± 0.000 0.457 ± 0.010 

Notes: The complete algorithms are ODCA-PiL1, ODCA-PiL2, and ODCA-Sig, and the approximate algorithms are ODCAk-PiL1, ODCAk-PiL2, and ODCAk-Sig on five runs. Bold values indicate the best results in each pair.

Table 3:
Average CPU Time and Its Standard Deviation in Seconds Obtained by Two Versions of Online DCA-Based Algorithms.
Data Set ODCAk-PiL1 ODCA-PiL1 ODCAk-PiL2 ODCA-PiL2 ODCAk-Sig ODCA-Sig 
D1 2.057± 0.014 2.192 ± 0.005 2.338± 0.011 5.138 ± 0.209 2.388± 0.013 2.716 ± 0.007 
D2 14.22± 0.773 15.52 ± 0.540 15.68± 0.902 15.97 ± 1.005 17.42± 0.910 19.09 ± 0.874 
D3 0.008± 0.000 0.009 ± 0.000 0.010± 0.000 0.011 ± 0.000 0.009± 0.000 0.170 ± 0.002 
D4 41.70± 0.700 45.76 ± 0.646 42.31± 0.080 42.50 ± 0.654 51.18± 0.222 204.6 ± 3.287 
D5 0.038± 0.000 0.157 ± 0.185 0.045± 0.000 0.201 ± 0.010 0.048± 0.000 1.846 ± 0.116 
D6 7.432± 0.135 14.05 ± 5.157 8.553± 0.308 17.96 ± 0.327 8.625± 0.175 23.75 ± 0.487 
D7 0.812± 0.016 0.895 ± 0.032 0.913± 0.025 0.949 ± 0.029 0.990± 0.048 1.122 ± 0.048 
D8 0.175± 0.001 0.422 ± 0.456 0.184± 0.000 0.191 ± 0.005 0.222± 0.000 0.230 ± 0.009 
D9 0.342± 0.002 0.391 ± 0.010 0.367± 0.002 0.378 ± 0.003 0.425± 0.003 0.638 ± 0.015 
D10 4.174± 0.072 3113 ± 1738 4.415± 0.073 7.194 ± 0.323 4.439± 0.061 9.593 ± 0.124 
D11 4.501± 0.069 4.618 ± 0.067 4.490± 0.066 4.563 ± 0.083 4.770± 0.144 5.240 ± 0.176 
D12 0.015± 0.002 0.016 ± 0.002 0.016± 0.002 0.021 ± 0.004 0.020± 0.003 1.074 ± 0.460 
D13 47.85± 5.261 70.45 ± 7.874 51.38± 3.939 53.55 ± 3.661 52.75± 3.882 68.86 ± 9.810 
D14 53.28± 0.931 78.94 ± 5.302 64.17± 1.983 64.64 ± 2.756 65.87± 2.465 89.40 ± 6.378 
Data Set ODCAk-PiL1 ODCA-PiL1 ODCAk-PiL2 ODCA-PiL2 ODCAk-Sig ODCA-Sig 
D1 2.057± 0.014 2.192 ± 0.005 2.338± 0.011 5.138 ± 0.209 2.388± 0.013 2.716 ± 0.007 
D2 14.22± 0.773 15.52 ± 0.540 15.68± 0.902 15.97 ± 1.005 17.42± 0.910 19.09 ± 0.874 
D3 0.008± 0.000 0.009 ± 0.000 0.010± 0.000 0.011 ± 0.000 0.009± 0.000 0.170 ± 0.002 
D4 41.70± 0.700 45.76 ± 0.646 42.31± 0.080 42.50 ± 0.654 51.18± 0.222 204.6 ± 3.287 
D5 0.038± 0.000 0.157 ± 0.185 0.045± 0.000 0.201 ± 0.010 0.048± 0.000 1.846 ± 0.116 
D6 7.432± 0.135 14.05 ± 5.157 8.553± 0.308 17.96 ± 0.327 8.625± 0.175 23.75 ± 0.487 
D7 0.812± 0.016 0.895 ± 0.032 0.913± 0.025 0.949 ± 0.029 0.990± 0.048 1.122 ± 0.048 
D8 0.175± 0.001 0.422 ± 0.456 0.184± 0.000 0.191 ± 0.005 0.222± 0.000 0.230 ± 0.009 
D9 0.342± 0.002 0.391 ± 0.010 0.367± 0.002 0.378 ± 0.003 0.425± 0.003 0.638 ± 0.015 
D10 4.174± 0.072 3113 ± 1738 4.415± 0.073 7.194 ± 0.323 4.439± 0.061 9.593 ± 0.124 
D11 4.501± 0.069 4.618 ± 0.067 4.490± 0.066 4.563 ± 0.083 4.770± 0.144 5.240 ± 0.176 
D12 0.015± 0.002 0.016 ± 0.002 0.016± 0.002 0.021 ± 0.004 0.020± 0.003 1.074 ± 0.460 
D13 47.85± 5.261 70.45 ± 7.874 51.38± 3.939 53.55 ± 3.661 52.75± 3.882 68.86 ± 9.810 
D14 53.28± 0.931 78.94 ± 5.302 64.17± 1.983 64.64 ± 2.756 65.87± 2.465 89.40 ± 6.378 

Notes: The complete algorithms are ODCA-PiL1, ODCA-PiL2, and ODCA-Sig, and the approximate algorithms ODCAk-PiL1, ODCAk-PiL2, and ODCAk-Sig on five runs. Bold values indicate the best results in each pair.

The approximate version is more efficient than the complete version on both mistake rate and CPU time, especially in very large data sets. Indeed, the approximate algorithms ODCA-PiL1, ODCA-PiL2, and ODCA-Sig run faster than the complete algorithms ODCAk-PiL1, ODCAk-PiL2, and ODCAk-Sig, respectively: the ratio of gain varies from 1.03 to 745 times, from 1.00 to 4.46 times, and from 1.04 to 53.7 times, respectively. As for the mistake rate, the approximate version is better than the complete version: the ratio of gain of ODCA-PiL1 versus ODCAk-PiL1, ODCA-PiL2 versus ODCAk-PiL2, ODCA-Sig versus ODCAk-Sig varies, respectively, from 1.4% to 53%, from 0.5% to 21%, and from 1.3% to 85% in most of the data sets. In particular, for data set D6, the complete algorithms ODCAk-PiL2 and ODCAk-Sig furnish a mistake rate smaller than the approximate versions ODCA-PiL2 and ODCA-Sig with the gain of 27% and 18%, but they run more slowly with the ratio of 2.10 and 2.75 times, respectively.

4.2  Experiment 2: Comparison with State-of-the-Art Classification Algorithms

In this experiment, we compare the notable algorithms ODCAk-PiL1, ODCAk-PiL2, and ODCAk-Sig with the five state-of-the-art binary linear classification algorithms mentioned above. The average results over 20 runs of all algorithms are reported in Tables 4 and 5.

Table 4:
Average Mistake Rate and Its Standard Deviation Obtained by ODCAk-PiL1, ODCAk-PiL2, ODCAk-Sig and Perceptron, ROMMA, ALMA, OGD, PA on 20 Runs.
Data Set ODCAk-PiL1 ODCAk-PiL2 ODCAk-Sig Perceptron ROMMA ALMA OGD PA 
D1 0.2088 ± 0.001 0.1575± 0.001 0.1574± 0.001 0.2100 ± 0.001 0.2249 ± 0.002 0.1581 ± 0.001 0.1577 ± 0.001 0.2108 ± 0.002 
D2 0.1739 ± 0.001 0.1176± 0.001 0.1149± 0.000 0.1749 ± 0.001 0.1517 ± 0.065 0.1994 ± 0.001 0.1657 ± 0.000 0.2074 ± 0.001 
D3 0.3088 ± 0.039 0.2379± 0.046 0.2379± 0.050 0.3137 ± 0.043 0.3717 ± 0.086 0.4435 ± 0.056 0.3032 ± 0.060 0.2637 ± 0.041 
D4 0.4697 ± 0.001 0.4237± 0.001 0.4231± 0.000 0.4697 ± 0.001 0.4804 ± 0.011 0.4839 ± 0.001 0.4676 ± 0.001 0.4835 ± 0.000 
D5 0.3194 ± 0.015 0.2615± 0.008 0.2621 ± 0.007 0.3265 ± 0.013 0.3072 ± 0.015 0.2655 ± 0.010 0.2586± 0.007 0.3346 ± 0.016 
D6 0.1045 ± 0.024 0.0705± 0.001 0.0740 ± 0.018 0.1062 ± 0.000 0.1008 ± 0.001 0.0699± 0.001 0.0767 ± 0.001 0.1023 ± 0.001 
D7 0.3593 ± 0.007 0.2786± 0.002 0.2775± 0.001 0.3645 ± 0.002 0.3365 ± 0.034 0.3636 ± 0.003 0.3557 ± 0.003 0.3835 ± 0.003 
D8 0.2969 ± 0.056 0.2329 ± 0.006 0.2150± 0.003 0.2732 ± 0.004 0.2684 ± 0.009 0.2283 ± 0.006 0.2168± 0.004 0.2617 ± 0.007 
D9 0.2492 ± 0.007 0.2723 ± 0.116 0.2026± 0.002 0.2560 ± 0.004 0.3037 ± 0.032 0.2564 ± 0.004 0.2466± 0.010 0.3130 ± 0.005 
D10 0.1147 ± 0.012 0.1005± 0.000 0.1005± 0.000 0.1151 ± 0.000 0.1094 ± 0.001 0.1028 ± 0.001 0.1037 ± 0.001 0.1051 ± 0.000 
Data Set ODCAk-PiL1 ODCAk-PiL2 ODCAk-Sig Perceptron ROMMA ALMA OGD PA 
D1 0.2088 ± 0.001 0.1575± 0.001 0.1574± 0.001 0.2100 ± 0.001 0.2249 ± 0.002 0.1581 ± 0.001 0.1577 ± 0.001 0.2108 ± 0.002 
D2 0.1739 ± 0.001 0.1176± 0.001 0.1149± 0.000 0.1749 ± 0.001 0.1517 ± 0.065 0.1994 ± 0.001 0.1657 ± 0.000 0.2074 ± 0.001 
D3 0.3088 ± 0.039 0.2379± 0.046 0.2379± 0.050 0.3137 ± 0.043 0.3717 ± 0.086 0.4435 ± 0.056 0.3032 ± 0.060 0.2637 ± 0.041 
D4 0.4697 ± 0.001 0.4237± 0.001 0.4231± 0.000 0.4697 ± 0.001 0.4804 ± 0.011 0.4839 ± 0.001 0.4676 ± 0.001 0.4835 ± 0.000 
D5 0.3194 ± 0.015 0.2615± 0.008 0.2621 ± 0.007 0.3265 ± 0.013 0.3072 ± 0.015 0.2655 ± 0.010 0.2586± 0.007 0.3346 ± 0.016 
D6 0.1045 ± 0.024 0.0705± 0.001 0.0740 ± 0.018 0.1062 ± 0.000 0.1008 ± 0.001 0.0699± 0.001 0.0767 ± 0.001 0.1023 ± 0.001 
D7 0.3593 ± 0.007 0.2786± 0.002 0.2775± 0.001 0.3645 ± 0.002 0.3365 ± 0.034 0.3636 ± 0.003 0.3557 ± 0.003 0.3835 ± 0.003 
D8 0.2969 ± 0.056 0.2329 ± 0.006 0.2150± 0.003 0.2732 ± 0.004 0.2684 ± 0.009 0.2283 ± 0.006 0.2168± 0.004 0.2617 ± 0.007 
D9 0.2492 ± 0.007 0.2723 ± 0.116 0.2026± 0.002 0.2560 ± 0.004 0.3037 ± 0.032 0.2564 ± 0.004 0.2466± 0.010 0.3130 ± 0.005 
D10 0.1147 ± 0.012 0.1005± 0.000 0.1005± 0.000 0.1151 ± 0.000 0.1094 ± 0.001 0.1028 ± 0.001 0.1037 ± 0.001 0.1051 ± 0.000 

Note: Bold (resp. underlining) values indicate the first-best (resp. second-best) results.

Table 5:
Average CPU Time and Its Standard Deviation in Seconds Obtained by ODCAk-PiL1, ODCAk-PiL2, ODCAk-Sig and Perceptron, ROMMA, ALMA, OGD, PA on 20 Runs.
Data Set ODCAk-PiL1 ODCAk-PiL2 ODCAk-Sig Perceptron ROMMA ALMA OGD PA 
D1 1.113 ± 0.007 1.569 ± 0.020 1.571 ± 0.027 1.084± 0.014 1.169 ± 0.017 1.309 ± 0.016 1.527 ± 0.021 1.254 ± 0.016 
D2 8.088 ± 0.057 11.91 ± 0.060 11.79 ± 0.057 7.807± 0.042 8.160 ± 0.243 9.494 ± 0.111 11.09 ± 0.048 9.582 ± 0.045 
D3 0.003± 0.000 0.005 ± 0.000 0.004 ± 0.000 0.003± 0.000 0.004± 0.000 0.004 ± 0.000 0.003± 0.000 0.004 ± 0.000 
D4 22.72 ± 1.665 33.28 ± 2.467 33.68 ± 2.435 22.07± 1.760 24.36 ± 1.903 25.63 ± 1.942 29.51 ± 2.194 26.80 ± 2.039 
D5 0.028± 0.000 0.041 ± 0.000 0.042 ± 0.000 0.028± 0.000 0.029 ± 0.000 0.034 ± 0.000 0.041 ± 0.000 0.034 ± 0.000 
D6 5.024 ± 0.045 7.062 ± 0.046 7.097 ± 0.053 4.873± 0.019 5.070 ± 0.025 5.919 ± 0.044 6.982± 0.038 5.444 ± 0.028 
D7 0.706 ± 0.003 1.024 ± 0.004 1.036 ± 0.003 0.687± 0.002 0.730 ± 0.010 0.817± 0.002 0.960 ± 0.002 0.848 ± 0.002 
D8 0.121 ± 0.002 0.184 ± 0.001 0.178 ± 0.000 0.117± 0.000 0.124 ± 0.000 0.138 ± 0.000 0.167 ± 0.000 0.143 ± 0.001 
D9 0.255 ± 0.001 0.360 ± 0.007 0.373 ± 0.001 0.248± 0.003 0.269 ± 0.003 0.295 ± 0.001 0.348 ± 0.001 0.301 ± 0.001 
D10 2.403 ± 0.086 3.096 ± 0.121 3.111 ± 0.113 2.365± 0.092 2.430 ± 0.065 2.709 ± 0.089 3.115 ± 0.079 2.487 ± 0.050 
Data Set ODCAk-PiL1 ODCAk-PiL2 ODCAk-Sig Perceptron ROMMA ALMA OGD PA 
D1 1.113 ± 0.007 1.569 ± 0.020 1.571 ± 0.027 1.084± 0.014 1.169 ± 0.017 1.309 ± 0.016 1.527 ± 0.021 1.254 ± 0.016 
D2 8.088 ± 0.057 11.91 ± 0.060 11.79 ± 0.057 7.807± 0.042 8.160 ± 0.243 9.494 ± 0.111 11.09 ± 0.048 9.582 ± 0.045 
D3 0.003± 0.000 0.005 ± 0.000 0.004 ± 0.000 0.003± 0.000 0.004± 0.000 0.004 ± 0.000 0.003± 0.000 0.004 ± 0.000 
D4 22.72 ± 1.665 33.28 ± 2.467 33.68 ± 2.435 22.07± 1.760 24.36 ± 1.903 25.63 ± 1.942 29.51 ± 2.194 26.80 ± 2.039 
D5 0.028± 0.000 0.041 ± 0.000 0.042 ± 0.000 0.028± 0.000 0.029 ± 0.000 0.034 ± 0.000 0.041 ± 0.000 0.034 ± 0.000 
D6 5.024 ± 0.045 7.062 ± 0.046 7.097 ± 0.053 4.873± 0.019 5.070 ± 0.025 5.919 ± 0.044 6.982± 0.038 5.444 ± 0.028 
D7 0.706 ± 0.003 1.024 ± 0.004 1.036 ± 0.003 0.687± 0.002 0.730 ± 0.010 0.817± 0.002 0.960 ± 0.002 0.848 ± 0.002 
D8 0.121 ± 0.002 0.184 ± 0.001 0.178 ± 0.000 0.117± 0.000 0.124 ± 0.000 0.138 ± 0.000 0.167 ± 0.000 0.143 ± 0.001 
D9 0.255 ± 0.001 0.360 ± 0.007 0.373 ± 0.001 0.248± 0.003 0.269 ± 0.003 0.295 ± 0.001 0.348 ± 0.001 0.301 ± 0.001 
D10 2.403 ± 0.086 3.096 ± 0.121 3.111 ± 0.113 2.365± 0.092 2.430 ± 0.065 2.709 ± 0.089 3.115 ± 0.079 2.487 ± 0.050 

Note: Bold values indicate the best results.

In terms of the mistake rate, we observe from Table 4 that ODCAk-Sig is the best algorithm, ODCAk-PiL2 is the second, and ODCAk-PiL1 is slightly more efficient than the existing algorithms. In particular, ODCAk-Sig is the first best on 8 of 10 data sets, especially the large data sets D2 (271,617 instances) and D4 (581,012 instances). The ratio of gain of ODCAk-Sig versus the others varies from 0.06% to 46.3%. ODCAk-PiL2 outperforms the existing algorithms on 8 of 10 data sets (2 for the first best and 6 for the second best); the ratio of gain varies from 0.12% to 46.3%. In addition, the mistake rate of ODCAk-PiL2 is comparable to that of ODCAk-Sig on 6 of 10 data sets with the ratio of gain of ODCAk-Sig versus ODCAk-PiL2 from 0% to 2.29%. OGD and ROMMA come next and are somewhat better than ODCAk-PiL1 on data sets; the ratio of gain varies from 0.44% to 26.9% and from 3.54% to 12.7%, respectively. ODCAk-PiL1 is slightly more efficient than the existing algorithms perceptron, ALMA, PA on 9 of 10, 5 of 10, and 6 of 10 data sets; the ratio of gain varies from 0.34% to 2.65%, from 1.18% to 30.3%, from 0.94% to 20.3%, respectively.

Concerning CPU time, all algorithms run very fast and can be classified as follows: perceptron and ODCAk-PiL1 are the fastest algorithms; three algorithms—ROMMA, ALMA, and PA—come next; and finally, ODCAk-Sig, ODCAk-PiL2, and OGD. More specifically, perceptron is the fastest on all data sets, while ODCAk-PiL1 is comparable to perceptron: the ratio of gain of perceptron versus ODCAk-PiL1 varies from 1.00 to 1.03 times. As for ODCAk-Sig and ODCAk-PiL2, their CPU time is fairly small and acceptable on all data sets; the ratio of gain of perceptron versus ODCAk-Sig (resp. ODCAk-PiL2) varies from 1.31 to 1.52 (resp. from 1.30 to 1.66) times.

5  Conclusion

We have intensively studied an online DCA-based approach for online learning techniques. At each learning step, we have considered a DC program for which an online version of DCA has been investigated. We have proposed the complete or approximate version of online DCA in which the convex subproblem is solved completely by (resp. is approximately solved by one iteration of) the subgradient method. We have proved that the complete version has the sublinear regret O(T), while the approximate one can achieve the logarithmic regret O(log(T)). As an application, we have developed the online DCA schemes for online classification with the 0-1 loss function. We have approximated the 0-1 loss function by DC functions and then proposed six online DCA-based algorithms: the three complete (resp. approximate) versions ODCA-PiL1, ODCA-PiL2, and ODCA-Sig (resp. ODCAk-PiL1, ODCAk-PiL2, and ODCAk-Sig). Thanks to the simple structure of piecewise-linear and sigmoid functions, natural DC decompositions have been considered where DC components take the form of a maximum of affine or smooth functions, which are differentiable almost everywhere. This property takes advantage of our DCA-based algorithms. Numerical results on various benchmark classification data sets have turned out that the approximate algorithms ODCAk-PiLs (resp. ODCAk-Sig) outperform the complete algorithms ODCA-PiLs (resp. ODCA-Sig) on both speed and quality of classification. Compared with the five state-of-the-art online classification algorithms, ODCAk-Sig is the best. From these promising results, we are making progress in further development of online DCA for online learning applications. In particular, extensions of the proposed approach for multiclass classification are ongoing.

Appendix:  Description of State-of-the-Art Binary Linear Classification

In this appendix, we give a detailed description of five state-of-the-art binary linear classification algorithms used in the numerical experiments: Perceptron (Novikoff, 1963; Rosenblatt, 1958; Van Der Malsburg, 1986), online gradient descent (OGD; Zinkevich, 2003), relaxed online maximum margin algorithm (ROMMA; Li & Long, 2002), approximate maximal margin classification algorithm (ALMA; Gentile, 2002), and passive-aggressive learning algorithms (PA; Crammer et al., 2006).

First, the perceptron algorithm is known as the earliest, simplest approach for online binary linear classification (Rosenblatt, 1958):

Perceptron

Initialization: let w1 be an initial point.

fort=1,2,,Tdo

ifytwt,xt0then

wt+1=wt+ytxt

else

wt+1=wt

end if

end for

Second, the relaxed online maximum margin algorithm (Li & Long, 2002) is an incremental algorithm for classification using a linear threshold function. It can be seen as a relaxed version of the algorithm that searches for the separating hyperplane that maximizes the minimum distance from previous instances classified correctly:

ROMMA: Relaxed Online Maximum Margin Algorithm

Initialization: let w1 be an initial point.

fort=1,2,,Tdo

ifytwt,xt0then

wt+1=xt2wt2-ytwt,xtxt2wt2-(wt,xt)2wt+wt2(yt-wt,xt)xt2wt2-(wt,xt)2xt

else

wt+1=wt

end if

end for

Third, the approximate maximal margin classification algorithm (Gentile, 2002) consists of approximating the maximal margin hyperplane with respect to p-norm (p2) for a set of linearly separable data. The proposed algorithm in Gentile (2002) is called approximate large margin algorithm (ALMA):

ALMA: Approximate Large Margin Algorithm

Initialization: let w1 be an initial point, the parameters p2, α(0,1], C>0, B=1α, k=1

fort=1,2,,Tdo

lt=max0,(1-α)Bp-1k-ytwt,xtxt

iflt>0then

wt+1=wt+Cp-1kytxtxt/max1,wt+Cp-1kytxtxt

k=k+lt

else

wt+1=wt

end if

end for

Fourth, the passive-aggressive (PA) learning algorithm (Crammer et al., 2006) computes the classifier based on an analytical solution to a simple constrained optimization problem that minimizes the distance from the current classifier wt to the half-space of vectors of the zero hinge-loss on the current sample:

PA: Passive-Aggressive Learning Algorithm

Initialization: let w1 be an initial point

fort=1,2,,Tdo

lt=max{0,1-ytwt,xt}

iflt>0then

wt+1=wt+ltxt2ytxt

else

wt+1=wt

end if

end for

Finally, the classic online gradient descent algorithm (Zinkevich, 2003) uses the gradient descent method for minimizing the hinge-loss function:

OGD: Online Gradient Descent

Initialization: let w1 be an initial point, parameter C>0

fort=1,2,,Tdo

lt=max{0,1-ytwt,xt}

iflt>0then

wt+1=wt+Ctytxt

else

wt+1=wt

end if

end for

Notes

References

Azoury
,
K.
, &
Warmuth
,
M.
(
2001
).
Relative loss bounds for on-line density estimation with the exponential family of distributions
.
Machine Learning
,
43
(
3
),
211
246
.
Cesa-Bianchi
,
N.
, &
Lugosi
,
G.
(
2006
).
Prediction, learning, and games
.
Cambridge
:
Cambridge University Press
.
Chung
,
T. H.
(
1994
).
Approximate methods for sequential decision making using expert advice.
In
Proceedings of the Seventh Annual Conference on Computational Learning Theory
(pp.
183
189
).
New York
:
ACM
.
Collobert
,
R.
,
Sinz
,
F.
,
Weston
,
J.
, &
Bottou
,
L.
(
2006a
).
Large scale transductive SVMs
.
Journal of Machine Learning Research
,
7
,
1687
1712
.
Collobert
,
R.
,
Sinz
,
F.
,
Weston
,
J.
, &
Bottou
,
L.
(
2006b
).
Trading convexity for scalability.
In
Proceedings of the 23rd International Conference on Machine Learning
(pp.
201
208
).
New York
:
ACM
.
Crammer
,
K.
,
Dekel
,
O.
,
Keshet
,
J.
,
Shalev-Shwartz
,
S.
, &
Singer
,
Y.
(
2006
).
Online passive-aggressive algorithms
.
Journal of Machine Learning Research
,
7
,
551
585
.
Ertekin
,
S.
,
Bottou
,
L.
, &
Giles
,
C. L.
(
2011
).
Nonconvex online support vector machines
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
33
(
2
),
368
381
.
Gao
,
X.
,
Li
,
X.
, &
Zhang
,
S.
(
2018
).
Online learning with non-convex losses and non-stationary regret.
In
A.
Storkey
&
F. Perez
-
Cruz
(Eds.),
Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics of Proceedings of Machine Learning Research
, Vol.
84
(pp.
235
243
).
Gasso
,
G.
,
Pappaioannou
,
A.
,
Spivak
,
M.
, &
Bottou
,
L.
(
2011
).
Batch and online learning algorithms for nonconvex Neyman-Pearson classification
.
ACM Transactions on Intelligent Systems and Technology
,
2
(
3
),
1
19
.
Gentile
,
C.
(
2002
).
A new approximate maximal margin classification algorithm
.
Journal of Machine Learning Research
,
2
,
213
242
.
Gentile
,
C.
(
2003
).
The robustness of the p-norm algorithms
.
Machine Learning
,
53
(
3
),
265
299
.
Hazan
,
E.
(
2016
).
Introduction to online convex optimization
.
Foundations and Trends in Optimization
,
2
(
3–4
),
157
325
.
Hazan
,
E.
,
Agarwal
,
A.
, &
Kale
,
S.
(
2007
).
Logarithmic regret algorithms for online convex optimization
.
Machine Learning
,
69
(
2–3
),
169
192
.
Hazan
,
E.
, &
Kale
,
S.
(
2012
).
Online submodular minimization
.
Journal of Machine Learning Research
,
13
(
1
),
2903
2922
.
Ho
,
V. T.
,
Le Thi
,
H. A.
, &
Bui
,
D. C.
(
2016
). Online DC optimization for online binary linear classification. In
T. N.
Nguyen
,
B.
Trawiński
,
H.
Fujita
, &
T.-P.
Hong
(Eds.),
Proceedings of ACIIDS 2016
(pp.
661
670
).
Berlin
:
Springer
.
Hoi
,
S. C. H.
,
Wang
,
J.
, &
Zhao
,
P.
(
2014
).
LIBOL: A library for online learning algorithms
.
Journal of Machine Learning Research
,
15
(
1
),
495
499
.
Kalai
,
A.
, &
Vempala
,
S.
(
2005
).
Efficient algorithms for online decision problems
.
Journal of Computer and System Sciences
,
71
(
3
),
291
307
.
Kivinen
,
J.
, &
Warmuth
,
M. K.
(
1997
).
Exponentiated gradient versus gradient descent for linear predictors
.
Information and Computation
,
132
(
1
),
1
63
.
Kivinen
,
J.
, &
Warmuth
,
M.
(
2001
).
Relative loss bounds for multidimensional regression problems
.
Machine Learning
,
45
(
3
),
301
329
.
Krichene
,
W.
,
Balandat
,
M.
,
Tomlin
,
C.
, &
Bayen
,
A.
(
2015
).
The hedge algorithm on a continuum.
In
F.
Bach
&
D.
Blei
(Eds.),
Proceedings of the 32nd International Conference on Machine Learning
, vol.
37
(pp.
824
832
). PMLR.
Le Thi
,
H. A.
,
Ho
,
V. T.
, &
Pham Dinh
,
T.
(
2019
).
A unified DC programming framework and efficient DCA based approaches for large scale batch reinforcement learning
.
Journal of Global Optimization
,
73
(
2
),
279
310
.
Le Thi
,
H. A.
,
Le
,
H. M.
,
Phan
,
D. N.
, &
Tran
,
B.
(
2017
).
Stochastic DCA for the large-sum of non-convex functions problem and its application to group variable selection in classification
. In
Proceedings of the 34th International Conference on Machine Learning
, vol.
70
(pp.
3394
3403
).
Le Thi
,
H. A.
, &
Nguyen
,
M. C.
(
2017
).
DCA based algorithms for feature selection in multi-class support vector machine
.
Annals of Operations Research
,
249
(
1
),
273
300
.
Le Thi
,
H. A.
, & Pham
Dinh
,
T.
(
2005
).
The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems
.
Annals of Operation Research
,
133
(
1–4
),
23
48
.
Le Thi
,
H. A.
, & Pham
Dinh
,
T.
(
2018
).
DC programming and DCA: Thirty years of developments
.
Mathematical Programming, Special Issue: DC Programming—Theory, Algorithms and Applications
,
169
(
1
),
5
68
.
Le Thi
,
H. A.
, &
Phan
,
D. N.
(
2017
).
DC programming and DCA for sparse Fisher linear discriminant analysis
.
Neural Computing and Applications
,
28
(
9
),
2809
2822
.
Li
,
Y.
, &
Long
,
P.
(
2002
).
The relaxed online maximum margin algorithm
.
Machine Learning
,
46
(
1–3
),
361
387
.
Maillard
,
O.-A.
, &
Munos
,
R.
(
2010
).
Online learning in adversarial Lipschitz environments
. In
Proceedings of the 2010 European Conference on Machine Learning and Knowledge Discovery in Databases: Part II
(pp.
305
320
).
Berlin
:
Springer-Verlag
.
Mason
,
L.
,
Baxter
,
J.
,
Bartlett
,
P.
, &
Frean
,
M.
(
1999
).
Boosting algorithms as gradient descent
. In
Proceedings of the 12th International Conference on Neural Information Processing Systems
(pp.
512
518
).
Cambridge, MA
:
MIT Press
.
Novikoff
,
A. B.
(
1963
).
On convergence proofs for perceptrons
. In
Proceedings of the Symposium on the Mathematical Theory of Automata
, vol. 12 (pp.
615
622
).
Brooklyn, NY
:
Polytechnic Press
.
Pham Dinh
,
T.
, &
Le Thi
,
H. A.
(
1997
).
Convex analysis approach to DC programming: Theory, algorithms and applications
.
Acta Mathematica Vietnamica
,
22
(
1
),
289
355
.
Pham Dinh
,
T.
, &
Le Thi
,
H. A.
(
1998
).
DC optimization algorithms for solving the trust region subproblem
.
SIAM Journal of Optimization
,
8
(
2
),
476
505
.
Pham Dinh
,
T.
, &
Le Thi
,
H. A.
(
2014
). Recent advances in DC programming and DCA. In
N. T.
Nguyen
&
H. A.
Le Thi
(Eds.),
Transactions on Computational Intelligence XIII
, vol. 8342 (pp.
1
37
).
Berlin
:
Springer
.
Phan
,
D. N.
,
Le
,
H. M.
, &
Le Thi
,
H. A.
(
2018
).
Accelerated difference of convex functions algorithm and its application to sparse binary logistic regression
. In
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence
(pp.
1369
1375
).
Menlo Park: CA
:
AAAI
.
Phan
,
D. N.
, &
Le Thi
,
H. A.
(
2019
).
Group variable selection via p,0 regularization and application to optimal scoring
.
Neural Networks
,
118
,
220
234
.
Phan
,
D. N.
,
Le Thi
,
H. A.
, & Pham
Dinh
,
T.
(
2017
).
Sparse covariance matrix estimation by DCA-based algorithms
.
Neural Computation
,
29
(
11
),
3040
3077
.
Rosenblatt
,
F.
(
1958
).
The perceptron: A probabilistic model for information storage and organization in the brain
.
Psychological Review
,
65
(
6
),
386
408
.
Shalev-Shwartz
,
S.
(
2007
).
Online learning: Theory, algorithms, and applications
. PhD diss., Hebrew University of Jerusalem.
Shalev-Shwartz
,
S.
(
2012
).
Online learning and online convex optimization
.
Foundations and Trends in Machine Learning
,
4
(
2
),
107
194
.
Shalev-Shwartz
,
S.
, &
Singer
,
Y.
(
2007
).
A primal-dual perspective of online learning algorithms
.
Machine Learning
,
69
(
2–3
),
115
142
.
Shor
,
N. Z.
(
1985
).
Minimization methods for non-differentiable functions
.
Berlin
:
Springer-Verlag
.
Valadier
,
M.
(
1969
).
Sous-différentiels dune borne supérieure et dune somme continue de fonctions convexes
.
CR Acad. Sci. Paris Sér. AB
,
268
,
A39
A42
.
Van Der Malsburg
,
C.
(
1986
). Frank Rosenblatt: Principles of neurodynamics: Perceptrons and the theory of brain mechanisms. In
G.
Palm
&
A.
Aertsen
(Eds.),
Brain theory
(pp.
245
248
).
Berlin
:
Springer
.
Yang
,
L.
,
Deng
,
L.
,
Hajiesmaili
,
M. H.
,
Tan
,
C.
, &
Wong
,
W. S.
(
2018
).
An optimal algorithm for online non-convex learning
.
Proc. ACM Meas. Anal. Comput. Syst.
,
2
(
2
),
25:1
25:25
.
Zhang
,
L.
,
Yang
,
T.
,
Jin
,
R.
, &
Zhou
,
Z.-H.
(
2015
).
Online bandit learning for a special class of non-convex losses
. In
Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence
(pp.
3158
3164
).
Menlo Park, CA
:
AAAI Press
.
Zinkevich
,
M.
(
2003
).
Online convex programming and generalized infinitesimal gradient ascent
. In
Proceedings of the 20th International Conference on Machine Learning
(pp.
928
936
).
Menlo Park, CA
:
AAAI Press
.