## Abstract

We investigate an approach based on DC (Difference of Convex functions) programming and DCA (DC Algorithm) for online learning techniques. The prediction problem of an online learner can be formulated as a DC program for which online DCA is applied. We propose the two so-called complete/approximate versions of online DCA scheme and prove their logarithmic/sublinear regrets. Six online DCA-based algorithms are developed for online binary linear classification. Numerical experiments on a variety of benchmark classification data sets show the efficiency of our proposed algorithms in comparison with the state-of-the-art online classification algorithms.

## 1  Introduction

Online learning can be seen as the process of predicting answers to the sequential arrival of questions based on the knowledge of the correct answers to previous questions and possibly other available information (Shalev-Shwartz, 2012). It plays a significant role in multiple contexts. For example, when the data samples are available over time, the predictions must be made in real time, the learner is required to dynamically adapt to new data patterns, or even learning over the entire data at once is impossible in the computational aspect. Applications range from online advertisement placement to online web ranking, online email categorization, and real-time recommendation (Shalev-Shwartz, 2007, 2012).

The learner makes a prediction in a sequence of consecutive rounds. On each online round, the learner receives an incoming question and must predict an answer to this question. After that, the correct answer is revealed, and the learner will suffer some loss. The whole process is summarized in the following protocol:

Online Learning

Input: a question space $X$, a possible answer space $Y$, a loss function $ℓ$.

for each step $t=1,2,…$do

1. Receive a question $xt∈X$.

2. Predict an answer $p¯t∈Y$.

3. Receive the true answer $yt∈Y$.

4. Suffer loss $ℓ(p¯t,yt)$.

end for

It is assumed that all the answers can be given by a hypothesis $h¯:X→Y$ (Shalev-Shwartz, 2012). The set of possible hypotheses is denoted by $H$. Let us denote by $LT$ and $Lh¯,T$, respectively, the cumulative loss of the learner and the cumulative loss of the hypothesis $h¯$ after $T$ prediction steps, i.e.,
$LT:=∑t=1Tℓ(p¯t,yt),Lh¯,T:=∑t=1Tℓ(h¯(xt),yt).$
The main goal for the online learner is to minimize the regret:
$RT:=∑t=1Tℓ(p¯t,yt)-minh¯∈H∑t=1Tℓ(h¯(xt),yt)=LT-minh¯∈HLh¯,T.$

Due to its goal, online learning is seen as a general class of techniques at the interface between machine learning and online optimization. Until now, most of the effective online learning algorithms have been derived from online convex optimization (Gentile, 2003; Zinkevich, 2003; Hazan, Agarwal, & Kale, 2007; Shalev-Shwartz, 2007, 2012; Shalev-Shwartz & Singer, 2007; Hazan, 2016). The paradigm of online convex optimization was introduced by (Zinkevich, 2003; Shalev-Shwartz & Singer, 2007), where the prediction domain and the loss function are convex. In online convex optimization, there exists a common update rule to predict the answer at each step: follow-the-leader (Kalai & Vempala, 2005) and its regularization form (Shalev-Shwartz, 2007; Shalev-Shwartz & Singer, 2007). In the follow-the-leader rule, the learner minimizes the cumulative loss function over all past steps, whereas its regularization form adds a convex regularization function to the cumulative loss function and introduces stability to the predictions. From update rule, most of the effective online convex algorithms have been proposed, such as online gradient descent (with lazy or greedy projections; Zinkevich, 2003), exponentiated gradient (Kivinen & Warmuth, 1997, 2001; Azoury & Warmuth, 2001), $p$-norm (Gentile, 2003), and their variants.

One common difficulty in most practical applications is that loss functions to assess the predictions are often nonsmooth or nonconvex (e.g., $0-1$ loss function returning 0 if the prediction is correct and 1 otherwise), or the domain of predictions is nonconvex (Chung, 1994). Hence, solving the resulting optimization problem becomes more intractable. The disadvantages of using online convex optimization approaches have been mentioned in several works (Cesa-Bianchi & Lugosi, 2006; Shalev-Shwartz, 2007, 2012). Thus, it is essential to resort to nonconvex optimization in online mode to overcome the difficulties.

Recently, some works have been developed for online learning with nonconvex loss functions. Ertekin, Bottou, and Giles (2011) proposed a nonconvex online algorithm for support vector machine problems with ramp loss function based on a special version of DCA (Difference of Convex functions Algorithm) for smooth functions, while Gasso, Pappaioannou, Spivak, and Bottou (2011) presented an online algorithm for nonconvex Neyman-Pearson classification problems using the gradient method. These works, however, did not study the regret bounds of the online algorithms. Later, online algorithms with a submodular loss function were given in Hazan and Kale (2012), and an algorithm for online bandit learning problems with nonconvex loss function (i.e. only the suffered loss is available) was proposed in Zhang, Yang, Jin, and Zhou (2015). The exponential weighting methods applied in on online (nonconvex) learning problems were presented in Maillard and Munos (2010) and Krichene, Balandat, Tomlin, and Bayen (2015). Using a special grid layered structure on the decision set, Yang, Deng, Hajiesmaili, Tan, and Wong (2018) recently introduced some variants of online recursive weighting algorithm in a full information setting (the knowledge of loss function is used). Gao, Li, and Zhang (2018) studied online nonconvex optimization with a new performance metric: nonstationary regret, which defines the discrepancy between the cumulative losses of the online learner and the cumulative losses of the best possible responses. The authors also proposed an online normalized gradient descent algorithm and its bandit version in full and partial information settings.

We investigate DC (Difference of Convex functions) programming and DCA for online learning problems with nonconvex loss function. The idea is to approximate the nonconvex loss function by a DC loss function and then investigate online DCA for minimizing the resulting DC loss function. At each iteration, online DCA consists of approximating the current DC loss function by its convex majorization and then solving the resulting convex subproblem. We propose some variants of online DCA in which the convex subproblems are solved completely or approximately by a subgradient method and prove that these variants have the vanishing per step regret. As an application, we apply the proposed online DCA algorithms for online binary linear classification using the $0-1$ loss function. Thanks to the regret bound of these algorithms, we derive their bounds on the number of prediction mistakes. We show the efficiency of the proposed algorithms on a variety of benchmark classification data sets in comparison with five state-of-the-art online binary linear classification algorithms.

The rest of the letter is organized as follows. In section 2, we briefly introduce DC programming and DCA, then present an online DCA scheme in the context of online learning, and finally propose some variants of online DCA with their regret bounds. The development of these variants for online binary linear classification is given in section 3. Section 4 reports the numerical results on several test problems. Section 5 concludes the letter.

## 2  Online DCA for Online Learning

### 2.1  Outline of DC Programming and DCA

DC programming and DCA were introduced by Pham Dinh Tao in a preliminary form in 1985 and have been extensively developed by Le Thi Hoai An and Pham Dinh Tao since 1994 (Pham Dinh & Le Thi, 1997, 1998, 2014; Le Thi & Pham Dinh, 2005, 2018) to become now classic and increasingly popular. They address DC programs of the form
$inf{f(w):=g(w)-h(w):w∈Rn}(Pdc)$
where $g,h∈Γ0(Rn)$, the set of all lower semicontinuous proper convex functions on $Rn$. Such a function $f$ is called DC function, and $g-h$, DC decomposition of $f$ while $g$ and $h$ are DC components of $f$.

The main idea of DCA is quite simple. It consists of approximating a DC program by a sequence of convex programs: each iteration $l$ of DCA approximates the concave part $-h$ by its affine majorization (that corresponds to taking $zl∈∂h(wl)$) and minimizes the resulting convex function. The generic DCA scheme can be described as follows.

Generic DCA Scheme

Initialization. Choose an initial point $w0$. Set $l←0$.

Repeat

Step 1. Compute $zl∈∂h(wl).$

Step 2. Compute $wl+1∈argmin{g(w)-〈w,zl〉:w∈Rn}$.

Step 3.$l←l+1$.

Until stopping condition is satisfied.

In recent years, numerous DCA-based algorithms have successfully solved large-scale nonsmooth and nonconvex programs appearing in several application areas, especially machine learning, communication systems, biology, and finance (see, e.g., Phan & Le Thi, 2019; Le Thi, Ho, & Pham Dinh, 2019; Le Thi & Pham Dinh, 2018; Phan, Le, & Le Thi, 2018; Le Thi, Le, Phan, & Tran, 2017; Le Thi & Phan, 2017; Le Thi & Nguyen, 2017; Phan, Le Thi, & Pham Dinh, 2017; Le Thi, 2005). DCA has been proved to be a fast and scalable approach, which is, thanks to the effect of DC decompositions, more efficient than related methods. (For a comprehensive survey on 30 years of development of DCA, see Le Thi & Pham Dinh, 2018.)

### 2.2  Online DCA

The generic DCA scheme can be adapted as follows for solving online DC problems where the set of predictions is convex and the loss suffered at each step is a DC function. At each learning step, we have to minimize a DC loss function $Ft$ under the set of predictions $S$. We are therefore faced with a (standard) DC program. As we are in the “online” context where data are available in a sequential order, completely solving this DC program may not be imperative. Instead, we perform only one iteration of DCA.

Let us denote by $T$ the number of online learning steps. The function $Ft$ can be defined either from the cumulative $∑t=1Tft$ or the current loss $ft$. In this letter we define $Ft$ simply as the current loss function $ft$, say, $Ft=ft:=gt-ht$ with $gt$ and $ht$ being convex functions. The online DCA scheme can be described as follows.

Online DCA for Online DC Programming

Input: a convex set $S$

Initialization: set an initial point $w0$ and observe a DC loss function $f0:=g0-h0$.

for$t=1,2,…$ until convergence

1. Predict a vector $wt∈S$ by performing one iteration of DCA.

1. Compute a subgradient $zt∈∂ht-1(wt-1)$.

2. Solve the convex program to obtain $wt$
$min{gt-1(w)-〈zt,w〉:w∈S}.$
(2.1)

2. Observe a DC loss function $ft:=gt-ht$.

3. Suffer loss $ft(wt)$ and update its model.

end for

Output: ${wt}$.

In an online DCA scheme, subproblem 2.1 can be solved by any solver for convex optimization problems. In this letter, we propose using a subgradient method (see, e.g., Shor, 1985). In particular, at step $t$, we need to compute, at iteration $k∈{0,1,…}$, a subgradient $rt-1,k∈∂gt-1(wt-1,k)$ and then determine $wt-1,k+1∈S$ by
$wt-1,k+1=ProjS(wt-1,k-ηt-1(rt-1,k-zt-1)),$
with a step size $ηt-1$ and $wt-1,0:=wt-1$. Here $ProjS$ denotes the orthogonal projection mapping.

When the convex subproblems in online DCA are completely solved by the subgradient method, the corresponding DCA is called the complete version of online DCA and summarized in the following scheme:

ODCA: Complete Online DCA Scheme with Projected Subgradient Method

Perform the online DCA in which step 1.2 is replaced by

Set $wt-1,0=wt-1$, $k=0$.

repeat

Compute a subgradient $rt-1,k∈∂gt-1(wt-1,k)$.

Set $st-1,k=rt-1,k-zt-1$.

Compute
$wt-1,k+1=ProjS(wt-1,k-ηt-1st-1,k).$
(2.2)

Set $k=k+1$.

until$∥wt-1,k-wt-1,k-1∥2≤ɛ(∥wt-1,k-1∥2+1)$

Set $wt=wt-1,k$.

We observe that completely solving the subproblem in online DCA by the gradient method may be computationally expensive. Thus, we propose the so-called approximate version of ODCA, named ODCA$k$, in which the convex subproblem 2.1 is approximately solved by one iteration of the subgradient method.

ODCA$k$: Approximate Online DCA Scheme with Projected Subgradient Method

Perform the online DCA scheme in which step 1.2 is replaced by

Compute a subgradient $rt-1∈∂gt-1(wt-1)$.

Set $st-1=rt-1-zt-1$.

Compute $wt=ProjS(wt-1-ηt-1st-1)$.

### 2.3  Regret Bounds of ODCA and ODCA$k$

In this section, we analyze the regret bound of two online DCA schemes: ODCA and ODCA$k$. We define the regret of an algorithm $A$ until step $T$ by
$RegretAT=∑t=1Tft(wt)-minw∈Rn∑t=1Tft(w),$
(2.3)
where the sequence ${w1,w2,…,wT}$ is generated by the algorithm $A$.

Now, we prove that ODCA and ODCA$k$ have a vanishing per step regret (or a sublinear regret). That is, $RegretAT$ grows sublinearly with the number of steps $T$: $limT→+∞RegretAT/T=0$. We can achieve a logarithmic regret bound $O(log(T))$ for ODCA$k$.

First, we make the assumptions of the DC function $ft$ in assumption 1:

Assumption 1.

There exist positive parameters $α$, $γ$, nonnegative parameter $β$, and point $u*∈S$, such that for $t∈{1,…,T}$:

1. $u*∈argmin{ft(w):w∈S}$,

2. $α2∥u*-wt∥22≤gt(wt)-gt(u*)-〈zt,wt-u*〉$,

3. $ht(u*)-ht(wt)-〈zt,u*-wt〉≤β2∥u*-wt∥22$,

4. $gt(wt)-gt(u*)≤〈rt,wt-u*〉-γ2∥u*-wt∥22$, with $rt∈∂gt(wt)$.

Next, theorem 2 indicates the regret bounds of ODCA and ODCA$k$.

Let $Kt$ be the number of iterations of subgradient method at step $t$, $K=maxt=1,…,TKt$, and let $L$ be a positive number satisfying
$maxt∈{1,…,T}max∥st∥2,maxk∈{1,…,Kt}∥st,k∥2≤L,maxt∈{1,…,T}∥wt-u*∥2≤max{K-1,1}Lmint∈{1,…,T}ηt.$
(2.4)
Theorem 1.
Let ${wt}t=1,…,T$ be the sequence generated by ODCA and ODCA$k$. If assumptions 1a to 1c are verified, then we have
$RegretODCAT≤3L2(α+β)T3K2-4K+22α,RegretODCAkT≤3L2(α+β)T2α.$
In addition, if assumption 1d is also verified, then
$RegretODCAkT≤L2(α+β)1+log(T)2αγ.$
Proof.

First, we analyze the regret bound of ODCA.

From the definition, equation 2.3, we have
$RegretODCAT=∑t=1Tft(wt)-minw∈S∑t=1Tft(w)≤∑t=1Tft(wt)-minw∈Sft(w).$
(2.5)
It readily derives from assumption 1a that
$ft(wt)-minw∈Sft(w)=ft(wt)-ft(u*)=[g¯t(wt)-g¯t(u*)]+[ht(u*)-ht(wt)-〈zt,u*-wt〉],$
(2.6)
where the convex function $g¯t:=gt-〈zt,·〉$ for $t=1,…,T$.
From equations 2.5 and 2.6 and assumptions 1b to 1c, we obtain
$RegretODCAT≤1+βα∑t=1T[g¯t(wt)-g¯t(u*)]=1+βα∑t=1T[g¯t(wt,0)-g¯t(u*)]$
(2.7)
$≤1+βα∑t=1T∑k=0Kt-2[g¯t(wt,k)-g¯t(wt,k+1)]+[g¯t(wt,Kt-1)-g¯t(u*)]≤1+βα∑t=1T∑k=0Kt-2〈st,k,wt,k-wt,k+1〉+〈st,Kt-1,wt,Kt-1-u*〉.$
(2.8)
The last inequality holds as $st,k∈(∂gt(wt,k)-zt):=∂g¯t(wt,k)$, $k=0,…,Kt-1$.
Similar to theorem 3.1 in Hazan (2016), we can derive from equation 2.2 an upper bound of $〈st,Kt-1,wt,Kt-1-u*〉$ as follows:
$〈st,Kt-1,wt,Kt-1-u*〉≤∥wt,Kt-1-u*∥22-∥wt,Kt-u*∥222ηt+ηt2∥st,Kt-1∥22.$
Combining equation 2.4 and the fact that
$∥wt,Kt-1-u*∥2≤∥wt,Kt-1-wt,0∥2+∥wt,0-u*∥2≤∑k=0Kt-2∥wt,k+1-wt,k∥2+∥wt-u*∥2≤∑k=0Kt-2ηt∥st,k∥2+∥wt-u*∥2≤ηt(K-1)L+∥wt-u*∥2,$
we yield
$∥wt,Kt-1-u*∥22≤∥wt-u*∥22+3ηt2(K-1)2L2.$
It implies
$〈st,Kt-1,wt,Kt-1-u*〉≤∥wt-u*∥22-∥wt+1-u*∥222ηt+ηt2(3K2-6K+4)L2.$
(2.9)
Similarly, we get
$∑k=0Kt-2〈st,k,wt,k-wt,k+1〉≤∑k=0Kt-2∥wt,k-wt,k+1∥222ηt+ηt2∥st,k∥22≤∑k=0Kt-2ηt∥st,k∥22≤ηt(K-1)L2.$
(2.10)
We deduce from 2.8, 2.9, and 2.10 that
$RegretODCAT≤1+βα∑t=1T∥wt-u*∥22-∥wt+1-u*∥222ηt+ηt2(3K2-4K+2)L2≤1+βα∑t=1T∥wt-u*∥2212ηt-12ηt-1+ηt2(3K2-4K+2)L2,$
where, by convention, $1η0:=0$.
Let us define $ηt=1t3K2-4K+2$ for all $t=1,…,T$. We have
$RegretODCAT≤(α+β)3K2-4K+2αL2T2+L222T≤3L2(α+β)T3K2-4K+22α.$
Setting $K=1$, $RegretODCAkT$ is bounded by $3L2(α+β)T2α$.
When assumption 1d is also satisfied, we derive from equation 2.7 for ODCA$k$ that
$RegretODCAkT≤1+βα∑t=1T〈st,wt-u*〉-γ2∥u*-wt∥2≤1+βα∑t=1T∥wt-u*∥2212ηt-12ηt-1-γ2+ηt2L2.$
Defining $ηt=1γt$ for all $t=1,…,T$, we obtain
$RegretODCAkT≤L2(α+β)1+log(T)2αγ.$

The proof of theorem 2 is established.

Remark 1.

From theorem 2, we see that the regret bound of ODCA$k$ is $3K2-4K+2$ times less than the regret bound of ODCA. Thus, in most of cases, the prediction of ODCA$k$ is better than that of ODCA. This is confirmed by the numerical experiments in section 4.

In the sequel, we show how to develop these online DCA schemes for the problem of online binary classification in online learning.

## 3  Online DCA for Online Binary Linear Classification

Online binary linear classification (Shalev-Shwartz, 2012; Hoi, Wang, & Zhao, 2014; Ho, Le Thi, & Bui, 2016) is online learning with the yes/no answers and predictions, in which the prediction set is the same as the corrected answer set ${-1,1}$ and the loss $ℓ(p¯t,yt)$ is the $0-1$ loss function. Formally, at each step $t$, the learner receives an instance with $n$ features, denoted $xt∈X=Rn$, and tries to find a linear classifier $wt∈S=Rn$ in order to predict the corresponding binary label:
$p¯t=pt(wt)∈Y={-1,1},pt(w):=1if〈w,xt〉≥0,-1otherwise.$
(3.1)
After that, the correct label $yt∈Y$ is revealed and the learner has to suffer the loss $ℓ(p¯t,yt)$ where the loss function $ℓ$ is defined as
$ℓ(pt(w),yt):=1{pt(w)≠yt}(w)=1{yt〈w,xt〉≤0}(w).$
(3.2)
Here, $1C$ is an indicator function on $C$ (say, $1C(x)=1$ if $x∈C$, 0 otherwise).

Obviously the loss $ℓ(p¯t,yt)=0$ when the prediction is correct ($pt(w)=yt$) and $ℓ(p¯t,yt)=1$ when the prediction is wrong ($pt(w)≠yt$).

There exist many online classification algorithms such as perceptron (Novikoff, 1963; Rosenblatt, 1958; Van Der Malsburg, 1986), approximate maximal margin classification algorithm (ALMA) (Gentile, 2002), relaxed online maximum margin algorithm (ROMMA) (Li & Long, 2002), passive-aggressive learning algorithms (PA) (Crammer, Dekel, Keshet, Shalev-Shwartz, & Singer, 2006), and their variants. Hoi et al. (2014) conducted a library of scalable and efficient online learning algorithms for large-scale online classification tasks.

Now, we apply online DCA for online binary linear classification with the $0-1$ loss function, equation 3.2:
$ℓt(w):=ℓ(pt(w),yt).$
As the function 3.2 is not DC, for applying ODCA and ODCA$k$, we have to approximate $ℓt$ by a DC function $ft$.

### 3.1  DC Approximation Functions

We propose DC approximation functions taking the form of piecewise-linear functions like ramp loss (Collobert, Sinz, Weston, & Bottou, 2006a, 2006b; Ho et al., 2016) and sigmoid-like function (Mason, Baxter, Bartlett, & Frean, 1999).

First, it is worth mentioning that in practice, if the prediction using the classifier $wt-1$ is correct (i.e., $ℓt-1(wt-1)=0$), then it is not necessary to update the classifier $wt$ (Shalev-Shwartz, 2012). Obviously, if the prediction at step $t$ is correct, say $ℓt(wt)=0$, then we take $ft=0$, which is a DC function. We can see from equation 3.2 that $ℓt(wt)=0$ if and only if $yt〈wt,xt〉>0$. Thus, we approximate $ℓt$ by a DC function $ft$ in two cases: $yt〈wt,xt〉<0$ and $yt〈wt,xt〉=0$.

To ensure the boundness on prediction mistakes, $ft$ should be a surrogate function of $ℓt$ at $wt$ (Shalev-Shwartz, 2012):
$ft(wt)≥ℓt(wt).$
(3.3)

#### 3.1.1  First Piecewise-Linear Approximation

We use the first DC approximation function proposed in Ho et al. (2016),
$ft(1)(w):=max0,ν1+min-yt〈wt,xt〉,-yt〈w,xt〉τt*,$
(3.4)
where $τ1$ is a positive parameter, and
$τt*=min{τ1,-yt〈wt,xt〉},ν1=0ifyt〈wt,xt〉<0,τt*=τ1,ν1=1ifyt〈wt,xt〉=0.$

The following proposition gives us a suitable DC decomposition of $ft(1)$.

Proposition 1.
Let $a$ and $x$ be a nonnegative constant and a given vector, respectively. The function
$f(w)=max{0,min{a,〈w,x〉}},$
(3.5)
is a DC function with DC components:
$g(w)=max{0,〈w,x〉}andh(w)=max{0,〈w,x〉-a}.$
Proof.
Obviously the function $min{a,〈w,x〉}$ is DC with the following natural DC decomposition:
$min{a,〈w,x〉}=f1(w)-f2(w)wheref1(w):=a+〈w,x〉,f2(w):=max{a,〈w,x〉}.$
Consequently the function $f$ is clearly DC too:
$f(w)=max{0,f1(w)-f2(w)}=max{f1(w),f2(w)}-f2(w).$
It is easy to see that $max{f1(w),f2(w)}=g(w)+a$ and $f2(w)=h(w)+a$, and the resulting DC decomposition of $f$ is $f=g-h$.
From proposition 1, we get a DC decomposition of $ft(1)$ as follows,
$ft(1)=gt(1)-ht(1),$
where
$gt(1)(w)=max0,ν1-yt〈w,xt〉τt*,ht(1)(w)=max0,yt〈wt-w,xt〉τt*.$
(3.6)
According to the ODCA/ODCA$k$ scheme, we need to compute the subgradients $zt-1∈∂ht-1(1)(wt-1)$ and $rt-1,k∈∂gt-1(1)(wt-1,k)$. Clearly, the functions $gt(1)(w)$ and $ht(1)(w)$ are the maximum of two affine functions. Thanks to the rule of computing the subdifferential of a function,
$h(w)=maxi=1,…,mhi(w),$
where $hi$, $i=1,…,m$, are convex functions (Valadier, 1969), we have
$∂h(w)=co∪∂hi(w):i∈argmaxj∈{1,…,m}hj(w).$
(3.7)
Here $co(X)$ denotes the convex hull of a set of points $X$.
Applying equation 3.7 to the functions $ht-1(1)$ and $gt-1(1)$ with $m=2$ and $h1,h2$ being affine functions, we obtain
$∂ht-1(1)(w)=-yt-1xt-1τt-1*ifyt-1〈wt-1-w,xt-1〉>0,-yt-1xt-1τt-1*,0ifyt-1〈wt-1-w,xt-1〉=0,{0}otherwise,∂gt-1(1)(w)=-yt-1xt-1τt-1*ifyt-1〈w,xt-1〉<ν1τt-1*,-yt-1xt-1τt-1*,0ifyt-1〈w,xt-1〉=ν1τt-1*,{0}otherwise.$
Here $[a,b]$$(=co{a,b})$ denotes the line segment between two points $a$ and $b$.
In particular, we can take these subgradients $zt-1∈∂ht-1(1)(wt-1)$ and $rt-1,k∈∂gt-1(1)(wt-1,k)$ as follows:
$zt-1=0andrt-1,k=-yt-1xt-1τt-1*ifyt-1〈wt-1,k,xt-1〉≤ν1τt-1*,0otherwise.$
(3.8)

#### 3.1.2  Second Piecewise-Linear Approximation

We propose another piecewise-linear approximation function. The idea is that at step $t$, we sometimes update the linear classifier $wt$ even when the prediction is correct (i.e., $yt〈wt,xt〉>0$) and do not update $wt$ when $yt〈wt,xt〉≤0$. In particular, we give the second DC approximation function,
$ft(2)(w):=max0,1+ν2minτ2,-yt〈w,xt〉τ3∥xt∥2,$
where $τ2,τ3$ are positive parameters and
$ν2=1if-τ2≤yt〈wt,xt〉≤τ3∥xt∥2,0if-τ2>yt〈wt,xt〉.$
According to proposition 1, the DC components $gt(2)$ and $ht(2)$ of $ft(2)$ are
$gt(2)(w)=max0,1-ν2yt〈w,xt〉τ3∥xt∥2,ht(2)(w)=max0,-ν2(τ2+yt〈w,xt〉)τ3∥xt∥2.$
Similar to the first piecewise-linear approximation, we can take the subgradients $zt-1∈∂ht-1(2)(wt-1)$ and $rt-1,k∈∂gt-1(2)(wt-1,k)$ as follows:
$zt-1=0andrt-1,k=-ν2yt-1xt-1τ3∥xt-1∥2ifν2yt-1〈wt-1,k,xt-1〉<τ3∥xt-1∥2,0otherwise.$

#### 3.1.3  Sigmoid Approximation

We propose the following sigmoid approximation function,
$ft(3)(w):=max1-tanh(δt),1-tanh(κtyt〈w,xt〉),$
where the increasing real-valued function $tanh(s)=es-e-ses+e-s$, $κt=κ∥xt∥2$, $κ>0$, $δt=κtyt〈wt,xt〉-ln(mt)2$ and $mt=2e2κtyt〈wt,xt〉+1(e2κtyt〈wt,xt〉+1)2$.

It is easy to see that if $yt〈wt,xt〉>0$, then $0. Thus, similar to the idea of $ft(2)$, we consider taking $ft(3)$ in case $tanh(κtyt〈wt,xt〉)<1-ɛ$, where $ɛ$ is a threshold in [0, 1).

It is known that the function $1-tanh(s):=2e-2se-2s+1$ is a DC function with DC components: $g(s)=2e-2s$ and $h(s)=2e-4se-2s+1$. Similar to the proof of proposition 1, we have
$ft(3)=max1-tanh(δt),g(κtyt〈w,xt〉)-h(κtyt〈w,xt〉)=max1-tanh(δt)+h(κtyt〈w,xt〉),g(κtyt〈w,xt〉)-h(κtyt〈w,xt〉).$
Thus, a DC decomposition of $ft(3)$ is given by
$ft(3):=gt(3)-ht(3),$
where the function $ct:Rn→R$ is defined by $ct(w):=2e-2κtyt〈w,xt〉$,
$ht(3)(w):=2e-4κtyt〈w,xt〉e-2κtyt〈w,xt〉+1,gt(3)(w):=max1-tanh(δt)+ht(3)(w),ct(w).$
Since $ht-1(3)$ is differentiable, we take $zt-1=∇ht-1(3)(wt-1)$ as follows:
$zt-1=-4κt-1yt-1xt-1e-2κt-1yt-1〈wt-1,xt-1〉(2e2κt-1yt-1〈wt-1,xt-1〉+1)(e2κt-1yt-1〈wt-1,xt-1〉+1)2.$
(3.9)
Similar to the first piecewise-linear approximation, the subgradient $rt-1,k∈∂gt-1(3)(wt-1,k)$ can be taken as
$rt-1,k=-2κt-1yt-1xt-1ct-1(wt-1,k)ifδt-1>κt-1yt-1〈wt-1,k,xt-1〉,∇ht-1(3)(wt-1,k)otherwise.$
(3.10)

In section 3.2, we describe six online DCA algorithms, corresponding to two versions ODCA and ODCA$k$ for three above DC approximation functions.

### 3.2  Proposed Online DCA Algorithms

The ODCA for the first piecewise-linear approximation is denoted by ODCA-PiL1 and summarized in algorithm 1.

Concerning the ODCA$k$ for this approximation, we take, at step $t$, $wt-1,0=wt-1$ and thus,
$st-1,0=-yt-1xt-1min{τ1,-yt-1〈wt-1,xt-1〉}ifyt-1〈wt-1,xt-1〉<0,-yt-1xt-1τ1ifyt-1〈wt-1,xt-1〉=0.$
Finally, the approximate ODCA$k$-PiL1 is given in algorithm 2.

Similar to ODCA-PiL1 and ODCA$k$-PiL1, we design the complete (resp. approximate) version of online DCA for the second piecewise-linear approximation, named ODCA-PiL2 (resp. ODCA$k$-PiL2) in algorithm 3 (resp. algorithm 4). ODCA-PiL2 differs from ODCA-PiL1 by the way to compute $rt-1,k$ in step 2.1.2.2.1. As for ODCA$k$-PiL2, one reduces step 2.1.2.2 of ODCA-PiL2 to one iteration. In this case, since $zt-1=0$, we have $st-1,0=rt-1,0$, and thus $wt=wt-1-ηt-1rt-1,0$.

As for the sigmoid approximation, its complete version of ODCA, named ODCA-Sig, is described in algorithm 5 in which the steps for computing the subgradient $zt-1$ and $rt-1,k$ are replaced by equations 3.9 and 3.10, respectively. Moreover, its approximate version, ODCA$k$-Sig, is given in algorithm 6 by performing one iteration of ODCA-Sig in step 2.1.2.2.

Remark 2.

Regarding the worst-case complexity, three approximate (resp. complete) algorithms ODCA$k$-PiL1, ODCA$k$-PiL2, and ODCA$k$-Sig (resp. ODCA-PiL1, ODCA-PiL2, and ODCA-Sig) have the complexity of $O(nT)$ (resp. $O(nKT)$). Here, $T$ is the total number of instances, $n$ is the number of features, and $K$ is the maximum number of iterations of subgradient methods at all steps. The complexity of the proposed algorithms is similar to the complexity of state-of-the-art algorithms (see the appendix). This is confirmed by the numerical experiment in section 4.

In the sequel, we show a mistake bound of the proposed algorithms—a bound on the number of steps at which $p¯t≠yt$.

### 3.3  Mistake Bound of the Proposed Algorithms

First, we show in lemma 6 that the DC function $ft$ satisfies assumption 1. Let us denote by $M$ the set of steps at which the DC function is observed.

Lemma 1.
For the DC functions ${ft(i)}$ ($i=1,2,3$), if there is a vector $u*∈Rn$ such that for all $t=1,…,T$,
$fori=1,yt〈u*,xt〉≥2τ1,fori=2,yt〈u*,xt〉≥2τ3maxj=1,…,T∥xj∥2,fori=3,κtyt〈u*,xt〉≥maxj=1,…,Tδj,$
(3.11)
then there exist $α$, $γ$, $u*$ such that assumptions 1a, 1b, and 1d are satisfied, and assumption 1c is satisfied for all $β≥0$.
Proof.

First, we see that when $xt=0$, the algorithms do not update the linear classifier, and thus the results of lemma 6 are straightforward. Hence, we further assume that $xt≠0$ for $t∈M$. We derive from equation 3.11 that assumption 1a is satisfied with the DC functions $ft(i)$ ($i=1,2,3$). Moreover, for $i=1$, we observe that $∥u*-wt∥≠0$ for all $t$, because if there is some $t$ such that $u*=wt$, then we have $yt〈u*,xt〉<0$, which contradicts equation 3.11. This is similar for $i=2,3$.

Next, we verify assumptions 1b to 1d for the function $ft(1)$ in two cases: $yt〈wt,xt〉<0$ and $yt〈wt,xt〉=0$. We first consider the case $yt〈wt,xt〉<0$.

Let us define the function $g¯t(1):=gt(1)-〈zt,·〉$. We have
$g¯t(1)(wt)-g¯t(1)(u*)=-yt〈wt,xt〉τt*>0.$
Thus, assumption 1b is satisfied with $α≤mint∈M-2yt〈wt,xt〉τt*∥u*-wt∥22.$
Assumption 1c is also satisfied for all $β≥0$ since we have
$ht(1)(u*)-ht(1)(wt)-〈zt,u*-wt〉=0≤β2∥u*-wt∥22.$
We have
$gt(1)(u*)-gt(1)(wt)-〈rt,u*-wt〉=yt〈u*,xt〉τt*>0.$
Thus, assumption 1d is satisfied with $γ≤mint∈M2yt〈u*,xt〉τt*∥u*-wt∥22$.

When $yt〈wt,xt〉=0$, assumptions 1b to 1d are also verified if $β≥0$, $α≤mint∈M2∥u*-wt∥22$, and $γ≤mint∈M2∥u*-wt∥22$.

Similarly for the DC functions $ft(2)$, Assumptions 1a, 1b, and 1d are satisfied if the parameters $α≤mint∈M2(τ2∥xt∥2-yt〈wt,xt〉)τ2∥xt∥2∥u*-wt∥22$, $γ≤mint∈M2∥u*-wt∥22$. Moreover, assumption 1c is satisfied for all $β≥0$.

Finally, we check assumptions 1b to 1d for the DC function $ft(3)$.

Let us define the function $g¯t(3):=gt(3)-〈zt,·〉$. We have
$g¯t(3)(wt)-g¯t(3)(u*)=ct(wt)-ct(u*)+2ct(wt)mtκtyt〈wt-u*,xt〉=ct(wt)1-ct(u*-wt)2-2mtκtyt〈u*-wt,xt〉=ct(wt)1-e-2κtyt〈u*-wt,xt〉-2mtκtyt〈u*-wt,xt〉.$
Due to the definition of $δt$, it is easy to prove that $g¯t(3)(wt)-g¯t(3)(u*)>0$. By setting
$α≤mint∈Mct(wt)-ct(u*)+2ct(wt)mtκtyt〈wt-u*,xt〉∥u*-wt∥22,$
(3.12)
we derive that assumption 1b is verified.
Since $ht(3)$ is convex and differentiable, we have
$ht(3)(u*)-ht(3)(wt)-〈zt,u*-wt〉≤〈∇ht(3)(u*)-∇ht(3)(wt),u*-wt〉≤∥∇ht(3)(u*)-∇ht(3)(wt)∥2.∥u*-wt∥2,$
(3.13)
and
$∥∇ht(3)(u*)-∇ht(3)(wt)∥2=8κt∥xt∥2ct(u*)[ct(-u*)+1][ct(-u*)+2]2-ct(wt)[ct(-wt)+1][ct(-wt)+2]2≤κt∥xt∥22[ct(u*)+4][ct(-wt)+2]2-[ct(wt)+4][ct(-u*)+2]2≤κt∥xt∥22|ct(u*-2wt)-ct(wt-2u*)|+4|ct(u*-wt)-ct(wt-u*)|+2|ct(u*)-ct(wt)|+4|ct(-2wt)-ct(-2u*)|+8|ct(-wt)-ct(-u*)|.$
(3.14)
For any $x$, $y∈Rn$, we have
$ct(x)-ct(y)=ct(x)1-e-2κtyt〈y-x,xt〉≤2ct(x)κt∥xt∥2∥y-x∥2.$
Thus, we readily derive that
$|ct(x)-ct(y)|≤2max{ct(x),ct(y)}κt∥xt∥2∥y-x∥2.$
(3.15)
Let us define $K¯t=64e4κt|〈u*-wt,xt〉|+max{|〈u*,xt〉|,|〈wt,xt〉|}$. Combining equation 3.13 with equations 3.14 and 3.15, we have
$ht(3)(u*)-ht(3)(wt)-〈zt,u*-wt〉≤κt2∥xt∥22K¯t∥u*-wt∥22.$
Hence, assumption 1c is satisfied with
$β≥maxt∈Mκt2∥xt∥22K¯t.$
(3.16)
Moreover, we have
$gt(3)(u*)-gt(3)(wt)-〈rt,u*-wt〉=ct(wt)ct(u*-wt)2-1+2κtyt〈u*-wt,xt〉=ct(wt)e-2κtyt〈u*-wt,xt〉-1+2κtyt〈u*-wt,xt〉.$
It is clear that $e-2κtyt〈u*-wt,xt〉+2κtyt〈u*-wt,xt〉>1$. Thus, if
$γ≤mint∈Mct(u*)-ct(wt)+2ct(wt)κtyt〈u*-wt,xt〉∥u*-wt∥22,$
(3.17)
then assumption 1d is verified.

The proof of lemma 6 is established.

According to theorem 2 and lemma 6, we obtain the regret bounds of the six proposed algorithms in corollary 7.

Corollary 1.
Assume that ODCA-PiL1, ODCA$k$-PiL1, ODCA-PiL2, ODCA$k$-PiL2, ODCA-Sig, and ODCA$k$-Sig generate the sequence ${wt}t=1,…,T$. Then we have
$RegretODCA-PiL1T≤3L2T3K2-4K+2,$
(3.18)
$RegretODCAk-PiL1T≤L21+log(T)γ,$
(3.19)
where $γ$ is the positive parameter satisfying
$γ≤mint∈M2minτt*,ψ(yt〈u*,xt〉)τt*∥u*-wt∥22,$
(3.20)
and $ψ$ is the real function defined by $ψ(x)=x$ if $x>0$, $+∞$ otherwise. Moreover, ODCA-PiL2 and ODCA$k$-PiL2 have the same regret bound as equations 3.18 and 3.19, respectively, but with
$γ≤mint∈M2∥u*-wt∥22.$
(3.21)
As for ODCA-Sig and ODCA$k$-Sig, we have
$RegretODCAk-SigT≤L2(α+β)1+log(T)2αγ,RegretODCA-SigT≤3L2(α+β)T3K2-4K+22α,$
where the parameters $α$, $β$, and $γ$ satisfy equations 3.12, 3.16, and 3.17, respectively.

Thanks to the regret bound of six proposed algorithms, the following proposition provides a mistake bound of these algorithms.

Proposition 2.
(a) For $w∈Rn$, the number of prediction mistakes made by ODCA$k$-PiL1 (resp. ODCA$k$-PiL2) has an upper bound that is the root, $x¯1$, of the equation,
$x-a¯-b¯1+log(x)=0,$
where $a¯=∑t∈Mft(w)$, $b¯=L2γPiL$, $x¯1≥b¯$, $γPiL≤minγ,L2$, and $γ$ is defined by equation 3.20 (resp. 3.21). Moreover, the mistake bound for ODCA$k$-Sig is nothing else, but $b¯=L2(α+β)(2αγSig)$, $γSig≤minγ,L2(α+β)2α$, and the parameters $α$, $β$, $γ$ satisfy equations 3.12, 3.16, and 3.17, respectively.

(b) For $w∈Rn$, the number of prediction mistakes made by ODCA-PiL1, ODCA-PiL2 is upper-bounded by $c¯+c¯2+4a¯24$ where $c¯:=3L23K2-4K+2$ and $a¯=∑t∈Mft(w)$. In addition, the mistake bound of ODCA-Sig is the same as ODCA-PiL1, ODCA-PiL2, but $c¯=3L2(α+β)T3K2-4K+22α$, and $α$, $β$ satisfy equations 3.12 and 3.16, respectively.

Proof.

In this proof, we only show the mistake bound of the algorithms ODCA$k$-PiL1 in part a and ODCA-PiL1 in part b. It is similar to the mistake bound of the other algorithms.

(a) From the inequality 3.3 and corollary 7, we derive that for any $w∈Rn$,
$|M|≤∑t∈Mft(wt)≤∑t∈Mft(w)+L21+log(|M|)γPiL.$
Here, $|M|$ is the number of steps in $M$. From the definition of $a¯$ and $b¯$, it is evident that $a¯≥0$, $b¯≥1$ and the last inequality can be rewritten as
$|M|≤a¯+b¯1+log(|M|).$
Let us consider the strictly convex function $r:(0,+∞)→R$,
$r(x)=x-a¯-b¯1+log(x).$
Since $limx→0+r(x)=limx→+∞r(x)=+∞$ and $r(b¯)≤0$, the equation $r(x)=0$ has two roots $x¯1$ and $x¯2$ such that $0.
(b) Similarly to part a, we obtain that for any $w∈Rn$,
$|M|≤∑t∈Mft(wt)≤a¯+3L2|M|3K2-4K+2.$
It leads to the inequation $|M|≤a¯+c¯|M|.$

The proof of proposition 1 is established.

## 4  Numerical Experiments

Our numerical experiment consists of two parts. First, we study the performance of the two versions of online DCA algorithms: the approximate ones ODCA$k$-PiL1, ODCA$k$-PiL2, ODCA$k$-Sig, and the complete ones ODCA-PiL1, ODCA-PiL2, ODCA-Sig. Second, we compare the notable algorithms with five state-of-the-art online binary classification algorithms: perceptron (Novikoff, 1963; Rosenblatt, 1958; Van Der Malsburg, 1986), online gradient descent (OGD; Zinkevich, 2003), relaxed online maximum margin algorithm (ROMMA; Li & Long, 2002), approximate maximal margin classification algorithm (ALMA; Gentile, 2002), passive-aggressive learning algorithms (PA; Crammer et al., 2006), which are described in detail in the appendix.

We test these comparative algorithms on a variety of benchmark classification data sets from the UCI Machine Learning Repository1 and LIBSVM website.2 The data sets used in our experiments cover various areas (e.g., social sciences, biology, physics, life sciences) and are shown in Table 1.

Table 1:
Data Sets Used in Our Experiments.
Data SetNameNumber of Instances ($T$)Number of Features ($n$)
D1 a8a 32,561 123
D2 cod-rna 271,617
D3 colon-cancer 62 2000
D4 covtype 581,012 54
D5 diabetes 768
D6 ijcnn1 141,691 22
D7 magic04 19,020 10
D8 splice 3175 60
D9 svmguide1 7089
D10 w7a 49,749 300
D11 gisette 6806 5000
D12 duke 44 7129
D13 susy1$a$ 1,000,000 18
D14 susy2$a$ 1,500,000 18
Data SetNameNumber of Instances ($T$)Number of Features ($n$)
D1 a8a 32,561 123
D2 cod-rna 271,617
D3 colon-cancer 62 2000
D4 covtype 581,012 54
D5 diabetes 768
D6 ijcnn1 141,691 22
D7 magic04 19,020 10
D8 splice 3175 60
D9 svmguide1 7089
D10 w7a 49,749 300
D11 gisette 6806 5000
D12 duke 44 7129
D13 susy1$a$ 1,000,000 18
D14 susy2$a$ 1,500,000 18

$a$Data set includes the first $T$ instances in the susy data set.

All experiments were implemented in Matlab R2013b and performed on a PC Intel Xeon CPU E5-2630 v2, at 2.60 GHz of 32 GB RAM. The open source Matlab package for the state-of-the-art algorithms is available in Hoi et al. (2014). The initial point of all algorithms is $0∈Rn$. In fact, many numerical experiments of existing online classification algorithms started from the zero point (see Shalev-Shwartz, 2012; Zinkevich, 2003; and Hoi et al., 2014). Thus, for a fair comparison, we used this zero point for all comparative algorithms. Meanwhile, it is worth mentioning that we have tested several random initial points, and our proposed algorithms gave similar results to those furnished from the zero starting point. The default tolerance $ɛ$ is set to $10-4$, and the maximum number of iterations for the subgradient method at each step is 5000. The step size for our proposed algorithms is $ηt=C/T$ for all $t$. We are interested in the following criteria to evaluate the effectiveness of the proposed algorithms: the mistake rate (defined as the ratio of the number of mistakes to the number of instances) and the CPU time (in seconds).

To choose the best parameters for different algorithms, we follow a so-called validation procedure described in Hoi et al. (2014). In particular, we first perform each algorithm by running over one random permutation of the data set with the different parameter values and then take the value corresponding to the smallest mistake rate. The ranges of parameters for the state-of-the-art algorithms are described completely in Hoi et al. (2014), while the best parameters $τ1$, $τ2$, $τ3$, $C$, $κ$, and $ɛ$ in our algorithms are searched from the range of ${2-4,2-3,…,24}$, ${2-4,2-3,…,24}$, ${1,3,…,9}$, ${2-4,2-3,…,24}$, ${0.1,0.2,…,1}$, and ${0,0.1,…,0.9}$, respectively. After the validation procedure, each algorithm is conducted over $N$ runs of different random permutations for each data set with the chosen parameter.

### 4.1  Experiment 1: Comparison between Two Versions of Online DCA Algorithms

In this experiment, we give a comparison between the approximate algorithms ODCA$k$-PiL1, ODCA$k$-PiL2, ODCA$k$-Sig and the complete algorithms ODCA-PiL1, ODCA-PiL2, ODCA-Sig, respectively. We run 6 algorithms on 14 data sets over five runs ($N=5$). The average mistake rate and CPU time over these five runs of all algorithms are reported in Tables 2 and 3, respectively.

Table 2:
Average Mistake Rate and Its Standard Deviation Obtained by Two Versions of Online DCA-Based Algorithms.
 Data Set ODCA$k$-PiL1 ODCA-PiL1 ODCA$k$-PiL2 ODCA-PiL2 ODCA$k$-Sig ODCA-Sig D1 0.217$±$ 0.016 0.221 $±$ 0.001 0.157$±$ 0.000 0.170 $±$ 0.001 0.157$±$ 0.000 0.240 $±$ 0.000 D2 0.174$±$ 0.001 0.222 $±$ 0.001 0.466$±$ 0.183 0.466$±$ 0.183 0.115$±$ 0.000 0.333 $±$ 0.000 D3 0.296$±$ 0.039 0.303 $±$ 0.039 0.229$±$ 0.035 0.290 $±$ 0.073 0.216$±$ 0.058 0.219 $±$ 0.035 D4 0.469$±$ 0.001 0.476 $±$ 0.001 0.487$±$ 0.000 0.487$±$ 0.000 0.423 $±$ 0.000 0.417$±$ 0.000 D5 0.318$±$ 0.013 0.323 $±$ 0.013 0.275$±$ 0.009 0.305 $±$ 0.018 0.260$±$ 0.001 0.268 $±$ 0.005 D6 0.096$±$ 0.005 0.207 $±$ 0.132 0.088 $±$ 0.039 0.064$±$ 0.002 0.072 $±$ 0.004 0.059$±$ 0.000 D7 0.361$±$ 0.006 0.384 $±$ 0.002 0.529$±$ 0.163 0.529$±$ 0.163 0.280$±$ 0.001 0.529 $±$ 0.163 D8 0.280$±$ 0.020 0.302 $±$ 0.016 0.496$±$ 0.021 0.496$±$ 0.021 0.216$±$ 0.003 0.519 $±$ 0.000 D9 0.248$±$ 0.005 0.278 $±$ 0.006 0.487$±$ 0.070 0.487$±$ 0.070 0.201$±$ 0.002 0.487 $±$ 0.070 D10 0.110$±$ 0.002 0.110$±$ 0.001 0.101 $±$ 0.001 0.100$±$ 0.000 0.101 $±$ 0.001 0.098$±$ 0.000 D11 0.110$±$ 0.002 0.113 $±$ 0.007 0.500$±$ 0.001 0.500$±$ 0.001 0.072$±$ 0.002 0.499 $±$ 0.000 D12 0.295$±$ 0.058 0.295$±$ 0.058 0.336 $±$ 0.057 0.318$±$ 0.084 0.272$±$ 0.023 0.318 $±$ 0.048 D13 0.284$±$ 0.006 0.359 $±$ 0.057 0.213$±$ 0.000 0.215 $±$ 0.001 0.213$±$ 0.000 0.458 $±$ 0.014 D14 0.295$±$ 0.018 0.354 $±$ 0.044 0.213$±$ 0.000 0.214 $±$ 0.001 0.213$±$ 0.000 0.457 $±$ 0.010
 Data Set ODCA$k$-PiL1 ODCA-PiL1 ODCA$k$-PiL2 ODCA-PiL2 ODCA$k$-Sig ODCA-Sig D1 0.217$±$ 0.016 0.221 $±$ 0.001 0.157$±$ 0.000 0.170 $±$ 0.001 0.157$±$ 0.000 0.240 $±$ 0.000 D2 0.174$±$ 0.001 0.222 $±$ 0.001 0.466$±$ 0.183 0.466$±$ 0.183 0.115$±$ 0.000 0.333 $±$ 0.000 D3 0.296$±$ 0.039 0.303 $±$ 0.039 0.229$±$ 0.035 0.290 $±$ 0.073 0.216$±$ 0.058 0.219 $±$ 0.035 D4 0.469$±$ 0.001 0.476 $±$ 0.001 0.487$±$ 0.000 0.487$±$ 0.000 0.423 $±$ 0.000 0.417$±$ 0.000 D5 0.318$±$ 0.013 0.323 $±$ 0.013 0.275$±$ 0.009 0.305 $±$ 0.018 0.260$±$ 0.001 0.268 $±$ 0.005 D6 0.096$±$ 0.005 0.207 $±$ 0.132 0.088 $±$ 0.039 0.064$±$ 0.002 0.072 $±$ 0.004 0.059$±$ 0.000 D7 0.361$±$ 0.006 0.384 $±$ 0.002 0.529$±$ 0.163 0.529$±$ 0.163 0.280$±$ 0.001 0.529 $±$ 0.163 D8 0.280$±$ 0.020 0.302 $±$ 0.016 0.496$±$ 0.021 0.496$±$ 0.021 0.216$±$ 0.003 0.519 $±$ 0.000 D9 0.248$±$ 0.005 0.278 $±$ 0.006 0.487$±$ 0.070 0.487$±$ 0.070 0.201$±$ 0.002 0.487 $±$ 0.070 D10 0.110$±$ 0.002 0.110$±$ 0.001 0.101 $±$ 0.001 0.100$±$ 0.000 0.101 $±$ 0.001 0.098$±$ 0.000 D11 0.110$±$ 0.002 0.113 $±$ 0.007 0.500$±$ 0.001 0.500$±$ 0.001 0.072$±$ 0.002 0.499 $±$ 0.000 D12 0.295$±$ 0.058 0.295$±$ 0.058 0.336 $±$ 0.057 0.318$±$ 0.084 0.272$±$ 0.023 0.318 $±$ 0.048 D13 0.284$±$ 0.006 0.359 $±$ 0.057 0.213$±$ 0.000 0.215 $±$ 0.001 0.213$±$ 0.000 0.458 $±$ 0.014 D14 0.295$±$ 0.018 0.354 $±$ 0.044 0.213$±$ 0.000 0.214 $±$ 0.001 0.213$±$ 0.000 0.457 $±$ 0.010

Notes: The complete algorithms are ODCA-PiL1, ODCA-PiL2, and ODCA-Sig, and the approximate algorithms are ODCA$k$-PiL1, ODCA$k$-PiL2, and ODCA$k$-Sig on five runs. Bold values indicate the best results in each pair.

Table 3:
Average CPU Time and Its Standard Deviation in Seconds Obtained by Two Versions of Online DCA-Based Algorithms.
 Data Set ODCA$k$-PiL1 ODCA-PiL1 ODCA$k$-PiL2 ODCA-PiL2 ODCA$k$-Sig ODCA-Sig D1 2.057$±$ 0.014 2.192 $±$ 0.005 2.338$±$ 0.011 5.138 $±$ 0.209 2.388$±$ 0.013 2.716 $±$ 0.007 D2 14.22$±$ 0.773 15.52 $±$ 0.540 15.68$±$ 0.902 15.97 $±$ 1.005 17.42$±$ 0.910 19.09 $±$ 0.874 D3 0.008$±$ 0.000 0.009 $±$ 0.000 0.010$±$ 0.000 0.011 $±$ 0.000 0.009$±$ 0.000 0.170 $±$ 0.002 D4 41.70$±$ 0.700 45.76 $±$ 0.646 42.31$±$ 0.080 42.50 $±$ 0.654 51.18$±$ 0.222 204.6 $±$ 3.287 D5 0.038$±$ 0.000 0.157 $±$ 0.185 0.045$±$ 0.000 0.201 $±$ 0.010 0.048$±$ 0.000 1.846 $±$ 0.116 D6 7.432$±$ 0.135 14.05 $±$ 5.157 8.553$±$ 0.308 17.96 $±$ 0.327 8.625$±$ 0.175 23.75 $±$ 0.487 D7 0.812$±$ 0.016 0.895 $±$ 0.032 0.913$±$ 0.025 0.949 $±$ 0.029 0.990$±$ 0.048 1.122 $±$ 0.048 D8 0.175$±$ 0.001 0.422 $±$ 0.456 0.184$±$ 0.000 0.191 $±$ 0.005 0.222$±$ 0.000 0.230 $±$ 0.009 D9 0.342$±$ 0.002 0.391 $±$ 0.010 0.367$±$ 0.002 0.378 $±$ 0.003 0.425$±$ 0.003 0.638 $±$ 0.015 D10 4.174$±$ 0.072 3113 $±$ 1738 4.415$±$ 0.073 7.194 $±$ 0.323 4.439$±$ 0.061 9.593 $±$ 0.124 D11 4.501$±$ 0.069 4.618 $±$ 0.067 4.490$±$ 0.066 4.563 $±$ 0.083 4.770$±$ 0.144 5.240 $±$ 0.176 D12 0.015$±$ 0.002 0.016 $±$ 0.002 0.016$±$ 0.002 0.021 $±$ 0.004 0.020$±$ 0.003 1.074 $±$ 0.460 D13 47.85$±$ 5.261 70.45 $±$ 7.874 51.38$±$ 3.939 53.55 $±$ 3.661 52.75$±$ 3.882 68.86 $±$ 9.810 D14 53.28$±$ 0.931 78.94 $±$ 5.302 64.17$±$ 1.983 64.64 $±$ 2.756 65.87$±$ 2.465 89.40 $±$ 6.378
 Data Set ODCA$k$-PiL1 ODCA-PiL1 ODCA$k$-PiL2 ODCA-PiL2 ODCA$k$-Sig ODCA-Sig D1 2.057$±$ 0.014 2.192 $±$ 0.005 2.338$±$ 0.011 5.138 $±$ 0.209 2.388$±$ 0.013 2.716 $±$ 0.007 D2 14.22$±$ 0.773 15.52 $±$ 0.540 15.68$±$ 0.902 15.97 $±$ 1.005 17.42$±$ 0.910 19.09 $±$ 0.874 D3 0.008$±$ 0.000 0.009 $±$ 0.000 0.010$±$ 0.000 0.011 $±$ 0.000 0.009$±$ 0.000 0.170 $±$ 0.002 D4 41.70$±$ 0.700 45.76 $±$ 0.646 42.31$±$ 0.080 42.50 $±$ 0.654 51.18$±$ 0.222 204.6 $±$ 3.287 D5 0.038$±$ 0.000 0.157 $±$ 0.185 0.045$±$ 0.000 0.201 $±$ 0.010 0.048$±$ 0.000 1.846 $±$ 0.116 D6 7.432$±$ 0.135 14.05 $±$ 5.157 8.553$±$ 0.308 17.96 $±$ 0.327 8.625$±$ 0.175 23.75 $±$ 0.487 D7 0.812$±$ 0.016 0.895 $±$ 0.032 0.913$±$ 0.025 0.949 $±$ 0.029 0.990$±$ 0.048 1.122 $±$ 0.048 D8 0.175$±$ 0.001 0.422 $±$ 0.456 0.184$±$ 0.000 0.191 $±$ 0.005 0.222$±$ 0.000 0.230 $±$ 0.009 D9 0.342$±$ 0.002 0.391 $±$ 0.010 0.367$±$ 0.002 0.378 $±$ 0.003 0.425$±$ 0.003 0.638 $±$ 0.015 D10 4.174$±$ 0.072 3113 $±$ 1738 4.415$±$ 0.073 7.194 $±$ 0.323 4.439$±$ 0.061 9.593 $±$ 0.124 D11 4.501$±$ 0.069 4.618 $±$ 0.067 4.490$±$ 0.066 4.563 $±$ 0.083 4.770$±$ 0.144 5.240 $±$ 0.176 D12 0.015$±$ 0.002 0.016 $±$ 0.002 0.016$±$ 0.002 0.021 $±$ 0.004 0.020$±$ 0.003 1.074 $±$ 0.460 D13 47.85$±$ 5.261 70.45 $±$ 7.874 51.38$±$ 3.939 53.55 $±$ 3.661 52.75$±$ 3.882 68.86 $±$ 9.810 D14 53.28$±$ 0.931 78.94 $±$ 5.302 64.17$±$ 1.983 64.64 $±$ 2.756 65.87$±$ 2.465 89.40 $±$ 6.378

Notes: The complete algorithms are ODCA-PiL1, ODCA-PiL2, and ODCA-Sig, and the approximate algorithms ODCA$k$-PiL1, ODCA$k$-PiL2, and ODCA$k$-Sig on five runs. Bold values indicate the best results in each pair.

The approximate version is more efficient than the complete version on both mistake rate and CPU time, especially in very large data sets. Indeed, the approximate algorithms ODCA-PiL1, ODCA-PiL2, and ODCA-Sig run faster than the complete algorithms ODCA$k$-PiL1, ODCA$k$-PiL2, and ODCA$k$-Sig, respectively: the ratio of gain varies from 1.03 to 745 times, from 1.00 to 4.46 times, and from 1.04 to 53.7 times, respectively. As for the mistake rate, the approximate version is better than the complete version: the ratio of gain of ODCA-PiL1 versus ODCA$k$-PiL1, ODCA-PiL2 versus ODCA$k$-PiL2, ODCA-Sig versus ODCA$k$-Sig varies, respectively, from $1.4%$ to $53%$, from $0.5%$ to $21%$, and from $1.3%$ to $85%$ in most of the data sets. In particular, for data set D6, the complete algorithms ODCA$k$-PiL2 and ODCA$k$-Sig furnish a mistake rate smaller than the approximate versions ODCA-PiL2 and ODCA-Sig with the gain of $27%$ and $18%$, but they run more slowly with the ratio of 2.10 and 2.75 times, respectively.

### 4.2  Experiment 2: Comparison with State-of-the-Art Classification Algorithms

In this experiment, we compare the notable algorithms ODCA$k$-PiL1, ODCA$k$-PiL2, and ODCA$k$-Sig with the five state-of-the-art binary linear classification algorithms mentioned above. The average results over 20 runs of all algorithms are reported in Tables 4 and 5.

Table 4:
Average Mistake Rate and Its Standard Deviation Obtained by ODCA$k$-PiL1, ODCA$k$-PiL2, ODCA$k$-Sig and Perceptron, ROMMA, ALMA, OGD, PA on 20 Runs.
 Data Set ODCA$k$-PiL1 ODCA$k$-PiL2 ODCA$k$-Sig Perceptron ROMMA ALMA OGD PA D1 0.2088 $±$ 0.001 0.1575$±$ 0.001 0.1574$±$ 0.001 0.2100 $±$ 0.001 0.2249 $±$ 0.002 0.1581 $±$ 0.001 0.1577 $±$ 0.001 0.2108 $±$ 0.002 D2 0.1739 $±$ 0.001 0.1176$±$ 0.001 0.1149$±$ 0.000 0.1749 $±$ 0.001 0.1517 $±$ 0.065 0.1994 $±$ 0.001 0.1657 $±$ 0.000 0.2074 $±$ 0.001 D3 0.3088 $±$ 0.039 0.2379$±$ 0.046 0.2379$±$ 0.050 0.3137 $±$ 0.043 0.3717 $±$ 0.086 0.4435 $±$ 0.056 0.3032 $±$ 0.060 0.2637 $±$ 0.041 D4 0.4697 $±$ 0.001 0.4237$±$ 0.001 0.4231$±$ 0.000 0.4697 $±$ 0.001 0.4804 $±$ 0.011 0.4839 $±$ 0.001 0.4676 $±$ 0.001 0.4835 $±$ 0.000 D5 0.3194 $±$ 0.015 0.2615$±$ 0.008 0.2621 $±$ 0.007 0.3265 $±$ 0.013 0.3072 $±$ 0.015 0.2655 $±$ 0.010 0.2586$±$ 0.007 0.3346 $±$ 0.016 D6 0.1045 $±$ 0.024 0.0705$±$ 0.001 0.0740 $±$ 0.018 0.1062 $±$ 0.000 0.1008 $±$ 0.001 0.0699$±$ 0.001 0.0767 $±$ 0.001 0.1023 $±$ 0.001 D7 0.3593 $±$ 0.007 0.2786$±$ 0.002 0.2775$±$ 0.001 0.3645 $±$ 0.002 0.3365 $±$ 0.034 0.3636 $±$ 0.003 0.3557 $±$ 0.003 0.3835 $±$ 0.003 D8 0.2969 $±$ 0.056 0.2329 $±$ 0.006 0.2150$±$ 0.003 0.2732 $±$ 0.004 0.2684 $±$ 0.009 0.2283 $±$ 0.006 0.2168$±$ 0.004 0.2617 $±$ 0.007 D9 0.2492 $±$ 0.007 0.2723 $±$ 0.116 0.2026$±$ 0.002 0.2560 $±$ 0.004 0.3037 $±$ 0.032 0.2564 $±$ 0.004 0.2466$±$ 0.010 0.3130 $±$ 0.005 D10 0.1147 $±$ 0.012 0.1005$±$ 0.000 0.1005$±$ 0.000 0.1151 $±$ 0.000 0.1094 $±$ 0.001 0.1028 $±$ 0.001 0.1037 $±$ 0.001 0.1051 $±$ 0.000
 Data Set ODCA$k$-PiL1 ODCA$k$-PiL2 ODCA$k$-Sig Perceptron ROMMA ALMA OGD PA D1 0.2088 $±$ 0.001 0.1575$±$ 0.001 0.1574$±$ 0.001 0.2100 $±$ 0.001 0.2249 $±$ 0.002 0.1581 $±$ 0.001 0.1577 $±$ 0.001 0.2108 $±$ 0.002 D2 0.1739 $±$ 0.001 0.1176$±$ 0.001 0.1149$±$ 0.000 0.1749 $±$ 0.001 0.1517 $±$ 0.065 0.1994 $±$ 0.001 0.1657 $±$ 0.000 0.2074 $±$ 0.001 D3 0.3088 $±$ 0.039 0.2379$±$ 0.046 0.2379$±$ 0.050 0.3137 $±$ 0.043 0.3717 $±$ 0.086 0.4435 $±$ 0.056 0.3032 $±$ 0.060 0.2637 $±$ 0.041 D4 0.4697 $±$ 0.001 0.4237$±$ 0.001 0.4231$±$ 0.000 0.4697 $±$ 0.001 0.4804 $±$ 0.011 0.4839 $±$ 0.001 0.4676 $±$ 0.001 0.4835 $±$ 0.000 D5 0.3194 $±$ 0.015 0.2615$±$ 0.008 0.2621 $±$ 0.007 0.3265 $±$ 0.013 0.3072 $±$ 0.015 0.2655 $±$ 0.010 0.2586$±$ 0.007 0.3346 $±$ 0.016 D6 0.1045 $±$ 0.024 0.0705$±$ 0.001 0.0740 $±$ 0.018 0.1062 $±$ 0.000 0.1008 $±$ 0.001 0.0699$±$ 0.001 0.0767 $±$ 0.001 0.1023 $±$ 0.001 D7 0.3593 $±$ 0.007 0.2786$±$ 0.002 0.2775$±$ 0.001 0.3645 $±$ 0.002 0.3365 $±$ 0.034 0.3636 $±$ 0.003 0.3557 $±$ 0.003 0.3835 $±$ 0.003 D8 0.2969 $±$ 0.056 0.2329 $±$ 0.006 0.2150$±$ 0.003 0.2732 $±$ 0.004 0.2684 $±$ 0.009 0.2283 $±$ 0.006 0.2168$±$ 0.004 0.2617 $±$ 0.007 D9 0.2492 $±$ 0.007 0.2723 $±$ 0.116 0.2026$±$ 0.002 0.2560 $±$ 0.004 0.3037 $±$ 0.032 0.2564 $±$ 0.004 0.2466$±$ 0.010 0.3130 $±$ 0.005 D10 0.1147 $±$ 0.012 0.1005$±$ 0.000 0.1005$±$ 0.000 0.1151 $±$ 0.000 0.1094 $±$ 0.001 0.1028 $±$ 0.001 0.1037 $±$ 0.001 0.1051 $±$ 0.000

Note: Bold (resp. underlining) values indicate the first-best (resp. second-best) results.

Table 5:
Average CPU Time and Its Standard Deviation in Seconds Obtained by ODCA$k$-PiL1, ODCA$k$-PiL2, ODCA$k$-Sig and Perceptron, ROMMA, ALMA, OGD, PA on 20 Runs.
 Data Set ODCA$k$-PiL1 ODCA$k$-PiL2 ODCA$k$-Sig Perceptron ROMMA ALMA OGD PA D1 1.113 $±$ 0.007 1.569 $±$ 0.020 1.571 $±$ 0.027 1.084$±$ 0.014 1.169 $±$ 0.017 1.309 $±$ 0.016 1.527 $±$ 0.021 1.254 $±$ 0.016 D2 8.088 $±$ 0.057 11.91 $±$ 0.060 11.79 $±$ 0.057 7.807$±$ 0.042 8.160 $±$ 0.243 9.494 $±$ 0.111 11.09 $±$ 0.048 9.582 $±$ 0.045 D3 0.003$±$ 0.000 0.005 $±$ 0.000 0.004 $±$ 0.000 0.003$±$ 0.000 0.004$±$ 0.000 0.004 $±$ 0.000 0.003$±$ 0.000 0.004 $±$ 0.000 D4 22.72 $±$ 1.665 33.28 $±$ 2.467 33.68 $±$ 2.435 22.07$±$ 1.760 24.36 $±$ 1.903 25.63 $±$ 1.942 29.51 $±$ 2.194 26.80 $±$ 2.039 D5 0.028$±$ 0.000 0.041 $±$ 0.000 0.042 $±$ 0.000 0.028$±$ 0.000 0.029 $±$ 0.000 0.034 $±$ 0.000 0.041 $±$ 0.000 0.034 $±$ 0.000 D6 5.024 $±$ 0.045 7.062 $±$ 0.046 7.097 $±$ 0.053 4.873$±$ 0.019 5.070 $±$ 0.025 5.919 $±$ 0.044 6.982$±$ 0.038 5.444 $±$ 0.028 D7 0.706 $±$ 0.003 1.024 $±$ 0.004 1.036 $±$ 0.003 0.687$±$ 0.002 0.730 $±$ 0.010 0.817$±$ 0.002 0.960 $±$ 0.002 0.848 $±$ 0.002 D8 0.121 $±$ 0.002 0.184 $±$ 0.001 0.178 $±$ 0.000 0.117$±$ 0.000 0.124 $±$ 0.000 0.138 $±$ 0.000 0.167 $±$ 0.000 0.143 $±$ 0.001 D9 0.255 $±$ 0.001 0.360 $±$ 0.007 0.373 $±$ 0.001 0.248$±$ 0.003 0.269 $±$ 0.003 0.295 $±$ 0.001 0.348 $±$ 0.001 0.301 $±$ 0.001 D10 2.403 $±$ 0.086 3.096 $±$ 0.121 3.111 $±$ 0.113 2.365$±$ 0.092 2.430 $±$ 0.065 2.709 $±$ 0.089 3.115 $±$ 0.079 2.487 $±$ 0.050
 Data Set ODCA$k$-PiL1 ODCA$k$-PiL2 ODCA$k$-Sig Perceptron ROMMA ALMA OGD PA D1 1.113 $±$ 0.007 1.569 $±$ 0.020 1.571 $±$ 0.027 1.084$±$ 0.014 1.169 $±$ 0.017 1.309 $±$ 0.016 1.527 $±$ 0.021 1.254 $±$ 0.016 D2 8.088 $±$ 0.057 11.91 $±$ 0.060 11.79 $±$ 0.057 7.807$±$ 0.042 8.160 $±$ 0.243 9.494 $±$ 0.111 11.09 $±$ 0.048 9.582 $±$ 0.045 D3 0.003$±$ 0.000 0.005 $±$ 0.000 0.004 $±$ 0.000 0.003$±$ 0.000 0.004$±$ 0.000 0.004 $±$ 0.000 0.003$±$ 0.000 0.004 $±$ 0.000 D4 22.72 $±$ 1.665 33.28 $±$ 2.467 33.68 $±$ 2.435 22.07$±$ 1.760 24.36 $±$ 1.903 25.63 $±$ 1.942 29.51 $±$ 2.194 26.80 $±$ 2.039 D5 0.028$±$ 0.000 0.041 $±$ 0.000 0.042 $±$ 0.000 0.028$±$ 0.000 0.029 $±$ 0.000 0.034 $±$ 0.000 0.041 $±$ 0.000 0.034 $±$ 0.000 D6 5.024 $±$ 0.045 7.062 $±$ 0.046 7.097 $±$ 0.053 4.873$±$ 0.019 5.070 $±$ 0.025 5.919 $±$ 0.044 6.982$±$ 0.038 5.444 $±$ 0.028 D7 0.706 $±$ 0.003 1.024 $±$ 0.004 1.036 $±$ 0.003 0.687$±$ 0.002 0.730 $±$ 0.010 0.817$±$ 0.002 0.960 $±$ 0.002 0.848 $±$ 0.002 D8 0.121 $±$ 0.002 0.184 $±$ 0.001 0.178 $±$ 0.000 0.117$±$ 0.000 0.124 $±$ 0.000 0.138 $±$ 0.000 0.167 $±$ 0.000 0.143 $±$ 0.001 D9 0.255 $±$ 0.001 0.360 $±$ 0.007 0.373 $±$ 0.001 0.248$±$ 0.003 0.269 $±$ 0.003 0.295 $±$ 0.001 0.348 $±$ 0.001 0.301 $±$ 0.001 D10 2.403 $±$ 0.086 3.096 $±$ 0.121 3.111 $±$ 0.113 2.365$±$ 0.092 2.430 $±$ 0.065 2.709 $±$ 0.089 3.115 $±$ 0.079 2.487 $±$ 0.050

Note: Bold values indicate the best results.

In terms of the mistake rate, we observe from Table 4 that ODCA$k$-Sig is the best algorithm, ODCA$k$-PiL2 is the second, and ODCA$k$-PiL1 is slightly more efficient than the existing algorithms. In particular, ODCA$k$-Sig is the first best on 8 of 10 data sets, especially the large data sets D2 (271,617 instances) and D4 (581,012 instances). The ratio of gain of ODCA$k$-Sig versus the others varies from $0.06%$ to $46.3%$. ODCA$k$-PiL2 outperforms the existing algorithms on 8 of 10 data sets (2 for the first best and 6 for the second best); the ratio of gain varies from $0.12%$ to $46.3%$. In addition, the mistake rate of ODCA$k$-PiL2 is comparable to that of ODCA$k$-Sig on 6 of 10 data sets with the ratio of gain of ODCA$k$-Sig versus ODCA$k$-PiL2 from $0%$ to $2.29%$. OGD and ROMMA come next and are somewhat better than ODCA$k$-PiL1 on data sets; the ratio of gain varies from $0.44%$ to $26.9%$ and from $3.54%$ to $12.7%$, respectively. ODCA$k$-PiL1 is slightly more efficient than the existing algorithms perceptron, ALMA, PA on 9 of 10, 5 of 10, and 6 of 10 data sets; the ratio of gain varies from $0.34%$ to $2.65%$, from $1.18%$ to $30.3%$, from $0.94%$ to $20.3%$, respectively.

Concerning CPU time, all algorithms run very fast and can be classified as follows: perceptron and ODCA$k$-PiL1 are the fastest algorithms; three algorithms—ROMMA, ALMA, and PA—come next; and finally, ODCA$k$-Sig, ODCA$k$-PiL2, and OGD. More specifically, perceptron is the fastest on all data sets, while ODCA$k$-PiL1 is comparable to perceptron: the ratio of gain of perceptron versus ODCA$k$-PiL1 varies from 1.00 to 1.03 times. As for ODCA$k$-Sig and ODCA$k$-PiL2, their CPU time is fairly small and acceptable on all data sets; the ratio of gain of perceptron versus ODCA$k$-Sig (resp. ODCA$k$-PiL2) varies from 1.31 to 1.52 (resp. from 1.30 to 1.66) times.

## 5  Conclusion

We have intensively studied an online DCA-based approach for online learning techniques. At each learning step, we have considered a DC program for which an online version of DCA has been investigated. We have proposed the complete or approximate version of online DCA in which the convex subproblem is solved completely by (resp. is approximately solved by one iteration of) the subgradient method. We have proved that the complete version has the sublinear regret $O(T)$, while the approximate one can achieve the logarithmic regret $O(log(T))$. As an application, we have developed the online DCA schemes for online classification with the $0-1$ loss function. We have approximated the $0-1$ loss function by DC functions and then proposed six online DCA-based algorithms: the three complete (resp. approximate) versions ODCA-PiL1, ODCA-PiL2, and ODCA-Sig (resp. ODCA$k$-PiL1, ODCA$k$-PiL2, and ODCA$k$-Sig). Thanks to the simple structure of piecewise-linear and sigmoid functions, natural DC decompositions have been considered where DC components take the form of a maximum of affine or smooth functions, which are differentiable almost everywhere. This property takes advantage of our DCA-based algorithms. Numerical results on various benchmark classification data sets have turned out that the approximate algorithms ODCA$k$-PiLs (resp. ODCA$k$-Sig) outperform the complete algorithms ODCA-PiLs (resp. ODCA-Sig) on both speed and quality of classification. Compared with the five state-of-the-art online classification algorithms, ODCA$k$-Sig is the best. From these promising results, we are making progress in further development of online DCA for online learning applications. In particular, extensions of the proposed approach for multiclass classification are ongoing.

## Appendix:  Description of State-of-the-Art Binary Linear Classification

In this appendix, we give a detailed description of five state-of-the-art binary linear classification algorithms used in the numerical experiments: Perceptron (Novikoff, 1963; Rosenblatt, 1958; Van Der Malsburg, 1986), online gradient descent (OGD; Zinkevich, 2003), relaxed online maximum margin algorithm (ROMMA; Li & Long, 2002), approximate maximal margin classification algorithm (ALMA; Gentile, 2002), and passive-aggressive learning algorithms (PA; Crammer et al., 2006).

First, the perceptron algorithm is known as the earliest, simplest approach for online binary linear classification (Rosenblatt, 1958):

Perceptron

Initialization: let $w1$ be an initial point.

for$t=1,2,…,T$do

if$yt〈wt,xt〉≤0$then

$wt+1=wt+ytxt$

else

$wt+1=wt$

end if

end for

Second, the relaxed online maximum margin algorithm (Li & Long, 2002) is an incremental algorithm for classification using a linear threshold function. It can be seen as a relaxed version of the algorithm that searches for the separating hyperplane that maximizes the minimum distance from previous instances classified correctly:

ROMMA: Relaxed Online Maximum Margin Algorithm

Initialization: let $w1$ be an initial point.

for$t=1,2,…,T$do

if$yt〈wt,xt〉≤0$then

$wt+1=∥xt∥2∥wt∥2-yt〈wt,xt〉∥xt∥2∥wt∥2-(〈wt,xt〉)2wt+∥wt∥2(yt-〈wt,xt〉)∥xt∥2∥wt∥2-(〈wt,xt〉)2xt$

else

$wt+1=wt$

end if

end for

Third, the approximate maximal margin classification algorithm (Gentile, 2002) consists of approximating the maximal margin hyperplane with respect to $ℓp$-norm ($p≥2$) for a set of linearly separable data. The proposed algorithm in Gentile (2002) is called approximate large margin algorithm (ALMA):

ALMA: Approximate Large Margin Algorithm

Initialization: let $w1$ be an initial point, the parameters $p≥2$, $α∈(0,1]$, $C>0$, $B=1α$, $k=1$

for$t=1,2,…,T$do

$lt=max0,(1-α)Bp-1k-yt〈wt,xt〉∥xt∥$

if$lt>0$then

$wt+1=wt+Cp-1kytxt∥xt∥/max1,wt+Cp-1kytxt∥xt∥$

$k=k+lt$

else

$wt+1=wt$

end if

end for

Fourth, the passive-aggressive (PA) learning algorithm (Crammer et al., 2006) computes the classifier based on an analytical solution to a simple constrained optimization problem that minimizes the distance from the current classifier $wt$ to the half-space of vectors of the zero hinge-loss on the current sample:

PA: Passive-Aggressive Learning Algorithm

Initialization: let $w1$ be an initial point

for$t=1,2,…,T$do

$lt=max{0,1-yt〈wt,xt〉}$

if$lt>0$then

$wt+1=wt+lt∥xt∥2ytxt$

else

$wt+1=wt$

end if

end for

Finally, the classic online gradient descent algorithm (Zinkevich, 2003) uses the gradient descent method for minimizing the hinge-loss function:

OGD: Online Gradient Descent

Initialization: let $w1$ be an initial point, parameter $C>0$

for$t=1,2,…,T$do

$lt=max{0,1-yt〈wt,xt〉}$

if$lt>0$then

$wt+1=wt+Ctytxt$

else

$wt+1=wt$

end if

end for

## References

Azoury
,
K.
, &
Warmuth
,
M.
(
2001
).
Relative loss bounds for on-line density estimation with the exponential family of distributions
.
Machine Learning
,
43
(
3
),
211
246
.
Cesa-Bianchi
,
N.
, &
Lugosi
,
G.
(
2006
).
Prediction, learning, and games
.
Cambridge
:
Cambridge University Press
.
Chung
,
T. H.
(
1994
).
Approximate methods for sequential decision making using expert advice.
In
Proceedings of the Seventh Annual Conference on Computational Learning Theory
(pp.
183
189
).
New York
:
ACM
.
Collobert
,
R.
,
Sinz
,
F.
,
Weston
,
J.
, &
Bottou
,
L.
(
2006a
).
Large scale transductive SVMs
.
Journal of Machine Learning Research
,
7
,
1687
1712
.
Collobert
,
R.
,
Sinz
,
F.
,
Weston
,
J.
, &
Bottou
,
L.
(
2006b
).
Trading convexity for scalability.
In
Proceedings of the 23rd International Conference on Machine Learning
(pp.
201
208
).
New York
:
ACM
.
Crammer
,
K.
,
Dekel
,
O.
,
Keshet
,
J.
,
Shalev-Shwartz
,
S.
, &
Singer
,
Y.
(
2006
).
Online passive-aggressive algorithms
.
Journal of Machine Learning Research
,
7
,
551
585
.
Ertekin
,
S.
,
Bottou
,
L.
, &
Giles
,
C. L.
(
2011
).
Nonconvex online support vector machines
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
33
(
2
),
368
381
.
Gao
,
X.
,
Li
,
X.
, &
Zhang
,
S.
(
2018
).
Online learning with non-convex losses and non-stationary regret.
In
A.
Storkey
&
F. Perez
-
Cruz
(Eds.),
Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics of Proceedings of Machine Learning Research
, Vol.
84
(pp.
235
243
).
Gasso
,
G.
,
Pappaioannou
,
A.
,
Spivak
,
M.
, &
Bottou
,
L.
(
2011
).
Batch and online learning algorithms for nonconvex Neyman-Pearson classification
.
ACM Transactions on Intelligent Systems and Technology
,
2
(
3
),
1
19
.
Gentile
,
C.
(
2002
).
A new approximate maximal margin classification algorithm
.
Journal of Machine Learning Research
,
2
,
213
242
.
Gentile
,
C.
(
2003
).
The robustness of the $p$-norm algorithms
.
Machine Learning
,
53
(
3
),
265
299
.
Hazan
,
E.
(
2016
).
Introduction to online convex optimization
.
Foundations and Trends in Optimization
,
2
(
3–4
),
157
325
.
Hazan
,
E.
,
Agarwal
,
A.
, &
Kale
,
S.
(
2007
).
Logarithmic regret algorithms for online convex optimization
.
Machine Learning
,
69
(
2–3
),
169
192
.
Hazan
,
E.
, &
Kale
,
S.
(
2012
).
Online submodular minimization
.
Journal of Machine Learning Research
,
13
(
1
),
2903
2922
.
Ho
,
V. T.
,
Le Thi
,
H. A.
, &
Bui
,
D. C.
(
2016
). Online DC optimization for online binary linear classification. In
T. N.
Nguyen
,
B.
Trawiński
,
H.
Fujita
, &
T.-P.
Hong
(Eds.),
Proceedings of ACIIDS 2016
(pp.
661
670
).
Berlin
:
Springer
.
Hoi
,
S. C. H.
,
Wang
,
J.
, &
Zhao
,
P.
(
2014
).
LIBOL: A library for online learning algorithms
.
Journal of Machine Learning Research
,
15
(
1
),
495
499
.
Kalai
,
A.
, &
Vempala
,
S.
(
2005
).
Efficient algorithms for online decision problems
.
Journal of Computer and System Sciences
,
71
(
3
),
291
307
.
Kivinen
,
J.
, &
Warmuth
,
M. K.
(
1997
).
Exponentiated gradient versus gradient descent for linear predictors
.
Information and Computation
,
132
(
1
),
1
63
.
Kivinen
,
J.
, &
Warmuth
,
M.
(
2001
).
Relative loss bounds for multidimensional regression problems
.
Machine Learning
,
45
(
3
),
301
329
.
Krichene
,
W.
,
Balandat
,
M.
,
Tomlin
,
C.
, &
Bayen
,
A.
(
2015
).
The hedge algorithm on a continuum.
In
F.
Bach
&
D.
Blei
(Eds.),
Proceedings of the 32nd International Conference on Machine Learning
, vol.
37
(pp.
824
832
). PMLR.
Le Thi
,
H. A.
,
Ho
,
V. T.
, &
Pham Dinh
,
T.
(
2019
).
A unified DC programming framework and efficient DCA based approaches for large scale batch reinforcement learning
.
Journal of Global Optimization
,
73
(
2
),
279
310
.
Le Thi
,
H. A.
,
Le
,
H. M.
,
Phan
,
D. N.
, &
Tran
,
B.
(
2017
).
Stochastic DCA for the large-sum of non-convex functions problem and its application to group variable selection in classification
. In
Proceedings of the 34th International Conference on Machine Learning
, vol.
70
(pp.
3394
3403
).
Le Thi
,
H. A.
, &
Nguyen
,
M. C.
(
2017
).
DCA based algorithms for feature selection in multi-class support vector machine
.
Annals of Operations Research
,
249
(
1
),
273
300
.
Le Thi
,
H. A.
, & Pham
Dinh
,
T.
(
2005
).
The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems
.
Annals of Operation Research
,
133
(
1–4
),
23
48
.
Le Thi
,
H. A.
, & Pham
Dinh
,
T.
(
2018
).
DC programming and DCA: Thirty years of developments
.
Mathematical Programming, Special Issue: DC Programming—Theory, Algorithms and Applications
,
169
(
1
),
5
68
.
Le Thi
,
H. A.
, &
Phan
,
D. N.
(
2017
).
DC programming and DCA for sparse Fisher linear discriminant analysis
.
Neural Computing and Applications
,
28
(
9
),
2809
2822
.
Li
,
Y.
, &
Long
,
P.
(
2002
).
The relaxed online maximum margin algorithm
.
Machine Learning
,
46
(
1–3
),
361
387
.
Maillard
,
O.-A.
, &
Munos
,
R.
(
2010
).
Online learning in adversarial Lipschitz environments
. In
Proceedings of the 2010 European Conference on Machine Learning and Knowledge Discovery in Databases: Part II
(pp.
305
320
).
Berlin
:
Springer-Verlag
.
Mason
,
L.
,
Baxter
,
J.
,
Bartlett
,
P.
, &
Frean
,
M.
(
1999
).
Boosting algorithms as gradient descent
. In
Proceedings of the 12th International Conference on Neural Information Processing Systems
(pp.
512
518
).
Cambridge, MA
:
MIT Press
.
Novikoff
,
A. B.
(
1963
).
On convergence proofs for perceptrons
. In
Proceedings of the Symposium on the Mathematical Theory of Automata
, vol. 12 (pp.
615
622
).
Brooklyn, NY
:
Polytechnic Press
.
Pham Dinh
,
T.
, &
Le Thi
,
H. A.
(
1997
).
Convex analysis approach to DC programming: Theory, algorithms and applications
.
Acta Mathematica Vietnamica
,
22
(
1
),
289
355
.
Pham Dinh
,
T.
, &
Le Thi
,
H. A.
(
1998
).
DC optimization algorithms for solving the trust region subproblem
.
SIAM Journal of Optimization
,
8
(
2
),
476
505
.
Pham Dinh
,
T.
, &
Le Thi
,
H. A.
(
2014
). Recent advances in DC programming and DCA. In
N. T.
Nguyen
&
H. A.
Le Thi
(Eds.),
Transactions on Computational Intelligence XIII
, vol. 8342 (pp.
1
37
).
Berlin
:
Springer
.
Phan
,
D. N.
,
Le
,
H. M.
, &
Le Thi
,
H. A.
(
2018
).
Accelerated difference of convex functions algorithm and its application to sparse binary logistic regression
. In
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence
(pp.
1369
1375
).
Menlo Park: CA
:
AAAI
.
Phan
,
D. N.
, &
Le Thi
,
H. A.
(
2019
).
Group variable selection via $ℓp,0$ regularization and application to optimal scoring
.
Neural Networks
,
118
,
220
234
.
Phan
,
D. N.
,
Le Thi
,
H. A.
, & Pham
Dinh
,
T.
(
2017
).
Sparse covariance matrix estimation by DCA-based algorithms
.
Neural Computation
,
29
(
11
),
3040
3077
.
Rosenblatt
,
F.
(
1958
).
The perceptron: A probabilistic model for information storage and organization in the brain
.
Psychological Review
,
65
(
6
),
386
408
.
Shalev-Shwartz
,
S.
(
2007
).
Online learning: Theory, algorithms, and applications
. PhD diss., Hebrew University of Jerusalem.
Shalev-Shwartz
,
S.
(
2012
).
Online learning and online convex optimization
.
Foundations and Trends in Machine Learning
,
4
(
2
),
107
194
.
Shalev-Shwartz
,
S.
, &
Singer
,
Y.
(
2007
).
A primal-dual perspective of online learning algorithms
.
Machine Learning
,
69
(
2–3
),
115
142
.
Shor
,
N. Z.
(
1985
).
Minimization methods for non-differentiable functions
.
Berlin
:
Springer-Verlag
.
Valadier
,
M.
(
1969
).
Sous-différentiels dune borne supérieure et dune somme continue de fonctions convexes
.
CR Acad. Sci. Paris Sér. AB
,
268
,
A39
A42
.
Van Der Malsburg
,
C.
(
1986
). Frank Rosenblatt: Principles of neurodynamics: Perceptrons and the theory of brain mechanisms. In
G.
Palm
&
A.
Aertsen
(Eds.),
Brain theory
(pp.
245
248
).
Berlin
:
Springer
.
Yang
,
L.
,
Deng
,
L.
,
Hajiesmaili
,
M. H.
,
Tan
,
C.
, &
Wong
,
W. S.
(
2018
).
An optimal algorithm for online non-convex learning
.
Proc. ACM Meas. Anal. Comput. Syst.
,
2
(
2
),
25:1
25:25
.
Zhang
,
L.
,
Yang
,
T.
,
Jin
,
R.
, &
Zhou
,
Z.-H.
(
2015
).
Online bandit learning for a special class of non-convex losses
. In
Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence
(pp.
3158
3164
).
Menlo Park, CA
:
AAAI Press
.
Zinkevich
,
M.
(
2003
).
Online convex programming and generalized infinitesimal gradient ascent
. In
Proceedings of the 20th International Conference on Machine Learning
(pp.
928
936
).
Menlo Park, CA
:
AAAI Press
.