Abstract

We study the convergence of the online composite mirror descent algorithm, which involves a mirror map to reflect the geometry of the data and a convex objective function consisting of a loss and a regularizer possibly inducing sparsity. Our error analysis provides convergence rates in terms of properties of the strongly convex differentiable mirror map and the objective function. For a class of objective functions with Hölder continuous gradients, the convergence rates of the excess (regularized) risk under polynomially decaying step sizes have the order after iterates. Our results improve the existing error analysis for the online composite mirror descent algorithm by avoiding averaging and removing boundedness assumptions, and they sharpen the existing convergence rates of the last iterate for online gradient descent without any boundedness assumptions. Our methodology mainly depends on a novel error decomposition in terms of an excess Bregman distance, refined analysis of self-bounding properties of the objective function, and the resulting one-step progress bounds.

1  Introduction

Gradient descent is a powerful classic method for optimization and numerical computation. To approximate a minimizer of a convex function on the Euclidean space , it defines a sequence of points iteratively by , where is a subgradient of at and is a step size. Gradient descent is even more powerful in the era of big data and has been extended along different directions in various ways. Mirror descent is such an extension by relaxing the Hilbert space structure (Nemirovsky & Yudin, 1983; Beck & Teboulle, 2003) and allowing a Banach space norm on such as the -norm with , where is used for performing the gradient descent in the dual of the primal space .

As a first-order optimization procedure, mirror descent provides an efficient way to solve large-scale optimization problems in a Banach space by introducing a sequence of primal-dual variables in the primal-dual space to replace the sequence in the gradient descent algorithm, and it is induced by a mirror map . We assume to be Fréchet differentiable, meaning that at every , there exists a bounded linear operator such that Denote the operator as the gradient of at , which lies in the dual space . is a map from the primal space to the dual space and is used to express the relationship for the primal-dual pair . We also assume that is -strongly convex with respect to for some , meaning that
formula
where is the dual element acting on the element . We call the Bregman distance between and . Then the mirror descent algorithm applied to with a convex function , an initial point , and a convex set produces a sequence of points iteratively as
formula
1.1
where is a sequence of step sizes. The strong convexity of implies the invertibility of the map , making the point and the Bregman distance well defined. An important property of the mirror descent algorithm rests on its flexibility in choosing a mirror map to capture the geometry of the problem at hand, which is appealing to solve problems of high dimensions. We provide a class of specific mirror maps to illustrate their influence on the behavior of the algorithm.
Example 1.

Let and with the -norm defined by for . Then its dual space is . Take the -norm divergence as the mirror map. This mirror map, as shown in Ball, Carlen, and Lieb (1994), is -strongly convex over with respect to the norm . Take . When , the primal and dual spaces coincide, and the mirror descent reduces to the gradient descent. When a minimizer of a convex function is sparse, the mirror descent method with the mirror map and the specific choice of yields a convergence bound with a logarithmic dependence on , as proved in Duchi, Shalev-Shwartz, Singer, and Tewari (2010).

In many machine learning problems, the objective function often takes a composite form, , with a data-fitting convex (loss) function and a regularization term , which arises naturally in regularization schemes. For these composite optimization problems, the mirror descent directly applied to , involving subgradients of , would destroy some desirable effects suggested by the regularizer (Duchi & Singer, 2009), such as the -norm for promoting sparsity. Instead, a variant of mirror descent called the composite mirror descent, was introduced in Lions and Mercier (1979) and Duchi et al. (2010). At the th iteration, composite mirror descent updates by approximating with, instead of its first-order approximation at used in the mirror descent scheme, the first-order approximation of at plus :
formula
1.2
When the term vanishes, the composite mirror descent method coincides with the mirror descent method equation 1.1, which can be seen from a reformulation of equation 1.2 in terms of two steps similar to equation 1.1 (Duchi et al., 2010). Another motivation to keep intact in equation 1.2 is that the first-order approximation of would slow down the convergence rate since can be nonsmooth while can be smooth. If we take the specific mirror map and , the composite mirror descent recovers the proximal gradient method or forward-backward splitting dated back to Lions and Mercier (1979) and Duchi and Singer (2009), where is the proximal operator. A typical choice of in machine learning is , where is a training sample and is a loss function used to measure the performance of the linear model on the example . When the sample size is large, composite mirror descent in online and stochastic settings is studied in Duchi et al. (2010), where the fixed objective function is replaced by a sequence with being either an instantaneous loss in the online setting or a stochastic estimate of the objective function in the stochastic setting.
In this letter, we study the online composite mirror descent algorithm with the aim of error analysis. Throughout the letter, the primal space is with the norm , the dual space is with the dual norm , and denotes the action of the dual element on . Take to be . We assume a sequence of examples , to be independently drawn from a Borel probability measure defined over . We assume that is convex in the second argument and is convex. Then the online composite mirror descent updates the sequence with by
formula
1.3
where denotes the left-side derivative of with respect to the second argument. This strategy of processing each observation per iteration enjoys a great computational advantage when compared to the composite mirror descent, equation 1.2. For example, for the typical choice in a machine learning setting, evaluating one single gradient in equation 1.2 requires going through the whole data set. This gradient evaluation becomes prohibitively expensive in the big data era when faced with large amounts of data (Bach & Moulines, 2013). Below we list some examples covered in the framework of online composite mirror descent, equation 1.3.
Example 2.
If we take and in equation 1.3, the online composite mirror descent recovers the online gradient descent learning with the linear kernel
formula
For the least squares loss , it further translates to the Kaczmarz algorithm.
Example 3.
If we take , the online composite mirror descent, equation 1.3, recovers the online proximal gradient descent algorithm,
formula
Example 4.
If we take and , the online composite mirror descent recovers the stochastic mirror descent algorithm made sparse (SMIDAS) proposed in Shalev-Shwartz and Tewari (2011):
formula
Actually in Duchi et al. (2010), the iterate defined above is equivalent to . Different realizations of online composite mirror descent with the sparsity-inducing regularizer have also been proposed and theoretically studied in Langford, Li, and Zhang (2008) and Shalev-Shwartz and Tewari (2011).
Our error analysis is carried out in terms of the generalization error (risk) of the linear function associated with the vector defined by
formula
We estimate the excess risk for the last iterate produced by equation 1.3, where is a vector attaining the minimal risk
formula
The algorithm and our analysis include three main ingredients: the loss function , the regularizer , and the mirror map . Our results are stated in terms of properties of the loss functions and the regularizer in addition to the strong convexity of the differentiable mirror map and the boundedness of the probability measure . To illustrate our ideas, we state learning rates, to be proved in section 4, for the case when is the (scaled) 1-norm.
Assumption 1.

We assume that the input data are uniformly bounded in the sense and .

The involved properties of are measured by the Hölder continuity of .

Assumption 2.
We assume that the loss function is convex in the second argument, and its (sub)gradient is -Hölder continuous for some , meaning that there exists a constant such that
formula
1.4
Example 5.

In the two extreme cases and , convex loss functions satisfying condition 1.4 include the hinge loss with for classification with , the least square loss , and the logistic function with . The intermediate case includes -norm hinge loss for classification (Chen, Wu, Ying, & Zhou, 2004) and the th power absolute distance loss for regression (Steinwart & Christmann, 2008) with and .

Denote the vector in with all components being 1. A norm on is said to be monotonic if whenever satisfy for .

Theorem 1.
Assume that the mirror map is differentiable and -strongly convex for some and the norm is monotonic. Suppose that assumptions 5 and 6 hold. Consider the regularizer with for some . If the step size is with
formula
1.5
then we have
formula
where is a constant independent of or , and the expectation is taken with respect to the sample .

2  Main Results

This section presents our main results on error analysis of the online composite mirror descent algorithm, equation 1.3, given in terms of the following properties of the regularizer in addition to those of the loss function .

Assumption 3.
We assume that the convex regularizer satisfies , and its (sub)gradient is -Hölder continuous for some , meaning that there exists a constant such that
formula
2.1
Example 6.

The (scaled) regularizer with and satisfies condition 2.1 with and norm (see lemma 27 in appendix C). In particular, the classical -regularizer with satisfies equation 2.1 with and .

Now we can state our main results, to be proved in section 4, on convergence rates of the excess regularized risk for the last iterate of equation 1.3, where denotes the regularized risk of the linear function associated with with the minimizer defined by
formula
Theorem 2.
Assume that the mirror map is differentiable and -strongly convex for some . Suppose that assumptions 5, 6, and 9 hold with . Consider the step size with and
formula
2.2
Then we have
formula
2.3
where is a constant depending on (explicitly given in the proof). Specifically, if , we get
formula
2.4
The existing research work on the online (stochastic) composite mirror descent algorithm, equation 1.3, gives bounds on the regularized regret defined by
formula
or the closely related excess regularized risk for the average of the iterates (Cesa-Bianchi, Conconi, & Gentile, 2004; Duchi, Shalev-Shwartz, Singer, and Tewari, 2010; Langford, Li, and Zhang, 2008; Shalev-Shwartz & Tewari, 2011; Duchi & Singer, 2009; Srebro, Sridharan, & Tewari, 2011). However, as shown in Rosasco, Villa, and Vũ (2014), Shamir and Zhang (2013), and Rakhlin, Shamir, and Sridharan (2012), averaging can have a detrimental effect in the sense that it can slow the convergence rates when the objective function is strongly convex or destroy the sparsity of the solution, which is often crucial for proper interpretations in many applications. Instead of studying regret bounds or the associated convergence rates for the average of iterates, we consider here the more challenging problem of the convergence of the last iterate, which would imply the convergence of the averaging scheme and would not destroy the sparsity. Our main results show that the excess regularized risk enjoys the convergence rate with the step size , matching (up to a logarithmic factor) the minimax rates of order for stochastic approximation in the nonstrongly convex case (Agarwal, Bartlett, Ravikumar, & Wainwright, 2012). These results are established for a general class of objective functions with Hölder continuous (sub)gradients including Lipschitz objective functions and smooth objective functions.

Furthermore, our analysis does not need any boundedness assumption on or as imposed in the literature (Duchi et al., 2010; Shamir & Zhang, 2013). For example, stochastic projected gradient descent is studied in Shamir and Zhang (2013) for nonsmooth optimization, which gives the convergence rate for the last iterate. But their discussion requires the assumption of the existence of a constant such that for all and for points on the projected domain , which holds only when is compact and thereby their algorithm requires an additional projection onto per iteration. More recently, convergence of the last iterate for stochastic proximal gradient algorithms with was studied in Rosasco et al. (2014), presenting a nonasymptotic bound in expectation in the strongly convex case and the almost sure convergence in the general case, but their discussion still needs the assumption of the existence of a sequence and a constant satisfying for all , where and .

In deriving the almost optimal convergence rates, we also get the following convergence rate for , to be proved in appendix C.

Corollary 1.
Under the conditions of theorem 11 with , with step size satisfying equation 2.2, we have
formula

To demonstrate our main results stated in theorem 11, we present explicit learning rates for some special cases in the following sections. It would be interesting to extend our results to nonconvex loss functions, including those from the minimum error entropy principle (Hu, Fan, Wu, & Zhou, 2015).

2.1.  Online Gradient Descent Learning

The first special case corresponds to and . In this case, the online composite mirror descent algorithm, equation 1.3, recovers the unregularized online gradient descent algorithms for regression and classification by selecting concrete loss functions such as the -norm hinge loss , the logistic function , and the th power absolute distance loss .

Convergence for the last iterate has been extensively studied for the online gradient descent algorithm in reproducing kernel Hilbert spaces in Smale and Zhou (2009), Ying and Zhou (2006, 2016), and Tarres and Yao (2014) where the regularizer is approximated by its first-order approximation when updating . The unregularized least squares online gradient descent algorithm in reproducing kernel Hilbert spaces is studied in Ying and Pontil (2008) and convergence rates of order are derived. For a class of loss functions with -Hölder continuous gradients with , the unregularized online gradient descent learning with is considered in Ying and Zhou (2017), which establishes the convergence rate
formula
with the step size , which would be by taking . This convergence rate can at most attain when the loss function is smooth. Theorem 11 immediately implies the following convergence rate of the excess risk for unregularized online gradient descent algorithms. It is a great improvement and thereby solves the open question of whether the rate without the boundedness assumption can be improved for the unregularized online gradient descent algorithm applied to general loss functions (Ying & Zhou, 2017).
Corollary 2.

Consider the mirror map and . Suppose assumptions 5 and 6 hold. Take and . For the step size satisfying equation 2.2 with , we have .

2.2.  Online Learning with Sparsity-Inducing Regularizer

The second special case is given by with and . In this case, the online composite mirror descent algorithm 1.3 recovers the SMIDAS proposed in Shalev-Shwartz and Tewari (2011), whose convergence follows as a direct corollary of theorem 11 by noting the identity and the -strong convexity of with regard to . Note that the dual norm of is .

Corollary 3.
Consider the mirror map with and with . Suppose assumptions 5 and 6 hold. Take and . Then for the step size satisfying equation 1.5, we have
formula
2.5
Remark 1.
Consider the case . If we choose and
formula
with depending only on and , the constant hidden in the big notation in equation 2.5 takes the form
formula
2.6
where is a constant depending only on , and . For the choice , the constant defined by equation 2.6 satisfies (note that for )
formula
For the choice , the constant in equation 2.6 translates to
formula
Therefore, for learning problems where the features are dense (i.e., closed to ) and is very spare (i.e., closed to ), the online composite mirror descent with would enjoy a faster convergence rate compared to that for , especially in high-dimensional problems (Duchi et al., 2010).

2.3.  Online Smoothed Linearized Bregman Iteration

The last special case corresponds to the least squares loss , , and the mirror map , with a parameter , defined by , where is a componentwise regularizer for robustness smoothing the 1-norm given by
formula
In this case, with , the online composite mirror descent algorithm 1.3 can be reformulated as
formula
2.7
This is the online version of the linearized Bregman iteration (Cai, Osher, & Shen, 2009) modified by smoothing the 1-norm in a -neighborhood of the origin: the online version of the original linearized Bregman iteration proposed in Yin, Osher, Goldfarb, and Darbon (2008) corresponds to with and . The convergence of the iterates 2.7 is established in the following direct corollary of theorem 11.
Corollary 4.

Let , , , with and . Under assumption 5, with the step size satisfying equation 2.2 with , we have .

It would be interesting to extend the above result to the convergence of the original online linearized Bregman iteration without smoothing.

3  Ideas and Novelty in the Analysis

This section outlines the ideas and novelty in the proof of our main results. Our first novel point is a one-step progress bound established in equation 3.1 to be proved in the next section, showing that the excess regularized error can be controlled by the excess Bregman distance plus the term . Here denotes the conditional expectation given , the -algebra generated by . A notable property of the one-step progress-bound equation 3.1 is that it involves the regularized error rather than the dual norm of gradients encountered during the iterations, whose “boundedness” in expectation is established in equation 3.2. This boundedness of allows us to avoid assumptions on the boundedness of gradients imposed in the literature (Shamir & Zhang, 2013; Duchi et al., 2010), and demonstrates the novelty of our analysis.

Lemma 1.
Under assumptions 5, 6, and 9, the sequence generated by equation 1.3 satisfies
formula
3.1
where and are two constants independent of or (explicitly given in the proof). If we take the step size satisfying equation 2.2 with , then for any , we have
formula
3.2
where is a constant independent of (explicitly given in the proof).

Our second novel point is to derive error bounds and convergence rates for the last iterate from the one-step progress measured by Bregman distance in lemma 18. It refines the recent error decomposition method for gradient descent schemes in Lin, Rosasco, and Zhou (2016), Lin, Rosasco, Villa, and Zhou (2015), and Shamir and Zhang (2013) reformulating as a summation of the weighted average errors and moving weighted average errors (see equation B.1), and is proved in appendix B.

Lemma 2.
Let be a nonincreasing sequence. Let be a sequence of random variables such that is measurable with respect to . If
formula
3.3
for every , then we have
formula
3.4

Our last novel point is to get the boundedness of stated in equation 3.2 by applying lemma 18 to the following one-step progress bound in terms of the excess Bregman distance and the dual norms of gradients, which can be controlled in terms of step sizes (see lemma 24). Lemma 19 improves lemma 17 in Duchi et al. (2010) in our situation. Unlike lemma 17 in Duchi et al. (2010) involving in the associated one-step progress bound, equation 3.6 in lemma 19 involves instead, which matches the form of equation 3.3 in lemma 18 and is thereby crucial for applying lemma 18 to get equation 3.2. As a comparison, lemma 17 in Duchi et al. (2010) could not yield a one-step progress bound of the form 3.3. The proof of lemma 19 is given in appendix A.

Lemma 3.
For any , the sequence generated by equation 1.3 satisfies
formula
3.5
and
formula
3.6
Remark 2.

It should be emphasized that a single application of lemma 18 with the one-step progress bound given in equation 3.5 can only yield the convergence rate with the step size . For the specific case and , this convergence rate translates to , matching the rate established in Ying and Zhou (2017) within a logarithmic factor. The way we achieve the improvement from to rests on the following key observation due to a self-bounding property (see lemmas 21 and 22): although the iterates can only be shown to lie in a ball with the asymptotically diverging radius (see lemma 24), the expected norm of the associated gradient is always bounded since it is dominated by the regularized risk.

4  Proving Main Results

This section presents the proof of theorem 11, which yields the conclusion of theorem 8. Our proof consists of two parts. The first part applies lemma 18 and the one-step progress bound equation, 3.5, to establish a crude bound on the regularized risk, equation 3.2, based on which the second part applies lemma 18 and the one-step progress bound, equation 3.1, to derive the convergence rate, equation 2.3, for the last iterate of the online composite mirror descent.

We first provide some technical lemmas and inequalities used throughout the proof. It is clear that loss functions satisfying assumption 6 always enjoy the following growth behavior:
formula
4.1
Also, the regularizer satisfying assumption 9 meets the following growth condition,
formula
4.2
where we have used the fact followed from the convexity of and . For , denote and if and if . Denote and .

The following two lemmas establish the self-bounding property for functions with Hölder continuous gradients (Srebro, Sridharan, & Tewari, 2010; Ying & Zhou, 2017), meaning that the gradients can be controlled by the function values. This self-bounding property allows us to transfer the one-step progress bound, equation 3.5, in terms of gradients to the one-step progress bound, equation 3.1, in terms of the regularized risk and is essential for us to avoid the boundedness assumptions imposed in the literature (Shamir & Zhang, 2013; Duchi et al., 2010). Lemma 21 can be found in Ying and Zhou (2017), while lemma 19 will be proved as a consequence of lemma 26 in appendix C.

Lemma 4.
If the nonnegative loss function satisfies equation 1.4 with , then for , we have
formula
4.3
Lemma 5.
If the gradient of the nonnegative regularizer satisfies equation 2.1 with , then for , we have
formula
4.4
Lemma 6.
For any , we have the following inequalities:
formula
4.5
and
formula
4.6

To apply lemmas 18 and 19, we need to estimate the growth behavior of and . This is achieved in lemma 24 by showing that always lie inside a ball, under the Bregman divergence, with a controllable radius. The proof of lemma 24 is given in appendix D.

Lemma 7.
Suppose that assumptions 5, 6, and 9 hold. If the step sizes satisfy equation 2.2, then the sequence generated by equation 1.3 satisfies
formula
4.7
and
formula
4.8
where and are constants given by
formula

We are now in a position to prove lemma 17. The proof of equation 3.1 requires the one-step progress bound, equation 3.5, and the self-bounding property established in lemmas 21 and 22, while the proof of equation 3.2 requires applying lemma 18 with the one-step progress bound, equation 3.5, coupled with the bounds on the gradients established in lemma 24.

Proof of Lemma 1.

We first use the self-bounding property established in lemmas 21 and 22 to control .

For the case , by and equation 4.3, the term can be controlled as
formula
where we have used Young’s inequality,
formula
4.9
This inequality holds obviously when . Moreover, according to equation 4.1, we have
formula
For the case , by and equation 4.4, the term can be controlled similarly by
formula
The above inequality holds obviously when . From equation 4.2, we also have that if .
Putting the above discussions together, we derive the following inequality:
formula
Plugging this inequality into equation 3.5 yields the following one-step progress bound for the online composite mirror descent,
formula
where the constants and are given explicitly as
formula
This proves the first desired estimate, equation 3.1.
We turn to the second desired estimate, equation 3.2. Plugging equation 4.8 into equation 3.5 immediately yields that
formula
which implies equation 3.3 with . Therefore, we can apply lemma 18 to obtain
formula
4.10
According to lemma 23, the definition of and the step size with , we have
formula
Furthermore, it follows from equation 4.5 that
formula
Plugging the above two bounds into equation 4.10, we see
formula
where we observe that if and only if . Note that can be equivalently written as .

When , we have .

When , then and the elementary inequality
formula
4.11
imply that
formula

When , is bounded by 1.

Combining the above three cases together, we know that
formula
The above inequality verifies the desired estimate, equation 3.2, with the constant given by
formula
where
formula

We are in a position to prove our main results.

Proof of Theorem 2.

We prove our conclusion in two cases according to different values of .

If , then equation 3.2 implies . According to equation 3.1, we can apply lemma 18 with and use the inequality to obtain
formula
where we have used equation 4.6 in the last inequality.
If , then equation 3.2 implies
formula
Analyzing analogously to the case yields
formula
where we have used the fact that followed from , and the last step uses the following inequalities due to equation 4.11:
formula
The following inequality follows for any :
formula
Analyzing analogously to the case by applying lemma 18 and equation 3.1 with bounded above and noting yields
formula
Combining the above discussions in two different cases together, we get
formula
4.12
where is the constant defined by
formula
This verifies the desired error bound, equation 2.3, with the constant
formula
Proof of Theorem 1.
According to the definition of we have
formula
4.13
For any , the monotonic property of implies and therefore satisfies equation 2.1 with and . It then follows from theorem 11 and equation 4.13 that
formula
4.14
where is a constant depending on . Equation 4.13, together with the inequality due to the definition of , implies and then . Furthermore, equation 4.13 and the assumption imply
formula
That is, both and can be upper-bounded by constants independent of or . Therefore, the constant in equation 4.14 is independent of or .

Appendix A:  Proof of Lemma 19

The first-order optimality condition for the minimization problem, equation 1.3, implies the existence of an such that
formula
Combining this with
formula
yields
formula
This, together with the convexity of and , and the -strong convexity of , gives
formula
A.1
The convexity of yields
formula
This, together with the elementary inequality , gives
formula
Plugging this estimate into equation A.1 gives
formula
This establishes equation 3.6. Reformulation followed with conditional expectation with given (note that is measurable with regard to ) on both sides yields
formula

Appendix B:  Proof of Lemma 18

We use our ideas from Lin and Zhou (2015) Lin et al. (2016, 2015) to prove lemma 18.

Proof of Lemma 2.

We proceed with the proof in four steps.

Step 1: Error decomposition. The following identity (Shamir & Zhang, 2013; Lin et al., 2016) holds for any sequence ,
formula
Applying this to yields
formula
from which we derive
formula
The definition of implies , which, coupled with the fact that is nonincreasing, guarantees the nonpositivity of the last term in the above inequality and thereby implies
formula
B.1
The first and second terms on the right-hand side of the above inequality are called the weighted average errors and moving weighted average errors, respectively.
Step 2: Controlling weighted average errors. Applying the assumption in equation 3.3 with and taking expectation over remaining random variables imply
formula
Step 3: Controlling moving weighted average errors. Applying equation 3.3 with (note is measurable with respect to for any ) followed with expectations over remaining random variables implies
formula
Step 4: Combining the above results. Plugging the error bounds in steps 2 and 3 into equation B.1 yields
formula
where the last inequality uses the following identity:
formula

Appendix C:  Proving Self-Bounding Properties

The following lemma is an extension of proposition 1 in Ying and Zhou (2017), which considers the case .

Lemma 8.
Let be a differentiable function. Suppose that there exist constants and such that
formula
C.1
Then, for any we have
formula
C.2
Proof.

Let be any real number. It suffices to consider the case . We proceed with the discussion by considering two cases according to the value of .

If , we take . Then . According to the mean value theorem, there exists between and such that . Therefore, by equation C.1 and the condition ,
formula
Hence,
formula
If , we take . Then . Analyzing analogously to the first case, we get
formula
Hence,
formula

Combining the above discussion together yields the inequality C.2.

Lemma 26 with the case was considered in Srebro et al. (2010).

Lemma 9.
Let be differentiable. Suppose that there exist constants and such that
formula
C.3
Then we have
formula
Proof.
Fix . For any such that , we define a function by
formula
It can be directly checked that
formula
and by equation C.3
formula
where in the last inequality we have used the inequality . So the function satisfies condition C.1, and thereby lemma 25 can be applied here to obtain
formula
from which it immediately follows that
formula
Proof of Corollary 1.
Define a map by . The definition of implies . For any , we have
formula
from which it follows that
formula
Applying equations 1.4 and 2.1 yields
formula
where . So condition C.3 is satisfied, and we can apply lemma 26 to get
formula
Setting and taking expectations on both sides, we find
formula
where we have used the Jensen inequality. Applying equation 2.4 of theorem 11 gives
formula
and thereby
formula

The following lemma provides a class of regularizers satisfying condition 2.1. For , denote by the sign of , that is, if , if and if .

Lemma 10.
The function with defined on satisfies
formula
where is the conjugate exponent of .
Proof.
If , then for any , the associated subgradient would satisfy , from which we immediately derive
formula
If , then the gradient of at can be calculated by , from which we have
formula
where we use the following inequality stated in Lei, Ding, and Zhang (2015)
formula

Appendix D:  Proof of Lemma 24

We first prove equation 4.7. Applying equation 3.6 of lemma 19 with shows
formula
D.1
We now tackle the terms and , separately. We perform the deduction in three steps.

Step 1. We first bound according to different values of .

If , applying lemma 21 shows that
formula
where in the last step, we have used equation 2.2.
If , applying lemma 21 and Young’s inequality, equation 4.9, implies
formula
Since , this is bounded by
formula
where in the last step, we have used equation 2.2 and .
If , then from equation 4.1, we have
formula
where the last inequality follows from the assumption .
Combining the above discussions together, we have that for any ,
formula
D.2

Step 2. We now bound in three cases according to the value of .

If , from lemma 22 and assumption 2.2, we have
formula
If , lemma 22, and Young’s inequality imply
formula
where in the last step we have used equation 2.2.
If , assumption implies
formula
According to the above deductions, we derive for any ,
formula
D.3
Step 3. Plugging equations D.2 and D.3 back into D.1 we get
formula
Taking a summation from to yields equation 4.7 as
formula
We then prove equation 4.8. The -strong convexity of , coupled with the inequality given by equation 4.7, implies
formula
from which we have
formula
D.4
and by equation 4.1,
formula
D.5
Also, it follows from the growth conditions 4.2 and D.4 that
formula
D.6
Combining equations D.5 and D.6 together yields
formula
This proves equation 4.8.

Appendix E:  Proof of Lemma 23

Inequality 4.5 is obvious. Inequality 4.6 is a slight modification of lemma 2.6 in Lin et al. (2015).

Proof of Lemma 12.
We prove only equation 4.6 here. We split the sum in two parts as follows (we denote by the largest integer not larger than ):
formula
If , we have
formula
If , we have
formula
If , we have
formula
where we have used equation 4.11 in the second inequality.

The above bounds together can be written as equation 4.6..

Acknowledgments

We are grateful to the anonymous referees for their constructive comments. The work described in this letter is partially supported by the NSFC/RGC Joint Research Scheme (RGC Project No. N_CityU120/14 and NSFC Project No. 11461161006).

References

Agarwal
,
A.
,
Bartlett
,
P. L.
,
Ravikumar
,
P.
, &
Wainwright
,
M. J.
(
2012
).
Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization
.
IEEE Transactions on Information Theory
,
5
(
58
),
3235
3249
.
Bach
,
F.
, &
Moulines
,
E.
(
2013
).
Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)
. In
C. J. C.
Burges
,
L.
Bottou
,
M.
Welling
,
Z.
Ghahramani
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
26
(pp.
773
781
).
Red Hook, NY
:
Curran
.
Ball
,
K.
,
Carlen
,
E. A.
, &
Lieb
,
E. H.
(
1994
).
Sharp uniform convexity and smoothness inequalities for trace norms
.
Inventiones Mathematicae
,
115
(
1
),
463
482
.
Beck
,
A.
, &
Teboulle
,
M.
(
2003
).
Mirror descent and nonlinear projected subgradient methods for convex optimization
.
Operations Research Letters
,
31
(
3
),
167
175
.
Cai
,
J.-F.
,
Osher
,
S.
, &
Shen
,
Z.
(
2009
).
Linearized Bregman iterations for compressed sensing
.
Mathematics of Computation
,
78
(
267
),
1515
1536
.
Cesa-Bianchi
,
N.
,
Conconi
,
A.
, &
Gentile
,
C.
(
2004
).
On the generalization ability of on-line learning algorithms
.
IEEE Transactions on Information Theory
,
50
(
9
),
2050
2057
.
Chen
,
D.-R.
,
Wu
,
Q.
,
Ying
,
Y.
, &
Zhou
,
D.-X.
(
2004
).
Support vector machine soft margin classifiers: Error analysis
.
Journal of Machine Learning Research
,
5
,
1143
1175
.
Duchi
,
J. C.
,
Shalev-Shwartz
,
S.
,
Singer
,
Y.
, &
Tewari
,
A.
(
2010
).
Composite objective mirror descent
. In
Proceedings of 23rd Annual Conference on Learning Theory
(pp.
14
26
).
Madison, WI
:
Omnipress
.
Duchi
,
J.
, &
Singer
,
Y.
(
2009
).
Efficient online and batch learning using forward backward splitting
. In
Y.
Bengio
,
D.
Schuurmans
,
J. D.
Lafferty
,
C. K. I.
Williams
, &
A.
Culotta
(Eds.),
Advances in neural information processing systems
,
22
(pp.
495
503
).
Cambridge, MA
:
MIT Press
.
Hu
,
T.
,
Fan
,
J.
,
Wu
,
Q.
, &
Zhou
,
D.-X.
(
2015
).
Regularization schemes for minimum error entropy principle
.
Analysis and Applications
,
13
(
4
),
437
455
.
Langford
,
J.
,
Li
,
L.
, &
Zhang
,
T.
(
2008
).
Sparse online learning via truncated gradient
. In
D.
Koller
,
D.
Schuurmans
,
Y.
Bengio
, &
L.
Bottou
(Eds.),
Advances in neural information processing systems
,
21
(pp.
905
912
).
Cambridge, MA
:
MIT Press
.
Lei
,
Y.
,
Ding
,
L.
, &
Zhang
,
W.
(
2015
).
Generalization performance of radial basis function networks
.
IEEE Transactions on Neural Networks and Learning Systems
,
26
(
3
),
551
564
.
Lin
,
J.
,
Rosasco
,
L.
,
Villa
,
S.
, &
Zhou
,
D.-X.
(
2015
).
Modified Fejér sequences and applications
.
arXiv:1510.04641
Lin
,
J.
,
Rosasco
,
L.
, &
Zhou
,
D.-X.
(
2016
).
Iterative regularization for learning with convex loss functions
.
Journal of Machine Learning Research
,
17
(
77
),
1
38
.
Lin
,
J.
, &
Zhou
,
D.-X.
(
2015
).
Learning theory of randomized Kaczmarz algorithm
.
Journal of Machine Learning Research
,
16
,
3341
3365
.
Lions
,
P.-L.
, &
Mercier
,
B.
(
1979
).
Splitting algorithms for the sum of two nonlinear operators
.
SIAM Journal on Numerical Analysis
,
16
(
6
),
964
979
.
Nemirovsky
,
A.-S.
, &
Yudin
,
D.-B.
(
1983
).
Problem complexity and method efficiency in optimization.
New York
:
Wiley
.
Rakhlin
,
A.
,
Shamir
,
O.
, &
Sridharan
,
K.
(
2012
).
Making gradient descent optimal for strongly convex stochastic optimization
. In
Proceedings of the 29th International Conference on Machine Learning
(pp.
449
456
).
Rosasco
,
L.
,
Villa
,
S.
, &
,
B. C.
(
2014
).
Convergence of stochastic proximal gradient algorithm
.
arXiv:1403.5074
Shalev-Shwartz
,
S.
, &
Tewari
,
A.
(
2011
).
Stochastic methods for