## Abstract

We study the convergence of the online composite mirror descent algorithm, which involves a mirror map to reflect the geometry of the data and a convex objective function consisting of a loss and a regularizer possibly inducing sparsity. Our error analysis provides convergence rates in terms of properties of the strongly convex differentiable mirror map and the objective function. For a class of objective functions with Hölder continuous gradients, the convergence rates of the excess (regularized) risk under polynomially decaying step sizes have the order after iterates. Our results improve the existing error analysis for the online composite mirror descent algorithm by avoiding averaging and removing boundedness assumptions, and they sharpen the existing convergence rates of the last iterate for online gradient descent without any boundedness assumptions. Our methodology mainly depends on a novel error decomposition in terms of an excess Bregman distance, refined analysis of self-bounding properties of the objective function, and the resulting one-step progress bounds.

## 1 Introduction

Gradient descent is a powerful classic method for optimization and numerical computation. To approximate a minimizer of a convex function on the Euclidean space , it defines a sequence of points iteratively by , where is a subgradient of at and is a step size. Gradient descent is even more powerful in the era of big data and has been extended along different directions in various ways. Mirror descent is such an extension by relaxing the Hilbert space structure (Nemirovsky & Yudin, 1983; Beck & Teboulle, 2003) and allowing a Banach space norm on such as the -norm with , where is used for performing the gradient descent in the dual of the primal space .

Let and with the -norm defined by for . Then its dual space is . Take the -norm divergence as the mirror map. This mirror map, as shown in Ball, Carlen, and Lieb (1994), is -strongly convex over with respect to the norm . Take . When , the primal and dual spaces coincide, and the mirror descent reduces to the gradient descent. When a minimizer of a convex function is sparse, the mirror descent method with the mirror map and the specific choice of yields a convergence bound with a logarithmic dependence on , as proved in Duchi, Shalev-Shwartz, Singer, and Tewari (2010).

We assume that the input data are uniformly bounded in the sense and .

The involved properties of are measured by the Hölder continuity of .

In the two extreme cases and , convex loss functions satisfying condition 1.4 include the hinge loss with for classification with , the least square loss , and the logistic function with . The intermediate case includes -norm hinge loss for classification (Chen, Wu, Ying, & Zhou, 2004) and the th power absolute distance loss for regression (Steinwart & Christmann, 2008) with and .

Denote the vector in with all components being 1. A norm on is said to be monotonic if whenever satisfy for .

^{5}and

^{6}hold. Consider the regularizer with for some . If the step size is with then we have where is a constant independent of or , and the expectation is taken with respect to the sample .

## 2 Main Results

This section presents our main results on error analysis of the online composite mirror descent algorithm, equation 1.3, given in terms of the following properties of the regularizer in addition to those of the loss function .

Furthermore, our analysis does not need any boundedness assumption on or as imposed in the literature (Duchi et al., 2010; Shamir & Zhang, 2013). For example, stochastic projected gradient descent is studied in Shamir and Zhang (2013) for nonsmooth optimization, which gives the convergence rate for the last iterate. But their discussion requires the assumption of the existence of a constant such that for all and for points on the projected domain , which holds only when is compact and thereby their algorithm requires an additional projection onto per iteration. More recently, convergence of the last iterate for stochastic proximal gradient algorithms with was studied in Rosasco et al. (2014), presenting a nonasymptotic bound in expectation in the strongly convex case and the almost sure convergence in the general case, but their discussion still needs the assumption of the existence of a sequence and a constant satisfying for all , where and .

In deriving the almost optimal convergence rates, we also get the following convergence rate for , to be proved in appendix C.

To demonstrate our main results stated in theorem ^{11}, we present explicit learning rates for some special cases in the following sections. It would be interesting to extend our results to nonconvex loss functions, including those from the minimum error entropy principle (Hu, Fan, Wu, & Zhou, 2015).

### 2.1. Online Gradient Descent Learning

The first special case corresponds to and . In this case, the online composite mirror descent algorithm, equation 1.3, recovers the unregularized online gradient descent algorithms for regression and classification by selecting concrete loss functions such as the -norm hinge loss , the logistic function , and the th power absolute distance loss .

^{11}immediately implies the following convergence rate of the excess risk for unregularized online gradient descent algorithms. It is a great improvement and thereby solves the open question of whether the rate without the boundedness assumption can be improved for the unregularized online gradient descent algorithm applied to general loss functions (Ying & Zhou, 2017).

Consider the mirror map and . Suppose assumptions ^{5} and ^{6} hold. Take and . For the step size satisfying equation 2.2 with , we have .

### 2.2. Online Learning with Sparsity-Inducing Regularizer

The second special case is given by with and . In this case, the online composite mirror descent algorithm 1.3 recovers the SMIDAS proposed in Shalev-Shwartz and Tewari (2011), whose convergence follows as a direct corollary of theorem ^{11} by noting the identity and the -strong convexity of with regard to . Note that the dual norm of is .

^{5}and

^{6}hold. Take and . Then for the step size satisfying equation 1.5, we have

### 2.3. Online Smoothed Linearized Bregman Iteration

^{11}.

Let , , , with and . Under assumption ^{5}, with the step size satisfying equation 2.2 with , we have .

It would be interesting to extend the above result to the convergence of the original online linearized Bregman iteration without smoothing.

## 3 Ideas and Novelty in the Analysis

This section outlines the ideas and novelty in the proof of our main results. Our first novel point is a one-step progress bound established in equation 3.1 to be proved in the next section, showing that the excess regularized error can be controlled by the excess Bregman distance plus the term . Here denotes the conditional expectation given , the -algebra generated by . A notable property of the one-step progress-bound equation 3.1 is that it involves the regularized error rather than the dual norm of gradients encountered during the iterations, whose “boundedness” in expectation is established in equation 3.2. This boundedness of allows us to avoid assumptions on the boundedness of gradients imposed in the literature (Shamir & Zhang, 2013; Duchi et al., 2010), and demonstrates the novelty of our analysis.

^{5},

^{6}, and

^{9}, the sequence generated by equation 1.3 satisfies where and are two constants independent of or (explicitly given in the proof). If we take the step size satisfying equation 2.2 with , then for any , we have where is a constant independent of (explicitly given in the proof).

Our second novel point is to derive error bounds and convergence rates for the last iterate from the one-step progress measured by Bregman distance in lemma ^{18}. It refines the recent error decomposition method for gradient descent schemes in Lin, Rosasco, and Zhou (2016), Lin, Rosasco, Villa, and Zhou (2015), and Shamir and Zhang (2013) reformulating as a summation of the weighted average errors and moving weighted average errors (see equation B.1), and is proved in appendix B.

Our last novel point is to get the boundedness of stated in equation 3.2 by applying lemma ^{18} to the following one-step progress bound in terms of the excess Bregman distance and the dual norms of gradients, which can be controlled in terms of step sizes (see lemma ^{24}). Lemma ^{19} improves lemma ^{17} in Duchi et al. (2010) in our situation. Unlike lemma ^{17} in Duchi et al. (2010) involving in the associated one-step progress bound, equation 3.6 in lemma ^{19} involves instead, which matches the form of equation 3.3 in lemma ^{18} and is thereby crucial for applying lemma ^{18} to get equation 3.2. As a comparison, lemma ^{17} in Duchi et al. (2010) could not yield a one-step progress bound of the form 3.3. The proof of lemma ^{19} is given in appendix A.

It should be emphasized that a single application of lemma ^{18} with the one-step progress bound given in equation 3.5 can only yield the convergence rate with the step size . For the specific case and , this convergence rate translates to , matching the rate established in Ying and Zhou (2017) within a logarithmic factor. The way we achieve the improvement from to rests on the following key observation due to a self-bounding property (see lemmas ^{21} and ^{22}): although the iterates can only be shown to lie in a ball with the asymptotically diverging radius (see lemma ^{24}), the expected norm of the associated gradient is always bounded since it is dominated by the regularized risk.

## 4 Proving Main Results

This section presents the proof of theorem ^{11}, which yields the conclusion of theorem ^{8}. Our proof consists of two parts. The first part applies lemma ^{18} and the one-step progress bound equation, 3.5, to establish a crude bound on the regularized risk, equation 3.2, based on which the second part applies lemma ^{18} and the one-step progress bound, equation 3.1, to derive the convergence rate, equation 2.3, for the last iterate of the online composite mirror descent.

^{6}always enjoy the following growth behavior: Also, the regularizer satisfying assumption

^{9}meets the following growth condition, where we have used the fact followed from the convexity of and . For , denote and if and if . Denote and .

The following two lemmas establish the self-bounding property for functions with Hölder continuous gradients (Srebro, Sridharan, & Tewari, 2010; Ying & Zhou, 2017), meaning that the gradients can be controlled by the function values. This self-bounding property allows us to transfer the one-step progress bound, equation 3.5, in terms of gradients to the one-step progress bound, equation 3.1, in terms of the regularized risk and is essential for us to avoid the boundedness assumptions imposed in the literature (Shamir & Zhang, 2013; Duchi et al., 2010). Lemma ^{21} can be found in Ying and Zhou (2017), while lemma ^{19} will be proved as a consequence of lemma ^{26} in appendix C.

To apply lemmas ^{18} and ^{19}, we need to estimate the growth behavior of and . This is achieved in lemma ^{24} by showing that always lie inside a ball, under the Bregman divergence, with a controllable radius. The proof of lemma ^{24} is given in appendix D.

We are now in a position to prove lemma ^{17}. The proof of equation 3.1 requires the one-step progress bound, equation 3.5, and the self-bounding property established in lemmas ^{21} and ^{22}, while the proof of equation 3.2 requires applying lemma ^{18} with the one-step progress bound, equation 3.5, coupled with the bounds on the gradients established in lemma ^{24}.

We first use the self-bounding property established in lemmas ^{21} and ^{22} to control .

^{18}to obtain According to lemma

^{23}, the definition of and the step size with , we have Furthermore, it follows from equation 4.5 that Plugging the above two bounds into equation 4.10, we see where we observe that if and only if . Note that can be equivalently written as .

When , we have .

When , is bounded by 1.

We are in a position to prove our main results.

We prove our conclusion in two cases according to different values of .

^{18}with and use the inequality to obtain where we have used equation 4.6 in the last inequality.

^{18}and equation 3.1 with bounded above and noting yields

^{11}and equation 4.13 that where is a constant depending on . Equation 4.13, together with the inequality due to the definition of , implies and then . Furthermore, equation 4.13 and the assumption imply That is, both and can be upper-bounded by constants independent of or . Therefore, the constant in equation 4.14 is independent of or .

## Appendix A: Proof of Lemma ^{19}

## Appendix B: Proof of Lemma ^{18}

We proceed with the proof in four steps.

**Step 1: Error decomposition.**The following identity (Shamir & Zhang, 2013; Lin et al., 2016) holds for any sequence , Applying this to yields from which we derive The definition of implies , which, coupled with the fact that is nonincreasing, guarantees the nonpositivity of the last term in the above inequality and thereby implies The first and second terms on the right-hand side of the above inequality are called the weighted average errors and moving weighted average errors, respectively.

**Step 2: Controlling weighted average errors.**Applying the assumption in equation 3.3 with and taking expectation over remaining random variables imply

**Step 3: Controlling moving weighted average errors.**Applying equation 3.3 with (note is measurable with respect to for any ) followed with expectations over remaining random variables implies

**Step 4: Combining the above results.**Plugging the error bounds in steps 2 and 3 into equation B.1 yields where the last inequality uses the following identity:

## Appendix C: Proving Self-Bounding Properties

The following lemma is an extension of proposition 1 in Ying and Zhou (2017), which considers the case .

Let be any real number. It suffices to consider the case . We proceed with the discussion by considering two cases according to the value of .

Combining the above discussion together yields the inequality C.2.

Lemma ^{26} with the case was considered in Srebro et al. (2010).

^{26}to get Setting and taking expectations on both sides, we find where we have used the Jensen inequality. Applying equation 2.4 of theorem

^{11}gives and thereby

The following lemma provides a class of regularizers satisfying condition 2.1. For , denote by the sign of , that is, if , if and if .

## Appendix D: Proof of Lemma ^{24}

**Step 1.** We first bound according to different values of .

**Step 2**. We now bound in three cases according to the value of .

## Appendix E: Proof of Lemma ^{23}

Inequality 4.5 is obvious. Inequality 4.6 is a slight modification of lemma 2.6 in Lin et al. (2015).

The above bounds together can be written as equation 4.6..

## Acknowledgments

We are grateful to the anonymous referees for their constructive comments. The work described in this letter is partially supported by the NSFC/RGC Joint Research Scheme (RGC Project No. N_CityU120/14 and NSFC Project No. 11461161006).