We study the convergence of the online composite mirror descent algorithm, which involves a mirror map to reflect the geometry of the data and a convex objective function consisting of a loss and a regularizer possibly inducing sparsity. Our error analysis provides convergence rates in terms of properties of the strongly convex differentiable mirror map and the objective function. For a class of objective functions with Hölder continuous gradients, the convergence rates of the excess (regularized) risk under polynomially decaying step sizes have the order after iterates. Our results improve the existing error analysis for the online composite mirror descent algorithm by avoiding averaging and removing boundedness assumptions, and they sharpen the existing convergence rates of the last iterate for online gradient descent without any boundedness assumptions. Our methodology mainly depends on a novel error decomposition in terms of an excess Bregman distance, refined analysis of self-bounding properties of the objective function, and the resulting one-step progress bounds.
Gradient descent is a powerful classic method for optimization and numerical computation. To approximate a minimizer of a convex function on the Euclidean space , it defines a sequence of points iteratively by , where is a subgradient of at and is a step size. Gradient descent is even more powerful in the era of big data and has been extended along different directions in various ways. Mirror descent is such an extension by relaxing the Hilbert space structure (Nemirovsky & Yudin, 1983; Beck & Teboulle, 2003) and allowing a Banach space norm on such as the -norm with , where is used for performing the gradient descent in the dual of the primal space .
Let and with the -norm defined by for . Then its dual space is . Take the -norm divergence as the mirror map. This mirror map, as shown in Ball, Carlen, and Lieb (1994), is -strongly convex over with respect to the norm . Take . When , the primal and dual spaces coincide, and the mirror descent reduces to the gradient descent. When a minimizer of a convex function is sparse, the mirror descent method with the mirror map and the specific choice of yields a convergence bound with a logarithmic dependence on , as proved in Duchi, Shalev-Shwartz, Singer, and Tewari (2010).
We assume that the input data are uniformly bounded in the sense and .
The involved properties of are measured by the Hölder continuity of .
In the two extreme cases and , convex loss functions satisfying condition 1.4 include the hinge loss with for classification with , the least square loss , and the logistic function with . The intermediate case includes -norm hinge loss for classification (Chen, Wu, Ying, & Zhou, 2004) and the th power absolute distance loss for regression (Steinwart & Christmann, 2008) with and .
Denote the vector in with all components being 1. A norm on is said to be monotonic if whenever satisfy for .
2 Main Results
This section presents our main results on error analysis of the online composite mirror descent algorithm, equation 1.3, given in terms of the following properties of the regularizer in addition to those of the loss function .
Furthermore, our analysis does not need any boundedness assumption on or as imposed in the literature (Duchi et al., 2010; Shamir & Zhang, 2013). For example, stochastic projected gradient descent is studied in Shamir and Zhang (2013) for nonsmooth optimization, which gives the convergence rate for the last iterate. But their discussion requires the assumption of the existence of a constant such that for all and for points on the projected domain , which holds only when is compact and thereby their algorithm requires an additional projection onto per iteration. More recently, convergence of the last iterate for stochastic proximal gradient algorithms with was studied in Rosasco et al. (2014), presenting a nonasymptotic bound in expectation in the strongly convex case and the almost sure convergence in the general case, but their discussion still needs the assumption of the existence of a sequence and a constant satisfying for all , where and .
In deriving the almost optimal convergence rates, we also get the following convergence rate for , to be proved in appendix C.
2.1. Online Gradient Descent Learning
The first special case corresponds to and . In this case, the online composite mirror descent algorithm, equation 1.3, recovers the unregularized online gradient descent algorithms for regression and classification by selecting concrete loss functions such as the -norm hinge loss , the logistic function , and the th power absolute distance loss .
2.2. Online Learning with Sparsity-Inducing Regularizer
The second special case is given by with and . In this case, the online composite mirror descent algorithm 1.3 recovers the SMIDAS proposed in Shalev-Shwartz and Tewari (2011), whose convergence follows as a direct corollary of theorem 11 by noting the identity and the -strong convexity of with regard to . Note that the dual norm of is .
2.3. Online Smoothed Linearized Bregman Iteration
It would be interesting to extend the above result to the convergence of the original online linearized Bregman iteration without smoothing.
3 Ideas and Novelty in the Analysis
This section outlines the ideas and novelty in the proof of our main results. Our first novel point is a one-step progress bound established in equation 3.1 to be proved in the next section, showing that the excess regularized error can be controlled by the excess Bregman distance plus the term . Here denotes the conditional expectation given , the -algebra generated by . A notable property of the one-step progress-bound equation 3.1 is that it involves the regularized error rather than the dual norm of gradients encountered during the iterations, whose “boundedness” in expectation is established in equation 3.2. This boundedness of allows us to avoid assumptions on the boundedness of gradients imposed in the literature (Shamir & Zhang, 2013; Duchi et al., 2010), and demonstrates the novelty of our analysis.
Our second novel point is to derive error bounds and convergence rates for the last iterate from the one-step progress measured by Bregman distance in lemma 18. It refines the recent error decomposition method for gradient descent schemes in Lin, Rosasco, and Zhou (2016), Lin, Rosasco, Villa, and Zhou (2015), and Shamir and Zhang (2013) reformulating as a summation of the weighted average errors and moving weighted average errors (see equation B.1), and is proved in appendix B.
Our last novel point is to get the boundedness of stated in equation 3.2 by applying lemma 18 to the following one-step progress bound in terms of the excess Bregman distance and the dual norms of gradients, which can be controlled in terms of step sizes (see lemma 24). Lemma 19 improves lemma 17 in Duchi et al. (2010) in our situation. Unlike lemma 17 in Duchi et al. (2010) involving in the associated one-step progress bound, equation 3.6 in lemma 19 involves instead, which matches the form of equation 3.3 in lemma 18 and is thereby crucial for applying lemma 18 to get equation 3.2. As a comparison, lemma 17 in Duchi et al. (2010) could not yield a one-step progress bound of the form 3.3. The proof of lemma 19 is given in appendix A.
It should be emphasized that a single application of lemma 18 with the one-step progress bound given in equation 3.5 can only yield the convergence rate with the step size . For the specific case and , this convergence rate translates to , matching the rate established in Ying and Zhou (2017) within a logarithmic factor. The way we achieve the improvement from to rests on the following key observation due to a self-bounding property (see lemmas 21 and 22): although the iterates can only be shown to lie in a ball with the asymptotically diverging radius (see lemma 24), the expected norm of the associated gradient is always bounded since it is dominated by the regularized risk.
4 Proving Main Results
This section presents the proof of theorem 11, which yields the conclusion of theorem 8. Our proof consists of two parts. The first part applies lemma 18 and the one-step progress bound equation, 3.5, to establish a crude bound on the regularized risk, equation 3.2, based on which the second part applies lemma 18 and the one-step progress bound, equation 3.1, to derive the convergence rate, equation 2.3, for the last iterate of the online composite mirror descent.
The following two lemmas establish the self-bounding property for functions with Hölder continuous gradients (Srebro, Sridharan, & Tewari, 2010; Ying & Zhou, 2017), meaning that the gradients can be controlled by the function values. This self-bounding property allows us to transfer the one-step progress bound, equation 3.5, in terms of gradients to the one-step progress bound, equation 3.1, in terms of the regularized risk and is essential for us to avoid the boundedness assumptions imposed in the literature (Shamir & Zhang, 2013; Duchi et al., 2010). Lemma 21 can be found in Ying and Zhou (2017), while lemma 19 will be proved as a consequence of lemma 26 in appendix C.
To apply lemmas 18 and 19, we need to estimate the growth behavior of and . This is achieved in lemma 24 by showing that always lie inside a ball, under the Bregman divergence, with a controllable radius. The proof of lemma 24 is given in appendix D.
We are now in a position to prove lemma 17. The proof of equation 3.1 requires the one-step progress bound, equation 3.5, and the self-bounding property established in lemmas 21 and 22, while the proof of equation 3.2 requires applying lemma 18 with the one-step progress bound, equation 3.5, coupled with the bounds on the gradients established in lemma 24.
When , we have .
When , is bounded by 1.
We are in a position to prove our main results.
We prove our conclusion in two cases according to different values of .
Appendix A: Proof of Lemma 19
Appendix B: Proof of Lemma 18
We proceed with the proof in four steps.
Appendix C: Proving Self-Bounding Properties
The following lemma is an extension of proposition 1 in Ying and Zhou (2017), which considers the case .
Let be any real number. It suffices to consider the case . We proceed with the discussion by considering two cases according to the value of .
Combining the above discussion together yields the inequality C.2.
The following lemma provides a class of regularizers satisfying condition 2.1. For , denote by the sign of , that is, if , if and if .
Appendix D: Proof of Lemma 24
Step 1. We first bound according to different values of .
Step 2. We now bound in three cases according to the value of .
Appendix E: Proof of Lemma 23
The above bounds together can be written as equation 4.6..
We are grateful to the anonymous referees for their constructive comments. The work described in this letter is partially supported by the NSFC/RGC Joint Research Scheme (RGC Project No. N_CityU120/14 and NSFC Project No. 11461161006).