We analyze the effect of synchronization on distributed stochastic gradient algorithms. By exploiting an analogy with dynamical models of biological quorum sensing, where synchronization between agents is induced through communication with a common signal, we quantify how synchronization can significantly reduce the magnitude of the noise felt by the individual distributed agents and their spatial mean. This noise reduction is in turn associated with a reduction in the smoothing of the loss function imposed by the stochastic gradient approximation. Through simulations on model nonconvex objectives, we demonstrate that coupling can stabilize higher noise levels and improve convergence. We provide a convergence analysis for strongly convex functions by deriving a bound on the expected deviation of the spatial mean of the agents from the global minimizer for an algorithm based on quorum sensing, the same algorithm with momentum, and the elastic averaging SGD (EASGD) algorithm. We discuss extensions to new algorithms that allow each agent to broadcast its current measure of success and shape the collective computation accordingly. We supplement our theoretical analysis with numerical experiments on convolutional neural networks trained on the CIFAR-10 data set, where we note a surprising regularizing property of EASGD even when applied to the non-distributed case. This observation suggests alternative second-order in time algorithms for nondistributed optimization that are competitive with momentum methods.
Stochastic gradient descent (SGD) and its variants have become the de facto algorithms for large-scale machine learning applications such as deep neural networks (Bottou, 2010; Goodfellow, Bengio, & Courville, 2016; LeCun, Bengio, & Hinton, 2015; Mallat, 2016). SGD is used to optimize finite-sum loss functions, where a stochastic approximation to the gradient is computed using only a random selection of the input data points. Well-known results on almost-sure convergence rates to global minimizers for strictly convex functions and to stationary points for non-convex functions exist under sufficient regularity conditions (Bottou, 1998; Robbins & Siegmund, 1971). Classic work on iterate averaging for SGD (Polyak & Juditsky, 1992) and other more recent extensions (Bach & Moulines, 2013; Defazio, Bach, & Lacoste-Julien, 2014; Roux, Schmidt, & Bach, 2012; Schmidt, Le Roux & Bach, 2017) can improve convergence under a set of reasonable assumptions typically satisfied in the machine learning setting. Convergence proofs rely on a suitably chosen decreasing step size; for constant step sizes and strictly convex functions, the parameters ultimately converge to a distribution peaked around the optimum.
For large-scale machine learning applications, parallelization of SGD is a critical problem of significant modern research interest (Chaudhari et al., 2017; Dean et al., 2012; Recht & Ré, 2013; Recht, Re, Wright, & Niu, 2011). Recent work in this direction includes the elastic averaging SGD (EASGD) algorithm, in which distributed agents coupled through a common signal optimize the same loss function. EASGD can be derived from a single SGD step on a global variable consensus objective with a quadratic penalty, and the common signal takes the form of an average over space and time of the parameter vectors of the individual agents (Boyd, Parikh, Chu, Peleato, & Eckstein, 2010; Zhang, Choromanska & LeCun, 2015). At its core, the EASGD algorithm is a system of identical, coupled, discrete-time dynamical systems. And indeed, the EASGD algorithm has exactly the same structure as earlier mathematical models of synchronization (Chung & Slotine, 2009; Russo & Slotine, 2010) inspired by quorum sensing in bacteria (Miller & Bassler, 2001; Waters & Bassler, 2005). In these models, which have typically been analyzed in continuous-time, the dynamics of the common (quorum) signal can be arbitrary (Russo & Slotine, 2010), and in fact they may consist simply of a weighted average of individual signals. Motivated by this immediate analogy, we present here a continuous-time analysis of distributed stochastic gradient algorithms, of which EASGD is a special case. A significant focus of this work is the interaction between the degree of synchronization of the individual agents, characterized rigorously by a bound on the expected distance between all agents and governed by the coupling strength, and the amount of noise induced by their stochastic gradient approximations.
The effect of coupling between identical continuous-time dynamical systems has a rich history. In particular, synchronization phenomena in such coupled systems have been the subject of much mathematical (Wang & Slotine, 2005), biological (Russo & Slotine, 2010), neuroscientific (Tabareau, Slotine & Pham, 2010), and physical interest (Javaloyes, Perrin, & Politi, 2008). In nonlinear dynamical systems, synchronization has been shown to play a crucial role in protection of the individual systems from independent sources of noise (Tabareau et al., 2010). The interaction between synchronization and noise has also been posed as a possible source of regularization in biological learning, where quorum sensing–like mechanisms could be implemented between neurons through local field potentials (Bouvrie & Slotine, 2013). Given the significance of stochastic gradient (Y. Zhang, Saxe, Advani & Lee, 2018) and externally injected (Neelakantan et al., 2015) noise in regularization of large-scale machine learning models such as deep networks (Zhang, Bengio, Hardt, Recht & Vinyals, 2017), it is natural to expect that the interplay between synchronization of the individual agents and the noise from their stochastic gradient approximations is of central importance in distributed SGD algorithms.
Recently, there has been renewed interest in a continuous-time view of optimization algorithms (Betancourt, Jordan, & Wilson, 2018; Wibisono & Wilson, 2015; Wibisono, Wilson & Jordan, 2016; Wilson, Recht & Jordan, 2016). Nesterov's accelerated gradient method (Nesterov, 1983) was fruitfully analyzed in continuous time in Su, Boyd and Candes (2014), and a unifying extension to other algorithms can be found in Wibisono et al. (2016). Continuous-time analysis has also enabled discrete-time algorithm development through classical discretization techniques from numerical analysis (Zhang, Mokhtari, Sra & Jadbabaie, 2018). This article adds to this line of work by deriving new results with the mathematical tools afforded by the continuous-time view, such as stochastic calculus and nonlinear contraction analysis (Lohmiller & Slotine, 1998).
The article is organized as follows. In section 2, we provide some necessary mathematical preliminaries: a review of SGD in continuous time, a continuous-time limit of the EASGD algorithm, a review of stochastic nonlinear contraction theory, and a statement of some needed assumptions. In section 3, we demonstrate that the effect of synchronization of the distributed SGD agents is to reduce the magnitude of the noise felt by each agent and by their spatial mean. We derive this for an algorithm where all-to-all coupling is implemented through communication with the spatial mean of the distributed parameters, and we refer to this algorithm as quorum SGD (QSGD). The appendix presents a similar derivation with arbitrary dynamics for the quorum variable, of which EASGD is a special case. In section 4, we connect this noise reduction property with a recent analysis in Kleinberg, Li, and Yuan (2018), which shows that SGD can be interpreted as performing gradient descent on a smoothed loss in expectation. We use this derivation to garner intuition about the qualitative performance of distributed SGD algorithms as the coupling strength is varied, and we verify this intuition with simulations on model non-convex loss functions in low and high dimensions. In section 5, we provide new convergence results for QSGD, QSGD with momentum, and EASGD for a strongly convex objective. In section 6, we explore the properties of EASGD and QSGD for training deep neural networks and, in particular, test the stability and performance of variants proposed throughout the article. We also propose a new class of second-order in time algorithms motivated by the EASGD algorithm with a single agent, which consists of standard SGD coupled in feedback to the output of a nonlinear filter of the parameters. We provide some concluding remarks in section 7.
2 Mathematical Preliminaries
In this section, we provide a brief review of the necessary mathematical tools employed in this work.
2.1 Convex Optimization
For the convergence proofs in section 5 and for synchronization of momentum methods, we require a few standard definitions from convex optimization.
A function is -strongly convex with if its Hessian is uniformly lower bounded by with respect to the positive semidefinite order, for all .
A function is -smooth with if its Hessian is uniformly upper bounded by with respect to the positive semidefinite order, for all .
2.2 Stochastic Gradient Descent in Discrete-Time
2.3 Stochastic Gradient Descent in Continuous-Time
2.4 EASGD in Continuous-Time
2.5 Background on Nonlinear Contraction Theory
The main mathematical tool used in this work is nonlinear contraction theory, a form of incremental stability for nonlinear systems. In particular, we specialize to the case of time- and state-independent metrics (further details can be found in Lohmiller & Slotine, 1998).
In this work, we interchangeably refer to , the system, and the generalized Jacobian as contracting depending on the context. In particular, for stochastic differential equations, we refer to as contracting if the deterministic system is contracting. Two specific robustness results for contracting systems needed for the derivations in this work are summarized below.
Corollary 6 is obtained by following the proof of theorem 8 in Pham et al. (2009), with the restriction that one system is deterministic. To reduce the appearance of decaying exponential terms, in applications of theorem 5, corollary 6, and other related contraction-based bounds, we will simply state the final constant and the corresponding rate of exponential transients. The conditions of theorem 5 are worthy of their own definition.
We will commonly refer to the auxiliary system in theorem 8 as a virtual system, and is said to be partially contracting. Theorem 8 enables the application of contraction to systems that in themselves are not contracting but can be embedded in a virtual system that is.
We require two main assumptions about the objective function , both of which have been employed in previous work analyzing synchronization and noise in nonlinear systems (Tabareau et al., 2010). The first is an assumption on the nonlinearity of the components of the gradient.
Assume that the Hessian matrix of each component of the negative gradient has bounded maximum eigenvalue, for all .
The second assumption is a condition on the robustness of the distributed gradient flows studied in this work to small, potentially stochastic perturbations.
Continuous dependence of trajectories on the parameters of the dynamics in the sense of assumption 2 can be characterized for deterministic systems through continuity assumptions on the dynamics (see, e.g., section 3.2 in Khalil, 2002). Here we assume a natural stochastic extension. Assumption 2 has been verified for FitzHugh-Nagumo oscillators where is a white noise process (Tuckwell & Rodriguez, 1998) and validated in simulation for more complex nonlinear oscillators (Tabareau et al., 2010). We remark that implies that almost surely, and hence that almost surely.
3 Synchronization and Noise
In this section, we analyze the interaction between synchronization of the distributed QSGD agents and the noise they experience. We begin with a derivation of a quantitative measure of synchronization that applies to a class of distributed SGD algorithms involving coupling to a common external signal with no communication delays. We then present the section's primary contribution, which will serve as a basis for the theory in the remainder of the article, as well as for the intuition for various experiments.
3.1 A Measure of Synchronization
We now present a simple theorem on synchronization in the deterministic setting, which will allow us to prove a bound on synchronization in the stochastic setting using theorem 5.
This theorem motivates a definition.
We will say the agents in a distributed algorithm globally exponentially synchronize if they all converge to one another exponentially regardless of initial conditions.
Theorem 11 gives a simple condition on the coupling gain for synchronization of the individual agents in equation 3.1. Because can represent any input, theorem 11 applies to any dynamics of the quorum variable: with , it applies to the QSGD algorithm, and with , it applies to the EASGD algorithm. Under the assumption of a contracting deterministic system, we can use the stochastic contraction results in theorem 5 to bound the expected distance between individual agents in the stochastic setting.
We will refer to equation 3.4 as a synchronization condition.
3.2 Reduction of Noise Due to Synchronization
We now provide a mathematical characterization of how synchronization reduces the amount of noise felt by the individual QSGD agents. The derivation follows the mathematical procedure first employed in Tabareau et al. (2010) in the study of neural oscillators.
By assumption 2 and theorem 5, as and , the difference between trajectories of equation 3.11 and the unperturbed, noise-free system tends to zero almost surely, as the effects of both the stochastic disturbance and the additive noise term are eliminated in this simultaneous limit.
Theorem 14 demonstrates that for distributed SGD algorithms, roughly speaking, the noise strength is set by the ratio parameter at the expense of a distortion term, which tends to zero with synchronization. Whether this noise reduction is a benefit or a drawback for non-convex optimization depends on the problem at hand.
If the use of a stochastic gradient is purely as an approximation of the true gradient (e.g., due to single-node or single-GPU memory limitations), then synchronization can be seen as improving this approximation and eliminating undesirable noise while simultaneously parallelizing the optimization problem. The analysis in this section then gives rigorous bounds on the magnitude of noise reduction. The term could be measured in practice to understand the empirical size of the distortion, and could be increased until tends approximately to zero and the noise is reduced to a desired level.
Many studies have reported the importance of stochastic gradient noise in deep learning, particularly in the context of generalization performance (Poggio et al., 2017; Zhu, Wu, Yu, Wu & Ma, 2018; Chaudhari & Soatto, 2018; Zhang et al., 2017). Furthermore, large batches are known to cause issues with generalization, and this has been hypothesized to be due to a reduction in the noise magnitude due to a higher in the ratio (Keskar, Mudigere, Nocedal, Smelyanskiy, & Tang, 2016). In this context, reduction of noise may be undesirable, and one may be interested only in the parallelization of the problem. Our analysis then suggests choosing high enough such that the quorum variable represents a meaningful average of the parameters, but low enough that the noise in the SGD iterations is not reduced. Indeed, in section 6, we will find the best generalization performance for low values of that still result in convergence of the quorum variable. For deep networks, the level of synchronization for a given value of will be both architecture and data set dependent.
The condition in theorem 11 is merely a sufficient condition for synchronization, and synchronization may occur for significantly lower values of than predicted by contraction in the Euclidean metric. However, independent of when synchronization exactly occurs, so long as there is a fixed upper bound as in equation 3.4, the results in this section will apply with the corresponding estimate of .
3.4 Extension to Multiple Learning Rates
3.5 Extension to Momentum Methods
Equations 3.18 and 3.19 also admit the as particular solutions, so that the agents globally exponentially synchronize with a rate . The lower bound on can be obtained by application of the result in Slotine (2003, example 3.8).
Hence, a bound similar to equation 3.4 can be derived just as in lemma 13. Because the dynamics are linear and because the dynamics are nonlinear only through the gradient of the loss, assumption 1 does not need to be modified. For , can be set to zero, so that coupling is only through the position variables.
4 An Alternative View of Distributed Stochastic Gradient Descent
In this section, we connect the discussion of synchronization and noise reduction with the analysis in Kleinberg et al. (2018), which interprets SGD as performing gradient descent on a smoothed loss in expectation. Specifically, we show that the reduction of noise due to synchronization can be viewed as a reduction in the smoothing of the loss function. This provides further geometrical intuition for the effect of synchronization on distributed SGD algorithms. It furthermore sheds light as to why one may want to use low values of to prevent noise reduction in learning problems involving generalization, where optimization of the empirical risk rather than the expected risk introduces spurious defects into the loss function that may be removed by sufficient smoothing.
4.1 The Effect of Synchronization on the Convolution Scaling
To better understand the interplay of synchronization and noise in SGD, we can consider several limiting cases. Consider a choice of corresponding to a fairly high noise level, so that the loss function is sufficiently smoothed for the iterates of SGD () to avoid local minima, saddle points, and flat regions, but so that the iterates would not reliably converge to a desirable region of parameter space, such as a deep and robust minimum.
For and sufficiently large, the quorum variable will effectively perform gradient descent on a minimally smoothed loss and will converge to a local minimum of the true loss function close to its initialization. Due to the strong coupling, the agents will likely get pulled into this minimum, leading to convergence as if a single agent had been initialized using deterministic gradient descent at , despite the high value of .
With an intermediate value of so that the agents remain in close proximity to each other, but not so strong that , the variables will be concentrated around the minima of the smoothed loss (the coupling will pull the agents together, but because , the smoothing will not be reduced in the sense of equation 4.2). The stationary distribution of SGD is thought to be biased toward concentration around degenerate minima of high volume (Banburski et al., 2019); the coupling force should thus amplify this effect and lead to an accumulation of agents in wider and deeper minima in which all agents can approximately fit. Eventually, if sufficiently many agents arrive in a single minimum, it will be extremely difficult for any one agent to escape, leading to a consensus solution chosen by the agents even at a high noise level.
4.3 Numerical Simulations in Non-Convex Optimization
In Figure 1a, there is no coupling and the distribution of final iterates for the agents is nearly uniform across the parameter space, with a slightly increased probability of convergence to the two deepest regions. The distribution of the quorum variable is sharply peaked around zero.6 As increases to in Figure 1b, the agents concentrate around the wide basins of the convolved loss function and avoid the sharp local minima of the true loss function. The distribution for the quorum variable is similar, but is too wide to imply reliable convergence to a minimum with loss near the global optimum.
As is increased further to in Figure 1c and in Figure 1d, performance increases significantly. The distribution of the agents is centered around the global optimum of the smoothed loss, and the distribution of the quorum variable is very sharp around the same minimum; this represents the regime in which the agents have chosen a consensus solution. As demonstrated by Figure 1a, this improved convergence is not possible with standard SGD. As is increased again in Figures 1e and 1f, the coupling force becomes too great, and performance decreases. There is no initial exploratory phase to find the deeper regions of the landscape, and convergence is simply near the initialization of .
These simulation results suggest a useful combination of high noise, coupling, and traditional learning rate schedules. High noise levels can lead to rapid exploration and avoidance of problematic regions in parameter space, such as local minima, saddle points, or flat regions, while coupling can stabilize the dynamics toward a distribution around a wide and deep minimum of the convolved loss. The learning rate can then be decreased to improve convergence to minima of the true loss that lie within the spread of the distribution. In the uncoupled case, similar levels of noise would lead to a random walk.
This intuition is supported by the simulation results in Figure 2. The same simulation parameters are used, except the learning rate is now decreased by a factor of two every 4000 iterations until , where it is fixed. In the uncoupled case in Figure 2a, the schedule only slightly improves convergence around minima of the smoothed loss when compared to Figure 1a. Figure 2b again reflects a mild improvement relative to Figure 1b. For the two best values of and in Figures 2c and 2d, convergence of the agents and the quorum variable around the deepest minimum of the true loss that lies within the distribution of the agents in Figures 1c and 1d is excellent. In the very high regime in Figures 2e and 2f, the coupling force is too strong to enable exploration, and convergence is again near the initialization of , but now to the minima of the true loss.
Figure 3a is identical to Figure 1a except for the difference in learning rate: the agents converge uniformly across the parameter space. As is increased to in Figure 3b, the distribution of the agents becomes more localized around the center of parameter space but not around any minima. When is increased to in Figure 3c, in Figure 3d, and in Figure 3e, the distributions of the agents and the quorum variable become localized on the two deepest minima of the convolved loss but are still too wide for reliable convergence. The value in Figure 3f leads to reliable convergence around the deep minimum on the right and would combine well with a learning rate schedule as in Figure 2. Overall, the trend is similar to the case without momentum, though much higher values of are tolerated before degradation in performance. Despite high values rapidly pulling the agent positions close to , significant differences in the velocities of the agents prevent convergence to a local minimum nearby in the high regime.
Equation 4.7 represents a separable sum of double well loss functions with pairwise sinusoidal coupling between all parameters. We include 1000 agents in each of 250 simulations per value with . Each simulation is allowed to run for 10,000 steps with 1000 agents per simulation. The parameters are updated according to the vector forms of equations 4.5 and 4.6 with and . No learning rate schedule is used. The agents are all randomly initialized uniformly in , and each experiences an i.i.d. noise term . is fixed at 50.
For visualization purposes, we plot the contours of a two-dimensional cross section of the loss function by evaluating the last coordinates at the value . This value was chosen to represent the bottom-left cluster apparent in Figures 5 and 6; it also lies close to the global minimum of the uncorrupted loss function . Visualization of high-dimensional loss functions is difficult, and using such a cross section has its drawbacks; in particular, a saddle point may show up as a local minimum, correctly as a saddle point, or as a local maximum depending on the cross section taken. Nevertheless, the employed cross sections enable qualitative visualization of the clustering of the quorum variable and the individual agents and provide assurance that the general phenomena seen in one dimension in Figures 1 to 3 generalize naturally to higher dimensions.
The loss function itself is shown in Figure 4a, and the smoothed loss is shown in Figure 4b, which has significantly reduced complexity. Figure 4c displays the loss value of the quorum variable, averaged over all simulations, as a function of iteration number for a set of possible values. The results are much the same as was described qualitatively in one dimension. Low values of such as and do not successfully minimize the loss function as the agents are too spread out. Despite a significant ability to explore the loss landscape with such small coupling, the agents are not concentrated enough for to represent a meaningful average. As increases, the ability to optimize the loss function at first significantly improves. While better than and , still represents the regime of too little coupling. and obtain much lower loss values than and , with achieving the lowest loss of the displayed values. As is increased further, performance starts to degrade. performs worse than , and obtains similar performance to . Increasing to , , and continues to deteriorate the ability of the algorithm to minimize the loss. The optimum value represents, for the given noise level and loss function, the correct balance of exploration and resistance to noise.
As in the case of any algorithmic hyperparameter, it is natural to expect that there will be an optimum value of . To see that the manifestation of this optimum is precisely a high-dimensional analog of the qualitative behavior observed in the one-dimensional simulations in Figures 1 to 3, we visualize the final points found by the quorum variable and a random selection of 25 agents per simulation in Figures 5 and 6, respectively, for a representative subset of the values seen in Figure 4c.
Figure 5a shows that results in essentially uniform convergence of the agents across the parameter space to local minima and saddle points, and hence the quorum variable simply converges near the origin in Figure 6a. The small amount of coupling in Figure 5b leads to increased, but still insufficient, clustering of the agents. This manifests itself in Figure 6b as a shift of the ball of quorum convergence points toward the bottom left corner. and in Figures 5c and 5d have significantly improved convergence, with strong clustering of the agents in four balls around . These clusters are located near the minima of the uncorrupted loss function, which occur at .
and have similar quorum convergence plots in Figures 6c and 6d, though the value of the loss in Figure 4c is noticeably different at iteration 10,000. The differences in the loss function values for the quorum variables are likely hidden by the low-dimensional visualization method. Figures 5c and 5d show that has more “straggler” agents between the four corner clusters than , which may shift the quorum convergence points uphill. From a qualitative perspective, both are good choices for tracking minima of the uncorrupted or the non-smoothed loss functions and could be combined with a learning rate schedule to improve convergence from the cloud of “starting points” in Figures 5c and 5d.
As is increased further to , the coupling begins to grow too strong. The distinct agent clusters attempt to merge, as seen in Figure 5e. The result of this is seen in Figure 6e, where there are scattered quorum convergence points between the clusters. Finally, for , the coupling is too great, and convergence of both the agents and the quorum variables in Figures 5f and 6f, respectively, are both near the origin.
Taken together, Figures 1 to 6 provide significant qualitative insight into the convergence of distributed SGD algorithms, both with and without momentum. In one-dimensional and high-dimensional simulations, there is an optimum level of coupling that represents an ideal balance between the ability of the agents to explore the loss function and the concentration of the distribution of final iterates. Pushing too high will lead to convergence near the initialization of and ultimately to reduced smoothing of the loss function, while setting too low will lead to poor convergence of the quorum variable due to a lack of clustering of the agents. Intermediate values of lead to concentration of the agents around deep and wide minima of the smoothed loss, which will generally lie close to the minima of the uncorrupted loss; convergence can be improved from here with a learning rate schedule.
The optimum value of is set by the size of the gradients in comparison to the noise level. In the simulation setup used here, this corresponds to a trade-off between the value of , which sets the gradient magnitudes, and the width of the noise distribution. By setting the width of the noise distribution very high, the optimum value can be shifted to a large value, so that numerical stability issues arise before performance begins to degrade. Similarly, with small width and small , the optimum value of can be very small. In section 6, we will see a manifestation of a similar phenomenon in deep networks for the testing loss.
5 Convergence Analysis
We now provide contraction-based convergence proofs for QSGD and EASGD in the strongly convex setting. In the original work on EASGD, rigorous bounds were found for multivariate quadratic objectives in discrete-time, and the analysis for a general strongly convex objective was restricted to an inequality on the iteration for several relevant variances (Zhang et al., 2015). The results in this section thus extend previously available convergence results for EASGD and contain new results for QSGD. We furthermore present convergence results for QSGD with momentum.
A significant theme of this section is that the general methodology of theorem 14 can be applied to produce bounds on the expected distance of the quorum variable from the global minimizer of a strongly convex function, again split into a sum of two terms—one based on the averaged noise and one based on bounding the distortion vector . We also demonstrate in this section that an optimality result obtained for EASGD in discrete-time in Zhang et al. (2015) can be obtained through a straightforward application of stochastic calculus in continuous-time, and that the same result applies for QSGD.
5.1 QSGD Convergence Analysis
We first present a simple lemma describing the convergence of deterministic distributed gradient descent with arbitrary coupling.
If is -strongly convex, will be contracting in the identity metric with rate .
If is locally -strongly convex, will be locally contracting in the identity metric with rate . For example, for a nonconvex objective with initializations in a strongly convex region of parameter space, we can conclude exponential convergence to a local minimizer for each agent.
If is strongly convex, the coupling between agents provides no advantage in the deterministic setting, as they would individually contract toward the minimum regardless. For stochastic dynamics, however, coupling can improve convergence. We now demonstrate the ramifications of the results in section 3 in the context of QSGD with the following theorem.
As in section 3, the bound 5.3 consists of two terms. The first term originates from a lack of complete synchronization and can be decreased by increasing . The second term comes from the additive noise and can be decreased by increasing the number of agents. Both terms can be decreased by decreasing , as this ratio sets the magnitude of the noise, and hence the size of both the disturbance and the noise term.
State- and time-dependent couplings of the form are also immediately applicable with the proof methodology above. For example, increasing over time can significantly decrease the influence of the first term in equation 5.3, leaving only a bound essentially equivalent to linear noise averaging. For non-convex objectives, this suggests choosing low values of in the early stages of training for exploration and larger values near the end of training to reduce the variance of around a minimum. By the synchronization and noise argument in section 3 and the considerations in section 4, this will also have the effect of improving convergence to a minimum of the true loss function rather than the smoothed loss. If accessible, local curvature information could be used to determine when is near a local minimum and therefore when to increase . Using state- and time-dependent couplings would change the duration of exponential transients, but the result in theorem 20 would still hold.
5.2 EASGD Convergence Analysis
We now incorporate the additional dynamics present in the EASGD algorithm. First, we prove a lemma demonstrating convergence to the global minimum of a strongly convex function in the deterministic setting.
Theorem 22 demonstrates an explicit bound on the expected deviation of both the center of mass variable and the quorum variable from the global minimizer of a strongly convex function. As in the discussion after theorem 20, the results will still hold with state- and time-dependent couplings of the form , and the same ideas suggested for QSGD based on increasing over time can be used to eliminate the effect of the first term in the bound.
Theorem 22 is strictly weaker than theorem 20. The metric transformation used adds a factor of to the first quantity in the bound, and the assumption now depends on through the factor of in the top-left block of . Indeed, writing the matrix in block form, where as in theorem 20. Thus, the dependence of on is in general linear.
Because of this linear dependence on , the first term in the bound scales like , while the second is asymptotically independent of . This is not the case in theorem 20, where the first term is asymptotically independent of and the second term scales like . The unfavorable scaling of the bound in theorem 22 with implies that higher values of do not improve convergence for EASGD as they do for QSGD. These issues can be avoided by reformulating lemma 21 in the Euclidean metric, but this leads to the fairly strong restriction .
These observations highlight potential convergence issues for EASGD with large which are not present with QSGD. In line with these theoretical conclusions, we will empirically find stricter stability conditions on for EASGD when compared to QSGD for training deep networks in section 6. Nevertheless, in the context of non-convex optimization, higher values of can still lead to improved performance by affording increased parallelization of the problem and exploration of the landscape.
Less significantly, unlike in theorem 20, the bound in theorem 22 is applied to the combined vector rather than the quorum variable itself, and the contraction rate is used rather than in the virtual system bounds.8 Both of these facts weaken the result when compared to theorem 20. will in general be less than , as exemplified by the lower bound, equation 5.6.
5.3 QSGD with Momentum Convergence Analysis
We now present a proof of convergence for the QSGD algorithm with momentum. We first prove a lemma demonstrating convergence to the global minimum of a strongly convex, -smooth function. We consider the case of coupling only in the position variables; coupling additionally through the momentum variables is similar. We also restrict to the case of constant momentum coefficient for simplicity.
Note that in general, so long as is chosen to satisfy the lower bound of the preceding lemma, the QSGD with momentum system will be contracting in some metric. The given metric will depend on the value of —for example, chosen as suggested in the proof.
With Lemma 23 in hand, we can now state a convergence result for QSGD with momentum.
Equation 5.10 is similar to the results for EASGD and QSGD. The bound is closer in spirit to the bound for QSGD without momentum, in that the two terms do not have poor dependencies on as they do for EASGD. However, the statement of the theorem is complicated by the expressions for the contraction rates and , the expressions for the minimum and maximum eigenvalues of the metric and , and the expression for in the metric transformation. Together, these four quantities create a more complex dependence of the bound on hyperparameters such as and . Nevertheless, the spirit is still the same as theorem 20, in that the first term originates from the disturbance and can be eliminated with synchronization, while the second term originates from the additive noise and can be eliminated by including additional agents.
5.4 Extensions to Other Distributed Structures
Similar results can be derived for many other possible distributed structures in an identical manner. We present one general formalism here, involving local state- and time-dependent couplings.
Individually state-dependent couplings of the form 5.11 or its quorum-mediated equivalent, equation 5.13, allow for individual gain schedules that depend on local cost values or other local performance measures. This can allow each agent to broadcast its current measure of success and shape the quorum variable accordingly. For example, the classification accuracy on a validation set for each could be use to select the current best parameter vectors and increase the corresponding values to pull other agents toward them.
5.5 Specialization to a Multivariate Quadratic Objective
In the original discrete-time analysis of EASGD in Zhang et al. (2015), it was proven that iterate averaging (Polyak & Juditsky, 1992) of leads to an optimal variance around the minimum of a quadratic objective. We now derive an identical result in continuous-time for the QSGD algorithm, demonstrating that this optimality is independent of the additional dynamics in the EASGD algorithm.
The assumption of state-independence can be justified in several ways. Theoretical analyses have demonstrated that the specific form of positive semidefinite does not affect the weak accuracy of the approximating stochastic differential equation 2.2 for SGD (Feng et al., 2018; Hu et al., 2017; Li et al., 2018), though it does affect the constant.9 For relevance to general non-convex optimization, we can assume that all agents have arrived sufficiently close to a minimum of the loss function that it can be approximately represented as a quadratic and that the noise covariance is approximately constant (Mandt et al., 2016, 2017). For deep networks, the noise covariance has been empirically shown to align with the Hessian of the loss (Sagun, Evci, Guney, Dauphin, & Bottou, 2017; Zhu et al., 2018), with theoretical justification for when this is valid provided in appendix A of Jastrzȩbski et al. (2017). For all agents in an approximately quadratic basin of a local minimum of a deep network, can then be taken to be constant such that , where is the approximately state-independent Hessian.
As in the discrete-time EASGD analysis, equation 5.15 is optimal in the sense of achieving the Fisher information lower bound and is independent of the coupling strength (Polyak & Juditsky, 1992; Zhang et al., 2015). The lack of dependence on the coupling is less surprising in this case, as it is not present in the dynamics. The optimality of this result, together with the comparison of theorems 20 and 22, suggests that the extra dynamics may not provide any benefit over coupling simply through the spatial average variable from the perspective of convex optimization. However, in section 6, we will show through numerical experiments on deep networks that EASGD tends to find networks that generalize better than QSGD. The benefits of EASGD must then go beyond basic optimization, and the extra dynamics may have a regularizing effect.
6 Deep Network Simulations
We now turn to evaluate EASGD, QSGD, and one possible state-dependent variant of QSGD, equation 5.13, as learning algorithms for training deep neural networks on the CIFAR-10 data set. A significant goal of the section is to understand the role of synchronization and noise in training deep neural networks. We also seek to test the extensions proposed throughout this article, such as multiple learning rates, synchronization bounds allowing for independent initial conditions of the agents, and state-dependent coupling.
We obtain two primary results. The first is that less synchronization, when it still leads to reliable convergence of the quorum variable, results in the best generalization capabilities of the learned network. This is similar to the results of the model experiments performed in section 4.3, though those experiments revealed this to be true for general optimization rather than generalization. The observation of better generalization with reduced synchronization is in line with the comments of section 3.3 regarding noise and generalization in deep networks.
Our second primary result is the observation of an interesting regularizing property of EASGD, even in the single-agent case. Unlike QSGD with a single agent, EASGD does not reduce to standard SGD. We find that EASGD without momentum outperforms SGD with momentum and EASGD with momentum in the nondistributed setting.
6.1 Experimental Setup
We use a three-layer convolutional neural network based on the experiments in Zhang et al. (2015); each layer consists of a two-dimensional convolution, an ReLU nonlinearity, max-pooling with a stride of two, and BatchNorm (Ioffe & Szegedy, 2015) with batch statistics in both training and evaluation. The first convolutional layer has kernel size nine, the second has kernel size five, and the third has kernel size three. All convolutions use a stride of one and zero padding. Following the three convolutional layers, there is a single fully connected layer to which we apply dropout with a probability of 0.5. The input data are normalized to have mean zero and standard deviation one in each channel in both the training and test sets. Because we are interested in qualitative trends rather than state-of-the-art performance, we do not employ any data augmentation strategies. We use an 80/20 training/validation set split, and we use the cross-entropy loss. The stochastic gradient is computed using minibatches of size 128. The learning rate is set to initially unless otherwise specified. This value was chosen as the highest initial value of that remained stable throughout training for most values of , and the qualitative trends presented here were robust to the choice of learning rate (further simulations demonstrating this robustness are available in the supplemental information). We decrease the learning rate three times when the validation loss stalls:11 first by a factor of five, then a factor of two the second and third times. This is done on an agent basis: the agents are allowed to maintain different learning rates. Because we are focused on the behavior of the algorithms rather than efficiency from the standpoint of a parallel implementation, the agents communicate with the quorum variable after each update.
In each of the following simulations, the fully connected weights and biases are initialized randomly and uniformly where is the number of inputs. The convolutional weights use Kaiming initialization (He, Zhang, Ren, & Sun, 2015). In each comparison, the methods are initialized from the same points in parameter space, but the agents are not required to be initialized at the same location. In QSGD and SD-QSGD, the quorum variable is exponentially weighted with , and we test the convergence of . Note that because this variable is not coupled to the dynamics of the individual agents, this is still distinct from EASGD. Because we use momentum in nearly all experiments, we will refer simply to QSGD and EASGD. The non-momentum variant of EASGD, when used, will be referred to as EASGD-WM (EASGD without momentum).
6.2 Experimental Results
We first analyze the effect of on classification performance. We find that the best performance is obtained for the lowest possible fixed values of that still lead to convergence of the quorum variable. This is demonstrated in Figure 7 for the EASGD algorithm with initially and , where we observe the general trend that test accuracy improves as the coupling gain is decreased. Note that and , as well as (not shown), have too little synchronization for the quorum variable to reflect a meaningful average, and hence do not lead to good performance. Similar results hold for QSGD (not shown). We found not only the best performance for low, fixed but also the best scaling with the number of agents.13
There are several plausible explanations for the observation of improved generalization with reduced coupling. Lower values of allow for greater exploration of the optimization landscape, which intuitively should lead to better performance. As the measure of synchronization in Figure 7d tends to zero, the term in the dynamics will also tend to zero, and synchronization will begin to reduce the amount of noise felt by the individual agents. In neural networks, it is expected that this noise reduction will favor convergence to minima that do not generalize as well as those obtained with higher amounts of noise, as seen in Figure 7c.
Results for a comparison of QSGD and SD-QSGD are shown in Figure 8 for , and 64 with . QSGD is shown in solid lines, while SD-QSGD is shown in dashed; color indicates the number of agents (see the key in Figure 8a). Note that simply corresponds to SGD for both SD-QSGD and QSGD, as the coupling term vanishes for a single agent. In both cases, we see significant improvement in accuracy as the number of agents increases, most likely due to an improved ability of the agents to explore the landscape, along with a decrease in synchronization. The test loss and test error curves display interesting differences between the two algorithms; for and , the state-dependent formalism obtains mildly improved generalization relative to QSGD, as expected by the bias toward minima with lower validation loss. QSGD performs better for and ; SD-QSGD does not converge for .
We display a comparison of QSGD and EASGD in Figure 9, again for . QSGD tends to decrease the training loss further and more rapidly than EASGD; this is in line with earlier comments that, from an optimization perspective, the extra dynamics of the quorum variable offer no clear theoretical benefit. However, consistently across all experiments except for where it does not converge, EASGD generalizes better: the test loss is driven lower, and the test accuracy is higher than QSGD. A particularly interesting result is the single-agent case, where EASGD actually performs better than SGD with momentum.14 These observations suggest that the extra dynamics of the quorum variable may impose a form of implicit regularization that, to our knowledge, has not been observed before.
Motivated by this observation, we now compare the EASGD algorithm with momentum, without momentum, and basic SGD with momentum in Figure 10 across a range of initial learning rates. Each algorithm is initialized from the same location, and each curve represents an average over three runs to eliminate stochastic variability. The momentum algorithms use , and the two EASGD variants use . In general, EASGD with and without momentum (dashed and solid lines, respectively) both achieve higher test accuracy than SGD with momentum (dotted lines). Surprisingly, EASGD without momentum often performs better than EASGD with momentum.
To show that this trend is not an artifact of incorrectly choosing the momentum parameter, we have compiled additional data in Table 1 over a range of momentum parameters and learning rates. Each data point reported is again the result of an average over three independent runs, and each algorithm is initialized from the same location in each run. For simplicity, we simply report the testing loss and testing error rather than the results on the training data. For all but one choice of and , EASGD-WM outperforms both EASGD and MSGD in classification accuracy, demonstrating that the trend is robust to choice of learning rate and momentum value.
|.||.||Minimum Test Loss .||Minimum Error .|
|.||.||Minimum Test Loss .||Minimum Error .|
Notes: Each experiment was run three times, and the minimum was taken over the average trajectory. In each run, the algorithms were initialized from the same starting location. Surprisingly, EASGD-WM consistently achieves the lowest test error (all but one setting) and the lowest test loss (all but four settings) in comparison to EASGD and MSGD. For high learning rate and high , MSGD and EASGD eventually run into convergence issues, while EASGD-WM does not (error of .9 and test loss of 6.91 indicate convergence issues). Bold indicates the top performance of the three algorithms for choice of and .
Much like SGD with momentum, single-agent EASGD-WM is a second-order system in time. It also maintains a similar computational complexity and requires storing only one extra set of parameters for the quorum variable.
Returning to the distributed case, Figure 9d shows that EASGD and QSGD respond differently to the choice of .15 EASGD is less synchronized than QSGD in all cases. Hence, in the context of Figure 7, a possible explanation for the improved performance of EASGD when compared to QSGD is simply the observation that it tends to remain less synchronized.
To answer this question, we use a scaling factor to roughly match the levels of synchronization between EASGD and QSGD. Results for are shown in Figure 11, and the synchronization curves are either approximately equal or EASGD remains more synchronized across all values of . Additional values of and are shown, and EASGD now converges for all attempted values of . QSGD continues to perform worse than EASGD on the test data due to an increased tendency to overfit. As the number of agents is increased, QSGD improves up to ; obtains roughly the same test performance. EASGD improves up to around and does not converge for (see Figure 11a; the curves in Figures 11b and 11d are covered by the insets, but EASGD obtains roughly 55% testing accuracy). In general, EASGD with agents obtains roughly the same performance as QSGD with agents. Interestingly, Figure 11d shows that the high stability issues for EASGD are not simply due to a lack of synchronization, as EASGD actually remains more synchronized than QSGD for for much of the training time. We offer a simple possible explanation for these stability issues in the supplemental information by analyzing discrete-time optimization of a one-dimensional quadratic objective. Another explanation is afforded by theorems 20 and 22, which reveal poor scaling with of both terms in the bound for EASGD when compared to QSGD. Together, these observations highlight stability issues in both continuous and discrete-time.
As discussed in the text and the description of the experimental setup, our theory allows the agents to be initialized in different locations and to use distinct learning rates through individual learning rate schedules. In the original work on EASGD, it was postulated that starting the agents at different locations would break symmetry and lead to instability (Zhang et al., 2015). Similarly, a single learning rate was used for all agents. The above simulations demonstrate that starting from distinct locations and decreasing the learning rate on an individual basis is nonproblematic. We show in Figure 12 that starting from a single location leads to decreased performance. Surprisingly, Figure 12 also highlights that initializing the agents from multiple locations is critical for optimal improvement as the number of agents is increased.
In this article, we presented a continuous-time analysis of distributed stochastic gradient algorithms within the framework of stochastic nonlinear contraction theory. Through analogy with quorum-sensing mechanisms, we analyzed the effect of synchronization of the individual SGD agents on the noise generated by their stochastic gradient approximations. We demonstrated that synchronization can effectively reduce the noise felt by each of the individual agents and by their spatial mean. We further demonstrated that synchronization can be seen to reduce the amount of smoothing imposed by SGD on the loss function. Through simulations on model non-convex optimization problems, we provided insight into how the distributed and coupled setting affects convergence to minima of the smoothed loss and the true loss. We introduced a new distributed algorithm, QSGD, and proved convergence results for a strongly convex objective for QSGD, QSGD with momentum, and EASGD. We further introduced a state-dependent variant of QSGD and constructed one specific example of the algorithm to show how the formalism can be used to bias exploration. We presented experiments on deep neural networks and compared the properties of QSGD, SD-QSGD, and EASGD for generalization performance. We noted an interesting regularizing property of EASGD even in the single-agent case and compared it to basic SGD with momentum, showing that it can lead to improved generalization. Research into similar higher-order in time optimization algorithms formed as coupled dynamical systems is an interesting direction of future work.
Appendix: Interaction between Synchronization and Noise: Extra Quorum Dynamics
For the remainder of this article, unless otherwise specified, we will use to denote the 2-norm.
For example, say with a symmetric and uniformly positive-definite matrix. Then satisfies this restriction requirement. The system is also contracting in , as the symmetric part of the Jacobian uniformly. The system has Jacobian , which has a symmetric part with unknown definiteness without further assumptions on .
Indeed, the covariance , where and the and are with respect to the positive semidefinite order. The covariance tends to zero as , so that gaussian random variables drawn from a distribution with this covariance will become increasingly concentrated around zero with increasing . Because the true covariance is less positive semidefinite, random variables drawn from the true distribution will also become concentrated around zero as .
Kleinberg et al. (2018) group the factor of with the covariance of the noise.
We choose a relatively high value of so that the convolved loss will be qualitatively different from the true loss to a degree that is visible by eye. This enables us to distinguish convergence to true minima from convergence to minima of the convolved loss. An alternative and equivalent choice would be to choose smaller, with a correspondingly wider distribution of the noise.
Note that without coupling, each agent performs basic SGD. Hence, the results in Figure 1a are equivalent to single-agent SGD simulations, where is the total number of simulations and is the number of agents per simulation.
In section 3, the denominator contained the factor rather than . Strong convexity of was not assumed, so that the contraction rate of the coupled system was . In this proof, strong convexity of implies that the contraction rate of the coupled system is .
The factor of in the first term remains, as this factor originates in the derivation of the bound on , where the synchronization rate is .
The state-dependent version used earlier in this work has been empirically shown to have a lower constant (Li et al., 2018), and is closer to the approximating SDE, which is why it has been used up to this point.
A similar continuous-time analysis for the averaging scheme considered here was performed in Mandt et al. (2017) for the non-distributed case; the derivation here is simpler and provides asymptotic results.
More precisely, we keep track of the validation loss for each agent at a reference point, beginning with the validation loss at the first epoch. If the validation loss at the next epoch changes by greater than of the reference point, the reference loss is set to the newly computed validation loss. If the validation loss changes by less than , the reference point is unchanged. When the reference point has been unchanged for five epochs, we decrease the learning rate.
Another option would be to set when this is positive and zero otherwise. This ensures, outside of the initial spiking period, that the total sum of the is constant. We found similar empirical results with both choices.
The improvement in test accuracy and in the minimization of the test loss with increasing number of agents is demonstrated in later plots. We found that this trend was maximized with lower values of .
Note that unlike QSGD with a single agent, EASGD with a single agent is a different algorithm from basic SGD. It can be seen as SGD coupled in feedback to a low-pass filter of its output.
Figure 9d shows the distance from for EASGD. The distance from for EASGD is nearly identical.
N.B. was supported by a Department of Energy Computational Science Graduate Fellowship. We graciously thank the reviewers for helpful feedback and for suggestions to improve the work.