Although the number of artificial neural network and machine learning architectures is growing at an exponential pace, more attention needs to be paid to theoretical guarantees of asymptotic convergence for novel, nonlinear, high-dimensional adaptive learning algorithms. When properly understood, such guarantees can guide the algorithm development and evaluation process and provide theoretical validation for a particular algorithm design. For many decades, the machine learning community has widely recognized the importance of stochastic approximation theory as a powerful tool for identifying explicit convergence conditions for adaptive learning machines. However, the verification of such conditions is challenging for multidisciplinary researchers not working in the area of stochastic approximation theory. For this reason, this letter presents a new stochastic approximation theorem for both passive and reactive learning environments with assumptions that are easily verifiable. The theorem is widely applicable to the analysis and design of important machine learning algorithms including deep learning algorithms with multiple strict local minimizers, Monte Carlo expectation-maximization algorithms, contrastive divergence learning in Markov fields, and policy gradient reinforcement learning.
Although the number of artificial neural network and machine learning architectures is growing at an exponential pace, more attention needs to be paid to theoretical guarantees of asymptotic convergence for novel, nonlinear, high-dimensional adaptive learning algorithms. When properly understood, such guarantees can guide the algorithm development and evaluation process and, in addition, provide theoretical validation for a particular algorithm design. For many decades, the machine learning community has widely recognized the importance of stochastic approximation theory as a powerful tool for identifying explicit convergence conditions for adaptive learning machines. However, the verification of such conditions is challenging for multidisciplinary researchers not working in the area of stochastic approximation theory. For this reason, the goal of this letter is to present a new stochastic approximation theorem with easily verifiable assumptions for characterizing the asymptotic behavior of a wide range of important machine learning algorithms.
The new stochastic approximation theorem presented here is applicable to the analysis of the asymptotic behavior of a wide range of learning algorithms including (1) deep learning algorithm (Bottou, 1991, 1998, 2004; Bengio, Courville, & Vincent, 2013; Sutskever, Marten, Dahl, & Hinton, 2013; Zhang, Choromanska, & LeCun, 2015), (2) variable metric (Jani, Dowling, Golden, & Wang, 2000; Paik, Golden, Torlak, & Dowling, 2006; Roux, Manzagol, & Bengio, 2008; Schraudolph, Yu, & Günter, 2007; Sunehag, Trumpf, Vishwanathan, & Schraudolph, 2009) and momentum-type stochastic approximation schemes (Pearlmutter, 1992; Roux, Schmidt, & Bach, 2012; Sutskever et al., 2013; Zhang et al., 2015), (3) reinforcement learning and adaptive control (Jaakkola, Jordan, & Singh, 1994; Baird & Moore, 1999; Williams, 1992; Sugiyama, 2015; Sutton & Barto, 1998; Balcan & Feldman, 2013; Mohri, Rostamizadeh, & Talwalkar, 2012), (4) expectation-maximization problems for latent variable and missing data problems (Carbonetto, King, & Hamze, 2009; Gu & Kong, 1998), and (5) contrastive divergence learning in Markov random fields (Yuille, 2005; Hinton, Osindero, & Teh, 2006; Tieleman, 2008; Swersky, Chen, Marlin, & de Freitas, 2010; Salakhutdinov & Hinton, 2012). A critical feature of the theorem is that its statement and proof are specifically designed to provide relatively easily verifiable assumptions and interpretable conclusions that can be understood and applied by researchers outside the field of stochastic approximation theory.
Stochastic approximation theorems have played a vital role in characterizing our understanding of adaptive learning algorithms from the very beginning of work in machine learning (e.g., Amari, 1967; Duda & Hart, 1973). White (1989a, 1989b), Benveniste, Metivier, and Priouret (1990), Bottou (1991), Bertsekas and Tsitsiklis (1996), Golden (1996), Borkar (2008), Swersky et al. (2010), and Mohri et al. (2012) provide useful discussions of the application of stochastic approximation methods to machine learning problems. Kushner (2010), a seminal contributor to the development of stochastic approximation theory, provides an excellent review of the theoretical stochastic approximation literature from its origins in the 1950s.
The generic form of a stochastic approximation algorithm is defined as follows. Consider a learning machine whose parameter values at iteration of the learning algorithm are interpretable as the realization of a -dimensional random vector . The learning machine is provided an initial guess for the parameter estimates at iteration , which is denoted as . Then the learning machine observes a realization of a random vector called the training stimulus which is then used to update the parameters of the learning machine.
In the initial stages of learning, the search time period, the step-size is typically chosen to be either constant or to increase in value. During this phase of the learning process, the adaptive learning machine's dynamics in equation 1.1 have the opportunity to sample the statistical environment. Ideally, this time period should be sufficiently long so that there is an opportunity for the learning machine to observe the different types of training stimuli in its environment for the purpose of extracting critical statistical regularities. For example, if there are distinct training stimuli that occur with approximately equal probability in the environment, then choosing the time period for learning to be would ensure that each training stimulus will be approximately observed by the learning machine about 10 times during the initial search phase. After the initial search phase, the step-size is decreased at an appropriate rate to ensure convergence. This latter phase is called the converge time period.
Different choices of the search direction vector in equation 1.1 realize different popular stochastic descent algorithms such as stochastic gradient descent (Bottou, 1991, 1998), normalized stochastic gradient descent (Hazan, Levy, & Shalev-Shwartz, 2015), modified Newton (Jani et al., 2000; Paik et al., 2006; Roux et al., 2008; Schraudolph et al., 2007; Sunehag et al., 2009), and momentum-type stochastic gradient descent methods (Pearlmutter, 1992; Roux et al., 2012; Sutskever et al., 2013; Zhang et al., 2015). A standard assumption is that the dot product of the expected value of the search direction with the gradient of the objective function is less than or equal to zero.
Assume the stochastic sequence of -dimensional random vectors modeling the training stimuli are independent and identically distributed with common data generating process (DGP) probability density . In other words, each time the learning machine updates its parameters, the likelihood of observing a particular training stimulus at iteration is given by . The goal of an adaptive learning machine is to estimate (learn) the global minimizer, , of a smooth risk function , which specifies the learning machine's optimal behavior. In addition, let a smooth function be defined such that is the penalty, or “loss,” incurred by the learning machine for choosing parameter value for training stimulus where .
Several prior publications in the machine learning literature (White, 1989a, 1989b; Bottou, 1991, 1998; Golden, 1996; Mohri et al., 2012; Toulis, Rennie, & Airoldi, 2014) have provided explicit convergence theorems by considering parameter update equations of the form of equation 1.1 and assuming that the risk function has the form of equation 1.4. That is, at each parameter update, the training stimulus is sampled from the statistical environment using the probability density . This assumption, unfortunately, is not directly relevant to many important problems in the areas of (1) contrastive divergence learning (Yuille, 2005; Younes, 1999; Hinton et al., 2006; Tieleman, 2008; Swersky et al., 2010; Salakhutdinov & Hinton, 2012); (2) learning in the presence of missing data or latent variables (Gu & Kong, 1998; Carbonetto et al., 2009; Vlassis & Toussaint, 2009); and (3) active learning and adaptive control (Jaakkola et al., 1994; Baird & Moore, 1999; Williams, 1992; Sugiyama, 2015; Sutton & Barto, 1998; Balcan & Feldman, 2013; Vlassis & Toussaint, 2009). Such problems typically require that the training stimulus is sampled from a statistical environment specified by the current parameter estimates so that rather than sampling from the density , one samples from the density , where is the current knowledge state of the learning machine. These latter problems can be viewed as learning within a reactive learning environment.
In the machine learning literature, most of the focus has been on investigating the rate of convergence of stochastic approximation algorithms (Roux et al., 2012; Mohri et al., 2012). Analyses in the machine learning literature (Yuille, 2005; Sunehag et al., 2009; Mohri et al., 2012) include theorems for handling reactive learning environments but do not explain in detail how such theorems handle the case where the data generating process density is functionally dependent on and do not explicitly characterize the asymptotic behavior of the state sequence . In addition, such analyses often lack a discussion regarding how a stochastic approximation convergence theorem can be applied to situations where the objective function has multiple minimizers, maximizers, and saddle points. However, Blum (1954), Beneviste et al. (1990), Gu and Kong (1998), Kushner (1981), Younes (1999), and Delyon, Lavielle, & Moulines (1999) have provided explicit assumptions and proofs of convergence theorems for stochastic reactive learning environments, but the theorems and their assumptions may be difficult to apply in practice for readers without a background in stochastic approximation theory.
Clarity of understanding is important to ensure that such theorems can be properly and confidently applied in practice since the algorithms they describe are widely used in the field of machine learning. An important contribution of this letter is providing a relatively simple set of assumptions and a straightforward detailed discussion intended to support the mathematical analysis of a wide range of adaptive learning algorithms. Furthermore, it is hoped that as a result of the analyses presented here, the importance of prior contributions to the stochastic approximation theorem literature will be better appreciated and this analysis will serve as a stepping-stone to advanced study in this important area.
2 Overview of the New Convergence Theorem
The new stochastic approximation theorem that minimizes the reactive environment learning risk function in equation 1.5, as well as the passive learning risk function in equation 1.4, is similar to analyses by Andrieu, Moulines, and Priouret (2005), Blum (1954), Kushner (1981, theorem 1), White (1989a, 1989b), Benveniste et al. (1990; appendix to part II), Bertsekas and Tsitsiklis (1996, proposition 4.1, p. 141), Gu and Kong (1998), and Delyon et al. (1999, theorem 1). With respect to the machine learning literature, the theorem and its proof are most closely related to the analysis of Sunehag et al. (2009). However, the assumptions, conclusions, and proof of this theorem are specifically designed to be easily understood by machine learning researchers working outside the field of stochastic approximation theory. The accessibility of these theoretical results is fundamentally important for the development of the field of machine learning to ensure that such results are correctly applied in specific applications. In addition to having conditions that are easily verifiable, the stochastic approximation theorem introduced here is applicable to a wide range of situations commonly encountered in practical machine learning problems.
If the objective function is positive definite everywhere on the parameter space, the theorem provides conditions ensuring convergence to the unique strict global minimum of the objective function. However, if the objective function has multiple minima, maxima, and saddle points, then the new stochastic approximation theorem is still applicable. In this latter nonconvex optimization case, the theorem provides the weaker conclusion that the sequence of algorithm-generated parameter estimates will converge to the set of critical points with probability one or the algorithm will generate a sequence of parameter estimates that are not bounded with probability one.
Note the terminology that an event occurs “with probability one” means there is a zero probability that the event will not occur. For example, if the stochastic sequence converges to some set with probability one, this means that the probability of observing any realization that deterministically converges to is exactly equal to one and the probability of observing any realization that does not converge to is exactly equal to zero.
3 A Practical Convergence Analysis Recipe
In this section, a procedure for applying the new stochastic approximation theorem is provided. Section 5 provides a formal statement and proof of the theorem.
The assumption that a stochastic sequence is bounded means that there exists some finite number such that with probability one. Here, the random vector corresponds to an experiment that generates a training stimulus vector . If the random vector is a discrete random vector restricted to take on a finite number of values (e.g., a -dimensional binary random vector ), then this is a sufficient condition for the stochastic sequence to be bounded.
A sufficient condition for to be called a twice continuously differentiable random function is if is a continuous function of and the second derivative of , , is a continuous function on the -dimensional parameter space .
The conclusion of the convergence theorem states that the stochastic sequence of parameter estimates either (1) is not confined to a closed, bounded, and convex region, , of the parameter space with probability one, or (2) converges to the set of critical points in with probability one. For example, if the stochastic sequence of parameter estimates converges to a set of two critical points of such that it oscillates between these two points forever with probability one, then the stochastic sequence of parameter estimates is said to converge to this set of two critical points with probability one:
Step 1: Identify the statistical environment. A reactive statistical environment is modeled as a sequence of bounded, independent, and identically distributed -dimensional random vectors with common density where . The density is not functionally dependent on for passive statistical environments.
- •Step 2: Check is twice continuously differentiable with a lower bound. Since is assumed bounded and it will be assumed that is a bounded stochastic sequence, this assumption is satisfied provided that and are twice continuously differentiable random functions and defined such that for all :That is, where the expectation is taken with respect to . It is also assumed that has a lower bound on .
Step 3: Define the region of convergence. Let be a closed, bounded, and convex subset of .
- •Step 4: Check the annealing schedule. Define a sequence of step sizes that satisfies equations 5.1 and 5.2. In the context of adaptive learning, corresponds to the adaptive learning algorithm's “learning rate.” For example, the step-size schedulewhere and positive generates a sequence that satisfies special constraints on the step-size sequence specified by equations 5.1 and 5.2. This particular step-size schedule initially increases the step size and then eventually decreases it. The constant should be chosen to be large enough that the learning algorithm observes a sufficiently rich sample of its statistical environment to support learning. The constant should be the same order of magnitude as . So, for example, if the learning machine observes distinct training stimuli with approximately equal probability and only one training stimulus is observed per iteration, then might be chosen to be so that each training stimulus is observed approximately 10 times during both the search and the converge phases of the learning process.
- •Step 5: Identify the search direction function. Let be a piecewise continuous function on for each . Rewrite the learning rule for updating parameter estimates using the formulawhere the search direction random vector , and is a bounded stochastic sequence. A sufficient condition for to be a bounded stochastic sequence is that there exists a piecewise continuous function on a finite partition of such that for all since and are bounded stochastic sequences by assumption.
- •Step 6: Show the average search direction is downward. Assume there exists a series of functions such thatShow that there exists a positive number such thatFor example, choosing(3.1)yields the standard stochastic gradient descent directionso that .
Step 7: Investigate asymptotic behavior. Let be the set of critical points in . Conclude that with probability one either (1) the stochastic sequence does not remain in Θ for all for some positive integer , or (2) as .
Consider the important special case where the Hessian of is positive definite on even though is multimodal. The region can contain no critical points, exactly one critical point, or multiple critical points. If contains exactly one critical point in its interior, then that critical point is the unique global minimizer of on the interior of . The region may also contain one or more critical points of on its boundary corresponding to saddle points or local maximizers of on . For example, suppose that a smooth objective function has a strict local minimum at the point , a saddle point at , and a strict local maximum at the point . The function is positive definite on the set but no critical points exist in . The function is positive definite on the set and has a unique strict local minimizer at . The function is positive definite on the set and has two critical points located at (strict local minimizer) and (critical point on boundary of ).
4 Adaptive Learning Algorithm Applications
In this section, we discuss several examples of adaptive learning algorithms that can be analyzed using the stochastic approximation theorem for reactive environments presented in section 5.
4.1 Adaptive Learning in Passive Statistical Environments
In this section, some adaptive learning strategies for passive statistical environments are discussed. In such environments, the objective function is defined as in equation 1.4. It should be noted, however, that these adaptive learning strategies are applicable for reactive learning statistical environments as well where the objective function is defined as in equation 1.5.
Assume the observations are independent and identically distributed with common density .
The above methodology can also be used to implement different stochastic approximation variants of momentum, conjugate gradient, limited memory Broyden-Fletcher-Goldfarb-Shanno descent algorithms (Shraudolph et al., 2007; Jani et al., 2000; Paik et al., 2006), natural gradient descent methods (Schraudolph et al., 2007), and normalized gradient methods (Hazan et al., 2015).
In practice, one would set , yielding a gradient descent step in situations where the magnitude of is less than some positive number .
A random block coordinate descent algorithm (Razaviyayn, Hong, Luo, & Pang, 2014) can be realized within this proposed framework as well. Let denote the Hadamard product (element-by-element vector multiplication) operator. Let the set of -dimensional binary vectors be denoted by . Let be a -dimensional binary vector whose th element is a one if the th element of the -dimensional random vector is updated with information about training pattern at learning trial .
4.2 Normalization Constants and Contrastive Divergence
Equation 4.12 cannot, however, be immediately used to derive a stochastic gradient descent algorithm that minimizes for the following reasons. The first term on the right-hand side of equation 4.13 is usually relatively easy to evaluate. But the second term on the right-hand side of equation 4.13 is usually very difficult to evaluate because it involves a computationally intractable multidimensional integration.
Note that the statistical environment used to generate the data for the stochastic approximation algorithm in equation 4.16 is not a passive statistical environment since the parameters of the learning machine are updated at learning trial not only by the observation but also by the observations whose joint distribution is functionally dependent on the current parameter estimates . Thus, contrastive-divergence algorithms of this type can be analyzed approximately using the theorem presented in section 1.
4.3 Missing Data, Hidden Variables, and the EM Algorithm
In this section, the problems of hidden variables and missing data are considered. The presence of hidden variables is not only a characteristic feature of latent variable models and deep learning architectures but can be considered equivalent to the presence of data, which is always missing.
For convenience, the -dimensional random vector is partitioned such that where is the observable component of and is the unobservable component whose probability distribution is functionally dependent only on a realization of . The elements of correspond to the visible random variables, while the elements of correspond to the hidden random variables or the missing data. Note that the dimensionalities of and will typically vary as a function of the positive integer index variable .
Note that can be chosen equal to 1 or any positive integer. In the case where , the resulting algorithm approximates the deterministic generalized expectation-maximization (GEM) algorithm (see McLachlan & Krishnan, 1996, for a formal definition of a GEM algorithm) in which the learning machine uses its current probabilistic model to compute the expected downhill search direction, takes a downhill step, updates its current probabilistic model, and then repeats this process in an iterative manner.
4.4 Policy Gradient Reinforcement Learning
In this section, the stochastic approximation theorem developed here is applied to the problem of investigating the convergence of a class of reinforcement learning algorithms called policy gradient reinforcement learning machines (Williams, 1992; Sutton & Barto, 1998; Sugiyama, 2015). Suppose that a learning machine experiences a collection of episodes. The episodes are assumed to be independent and identically distributed. In addition, the th episode is defined such that where is called the initial state of episode and is called the final state of episode. The probability density of when the learning machine is embedded within a passive statistical environment is specified by the density where specifies the likelihood that is observed by the learning machine in its statistical environment.
On the other hand, for a reactive learning environment, the probability that the learning machine selects action given the current state of the environment and the learning machine's current state of knowledge is expressed by the conditional probability mass function , . The statistical environment of the learning machine is characterized by the probability density , specifying the likelihood of a given initial state of an episode and the conditional density , which specifies the likelihood of a final state of an episode given the learning machine's action and the initial state of the episode .
5 Formal Convergence Analysis of Learning
In this section, the proof of the stochastic approximation theorem is provided, which minimizes the reactive environment risk function in equation 1.5 as well as the passive environment risk function in equation 1.4.
Although the specific theorem and proof presented here are novel, the obtained results and method of proof are very similar to many existing results in the literature. In particular, the statement and proof of the theorem follow a combination of arguments by Blum (1954), the appendix of Benveniste et al. (1990), and Sunehag et al. (2009) using the the well-known Robbins-Siegmund lemma (Robbins & Siegmund, 1971; see Benveniste et al., 1990, appendix to part 2, or Douc, Moulines, & Stoffer, 2014, lemma C2, for relevant reviews).
The results presented here are similar to those obtained by Andrieu et al. (2005, theorem 2.3), Benveniste et al. (1990, appendix to part 2, pp. 344–347), Bertsekas & Tsitsiklis (1996, proposition 4.1, p. 141), Douc et al. (2014, theorem C.7), Kushner (1981, theorem 1), Kushner & Yin (1997, theorem 4.1), Mohri et al (2012, theorems 14.7 and 14.8), White (1989a, 1989b, theorem 3.1).
The terminology that a function is bounded means that for all , there exists a finite number such that . The terminology that a stochastic sequence is bounded means that there exists a finite number such that for all : with probability one where .
Let be a convex, closed, and bounded subset of . Let be a twice continuously differentiable function.
Let the gradient of be denoted as . Let the Hessian of be denoted as .
Let be a closed, bounded, and convex subset of . Let be a twice continuously differentiable function with a finite lower bound. Let . Let .
Assume has Radon-Nikodým density with respect to a sigma-finite measure for each .
Assume a positive number exists such that for all , the random vector with density satisfies with probability one.
- •Let be a sequence of positive real numbers such thatand(5.1)(5.2)
- •Let be a piecewise continuous function on a finite partition of for all . When it exists, let
- •Let be a -dimensional random vector. Let be a sequence of -dimensional random vectors defined such that for ,where such that is less than some finite number for , and the distribution of is specified by the conditional density .(5.3)
- •Assume there exists a positive number such that for all ,(5.4)
If there exists a positive integer such that for all with probability one, then converges with probability one to the set of critical points of contained in .
Proof. Let with realization . Let with realization . Let with realization .
- Step 1: Expand using a second-order mean value expansion. Expand about and evaluate at using the mean value theorem to obtainwith(5.5)where the random variable can be defined as a point on the chord connecting and . Substituting the relation(5.6)into equation 5.5 gives(5.7)
- Step 2: Identify conditions required for the remainder term of the expansion to be bounded. Since, by assumption, is a bounded stochastic sequence and is continuous, this implies that the stochastic sequence is bounded. In addition, by assumption, is a bounded stochastic sequence. This implies there exists a number such that for all ,with probability one.(5.8)
- Step 3: Show the expected value of objective function decreases. Taking the conditional expectation of both sides of equation 5.7 with respect to the conditional density and evaluating at and yields(5.9)Substituting the assumption and the conclusion of step 2 that with probability one into equation 5.7 gives(5.10)
- Step 4: Show a subsequence of converges to zero wp1. Since has a lower bound, is a finite positive number, and equation 5.1 holds by assumption, then the almost supermartingale lemma can be applied to equation 5.10 on the set where and are bounded with probability one to obtain the conclusion thatwith probability one.(5.11)For some positive integer , letThe sequence is nonincreasing with probability one and bounded from below by zero, which implies that this sequence is convergent with probability one to a random variable (see theorem 5.1.1(vii); Rosenlicht, 1968, p. 50).
Step 5: Show that the stochastic sequence converges to a random variable wp1. From conclusion (1) of the almost supermartingale lemma, the stochastic sequence of converges to some unknown random variable, which will be denoted as with probability one. Since is continuous, this is equivalent to the assertion that converges with probability one to some unknown random variable, which will be denoted as such that with probability one. By the assumption that with probability one, every trajectory is confined to the closed, bounded, and convex set , it follows that with probability one.
Step 6: Show the stochastic sequence converges to zero wp1: Since is a continuous function, it follows that converges with probability one to . This is equivalent to the statement that every subsequence of converges to with probability one. That is, for every possible sequence of positive integers the stochastic subsequence converges with probability one to .
From step 4, there exists a sequence of positive integers, such that the stochastic subsequence converges with probability one to zero. Thus, to avoid a contradiction, every subsequence of converges to a random variable with probability one and additionally with probability one, —or equivalently, converges to 0 with probability one.Since is a continuous function and the assumption that with probability one, it follows that converges with probability one toThat is, converges with probability one to the set of critical points of in