## Abstract

Optimization based on *k*-step contrastive divergence (CD) has become a common way to train restricted Boltzmann machines (RBMs). The *k*-step CD is a biased estimator of the log-likelihood gradient relying on Gibbs sampling. We derive a new upper bound for this bias. Its magnitude depends on *k*, the number of variables in the RBM, and the maximum change in energy that can be produced by changing a single variable. The last reflects the dependence on the absolute values of the RBM parameters. The magnitude of the bias is also affected by the distance in variation between the modeled distribution and the starting distribution of the Gibbs chain.

## 1. Training RBMs Using Contrastive Divergence

*w*,

_{ij}*b*, and

_{j}*c*(

_{i}*i*∈ {1, …,

*n*},

*j*∈ {1, …,

*m*}) jointly denoted as . In the following, we restrict our considerations to RBMs with binary units for which with sigmoid(

*x*) = (1 + exp(−

*x*))

^{−1}.

*k*-step contrastive divergence (CD-

*k*) learning (Hinton, 2002) approximates the second term by a sample obtained by

*k*-steps of Gibbs sampling. Starting from an example of the training set, the Gibbs chain is run for only

*k*steps, yielding the sample . Each step

*t*consists of sampling from and sampling from subsequently. The gradient (1.1) with respect to of the log likelihood for one training pattern is then approximated by

Bengio and Delalleau (2009) show that CD-*k* is an approximation of the true log-likelihood gradient by finding an expansion of the gradient that considers the *k*th sample in the Gibbs chain and showing that CD-*k* is equal to a truncation of this expansion. Furthermore, they prove that the left-out term converges to zero as *k* goes to infinity:

*CD-k*, relating it to the magnitude of the RBM parameters: Here θ

_{a}denotes a single parameter of the RBM, , . But the bound gets loose very fast if the norm of the parameters increases. Note that the absolute value of is never larger than one for binary RBMs (this follows from ; e.g., see Bengio & Delalleau, 2009, p. 1611 above equation 4.3), while the bound given above grows quickly with α and β and approaches 2

^{m}, the number of configurations of the visible units.

## 2. Bounding the CD Approximation Error

*k*based on general results for the convergence rate of the Gibbs sampler (see Brémaud, 1999). The convergence rate depends on the distance between the distribution of the initial states μ (the starting distribution of the chain) and the stationary distribution. A measure of distance between two distributions α and β on a countable set Ω is the total variation distance defined as The total variation distance between two distributions is bounded by one. We make use of the following theorem:

*A Markov random field with random variables taking values in a finite set Ω and a Markov chain produced by periodic Gibbs sampling is given. Let be the transition matrix, μ the starting distribution, and*

*p*the stationary distribution (i.e., the Gibbs distribution) of the Gibbs chain. It holds*where*

*and denotes the energy function of the Gibbs distribution.*

A proof is given by Brémaud (1999).

Now we can state our main result:

*Given is an RBM (*

*V*_{1}, …,*V*,_{m}*H*_{1}, …,*H*) and a Markov chain produced by periodic Gibbs sampling starting from (). Let the initial states be distributed according to μ as defined in equation 2.2, and let_{n}*p*be the joint probability distribution of and of the RBM (i.e., the stationary distribution of the Markov chain). Then we can bound the error of the CD-*k*approximation of the log-likelihood derivative with regard to some RBM parameter θ_{a}(i.e., ) by*with*

*where*

*and*

*k*steps of Gibbs sampling and the stationary distribution of the chain.

_{v}, Δ

_{h}} using Δ

_{v}= max

_{l∈{1,…,m}}ϑ

_{l}and Δ

_{h}= max

_{l∈{1,…,n}}ξ

_{l}. For the visible units, we have where we maximize over , , and under the constraint that ∀

*j*∈ {1, …,

*m*},

*j*≠

*l*:

*v*=

_{j}*v*′

_{j}(i.e., that only one unit changes its state). Thus, where the indicator function

*I*is one if its argument is true and zero otherwise. The third equality holds because (

*v*′

_{l}−

*v*) is either −1 or 1 and can be pulled out as a common factor. The absolute value of the resulting sum is maximized if the

_{l}*h*exclusively “select” either all positive or all negative

_{i}*w*, which leads to the final expression. Analogously, we compute ξ

_{il}_{s}.

The result for a single initial observed pattern is appropriate for online learning. It is straightforward to extend the theorem to batch learning in which the gradient and the CD-*k* approximation are averages over a set of observed patterns defining an empirical distribution:

*Let*

*p*denote the marginal distribution of the visible units of an RBM, and let*p*be the empirical distribution defined by a set of samples . Then an upper bound on the expectation of the error of the CD-_{e}*k*approximation of the log-likelihood derivative with regard to some RBM parameter θ_{a}is given by*with Δ as defined in theorem 3.*

Our bounds shows that the bias is determined by two antagonistic terms. The dependence on ‖*p _{e}* −

*p*‖

_{1}is an important difference between our results and those derived by Bengio and Delalleau (2009). Since

*p*is the target distribution for the RBM learning process, the variation distance between

_{e}*p*and

_{e}*p*should decrease during successful RBM learning. At the same time, the magnitudes of the parameters—if not controlled by weight decay—tend to increase in practice (see, e.g., Bengio & Delalleau, 2009; Fischer & Igel, 2009; Desjardins, Courville, Bengio, Vincent, & Dellaleau, 2010). Thus Δ increases and (1 −

*e*

^{−(m+n)Δ})

^{k}approaches one.

## 3. Experimental Results

We empirically studied the development of bias and bound during learning of the diag- and the 1DBall data set described by Bengio and Delalleau (2009). Small RBMs with six visible and six hidden neurons were trained by batch learning based on the expected value of the CD-1-update. Their parameters were initialized with weights drawn uniformly from [ − 0.5, 0.5] and bias parameters set to *c _{i}* =

*b*= 0 for all

_{j}*i*and

*j*. Each experiment was repeated 25 times with different initializations. The learning rates were set to 0.1, and no weight decay was used. The results are shown in Figure 1. The bias value plotted is the maximum over all parameters.

The results show the tightness of the new bound. Only in the initial phase of learning, when ‖*p _{e}* −

*p*‖

_{1}is large, was the bound rather loose (but always nontrivial, i.e., below one; this is not shown in the left plot). After 50,000 iterations, the differences between bound and bias value as defined above are ≈0.00138 and ≈0.02628 for Diag and 1DBall, respectively. In the beginning, the bias is small because the models with weights close to zero mix fast (if the weights were all zero, the RBM would model a uniform distribution, which is sampled after a single Gibbs sampling step). We refer to Bengio and Delalleau (2009) for a detailed empirical analysis of CD-

*k*learning of RBMs applied to the Diag and 1DBall benchmark (e.g., showing the dependence on

*k*and the dimensionality of the problem).

## 4. Discussion and Conclusion

We derived a new upper bound for the bias when estimating the log-likelihood gradient by *k*-step contrastive divergence (CD-*k*) for training RBMs. It is considerably tighter than a recently published result. The main reason for that is that it incorporates the decrease of the bias for decreasing distance between the modeled distribution and the starting distribution of the Gibbs chain.

Learning based on CD-*k* has been successfully applied to RBMs (e.g., Hinton, 2002, 2007; Hinton, Osindero, & Teh, 2006; Hinton & Salakhutdinov, 2006; Bengio, Lamblin, Popovici, Larochelle, & Montreal, 2007; Bengio & Delalleau, 2009). However, it need not converge to a maximum likelihood (ML) estimate of the model parameters (conditions for convergence with probability one are given by Yuille, 2005). Analytical counterexamples are presented by MacKay (2001). Carreira-Perpiñán and Hinton (2005) show that in general, CD learning does not lead to the ML solution. In their experiments, it reaches solutions that are close. However, empirical evidence for misled RBM learning using approximations of the true log-likelihood gradient is, for example, given by Fischer and Igel (2009, 2010), as well as Desjardins et al. (2010). Intuitively, the smaller the bias of the log-likelihood gradient estimation, the higher the chances of converging to an ML solution quickly. Still, even small deviations of a few gradient components can deteriorate the learning process.

Our bound for the bias increases with the maximum possible change in energy that can be produced by changing a single variable. This indicates the relevance of controlling the absolute values of the RBM parameters, for example, by using weight-decay (see the discussion by Fischer & Igel, 2010). Further, the bound increases with the number of RBM variables and decreases with increasing *k*. The latter underpins that larger values of *k* stabilize CD learning and that increasing *k* dynamically when the weights increase may be a good learning strategy (Bengio & Delalleau, 2009).

## Acknowledgments

We acknowledge support from the German Federal Ministry of Education and Research within the National Network Computational Neuroscience under grant number 01GQ0951 (Bernstein Fokus “Learning Behavioral Models: From Human Experiment to Technical Assistance”).