Optimization based on k-step contrastive divergence (CD) has become a common way to train restricted Boltzmann machines (RBMs). The k-step CD is a biased estimator of the log-likelihood gradient relying on Gibbs sampling. We derive a new upper bound for this bias. Its magnitude depends on k, the number of variables in the RBM, and the maximum change in energy that can be produced by changing a single variable. The last reflects the dependence on the absolute values of the RBM parameters. The magnitude of the bias is also affected by the distance in variation between the modeled distribution and the starting distribution of the Gibbs chain.
1. Training RBMs Using Contrastive Divergence
Bengio and Delalleau (2009) show that CD-k is an approximation of the true log-likelihood gradient by finding an expansion of the gradient that considers the kth sample in the Gibbs chain and showing that CD-k is equal to a truncation of this expansion. Furthermore, they prove that the left-out term converges to zero as k goes to infinity:
2. Bounding the CD Approximation Error
A proof is given by Brémaud (1999).
Now we can state our main result:
The result for a single initial observed pattern is appropriate for online learning. It is straightforward to extend the theorem to batch learning in which the gradient and the CD-k approximation are averages over a set of observed patterns defining an empirical distribution:
Our bounds shows that the bias is determined by two antagonistic terms. The dependence on ‖pe − p‖1 is an important difference between our results and those derived by Bengio and Delalleau (2009). Since pe is the target distribution for the RBM learning process, the variation distance between pe and p should decrease during successful RBM learning. At the same time, the magnitudes of the parameters—if not controlled by weight decay—tend to increase in practice (see, e.g., Bengio & Delalleau, 2009; Fischer & Igel, 2009; Desjardins, Courville, Bengio, Vincent, & Dellaleau, 2010). Thus Δ increases and (1 − e−(m+n)Δ)k approaches one.
3. Experimental Results
We empirically studied the development of bias and bound during learning of the diag- and the 1DBall data set described by Bengio and Delalleau (2009). Small RBMs with six visible and six hidden neurons were trained by batch learning based on the expected value of the CD-1-update. Their parameters were initialized with weights drawn uniformly from [ − 0.5, 0.5] and bias parameters set to ci = bj = 0 for all i and j. Each experiment was repeated 25 times with different initializations. The learning rates were set to 0.1, and no weight decay was used. The results are shown in Figure 1. The bias value plotted is the maximum over all parameters.
The results show the tightness of the new bound. Only in the initial phase of learning, when ‖pe − p‖1 is large, was the bound rather loose (but always nontrivial, i.e., below one; this is not shown in the left plot). After 50,000 iterations, the differences between bound and bias value as defined above are ≈0.00138 and ≈0.02628 for Diag and 1DBall, respectively. In the beginning, the bias is small because the models with weights close to zero mix fast (if the weights were all zero, the RBM would model a uniform distribution, which is sampled after a single Gibbs sampling step). We refer to Bengio and Delalleau (2009) for a detailed empirical analysis of CD-k learning of RBMs applied to the Diag and 1DBall benchmark (e.g., showing the dependence on k and the dimensionality of the problem).
4. Discussion and Conclusion
We derived a new upper bound for the bias when estimating the log-likelihood gradient by k-step contrastive divergence (CD-k) for training RBMs. It is considerably tighter than a recently published result. The main reason for that is that it incorporates the decrease of the bias for decreasing distance between the modeled distribution and the starting distribution of the Gibbs chain.
Learning based on CD-k has been successfully applied to RBMs (e.g., Hinton, 2002, 2007; Hinton, Osindero, & Teh, 2006; Hinton & Salakhutdinov, 2006; Bengio, Lamblin, Popovici, Larochelle, & Montreal, 2007; Bengio & Delalleau, 2009). However, it need not converge to a maximum likelihood (ML) estimate of the model parameters (conditions for convergence with probability one are given by Yuille, 2005). Analytical counterexamples are presented by MacKay (2001). Carreira-Perpiñán and Hinton (2005) show that in general, CD learning does not lead to the ML solution. In their experiments, it reaches solutions that are close. However, empirical evidence for misled RBM learning using approximations of the true log-likelihood gradient is, for example, given by Fischer and Igel (2009, 2010), as well as Desjardins et al. (2010). Intuitively, the smaller the bias of the log-likelihood gradient estimation, the higher the chances of converging to an ML solution quickly. Still, even small deviations of a few gradient components can deteriorate the learning process.
Our bound for the bias increases with the maximum possible change in energy that can be produced by changing a single variable. This indicates the relevance of controlling the absolute values of the RBM parameters, for example, by using weight-decay (see the discussion by Fischer & Igel, 2010). Further, the bound increases with the number of RBM variables and decreases with increasing k. The latter underpins that larger values of k stabilize CD learning and that increasing k dynamically when the weights increase may be a good learning strategy (Bengio & Delalleau, 2009).
We acknowledge support from the German Federal Ministry of Education and Research within the National Network Computational Neuroscience under grant number 01GQ0951 (Bernstein Fokus “Learning Behavioral Models: From Human Experiment to Technical Assistance”).