## Abstract

We study an expansion of the log likelihood in undirected graphical models such as the restricted Boltzmann machine (RBM), where each term in the expansion is associated with a sample in a Gibbs chain alternating between two random variables (the visible vector and the hidden vector in RBMs). We are particularly interested in estimators of the gradient of the log likelihood obtained through this expansion. We show that its residual term converges to zero, justifying the use of a truncation—running only a short Gibbs chain, which is the main idea behind the contrastive divergence (CD) estimator of the log-likelihood gradient. By truncating even more, we obtain a stochastic reconstruction error, related through a mean-field approximation to the reconstruction error often used to train autoassociators and stacked autoassociators. The derivation is not specific to the particular parametric forms used in RBMs and requires only convergence of the Gibbs chain. We present theoretical and empirical evidence linking the number of Gibbs steps *k* and the magnitude of the RBM parameters to the bias in the CD estimator. These experiments also suggest that the sign of the CD estimator is correct most of the time, even when the bias is large, so that CD-*k* is a good descent direction even for small *k*.

## 1. Introduction

*w*(vector) and

*b*(scalar), and where

*s*(

*a*) could be 1

_{a > 0}, tanh(

*a*), or , for example.

Here, we are particularly interested in the restricted Boltzmann machine (Smolensky, 1986; Freund & Haussler, 1994; Hinton, 2002; Welling, Rosen-Zvi, & Hinton, 2005; Carreira-Perpiñan & Hinton, 2005), a family of bipartite graphical models with hidden variables (the hidden layer) that are used as components in building deep belief networks (Hinton et al., 2006; Bengio et al., 2007; Salakhutdinov & Hinton, 2007; Larochelle et al., 2007). Deep belief networks have yielded impressive performance on several benchmarks, clearly beating the state-of-the-art and other nonparametric learning algorithms in several cases. A very successful learning algorithm for training a restricted Boltzmann machine (RBM) is the contrastive divergence (CD) algorithm. An RBM represents the joint distribution between a visible vector *X*, which is the random variable observed in the data, and a hidden random variable *H*. There is no tractable representation of *P*(*X*,*H*), but conditional distributions *P*(*H*|*X*) and *P*(*X*|*H*) can easily be computed and sampled from. CD-*k* is based on a Gibbs Monte Carlo Markov Chain (MCMC) starting at an example *X* = *x*_{1} from the empirical distribution and converging to the RBM's generative distribution *P*(*X*). CD-*k* relies on a biased estimator obtained after a small number *k* of Gibbs steps (often only one step). Each Gibbs step is composed of two alternating substeps: sampling *h _{t}* ∼

*P*(

*H*|

*X*=

*x*) and sampling

_{t}*x*

_{t+1}∼

*P*(

*X*|

*H*=

*h*), starting at

_{t}*t*= 1.

The surprising empirical result is that even *k* = 1 (CD-1) often gives good results. An extensive numerical comparison of training with CD-*k* versus exact log-likelihood gradient has been presented in Carreira-Perpiñan and Hinton (2005). In these experiments, taking *k* larger than 1 gives more precise results, although very good approximations of the solution can be obtained even with *k* = 1. Here we present a follow-up to Carreira-Perpiñan and Hinton (2005) that brings further theoretical and empirical support to CD-*k*, even for small *k*.

*KL*is the Kullback-Leibler divergence, is the empirical distribution of the training data, and

*P*(

*X*

_{2}= · |

*x*

_{1}) denotes the distribution of the chain after one step. The term left out in the approximation of the gradient of the

*KL*difference is (Hinton, 2002) which was empirically found to be small. On the one hand, it is not clear how aligned the log-likelihood gradient and the gradient are with respect to the above

*KL*difference. On the other hand, it would be nice to prove that left-out terms are small in some sense. One of the motivations for this letter is to obtain the CD algorithm from a different route, by which we can prove that the term left out with respect to the log-likelihood gradient is small and converging to zero as we take

*k*larger.

We show that the log likelihood and its gradient can be expanded by considering samples in a Gibbs chain. We show that when the gradient expansion is truncated to *k* steps, the remainder converges to zero at a rate that depends on the mixing rate of the chain. The inspiration for this derivation comes from Hinton et al. (2006): first, the idea that the Gibbs chain can be associated with an infinite directed graphical model (which here we associate to an expansion of the log likelihood and its gradient), and second, that the convergence of the chain justifies CD (since the *k*th sample from the Gibbs chain becomes equivalent to a model sample). However, our empirical results also show that the convergence of the chain alone cannot explain the good results obtained by CD, because this convergence becomes too slow as weights increase during training. It turns out that even when *k* is not large enough for the chain to converge (e.g., the typical value *k* = 1), the CD-*k* rule remains a good update direction to increase the log likelihood of the training data.

Finally, we show that when the series is truncated to a single substep, we obtain the gradient of a stochastic reconstruction error. A mean-field approximation of that error is the reconstruction error often used to train autoassociators (Rumelhart, Hinton, & Williams, 1986; Bourlard & Kamp, 1988; Hinton & Zemel, 1994; Schwenk & Milgram, 1995; Japkowicz, Hanson, & Gluck, 2000). Autoassociators can be stacked using the same principle used to stack RBMs into a deep belief network in order to train deeep neural networks (Bengio et al., 2007; Ranzato et al., 2007; Larochelle et al., 2007). Reconstruction error has also been used to monitor progress in training RBMs by CD (Taylor, Hinton, & Roweis, 2007; Bengio et al., 2007), because it can be computed tractably and analytically, without sampling noise.

In the following, we drop the *X* = *x* notation and use shorthand such as *P*(*x* ∣ *h*) instead of *P*(*X* = *x* ∣ *H* = *h*). The *t* index is used to denote position in the Markov chain, whereas indices *i* or *j* denote an element of the hidden or visible vector, respectively.

## 2. Restricted Boltzmann Machines and Contrastive Divergence

### 2.1. Boltzmann Machines.

*x*, marginalizing over the values of hidden units

*h*, and where the joint distribution between hidden and visible units is associated with a quadratic energy function such that where is a normalization constant (called the partition function) and (

*b*,

*c*,

*W*,

*U*,

*V*) are parameters of the model.

*b*is called the bias of visible unit

_{j}*x*,

_{j}*c*is the bias of visible unit

_{i}*h*, and the matrices

_{i}*W*,

*U*, and

*V*represent interaction terms between units. Note that nonzero

*U*and

*V*mean that there are interactions between units belonging to the same layer (hidden layer or visible layer). Marginalizing over

*h*at the level of the energy yields so-called free energy: We can rewrite the log likelihood accordingly: Differentiating the above, the gradient of the log likelihood with respect to some model parameter θ can be written as follows: Computing is straightforward. Therefore, if sampling from the model was possible, one could obtain a stochastic gradient for use in training the model as follows. Two samples are necessary:

*h*given

*x*for the first term, which is called the

*positive phase*, and an pair from in what is called the

*negative phase*. Note how the resulting stochastic gradient estimator, has one term for each of the positive phase and negative phase, with the same form but opposite signs. Let

*u*= (

*x*,

*h*) be a vector with all the unit values. In a general Boltzmann machine, one can compute and sample from

*P*(

*u*|

_{i}*u*

_{−i}), where

*u*

_{−i}is the vector with all the unit values except the

*i*th. Gibbs sampling with as many substeps as units in the model has been used to train Boltzmann machines in the past, with very long chains, yielding correspondingly long training times.

### 2.2. Restricted Boltzmann Machines.

*U*= 0 and

*V*= 0 in equation 2.2—that is, the only interaction terms are between a hidden unit and a visible unit, but not between units of the same layer. This form of model was first introduced under the name of

*Harmonium*(Smolensky, 1986). Because of this restriction,

*P*(

*h*∣

*x*) and

*P*(

*x*∣

*h*) factorize and can be computed and sampled from easily. This enables the use of a two-step Gibbs sampling alternating between

*h*∼

*P*(

*H*∣

*X*=

*x*) and

*x*∼

*P*(

*X*∣

*H*=

*h*). In addition, the positive phase gradient can be obtained exactly and efficiently because the free energy factorizes: where

*W*is the

_{i}*i*th row of

*W*and

*d*the dimension of

_{h}*h*. Using the same type of factorization, one obtains, for example, in the most common case where

*h*is binary, where The log-likelihood gradient for

_{i}*W*thus has the form where

_{ij}*E*is an expectation over

_{X}*P*(

*X*). Samples from

*P*(

*X*) can be approximated by running an alternating Gibbs chain

*x*

_{1}⇒

*h*

_{1}⇒

*x*

_{2}⇒

*h*

_{2}⇒ … Since the model

*P*is trying to imitate the empirical distribution , it is a good idea to start the chain with a sample from , so that we start the chain from a distribution close to the asymptotic one.

In most uses of RBMs (Hinton, 2002; Carreira-Perpiñan & Hinton, 2005; Hinton et al., 2006; Bengio et al., 2007), both *h _{i}* and

*x*are binary, but many extensions are possible and have been studied, including cases where hidden or visible units are continuous-valued (Freund & Haussler, 1994; Welling et al., 2005; Bengio et al., 2007).

_{j}### 2.3. Contrastive Divergence.

*k*-step contrastive divergence (CD-

*k*) (Hinton 1999, 2002) involves a second approximation besides the use of MCMC to sample from

*P*. This additional approximation introduces some bias in the gradient: we run the MCMC chain for only

*k*steps, starting from the observed example

*x*. Using the same technique as in equation 2.6 to express the log-likelihood gradient, but keeping the sums over

*h*inside the free energy, we obtain The CD-

*k*update after seeing example

*x*is taken proportional to where is a sample from our Markov chain after

*k*steps. We know that when

*k*→ ∞, the samples from the Markov chain converge to samples from

*P*, and the bias goes away. We also know that when the model distribution is very close to the empirical distribution (i.e., ), then when we start the chain from

*x*(a sample from ), the MCMC samples have already converged to

*P*, and we need fewer sampling steps to obtain an unbiased (albeit correlated) sample from

*P*.

## 3. Log-Likelihood Expansion via Gibbs Chain

In the following we consider the case where both *h* and *x* can take only a finite number of values. We also assume that there is no pair (*x*, *h*) such that *P*(*x* ∣ *h*) = 0 or *P*(*h* ∣ *x*) = 0. This ensures the Markov chain associated with Gibbs sampling is irreducible (one can go from any state to any other state), and there exists a unique stationary distribution *P*(*x*, *h*) the chain converges to.

*Consider the irreducible Gibbs chain x*,

_{1}⇒ h_{1}⇒ x_{2}⇒ h_{2}… starting at data point x_{1}. The log likelihood can be written as follows at any step t of the chain*and since this is true for any path*,

*where expectations are over Markov chain sample paths, conditioned on the starting sample x*

_{1}.

Note that *E _{Xt}*[log

*P*(

*X*) ∣

_{t}*x*

_{1}] is the negative entropy of the

*t*th visible sample of the chain, and it does not become smaller as

*t*→ ∞. Therefore, it does not seem reasonable to truncate this expansion. However, the gradient of the log likelihood is more interesting. But first we need a simple lemma:

The lemma is clearly also true for conditional distributions with corresponding conditional expectations.

*x*

_{1}, getting In order to prove the convergence of the CD bias toward zero, we will use the assumed convergence of the chain, which can be written with ∑

_{x}ϵ

_{t}(

*x*) = 0 and lim

_{t→+∞}ϵ

_{t}(

*x*) = 0 for all

*x*. Since

*x*is discrete, also verifies lim

_{t→+∞}ϵ

_{t}= 0. Then we can rewrite the last expectation as follows: When lemma 2 is used, the first sum is equal to zero. Thus, we can bound this expectation by where

*N*is the number of discrete configurations for the random variable

_{x}*X*. This proves that the expectation converges to zero as

*t*→ +∞, since lim

_{t→+∞}ϵ

_{t}= 0.

*x*and

*h*are not discrete anymore but instead may take values in infinite (possibly uncountable) sets. We assume

*P*(

*x*∣

*h*) and

*P*(

*h*∣

*x*) are such that there still exists a unique stationary distribution

*P*(

*x*,

*h*). Lemma 1 and its proof remain unchanged. On another hand, lemma 2 is true only for distributions

*P*such that This equation can be guaranteed to be verified under additional “niceness” assumptions on

*P*, and we assume it is the case for distributions

*P*(

*X*),

*P*(

*x*∣

*h*), and

*P*(

*h*∣

*x*). Consequently, the gradient expansion, equation 3.3, in theorem 1 can be obtained in the same way as before. The key point to justify further truncation of this expansion is the convergence toward zero of the bias, This convergence is not necessarily guaranteed unless we have convergence of

*P*(

*X*∣

_{t}*x*

_{1}) to

*P*(

*X*) in the sense that where the second expectation is over the stationary distribution

_{t}*P*. If the distributions

*P*(

*x*∣

*h*) and

*P*(

*h*∣

*x*) are such that equation 3.8 is verified, then this limit is also zero according to lemma 2, and it makes sense to truncate equation 3.3. Note, however, that equation 3.8 does not necessarily hold in the most general case (Hernández-Lerma & Lasserre, 2003).

## 4. Connection with Contrastive Divergence

### 4.1. Theoretical Analysis.

*t*steps, that is, ignoring , yielding the approximation Note how the expectation can be readily replaced by sampling

*x*∼

_{t}*P*(

*X*∣

_{t}*x*

_{1}), giving rise to the stochastic update whose expected value is the above approximation. This is also exactly the CD-(

*t*− 1) (see update equation 2.12).

*k*was introduced earlier (Carreira-Perpiñan & Hinton, 2005; Hinton et al., 2006). The bound in equation 3.5 explicitly relates the convergence of the chain (through the convergence of error ϵ

_{t}in estimating

*P*(

*X*) with

*P*(

*X*

_{k+1}=

*x*∣

*x*

_{1})) to the approximation error of the CD-

*k*gradient estimator. When the RBM weights are large, it is plausible that the chain will mix more slowly because there is less randomness in each sampling step. Hence, it might be advisable to use larger values of

*k*as the weights become larger during training. It is thus interesting to study how fast the bias converges to zero as

*t*increases, depending on the magnitude of the weights in an RBM. Markov chain theory (Schmidt, 2006) ensures that in the discrete case, where

*N*is the number of possible configurations for

_{x}*x*, and

*a*is the smallest element in the transition matrix of the Markov chain. In order to obtain a meaningful bound on equation 3.5, we also need to bound the gradient of the log likelihood. In the following, we will thus consider the typical case of a binomial RBM, with θ being a weight

*W*between hidden unit

_{ij}*i*and visible unit

*j*. Recall equation 2.10: For any

*x*, both

*P*(

*H*= 1 ∣

_{i}*X*=

*x*) and

*x*are in (0, 1). Consequently, the expectation above is also in (0, 1) and thus Combining this inequality with equation 4.2, we obtain from equation 3.5 that It remains to quantify

_{j}*a*, the smallest term in the Markov chain transition matrix. Each element of this matrix is of the form Since 1 − sigm(

*z*) = sigm(−

*z*), we have: Let us denote α

_{j}= ∑

_{i}|

*W*| + |

_{ij}*b*| and β

_{j}_{i}= ∑

_{j}|

*W*| + |

_{ij}*c*|. We can obtain in a similar way that

_{i}*P*(

*h*∣

_{i}*x*

_{1}) ⩾ sigm(−β

_{i}). As a result, we have that In order to simplify notations (at the cost of a looser bound), let us denote Then, by combining equations 4.3 and 4.4, we finally obtain where

*N*= 2

_{x}^{dx}and

*N*= 2

_{h}^{dh}. Note that although this bound is tight (and equal to zero) for any

*t*⩾ 2 when weights and biases are set to zero (since mixing is immediate), the bound is likely to be loose in practical cases. Indeed, the bound approaches

*N*fast, as the two sigmoids decrease toward zero. However, the bound clarifies the importance of weight size in the bias of the CD approximation. It is also interesting to note that this bound on the bias decreases exponentially with the number of steps performed in the CD update, even though this decrease may become linear when the bound is loose (which is usually the case in practice): in such cases, it can be written

_{x}*N*(1 − γ)

_{x}^{t−1}with a small γ, and thus is close to

*N*(1 − γ(

_{x}*t*− 1)), a linear decrease in

*t*.

If the CD update is considered a biased and noisy estimator of the true log-likelihood gradient, it can be shown that stochastic gradient descent converges (to a local minimum), provided that the bias is not too large (Yuille, 2005). On the other hand, one should keep in mind that for small *k*, there is no guarantee that CD converges near the maximum likelihood solution (MacKay, 2001). The experiments below confirm the above theoretical results and suggest that even when the bias is large and the weights are large, the sign of the CD estimator may be generally correct.

### 4.2. Experiments.

*k*update relates to the gradient of the log likelihood. More specifically, in order to remove variance caused by sampling noise, we are interested in comparing two quantities: where Δ(

*x*

_{1}) is the gradient of the likelihood, equation 2.11, and Δ

_{k}(

*x*

_{1}) its average approximation by CD-

*k*, equation 4.1. The difference between these two terms is the bias Δ

_{k}(

*x*

_{1}), that is, according to equation 3.3, and, as shown in section 4.1, we have Note that our analysis is different from the one in Carreira-Perpiñan and Hinton (2005), where the solutions (after convergence) found by CD-

*k*and gradient descent on the negative log likelihood were compared, while we focus on the updates themselves.

In these experiments, we use two manually generated binary data sets:

*1DBall*is a_{d}*d*-dimensional data set containing samples, representing “balls” on a one-dimensional discrete line with*d*pixels. Half of the data examples are generated by first picking the position*b*of the beginning of the ball (among*d*possibilities), then its width*w*(among possibilities). Pixels from*b*to*b*+*w*− 1 (modulo*d*) are then set to 1 while the rest of the pixels are set to 0. The second half of the data set is generated by simply “reverting” its first half (switching zeros and ones).

In order to be able to compute δ_{k}(*x*_{1}) exactly, only RBMs with a small (fewer than 10) number of visible and hidden units are used. We compute quantities for all θ = *W _{ij}* (the weights of the RBM connections between visible and input units). The following statistics are then computed over all weights

*W*and all training examples

_{ij}*x*

_{1}:

- •
The weight magnitude indicators α and β, as defined in equations 4.5 and 4.6,

- •
The mean of the gradient bias |δ

_{k}(*x*_{1})|, denoted by δ_{k}and called the absolute bias - •
The median of , that is, the relative difference between the CD-

*k*update and the log-likelihood gradient^{1}(we use the median to avoid numerical issues for small gradients), denoted by*r*and called the_{k}*relative bias* - •
The sign error

*s*, that is, the fraction of updates for which Δ_{k}_{k}(*x*_{1}) and Δ(*x*_{1}) have different signs

The RBM is initialized with zero biases and small weights uniformly sampled in , where *d* is the number of visible units. Note that even with such small weights, the bound from equation 4.7 is already close to its maximum value *N _{x}*, so that it is not interesting to plot it on the figures. The number of hidden units is also set to

*d*for simplicity. The RBM weights and biases are trained by CD-1 with a learning rate set to 10

^{−3}. Keep in mind that we are not interested in comparing the learning process itself, but rather how the quantities above evolve for different kinds of RBMs, in particular, as weights become larger during training. Training is stopped once the average negative log likelihood over training samples has less than 5% relative difference compared to its lower bound, which here is log(

*N*), where

*N*is the number of training samples (which are all unique).

Figure 1 shows a typical example of how the quantities defined above evolve during training (β is not plotted as it exhibits the same behavior as α). As the weights increase (as shown by α), so does the absolute value of the left-out term in CD-1 (δ_{1}) and its relative magnitude compared to the log likelihood (*r*_{1}). In particular, we observe that most of the log-likelihood gradient is quickly lost in CD-1 (here after only 80,000 updates), so that CD-1 is no longer a good approximation of negative log-likelihood gradient descent. However, the RBM is still able to learn its input distribution, which can be explained by the fact that the “sign disagreement” *s*_{1} between CD-1 and the log-likelihood gradient remains small (less than 5% for the whole training period).

Figures 2 and 4 show how *r _{k}* and

*s*, respectively, vary depending on the number of steps

_{k}*k*performed in CD, on the

*Diag*

_{d}(left) and

*1DBall*

_{d}(right) data sets, for

*d*∈ {6, 7, 8, 9, 10}. All of these values are taken when our stopping criterion is reached (i.e., we are close enough to the empirical distribution). It may seem surprising that

*r*does not systematically increase with

_{k}*d*, but remember that each RBM may be trained for a different number of iterations, leading to potentially very different weight magnitude. Figure 3 shows the corresponding values for α and β (which reflect the magnitude of weights). We can see, for instance, that α and β for data set

*1DBall*

_{6}are larger than for data set

*1DBall*

_{7}, which explains why

*r*is also larger, as shown in Figure 2 (right). Figure 5 shows a “smoother” behavior of

_{k}*r*with regard to

_{k}*d*when all RBMs are trained for a fixed (small) number of iterations, illustrating how the quality of CD-

*k*decreases in higher dimension (as an approximation to negative log-likelihood gradient descent).

We observe on Figure 2 that the relative bias *r _{k}* becomes large not only for small

*k*(which means the CD-

*k*update is a poor approximation of the true log-likelihood gradient), but also for larger

*k*in higher dimensions. As a result, increasing

*k*moderately (from 1 to 10) still leaves a large approximation error (e.g., from 80% to 50% with

*d*= 10 in Figure 2) in spite of a 10-fold increase in computation time. This suggests that when the goal is to obtain a more precise estimator of the gradient, alternatives to CD-

*k*such as persistent CD (Tieleman, 2008) may be more appropriate. On another hand, we notice from Figure 4 that the disagreement

*s*between the two updates remains low even for small

_{k}*k*in larger dimensions (in our experiments, it always remains below 5%). This may explain why CD-1 can successfully train RMBs even when connection weights become larger and the Markov chain does not mix fast anymore. An intuitive explanation for this empirical observation is the popular view of CD-

*k*as a process that, on one hand, decreases the energy of a training sample

*x*

_{1}(first term in equation 4.8), and, on another hand, increases the energy of other nearby input examples (second term), thus leading to an overall increase of

*P*(

*x*

_{1}).

## 5. Connection with Autoassociator Reconstruction Error

*h*

_{1}as follows: Taking the expectation with respect to

*H*

_{1}conditioned on

*x*

_{1}yields When lemma 2 is used, the second term is equal to zero. If we truncate this expansion by removing the last term (as is done in CD) we obtain which is an average over

*P*(

*h*

_{1}|

*x*

_{1}), that could be approximated by sampling. Note that this is not quite the negated gradient of the stochastic reconstruction error: Let us consider a notion of mean-field approximation by which an average

*E*[

_{X}*f*(

*X*)] over configurations of a random variable

*X*is approximated by

*f*(

*E*[

*X*]), that is, using the mean configuration. Applying such an approximation to SRE, equation 5.3, gives the reconstruction error typically used in training autoassociators (Rumelhart et al., 1986; Bourlard & Kamp, 1988; Hinton & Zemel, 1994; Schwenk & Milgram, 1995; Japkowicz et al., 2000; Bengio et al., 2007; Ranzato et al., 2007; Larochelle et al., 2007), where is the mean-field output of the hidden units given the observed input

*x*

_{1}. If we apply the mean-field approximation to the truncation of the log likelihood given in equation 5.2, we obtain It is arguable whether the mean-field approximation per se gives us license to include in the effect of θ on , but if we do so, then we obtain the gradient of the reconstruction error, equation 5.4, up to the sign (since the log likelihood is maximized while the reconstruction error is minimized).

*x*

_{2}(as seen in section 2.3), ignoring we see (using the fact that the second term of equation 5.1 is zero) that a reconstruction update truncates the chain expansion one step earlier (at

*h*

_{1}), ignoring and working on a mean-field approximation instead of a stochastic approximation. The reconstruction error gradient can thus be seen as a more biased approximation of the log-likelihood gradient than CD-1. Comparative experiments between reconstruction error training and CD-1 training confirm this view (Bengio et al., 2007; Larochelle et al., 2007): CD-1 updating generally has a slight advantage over reconstruction error gradient.

However, reconstruction error can be computed deterministically and has been used as an easy method to monitor the progress of training RBMs with CD, whereas the CD-*k* itself is generally not the gradient of anything and is stochastic.

## 6. Conclusion

This letter provides a theoretical and empirical analysis of the log-likelihood gradient in graphical models involving a hidden variable *h* in addition to the observed variable *x*, and where conditionals *P*(*h* ∣ *x*) and *P*(*x* ∣ *h*) are easy to compute and sample from. That includes the case of contrastive divergence (CD) for restricted Boltzmann machines (RBM). The analysis justifies the use of a short Gibbs chain of length *k* to obtain a biased estimator of the log-likelihood gradient. Although our results do not guarantee that the bias decreases monotically with *k*, we prove a bound that does, and we observe this decrease experimentally. Moreover, although this bias may be large when using only few steps in the Gibbs chain (as is usually done in practice), our empirical analysis indicates this estimator remains a good update direction compared to the true (but intractable) log-likelihood gradient.

The analysis also shows a connection between reconstruction error, log likelihood, and CD, which helps explain the better results generally obtained with CD and justifies the use of reconstruction error as a monitoring device when training an RBM by CD. The generality of the analysis also opens the door to other learning algorithms in which *P*(*h* ∣ *x*) and *P*(*x* ∣ *h*) do not have the parametric forms of RBMs.

## Notes

This quantity is more interesting than the absolute bias because it tells us what proportion of the true gradient of the log likelihood is “lost” by using the CD-*k* update.