Abstract

We study an expansion of the log likelihood in undirected graphical models such as the restricted Boltzmann machine (RBM), where each term in the expansion is associated with a sample in a Gibbs chain alternating between two random variables (the visible vector and the hidden vector in RBMs). We are particularly interested in estimators of the gradient of the log likelihood obtained through this expansion. We show that its residual term converges to zero, justifying the use of a truncation—running only a short Gibbs chain, which is the main idea behind the contrastive divergence (CD) estimator of the log-likelihood gradient. By truncating even more, we obtain a stochastic reconstruction error, related through a mean-field approximation to the reconstruction error often used to train autoassociators and stacked autoassociators. The derivation is not specific to the particular parametric forms used in RBMs and requires only convergence of the Gibbs chain. We present theoretical and empirical evidence linking the number of Gibbs steps k and the magnitude of the RBM parameters to the bias in the CD estimator. These experiments also suggest that the sign of the CD estimator is correct most of the time, even when the bias is large, so that CD-k is a good descent direction even for small k.

1.  Introduction

Motivated by the theoretical limitations of a large class of nonparametric learning algorithms (Bengio & Le Cun, 2007), recent research has focused on learning algorithms for so-called deep architectures (Hinton, Osindero, & Teh, 2006; Hinton & Salakhutdinov, 2006; Bengio, Lamblin, Popovici, & Larochelle, 2007; Salakhutdinov & Hinton 2007; Ranzato, Poultney, Chopra, & Le Cun, 2007; Larochelle, Erhan, Courville, Bergstra, & Bengio, 2007). These represent the learned function through many levels of composition of elements taken in a small or parametric set. The most common element type found in these papers is the soft or hard linear threshold unit, or artificial neuron,
formula
1.1
with parameters w (vector) and b (scalar), and where s(a) could be 1a > 0, tanh(a), or , for example.

Here, we are particularly interested in the restricted Boltzmann machine (Smolensky, 1986; Freund & Haussler, 1994; Hinton, 2002; Welling, Rosen-Zvi, & Hinton, 2005; Carreira-Perpiñan & Hinton, 2005), a family of bipartite graphical models with hidden variables (the hidden layer) that are used as components in building deep belief networks (Hinton et al., 2006; Bengio et al., 2007; Salakhutdinov & Hinton, 2007; Larochelle et al., 2007). Deep belief networks have yielded impressive performance on several benchmarks, clearly beating the state-of-the-art and other nonparametric learning algorithms in several cases. A very successful learning algorithm for training a restricted Boltzmann machine (RBM) is the contrastive divergence (CD) algorithm. An RBM represents the joint distribution between a visible vector X, which is the random variable observed in the data, and a hidden random variable H. There is no tractable representation of P(X,H), but conditional distributions P(H|X) and P(X|H) can easily be computed and sampled from. CD-k is based on a Gibbs Monte Carlo Markov Chain (MCMC) starting at an example X = x1 from the empirical distribution and converging to the RBM's generative distribution P(X). CD-k relies on a biased estimator obtained after a small number k of Gibbs steps (often only one step). Each Gibbs step is composed of two alternating substeps: sampling htP(H|X = xt) and sampling xt+1P(X|H = ht), starting at t = 1.

The surprising empirical result is that even k = 1 (CD-1) often gives good results. An extensive numerical comparison of training with CD-k versus exact log-likelihood gradient has been presented in Carreira-Perpiñan and Hinton (2005). In these experiments, taking k larger than 1 gives more precise results, although very good approximations of the solution can be obtained even with k = 1. Here we present a follow-up to Carreira-Perpiñan and Hinton (2005) that brings further theoretical and empirical support to CD-k, even for small k.

CD-1 was originally justified (Hinton, 2002) as an approximation of the gradient of , where KL is the Kullback-Leibler divergence, is the empirical distribution of the training data, and P(X2 = · |x1) denotes the distribution of the chain after one step. The term left out in the approximation of the gradient of the KL difference is (Hinton, 2002)
formula
1.2
which was empirically found to be small. On the one hand, it is not clear how aligned the log-likelihood gradient and the gradient are with respect to the above KL difference. On the other hand, it would be nice to prove that left-out terms are small in some sense. One of the motivations for this letter is to obtain the CD algorithm from a different route, by which we can prove that the term left out with respect to the log-likelihood gradient is small and converging to zero as we take k larger.

We show that the log likelihood and its gradient can be expanded by considering samples in a Gibbs chain. We show that when the gradient expansion is truncated to k steps, the remainder converges to zero at a rate that depends on the mixing rate of the chain. The inspiration for this derivation comes from Hinton et al. (2006): first, the idea that the Gibbs chain can be associated with an infinite directed graphical model (which here we associate to an expansion of the log likelihood and its gradient), and second, that the convergence of the chain justifies CD (since the kth sample from the Gibbs chain becomes equivalent to a model sample). However, our empirical results also show that the convergence of the chain alone cannot explain the good results obtained by CD, because this convergence becomes too slow as weights increase during training. It turns out that even when k is not large enough for the chain to converge (e.g., the typical value k = 1), the CD-k rule remains a good update direction to increase the log likelihood of the training data.

Finally, we show that when the series is truncated to a single substep, we obtain the gradient of a stochastic reconstruction error. A mean-field approximation of that error is the reconstruction error often used to train autoassociators (Rumelhart, Hinton, & Williams, 1986; Bourlard & Kamp, 1988; Hinton & Zemel, 1994; Schwenk & Milgram, 1995; Japkowicz, Hanson, & Gluck, 2000). Autoassociators can be stacked using the same principle used to stack RBMs into a deep belief network in order to train deeep neural networks (Bengio et al., 2007; Ranzato et al., 2007; Larochelle et al., 2007). Reconstruction error has also been used to monitor progress in training RBMs by CD (Taylor, Hinton, & Roweis, 2007; Bengio et al., 2007), because it can be computed tractably and analytically, without sampling noise.

In the following, we drop the X = x notation and use shorthand such as P(xh) instead of P(X = xH = h). The t index is used to denote position in the Markov chain, whereas indices i or j denote an element of the hidden or visible vector, respectively.

2.  Restricted Boltzmann Machines and Contrastive Divergence

2.1.  Boltzmann Machines.

A boltzmann machine (Hinton, Sejnowski, & Ackley, 1984; Hinton & Sejnowski, 1986) is a probabilistic model of the joint distribution between visible units x, marginalizing over the values of hidden units h,
formula
2.1
and where the joint distribution between hidden and visible units is associated with a quadratic energy function
formula
2.2
such that
formula
2.3
where is a normalization constant (called the partition function) and (b, c, W, U, V) are parameters of the model. bj is called the bias of visible unit xj, ci is the bias of visible unit hi, and the matrices W, U, and V represent interaction terms between units. Note that nonzero U and V mean that there are interactions between units belonging to the same layer (hidden layer or visible layer). Marginalizing over h at the level of the energy yields so-called free energy:
formula
2.4
We can rewrite the log likelihood accordingly:
formula
2.5
Differentiating the above, the gradient of the log likelihood with respect to some model parameter θ can be written as follows:
formula
2.6
Computing is straightforward. Therefore, if sampling from the model was possible, one could obtain a stochastic gradient for use in training the model as follows. Two samples are necessary: h given x for the first term, which is called the positive phase, and an pair from in what is called the negative phase. Note how the resulting stochastic gradient estimator,
formula
2.7
has one term for each of the positive phase and negative phase, with the same form but opposite signs. Let u = (x, h) be a vector with all the unit values. In a general Boltzmann machine, one can compute and sample from P(ui|ui), where ui is the vector with all the unit values except the ith. Gibbs sampling with as many substeps as units in the model has been used to train Boltzmann machines in the past, with very long chains, yielding correspondingly long training times.

2.2.  Restricted Boltzmann Machines.

In a restricted Boltzmann machine (RBM), U = 0 and V = 0 in equation 2.2—that is, the only interaction terms are between a hidden unit and a visible unit, but not between units of the same layer. This form of model was first introduced under the name of Harmonium (Smolensky, 1986). Because of this restriction, P(hx) and P(xh) factorize and can be computed and sampled from easily. This enables the use of a two-step Gibbs sampling alternating between hP(HX = x) and xP(XH = h). In addition, the positive phase gradient can be obtained exactly and efficiently because the free energy factorizes:
formula
where Wi is the ith row of W and dh the dimension of h. Using the same type of factorization, one obtains, for example, in the most common case where hi is binary,
formula
2.8
where
formula
2.9
The log-likelihood gradient for Wij thus has the form
formula
2.10
where EX is an expectation over P(X). Samples from P(X) can be approximated by running an alternating Gibbs chain x1h1x2h2 ⇒ … Since the model P is trying to imitate the empirical distribution , it is a good idea to start the chain with a sample from , so that we start the chain from a distribution close to the asymptotic one.

In most uses of RBMs (Hinton, 2002; Carreira-Perpiñan & Hinton, 2005; Hinton et al., 2006; Bengio et al., 2007), both hi and xj are binary, but many extensions are possible and have been studied, including cases where hidden or visible units are continuous-valued (Freund & Haussler, 1994; Welling et al., 2005; Bengio et al., 2007).

2.3.  Contrastive Divergence.

The k-step contrastive divergence (CD-k) (Hinton 1999, 2002) involves a second approximation besides the use of MCMC to sample from P. This additional approximation introduces some bias in the gradient: we run the MCMC chain for only k steps, starting from the observed example x. Using the same technique as in equation 2.6 to express the log-likelihood gradient, but keeping the sums over h inside the free energy, we obtain
formula
2.11
The CD-k update after seeing example x is taken proportional to
formula
2.12
where is a sample from our Markov chain after k steps. We know that when k → ∞, the samples from the Markov chain converge to samples from P, and the bias goes away. We also know that when the model distribution is very close to the empirical distribution (i.e., ), then when we start the chain from x (a sample from ), the MCMC samples have already converged to P, and we need fewer sampling steps to obtain an unbiased (albeit correlated) sample from P.

3.  Log-Likelihood Expansion via Gibbs Chain

In the following we consider the case where both h and x can take only a finite number of values. We also assume that there is no pair (x, h) such that P(xh) = 0 or P(hx) = 0. This ensures the Markov chain associated with Gibbs sampling is irreducible (one can go from any state to any other state), and there exists a unique stationary distribution P(x, h) the chain converges to.

Lemma 1.
Consider the irreducible Gibbs chain x1 ⇒ h1 ⇒ x2 ⇒ h2… starting at data point x1. The log likelihood can be written as follows at any step t of the chain,
formula
3.1
and since this is true for any path,
formula
3.2
where expectations are over Markov chain sample paths, conditioned on the starting sample x1.

Proof.
Equation 3.1 is obvious, while equation 3.2 is obtained by writing
formula
and substituting equation 3.1.

Note that EXt[log P(Xt) ∣ x1] is the negative entropy of the tth visible sample of the chain, and it does not become smaller as t → ∞. Therefore, it does not seem reasonable to truncate this expansion. However, the gradient of the log likelihood is more interesting. But first we need a simple lemma:

Lemma 2.
For any model P(Y) with parameters θ,
formula
when the expected value is taken according to P(Y).

Proof.
formula

The lemma is clearly also true for conditional distributions with corresponding conditional expectations.

Theorem 1.
Consider the converging Gibbs chain x1 ⇒ h1 ⇒ x2 ⇒ h2… starting at data point x1. The log-likelihood gradient can be written
formula
3.3
and the final term (which will be shown later to be the bias of the CD estimator) converges to zero as t goes to infinity.

Proof.
We take derivatives with respect to a parameter θ in the log-likelihood expansion in equation 3.1 of lemma 1:
formula
Then we take expectations with respect to the Markov chain conditional on x1, getting
formula
In order to prove the convergence of the CD bias toward zero, we will use the assumed convergence of the chain, which can be written
formula
3.4
with ∑xϵt(x) = 0 and limt→+∞ϵt(x) = 0 for all x. Since x is discrete, also verifies limt→+∞ϵt = 0. Then we can rewrite the last expectation as follows:
formula
When lemma 2 is used, the first sum is equal to zero. Thus, we can bound this expectation by
formula
3.5
where Nx is the number of discrete configurations for the random variable X. This proves that the expectation converges to zero as t → +∞, since limt→+∞ϵt = 0.

One may wonder to what extent the above results still hold in the situation where x and h are not discrete anymore but instead may take values in infinite (possibly uncountable) sets. We assume P(xh) and P(hx) are such that there still exists a unique stationary distribution P(x, h). Lemma 1 and its proof remain unchanged. On another hand, lemma 2 is true only for distributions P such that
formula
3.6
This equation can be guaranteed to be verified under additional “niceness” assumptions on P, and we assume it is the case for distributions P(X), P(xh), and P(hx). Consequently, the gradient expansion, equation 3.3, in theorem 1 can be obtained in the same way as before. The key point to justify further truncation of this expansion is the convergence toward zero of the bias,
formula
3.7
This convergence is not necessarily guaranteed unless we have convergence of P(Xtx1) to P(Xt) in the sense that
formula
3.8
where the second expectation is over the stationary distribution P. If the distributions P(xh) and P(hx) are such that equation 3.8 is verified, then this limit is also zero according to lemma 2, and it makes sense to truncate equation 3.3. Note, however, that equation 3.8 does not necessarily hold in the most general case (Hernández-Lerma & Lasserre, 2003).

4.  Connection with Contrastive Divergence

4.1.  Theoretical Analysis.

Theorem 1 justifies truncating the series after t steps, that is, ignoring , yielding the approximation
formula
4.1
Note how the expectation can be readily replaced by sampling xtP(Xtx1), giving rise to the stochastic update
formula
whose expected value is the above approximation. This is also exactly the CD-(t − 1) (see update equation 2.12).
The idea that faster mixing yields to better approximation by CD-k was introduced earlier (Carreira-Perpiñan & Hinton, 2005; Hinton et al., 2006). The bound in equation 3.5 explicitly relates the convergence of the chain (through the convergence of error ϵt in estimating P(X) with P(Xk+1 = xx1)) to the approximation error of the CD-k gradient estimator. When the RBM weights are large, it is plausible that the chain will mix more slowly because there is less randomness in each sampling step. Hence, it might be advisable to use larger values of k as the weights become larger during training. It is thus interesting to study how fast the bias converges to zero as t increases, depending on the magnitude of the weights in an RBM. Markov chain theory (Schmidt, 2006) ensures that in the discrete case,
formula
4.2
where Nx is the number of possible configurations for x, and a is the smallest element in the transition matrix of the Markov chain. In order to obtain a meaningful bound on equation 3.5, we also need to bound the gradient of the log likelihood. In the following, we will thus consider the typical case of a binomial RBM, with θ being a weight Wij between hidden unit i and visible unit j. Recall equation 2.10:
formula
For any x, both P(Hi = 1 ∣ X = x) and xj are in (0, 1). Consequently, the expectation above is also in (0, 1) and thus
formula
Combining this inequality with equation 4.2, we obtain from equation 3.5 that
formula
4.3
It remains to quantify a, the smallest term in the Markov chain transition matrix. Each element of this matrix is of the form
formula
Since 1 − sigm(z) = sigm(−z), we have:
formula
Let us denote αj = ∑i|Wij| + |bj| and βi = ∑j|Wij| + |ci|. We can obtain in a similar way that P(hix1) ⩾ sigm(−βi). As a result, we have that
formula
4.4
In order to simplify notations (at the cost of a looser bound), let us denote
formula
4.5
formula
4.6
Then, by combining equations 4.3 and 4.4, we finally obtain
formula
4.7
where Nx = 2dx and Nh = 2dh. Note that although this bound is tight (and equal to zero) for any t ⩾ 2 when weights and biases are set to zero (since mixing is immediate), the bound is likely to be loose in practical cases. Indeed, the bound approaches Nx fast, as the two sigmoids decrease toward zero. However, the bound clarifies the importance of weight size in the bias of the CD approximation. It is also interesting to note that this bound on the bias decreases exponentially with the number of steps performed in the CD update, even though this decrease may become linear when the bound is loose (which is usually the case in practice): in such cases, it can be written Nx(1 − γ)t−1 with a small γ, and thus is close to Nx(1 − γ(t − 1)), a linear decrease in t.

If the CD update is considered a biased and noisy estimator of the true log-likelihood gradient, it can be shown that stochastic gradient descent converges (to a local minimum), provided that the bias is not too large (Yuille, 2005). On the other hand, one should keep in mind that for small k, there is no guarantee that CD converges near the maximum likelihood solution (MacKay, 2001). The experiments below confirm the above theoretical results and suggest that even when the bias is large and the weights are large, the sign of the CD estimator may be generally correct.

4.2.  Experiments.

In the following series of experiments, we study empirically how the CD-k update relates to the gradient of the log likelihood. More specifically, in order to remove variance caused by sampling noise, we are interested in comparing two quantities:
formula
4.8
where Δ(x1) is the gradient of the likelihood, equation 2.11, and Δk(x1) its average approximation by CD-k, equation 4.1. The difference between these two terms is the bias Δk(x1), that is, according to equation 3.3,
formula
and, as shown in section 4.1, we have
formula
Note that our analysis is different from the one in Carreira-Perpiñan and Hinton (2005), where the solutions (after convergence) found by CD-k and gradient descent on the negative log likelihood were compared, while we focus on the updates themselves.

In these experiments, we use two manually generated binary data sets:

  1. Diagd is a d-dimensional data set containing d + 1 samples as follows:
    formula
  2. 1DBalld is a d-dimensional data set containing samples, representing “balls” on a one-dimensional discrete line with d pixels. Half of the data examples are generated by first picking the position b of the beginning of the ball (among d possibilities), then its width w (among possibilities). Pixels from b to b + w − 1 (modulo d) are then set to 1 while the rest of the pixels are set to 0. The second half of the data set is generated by simply “reverting” its first half (switching zeros and ones).

In order to be able to compute δk(x1) exactly, only RBMs with a small (fewer than 10) number of visible and hidden units are used. We compute quantities for all θ = Wij (the weights of the RBM connections between visible and input units). The following statistics are then computed over all weights Wij and all training examples x1:

  • • 

    The weight magnitude indicators α and β, as defined in equations 4.5 and 4.6,

  • • 

    The mean of the gradient bias |δk(x1)|, denoted by δk and called the absolute bias

  • • 

    The median of , that is, the relative difference between the CD-k update and the log-likelihood gradient1 (we use the median to avoid numerical issues for small gradients), denoted by rk and called the relative bias

  • • 

    The sign error sk, that is, the fraction of updates for which Δk(x1) and Δ(x1) have different signs

The RBM is initialized with zero biases and small weights uniformly sampled in , where d is the number of visible units. Note that even with such small weights, the bound from equation 4.7 is already close to its maximum value Nx, so that it is not interesting to plot it on the figures. The number of hidden units is also set to d for simplicity. The RBM weights and biases are trained by CD-1 with a learning rate set to 10−3. Keep in mind that we are not interested in comparing the learning process itself, but rather how the quantities above evolve for different kinds of RBMs, in particular, as weights become larger during training. Training is stopped once the average negative log likelihood over training samples has less than 5% relative difference compared to its lower bound, which here is log(N), where N is the number of training samples (which are all unique).

Figure 1 shows a typical example of how the quantities defined above evolve during training (β is not plotted as it exhibits the same behavior as α). As the weights increase (as shown by α), so does the absolute value of the left-out term in CD-1 (δ1) and its relative magnitude compared to the log likelihood (r1). In particular, we observe that most of the log-likelihood gradient is quickly lost in CD-1 (here after only 80,000 updates), so that CD-1 is no longer a good approximation of negative log-likelihood gradient descent. However, the RBM is still able to learn its input distribution, which can be explained by the fact that the “sign disagreement” s1 between CD-1 and the log-likelihood gradient remains small (less than 5% for the whole training period).

Figure 1:

Typical evolution of weight magnitude α, gradient absolute bias δ1, relative bias r1, and sign error s1 as the RBM is being trained by CD-1 on 1DBall10. The size of weights α and the absolute bias δ1 are rescaled so that their maximum value is 1, while the relative bias r1 and the sign disagreement s1 naturally fall within [0, 1].

Figure 1:

Typical evolution of weight magnitude α, gradient absolute bias δ1, relative bias r1, and sign error s1 as the RBM is being trained by CD-1 on 1DBall10. The size of weights α and the absolute bias δ1 are rescaled so that their maximum value is 1, while the relative bias r1 and the sign disagreement s1 naturally fall within [0, 1].

Figures 2 and 4 show how rk and sk, respectively, vary depending on the number of steps k performed in CD, on the Diagd (left) and 1DBalld (right) data sets, for d ∈ {6, 7, 8, 9, 10}. All of these values are taken when our stopping criterion is reached (i.e., we are close enough to the empirical distribution). It may seem surprising that rk does not systematically increase with d, but remember that each RBM may be trained for a different number of iterations, leading to potentially very different weight magnitude. Figure 3 shows the corresponding values for α and β (which reflect the magnitude of weights). We can see, for instance, that α and β for data set 1DBall6 are larger than for data set 1DBall7, which explains why rk is also larger, as shown in Figure 2 (right). Figure 5 shows a “smoother” behavior of rk with regard to d when all RBMs are trained for a fixed (small) number of iterations, illustrating how the quality of CD-k decreases in higher dimension (as an approximation to negative log-likelihood gradient descent).

Figure 2:

Median relative bias rk between the CD-k update and the gradient of the log likelihood, for k from 1 to 10, with input dimension d ∈ {6, 7, 8, 9, 10}, when the stopping criterion is reached. (Left) On data sets Diagd. (Right) On data sets 1DBalld.

Figure 2:

Median relative bias rk between the CD-k update and the gradient of the log likelihood, for k from 1 to 10, with input dimension d ∈ {6, 7, 8, 9, 10}, when the stopping criterion is reached. (Left) On data sets Diagd. (Right) On data sets 1DBalld.

Figure 3:

Measures of weight magnitude α and β as the input dimension d varies from 6 to 10, when the stopping criterion is reached. (Left) On data sets Diagd. (Right) On data sets 1DBalld.

Figure 3:

Measures of weight magnitude α and β as the input dimension d varies from 6 to 10, when the stopping criterion is reached. (Left) On data sets Diagd. (Right) On data sets 1DBalld.

We observe on Figure 2 that the relative bias rk becomes large not only for small k (which means the CD-k update is a poor approximation of the true log-likelihood gradient), but also for larger k in higher dimensions. As a result, increasing k moderately (from 1 to 10) still leaves a large approximation error (e.g., from 80% to 50% with d = 10 in Figure 2) in spite of a 10-fold increase in computation time. This suggests that when the goal is to obtain a more precise estimator of the gradient, alternatives to CD-k such as persistent CD (Tieleman, 2008) may be more appropriate. On another hand, we notice from Figure 4 that the disagreement sk between the two updates remains low even for small k in larger dimensions (in our experiments, it always remains below 5%). This may explain why CD-1 can successfully train RMBs even when connection weights become larger and the Markov chain does not mix fast anymore. An intuitive explanation for this empirical observation is the popular view of CD-k as a process that, on one hand, decreases the energy of a training sample x1 (first term in equation 4.8), and, on another hand, increases the energy of other nearby input examples (second term), thus leading to an overall increase of P(x1).

Figure 4:

Average disagreement sk between the CD-k update and negative log-likelihood gradient descent, for k from 1 to 10, with input dimension d ∈ {6, 7, 8, 9, 10}, when the stopping criterion is reached. (Left) On data sets Diagd. (Right) On data sets 1DBalld.

Figure 4:

Average disagreement sk between the CD-k update and negative log-likelihood gradient descent, for k from 1 to 10, with input dimension d ∈ {6, 7, 8, 9, 10}, when the stopping criterion is reached. (Left) On data sets Diagd. (Right) On data sets 1DBalld.

Figure 5:

rk (left) and α and β (right) on data sets 1DBalld, after only 300,000 training iterations. rk systematically increases with d when weights are small (compared to Figures 2 and 3).

Figure 5:

rk (left) and α and β (right) on data sets 1DBalld, after only 300,000 training iterations. rk systematically increases with d when weights are small (compared to Figures 2 and 3).

5.  Connection with Autoassociator Reconstruction Error

In this section, we relate the autoassociator reconstruction error criterion (an alternative to CD learning) to another similar truncation of the log-likelihood expansion. We can use the same approach as in theorem 1 to introduce the first hidden sample h1 as follows:
formula
Taking the expectation with respect to H1 conditioned on x1 yields
formula
5.1
When lemma 2 is used, the second term is equal to zero. If we truncate this expansion by removing the last term (as is done in CD) we obtain
formula
5.2
which is an average over P(h1|x1), that could be approximated by sampling. Note that this is not quite the negated gradient of the stochastic reconstruction error:
formula
5.3
Let us consider a notion of mean-field approximation by which an average EX[f(X)] over configurations of a random variable X is approximated by f(E[X]), that is, using the mean configuration. Applying such an approximation to SRE, equation 5.3, gives the reconstruction error typically used in training autoassociators (Rumelhart et al., 1986; Bourlard & Kamp, 1988; Hinton & Zemel, 1994; Schwenk & Milgram, 1995; Japkowicz et al., 2000; Bengio et al., 2007; Ranzato et al., 2007; Larochelle et al., 2007),
formula
5.4
where is the mean-field output of the hidden units given the observed input x1. If we apply the mean-field approximation to the truncation of the log likelihood given in equation 5.2, we obtain
formula
It is arguable whether the mean-field approximation per se gives us license to include in the effect of θ on , but if we do so, then we obtain the gradient of the reconstruction error, equation 5.4, up to the sign (since the log likelihood is maximized while the reconstruction error is minimized).
As a result, whereas CD-1 truncates the chain expansion at x2 (as seen in section 2.3), ignoring
formula
we see (using the fact that the second term of equation 5.1 is zero) that a reconstruction update truncates the chain expansion one step earlier (at h1), ignoring
formula
and working on a mean-field approximation instead of a stochastic approximation. The reconstruction error gradient can thus be seen as a more biased approximation of the log-likelihood gradient than CD-1. Comparative experiments between reconstruction error training and CD-1 training confirm this view (Bengio et al., 2007; Larochelle et al., 2007): CD-1 updating generally has a slight advantage over reconstruction error gradient.

However, reconstruction error can be computed deterministically and has been used as an easy method to monitor the progress of training RBMs with CD, whereas the CD-k itself is generally not the gradient of anything and is stochastic.

6.  Conclusion

This letter provides a theoretical and empirical analysis of the log-likelihood gradient in graphical models involving a hidden variable h in addition to the observed variable x, and where conditionals P(hx) and P(xh) are easy to compute and sample from. That includes the case of contrastive divergence (CD) for restricted Boltzmann machines (RBM). The analysis justifies the use of a short Gibbs chain of length k to obtain a biased estimator of the log-likelihood gradient. Although our results do not guarantee that the bias decreases monotically with k, we prove a bound that does, and we observe this decrease experimentally. Moreover, although this bias may be large when using only few steps in the Gibbs chain (as is usually done in practice), our empirical analysis indicates this estimator remains a good update direction compared to the true (but intractable) log-likelihood gradient.

The analysis also shows a connection between reconstruction error, log likelihood, and CD, which helps explain the better results generally obtained with CD and justifies the use of reconstruction error as a monitoring device when training an RBM by CD. The generality of the analysis also opens the door to other learning algorithms in which P(hx) and P(xh) do not have the parametric forms of RBMs.

Notes

1

This quantity is more interesting than the absolute bias because it tells us what proportion of the true gradient of the log likelihood is “lost” by using the CD-k update.

References

Bengio
,
Y.
,
Lamblin
,
P.
,
Popovici
,
D.
, &
Larochelle
,
H.
(
2007
).
Greedy layer-wise training of deep networks
. In
B. Schölkopf, J. Platt, & T. Hoffman
(Eds.),
Advances in neural information processing systems
,
19
(pp.
153
160
).
Cambridge, MA
:
MIT Press
.
Bengio
,
Y.
, &
Le Cun
,
Y.
(
2007
).
Scaling learning algorithms towards AI
. In
L. Bottou, O. Chapelle, D. DeCoste, & J. Weston
(Eds.),
Large scale kernel machines
.
Cambridge, MA
:
MIT Press
.
Bourlard
,
H.
, &
Kamp
,
Y.
(
1988
).
Auto-association by multilayer perceptrons and singular value decomposition
.
Biological Cybernetics
,
59
,
291
294
.
Carreira-Perpiñan
,
M. A.
, &
Hinton
,
G. E.
(
2005
).
On contrastive divergence learning
. In
R. G. Cowell, & Z. Ghahramani
(Eds.),
Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics
(pp.
33
40
).
N. P.
:
Society for Artificial Intelligence and Statistics
.
Freund
,
Y.
, &
Haussler
,
D.
(
1994
).
Unsupervised learning of distributions on binary vectors using two layer networks
(
Tech. Rep. UCSC-CRL-94-25
).
Santa Cruz
:
University of California, Santa Cruz
.
Hernández-Lerma
,
O.
, &
Lasserre
,
J. B.
(
2003
).
Markov chains and invariant probabilities
.
Basel
:
Birkhäuser Verlag
.
Hinton
,
G. E.
(
1999
).
Products of experts
.
In Proceedings of the Ninth International Conference on Artificial Neural Networks (ICANN)
(Vol.
1
, pp.
1
6
).
New York
:
IEEE
.
Hinton
,
G. E.
(
2002
).
Training products of experts by minimizing contrastive divergence
.
Neural Computation
,
14
,
1771
1800
.
Hinton
,
G. E.
,
Osindero
,
S.
, &
Teh
,
Y.
(
2006
).
A fast learning algorithm for deep belief nets
.
Neural Computation
,
18
,
1527
1554
.
Hinton
,
G. E.
, &
Salakhutdinov
,
R.
(
2006
).
Reducing the dimensionality of data with neural networks
.
Science
,
313
,
504
507
.
Hinton
,
G. E.
, &
Sejnowski
,
T. J.
(
1986
).
Learning and relearning in Boltzmann machines
. In
D. E. Rumelhart & J. L. McClelland
(Eds.),
Parallel distributed processing: Explorations in the microstructure of cognition. Vol. 1: Foundations
.
Cambridge, MA
:
MIT Press
.
Hinton
,
G. E.
,
Sejnowski
,
T. J.
, &
Ackley
,
D. H.
(
1984
).
Boltzmann machines: Constraint satisfaction networks that learn
(
Tech. Rep. TR-CMU-CS-84-119
).
Pittsburgh, PA
:
Carnegie Mellon University, Department of Computer Science
.
Hinton
,
G. E.
, &
Zemel
,
R. S.
(
1994
).
Autoencoders, minimum description length, and Helmholtz free energy
. In
D. Cowan, G. Tesauro, & J. Alspector
(Eds.),
Advances in neural information processing systems
,
6
(pp.
3
10
).
San Francisco
:
Morgan Kaufmann
.
Japkowicz
,
N.
,
Hanson
,
S. J.
, &
Gluck
,
M. A.
(
2000
).
Nonlinear autoassociation is not equivalent to PCA
.
Neural Computation
,
12
(
3
),
531
545
.
Larochelle
,
H.
,
Erhan
,
D.
,
Courville
,
A.
,
Bergstra
,
J.
, &
Bengio
,
Y.
(
2007
).
An empirical evaluation of deep architectures on problems with many factors of variation
. In
Z. Ghahramani
(Ed.),
Twenty-Fourth International Conference on Machine Learning (ICML'2007)
(pp.
473
480
).
Madison, WI
:
Omnipress
.
MacKay
,
D.
(
2001
).
Failures of the one-step learning algorithm
.
Unpublished manuscript
.
Ranzato
,
M.
,
Poultney
,
C.
,
Chopra
,
S.
, &
Le Cun
,
Y.
(
2007
).
Efficient learning of sparse representations with an energy-based model
. In
B. Schölkopf, J. Platt, & T. Hoffman
(Eds.),
Advances in neural information processing systems
,
19
.
Cambridge, MA
:
MIT Press
.
Rumelhart
,
D. E.
,
Hinton
,
G. E.
, &
Williams
,
R. J.
(
1986
).
Learning representations by back-propagating errors
.
Nature
,
323
,
533
536
.
Salakhutdinov
,
R.
, &
Hinton
,
G.
(
2007
).
Semantic hashing
. In
In A. McCallum & S. Roweis
(Eds.),
Proceedings of the 2007 Workshop on Information Retrieval and Applications of Graphical Models (SIGIR 2007)
.
Amsterdam
:
Elsevier
.
Schmidt
,
V.
(
2006
).
Markov chains and Monte-Carlo simulation
. In
Lecture Notes, Summer 2006
.
Ulm
:
Ulm University, Department of Stochastics
.
Schwenk
,
H.
, &
Milgram
,
M.
(
1995
).
Transformation invariant autoassociation with application to handwritten character recognition
. In
G. Tesauro, D. Touretzky, & T. Leen
(Eds.),
Advances in neural information processing systems
,
7
(pp.
991
998
).
Cambridge, MA
:
MIT Press
.
Smolensky
,
P.
(
1986
).
Information processing in dynamical systems: Foundations of harmony theory
. In
D. E. Rumelhart & J. L. McClelland
(Eds.),
Parallel distributed processing
(Vol.
1
, pp.
194
281
).
Cambridge, MA
:
MIT Press
.
Taylor
,
G.
,
Hinton
,
G.
, &
Roweis
,
S.
(
2007
).
Modeling human motion using binary latent variables
. In
B. Schölkopf, J. Platt, & T. Hoffman
(Eds.),
Advances in neural information processing systems
,
19
(pp.
1345
1352
).
Cambridge, MA
:
MIT Press
.
Tieleman
,
T.
(
2008
).
Training restricted Boltzmann machines using approximations to the likelihood gradient
. In
A. McCallum & S. Roweis
(Eds.),
Proceedings of the International Conference on Machine Learning
(Vol.
25
, pp.
1064
1071
).
Madison, WI
:
Omnipress
.
Welling
,
M.
,
Rosen-Zvi
,
M.
, &
Hinton
,
G. E.
(
2005
).
Exponential family harmoniums with an application to information retrieval
. In
L. Saul, Y. Weiss, & L. Bottou
(Eds.),
Advances in neural information processing systems
,
17
.
Cambridge, MA
:
MIT Press
.
Yuille
,
A. L.
(
2005
).
The convergence of contrastive divergences
. In
L. Saul, Y. Weiss, & L. Bottou
(Eds.),
Advances in neural information processing systems
,
17
(pp.
1593
1600
).
Cambridge, MA
:
MIT Press
.