Abstract

We present a mathematical construction for the restricted Boltzmann machine (RBM) that does not require specifying the number of hidden units. In fact, the hidden layer size is adaptive and can grow during training. This is obtained by first extending the RBM to be sensitive to the ordering of its hidden units. Then, with a carefully chosen definition of the energy function, we show that the limit of infinitely many hidden units is well defined. As with RBM, approximate maximum likelihood training can be performed, resulting in an algorithm that naturally and adaptively adds trained hidden units during learning. We empirically study the behavior of this infinite RBM, showing that its performance is competitive to that of the RBM, while not requiring the tuning of a hidden layer size.

1  Introduction

Over the years, machine learning research has produced a large variety of latent variable probabilistic models: mixture models, factor analysis models, latent dynamical models, and many others. Such models usually require that the dimensionality of the latent representation be specified and fixed during learning. Adapting this quantity is then considered a separate process that takes the form of model selection and is normally treated as an additional hyperparameter to tune.

For this reason, more recently, there has been a lot of work on extending these models such that the size of the representation can be treated as an adaptive quantity during training. These extensions, often referred to as infinite models, are nonparametric in nature where the latent space is infinite with probability 1 and can arbitrarily adapt their capacity to the training data (see Orbanz & Teh, 2010, for a brief overview).

While most latent variable models have been extended to one or more infinite variants, a notable exception is the restricted Boltzmann machine (RBM). The RBM is an undirected graphical model for binary vector observations, where the latent representation is itself a binary vector (i.e., hidden layer). The RBM and its extensions to nonbinary vectors have been successfully applied to a variety of problems and data, such as images (Ranzato, Krizhevsky, & Hinton, 2010), movie user preferences (Salakhutdinov, Mnih, & Hinton, 2007), motion capture (Taylor, Hinton, & Roweis, 2011), and text (Dahl, Adams, & Larochelle, 2012). One explanation for the lack of literature on RBMs with an adaptive hidden layer size comes from its undirected nature. Indeed, undirected models tend to be less amenable to a Bayesian treatment of learning, on which the majority of the literature on infinite models rely.

Our main contribution in this article is thus a proposal for an infinite RBM that can adapt the effective number of hidden units during training. While our proposal is not based on a Bayesian formulation, it does correspond to the infinite limit of a finite-sized model and behaves in such a way that it effectively adapts its capacity as training progresses.

First, we propose a finite extension of the RBM that is sensitive to the position of each unit in its hidden layer. This is achieved by introducing a random variable that represents the number of hidden units intervening in the RBM’s energy function. Then, thanks to the introduction of an energy cost for using each additional unit, we show that taking the infinite limit of the total number of hidden units is well defined. We describe an approximate maximum likelihood training algorithm for this infinite RBM, based on (persistent) contrastive divergence, which results in a procedure where hidden units are implicitly added as training progresses. Finally, we empirically report how this model behaves in practice and show that it can achieve performance that is competitive to a traditional RBM on the binarized MNIST and Caltech101 Silhouettes data sets, while not requiring the tuning of a hyperparameter for its hidden layer size.

2  Restricted Boltzmann Machine

We describe the basic RBM model, which we will build on to derive its ordered and infinite versions.

An RBM is a generative stochastic neural network composed of two layers: visible and hidden . These layers are fully connected to each other, while connections within a layer are not allowed. This means each unit vi is connected to all hj units via undirected weighted connections (see Figure 1).

Figure 1:

Graphical model of the restricted Boltzmann machine. Inter-connections between visible units and hidden units using symmetric weights.

Figure 1:

Graphical model of the restricted Boltzmann machine. Inter-connections between visible units and hidden units using symmetric weights.

Given a binary RBM with D visible units and K hidden units, the set of visible vectors is , whereas the set of hidden vectors is . In an RBM model, each configuration has an associated energy value defined by the following function:
formula
2.1
The parameters of this model are the weights ( matrix), the visible unit biases ( vector), and the hidden unit biases ( vector).
A probability distribution over visible and hidden vectors is defined in terms of this energy function,
formula
2.2
with
formula
2.3

We see from equation 2.3 that the partition function Z (normalizing constant) is intractable, as it requires summing over all possible configurations.

The probability distribution of a visible vector is obtained by marginalizing over all configurations of hidden vectors. One property of the RBM is that the numerator of the marginal is tractable,
formula
2.4
with
formula
2.5
where and the notation designates the ith row of and, for columns, . This allows an equivalent definition of the RBM model in terms of what is known as free energy . However, the partition function still requires summing over all configurations of visible vectors, which is intractable even for moderate values of D.
RBMs can be learned as generative models, to assign high probability (i.e., low energy) to training observations and low probability otherwise. One approach is to minimize the average negative log likelihood (NLL) for a set of examples :
formula
2.6
The gradient of this objective has a simple form, which is often referred to as the combination of positive and negative phases:
formula
2.7
where
formula
2.8
formula
2.9
formula
2.10
and where with being the sigmoid function applied element-wise. Derivation for the partial derivatives can be found in appendix  A.
Intuitively, the positive phase pushes up the probability of examples coming from our training set, whereas the negative phase lowers the probability of examples generated by the model. Much like the partition function, the negative phase is intractable. To overcome this, we approximate the expectation under with an average of S samples drawn from , that is, the model.
formula
2.11
Moreover, mini-batch training is usually employed and consists of replacing the positive phase average by one over a small subset of the training set, different for every training update.
Sampling from can be achieved using block Gibbs sampling by alternating between sampling and . It can be done efficiently because RBMs have no connections within a layer, meaning that hidden units are conditionally independent given the visible units and vice versa. The conditional distributions of a binary RBM are Bernoulli distributions with parameters
formula
2.12
formula
2.13

In theory, the Markov chain should be run until equilibrium before drawing a sample for every training update, which is highly inefficient. Thus, contrastive divergence (CD) learning is often employed, where we initialize the update’s Gibbs chains to the training examples and perform only T steps of Gibbs sampling (Hinton, 2002). Another approach, referred to as stochastic approximation or persistent CD (PCD) (Tieleman, 2008), is to not reinitialize the Gibbs chains between updates.

3  Ordered Restricted Boltzmann Machine

The model we propose is a variant of the RBM where the hidden units are ordered from left to right, with this order being taken into account by the energy function. We refer to this model as an ordered RBM (oRBM). As shown in Figure 2, the oRBM takes hidden unit order into account by introducing a random variable z that can be understood as the effective number of hidden units contributing to the energy. Hidden units are selected starting from the left, and the selection of each hidden unit is associated with an incremental cost in energy.

Figure 2:

Illustration of the ordered RBM. Since , only the first two hidden units are selected.

Figure 2:

Illustration of the ordered RBM. Since , only the first two hidden units are selected.

Concretely, we define the energy function of the oRBM as
formula
3.1
where z represents the number of selected hidden units that are active and is an energy penalty for selecting each ith hidden unit. As we will see, carefully parameterizing the per unit energy penalty will allow us to consider the case of an infinite pool of hidden units.

In our experiments, as we wanted the filters of each unit to be the dominating factor in a unit being selected, we parameterized it as , where is a global hyperparameter (critically, as we discuss later, this hyperparameter does not actually require tuning, and a generic value for it works fine). Intuitively, the penalty term acts as a form of regularization since it forces the model to avoid using more hidden units than needed, prioritizing smaller networks.

Moreover, having the penalty depending on the hidden biases also implies that the selection of a hidden units (i.e., influencing the outcome of the random variable z) will be mostly controlled by the values taken by the connections . Higher values of the bias of a hidden unit will not increase its probability of being selected. In other words, for the model to increase its capacity and better fit the training data, it will have to learn better filters. Note that alternative parameterizations could certainly be considered.

As with the RBM, is defined in terms of its energy function. For this, we have to specify the set of legal values for , , and z. Since, for a given z, the value of the energy is irrelevant to the dimensions of from z to K, we will assume they are set to 0. There is thus a coupling between the value of z and the legal values of . Let the legal values of for a given z. As for z, it can vary in , and as usual.

The joint probability over , , and z is thus
formula
3.2
where
formula
3.3
As for the marginal distribution of the oRBM model, it can also be written in terms of a free energy. Indeed, in a derivation similar to the case of the RBM, we can show
formula
3.4
formula
3.5
This gives us a free energy where only the hidden units have been marginalized. We can also derive a formulation where the free energy depends only on :
formula
3.6
It should be noted that in the oRBM, z does not correspond to the number of hidden units assumed to have generated all observations. Instead, the model allows for different observations having been generated by a different number of hidden units. Specifically, for a given , the conditional distribution over the corresponding value of z is
formula
3.7
As for the conditional distribution over the hidden units, given a value of z, it takes the same form as for the regular RBM, except for unselected hidden units, which are forced to zero. Similarly, the distribution of given a value of the hidden layer and z reflects that of the RBM:
formula
3.8
formula
3.9
To train the oRBM, we can also rely on CD or PCD for estimating the gradients based on equation 2.11 but using as defined in equation 3.6. Defining and with denoting the element-wise product, the free energy gradients are then slightly modified as follows:
formula
3.10
formula
3.11
formula
3.12
with . Derivation for the partial derivatives can be found in appendix  A.

Compared to the RBM, computing these gradients requires one additional quantity: the vector of cumulative probabilities . Fortunately, this quantity can be efficiently computed, in , by first computing the required probabilities vector and performing a cumulative sum.

Sampling from differs slightly from the RBM, as we need to consider z in the Markov chain. With the oRBM, Gibbs steps alternate between sampling and . Sampling from is done in two steps: , followed by .

During training, what we observe is that the hidden units are each trained gradually, in sequence, from left to right. This effect is mainly due to the multiplicative term in the hidden unit parameter updates of equations 3.10 and 3.11, which is monotonically decreasing. Effectively, the model is thus growing in capacity during training, until its maximum capacity of K hidden units.

4  Infinite Restricted Boltzmann Machine

The growing behavior of the oRBM begs the question, Could we achieve a similar effect without having to specify a maximum capacity to the model? Indeed, while Montufar and Ay (2010) have shown that with hidden units, an RBM is a universal approximator, a variant of the RBM that could automatically increase its capacity until it is sufficiently high is likely to yield much smaller models in practice. It turns out that this is possible by taking the limit of . For this reason, we refer to this model as the infinite RBM (iRBM; see Figure 3).

Figure 3:

Illustration of the infinite RBM. With , only the first two hidden units are currently selected. The dashed lines illustrate that there are connections that are trained (nonzero) with the third hidden unit. All (infinitely many) hidden units after the third have zero-valued weights, which correspond to l being equal to 3.

Figure 3:

Illustration of the infinite RBM. With , only the first two hidden units are currently selected. The dashed lines illustrate that there are connections that are trained (nonzero) with the third hidden unit. All (infinitely many) hidden units after the third have zero-valued weights, which correspond to l being equal to 3.

This limit is made possible thanks to two modeling choices. The first is the assumption that a finite (but variable!) number of hidden units have nonzero weights and biases. This is trivial to ensure for any optimization procedure using any amount of any type of weight decay (e.g., L2 or L1 regularization) on all the weights and hidden biases. An infinite number of nonzero weights and biases could then correspond to an infinite penalty, so no proper optimization would ever diverge to this solution, no matter the initialization. This is guaranteed when using L1 regularization, thanks to its sparsity-inducing property. As for L2 regularization, while it could theoretically lead to an infinite number of hidden units (e.g., if the L2 norm of the parameters associated with each hidden unit decreases exponentially with respect to the position of the hidden unit), in practice the floating precision would clip very small parameters to zero, thus having a finite number of hidden units.

The second key choice is our parameterization of the per unit energy penalty , which will ensure that the infinite sums required in computing probabilities will be convergent. For instance, consider the conditional :
formula
4.1
Let us note l the number of effectively trained hidden units (i.e., where all hidden units have zero weights and biases). This is guaranteed to happen thanks to the growing behavior that ensures hidden units are ordered from left to right. Then we can split the normalization constant of equation 4.1 into two parts, split at , as follows:
formula
4.2
where equation 4.2 is obtained by exploiting the fact that all weights and biases of hidden units at position and higher are zero. By ensuring that , the geometric series of equation 4.2 is finite and can be analytically computed. This in turn implies that is tractable and can be sampled from. Following a similar reasoning, the global partition function Z can be shown to be finite (see appendix  B), thus yielding a properly defined joint distribution for any configurations with a finite number of nonzero weights and hidden biases.

One could think that, compared to a regular RBM, we have merely traded the hyperparameter of the hidden layer size with the hyperparameter . However, crucially, ’s role is only to ensure that the iRBM is properly defined and the penalty it imposes in the energy function can be compensated by the learned parameters. The extent to which the parameters can grow enough to compensate for that penalty is then controlled by the strength of weight decay, a hyperparameter the iRBM shares with the RBM. We have thus effectively removed one hyperparameter. Moreover, we have indeed observed that results are robust to the choice of —that is, finely tuning beta was not necessary to ultimately achieving good performance. While the choice of can affect the number of epochs it would take for the weights to compensate for the penalty, this (the number of epochs) is a quantity that must be tuned nevertheless, even in regular RBMs.

The question of the identifiability of the binary RBM is a complex one, which has been studied (Cueto, Morton, & Sturmfels, 2010). Unlike the RBM, the iRBM is sensitive to the ordering of its hidden units thanks to the penalty term. This means that permutations of iRBM’s hidden units do not correspond to the same distribution, making its parameterization more identifiable.

As for learning, it can be done mostly by following the procedure of the oRBM—minimizing the NLL with stochastic gradient descent using (Persistent) CD to approximate the gradients. One slight modification is required, however. Indeed, since the free energy gradient for the hidden weights and biases can be nonzero for all (infinite) hidden units, we cannot use the gradient of equations 3.10 and 3.11 for all hidden units.

To avoid this issue, we consider the following observation. Instead of using the derivative of , we could instead use the derivative of , where z is obtained by sampling from :
formula
4.3
formula
4.4

In this case, all weights and biases with an index greater than the sampled z have a gradient of zero (i.e., they do not require any update). Moreover, taking the expectation of these gradients under corresponds to taking the gradients of , making them unbiased in this respect. This comes at the cost of higher variance in the updates. But thanks to this observation, we are justified using a hybrid approach, where we use the gradients only for the units with an index less than or equal to l, and adopt the gradient of for the other units (i.e., leave them set to zero).

As previously mentioned, we use weight decay to ensure that the number of nonzero parameters cannot diverge to infinity. For practical reasons, our implementation also used a capacity-limiting heuristic. If the Gibbs sampling chain ever sampled a value for z that is greater than l, then we clamped it to . Intuitively, this corresponds to adding a single hidden unit. This avoids filling all the memory in the (unlikely) event where we would draw a large value for z. When a hidden unit is added, its associated weights and biases are initialized to zero.

We emphasize that these were not required to avoid divergence (weight decay is sufficient); it merely ensured a practical and efficient implementation of the model on the GPU. Note also that when L1 regularization is used, l can decrease in value thanks to the sparsity-promoting property of the L1 norm. Again, we highlight that while a finite number of weights and biases is maintained, that number of such weights does vary and is learned, while the implicit number of hidden units is indeed infinite (infinitely many contribute to the partition function).

5  Related Work

This work falls within the research literature on discovering extensions of the original RBM model to different contexts and objectives. Of note here is the implicit mixture of RBMs (Nair & Hinton, 2008). Indeed, the oRBM can be interpreted as a special case of an implicit mixture of RBMs. Writing as , we see that the oRBM is an implicit mixture of K RBMs, where each RBM has a different number of hidden units (from 1 to K) and the weights are tied between RBMs. The prior represents the probability of using the zth RBM and is also derived from the energy function. However, as in the implicit mixture of RBMs, is intractable, as it would require the value of the partition function. That said, the work of Nair and Hinton (2008) is otherwise very different and did not address the question of having an RBM with adaptive capacity.

Another related work is that of the cardinality RBMs proposed by Swersky et al. (2012). They used a cardinality potential to control the sparsity of the RBM, limiting the number of hidden units that can be active. In the oRBM and the iRBM, z effectively acts as an upper bound on the number of hidden units hi that can be equal to 1, since we are limiting to be in , a subset of . In their work, Swersky et al. (2012) use cardinality potentials that allow only configurations having at most k active hidden units. One difference with our work, however, is that their cardinality potential is order agnostic, meaning that the active hidden units can be positioned anywhere within the hidden layer while still satisfying the cardinality potential. On the other hand, in the oRBM, all units with an index higher than z must be set to zero, with only the previous hidden units being allowed to be active. In addition, their parameter k is fixed during training, whereas our number of active hidden units z changes depending on the input.

The oRBM also bears some similarity to autoencoders trained by a nested version of dropout (Rippel, Gelbart, & Adams, 2014). Nested dropout works by stochastically selecting the number of hidden units used to reconstruct an input example at training time, and so independently for each update and example. Rippel et al. (2014) showed that this defines a learning objective that makes the solution identifiable and no longer invariant to hidden unit permutation. In addition to being concerned with a different type of model, this work does not discuss the case of an unbounded and adaptive hidden layer size.

Welling, Zemel, and Hinton (2003) proposed a self-supervised boosting approach that is applicable to the RBM and in which hidden units are sequentially added and trained. However, like boosting in general and unlike the iRBM, this procedure trains each hidden unit greedily instead of jointly, which could lead to much larger networks than necessary. Moreover, it is not easily generalizable to online learning.

While the work on unsupervised neural networks with adaptive hidden layer size is otherwise relatively scarce, there has been much more work in the context of supervised learning. There is the well-known work of Fahlman and Lebiere (1990) on cascade-correlation networks. More recently Zhou, Sohn, and Lee (2012) proposed a procedure for learning discriminative features with a denoising autoencoder (a model related to the RBM). The procedure is also applicable to the online setting. It relies on invoking two heuristics that either add or merge hidden units during training. We note that the iRBM framework could easily be generalized to discriminative and hybrid training as in Zhou et al. (2012). The corresponding mechanisms for adding and merging units would then be implicitly derived from gradient descent on the corresponding supervised training objective.

Finally, we highlight that our model is not based on a Bayesian formulation, as is most of the literature on infinite models. But it does correspond to the infinite limit of a finite-sized model and yields a model that can learn its size with training.

6  Experiments

We compare the performance of the oRBM and the iRBM with the classic RBM on two data sets: binarized MNIST (Salakhutdinov & Murray, 2008) and CalTech101 Silhouettes (Marlin, Swersky, Chen, & de Freitas, 2010). We aim to demonstrate that the iRBM effectively removes the need for tuning a hyperparameter for the hidden layer size while still achieving performance comparable to that of the standard RBM. The code to reproduce the experiments of the paper is available on GitHub (http://github.com/MarcCote/iRBM). Our implementation is done using Theano (Bastien et al., 2012; Bergstra et al., 2010).

For completeness, we mention that more sophisticated or deep models have reported results on one or both of these data sets (e.g. EoNADE, Uria, Murray, & Larochelle, 2014; DBNs, Murray & Salakhutdinov, 2008; deep autoregressive networks, Gregor, Mnih, & Wierstra, 2014; iterative neural autoregressive distribution estimator, Raiko & Bengio, 2014) that improve on the standard RBM. However, since our objective with the iRBM is to effectively remove a hyperparameter of the RBM instead of achieving improved performance, we focus our comparison on this baseline.

All NLL results of this section were obtained by estimating the log-partition function using annealed importance sampling (AIS) (Sala-khutdinov & Murray, 2008) with 100,000 intermediate distributions and 5000 chains. As an additional validation step, samples were generated from best models and visually inspected.

Each model was trained with mini-batch stochastic gradient descent using a batch size of 64 examples and using PCD with 10 Gibbs steps between parameter updates. We used the ADAGRAD stochastic gradient update (Duchi, Hazan, & Singer, 2011), a per-dimension learning rate method, to train the oRBMs and the iRBMs. We found that having different learning rates for different hidden units was very beneficial, since units positioned earlier in the hidden layer will approach convergence faster than units to their right, and thus will benefit from a learning rate decaying more rapidly. We tried several learning rates and always set ADAGRAD’s epsilon parameter to .

We also tested different values for both L1 and L2 regularization’s factor . Note that we allow the iRBM to shrink only if L1 regularization is used.

We did try varying the found in the penalty term and, as expected, found results to be robust to its value. Since must be greater than 1, we explored positive constants to add to 1 on a log scale (e.g., 1, 0.25, 0.1, 0.01, 0.001). We settled on using for all experiments, as it provides a penalty high enough to have a growing behavior and requires around 500 epochs for the weights to compensate for the penalty.

Finally, we note that improved performances could certainly have been achieved using an improved sampler (e.g., parallel tempering; Desjardins, Courville, Bengio, Vincent, & Delalleau, 2010) or parameterization (e.g. enhanced gradient parameterization; Cho, Raiko, & Ilin, 2013). However, these changes would equally improve the baseline RBM, so we decided to concentrate on this more common learning setup.

6.1  Binarized MNIST

The MNIST dataset (http://yann.lecun.com/exdb/mnist) is composed of 70,000 images of size pixels representing handwritten digits (0–9). Images have been stochastically binarized according to their pixel intensity as in Salakhutdinov and Murray (2008). We use the same split as in Larochelle and Murray (2011), corresponding to 50,000 examples for training, 10,000 for validation, and 10,000 for testing.

Each model was trained up to 5000 epochs, but we performed AIS evaluation every 1000 epochs and kept the model with the best NLL approximation on the valid set. We report the associated NLL approximations obtained on the test set. Taking after past studies assessing RBM results on binarized MNIST, we fixed the number of hidden units to 500 for the RBM and the oRBM. Best results for the RBM, oRBM, and iRBM are reported in Table 1. The oRBM and the iRBM models reach competitive performance compared to the RBM. Samples from all three models are illustrated in Figure 4.

Table 1:
Average NLL on Binarized MNIST Test Set for Best RBMs, oRBM, and iRBM.
Binarized MNIST
ModelSizeAverage NLL
RBM 100 600.92 [600.88, 600.95] 98.17 0.52 
RBM 500 613.28 [613.24, 613.31] 86.50 0.44 
RBM 2000 1099.07 [1098.94, 1099.17] 85.03 0.42 
oRBM 500 40.06 [39.90, 40.19] 88.15 0.46 
iRBM 1208 40.32 [40.03, 40.54] 85.65 0.44 
Binarized MNIST
ModelSizeAverage NLL
RBM 100 600.92 [600.88, 600.95] 98.17 0.52 
RBM 500 613.28 [613.24, 613.31] 86.50 0.44 
RBM 2000 1099.07 [1098.94, 1099.17] 85.03 0.42 
oRBM 500 40.06 [39.90, 40.19] 88.15 0.46 
iRBM 1208 40.32 [40.03, 40.54] 85.65 0.44 

Notes: Partition functions were estimated using AIS with 100,000 intermediate distributions and 5000 chains. The confidence interval on the average NLL assumes has no variance and reflects the confidence of a finite sample average. By taking the uncertainty about the partition function into account, the interval would be larger.

Figure 4:

Comparison between data from binarized MNIST and random samples generated from the three models by randomly initializing visible units and running 10,000 Gibbs steps. The RBM and oRBM both have 500 hidden units, whereas the iRBM final size is 1208 hidden units.

Figure 4:

Comparison between data from binarized MNIST and random samples generated from the three models by randomly initializing visible units and running 10,000 Gibbs steps. The RBM and oRBM both have 500 hidden units, whereas the iRBM final size is 1208 hidden units.

The best RBM (500 hidden units) was trained without any regularization and for 5000 epochs. We used our own implementation to train the RBM, which is why our result slightly differs from what is reported by Salakhutdinov & Murray (2008). The difference can be justified by the fact that they used the full 60,000 training set images instead of a 50,000 subset. Also, they use a custom schedule to gradually increase the number of CD steps during training. That said, the oRBM and the iRBM would probably also benefit from having more training data and an improved sampling strategy.

The best oRBM (500 hidden units) was trained without any regularization and for 500 epochs. After 3000 epochs, the best iRBM had 1208 hidden units with nonzero weights. It was trained with L1 regularization using a regularization factor of and .

To show that our best iRBM does find an appropriate number of hidden units, we compared it with two other RBMs having, respectively, 100 and 2000 hidden units. Both were trained for 5000 epochs without any regularization and, respectively, with and . Results are reported in Table 1, where we can see the oRBM and the iRBM still achieve competitive results compared to the RBM with 2000 hidden units.

Figure 5 shows the ordering effect on the filters obtained with an iRBM. The ordering is even more apparent when observing the hidden unit filters during training. We generated a video of this visualization illustrating the filter values and the generated negative samples at epochs 1, 10, 50, and 100. (See http://youtu.be/zP-6DiwksNY)

Figure 5:

Comparing the filters of an RBM and an iRBM, both trained on binarized MNIST. The first 96 filters are shown starting from the top-left corner and incrementing across columns first.

Figure 5:

Comparing the filters of an RBM and an iRBM, both trained on binarized MNIST. The first 96 filters are shown starting from the top-left corner and incrementing across columns first.

Interestingly, we have observed that Gibbs sampling can mix much more slowly with the oRBM. The reason is that the addition of variable z increases the dependence between states and thus hurts the convergence of Gibbs sampling. In particular, we observed that when the Gibbs chain is in a state corresponding to a noisy image without any structure, it can require many steps before stepping out of this region of the input space. Yet comparing the free energy of such random images and images that resemble digits confirmed that these random images have significantly higher free energy (and thus are unlikely samples of the model). Figure 6 also confirms the high dependence between z and : the distribution of the unstructured image is peaked at , while all digits prefer values of z greater than 250. To fix this issue, we found that simply initializing the Gibbs chain to was sufficient. We used this when sampling from a trained oRBM model.

Figure 6:

Each row shows a plot of where is a given example from the MNIST test set and is displayed to the left. The first row illustrates the impact of a noisy image on sampling z. As explained in section 3, we see that different input images are related to different values for the number z of used units.

Figure 6:

Each row shows a plot of where is a given example from the MNIST test set and is displayed to the left. The first row illustrates the impact of a noisy image on sampling z. As explained in section 3, we see that different input images are related to different values for the number z of used units.

The iRBM does not seem to suffer as much from a low mixing rate and thus does not require the initialization heuristic for sampling. In fact, using the heuristic when sampling from an iRBM has almost no impact on the final samples when running 10,000 Gibbs steps. This could be an artifact of the model being trained progressively, that is, we add one hidden unit only when sampling a large value for z bigger than l. Understanding how the lower mixing rate affects the proposed models and if a heuristic such as the one we mentioned earlier could be used to improve training is a topic left for future work.

We have also investigated what kinds of inputs are maximizing , for different values of z. Using our best iRBM model trained with L1 regularization, we generated Figure 7. It highlights the fact that does capture some structure about the data, as the identity of the character with highest vary between different values of z.

Figure 7:

(Bottom) Top 10 inputs from the test set with the highest value of within different intervals for z, that is, for different intervals . Interestingly, bolder inputs seem to be related to bigger values for the number z of used units. Also, simpler characters (e.g., ones) tend to favor smaller values of z compared to more complex characters. (Top) Average of over the top 10 inputs. Low values highlight regions in the hidden layer where the hidden units are useful only when taken together with hidden units farther right in the layer.

Figure 7:

(Bottom) Top 10 inputs from the test set with the highest value of within different intervals for z, that is, for different intervals . Interestingly, bolder inputs seem to be related to bigger values for the number z of used units. Also, simpler characters (e.g., ones) tend to favor smaller values of z compared to more complex characters. (Top) Average of over the top 10 inputs. Low values highlight regions in the hidden layer where the hidden units are useful only when taken together with hidden units farther right in the layer.

6.2  CalTech101 Silhouettes

The CalTech101 Silhouettes data set (http://people.cs.umass.edu/marlin/data.shtml; Marlin et al., 2010) is composed of 8671 images of size binary pixels, representing object silhouettes (101 classes). The data set is divided into three subsets: 4100 examples for training, 2264 for validation, and 2307 for testing.

Following a protocol similar to the one used for MNIST, each model was trained up to 5000 epochs, and AIS evaluation was done every 1000 epochs. We report the NLL approximations obtained on the test set. Best results for the RBM, oRBM, and iRBM are reported in Table 2. Again, the oRBM and the iRBM models reach competitive performance compared to the RBM. Samples from all three models are illustrated in Figure 8.

Table 2:
Average NLL on CalTech101 Silhouettes Test Set Estimated Using AIS with 100,000 Intermediate Distributions and 5000 Chains.
CalTech101 Silhouettes
ModelSizeAverage NLL
RBM 100 2512.20 [2511.62, 2512.56] 177.37 2.81 
RBM 500 2385.91 [2385.68, 2386.10] 119.05 2.27 
RBM 2000 3353.47 [3349.85, 3354.15] 118.29 2.25 
oRBM 500 1782.96 [1782.88 1783.02] 114.99 1.97 
iRBM 915 2000.08 [1999.93, 2000.22] 121.47 2.07 
CalTech101 Silhouettes
ModelSizeAverage NLL
RBM 100 2512.20 [2511.62, 2512.56] 177.37 2.81 
RBM 500 2385.91 [2385.68, 2386.10] 119.05 2.27 
RBM 2000 3353.47 [3349.85, 3354.15] 118.29 2.25 
oRBM 500 1782.96 [1782.88 1783.02] 114.99 1.97 
iRBM 915 2000.08 [1999.93, 2000.22] 121.47 2.07 

Notes: The confidence interval on the average NLL assumes has no variance and reflects the confidence of a finite sample average. By taking the uncertainty about the partition function into account, the interval would be larger.

Figure 8:

Comparison between data from CalTech101 Silhouettes and random samples generated from three models by randomly initializing visible units and running 10,000 Gibbs steps. The RBM and oRBM both have 500 hidden units, whereas the iRBM final size is 915 hidden units.

Figure 8:

Comparison between data from CalTech101 Silhouettes and random samples generated from three models by randomly initializing visible units and running 10,000 Gibbs steps. The RBM and oRBM both have 500 hidden units, whereas the iRBM final size is 915 hidden units.

The best RBM (500 hidden units) was trained without any regularization and for 3000 epochs. We used our own implementation to train the RBM. The best oRBM (500 hidden units) was trained with L1 regularization using a regularization factor of  and for 5000 epochs. After 4000 epochs, the best iRBM had 915 hidden units with nonzero weights. It was trained with L1 regularization using a regularization factor of and .

Again, to show that our best iRBM does find an appropriate number of hidden units, we compared it with two other RBMs having, respectively, 100 and 2000 hidden units. Both were trained without any regularization and, respectively, with for 5000 epochs and for 2000 epochs. Results are reported in Table 2, where we can see the oRBM and the iRBM still achieve competitive results compared to the RBM with 2000 hidden units.

7  Conclusion

We proposed a novel extension of the RBM, the infinite RBM, which obviates the need to specify the hidden layer size. The iRBM is derived from the ordered RBM by taking the infinite limit of its hidden layer size. We presented a training procedure, derived from contrastive divergence, such that training the iRBM yields a learning procedure where the effective hidden layer size can grow.

In future work, we are interested in generalizing the idea of a developing latent representation to structures other than a flat vector representation. We are currently exploring extensions of the RBM allowing for a tree-structured latent representation. We believe a similar construction, involving a similar z random variable, should allow us to derive a training algorithm that also learns the size of the latent representation.

Appendix A:  Partial Derivatives

A.1  Partial Derivatives Related to the RBM

Recall equation 2.5 representing the free energy of the RBM:
formula
Taking the partial derivatives of with regard to Wij, , and respectively, we obtain the following:
formula
A.1
formula
A.2
formula
A.3
where can be expressed as a conditional expectation over hi using equation 2.12:
formula

A.2  Partial Derivatives Related to the oRBM and the iRBM

Recall equation 3.6 representing the free energy of the oRBM:
formula
where
formula
The partial derivatives of with regard to Wij, , and are similar to equations A.1 to A.3 from the RBM and are, respectively, given by
formula
A.4
formula
A.5
formula
A.6
with the Heaviside step function denoted as
formula
Then the partial derivatives of with regard to Wij, , and are obtained, respectively, as follows:
formula
A.7
formula
A.8

Observe that in equations A.7 and A.8, corresponds to . This then translates to when deriving the gradients as shown in equations 3.10 and 3.11.

Appendix B:  Convergence of the Partition Function for the iRBM

We show that the partition function Z of the iRBM is finite. To do so, we take the limit of of equation 3.3:
formula
B.1
Since the sum over all is finite and we know from equation 4.2 that is finite, then Z is also finite.

Acknowledgments

We thank NSERC for supporting this research, Nicolas Le Roux for discussions and comments, and Stanislas Lauly for making the iRBM’s training video.

References

Bastien
,
F.
,
Lamblin
,
P.
,
Pascanu
,
R.
,
Bergstra
,
J.
,
Goodfellow
,
I. J.
,
Bergeron
,
A.
, …
Bengio
,
Y.
(
2012
).
Theano: New features and speed improvements
.
Paper presented at Deep Learning and Unsupervised Feature Learning, NIPS 2012 Workshop
.
Bergstra
,
J.
,
Breuleux
,
O.
,
Bastien
,
F.
,
Lamblin
,
P.
,
Pascanu
,
R.
,
Desjardins
,
G.
, …
Bengio
,
Y.
(
2010
).
Theano: A CPU and GPU math expression compiler
. In
Proceedings of the Python for Scientific Computing Conference
. http://conference.scipy.org/proceedings/scipy2010/
Cho
,
K.
,
Raiko
,
T.
, &
Ilin
,
A.
(
2013
).
Enhanced gradient for training restricted Boltzmann machines
.
Neural Computation
,
25
,
805
831
.
Cueto
,
M. A.
,
Morton
,
J.
, &
Sturmfels
,
B.
(
2010
).
Geometry of the restricted Boltzmann machine
. In
M. A. G.
Viana
&
H. P.
Wynn
(Eds.),
Algebraic methods in statistics and probability II: AMS Special Session
.
Providence, RI
:
American Mathematical Society
.
Dahl
,
G. E.
,
Adams
,
R. P.
, &
Larochelle
,
H.
(
2012
).
Training restricted boltzmann machines on word observations
. In
Proceedings of the 29th International Conference on Machine Learning
(pp.
679
686
).
Madison, WI
:
Omnipress
.
Desjardins
,
G.
,
Courville
,
A.
,
Bengio
,
Y.
,
Vincent
,
P.
, &
Delalleau
,
O.
(
2010
).
Parallel tempering for training of restricted Boltzmann machines
. In
Proceedings of AISTATS
(vol.
9
, pp.
145
152
). JMLR.org.
Duchi
,
J.
,
Hazan
,
E.
, &
Singer
,
Y.
(
2011
).
Adaptive subgradient methods for online learning and stochastic optimization
.
J. Mach. Learn. Res.
,
12
,
2121
2159
.
Fahlman
,
S. E.
, &
Lebiere
,
C.
(
1990
).
The cascade-correlation learning architecture
. In
D.
Touretzky
(Ed.),
Advances in neural information processing systems
,
2
(pp.
524
532
).
San Mateo, CA
:
Morgan-Kaufmann
.
Gregor
,
K.
,
Mnih
,
A.
, &
Wierstra
,
D.
(
2014
).
Deep Autoregressive networks
. In
Proceedings of the International Conference on Machine Learning
(pp.
1242
1250
). JMLR.org.
Hinton
,
G.
(
2002
).
Training products of experts by minimizing contrastive divergence
.
Neural Computation
,
14
,
1771
1800
.
Larochelle
,
H.
, &
Murray
,
I.
(
2011
).
The neural autoregressive distribution estimator
. In
Proceedings of AISTATS
(
vol. 15
, pp.
29
37
). JMLR.org.
Marlin
,
B. M.
,
Swersky
,
K.
,
Chen
,
B.
, &
de Freitas
,
N.
(
2010
).
Inductive principles for restricted Boltzmann machine learning
. In
Proc. Intl. Conference on Artificial Intelligence and Statistics
(pp.
305
306
).
Montufar
,
G.
, &
Ay
,
N.
(
2010
).
Refinements of universal approximation results for deep belief networks and restricted Boltzmann machines
.
Neural Computation
,
23
,
1
12
.
Murray
,
I.
, &
Salakhutdinov
,
R.
(
2008
).
Evaluating probabilities under high-dimensional latent variable models
. In
D.
Koller
,
D.
Schuurmans
,
Y.
Bengio
, &
L.
Bottou
(Eds.),
Advances in neural information processing systems
,
21
(pp.
1137
1144
).
Red Hook, NY
:
Curran
.
Nair
,
V.
, &
Hinton
,
G.
(
2008
).
Implicit mixtures of restricted boltzmann machines
. In
D.
Koller
,
D.
Schuurmans
,
Y.
Bengio
, &
L.
Bottou
(Eds.),
Advances in information processing systems
,
21
.
Red Hook, NY
:
Curran
.
Orbanz
,
P.
, &
Teh
,
Y. W.
(
2010
).
Bayesian nonparametric models
. In
Encyclopedia of machine learning
.
New York
:
Springer
.
Raiko
,
T.
, &
Bengio
,
Y.
(
2014
).
Iterative neural autoregressive distribution
. In
Z.
Ghahramani
,
M.
Welling
,
C.
Cortes
,
N. D.
Lawrence
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
(pp.
1
9
).
Red Hook, NY
:
Curran
.
Ranzato
,
M.
,
Krizhevsky
,
A.
, &
Hinton
,
G. E.
(
2010
).
Factored 3-way restricted Boltzmann machines for modeling natural images
.
Journal of Machine Learning Research—Proceedings Track
,
9
,
621
628
.
Rippel
,
O.
,
Gelbart
,
M.
, &
Adams
,
R.
(
2014
). Learning ordered representations with nested dropout. In
Proceedings of the 31st International Conference on Machine Learning
(pp.
1746
1754
). JMLR.org.
Salakhutdinov
,
R.
,
Mnih
,
A.
, &
Hinton
,
G.
(
2007
).
Restricted Boltzmann machines for collaborative filtering
. In
Proceedings of the 24th International Conference on Machine Learning
(pp.
791
798
).
New York
:
ACM
.
Salakhutdinov
,
R.
, &
Murray
,
I.
(
2008
).
On the quantitative analysis of deep belief networks
. In
A.
McCallum
&
S.
Roweis
(Eds.),
Proceedings of the 25th Annual International Conference on Machine Learning
(pp.
872
879
).
Madison, WI
:
Omnipress
.
Swersky
,
K.
,
Sutskever
,
I.
,
Tarlow
,
D.
,
Zemel
,
R. S.
,
Salakhutdinov
,
R. R.
, &
Adams
,
R. P.
(
2012
).
Cardinality restricted Boltzmann machines
. In
P. L.
Bartlett
,
F. C. H.
Pereiro
,
C. J. C.
Burgess
, &
L.
Bottou
(Eds.),
Advances in neural information processing systems, 25
(pp.
3293
3301
).
Red Hook, NY
:
Curran
.
Taylor
,
G. W.
,
Hinton
,
G. E.
, &
Roweis
,
S. T.
(
2011
).
Two distributed-state models for generating high-dimensional time series
.
Journal of Machine Learning Research
,
12
,
1025
1068
.
Tieleman
,
T.
(
2008
).
Training restricted Boltzmann machines using approximations to the likelihood gradient
. In
Proceedings of the 25th International Conference on Machine Learning
(p.
7
).
New York
:
ACM
.
Uria
,
B.
,
Murray
,
I.
, &
Larochelle
,
H.
(
2014
).
A deep and tractable density estimator
. In
Proceedings of the International Conference on Machine Learning
(p.
9
). JMLR.org.
Welling
,
M.
,
Zemel
,
R. S.
, &
Hinton
,
G. E.
(
2003
).
Self supervised boosting
. In
S.
Becker
,
S.
Thrun
, &
K.
Obermayer
(Eds.),
Advances in neural information processing systems, 15
(pp.
681
688
).
Cambridge, MA
:
MIT Press
.
Zhou
,
G.
,
Sohn
,
K.
, &
Lee
,
H.
(
2012
).
Online incremental feature learning with denoising autoencoders
. In
N. D.
Lawrence
&
M. A.
Girolami
(Eds.),
Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics
(
vol. 22
, pp.
1453
1461
). JMLR.org.