## Abstract

We present a mathematical construction for the restricted Boltzmann machine (RBM) that does not require specifying the number of hidden units. In fact, the hidden layer size is adaptive and can grow during training. This is obtained by first extending the RBM to be sensitive to the ordering of its hidden units. Then, with a carefully chosen definition of the energy function, we show that the limit of infinitely many hidden units is well defined. As with RBM, approximate maximum likelihood training can be performed, resulting in an algorithm that naturally and adaptively adds trained hidden units during learning. We empirically study the behavior of this infinite RBM, showing that its performance is competitive to that of the RBM, while not requiring the tuning of a hidden layer size.

## 1 Introduction

Over the years, machine learning research has produced a large variety of latent variable probabilistic models: mixture models, factor analysis models, latent dynamical models, and many others. Such models usually require that the dimensionality of the latent representation be specified and fixed during learning. Adapting this quantity is then considered a separate process that takes the form of model selection and is normally treated as an additional hyperparameter to tune.

For this reason, more recently, there has been a lot of work on extending these models such that the size of the representation can be treated as an adaptive quantity during training. These extensions, often referred to as infinite models, are nonparametric in nature where the latent space is infinite with probability 1 and can arbitrarily adapt their capacity to the training data (see Orbanz & Teh, 2010, for a brief overview).

While most latent variable models have been extended to one or more infinite variants, a notable exception is the restricted Boltzmann machine (RBM). The RBM is an undirected graphical model for binary vector observations, where the latent representation is itself a binary vector (i.e., hidden layer). The RBM and its extensions to nonbinary vectors have been successfully applied to a variety of problems and data, such as images (Ranzato, Krizhevsky, & Hinton, 2010), movie user preferences (Salakhutdinov, Mnih, & Hinton, 2007), motion capture (Taylor, Hinton, & Roweis, 2011), and text (Dahl, Adams, & Larochelle, 2012). One explanation for the lack of literature on RBMs with an adaptive hidden layer size comes from its undirected nature. Indeed, undirected models tend to be less amenable to a Bayesian treatment of learning, on which the majority of the literature on infinite models rely.

Our main contribution in this article is thus a proposal for an infinite RBM that can adapt the effective number of hidden units during training. While our proposal is not based on a Bayesian formulation, it does correspond to the infinite limit of a finite-sized model and behaves in such a way that it effectively adapts its capacity as training progresses.

First, we propose a finite extension of the RBM that is sensitive to the position of each unit in its hidden layer. This is achieved by introducing a random variable that represents the number of hidden units intervening in the RBM’s energy function. Then, thanks to the introduction of an energy cost for using each additional unit, we show that taking the infinite limit of the total number of hidden units is well defined. We describe an approximate maximum likelihood training algorithm for this infinite RBM, based on (persistent) contrastive divergence, which results in a procedure where hidden units are implicitly added as training progresses. Finally, we empirically report how this model behaves in practice and show that it can achieve performance that is competitive to a traditional RBM on the binarized MNIST and Caltech101 Silhouettes data sets, while not requiring the tuning of a hyperparameter for its hidden layer size.

## 2 Restricted Boltzmann Machine

We describe the basic RBM model, which we will build on to derive its ordered and infinite versions.

An RBM is a generative stochastic neural network composed of two layers: visible and hidden . These layers are fully connected to each other, while connections within a layer are not allowed. This means each unit *v _{i}* is connected to all

*h*units via undirected weighted connections (see Figure 1).

_{j}*D*visible units and

*K*hidden units, the set of visible vectors is , whereas the set of hidden vectors is . In an RBM model, each configuration has an associated energy value defined by the following function: The parameters of this model are the weights ( matrix), the visible unit biases ( vector), and the hidden unit biases ( vector).

We see from equation 2.3 that the partition function *Z* (normalizing constant) is intractable, as it requires summing over all possible configurations.

*i*th row of and, for columns, . This allows an equivalent definition of the RBM model in terms of what is known as free energy . However, the partition function still requires summing over all configurations of visible vectors, which is intractable even for moderate values of

*D*.

*S*samples drawn from , that is, the model. Moreover, mini-batch training is usually employed and consists of replacing the positive phase average by one over a small subset of the training set, different for every training update.

In theory, the Markov chain should be run until equilibrium before drawing a sample for every training update, which is highly inefficient. Thus, contrastive divergence (CD) learning is often employed, where we initialize the update’s Gibbs chains to the training examples and perform only *T* steps of Gibbs sampling (Hinton, 2002). Another approach, referred to as stochastic approximation or persistent CD (PCD) (Tieleman, 2008), is to not reinitialize the Gibbs chains between updates.

## 3 Ordered Restricted Boltzmann Machine

The model we propose is a variant of the RBM where the hidden units are ordered from left to right, with this order being taken into account by the energy function. We refer to this model as an ordered RBM (oRBM). As shown in Figure 2, the oRBM takes hidden unit order into account by introducing a random variable *z* that can be understood as the effective number of hidden units contributing to the energy. Hidden units are selected starting from the left, and the selection of each hidden unit is associated with an incremental cost in energy.

*z*represents the number of selected hidden units that are active and is an energy penalty for selecting each

*i*th hidden unit. As we will see, carefully parameterizing the per unit energy penalty will allow us to consider the case of an infinite pool of hidden units.

In our experiments, as we wanted the filters of each unit to be the dominating factor in a unit being selected, we parameterized it as , where is a global hyperparameter (critically, as we discuss later, this hyperparameter does not actually require tuning, and a generic value for it works fine). Intuitively, the penalty term acts as a form of regularization since it forces the model to avoid using more hidden units than needed, prioritizing smaller networks.

Moreover, having the penalty depending on the hidden biases also implies that the selection of a hidden units (i.e., influencing the outcome of the random variable *z*) will be mostly controlled by the values taken by the connections . Higher values of the bias of a hidden unit will not increase its probability of being selected. In other words, for the model to increase its capacity and better fit the training data, it will have to learn better filters. Note that alternative parameterizations could certainly be considered.

As with the RBM, is defined in terms of its energy function. For this, we have to specify the set of legal values for , , and *z*. Since, for a given *z*, the value of the energy is irrelevant to the dimensions of from *z* to *K*, we will assume they are set to 0. There is thus a coupling between the value of *z* and the legal values of . Let the legal values of for a given *z*. As for *z*, it can vary in , and as usual.

*z*is thus where As for the marginal distribution of the oRBM model, it can also be written in terms of a free energy. Indeed, in a derivation similar to the case of the RBM, we can show This gives us a free energy where only the hidden units have been marginalized. We can also derive a formulation where the free energy depends only on : It should be noted that in the oRBM,

*z*does not correspond to the number of hidden units assumed to have generated all observations. Instead, the model allows for different observations having been generated by a different number of hidden units. Specifically, for a given , the conditional distribution over the corresponding value of

*z*is As for the conditional distribution over the hidden units, given a value of

*z*, it takes the same form as for the regular RBM, except for unselected hidden units, which are forced to zero. Similarly, the distribution of given a value of the hidden layer and

*z*reflects that of the RBM:

Compared to the RBM, computing these gradients requires one additional quantity: the vector of cumulative probabilities . Fortunately, this quantity can be efficiently computed, in , by first computing the required probabilities vector and performing a cumulative sum.

Sampling from differs slightly from the RBM, as we need to consider *z* in the Markov chain. With the oRBM, Gibbs steps alternate between sampling and . Sampling from is done in two steps: , followed by .

During training, what we observe is that the hidden units are each trained gradually, in sequence, from left to right. This effect is mainly due to the multiplicative term in the hidden unit parameter updates of equations 3.10 and 3.11, which is monotonically decreasing. Effectively, the model is thus growing in capacity during training, until its maximum capacity of *K* hidden units.

## 4 Infinite Restricted Boltzmann Machine

The growing behavior of the oRBM begs the question, Could we achieve a similar effect without having to specify a maximum capacity to the model? Indeed, while Montufar and Ay (2010) have shown that with hidden units, an RBM is a universal approximator, a variant of the RBM that could automatically increase its capacity until it is sufficiently high is likely to yield much smaller models in practice. It turns out that this is possible by taking the limit of . For this reason, we refer to this model as the infinite RBM (iRBM; see Figure 3).

This limit is made possible thanks to two modeling choices. The first is the assumption that a finite (but variable!) number of hidden units have nonzero weights and biases. This is trivial to ensure for any optimization procedure using any amount of any type of weight decay (e.g., L2 or L1 regularization) on all the weights and hidden biases. An infinite number of nonzero weights and biases could then correspond to an infinite penalty, so no proper optimization would ever diverge to this solution, no matter the initialization. This is guaranteed when using L1 regularization, thanks to its sparsity-inducing property. As for L2 regularization, while it could theoretically lead to an infinite number of hidden units (e.g., if the L2 norm of the parameters associated with each hidden unit decreases exponentially with respect to the position of the hidden unit), in practice the floating precision would clip very small parameters to zero, thus having a finite number of hidden units.

*l*the number of effectively trained hidden units (i.e., where all hidden units have zero weights and biases). This is guaranteed to happen thanks to the growing behavior that ensures hidden units are ordered from left to right. Then we can split the normalization constant of equation 4.1 into two parts, split at , as follows: where equation 4.2 is obtained by exploiting the fact that all weights and biases of hidden units at position and higher are zero. By ensuring that , the geometric series of equation 4.2 is finite and can be analytically computed. This in turn implies that is tractable and can be sampled from. Following a similar reasoning, the global partition function

*Z*can be shown to be finite (see appendix B), thus yielding a properly defined joint distribution for any configurations with a finite number of nonzero weights and hidden biases.

One could think that, compared to a regular RBM, we have merely traded the hyperparameter of the hidden layer size with the hyperparameter . However, crucially, ’s role is only to ensure that the iRBM is properly defined and the penalty it imposes in the energy function can be compensated by the learned parameters. The extent to which the parameters can grow enough to compensate for that penalty is then controlled by the strength of weight decay, a hyperparameter the iRBM shares with the RBM. We have thus effectively removed one hyperparameter. Moreover, we have indeed observed that results are robust to the choice of —that is, finely tuning beta was not necessary to ultimately achieving good performance. While the choice of can affect the number of epochs it would take for the weights to compensate for the penalty, this (the number of epochs) is a quantity that must be tuned nevertheless, even in regular RBMs.

The question of the identifiability of the binary RBM is a complex one, which has been studied (Cueto, Morton, & Sturmfels, 2010). Unlike the RBM, the iRBM is sensitive to the ordering of its hidden units thanks to the penalty term. This means that permutations of iRBM’s hidden units do not correspond to the same distribution, making its parameterization more identifiable.

As for learning, it can be done mostly by following the procedure of the oRBM—minimizing the NLL with stochastic gradient descent using (Persistent) CD to approximate the gradients. One slight modification is required, however. Indeed, since the free energy gradient for the hidden weights and biases can be nonzero for all (infinite) hidden units, we cannot use the gradient of equations 3.10 and 3.11 for all hidden units.

In this case, all weights and biases with an index greater than the sampled *z* have a gradient of zero (i.e., they do not require any update). Moreover, taking the expectation of these gradients under corresponds to taking the gradients of , making them unbiased in this respect. This comes at the cost of higher variance in the updates. But thanks to this observation, we are justified using a hybrid approach, where we use the gradients only for the units with an index less than or equal to *l*, and adopt the gradient of for the other units (i.e., leave them set to zero).

As previously mentioned, we use weight decay to ensure that the number of nonzero parameters cannot diverge to infinity. For practical reasons, our implementation also used a capacity-limiting heuristic. If the Gibbs sampling chain ever sampled a value for *z* that is greater than *l*, then we clamped it to . Intuitively, this corresponds to adding a single hidden unit. This avoids filling all the memory in the (unlikely) event where we would draw a large value for *z*. When a hidden unit is added, its associated weights and biases are initialized to zero.

We emphasize that these were not required to avoid divergence (weight decay is sufficient); it merely ensured a practical and efficient implementation of the model on the GPU. Note also that when L1 regularization is used, *l* can decrease in value thanks to the sparsity-promoting property of the L1 norm. Again, we highlight that while a finite number of weights and biases is maintained, that number of such weights does vary and is learned, while the implicit number of hidden units is indeed infinite (infinitely many contribute to the partition function).

## 5 Related Work

This work falls within the research literature on discovering extensions of the original RBM model to different contexts and objectives. Of note here is the implicit mixture of RBMs (Nair & Hinton, 2008). Indeed, the oRBM can be interpreted as a special case of an implicit mixture of RBMs. Writing as , we see that the oRBM is an implicit mixture of *K* RBMs, where each RBM has a different number of hidden units (from 1 to *K*) and the weights are tied between RBMs. The prior represents the probability of using the *z*th RBM and is also derived from the energy function. However, as in the implicit mixture of RBMs, is intractable, as it would require the value of the partition function. That said, the work of Nair and Hinton (2008) is otherwise very different and did not address the question of having an RBM with adaptive capacity.

Another related work is that of the cardinality RBMs proposed by Swersky et al. (2012). They used a cardinality potential to control the sparsity of the RBM, limiting the number of hidden units that can be active. In the oRBM and the iRBM, *z* effectively acts as an upper bound on the number of hidden units *h _{i}* that can be equal to 1, since we are limiting to be in , a subset of . In their work, Swersky et al. (2012) use cardinality potentials that allow only configurations having at most

*k*active hidden units. One difference with our work, however, is that their cardinality potential is order agnostic, meaning that the active hidden units can be positioned anywhere within the hidden layer while still satisfying the cardinality potential. On the other hand, in the oRBM, all units with an index higher than

*z*must be set to zero, with only the previous hidden units being allowed to be active. In addition, their parameter

*k*is fixed during training, whereas our number of active hidden units

*z*changes depending on the input.

The oRBM also bears some similarity to autoencoders trained by a nested version of dropout (Rippel, Gelbart, & Adams, 2014). Nested dropout works by stochastically selecting the number of hidden units used to reconstruct an input example at training time, and so independently for each update and example. Rippel et al. (2014) showed that this defines a learning objective that makes the solution identifiable and no longer invariant to hidden unit permutation. In addition to being concerned with a different type of model, this work does not discuss the case of an unbounded and adaptive hidden layer size.

Welling, Zemel, and Hinton (2003) proposed a self-supervised boosting approach that is applicable to the RBM and in which hidden units are sequentially added and trained. However, like boosting in general and unlike the iRBM, this procedure trains each hidden unit greedily instead of jointly, which could lead to much larger networks than necessary. Moreover, it is not easily generalizable to online learning.

While the work on unsupervised neural networks with adaptive hidden layer size is otherwise relatively scarce, there has been much more work in the context of supervised learning. There is the well-known work of Fahlman and Lebiere (1990) on cascade-correlation networks. More recently Zhou, Sohn, and Lee (2012) proposed a procedure for learning discriminative features with a denoising autoencoder (a model related to the RBM). The procedure is also applicable to the online setting. It relies on invoking two heuristics that either add or merge hidden units during training. We note that the iRBM framework could easily be generalized to discriminative and hybrid training as in Zhou et al. (2012). The corresponding mechanisms for adding and merging units would then be implicitly derived from gradient descent on the corresponding supervised training objective.

Finally, we highlight that our model is not based on a Bayesian formulation, as is most of the literature on infinite models. But it does correspond to the infinite limit of a finite-sized model and yields a model that can learn its size with training.

## 6 Experiments

We compare the performance of the oRBM and the iRBM with the classic RBM on two data sets: binarized MNIST (Salakhutdinov & Murray, 2008) and CalTech101 Silhouettes (Marlin, Swersky, Chen, & de Freitas, 2010). We aim to demonstrate that the iRBM effectively removes the need for tuning a hyperparameter for the hidden layer size while still achieving performance comparable to that of the standard RBM. The code to reproduce the experiments of the paper is available on GitHub (http://github.com/MarcCote/iRBM). Our implementation is done using Theano (Bastien et al., 2012; Bergstra et al., 2010).

For completeness, we mention that more sophisticated or deep models have reported results on one or both of these data sets (e.g. EoNADE, Uria, Murray, & Larochelle, 2014; DBNs, Murray & Salakhutdinov, 2008; deep autoregressive networks, Gregor, Mnih, & Wierstra, 2014; iterative neural autoregressive distribution estimator, Raiko & Bengio, 2014) that improve on the standard RBM. However, since our objective with the iRBM is to effectively remove a hyperparameter of the RBM instead of achieving improved performance, we focus our comparison on this baseline.

All NLL results of this section were obtained by estimating the log-partition function using annealed importance sampling (AIS) (Sala-khutdinov & Murray, 2008) with 100,000 intermediate distributions and 5000 chains. As an additional validation step, samples were generated from best models and visually inspected.

Each model was trained with mini-batch stochastic gradient descent using a batch size of 64 examples and using PCD with 10 Gibbs steps between parameter updates. We used the ADAGRAD stochastic gradient update (Duchi, Hazan, & Singer, 2011), a per-dimension learning rate method, to train the oRBMs and the iRBMs. We found that having different learning rates for different hidden units was very beneficial, since units positioned earlier in the hidden layer will approach convergence faster than units to their right, and thus will benefit from a learning rate decaying more rapidly. We tried several learning rates and always set ADAGRAD’s epsilon parameter to .

We also tested different values for both L1 and L2 regularization’s factor . Note that we allow the iRBM to shrink only if L1 regularization is used.

We did try varying the found in the penalty term and, as expected, found results to be robust to its value. Since must be greater than 1, we explored positive constants to add to 1 on a log scale (e.g., 1, 0.25, 0.1, 0.01, 0.001). We settled on using for all experiments, as it provides a penalty high enough to have a growing behavior and requires around 500 epochs for the weights to compensate for the penalty.

Finally, we note that improved performances could certainly have been achieved using an improved sampler (e.g., parallel tempering; Desjardins, Courville, Bengio, Vincent, & Delalleau, 2010) or parameterization (e.g. enhanced gradient parameterization; Cho, Raiko, & Ilin, 2013). However, these changes would equally improve the baseline RBM, so we decided to concentrate on this more common learning setup.

### 6.1 Binarized MNIST

The MNIST dataset (http://yann.lecun.com/exdb/mnist) is composed of 70,000 images of size pixels representing handwritten digits (0–9). Images have been stochastically binarized according to their pixel intensity as in Salakhutdinov and Murray (2008). We use the same split as in Larochelle and Murray (2011), corresponding to 50,000 examples for training, 10,000 for validation, and 10,000 for testing.

Each model was trained up to 5000 epochs, but we performed AIS evaluation every 1000 epochs and kept the model with the best NLL approximation on the valid set. We report the associated NLL approximations obtained on the test set. Taking after past studies assessing RBM results on binarized MNIST, we fixed the number of hidden units to 500 for the RBM and the oRBM. Best results for the RBM, oRBM, and iRBM are reported in Table 1. The oRBM and the iRBM models reach competitive performance compared to the RBM. Samples from all three models are illustrated in Figure 4.

. | . | Binarized MNIST . | ||
---|---|---|---|---|

Model . | Size . | . | . | Average NLL . |

RBM | 100 | 600.92 | [600.88, 600.95] | 98.17 0.52 |

RBM | 500 | 613.28 | [613.24, 613.31] | 86.50 0.44 |

RBM | 2000 | 1099.07 | [1098.94, 1099.17] | 85.03 0.42 |

oRBM | 500 | 40.06 | [39.90, 40.19] | 88.15 0.46 |

iRBM | 1208 | 40.32 | [40.03, 40.54] | 85.65 0.44 |

. | . | Binarized MNIST . | ||
---|---|---|---|---|

Model . | Size . | . | . | Average NLL . |

RBM | 100 | 600.92 | [600.88, 600.95] | 98.17 0.52 |

RBM | 500 | 613.28 | [613.24, 613.31] | 86.50 0.44 |

RBM | 2000 | 1099.07 | [1098.94, 1099.17] | 85.03 0.42 |

oRBM | 500 | 40.06 | [39.90, 40.19] | 88.15 0.46 |

iRBM | 1208 | 40.32 | [40.03, 40.54] | 85.65 0.44 |

Notes: Partition functions were estimated using AIS with 100,000 intermediate distributions and 5000 chains. The confidence interval on the average NLL assumes has no variance and reflects the confidence of a finite sample average. By taking the uncertainty about the partition function into account, the interval would be larger.

The best RBM (500 hidden units) was trained without any regularization and for 5000 epochs. We used our own implementation to train the RBM, which is why our result slightly differs from what is reported by Salakhutdinov & Murray (2008). The difference can be justified by the fact that they used the full 60,000 training set images instead of a 50,000 subset. Also, they use a custom schedule to gradually increase the number of CD steps during training. That said, the oRBM and the iRBM would probably also benefit from having more training data and an improved sampling strategy.

The best oRBM (500 hidden units) was trained without any regularization and for 500 epochs. After 3000 epochs, the best iRBM had 1208 hidden units with nonzero weights. It was trained with L1 regularization using a regularization factor of and .

To show that our best iRBM does find an appropriate number of hidden units, we compared it with two other RBMs having, respectively, 100 and 2000 hidden units. Both were trained for 5000 epochs without any regularization and, respectively, with and . Results are reported in Table 1, where we can see the oRBM and the iRBM still achieve competitive results compared to the RBM with 2000 hidden units.

Figure 5 shows the ordering effect on the filters obtained with an iRBM. The ordering is even more apparent when observing the hidden unit filters during training. We generated a video of this visualization illustrating the filter values and the generated negative samples at epochs 1, 10, 50, and 100. (See http://youtu.be/zP-6DiwksNY)

Interestingly, we have observed that Gibbs sampling can mix much more slowly with the oRBM. The reason is that the addition of variable *z* increases the dependence between states and thus hurts the convergence of Gibbs sampling. In particular, we observed that when the Gibbs chain is in a state corresponding to a noisy image without any structure, it can require many steps before stepping out of this region of the input space. Yet comparing the free energy of such random images and images that resemble digits confirmed that these random images have significantly higher free energy (and thus are unlikely samples of the model). Figure 6 also confirms the high dependence between *z* and : the distribution of the unstructured image is peaked at , while all digits prefer values of *z* greater than 250. To fix this issue, we found that simply initializing the Gibbs chain to was sufficient. We used this when sampling from a trained oRBM model.

The iRBM does not seem to suffer as much from a low mixing rate and thus does not require the initialization heuristic for sampling. In fact, using the heuristic when sampling from an iRBM has almost no impact on the final samples when running 10,000 Gibbs steps. This could be an artifact of the model being trained progressively, that is, we add one hidden unit only when sampling a large value for *z* bigger than *l*. Understanding how the lower mixing rate affects the proposed models and if a heuristic such as the one we mentioned earlier could be used to improve training is a topic left for future work.

We have also investigated what kinds of inputs are maximizing , for different values of *z*. Using our best iRBM model trained with L1 regularization, we generated Figure 7. It highlights the fact that does capture some structure about the data, as the identity of the character with highest vary between different values of *z*.

### 6.2 CalTech101 Silhouettes

The CalTech101 Silhouettes data set (http://people.cs.umass.edu/marlin/data.shtml; Marlin et al., 2010) is composed of 8671 images of size binary pixels, representing object silhouettes (101 classes). The data set is divided into three subsets: 4100 examples for training, 2264 for validation, and 2307 for testing.

Following a protocol similar to the one used for MNIST, each model was trained up to 5000 epochs, and AIS evaluation was done every 1000 epochs. We report the NLL approximations obtained on the test set. Best results for the RBM, oRBM, and iRBM are reported in Table 2. Again, the oRBM and the iRBM models reach competitive performance compared to the RBM. Samples from all three models are illustrated in Figure 8.

. | . | CalTech101 Silhouettes . | ||
---|---|---|---|---|

Model . | Size . | . | . | Average NLL . |

RBM | 100 | 2512.20 | [2511.62, 2512.56] | 177.37 2.81 |

RBM | 500 | 2385.91 | [2385.68, 2386.10] | 119.05 2.27 |

RBM | 2000 | 3353.47 | [3349.85, 3354.15] | 118.29 2.25 |

oRBM | 500 | 1782.96 | [1782.88 1783.02] | 114.99 1.97 |

iRBM | 915 | 2000.08 | [1999.93, 2000.22] | 121.47 2.07 |

. | . | CalTech101 Silhouettes . | ||
---|---|---|---|---|

Model . | Size . | . | . | Average NLL . |

RBM | 100 | 2512.20 | [2511.62, 2512.56] | 177.37 2.81 |

RBM | 500 | 2385.91 | [2385.68, 2386.10] | 119.05 2.27 |

RBM | 2000 | 3353.47 | [3349.85, 3354.15] | 118.29 2.25 |

oRBM | 500 | 1782.96 | [1782.88 1783.02] | 114.99 1.97 |

iRBM | 915 | 2000.08 | [1999.93, 2000.22] | 121.47 2.07 |

Notes: The confidence interval on the average NLL assumes has no variance and reflects the confidence of a finite sample average. By taking the uncertainty about the partition function into account, the interval would be larger.

The best RBM (500 hidden units) was trained without any regularization and for 3000 epochs. We used our own implementation to train the RBM. The best oRBM (500 hidden units) was trained with L1 regularization using a regularization factor of and for 5000 epochs. After 4000 epochs, the best iRBM had 915 hidden units with nonzero weights. It was trained with L1 regularization using a regularization factor of and .

Again, to show that our best iRBM does find an appropriate number of hidden units, we compared it with two other RBMs having, respectively, 100 and 2000 hidden units. Both were trained without any regularization and, respectively, with for 5000 epochs and for 2000 epochs. Results are reported in Table 2, where we can see the oRBM and the iRBM still achieve competitive results compared to the RBM with 2000 hidden units.

## 7 Conclusion

We proposed a novel extension of the RBM, the infinite RBM, which obviates the need to specify the hidden layer size. The iRBM is derived from the ordered RBM by taking the infinite limit of its hidden layer size. We presented a training procedure, derived from contrastive divergence, such that training the iRBM yields a learning procedure where the effective hidden layer size can grow.

In future work, we are interested in generalizing the idea of a developing latent representation to structures other than a flat vector representation. We are currently exploring extensions of the RBM allowing for a tree-structured latent representation. We believe a similar construction, involving a similar *z* random variable, should allow us to derive a training algorithm that also learns the size of the latent representation.

## Appendix A: Partial Derivatives

### A.1 Partial Derivatives Related to the RBM

### A.2 Partial Derivatives Related to the oRBM and the iRBM

## Appendix B: Convergence of the Partition Function for the iRBM

## Acknowledgments

We thank NSERC for supporting this research, Nicolas Le Roux for discussions and comments, and Stanislas Lauly for making the iRBM’s training video.

## References

*Proceedings of the 31st International Conference on Machine Learning*