## Abstract

We improve recently published results about resources of restricted Boltzmann machines (RBM) and deep belief networks (DBN) required to make them universal approximators. We show that any distribution on the set of binary vectors of length can be arbitrarily well approximated by an RBM with hidden units, where is the minimal number of pairs of binary vectors differing in only one entry such that their union contains the support set of . In important cases this number is half the cardinality of the support set of (given in Le Roux & Bengio, 2008). We construct a DBN with , hidden layers of width that is capable of approximating any distribution on arbitrarily well. This confirms a conjecture presented in Le Roux and Bengio (2010).

## 1. Introduction

This work rests on ideas presented in Le Roux and Bengio (2008, 2010). We positively resolve a conjecture that was posed in Le Roux and Bengio (2010). Before going into the details about this conjecture, we first recall some basic ideas.

The definition of restricted Boltzmann machines (RBMs) and deep belief networks (DBNs) that we use is the common one. (For details, see Le Roux & Bengio, 2008, 2010.) Here we give a short description. A Boltzmann machine consists of a collection of binary stochastic units, where any pair of units may interact. The unit set is divided into visible and hidden units. Correspondingly, the state is characterized by a pair (*v*, *h*) where *v* denotes the state of the visible units and *h* denotes the state of the hidden units. One is usually interested in distributions on the visible states *v* and would like to generate these as marginals of distributions on the states (*v*, *h*). In a general Boltzmann machine, the interaction graph is allowed to be complete. An RBM is a special type of Boltzmann machine, where the graph describing the interactions is bipartite: only connections between visible and hidden units appear. Two visible units or two hidden units may not interact with each other (see Figure 1). The distribution over the states of all RBM units has the form of the Boltzmann distribution *p*(*v*, *h*) ∝ exp(*hWv* + *Bv* + *Ch*), where *v* is a binary vector of length equal to the number of visible units and *h* is a binary vector with length equal to the number of hidden units. The parameters of the RBM are given by the matrix *W* and the two vectors *B* and *C*. A DBN consists of a chain of layers of units. Only units from neighboring layers are allowed to be connected; there are no connections within each layer. The last two layers have undirected connections between them, while the other layers have connections directed toward the first layer, which is visible. The general idea of a DBN is that the interaction structure is deep rather than shallow (i.e., each hidden layer is not very large compared to the visible layer, as shown in Figure 1).

A major difficulty in the use of Boltzmann machines always has been the slowness of learning. In order to overcome this problem, DBNs have been proposed as an alternative to classical Boltzmann machines. An efficient learning algorithm for DBNs was given in by Hinton, Osindero, and Teh (2006).

The fundamental questions are the following: Does a DBN exist that is capable of approximating any distribution on the visible states through an appropriate choice of parameters? We will refer to such a DBN as a *universal DBN approximator* (similarly, *universal RBM approximator*). If universal DBN approximators exist, what is their minimal size?

Since DBNs are more difficult to study than RBMs, as a preliminary step, corresponding questions related to the representational power of RBMs have been addressed. Theorem 2 in Le Roux and Bengio (2008) shows that any distribution on {0, 1}^{n} with the support of cardinality *s* is arbitrarily well approximated (with respect to the Kullback-Leibler divergence) by the marginal distribution of an RBM containing *s* + 1 hidden units:

*Any distribution on {0, 1} ^{n} can be approximated arbitrarily well with an RBM with s + 1 hidden units, where s is the number of input vectors whose probability does not vanish.*

This theorem proves the existence of a universal RBM approximator. The existence proof of a universal DBN approximator is due to Sutskever and Hinton (2008). More precisely, Sutskever and Hinton explicitly constructed a DBN with ∼3 · 2^{n} hidden layers of width *n* + 1 that approximates any distribution on {0, 1}^{n}. Given that the existence problem of universal DBN approximators was positively resolved through this result, the efforts have been put into optimizing the size (i.e., reducing the number of parameters). This can be done by reducing the number of hidden layers involved in a DBN or by making the hidden layers smaller.

We want to deduce a theoretical lower bound on the number of layers of a universal DBN approximator. Observe that for a DBN to approximate any visible distribution on {0, 1}^{n} arbitrarily well, the number of parameters has to be at least equal to 2^{n} − 1, the dimension of the set of all distributions on {0, 1}^{n} (for completeness we give a formal proof of this statement in the appendix). We use simple counting arguments on that observation: the number of free parameters in a DBN with layers of constant size is the *square of the width of each layer* × *number of hidden layers* + *number of units*, which for *k* hidden layers of width *n* is *k*(*n*^{2} + *n*) + *n*. The number of parameters needed to describe all distributions on {0, 1}^{n} is 2^{n} − 1. Therefore, a lower bound on the number of hidden layers of a universal DBN approximator is given by (which yields 2^{n} − 1 free parameters). Otherwise, the number of parameters would not be sufficient. Asymptotically, this bound is of order .

This result implies a positive answer to the following question, which Sutskever and Hinton (2008) raised: Given that a network with 2^{n}/*n*^{2} layers has about 2^{n} parameters, can it be shown that a deep and narrow (with width *n*+*c*) network of ≪2^{n}/*n*^{2} layers cannot approximate every distribution?

Since the architecture of DBNs makes important restrictions on the way the parameters are used, the bound we derived above is not necessarily achievable. In particular the approximation of a distribution through a DBN or RBM is not unambiguous; for several choices of the parameters the same distribution is produced as marginal distribution. However, Le Roux and Bengio (2010) showed that a number of hidden layers of order is sufficient:

*If n = 2^{t}, a DBN composed of layers of size n is a universal approximator of distributions on {0, 1}^{n}.*

The optimality of the bound given in this theorem remains an open problem in that paper. However, the authors’ proof method suggests the sufficiency of less hidden layers, which they conjectured. The proof of theorem 4 in Le Roux and Bengio (2010) crucially depends on the authors’ previous theorem 2 (Le Roux & Bengio, 2008). Our contribution is to sharpen that ingredient (see theorem 1, section 2.1), which allows us to exploit their method better (see lemma 1, section 2.2) and thereby confirm their conjecture (see theorem 2, section 2.2).

## 2. Results

### 2.1. Restricted Boltzmann Machines.

The following theorem 1 sharpens theorem 2 in Le Roux and Bengio (2010).

*Any distribution p on binary vectors of length n can be approximated arbitrarily well by an RBM with k − 1 hidden units, where k is the minimal number of pairs of binary vectors, such that the two vectors in each pair differ in only one entry and such that the support set of p is contained in the union of these pairs.*

The set {0, 1}^{n} corresponds to the vertex set of the *n*-dimensional cube. The edges of the *n*-dimensional cube correspond to pairs of binary vectors of length *n*, which differ in exactly one entry. For the graph of the *n*-dimensional cube, there exist perfect matchings, that is, collections of disjoint edges that cover all vertices.

^{n}(the support set of an arbitary distribution) can be covered by 2

^{n−1}pairs of vectors, which differ in only one entry. The minimal number sufficient to cover the support of any

*p*can be as small as |supp(

*p*)|/2. This is, for example, the case when the support of

*p*is of the form for any 1 ⩽

*b*⩽

*n*and fixed

*x*∈ {0, 1} for all

_{i}*i*≠

*i*

_{1}, …,

*i*. The union of the following 2

_{b}^{b−1}pairs covers that set:

We therefore have the following corollary 1 (which will be used in the proof of our main result, theorem 2):

- •
*Any distribution on {0, 1}*^{n}can be approximated arbitrarily well by an RBM with hidden units. - •
*An RBM with**n*hidden units can approximate any visible distribution*p*on*n*visible units arbitrarily well, given that the support of*p*is contained in the set of vertices of some*log*(2(*n*+ 1))-dimensional face of the*n*-dimensional unit cube, for example,*supp*(*p*) = {(*x*_{1}, …,*x*, 0, …, 0) ∈ {0, 1}_{b}^{n}:*x*∈ {0, 1}, 1 ⩽_{i}*i*⩽*b*} for any*b*⩽*log*(2(*n*+ 1)).

The proof of theorem 1 given below is in the spirit of the proof of theorem 2 in Le Roux and Bengio (2008). The idea there consists of showing that given an RBM with some marginal visible distribution, the inclusion of an additional hidden unit allows us to increment the probability mass of one visible state vector while uniformly reducing the probability mass of all other visible vectors.

We show that the inclusion of an additional hidden unit in fact allows to increase the probability mass of a pair of visible vectors, in independent ratio, given that this pair differs in one entry. At the same time, the probability of all other visible states is reduced uniformly. We also use the bias weights in the visible units to further improve the result.

*p*be the distribution on the states of visible and hidden units of an RBM. Its marginal probability distribution on

*v*can be written as where

*z*(

*v*,

*h*) = exp(

*hWv*+

*Bv*+

*Ch*) and

*W*,

*B*, and

*C*are the parameters corresponding to

*p*. Denote by

*p*

_{w,c}the distribution arising through adding a hidden unit to the RBM connected with weights

*w*= (

*w*

_{1}, …,

*w*) to the visible units and with bias weight

_{n}*c*. Its marginal distribution is then

2. Given any vector *v* ∈ {0, 1}^{n}, we write *v*^{j,0} for the vector defined through *v*^{j,0}_{i} = *v _{i}*, ∀

*i*≠

*j*, and

*v*

^{j,0}

_{j}= 0. Similarly, we write

*v*

^{j,1}for the vector with

*v*

^{j,1}

_{i}=

*v*, ∀

_{i}*i*≠

*j*, and

*v*

^{j,1}

_{j}= 1. We also write ≔ (1, …, 1) and

**e**

_{j}≔ (0, …, 0, 1, 0, …, 0), where the 1 is in the

*j*th entry.

*j*∈ {1, …,

*n*} and an arbitrary visible vector

*u*. Consider also

*s*≔ ·

*u*

^{j,0}, that is, the number of ones in vector

*u*

^{j,0}. Define For the weights and , we have: and in the limit

*a*→ ∞, we get:

*u*

^{j,0}and

*u*

^{j,1}can be increased independently by a multiplicative factor, while all other probabilities are reduced uniformly.

4. Now we explain how to start an induction from which the claim follows. Consider an RBM with no hidden units, RBM^{0}. Through a choice of the bias weights in every visible unit, RBM^{0} produces as visible distribution any arbitrary factorizable distribution *p*^{0}(*v*) ∝ exp(*B* · *v*) ∝ exp(*B* · *v* + *K*), where *B* is the vector of bias weights and *K* is a constant that we introduce for illustrative reasons and is not a parameter of the RBM^{0} since it cancels out with the normalization of *p*^{0}. In particular, RBM^{0} can approximate arbitrarily well any distribution with support given by a pair of vectors that differ in only one entry. To see this, consider any pair of vectors *u*^{j,0} and *u*^{j,1} that differ in the entry *j*. Then the choice and yields in the limit *a* → ∞ (similarly to equations 2.5) that lim_{a→∞}*p*^{0}(*v*) = 0 whenever *v* ≠ *u*^{j,1} and *v* ≠ *u*^{j,0}, while lim_{a→∞}*p*^{0}(*u*^{j,1})/*p*^{0}(*u*^{j,0}) = exp(λ_{2} − λ_{1}) can be chosen arbitrarily by modifying λ_{1} and λ_{2}. Hence, *p*^{0} can be made arbitrarily similar to any distribution with support {*u*^{j,1}, *u*^{j,0}}. Notice that *p*^{0} always remains strictly positive for *a* < ∞.

By the arguments described above in equations 2.7, every additional hidden unit allows increasing the probability of any pair of vectors that differs in one entry. Obviously it is possible to do the same for a single vector instead of a pair. Then, with every additional hidden unit, the support set of the probabilities that can be approximated arbitrarily well can be enlarged by an arbitrary pair of vectors that differ in one entry. That is, RBM^{(i−1)} is an approximator of distributions with support contained in any union of *i* pairs of vectors that differ in exactly one entry.

We close this section with some remarks.

The possibility of independent change of the probability mass of two visible vectors is due to the usability of the following two parameters: (1) the bias input weight in the added hidden unit and (2) the weight of the connection between the added hidden unit and the visible unit where the pair of visible vectors differs (see item 3 in the proof).

The attempt to use a similar idea to increment the probability mass of three different vectors in independent ratios leads to a coupled change in the probability of a fourth vector. Three vectors differ in at least two entries, as do four vectors. Since only three parameters are available (the bias of the new hidden unit and two connection weigths), the dependence arises.

It is worth noting that using exclusively a similar idea will not allow an exension of theorem 2 in Le Roux and Bengio (2010) to permit the flip of a certain bit with a certain probability (only) given one of three input vectors.

### 2.2. Deep Belief Networks.

In this section we implement our theorem 1, make a sensible modification of the construction used in the proof of theorem 4 in Le Roux and Bengio (2010) in our lemma 1, and prove our main result, theorem 2:

*Let , b ∈ N, b ⩾ 1. A DBN containing hidden layers of width n is a universal approximator of distributions on {0, 1}^{n}.*

Before proving theorem 2, we develop some components of the proof.

An important idea of Sutskever and Hinton (2008) is that of sharing, by means of which in a part of a DBN the probability of a vector is increased while the probability of another vector is decreased and the probability of all other vectors remains nearly constant. This idea is refined in theorem 2 of Le Roux and Bengio (2010). The following is a slightly different formulation of that result:

*Consider two layers of units indexed by i ∈ {1, …, n} and k ∈ {1, …, n}, and denote by v and h state vectors in each layer. Denote by {w_{ik}}_{i,k=1,…,n} the connection weights and by {c_{k}}_{k=1,…,n} the bias weights in the second layer. Given any l and j, l ≠ j, let a be an arbitrary vector in {0, 1}^{n} and b another vector with b_{i} = a_{i} ∀i ≠ j, and a_{j} ≠ b_{j}. Then it is possible to choose weights w_{k,l}, k ∈ {1, …, n} and c_{l} such that the following equations are satisfied with arbitrary accuracy: , while P(v_{l} = 1|h = a) = p_{a} and P(v_{l} = 1|h = b) = p_{b} with arbitrary p_{a}, p_{b}.*

By this theorem, a sharing step can be accomplished in only one layer, whereas probability mass is transferred from a chosen vector to another vector differing in one entry. Futhermore, it demands adapting only the connection weights and bias weight of one single unit. Thereby, the overlay of a number of sharing steps in each layer is possible.

The main idea in Le Roux and Bengio (2010) was to exploit these circumstances using a clever sequence of transactions of probabilities. The requirements for realizing sharing sequences using their theorem 2 can be summarized in the properties of sequences of vectors. These properties are described in theorem 3 of Le Roux and Bengio (2010) and in items 2 and 3 of our lemma 1.

How theorem 2 in Le Roux and Bengio (2010) and our lemma 1 brace the construction of a universal DBN approximator will become clearer following lemma 2.

*Let . There exist a ≔ 2^{b} = 2(n − b) sequences of binary vectors S_{i}, 0 ⩽ i ⩽ a − 1 composed of vectors satisfying the following:*

- 1.
*{**S*_{0}, …,*S*_{a−1}} is a partition of {0, 1}^{n}. - 2.
*∀**i*∈ {0, …,*a*− 1}, . We have*H*(*S*_{i,k},*S*_{i,k+1}) = 1, where*H*(·, ·) denotes the Hamming distance. - 3.
*∀**i*,*j*∈ {0, …,*a*− 1} such that*i*≠*j*and . The bit switched between*S*_{i,k}and*S*_{i,k+1}and the bit switched between*S*_{j,k}and*S*_{j,k+1}are different, unless*H*(*S*_{i,k},*S*_{j,k}) = 1. - 4.
*The choice {**S*_{0,1}, …,*S*_{a−1,1}} = {(*x*, 0, …, 0):*x*∈ {0, 1}^{b}} (the vertices of a*b*-dimensional face of the*n*-cube) is possible.

*m*bits is a matrix of size 2

^{m}×

*m*, where every element from {0, 1}

^{m}appears exactly once as a row of the matrix and any two consecutive rows have Hamming distance one to each other. The collection of rows of a Gray code can be understood as a path visiting every vertex of the

*m*-cube exactly once. Such paths always exist for any

*m*since the graph of any

*m*-cube is Hamiltonian. Let

*G*

^{0}

_{n−b}be any Gray code for (

*n*−

*b*) bits. Obviously any permutation of columns of a Gray code is again a Gray code. Let

*G*

^{i}_{n−b}be the cyclic permutation of the columns of

*G*

^{0}

_{n−b}

*i*positions to the left. Now define that is, the first

*b*entries of the row vector

*S*

_{i,k}contain the

*b*-bit binary representation of

*i*. The remaining (

*n*−

*b*) entries of

*S*

_{i,k}contain the

*k*th row of the Gray code , which is

*G*

^{0}

_{n−b}with columns cyclically shifted

*i*positions to the left. The cyclic shift means that two sequences of vectors

*S*and

_{i}*S*,

_{j}*i*≠

*j*change the same bit in the same row only if (in this case, both sequences change the same bit in every row), that is, only if , which means that

*i*and

*j*differ in a multiple of (

*n*−

*b*) = 2

^{b}/2. This implies that and differ in exactly the first entry. For the last item,

*G*

^{0}

_{n−b}can be chosen such that the first row is (0, …, 0). This is always possible in view of the fact that any cyclic permutation of the rows of a Gray code again yields a Gray code. Thus, we have verified all claims.

Every two consecutive vectors *S*_{i,k} and *S*_{i,k+1} in a sequence *S _{i}* of lemma 1 differ in only one entry, and this entry can be located in almost any position {1, …,

*n*}. In contrast, for the sequences given in theorem 3 of Le Roux and Bengio (2010), that entry can be located in only a subset of {1, …,

*n*} of cardinality

*n*/2.

*n*−

*b*entries is flipped by exactly two sequences. The choice of the relations between number of sequences, number of visible units, and number of layers is not accidental and is somewhat intricate. It must take into account all the components that will be needed in the proof of theorem 2. The attempt to produce 2

*n*instead of 2(

*n*−

*b*) sequences with properties 1 and 2 of lemma 1 (and flips in all entries) would correspond to the following: Set that is, the sequences to be overlaid are portions of the same Gray code. In this case, it is difficult to say that condition 3 is satistfied—that if

*S*and

_{i}*S*flip the same bit in the same row, then

_{j}*H*(

*S*

_{i,k},

*S*

_{j,k}) = 1. The property given in item 3, however, is essential for the use of theorem 2 of Le Roux and Bengio (2010). Most common Gray codes flip some entries more often than other entries and can be discarded. Other sequences, which are referred to as totally balanced Gray codes, flip all entries equally often and exist whenever

*n*is a power of 2, but a strong cyclicity condition would still be required. Because of this, we say that the sequences given in our lemma 1 allow optimal use of theorem 2 in Le Roux and Bengio (2010).

Lemma 2 is a transcription of lemma 1 in Le Roux and Bengio (2010) with replacements of indices according to our construction. The proof is an obvious transcription, which we omit here. Denote by *h ^{i}* a state vector of the units in the hidden layer

*i*, and denote by

*h*

^{0}a visible state. The joint distribution on the states of all units in the case of layers is of the form .

*Let p* be an arbitrary distribution on {0, 1}^{n}. Consider a DBN with layers and the following properties:*

- 1.
*∀**i*∈ {0, …,*a*− 1}, the top RBM between and , assigns probability ∑_{k}*p**(*S*_{i,k}) to*S*_{i,1}. - 2.
- 3.

*Such a DBN has p* as its marginal visible distribution.*

We conclude this section with the proof of theorem 2 and some remarks:

The proof follows the strategy of the proof of theorem 4 in Le Roux and Bengio (2010). We show the existence of a DBN with the properties of the DBN described in lemma 2.

In view of corollary 1, it is possible to achieve that the top RBM assigns arbitrary probability to a collection of vectors *S*_{i,1}, *i* ∈ {0, …, *a* − 1} whenever they are contained in the set of vertices of a log(2(*n* + 1))-dimensional cube. This requirement is met for the vectors *S*_{i,1}, *i* ∈ {0, …, *a* − 1} of lemma 1 since we can choose {*S*_{i,1}}_{i} = {(*x*, 0, …, 0) ∈ {0, 1}^{n}:*x* ∈ {0, 1}^{b}}, which is a *b*-dimensional cube, and *b* < log 2*n*.

At each subsequent layer, the first *b* bits of *h*^{k+1} are copied to the first *b* bits of *h ^{k}* with probability arbitrarily close to one. The

*n*−

*b*remaining bits are potentially changed to move from one vector in a Gray code sequence to the next, with the correct probability as defined in lemma 2. This changes are possible because theorem 2 in Le Roux and Bengio (2010) can be applied to the sequences provided in lemma 2. The crucial difference to the proof of the previous result is that by our definition of {

*S*}, at each layer

_{i}*n*−

*b*bit flips (with correct probabilities) occur instead of .

Le Roux and Bengio (2010) overlaid *n* sequences of sharing steps (their theorem 3) for constructing a universal DBN approximator. In principle, an overlay of more such sequences is possible. This is what we exploit in our proof (the sequences given in lemma 2). Apparently one reason that the overlay of more sequences was not realized in that paper was that for the initialization of these sequences, the authors used theorem 2 of Le Roux and Bengio (2008), which allows assigning arbitrary probability only to *n* vectors. Our theorem 1 overcomes this difficulty and allows initializing up to 2(*n* + 1) sequences, which we use to obtain the first property of the DBN described in lemma 2.

## 3. Conclusion

We have shown that a DBN with , *b* ∼ log *n*, hidden layers of size *n* is capable of approximating any distribution on {0, 1}^{n} arbitrarily well as its marginal visible distribution. (This confirms a conjecture presented in Le Roux & Bengio, 2010.) The number of layers is of order . This DBN has parameters, which is of order .

Furthermore, we have shown that an RBM with hidden units is capable of approximating any distribution on {0, 1}^{n} arbitrarily well as its marginal visible distribution. This RBM has parameters, which is of order .

Our results improve all bounds known to date on the minimal size of universal DBN and RBN approximators. We still do not know if our results represent the minimal sufficient size for universal DBN and RBM approximators. Our construction already exploits theorem 2 in Le Roux and Bengio (2010) exhaustively, and therefore a construction using only similar ideas will not lead to improvements. However, alternative constructions might exist, which exploit the representational power of RBMs better. Whether further reductions of the size of a universal DBN approximator are possible is the subject of our ongoing research (Montufar, 2010).

## Appendix: Lower Bound on the Number of Parameters

Here we formally confirm the heuristic that a DBN can approximate any visible distribution on {0, 1}^{n} arbitrarily well only when the number of parameters of that DBN is not less than 2^{n} − 1.

Consider a DBN with *l* hidden layers, where the hidden layer *k* = 1, …, *l* contains *n _{k}* units. We denote by

*h*the state of the units in the layer

^{k}*k*(i.e., it is a binary vector of length

*n*), and we denote by

_{k}*h*

^{0}≡

*v*a state vector of the

*n*

_{0}≡

*n*visible units in layer 0, the visible layer. Let

*N*be the total number of units of the DBN and

*d*the number of parameters: the connection weights

*W*

^{k+1}

_{j,i}between the unit

*j*in layer

*k*and the unit

*i*in layer

*k*+ 1, for all

*j*and

*i*, and

*k*= 0, …,

*l*− 1, as well as the biases

*b*for all

^{k}_{j}*j*and

*k*= 0, …,

*l*. Denote by the set of all distributions on {0, 1}

^{n}, and by the set of all distributions on {0, 1}

^{n}.

*d*, parameterized by the function , which takes the parameters {

*W*

^{k}_{j,i}}, {

*b*} into a distribution

^{k}_{j}*P*defined as follows (Sutskever & Hinton, 2008):

*Q*is continuous everywhere, including at infinity (it converges to some distribution for any sequence of parameters escaping to any direction). Hence, , the set of all joint distributions (also for parameters taking infinite values, including not strictly positive distributions), is contained in a compact set contained in a bounded manifold of dimension .

Since this is a linear map, its differential (the Jacobian of the natural extension of M to restricted to the tangential space of ) is given by the same matrix: . The rank of this map is not more than . The elements for which the differential *d _{p}*M is not a surjective map are called critical points, and for these

*p*, the value M(

*p*) is called a critical value. If , then clearly all points in the image are critical values.

Sard's theorem (Sard, 1942), says that the set of critical values is a null set. This means that if , then is a null set of . is also a null set, since M can be extended to a domain that is a manifold containing , and the image of which is a null set of (i.e., a null set of ).

Observe that a set approximates any element of arbitrarily well exactly when it is dense in , that is, .

Since the map M is continuous and is compact, we have that is a compact subset of . This means that . By the arguments above, this is a null set whenever , in which case it obviously differs from .

Hence, for a DBN to approximate any visible distribution on {0, 1}^{n} arbitrarily well, the number of parameters has to be at least equal to 2^{n} − 1, the dimension of the set of all distributions on {0, 1}^{n}.