Abstract

We improve recently published results about resources of restricted Boltzmann machines (RBM) and deep belief networks (DBN) required to make them universal approximators. We show that any distribution on the set of binary vectors of length can be arbitrarily well approximated by an RBM with hidden units, where is the minimal number of pairs of binary vectors differing in only one entry such that their union contains the support set of . In important cases this number is half the cardinality of the support set of (given in Le Roux & Bengio, 2008). We construct a DBN with , hidden layers of width that is capable of approximating any distribution on arbitrarily well. This confirms a conjecture presented in Le Roux and Bengio (2010).

1.  Introduction

This work rests on ideas presented in Le Roux and Bengio (2008, 2010). We positively resolve a conjecture that was posed in Le Roux and Bengio (2010). Before going into the details about this conjecture, we first recall some basic ideas.

The definition of restricted Boltzmann machines (RBMs) and deep belief networks (DBNs) that we use is the common one. (For details, see Le Roux & Bengio, 2008, 2010.) Here we give a short description. A Boltzmann machine consists of a collection of binary stochastic units, where any pair of units may interact. The unit set is divided into visible and hidden units. Correspondingly, the state is characterized by a pair (v, h) where v denotes the state of the visible units and h denotes the state of the hidden units. One is usually interested in distributions on the visible states v and would like to generate these as marginals of distributions on the states (v, h). In a general Boltzmann machine, the interaction graph is allowed to be complete. An RBM is a special type of Boltzmann machine, where the graph describing the interactions is bipartite: only connections between visible and hidden units appear. Two visible units or two hidden units may not interact with each other (see Figure 1). The distribution over the states of all RBM units has the form of the Boltzmann distribution p(v, h) ∝ exp(hWv + Bv + Ch), where v is a binary vector of length equal to the number of visible units and h is a binary vector with length equal to the number of hidden units. The parameters of the RBM are given by the matrix W and the two vectors B and C. A DBN consists of a chain of layers of units. Only units from neighboring layers are allowed to be connected; there are no connections within each layer. The last two layers have undirected connections between them, while the other layers have connections directed toward the first layer, which is visible. The general idea of a DBN is that the interaction structure is deep rather than shallow (i.e., each hidden layer is not very large compared to the visible layer, as shown in Figure 1).

Figure 1:

(Left) A graph of interactions in an RBM. (Right) The corresponding graph for a DBN with n = 4 visible units (the lighter gray nodes correspond to the visible units). An arbitrary weight can be assigned to every edge. Beside the connection weights, every node contains an individual bias weight. Every node takes value 0 or 1 with a probability that depends on the weights. An RBM and a DBN with the architectures depicted in this figure can approximate any distributions on {0, 1}4 arbitrarily well through an appropriate adjustment of parameters (Le Roux & Bengio, 2008, 2010, respectively). In this letter, we show that the number of hidden units in the RBM can be halved and the number of hidden layers in the DBN can be roughly halved.

Figure 1:

(Left) A graph of interactions in an RBM. (Right) The corresponding graph for a DBN with n = 4 visible units (the lighter gray nodes correspond to the visible units). An arbitrary weight can be assigned to every edge. Beside the connection weights, every node contains an individual bias weight. Every node takes value 0 or 1 with a probability that depends on the weights. An RBM and a DBN with the architectures depicted in this figure can approximate any distributions on {0, 1}4 arbitrarily well through an appropriate adjustment of parameters (Le Roux & Bengio, 2008, 2010, respectively). In this letter, we show that the number of hidden units in the RBM can be halved and the number of hidden layers in the DBN can be roughly halved.

A major difficulty in the use of Boltzmann machines always has been the slowness of learning. In order to overcome this problem, DBNs have been proposed as an alternative to classical Boltzmann machines. An efficient learning algorithm for DBNs was given in by Hinton, Osindero, and Teh (2006).

The fundamental questions are the following: Does a DBN exist that is capable of approximating any distribution on the visible states through an appropriate choice of parameters? We will refer to such a DBN as a universal DBN approximator (similarly, universal RBM approximator). If universal DBN approximators exist, what is their minimal size?

Since DBNs are more difficult to study than RBMs, as a preliminary step, corresponding questions related to the representational power of RBMs have been addressed. Theorem 2 in Le Roux and Bengio (2008) shows that any distribution on {0, 1}n with the support of cardinality s is arbitrarily well approximated (with respect to the Kullback-Leibler divergence) by the marginal distribution of an RBM containing s + 1 hidden units:

Theorem 2 in Le Roux and Bengio (2008).

Any distribution on {0, 1}n can be approximated arbitrarily well with an RBM with s + 1 hidden units, where s is the number of input vectors whose probability does not vanish.

This theorem proves the existence of a universal RBM approximator. The existence proof of a universal DBN approximator is due to Sutskever and Hinton (2008). More precisely, Sutskever and Hinton explicitly constructed a DBN with ∼3 · 2n hidden layers of width n + 1 that approximates any distribution on {0, 1}n. Given that the existence problem of universal DBN approximators was positively resolved through this result, the efforts have been put into optimizing the size (i.e., reducing the number of parameters). This can be done by reducing the number of hidden layers involved in a DBN or by making the hidden layers smaller.

We want to deduce a theoretical lower bound on the number of layers of a universal DBN approximator. Observe that for a DBN to approximate any visible distribution on {0, 1}n arbitrarily well, the number of parameters has to be at least equal to 2n − 1, the dimension of the set of all distributions on {0, 1}n (for completeness we give a formal proof of this statement in the appendix). We use simple counting arguments on that observation: the number of free parameters in a DBN with layers of constant size is the square of the width of each layer × number of hidden layers + number of units, which for k hidden layers of width n is k(n2 + n) + n. The number of parameters needed to describe all distributions on {0, 1}n is 2n − 1. Therefore, a lower bound on the number of hidden layers of a universal DBN approximator is given by (which yields 2n − 1 free parameters). Otherwise, the number of parameters would not be sufficient. Asymptotically, this bound is of order .

This result implies a positive answer to the following question, which Sutskever and Hinton (2008) raised: Given that a network with 2n/n2 layers has about 2n parameters, can it be shown that a deep and narrow (with width n+c) network of ≪2n/n2 layers cannot approximate every distribution?

Since the architecture of DBNs makes important restrictions on the way the parameters are used, the bound we derived above is not necessarily achievable. In particular the approximation of a distribution through a DBN or RBM is not unambiguous; for several choices of the parameters the same distribution is produced as marginal distribution. However, Le Roux and Bengio (2010) showed that a number of hidden layers of order is sufficient:

Theorem 4 in Le Roux and Bengio (2010).

If n = 2t, a DBN composed of layers of size n is a universal approximator of distributions on {0, 1}n.

The optimality of the bound given in this theorem remains an open problem in that paper. However, the authors’ proof method suggests the sufficiency of less hidden layers, which they conjectured. The proof of theorem 4 in Le Roux and Bengio (2010) crucially depends on the authors’ previous theorem 2 (Le Roux & Bengio, 2008). Our contribution is to sharpen that ingredient (see theorem 1, section 2.1), which allows us to exploit their method better (see lemma 1, section 2.2) and thereby confirm their conjecture (see theorem 2, section 2.2).

2.  Results

2.1.  Restricted Boltzmann Machines.

The following theorem 1 sharpens theorem 2 in Le Roux and Bengio (2010).

Theorem 1 (reduced RBMs that are universal approximators).

Any distribution p on binary vectors of length n can be approximated arbitrarily well by an RBM with k − 1 hidden units, where k is the minimal number of pairs of binary vectors, such that the two vectors in each pair differ in only one entry and such that the support set of p is contained in the union of these pairs.

The set {0, 1}n corresponds to the vertex set of the n-dimensional cube. The edges of the n-dimensional cube correspond to pairs of binary vectors of length n, which differ in exactly one entry. For the graph of the n-dimensional cube, there exist perfect matchings, that is, collections of disjoint edges that cover all vertices.

Hence, any subset of {0, 1}n (the support set of an arbitary distribution) can be covered by 2n−1 pairs of vectors, which differ in only one entry. The minimal number sufficient to cover the support of any p can be as small as |supp(p)|/2. This is, for example, the case when the support of p is of the form for any 1 ⩽ bn and fixed xi ∈ {0, 1} for all ii1, …, ib. The union of the following 2b−1 pairs covers that set:
formula

We therefore have the following corollary 1 (which will be used in the proof of our main result, theorem 2):

Corollary 1.

  • • 

    Any distribution on {0, 1}n can be approximated arbitrarily well by an RBM with hidden units.

  • • 

    An RBM with n hidden units can approximate any visible distribution p on n visible units arbitrarily well, given that the support of p is contained in the set of vertices of some log(2(n + 1))-dimensional face of the n-dimensional unit cube, for example, supp(p) = {(x1, …, xb, 0, …, 0) ∈ {0, 1}n:xi ∈ {0, 1}, 1 ⩽ ib} for any blog(2(n + 1)).

The proof of theorem 1 given below is in the spirit of the proof of theorem 2 in Le Roux and Bengio (2008). The idea there consists of showing that given an RBM with some marginal visible distribution, the inclusion of an additional hidden unit allows us to increment the probability mass of one visible state vector while uniformly reducing the probability mass of all other visible vectors.

We show that the inclusion of an additional hidden unit in fact allows to increase the probability mass of a pair of visible vectors, in independent ratio, given that this pair differs in one entry. At the same time, the probability of all other visible states is reduced uniformly. We also use the bias weights in the visible units to further improve the result.

Proof of Theorem 1.
Let p be the distribution on the states of visible and hidden units of an RBM. Its marginal probability distribution on v can be written as
formula
2.1
where z(v, h) = exp(hWv + Bv + Ch) and W, B, and C are the parameters corresponding to p. Denote by pw,c the distribution arising through adding a hidden unit to the RBM connected with weights w = (w1, …, wn) to the visible units and with bias weight c. Its marginal distribution is then
formula
2.2

2. Given any vector v ∈ {0, 1}n, we write vj,0 for the vector defined through vj,0i = vi, ∀ij, and vj,0j = 0. Similarly, we write vj,1 for the vector with vj,1i = vi, ∀ij, and vj,1j = 1. We also write ≔ (1, …, 1) and ej ≔ (0, …, 0, 1, 0, …, 0), where the 1 is in the jth entry.

3. Consider an arbitrary j ∈ {1, …, n} and an arbitrary visible vector u. Consider also s · uj,0, that is, the number of ones in vector uj,0. Define
formula
For the weights and , we have:
formula
2.3
formula
2.4
and in the limit a → ∞, we get:
formula
2.5
Now we look at the denominator on the right-hand side of equation 2.2. For the parameters and defined above, we have:
formula
2.6
Inserting the terms of equations 2.5 and 2.6 into equation 2.2 and multiplying the nominator and denominator by yields (see equation 2.1):
formula
2.7
This means that the probability of uj,0 and uj,1 can be increased independently by a multiplicative factor, while all other probabilities are reduced uniformly.

4. Now we explain how to start an induction from which the claim follows. Consider an RBM with no hidden units, RBM0. Through a choice of the bias weights in every visible unit, RBM0 produces as visible distribution any arbitrary factorizable distribution p0(v) ∝ exp(B · v) ∝ exp(B · v + K), where B is the vector of bias weights and K is a constant that we introduce for illustrative reasons and is not a parameter of the RBM0 since it cancels out with the normalization of p0. In particular, RBM0 can approximate arbitrarily well any distribution with support given by a pair of vectors that differ in only one entry. To see this, consider any pair of vectors uj,0 and uj,1 that differ in the entry j. Then the choice and yields in the limit a → ∞ (similarly to equations 2.5) that lima→∞p0(v) = 0 whenever vuj,1 and vuj,0, while lima→∞p0(uj,1)/p0(uj,0) = exp(λ2 − λ1) can be chosen arbitrarily by modifying λ1 and λ2. Hence, p0 can be made arbitrarily similar to any distribution with support {uj,1, uj,0}. Notice that p0 always remains strictly positive for a < ∞.

By the arguments described above in equations 2.7, every additional hidden unit allows increasing the probability of any pair of vectors that differs in one entry. Obviously it is possible to do the same for a single vector instead of a pair. Then, with every additional hidden unit, the support set of the probabilities that can be approximated arbitrarily well can be enlarged by an arbitrary pair of vectors that differ in one entry. That is, RBM(i−1) is an approximator of distributions with support contained in any union of i pairs of vectors that differ in exactly one entry.

We close this section with some remarks.

The possibility of independent change of the probability mass of two visible vectors is due to the usability of the following two parameters: (1) the bias input weight in the added hidden unit and (2) the weight of the connection between the added hidden unit and the visible unit where the pair of visible vectors differs (see item 3 in the proof).

The attempt to use a similar idea to increment the probability mass of three different vectors in independent ratios leads to a coupled change in the probability of a fourth vector. Three vectors differ in at least two entries, as do four vectors. Since only three parameters are available (the bias of the new hidden unit and two connection weigths), the dependence arises.

It is worth noting that using exclusively a similar idea will not allow an exension of theorem 2 in Le Roux and Bengio (2010) to permit the flip of a certain bit with a certain probability (only) given one of three input vectors.

2.2.  Deep Belief Networks.

In this section we implement our theorem 1, make a sensible modification of the construction used in the proof of theorem 4 in Le Roux and Bengio (2010) in our lemma 1, and prove our main result, theorem 2:

Theorem 2 (reduced DBNs that are universal approximators).

Let , bN, b ⩾ 1. A DBN containing hidden layers of width n is a universal approximator of distributions on {0, 1}n.

Before proving theorem 2, we develop some components of the proof.

An important idea of Sutskever and Hinton (2008) is that of sharing, by means of which in a part of a DBN the probability of a vector is increased while the probability of another vector is decreased and the probability of all other vectors remains nearly constant. This idea is refined in theorem 2 of Le Roux and Bengio (2010). The following is a slightly different formulation of that result:

Theorem 2 in Le Roux and Bengio (2010).

Consider two layers of units indexed by i ∈ {1, …, n} and k ∈ {1, …, n}, and denote by v and h state vectors in each layer. Denote by {wik}i,k=1,…,n the connection weights and by {ck}k=1,…,n the bias weights in the second layer. Given any l and j, lj, let a be an arbitrary vector in {0, 1}n and b another vector with bi = aiij, and ajbj. Then it is possible to choose weights wk,l, k ∈ {1, …, n} and cl such that the following equations are satisfied with arbitrary accuracy: , while P(vl = 1|h = a) = pa and P(vl = 1|h = b) = pb with arbitrary pa, pb.

By this theorem, a sharing step can be accomplished in only one layer, whereas probability mass is transferred from a chosen vector to another vector differing in one entry. Futhermore, it demands adapting only the connection weights and bias weight of one single unit. Thereby, the overlay of a number of sharing steps in each layer is possible.

The main idea in Le Roux and Bengio (2010) was to exploit these circumstances using a clever sequence of transactions of probabilities. The requirements for realizing sharing sequences using their theorem 2 can be summarized in the properties of sequences of vectors. These properties are described in theorem 3 of Le Roux and Bengio (2010) and in items 2 and 3 of our lemma 1.

How theorem 2 in Le Roux and Bengio (2010) and our lemma 1 brace the construction of a universal DBN approximator will become clearer following lemma 2.

Lemma 1.

Let . There exist a ≔ 2b = 2(nb) sequences of binary vectors Si, 0 ⩽ ia − 1 composed of vectors satisfying the following:

  • 1.

    {S0, …, Sa−1} is a partition of {0, 1}n.

  • 2.

    i ∈ {0, …, a − 1}, . We have H(Si,k, Si,k+1) = 1, where H(·, ·) denotes the Hamming distance.

  • 3.

    i, j ∈ {0, …, a − 1} such that ij and . The bit switched between Si,k and Si,k+1 and the bit switched between Sj,k and Sj,k+1 are different, unless H(Si,k, Sj,k) = 1.

  • 4.

    The choice {S0,1, …, Sa−1,1} = {(x, 0, …, 0):x ∈ {0, 1}b} (the vertices of a b-dimensional face of the n-cube) is possible.

Proof of Lemma 1.
A Gray code for m bits is a matrix of size 2m × m, where every element from {0, 1}m appears exactly once as a row of the matrix and any two consecutive rows have Hamming distance one to each other. The collection of rows of a Gray code can be understood as a path visiting every vertex of the m-cube exactly once. Such paths always exist for any m since the graph of any m-cube is Hamiltonian. Let G0nb be any Gray code for (nb) bits. Obviously any permutation of columns of a Gray code is again a Gray code. Let Ginb be the cyclic permutation of the columns of G0nbi positions to the left. Now define
formula
that is, the first b entries of the row vector Si,k contain the b-bit binary representation of i. The remaining (nb) entries of Si,k contain the kth row of the Gray code , which is G0nb with columns cyclically shifted i positions to the left. The cyclic shift means that two sequences of vectors Si and Sj, ij change the same bit in the same row only if (in this case, both sequences change the same bit in every row), that is, only if , which means that i and j differ in a multiple of (nb) = 2b/2. This implies that and differ in exactly the first entry. For the last item, G0nb can be chosen such that the first row is (0, …, 0). This is always possible in view of the fact that any cyclic permutation of the rows of a Gray code again yields a Gray code. Thus, we have verified all claims.

Every two consecutive vectors Si,k and Si,k+1 in a sequence Si of lemma 1 differ in only one entry, and this entry can be located in almost any position {1, …, n}. In contrast, for the sequences given in theorem 3 of Le Roux and Bengio (2010), that entry can be located in only a subset of {1, …, n} of cardinality n/2.

In lemma 1, for any row, every one of nb entries is flipped by exactly two sequences. The choice of the relations between number of sequences, number of visible units, and number of layers is not accidental and is somewhat intricate. It must take into account all the components that will be needed in the proof of theorem 2. The attempt to produce 2n instead of 2(nb) sequences with properties 1 and 2 of lemma 1 (and flips in all entries) would correspond to the following: Set
formula
that is, the sequences to be overlaid are portions of the same Gray code. In this case, it is difficult to say that condition 3 is satistfied—that if Si and Sj flip the same bit in the same row, then H(Si,k, Sj,k) = 1. The property given in item 3, however, is essential for the use of theorem 2 of Le Roux and Bengio (2010). Most common Gray codes flip some entries more often than other entries and can be discarded. Other sequences, which are referred to as totally balanced Gray codes, flip all entries equally often and exist whenever n is a power of 2, but a strong cyclicity condition would still be required. Because of this, we say that the sequences given in our lemma 1 allow optimal use of theorem 2 in Le Roux and Bengio (2010).

Lemma 2 is a transcription of lemma 1 in Le Roux and Bengio (2010) with replacements of indices according to our construction. The proof is an obvious transcription, which we omit here. Denote by hi a state vector of the units in the hidden layer i, and denote by h0 a visible state. The joint distribution on the states of all units in the case of layers is of the form .

Lemma 2.

Let p* be an arbitrary distribution on {0, 1}n. Consider a DBN with layers and the following properties:

  • 1.

    i ∈ {0, …, a − 1}, the top RBM between and , assigns probability ∑kp*(Si,k) to Si,1.

  • 2.
    i ∈ {0, …, a − 1}, :
    formula
  • 3.
    the DBN provides
    formula

Such a DBN has p* as its marginal visible distribution.

We conclude this section with the proof of theorem 2 and some remarks:

Proof if Theorem 2.

The proof follows the strategy of the proof of theorem 4 in Le Roux and Bengio (2010). We show the existence of a DBN with the properties of the DBN described in lemma 2.

In view of corollary 1, it is possible to achieve that the top RBM assigns arbitrary probability to a collection of vectors Si,1, i ∈ {0, …, a − 1} whenever they are contained in the set of vertices of a log(2(n + 1))-dimensional cube. This requirement is met for the vectors Si,1, i ∈ {0, …, a − 1} of lemma 1 since we can choose {Si,1}i = {(x, 0, …, 0) ∈ {0, 1}n:x ∈ {0, 1}b}, which is a b-dimensional cube, and b < log 2n.

At each subsequent layer, the first b bits of hk+1 are copied to the first b bits of hk with probability arbitrarily close to one. The nb remaining bits are potentially changed to move from one vector in a Gray code sequence to the next, with the correct probability as defined in lemma 2. This changes are possible because theorem 2 in Le Roux and Bengio (2010) can be applied to the sequences provided in lemma 2. The crucial difference to the proof of the previous result is that by our definition of {Si}, at each layer nb bit flips (with correct probabilities) occur instead of .

Le Roux and Bengio (2010) overlaid n sequences of sharing steps (their theorem 3) for constructing a universal DBN approximator. In principle, an overlay of more such sequences is possible. This is what we exploit in our proof (the sequences given in lemma 2). Apparently one reason that the overlay of more sequences was not realized in that paper was that for the initialization of these sequences, the authors used theorem 2 of Le Roux and Bengio (2008), which allows assigning arbitrary probability only to n vectors. Our theorem 1 overcomes this difficulty and allows initializing up to 2(n + 1) sequences, which we use to obtain the first property of the DBN described in lemma 2.

3.  Conclusion

We have shown that a DBN with , b ∼ log n, hidden layers of size n is capable of approximating any distribution on {0, 1}n arbitrarily well as its marginal visible distribution. (This confirms a conjecture presented in Le Roux & Bengio, 2010.) The number of layers is of order . This DBN has parameters, which is of order .

Furthermore, we have shown that an RBM with hidden units is capable of approximating any distribution on {0, 1}n arbitrarily well as its marginal visible distribution. This RBM has parameters, which is of order .

Our results improve all bounds known to date on the minimal size of universal DBN and RBN approximators. We still do not know if our results represent the minimal sufficient size for universal DBN and RBM approximators. Our construction already exploits theorem 2 in Le Roux and Bengio (2010) exhaustively, and therefore a construction using only similar ideas will not lead to improvements. However, alternative constructions might exist, which exploit the representational power of RBMs better. Whether further reductions of the size of a universal DBN approximator are possible is the subject of our ongoing research (Montufar, 2010).

Appendix:  Lower Bound on the Number of Parameters

Here we formally confirm the heuristic that a DBN can approximate any visible distribution on {0, 1}n arbitrarily well only when the number of parameters of that DBN is not less than 2n − 1.

Consider a DBN with l hidden layers, where the hidden layer k = 1, …, l contains nk units. We denote by hk the state of the units in the layer k (i.e., it is a binary vector of length nk), and we denote by h0v a state vector of the n0n visible units in layer 0, the visible layer. Let N be the total number of units of the DBN and d the number of parameters: the connection weights Wk+1j,i between the unit j in layer k and the unit i in layer k + 1, for all j and i, and k = 0, …, l − 1, as well as the biases bkj for all j and k = 0, …, l. Denote by the set of all distributions on {0, 1}n, and by the set of all distributions on {0, 1}n.

The set of joint distributions on the states of all units of the DBN that arise through variation of the parameters is a manifold of a dimension not more than d, parameterized by the function , which takes the parameters {Wkj,i}, {bkj} into a distribution P defined as follows (Sutskever & Hinton, 2008):
formula
A.1
Q is continuous everywhere, including at infinity (it converges to some distribution for any sequence of parameters escaping to any direction). Hence, , the set of all joint distributions (also for parameters taking infinite values, including not strictly positive distributions), is contained in a compact set contained in a bounded manifold of dimension .
Restricting observations to the visible units corresponds to marginalizing the variables hk, k = 1, …, l, which is applying the following continuous, linear map,
formula
for a matrix with rows , for v ∈ {0, 1}n, such that
formula

Since this is a linear map, its differential (the Jacobian of the natural extension of M to restricted to the tangential space of ) is given by the same matrix: . The rank of this map is not more than . The elements for which the differential dpM is not a surjective map are called critical points, and for these p, the value M(p) is called a critical value. If , then clearly all points in the image are critical values.

Sard's theorem (Sard, 1942), says that the set of critical values is a null set. This means that if , then is a null set of . is also a null set, since M can be extended to a domain that is a manifold containing , and the image of which is a null set of (i.e., a null set of ).

Observe that a set approximates any element of arbitrarily well exactly when it is dense in , that is, .

Since the map M is continuous and is compact, we have that is a compact subset of . This means that . By the arguments above, this is a null set whenever , in which case it obviously differs from .

Hence, for a DBN to approximate any visible distribution on {0, 1}n arbitrarily well, the number of parameters has to be at least equal to 2n − 1, the dimension of the set of all distributions on {0, 1}n.

References

Hinton
,
G. E.
,
Osindero
,
S.
, &
Teh
,
Y.
(
2006
).
A fast learning algorithm for deep belief nets
.
Neural Computation
,
18
,
1527
1554
.
Le Roux
,
N.
, &
Bengio
,
Y.
(
2008
).
Representational power of restricted Boltzmann machines and deep belief networks
.
Neural Computation
,
20
,
1631
1649
.
Le Roux
,
N.
, &
Bengio
,
Y.
(
2010
).
Deep belief networks are compact universal approximators
.
Neural Computation
,
22
,
2192
2207
.
Montufar
,
G.
(
2010
).
Mixture decomposition of distributions using a decomposition of the sample space.
Unpublished manuscript
.
Sard
,
A.
(
1942
).
The measure of the critical values of differentiable maps
.
Bulletin of the American Mathematical Society
,
48
,
883
890
.
Sutskever
,
I.
, &
Hinton
,
G. E.
(
2008
).
Deep narrow sigmoid belief networks are universal approximators
.
Neural Computation
,
20
,
2629
2636
.