## Abstract

We generalize recent theoretical work on the minimal number of layers of narrow deep belief networks that can approximate any probability distribution on the states of their visible units arbitrarily well. We relax the setting of binary units (Sutskever & Hinton, 2008; Le Roux & Bengio, 2008, 2010; Montúfar & Ay, 2011) to units with arbitrary finite state spaces and the vanishing approximation error to an arbitrary approximation error tolerance. For example, we show that a q-ary deep belief network with layers of width for some can approximate any probability distribution on without exceeding a Kullback-Leibler divergence of . Our analysis covers discrete restricted Boltzmann machines and naive Bayes models as special cases.

## 1.  Introduction

A deep belief network (DBN) (Hinton, Osindero, & Teh, 2006) is a layered stochastic network with undirected bipartite interactions between the units in the top two layers and directed bipartite interactions between the units in all other subsequent pairs of layers, directed toward the bottom layer. The top two layers form a restricted Boltzmann machine (RBM) (Smolensky, 1986). The entire network defines a model of probability distributions on the states of the units in the bottom layer, the visible layer. When the number of units in every layer has the same order of magnitude, the network is called narrow. Depth refers to the number of layers. Deep network architectures are believed to play a key role in information processing of intelligent agents (see Bengio, 2009, for an overview on this exciting topic). DBNs were the first deep architectures to be envisaged together with an efficient unsupervised training algorithm (Hinton et al., 2006). Due to their restricted connectivity, it is possible to greedily train their layers one at the time and in this way identify remarkably good parameter initializations for solving specific tasks (see Bengio, Lamblin, Popovici, & Larochelle, 2007). The ability to train deep architectures efficiently has pioneered a great number of applications in machine learning and in the booming field of deep learning.

The representational power of neural networks has been studied for several decades, whereby their universal approximation properties have received special attention. For instance, a well-known result (Hornik, Stinchcombe, & White, 1989) shows that multilayer feedforward networks with one exponentially large layer of hidden units are universal approximators of Borel measurable functions. Although universal approximation has a limited importance for practical purposes,1 it plays an important role as warrant for consistency and sufficiency of the complexity attainable by specific classes of learning systems. Besides the universal approximation question, it is natural to ask, “How well is a given network able to approximate certain classes of probability distributions?” This letter pursues an account of the ability of DBNs to approximate probability distributions.

The first universal approximation result for deep and narrow sigmoid belief networks is due to Sutskever and Hinton (2008). They showed that a narrow sigmoid belief network with 3(2n−1)+1 layers can represent probability distributions arbitrarily close to any probability distribution on the set of length n binary vectors. Their result shows that not only exponentially wide and shallow networks are universal approximators (Hornik et al., 1989); exponentially deep and narrow ones are as well. Subsequent work has studied the optimality question, “How deep is deep enough?” with improved universal approximation depth bounds by Le Roux and Bengio (2010) and Montúfar and Ay (2011), which we discuss in more detail in this letter. These articles focus on the minimal depth of narrow DBN universal approximators with binary units—that is, the number of layers that these networks must have in order to be able to represent probability distributions arbitrarily close to any probability distribution on the states of their visible units. This letter complements that analysis in two ways.

First, instead of asking for the minimal size of universal approximators, we ask for the minimal size of networks that can approximate any distribution to a given error tolerance, treating the universal approximation problem as the special case of zero error tolerance. This analysis gives a theoretical basis on which to balance model accuracy and parameter count. For comparison, universal approximation is a binary property that always requires an exponential number of parameters. As it turns out, our analysis also allows us to estimate the expected value of the model approximation errors incurred when learning classes of distributions, say, low-entropy distributions, with networks of given sizes.

Second, we consider networks with finite valued units, called discrete or multinomial DBNs, including binary DBNs as special cases. Nonbinary units serve, obviously, to encode nonbinary features directly, which may be interesting in multichannel perception (e.g., color-temperature-distance sensory inputs). Additionally, the interactions between discrete units can carry much richer relations that those between binary units. In particular, within the nonbinary discrete setting, DBNs, RBMs, and naive Bayes models can be seen as representatives of the same class of probability models.

This letter is organized as follows. Section 2 gives formal definitions, before we state our main result, theorem 2, in section 3: a bound on the approximation errors of discrete DBNs. A universal approximation depth bound follows directly. After this, a discussion of the result is given with a sketch of the proof. The proof entails several steps of independent interest, developed in the next sections. Section 4 addresses the representational power and approximation errors of RBMs with discrete units. Section 5 studies the models of conditional distributions represented by feedforward discrete stochastic networks (DBN layers). Section 6 studies concatenations of layers of feedforward networks and elaborates on the patterns of probability sharing steps (transformations of probability distributions) that they can realize. Section 7 concludes the proof of the main theorem and gives a corollary about the expectation value of the approximation error of DBNs. The appendix presents an empirical validation scheme and tests the approximation error bounds numerically on small networks.

## 2.  Preliminaries

A few formal definitions are necessary before proceeding. Given a finite set , we denote the set of all probability distributions on . A model of probability distributions on is a subset . Given a pair of distributions , the Kullback-Leibler divergence from p to q is defined as when and otherwise. The divergence from a distribution p to a model is defined as . The divergence of any distribution on to is bounded by
We refer to as the universal or maximal approximation error of . The model is called a universal approximator of probability distributions on iff .

A discrete DBN probability model is specified by a number of layers (the depth of the network), the number of units in each layer (the width of each layer), and the state-space of each unit in each layer. Let , be the number of layers. We imagine these layers arranged as a stack with layer 1 at the bottom (this will be the visible layer) and layer L at the top (this will be the deepest layer). (See Figure 1.) For each , let be the number of units in layer l. For each , let , be the state-space of unit i in layer l. We denote the joint state-space of the units in layer l by and write for a state from . We call a unit q-valued or q-ary if its state-space has cardinality q and assume that q is a finite integer larger than one.

Figure 1:

Graphical representation of a discrete DBN probability model. Each node represents a unit with the indicated state-space. The top two layers have undirected connections; they correspond to the term pL−1,L described in equation 2.2. All other layers receive directed connections, corresponding to the terms pl, described in equation 2.3. Only the bottom layer is visible.

Figure 1:

Graphical representation of a discrete DBN probability model. Each node represents a unit with the indicated state-space. The top two layers have undirected connections; they correspond to the term pL−1,L described in equation 2.2. All other layers receive directed connections, corresponding to the terms pl, described in equation 2.3. Only the bottom layer is visible.

In order to proceed with the definition of the DBN model, we consider the mixed graphical model with undirected connections between the units in the top two layers L and L−1 and directed connections from the units in layer l+1 to the units in layer l for all . This model consists of joint probability distributions on the states of all network units, parameterized by a collection of real matrices and vectors . For each , the matrix contains the interaction weights between units in layers l and l+1. It consists of row blocks for all . For each , the row vector contains the bias weights of the units in layer l. It consists of blocks for all .

Note that the bias of a unit with state-space is a vector with entries, and the interaction of a pair of units with state-spaces and is described by a matrix of order . The number of interaction and bias parameters in the entire network adds to .

For any choice of these parameters, the corresponding probability distribution on the states of all units is
2.1
where
2.2
and
2.3
with factors given by
2.4

Here we use following notation. Given a state vector of n units with joint state-space , x denotes the xth column of a minimal matrix of sufficient statistics for the independent distributions of these n units. To make this more concrete, we set x equal to a column vector with blocks x1, …, xn, where is the one-hot representation of xi without the first entry, for all . For example, if , then , with and .

The function
2.5
normalizes the probability distribution from equation 2.2. Likewise, the function
2.6
normalizes the probability distribution from equation 2.4 for each and .
The marginal of the distribution from equation 2.1 on the states of the units in the first layer is given by
2.7
The discrete DBN probability model with L layers of widths and state-spaces is the set of probability distributions expressible by equation 2.7 for all possible choices of the parameter . Intuitively, this set is a linear projection of a manifold parameterized by and may have self-intersections or other singularities.

The discrete DBN probability model with L=2 is a discrete RBM probability model. This model consists of the marginal distributions on of the distributions from equation 2.2 for all possible choices of , , and .

When L>2, the distributions on defined by the top two DBN layers can be seen as the inputs of the stochastic maps defined by the conditional distributions from equation 2.3. The outputs of these maps are probability distributions on that can be seen as the inputs of the stochastic maps defined by the next lower layer and so forth. The discrete DBN probability model can be seen as the set of images of a discrete RBM probability model by a family of sequences of stochastic maps.

The following simple class of probability models will be useful to study the approximation capabilities of DBN models. Let be a partition of a finite set . The partition model with partition is the set of probability distributions on that have constant value on each Ai. Geometrically, this is the simplex with vertices for all , where is the indicator function of Ai. The coarseness of is maxi|Ai|. Unlike many statistical models, partition models have a well-understood Kullback-Leibler divergence. If is a partition model of coarseness c, then . Furthermore, partition models are known to be optimally approximating exponential families in the sense that they minimize the universal approximation error among all closures of exponential families of a given dimension (see Rauh, 2013).

## 3.  Main Result

The starting point of our considerations is the following result for binary DBNs:

Theorem 1.

A deep belief network probability model with L layers of binary units of width n=2k−1+k (for some ) is a universal approximator of probability distributions on whenever .

Note that
3.1
This result is due to Montúfar and Ay (2011, theorem 2). It is based on a refinement of previous work by Le Roux and Bengio (2010), who obtained the bound when n is a power of two.

The main result of this letter is the following generalization of theorem 1. Here we make the simplifying assumption that all layers have the same width n and the same state-space. The result holds automatically for DBNs with wider hidden layers or hidden units with larger state-spaces.

Theorem 2.
Let DBN be a deep belief network probability model with , layers of width . Let the ith unit of each layer have state-space , , , for each . Let m be any integer with , and let . If for some , then the probability model DBN can approximate each element of a partition model of coarseness arbitrarily well. The Kullback-Leibler divergence from any distribution on to DBN is bounded by
In particular, this DBN probability model is a universal approximator whenever
When all units are q-ary and the layer width is n=qk−1+k for some , then the DBN probability model is a universal approximator of distributions on whenever . Note that
3.2
The theorem is illustrated in Figure 2.
Figure 2:

Qualitative illustration of theorem 2. Shown is the large-scale behavior of the DBN universal approximation error upper bound as a function D of the layer width n and the logarithm of the number of layers logq(L). Here it is assumed that the Kullback-Leibler divergence is computed in base q logarithm and that all units are q-ary. The number of parameters of these DBNs scales with Ln2(q−1)2.

Figure 2:

Qualitative illustration of theorem 2. Shown is the large-scale behavior of the DBN universal approximation error upper bound as a function D of the layer width n and the logarithm of the number of layers logq(L). Here it is assumed that the Kullback-Leibler divergence is computed in base q logarithm and that all units are q-ary. The number of parameters of these DBNs scales with Ln2(q−1)2.

### 3.1.  Remarks.

The number of parameters of a q-ary DBN with L layers of width n is (L−1)(n(q−1)+1)n(q−1)+n(q−1). Since the set of probability distributions on has dimension qn−1, the DBN model is full dimensional only if . This is a parameter-counting lower bound for the universal approximation depth. Theorem 2 gives an upper bound for the minimal universal approximation depth. The upper bound from the theorem surpasses the parameter-counting lower bound by roughly a factor n. We think that the upper bound is tight, up to sublinear factors, in consideration of the following. Probability models with hidden variables can have dimension strictly smaller than their parameter count (dimension defect). Moreover, in some cases, even full-dimensional models represent only very restricted classes of distributions, as has been observed, for example, in binary tree models with hidden variables. It is known that for any prime power q, the smallest naive Bayes model universal approximator of distributions on has qn−1(n(q−1)+1)−1 parameters (see Montúfar, 2013, theorem 13). Hence for these models, the number of parameters needed to achieve universal approximation surpasses the corresponding parameter-counting lower bound qn/(n(q−1)+1) by a factor of order n.

Computing tight bounds for the maximum of the Kullback-Leibler divergence is a notoriously challenging problem. This is so even for simple probability models without hidden variables, for example, independence models with mixed discrete variables. The optimality of our DBN error bounds is not completely settled at this point, but we think that they give a close description of the large-scale approximation error behavior of DBNs. For the limiting case of a single layer with n independent q-ary units, it is known that the maximal divergence is equal to (n−1)log(q) (see Ay & Knauf, 2006), corresponding to the line logq(L)=0 in Figure 2. Furthermore, when our upper bounds vanish, they obviously are tight (corresponding to the points with value zero in Figure 2).

Discrete DBNs have many hyperparameters (the layer widths and the state-spaces of the units), which makes their analysis combinatorially intricate. Some of these intricacies are apparent from the floor and ceiling functions in our main theorem. This theorem tries to balance accuracy, generality, and clarity. In some cases, the bounds can be improved by exhausting the representational power gain per layer described in theorem 8. A more detailed and accurate account on the two-layer case (RBMs) is given in section 4. In section 7 we give results describing probability distributions contained in the DBN model (proposition 1) and addressing the expectation value of the divergence (corollary 1). The appendix contains an empirical discussion, together with the numerical evaluation of small models.

### 3.2.  Outline of the Proof.

We will prove theorem 2 by first studying the individual parts of the DBN: the RBM formed by the top two layers (see section 4); the individual units with directed inputs (see section 5); the probability sharing realized by stacks of layers (see section 6); and, finally, the sets of distributions of the units in the bottom layer (see section 7). The proof steps can be summarized as follows:

• •

Show that the top RBM can approximate any probability distribution with support on a set of the form arbitrarily well.

• •

For a unit with state-space receiving n directed inputs, show that there is a choice of parameters for which the following holds for each state of the nth input unit: if the input vector is , then the unit outputs with probability , where is an arbitrary distribution on for all .

• •

Show that there is a sequence of stochastic maps , each of which superposes nearly qn probability multisharing steps, which maps the probability distributions represented by the top RBM to an arbitrary probability distribution on .

• •

Show that the DBN approximates certain classes of tractable probability distributions arbitrarily well and estimate their maximal approximation errors.

The superposition of probability-sharing steps is inspired by Le Roux and Bengio (2010), together with the refinements of that work devised in Montúfar and Ay (2011). By probability sharing, we refer to the process of transferring an arbitrary amount of probability from a state vector to another state vector . In contrast to the binary proofs, where each layer superposes about 2n sharing steps, here each layer superposes about qn multisharing steps, whereby each multisharing step transfers probability from one state to q−1 states (when the units are q-ary). With this, a more general treatment of models of conditional distributions is required. Further, additional considerations are required in order to derive tractable submodels of probability distributions that bound the DBN model approximation errors.

## 4.  Restricted Boltzmann Machines

We denote by the restricted Boltzmann machine probability model with hidden units taking states in and visible units taking states in . Recall the definitions made in section 2. In the literature, RBMs are defined by default with binary units; however, RBMs with discrete units have appeared in Welling, Rosen-Zvi, and Hinton (2005), and their representational power has been studied in Montúfar and Morton (2013). The results from this section are closely related to the analysis given in Montúfar and Morton (2013).

Theorem 3.

The model can approximate any mixture distribution arbitrarily well, where p0 is any product distribution and pi is any mixture of product distributions for all satisfying for all.

Here, a product distribution p is a probability distribution on that factorizes as for all , where pj is a distribution on for all . A mixture is a weighted sum with nonnegative weights adding to one. The support of a distribution p is .

Proof of Theorem 3.
Let denote the set of strictly positive product distributions of . Let denote the set of all mixtures of k product distributions from . The closure contains all mixtures of k product distributions, including those that are not strictly positive. Let denote the renormalized entry-wise product with for all . Let 1 denote the constant function on with value 1. The model can be written, up to normalization, as the set
4.1
Now consider any probability distributions , . If for all , then the product is equal to , up to normalization. Let and . Then . Hence the mixture distribution p is contained in the closure of the RBM model.

RBMs can approximate certain partition models arbitrarily well:

Lemma 1.

Let be the partition model with partition blocks for all . If , then each distribution contained in can be approximated arbitrarily well by distributions from.

Proof.
Any point in is a mixture of the uniform distributions on the partition blocks. These mixture components have disjoint supports since the partition blocks are disjoint. They are product distributions, since they can be written as , where ui denotes the uniform distribution on . For any , any mixture of the form is also a product distribution, which factorizes as
4.2
Hence any point in is a mixture of product distributions of the form given in equation 4.2. The claim follows from theorem 3.

Lemma 1, together with the divergence formula for partition models given at the end of section 2 implies:

Theorem 4.
Iffor some, then
In particular, the modelis a universal approximator whenever

When all units are q-ary, the RBM with (qn−1−1)/(q−1) hidden units is a universal approximator of distributions on . Theorem 4 generalizes previous results on binary RBMs (Montúfar & Ay, 2011, theorem 1; Montúfar, Rauh, & Ay, 2011, theorem 5.1), where it is shown that a binary RBM with 2n−1−1 hidden units is a universal approximator of distributions on and that the maximal approximation error of binary RBMs decreases at least logarithmically in the number of hidden units. A previous result by Freund and Haussler (1991) shows that a binary RBM with 2n hidden units is a universal approximator of distributions on . (See also Le Roux & Bengio, 2008, theorem 2.)

## 5.  The Internal Node of a Star

Consider an inward-directed star graph with leaf variables taking states in and an internal node variable taking states in . Denote by the set of conditional distributions on given the states of the leaf units, defined by this network. Each of these distributions can be written as
5.1
The distributions from equation 2.4 are of this form, with corresponding to .

A conditional distribution is naturally identified with the stochastic map defined by the matrix (p(x|y))y,x. The following lemma describes some stochastic maps that are representable by the model and that we will use to define a probability-sharing scheme in section 6.

Lemma 2.
Let, . Furthermore, let, and letbe any distributions on. Then there is a choice of the parameters of for which
Proof.
Let for all , and . The set of strictly positive probability distributions on is an exponential family with d=r−1. For some , let be the parameter vector of a distribution that attains a unique maximum at v. Then for any fixed , we have
5.2

To see this, note that and hence . Furthermore, .

Without loss of generality, let . For each , let be such that for all . The matrix can be set as follows:
5.3
where contains the columns corresponding to yj in equation 5.1 and
5.4
The matrix maps to the parameter vectors with corresponding distributions . When are chosen such that , then for each , the vector is mapped to a parameter vector with arbitrarily close to .
Remark 1.

In order to prove lemma 2 for any subset , it is sufficient to show that the vectors are affinely independent and there is a linear map mapping into the zero vector and into the relative interior of the normal cone of at the vertex v=ym for all .

## 6.  Probability Sharing

### 6.1.  A Single Directed Layer.

Consider an input layer of units with bipartite connections directed toward an output layer of units . Denote by the model of conditional distributions defined by this network. Recall the definition from equation 2.3. Each conditional distribution defines a linear stochastic map from the simplex of distributions on to the simplex of distributions on . Here the parameter corresponds to the parameters of the conditional distributions from equation 2.3 for a given l.

For any and , we denote by y[j] the one-dimensional cylinder set . Similarly, for any , we denote by the cylinder set consisting of all arrays in with fixed values in the entries .

Applying lemma 2 to each output unit of shows:

Theorem 5.

Consider some . Let be a multiset and a set of indices from [m]. If the cylinder sets y(s)[js] are disjoint and is a subset of containing them, then the image of by the family of stochastic maps contains .

This result describes the image of a set of probability distributions by the collection of stochastic maps defined by a DBN layer for all choices of its parameters. In turn, it describes part of the DBN representational power contributed by a layer of units.

### 6.2.  A Stack of Directed Layers.

In the case of binary units, sequences of probability-sharing steps can be defined conveniently using Gray codes, as done in Le Roux and Bengio (2010). A Gray code is an ordered list of vectors, where each two subsequent vectors differ in only one entry. A binary Gray code can be viewed as a sequence of one-dimensional cylinder sets. In the nonbinary case, this correspondence is no longer given. Instead, motivated by theorem 5, we will use one-dimensional cylinder sets in order to define sequences of multisharing steps, as shown in Figure 3.

Figure 3:

Three multisharing steps on .

Figure 3:

Three multisharing steps on .

Let be the cardinality of for , and let . The set can be written as the disjoint union of one-dimensional cylinder sets, as , where and .

In the following, each set y(s)[m+1] will be the starting point of a sequence of sharing steps. By theorem 5, a directed DBN layer maps the simplex of distributions surjectively to the simplex of distributions . The latter can be mapped by a further DBN layer onto a larger simplex and so forth. Starting with y(1)[m+1], consider the sequence
6.1
continued as shown in Table 1. We denote this sequence of cylinder sets by G1, and its lth row (a cylinder set) by G1(l). The union of the first K rows, with , is equal to .
Table 1:
Sequence of One-Dimensional Cylinder Sets.

Notes: Light gray indicates a free coordinate. Dark gray indicates a transition coordinate.

We define k sequences as follows. The first m coordinates of Gs are equal to a permutation of the first m coordinates of G1, defined by shifting each of these m columns cyclically s positions to the right. The last nm coordinates of Gs are equal to .

We use the abbreviation . Within the first m columns, the free coordinate of the lth row of Gs is , where is the least integer with . Here the empty product is defined as 1. Let . We can modify each sequence Gs by repeating rows if necessary, such that the free coordinate of the lth row of the resulting sequence is , where is the least integer with . This does not depend on s.

The sequences for are all different from each other in the last nm coordinates and have a different sharing free coordinate in each row. The union of cylinder sets in all rows of these sequences is equal to .

## 7.  Deep Belief Networks

Proposition 1.

Consider a DBN with layers of width n, each layer containing units with state-spaces of cardinalities . Let m be any integer with . The corresponding probability model can approximate a distribution p on arbitrarily well whenever the support of p is contained in.

Proof.

Note that . By theorem 3 the top RBM can approximate each distribution in the probability simplex on arbitrarily well. By theorem 5, this simplex can be mapped iteratively into larger simplices, according to the sequences from section 6.

Theorem 6.

Consider a DBN with L layers of width n, each layer containing units with state-spaces of cardinalities. Let m be any integer withand. If, then the DBN model can approximate each distribution in a partition modelof coarsenessarbitrarily well.

Proof.
When , the result follows from lemma 1. Assume therefore that , . We use the abbreviation . Let and . The top RBM can approximate each distribution from a partition model (on a subset of ) arbitrarily well, whose partition blocks are the cylinder sets with fixed coordinate values
for all , for all . After L−2 probability-sharing steps starting from , the DBN can approximate the distributions from the partition model arbitrarily well, whose partition blocks are the cylinder sets with fixed coordinate values
for all possible choices of , for all . The maximal cardinality of such a block is q1⋅⋅⋅qmr−1, and the union of all blocks equals .
Proof of Theorem 2.

The claim follows bounding the divergence of the partition models described in theorem 6.

As a corollary, we obtain the following bound for the expectation value of the divergence from distributions drawn from a Dirichlet prior, to the DBN model.

Corollary 1.
The expectation value of the divergence from a probability distribution p drawn from the symmetric Dirichlet distribution to the model DBN from theorem 2 is bounded by
where, is the digamma function and e is Euler's constant.
Proof.

This is a consequence of analytical work (Montúfar & Rauh, 2012) on the expectation value of Kullback-Leibler divergences of standard probability models, applied to the partition models described in theorem 2.

## Appendix:  Small Experiments

We run some computer experiments not with the purpose of validating the quality of our bounds in general, but with the purpose of giving a first empirical insight. It is important to emphasize that numerical experiments evaluating the divergence from probability models defined by neural networks are feasible only for small networks, since otherwise the model distributions are too hard to compute (see Long & Servedio, 2010). For large models, one still could try to sample the distributions and replace the divergence by a proxy, like the discrepancy of low-level statistics, but here we focus on small networks.

We generate artificial data in the following natural way. For a given visible state-space and the corresponding probability simplex , we draw a set of distributions from the Dirichlet distribution on . For our experiments, we choose the concentration parameter a in such a way that the Dirichlet density is higher for low-entropy distributions (most distributions in practice have relatively few preferred states and hence a small entropy). Next, for each , we generate N independent and identically distributed samples from pi, which results in a data vector with empirical distribution .

A network (with visible states ) is then tested on all data sets Xi, . For each data set, we train using contrastive divergence (CD) (Hinton, 2002; Hinton et al., 2006) and maximum likelihood (ML) gradient. This gives us a maximum likelihood estimate of Pi within . Finally, we compute the Kullback-Leibler divergence , the maximum value over all data sets , and the mean value over all data sets . We do not need cross-validation or , because we are interested in the representational power of rather than its generalization properties.

We note that the number of distributions with the largest divergence from is relatively small, and hence the random variable has a large variance (unless the number of data sets tends to infinity, ). Moreover, we note that it is hard to find the best approximations of a given target Pi. Since the likelihood function has many local maxima, the distribution is often not a global maximizer of , even if training is arranged with many parameter initializations. Many times the estimated value is a good local minimizer of the divergence, but sometimes it is relatively poor (especially for the larger networks). This contributes again to the variance of . The mean values , on the other hand, are more stable.

Figure 4 shows the results for small binary RBMs with three and four visible units, and Figure 5 shows the results for small constant-width binary DBNs with four visible units. In both figures, the maximum and mean divergence is captured relatively well by our theoretical bounds. The empirical maximum values have a well-recognizable discrepancy from the theoretical bound. This is explained by the large variance of , given the limited number of target distributions used in these experiments. Finding a maximizer of the divergence (a data vector that is hardest to represent) is hard. Most target distributions can be approximated much better than the hardest distributions. A second observation is that with increasing network complexity (more hidden units), finding the best approximations of the target distributions becomes harder (even increasing the training efforts). This causes the empirical maximum divergence to actually surpass the theoretical bounds. In other words, although the models are in principle able to approximate the targets accurately, according to our theoretical bounds, in practice they may not because of the difficult training, and their capacity remains wasted. The empirical mean values have a much lower variance and are captured quite accurately by our theoretical bounds.

Figure 4:

Empirical evaluation of the representational power of small binary RBMs. The gray shading indicates the frequency at which the target distribution Pi had a given divergence from the trained RBM distribution (the darker a value, the more frequent). The lines with round markers show the mean divergence (dashed) and maximal divergence (solid) over all target distributions. The lines with square markers show the theoretical upper bounds of the mean divergence (dashed) and maximal divergence (solid) over the continuum of all possible target distributions drawn from the symmetric Dirichlet distribution .

Figure 4:

Empirical evaluation of the representational power of small binary RBMs. The gray shading indicates the frequency at which the target distribution Pi had a given divergence from the trained RBM distribution (the darker a value, the more frequent). The lines with round markers show the mean divergence (dashed) and maximal divergence (solid) over all target distributions. The lines with square markers show the theoretical upper bounds of the mean divergence (dashed) and maximal divergence (solid) over the continuum of all possible target distributions drawn from the symmetric Dirichlet distribution .

Figure 5:

Empirical evaluation of the representational power of small binary DBNs. The details are as in Figure 4, whereby the theoretical upper bounds shown here for the maximal and mean divergence, and , are a combination of our results for RBMs and DBNs.

Figure 5:

Empirical evaluation of the representational power of small binary DBNs. The details are as in Figure 4, whereby the theoretical upper bounds shown here for the maximal and mean divergence, and , are a combination of our results for RBMs and DBNs.

## Acknowledgments

I was supported in part by DARPA grant FA8650-11-1-7145. I completed the revision of the original manuscript at the Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany.

## References

Ay
,
N.
, &
Knauf
,
A.
(
2006
).
Maximizing multi-information
.
Kybernetika
,
42
,
517
538
.
Bengio
,
Y.
(
2009
).
Learning deep architectures for AI
.
Foundations and Trends in Machine Learning
,
2
(
1
),
1
127
.
Bengio
,
Y.
,
Lamblin
,
P.
,
Popovici
,
D.
, &
Larochelle
,
H.
(
2007
).
Greedy layer-wise training of deep networks
. In
B. Schölkopf, J. C. Platt, & T. Hoffman (Eds.)
,
Advances in neural information processing systems
,
19
(pp.
153
160
).
Cambridge, MA
:
MIT Press
.
Freund
,
Y.
, &
Haussler
,
D.
(
1991
).
Unsupervised learning of distributions of binary vectors using 2-layer networks
. In
J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.)
,
Advances in neural information processing systems
,
4
(pp.
912
919
).
Hinton
,
G. E.
(
2002
).
Training products of experts by minimizing contrastive divergence
.
Neural Computation
,
14
,
1771
1800
.
Hinton
,
G. E.
,
Osindero
,
S.
, &
Teh
,
Y.
(
2006
).
A fast learning algorithm for deep belief nets
.
Neural Computation
,
18
(
7
),
1527
1554
.
Hornik
,
K.
,
Stinchcombe
,
M. B.
, &
White
,
H.
(
1989
).
Multilayer feedforward networks are universal approximators
.
Neural Networks
,
2
(
5
),
359
366
.
Le Roux
,
N.
, &
Bengio
,
Y.
(
2008
).
Representational power of restricted Boltzmann machines and deep belief networks
.
Neural Computation
,
20
(
6
),
1631
1649
.
Le Roux
,
N.
, &
Bengio
,
Y.
(
2010
).
Deep belief networks are compact universal approximators
.
Neural Computation
,
22
(
8
),
2192
2207
.
Long
,
P. M.
, &
Servedio
,
R. A.
(
2010
).
Restricted Boltzmann machines are hard to approximately evaluate or simulate
. In
Proceedings of the 27th ICML
(pp.
703
710
).
:
Omnipress
.
Montúfar
,
G.
(
2013
).
Mixture decompositions of exponential families using a decomposition of their sample spaces
.
Kybernetika
,
49
(
1
),
23
39
.
Montúfar
,
G.
, &
Ay
,
N.
(
2011
).
Refinements of universal approximation results for deep belief networks and restricted Boltzmann machines
.
Neural Computation
,
23
(
5
),
1306
1319
.
Montúfar
,
G.
, &
Morton
,
J.
(
2013
).
Discrete restricted Boltzmann machines
. In
Online Proceedings of the First International Conference on Learning Representations
Montúfar
,
G.
, &
Rauh
,
J.
(
2012
).
Scaling of model approximation errors and expected entropy distances
. In
Proceedings of the Ninth Workshop on Uncertainty Processing
(pp.
137
148
).
Prague
:
Faculty of Management, University of Economics
.
Montúfar
,
G.
,
Rauh
,
J.
, &
Ay
,
N.
(
2011
).
Expressive power and approximation errors of restricted Boltzmann machines
. In
J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, & K. Q. Weinberger (Eds.)
,
Advances in neural information processing systems
,
24
(pp.
415
423
).
Red Hook, NY
:
Curran
.
Rauh
,
J.
(
2013
).
Optimally approximating exponential families
.
Kybernetika
,
49
(
2
),
199
215
.
Smolensky
,
P.
(
1986
).
Information processing in dynamical systems: Foundations of harmony theory
. In
D. E. Rumelhart, J. L. McClellan, & the PDP Research Group
,
Parallel distributed processing: Explorations in the microstructure of cognition
(Vol.
1
, pp.
194
281
).
Cambridge, MA
:
MIT Press
.
Sutskever
,
I.
, &
Hinton
,
G. E.
(
2008
).
Deep narrow sigmoid belief networks are universal approximators
.
Neural Computation
,
20
(
11
),
2629
2636
.
Welling
,
M.
,
Rosen-Zvi
,
M.
, &
Hinton
,
G. E.
(
2005
).
Exponential family harmoniums with an application to information retrieval
. In
L. K. Saul, Y. Weiss, & L. Bottou (Eds.)
,
Advances in neural information processing systems
,
17
(pp.
1481
1488
).
Cambridge, MA
:
MIT Press
.

## Note

1

Where a more or less good approximation of a small set of target distributions is often sufficient, or where the goal is not to model data directly but rather to obtain abstract representations of data.