## Abstract

We generalize recent theoretical work on the minimal number of layers of narrow deep belief networks that can approximate any probability distribution on the states of their visible units arbitrarily well. We relax the setting of binary units (Sutskever & Hinton, 2008; Le Roux & Bengio, 2008, 2010; Montúfar & Ay, 2011) to units with arbitrary finite state spaces and the vanishing approximation error to an arbitrary approximation error tolerance. For example, we show that a *q*-ary deep belief network with layers of width for some can approximate any probability distribution on without exceeding a Kullback-Leibler divergence of . Our analysis covers discrete restricted Boltzmann machines and naive Bayes models as special cases.

## 1. Introduction

A deep belief network (DBN) (Hinton, Osindero, & Teh, 2006) is a layered stochastic network with undirected bipartite interactions between the units in the top two layers and directed bipartite interactions between the units in all other subsequent pairs of layers, directed toward the bottom layer. The top two layers form a restricted Boltzmann machine (RBM) (Smolensky, 1986). The entire network defines a model of probability distributions on the states of the units in the bottom layer, the visible layer. When the number of units in every layer has the same order of magnitude, the network is called narrow. Depth refers to the number of layers. Deep network architectures are believed to play a key role in information processing of intelligent agents (see Bengio, 2009, for an overview on this exciting topic). DBNs were the first deep architectures to be envisaged together with an efficient unsupervised training algorithm (Hinton et al., 2006). Due to their restricted connectivity, it is possible to greedily train their layers one at the time and in this way identify remarkably good parameter initializations for solving specific tasks (see Bengio, Lamblin, Popovici, & Larochelle, 2007). The ability to train deep architectures efficiently has pioneered a great number of applications in machine learning and in the booming field of deep learning.

The representational power of neural networks has been studied for several decades, whereby their universal approximation properties have received special attention. For instance, a well-known result (Hornik, Stinchcombe, & White, 1989) shows that multilayer feedforward networks with one exponentially large layer of hidden units are universal approximators of Borel measurable functions. Although universal approximation has a limited importance for practical purposes,^{1} it plays an important role as warrant for consistency and sufficiency of the complexity attainable by specific classes of learning systems. Besides the universal approximation question, it is natural to ask, “How well is a given network able to approximate certain classes of probability distributions?” This letter pursues an account of the ability of DBNs to approximate probability distributions.

The first universal approximation result for deep and narrow sigmoid belief networks is due to Sutskever and Hinton (2008). They showed that a narrow sigmoid belief network with 3(2^{n}−1)+1 layers can represent probability distributions arbitrarily close to any probability distribution on the set of length *n* binary vectors. Their result shows that not only exponentially wide and shallow networks are universal approximators (Hornik et al., 1989); exponentially deep and narrow ones are as well. Subsequent work has studied the optimality question, “How deep is deep enough?” with improved universal approximation depth bounds by Le Roux and Bengio (2010) and Montúfar and Ay (2011), which we discuss in more detail in this letter. These articles focus on the minimal depth of narrow DBN universal approximators with binary units—that is, the number of layers that these networks must have in order to be able to represent probability distributions arbitrarily close to any probability distribution on the states of their visible units. This letter complements that analysis in two ways.

First, instead of asking for the minimal size of universal approximators, we ask for the minimal size of networks that can approximate any distribution to a given error tolerance, treating the universal approximation problem as the special case of zero error tolerance. This analysis gives a theoretical basis on which to balance model accuracy and parameter count. For comparison, universal approximation is a binary property that always requires an exponential number of parameters. As it turns out, our analysis also allows us to estimate the expected value of the model approximation errors incurred when learning classes of distributions, say, low-entropy distributions, with networks of given sizes.

Second, we consider networks with finite valued units, called discrete or multinomial DBNs, including binary DBNs as special cases. Nonbinary units serve, obviously, to encode nonbinary features directly, which may be interesting in multichannel perception (e.g., color-temperature-distance sensory inputs). Additionally, the interactions between discrete units can carry much richer relations that those between binary units. In particular, within the nonbinary discrete setting, DBNs, RBMs, and naive Bayes models can be seen as representatives of the same class of probability models.

This letter is organized as follows. Section 2 gives formal definitions, before we state our main result, theorem 2, in section 3: a bound on the approximation errors of discrete DBNs. A universal approximation depth bound follows directly. After this, a discussion of the result is given with a sketch of the proof. The proof entails several steps of independent interest, developed in the next sections. Section 4 addresses the representational power and approximation errors of RBMs with discrete units. Section 5 studies the models of conditional distributions represented by feedforward discrete stochastic networks (DBN layers). Section 6 studies concatenations of layers of feedforward networks and elaborates on the patterns of probability sharing steps (transformations of probability distributions) that they can realize. Section 7 concludes the proof of the main theorem and gives a corollary about the expectation value of the approximation error of DBNs. The appendix presents an empirical validation scheme and tests the approximation error bounds numerically on small networks.

## 2. Preliminaries

*p*to

*q*is defined as when and otherwise. The divergence from a distribution

*p*to a model is defined as . The divergence of any distribution on to is bounded by We refer to as the universal or maximal approximation error of . The model is called a universal approximator of probability distributions on iff .

A discrete DBN probability model is specified by a number of layers (the depth of the network), the number of units in each layer (the width of each layer), and the state-space of each unit in each layer. Let , be the number of layers. We imagine these layers arranged as a stack with layer 1 at the bottom (this will be the visible layer) and layer *L* at the top (this will be the deepest layer). (See Figure 1.) For each , let be the number of units in layer *l*. For each , let , be the state-space of unit *i* in layer *l*. We denote the joint state-space of the units in layer *l* by and write for a state from . We call a unit *q-valued* or *q-ary* if its state-space has cardinality *q* and assume that *q* is a finite integer larger than one.

In order to proceed with the definition of the DBN model, we consider the mixed graphical model with undirected connections between the units in the top two layers *L* and *L*−1 and directed connections from the units in layer *l*+1 to the units in layer *l* for all . This model consists of joint probability distributions on the states of all network units, parameterized by a collection of real matrices and vectors . For each , the matrix contains the interaction weights between units in layers *l* and *l*+1. It consists of row blocks for all . For each , the row vector contains the bias weights of the units in layer *l*. It consists of blocks for all .

Note that the bias of a unit with state-space is a vector with entries, and the interaction of a pair of units with state-spaces and is described by a matrix of order . The number of interaction and bias parameters in the entire network adds to .

Here we use following notation. Given a state vector of *n* units with joint state-space , **x** denotes the *x*th column of a minimal matrix of sufficient statistics for the independent distributions of these *n* units. To make this more concrete, we set **x** equal to a column vector with blocks **x**_{1}, …, **x**_{n}, where is the one-hot representation of *x _{i}* without the first entry, for all . For example, if , then , with and .

*L*layers of widths and state-spaces is the set of probability distributions expressible by equation 2.7 for all possible choices of the parameter . Intuitively, this set is a linear projection of a manifold parameterized by and may have self-intersections or other singularities.

The discrete DBN probability model with *L*=2 is a discrete RBM probability model. This model consists of the marginal distributions on of the distributions from equation 2.2 for all possible choices of , , and .

When *L*>2, the distributions on defined by the top two DBN layers can be seen as the inputs of the stochastic maps defined by the conditional distributions from equation 2.3. The outputs of these maps are probability distributions on that can be seen as the inputs of the stochastic maps defined by the next lower layer and so forth. The discrete DBN probability model can be seen as the set of images of a discrete RBM probability model by a family of sequences of stochastic maps.

The following simple class of probability models will be useful to study the approximation capabilities of DBN models. Let be a partition of a finite set . The partition model with partition is the set of probability distributions on that have constant value on each *A _{i}*. Geometrically, this is the simplex with vertices for all , where is the indicator function of

*A*. The coarseness of is max

_{i}_{i}|

*A*|. Unlike many statistical models, partition models have a well-understood Kullback-Leibler divergence. If is a partition model of coarseness

_{i}*c*, then . Furthermore, partition models are known to be optimally approximating exponential families in the sense that they minimize the universal approximation error among all closures of exponential families of a given dimension (see Rauh, 2013).

## 3. Main Result

The starting point of our considerations is the following result for binary DBNs:

*A deep belief network probability model with L layers of binary units of width n=2^{k−1}+k (for some ) is a universal approximator of probability distributions on whenever *.

The main result of this letter is the following generalization of theorem 1. Here we make the simplifying assumption that all layers have the same width *n* and the same state-space. The result holds automatically for DBNs with wider hidden layers or hidden units with larger state-spaces.

*Let*

*DBN*be a deep belief network probability model with , layers of width . Let the*i*th unit of each layer have state-space , , , for each . Let*m*be any integer with , and let . If for some , then the probability model*DBN*can approximate each element of a partition model of coarseness arbitrarily well. The Kullback-Leibler divergence from any distribution on to*DBN*is bounded by*q*-ary and the layer width is

*n*=

*q*

^{k−1}+

*k*for some , then the DBN probability model is a universal approximator of distributions on whenever . Note that The theorem is illustrated in Figure 2.

### 3.1. Remarks.

The number of parameters of a *q*-ary DBN with *L* layers of width *n* is (*L*−1)(*n*(*q*−1)+1)*n*(*q*−1)+*n*(*q*−1). Since the set of probability distributions on has dimension *q ^{n}*−1, the DBN model is full dimensional only if . This is a parameter-counting lower bound for the universal approximation depth. Theorem 2 gives an upper bound for the minimal universal approximation depth. The upper bound from the theorem surpasses the parameter-counting lower bound by roughly a factor

*n*. We think that the upper bound is tight, up to sublinear factors, in consideration of the following. Probability models with hidden variables can have dimension strictly smaller than their parameter count (dimension defect). Moreover, in some cases, even full-dimensional models represent only very restricted classes of distributions, as has been observed, for example, in binary tree models with hidden variables. It is known that for any prime power

*q*, the smallest naive Bayes model universal approximator of distributions on has

*q*

^{n−1}(

*n*(

*q*−1)+1)−1 parameters (see Montúfar, 2013, theorem 13). Hence for these models, the number of parameters needed to achieve universal approximation surpasses the corresponding parameter-counting lower bound

*q*/(

^{n}*n*(

*q*−1)+1) by a factor of order

*n*.

Computing tight bounds for the maximum of the Kullback-Leibler divergence is a notoriously challenging problem. This is so even for simple probability models without hidden variables, for example, independence models with mixed discrete variables. The optimality of our DBN error bounds is not completely settled at this point, but we think that they give a close description of the large-scale approximation error behavior of DBNs. For the limiting case of a single layer with *n* independent *q*-ary units, it is known that the maximal divergence is equal to (*n*−1)log(*q*) (see Ay & Knauf, 2006), corresponding to the line log_{q}(*L*)=0 in Figure 2. Furthermore, when our upper bounds vanish, they obviously are tight (corresponding to the points with value zero in Figure 2).

Discrete DBNs have many hyperparameters (the layer widths and the state-spaces of the units), which makes their analysis combinatorially intricate. Some of these intricacies are apparent from the floor and ceiling functions in our main theorem. This theorem tries to balance accuracy, generality, and clarity. In some cases, the bounds can be improved by exhausting the representational power gain per layer described in theorem 8. A more detailed and accurate account on the two-layer case (RBMs) is given in section 4. In section 7 we give results describing probability distributions contained in the DBN model (proposition 1) and addressing the expectation value of the divergence (corollary 1). The appendix contains an empirical discussion, together with the numerical evaluation of small models.

### 3.2. Outline of the Proof.

We will prove theorem 2 by first studying the individual parts of the DBN: the RBM formed by the top two layers (see section 4); the individual units with directed inputs (see section 5); the probability sharing realized by stacks of layers (see section 6); and, finally, the sets of distributions of the units in the bottom layer (see section 7). The proof steps can be summarized as follows:

- •
Show that the top RBM can approximate any probability distribution with support on a set of the form arbitrarily well.

- •
For a unit with state-space receiving

*n*directed inputs, show that there is a choice of parameters for which the following holds for each state of the*n*th input unit: if the input vector is , then the unit outputs with probability , where is an arbitrary distribution on for all . - •
Show that there is a sequence of stochastic maps , each of which superposes nearly

*qn*probability multisharing steps, which maps the probability distributions represented by the top RBM to an arbitrary probability distribution on . - •
Show that the DBN approximates certain classes of tractable probability distributions arbitrarily well and estimate their maximal approximation errors.

The superposition of probability-sharing steps is inspired by Le Roux and Bengio (2010), together with the refinements of that work devised in Montúfar and Ay (2011). By *probability sharing*, we refer to the process of transferring an arbitrary amount of probability from a state vector to another state vector . In contrast to the binary proofs, where each layer superposes about 2*n* sharing steps, here each layer superposes about *qn* multisharing steps, whereby each multisharing step transfers probability from one state to *q*−1 states (when the units are *q*-ary). With this, a more general treatment of models of conditional distributions is required. Further, additional considerations are required in order to derive tractable submodels of probability distributions that bound the DBN model approximation errors.

## 4. Restricted Boltzmann Machines

We denote by the restricted Boltzmann machine probability model with hidden units taking states in and visible units taking states in . Recall the definitions made in section 2. In the literature, RBMs are defined by default with binary units; however, RBMs with discrete units have appeared in Welling, Rosen-Zvi, and Hinton (2005), and their representational power has been studied in Montúfar and Morton (2013). The results from this section are closely related to the analysis given in Montúfar and Morton (2013).

*The model can approximate any mixture distribution arbitrarily well, where p_{0} is any product distribution and p_{i} is any mixture of product distributions for all satisfying for all*.

Here, a product distribution *p* is a probability distribution on that factorizes as for all , where *p _{j}* is a distribution on for all . A mixture is a weighted sum with nonnegative weights adding to one. The support of a distribution

*p*is .

*k*product distributions from . The closure contains all mixtures of

*k*product distributions, including those that are not strictly positive. Let denote the renormalized entry-wise product with for all . Let 1 denote the constant function on with value 1. The model can be written, up to normalization, as the set Now consider any probability distributions , . If for all , then the product is equal to , up to normalization. Let and . Then . Hence the mixture distribution

*p*is contained in the closure of the RBM model.

RBMs can approximate certain partition models arbitrarily well:

*Let be the partition model with partition blocks for all . If , then each distribution contained in can be approximated arbitrarily well by distributions from*.

*u*denotes the uniform distribution on . For any , any mixture of the form is also a product distribution, which factorizes as Hence any point in is a mixture of product distributions of the form given in equation 4.2. The claim follows from theorem 3.

_{i}Lemma 1, together with the divergence formula for partition models given at the end of section 2 implies:

When all units are *q*-ary, the RBM with (*q*^{n−1}−1)/(*q*−1) hidden units is a universal approximator of distributions on . Theorem 4 generalizes previous results on binary RBMs (Montúfar & Ay, 2011, theorem 1; Montúfar, Rauh, & Ay, 2011, theorem 5.1), where it is shown that a binary RBM with 2^{n−1}−1 hidden units is a universal approximator of distributions on and that the maximal approximation error of binary RBMs decreases at least logarithmically in the number of hidden units. A previous result by Freund and Haussler (1991) shows that a binary RBM with 2^{n} hidden units is a universal approximator of distributions on . (See also Le Roux & Bengio, 2008, theorem 2.)

## 5. The Internal Node of a Star

A conditional distribution is naturally identified with the stochastic map defined by the matrix (*p*(*x*|*y*))_{y,x}. The following lemma describes some stochastic maps that are representable by the model and that we will use to define a probability-sharing scheme in section 6.

*d*=

*r*−1. For some , let be the parameter vector of a distribution that attains a unique maximum at

*v*. Then for any fixed , we have

To see this, note that and hence . Furthermore, .

**y**

_{j}in equation 5.1 and The matrix maps to the parameter vectors with corresponding distributions . When are chosen such that , then for each , the vector is mapped to a parameter vector with arbitrarily close to .

In order to prove lemma 2 for any subset , it is sufficient to show that the vectors are affinely independent and there is a linear map mapping into the zero vector and into the relative interior of the normal cone of at the vertex *v*=*y _{m}* for all .

## 6. Probability Sharing

### 6.1. A Single Directed Layer.

Consider an input layer of units with bipartite connections directed toward an output layer of units . Denote by the model of conditional distributions defined by this network. Recall the definition from equation 2.3. Each conditional distribution defines a linear stochastic map from the simplex of distributions on to the simplex of distributions on . Here the parameter corresponds to the parameters of the conditional distributions from equation 2.3 for a given *l*.

For any and , we denote by *y*[*j*] the one-dimensional cylinder set . Similarly, for any , we denote by the cylinder set consisting of all arrays in with fixed values in the entries .

Applying lemma 2 to each output unit of shows:

Consider some . Let be a multiset and a set of indices from [*m*]. If the cylinder sets *y*^{(s)}[*j _{s}*] are disjoint and is a subset of containing them, then the image of by the family of stochastic maps contains .

This result describes the image of a set of probability distributions by the collection of stochastic maps defined by a DBN layer for all choices of its parameters. In turn, it describes part of the DBN representational power contributed by a layer of units.

### 6.2. A Stack of Directed Layers.

In the case of binary units, sequences of probability-sharing steps can be defined conveniently using Gray codes, as done in Le Roux and Bengio (2010). A Gray code is an ordered list of vectors, where each two subsequent vectors differ in only one entry. A binary Gray code can be viewed as a sequence of one-dimensional cylinder sets. In the nonbinary case, this correspondence is no longer given. Instead, motivated by theorem 5, we will use one-dimensional cylinder sets in order to define sequences of multisharing steps, as shown in Figure 3.

Let be the cardinality of for , and let . The set can be written as the disjoint union of one-dimensional cylinder sets, as , where and .

*y*

^{(s)}[

*m*+1] will be the starting point of a sequence of sharing steps. By theorem 5, a directed DBN layer maps the simplex of distributions surjectively to the simplex of distributions . The latter can be mapped by a further DBN layer onto a larger simplex and so forth. Starting with

*y*

^{(1)}[

*m*+1], consider the sequence continued as shown in Table 1. We denote this sequence of cylinder sets by

*G*

^{1}, and its

*l*th row (a cylinder set) by

*G*

^{1}(

*l*). The union of the first

*K*rows, with , is equal to .

Notes: Light gray indicates a free coordinate. Dark gray indicates a transition coordinate.

We define *k* sequences as follows. The first *m* coordinates of *G ^{s}* are equal to a permutation of the first

*m*coordinates of

*G*

^{1}, defined by shifting each of these

*m*columns cyclically

*s*positions to the right. The last

*n*−

*m*coordinates of

*G*are equal to .

^{s}We use the abbreviation . Within the first *m* columns, the free coordinate of the *l*th row of *G ^{s}* is , where is the least integer with . Here the empty product is defined as 1. Let . We can modify each sequence

*G*by repeating rows if necessary, such that the free coordinate of the

^{s}*l*th row of the resulting sequence is , where is the least integer with . This does not depend on

*s*.

The sequences for are all different from each other in the last *n* − *m* coordinates and have a different sharing free coordinate in each row. The union of cylinder sets in all rows of these sequences is equal to .

## 7. Deep Belief Networks

*Consider a DBN with layers of width n, each layer containing units with state-spaces of cardinalities . Let m be any integer with . The corresponding probability model can approximate a distribution p on arbitrarily well whenever the support of p is contained in*.

Note that . By theorem 3 the top RBM can approximate each distribution in the probability simplex on arbitrarily well. By theorem 5, this simplex can be mapped iteratively into larger simplices, according to the sequences from section 6.

*Consider a DBN with L layers of width n, each layer containing units with state-spaces of cardinalities*. *Let m be any integer with**and*. *If*, *then the DBN model can approximate each distribution in a partition model**of coarseness**arbitrarily well*.

*L*−2 probability-sharing steps starting from , the DBN can approximate the distributions from the partition model arbitrarily well, whose partition blocks are the cylinder sets with fixed coordinate values for all possible choices of , for all . The maximal cardinality of such a block is

*q*

_{1}⋅⋅⋅

*q*

_{m−r−1}, and the union of all blocks equals .

The claim follows bounding the divergence of the partition models described in theorem 6.

As a corollary, we obtain the following bound for the expectation value of the divergence from distributions drawn from a Dirichlet prior, to the DBN model.

This is a consequence of analytical work (Montúfar & Rauh, 2012) on the expectation value of Kullback-Leibler divergences of standard probability models, applied to the partition models described in theorem 2.

## Appendix: Small Experiments

We run some computer experiments not with the purpose of validating the quality of our bounds in general, but with the purpose of giving a first empirical insight. It is important to emphasize that numerical experiments evaluating the divergence from probability models defined by neural networks are feasible only for small networks, since otherwise the model distributions are too hard to compute (see Long & Servedio, 2010). For large models, one still could try to sample the distributions and replace the divergence by a proxy, like the discrepancy of low-level statistics, but here we focus on small networks.

We generate artificial data in the following natural way. For a given visible state-space and the corresponding probability simplex , we draw a set of distributions from the Dirichlet distribution on . For our experiments, we choose the concentration parameter *a* in such a way that the Dirichlet density is higher for low-entropy distributions (most distributions in practice have relatively few preferred states and hence a small entropy). Next, for each , we generate *N* independent and identically distributed samples from *p ^{i}*, which results in a data vector with empirical distribution .

A network (with visible states ) is then tested on all data sets *X ^{i}*, . For each data set, we train using contrastive divergence (CD) (Hinton, 2002; Hinton et al., 2006) and maximum likelihood (ML) gradient. This gives us a maximum likelihood estimate of

*P*within . Finally, we compute the Kullback-Leibler divergence , the maximum value over all data sets , and the mean value over all data sets . We do not need cross-validation or , because we are interested in the representational power of rather than its generalization properties.

^{i}We note that the number of distributions with the largest divergence from is relatively small, and hence the random variable has a large variance (unless the number of data sets tends to infinity, ). Moreover, we note that it is hard to find the best approximations of a given target *P ^{i}*. Since the likelihood function has many local maxima, the distribution is often not a global maximizer of , even if training is arranged with many parameter initializations. Many times the estimated value is a good local minimizer of the divergence, but sometimes it is relatively poor (especially for the larger networks). This contributes again to the variance of . The mean values , on the other hand, are more stable.

Figure 4 shows the results for small binary RBMs with three and four visible units, and Figure 5 shows the results for small constant-width binary DBNs with four visible units. In both figures, the maximum and mean divergence is captured relatively well by our theoretical bounds. The empirical maximum values have a well-recognizable discrepancy from the theoretical bound. This is explained by the large variance of , given the limited number of target distributions used in these experiments. Finding a maximizer of the divergence (a data vector that is hardest to represent) is hard. Most target distributions can be approximated much better than the hardest distributions. A second observation is that with increasing network complexity (more hidden units), finding the best approximations of the target distributions becomes harder (even increasing the training efforts). This causes the empirical maximum divergence to actually surpass the theoretical bounds. In other words, although the models are in principle able to approximate the targets accurately, according to our theoretical bounds, in practice they may not because of the difficult training, and their capacity remains wasted. The empirical mean values have a much lower variance and are captured quite accurately by our theoretical bounds.

## Acknowledgments

I was supported in part by DARPA grant FA8650-11-1-7145. I completed the revision of the original manuscript at the Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany.

## References

## Note

^{1}

Where a more or less good approximation of a small set of target distributions is often sufficient, or where the goal is not to model data directly but rather to obtain abstract representations of data.