Abstract

In this letter, we propose a method to decrease the number of hidden units of the restricted Boltzmann machine while avoiding a decrease in the performance quantified by the Kullback-Leibler divergence. Our algorithm is then demonstrated by numerical simulations.

1  Introduction

The improvement of computer performance enables utilization of the exceedingly high representational powers of neural networks. Deep neural networks have been applied to various types of data (e.g., images, speech, and natural language) and have achieved great success (Bengio, Courville, & Vincent, 2013; Goodfellow et al., 2014; He, Zhang, Ren, & Sun, 2016; Oord et al., 2016; Vaswani et al., 2017) in both discrimination and generation tasks. To increase performance, which stems from the hierarchical structures of neural networks (Hestness et al., 2017), network size becomes larger, and computational burdens increase. Thus, demands for decreasing the network size are growing. Various methods have been proposed for compressing the sizes of discriminative models (Cheng, Wang, Zhou, & Zhang, 2017; Guo, Yao, & Chen, 2016; Han, Pool, Tran, & Dally, 2015). However, compression of generative models (Berglund, Raiko, & Cho, 2015) has rarely been discussed.

Discriminative models provide the probabilities that data are classified into a particular class (Christopher, 2016), and in most cases, their learning requires a supervisor, namely, a data set with classification labels attached by humans. Thus, outputs of discriminative models can be intuitively interpreted by humans. However, some data are difficult for humans to properly classify. Even if possible, hand-labeling tasks is troublesome labor. In such cases, generative models with unsupervised learning are effective, since they automatically find the data structure without hand labels by learning the joint probabilities of data and classes. Therefore, it is expected that compression of generative models with unsupervised learning will be required in the future. Furthermore, if the system's performance can be preserved during compression, then the network size can be decreased while it is in use. To approximately maintain performance throughout compression, we consider removing part of the system after decreasing its contribution to the overall performance. Our approach differs from that of previous studies (Berglund et al., 2015; Cheng et al., 2017; Guo et al., 2016; Han et al., 2015) that retrain systems after removing a part that contributes little to their performance.

In this letter, we deal with the restricted Boltzmann machine (RBM; Fischer & Igel, 2012; Smolensky, 1986). The RBM is one of the most important generative models with unsupervised learning, from the viewpoints of not only machine learning history (Bengio et al., 2013) but also its wide applications—for example, generation of new samples, classification of data (Larochelle & Bengio, 2008), feature extraction (Hinton & Salakhutdinov, 2006), pretraining of deep neural networks (Hinton, Osindero, & Teh, 2006; Hinton & Salakhutdinov, 2006; Salakhutdinov & Larochelle, 2010), and solving many-body problems in physics (Carleo & Troyer, 2017; Tubiana & Monasson, 2017). The RBM consists of visible units that represent observables (e.g., pixels of images) and hidden units that express correlations between visible units. An objective of the RBM is to generate plausible data by imitating the probability distribution from which true data are sampled. In this case, the performance of the RBM is quantified by the difference between the probability distribution of data and that of visible variables of the RBM, and it can be expressed by the Kullback-Leibler divergence (KLD). The RBM can exactly reproduce any probability distribution of binary data if it has a sufficient number of hidden units (Le Roux & Bengio, 2008). However, a smaller number of hidden units may be enough to capture the structure of the data. Therefore, in this letter, we aim to practically decrease the number of hidden units while avoiding an increase in the KLD between the model and data distributions (see Figure 1).

Figure 1:

Graphical model of an RBM. While approximately preserving the KLD, the target hidden unit and its edges (green) are removed from the main body of the RBM (from the left to right panel).

Figure 1:

Graphical model of an RBM. While approximately preserving the KLD, the target hidden unit and its edges (green) are removed from the main body of the RBM (from the left to right panel).

The outline of this letter is as follows. In section 2, we give a brief review of the RBM. In section 3, we evaluate the deviation of the KLD associated with node removal and propose a method that decreases the number of hidden units while avoiding an increase in the KLD. Numerical simulations are demonstrated in section 4, and we summarize this letter in section 5. The details of calculations are shown in appendixes.

2  Brief Introduction of the RBM

In this section, we briefly review the RBM, a Markov random field that consists of visible units, tv=(v1,,vM){0,1}M, and hidden units, th=(h1,,hN){0,1}N. The joint probability that a configuration (v,h) is realized, p(v,h), is given by the energy function E(v,h) as follows:
E(v,h)=-tbv-tch-tvWh=-i=1Mbivi-j=1Ncjhj-i=1Mj=1Nviwijhj,
(2.1)
p(v,h)=e-E(v,h)v',h'e-E(v',h'),
(2.2)
where tb=(b1,,bM)RM and tc=(c1,,cN)RN are the biases of the visible and hidden units, respectively, and W=(wij)RM×N is the weight matrix.1 We abbreviate all of the RBM parameters, b, c, and W, as ξ.
By properly tuning ξ, the probability distribution of the visible variables, p(v)=hp(v,h), can approximate the unknown probability distribution that generates real data, q(v). The performance of the RBM can be measured by the KLD of p(v) from q(v):
DKL(q||p)=vq(v)lnq(v)p(v).
(2.3)
Hence, learning of the RBM is performed by updating the RBM parameters ξ so as to decrease the KLD. The gradient descent method is often employed to decrease the KLD as
ξs+1=ξs-λξDKL(q||p)|ξ=ξs,
(2.4)
where ξs and ξs+1 denote the RBM parameters at the sth and (s+1)th step of the learning process, respectively. A learning rate is represented by λ(>0), and ξDKL(q||p)|ξ=ξs denotes the gradient of the KLD with respect to ξ at the sth step. The gradient with respect to bi,cj, and wij can be written as
Dbi=-vviq(v)+vip,
(2.5)
Dcj=-vq(v)p(hj=1|v)+hjp,
(2.6)
Dwij=-vviq(v)p(hj=1|v)+vihjp,
(2.7)
where DKL(q||p) is abbreviated as D and the expectation value with respect to p(v,h) as ·p. The conditional probability, p(hj|v), is given by
p(hj|v)=e(cj+iviwij)hj1+ecj+iviwij.
(2.8)
If D and ξD can be obtained, the RBM reaches some local minimum of the KLD through a parameter update. However, neither of them can be calculated, since they contain not only the unknown probability q(v) but also the sum with respect to the large state space of the RBM. Thus, in equations 2.5 to 2.7, one approximates q(v) by empirical distribution or, more practically, mini-batch, which are samples from the empirical distribution. One also evaluates the expectation values with respect to p(v,h), which are computationally expensive, by using the realizations obtained from Gibbs sampling—for example, contrastive divergence (CD; Hinton, 2002), persistent CD (PCD; Tieleman, 2008), fast PCD (Tieleman & Hinton, 2009); and block Gibbs sampling with tempered transition (Salakhutdinov, 2009) or with parallel tempering (Cho, Raiko, & Ilin, 2010; Desjardins, Courville, Bengio, Vincent, & Delalleau, 2010). Block Gibbs sampling in the RBM effectively updates the configuration (v,h) by repeatedly using the conditional probabilities,
p(h|v)=jp(hj|v)=je(cj+iviwij)hj1+ecj+iviwij,
(2.9)
p(v|h)=ip(vi|h)=ie(bi+jhjwij)vi1+ebi+jhjwij,
(2.10)
as transition matrices. In many cases, CD and PCD employ only a few block Gibbs sampling steps. In addition to ξD, the KLD, which represents the performance of the RBM, is also intractable. Therefore, in order to monitor learning progress, a different quantity is employed that can be considered to correlate to the KLD to a certain degree—for example, the reconstruction error (Bengio, Lamblin, Popovici, & Larochelle, 2007; Hinton, 2012; Taylor, Hinton, & Roweis, 2007), the product of the two probabilities ratio (Buchaca, Romero, Mazzanti, & Delgado, 2013), and the likelihood of a validation set obtained by tracking the partition function (Desjardins, Bengio, & Courville, 2011).

3  Removal of Hidden Units

3.1  Removal Cost and Its Gradient

The goal of this letter is not to propose a new method for optimization of the KLD but to decrease the number of hidden units while avoiding an increase in the KLD. Suppose an RBM achieves, if not optimal, sufficient performance after the learning process at a fixed number of hidden units, N. Next, we remove the kth hidden unit of the RBM so as not to increase the KLD. In order to compare the performance of two RBMs whose kth hidden unit does or does not exist, we introduce hk as a configuration of hidden units except for hk, thk=(h1,,hk-1,hk+1,,hN). The energy function and the probability distribution of the RBM after removal are given by
Ek(v,hk)=-ibivi-jkcjhj-ijkviwijhj=E(v,h)|hk=0,
(3.1)
pk(v,hk)=e-Ek(v,hk)v',hk'e-Ek(v',hk').
(3.2)
Then we define a removal cost, Ck, as the difference of the KLD before and after removing the kth hidden unit,
CkDKL(q||pk)-DKL(q||p)=vq(v)lnq(v)pk(v)-vq(v)lnq(v)p(v)=-vq(v)lnp(hk=0|v)+lnp(hk=0).
(3.3)
The details of the calculation and removal cost for several hidden units are shown in appendix A. Thus, if Ck satisfies Ck0, the kth hidden unit can be removed without increasing the KLD.
In most cases, however, there are no hidden units with nonpositive removal costs. Thus, before removing a hidden unit, we first decrease its removal cost without increasing the KLD.2 For this purpose, we naively determine the parameter update at the sth step in a removal process, Δξs, so that both Ck and the KLD decrease at O(|Δξs|) (see appendix B),
Δξis=-ν·θDξiCkξi·Dξi|ξ=ξs,
(3.4)
θ(x)=1(x0)0(x<0),
(3.5)
where ν(>0) is the parameter change rate and θ(x) is the step function. Evaluation of ξD can be performed using equations 2.5 to 2.7, and ξCk can be written as
Ckbi=vip¯-vip,
(3.6)
Ckcj=vq(v)p(hk=1|v)δkj+hjp¯-hjp,
(3.7)
Ckwij=vq(v)vip(hk=1|v)δkj+vihjp¯-vihjp,
(3.8)
where δkj is the Kronecker delta and ·p and ·p¯ denote expectation values with respect to p(v,h) and p¯p(v,hk|hk=0), respectively. If Ck0 is satisfied after parameter updates, the kth hidden unit can be removed without increasing the KLD. When all of the RBM parameters satisfy D/ξi·Ck/ξi<0, Ck cannot decrease without increasing the KLD, and the parameter update is stopped (Δξ=0).3

Note two properties of Ck. First, -Ck can be interpreted as an additional cost of a new node. Thus, it may be employed when new nodes are added into an RBM whose performance is insufficient. Second, equation 3.3 can be applied to the Boltzmann machine (BM; Ackley, Hinton, & Sejnowski, 1987), which is expressed as a complete graph consisting of visible and hidden units, and a special case of the BM called the deep Boltzmann machine (DBM; Salakhutdinov & Hinton, 2009), which has hierarchical hidden layers with neighboring interlayer connections. However, in these cases, calculation of the conditional probability, p(hk=0|v), and gradients with respect to the model parameters are computationally expensive compared to the RBM.

3.2  Practical Removal Procedure

The removal process proposed in section 3.1 preserves the performance when Ck, ξCk, and ξD can be accurately evaluated. However, in most cases, Ck and ξCk are approximated using Gibbs sampling, as with ξD. Thus, in order to reflect the variances of Gibbs sampling, we change both the parameter update rule and removal condition, equations 3.4 and 3.3, into more effective forms.

First, we modify the parameter update rule, equation 3.4, which may increase D for two reasons; the inaccuracy of Gibbs sampling and the contribution from higher-order derivative terms of O(|Δξ|2). These problems also arise in the learning process. However, even if D increases, it can decrease again through the update rule, equation 2.4. Since the difference between equation 2.4 and equation 3.4 is solely the existence of the step function, similar behavior is expected in the removal process. Unfortunately, equation 3.4 frequently increases D due to the following. Since the removal cost is defined as the change in the KLD through node removal, it can be interpreted as the contribution of the node to the performance. Hence, when the performance increases, removal costs are expected to increase. This means that in the RBM parameter space, there are few directions along which both D and Ck decrease. However, since the step function in equation 3.4 allows the parameter update solely along these few directions, there are few opportunities to decrease D. Therefore, once D increases, it rarely decreases by equation 3.4. As a result, a successive increase of D occurs. In order to maintain the performance, we probabilistically accept updates that increase Ck. That is, we change the step function in equation 3.4, which gives either 0 or 1 deterministically, into a random variable, zi{0,1}. Next, we determine the probability that zi takes 1, that is, the acceptance probability of updates. The modified update rule is required to return to equation 3.4 when Gibbs sampling estimates are exactly obtained. For this purpose, we employ the ratio of the mean to the standard deviation and determine the modified update rule by
Δξis¯=-νziiD¯|ξ=ξs,
(3.9)
p(zi=1)=sigS·iD¯σD,i¯·S·iCk¯σC,i¯,
(3.10)
sig(x)=ex1+ex,
(3.11)
where S is the number of Gibbs samples and iD¯ and iCk¯ represent sample means of D/ξi and Ck/ξi, respectively. The unbiased standard deviations of D/ξi and Ck/ξi are denoted by σD,i¯ and σC,i¯, respectively. As the number of samples increases, equation 3.9 returns to equation 3.4.4
Second, we modify the removal condition, equation 3.3. Since node removal irreversibly decreases the representational power of the RBM, we carefully verify whether Ck0 is satisfied. However, since the logarithmic function in the second term of equation 3.3 drastically decreases in p(hk=0)<1, a small sampling error in p(hk=0) results in a large error in lnp(hk=0), which makes it difficult to evaluate the removal cost accurately by Gibbs sampling. Therefore, we employ an upper bound of Ck as an effective removal cost, Ck':
Ck=-vq(v)lnp(hk=0|v)+ln[1-p(hk=1)]-vq(v)lnp(hk=0|v)-p(hk=1)Ck'.
(3.12)
Then consider the approximation of Ck' by Gibbs sampling,
Ck'¯-1Sα=1Slnp(hk=0|vα)-1Sα=1Shkα,
(3.13)
where α is the sample index. Since samplings from q(v) and p(v,h) are independent, the first and second terms of equation 3.13 have no correlations. Thus, when the sampling size, S, is sufficiently large, the probability distribution of Ck'¯ can be approximated by the normal distribution due to the central limit theorem:
Ck'¯NCk',σ12S+σ22S,
(3.14)
σ12=vq(v)lnp(hk=1|v)2-vq(v)lnp(hk=1|v)2,
(3.15)
σ22=p(hk=1)-[p(hk=1)]2,
(3.16)
where N(μ,σ2) denotes the normal distribution. The unbiased standard deviation of Ck'¯ is given by
σCk'¯=σ12¯+σ22¯S,
(3.17)
where σ12¯ and σ22¯ are the unbiased variances of lnp(hk|v) and hk, respectively. Using Ck'¯ and σCk'¯, we change the removal criterion from Ck0 into Ck'¯+aσCk'¯0, where a tunes the confidence intervals of Ck'. By increasing a, we can decrease the probability that a hidden unit is wrongly removed when its true removal cost is positive, Ck>0. When σCk'¯/D is not small, this incorrect removal may harm the performance. Thus, a large a is used to decrease the probability of an incorrect removal.
formula

In summary, our node removal procedure is as follows (see algorithm 1). First, we remove all hidden units that satisfy the modified removal condition. Then, at each parameter update step, we choose the smallest removal cost and decrease it using equation 3.9 until a hidden unit can be removed. The source code is available on GitHub at https://github.com/snsiorssb/RBM.

4  Numerical Simulation

In this section, we show that the proposed algorithm does not spoil the performance of the RBMs by using two different data sets. First, we used the 3×3 Bars-and-Stripes data set (MacKay & Mac Kay, 2003; see Figure 2), which is small enough to allow calculation of the exact KLD during the removal processes. Next, we employed MNIST data set of handwritten images (LeCun & Cortes, 1998) and verified that our algorithm also works in realistic-size RBMs.

Figure 2:

Examples of 3×3 Bars-and-Stripes images are shown, which are generated as follows. First, a white square of A×A pixel is prepared. Next, each column of the square is painted black with probability 1/2. Finally, the square is rotated 90 degrees with probability 1/2. For A=3, 14 different images are created.

Figure 2:

Examples of 3×3 Bars-and-Stripes images are shown, which are generated as follows. First, a white square of A×A pixel is prepared. Next, each column of the square is painted black with probability 1/2. Finally, the square is rotated 90 degrees with probability 1/2. For A=3, 14 different images are created.

Since parameter update after sufficient learning slightly changes p(v,h), it can be considered that short Markov chains are enough for convergence to p(v,h) after parameter updates. Thus, we used PCD (Tieleman, 2008) with n-step block Gibbs sampling (PCD-n) in both learning and removal processes, except for samplings immediately after a node removal. However, a change of p(v,h) caused by node removal is expected to be larger than that caused by parameter updates. Hence, PCD-n with small n may not converge to p(v,h) and may fail to sample from p(v,h) immediately after node removals. Thus, we carefully performed Gibbs sampling using tempered transition (Neal, 1996; Salakhutdinov, 2009) at these times. In tempered transition, we linearly divided the inverse temperature from β0=1 to β1=0.9 into l=100 intervals. We did not use a validation set for early stopping or hyperparameter searches in the learning and removal processes.

4.1  Bars-and-Stripes

An artificial data set, Bars-and-Stripes, was used to demonstrate that our algorithm effectively works when the data distribution is completely known. Thus, we did not divide the data set into training and test sets. First, we trained the RBM with M=9 visible units and N=30 hidden units using PCD-5 and PCD-1 with a batch size of 100 and a fixed learning rate, λ=10-2. After 50,000 learning steps, we performed removal processes starting from the same trained RBM with a batch size of 1000 and a fixed parameter change rate, ν=10-2. During the beginning of the removal process, the typical value of σCk'¯/D was not small, that is, σCk'¯/D0.1. Thus, we employed a strict removal criterion, Ck'¯+3σCk'¯0.

The results are shown in Figures 3, 4, and 5. We stopped the removal processes after 10 million steps in Figure 3 and after 5 million steps in Figures 4 and 5. The removal procedure employing PCD-5 slowly decreases N with small fluctuations of the KLD in all five trials (see Figure 3). In particular, the removal cost in Figure 3 shows that if a hidden unit with the smallest removal cost is removed before it decreases, then the KLD increases approximately sevenfold. This result clearly shows that the update rule, equation 3.9, is useful for maintaining performance during the removal processes. The removal procedure employing PCD-1 decreases N more rapidly while approximately preserving the KLD in six of eight trials (see Figure 4), although some sharp peaks appear in the change of the KLD after node removals. However, two of eight trials that employed PCD-1 fail to preserve the KLD (see Figure 5).

Figure 3:

The number of hidden units N (top), KLD (middle), and smallest removal cost (bottom) are shown as functions of the number of removal steps. The 3×3 Bars-and-Stripes data set was employed. PCD-5 was used for block Gibbs sampling. Each color corresponds to a different trial.

Figure 3:

The number of hidden units N (top), KLD (middle), and smallest removal cost (bottom) are shown as functions of the number of removal steps. The 3×3 Bars-and-Stripes data set was employed. PCD-5 was used for block Gibbs sampling. Each color corresponds to a different trial.

Figure 4:

The number of hidden units N (top), and KLD (bottom) are shown as functions of the number of removal steps. The 3×3 Bars-and-Stripes data set was employed. PCD-1 was used for block Gibbs sampling. Each color corresponds to a different trial.

Figure 4:

The number of hidden units N (top), and KLD (bottom) are shown as functions of the number of removal steps. The 3×3 Bars-and-Stripes data set was employed. PCD-1 was used for block Gibbs sampling. Each color corresponds to a different trial.

Figure 5:

Two trials failed to keep the KLD in case of the PCD-1. Large fluctuations of the KLD appear immediately (green) and sufficiently (blue) after node removal.

Figure 5:

Two trials failed to keep the KLD in case of the PCD-1. Large fluctuations of the KLD appear immediately (green) and sufficiently (blue) after node removal.

First, we discuss the sharp peaks observed in Figure 4, which resulted from inaccurate estimates of Ck' or Δξ¯. In order to distinguish among them, we enlarge peaks in the change of the KLD (see Figure 6) and find that these peaks were caused by the failure of Gibbs sampling in parameter updates immediately after node removals rather than the node removals themselves. This behavior supports the assumption that the change of p(v,h) caused by node removal can be large and can result in a failure of Gibbs sampling. Nevertheless, owing to the tempered transition, most of the parameter updates after node removal produced rather small peaks in Figure 4.

Figure 6:

The peaks after the three-millionth step (blue line) and before the four-millionth step (cyan line) in Figure 4 are enlarged. These figures show that node removal slightly decreases the KLD, and parameter updates immediately following removal caused increases in the KLD.

Figure 6:

The peaks after the three-millionth step (blue line) and before the four-millionth step (cyan line) in Figure 4 are enlarged. These figures show that node removal slightly decreases the KLD, and parameter updates immediately following removal caused increases in the KLD.

Next, we discuss large fluctuations of the KLD in Figure 5. Failure of Gibbs sampling through parameter updates is expected to occur more frequently as the removal process continues for the same reason as in the learning process (Desjardins et al., 2010; Fischer & Igel, 2010). It can be considered that the problem in the learning process arises as follows. At the beginning of the learning process, the RBM parameters are approximately zero, and p(v) is almost a uniform distribution. As learning proceeds, each component of ξ is expected to move away from zero in order to adjust p(v) to the data distribution, q(v). In the removal process, components of ξ are also expected to move away from zero so that the remaining system compensates for the roles of the removed hidden units. As one can find from equations 2.9 and 2.10, the transition matrices used in MCMC, p(h|v) and p(v|h) take almost either 0 or 1 in the region where |ξ| is large. Therefore, block Gibbs sampling behaves almost deterministically. Hence, dependence on the initial condition remains for a long time or, equivalently, it takes a long time to converge to p(v,h) even after a one-step parameter update in the large |ξ| region. Thus, the model distribution after parameter update, from which we should sample, may be quite different from the probability distribution after a few block Gibbs sampling steps. As a result, parameters are updated using inaccurate Gibbs samples. If these deviations are corrected by subsequent parameter updates, the KLD decreases again. However, if the failure of Gibbs sampling continues for a long time, then the KLD drastically fluctuates. From Figure 5, it can be found that such a drastic increase in the KLD can emerge not only immediately after node removal (green line) but also later (blue line). Therefore, in order to prevent the problem resulting from a long convergence time of the block Gibbs sampling, the removal process should be stopped at some point as with the learning process.

4.2  MNIST

We used 60,000 out of 70,000 MNIST images for the evaluation of Ck, ξCk, and ξD in the learning and removal processes. Each pixel value was probabilistically set to 1 proportional to its intensity (Salakhutdinov & Murray, 2008; Tieleman, 2008). We first trained the RBM with M=784 visible units and N=500 hidden units using PCD-1 with a batch size of 1,000 and fixed learning rate λ=10-2. After 200,000 learning steps, we performed the removal processes starting from the same trained RBM with a batch size of 1,000 and a fixed parameter change rate, ν=10-2. In this case, the typical value of σCk'¯/D at the first removal step is small, that is, σCk'¯/D10-4. Thus, we employed Ck'¯+σCk'¯0 as the removal criterion in order to quickly remove hidden units under the restriction that they do not drastically decrease the performance.

As mentioned in section 2, the KLD cannot be evaluated owing to unknown probability q(v) and a large state space of the RBM. Thus, we employed an alternative evaluation criterion, the KLD of p(v) from empirical distribution of samples generated from the test set, qd(v),
D˜DKL(qd||p)=vqd(v)lnqd(v)+lnZ+vqd(v)ibivi+jln1+ecj+iviwij,
(4.1)
where Z is the normalization constant of p(v) and was evaluated by annealed importance sampling (AIS; Neal, 2001). In the AIS, we used 100 samples and linearly divided the inverse temperature from β=0 to β=1 into 10,000 intervals. Since the evaluation of Z by AIS takes a long time, we calculated D˜ at every 50,000 steps. Between the intervals of evaluations of D˜, we employed another evaluation criterion, the reconstruction error, for reference. The reconstruction error, R, can be easily calculated and is widely used to roughly estimate the performance of the RBM (Bengio et al., 2007; Hinton, 2012; Taylor et al., 2007):
R=-1Sα=1Si=1Mviαlnv˜iα+(1-viα)ln(1-v˜iα),
(4.2)
v˜iα=ebi+jwijh˜jα1+ebi+jwijh˜jα,
(4.3)
h˜jα=ecj+iviαwij1+ecj+iviαwij,
(4.4)
where α denotes the index of a mini-batch and vα is a sample from the training set.

The progress of the removal processes is shown in Figure 7, and samples of visible variables at the beginning and the end of the removal processes are presented in Figure 8. From the behavior of N, D˜, and R in Figure 7, it can be found that in a realistic-size RBM, our algorithm decreases the number of hidden units while avoiding a drastic increase in the KLD.5 We stopped three removal processes after 800,000 steps, and the RBMs were compressed to N400. The number of removal steps is much larger than that of the learning steps. However, this is not a defect of our algorithm, since our motivation is not to quickly compress the RBM but to preserve its performance during the removal process. As a reference for the performance of the compressed RBMs, we trained the RBM with N=400 using the same setting employed in the learning of the RBM with N=500. The performance of this RBM was D˜=78.0±0.3 (where ± indicates 1σ confidence interval), which is almost the same performance of the RBMs after the removal process. This result suggests that our algorithm does not harm the performance, although we did not highly optimize the learning process for the RBMs with N=400 and N=500. The gradual increase of the upper side of Ck' in Figure 7 supports our intuitive explanation that the contribution of the remaining hidden units to the performance increases in order to maintain the performance. Thus, also in this case, an extremely long removal process can increase |ξ| and may lead to the failure of Gibbs sampling. Thus, the removal process should be stopped before a successive increase in the KLD occurs. Since the KLD cannot be evaluated in large-size RBMs, we recommend monitoring the change in performance by employing some evaluation criterion used in the learning process in previous studies, such as the reconstruction error (Bengio et al., 2007; Hinton, 2012; Taylor et al., 2007), the product of the two probabilities ratio (Buchaca et al., 2013), and the likelihood of a validation set obtained by tracking the partition functions (Desjardins et al., 2011).6

Figure 7:

From top to bottom, the number of hidden units N, the KLD of p(v) from qd(v), the reconstruction error R, and the effective removal cost are shown as functions of the number of removal steps. MNIST handwritten images were employed as the data set. Each color corresponds to a different trial. In the second panel from the top, the width of the KLD represents 1σ confidence intervals, and the negative log likelihood (NLL), -lD˜-vqd(v)lnqd(v), is also shown for the evaluation of the performance together with D˜.

Figure 7:

From top to bottom, the number of hidden units N, the KLD of p(v) from qd(v), the reconstruction error R, and the effective removal cost are shown as functions of the number of removal steps. MNIST handwritten images were employed as the data set. Each color corresponds to a different trial. In the second panel from the top, the width of the KLD represents 1σ confidence intervals, and the negative log likelihood (NLL), -lD˜-vqd(v)lnqd(v), is also shown for the evaluation of the performance together with D˜.

Figure 8:

MNIST images are shown at the start and ends of the removal processes. (a) Samples of visible configurations at the 0th step of the removal processes. (b–d) Samples of visible configurations at the 800,000th step of the blue, green, and red lines in Figure 7, respectively.

Figure 8:

MNIST images are shown at the start and ends of the removal processes. (a) Samples of visible configurations at the 0th step of the removal processes. (b–d) Samples of visible configurations at the 800,000th step of the blue, green, and red lines in Figure 7, respectively.

5  Summary and Discussion

In this letter, we aimed to decrease the number of hidden units of the RBM without affecting its performance. For this purpose, we have introduced the removal cost of a hidden unit and have proposed a method to remove it while avoiding a drastic increase in the KLD. Then we have applied the proposed method to two different data sets and have shown that the KLD was approximately maintained during the removal processes. The increase in the KLD observed in the numerical simulations was caused by the failure of Gibbs sampling, which is also a problem in the learning process. The RBM has been facing difficulties such as accurately obtaining expectation values that are computationally expensive. Several kinds of Gibbs sampling methods have been proposed (Cho et al., 2010; Desjardins et al., 2010; Hinton, 2002; Salakhutdinov, 2009; Tieleman, 2008; Tieleman & Hinton, 2009), which provide precise estimates and increase the performance of the RBM. However, more accurate Gibbs sampling methods require a longer time for evaluations. If expectation values can be precisely evaluated, then our algorithm is expected to be more effective. We expect that physical implementation of the RBM (Dumoulin, Goodfellow, Courville, & Bengio, 2014) becomes an accurate and fast method for their evaluation.

Finally, we comment on another application of the removal cost. If the representational power of the system is sufficient, an arbitrary hidden unit can be safely removed by decreasing its removal cost. Hence, by repeatedly adding and removing hidden units, entire hidden units of a system can be replaced. Such a procedure may be useful for reforming physically implemented systems that are difficult to copy and must not be halted.

Appendix A:  Derivation of Equation 3.3

For convenience, we introduce two unnormalized probabilities, p*(v,h)=e-E(v,h) and pk*(v,hk)=e-E(v,h)|hk=0. Then we can obtain Ck as follows:
Ck=DKL(q||pk)-DKL(q||p)=vq(v)lnq(v)hkpk(v,hk)-vq(v)lnq(v)hp(v,h)=-vq(v)lnhkpk(v,hk)+vq(v)lnhp(v,h)=-vq(v)lnhkpk*(v,hk)v',hk'pk*(v',hk')+vq(v)lnhp*(v,h)v',h'p*(v',h')=-vq(v)lnhkp*(v,h)|hk=0v',hk'p*(v',h')|hk=0+vq(v)lnhp*(v,h)v',h'p*(v',h')=-vq(v)lnhkp*(v,h)|hk=0hp*(v,h)+lnv',hk'p*(v',h')|hk=0v',h'p*(v',h')=-vq(v)lnp(hk=0|v)+lnp(hk=0).
(A.1)
Next, consider the simultaneous removal of several hidden units. Suppose tk=(k1,,kr) denotes the indices of the hidden units to be removed and define pk(v) as the probability distribution after removal of these hidden units. Following a similar calculation above, we obtain the removal cost for several hidden units:
CkDKL(q||pk)-DKL(q||p)=-vq(v)lnp(hk1==hkr=0|v)+lnp(hk1==hkr=0).
(A.2)
In the case of the RBM, the first term of equation A.2 can be simplified as
Ck=DKL(q||pk)-DKL(q||p)=-vq(v)α=1rlnp(hkα=0|v)+lnp(hk1==hkr=0).
(A.3)
This removal cost can be used to minimize the size of the RBM. Suppose D0 is the KLD to be preserved. If some set of parameters, ξ, satisfies Ck=0 and D=D0, then the hidden units whose indices are k can be removed simultaneously without changing the KLD. Furthermore, if one can find a set ξ that can remove as many hidden units as possible, the size of the RBM is minimized. However, finding such a set of parameters is a difficult problem.

Appendix B:  Change of D and Ck by the Naive Update Rule, Equation 3.4

In this appendix, we show that the naive update rule, equation 3.4, decreases both D and Ck at O(|Δξ|). The changes of D and Ck by equation 3.4 at O(|Δξ|) are given by
DξiΔξi=-ν·θDξiCkξi·Dξi2,
(B.1)
CkξiΔξi=-ν·θDξiCkξi·Dξi·Ckξi.
(B.2)
In the case of D/ξi·Ck/ξi0, equations B.1 and B.2 become
DξiΔξi=-ν·Dξi20,
(B.3)
CkξiΔξi=-ν·Dξi·Ckξi0,
(B.4)
and in the case of D/ξi·Ck/ξi<0, equations B.1 and B.2 become
DξiΔξi=0,
(B.5)
CkξiΔξi=0.
(B.6)
In both cases, equations B.1 and B.2 take nonpositive values. Thus, this update rule decreases both D and Ck at O(|Δξ|).

Notes

1

The RBM whose visible and hidden units take tv'{-1,1}M and th'{-1,1}N can be related to the RBM that takes tv{0,1}M and th{0,1}N by changing the parameters, W'=W/4, bi'=bi/2+jwij/4 and cj'=cj/2+iwij/4, where b', c' and W' are the biases and weight matrix of the RBM whose nodes take {-1,1}.

2

As explained in appendix A, minimizing the size of the RBM is a difficult problem. Thus, in this letter, hidden units are removed individually in a greedy fashion.

3

For ξD=0, which seldom occurs in numerical simulations, we employ higher-order derivatives of D and seek a direction along which both Ck and D decrease. By restricting the number of parameters to be updated, one can alleviate computational cost caused by a large number of the elements of higher-order derivatives.

4

When zero divided by zero appears owing to rounding error, we approved this update by setting zi=1 in the numerical simulations in section 4.

5

Figure 7 shows that the increase of the reconstruction error does not mean the increase of the KLD. However, it may be used as a stopping criterion, which can be easily calculated.

6

Tracking the partition function requires the parallel tempering for Gibbs sampling instead of CD or PCD.

Acknowledgments

This research is supported by JSPS KAKENHI grant 15H00800.

References

References
Ackley
,
D. H.
,
Hinton
,
G. E.
, &
Sejnowski
,
T. J.
(
1987
). A learning algorithm for Boltzmann machines. In
M. A.
Fischler
&
O.
Firschein
(Eds.),
Readings in computer vision
(pp.
522
533
).
Amsterdam
:
Elsevier
.
Bengio
,
Y.
,
Courville
,
A.
, &
Vincent
,
P.
(
2013
).
Representation learning: A review and new perspectives
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
35
(
8
),
1798
1828
.
Bengio
,
Y.
,
Lamblin
,
P.
,
Popovici
,
D.
, &
Larochelle
,
H.
(
2007
). Greedy layer-wise training of deep networks. In
B.
Schölkopf
,
J. C.
Platt
, &
T.
Hoffman
(Eds.),
Advances in neural information processing systems
,
19
(pp.
153
160
).
Cambridge, MA
:
MIT Press
.
Berglund
,
M.
,
Raiko
,
T.
, &
Cho
,
K.
(
2015
).
Measuring the usefulness of hidden units in Boltzmann machines with mutual information
.
Neural Networks
,
64
,
12
18
.
Buchaca
,
D.
,
Romero
,
E.
,
Mazzanti
,
F.
, &
Delgado
,
J.
(
2013
).
Stopping criteria in contrastive divergence: Alternatives to the reconstruction error
.
arXiv:1312.6062
.
Carleo
,
G.
, &
Troyer
,
M.
(
2017
).
Solving the quantum many-body problem with artificial neural networks
.
Science
,
355
(
6325
),
602
606
.
Cheng
,
Y.
,
Wang
,
D.
,
Zhou
,
P.
, &
Zhang
,
T.
(
2017
).
A survey of model compression and acceleration for deep neural networks
.
arXiv:1710.09282
.
Cho
,
K.
,
Raiko
,
T.
, &
Ilin
,
A.
(
2010
).
Parallel tempering is efficient for learning restricted boltzmann machines
. In
Proceedings of the 2010 International Joint Conference on Neural Networks
.
Piscataway, NJ
:
IEEE
.
Christopher
,
M. B.
(
2016
).
Pattern recognition and machine learning
.
New York
:
Springer-Verlag
.
Desjardins
,
G.
,
Bengio
,
Y.
, &
Courville
,
A. C.
(
2011
). On tracking the partition function. In
J.
Shawe-Taylor
,
R. S.
Zemel
,
P. L.
Bartlett
,
F.
Pereira
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
24
(pp.
2501
2509
).
Red Hook, NY
:
Curran
.
Desjardins
,
G.
,
Courville
,
A.
,
Bengio
,
Y.
,
Vincent
,
P.
, &
Delalleau
,
O.
(
2010
).
Tempered Markov Chain Monte Carlo for training of restricted boltzmann machines
. In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
(pp.
145
152
).
Dumoulin
,
V.
,
Goodfellow
,
I. J.
,
Courville
,
A. C.
, &
Bengio
,
Y.
(
2014
).
On the challenges of physical implementations of RBMS
. In
Proceedings of the 28th AAAI Conference on Artificial Intelligence
(
vol. 2014
, pp.
1199
1205
).
Fischer
,
A.
, &
Igel
,
C.
(
2010
).
Empirical analysis of the divergence of gibbs sampling based learning algorithms for restricted Boltzmann machines
. In
Proceedings of the International Conference on Artificial Neural Networks
(pp.
208
217
).
Cambridge, MA
:
AAAI Press
.
Fischer
,
A.
, &
Igel
,
C.
(
2012
).
An introduction to restricted Boltzmann Machines
. In
Proceedings of the Iberoamerican Congress on Pattern Recognition
(pp.
14
36
).
Berlin
:
Springer
.
Goodfellow
,
I.
,
Pouget-Abadie
,
J.
,
Mirza
,
M.
,
Xu
,
B.
,
Warde-Farley
,
D.
,
Ozair
,
S.
,
Courville
,
A.
, &
Bengio
,
Y.
(
2014
). Generative adversarial nets. In
Z.
Ghahramani
,
M.
Welling
,
C.
Cortes
,
N. O.
Lawrence
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
27
(pp.
2672
2680
).
Red Hook, NY
:
Curran
.
Guo
,
Y.
,
Yao
,
A.
, &
Chen
,
Y.
(
2016
). Dynamic network surgery for efficient DNNS. In
D. D.
Lee
,
M.
Sugiyama
,
U. V.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
29
(pp.
1379
1387
).
Red Hook, NY
:
Curran
.
Han
,
S.
,
Pool
,
J.
,
Tran
,
J.
, &
Dally
,
W.
(
2015
). Learning both weights and connections for efficient neural network. In
C.
Cortes
,
N. D.
Lawrence
,
D. D.
Lee
,
M.
Sugiyama
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
28
(pp.
1135
1143
).
Red Hook, NY
:
Curran
.
He
,
K.
,
Zhang
,
X.
,
Ren
,
S.
, &
Sun
,
J.
(
2016
).
Deep residual learning for image recognition
. In
Proceedings of the IEEE conference on Computer Vision and Pattern Recognition
(pp.
770
778
).
Piscataway, NJ
:
IEEE
.
Hestness
,
J.
,
Narang
,
S.
,
Ardalani
,
N.
,
Diamos
,
G.
,
Jun
,
H.
,
Kianinejad
,
H.
, …
Zhou
,
Y.
(
2017
).
Deep learning scaling is predictable, empirically
.
arXiv:1712.00409
.
Hinton
,
G. E.
(
2002
).
Training products of experts by minimizing contrastive divergence
.
Neural Computation
,
14
(
8
),
1771
1800
.
Hinton
,
G. E.
(
2012
). A practical guide to training restricted Boltzmann machines. In
G.
Montavon
,
G. B.
Orr
, &
K.-R.
Müller
(Eds.),
Neural networks: Tricks of the trade
(pp.
599
619
).
New York
:
Springer
.
Hinton
,
G. E.
,
Osindero
,
S.
, &
Teh
,
Y.-W.
(
2006
).
A fast learning algorithm for deep belief nets
.
Neural Computation
,
18
(
7
),
1527
1554
.
Hinton
,
G. E.
, &
Salakhutdinov
,
R. R.
(
2006
).
Reducing the dimensionality of data with neural networks
.
Science
,
313
(
5786
),
504
507
.
Larochelle
,
H.
, &
Bengio
,
Y.
(
2008
).
Classification using discriminative restricted Boltzmann machines
. In
Proceedings of the 25th International Conference on machine learning
(pp.
536
543
).
Madison, WI
:
Omnipress
.
LeCun
,
Y.
, &
Cortes
,
C.
(
1998
).
The MNIST database of handwritten digits
. http://yann.lecun.com/exdb/mnist/.
Le Roux
,
N.
, &
Bengio
,
Y.
(
2008
).
Representational power of restricted Boltzmann machines and deep belief networks
.
Neural Computation
,
20
(
6
),
1631
1649
.
MacKay
,
D. J.
, &
Mac Kay
,
D. J.
(
2003
).
Information theory, inference and learning algorithms
.
Cambridge
:
Cambridge University Press
.
Neal
,
R. M.
(
1996
).
Sampling from multimodal distributions using tempered transitions
.
Statistics and Computing
,
6
(
4
),
353
366
.
Neal
,
R. M.
(
2001
).
Annealed importance sampling
.
Statistics and Computing
,
11
(
2
),
125
139
.
Oord
,
A. v. d.
,
Dieleman
,
S.
,
Zen
,
H.
,
Simonyan
,
K.
,
Vinyals
,
O.
,
Graves
,
A.
,
Kalchbrenner
,
N.
,
Senior
,
A.
, &
Kavukcuoglu
,
K.
(
2016
).
Wavenet: A generative model for raw audio
.
arXiv:1609.03499
.
Salakhutdinov
,
R.
(
2009
). Learning in Markov random fields using tempered transitions. In
Y.
Bengio
,
D.
Schuurmans
,
J. L.
Lafferty
,
C. K. I.
Williams
, &
A.
Culotta
(Eds.),
Advances in neural information processing systems
(pp.
1598
1606
).
Red Hook, NY
:
Curran
.
Salakhutdinov
,
R.
, &
Hinton
,
G. E.
(
2009
).
Deep Boltzmann machines
. In
Proceedings of the 12th International Conference on Artificial Intelligence and Statistics
(
vol. 3, p. 3
).
Salakhutdinov
,
R.
, &
Larochelle
,
H.
(
2010
).
Efficient learning of deep Boltzmann machines
. In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
(pp.
693
700
).
Salakhutdinov
,
R.
, &
Murray
,
I.
(
2008
).
On the quantitative analysis of deep belief networks
. In
Proceedings of the 25th International Conference on Machine Learning
(pp.
872
879
).
Madison, WI
:
Omnipress
.
Smolensky
,
P.
(
1986
).
Information processing in dynamical systems: Foundations of harmony theory
(
Tech. Rep.
)
Colorado University at Boulder, Department of Computer Science
.
Taylor
,
G. W.
,
Hinton
,
G. E.
, &
Roweis
,
S. T.
(
2007
). Modeling human motion using binary latent variables. In
B.
Schölkopf
,
J. C.
Platt
, &
T.
Hoffman
(Eds.),
Advances in neural information processing systems
,
19
(pp.
1345
1352
).
Cambridge, MA
:
MIT Press
.
Tieleman
,
T.
(
2008
).
Training restricted Boltzmann machines using approximations to the likelihood gradient
. In
Proceedings of the 25th International Conference on Machine Learning
(pp.
1064
1071
).
Madison, WI
:
Omnipress
.
Tieleman
,
T.
, &
Hinton
,
G.
(
2009
).
Using fast weights to improve persistent contrastive divergence
. In
Proceedings of the 26th Annual International Conference on Machine Learning
(pp.
1033
1040
).
Madison, WI
:
Omnipress
.
Tubiana
,
J.
, &
Monasson
,
R.
(
2017
).
Emergence of compositional representations in restricted Boltzmann machines
.
Physical Review Letters
,
118
(
13
), 138301.
Vaswani
,
A.
,
Shazeer
,
N.
,
Parmar
,
N.
,
Uszkoreit
,
J.
,
Jones
,
L.
,
Gomez
,
A. N.
,
Kaiser
,
Ł.
, &
Polosukhin
,
I.
(
2017
). Attention is all you need. In
I.
Guyon
,
U. V.
Luxburg
,
S.
Bengio
,
H.
Wallach
,
R.
Fergus
,
S.
Vishwanathan
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
30
(pp.
6000
6010
).
Red Hook, NY
:
Curran
.