## Abstract

In this letter, we propose a method to decrease the number of hidden units of the restricted Boltzmann machine while avoiding a decrease in the performance quantified by the Kullback-Leibler divergence. Our algorithm is then demonstrated by numerical simulations.

## 1 Introduction

The improvement of computer performance enables utilization of the exceedingly high representational powers of neural networks. Deep neural networks have been applied to various types of data (e.g., images, speech, and natural language) and have achieved great success (Bengio, Courville, & Vincent, 2013; Goodfellow et al., 2014; He, Zhang, Ren, & Sun, 2016; Oord et al., 2016; Vaswani et al., 2017) in both discrimination and generation tasks. To increase performance, which stems from the hierarchical structures of neural networks (Hestness et al., 2017), network size becomes larger, and computational burdens increase. Thus, demands for decreasing the network size are growing. Various methods have been proposed for compressing the sizes of discriminative models (Cheng, Wang, Zhou, & Zhang, 2017; Guo, Yao, & Chen, 2016; Han, Pool, Tran, & Dally, 2015). However, compression of generative models (Berglund, Raiko, & Cho, 2015) has rarely been discussed.

Discriminative models provide the probabilities that data are classified into a particular class (Christopher, 2016), and in most cases, their learning requires a supervisor, namely, a data set with classification labels attached by humans. Thus, outputs of discriminative models can be intuitively interpreted by humans. However, some data are difficult for humans to properly classify. Even if possible, hand-labeling tasks is troublesome labor. In such cases, generative models with unsupervised learning are effective, since they automatically find the data structure without hand labels by learning the joint probabilities of data and classes. Therefore, it is expected that compression of generative models with unsupervised learning will be required in the future. Furthermore, if the system's performance can be preserved during compression, then the network size can be decreased while it is in use. To approximately maintain performance throughout compression, we consider removing part of the system after decreasing its contribution to the overall performance. Our approach differs from that of previous studies (Berglund et al., 2015; Cheng et al., 2017; Guo et al., 2016; Han et al., 2015) that retrain systems after removing a part that contributes little to their performance.

In this letter, we deal with the restricted Boltzmann machine (RBM; Fischer & Igel, 2012; Smolensky, 1986). The RBM is one of the most important generative models with unsupervised learning, from the viewpoints of not only machine learning history (Bengio et al., 2013) but also its wide applications—for example, generation of new samples, classification of data (Larochelle & Bengio, 2008), feature extraction (Hinton & Salakhutdinov, 2006), pretraining of deep neural networks (Hinton, Osindero, & Teh, 2006; Hinton & Salakhutdinov, 2006; Salakhutdinov & Larochelle, 2010), and solving many-body problems in physics (Carleo & Troyer, 2017; Tubiana & Monasson, 2017). The RBM consists of visible units that represent observables (e.g., pixels of images) and hidden units that express correlations between visible units. An objective of the RBM is to generate plausible data by imitating the probability distribution from which true data are sampled. In this case, the performance of the RBM is quantified by the difference between the probability distribution of data and that of visible variables of the RBM, and it can be expressed by the Kullback-Leibler divergence (KLD). The RBM can exactly reproduce any probability distribution of binary data if it has a sufficient number of hidden units (Le Roux & Bengio, 2008). However, a smaller number of hidden units may be enough to capture the structure of the data. Therefore, in this letter, we aim to practically decrease the number of hidden units while avoiding an increase in the KLD between the model and data distributions (see Figure 1).

The outline of this letter is as follows. In section 2, we give a brief review of the RBM. In section 3, we evaluate the deviation of the KLD associated with node removal and propose a method that decreases the number of hidden units while avoiding an increase in the KLD. Numerical simulations are demonstrated in section 4, and we summarize this letter in section 5. The details of calculations are shown in appendixes.

## 2 Brief Introduction of the RBM

^{1}We abbreviate all of the RBM parameters, $b$, $c$, and $W$, as $\xi $.

## 3 Removal of Hidden Units

### 3.1 Removal Cost and Its Gradient

^{2}For this purpose, we naively determine the parameter update at the $s$th step in a removal process, $\Delta \xi s$, so that both $Ck$ and the KLD decrease at $O(|\Delta \xi s|)$ (see appendix B),

^{3}

Note two properties of $Ck$. First, $-Ck$ can be interpreted as an additional cost of a new node. Thus, it may be employed when new nodes are added into an RBM whose performance is insufficient. Second, equation 3.3 can be applied to the Boltzmann machine (BM; Ackley, Hinton, & Sejnowski, 1987), which is expressed as a complete graph consisting of visible and hidden units, and a special case of the BM called the deep Boltzmann machine (DBM; Salakhutdinov & Hinton, 2009), which has hierarchical hidden layers with neighboring interlayer connections. However, in these cases, calculation of the conditional probability, $p(hk=0|v)$, and gradients with respect to the model parameters are computationally expensive compared to the RBM.

### 3.2 Practical Removal Procedure

The removal process proposed in section 3.1 preserves the performance when $Ck$, $\u2207\xi Ck$, and $\u2207\xi D$ can be accurately evaluated. However, in most cases, $Ck$ and $\u2207\xi Ck$ are approximated using Gibbs sampling, as with $\u2207\xi D$. Thus, in order to reflect the variances of Gibbs sampling, we change both the parameter update rule and removal condition, equations 3.4 and 3.3, into more effective forms.

^{4}

In summary, our node removal procedure is as follows (see algorithm 1). First, we remove all hidden units that satisfy the modified removal condition. Then, at each parameter update step, we choose the smallest removal cost and decrease it using equation 3.9 until a hidden unit can be removed. The source code is available on GitHub at https://github.com/snsiorssb/RBM.

## 4 Numerical Simulation

In this section, we show that the proposed algorithm does not spoil the performance of the RBMs by using two different data sets. First, we used the $3\xd73$ Bars-and-Stripes data set (MacKay & Mac Kay, 2003; see Figure 2), which is small enough to allow calculation of the exact KLD during the removal processes. Next, we employed MNIST data set of handwritten images (LeCun & Cortes, 1998) and verified that our algorithm also works in realistic-size RBMs.

Since parameter update after sufficient learning slightly changes $p(v,h)$, it can be considered that short Markov chains are enough for convergence to $p(v,h)$ after parameter updates. Thus, we used PCD (Tieleman, 2008) with $n$-step block Gibbs sampling (PCD-$n$) in both learning and removal processes, except for samplings immediately after a node removal. However, a change of $p(v,h)$ caused by node removal is expected to be larger than that caused by parameter updates. Hence, PCD-$n$ with small $n$ may not converge to $p(v,h)$ and may fail to sample from $p(v,h)$ immediately after node removals. Thus, we carefully performed Gibbs sampling using tempered transition (Neal, 1996; Salakhutdinov, 2009) at these times. In tempered transition, we linearly divided the inverse temperature from $\beta 0=1$ to $\beta 1=0.9$ into $l=100$ intervals. We did not use a validation set for early stopping or hyperparameter searches in the learning and removal processes.

### 4.1 Bars-and-Stripes

An artificial data set, Bars-and-Stripes, was used to demonstrate that our algorithm effectively works when the data distribution is completely known. Thus, we did not divide the data set into training and test sets. First, we trained the RBM with $M=9$ visible units and $N=30$ hidden units using PCD-5 and PCD-1 with a batch size of 100 and a fixed learning rate, $\lambda =10-2$. After 50,000 learning steps, we performed removal processes starting from the same trained RBM with a batch size of 1000 and a fixed parameter change rate, $\nu =10-2$. During the beginning of the removal process, the typical value of $\sigma Ck'\xaf/D$ was not small, that is, $\sigma Ck'\xaf/D\u223c0.1$. Thus, we employed a strict removal criterion, $Ck'\xaf+3\sigma Ck'\xaf\u22640$.

The results are shown in Figures 3, 4, and 5. We stopped the removal processes after 10 million steps in Figure 3 and after 5 million steps in Figures 4 and 5. The removal procedure employing PCD-5 slowly decreases $N$ with small fluctuations of the KLD in all five trials (see Figure 3). In particular, the removal cost in Figure 3 shows that if a hidden unit with the smallest removal cost is removed before it decreases, then the KLD increases approximately sevenfold. This result clearly shows that the update rule, equation 3.9, is useful for maintaining performance during the removal processes. The removal procedure employing PCD-1 decreases $N$ more rapidly while approximately preserving the KLD in six of eight trials (see Figure 4), although some sharp peaks appear in the change of the KLD after node removals. However, two of eight trials that employed PCD-1 fail to preserve the KLD (see Figure 5).

First, we discuss the sharp peaks observed in Figure 4, which resulted from inaccurate estimates of $Ck'$ or $\Delta \xi \xaf$. In order to distinguish among them, we enlarge peaks in the change of the KLD (see Figure 6) and find that these peaks were caused by the failure of Gibbs sampling in parameter updates immediately after node removals rather than the node removals themselves. This behavior supports the assumption that the change of $p(v,h)$ caused by node removal can be large and can result in a failure of Gibbs sampling. Nevertheless, owing to the tempered transition, most of the parameter updates after node removal produced rather small peaks in Figure 4.

Next, we discuss large fluctuations of the KLD in Figure 5. Failure of Gibbs sampling through parameter updates is expected to occur more frequently as the removal process continues for the same reason as in the learning process (Desjardins et al., 2010; Fischer & Igel, 2010). It can be considered that the problem in the learning process arises as follows. At the beginning of the learning process, the RBM parameters are approximately zero, and $p(v)$ is almost a uniform distribution. As learning proceeds, each component of $\xi $ is expected to move away from zero in order to adjust $p(v)$ to the data distribution, $q(v)$. In the removal process, components of $\xi $ are also expected to move away from zero so that the remaining system compensates for the roles of the removed hidden units. As one can find from equations 2.9 and 2.10, the transition matrices used in MCMC, $p(h|v)$ and $p(v|h)$ take almost either 0 or 1 in the region where $|\xi |$ is large. Therefore, block Gibbs sampling behaves almost deterministically. Hence, dependence on the initial condition remains for a long time or, equivalently, it takes a long time to converge to $p(v,h)$ even after a one-step parameter update in the large $|\xi |$ region. Thus, the model distribution after parameter update, from which we should sample, may be quite different from the probability distribution after a few block Gibbs sampling steps. As a result, parameters are updated using inaccurate Gibbs samples. If these deviations are corrected by subsequent parameter updates, the KLD decreases again. However, if the failure of Gibbs sampling continues for a long time, then the KLD drastically fluctuates. From Figure 5, it can be found that such a drastic increase in the KLD can emerge not only immediately after node removal (green line) but also later (blue line). Therefore, in order to prevent the problem resulting from a long convergence time of the block Gibbs sampling, the removal process should be stopped at some point as with the learning process.

### 4.2 MNIST

We used 60,000 out of 70,000 MNIST images for the evaluation of $Ck$, $\u2207\xi Ck$, and $\u2207\xi D$ in the learning and removal processes. Each pixel value was probabilistically set to 1 proportional to its intensity (Salakhutdinov & Murray, 2008; Tieleman, 2008). We first trained the RBM with $M=784$ visible units and $N=500$ hidden units using PCD-1 with a batch size of 1,000 and fixed learning rate $\lambda =10-2$. After 200,000 learning steps, we performed the removal processes starting from the same trained RBM with a batch size of 1,000 and a fixed parameter change rate, $\nu =10-2$. In this case, the typical value of $\sigma Ck'\xaf/D$ at the first removal step is small, that is, $\sigma Ck'\xaf/D\u223c10-4$. Thus, we employed $Ck'\xaf+\sigma Ck'\xaf\u22640$ as the removal criterion in order to quickly remove hidden units under the restriction that they do not drastically decrease the performance.

The progress of the removal processes is shown in Figure 7, and samples of visible variables at the beginning and the end of the removal processes are presented in Figure 8. From the behavior of $N$, $D\u02dc$, and $R$ in Figure 7, it can be found that in a realistic-size RBM, our algorithm decreases the number of hidden units while avoiding a drastic increase in the KLD.^{5} We stopped three removal processes after 800,000 steps, and the RBMs were compressed to $N\u223c400$. The number of removal steps is much larger than that of the learning steps. However, this is not a defect of our algorithm, since our motivation is not to quickly compress the RBM but to preserve its performance during the removal process. As a reference for the performance of the compressed RBMs, we trained the RBM with $N=400$ using the same setting employed in the learning of the RBM with $N=500$. The performance of this RBM was $D\u02dc=78.0\xb10.3$ (where $\xb1$ indicates $1\sigma $ confidence interval), which is almost the same performance of the RBMs after the removal process. This result suggests that our algorithm does not harm the performance, although we did not highly optimize the learning process for the RBMs with $N=400$ and $N=500$. The gradual increase of the upper side of $Ck'$ in Figure 7 supports our intuitive explanation that the contribution of the remaining hidden units to the performance increases in order to maintain the performance. Thus, also in this case, an extremely long removal process can increase $|\xi |$ and may lead to the failure of Gibbs sampling. Thus, the removal process should be stopped before a successive increase in the KLD occurs. Since the KLD cannot be evaluated in large-size RBMs, we recommend monitoring the change in performance by employing some evaluation criterion used in the learning process in previous studies, such as the reconstruction error (Bengio et al., 2007; Hinton, 2012; Taylor et al., 2007), the product of the two probabilities ratio (Buchaca et al., 2013), and the likelihood of a validation set obtained by tracking the partition functions (Desjardins et al., 2011).^{6}

## 5 Summary and Discussion

In this letter, we aimed to decrease the number of hidden units of the RBM without affecting its performance. For this purpose, we have introduced the removal cost of a hidden unit and have proposed a method to remove it while avoiding a drastic increase in the KLD. Then we have applied the proposed method to two different data sets and have shown that the KLD was approximately maintained during the removal processes. The increase in the KLD observed in the numerical simulations was caused by the failure of Gibbs sampling, which is also a problem in the learning process. The RBM has been facing difficulties such as accurately obtaining expectation values that are computationally expensive. Several kinds of Gibbs sampling methods have been proposed (Cho et al., 2010; Desjardins et al., 2010; Hinton, 2002; Salakhutdinov, 2009; Tieleman, 2008; Tieleman & Hinton, 2009), which provide precise estimates and increase the performance of the RBM. However, more accurate Gibbs sampling methods require a longer time for evaluations. If expectation values can be precisely evaluated, then our algorithm is expected to be more effective. We expect that physical implementation of the RBM (Dumoulin, Goodfellow, Courville, & Bengio, 2014) becomes an accurate and fast method for their evaluation.

Finally, we comment on another application of the removal cost. If the representational power of the system is sufficient, an arbitrary hidden unit can be safely removed by decreasing its removal cost. Hence, by repeatedly adding and removing hidden units, entire hidden units of a system can be replaced. Such a procedure may be useful for reforming physically implemented systems that are difficult to copy and must not be halted.

## Appendix A: Derivation of Equation 3.3

## Appendix B: Change of $D$ and $Ck$ by the Naive Update Rule, Equation 3.4

## Notes

^{1}

The RBM whose visible and hidden units take $tv'\u2208{-1,1}M$ and $th'\u2208{-1,1}N$ can be related to the RBM that takes $tv\u2208{0,1}M$ and $th\u2208{0,1}N$ by changing the parameters, $W'=W/4$, $bi'=bi/2+\u2211jwij/4$ and $cj'=cj/2+\u2211iwij/4$, where $b'$, $c'$ and $W'$ are the biases and weight matrix of the RBM whose nodes take ${-1,1}$.

^{2}

As explained in appendix A, minimizing the size of the RBM is a difficult problem. Thus, in this letter, hidden units are removed individually in a greedy fashion.

^{3}

For $\u2207\xi D=0$, which seldom occurs in numerical simulations, we employ higher-order derivatives of $D$ and seek a direction along which both $Ck$ and $D$ decrease. By restricting the number of parameters to be updated, one can alleviate computational cost caused by a large number of the elements of higher-order derivatives.

^{4}

When zero divided by zero appears owing to rounding error, we approved this update by setting $zi=1$ in the numerical simulations in section 4.

^{5}

Figure 7 shows that the increase of the reconstruction error does not mean the increase of the KLD. However, it may be used as a stopping criterion, which can be easily calculated.

^{6}

Tracking the partition function requires the parallel tempering for Gibbs sampling instead of CD or PCD.

## Acknowledgments

This research is supported by JSPS KAKENHI grant 15H00800.