Abstract
The full-span log-linear (FSLL) model introduced in this letter is considered an th order Boltzmann machine, where is the number of all variables in the target system. Let be finite discrete random variables that can take different values. The FSLL model has parameters and can represent arbitrary positive distributions of . The FSLL model is a highest-order Boltzmann machine; nevertheless, we can compute the dual parameter of the model distribution, which plays important roles in exponential families in time. Furthermore, using properties of the dual parameters of the FSLL model, we can construct an efficient learning algorithm. The FSLL model is limited to small probabilistic models up to ; however, in this problem domain, the FSLL model flexibly fits various true distributions underlying the training data without any hyperparameter tuning. The experiments showed that the FSLL successfully learned six training data sets such that within 1 minute with a laptop PC.
1 Introduction
The main purpose of this letter is as follows:
To introduce the full-span log-linear (FSLL) model and a fast learning algorithm
To demonstrate the performance of the FSLL model with experiments
One disadvantage of the Boltzmann machine is its insufficient ability to represent distributions. The Boltzmann machine has only parameters, while the dimension of the function space spanned by the possible distributions of is ( comes from the constraint ).
However, introducing higher-order terms leads to an enormous increase in computational cost because the th-order Boltzmann machine has parameters.
The FSLL model introduced in this letter can be considered an th order Boltzmann machine,2 where is the number of all variables in the target system. The FSLL model has parameters and can represent arbitrary positive distributions. Since the FSLL model is a highest-order Boltzmann machine, the learning of FSLL is expected to be very slow. However, we propose a fast learning algorithm. For example, this algorithm can learn a joint distribution of 20 binary variables within 1 minute with a laptop PC.
Since the FSLL model has full degrees of freedom, a regularization mechanism to avoid overfitting is essential. For this purpose, we used a regularization mechanism based on the minimum description length principle (Rissanen, 2007).
The remainder of this letter is organized as follows. In section 2, we present the FSLL model and its fast learning algorithm. In section 3, we demonstrate the performance of the FSLL model by experiment. In section 4, we discuss the advantages and disadvantages of the FSLL model. In section 5, we present the conclusions and extensions of the letter.
2 Full-Span Log-Linear Model
Before introducing the FSLL model, we define the notations used in this letter. A random variable is denoted by a capital letter, such as , and the value that takes is indicated by a lowercase letter, such as . also denotes the set of values that the variable can take; thus, denotes the number of values that can take. denotes the expectation of with distribution , that is, . The differential operator is abbreviated as . denotes the set of all distributions of , and denotes all positive distributions of .
2.1 Model Distribution
If are linearly independent functions of , and are linearly independent functions of , then are linearly independent functions of .
The proof is provided in the appendix.
In the FSLL model, we determine the local basis functions as follows:
Since , an arbitrary gives the same model distribution. Therefore, we determine as .
2.2 Learning Algorithm
Algorithm 1 presents the outline of the learning algorithm of the FSLL model. This algorithm is a greedy search to find a local minimum point of , which monotonically decreases as the iteration progresses.
In line 7 of the algorithm, candidate denotes derived from by applying one of the following modifications on the th component:
Candidate derived by appending : If , then let .
Candidate derived by adjusting : If , then let .
Candidate derived by removing : If , then let .
2.2.1 Cost Function
2.2.2 Fast Algorithm to Compute
We can use the same algorithm to obtain the vector from . In this case, let .
2.2.3 Acceleration by Walsh-Hadamard Transform
In cases where is large, we can accelerate the computation of by using Walsh-Hadamard transform (WHT) (Fino & Algazi, 1976).
Let us recall the 2D-DFT in section 2.2.2. If are powers of two, we can use the fast Fourier transform (FFT) (Smith, 2010) for DFT by and DFT by in equation 2.7 and can reduce the computational cost to . We can apply this principle to computing .
Since the number of all combinations of (the th component is omitted) is , we can obtain the function from in time. Moreover, is obtained from in time.
2.2.4 Algorithm to Evaluate Candidate
- Removing : becomes 0. Therefore,
Among the three types of candidates—appending, adjusting, and removing—the group of candidates by appending has almost candidates because is a very sparse vector. Therefore, it is important to reduce the computational cost to evaluate candidates by appending. Evaluating in equation 2.18 is an expensive task for a central processing unit (CPU) because it involves logarithm computation.6
Algorithm 2 presents the details of line 7 of algorithm 1. In line 5, if , the candidate is discarded and the evaluation of is skipped. This skipping effectively reduces the computational cost of evaluating candidates.7
2.2.5 Updating and
Algorithm 3 presents the details of lines 10 and 11 in algorithm 1. This algorithm requires time. It should be noted that the expensive exponential computation is not used in the for-loop.
2.2.6 Memory Requirements
In the FSLL model, most of the memory consumption is dominated by four large tables for , , , and , and each stores floating point numbers. However, does not require a large amount of memory because it is a sparse vector.
For example, if the FSLL model is allowed to use 4 GB of memory, it can handle up to 26 binary variables, 16 three-valued variables(), and 13 four-valued variables.
2.3 Convergence to Target Distribution
One major point of interest about algorithm 1 is whether the model distribution converges to a target distribution at the limit of or not, where is the number of samples in the training data. As a result, we can guarantee this convergence.
We first consider the case where the cost function is a continuous function of with no regularization term. Here, if and only if , we say that is an axis minimum of . Then the following theorem holds (Beck, 2015):
Let be a continuous cost function such that is a bounded close set. Then any accumulation point of is an axis minimum of .
The proof is provided in the appendix. Here, if has an unique axis minimum at , the following corollary is derived from theorem 2:
Let be a function satisfying the condition in theorem 2. If has a unique axis minimum at , then is also the global minimum and .
The proof is provided in the appendix.
Let be a positive distribution. By corollary 1, in the case where the cost function is , where is a positive distribution,8 the equation holds.
Then we extend the cost function to , where is a continuous function having the unique global minimum at , and is a regularization term in equation 2.5. The following theorem holds:
3 Experiments
In this section, to demonstrate the performance of the FSLL model, we compare a full-span log-linear model that we refer to as FL with two Boltzmann machines that we refer to as BM-DI and BM-PCD.
3.1 Full-Span Log-Linear Model FL
FL is a full-span log-linear model that has been described in section 2. The model distribution of FL is given by equation 2.1. The cost function is given by equation 2.5. The learning algorithm is algorithm 1 described in section 2.2. The threshold to finish the cost minimization is determined as .
3.2 Boltzmann Machine BM-DI
BM-DI (Boltzmann machine with direct integration) is a fully connected Boltzmann machine having no hidden variables and no temperature parameter. To examine the ideal performance of the Boltzmann machine, we do not use the Monte Carlo approximation in BM-DI to evaluate equation 1.2. The model distribution of BM-DI is given by equation 1.1. The cost function of BM-DI is and has no regularization term.
3.3 Boltzmann Machine BM-PCD
BM-PCD (Boltzman machine with persistent contrastive divergence method) is similar to BM-DI; however, BM-PCD uses the persistent contrastive divergence method (Tieleman, 2008), a popular Monte Carlo method in Boltzmann machine learning. BM-PCD has some hyperparameters. We tested various combinations of these hyperparameters and determined them as learning rate 0.01, number of Markov chains 100, and length of Markov chains 10,000.
3.4 Training Data
We prepared six training data sets. These data sets are artificial; therefore, their true distributions are known. Each is an independent and identically distributed (i.i.d.) data set drawn from its true distribution.
3.4.1 Ising5 4S, Ising5 4L
3.4.2 BN20-37S, BN20-37L
3.4.3 BN20-54S, BN20-54L
3.5 Experimental Platform
All experiments were conducted on a laptop PC (CPU: Intel Core i7-6700K @4 GHz; memory: 64 GB; operating system: Windows 10 Pro). All programs were written in and executed on Java 8.
3.6 Results
Table 1 represents performance comparisons between FL BM-DI and BM-PCD. We evaluated the accuracy of the learned distribution by . Figure 3 illustrates the comparison of .
Performance Comparison between FL and BM.
Data . | Model . | . | . | #Basis . | Time . |
---|---|---|---|---|---|
Ising5 4S | FL | 2.501 nat | 0.012 nat | 31 | 5 sec |
BM-DI | 2.424 | 0.087 | 210 | 13 | |
BM-PCD | 2.504 | 0.094 | 210 | 3 | |
Ising5 4L | FL | 0.476 | 0.004 | 37 | 9 |
BM-DI | 0.473 | 0.002 | 210 | 12 | |
BM-PCD | 0.528 | 0.053 | 210 | 3 | |
BN20-37S | FL | 4.355 | 0.317 | 39 | 5 |
BM-DI | 4.746 | 0.863 | 210 | 17 | |
BM-PCD | 4.803 | 0.903 | 210 | 3 | |
BN20-37L | FL | 0.697 | 0.026 | 105 | 12 |
BM-DI | 1.422 | 0.750 | 210 | 19 | |
BM-PCD | 1.477 | 0.806 | 210 | 3 | |
BN20-54S | FL | 3.288 | 0.697 | 41 | 5 |
BM-DI | 3.743 | 1.301 | 210 | 23 | |
BM-PCD | 3.826 | 1.338 | 210 | 3 | |
BN20-54L | FL | 0.430 | 0.057 | 192 | 23 |
BM-DI | 1.545 | 1.166 | 210 | 21 | |
BM-PCD | 1.620 | 1.242 | 210 | 3 |
Data . | Model . | . | . | #Basis . | Time . |
---|---|---|---|---|---|
Ising5 4S | FL | 2.501 nat | 0.012 nat | 31 | 5 sec |
BM-DI | 2.424 | 0.087 | 210 | 13 | |
BM-PCD | 2.504 | 0.094 | 210 | 3 | |
Ising5 4L | FL | 0.476 | 0.004 | 37 | 9 |
BM-DI | 0.473 | 0.002 | 210 | 12 | |
BM-PCD | 0.528 | 0.053 | 210 | 3 | |
BN20-37S | FL | 4.355 | 0.317 | 39 | 5 |
BM-DI | 4.746 | 0.863 | 210 | 17 | |
BM-PCD | 4.803 | 0.903 | 210 | 3 | |
BN20-37L | FL | 0.697 | 0.026 | 105 | 12 |
BM-DI | 1.422 | 0.750 | 210 | 19 | |
BM-PCD | 1.477 | 0.806 | 210 | 3 | |
BN20-54S | FL | 3.288 | 0.697 | 41 | 5 |
BM-DI | 3.743 | 1.301 | 210 | 23 | |
BM-PCD | 3.826 | 1.338 | 210 | 3 | |
BN20-54L | FL | 0.430 | 0.057 | 192 | 23 |
BM-DI | 1.545 | 1.166 | 210 | 21 | |
BM-PCD | 1.620 | 1.242 | 210 | 3 |
Notes: : empirical distribution of training data. : true distribution. #Basis: number of used () basis functions. Time: CPU time for learning(median of three trials).
For Ising5 4S/L, a performance difference between FL and BMs (BM-DI and BM-PCD) was not remarkable because both FL and BMs could represent the true distribution . The fact that implies that overfitting to was successfully suppressed. FL used fewer basis functions than BMs used, which implies that some basis functions of BM were useless to represent . Regarding the accuracy of the model distribution, BM-PCD is less accurate than FL and BM-DI. This disadvantage becomes noticeable when the model distribution is close to the true distribution. Even when large training data are given, some error remains in the model distribution of BM-PCD (e.g., Ising5 4L).
For BN20-37S/L and BN20-54S/L, FL outperformed BMs because only FL could represent . To fit , FL adaptively selected 39 basis functions for BN20-37S and 105 basis functions for BN20-37L from basis functions. This fact implies that FL constructed a more complex model to fit as the training data increased. Furthermore, a comparison of revealed that the accuracy of the model distribution was remarkably improved as the size of training data increased in FL. In contrast, BMs could not fit even if a large training data set was supplied.
Figure 4 illustrates the CPU time to learn the training data sets. BM-PCD was the fastest, and FL was faster than BM-DI for five out of six training data sets. The learning time of BM-PCD is constant because we used a fixed length (10,000) of Markov chains. FL had basis functions, while BM-DI had basis functions. Nevertheless, FL was faster than BM-DI.
For BN20-54L, FL takes a long time to learn because it uses 192 basis functions to construct the model distribution. Using the 192 basis functions, FL successfully constructed a model distribution that fit , while BMs failed.
4 Discussion
The major disadvantage of the FSLL model is that it is not feasible for large problems due to memory consumption and learning speed. If we use a typical personal computer, the problem size should be limited as . However, as far as we use the FSLL model in this problem domain, the model has the following theoretical and practical advantages.
The first advantage is that the FSLL model can represent arbitrary distributions of . Furthermore, it is guaranteed that the model distribution converges to any target distribution as the limit of the training data size is infinity.
Here, we view learning machines from an information geometry perspective. The dimension of the distribution space (all positive distributions of ) is , and a learning machine having parameters spans an -dimensional manifold in to represent its model distribution (we refer to this manifold as the model manifold).
A learning machine with parameters cannot represent arbitrary distributions in . Moreover, if , there is no guarantee that the true distribution is close to the model manifold, and if the model manifold is remote from the true distribution, the machine's performance will be poor. This poor performance is not improved even if infinite training data are given.
The FSLL model extends the manifold's dimension to by introducing higher-order factors. The model manifold becomes itself; thus, there is no more expansion. Therefore, we refer to the model as the full-Span log-linear model. The main interest of this letter is that as far as the problem size is , the FSLL model becomes a feasible and practical model.
For example, suppose that we construct a full-span model by adding hidden nodes into a Boltzmann machine having 20 visible nodes. The number of parameters of the Boltzmann machine is , where is the number of edges and is the number of nodes. Therefore, it is not practical to construct the full-span model for 20 visible nodes because it requires edges.
The second advantage is that the FSLL model has no hyperparameters; therefore, no hyperparameter tuning is needed. For example, if we use a Boltzmann machine with hidden nodes that learns the true distribution with contrastive divergence methods, we need to determine hyperparameters such as the learning rate, mini-batch size, and the number of hidden and graphical structure of nodes. The FSLL model, however, automatically learns the training data without human participation.
5 Conclusion and Extension
Suppose that we let the FSLL model learn training data consisting of 20 binary variables. The dimension of the function space spanned by possible positive distributions is . The FSLL model has parameters and can fit arbitrary positive distributions. The basis functions of the FSLL model have the following properties:
Each basis function is a product of univariate functions.
The basis functions take values 1 or .
The proposed learning algorithm exploited these properties and realized fast learning.
Our experiments demonstrated the following:
The FSLL model could learn the training data with 20 binary variables within 1 minute with a laptop PC.
The FSLL model successfully learned the true distribution underlying the training data; even higher-order terms that depend on three or more variables existed.
The FSLL model constructed a more complex model to fit the true distribution as the training data increased; however, the learning time became longer.
Appendix: Proofs and Derivations of Equations
A.1 Proof of Theorem 1
A.2 Derivation of Equation 2.6
A.3 Derivation of Equation 2.11
A.4 Derivation of Equation 2.13
A.5 Derivation of Equation 2.16
A.6 Derivation of Equation 2.17
A.7 Derivation of Equation 2.19
A.8 Proof of Theorem 2
Since is a bounded closed set, has one or more accumulation point(s) in .
A.9 Proof of Corollary 1
By theorem 2, any accumulation point of is an axis minimum of ; however, the axis minimum of is unique; therefore, is a unique accumulation point of , and .
A.10 Proof of Theorem 3
Acknowledgments
This research is supported by KAKENHI 17H01793.
Notes
Distributions such that .
Furthermore, the FSLL model is not limited to binary variables.
We use “nat” as the description length unit; thus, we use “ln” instead of “log.”
Not 2D-FFT but 2D-DFT.
For , directly multiplying the Hadamard matrix is faster than the WHT in our environment; therefore, we use the WHT only in cases where .
Logarithm computation is 30 times slower than addition or multiplication in our environment.
In our environment, this skipping makes the evaluation of candidates more than 10 times faster.
Positivity is needed to keep bounded.