The full-span log-linear (FSLL) model introduced in this letter is considered an nth order Boltzmann machine, where n is the number of all variables in the target system. Let X=(X0,,Xn-1) be finite discrete random variables that can take |X|=|X0||Xn-1| different values. The FSLL model has |X|-1 parameters and can represent arbitrary positive distributions of X. The FSLL model is a highest-order Boltzmann machine; nevertheless, we can compute the dual parameter of the model distribution, which plays important roles in exponential families in O(|X|log|X|) time. Furthermore, using properties of the dual parameters of the FSLL model, we can construct an efficient learning algorithm. The FSLL model is limited to small probabilistic models up to |X|225; however, in this problem domain, the FSLL model flexibly fits various true distributions underlying the training data without any hyperparameter tuning. The experiments showed that the FSLL successfully learned six training data sets such that |X|=220 within 1 minute with a laptop PC.

The main purpose of this letter is as follows:

  • To introduce the full-span log-linear (FSLL) model and a fast learning algorithm

  • To demonstrate the performance of the FSLL model with experiments

Boltzmann machines (Ackley, Hinton, & Sejnowski, 1985) are multivariate probabilistic models that are widely used in the field of machine learning. Here, we consider a Boltzmann machine with n binary variables X=(X0,,Xn-1)(Xi=0,1). We address fully connected Boltzmann machines with no hidden variables and no temperature parameter. A Boltzmann machine represents the following distribution pθ, which we refer to as the model distribution:
pθ(x)=Z(θ)-1elθ(x),xi{0,1},Z(θ)=xXelθ(x),lθ(x)=0i<j<nθijxixj+0i<nθinxi.
(1.1)
In Boltzmann machines, learning is achieved by minimizing KL(pdpθ), where KL(**) is the Kullback-Leibler(KL-) divergence and pd is the empirical distribution of the training data. One straightforward method to minimize KL(pdpθ) is to use a gradient vector whose components are
KL(pdpθ)θij=XiXjpθ-XiXjpd
(1.2)
(Ackley et al., 1985) and to apply the gradient descent method or quasi-Newton method (Dennis & Moré, 1977). In evaluating equation 1.2, the computational cost to evaluate the term XiXjpθ is significant because we need to evaluate this term every time θ is modified.

One disadvantage of the Boltzmann machine is its insufficient ability to represent distributions. The Boltzmann machine has only n(n+1)/2 parameters, while the dimension of the function space spanned by the possible distributions of X is |X|-1 (-1 comes from the constraint xXp(x)=1).

One way to reduce this disadvantage is to introduce higher-order terms into the function lθ(X). For example, third-order Boltzmann machines represent the following distribution pθ (Sejnowski, 1986):
pθ(x)=Z(θ)-1elθ(x),xi{0,1},Z(θ)=xXelθ(x),lθ(x)=0i<j<k<nθijkxixjxk+0i<j<nθijnxixj+0i<nθinnxi.
Here, “order” means the number of variables on which a function depends. This definition of order is not limited to binary variables. For example, if f(x0,x1,x2) ignores x2, that is, x2 does not affect the value f(x0,x1,x2), the order of f is two. The kth-order Boltzmann machine has up to kth-order terms in lθ. Since the nth-order Boltzmann machine has arbitrary order of terms, it can represent arbitrary positive distributions1 of X.

However, introducing higher-order terms leads to an enormous increase in computational cost because the kth-order Boltzmann machine has i=1kni parameters.

The FSLL model introduced in this letter can be considered an nth order Boltzmann machine,2 where n is the number of all variables in the target system. The FSLL model has |X|-1 parameters and can represent arbitrary positive distributions. Since the FSLL model is a highest-order Boltzmann machine, the learning of FSLL is expected to be very slow. However, we propose a fast learning algorithm. For example, this algorithm can learn a joint distribution of 20 binary variables within 1 minute with a laptop PC.

Since the FSLL model has full degrees of freedom, a regularization mechanism to avoid overfitting is essential. For this purpose, we used a regularization mechanism based on the minimum description length principle (Rissanen, 2007).

The remainder of this letter is organized as follows. In section 2, we present the FSLL model and its fast learning algorithm. In section 3, we demonstrate the performance of the FSLL model by experiment. In section 4, we discuss the advantages and disadvantages of the FSLL model. In section 5, we present the conclusions and extensions of the letter.

Before introducing the FSLL model, we define the notations used in this letter. A random variable is denoted by a capital letter, such as X, and the value that X takes is indicated by a lowercase letter, such as x. X also denotes the set of values that the variable X can take; thus, |X| denotes the number of values that X can take. fp denotes the expectation of f(X) with distribution p(X), that is, fp=xXp(x)f(x). The differential operator /θy is abbreviated as y. P denotes the set of all distributions of X, and P+ denotes all positive distributions of X.

2.1  Model Distribution

The FSLL model is a multivariate probabilistic model designed for a system that has n discrete finite variables X0,,Xn-1, where Xi takes an integer value in [0,|Xi|). The FSLL model has parameters θ={θy}, where y=(y0,,yn-1) is a vector such that yiXi. The model distribution of the FSLL model is the following pθ:
pθ(x)=Z(θ)-1elθ(x),xiXi,Z(θ)=xXelθ(x),lθ(x)=yXθyΦy(x),Φy(x)=i=0n-1ϕyii(xi).
(2.1)
In equation 2.1, {ϕyii}(yiXi) are |Xi| linearly independent functions of Xi, which we refer to as the local basis functions, and {Φy}(yX) are |X| functions of X=(X0,,Xn-1), which we refer to as the (global) basis functions. Using the following theorem recursively, we can prove that the global basis functions are linearly independent functions of X.
Theorem 1.

If {fi}(iI) are linearly independent functions of X0, and {gj}(jJ) are linearly independent functions of X1, then {fi(X0)gj(X1)}(iI,jJ) are linearly independent functions of (X0,X1).

The proof is provided in the appendix.

In the FSLL model, we determine the local basis functions as follows:

Case |Xi|=2k:
H1=(1),H2k+1=H2kH2kH2k-H2k,ϕ0i(0)ϕ0i(|Xi|-1)ϕ|Xi|-1i(0)ϕ|Xi|-1i(|Xi|-1)=H2k
(2.2)
H2k above is referred to as the Walsh–Hadamard matrix (Pratt, Andrews, & Kane, 1969).
Else:
ϕ0i1,ϕji(l)=1(0<j=l)-1(0<jl)
(2.3)

Since Φ01, an arbitrary θ0 gives the same model distribution. Therefore, we determine as θ00.

2.2  Learning Algorithm

Algorithm 1 presents the outline of the learning algorithm of the FSLL model. This algorithm is a greedy search to find a local minimum point of cost(θ), which monotonically decreases as the iteration progresses.

graphic

In line 7 of the algorithm, candidate denotes θ' derived from θ by applying one of the following modifications on the yth component:

  • Candidate derived by appending θy: If θy=0, then let θy'=argminθycost(θ).

  • Candidate derived by adjusting θy: If θy0, then let θy'=argminθycost(θ).

  • Candidate derived by removing θy: If θy0, then let θy'=0.

2.2.1  Cost Function

We use a cost function based on the minimum description length principle (Rissanen, 2007). Suppose that we send the training data to a receiver by transmitting the parameters and compressed data. Since θ is a sparse vector, we transmit only indexes y, such that θy0, and the value of θy. Moreover, the index y=(y0,,yn-1) is a sparse vector because higher-order basis functions are rarely used in the model distribution due to their expensive cost. Therefore, we transmit only indexes i, such that yi0, and the values of yi[1,|Xi|) to transmit the sparse vector y. Then the description length3 to transmit the sparse vector y becomes
i:yi0lnntosendi[0,n)+ln(|Xi|-1)tosendyi[1,|Xi|)=i:yi0lnn(|Xi|-1),
and the description length to transmit all index vectors y, such that θy0, becomes
y:θy0i:yi0lnn(|Xi|-1).
The minimum description length to transmit k parameters and the compressed data is estimated as (Rissanen, 2007)
-Nlnpθpd+k2lnN,
where k is the number of nonzero parameters and N is the number of samples in the training data. The total description length to transmit y,θy and the compressed data becomes
-Nlnpθpd+k2lnN+y:θy0i:yi0lnn(|Xi|-1)=-Nlnpθpd+y:θy0lnN2+i:yi0lnn(|Xi|-1).
(2.4)
We divide equation 2.4 by N and add lnpdpd to create information-geometric quantity and obtain the following cost function:
cost(θ,N)=KL(pdpθ)+r(θ,N),r(θ,N)=y:θy0ry(N),ry(N)=1NlnN2+i:yi0lnn(|Xi|-1).
(2.5)

2.2.2  Fast Algorithm to Compute θ¯

The following vector θ¯, referred to as the dual parameter of θ, plays important roles in the exponential families (Amari, 2016):
θ¯y(θ)=ylnZ(θ)=Φypθ(seeappendixforderivation).
(2.6)
We identified an algorithm to compute θ¯ from pθ in O(|X|log|X|) time. This algorithm borrowed ideas from the multidimensional discrete Fourier transform (DFT) (Smith, 2010). Here, we consider a two-dimensional (2D-)DFT4 for |X0|×|X1| pixels of data. The 2D-DFT transforms f into F by the following equation:
F(y0,y1)=x0,x1f(x0,x1)Φy0y1(x0,x1),Φy0y1(x0,x1)=exp2πi|X0|y0x0exp2πi|X1|y1x1.
To derive the value of F(y0,y1) for a specific (y0,y1), O(|X0||X1|) time is required; therefore, it appears that O(|X0|2|X1|2) time is required to derive F(y0,y1) for all y0,y1. However, 2D-DFT is usually realized in the following dimension-by-dimension manner:
F(y0,y1)=x1x0f(x0,x1)exp2πi|X0|y0x0DFTbyX0exp2πi|X1|y1x1DFTbyX1.
(2.7)
In equation 2.7, the DFT by X0 for all (y0,x1)X0×X1 requires O(|X0||X1|) time, and the DFT by X1 for all y0X0,y1X1 requires O(|X0||X1|2) time. Therefore, the entire 2D DFT requires O(|X0||X1|(|X0|+|X1|)) time that is smaller than O(|X0|2|X1|2). The key here is that the basis function Φy0y1(x0,x1) is a product of two univariate functions as follows:
Φy0y1(x0,x1)=ϕy00(x0)ϕy11(x1),ϕy00(x0)=exp2πi|X0|y0x0,ϕy11(x1)=exp2πi|X1|y1x1.
We apply this principle to compute θ¯.
Here, we consider how to derive θ¯ from pθ by the following equation:
Φypθ=xpθ(x)Φy(x)=xpθ(x)iϕyii(xi)=xn-1x1x0pθ(x)ϕy00(x0)ϕy11(x1)ϕyn-1n-1(xn-1).
(2.8)
We evaluate the right side of equation 2.8 from the innermost parenthesis to the outermost parenthesis; that is, we determine the following function, gi:XR by the following recurrence sequence:
g0(x)=pθ(x),gi+1(y0,,yi-1,yi,xi+1,,xn-1)=xigi(y0,,yi-1,xi,xi+1,,xn-1)ϕyii(xi).
(2.9)
Then the equation θ¯y=gn(y) holds. Since O(|X||Xi|) time is required to obtain gi+1 from gi, we can obtain gn from g0 in O(|X|i|Xi|) time. In the case where |Xi|=k, the computational cost to derive gn from g0 becomes O(|X|kn). Moreover, considering k as a constant, we obtain the computational cost as O(|X|n)=O(|X|log|X|).

We can use the same algorithm to obtain the vector d¯ from pd. In this case, let g0=pd.

2.2.3  Acceleration by Walsh-Hadamard Transform

In cases where |Xi| is large, we can accelerate the computation of θ¯ by using Walsh-Hadamard transform (WHT) (Fino & Algazi, 1976).

Let us recall the 2D-DFT in section 2.2.2. If |X0|,|X1| are powers of two, we can use the fast Fourier transform (FFT) (Smith, 2010) for DFT by X0 and DFT by X1 in equation 2.7 and can reduce the computational cost to O(|X0||X1|log|X0||X1|). We can apply this principle to computing θ¯.

Here, we fix the values of y0,,yi-1,xi+1,,x|Xi|-1(only the ith component is omitted) in equation 2.9. Then we obtain
gi+1(,yii,)=xigi(,xii,)ϕyii(xi).
Using matrix notation, we obtain
gi+1(,0i,)gi+1(,|Xi|-1i,)=ϕ0i(0)ϕ0i(|Xi|-1)ϕ|Xi|-1i(0)ϕ|Xi|-1i(|Xi|-1)gi(,0i,)gi(,|Xi|-1i,).
(2.10)
We refer to this transform R|Xi|R|Xi| as the local transform. The local transform usually requires O(|Xi|2) time; however, if the matrix in equation 2.10 is a Walsh-Hadamard matrix (see equation 2.2), using Walsh-Hadamard transform (WHT) (Fino & Algazi, 1976), we can perform the local transform in O(|Xi|log|Xi|) time.5

Since the number of all combinations of y0,,yi-1,xi+1,,xn-1 (the ith component is omitted) is |X|/|Xi|, we can obtain the function gi+1 from gi in O(|X|log|Xi|) time. Moreover, gn is obtained from g0 in O(i|X|log|Xi|)=O(|X|log|X|) time.

2.2.4  O(1) Algorithm to Evaluate Candidate

In line 7 of algorithm 1, all three types of candidates—appending, adjusting, and removing—are evaluated for all yX. Here, we consider the following equation:
yθ¯y=Φy2pθ-θ¯y2(seetheappendixforthederivation)
(2.11)
=1-θ¯y2(Φy21).
(2.12)
Considering θ¯y as a univariate function of θy and equation 2.12 as an ordinary differential equation, we obtain the following general solution:
θ¯y(θy)=tanh(θy-c)(seetheappendixforthederivation),
(2.13)
where c is a constant determined by a boundary condition. For example, if θ¯y(θy0)=θ¯y0, then c is given by c=θy0-tanh-1θ¯y0 and equation 2.13 becomes
θ¯y(θy)=tanh(θy-θy0+tanh-1θ¯y0).
(2.14)
Here, we define the follwing line Ay(θ0) in P:
Ay(θ0)={θ|y'y,θy=θy0}.
(2.15)
Equation 2.14 demonstrates that if θ¯y0 is known, we can derive any θ¯y at a point on the line Ay(θ0) in O(1) time.
Here, the gradient vector of KL(pdpθ) is given by
yKL(pdpθ)=θ¯y-d¯y,d¯y=Φypd(seetheappendixforthederivation).
(2.16)
Therefore, for θAy(θ0), we can obtain KL(pdpθ)-KL(pdpθ0) by integrating equation 2.16 as follows:
KL(pdpθ)-KL(pdpθ0)=θy0θyθ¯y(u)-d¯ydu=1+d¯y2ln1+θ¯y01+θ¯y+1-d¯y2ln1-θ¯y01-θ¯y(seetheappendixforthederivation).
(2.17)
Then we can evaluate Δ(θ')=cost(θ')-cost(θ0) for three types of candidates θ' as follows:
  • Appending θy: By equation 2.16, KL(pdpθ) is minimized when θ¯y=d¯y. Therefore,
    Δ=1+d¯y2ln1+θ¯y01+d¯y+1-d¯y2ln1-θ¯y01-d¯y+ry.
    (2.18)
  • Adjusting θy: By equation 2.16, KL(pdpθ) is minimized when θ¯y=d¯y. Therefore,
    Δ=1+d¯y2ln1+θ¯y01+d¯y+1-d¯y2ln1-θ¯y01-d¯y.
  • Removing θy:θy becomes 0. Therefore,
    Δ=1+d¯y2ln1+θ¯y01+θ¯y(0)+1-d¯y2ln1-θ¯y01-θ¯y(0)-ry.

Among the three types of candidates—appending, adjusting, and removing—the group of candidates by appending θy has almost |X| candidates because θy is a very sparse vector. Therefore, it is important to reduce the computational cost to evaluate candidates by appending. Evaluating Δ in equation 2.18 is an expensive task for a central processing unit (CPU) because it involves logarithm computation.6

Δ in equation 2.18 has the following lower bound Δ̲:
ΔΔ̲=-(θ¯y0-d¯y)21-(θ¯y0)2+ry(seetheappendixforthederivation).
(2.19)
Evaluating Δ̲ is much faster than evaluating Δ. In the candidate evaluation, the candidate with the lower cost is the “winner.” If a candidate's lower bound is greater than the champion's cost—the lowest cost ever found—the candidate has no chance to win; therefore, we can discard the candidate without precise evaluation of Δ.

Algorithm 2 presents the details of line 7 of algorithm 1. In line 5, if Δ̲(θy')Δθ1, the candidate θ' is discarded and the evaluation of Δ(θy') is skipped. This skipping effectively reduces the computational cost of evaluating candidates.7

graphic

2.2.5  Updating θ and pθ

Let θ0 denote θ before updating and θ1 denote θ after updating in line 10 of algorithm 1. The θ1 differs from θ0 only at the y'th component. Therefore,
lθ1(x)=lθ0(x)+(θy'1-θy'0)Φy'(x)
and
pθ1(x)=Z(θ1)-1explθ1(x)=Z(θ1)-1explθ0(x)+(θy'1-θy'0)Φy'(x)=Z(θ0)Z(θ1)pθ0(x)exp(θy'1-θy'0)Φy'(x).
(2.20)
Summing equation 2.20 for all xX, we obtain
1=Z(θ0)Z(θ1)xpθ0(x)exp(θy'1-θy'0)Φy'(x).
Therefore,
pθ1(x)=Z(θ0)Z(θ1)pθ0(x)exp(θy'1-θy'0)Φy'(x),Z(θ0)Z(θ1)=xpθ0(x)exp(θy'1-θy'0)Φy'(x)-1.
(2.21)

Algorithm 3 presents the details of lines 10 and 11 in algorithm 1. This algorithm requires O(|X|) time. It should be noted that the expensive exponential computation exp(θy'1-θy'0)Φy'(x) is not used in the for-loop.

graphic

2.2.6  Memory Requirements

In the FSLL model, most of the memory consumption is dominated by four large tables for pθ, θ¯, d¯, and r, and each stores |X| floating point numbers. However, θ does not require a large amount of memory because it is a sparse vector.

For example, if the FSLL model is allowed to use 4 GB of memory, it can handle up to 26 binary variables, 16 three-valued variables(|Xi|=3), and 13 four-valued variables.

2.3  Convergence to Target Distribution

One major point of interest about algorithm 1 is whether the model distribution converges to a target distribution at the limit of N or not, where N is the number of samples in the training data. As a result, we can guarantee this convergence.

Let us define the following symbols:
A(θt)=yAy(θt)//Ayisdefinedbyequation2.15,my(θt,N)=minθAy(θt)fN(θ)//minimuminlineAym(θt,N)=minθA(θt)fN(θ)//minimuminalllinesAy=minymy(θt,N),
(2.22)
Ymin={y|my(θt,N)=m(θ,N)},argmy(θt,N)={θ|θAy(θt),fN(θ)=my(θt,N)},argm(θt,N)={θ|θA(θt),fN(θ)=m(θt,N)},=yYminargmy(θt,N).
(2.23)
Then, algorithm 1 iteratively selects θt+1 from argm(θt,N).

We first consider the case where the cost function is a continuous function of θ with no regularization term. Here, if and only if f(θ)=minθ'A(θ)f(θ'), we say that θ is an axis minimum of f. Then the following theorem holds (Beck, 2015):

Theorem 2.

Let f:R|X|R be a continuous cost function such that B={θ|f(θ)f(θ0)} is a bounded close set. Then any accumulation point of {θt}(t[0,)) is an axis minimum of f.

The proof is provided in the appendix. Here, if f(θ) has an unique axis minimum at θ=a, the following corollary is derived from theorem 2:

Corollary 1.

Let f be a function satisfying the condition in theorem 2. If f has a unique axis minimum at θ=θmin, then θmin is also the global minimum and limtθt=θmin.

The proof is provided in the appendix.

Let q be a positive distribution. By corollary 1, in the case where the cost function is f(θ)=KL(qpθ), where q is a positive distribution,8 the equation limtKL(qpθt)=0 holds.

Then we extend the cost function to fN(θ)=f(θ)+r(θ,N), where f:R|X|R is a continuous function having the unique global minimum at θ=θmin, and r(θ,N) is a regularization term in equation 2.5. The following theorem holds:

Theorem 3.
Let f be a function satisfying the conition in theorem 2. Then
limNlimtf(θt(N))=minθf(θ)
holds.
Here, do not confuse limNlimtf(θt(N)) with limtlimNf(θt(N)).
limtlimNf(θt(N))=minθf(θ)
is trivial by corollary 1, while
limNlimtf(θt(N))=minθf(θ)
needs a proof. The proof is provided in the appendix. In the case where pd is a positive distribution for sufficiently large N and f(θ)=KL(pd|pθ), theorem 3 guarantees
limNlimtKL(pdpθt(N))=0.

In this section, to demonstrate the performance of the FSLL model, we compare a full-span log-linear model that we refer to as FL with two Boltzmann machines that we refer to as BM-DI and BM-PCD.

3.1  Full-Span Log-Linear Model FL

FL is a full-span log-linear model that has been described in section 2. The model distribution of FL is given by equation 2.1. The cost function is given by equation 2.5. The learning algorithm is algorithm 1 described in section 2.2. The threshold to finish the cost minimization is determined as ε=10-4.

3.2  Boltzmann Machine BM-DI

BM-DI (Boltzmann machine with direct integration) is a fully connected Boltzmann machine having no hidden variables and no temperature parameter. To examine the ideal performance of the Boltzmann machine, we do not use the Monte Carlo approximation in BM-DI to evaluate equation 1.2. The model distribution of BM-DI is given by equation 1.1. The cost function of BM-DI is KL(pdpθ) and has no regularization term.

To minimize the cost function with fewer evaluations of the gradient vector, we used a pseudo-Newton method, the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm (Nocedal & Wright, 2006), implemented in the Java statistical analysis tool (JSAT; Raff, 2017).

3.3  Boltzmann Machine BM-PCD

BM-PCD (Boltzman machine with persistent contrastive divergence method) is similar to BM-DI; however, BM-PCD uses the persistent contrastive divergence method (Tieleman, 2008), a popular Monte Carlo method in Boltzmann machine learning. BM-PCD has some hyperparameters. We tested various combinations of these hyperparameters and determined them as learning rate = 0.01, number of Markov chains = 100, and length of Markov chains = 10,000.

3.4  Training Data

We prepared six training data sets. These data sets are artificial; therefore, their true distributions p*(X) are known. Each is an independent and identically distributed (i.i.d.) data set drawn from its true distribution.

3.4.1  Ising5 × 4S, Ising5 × 4L

These data sets were drawn from the distribution represented by the 2D Ising model (Newman & Barkema, 1999) with 5×4 nodes (see Figure 1).
Figure 1:

Graphical structure of Ising5 × 4S/L.

Figure 1:

Graphical structure of Ising5 × 4S/L.

Close modal
Every Xi takes the value 0 or 1. Ising5 × 4S has 1000 samples, while Ising5 × 4L has 100,000 samples. The true distribution is represented as
p*(x)=Z-1exp12i,jsisj,Z=xexp12i,jsisj,si=1xi=1-1xi=0,
where i,j denotes the adjacent variables in Figure 1. Boltzmann machines can represent p*.

3.4.2  BN20-37S, BN20-37L

These data sets were drawn from the distribution represented by the Bayesian network with 20 nodes and 37 edges (see Figure 2).
Figure 2:

Graphical structure of BN20-37S/L.

Figure 2:

Graphical structure of BN20-37S/L.

Close modal
X0 has no parents, X1 has X0 as a parent, and the other Xi(i2) individually has two parents Yi0,Yi1{X0,,Xi-1}. The graphical structure and contents of the conditional distribution table of each node are determined randomly. Every Xi takes the value 0 or 1. BN20-37S has 1000 samples, while BN20-37L has 100,000 samples. The true distribution is represented as
p*(x)=ip*(xi|yi0,yi1)=expilnp*(xi|yi0,yi1).
Since lnp*(xi|yi0,yi1) are third-order terms, third-order Boltzmann machines can represent p*.

3.4.3  BN20-54S, BN20-54L

These data sets were drawn from the distribution represented by the following Bayesian network with 20 nodes and 54 edges. X0 has no parents, X1 has X0 as a parent, X2 has X0,X1 has parents, and the other Xi(i3) individually has three parents: Yi0,Yi1,Yi2{X0,,Xi-1}. The graphical structure and contents of the conditional distribution table of each node are determined randomly. Every Xi takes the value zero or one. BN20-54S has 1000 samples, and BN20-54L has 100,000 samples. The true distribution is represented as
p*(x)=ip*(xi|yi0,yi1,yi2)=expilnp*(xi|yi0,yi1,yi2).
Since lnp*(xi|yi0,yi1,yi2) are fourth-order terms, fourth-order Boltzmann machines can represent p*.

3.5  Experimental Platform

All experiments were conducted on a laptop PC (CPU: Intel Core i7-6700K @4 GHz; memory: 64 GB; operating system: Windows 10 Pro). All programs were written in and executed on Java 8.

3.6  Results

Table 1 represents performance comparisons between FL BM-DI and BM-PCD. We evaluated the accuracy of the learned distribution by KL(p*pθ). Figure 3 illustrates the comparison of KL(p*pθ).

Table 1:

Performance Comparison between FL and BM.

DataModelKL(pdpθ)KL(p*pθ)#BasisTime
Ising5 × 4S FL 2.501 nat 0.012 nat 31 5 sec 
 BM-DI 2.424 0.087 210 13 
 BM-PCD 2.504 0.094 210 
Ising5 × 4L FL 0.476 0.004 37 
 BM-DI 0.473 0.002 210 12 
 BM-PCD 0.528 0.053 210 
BN20-37S FL 4.355 0.317 39 
 BM-DI 4.746 0.863 210 17 
 BM-PCD 4.803 0.903 210 
BN20-37L FL 0.697 0.026 105 12 
 BM-DI 1.422 0.750 210 19 
 BM-PCD 1.477 0.806 210 
BN20-54S FL 3.288 0.697 41 
 BM-DI 3.743 1.301 210 23 
 BM-PCD 3.826 1.338 210 
BN20-54L FL 0.430 0.057 192 23 
 BM-DI 1.545 1.166 210 21 
 BM-PCD 1.620 1.242 210 
DataModelKL(pdpθ)KL(p*pθ)#BasisTime
Ising5 × 4S FL 2.501 nat 0.012 nat 31 5 sec 
 BM-DI 2.424 0.087 210 13 
 BM-PCD 2.504 0.094 210 
Ising5 × 4L FL 0.476 0.004 37 
 BM-DI 0.473 0.002 210 12 
 BM-PCD 0.528 0.053 210 
BN20-37S FL 4.355 0.317 39 
 BM-DI 4.746 0.863 210 17 
 BM-PCD 4.803 0.903 210 
BN20-37L FL 0.697 0.026 105 12 
 BM-DI 1.422 0.750 210 19 
 BM-PCD 1.477 0.806 210 
BN20-54S FL 3.288 0.697 41 
 BM-DI 3.743 1.301 210 23 
 BM-PCD 3.826 1.338 210 
BN20-54L FL 0.430 0.057 192 23 
 BM-DI 1.545 1.166 210 21 
 BM-PCD 1.620 1.242 210 

Notes: pd: empirical distribution of training data. p*: true distribution. #Basis: number of used (θy0) basis functions. Time: CPU time for learning(median of three trials).

For Ising5 × 4S/L, a performance difference between FL and BMs (BM-DI and BM-PCD) was not remarkable because both FL and BMs could represent the true distribution p*. The fact that KL(pdpθ)KL(p*pθ) implies that overfitting to pd was successfully suppressed. FL used fewer basis functions than BMs used, which implies that some basis functions of BM were useless to represent p*. Regarding the accuracy of the model distribution, BM-PCD is less accurate than FL and BM-DI. This disadvantage becomes noticeable when the model distribution is close to the true distribution. Even when large training data are given, some error remains in the model distribution of BM-PCD (e.g., Ising5 × 4L).

For BN20-37S/L and BN20-54S/L, FL outperformed BMs because only FL could represent p*. To fit p*, FL adaptively selected 39 basis functions for BN20-37S and 105 basis functions for BN20-37L from |X|-1=220-1 basis functions. This fact implies that FL constructed a more complex model to fit p* as the training data increased. Furthermore, a comparison of KL(p*pθ) revealed that the accuracy of the model distribution was remarkably improved as the size of training data increased in FL. In contrast, BMs could not fit p* even if a large training data set was supplied.

Figure 4 illustrates the CPU time to learn the training data sets. BM-PCD was the fastest, and FL was faster than BM-DI for five out of six training data sets. The learning time of BM-PCD is constant because we used a fixed length (10,000) of Markov chains. FL had |X|-1=220-1 basis functions, while BM-DI had n(n+1)/2=210 basis functions. Nevertheless, FL was faster than BM-DI.

For BN20-54L, FL takes a long time to learn because it uses 192 basis functions to construct the model distribution. Using the 192 basis functions, FL successfully constructed a model distribution that fit p*, while BMs failed.

Figure 3:

Comparing accuracy of model by KL(p*pθ) (lower is better).

Figure 3:

Comparing accuracy of model by KL(p*pθ) (lower is better).

Close modal
Figure 4:

Comparing CPU time to learn (lower is better).

Figure 4:

Comparing CPU time to learn (lower is better).

Close modal

The major disadvantage of the FSLL model is that it is not feasible for large problems due to memory consumption and learning speed. If we use a typical personal computer, the problem size should be limited as |X|225. However, as far as we use the FSLL model in this problem domain, the model has the following theoretical and practical advantages.

The first advantage is that the FSLL model can represent arbitrary distributions of X. Furthermore, it is guaranteed that the model distribution converges to any target distribution as the limit of the training data size is infinity.

Here, we view learning machines from an information geometry perspective. The dimension of the distribution space P+ (all positive distributions of X) is |X|-1, and a learning machine having M parameters spans an M-dimensional manifold in P+ to represent its model distribution (we refer to this manifold as the model manifold).

A learning machine with M<|X|-1 parameters cannot represent arbitrary distributions in P. Moreover, if M<|X|-1, there is no guarantee that the true distribution is close to the model manifold, and if the model manifold is remote from the true distribution, the machine's performance will be poor. This poor performance is not improved even if infinite training data are given.

The FSLL model extends the manifold's dimension to |X|-1 by introducing higher-order factors. The model manifold becomes P itself; thus, there is no more expansion. Therefore, we refer to the model as the full-Span log-linear model. The main interest of this letter is that as far as the problem size is |X|225, the FSLL model becomes a feasible and practical model.

For example, suppose that we construct a full-span model by adding hidden nodes into a Boltzmann machine having 20 visible nodes. The number of parameters of the Boltzmann machine is E+n, where E is the number of edges and n is the number of nodes. Therefore, it is not practical to construct the full-span model for 20 visible nodes because it requires 220 edges.

The second advantage is that the FSLL model has no hyperparameters; therefore, no hyperparameter tuning is needed. For example, if we use a Boltzmann machine with hidden nodes that learns the true distribution with contrastive divergence methods, we need to determine hyperparameters such as the learning rate, mini-batch size, and the number of hidden and graphical structure of nodes. The FSLL model, however, automatically learns the training data without human participation.

Suppose that we let the FSLL model learn training data consisting of 20 binary variables. The dimension of the function space spanned by possible positive distributions is 220-1. The FSLL model has 220-1 parameters and can fit arbitrary positive distributions. The basis functions of the FSLL model have the following properties:

  • Each basis function is a product of univariate functions.

  • The basis functions take values 1 or -1.

The proposed learning algorithm exploited these properties and realized fast learning.

Our experiments demonstrated the following:

  • The FSLL model could learn the training data with 20 binary variables within 1 minute with a laptop PC.

  • The FSLL model successfully learned the true distribution underlying the training data; even higher-order terms that depend on three or more variables existed.

  • The FSLL model constructed a more complex model to fit the true distribution as the training data increased; however, the learning time became longer.

In this letter, we have presented a basic version of the FSLL model; however, we can extend it as follows (Takabatake & Akaho, 2014, 2015):

  • Introducing L1 regularization (Andrew & Gao, 2007)

  • Introducing hidden variables

A.1  Proof of Theorem 1

For brevity and clarity of expression, we use predicate logic notation here. The statement “{fi(X0)gj(X1)}(iI,jJ) are linearly independent functions of (X0,X1)” is equivalent to the following proposition:
x0x1i,jaijfi(x0)gj(x1)=0ijaij=0.
This proposition is proved as follows:
x0x1i,jaijfi(x0)gj(x1)=0x0x1jiaijfi(x0)gj(x1)=0x0jiaijfi(x0)=0{gj}arelinearlyindependentijaij=0.{fi}arelinearlyindependent.

A.2  Derivation of Equation 2.6

θ¯y(θ)=ylnZ=1ZyZ=1Zxyelθ(x)=1ZxΦy(x)elθ(x)=xpθ(x)Φy(x)=Φypθ.

A.3  Derivation of Equation 2.11

yθ¯y=yΦypθ=xΦy(x)ypθ(x)=xΦy(x)pθ(x)ylnpθ(x)=xΦy(x)pθ(x)y(lθ(x)-lnZ)=xΦy(x)pθ(x)(Φy(x)-ylnZ)=xpθ(x)Φy(x)2-θ¯yxpθ(x)Φy(x)=Φy2pθ-θ¯y2.

A.4  Derivation of Equation 2.13

1=θ¯y1-θ¯y2(byequation2.12)=12yθ¯y1+θ¯y+yθ¯y1-θ¯y=12yln(1+θ¯y)-yln(1-θ¯y).
(A.1)
Integrating equation A.1, we obtain
12ln1+θ¯y1-θ¯y=1dθy=θy-c,
(A.2)
where c is a constant of integration. Since the left side of equation A.2 equals tanh-1(θ¯y), we obtain the equation θ¯y=tanh(θy-c).

A.5  Derivation of Equation 2.16

yKL(pdpθ)=ylnpdpd-lnpθpd=-ylnpθpd=-ylθ-lnZpd=-ylθpd+ylnZ=-Φypd+θ¯=θ¯y-d¯.

A.6  Derivation of Equation 2.17

θy0θyθ¯y(u)-d¯ydu=θy0θytanh(u-c)du-d¯y(θy-θy0)(byequation2.13)=lncosh(u-c)u=θy0θy-d¯y(tanh-1θ¯y-tanh-1θ¯y0)=12lncosh2(u-c)u=θy0θy-d¯y12ln1+θ¯y1-θ¯y-12ln1+θ¯y01-θ¯y0=12ln11-tanh2(u-c)u=θy0θy-d¯y2ln1+θ¯y1-θ¯y1-θ¯y01+θ¯y0=12ln11-θ¯y2-ln11-(θ¯y0)2-d¯y2ln1+θ¯y1-θ¯y1-θ¯y01+θ¯y0=12ln1+θ¯y01+θ¯y1-θ¯y01-θ¯y-d¯y2ln1+θ¯y1-θ¯y1-θ¯y01+θ¯y0=-1+d¯y2ln(1+θ¯y)-1-d¯y2ln(1-θ¯y)+1+d¯y2ln(1+θ¯y0)+1-d¯y2ln(1-θ¯y0)=1+d¯y2ln1+θ¯y01+θ¯y+1-d¯y2ln1-θ¯y01-θ¯y.

A.7  Derivation of Equation 2.19

Δ=1+d¯y2ln1+θ¯y01+d¯y+1-d¯y2ln1-θ¯y01-d¯y+ry=1+d¯y2-ln1+d¯y1+θ¯y0+1-d¯y2-ln1-d¯y1-θ¯y0+ry1+d¯y21-1+d¯y1+θ¯y0+1-d¯y21-1-d¯y1-θ¯y0+ry1+d¯y20,1-d¯y20,-lnx1-x=1+d¯y2θ¯y0-d¯y1+θ¯y0-1-d¯y2θ¯y0-d¯y1-θ¯y0+ry=θ¯y0-d¯y21+d¯y1+θ¯y0-1-d¯y1-θ¯y0+ry=θ¯y0-d¯y2(1+d¯y)(1-θ¯y0)-(1-d¯y)(1+θ¯y0)(1+θ¯y0)(1-θ¯y0)+ry=θ¯y0-d¯y22d¯y-2θ¯y01-(θ¯y0)2+ry=-(θ¯y0-d¯y)21-(θ¯y0)2+ry=Δ̲.

A.8  Proof of Theorem 2

Since B is a bounded closed set, {θt} has one or more accumulation point(s) in B.

As an assumption of a proof by contradiction, assume that a is an accumulation point of {θt} and the proposition
ybAy(a),f(b)<f(a)
holds. Let ηjAy(θj) be the point such that ηyj=by. Since f(a)-f(b)>0, f(θ) is continuous at θ=b, and a is an accumulation point of {θt}, the proposition
θj,f(ηj)<f(a)
holds ( in Figure 5, f(ηj)<f(a) for sufficiently small δ).
Figure 5:

Existence of θj such that f(ηj)<f(a).

Figure 5:

Existence of θj such that f(ηj)<f(a).

Close modal
Here, ηjAy(θj), and the inequality
f(θj+1)=minθyAy(θj)f(θ)f(ηj)
holds. Therefore, the inequality
f(θj+1)f(ηj)<f(a)
holds. Since f(θt) monotonically decreases as t increases, the proposition
tj+1,f(θt)f(ηj)<f(a)
holds. Since f(θ) is continuous at θ=a, no sub-sequence of {θt} can converge to a, that is, a is not an accumulation point of {θt}. This fact contradicts the assumption we made at the beginning of this proof.

A.9  Proof of Corollary 1

By theorem 2, any accumulation point of {θt} is an axis minimum of f; however, the axis minimum of f is unique; therefore, θ=a is a unique accumulation point of {θt}, and θta(t).

A.10  Proof of Theorem 3

Let us define the following symbols:
my(θt,)=minθAy(θt)f(θ),argmy(θt,)={θ|θAy(θt),f(θ)=my(θt,)}.
Figure 6 illustrates the sectional view of fN(θ) along the line Ay(θt). As shown in the figure, fN(θ) has a gap with depth ry (see equation 2.5) at θy=0. Here, we can ignore the gap at θy=0 for sufficiently large N, that is, the proposition
Figure 6:

View of fN(θ).

Figure 6:

View of fN(θ).

Close modal
N'N,N>N'argmy(θt,N)=argmy(θt,)
holds. Moreover, the proposition
N'Ny,N>N'argmy(θt,N)=argmy(θt,)
(A.3)
holds. By equation 2.23, the proposition
N'N,N>N'argm(θt,N)=argm(θt,)
holds. Here, we compare a sequence {θt(N)} with the cost function fN and a sequence {θt()} with the cost function f(=f). By corollary 1, limtf(θt())=minθf(θ), that is,
ε>0t't,tt'f(θt())<minθf(θ)+ε
(A.4)
holds. Here, let T(N) be the smallest t such that m(θt+1,N)m(θt+1,). Then limNT(N)=, that is,
t'N,T(N)t'
(A.5)
holds. By equations A.4 and A.5, the proposition
ε>0N,t,tT(N)f(θt())<minθf(θ)+ε
(A.6)
holds, and therefore the proposition
ε>0N,f(θT(N)())<minθf(θ)+ε
(A.7)
also holds. Here, by the definition of T(N), θT(N)()=θT(N)(N). Therefore, we can modify equation A.7 as
ε>0N,f(θT(N)(N))<minθf(θ)+ε.
(A.8)
Since f(θt(N)) monotonically decreases as t grows,
ε>0Nt,tT(N)f(θt(N))<minθf(θ)+ε.
Using the notation of lim, we obtain
limNlimtf(θt(N))=minθf(θ).

This research is supported by KAKENHI 17H01793.

1

Distributions such that x,p(x)>0.

2

Furthermore, the FSLL model is not limited to binary variables.

3

We use “nat” as the description length unit; thus, we use “ln” instead of “log.”

4

Not 2D-FFT but 2D-DFT.

5

For k4, directly multiplying the Hadamard matrix is faster than the WHT in our environment; therefore, we use the WHT only in cases where k5.

6

Logarithm computation is 30 times slower than addition or multiplication in our environment.

7

In our environment, this skipping makes the evaluation of candidates more than 10 times faster.

8

Positivity is needed to keep B bounded.

Ackley
,
D. H.
,
Hinton
,
G. E.
, &
Sejnowski
,
T. J.
(
1985
).
A learning algorithm for Boltzmann machines
.
Cognitive Science
,
9
(
1
),
147
169
.
Amari
,
S.
(
2016
).
Information geometry and its applications
.
Berlin
:
Springer
.
Andrew
,
G.
, &
Gao
,
J.
(
2007
).
Scalable training of L1-regularized log-linear models
. In
Proceedings of the 24th International Conference on Machine Learning
(pp.
33
40
).
New York
:
ACM
.
Beck
,
A.
(
2015
).
On the convergence of alternating minimization for convex programming with applications to iteratively reweighted least squares and decomposition schemes
.
SIAM Journal on Optimization
,
25
(
1
),
185
209
.
Dennis Jr.
,
J. E.
, &
Moré
,
J. J.
(
1977
).
Quasi-Newton methods, motivation and theory
.
SIAM Review
,
19
(
1
),
46
89
.
Fino
,
B. J.
, &
Algazi
,
V. R.
(
1976
).
Unified matrix treatment of the fast Walsh- Hadamard transform
.
IEEE Transactions on Computers
,
25
(
11
),
1142
1146
.
Newman
,
M.
, &
Barkema
,
G.
(
1999
).
Monte Carlo methods in statistical physic
.
New York
:
Oxford University Press
.
Nocedal
,
J.
, &
Wright
,
S.
(
2006
).
Numerical optimization
.
New York
:
Springer Science & Business Media
.
Pratt
,
W. K.
,
Andrews
,
H. C.
, &
Kane
,
J.
(
1969
).
Hadamard transform image coding
. In
Proceedings of the IEEE
,
57
(
1
),
58
68
.
Raff
,
E.
(
2017
).
JSAT: Java statistical analysis tool, a library for machine learning
.
Journal of Machine Learning Research
,
18
(
23
),
1
5
.
Rissanen
,
J.
(
2007
).
Information and complexity in statistical modeling
.
Berlin
:
Springer
.
Sejnowski
,
T. J.
(
1986
). Higher-order Boltzmann machines. In
AIP Conference Proceedings
(vol. 151, pp.
398
403
).
Melville, NY
:
American Institute for Physics
.
Smith
,
W. W.
(
2010
).
Handbook of real-time fast Fourier transforms
.
Piscataway, NJ
:
IEEE
.
Takabatake
,
K.
, &
Akaho
,
S.
(
2014
).
Basis functions for fast learning of log-linear models
.
IEICE Technical Report
,
114
(
306
),
307
312
. (in Japanese)
Takabatake
,
K.
, &
Akaho
,
S.
(
2015
).
Full-span log-linear model with L1 regularization and its performance
.
IEICE Technical Report
,
115
(
323
),
153
157
. (in Japanese)
Tieleman
,
T.
(
2008
).
Training restricted Boltzmann machines using approximations to the likelihood gradient
. In
Proceedings of the 25th International Conference on Machine Learning
(pp.
1064
1071
).
New York
:
ACM
.
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode