## Abstract

Maximum pseudo-likelihood estimation (MPLE) is an attractive method for training fully visible Boltzmann machines (FVBMs) due to its computational scalability and the desirable statistical properties of the MPLE. No published algorithms for MPLE have been proven to be convergent or monotonic. In this note, we present an algorithm for the MPLE of FVBMs based on the block successive lower-bound maximization (BSLM) principle. We show that the BSLM algorithm monotonically increases the pseudo-likelihood values and that the sequence of BSLM estimates converges to the unique global maximizer of the pseudo-likelihood function. The relationship between the BSLM algorithm and the gradient ascent (GA) algorithm for MPLE of FVBMs is also discussed, and a convergence criterion for the GA algorithm is given.

## 1  Introduction

Let be a random vector, with realization and probability mass function,
1.1
where
, and is a symmetric matrix with . We put the elements of and the upper triangular elements of (i.e., mjk, where and ) into the parameter vector .

Mass functions of form 1.1 are known as fully visible Boltzmann machines (FVBMs), which are special cases of the Boltzmann machines of Ackley, Hinton, and Sejnowski (1985), with no latent variables. Recently there has been interest in training FVBMs via maximum pseudo-likelihood estimation (MPLE) due to the probabilistic consistency and asymptotic normality of the MPLE (see Hyvarinen, 2006, and Nguyen & Wood, in press, respectively; see Arnold & Strauss, 1991, for a general treatment regarding MPLE). The statistical properties of MPLEs allow for the construction of hypothesis tests and confidence intervals such as those in Nguyen and Wood (in press).

There are currently no published algorithms for MPLE that are proven to be convergent or monotonic. In their work, Hyvarinen (2006) and Nguyen and Wood (in press) used gradient ascent (GA) and the Nelder-Mead algorithm (Nelder & Mead, 1965), respectively, neither of which has known convergence results for the problem.

In this note, we present a block successive lower-bound maximization (BSLM) algorithm based on the principles of Razaviyayn, Hong, and Luo (2013). We show that the BSLM algorithm increases the pseudo-likelihood in each iteration and is convergent to the global maximum of the pseudo-likelihood function. Furthermore, we discuss the relationship between the BSLM and the GA algorithm of Hyvarinen (2006), and we provide some simulation results that show the monotonicity of the log-pseudo-likelihood sequences generated by the BSLM algorithm.

## 2  Maximum Pseudo-Likelihood Estimation and the BSLM Algorithm

Let be a random sample from an FVBM with some unknown parameter vector (i.e., the parameter components and are unknown), and let for each . Following Nguyen and Wood (in press), the log-pseudolikelihood function for the FVBM can be given as
2.1
where is the column of . MPLE is conducted by maximizing equation 2.1 to obtain the MPLE,

Under the BSLM paradigm, we construct an iterative algorithm whereupon we maximize a lower-bounding approximation of the objective function (i.e., equation 2.1) that is simple and has desirable properties at each iteration and for each coordinate of the parameter vector. The maximization occurs over blocks or subsets of the parameter vector (e.g., each coordinate) noncontemporaneously. In each iteration, all blocks are updated successively, taking into account previous updates.

The BSLM algorithm for computing first requires an initialization . Next, we define the iterate of the parameter vector as , and we compute the iterate in two steps, the and the step. In the step, we compute the updates
2.2
in the order . Here
2.3
and . Definition 2.2 yields the updates
2.4
where
Next in the step, in lexicographic order of the upper triangular elements of (i.e. ), we compute the updates
2.5
where
2.6
and is symmetric with and nondiagonal elements,
for . Definition 2.5 yields the updates
2.7
where

The and steps are iterated until the algorithm converges, whereupon the final iterate is declared the MPLE . Here, we define convergence in the sense that for some sufficiently small tolerance .

## 3  Convergence Results

For some initialization , if we let (or, equivalently, ), then the sequence goes to , where is a limit point of the BSLM algorithm. Using theorem 2 of Razaviyayn et al. (2013), we can state the following convergence result.

Theorem 1.

Every limit point of the BSLM algorithm, defined by updates 2.4 and 2.7, is a stationary point of equation 2.1, and the sequence of log-pseudo-likelihood values is increasing.

Proof.

By theorem 2 of Razaviyayn et al. (2013), we obtain the result by checking that and satisfy the following assumptions.

1. For each , with equality if and only if .

2. For each , with equality if and only if .

3. For each , is quasi-concave and continuous in bj, with a unique global maximizer.

4. For each , is quasi-concave and continuous in mjk, with a unique global maximizer.

To show that assumptions A1 and A2 are satisfied, we use the quadratic bound principle (QBP), (see equation 8.8 of Lange, 2013), which states that for any real function , if for some , then
where . By the QBP, assumption A1 is satisfied if for each j. Since
the result holds by noting that for . Similarly, by the QBP, assumption A2 is satisfied if for each j and k, which can be confirmed by observing that
Next, consider that and are concave quadratic functions of bj and mjk, respectively, which implies their continuity and the uniqueness of their maximizers. Furthermore, all concave functions are quasi-concave; hence, assumptions A3 and A4 are satisfied.

As Nguyen and Wood (in press), noted, equation 2.1 is strictly concave with respect to . This fact can be established by observing that for ; hence, must be strictly concave by composition. By elementary calculus, we obtain the following corollary:

Corollary 1.

The limit point of the BSLM algorithm, defined by updates 2.4 and 2.7, is the unique global maximizer of equation 2.1.

## 4  Relation to Gradient Ascent

In Hyvarinen (2006), a GA algorithm was considered, where updates 2.4 and 2.7 were replaced with
4.1
and
4.2
respectively, for some . Thus, the BSLM algorithm is the case of the GA algorithm. From our convergence results and by the QBP, we can deduce that the GA algorithm with will yield an increasing sequence of pseudo-likelihood values that converges to the global maximum, whereas no guarantees can be made when .

Using the same argument as in theorem 1, we note that and , for any . To obtain equation 4.1, it suffices to substitute in place of in equation 2.3, and to solve the first-order condition (FOC). Similarly, to obtain equation 4.2, it suffices to substitute in place of in equation 2.6, and to solve the FOC.

## 5  Simulation Results

To demonstrate the increasing property of the BSLM sequence of log-pseudo-likelihood values, we performed a simulation, following the design of Hyvarinen (2006). In each of our four simulation scenarios, we simulated a single instance of observations from a FVBM with parameters and for . For all of the scenarios, the upper triangular values of and the values of are each generated from a normal distribution with mean zero and variance . The initialization of the BSLM algorithm is simulated in the same manner, and the tolerance is set at .

Using the BSLM algorithm, we obtained five sequences of log-pseudo-likelihood values for each scenario with the results shown in Figure 1. We observed that the log-pseudo-likelihood values are increasing in each simulation, as expected. Furthermore, most of the increase in log-pseudo-likelihood values occurs in early iterations, and the algorithm appears to converge rapidly.

Figure 1:

BSLM-obtained sequences of log-pseudo-likelihood values for the five repetitions of each of the four simulation scenarios. The solid lines are the log-pseudo-likelihood values (left axis), and the dashed lines are the increases in values (right axis) in each iteration (in log base 10). Each point shape indicates a different repetition.

Figure 1:

BSLM-obtained sequences of log-pseudo-likelihood values for the five repetitions of each of the four simulation scenarios. The solid lines are the log-pseudo-likelihood values (left axis), and the dashed lines are the increases in values (right axis) in each iteration (in log base 10). Each point shape indicates a different repetition.

We also calculated the average mean squared error (MSE) over the five repetitions of each scenario to be , , , and , for , respectively. Here, the average MSE is computed as , where and are the true parameter and MPL estimate, respectively, for repetitions , , and q is the number of elements of the parameter vectors. The average MSE values found were small and conformed to the theoretical results of Nguyen and Wood (in press).

## 6  Conclusion

In this note, we have presented a BSLM algorithm for the MPLE of the FVBM. Furthermore, we have shown that the pseudo-likelihood sequence generated by the algorithm is monotonically convergent to the unique global maximum. Using the convergence results for the BSLM algorithm, we have also deduced a convergence criterion for the GA of Hyvarinen (2006).

## References

Ackley
,
D. H.
,
Hinton
,
G. E.
, &
Sejnowski
,
T. J.
(
1985
).
A learning algorithm for Boltzmann machines
.
Cognitive Science
,
9
,
147
169
.
Arnold
,
B. C.
, &
Strauss
,
D.
(
1991
).
Pseudolikelihood estimation: Some examples
.
Sankhya B
,
53
,
233
243
.
Hyvarinen
,
A.
(
2006
).
Consistency of pseudolikelihood estimation of fully visible Boltzmann machines
.
Neural Computation
,
18
,
2283
2292
.
Lange
,
K.
(
2013
).
Optimization
.
New York
:
Springer
.
Nelder
,
J. A.
, &
,
R.
(
1965
).
A simplex algorithm for functional minimization
.
Computer Journal
,
7
,
308
313
.
Nguyen
,
H. D.
, &
Wood
,
I. A.
(
in press
).
Asymptotic normality of the maximum pseudolikelihood estimator for fully visible Boltzmann machines
.
IEEE Transactions on Neural Networks and Learning Systems
.
Razaviyayn
,
M.
,
Hong
,
M.
, &
Luo
,
Z.-Q.
(
2013
).
A unified convergence analysis of block successive minimization methods for nonsmooth optimization
.
SIAM Journal of Optimization
,
23
,
1126
1153
.