We consider Bayesian inference problems with computationally intensive likelihood functions. We propose a Gaussian process (GP)–based method to approximate the joint distribution of the unknown parameters and the data, built on recent work (Kandasamy, Schneider, & Póczos, 2015). In particular, we write the joint density approximately as a product of an approximate posterior density and an exponentiated GP surrogate. We then provide an adaptive algorithm to construct such an approximation, where an active learning method is used to choose the design points. With numerical examples, we illustrate that the proposed method has competitive performance against existing approaches for Bayesian computation.
Bayesian inference is a popular method for estimating unknown parameters from data, and a major advantage of the method is its ability to quantify uncertainty in the inference results (Gelman et al., 2013). In this letter, we consider Bayesian inference problems where the likelihood functions are highly expensive to evaluate. A typical example of this type of problems is the Bayesian inverse problem (Tarantola, 2005), where the parameters of interest cannot be observed directly and need to be estimated from indirect data. Such problems arise from many real-world applications, ranging from carbon capture (Konomi, Karagiannis, Lai, & Lin, 2018) to chemical kinetics (Golightly & Wilkinson, 2011). In Bayesian inverse problems, the mappings from the parameter of interest to the observable quantities, often known as the forward models, are often computationally intensive (e.g., involving simulating large-scale computer models).
Due to the high computational cost, common numerical implementations of Bayesian inferences, such as Markov chain Monte Carlo (MCMC; Andrieu, De Freitas, Doucet, & Jordan, 2003) methods, can be prohibitively expensive. A simple idea to accelerate the computation of the posterior is to construct a computationally inexpensive surrogate or an approximation of the posterior distribution with a limited number of likelihood function evaluations. To this end, a convenient choice for surrogate function is the gaussian process (GP) model (Williams & Rasmussen, 2006). The idea of using this model to approximate the posterior or the likelihood function dates back to the so-called Bayesian quadrature (or Bayesian Monte Carlo) approaches (O'Hagan, 1991; Rasmussen & Ghahramani, 2003; Rasmussen et al., 2003; Kennedy, 1998), which were designed to perform numerical integrations in a Bayesian fashion (e.g., to compute the evidence in Bayesian inference problems; Osborne et al., 2012).
Unlike the Bayesian quadrature methods, the goal of this work is to construct an approximation of the posterior distribution. To this end, recent work (Kandasamy, Schneider, & Póczos, 2015) approximates the joint distribution of the unknown parameter and the data (which can also be viewed as the unnormalized posterior distribution) with an exponentiated GP model, where the design points (the points where the likelihood function is evaluated) are chosen with an active learning strategy. In particular, they determine the design points by sequentially maximizing the variance in the posterior approximation. Other ideas for using GP approximation to accelerate Bayesian computation can be found in Conrad, Marzouk, Pillai, and Smith (2016); Bilionis and Zabaras (2013), among others. The method presented in this letter also intends to approximate the unnormalized posterior distribution. The main contribution of the work is as follows. We write the unnormalized posterior distribution as a product of an approximate posterior density and an exponentiated GP surrogate. The intuition behind this formulation is that the GP model can be more effectively constructed if we factor out a good approximation of the posterior (see section 2.3 for a detailed explanation). As we may not know a good approximate posterior density in advance, we develop an algorithm to adaptively construct the product-form approximation of the unnormalized posterior distribution. Another difference between our method and that in Kandasamy et al. (2015) is the learning strategy for selecting the design points. We use the entropy rather than the variance as the selection criterion, which can better represent the uncertainty in the approximation. Numerical examples illustrate that the proposed method can substantially improve the performance of the GP approximation.
We note that other surrogate models, notably generalized polynomial chaos (gPC) expansion (Li & Marzouk, 2014; Marzouk & Xiu, 2009; Marzouk & Najm, 2009; Marzouk, Najm, & Rahn, 2007; Nagel & Sudret, 2016), have also been used to accelerate Bayesian computation. Detailed comparison of the two types of the surrogates is not discussed in this work, and those who are interested in this matter may consult O'Hagan (2013).
The rest of the letter is organized as the following. In section 2, we present the adaptive GP algorithm to construct the posterior approximation and active leaning method to determine the design points. In section 3, we give two examples to illustrate the performance of the proposed method. Section 4 provides some concluding remarks.
2 The Adaptive GP Method
2.1 Problem Setup
In what follows, we omit the argument in the likelihood function and denote it as for simplicity. It is easy to see that each evaluation of the likelihood function requires evaluating the forward function . In practice, the forward function often represents a large-scale computer model, and thus the evaluation of can be computationally demanding. Due to the high computational cost, the brute force Monte Carlo simulation cannot be used for such problems, and we resort to an alternative method to compute the posterior distributions, using the GP surrogate model. A brief description of the GP method is provided in next section.
2.2 The GP Model
2.3 The Adaptive GP Algorithm
Now we discuss how to use the GP method to compute the posterior distribution in our problem. A straightforward idea is to construct the surrogate model directly for the log-likelihood function ; such a method has been used in Osborne et al. (2012) and Kandasamy et al. (2015). A difficulty in this approach is that the target function can be highly nonlinear and fast varying, and thus is not well described by a GP model. We present an adaptive scheme to alleviate the difficulty.
Some remarks on the implementation of algorithm 1 are listed in order:
In line 6, we construct the GP model for using the procedure described in section 2.2. The hyperparameters of the GP model are determined by maximizing the marginal likelihood function (Williams & Rasmussen, 2006).
In line 7, we resort to the MCMC method to draw a rather large number of samples from the approximate posterior distribution. This procedure, however, does not require evaluating the true likelihood function and is not computationally expensive.
In line 8, we need to compute the density function of a distribution from the samples , and here we use the gaussian mixture method (McLachlan & Peel, 2004) to estimate the density. We note that an important issue in the gaussian mixture method is how to choose the number of mixtures. In our numerical examples, we use a fixed number of mixtures, but it is also possible to determine it with an adaptive procedure (Vlassis & Likas, 2002). Certainly there will be estimation errors in this procedure and so we denote the estimated density as to distinguish it from the true density .
In line 9, we find that it is rather costly to compute the KLD between and . We instead use the KLD between and , which is much easier to compute as the distributions are available as gaussian mixtures. The KLD can be efficiently computed with the Monte Carlo method as we have stored the samples from in the previous iteration.
In line 15, we need to determine the design points—the locations where we evaluate the true function. The choice of design points is critical to the performance of the proposed adaptive GP algorithm, and we use an active learning method to determine the points, which we present in section 2.4.
2.4 Active Learning for the Design Points
In the GP literature, the determination of the design points is often cast as an experimental design problem, that is, we need to find the experimental parameters that can provide us the most information. The problem has received considerable attention, and a number of methods and criteria have been proposed to select the points: the mutual information criterion (Krause, Singh, & Guestrin, 2008), the integrated mean square error (IMSE; Sacks, Welch, Mitchell, & Wynn, 1989), the integrated posterior variance (IVAR; Gorodetsky & Marzouk, 2016), and the active learning MacKay (ALM; criterion MacKay, 1992), to name just a few. Here we choose to use an active learning strategy that adds one design point a time, primarily because it is easy to implement.
A common active learning strategy is to choose the point that has the largest uncertainty, and to this end we need a function that can measure or quantify the uncertainty in the approximation reconstructed. In the usual GP problems, the variance of the GP model is a natural choice for such a measure of uncertainty (which yields the ALM method), because the distribution of is gaussian. In our problems, however, the function of interest is the posterior approximation rather than the GP model itself, and thus we should measure the uncertainty in . In Kandasamy et al. (2015), the variance of the posterior approximation is used as the measure function. However, since the distribution of is not gaussian, the variance may not provide a good estimate of the uncertainty. On the other hand, the entropy is a commonly used measure to quantify the uncertainty in a random variable (Rényi, 1961; Shannon, 2001), and here we use it as our design criterion.
Now suppose that we have a set of existing data points, and we want to choose new design points. We use the following scheme to sequentially choose the new points:
Construct a GP model for using data set .
Evaluate and let .
Note that the key in the adaptive scheme is step 2, where we seek the point that maximizes the entropy in . This is a challenging problem from an optimization perspective, because the problem may have multiple local maxima. However, in the numerical tests, we have found that our algorithm does not strictly require the optimality of the solution, and it performs well as long as a good design point can be found in each step. Thus, here we use a stochastic search method, the simulated annealing algorithm (Kirkpatrick, Gelatt, & Vecchi, 1983), to find the design point. We have tested other metaheuristic optimization algorithms, and the performances do not vary significantly.
3 Numerical Examples
3.1 The Rosenbrock Function
3.2 Genetic Toggle Switch
We use this example to make a detailed comparison of the proposed AGP algorithm with some other popular methods. Thus, we employ four methods to compute the posterior distribution in this example: the direct MCMC algorithm with the true likelihood function, the proposed AGP algorithm, the BAPE method (Kandasamy et al., 2015), and the spectral likelihood expansion (SLE) method (Nagel & Sudret, 2016), which constructs the gPC surrogate for the likelihood function using nonintrusive approaches.
We first consider the large-noise case. We draw samples from the true posterior distribution with a DRAM algorithm and use the results as the reference posterior distribution. We then apply the AGP method to approximate the posterior distribution, where we use initial design points randomly drawn from the prior and design points in each iteration. We also choose the termination parameters to be , , and . The algorithm terminates in 18 iterations, resulting in a total of 950 evaluations of the true likelihood function. We note that each evaluation of the likelihood function involves a full simulation of the underlying model described by equations 3.2a to 3.2c. After obtaining the approximate posterior distribution, we draw samples from it using a DRAM MCMC simulation.
We then compute the approximate posterior with the BAPE method using 950 likelihood evaluations and draw samples from it using a DRAM MCMC simulation. Finally, we approximate the posterior with the SLE-gPC method where the gPC expansion coefficients are computed using the least squares method with design points determined by the Sobol sequences–based quasi–Monte Carlo (QMC) method. The gPC degree is automatically determined by the algorithm using leave-one-out (LOO) cross-validation, and in the QMC scheme, we set the number of design points at 950.
We now compare the results of these methods. First, we estimate the posterior distributions of the six parameters by all four methods and show the results in Figure 6. One can see from the figure that the distributions computed by both the BAPE and the AGP methods are rather close to those of direct MCMC (which are regarded as the true posteriors), while the results of SLE-gPC deviate from the MCMC results, especially for the two parameters and . As for the comparison of BAPE and AGP, the figures show that both methods can produce reasonably good approximations of the posterior distributions in this case. So for a quantitive evaluation of the performance of the methods, we compute the KLD from the approximate posterior distributions to the true posterior densities and show the results in Table 2. We can see from the table that the posterior distributions computed by the AGP method are closer (in terms of KLD) to the true posteriors than the other two methods for all six parameters.
We then consider the small-noise case where we use the same implementation configurations as in the large-noise case. In this case, the AGP algorithm uses 1200 true likelihood evaluations; as before, we compute the posterior using the SLE and the BAPE methods with the same number of true likelihood evaluations. We then compute the posterior directly with MCMC samples and use the results as the true posterior. In Figure 7 we compare the marginal posterior distributions with the four methods. Similar to the large-noise case, the figure shows that the results of the SLE-gPC are of very low accuracy, while both the AGP and the BAPE methods yield rather good results. Once again, we show in Table 3 the KLD from the marginal posterior distributions computed with the three approximate methods to the true posterior (those computed by the direct MCMC). These quantitive comparison results indicate that the AGP method yields better results than the other two methods in terms of the KLD. Thus, we can conclude that our AGP method has the best performance in both the large- and the small-noise cases.
3.3 The Human Body Sway Problem
Finally, we apply the proposed method to a human body sway problem. This problem has received considerable attention as body sway may provide information about a person's physiological status (Tietäväinen, Gutmann, Keskivakkuri, Corander, & Hæggström, 2017). Several mathematical models have been proposed to describe the sway motion, and here we consider the single-link inverted pendulum (SLIP) model proposed in Asai et al. (2009), which assumes that the body is maintained in an upright position by an active and a passive proportional-derivative controller.
In the numerical experiments, we use simulated data; in particular, the true parameter values are set to be Nm/rad, Nms/rad, s, Nm and , and the other parameter values are , kg, m, , Nm/rad, Nms/rad, and . The COM signal generated from the model, equation 3.3, with these parameter values is shown in Figure 8. In the inference, we impose a uniform distribution on each of the five parameters on the following intervals: , , , , and . Moreover, for each evaluation of the likelihood function , we use 10,000 simulations of equation 3.3, and as a result, a direct MCMC simulation of the posterior distribution is computationally infeasible. We apply our AGP method to compute the posterior distribution, and the algorithm parameters are the same as those in the second example. The algorithm terminates in 14 iterations, so the total number of true likelihood evaluations is 750. As a comparison, we also perform the BAPE method with the same number of true likelihood function evaluations. To compare the performance of the two methods, we plot the posterior marginals of the model parameters computed by BAPE in Figure 9 and those computed by AGP in Figure 10. We also show the true parameter values, as well as the 60% confidence interval in the figures. Here, one can see that for all the posteriors computed with the AGP method, the true parameter values fall in the 60% confidence intervals, while for the results of the BAPE method, the true values of and fall outside the 60% confidence intervals, which suggests that the posteriors computed by the AGP method may be more accurate and reliable than those by BAPE.
We have proposed an algorithm to construct GP-based approximations for the unnormalized posterior distribution. The method expresses the unnormalized posterior as a product of an approximate posterior density and an exponentiated GP model, and an adaptive scheme is presented to construct such an approximation. We also provide an active learning method that uses maximum entropy as the selection criterion to determine the sampling points. With numerical examples, we show that the method can obtain a rather good approximation of the posterior with a limited number of evaluations of the likelihood functions. We believe the proposed method can be useful in a wide range of practical Bayesian inference problems where the likelihood functions are difficult or expensive to evaluate.
Several issues of the proposed algorithm deserve further study. First, while our numerical experiments illustrate that the algorithm may converge in these examples, a rigorous convergence analysis of the algorithm is lacking. Second, for a posterior distribution with an unbounded domain, the resulting approximation may become improper, and thus certain modifications of the algorithm may be needed to address the issue. Finally, we note that selecting a good kernel function for the GP model is an important issue for GP-based methods, and to this end, an interesting question is how to choose kernel functions that are specifically suitable for the log posteriors. We plan to study these issues in future work.
The work was partially supported by the National Natural Science Foundation of China under grant 11771289.
J.L. is now at the University of Liverpool.