We propose a nonparametric procedure to achieve fast inference in generative graphical models when the number of latent states is very large. The approach is based on iterative latent variable preselection, where we alternate between learning a selection function to reveal the relevant latent variables and using this to obtain a compact approximation of the posterior distribution for EM. This can make inference possible where the number of possible latent states is, for example, exponential in the number of latent variables, whereas an exact approach would be computationally infeasible. We learn the selection function entirely from the observed data and current expectation-maximization state via gaussian process regression. This is in contrast to earlier approaches, where selection functions were manually designed for each problem setting. We show that our approach performs as well as these bespoke selection functions on a wide variety of inference problems. In particular, for the challenging case of a hierarchical model for object localization with occlusion, we achieve results that match a customized state-of-the-art selection method at a far lower computational cost.
Inference in probabilistic graphical models can be challenging in situations where there are large numbers of hidden variables, each of which may take on one of several state values. The expectation-maximization (EM) algorithm is widely applied to learn model parameters when hidden variables are present; however, inference can quickly become intractable as the dimensionality of hidden states increases. Consider, for instance, the floor of a nursery with different toys scattered on it and images of this floor large enough to contain a number of toys. A nursery easily contains a hundred different toys, and any subset of these hundred toys may appear in any image. For 100 toys, there is therefore a combinatorics of different combinations of toys that can make up an image. An inference task may now be to infer, for any given image, the toys it contains. If we approached this task using a probabilistic graphical model, we would define a basic such model using a set of 100 hidden variables (one for each toy). Given a specific image, inference would then take the form of computing the posterior probability for any combination of toys, and from this, for example, the probability of each toy to be in the image can be computed. If done exactly, this process needs to evaluate all different toy combinations, which easily exceeds currently available computational resources.
While there are also many tasks for which graphical models with few latent variables are sufficient, the requirement for many hidden variables (as in the toy example) is typical for visual, auditory, and many other types of data with very rich structures. Graphical models for such data are often a central building block for tasks such as denoising (Elad & Aharon, 2006; Titsias & Lázaro-Gredilla, 2011), inpainting (Mairal, Bach, Ponce, & Sapiro, 2009; Mairal, Bach, Ponce, Sapiro, & Zisserman, 2009; Titsias & Lázaro-Gredilla, 2011), classification (Raina, Battle, Lee, Packer, & Ng, 2007), and collaborative filtering (Titsias & Lázaro-Gredilla, 2011). Typically the performance in these tasks improves with the number of latent variables that can be used (and which is usually limited by computational demands).
Expectation truncation (ET) (Lücke & Eggert, 2010) is an approximate EM algorithm for accelerating inference and learning in graphical models with many latent variables. Its basic idea is to restrict the inference performed during the E-step to an interesting subset of states of the latent variables, chosen per data point according to a selection function. This subspace reduction can lead to a significant decrease in computational demand with very little loss of accuracy (compared with the full model). For example, considering the toy example, we could first analyze the colors contained in a given image. If the image did not contain the color red, we could already assume red toys or partly red toys to be absent. Only in a second step would we then consider the combinatorics (e.g., combinations of a restricted number of toys) of the remaining toys. More features and more refined features would allow a reduction to still smaller sets of toys, until the combinatorics of these selected toys becomes computationally tractable. The selection function of expectation truncation mathematically models the process of selecting the relevant hidden variables (the relevant toys), while truncated posterior distributions then model their remaining combinatorics.
In previous work, functions to select states of high posterior mass were derived individually for each graphical model of interest, for example, by taking upper bounds or noiseless limits (Lücke & Eggert, 2010; Shelton, Sterne, Bornschein, Sheikh, & Lücke, 2012; Bornschein, Henniges, & Lücke, 2013; Henniges, Turner, Sahani, Eggert, & Lücke, 2014; Sheikh, Shelton, & Lücke, 2014). The crucial underlying assumption remains that when EM has converged, the posterior mass is concentrated in small volumes of the latent state space (see, e.g., Lücke & Eggert, 2010; Sheikh et al., 2014, for discussions). We can expect the approximation to be accurate only if restricting the combinatorics does not miss large parts of posterior mass. This property is observed to hold, however, for many types of data in the auditory, visual, or general pattern recognition domains.
The definition of appropriate selection functions for basic graphical models such as the nursery floor example is already nontrivial. For models incorporating more detailed data properties, the definition of selections functions becomes still more demanding. For visual data, models that also capture mutual object occlusions (Henniges et al., 2014) or the object position (Dai & Lücke, 2014), the design of suitable selection functions is extremely challenging. It requires both expert knowledge on the problem domain and considerable computational resources to implement (indeed, the design of such functions for particular problems has been a major contribution in previous work on the topic).
In the work presented in this letter, we propose a generalization of the expectation truncation (ET) approach, where we completely avoid the challenge of problem-specific selection function design. Instead, we learn selection functions adaptively and nonparametrically from the data, while learning the model parameters simultaneously using EM. We emphasize that the selection function is used only to guide the underlying base inference algorithm to regions of high posterior probability; it is not itself used as an approximation to the posterior distribution. As such, the learned function does not have to be a completely accurate indication of latent variable predictivity as long as the relative importance of the latent states likely to contribute posterior probability mass is preserved. We use gaussian process regression (Rasmussen & Williams, 2005) to learn the selection function—by regressing the expected values of the latent variables onto the observed data—though other regression techniques could also be applied. The main advantage of GPs is that they do not need to be retrained when only the output changes, as long as the inputs remain the same. This makes adaptive learning of a changing target function (given fixed inputs) computationally trivial. We term this part of our approach GP-select. Our nonparametric generalization of ET may be applied as a black-box meta-algorithm for accelerating inference in generative graphical models, with no expert knowledge required.
Our approach is the first to make ET a general-purpose algorithm for discrete latent variables, whereas previously, ET had to be modified by hand for each latent variable model addressed. For instance, in section 5.3, we show that preselection is crucial for efficient inference in complex models. Although ET has already been successful in some models, this work shows that more complex models crucially depend on an improved selection step and focuses on automating this step.
For empirical evaluation, we have applied GP-select in a number of experimental settings. First, we considered the case of sparse coding models (binary sparse coding, spike-and-slab, nonlinear spike-and-slab), where the relationship between the observed and latent variables is known to be complex and nonlinear.1 We show that GP-select can produce results with equal performance to the respective manually derived selection functions. Interestingly, we find it can be essential to use nonlinear GP regression in the spike-and-slab case and that simple linear regression is not sufficiently flexible in modeling the posterior shape. Second, we illustrate GP-select on a simple gaussian mixture model, where we can provide intuition and explicitly visualize the form of the learned regression function. We find that even for a simple model, it can be be essential to learn a nonlinear mapping. Finally, we present results for a recent hierarchical model for translation-invariant occlusive components analysis (Dai & Lücke, 2014). The performance of our inference algorithm matches that of the complex hand-engineered selection function of previous work, while being straightforward to implement and having a far lower computational cost.
2 Related Work
The general idea of aiding inference in graphical models by learning a function that maps from the observed data to a property of the latent variables is quite old. Early work includes the Helmholtz machine (Dayan, Hinton, Neal, & Zemel, 1995) and its bottom-up connections trained using the wake-sleep algorithm (Hinton, Dayan, Frey, & Neal, 1995). More recently, the idea has surfaced in the context of learning variational distributions with neural networks (Kingma & Welling, 2014). A two-stage inference procedure has been discussed in the context of computer vision (Yuille & Kersten, 2006) and neural inference (Körner, Gewaltig, Körner, Richter, & Rodemann, 1999). Recently, researchers (Mnih & Gregor, 2014) have generalized this idea to learning in arbitrary graphical models by training an inference network that efficiently implements sampling from the posterior distribution.
GPs have recently been widely used to “learn” the results of complicated models in order to accelerate inference and parameter selection. GP approximations have been used in lieu of solving complex partial differential equations (Sacks, Welch, Mitchell, & Wynn, 1989; Currin, Mitchell, Morris, & Ylvisaker, 1991), to learn data-driven kernel functions for recommendation systems (Schwaighofer, Tresp, & Yu, 2004), and recently for quantum chemistry (Rupp, Tkatchenko, Müller, & von Lilienfeld, 2012). Other work has used GPs to simplify computations in approximate Bayesian computation (ABC) methods: namely, to model the likelihood function for inference (Wilkinson, 2014), aid in making Metropolis-Hastings (MH) decisions (Meeds & Welling, 2014), and to model the discrepancies between simulated/or observed data in parameter space simplification (Gutmann & Corander, 2015). Recently, instead of the typical choice of GPs for large-scale Bayesian optimization, neural networks have been used to learn an adaptive set of basis functions for Bayesian linear regression (Snoek et al., 2015).
Our work follows the same high-level philosophy in that we use GPs to approximate complex/or intractable probabilistic models. None of the cited prior work addresses our problem setting: the selection of relevant latent variables by learning a nonparametric relevance function, for use in ET.
3 Variable Selection for Accelerated Inference
3.1 Selection via Expectation Truncation in EM
Expectation maximization (EM) is an iterative algorithm to optimize the model parameters of a given graphical model (see, e.g., Dempster, Laird, & Rubin, 1977; Neal & Hinton, 1998). EM iteratively optimizes a lower bound on the data likelihood by inferring the posterior distribution over hidden variables given the current parameters (the E-step) and then adjusting the parameters to maximize the likelihood of the data averaged over this posterior (the M-step). When the number of latent states to consider is large (e.g., exponential in the number of latent variables), the computation of the posterior distribution in the E-step becomes intractable and approximations are required.
Expectation truncation (ET) is a meta-algorithm, that improves convergence of the EM algorithm (Lücke & Eggert, 2010). The main idea underlying ET is that the posterior probability mass is concentrated in a small subspace of the full latent space. This is the case, for instance, if, for a given data point , only a subset of the latent variables is relevant. Even when the probability mass is supported everywhere, it may still be largely concentrated on a small number of the latents.
3.2 ET with Affinity
Next, using all from the affinity function , we define to simultaneously sort the indices of the latent variables in descending order [of probability ] and reduce the sorted set to the highest (most relevant) variables' indices. thus returns the selected variable indices chosen by the affinity to be relevant to the th data point. To ensure that there is a nonzero probability of selecting each variable per EM iteration, of the indices are uniformly chosen from at random. This prevents the possible propagation of errors from continuously assigning small probabilities to a variable in early EM iterations. The rationale for this is that the optimization of in early iterations of EM starts from randomly initialized . If the affinity function itself is based on the posterior approximation (as it will be in the algorithm described in section 4), it has a tendency to not select indices that were previously not selected. Thus, in order for selection-based EM to not “get stuck,” it is important to select a few extra hidden indices randomly to give the algorithm an opportunity to evaluate possibly unused variables that might be relevant for .
Finally, using the indices from , we define to return an -dimensional subset of selected relevant latent states for each data point . All nonrelevant variable states for all variables are effectively set to 0 in equation 3.2 by not being present in the state set . For example, say that there are five , where . We consider the case where only and are selected. The function will then return zeros for , , and but will return both allowed possibilities 0 or 1 for and . Thus, a valid setting for the entire vector can be but not .
3.3 Inference in EM with Selection
In each iteration of EM, the following occurs. Prior to the E-step, the selection function in equation 3.5 is computed to select the most relevant states , which are then used to compute the truncated posterior distribution in equation 3.2. The truncated posterior can be computed using any standard inference method, such as exact inference or, for example, Gibbs sampling from if inference is still intractable or further computational acceleration is desired. The result of the E-step is then used to update the model parameters with maximum likelihood in the M-step.
In previous work, the selection function was a deterministic function derived individually for each model (see, e.g., Shelton et al., 2012; Dai & Lücke, 2012a, 2012b; Bornschein et al., 2013; Sheikh et al., 2014; Shelton, Sheikh, Bornschein, Sterne, & Lücke, 2015), specific examples of which will be shown in section 5.1. We now generalize the selection approach. Instead of predefining the form of for variable selection, we want to learn it in a blackbox and model-free way based on the data. We learn using gaussian process (GP) regression (Rasmussen & Williams, 2005), a flexible nonparametric model that scales cubicly2 with the number of data points but linearly with the number of latent variables . We define the affinity function as being drawn from a gaussian process model: , where is the covariance kernel, which can be flexibly parameterized to represent the relationship between variables. Again, we use to approximate the marginal posterior probability that . A nice property of gaussian processes is that the kernel matrix need only be computed once (until the kernel function hyperparameters are updated) to approximate for the entire set of latent variables .
Thus, prior to each E-step in each EM iteration, within each calculation of the selection function, we calculate the affinity using a GP to regress the expected values of the latent variables from the observed data . Specifically, we train on from the previous EM iteration (where is equal to ), for training data of , where we recall that is the approximate posterior distribution for in equation 3.2. Note that we do not use a sigmoid link; hence, this is clearly not a correct estimate of a probability (it can be negative or greater than one). From the selection perspective, however, it is not necessary to avoid these pathologies, as we want only an ordering of the variables. A correct GP classification approach with a properly defined likelihood will no longer have a marginal gaussian distribution, and we would no longer be able to trivially express the posterior means of different functions with the same inputs, without considerable extra computation.
In the first EM iteration, the expectations are initialized randomly. In each subsequent EM iteration, the expectations with regard to the -truncated posterior are used. The EM algorithm is run for iterations and the hyperparameters of the kernel are optimized by maximum likelihood every EM iterations.
Equation 4.1 can be efficiently implemented for all latent variables and all data points using matrix operations, thereby requiring only one kernel matrix inversion for the entire data set.
We apply our GP-select inference approach to five probabilistic generative models. First, we consider three sparse coding models (binary sparse coding, spike-and-slab, and nonlinear spike-and-slab), where the relationship between the observed and latent variables is known to be complex and nonlinear. Second, we apply GP-select to a simple gaussian mixture model to provide a functional intuition of approach and explicitly visualize the form of the learned regression function. Finally, we apply our approach to a recent hierarchical model for translation-invariant occlusive components analysis (Dai, Exarchakis, & Lücke, 2013; Dai & Lücke, 2012a, 2014).
5.1 Sparse Coding Models
Many types of natural data are composed of potentially many component types, but any data point often contains only a very small number of this potentially large set of components. For the example of toys on the nursery floor, for instance, there are many different toys that can potentially be in a given image, but typically only a relatively small number of toys actually appear in any one image. Another example is a sound played by a piano at a given time . While the sound can contain waveforms generated by pressing any of the 88 piano keys, only relatively few keys (typically much fewer than 10) actually generated the sound. Sparse coding algorithms model such data properties by providing a large number of hidden variables (potential data components) but assigning nonzero (or significantly different from zero) values to only a small subset of components (those actually appearing). Sparse coding algorithms are typically used for tasks such as denoising (Elad & Aharon, 2006; Mairal, Bach, Ponce, Sapiro et al., 2009), inpainting (Mairal, Bach, Ponce, & Sapiro, 2009; Mairal, Bach, Ponce, Sapiro, & Zisserman, 2009; Titsias & Lázaro-Gredilla, 2011), classification (LeCun, n.d.; Titsias & Lázaro-Gredilla, 2011; Raina et al., 2007, e.g., MNIST data set, http://yann.lecun.com/exdb/mnist/, and the flowers data set, http://www.robots.ox.ac.uk/∼vgg/data/flowers/), transfer learning (Raina et al., 2007), and collaborative filtering (Titsias & Lázaro-Gredilla, 2011) and are important models for neurosensory processing (Olshausen & Field, 1997; Zylberberg, Murphy, & Deweese, 2011; Bornschein et al., 2013; Sheikh et al., 2014, and many more). A variety of sparse coding models have been successfully scaled to high-dimensional latent spaces with the use of selection (Henniges, Puertas, Bornschein, Eggert, & Lücke, 2010); Bornschein et al., 2013; Sheikh et al., 2014) or selection combined with Gibbs sampling (Shelton et al., 2011, 2012, 2015) inference approaches. Latent variables were selected in these earlier works using selection functions that were individually defined for each model. In order to demonstrate our method of autonomously learned selection functions, we consider three of these sparse generative models and perform inference in EM with our GP-select approach instead of a handcrafted selection function. The models are relevant for different tasks such as classification (e.g., binary sparse coding), source separations and denoising (linear spike-and-slab sparse coding), or sparse encoding and extraction of interpretable image components (nonlinear sparse coding). Note that when it is obvious from context, we drop the notation referring to each data point in order to make the equations more concise.
The models and their parameters are:
where denotes the components or dictionary elements and parameterizes the sparsity (see, e.g., Henniges et al., 2010).
- •Spike-and-slab sparse coding, generates a spike-and-slab distributed variable () that has either continuous values or exact zero entries (e.g., Titsias & Lázaro-Gredilla, 2011; Goodfellow, Courville, & Bengio, 2013; Sheikh et al., 2014).
- •Nonlinear spike-and-slab sparse coding is the indicator function denoting the domain to integrate over, namely, where is the maximum. Using allows for the condensed expression of the update equations shown above. The mean of the gaussian for each is centered at , where is a nonlinearity that considers all latent components and takes the yielding the maximum value for (Lücke & Sahani, 2008; Shelton et al., 2012; Bornschein et al., 2013), instead of centering the data at the linear combination of .
In the above models, inference with the truncated posterior of equation 3.2 using handcrafted selection functions to obtain the subset of states of selected relevant variables , shown in equation 3.5, has yielded results as good or more robust performance than exact inference (converging less frequently to local optima than exact inference; see earlier references for details). For models A and C, the hand-constructed function approximating , for substitution in equation 3.5, was the cosine similarity between the weights (e.g., dictionary elements, components) associated with each latent variable and each data point : . For model B, the constructed affinity function was the data likelihood given a singleton state: , where represents a singleton state in which only the entry is nonzero. The goal of these experiments is to demonstrate the performance of GP-select and the effects or benefits of using different selection functions. To do this, we consider artificial data generated according to each sparse coding model, and thus with known ground-truth parameters. As discussed above, we could also apply the sparse coding models using GP-select to other application domains listed, but that is not the focus of these experiments. We generate data points consisting of observed dimensions and latent components according to each of the models A to C: images of randomly selected overlapping bars with varying intensities for models B and C and additive gaussian noise parameterized by ground-truth , and we choose (e.g., following the spike-and-slab prior). On average, each data point contains two bars, that is, ground truth is , and we choose . With this choice, we can select sufficiently many latents for virtually all data points.
For each of the models considered, we run 10 repetitions of each of the following set of experiments: (1) selection using the respective handcrafted selection function, (2) GP-select using a linear covariance kernel, (3) GP-select using an RBF covariance kernel, and (4) GP-select using a kernel composed by adding the following kernels: RBF, linear, bias, and white noise kernels, which we term the composition kernel. As hyperparameters of kernels are learned, the composition kernel experiment 4, can adapt itself to the data and only use the kernel components required. (See Rasmussen & Williams, 2005 for a discussion on kernel adaptation.) Kernel parameters were model-selected via maximum marginal likelihood every 10 EM iterations. For models A and B, inference was performed exactly using the truncated posterior, equaiton 3.2, but as exact inference is analytically intractable in model C, inference was performed by drawing Gibbs samples from the truncated space (Shelton et al., 2011, 2012, 2015). We run all models until convergence.
Results are shown in Figure 2. In all experiments, the GP-select approach was able to infer ground-truth parameters as well as the handcrafted function. For models where the cosine similarity was used (in A and C), GP regression with a linear kernel quickly learned the ground-truth parameters, and hence fewer iterations of EM were necessary. In other words, even without providing GP-select explicit weights as required for the handcrafted function, its affinity function using GP regression, equation 4.1 learned a similar enough function to quickly yield identical results. Furthermore, in the model with a less straightforward handcrafted function (in the spike-and-slab model of B), only GP regression with an RBF kernel was able to recover ground-truth parameters. In this case (model B), GP-select using an RBF kernel recovered the ground-truth bars in 7 of 10 repetitions, whereas the handcrafted function recovered the bases in 8 instances. For the remaining models, GP-select converged to the ground-truth parameters with the same average frequency as the handcrafted functions.
Finally, we have observed empirically that the composition kernel is flexible enough to subsume all other kernels: the variance of the irrelevant kernels dropped to zero in simulations. This suggests the composition kernel is a good choice for general use.
5.2 Gaussian Mixture Model
Next, we apply GP-select to a simple example, a gaussian mixture model, where the flexibility of the approach can be easily and intuitively visualized. Furthermore, the GMM's flexibility allow us to explicitly visualize the effect of different selection functions. A demonstration and code for the GMM application is provided in (Dai, 2016).
For training data for GP regression, we used , where the targets were the expected cluster responsibilities (posterior probability distribution for each cluster) for all data points, , and we use one-hot encoding for cluster identity. With this, we apply our GP-select approach to this model, computing the selection function according to equation 3.5 with affinity defined by GP regression equation 4.1, and following the approximate EM approach as in the previous experiments. In these experiments we consider two scenarios for EM learning of the data likelihood in equation 5.1: GP-select with an RBF covariance kernel and a linear covariance kernel. We do not include the composition kernel suggested (based on experiments) in section 4.1, as the goal of the current experiments is to show the effects of using the “wrong” kernel. These effects would further support the use of the flexible composition kernel in general, as it can subsume both kernels considered in the current experiments (RBF and linear).
To easily visualize the output, we generate two-dimensional observed data () from clusters—first with randomly assigned cluster means and then such that the means of the clusters lie roughly on a line. In the GP-select experiments, we select clusters from the full set and run 40 EM iterations for both kernel choices (linear and RBF). Note that for mixture models, the notation of selected clusters of the set is analogous to the selected latent variables from the full set, as described in the nonmixture model setting, and the GP-select algorithm proceeds unchanged. We randomly initialize the variance of the clusters and initialize the cluster means at randomly selected data points. Results are shown in Figure 3.
On these data, the linear GP regression prediction cannot correctly assign the data to their clusters (as seen in Figure 3B), but the nonlinear approach successfully and easily finds the ground-truth clusters (Figure 3A). Furthermore, even when both approaches were initialized in the optimal solution, the cluster assignments from GP-select with a linear kernel quickly wandered away from the optimal solution and were identical to random initialization, converging to the same result shown in iteration 20 of Figure 3B). The RBF kernel cluster assignments remained at the optimal solution even with the number of selected clusters set to .
These experiments demonstrate that the selection function needs to be flexible even for very simple models, and that nonlinear selection functions are an essential tool even in such apparently straightforward cases.
5.3 Translation-Invariant Occlusive Models
Now that we have verified that GP-select can be applied to various generative graphical models and converge to ground-truth parameters, we consider a more challenging model that addresses a problem in computer vision: translations of objects in a scene.
Translation-invariant models address the problem that, for example, visual objects can appear in any location of an image. Probabilistic models for translation invariance are particularly appealing as they allow separately inferring object positions and object type, making them very interpretable and powerful tools for image processing.
Translation-invariant models are particularly difficult to optimize, however, because they must consider a massive latent variable space: evaluating multiple objects and locations in a scene leads a latent space complexity of the number of locations exponentiated by the number of objects. Inference in such a massive latent space heavily relies on the idea of variable selection to reduce the number of candidate objects and locations. In particular, hand-engineered selection functions that consider translational invariance have been successfully applied to this type of model (Dai & Lücke, 2012b, 2014; Dai et al., 2013). The selection function used so far for reducing latent space complexity in this model has been constructed as follows. First, the candidate locations of all the objects in the model are predicted. Then a subset of candidate objects that might appear in the image is selected according to those predicted locations. Next, the subset of states is constructed according to the combinations of the possible locations and numbers of candidate objects. The posterior distribution is then computed following equation 3.2.
This selection system is very costly: the selection function has parameters that need to be hand-tuned (e.g., the number of representative features), and it needs to scan through the entire image, considering all possible locations, which becomes computationally demanding for large-scale experiments. To maximally exploit the capabilities of the GP-selection function, we directly use the GP regression model to predict the possible locations of a component without introducing any knowledge of translation invariance into the selection function. In this work, a GP regression model is fitted from the input image to marginal posterior probabilities of individual components appearing at all possible locations. Therefore, the input to the GP-selection function is the image to be inferred, and the output is a score for each possible location of each component in the model. For example, when learning 10 components in a pixel image patch, the output dimensionality of GP-select is 9000. This task is computationally feasible, since GP models scale linearly with output dimensionality. The inference of components' locations with GP-select is significantly faster than the selection function in the original work, as it avoids explicitly scanning through the image.
Although additional computations are necessary for an automatic selection function like GP-select, for instance, due to the adjustment of its parameters, there are many options to reduce computational costs. First, we may approximate the full gram matrix by an incomplete Cholesky approximation (Fine & Scheinberg, 2001), resulting in a cost of , where is the rank of the Cholesky approximation. Second, we may reduce the update frequency of the kernel hyperparameters to be computed only every EM iterations, where a represents a corresponding computation reduction. The combination of the Cholesky approximation plus infrequent updates will have the following benefits: a factor of five speed-up for infrequent updates and a factor of speed-up from incomplete Cholesky, where is the rank of the Cholesky approximation and is the number of original data points.
5.3.2 COIL Data Set
We apply our GP-selection function to the invariant exclusive component analysis (InvECA) model (Dai & Lücke, 2012b; Dai et al., 2013). For our experiments, we consider an image data set used in previous work. Data were generated using objects from the COIL-100 image data set (Nene et al., 1996), taking 16 different objects, downscaled to pixels and segmented out from the black background. A given image was generated by randomly selecting a subset of the 16 objects, where each object has a probability of 0.2 of appearing. The appearing objects were placed at random positions on a black image. When the objects overlap, they occlude each other with a different random depth order for each image. In total, images were generated in the data set (examples shown in Figure 4). The task of the InvECA model is to discover the visual components (i.e., the images of 16 objects) from the image set without any label information. We compare the visual components learned by using four different selection functions in the InvECA model: the handcrafted selection function used in the original work by Dai and Lücke (2012b), GP-select updated every iteration, GP-select updated every iterations, and GP-select with incomplete Cholesky decomposition updated every iteration, or (in this manner, we isolate the improvements due to Cholesky from those due to infrequent updates). In these experiments, the parameters of GP-select are optimized at the end of each EM iteration(s), using a maximum of 20, gradient updates. The number of objects to be learned is , and the algorithm preselects objects for each data point. The kernel used was the composition kernel, as suggested in section 4.1, although after fitting the hyperparameters, only the RBF kernel remained with large variance (i.e., a linear kernel alone would not have produced good variable selection; thus the flexible composition kernel was further shown to be a good choice).
All four versions of the InvECA model using each of the selection functions considered successfully recover each individual object in our modified COIL image set. The learned object representations with GP-select are shown in Figure 5. Four additional components developed into representations; however, these all had very low mask values, allowing them to easily be distinguished from other true components.
Next, we compare the accuracy of the four selection functions. For this, we collected the object locations (pixels) indicated by each selection function after all EM iterations, applied the selection functions (for the GP selection functions, this was using the final function learned after all EM iterations) to the entire image data set again, and then compared these results with the ground-truth location of all of the objects in the data set. The accuracy of the predicted locations was then computed by comparing the distance of all ground-truth object locations to the location of the top candidate locations from each selection function. (See Figure 6 for a histogram of these distances and the corresponding accuracy for all selection functions.) Note that the percentages in the histogram are plotted in log scale. Also, as a baseline verification, we computed and compared the pseudo-log-likelihood (Dai et al., 2013) of the original selection function to the three GP-select based ones. The pseudo-log-likelihood for all selection functions is shown in Figure 7. Figures 6 and 7 show that all four selection functions can accurately predict the locations of all the objects in the data set: the GP-select selection functions yield no loss in inference performance in comparison to the original hand-engineered selection function. Even those using speed-considerate approximations (incomplete Cholesky decomposition of the kernel matrix, GP IChol, and updating kernel hyperparameters only every five EM iterations, GP every5) have indistinguishable prediction accuracy on the task.
An analysis of the benefits indicates that as GP-select avoids explicitly scanning through the image, the time to infer the location of an object is significantly reduced compared to the handcrafted function. GP-select requires 22.1 seconds on a single CPU core to infer the locations of objects across the whole image set, while the handcrafted function requires 1830.9 seconds. In the original work, the selection function was implemented with GPU acceleration and parallelization. Although we must compute the kernel hyperparameters for GP-select, it is important to note that the hyperparameters need not perfectly fit each iteration. For the purposes of our approach, a decent approximation suffices for excellent variable selection. In this experiment, updating the parameters of GP-select with 10 gradient steps took about 390 seconds for the full-rank kernel matrix. When we compute the incomplete Cholesky decomposition while inverting the covariance matrix, computing time was reduced to 194 seconds (corresponding to the speed-up, where is the rank of the Cholesky approximation), with minimal loss in accuracy. Furthermore, when updating the GP-select hyperparameters only every five iterations, average computing time was reduced by another one-fifth, again without loss in accuracy.
We have proposed a means of achieving fast EM inference in Bayesian generative models by learning an approximate selection function to determine relevant latent variables for each observed variable. The process of learning the relevance functions is interleaved with the EM steps, and these functions are used in obtaining an approximate posterior distribution in the subsequent EM iteration. The functions themselves are learned via gaussian process regression and do not require domain-specific engineering, unlike previous selection functions. In experiments on mixtures and sparse coding models with interpretable output, the learned selection functions behaved in accordance with our expectations for the posterior distribution over the latents.
The significant benefit we show empirically is that by learning the selection function in a general and flexible nonparametric way, we can avoid using potentially expensive hand-engineered selection functions. Cost reduction is in terms of both required expertise in the problem domain and computation time in identifying the relevant latent variables. Inference using our approach required 22.1 seconds on a single CPU core, versus 1830.9 seconds with the original handcrafted function for the complex hierarchical model of Dai et al. (2013).
A major area where further performance gains might be expected is in improving computational performance, since we expect the greatest advantages of GP-select to occur for complex models at large scale. For instance, kernel ridge regression may be parallelized (Zhang, Duchi, & Wainwright, 2014), or the problem may be solved in the primal via random Fourier features (Le, Sarlos, & Smola, 2013). Furthermore, there are many recent developments regarding the scaling up of GP inference to large-scale problems—for example, sparse GP approximation (Lawrence, Seeger, & Herbrich, 2002), stochastic variational inference (Hensman, Fusi, & Lawrence, 2013; Hensman, Rattray, & Lawrence, 2012), using parallelization techniques and GPU acceleration (Dai, Damianou, Hensman, & Lawrence, 2014), or in combination with stochastic gradient descent (Bottou & Bousquet, 2008). For instance, for very large data sets where the main model is typically trained with mini-batch learning, stochastic variational inference can be used for GP fitting as in Hensman et al. (2013) and the kernel parameters can be efficiently updated each (or only every few) iteration with respect to a mini-batch.
Note that even when linear relations exist between the latents and outputs, a nonlinear regression may still be necessary in finding relevant variables as a result of explaining away.
If the scaling with is still too expensive, an incomplete Cholesky approximation is used, with cost linear in and quadratic in the rank of the approximation (see section 5.3 for details).
We acknowledge funding by the German Research Foundation (DFG) under grants LU 1196/4-2 (J.S.), LU 1196/5-1 (J.L.); by the Cluster of Excellence EXC 1077/1 “Hearing4all” (J.L.); and by the RADIANT and WYSIWYD (EU FP7-ICT) projects (Z.D.).