## Abstract

Computer vision has grown tremendously in the past two decades. Despite all efforts, existing attempts at matching parts of the human visual system's extraordinary ability to understand visual scenes lack either scope or power. By combining the advantages of general low-level generative models and powerful layer-based and hierarchical models, this work aims at being a first step toward richer, more flexible models of images. After comparing various types of restricted Boltzmann machines (RBMs) able to model continuous-valued data, we introduce our basic model, the masked RBM, which explicitly models occlusion boundaries in image patches by factoring the appearance of any patch region from its shape. We then propose a generative model of larger images using a field of such RBMs. Finally, we discuss how masked RBMs could be stacked to form a deep model able to generate more complicated structures and suitable for various tasks such as segmentation or object recognition.

## 1. Introduction

Despite much progress in the field of computer vision in recent years, interpreting and modeling the bewildering structure of natural images remains a challenging problem. The limitations of even the most advanced systems become strikingly obvious when contrasted with the ease, flexibility, and robustness with which the human visual system analyzes and interprets an image. Computer vision is a problem domain where the structure that needs to be represented is complex and strongly task dependent and the input data are often highly ambiguous. Against this background, we believe that rich, generative models are necessary to extract an accurate and meaningful representation of the world, detailed enough to make them suitable for a wide range of visual tasks. This work is a first step toward building such a general-purpose generative model able to perform varied high-level tasks on natural images. The model integrates concepts from computer vision that combine some very general knowledge about the structure of our visual world with ideas from deep unsupervised learning. In particular, it draws on ideas such as:

- •
The separation of shape and appearance and the explicit treatment of occlusions

- •
A generic, learned model of shapes and appearances

- •
The unsupervised training of a generative model on a large database, exploiting graphical models that foster efficient inference and learning

- •
The modeling of large images using a field of more local experts

- •
The potential for a hierarchical latent representation of objects

Some of these ideas have been explored independent of each other and in models that focused on particular aspects of images or that were applied to very limited (e.g., category specific) data sets. Here we demonstrate how these techniques, in combination, give rise to a promising model of generic natural images.

One premise of the work described in this article is that generative models hold important advantages in computer vision. Their most obvious advantage over discriminative methods is perhaps that they are more amenable to unsupervised learning, which seems of crucial importance in a domain where labeled training data are often expensive while unlabeled data are now easy to obtain. Equally important, however, is that in vision, we are rarely interested in solving a single task such as object classification. Instead we typically need to extract information about different aspects of an image and at different levels of abstraction—for example, recognizing whether an object is present, identifying its position and those of its parts, and separating pixels belonging to the object from the background or occluding objects (segmentation). Many lower-level tasks, such as segmentation, are not even well defined without reference to more abstract structure (e.g., the object or part to be segmented), and information in natural images, especially when it is low level and local, is often highly ambiguous. These considerations strongly suggest that we need a model that can represent and learn a rich prior of image structure at many different levels of abstraction and also allow efficiently combining bottom-up (from the data) with top-down (from the prior) information during inference. Probabilistic, generative models naturally offer the appropriate framework for doing such inference. Furthermore, unlike in the discriminative case, they are trained not with respect to a particular task-specific label (which in most cases provides very little information about the complex structure present in an image) but rather to represent the data efficiently. This makes it much more likely that the required rich prior can ultimately be learned, especially if a suitable (e.g., a hierarchical) model structure is assumed. In this article we briefly review the most closely related works, even though such a review will necessarily be incomplete.

Some generative models can extract information about shape and appearance, illumination, occlusion and other factors of variation in an unsupervised manner (Frey & Jojic, 2003; Williams & Titsias, 2004; Kannan, Jojic, & Frey, 2005; Winn & Jojic, 2005; Kannan, Winn, & Rother, 2006). Though these models have successfully been applied to sets of relatively homogeneous images, such as images of particular object classes or movies of a small number of objects, they have limited scope and are typically not suitable for more heterogeneous data, let alone generic natural images.

Generic image structure is the domain of models such as the sparse coding approach by Olshausen and Field (1996; Lewicki & Olshausen, 1999; Hyvärinen, Hoyer, & Inki, 2001; Karklin and Lewicki, 2009) or the more recent work, broadly referred to as deep learning architectures (Osindero & Hinton, 2008; Lee, Ekanadham, & Ng, 2008). Unlike the models in the previous category, these models of generic image structure have very little built-in knowledge about the formation of natural images and are trained on large, unlabeled image databases. In particular, for the second group of models, the hope is that by learning increasingly deep (i.e., multilayered) representations of natural images, these models will capture structures of increasing complexity and at larger scales. Although this line of work has produced interesting results, so far the models are typically limited to small image patches (with some exceptions, see, e.g., Lee, Grosse, Ranganath, and Ng, 2009 and Raina, Madhavan, & Ng, 2009). Furthermore, most models so far, including hierarchical ones, appear to learn only very simple, low-level properties of natural images and are far from learning more abstract, higher-level concepts, suggesting that these models might still be too limited to capture the wealth of structure in natural images.

A large body of computer vision literature has focused on hierarchical image representations of various kinds, in particular on the recursive compositions of objects from parts, and many of these works employ generative (probabilistic) formulations of the hierarchy (see Bienenstock, Geman, & Potter, 1997; Jin & Geman, 2006; Fidler & Leonardis, 2007; Ommer & Buhmann, 2010; Zhu, Lin, Huang, Chen, & Yuille, 2008; Todorovic & Ahuja, 2008; Bouchard and Triggs, 2005; Zhu & Mumford, 2006, for some examples). The focus here is often less on modeling full images (in particular, not in such a manner that new images could be generated from these models) than on developing a representation for recognition or segmentation. Learning such models, in particular the structure of the hierarchy, can be challenging although progress has recently been made (e.g., Fidler & Leonardis, 2007; Ommer & Buhmann, 2010; Zhu et al., 2008; Todorovic & Ahuja, 2008). One important insight that has arisen from these compositional models of images, but also from tree-structured belief network models of images (e.g., Bouman & Shapiro, 1994; Luettgen & Willsky, 1995), is the notion that such a hierarchy needs to be flexible and allowed to vary in structure so as to match the underlying dependencies present in any particular image. This issue has been addressed in the work on dynamic trees (Williams & Adams, 1999; Storkey & Williams, 2003), and also in the credibility network model (Hinton, Ghahramani, & Teh, 2000), among others. However, these methods still fall short of being able to capture the complexity of natural images: for example, dynamic trees do not impose a depth ordering or learn an explicit shape model as a prior over tree structures.

Most of the work described in the previous paragraphs focuses on certain aspects of natural images. The question as to what kinds of models are suitable for comprehensively modeling the very different types of structure that typically co-occur in images has featured prominently in the work of Zhu and his coworkers (Guo, Zhu, & Wu, 2003, 2007; Tu, Chen, Yuille, & Zhu, 2005; Zhu & Mumford, 2006). Recently they proposed a generative model that combines submodels of different types for capturing the different kinds of structure occurring in natural images at different levels of abstraction and scale, ranging from low-level structures such as image textures to high-level part-based representations of objects and, ultimately, full visual scenes. This model appears to be one of the most comprehensive available to date, but due to its complexity, it currently fails to leverage one of the potential advantages of generative models in that unsupervised learning seems extremely difficult. Thus, training relies quite heavily on hand-labeled data, which are expensive to get.

In light of all these works, we aim at providing a unified probabilistic framework able to deal with generic, large images in an efficient manner from both a representation and an inference point of view.

The base component of our model is the restricted Boltzmann machine (Smolensky, 1986; Freund & Haussler, 1994), which is a Boltzmann machine (Ackley, Hinton, & Sejnowski, 1985) restricted to have bipartite connectivity. Section 2 presents and compares various RBMs able to model continuous-valued data, which will prove useful when we model appearances of objects. Section 3 presents the masked RBM, which extends the already rich modeling capacity of an RBM with a depth-ordered segmentation model. The masked RBM represents the shape and appearance of image regions separately, and it explicitly reasons about occlusion. The shape of objects is modeled by another RBM, introduced in section 4. This opens up new application domains (such as image segmentation and inpainting), and, importantly, leads to a much more efficient representation of image structure than standard RBMs, which can be learned in a fully unsupervised manner from training images. Despite its complexity and power, our model allows efficient approximate inference and learning. Section 5 is a thorough evaluation of this model's quality using both toy data and natural image patches, demonstrating how explicit incorporation of knowledge about natural images formation considerably increases the efficiency of the learned representation.

We then move from image patches to full images by introducing the field of masked RBMs in section 6, leveraging the modeling power we obtained at the patch level, before concluding in section 7.

Finally, as future work, we propose in section 8 a hierarchical formulation of the basic model that gives rise to a flexible, reconfigurable tree-structured representation that would allow us to learn image structures at different scales and levels of abstraction.

## 2. Binary and Continuous-Valued RBMs

In this section, we introduce the standard RBM, defined over binary variables and than present several RBMs able to model continuous-valued data.

### 2.1. The Binary RBM.

*n*hidden units is a parametric model of the joint distribution between binary hidden variables

*h*(explanatory factors, collected in vector

_{j}**h**) and binary observed variables

*v*(the observed data, collected in vector

_{i}**v**), of the form with

^{1}parameters θ = (

*W*,

**b**,

**c**) and

*v*,

_{i}*h*∈ {0, 1} (

_{j}*Z*is the normalizing constant).

One can show that conditional distributions *P*(**v**|**h**) and *P*(**h**|**v**) are factorial and thus easy to sample from Hinton (2002). Although the marginal distribution *P*(**v**) is not tractable, it can be easily computed up to a normalizing constant. The bipartite structure of an RBM allows both inference and learning to be performed efficiently using Gibbs sampling (Hinton, Osindero, & Teh, 2006).

### 2.2. Modeling Continuous Values with an RBM.

Since we are building a generative model of RGB images, we will need to use generative models of (potentially bounded) real-valued vectors of the red, green, and blue channel values. Surprisingly little work has been done on designing efficient RBMs for real-valued data.

The general foundations for using RBMs to model distributions in the exponential family were laid in Welling, Rosen-Zvi, and Hinton (2005), where one particular instantiation of this family was investigated for modeling discrete data using continuous latent variables. To date, using other members of this family to learn data variance has not been explored.

Some authors have used RBMs in the context of continuous values, using a truncated exponential (Larochelle, Erhan, Courville, Bergstra, & Bengio, 2007), gaussians with fixed variance (Freund & Haussler, 1994; Lee et al., 2008), or rectified linear units (Nair & Hinton, 2010). In none of these cases is the variance learned. In the case of the truncated exponential, even though the variance does depend on the parameters, it is a deterministic function of the mean and cannot be separately optimized. We will thus refer to this model as having fixed variance.

We now present several kinds of RBMs able to model continuous-valued data.

### 2.3. Truncated Exponential.

### 2.4. Gaussian RBM with Fixed Variance.

**v**

^{2}is the vector whose

*i*th element is

*v*

^{2}

_{i}and

**e**= [1, 1, …, 1]

^{T}. This model is restricted to be a mixture of isotropic gaussians.

Choosing a fixed variance to use with this model is problematic: large variances make training very noisy, while small variances cause training to get stuck in local maxima. The heuristic approach aims at avoiding the problems of a large, fixed variance by using the mean of *P*(**v**|**h**), rather than a sample from it, during training. We will show the results obtained with the fixed variance model trained normally (Gaussian Fixed) and trained using this heuristic (Gaussian Heuristic).

### 2.5. Gaussian RBM with Learned Variance.

We now present an extension of the gaussian RBM model that allows the modeling of the variance. We consider two similar models: the first uses the same hidden units to model both the mean and the precision (Gaussian Joint), and the second uses different sets of hidden units for each (Gaussian Separate).

#### 2.5.1. Joint Modeling of Mean and Precision.

#### 2.5.2. Separate Modeling of Mean and Precision.

### 2.6. Beta RBM.

In the beta RBM, the conditional distributions *P*(**v**|**h**) are beta distributions whose means and variances are learned during training.

Before going any further, we would like to recall the link between RBMs and products of experts (for a detailed explanation, see Freund & Haussler, 1994). When we sum out over all possible values of **h** in the energy function of an RBM, the unnormalized probability of a state **x** is the product of as many experts as there are hidden units, each expert being a mixture of two distributions—one when the hidden unit is turned on, one when it is turned off.

**a**,

**b**,

*W*, and

*U*to be positive resulted in too hard a constraint).

*W*

_{1},

*W*

_{2},

*U*

_{1}, and

*U*

_{2}restricted to be positive (note that we no longer have the visible biases

**a**and

**b**as these may be included in the weight matrices). As beta distributions treat the boundary values (0 and 1) differently from the others, we extended their range to [ − λ, 1 + λ] with .

^{2}

### 2.7. Assessment of the Quality of Each RBM.

To choose the most appropriate RBM for the real-valued red, blue, and green channels, we compared all these models on natural image patches (of size 16 × 16), using three quantitative metrics: the reconstruction RMSE, the reconstruction log likelihood, and the imputation accuracy. The experiments were led on patches that were not seen during training.

#### 2.7.1. Experimental Setup.

All models were trained on a training set of 383, 300 color image patches of size 16 × 16. Patches were extracted on a regular 16 × 16 grid from images from three different object recognition data sets: Pascal VOC, MSR Cambridge, and the INRIA horse data set.^{3} Red, green, and blue color channels are concatenated so that each model has 768 visible units. Where necessary, we used an appropriately sized validation set.

We trained the model using gradient descent with persistent contrastive divergence (Tieleman, 2008) and batches of size 20. We used a small weight decay and decreased the learning rate every epoch (one run through all training patches), dividing each epoch into batches.

The hyperparameters were not treated equally:

- •
The weight decay and decrease constant were manually fixed to .0002 and .001, respectively.

- •
The learning rate was optimized using the validation set, taking the learning rate that gives the best log likelihood of the data given the inferred latent variables after one epoch.

- •
In the case of the beta RBM, to get an idea of the effect of parameter λ, we tried three different values of λ for the case of 256 hidden units. We decided beforehand to report for 512 and 1024 hidden units only the results for .

Once the optimal learning rate was found, we trained each model for 20 epochs in batches of size 20 patches. Models were trained for three different sizes of the hidden layer: 256, 512, and 1024 hidden units.

#### 2.7.2. Reconstruction RMSE.

This experiment is used to determine the ability of each RBM to correctly model the mean of the data. Reconstruction is performed as follows. Given a test patch **v**_{test}, we sample a configuration of the hidden states **h**^{⋆} from the conditional distribution *P*(**h**|**v**_{test}). Given this configuration **h**^{⋆}, we compute the average value of the visible states *E*[*P*(**v**|**h**^{⋆})]. This is called a mean reconstruction of the test patch. Note that this is not the true average reconstruction since we consider only one configuration of the hidden states, not the full conditional distribution. Finally, we compute the pixel-wise squared error between the reconstruction and the original patch.

RMSE reconstruction accuracies for the different models (with 1024 hidden units) are shown in Figure 1a where the accuracies have been averaged across all test patches. Note that because the RMSE measure uses only the mean of *P*(**v**|**h**^{⋆}), the accuracy of the variance of *P*(**v**|**h**^{⋆}) is not assessed in these plots. A selection of test patches and their mean reconstructions is shown in Figures 2a and 2b.

The truncated exponential does a reasonable job of reconstructing the patches, but it is exceeded in performance by all three of the learned-variance models. This leads to the counterintuitive result that models designed to capture data variance prove to be significantly better at representing the mean. An explanation is that these models learn where they are able to represent the data accurately (e.g., in untextured regions) and where they cannot (e.g., near edges) and hence are able to focus their modeling power on the former rather than the latter, leading to an overall improvement in RMSE. The overall best performer is the beta RBM, which not only has the best average RMSE but also shows much greater stability during training in comparison to the gaussian models (as may be seen in Figure 1a).

#### 2.7.3. Reconstruction Log Likelihood.

This experiment is a proxy to the true log probability of the data. To obtain the true probability of a test patch, one could start a Markov chain from this same patch, run for an infinite amount of time, and compute the log probability of that patch under the final distribution (the choice of starting point would actually have no influence). Since this would be too expensive, we consider only an unbiased sample of the distribution obtained after one Markov step. We therefore perform the following experiment:

Given a test patch

**v**_{test}, we sample a configuration of the hidden states**h**^{⋆}from the conditional distribution*P*(**h**|**v**_{test}).Given this configuration of the hidden states, we compute the conditional probability of the test patch

*P*(**v**_{test}|**h**^{⋆}), which is easily done given the factoriality of this distribution.

Results for all models are given in Figure 1b, again with 1024 hidden units. Unlike the RMSE reconstruction, the log likelihood jointly assesses the accuracy of the mean and variance of the model. Hence, differences from the RMSE reconstruction results indicate models where the variance is modeled more or less accurately. Unsurprisingly, the fixed variance models do very poorly on this metric since they have fixed, large variances. More interestingly, the joint gaussian model now achieves very similar performance to the beta, indicating that it is modeling the variance better than the beta (considering that it modeled the mean slightly worse). This may be due to the gaussian being light-tailed in comparison to the beta and hence able to put greater probability mass near the mean.

#### 2.7.4. Imputation Accuracy.

As a further investigation of the models’ abilities to represent the distribution over image patches, we assessed their performance at filling in missing pixels in test patches, a process known as imputation. We used the experimental process:

Given a test patch, randomly select a region of 1 × 1, 2 × 2 or 4 × 4 pixels, and consider these pixels to be missing.

Initialize the missing pixels to the mean of the observed pixels.

Perform 16 bottom-up and top-down passes to impute the values of the missing pixels. In each top-down pass, the values of the observed pixels are fixed, while the values of the missing pixels are sampled from

*P*(**v**|**h**). Enough passes are chosen to allow mixing to occur (bear in mind that we are sampling from the conditional distribution of the unobserved pixels given the observed pixels, which is highly concentrated).

The RMSEs between the imputed and true pixel values for the different models are shown in Figure 3a for models with differing numbers of hidden units and in Figure 3b for different-sized imputation regions. Again, the beta RBM leads to the best performance in all cases, with the less stable joint gaussian RBM typically coming in second.

### 2.8. Conclusion.

Across experiments, the beta RBM proved more robust and slightly more accurate than all the other types of RBM. We therefore decided to use it to model appearances. Nevertheless, one should bear in mind that there is room for improvement and other, higher-quality continuous-valued RBMs may exist.

## 3. The Masked Restricted Boltzmann Machine

An RBM will capture high-order interactions between visible units, to the limit of its representational power determined by the number of hidden units. If there are not enough hidden units to perfectly model the training distribution, one can observe a blurring effect: when two input variables are almost always similar to each other and sometimes radically different, the RBM will not capture this rare difference and will assign a mean value to both variables. When the appearance of image patches is being modeled, any two nearby pixels will exhibit this property (being different only when an edge is present between these two pixels), thus resulting in a poor generative model of image patches (as shown in the *K* = 1 case of Figure 6). To avoid this effect, a standard RBM would require a number of hidden units equal to the product of the number of possible locations for an edge and the number of possible appearances. Not only would that number be prohibitive, it would also be highly inefficient since the vast majority of hidden units would remain unused most of the time. A more efficient way to bypass this constraint of consistency within the data set is to have *K* appearance RBMs, each generating a latent image patch , competing to explain each pixel in the patch. Whenever an edge is present, one RBM can explain the pixels on one side of the edge, while another RBM will explain pixels on the other side. We say that such a model has *K***layers**. To determine which appearance RBM explains each pixel, we introduce a **mask** with one mask variable per pixel (*m _{i}*), which can take as many values as there are competing RBMs. The overall masked RBM is shown in Figure 4 and its associated factor graph is shown in Figure 5.

In the remaining, we use the following notation:

- •
Since most of the equations will involve all the layers, we will define a short-cut notation: for any variable

*t*defined for each layer*k*, the set of variables {*t*_{1}, …,*t*} shall be replaced by_{K}*t*_{1..K}. - •
**v**is the image patch. - •
is the

*k*th latent patch. - •
**h**^{(a)}_{k}the hidden state of the*k*th layer. The (*a*) superscript stands for “appearance,” as we will introduce shape layers later on.

**m**, the probability of a joint state is equal to where is the joint probability of under the chosen appearance RBM. The first term allows our model to assign infinite energy (and therefore zero probability) to configurations violating the constraint that if layer

*k*is selected to explain pixel

*i*(i.e.

*m*=

_{i}*k*), then we must have .

To demonstrate the efficiency of using several masks, we infer the mask and hidden states of models with various *K* given an image and then reconstruct the image using the mask and these hidden states. The inference procedure is described in section A.1 of the appendix. For a fair comparison, we used the same total number of hidden variables for each value of *K* (accounting for the bits required to store the mask and the hidden units for each appearance model). The reconstruction with *K* = 4 thus used RBMs with many fewer hidden units (*n* = 128) than the one with *K* = 1 (*n* = 1024). From the results shown in Figure 6, we see that it is advantageous to assign a large number of bits to the mask rather than to the appearance. A more thorough evaluation of the masked RBM is presented in section 5.

## 4. Modeling Shape and Occlusion

Equation 3.1 defines a conditional distribution of the image, the latent patches, and the hidden states given the mask. To get a full probability distribution over the joint variables, we must also define a distribution over the mask. In this article, we consider three mask models: a uniform distribution over all possible masks, a multinomial RBM that we denote the softmax model, and a model that has been designed to handle occlusions, which we call the occlusion-based model. The latter two models will allow us to learn a model of the shapes present in natural images.

The learning and inference procedures in these models may be found in appendix A.

### 4.1. The Uniform Model.

The simplest mask model is the uniform distribution over **m**. In this model, no mask is preferred a priori, and the inferred masks are solely determined by the image. We use this model as a baseline.

### 4.2. The Softmax Model.

*K*binary RBMs with shared parameters competing to explain each mask pixel. Each RBM defines a joint distribution over its visible state

**s**

_{k}, which is a binary shape, and its binary hidden state

**h**

^{(s)}

_{k}(the (

*s*) superscript stands for “shape”). The

*K*binary shapes

**s**

_{k}are then combined to form the mask

**m**, which is a

*K*-valued vector of the same size as the

**s**

_{k}’s. To determine the value of

*m*given the

_{i}*K*sets of hidden states

**h**

^{(s)}

_{k}, one needs to compute a softmax over the

*K*different inputs. The joint probability distribution of this model is where SHAPE(

**s**

_{k},

**h**

^{(s)}

_{k}) is the joint probability of (

**s**

_{k},

**h**

^{(s)}

_{k}) under the chosen shape RBM (a binary RBM in our case). The right-hand side of the equation is unnormalized due to configurations violating the constraints (e.g.,

*s*

_{k,i}= 0 for all

*k*).

The first and second terms state that only one shape may be “on” at any given pixel and that the index of the selected shape is the value of the mask at that pixel. Inference is relatively straightforward in this model, but at the cost of poor handling of occlusion. Indeed, this model makes the implicit assumption that all the objects are at the same depth. This gives rise to two problems:

When object

*A*is occluding object*B*, the shape of object*B*is considered absent in the occluded region rather than unobserved. As a consequence, the model is forced to learn the shape of the visible regions of occluded layers. For example, with a digit against a background, the model is required to learn the shape of the visible region of the background—in other words, the inverted digit shape.There is no direct correspondence between the hidden states of any single layer and the corresponding object shape, since the observed shape will jointly depend on the

*K*inputs. In an object recognition system, this would reduce the ability to recognize an object by its shape, where the object is partially occluded.

### 4.3. The Occlusion Model.

*k*) being the position in the relative depth ordering of layer

*k*, that is, π(

*k*) = 1 indicates that

*k*is the front-most layer and π(

*k*) =

*K*indicates that

*k*is the rear-most layer), where each layer contains a shape. For this shape to be visible, there must not be any other shape at the same location in the layers above. The joint probability distribution for this model is:

The general factor graph corresponding to the masked RBM with nonuniform mask prior is shown in Figure 7. There are two main differences between the occlusion model and the softmax model:

We now have a prior

*P*(π) over the depth ordering (which is chosen to be uniform).If

*m*=_{i}*k*, then we must have**s**_{k,i}= 1 (as in the softmax model), but we require only that for the layers*k*′ in front of the layer*k*(rather than for all the layers, as is the case in the softmax model). for*k*′^{}′ behind layer*k*are unobserved (occluded). This idea is illustrated in Figure 8.

In the case of the occlusion model, there is a direct correspondence between the hidden states and the shape of the object (see Figure 10). Figure 9 specializes the general factor graph for the masked RBM with nonuniform mask prior from Figure 7 for the masked RBM with occlusion mask model and shows a schematic of the full model as a chain graph. The inference procedure for the depth ordering π is described in section B.1 in appendix B.

## 5. Inferring Appearance and Shape of Objects in Images

Our goal is to learn a good generative model of images by extracting a factorial latent representation (appearance and shape) of objects in natural images. To assess how well this goal is achieved, we seek to answer a set of questions:

- •
How visually similar are the samples from our model to samples coming from the same distribution as the training set? Although poor samples characterize a bad generative model, the converse is not true, as samples too close to the training data show a lack of generalization of our model, which is not desirable. Despite the flaws of this measure, we think it can provide meaningful insight on what has actually been learned.

- •
Do samples from our model exhibit the same statistics as those computed on test patches?

- •
Are test patches likely under our model?

- •
Did we really factor appearance and shape? Are the latent representations we extract meaningful? Are they independent of the depth ordering of the objects in the image? Are the depth orderings correct?

The first three questions relating samples from our model and test data can be answered on both a toy data set and a real data set of natural images. However, a toy data set offers the additional advantage of providing the ground truth objects from which the patches have been created, which makes it easier to assess the quality of the generative model.

The last questions are trickier to answer in the context of natural images since we have no control over the ordering of the objects. However, there are some natural patches for which there is little ambiguity over that ordering. If the model is able to infer a plausible answer in these cases, this should be a good indicator of the quality of its inference of the depth ordering of the objects (and thus a measure of the invariance of the inferred latent shapes to this ordering).

### 5.1. Training.

This section describes the training procedure for the masked RBM, as this model proved much more complicated to train than a standard RBM. Details on the data sets used are provided in the next sections. Additional details about the training procedure are contained in appendix D. The training was done in several stages of increasing complexity for efficiency reasons:

We first trained a single unmasked RBM until low-frequency filters appeared. This allowed us to quickly obtain a good initialization for the filters typically obtained in the masked RBMs (since the edges are captured by the masks, none of them are high frequency) by avoiding having to infer the mask at each iteration;

Initializing with the filters from the previous step, we then trained a masked RBM with a uniform mask model (which means we trained only the appearance RBM) and

*K*= 2. Using a lower*K*allows us to speed up inference while still providing good initial filters for the final stage.*K*was then switched to 4 until parameters converged. The reason that we trained the appearance model in the context of a masked RBM is to avoid wasting capacity in modeling complicated shapes that will be handled by the mask.We froze the parameters of the appearance RBM and inferred an initial segmentation (mask) of our training data. We used the binary region shapes extracted from the masks to pretrain the binary RBM of the shape model.

We trained the shape model in the context of the full masked RBM, performing joint inference of the shape, appearance, mask, and depth, with an occlusion shape model using the binary RBM trained in the previous step as initialization.

We fine-tuned both the appearance RBM and the shape RBM by performing the joint inference of the parameters of both models (the masks being inferred at each iteration using the current state of the RBMs), using the correct shape model.

Bootstrapping allowed faster learning of this complex model. Also, experiments seemed to indicate that it helps to find a better global solution and avoid undesirable local minima.

### 5.2. Toy Masks Data Set.

The toy masks data set is composed of 4000 14 × 14 mask patches generated from the superposition of an MNIST digit (from class 3) and a shape (a circle, a square, or a triangle). In this data set, neither digits nor shapes are shown in isolation, and each digit example appears in exactly one image. Since the digit is in the background on half of the patches, half of the digit examples are only partially visible. Samples of this data set (which are masks) are shown in Figure 10a: each pixel can take three values (represented by the colors red, green, and blue)—one for each object in the patch (the background being the third object). Which color is assigned to each object is irrelevant (the actual values are not used to infer the depth ordering); it matters only that they are assigned different colors.

#### 5.2.1. Quality of the Generative Shape Model.

We trained our mask model using three layers (*K* = 3). Figure 10 shows samples from the occlusion model with 20 hidden units (b), the softmax model with 70 hidden units (c), and the softmax model with 20 hidden units (d). Samples from the occlusion model are drawn by sampling from the two RBMs governing the top-most and second-most layer independently and then composing these samples, as prescribed by equation 4.2. One can see that when 20 hidden units are being used, the samples drawn from the occlusion-based mask model are much more convincing than those drawn from the softmax model. Indeed, the latter generated samples with improper occlusions or deformed digits. It is also interesting to note that the occlusion model generalized to samples not seen in the training set, like the two MNIST digits that occlude each other. Furthermore, columns b2 and b3 show samples of the latent shapes, proving that the occlusion model learned a model of the individual shapes—despite the fact that it has never seen them in isolation.

In the softmax model, the layers cooperate to generate a particular image of occluding shapes. It is not possible to sample from the individual layers separately, but one can still inspect the inputs to the three layers of visible units that are tied together by the softmax. These inputs are shown in Figure 10 (c2, c3, and c4). It is clear that no shape is generated by a single layer but that all three layers have to interact. In the first row, for instance, all three inputs contain a 3 (with either positive or negative weights) although it is absent from the resulting sample. Though harmful (because they require additional modeling power), these cancellations are inevitable in the softmax model. While the occlusion model learns about the individual image elements, the softmax model has to represent all their possible arrangements explicitly, which is less efficient and thus requires a larger number of hidden units. This also leads to a set of hidden units, which is far less indicative of the shape in the image than in the occlusion model.

#### 5.2.2. Sensitivity to Occlusion.

To assess the importance of the difference in representation between the softmax and the occlusion mask models, we created pairs of images containing one digit and one shape (the same digit and the same shape were used in both images of a pair). In the first image, the digit was in front of the shape, and in the second image, the shape was in front of the digit. We compared the inferred shape latent variables for the two cases and computed their root mean squared difference. Because our main motivation is to recognize objects whether or not they are occluded, we would like the shape latent variables to be as similar as possible in the two cases. Unsurprisingly, the occlusion-based mask model clearly outperforms the softmax model, as may be seen in Figure 11. Furthermore, in our experiments, the occlusion model inferred the correct ordering more than 95% of the time (chance being 17%, as there are three layers and six possible orderings).

This toy data set emphasizes the need for modeling occlusion when extracting a meaningful representation of the shapes present in images.

### 5.3. Natural Image Patches.

The experiments on toy data demonstrated that the occlusion model is able to learn and recognize shapes under occlusion and is able to perform depth inference given a mask image with occluding shapes. The second set of experiments on natural images assesses the joint model consisting of the shape and the appearance model. For this purpose, we trained the full model with *K* = 3 on 21,000 16 × 16 patches extracted from natural color images. The mask model used in all these experiments is the occlusion model. The appearance RBM had 128 hidden units, and the shape RBM had 384 hidden units. (Details on the training procedure can be found in section D.2 in appendix D.)

As outlined above, our criteria for assessing the model on this data set were:

- •
Whether samples from the model looked qualitatively similar to the natural image patches that we had trained the model on (see section 5.3.2)

- •
Whether samples from the model exhibited the same statistics as natural image patches (see section 5.3.3)

- •
Whether inference on natural image patches would give plausible results (see section 5.3.4)

#### 5.3.1. Sampling from a Confident Continuous-Valued RBM.

When learning the appearances of the objects with the beta RBM, each expert becomes extremely confident. This is even more striking in the masked context, where the noise model does not need to explain the sharp variations of appearance at the boundaries of objects. While this is a good thing from a generative point of view, it leads to a very poor mixing of the Gibbs chain. Indeed, as the conditional distributions *P*(**v**|**h**) become very peaked, so do the distributions *P*(**h**|**v**), and the relationship between **v** and **h** becomes quasi-deterministic. This makes it hard to:

- •
Learn the parameters in the final stage, as the samples from the negative chain are highly correlated between consecutive time steps

- •
Draw samples to assess the quality of the generative model

- •
Compute an accurate approximation to the partition function to estimate the log probability of test patches

The first issue was dealt with by using tempered transitions (Salakhutdinov, 2009) twice per sweep through the training set. To improve sampling, we trained a binary RBM on top of our beta RBM. Because such RBMs mix much more easily, we could draw samples by running a Gibbs chain in this top binary RBM before performing a top-down pass in the bottom beta RBM. Unfortunately, even then, annealed importance sampling (AIS) (Salakhutdinov & Murray, 2008) proved unreliable. We therefore decided not to include log-probability results whose validity we could not properly assess.

It is worth emphasizing that the inference of the hidden variables given the visible ones does not suffer from these issues (it is still fast and exact), nor does the optional learning of a layer above (since it will then deal with binary data).

#### 5.3.2. Visual Assessment of the Samples.

Sampling from the mask model was performed by sampling the binary RBMs in the shape layers (15,000 steps of Gibbs sampling) and composing them according to a randomly chosen depth ordering. Masks were then combined with samples from the appearance model (5000 steps of Gibbs samplings). The full samples from the masked RBM are shown in Figure 12 (right). Although they do not exhibit as much structure as true natural image patches (see Figure 12, left), the presence of multiple sharp edges makes them look much more convincing than the typical blurred samples one may obtain from a single RBM. Moreover, the samples clearly capture important characteristics of the training patches (such as the dominance of homogeneous regions and the shape of the boundaries of these regions), despite the relative simplicity of the model and the fact that *K* was chosen to be small.

#### 5.3.3. Image Statistics.

We assess the quality of the samples from the masked RBM by comparing the statistics of responses of different types of filters (even and odd Gabor filters and random zero-mean filters) with the statistics of real image patches. Before computing the filter responses, we converted all the patches to grayscale. We compared four kinds of patches:

- •
Natural patches.

- •
Patches sampled from the masked RBM. The appearances and the shapes are true samples from the model. This model used

*K*= 3 layers. - •
Patches sampled from a single, unmasked RBM.

- •
Patches generated from gaussian noise with the same covariance as natural patches.

The results (displayed as log probability of each response value) are shown in Figure 13. For all filters, the response histograms of samples from the masked RBM (in blue) have much heavier tails than those for patches sampled from the unmasked RBM (in red) or the gaussian model (cyan), but they are similar to the responses obtained from real image patches (green). There is one systematic mismatch between natural image patches and the samples obtained from the masked RBM. Due to the pixel-independent noise model, the peak of the histograms at 0 is underestimated for the samples from the masked RBM (this is because nearby pixels have an extremely low probability of having the same value, unlike true image patches). However, if we replace samples from the appearance model with the mean activations of the visibles given the binary hiddens in the last step of the Gibbs chains^{4} and use those when composing the full, layered samples from the masked RBM, we get the filter responses shown in Figure 14 (only the region near the origin is shown). The tails remain the same, but the peak at 0 is more pronounced, closely matching the ones obtained with true image patches. We emphasize that the model has never been trained directly to match the statistics of natural images. Nevertheless, it reproduces some of their distinguishing features quite reliably. The improved matching, in particular the heavy tails, arose naturally with the use of a mask.

#### 5.3.4. Inference of Relative Depths Based on Shape.

The goal of this experiment is to investigate whether learning an efficient representation of the data leads to the model being able to reason about image regions and relative depths. For this purpose, we chose a simple scenario shown in Figure 15: patches that contained simple shape-based depth cues were extracted from an image (a). For each patch, the model inferred a segmentation mask with up to *K* = 3 regions (b.1), a relative depth ordering (front to back: red—green—blue), the potentially partially unobserved shapes of the two rear-most layers (b.2), and the appearances of the three layers. The inferred latent shapes allow removing the foreground shape and imputing the missing parts of the second layer shape (c.1 and c.2: segmentation mask with two layers and imputed image, respectively). For the examples shown, the model inferred segmentations, depth orderings, and latent shapes largely consistent with the full image.

Inferring relative depth using very local shape information only (such as provided by our 16 × 16 patches) is a highly ambiguous problem in the general case—not just for a computational model but also for human observers. The fact that the model is able to perform such a task at all might be surprising considering that it has been trained on only individual image patches without any built-in prior (e.g., about smooth boundary shapes) or additional information, such as the context (the larger shapes that the fragments in the patch are part of), stereo data, or temporal information. Nevertheless, there are at least two plausible cues acquired by the model during training that are driving the results in Figure 15. One relatively naive cue the model uses is that it prefers to place smaller regions in the foreground. More important, however, it also prefers to explain image patches in terms of extended, roughly horizontal or vertical shapes. This behavior is rather robust and observed for all five examples in Figure 15, particularly for patch 3. It allows the model to complete the occluded shapes in a plausible manner and thus drives depth inference. We provide an evaluation of this phenomenon on a larger data set in 8 appendix E. Here, results cannot be easily explained in terms of region size, and we find the model to be in qualitative agreement with human observers (although we would not like to claim that the model matches human performance in general). This behavior seems reasonable given that such roughly horizontal and vertical shapes are particularly frequent in our training data so that representing, say, patch 3 in terms of such shapes is a likely explanation in light of these training data. Thus, learning an efficient representation of the data also has made the model pick up certain simple depth cues despite never having received any kind of depth information with the training data.

There are currently two main limitations to the model. First, the model has difficulties in correctly segmenting image patches that exhibit matting or shading since this is not accounted for by the model. Also, the model currently does not have a suitable prior over the number of regions, so it has a tendency to oversegment patches that have fewer than *K* coherent regions (such as the second patch in Figure 15). Incorporating such a prior effectively corresponds to model selection and is nontrivial since we cannot compute the normalization constant of either the appearance or shape RBM, but we are currently working on suitable approximations.

### 5.4. On the Use of an RBM for Modeling Appearance.

Looking at the very smooth latent patches of Figure 6, one may wonder if RBMs are the right model to use for appearances since they do not seem to be able to model complex textures.

First, we recall that some of the advantages of RBMs are the ease with which they can be trained, the speed of inference, the convenience of the distributed representation of the data, and their ability to be easily stacked into deeper structures, which will be important for the future hierarchical formulation of the model outlined in section 8. Also, provided that the number of hidden units is large enough, they can model more complicated structures, as shown in Figure 6 when *K* = 1. Thus, the choice for simpler RBMs stemmed from the observation that it is much more efficient (in terms of the quality of the reconstruction) to assign bits to the mask rather than to the appearance. Finally, many natural image patches in our data set were simple enough so that one did not need to use four latent appearances, but since there is not yet any procedure to select *K* automatically, this resulted in an oversegmentation of these patches during training, yielding overly smooth patches.

Thus, although the RBMs we used did not model complex textures (which were then accounted for by the mask), this would not necessarily be the case in other models or with larger RBMs (in terms of the size of the hidden layer), resulting in the mask's capturing changes only in such textures.

## 6. Field of Masked RBMs

When modeling image patches of size 16 × 16, we made the assumption that they were composed of *K* patch models of size 16 × 16 fully aligned with each other and with the image patch. As a consequence, each pixel in a patch can be explained by any one of the *K* different patch models. In order to move from image patches to entire images, we could use a larger number of bigger patch models (which would be the size of the image rather than 16 × 16). However, that would be very expensive (especially the depth inference in the occlusion model) and inefficient since this would not model translation invariance. Instead, we will again use patch models of size 16 × 16, which will be laid out across the image, partially overlapping each other. Of course, in the case of large images, the total number of such patch models is much greater than the number of patch models any one pixel can be explained by.

A simple way of covering an entire image with these patch models is to tile it into a set of nonoverlapping image patches and model each such patch with a masked RBM, as in section 4. However, this approach leads to artifacts at the patch boundaries, since correlations between pixels on either side of these boundaries are ignored. These artifacts appear because the *K* patch appearance models that each pixel chooses between are aligned, so that their patch boundaries are in the same place. Moreover, and perhaps more important, the only translation invariance we get is very coarse (our model would be invariant to translations of 16 pixels or multiples thereof). A better solution is obtained if we spatially offset the patch models so that no two patches are fully aligned. One such arrangement is shown in Figure 16. Here, the image is tiled by *K* grids of patch models. In each grid, the patches are nonoverlapping and cover all pixels in the image. Across different grids, the patch boundaries are spatially offset horizontally or vertically by half the patch size so that no two patches are fully aligned. This model allows finer translational invariance. For instance, with *K* = 4 and a patch size of 16 × 16, the patch boundaries are offset by 8 pixels; thus, it is invariant to translations of 8 pixels or multiples thereof. It should be noted that although we colored all the patch models belonging to one grid with the same color in Figure 16, patch models belonging to the same grid are in no way more related than patch models belonging to different grids.

Thus, the image is covered with partially overlapping appearance RBMs (and possibly corresponding shape models), arranged such that each pixel is covered by exactly *K* RBMs. Figure 16 shows a field of masked RBMs, with two of the overlapping appearance RBMs highlighted. The set of mask variables now forms a mask image with a value for each image pixel indicating which of the *K* overlapping models it is explained by (bear in mind that the same value of *K* represents different superpixels across the image, since at different positions in the image, we have different superpixels in the *k*th grid). It should be noted that this model is a mixture rather than a product of appearance RBMs, in contrast, for instance, to the field of experts model of Roth and Black (2005).

Inference is done in the same way as at the patch level, with the one difference being that the patch models competing for a particular pixel are no longer aligned. This introduces long-range dependencies between spatially separated patches, so that inference has to be performed on the entire image simultaneously. While this makes perfect sense from a probabilistic point of view (in the general case, one has to take the whole image into account to understand part of it), the result is slower learning and inference.

Figure 17 is the equivalent of Figure 6 for full images. It shows the reconstruction of an image (i.e., the image generated using the hidden states inferred from the original image) using various numbers of layers and a uniform mask model. As for the experiments depicted in Figure 6, we used the same number of hidden variables for each value of *K* (4 bits per pixel). The RBMs used in the appearance model with *K* = 4 thus have only 128 hidden units, whereas those used in the model with *K* = 1 have 1024 hidden units. The patch size is 16 × 16 pixels for all *K*.

Both shape models discussed in the previous section (the softmax as well as the occlusion model) can be used at the image level. Figure 18 shows that using such a shape model yields more coherent regions for the mask image without significant loss in reconstruction accuracy. The occlusion model leads to a particularly appealing interpretation at the image level: each patch model can be thought of as an independent expert modeling shape and appearance of an image patch. It consists of an appearance RBM that determines the color—or, more generally, texture—of a patch and a binary RBM that determines its shape, as is illustrated in Figure 16. An image is generated by covering it fully with such patches in an occluding manner. This generative process bears some resemblance to the dead-leaves model (Lee, Mumford, & Huang, 2001), although there are important differences (e.g., in the current formulation of our model, the number of occluding objects covering an image is fixed and their maximum size is restricted while we allow for complicated and diverse shapes and appearances of individual objects). In particular, and perhaps surprisingly, inference with the occlusion model can still be performed efficiently for full images: even though each image is explained by a potentially large number of patches, each individual patch overlaps with only a small number of neighbors (e.g., for *K* = 4 and the global patch layout shown in Figure 16, each patch overlaps with eight neighbors). Thus, instead of determining a global depth order of all patches (which would clearly be infeasible), it is sufficient to infer the depth of each patch relative to its neighbors. The depth of a particular patch given a fixed relative order of its neighbors can be determined following the principles described for image patches in section 4.3; the full local ordering of all patches covering the image is determined in an iterative manner by considering each patch in turn (see the appendix for details).

### 6.1. Evaluation on Shape Data Set.

Although the reconstruction of test data gives some information about the quality of a generative model, it has severe shortcomings. We thus repeat some experiments done at the patch level to show how the main properties of the algorithm have been preserved despite operating at the image level.

We start by assessing the validity of our model on toy data. We focus our attention on three components:

- •
The allocation of objects to masked RBMs. Namely, are objects fully captured by the RBM they are centered on? Are RBMs explaining only parts of objects?

- •
How robust is the depth inference between overlapping objects?

- •
How good are the shape and appearance models learned using entire images?

For this purpose, we trained our field of masked RBMs with *K* = 4 on 100 80 × 80 images (each image composed of 144 overlapping 16 × 16 patches) composed of five different shapes with varying colors placed randomly in an overlapping fashion against a uniform background (see Figure 19, left, for an example; note that shapes were aligned with the patch grid). We allowed 20 hidden units for the shape model.

After training, we verified whether the shape model had indeed learned about the shapes comprising the images by sampling from the binary RBM directly. A selection of random samples is shown in Figure 21. Indeed, even though most shapes are only partially visible in the training images (and have varying colors), the shape model has recovered the five template shapes correctly. Figure 19 (right), shows the segmentation inferred with the fully trained model for the image shown on the left. Yellow outlines show the boundaries of objects captured by each masked RBM (patch model). These boundaries indeed reflect the shapes comprising the image (note that the background is segmented in a largely arbitrary manner). Segmentation is obviously not a very difficult task given the image at hand. More interesting is the simultaneously inferred relative depth of the different image regions and the latent representation inferred for each patch model. The relative depths are shown for a subset of segmentation boundaries, which are double-marked with red and green lines. The red side of the boundary points toward the region that has been inferred to be in front and the green side toward the one that is inferred to be in the back. Figure 20 further shows the inferred latent shape and appearance for two of the patch models representing the image (indicated by the blue squares). In both cases, the true shapes (a gray star and a red triangle) are barely visible in the image (see also Figure 19). Nevertheless, the model correctly infers the appearance and, importantly, completes the partially occluded shape (see Figure 20). It is this ability to correctly complete occluded shapes that drives the depth inference.

### 6.2. Natural Images.

#### 6.2.1. Inference on Natural Images: Interpretation as a Superpixel Algorithm.

The field of masked RBMs learns to represent an image as a number of regions, each explained in terms of an appearance and a shape. These regions can be thought of as superpixels, although they differ from previous kinds of superpixels in that they are not required to be contiguous but merely constrained to lie within the boundary of a patch. Also, they have high-order shape priors that have the potential to capture complex shapes, such as digits or letters. Such noncontiguity makes particular sense when dealing with occlusion, since the same superpixel can be used to represent parts of an object on either side of a narrow occlusion.

This behavior is illustrated in Figure 22, which shows the equivalent of Figures 19 and 20 for the natural image in Figure 17. It shows the segmentation inferred by a field of masked RBMs with occlusion shape model (see section D.3 in appendix D for details on how the model has been trained) together with the corresponding latent representation of all 1218 patch models (of size 16 × 16 pixels) covering the image. For each superpixel, the combined latent shape and appearance are shown (cf. Figure 20). For the toy data considered in the previous section, the model was confident with respect to the shapes composing the image and with respect to their relative depth. In contrast, for real images such as the one considered in Figure 22, there is considerably more uncertainty as to what a suitable decomposition of the image would be. Not only are relevant regions typically significantly larger than the extent of the individual patch model, but there is also an enormous variability of shapes in natural images. With only the very local information available to the model, a decomposition in terms of high-level components of the scene cannot necessarily be expected. Nevertheless, the decomposition of the image that is inferred by the model appears largely sensible: in particular, it has a tendency to explain the image in terms of small shapes, especially thin horizontal and vertical structures, that appear in front of larger homogeneous backgrounds. This is very noticeable when focusing, for instance, on the representation of the various signs in the image (“Except for access,” “ral Service,” “TY Ltd,” and the “no parking” sign), where the letters have largely been separated out and are placed in front of mostly contiguous background superpixels. Note that due to the explicit representation of occlusions, superpixels in the rear do not have to model the cut-out shape of foreground superpixels (even though there are some counterexamples, e.g., the “x” and “c” in “Except” are being explained in terms of a black background of unspecific shape behind a light gray foreground that has the letter shape cut out). Other examples are the frames of signs and windows that have predominantly been explained in terms of thin horizontal and vertical structures with often larger superpixels in the rear. To facilitate the mapping between the two representations, we have color-coded superpixels in both subfigures, representing letters in red, superpixels representing the background of the signs in blue, and some of the superpixels explaining window frames in green.

The nature of this decomposition is the result of training the field of masked RBMs on a large data set of natural images (see section D.3 for some examples of the training data). Many of the training images images are efficiently explained in terms of thin structures in front of larger “background” patches. Furthermore, thin horizontal and vertical structures are especially frequent in natural images, and accordingly, the models’ preference for separating these into "foreground" patches is particularly robust.

To further illustrate the value of the shape model, we show the behavior of the model on a simple structure inpainting task in Figure 23. In several places, image pixels overlapping with region boundaries were removed and treated as unobserved during inference (there are seven such “unobserved” areas with an average size of more than 26 pixels; see Figure 23, left panel). The learned shape prior allows the model to continue region boundaries across the unobserved parts of the image, giving rise to a plausible reconstruction of the removed pixels (see Figure 23, middle panel). Inference is done by sampling, and there is some uncertainty with respect to the correct reconstruction. This is reflected in the mean reconstruction (see Figure 23, right panel) for which some of the filled-in boundaries are slightly blurred. The model is, however, relatively confident in most cases. Note that the ability of the model to perform such a task crucially depends on the shape model.

#### 6.2.2. Generating From the Field of Masked RBM.

The field of masked RBMs defines a generative model of natural images and it is possible to draw samples from this model. Figure 24 shows images of size 80 × 80 pixels generated from a field of masked RBMs trained on natural images (the same model as the one used in the inference experiments in the previous section, see section D.3 for details). Samples are obtained by first sampling shape and appearance independently for each of the 144 patch models covering the image and then composing them according to a random depth order (as pointed out above, this generative process bears some resemblance to the “dead-leaves model”, Lee et al. 2001).

The generated images contain many regions arising from partially overlapping 16 × 16 pixel square patches. This is to be expected considering that the training data contain large homogeneous regions that are well explained in terms of such almost completely filled superpixels (see also the discussion of the inferred latent representation in the previous section). In addition, the samples contain many regions with smooth, nonrectangular boundaries that cannot be explained in this manner. These reflect the shapes of boundaries found in natural images that have been learned by the shape model.

These characteristics of the samples (and also the nature of the latent representation inferred for real images discussed in the preceding section) suggest that the field of masked RBMs does indeed learn a sensible representation of natural images. At the same time, however, they also indicate one structural deficit of the model: individual patch models are assumed to be independent of each other. This is not necessarily a problem when performing inference since, in this case, the relevant longer-range dependencies are prescribed by the observed data (in fact, this independence assumption helps keeping inference tractable). Yet when generating from the model, this means that shapes and appearances of neighboring patches are not required to be consistent with each other, giving rise to the more or less random patchwork of shapes and appearances observed in Figure 24. This makes it very unlikely that the model will generate images with homogeneous regions larger than the size of individual patches or regions that have smooth boundaries extending across multiple patches.

From a generative point of view, this is certainly a drawback of the model. However, the field of masked RBMs itself applied hierarchically does provide an elegant solution to this problem, which we outline in section 8.

## 7. Conclusion

The contributions of this article are as follows. First, we provided an empirical comparison of a range of RBMs able to model continuous data, showing that properly modeling the variance dramatically improves the quality of the model. We then introduced the masked RBM, a generative model that works with the assumption that natural image patches are composed of objects occluding each other. In this model, each object is factored into an appearance and a shape, over which we made no prior assumptions. This proved to be a much more accurate model of image patches than the standard RBM while still allowing for efficient inference. We demonstrated how it was able to infer the depth of objects in natural scenes using only learned visual cues. We also showed that properly dealing with occlusion was essential for a good latent representation of objects. Finally, composing the masked RBMs into a field, we were able to extend our model to large images while retaining the properties observed at the patch level.

We believe the abilities to deal with occlusion, to model generic shapes and appearances, and the applicability to large images are central to a generative model suitable for a broad range of images. Inspired by previous work that dealt with a subset of these properties, we provided a unified, comprehensive probabilistic framework that, while powerful, remains computationally tractable (though still expensive). We hope that this will encourage the community to build richer, more powerful models, with the ultimate goal of approaching the capacity of the human visual system.

## 8. Future Work: The Deep Segmentation Network

We have shown how a field of masked RBMs is able to decompose an image into superpixels and model the shape and appearance of each superpixel using separate sets of hidden variables, even under occlusion (see Figure 22 for an example). The next stage of this research is to learn how these superpixels fit together into object parts and how object parts go together to form objects. To do this, we can follow the approach of deep belief nets and combine multiple fields of masked RBMs in a hierarchical model, which we call a deep segmentation network (DSN). The idea is to treat the superpixels learned by the first field of masked RBMs (see Figure 22) as input “pixels” for a higher-level field of masked RBMs. For example, the superpixels learned in the previous section are associated with patches laid out on a regular 8 × 8 grid. Hence, we can construct a new “image” one-eighth the size of the original image where the “pixels” are 512 bit feature vectors (384 shape + 128 appearance) rather than RGB values. We can train a second-level field of masked RBMs on a set of such images, where the appearance models are now binary RBMs, as shown in Figure 25. The overlapping patches of the second level cover multiple first-level superpixels and hence learn how the shape and appearance of nearby superpixels go together. Mask images will also be inferred for the second level, leading to second-level superpixels that merge a number of first-level superpixels. This process can be repeated by adding levels to the DSN until the entire image belongs to a single superpixel. This formulation gives rise to a tree-structured hierarchy in which each lower-level node (pixel) is connected to exactly one node in the next level. This hierarchy is, however, not fixed: since the mask determines to which superpixel pixels are associated, DSNs define an image-dependent parse tree of the input image, similar to Dynamic Trees (Williams & Adams, 1999; Storkey & Williams, 2003). However, DSNs are able to define richer and more complex priors over such parse trees than was possible with DTs. Preliminary results show that using deeper DSNs leads to meaningful higher-level superpixels while increasing accuracy on a segmentation task. We believe this is due to the capacity of the higher layers to capture longer-range dependencies, allowing parts, entire objects, and object context to be captured.

Deeper DSNs will require very large image training sets in order to learn about the range and variability of objects in natural images. Large-scale training of deep DSNs is a significant research and engineering challenge that will require extensive parallelization, in combination with novel methods for learning from vast image data sets. In the future, we will pursue this goal, with the aim of learning generative models that start to capture the daunting complexity of natural images.

## Appendix A: Inference and Learning in the Masked RBM

One of the strengths of RBMs is to have a factorial posterior distribution over the latent variables given the visible ones, making it extremely easy to perform inference. Unfortunately, this is not the case in our model, since even when the mask is known, the latent images are only partially observed, resulting in a nonfactorial posterior distribution. Furthermore, the mask is not known for natural images, and this needs to be inferred as well. This section explains in detail how to infer all of these variables using Gibbs sampling. The mask model we will consider here is the occlusion-based one, as the other two can be easily deduced from it. We recall that:

- •
- •

The right-hand side of equation A.2 is unnormalized due to configurations’ violating the constraints (*s*_{k,i} = 0 for all *k*, for instance). When generating a mask from the occlusion mask model, this could be dealt with by simply rejecting such invalid shape tuples. This would, however, mean that the shapes are no longer truly marginally independent (this corresponds to a renormalization of equation A.2). In practice we therefore take a different approach: when generating from the occlusion model, we do not draw the shape for the rear-most layer from the shape RBM but rather assume that this layer's shape is always on everywhere it is visible (for all pixels that are not covered by any of the other preceding shapes). This can be thought of as drawing the rear-most shape from a special shape model that puts all probability mass at the fully filled shape, and the generative model remains thus well defined. In this view, equation A.2 does not include the term SHAPE(**s**_{k}, **h**^{(s)}_{k}) for *k* = π^{−1}(*K*) (i.e., for the rear-most shape) and it is normalized, giving rise to the directed edges in Figure 9.

### A.1. Inference.

The joint distribution defined by equation A.3 exhibits several properties:

Given the latent images , the distribution over the appearance hidden states

**h**^{(a)}_{1..K}is factorial (APP is an RBM).Given the latent shapes

**s**_{1..K}, the distribution over the shape hidden states**h**^{(s)}_{1..K}is factorial (SHAPE is an RBM).Given the image patch

**v**, the hidden states**h**^{(a)}_{1..K}, the hidden states**h**^{(s)}_{1..K}, and the ordering π, the marginal distribution over the mask**m**(when integrating out the latent images and the latent shapes ) is factorial.Given the image patch

**v**, the mask**m**, and the hidden states**h**^{(a)}_{1..K}, the distribution over the latent images is factorial.Given the mask

**m**, the hidden states**h**^{(s)}_{1..K}, and the ordering π, the distribution over the latent shapes**s**_{1..K}is factorial.

**v**, the hidden states

**h**

^{(a)}

_{1..K}, the hidden states

**h**

^{(s)}

_{1..K}, and the ordering π, we have where (resp. SHAPE(

**s**

_{k}|

**h**

^{(s)}

_{k})) is the conditional probability of (resp.

**s**

_{k}) given

**h**

^{(a)}

_{k}(resp.

**h**

^{(s)}

_{k}) under the appearance RBM (resp. shape RBM).

Thus, for the mask *m _{i}* to be equal to

*k*, we need that:

- •
.

- •
*s*_{t,i}= 0ifπ(*t*) < π(*k*). - •
*s*_{k,i}= 1.

Since, in equation A.4, the distributions over and *s*_{k,i} are factorial and do not depend on the value of *m _{k}*, the resulting conditional distribution on

*m*is also factorial.

_{i}This suggests the following Gibbs sampling scheme to infer all the hidden variables given an image **v**. Starting from a random mask **m**, we iterate over the following steps:

Given the mask

**m**, we sample the unobserved parts of the latent images using block Gibbs sampling (using properties 1 and 4).Given the mask

**m**and the ordering π, we sample the unobserved parts of the latent shapes**s**_{1..K}using block Gibbs sampling (using properties 2 and 5).Given the latent images , we sample the appearance hidden units

**h**^{(a)}_{1..K}(using property 1).Given the latent shapes

**s**_{1..K}, we sample the shape hidden units**h**^{(s)}_{1..K}(using property 2).Given the appearance hidden units

**h**^{(a)}_{1..K}, the shape hidden units**h**^{(s)}_{1..K}, the image patch**v**and the ordering π, we sample a new mask**m**(using property 3).Given the mask, infer the depth ordering as explained in section B.1 in appendix B.

This process is repeated until convergence of the mask. The sampling procedure directly implies that the mask may be different each time. However, in all our experiments, it consistently matched the structure of the shapes in the images.

### A.2. Learning.

*W*

^{(s)}and θ can be achieved using the above inference procedure. We need to compute the gradient of the log probability of an image patch

**v**with respect to the parameters, that is, Since this cannot be computed exactly, we shall use an EM procedure (Dempster, Laird, & Rubin, 1977). We first derive a variational lower bound of log

*p*(

**v**): for any function

*Q*. The bound is tight when is the true posterior distribution. Since we cannot compute the sum over all masks, all latent images, all latent shapes, and all orderings, we will replace it by a sample from the posterior distribution. Therefore, the gradient direction we follow is where , , , and are samples from the posterior distribution (obtained using the method described in section A.1). Using more than one sample would reduce noise at the expense of extra computation. In our experiments, we used a single sample and found that learning worked well.

## Appendix B: Depth Inference in the Occlusion Model

### B.1. Depth Inference for Image Patches.

**m**, we consider each possible ordering of the

*K*layers explicitly. The mask

**m**, together with a particular occlusion order π, defines which shape pixels

*s*

_{k,i}are observed and which are unobserved. This is illustrated in Figure 8. The likelihood of a particular ordering π is then simply given as the likelihood of all the partially observed shapes

**s**

_{k}under the shape model: Here,

*U*

_{π,k}(

**m**) is the set of all unobserved pixels for shape

*k*given the mask

**m**and the ordering π. The set of unobserved pixels

*U*

_{π,k}(

**m**) will vary among different orderings π, and this is what drives the depth inference.

**h**

^{(s)}

_{k}cannot be computed exactly. We therefore replace the first sum by sampling the unobserved pixels {

*s*

_{k,i}:

*i*∈

*U*

_{π,k}(

**m**)} conditioned on the observed shape pixels for each

*k*and π. Sampling can be done efficiently using several iterations of block Gibbs sampling. This results in “completed” shape images for which the unnormalized probability under the shape model can be computed efficiently,

^{5}We then obtain Note that the completed shape images are different for different π; for plausible orderings, the shape model will be able to “fill in” the unobserved pixels to give rise to a shape with a high likelihood, which leads to a high probability of the respective ordering. It should further be noted that although considering each possible ordering π explicitly might seem expensive (the number of possible orderings is factorial in

*K*), this remains feasible in practice for

*K*⩽ 4. Given a depth ordering π and the latent states of the

*K*shape RBMs {

**h**

^{(s)}

_{k}}

_{k=1…K}, the conditional probability of the mask is given as This probability can be combined with the signal from the appearance models as described in appendix A. The shape in the rear-most layer is largely determined by the preceding layers. For this reason, and as explained in appendix A, we treat the rear-most shape in a special manner. During depth inference, this means that we ignore the likelihood of the rear-most shape when computing the probability of a particular depth ordering π using equation B.4: . Note that the product here no longer includes a term for the rear-most layer. Similarly, equation B.5 becomes

*P*(

*m*=

_{i}*t*|

**h**

^{(s)}

_{1..K}, π) = ∏

_{k:π(k)<π(t)}[1 − SHAPE(

*s*

_{k,i}= 1|

**h**

^{(s)}

_{k})] if

*t*is the rear-most layer (i.e., if

*t*= π

^{−1}(

*K*)) and the proportionality ∝ in equation B.5 becomes an equality for all other values of

*t*.

### B.2. Depth Inference for Images.

Depth inference at the image level, given a mask image, is performed by determining local depth orderings of overlapping patches. For this purpose, each patch is considered in turn and its depth relative to its neighbors is determined, keeping the ordering of its neighbors fixed. For instance, for the experiments with 16 × 16 pixel patches and *K* = 4, each patch model overlaps partially with eight neighboring patches (so that each pixel is covered by four competing patch models). Thus, for any given patch and a fixed ordering of its eight neighbors, nine different relative depths need to be considered. Each of these relative depths gives rise to a set of unobserved pixels, not only for the patch considered but also for its neighbors. The probability of the different relative depths can be computed in essentially the same way as described in section B.1 (approximating the sum over unobserved pixels by a sample and then efficiently computing the unnormalized log probability of the completed shape).

Note that for each neighboring patch, the set of unobserved pixels depends only on whether the patch under consideration is in front of or behind that neighbor; this considerably reduces the number of “shape completions” that need to be considered (two completions per neighboring patch and *N* + 1 for the central patch, where *N* is the number of neighbors).

In practice, given a mask, we perform one full sweep through the set of patch models, updating the relative depth (and the latent shapes) of each patch with respect to its neighbors once in a random order. Given the resulting depth ordering and the latent states of the shape models, the mask can then be updated as in the patch case (cf. equation B.5 above).

## Appendix C: Computing the Log Probability of Image Patches Under the Masked RBM

Due to the number of latent variables involved in the masked RBM, it is impossible to compute the exact log probability of natural image patches under this model. We may, however, derive a variational lower bound that would allow us to quantify the gains provided by the mask. However, one must bear in mind that all the techniques presented in this section require the use of AIS to yield an estimate, which is unusable in our setting (see section 5.3.1). Nevertheless, we believe them to be of interest if the limitations of AIS may be overcome.

### C.1. Uniform Mask Model.

*Q*, using Jensen's inequality. Let us first rewrite the sum inside the logarithm: The second term enforces the constraints described in equation A.1: all configurations that do not match for all

*i*have zero probability. Therefore, we need only to compute the sum over the configurations satisfying these constraints. Since these constraints are independent of the

**h**

^{(a)}

_{k}, we have and this distribution is fully concentrated on one point (given the latent images and the mask, there is only one valid image). Furthermore, we have yielding where

*C*is the set of matching the constraints imposed by

_{k}**v**and

**m**(as defined in equation A.1). We recall that the set

*C*is the set of all such that if

_{k}*m*=

_{i}*k*. Therefore, we need to sum the probabilities of all visible vectors with a subset of the units being fixed. This can be done using AIS (Salakhutdinov & Murray, 2008). Indeed, the conditional distribution over a subset of the visible units given the rest of the other visible units is also an RBM (conditioning on some visible units only modifies the biases of the hidden layer). Given the strong constraint imposed by the observed pixels, the resulting RBM is likely to have a very peaked distribution, making its partition function easy to approximate.

Now that we know how to compute for a given **m**, we need to find the optimal subset of masks to consider (that is, the distribution *Q*(**m**|**v**)).

*p*=

_{i}*P*(

**v**,

**m**

^{i}) for a certain mask configuration

**m**

^{i}and

*q*=

_{i}*Q*(

**m**

^{i}|

**v**). We need to optimize the quantity

*D*= ∑

_{i}

*q*log

_{i}*p*− ∑

_{i}_{i}

*q*log

_{i}*q*over the

_{i}*q*’s, subject to the constraint ∑

_{i}_{i}

*q*= 1. The optimal solution is given by , yielding We therefore need to find the

_{i}**m**

^{i}’s yielding the maximal

*p*’s. Since

_{i}*p*=

_{i}*P*(

**v**,

**m**

^{i}) =

*P*(

**v**)

*P*(

**m**

^{i}|

**v**), we need to find the modes of the posterior distribution of

**m**given

**v**. Due to the very constrained nature of the mask, the probability mass is heavily concentrated around a small number of modes, making it possible to achieve a tight bound over the log probability of an image patch with few masks.

A simpler explanation of this approximation is that we have replaced the quantity *p*(**v**) = ∑_{m}*p*(**v**, **m**) by a sum over a subset of the masks. It then becomes clear that this subset needs to include the masks **m** for which the quantity *p*(**v**, **m**) is maximized.

To find the modes of *P*(**m**|**v**), we first do a few iterations (typically 20) of sampling as described in section A.1 and then replace the third sampling step by a maximization step for a few more iterations (typically 10). Maximization should not be performed from the beginning as this often results in finding a poor local optimum.

### C.2. Nonuniform Mask Model.

Unfortunately, AIS does not work well in these models due to the low variance of the conditional distributions, resulting in confident but wrong estimates of the partition function.

## Appendix D: Experimental Procedure

In this section, we describe in greater detail how the model was trained.

### D.1. Pretraining the Appearance RBM.

In our model, patches are of size 16 × 16, which means that the appearance RBM has 768 (16 × 16 × 3) visible units. We used a beta RBM of the form described in section 2.6, with 128 hidden units. In the first phase, the RBM was trained without using a mask and using stochastic approximation (Tieleman, 2008). We performed a few tens of thousands of parameter updates. Once the filters started converging, we continued training in the masked context (with a uniform mask model) with *K* = 2. The inferred masks were kept between epochs, and only one iteration of mask update was run for each patch. To compute this mask update, the unobserved pixels of the latent patches were initialized to the mean of the observed ones (for each color channel), and one up-down pass was realized to update these unobserved values. Once this was done, we sampled **h** given **v** to update the mask. Eventually we completed this training with *K* = 4.

This part of training does not critically depend on the number of parameter updates for each phase. If it is too low, the whole procedure will be slower, as inferring the mask is more expensive for higher values of *K*. If it is too high, the next phase will unlearn what has been previously learned, again slowing down the pretraining, but for a similar final result.

Also, using only one iteration to reinfer the mask proved to be enough to yield accurate results.

### D.2. Training the Occlusion-Based Shape Model for Image Patches.

The shape model for image patches was trained in two phases. In the first phase we pretrained the shape model directly on binary mask patches. For this purpose we inferred the mask (*K* = 3) for a large set of natural image patches (16 × 16 pixels RGB patches) using the uniform model as mask prior. For each patch, we performed 100 mask iterations. From each patch, we thus obtained three binary mask patches (due to the lack of a shape prior, many of these mask patches were very noisy). We then trained a binary RBM (384 hidden units, 256 visible units) directly on 95000 binary mask patches. Training was performed with stochastic approximation (Tieleman, 2008) with a small learning rate of 0.0005, weight decay 0.0002, no momentum, and mini-batches of size 100. Training was performed for 10,000 epochs. The parameters of this binary RBM served as initialization for training of the shape model in the context of the full model. Pretraining took about 3.5 days using our Matlab implementation on a single-core machine.

In the second phase, we trained the shape model in the context of the full model (masked RBM with *K* = 3). The parameters of the shape RBM were initialized with the parameters obtained from phase 1. We used a training set of 21,000 RGB patches grouped into mini-batches of size 60. Learning was performed in alternation with inference. For each patch, we performed two iterations of full inference in the model (this includes the update of the appearance fantasies, the depth, the shape fantasies, and the mask) before updating the model parameters. Inference was performed as described in the main text. During inference in the mask model, we used 10 iterations of masked Gibbs sampling to update the shape fantasies. Before sampling, unobserved pixels in the shape fantasies were initialized with their state from the previous cycle. To prevent the model from hallucinating shapes into unused layers (which would slow down learning), we forced such layers to be in front of all visible layers and thus to be empty. Learning was performed using CD-10 with a learning rate of 0.001, a weight decay of 0.0002, and a momentum of 0.5. Training in the full model was performed for 550 epochs and took approximately two weeks using our unoptimized Matlab implementation on a single-core machine.

### D.3. Training the Occlusion-Based Shape Model for Natural Images.

As for image patches, the shape model for natural images (i.e., for the field of masked RBMs) was trained in two phases.

For pretraining we inferred the mask for natural images of size 80 × 80 pixels (RGB) extracted from the MSRC data set (see Figure 26) with a field of masked RBMs, using the uniform model as mask prior, and running 100 iterations of mask inference. From each image, we obtained 144 binary mask patches (using the superpixel layout described in the main text, each 80 × 80 pixel image is covered by four layers of 6 × 6 superpixels of size 16 × 16 pixels). We randomly selected 95,000 binary mask patches (excluding any mask patches from superpixels not fully overlapping with the images) and used those as training data for a binary RBM (256 visible units, 384 hidden units). Training was performed for 10,000 epochs using stochastic approximation, with a learning rate of 0.0005, no momentum, a weight decay of 0.0002, and mini-batches of size 100. The parameters of this binary RBM were used to initialize the shape model for training in the context of the full model.

We subsequently trained the occlusion-based shape model in the context of a field of masked RBMs, initializing the binary RBM for the shape model with the parameters obtained in phase 1. Our training set consisted of 1000 RGB images, and our “batches” consisted of individual images (144 superpixels are associated with each image). We alternated inference and the update of the model parameters. Two iterations of full inference (update of the appearance fantasies, shape fantasies, relative depth for all superpixels, as well as of the mask) were performed for each image before computing the gradient and updating the parameters. Inference in the mask model was performed in parallel for patches that did not share neighbors (i.e., for patches that were independent conditioned on the mask and the remaining nonoverlapping patches), and such sets of independent patches were treated sequentially but in a random order. Ten steps of masked Gibbs sampling were performed to update the shape fantasies. Completely unobserved superpixels were forced to be in front (i.e., their shape fantasies were required to be completely off) in order to prevent unconstrained hallucinations by the model. We used CD-15 for training, with a learning rate of 0.0025, weight decay of 0.0002, and momentum of 0.5. For the superpixel layout described in the main text, some superpixels are overlapping with the image boundaries: they are always only partially unobserved. To prevent the model from learning from largely unconstrained shapes (its own hallucinations), we did not include shapes from superpixels into the gradient that overlapped with the image to less than 25%. Training was run for 100 iterations and took approximately three weeks using our unoptimized Matlab implementation on a single-core machine.

## Appendix E: Additional Analysis of the Model for Natural Image Patches

Inferring relative depth based on the information provided to the model in the experiment shown in Figure 15—using only very local shape information (from small, 16 × 16 image patches)—is a highly ambiguous problem in many cases, not just for our model but equally so for a human observer. Accordingly, the confidence of the model with respect to the relative depth of the regions in a patch can vary significantly between patches. For the examples shown in Figure 16, the model is rather confident with respect to the inferred depth for patches 1, 2, 4, and 5 but considerably less confident for patch 3 (inference is performed by sampling from the posterior distribution; Figure 15 shows the most likely depth ordering under the model for the five patches).

To evaluate the behavior of the model on a larger data set and demonstrate how learning of a shape prior can drive depth inference, we ran depth inference on 73 three-region mask patches, similar to patch 3 in Figure 15, extracted from the segmentation images provided with the Berkeley segmentation database.^{6} Depth inference was run for 8000 iterations, and the inferred depth after each iteration was recorded. For each patch, we determined which of the three mask regions was most frequently sampled to be the front-most region and which of the remaining two layers was most frequently chosen to be the middle layer. For the preferred middle-layer region, we then determined, for each patch, the average shape fantasy associated with that region being the middle layer. The results are shown in Figures 27a and 27b.

Although there is some variability, the model has a clear tendency to explain the mask patches in terms of extended shapes overlapping each other, in particular in terms of roughly horizontal or vertical shapes. This is consistent with the results shown in Figure 15 and a very plausible behavior given the training data in which regions of such shapes occur frequently (note that these shapes also feature prominently in the samples shown in Figure 12). This behavior is also in rough agreement with the judgment of human observers. We showed the same 73 patches to five subjects and asked them to indicate, for each patch, which of the three regions they thought to be in front. The depth inferred by the model was consistent with the majority of human observers in 44 of 73 cases—60% of the patches, a considerably higher percentage than expected if the model selected the front-most region randomly (a random choice of the front most region in this task would correspond to an agreement of 33%). At the same time, human subjects were in agreement with each other only for 32/73 patches (44%), highlighting the general difficulty and ambiguity of this task. Note that these results cannot be explained by a simple bias of the model to place smaller regions in the front since for 37 of 73 (51%) of the test patches, the region that was inferred to be in front by the model was in fact the largest of the three regions, while the smallest region was inferred to be in front only in 17/73 (23%) cases.

## Acknowledgments

We thank Chris Williams for his support and help and Iain Murray for insightful comments. N.H. is supported by an Engineering and Physical Sciences Research Council/Medical Research Council scholarship from the Neuroinformatics and Computational Neuroscience Doctoral Training Centre at the University of Edinburgh.

## Notes

^{1}

Throughout the article, we slightly abuse notation and use the variable *Z* for all partition functions, although they depend on the energy function.

^{2}

has the properties that log(λ) = −log(1 + λ) and log(1 + λ) − log(λ) ≈ 1. The first property ensures that the range of inputs to the hidden units are symmetric around 0, and the second property ensures that log(**v** + λ), log(1 + λ − **v**) and **h** are approximately of the same amplitude when **v** lies in the interval [0, 1].

^{3}

Available online at http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2008/, http://research.microsoft.com/vision/cambridge/recognition/, and http://lear.inrialpes.fr/data, respectively.

^{4}

That is, we run a Gibbs chain in the appearance RBM for the same amount of time (5000 steps) sampling visibles and hidden units in each step. Only during the last step do we take the mean activation of the visible units given the binary hidden states (rather than a sample).

^{5}

It should be noted that equation B.2 is not an unbiased estimate of the unnormalized log probability. Overall this estimator might give rise to a slight preference for depth orderings with fewer unobserved shape pixels. Nevertheless, in our experiments we found the estimator to work well. An unbiased estimator can also be constructed: Let **s**_{O} denote the observed shape pixels and **s**_{U} the unobserved ones (for a given mask, depth ordering, and layer). In this notation equation B.2 corresponds to where through multiple iterations of Gibbs sampling ( is the unnormalized log probability after summing out **h**). We obtain an unbiased estimator by considering where and .

^{6}

We used mask patches (i.e., patches for which the segmentation had already been provided) in order to separate depth inference from the segmentation problem. As explained in the main text the segmentation of a patch can be affected, for example, by matting or shading, which the appearance model does not currently handle well.

## References

## Author notes

Nicolas Le Roux and Nicolas Heess contributed equally to this article.