Capsule networks (see Hinton et al., 2018) aim to encode knowledge of and reason about the relationship between an object and its parts. In this letter, we specify a generative model for such data and derive a variational algorithm for inferring the transformation of each model object in a scene and the assignments of observed parts to the objects. We derive a learning algorithm for the object models, based on variational expectation maximization (Jordan et al., 1999). We also study an alternative inference algorithm based on the RANSAC method of Fischler and Bolles (1981). We apply these inference methods to data generated from multiple geometric objects like squares and triangles (“constellations”) and data from a parts-based model of faces. Recent work by Kosiorek et al. (2019) has used amortized inference via stacked capsule autoencoders to tackle this problem; our results show that we significantly outperform them where we can make comparisons (on the constellations data).

You do not currently have access to this content.