Figure 1:
Showing the bottom-up, top-down, and same-level interactions among three adjacent levels of the proposed GLOM architecture for a single column. The blue and red arrows representing bottom-up and top-down interactions are implemented by two different neural networks that have several hidden layers. These networks can differ between pairs of levels, but they are shared across columns and across time steps. The top-down net should probably use sinusoidal units (Sitzmann, Martel, Bergman, Lindell, & Wetzstein, 2020). For a static image, the green arrows could simply be scaled residual connections that implement temporal smoothing of the embedding at each level. For video, the green connections could be neural networks that learn temporal dynamics based on several previous states of the capsule. Interactions between the embedding vectors at the same level in different columns are implemented by a nonadaptive, attention-weighted, local smoother, which is not shown.

Showing the bottom-up, top-down, and same-level interactions among three adjacent levels of the proposed GLOM architecture for a single column. The blue and red arrows representing bottom-up and top-down interactions are implemented by two different neural networks that have several hidden layers. These networks can differ between pairs of levels, but they are shared across columns and across time steps. The top-down net should probably use sinusoidal units (Sitzmann, Martel, Bergman, Lindell, & Wetzstein, 2020). For a static image, the green arrows could simply be scaled residual connections that implement temporal smoothing of the embedding at each level. For video, the green connections could be neural networks that learn temporal dynamics based on several previous states of the capsule. Interactions between the embedding vectors at the same level in different columns are implemented by a nonadaptive, attention-weighted, local smoother, which is not shown.

Close Modal

or Create an Account

Close Modal
Close Modal