MIT Press

Figure 1:

Network architecture. The model combines a VQ-VAE (whole pipeline from left to right) and a PixelCNN. The encoder of the VQ-VAE, which consists of several convolutional layers, converts the input image into an array $z_{e}$ of $w \times h$ $d$ -dimensional feature vectors. Each feature vector is then assigned to the closest codebook vector $e_{l}$ to create the index matrix $z_{x}$ containing the indices $l$ ⁠, from which an array $z_{q}$ of the $w \times h$ corresponding $d$ -dimensional codebook vectors $e_{l}$ can be constructed. The decoder then reconstructs the original input based on the quantized array $z_{q}$ ⁠. Selective attention is modeled by discarding consecutive entries in the lower part of the index matrix (transition from $t_{1}$ to $t_{2}$ ⁠.). The missing part is filled in by the PixelCNN in a recurrent process that performs semantic completion (transition from $t_{2}$ to $t_{3}$ ⁠). The completion is plausible but not necessarily faithful; for example, some flowers in the background are missing here (white circle).

This Feature Is Available To Subscribers Only