Network architecture. The model combines a VQ-VAE (whole pipeline from left to right) and a PixelCNN. The encoder of the VQ-VAE, which consists of several convolutional layers, converts the input image into an array of -dimensional feature vectors. Each feature vector is then assigned to the closest codebook vector to create the index matrix containing the indices , from which an array of the corresponding -dimensional codebook vectors can be constructed. The decoder then reconstructs the original input based on the quantized array . Selective attention is modeled by discarding consecutive entries in the lower part of the index matrix (transition from to .). The missing part is filled in by the PixelCNN in a recurrent process that performs semantic completion (transition from to ). The completion is plausible but not necessarily faithful; for example, some flowers in the background are missing here (white circle).
This site uses cookies. By continuing to use our website, you are agreeing to our privacy policy. No content on this site may be used to train artificial intelligence systems without permission in writing from the MIT Press.