Figure 1:
Network architecture. The model combines a VQ-VAE (whole pipeline from left to right) and a PixelCNN. The encoder of the VQ-VAE, which consists of several convolutional layers, converts the input image into an array ze of w×hd-dimensional feature vectors. Each feature vector is then assigned to the closest codebook vector el to create the index matrix zx containing the indices l, from which an array zq of the w×h corresponding d-dimensional codebook vectors el can be constructed. The decoder then reconstructs the original input based on the quantized array zq. Selective attention is modeled by discarding consecutive entries in the lower part of the index matrix (transition from t1 to t2.). The missing part is filled in by the PixelCNN in a recurrent process that performs semantic completion (transition from t2 to t3). The completion is plausible but not necessarily faithful; for example, some flowers in the background are missing here (white circle).

Network architecture. The model combines a VQ-VAE (whole pipeline from left to right) and a PixelCNN. The encoder of the VQ-VAE, which consists of several convolutional layers, converts the input image into an array ze of w×hd-dimensional feature vectors. Each feature vector is then assigned to the closest codebook vector el to create the index matrix zx containing the indices l, from which an array zq of the w×h corresponding d-dimensional codebook vectors el can be constructed. The decoder then reconstructs the original input based on the quantized array zq. Selective attention is modeled by discarding consecutive entries in the lower part of the index matrix (transition from t1 to t2.). The missing part is filled in by the PixelCNN in a recurrent process that performs semantic completion (transition from t2 to t3). The completion is plausible but not necessarily faithful; for example, some flowers in the background are missing here (white circle).

Close Modal

or Create an Account

Close Modal
Close Modal