Generalization by learning is an essential cognitive competency for humans. For example, we can manipulate even unfamiliar objects and can generate mental images before enacting a preplan. How is this possible? Our study investigated this problem by revisiting our previous study (Jung, Matsumoto, & Tani, 2019), which examined the problem of vision-based, goal-directed planning by robots performing a task of block stacking. By extending the previous study, our work introduces a large network comprising dynamically interacting submodules, including visual working memory (VWMs), a visual attention module, and an executive network. The executive network predicts motor signals, visual images, and various controls for attention, as well as masking of visual information. The most significant difference from the previous study is that our current model contains an additional VWM. The entire network is trained by using predictive coding and an optimal visuomotor plan to achieve a given goal state is inferred using active inference. Results indicate that our current model performs significantly better than that used in Jung et al. (2019), especially when manipulating blocks with unlearned colors and textures. Simulation results revealed that the observed generalization was achieved because content-agnostic information processing developed through synergistic interaction between the second VWM and other modules during the course of learning, in which memorizing image contents and transforming them are dissociated. This letter verifies this claim by conducting both qualitative and quantitative analysis of simulation results.