Spatial recognition is the ability to recognize the environment and generate goal-directed behaviors, such as navigation. Animals develop spatial recognition by integrating their subjective visual and motion experiences. We propose a model that consists of hierarchical recurrent neural networks with multiple time scales, fast, medium, and slow, that shows how spatial recognition can be obtained from only visual and motion experiences. For high-dimensional visual sequences, a convolutional neural network (CNN) was used to recognize and generate vision. Our model, which was applied to a simulated mobile agent, was trained to predict future visual and motion experiences and generate goal-directed sequences toward destinations that were indicated by photographs. Due to the training, our model was able to achieve spatial recognition, predict future experiences, and generate goal-directed sequences by integrating subjective visual and motion experiences. An internal state analysis showed that the internal states of slow recurrent neural networks were self-organized by the agent’s position. Furthermore, such representation of the internal states was obtained efficiently as the representation was independent of the prediction and generation processes.