This letter proposes a novel predictive coding type neural network model, the predictive multiple spatiotemporal scales recurrent neural network (P-MSTRNN). The P-MSTRNN learns to predict visually perceived human whole-body cyclic movement patterns by exploiting multiscale spatiotemporal constraints imposed on network dynamics by using differently sized receptive fields as well as different time constant values for each layer. After learning, the network can imitate target movement patterns by inferring or recognizing corresponding intentions by means of the regression of prediction error. Results show that the network can develop a functional hierarchy by developing a different type of dynamic structure at each layer. The letter examines how model performance during pattern generation, as well as predictive imitation, varies depending on the stage of learning. The number of limit cycle attractors corresponding to target movement patterns increases as learning proceeds. Transient dynamics developing early in the learning process successfully perform pattern generation and predictive imitation tasks. The letter concludes that exploitation of transient dynamics facilitates successful task performance during early learning periods.