The present study examined the modality specificity and spatio-temporal dynamics of “what” and “where” preparatory processes in anticipation of auditory and visual targets using ERPs and a cue–target paradigm. Participants were presented with an auditory (Experiment 1) or a visual (Experiment 2) cue that signaled them to attend to the identity or location of an upcoming auditory or visual target. In both experiments, participants responded faster to the location compared to the identity conditions. Multivariate spatio-temporal partial least square (ST-PLS) analysis of the scalp-recorded data revealed supramodal “where” preparatory processes between 300–600 msec and 600–1200 msec at central and posterior parietal electrode sites in anticipation of both auditory and visual targets. Furthermore, preparation for pitch processing was captured at modality-specific temporal regions between 300 and 700 msec, and preparation for shape processing was detected at occipital electrode sites between 700 and 1150 msec. The spatio-temporal patterns noted above were replicated when a visual cue signaled the upcoming response (Experiment 2). Pitch or shape preparation exhibited modality-dependent spatio-temporal patterns, whereas preparation for target localization was associated with larger amplitude deflections at multimodal, centro-parietal sites preceding both auditory and visual targets. Using a novel paradigm, the study supports the notion of a division of labor in the auditory and visual pathways following both auditory and visual cues that signal identity or location response preparation to upcoming auditory or visual targets.