Recent findings on multisensory integration suggest that selective attention influences cross-sensory interactions from an early processing stage. Yet, in the field of emotional face–voice integration, the hypothesis prevails that facial and vocal emotional information interacts preattentively. Using ERPs, we investigated the influence of selective attention on the perception of congruent versus incongruent combinations of neutral and angry facial and vocal expressions. Attention was manipulated via four tasks that directed participants to (i) the facial expression, (ii) the vocal expression, (iii) the emotional congruence between the face and the voice, and (iv) the synchrony between lip movement and speech onset. Our results revealed early interactions between facial and vocal emotional expressions, manifested as modulations of the auditory N1 and P2 amplitude by incongruent emotional face–voice combinations. Although audiovisual emotional interactions within the N1 time window were affected by the attentional manipulations, interactions within the P2 modulation showed no such attentional influence. Thus, we propose that the N1 and P2 are functionally dissociated in terms of emotional face–voice processing and discuss evidence in support of the notion that the N1 is associated with cross-sensory prediction, whereas the P2 relates to the derivation of an emotional percept. Essentially, our findings put the integration of facial and vocal emotional expressions into a new perspective—one that regards the integration process as a composite of multiple, possibly independent subprocesses, some of which are susceptible to attentional modulation, whereas others may be influenced by additional factors.