Attention maintains task-relevant information in working memory (WM) in an active state. We investigated whether the attention-based maintenance of stimulus representations that were encoded through different modalities is flexibly controlled by top–down mechanisms that depend on behavioral goals. Distinct components of the ERP reflect the maintenance of tactile and visual information in WM. We concurrently measured tactile (tCDA) and visual contralateral delay activity (CDA) to track the attentional activation of tactile and visual information during multimodal WM. Participants simultaneously received tactile and visual sample stimuli on the left and right sides and memorized all stimuli on one task-relevant side. After 500 msec, an auditory retrocue indicated whether the sample set's tactile or visual content had to be compared with a subsequent test stimulus set. tCDA and CDA components that emerged simultaneously during the encoding phase were consistently reduced after retrocues that marked the corresponding (tactile or visual) modality as task-irrelevant. The absolute size of cue-dependent modulations was similar for the tCDA/CDA components and did not depend on the number of tactile/visual stimuli that were initially encoded into WM. Our results suggest that modality-specific maintenance processes in sensory brain regions are flexibly modulated by top–down influences that optimize multimodal WM representations for behavioral goals.