Multisensory integration of visual mouth movements with auditory speech is known to offer substantial perceptual benefits, particularly under challenging (i.e., noisy) acoustic conditions. Previous work characterizing this process has found that ERPs to auditory speech are of shorter latency and smaller magnitude in the presence of visual speech. We sought to determine the dependency of these effects on the temporal relationship between the auditory and visual speech streams using EEG. We found that reductions in ERP latency and suppression of ERP amplitude are maximal when the visual signal precedes the auditory signal by a small interval and that increasing amounts of asynchrony reduce these effects in a continuous manner. Time–frequency analysis revealed that these effects are found primarily in the theta (4–8 Hz) and alpha (8–12 Hz) bands, with a central topography consistent with auditory generators. Theta effects also persisted in the lower portion of the band (3.5–5 Hz), and this late activity was more frontally distributed. Importantly, the magnitude of these late theta oscillations not only differed with the temporal characteristics of the stimuli but also served to predict participants' task performance. Our analysis thus reveals that suppression of single-trial brain responses by visual speech depends strongly on the temporal concordance of the auditory and visual inputs. It further illustrates that processes in the lower theta band, which we suggest as an index of incongruity processing, might serve to reflect the neural correlates of individual differences in multisensory temporal perception.