Incongruencies between auditory and visual signals negatively affect human performance and cause selective activation in neuroimaging studies; therefore, they are increasingly used to probe audiovisual integration mechanisms. An open question is whether the increased BOLD response reflects computational demands in integrating mismatching low-level signals or reflects simultaneous unimodal conceptual representations of the competing signals. To address this question, we explore the effect of semantic congruency within and across three signal categories (speech, body actions, and unfamiliar patterns) for signals with matched low-level statistics. In a localizer experiment, unimodal (auditory and visual) and bimodal stimuli were used to identify ROIs. All three semantic categories cause overlapping activation patterns. We find no evidence for areas that show greater BOLD response to bimodal stimuli than predicted by the sum of the two unimodal responses. Conjunction analysis of the unimodal responses in each category identifies a network including posterior temporal, inferior frontal, and premotor areas. Semantic congruency effects are measured in the main experiment. We find that incongruent combinations of two meaningful stimuli (speech and body actions) but not combinations of meaningful with meaningless stimuli lead to increased BOLD response in the posterior STS (pSTS) bilaterally, the left SMA, the inferior frontal gyrus, the inferior parietal lobule, and the anterior insula. These interactions are not seen in premotor areas. Our findings are consistent with the hypothesis that pSTS and frontal areas form a recognition network that combines sensory categorical representations (in pSTS) with action hypothesis generation in inferior frontal gyrus/premotor areas. We argue that the same neural networks process speech and body actions.