This experiment investigates the integration of gesture and speech from a multisensory perspective. In a disambiguation paradigm, participants were presented with short videos of an actress uttering sentences like “She was impressed by the BALL, because the GAME/DANCE….” The ambiguous noun (BALL) was accompanied by an iconic gesture fragment containing information to disambiguate the noun toward its dominant or subordinate meaning. We used four different temporal alignments between noun and gesture fragment: the identification point (IP) of the noun was either prior to (+120 msec), synchronous with (0 msec), or lagging behind the end of the gesture fragment (−200 and −600 msec). ERPs triggered to the IP of the noun showed significant differences for the integration of dominant and subordinate gesture fragments in the −200, 0, and +120 msec conditions. The outcome of this integration was revealed at the target words. These data suggest a time window for direct semantic gesture–speech integration ranging from at least −200 up to +120 msec. Although the −600 msec condition did not show any signs of direct integration at the homonym, significant disambiguation was found at the target word. An explorative analysis suggested that gesture information was directly integrated at the verb, indicating that there are multiple positions in a sentence where direct gesture–speech integration takes place. Ultimately, this would implicate that in natural communication, where a gesture lasts for some time, several aspects of that gesture will have their specific and possibly distinct impact on different positions in an utterance.