The formation of cross-modal object representations was investigated using a novel paradigm that was previously successful in establishing unimodal visual category learning in monkeys and humans. The stimulus set consisted of six categories of bird shapes and sounds that were morphed to create different exemplars of each category. Subjects learned new cross-modal bird categories using a one-back task. Over time, the subjects became faster and more accurate in categorizing the birds. After 3 days of training, subjects were scanned while passively viewing and listening to trained and novel bird types. Stimulus blocks consisted of bird sounds only, bird pictures only, matching pictures and sounds (cross-modal congruent), and mismatching pictures and sounds (cross-modal incongruent). fMRI data showed unimodal and cross-modal training effects in the right fusiform gyrus. In addition, the left STS showed cross-modal training effects in the absence of unimodal training effects. Importantly, for both the right fusiform gyrus and the left STS, the newly formed cross-modal representation was specific for the trained categories. Learning did not generalize to incongruent combinations of learned sounds and shapes; their response did not differ from the response to novel cross-modal bird types. Moreover, responses were larger for congruent than for incongruent cross-modal bird types in the right fusiform gyrus and STS, providing further evidence that categorization training induced the formation of meaningful cross-modal object representations.