Real-world scenes usually contain a set of cluttered and yet contextually related objects. Here we used fMRI to investigate where and how contextually related multiple objects were represented in the human ventral visual pathway. Specifically, we measured the responses in face-selective and body-selective regions along the ventral pathway when faces and bodies were presented either simultaneously or in isolation. We found that, in the posterior regions, the response for the face and body pair was the weighted average response for faces and bodies presented in isolation. In contrast, the anterior regions encoded the face and body pair in a mutually facilitative fashion, with the response for the pair significantly higher than that for its constituent objects. Furthermore, in the right fusiform face area, the face and body pair was represented as one inseparable object, possibly to reduce perceptual load and increase representation efficiency. Therefore, our study suggests that the visual system uses a hierarchical representation scheme to process multiple objects in natural scenes: the average mechanism in posterior regions helps retaining information of individual objects in clutter, whereas the nonaverage mechanism in the anterior regions uses the contextual information to optimize the representation for multiple objects.