To exhibit social intelligence, animals have to recognize whom they are communicating with. One way to make this inference is to select among internal generative models of each conspecific who may be encountered. However, these models also have to be learned via some form of Bayesian belief updating. This induces an interesting problem: When receiving sensory input generated by a particular conspecific, how does an animal know which internal model to update? We consider a theoretical and neurobiologically plausible solution that enables inference and learning of the processes that generate sensory inputs (e.g., listening and understanding) and reproduction of those inputs (e.g., talking or singing), under multiple generative models. This is based on recent advances in theoretical neurobiology—namely, active inference and post hoc (online) Bayesian model selection. In brief, this scheme fits sensory inputs under each generative model. Model parameters are then updated in proportion to the probability that each model could have generated the input (i.e., model evidence). The proposed scheme is demonstrated using a series of (real zebra finch) birdsongs, where each song is generated by several different birds. The scheme is implemented using physiologically plausible models of birdsong production. We show that generalized Bayesian filtering, combined with model selection, leads to successful learning across generative models, each possessing different parameters. These results highlight the utility of having multiple internal models when making inferences in social environments with multiple sources of sensory information.