## Abstract

To exhibit social intelligence, animals have to recognize whom they are communicating with. One way to make this inference is to select among internal generative models of each conspecific who may be encountered. However, these models also have to be learned via some form of Bayesian belief updating. This induces an interesting problem: When receiving sensory input generated by a particular conspecific, how does an animal know which internal model to update? We consider a theoretical and neurobiologically plausible solution that enables inference and learning of the processes that generate sensory inputs (e.g., listening and understanding) and reproduction of those inputs (e.g., talking or singing), under multiple generative models. This is based on recent advances in theoretical neurobiology—namely, active inference and post hoc (online) Bayesian model selection. In brief, this scheme fits sensory inputs under each generative model. Model parameters are then updated in proportion to the probability that each model could have generated the input (i.e., model evidence). The proposed scheme is demonstrated using a series of (real zebra finch) birdsongs, where each song is generated by several different birds. The scheme is implemented using physiologically plausible models of birdsong production. We show that generalized Bayesian filtering, combined with model selection, leads to successful learning across generative models, each possessing different parameters. These results highlight the utility of having multiple internal models when making inferences in social environments with multiple sources of sensory information.

## 1 Introduction

One of the most notable abilities of biological creatures is their capacity to adapt their behavior to different contexts and environments (i.e., cognitive flexibility) (Mante, Sussillo, Shenoy, & Newsome, 2013; Dajani & Uddin, 2015) through learning. People can learn to call on various responses depending on the situation—for example, independently move the right and left hands when playing an instrument and speak several different languages. Such multitasking abilities are particularly crucial in communication with several people, who each demand subtly different forms of interaction (Taborsky & Oliveira, 2012; Parkinson & Wheatley, 2015). In this kind of situation, one needs to infer who has generated a heard voice—and infer that person's mental state—to respond in an appropriate manner. This is a requirement for exhibiting social intelligence—which usually indicates the ability of organisms to correctly recognize oneself and others—and behave adequately in the social environment with several conspecifics. This is an important challenge in understanding a key aspect of social intelligence. Experimental studies of primates have shown that the volumes of certain brain structures (e.g., the hippocampus) are correlated with the performance of cognitive and social tasks (Reader & Laland, 2002; Shultz & Dunbar, 2010) and that the ability to infer another's intentions increases with brain volume (Devaine et al., 2017). This speaks to a putative strategy for making inferences about several different conspecifics with a plurality of internal models, each associated with a particular of the community or econiche.

This ability of biological creatures contrasts with current notions of artificial general intelligence. The development of a synthetic system as flexible as the biological brain remains a challenge (LeCun, Bengio, & Hinton, 2015; Hassabis, Kumaran, Summerfield, & Botvinick, 2017). Here, we tried to understand how the brain might entertain distinct generative models in a context-sensitive setting. To do this, we focus on a social task, communication through birdsong, in which the conversational partner may change. This induces the dual task of inferring the identity of a conspecific and learning about that conspecific at the same time. Crucially, this learning should be specific to each partner.

To address this problem, we appeal to generalized Bayesian filtering, a corollary of the free-energy principle (Friston, 2008, 2010). We illustrate the behavior of the proposed scheme using artificial birdsongs and natural zebra finch songs. We consider a synthetic (student), whose generative model is based on a physiologically plausible model of birdsong production, and present the student bird with a song generated by one of several (teacher) conspecifics. During the exchange the student bird performs Bayesian model selection (Schwarz, 1978) to decide which teacher generated the heard song. Having accumulated sensory evidence under all hypotheses or models, the parameters of the generative models are updated in proportion to the evidence for each competing model.

We show that over successive interactions, our student is able to learn the individual characteristics of multiple teachers and recognize them with increasing confidence Finally, possible neurobiological implementations of the proposed scheme are discussed. Despite our emphasis on birdsong, our interest (and expertise) is not in the theoretical neurobiology of songbirds. We use birdsong as a vehicle to introduce a computational perspective on perceptual categorization and learning in communication (of any sort) that inherits from Bayesian model selection. We hope the scheme we showcase may be useful in areas like voice recognition and in other domains of social exchange.

### 1.1 Concept of Modeling

In formulating the generative model, we have to contend with a mixture of random variables in continuous time (i.e., latent states of each singing bird) and categorical variables (i.e., the identity of the bird) that constitute a perceptual categorization problem. In short, the listening bird (i.e., student) has to make inferences in terms of beliefs over both continuous and discrete random variables in order to recognize who is singing and *what* they are singing. In a general setting, this would call on mixed generative models with a mixture of continuous and discrete states of the sort considered in Roweis and Ghahramani (1999) and, more recently, Friston, Parr, and de Vries (2017). A complementary way of combining categorical and continuous latent states is to work within a continuous generative model that includes switching variables that have a discrete (i.e., categorical) probability distribution, with an accompanying conjugate prior such as the Dirichlet distribution. The most common example of this would be a gaussian mixture model (see Roweis & Ghahramani, 1999, for details).

Heuristically, this means the generative model can be constructed in one of two ways. We can select a singing bird to generate a song, leading to a hierarchical model with a categorical latent variable at the top and a continuous model generating outcomes. Alternatively, we could generate continuous outcomes from all possible birds and then select one to constitute the actual stimulus. In the second (switching variable) case, the categorical variable plays the role of a switch, basically switching from one possible sensory “channel” to another.

In terms of model inversion and belief propagation, both generative models are isomorphic and lead to the same update equations via minimization of variational free energy. However, the way in which the generative models play out in terms of requisite message passing can have different forms. We could use a generative model with a single bird and try to infer which bird was singing (and, implicitly, the parameters of its generative process). Alternatively, the student may entertain all possible teachers “in mind” and then select the best hypothesis or explanation for the sensory input. This would correspond to the second form of generative model, in which the dynamics are conditioned on the categorical variable (i.e., a student bird predicts songs under all possible hypotheses) and the best explanation is then selected. In this sense, the expectation about the identity of the singing bird acquires two complementary interpretations. In the first formulation, it is the posterior expectation about the bird that has been selected to generate the song. In the second interpretation, it becomes an expectation about the switching variable. This means the student (i.e., listening bird) effectively composes a Bayesian model average over all hypotheses (i.e., singing birds) entertained in providing posterior predictions of the song.

We can appeal to both forms when interpreting the results that follow. However, the second interpretation has some interesting aspects from a cognitive neuroscience perspective. In essence, the gating or selection of top-down predictions complements the gating or selection of ascending prediction errors usually associated with attention (Luck, Woodman, & Vogel, 2000; Green & Bavelier, 2003; Awh, Belopolsky, & Theeuwes, 2012). In other words, selecting (switching to) the best explanation from available hypotheses, when predicting sensory input, becomes a covert form of (mental) action. Examples of such attentional switching can be found in bistable visual illusions (Eagleman, 2001). This is in the sense that descending predictions are contextualized and selected on the basis of higher-order beliefs (i.e., expectations) about the most plausible hypothesis or context in play. The unique aspect of this gating rests on the fact that there are a discrete (categorical) number of competing hypotheses that are mutually exclusive. This is reminiscent of equivalent architectures in motor control (e.g., the MOSAIC architecture) and related mixture of experts (Roweis & Ghahramani, 1999; Lee, Lewicki, & Sejnowski, 2000; Haruno, Wolpert, & Kawato, 2003). In our case, a simple perceptual categorization paradigm mandates a selection among different possible categories and enforces a form of mental action through optimization of an implicit switching variable.

In what follows, we present the results of perceptual learning and inference using this form of model selection or structure learning, predicated on an ensemble or repertoire of generative models (using synthetic birds and real birdsongs). Using this setup, we show that Bayesian model averaging provides a plausible account of how multiple hypotheses can be combined to predict the sensorium, while Bayesian model selection enables perceptual categorization and selective learning. Crucially, all of these unsupervised processes conform to the same normative principle: the minimization of (the path integral of) a variational free energy bound on model evidence.

## 2 Results

### 2.1 Multiple Generative Models and Attentional Switching

Organisms continuously infer the causes of their sensations (unconscious inference and the Bayesian brain hypothesis: Helmholtz, 1925; Knill & Pouget, 2004) and thereby predict what will happen in the immediate future (e.g., predictive coding) (Rao & Ballard, 1999; Friston, 2005). This sort of perceptual inference rests on an internal generative model that expresses beliefs about how sensory inputs are generated, where perceptual inference is formulated as the minimization of surprise or prediction errors. These models typically assume that sensations are generated by latent or hidden (unobservable) causes in the external world. Such causes may themselves be generated by other causes in a hierarchical manner. In the setting of continuous state-space models, hierarchical Bayesian filtering can be used to perform inference under a hierarchical generative model (Friston, 2008; Friston, Trujillo-Barreto, & Daunizeau, 2008). This filtering uses variational message passing to furnish approximate posterior probability (recognition) densities over the hidden states. In what follows, we describe the process-generating sensory inputs. We assume that the same generative structure is used by the brain as an internal generative model; however the brain needs to learn underlying model parameters to infer the values of hidden states (Dayan, Hinton, Neal, & Zemel, 1995; Friston, Kilner, & Harrison, 2006; George & Hawkins, 2009). A detailed description of the generative models used in this study is provided in section 4.

Let us consider a generative model of birdsong. In brief, this model is a deep generative model with two levels, both levels based on attractor dynamics in the form of neural circuits (see section 4; see also Kiebel, Daunizeau, & Friston, 2008, and Friston & Kiebel, 2009, for details). The goal of an agent (student bird) is to learn about and categorize several different birdsongs, and hence reproduce particular songs depending on the currently heard song. Crucially, the state of a (slow) higher attractor, associated with neuronal dynamics in the high vocal center (HVC) in the songbird brain, provides a control parameter for a (fast) attractor at a lower level in the auditory hierarchy. The hidden or latent states of the lower attractor, associated with the robust nucleus of the archistriatum (RA), then drive fluctuations in the amplitude and frequency of birdsong. (For related songbird studies, see Laje & Mindlin, 2002; Long & Fee, 2008; Amador, Perl, Mindlin, & Margoliash, 2013; and Calabrese & Woolley, 2015.)

*ij*) indicates the $j$th level of model $i$ (i.e. latent variables at level 2 generate latent variables at level 1 that generate sensory input). The model specified by these equations can be interpreted in terms of a set of probability distributions (see Figure 1 middle left), and their product provides the $i$th generative model (see Figure 1 top).

In our task design, although every generative model is running simultaneously, only the signal generated by a specific bird is selected as the sensory input (i.e., a teacher song) that the student can actually hear—as an analogy to social communication with several distinct conspecifics. This selection is controlled by a switcher (see also Figure 1 right). Suppose the currently selected model is indexed by $c$. We represent the switcher state by a set of binary variables $Pi=c=\gamma i\u2208{0,1}$ where only $\gamma c=1$, while the remaining variables are zero to ensure $\u2211i\u2208M\gamma i=1$. Note that $\gamma i$ indicates the probability of model $i$ being selected but takes only either 0 or 1 by design. When this switching process is used, the probability of sensory input is $ps\u02dc|mc=EP(i=c)p(s\u02dc|mi)=\u2211i\u2208M\gamma ip(s\u02dc|mi)$, where $p(s\u02dc|mi)$ is the conditional probability of the sensory input when model $i$ is selected (see Figure 1 bottom left). We suppose that the switcher $\gamma $ is sampled from a categorical prior distribution $P\gamma =Cat(\Gamma )$ for each epoch of singing. In sum all models ($mi)$ are generating fluctuations in hidden states, while only the output from $mc$ is selected as the sensory input.

### 2.2 Update Rules for Inference, Model Selection, and Learning

The inversion of a generative model corresponds to inferring the unknown variables and parameters, which we will treat as perceptual inference and learning respectively. Formally, in variational Bayes, this rests on optimizing an approximate posterior belief over unknown quantities by minimizing the variational free energy (and its path integral) under each model. This comprises three steps, as shown at the top of Figure 2: (1) in the inference step, latent variables under all models are updated over an epoch of birdsong; (2) in the model selection step, a softmax function of variational free action, under each model, gives the model posterior (i.e., model evidence); and (3) in the learning step, this posterior plays the role of an adaptive learning rate when updating model parameters using a descent on variational free action. This ensures that only models that are likely to be generating the birdsong (sensory data) are updated, while the remaining models retain their current parameters. In what follows, we derive the associated update rules to illustrate their general form (section 4 for details).

The posterior distribution of the switch $Q\gamma $ can be considered an attentional filter (Luck et al., 2000; Green & Bavelier, 2003; Awh et al., 2012). According to this view, an attended generative model and its associated posterior beliefs correspond to the marginal distributions over models and posteriors (i.e., because the attended model is more plausible than all others, the posteriors conditioned on this model will approximate those obtained through a Bayesian model average over all models). Let $u!$ and $\theta !$ be the marginal beliefs over latent variables and parameters, respectively. These may be thought of as Bayesian model averages over each of the internal models. When each model has the same structure and dimensions, these marginals are given by $ps\u02dc,u!,\theta !\u2261\u2211i\u2208M\gamma ips\u02dc,ui=u!,\theta i=\theta !|mi$, $qu!\u2261\u2211i\u2208M\gamma iqui=u!$, and $q\theta !\u2261\u2211i\u2208M\gamma iq\theta i=\theta !$. Thus, our model is formally analogous to a gaussian-mixture-model version of a Bayesian filter. On a more anthropomorphic note, the marginal beliefs over latent variables $u!$ and parameters $\theta !$ are fictive (i.e., they do not exist in the external real world). One could imagine that they underwrite some conscious inference, with several competing generative models (i.e., hypotheses) running at a subpersonal or unconscious level in the brain.

Interestingly the above formulation can be applied to generative models that have different structures and dimensions because there is no direct interaction between generative models and the switcher receives only the output from each generative model. This property may be particularly pertinent for recognizing conspecifics, since conspecifics may not be best modeled using the same generative model structure.

### 2.3 Demonstrations of Multiple Internal Models Using Artificial and Natural Birdsongs

A birdsong has a hierarchical structure that enables the expression of complicated narratives using a finite set of notes (Suzuki, Wheatcroft, & Griesser, 2016). Young songbirds are known to learn such a song by mimicking adult birds' song (Tchernichovski, Mitra, Lints, & Nottebohm, 2001; Woolley, 2012; Lipkind et al., 2013; Yanagihara & Yazaki-Sugiyama, 2016; Lipkind et al., 2017). Previous studies have developed a songbird model that infers the dynamics of another's song based on a deep (two-layer) generative model (Kiebel et al., 2008; Friston & Kiebel, 2009). Perceptual inference requires an internal model of how the song was generated. However, in a social situation, several birds may produce different songs generated by different brain states (or generative models). In the simulations that follow, we consider a case where two birds (denoted by teacher 1, 2) sang two different songs in turn, as illustrated in Figure 2 (left). A song $s=s1,s2T$ is given by a 4 s sequence of a two-dimensional vector, where $s2$ and $s1$ represent the mode of sound frequency and its power, respectively—analogous to a physiological model of birdsong vocalizations (Laje, Gardner, & Mindlin, 2002; Perl, Arneodo, Amador, Goller, & Mindlin, 2011). Here, we supposed that the generative model had two layers of three-neuron circuits (or circuits comprising three neural populations) for birdsong generation, the so-called Laje-Mindlin style model (Laje & Mindlin, 2002). In preliminary simulations, we confirmed that when a student with a single generative model heard their songs, it was unable to learn either teacher 1 or 2's song (see Figure 6 in appendix A). This is because a single generative model cannot generate two songs. Thus, the student tried to learn a spurious intermediate model of the two songs and failed to learn either.

This limitation can be overcome using a repertoire of generative models (see section 4 for details). We found that a student with two generative models ($m1$, $m2$; see Figure 2, right) can solve this unsupervised learning problem efficiently. We trained the model by providing two (alternating) teacher songs by updating an unknown parameter of both generative models. Posterior densities over parameters (i.e., synaptic strengths) were updated over learning and successfully converged to the true values used in the simulation (see Figure 3A). As a result, the student was able to make perceptual inferences about latent states generating both songs (see Figure 3B). In each session of training, the free action (i.e., average free energy) was computed for both internal models. The trajectories of free action evince a process of specialization, where each model becomes an expert for one of two songs (see Figure 3C). At the beginning of each exposure, the probability of each model was around 0.5, which led to parameter updates in both models (see Figure 3D). Following learning, the difference in model plausibility became significantly larger—and only the most likely model updated its parameter following the appropriate song. In Figure 3, the hidden states of teachers were reset at the beginning of each session, which made the song sequence periodic and easy to learn. When the hidden states of teachers were not reset, the song sequence became chaotic and was more difficult to learn. However, even in this chaotic case, our model successfully learned from two distinct teachers (see Figure 7 in appendix A).

We next provided six distinct natural zebra finch songs to our model to see if it could learn and recognize six different teachers (see Figure 4; see also section 4 for details). Here we assumed a generative model with realistic song generation capacity comprising two layers of four-neuron circuits (or circuits comprising four neural populations), based on the Laje-Mindlin-style model (see section 4). The posteriors of the parameters of the student's internal models were randomized. However, for simplicity, the posteriors over hidden states at time $t=0$, and the time constants of the differential equations, were optimized a priori (i.e., initialized) to be consistent with one of the six teacher songs. Before training, we tested the responses of the student to the teacher songs as a reference (movie 1; see appendix B).

A student bird with six internal models inferred latent states (with a small update rate) and calculated the accompanying free energy and model evidence (see Figures 4A and 4B). After exposure, the student generated a song to predict (or imitate) the current teacher song by running the generative models in a forward or active mode. In this mode, the bird reproduces its predicted sensory input based on a Bayesian model average of the dynamics generating a particular song. This Bayesian model average is the mixture of model-specific predictions weighted by model evidence or plausibility. However, prior to learning, the student could not reproduce the teacher song because it has not yet learned the teacher's parameters and could not categorize the teachers. During training, we randomly provided one of the six teacher songs for 60 sessions (movie 2; see appendix B). The student listened to the song and evaluated model plausibility for each of its six internal models. It then learned (924-dimensional) unknown model parameters, with a learning rate determined by model plausibility, to ensure only plausible models were updated. These parameters controlled a nonlinear (polynomial) mapping from latent states expressing the dynamics of the deep generative models to fluctuations in amplitude and peak frequency of the sensory input.

We found that the student's internal models became progressively specialized for one of six teacher songs (movie 3; see appendix B). After learning, only the most plausible model (with veridical parameters) contributed to the Bayesian model average, so that the student could reproduce the teacher songs in a remarkably accurate way (see Figure 4C). These results are particularly pleasing because they also suggest that real songbirds (zebra finches) learn and generate songs (in their RA and HVC) using dynamics with the form we have assumed. Indeed, to compare inferred and true hidden states (and parameters), the real zebra finch songs were learned separately and regenerated under the appropriate model to provide stimuli. Learning success was further confirmed by a specialization of each model for a specific teacher (see Figure 4D) and a convergence of posterior parameter expectations, under each model, to the teacher-specific values (see Figure 4E). These results suggest that the proposed scheme works robustly, even with natural data and a large number of songs.

Finally, we illustrate how inference is affected by either the absence of attentional switching or by a discrepancy between the number of internal models and teacher songs presented (see Figure 5). A standard Bayesian filter, lacking attentional switching, failed to find optimal internal models—to track six teacher songs separately—even when equipped with six internal models. In Figures 5A and 5B, this is evidenced by the absence of free-energy reduction (black line). Conversely, the current scheme, with attentional switching, was able to reduce free energy (red line). This is due to the suppression of the learning rate in implausible models during model inversion. The resulting difference is especially evident when the generative model comprises four-neuron circuits. The mixture model learned six different songs with a high degree of accuracy, as shown in Figure 4, thereby reducing the free energy substantially. When there were equal numbers of internal models and teachers, only one of six internal models was plausible for each session (see Figure 5C). When the number of internal models was greater than that of teachers, several internal models came to represent a teacher song, while the superfluous models were never considered plausible (see Figure 5D), indicating the continued success of the agent in categorizing and learning multiple songs. However, the confidence in these categorizations diminished relative to the correct model. Conversely, when the number of internal models was fewer than that of teachers, each internal model came to represent a mixture of teacher songs, thereby failing to recognize distinct teacher songs (see Figure 5E). Collectively, these findings highlight the potential utility of equipping an adequate number of generative models with attentional selectivity for learning (and inverting) context-sensitive models of the social world.

## 3 Discussion

The brain may use multiple generative models and select the most plausible explanation for any given context. Findings from comparative neuroanatomy (Reader & Laland, 2002; Shultz & Dunbar, 2010; Devaine et al., 2017) suggest that as the brain becomes larger, it can entertain more hypotheses, or internal models, about how its sensations were caused. This strategy can be used to learn and recognize particular conspecifics in a communication or social setting. A key question here is how the brain separately establishes distinct generative models before it recognizes which model is fit for purpose, and vice versa. In this study, we introduce a novel learning scheme for updating the parameters of the multiple internal models that are themselves being used to filter continuous data. First, several alternative generative models run in parallel to explain the sensory input in terms of inferred latent variables; this enables the free action (under each model) and associated model plausibility to be evaluated; finally, the parameters of each model are updated with a learning rate that is proportional to the model plausibility. This ensures that only models with high model plausibility or evidence are informed by sensory experience. The proposed scheme allows an agent to establish and maintain several different generative models (or hypotheses) and to perform an adaptive online Bayesian model selection (i.e., switching) of generative models depending on the provided input.

The definition of *social intelligence* varies greatly. The term could be applied broadly to species able to engage in coordinated behaviors with other conspecifics (e.g., swarm behaviors or shoals of fish; Mann & Garnett, 2015). For an account of the sort of inferences required for a creature to “know its place” in this sort of society, Friston, Levin, Sengupta, and Pezzula (2015) illustrate how this can be achieved in the absence of inferences about other individuals in an ensemble. The sort of intelligence we are interested in here is of a more sophisticated sort: the capacity to recognize oneself and others. We are interested in creatures that interact with their conspecifics at an individual level and can tailor their behavior to whomever they interact with. This requires not just (a minimal) theory of mind but a theory of multiple minds and is closer to the sorts of social intelligence thought to be impaired in conditions like autism (Happé & Frith, 1995).

The ideas presented here address a key challenge for social systems: that of disambiguating between and learning about other conspecifics or members of a society. Our hope is that this takes us a step closer to a formal theory of social intelligence. A complete formal theory would entail computational approaches to solving other aspects of social behavior, including those addressing behavioral economic and trust games (Moutoussis, Trujillo-Barreto, El-Deredy, Dolan, & Friston, 2014), and approaches to understanding the optimal depth of recursive sophistication for social interactions (Devaine, Hollard, & Daunizeau, 2014).

The Bayesian filtering or (sensory) evidence accumulation simulated in this letter offers proof of concept that biologically plausible schemes can be used to recognize the source of dynamically rich sensory streams. The simulations show how neuronal-like message passing can solve two key problems: (1) abstracting or deconvolving a time-invariant representation of how fluctuating sensations are generated and (2) disambiguating among alternative sources. The particular message passing used in this study and in a number of previous publications (Friston, Adams, Perrinet, & Breakspear, 2012; Friston & Frith, 2015b; Friston & Herreros, 2016) can be regarded as a generalization of predictive coding that has growing empirical support as a scheme that the brain might use (Kok, Rahnev, Jehee, Lau, & de Lange, 2012; Brodski-Guerniero et al., 2017; Heilbron & Chait, 2017). A review of the evidence for the basic architecture and ideas can be found in several papers (Bastos et al., 2012; Adams, Shipp, & Friston, 2013; Shipp, 2016). A more technical treatment based on message passing on factor graphs can be found in other publications (Friston, Parr et al., 2017). This letter pursues the biological plausibility of belief updating and, in particular, shows the formal similarities between neuronal message passing required under generative models of both discrete and continuous state spaces.

One might ask if there are alternative schemes that could perform equally well—for example, classification schemes from machine learning (LeCun et al., 2015). This is a potentially important question that would speak to different computational architectures and neurophysiological implementation. However, current machine learning approaches would probably converge on the Bayesian filtering scheme under the deep temporal models used above. This follows from the fact that high-end machine learning schemes use exactly the same (variational free energy) objective function used in Bayesian filtering (and generalized predictive coding). We have in mind here variational autoencoders based on a deep bottleneck architecture, for example (Suh, Chae, Kang, & Choi, 2016). Our model is a mixture model for generalized Bayesian filtering. In this sense, our model can be viewed as a time-domain extension of an autoencoder mixture model (Aljundi, Chakravarty, & Tuytelaars, 2017). A theoretical comparison also supports a close link between learning mechanisms in predictive coding and backpropagation (Whittington & Bogacz, 2017). At the present time, most variational autoencoders do not deal with time-varying data. The implication is that extending current deep learning and variational inference in machine learning to solve the inference problem in a dynamic setting will produce the same scheme as the one used in our simulations.

From a technical perspective, one of the key contributions to the literature of this study is the evaluation of the evidence for competing hypotheses about the sources of sensory input. This evaluation can be cast in terms of Bayesian model selection. From a psychological perspective, this sheds light on our capacity for perceptual categorization, where the underlying selective processes may be cast in terms of attentional selection (Deubel & Schneider, 1996; Itti & Koch, 2001; Bosman et al., 2012). From the perspective of optimal control theory and machine learning, this has clear homologues with selection from mixtures of experts in motor control (Tani & Nolfi, 1999; Haruno et al., 2003) and, indeed, any selective process that involves mutual or lateral inhibition leading to a winner-takes-alll-ike behavior (Zelinsky & Bisley, 2015).

Recent machine learning studies show that task-specific synaptic consolidation can protect the network from forgetting previously learned associations while learning new associations (Kirkpatrick et al., 2017; Zenke, Poole, & Ganguli, 2017). We have focused on disambiguating between, and learning with, multiple internal models. Therefore, we do not consider the explicit protection of previously learned associations. However, a combination of attentional switching and synaptic consolidation would be a potentially interesting extension. We would like to address this issue in the future work.

When an agent encounters an environment that generates data in several possible ways, it can model the environment as either a single generative model (with distinct contextual levels) or multiple generative models. In contrast, when an agent encounters an environment with multiple conspecifics, explaining data with a single model is not straightforward because there are multiple sources of sensory data, calling for a mixture of generative models. In this sense, our model is particularly useful when an agent encounters several different agents in a social context.

Neurobiologically, our learning update rule might be implemented by associative (Hebbian) plasticity modulated by a third factor, a concept that has recently received attention (Pawlak, Wickens, Kirkwood, & Kerr, 2010; Frémaux & Gerstner, 2016; Kuśmierz, Isomura, & Toyoizumi, 2017). While Hebbian plasticity occurs depending on the spike timings of pre- and postsynaptic neurons (Hebb, 1949; Bliss & Lømo, 1973; Markram, Lübke, Frotscher, Sakmann, 1997; Bi & Poo, 1998; Froemke & Dan, 2002; Malenka & Bear, 2004; Feldman, 2012), recent studies have reported that various neuromodulators (Reynolds, Hyland, & Wickens, 2001; Seol et al., 2007; Zhang, Lau, & Bi, 2009; Salgado, Köhr, & Treviño, 2012; Yagishita et al., 2014; Johansen et al., 2014), GABAergic inputs (Paille et al., 2013; Hayama et al., 2013), and glial factors (Ben Achour & Pascual, 2010) can modulate Hebbian plasticity in various ways. Our learning update rule consists of the product of the (conditional) free action gradient providing a Hebbian-like term (see Friston, 2008 for details) and the posterior belief of the switcher state, in which the latter might be implemented by such additional neurobiological factors.

Previous studies have modeled communication between agents in analogy with the mirror neuron system (Kilner, Friston, & Frith, 2007; Friston, Mattout, & Kilner, 2011; Friston & Frith, 2015a, 2015b). These simulations involve two birds that make inferences about each other, converging onto the same internal state and generating the same song. This has been used as a model of hermeneutics, cast in terms of generalized synchrony. Heuristically, both birds come to sing from the same “hymn sheet” and thereby come to “know each other” through knowing themselves. Such a synchronous exchange minimizes the joint free energy of both birds because both birds become mutually predictable. This might be related to an experimental observation that a birdsong can propagate emotional information to another bird and have influence on its behavior (Schwing, Nelson, Wein, & Parsons, 2017). This setup can be generalized when more than two birds are singing the same song. However, when several conspecifics generate different songs (or speak different languages), an agent with a single generative model is no longer fit for purpose. We addressed this limitation by equipping synthetic birds with alternative attractors or hypotheses. Interestingly, it has been reported that some birdsongs display such a learning flexibility—for example, white-crowned sparrows learn multiple songs during the vocal development stage and later switch between learned songs (Hough, Nelson, & Volman, 2000). In our case, generative models are explicitly decomposed and their learning rates are tuned by an attentional switch, allowing an agent to optimize a specific model for each context. In other words, inference about a particular correspondent's “state of mind” can be modeled by synchronous dynamics during conversation, where one of the listener's internal models converges to an attractor representing the speaker's. In this view, empathic capacity may be quantified by how many attractors the agent can deploy and how well it can optimize each attractor. In future work, it will be interesting to consider the relationship between this conceptual model and recent experimental work that suggests that the human brain uses dissociable activity patterns to separately represent self and other (Ereira, Dolan, & Kurth-Nelson, 2018).

In terms of relating the dynamics of learning and inference to empirical observations, the ability to simulate learning and inference, implicit in our simulations, raises the possibility of using empirical data to constrain the scheme's parameters. In other words, there is, in principle, an opportunity to use the simulations of the birdsong recognition above as an observation model to explain the empirical time course of perceptual categorization, learning, and their neuronal correlates. For example, the time course of learning in Figures 3 to 5 suggests that a unit of time (the number of sessions) would correspond to a range from a few hours to a day, given the results reported in Tchernichovski et al. (2001). In our simulation, the first few sessions exhibited a small free-energy reduction because model plausibility was almost uniform and the associated learning rates were similarly small. After a difference in model evidence emerged, the rate of free-energy reduction reached a peak and then gradually returned to zero. This might correspond to the learning process of songbirds that learn the prototypes of songs early and later learn the details of respective songs (Tchernichovski et al., 2001). In the songbird brain, a population of neurons in the auditory association areas exhibits an experience-dependent selective response to one of several learned songs (Gentner & Margoliash, 2003), suggesting that neurons encode posterior expectations of individual songs based on experience. Moreover, the HVC plays an important role not only in song production but also in the formation of associations between a song and a conspecific that emits that song (Gentner, Hulse, Bentley, & Ball, 2000). Again, this is consistent with our generative model that is used for both generation and recognition. The existence of neurons with a teacher-specific activity has been reported in the higher-level auditory cortex of the songbird, the caudomedial nidopallium (NCM; Yanagihara & Yazaki-Sugiyama, 2016). In our model, the switcher exhibits teacher-specific activity by accumulating model evidence. One can imagine that neurons in the NCM might encode the posterior belief about the switcher state. In addition, the accuracy of song recognition might be bounded by the memory capacity of the neural circuit encoding songs (Gentner, 2004), which is similar to the memory capacity of our model, as determined by the number of generative models that can be supported by the neuronal infrastructure. In short, these empirical observations support the neurobiological plausibility of our model and speak to the empirical tests.

While we have randomly interspersed the order of presentation from each teacher, it would be interesting to examine the influence of more systematic changes in presentation order. In a social neuroscience context, this could be important in understanding things like multiple language acquisition in children who speak to different family members in different languages. Specifically, there appear to be differences in bilingualism when languages are learned simultaneously (i.e., interspersed like the model here) or successively (Klein, Mok, Chen, & Watkins, 2014). This might imply different mechanisms for the latter compared to the former and require an extension of the generative model employed here.

In terms of the questions and challenges for empirical neuroscience, the picture that emerges from the current solution raises the following considerations: the computational anatomy in Figure 1 communicates with a deep (temporal) architecture in which there are multiple, competing attractor networks in the brain. These effectively compete to explain the sensory data, and their ability to do so determines the rate of perceptual learning. In turn, this means that one would predict distinct autonomous dynamics corresponding to competing hypotheses about the current dynamical form of sensory input. This means that there should be neuronal correlates of distinct pattern generators that are engaged contemporaneously during perceptual synthesis. Second, it suggests a convergence of descending projections to lower (e.g., primary) sensory systems. This brings an interesting and complementary perspective on the divergent neuroanatomy of descending backward connections in cortical hierarchies in the brain (Zeki & Shipp, 1988; Angelucci & Bressloff, 2006). We mean this in the sense that usually one interprets the asymmetry between convergent and divergent zones in terms of things like extra classical receptive field effects, particularly in the visual cortex (Angelucci & Bressloff, 2006). A complementary perspective is that the divergence of descending efferents can also be looked upon as a convergence of descending afferents. This is precisely the architecture described in Figure 1. More interesting, the factor graph representation of neuronal architectures speaks to a selective (Bayesian model selection) modulation of the messages converging on any given lower level. Physiologically, this means that there must be a neuromodulatory mechanism in play that can handle multiple convergent inputs to a postsynaptic neuron or population. In effect, this implies a winner-takes-all-like mechanism at the level of synaptic efficacy, as opposed to synaptic activity. One could speculate about the neurotransmitter basis of this selection process—for example, appealing to the neuromodulatory effects of neurotransmitter systems targeting cholinergic and 5HT receptors (Everitt & Robbins, 1997; Collerton, Perry, & McKeith, 2005; Vossel, Bauer, Bauer, & Mathys, 2014; Hedrick & Waters, 2015; Doya, 2002; Yu & Dayan, 2005; Dayan, 2012)—on inhibitory interneurons in superficial layers and deep pyramidal cells in deep layers. Many of these speculations have often been rehearsed in relation to the deployment of attention in the context of predictive coding and usually implicate synchronous gain mechanisms via the action of inhibitory interneurons and the current connections with pyramidal cells (Fries, 2005; Womelsdorf & Fries, 2006; Saalmann & Kastner, 2009; Feldman & Friston, 2010; Buschman & Kastner, 2015). Finally, one key message of this theoretical work is that the rate of perceptual learning is determined by the evidence for competing models of sensory input. In principle, this predicts that the rate of sensory learning under ambiguity should be sensitive to the relative probability ascribed to different explanations for sensory input. In turn, this relates to psychophysical studies of perceptual learning under ambiguity with, for example, ambiguous figures or other forms of multistable perception (Tani & Nolfi, 1999; Wurtz, 2008; Hohwy, Paton, & Palmer, 2016).

In summary, we have introduced a novel learning scheme that integrates Bayesian filtering and model selection to learn and deploy multiple generative models. We assumed that a switching variable selects a particular model to generate current sensory input (like switching to a particular radio channel from a repertoire of radio programs), while many alternative generative models are running in the background. To deal with the problem of context-sensitive learning, the proposed scheme calculates the model plausibility (i.e., model evidence) of each generative model based on conditional free actions and updates parameters only in models with a convincing degree of evidence. Our synthetic agents were able to both learn and recognize different artificial and natural birdsongs. These results highlight the potential utility of equipping agents with multiple generative models to make inferences in context-sensitive environments.

## 4 Methods

The proposed variational update scheme is described in section 2. Further details are provided in this section.

### 4.1 Generative Model

#### 4.1.1 Sensory Inputs

Interestingly, this definition of the multiple generative models and the switcher is slightly different from supposing a large generative model with switcher-dependent parameters. This idea is rather an assumption that an agent has a set of hypotheses (generative models) about sensory inputs from which to select in a given context. Each generative model is running independently from the others. They interact only via sensory inputs through the model switching scheme, but this interaction does not change their latent variables or parameters. Because of this conditional independence, each model can have different forms and dimensions, although in this work, we assume models based on the same model structure and dimension with different latent variables and parameters.

#### 4.1.2 Free Energy and Free Action

#### 4.1.3 Posteriors

From the Laplace assumption, the posterior density of latent variables $ui$ is approximated as a gaussian distribution $qui=N[ui;ui,Pui]$ with an expectation (or mode) vector $ui$ and a precision matrix $Pui$. The posterior density of parameters $\theta i$ is approximated as a gaussian distribution $q\theta i=N[\theta i;\theta i,P\theta i]$ with an expectation vector $\theta i$ and a precision matrix $P\theta i$. As described above, the posterior distribution of the switcher state (i.e., model plausibility) has been defined as a categorical distribution $Qi=c=\gamma i$ with $\u2211i\u2208M\gamma i=1$, which is equivalent to $Q\gamma =Cat\gamma .$

### 4.2 Variational Update Rules

Updates of the posteriors of the latent variables, the switcher state, and the parameters are conducted in the inference, model selection, and learning steps, respectively. In the simulation, these three steps are repeated in order for each session. In what follows, we formally derive update rules from the minimization of free energy or free action.

#### 4.2.1 Inference (Neural Activity)

#### 4.2.2 Model Selection (Attentional Switch)

#### 4.2.3 Learning (Synaptic Plasticity)

Accordingly, we obtain posterior beliefs of the latent variables, the switcher state, and the parameters that minimize free action. The difference in learning rate mediated by the model plausibility enables that only the parameters in the most plausible models are updated, while the parameters in the remaining models are maintained in a winner-takes-all manner, whereas the latent variables in all models are updated with a fixed update rate. Therefore, inference occurs for all generative models, while learning occurs only for the most plausible generative models. This mechanism enables the agent to make inferences and learning with several different generative models.

#### 4.2.4 Action

### 4.3 Songbird Model

A generative model for birdsong generation is defined as a two-layer hierarchical generative model as mentioned in previous studies (Kiebel et al., 2008; Friston & Kiebel, 2009), in which each layer has three or four hidden states that express biological neural circuits for birdsong generation. For simulation purposes, several different teacher songs and the same number of internal models were used.

#### 4.3.1 For Figure 3

#### 4.3.2 For Figures 4 and 5

In the simulations, the leakage parameter $\rho =0.1$ (to ensure stability), a small inhibitory synaptic weight $\epsilon =0.2$ (which controlled the coupling strength between the first and second oscillators), and a large inhibitory synaptic weight $\lambda =1.2$ (which determined the period of the second neural oscillator) were used. We supposed that the hidden causes $v(i,1)$ converged to $x(i,2)$ to simplify the simulation. The definition of $g(i,1)$ was chosen to ensure that it could express a general quintic function by a linear product of a 2 $\xd7$ 462 matrix and a 462-dimensional vector. These parameters were learned by a student without supervision. When updating the posterior belief of hidden states, we smoothed the posterior trajectory by adding small amounts of components of the prior (i.e., a trajectory of an attractor without perturbation) to avoid a divergence of variables induced by a large perturbation. Training was repeated for 60 sessions. Each session had a 10 s sequence. Time resolution $dt=10-3[s]$ was used.

The following procedure was applied before training: (1) the student's initial states of $x(i,1),x(i,2)$ at time $t=0$ and the time constants $\tau 1,\tau 2[s]$ were optimized in relation to one of six teacher songs (these were in the range of $-1\u2264x1i,j,x2i,j,x3i,jx4i,j\u22641$, $1/60\u2264\tau 1\u22641/40$, and $1/6\u2264\tau 2\u22641/4)$; (2) the posterior expectation of parameters $\theta (i,1)$ was randomly generated; and (3) $\theta (i,1)$ were modified by pretraining, in which each model randomly received one of six teacher songs, made an inference, and updated the parameters without model selection for 18 sessions to ensure that each internal model initially represented an averaged song. Then the response songs of a student were tested with different teacher songs (movie 1; see appendix B). For training, we randomly selected one of six teacher songs and provided it to a student (movie 2; see appendix B). Training was repeated for 60 sessions. After training, we tested the response songs again (see Figure 4 and movie 3; see appendix B).

### 4.4 Preprocessing for Natural Birdsong Data

The birdsong data used for Figures 4 and 5 and the supplementary movies were downloa-ded from http://ofer.sci.ccny.cuny.edu/song_database/zebra-finch-song-library-2015/view. This data set was recorded by the Tchernichovski group (see Tchernichovski et al., 2001). We treated the data as follows. First, we acquired a spectrogram of the song by performing a Fourier transform with a 23.2 ms time window. As an analogy to a physiological model of vocal coda that generates a birdsong by sequences of power and tone (frequency) of the voice (Laje, 2002), we defined the leading frequency ($s2)$ and the amplitude ($s1)$ of a song by the mode of its frequency and the power of the mode frequency for each time step, respectively. They were normalized and introduced as sensory inputs $s=s1,s2T$.

## Appendix A: Supplementary Figures

## Appendix B: Supplementary Movies 1–3

Movies 1 to 3 are available online at https://www.mitpressjournals.org/doi/suppl/10.1162/neco_a_01239. These movies show the dynamics of teacher and student birds before, during, and after training. The details are described in the caption to Figure 4.

### Acknowledgments

This work was supported by RIKEN Center for Brain Science (T.I.) and Tateisi Science and Technology Foundation (T.I.). T.P. is supported by the Rosetrees Trust (award 173346). K.J.F. is funded by a Wellcome Trust Principal Research Fellowship (088130/Z/09/Z). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

### Author Contributions

K.J.F. conceptualized the free-energy principle; T.I. conceived and designed the method using the multiple internal models and performed the simulations. T.I., T.P., and K.J.F. wrote the paper.

## References

## Competing Interests

Competing Interests: The authors declare that they have no competing interests.