Humans in Love Are Singing Birds: Socially-Mediated Brain Activity in Language Production

Abstract This functional magnetic resonance imaging (fMRI) study investigated whether and how the human speech production circuit is mediated by social factors. Participants recited a poem in the MRI scanner while viewing pictures of their lover, unknown persons, or houses to simulate different social contexts. The results showed, as expected, the recruitment of the speech production circuit during recitation. However, for the first time, we demonstrated that this circuit is tightly linked to the network underlying social cognition. The socially relevant contexts (familiar and unfamiliar persons) elicited the recruitment of a widespread bilateral circuit including regions such as the amygdala, anterior cingulate, and orbitofrontal cortex, in contrast to the non-socially relevant context (houses). We also showed a neural gradient generated by the differences in the social relevance of affective and nonaffective contexts. This study opens up a novel line of research into socially mediated speech production, revealing drastic differences in brain activation when performing the same speech production task in different social contexts. Interestingly, the analogous avian anterior neural pathway in the zebra finch is also differentially activated when the bird sings facing a (potential) mate or alone. Thus, this study suggests that despite important phylogenetic differences, speech production in humans is based, as in songbirds, on a complex neural circuitry that is modulated by evolutionarily primordial aspects such as the social relevance of the addressee.


INTRODUCTION
Vocal communication in vertebrates is a major cognitive function shared by phylogenetically distant social animal species, such as birds and primates (Hauser et al., 2014;Seyfarth & Cheney, 2010). Evolutionarily, this function allows mating, warning calls, localizing food resources, and social learning among others (Jarvis, 2019). Considering human speech as the physical act of producing communicative sounds and sound patterns, and given that songbirds have been considered an ideal animal model to understand the neural basis of speech communication, here we will draw a parallel between brain mechanisms involved in bird vocalization and human speech production in the context of socially mediated communication (see Figures 1 and 2 in Simonyan et al., 2012). In this regard, the avian anterior neural pathway of songbirds-involved in sound pattern learning and production-is the a n o p e n a c c e s s j o u r n a l Citation: Martin, C., Quiñones, I., & Carreiras, M. (2023). Humans in love are singing birds: Socially-mediated brain activity in language production. Neurobiology of Language, 4(3), 501-515. https://doi.org/10.1162 /nol_a_00112 analog of the mammalian cortical-striatal-thalamocortical loop (Alexander et al., 1986;Doupe et al., 2005;Jarvis, 2004Jarvis, , 2019Jarvis et al., 2005). The present functional magnetic resonance imaging (fMRI) study will investigate whether and how the human speech production circuit is mediated, as in songbirds, by social factors (Jarvis, 2004(Jarvis, , 2019Jarvis et al., 2005). Importantly, previous studies have shown that several neural circuits are differentially engaged in socially mediated perception in humans, specifically, by tasks that demand inferences about or empathy for another person's mental or emotional states (Lieberman, 2010;Martin et al., 2016;Molnar, 2018;Molnar et al., 2015;Woumans et al., 2015). By drawing a parallel between perception and production, we hypothesized that socially mediated speech production (i.e., taking the interlocutor into account during language processing) might produce an interaction between regions that have been related to speech production and some of the core areas of the right-lateralized social brain network (Lieberman, 2010).

Avian Anterior Neural Pathway Underlying Bird Vocalization
Male zebra finches produce both directed and undirected songs, using the same motif (i.e., syllabic ordering or ordering of elements) in two different social contexts (Dunn & Zann, 1996;Morris, 1954). A directed song is performed for a mate whereas an undirected song is produced during learning and solitary practice (see Zann, 1996, for a review). Directed songs have a stereotypical acoustic structure and ordering of syllables and syllabic patterns (Williams, 2004). Undirected songs have the same structure but are produced with larger variability and are more likely to include truncated patterns and atypical transitions (Jarvis et al., 1998;Williams, 2004). Directed and undirected vocalizations are underpinned by different patterns of neural activity along the avian anterior vocal pathway (Hessler & Doupe, 1999). This pathway runs from frontal cortical regions (the magnocellular nucleus of the anterior neostriatum; MAN) to the striatum (Area X), the dorsolateral thalamus, and back (Jarvis, 2004). Recent studies have specified the role of the different areas within the anterior pathway and their respective functions in socially mediated singing: Inactivation of the MAN region reduces variability in undirected song patterns to a stereotyped level similar to that in directed song patterns (Kojima et al., 2018;Stepanek & Doupe, 2010). Similarly, the inactivation of Area X reduces within-syllable variability (in fundamental frequency) typically observed in undirected singing (Kojima et al., 2018). Consistently, Singh Alvarado et al. (2021) recently showed that neural activity in Area X is strongly suppressed during directed song production.

Mammalian Cortical-Striatal-Thalamocortical Loop Underlying Speech Production
Speech production is a complex cognitive function that requires the processing of linguistic information as well as precise sensorimotor integration. This function is supported by a circuitry that includes the primary motor cortex, regions of the lateral inferior and medial frontal cortex, premotor cortex, supplementary motor cortex, cerebellum, and subcortical structures such as parts of the basal ganglia and the thalamus (Alexander et al., 1986;Doupe et al., 2005;Jarvis, 2004Jarvis, , 2019Jarvis et al., 2005). There is evidence demonstrating that this circuit has direct projections from the facial area of the primary motor cortex in Brodmann's area 4 (the so-called laryngeal motor cortex) to the nucleus ambiguous in the brainstem (Jarvis, 2019). Interestingly, Westermann et al. (2022) demonstrated that humans have an additional affective or innate limbic vocal production pathway that involves connections from the amygdala, orbitofrontal cortex, and anterior cingulate cortex to the periaqueductal gray in the brainstem (Hage & Nieder, 2016;Jarvis, 2019). This circuit supports, for example, threatening (Hage & Nieder, 2016) or laughing (Westermann et al., 2022) vocalizations used in emotional situations. However, so far there is no experimental evidence for the activation of this circuit during human language production or comprehension.

Present Study
In the present study, we draw a parallel between human and avian vocal production, exploring whether neural circuit activation in the human brain also differs depending on the social context of speech production. We created a socially mediated language production task in which participants had to recite a previously learned poem, in the MRI scanner, in various social conditions (see Figure 1A). In order to mimic as closely as possible directed and undirected singing in songbirds we carried out an fMRI experiment in which participants recited a poem in one of two types of directed conditions (i.e., affective directed to their lover or nonaffective directed to unfamiliar persons) or an undirected condition (i.e., without an addressee). (See Table S1 of the Supporting Information available at https://doi.org/10.1162 /nol_a_00112.) A social gradient was simulated by manipulating the amount of social information contained in each of the experimental conditions. As shown in Table 1, the image of the lover elicits an emotional charge associated with the face identity that makes its social relevance greater than that of the unfamiliar persons. The houses, on the other hand, do not contain social information, so they are used as a baseline-equivalent to the undirected singing in songbirds.
Our main goal was to investigate whether socially mediated speech production directed toward a lover triggered differential activation in humans similar to that observed in singing birds. Since seduction is a trans-species universal behavior and the anterior avian vocal pathway has a mammalian analog (see Alexander et al., 1986;Doupe et al., 2005), we expected brain activity during human speech production might also differ with social context: Specifically, based on previous research on songbirds, we hypothesized the cortical-striatalthalamocortical loop might be differentially activated in directed than undirected poem recitation. Furthermore, we also explored whether the simulated social gradient elicits differential activation in brain areas involved in language processing and socially mediated behavior (see Table 1). To differentiate face-specific and face-familiarity neural effects from those related to socially mediated speech, the contrasts between faces versus houses and familiar versus unfamiliar faces were tested while the participants passively viewed the visual stimuli, without the linguistic task. This allowed us to interpret the potential results excluding the effects driven by the processing of the social context (i.e., the visual cue) rather than by socially mediated speech production in a directed way.  Regions underlying the processing of the affective information (right-lateralized) + + Regions underlying the processing of the affective information (right-lateralized)

Speech production circuit (left-lateralized)
Note. Faces constitute a socially relevant stimulus. However, familiar faces, unlike the faces of strangers, allow us to anchor the face to a personal identity, which implies access to semantic, biographical, and autobiographical information. In the present case, the lover's face also contains emotional information due to the existing affective bond. Therefore, the symbol "+" (in the third column) quantifies the socially relevant information associated with each contrast.

Participants
Forty-one native Spanish speakers (mean age = 28 years, SD = 5.37, range = 20-46, 21 females) took part in the study as paid volunteers. They were all right-hand dominant, had normal or corrected to normal vision, and reported no history of psychiatric or neurological disease, learning disabilities, or hearing impairments. The study protocol was conducted according to the Declaration of Helsinki guidelines and approved by the Ethics and Scientific Committees of the Basque Center on Cognition, Brain and Language (BCBL). After the experimental procedure and MRI cautions were explained, all participants signed an informed consent form and completed a standard MRI safety declaration before the scanning session.

Stimuli
The poem to be learned and recited was "Necesito de ti" (I need you) by the Spanish poet Rafael de León (see the poem and an English translation in Table S1). Gray-scale pictures were used for the experimental contexts. Image size (217 × 285 pixels) and resolution (72 pixels per inch) were controlled across conditions. For each participant, we selected 10 face pictures of their lover (same photo session conditions and neutral background); 10 face pictures of a young adult of the same gender as the participant's lover with a neutral expression for the unfamiliar face condition-same gender (see Figure 1 for an example); 10 face pictures of a young adult of a different gender to the participant's lover for the unfamiliar face conditiondifferent gender; and 10 pictures of houses.
Pictures of young adult faces were selected after controlling for attractiveness in an earlier test. Twenty participants (not involved in the scanning session) were presented with 60 faces in random order and asked to rate each of them for attractiveness (1: unattractive, 10: highly attractive). We selected pictures of unfamiliar faces for both genders (10 men and 10 women) within the average range of these ratings (mean = 5.64, SD = 0.92), to avoid gender-and attractiveness-related effects in the unknown person condition.

Experimental Procedure
The complete fMRI session consisted of two different production modes: reciting the previously memorized poem or reading a previously unseen short poem displayed on screen. Only the poem recitation is presented in this report. Each participant participated in four consecutive sessions, consisting of four randomized repetitions of a block-design functional task (see Figure 1A). Each block lasted 55.8 s (31 volumes), starting with a 1.8 s picture (visual cue) display, followed by an audio signal cueing the participant to start producing the poem, 45 s of poem recitation (with continuing picture display), followed by a 9 s (5 volumes) fixation-cross display during the interblock interval. During the four runs, participants recited the memorized poem while viewing pictures of their lover, unfamiliar faces (same or different gender as the lover), or houses. Each block (i.e., picture-poem combination) was repeated 10 times for a total of 40 blocks (10 blocks/condition).

Instructions
Before attending the scanner session, participants were asked to memorize a short poem by imitating a digital recording of a same-sex speaker. They were instructed to listen to and repeat the recording as often as needed until they could recite the poem fluently. This process of learning by listening to and mimicking a same-sex speaker was designed to approximate the learning experience of birds, who learn song patterns by imitating a same-sex bird singing (Slater et al., 1988).
During the scanning session, participants were instructed to recite the poem as if they were reciting it to their lover when a picture of the lover appeared on the screen; as if they were reciting it to an unknown person when a picture of an unknown person appeared on the screen; and as if they were reciting it to themselves/rehearsing when a picture of a house appeared on the screen.

MRI Acquisition
Structural and functional MRI sessions for each participant were acquired using a Siemens 3T MAGNETOM Prisma Fit scanner. High-resolution T1-and T2-weighted images were acquired with a 3D ultrafast gradient echo (MPRAGE) pulse sequence using a 64-channel head coil with the following acquisition parameters for T1: 176 contiguous sagittal slices; voxel resolution 1 × 1 × 1 mm 3 ; repetition time (RT) = 2,530 ms, echo time (ET) = 2.36 ms; image columns = 256; image rows = 256; flip angle (flip) = 7°and for T2: 176 contiguous sagittal slices; voxel resolution 1 × 1 × 1 mm 3 ; RT = 3,390 ms, ET = 389 ms; image columns = 204; image rows = 256; flip = 120°. The origin of the T1/T2 weighted images was set to the anterior commissure. The two structural images were then co-registered and used as priors for the segmentation. Functional volumes consisted of 618 T2*-weighted echoplanar images, acquired with the following multiband sequence specifications: RT = 1.8 s with no time gap; ET = 29 ms; number of axial slices = 72, isotropic voxel size = 2 mm 3 ; percent phase field of view (FOV) = 100% (192 mm); number of phase encoding steps = 84; pixel bandwidth = 2170; flip angle = 73°; GRAPPA 4 in plane (see Table S2 in the Supporting Information for a full description of the sequence).

fMRI Data Analysis
Functional data were analyzed using SPM12 (Ashburner et al., 2021) and related toolboxes (https://www.fil.ion.ucl.ac.uk/spm). Raw functional volumes were slice-time corrected taking the middle slice as reference, spatially realigned, unwarped, co-registered with the anatomical T1, and normalized to MNI (Montreal Neurological Institute) space using the unified normalization segmentation procedure. Global effects were then removed using a voxel-level linear model of the global signal proposed by Macey et al. (2004). Detrending fMRI time series were then smoothed using an isotropic 8 mm Gaussian kernel. The resulting time series from each voxel were high-pass filtered (128 s cut-off period). Statistical parametric maps were generated by modeling a univariate general linear model (GLM), using a regressor obtained by convolving the canonical hemodynamic response function with delta functions at stimulus onsets for each stimulus type (i.e., Lover's face, Unfamiliar face same gender, Unfamiliar face different gender, and House), and also including the six motion-correction parameters as regressors. In order to increase the reliability of the first-level analysis, GLM parameters were estimated with the FAST model, which uses a dictionary of covariance components based on exponential covariance functions in the context of the restricted maximum likelihood estimation (Olszowy et al., 2019). Contrast images for each of the two critical conditions compared to Houses were submitted into the second-level design. A grey matter inclusive mask was defined using individual grey matter segmentations. Only those voxels with grey matter probability value higher than 0.6 in at least 50% of participants were included. Only peaks or clusters with a significant p value after correction for multiple comparisons using family-wise error rate (FWER [p < 0.05]; Nichols & Hayasaka, 2003) are reported.

Affective Directed (Facing Their Lover) Versus Undirected (Facing a House) Recitation
Differential brain activation was found during lover-directed versus undirected recitation in a bilaterally distributed cortico-subcortical network. This network included subcortical areas similar to those differentially activated in directed versus undirected singing in songbirds, namely the dorsal striatum (putamen and caudate nuclei), globus pallidus, and thalamus. Most strikingly, the left perisylvian language-specific system was also differentially activated by social context: Significant differences emerged in the inferior frontal gyrus (IFG), insula, anterior and posterior middle temporal gyrus (MTG), superior temporal gyrus (STG). Functional responses from additional areas such as the fusiform gyrus, angular gyrus, supramarginal gyrus, inferior and superior parietal gyri, and cerebellum also varied with social context (see Table 2 for a detailed list of regions and Figure 2 for a representation of the response pattern).  Figure 3 displays the superimposition of the contrasts for recitation facing lovers versus houses-represented in red-and recitation facing unfamiliar persons versus housesrepresented in green (also see Table 2). Activation common to the two contrasts is represented in yellow. Interestingly, while some areas were common to both contrasts, others appeared only in the affective directed condition. The left and right IFG, left and right middle frontal gyrus, left and right posterior part of the STG, left MTG, and left inferior parietal gyrus (IPG) emerged as significant only for the affective directed condition.
Possible gender-related effects concerning the unfamiliar faces were tested by contrasting unfamiliar persons of the same gender as the participant's lover with unfamiliar persons of a different gender as the participant's lover. This contrast yielded no significant results, so the plots were restricted to the comparison between lovers' and same gender unfamiliar faces.

Controlling for Face-Specific and Familiarity Effects Using the Pre-Recitation Time Window
To disentangle whether the differential activation in socially mediated speech circuitry detected by the critical comparisons was driven by the social context (i.e., exposure to different faces) we tested for face-specific and face-familiarity responses in the pre-recitation time window-while the participants passively viewed the visual stimuli, without the linguistic task. Face-specific voxels were defined by larger blood oxygen level dependent responses for faces than for houses (Table 3). The posterior fusiform and inferior occipital gyri presented regional maxima in both cerebral hemispheres, corresponding to the fusiform face area (FFA) and occipital face area, two face-specific areas previously described in other studies (Kanwisher et al., 1997). Responses to lovers' faces were larger than responses to unfamiliar faces in several regions, including the right FFA, right inferior temporal gyrus, and right parahippocampus, regions known to be related to face identity recognition (Gobbini & Haxby, 2007; Table 3). Strikingly, when these two statistical parametric maps were used as a restrictive mask for the contrasts between directed (affective and nonaffective) and undirected recitation we did not find any significant voxels, which indicates that during poem recitation the speech production circuit interacted with the social brain network beyond the face-specific areas. Note. Face-specific voxels were defined by larger blood oxygen level dependent responses for faces than for houses whereas brain areas related to face familiarity were defined by comparing lover and unfamiliar faces. Both contrasts were tested in the pre-recitation time window.

DISCUSSION
This study explored brain activation during language production in various socially mediated contexts, by testing participants reciting a poem in the MRI scanner while facing a picture of their lover, an unknown person, or a house. The results showed, as expected, the recruitment of the speech production circuit during recitation. However, for the first time, we demonstrated that this cortical-striatal-thalamocortical loop underlying speech production is linked to the network underlying social cognition. The socially relevant contexts (familiar and unfamiliar persons) elicited the recruitment of a widespread bilateral circuit including regions such as the amygdala, anterior cingulate, and orbitofrontal cortex, whereas the noninformative context (houses) showed no effect on the speech production network. We also showed a neural gradient generated by the differences in the social relevance of affective and nonaffective contexts.
While the exact function of striatal substructures in human language has not yet been clearly defined, their role in motor control has been repeatedly observed. Specifically, the putamen and caudate nuclei of the striatum and the thalamus have been shown to be linked to memory and language monitoring (Hamm & Mattfeld, 2019;Saalmann & Kastner, 2015). The putamen is also involved in the control of learned vocal patterns and the initiation and execution of speech (Price, 2010;Simonyan et al., 2012). Basal ganglia nuclei are also involved in the regulation of motor actions in response to cognitive activities and the processing of emotional and motivational stimuli (Heimer et al., 1982;Turner & Desmurget, 2010). Here, we demonstrate that the activation of this cortical-striatal-thalamocortical loop during production is also socially mediated. Interestingly, some of those regions which were differentially activated in language production facing a lover or a stranger (e.g., striatum and thalamus) have also been shown to be related to "intense love," in studies presenting participants with pictures of a long-term partner versus highly and low-familiar persons (Acevedo et al., 2012). Cortical brain areas that were differentially activated in the directed poem recitation included the core areas of the ventral and dorsal streams of the speech production system (Jarvis, 2019): the IFG, premotor cortex, and insula, which subserve articulatory operations (Basilakos et al., 2018;Price, 2010); the anterior MTG and angular gyrus, which work together to support lexicosemantic combinatorial processes (Hickok & Poeppel, 2007), and the superior temporal sulcus (STS), which codes phonological information as well as sensory-based representations of speech within the auditory-motor integration network (Hickok & Poeppel, 2007;Houde & Nagarajan, 2011). However, importantly, differential activations were not only observed in the cortical-striatal-thalamocortical loop but also in regions that have been previously related to the social brain (Lieberman, 2010). This shows that while social context modulates analogous ancestral brain regions and circuits in humans and birds, additional cortical circuits are recruited in humans, reflecting their unique capacity for language. We also found differential activation for directed versus undirected speech production in the IPG, amygdala, orbitofrontal cortex, and anterior cingulate cortex. Since these areas have been proposed to constitute a core social cognition network in perception (Lieberman, 2010), this result reveals a striking parallel between the processing of socially mediated stimuli and socially oriented language production. In the same direction, previous studies have shown that nonhuman primates have an additional affective or innate vocal production pathway involving connections from the amygdala, orbitofrontal cortex, and anterior cingulate cortex to the periaqueductal gray in the brainstem (Hage & Nieder, 2016;Jarvis, 2019). This circuit has been identified in humans, but so far mainly in the context of threatening (Hage & Nieder, 2016) or laughing (Westermann et al., 2022) vocalizations used in emotional situations. Crucially, our results reveal the existence of a similar functional pathway in language production. The activation of primitive regions dedicated to the processing of social and emotional information during the recitation of a poem supports theories of language modularity by revealing that the language network is closely related to our emotional ancestral brain (Allott & Smith, 2021;Mesulam et al., 2021).
Previous evidence for the social modulation of language production has come from behavioral studies showing that some acoustic aspects of adult speech depend on factors related to the addressee (see Cooke et al., 2014, for a review). Regarding the association between face perception and language production, previous studies have shown that faces can act as cues for language selection in bilinguals prior to speech production (Woumans et al., 2015). Here, we add novel insight into this research on socially mediated speech production, showing that neural activations differ during speech production of the same text, depending on the social context (i.e., directed to a lover versus an unfamiliar face or house).
In the present study, responses of the human speech production network based on the addressee (being a lover or the picture of a house) mimic modulations previously observed in the analogous pathway in songbirds. Thus, this study opens a novel link between research on animals and humans, suggesting that social context might modulate brain activation in (at least partially) equivalent ways in human speech production as that already established for birdsong.
Despite the interesting parallel between brain circuit activation in humans and songbirds identified in the present study, we acknowledge some limitations that require further investigation: Participants in our study learned a poem by listening to a recorded version of it, and we did not explore brain activation modulation during learning, which was performed at home prior to MRI testing.
Although we tried to approximate poem learning to song pattern learning as much as possible, we acknowledge the limitation that songbirds learn by imitating a physical conspecific and not via a tape recording (Chen et al., 2016;Derégnaucourt et al., 2013). It would be interesting to explore the learning process with MRI data acquisition and with participants mimicking a "tutor" present in the room. Several studies on songbirds have shown that the presence of a tutor has a strong influence on vocal learning: Juvenile zebra finches are quicker and more accurate in learning to mimic songs when interacting with a conspecific tutor than when passively listening to the song through a loudspeaker (Chen et al., 2016;Derégnaucourt et al., 2013).
Furthermore, it would be interesting to test participants reciting the poem to their lover, an unknown person, or alone, while the intended listener was physically present in the room instead of represented by pictures. Modulation of the socially mediated brain circuits should certainly be larger in such a more naturalistic setting. However, the fact that we were able to detect significant differences in both social and language networks despite using simulated/virtual tutors and addressees points to the robustness of our findings.
To conclude, the current study (1) opens up a novel line of research on socially mediated language production, showing that cortical and subcortical regions are differentially activated depending on the addressee; (2) highlights the benefits of comparing sophisticated communication practices, such as birdsong and human speech production, to better understand human language; and (3) reveals the contribution of cortical and subcortical components and overall complexity of brain circuits underlying human language.