Processing Speech and Thoughts during Silent Reading: Direct Reference Effects for Speech by Fictional Characters in Voice-Selective Auditory Cortex and a Theory-of-Mind Network

Abstract Stories transport readers into vivid imaginative worlds, but understanding how readers create such worlds—populating them with characters, objects, and events—presents serious challenges across disciplines. Auditory imagery is thought to play a prominent role in this process, especially when representing characters' voices. Previous research has shown that direct reference to speech in stories (e.g., He said, “I'm over here”) may prompt spontaneous activation of voice-selective auditory cortex more than indirect speech [Yao, B., Belin, P., & Scheepers, C. Silent reading of direct versus indirect speech activates voice-selective areas in the auditory cortex. Journal of Cognitive Neuroscience, 23, 3146–3152, 2011]. However, it is unclear whether this effect reflects differential processing of speech or differences in linguistic content, source memory, or grammar. One way to test this is to compare direct reference effects for characters speaking and thinking in a story. Here, we present a multidisciplinary fMRI study of 21 readers' responses to characters' speech and thoughts during silent reading of short fictional stories. Activations relating to direct and indirect references were compared for both speaking and thinking. Eye-tracking and independent localizer tasks (auditory cortex and theory of mind [ToM]) established ROIs in which responses to stories could be tracked for individuals. Evidence of elevated auditory cortex responses to direct speech over indirect speech was observed, replicating previously reported effects; no reference effect was observed for thoughts. Moreover, a direct reference effect specific to speech was also evident in regions previously associated with inferring intentions from communication. Implications are discussed for the spontaneous representation of fictional characters and the potential roles of inner speech and ToM in this process.


INTRODUCTION
Stories can conjure complex imaginative worlds that offer immersion and transportation for the reader (Green, 2004;Green, Brock, & Kaufman, 2004;Ryan, 1999;Gerrig, 1993). Fictional characters in particular are sometimes experienced with a vividness and complexity, which can linger beyond the page (Alderson-Day, Bernini, & Fernyhough, 2017;Maslej, Oatley, & Mar, 2017). Understanding how these experiences are created by the mind-often with apparent automaticity and spontaneity-is a challenge for a wide range of disciplines beyond psychology, including literary theory, narratology, philosophy of mind, and cognitive neuroscience (Herman, 2013). Far from passively "receiving" information from the writer, readers actively and creatively engage with fictional texts in a way that draws on multiple psychological resources (Polvinen, 2016;Caracciolo, 2014;Kukkonen, 2014;Oatley, 2011;Bortolussi & Dixon, 2003).
A good example of this is provided by texts involving direct speech. When direct reference is made to a character overtly speaking in a text (he said, "the cat is over there"), it is thought to evoke a more vivid experience of the storyworld than if the same overt speech is only indirectly referred to (he said that the cat is over there). It has been suggested that the purpose of such constructions is to demonstrate (and thus depict) a situation, rather than merely describe it (Clark & Gerrig, 1990). Evidence that this could resemble hearing an actual voice is provided by Yao et al. (2011), who compared fMRI responses in auditory cortex for participants silently reading short stories that contained either direct or indirect reference to speech. Although both kinds of speech activated auditory cortex, direct speech was associated with a greater response than indirect speech in voice-selective regions of the right superior and middle temporal lobe (as defined by a separate auditory localizer task; Belin, Zatorre, Lafaille, Ahad, & Pike, 2000). As the stories were very short (three to four sentences in total) and participants were not prompted to imagine the voices, characters, or stories in any specific way, this suggests that fairly minimal textual markers for direct speech can elicit a response in cortical regions that are selective for voice perception.
If direct speech in text can prompt this kind of reaction, a second question is why readers appear to respond in this way. In a separate study, Yao, Belin, and Scheepers (2012) observed similarly enhanced responses in voiceselective regions for direct speech quotations when they were being read by a monotonous voice. Building on Barsalou's theory of embodied cognition (Barsalou, 2008), they suggested that auditory cortex activation may have a role in constructing a perceptual simulation of the emotional prosody and intonation of the speaker's voice, given that such information is either absent or diminished in the case of both silent reading and monotonous listening. This would not rule out perceptual simulation during other kinds of silent reading but characterizes direct reference as a cue to simulate suprasegmental and communicative properties of speech from text ( Yao et al., 2011( Yao et al., , 2012. The effects of direct speech and its potential consequences for simulation can be questioned, however. If direct speech prompts more vivid imagery or provides more communicative information (e.g., tone or emotional content), this would plausibly be reflected in reader comprehension. However, in a series of behavioral experiments, Eerland, Engelen, and Zwaan (2013) reported inconsistent evidence for either perceptual or communicative information being more available to readers after direct speech quotations. Instead, they suggested that the use of direct quotations prompts better memory for the verbatim content of characters' utterances, whereas indirect speech assists the building of a situation model, that is, an overall "representation of the referential situation" (Eerland et al., 2013, p. 7;van Dijk & Kintsch, 1983). Supporting this, source memory for characters' utterances is actually enhanced for indirect, not direct, speech quotations (Eerland & Zwaan, 2018)-suggesting that the potential vividness of direct speech is not used for tracking information about who said what (or could even obstruct such tracking, when compared to indirect speech). Finally, the typographical and grammatical differences between direct and indirect speech make it difficult to clearly compare their specific consequences for mental simulation. Along with potentially alerting the reader to pay attention to text, direct sentences are typically shorter than indirect sentences, are syntactically simpler, and may be expected to prompt changes in reader perspective (Köder, Maier, & Hendriks, 2015;Coulmas, 2011;Clark & Gerrig, 1990). As such, the effect of direct speech on the reader, and its potential function in the imaginative response of reading, remains unclear.
One way to explore this topic-in a way that might begin to address some of the above concerns-is to compare references to characters' speech with another kind of representation that fictional narratives can involve: characters' thoughts. Although theories of mental simulation during reading emphasize various forms of sensory and embodied simulation (e.g., Kurby & Zacks, 2013;Zwaan, Madden, Yaxley, & Aveyard, 2004), fictional narratives have been proposed to place specific sociocognitive demands on the reader (Mar & Oatley, 2008;Zunshine, 2006). Typically, a reader must track the mental states of multiple characters, following their beliefs, intentions, and desires through a narrative, to make sense of actions, decisions, and responses to events in the storyworld (Spreng, Mar, & Kim, 2009;Herman, 2008;Palmer, 2004;Gerrig, Brennan, & Ohaeri, 2001), 1 all of which imply a central role for theory of mind (ToM) in the reading process.
How might this shed light on direct speech? First is because it provides a contrasting example of direct reference. Both indirect and direct references to thinking are used in narratives. Indirect thought, which is usually considered the representational norm (Leech & Short, 2007, p. 268), is more flexible and can be used to represent verbal, preverbal, and nonverbal mental processes from the perspective of the character (e.g., he thought that X; he felt that Y; he was willing to do Z ). Direct thought (also referred to as "quoted monologue"; Cohn, 1978) is used to represent, verbatim, the linguistic silent articulation of verbal thoughts (He thought "this is so complicated!").
The verbalized nature of depicting characters' thoughts is almost identical in form and complexity to direct speech (i.e., when used in a basic form; indirect thoughts in more extended narratives can be used in highly complex ways). Contrasting these forms of speech and verbal thought can therefore provide a test of Yao et al.'s (2011) interpretation of direct reference effects by assessing how specific they might be to vocal information, while at the same time controlling for typographic features. If Yao et al.'s conjecture is correct, direct reference to speech-but not necessarily thoughts-may be expected to elicit recruitment of voice-selective regions of auditory cortex, to specifically simulate the perceptual qualities of characters speaking out loud in the storyworld. In contrast, if direct speech and direct thought elicit similar responses, then a voice-specific account of direct reference would be harder to maintain. It could be the case that both speech and thoughts elicit some form of perceptual simulation under direct reference, but they would be doing so despite clear dissimilarities in the auditory scenario (one is an external utterance, and the other is a form of internal monologue). Instead, showing that direct reference to speech and thoughts prompts a generally greater response in auditory cortex could support alternative interpretations: It may, for example, simply reflect a greater level of engagement that happens when quote marks prompt the reader to pay attention to verbatim content (Eerland et al., 2013). In that scenario, there would be nothing special about speech for understanding direct reference effects.
Second, exploring sociocognitive processing and contrasting how this works for speaking and thinking are potentially highly informative for understanding direct speech effects. ToM has multiple components, each with its own developmental trajectory (Fernyhough, 2008;Tomasello, Carpenter, Call, Behne, & Moll, 2005). Whereas some socio-cognitive skills are evident early in infancy-such as the ability to follow others' attentional cues (Behne, Liszkowski, Carpenter, & Tomasello, 2012;Woodward, 1998)-the ability to cognitively represent others' mental states when incorrect is thought to emerge later in childhood (typically around 4 years old; Wellman, Cross, & Watson, 2001). Similarly, understanding pragmatic information and speaker intention from prosody shows a competence-performance gap, in which vocal cues to emotion are recognized very early in infancy but are only used consistently (in the face of, e.g., conflicting cues) in older children (Esteve-Gibert & Guellaï, 2018). ToM and social cognition more generally are associated with a canonical network of regions in the medial pFC (mPFC), precuneus, and TPJ bilaterally (Molenberghs, Johnson, Henry, & Mattingley, 2016;Schurz, Radua, Aichhorn, Richlan, & Perner, 2014;Saxe & Kanwisher, 2003;Fletcher et al., 1995). Of these, representing the thoughts and intentions of others in particular has been argued to localize to regions of the TPJ and precuneus (Schurz, Tholen, Perner, Mars, & Sallet, 2017;Saxe & Powell, 2006), whereas mPFC has been linked to the processing of more constant traits associated with self and other (van Overwalle, 2009).
If direct speech prompts a detailed simulation of suprasegmental vocal information (such as emotional tone or prosody), then this may also be reflected in socialcognitive regions-specifically for areas associated with interpreting or reasoning about a speaker's communicative intentions. For example, using a nonverbal, cartoonbased story task, Ciaramidaro et al. (2007) observed that bilateral TPJ regions in particular are associated with tracking different kinds of intent associated with sociocommunicative interactions. If direct speech prompted similar activation, this would support an extension of Yao et al.'s (2011) original theory to suggest that direct reference involves constructing a broader, socioperceptual simulation than merely how a voice sounds. Contrastingly, if tracking characters and their intentions is an ultimately separate process from simulating the perceptual features of characters' voices, then no direct effect for speech would necessarily be expected in ToM regions. Instead, it is possible that references to characters' mental states-but not their speech-would be most likely to engage such regions, irrespective of any direct reference effect.
To investigate this, we adapted Yao et al.'s (2011) paradigm to include direct and indirect references to characters' verbal thoughts and speech in a 2 × 2 design. We used eye tracking and an auditory localizer task to study cortical responses specific to each individual's reading times and voice-selective regions. To explore the broader effect of direct speech in regions commonly associated with inferring communicative intentions, we also included a version of Ciaramidaro et al.'s (2007) story task as a second localizer. Many standard ToM tasks use written short stories in which characters' false beliefs must be inferred from textual information, but using such stories could be expected to overlap considerably with other reading tasks (in terms of both stimuli and task demands). Instead, by using a wordless, cartoon-based ToM task, we could avoid this potential confound with the demands of our main direct/indirect story task. On the basis of the original findings of Yao et al. (2011Yao et al. ( , 2012, we hypothesized that (i) direct reference effects would be evident for speech but not thoughts in auditory cortex. In accordance with the claim that this facilitates prosodic and communicative processing of the utterance, we also predicted that (ii) the voice-specific effect of direct reference would extend to ToM-related regions. In contrast, no direct reference effects were expected for thoughts, in either network.

Participants
An initial sample of 30 individuals took part in the full MRI procedure, but nine participants did not produce a full data set because of the following exclusions (one incidental finding, one insufficient accuracy [<60%] on the story task, two no clear voice-selective response on the auditory localizer task, five insufficient eye-tracking data; three male, six female). As such, analysis proceeded with a final sample of 21 (age: M = 23.49, SD = 6.63; three male, 18 female). All participants were right-handed, native English speakers, with normal or corrected-to-normal vision. All procedures were approved by a university ethics subcommittee.

Story Task
Following Yao et al. (2011), participants viewed a series of short stories containing two preparation sentences (Sentences 1 and 2) and a target sentence, containing a character either (i) speaking or thinking with (ii) direct or indirect reference. On each trial, participants viewed a fixation cross for 1-2 sec ( jittered at random), followed by one slide per sentence, presented sequentially (see Figure 1). Viewing times per slide were determined using the following formula: ( Words × 100 msec) + (Syllables × 50 msec) + 2000 msec. Mean presentation times were 5.61 and 5.72 sec for Sentences 1 and 2, respectively, and 5.95 and 6.22 sec for direct and indirect target sentences, respectively, reflecting the slightly longer length of indirect sentences on average (18.6 words per indirect sentence compared to 16.8 for direct sentences). To allow for sufficient trials in each condition, the number of stories was increased from the 90 trials used in Yao et al. (2011) to 120, split across two 20-min runs (additional stories were prepared by a narratologist, M. B., to follow the length, complexity, and style of the original stimuli and ensure balance across the four conditions). Each run also contained three 30-sec break periods, occurring every 20 trials. An attentional check (a simple comprehension question relating to factual content from the preceding story) was included after 25% of trials, with participants having 6 sec to respond. 2 Four random orders of trials were generated, counterbalancing the combination of voice/thought and direct/indirect target sentences across participants. Eyetracking timings were collected as an indicator of participants' reading responses for the two preparation sentences and the target sentence. Specifically, participants' first Figure 1. Adapted story task (A) with direct reference to voices and thoughts applied to auditory and ToM localizer regions (B). Figure 1B depicts left-sided sagittal view (rendered, p < .05, FWE); note that auditory and ToM regions were observed bilaterally (see Table 1). fixation (the beginning of the sentence) and last fixation (the final line of the target sentence) within the text area were used to define reading onsets and offsets of characters' speaking and thinking in the target sentence. These were then directly included in the fMRI model to account for individual differences in the reading response.

Auditory Localizer Task
The auditory localizer task was identical to that used in Yao et al. (2011). Participants listened to 20 blocks of vocal stimuli and 20 blocks of nonvocal stimuli, along with 20 silent blocks that were used as a baseline. The blocks were presented randomly. Each block was 8 sec long, and the task lasted 10 min. The contrasting brain activity in response to the vocal and nonvocal stimuli reliably localizes voice-selective areas of the auditory cortex ( Yao et al., 2011;Belin et al., 2000).

ToM Task
The cartoon-based ToM task was adapted from a task used by Walter et al. (2004) and Ciaramidaro et al. (2007). Participants viewed a sequence of three cartoon story vignettes ("story" phase) and were required to indicate a logical end of each story based on the three presented images ("choice" phase). The story phase included either reasoning about characters' intentions when communicating with others (e.g., a man indicating whether a seat is free on a train) or physical reasoning (e.g., a water pipe bursting). The images were displayed sequentially for 3 sec in the story phase and for 7 sec in the choice phase. The intertrial intervals lasted between 7 and 11 sec. In total, 10 ToM stories and 10 physical reasoning stories were presented in a random order. Participants answered (A, B, or C) by a button press. The task took 9 min to complete. The contrasting brain activity in response to the ToM reasoning stories compared to physical reasoning stories has been observed to prompt activity in brain regions often associated with ToM, including the right TPJ, precuneus, and anterior paracingulate cortex (Alderson-Day et al., 2016;Ciaramidaro et al., 2007;Walter et al., 2004).

Data Acquisition
fMRI data were acquired at Durham University Neuroimaging Centre using a 3-T Magnetom Trio MRI system (Siemens Medical Systems) with standard gradients and a 32-channel head coil. T2*-weighted axial EPI scans were acquired with the following parameters: field of view = 212 mm, flip angle = 90°, repetition time = 2000 msec, echo time = 30 msec, number of slices = 32, slice thickness = 3.0 mm, interslice gap = 0.3 mm, and matrix size = 64 × 64. Story task data were collected across 2 × 20-min runs consisting of 600 volumes each; auditory and ToM tasks took roughly 10 min each and consisted of 300 and 281 volumes, respectively. The first three volumes of each EPI run were discarded to allow for equilibrium of the T2 response. For each participant, an anatomical scan was acquired using a high-resolution T1-weighted 3-D sequence (number of slices = 192, slice thickness = 1 mm, matrix size = 512 × 512, field of view = 256 mm, echo time = 2.52 msec, repetition time = 2250 msec, flip angle = 9°). Eye-tracking data were collected using a LiveTrack system (Cambridge Research Systems) with MATLAB 2016b (The Mathworks, Inc.).

Data Analysis
All MRI analyses were conducted using SPM Version 12 ( Wellcome Department of Cognitive Neurology) implemented in MATLAB. Images were slice-time corrected before being realigned to the first image to correct for head movement. Volumes were then normalized into standard stereotaxic anatomical Montreal Neurological Institute space using the transformation matrix calculated from the first EPI scan of each participant and the EPI template. The default settings for normalization in SPM12 and the standard EPI template supplied with SPM12 were used. The normalized data with a resliced voxel size of 2 × 2 × 2 mm were smoothed with an 8-mm FHWM isotropic Gaussian kernel to accommodate intersubject anatomical variation. The time-series data were high-pass filtered with a high-pass cutoff of 1/128 Hz, and first-order autocorrelations of the data were estimated and corrected for. Movement parameters from the realignment phase were visually inspected for outliers and included as regressors for single-participant (first-level) analyses. ROI analyses were conducted using the Marsbar toolbox (Brett, Anton, Valabregue, & Poline, 2002). Individual ROIs were defined using p < .05 corrected for FWE at cluster level, in temporal cortical regions for the auditory localizer task and clusters in mPFC, precuneus, and TPJ regions for the ToM task. Where significant clusters were not evident for individual participants at this level, a more liberal threshold of p < .001 (uncorrected) was used to maximize sensitivity to individual differences; participants who showed no clusters in these regions even at the more liberal threshold were excluded from analyses (two auditory, six ToM). All whole-brain analyses are presented at p < .05, FWE, cluster-level corrected. All statistical analyses of mean beta values were conducted using R and jamovi; figures were generated using ggplot2 and MicroGL. Effect sizes are reported as Cohen's d for pairwise comparisons and η p 2 values for ANOVA main and interaction effects. η p 2 values can be considered as small, moderate, and large effects with values of .099, .0588, and .1379, respectively (Richardson, 2011;Cohen, 1969).

RESULTS
Accuracy on the task was generally high (M = 82.5%, SD = 7.4%) indicating that participants maintained attention despite the 40-min duration of the task. Repeatedmeasures ANOVA with a 2 × 2 (Form × Reference) design was used to compare behavioral responses for the four conditions (see Table 1). No main effects, interaction effects, or pairwise comparisons were significant for condition accuracy, although we observed a nonsignificant trend for participants to be slightly less accurate on speech trials compared with thought trials, F(1, 20) = 3.34, p = .082, η p 2 = .14 ( p > .14 for all other effects and comparisons). For the duration of reading times, the only effect close to significance was for direct compared with indirect reference, F(1, 20) = 3.57, p = .073,  η p 2 = .15, which likely reflected the slightly longer lengths of indirect sentences. All other effects and comparisons for duration were also nonsignificant (all ps > .15). Reading onsets, in contrast, showed a main effect of Form, F(1, 20) = 7.10, p = .015, η p 2 = .26, such that readers were quicker to start reading speech trials; follow-up pairwise comparisons indicated that this was only significantly quicker for direct speech compared with indirect thought, t = 2.20, df = 36.24, p = .035 (uncorrected), d = 0.4, all other ps > .10.
Whole-brain analyses-included here for descriptive purposes-indicated that the vocal > nonvocal contrast from the auditory localizer task was associated with significantly greater activation in bilateral auditory cortices, across the middle temporal gyrus (MTG) and superior temporal gyrus (see Figure 1B and Table 2). Compared with baseline, each of the four reading task conditions was associated with temporal activation bilaterally, with the largest clusters being observed along the dorsal bank of the left MTG (Table 3).

Responses to Characters' Speech and Thoughts in Voice-Selective Auditory Cortex
A repeated-measures ANOVA was used to compare mean beta values in auditory ROIs for story passages containing characters' speech or thoughts (i.e., Form) in direct or indirect reference, in a 2 × 2 design. No significant main effect of Form was evident, F(1, 20) = 0.31, p = .584, η p 2 = .02, although a trend was observed for reference in favor of direct quotation, F(1, 20) = 4.00, p = .059, η p 2 = .17. The interaction of Form and Reference was significant, F(1, 20) = 7.08, p = .015, η p 2 = .26. As displayed in Figure 2, this was largely driven by a specific direct reference effect for character's speech, but not thoughts. Pairwise comparisons indicated that mean beta values for direct speech were significantly higher than those for indirect speech ( p = .006, d = 0.84, Bonferroni corrected), but no other pairwise contrasts were significant (all ps > .25).

Responses to Speech and Thoughts in a ToM Network
We then applied the same analyses to responses in a ToM network identified via the cartoons task. As shown in   Table 1, a range of typical regions were identified in the contrast between communicative inference reasoning and physical reasoning on the task, including the mPFC, precuneus, and TPJ bilaterally. Sixteen of the 21 individuals produced ToM networks with significant clusters in at least one of these regions, and their beta values were taken forward for ROI analysis (15/16, right TPJ; 12/16, left TPJ; 7/16, precuneus; 6/16, mPFC). When the mean beta values were compared in these areas in a repeated-measures ANOVA, no main effects of Form, F(1, 15) = 0.49, p = .493, η p 2 = .03, or Reference, F(1, 15) = 1.74, p = .207, η p 2 = .10, were observed, but a significant interaction was again evident, F(1, 15) = 9.39, p = .008, η p 2 = .38. 3 As Figure 3 shows, this too was driven by responses for direct speech (compared with indirect speech), and this was the only significant difference between the conditions ( p = .016, d = 0.90, Bonferroni corrected).
We then conducted an exploratory whole-brain analysis to investigate any further potential differences for direct versus indirect speech. Significant increases in signal for direct over indirect speech were evident in three regions: right TPJ (encompassing right angular gyrus [AG] and MTG), left inferior frontal gyrus (IFG), and left superior parietal lobule (see Figure 4). Using the online metaanalytic tool Neurosynth ( Yarkoni, Poldrack, Nichols, Van Essen, & Wager, 2011), the most common functional terms associated with these regions were "network DMN" for the right AG (posterior probability = 0.73), "theory mind" for the right MTG ( p = .88), "semantic" for the left IFG ( p = .88), and "imagery" for the left SPL ( p = .78). Despite the apparent direct speech effect in voice-selective regions of the auditory cortex, no significant increase in signal was seen for this region when correcting across the whole brain for direct versus indirect speech (see Table 3). No regions were more active in the reverse contrast (indirect > direct speech).
Other exploratory whole-brain comparisons indicated few differences between conditions. Two exceptions were direct speech versus direct thought and direct reference versus indirect reference (i.e., with speech and thought sentences combined). Direct speech compared to direct thought was associated with greater activation in the right insula and anterior and middle cingulate, including regions bordering on the pre-SMA (see Table 4B). Direct reference was observed to predominantly activate occipital and parietal regions more than indirect reference (Table 4C). Their reverse contrasts (direct thought > direct speech; indirect > direct) produced no significant clusters, even at an uncorrected   significance level ( p < .001, uncorrected, k > 50). Similarly, no whole-brain differences were observed between voices and thoughts overall or between indirect forms of speech and thought, either at corrected or uncorrected levels.

DISCUSSION
The aim of this study was to explore further the effect of direct speech in the brains of readers. The main finding of our results was to replicate the original effect reported by Yao et al. (2011), namely, that direct speech in short stories is accompanied by elevated responses in voiceselective auditory regions of the brain, when compared with indirect speech. Our findings go further than those of Yao et al. in two key ways. First, by comparing direct and indirect references for speech and thoughts, our ROI results demonstrate a specific effect of reference for characters who are represented as speaking, but not when they are represented as thinking. Second, this direct speech effect appears to extend beyond voice-selective auditory cortex to also include regions that are used when making inferences about communicative intentions, based on a ToM localizer task (Ciaramidaro et al., 2007). This pattern of results, therefore, supports the earlier observation that readers spontaneously engage sensory cortices when faced with direct speech, but it also implicates higher-order processes associated with gauging character intention and meaning. Evidence of a direct speech effect in auditory cortex is consistent with previous findings that such regions are recruited during silent reading of characters' speech (Yao et al., 2011(Yao et al., , 2012, which is in turn suggestive of auditory verbal imagery being used during this process. This aligns with behavioral evidence of phonologically detailed imagery being involved in silent reading of various kinds (Kurby & Zacks, 2013;Filik & Barber, 2011). There is debate around how specific any such voice representation would be: Kurby, Magliano, and Rapp (2009) have argued that such effects are specific to familiar voices only, whereas Petkov and Belin (2013) propose that any kind of voice simulation is likely to reflect a generic speaking voice. Their argument for this is based on phonological information specific to voice identity usually being associated with anterior temporal cortex, whereas those associated with direct speech in Yao et al. (2011), for example, are more focused on posterior temporal regions (Petkov & Belin, 2013). Our findings cannot easily arbitrate between these two possibilities (general vs. specific voices), as voice-selective auditory regions were identified along the length of the superior temporal gyri bilaterally. However, we would speculate that any simulation of a generic or specific voice is likely to vary considerably across individuals. When asked, readers describe drawing upon a wide range of active and creative strategies to imagine the voices of characters, including other familiar voices and their own voice (Alderson-Day et al., 2017).
Perhaps more notable is the suggestion of direct speech effects also being present in cortical regions often associated with ToM in general and understanding others' intentions in particular. 4 We chose a localizer task that aimed to minimize superficial overlaps with the primary task-using cartoons instead of a written story format-and focused specifically on assessing understanding of communicative intentions over other types of ToM reasoning, such as inferring false beliefs (Ciaramidaro et al., 2007;Walter et al., 2004). This produced a network that, in our sample, primarily centered around bilateral TPJ regions but also included precuneus and mPFC in subsets of participants. Evidence of a direct speech effect in these regions provides at least prima facie support for the idea that text presented in this way prompts engagement with what a character intends to say (Yao et al., 2011(Yao et al., , 2012, despite the mixed behavioral evidence that direct reference primes any further communicative information about characters (Eerland et al., 2013). Moreover, our analysis suggests involvement of these regions at a comparable level to responses in auditory networks, as indicated by the lack of any interaction effect across the two ROIs.
Drawing strong conclusions about the role of these regions in processing direct speech is fraught with difficulty. The areas highlighted by our ToM task are often implicated in a range of attentional and cognitive processes (Spreng et al., 2009;Mitchell, 2008), and making broader claims based on the prior literature raises the risk of reverse inference (Poldrack, 2006). Using Neurosynth ( Yarkoni et al., 2011), which provides at least a systematic approach to informal reverse inference (Poldrack, 2011), the strongest responses in the localizer task were in two regions where the most common associations in the literature are with "mind tom" and "theory mind" (with posterior probabilities of .87-.90). Similarly, in the exploratory whole-brain analysis, the right MTG peak in particular showed high z scores for tests of association (z = 12.00) and uniformity (z = 14.39) in a ToM meta-analytic map of 181 studies ( Yarkoni et al., 2011).
These regions have also been observed in similar work examining sociocognitive responses to fiction reading by Tamir, Bricker, Dodell-Feder, and Mitchell (2016), although in their study, they observed preferential engagement of the mPFC for social content in stories (describing a person's mental content), with medial temporal cortex more closely indexing story vividness. In contrast, most of our participants (15/16) activated the right TPJ on our localizer task (compared with only six for mPFC), and this was the only ToM region to be identified in our whole-brain analysis comparing direct and indirect speech. The right TPJ cluster that we observed in this analysis included peaks in the right AG, extending dorsally and caudally from areas that are often linked to representing others' mental states (Bzdok et al., 2013). Both left and right AG have been associated with support for the default mode network, via the generation and processing of transmodal information in the absence of stimulus input (Murphy et al., 2018) and modality-independent contributions to imagery (Daselaar, Porat, Huijbers, & Pennartz, 2010). The right AG has also recently been implicated in making valence judgments from nonverbal cues: In a paradigm where participants were asked to judge the intentions of musical alien "signals," variations in the consonance and dissonance of the stimuli (roughly corresponding to positive and negative emotions) modulated this region specifically (Bravo et al., 2017). The broader extension of this cluster, therefore, may reflect the generation and maintenance of intention-related imagery, rather than representing characters' mental states, or social content more generally. This being associated with posterior ToM regions over mPFC would also be consistent with van Overwalle's (2009) distinction between a posterior ToM subsystem supporting representation of temporary and perceptually based intentions and goals, versus an anterior pFC system that tracks and integrates enduring social information over time.
When taken together, these findings broadly support the interpretation of direct speech made by Yao et al. (2011). Recall that, for Yao and colleagues, direct speech prompts auditory imagery as a means of modeling speaker prosody (and, ultimately, communicative intent). A counterhypothesis, provided by Eerland, Zwaan, and colleagues, is that direct reference acts primarily as a cue to simulate verbatim linguistic content-in other words, emphasizing the words but, arguably, not the speaker (Eerland & Zwaan, 2018;Eerland et al., 2013). Our data suggest that direct reference has a specific effect for speech, and this extends to regions that would be consistent with inferring communicative intentions. Moreover, this can be distinguished from the overall effect of direct reference, which primarily shows greater engagement in visual areas of occipital and parietal cortex (see Table 4C).
A curious characteristic of our data is the apparently contradictory results for a direct speech effect in auditory regions, which was evident in the ROI analysis, but not for the whole-brain contrast. This likely reflects (i) individual variability in the temporal voice area (Belin et al., 2000), (ii) the effect of the more conservative statistical correction required across the whole brain, and (iii) the fact that both direct and indirect speech activate a range of overlapping temporal regions, with any subsequent difference in beta values being likely to be subtle. Nevertheless, it should be noted that prominent differences across the cortex were observed in the right TPJ (as discussed), left SPL, and left IFG, much more obviously than for regions of the auditory cortex. The involvement of the latter in particular is consistent with greater demand being placed on inner speech production to support the representation of direct speech, given the common association of Broca's area with silent articulation (Alderson-Day & Fernyhough, 2015;Kühn, Fernyhough, Alderson-Day, & Hurlburt, 2014;Simons et al., 2010;Shergill et al., 2001). Evidence from psycholinguistics research suggests that greater involvement of articulatory processes in silent speech results in more detailed acoustic properties being represented in auditory imagery (Oppenheim & Dell, 2010), and both external and internal speech have been shown to consistently modulate auditory cortical responses (Okada, Matchin, & Hickok, 2018;Ylinen et al., 2014;Shergill et al., 2002). In addition, two recent studies of inner speech have highlighted how right-hemisphere homologs of left-hemisphere language regions are recruited when speech of others must be imagined (Grandchamp et al., 2019;Alderson-Day et al., 2016). A potential model, then, would be that a reader coming across direct speech in a text is prompted to generate a communicatively plausible perceptual simulation, via inner speech, which involves the left IFG and right TPJ working in concert to modulate voice-selective regions of the auditory cortex. This is not to suggest that inner speech (and other auditory imagery processes) would not be evidenced in each of the task conditions (given the widespread activation vs. baseline seen for all conditions; see Table 2 and Figure 4A) but rather that direct speech could place a specific demand on internal articulatory processes. In this scenario, direct reference effects in auditory cortex would plausibly not be the primary component of the reader's response but a secondary consequence of inner speech (and ToM) processes, which may explain their relative prominence in our whole-brain results.
Although the present results appear to have much to say about how speech is treated by readers, they perhaps say less about what is happening for characters' thoughts. Despite having received early theoretical attention in stylistics (Sharvit, 2008;Sotirova, 2004), until now, qualitative differences between direct and indirect modes of speech and thought representation have scarcely been empirically investigated (for some exceptions using free indirect discourse, see Fletcher & Monterosso, 2015;Bray, 2007). A plausible assumption would be that thought presentation (in direct or indirect reference) would be more likely to engage ToM resources-that is, a main effect of thoughts-compared with speech. Why, then, was this not seen? Insights from contemporary cognitive narratology may be useful here, particularly in relation to the problem of "accessibility" of others' thoughts. Ordinarily, stories that are used to assess ToM require the reader to make inferences about the mental states of others; their actual beliefs are not made explicit and may even conflict with the literal and immediate content of what they say and how they act (e.g., Saxe & Kanwisher, 2003). Fictional narratives may sometimes exploit this "accessibility gap" (e.g., a suspect in a mystery could have hidden motives), but they are also notable because they can give us apparent access to other minds via direct and indirect references (Bernini, 2016;Cohn, 1978). Although the stimuli used in our experiment contained mental content, they did not necessarily make demands in terms of mental state inference-in the thought trials used in our experiment, the inner life of the character is laid bare (e.g., "He thought that he should go to the shop"). Direct speech, in contrast, does not signal the intonation, emotion, or intention of a character-they must be simulated or otherwise inferred by the reader, in a way that the ToM system is often considered to do (Saxe & Kanwisher, 2003). As such, although counterintuitive, our findings are in line with common views about mental state inference (Spreng et al., 2009). It may also be the case that direct speech in general is more vivid and salient than direct reference to thinking, given the whole-brain differences between speaking and thinking seen in anterior insula and dorsal ACC (Uddin, 2016), and the quicker orienting times we observed for speech trials. Engagement with fictional storyworlds and characters is often argued to depend on the "experiential traces" the reader brings from his or her own life (Zwaan, 2008): The more we have access to an experience in the real world, the more it will be used to generate vivid and imaginative responses during reading. When one considers the diminished, quasi-perceptual phenomenology that verbal thoughts are often claimed to possess (Prinz, 2011;Jones & Fernyhough, 2007), it is perhaps no surprise that characters' thinking in a text did not provide distinct patterns of activation that were as distinct as for direct speech.
Another perspective-also provided by cognitive literary studies-is to consider how fictional minds may be differently represented from the outside and the inside. Kuzmičová (2013), for example, has suggested that we experience characters' speech in literary texts as either "outer reverberations" (when we read, as vicarious listeners, about a character overtly speaking) and "inner reverberations" (when we voice a character's words within his or her perspective). In parallel, Caracciolo (2014) has highlighted the contrast between attributing intentions to characters and the direct, inner enactment of a character's thoughts and fictional consciousnesses more broadly. These distinctions parallel the extensive literature on perspective-taking and how this is instantiated in the brain (e.g., Ruby & Decety, 2001). It could be the case that our different conditions prompted readers to adopt first-or third-person perspectives in response to speech compared with thoughts or direct compared with indirect reference. However, the direction of these shifts is not straightforward: Although it is sometimes assumed that direct speech necessarily prompts adopting a first-person perspective (speaking as the character), it is also understood as focusing the reader on what it would be like to hear the character speak to them (Clark & Gerrig, 1990). Similarly, thoughts could be seen to prime a first-person perspective (thinking "from the inside"), but this will likely depend on the position of the narrator, the reader's identification with the character, and the wider context of the narrative (Kuiken, Miall, & Sikora, 2004). As such, a key area for further exploration is to systematically examine how perspective shifts potentially interact with direct reference effects and speech/thought distinctions.
This study has a number of limitations. First, it was necessary to exclude some participants because of partial data from eye-tracking or either of the independent localizer tasks, limiting the overall sample size. This also further skewed our sex ratio, such that male participants are underrepresented in our eventual sample (as can often be the case for psychology studies recruited from university populations, e.g., Dickinson, Adelson, & Owen, 2012). Given the wide variability in individual differences for reading responses, we chose to deploy these measures to be as specific as possible about both participants' onset and offset times of reading target sentences and to allow for the use of individually specific cortical networks. This did not prohibit the recruitment of a larger sample than the original study we sought to replicate ( Yao et al., 2011), but for a topic (imagery) with typically small effects and potentially large variation, replication in larger samples will be required for the exploration of individual differences in imagery production across different kinds of readers. Inner speech and imagery are highly susceptible to individual differences in day-to-day use (Alderson-Day et al., 2016), and effects of expertise (Borst, Niven, & Logie, 2011) and variation across readers seem highly likely.
Second, our use of direct reference for thoughts (such as he thought "I should have finished this paper by now") could be questioned in terms of its relative familiarity for readers. One of our aims for the study was to use a stimulus that could act as a typographical and grammatical control comparison for direct and indirect speech. Although use of quotation marks for thoughts does feature in narratives, indirect references might be thought of as many authors' default option when referring to characters' mental states (Leech & Short, 2007). An alternative form of reference-such as using italics to mark characters' thoughts-may have been more familiar to readers but would also have added further typographical differences to the original contrast of interest: direct versus indirect speech. The lack of any behavioral differences (in terms of accuracy or reading time) between the thought conditions, and the lack of any pairwise or whole-brain differences, would suggest that this had little effect on our participants. However, further careful behavioral (and, arguably, interdisciplinary) work-incorporating the valuable insights of cognitive literary studies-is clearly required to elucidate how readers interpret these kinds of text constructions when depicting characters' mental states.
Finally, a related point about generalizability concerns the fictional stories used in the experiment. For experimental use, we used very minimal stories that were unlikely to prompt extensive use of many of the processes thought to be relevant to a reader's experience of a text, whether that involves identification with characters, use of prior knowledge, management of expectations, or feelings of transportation (Miall, 2011;Green, 2004;Kuiken et al., 2004). As such, this is still a very artificial reading scenario for many participants. We cannot rule out the possibility that there was something about this situation in particular that may have posed unusual demands or biased readers' responses, such as encouraging them to pay attention to or engage more with specific aspects of the text (such as voices in particular). Our attentional checks would adjudicate against this interpretation-no significant differences in accuracy were observed across the various task conditions-but, in considering ecological validity, the experiential gap between full stories and these experimental sketches must be borne in mind.
Notwithstanding these limitations, our findings have important implications for future research on fiction, reading, and imagination more generally. Our data broadly support social cognitive approaches to fiction (Oatley, 2016;Tamir et al., 2016), but in a complex and unexpected way. On the one hand, the potential involvement of ToM in simulating episodes of characters' speech opens a new avenue for research on fiction and mentalizing; on the other hand, our findings for representing characters' thoughts challenge the idea that engaging with the mental states of others via fiction necessarily involves (or could even enhance) ToM processes. Our findings also highlight how readers likely draw on multiple perceptuomotor resources to support a socially informed simulation of speech, where prompted by the text. This is, arguably, a creative and constructive process on the part of the reader, which will be contingent on their own imaginative skills and experience. Along with comparing individual differences in this process, contrasting forms of reference for speech and verbal thoughts offer a comparative methodology for exploring how readers track speaking and thinking through more complicated narratives. Free indirect discourse, as seen in many modernist texts, demands that the reader follow closely, or even make their own inference, about exactly who is speaking or thinking in a story (see Waugh, 2011, for a discussion of this topic). Here, reference or its absence could be considered as an experimental tool to challenge the readers and place them in situations of uncertainty about the speech and thoughts in a narrative (as in Fletcher & Monterosso, 2015). In this respect, more challenging texts offer an opportunity to push at the limits of readers' creative and imaginative capacities.

Conclusions
In conclusion, references to direct speech in fictional stories are associated with the recruitment of not only voice-selective auditory cortex but also regions that may implicate gauging of characters' communicative intentions. Moreover, this is a process that is apparently specific to speech. We cannot conclude on the basis of these findings that the function of this process is communicative inference per se, but we speculate that it goes beyond a purely perceptual simulation of voice and requires coordination between inner speech and ToM resources. To experience a character's voice in a story, in this sense, may be about not only what they say but also how they say it and what they intend.