Information-Restricted Neural Language Models Reveal Different Brain Regions’ Sensitivity to Semantics, Syntax, and Context

Abstract A fundamental question in neurolinguistics concerns the brain regions involved in syntactic and semantic processing during speech comprehension, both at the lexical (word processing) and supra-lexical levels (sentence and discourse processing). To what extent are these regions separated or intertwined? To address this question, we introduce a novel approach exploiting neural language models to generate high-dimensional feature sets that separately encode semantic and syntactic information. More precisely, we train a lexical language model, GloVe, and a supra-lexical language model, GPT-2, on a text corpus from which we selectively removed either syntactic or semantic information. We then assess to what extent the features derived from these information-restricted models are still able to predict the fMRI time courses of humans listening to naturalistic text. Furthermore, to determine the windows of integration of brain regions involved in supra-lexical processing, we manipulate the size of contextual information provided to GPT-2. The analyses show that, while most brain regions involved in language comprehension are sensitive to both syntactic and semantic features, the relative magnitudes of these effects vary across these regions. Moreover, regions that are best fitted by semantic or syntactic features are more spatially dissociated in the left hemisphere than in the right one, and the right hemisphere shows sensitivity to longer contexts than the left. The novelty of our approach lies in the ability to control for the information encoded in the models’ embeddings by manipulating the training set. These “information-restricted” models complement previous studies that used language models to probe the neural bases of language, and shed new light on its spatial organization.


Introduction
Understanding the neural bases of language processing has been one of the main research efforts in the neuroimaging community for the past decades (see, e.g., Friederici, 2011; Binder et al., 2009, for reviews).However, the complex nature of language makes it difficult to discern how the various processes underlying language processing are topographically and dynamically organized in the human brain, and therefore many questions remain open to this date.
One central open question is whether semantic and syntactic information are encoded and processed jointly or separately in the human brain.Language comprehension requires to access word meanings (lexical semantics), but also to compose these meanings to construct the meaning of entire sentences.In languages such a English, semantic composition strongly depends on word order in the sentence -for example, 'The boy kissed the girl' has a different meaning compared to 'The girl kissed the boy' although both sentences contain the exact same words.The brain constructs these different meanings conditionally on words order, which is the backbone of sentence processing, indicating how to combine the lexical meanings of its sub-parts.Importantly, meaning construction of new sentences would be roughly done in the same way if only the structure of the sentences remains the same ('The X kissed the Y'), independently of the lexical meanings of the single nouns in the sentences ('boy' and 'girl').This combinatorial property of language allows to construct meanings of sentences that we have never heard before and suggests that it might be computationally advantageous for the brain to have developed neural mechanisms for composition that are separate from those dedicated to the processing of lexico-semantic content.Such neural mechanisms for composition would be sensitive to only the abstract structure of sentences and would implement the syntactic rules according to which sentence parts should be composed.
However, in parallel, an opposing view has argued that semantics and syntax are processed in a common distributed language processing system (Bates and MacWhinney, 1989;Dick et al., 2001;Bates and Dick, 2002).Recent work in support of this view has raised concerns regarding the replicability of some of the early results from the modular view (Siegelman et al., 2019) and provided evidence that semantic and syntactic processing in the language network might not be so easily dissociated from one another (Mollica et al., 2018;Fedorenko et al., 2020).
Neuroimaging studies, cited to defend one or the other view, have mainly relied on one of two methodological approaches: on the one hand, controlled experimental paradigms, which manipulate the words or sentences (Mazoyer et al., 1993;Bottini et al., 1995;Stromswold et al., 1996;Caplan et al., 1998;Pallier et al., 2011) and, on the other hand, naturalistic paradigms that make use of stimuli closer to what one could find in a daily environment.The former approach probes linguistic dimensions in one of the following ways: varying the presence or absence of syntactic or semantic information (Friederici et al., 2009a(Friederici et al., , 2003) ) or varying the syntactic structure difficulty or the semantic interpretation difficulty (e.g.Cooke et al., 2001;Friederici et al., 2009b;Kinno et al., 2007;Newman et al., 2010;Santi and Grodzinsky, 2010).However, the conclusions from such studies may be bounded to the peculiarity of the task and setup used in the experiment (Nastase et al., 2020).To overcome these shortcomings, over the last years, researchers have become increasingly interested in data using "Ecological Paradigms", in which participants are engaged in more natural tasks, such as conversation or story listening (Regev et al., 2013;Lerner et al., 2011;Wehbe et al., 2014;Nastase et al., 2021;Pasquiou et al., 2022;LeBel et al., 2022).This avoids any task-induced bias and takes into consideration both lexical and supra-lexical levels of syntax and semantic processing.Integrating supra-lexical level information is essential for understanding language processing in the brain, because the lexical-semantic information of a word and the resulting semantic compositions depend on its context.
More recently, following advances in natural language processing, neural language models have been increasingly employed in the analysis of data collected from ecological paradigms.Neural language models are models based on neural networks, which are trained to capture joint probability distributions of words in sentences using next-word, or masked-word prediction tasks (e.g.Elman, 1991;Pennington et al., 2014;Devlin et al., 2019;Radford et al., 2019).By doing so, the models have to learn semantic and syntactic relations among word tokens in the language.To study brain data collected from ecological paradigms, neural language models are presented with the same sentence stimuli, then, their activations (aka, embeddings) are extracted and used to fit and predict the brain data (Wehbe et al., 2014;Huth et al., 2016;Pasquiou et al., 2022;Caucheteux and King, 2022).This approach has led to several discoveries, such as wide networks associated with semantic processing uncovered by Huth et al. (2016) using word embeddings (see also Pereira et al. 2018a), or context-sensitivity maps discovered by Jain and Huth (2018) and Toneva and Wehbe (2019).
Despite these advances and extensive neuroscientific and cognitive explorations, the neural bases of semantics, syntax and the integration of contextual information still remain debated.In particular, a central puzzle remains in the field: on the one hand, studies investigating syntax and semantics found vastly distributed networks when using naturalistic stimuli (Fedorenko et al., 2020;Caucheteux et al., 2021) and others found more localized activations for syntax, typically in inferior frontal and posterior temporal regions, when using constraint experimental paradigms e.g., (Pallier et al., 2011;Matchin et al., 2017).Thus, whether there is a hierarchy of brain regions integrating contextual information or the extent to which syntactic information is independently processed from semantic information, in at least some brain regions, remains largely debated to date.
So far, insights from neural language models about this central puzzle were also rather limited.This is mostly due to the complexity of the models in terms of size, training and architecture.This complexity makes it difficult to identify how and what information is encoded in their latent representations, and how to use their embeddings to study brain function.
Caucheteux et al. (2021) used a neural language model, GPT-2, in an novel way to separate semantic and syntactic processing in the brain.Specifically, using a pre-trained GPT-2 model, they built syntactic predictors by averaging the embeddings of words from sentences that shared syntactic but no semantic properties, and used them to identify syntactic-sensitive brain regions.They defined as semantic-sensitive brain regions, the regions that were better predicted by the GPT-2's embeddings computed on the original text, compared to the syntactic predictors.They observed that syntax and semantics, defined in this way, rely on a common set of distributed brain areas.
Jain and Huth (2018) used pre-trained LSTM models to study context integration.They varied the amount of context used to generate word embeddings, and obtained a map indicating brain regions' sensitivity to different sizes of context.
Here, we propose a new approach to tackle the questions of syntactic vs. semantic processing and contextual integration, by fitting brain activity with word embeddings derived from informationrestricted models.By this, we mean that the models are trained on text corpora from which specific types of information (syntactic, semantic, or contextual) were removed.We then assess the ability of these information-restricted models to fit brain activations, and compared it to the predictive performance of a neural model trained on the original dataset.
More precisely, we created a text corpus of novels from the Gutenberg Project (http://www. gutenberg.org)and used it to define three different sets of features: (i) Integral features, the full text from the corpus (ii) Semantic features, the content words from the corpus; (iii) Syntactic features, where each word and punctuation sign from the corpus is replaced by syntactic characteristics.
We then trained two types of models on each feature space: a non-contextual model, Glove (Pennington et al., 2014), and a contextual model, GPT-2 (Radford et al., 2019) (See Fig. 1A).The text transcription of the audio-book, to which participants listened in the scanner, was then presented to the neural language models from which we derived embedding vectors.After fitting these embedded representations to fMRI brain data with linear encoding models, we computed the crossvalidated correlations between the encoding models' predicted time courses and the observed time-series.In a first set of analyses, this allowed us to quantify the sensitivity to syntactic and semantic information in each voxel (Fig. 1B).In a second set of analyses, we identified brain regions integrating information beyond the lexical level.We first compared the contextual model (GPT-2) and the non contextual model (Glove), before investigating the brain regions processing short (5 words), medium (15 words) and long (45 words) contexts, using a non-contextualized GloVe model as a 0-context baseline (See Fig. 1C.).

Dissociation of syntactic and semantic information in embeddings
We first assessed the amount of syntactic and semantic information contained in the embedding vectors derived from GloVe and GPT-2 trained on the different sets of features.In order to do so, we trained logistic classifiers to decode either the semantic category or the syntactic category from the embeddings generated from the text of The Little Prince.
The decoding performances of the logistic classifiers are displayed in Fig. 2. The models trained directly on the integral features, that is, the intact texts, have relatively high performance on the two tasks (75% in average for both GloVe and GPT-2).The models trained on the syntactic features performed well on the syntax decoding task (decoding accuracy >95%), but are near chance-level on the semantic decoding task (decoding accuracy around 25% with a chance-level at 16%).Similarly, the models trained on the semantic features display good performance on the semantic decoding task (decoding accuracy greater than 80%), but a relatively poorer decoding accuracy on the syntax decoding task (45%, chance level: 16%).These results validate the experimental manipulation by showing that syntactic embeddings essentially encode syntactic information and semantic embeddings essentially encode semantic information.

Correlations of fMRI data with syntactic and semantic embeddings
Our objective was to evaluate how well the embeddings computed from GloVe and GPT-2 on the syntactic and semantic features fit the fMRI signal in various parts of the brain.For each model/features combination, we computed the increase in R score when the resulting embeddings were appended to a baseline model that comprised low-level variables (acoustic energy, word onsets and lexical frequency).This was done separately for each voxel.The resulting maps are displayed in Fig. 3A.
The maps reveal that semantic and syntactic feature-derived embeddings from GloVe or GPT-2 significantly explain the signal in a set of bilateral brain regions including frontal and temporal regions, as well as the Temporo-parietal junction, the Precuneus and Dorso-Medial Prefrontal Cortex (dMPC).The classical left-lateralized language network, which includes the Inferior Frontal Gyrus (IFG) and the Superior Temporal Sulcus (STS), is entirely covered.Overall, a vast network of regions is modulated by both semantic and syntactic information.
Nevertheless, detailed inspection of the maps shows different R score distribution profiles (see Appendix 1-R Scores Distribution for GloVe and GPT-2 Trained on Semantic or Syntactic Features Appendix 1-Fig.4).For example, syntactic embeddings yield the highest fits in the Superior Temporal Lobe, extending from the Temporal Pole (TP) to the Temporo-Parietal Junction (TPJ), as well as the Inferior Frontal Gyrus (IFG, BA-44 and 47), the Superior Frontal Gyrus (SFG), the Dorso-Medial Prefrontal Cortex (dMPC) and the posterior Cingulate cortext (pCC).Semantic embeddings, on the other hand, show peaks in the posterior Middle Temporal Gyrus (pMTG), the Angular Gyrus (AG), the Inferior Frontal Sulcus (IFS), the dMPC and the Precuneus/pCC.

Regions best fitted by semantic or syntactic embeddings
As noticed above, despite the fact that the regions fitted by semantic and syntactic embeddings essentially overlap (Fig. 3A), the areas where each model has the highest R scores differ.To better visualize the maxima from these maps, we selected, for each of them, the 10% of voxels having the highest R scores.Thresholding at the 90-th percentile of the distributions (threshold values displayed in Appendix 1-Fig.4) produces the maps presented in Fig. 3B.
A first observation is that the number of supra-threshold voxels is quite similar in the left (19%) and right (21%) hemispheres, whether GPT-2 or Glove is considered, showing that during the pro- GloVe and GPT-2 models were trained on each feature space.B) fMRI scans of human participants listening to an audio-book were obtained.The associated text transcription was input to Neural models, yielding embeddings that were convolved with an haemodyamic kernel and fitted to brain activity using a Ridge-regression.Brain maps of cross-validated correlation between encoding models' predictions and fMRI time-series were computed.C) To study sensitivity to context, a GPT-2 model was trained and tested on input sequences of bounded context length (5, 15 and 45).The resulting representations were then used to predict fMRI activity.indicates the significance threshold on the Z-scores).B) Bilateral spatial organisation of syntax and semantics highest R scores.Voxels whose R score belong in the 10% highest R scores (in green for models trained on the semantic features, and in red for models trained on the syntactic features) are projected onto brain surface maps for GloVe and GPT-2 (overlap in yellow and other voxels in grey).Jaccard score for each hemisphere are computed, i.e. the ratio between the size of the intersection and the size of the union of semantics and syntax peak regions; the proportion of voxels of each category are displayed for each hemisphere and model.cessing of natural speech, both syntactic and semantic features modulate activations in both hemispheres to a similar extent.The regions involved include, bilaterally, the TP, the STS, the IFG and IFS, the DMPC, the pMTG, the TPJ, the Precuneus and pCC.One noticeable difference between the two hemispheres, apparent in Fig. 3B, concerns the overlap between the semantic and syntactic peak regions: it is stronger in the right than in the left hemisphere.To assess this overlap, we computed the Jaccard indices (see Jaccard index) between voxels modulated by syntax and voxels modulated by semantics.The Jaccard indices were much larger in the right hemisphere ( ℎ = 0.52 and ℎ −2 = 0.60) than in the left ( = 0.14 and The left hemisphere displayed distinct peak regions for semantics and syntax; syntax involving the STS, the pSTG, the anterior TP, the IFG (BA-44/45/47) and the MFG, while semantics involves the pMTG, AG, the TPJ and the IFS.We only observe overlap in the upper IFG (BA-44), AG and posterior STS.On medial faces, semantics and syntax share peak regions in the Precuneus, the pCC and the DMPC.In the right hemisphere, syntax and semantics share the STS, pMTG and most frontal regions, with only syntax-specific peak regions in the TP and SFG and semantics-specific peak regions in the TPJ.
Overall, this shows that the neural correlates of syntactic and semantic features appear more separable in the left than in the right hemisphere .

Gradient of sensitivity to syntax or semantics
The analyses presented above revealed a large distributed network of brain regions sensitive to both syntax and semantics but with varying local sensitivity to both conditions.
We further investigated these differences by defining a specificity index that reflects, for each voxel, the logarithm of the ratio between the R scores derived from the semantic and the syntactic embeddings (see Specificity index).A score of indicates that the voxel is 10 -times more sensitive to semantics compare to syntax if > 0 (green), and conversely, the voxel is 10 − -times more sensitive to syntax compare to semantics if < 0 (red).Voxels with specificity indexes close to 0, are colored in yellow and show equal sensitivity to both conditions.Specificity indexes are plotted on surface maps in Fig. 4. The top row shows the specificity index of voxels where there was a significant effect for syntactic or for semantic embeddings in Fig. 3A, while the bottom row shows group specificity indexes corrected for multiple comparison using an FDR-correction of 0.005 (N=51).
The top row of Fig. 4 shows that voxels that are more sensitive to Syntax include, bilaterally, the anterior Temporal Lobes (aTL), the STG, the Supplementary Motor Area (SMA), the MFG and subparts of the IFG.Voxels more sensitive to Semantics are located in the pMTG, the TPJ/AG, the IFS, SFS and the Precuneus.Voxels sensitive to both types of features are located in the posterior STG, the STS, the dMPC, the CC, the MFG and in the IFG.
More specifically, in Fig. 4 bottom, one can observe significantly low ratios (in favor of the syntactic embeddings) in the STG, aTL and pre-SMA, and significantly large ratios (in favor of the semantic embeddings) in the pMTG, the AG and the IFS.Specificity index maps are consistent with the maps of R score differences between semantic and syntactic embeddings for Glove and GPT-2 (see Appendix 1-Fig.5), but provide more insights into the relative sensitivity to syntax and semantics.These maps highlight that some brain regions show stronger responses to the semantic or to the syntactic condition even when they show sensitivity to both.

Unique contributions of syntax and semantics
The previous analyses allowed us to quantify the amounts of brain signal explained by the information encoded in various embeddings.Yet, when two embeddings explain the same amount of signal, that is, have similar R score, it remains to be clarified whether they hinge on information represented redundantly in the embeddings or information specific to each embedding.To address this issue, we analyzed the additional information brought by each embedding on top of the other one.To this end, we evaluated correlations that are uniquely explained by the semantic embeddings compared to the syntactic embeddings, and conversely.
To quantify the unique contribution of each feature space to the prediction of the fMRI signal, we first estimated the Pearson correlation explained by the embeddings learned from the individual feature space -e.g., using only syntactic embeddings or semantic embeddings.We then assessed the correlation explained by the concatenation of embeddings derived from different feature spaces -e.g., concatenating syntactic and semantic embedding vectors (de Heer et al., 2017).
Because it can identify single voxels whose responses can be partly explained by different feature spaces, this approach provides more information than simple subtractive analyses that estimate the R score difference per voxel (see Appendix 1-Fig.5).
Syntactic embeddings (Fig. 5A) uniquely explained brain data in localized brain regions: the STG, the TP, the pre-SMA and in the IFG, with R scores increases of about 5%.
Semantic embeddings (Fig. 5B) uniquely explained signal bilaterally in the same wide network of brain regions as the one highlighted in Fig. 3A, including frontal and temporo-parietal regions bilaterally as well as the Precuneus and pCC medially, with similar R scores increases around 5%.
This suggests that even if most of the brain is sensitive to both syntactic and semantic conditions, syntax is preferentially processed in more localized regions than semantics which is widely distributed.

Synergy between syntax and semantics
To probe regions where the joint effect of syntax and semantics is greater than the sum of the contributions of these features, we compared the R scores of the embeddings derived from the integral features with the R scores of the encoding models concatenating the semantic and syntactic embeddings (see Fig. 5C).
For the embeddings obtained with GloVe, this analysis did not reveal any significant effect.For the embeddings obtained with GPT-2, significant effects were observed in most of the brain, but with higher effects in the semantic peak regions: pMTG, TPJ, AG and in frontal regions.

Integration of contextual information
To further examine the effect of context, we compared GPT-2, the supra-lexical model which takes context into account, to GloVe, a purely lexical model.The differences in R scores between the two models, trained on each of the three datasets are presented in Fig. 6.
GPT-2 embeddings elicit stronger R scores than GloVe.The difference spreads over wider regions when the models were trained on syntax compared to semantics (see Fig. 6 top left and right).The comparison for syntax led to significant differences bilaterally in the STS/STG, from the Temporal Pole to the TPJ, in superior, middle and inferior frontal regions, and medially in the pCC and dMPC.For semantics, the comparison only led to significant differences in the Precuneus, the right STS and posterior STG.Fig. 6 (bottom left) shows the comparison between GPT-2 and GloVe when trained on the Integral features.Given that both semantic and syntactic contextual information were available to GPT-2, these maps reflect the regions that benefit from context during story listening.To show that context has an effect is one thing, but different brain regions are likely to have different integration window's size.To address this question, we developed a fixed-context window training protocol to control for the amount of contextual information used by GPT-2 (Fig. 1C).We trained models with short (5 tokens), medium (15 tokens) and long (45 tokens) range windows sizes.This ensures that GPT-2 was not sampling out of the learnt distribution at inference, and not using more context than what was available in the context window.
Comparing GPT-2 with 5 tokens to GloVe (0-size context) highlighted a large network of frontal and temporo-parietal regions.Medially, it included the Precuneus, the pCC and the DMPC (Fig. 6, short).Short context-sensitivity showed peak effects in the Supramarginal gyri, the pMTG and medially in the Precuneus and pCC.Counting the number of voxels showing significant short-context effects highlighted an asymmetry between the left and right hemisphere with 1.6 times more significant voxels in the left hemisphere compared to the right.Contrasting a GPT-2 model using 15 tokens of context (the average size of a sentence in The Little Prince) versus a GPT-2 model using only 5 tokens, yielded localized significant differences in the SFG/SFS, the TP, MFG and STG near Heschl's gyri and medially in the Precuneus and pCC (Fig. 6, Medium).The biggest medium context effects included the left MFG, the right SFG and DMPC and bilaterally the Precuneus and pCC.Finally, contrasting models using respectively 45 and 15 tokens of context revealed 2.8 times as many significant differences in the right hemisphere as in the left.Significant effects were the highest bilaterally and medially in the pCC, followed, in the right hemisphere, by the Precuneus, the DMPC, MFG, SFG, STS and TP (see Fig. 6, bottom).Taken together, our results show 1) that syntax dominantly determines the integration of contextual information, 2) that a bilateral network of frontal and temporo-parietal regions is modulated by short context, 3) that short-range context integration is preferentially located in the left hemisphere, 4) that the right hemisphere is involved in the processing of longer context sizes, and finally 5) that medial regions (Precuneus and pCC) are core regions of context integration, showing context effects at all scales.

Discussion
Language comprehension in humans is a complex process, which involves several interacting subcomponents (word recognition, processing of syntactic and semantic information to construct sentence meaning, pragmatic and discourse inference, ...) (Jackendoff, 2002, e.g.).Discovering how the brain implements these processes is one of the major goals of neurolinguistics.A lot of attention has been devoted, in particular, to the syntactic and semantic components (Friederici, 2017;Binder and Desai, 2011, for reviews) and the extent to which they are implemented in (practically) distinct or identical regions is still debated (e.g.Fedorenko et al., 2020).In Fig. 8, we present the outcome of a meta analysis of the literature based on the search for the keywords 'syntactic' and 'semantic' in the Neurosynth database (see Meta-Analysis based on Neurosynth).This analysis, albeit somewhat simplistic, reveals the brain regions most often associated with syntax and semantics.
Figure 8. Association maps for the terms "semantic" and "syntactic" in a meta-analysis using Neurosynth (http://neurosynth.org)The association test map for syntactic (resp.semantic) displays voxels that are reported more often in articles that include the term syntactic (resp.semantic) in their abstracts than articles that do not (FDR correction of 0.01).
It must be noted that a fair proportion of the studies included in the meta analysis relied on controlled experimental paradigms with single words or sentences, based on the manipulation of complexity or violations of expectations.To study language processing in a more natural way, several recent studies have presented naturalistic texts to participants, and have analyzed their brain activations using Artificial Neural Language Models (e.g.Pereira et al., 2018a;Huth et al., 2016;Schrimpf et al., 2020;Pasquiou et al., 2022).These models are known to encode some aspects of semantics and syntax (e.g.

Pennington et al., 2014; Hewitt and Manning, 2019; Lakretz et al., 2019).
In the current work, to further dissect brain activations into separate linguistic processes, we trained NLP models on a corpus from which we selectively removed syntactic, semantic or contextual information and examined how well these information-restricted models could explain fMRI signal recorded from participants who had listened to an audiobook.The rationale was to highlight brain regions representing syntactic and semantic information, at the lexical and supralexical levels (comparing a lexical model GloVe, and a contextual one, GPT-2).Additionaly, by varying the amount of context provided to the supralexical model, we sought to identify the brain regions sensitive to different context sizes (see Jain and Huth (2018) for a similar analysis).
Whether models were trained on syntactic features or on semantic features, they fit fMRI activations in a wide bilateral network which goes beyond the classic language network comprising the IFG and temporal regions: it also includes most of the dorso lateral and medial prefrontal cortex, the inferior parietal cortex, and on the internal face, the precuneus and posterior cingulate cortex (see Fig. 3).Nevertheless, the regions best predicted by syntactic features on the one hand, and semantic features on the other hand, are not exactly the same.While they overlap quite a lot in the right hemisphere, they are more dissociated in the left hemisphere Fig. 3, panel B).In addition, the relative sensitivity to syntax and semantics varies from region to region, with syntax predominating in the temporal lobe (see Fig. 4).Elimination of shared variance between syntactic and semantic features confirmed that pure syntactic effects are restricted to STG/STS, bilaterally, IFG, and pre-SMA, while pure semantic effects occur throughout the network (Fig. 5 A-B).
The comparison between the supralexical model (GPT-2) and the lexical one (GloVe), revealed brain regions involved in compositionality (Fig. 6) and a synergy between syntax and semantics that arises only at the supralexical level (Fig. 5C).Finally, analyses of the influence of the size of context provided to GPT-2 when computing word embeddings, show that (1) a bilateral network of frontotemporo-parietal regions is sensitive to short context, that (2) there is a dissociation between the left and right hemispheres, respectively associated with short-range and long-range context integration, and finally that (3) the medial Precuneus and posterior Cingulate gyri show the highest effects at every scale, hinting at an important role in large context integration (Fig. 7).

Models trained on semantic and syntactic features fit brain activity in a widely distributed network, but with varying relative degrees.
When trained on the integral corpus, that is on the integral features, both the lexical (GloVe) and contextual (GPT-2) models captured brain activity in a large extended language network (Appendix 1-Fig.3).This large extended language network goes beyond the core language network, that is, the left IFG and temporal regions, encompassing homologous areas in the right hemisphere, the dorsal prefrontal regions, both on the lateral and medial surfaces, as well as in the inferior parietal, Precuneus and posterior Cingulate.The result is consistent with the ones from previous studies that have looked at brain responses to naturalistic text, whether analysed with NLP models (e.g.

Huth et al., 2016; Pereira et al., 2018b; Jain and Huth, 2018; Caucheteux et al., 2021) or not (Lerner et al., 2011; Chang et al., 2022).
The Precuneus/pCC, inferior parietal and dorsomedial prefrontal cortex are part of the Default Mode Network (DMN) (Raichle, 2015).The same areas are actually also relevant in language and high-level cognition.For example, early studies examining the role of coherence during text comprehension had pointed out the same regions (Ferstl and von Cramon, 2001;Xu et al., 2005): coherent discourses elicit stronger activations than incoherent ones.Recent work by (Chang et al., 2022) has revealed that the DMN is the last stage in a temporal hierarchy of processing naturalistic text, integrating information on the scale of paragraphs and narrative events, see also (Simony et al., 2016;Baldassano et al., 2017).These regions are not language-specific though, as they have been shown to be activated during various theory of mind tasks, relying on language or not, and have thus also been dubbed the "Mentalizing network" (Mar, 2011;Baetens et al., 2014).
Models trained on the information-restricted semantic and syntactic features fit signal in this widely distributed network (Fig. 3A).This is in agreement with Caucheteux et al. (2021) and Fedorenko et al. (2020) who, using very different approaches, found that syntactic predictors modulated activity throughout the language network.Caucheteux et al. (2021) first constructed new texts that matched, as well as possible, the text presented to participants in terms of their syntactic properties.The lexical items being different, the semantics of the new texts bear little relation with the original text.Then, using a pre-trained version of GPT-2, the authors obtained embeddings from these new texts and averaged them to create syntactic predictors.They found that these syntactic embeddings fitted a network of regions (ibid.Fig5D) similar to the one we observed (Fig. 3A).Further, defining the effect of semantics as the difference between the scores obtained from the embeddings from the original text, and the scores from the syntactic embeddings, Caucheteux et al. (2021) observed that semantics had a significant effect throughout the same network (ibid.Fig5G).
Should one conclude that syntax and semantics equally modulate the entire language network?Our results reveal a more complex picture.Figure 4 presents a semantics vs syntax specificity index map, showing higher sensitivity to syntax in the STG and anterior temporal lobe, whereas the parietal regions are more sensitive to semantics, consistent with Binder et al. (2009).Another point to take in consideration is that syntactic and semantic features are not perfectly orthogonal.Indeed, the logistic decoder trained on the embeddings from the semantic dataset was better than chance at recovering syntactic features (Fig. 2), and vice versa.This might be due, for example, to the fact that some features like gender or number are present in both datasets, explicitly in the syntactic dataset and implicitly in the semantic dataset.To focus on the unique contributions of syntax and semantic, we remove the shared variance from the syntactic and semantic models using model comparisons (Fig. 5).
"Pure" semantic but not "pure" syntactic features modulate activity in a wide set of brain regions.
The unique effect of semantics, when its shared component with syntax was removed, remains widespread (Fig. 5B).This is consistent with the notion that semantic information is widely distributed over the cortex, an idea popularized by embodiment theories (Hauk et al., 2004;Pulvermüller, 2013), but which was already supported by the neuropsychological observations revealing domain-specific semantic deficits in patients (Damasio et al., 2004).
On the other hand the "pure" effect of syntax "shrinked" to the STG and aTL (bilaterally), the IFG (on the left) and the pre-SMA (Fig. 5A).The left IFG and STG/STS have previously been implicated in syntactic processing (Friederici, 2017(Friederici, , 2011, e.g.), e.g.), and this is confirmed by the new approach employed here.Note that we are not claiming that these regions are specialized for syntactic processing only.Indeed they also appear to be sensitive to the "pure" semantic component (Fig. 5B).

The contributions of the right hemisphere.
A striking feature of our results is the strong involvement of the right hemisphere.The notion that the right hemisphere has some linguistic abilities is supported by the studies on split-brains (Sperry, 1961) and by the patterns of recovery of aphasic patients after lesions in the left hemisphere (Dronkers et al., 2017).Moreover, a number of brain imaging studies have confirmed the right hemisphere involvement in higher-level language tasks, such as comprehending metaphors or jokes, generating the best endings to sentences, mentally repairing grammatical errors, detecting story inconsistencies (see Jung- Beeman (2005); Beeman and Chiarello (2013)).All in all, this suggests that the right hemisphere is apt at recognizing distant relations between words.This conclusion is further reinforced by our observation of long-range (paragraph-level) context effects in the right hemisphere (Fig. 7, Long).
The effects we observed in the right hemisphere are not simply the mirror image of the left hemisphere.Spatially, syntax and semantics dissociate more in the left than the right.(see Fig. 3, Panel B).Moreover, the regions of overlap correspond to the regions integrating long context (Fig. 7C, bottom row), suggesting that the left hemisphere is relatively more involved in the processing of local semantic or syntactic information, whereas the right hemisphere integrates both information at a larger time-scale (supra sentential).

Syntax drives the integration of contextual information.
The comparison between the predictions of the integral model trained on the intact texts, and the predictions of the combined syntactic and semantic embeddings from the information-restricted models (Fig. 5C), highlights a striking contrast between GloVe and GPT2.While the former, a purely lexical model, does not benefit from being trained on the integral text, GPT-2 shows clear synergetic effects of syntactic and semantic information.GPT-2's embeddings fit brain activation better when syntactic and semantic information can contribute together.The fact that the regions that benefit most from this synergetic effect are high-level integrative regions, at the end of the temporal processing hierarchy described by Chang et al. (2022), suggests that the availability of syntactic information drives the semantic interpretation at the sentence level.
These regions are quite similar to the semantic peak regions highlighted in Fig. 3A, and overlap with the regions showing context effects (Fig. 7).This replicates, and extends, the results from Jain and Huth (2018) who, varying the amount of context fed to LSTM models, from 0 to 19 words, found shorter context effects in temporal regions (ibid. Fig 4).

Limitations of our study
Two limitations of our study must be acknowledged.
The dissociation between syntax and semantics is not perfect.The way we created the semantic dataset by removing function words clearly impacts supra-lexical semantics.For example, removing instances of and and or prevents the NLP model from distinguishing between the meaning of "A or B" and "A and B".In other words, the logical form of sentences can be perturbed.This may partly explain the synergetic effect of syntax and semantics described above.Removing pronouns is also problematic as this removed the arguments of some verbs.Ideally, one would like to find transformations of the sentences that keep the semantic information associated to the function words like conjunctions or pronouns, but it is not clear how to do that.
A second limitation concerns potential confounding effects of prosody.One cannot exclude that the embeddings of the models captured some prosodic variables correlated with syntax (Bennett and Elfner, 2019).For example, certain categories of words (e.g.determiners or pronouns) are shorter and less accented than others.Also, although the models are purely trained on written text, they acquire the capacity to predict the end of sentences, which are more likely to be followed by pauses in the acoustic signal.We included acoustic energy and the words' offsets in the baseline models to try and diminish the impact of such factors, but such controls cannot be perfect.One way to address this issue would be to have participants read the text, presented at a fixed presentation rate.This would effectively remove all low-level effects of prosody.

Conclusion
State-of-the-art Natural Language Processing models, like transformers, trained with large enough corpora, can generate essentially flawless grammatical text, showing that they can acquire the grammar of the language.Using them to fit brain data has become a common endeavour, even if their architecture rules them out of plausible models of the brain.Yet, despite their low biological plausibility, their ability to build rich distributed representations can be exploited to study language processing in the brain.In this paper, we have demonstrated that restricting information provided to the model during training can be used to show which brain areas encode this information.Information-restricted models are powerful and flexible tools to probe the brain as they can be used to investigate whatever representational space chosen, such as semantics, syntax or context.Moreover, once they are trained, these models can be used directly on any dataset in order to generate information-restricted features for model-brain alignment.This approach is highly beneficial, both in term of richness of the features, and scalability, compared to classical approaches that use manually crafted features or focus on specific contrasts.In future experiments, more fine grained control of both the information given to the models as well as model's representations will permit more precise characterisation of the role of the various regions involved in language comprehension.

Creation of datasets; Semantic, Syntactic and Integral features
We selected a collection of English novels from Project Gutenberg (www.gutenberg.org;data retrieved on February 21, 2016).This original dataset comprised 4.4GB of text for training purposes and 1.1GB for validation.From it, we created two information-restricted datasets: the semantic dataset and the syntactic dataset.In the semantic dataset, only content words were kept, while all grammatical, function words and punctuation signs were filtered out.In the syntactic dataset, each token (word or punctuation sign) was replaced by an identifier encoding a triplet (POS, Morph, NCN) where POS is the Part-of-speech computed using Spacy (Honnibal and Montani, 2017), Morph corresponds to the morphological features obtained from Spacy and NCN stands for the Number of Closing Nodes in the parse tree, at the current token, computed using the Berkeley Neural Parser (Kitaev and Klein, 2018) available with Spacy.
In this paper, we refer to the content of the original dataset as integral features, the content of the semantic dataset as semantic features, and the content of the syntactic dataset as syntactic features.Examples of integral, semantic and syntactic features are given in Appendix 1-Models training.

GloVe Training
GloVe (Global Vectors for Word Representation) relies on the co-occurence matrix of words in a given corpus to generate fixed embedding vectors that capture the distributional properties of the words (Pennington et al., 2014).Using the open-source code provided by Pennington and al. (https://nlp.stanford.edu/projects/glove/),we trained GloVe on the three datasets (integral, semantic and syntactic), setting the context window size to 15 words, the embedding vectors' size to 768, and the number of training epochs to 23.
The GPT2LMHeadModel architecture is trained on a next-token prediction task using a CrossEn-tropyLoss and the Pytorch python package (Paszke et al., 2019).The training procedure can easily be extended to any feature type by adapting both vocabulary size and tokenizer to each vocabulary.Indeed, the inputs given to GPT2LMHeadModel are ids encoding vocabulary items.All the analyses reported in this paper were performed with 4-layer models having 768 units per layer and 12 attention heads.As shown in (Pasquiou et al., 2022), these 4-layer models fit brain data nearly as well as the usual 12-layer models.We presented the models with input sequences of 512 tokens, and let the training run for 5 epochs; convergence assessments are provided in Appendix 1-Convergence of the language models during training (Appendix 1-Fig.1).
For the GTP-2 trained on the semantic features, small modifications had to be made to the model architecture in order to remove all residual syntax.By default, GPT-2 encodes the absolute positions of tokens in sentences.When training GPT-2 on the semantic features, as word ordering might contain syntactic information, we had to make sure that position information could not be leveraged by means of its positional embeddings, yet keeping information about word proximity as it influences semantics.We modified the implementation so that the GPT-2 trained on semantic features follows these specifications (see Appendix 1-Removing absolute position information in GPT-2 trained on semantic features).

Stimulus: The Little Prince story
The stimulus used to obtain activations from humans and from NLP models was The Little Prince novella.Humans listened to an audio-book version, spliced into 9 tracks that lasted approximately 11 minutes each (see Li et al., 2022).In parallel, NLP models were provided with an exact transcription of this audio-book, enriched with punctuation signs from the written version of the Little Prince.The text comprised 15,426 words and 4,482 punctuation signs.The acoustic onsets and offsets of the spoken words were marked to align the audio recording with the The Little Prince text.

Computing Embeddings from the Little Prince text
The tokenized versions of the Little Prince (one for each feature type) were run through Glove and GPT-2 in order to generate embeddings that could be compared with fMRI data.
For GloVe, we simply retrieved the fixed embedding vector learnt during training for each token.For GPT-2, we retrieved the contextualized third layer hidden-state (aka embedding) vector for each token, so that the dimension is comparable to the dimension of GloVe's embeddings (768 units).Layer 3 (out of 4) was selected because it has been demonstrated that late middle layers of recurrent language models are best able to predict brain activity (Toneva and Wehbe, 2019;Jain and Huth, 2018).
The embedding built by GPT-2 for a given token rely on the past tokens (aka past context).The bigger the past context, the more reliable the token embedding will be.We designed the following procedure to ensure that the embedding of each token used similar past context size: the input sequence was limited to a maximum of 512 tokens.The text was scanned with a sliding window of size = 512 tokens, and a step of 1 token.The embedding vector of the next to last token (in the sliding window) was then retrieved.For the context-constrained versions of GPT-2 (denoted GPT-2 − ), the input text was formatted as the training data (see Fig. 1C) in batches of input sequences of length ( + 5) tokens (see Appendix 1-Context-limited models for examples), and only the embedding vector of the current token was retrieved.Embedding matrices were built by concatenating words embeddings.More precisely, calling the dimension of the embeddings retrieved from of a neural model, corresponding to the number of units in one layer in our case, and the total number of tokens in the text, we obtained an embedding matrix ∈ ℝ × after the presentation of the entire text to the model.

Decoding of syntax and semantics categories from embeddings
We designed two decoding tasks: a syntax decoding task in which we tried to predict the triplet (Part-of-speech, morphological information and number of closing nodes) of each word from its embedding vector (355 categories), and a semantic decoding task in which we tried to predict each word's semantic category (from Wordnet, https://wordnet.princeton.edu/)from its embedding vector (837 categories).
We used Logistic Classifiers and the text of The Little Prince as train and test data, which was split using a 9-fold cross-validation on runs, training on 8 runs and evaluating on the remaining one for each split.Dummy classifiers were fitted and used as estimations of chance-level for each task and model.All classifiers implementations were taken from Scikit-Learn (Pedregosa et al., 2011).

MRI data
We used the functional Magnetic Resonance Imaging (fMRI) data of 51 English speaking participants who listened to an entire audio-book of The Little Prince during about one hour and a half.These data, available at https://openneuro.org/datasets/ds003643/versions/1.0.2 are described in details by Li et al. (2022).In short, the acquisition used echo-planar imaging (TR=2s; resolu-tion=3.75x3.75x3.75mm)with a multi-echo (3 echos) sequence to optimize signal-to-noise (Kundu et al., 2018).Preprocessing comprised multi-echo independent components analysis (ME-ICA) to denoise data for motion, physiology and scanner artifacts, correction for slice-timing differences, and nonlinear alignment to the MNI template brain.
For each participant, there were 9 runs of fMRI acquisition representing about 10 minutes of brain activations each.We re-sampled the preprocessed individual scans at 4x4x4 mm (to reduce computation load) and applied linear detrending and standardization (mean removal and scaling to unit variance) to each voxel's time-series.
Finally, we computed a global brain mask to only keep voxels containing useful signal (using nilearn's compute_epi_mask function, we find the least dense point of the total image histogram) across all runs for at least 50% of all participants.This global mask contained 26,164 voxels at 4x4x4mm resolution.All analyses reported in this paper were performed within this global mask.peak region as well as the Jaccard score between syntax and semantics are displayed for each model and hemisphere.

Jaccard index
The Jaccard index (computed using scikit-learn jaccard_score function from the metrics module) for two sets and is defined in the following manner: ( , ) = | ∩ |∕| ∪ |.It behaves as a similarity coefficient: when the two sets completely overlap, J=1; when their intersection is nil, J=0.

Specificity index
To quantify how much each voxel is influenced by semantic and syntactic embeddings, we defined a specificity index in the following manner: is the score increase relative to the baseline model for the syntactic embeddings. is the score increase relative to the baseline model for the semantic embeddings.
In Fig. 4, the higher and greener is, the more sensitive it is to semantic embeddings compared to syntactic embeddings.The lower and redder is, the more sensitive it is to syntactic embeddings compared to semantic embeddings.close to 0, indicates an equal sensitivity to syntactic and semantic embeddings.
Group average specificity index maps were computed from each subject's map and significance was assessed through one-sample t-tests applied to the spatially smoothed specificity maps, with an isotropic Gaussian kernel with FWHM of 6mm.A FDR correction ( < 0.005) was used to correct for multiple comparisons.

Meta-Analysis based on Neurosynth
We used the Neurosynth database (https://github.com/neurosynth/neurosynth) to perform a metaanalysis of brain regions that appeared in fMRI articles containing the words 'syntactic' or 'semantic' in their abstract.Using a frequency threshold of 0.05, the keyword semantic yielded 626 articles, while syntactic yielded 128 articles.
The meta.MetaAnalysis function from the neurosynth package was then used to create association test maps for syntax and semantics.These maps display voxels that are reported more often in articles that mention the keyword than articles that do not.Such association test maps indicate whether or not there's a non-zero association between activation of the voxel in question and the use of a particular term in a study.We fused the maps associated to syntactic and semantic, thresholded with a False Discovery Rate set to 0.01, to produce Fig. 8

Removing absolute position information in GPT-2 trained on semantic features
For the GTP-2 model trained on the semantic features, small modifications had to be made to the model architecture in order to remove all residual syntax.By default, GPT-2 encodes the absolute positions of tokens in sentences.As word ordering might contain syntactic information, we had to make sure that it could not be leveraged by GPT-2 by means of its positional embeddings, yet keeping information about word proximity as it influences semantics.We achieved it by slightly modifying the architecture of GPT-2: we first removed the default positional embeddings, and added to the attention scores embeddings encoding relative positions between input tokens.Indeed, just removing positional embeddings would have led to a bag-of-words model.By adding these embeddings encoding relative position to the attention scores a token will weight the attention granted to another token depending on their distance.By doing so, information about absolute and relative positions is removed from tokens' embeddings as it is not directly added to the tokens' hidden states.The following explains how this operation was performed.Let c = ( 1 , . . ., ) be a sequence of tokenized content words.c is then fed to a transformer with ℎ of dimension ℎ that first build an embedding representation , = 1.. (of size = ℎ * ℎ ) to which it appends (by default) a position embedding p , = 1.. (of size ) for each token.To remove all syntactic content, the first step is to discard the previously mentioned positional embeddings p , = 1.. .However stopping here would only lead to a bag-of-word model where a given token might be influenced similarly by an adjacent token or one far away.As a consequence, we had to weight the attention score granted to a token depending on its relative distance.
The attention operation can be described as mapping a query (Q) and a set of key-value (K, V) pairs to an output, where the query, keys, values, and output are all vectors (generally packed into matrices).The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.We thus modify the classical attention operation: Attention(Q, K, V) = Softmax((QK )∕ √ )V by adding the previously described relative positional embedding W in the attention mechanisms: Attention(Q, K, V) = Softmax((QK + W)∕ √ )V To build W, we first defined the matrix D = ( − 1 + − ) , =1.. ∈ ℝ × (encoding the number of tokens separating two tokens in the input sequence shifted by −1) for each input sequence c , where is the maximal input size.D is then embedded using a lookup table that stores an embedding of size ( ℎ ) for each possible value of D, giving U (∈ ℝ × × ℎ ).
Finally, the weights assigned to the value vectors are adjusted using the embedded relative distances between tokens W (∈ ℝ ℎ × × ), defined as: By doing so, we were able to weight words interactions depending on their relative distance in the input sequence, while removing all absolute positional information from tokens hidden-states.

Mapping NLM activations to brain data
Given two non-linear transformations 1 (the neural language model that takes as input the sentence and from which we extract latent representations) and 2 (the brain that takes as input the sentence and from which we extract voxels' activations) and an input sequence w = (w 1 , . . ., w ), we define Y = 2 (w) ∈ ℝ × and X = 1 (w) ∈ ℝ × , and we aimed at finding a linear transformation from X to Y , where d is the dimension of the model, V is the number of brain voxels, and N the number of fMRI scans acquired.One issue is that and don't have the same sampling frequency: being defined at word-level while has been re-sampled at the fMRI acquisition frequency, every 2 seconds.To map to we first need to temporally align them, taking the dynamic of the fMRI BOLD signal into account, and then determine a linear spatial mapping between the convolved and re-sampled and .Using the standard model-based encoding approach to modelling fMRI signals (Naselaris et al., 2011;Huth et al., 2016;Pasquiou et al., 2022), we first convolve each column of with the SPM haemodynamic kernel (K), which corresponds to the profile of the fMRI BOLD response following a Dirac stimulation, and then sub-sampled the signal to match the sampling frequency of , giving ̃ = ( • ), with the sub-sampling operator.Finally, we learn the linear spatial mapping between ̃ and using a nested cross-validated L2-regularized (aka Ridge) univariate linear encoding model.More precisely, for each voxel , we learn a linear projection ̂ from ̃ to using a nested cross-validated L2-regularized univariate linear encoding model whose general solution is given by: The latter stage resulted for each model and each run into a design matrix of size × .
Given a neural language model, we gave the associated nine design-matrices to a nested cross-validated L2-regularized univariate linear encoding model to fit the fMRI brain data (of size × ).To evaluate model performance and the optimal regularization parameter * , we used a nested cross-validation procedure: we split each participant's dataset into training, validation and test sets, such that the training set included 7 out of the 9 experiment runs, and the validation and test sets contained one of the two remaining sessions.We evaluated model performance using Pearson correlation coefficient , which is a measure of the linear correlation between encoding models' predicted time-courses and the actual time-courses.
For each subject and each voxel, we first determined * by comparing for 10 different values of , linearly spaced in log-scale between 10 −3 and 10 4 .We then calculated for * .Finally, we repeated this procedure 9 times, using cross-validation.This resulted in 9 values that we then averaged to produce a single map for the participant.We evaluated the quality of the mapping for subject in voxel using Pearson correlation:

Figure 1 .
Figure 1.Experimental setup A) A corpus of novels was used to create a dataset from which we extracted three different sets of features: (i) Integral features, comprising all tokens (words+punctuation); (ii) Semantic features, comprising only the content words; (iii) Syntactic features, comprising syntactic characteristics (Part-of-speech, Morphological syntactic characteristics, Number of Closing Nodes) of all tokens.GloVe and GPT-2 models were trained on each feature space.B) fMRI scans of human participants listening to an audio-book were obtained.The associated text transcription was input to Neural models, yielding embeddings that were convolved with an haemodyamic kernel and fitted to brain activity using a Ridge-regression.Brain maps of cross-validated correlation between encoding models' predictions and fMRI time-series were computed.C) To study sensitivity to context, a GPT-2 model was trained and tested on input sequences of bounded context length (5, 15 and 45).The resulting representations were then used to predict fMRI activity.

Figure 2 .
Figure 2. Decoding syntactic and semantic information from words embeddings.For each dataset and model type (Glove and GPT-2), logistic classifiers were set up to decode either the syntactic or the semantic categories of the words from the text of The Little Prince.Chance-level was assessed using dummy classifiers and is indicated by black vertical lines.

Figure 3 .
Figure 3.Comparison of the ability of GloVe and GPT-2 to fit brain data when trained on either the semantic or the syntactic features.A) Significant increase in R scores relative to the baseline model for GloVe (a non contextual model) and GPT-2 (a contextual model), trained either on the Syntactic features or on the Semantic features (voxel-wise thresholded group analyses; N=51 subjects; corrected for multiple comparisons with a FDR approach < 0.005; for each figureindicates the significance threshold on the Z-scores).B) Bilateral spatial organisation of syntax and semantics highest R scores.Voxels whose R score belong in the 10% highest R scores (in green for models trained on the semantic features, and in red for models trained on the syntactic features) are projected onto brain surface maps for GloVe and GPT-2 (overlap in yellow and other voxels in grey).Jaccard score for each hemisphere are computed, i.e. the ratio between the size of the intersection and the size of the union of semantics and syntax peak regions; the proportion of voxels of each category are displayed for each hemisphere and model.

Figure 4 .
Figure 4. Voxels' sensitivity to syntactic and semantic embeddings.Voxels' specificity indexes are projected onto brain surface maps reflecting how much semantic information helps to better fit the time-courses of a voxel compared to syntactic information; the greener the more the voxel is categorized as a semantic voxel, the redder the more the voxel is categorized as a syntactic voxel.Yellow regions are brain areas where semantic and syntactic information lead to similar R score increases.The top row displays specificity indexes in voxels where there was a significant effect for semantic or syntactic embeddings in Fig.3A.The bottom row is the voxel-wise thresholded group analyses; N=51 subjects; corrected for multiple comparisons with < 0.005 (for each figure indicates the significance threshold on the Z-scores).

Figure 5 .
Figure 5. Correlation uniquely explained by each embeddings A) Increase in R scores relative to the semantic embeddings when concatenating semantic and syntactic embeddings in the encoding model.B) Increase in R scores relative to the syntactic embeddings when concatenating semantic and syntactic embeddings in the encoding model.C) Increase in R scores relative to the concatenated semantic and syntactic embeddings for the integral embeddings.These maps are voxel-wise thresholded group analyses; N=51 subjects; corrected for multiple comparisons with a FDR approach < 0.005; for each figure indicates the significance threshold on the Z-scores.

Figure 6 .
Figure 6.Comparison of lexical and supra-lexical processing levels.Brain regions that are significantly better predicted by GPT-2 (in red) compared to Glove, when trained on syntactic features (top left), semantic features (top right) and integral features (bottom left).Maps are voxel-wise thresholded group analyses; N=51 subjects; corrected for multiple comparisons with a FDR approach < 0.005; for each figure indicates the significance threshold on the Z-scores.

Figure 7 .
Figure 7. Integration of context at different levels of language processing.A) Per hemisphere histograms of significant context effects after group analyses (N=51 subjects); thresholded at p<0.005 voxel-wise, corrected for multiple comparisons with the FDR approach.B) Uncorrected group averaged surface brain maps representing R scores increases when fitting brain data with models leveraging increasing sizes of contextual information.C) Corrected group averaged surface brain maps representing R scores increases when fitting brain data with models leveraging increasing sizes of contextual information; thresholded at p<0.005 voxel-wise, corrected for multiple comparisons with the FDR approach (for each figure indicates the significance threshold on the Z-scores).(top row) Comparison of the model trained with 5 tokens of context (GPT-2 −5 ) with the non-contextualized GloVe.(middle row) Comparison of the models respectively trained with 15 (GPT-2 −15 ) and 5 (GPT-2 −5 ) tokens of context.(bottom row) Comparison of the models respectively trained with 45 (GPT-2 −45 ) and 15 (GPT-2 −15 ) tokens of context.

.
. Examples of context-limited input sequences given to GPT-2 for the analyses on context-integration.Here the context size is equal to 5.