Abstract
We investigate the extent to which modern neural language models are susceptible to structural priming, the phenomenon whereby the structure of a sentence makes the same structure more probable in a follow-up sentence. We explore how priming can be used to study the potential of these models to learn abstract structural information, which is a prerequisite for good performance on tasks that require natural language understanding skills. We introduce a novel metric and release Prime-LM, a large corpus where we control for various linguistic factors that interact with priming strength. We find that Transformer models indeed show evidence of structural priming, but also that the generalizations they learned are to some extent modulated by semantic information. Our experiments also show that the representations acquired by the models may not only encode abstract sequential structure but involve certain level of hierarchical syntactic information. More generally, our study shows that the priming paradigm is a useful, additional tool for gaining insights into the capacities of language models and opens the door to future priming-based investigations that probe the model’s internal states.1
1 Introduction
It has become increasingly clear that modern, neural language models (LMs) are capable of representing and learning a broad range of linguistic phenomena (Gulordava et al., 2018; Hewitt andManning,2019; Tenney et al., 2019a; Rogers et al., 2020; Warstadt et al., 2020). However, many open questions remain about the extent to which specific LMs have indeed acquired specific linguistic constructions, about whether these models encode an abstract notion of structure in their representations, and about the best ways to even assess the syntactic abilities of these models. A rich literature has emerged in the last few years addressing these questions, often taking inspiration from methodologies developed in theoretical linguistics, psycholinguistics, neurolinguistics, and language acquisition research (Futrell et al., 2019; Ettinger, 2020; Boleda, 2020; Gauthier et al., 2020; Baroni, 2022), where the same questions have been asked about the human mind/brain for centuries.
Building on this tradition, this paper turns to structural priming to investigate the degree to which LMs encode abstract structural information independent from the concrete words that make up sentences. This phenomenon refers to the fact that humans are more likely to produce—or to more easily comprehend—a sentence of a certain structure X (the target) when they have been exposed before to a sentence of a similar structure X (the prime), than if they had been prompted with a sentence of a different structure Y. For example, a native speaker of English will be more inclined to produce the target sentence with a prepositional object in (2-a) after having read sentence (1-a) instead of (1-a), and, vice versa, be more inclined to produce the double-object target sentence (2-b) after having read (1-b) instead of (1-a). Similar effects are also observed in language comprehension.
- (1)
a. A teacher cooked a chicken for a worker
b. A teacher cooked a worker a chicken
- (2)
a. The guest threw the pot to the lady
b. The guest threw the lady the pot
Evidence for structural priming—to the extent that it can be shown to be independent from lexical overlap and other confounds—is taken as evidence for a linguistic structural level of representation that abstracts away from the surface form of sentences. Thus whether or not language models display structural priming can provide insights as to their structural awareness, which is necessary for downstream tasks requiring natural language understanding skills. Previous experiments designed to test structural encoding in LMs are inconclusive. On the one hand, studies on structural probing (Hewitt and Manning, 2019) and on syntactic evaluation tasks (Warstadt et al., 2020) have yielded evidence for its presence. On the other hand, other sets of experiments have indicated that current LMs are surprisingly indifferent to word order (Hessel and Schofield, 2021; Pham et al., 2021; Sinha et al., 2021a) and rely on superficial heuristics when resolving downstream tasks (McCoy et al., 2019; Sinha et al., 2021b). Such unresolved tensions between results—and the active debate about them—highlights the need for developing additional methodologies that isolate structure from the lexico-semantic cues given to the model. In this paper, we leverage findings from structural priming in human language processing to develop a systematic experimental pipeline with the aim of assessing the extent to which pre-trained neural language models learn representations that encode structural information—a prerequisite for their good performance on natural language understanding tasks.
We use the term ‘structural priming’ (Pickering and Ferreira, 2008) rather than ‘syntactic priming’ (first described in Katryn Bock’s Syntactic Persistence in Language Production, 1986) because it comprises priming of abstract structural information that is not restricted to syntactic hierarchical rules, such as the linear positions of semantic roles or the sequential order of parts of speech. In this paper, we focus mostly on the latter and touch upon syntactic rules in Section 7.4.
In Section 3, we define an efficient novel metric for measuring the effect of priming. For our experiments, we create Prime-LM, a large-scale corpus for examining structural priming consisting of ∼1.3M prime-target sentence pairs, as we describe in Section 4. Earlier work on priming in LMs by Prasad et al. (2019) operationalized priming as adaptation or implicit learning and thus fine-tuned the model weights in between prime and target. While our priming effect metric is compatible with priming as adaptation, our experiments in this paper concentrate on priming after recent exposure to linguistic context without updating the model weights. This allows us to assess the structural representational abilities acquired by the models during training and investigate to what extent such structural information remains active at inference time.
In Section 6 and 7 we use our corpus and priming paradigm to answer three main research questions: (1) Are modern neural language models susceptible to structural priming? (2) Which factors influence the strength of the priming effect? (3) What is the nature of the structural representations acquired by those models? Our results show that Transformer language models do exhibit structural priming. This finding provides evidence that abstract structural information is encoded by the models to some degree and persists as a model makes predictions about upcoming sentences. The strength of the priming effect is influenced by several factors, including the semantic similarity and the proximity between prime and target, as well as the amount of exposure to a given structure during prompting. Our final experiment moreover reveals that the structural representations encoded by the model may not only be sequential but involve a certain level of hierarchical syntactic structure.
2 Background
2.1 Structural Priming in Humans
Priming is the dominant paradigm in psycholinguistics for investigating the extent to which human language processing involves a level of structural representation independent from other types of linguistic knowledge. The rationale behind this paradigm is that if speakers are sensitive to sentence structure independently from sentence content, then it is reasonable to assume that such structural information is an integral part of the representations built during processing.
In human language processing, structural priming effects are well attested both in comprehension and production (Bock, 1986; Pickering and Branigan, 1998; Bock and Griffin, 2000; Pickering and Ferreira, 2008; Goldwater et al., 2011; Pickering et al., 2013; Reitter and Moore, 2014; Tooley and Bock, 2014, among others). Several studies have shown that the strength of the priming effect increases after repeated exposure to a given structure (Kaschak et al., 2011; Jaeger and Snider, 2013) and tends to decay if material intervenes between prime and target (Reitter et al., 2011). Other experiments have shown that ungrammatical and semantically incongruent sentences (e.g., the waitress brunks the book to the monk) lead to similar priming effects as well-formed sentences (Ivanova et al., 2012, 2017), which suggests that structural persistence effects are robust enough in the absence of semantic and lexical cues.
Yet, structural priming has been found to be affected by various aspects. For example, priming effects are stronger with lower-frequency than higher-frequency constructions (e.g., Scheepers, 2003; Bernolet and Hartsuiker, 2010; Pickering et al., 2013). Similarly, some types of lexical repetition between prime and target have been shown to enhance structural priming, suggesting that there is a lexical component involved (Pickering and Branigan, 1998; Cleland and Pickering, 2003). Semantic relatedness between prime and target also has a boosting effect, albeit smaller than the lexical repetition boost (Cleland and Pickering, 2003; Mahowald et al., 2016).
In the present study, we take inspiration from this tradition to investigate the priming behaviour of neural language models, which in turn depends on them encoding structural information. Two (not necessarily exclusive) mechanisms have been proposed to account for structural priming in humans: short-term residual activation of structural information across utterances (e.g., Branigan et al., 1999; Wheeldon and Smith, 2003) and long-term adaptation or implicit learning involving changes in the probability of a given structure (Bock et al., 2007; Kaschak et al., 2011; Fine and Jaeger, 2013). Here we focus on the ability of large pre-trained LMs to encode structural information given in the preceding context, similarly to residual activation in humans.
2.2 Structural Sensitivity of Neural LMs
The increasing capacities of neural language models in recent years have led to a surge in research into their representation of language on a fine-grained linguistic level (Alishahi et al., 2019; Tenney et al., 2019a; Rogers et al., 2020, inter alia). A common approach to examining language models is to consider them as ‘psycholinguistic subjects’; by testing hypotheses derived from psycholinguistics we are able to determine to what extent language models process language similarly to humans (Futrell et al., 2019; Ettinger, 2020; Davis and van Schijndel, 2020; Lakretz et al., 2021).
To assess the linguistic knowledge of LMs, a range of tools have been deployed. For instance, by training auxiliary diagnostic classifiers on top of a model’s internal states (Hupkes et al., 2018), we can probe whether these states encode certain linguistic properties such as POS tags (Tenneyet al.,2019b), syntactic dependencies (Hewitt andManning,2019; White et al., 2021), or constructional information (Madabushi et al., 2020; Liet al.,2022). Another common approach is the usage of Targeted Syntactic Evaluations, in which the LM’s output probabilities are compared on a minimally different pair of a grammatical and ungrammatical sentence (Linzen et al., 2016; Marvin and Linzen, 2018; Gauthier et al., 2020; Hu et al., 2020). This procedure makes it possible to investigate a model’s knowledge of specific linguistic phenomena without probing the model’s internal representations, such as negative polarity items (Warstadt et al., 2019; Jumelet et al., 2021), subject-verb agreement (Gulordava et al., 2018; Lakretz et al., 2019), and argument binding (Warstadt et al., 2020).
Taken together, results from probing, Targeted Syntactic Evaluations, and other existing evaluation paradigms can certainly be viewed as providing converging evidence that modern neural LMs learn non-trivial structural, linguistic knowledge, and do not just memorize fragments of texts from the data and simple sequential dependencies. However, although converging, the evidence is not yet conclusive: Each of these evaluation paradigms has also been found to occasionally produce false positives. In probing, for instance, a well-known risk is that probes pick up information represented in the internal states of the language model, but not causally involved in the predictions of the model (Voita and Titov, 2020). In Targeted Syntactic Evaluations, the strength of the evidence depends on the quality of the set of alternative explanations that is considered, which ultimately is a matter of judgements and differs for different linguistic constructions (Vamvas and Sennrich, 2021). Recent studies have provided new challenges, including studies pointing out the indifference of LMs towards word order (Sinhaet al.,2021a, inter alia), their reliance on spurious heuristics (Lovering et al., 2021), and their difficulty in dealing with negation (Ettinger, 2020; Kassner and Schütze, 2020).
Hence, the debate about the abilities of language models to learn structural information in general, as well as their success in learning certain linguistic constructions specifically, is far from over. The research we present in this paper starts from the observation that structural priming may provide a much needed, complementary methodology that, like Targeted Syntactic Evaluations, examines the behavior of a model, but also, like probing, informs us about the nature of the internal states. We will assess a model’s representation of a sentence by measuring its consequences in processing the next sentence. Instead of examining how the model deals with specific syntactic properties within a sentence, such as number agreement, we measure its encoding of abstract structure at the overall sentence level and the consequences this has for upcoming sentences. In the next section we explain our approach in detail.
3 Measuring Priming
where denotes the set of prime sentences that can be matched with target Tx. In our experiments, we report the mean of this metric, taken over large-scale corpora of semantically diverse sentences.
Our PE method is related to various other metrics that are used in the context of priming and statistics in general. When the conditional probabilities are close to 0—as is the case for our corpora with a mean sentence probability around 10−18—this metric approaches the log odds ratio that is used by Mahowald et al. (2016). This allows our scores to be directly comparable to their results on human priming. A more general connection can be made between our metric and Bayes factors (Jeffreys, 1961; Kass and Raftery, 1995), which determine the strength of evidence and are, similar to our metric, also defined as a log probability ratio.
Note that our PE metric could be applied to the priming-as-adaptation paradigm as well, by comparing the target sentence probabilities of two fine-tuned models. In the experiments presented in this paper, we focus on priming as residual activation and thus do not update the model weights, which makes the approach more computationally efficient.
4 The Prime-LM Corpus
We create a large-scale set of corpora designed to examine the priming behavior of LMs.
4.1 Syntactic Alternations
In the current experiments, we focus on two types of syntactic alternations, dative and transitive, which allow for the same content to be expressed by two different structures. The dative alternation includes ditransitive verbs whose complements can be expressed by a double-object (do) structure or a prepositional-object (po) structure (e.g., the boss gave the dog a bone vs. gave a bone to the dog). The transitive alternation includes transitive verbs within an active (act) or a passive (pass) structure (e.g., the actor followed the student vs. the student was followed by the actor).
- (3)
Dative
do: DtNagentVDtNrecipientDtNpatient
po: DtNagentVDtNpatientPrDtNrecipient
- (4)
Transitive
act: DtNagentVDtNpatient
pass: DtNpatientAuxV by DtNagent
In the transitive case, the active structure is dominant in English (Bock, 1986; Merriam-Webster, 1989). The proportion of use between structures for the dative alternation is less marked, with different studies showing a preference for the direct-object structure (e.g., Bock, 1986; Bresnan et al., 2007).
4.2 Corpus Construction
We construct a set of corpora by filling in the templates in (3) and (4) above. For the content words (nouns and verbs), we exploit the vocabulary present in the University of South Florida (USF) free association norms dataset (Nelson et al., 2004), which contains pairs of cue-target words with their association strength.1 This allows us to control for the degree of semantic association between prime and target sentences. To minimize any effects stemming from word frequency factors, we only include USF content words that appear in the top 5000 most common words according to the COCA corpus (Davies, 2009).
We identify transitive and ditransitive verbs using vocabulary lists targeted at English language learners,2 keeping those that are present in USF and meet the frequency constraints (around 80 verbs in total). The ditransitive verbs were manually labeled for the preposition to be used in the po structure (to/for) and the transitive verbs were annotated with their past participle form to be used in the passive construction. In addition, all verbs were manually labeled for some of the noun categories they can take as arguments (e.g., the transitive verb wash was annotated as accepting agents of category person and patients of category person or object). Following the same frequency constraints, a set of nouns fulfilling these categories was selected from USF using the WordNet closure categories of person, social_group, social_control, institution, physical_entity, and object, which we further hand split into non-edible, edible, and drinkable.3 This yielded 119 nouns in total.
From this vocabulary, we are able to generate many realizations of our sentence templates through sampling, respecting the grammaticality of the sentences produced. Three native speakers of English manually examined a subset of sentences for each verb and syntactic alternation to confirm that the sentences produced are well formed. This resulted in the elimination of a few ditransitive verbs for which the do structure was considered awkward. The final corpus contains 48 transitive and 16 ditransitive verbs.
Using this template-based method, we create a series of corpora that satisfy various semantic and lexical constraints. For each of these corpora we specify a corpus size of 15,000 prime-target pairs per syntactic target structure (do, po, act, pass), which are obtained by pairing 1,500 different target sentences with 10 semantically different primes.4 Overall, Prime-LM contains ∼1.3M prime-target pairs.
4.3 The Core Corpus
Prime-LM consists of a core corpus and a set of variants over this core. In the core corpus, we ensure that prime and target sentences (1) include different determiners, either a/an or the, (2) do not share any nouns nor verbs, and (3) only contain nouns and verbs that are not semantically associated across prime and target according to the USF free association norms dataset.5 For the po structure, we additionally make sure that prime and target differ in preposition (to vs. for), which makes all the prime and target sentences in the dative alternation lexically fully disjoint. For the transitive alternation, this is not possible because the preposition by must appear in the pass structure. Other than that, we completely limit lexical overlap for transitive constructions by using alternate auxiliary verb forms (is vs. was) for the passive prime and target, and create their active counterparts by using the corresponding tense of the auxiliary to maintain semantic equivalence. All sentences in the dative alternation are in the past simple tense.
As an illustration, below we show two examples from the core corpus following the scheme in Figure 1, where P are the prime sentences and T the target:
- (5)
Ppo: A pilot bought a pie for an attorney
Pdo: A pilot bought an attorney a pie
Tpo: The professor sent the tea to the band
- (6)
Pact: The nurse purchased the beer
Ppass: The beer was purchased by the nurse
Tpass: An engine is wrapped by a colonel
We create different variants of the core corpus that isolate specific aspects shown to influence structural priming in human sentence processing. They are described in Section 7 together with the corresponding experiments. Example sentences for each of our corpora can be found in Table 1.
Corpus . | Condition . | Prime (act) . | Target (pass) . |
---|---|---|---|
Core | — | The boy judged the adult. | A cousin is forgotten by a secretary. |
Semantic Similarity | Verb Only | The chief struck the mayor. | A bishop was beaten by a hero. |
All Nouns | An actor broke a glass. | The bottle was wrapped by the actress. | |
All Words | The studentdrank the wine. | A beer was prepared by a professor. | |
Lexical Overlap | Random Noun | The girl smelled the chicken. | A chicken was prepared by a pilot. |
Main Verb | A woman used a computer. | The iron was used by the father. | |
Function Words | The soldier wanted the pie. | The book was carried by the manager. | |
All Nouns | The king smelled the wine. | A wine was drunk by a king. | |
Implausible Prime | — | The newspaper grabbed the pot. | A key is removed by an attorney. |
Structural Complexity | Prime Complex | A lady with a red bag chased a minister. | The juice was purchased by the child. |
Target Complex | The physician judged the leader. | A rich school was embraced by a business. | |
Both Complex | The bad adult with the hat raised the knife. | A son was helped by an author from Cuba. |
Corpus . | Condition . | Prime (act) . | Target (pass) . |
---|---|---|---|
Core | — | The boy judged the adult. | A cousin is forgotten by a secretary. |
Semantic Similarity | Verb Only | The chief struck the mayor. | A bishop was beaten by a hero. |
All Nouns | An actor broke a glass. | The bottle was wrapped by the actress. | |
All Words | The studentdrank the wine. | A beer was prepared by a professor. | |
Lexical Overlap | Random Noun | The girl smelled the chicken. | A chicken was prepared by a pilot. |
Main Verb | A woman used a computer. | The iron was used by the father. | |
Function Words | The soldier wanted the pie. | The book was carried by the manager. | |
All Nouns | The king smelled the wine. | A wine was drunk by a king. | |
Implausible Prime | — | The newspaper grabbed the pot. | A key is removed by an attorney. |
Structural Complexity | Prime Complex | A lady with a red bag chased a minister. | The juice was purchased by the child. |
Target Complex | The physician judged the leader. | A rich school was embraced by a business. | |
Both Complex | The bad adult with the hat raised the knife. | A son was helped by an author from Cuba. |
5 Language Models
We focus our experiments on the class of auto-regressive LMs,6 which are trained to predict the next token, in line with human incremental language processing. Our methodology can be applied to masked LMs as well; we briefly reflect on this in the discussion (§8). The main focus of our analysis is directed on Transformer models (Vaswani et al., 2017), which constitute the current state of the art in language modeling, and have been shown to produce representations that correlate strongly with human brain signals (Schrimpf et al., 2020).
This is the set of models we consider:
GPT2, in its four sizes (small, medium, large, xl; Radford et al., 2019), and its distilled version (Sanh et al., 2019);
DialoGPT, three GPT2 models of increasing size that have been fine-tuned on dialogue data (Zhang et al., 2020);
GPT-Neo in three sizes (125m, 1.3b, 2.7b; Black et al., 2021), which is based on GPT3 (Brown et al., 2020).
All Transformer LMs are imported with the transformers library (Wolf et al., 2020). The extraction of the model probabilities is done using the diagNNose library (Jumelet, 2020), which provides support for efficient activation extraction. Our implementation allows our priming procedure to be efficiently tested on any kind of language model and to be easily reproducible. All our code and corpora are available at https://github.com/dmg-illc/prime-lm.
Why Should LMs Exhibit Structural Priming?
Since structural repetition is present in human language use and common in corpora (Dubey et al., 2008), LMs have, in theory, the potential to learn such structural dependencies during training. It is not reasonable to expect that models which have been trained on shuffled sentences will exhibit priming, however, because such models will not be able to adequately carry over a linguistic signal (structural or otherwise) from the prime sentence to the target.7 As mentioned in the Introduction and in Section 2.2, several studies have suggested that structural information is being encoded by large language models; yet, other studies showing that LMs are often insensitive to permutations in word order (e.g., Kodner and Gupta, 2020; Sinha et al., 2021b) cast doubt on these results. Thus, while there is potential for LMs pre-trained on unshuffled data to encode structural dependencies that are detectable with our priming paradigm, whether they will in fact do so remains an open question, since the language modeling objective (next word prediction) contains no explicit cues for structural information. This is precisely the question we address in this work.
Priming Behavior
To interpret our results we distinguish between three types of behaviour: (i) symmetrical priming occurs when a model obtains positive PEs for both constructions within an alternation: The model has fully picked up on the structural congruence between prime and target; (ii) asymmetrical priming occurs when a model obtains a positive PE for one construction, and a PE close to zero for its counterpart;8 and (iii) biased priming occurs when a model obtains a positive PE for one construction, but a negative PE for its counterpart. A priming bias indicates that a prime of the preferred structure is more likely to boost any subsequent target that we consider, regardless of its structural congruence with the prime. Hence, we take symmetrical and, to some extent, asymmetrical priming behavior to represent evidence for the structural priming effect we are interested in.9
6 Core Priming Results across LMs
We initially test all LMs described in the previous section on our core corpus, designed to control for lexical overlap and semantic similarity. This provides a clean experimental setup, where the only element shared between prime and target is the abstract sequential structure. The results are reported in Figure 2, split by the structure type of the target sentence. It can be seen that across many models a positive PE is present. We will now discuss these results in more detail.
There are two models that exhibit symmetrical priming for both transitive and dative alternations: GPT2-large and GPT2-xl. The other GPT2 models exhibit symmetrical priming for transitive as well, but exhibit moderate asymmetrical priming behavior for dative, with priming occurring only for double-object structure. DialoGPT-small exhibits biased priming for transitive constructions: a negative PE on active constructions, but a large positive PE for passive constructions. This shows that for this particular model a passive prime boosts the probability of an active target more than an active prime does, resulting in a negative effect.
Model Size
We can investigate the impact of model size by comparing the results of the different sizes of the models we consider.10 Larger models may have more potential for encoding finer-grained structural information (see, e.g., Hu et al., 2020). If model size were to have a positive effect on structural priming this might manifest itself in two ways: either (1) the PE increases for both structural alternatives, or (2) the priming bias towards one structure decreases. We do not see evidence of (1). As for (2) regarding bias, results differ between transitive and dative. For the GPT2 models the asymmetrical priming towards double objects is decreased, resulting in symmetrical priming for both GPT2-large and GPT2-xl. For the DialoGPT results on transitive we can see that the severe bias towards passive decreases as model size is increased, resulting in symmetrical priming behaviour for DialoGPT-large. For dative constructions, however, the larger model size gives rise to a priming bias towards double objects: in this case increasing model size actually has a detrimental effect on the model’s priming behaviour. From this we conclude that sensitivity to structural priming is partly driven by model size, but is likely to depend on a more intricate combination of factors related to model architecture and training data, which needs to be investigated further in future work.
Best Model
The models that exhibit more susceptibility to structural priming across all four construction types are GPT2-large and GPT-2-xl. For GPT2-large the congruent conditional probability P(Tx|Px) was larger than the incongruent one P(Tx|Py) 60.5% of the time for active, 81.0% for passive, 65.4% for prepositional object, and 72.1% for double object. In the subsequent experiments we will focus our analysis on GPT2-large and use more specialized experimental conditions within the priming paradigm to dig deeper into the potential of the model for encoding structural information.
7 Impact of Specific Factors
The next battery of experiments isolates various factors that have been shown to be of influence to priming in human language processing. For each experimental condition, we present a specialized corpus followed by an analysis of the priming effects exhibited by GPT2-large on this data, comparing them to the model’s behavior on the core corpus. Examples from the core and specialized conditions can be found in Table 1.
7.1 Lexical Dependence
In the core corpus, prime and target sentences are semantically unrelated, which ensures that priming effects cannot stem from the model assigning higher probabilities to words that are similar or identical to those present in the prime sentence. In the following two experiments we relax this constraint to investigate the extent to which lexical semantic similarity and lexical repetition across prime and target have an impact on structural priming effects.
7.1.1 Semantic Similarity
We create versions of the core corpus where prime and target sentences have different degrees of lexical semantic similarity. Concretely, a pair of words sharing the same semantic role in the prime and target is considered semantically similar if they (a) are associated according to the USF norms, and (b) have a cosine similarity (computed with embeddings from Fares et al., 2017) equal or higher than the 90%-percentile of the distribution of similarities in the core corpus.11
In human experiments, semantic similarity has been found to boost priming (Goldwater et al., 2011), both in nouns (Cleland and Pickering, 2003) and in verbs (Pickering and Branigan, 1998). We isolate the effects of verb and noun similarity by creating conditions where (1) only the verb, (2) all nouns, or (3) all content words are semantically similar across prime and target sentences. These additional constraints result in a more limited set of possible sentence pairs for condition (3), and thus in a reduced corpus of 228 (transitive) and 1648 (dative) prime-target pairs rather then 15,000.12
Results
We find greater PE across constructions in this setup compared to the core corpus, although this is less pronounced for the po structure. As can be seen in Figure 3A, a semantically similar verb in prime and target leads to an increase of the PE, comparable to the condition where all nouns are similar. With the exception of do, we do not observe an additive effect: When all content words are similar, the PE is not substantially higher than when only the verb is similar.
7.1.2 Lexical Overlap
Lexical overlap between prime and target in the core corpus was avoided in both content and function words. Here we systematically introduce lexical repetition across prime and target sentences. We create versions of the core corpus where lexical overlap takes place with respect to only (1) one of the nouns at random but with the same semantic role across prime and target (agent, patient, recipient, see §4.1), (2) all nouns, (3) the verb, and (4) all function words (i.e., any determiners, prepositions, and auxiliary verbs are shared across prime and target, without content words being shared).
Results
As can be seen in Figure 3B, overall the presence of lexical overlap greatly boosts structural priming effects. For all constructions, verb overlap leads to higher priming effects than repeating one noun or even all nouns. Surprisingly, overlap of function words has the highest boosting effect for act and do.13 To place these results into context, we calculate the PE when prime and target are identical sentences. Language models are known to fall prone to repeatedly generating the same sentence (Foster and White, 2007; Fuet al.,2021); hence this value can be considered a ceiling. We obtain a PE of 2.5 for act, 7.2 for pass, 9.2 for po, and 10.1 for do constructions. None of the lexical overlap conditions we consider reaches the corresponding ceiling.
7.2 Semantic Implausibility
In this experiment, we test whether the effects found in the core corpus are robust to manipulations concerned with the semantic plausibility of the sentences used as stimuli. This helps to diagnose to what extent any structural information encoded by the model is autonomous from semantics. To this end, we construct a version of the corpus where the prime sentences are specifically designed to be semantically implausible. Gulordava et al. (2018) used a similar method in their study of long-distance agreement dependencies, finding that RNN’s ability to predict number agreement was robust to nonsensical sentences. The authors interpret this result as evidence that the networks track abstract structure, in line with Chomsky’s (1957) proposal that grammaticality is distinct from meaningfulness in the human language faculty. Here we further test this hypothesis by analyzing whether the LM is susceptible to structural priming effects when the prime sentence is nonsensical. As mentioned in §2.1, humans do exhibit structural priming effects when prompted with incongruent sentences (Ivanova et al., 2012, 2017). We construct semantically implausible primes via sampling nouns at random among noun categories that do not respect the verb selectional restrictions. This results in grammatically correct, yet nonsensical sentences such as ‘the iron threw the hero to the chocolate’. The same constraints regarding absence of semantic similarity and lexical overlap between prime and target present in the core corpus apply here as well.
Results
The results of this experiment are shown in Figure 3C. We find here that the PE exhibits asymmetrical priming behavior, indicating that the prime structure itself is more likely to boost any subsequent target regardless of shared structural properties. The PE disappears and becomes negative for the act and po constructions, while for pass and do it decreases when compared to the results on the core corpus, but remains positive. While some degree of abstract structural information present in the nonsensical sentences may be exploited to predict the target construction, the asymmetrical behaviour suggests that structural encoding is not fully independent from semantic plausibility.
7.3 Activation Strength
In the following two experiments, we test whether structural priming effects are affected by the proximity of prime to target and by increased exposure to the priming structure. We maintain the strict setting of our core corpus, where prime and target are semantically and lexically unrelated, thus testing to what extent the activation of abstract structural information across sentences is affected by recency and cumulativity factors.
7.3.1 Recency
Results
The results of this experiment are shown in Figure 4A, which shows that increasing the proximity between prime and target has a highly positive impact on the strength of priming. Interestingly, the PE for the transitive cases is still relatively high even when the distance between prime and target is at its largest, whereas for the dative cases the PE has dropped drastically. This may indicate that the syntactic configuration of a transitive sentence is not corrupted as much by the intermediate intransitive sentence as the configuration of a dative sentence.
7.3.2 Cumulativity
Results
As shown in Figure 4B, for all constructions the PE increases monotonically as the number of congruent prime sentences increases. This resonates with the potential of large LMs for few-shot learning: The multiple priming sentences appear to act as “demonstrations” (in the sense of Brown et al., 2020) of a given structure, which presumably increases the activation of that type of structural information. This result is a yet another indication of structural information being encoded by the model and remaining active across sentences, as the main feature that is repeated across the multiple primes is the shared abstract structure.
7.4 Structural Complexity
Finally, we test whether the priming effects present in the core corpus are robust to different degrees of structural complexity between prime and target. In our core corpus, congruent prime and target sentences are constructed from the same sequence of parts of speech (see §4.1). Results by Reitter and Keller (2007) suggest that, for humans, short-term priming via residual activation is better explained by assuming hierarchical representations. In this experiment, we test whether the structural information encoded by the model is limited to sequential abstract structure or rather involves hierarchical syntactic representations.
To gain more insight on the nature of the structural information represented by the model, we construct a version of the corpus where some of the noun phrases are more complex than simply “DtN” (e.g., the awful tea from Spain). The rationale behind this manipulation is the following: If the structure of a sentence is represented in terms of something akin to a hierarchical phrase- structure rule such as VP NP NP or VP NP PP rather than as a sequence of part-of-speech categories, then it should not matter whether prime and target differ with respect to the internal structure of the sub-constituents—we should observe a similar degree of priming whether the noun phrases are complex or not. Evidence suggests that this is indeed the case for humans (Tree and Meijer, 1999; Pickering and Branigan, 1998; Branigan et al., 2006).
We create a version of the core corpus where the noun phrases may contain a prenominal adjective, a prepositional phrase, neither or both in order to introduce varying degrees of complexity. We use a total of 164 adjectives manually labeled for compatibility with the different noun categories. The prepositional phrases are constructed with either with or from. For the with case, we select a set of 27 suitable nouns within the WordNet categories of clothing, device, or container. This results in noun phrases such as “Dt(A)NwithDt(A)N”. For the from case, we use 23 country names, resulting in noun phrases such “Dt(A)NfromN”. All the additional vocabulary adheres to the same selection procedure as in §4, with prime and target being semantically unrelated. We test three conditions: (1) only the prime sentence has a complex NP, (2) only the target sentence does, (3) both prime and target have a complex NP—ensuring different NP structures across prime and target. In all three settings, any semantic role (agent, patient, or recipient) can be modified to become complex and there is at most one complex NP per sentence.
Results
The results are shown in Figure 5. The first thing to note is that the presence of noun phrases of varying complexity across prime and target does not cancel out the PE: In all cases, the effect remains positive, although there is a decrease for several conditions. We also observe asymmetrical priming effects, for example, for transitive with complex prime (e.g., active is unaffected, whereas the PE for passive is clearly reduced). This suggests that some of the effects observed on the core corpus may be driven by the consistently simple sequential structures present in that data. Yet, the fact that the priming effect remains positive suggests that there is some degree of hierarchical structural information commonly encoded for both simple and complex NPs, which is carried over to influence the prediction of the target.
8 Discussion and Conclusions
In this paper, we investigated three main questions: (1) Are modern neural LMs susceptible to structural priming? (2) Which factors influence the strength of the priming effect? (3) What is the nature of the structural representations acquired by those models? To answer these questions, we designed a series of carefully curated large-scale corpora, proposed a metric to measure the degree to which a model is susceptible to priming, and ran a series of experiments on several Transformer LMs. This methodology constitutes a new way of assessing the representational abilities of LMs via examining their behavior in controlled setups, which complements tools like Targeted Syntactic Evaluations and the adaptation-based priming measure by Prasad et al. (2019).
Our results in Section 6 showed that on our core corpus, where we control for lexical overlap and semantic similarity between prime and target, most of the language models we test exhibit some degree of priming for most of the constructions we study. This is important, as it opens up the possibility of using priming to investigate what influences the learned representations of these language models.
In Section 7, we focused on GPT2-large to conduct a series of subsequent experiments to dig deeper into the impact of different factors on the model’s susceptibility to priming. In line with psycholinguistic accounts of residual activation, we found that the effects of priming decrease with the distance between prime and target and increase with the amount of exposure to a given structure. Our results indicate that the structural information being encoded is not fully autonomous from semantics: The Priming Effect is highly boosted by semantic similarity and lexical overlap between the words used in prime and target. Such boosting effects are well known to be present in human language processing as well. Furthermore, the Priming Effect partly disappears with semantically implausible prime sentences, suggesting that semantic plausibility is an important cue for the encoding of structure, arguably more so than in human language processing. Finally, we showed that priming effects remain positive in the presence of phrases with differing degrees of complexity across prime and target. This offers some insight into the nature of the representations learned by the model: It suggests that, in addition to abstract sequential structure, some degree of hierarchical syntactic information is being represented.
The current work does not reveal, for the various conditions tested, what the mechanics of the boosting or suppressing effects are. For example, we do not know whether the boosts from lexical overlap or semantic similarity are the result of an improved match with the same structural representations, or of independent factors that influence priming behaviour. Similarly, the precise interplay between semantic plausibility and structural encoding remains unclear. Overall, the pattern of results calls for further investigation using interpretability methods, such as probing and feature attributions, which we plan to pursue in future work.
An additional aspect that requires further study is the role of the training data and its statistics, for example, regarding the frequency of the different constructions under investigation and the impact this may have on priming asymmetries within an alternation, and on priming behaviour more generally. An important future step to disentangle the factors that may give rise to priming behavior would involve training a range of different model types on the same data. This way it becomes possible to interpret the role that model architecture, model size, training objective, and corpus statistics play in shaping the behavior of the model. An important class of models to include in such studies are Masked Language Models. We conducted a preliminary experiment on three such models, which resulted in biased priming behavior for all (see Figure 6). We postulate that these models may rely less on the structure of a prime because their bi-directional nature allows them to take the entire target sentence into account. However, in order to adequately determine that this is entirely due to their training objective, and not due to external factors stemming from corpus statistics, future work could control for this with newly trained models.
Our study reveals novel details about the potential of LMs to represent structural information and the persistence of this information when making predictions about upcoming sentences. But more generally, we believe our findings also demonstrate the usefulness of the priming paradigm for investigating such questions. Even more generally, they illustrate the benefits of repurposing experimental paradigms from psycholinguistics to investigate the knowledge acquired by large neural language models. In that sense, the current paper complements exciting recent work that borrows other paradigms from linguistics and psycholinguistics, including grammaticality judgments, few shot learning, and cloze tests (Gauthier et al., 2020; Brown et al., 2020; Baroni, 2022; Lovering et al., 2021). That is, while syntactic priming offers one window into abstract language representations in neural language models, linguistics offers a whole row of windows that are starting to reveal an exciting vista.
Acknowledgments
We would like to thank the anonymous reviewers for their extensive and thoughtful feedback and suggestions, which greatly improved our work, as the action editor for his helpful guidance. We would also like to thank members of the ILLC past and present for their useful comments and feedback, specifically, Dieuwke Hupkes, Mario Giulianelli, Sandro Pezzelle, and Ece Takmaz. Arabella Sinclair worked on this project while affiliated with the University of Amsterdam. The project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 819455).
Notes
Our code and data can be found at https://github.com/dmg-illc/prime-lm.
Corresponding to the percentage of human participants who produced the target word when asked to come up with words related to the cue (http://w3.usf.edu/FreeAssociation/).
To ensure compatibility with the indefinite article a/an (see Section 4.3), uncountable nouns were discarded.
The corpus size of 15,000 was determined based on Cochran’s Formula for sample size determination (Cochran, 1977), with a p-value and margin of error of 0.01.
The average cosine similarity across pairs of words in prime and target computed with word2vec embeddings by Fares et al. (2017) is 0.2 for both nouns and verbs.
Also known as causal or left-to-right language models, predicting the probability of the next token solely on prior context.
This is analogical to, for example, subject-verb agreement: A model that always prefers a plural verb, regardless of the subject number, can’t be said to understand the task. A model that scores 100% on plural verb prediction, but randomly for singular verbs, has an asymmetric understanding of the task.
Note that the different sizes of a model are trained on the same amount of data; only the number of parameters is affected.
This results in a cosine similarity threshold of ∼0.4.
In this case, to maximize the number of unique pairs, we allow a varying number of primes to target, rather than observing the 10-to-1 prime-target setup of the other corpora.
References
Author notes
Action Editor: Chris Brew
Equal contribution.