BLiMP: The Benchmark of Linguistic Minimal Pairs for English

Abstract We introduce The Benchmark of Linguistic Minimal Pairs (BLiMP),1 a challenge set for evaluating the linguistic knowledge of language models (LMs) on major grammatical phenomena in English. BLiMP consists of 67 individual datasets, each containing 1,000 minimal pairs—that is, pairs of minimally different sentences that contrast in grammatical acceptability and isolate specific phenomenon in syntax, morphology, or semantics. We generate the data according to linguist-crafted grammar templates, and human aggregate agreement with the labels is 96.4%. We evaluate n-gram, LSTM, and Transformer (GPT-2 and Transformer-XL) LMs by observing whether they assign a higher probability to the acceptable sentence in each minimal pair. We find that state-of-the-art models identify morphological contrasts related to agreement reliably, but they struggle with some subtle semantic and syntactic phenomena, such as negative polarity items and extraction islands.


Introduction
Current neural networks for sentence processing rely on unsupervised pretraining tasks like language modeling. Still, it is an open question how the linguistic knowledge of state-of-the-art language models (LMs) varies across the linguistic phenomena of English. Recent studies (e.g., Linzen et al., 2016;Marvin and Linzen, 2018; have explored this question by evaluating LMs' preferences between minimal pairs of sentences differing in grammatical acceptability as in Example (1). However, each of these studies uses a different set of metrics, and focuses on a small set of linguistic paradigms, severely limiting any possible big-picture conclusions. 1 https://github.com/alexwarstadt/blimp (1) a. The cats annoy Tim. (grammatical) b. *The cats annoys Tim. (ungrammatical) We introduce the Benchmark of Linguistic Minimal Pairs (shortened to BLiMP), a linguisticallymotivated benchmark for assessing the sensitivity of LMs to acceptability contrasts across a wide range of English phenomena, covering both previously-studied and novel contrasts. BLiMP consists of 67 datasets automatically generated from linguist-crafted grammar templates, each containing 1,000 minimal pairs and organized by phenomenon into 12 categories. Validation with crowd workers shows that BLiMP faithfully represents human preferences.
We use BLiMP to study several pretrained LMs: Transformer-based LMs GPT-2 (Radford et al., 2019) and Transformer-XL (Dai et al., 2019), an LSTM LM trained by Gulordava et al. (2019), and an n-gram LM. We evaluate whether the LM assigns a higher probability to the acceptable sentence in each minimal pair to determine which grammatical distinctions LMs are sensitive to. This gives us indirect evidence about each model's linguistic knowledge and allows us to compare models in a fine-grained way. We conclude that current neural LMs appear to acquire robust knowledge of morphological agreement and some syntactic phenomena such as ellipsis and control/raising. They show weaker evidence of knowledge about argument structure, negative polarity item licensing, and the semantic properties of quantifiers. All models perform at or near chance on extraction islands. Overall, every model we evaluate falls short of human performance by a wide margin. GPT-2, which performs the best, performs 8 points below humans overall, though it does match or exceed human performance on specific phenomena.
In §6.3 we conduct additional experiments to investigate the effect of training size on Table 1: Minimal pairs from each of the twelve linguistic phenomenon categories covered by BLiMP. Differences are underlined. N is the number of 1,000-example minimal pair paradigms within each broad category.
the LSTM LM and Transformer-XL's performance on BLiMP. While we see steady improvements in overall performance, we find that LMs learn phenomenon-specific distinctions at different rates. In §6.4 we consider alternative wellmotivated evaluation metrics on BLiMP, but find that they do not differ drastically from our method of comparing LM probabilities for full sentences.
We conclude that while models like GPT-2 appear to have significant linguistic knowledge, this knowledge is concentrated in some specific domains of English grammar. We use BLiMP to uncover several linguistic phenomena where even state-of-the-art language models clearly lack human-like knowledge, and to bring into focus those areas of grammar that future studies evaluating LMs should investigate in greater depth.

Language Models
The objective of a language model is to give a probability distribution over the strings of a language. Both neural network and non-neural network architectures are used to build LMs, and neural models can be trained in a self-supervised setting without the need for labeled data. Recently, variants of neural language modeling have been shown to be a strong pretraining task for natural language processing tasks (Howard and Ruder, 2018;Peters et al., 2018;Radford et al., 2018;Devlin et al., 2019).
The last decade has seen two major paradigm shifts in the state of the art for language modeling. First, there was a movement from models based on local n-gram statistics (see Chen and Goodman, 1999) to neural sequence models such as LSTMs (Mikolov et al., 2010), which optimize on the task of predicting the next token. Subsequently, Transformer-based architectures employing selfattention (Vaswani et al., 2017) have outperformed LSTMs (e.g., Dai et al., 2019). Although these shifts have resulted in stronger LMs, perplexity on large benchmark datasets like WikiText-103 (Merity et al., 2016) has remained the primary performance metric, which cannot give detailed insight into these models' knowledge of grammar. Evaluation on benchmarks like GLUE (Wang et al., 2018(Wang et al., , 2019a, that heavily adapt language models to perform downstream tasks, is more informative, but doesn't offer broad coverage of linguistic phenomena, and doesn't necessary reflect knowledge that is already present in the LMs.

Linguistic Knowledge of NNs
Many recent studies have searched for evidence that neural networks (NNs) learn representations that implicitly encode grammatical concepts. We refer to the ability to encode these concepts as linguistic knowledge. Some studies evaluate NNs' linguistic knowledge using probing tasks in which a classifier is trained to directly predict grammatical properties of a sentence (e.g. syntactic tree depth) or part of a sentence (e.g. part-of-speech) using only the NNs' learned representation as input (Shi et al., 2016;Adi et al., 2017;Conneau et al., 2018;Ettinger et al., 2018;Tenney et al., 2019). We follow a complementary approach that uses acceptability judgments to address the same  question without the need for training data labeled with grammatical concepts. Acceptability judgments are the main form of behavioral data used in generative linguistics to measure human linguistic competence (Chomsky, 1965;Schütze, 1996). One branch of this literature uses minimal pairs to infer whether LMs detect specific grammatical contrasts. Table 2 summarizes linguistic phenomena studied in this work. For instance, Linzen et al. (2016) look closely at minimal pairs contrasting subject-verb agreement. Marvin and Linzen (2018) expand the investigation to negative polarity item and reflexive licensing. However, these and related studies cover a limited set of phenomena, to the exclusion of well-studied phenomena in linguistics such as control and raising, ellipsis, quantification, and countless others. This is likely due to the labor-intensive nature of collecting such targeted minimal pairs.
A related line of work evaluates neural networks on acceptability judgments in a more domaingeneral way. Corpora of sentences and their grammaticality are collected for this purpose in a number of studies (Heilman et al., 2014;Lau et al., 2017;Warstadt et al., 2019b). The most recent and comprehensive corpus is CoLA (Warstadt et al., 2019b), containing 10k sentences covering a wide variety of linguistic phenomena provided as examples in linguistics papers and books. CoLA, which is included in the GLUE benchmark (Wang et al., 2018), has been used to track advances in the sensitivity of reusable sentence encoding models to acceptability. Current models like BERT (Devlin et al., 2019) and T5 (Raffel et al., 2019) now learn to give acceptability judgments that approach or even exceed individual human agreement with CoLA.
While CoLA can provide evidence about phenomenon-specific knowledge of models, this method is limited by the need to train a supervised classifier on CoLA data prior to evaluation. This is because CoLA is designed for binary acceptability classification, and there is no generally accepted method for obtaining binary acceptability predictions from unsupervised models like LMs. 2 Warstadt and Bowman (2019) measure phenomenon-specific performance on CoLA for several pretrained sentence encoding models: an LSTM, GPT (Radford et al., 2018), and BERT. However, the use of supervision prevents making strong conclusions about the sentence encoding component, since it is not possible to distinguish what the encoder knows from what is learned through supervised training on acceptability data. Evaluating LMs on minimal pairs avoids this problem, with the caveat that the LM probability of a sentence can only serve as a proxy for acceptability if confounding factors impacting a sentence's probability such as length and lexical content are controlled for. It is with these considerations in mind that we design BLiMP.

Data
BLiMP consists of 67 minimal pair paradigms, each with 1,000 sentence pairs in mainstream American English grouped into 12 categories. 3 We refer to minimal pair types as paradigms and categories as phenomena. Each paradigm is an-notated for the unique contrast it isolates and the broader phenomena it is part of. We automatically generate the data from linguist-crafted grammar templates, and our automatic labels are validated with crowd-sourced human judgments.
While each minimal pair type corresponds to exactly one paradigm, a particular fact about English grammar may be illustrated by multiple paradigms. For instance, the fact that certain determiners and nouns agree can be illustrated by keeping the determiner the same and changing the number marking of the noun as in the example in Table 1, or by keeping the noun the same and changing the determiner (e.g. Rachelle had bought those chair.). With completeness in mind, we include such complementary paradigms in BLiMP whenever possible.

Data generation procedure
To create minimal pairs exemplifying a wide array of linguistic contrasts, we found it necessary to artificially generate all datasets. This ensures both that we have sufficient unacceptable examples, and that the data is fully controlled, allowing for repeated isolation of a single linguistic phenomenon (Ettinger et al., 2018). For each paradigm, we use a generation script to sample lexical items from a vocabulary of over 3,000 items according to a template specifying linear order of the phrases in the acceptable and unacceptable sentences in each minimal pair. Our data generation scripts are publicly available. 4 We annotate these lexical items with the morphological, syntactic, and semantic features needed to enforce selectional restrictions and create grammatical and semantically felicitous sentences.
All examples in a paradigm are structurally analogous up to the point required for the relevant contrast but may vary in some ways. For instance, the template for NPI LICENSING, illustrated in Table 1, specifies that an arbitrary verb phrase needs to be generated. Accordingly, the generation script samples from the entire set of verbs and generates the required arguments on-the-fly. Thus, the structure of the sentence then depends on whether the sampled verb is transitive, clause-embedding, raising, etc., but that same verb phrase and its arguments are used in both pairs in the paradigm.
This generation procedure is not without limita-tions, and despite the very detailed vocabulary we use, implausible sentences are occasionally generated (e.g., Sam ran around some glaciers). In these cases, though, both the acceptable and unacceptable sentences will be equally implausible given world knowledge, so any difference in the probability assigned to them is still attributable to the intended grammatical contrast.

Coverage
The paradigms covered by BLiMP represent wellestablished contrasts in English morphology, syntax, and semantics. Each paradigm is grouped into one of 12 phenomena, shown in Table 1. Examples of all 67 paradigms appear in Table 4 of the Appendix. The paradigms are selected with the constraints that they can be characterized using templates as described above and illustrated with minimal pairs of sentences equal in length 5 that differ in at most one vocabulary item. While this dataset has broad coverage, it is not exhaustive. It is not possible to include every grammatical phenomenon of English, and there is no agreed-upon set of core phenomena. However, we consider frequent inclusion of a phenomenon in a syntax/semantics textbook as an informal proxy for what linguists consider to be core phenomena. We survey several syntax textbooks (e.g., Sag et al., 2003;Adger, 2003;Sportiche et al., 2013), and find that nearly all of the phenomena in BLiMP are discussed in some source. Most of the topics that repeatedly appear in textbooks and can be represented with minimal pairs (e.g. agreement, control/raising, wh-extraction/islands, binding) are present in BLiMP. 6 We characterize the 12 phenomena in BLiMP as follows 7 : • ANAPHOR AGREEMENT: the requirement that reflexive pronouns like himself (a.k.a. anaphora) agree with their antecedents in person, number, gender and animacy.
• ARGUMENT STRUCTURE: the ability of different verbs to appear with different types of arguments. For instance, different verbs can appear with a direct object, participate in the causative alternation, or take an inanimate argument. • BINDING: the structural relationship between a pronoun and its antecedent. All paradigms illustrate aspects of Chomsky's (1981) Principle A. Since coindexation cannot be annotated in BLiMP, Principles B and C are not illustrated. • CONTROL/RAISING: syntactic and semantic differences between various types of predicates that embed an infinitival VP. This includes control, raising, and tough-movement predicates. • DETERMINER-NOUN AGREEMENT: number agreement between demonstrative determiners (e.g. this/these) and the associated noun. • ELLIPSIS: the possibility of omitting expressions from a sentence. Since this is difficult to illustrate with sentences of equal length, our paradigms cover only special cases of noun phrase ellipsis that meet this constraint. • FILLER-GAP: dependencies arising from phrasal movement in, e.g., wh-questions.

• IRREGULAR FORMS: irregular morphology on
English past participles (e.g. broken). We are unable to evaluate models on non-existent forms like *breaked because such forms are out of the vocabulary for some LMs. • ISLAND EFFECTS: restrictions on syntactic environments where the gap in a filler-gap dependency may occur. • NPI LICENSING: restrictions on the distribution of negative polarity items like any and ever limited to, e.g., the scope of negation and only. • QUANTIFIERS: restrictions on the distribution of quantifiers. We cover two such restrictions: superlative quantifiers (e.g., at least) cannot embed under negation, and definite quantifiers and determiners cannot be subjects in existentialthere constructions. • SUBJECT-VERB AGREEMENT: subjects and present tense verbs must agree in number.

Comparison to Related Resources
With a vocabulary of over 3,000 words, BLiMP has by far the most lexical variation of any related generated dataset. It includes verbs with 11 different subcategorization frames, including verbs that select for PPs, infinitival VPs, and embedded clauses. By comparison, datasets by Ettinger et al. (2018) and Marvin and Linzen (2018) use vocabularies of under 200 items. Other datasets of minimal pairs that achieve more lexical and syntactic variety use data-creation methods that limit empirical scope and control. Linzen et al. (2016) construct a dataset of minimal pairs for subject-verb agreement by changing verbs' number marking in a subset of English Wikipedia, but this approach does not generalize beyond agreement phenomena. Lau et al. (2017) construct minimal pairs by taking sentences from the BNC through round-trip machine translation. The resulting sentences contain a wider variety of grammatical violations, but it is not possible to control the nature or quantity of violations in the resulting sentences.

Data validation
To verify that the generated sentences represent a real contrast in acceptability, we conduct human validation via Amazon Mechanical Turk. 8 Twenty separate validators rated five pairs from each of the 67 paradigms, for a total of 6700 judgments. We restricted validators to individuals currently located in the US who self-reported as native speakers of English. To assure that our validators made a genuine effort on the task, each HIT included an attention check item and a hidden field question to catch bot-assisted humans. Validators were paid $0.25 for completing 5 judgments, which we estimate took 1-2 minutes. For each minimal pair, 20 individuals completed a forced-choice task mirroring the LMs' task; the human-determined acceptable sentence was calculated via majority vote of annotators. By this metric, we estimate aggregate human agreement with our annotations to be 96.4% overall. As a threshold of inclusion in BLiMP, the majority of validators needed to agree with BLiMP on at least 4/5 examples from each paradigm. Thus, all 67 paradigms in the public version of BLiMP passed this validation; only two additional paradigms were rejected on this criterion. We also estimate individual human agreement to be 88.6% overall using the approximately 100 annotations from each paradigm. 9 Figure 3 reports individual human results (and model results) as a conservative measure of human agreement. 8 The full set of human judgments and a summary of the results for all 67 paradigms is in Table 4 in the Appendix. 9 A few had to be excluded due to ineligible annotators.

Model
O v e r a l l

Results & Discussion
An LM's overall accuracy on BLiMP is simply the proportion of the 67,000 minimal pairs in which the model assigns a higher probability to the acceptable sentence. We report the results for all models and human evaluation in Table 3. GPT-2 achieves the highest accuracy and the 5-gram model the lowest. All models perform well below estimated human accuracy (as described in §3.4). The 5-gram model's poor performance-overall and on every individual category-indicates that BLiMP is likely not solvable from local co-occurrence statistics alone. Because we evaluate pretrained models that differ in architecture and training data, we can only speculate about what drives these differences (though see §6.3 for a controlled ablation study on the LSTM LM). The results seem to indicate that access to training data is the main driver of performance on BLiMP for the neural models we evaluate. This may explain why Transformer-XL and the LSTM LM perform similarly in spite of differences in architecture, as both are trained on approximately 100M tokens of Wikipedia text. Relatedly, GPT-2's advantage may come from the 12 https://github.com/sheng-fu/ colorlessgreenRNNs 13 https://github.com/kpu/kenlm 14 https://github.com/anhad13/blimp_ ngram fact that it is trained on roughly two orders of magnitude more data. Possibly, LSTMs trained on larger datasets could perform comparably to GPT-2, but such experiments are impractical due to the inefficiency of training LSTMs at this scale.

Results & Discussion by Phenomenon
The results also give insight into how LM's linguistic knowledge varies by domain. Models generally perform best and closest to human level on morphological phenomena. For instance, GPT-2 performs within 5 points of humans on ANAPHOR AGR., DET.-NOUN AGR., and SUBJ.-VERB AGR. The set of challenging phenomena is more diverse. ISLANDS are the hardest phenomenon for most models. Only GPT-2 performs well above chance, and it remains 12 points below humans. Some semantic phenomena, specifically those involving NPI LICENSING and QUANTIFIERS, are also challenging overall. All models perform relatively poorly on ARG. STRUCTURE. From these results we conclude that current SotA LMs robustly encode basic facts of English agreement. This does not mean that LMs will come close to human performance for all agreement phenomena. §6.1 discusses evidence that increased dependency length and the presence of agreement attractors of the kind investigated by We find, in accordance with , that LMs do represent long-distance whdependencies, but we also conclude that their representations differ fundamentally from humans'. While some models approach human performance in ordinary filler-gap dependencies, they are exceptionally poor at identifying island violations overall. This finding suggests that they reliably encode long-distance dependencies in general, but not the syntactic domains in which these dependencies are blocked, though GPT-2 does perform well above chance on some paradigms of ISLAND EFFECTS. However, strong conclusions about how these models represent wh-dependencies are not possible using the forced-choice task compatible with BLiMP, and a complete assessment of syntactic islands is best addressed using a factorial design that manipulates both the presence of an island and an attempt to extract from it as in Kush et al. (2018) or .
In the semantic phenomena where models struggle (NPIS and QUANTIFIERS), violations are often attributed in semantic theories to a presupposition failure or contradiction arising from semantic composition or pragmatic reasoning (e.g., Chierchia, 2013;Ward and Birner, 1995;Geurts and Nouwen, 2007). These abstract semantic and pragmatic factors may be difficult for LMs to learn. Marvin and Linzen also find that LSTMs largely fail to recognize NPI licensing conditions. Warstadt et al. (2019a) find that BERT (which is similar in scale to GPT-2) recognizes these conditions inconsistently in an unsupervised setting.
The weak performance on ARG. STRUCTURE is somewhat surprising, since arguments and heads are usually-though not always-adjacent (e.g., subjects and direct objects are adjacent to the verb in default English word order). However, argument structure is closely related to semantic event structure (see Marantz, 2013), which may be comparatively difficult for LMs to learn. Also, judgments about argument structure are complicated by the possibility of coercing a frequently transitive verb to be intransitive and vice versa as well as the existence of secondary meanings of verbs with different argument structures (e.g. normally intransitive boast has a transitive use as in The spa boasts 10 pools), which might make this domain somewhat more difficult for LMs. Though even with these complications, humans detect the intended contrast 90% of the time. We note that the reported difficulty of these phenomena contradicts Warstadt and Bowman's (2019) conclusion that argument structure is one of the strongest domains for neural models. However, Warstadt and Bowman evaluate classifiers with supervision on CoLA, a large proportion of which is sentences related to argument structure.
Finally, we caution against interpreting positive results on a general phenomenon in BLiMP as proof of human-like knowledge. While it is unlikely that GPT-2 could reach human performance on the SUBJ.-VERB AGR. paradigms without acquiring a concept of number marking that abstracts away from specific lexical items, it is difficult to rule out this possibility without accumulating different forms of evidence, for instance by testing how it generalizes to nonce words. We take the paradigms in FILLER-GAP as a cautionary example (see Table 4). There are four paradigms that assess a model's sensitivity to the syntactic requirements of complementizer that versus a wh-word.
We observe that all models more or less succeed when the unacceptable sentence lacks a necessary gap, but fail when it contains an illicit gap. These results suggest the models' ability to accurately detect a contrast in whether a gap is filled following a wh-word is not clearly based on a generalization about the relationship between that wh-word and its gap, as such a generalization should extend to the cases where the models currently fail to detect the correct contrast. More generally, conclusions about a model's knowledge of a particular grammatical concept can only be reached by considering several paradigms.

Shallow Predictors of Performance
We also ask what factors besides linguistic phenomena affect model accuracy. Figure 2 shows how sentence length, perplexity (which does not depend on length), the probability of the good sentence (which does depend on length), and confidence affect model performance. The effect of perplexity is much weaker for GPT-2 than for other models, which indicates it is probably more robust to sentences with non-stereotypical syntax or describing unlikely scenarios. GPT-2 is the only model where accuracy increases largely monotonically with confidence. A similar relationship holds between confidence and agreement in human acceptability judgments.

Correlation of Model & Human Performance
We examine the extent to which models and humans succeed at detecting contrasts for the same linguistic phenomena. Figure 1 shows the Pearson correlation between the four LMs and humans of their accuracies on the 67 paradigms. The neural models correlate moderately with humans, with GPT-2 correlating most strongly. The ngram model's performance correlates with humans relatively weakly. Neural models correlate with each other more strongly, suggesting neural networks share some biases that are not human-like. Transformer-XL and LSTM's high correlation of 0.9 possibly reflects their similar training data.

Long-Distance Dependencies
The presence of intervening material can lower the ability of humans to detect agreement dependencies (Bock and Miller, 1991). We study how in-   tervening material affects the LMs' sensitivity to mismatches in agreement in BLiMP. First, we test for sensitivity to determiner-noun agreement with and without an intervening adjective, as in Example (2). The results are plotted in Figure 3. The n-gram model is the most heavily impacted, performing on average 35 points worse. This is unsurprising, since the bigram consisting of a determiner and noun is far more likely to be observed than the trigram of determiner, adjective, and noun. For the neural models, we find a weak but consistent effect, with all models performing on average between 5 and 3 points worse when there is an intervening adjective.
(2) a. Ron saw that man/*men. b. Ron saw that nice man/*men.
Second, we test for sensitivity to mismatches in subject-verb agreement when an attractor noun of the opposite number intervenes. We compare attractors in relative clauses (3-b) and as part of a relational noun (3-c), following experiments by Linzen et al. (2016) and others. Again, we find that the n-gram model's performance is reduced significantly by this intervening material, suggesting the model is consistently misled by the presence of an attractor. All the neural models perform above chance with an attractor present, but GPT-2 and the LSTM perform 22 and 20 points worse when an attractor is present than when there is no attractor, while Transformer-XL's performance is reduced by only 5 points. Thus, we reproduce Linzen et al.'s finding that attractors significantly reduce LSTM LMs' sensitivity to mismatches in agreement and find evidence that this holds true of some Transformer LMs as well.
b. The sisters who met Cheryl bake/*bakes. c. The sisters of Cheryl bake/*bakes.

Regular vs. Irregular Agreement
In DET.-NOUN AGR. and SUBJ.-VERB AGR., we generate separate datasets for nouns with regular and irregular number marking, as in Example (4). All else being equal, only models with access to sub-word-level information should make any distinction between regular and irregular morphology.
(4) a. Ron saw that nice kid/*kids. (regular) b. Ron saw that nice man/*men. (irregular) In fact, Figure 4 shows that the two sub-wordlevel models GPT-2 and Transformer-XL show little effect of irregular morphology: they perform less than 1.3 points worse on irregulars than regulars. Their high overall performance suggests they robustly encode number features without relying on segmental cues. 15 Figure 4: Models' performance on agreement phenomena between a determiner and noun and between a subject and verb, broken down by whether the noun/subject has a regular or irregular plural form

Training size and BLiMP performance
We use BLiMP to track how a model's representation of particular phenomena varies with the quantity of training data. Using different sized subsets of Gulordava et al.'s (2019) training data, we retrain the LSTM and Transformer-XL models and evaluate their performance on BLiMP. Figure 5 shows that different phenomena have notably different learning curves across different training sizes even if the full model trained on 83M tokens achieved equivalent accuracy scores. For example, the LSTM model ultimately performs well on both IRREGULAR and ANAPHOR AGR., but requires more training to reach this level of performance for ANAPHOR AGR.. These learning curve differences show how BLiMP performance dissociates from perplexity on Wikipedia data, a standard measure of LM performance: although perplexity decreases with more training data, 16 performance on different phenomena grows at varying rates.
We conjecture that there is a sigmoid relationship between the logarithm of training set size and BLiMP performance which appears to be roughly linear at this scale. We conduct linear regression analyses to estimate the rate of increase in performance in relation to the logarithm (base 2) of dataset size. Based on these values, we estimate that if log-linear improvement continues, the LSTM LM and Transformer-XL should require well over 10 20 tokens of training data to achieve human-like performance on these hardest phenomena.
We also find that increasing model size (number of parameters) is unlikely to improve performance: We evaluate four pretrained versions of GPT-2 with 117M to 1558M parameters trained on WebText. All models have overall BLiMP accuracy of 0.84±.01%, and standard deviation among the models on each of the 12 phenomena does not exceed 0.03. This finding bolsters our earlier conclusion in §5 that amount of training data has the biggest impact on BLiMP performance.

Alternate Evaluation Methods
There are several other methods one can use to measure an LM's preference between two minimally different sentences. So far, we have con- sidered only the full-sentence method, advocated for by Marvin and Linzen (2018), which compares LM likelihoods of full sentences. In a followup experiment, we use two prefix methods, each of which has appeared in related prior work, that evaluate a model's preferences by comparing its prediction at a key point of divergence between the sentences. Subsets of BLiMP data are designed to be compatible with multiple methods, allowing us to conduct the first direct comparison. We find that all methods give broadly similar results when aggregating over a set of paradigms. We see no strong argument against evaluating solely using the full-sentence method, though some results diverge for specific paradigms.

One-Prefix Method
In the one-prefix method, used by Linzen et al. (2016), a pair of sentences share the same initial portion of a sentence, but differ in a critical word that make them differ in grammaticality (e.g., The cat eats mice vs. The cat eat mice). The model's prediction is correct if it assigns a higher probability to the grammatical token given the shared prefix.

Two-Prefix Method
In the two-prefix method, used by , a pair of sentences differ in their initial string, and the grammaticality difference is only revealed when a shared critical word is included (e.g., The cat eats mice vs. The cats eats mice). For these paradigms, we evaluate whether the model assigns a higher probability to the critical word conditioned on the grammatical prefix than on the ungrammatical prefix.
The prefix methods differ from the full-sentence method in two key ways: (i) they require that the acceptability of the sentence be unambiguously predictable from the critical word, but not sooner, and (ii) they are not affected by predictions made by the LM following the critical word. These values do affect the full sentence method. For example, assuming that P (are numerous) P (is numerous), a model could predict that The cats are numerous is more likely than The cats is numerous without correctly predicting that P (are|the cats) > P (is|the cats). Using prefix probabilities allows us to exclude models' use of this additional information and evaluate how the models perform when they have just enough information to judge grammaticality. Figure 6 shows that models have generally comparable accuracies across all three methods. However, there are some cases where we observe differences between these methods. For example, Transformer-XL performs much worse at BIND-ING, DET.-NOUN AGR., and SUBJ.-VERB AGR. in the simple LM method, suggesting that the probabilities Transformer-XL assigns to the irrelevant part at the end of the sentence very often overturn the observed preference based on probability up to the critical word. On the other hand, GPT-2 benefits from reading the whole sentence for BINDING phenomena, as its performance is better in the simple LM method than in the prefix method.
We conclude that with a sufficiently diverse set of paradigms, the various metrics under consideration will give similar results. Thus, it is not problematic that BLiMP relies only on the fullsentence method, and doing so allows BLiMP to include many paradigms not compatible with either prefix method. Nonetheless, prefix methods are still valuable for detailed analysis or for studies making direct comparison to psycholinguistic theories (e.g. .

Conclusion & Future Work
We have shown ways in which BLiMP can be used as tool to gain evidence about both the overall and fine-grained linguistic knowledge of language models. Like the GLUE benchmark (Wang et al., 2018), BLiMP assigns a single overall score to an LM which summarizes its general sensitivity to minimal pair contrasts. It also provides a breakdown of LM performance by linguistic phe-nomenon, which can be used to draw more concrete conclusions about the kinds of grammatical features learned acquired by a given model. This kind of information is a linguistically motivated evaluation of LMs that can complement common metrics like perplexity.
Furthermore, the extent to which humans resemble data-driven learners like language models is debated in linguistics and cognitive science (see e.g., Chomsky, 1965;Reali and Christiansen, 2005). In some domains, we may require the aid of innate knowledge to acquire phenomenon-specific knowledge resembling that tested in BLiMP. By evaluating whether self-supervised learners like LMs acquire human-like grammatical acuity in a particular domain, we gather indirect evidence as to whether this phenomenon is a necessary component of humans' innate knowledge.
Another aim of BLiMP is to serve as a guide for future work on the linguistic evaluation of LMs. It is particularly interesting to better understand those empirical domains where current LMs appear to acquire some relevant knowledge, but still fall short of human performance. The results from BLiMP suggest that-in addition to relatively well-studied phenomena like filler-gap dependencies, NPIs, and binding-argument structure remains one area where there is much to uncover about what LMs learn. More generally, as language modeling techniques continue to improve, it will be useful to have large-scale tools like BLiMP to efficiently track changes in what these models do and do not know about grammar.