Abstract
We introduce Holmes, a new benchmark designed to assess language models’ (LMs’) linguistic competence—their unconscious understanding of linguistic phenomena. Specifically, we use classifier-based probing to examine LMs’ internal representations regarding distinct linguistic phenomena (e.g., part-of-speech tagging). As a result, we meet recent calls to disentangle LMs’ linguistic competence from other cognitive abilities, such as following instructions in prompting-based evaluations. Composing Holmes, we review over 270 probing studies and include more than 200 datasets to assess syntax, morphology, semantics, reasoning, and discourse phenomena. Analyzing over 50 LMs reveals that, aligned with known trends, their linguistic competence correlates with model size. However, surprisingly, model architecture and instruction tuning also significantly influence performance, particularly in morphology and syntax. Finally, we propose FlashHolmes, a streamlined version that reduces the computation load while maintaining high-ranking precision.
1 Introduction
Linguistic competence is the unconscious understanding of language (Chomsky, 1965), like the syntactic structure of a sentence. As language models (LMs) are trained on simple tasks such as next word prediction (Brown et al., 2020), one might naturally wonder: What is the linguistic competence of LMs, and how do they differ? To answer such questions, contemporary benchmarks estimate cognitive abilities, as done for mathematical reasoning (Cobbe et al., 2021) or factual knowledge (Petroni et al., 2019b, 2020). However, such benchmarks rely on LMs’ use of language (textual responses), known as linguistic performance (Matthews, 2014). As a result, they conflate abilities tested with specific instructions, as done for syntactic phenomena in Blevins et al. (2023), with latent abilities like producing coherent text or following instructions. As this entanglement makes it infeasible to draw definitive conclusions (Hu and Levy, 2023; Liang et al., 2023; Perlitz et al., 2024), recent studies call to assess LMs’ linguistic competence comprehensively and isolated (Lu et al., 2023; Mahowald et al., 2024).
In this work, we introduce Holmes (Figure 2), a benchmark to assess the linguistic competence of LMs (Figure 7) regarding numerous linguistic phenomena. To disentangle LMs’ understanding of these phenomena from their linguistic performance, we assess the LMs’ internals using classifier-based probing (Tenney et al., 2019a; Hewitt and Manning, 2019; Belinkov, 2022). As illustrated in Figure 1 for probing the part-of-speech (POS) tags for words, we first train linear models (probes) using the internal representations of text inputs from the last model layer to predict the specific phenomena aspects. We then approximate the LMs’ grasp of these phenomena using the probes’ performance, rigorously verified using control tasks (Hewitt and Liang, 2019) and from an information theory perspective (Voita and Titov, 2020). With this particular and comprehensive scope, we thoroughly address the initially raised questions as follows:
In Holmes, we encode examples of probing datasets using frozen LMs. Then, we train probes (linear models) with labels representing the specific linguistic phenomenon under test. Finally, we use the results of testing the probes to approximate the LMs’ linguistic competence regarding the tested phenomena.
In Holmes, we encode examples of probing datasets using frozen LMs. Then, we train probes (linear models) with labels representing the specific linguistic phenomenon under test. Finally, we use the results of testing the probes to approximate the LMs’ linguistic competence regarding the tested phenomena.
Overview of Holmes (left) with the five phenomena types (right) and an example of probing-based evaluations for part-of-speech: encoding the input tokens and predicting the POS tag for cucumber, here NN.
Overview of Holmes (left) with the five phenomena types (right) and an example of probing-based evaluations for part-of-speech: encoding the input tokens and predicting the POS tag for cucumber, here NN.
Meta-Study (§ 3)
The review of over 270 probing studies reveals a gap in comprehensively evaluating linguistic competence. Despite covering over 200 probing tasks and 150 LMs, individual studies focus on particular tasks and LMs. As a result, only three LMs were probed on over 20% of the tasks, and only one task (POS) was evaluated for more than 20% of the LMs. Notably, recent large LMs are significantly underrepresented.
Benchmark (§ 4)
Addressing these identified deficiencies, Holmes offers a structured way to assess LMs’ English linguistic competence comprehensively. It features 208 distinct datasets covering morphology, syntax, semantics, reasoning, and discourse phenomena, including previously underrepresented ones like negation or rhetoric.
Results and Analysis (§ 5)
From assessing 59 LMs, we find that no LM consistently excels over the others. Further, linguistic competence is more pronounced for morphology and syntax than the other types of phenomena, and LMs’ linguistic competence is fundamentally affected by model size, model architecture, and instruction tuning.
First, we generalize previous findings (Tenney et al., 2019b; Zhang et al., 2021) and show that LMs’ linguistic competence, particularly morphology and syntax, scales beyond 350 million parameters. Second, contrary to the prompting evaluations (Lu et al., 2023) and aligned with Waldis et al. (2024a) and Gautam et al. (2024), model architecture is critical. The linguistic competence of decoder-only LMs lags behind encoder-only ones. Not even 70 billion decoders produce representations for words with the same stability as encoders with 110 million parameters. Third, while instruction tuning (Ouyang et al., 2022; Touvron et al., 2023; Zhou et al., 2023) aims to align LMs with human interactions, we focus for the first time on its effect on linguistic competence. We found that instruction tuning improves morphology and syntax but has mixed effects on other phenomena types, hinting at a superficial alignment. Lastly, we compare Holmes with other benchmarks. While LM rankings of reasoning-intense downstream tasks (Beeching et al., 2023) correlate with reasoning phenomena, explicitly prompting for linguistic phenomena (Liang et al., 2023) leads to unreliable results. As these results show that Holmes aligns with other benchmarks, its probing-based evaluation is indispensable for explicitly testing LMs’ linguistic competence disentangled from their linguistic performance.
Efficiency (§ 6)
Finally, to mitigate the heavy computational burden of evaluating a new LM on Holmes, we form the streamlined version FlashHolmes by selectively excluding samples not significantly influencing overall rankings (Perlitz et al., 2023). Specifically, FlashHolmes approximates Holmes rankings with high precision while requiring only ∼3% of the computation.
Contributions
With Holmes, we introduce a comprehensive and thorough benchmark to assess LMs’ linguistic competence, providing ground to evaluate them more holistically. Extensive experiments on Holmes reveal that LMs’ linguistic competence is manifold and more pronounced for phenomena targeting words and syntactic structure than semantic, reasoning, or discourse. LMs properties like size or architecture crucially account for differences among LMs. Fostering further research, we provide interactive tools to explore Holmes and straightforward evaluation code for upcoming LMs with efficiency in mind.
2 Preliminaries
Language Models (LMs)
Language models compute probabilities for word sequences i, enabling tasks such as classifying i, textual comparisons between i and another sequence i′, and text generation based on i. We consider LMs as any model producing representations of i, regardless of their specific type: sparse like bag-of-words (Harris, 1954); static such as GloVe (Pennington et al., 2014); or contextualized transformers (Devlin et al., 2019; Raffel et al., 2020).
Linguistic Competence and Performance
For centuries (Robins, 2013), linguists have been fascinated by the processes of language learning, usage, and evolution. One specific discussion is the differentiation between knowing and using a language. de Saussure (1916) distinguished between language with specific rules and words (langue) as an ongoing negotiated fulfillment of the societal need for communication and its usage (parole). Similarly, Chomsky (1965) uses the term linguistic competence for the unconscious understanding of language and linguistic performance for using languages in any utterance. In this work, we follow Chomsky’s terminology and treat LMs as static artifacts of a certain time, omitting ongoing processes of the society considered by de Saussure. Specifically, we focus on assessing the linguistic competence of LMs, including specific linguistic phenomena like word dependencies and their distinct POSs. Opposed, contemporary benchmarks (Cobbe et al., 2021; Petroni et al., 2019b, 2020) assess linguistic performance by providing textual instructions and verifying LMs’ textual responses. Note that this evaluation protocol can also verify an understanding of specific linguistic phenomena, as done in Blevins et al. (2023) or Liang et al. (2023) for syntactic structure. However, such evaluation protocols conflate LMs’ linguistic competence with latent abilities (like following instructions). Thus, Holmes unique evaluation perspective is indispensable to assess linguistic phenomena isolated to assess LMs comprehensively.
Linguistic Phenomena
We define the linguistic competence of LMs as their ability to understand a diversity of linguistic phenomena. Specifically, we focus on five phenomena types: morphology, the structure of words; syntax, the structure of sentences; semantics, the meaning of words; reasoning, the use of words in logical deduction and other related phenomena like negation or speculation; discourse, the context in text like rhetorical structure. Following Mahowald et al. (2024), we categorize these phenomena types into two groups: morphology and syntax are formal phenomena, which include understanding grammatical rules and statistical patterns, while functional ones (semantics, reasoning, and discourse) focus on practical abilities like interpreting text sentiment or detecting the existence of speculation.
Datasets
We define a dataset as text examples and labels covering a specific aspect of a linguistic phenomenon, like words and their POS tags. Typically, these labels are unambiguous, enabling us to assess the specific aspect under test in isolation.
Probes
Using probes, we empirically assess the linguistic competence of LMs regarding the featured linguistic phenomena in Holmes. We design probing tasks using the widely recognized classifier-based probing method (Tenney et al., 2019a; Hewitt and Manning, 2019; Belinkov, 2022) also known as diagnostic classifiers (Veldhoen et al., 2016; Giulianelli et al., 2018). Running such a probing task involves training a probe (linear model) using the specific dataset to test a distinct aspect of a linguistic phenomenon in isolation. To do this, we encode the text examples of a dataset with a given LM and use them to train the probe regarding the specific labels representing the tested linguistic phenomenon. The probe’s performance is then used to approximate the LM’s understanding of the specific phenomenon. A higher score indicates that LMs capture patterns relevant to this phenomenon internally, which in turn enhances the accuracy (Tenney et al., 2019b).
3 Meta-Study
3.1 Scope
We analyze 28k papers (P) from 2015 to August 2023 of major NLP conferences (TACL, ACL, AACL, COLING, EACL, EMNLP, NAACL, and the corresponding workshops) expanded with selected work from other venues such as ICLR. To identify relevant work, we employ a semi-automatic approach. First, we use automated filtering based on paper metadata and full text,1 grounded in the occurrence of established terminology related to the specific focus of Holmes, namely, disentangling the linguistic competence of LMs by studying their internal representations. This terminology, including probing and probe, is commonly found in influential literature surveys (Rogers et al., 2020; Belinkov, 2022) and diverse investigation settings, such as analyzing internal representations using linear classifiers (Tenney et al., 2019b; Conneau et al., 2018; Elazar et al., 2021) or masked-based approaches focusing on lexical knowledge of LMs (Petroni et al., 2019a; Talmor et al., 2020a; Kassner et al., 2021; Peng et al., 2022). Specifically, we define three criteria to identify relevant papers: P′ = {∀p ∈ P|p ∈ P1 ∪ p ∈ P2 ∪ p ∈ P3}, where:
P1: papers with probing or probe in the title.
P2: papers with probing or probe in the abstract and at least five occurrences in the main content.
P3: papers with probing or probe occurring at least ten times in the main content.
We identified 493 matching papers (P′) by applying these criteria. We then manually review the automatically generated candidate list (P′) and select studies that examined LMs with one or more specific linguistic phenomena as part of their analysis or as a primary contribution. This process involves filtering out papers using the term probing in other senses, such as probing hash tables in Bogoychev and Lopez (2016). Moreover, we supplement the candidates with a curated selection of highly relevant studies that do not meet the above criteria. For example, seminal works published before 2019 which employ terms like “diagnostic classifier” (Giulianelli et al., 2018; Hupkes and Zuidema, 2018), as well as other notable studies (Gupta et al., 2015; Shi et al., 2016). This comprehensive approach yields 274 relevant papers (Pr), which we further analyze subsequently.
3.2 Analysis
i) Scattered Evolution Calls for Consolidation.
We begin by examining the evolution of relevant studies in the field, illustrated in Figure 3. We analyze the citation patterns among these studies, distinguishing between probing citations (Cp), which represent citations between them, and general citations (Cg), which encompass all other citations. The colorized ratio visually relates these two measures. This analysis reveals that only a small fraction of the works have garnered broad recognition, with 16 papers exceeding 200 general citations. Furthermore, probing works cite each other relatively infrequently, with an average probing citation ratio of α = 0.1. This suggests that other fields have paid limited attention to LMs’ linguistic competence. The scattered citation patterns and lack of engagement with this topic underscore the need to consolidate existing resources and establish a solid foundation to bootstrap research in this area.
Citation analysis considering probing citations originating from the set of relevant work and every other citation (general citations). The color scale indicates the ratio (α) between them.
Citation analysis considering probing citations originating from the set of relevant work and every other citation (general citations). The color scale indicates the ratio (α) between them.
ii) Probing Work Prioritizes Tasks and Analytics over Methods.
We categorize the selected work according to their probing focus into three categories: methodological, which introduces new methods, such as control tasks (Hewitt and Liang, 2019) or minimum description length (Voita and Titov, 2020); task-focused, which assesses specific linguistic phenomena as main contributions, such as discourse relations in text (Koto et al., 2021); and analytical, which uses probing tasks to analyze LMs, such as the impact of pre-training data (Zhang et al., 2021). As shown in Figure 4, the majority of studies (51.8%) focus on specific probing tasks, such as numeric scales (Zhang et al., 2020), or morphosyntactic analysis (Shapiro et al., 2021). A significant proportion (35.7%) use probing as a supplementary analytical tool, for example, to analyze the effect of fine-tuning (Mosbach et al., 2020a; Zhu et al., 2022a). The remaining 12.5% address methodological problems related to probing (Wu et al., 2020; Immer et al., 2022; Zhu et al., 2022b).
Categorization of the selected studies by their focus and their conducted probing method.
Categorization of the selected studies by their focus and their conducted probing method.
iii) The Dominance of Classifier-based Probing.
Next, we analyze the specific employed probing method regarding four categories: (1) classifier-based probing, which uses linear or shallow models to probe internal representations of LMs; (2) mask-based probing, where LMs fill gaps to verify linguistic phenomena; (3) attention-based probing, which relies on attention patterns; and (4) other methods that do not fit into the previous three categories. Our analysis indicates that most studies (74%) utilize the classifier-based probing method, as exemplified in Tenney et al. (2019a). Additionally, 20% of studies conduct mask-based probing, as shown in Talmor et al. (2020b). In contrast, only a small portion of work (∼ 3%) considers attention patterns or other approaches, such as bridging (Pandit and Hou, 2021) or dimension selection (Torroba Hennigen et al., 2020).
iv) Tasks and LMs Are Barely Broadly Evaluated.
Finally, we examine the tasks and LMs investigated by the relevant studies. For example, Tenney et al. (2019b) explore BERT on various tasks, including POS tagging, semantic-role labeling (SRL), and others. Our analysis reveals that, collectively, these studies cover a remarkable 289 unique tasks and 161 distinct LMs, demonstrating a broad scope of investigation. Below, we delve into the details and highlight noteworthy findings.
We analyze how LMs and tasks are considered jointly in Figure 5. Despite the broad coverage, single studies, including fundamental ones, maintain a particular focus and consider only a fraction of LMs and tasks. For example, while most tasks (72%) were assessed on BERT, RoBERTa’s coverage has already declined to 42%. Conversely, POS tagging, the most probed task, was only evaluated on 23% of the LMs, excluding prominent examples like BART (Lewis et al., 2020). In particular, more recently released larger and powerful LMs, like Pythia (Biderman et al., 2023), UL2 (Tay et al., 2023), or LLAMA-2 (Touvron et al., 2023), as well as instruction-tuned LMs like FLAN-T5 (Chung et al., 2022) or LLAMA-2-Chat (Touvron et al., 2023), are missing almost entirely, with only a few recent exceptions (Hu and Levy, 2023; Waldis et al., 2024a). Again, these insights underscore the need to consolidate existing resources for more comprehensive coverage.
Overview of how many tasks single LMs cover and vice versa. Single examples are highlighted.
Overview of how many tasks single LMs cover and vice versa. Single examples are highlighted.
Figure 6 further highlights this point by sorting LMs and tasks according to their frequency of mention in relevant works and plotting their cumulative coverage. For example, considering all studies (red line), the top-10 most mentioned LMs account for 80% of all LMs mentions (black dot), while the remaining 151 unique LMs account for only 40%. A comparison of the paper’s focus reveals that methodological studies rely only on a limited set of 24 LMs and 36 tasks. In contrast, task-focused and analytical work cover a similar number of LMs (91 and 99, respectively). However, due to their distinct focus, task-focused studies cover a significantly larger number of tasks (202) than analytical ones (115).
Cumulative coverage of LMs and tasks, considering all relevant studies and their focus.
Cumulative coverage of LMs and tasks, considering all relevant studies and their focus.
3.3 Summary
Our meta-study emphasizes the need to consolidate existing resources for a comprehensive assessment of the linguistic competence of LMs —a manifold but rather a blind spot in evaluation research. Apart from more thorough evaluations, such a stimulus can significantly boost future research, as happened in computer vision with ImageNet (Deng et al., 2009) or in NLP with GLUE and SuperGLUE (Wang et al., 2019a, b).
4 Holmes Benchmark
With Holmes, we provide an extensive ground to tackle these identified deficiencies in the existing literature and comprehensively investigate the English linguistic competence of LMs. Specifically, Holmes features 208 datasets addressing distinct aspects of 66 phenomena covering morphology, syntax, semantic, reasoning, and discourse.
4.1 Datasets
We provide a comprehensive coverage of linguistic phenomena by covering 208 unique datasets. We leverage existing and established resources like OntoNotes (Weischedel et al., 2013), English Web Treebank (Silveira et al., 2014), or BLiMP (Warstadt et al., 2020) to create datasets addressing phenomena like the POS of words, their dependencies or the linguistic acceptability of sentences. Further, we include a range of less employed data, addressing contextualization of words (Klafka and Ettinger, 2020), reasoning (Talmor et al., 2020b), semantic decomposition (White et al., 2016; Rudinger et al., 2018a, b; Govindarajan et al., 2019; Vashishtha et al., 2019), grammatical knowledge (Huebner et al., 2021), bridging (Pandit and Hou, 2021), and rhetorical (Carlson et al., 2001) and discourse (Webber et al., 2019) structure in text. Finally, we cover rarely probed phenomena like negation (Szarvas et al., 2008; Konstantinova et al., 2012; Vahtola et al., 2022), or word complexity (Paetzold and Specia, 2016).
4.2 Structure
Apart from the comprehensive scope, Holmes provides a clear structure for specific evaluations on different levels of aggregation. We first group the datasets according to the linguistic phenomena addressed. Next we categorize these phenomena into their previously defined five phenomena types (see § 2): morphology, like the agreement of subject and verb; syntax, such as the part-of-speech of words; semantics, like semantic roles of words; reasoning, such as detecting a negated sentence; and discourse, like selecting the correct following sentence. Table 1 provides examples for every type of phenomenon. Note that we rely on the categorization provided by the specific studies whenever given (more details in the Appendix § I.3). For example, Conneau et al. (2018) categorized the tense of the main clause as semantic. This phenomenon could also be categorized as syntax if we test the detection of incorrect formulations given a specific tense. However, we follow the authors’ suggestion and test the detection of the tense on a sentence level, which represents semantic aspects.
Example instance of Holmes datasets for every type of linguistic phenomena. The relevant part of the example for the specific label is underlined.
Type . | Phenomena . | Example . | Label . |
---|---|---|---|
Morphology | Subject-Verb Agreement | And then, the cucumber was hurled into the air. | Correct |
And then, the cucumber were hurled into the air. | Wrong | ||
Syntax | Part-of-Speech | And then, the cucumber was hurled into the air. | NN (Noun Singular) |
Semantic | Semantic Roles | And then, the cucumber was hurled into the air. | Direction |
Reasoning | Negation | And then, the cucumber was hurled into the air. | No Negation |
Discourse | Node Type in Rhetorical Tree | And then, the cucumber was hurled into the air. | Satellite |
Type . | Phenomena . | Example . | Label . |
---|---|---|---|
Morphology | Subject-Verb Agreement | And then, the cucumber was hurled into the air. | Correct |
And then, the cucumber were hurled into the air. | Wrong | ||
Syntax | Part-of-Speech | And then, the cucumber was hurled into the air. | NN (Noun Singular) |
Semantic | Semantic Roles | And then, the cucumber was hurled into the air. | Direction |
Reasoning | Negation | And then, the cucumber was hurled into the air. | No Negation |
Discourse | Node Type in Rhetorical Tree | And then, the cucumber was hurled into the air. | Satellite |
4.3 Experimental Setup
Holmes evaluation follows the primarily used classifier-based probing paradigm, as described in § 2, to analyze the internal representations of the last layer of LMs.2 Thereby, we maximally disentangle the understanding of distinct linguistic phenomena from each other and from other cognitive abilities, such as following textual instructions. Further, this method allows us to assess any LM type, including sparse, static, or contextualized ones. Based on the specific dataset, we either select the embeddings of the specific input tokens (like single words for POS tagging) or average embeddings across a span or the whole sentence. We define a probing task as training a probe fp (linear model without intermediate layers) using these embeddings as inputs and the dataset labels as training signals. If not defined in the original data, we divide the dataset samples into train/dev/test split following a ratio of 70/10/20. We repeat this procedure five times using different random seeds for a robust measurement.
4.4 Evaluations
We approximate how well an LM encodes specific linguistic phenomena using the absolute prediction performance of the probes. In addition, we rigorously evaluate the reliability of probing results using control tasks and from an information theory perspective (Voita and Titov, 2020; Hewitt and Liang, 2019). Different from commonly used prompting assessments, this particular evaluation protocol refrains from known fallacies in which the results and conclusions are sensible with specific instructions (Mizrahi et al., 2024; Min et al., 2022) or few-shot examples (Lu et al., 2023).
Task Score Metric
Based on a dataset’s specific task type, we use a corresponding performance measure, macro F1 for classification or Pearson correlation for regression. In addition, we calculate the standard deviation σ of the probe across multiple seeds. A lower σ indicates a better encoding of a given linguistic phenomenon since the measurement is robust to noise. Further, we use the task score for ranking-based evaluation of all evaluated LMs L = {l1,…, lm} within Holmes. We calculate the mean winning rate mwr (in percentage), telling us how many times one LM l1 wins against others (Liang et al., 2023). With a higher mwr, we assume an LM encodes tested linguistic phenomena better than others.
Compression
Next, we evaluate the probes’ reliability from an information-theoretic perspective. Following Voita and Titov (2020), we use the compression I. It is the ratio between the minimum description length mdl of encoding n instances with a label space of K compared to applying a uniform encoding . A higher I means fewer bits are needed to encode the instance and their labels, indicating that the given linguistic phenomenon is more clearly encoded in the internal representation of LMs.
Selectivity
A reliable probe should grasp patterns relevant to the tested phenomena in the internal representations of LMs but should not be able to learn anything else. Therefore, we expect high performance when evaluating the specific dataset but low performance when we randomize training signals. We check this using control tasks introduced in Hewitt and Liang (2019). Specifically, we calculate the selectivity as the difference between the probe trained with the original labels y and the control task where we train the probe with randomly assigned labels y′. With a higher S, we assume the detected patterns are relevant for the specific phenomena under test, as random patterns do not lead to similar performance.
5 Holmes Results
Using Holmes, we evaluate a diverse collection of 59 LMs.3 Using the results of these extensive experiments, we first answer the research question: What is the linguistic competence of LMs? In doing so, we discuss the reliability of results (5 i) and the linguistic competence of LMs concerning the unique structure of Holmes (5 ii). Subsequently, we examine how linguistic competence varies among LMs, as we find LMs prevailing for different types of linguistic phenomena (Figure 7) and delve into the effects of model architecture (5 iii), size (5 iv), and instruction tuning (5 v). Finally, we show how Holmes’ results relate to the linguistic performance of LMs by comparing them with the OpenLLM benchmark (5 vi) and further experiments with the HELM benchmark (5 vii).
A subset of Holmes rankings (↓) for various evaluated LMs. FLAN-UL2 outperforms the others overall, while different LMs prevail for the five distinct types of linguistic phenomena.
A subset of Holmes rankings (↓) for various evaluated LMs. FLAN-UL2 outperforms the others overall, while different LMs prevail for the five distinct types of linguistic phenomena.
i) Holmes Results Are Reliable.
Figure 8 shows the reliability of probing-based evaluations using averaged results across random seeds and LMs. Single outliers are datasets that are too hard for all LMs, either because the sample size is too small or the linguistic phenomena under test are too complex. First, a low average deviation (σ = 0.02) across five seeds underscores the reliability of probing-based measures. These results also highlight the stability of probing results over prompting-based evaluations, where prompt paraphrasing leads to deviations of σ = 0.07 reported in Mizrahi et al. (2024). Next, substantial compression (average I = 1.9) and selectivity (average S = 0.31) further confirm the probes’ reliability. Note, for selectivity, we consider only base-sized model (10m–200m parameters) for computational efficiency. Interestingly, two parallel trends emerge. More challenging datasets with many labels, like POS tagging, are arranged around a selectivity of 0.1 to 0.4 and a task metric of 0.3. In contrast, for easier binary classification tasks (such as linguistic applicability), we observe selectivity around 0.2 to 0.5 and a task metric of 0.6 to 0.9. Furthermore, our analysis reveals a statistically significant positive correlation (p < 0.05) between the task metrics and both compression (τ = 0.64) and selectivity (τ = 0.65). This finding provides strong evidence for the reliability of our task metric, thereby justifying its use as the primary evaluation measure in our study.
Reliability evaluation Holmes results to ensure low deviation across random seeds, high information compression (log), and high selectivity. Every dot represents the averaged results of one probing dataset across LMs. The x-axis represents the task metrics (either person correlation or macro F1).
Reliability evaluation Holmes results to ensure low deviation across random seeds, high information compression (log), and high selectivity. Every dot represents the averaged results of one probing dataset across LMs. The x-axis represents the task metrics (either person correlation or macro F1).
ii) LMs’ Linguistic Competence is Manifold.
We focus on what Holmes tells us in general and regarding formal and functional phenomena, as defined in § 2. We report in Figure 9 the task metric, discriminability, and selectivity, averaged for every phenomena type. Note, discriminability (Rodriguez et al., 2021) quantifies the alignment of LMs ranking of one specific dataset compared to the overall rankings using the Kendall Tau correlation. Considering these three metrics, all tested LMs strongly encode formal phenomena (morphology and syntax), which often depend on the local neighborhood of words. Therefore, we assume that LMs approximate these co-occurrences during pre-training with high precision. For example, the specific POS tag of a word, like man (noun), primarily depends on its surroundings, such as the frequent predecessor the. In contrast, LMs encode less information about functional phenomena (semantics, reasoning, and discourse) since they show a relatively low performance regarding the task metric. For these functional phenomena, we assume more complex co-occurrences are required to capture the broad context in language, such as the rhetorical relation of two distant text spans. Despite these differences between formal and functional phenomena types, they contribute to the benchmark in a balanced way. A low to medium discriminability indicates that none of these linguistic phenomenon types dominates the overall LM rankings.
Average task metric, difficulty, and discriminability for each phenomena type. The dashed lines show the average measure over all datasets.
Average task metric, difficulty, and discriminability for each phenomena type. The dashed lines show the average measure over all datasets.
This balanced influence of the five phenomena types is further visible when considering their ranking correlations (Figure 10, left). A high average correlation of 68.4 ± 7.5 with the overall results (last column/row) hints that they are facets of a broader occurrence but share common characteristics. Still, breaking into categories is meaningful, as the phenomena types (first five columns/rows) are medium correlated (average of 54.7 ± 13.9). Analyzing the results of phenomena types further highlights the value of this distinction. While results of semantics and reasoning are similarly correlated with the overall results (73.9 and 75.6), their direct correlation (58.4) indicates their supplementary nature. Further, discourse results show the lowest correlation with others (44.4 ± 14.7), indicating a particular scope.
Kendall-tau correlation within Holmes (left) and compared to OpenLLM (right). Green stars indicate significant correlations (p < 0.05).
Kendall-tau correlation within Holmes (left) and compared to OpenLLM (right). Green stars indicate significant correlations (p < 0.05).
iii) Encoder Architecture Equips LMs with High Linguistic Competence.
Next, we discuss the impact of model architecture on the linguistic competence of LMs. In Figure 11 (left), we compare encoder and decoder LMs. Due to the absence of big encoder LMs, we consider five encoder and six decoder LMs with up to 220m parameters. Encoder LMs show a higher mwr of 52% than decoder LMs (21%). This observation is the most saturated for morphology or syntax, encompassing a variety of token-level phenomena, like part-of-speech. We assume that the missing bi-directional encoding of decoder LMs causes this lower performance because the available context of one token heavily depends on its position. Thus, even common tokens, like the, have different potential representations—at the beginning or middle of a sentence. These instabilities are further evident when considering Figure 11 (right), which reports the accuracy for the top-20 most common POS tokens (such as the) based on the pos, xpos, upos dataset. Given their high frequency, one expects stable prediction performance. Surprisingly, encoder LMs (BERT and RoBERTa) show higher median accuracy and lower deviations compared to the same-size decoder counterpart (GPT2). While scaling model size to 12B (Pythia) and 70B (Llama-2) allows for improved accuracy and lower deviations, decoder LMs do not match the encoder performance, even up to 700 times bigger.
Comparison of the phenomenon types for encoder and decoder LMs (left) and on the right, the accuracy of the top-20 most common tokens of the three part-of-speech probing datasets for BERT, RoBERTa, GPT2, Pythia, and Llama-2.
Comparison of the phenomenon types for encoder and decoder LMs (left) and on the right, the accuracy of the top-20 most common tokens of the three part-of-speech probing datasets for BERT, RoBERTa, GPT2, Pythia, and Llama-2.
iv) More Parameters Improve LMs’ Linguistic Competence.
We discuss how the number of parameters influences the linguistic competence of LMs. Given the variety of LMs of different sizes, we focus on the Pythia (decoder-only) and T5 (encoder-decoder) families. From Figure 12, we observe for both Pythia and T5 that the linguistic competence scales with model size, and it is particularly pronounced after exceeding 0.5B (Pythia) and 1.0B (T5) parameters. Again, model architecture is crucial, as T5 LMs (encoder-decoder) exhibit a clearly higher mean winning rate of 40–70% than Pythia (decoder-only) ones with mwr of 20–60%. Further, we found formal phenomena evolving differently with increased model size than functional ones. Specifically, morphology and syntax start at a lower level, with an apparent performance jump after 0.5B (Pythia) and 1.0B (T5) parameters, followed by slow but steady growth. Differently, semantics, reasoning, and discourse start at a higher mwr, followed by a continuous improvement as the model size grows. From these results, we assume that more parameters enable language models to better approximate simple word co-occurrences in nearby contexts. While handling formal phenomena like word dependencies, they struggle with more distant and complex co-occurrences, such as rhetorical relations.
Effect of scaling LM parameters considering the T5 and Pythia model families providing eight and five different sizes. We address the overall scope (left) and the different types of linguistic phenomena (right).
Effect of scaling LM parameters considering the T5 and Pythia model families providing eight and five different sizes. We address the overall scope (left) and the different types of linguistic phenomena (right).
v) Instruction-tuned LMs Get Better at Mimicking Language than Understanding it.
We focus on how instruction tuning affects LMs’ linguistic competence and compare tuned and pre-trained LMs, for example, FLAN-UL2 vs. UL2. Table 2 shows less saturated effects for the overall scope while being more pronounced for the five phenomenon types - again emphasizing the structured and comprehensive evaluation of linguistic competence. On average, we found instruction tuning has the highest effect on morphology (+10%) followed by syntax (+5%), reasoning (+4%), and a negative effect for semantics −3% and discourse −1%. These results confirm previous assumptions that instruction tuning updates are often superficial (Yadav et al., 2023; Hershcovitch et al., 2024; Sharma et al., 2023) and that LMs get better at mimicking language (formal phenomena) than understanding it, measured with functional phenomena (Mahowald et al., 2024). Further, larger models benefit more from instruction tuning. Llama-2-70b-Chat and FLAN-UL2 gain up to +24% and +41% for morphology and +10% and +12% on average. When comparing LMs based on Llama-2-13B, we see that specific fine-tuning methods shape the LMs differently. The top-ranked 13B LM for Holmes and OpenLLM, Vicuna, was trained on fewer instructions than others (125k) but of higher quality. This high quality seems important as LMs with more instructions but lower quality (Tülu with approx. 330k instructions) lose performance, the same for 70B versions. Further and aligned with the previous comparison with OpenLLM results, reasoning specialization (Orca-2) is reflected in the corresponding phenomena. These insights show again that while providing a particular perspective, Holmes shows apparent differences between LMs and allows us to map them to methodological decisions.
The mixed effect of instruction tuning on the mean winning rate compared to the pre-trained LMs.
Model . | Morphology . | Syntax . | Semantics . | Reasoning . | Discourse . | Overall . |
---|---|---|---|---|---|---|
Comparison against Llama-2 with 7 billion parameters | ||||||
Llama-2-Chat | −8% | +5% | −6% | −8% | −2% | −2% |
Comparison against T5 with 11 billion parameters | ||||||
FLAN-T5 | +9% | +1% | −3% | +6% | 0% | +1% |
Comparison against Pythia with 12 billion parameters | ||||||
Dolly-v2 | +4% | −1% | −9% | −2% | +2% | −3% |
Comparison against Llama-2 with 13 billion parameters | ||||||
Tülu-2 | +6% | +3% | −13% | +1% | −13% | −4% |
Orca-2 | 0% | −4% | −6% | +3% | −2% | −3% |
Llama-2-chat | +9% | +6% | 0% | +7% | +1% | +4% |
Vicuna-v1.5 | +26% | +9% | 0% | +8% | +2% | +7% |
Comparison against UL2 with 20 billion parameters | ||||||
FLAN-UL2 | +41% | +15% | +6% | +11% | −1% | +12% |
Comparison against Mixtral with ∼47 billion parameters | ||||||
Mixtral-Instruct | +6% | +4% | +1% | +9% | +3% | +4% |
Comparison against Llama-2 with 70 billion parameters | ||||||
Tülu-2 | +14% | 0% | −9% | −4% | +1% | −2% |
Llama-2-Chat | +24% | +13% | +3% | +3% | +13% | +10% |
Average | +10% | +5% | −3% | +4% | −1% | +2% |
Model . | Morphology . | Syntax . | Semantics . | Reasoning . | Discourse . | Overall . |
---|---|---|---|---|---|---|
Comparison against Llama-2 with 7 billion parameters | ||||||
Llama-2-Chat | −8% | +5% | −6% | −8% | −2% | −2% |
Comparison against T5 with 11 billion parameters | ||||||
FLAN-T5 | +9% | +1% | −3% | +6% | 0% | +1% |
Comparison against Pythia with 12 billion parameters | ||||||
Dolly-v2 | +4% | −1% | −9% | −2% | +2% | −3% |
Comparison against Llama-2 with 13 billion parameters | ||||||
Tülu-2 | +6% | +3% | −13% | +1% | −13% | −4% |
Orca-2 | 0% | −4% | −6% | +3% | −2% | −3% |
Llama-2-chat | +9% | +6% | 0% | +7% | +1% | +4% |
Vicuna-v1.5 | +26% | +9% | 0% | +8% | +2% | +7% |
Comparison against UL2 with 20 billion parameters | ||||||
FLAN-UL2 | +41% | +15% | +6% | +11% | −1% | +12% |
Comparison against Mixtral with ∼47 billion parameters | ||||||
Mixtral-Instruct | +6% | +4% | +1% | +9% | +3% | +4% |
Comparison against Llama-2 with 70 billion parameters | ||||||
Tülu-2 | +14% | 0% | −9% | −4% | +1% | −2% |
Llama-2-Chat | +24% | +13% | +3% | +3% | +13% | +10% |
Average | +10% | +5% | −3% | +4% | −1% | +2% |
vi) Internals of LMs are Partly Aligned with their Linguistic Performance.
We analyze the alignment of the probing-based LM rankings of Holmes with prompting-based ones when evaluating downstream using the LMs responses (linguistic performance). Specifically, we compare against OpenLLM (Beeching et al., 2023).4Figure 10 (right) shows Holmes and OpenLLM rankings of jointly evaluated LMs are medium correlated, hinting that LMs’ linguistic competence is partly reflected in their language utterances when solving concrete tasks. While syntax, semantics, and discourse show similar correlation (54.7 to 58.0), morphology and reasoning exhibit a substantially higher one of 65.3 and 77.5. These results suggest that LMs’ reasoning abilities are reflected in their internal representations when evaluating related phenomena like identifying the cause of negations. These correlation patterns are consistent across the three most meaningful OpenLLM datasets (MMLU, TruthfulQA, and GSM8K). As TruthfulQA shows lower correlations with the linguistic phenomena and other datasets within OpenLLM, we presume this dataset captures distinctly different skills (possibly knowledge).
vii) Prompting is not a Substitute for Probing When Evaluating LMs’ Linguistic Competence.
Finally, we compare probing- and prompting-based LM rankings on the jointly evaluated BLiMP tasks (Warstadt et al., 2020) of Holmes and HELM (Liang et al., 2023). Results (Appendix, Figure 15) show apparent discrepancies (rank correlation τ = 0.05) when evaluating LMs’ internal representations or their responses (linguistic performance) to HELM instructions. As most prompting-based results from HELM fall below the random baseline, only probing-based evaluation can effectively isolate the assessment of linguistic phenomena. In contrast, prompting-based methods mix this assessment with other abilities, such as instruction following. Similar to Hu and Levy (2023), these insights show the need for a more comprehensive comparison of different evaluation protocols like probing, prompting, or log-probabilities (used in HELM in Figure 33 on page 58 as a workaround for BLiMP). Nevertheless, probing provides a unified evaluation protocol assessing the diversity of linguistic phenomena using representations of tokens, spans, or whole texts beyond minimal pair tasks testing whether correct or wrong sentences are preferred.
6 Efficiency
Seamless, easy, cost-effective integration of new LMs is crucial to adopting benchmarks widely. As Holmes covers many datasets and examples, it is computationally heavy in encoding text and training the probes. It takes 6 GPU days to encode the 70 million tokens (230k pages) and two days to run the 208 probes for a 70b model. To account for this issue, we introduce FlashHolmes, a streamlined version of Holmes, to evaluate new LMs with a fraction of the compute while maintaining evaluation integrity.
Besides excluding licensed data (18 probing datasets), we analyze the effect of discarding training instances. As a result, we reduce the computation for encoding and the actual probing simultaneously. We follow Perlitz et al. (2023) and calculate the rank resolution, 95% CI of model rank difference. This measure indicates the maximum expected rank deviation from evaluating an LM on FlashHolmes compared to Holmes. For example, a rank resolution of one means that an LM evaluated on FlashHolmes and Holmes has the same rank or switch place with its neighbors with a probability of 95%. Figure 13 shows the resulting rank resolution when training only on a fraction of the instances, from 1/2 to 1/512. Solely focusing on efficiency (1/512) still provides a decent rank resolution of ∼2.6. In contrast, considering 1/2 of the training data results in the best reliability of ∼0.9. To balance benchmark reliability and efficiency, we compose FlashHolmes using 1/32 of the training instances. Precisely, it reduces the computation expenses of evaluating LMs to ∼3% of what Holmes would have required while preserving a high rank-correlation of ∼1.5.
Analysis of the reliability vs. efficiency trade-off when reducing the number of training data.
Analysis of the reliability vs. efficiency trade-off when reducing the number of training data.
7 Related Work
Benchmarking LMs
Benchmarks approximate LMs abilities like general language understanding (Wang et al., 2019a, b), out-of-distribution generalization (Yang et al., 2023; Waldis et al., 2024b), real-world knowledge contradiction (Hou et al., 2024), adversarial scenarios (Nie et al., 2020; Wang et al., 2021), or retrieval (Thakur et al., 2021; Muennighoff et al., 2023). With the recent advent of large LMs, the predominant method has shifted to evaluate the obtained linguistic performance of LMs when providing textual instructions (Brown et al., 2020; Hendrycks et al., 2021; Srivastava et al., 2022). While LMs show substantial performance on application-oriented tasks (Liang et al., 2023) or mathematical reasoning (Cobbe et al., 2021), such evaluations are sensible to specific formulations (Mizrahi et al., 2024) or metrics (Schaeffer et al., 2023) employed. Thus, results of different benchmarks were found to disagree substantially (Yuan et al., 2024; Perlitz et al., 2024).
Assessing the Linguistic Competence of LMs
Analyzing LMs’ linguistic competence started with static word vectors (Köhn, 2015), sentence embeddings (Conneau et al., 2018; Adi et al., 2017), the internals of translation models (Shi et al., 2016; Bau et al., 2019), or contextualized LMs (Tenney et al., 2019b, a; Hewitt and Manning, 2019). Other methodological work addressed the validity of obtained results with control tasks (Hewitt and Liang, 2019) or from an information theory perspective (Voita and Titov, 2020; Pimentel et al., 2020), or studied causal effects (Elazar et al., 2021). While further studies focus on whether LMs follow human understanding of linguistic competence when solving downstream tasks (Belinkov, 2022; Aw et al., 2023; Mahowald et al., 2024), Mosbach et al. (2020b) and Waldis et al. (2024a) found that downstream task fine-tuning hurts the understanding of linguistic phenomena.
In contrast to prior studies, Holmes assesses the linguistic competence of an extensive set of contemporary LMs covering a comprehensive collection of linguistic phenomena. Unlike other work evaluating linguistic phenomena (Blevins et al., 2023; Amouyal et al., 2024) using prompting leading to unreliable results (Liang et al., 2023), probing allows Holmes to reliably and comprehensively compare LMs regardless of architecture or pre-training. As a result, Holmes can address recent calls to thoroughly and explicitly evaluate linguistic phenomena (Hu and Levy, 2023; Lu et al., 2023; Mahowald et al., 2024).
8 Conclusion
Holmes marks the most up-to-date and extensive consolidation of existing resources addressing the need to assess the linguistic competence of LMs in isolation. Our experiments demonstrate that LMs’ linguistic competence is pronounced regarding formal phenomena but lacks functional ones when information about broader textual contexts, such as rhetorical structure, is required. Simultaneously, size, architecture, and instruction tuning are crucial factors for differences among LMs. As LM and resources in linguistics constantly grow, we will actively extend Holmes with new datasets and upcoming LMs.
Ethical Considerations and Limitations
Language
Holmes as well as FlashHolmes solely assess linguistic phenomena for the English language. As we plan to expand the benchmark and scope of multilingual data, we focus at the moment on English because of the widespread availability of resources, including curated corpora and the diversity of available LMs.
Last Layer Internal Representation
Given the extensive scope of the analysis presented in this work, we focus on examining the internal representation of LMs through the output of their last layer. While this analysis provides valuable insights, it only partially captures the complexity inherent in LMs across all their layers. To facilitate further research into the comprehensive analysis of LMs, we see Holmes providing groundwork, including the release of the specific tasks in a unifying format and corresponding evaluation code, which can be easily adapted to investigate specific layers of LMs.
Coverage
We agree with Liang et al. (2023) and see one fundamental aspect in composing a benchmark in acknowledging its incompleteness. Linguistic phenomena, LMs, and underlying meta-studies are a subset of the variety of available resources. We consolidated them carefully to provide a comprehensive scope of the linguistic competence and various LMs. However, as benchmarks evolve as tools to assess LMs, we will further expand Holmes both with the existing and upcoming LMs and data resources.
Data Availability
Linguistic annotations, in particular more complex ones targeting phenomena like discourse, are money and time-wise expensive. Out of 208 datasets included in Holmes, 18 probing datasets are based on licensed resources and are not freely available. However, with FlashHolmes, we provide an effective and efficient alternative based on open-access resources. Furthermore, upon confirming the granted access, we are happy to share our probing datasets, including those based on the licensed resources.
Bias
As Holmes relies on existing resources, it inherits the bias embodied in these datasets. Examples of such bias are gender equality or gender fairness, like the use of neo pronouns such as em in Lauscher et al. (2023).
Dataset Contamination
Holmes encompasses a large collection of established datasets, like OntoNotes (Weischedel et al., 2013). While we solely rely on LMs with open-sourced weights, the training or instruction-tuning data is not known for all of them, as for the Llama-2 (Touvron et al., 2023), Mixtral (Jiang et al., 2024), or Wizard (Xu et al., 2023) LMs. Therefore, we need to expect that some texts were part of the LMs’ pre-training corpora and that specific tasks, such as named-entity recognition (NER), were used during instruction tuning. However, instruction-tuning aligns LMs’ linguistic performance to produce coherent text responding to specific textual instruction provided and does not align LMs’ internal representations explicitly (Brown et al., 2020; Touvron et al., 2023; Jiang et al., 2024). As Holmes evaluates the linguistic competence using LMs’ internal representations, it retains its validity even under potential data contamination (Balloccu et al., 2024). Building upon our results, showing that downstream abilities are partly reflected in LMs’ internal representations, one could examine whether instruction-tuning injects task-specific information into LMs’ internal representations, thereby detecting task contamination.
Notes
We use PyPDF2 v3.0.0, DBLP and semanticscholar API.
Please refer to Appendix § I.2 for a complete list.
Unlike other benchmarks like HELM (Liang et al., 2023), OpenLLM covers many open LMs’ leading to high overlap with Holmes.
Note that EMNLP-23 and AACL-23 proceedings were not published when conducting this meta-study.
References
A Additional Details of Holmes
A.1 Additional Details on the Evolution of Probing Literature
We analyze publication trends by year and venue as shown in Table 3. Less work was published between 2015-2018 (earlier) focusing on LSTM-based (Linzen et al., 2016; Conneau et al., 2018) and static LMs (Köhn, 2015; Linzen et al., 2016; Belinkov et al., 2017; Conneau et al., 2018). With the release of BERT (Devlin et al., 2019) in 2019, we note increasing attention to analyzing linguistic abilities within LMs, with a peak of 90 papers in 2022. Considering the venue, more than half of the relevant work (149 papers) was published at major conferences (ACL and EMNLP), and 68 papers were published at AACL, EACL, NAACL, and COLING.5 In addition, we observe a constant contribution of TACL, various workshops, such as Analyzing and Interpreting Neural Networks for NLP or Representation Learning for NLP.
Evolution of probing studies. Note that EMNLP-23 and AACL-23 proceedings were not published when conducting this meta-study.
. | earlier . | 2019 . | 2020 . | 2021 . | 2022 . | 2023 . | Total . |
---|---|---|---|---|---|---|---|
ACL | 2 | 10 | 12 | 9 | 34 | 25 | 92 |
AACL | – | – | – | – | 1 | – | 1 |
COLING | – | – | 10 | – | 9 | – | 19 |
EACL | – | – | – | 7 | – | 15 | 22 |
EMNLP | 2 | 4 | 13 | 17 | 21 | – | 57 |
NAACL | – | 3 | – | 9 | 14 | – | 26 |
TACL | 1 | 1 | 2 | 3 | 3 | 1 | 11 |
Workshops | 4 | 4 | 10 | 10 | 7 | 1 | 36 |
Other | 1 | 2 | 1 | 1 | 1 | 4 | 10 |
Probing | 10 | 24 | 48 | 56 | 90 | 46 | 274 |
All Papers | 8,056 | 3,111 | 3,822 | 4,294 | 5,133 | 3,647 | 28,063 |
. | earlier . | 2019 . | 2020 . | 2021 . | 2022 . | 2023 . | Total . |
---|---|---|---|---|---|---|---|
ACL | 2 | 10 | 12 | 9 | 34 | 25 | 92 |
AACL | – | – | – | – | 1 | – | 1 |
COLING | – | – | 10 | – | 9 | – | 19 |
EACL | – | – | – | 7 | – | 15 | 22 |
EMNLP | 2 | 4 | 13 | 17 | 21 | – | 57 |
NAACL | – | 3 | – | 9 | 14 | – | 26 |
TACL | 1 | 1 | 2 | 3 | 3 | 1 | 11 |
Workshops | 4 | 4 | 10 | 10 | 7 | 1 | 36 |
Other | 1 | 2 | 1 | 1 | 1 | 4 | 10 |
Probing | 10 | 24 | 48 | 56 | 90 | 46 | 274 |
All Papers | 8,056 | 3,111 | 3,822 | 4,294 | 5,133 | 3,647 | 28,063 |
A.2 Experimental Details
Probing Hyperparameters
Following previous work (Hewitt and Liang, 2019; Voita and Titov, 2020), we use fixed hyperparameters for training the probes: 20 epochs, where we find the best one using dev instances; AdamW (Loshchilov and Hutter, 2019) as optimizer; a batch size of 64; a learning rate of 0.0005; a dropout rate of 0.2; a warmup rate of 10% of the steps; random seeds: [0,1,2,3,4]
Hardware
We run all of our experiments using 12 Nvidia RTX A6000 GPUs. Every GPU provides 48GB of memory and 10752 CUDA Cores.
Considered LMs
Table 9 outlines the details of the LMs we evaluate on Holmes in this work.
A.3 Probing Datasets Categorization
We show in Table 4, Table 5, Table 8, Table 6, and Table 7 which resources Holmes use to cover morphology, syntax, semantics, reasoning, and discourse phenomena. Further, we provide illustrative examples of the phenomena. We rely on 33 works providing the data, the specific linguistic phenomena, or both. For example, for readability, we use the data of Weischedel et al. (2013) and calculated the Flesch score (Flesch, 1948).
Overview of resources and linguistic phenomena mapping for morphology. We give an illustrative example for each phenomenon (*indicates the right option, if options are given) and the number of datasets for the phenomenon by dataset type.
Phenomena . | Illustrative Example . | Text . | Text-Pair . | Span . | Span-Pair . | Warstadt et al. (2020) . | Huebner et al. (2021) . |
---|---|---|---|---|---|---|---|
anaphor agreement | Katherine can’t help herself*/himself. | 3 | |||||
determiner noun agreement | Craig explored that grocery store*/stores. | 10 | |||||
irregular forms | Edward hid*/hidden the cats. | 3 | |||||
subject-verb agreement | A sketch of lights does not*/do not appear. | 10 |
Phenomena . | Illustrative Example . | Text . | Text-Pair . | Span . | Span-Pair . | Warstadt et al. (2020) . | Huebner et al. (2021) . |
---|---|---|---|---|---|---|---|
anaphor agreement | Katherine can’t help herself*/himself. | 3 | |||||
determiner noun agreement | Craig explored that grocery store*/stores. | 10 | |||||
irregular forms | Edward hid*/hidden the cats. | 3 | |||||
subject-verb agreement | A sketch of lights does not*/do not appear. | 10 |
Overview of resources and linguistic phenomena mapping for syntax. We give an illustrative example for each phenomenon (*indicates the right option, if options are given) and the number of datasets for the phenomenon by dataset type.
Phenomena . | Illustrative Example . | Text . | Text-Pair . | Span . | Span-Pair . | Weischedel et al. (2013) . | Silveira et al. (2014) . | Conneau et al. (2018) . | Flesch (1948) . | Klafka and Ettinger (2020) . | Warstadt et al. (2020) . | Huebner et al. (2021) . |
---|---|---|---|---|---|---|---|---|---|---|---|---|
argument-structure | Most cashiers are disliked*/flirted. | 20 | ||||||||||
bigram-shift | What are you*/you are doing out there? | 1 | ||||||||||
binding | Carlos said that Lori helped him*/himself. | 8 | ||||||||||
case-subjective-pronoun | He brought the pig this suit.*/The pig brought he this suit. | 1 | ||||||||||
constituent parsing | sees Bill ⇒ VP | 2 | 1 | |||||||||
control/raising | Julia wasn’t fun*/unlikely to talk to. | 5 | ||||||||||
deoncausative-inchoative alternation | The warden melted the ice.*/The warden bought the ice. | 1 | ||||||||||
dependency parsing | (into, air) ⇒ pobj | 1 | ||||||||||
ellipsis | He cleans one important book and Stacey cleans a few.*/He cleans one book and Stacey cleans a few important. | 3 | ||||||||||
filler-gap | Brett knew what many waiters find.*/Brett knew that many waiters find. | 9 | ||||||||||
island-effects | Which bikes is John fixing?*/Which is John fixing bikes? | 10 | ||||||||||
local attractor | Can the access work?*/ Can the access works? | 1 | ||||||||||
object-number | Oh gods! ⇒ Plural | 2 | ||||||||||
part-of-speech | cucumber ⇒ NN (Noun Singular) | 3 | ||||||||||
readability | Curriculums need selling points. ⇒ 50.5 (middle) | 1 | ||||||||||
sentence-length | Oh gods! ⇒ 3 words | 1 | ||||||||||
subject-number | Things are going to be noticed. Rightarrow Plural | 2 | ||||||||||
top-constituent | Did it all matter? Rightarrow VBD NP VP | 1 | ||||||||||
tree-depth | Where do you want it? ⇒ 6 | 1 |
Phenomena . | Illustrative Example . | Text . | Text-Pair . | Span . | Span-Pair . | Weischedel et al. (2013) . | Silveira et al. (2014) . | Conneau et al. (2018) . | Flesch (1948) . | Klafka and Ettinger (2020) . | Warstadt et al. (2020) . | Huebner et al. (2021) . |
---|---|---|---|---|---|---|---|---|---|---|---|---|
argument-structure | Most cashiers are disliked*/flirted. | 20 | ||||||||||
bigram-shift | What are you*/you are doing out there? | 1 | ||||||||||
binding | Carlos said that Lori helped him*/himself. | 8 | ||||||||||
case-subjective-pronoun | He brought the pig this suit.*/The pig brought he this suit. | 1 | ||||||||||
constituent parsing | sees Bill ⇒ VP | 2 | 1 | |||||||||
control/raising | Julia wasn’t fun*/unlikely to talk to. | 5 | ||||||||||
deoncausative-inchoative alternation | The warden melted the ice.*/The warden bought the ice. | 1 | ||||||||||
dependency parsing | (into, air) ⇒ pobj | 1 | ||||||||||
ellipsis | He cleans one important book and Stacey cleans a few.*/He cleans one book and Stacey cleans a few important. | 3 | ||||||||||
filler-gap | Brett knew what many waiters find.*/Brett knew that many waiters find. | 9 | ||||||||||
island-effects | Which bikes is John fixing?*/Which is John fixing bikes? | 10 | ||||||||||
local attractor | Can the access work?*/ Can the access works? | 1 | ||||||||||
object-number | Oh gods! ⇒ Plural | 2 | ||||||||||
part-of-speech | cucumber ⇒ NN (Noun Singular) | 3 | ||||||||||
readability | Curriculums need selling points. ⇒ 50.5 (middle) | 1 | ||||||||||
sentence-length | Oh gods! ⇒ 3 words | 1 | ||||||||||
subject-number | Things are going to be noticed. Rightarrow Plural | 2 | ||||||||||
top-constituent | Did it all matter? Rightarrow VBD NP VP | 1 | ||||||||||
tree-depth | Where do you want it? ⇒ 6 | 1 |
Overview of resources and linguistic phenomena mapping for reasoning. We give an illustrative example for each phenomenon (*indicates the right option, if options are given) and the number of datasets for the phenomenon by dataset type.
Phenomena . | Illustrative Example . | Text . | Text-Pair . | Span . | Span-Pair . | Vahtola et al. (2022) . | Szarvas et al. (2008) . | Konstantinova et al. (2012) . | Morante and Blanco (2012) . | Talmor et al. (2020b) . |
---|---|---|---|---|---|---|---|---|---|---|
age comparison | 21 years old is older than 35 years fold.*/21 years old is younger than 35 years fold. | 1 | ||||||||
always-never | Horses have always*/never four legs. | 1 | ||||||||
antonym negation | It was not*/really hot, it was cold. | 1 | ||||||||
multi-hop composition | Comparing a 23, a 38 and a 31 year old, the last*/first is oldest. | 1 | ||||||||
negation | I don’t like bananas. ⇒ Negation | 3 | 1 | 2 | 2 | |||||
objects comparison | An airplane is bigger*/smaller than a pen. | 1 | ||||||||
property conjunction | A pen*/computer is usually located at hand and used for writing. | 1 | ||||||||
speculation | Just about every PC can be upgraded. ⇒ Speculation | 1 | 1 | 1 | ||||||
taxonomy connection | Ferry and floatplane are both boats*/airplaines. | 1 |
Phenomena . | Illustrative Example . | Text . | Text-Pair . | Span . | Span-Pair . | Vahtola et al. (2022) . | Szarvas et al. (2008) . | Konstantinova et al. (2012) . | Morante and Blanco (2012) . | Talmor et al. (2020b) . |
---|---|---|---|---|---|---|---|---|---|---|
age comparison | 21 years old is older than 35 years fold.*/21 years old is younger than 35 years fold. | 1 | ||||||||
always-never | Horses have always*/never four legs. | 1 | ||||||||
antonym negation | It was not*/really hot, it was cold. | 1 | ||||||||
multi-hop composition | Comparing a 23, a 38 and a 31 year old, the last*/first is oldest. | 1 | ||||||||
negation | I don’t like bananas. ⇒ Negation | 3 | 1 | 2 | 2 | |||||
objects comparison | An airplane is bigger*/smaller than a pen. | 1 | ||||||||
property conjunction | A pen*/computer is usually located at hand and used for writing. | 1 | ||||||||
speculation | Just about every PC can be upgraded. ⇒ Speculation | 1 | 1 | 1 | ||||||
taxonomy connection | Ferry and floatplane are both boats*/airplaines. | 1 |
Overview of resources and linguistic phenomena mapping for discourse. We give an illustrative example for each phenomenon (*indicates the right option, if options are given) and the number of datasets for the phenomenon by dataset type.
Phenomena . | Illustrative Example . | Text . | Text-Pair . | Span . | Span-Pair . | Weischedel et al. (2013) . | Pandit and Hou (2021) . | Nie et al. (2019) . | Narayan et al. (2018) . | Webber et al. (2019) . | Carlson et al. (2001) . | Zeldes (2017) . |
---|---|---|---|---|---|---|---|---|---|---|---|---|
bridging | The disease and symptoms of advanced infection. ⇒ Valid Bridge | 1 | 1 | |||||||||
co-reference resolution | National Taiwan University opened the doors of five of its graduate schools. ⇒ Valid Co-Reference | 1 | ||||||||||
discourse connective | Leaning against his hip. He reclined with his feet up on the table. ⇒ when | 1 | ||||||||||
discourse representation theory | This is an old story. We’re talking about years ago. ⇒ Implicit Relation | 8 | ||||||||||
next-sentence prediction | Sentence A, Sentence B ⇒ Valid Next Sentence | 1 | ||||||||||
rethorical structure theory | The statisticsquoted by the “ new ” Census Bureau report ⇒ Elaboration | 6 | 8 | |||||||||
sentence order | Given Sentence B, C, and D ⇒ C is at position 2 | 1 |
Phenomena . | Illustrative Example . | Text . | Text-Pair . | Span . | Span-Pair . | Weischedel et al. (2013) . | Pandit and Hou (2021) . | Nie et al. (2019) . | Narayan et al. (2018) . | Webber et al. (2019) . | Carlson et al. (2001) . | Zeldes (2017) . |
---|---|---|---|---|---|---|---|---|---|---|---|---|
bridging | The disease and symptoms of advanced infection. ⇒ Valid Bridge | 1 | 1 | |||||||||
co-reference resolution | National Taiwan University opened the doors of five of its graduate schools. ⇒ Valid Co-Reference | 1 | ||||||||||
discourse connective | Leaning against his hip. He reclined with his feet up on the table. ⇒ when | 1 | ||||||||||
discourse representation theory | This is an old story. We’re talking about years ago. ⇒ Implicit Relation | 8 | ||||||||||
next-sentence prediction | Sentence A, Sentence B ⇒ Valid Next Sentence | 1 | ||||||||||
rethorical structure theory | The statisticsquoted by the “ new ” Census Bureau report ⇒ Elaboration | 6 | 8 | |||||||||
sentence order | Given Sentence B, C, and D ⇒ C is at position 2 | 1 |
Overview of resources and linguistic phenomena mapping for semantics. We give an illustrative example for each phenomenon (*indicates the right option, if options are given) and the number of datasets for the phenomenon by dataset type.
Phenomena . | Illustrative Example . | Text . | Text-Pair . | Span . | Span-Pair . | Weischedel et al. (2013) . | Conneau et al. (2018) . | Klafka and Ettinger (2020) . | Warstadt et al. (2020) . | Huebner et al. (2021) . | Hendrickx et al. (2010) . | Rudinger et al. (2018a) . | Rudinger et al. (2018b) . | Govindarajan et al. (2019) . | Gantt et al. (2022) . | Vashishtha et al. (2019) . | White et al. (2016) . | Socher et al. (2013) . | Mohler et al. (2016) . | Birke and Sarkar (2006) . | Steen et al. (2010) . | Paetzold and Specia (2016) . | Krasnowska-Kieraś and Wróblewska (2019) . | Miller (1995) . |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
complex word identification | membrane ⇒ Complex, his ⇒ Simple | 1 | ||||||||||||||||||||||
coordination inversion | He knew it, and he deserved no answer. ⇒ Inversion | 1 | ||||||||||||||||||||||
event structure | Give them to a library or burn them. ⇒ Distributive | 4 | 2 | |||||||||||||||||||||
factuality | I ran across this item on the Internet. ⇒ Factual | 1 | ||||||||||||||||||||||
genericity | I assume you mean the crazy horse memorial. ⇒ Not Dynamic | 6 | ||||||||||||||||||||||
metaphor | After all, morons pay taxes, too. ⇒ Valid Metaphor | 4 | ||||||||||||||||||||||
named-entity labeling | Paris ⇒ City | 1 | ||||||||||||||||||||||
negative polarity item licensing | Only/Even Bill would ever complain. | 4 | ||||||||||||||||||||||
object-animacy | The rhino fined the pumpkin. ⇒ Animate | 1 | ||||||||||||||||||||||
object-gender | The princess uncovered the heiress. ⇒ Feminine | 1 | ||||||||||||||||||||||
passive | He is considered a European poet through and through. ⇒ Passive Sentence | 1 | ||||||||||||||||||||||
quantifiers | There aren’t many*/all lights darkening. | 6 | ||||||||||||||||||||||
semantic relation classification | Those cancers were caused by radiation exposures. ⇒ Cause-Effect | 1 | ||||||||||||||||||||||
semantic proto-roles | These look fine to me. ⇒ Exists as physical | 20 | ||||||||||||||||||||||
semantic odd man out | I wanted to know if it was real or a ploy. ⇒ Original | 1 | ||||||||||||||||||||||
semantic-role labeling | And what effect does their return haveon campus? ⇒ ARGM-ADV | 1 | ||||||||||||||||||||||
sentiment analysis | You ’ll probably love it. ⇒ Positive | 1 | ||||||||||||||||||||||
subject-animacy | The turtle betrayed the judge. ⇒ Animate | 1 | ||||||||||||||||||||||
subject-gender | The waitress betrayed the judge. ⇒ Feminine | 1 | ||||||||||||||||||||||
synonym-/antonym-detection | Is the degree really that important→unimportant to them? ⇒ Antonym Replacement | 1 | ||||||||||||||||||||||
tense | I quietly snuck up to him and pulled at his sleeve. → Present | 2 | ||||||||||||||||||||||
time | His mother was also killed in the attack. ⇒ Minutes | 1 | ||||||||||||||||||||||
verb-dynamic | The lawyer found the judge. ⇒ Dynamic Verb | 1 | ||||||||||||||||||||||
word content | You mean Alice. ⇒ Contains Word Alice | 1 | ||||||||||||||||||||||
word sense | His mother was also killed in the attack. ⇒ Supersense Noun Person | 1 |
Phenomena . | Illustrative Example . | Text . | Text-Pair . | Span . | Span-Pair . | Weischedel et al. (2013) . | Conneau et al. (2018) . | Klafka and Ettinger (2020) . | Warstadt et al. (2020) . | Huebner et al. (2021) . | Hendrickx et al. (2010) . | Rudinger et al. (2018a) . | Rudinger et al. (2018b) . | Govindarajan et al. (2019) . | Gantt et al. (2022) . | Vashishtha et al. (2019) . | White et al. (2016) . | Socher et al. (2013) . | Mohler et al. (2016) . | Birke and Sarkar (2006) . | Steen et al. (2010) . | Paetzold and Specia (2016) . | Krasnowska-Kieraś and Wróblewska (2019) . | Miller (1995) . |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
complex word identification | membrane ⇒ Complex, his ⇒ Simple | 1 | ||||||||||||||||||||||
coordination inversion | He knew it, and he deserved no answer. ⇒ Inversion | 1 | ||||||||||||||||||||||
event structure | Give them to a library or burn them. ⇒ Distributive | 4 | 2 | |||||||||||||||||||||
factuality | I ran across this item on the Internet. ⇒ Factual | 1 | ||||||||||||||||||||||
genericity | I assume you mean the crazy horse memorial. ⇒ Not Dynamic | 6 | ||||||||||||||||||||||
metaphor | After all, morons pay taxes, too. ⇒ Valid Metaphor | 4 | ||||||||||||||||||||||
named-entity labeling | Paris ⇒ City | 1 | ||||||||||||||||||||||
negative polarity item licensing | Only/Even Bill would ever complain. | 4 | ||||||||||||||||||||||
object-animacy | The rhino fined the pumpkin. ⇒ Animate | 1 | ||||||||||||||||||||||
object-gender | The princess uncovered the heiress. ⇒ Feminine | 1 | ||||||||||||||||||||||
passive | He is considered a European poet through and through. ⇒ Passive Sentence | 1 | ||||||||||||||||||||||
quantifiers | There aren’t many*/all lights darkening. | 6 | ||||||||||||||||||||||
semantic relation classification | Those cancers were caused by radiation exposures. ⇒ Cause-Effect | 1 | ||||||||||||||||||||||
semantic proto-roles | These look fine to me. ⇒ Exists as physical | 20 | ||||||||||||||||||||||
semantic odd man out | I wanted to know if it was real or a ploy. ⇒ Original | 1 | ||||||||||||||||||||||
semantic-role labeling | And what effect does their return haveon campus? ⇒ ARGM-ADV | 1 | ||||||||||||||||||||||
sentiment analysis | You ’ll probably love it. ⇒ Positive | 1 | ||||||||||||||||||||||
subject-animacy | The turtle betrayed the judge. ⇒ Animate | 1 | ||||||||||||||||||||||
subject-gender | The waitress betrayed the judge. ⇒ Feminine | 1 | ||||||||||||||||||||||
synonym-/antonym-detection | Is the degree really that important→unimportant to them? ⇒ Antonym Replacement | 1 | ||||||||||||||||||||||
tense | I quietly snuck up to him and pulled at his sleeve. → Present | 2 | ||||||||||||||||||||||
time | His mother was also killed in the attack. ⇒ Minutes | 1 | ||||||||||||||||||||||
verb-dynamic | The lawyer found the judge. ⇒ Dynamic Verb | 1 | ||||||||||||||||||||||
word content | You mean Alice. ⇒ Contains Word Alice | 1 | ||||||||||||||||||||||
word sense | His mother was also killed in the attack. ⇒ Supersense Noun Person | 1 |
Overview of the evaluated LMS covering the corresponding citation, model size, model architecture, pre-training objective & data, and the Huggingface model tag. Regarding the pre-training objective, we distinguish between masked language modeling (MLM), sentence order prediction (SOP), next sentence prediction (NSP), next word prediction (LM), instruction fine-tuning (IT), word denoising (DAE), and word probabilities from word co-occurrences (WP). For pre-training data, we report known numbers, either as the size of the corpora in gigabytes (GB), the number of pre-training tokens, the number of instructions for fine-tuning, or the number of tasks for instruction fine-tuning.
Model . | Citation . | Size . | Pre-Training Objective . | Pre-Training Data . | Huggingface Tag . |
---|---|---|---|---|---|
Encoder-Only Language Models | |||||
ALBERT | Lan et al. (2020) | 10 million | MLM+SOP | 16GB | albert-base-v2 |
BERT | Tenney et al. (2019a) | 110 million | MLM+NSP | 16GB | bert-base-uncased |
DeBERTa | He et al. (2021) | 100 million | MLM | 80GB | microsoft/deberta-base |
DeBERTa-v3 | He et al. (2023) | 86 million | MLM+DISC | 160GB | microsoft/deberta-v3-base |
ELECTRA | Clark et al. (2020) | 110 million | MLM | 16GB | google/electra-base-discriminator |
RoBERTa | Liu et al. (2019) | 110 million | MLM+DISC | 160GB | roberta-base |
Decoder-Only Language Models | |||||
GPT2 | Radford et al. (2019) | 117 million | LM | 40GB | gpt2 |
Pythia-70m | Biderman et al. (2023) | 70 million | LM | 300 billion tokens | EleutherAI/pythia-70m |
Pythia-160m | Biderman et al. (2023) | 160 million | LM | 300 billion tokens | EleutherAI/pythia-160m |
Pythia-410m | Biderman et al. (2023) | 410 million | LM | 300 billion tokens | EleutherAI/pythia-410m |
Pythia-1b | Biderman et al. (2023) | 1 billion | LM | 300 billion tokens | EleutherAI/pythia-1B |
Pythia-1.4b | Biderman et al. (2023) | 1.4 billion | LM | 300 billion tokens | EleutherAI/pythia-1.4B |
Pythia-2.8b | Biderman et al. (2023) | 2.8 billion | LM | 300 billion tokens | EleutherAI/pythia-2.8B |
Pythia-6.9b | Biderman et al. (2023) | 6.9 billion | LM | 300 billion tokens | EleutherAI/pythia-6.9B |
Pythia-12b | Biderman et al. (2023) | 12 billion | LM | 300 billion tokens | EleutherAI/pythia-12B |
Pythia-70m-dedup | Biderman et al. (2023) | 70 million | LM | 207 billion tokens | EleutherAI/pythia-70m-deduped |
Pythia-160m-dedup | Biderman et al. (2023) | 160 million | LM | 207 billion tokens | EleutherAI/pythia-160m-deduped |
Pythia-410m-dedup | Biderman et al. (2023) | 410 million | LM | 207 billion tokens | EleutherAI/pythia-410m-deduped |
Pythia-1b-dedup | Biderman et al. (2023) | 1 billion | LM | 207 billion tokens | EleutherAI/pythia-1B-deduped |
Pythia-1.4b-dedup | Biderman et al. (2023) | 1.4 billion | LM | 207 billion tokens | EleutherAI/pythia-1.4B-deduped |
Pythia-2.8b-dedup | Biderman et al. (2023) | 2.8 billion | LM | 207 billion tokens | EleutherAI/pythia-2.8B-deduped |
Pythia-6.9b-dedup | Biderman et al. (2023) | 6.9 billion | LM | 207 billion tokens | EleutherAI/pythia-6.9B-deduped |
Pythia-12b-dedup | Biderman et al. (2023) | 12 billion | LM | 207 billion tokens | EleutherAI/pythia-12B-deduped |
Dolly-v2 | Conover et al. (2023) | 12 billion | LM+IT | 300 billion token + 15K instructions | databricks/dolly-v2-12b |
Llama-2-7b | Touvron et al. (2023) | 7 billion | LM | 2.4 trillion tokens | meta-llama/Llama-2-7b-hf |
Llama-2-13b | Touvron et al. (2023) | 13 billion | LM | 2.4 trillion tokens | meta-llama/Llama-2-13b-hf |
Llama-2-70b | Touvron et al. (2023) | 70 billion | LM | 2.4 trillion tokens | meta-llama/Llama-2-70b-hf |
Llama-2-7b-chat | Touvron et al. (2023) | 7 billion | LM+IT | 2.4 trillion tokens + 27,5K instructions | meta-llama/Llama-2-7b-chat-hf |
Llama-2-13b-chat | Touvron et al. (2023) | 13 billion | LM+IT | 2.4 trillion tokens + 27,5K instructions | meta-llama/Llama-2-13b-chat-hf |
Llama-2-70b-chat | Touvron et al. (2023) | 70 billion | LM+IT | 2.4 trillion tokens + 27,5K instructions | meta-llama/Llama-2-70b-chat-hf |
IBM-Merlinite | Sudalairaj et al. (2024) | 7 billion | LM+IT | 2.4 trillion tokens + 1400k instructions | ibm/merlinite-7b |
IBM-Labradorite | Sudalairaj et al. (2024) | 13 billion | LM+IT | 2.4 trillion tokens + 1400k instructions | ibm/labradorite-13b |
Vicuna-13b-v1.5 | Zheng et al. (2023) | 13 billion | LM+IT | 2.4 trillion tokens + 125k instructions | lmsys/vicuna-13b-v1.5 |
Orca-2-13b | Mitra et al. (2023) | 13 billion | LM+IT | 2.4 trillion tokens + 817K instructions | microsoft/Orca-2-13b |
Wizard-13B-v1.2 | Xu et al. (2023) | 13 billion | LM | unknown | WizardLM/WizardLM-13B-V1.2 |
Tülu-2-13b | Wang et al. (2023) | 13 billion | LM+IT | 2.4 trillion tokens + 330k instructions | allenai/tulu-2-13b |
Tülu-2-dpo-13b | Wang et al. (2023) | 13 billion | LM+IT | 2.4 trillion tokens + 330k instructions | tulu-2-dpo-13b |
Tülu-2-70b | Wang et al. (2023) | 70 billion | LM+IT | 2.4 trillion tokens + 330k instructions | allenai/tulu-2-70b |
Tülu-2-dpo-70b | Wang et al. (2023) | 70 billion | LM+IT | 2.4 trillion tokens + 330k instructions | tulu-2-dpo-70b |
Mistral-7b | Jiang et al. (2023) | 7 billion | LM | unknown | mistralai/Mistral-7B-v0.1 |
Mistral-7b-Inst | Jiang et al. (2023) | 7 billion | LM | unknown | mistralai/Mistral-7B-Instruct-v0.1 |
Mixtral-8×7b | Jiang et al. (2024) | 47 billion | LM | unknown | mistralai/Mixtral-8×7B-v0.1 |
Mixtral-8×7b-Inst | Jiang et al. (2024) | 47 billion | LM | unknown | mistralai/Mistral-7B-v0.1 |
Encoder-Decoder Language Models | |||||
BART | Lewis et al. (2020) | 121 million | DAE | 160GB | google/facebook/bart-base |
T5-small | Raffel et al. (2020) | 60 million | DAE | 800GB | google/t5-small-lm-adapt |
T5-base | Raffel et al. (2020) | 220 million | DAE | 800GB | google/t5-base-lm-adapt |
T5-large | Raffel et al. (2020) | 770 million | DAE | 800GB | google/t5-large-lm-adapt |
T5-xl | Raffel et al. (2020) | 3 billion | DAE | 800GB | google/t5-xl-lm-adapt |
T5-xxl | Raffel et al. (2020) | 11 billion | DAE | 800GB | google/t5-xxl-lm-adapt |
FLAN-T5-small | Raffel et al. (2020) | 60 million | DAE+IT | 800GB + 1.8k tasks | google/t5-small-lm-adapt |
FLAN-T5-base | Raffel et al. (2020) | 220 million | DAE+IT | 800GB + 1.8k tasks | google/t5-base-lm-adapt |
FLAN-T5-large | Raffel et al. (2020) | 770 million | DAE+IT | 800GB + 1.8k tasks | google/t5-large-lm-adapt |
FLAN-T5-xl | Raffel et al. (2020) | 3 billion | DAE+IT | 800GB + 1.8k tasks | google/t5-xl-lm-adapt |
FLAN-T5-xxl | Raffel et al. (2020) | 11 billion | DAE+IT | 800GB + 1.8k tasks | google/t5-xxl-lm-adapt |
TK-Instruct | Wang et al. (2022) | 11 billion | DAE+IT | 800GB + 1.6k tasks | allenai/tk-instruct-11b-def |
UL2 | Tay et al. (2023) | 20 billion | DAE | 800GB | google/ul2 |
FLAN-UL2 | Tay et al. (2023) | 20 billion | DAE+IT | 800GB + 100k instructions | google/flan-ul2 |
Static Language Models | |||||
Glove-6B | Pennington et al. (2014) | – | WP | 6 billion tokens | glove.6B.300d |
Glove-840B | Pennington et al. (2014) | – | WP | 840 billion tokens | glove.840B.300d |
Model . | Citation . | Size . | Pre-Training Objective . | Pre-Training Data . | Huggingface Tag . |
---|---|---|---|---|---|
Encoder-Only Language Models | |||||
ALBERT | Lan et al. (2020) | 10 million | MLM+SOP | 16GB | albert-base-v2 |
BERT | Tenney et al. (2019a) | 110 million | MLM+NSP | 16GB | bert-base-uncased |
DeBERTa | He et al. (2021) | 100 million | MLM | 80GB | microsoft/deberta-base |
DeBERTa-v3 | He et al. (2023) | 86 million | MLM+DISC | 160GB | microsoft/deberta-v3-base |
ELECTRA | Clark et al. (2020) | 110 million | MLM | 16GB | google/electra-base-discriminator |
RoBERTa | Liu et al. (2019) | 110 million | MLM+DISC | 160GB | roberta-base |
Decoder-Only Language Models | |||||
GPT2 | Radford et al. (2019) | 117 million | LM | 40GB | gpt2 |
Pythia-70m | Biderman et al. (2023) | 70 million | LM | 300 billion tokens | EleutherAI/pythia-70m |
Pythia-160m | Biderman et al. (2023) | 160 million | LM | 300 billion tokens | EleutherAI/pythia-160m |
Pythia-410m | Biderman et al. (2023) | 410 million | LM | 300 billion tokens | EleutherAI/pythia-410m |
Pythia-1b | Biderman et al. (2023) | 1 billion | LM | 300 billion tokens | EleutherAI/pythia-1B |
Pythia-1.4b | Biderman et al. (2023) | 1.4 billion | LM | 300 billion tokens | EleutherAI/pythia-1.4B |
Pythia-2.8b | Biderman et al. (2023) | 2.8 billion | LM | 300 billion tokens | EleutherAI/pythia-2.8B |
Pythia-6.9b | Biderman et al. (2023) | 6.9 billion | LM | 300 billion tokens | EleutherAI/pythia-6.9B |
Pythia-12b | Biderman et al. (2023) | 12 billion | LM | 300 billion tokens | EleutherAI/pythia-12B |
Pythia-70m-dedup | Biderman et al. (2023) | 70 million | LM | 207 billion tokens | EleutherAI/pythia-70m-deduped |
Pythia-160m-dedup | Biderman et al. (2023) | 160 million | LM | 207 billion tokens | EleutherAI/pythia-160m-deduped |
Pythia-410m-dedup | Biderman et al. (2023) | 410 million | LM | 207 billion tokens | EleutherAI/pythia-410m-deduped |
Pythia-1b-dedup | Biderman et al. (2023) | 1 billion | LM | 207 billion tokens | EleutherAI/pythia-1B-deduped |
Pythia-1.4b-dedup | Biderman et al. (2023) | 1.4 billion | LM | 207 billion tokens | EleutherAI/pythia-1.4B-deduped |
Pythia-2.8b-dedup | Biderman et al. (2023) | 2.8 billion | LM | 207 billion tokens | EleutherAI/pythia-2.8B-deduped |
Pythia-6.9b-dedup | Biderman et al. (2023) | 6.9 billion | LM | 207 billion tokens | EleutherAI/pythia-6.9B-deduped |
Pythia-12b-dedup | Biderman et al. (2023) | 12 billion | LM | 207 billion tokens | EleutherAI/pythia-12B-deduped |
Dolly-v2 | Conover et al. (2023) | 12 billion | LM+IT | 300 billion token + 15K instructions | databricks/dolly-v2-12b |
Llama-2-7b | Touvron et al. (2023) | 7 billion | LM | 2.4 trillion tokens | meta-llama/Llama-2-7b-hf |
Llama-2-13b | Touvron et al. (2023) | 13 billion | LM | 2.4 trillion tokens | meta-llama/Llama-2-13b-hf |
Llama-2-70b | Touvron et al. (2023) | 70 billion | LM | 2.4 trillion tokens | meta-llama/Llama-2-70b-hf |
Llama-2-7b-chat | Touvron et al. (2023) | 7 billion | LM+IT | 2.4 trillion tokens + 27,5K instructions | meta-llama/Llama-2-7b-chat-hf |
Llama-2-13b-chat | Touvron et al. (2023) | 13 billion | LM+IT | 2.4 trillion tokens + 27,5K instructions | meta-llama/Llama-2-13b-chat-hf |
Llama-2-70b-chat | Touvron et al. (2023) | 70 billion | LM+IT | 2.4 trillion tokens + 27,5K instructions | meta-llama/Llama-2-70b-chat-hf |
IBM-Merlinite | Sudalairaj et al. (2024) | 7 billion | LM+IT | 2.4 trillion tokens + 1400k instructions | ibm/merlinite-7b |
IBM-Labradorite | Sudalairaj et al. (2024) | 13 billion | LM+IT | 2.4 trillion tokens + 1400k instructions | ibm/labradorite-13b |
Vicuna-13b-v1.5 | Zheng et al. (2023) | 13 billion | LM+IT | 2.4 trillion tokens + 125k instructions | lmsys/vicuna-13b-v1.5 |
Orca-2-13b | Mitra et al. (2023) | 13 billion | LM+IT | 2.4 trillion tokens + 817K instructions | microsoft/Orca-2-13b |
Wizard-13B-v1.2 | Xu et al. (2023) | 13 billion | LM | unknown | WizardLM/WizardLM-13B-V1.2 |
Tülu-2-13b | Wang et al. (2023) | 13 billion | LM+IT | 2.4 trillion tokens + 330k instructions | allenai/tulu-2-13b |
Tülu-2-dpo-13b | Wang et al. (2023) | 13 billion | LM+IT | 2.4 trillion tokens + 330k instructions | tulu-2-dpo-13b |
Tülu-2-70b | Wang et al. (2023) | 70 billion | LM+IT | 2.4 trillion tokens + 330k instructions | allenai/tulu-2-70b |
Tülu-2-dpo-70b | Wang et al. (2023) | 70 billion | LM+IT | 2.4 trillion tokens + 330k instructions | tulu-2-dpo-70b |
Mistral-7b | Jiang et al. (2023) | 7 billion | LM | unknown | mistralai/Mistral-7B-v0.1 |
Mistral-7b-Inst | Jiang et al. (2023) | 7 billion | LM | unknown | mistralai/Mistral-7B-Instruct-v0.1 |
Mixtral-8×7b | Jiang et al. (2024) | 47 billion | LM | unknown | mistralai/Mixtral-8×7B-v0.1 |
Mixtral-8×7b-Inst | Jiang et al. (2024) | 47 billion | LM | unknown | mistralai/Mistral-7B-v0.1 |
Encoder-Decoder Language Models | |||||
BART | Lewis et al. (2020) | 121 million | DAE | 160GB | google/facebook/bart-base |
T5-small | Raffel et al. (2020) | 60 million | DAE | 800GB | google/t5-small-lm-adapt |
T5-base | Raffel et al. (2020) | 220 million | DAE | 800GB | google/t5-base-lm-adapt |
T5-large | Raffel et al. (2020) | 770 million | DAE | 800GB | google/t5-large-lm-adapt |
T5-xl | Raffel et al. (2020) | 3 billion | DAE | 800GB | google/t5-xl-lm-adapt |
T5-xxl | Raffel et al. (2020) | 11 billion | DAE | 800GB | google/t5-xxl-lm-adapt |
FLAN-T5-small | Raffel et al. (2020) | 60 million | DAE+IT | 800GB + 1.8k tasks | google/t5-small-lm-adapt |
FLAN-T5-base | Raffel et al. (2020) | 220 million | DAE+IT | 800GB + 1.8k tasks | google/t5-base-lm-adapt |
FLAN-T5-large | Raffel et al. (2020) | 770 million | DAE+IT | 800GB + 1.8k tasks | google/t5-large-lm-adapt |
FLAN-T5-xl | Raffel et al. (2020) | 3 billion | DAE+IT | 800GB + 1.8k tasks | google/t5-xl-lm-adapt |
FLAN-T5-xxl | Raffel et al. (2020) | 11 billion | DAE+IT | 800GB + 1.8k tasks | google/t5-xxl-lm-adapt |
TK-Instruct | Wang et al. (2022) | 11 billion | DAE+IT | 800GB + 1.6k tasks | allenai/tk-instruct-11b-def |
UL2 | Tay et al. (2023) | 20 billion | DAE | 800GB | google/ul2 |
FLAN-UL2 | Tay et al. (2023) | 20 billion | DAE+IT | 800GB + 100k instructions | google/flan-ul2 |
Static Language Models | |||||
Glove-6B | Pennington et al. (2014) | – | WP | 6 billion tokens | glove.6B.300d |
Glove-840B | Pennington et al. (2014) | – | WP | 840 billion tokens | glove.840B.300d |
Morphology
Syntax
The second group of 75 tasks verifies the following syntax phenomena: Part-of-speech and constituent labeling (Weischedel et al., 2013); dependency labeling (Silveira et al., 2014); bigram-shift (whether two words were shifted), tree-depth (the depth of a sentence constituency tree), top-constituent-task (top constituency tag), and sentence-length (Conneau et al., 2018); subject- & object-number (singular/plural), and deoncausative-inchoative alternation (interaction of a verb with its context) based on Klafka and Ettinger (2020); binding, control/raising, negative polarity item licensing, island-effects, argument-structure, ellipsis, and filler-gap (Warstadt et al., 2020; Huebner et al., 2021).
Semantics
Third, consider 67 datasets covering semantics phenomena: Named-entity labeling and semantic-role labeling (Weischedel et al., 2013); tense, semantic odd man out, word content, and coordination inversion (Conneau et al., 2018); semantic relation classification (Hendrickx et al., 2010); semantic proto-roles (Rudinger et al., 2018a); factuality (if a span is factual or not) (Rudinger et al., 2018b); genericity (whether a span is generic or not) (Govindarajan et al., 2019); event structure (Gantt et al., 2022); time (time dimension of a span) (Vashishtha et al., 2019); word sense (White et al., 2016); sentiment analysis (Socher et al., 2013); object- and subject-animacy (whether a entity is animate, like human, or not, such as cars), object- and subject-gender (male/female), verb-tense, and verb-dynamicKlafka and Ettinger (2020); metaphor (Mohler et al., 2016; Birke and Sarkar, 2006; Steen et al., 2010); complex word identification (whether the word is complex or not) (Paetzold and Specia, 2016); and passive (Krasnowska-Kieraś and Wróblewska, 2019). In addition, we derive an synonym-/antonym-detection dataset using WordNet (Miller, 1995) and the texts from OntoNotes v5 Weischedel et al. (2013).
Reasoning
Forth, 19 datasets cover reasoning phenomena: Paraphrasticity with negation and antonyms (Vahtola et al., 2022); negation detection (Szarvas et al., 2008; Konstantinova et al., 2012; Morante and Blanco, 2012); negation-span classification (does a span cause a negation) (Szarvas et al., 2008; Konstantinova et al., 2012); negation-correspondence (the target span of a negation) (Szarvas et al., 2008; Konstantinova et al., 2012); speculation detection, speculation-span classification, and speculation-correspondence (the target span of a sepculation) (Szarvas et al., 2008); and always-never, age comparison, objects comparison, antonym negation, property conjunction, taxonomy connection, and multi-hop composition (Talmor et al., 2020b).
Discourse
Finally, Holmes embodies 28 datasets addressing discourse phenomena: Co-reference resolutionWeischedel et al. (2013); bridging (Hou, 2018, 2020; Pandit and Hou, 2021); discourse connective (Nie et al., 2019); sentence order and next-sentence prediction (Narayan et al., 2018); Given discourse tree, whether two nodes correspond (discourse correspondence), the correct order of two nodes (discourse order), node-node relation (discourse relation), distance between two nodes (discourse distance), explicit node class discourse explicit classes, implicit node class discourse implicit classes (Webber et al., 2019; Kurfalı and Östling, 2021); and given a rhetorical tree with the number of child nodes (rst-count), the node depth (rst-depth), distance between two nodes rst-distance, node-node relation (rst-relation), node-node relation group (rst-relation-group), appear two nodes after each other (rst-successively), node type (rst-type) (Carlson et al., 2001; Koto et al., 2021; Kurfalı and Östling, 2021; Zeldes, 2017).
A.4 Details of Probing Dataset Composition
Whenever possible, we rely on established probing datasets and transform instances into a unified format: 1) an input x which is either one or a pair of span(s) or sentence(s), including the string and an optional starting and ending index in the context c when task type is either a span or span-pair classification; 2) an optional textual context c to encode x, for example the sentence in which a span occurs; and 3) a corresponding label y. Figure 14 shows the composition of the specific probing input x for these four tasks using the internal representation of the last layer of LMs. Note that additional averaging operations are required if words are tokenized into multiple tokens to get one average vector representing one word, for example, when probing for the part-of-speech tag of a rare word.
Overview of the composition of the probing input x based on the given text the four types of tasks using concatenating and averaging. In case a tokenizer splits one word into multiple tokens and applies additional averaging operations, such as when probing the part-of-speech phenomenon.
Overview of the composition of the probing input x based on the given text the four types of tasks using concatenating and averaging. In case a tokenizer splits one word into multiple tokens and applies additional averaging operations, such as when probing the part-of-speech phenomenon.
Detailed Holmes vs. HELM (Liang et al., 2023) comparison for 40 open decoder models and 22 Blimp datasets covering quantifier, island effects, irregular forms, and binding phenomena. We use the evaluation code of HELM and run the prompting-based adaption (multiple joice joined). The Holmes and Helm results for 40 open decoder models. These results show the advantage of disentangled evaluation (Holmes) over entanglement evaluations (like in HELM), which intertwine the understanding of specific linguistic phenomena and other abilities (like following instructions or answering precisely) in HELM. Most HELM results are below the random baseline, underscoring the necessity to measure linguistic phenomena directly in isolation within LMs.
Detailed Holmes vs. HELM (Liang et al., 2023) comparison for 40 open decoder models and 22 Blimp datasets covering quantifier, island effects, irregular forms, and binding phenomena. We use the evaluation code of HELM and run the prompting-based adaption (multiple joice joined). The Holmes and Helm results for 40 open decoder models. These results show the advantage of disentangled evaluation (Holmes) over entanglement evaluations (like in HELM), which intertwine the understanding of specific linguistic phenomena and other abilities (like following instructions or answering precisely) in HELM. Most HELM results are below the random baseline, underscoring the necessity to measure linguistic phenomena directly in isolation within LMs.
If given, we use the original train/dev/test splits. However, if this division does not exist, we use a 70/10/20 ratio to form these splits. Furthermore, we adapted the design of some data to map our dataset format. Exemplary, for the oLMmpics (Talmor et al., 2020b) dataset, we transform the mask-filling tasks into a binary classification where the correct label corresponds to a sentence with a correctly filled mask and incorrect to a sentence where the mask was filled wrongly.
OnToNotes
Following Tenney et al. (2019b, a), we use the OntoNotes (Weischedel et al., 2013) dataset to derive part-of-speech tagging, constituent labeling, named-entity labeling, semantic role, and co-reference resolution probing datasets. Further, we consider with constituent maximum depth and constituent node length further properties of the constituent tree this dataset OntoNotes.
Dependency Corpus
Context Probes
Presented in Klafka and Ettinger (2020), we compose the nine datasets verifying LMs’ knowledge about the context of words. For example, is a word animate (like animals or humans) or inanimate (like buildings or vehicles), or is a verb static or dynamic
BLiMP Dataset
Using the data presented in the BLiMP benchmark (Warstadt et al., 2020), we derive 67 probing datasets verifying specific phenomena, like island effect, covering morphology, syntax, and semantics. Unlike the original version, we compose a binary classification task for every phenomenon, either accepting a valid sentence or rejecting one that violates the given linguistic phenomenon.
Zorro Dataset
As for the BLiMP tasks, we convert the 21 distinct Zorro tasks into a binary classification task on whether a sentence accepts or rejects the given linguistic phenomena is violated.
SemEval-2010 Task 8
For semantic relation classification, we rely on the dataset of Hendrickx et al. (2010).
Decompositional Semantics Initiative
The Decompositional Semantics Initiative6 provides a large number of datasets to verify semantic phenomena. Apart from the common use semantic proto-roles (Rudinger et al., 2018a), we use their collection of works to compose probing datasets for factuality (Rudinger et al., 2018b), genericity (Govindarajan et al., 2019), event structure (Vashishtha et al., 2019), time (Vashishtha et al., 2019), and word sense (White et al., 2016).
Sentiment Analysis
We use the commonly used work of Socher et al. (2013) and form a probing dataset targeting sentiment.
Metaphor
Complex Word Identification
We consider word complexity for the first time and use the data presented in Paetzold and Specia (2016). It provides annotations for different complexity levels of words.
Passive
We use data from Krasnowska-Kieraś and Wróblewska (2019) to form a probing dataset assessing knowledge about passive language.
Synonym / Antonym Replacement
Using the text of the OntoNotes (Weischedel et al., 2013) and Wordnet (Miller, 1995), we form a probing dataset to detect synonym and antonym replacement. Specifically, the binary classification task is: given two texts (the original and an updated one), was the updated one changed by replacing a word with its synonym or antonym?
Negation
With this work, we verify for the first time negation based on human annotated datasets (Vahtola et al., 2022; Szarvas et al., 2008; Konstantinova et al., 2012). Specifically, we form different probing datasets.
Is a text negated or not?
Given two text spans, does the negation within the first one correspond to the second one?
Given a text span, is it the cue or the scope of the negation?
oLMmpics
We form probing datasets addressing different lexical reasoning using the data presented in Talmor et al. (2020b). As they provide multiple choices, we form correct instances by filling the gap with the correct option and wrong ones by filling in the other options. Specifically, we form dataset for always-never, age comparison, objects comparison, antonym-negation, multi-hop compositionproperty conjunction, taxonomy conjunction, and encyclopedic composition.
Bridging
We rely on the data presented in Pandit and Hou (2021) and form two probing datasets. One is to verify whether a text is linguistically applicable, considering bridging (antecedent matches anaphora). And a second one to verify whether an antecedent and anaphora match.
Discourse Connective
Using data from Nie et al. (2019), we form a probing dataset to assess whether a given connective marker matches the discourse of the given text.
Sentence Order and Next Sentence Prediction
Following Narayan et al. (2018), we form two datasets to verify the order of good or badness of a given sentence and whether two sentences occur after each other.
Discourse Representation Theory
We use data from Webber et al. (2019) to compose eight probing datasets addressing discourse representation theory:
Four probing dataset predicting the class of a given span. We distinguish between implicit, explicit, implicit-coarse, and explicit- coarse.
The absolute distance, number of words, between two spans in the text.
Whether the order of two spans is correct or not.
Whether two spans have discourse relation or not.
The specific discourse relation of two spans.
Rhetorical Structure Theory
Using annotations from Carlson et al. (2001); Zeldes (2017), we compose 14 probing datasets addressing rhetorical theory. Specifically, we compose the following seven types of datasets for both works:
The rhetorical type of a text span, either nucleus or satellite.
The number of children of a text span within the rhetorical tree of the text.
The depth of a text span within the rhetorical tree of the text.
The number of edges between two text spans within the rhetorical tree.
The specific rhetorical relation between two text spans like conclusion.
The relation group of a specific rhetorical relation between two text spans like evaluation for the relation conclusion.
Whether two text spans occur after each other in the rhetorical tree.
Author notes
Action Editor: Marco Baroni