oLMpics -- On what Language Model Pre-training Captures

Recent success of pre-trained language models (LMs) has spurred widespread interest in the language capabilities that they possess. However, efforts to understand whether LM representations are useful for symbolic reasoning tasks have been limited and scattered. In this work, we propose eight reasoning tasks, which conceptually require operations such as comparison, conjunction, and composition. A fundamental challenge is to understand whether the performance of a LM on a task should be attributed to the pre-trained representations or to the process of fine-tuning on the task data. To address this, we propose an evaluation protocol that includes both zero-shot evaluation (no fine-tuning), as well as comparing the learning curve of a fine-tuned LM to the learning curve of multiple controls, which paints a rich picture of the LM capabilities. Our main findings are that: (a) different LMs exhibit qualitatively different reasoning abilities, e.g., RoBERTa succeeds in reasoning tasks where BERT fails completely; (b) LMs do not reason in an abstract manner and are context-dependent, e.g., while RoBERTa can compare ages, it can do so only when the ages are in the typical range of human ages; (c) On half of our reasoning tasks all models fail completely. Our findings and infrastructure can help future work on designing new datasets, models and objective functions for pre-training.


Introduction
Large pre-trained language models (LM) have revolutionized the field of natural language processing in the last few years (Peters et al., 2018a;Devlin et al., 2019;Yang et al., 2019;Radford et al., 2019), leading to undeniable empirical gains in almost every benchmark. This has instigated research exploring what is captured by Figure 1: Overview of our experimental design. Two probes are evaluated using learning curves (including zero-shot). ROBERTA-L's (red squares, upper text in black) accuracy is compared to a NO LANGUAGE control (red circles, lower text in black), and MLM-BASELINE, which is not pre-trained (green squares). Here, we conclude that the LM representations are well-suited for task A), whereas in task B) the model is adapting to the task during fine-tuning. the contextualized representations that these LM compute, revealing that they encode substantial amounts of syntax and semantics (Linzen et al., 2016b;Peters et al., 2018b;Tenney et al., 2019b;Goldberg, 2019;Hewitt and Manning, 2019;Tenney et al., 2019a;Lin et al., 2019;Coenen et al., 2019).
Despite these efforts, it remains unclear what skills are difficult to learn from a LM objective only? In this paper, we propose a diverse set of probing tasks for types of symbolic reasoning that are potentially difficult to capture using a LM objective (see Table 1). Our intuition is that since a LM objective focuses on word co-occurrence, it will struggle with tasks that are considered to involve symbolic reasoning such as determining whether a conjunction of properties is held by an object, and comparing the sizes of different objects. Understanding what is missing from current LMs may help design datasets and objectives that will endow models with the missing capabilities.  Table 1: Examples for all our reasoning probes. We use two types of experimental setups, explained in §2. The first answer (A) is the correct answer.
However, how does one verify whether pretrained representations hold information that is useful for a particular task? Past work mostly resorted to fixing the representations and finetuning a simple, often linear, randomly initialized probe, to determine whether the representations hold relevant information (Ettinger et al., 2016;Adi et al., 2016;Belinkov and Glass, 2019;Hewitt and Manning, 2019;Wallace et al., 2019;Rozen et al., 2019;Warstadt et al., 2019;Richardson et al., 2019). However, it is difficult to determine whether success is due to the pre-trained representations or due to fine-tuning itself (Hewitt and Liang, 2019). To handle this challenge, we include multiple controls that improve our understanding of the empirical results.
Our "purest" setup is zero-shot: we cast tasks in the masked LM format, and use a pre-trained LM without any fine-tuning. For example, given the statement "A cat is [MASK] than a mouse", a LM can decide if the probability of "larger" is higher than "smaller" for the a masked word (Figure 1). If a model succeeds without pre-training over many pairs of objects, then its representations are useful for this task. However, if it fails, it could be due to a mismatch between the language it was pre-trained on and the language of the task (which might be automatically generated and with grammatical errors). Thus, we also compute the learning curve (Figure 1), by fine-tuning on increasing amounts of data the already pre-trained masked language modeling (MLM) head, a 1-hidden layer MLP on top of the model's contextualized representations. A model that can adapt from fewer examples arguably has better representations for it.
Moreover, to diagnose whether model performance is related to pre-trained representations or fine-tuning, we add controls to every experiment (Figures 1,3). First, we add a control that makes minimal use of language tokens, i.e., "cat [MASK] mouse" (NO LANGUAGE in Figure 1). If a model is agnostic to the presence of language, then performance can be attributed to fine-tuning rather than to pre-training. Similar logic is used to compare against baseline models that are not pretrained at all. (except for non-contextualize word embeddings). Overall, our setup provides a rich picture of whether LM representations are useful for solving a wide range of reasoning tasks.
We introduce eight tasks that test different types of reasoning, as shown in Table 1. We run experiments using several pre-trained LMs, based on BERT (Devlin et al., 2019) and ROBERTA . We find that there are clear qualitative differences between different LMs with similar architecture. For example, ROBERTA-LARGE (ROBERTA-L) can perfectly solve reasoning tasks, such as comparing numbers, even in a zero-shot setup, while other models' performance is close to random. However, good performance is highly context-dependant. Specifically, we repeatedly observe that even when a model solves a task, small changes to the input quickly derail it to low performance. For example, ROBERTA-L can almost perfectly compare people's ages, when the numeric values are in the expected range (15-105), but miserably fails if the values are outside this range. Interestingly, it is able to reliably answer, and reverse the order of younger/older, when ages are specified through the year of birth in the range 1920-2000. This highlights that the LMs ability to solve this task is strongly tied to the specific values and linguistic context and does not generalize to arbitrary scenarios. Last, we find that in four out of eight tasks, all LMs perform poorly compared to the controls.
Our contributions are summarized as follows: • A set of probes that test whether specific reasoning skills are encoded in the representations of pre-trained LMs. • An evaluation protocol for understanding whether a capability is encoded in pre-traind representations or is learned during fine-tuning. • An analysis of reasoning skills that current LMs possess. We find that LMs with similar architectures are qualitatively different, that their success is highly context-dependent, and that in many cases, all LMs fail. • Code and infrastructure for designing new probes and testing them on a large set of pretrained language models. The code and models will be available at http://github.com/alontalmor/ oLMpics.

Models
We now turn to the architectures and loss functions used throughout the different probing tasks.

Pre-trained Language Models
All models in this paper take a sequence of tokens x = (x 1 , . . . , x n ), and compute contextualized representations with a pre-trained LM, that is, h = ENCODE(x) = (h 1 , . . . , h n ). Specifically, we consider (a) BERT: (Devlin et al., 2019), a transformer-based architecture (Vaswani et al., 2017), trained with a masked-language modeling objective (MLM), i.e., the model is trained to predicts words that are masked from the input; including BERT-WHOLE-WORD-MASKING (BERT-WWM), that was trained using wholeword-masking (b) ROBERTA , which has the same architecture as BERT, but was trained on more data and optimized carefully.

Probing setups
We probe the pre-trained LMs using two setups: multi-choice masked LM (MC-MLM) and multichoice question answering (MC-QA). The MC-MLM setup is our default one, and is used for tasks where the answer-set is small, consistent across the different questions, and each answer appears as a single item in the word-piece vocabulary. The MC-QA setup is used when the answer-set substantially varies between questions, and many of the answers consist of more than one word, or expected not to appear in the word-piece vocabulary.

MC-MLM
where V is the vocabulary, and F F MLM is a 1hidden layer MLP. Applying the mask m l, guarantees that the support of the probability distribution will be over exactly K candidate tokens: the correct one and K − 1 distractors. This is done to speed-up the adaptation rate of the model, and allows for reasonable accuracy to be achieved from limited amounts of data. Training minimizes cross-entropy loss given the gold masked token. Figure 2 illustrates this setup. The input "[CLS] Cats [MASK] drink coffee [SEP]" is passed through the model, the contextualized representation of the masked token is passed through the MC-MLM head, and the final distribution is over the vocabulary words "always", "sometimes" and "never", where the gold token is "never", in this case.
A compelling advantage of this setup, is that reasonable performance can be obtained without training, using the original LM representations and the already pre-trained MLM head weights.
MC-QA Here, we use the standard setup for answering multi-choice questions with pre-trained LMs (Talmor et al., 2019;Mihaylov et al., 2018;Zellers et al., 2018). Given a question q and candidate answers a 1 , . . . , a K , we compute for each candidate answer a k representations h (k) from the input tokens "[CLS] q [SEP] a k [SEP]". Then the probability over answers is obtained using the multi-choice QA head: where F F QA is a 1-hidden layer MLP that is run over the [CLS] (first) token of an answer candidate and outputs a single logit. Note that in this setup that parameters of F F QA cannot be initialized using the original pre-trained LM.

Baseline Models
To provide a lower bound on the performance of pre-trained LMs, we introduce two baseline models with only non-contextualized representations.
MLM-BASELINE This serves as a lower-bound for the MC-MLM setup. The input to F F QA (·) is the hidden representation h ∈ R 1024 (for large models). To obtain a similar architecture with non-contextualized representations, we concatenate the first 20 tokens of each example, representing each token with a 50-dimensional GLOVE vector (Pennington et al., 2014), and pass this 1000dimensional representation of the input through F F QA , exactly like in MC-MLM. In all probes, phrases are limited to 20 tokens. If there are less than 20 tokens in the input, we zero-pad the input.
MC-QA baseline This serves as a lower-bound for MC-QA. We use the ESIM architecture over GLOVE representations, which is known to provide a strong model when the input is a pair of text fragments (Chen et al., 2017). We adapt the architecture to the multi-choice setup using the procedure proposed by Zellers et al. (2018).

Controlled Experiments
We now elaborate on our experimental design, and the controls used to better interpret the empirical results. We will use the results of our first task, AGE-COMPARISON, as a running example, where we test for models' ability to compare the numeric value of ages.

Zero-shot Experiments with MC-MLM
Fine-tuning pre-trained LMs makes it difficult to disentangle what is captured by the original representations and what was learned during finetuning (Hewitt and Liang, 2019). Thus, ideally, one should test pre-trained LMs using the original weights directly without any fine-tuning (Linzen et al., 2016a;Goldberg, 2019). The MC-MLM setup ( §2) uses the pre-trained MLM head and thus achieves exactly that. One only needs to design the task as a statement with a single masked token and K possible output tokens. In AGE-COMPARISON, this is done using the phrase "A AGE-1 year-old person age is [MASK] than a AGE-2 year-old person.", where AGE-1 and AGE-2 are replaced with different integers, and the K = 2 possible answers are "younger" and "older". Other than that, no training is needed, and the original representations are tested. Figure 3A provides an example of such zeroshot evaluation. Different values are assigned to AGE-1 and AGE-2, and the pixel is colored when the model predicts "younger". Accuracy is measured as the proportion of cases where the model outputs the correct token. The performance of BERT-WWM, is on the left, and of ROBERTA-L on the right (green). The results in Figure 3A and Table 2 show that ROBERTA-L consistently represents numbers and compares them in (96% accuracy), BERT-WWM achieves higher than random accuracy (63% accuracy), while BERT-LARGE (BERT-L) is roughly random (49% accuracy). ROBERTA-L errs when the difference between numerical values is small, while BERT-WWM errors are more scattered. The performance of MLM-BASELINE is random as expected, as the MLP MLM weights are randomly initialized.

Learning Curves
Despite the advantages of zero-shot evaluation, performance of a model might be adversely affected by mismatches between the language the pre-trained LM was trained on and the language of the examples in our tasks. For example, one task that focuses on negation uses a the template "It was [MASK] fat, it was really slim", with candidate answers "not", "very". Such sentences may derail a LM due to the unnatural language rather than the ability to interpret negation.
To tackle this, we fine-tune models in the MC-MLM setup with a limited number of examples. We assume that if the LM representations are useful for a task, then it will require few examples to overcome the language mismatch described above and achieve high performance. In almost all cases, we train with N examples, where N ∈ {62, 125, 250, 500, 1K, 2K, 4K}. To account for optimization instabilities, we fine-tune multiple times with different random seeds for each value of N , 1 and report average accuracy Dodge et al. (2019). The representations h are fixed at finetuning time, and we only train the already pre-  RoBERTa-L  96  96  100  96  100  43  47  BERT-WWM  63  78  100  63  86  11  22  BERT-L  49  55  91  48  51  1  3  BERT-B  49  51  62  50  58  0  2  RoBERTa-B  37  56  99  45  50  5  4  Baseline  48  61 74 --1 0 trained parameters of MLP MLM .
Learning-curve metrics Learning curves are informative, but inspecting a many learning curves could be difficult to digest. Thus, we summarize them using two aggregate statistics. We report: (a) MAX, i.e., the maximal accuracy on the learning curve, and (b) the metric WS, which is a weighted average of accuracies on the learning curve, where higher weights are given to points where N is small. 2 WS emphasizes our focus on performance given little training data, as it highlights what was encoded by the model before fine-tuning. WS is related to the area under the accuracy curve, and to the online code metric, proposed by Yogatama et al. (2019); Blier and Ollivier (2018). For AGE-COMPARISON, the solid lines in Figure 3B illustrate the learning curves of ROBERTA-L and BERT-L, and Table 2 shows the aggregate statistics. We fine-tune the model by replacing AGE-1 and AGE-2 with values between 43 and 120, but test with values between 15 and 38, to guarantee that the model generalizes to values unseen at training time. Again, we see that the representations learned by ROBERTA-L are already equipped with the knowledge necessary for solving this task from few examples.

Controls
Comparing learning curves tells us which model learns from fewer examples. However, since MLPs can approximate any function, it is difficult to determine whether the performance is tied to the knowledge acquired at pre-training time, or to the process of fine-tuning itself. We present controls that attempt to disentangle these two factors. 2 We use the decreasing weights W = (0.23, 0.2, 0.17, 0.14, 0.11, 0.08, 0.07). Are LMs sensitive to the language input? We are interested in whether pre-trained representations handle reasoning tasks over language examples. Thus, a natural control is to present the reasoning task without language and check whether performance drops. If the learning curve of a model does not change when the input is perturbed or even mostly deleted, then the model shows low language sensitivity and the pre-trained representations do not explain the probe performance. This approach is related to recent work by Hewitt and Liang (2019), who proposed a control task, where the learning curve of a model is compared to a learning curve when words are associated with random behaviour. We propose two control tasks: NO LANGUAGE control We remove all input tokens, except for [MASK] and the arguments of the task, i.e., the tokens that are necessary for computing the desired output. In AGE-COMPARISON, an example is reduced to the phrase "24 [MASK] 55", where the candidate answers are the words "blah", for "older", and "ya", for "younger". If the learning curve is similar to when the full example is given (low language sensitivity), then the LM is not strongly using the language input.
The dashed lines in Figure 3C illustrate the learning curves in the NO LANGUAGE control. ROBERTA-L (green) shows high language sensitivity, while BERT-L (red) has low language sensitivity. This suggests it handles this task mostly through examples provided during finetuning. Table 2 paints a similar picture, where the metric we use is identical to WS, except that instead of averaging accuracies, we average the difference in accuracies between the standard model and the NO LANGUAGE control (rounding negative numbers to zero). For ROBERTA-L the value is almost 50, because ROBERTA-L gets almost 100% accuracy in the presence of language, and is random (50% accuracy) without language. PERTRUBED LANGUAGE control A more targeted control, is to replace words that are central for the reasoning task with nonsense words. For example, in a PROPERTY CONJUNCTION task we replace the word "and" with the word "blah", resulting in examples such as "What is located at hand blah used for writing?". If the learning curve of the PERTRUBED LANGUAGE control is similar to the original example, this indicates that the model does not utilize the pre-trained representation of "and" to solve the task, and may not capture its effect on the semantics of the statement.
Targeted words change from probe to probe. In AGE-COMPARISON, the targeted words are "A AGE-1 year old person blah is [MASK] blah a AGE-2 year old person." In this case the word "age" and the word "than", which add context and suggest a comparison operation, are replaced. Figure 3B shows the learning curves for ROBERTA-L and BERT-L, where solid lines corresponds to the original examples and dashed lines are the PERTRUBED LANGUAGE control. Despite this minor perturbation, the performance of ROBERTA-L substantially decreases, implying that the model requires the language input to achieve high accuracy, while BERT-L is not influenced by this perturbation.
Does a linear transformation suffice? In MC-MLM, the representations h are fixed, and only the pre-trained parameters of MLP MLM are finetuned. As a proxy for measuring "how far" are the representations from solving a task, we fix the weights of the first layer of MLP MLM , and only train the final layer. Succeeding in this setup means that only a linear transformation of the representations is required for this probe. Table 2 and Figure 3D show (dashed lines) the performance of the linear setup (LINEAR), compared to MLP MLM . We observe that ROBERTA-L can reach high performance in this setup, while BERT-L remains random.
Are LMs sensitive to the input distribution? In probes where the arguments of the symbolic reasoning can take a range of values, we can test whether models are robust to changes in the input distribution. In AGE-COMPARISON, we shift ages to values that are not within a human life span: 215 − 230. Figure 3E shows that models are substantially affected by shift the age values. ROBERTA-L partially recover and achieve fair accuracy, but the drop in zero-shot performance illustrates that the ability of LMs to predict "younger" or "older" is tied to the natural distribution of ages, and the models cannot just abstractly reason about numbers in any context.

Multi-Choice Question Answering
Constructing a MC-MLM probe limits the answer candidates to a single token from the word-piece vocabulary. To relax this setup we also explore the MC-QA setup from §2.
In MC-QA, we phrase the task as a question, letting answer candidates be arbitrary strings, which provides ample expressivity . In Table 1 Figure 3F shows the learning curves. Because in MC-QA, the network MLP QA cannot be initialized by pre-trained weights, it is impossible to obtain meaningful zero-shot results, and more training examples are needed to train MLP QA . Still, the trends observed in MC-MLM remain, with ROBERTA-L achieving best performance with the fewest examples.

The oLMpic Games
We now present a series of challenges aimed at probing the symbolic reasoning abilities of pretrained LMs.

Can LMs perform robust comparison?
Comparing two numeric values, requires representing the values and performing the comparison operations. In §3 we saw the AGE-COMPARISON task, in which the ages of two people were compared. We found that ROBERTA-L and to some extent BERT-WWM were able to handle this task, performing well under the controls. We expand on this to related comparison tasks and perturbations that assess the sensitivity of LMs to the particular context and to the numerical value.
Is ROBERTA-L comparing numbers or ages? ROBERTA-L obtained zero-shot accuracy of 96% in AGE-COMPARISON. But is it robust? We test this using perturbations to the task and present the results in Figure 4. Figure 4A corresponds to the experiment from §3, where we observed that ROBERTA-L predicts "younger" (blue pixels) and "older" (white pixels) almost perfectly.
To test whether ROBERTA-L can compare ages given the birth year rather than the age, we use the statement "A person born in YEAR-1 is [MASK] than me in age, If i was born in YEAR-2." Figure 4B shows that ROBERTA-L correctly flips "younger" to "older" (76% accuracy), reasoning that a person born in 1980 is older than a person born in 2000.
However, when evaluated on the exact same statement, but with values corresponding to typical ages instead of years ( Figure 4D), ROBERTA-L obtains an accuracy of 12%, consistently outputting the wrong prediction. It seems that since the values are typical ages and not years, it disregards the statement, performing the comparison based on the values only and not the language. We will revisit this tendency in §4.4. Symmetrically, Figure 4C shows results when numeric values of ages are swapped with typical years of birth. ROBERTA-L is unable to handle this, always predicting "older". This emphasizes that the model is sensitive to the argument values.
Can Language Models compare object sizes? Comparing physical properties of objects requires knowledge of the numeric value of the property and the ability to perform comparison. Previous work has shown that such knowledge can be extracted from text and images (Bagherinezhad et al., 2016;Forbes and Choi, 2017;Yang et al., 2018a;Elazar et al., 2019). Can LMs do the same? Probe Construction We construct statements of Figure 4: AGE COMPARISON perturbations. Left side graphs are age-comparison, right side graphs are age comparison by birth-year. In the bottom row, the values of ages are swapped with birth-years and vice versa. In blue pixels the model predicts "older", in white pixels "younger". The first answer (A) is the correct answer. the form "The size of a OBJ-1 is usually much [MASK] than the size of a OBJ-2.", where the candidate answers are "larger" and "smaller". To instantiate the two objects, we manually sample a list of objects from two domains: animals (e.g., "camel", "dinosaur") and general objects (e.g., "pen", "sun"), and use the first domain for training and the second for evaluation. We bucket different objects based on the numerical value of their size based on their median value from DOQ (Elazar et al., 2019), and then manually fix any errors. Overall, we collected 127 and 35 objects for training and development respectively. We automatically instantiate object slots using objects that are in the same bucket. Results ROBERTA-L excels in this task, starting from 84% accuracy in the zero-shot setup and reaching MAX of 91% (Table 3). Other models start with random performance and are roughly on par with MLM-BASELINE. ROBERTA-L shows sensitivity to the language, suggesting that the ability to compare object sizes is encoded in it. Analysis Table 4 shows results of running ROBERTA-L in the zero-shot setup over pairs of objects, where we sampled a single object from each bucket. Objects are ordered by their size from small to large. Overall, ROBERTA-L correctly predicts "larger" below the diagonal, and "smaller" above it. Interestingly, errors are concentrated around the diagonal, due to the more fine-grained differences in sizes, and in the last column where we compare objects to "sun". A possible explanation is that the model ignores the other object compared, stating that the sun is large.   4.2 Do LMs know "always" from "often"?
Adverbial modifiers such as "always", "sometimes" or "never", tell us about the frequency of various world facts. Anecdotally, when ROBERTA-L predicts a completion for the phrase "Cats usually drink [MASK].", the top completion is "coffee" rather than "water". This is since coffee is a frequent drink in literature, which the model was trained on. However, humans know that"Cats NEVER drink coffee". The "Always-Never" task We present statements, such as "rhinoceros [MASK] have fur", with answer candidates, such as "never" or "always". To succeed, the model should represent the frequency of the fact, and map the appropriate meaning of the adverbial modifier to that representation. Probe Construction We manually craft templates that contain one slot for a subject and another for an object.
Two example templates are "FOOD-TYPE is [MASK] part of a ANIMAL's diet." and "A ANIMAL [MASK] has a BODY-PART." (more examples available in Table 6). The subject slot is instantiated with concepts of the correct semantic type, according to the isA predicate in CONCEPTNET. In the example above we will find concepts that are of type FOOD-TYPE and ANIMAL. The object slot is then instantiated by forming masked templates of the form "meat is part of a [MASK]'s diet." and "cats have [MASK]." and letting BERT-L produce the top-20 completions. We filter out completions that do not have the correct semantic type according to the isA predicate. Finally, we crowdsource gold answers using Amazon Mechanical Turk. Annotators were presented with an instantiated template (with the masked token removed), such as "Chickens have horns." and chose the correct answer from 5 candidates: "never", "rarely", "sometimes", "often" and "always". We collected 1,300 examples with 1,000 used for training and 300 for evaluation. Results Table 5 shows the results, where random accuracy is 20%, and the accuracy when taking a majority vote over the training set is 35.5%. In the zero-shot setup, accuracy is less than random. In the MLP MLM and LINEAR setup accuracy reaches a maximum of 57% in BERT-L, but MLM-BASELINE obtains similar accuracy, implying that the task was mostly tackled at fine-tuning time, and the pre-trained representations did not contribute much. Language controls strengthen this hypothesis, where performance hardly drops in the PERTRUBED LANGUAGE control (1-2 tokens per templates were replaced with "blah"), and slightly drops in the NO LANGUAGE control. Figure 5A compares the learning curve of the best performing model, BERT-WWM, and the NO LAN-GUAGE control. MLM-BASELINE consistently outperforms the LMs, which display only minor language sensitivity, suggesting that pre-training is not effective for solving this task.  Analysis We generated predictions from the best model, BERT-WWM, and show analysis results in Table 6. For reference, the majority vote accuracy for human annotators is near 100%. Although the answers "often" and "rarely" are the gold answer in 19% of the training data, the LMs predict these answers in less than 1% of examples. In the template "A dish with FOOD-TYPE [MASK] contains FOOD-TYPE." the LM always predicts "sometimes". Overall we find models do not perform well. Reporting bias (Gordon and Van Durme, 2013) may play a roll in LMs inability to correctly determine cases such as "A rhinoceros NEVER has fur." Interestingly, behavioral research conducted on blind humans shows they exhibit a similar bias (Kim et al., 2019).

Do LMs Capture Negation?
Ideally, the presence of the word "not" should affect the prediction of a masked token. However, it has been shown that the completion for both "A cat is a [MASK]." and "A cat is not a [MASK] .", is "cat". Several recent works have shown that indeed LMs do not take into account the presence of negation in sentences (Ettinger, 2019;Kassner and Schütze, 2019). Richardson et al. (2019) showed that LMs are able to successfully apply knowledge of antonyms in a natural language inference setup. Here, we add to this literature, by probing whether LMs can properly use negation in the context of synonyms vs. antonyms.

Do LMs Capture the Semantics of Antonyms?
We check whether LMs use modifiers appropriately in the presence of synonyms vs. antonyms. In the statement "He was [MASK] fast, he was very slow.", [MASK] should be replaced with "not", since "fast" and "slow" are antonyms. Conversely, in "He was [MASK] fast, he was very rapid", the LM should choose a word like "very" in the presence of the synonyms "fast" and "rapid". A LM that can correctly distinguish between "not" and "very", demonstrates knowledge of the taxonomic relations antonym and synonym, as well as the ability to reason about how negation should be used in this context.

Probe Construction
We sample synonym and antonym pairs from CONCEPTNET (Speer et al., 2017) and WORDNET (Fellbaum, 1998), and use Google Books Corpus to choose pairs that occur frequently in language. We make use of the statements introduced above. Half of the examples are synonym pairs and half antonyms, generating 4,000 training examples and 500 for evaluation.

How well do LMs handle conjunctions of facts?
We present two probes where a model should understand the reasoning expressed by the word and.
Property conjunction CONCEPTNET is a Knowledge-Base that describes the properties of millions of concepts through its (subject, predicate, object) triples. We use CON-CEPNET to test whether LMs can find concepts for which a conjunction of properties holds. For example, we will create a question like "What is located in a street and is related to octagon?", where the correct answer is "street sign". Because answers are drawn from CONCEPTNET, they often consist of more than one word-piece, thus examples are generated in the MC-QA setup. Probe Construction CONCEPTNET contains more than 34 million (subject, predicate, object) triples. To construct an example, we first choose a concept that has two properties in CONCEPTNET, where a property is a (subject, predicate) pair, or a (predicate, object) pair. For example, street sign has the properties (atLocation, street) and (relatedTo, octagon).
Then, we create two distractor concepts, for which only one property holds: car has the property (atLocation, street), and math has the property (relatedTo, octagon). Given the answer concept, the distractors and the properties, we can automatically generate pseudo-langauge questions and answers by mapping 15 CONCEPT-NET predicates to natural language questions.
We split examples such that concepts are disjoint between training and evaluation. Results In MC-QA, we fine-tune the entire network and do not freeze any representations. Zeroshot cannot be applied since the weights of MLP QA are untrained. All LMs consistely improve as the number of examples increases, reach-ing a MAX of 66-80% (Table 8), and a WS of 37-47, substantially higher than the baselines (56% MAX and 37 WS). Language Sensitivity is slightly higher than zero in some models. Overall, results suggest the LMs do have some capability in this task, but it is hard to clearly determine it existed before fine-tuning.  Taxonomy conjunction A different operation is to find properties that are shared by two concepts. Specifically, we test whether LMs can find the mutual hypernym of a pair of concepts. For example, "A germ and a human are both a type of [MASK].", where the answer is "organism". Probe Construction We use CONCEPTNET and WORDNET to find pairs of concepts and their hypernyms, keeping only pairs that frequently appear in the GOOGLE BOOK CORPUS. The example template is "A ENT-1 and a ENT-2 are both a type of [MASK].", where ENT-1 and ENT-2 are replaced with entities that have a common hypernym, which is the gold answer. Distractors are concepts that are hypernyms of ENT-1, but not ENT-2, or vice versa. For evaluation, we keep all examples related to food and animal taxonomies, e.g., "A beer and a ricotta are both a type of [MASK].", where the answer is "food" and the distractors are "cheese" and "alcohol". For training, we use examples from different taxonomic trees, such that the concepts in the training and evaluation sets are disjoint.
Results Table 9 shows that models' zero-shot accuracy is substantially higher than random (33%) (Richardson et al., 2019), but overall even after fine-tuning accuracy is at most 57%. However, we do observe language sensitivity in the NO LAN-GUAGE and PERTRUBED LANGUAGE controls, suggesting that some models have pre-existing ca-pabilities. Next, we characterize when do the models err. A learning curve of the best performing model, ROBERTA-L is compared to controls in Figure 5C.  Analysis Analyzing the errors of ROBERTA-L, we found that a typical error is predicting for "A crow and a horse are both a type of [MASK]." that the answer is "bird", rather than "animal". Specifically, LMs prefer hypernyms that are closer in terms of edge distance on the taxonomy tree. Thus, a crow is first a bird, and then an animal. We find that when distractors are closer in the taxonomy tree to one of the entities in the statement than the gold answer, the models will consistently (80%) choose the distractor, ignoring the second entity in the phrase.

Can language models perform multi-hop reasoning?
Questions that require multi-hop reasoning, such as "Who is the director of the movie about a WW2 pacific medic?", have recently drawn attention (Yang et al., 2018b;Welbl et al., 2017;Talmor and Berant, 2018) as a challenging task for contemporary models. But do pre-trained LMs have some internal mechanism to handle such questions? To address this question, we create two probes, one for compositional question answering, and the other uses a multi-hop setup, building upon our observation ( §3) that some LMs can compare ages.
Encyclopedic composition We construct questions such as "When did the band where John Lennon played first form?". Because answers require more than one word-piece, we use the MC-QA setup. Probe Construction We use the following three templates: (1) "when did the band where ENT played first form?", (2) "who is the spouse of the actor that played in ENT?" and (3) "where is the headquarters of the company that ENT established located?". We instantiate ENT using information from WIKIDATA (Vrandečić and Krőtzsch, 2014), choosing challenging distractors. For example, for template 1, the distractor will be a year close to the gold answer, and for template 3, it will be a city in the same country as the gold answer city.
To solve the complex question, the model must know all single-hop facts required for answering it. Thus, the model is first fine-tuned on all such facts ("What company did Bill Gates establish? Microsoft", "Where is the headquarters of Microsfot? Seattle") from the training and evaluation set, and then fine-tuned on multi-hop composition.
Results Results are summarized in Table 10. All models achieve low accuracy in this task, and the baseline performs best with a MAX of 54%. Language sensitivity of all models is small, suggesting that the LMs are unable to resolve compositional questions, but also struggle to learn it with some supervision. A learning curve of the best performing LM, BERT-WWM is compared to the controls in Figure 5D.

Multi-hop Comparison
Multi-hop reasoning can be found in many common structures in natural language. In the phrase "When comparing a 83 year old, a 63 year old and a 56 year old, the [MASK] is oldest" one must first solve the question of which person is oldest, and then refer to its ordering: first, second or third. Probe Construction We use the template above, treating the ages as arguments. The LM must choose between the candidates "first", "second", and "third". Age arguments are in the same ranges as in AGE-COMPARISON.   dicating that the model sees the answers as viable choices. However, although successful in AGE-COMPARISON, the performance of ROBERTA-L is poor in this probe (Table 12), With zero-shot accuracy that is almost random, WS slightly above random, MAX lower than MLM-BASELINE (52%), and close to zero language sensitivity. All LMs seem to be learning the task during probing. Although BERT-WWM was able to partially solve the task with a MAX of 74% when approaching 4,000 training examples, the models do not appear to show multi-step capability in this task.

Medals
We summarize the results of the oLMpic Games in Table 11. For each task and LM, we summarize its success taking into account baseline models and controls. Interestingly, the table is mostly empty. The LMs, generally, have not been able to demonstrate strong pre-training capabilities in these symbolic reasoning tasks. BERT-WWM has shown partial success in a few tasks, whereas ROBERTA-L shows high performance in ALWAYS-NEVER, OBJECTS COMPARISON and ANTONYM NEGATION, and emerges as the most promising LM in these probes. However, when perturbed, ROBERTA-L has failed to demon-strates consistent generalization and abstraction.

Discussion
In this work we presented eight different tasks for evaluating the reasoning abilities of models, alongside an evaluation protocol for disentangling pre-training from fine-tuning.
Interestingly, we found that even models that have identical structure and objective functions differ not only quantitatively but also qualitatively. Specifically, ROBERTA-L has shown reasoning abilities that are absent from other models. Thus, with enough data and careful optimization, models can acquire from a LM objective skills that might be surprising intuitively.
Another insight is that when current LMs succeed in a reasoning task, they do not seem to do so through abstraction and operation composition as humans perceive it. The abilities are strongly context-dependent, if ages are compared -then the numbers should be typical ages. Discrepancies from the training distribution quickly lead to large drops in performance. Last, the performance of LM in many reasoning tasks is poor.
Our work sheds some light on some of the blind spots of current LMs. We will release our code and data and hope that this work will help researchers evaluate the reasoning abilities of models, aid the design of new probes, and will guide future work on pre-training, objective functions and model design for endowing models with capabilities they are currently lacking.