What Does My QA Model Know? Devising Controlled Probes using Expert Knowledge

Open-domain question answering (QA) is known to involve several underlying knowledge and reasoning challenges, but are models actually learning such knowledge when trained on benchmark tasks? To investigate this, we introduce several new challenge tasks that probe whether state-of-the-art QA models have general knowledge about word definitions and general taxonomic reasoning, both of which are fundamental to more complex forms of reasoning and are widespread in benchmark datasets. As an alternative to expensive crowd-sourcing, we introduce a methodology for automatically building datasets from various types of expert knowledge (e.g., knowledge graphs and lexical taxonomies), allowing for systematic control over the resulting probes and for a more comprehensive evaluation. We find automatically constructing probes to be vulnerable to annotation artifacts, which we carefully control for. Our evaluation confirms that transformer-based QA models are already predisposed to recognize certain types of structural lexical knowledge. However, it also reveals a more nuanced picture: their performance degrades substantially with even a slight increase in the number of hops in the underlying taxonomic hierarchy, or as more challenging distractor candidate answers are introduced. Further, even when these models succeed at the standard instance-level evaluation, they leave much room for improvement when assessed at the level of clusters of semantically connected probes (e.g., all Isa questions about a concept).


Introduction
Automatically answering questions, especially in the open-domain setting (i.e., where minimal or no contextual knowledge is explicitly provided), requires bringing to bear considerable amount of Benchmark Tasks 1.OpenBook QA (OBQA) (Mihaylov et al., 2018) Question: Which of the following is a [specific  background knowledge and reasoning abilities. For example, knowing the answers to the two questions in Figure 1 requires identifying a specific ISA relation (i.e., that cooking is a type of learned behavior) as well as recalling the definition of a concept (i.e., that global warming is defined as a worldwide increase in temperature). In the multiple-choice setting, which is the variety of question-answering (QA) that we focus on in this paper, there is also pragmatic reasoning involved in selecting optimal answer choices (e.g., while greenhouse effect might in some other context be a reasonable answer to the second question in Figure 1, global warming is a preferable candidate). Recent successes in QA, driven largely by the creation of new resources (Zellers et al., 2018;Talmor et al., 2019;Bhagavatula et al., 2019;Khot et al., 2020, etc) and advances in model pre-training (Radford et al., 2018;Devlin et al., 2019), raise a natural question: do state-of-theart multiple-choice QA (MCQA) models that excel at standard tasks really have basic knowledge and reasoning skills?
Most existing MCQA datasets are constructed through either expensive crowd-sourcing (Welbl et al., 2017) or hand engineering effort, in the former case making it possible to collect large amounts of data at the cost of losing systematic control over the semantics of the target questions. Hence, doing a controlled experiment to answer such a question for QA is difficult given a lack of targeted challenge datasets.
Having definitive empirical evidence of model competence on any given phenomenon requires constructing a wide range of systematic tests. For example, in measuring competence of definitions, not only do we want to see that the model can handle individual questions such as Figure 1.1 inside of benchmark tasks, but that it can answer a wider range of questions that exhaustively cover a broad set of concepts and question perturbations (i.e., systematic adjustments to how the questions are constructed). The same applies to ISA reasoning; not only is it important to recognize in the question in Figure 1.1 that cooking is a learned behavior, but also that cooking is a general type of behavior or, through a few more inferential steps, a type of human activity.
In this paper, we look at systematically constructing such tests by exploiting the vast amounts of structured information contained in various types of expert knowledge such as knowledge graphs and lexical taxonomies. Our general methodology works as illustrated in Figure 1: given any MCQA model trained on a set of benchmark tasks, we systematically generate a set of synthetic dataset probes (i.e., MCQA renderings of the target information) from information in expert knowledge sources. We then use these probes to ask two empirical questions: 1) how well do models trained on benchmark tasks perform on these probing tasks and; 2) can such models be re-trained to master new challenges with minimal performance loss on their original tasks?
While our methodology is amenable to any knowledge source and set of models/benchmark tasks, we focus on probing state-of-the-art transformer models (Devlin et al., 2019;Liu et al., 2019b) in the domain of science MCQA. For sources of expert knowledge, we use WordNet, a comprehensive lexical ontology, and other publicly available dictionary resources. We devise probes that measure model competence in definition and taxonomic knowledge in different settings (including hypernymy, hyponymy, and synonymy detection, and word sense disambiguation). This choice is motivated by fact that the science domain is considered particularly challenging for QA (Clark et al., 2013;Clark, 2015;Clark et al., 2019), and existing science benchmarks are known to involve widespread use of such knowledge (see ; Boratko et al. (2018) for analysis), which is also arguably fundamental to more complex forms of reasoning.
We show that accurately probing QA models via synthetic datasets is not straightforward, as unexpected artifacts can easily arise in such data. This motivates our carefully constructed baselines and close data inspection to ensure probe quality.
Our results confirm that transformer-based QA models have a remarkable ability to recognize certain types of knowledge captured in our probeseven without additional fine-tuning. Such models can even outperform strong task-specific models trained directly on our probing tasks (e.g., on definitions, our best model achieves 77% test accuracy without specialized training, as opposed to 51% for a task-specific LSTM-based model). We also show that the same models can be effectively refine-tuned on small samples (even 100 examples) of probe data, and that high performance on the probes tends to correlate with a smaller drop in the model's performance on the original QA task.
Our comprehensive assessment reveals several interesting nuances to the overall positive trend. For example, the performance of even the best QA models degrades substantially on our hyponym probes (by 8-15%) when going from 1-hop links to 2-hops. Further, the accuracy of even our best models on the WordNetQA probe drops by 14-44% under our cluster-based analysis, which assesses whether a model knows several facts about each individual concept, rather than just being good at answering isolated questions. State-ofthe-art QA models thus have much room to improve even in some fundamental building blocks, namely definitions and taxonomic hierarchies, of more complex forms of reasoning.

Related Work
We follow recent work on constructing challenge datasets for probing neural models, which has primarily focused on the task of natural language inference (NLI) (Glockner et al., 2018;Naik et al., 2018;McCoy et al., 2019;Rozen et al., 2019;Warstadt et al., 2019). Most of this work looks at constructing data through adversarial generation methods, which have also been found useful for creating stronger models (Kang et al., 2018). There has also been work on using synthetic data of the type we consider in this paper (Poliak et al., 2018a;Geiger et al., 2019;Richardson et al., 2020). We closely follow the methodology of Richardson et al. (2020), who use hand-constructed linguistic fragments to probe NLI models and study model re-training using a variant of the inoculation by fine-tuning strategy of Liu et al. (2019a). In contrast, we focus on probing open-domain MCQA models (see Si et al. (2019) for a related study in the reading comprehension setting) as well as constructing data from much larger sources of structured knowledge.
Our main study focuses on probing the BERT model and fine-tuning approach of Devlin et al. (2019), and other variants thereof, which are all based on the transformer architecture of Vaswani et al. (2017). Related to our efforts, there have been recent studies into the types of relational knowledge contained in large-scale knowledge models (Petroni et al., 2019;Kassner and Schütze, 2019;, which, similar to our work, probe models using structured knowledge sources. This prior work, however, primarily focuses on unearthing the knowledge contained in the underlying language models as is without further training, using simple (single token) clozestyle probing tasks and templates (similar to what we propose in Section 3). In contrast, we focus on understanding the knowledge contained in language models after they have been trained for a QA end-task using benchmark datasets in which such knowledge is expected to be widespread. Further, our evaluation is done before and after these models are fine-tuned on our probe QA tasks, using a more complex set of QA templates and target inferences.
The use of lexical resources and knowledge graphs such as WordNet to construct datasets has a long history, and has recently appeared in work on adversarial attacks (Glockner et al., 2018;Jia and   Liang, 2017) and general task construction (Pilehvar and Camacho-Collados, 2019; Pasupat and Liang, 2015). In the area of MCQA, there is related work on constructing questions from tuples (Jauhar et al., 2016;Talmor et al., 2019), both of which involve standard crowd annotation to elicit question-answer pairs (see also Seyler et al. (2017); Reddy et al. (2017)). In contrast to this work, we focus on generating data in an entirely automatic fashion, which obviates the need for expensive annotation and gives us the flexibility to construct much larger datasets that control a rich set of semantic aspects of the target questions.

Dataset Probes and Construction
Our probing methodology starts by constructing challenge datasets (Figure 1, yellow box) from a target set of knowledge resources. Each of our probing datasets consists of multiple-choice questions that include a question q and a set of answer choices or candidates {a 1 , ...a N }. This section describes in detail the 5 different datasets we build, which are drawn from two sources of expert knowledge, namely WordNet (Miller, 1995) 1 and the GNU Collaborative International Dictionary of English (GCIDE). 2 We describe each resource in turn, and explain how the resulting dataset probes, which we call WordNetQA and DictionaryQA, are constructed. For convenience, we will describe each source of expert knowledge as a directed, edge-labeled graph G. The nodes of this graph are V = C ∪ W ∪ S ∪ D, where C is a set of atomic concepts, W a set of words, S a set of sentences, and D a set of definitions (see Table 1 for details for WordNet and GCIDE). Each edge of G is directed from an atomic concept in C to another node in V , and is labeled with a relation, such as hypernym or isa ↑ , from a set of relations R (see Table 1).
When defining our probe question templates, it will be useful to view G as a set of (relation, source, target) triples T ⊆ R×C×V. Due to their origin in an expert knowledge source, such triples preserve semantic consistency. For instance, when the relation in a triple is def, the corresponding edge maps a concept in C to a definition in D.
To construct probe datasets, we rely on two heuristic functions, defined below for each individual probe: GEN Q (τ ), which generates gold question-answer pairs (q, a) from a set of triples τ ⊆ T and question templates Q, and DISTR(τ ), which generates distractor answers choices {a 1 , ...a N −1 } based on another set of triples τ (where usually τ ⊂ τ ). For brevity, we will use GEN(τ ) to denote GEN Q (τ ), leaving question templates Q implicit.

WordNetQA
WordNet is an English lexical database consisting of around 117k concepts, which are organized into groups of synsets that each contain a gloss (i.e., a definition of the target concept), a set of representative English words (called lemmas), and, in around 33k synsets, example sentences. In addition, many synsets have ISA links to other synsets that express complex taxonomic relations.  Table 1 summarizes how we formulate WordNet as a set of triples T of various types. These triples together represent a directed, edge-labeled graph G. Our main motivation for using WordNet, as opposed to a resource such as ConceptNet (Havasi et al., 2007), is the availability of glosses (D) and example sentences (S), which allows us to construct natural language questions that contextualize the types of concepts we want to probe.
Example Generation GEN(τ ). We build 4 individual datasets based on semantic relations native to WordNet (see Miller et al. (1990)): hypernymy (i.e., generalization or ISA reasoning up a taxonomy, ISA ↑ ), hyponymy (ISA ↓ ), synonymy, and definitions. To generate a set of questions in each case, we employ a number of rule templates Q that operate over tuples. A subset of such templates is shown in Table 2. The templates were designed to mimic naturalistic questions we observed in our science benchmarks.
q. In the sentence The toddler could count, the word count is a type of: a. recite event...  For example, suppose we wish to create a question q about the definition of a target concept c ∈ C. We first select a question template from Q that first introduces the concept c and its lemma l ∈ W in context using the example sentence s ∈ S, and then asks to identify the corresponding WordNet gloss d ∈ D, which serves as the gold answer a. The same is done for ISA reasoning; each question about a hypernym/hyponym relation between two concepts c → ↑/↓ c ∈ T i (e.g., dog → ↑/↓ animal/terrier) first introduces a context for c and then asks for an answer that identifies c (which is also provided with a gloss so as to contain all available context).

Hyponym and Sister Distractors
In the latter case, the rules (isa r , c, c ) ∈ T i in Table 2 cover only direct ISA links from c in direction r ∈ {↑, ↓}. In practice, for each c and direction r, we construct tests that cover the set HOPS(c, r) of all direct as well as derived ISA relations of c: This allows us to evaluate the extent to which models are able to handle complex forms of reasoning that require several inferential steps or hops. 3

Probe Type
Triple Input τ Generation Templates from Q Example Questions and Answers (q, a) Definitions: defining words in context.
q. In the sentence The baby nestled her head, the word nestled is best defined as: a. position comfortably Hypernymy: is best described as a type of a.
[w ] defined as [d] q. In The thief eluded the police, the word or concept eluded is best described as a type of a. escape event defined as to run away from..

Hyponymy:
q. Given the context [s], which of the following word or concept is a specific type of q. Given the context they awaited her arrival, which of the following word or concept is a specific type of arrival? a. crash landing, defined as an emergency landing under circumstances where....' Synonymy: Related words.
q. Which set of words best corresponds to the definition a grammatical category in inflected languages governing agreement between nouns and pronouns...? a. gender,... Table 2: Details of the GEN(τ ) function used to construct gold question-answer pairs (q, a) from a triple graph G.
Distractor Generation: DISTR(τ ). An example of how distractors are generated is shown in Figure 2, which relies on similar principles as above. For each concept c, we choose 4 distractor answers that are close in the WordNet semantic space. For example, when constructing hypernymy tests for c from the set HOPS(c, ↑), we build distractors by drawing from HOPS(c, ↓) (and vice versa), as well as from the -deep sister family of c, defined as follows. The 1-deep sister family is simply c's siblings or sisters, i.e., the other childrenc = c of the parent node c of c. For > 1, the -deep sister family also includes all descendants of eachc up to − 1 levels deep, denoted HOPS −1 (c, ↓). Formally: For definitions and synonyms we build distractors from all of these sets (with a similar restriction on the depth of SISTER distractors as noted above). In doing this, we can systematically investigate model performance on a wide range of distractor sets.

Perturbations and Semantic Clusters
Based on how we generate data, for each concept c (i.e., atomic WordNet synset) and probe type (i.e., definitions, hypernymy, etc.), we have a wide variety of questions related to c that manipulate 1) the complexity of reasoning that is involved (e.g., the number of inferential hops) and; 2) the types of distractors (or distractor perturbations) that are employed. We call such sets semantic clusters. As  we describe in the next section, semantic clusters allow us to devise new types of evaluation that reveal whether models have comprehensive and consistent knowledge of target concepts (e.g., evaluating whether a model can correctly answer several questions associated with a concept, as opposed to a few disjoint instances). Details of the individual datasets are shown in Table 3. From these sets, we follow Richardson et al. (2020) in allocating a maximum of 3k examples for training and reserve the rest for development and testing. Since we are interested in probing, having large held-out sets allows us to do detailed analysis and cluster-based evaluation.

DictionaryQA
The DictionaryQA dataset is created from the GCIDE dictionary, which is a comprehensive open-source English dictionary built largely from the Webster's Revised Unabridged Dictionary (Webster, 1913). Each entry consists of a word, its part-of-speech, its definition, and an optional example sentence (see Table 4). Overall, 33k entries (out of a total of 155k) contain example GCIDE Dictionary Entries word: gift, pos: n., definition: Anything given; anything voluntarily transferred by one person to another without compensation; a present; entry example: None. word: gift, pos: n., definition: A bribe; anything given to corrupt. entry example: None. word: gift, pos: n., definition: Some exception inborn quality or characteristic; a striking or special talent or aptitude;.. entry example: the gift of wit; a gift for speaking. sentences/usages. As with the WordNet probes, we focus on this subset so as to contextualize each word being probed. In contrast to Word-Net, GCIDE does not have ISA relations or explicit synsets, so we take each unique entry to be a distinct sense. We then use the dictionary entries to create a probe that centers around word-sense disambiguation, as described below.
Example and Distractor Generation. To generate gold questions and answers, we use the same generation templates for definitions exemplified in Figure 2 for WordNetQA. To generate distractors, we simply take alternative definitions for the target words that represent a different word sense (e.g., the alternative definitions of gift shown in Table 4), as well as randomly chosen definitions if needed to create a 5-way multiple choice question. As above, we reserve a maximum of 3k examples for training. Since we have only 9k examples in total in this dataset (see WordSense in Table 3), we also reserve 3k each for development and testing.
We note that initial attempts to build this dataset through standard random splitting gave rise to certain systematic biases that were exploited by the choice-only baseline models described in the next section, and hence inflated overall model scores.
After several efforts at filtering we found that, among other factors, using definitions from entries without example sentences as distractors (e.g., the first two entries in Table 4) had a surprising correlation with such biases. This suggests that possible biases involving differences between dictionary entries with and without examples can taint the resulting automatically generated MCQA dataset (for more discussion on the pitfalls involved with automatic dataset construction, see Section 5).

Probing Methodology and Modeling
Given the probes above, we now can start to answer the empirical questions posed at the begin-ning. Our main focus is on looking at transformerbased MCQA models trained in the science domain (using the benchmarks shown in Table 5). In this section, we provide details of MCQA and the target models, as well as several baselines that we use to sanity check our new datasets. To evaluate model competence, we look at a combination of model performance after science pre-training and after additional model fine-tuning using the lossless inoculation strategy of Richardson et al. (2020) (Section 4.2). In Section 4.3, we also discuss a cluster-level accuracy metric for measuring performance over semantic clusters.

Task Definition and Modeling
consisting of pairs of questions stems q and answer choices a i , the goal is to find the correct answer a i * that correctly answers each q. Throughout this paper, we look at 5-way multiple-choice problems (i.e., where each N = 5).
Question+Answer Encoder. To model this, our investigation centers around the use of the transformer-based (Vaswani et al., 2017) BERT encoder and fine-tuning approach of Devlin et al.
(2019) (see also Radford et al. (2018)). For each question and individual answer pair q (j) a i , we assume the following rendering of this input: which is run through the pre-trained BERT encoder to generate a representation for q (j) a i using the hidden state representation for CLS (i.e., the classifier token) c i : The probability of a given answer p i is then computed as p i , which uses an additional set of classification parameters v ∈ R H that are optimized (along with the full transformer network) by taking the final loss of the probability of each correct answer p i * over all answer choices: We specifically use BERT-large uncased with whole-word masking, as well as the RoBERTalarge model from Liu et al. (2019b), which is a more robustly trained version of the original BERT model. Our system uses the implementations provided in AllenNLP  and Huggingface (Wolf et al., 2019).
Baselines and Sanity Checks. When creating synthetic datasets, it is important to ensure that systematic biases, or annotation artifacts (Gururangan et al., 2018), are not introduced into the resulting probes and that the target datasets are sufficiently challenging (or good, in the sense of Hewitt and Liang (2019)). To test for this, we use several of the MCQA baseline models first introduced in Mihaylov et al. (2018), which take inspiration from the LSTM-based models used in Conneau et al. (2017) for NLI and various partialinput baselines based on these models. s = BiLSTM(EMBED(s)) ∈ R |s|×2h (where h is the dimension of the hidden state in each directional network, and EMBED(·) is an embedding function that assigns token-level embeddings to each token in s 4 ). A contextual representation for each s is then built by applying an element-wise max operation over h s as follows: With these contextual representations, different baseline models can be constructed. For example, a Choice-Only model, which is a variant of the well-known hypothesis-only baseline used in NLI (Poliak et al., 2018b), scores each choice c i in the following way: for W T ∈ R 2h independently of the question and assigns a probability to each answer p (j) i ∝ e α (j) i . A slight variant of this model, the Choice-tochoice model, tries to single out a given answer choice relative to other choices by scoring all choice pairs α (j) c i ) ∈ R using a learned attention mechanism ATT and finding the choice with the minimal similarity to other options (for full details, see their original paper). In using these partial-input baselines, which we train directly on each target probe, we can check whether   systematic biases related to answer choices were introduced into the data creation process.
A Question-to-choice model, in contrast, uses the contextual representations for each question and individual choice and an attention model ATT model to get a score α (j) Here we also experiment with using ESIM (Chen et al., 2017) to generate the contextual representations r, as well as a simpler VecSimilarity model that measures the average vector similarity 5 between question and answer tokens: α i )). In contrast to the models above, these sets of baselines are used to check for artifacts between questions and answers that are not captured in the partial-input baselines (see discussion in Feng et al. (2019)) and ensure that the overall MCQA tasks are sufficiently difficult for our transformer models.

Inoculation and Pre-training
Using the various models introduced above, we train these models on benchmark tasks in the science domain and look at model performance on our probes with and without additional training on samples of probe data, building on the idea of inoculation from Liu et al. (2019a). Model inoculation is the idea of continuing to train models on new challenge tasks (in our cases, separately for each probe) using only a small amount of examples. Unlike in ordinary fine-tuning, the goal is not to learn an entirely re-purposed model, but to improve on (or vaccinate against) particular phenomena (e.g., our synthetic probes) that potentially deviate from a model's original training distribution (but that nonetheless might involve knowledge already contained in the model).
In the variant proposed in Richardson et al. (2020), for each pre-trained (science) model and architecture M a we continue training the model on k new probe examples (with a maximum of k = 3k) under a set of different hyper-parameter configurations j ∈ {1, ..., J} and identify, for each k, the model M a,k * with the best aggregate performance S on the original (orig) and new task: As in Richardson et al. (2020), we found all models to be especially sensitive to different learning rates, and performed comprehensive hyperparameters searches that also manipulate the number of iterations and random seeds used.
Using this methodology, we can see how much exposure to new data it takes for a given model to master a new task, and whether there are phenomena that stress particular models (e.g., lead to catastrophic forgetting of the original task). Given the restrictions on the number of fine-tuning examples, our assumption is that when models are able to maintain good performance on their original task during inoculation, the quickness with which they are able to learn the inoculated task provides evidence of prior competence, which is precisely what we aim to probe. To measure past performance, we define a model's inoculation cost as the difference in the performance of this model on its original task before and after inoculation.
We pre-train on an aggregated training set of the benchmark science exams detailed in Table 5 6 , and created an aggregate development set of around 4k science questions for evaluating overall science performance and inoculation costs. To handle the mismatch between number of answer choices in these sets, we made all sets 5-way by adding empty answers as needed. We also experimented with a slight variant of inoculation, called add-some inoculation, which involves balancing the inoculation training sets with naturalistic science questions. We reserve the MCQL dataset in Table 5 for this purpose, and experiment with balancing each probe example with a science example (x1 matching) and adding twice as many science questions (x2 matching, up to 3k) for each new example.

Evaluating Model Competence
The standard way to evaluate our MCQA models is by looking at the overall accuracy of the correct answer prediction, or what we call instancelevel accuracy (as in Table 6). Given the nature of our data and the existence of semantic clusters as detailed in Section 3.1.1 (i.e., sets of questions and answers under different distractor choices and inference complexity), we also measure a model's cluster-level (or strict cluster) accuracy, which requires correctly answering all questions in a cluster. Example semantic clusters are shown in Table 7; in the first case, there are 6 ISA ↑ questions (including perturbations) about the concept trouser.n.01 (e.g., involving knowing that trousers are a type of consumer good and garment/clothing), which a model must answer in order to receive full credit.
Our cluster-based analysis is motivated by the idea that if a model truly knows the meaning of a given concept, such as the concept of trousers, then it should be able to answer arbitrary questions about this concept without sensitivity to varied distractors. While our strict cluster metric is simplistic, it takes inspiration from work on visual QA (Shah et al., 2019), and allows us to evaluate how consistent and robust models are across our different probes, and to get insight into whether errors are concentrated on a small set of concepts or widespread across clusters.

Results and Findings
In this section, we provide the results of the empirical questions first introduced in Figure 1, starting with the results of our baseline models.
Are our Probes Sufficiently Challenging? As shown in Table 6, most of our partial-input baselines (i.e., Choice-Only and Choice-to-Choice models) failed to perform well on our dataset probes across a wide range of models, showing that such probes are generally immune from biases relating to how distractors were generated. As already discussed in Section 3.2, however, initial versions of the DictionaryQA dataset had unforeseen biases partly related to whether distractors were sampled from entries without example sentences, which resulted in high Choice-Only-GloVe scores ranging around 56% accuracy before a filtering step was applied to remove these distractors.   We had similar issues with the hypernymy probe which, even after a filtering step that used our Choice-to-Choice-GloVe model, still leads to high results on the BERT and RoBERTa choiceonly models. Given that several attempts were made to entirely de-duplicate the different splits (both in terms of gold answers and distractor types), the source of these biases is not at all obvious, which shows how easy it is for unintended biases in expert knowledge to appear in the resulting datasets and the importance of having rigorous baselines. We also note the large gap in some cases between the BERT and RoBERTa versus GloVe choice-only models, which highlights the need for having partial-input baselines that use the best available models.
Using a more conventional set of Task-Specific QA models (i.e., the LSTM-based Question-to-Choice models trained directly on the probes), we can see that results are not particularly strong on any of the datasets, suggesting that our probes are indeed sufficiently challenging and largely immune from overt artifacts. The poor performance of the VecSimilarity (which uses pretrained Word2Vec embeddings without additional training) provides additional evidence that elementary lexical matching strategies are insufficient for solving any of the probing tasks.
How well do pre-trained MCQA models do? Science models that use non-transformer based encoders, such as the ESIM model with GloVe and ELMO, perform poorly across all probes, in many cases scoring near random chance, showing limits to how well they generalize from science to other tasks even with pre-trained GloVe and ELMO embeddings. In sharp contrast, the transformer models have mixed results, the most striking result being the RoBERTa models on the definitions and synonymy probes (achieving a test accuracy of 77% and 61%, respectively), which outperform several of the task-specific LSTM models trained directly on the probes. At first glance, this suggests that RoBERTa, which generally far outpaces even BERT across most probes, has high competence of definitions and synonyms even without explicit training on our new tasks.
Given the controlled nature of our probes, we can get a more detailed view of how well the science models are performing across different reasoning and distractor types, as shown in the first column of Figure 3 for ESIM and RoBERTa. The ESIM science model without training has uniformly poor performance across all categories, whereas the performance of RoBERTa is more varied. Across all datasets and number of hops (i.e., the rows in the heat maps), model performance for RoBERTa is consistently highest among examples with random distractors (i.e., the first column), and lowest in cases involving distractors that are closest in WordNet space (e.g., sister and ISA, or up/down, distractors of distance k = 1). This is not surprising, given that, in the first case, random distractors are likely to be the easiest category (and the opposite for distractors close in space), but suggests that RoBERTa might ESIM+Glove-Science (no-training) ESIM+Glove-Science (100 ex.) ESIM+Glove-Science (3000 ex.) RoBERTa-Science (no-training) RoBERTa-Science (100 ex.) RoBERTa-Science (3000 ex.) only be getting the easiest cases correct.
Model performance also clearly degrades for hypernymy and hyponymy across all models as the number of hops k increases (see red dashed boxes). For example, problems that involve hyponym reasoning with sister distractors of distance k = 1 (i.e., the second column) degrades from 47% to 15% when the number of hops k increases from 1 to 4. This general tendency persists even after additional fine-tuning, as we discuss next, and gives evidence that models are limited in their capacity for certain types of multi-hop inferences.
As discussed by Petroni et al. (2019), the choice of generation templates can have a significant effect on model performance. The results so far should therefore be regarded as a lower bound on model competence. It is possible that model performance is high for definitions, for example, because the associated templates best align with the science training distribution (which we know little about). For this reason, the subsequent inoculation step is important-it gives the model an opportunity to learn about our target templates and couple this learned knowledge with its general knowledge acquired during pre-training and science training (which is, again, what we aim to probe).  1/6 6/6 6/6 0/6 6/6 6/6 4/6 2/6 3/6 0/6 6/6 6/6 poet_laureate.n.01, gloss: a poet who is ... holding an honorary position... q. Given the fragment he is the poet laureate of Arkansas, poet laureate ... is best described as a type of poet_laureate.n.01=>poet.n.01 poet_laureate.n.01=>communicator.n.01 poet_laureate.n.01=>writer.n.01 0/2 2/2 2/2 0/2 0/2 2/2 2/3 3/3 3/3 Can Models Be Effectively Inoculated? Model performance after additional fine-tuning, or inoculation, is shown in the last 3 rows of Table 6, along with learning curves shown in Figure 4 for a selection of probes and models. In the former case, the performance represents the model (and inoculation amount) with the highest aggregate performance over the old task and new probe. Here we again see the transformer-based models outperform non-transformer models, and that better models correlate with lower inoculation costs. For example, when inoculating on synonymy, the cost for ESIM is around 7% reduced accuracy on its original task, as opposed to < 1% and around 1% for BERT and RoBERTa, respectively. This shows the high capacity for transformer models to absorb new tasks with minimal costs, as also observed in Richardson et al. (2020) for NLI.
As shown in Figure 4, transformer models tend to learn most tasks fairly quickly while keeping constant scores on their original tasks (i.e., the flat dashed lines observed in plots 1-4), which gives evidence of high competence. In both cases, addsome inoculation proves to be a cheap and easy way to 1) improve scores on the probing tasks (i.e., the solid black and blue lines in plot 1) and; 2) minimize loss on science (e.g., the blue and black dashed lines in plots 2-4). The opposite is the case for ESIM (plots 5-6); models are generally unable to simultaneously learn individual probes without degrading on their original task, and adding more science data during inoculation confuses models on both tasks. Figure 3, RoBERTa is able to significantly improve performance across most categories even after inoculation with a mere 100 examples (the middle plot), which again provides strong evidence of prior competence. As an example, RoBERTa improves on 2-hop hyponymy  inference with random distractors by 18% (from 59% to 77%). After 3k examples, the model has high performance on virtually all categories (the same score increases from 59% to 87%), however results still tends to degrade as a function of hop and distractor complexity, as discussed above.

As shown in
Despite the high performance of our transformer models after inoculation, model performance on most probes (with the exception of Definitions) averages around 80% for our best models. This suggests that there is still considerable room for improvement, especially for synonymy and word sense, which is a topic that we discuss more in Section 6.
Are Models Consistent across Clusters? Table 8 shows cluster-level accuracies for the different WordNetQA probes. As with performance across the different inference/distractor categories, these results are mixed. For some probes, such as definitions, our best models appear to be rather robust; e.g., our RoBERTa model has a cluster accuracy of 75%, meaning that it can answer all questions perfectly for 75% of the target concepts and that errors are concentrated on a small minority (25%) of concepts. On synonymy and hypernymy, both BERT and RoBERTa appear robust on the majority of concepts, showing that errors are similarly concentrated. In contrast, our best model on hyponymy has an accuracy of 36%, meaning that its errors are spread across many concepts, thus suggesting less robustness. Table 7 shows a selection of semantic clusters involving ISA reasoning, as well as the model performance over different answers (shown symbolically) and perturbations. For example, in the the second case, the cluster is based around the concept/synset oppose.v.06 and involves 4 inferences and a total 24 questions (i.e., inferences with perturbations). Our weakest model, ESIM, answers only 5 out of 24 questions correctly, whereas RoBERTa gets 21/24. In the other cases, RoBERTa gets all clusters correct, whereas BERT and ESIM get none of them correct.
We emphasize that these results only provide a crude look into model consistency and robustness. Recalling again the details in Table 3, probes differ in terms of average size of clusters. Hyponymy, in virtue of having many more questions per cluster, might simply be a much more difficult dataset. In addition, such a strict evaluation does not take into account potential errors inside of clusters, which is an important issue that we discuss in the next section. We leave addressing such issues and coming up with more insightful cluster-based metrics for future work.

Discussion and Conclusion
We presented several new challenge datasets and a novel methodology for automatically building such datasets from knowledge graphs and taxonomies. We used these to probe state-of-theart open-domain QA models (centering around models based on variants of BERT). While our general methodology is amendable to any target knowledge resource or QA model/domain, we focus on probing definitions and ISA knowledge using open-source dictionaries and MCQA models trained in the science domain.
We find, consistent with recent probing studies (Petroni et al., 2019), that transformer-based models have a remarkable ability to answer questions that involve complex forms of relational knowledge, both with and without explicit exposure to our new target tasks. In the latter case, a newer RoBERTa model trained only on benchmark science tasks is able to outperform several task-specific LSTM-based models trained directly on our probing data. When re-trained on small samples (e.g., 100 examples) of probing data using variations of the lossless inoculation strategy from Richardson et al. (2020), RoBERTa is able to master many aspects of our probes with virtually no performance loss on its original QA task.
These positive results suggest that transformerbased models, especially models additionally finetuned on small samples of synthetic data, can be used in place of task-specific models used for querying relational knowledge, as has already been done for targeted tasks such as word sense disambiguation (Huang et al., 2019). Since models seem to already contain considerable amounts of relational knowledge, our simple inoculation strategy, which tries to nudge models to bring out this knowledge explicitly, could serve as a cheaper alternative to recent attempts to build architectures that explicitly incorporate structured knowledge (Peters et al., 2019); we see many areas where our inoculation strategy could be improved for such purposes, including having more complex loss functions that manage old and new information, as well as using techniques that take into account network plasticity (Paik et al., 2019).
The main appeal of using automatically generate datasets is the ability to systematically manipulate and control the complexity of target questions, which allows for more controlled experimentation and new forms of evaluation. Despite the positive results described above, results that look directly at the effect of different types of distractors and the complexity of reasoning show that our best models, even after additional fine-tuning, struggle with certain categories of hard distractors and multi-hop inferences. For some probes, our cluster-based analysis also reveals that errors are widespread across concept clusters, suggesting that models are not always consistent and robust. These results, taken together with our findings about the vulnerability of synthetic datasets to systematic biases, suggest that there is much room for improvement and that the positive results should be taken with a grain of salt. Developing better ways to evaluate semantic clusters and model robustness would be a step in this direction.
We emphasize that using synthetic versus naturalistic QA data comes with important trade-offs. While we are able to generate large amounts of systematically controlled data at virtually no cost or need for manual annotation, it is much harder to validate the quality of such data at such a scale and such varying levels of complexity. Conversely, with benchmark QA datasets, it is much harder to perform the type of careful manipulations and cluster-based analyses we report here. While we assume that the expert knowledge we employ, in virtue of being hand-curated by human experts, is generally correct, we know that such resources are fallible and error-prone. Initial crowd-sourcing experiments that look at validating samples of our data show high agreement across probes and that human scores correlate with the model trends across the probe categories. More details of these studies are left for future work.