Abstract
Recent efforts to create challenge benchmarks that test the abilities of natural language understanding models have largely depended on human annotations. In this work, we introduce the “Break, Perturb, Build” (BPB) framework for automatic reasoning-oriented perturbation of question-answer pairs. BPB represents a question by decomposing it into the reasoning steps that are required to answer it, symbolically perturbs the decomposition, and then generates new question-answer pairs. We demonstrate the effectiveness of BPB by creating evaluation sets for three reading comprehension (RC) benchmarks, generating thousands of high-quality examples without human intervention. We evaluate a range of RC models on our evaluation sets, which reveals large performance gaps on generated examples compared to the original data. Moreover, symbolic perturbations enable fine-grained analysis of the strengths and limitations of models. Last, augmenting the training data with examples generated by BPB helps close the performance gaps, without any drop on the original data distribution.
1 Introduction
Evaluating natural language understanding (NLU) systems has become a fickle enterprise. While models outperform humans on standard benchmarks, they perform poorly on a multitude of distribution shifts (Jia and Liang, 2017; Naik et al., 2018; McCoy et al., 2019, inter alia). To expose such gaps, recent work has proposed to evaluate models on contrast sets (Gardner et al., 2020), or counterfactually-augmented data (Kaushik et al., 2020), where minimal but meaningful perturbations are applied to test examples. However, since such examples are manually written, collecting them is expensive, and procuring diverse perturbations is challenging (Joshi and He, 2021).
Recently, methods for automatic generation of contrast sets were proposed. However, current methods are restricted to shallow surface perturbations (Mille et al., 2021; Li et al., 2020), specific reasoning skills (Asai and Hajishirzi, 2020), or rely on expensive annotations (Bitton et al., 2021). Thus, automatic generation of examples that test high-level reasoning abilities of models and their robustness to fine semantic distinctions remains an open challenge.
In this work, we propose the “Break, Perturb, Build” (BPB) framework for automatic generation of reasoning-focused contrast sets for reading comprehension (RC). Changing the high-level semantics of questions and generating question-answer pairs automatically is challenging. First, it requires extracting the reasoning path expressed in a question, in order to manipulate it. Second, it requires the ability to generate grammatical and coherent questions. In Figure 1, for example, transforming Q, which involves number comparison, into Q1, which requires subtraction, leads to dramatic changes in surface form. Third, it requires an automatic method for computing the answer to the perturbed question.
An overview of BPB. Given a context (C), question (Q), and the answer (A) to the question, we generate new examples by (1) parsing the question into its QDMR decomposition, (2) applying semantic perturbations to the decomposition, (3) generating a question for each transformed decomposition, and (4) computing answers/constraints to the new questions.
An overview of BPB. Given a context (C), question (Q), and the answer (A) to the question, we generate new examples by (1) parsing the question into its QDMR decomposition, (2) applying semantic perturbations to the decomposition, (3) generating a question for each transformed decomposition, and (4) computing answers/constraints to the new questions.
Our insight is that perturbing question semantics is possible when modifications are applied to a structured meaning representation, rather than to the question itself. Specifically, we represent questions with QDMR (Wolfson et al., 2020), a representation that decomposes a question into a sequence of reasoning steps, which are written in natural language and are easy to manipulate. Relying on a structured representation lets us develop a pipeline for perturbing the reasoning path expressed in RC examples.
Our method (see Figure 1) has four steps. We (1) parse the question into its QDMR decomposition, (2) apply rule-based perturbations to the decomposition, (3) generate new questions from the perturbed decompositions, and (4) compute their answers. In cases where computing the answer is impossible, we compute constraints on the answer, which are also useful for evaluation. For example, for Q4 in Figure 1, even if we cannot extract the years of the described events, we know the answer type of the question (Boolean). Notably, aside from answer generation, all steps depend on the question only, and can be applied to other modalities, such as visual or table question answering (QA).
Running BPB on the three RC datasets, DROP (Dua et al., 2019), HotpotQA (Yang et al., 2018), and IIRC (Ferguson et al., 2020), yields thousands of semantically rich examples, covering a majority of the original examples (63.5%, 70.2%, and 45.1%, respectively). Moreover, we validate examples using crowdworkers and find that ≥85% of generated examples are correct.
We demonstrate the utility of BPB for comprehensive and fine-grained evaluation of multiple RC models. First, we show that leading models, such as UnifiedQA (Khashabi et al., 2020b) and TASE (Segal et al., 2020), struggle on the generated contrast sets with a decrease of 13-36 F1 points and low consistency ( <40). Moreover, analyzing model performance per perturbation type and constraints, reveals the strengths and weaknesses of models on various reasoning types. For instance, (a) models with specialized architectures are more brittle compared to general-purpose models trained on multiple datasets, (b) TASE fails to answer intermediate reasoning steps on DROP, (c) UnifiedQA fails completely on questions requiring numerical computations, and (d) models tend to do better when the numerical value of an answer is small. Last, data augmentation with examples generated by BPB closes part of the performance gap, without any decrease on the original datasets.
In summary, we introduce a novel framework for automatic perturbation of complex reasoning questions, and demonstrate its efficacy for generating contrast sets and evaluating models. We expect that imminent improvements in question generation, RC, and QDMR models will further widen the accuracy and applicability of our approach. The generated evaluation sets and codebase are publicly available at https://github.com/mega002/qdmr-based-question-generation.
2 Background
Our goal, given a natural language question q, is to automatically alter its semantics, generating perturbed questions for evaluating RC models. This section provides background on the QDMR representation and the notion of contrast sets.
Question Decomposition Meaning Representation (QDMR).
To manipulate question semantics, we rely on QDMR (Wolfson et al., 2020), a structured meaning representation for questions. The QDMR decomposition d = QDMR(q) is a sequence of reasoning steps s1,…,s|d| required to answer q. Each step si in d is an intermediate question that is phrased in natural language and annotated with a logical operation oi, such as selection (e.g., “When was the Madison Woolen Mill built?”) or comparison (e.g., “Which is highest of #1, #2?”). Example QDMRs are shown in Figure 1 (step 2). QDMR paves a path towards controlling the reasoning path expressed in a question by changing, removing, or adding steps (§3.2).
Contrast Sets.
Gardner et al. (2020) defined the contrast set of an example x with a label y as a set of examples with minimal perturbations to x that typically affect y. Contrast sets evaluate whether a local decision boundary around an example is captured by a model. In this work, given a question-context pair x = ⟨q,c⟩, we semantically perturb the question and generate examples that modify the original answer a to .
3 BPB: Automatically Generating Semantic Question Perturbations
We now describe the BPB framework. Given an input x = ⟨q,c⟩ of question and context, and the answer a to q given c, we automatically map it to a set of new examples (Figure 1). Our approach uses models for question decomposition, question generation (QG), and RC.
3.1 Question Decomposition
The first step (Figure 1, step 1) is to represent q using a structured decomposition, d = QDMR(q). To this end, we train a text-to-text model that generates d conditioned on q. Specifically, we fine-tune BART (Lewis et al., 2020) on the high-level subset of the Break dataset (Wolfson et al., 2020), which consists of 23.8K ⟨q,d⟩ pairs from three RC datasets, including DROP and HotpotQA.1 Our QDMR parser obtains a 77.3 SARI score on the development set, which is near state-of-the-art on the leaderboard.2
3.2 Decomposition Perturbation
A decomposition d describes the reasoning steps necessary for answering q. By modifying d’s steps, we can control the semantics of the question. We define a “library” of rules for transforming , and use it to generate questions .
BPB provides a general method for creating a wide range of perturbations. In practice, though, deciding which rules to include is coupled with the reasoning abilities expected from our models. For example, there is little point in testing a model on arithmetic operations if it had never seen such examples. Thus, we implement rules based on the reasoning skills required in current RC datasets (Yang et al., 2018; Dua et al., 2019). As future benchmarks and models tackle a wider range of reasoning phenomena, one can expand the rule library.
Table 1 provides examples for all QDMR perturbations, which we describe next:
AppendBool: When the question q returns a numeric value, we transform its QDMR by appending a “yes/no” comparison step. The comparison is against the answer a of question q. As shown in Table 1, the appended step compares the previous step result (“#3”) to a constant (“is higher than 2”). AppendBool perturbations are generated for 5 comparison operators ( >, <,≤,≥,≠). For the compared values, we sample from a set, based on the answer a: for k ∈{1,2,3}.
ChangeLast: Changes the type of the last QDMR step. This perturbation is applied to steps involving operations over two referenced steps. Steps with type {arithmetic, comparison} have their type changed to either {arithmetic, Boolean}. Table 1 shows a comparison step changed to an arithmetic step, involving subtraction. Below it, an arithmetic step is changed to a yes/no question (Boolean).
ReplaceArith: Given an arithmetic step, involving either subtraction or addition, we transform it by flipping its arithmetic operation.
ReplaceBool: Given a Boolean step, verifying whether two statements are correct, we transform it to verify if neither are correct.
ReplaceComp: A comparison step compares two values and returns the highest or lowest. Given a comparison step, we flip its expression from “highest” to “lowest” and vice versa.
PruneStep: We remove one of the QDMR steps. Following step pruning, we prune all other steps that are no longer referenced. We apply only a single PruneStep per d. Table 1 displays after its second step has been pruned.
3.3 Question Generation
At this point (Figure 1, step 3), we parsed q to its decomposition d and altered its steps to produce the perturbed decomposition . The new expresses a different reasoning process compared to the original q. Next, we generate the perturbed question corresponding to . To this end, we train a QG model, generating questions conditioned on the input QDMR. Using the same ⟨q,d⟩ pairs used to train the QDMR parser (§3.1), we train a separate BART model for mapping .3
An issue with our QG model is that the perturbed may be outside the distribution the QG model was trained on, e.g., applying AppendBool on questions from DROP results in yes/no questions that do not occur in the original dataset. This can lead to low-quality questions . To improve our QG model, we use simple heuristics to take ⟨q,d⟩ pairs from Break and generate additional pairs ⟨qaug,daug⟩. Specifically, we define 4 textual patterns, associated with the perturbations, AppendBool, ReplaceBool or ReplaceComp. We automatically generate examples ⟨qaug,daug⟩ from ⟨q,d⟩ pairs that match a pattern. An example application of all patterns is in Table 2. For example, in AppendBool, the question qaug is inferred with the pattern “how many …did”. In ReplaceComp, generating qaug is done by identifying the superlative in q and fetching its antonym.
Example application of all textual patterns used to generate questions qaug (perturbation type highlighted). Boldface indicates the pattern matched in q and the modified part in qaug. Decompositions d and daug omitted for brevity.

Overall, we generate 4,315 examples and train our QG model on the union of Break and the augmented data. As QG models have been rapidly improving, we expect future QG models will be able to generate high-quality questions for any decomposition without data augmentation.
3.4 Answer Generation
We converted the input question into a set of perturbed questions without using the answer or context. Therefore, this part of BPB can be applied to any question, regardless of the context modality. We now describe a RC-specific component for answer generation that uses the textual context.
To get complete RC examples, we must compute answers to the generated questions (Figure 1, step 4). We take a two-step approach: For some questions, we can compute the answer automatically based on the type of applied perturbation. If this fails, we compute the answer by answering each step in the perturbed QDMR .
Answer Generation Methods.
Let ⟨q,c,a⟩ be the original RC example and denote by the generated question. We use the following per-perturbation rules to generate the new answer :
AppendBool: The transformed compares whether the answer a and a numeric value v satisfy a comparison condition. As the values of a and v are given (§3.2), we can compute whether the answer is “yes” or “no” directly.
ReplaceArith: This perturbation converts an answer that is the sum (difference) of numbers to an answer that is the difference (sum). We can often identify the numbers by looking for numbers x,y in the context c such that a = x ± y and flipping the operation: . To avoid noise, we discard examples for which there is more than one pair of numbers that result in a, and cases where a < 10, as the computation may involve explicit counting rather than an arithmetic computation.
ReplaceBool: This perturbation turns a verification of whether two statements x,y are true, to a verification of whether neither x nor y are true. Therefore, if a is “yes” (i.e., both x,y are true), must be “no”.
ReplaceComp: This perturbation takes a comparison question q that contains two candidate answers x,y, of which x is the answer a. We parse q with spaCy4 and identify the two answer candidates x,y, and return the one that is not a.
QDMR Evaluator.
When our heuristics do not apply (e.g., arithmetic computations over more than two numbers, PruneStep, and ChangeLast), we use a RC model and the QDMR structure to directly evaluate each step of and compute . Recall each QDMR step si is annotated with a logical operation oi (§2). To evaluate , we go over it step-by-step, and for each step either apply the RC model for operations that require querying the context (e.g., selection), or directly compute the output for numerical/set-based operations (e.g., comparison). The answer computed for each step is then used for replacing placeholders in subsequent steps. An example is provided in Figure 2.
We discard the generated example when the RC model predicted an answer that does not match the expected argument type in a following step for which the answer is an argument (e.g., when a non-numerical span predicted by the RC model is used as an argument for an arithmetic operation), and when the generated answer has more than 8 words. Also, we discard operations that often produce noisy answers based on manual analysis (e.g., project with a non-numeric answer).
3.5 Answer Constraint Generation
For some perturbations, even if we fail to generate an answer, it is still possible to derive constraints on the answer. Such constraints are valuable, as they indicate cases of model failure. Therefore, in addition to , we generate four types of answer constraints: Numeric, Boolean, ≥, ≤.
When changing the last QDMR step to an arithmetic or Boolean operation (Table 1, rows 2-3), the new answer should be Numeric or Boolean, respectively. An example for a Boolean constraint is given in Q4 in Figure 1. When replacing an arithmetic operation (Table 1, row 4), if an answer that is the sum (difference) of two non-negative numbers is changed to the difference (sum) of these numbers, the new answer must not be greater (smaller) than the original answer. For example, the answer to the question perturbed by ReplaceArith in Table 1 (row 4) should satisfy the ≥ constraint.
4 Generated Evaluation Sets
We run BPB on the RC datasets DROP (Dua et al., 2019), HotpotQA (Yang et al., 2018), and IIRC (Ferguson et al., 2020). Questions from the training sets of DROP and HotpotQA are included in Break, and were used to train the decomposition and QG models. Results on IIRC show BPB’s generalization to datasets for which we did not observe ⟨q,d⟩ pairs. Statistics on the generated contrast and constraint sets are in Table 3, 4, and 5.
Generation and annotation statistics for the DROP, HotpotQA, and IIRC datasets.
. | DROP . | HPQA . | IIRC . |
---|---|---|---|
development set size | 9,536 | 7,405 | 1,301 |
# of unique generated perturbations | 65,675 | 10,541 | 3,119 |
# of generated examples | 61,231 | 8,488 | 2,450 |
# of covered development examples | 6,053 | 5,199 | 587 |
% of covered development examples | 63.5 | 70.2 | 45.1 |
Avg. contrast set size | 11.1 | 2.6 | 5.2 |
Avg. # of perturbations per example | 1.2 | 1 | 1 |
% of answers generated by the QDMR evaluator | 5.8 | 61.8 | 22.5 |
# of annotated contrast examples | 1,235 | 1,325 | 559 |
% of valid annotated examples | 85 | 89 | 90.3 |
. | DROP . | HPQA . | IIRC . |
---|---|---|---|
development set size | 9,536 | 7,405 | 1,301 |
# of unique generated perturbations | 65,675 | 10,541 | 3,119 |
# of generated examples | 61,231 | 8,488 | 2,450 |
# of covered development examples | 6,053 | 5,199 | 587 |
% of covered development examples | 63.5 | 70.2 | 45.1 |
Avg. contrast set size | 11.1 | 2.6 | 5.2 |
Avg. # of perturbations per example | 1.2 | 1 | 1 |
% of answers generated by the QDMR evaluator | 5.8 | 61.8 | 22.5 |
# of annotated contrast examples | 1,235 | 1,325 | 559 |
% of valid annotated examples | 85 | 89 | 90.3 |
Per-perturbation statistics for generation and annotation of our datasets. Validation results are in bold for perturbations with at least 40 examples.
. | . | DROP . | HPQA . | IIRC . |
---|---|---|---|---|
AppendBool | contrast | 56,205 | 2,754 | 1,884 |
annotated | 254 | 200 | 198 | |
% valid | 97.2 | 98 | 98 | |
ChangeLast | contrast | 85 | 408 | 43 |
annotated | 69 | 200 | 43 | |
% valid | 55.1 | 84.5 | 76.7 | |
ReplaceArith | contrast | 390 | – | 1 |
annotated | 191 | – | 1 | |
% valid | 79.6 | – | 0 | |
ReplaceBool | contrast | – | 127 | 1 |
annotated | – | 127 | 1 | |
% valid | – | 97.6 | 100 | |
ReplaceComp | contrast | 1,126 | 362 | 14 |
annotated | 245 | 200 | 14 | |
% valid | 90.2 | 88.5 | 71.4 | |
PruneStep | contrast | 3,425 | 3,777 | 507 |
annotated | 476 | 399 | 302 | |
% valid | 82.4 | 85.8 | 88.4 |
. | . | DROP . | HPQA . | IIRC . |
---|---|---|---|---|
AppendBool | contrast | 56,205 | 2,754 | 1,884 |
annotated | 254 | 200 | 198 | |
% valid | 97.2 | 98 | 98 | |
ChangeLast | contrast | 85 | 408 | 43 |
annotated | 69 | 200 | 43 | |
% valid | 55.1 | 84.5 | 76.7 | |
ReplaceArith | contrast | 390 | – | 1 |
annotated | 191 | – | 1 | |
% valid | 79.6 | – | 0 | |
ReplaceBool | contrast | – | 127 | 1 |
annotated | – | 127 | 1 | |
% valid | – | 97.6 | 100 | |
ReplaceComp | contrast | 1,126 | 362 | 14 |
annotated | 245 | 200 | 14 | |
% valid | 90.2 | 88.5 | 71.4 | |
PruneStep | contrast | 3,425 | 3,777 | 507 |
annotated | 476 | 399 | 302 | |
% valid | 82.4 | 85.8 | 88.4 |
Generation of constraints statistics for the DROP, HotpotQA, and IIRC datasets.
. | DROP . | HPQA . | IIRC . |
---|---|---|---|
# of constraints | 3,323 | 550 | 56 |
% of constraints that cover examples without a contrast set | 8.9 | 26 | 21.4 |
% of covered development examples | 22.5 | 7.4 | 4 |
Numeric | 2,398 | – | – |
Boolean | – | 549 | 52 |
≥ | 825 | – | 1 |
≤ | 100 | 1 | 3 |
. | DROP . | HPQA . | IIRC . |
---|---|---|---|
# of constraints | 3,323 | 550 | 56 |
% of constraints that cover examples without a contrast set | 8.9 | 26 | 21.4 |
% of covered development examples | 22.5 | 7.4 | 4 |
Numeric | 2,398 | – | – |
Boolean | – | 549 | 52 |
≥ | 825 | – | 1 |
≤ | 100 | 1 | 3 |
Contrast Sets.
Table 3 shows that BPB successfully generates thousands of perturbations for each dataset. For the vast majority of perturbations, answer generation successfully produced a result—for 61K out of 65K in DROP, 8.5K out of 10.5K in HotpotQA, and 2.5K out of 3K in IIRC. Overall, 61K/8.5K examples were created from the development sets of DROP/ HotpotQA, respectively, covering 63.5%/70.2% of the development set. For the held-out dataset IIRC, not used to train the QDMR parser and QG model, BPB created a contrast set of 2.5K examples, which covers almost half of the development set.
Table 4 shows the number of generated examples per perturbation. The distribution over perturbations is skewed, with some perturbations (AppendBool) 100x more frequent than others (ReplaceArith). This is because the original distribution over operations is not uniform and each perturbation operates on different decompositions (e.g., AppendBool can be applied to any question with a numeric answer, while ReplaceComp operates on questions comparing two objects).
Constraint Sets.
Table 5 shows the number of generated answer constraints for each dataset. The constraint set for DROP is the largest, consisting of 3.3K constraints, 8.9% of which covering DROP examples for which we could not generate a contrast set. This is due to the examples with arithmetic operations, for which it is easier to generate constraints. The constraint sets of HotpotQA and IIRC contain yes/no questions, for which we use the Boolean constraint.
Estimating Example Quality
To analyze the quality of generated examples, we sampled 200-500 examples from each perturbation and dataset (unless fewer than 200 examples were generated) and let crowdworkers validate their correctness. We qualify 5 workers, and establish a feedback protocol where we review work and send feedback after every annotation batch (Nangia et al., 2021). Each generated example was validated by three workers, and is considered valid if approved by the majority. Overall, we observe a Fleiss Kappa (Fleiss, 1971) of 0.71, indicating substantial annotator agreement (Landis and Koch, 1977).
Results are in Table 3 and 4. The vast majority of generated examples (≥85%) were marked as valid, showing that BPB produces high-quality examples. Moreover (Table 4), we see variance across perturbations, where some perturbations reach ¿95% valid examples (AppendBool, ReplaceBool), while others (ChangeLast) have lower validity. Thus, overall quality can be controlled by choosing specific perturbations.
Manual validation of generated contrast sets is cheaper than authoring contrast sets from scratch: The median validation time per example is 31 seconds, roughly an order of magnitude faster than reported in Gardner et al. (2020). Thus, when a very clean evaluation set is needed, BPB can dramatically reduce the cost of manual annotation.
Error Analysis of the QDMR Parser
To study the impact of errors by the QDMR parser on the quality of generated examples, we (the authors) took the examples annotated by crowdworkers, and analyzed the generated QDMRs for 60 examples per perturbation from each dataset: 30 that were marked as valid by crowdworkers, and 30 that were marked as invalid. Specifically, for each example, we checked whether the generated QDMR faithfully expresses the reasoning path required to answer the question, and compared the quality of QDMRs of valid and invalid examples.
For the examples that were marked as valid, we observed that the accuracy of QDMR structures is high: 89.5%, 92.7%, and 91.1% for DROP, HotpotQA, and IIRC, respectively. This implies that, overall, our QDMR parser generated faithful and accurate representations for the input questions. Moreover, for examples marked as invalid, the QDMR parser accuracy was lower but still relatively high, with 82.0%, 82.9%, and 75.5% valid QDMRs for DROP, HotpotQA, and IIRC, respectively. This suggests that the impact of errors made by the QDMR parser on generated examples is moderate.
5 Experimental Setting
We use the generated contrast and constraints sets to evaluate the performance of strong RC models.
5.1 Models
To evaluate our approach, we examine a suite of models that perform well on current RC benchmarks, and that are diverse it terms of their architecture and the reasoning skills they address:
TASE (Segal et al., 2020): A RoBERTa model (Liu et al., 2019) with 4 specialized output heads for (a) tag-based multi-span extraction, (b) single-span extraction, (c) signed number combinations, and (d) counting (until 9). TASE obtains near state-of-the-art performance when fine-tuned on DROP.
UnifiedQA (Khashabi et al., 2020b): A text-to-text T5 model (Raffel et al., 2020) that was fine-tuned on multiple QA datasets with different answer formats (yes/no, span, etc.). UnifiedQA has demonstrated high performance on a wide range of QA benchmarks.
Reader (Asai et al., 2020): A BERT-based model (Devlin et al., 2019) for RC with two output heads for answer classification to yes/no/span/no-answer, and span extraction.
We fine-tune two TASE models, one on DROP and another on IIRC, which also requires numerical reasoning. Reader is fine-tuned on HotpotQA, while separate UnifiedQA models are fine-tuned on each of the three datasets. In addition, we evaluate UnifiedQA without fine-tuning, to analyze its generalization to unseen QA distributions. We denote by UnifiedQA the model without fine-tuning, and by UnifiedQAX the UnifiedQA model fine-tuned on dataset X.
We consider a “pure” RC setting, where only the context necessary for answering is given as input. For HotpotQA, we feed the model with the two gold paragraphs (without distractors), and for IIRC we concatenate the input paragraph with the gold evidence pieces from other paragraphs.
Overall, we study 6 model-dataset combinations, with 2 models per dataset. For each model, we perform a hyperparameter search and train 3-4 instances with different random seeds, using the best configuration on the development set.
5.2 Evaluation
We evaluate each model in multiple settings: (a) the original development set; (b) the generated contrast set, denoted by CONT; (c) the subset of CONT marked as valid by crowdworkers, denoted by CONTVal. Notably, CONT and CONTVal have a different distribution over perturbations. To account for this discrepancy, we also evaluate models on a sample from CONT, denoted by CONTRand, where sampling is according to the perturbation distribution in CONTVal. Last, to assess the utility of constraint sets, we enrich the contrast set of each example with its corresponding constraints, denoted by CONT +CONST.
Because yes/no questions do not exist in DROP, we do not evaluate TASEDROP on AppendBool examples, which have yes/no answers, as we cannot expect the model to answer those correctly.
5.3 Results
Results are presented separately for each model, in Table 6, 7, and 8. Comparing performance on the development sets (DEV F1) to the corresponding contrast sets (CONT F1), we see a substantial decrease in performance on the generated contrast sets, across all datasets (e.g., 83.5 54.8 for TASEDROP, 82.2 49.9 for Reader, and 50.2 20.4 for UnifiedQAIIRC). Moreover, model consistency (CONT Cnst.) is considerably lower than the development scores (DEV F1), for example, TASEIIRC obtains 69.9 F1 score but only 24.3 consistency. This suggests that, overall, the models do not generalize to pertrubations in the reasoning path expressed in the original question.
Evaluation results of TASE on DROP and IIRC. For each dataset, we compare the model trained on the original and augmented (marked with +) training data.
. | DEV . | CONTVal . | CONTRand . | CONT . | CONTVal . | CONT . | CONT +CONST . |
---|---|---|---|---|---|---|---|
. | F1 . | F1 . | F1 . | F1 . | Cnst. . | Cnst. . | Cnst. . |
TASEDROP | 83.5 ± 0.1 | 65.9 ± 1 | 57.3 ± 0.6 | 54.8 ± 0.4 | 55.7 ± 1.1 | 35.7 ± 0.5 | 33.7 ± 0.3 |
TASEDROP+ | 83.7 ± 1.1 | 75.2 ± 0.5 | 68 ± 1 | 66.5 ± 0.5 | 66.3 ± 0.4 | 48.9 ± 0.6 | 45 ± 0.4 |
TASEIIRC | 69.9 ± 0.5 | 45 ± 5 | 41.2 ± 3.8 | 33.7 ± 2.2 | 23.7 ± 4.7 | 24.3 ± 5.3 | 24.3 ± 5.3 |
TASEIIRC+ | 68.8 ± 1.3 | 81.1 ± 4.6 | 78.2 ± 4.9 | 72.4 ± 5.7 | 50.4 ± 3.2 | 48.2 ± 2.5 | 48.2 ± 2.5 |
. | DEV . | CONTVal . | CONTRand . | CONT . | CONTVal . | CONT . | CONT +CONST . |
---|---|---|---|---|---|---|---|
. | F1 . | F1 . | F1 . | F1 . | Cnst. . | Cnst. . | Cnst. . |
TASEDROP | 83.5 ± 0.1 | 65.9 ± 1 | 57.3 ± 0.6 | 54.8 ± 0.4 | 55.7 ± 1.1 | 35.7 ± 0.5 | 33.7 ± 0.3 |
TASEDROP+ | 83.7 ± 1.1 | 75.2 ± 0.5 | 68 ± 1 | 66.5 ± 0.5 | 66.3 ± 0.4 | 48.9 ± 0.6 | 45 ± 0.4 |
TASEIIRC | 69.9 ± 0.5 | 45 ± 5 | 41.2 ± 3.8 | 33.7 ± 2.2 | 23.7 ± 4.7 | 24.3 ± 5.3 | 24.3 ± 5.3 |
TASEIIRC+ | 68.8 ± 1.3 | 81.1 ± 4.6 | 78.2 ± 4.9 | 72.4 ± 5.7 | 50.4 ± 3.2 | 48.2 ± 2.5 | 48.2 ± 2.5 |
Results of Reader on HotpotQA, when trained on the original and augmented (marked with +) data.
. | DEV . | CONTVal . | CONTRand . | CONT . | CONTVal . | CONT . | CONT +CONST . |
---|---|---|---|---|---|---|---|
. | F1 . | F1 . | F1 . | F1 . | Cnst. . | Cnst. . | Cnst. . |
Reader | 82.2 ± 0.2 | 58.1 ± 0.1 | 54.5 ± 0.7 | 49.9 ± 0.4 | 39.6 ± 0.6 | 43.1 ± 0.1 | 43 ± 0.1 |
Reader+ | 82.7 ± 0.9 | 89.1 ± 0.4 | 86.6 ± 0.6 | 81.9 ± 0.3 | 65.6 ± 0.4 | 56.4 ± 0.4 | 56.3 ± 0.4 |
. | DEV . | CONTVal . | CONTRand . | CONT . | CONTVal . | CONT . | CONT +CONST . |
---|---|---|---|---|---|---|---|
. | F1 . | F1 . | F1 . | F1 . | Cnst. . | Cnst. . | Cnst. . |
Reader | 82.2 ± 0.2 | 58.1 ± 0.1 | 54.5 ± 0.7 | 49.9 ± 0.4 | 39.6 ± 0.6 | 43.1 ± 0.1 | 43 ± 0.1 |
Reader+ | 82.7 ± 0.9 | 89.1 ± 0.4 | 86.6 ± 0.6 | 81.9 ± 0.3 | 65.6 ± 0.4 | 56.4 ± 0.4 | 56.3 ± 0.4 |
Evaluation results of UnifiedQA on DROP, HotpotQA, and IIRC. We compare UnifiedQA without fine-tuning, and after fine-tuning on the original training data and on the augmented training data (marked with +).
. | DEV . | CONTVal . | CONTRand . | CONT . | CONTVal . | CONT . | CONT +CONST . |
---|---|---|---|---|---|---|---|
. | F1 . | F1 . | F1 . | F1 . | Cnst. . | Cnst. . | Cnst. . |
UnifiedQA | 28.2 | 38.1 | 35.1 | 34.9 | 5.3 | 4.4 | 2.2 |
UnifiedQADROP | 33.9 ± 0.9 | 28.4 ± 0.8 | 26.9 ± 0.5 | 8.1 ± 3.8 | 12.2 ± 1.6 | 5.1 ± 0.7 | 4.4 ± 0.5 |
UnifiedQADROP+ | 32.9 ± 1.2 | 37.9 ± 1.4 | 35.9 ± 2.5 | 10.5 ± 4.4 | 16.9 ± 0.2 | 9.6 ± 0.2 | 8 ± 0.5 |
UnifiedQA | 68.7 | 68.2 | 52.9 | 65.2 | 29.8 | 38.4 | 37.6 |
UnifiedQAHPQA | 74.7 ± 0.2 | 60.3 ± 0.8 | 58.7 ± 0.9 | 61.9 ± 0.7 | 35.6 ± 1.1 | 40.2 ± 0.1 | 39.9 ± 0.1 |
UnifiedQAHPQA+ | 74.1 ± 0.2 | 60.3 ± 1.9 | 59.2 ± 1.5 | 62.3 ± 2.3 | 36.3 ± 0.7 | 41.6 ± 0.3 | 41.3 ± 0.4 |
UnifiedQA | 44.5 | 61.1 | 57.2 | 36.5 | 21.6 | 28.1 | 28.1 |
UnifiedQAIIRC | 50.2 ± 0.7 | 45.1 ± 2.1 | 42.5 ± 2.3 | 20.4 ± 2.9 | 24.9 ± 1.2 | 28.6 ± 0.8 | 28.5 ± 0.8 |
UnifiedQAIIRC+ | 51.7 ± 0.9 | 62.9 ± 2.9 | 54.5 ± 3.9 | 40.8 ± 5.4 | 30.2 ± 2.7 | 32.1 ± 1.9 | 32.1 ± 1.9 |
. | DEV . | CONTVal . | CONTRand . | CONT . | CONTVal . | CONT . | CONT +CONST . |
---|---|---|---|---|---|---|---|
. | F1 . | F1 . | F1 . | F1 . | Cnst. . | Cnst. . | Cnst. . |
UnifiedQA | 28.2 | 38.1 | 35.1 | 34.9 | 5.3 | 4.4 | 2.2 |
UnifiedQADROP | 33.9 ± 0.9 | 28.4 ± 0.8 | 26.9 ± 0.5 | 8.1 ± 3.8 | 12.2 ± 1.6 | 5.1 ± 0.7 | 4.4 ± 0.5 |
UnifiedQADROP+ | 32.9 ± 1.2 | 37.9 ± 1.4 | 35.9 ± 2.5 | 10.5 ± 4.4 | 16.9 ± 0.2 | 9.6 ± 0.2 | 8 ± 0.5 |
UnifiedQA | 68.7 | 68.2 | 52.9 | 65.2 | 29.8 | 38.4 | 37.6 |
UnifiedQAHPQA | 74.7 ± 0.2 | 60.3 ± 0.8 | 58.7 ± 0.9 | 61.9 ± 0.7 | 35.6 ± 1.1 | 40.2 ± 0.1 | 39.9 ± 0.1 |
UnifiedQAHPQA+ | 74.1 ± 0.2 | 60.3 ± 1.9 | 59.2 ± 1.5 | 62.3 ± 2.3 | 36.3 ± 0.7 | 41.6 ± 0.3 | 41.3 ± 0.4 |
UnifiedQA | 44.5 | 61.1 | 57.2 | 36.5 | 21.6 | 28.1 | 28.1 |
UnifiedQAIIRC | 50.2 ± 0.7 | 45.1 ± 2.1 | 42.5 ± 2.3 | 20.4 ± 2.9 | 24.9 ± 1.2 | 28.6 ± 0.8 | 28.5 ± 0.8 |
UnifiedQAIIRC+ | 51.7 ± 0.9 | 62.9 ± 2.9 | 54.5 ± 3.9 | 40.8 ± 5.4 | 30.2 ± 2.7 | 32.1 ± 1.9 | 32.1 ± 1.9 |
Comparing the results on the contrast sets and their validated subsets (CONT vs. CONTVal), performance on CONTVal is better than on CONT (e.g., 58.1 versus 49.9 for Reader). These gaps are due to (a) the distribution mismatch between the two sets, and (b) bad example generation. To isolate the effect of bad example generation, we can compare CONTVal to CONTRand, which have the same distribution over perturbations, but CONTRand is not validated by humans. We see that the performance of CONTVal is typically ≤10% higher than CONTRand (e.g., 58.1 vs. 54.5 for Reader). Given that performance on the original development set is dramatically higher, it seems we can currently use automatically generated contrast sets (without verification) to evaluate robustness to reasoning perturbations.
Last, adding constraints to the generated contrast sets (CONT vs. CONT +CONST) often leads to a decrease in model consistency, most notably on DROP, where there are arithmetic constraints and not only answer type constraints.
For instance, consistency drops from 35.7 to 33.7 for TASE, and from 5.1 to 4.4 for UnifiedQADROP. This shows that the generated constraints expose additional flaws in current models.
5.4 Data Augmentation
Results in §5.3 reveal clear performance gaps in current QA models. A natural solution is to augment the training data with examples from the contrast set distribution, which can be done effortlessly, since BPB is fully automatic.
We run BPB on the training sets of DROP, HotpotQA, and IIRC. As BPB generates many examples, it can shift the original training distribution dramatically. Thus, we limit the number of examples generated by each perturbation by a threshold τ. Specifically, for a training set with examples, we augment with τ * n randomly generated examples from each perturbation (if fewer than τ * n examples were generated we add all of them). We experiment with three values τ ∈{0.03,0.05,0.1}, and choose the trained model with the best F1 on the contrast set.
Augmentation results are shown in Table 6–8. Consistency (CONT and CONTVal) improves dramatically, with only a small change in the model’s DEV performance, across all models. We observe an increase in consistency of 13 points for TASEDROP, 24 for TASEIIRC, 13 for Reader, and 1-4 points for the UnifiedQA models. Interestingly, augmentation is less helpful for UnifiedQA than for TASE and Reader. We conjecture that this is because UnifiedQA was trained on examples from multiple QA datasets and is thus less affected by the augmented data.
Improvement on test examples sampled from the augmented training distribution is expected. To test whether augmented data improves robustness on other distributions, we evaluate TASE+ and UnifiedQADROP+ on the DROP contrast set manually collected by Gardner et al. (2020). We find that training on the augmented training set does not lead to a significant change on the manually collected contrast set (F1of 60.4 61.1 for TASE, and 30 29.6 for UnifiedQADROP). This agrees with findings that data augmentation with respect to a phenomenon may not improve generalization to other out-of-distribution examples (Kaushik et al., 2021; Joshi and He, 2021).
6 Performance Analysis
Analysis Across Perturbations.
We compare model performance on the original (ORIG) and generated examples (CONT and CONTVal) across perturbations (Figure 3, 4, 5). Starting from models with specialized architectures (TASE and Reader), except for ChangeLast (discussed later), models’ performance decreases on all perturbations. Specifically, TASE (Figure 3, 5) demonstrates brittleness to changes in comparison questions (10-30 F1 decrease on ReplaceComp) and arithmetic computations (∼30 F1 decrease on ReplaceArith). The biggest decrease of almost 50 points is on examples generated by PruneStep from DROP (Figure 3), showing that the model struggles to answer intermediate reasoning steps.
Performance on DROP per perturbation: on the generated contrast set (CONT), on the examples from which CONT was generated (ORIG), and on the validated subset of CONT (CONTVal).
Performance on DROP per perturbation: on the generated contrast set (CONT), on the examples from which CONT was generated (ORIG), and on the validated subset of CONT (CONTVal).
Performance on HotpotQA per perturbation: on the generated contrast set (CONT), on the examples from which CONT was generated (ORIG), and the validated subset of CONT (CONTVal).
Performance on HotpotQA per perturbation: on the generated contrast set (CONT), on the examples from which CONT was generated (ORIG), and the validated subset of CONT (CONTVal).
Performance on IIRC per perturbation: on the generated contrast set (CONT), on the examples from which CONT was generated (ORIG), and the validated subset of CONT (CONTVal).
Performance on IIRC per perturbation: on the generated contrast set (CONT), on the examples from which CONT was generated (ORIG), and the validated subset of CONT (CONTVal).
Reader (Figure 4) shows similar trends to TASE, with a dramatic performance decrease of 80-90 points on yes/no questions created by AppendBool and ReplaceBool. Interestingly, Reader obtains high performance on PruneStep examples, as opposed to TASEDROP (Figure 3), which has a similar span extraction head that is required for these examples. This is possibly due to the “train-easy” subset of HotpotQA, which includes single-step selection questions.
Moving to the general-purpose UnifiedQA models, they perform on PruneStep at least as well the original examples, showing their ability to answer simple selection questions. They also demonstrate robustness on ReplaceBool. Yet, they struggle on numeric comparison questions or arithmetic calculations: ∼65 points decrease on ChangeLast on DROP (Figure 3), 10-30 F1 decrease on ReplaceComp and AppendBool (Figure 3, 4, 5), and almost 0 F1 on ReplaceArith (Figure 3).
Performance on CONT and CONTVal.
Results on CONTVal are generally higher than CONT due to the noise in example generation. However, whenever results on ORIG are higher than CONT, they are also higher than CONTVal, showing that the general trend can be inferred from CONT, due to the large performance gap between ORIG and CONT. An exception is ChangeLast in DROP and HotpotQA, where performance on CONT is lower than ORIG, but on CONTVal is higher. This is probably due to the noise in generation, especially for DROP, where example validity is at 55.1% (see Table 4).
Evaluation on Answer Constraints
Evaluating whether the model satisfies answer constraints can help assess the model’s skills. To this end, we measure the fraction of answer constraints satisfied by the predictions of each model (we consider only constraints with more than 50 examples).
Models typically predict the correct answer type; TASEDROP and UnifiedQA predict a number for ≥ 86% of the generated numeric questions, and Reader and TASEIIRC successfully predict a yes/no answer in ≥ 92% of the cases. However, fine-tuning UnifiedQA on HotpotQA and IIRC reduces constraint satisfaction (94.7 76.3 for UnifiedQAHPQA, 65.4 38.9 for UnifiedQAIIRC), possibly since yes/no questions constitute fewer than 10% of the examples (Yang et al., 2018; Ferguson et al., 2020). In addition, results on DROP for the constraint ‘≥’ are considerably lower than for ‘≤’ for UnifiedQA () and UnifiedQADROP (), indicating a bias towards predicting small numbers.
7 Related Work
The evaluation crisis in NLU has led to wide interest in challenge sets that evaluate the robustness of models to input perturbations. However, most past approaches (Ribeiro et al., 2020; Gardner et al., 2020; Khashabi et al., 2020a; Kaushik et al., 2020) involve a human-in-the-loop and are thus costly.
Recently, more and more work has considered using meaning representations of language to automatically generate evaluation sets. Past work used an ERG grammar (Li et al., 2020) and AMR (Rakshit and Flanigan, 2021) to generate relatively shallow perturbations. In parallel to this work, Ross et al. (2021) used control codes over SRL to generate more semantic perturbations to declarative sentences. We generate perturbations at the level of the underlying reasoning process, in the context of QA. Last, Bitton et al. (2021) used scene graphs to generate examples for visual QA. However, they assumed the existence of gold scene graph at the input. Overall, this body of work represents an exciting new research program, where structured representations are leveraged to test and improve the blind spots of pre-trained language models.
More broadly, interest in automatic creation of evaluation sets that test out-of-distribution generalization has skyrocketed, whether using heuristics (Asai and Hajishirzi, 2020; Wu et al., 2021), data splits (Finegan-Dollak et al., 2018; Keysers et al., 2020), adversarial methods (Alzantot et al., 2018), or an aggregation of the above (Mille et al., 2021; Goel et al., 2021).
Last, QDMR-to-question generation is broadly related to work on text generation from structured data (Nan et al., 2021; Novikova et al., 2017; Shu et al., 2021), and to passage-to-question generation methods (Du et al., 2017; Wang et al., 2020; Duan et al., 2017) that, in contrast to our work, focused on simple questions not requiring reasoning.
8 Discussion
We propose the BPB framework for generating high-quality reasoning-focused question perturbations, and demonstrate its utility for constructing contrast sets and evaluating RC models.
While we focus on RC, our method for perturbing questions is independent of the context modality. Thus, porting our approach to other modalities only requires a method for computing the answer to perturbed questions. Moreover, BPB provides a general-purpose mechanism for question generation, which can be used outside QA as well.
We provide a library of perturbations that is a function of the current abilities of RC models. As future RC models, QDMR parsers, and QG models improve, we can expand this library to support additional semantic phenomena.
Last, we showed that constraint sets are useful for evaluation. Future work can use constraints as a supervision signal, similar to Dua et al. (2021), who leveraged dependencies between training examples to enhance model performance.
Limitations
BPB represents questions with QDMR, which is geared towards representing complex factoid questions that involve multiple reasoning steps. Thus, BPB cannot be used when questions involve a single step, for example, one cannot use BPB to perturb “Where was Barack Obama born?”. Inherently, the effectiveness of our pipeline approach depends on the performance of its modules—the QDMR parser, the QG model, and the single-hop RC model used for QDMR evaluation. However, our results suggest that current models already yield high-quality examples, and model performance is expected to improve over time.
Acknowledgments
We thank Yuxiang Wu, Itay Levy, and Inbar Oren for the helpful feedback and suggestions. This research was supported in part by The Yandex Initiative for Machine Learning, and The European Research Council (ERC) under the European Union Horizons 2020 research and innovation programme (grant ERC DELPHI 802800). This work was completed in partial fulfillment for the Ph.D. degree of Mor Geva.
Notes
We fine-tune BART-large for 10 epochs, using a learning rate of 3e−5 with polynomial decay and a batch size of 32.
We use the same hyperparameters as detailed in §3.1, except the number of epochs, which was set to 15.
References
Author notes
Action Editor: Preslav Nakov