Break, Perturb, Build: Automatic Perturbation of Reasoning Paths Through Question Decomposition

Recent efforts to create challenge benchmarks that test the abilities of natural language understanding models have largely depended on human annotations. In this work, we introduce the “Break, Perturb, Build” (BPB) framework for automatic reasoning-oriented perturbation of question-answer pairs. BPB represents a question by decomposing it into the reasoning steps that are required to answer it, symbolically perturbs the decomposition, and then generates new question-answer pairs. We demonstrate the effectiveness of BPB by creating evaluation sets for three reading comprehension (RC) benchmarks, generating thousands of high-quality examples without human intervention. We evaluate a range of RC models on our evaluation sets, which reveals large performance gaps on generated examples compared to the original data. Moreover, symbolic perturbations enable fine-grained analysis of the strengths and limitations of models. Last, augmenting the training data with examples generated by BPB helps close the performance gaps, without any drop on the original data distribution.


Introduction
Evaluating natural language understanding (NLU) systems has become a fickle enterprise. While models outperform humans on standard benchmarks, they perform poorly on a multitude of distribution shifts (Jia and Liang, 2017;Naik et al., 2018;McCoy et al., 2019, inter alia). To expose such gaps, recent work has proposed to evaluate models on contrast sets , or counterfactually-augmented data (Kaushik et al., 2020), where minimal but meaningful perturbations are applied to test examples. However, since such examples are manually written, collecting them is expensive, and procuring diverse perturbations is challenging (Joshi and He, 2021).
Recently, methods for automatic generation of contrast sets were proposed. However, current methods are restricted to shallow surface perturbations (Mille et al., 2021;, specific reasoning skills , or rely on expensive annotations (Bitton et al., 2021). Thus, automatic generation of examples that test high-level reasoning abilities of models and their robustness to fine semantic distinctions remains an open challenge.
In this work, we propose the ''Break, Perturb, Build'' (BPB) framework for automatic generation of reasoning-focused contrast sets for reading comprehension (RC). Changing the high-level semantics of questions and generating questionanswer pairs automatically is challenging. First, it requires extracting the reasoning path expressed in a question, in order to manipulate it. Second, it requires the ability to generate grammatical and coherent questions. In Figure 1, for example, transforming Q, which involves number comparison, into Q1, which requires subtraction, leads to dramatic changes in surface form. Third, it requires an automatic method for computing the answer to the perturbed question.
Our insight is that perturbing question semantics is possible when modifications are applied to a structured meaning representation, rather than to the question itself. Specifically, we represent questions with QDMR (Wolfson et al., 2020), a representation that decomposes a question into a sequence of reasoning steps, which are written in natural language and are easy to manipulate. Relying on a structured representation lets us develop a pipeline for perturbing the reasoning path expressed in RC examples.
Our method (see Figure 1) has four steps. We (1) parse the question into its QDMR decomposition, (2) apply rule-based perturbations to the decomposition, (3) generate new questions from the perturbed decompositions, and (4) compute their answers. In cases where computing the answer is impossible, we compute constraints on the answer, which are also useful for evaluation. For example, for Q4 in Figure 1, even if we cannot extract the years of the described events, we know the answer type of the question (Boolean). Notably, aside from answer generation, all steps depend on the question only, and can be applied to other modalities, such as visual or table question answering (QA).
We demonstrate the utility of BPB for comprehensive and fine-grained evaluation of multiple RC models. First, we show that leading models, such as UNIFIEDQA (Khashabi et al., 2020b) and TASE (Segal et al., 2020), struggle on the generated contrast sets with a decrease of 13-36 F 1 points and low consistency (<40). Moreover, analyzing model performance per perturbation type and constraints, reveals the strengths and weaknesses of models on various reasoning types. For instance, (a) models with specialized architectures are more brittle compared to general-purpose models trained on multiple datasets, (b) TASE fails to answer intermediate reasoning steps on DROP, (c) UNIFIEDQA fails completely on questions requiring numerical computations, and (d) models tend to do better when the numerical value of an answer is small. Last, data augmentation with examples generated by BPB closes part of the performance gap, without any decrease on the original datasets.
In summary, we introduce a novel framework for automatic perturbation of complex reasoning questions, and demonstrate its efficacy for generating contrast sets and evaluating models. We expect that imminent improvements in question generation, RC, and QDMR models will further widen the accuracy and applicability of our approach. The generated evaluation sets and codebase are publicly available at https://github.com/mega002/qdmr -based-question-generation.

Background
Our goal, given a natural language question q, is to automatically alter its semantics, generating perturbed questionsq for evaluating RC models. This section provides background on the QDMR representation and the notion of contrast sets.
Question Decomposition Meaning Representation (QDMR). To manipulate question semantics, we rely on QDMR (Wolfson et al., 2020), a structured meaning representation for questions. The QDMR decomposition d = QDMR(q) is a sequence of reasoning steps s 1 , . . . , s |d| required to answer q. Each step s i in d is an intermediate question that is phrased in natural language and annotated with a logical operation o i , such as selection (e.g., ''When was the Madison Woolen Mill built?'') or comparison (e.g., ''Which is highest of #1, #2?''). Example QDMRs are shown in Figure 1 (step 2). QDMR paves a path towards controlling the reasoning path expressed in a question by changing, removing, or adding steps ( §3.2).
Contrast Sets.  defined the contrast set C(x) of an example x with a label y as a set of examples with minimal perturbations to x that typically affect y. Contrast sets evaluate whether a local decision boundary around an example is captured by a model. In this work, given a question-context pair x = q, c , we semantically perturb the question and generate exampleŝ x = q, c ∈ C( q, c ) that modify the original answer a toâ.

BPB: Automatically Generating Semantic Question Perturbations
We now describe the BPB framework. Given an input x = q, c of question and context, and the answer a to q given c, we automatically map it to a set of new examples C(x) (Figure 1). Our approach uses models for question decomposition, question generation (QG), and RC.

Question Decomposition
The first step ( Figure 1, step 1) is to represent q using a structured decomposition, d = QDMR(q).
To this end, we train a text-to-text model that generates d conditioned on q. Specifically, we fine-tune BART (Lewis et al., 2020) on the highlevel subset of the BREAK dataset (Wolfson et al., 2020), which consists of 23.8K q, d pairs from three RC datasets, including DROP and HOT-POTQA.
1 Our QDMR parser obtains a 77.3 SARI score on the development set, which is near state-of-the-art on the leaderboard. 2

Decomposition Perturbation
A decomposition d describes the reasoning steps necessary for answering q. By modifying d's steps, we can control the semantics of the question. We define a ''library'' of rules for transforming d →d, and use it to generate questionsd →q.
BPB provides a general method for creating a wide range of perturbations. In practice, though, deciding which rules to include is coupled with the reasoning abilities expected from our models. For example, there is little point in testing a model on arithmetic operations if it had never seen such examples. Thus, we implement rules based on the reasoning skills required in current RC datasets (Yang et al., 2018;Dua et al., 2019). As future benchmarks and models tackle a wider range of reasoning phenomena, one can expand the rule library. Table 1 provides examples for all QDMR perturbations, which we describe next: • AppendBool: When the question q returns a numeric value, we transform its QDMR by appending a ''yes/no'' comparison step. The comparison is against the answer a of question q. As shown in Table 1, the appended step compares the previous step result (''#3'') to a constant (''is higher than 2''). AppendBool perturbations are generated for 5 comparison operators (>, <, ≤, ≥, =). For the compared values, we sample from a set, based on the answer a: {a + k, a − k, a k , a × k} for k ∈ {1, 2, 3}.
• ChangeLast: Changes the type of the last QDMR step. This perturbation is applied to steps involving operations over two referenced steps. Steps with type {arithmetic, comparison} have their type changed to either {arithmetic, Boolean}. Table 1 shows a comparison step changed to an arithmetic step, involving subtraction. Below it, an arithmetic step is changed to a yes/no question (Boolean).
• ReplaceArith: Given an arithmetic step, involving either subtraction or addition, we transform it by flipping its arithmetic operation.
• ReplaceBool: Given a Boolean step, verifying whether two statements are correct, we transform it to verify if neither are correct.
• ReplaceComp: A comparison step compares two values and returns the highest or lowest. Given a comparison step, we flip its expression from ''highest'' to ''lowest'' and vice versa.
• PruneStep: We remove one of the QDMR steps. Following step pruning, we prune all Perturbation Question QDMR Perturbed QDMR Perturbed Question

Append Boolean step
Kadeem Jack is a player in a league that started with how many teams?
(1) league that Kadeem Jack is a player in; (2) teams that #1 started with; (3) number of #2 (1) league that Kadeem Jack is a player in; (2) teams that #1 started with; (3) number of #2; (4) if #3 is higher than 2 If Kadeem Jack is a player in a league thatstartedwithmore than two teams?
Change last step (to arithmetic) Which gallery was foundedfirst,Hughes-Donahue Gallery or Art Euphoric?
Change last step (to Boolean) How many years after Madrugada's final concert did Sunday Driver become popular?
(1) year of Madrugada's final concert; (2) year when Sunday Driver become popular; (3) the difference of #2 and #1 ( Which group is smaller for the county according to the census: people or households? (1) size of the people group in the county according to the census; (2) size of households group in the county according to the census; (3) which is smaller of #1, #2 (1) size of the people group in the county according to the census; (2) size of households group in the county according to the census; (3) which is highest of #1, #2 According to the census, which group in the county from the county is larger: people or households?

Prune step
How many people comprised the total adult population of Cunter, excluding seniors?
(1) adult population of Cunter; (2) #1 excluding seniors; (3) number of #2 (1) adult population of Cunter; (2) number of #2 How many adult population does Cunter have? Table 1: The full list of semantic perturbations in BPB. For each perturbation, we provide an example question and its decomposition. We highlight the altered decomposition steps, along with the generated question.
other steps that are no longer referenced. We apply only a single PruneStep per d. Table 1 displaysd after its second step has been pruned.

Question Generation
At this point ( Figure 1, step 3), we parsed q to its decomposition d and altered its steps to produce the perturbed decompositiond. The newd expresses a different reasoning process compared to the original q. Next, we generate the perturbed questionq corresponding tod. To this end, we train a QG model, generating questions conditioned on the input QDMR. Using the same q, d pairs used to train the QDMR parser ( §3.1), we train a separate BART model for mapping d → q. 3 An issue with our QG model is that the perturbedd may be outside the distribution the QG  model was trained on, e.g., applying Append-Bool on questions from DROP results in yes/no questions that do not occur in the original dataset. This can lead to low-quality questionsq. To improve our QG model, we use simple heuristics to take q, d pairs from BREAK and generate additional pairs q aug , d aug . Specifically, we define 4 textual patterns, associated with the perturbations, AppendBool, ReplaceBool or Replace-Comp. We automatically generate examples q aug , d aug from q, d pairs that match a pattern. An example application of all patterns is in Table 2. For example, in AppendBool, the question q aug is inferred with the pattern ''how many . . . did''. In ReplaceComp, generating q aug is done by identifying the superlative in q and fetching its antonym.
Overall, we generate 4,315 examples and train our QG model on the union of BREAK and the augmented data. As QG models have been rapidly improving, we expect future QG models will be able to generate high-quality questions for any decomposition without data augmentation.

Answer Generation
We converted the input question into a set of perturbed questions without using the answer or context. Therefore, this part of BPB can be applied to any question, regardless of the context modality. We now describe a RC-specific component for answer generation that uses the textual context.
To get complete RC examples, we must compute answers to the generated questions ( Figure 1, step 4). We take a two-step approach: For some questions, we can compute the answer automatically based on the type of applied perturbation. If this fails, we compute the answer by answering each step in the perturbed QDMRd.
Answer Generation Methods. Let q, c, a be the original RC example and denote byq the generated question. We use the following per-perturbation rules to generate the new answerâ: • AppendBool: The transformedq compares whether the answer a and a numeric value v satisfy a comparison condition. As the values of a and v are given ( §3.2), we can compute whether the answer is ''yes'' or ''no'' directly.
• ReplaceArith: This perturbation converts an answer that is the sum (difference) of numbers to an answer that is the difference (sum). We can often identify the numbers by looking for numbers x, y in the context c such that a = x ± y and flipping the operation: a = |x ∓ y|. To avoid noise, we discard examples for which there is more than one pair of numbers that result in a, and cases where a < 10, as the computation may involve explicit counting rather than an arithmetic computation.
• ReplaceBool: This perturbation turns a verification of whether two statements x, y are true, to a verification of whether neither x nor y are true. Therefore, if a is ''yes'' (i.e., both x, y are true),â must be ''no''.
• ReplaceComp: This perturbation takes a comparison question q that contains two candidate answers x, y, of which x is the answer a. We parse q with spaCy 4 and identify the two answer candidates x, y, and return the one that is not a. QDMR Evaluator. When our heuristics do not apply (e.g., arithmetic computations over more than two numbers, PruneStep, and Change-Last), we use a RC model and the QDMR structure to directly evaluate each step ofd and computeâ. Recall each QDMR step s i is annotated with a logical operation o i ( §2). To evaluated, we go over it step-by-step, and for each step either apply the RC model for operations that require querying the context (e.g., selection), or directly compute the output for numerical/set-based operations (e.g., comparison). The answer computed for each step is then used for replacing placeholders in subsequent steps. An example is provided in Figure 2. We discard the generated example when the RC model predicted an answer that does not match the expected argument type in a following step for which the answer is an argument (e.g., when a non-numerical span predicted by the RC model is used as an argument for an arithmetic operation), and when the generated answer has more than 8 words. Also, we discard operations that often produce noisy answers based on manual analysis (e.g., project with a nonnumeric answer).
For our QDMR evaluator, we fine-tune a ROBERTA-large model with a standard spanextraction output head on SQUAD (Rajpurkar et al., 2016) and BOOLQ (Clark et al., 2019). BOOLQ is included to support yes/no answers.

Answer Constraint Generation
For some perturbations, even if we fail to generate an answer, it is still possible to derive constraints on the answer. Such constraints are valuable, as they indicate cases of model failure. Therefore, in addition toâ, we generate four types of answer constraints: Numeric, Boolean, ≥, ≤.  When changing the last QDMR step to an arithmetic or Boolean operation (Table 1, rows 2-3), the new answer should be Numeric or Boolean, respectively. An example for a Boolean constraint is given in Q4 in Figure 1. When replacing an arithmetic operation (Table 1, row 4), if an answer that is the sum (difference) of two non-negative numbers is changed to the difference (sum) of these numbers, the new answer must not be greater (smaller) than the original answer. For example, the answer to the question perturbed by ReplaceArith in Table 1 (row 4) should satisfy the ≥ constraint.

Generated Evaluation Sets
We run BPB on the RC datasets DROP (Dua et al., 2019), HOTPOTQA (Yang et al., 2018), and IIRC (Ferguson et al., 2020). Questions from the training sets of DROP and HOTPOTQA are included in BREAK, and were used to train the decomposition and QG models. Results on IIRC show BPB's generalization to datasets for which we did not observe q, d pairs. Statistics on the generated contrast and constraint sets are in Table 3, 4, and 5.
Contrast Sets.   Table 4 shows the number of generated examples per perturbation. The distribution over perturbations is skewed, with some perturbations (AppendBool) 100x more frequent than others (ReplaceArith). This is because the original distribution over operations is not uniform and each perturbation operates on different decompositions (e.g., AppendBool can be applied to any question with a numeric answer, while Re-placeComp operates on questions comparing two objects).  Manual validation of generated contrast sets is cheaper than authoring contrast sets from scratch: The median validation time per example is 31 seconds, roughly an order of magnitude faster than reported in . Thus, when a very clean evaluation set is needed, BPB can dramatically reduce the cost of manual annotation.

Constraint Sets.
Error Analysis of the QDMR Parser To study the impact of errors by the QDMR parser on the quality of generated examples, we (the authors) took the examples annotated by crowdworkers, and analyzed the generated QDMRs for 60 examples per perturbation from each dataset: 30 that were marked as valid by crowdworkers, and 30 that were marked as invalid. Specifically, for each example, we checked whether the generated QDMR faithfully expresses the reasoning path required to answer the question, and compared the quality of QDMRs of valid and invalid examples.
For the examples that were marked as valid, we observed that the accuracy of QDMR structures is high: 89.5%, 92.7%, and 91.1% for DROP, HOTPOTQA, and IIRC, respectively. This implies that, overall, our QDMR parser generated faithful and accurate representations for the input questions. Moreover, for examples marked as invalid, the QDMR parser accuracy was lower but still relatively high, with 82.0%, 82.9%, and 75.5% valid QDMRs for DROP, HOTPOTQA, and IIRC, respectively. This suggests that the impact of errors made by the QDMR parser on generated examples is moderate.

Experimental Setting
We use the generated contrast and constraints sets to evaluate the performance of strong RC models.

Models
To evaluate our approach, we examine a suite of models that perform well on current RC benchmarks, and that are diverse it terms of their architecture and the reasoning skills they address: • TASE (Segal et al., 2020): A ROBERTA model (Liu et al., 2019) with 4 specialized output heads for (a) tag-based multi-span extraction, (b) single-span extraction, (c) signed number combinations, and (d) counting (until 9). TASE obtains near state-of-the-art performance when fine-tuned on DROP.
• UNIFIEDQA (Khashabi et al., 2020b): A text-to-text T5 model (Raffel et al., 2020) that was fine-tuned on multiple QA datasets with different answer formats (yes/no, span, etc.). UNIFIEDQA has demonstrated high performance on a wide range of QA benchmarks.
• READER : A BERT-based model (Devlin et al., 2019) for RC with two output heads for answer classification to yes/no/span/no-answer, and span extraction.
We fine-tune two TASE models, one on DROP and another on IIRC, which also requires numerical reasoning. READER is fine-tuned on HOTPOTQA, while separate UNIFIEDQA models are fine-tuned on each of the three datasets. In addition, we evaluate UNIFIEDQA without fine-tuning, to analyze its generalization to unseen QA distributions. We denote by UNIFIEDQA the model without fine-tuning, and by UNIFIEDQA X the UNIFIEDQA model fine-tuned on dataset X.
We consider a ''pure'' RC setting, where only the context necessary for answering is given as input. For HOTPOTQA, we feed the model with the two gold paragraphs (without distractors), and for IIRC we concatenate the input paragraph with the gold evidence pieces from other paragraphs.
Overall, we study 6 model-dataset combinations, with 2 models per dataset. For each model, we perform a hyperparameter search and train 3-4 instances with different random seeds, using the best configuration on the development set.

Evaluation
We evaluate each model in multiple settings: (a) the original development set; (b) the generated contrast set, denoted by CONT; (c) the subset of CONT marked as valid by crowdworkers, denoted by CONT VAL . Notably, CONT and CONT VAL have a different distribution over perturbations. To account for this discrepancy, we also evaluate models on a sample from CONT, denoted by CONT RAND , where sampling is according to the perturbation distribution in CONT VAL . Last, to assess the utility of constraint sets, we enrich the contrast set of each example with its corresponding constraints, denoted by CONT +CONST .
where C(x) is the generated contrast set for example x (which includes x), 5 and y(x) is the model's prediction for examplex. Constraint satisfaction is measured using a binary 0-1 score.
Because yes/no questions do not exist in DROP, we do not evaluate TASE DROP on AppendBool examples, which have yes/no answers, as we cannot expect the model to answer those correctly.

Results
Results are presented separately for each model, in Table 6, 7, and 8. Comparing performance on the development sets (DEV F 1 ) to the corresponding contrast sets (CONT F 1 ), we see a substantial decrease in performance on the generated contrast sets, across all datasets (e.g., 83.5 → 54.8 for TASE DROP , 82.2 → 49.9 for READER, and 50.2 → 5 With a slight abuse of notation, we overload the definition of C(x) from §2, such that members of C(x) include not just the queston and context, but also an answer. 20.4 for UNIFIEDQA IIRC ). Moreover, model consistency (CONT Cnst.) is considerably lower than the development scores (DEV F 1 ), for example, TASE IIRC obtains 69.9 F 1 score but only 24.3 consistency. This suggests that, overall, the models do not generalize to pertrubations in the reasoning path expressed in the original question.
Comparing the results on the contrast sets and their validated subsets (CONT vs. CONT VAL ), performance on CONT VAL is better than on CONT (e.g., 58.1 versus 49.9 for READER). These gaps are due to (a) the distribution mismatch between the two sets, and (b) bad example generation. To isolate the effect of bad example generation, we can compare CONT VAL to CONT RAND , which have the same distribution over perturbations, but CONT RAND is not validated by humans. We see that the performance of CONT VAL is typically ≤10% higher than CONT RAND (e.g., 58.1 vs. 54.5 for READER). Given that performance on the original development set is dramatically higher, it seems we can currently use automatically generated contrast sets (without verification) to evaluate robustness to reasoning perturbations.
Last, adding constraints to the generated contrast sets (CONT vs. CONT +CONST ) often leads to a decrease in model consistency, most notably on DROP, where there are arithmetic constraints and not only answer type constraints.
For instance, consistency drops from 35.7 to 33.7 for TASE, and from 5.1 to 4.4 for UNIFIEDQA DROP . This shows that the generated constraints expose additional flaws in current models.

Data Augmentation
Results in §5.3 reveal clear performance gaps in current QA models. A natural solution is to augment the training data with examples from the contrast set distribution, which can be done effortlessly, since BPB is fully automatic.
We run BPB on the training sets of DROP, HOTPOTQA, and IIRC. As BPB generates many examples, it can shift the original training distribution dramatically. Thus, we limit the number of examples generated by each perturbation by a threshold τ . Specifically, for a training set S with |S| = n examples, we augment S with τ * n randomly generated examples from each perturbation (if fewer than τ * n examples were generated we add all of them). We experiment with three values τ ∈ {0.03, 0.05, 0.1}, and choose the trained model with the best F 1 on the contrast set.
Augmentation results are shown in Table 6-8. Consistency (CONT and CONT VAL ) improves dramatically, with only a small change in the model's DEV performance, across all models. We observe an increase in consistency of 13 points for TASE DROP , 24 for TASE IIRC , 13 for READER, and 1-4 points for the UNIFIEDQA models. Interestingly, augmentation is less helpful for UNIFIEDQA than for TASE and READER. We conjecture that this is because UNIFIEDQA was trained on examples from multiple QA datasets and is thus less affected by the augmented data. Improvement on test examples sampled from the augmented training distribution is expected. To test whether augmented data improves robustness on other distributions, we evaluate TASE+ and UNIFIEDQA DROP + on the DROP contrast set manually collected by . We find that training on the augmented training set does not lead to a significant change on the manually collected contrast set (F 1 of 60.4 → 61.1 for TASE, and 30 → 29.6 for UNIFIEDQA DROP ). This agrees with findings that data augmentation with respect to a phenomenon may not improve generalization to other out-of-distribution examples (Kaushik et al., 2021;Joshi and He, 2021). READER (Figure 4) shows similar trends to TASE, with a dramatic performance decrease of 80-90 points on yes/no questions created by AppendBool and ReplaceBool. Interestingly, READER obtains high performance on PruneStep examples, as opposed to TASE DROP (Figure 3), which has a similar span extraction head that is required for these examples. This is possibly due to the ''train-easy'' subset of HOTPOTQA, which includes single-step selection questions.
Moving to the general-purpose UNIFIEDQA models, they perform on PruneStep at least as well the original examples, showing their ability to answer simple selection questions. They also demonstrate robustness on ReplaceBool. Yet, they struggle on numeric comparison questions or arithmetic calculations: ∼65 points decrease on ChangeLast on DROP (Figure 3), 10-30 F 1 decrease on ReplaceComp and Append-Bool (Figure 3, 4, 5), and almost 0 F 1 on ReplaceArith (Figure 3).
Performance on CONT and CONT VAL . Results on CONT VAL are generally higher than CONT due to the noise in example generation. However,  whenever results on ORIG are higher than CONT, they are also higher than CONT VAL , showing that the general trend can be inferred from CONT, due to the large performance gap between ORIG and CONT. An exception is ChangeLast in DROP and HOTPOTQA, where performance on CONT is lower than ORIG, but on CONT VAL is higher. This is probably due to the noise in generation, especially for DROP, where example validity is at 55.1% (see Table 4).

Evaluation on Answer Constraints
Evaluating whether the model satisfies answer constraints can help assess the model's skills. To this end, we Models typically predict the correct answer type; TASE DROP and UNIFIEDQA predict a number for ≥ 86% of the generated numeric questions, and READER and TASE IIRC successfully predict a yes/no answer in ≥ 92% of the cases. However, fine-tuning UNIFIEDQA on HOTPOTQA and IIRC reduces constraint satisfaction (94.7 → 76.3 for UNIFIEDQA HPQA , 65.4 → 38.9 for UNIFIEDQA IIRC ), possibly since yes/no questions constitute fewer than 10% of the examples (Yang et al., 2018;Ferguson et al., 2020). In addition, results on DROP for the constraint '≥' are considerably lower than for '≤' for UNIFIEDQA (83 → 67.4) and UNIFIEDQA DROP (81.8 → 65.9), indicating a bias towards predicting small numbers.

Related Work
The evaluation crisis in NLU has led to wide interest in challenge sets that evaluate the robustness of models to input perturbations. However, most past approaches (Ribeiro et al., 2020;Khashabi et al., 2020a;Kaushik et al., 2020) involve a human-in-the-loop and are thus costly.
Recently, more and more work has considered using meaning representations of language to automatically generate evaluation sets. Past work used an ERG grammar  and AMR (Rakshit and Flanigan, 2021) to generate relatively shallow perturbations. In parallel to this work, Ross et al. (2021) used control codes over SRL to generate more semantic perturbations to declarative sentences. We generate perturbations at the level of the underlying reasoning process, in the context of QA. Last, Bitton et al. (2021) used scene graphs to generate examples for visual QA. However, they assumed the existence of gold scene graph at the input. Overall, this body of work represents an exciting new research program, where structured representations are leveraged to test and improve the blind spots of pre-trained language models.
Last, QDMR-to-question generation is broadly related to work on text generation from structured data (Nan et al., 2021;Novikova et al., 2017;Shu et al., 2021), and to passage-to-question generation methods (Du et al., 2017;Duan et al., 2017) that, in contrast to our work, focused on simple questions not requiring reasoning.

Discussion
We propose the BPB framework for generating high-quality reasoning-focused question perturbations, and demonstrate its utility for constructing contrast sets and evaluating RC models.
While we focus on RC, our method for perturbing questions is independent of the context modality. Thus, porting our approach to other modalities only requires a method for computing the answer to perturbed questions. Moreover, BPB provides a general-purpose mechanism for question generation, which can be used outside QA as well.
We provide a library of perturbations that is a function of the current abilities of RC models. As future RC models, QDMR parsers, and QG models improve, we can expand this library to support additional semantic phenomena.
Last, we showed that constraint sets are useful for evaluation. Future work can use constraints as a supervision signal, similar to Dua et al. (2021), who leveraged dependencies between training examples to enhance model performance.
Limitations BPB represents questions with QDMR, which is geared towards representing complex factoid questions that involve multiple reasoning steps. Thus, BPB cannot be used when questions involve a single step, for example, one cannot use BPB to perturb ''Where was Barack Obama born?''. Inherently, the effectiveness of our pipeline approach depends on the performance of its modules-the QDMR parser, the QG model, and the single-hop RC model used for QDMR evaluation. However, our results suggest that current models already yield high-quality examples, and model performance is expected to improve over time.