Break It Down: A Question Understanding Benchmark

Understanding natural language questions entails the ability to break down a question into the requisite steps for computing its answer. In this work, we introduce a Question Decomposition Meaning Representation (QDMR) for questions. QDMR constitutes the ordered list of steps, expressed through natural language, that are necessary for answering a question. We develop a crowdsourcing pipeline, showing that quality QDMRs can be annotated at scale, and release the Break dataset, containing over 83K pairs of questions and their QDMRs. We demonstrate the utility of QDMR by showing that (a) it can be used to improve open-domain question answering on the HotpotQA dataset, (b) it can be deterministically converted to a pseudo-SQL formal language, which can alleviate annotation in semantic parsing applications. Last, we use Break to train a sequence-to-sequence model with copying that parses questions into QDMR structures, and show that it substantially outperforms several natural baselines.


Introduction
Recently, increasing work has been devoted to models that can reason and integrate information from multiple parts of an input. This includes reasoning over images (Antol et al., 2015;Johnson et al., 2017;Suhr et al., 2019;Hudson and Manning, 2019), paragraphs (Dua et al., 2019), documents (Welbl et al., 2018;Talmor and Berant, 2018;, tables (Pasupat and Liang, 2015), and more. Question answering (QA) is commonly used to test the ability to reason, where a complex natural language question is posed, and is to be answered given a particular context (text, image, etc.). Although questions often share structure across tasks and modalities, understanding the language of complex questions has thus far been addressed within each task in isolation. Consider the questions in Figure 1, all of which express operations such as fact chaining and counting. Additionally, humans can take a complex question and break it down into a sequence of simpler questions even when they are unaware of what or where the answer is. This ability, to compose and decompose questions, lies at the heart of human language (Pelletier, 1994) and allows us to tackle previously unseen problems. Thus, better question understanding models should improve performance and generalization in tasks that require multi-step reasoning or that do not have access to substantial amounts of data.
In this work we propose question understanding as a standalone language understanding task. We introduce a formalism for representing the meaning of questions that relies on question decomposition, and is agnostic to the information source. Our formalism, Question Decomposition Meaning Representation (QDMR), is inspired by database query languages (SQL; SPARQL), and by semantic parsing (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2005;Clarke et al., 2010), in which questions are given full meaning representations.
We express complex questions via simple (''atomic'') questions that can be executed in sequence to answer the original question. Each atomic question can be mapped into a small set of formal operations, where each operation either selects a set of entities, retrieves information about their attributes, or aggregates information over entities. While this has been formalized in knowledge-base (KB) query languages (Chamberlin and Boyce, 1974), the same intuition can be applied to other modalities, such as images and text. QDMR abstracts away the context needed to answer the question, allowing in principle to query multiple sources for the same question.
In contrast to semantic parsing, QDMR operations are expressed through natural language, facilitating annotation at scale by non-experts. Figure 1 presents examples of complex questions on three different modalities. The middle box lists the natural language decompositions provided for each question, and the bottom box displays their corresponding formal queries.
QDMR serves as the formalism for creating BREAK, a question decomposition dataset of 83,978 questions over ten datasets and three modalities. BREAK is collected via crowdsourcing, with a user interface that allows us to train crowd-workers to produce quality decompositions ( §3). Validating the quality of annotated structures reveals 97.4% to be correct ( §4).
We demonstrate the utility of QDMR in two setups. First, we regard the task of open-domain QA over multi-hop questions from the HOTPOTQA dataset. Combining QDMR structures in BREAK with a reading comprehension (RC) model (Min et al., 2019b) improves F 1 from 43.3 to 52.4 ( §5). Second, we show that decompositions in BREAK possess high annotation consistency, which indicates that annotators produce high-quality QDMRs ( §4.3). In §6 we discuss how these QDMRs can be used as a strong proxy for full logical forms in semantic parsing.
We use BREAK to train a neural QDMR parser that maps questions into QDMR representations, based on a sequence-to-sequence model with copying (Gu et al., 2016). Manual analysis of generated structures reveals an accuracy of 54%, showing that automatic QDMR parsing is possible, though still far from human performance ( §7).
To conclude, our contributions are: • Proposing the task of question understanding and introducing the QDMR formalism for representing the meaning of questions ( §2) • The BREAK dataset, which consists of 83,978 examples sampled from 10 datasets over three distinct information sources ( §3) Figure 1: Questions over different sources share a similar compositional structure. Natural language questions from multiple sources (top) are annotated with the QDMR formalism (middle) and deterministically mapped into a pseudo-formal language (bottom).
• Showing how QDMR can be used to improve open-domain question answering ( §5), as well as alleviate the burden of annotating logical forms in semantic parsing ( §6) • A QDMR parser based on a sequence-tosequence model with copying mechanism ( §7) The BREAK dataset, models, and entire codebase are publicly available at: https://github. com/tomerwolgithub/Break.

Question Decomposition Formalism
In this section we define the QDMR formalism for domain agnostic question decomposition. QDMR is primarily inspired by SQL (Codd, 1970;Chamberlin and Boyce, 1974). However, while SQL was designed for relational databases, QDMR also aims to capture the meaning of questions over unstructured sources such as text and images. Thus, our formalism abstracts away from SQL by assuming an underlying ''idealized'' KB, which contains all entities and relations expressed in the question. This abstraction enables QDMR to be unrestricted to a particular modality, with its operators to be executed also against text  Table 1). A step, s i is a sequence of tokens, s i = (s i 1 , ..., s i m i ), where a token s i k is either a word from a predefined lexicon L x (details in §3) or a reference token, referring to the result of a previous step s j , where j < i. The last step, s n returns the answer to x.
Decomposition Graph QDMR structures can be represented as a directed acyclic graph (DAG), used for evaluating QDMR parsing models ( §7.1). Given QDMR, s = s 1 , ..., s n , each step s i is a node in the graph, labeled by its sequence of tokens and index i. Edges in the graph are induced by reference tokens to previous steps. Node s i is connected by an incoming edge (s j , That is, if one of the tokens in s i is a reference to s j . Figure 2 displays a sequence of QDMR steps, represented as a DAG. QDMR Operators A QDMR step corresponds to one of 13 query operators. We designed the operators to be expressive enough to represent the meaning of questions from a diverse set of datasets ( §3). QDMR assumes an underlying KB, K, which contains all of the entities and relations expressed in its steps. A relation, r, is a function mapping two arguments to whether r holds in K: [[r(x, y)]] K ∈ {true, false}. The operators operate over: (i) sets of objects S o , where objects o, are either numbers n, boolean values b, or entities e in K; (ii) a closed set of phrases w op , describing logical operations; and (iii) natural language phrases w, representing entities and relations in K. We assume the existence of grounding functions that map a phrase w to concrete constants in K. Table 2 describes the aforementioned constructs. In addition, we define the function map K (S e , S o ) which maps entity e ∈ S e to the set of corresponding objects from S o . Each o ∈ S o corresponds to an e ∈ S e by being contained in the result of a sequence of PROJECT and GROUP operations applied to e: 2 We now formally define each QDMR operator and provide concrete examples in Table 1.
• SELECT: Computes the set of entities in K corresponding to w: select(w) = ground e K (w).
2 The sequence of operations op 1 , . . . , op k is traced using the references to previous steps in the QDMR structure.

Function
Description agg Given a phrase wagg which describes an aggregate operation, agg denotes the corresponding operation. Either max, min, count , sum or avg . sup Given wsup describing a superlative, it denotes the corresponding function. Either arg max or arg min. com Given wcom describing a comparison, it denotes the corresponding relation out of: <, ≤, >, ≥, =, =. ari Given w ari describing an arithmetic operation, it denotes the corresponding operation out of: +, −, * , /. ground e K (w) Given a natural language phrase w, it returns the set of corresponding KB entities, Se. ground r K (w) Given a natural language phrase w, it returns the corresponding KB relation, r. • FILTER: Filters a set of objects so that it follows the condition expressed by w: where r = ground r K (w), e = ground e K (w)}.
• PROJECT: Computes the objects that relate to input entities S e with the relation expressed by w, where r = ground r K (w).
• AGGREGATE: The result of applying an aggregate operation: aggregate(w agg , S o ) = {agg (S o )}.
• GROUP: Receives a set of ''keys'', S e , and a set of corresponding ''values'', S o . It outputs a set of numbers, each corresponding to a key e ∈ S e . Each number results from applying aggregate, w agg to the subset of values corresponding to e.
• SUPERLATIVE: Receives entity set S e and number set S n . Each number n ∈ S n is the result of a mapping from an entity e ∈ S e . It returns a subset of S e for which the corresponding number is either highest/lowest as indicated by w sup .
• COMPARATIVE: Receives entity set S e and number set S n . Each n ∈ S n is the result of a mapping from an e ∈ S e . It returns a subset of S e for which the comparison with n ′ , represented by w com , holds.
• UNION: Denotes the union of object sets: • DISCARD: Denotes the set difference of two object sets: • INTERSECTION: Computes the intersection of its entity sets and returns all objects which relate to the entities with the relation expressed by w.
• SORT: Orders a set of entities according to a corresponding set of numbers. Each number n i is the result of a mapping from entity e i .
• BOOLEAN: Returns whether the relation expressed by w holds between the input objects: • ARITHMETIC: Computes the application of an arithmetic operation: arith(w ari , S 1 n , S 2 n ) = {ari(n 1 , n 2 )}, where S 1 n , S 2 n are singleton sets containing n 1 , n 2 respectively.

High-level Decompositions
In QDMR, each step corresponds to a single logical operator. In certain contexts, a less granular decomposition might be desirable, where sub-structures containing multiple operators could be collapsed to a single node. This can be easily achieved in QDMR by merging certain adjacent nodes in its DAG structure. When examining existing RC datasets Dua et al., 2019), we observed that long spans in the question often match long spans in the text, due to existing practices of generating questions Step #1 merges together SELECT and multiple FILTER steps. via crowdsourcing. In such cases, decomposing the long spans into multiple steps and having an RC model process each step independently, increases the probability of error. Thus, to promote the usefulness of QDMR for current RC datasets and models, we introduce high-level QDMR, by merging the following operators: • SELECT + PROJECT on named entities: For the question, ''What is the birthdate of Jane?'' its high-level QDMR would be ''return the birthdate of Jane'' as opposed to the more granular, ''return Jane; return birthdate of #1''.
• SELECT + FILTER: Consider the first step of the example in Figure 3. It contains both a SELECT operator (''return actress'') as well as two FILTER conditions (''that played...'', ''on the TV sitcom...'').
• FILTER + GROUP + COMPARATIVE: Certain high-level FILTER steps contain implicit grouping and comparison operations. E.g., ''return yard line scores in the fourth quarter; return #1 that both teams scored from''.
Step #2 contains an implicit GROUP of team per yard line and a COMPARATIVE returning the lines where exactly two teams scored.
We provide both granular and high-level QDMRs for a random subset of RC questions (see Table 3). The concrete utility of high-level QDMR to open-domain QA is presented in §5.

Data Collection
Our annotation pipeline for generating BREAK consisted of three phases. First, we collected complex questions from existing QA benchmarks. Second, we crowdsourced the QDMR annotation of these questions. Finally, we validated worker annotations in order to maintain their quality.  Question Collection Questions in BREAK were randomly sampled from ten QA datasets over the following tasks (Table 3): • Semantic Parsing: Mapping natural language utterances into formal queries, to be executed on a target KB (Price, 1990;Zelle and Mooney, 1996;Yu et al., 2018).
All questions collected were composed by human annotators. 3 HOTPOTQA questions were all sampled from the hard split of the dataset.

QDMR Annotation
A key question is whether it is possible to train non-expert annotators to produce high-quality QDMRs. We designed an annotation interface (Figure 4), where workers are first given explanations and examples on how to identify and phrase each of the operators in Table 1. Then, workers decompose questions into a list of 3 Except for COMPLEXWEBQUESTIONS (CWQ), where annotators paraphrased automatically generated questions. steps, where they are only allowed to use words from a lexicon L x , which contains: (a) words appearing in the question (or their automatically computed inflections), (b) words from a small pre-defined list of 66 function word such as, 'if', 'on', 'for each ', or (c) reference tokens that refer to the results of a previous step. This ensures that the language used by workers is consistent across examples, while being expressive enough for the decomposition. Our annotation interface presents workers with the question only, so they are agnostic to the original modality of the question. The efficacy of this process is explored in §4.2.
We used Amazon Mechanical Turk to crowdsource QDMR annotation. In each task, workers decomposed questions, paying them $0.40 per question, which amounts to an average pay of $12 per hour. Overall, we collected 83,978 examples using 64 distinct workers. The dataset was partitioned into train/development/test sets following the partitions in the original datasets. During partition, we made sure that development and test samples do not share the same context.

Worker Validation
To ensure worker quality, we initially published qualification tasks, open to all workers in the United States. The task required workers to carefully review the annotation instructions and decompose 10 example questions. The examples were selected so that each QDMR operation should appear in at least one of their decompositions (Table 1). In total, 64 workers were able to correctly decompose at least 8 examples and were qualified as annotators. To validate worker performance over time, we conducted random validations of annotations. Over 9K annotations were  reviewed by experts throughout the annotation process. Only workers who consistently produced correct QDMRs for at least 90% of their tasks were allowed to continue as annotators.

Dataset Analysis
This section examines the properties of collected QDMRs in BREAK and analyzes their quality.

Quantitative Analysis
Overall, BREAK contains 83,978 decompositions, including 60,150 QDMRs and 23,828 examples with high-level QDMRs, which are exclusive to text modalities. Table 3 shows that data is proportionately distributed between questions over structured (DB) and unstructured modalities (text, images). The distribution of QDMR operators is presented in Table 4, detailing the prevalence of each query operator 4 (we automatically compute this distribution, as explained in §4.3). SELECT and PROJECT are the most common operators. Additionally, at least 10% of QDMRs contain operators such as GROUP and COMPARATIVE, which entail complex reasoning, in contrast to high-level QDMRs, where such operations are rare. This distinction sheds light on the reasoning types required for answering RC datasets (highlevel QDMR) compared with more structured tasks (QDMR).  include 3-6 steps, whereas high-level QDMRs are much shorter, as a single SELECT often finds an entity described by a long noun phrase (see §2).

Quality Analysis
We describe the process of estimating the correctness of collected QDMR annotations. Similar to previous works (Yu et al., 2018;Kwiatkowski et al., 2019) we use expert judgments, where the experts had prepared the guidelines for the annotation task. Given a question and its annotated QDMR, (q, s) the expert determines the correctness of s using one of the following categories: • Correct (C): If s constitutes a list of QDMR operations that lead to correctly answering q.
• Granular (C G ): If s is correct and none of its operators can be further decomposed. 5 • Incorrect (I): If s is in neither C nor C G .
Examples of these expert judgments are shown in Figure 5. To estimate expert judgment of correctness, we manually reviewed a random sample of 500 QDMRs from BREAK. We classified 93.8% of the samples in C G and another 3.6% in C. Thus, 97.4% of the samples constitute a correct decomposition of the original question. Workers have somewhat struggled with decomposing superlatives (e.g., ''biggest sphere''), as evident from the first question in Figure 5. Collected QDMRs displayed similar estimates of C, C G , and I, regardless of their modality (DB, text, or image).

Annotation Consistency
As QDMR is expressed using natural language, it introduces variability into its annotations. We wish to validate the consistency of collected QDMRs, that is, whether we can correctly infer the formal QDMR operator (f i ) and its arguments from each step (s i ). To infer these formal representations, we developed an algorithm that goes over the QDMR structure step-by-step, and for each step s i , uses a set of predefined templates to identify f i and its arguments, expressed in s i . This results in an execution graph (Figure 2), where the execution result of a parent node serves as input to its child. Figure 1 presents three QDMR decompositions along with the formal graphs output by our algorithm (lower box). Each node lists its operator (e.g., GROUP), its constant input listed in brackets (e.g., count) and its dynamic input, which are the execution results of its parent nodes.
Overall, 99.5% of QDMRs had all their steps mapped into pseudo-logical forms by our algorithm. To evaluate the correctness of the mapping algorithm, we randomly sampled 350 logical forms, and examined the structure of the formulas, assuming that words copied from the question correspond to entities and relations in an idealized KB (see §2). Of this sample, 99.4% of its examples had all of their steps, s i , correctly mapped to the corresponding f i . Overall, 93.1% of the examples were of fully accurate logical forms, with errors being due to QDMRs that were either incorrect or not fully decomposed (I, C in §4.2). Thus, a rule-based algorithm can map more than 93% of the annotations into a correct formal representation. This shows that our annotators produced consistent and high-quality QDMRs. Moreover, it suggests that non-experts can annotate questions with pseudo-logical forms, which can be used as

QDMR for Open-domain QA
A natural setup for QDMR is in answering complex questions that require multiple reasoning steps. We compare models that exploit question decompositions to baselines that do not. We use the open-domain QA (''full-wiki") setting of the HOTPOTQA dataset : Given a question, the QA model retrieves the relevant Wikipedia paragraphs and answers the question using these paragraphs.

Experimental Setup
We compare BREAKRC, a model that utilizes question decomposition to BERTQA, a standard QA model, based on BERT , and present COMBINED, an approach that enjoys the benefits of both models.

BREAKRC Algorithm 1 describes the BREAKRC model, which uses high-level QDMR structures for answering open-domain multi-hop questions.
We assume access to an Information Retrieval (IR) model and an RC model, and denote by ANSWER(·) a function that takes a question as input, runs the IR model to obtain paragraphs, and then feeds those paragraphs as context for an RC model that returns a distribution over answers.
Given an input QDMR, s = s 1 , ..., s n , iterate over s step-by-step and perform the following. First, we extract the operation (line 4) and the previous steps referenced by s i (line 5). Then, we compute the answer to s i conditioned on the extracted operator. For SELECT steps, we simply run the ANSWER(·) function. For PROJECT steps, we substitute the reference to the previous step in s i with its already computed answer, and then run ANSWER(·). For FILTER steps, 6 we use a simple rule to extract a ''normalized question'',ŝ i from s i and get an intermediate answer ans tmp with ANSWER(ŝ i ). We then ''intersect'' ans tmp with the referenced answer by multiplying the probabilities provided by the RC model and normalizing. For COMPARISON steps, we compare, with a discrete operation, the numbers returned by the referenced steps. The final answer is the highest probability answer of step s n .
As our IR model we use bigram TF-IDF, proposed by Chen et al. (2017). Because the RC model is run on single-hop questions, we use the BERTbased RC model from Min et al. (2019b), trained solely on SQuAD (Rajpurkar et al., 2016).
BERTQA Baseline As BREAKRC exploits question decompositions, we compare it with a model that does not. BERTQA receives as input the original natural language question, x. It uses the same IR model as BREAKRC to retrieve paragraphs for x. For a fair comparison, we set its number of retrieved paragraphs such that it is identical to BREAKRC (namely, 10 paragraphs for each QDMR step that involves IR). Similar to BREAKRC, retrieved paragraphs are fed to a pretrained BERT-based RC model (Min et al., 2019b) to answer x. In contrast to BREAKRC, that is trained on SQUAD, BERTQA is trained on the target dataset (HOTPOTQA), giving it an advantage over BREAKRC.
A COMBINED Approach Last, we present an approach that combines the strengths of BREAKRC and BERTQA. In this approach, we use the QDMR decomposition to improve retrieval only. Given a question x and its QDMR s, we run BREAKRC on s, but in addition to storing answers, we also store all the paragraphs retrieved by the IR model. We then run BERTQA on the question x and the top-10 paragraphs retrieved by BREAKRC, sorted by their IR ranking. This approach resembles that of Qi et al. (2019).
The advantage of COMBINED is that we do not need to develop an answering procedure for each QDMR operator separately, which involves dif-6 INTERSECTION steps are handled in a manner similar to FILTER, but we omit the exact description for brevity.   Table 6 shows model performance on HOTPOTQA. We report EM and F 1 using the official HOTPOTQA evaluation script. IR measures the percentage of examples in which the IR model successfully retrieved both of the ''gold paragraphs'' necessary for answering the multi-hop question. To assess the potential utility of QDMR, we report results for BREAKRC G , which uses gold QDMRs, and BREAKRC P , which uses QDMRs predicted by a COPYNET parser ( §7.2).

Results
Retrieving paragraphs with decomposed questions substantially improves the IR metric from 46.3 to 59.2 (BREAKRC G ), or 52.5 (BREAKRC P ). This leads to substantial gains in EM and F 1 for COMBINED G (43.3 to 52.4) and COMBINED P (43.3 to 49.3). The EM and F 1 of BREAKRC G are only slightly higher than BERTQA because BREAKRC does not handle certain operators, such as BOOLEAN steps (9.4% of the examples).
The majority of questions in HOTPOTQA combine SELECT operations with either PROJECT (also called ''bridge'' questions), COMPARISON, or FILTER. PROJECT and COMPARISON questions ( Figure 6) were shown to be less susceptible   to reasoning shortcuts, i.e. they necessitate multistep reasoning (Chen and Durrett, 2019;Jiang and Bansal, 2019;Min et al., 2019a). In Table 7 we report BREAKRC results on these question types, where it notably outperforms BERTQA.
Ablations In BREAKRC, multiple IR queries are issued, one at each step. To examine whether these multiple queries were the cause for performance gains, we built IR-NP, a model that issues multiple IR queries, one for each noun phrase in the question. Similar to COMBINED, the question and union of retrieved paragraphs are given as input to BERTQA. We observe that COMBINED substantially outperforms IR-NP, indicating that the structure of QDMR, rather than multiple IR queries, has led to improved performance. 7 To test whether QDMR is better than a simple rule-based decomposition algorithm, we developed a model that decomposes a question by applying a set of predefined rules over the dependency tree of the question (full details in §7.2). COMBINED and BREAKRC were compared to COMBINED R and BREAKRC R , which use the rulebased decompositions. We observe that QDMR lead to substantially higher performance when compared to the rule-based decompositions. 7 Issuing an IR query over each ''content word'' in the question, instead of each noun phrase, led to poor results.

QDMR for Semantic Parsing
As QDMR structures can be easily annotated at scale, a natural question is how far are they from fully executable queries (known to be expensive to annotate). As shown in §4.3, QDMRs can be mapped to pseudo-logical forms with high precision (93.1%) by extracting formal operators and arguments from their steps. The pseudo-logical form differs from an executable query in the lack of grounding of its arguments (entities and relations) in KB constants. This stems from the design of QDMR as a domain-agnostic meaning representation ( §2). QDMR abstracts away from a concrete KB schema by assuming an underlying ''idealized'' KB, which contains all of its arguments.
Thus, QDMR can be viewed as an intermediate representation between a natural language question and an executable query. Such intermediate representations have already been discussed in prior work on semantic parsing. Kwiatkowski et al. (2013) and Choi et al. (2015) used underspecified logical forms as an intermediate representation. Guo et al. (2019) proposed a two-stage approach, separating between learning an intermediate text-to-SQL representation and the actual mapping to schema items. Works in the database community have particularly targeted the mapping of intermediate query representations into DB grounded queries, using schema mapping and join path inference (Androutsopoulos et al., 1995;Baik et al., 2019). We argue that QDMR can be used as an easy-to-annotate representation in such semantic parsers, bridging between natural language and full logical forms.

QDMR Parsing
We now present evaluation metrics and models for mapping questions into QDMR structures.
Task Definition Given a question x we wish to map it to its QDMR steps, s = s 1 , ... , s n . One can frame this as a sequence-to-sequence problem where x is mapped to a string representing its decomposition. We add a special separating token SEP , and define the target string to be s 1 1 , ... , s 1 m 1 , SEP , s 2 1 , ... , s 2 m 2 , SEP , ... , s n m n , where m 1 , ... , m n are the number of tokens in each decomposition step.

Evaluation Metrics
We wish to assess the quality of a predicted QDMR,ŝ to a gold standard, s. Figure 7 lists various properties by which question decompositions may differ, such as granularity (e.g., steps 1-3 of decomposition 1 are merged into the first step of decomposition 2), ordering (e.g., the last two steps are swapped) and wording (e.g., using ''from'' instead of ''on''). While such differences do not affect the overall semantics, the second decomposition can be further decomposed. To measure such variations, we introduce two types of evaluation metrics. Sequence-based metrics treat the decomposition as a sequence of tokens, applying standard text generation metrics. As such metrics ignore the QDMR graph structure, we also use graph-based metrics that compare the predicted graph Gŝ to the gold QDMR graph G s (see §2).
Sequence-based scores, where higher values are better, are denoted by ⇑. Graph-based scores, where lower values are better, are denoted by ⇓.
• SARI ⇑ (Xu et al., 2016): SARI is commonly used in tasks such as text simplification. Given s, we consider the sets of added, deleted, and kept n-grams when mapping the question x to s. We compute these three sets for both s andŝ using the standard of up to 4-grams, then average (a) the F 1 for added n-grams between s andŝ, (b) the F 1 for kept n-grams, and (c) the precision for the deleted n-grams.
• Graph Edit Distance (GED) ⇓: A graph edit path is a sequence of node and edge edit operations (addition, deletion, and substitution), where each operation has a predefined cost. GED computes the minimal-cost graph edit path required for transitioning from G s to Gŝ (and vice versa), normalized by max(|G s |, |Gŝ|). Operation costs are 1 for insertion and deletion of nodes and edges. The substitution cost of two nodes u, v is set to be 1−Align (u, v), where Align(u, v) is the ratio of aligned tokens between these steps.
• GED+ ⇓: Comparing the QDMR graphs in Figure 8, we consider the splitting and merging of graph nodes. We implement GED+, a variant of GED with additional operations to merge (split) a set of nodes (node), based on the A* algorithm (Hart et al., 1968). 8
• COPY: A model that copies the input question x, without introducing any modifications.
• RULEBASED: We defined 12 decomposition rules, to be applied over the dependency tree of the question, augmented with coreference relations. A rule is a regular expression over the question dependency tree, which invokes a decomposition operation when matched (Table 8). For example, the rule for relative clauses (relcl) breaks the question at the relative pronoun ''that'', while adding a reference to the preceding part of the sentence. A full decomposition is obtained by recursively applying the rules until no rule is matched.
• SEQ2SEQ: A sequence-to-sequence neural model with a 5-layer LSTM encoder and attention at decoding time.  • S2SDYNAMIC: SEQ2SEQ with a dynamic output vocabulary restricted to the closed set of tokens L x available to crowd-workers (see §3).
• COPYNET: SEQ2SEQ with an added copy mechanism that allows copying tokens from the input sequence (Gu et al., 2016). Table 9 presents model performance on BREAK. Neural models outperform the RULEBASED baseline and perform reasonably well, with COPYNET obtaining the best scores across all metrics. This can be attributed to most of the tokens in a QDMR parse being copied from the original question.

Error Analysis
To judge the quality of predicted QDMRs we sampled 100 predictions of COPYNET (Table 10) half of them being high-level QDMRs. For standard QDMR, 24% of the sampled predictions were an exact match, with an additional 30% being fully decomposed and semantically equivalent to the gold decompositions. For example, in the first row of Table 10, the gold decomposition first discards the number of cylinders, then counts the remaining objects. Instead, COPYNET opted to count both groups, then subtract the number of cylinders from the number of objects. This illustrates how different QDMRs may be equivalent.
For high-level examples (from RC datasets), as questions are often less structured, they require a deeper semantic understating from the decomposition model. Only 8% of the predictions were an exact match, with an additional 46% being semantically equivalent to the gold. The remaining 46% were of erroneous predictions (see Table 10).

Related Work
Question Decomposition Recent work on QA through question decomposition has focused mostly on single modalities (Gupta and Lewis, 2018;Guo et al., 2019;Min et al., 2019b). QA using neural modular networks has been suggested for both KBs and images by Andreas et al. (2016) and Hu et al. (2017). Question decomposition over text was proposed by Talmor and Berant (2018), however over a much more limited set of questions than in BREAK. Iyyer et al. (2017) have also decomposed questions to create a ''sequential question answering'' task. Their annotators viewed a web table and performed actions over it to retrieve the cells that constituted the answer. Conversely, we provided annotators only with the question, as QDMR is agnostic to the original context.
An opposite annotation cycle to ours was presented in Cheng et al. (2018). The authors generate sequences of simple questions which crowdworkers paraphrase into a compositional question. Questions in BREAK are composed by humans, and are then decomposed to QDMR.

Semantic Formalism Annotation
Labeling corpora with a semantic formalism has often been reserved for expert annotators (Dahl et al., 1994;Zelle and Mooney, 1996;Abend and Rappoport, 2013;Yu et al., 2018). Recent work has focused on cheaply eliciting quality annotations from nonexperts through crowdsourcing (He et al., 2016;Iyer et al., 2017;. FitzGerald et al. (2018) facilitated non-expert annotation by introducing a formalism expressed in natural language for semantic-role-labeling. This mirrors QDMR, as both are expressed in natural language.
(1) company that Kyle York is the Chief Strategy Officer of; (2) corporation in 2016 that #1 was acquired by. (1) MASH star that Dayton 's Devils had a cameo from; (2) role that #1 played on the show.
(1) the MASH that Dayton 's Devils had a cameo; (2) what role on the show star of #1 played. incorrect (46%) to facilitate annotations, while Dependencybased Compositional Semantics is centered on paralleling syntax. Domain-independent intermediate representations for semantic parsers were proposed by Kwiatkowski et al. (2013) and Reddy et al. (2016). As there is no consensus on the ideal meaning representation for semantic parsing, representations are often chosen based on the particular execution setup: SQL is used for relational databases (Yu et al., 2018), SPARQL for graph KBs (Yih et al., 2016), while other ad-hoc languages are used based on the task at hand. We frame QDMR as an easy-to-annotate formalism that can be potentially converted to other representations, depending on the task. Last, AMR (Banarescu et al., 2013) is a meaning representation for sentences. Instead of representing general language, QDMR represents questions, which are important for QA systems, and for probing models for reasoning.

Conclusion
In this paper, we presented a formalism for question understanding. We have shown it is possible to train crowd-workers to produce such representations with high quality at scale, and created BREAK, a benchmark for question decomposition with over 83K decompositions of questions from 10 datasets and 3 modalities (DB, images, text). We presented the utility of QDMR for both open-domain question answering and semantic parsing, and constructed a QDMR parser with reasonable performance. QDMR proposes a promising direction for modeling question understanding, which we believe will be useful for multiple tasks in which reasoning is probed through questions.