Understanding natural language questions entails the ability to break down a question into the requisite steps for computing its answer. In this work, we introduce a Question Decomposition Meaning Representation (QDMR) for questions. QDMR constitutes the ordered list of steps, expressed through natural language, that are necessary for answering a question. We develop a crowdsourcing pipeline, showing that quality QDMRs can be annotated at scale, and release the Break dataset, containing over 83K pairs of questions and their QDMRs. We demonstrate the utility of QDMR by showing that (a) it can be used to improve open-domain question answering on the HotpotQA dataset, (b) it can be deterministically converted to a pseudo-SQL formal language, which can alleviate annotation in semantic parsing applications. Last, we use Break to train a sequence-to-sequence model with copying that parses questions into QDMR structures, and show that it substantially outperforms several natural baselines.

Recently, increasing work has been devoted to models that can reason and integrate information from multiple parts of an input. This includes reasoning over images (Antol et al., 2015; Johnson et al., 2017; Suhr et al., 2019; Hudson and Manning, 2019), paragraphs (Dua et al., 2019), documents (Welbl et al., 2018; Talmor and Berant, 2018; Yang et al., 2018), tables (Pasupat and Liang, 2015), and more. Question answering (QA) is commonly used to test the ability to reason, where a complex natural language question is posed, and is to be answered given a particular context (text, image, etc.). Although questions often share structure across tasks and modalities, understanding the language of complex questions has thus far been addressed within each task in isolation. Consider the questions in Figure 1, all of which express operations such as fact chaining and counting. Additionally, humans can take a complex question and break it down into a sequence of simpler questions even when they are unaware of what or where the answer is. This ability, to compose and decompose questions, lies at the heart of human language (Pelletier, 1994) and allows us to tackle previously unseen problems. Thus, better question understanding models should improve performance and generalization in tasks that require multi-step reasoning or that do not have access to substantial amounts of data.

Figure 1:

Questions over different sources share a similar compositional structure. Natural language questions from multiple sources (top) are annotated with the QDMR formalism (middle) and deterministically mapped into a pseudo-formal language (bottom).

Figure 1:

Questions over different sources share a similar compositional structure. Natural language questions from multiple sources (top) are annotated with the QDMR formalism (middle) and deterministically mapped into a pseudo-formal language (bottom).

Close modal

In this work we propose question understanding as a standalone language understanding task. We introduce a formalism for representing the meaning of questions that relies on question decomposition, and is agnostic to the information source. Our formalism, Question Decomposition Meaning Representation (QDMR), is inspired by database query languages (SQL; SPARQL), and by semantic parsing (Zelle and Mooney, 1996; Zettlemoyer and Collins, 2005; Clarke et al., 2010), in which questions are given full meaning representations.

We express complex questions via simple (“atomic”) questions that can be executed in sequence to answer the original question. Each atomic question can be mapped into a small set of formal operations, where each operation either selects a set of entities, retrieves information about their attributes, or aggregates information over entities. While this has been formalized in knowledge-base (KB) query languages (Chamberlin and Boyce, 1974), the same intuition can be applied to other modalities, such as images and text. QDMR abstracts away the context needed to answer the question, allowing in principle to query multiple sources for the same question.

In contrast to semantic parsing, QDMR operations are expressed through natural language, facilitating annotation at scale by non-experts. Figure 1 presents examples of complex questions on three different modalities. The middle box lists the natural language decompositions provided for each question, and the bottom box displays their corresponding formal queries.

QDMR serves as the formalism for creating Break, a question decomposition dataset of 83,978 questions over ten datasets and three modalities. Break is collected via crowdsourcing, with a user interface that allows us to train crowd-workers to produce quality decompositions (§3). Validating the quality of annotated structures reveals 97.4% to be correct (§4).

We demonstrate the utility of QDMR in two setups. First, we regard the task of open-domain QA over multi-hop questions from the HotpotQA dataset. Combining QDMR structures in Break with a reading comprehension (RC) model (Min et al., 2019b) improves F1 from 43.3 to 52.4 (§5). Second, we show that decompositions in Break possess high annotation consistency, which indicates that annotators produce high-quality QDMRs (§4.3). In §6 we discuss how these QDMRs can be used as a strong proxy for full logical forms in semantic parsing.

We use Break to train a neural QDMR parser that maps questions into QDMR representations, based on a sequence-to-sequence model with copying (Gu et al., 2016). Manual analysis of generated structures reveals an accuracy of 54%, showing that automatic QDMR parsing is possible, though still far from human performance (§7).

To conclude, our contributions are:

• •

Proposing the task of question understanding and introducing the QDMR formalism for representing the meaning of questions (§2)

• •

The Break dataset, which consists of 83,978 examples sampled from 10 datasets over three distinct information sources (§3)

• •

Showing how QDMR can be used to improve open-domain question answering (§5), as well as alleviate the burden of annotating logical forms in semantic parsing (§6)

• •

A QDMR parser based on a sequence-to- sequence model with copying mechanism (§7)

The Break dataset, models, and entire codebase are publicly available at: https://github.com/tomerwolgithub/Break.

In this section we define the QDMR formalism for domain agnostic question decomposition.

QDMR is primarily inspired by SQL (Codd, 1970; Chamberlin and Boyce, 1974). However, while SQL was designed for relational databases, QDMR also aims to capture the meaning of questions over unstructured sources such as text and images. Thus, our formalism abstracts away from SQL by assuming an underlying “idealized” KB, which contains all entities and relations expressed in the question. This abstraction enables QDMR to be unrestricted to a particular modality, with its operators to be executed also against text and images, while allowing in principle to query multiple modalities for the same question.1

### QDMR Definition

Given a question x, its QDMR is a sequence of n steps, s = 〈s1, ..., sn〉, where each step si corresponds to a single query operator f i (see Table 1). A step, si is a sequence of tokens, $si=(s1i,...,smii)$, where a token $ski$ is either a word from a predefined lexicon Lx (details in §3) or a reference token, referring to the result of a previous step sj, where j < i. The last step, sn returns the answer to x.

Table 1:
The 13 operator types of QDMR steps. Listed are, the natural language template used to express the operator, the operator signature, and an example question that uses the query operator in its decomposition.
OperatorTemplate / SignatureQuestionDecomposition
Select Return [entities] w→Se How many touchdowns were scored overall? 1. Return touchdowns 2. Return the number of #1
Filter Return [ref] [condition] So, w→So I would like a flight from Toronto to San Diego please. 1. Return flights2. Return #1 from Toronto3. Return #2 to San Diego
Project Return [relation] of [ref] w,Se→So Who is the head coach of the Los Angeles Lakers? 1. Return the Los Angeles Lakers 2. Return the head coach of #1
Aggregate Return [aggregate] of [ref] wagg,So→n How many states border Colorado? 1. Return Colorado 2. Return border states of #13. Return the number of #2
Group Return [aggregate] [ref1] for each [ref2] wagg,So,Se→Sn How many female students are there in each club? 1. Return clubs 2. Return female students of #13. Return the number of #2 for each #1
Superlative Return [ref1] where [ref2] is [highest / lowest] Se,Sn,wsup→Se What is the keyword, which has been contained by the most number of papers? 1. Return papers 2. Return keywords of #1 3. Return the number of #1 for each #24. Return #2 where #3 is highest
Comparative Return [ref1] where [ref2] [comparison] [number] Se,Sn,wcom,n→Se Who are the authors who have more than 500 papers? 1. Return authors 2. Return papers of #1 3. Return the number of #2 for each of #14. Return #1 where #3 is more than 500
Union Return [ref1] , [ref2] So,So→So Tell me who the president and vice- president are? 1. Return the president 2. Return the vice-president3. Return #1, #2
Intersection Return [relation] in both [ref1] and [ref2] w,Se,Se→So Show the parties that have representatives in both New York state and representatives in Pennsylvania state. 1. Return representatives 2. Return #1 in New York state 3. Return #1 in Pennsylvania state4. Return parties in both #2 and #3
Discard Return [ref1] besides [ref2] So,So→So Find the professors who are not playing Canoeing. 1. Return professors 2. Return #1 playing Canoeing3. Return #1 besides #2
Sort Return [ref1] sorted by [ref2] Se,Sn→〈e1...ek Find all information about student addresses, and sort by monthly rental. 1. Return students 2. Return addresses of #1 3. Return monthly rental of #24. Return #2 sorted by #3
Boolean Return [if / is] [ref1] [condition] [ref2] So,w,So→b Were Scott Derrickson and Ed Wood of the same nationality? ... 3. Return the nationality of #1 4. Return the nationality of #25. Return if #3 is the same as #4
Arithmetic Return the [arithmetic] of [ref1] and [ref2] wari,n,n→n How many more red objects are there than blue objects? ... 3. Return the number of #1 4. Return the number of #25. Return the difference of #3 and #4
OperatorTemplate / SignatureQuestionDecomposition
Select Return [entities] w→Se How many touchdowns were scored overall? 1. Return touchdowns 2. Return the number of #1
Filter Return [ref] [condition] So, w→So I would like a flight from Toronto to San Diego please. 1. Return flights2. Return #1 from Toronto3. Return #2 to San Diego
Project Return [relation] of [ref] w,Se→So Who is the head coach of the Los Angeles Lakers? 1. Return the Los Angeles Lakers 2. Return the head coach of #1
Aggregate Return [aggregate] of [ref] wagg,So→n How many states border Colorado? 1. Return Colorado 2. Return border states of #13. Return the number of #2
Group Return [aggregate] [ref1] for each [ref2] wagg,So,Se→Sn How many female students are there in each club? 1. Return clubs 2. Return female students of #13. Return the number of #2 for each #1
Superlative Return [ref1] where [ref2] is [highest / lowest] Se,Sn,wsup→Se What is the keyword, which has been contained by the most number of papers? 1. Return papers 2. Return keywords of #1 3. Return the number of #1 for each #24. Return #2 where #3 is highest
Comparative Return [ref1] where [ref2] [comparison] [number] Se,Sn,wcom,n→Se Who are the authors who have more than 500 papers? 1. Return authors 2. Return papers of #1 3. Return the number of #2 for each of #14. Return #1 where #3 is more than 500
Union Return [ref1] , [ref2] So,So→So Tell me who the president and vice- president are? 1. Return the president 2. Return the vice-president3. Return #1, #2
Intersection Return [relation] in both [ref1] and [ref2] w,Se,Se→So Show the parties that have representatives in both New York state and representatives in Pennsylvania state. 1. Return representatives 2. Return #1 in New York state 3. Return #1 in Pennsylvania state4. Return parties in both #2 and #3
Discard Return [ref1] besides [ref2] So,So→So Find the professors who are not playing Canoeing. 1. Return professors 2. Return #1 playing Canoeing3. Return #1 besides #2
Sort Return [ref1] sorted by [ref2] Se,Sn→〈e1...ek Find all information about student addresses, and sort by monthly rental. 1. Return students 2. Return addresses of #1 3. Return monthly rental of #24. Return #2 sorted by #3
Boolean Return [if / is] [ref1] [condition] [ref2] So,w,So→b Were Scott Derrickson and Ed Wood of the same nationality? ... 3. Return the nationality of #1 4. Return the nationality of #25. Return if #3 is the same as #4
Arithmetic Return the [arithmetic] of [ref1] and [ref2] wari,n,n→n How many more red objects are there than blue objects? ... 3. Return the number of #1 4. Return the number of #25. Return the difference of #3 and #4

### Decomposition Graph

QDMR structures can be represented as a directed acyclic graph (DAG), used for evaluating QDMR parsing models (§7.1). Given QDMR, s = 〈s1, ..., sn〉, each step si is a node in the graph, labeled by its sequence of tokens and index i. Edges in the graph are induced by reference tokens to previous steps. Node si is connected by an incoming edge (sj,si), if $ref[sj]∈(s1i,...,smii)$. That is, if one of the tokens in si is a reference to sj. Figure 2 displays a sequence of QDMR steps, represented as a DAG.

Figure 2:

QDMR of the question “Return the keywords which have been contained by more than 100 ACL papers.”, represented as a decomposition graph.

Figure 2:

QDMR of the question “Return the keywords which have been contained by more than 100 ACL papers.”, represented as a decomposition graph.

Close modal

### QDMR Operators

A QDMR step corresponds to one of 13 query operators. We designed the operators to be expressive enough to represent the meaning of questions from a diverse set of datasets (§3). QDMR assumes an underlying KB, $K$, which contains all of the entities and relations expressed in its steps. A relation, r, is a function mapping two arguments to whether r holds in $K$: $[[r(x,y)]]K∈${true,false}. The operators operate over: (i) sets of objects So, where objectso, are either numbersn, boolean values b, or entitiese in $K$; (ii) a closed set of phrases wop, describing logical operations; and (iii) natural language phrases w, representing entities and relations in $K$. We assume the existence of grounding functions that map a phrase w to concrete constants in $K$. Table 2 describes the aforementioned constructs. In addition, we define the function $mapK(Se,So)$ which maps entity eSe to the set of corresponding objects from So. Each oSo corresponds to an eSe by being contained in the result of a sequence of PROJECT and GROUP operations applied to e: 2
$mapK(Se,So)={〈e,o〉∣e∈Se,o∈So,o∈opk∘…∘op1(e)}.$
Table 2:
Functions used for grounding natural language phrases in numerical operators or KB entities.
FunctionDescription
agg Given a phrase wagg which describes an aggregate operation, agg denotes the corresponding operation. Either max, min, count, sum or avg.
sup Given wsup describing a superlative, it denotes the corresponding function. Either argmax or argmin.
com Given wcom describing a comparison, it denotes the corresponding relation out of: <, ≤, >, ≥, =, ≠.
ari Given wari describing an arithmetic operation, it denotes the corresponding operation out of: +, −, *, /.
$groundKe(w)$ Given a natural language phrase w, it returns the set of corresponding KB entities, Se
$groundKr(w)$ Given a natural language phrase w, it returns the corresponding KB relation, r
FunctionDescription
agg Given a phrase wagg which describes an aggregate operation, agg denotes the corresponding operation. Either max, min, count, sum or avg.
sup Given wsup describing a superlative, it denotes the corresponding function. Either argmax or argmin.
com Given wcom describing a comparison, it denotes the corresponding relation out of: <, ≤, >, ≥, =, ≠.
ari Given wari describing an arithmetic operation, it denotes the corresponding operation out of: +, −, *, /.
$groundKe(w)$ Given a natural language phrase w, it returns the set of corresponding KB entities, Se
$groundKr(w)$ Given a natural language phrase w, it returns the corresponding KB relation, r

We now formally define each QDMR operator and provide concrete examples in Table 1.

• •

SELECT: Computes the set of entities in $K$ corresponding to w: select(w) = $groundKe(w)$.

• •
FILTER: Filters a set of objects so that it follows the condition expressed by w:
$filter(So,w)=So∩{o∣〚r(e,o)〛K≡true},$
where $r=groundKr(w)$, $e=groundKe(w)}$.
• •
PROJECT: Computes the objects that relate to input entities Se with the relation expressed by w,
$proj(w,Se)={o∣〚r(e,o)]〛K≡true,e∈Se},$
where $r=groundKr(w)$.
• •

AGGREGATE: The result of applying an aggregate operation: aggregate(wagg,So) = {agg(So)}.

• •
GROUP: Receives a set of “keys”, Se, and a set of corresponding “values”, So. It outputs a set of numbers, each corresponding to a key eSe. Each number results from applying aggregate, wagg to the subset of values corresponding to e.
$group(wagg,So,Se)={agg(Vo(e))∣e∈Se},$
where $Vo(e)={o∣〈e,o〉∈mapK(Se,So)}$.
• •
SUPERLATIVE: Receives entity set Se and number set Sn. Each number nSn is the result of a mapping from an entity eSe. It returns a subset of Se for which the corresponding number is either highest/lowest as indicated by wsup.
$super(Se,Sn,wsup)={sup(mapK(Se,Sn))}.$
• •
COMPARATIVE: Receives entity set Se and number set Sn. Each nSn is the result of a mapping from an eSe. It returns a subset of Se for which the comparison with n′, represented by wcom, holds.
$comparative(Se,Sn,wcom,n′)={e∣〈e,n〉∈mapK(Se,Sn),com(n,n′)≡true}.$
• •

UNION: Denotes the union of object sets: $union(So1,So2)=So1∪So2$.

• •

DISCARD: Denotes the set difference of two object sets: $discard(So1,So2)=So1∖So2$.

• •
INTERSECTION: Computes the intersection of its entity sets and returns all objects which relate to the entities with the relation expressed by w.
$intersect(w,Se1,Se2)={o∣e∈Se1∩Se2,〚r(e,o)]〛K≡true,r=groundKr(w)}.$
• •
SORT: Orders a set of entities according to a corresponding set of numbers. Each number ni is the result of a mapping from entity ei.
$sort(Se,Sn)={〈ei1…eim〉∣〈eij,nij〉∈mapK(Se,Sn),ni1≤…≤nim}.$
• •

BOOLEAN: Returns whether the relation expressed by w holds between the input objects: $boolean(So1,w,So2)={〚r(o1,o2)〛K}$, where $r=groundKr(w)$ and $So1$, $So2$ are singleton sets containing o1, o2 respectively.

• •

ARITHMETIC: Computes the application of an arithmetic operation: $arith(wari,Sn1,Sn2)=${ari(n1,n2)}, where $Sn1$, $Sn2$ are singleton sets containing n1, n2 respectively.

### High-level Decompositions

In QDMR, each step corresponds to a single logical operator. In certain contexts, a less granular decomposition might be desirable, where sub-structures containing multiple operators could be collapsed to a single node. This can be easily achieved in QDMR by merging certain adjacent nodes in its DAG structure. When examining existing RC datasets (Yang et al., 2018; Dua et al., 2019), we observed that long spans in the question often match long spans in the text, due to existing practices of generating questions via crowdsourcing. In such cases, decomposing the long spans into multiple steps and having an RC model process each step independently, increases the probability of error. Thus, to promote the usefulness of QDMR for current RC datasets and models, we introduce high-level QDMR, by merging the following operators:

• •

SELECT + PROJECT on named entities: For the question, “What is the birthdate of Jane?” its high-level QDMR would be “return the birthdate of Jane” as opposed to the more granular, “return Jane; return birthdate of #1”.

• •

SELECT + FILTER: Consider the first step of the example in Figure 3. It contains both a SELECT operator (“return actress”) as well as two FILTER conditions (“that played...”, “on the TV sitcom...”).

• •

FILTER + GROUP + COMPARATIVE: Certain high-level FILTER steps contain implicit grouping and comparison operations. E.g., “return yard line scores in the fourth quarter; return #1 that both teams scored from”. Step #2 contains an implicit GROUP of team per yard line and a COMPARATIVE returning the lines where exactly two teams scored.

Figure 3:

Example of a high-level QDMR. Step #1 merges together SELECT and multiple FILTER steps.

Figure 3:

Example of a high-level QDMR. Step #1 merges together SELECT and multiple FILTER steps.

Close modal

We provide both granular and high-level QDMRs for a random subset of RC questions (see Table 3). The concrete utility of high-level QDMR to open-domain QA is presented in §5.

Table 3:
The QA datasets in Break. Lists the number of examples in the original dataset and in Break. Numbers of high-level QDMRs are denoted by high.
DatasetExampleOriginalBreak
Academic (DB) Return me the total citations of all the papers in the VLDB conference. 195 195
ATIS (DB) What is the first flight from Atlanta to Baltimore that serves lunch? 5,283 4,906
GeoQuery (DB) How high is the highest point in the largest state? 880 877
Spider (DB) How many transactions correspond to each invoice number? 10,181 7,982
CLEVR-humans (Images) What is the number of cylinders divided by the number of cubes? 32,164 13,935
NLVR2 (ImagesIf there are only two dogs pulling one of the sleds? 29,680 13,517
ComQA (TextWhat was Gandhi’s occupation before becoming a freedom fighter? 11,214 5,520
CWQ (TextRobert E Jordan is part of the organization started by whom? 34,689 2,988, 2,991high
DROP (TextApproximately how many years did the churches built in 1909 survive? 96,567 10,230, 10,262high
HotpotQA-hard (TextBenjamin Halfpenny was a footballer for a club that plays its home matches where? 23,066 10,575high

Total:   83,978
DatasetExampleOriginalBreak
Academic (DB) Return me the total citations of all the papers in the VLDB conference. 195 195
ATIS (DB) What is the first flight from Atlanta to Baltimore that serves lunch? 5,283 4,906
GeoQuery (DB) How high is the highest point in the largest state? 880 877
Spider (DB) How many transactions correspond to each invoice number? 10,181 7,982
CLEVR-humans (Images) What is the number of cylinders divided by the number of cubes? 32,164 13,935
NLVR2 (ImagesIf there are only two dogs pulling one of the sleds? 29,680 13,517
ComQA (TextWhat was Gandhi’s occupation before becoming a freedom fighter? 11,214 5,520
CWQ (TextRobert E Jordan is part of the organization started by whom? 34,689 2,988, 2,991high
DROP (TextApproximately how many years did the churches built in 1909 survive? 96,567 10,230, 10,262high
HotpotQA-hard (TextBenjamin Halfpenny was a footballer for a club that plays its home matches where? 23,066 10,575high

Total:   83,978

Our annotation pipeline for generating Break consisted of three phases. First, we collected complex questions from existing QA benchmarks. Second, we crowdsourced the QDMR annotation of these questions. Finally, we validated worker annotations in order to maintain their quality.

### Question Collection

Questions in Break were randomly sampled from ten QA datasets over the following tasks (Table 3):

• •

Semantic Parsing: Mapping natural language utterances into formal queries, to be executed on a target KB (Price, 1990; Zelle and Mooney, 1996; Li and Jagadish, 2014; Yu et al., 2018).

• •

Reading Comprehension (RC): Questions that require understanding of a text passage by reasoning over multiple sentences (Talmor and Berant, 2018; Yang et al., 2018; Dua et al., 2019; Abujabal et al., 2019).

• •

Visual Question Answering (VQA): Questions over images that require both visual and numerical reasoning skills (Johnson et al., 2017; Suhr et al., 2019).

All questions collected were composed by human annotators.3HotpotQA questions were all sampled from the hard split of the dataset.

### QDMR Annotation

A key question is whether it is possible to train non-expert annotators to produce high-quality QDMRs. We designed an annotation interface (Figure 4), where workers are first given explanations and examples on how to identify and phrase each of the operators in Table 1. Then, workers decompose questions into a list of steps, where they are only allowed to use words from a lexicon Lx, which contains: (a) words appearing in the question (or their automatically computed inflections), (b) words from a small pre-defined list of 66 function word such as, ‘if’, ‘on’, ‘for each’, or (c) reference tokens that refer to the results of a previous step. This ensures that the language used by workers is consistent across examples, while being expressive enough for the decomposition. Our annotation interface presents workers with the question only, so they are agnostic to the original modality of the question. The efficacy of this process is explored in §4.2.

Figure 4:

User interface for decomposing a complex question that uses a closed lexicon of tokens.

Figure 4:

User interface for decomposing a complex question that uses a closed lexicon of tokens.

Close modal

We used Amazon Mechanical Turk to crowdsource QDMR annotation. In each task, workers decomposed questions, paying them $0.40 per question, which amounts to an average pay of$12 per hour. Overall, we collected 83,978 examples using 64 distinct workers. The dataset was partitioned into train/development/test sets following the partitions in the original datasets. During partition, we made sure that development and test samples do not share the same context.

### Worker Validation

To ensure worker quality, we initially published qualification tasks, open to all workers in the United States. The task required workers to carefully review the annotation instructions and decompose 10 example questions. The examples were selected so that each QDMR operation should appear in at least one of their decompositions (Table 1). In total, 64 workers were able to correctly decompose at least 8 examples and were qualified as annotators. To validate worker performance over time, we conducted random validations of annotations. Over 9K annotations were reviewed by experts throughout the annotation process. Only workers who consistently produced correct QDMRs for at least 90% of their tasks were allowed to continue as annotators.

This section examines the properties of collected QDMRs in Break and analyzes their quality.

### 4.1 Quantitative Analysis

Overall, Break contains 83,978 decompositions, including 60,150 QDMRs and 23,828 examples with high-level QDMRs, which are exclusive to text modalities. Table 3 shows that data is proportionately distributed between questions over structured (DB) and unstructured modalities (text, images).

The distribution of QDMR operators is presented in Table 4, detailing the prevalence of each query operator4 (we automatically compute this distribution, as explained in §4.3). SELECT and PROJECT are the most common operators. Additionally, at least 10% of QDMRs contain operators such as GROUP and COMPARATIVE, which entail complex reasoning, in contrast to high-level QDMRs, where such operations are rare. This distinction sheds light on the reasoning types required for answering RC datasets (high-level QDMR) compared with more structured tasks (QDMR).

Table 4:
Operator prevalence in Break. Lists the percentage of QDMRs where the operator appears.
OperatorQDMRQDMRhigh
SELECT 100% 100%
PROJECT 69.0% 35.6%
FILTER 53.2% 15.3%
AGGREGATE 38.1% 22.3%
BOOLEAN 30.0% 4.6%
COMPARATIVE 17.0% 1.0%
GROUP 9.7% 0.7%
SUPERLATIVE 6.3% 13.0%
UNION 5.5% 0.5%
ARITHMETIC 5.4% 11.2%
INTERSECTION 2.7% 2.8%
SORT 0.9% 0.0%
Total 60,150 23,828
OperatorQDMRQDMRhigh
SELECT 100% 100%
PROJECT 69.0% 35.6%
FILTER 53.2% 15.3%
AGGREGATE 38.1% 22.3%
BOOLEAN 30.0% 4.6%
COMPARATIVE 17.0% 1.0%
GROUP 9.7% 0.7%
SUPERLATIVE 6.3% 13.0%
UNION 5.5% 0.5%
ARITHMETIC 5.4% 11.2%
INTERSECTION 2.7% 2.8%
SORT 0.9% 0.0%
Total 60,150 23,828

Table 5 details the distribution of QDMR sequence length. Most decompositions in QDMR include 3–6 steps, whereas high-level QDMRs are much shorter, as a single SELECT often finds an entity described by a long noun phrase (see §2).

Table 5:
The distribution over QDMR sequence length.
StepsQDMRQDMRhigh
1–2 10.7% 59.8%
3–4 44.9% 31.6%
5–6 27.0% 7.9%
7–8 10.1% 0.6%
9+ 7.4% 0.2%
StepsQDMRQDMRhigh
1–2 10.7% 59.8%
3–4 44.9% 31.6%
5–6 27.0% 7.9%
7–8 10.1% 0.6%
9+ 7.4% 0.2%

### 4.2 Quality Analysis

We describe the process of estimating the correctness of collected QDMR annotations. Similar to previous works (Yu et al., 2018; Kwiatkowski et al., 2019) we use expert judgments, where the experts had prepared the guidelines for the annotation task. Given a question and its annotated QDMR, (q,s) the expert determines the correctness of s using one of the following categories:

• •

Correct ($C$): If s constitutes a list of QDMR operations that lead to correctly answering q.

• •

Granular ($CG$): If s is correct and none of its operators can be further decomposed.5

• •

Incorrect ($I$): If s is in neither $C$ nor $CG$.

Examples of these expert judgments are shown in Figure 5. To estimate expert judgment of correctness, we manually reviewed a random sample of 500 QDMRs from Break. We classified 93.8% of the samples in $CG$ and another 3.6% in $C$. Thus, 97.4% of the samples constitute a correct decomposition of the original question. Workers have somewhat struggled with decomposing superlatives (e.g., “biggest sphere”), as evident from the first question in Figure 5. Collected QDMRs displayed similar estimates of $C$, $CG$, and $I$, regardless of their modality (DB, text, or image).

Figure 5:

Examples and justifications of expert judgment on collected QDMRs in Break.

Figure 5:

Examples and justifications of expert judgment on collected QDMRs in Break.

Close modal

### 4.3 Annotation Consistency

As QDMR is expressed using natural language, it introduces variability into its annotations. We wish to validate the consistency of collected QDMRs, that is, whether we can correctly infer the formal QDMR operator (f i) and its arguments from each step (si). To infer these formal representations, we developed an algorithm that goes over the QDMR structure step-by-step, and for each step si, uses a set of predefined templates to identify f i and its arguments, expressed in si. This results in an execution graph (Figure 2), where the execution result of a parent node serves as input to its child. Figure 1 presents three QDMR decompositions along with the formal graphs output by our algorithm (lower box). Each node lists its operator (e.g., GROUP), its constant input listed in brackets (e.g., count) and its dynamic input, which are the execution results of its parent nodes.

Overall, 99.5% of QDMRs had all their steps mapped into pseudo-logical forms by our algorithm. To evaluate the correctness of the mapping algorithm, we randomly sampled 350 logical forms, and examined the structure of the formulas, assuming that words copied from the question correspond to entities and relations in an idealized KB (see §2). Of this sample, 99.4% of its examples had all of their steps, si, correctly mapped to the corresponding f i. Overall, 93.1% of the examples were of fully accurate logical forms, with errors being due to QDMRs that were either incorrect or not fully decomposed ($I$, $C$ in §4.2). Thus, a rule-based algorithm can map more than 93% of the annotations into a correct formal representation. This shows that our annotators produced consistent and high-quality QDMRs. Moreover, it suggests that non-experts can annotate questions with pseudo-logical forms, which can be used as a cheap intermediate representation for semantic parsers (Yih et al., 2016), further discussed in §6.

A natural setup for QDMR is in answering complex questions that require multiple reasoning steps. We compare models that exploit question decompositions to baselines that do not. We use the open-domain QA (“full-wiki”) setting of the HotpotQA dataset (Yang et al., 2018): Given a question, the QA model retrieves the relevant Wikipedia paragraphs and answers the question using these paragraphs.

### 5.1 Experimental Setup

We compare BreakRC, a model that utilizes question decomposition to BERTQA, a standard QA model, based on BERT (Devlin et al., 2019), and present Combined, an approach that enjoys the benefits of both models.

#### BreakRC

Algorithm 1 describes the BreakRC model, which uses high-level QDMR structures for answering open-domain multi-hop questions. We assume access to an Information Retrieval (IR) model and an RC model, and denote by Answer(⋅) a function that takes a question as input, runs the IR model to obtain paragraphs, and then feeds those paragraphs as context for an RC model that returns a distribution over answers.

Given an input QDMR, s = 〈s1, ..., sn〉, iterate over s step-by-step and perform the following. First, we extract the operation (line 4) and the previous steps referenced by si (line 5). Then, we compute the answer to si conditioned on the extracted operator. For SELECT steps, we simply run the Answer(⋅) function. For PROJECT steps, we substitute the reference to the previous step in si with its already computed answer, and then run Answer(⋅). For FILTER steps,6 we use a simple rule to extract a “normalized question”, $ŝi$ from si and get an intermediate answer anstmp with $Answer(ŝi)$. We then “intersect” anstmp with the referenced answer by multiplying the probabilities provided by the RC model and normalizing. For COMPARISON steps, we compare, with a discrete operation, the numbers returned by the referenced steps. The final answer is the highest probability answer of step sn.

As our IR model we use bigram TF-IDF, proposed by Chen et al. (2017). Because the RC model is run on single-hop questions, we use the BERT-based RC model from Min et al. (2019b), trained solely on SQuAD (Rajpurkar et al., 2016).

#### BERTQA Baseline

As BreakRC exploits question decompositions, we compare it with a model that does not. BERTQA receives as input the original natural language question, x. It uses the same IR model as BreakRC to retrieve paragraphs for x. For a fair comparison, we set its number of retrieved paragraphs such that it is identical to BreakRC (namely, 10 paragraphs for each QDMR step that involves IR). Similar to BreakRC, retrieved paragraphs are fed to a pretrained BERT-based RC model (Min et al., 2019b) to answer x. In contrast to BreakRC, that is trained on SQuAD, BERTQA is trained on the target dataset (HotpotQA), giving it an advantage over BreakRC.

#### A Combined Approach

Last, we present an approach that combines the strengths of BreakRC and BERTQA. In this approach, we use the QDMR decomposition to improve retrieval only. Given a question x and its QDMR s, we run BreakRC on s, but in addition to storing answers, we also store all the paragraphs retrieved by the IR model. We then run BERTQA on the question x and the top-10 paragraphs retrieved by BreakRC, sorted by their IR ranking. This approach resembles that of Qi et al. (2019).

The advantage of Combined is that we do not need to develop an answering procedure for each QDMR operator separately, which involves different discrete operations such as comparison and intersection. Instead, we use BreakRC to retrieve contexts, and an end-to-end approach to learn how to answer the question directly. This can often handle operators not implemented in BreakRC, like BOOLEAN and UNION.

#### Dataset

To evaluate our models, we use all 2,765 QDMR annotated examples of the HotpotQA development set found in Break. PROJECT and COMPARISON type questions account for 48% and 7% of examples respectively.

### 5.2 Results

Table 6 shows model performance on HotpotQA. We report EM and F1 using the official HotpotQA evaluation script. IR measures the percentage of examples in which the IR model successfully retrieved both of the “gold paragraphs” necessary for answering the multi-hop question. To assess the potential utility of QDMR, we report results for BreakRCG, which uses gold QDMRs, and BreakRCP, which uses QDMRs predicted by a Copynet parser (§7.2).

Table 6:
Open-domain QA results on HotpotQA.
ModelHotpotQA
EMF1IR
BERTQA 33.6 43.3 46.3
BreakRCP 28.8 37.7 52.5
BreakRCG 34.6 44.6 59.2
CombinedP 38.3 49.3 52.5
CombinedG 41.2 52.4 59.2
IR-NP 31.7 41.2 40.8
BreakRCR 18.9 26.5 40.3
CombinedR 32.7 42.6 40.3
ModelHotpotQA
EMF1IR
BERTQA 33.6 43.3 46.3
BreakRCP 28.8 37.7 52.5
BreakRCG 34.6 44.6 59.2
CombinedP 38.3 49.3 52.5
CombinedG 41.2 52.4 59.2
IR-NP 31.7 41.2 40.8
BreakRCR 18.9 26.5 40.3
CombinedR 32.7 42.6 40.3

Retrieving paragraphs with decomposed questions substantially improves the IR metric from 46.3 to 59.2 (BreakRCG), or 52.5 (BreakRCP). This leads to substantial gains in EM and F1 for CombinedG (43.3 to 52.4) and CombinedP (43.3 to 49.3). The EM and F1 of BreakRCG are only slightly higher than BERTQA because BreakRC does not handle certain operators, such as BOOLEAN steps (9.4% of the examples).

The majority of questions in HotpotQA combine SELECT operations with either PROJECT (also called “bridge” questions), COMPARISON, or FILTER. PROJECT and COMPARISON questions (Figure 6) were shown to be less susceptible to reasoning shortcuts, i.e. they necessitate multi-step reasoning (Chen and Durrett, 2019; Jiang and Bansal, 2019; Min et al., 2019a). In Table 7 we report BreakRC results on these question types, where it notably outperforms BERTQA.

Figure 6:

Examples of PROJECT and COMPARISON questions in HotpotQA (high-level QDMR).

Figure 6:

Examples of PROJECT and COMPARISON questions in HotpotQA (high-level QDMR).

Close modal
Table 7:
Results on PROJECT and COMPARISON questions from HotpotQA development set.
ModelProjectComparison
EM F1 IR EM F1 IR
BERTQA 22.8 31.0 31.6 42.9 51.7 75.8
BreakRCP 25.4 33.7 52.9 34.7 50.4 68.9
BreakRCG 32.2 41.9 59.8 44.5 57.6 78.0
ModelProjectComparison
EM F1 IR EM F1 IR
BERTQA 22.8 31.0 31.6 42.9 51.7 75.8
BreakRCP 25.4 33.7 52.9 34.7 50.4 68.9
BreakRCG 32.2 41.9 59.8 44.5 57.6 78.0

#### Ablations

In BreakRC, multiple IR queries are issued, one at each step. To examine whether these multiple queries were the cause for performance gains, we built IR-NP, a model that issues multiple IR queries, one for each noun phrase in the question. Similar to Combined, the question and union of retrieved paragraphs are given as input to BERTQA. We observe that Combined substantially outperforms IR-NP, indicating that the structure of QDMR, rather than multiple IR queries, has led to improved performance.7

To test whether QDMR is better than a simple rule-based decomposition algorithm, we developed a model that decomposes a question by applying a set of predefined rules over the dependency tree of the question (full details in §7.2). Combined and BreakRC were compared to CombinedR and BreakRCR, which use the rule-based decompositions. We observe that QDMR lead to substantially higher performance when compared to the rule-based decompositions.

As QDMR structures can be easily annotated at scale, a natural question is how far are they from fully executable queries (known to be expensive to annotate). As shown in §4.3, QDMRs can be mapped to pseudo-logical forms with high precision (93.1%) by extracting formal operators and arguments from their steps. The pseudo-logical form differs from an executable query in the lack of grounding of its arguments (entities and relations) in KB constants. This stems from the design of QDMR as a domain-agnostic meaning representation (§2). QDMR abstracts away from a concrete KB schema by assuming an underlying “idealized” KB, which contains all of its arguments.

Thus, QDMR can be viewed as an intermediate representation between a natural language question and an executable query. Such intermediate representations have already been discussed in prior work on semantic parsing. Kwiatkowski et al. (2013) and Choi et al. (2015) used underspecified logical forms as an intermediate representation. Guo et al. (2019) proposed a two-stage approach, separating between learning an intermediate text-to-SQL representation and the actual mapping to schema items. Works in the database community have particularly targeted the mapping of intermediate query representations into DB grounded queries, using schema mapping and join path inference (Androutsopoulos et al., 1995; Li et al., 2014; Baik et al., 2019). We argue that QDMR can be used as an easy-to-annotate representation in such semantic parsers, bridging between natural language and full logical forms.

We now present evaluation metrics and models for mapping questions into QDMR structures.

Given a question x we wish to map it to its QDMR steps, s = 〈s1, ..., sn〉. One can frame this as a sequence-to-sequence problem where x is mapped to a string representing its decomposition. We add a special separating token 〈SEP〉, and define the target string to be $s11,...,sm11,〈SEP〉,s12,…,sm22,〈SEP〉,…,smnn$, where m1, ..., mn are the number of tokens in each decomposition step.

### 7.1 Evaluation Metrics

We wish to assess the quality of a predicted QDMR, $s^$ to a gold standard, s. Figure 7 lists various properties by which question decompositions may differ, such as granularity (e.g., steps 1–3 of decomposition 1 are merged into the first step of decomposition 2), ordering (e.g., the last two steps are swapped) and wording (e.g., using “from” instead of “on”). While such differences do not affect the overall semantics, the second decomposition can be further decomposed. To measure such variations, we introduce two types of evaluation metrics. Sequence-based metrics treat the decomposition as a sequence of tokens, applying standard text generation metrics. As such metrics ignore the QDMR graph structure, we also use graph-based metrics that compare the predicted graph $Gs^$ to the gold QDMR graph Gs (see §2).

Figure 7:

Differences in granularity, step order, and wording between two decompositions.

Figure 7:

Differences in granularity, step order, and wording between two decompositions.

Close modal

Sequence-based scores, where higher values are better, are denoted by ⇑. Graph-based scores, where lower values are better, are denoted by ⇓.

• •

Exact Match ⇑: Measures exact match between s and $s^$, either 0 or 1.

• •

SARI ⇑ (Xu et al., 2016): SARI is commonly used in tasks such as text simplification. Given s, we consider the sets of added, deleted, and kept n-grams when mapping the question x to s. We compute these three sets for both s and $s^$ using the standard of up to 4-grams, then average (a) the F1 for added n-grams between s and $s^$, (b) the F1 for kept n-grams, and (c) the precision for the deleted n-grams.

• •

Graph Edit Distance (GED) ⇓: A graph edit path is a sequence of node and edge edit operations (addition, deletion, and substitution), where each operation has a predefined cost. GED computes the minimal-cost graph edit path required for transitioning from Gs to $Gs^$ (and vice versa), normalized by $max(|Gs|,|Gs^|)$. Operation costs are 1 for insertion and deletion of nodes and edges. The substitution cost of two nodes u,v is set to be 1 −Align(u,v), where Align(u,v) is the ratio of aligned tokens between these steps.

• •

GED+ ⇓: Comparing the QDMR graphs in Figure 8, we consider the splitting and merging of graph nodes. We implement GED+, a variant of GED with additional operations to merge (split) a set of nodes (node), based on the A* algorithm (Hart et al., 1968).8

Figure 8:

Graph edit operations between the graphs of the two QDMRs in Figure 7.

Figure 8:

Graph edit operations between the graphs of the two QDMRs in Figure 7.

Close modal

### 7.2 QDMR Parsing Models

We present models for QDMR parsing, built over AllenNLP (Gardner et al., 2017).

• •

Copy: A model that copies the input question x, without introducing any modifications.

• •

RuleBased: We defined 12 decomposition rules, to be applied over the dependency tree of the question, augmented with coreference relations. A rule is a regular expression over the question dependency tree, which invokes a decomposition operation when matched (Table 8). For example, the rule for relative clauses (relcl) breaks the question at the relative pronoun “that”, while adding a reference to the preceding part of the sentence. A full decomposition is obtained by recursively applying the rules until no rule is matched.

• •

Seq2Seq: A sequence-to-sequence neural model with a 5-layer LSTM encoder and attention at decoding time.

• •

S2SDynamic: Seq2Seq with a dynamic output vocabulary restricted to the closed set of tokens Lx available to crowd-workers (see §3).

• •

Copynet: Seq2Seq with an added copy mechanism that allows copying tokens from the input sequence (Gu et al., 2016).

Table 8:
The decomposition rules of RuleBased. Rules are based on dependency labels, part-of- speech tags and coreference edges. Text fragments used for decomposition are in boldface.
StructureExample
be-root How many objects smaller than the matte object are silver
[objects smaller than the matte object, How many #1 silver]
be-auxpass Find the average rating star for each movie that are not reviewed by Brittany Harris.
[Brittany Harris, the average rating star for each movie that not reviewed by #1]
do-subj Year did the team with Baltimore Fight Song win the Superbowl?
[team with Baltimore Fight Song, year did #1 win the Superbowl]
subj-do-have Which team owned by Malcolm Glazer has Tim Howard playing?
[team Tim Howard playing, #1 owned by Malcolm Glazer]
conjunction Who trades with China and has a capital city called Khartoum?
[Who has a capital city called Khartoum, #1 trades with China]
how-many How many metallic objects appear in this image?
[metallic objects appear in this image, the number of #1]
single-prep Find the ids of the problems reported after 1978. [the problems reported after 1978, ids of #1]
multi-prep what flights from Tacoma to Orlando on Saturday
[flights, #1 from Tacoma, #2 to Orlando, #3 on Saturday]
relcl Find all the songs that do not have a back vocal.
[all the songs, #1 that do not have a back vocal]
superlative What is the smallest state bordering ohio
[state bordering ohio, the smallest #1]
acl-verb Find the first names of students studying in 108.
[students, #1 studying in 108, first names of #2]
sent-coref Find the claim that has the largest total settlement amount. Return the effective date of the claim
[the claim that has the largest total settlement amount, the effective date of #1]
StructureExample
be-root How many objects smaller than the matte object are silver
[objects smaller than the matte object, How many #1 silver]
be-auxpass Find the average rating star for each movie that are not reviewed by Brittany Harris.
[Brittany Harris, the average rating star for each movie that not reviewed by #1]
do-subj Year did the team with Baltimore Fight Song win the Superbowl?
[team with Baltimore Fight Song, year did #1 win the Superbowl]
subj-do-have Which team owned by Malcolm Glazer has Tim Howard playing?
[team Tim Howard playing, #1 owned by Malcolm Glazer]
conjunction Who trades with China and has a capital city called Khartoum?
[Who has a capital city called Khartoum, #1 trades with China]
how-many How many metallic objects appear in this image?
[metallic objects appear in this image, the number of #1]
single-prep Find the ids of the problems reported after 1978. [the problems reported after 1978, ids of #1]
multi-prep what flights from Tacoma to Orlando on Saturday
[flights, #1 from Tacoma, #2 to Orlando, #3 on Saturday]
relcl Find all the songs that do not have a back vocal.
[all the songs, #1 that do not have a back vocal]
superlative What is the smallest state bordering ohio
[state bordering ohio, the smallest #1]
acl-verb Find the first names of students studying in 108.
[students, #1 studying in 108, first names of #2]
sent-coref Find the claim that has the largest total settlement amount. Return the effective date of the claim
[the claim that has the largest total settlement amount, the effective date of #1]

### 7.3 Results

Table 9 presents model performance on Break. Neural models outperform the RuleBased baseline and perform reasonably well, with Copynet obtaining the best scores across all metrics. This can be attributed to most of the tokens in a QDMR parse being copied from the original question.

Table 9:
Performance of QDMR parsing models on the development and test set. GED+ is computed only for the subset of QDMR graphs with up to 5 nodes, covering 66.1% of QDMRs and 97.6% of high-level data.
DataMetricCopyRuleBasedSeq2SeqS2SDynamicCopynetCopynet (test)
QDMR Exact Match ⇑ 0.001 0.002 0.081 0.116 0.154 0.157
SARI ⇑ 0.431 0.508 0.665 0.705 0.748 0.746
GED ⇓ 0.937 0.799 0.398 0.363 0.318 0.322
GED+ ⇓ 1.813 1.722 1.424 1.137 0.941 0.984

QDMRhigh Exact Match ⇑ 0.001 0.010 0.001 0.015 0.081 0.083
SARI ⇑ 0.501 0.554 0.379 0.504 0.722 0.722
GED ⇓ 0.793 0.659 0.585 0.468 0.319 0.316
GED+ ⇓ 1.102 1.395 1.655 1.238 0.716 0.709
DataMetricCopyRuleBasedSeq2SeqS2SDynamicCopynetCopynet (test)
QDMR Exact Match ⇑ 0.001 0.002 0.081 0.116 0.154 0.157
SARI ⇑ 0.431 0.508 0.665 0.705 0.748 0.746
GED ⇓ 0.937 0.799 0.398 0.363 0.318 0.322
GED+ ⇓ 1.813 1.722 1.424 1.137 0.941 0.984

QDMRhigh Exact Match ⇑ 0.001 0.010 0.001 0.015 0.081 0.083
SARI ⇑ 0.501 0.554 0.379 0.504 0.722 0.722
GED ⇓ 0.793 0.659 0.585 0.468 0.319 0.316
GED+ ⇓ 1.102 1.395 1.655 1.238 0.716 0.709

#### Error Analysis

To judge the quality of predicted QDMRs we sampled 100 predictions of Copynet (Table 10) half of them being high-level QDMRs. For standard QDMR, 24% of the sampled predictions were an exact match, with an additional 30% being fully decomposed and semantically equivalent to the gold decompositions. For example, in the first row of Table 10, the gold decomposition first discards the number of cylinders, then counts the remaining objects. Instead, Copynet opted to count both groups, then subtract the number of cylinders from the number of objects. This illustrates how different QDMRs may be equivalent.

Table 10:
Manual error analysis of the Copynet model predictions. Lower examples are of high-level QDMRs.
QuestionGoldPrediction (Copynet)Analysis
“How many objects other than cylinders are there?” (1) objects; (2) cylinders; (3) #1 besides #2; (4) number of #3. (1) objects; (2) cylinders; (3) number of #1; (4) number of #2; (5) difference of #3 and #4. sem. equiv. (30%)
“Where is the youngest teacher from?” (1) teachers; (2) the youngest of #1; (3) where is #2 from. (1) youngest teacher; (2) where is #1. incorrect (46%)
“Kyle York is the Chief Strategy Officer of a company acquired by what corporation in 2016?” (1) company that Kyle York is the Chief Strategy Officer of; (2) corporation that acquired #1 in 2016. (1) company that Kyle York is the Chief Strategy Officer of; (2) corporation in 2016 that #1 was acquired by. sem. equiv. (46%)
“Dayton’s Devils had a cameo from the ‘MASH’ star who played what role on the show?” (1) MASH star that Dayton ’s Devils had a cameo from; (2) role that #1 played on the show. (1) the MASH that Dayton ’s Devils had a cameo; (2) what role on the show star of #1 played. incorrect (46%)
QuestionGoldPrediction (Copynet)Analysis
“How many objects other than cylinders are there?” (1) objects; (2) cylinders; (3) #1 besides #2; (4) number of #3. (1) objects; (2) cylinders; (3) number of #1; (4) number of #2; (5) difference of #3 and #4. sem. equiv. (30%)
“Where is the youngest teacher from?” (1) teachers; (2) the youngest of #1; (3) where is #2 from. (1) youngest teacher; (2) where is #1. incorrect (46%)
“Kyle York is the Chief Strategy Officer of a company acquired by what corporation in 2016?” (1) company that Kyle York is the Chief Strategy Officer of; (2) corporation that acquired #1 in 2016. (1) company that Kyle York is the Chief Strategy Officer of; (2) corporation in 2016 that #1 was acquired by. sem. equiv. (46%)
“Dayton’s Devils had a cameo from the ‘MASH’ star who played what role on the show?” (1) MASH star that Dayton ’s Devils had a cameo from; (2) role that #1 played on the show. (1) the MASH that Dayton ’s Devils had a cameo; (2) what role on the show star of #1 played. incorrect (46%)

For high-level examples (from RC datasets), as questions are often less structured, they require a deeper semantic understating from the decomposition model. Only 8% of the predictions were an exact match, with an additional 46% being semantically equivalent to the gold. The remaining 46% were of erroneous predictions (see Table 10).

### Question Decomposition

Recent work on QA through question decomposition has focused mostly on single modalities (Gupta and Lewis, 2018; Guo et al., 2019; Min et al., 2019b). QA using neural modular networks has been suggested for both KBs and images by Andreas et al. (2016) and Hu et al. (2017). Question decomposition over text was proposed by Talmor and Berant (2018), however over a much more limited set of questions than in Break. Iyyer et al. (2017) have also decomposed questions to create a “sequential question answering” task. Their annotators viewed a web table and performed actions over it to retrieve the cells that constituted the answer. Conversely, we provided annotators only with the question, as QDMR is agnostic to the original context.

An opposite annotation cycle to ours was presented in Cheng et al. (2018). The authors generate sequences of simple questions which crowd-workers paraphrase into a compositional question. Questions in Break are composed by humans, and are then decomposed to QDMR.

### Semantic Formalism Annotation

Labeling corpora with a semantic formalism has often been reserved for expert annotators (Dahl et al., 1994; Zelle and Mooney, 1996; Abend and Rappoport, 2013; Yu et al., 2018). Recent work has focused on cheaply eliciting quality annotations from non-experts through crowdsourcing (He et al., 2016; Iyer et al., 2017; Michael et al., 2018). FitzGerald et al. (2018) facilitated non-expert annotation by introducing a formalism expressed in natural language for semantic-role-labeling. This mirrors QDMR, as both are expressed in natural language.

### Relation to Other Formalisms

QDMR is related to Dependency-based Compositional Semantics (Liang et al., 2013), as both focus on question representations. However, QDMR is designed to facilitate annotations, while Dependency-based Compositional Semantics is centered on paralleling syntax. Domain-independent intermediate representations for semantic parsers were proposed by Kwiatkowski et al. (2013) and Reddy et al. (2016). As there is no consensus on the ideal meaning representation for semantic parsing, representations are often chosen based on the particular execution setup: SQL is used for relational databases (Yu et al., 2018), SPARQL for graph KBs (Yih et al., 2016), while other ad-hoc languages are used based on the task at hand. We frame QDMR as an easy-to-annotate formalism that can be potentially converted to other representations, depending on the task. Last, AMR (Banarescu et al., 2013) is a meaning representation for sentences. Instead of representing general language, QDMR represents questions, which are important for QA systems, and for probing models for reasoning.

In this paper, we presented a formalism for question understanding. We have shown it is possible to train crowd-workers to produce such representations with high quality at scale, and created Break, a benchmark for question decomposition with over 83K decompositions of questions from 10 datasets and 3 modalities (DB, images, text). We presented the utility of QDMR for both open-domain question answering and semantic parsing, and constructed a QDMR parser with reasonable performance. QDMR proposes a promising direction for modeling question understanding, which we believe will be useful for multiple tasks in which reasoning is probed through questions.

This work was completed in partial fulfillment for the PhD of Tomer Wolfson. This research was partially supported by The Israel Science Foundation (grants 942/16 and 978/17), and The Yandex Initiative for Machine Learning and the European Research Council (ERC) under the European Union Horizons 2020 research and innovation programme (grant ERC DELPHI 802800).

1

A system could potentially answer “Name the political parties of the most densely populated country”, by retrieving “the most densely populated country” using a database query, and “the political parties of #1” via an RC model.

2

The sequence of operations op1, …, opk is traced using the references to previous steps in the QDMR structure.

3

Except for ComplexWebQuestions (CWQ), where annotators paraphrased automatically generated questions.

4

Regarding the three merged operators of high-level QDMRs (§2), the first two operators are treated as SELECT, while the third is considered a FILTER.

5

For high-level QDMRs, the merged operators (§2) are considered to be fully decomposed.

6

INTERSECTION steps are handled in a manner similar to FILTER, but we omit the exact description for brevity.

7

Issuing an IR query over each “content word” in the question, instead of each noun phrase, led to poor results.

8

Because of its exponential worst-case complexity, we compute GED+ only for graphs with up to 5 nodes, covering 75.2% of the examples in the development set of Break.

Omri
Abend
and
Ari
Rappoport
.
2013
.
Universal conceptual cognitive annotation (UCCA)
. In
Association for Computational Linguistics (ACL)
.
Abdalghani
Abujabal
,
Rishiraj Saha
Roy
,
Mohamed
Yahya
, and
Gerhard
Weikum
.
2019
.
ComQA: A community-sourced dataset for complex factoid question answering with paraphrase clusters
. In
North American Association for Computational Linguistics (NAACL)
.
Jacob
Andreas
,
Marcus
Rohrbach
,
Trevor
Darrell
, and
Dan
Klein
.
2016
.
Learning to compose neural networks for question answering
. In
Human Language Technology and North American Association for Computational Linguistics (HLT/NAACL)
.
Ion
Androutsopoulos
,
Graeme D.
Ritchie
, and
Peter
Thanisch
.
1995
.
Natural language interfaces to databases – an introduction
.
Journal of Natural Language Engineering
,
1
:
29
81
.
Stanislaw
Antol
,
Aishwarya
Agrawal
,
Jiasen
Lu
,
Margaret
Mitchell
,
Dhruv
Batra
,
C.
Lawrence Zitnick
, and
Devi
Parikh
.
2015
.
Vqa: Visual question answering
. In
International Conference on Computer Vision (ICCV)
, pages
2425
2433
.
Christopher
Baik
,
Hosagrahar Visvesvaraya
, and
Yunyao
Li
.
2019
.
Bridging the semantic gap with SQL query logs in natural language interfaces to databases
.
2019 IEEE 35th International Conference on Data Engineering (ICDE)
, pages
374
385
.
Laura
Banarescu
,
Claire
Bonial
,
Shu
Cai
,
Georgescu
,
Kira
Griffitt
,
Ulf
Hermjakob
,
Kevin
Knight
,
Philipp
Koehn
,
Martha
Palmer
, and
Nathan
Schneider
.
2013
.
Abstract meaning representation for sembanking
. In
7th Linguistic Annotation Workshop and Interoperability with Discourse
.
Donald D.
Chamberlin
and
Raymond F.
Boyce
.
1974
.
SEQUEL: A structured English query language
. In
Proceedings of the 1974 ACM SIGFIDET (now SIGMOD) Workshop on Data Description, Access and Control
, pages
249
264
.
ACM
.
Danqi
Chen
,
Fisch
,
Jason
Weston
, and
Antoine
Bordes
.
2017
.
. In
Association for Computational Linguistics (ACL)
.
Jifan
Chen
and
Greg
Durrett
.
2019
.
Understanding dataset design choices for multi-hop reasoning
. In
Association for Computational Linguistics (ACL)
.
Jianpeng
Cheng
,
Siva
Reddy
, and
Mirella
Lapata
.
2018
.
Building a neural semantic parser from a domain ontology
.
ArXiv
,
abs/1812.10037. Version 1
.
Eunsol
Choi
,
Tom
Kwiatkowski
, and
Luke
Zettlemoyer
.
2015
.
Scalable semantic parsing with partial ontologies
. In
Association for Computational Linguistics (ACL)
.
James
Clarke
,
Dan
Goldwasser
,
Ming-Wei
Chang
, and
Dan
Roth
.
2010
.
Driving semantic parsing from the world’s response
. In
Computational Natural Language Learning (CoNLL)
, pages
18
27
.
Edgar F.
Codd
.
1970
.
A relational model of data for large shared data banks
.
Communications of the ACM
,
13
(
6
):
377
387
.
Deborah A.
Dahl
,
Bates
,
Michael
Brown
,
William M.
Fisher
,
Kate Hunicke-
Smith
,
David S.
Pallett
,
Christine
Pao
,
Alexander I.
Rudnicky
, and
Elizabeth
Shriberg
.
1994
.
Expanding the scope of the ATIS task: The ATIS-3 corpus
. In
Workshop on Human Language Technology
, pages
43
48
.
Jacob
Devlin
,
Ming-Wei
Chang
,
Kenton
Lee
, and
Kristina
Toutanova
.
2019
.
Bert: Pre-training of deep bidirectional transformers for language understanding
. In
North American Association for Computational Linguistics (NAACL)
.
Dheeru
Dua
,
Yizhong
Wang
,
Dasigi
,
Gabriel
Stanovsky
,
Sameer
Singh
, and
Matt
Gardner
.
2019
.
DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs
. In
Human Language Technology and North American Association for Computational Linguistics (HLT/NAACL)
.
Nicholas
FitzGerald
,
Julian
Michael
,
Luheng
He
, and
Luke S.
Zettlemoyer
.
2018
.
Large-scale QA-SRL parsing
. In
Association for Computational Linguistics (ACL)
.
Matt
Gardner
,
Joel
Grus
,
Mark
Neumann
,
Oyvind
Tafjord
,
Dasigi
,
Nelson F.
Liu
,
Matthew
Peters
,
Michael
Schmitz
, and
Luke S.
Zettlemoyer
.
2017
. In
AllenNLP: A deep semantic natural language processing platform
,
arXiv, abs/1803.07640v2
.
Jiatao
Gu
,
Zhengdong
Lu
,
Hang
Li
, and
Victor O. K.
Li
.
2016
.
Incorporating copying mechanism in sequence-to-sequence learning
. In
Association for Computational Linguistics (ACL)
.
Jiaqi
Guo
,
Zecheng
Zhan
,
Yan
Gao
,
Yan
Xiao
,
Jian-Guang
Lou
,
Ting
Liu
, and
Dongmei
Zhang
.
2019
.
Towards complex text-to-SQL in cross-domain database with intermediate representation
. In
Association for Computational Linguistics (ACL)
.
Nitish
Gupta
and
Mike
Lewis
.
2018
.
Neural compositional denotational semantics for question answering
. In
Empirical Methods in Natural Language Processing (EMNLP)
.
Peter E.
Hart
,
Nils J.
Nilsson
, and
Bertram
Raphael
.
1968
.
A formal basis for the heuristic determination of minimum cost paths
.
IEEE Transactions on Systems Science and Cybernetics
,
4
(
2
):
100
107
.
Luheng
He
,
Julian
Michael
,
Mike
Lewis
, and
Luke
Zettlemoyer
.
2016
.
Human-in-the-loop parsing
. In
Empirical Methods in Natural Language Processing (EMNLP)
.
Ronghang
Hu
,
Jacob
Andreas
,
Marcus
Rohrbach
,
Trevor
Darrell
, and
Kate
Saenko
.
2017
.
Learning to reason: End-to-end module networks for visual question answering
. In
International Conference on Computer Vision (ICCV)
.
Drew A.
Hudson
and
Christopher D.
Manning
.
2019
.
GQA: A new dataset for real-world visual reasoning and compositional question answering
. In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
.
Srini
Iyer
,
Ioannis
Konstas
,
Alvin
Cheung
,
Jayant
Krishnamurthy
, and
Luke
Zettlemoyer
.
2017
.
Learning a neural semantic parser from user feedback
. In
Association for Computational Linguistics (ACL)
.
Mohit
Iyyer
,
Wen-tau
Yih
, and
Ming-Wei
Chang
.
2017
.
Search-based neural structured learning for sequential question answering
. In
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
1821
1831
.
Yichen
Jiang
and
Mohit
Bansal
.
2019
.
Avoiding reasoning shortcuts: Adversarial evaluation, training, and model development for multi-hop QA
. In
Association for Computational Linguistics (ACL)
.
Justin
Johnson
,
Bharath
Hariharan
,
Laurens van der
Maaten
,
Li
Fei-Fei
,
C.
Lawrence Zitnick
, and
Ross B.
Girshick
.
2017
.
Clevr: A diagnostic dataset for compositional language and elementary visual reasoning
. In
Computer Vision and Pattern Recognition (CVPR)
.
Tom
Kwiatkowski
,
Eunsol
Choi
,
Yoav
Artzi
, and
Luke
Zettlemoyer
.
2013
.
Scaling semantic parsers with on-the-fly ontology matching
. In
Empirical Methods in Natural Language Processing (EMNLP)
.
Tom
Kwiatkowski
,
Jennimaria
Palomaki
,
Olivia
Redfield
,
Michael
Collins
,
Ankur
Parikh
,
Chris
Alberti
,
Danielle
Epstein
,
Illia
Polosukhin
,
Jacob
Devlin
,
Kenton
Lee
, et al.
2019
.
Natural questions: A benchmark for question answering research
.
Transactions of the Association for Computational Linguistics
,
7
:
453
466
.
Fei
Li
and
Hosagrahar Visvesvaraya
.
2014
.
NaLIR: An interactive natural language interface for querying relational databases
. In
International Conference on Management of Data, SIGMOD
.
Fei
Li
,
Tianyin
Pan
, and
Hosagrahar Visvesvaraya
.
2014
.
Schema-free SQL
. In
International Conference on Management of Data, SIGMOD
,
pages 1051–pages 1062
.
Percy
Liang
,
Michael I.
Jordan
, and
Dan
Klein
.
2013
.
Learning dependency-based compositional semantics
.
Computational Linguistics
,
39
:
389
446
.
Julian
Michael
,
Gabriel
Stanovsky
,
Luheng
He
,
Ido
Dagan
, and
Luke
Zettlemoyer
.
2018
.
Crowdsourcing question–answer meaning representations
. In
North American Association for Computational Linguistics (NAACL)
.
Sewon
Min
,
Eric
Wallace
,
Sameer
Singh
,
Matt
Gardner
,
Hannaneh
Hajishirzi
, and
Luke
Zettlemoyer
.
2019a
.
Compositional questions do not necessitate multi-hop reasoning
. In
Association for Computational Linguistics (ACL)
.
Sewon
Min
,
Victor
Zhong
,
Luke
Zettlemoyer
, and
Hannaneh
Hajishirzi
.
2019b
.
Multi-hop reading comprehension through question decomposition and rescoring
. In
Association for Computational Linguistics (ACL)
.
Panupong
Pasupat
and
Percy
Liang
.
2015
.
Compositional semantic parsing on semi-structured tables
. In
Association for Computational Linguistics (ACL)
.
Francis Jeffry
Pelletier
.
1994
.
The principle of semantic compositionality
.
Topoi
,
13
(
1
):
11
24
.
P. J.
Price
.
1990
.
Evaluation of spoken language systems: The ATIS domain
. In
Proceedings of the Third DARPA Speech and Natural Language Workshop
, pages
91
95
.
Peng
Qi
,
Xiaowen
Lin
,
Leo
Mehr
,
Zijian
Wang
, and
Christopher D.
Manning
.
2019
.
Answering complex open-domain questions through iterative query generation
. In
Empirical Methods in Natural Language Processing (EMNLP)
, pages
2590
2602
.
Association for Computational Linguistics
.
Pranav
Rajpurkar
,
Jian
Zhang
,
Konstantin
Lopyrev
, and
Percy
Liang
.
2016
.
SQuAD: 100,000+ questions for machine comprehension of text
. In
Empirical Methods in Natural Language Processing (EMNLP)
.
Siva
Reddy
,
Oscar
Täckström
,
Michael
Collins
,
Tom
Kwiatkowski
,
Dipanjan
Das
,
Mark
Steedman
, and
Mirella
Lapata
.
2016
.
Transforming dependency structures to logical forms for semantic parsing
. In
Association for Computational Linguistics (ACL)
.
Alane
Suhr
,
Stephanie
Zhou
,
Iris
Zhang
,
Huajun
Bai
, and
Yoav
Artzi
.
2019
.
A corpus for reasoning about natural language grounded in photographs
.
Association for Computational Linguistics (ACL)
.
Alon
Talmor
and
Jonathan
Berant
.
2018
.
The web as knowledge-base for answering complex questions
. In
North American Association for Computational Linguistics (NAACL)
.
Johannes
Welbl
,
Pontus
Stenetorp
, and
Sebastian
Riedel
.
2018
.
Constructing datasets for multi- hop reading comprehension across documents
.
Transactions of the Association for Computational Linguistics
,
6
:
287
302
.
Wei
Xu
,
Courtney
Napoles
,
Ellie
Pavlick
,
Quanze
Chen
, and
Chris
Callison-Burch
.
2016
.
Optimizing statistical machine translation for text simplification
.
Transactions of the Association for Computational Linguistics
,
4
:
401
415
.
Zhilin
Yang
,
Peng
Qi
,
Saizheng
Zhang
,
Yoshua
Bengio
,
William W.
Cohen
,
Ruslan R.
Salakhutdinov
, and
Christopher D.
Manning
.
2018
.
Hotpotqa: A dataset for diverse, explainable multi-hop question answering
. In
Empirical Methods in Natural Language Processing (EMNLP)
.
Wen-tau
Yih
,
Matthew
Richardson
,
Christopher
Meek
,
Ming-Wei
Chang
, and
Jina
Suh
.
2016
.
The value of semantic parse labeling for knowledge base question answering
. In
Association for Computational Linguistics (ACL)
.
Tao
Yu
,
Rui
Zhang
,
Kai
Yang
,
Michihiro
Yasunaga
,
Dongxu
Wang
,
Zifan
Li
,
James
Ma
,
Irene
Li
,
Qingning
Yao
,
Shanelle
Roman
,
Zilin
Zhang
, and
Dragomir R.
.
2018
.
Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task
. In
Empirical Methods in Natural Language Processing (EMNLP)
.
John M.
Zelle
and
Raymond J.
Mooney
.
1996
.
Learning to parse database queries using inductive logic programming
. In
Association for the Advancement of Artificial Intelligence (AAAI)
, pages
1050
1055
.
Luke
Zettlemoyer
and
Michael
Collins
.
2005
.
Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars
. In
Uncertainty in Artificial Intelligence (UAI)
, pages
658
666
.
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode