♫ MuSiQue: Multihop Questions via Single-hop Question Composition

Multihop reasoning remains an elusive goal as existing multihop benchmarks are known to be largely solvable via shortcuts. Can we create a question answering (QA) dataset that, by construction, requires proper multihop reasoning? To this end, we introduce a bottom–up approach that systematically selects composable pairs of single-hop questions that are connected, that is, where one reasoning step critically relies on information from another. This bottom–up methodology lets us explore a vast space of questions and add stringent filters as well as other mechanisms targeting connected reasoning. It provides fine-grained control over the construction process and the properties of the resulting k-hop questions. We use this methodology to create MuSiQue-Ans, a new multihop QA dataset with 25K 2–4 hop questions. Relative to existing datasets, MuSiQue-Ans is more difficult overall (3× increase in human–machine gap), and harder to cheat via disconnected reasoning (e.g., a single-hop model has a 30-point drop in F1). We further add unanswerable contrast questions to produce a more stringent dataset, MuSiQue-Full. We hope our datasets will help the NLP community develop models that perform genuine multihop reasoning.1


Introduction
Multihop QA datasets are designed to support the development and evaluation of models that perform multiple steps of reasoning in order to answer a question. Recent work, however, shows that on existing datasets, models often need not even connect information across all supporting facts, 2 because they can exploit reasoning shortcuts and other artifacts to find the correct answers and obtain high scores (Min et al., 2019a;Chen and Durrett, 2019;Trivedi et al., 2020). Such shortcuts arise from various factors, such as overly specific sub-questions, train-test leakage, and insufficient distractors. These factors allow models to circumvent connected reasoning-they need not read the context to find answers to previous sub-question(s) or use these answers to answer the later sub-questions that depend on them.
The left hand side of Fig. 1 illustrates an instance of this problem in an actual question (Q) taken from the HotpotQA dataset (Yang et al., 2018). This question has the over-specification issue. At first glance, it appears to require a model to identify Kurt Vonnegut as the author of Armageddon in Retrospect, and then use this information to answer the final question about the famous satire novel he authored. However, this framing of the question is insufficient to enforce that models must perform connected multihop reasoning to arrive at the correct answer. A model can, in fact, find the correct answer to this question from the context without finding the answer to Q1. This is because, even if a model does not know that A1 refers to Kurt Vonnegut, there happens to be only one person best known for a satirical novel mentioned in the context.
Contrast this with the question on the right (Q'), which cannot be answered by simply returning a novel that someone was best known for. There are three possible answers in the context and choosing between them requires knowing which author is referenced. This is a desirable multihop question that requires connected reasoning. Left: A HotpotQA question that would have been filtered out by our approach for not requiring connected reasoning; it can be answered using just Q2 without knowing the answer to Q1 (since there is only one person mentioned in the context as being best known for a satirical novel). Right: A connected question that forces models to reason through both intended hops (since there are multiple people mentioned in the context as being best known for some novel).
Prior work has characterized such reasoning, where a model arrives at the correct answer without using all supporting facts, as Disconnected Reasoning (Trivedi et al., 2020). While this characterization enables filtering or automatically transforming existing datasets (Trivedi et al., 2020), we ask a different question: How can we construct a new multihop dataset that, by design, enforces connected reasoning?
We make two main contributions towards this:

1) A new dataset construction approach:
We introduce a bottom-up process for building challenging multihop reading comprehension QA datasets by carefully selecting and composing single-hop questions obtained from existing datasets. The key ideas behind our approach are: (i) Composing multihop questions from a large collection of single-hop questions, which allows a systematic exploration of a vast space of candidate multihop questions. (ii) Applying a stringent set of filters that ensure no sub-question can be answered without finding the answer to the previous sub-questions it is connected to (a key property we formally define as part of the MuSiQue condition, Eqn. (2)). (iii) Reducing train-test leakage at the level of each single-hop question, thereby mitigating the impact of simple memorization tricks.
(iv) Adding distractor contexts that cannot be easily identified. (v) Creating unanswerable multihop questions at the sub-question level.
2) A new challenge dataset and empirical analysis: We build a new multihop QA dataset, MuSiQue-Ans (abbreviated as -Ans), with ∼25K 2-4 hop questions with six different composition structures (cf. Table 1). We demonstrate that -Ans is more challenging and less cheatable than two prior multihop reasoning datasets, HotpotQA (Yang et al., 2018) and 2WikiMultihopQA (Ho et al., 2020). In particular, it has 3× the human-machine gap, and a substantially lower disconnected reasoning (DiRe) score, which captures the extent to which a dataset can be cheated via disconnected reasoning (Trivedi et al., 2020). We also show how various features of our dataset construction pipeline help increase dataset difficulty and reduce cheatability. Lastly, by incorporating the notion of insufficient context (Rajpurkar et al., 2018;Trivedi et al., 2020), we also release a variant of our dataset, -Full, having ∼50K multihop questions that form contrasting pairs (Kaushik et al., 2019; of answerable and unanswerable questions.
-Full is even more challenging and harder to cheat on.
We hope our bottom-up multihop dataset construction methodology and our challenging datasets with a mixed number of hops will help develop proper multihop reasoning systems and decomposition-based models.

Related Work
Multihop QA.
-Ans is closest to HotpotQA (Yang et al., 2018) and 2WikiMultihopQA (Ho et al., 2020). HotpotQA was constructed by directly crowdsourcing 2-hop questions without considering the difficulty of composition and has been shown to be largely solvable without multihop reasoning (Min et al., 2019a;Chen and Durrett, 2019;Trivedi et al., 2020). While 2WikiMultihopQA was also constructed via composition, they use a limited set of hand-authored compositional rules, making it easy for large language models. We show that -Ans is harder and less cheatable than both of these. Other multihop datasets (Khashabi et al., 2018;Dua et al., 2019, inter alia) focus on different challenges such as multiple modalitites , open-domain QA (Geva et al., 2021;   , fact verification (Jiang et al., 2020), science explanations (Jansen et al., 2018), and relation extraction (Welbl et al., 2018), among others. Extending our ideas to these challenges is an interesting avenue for future work.
Unanswerable QA. Prior works have used unanswerable questions for robust reasoning in single-hop (Rajpurkar et al., 2018) and multihop (Ferguson et al., 2020;Trivedi et al., 2020) settings. IIRC (Ferguson et al., 2020) focuses on open-domain QA where the unanswerable questions are identified by crowdsourcing questions where relevant knowledge couldn't be retrieved from Wikipedia. Our idea to make unanswerable multihop questions by removing support paragraphs is most similar to Trivedi et al. (2020). While they rely on annotations (potentially incomplete) to identify these support paragraphs, we can use the bridge entities to remove any potential support paragraphs (containing the bridge entity) and better ensure unanswerability.

Question Decomposition and Composition.
Multihop QA datasets have been decomposed into simpler questions (Min et al., 2019b;Talmor and Berant, 2018) and special meaning representations (Wolfson et al., 2020). Our dataset creation pipeline naturally provides question decompositions, which can can help develop interpretable models (Min et al., 2019b;.
Recent work has also used bottom-up approaches to create multihop questions (Pan et al., 2021; using rule-based methods. However, their primary goal was data augmentation to improve on downstream datasets. The questions themselves haven't been shown to be challenging or less cheatable.

Multihop Reasoning Desiderata
Multihop question answering can be seen as a sequence of inter-dependent reasoning steps leading to the answer. In its most general form, these reasoning steps and the dependencies can be viewed as directed acyclic graph (DAG), G Q . Each node q i in this graph represents a reasoning step or a ''hop'', for example, a single-hop question in multihop QA or a KB relation traversal in graph-based KBQA. An edge (q j , q i ) ∈ edges(G Q ) indicates that the reasoning step q i relies critically on the output of the predecessor step q j . For example, in Fig. 1, the single-hop question Q2 depends on the answer to Q1 , and the graph G Q is a linear chain Q1 → Q2 .
Given this framing, a key desirable property for multihop reasoning is connected reasoning: Performing each step q i correctly should require the output of all its predecessor steps q j .
Analytical Intuition: Suppose a model M can answer each q i correctly with probability p, and it can also answer q i without the output of all its predecessor steps with probability r ≤ p. For simplicity, we assume these probabilities are independent across various q i . M can correctly answer a k-hop question Q by identifying and performing all its k reasoning steps. This will succeed with probability at most p k . Alternatively, as an extreme case, it can ''cheat'' by identifying and performing only the last step q k (the ''end question'') without considering the output of q k−1 (or other steps) at all. This could succeed with probability as much as r, which does not decrease with k and is thus undesirable when constructing multihop datasets. Our goal is to create multihop questions that enforce connected reasoning, that is, where r p and, in particular, r < p k , so that models have an incentive to perform all k reasoning steps.
Not surprisingly, the connected reasoning property is often not satisfied by existing datasets (Min et al., 2019a;Chen and Durrett, 2019;Trivedi et al., 2020), and never optimized for during dataset construction. As a consequence, models are able to exploit artifacts in existing datasets that allow them to achieve high scores while bypassing some of the reasoning steps, thus negating the main purpose of building multihop datasets. Prior work (Trivedi et al., 2020) has attempted to measure the extent of connected reasoning in current models and datasets. However, due to the design of existing datasets, this approach is only able to measure this by ablating the pre-requisites of each reasoning step, namely, the supporting facts. Rather than only measure, we propose a method to construct multihop QA datasets that directly optimize for this condition.
Consider question Q on the left-hand side of Fig. 1. It can be answered in two steps, Q1 and Q2. However, the information in Q2 itself is sufficient to uniquely identify A2 from the context, even without considering A1. That is, while there is an intended dependency between Q1 and Q2, Q2 can be answered correctly without requiring the output of its predecessor question Q1. Our approach constructs multihop questions that prevent this issue, and thereby require the desired connected reasoning. Specifically, we carefully choose which single-hop questions to compose and what context to use such that each constituent single-hop question necessitates the answers from one or more previous questions.

Connected Reasoning via Composition
The central issue we want to address is ensuring connected reasoning. Our solution is to use a bottom-up approach where we compose multihop questions from a large pool of single-hop questions. As we show later, this approach allows us to explore a large space of multihop questions and carefully select ones that require connected reasoning. Additionally, with each multihop question, we will have associated constituent questions, their answers and supporting paragraphs, which can help develop more interpretable models. Here we describe the high-level process and describe the specifics in the next section.

Multihop via Single-Hop Composition
As mentioned earlier, multihop questions can be viewed as a sequence of reasoning steps where answer from one reasoning step is used to identify the next reasoning step. Therefore, we can use single-hop questions containing answers from other questions to construct potential multihop questions. For example, in Fig. 1, Q2' mentions A1', and hence single-hop questions Q1' and Q2' can be composed to create a DAG Q1 → Q2 and multihop question Q' (right). Concretely, to create a multihop question from two single-hop questions, we have a composability criteria: Two single-hop question answer tuples (q 1 , a 1 ) and (q 2 , a 2 ) are composable into a multihop question Q with a 2 as a valid answer if a 1 is a named entity and it is mentioned in q 2 . See §5:S2 for detailed criteria.
This process of composing multihop questions can be chained together to form candidate reasoning graphs of various shapes and sizes (examples in Table 1). Formally, each multihop question Q has an underlying DAG G Q representing the composition of the single-hop questions q 1 , q 2 , . . . , q n , which form the nodes of G Q . A directed edge (q j , q i ) indicates that q i depends on the answer of the previous sub-question q j . a i is the answer to q i , and thereby, a n is the answer to Q.

Ensuring Connected Reasoning
Given the graph G Q associated with a question Q, ensuring connected reasoning requires ensuring that for each edge (q j , q i ) ∈ edges(G Q ), arriving at answer a i using q i , necessitates the use of a j . In other words, without a j , there isn't sufficient information in q i to arrive at a i . The existence of such information can be probed by training a strong QA model M on subquestions (q i ) with the mention of their predecessor's answer (a j ) masked out (removed). If, on held out data, the model can identify a subquestion's answer (a i ) without its predecessor's answer (a j ), we say the edge (q j , q i ) is disconnected. Formally, we say Q requires connected reasoning if: where q m j i denotes the subquestion formed from q i by masking out the mention of the answer a j .
Consider the masked questions Q2 and Q2' in Fig. 1. While Q2 can easily be answered without answer A1, Q2' can't be answered without A1' and Q' hence satisfies condition (1).

Reading Comprehension Setting
While our proposed framework makes no assumptions about the choice of the model, and is applicable to open-domain setting, we focus on the Reading Comprehension (RC) setting, where we've a fixed set of paragraphs as context, C.
In a RC setting, apart from requiring the dependence between the reasoning steps, we also want the model to depend on the context to answer each question. While this requirement seems unnecessary, previous works have shown that RC datasets often have artifacts that allow models to predict the answer without the context (Kaushik and Lipton, 2018) and can even memorize the answers  due to train-test leakage. As we will show later, previous multihop RC datasets can be cheated via such shortcuts. To ensure the dependence between the question and context, we modify the required condition in Eqn. (1) to: In summary, we want multihop reading comprehension questions that satisfy condition (2) for a strong trained model M . If it does, we say that the question satisfies the MuSiQue condition. Our dataset construction pipeline optimizes for this condition as described next.

Dataset Construction Pipeline
The high-level schematic of the pipeline is shown in Fig. 2. We begin with a large set of RC singlehop questions from 5 English Wikipedia-based datasets, SQuAD (Rajpurkar et al., 2016), Natural Questions (Kwiatkowski et al., 2019), MLQA (en-en) (Lewis et al., 2020b), T-REx (ElSahar et al., 2018), and Zero Shot RE (Levy et al., 2017), where instances are of the form (q i , p i , a i ) referring to the question, the associated paragraph, and the answer, respectively. For Natural Questions, as the context is very long (entire Wikipedia page), we use the annotated long answer (usually a paragraph) from the dataset as the context, and the annotated short answer as the answer. Then, we take the following two steps: S1. Find Good Single-Hop Questions. Even a tolerably small percentage of issues in single-hop questions can compound into an intolerably large percentage in the composed multihop questions. To mitigate this, we first remove questions that are likely annotation errors. Because manually identifying such questions at scale is laborious, we use a model-based approach. We remove the questions for which none of five large trained QA models 3 can predict the associated answer with > 0 answer F1. Furthermore, we remove (i) erroneous questions where the answer spans are not in the context, (ii) questions with < 20 word context as we found them to be too easy, and (iii) questions with > 300 word context to prevent final multihop question context from being too long for current long-range transformer models.

S2. Find Composable Single-Hop Pairs.
To create 2-hop questions, we first collect distinct single-hop question pairs with a bridge entity. Specifically, we find pairs (q 1 , p 1 , a 1 ) and (q 2 , p 2 , a 2 ) such that (i) a 1 is a named entity also mentioned in q 2 , (ii) a 2 is not in q 1 , and (iii) p 1 = p 2 . Such pairs can be combined to form a 2-hop question (Q, {p 1 , p 2 }, a 2 ). To ensure that the mentions (a 1 and its occurrence in q 2 denoted e 2 ) refer to the same entity, we ensure: 1. Spacy entity tagger (Honnibal et al., 2020) tags a 1 and e 2 as entities of the same type. 2. A Wikipedia search with a 1 and e 2 returns identical 1st result. 3. A state-of-the-art (SOTA) Wikification model (Wu et al., 2020) returns the same result for a 1 and e 2 . At a later step (S7) when humans write composed questions from DAGs, they get to remove questions containing erroneous pairs. Only 8% of the pairs are pruned in that step, indicating that step S2 is quite effective.
S3. Filter Disconnected Single-Hop Pairs. We want connected 2-hop questions-questions that cannot be answered without using the answers of the constituent single-hop questions. The MuSiQue condition (2) states that for a 2-hop question to be connected, either sub-question q i should not be correctly answered without its context (M (q i , φ) = a i ) and the tail question q 2 should not be correctly answered when a 1 is removed from it (M (q m 1 2 , C) = a 2 ). Accordingly we use a two-step filtering process to find connected 2-hop questions. For simplicity, and because the second condition already filters some tail 3 Two random-seed variants of RoBERTa-large (Liu et al., 2019), two random-seeds of Longformer-Large (Beltagy et al., 2020), and one UnifiedQA . questions, our current implementation enforces the first condition only on the head question, q 1 .
Filtering Head Nodes: We collect all questions that appear at least once as the head of composable 2-hop questions (q 1 ) to create a set of head nodes. We create 5-fold train-test splits of this set and train two Longformer-Large models (different seeds) per split (train on three, validate and test on one). We generate answer predictions using the 2 models on their corresponding test splits resulting in 2 predictions per question. We accept a head question if, on average, the predicted answers' word overlap (computed using answer f1) with the answer label is < 0.5.
Filtering Tail Nodes: We create a unique set of masked single-hop questions that occur as a tail node (q 2 ) in any composable 2-hop question. If the same single-hop question occurs in two 2-hop questions with different masked entities, they both are added to the set. We combine the gold-paragraph with 9 distractor paragraphs (retrieved 4 using the question without the masked entities as query). As before, we create 5-fold train-test splits and use 2 Longformer-Large models to obtain 2 answer and support predictions. We accept a tail question if either mean answer F1 ≤ 0.25, or if it's ≤ 0.75 and mean support F1 < 1.0.
The thresholds for head and tail node filtering were chosen via a manual inspection of a few predictions in various ranges of the parameters, and gauging at what F1 values does the model's answer semantically match the correct answer (e.g., ''Barack Obama'' and ''President Barack Obama'' overlap with 0.8 answer F1). Controlling these thresholds provides a way to trade off between the degree of cheatability allowed in the dataset and the size of the final dataset. We aim to limit cheatability while retaining a reasonable dataset size.
Finally, only 2-hop questions for which both head and tail node are acceptable are kept. We call this process Disconnection Filtering.
S4. Build Multihop Questions. We now have a set of connected 2-hop questions, which form directed edges of a graph. Any subset DAG of it can be used to create a connected multihop question. We use 6 types of reasoning graphs with 2-4 hops as shown in Table 1. To avoid very long questions, we limit single-hop questions to ≤ 10 tokens, the total length of questions in 2, 3-hops to ≤ 15, and 3-hops to ≤ 20 tokens. To ensure diversity, we (1) cap the reuse of bridging entities and single-hop questions at 25 and 100 multihop questions respectively (2) remove any n-hop question that's subset of any m-hop question (m > n > 1).
S5. Minimize Train-Test Leakage. We devise a procedure to create train, validation, and test splits such that models cannot achieve high scores via memorization enabled by train-test leakage, an issue observed in some existing datasets . Our procedure ensures that the training set has no overlap with validation or the test sets, and tries to keep the overlap between validation and test sets minimal.
We consider two multihop questions Q i and Q j to overlap if any of the following are common between Q i and Q j : (i) single-hop question, (ii) answer to any single-hop question, (iii) associated paragraph to any single-hop question. To minimize such overlap, we take a set of multihop questions, greedily find a subset of given size (S) which least overlaps with its complement (S'), and then remove overlapping questions from S', to get train (S) and dev+test set (S'). Then, we split dev+test to dev and test similarly. We ensure the distribution of source datasets of single-hop questions in train, dev and test are similar, and also control the proportion of 2-4 hop questions.
S6. Build Contexts for Questions. For an n-hop question, the context has 20 paragraphs containing: (i) supporting paragraphs associated with its single-hop questions {p 1 , p 2 . . . p n }, (ii) distractor paragraphs retrieved using a query that is a concatenation of single-hop questions from which all intermediate answer mentions are removed. To make distractor paragraphs harder to identify, we retrieve them from the set of gold-paragraphs for the filtered single-hop question (S1).

S7. Crowdsource Question Compositions.
We crowdsource question compositions on Amazon MTurk, where workers composed coherent questions from our final DAGs of single-hop questions. In the interface, workers could see a list of single-hop questions with their associated paragraphs and how they are connected via bridge entities. They were first asked to check whether all pairs of mentions of bridge entities indeed refer to the same underlying entity. If they answered 'yes' for each pair, 5 they were asked to compose a natural language question ensuring that information from all single-hop questions in the DAG is used, and the answer to the composed question is the same as the last single-hop question. If they answered 'no' for any of the pairs, we discarded that question. Our tutorial provided them with several handwritten good and bad examples for each of the 2-4 hop compositions. Workers were encouraged to write short questions and make implicit inferences when possible. They were allowed to split questions into two sentences if needed.
We carried out a qualification round where 100 workers participated to perform the aforementioned task on 20 examples each. We manually evaluated these annotations for correctness and coherence, and selected 17 workers to annotate the full dataset. To ensure dataset quality, we carried out crowdsourcing in 9 batches, reading 10-20 random examples from each worker after each batch and sending relevant feedback via email, if needed. Workers were paid 25, 40, and 60 cents for each 2-, 3-, and 4-hop question, amounting to ∼15 USD per hour, totaling ∼11K USD.
We refer to the dataset at this stage as MuSiQue-Ans or -Ans.

S8. Add Unanswerable Questions.
For each answerable multihop RC instance we create a corresponding unanswerable multihop RC instance using the procedure similar to the one proposed in Trivedi et al. (2020). For a multihop question we randomly sample any of its single-hop question and make it unanswerable by ensuring the answer to that single-hop question doesn't appear in any of the paragraphs in context (except this requirement, the context is built as described in S6). Because one of the single-hop questions is unanswerable, the whole multihop question is unanswerable.
The task now is to predict whether the question is answerable, and predict the answer and support if it's answerable. Given the questions for answerable and unanswerable pair are identical and the context marginally changes, models that rely on shortcuts find this new task very difficult. We call the dataset at this stage MuSiQue-Full or -Full, and both datasets together as MuSiQue.   In summary, our construction pipeline allows us to produce a dataset with mixed hops, multiple types of reasoning graphs, and unanswerable sub-questions, all of which make for a more challenging and less cheatable dataset (as we will quantify in Section 8). Question decomposition, which is a natural outcome of our construction pipeline, can also be used to aid decomposition-based QA research (Min et al., 2019b;.

Dataset Quality Assessment
Quality of -Ans. To assess the quality of -Ans, we first evaluate how well humans can answer questions in it. Note that we already have gold answers and supporting paragraphs from our construction pipeline. This goal is therefore not to determine gold labels, but rather to measure how well humans perform on the task treating our gold labels as correct.
We sample 125 questions from -Ans validation and test sets, and obtain 3 annotations (answer and supporting paragraphs) for each question. We used Amazon MTurk, 6 selecting crowdsource workers as described in §7.3.
Workers were shown the question and all paragraphs in the context, and were asked to highlight the answer span and checkmark the supporting paragraphs. Our interface allowed for searching, sorting, and filtering the list of paragraphs easily with interactive text-overlap-based search queries.  We compute human performance by comparing against gold labels for answer and support in two ways: 1) Human Score-the most frequent answer and support among the three annotators breaking ties at random (the strategy used by Rajpurkar et al. (2018)), and 2) Human Upper Bound (UB)-the answer and support that maximizes the score (as done by Yang et al. (2018)).
Furthermore, to assess how well humans agree with each other (ignoring our gold labels), we also compute the Human Agreement (Agr) score (Rajpurkar et al., 2016;Yang et al., 2018). Specifically, we treat one of 3 annotations, chosen randomly, as predicted, and evaluate it against rest of the annotations, which are treated as correct. Table 3 demonstrates that -Ans is a highquality dataset. Furthermore, as we will discuss in §7.3, we also compare our human performance with two other similar datasets (HotpotQA and 2WikiMultihopQA), and show that -Ans is close to them under these metrics ( §8).
Quality of -Full. We perform an additional manual validation to assess dataset quality of -Full. Recall that -Full shares the answerable questions with -Ans, the only extra task in -Full being determining the answerability of a question from the given context. To assess the validity of this task, we sampled 50 random instances from -Full, and one of the authors determined the answerability of each question from its context. We found that in 45 out of the 50 instances (90%) the human predicted answerability matched the gold label, showing that -Full is a also high-quality dataset.
Multihop Nature of MuSiQue. Finally, we assess the extent to which -Ans satisfies the MuSiQue condition (Eqn. 2) for connected reasoning. To this end, we first estimate what percentage of head and tail questions in the validation set would we retain if we were to repeat our disconnection filtering procedure (S3) with models trained on the final training data. This captures the fraction of the questions in -Ans that satisfy the MuSiQue condition. We then compare it with the respective numbers from the original step S3.
In the original disconnection filtering step, we retained only 26.5% of the tail questions, whereas we would have retained 79.0% of the tail questions had we filtered the final validation dataset. For head questions, we see a less dramatic but still significant effect-we originally retained 74.5% questions, and would now have retained 87.7% had we filtered the final validation set. This shows that vastly more questions in -Ans satisfy the MuSiQue condition than what we started with.
7 Experimental Setup

Datasets
We compare our datasets (MuSiQue-Ans and MuSiQue-Full) with two similar multihop RC datasets: distractor-setting of HotpotQA (Yang et al., 2018) and 2WikiMultihopQA (Ho et al., 2020). 7 Both datasets have 10 paragraphs as context. HQ and 2W have 2-hop and 2,4-hop questions respectively. Additionally, HQ has sentence support and 2W has entity-relation tuples support, but we don't use this annotation in our training or evaluation for a fair comparison.
HQ, 2W, and -Ans have 90K, 167K, and 20K training instances, respectively. For a fair comparison, we use equal sized training sets in all our experiments, obtained by randomly sampling 20K instances each from HQ and 2W, and referred to as HQ-20k and 2W-20k, respectively.
Notation. Instances in -Ans, HQ, and 2W are of the form (Q, C; A, P s ). Given a question Q and context C consisting of a set of paragraphs, the task is to predict the answer A and identify supporting paragraphs P s ∈ C. -Ans additionally has gold decomposition G Q ( §3), which can be leveraged during training. Instances in -Full are of form (Q, C; A, P s , S), where there's an additional binary classification task to predict S, the answerability of Q based on C, also referred to as context sufficiency (Trivedi et al., 2020).
Metrics. For -Ans, HQ, and 2W, we report the standard F1 based metrics for answer (An) and support identification (Sp); see Yang et al. (2018) 7 For brevity, we use HQ, 2W, -Ans/Full to refer to HotpotQA, 2WikiMultihopQA, MuSiQue-Ans/Full, respectively. for details. To make a fair comparison across datasets, we use only paragraph-level support F1.
For -Full, we follow Trivedi et al. (2020) to combine sufficiency prediction S with An and Sp, which are denoted as An+Sf and Sp+Sf. Instances in -Full are evaluated in pairs. For each Q with a sufficient context C, there is a paired instance with Q and an insufficient context C . For An+Sf, if a model incorrectly predicts context sufficiency (yes or no) for either of the instances in a pair, it gets 0 points on that pair. Otherwise, it gets the same An score on that pair as it gets on the answerable instance in that pair. Scores are averaged across all pairs of instances in the dataset. Likewise for Sp+Sf.

Models
Our models are Transformer-based (Vaswani et al., 2017) language models , implemented using PyTorch (Paszke et al., 2019), HuggingFace Transformers (Wolf et al., 2019), and AllenNLP (Gardner et al., 2017). We experiment with 2 types of models: (1) Multihop Models, which are in principle capable of employing desired reasoning, and have demonstrated competitive performance on previous multihop QA datasets. They help probe the extent to which a dataset can be solved by current models. (2) Artifact-based Models, which are restricted in some way that prohibits them from doing desired reasoning (discussed shortly). They help probe the extent to which a dataset can be cheated. Next, we describe these models for -Ans and -Full. For HQ and 2W, they work similar to -Ans.

Multihop Models
End2End (EE) Model. This model takes (Q, C) as input, runs it through a transformer, and predicts (A, P s ) as the output for -Ans and (A, P s , S) for -Full. We use Longformer-Large as it's one of the few transformer architectures that is able to fit the full context, and follow Beltagy et al. (2020) for answer and support prediction. Answerability prediction is done via binary classification using CLS token.
Note that our Longformer EE model is a strong model for multihop reasoning. When trained on full datasets, its answer F1 is 78.4 (within 3 pts of published SOTA [Groeneveld et al., 2020]) on HQ, and 87.7 (SOTA) on 2W.
Select+Answer (SA) Model. This model, inspired by Quark (Groeneveld et al., 2020) and SAE (Tu et al., 2020), has two parts. First, a selector ranks and selects the K most relevant paragraphs C K ⊆ C. 8 Specifically, given (Q, C) as input, it classifies every paragraph P ∈ C as relevant or not, and is trained with the cross-entropy loss. Second, for MuSiQue-Ans, the answerer predicts the answer and supporting paragraphs based only on C K . For MuSiQue-Full, it additionally predicts answerability. Both components are trained individually using annotations available in the dataset. We implement a selector using RoBERTa-large (Liu et al., 2019), and an answerer using Longformer-Large.
Step Execution (EX) Model. Similar to prior work (Talmor and Berant, 2018;Min et al., 2019b;Qi et al., 2021;, this model performs explicit, step-by-step multihop reasoning, by first decomposing the Q into a DAG G Q having single-hop questions, and then calling single-hop model repeatedly to execute this decomposition.
The decomposer is trained with gold decompositions, and is implemented with BART-large.
The executor takes C and the predicted DAG G Q , and outputs (A, P s ) for MuSiQue-Ans and (A, P s , S) for MuSiQue-Full. It calls single-hop model M s repeatedly while traversing G Q along the edges and substituting the answers.
Model M s is trained on only single-hop instances-taking (q i , C) as input, and producing (A, P i ) or (A, P s i , S i ) as the output. Here P i refers to the supporting paragraph for q i and S i refers to whether C is sufficient to answer q i . For MuSiQue-Full, the answerer predicts Q as having sufficient context if M s predicts all q i to have sufficient context. We implement 2 such single-hop models M s : End2End and Select+Answer, abbreviated as EX(EE) and EX(SA) respectively We don't experiment with this model on HQ, since it needs ground-truth decomposition and intermediate answers, which aren't available in HQ.
Baseline (RNN) Model. The filtering steps in our pipeline use transformer-based models, which could make MuSiQue particularly difficult for transformer-based models. A natural question then is, can a strong non-transformer model perform 8 K is a hyperparameter, chosen from {3,5,7}. better on MuSiQue? To answer this, we evaluate our re-implementation of a strong RNN-based baseline (Yang et al., 2018) (see their original paper for details). To verify our implementation, we trained it on full HotpotQA and found its performance to be 64.0 An (answer F1) on the validation set, better than what's reported by Yang et al. (2018) (58.3 An). We thus use this model as a strong non-transformer baseline.

Artifact-based Models
The Q-Only Model takes only Q as input (no C) and generates output A for -Ans and (A, S) for -Full. We implement this with BART-large (Lewis et al., 2020a). The C-Only Model takes only C as input (no Q) and predicts (A, P s ) for -Ans and (A, P s , S) for -Full. We implement this with an EE Longformer-Large model with empty Q. The 1-Para Model, like Min et al. (2019a) and Chen and Durrett (2019), is similar to SA model with K = 1. Instead of training the selector to rank all P s the highest, we train it to rank any paragraph containing the answer A as the highest. The answerer then takes as input one selected paragraph p ∈ P s and predicts an answer to Q based solely on p. This model can't access full supporting information as all considered datasets have at least 2 supporting paragraphs.

Cheatability Score
We compute the DiRe score of all datasets, which measures the extent to which the datasets can be cheated by strong models via Disconnected Reasoning (Trivedi et al., 2020). We report scores based on the SA model because it performed the best.

Human Performance
Apart from assessing the human performance level on -Ans, as discussed in §6, we also obtain human performance on HQ and 2W. For a fair comparison, we use the same crowdsourcing workers, annotation guidelines, and interface across the 3 datasets. We sample 125 questions from each dataset, shuffle them all into one set, and obtain 3 annotations per question for answer and support.
To select the workers, we ran a qualification round where each worker was required to identify answer and support for at least 25 questions. We then selected workers who had more than 75 An  and Sp scores on all datasets. Seven out of 15 workers were qualified for rest of the validation.

Empirical Findings
We now discuss our findings, demonstrating that MuSiQue is a challenging multihop dataset that is harder to cheat on than existing datasets ( §8.1) and that the steps in the MuSiQue construction pipeline are individually valuable ( §8.2). Finally, we explore avenues for future work ( §8.3). For HQ and 2W, we report validation set performance. For -Ans and -Full, Table 5 reports test set numbers; all else is on the validation set.

MuSiQue is a Challenging Dataset
Compared to HQ and 2W, both variants of MuSiQue are less cheatable via shortcuts and have a larger human-to-model gap.
Higher Human-Model Gap. Top two sections of Table 4 show -Ans has a significantly higher human-model gap (computed as Human Score minus best model score) than the other datasets, for both answer and supporting paragraph identification. In fact, for both the other datasets, supporting paragraph identification has even surpassed the human score, whereas for -Ans, there is a 14-point gap. Additionally, -Ans has a ∼27-point gap in answer F1, whereas HQ and 2W have a gap of only 10 and 5 points, respectively.  Table 5: -Full is harder (top row) and less cheatable (bottom row) than -Ans. Note: -Full has a stricter metric that operates over instance pairs ( §7.1:metrics).
Our best model, EX(SA), scores 57.9, 47.9, and 28.1 answer F1 on 2, 3, and 4-hop questions of -Ans, respectively. The EE model, on the other hand, stays around 42% irrespective of the number of hops.
Lower Cheatability. The 3rd section of Table 4 shows that the performance of artifactbased models ( §7.2.2) is much higher on HQ and 2W than on -Ans. For example, the 1-Para model achieves 64.8 and 60.1 answer score on HQ and 2W, respectively, but only 32.0 on -Ans. Support identification in both datasets can be done to a surprisingly high degree (67.6 and 92.0 F1) even without the question (C-only model), but fails on -Ans. 9 Similarly, the last row of Table 4 shows that the DiRe answer scores of HQ and 2W (68.8 and 63.4) are high, indicating that even disconnected reasoning (bypassing reasoning steps) can achieve such high scores. In contrast, this number is significantly lower (37.8) for -Ans.
These results demonstrate that -Ans is significantly less cheatable via shortcut-based reasoning.
MuSiQue-Full: Even More Challenging. Table 5 shows that -Full is significantly more difficult and less cheatable than -Ans.
Intuitively, because the answerable and unanswerable instances are very similar but have different labels, it's difficult for models to do well on both instances if they learn to rely on shortcuts (Kaushik et al., 2019;. All artifact-based models barely get any  An+Sf or Sp+Sf score. For all multihop models too, the An drops by 14-17 pts and Sp by 33-44 pts.

Dataset Construction Steps are Valuable
Next, we show that the key steps of our dataset construction pipeline ( §5) are valuable.
Disconnection Filter (step 3). To assess the effect of Disconnection Filter (DF), we ablate it from the pipeline, that is, skip the filtering composable 2-hop questions to connected 2-hop questions. As we don't have human-generated composed questions for the resulting questions, we use a seq2seq BART-large model that's trained (using MuSiQue) to compose questions from input decomposition DAG. For a fair comparison, we randomly subsample train set from ablated pipeline to be of the same size as the original train set. Table 6 shows that DF is crucial for increasing difficulty and reducing cheatability of the dataset. Without DF, both multihop and artifact-based models do much better on the resulting datasets.
Reduced Train-Test Leakage (step 5). To assess the effect of Reduced train-test Leakage (RL), we create a dataset the traditional way, with a random partition into train, validation, and test splits. For uniformity, we ensure the distribution of 2-4 hop questions in development set of the resulting dataset from both ablated pipelines remains the same as in the original development set. Like DF ablation, we also normalize train set sizes. Table 6 shows that without a careful split, the dataset is highly solvable by multihop models (An = 87.3). Importantly, most of this high score can also be achieved by artifact-based models: 1-para (An = 85.1) and C-only (An = 69.5), revealing the high cheatability of such a split.  Table 7: Positive Distractors (PD) are more effective than using Full Wikipedia (FW) for choosing distractors, as shown by lower scores of models. The effect of using PD is more pronounced when combined with the use of 20 (rather than 10) distractor paragraphs.
Harder Distractors (step 7). To assess the effect of distractors in -Ans, we create 4 variations. Two vary the number of distractors: (i) 10 paragraphs and (ii) 20 paragraphs; and two vary the source: (i) Full Wikipedia (FW) 10 and (ii) gold context paragraphs from the good single-hop questions from step 1. We refer to the last setting as positive distractors (PD), as these paragraphs are likely to appear as supporting (positive) paragraphs in our final dataset. Table 7 shows that all models find PD significantly harder than FW. In particular, PD makes support identification extremely difficult for C-only, whereas Table 4 showed that C-only succeeds on HQ and 2W to a high degree (67.6 and 92.0 Sp). This would have also been true for -Ans (66.4 Sp) had we used Wikipedia as the distractor construction corpus like HQ and 2W. This underscores the value of selecting the right corpus for distractor selection, and ensuring distributional shift can't be exploited to bypass reasoning. 11 Second, using 20 paragraphs instead of 10 makes the dataset more difficult and less cheatable. Interestingly, the effect is stronger if we use PD, indicating the synergy between two approaches to create challenging distractors.

Potential Avenues for Improvement
Better Decomposition. We train our EX(SA) model using ground-truth decompositions. On -Ans, (An, Sp) improve by (9.4, 7.3) points, and on -Full, (An+Sf, Sp+Sf) improve by (7.3, 6.9) points. The improvements with the EX(EE) model are slightly lower. This shows that although improving question decomposition will be helpful, it's insufficient to reach human parity on the dataset.
Better Transformer. While Longformer can fit long context, there are arguably more effective pretrained transformers for shorter input, for example, T5. Moreover, since T5 uses relative position embeddings, it can be used for longer text, although at a significant memory and computation cost. We managed to train SA with T5-large on MuSiQue, 12 but didn't use it for the rest of our experiments because of high computational cost. Over Longformer SA, T5 SA showed a modest improvement of (6.1, 0.7) on -Ans and (1.7, 2.0) on -Full.

Conclusion
Constructing multihop datasets is a tricky process. It can introduce shortcuts and artifacts that models can exploit to circumvent the need for multihop reasoning. A bottom-up process of constructing multihop from single-hop questions allows systematic exploration of a large space of multihop candidates and greater control over which questions we compose. We showed how to use such a carefully controlled process to create a challenging dataset that, by design, requires connected reasoning by reducing potential reasoning shortcuts, minimizing train-test leakage, and including harder distractor contexts. Empirical results show that -Ans has a substantially higher human-model gap and is significantly less cheatable via disconnected reasoning than previous datasets. The dataset also comes with unanswerable questions, and question decompositions which we hope spurs further work in developing models that get right answers for the right reasons.