Reasoning over Public and Private Data in Retrieval-Based Systems

Abstract Users an organizations are generating ever-increasing amounts of private data from a wide range of sources. Incorporating private context is important to personalize open-domain tasks such as question-answering, fact-checking, and personal assistants. State-of-the-art systems for these tasks explicitly retrieve information that is relevant to an input question from a background corpus before producing an answer. While today’s retrieval systems assume relevant corpora are fully (e.g., publicly) accessible, users are often unable or unwilling to expose their private data to entities hosting public data. We define the Split Iterative Retrieval (SPIRAL) problem involving iterative retrieval over multiple privacy scopes. We introduce a foundational benchmark with which to study SPIRAL, as no existing benchmark includes data from a private distribution. Our dataset, ConcurrentQA, includes data from distinct public and private distributions and is the first textual QA benchmark requiring concurrent retrieval over multiple distributions. Finally, we show that existing retrieval approaches face significant performance degradations when applied to our proposed retrieval setting and investigate approaches with which these tradeoffs can be mitigated. We release the new benchmark and code to reproduce the results.1


Introduction
The world's information is split between that which is publicly and privately accessible and the ability to simultaneously reason over information from both scopes is useful to support personalized tasks. However, retrieval-based machine learning (ML) systems, which first collect relevant information to a user input from a background knowledge source before producing an output, do not consider retrieving from the private data that organizations and individuals aggregate locally. Retrieval systems are achieving impressive performance across open-domain applications such as languagemodeling [Borgeaud et al., 2021], question-answering [Voorhees, 1999, Chen et al., 2017, and dialogue [Dinan et al., 2019], and also benefit from practical properties such as updatability and a degree of interpretability. In this work, we focus on the underexplored question of how to personalize these systems while preserving privacy.
Consider the following examples that require a combination of public and private information: individuals could ask "With my GPA and SAT score, which universities should I apply to in the United States?" or "Is my blood pressure in the normal range for someone 55+?". In an organization, an ML engineer could ask: "How do I fine-tune a language model, based on public StackOverflow and our internal company documentation?", 2 or a doctor could ask "How are COVID-19 vaccinations affecting patients with type-1 diabetes based on our private hospital records and public Figure 1: Multi-hop retrieval systems use beam search to retrieve data from a background corpus: a document retrieved in round, or hop i , is used to retrieve in hop i+1 . Thus, if private documents (e.g., medical records or emails) were retrieved in hop i , it would sacrifice privacy if they were used to retrieve public information in hop i+1 , since the document would be exposed to the entity hosting the public data. Existing multi-hop systems do not consider retrieval from multiple privacy scopes, the focus of this work.
PubMed reports?". 3 Currently, to answer such questions, users must manually cross-reference public and private information. A public knowledge source would not contain a user's medical record and a private knowledge source is not likely to include all medical statistics. In this work, we propose a framework for using public (or global) information to enhance our understanding of private (or local) information, which we refer to as PUBLIC-PRIVATE AUTOREGRESSIVE INFORMATION RETRIEVAL (PAIR).
Given a user question, retrieval-based systems operate by collecting the most similar documents to the question from a massive corpus, and providing these to a separate model which can reason over the information to produce an answer [Chen et al., 2017]. Answering complex questions about public and private data requires reasoning over information that is distributed across multiple documents (e.g., public medical statistics and private health records), termed multi-hop reasoning [Welbl et al., 2018]. For popular benchmarks of multi-hop questions, we find that introducing multiple rounds of retrieval, where the initial question combined with the text of documents retrieved in round i is used to retrieve in round i + 1, provides upwards of 75% performance gains versus using a single round of retrieval (Section 3). Accordingly iterative retrieval is the typical approach for multi-hop reasoning [Miller et al., 2016, Feldman and El-Yaniv, 2019, Asai et al., 2020, Xiong et al., 2021, Qi et al., 2021, Khattab et al., 2021.
Existing multi-hop systems assume retrieval is performed over one corpus, in a single privacy scope. However, data is often distributed across multiple parties with different privacy restrictions and in certain (e.g., government and medical) settings, data cannot be shared publicly. Broadly, users and organizations often do not want to expose all data to public entities, and it is unlikely that private parties can locally host terabyte-scale and constantly updating web data, naturally resulting in multiple corpora over which to retrieve.
Example To understand why multi-hop retrieval over distributed corpora implicates different privacy concerns, consider two questions from an employee standpoint. First, 'Of the products our top competitor released this month, which are most similar to our unannounced upcoming products?". To answer this question an existing multi-hop system likely (1) retrieves documents (e.g., news articles) about competitor releases from the public corpus, and (2) uses these to collect private documents (e.g., company emails and announcements) that detail upcoming products, so no private information is leaked. Meanwhile, "Have any companies ever released similar products to the one we are designing?" entails retrieving (1) private documents about the upcoming product, and (2) using the confidential product design documents to retrieve documents about public products. The latter retrieval reveals private company documents to the untrustworthy entity hosting the public corpus. An effective privacy model will preclude this private-then-public retrieval, preventing any possibility of leakage.
Guided by the constraint that in many situations users cannot or do not want to publicly expose their private information to public entities, we propose PAIR as a natural and effective privacy framework for complex QA. PAIR employs the classical Bell-LaPadula Model (BLP) (see Section 3 for details), a simple and efficient framework which guarantees no leakage of private data [Bell and LaPadula, 1976]. The framework was originally developed for, and widely used by, government agencies to successfully manage multi-security-level access control, which maps to our setting of open-access public and private user data.
Study and evaluation with PAIR We next address how to methodologically study and evaluate retrieval in the PAIR setting. We first propose an adaptation of one of the most popular multi-hop benchmarks, HotpotQA [Yang et al., 2018], which requires retrieval from Wikipedia data -though we show this adaption is limited insofar as private and public documents are likely to come from different distributions. 4 We observe that all existing textual multi-hop benchmarks require retrieving from a single domain such as Wiki or Freebase. Thus, to more appropriately evaluate PAIR, we create and release the first textual multi-domain, multi-hop benchmark, called CONCURRENTQA, which spans Wikipedia in the public domain and an open source email collection in the private domain.
Finally, we implement a PAIR-preserving retrieval system which excitingly answers many questions spanning public and private data. However, we show multi-hop reasoning systems exhibit high sensitivity to document retrieval order, thus presenting a privacy-performance tradeoff. Models sacrifice upwards of 19% performance under PAIR constraints to protect document privacy and 57% under constraints to protect query privacy, when compared to a baseline system with standard, non-privacy aware retrieval mechanics.
In summary, we ask how to improve personalization while preserving privacy in retrieval-based systems and our particular contributions are: • We define the problem of retrieving over public and private data, and introduce the PUBLIC-PRIVATE AUTOREGRESSIVE INFORMATION RETRIEVAL (PAIR) privacy framework based on the classical Bell-LaPadula Model.
• We create CONCURRENTQA, the first textual multi-domain, multi-hop benchmark. In the absence of privacy concerns, the benchmark allows studying multi-distribution retrieval in general.
• We demonstrate and quantify the privacy-performance tradeoff faced by existing multi-hop systems in PAIR and investigate challenges in mitigating the tradeoff.
We hope the framework, resources, and analysis we present encourage further research towards building privacypreserving retrieval systems. 5

Background & Related Work
Retrieval-Based Systems Open-domain applications in NLP, such as open-domain QA [Voorhees, 1999, Chen et al., 2017, personal assistants [Dinan et al., 2019], and language modeling [Borgeaud et al., 2021] need to support inputs across a broad range of topics. Implicit-memory approaches for open-domain tasks focus on memorizing knowledge within model parameters, for example by taking a pretrained language model such as T5 or BART and fine-tuning it on question-answer training pairs [Roberts et al., 2020].
In contrast, open-domain systems typically have access to the information in a background corpus (e.g., Wikipedia) or knowledge graph (e.g., Wikidata). Systems which explicitly exploit this information, called retrieval-based systems, introduce a step to retrieve information that is relevant to the input from the background corpus, and provide this to a separate task model that produces the output. Retrieval-free approaches have not been shown to work convincingly in multi-hop settings [Xiong et al., 2021].

Multi-hop Open-Domain Question Answering
We concretely demonstrate the challenges in applying PAIR to existing systems by focusing on open-domain QA (ODQA), a classic application for retrieval-based systems. ODQA entails providing an answer a to a question q, expressed in natural language and without explicitly provided context from which to find the answer [Voorhees, 1999]. Prevailing methods for ODQA collect a large collection of documents D and follow a retrieve-and-read approach [Chen et al., 2017, inter alia.] where the retriever retrieves a small set of relevant documents from the collection, from which the reader model extracts an answer. The answer is typically a span from one or more of the retrieved passages.
Our setting is concerned with complex queries where the supporting evidence for the answer is distributed across multiple (public and private) documents, termed multi-hop reasoning [Welbl et al., 2018]. To collect the distributed evidence, existing multi-hop systems use multiple iterations of retrieval: representations of the passages retrieved in iteration i are used to retrieve passages in iteration i + 1. The beam search is a consistent backbone of multi-hop systems [Miller et al., 2016, Feldman and El-Yaniv, 2019, Asai et al., 2020, Wolfson et al., 2020, Xiong et al., 2021, Qi et al., 2021, Khattab et al., 2021. Various datasets have been developed for multi-hop reasoning [Yang et al., 2018, Talmor andBerant, 2018]. We discuss the applicability of these benchmarks to the PAIR setting in Section 4.

Privacy Framework
The proposed privacy framework, PAIR, is designed after the classical Bell-LaPaluda (BLP) privacy model [Bell and LaPadula, 1976], which has been widely and successfully used to manage access control between individuals of given clearance levels to objects of given classification levels. Our work instantiates BLP in the context of retrieval-based systems. Broadly, using freely available public resources, such as large models trained on public data and raw public data, locally, is a compelling set up because this paradigm incurs no privacy leakage whatsoever and can inject personal knowledge without requiring training. This is in contrast to the Federated Learning (FL) [McMahan et al., 2016] and Differential Privacy (DP) [Dwork et al., 2006] privacy frameworks, which do leak information [Shokri et al., 2017, Nasr et al., 2019. Our setting resembles the FL setting in so far as data heterogeneity and distribution are properties of both, though FL has focused on collective model training across multiple parties, while we focus on information retrieval for a single individual or organization that owns private information. Overall, access control frameworks have been widely used in practice for many years [Hu et al., 2006].
Proposed cryptographic methods for retrieval privacy include obfuscating the query by interleaving real and fake queries [Gervais et al., 2014] or performing secure approximate nearest neighbor (ANN) search. Applications such as search and personal assistants require low latency, and especially for complex queries requiring multiple hops (i.e., the number of queries grows exponentially with the number of hops) over high dimensional vectors, the computational overhead of existing cryptographic methods is prohibitive [Zuber andSirdey, 2021, Chen et al., 2019]. Even secure ANN approaches that sacrifice some privacy leakage for better efficiency, are too slow for our setting [Servan-Schreiber, 2021]. Other works propose fully on-device search engines, but scaling the amount of public data that can be hosted locally, not to mention updating at the of rate public data updates, remains challenging [Cao et al., 2019].
Ultimately, PAIR is a natural starting point in a rich decision space; we hope the resources we present facilitate research on alternate privacy models for public-private ODQA under differing cost-landscapes and privacy tradeoffs.

Privacy-Aware ODQA Framework
This section presents our novel retrieval setting and PUBLIC-PRIVATE AUTOREGRESSIVE INFORMATION RETRIEVAL privacy framework.

Preliminaries
Objective Given a multi-hop input q, an individual or organization's private documents p ∈ D P , and public documents d ∈ D G , the objective is to provide the user with the correct answer a, which is a span in one or more of the documents. Figure 1 (Right) provides a multi-hop reasoning example, and instances of private and public data (Left).
Standard, Non-Privacy Aware QA Standard non-private multi-hop ODQA involves answering q with the help of passages d ∈ D G , using beam search. In the first iteration of retrieval, the k passages from the corpus, d 1 , ..., d k , that are most relevant to q are retrieved. The text of a retrieved passage is combined with q using a combination function f (e.g., concatenating the query and passages sequences) to produce q i = f (q, d i ), for i ∈ [1..k]. Each q i , which contains an explicitly retrieved document, is used to retrieve k more passages in the following iteration.
Are multiple hops useful? An important question is whether multiple-hops are actually required for answering complex questions. Differently from Min et al. [2019a] and Chen and Durrett [2019], we consider this question in the opendomain setting. We observe that the performance on HotpotQA [Yang et al., 2018] improves by 26.4 EM (75%) when using two iterations versus using one iteration (Appendix 8).
Bell-LaPadula Model The Bell-LaPadula Model (BLP) manages the access of subjects with assigned clearance levels to objects of assigned security levels [Bell and LaPadula, 1976]. BLP is defined by three security rules: subjects cannot read data at higher security levels (Simple Security Property), subjects cannot write to data-stores at lower security levels (*-Property), and discretionary access to objects can be granted or revoked from subjects (Discretionary Security Property). We next present our privacy framework based on BLP.

PUBLIC-PRIVATE AUTOREGRESSIVE INFORMATION RETRIEVAL Framework
In the private QA setting, users and organizations are classified and hold confidential private data, and unclassified services (e.g., cloud services) host public data. The user inputs to the public-private QA system are D P and q. We now describe the PAIR framework and challenges in applying non-private retrieval methods to both D P and D G .
Constraint 1: Data is stored in two separate enclaves and personal documents p ∈ D P can not leave the user's enclave. PAIR requires introducing a second, private corpus over which to retrieve, since users do not want to publicly expose their data to create a single public corpus nor blindly write personal data to a public location. 6 Further, we assume it is infeasible to copy public data to produce a single local corpus for each user. This is because not only are there terabytes of public data, but public data is also constantly being updated. Thus, users host a private data (D P ) and public (cloud) entities host open-access public data (D G ). Now given an input query q, the system must perform one retrieval over D G and a second over D P . The top-k retrieved passages for each iteration will include k P private passages and k G public passages the top k of the k P + k G passages are used for the following iteration of retrieval.
If the retrieval-system stops after a single-hop, there is no privacy risk since no p ∈ D P is seen by public entities. 7 However for multi-hop questions, if k P > 0 for an initial round of retrieval, meaning there exists some p i ∈ D P which was in the top-k passages, in general it would sacrifice privacy if f (q, p i ) were to be used to perform the next round of retrieval from D G . Thus to preserve the privacy of private documents, under PAIR, public retrievals precede private document retrievals.
Constraint 2: Inputs that entirely rely on private information should not be revealed publicly. Given the two indices for D P and D G , q may be entirely answerable using multiple hops over the D P index, in which case, q would never need to leave the user device. For example, consider the hypothetical query from an employee standpoint: Does the search team use any infrastructure tools that our personal assistant team does not use?, which is answerable purely through private company information. Prior work demonstrates that queries are very revealing of user interests, intents, and backgrounds [Xu et al., 2007, Gervais et al., 2014, Hill, 2012, and for users who are especially concerned about privacy, there is an observable difference in their search behavior [Zimmerman et al., 2019].
Adherence to the PAIR framework allows no possibility for data leakage and is simple to understand, without introducing inefficiencies over the non-private baselines. The framework is, however, conservative, which, as we shall see, can have performance implications . If a user does not mind revealing certain data or weakening these constraints, our approach can be extended with methods that manage such user-specified exceptions [Xu et al., 2007, Shou et al., 2014. PAIR is a natural privacy framework, based on a widely used and successful foundation, BLP, however we hope this work inspires broader research on privacy-preserving solutions under alternate performance-privacy cost models.

CONCURRENTQA for Multi-Domain Multi-Hop Reasoning
In this section, we develop a testbed for studying the PAIR framework. The key requirement is a set of questions spanning two corpora, D P and D G . We begin by considering the use of existing benchmarks and describing the limitations we encounter, motivating the creation of our new benchmark, CONCURRENTQA. Then we describe the benchmark collection process and provide an analysis of the contents.

Adapting Existing Benchmarks to Privacy-Preserving QA and Limitations
We first adapt the widely used benchmark, HotpotQA [Yang et al., 2018], to study our problem. HotpotQA contains multi-hop questions, which are each answerable by multiple Wikipedia passages. We create HotpotQA-PAIR by splitting the Wikipedia corpus into D G and D P by randomly assigning Wikipedia articles to one or the other. This results in questions entirely reliant on p ∈ D P , entirely reliant on d ∈ D G , or reliant on a mix of one private and one public document, allowing us to evaluate performance under the PAIR constraints.
Ultimately however, D P and D G come from a single Wikipedia distribution in HopotQA-PAIR. While it is possible that public and private data come from the same distribution (e.g., organizations routinely develop internal Wikis in the style of public Wikipedia), private and public data will intuitively often reflect different linguistic styles, structures, and topics, that further evolve over time [Hawking, 2004]. We observe all existing textual multi-hop benchmarks focus on retrieving from a single distribution (Table 1). Additionally, we cannot combine existing benchmarks over two different corpora because this will not yield questions requiring one passage from each domain. Methodologically, in the PAIR setting we likely will not have access to training data from all downstream (private) domains. To evaluate with a realistically private set of information and PAIR set up, we create a new benchmark CONCURRENTQA. 6 Following from the Simple Security Property and *-Property in the BLP model. 7 Single-hop can also avoid performance degradations arising from using two enclaves. Recall that a non-private system retrieves the top k overall passages, so if for example kP = k 2 and kG = k 2 , such that kP + kG = k, the system may not retrieve the optimal k passages that the non-private system would have retrieved (e.g., consider when the overall top k passages for a question are in DG). However letting kP ∈ [0..k], kG ∈ [0..k] circumvents this challenge, at the cost of retrieving a few more passages per hop.
Hop 1 An email reports the list of investors in the fifth round for Exraprise.
Hop 2 An email reports the list of investors in the first round for JobsOnline.com.
What is the position of the person who sent an e-mail on 3/15/01 at 3:26 PM where the first listed recipient was Susan McCabe?
Hop 1 An email that forwards an original email sent by Julee Malinowski-Ball.
Hop 2 A different email from Julee Malinowski-Ball, which includes her position in the signature.
The paper that ran a story on 4/20/01 titled "Hines will add to skyline" bought out its long-time rival in what year to become its home city's primary newspaper?
Hop 1 An email includes a list of headlines, relevant to Enron, published by newspapers from 4/20/01. The article of interest was by the Houston Chronicle.

Hop 2
The Wikipedia passage about the Houston Chronicle describes the 1995 buy-out of the rival. Table 2: Example queries constructed over Wikipedia (D G ) and emails (D P ).

CONCURRENTQA Overview
We create and release a new multi-hop QA dataset, CONCURRENTQA, which is designed to more closely resemble a practical use case for PAIR. CONCURRENTQA contains questions spanning Wikipedia documents as D G and Enron employee emails [Klimt and Yang, 2004] as D P . The email corpus is one of the only collections of real emails that has been publicly released for research use. 8 We imagine two evaluation settings for CONCURRENTQA: (1) performance under defined (either PAIR or future proposals) privacy restrictions (presented in Section 5), and (2) multi-domain question-answering in the absence of privacy concerns (presented in Section 6).
Contents The full set of information collected from the crowd worker includes: the question which requires reasoning over multiple documents, the answer to the question which is a span in one of the documents, and the specific supporting sentences in the documents which are necessary to arrive at the answer and can serve as useful supervision signals. Given an input question from our dataset, the QA system must extract a span of text from the contexts as the answer.
Ethics Statement The Enron Email Dataset is already widely-used in NLP research [Heller, 2017]. That said, we acknowledge the origin of this data as collected and made public by the U.S. Federal Energy Regulatory Commission during their investigation of Enron. We note that many of the individuals whose emails appear in the dataset were not involved in any wrongdoing. We defer to using inboxes that are frequently used and well-studied in prior literature and that were not subject to redaction requests from affected employees, remaining freely-available in the public domain. Total  Comparison  Bridge   Train  15,239  1093  14,146  Dev  1,600  200  1,400  Test  1,600  200  1,400  Table 3: CONCURRENTQA Benchmark size statistics. The evaluation sets are balanced between questions for which the gold evidence passages are emails versus Wikipedia passages for Hop 1 and Hop 2 respectively.

Benchmark Design
As in HotpotQA, CONCURRENTQA is collected by showing crowd workers multiple supporting context documents and asking them to submit a question that requires reasoning over all the documents. We discuss the tradeoffs of our design choices in Section 4.5.
Passage Pairs As discussed in Yang et al. [2018], collecting a high-quality multi-hop QA dataset is challenging because it is important to provide reasonable pairs of supporting context documents to the worker -not all article pairs are conducive to a good multi-hop question. There are four types of pairs we need to collect for the Hop 1 and Hop 2 passages: Private and Private, Private and Public, Public and Private, and Public and Public. We use the insight that we can obtain meaningful passage-pairs by showing workers passages that mention similar or overlapping entities. All crowdworker assignments contain unique passage pairs. We release all our code for creating the passage pairs from raw data and Algorithm 1 gives the full data collection procedure.
While entity-tags are readily available for Wikipedia passages, hyperlinks are not readily available for many unstructured data sources including emails. Personal data also contains both private and public (e.g., Wiki) entities. High precision entity-linking is critical for the quality of the benchmark: for evaluation purposes, a question assumed to require the retrieval of private passages, should not be unknowingly answerable by public passages. We use a combination of off-the-shelf entity recognition and linking tools, and post-processing to tag private emails (Additional details in Appendix 8). 9 For all passage-pairs shown to crowdworkers, we provide a hint that describes overlapping entities between the passages to assist with question generation.

Dataset Collection
The dataset collection proceeded in two stages, question generation and validation. Tasks were conducted through Amazon Mechanical Turk 10 using the Mephisto interface. 11 The end-to-end pipeline is in Figure 8.
The question generation stage began with an onboarding process in which we provided training videos, documents with examples and explanations, and a multiple-choice exam. Workers completing the onboarding phase were given access to pilot assignments, which we manually reviewed to identify individuals providing high-quality submissions. Finally we worked with the shortlisted individuals to collect the full dataset.
For validation, we manually reviewed over 2.5k queries to identify workers with high-quality submissions, and prioritized including manually-verified examples in the final test and dev splits. Through reviewing, we identified the key reasons to invalidate and exclude questions from the benchmark (e.g., if a question could be answered using one passage alone, has multiple plausible answers either in or out of the shown passages, or simply lacks clarity). Using these insights, we developed a second task to validate all generated queries. The validation task again involved onboarding and pilot steps, in which we manually reviewed performance. We shortlisted ∼20 crowdworkers with high quality submissions who collectively validated examples appearing in the final benchmark.

Benchmark Analysis
In this section we analyze the contents of CONCURRENTQA. The background corpora contain 47k email passages (D P ) and 5.2M Wikipedia passages (D G ), and the benchmark contains 18,439 total examples (Table 3). Table 2 includes examples of CONCURRENTQA queries.
Question Types We identify three main reasoning patterns required for CONCURRENTQA questions: (1) bridge questions require identifying an entity or fact in hop 1 on which the second retrieval is dependent, (2) attribute questions require identifying the entity that satisfies all attributes in the question, where attributes are distributed across multiple passages, and (3) comparison questions require comparing two similar entities, where each entity appears in a separate passage. We estimate the benchmark contains 80% bridge, 12% attribute, and 8% comparison questions.  Salient topic categories that may be more popular in CONCURRENTQA compared to purely Wikipedia-based benchmarks include questions related to investments in projects and companies, newspaper articles, executives or changes in company leadership (e.g., new board members, C-Suite), legal activity (e.g., introduction, voting, or opposition to proposed bills or lawsuits and court cases), email features (see Example 4 in Table 2), and political events.
Passage Types There is a distinct shift between the emails and Wikipedia passages. Format: Wikipedia passages for entities of the same type tend to be similarly structured, while Enron emails introduce many formats -for example, certain emails contain portions of forwarded emails, lists of articles, or spam advertisements. Noise: Wikipedia passages tend to be typo-free, while the emails contain several typos, URLs, and inconsistent capitalization (examples in Table  14). Entity Distributions: Wikipedia passages tend to focus on details about one entity, while a single email can cover multiple (possibly unrelated) topics. Information about Enron entities is also observationally more distributed across passages, whereas public entity-information tends to be localized to one Wikipedia passage. We observe that a private entity occurs 9 times on average in gold training data passages while a public entity appears 4 times on average. There are 22.6k unique private entities in the gold training data passages, and 12.8k unique public entities. Passage Length: Finally, the average length of Wikipedia passages is shorter than the average length of email passages (Table 12). 12 Figure 7 visually shows the distributions of questions and passages.
Answer Types CONCURRENTQA is a factoid QA task so answers tend to be short spans of text containing nouns, or entity names and properties. Figure 2 shows the distribution NER tags across answers and examples from each category.

Benchmark Limitations
CONCURRENTQA, like HotpotQA, faces the limitation that crowdworkers see the gold supporting passages when creating questions, which can result in textual overlap between substrings in the questions and passages [Trivedi et al., 2020]. We mitigate these effects through our validation task, and by limiting the allowable degree of overlap between passage pairs and questions through the frontend interface during the generation stage. Further, our questions are not organic user searches as in Kwiatkowski et al. [2019], however search logs do not contain questions over public and private data, and existing dialogue systems have not considered retrieval from a private corpus to our knowledge.
Additionally, Enron was a major public corporation and many entities discussed in Enron emails are public entities, so it is possible that public websites and news articles encountered during retriever and reader model pretraining, impact the distinction between public and private questions. We investigate the impact of dataset leakage further in Section 6.

Evaluation in the PAIR Setting
Research Question How how do existing multi-hop ODQA systems perform under the PAIR framework on the HotpotQA-PAIR proxy and CONCURRENTQA benchmark described in Section 4.
Model We use the multi-hop QA method, multi-hop dense retrieval (MDR) [Xiong et al., 2021], as the baseline for evaluation, given its simplicity and competitive performance, however the privacy analysis applies to all iterative multi-hop methods. MDR is a dense-passage-retrieval (DPR) bi-encoder model, consisting of a query encoder g(·) and passage encoder h(·) [Karpukhin et al., 2020]. The model is trained contrastively on tuples of queries, positive passages (containing the answer to the query), and negative passages. For fast retrieval, document embeddings are typically  stored in an index designed for efficient similarity search and embedding clustering [Johnson et al., 2017]. In the first iteration of MDR, the embedding for query q is used to retrieve the k documents d 1 , ..., d k with the highest retrieval score according to a maximum inner product search over the dense corpus: Retrieved documents are each appended to q, and the embedding of the resulting q|d i is used to collect k passage sets of k passages each for the following iteration. The top-k of the k 2 passages are again ranked and the top-k are presented to the reader model, which selects a candidate answer in each passage. The candidate with the highest reader score is outputted. The reader is a fine-tuned ELECTRA-Large model [Clark et al., 2020].
Privacy-Performance Tradeoff Next we evaluate MDR within the PAIR framework. We use question-answering models trained on HotpotQA (i.e. Wikipedia) data, to evaluate performance both on the in-distribution HotpotQA and mixed-distribution CONCURRENTQA evaluation data. The latter setting captures the intuition that public and private data will reflect different distributions, and training data is unlikely to be available for private distributions.
1. Single-Index Baseline Here we combine all the public and private documents in a single corpus, setting aside privacy concerns (Table 4 -"No Privacy Baseline") This is the current standard.
2. Multiple-Indices We create two corpora and retrieve the top k from each in each iteration. Note that retrieving less than k documents per index may result in a performance drop vs. the single-index baseline, if the global top-k are all in one corpus. However, instead retrieving the top-k from each, and retaining the top k P private and k G public such that k P + k G = k, we can fully recover the performance of using a single index (Table 4 -"No Privacy Multi-Index"). The cost of this decision is it introduces up to 2x as many queries per iteration, since each query is used to retrieve from both indices.

Document Privacy
To maintain document privacy, we cannot use a private passage p retrieved in a prior retrieval iteration to subsequently retrieve from D G . Restricting retrievals using p results in a clear performance drop (see Table 4 -"Document Privacy Baseline").

Query Privacy
The natural baseline to enforce query privacy is to only retrieve from D P on each hop. This results in a significant performance drop (see Table 4 -"Query Privacy Baseline").
We are excitingly able to answer many complex questions while maintaining privacy (see examples in Table 5 from CONCURRENTQA). However at the same time, in maintaining document privacy, the end-to-end question-answering system achieves 9% worse performance for HotpotQA and 19% worse performance for CONCURRENTQA compared to the quality of the non-private system, and the degradation is even worse if questions are only posed to the private corpus. The performance degradation is undesirable for a deployed system, so our next focus is to investigate the key research challenges towards realizing privacy-preserving retrieval systems.

Challenges in Enabling Public-Private Retrieval
In this section, we investigate challenges towards improving the quality of public-private retrieval systems: 1. Research Question Can we predict whether a natural language question is unanswerable due to imposed privacy restrictions? We explore this in Selective Prediction, Section 6.1.

Research Question
How do retrieval systems perform when public and private data distributions differ? We explore this in Multi-Distribution Retrieval, Section 6.2.

Selective Prediction
To mitigate the privacy-performance tradeoffs observed in Section 5, the first natural objective is to answer as many questions as possible (high coverage) under imposed privacy constraints, with as high performance as possible (low risk).
Design Primitives Ultimately given a multi-hop query q, we need to classify between the cases for the Hop i → Hop i+1 supporting documents where each Hop ∈ {P rivate, P ublic}. A question can be answered with PAIRdocument-privacy so long as the supports are not P rivate → P ublic. A question can be answered with PAIR-queryprivacy so long as the supports are P rivate → P rivate. 13 To classify between these cases, the options are to use linguistic features and representations, or to use model outputs. Classifying using linguistic features (e.g., entities mentioned) alone is challenging, due to the diversity of user queries. Consider the following HotpotQA examples: • For some queries, required entities are not mentioned by name in the query. E.g., answering "What screenwriter with credits for 'Evolution' co-wrote a film starring Nicolas Cage and Téa Leoni?" requires retrieving the document for "The Family Man" then for "David Weissman".
• For other queries, no named entities are mentioned by name and only descriptions of entities are provided. E.g., "What company claims to manufacture one out of every three objects that provide a shelf life typically ranging from one to five years?" Instead, selective prediction [Chow, 1957, El-Yaniv and Wiener, 2010, Geifman and El-Yaniv, 2017] is a common and general starting point to predict answerability (e.g., Rodriguez et al. [2019], Kamath et al. [2020], Lewis et al. [2021]).
In selective prediction, given an input x, and a model which outputs (ŷ, c), whereŷ is the predicted label and c ∈ R represents the model's confidence in the prediction, the system providesŷ if c ≥ γ for some threshold γ ∈ R, and abstains otherwise. We evaluate using risk-coverage curves [El-Yaniv and Wiener, 2010], where the coverage is the proportion of queries the selective prediction method answers (i.e., examples for which c ≥ γ), and the risk is the error achieved on the covered queries. Intuitively, as γ is higher, coverage and risk both tend to decrease. The QA model outputs an answer-string and score for the top k passage chains collected by the retriever, and we compute the softmax over these scores, using the top softmax score as c [Hendrycks and Gimpel, 2017]. These are the same reader scores as in Section 5; models are trained on HotpotQA data and applied to HotpotQA and CONCURRENTQA evaluation data.  Recall that Document Privacy restricts P rivate to P ublic retrieval.
Takeaways Figure 3 shows the risk-coverage curves for predictions produced after either "No Privacy" or "Document Privacy" retrieval for HotpotQA (left) and CONCURRENTQA (right). The non-private score of 75.3 F1 for HotpotQA is achieved at 85.7 coverage and 53.0 F1 for CONCURRENTQA at 67.8%.
Privacy restrictions appear to increase the selective prediction challenge. In Figure 3, the risk-coverage tradeoff is consistently worse (i.e., at a given coverage level, the risk is higher) for selective prediction methods applied to the Document Privacy compared to No Privacy baselines. In Figure 4, we break down the risk-coverage by the domains of the supporting passages required for Hop 1 → Hop 2 on each question. Recall that enforcing Document Privacy restricts P rivate → P ublic retrieval sequences. We observe the risk-coverage tradeoff worsens not only for P rivate → P ublic, but also for alternate unrestricted retrieval paths, such as P ublic → P rivate for CONCURRENTQA (right)) under the Document Privacy vs. No Privacy baseline. Intuitively, if the reader receives low-quality passages for P rivate → P ublic questions, its confidence may be lower for similar P ublic → P rivate examples. We observe the reader-model's softmax entropy is 38.4% higher across P ublic → P rivate and P rivate → P ublic examples in CONCURRENTQA when Document Privacy is imposed, compared to the No Privacy baseline. Privacy-restricted examples are essentially out-of-distribution, increasing the selective prediction challenge [Kamath et al., 2020].
Selective prediction quality is much worse for certain sub-distributions of CONCURRENTQA. Independent of privacy concerns, Figure 4 shows worse performance at full-coverage and worse risk-coverage tradeoffs for questions involving private emails. Alongside improving predictions of answerability under privacy restrictions, there is significant room to improve retrieval quality even in the absence of privacy concerns, which we investigate next.

Multi-Distribution Retrieval
Progress on the more general multi-domain retrieval problem is an important step towards succeeding on CONCUR-RENTQA and enabling public-private retrieval, as well as retrieval over temporally-evolving data. While in the more common zero-shot retrieval setting [Guoa et al., 2021, Thakur et al., 2021 the top k of k passages will be from the  Table 6: CONCURRENTQA results using four retrieval approaches, and oracle retrieval. On the right, we show performance (F1 scores) by the domains of the Hop 1 and Hop 2 gold passages for each question, where email is "E" and Wikipedia is "W". "EW" indicates the Hop 1 gold passage is an email, and Hop 2 gold passage is from Wikipedia. out-of-distribution (OOD) corpus for each retrieval, in the underexplored mixed-retrieval setting, it is possible to retrieve zero OOD passages in the top k. Each sub-distribution may further benefit from a different retrieval method.
Section 5 shows us that although HotpotQA and CONCURRENTQA are curated using the same data collection process, overall performance on CONCURRENTQA remains 18.8 F1 worse. Notably, applying models trained on HotpotQA to CONCURRENTQA, we observe similar performance on the subset of questions for which the Hop 1 and Hop 2 passages both come from Wikipedia ( Retrieval Baselines We evaluate on CONCURRENTQA using four retrieval baselines: (1) CONCURRENTQA-MDR is a dense retriever (MDR, Section 5.1) trained on the CONCURRENTQA train set (15.2k examples) and we use this to understand the value of in-domain training data for the task.
(2) HotpotQA-MDR is a dense retriever trained on the full HotpotQA train set (90.4K examples) and we use this to understand how well a publicly trained model performs on the public-private mixed distribution.
(3) Subsampled HotpotQA-MDR is a dense retriever trained on subsampled HotpotQA data of the same size as the CONCURRENTQA train set and we use this to investigate the effect of dataset size. (4) Finally we consider BM25 sparse retrieval as prior work indicates its strength in OOD retrieval [Thakur et al., 2021]. Results are in Table 6. For each method, k = 100, where k the number of retrieved passages per hop. We include ablations for different values of k, as well as details about the models and experiments, in Appendix 8. The reader for all runs is an ELECTRA-Large model trained on the full HotpotQA training set.
Training Data Size Strong dense retrieval performance requires a large amount of training data. Comparing the CONCURRENTQA-MDR and Subsampled Hotpot-QA MDR baselines, the former outperforms by 12.6 F1 points as it is evaluated in-domain. However, the HotpotQA-MDR baseline, trained on the full 90k HotpotQA training examples, performs nearly equal to CONCURRENTQA-MDR. Figure 5 shows the performance of Subsampled CONCURRENTQA-MDR and Subsampled HotpotQA-MDR for subsample sizes ∈ {1k, 2k, 4k, 8k, |CONCURRENTQA Train |} and the full HotpotQA training dataset. Next we observe that the sparse method matches zero-shot performance of using the Subsampled HotpotQA model on CONCURRENTQA, but for larger dataset sizes (HotpotQA-MDR) and in-domain training data (CONCURRENTQA-MDR), dense retrieval outperforms sparse retrieval. Notably, it may be difficult to obtain training data for all incurred distributions, especially for private or temporally arising distributions [Hawking, 2004, Chirita et al., 2005.
Domain Specific Performance Each retrieval method excels in a different subdomain of the benchmark. Table 6 shows the retrieval performance of each method based on whether the gold supporting passages for the first and second hop of the multi-hop example are email (E) or Wikipedia (W) passages. The notation EW means the first hop is an email and second is a Wikipedia passage. HotpotQA-MDR performance on WW questions far exceeds performance on questions where at least one supporting passage is an email. Notably, HotpotQA-MDR gives 81.3 EM for WW, but only 28.7 EM for EE. We also observe that the sparse retriever performs worse than the dense models on Wikipedia-based questions, but better on questions involving an email as Hop 2 . When training on CONCURRENTQA, the performance on questions involving emails improves significantly, however remains lower than its performance on Wikipedia-based questions. The WW performance also decreases significantly using CONCURRENTQA-MDR. We discuss this further in Section 6.3.
Oracle QA performance using oracle retrieval, i.e., explicitly providing the gold supporting passages to the model, is also provided in Table 6. These results demonstrate significant room to improve retrieval, however performance on EE questions also remains low, indicating room to improve the reader as well.
Dataset Leakage We use RoBERTA-Base for the retriever [Liu et al., 2019] and ELECTRA-Large for the reader [Clark et al., 2020]. As a simple test to investigate the effect of dataset leakage, due to the pretrained language models viewing email data during pretraining, we consider performance using only the Wikipedia passages. The test score is 27.6 EM, where performance is 72.0 EM on WW and 3.3 EM on EE questions. Overall, it is possible that the models may have picked up general knowledge during pretraining that helps reason about Enron concepts. However, these results suggest that access to the private corpus remains important.

Error Analysis of Retrieval Methods
We conclude with a qualitative discussion of representative errors observed for each retrieval method.

Dense Retrievers
The primary failure modes we observe for HotpotQA-MDR are: (1) ignoring parts of the question to pick passages reflecting a subset of mentioned entities and details, (2) ignoring a short relevant substring within a long Hop 1 email and thus not retrieving the Hop 2 passage successfully, and (3)  Performance on WE questions is notably worse than EW questions and we hypothesize that two factors impact these results: (1) Wikipedia passages generally follow consistent structures, so it may be easier to retrieve Wikipedia passages on Hop 2 after retrieving Wikipedia on Hop 1 , and (2) several emails discuss each Wikipedia-entity, which may increase the noise in Hop 2 (i.e., WE is a one-to-many hop, while for EW, W typically contains one valid entity-specific passage). The latter is intuitively because individuals owning private data truly care about a narrow set of public entities.
Sparse Retrievers First, we observe the sparse model often "cheats" by retrieving the Hop 2 passage, without the Hop 1 passage. For questions where BM25 retrieves the gold Hop 2 passage in the first hop, the score is 64.2 F1, and when this is not the case, the score is 18.3 F1.
Next, we observe BM25 performance is high on email based questions -we compute WW questions have an average length of 97 characters, while EE questions have an average length of 141 characters. Perhaps, due to the nature of how the dataset is constructed, namely crowdworkers can see the passages before they write the questions, we may be underestimating the need for skills dense models provide (e.g., fuzzy semantic matching) and overestimating the quality of sparse models that benefit from direct matching. We observe several other benchmarks reported in [Thakur et al., 2021] on which BM25 outperforms dense retrieval, use similar annotation pipelines during question generation (e.g., Wadden et al. [2020], Yang et al. [2018]).

Discussion and Future Work
Privacy-Preserving Personalized Retrieval Systems We hope this work inspires interest in realizing the potential of privacy-preserving personalized open-domain systems. Potential future directions include decomposing multi-hop queries into public and private sub-queries to address query-privacy [Min et al., 2019b, Perez et al., 2020. Additionally, reformulating queries by including personal keywords could help produce more meaningful retrievals [Carpineto and Romano, 2012]. For example, if an ML practitioner asks a question about "Michael Jordan" the ML professor, a naive public search may return many passages about the basketball player. 14 Perhaps a reformulated query with keywords such as "ML" would yield more relevant results. Future work could study the tradeoff between providing additional personal context in the query versus the number of public passages one would need to retrieve. Overall retrieval is an exciting direction for incorporating personal context, without requiring any training.
Retriever Generalization While prior work considers zero-shot generalization [Thakur et al., 2021, Guoa et al., 2021, retrieval-based systems over personal or temporally-changing data will need to retrieve from a mixture of in and out-of-distribution data. In the former setting, k of k retrieved passages will be OOD passages, while in the latter setting, it is possible that few (or 0) OOD passages are retrieved, for example if the retriever scores are distributionally higher for ID passages. It is also possible that domain labels do not exist for certain retrieval applications. In contrast to using a single retriever for in and OOD data, a system that routes questions to different retrievers, depending on question attributes, is another possibility. We hope CONCURRENTQA facilitates further study of concurrent multi-domain retrieval.

Conclusion
This work asks how to personalize retrieval-based systems in a privacy-preserving way and identifies that arbitrary autoregressive retrieval over public and private data poses a privacy concern. In summary, we define the PAIR privacy framework, present a new multi-domain multi-hop benchmark called CONCURRENTQA for the novel retrieval setting, and demonstrate the privacy-performance tradeoffs faced by existing open-domain systems. We finally investigate two challenges towards realizing the potential of public-private retrieval systems: using selective prediction to manage the privacy-performance tradeoff and concurrently retrieving over multiple distributions. We hope this work inspires new privacy-preserving solutions for personalized retrieval-based systems.

A Model Details
This section provides details about the retrieval and reader models and experiment settings. Experiments are conducted on 8 NVidia-A100 GPUs.

A.1 Dense Retrieval
We use the model implementations for the MDR (dense retriever) provided by Xiong et al. [2021]. 15 For the non-private experiments, we use the base retrieval algorithm; we extend the base implementation for the private-retrieval modes described in Section 3 and release our implementation. We construct the dense passage corpus using FAISS [Johnson et al., 2017], and use exact inner product search as in the original implementation.
The retriever is trained with a contrastive loss as in Karpukhin et al. [2020], where each query is paired with a (gold annotated) positive passage and m negative passages to approximate the softmax over all passages. We consider two methods of collecting negative passages: first, we use random passages from the corpus that do not contain the answer (random), and second, we use one top-ranking passage from BM25 that does not contain the answer as a hard-negative paired with remaining random negatives. We do not observe a large difference between the two approaches for CONCURRENTQA-results (also observed in [Xiong et al., 2021]), and thus use random negatives for all experiments. We hope to experiment with additional methods of selecting negatives for CONCURRENTQA in future work.
The number of retrieved passages per retrieval, k, is an important hyperparameter as increasing k tends to increase recall, but sacrifice precision. Using larger values of k is also less efficient at inference time. We use k = 100 for all experiments in the paper and Table 7 shows the effect of using different values of k on retrieval performance (HotpotQA-MDR, CONCURRENTQA eval data).
Inference-Only For the MDR experiments in Section 6, and the HotpotQA-MDR experiments in Section 5, we use an MDR-model trained in the Wikipedia domain (i.e., HotpotQA training data) to retrieve passages for the the HotpotQA-PAIR and CONCURRENTQA evaluation sets. For these experiments, we directly use the provided question encoder and passage encoder checkpoints.
Training and Inference For the CONCURRENTQA-MDR and Subsampled HotpotQA-MDR experiments, we train the MDR model from scratch, finding the hyperparameters in Table 8 work best.

A.2 Sparse Retrieval
For the sparse retrieval baseline, we use the Pyserini BM25 implementation using default parameters. 16 We consider different values of k ∈ {1, 10, 25, 50, 100} per retrieval and report the retrieval performance in Table 9. We generate the second hop query by concatenating the text of the initial query and first hop passages.

A.3 QA Model
We use the provided ELECTRA-Large reader model checkpoint from Xiong et al. [2021] for all experiments. The model was trained on HotpotQA training data. Using the same reader is useful to understand how retrieval quality affects performance, in the absence of reader modifications.  Table 10: Here we ask how many hops are required to answer the benchmark multi-hop questions. We the same checkpoint MDR models trained on HotpotQA to retrieve hop 1 and hop 2 passages. We train the reader model on only hop 1 passages (Single-hop) and compare to the performance of using hop 1 and hop 2 passages (Multi-hop). We provide results using FiD with T5-base and T5-large. *Reported in [Xiong et al., 2021].

A.4 Are two hops necessary?
The document privacy challenge is a consequence of the autoregressive retrieval process. Here we ask whether two hops are in fact necessary to answer multi-hop benchmark questions.
In the Single-hop baseline, we use the MDR model to retrieve k = 50 passages, but stop after the first hop. We train and evaluate a Fusion-in-Decoder (FiD) model [Izacard and Grave1, 2021] on the resulting contexts. To motivate the choice of reader, FiD combines information across multiple passages simultaneously, whereas ELECTRA searches for answer spans individually in each passage. We compare performance to the Multi-hop baseline, where we take the top 50 passage chains from the same MDR model, concatenate the two passages in each chain to obtain 50 contexts, and train and evaluate a Fusion-in-Decoder model on the resulting data (as in Xiong et al. [2021]). We observe that the multi-hop baseline performs 26.4 F1 points higher, indicating the benefit of multiple iterations. See results in Table 10.
We train the FiD models for 15000 steps, with a learning rate of 5e − 05, per-GPU batch size of 1, maximum text length of 250 when using one passage and 512 for two passages, and maximum answer length of 20, for one random seed.

B Additional Details for PAIR Baselines
In     Figure 6 show that there is a clear separation between the relevance score distributions from the email vs. Wikipedia corpus for questions based on Wikipedia (public) passages, but this is is not the case for questions based on email passages. The relevance score distributions are not-necessarily well-aligned in the mixed-distribution retrieval setting, contributing to the difficulty and difference vs. zero-shot retrieval.
C Additional CONCURRENTQA Analysis Figure 7 (Left, Middle) shows the UMAP plots of CONCURRENTQA questions using BERT-base representations, split by whether the gold hop passages are both from the same domain (e.g., two Wikipedia or two email passages) or require one passage from each domain. The plots reflect a separation between Wiki-based and email-based questions and passages.
In Table 13, we provide statistics for the number of CONCURRENTQA questions that require gold supporting passages from each set of privacy scopes.
In  Xiong et al. [2021] as public data. 17 We use NLTK [Bird et al., 2009] for sentence tokenization -this is important for storing the indices of supporting sentences.

Private Enron Data Preprocessing
We download the May 7, 2015 version of the Enron Emails dataset distributed by CMU. 18 We select the "Jeff Dasovich" inbox as the personal data source because the inbox is amongst the largest (28, 234 emails) and the employee was a "Government Relation Executive", so the emails contain several public entities in addition to private entities.
We split each email into chunks of up to 150 words, resulting in 112k total passages. We deduplicate the emails to prior to generating passage pairs shown to the crowd workers, resulting in the final set of 47k passages. Duplicates exist because the same email can appear in reply chains, forward chains, and in multiple inbox folders (e.g., sent and received email folders).
Further Processing Since emails can be quite long, we use a sliding window approach to generate documents, given the sequence length limitations of the transformer architecture . Finally, we deduplicate the emails for the final private corpus. We release all the code for preprocessing, annotating, filtering, and deduplication along with the benchmark.
To initialize the Wikipedia hyperlink graph, we use the KILT KnowledgeSource resource [Petroni et al., 2021] to identify hyperlinks in each of the Wikipedia passages. 19 To collect passages that share enough in common, we eliminate entities b which are too specific or vague, having many plausible correspondences across passages. For example, given a representing a "company", it may be challenging to write a question about its connection to the "business psychology" doctrine the company ascribes to (b is too specific) or to the "country" in which the company is located (b is too general). To determine which Wiki entities to permit for a and b pairings shown to the workers, we ensure that the entities come from a restricted set of entity-categories. The Wikidata knowledge base stores type categories associated with entities (e.g., "Barack Obama" is a "politician" and "lawyer"). We compute the frequency of Wikidata types across the 5.2 million entities and permit entities containing any type that occurs at least 1000 times. We also restrict to Wikipedia documents containing a minimum number of sentences and tokens. The intuition for this is that highly specific types entities (e.g., a legal code or scientific fact) and highly general types of entities (e.g. countries) occur less frequently.
Pairs with Private Emails Unlike Wikipedia, hyperlinks are not readily available for many unstructured data sources including the emails, and the non-Wikipedia data contains both private and public (e.g., Wiki) entities. Thus, we design the following approach to collect passage pairs involving a private passage.
We first collect entity occurrences in emails: 1. To annotate the public and private entity occurrences in the email passages, we collect candidate entities with the SpaCy NER tagger. 20 2. We split the full set into candidate public and candidate private entities by identifying Wikipedia linked entities amongst the spans tagged by the NER model. We annotate the text with the open-source SpaCy entity-linker, which links the text to entities in the Wiki knowledge base, to collect candidate occurrences of global entities. 21 We use heuristic rules to filter remaining noise in the public entity list.
3. We post-process the private entity lists to improve precision. High precision entity-linking is critical for the quality of the benchmark: a query assumed to require the retrieval of private passages a and b should not be unknowingly answerable by public passages. After curating the private entity list, we restrict to candidates which occur at least 5 times in the deduplicated set of passages.
A total of 43.4k unique private entities and 8.8k unique public entities appear in the emails, and 1.6k private and 2.3k public entities occur at least 5 times across passages. We present crowd workers emails containing at least three total entities to ensure there is sufficient information to write the multi-hop question.
Private-Private Pairs are pairs of emails that mention the same private entity e. The Private-Public and Public-Private are pairs of emails mentioning public entity e and the Wikipedia passage for e. In both cases, we provide the hint that e is a potential anchor for the multi-hop question.
Comparison Questions For comparison questions, Wikidata types are readily available for public entities, and we use these to present the crowdworker with two passages describing entities of the same type. For private emails, there is no associated knowledge graph so we heuristically assigned types to private entities, by determining whether type strings occurred frequently alongside the entity in emails (e.g., if "politician" is frequently mentioned in the emails in which an entity occurs, assign the "politician" type).

D.3 Data Collection Procedure
Algorithm 1 and Figure 8 give the full data collection procedure for CONCURRENTQA. It is adapted from Algorithm 1 in Yang et al. [2018], which was used to produce HotpotQA.
Crowd Worker Interface We use the Mephisto framework to build our crowd worker interface. Figure 9 gives an example of the interface shown to workers. 22

E Additional Details for Selective Prediction
The QA model predicts an answer span in each of the top k passages by predicting the start and end tokens of the answer in each passage. The model outputs scores for each answer and the system outputs the top-scoring answer span. Uniformly sample public documents a, b, where entity e appears in both. In the Wikipedia setup, we take (a, b) ∈ G, for hyperlink graph G. Workers ask a question about documents a and b, given e (or e a , e b for comparison questions) as an optional anchor. 29: end while To obtain scores for Section 6.1, we computed the softmax over all k scores and selected the top score. The same model (trained on the full HopotQA data) was used for all private and non-private runs in Section 6.1.
Since multiple passages can be used to answer a question, especially as discussed in the case of email-based questions, we also tried identifying groups within the top k answer spans for which the model predicted the same answer -we then combined softmax scores at the group level and used the top group score as c. This resulted in 68.1% coverage at 53.0 F1 (non-private baseline) for CONCURRENTQA, but a much lower 71.4% coverage at 75.0 F1 for HotpotQA.  Figure 8: The end-to-end data collection pipeline includes (1) passage-pair generation: we identified entities appearing in emails and Wikipedia passages and categorized entities as public or private, (2) question generation: workers are asked to write a question-answer pair given two passages and a hint stating the common entities in the passages, (3) validation: workers answer a series of questions about each generated question-answer pair to filter out low-quality questions. During the entire process, we manually review questions and workers' understanding of the task.
Instructions: Below you are given two pieces of text. Please write a question that can only be answered if both of the passages are used together. People should not be able to confidently answer your question if they are given just one of the two passages, and do not assume that they know which passages you used to write your question.
Please submit: -Your question in the box below.
-The answer to your question should be a sequence of words in paragraph 2. Highlight the correct answer to the question in paragraph 2.
-Click the checkboxes next to the sentences someone would need to see to answer your question. Leave unchecked any sentence that is not useful for your question.
Please note the following very important points about the person who will answer your question: 1. They have no other information beisdes the provided paragraphs.
2. Do not assume that they know which passages you used to write your question. Please add enough detail to your question so they can be reasonably confident about the answer, just using given the passages.
Given the passages, they should be confident about the answer. E.g., given a passage about a Spurs Basketball game on 12/12/2021 and the question "Did the Spurs basketball team win the game?" is not detailed enough question because the Spurs play many basketball games. We can't be confident *which* game the question is referring to and whether the correct answer is in the passage, so please try to be specific with your question for example by asking "Did the Spurs basketball team win the game on 12/12/2021?"! 3. Please try to write natural and grammatically correct questions someone might actually ask about these pieces of text! 4. Do not write questions such as: "What is the name of the organization thats name starts with an "H"? --you should be asking about the content of the passages, not the letters in the passages.
Click here to view examples of the completed task.
Click here to view a video example of how to complete the task.
Thank you for your help! If you submit high quality answers, we will invite you to submit many more tasks!

Paragraphs
Paragraph 1 "Here's our thesis," he told them.
"What are we missing?" Mr. Chanos came out of those meetings with a "heightened conviction that we were right." For one thing, he sensed frustration brewing about the level of trust required with Enron.
As the spring progressed, Mr. Chanos became increasingly confident, adding to his short position.
On a widely reported conference call in April, Jeffrey Skilling, then Enron's chief executive, responded to another short seller's criticism that Enron hadn't provided a balance sheet by calling him an "ah." For the first time, "I got a sense that the company was now getting tough questions and was not happy about it," Mr. Chanos says.
For their part, Wall Street analysts argue that they have limited time and resources for the in-depth research that Mr. Chanos prefers.
Many cover dozens of companies.
Still, some say they have learned lessons from Enron's fall from grace.
Salomon Smith Barney analyst Raymond Niles, for one, says he will "pursue warning signs relentlessly and go by gut instinct" when he senses a looming problem.

Paragraph 2
Jeffrey Keith "Jeff" Skilling (born November 25, 1953) is the former CEO of Enron Corporation.
In 2006, he was convicted of federal felony charges relating to Enron's collapse and is currently serving 14 years of a 24-year, four-month prison sentence at the During April 2011, a three-judge 5th Circuit Court of Appeals panel ruled that the verdict would have been the same despite the legal issues being discussed, and Skilling's conviction was confirmed; however, the court ruled Skilling should be resentenced.
Skilling appealed this new decision to the Supreme Court, but the appeal was denied.
In 2013, the United States Department of Justice reached a deal with Skilling, which resulted in ten years being cut from his sentence.

Question and Answer Input
Hint: Consider forming questions which use the entity 'Jeffrey Skilling', since it's mentioned in both passages! If you think the entity mentioned in the hint does not exist or does not refer to the same entity in both paragraphs, please click `skip'.

Question
The sentence for the Enron executive who publicly called a short seller an "ah" in April was shortened due to a deal with which organization?

Answer
United States Department of Justice Submit Skip Figure 9: Mechanical Turk interface for CONCURRENTQA data collection. Crowdworkers select checkboxes for supporting passages, highlight the answer span, and write the question in the text box.