Abstract
Users an organizations are generating ever-increasing amounts of private data from a wide range of sources. Incorporating private context is important to personalize open-domain tasks such as question-answering, fact-checking, and personal assistants. State-of-the-art systems for these tasks explicitly retrieve information that is relevant to an input question from a background corpus before producing an answer. While today’s retrieval systems assume relevant corpora are fully (e.g., publicly) accessible, users are often unable or unwilling to expose their private data to entities hosting public data. We define the Split Iterative Retrieval (SPIRAL) problem involving iterative retrieval over multiple privacy scopes. We introduce a foundational benchmark with which to study SPIRAL, as no existing benchmark includes data from a private distribution. Our dataset, ConcurrentQA, includes data from distinct public and private distributions and is the first textual QA benchmark requiring concurrent retrieval over multiple distributions. Finally, we show that existing retrieval approaches face significant performance degradations when applied to our proposed retrieval setting and investigate approaches with which these tradeoffs can be mitigated. We release the new benchmark and code to reproduce the results.1
1 Introduction
The world’s information is split between publicly and privately accessible scopes, and the ability to simultaneously reason over both scopes is useful to support personalized tasks. However, retrieval-based machine learning (ML) systems, which first retrieve relevant information to a user input from background knowledge sources before providing an output, do not consider retrieving from private data that organizations and individuals aggregate locally. Neural retrieval systems are achieving impressive performance across applications such as language-modeling (Borgeaud et al., 2022), question answering (Chen et al., 2017), and dialogue (Dinan et al., 2019), and we focus on the under-explored question of how to personalize these systems while preserving privacy.
Consider the following examples that require retrieving information from both public and private scopes. Individuals could ask “With my GPA and SAT score, which universities should I apply to?” or “Is my blood pressure in the normal range for someone 55+?”. In an organization, an ML engineer could ask: “How do I fine-tune a language model, based on public StackOverflow and our internal company documentation?”, or a doctor could ask “How are COVID-19 vaccinations affecting patients with type-1 diabetes based on our private hospital records and public PubMed reports?”. To answer such questions, users manually cross-reference public and private information sources. We initiate the study of a retrieval setting that enables using public (global) data to enhance our understanding of private (local) data.
Modern retrieval systems typically collect documents that are most similar to a user’s question from a massive corpus, and provide the resulting documents to a separate model, which reasons over the information to output an answer (Chen et al., 2017). Multi-hop reasoning (Welbl et al., 2018) can be used to answer complex queries over information distributed across multiple documents, e.g., news articles and Wikipedia. For such queries, we observe that using multiple rounds of retrieval (i.e., combining the original query with retrieved documents at round i for use in retrieval at round i + 1) provides over 75% improvement in quality versus using one round of retrieval (Section 5). Iterative retrieval is now common in retrieval (Miller et al., 2016; Feldman and El-Yaniv, 2019; Asai et al., 2020; Xiong et al., 2021; Qi et al., 2021; Khattab et al., 2021, inter alia).
Existing multi-hop systems perform retrieval over a single privacy scope. However, users and organizations often cannot expose data to public entities. Maintaining terabyte-scale and dynamic data is difficult for many private entities, warranting retrieval from multiple distributed corpora.
To understand why distributed multi-hop retrieval implicates privacy concerns, consider two illustrative questions an employee may ask. First, to answer “Of the products our competitors released this month, which are similar to our unreleased upcoming products?”, an existing multi-hop system likely (1) retrieves public documents (e.g., news articles) about competitors, and (2) uses these to find private documents (e.g., company emails) about internal products, leaking no private information. Meanwhile, “Have any companies ever released similar products to the one we are designing?” entails (1) retrieving private documents detailing the upcoming product, and (2) performing similarity search for public products using information from the confidential documents. The latter reveals private data to an untrusted entity hosting a public corpus. An effective privacy model will minimize leakage.
We introduce the Split Iterative Retrieval (SPIRAL) problem. Public and private document distributions usually differ and our first observation is that all existing textual benchmarks require retrieving from one data-distribution. To appropriately evaluate SPIRAL, we create the first textual multi-distribution benchmark, ConcurrentQA, which spans Wikipedia in the public domain and emails in the private domain, enabling the study of two novel real-world retrieval setups: (1) multi-distribution and (2) privacy-preserving retrieval:
Multi-distribution Retrieval The ability for a model to effectively retrieve over multiple distributions, even in the absence of privacy constraints, is a precursor to effective SPIRAL systems, since it is unlikely for all private distributions to be reflected at train time. However, the typical retrieval setup requires retrieving over a single document distribution with a single query distribution (Thakur et al., 2021). We initiate the study of the real-world multi-distribution setting. We find that the SoTA multi-hop QA model trained on 90.4k Wikipedia data underperforms the same model trained on the 15.2k ConcurrentQA (Wikipedia and Email) examples by 20.8 F1 points on questions based on Email passages. Further, we find the performance of the model trained on Wikipedia improves by 4.3% if we retrieve the top passages from each distribution vs. retrieving the overall top k passages, which is the standard protocol.
Privacy-Preserving Retrieval We then propose a framework for reasoning about the privacy tradeoffs required for SoTA models to achieve as good performance on public-private QA as is achieved in public-QA. We evaluate performance when no private information is revealed, and models trained only on public data (e.g., Wikipedia) are utilized. Under this privacy standard, models sacrifice upwards of 19% performance under SPIRAL constraints to protect document privacy and 57% to protect query privacy when compared to a baseline system with standard, non privacy-aware retrieval mechanics. We then study how to manage the privacy-performance tradeoff using selective prediction, a popular approach for improving the reliability of QA systems (Kamath et al., 2020; Lewis et al., 2021; Varshney et al., 2022).
In summary: (1) We are the first to report on problems with applying existing neural retrieval systems to the public and private retrieval setting, (2) We create ConcurrentQA, the first textual multi-distribution benchmark to study the problems, and (3) We provide extensive evaluations of existing retrieval approaches under the proposed real-world retrieval settings. We hope this work encourages further research on private retrieval.
2 Background and Related Work
2.1 Retrieval-Based Systems
Open-domain applications, such as question answering and personal assistants, must support user inputs across a broad range of topics. Implicit-memory approaches for these tasks focus on memorizing the knowledge required to answer questions within model parameters (Roberts et al., 2020). Instead of memorizing massive amounts of knowledge in model parameters, retrieval-based systems introduce a step to retrieve information that is relevant to a user input from a massive corpus of documents (e.g., Wikipedia), and then provide this to a separate task model that produces the output. Retrieval-free approaches have not been shown to work convincingly in multi-hop settings (Xiong et al., 2021).
2.2 Multi-hop Retrieval
We focus on open-domain QA (ODQA), a classic application for retrieval-based systems. ODQA entails providing an answer a to a question q, expressed in natural language and without explicitly provided context from which to find the answer (Voorhees, 1999). A retriever collects relevant documents to the question from a corpus, then a reader model extracts an answer from selected documents.
Our setting is concerned with complex queries where supporting evidence for the answer is distributed across multiple (public and private) documents, termed multi-hop reasoning (Welbl et al., 2018). To collect the distributed evidence, systems use multiple hops of retrieval: Representations of the top passages retrieved in hopi are used to retrieve passages in hopi +1 (Miller et al., 2016; Feldman and El-Yaniv, 2019; Asai et al., 2020; Wolfson et al., 2020; Xiong et al., 2021; Qi et al., 2021; Khattab et al., 2021).2 Finally, we discuss the applicability of existing multi-hop benchmarks to our problem setting in Section 4.
2.3 Privacy Preserving Retrieval
Information retrieval is a long-standing topic spanning the machine learning, databases, and privacy communities. We discuss the prior work and considerations for our setup along three axes: (1) Levels of privacy. Prior private retrieval system designs guarantee privacy for different components across both query and document privacy. Our setting requires both query and document privacy.(2) Relative isolation of document storage and retrieval computation. The degree to which prior retrieval and database systems store or send private data to centralized machines (with or without encryption) varies. Our work structures dataflow to eliminate processing of private documents on public retrieval infrastructure.(3) Updatability and latency. Works make different assumptions about how a user will interact with the system. These include (1) tolerance of high-latency responses and (2) whether corpora are static or changing. Our setting focuses on open-domain questions for interactive applications with massive, temporally changing corpora and requiring low-latency.
Isolated systems with document and query privacy but poor updatability. To provide the strongest possible privacy guarantee (i.e., no information about the user questions or passages is revealed), prior work considers when purely local search is possible (Cao et al., 2019) (i.e., search performed on systems controlled exclusively by the user). This guarantee provides no threat opportunities, assuming that both data (documents and queries) and computation occur on controlled infrastructure. Scaling the amount of locally hosted data and updating local corpora with quickly changing public data is challenging; we build a system that might meet such demands.
Public, updatable database systems providing query privacy. A distinct line of work explores how to securely perform retrieval such that the user query is not revealed to a public entity that hosts and updates databases. Private information retrieval (PIR) (Chor et al., 1998) in the cryptography community refers to a setup where users know the entry in a remote database that they want to retrieve (Corrigan-Gibbs and Kogan, 2020). The threat model is directly related to the cryptographic scheme used to protect queries and retrieval computation. Here, the document containing the answer is assumed to be known; leaking the particular corpus containing the answer may implicitly leak information about the query. In contrast, we focus on open-domain applications, where users ask about any topic imaginable and do not know which corpus item holds the answer. Our setting also considers document privacy, as discussed in Section 6.
Public, updatable but high-latency secure nearest neighbor search with document and query privacy. The next relevant line of work focuses on secure nearest neighbor search (NNS) (Murugesan et al., 2010; Chen et al., 2020a; Schoppmann et al., 2020; Servan-Schreiber, 2021), where the objective is to securely (1) compute similarity scores between the query and passages, and (2) select the top-k scores. The speed of cryptographic tools (secure multi-party computation, secret sharing) that are used to perform these steps increase as the sparsity of the query and passage representations increases. Performing the secure protocol over dense embeddings can take hours per query (Schoppmann et al., 2020). As before, threats in this setting are related to vulnerabilities in cryptographic schemes or in actors gaining access to private document indices if not directly encrypted. Prior work relaxes privacy guarantees and computes approximate NNS; speeds, however, are still several seconds per query (Schoppmann et al., 2020; Chen et al., 2020a). This is prohibitive for iterative open domain retrieval applications.
Partial query privacy via fake query augmentation for high-latency retrieval from public databases. Another class of privacy techniques for hiding the user’s intentions is query-obfuscation or k-anonymity. The user’s query is combined with fake queries or queries from other users to increase the difficulty of linking a particular query to the user’s true intentions (Gervais et al., 2014). This multiplies communication costs since nearest neighbors must be retrieved for each of the k queries; iterative retrieval worsens this cost penalty. Further, the private query is revealed among the full set of k; the threat of identifying the user’s true query remains (Haeberlen et al., 2011).
Finally, we note that our primary focus is on inference-time privacy concerns and note that during training time, federated learning (FL) with differential privacy (DP) is a popular strategy for training models without exposing training data (McMahan et al., 2016; Dwork et al., 2006).
Overall, despite significant interest in IR, there is limited attention towards characterizing the privacy risks as previously observed (Si and Yang, 2014). Our setting, which focuses on supporting open-domain applications with modern dense retrievers, is not well-studied. Further, the prior works do not characterize the privacy concerns associated with iterative retrieval. Studying this setting is increasingly important with the prevalence of API-hosted large language models and services. For instance, users may want to incorporate private knowledge into systems that make multiple calls to OpenAI model endpoints (Brown et al., 2020; Khattab et al., 2022). Code assistants, which may be extended to interact with both private repositories and public resources like Stack Overflow, are also seeing widespread use (Chen et al., 2021).
3 Problem Definition
Objective
Given a multi-hop input q, a set of private documents p ∈ DP, and public documents d ∈ DG, the objective is to provide the user with the correct answer a, which is contained in the documents. Figure 1 (Right) provides an example. Overall, the Split Iterative Retrieval (SPIRAL) problem entails maximizing quality, while protecting query and document privacy.
Standard, Non-Privacy Aware QA
Standard non-private multi-hop ODQA involves answering q with the help of passages d ∈ DG, using beam search. In the first iteration of retrieval, the k passages from the corpus, d1,…, dk, that are most relevant to q are retrieved. The text of a retrieved passage is combined with q using function f (e.g., concatenating the query and passages sequences) to produce qi = f(q, di), for i ∈ [1..k]. Each qi (which contains di) is used to retrieve k more passages in the following iteration.
We now introduce the SPIRAL retrieval problem. The user inputs to the QA system are the private corpus DP and questions q. There are two key properties of the problem setting.
Property 1: Data is likely stored in multiple enclaves and personal documents p ∈ DP can not leave the user’s enclave.
Users and organizations own private data, and untrustworthy (e.g., cloud) services own public data. First, we assume users likely do not want to publicly expose their data to create a single public corpus nor blindly write personal data to a public location. Next, we also assume it is challenging to store global data locally in many cases. This is because not only are there terabytes of public data and user-searches follow a long tail (Bernstein et al., 2012) (i.e., it is challenging to anticipate all a user’s information needs), but public data is also constantly being updated (Zhang and Choi, 2021). Thus, DP and DG are hosted as separate corpora.
Now, given q, the system must perform one retrieval over DG and one over DP, rank the results such that the top-k passages will include kP private and kG public passages, and use these for the following iteration of retrieval. If the retrieval-system stops after a single-hop, there is no document privacy risk since no p ∈ DP is publicly exposed and no query privacy risk if the system used to retrieve from DP is private, as is assumed. However for multi-hop questions, if kP > 0 for an initial round of retrieval, meaning there exists some pi ∈ DP which was in the top-k passages, it would sacrifice privacy if f(q, pi) were to be used to perform the next round of retrieval from DG. Thus, for the strongest privacy guarantee, public retrievals should precede private document retrievals. For less privacy-sensitive use cases, this strict ordering can be weakened.
Property 2: Inputs that entirely rely on private information should not be revealed publicly.
Given the multiple indices, DP and DG, q may be entirely answerable using multiple hops over the DP index, in which case, q would never need to leave the user’s device. For example, the query from an employee standpoint, Does the search team use any infrastructure tools that our personal assistant team does not use?, is fully answerable with company information. Prior work demonstrates that queries are very revealing of user interests, intents, and backgrounds (Xu et al., 2007; Gervais et al., 2014). There is an observable difference in the search behavior of users with privacy concerns (Zimmerman et al., 2019) and an effective system will protect queries.
4 ConcurrentQA Benchmark
Here we develop a testbed for studying public-private retrieval. We require questions spanning two corpora, DP and DG. First, we consider using existing benchmarks and describe the limitations we encounter, motivating the creation of our new benchmark, ConcurrentQA. Then we describe the dataset collection process and its contents.
4.1 Adapting Existing Benchmarks
We first adapt the widely used benchmark, HotpotQA (Yang et al., 2018), to study our problem. HotpotQA contains multi-hop questions, which are each answered using two Wikipedia passages. We create HotpotQA-SPIRAL by splitting the Wikipedia corpus into DG and DP. This results in questions entirely reliant on p ∈ DP, entirely on d ∈ DG, or reliant on a mix of one private and one public document, allowing us to evaluate performance under SPIRAL constraints.
Ultimately, however, DP and DG come from a single Wikipedia distribution in HotpotQA-SPIRAL. Private and public data will often reflect different linguistic styles, structures, and topics. We observe all existing textual multi-hop benchmarks require retrieving from a single distribution. We cannot combine two existing benchmarks over two corpora because this will not yield questions that rely on both corpora simultaneously. To evaluate with a more realistic setup, we create a new benchmark: ConcurrentQA. We quantitatively demonstrate the limitations of using HotpotQA-SPIRAL in the experiments and analysis.
4.2 ConcurrentQA Overview
We create and release a new multi-hop QA dataset, ConcurrentQA, which is designed to more closely resemble a practical use case for SPIRAL. ConcurrentQA contains questions spanning Wikipedia documents as DG and Enron employee emails (Klimt and Yang, 2004) as DP.3 We propose two unique evaluation settings for ConcurrentQA: performance (1) conditioned on the sub-domains in which the question evidence can be found (Section 5), and (2) conditioned on the degree of privacy protection (Section 6).
Example questions from ConcurrentQA are included in Table 1. The corpora contain 47k emails (DP) and 5.2M Wikipedia passages (DG), and the benchmark contains 18,439 examples (Table 2). Questions require three main reasoning patterns: (1) bridge questions require identifying an entity or fact in Hop1 on which the second retrieval is dependent, (2) attribute questions require identifying the entity that satisfies all attributes in the question, where attributes are distributed across passages, and (3) comparison questions require comparing two similar entities, each appearing in a separate passage. We estimate the benchmark is 80% bridge, 12% attribute, and 8% comparison questions. We focus on factoid QA.
Question . | Hop 1 and Hop 2 Gold Passages . |
---|---|
What was the estimated 2016 population of the city that generates power at the Hetch Hetchy hydroelectric dams? | Hop 1 An email mentions that San Francisco generates power at the Hetch Hetchy dams. Hop 2 The Wikipedia passage about San Francisco reports the 2016 census-estimated population. |
Which firm invested in both the 5th round of funding for Extraprise and first round of funding for JobsOnline.com? | Hop 1 An email lists 5th round Exraprise investors. Hop 2 An email lists round-1 investors for JobsOnline.com. |
Question . | Hop 1 and Hop 2 Gold Passages . |
---|---|
What was the estimated 2016 population of the city that generates power at the Hetch Hetchy hydroelectric dams? | Hop 1 An email mentions that San Francisco generates power at the Hetch Hetchy dams. Hop 2 The Wikipedia passage about San Francisco reports the 2016 census-estimated population. |
Which firm invested in both the 5th round of funding for Extraprise and first round of funding for JobsOnline.com? | Hop 1 An email lists 5th round Exraprise investors. Hop 2 An email lists round-1 investors for JobsOnline.com. |
Benchmark Design
Each benchmark example includes the question that requires reasoning over multiple documents, answer which is a span of text from the supporting documents, and the specific supporting sentences in the documents which are used to arrive at the answer and can serve as supervision signals.
As discussed in Yang et al. (2018), collecting a high quality multi-hop QA dataset is challenging because it is important to provide reasonable pairs of supporting context documents to the worker—not all article pairs are conducive to a good multi-hop question. There are four types of pairs we need to collect for the Hop1 and Hop2 passages: Private and Private, Private and Public, Public and Private, and Public and Public. We use the insight that we can obtain meaningful passage-pairs by showing workers passages that mention similar or overlapping entities. All crowdworker assignments contain unique passage pairs. A detailed description of how the passage pairs are produced is in Appendix C and we release all our code for creating the passage pairs.
Benchmark Collection
We used Amazon Turk for collection. The question generation stage began with an onboarding process in which we provided training videos, documents with examples and explanations, and a multiple-choice exam. Workers completing the onboarding phase were given access to pilot assignments, which we manually reviewed to identify individuals with high quality submissions. We worked with these individuals to collect the full dataset. We manually reviewed over 2.5k queries in the quality-check process and prioritized including the manually verified examples in the final evaluation splits.
In the manual review, examples of the criteria that led us to discard queries included: the query (1) could be answered using one passage alone, (2) had multiple plausible answers either in or out of the shown passages, or (3) lacked clarity. During the manual review, we developed a multiple-choice questionnaire to streamline the checks along the identified criteria. We then used this to launch a second Turk task to validate the generated queries that we did not manually review. Assembling the cohort of crowdworkers for the validation task again involved onboarding and pilot steps, in which we manually reviewed performance. We shortlisted ∼20 crowdworkers with high quality submissions who collectively validated examples appearing in the final benchmark.
4.3 Benchmark Analysis
Emails and Wiki passages differ in several ways. Format: Wiki passages for entities of the same type tend to be similarly structured, while emails introduce many formats—for example, certain emails contain portions of forwarded emails, lists of articles, or spam advertisements. Noise: Wiki passages tend to be typo-free, while the emails contain several typos, URLs, and inconsistent capitalization. Entity Distributions: Wiki passages tend to focus on details about one entity, while a single email can cover multiple (possibly unrelated) topics. Information about email entities is also often distributed across passages, whereas public-entity information tends to be localized to one Wiki passage. We observe that a private entity occurs 9× on average in gold training data passages while a public entity appears 4× on average. There are 22.6k unique private entities in the gold training data passages, and 12.8k unique public entities. Passage Length: Finally, emails are 3× longer than Wiki passages on average.4
Answer Types
ConcurrentQA is a factoid QA task so answers tend to be short spans of text containing nouns, or entity names and properties. Figure 2 shows the distribution NER tags across answers and examples from each category.
Limitations
As in HotpotQA, workers see the gold supporting passages when writing questions, which can result in lexical overlap between the questions and passages. We mitigate these effects through validation task filtering and by limiting the allowed lexical overlap via the Turk interface. Next, our questions are not organic user searches, however existing search and dialogue logs do not contain questions over public and private data to our knowledge. Finally, Enron was a major public corporation; data encountered during pretraining could impact the distinction between public and private data. We investigate this in Section 5.
Ethics Statement
The Enron Dataset is already widely used in NLP research (Heller, 2017). That said, we acknowledge the origin of this data as collected and made public by the U.S. FERC during their investigation of Enron. We note that many of the individuals whose emails appear in the dataset were not involved in wrongdoing. We defer to using inboxes that are frequently used in prior work.
In the next sections, we evaluate ConcurrentQA in the SPIRAL setting. We first ask how a range of SoTA retrievers perform in the multi-domain retrieval setting in Section 5, then introduce baselines for ConcurrentQA under a strong privacy guarantee in which no private information is revealed whatsoever in Section 6.
5 Evaluating Mixed-Domain Retrieval
Here we study the SoTA multi-hop model performance on ConcurrentQA in the novel multi-distribution setting. The ability for models trained on public data to generalize to private distributions, with little or no labeled data, is a precursor to solutions for SPIRAL. In the commonly studied zero-shot retrieval setting (Guoa et al., 2021; Thakur et al., 2021), the top k of k passages will be from a single distribution, however users often have diverse questions and documents.
We first evaluate multi-hop retrievers. Then we apply strong single-hop retrievers to the setting, to understand the degree to which iterative retrieval is required in ConcurrentQA.
5.1 Benchmarking Multi-Hop Retrievers
Retrievers
We evaluate the multi-hop dense retrieval model (MDR) (Xiong et al., 2021), which achieves SoTA on multi-hop QA and multi-hop implementation of BM25, a classical bag-of-words method, as prior work indicates its strength in OOD retrieval (Thakur et al., 2021).
MDR is a bi-encoder model consisting of a query encoder and passage encoder. Passage embeddings are stored in an index designed for efficient retrieval (Johnson et al., 2017). In Hop1, the embedding for query q is used to retrieve the k passages d1,…, dk with the highest retrieval score by the maximum inner product between question and passage encodings. For multi-hop MDR, those retrieved passages are each appended to q and encoded, and each of the k resulting embeddings are used to collect k more passages in Hop2, yielding k2 passages. The top-k of the passages after the final hop are inputs to the reader, ELECTRA-Large (Clark et al., 2020). The reader selects a candidate answer in each passage.5 The candidate with the highest reader score is outputted.
Baselines
We evaluate using four retrieval baselines: (1) ConcurrentQA-MDR, a dense retriever trained on the ConcurrentQA train set (15.2k examples), to understand the value of in-domain training data for the task; (2) HotpotQA-MDR, trained on HotpotQA (90.4K examples), to understand how well a publicly trained model performs on the multi-distribution benchmark; (3) Subsampled HotpotQA-MDR, trained on subsampled HotpotQA data of the same size as the ConcurrentQA train set, to investigate the effect of dataset size; and (4) BM25 sparse retrieval. Results are in Table 3. Experimental details are in Appendix I.6
Retrieval Method . | Overall . | Domain-Conditioned . | ||||
---|---|---|---|---|---|---|
EM . | F1 . | EE . | EW . | WE . | WW . | |
ConcurrentQA-MDR | 48.9 | 56.5 | 49.5 | 66.4 | 41.8 | 68.3 |
HotpotQA-MDR | 45.0 | 53.0 | 28.7 | 61.7 | 41.1 | 81.3 |
Subsampled HotpotQA-MDR | 37.2 | 43.9 | 23.8 | 51.1 | 28.6 | 72.1 |
BM25 | 33.2 | 40.8 | 44.2 | 30.7 | 50.2 | 30.5 |
Oracle | 74.1 | 83.4 | 66.5 | 87.5 | 89.4 | 90.4 |
Retrieval Method . | Overall . | Domain-Conditioned . | ||||
---|---|---|---|---|---|---|
EM . | F1 . | EE . | EW . | WE . | WW . | |
ConcurrentQA-MDR | 48.9 | 56.5 | 49.5 | 66.4 | 41.8 | 68.3 |
HotpotQA-MDR | 45.0 | 53.0 | 28.7 | 61.7 | 41.1 | 81.3 |
Subsampled HotpotQA-MDR | 37.2 | 43.9 | 23.8 | 51.1 | 28.6 | 72.1 |
BM25 | 33.2 | 40.8 | 44.2 | 30.7 | 50.2 | 30.5 |
Oracle | 74.1 | 83.4 | 66.5 | 87.5 | 89.4 | 90.4 |
Training Data Size
Strong dense retrieval performance requires a large amount of training data. Comparing ConcurrentQA-MDR and Subsampled HotpotQA-MDR, the former outperforms by 12.6 F1 points as it is evaluated in-domain. However, the HotpotQA-MDR baseline, trained on the full HotpotQA training set, performs nearly equal to ConcurrentQA-MDR. Figure 3 shows the performance as training dataset size varies. Next we observe that the sparse method matches the zero-shot performance of the Subsampled HotpotQA model on ConcurrentQA. For larger dataset sizes (HotpotQA-MDR) and in-domain training data (ConcurrentQA-MDR), dense outperforms sparse retrieval. Notably, it may be difficult to obtain training data for all private or temporally arising distributions.
Domain Specific Performance
Each retriever excels in a different subdomain of the benchmark. Table 3 shows the retrieval performance of each method based on whether the gold supporting passages for Hop1 and Hop2 are email (E) or Wikipedia (W) passages (EW is Email-Wiki for Hop1-Hop2). HotpotQA-MDR performance on WW questions is far better than on questions involving emails. The sparse retriever performs worse than the dense models on questions involving W, but better on questions with E in Hop2. When training on ConcurrentQA, performance on questions involving E improves significantly, but remains low on W-based questions. Finally, we explicitly provide the gold supporting passages to the reader model (Oracle). EE oracle performance also remains low, indicating room to improve the reader.
How well does the retriever trained on public data perform in the SPIRAL setting?
We observe the HotpotQA-MDR model is biased towards retrieving Wikipedia passages. On examples where the gold Hop1 passage is an email, 15% of the time, no emails appear in the top-k Hop1 results; meanwhile, this only occurs 4% of the time when Hop1 is Wikipedia. On the slice of EE examples, 64% of Hop2 passages are E, while on the slice of WW examples, 99.9% of Hop2 passages are W. If we simply force equal retrieval () from each domain on each hop, we observe 2.3 F1 points (4.3%) improvement in ConcurrentQA performance, compared to retrieving the overall top-k. Optimally selecting the allocation for each domain is an exciting question for future work.
Performance on WE questions is notably worse than on EW questions. We hypothesize that this is because several emails discuss each Wikipedia-entity, which may increase the noise in Hop2 (i.e., WE is a one-to-many hop, while for EW, W typically contains one valid entity-specific passage). The latter is intuitively because individuals refer to a narrow set of public entities in private discourse.
5.2 Benchmarking Single-Hop Retrieval
In Section 3, we identify that iterative retrieval implicates document privacy. Therefore, an important preliminary question is to what degree multiple hops are actually required? We investigate this question using both HotpotQA and ConcurrentQA. We evaluate MDR using just the first-hop results and Contriever (Izacard et al., 2021), the SoTA single-hop dense retrieval model.
Results
In Table 4, we summarize the retrieval results from using three off-the-shelf models for HotpotQA: (1) the HotpotQA MDR model for one-hop, (2) the pretrained Contriever model, and (3) the MS-MARCO (Nguyen et al., 2016) fine-tuned variant of Contriever. We observe a sizeable gap between the one and two hop baselines. Strong single-hop models trained over more diverse publicly available data may help address the SPIRAL problem as demonstrated by Contriever fine-tuned on MS-MARCO.
Method . | Recall@10 . |
---|---|
Two-hop MDR | 77.5 |
One-hop MDR | 45.7 |
Contriever | 52.7 |
Contriever MS-MARCO | 64.3 |
Method . | Recall@10 . |
---|---|
Two-hop MDR | 77.5 |
One-hop MDR | 45.7 |
Contriever | 52.7 |
Contriever MS-MARCO | 64.3 |
However, when evaluating the one-hop baselines on ConcurrentQA, we find Contriever underperforms the two-hop baseline more significantly, as shown in Appendix Table 8. This is consistent with prior work that finds Contriever quality degrades on tasks that increasingly differ from the pretraining distribution (Zhan et al., 2022). By sub-domain, Contriever MS-MARCO returns the gold first-hop passage for 85% of questions where both gold passages are from Wikipedia, but for less than 39% of questions when at least one gold passage (Hop1 and/or Hop2) is an email. By hop, we find Contriever MS-MARCO retrieves the first-hop passage 49% of the time and second-hop passage 25% of the time.
Finally, to explore whether a stronger single-hop retriever may further improve the one-hop baseline, we continually fine-tune Contriever on ConcurrentQA. We follow the training protocol and use the code released in Izacard et al. (2021), and include these details in Appendix A. The fine-tuned model achieves 39.7 Recall@10 and 63.6 Recall@100, while two-hop MDR achieves 55.9 Recall@10 and 73.8 Recall@100 (Table 9 in the Appendix). We observe Contriever’s one-hop Recall@100 of 63.6 exceeds the two-hop MDR Recall@10 of 55.9, suggesting a tradeoff space between the number of passages retrieved per hop (which is correlated with cost) and the ability to circumvent iterative retrieval (which we identify implicates privacy concerns).
6 Evaluation under Privacy Constraints
This section provides baselines for ConcurrentQA under privacy constraints. We concretely study a baseline in which no private information is revealed publicly whatsoever. We believe this is an informative baseline for two reasons:
The privacy setting we study is often categorized as an access-control framework—different parties have different degrees of access to different degrees of privileged information. While this setting is quite restrictive, this privacy framework is widely used in practice for instance in the government and medical fields (Bell and LaPadula, 1976; Hu et al., 2006).
There are many possible privacy constraints as users find different types of information to be sensitive (Xu et al., 2007). Studying these is an exciting direction that we hope is facilitated by this work. Because the appropriate privacy relaxations are subjective, we focus on characterizing the upper (Section 5) and lower bounds (Section 6) of retrieval quality in our proposed setting.
Setup
We use models trained on Wikipedia data to evaluate performance under privacy restrictions both in the in-distribution multi-hop HotpotQA-SPIRAL (an adaptation of the HotpotQA benchmark to the SPIRAL setting [Yang et al., 2018]) and multi-distribution ConcurrentQA settings. Motivating the latter setup, sufficient training data is seldom available for all private distributions. We use the multi-hop SoTA model, MDR, which is representative of the iterative retrieval procedure that is used across multi-hop solutions (Miller et al., 2016; Feldman and El-Yaniv, 2019; Xiong et al., 2021, inter alia).
We construct Hotpot-SPIRAL by randomly assigning passages to the private (DP) and public (DG) corpora. To enable a clear comparison, we ensure that the sizes of DP and DG, and the proportions of questions for which the gold documents are public and private in Hop1 and Hop2 match those in ConcurrentQA.
6.1 Evaluation
We evaluate performance when no private information (neither queries nor documents) is revealed whatsoever. We compare four baselines, shown in Table 6. (1) No Privacy Baseline: We combine all public and private passages in one corpus, ignoring privacy concerns. (2) No Privacy Multi-Index: We create two corpora and retrieve the top k from each index in each hop, and retain the top-k of these 2k documents for the next hop, without applying any privacy restriction. Note performance should match single-index performance. (3) Document Privacy: We use the process in (2), but cannot use a private passage retrieved in Hop1 to subsequently retrieve from public DG. (4) Query Privacy: The baseline to keep q entirely private is to only retrieve from DP.
We can answer many complex questions while revealing no private information whatsoever (see Table 5). However, in maintaining document privacy, the end-to-end QA performance degrades by 9% HotpotQA and 19% for ConcurrentQA compared to the quality of the non-private system; degradation is worse under query privacy. We hope the resources we provide facilitate future work under alternate privacy frameworks.
Privacy Level . | Sample Questions Answered under Each Privacy Level . |
---|---|
Answered with No Privacy, but not under Document Privacy | Q1 In which region is the site of a meeting between Dabhol manager Wade Cline and Ministry of Power Secretary A. K. Basu located? Q2 What year was the state-owned regulation board that was in conflict with Dabhol Power over the DPC project formed? |
Answered with Document Privacy | Q1 The U.S. Representative from New York who served from 1983 to 2013 requested a summary of what order concerning a price cap complaint? Q2How much of the company known as DirecTV Group does GM own? |
Answered with Query Privacy | Q1 Which CarrierPoint backer has a partner on SupplySolution’s board? Q2 At the end of what year did Enron India’s managing director responsible for managing operations for Dabhol Power believe it would go online? *All evidence is in private emails and not in Wikipedia. |
Privacy Level . | Sample Questions Answered under Each Privacy Level . |
---|---|
Answered with No Privacy, but not under Document Privacy | Q1 In which region is the site of a meeting between Dabhol manager Wade Cline and Ministry of Power Secretary A. K. Basu located? Q2 What year was the state-owned regulation board that was in conflict with Dabhol Power over the DPC project formed? |
Answered with Document Privacy | Q1 The U.S. Representative from New York who served from 1983 to 2013 requested a summary of what order concerning a price cap complaint? Q2How much of the company known as DirecTV Group does GM own? |
Answered with Query Privacy | Q1 Which CarrierPoint backer has a partner on SupplySolution’s board? Q2 At the end of what year did Enron India’s managing director responsible for managing operations for Dabhol Power believe it would go online? *All evidence is in private emails and not in Wikipedia. |
Model . | HotpotQA-SPIRAL . | ConcurrentQA . | ||
---|---|---|---|---|
EM . | F1 . | EM . | F1 . | |
No Privacy Baseline | 62.3 | 75.3 | 45.0 | 53.0 |
No Privacy Multi-Index | 62.3 | 75.3 | 45.0 | 53.0 |
Document Privacy | 56.8 | 68.8 | 36.1 | 43.0 |
Query Privacy | 34.3 | 43.3 | 19.1 | 23.8 |
Model . | HotpotQA-SPIRAL . | ConcurrentQA . | ||
---|---|---|---|---|
EM . | F1 . | EM . | F1 . | |
No Privacy Baseline | 62.3 | 75.3 | 45.0 | 53.0 |
No Privacy Multi-Index | 62.3 | 75.3 | 45.0 | 53.0 |
Document Privacy | 56.8 | 68.8 | 36.1 | 43.0 |
Query Privacy | 34.3 | 43.3 | 19.1 | 23.8 |
6.2 Managing the Privacy-Quality Tradeoff
Alongside improving the retriever’s quality, an important area of research for end-to-end QA systems is to avoid providing users with incorrect predictions, given existing retrievers. Significant work focuses on equipping QA-systems with this selective-prediction capability (Chow, 1957; El-Yaniv and Wiener, 2010; Kamath et al., 2020; Jones et al., 2021, inter alia). Towards improving the reliability of the QA system, we next evaluate selective prediction in our novel retrieval setting.
Setup
Selective prediction aims to provide the user with an answer only when the model is confident. The goal is to answer as many questions as possible (high coverage) with as high performance as possible (low risk). Given query q, and a model which outputs , where is the predicted answer and c ∈ℝ represents the model’s confidence in , we output if c ≥ γ for some threshold γ ∈ℝ, and abstain otherwise. As γ increases, risk and coverage both tend to decrease. The QA model outputs an answer and score for each of the top-k retrieved passages—we compute the softmax over the top-k scores and use the top softmax score as c (Hendrycks and Gimpel, 2017; Varshney et al., 2022). Models are trained on HotpotQA, representing the public domain.
Results
Risk-coverage curves for HotpotQA and ConcurrentQA are in Figure 4. Under Document Privacy, the “No Privacy” score of 75.3 F1 for HotpotQA and 53.0 F1 for ConcurrentQA are achieved at 85.7% and 67.8% coverage, respectively.
In the top plots, in the absence of privacy concerns, the risk-coverage trends are worse for ConcurrentQA vs. HotpotQA (i.e., quality degrades more quickly as the coverage increases). Out-of-distribution selective prediction is actively studied (Kamath et al., 2020). However, this setting differs from the standard setup. The bottom plots show on ConcurrentQA that that the risk-coverage trends differ widely based on the sub-domains of the questions; the standard retrieval setup typically has a single distribution (Thakur et al., 2021).
Next, privacy restrictions correlate with degredations in the risk-coverage curves on both ConcurrentQA and HotpotQA. Critically, HotpotQA is in-distribution for the retriever. Strategies beyond selective prediction via max-prob, the prevailing approach in NLP (Varshney et al., 2022), may be useful for the SPIRAL setting.
7 Conclusion
We ask how to personalize neural retrieval-systems in a privacy-preserving way and report on how arbitrary retrieval over public and private data poses a privacy concern. We define the SPIRAL retrieval problem, present the first textual multi-distribution benchmark to study the novel setting, and empirically characterize the privacy-quality tradeoffs faced by neural retrieval systems.
We motivated the creation of a new benchmark, as opposed to repurposing existing benchmarks through our analysis. We qualitatively identified differences between the public Wikipedia and private emails in Section 4.3, and quantitatively demonstrated the effects of applying models trained on one distribution (e.g., public) to the mixed-distribution (e.g., public and private) setting in Sections 5 and 6. Private iterative retrieval is underexplored and we hope the benchmark-resource and evaluations we provide inspire further research on this topic, for instance under alternate privacy models.
Acknowledgments
We thank Jack Urbaneck, Wenhan Xiong, and Gautier Izacard for their advice and feedback. We gratefully acknowledge the support of NIH under grant no. U54EB020405 (Mobilize), NSF under grant nos. CCF1763315 (Beyond Sparsity), CCF1563078 (Volume to Velocity), and 1937301 (RTML); US DEVCOM ARL under grant no. W911NF-21-2-0251 (Interactive Human-AI Teaming); ONR under grant no. N000141712266 (Unifying Weak Supervision); ONR N00014-20-1-2480: Understanding and Applying Non-Euclidean Geometry in Machine Learning; N000142012275 (NEPTUNE); NXP, Xilinx, LETI-CEA, Intel, IBM, Microsoft, NEC, Toshiba, TSMC, ARM, Hitachi, BASF, Accenture, Ericsson, Qualcomm, Analog Devices, Google Cloud, Salesforce, Total, the HAI-GCP Cloud Credits for Research program, the Stanford Data Science Initiative (SDSI), Stanford Graduate Fellowship, and members of the Stanford DAWN project: Facebook, Google, and VMWare. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views, policies, or endorsements, either expressed or implied, of NIH, ONR, or the U.S. Government.
Notes
Note that beyond multi-hop QA, retrieval augmented language models and dialogue systems also involve iterative retrieval (Guu et al., 2020).
The Enron Corpus includes emails written by 158 employees of Enron Corporation and are in the public domain.
Since information density is generally lower in emails vs. Wiki passages, this helps crowdworkers generate meaningful questions. Lengths chosen within model context window.
We check for dataset leakage stemming from the “public” models potentially viewing “private” email information in pretraining. Using the MDR and ELECTRA models fine-tuned on HotpotQA, we evaluate on ConcurrentQA using a corpus of only Wiki passages. Test scores are 72.0 and 3.3 EM for questions based on two Wiki and two email passages respectively, suggesting explicit access to emails is important.
References
A Experimental Details
The MDR retriever is trained with a contrastive loss as in Karpukhin et al. (2020), where each query is paired with a (gold annotated) positive passage and m negative passages to approximate the softmax over all passages. We consider two methods of collecting negative passages: First, we use random passages from the corpus that do not contain the answer (random), and second, we use one top-ranking passage from BM25 that does not contain the answer as a hard-negative paired with remaining random negatives. We do not observe much difference between the two approaches for ConcurrentQA-results (also observed in Xiong et al. [2021]), and thus use random negatives for all experiments.
The number of passages retrieved per hop, k, is an important hyperparameter; increasing k tends to increase recall, but sacrifice precision. A larger k is also less efficient at inference time. We use k = 100 for all experiments in the paper and Table 9 studies the effect of using different values of k.
We find the hyperparameters in Table 7 in the Appendix work best and train on up to 8 NVidia-A100 GPUs.
Model . | Avg-PR . |
---|---|
Learning Rate | 5e–5 |
Batch Size | 150 |
Maximum passage length | 300 |
Maximum query length at initial hop | 70 |
Maximum query length at 2nd hop | 350 |
Warmup ratio | 0.1 |
Gradient clipping norm | 2.0 |
Traininig epoch | 64 |
Weight decay | 0 |
Model . | Avg-PR . |
---|---|
Learning Rate | 5e–5 |
Batch Size | 150 |
Maximum passage length | 300 |
Maximum query length at initial hop | 70 |
Maximum query length at 2nd hop | 350 |
Warmup ratio | 0.1 |
Gradient clipping norm | 2.0 |
Traininig epoch | 64 |
Weight decay | 0 |
Model . | Recall@10 . |
---|---|
Two-hop MDR | 55.9 |
Contriever | 12.1 |
Contriever MS-MARCO | 36.9 |
Model . | Recall@10 . |
---|---|
Two-hop MDR | 55.9 |
Contriever | 12.1 |
Contriever MS-MARCO | 36.9 |
Sparse Retrieval
For the sparse retrieval baseline, we use Pyserini with default parameters.7 We consider different values of k ∈{1,10,25,100} per retrieval, reported in Table 10 in the Appendix. We generate the second hop query by concatenating the text of the initial query and first hop passages.
QA Model
We use the provided ELECTRA-Large reader model checkpoint from Xiong et al. (2021) for all experiments. The model was trained on HotpotQA training data. Using the same reader is useful to understand how retrieval quality affects performance, in the absence of reader modifications.
Contriever Model
We use the code released by for zero-shot and fine-tuning implementation and evaluation (Izacard et al., 2021).8 We perform a hyperparamter search for the learning rate ∈{1e −4,1e −5}, temperature ∈{0.5,1}, and number of negatives ∈{5,10}. We found a learning rate of 1e −5 with a linear schedule and 10 negative passages to be best. These hyperparameters are chosen following the protocol in Izacard et al. (2021).
B Additional Analysis
We include two figures to further characterize the differences between the Wikipedia and Enron distributions.
Figure 5 (Left, Middle) in the Appendix shows the UMAP plots of ConcurrentQA questions using BERT-base representations, split by whether the gold hop passages are both from the same domain (e.g., two Wikipedia or two email passages) or require one passage from each domain. The plots reflect a separation between Wiki-based and email-based questions and passages.
C ConcurrentQA Details
Here we compare ConcurrentQA to available textual QA benchmarks and provide additional details on the benchmark collection procedure.
C.1 Overview
ConcurrentQA is the first multi-distribution textual benchmark. Existing benchmarks in this category are summarized in Table 11 in the Appendix. We note that HybridQA (Chen et al., 2020b) and similar benchmarks also include multi-modal documents. However, these only contain questions that require one passage from each domain for all questions, i.e., one table and one passage. Our benchmark considers text-only documents, where questions can require arbitrary retrieval patterns across the distributions.
Dataset . | Size . | Domain . |
---|---|---|
WebQuestions (Berant et al., 2013) | 6.6K | Freebase |
WebQSP (Yih et al., 2016) | 4.7K | Freebase |
WebQComplex (Talmor and Berant, 2018) | 34K | Freebase |
MuSiQue (Trivedi et al., 2021) | 25K | Wiki |
DROP (Dua et al., 2019) | 96K | Wiki |
HotpotQA (Yang et al., 2018) | 112K | Wiki |
2Wiki2MultiHopQA (Ho et al., 2020) | 193K | Wiki |
Natural-QA (Kwiatkowski et al., 2019) | 300K | Wiki |
ConcurrentQA | 18.4K | Email & Wiki |
Dataset . | Size . | Domain . |
---|---|---|
WebQuestions (Berant et al., 2013) | 6.6K | Freebase |
WebQSP (Yih et al., 2016) | 4.7K | Freebase |
WebQComplex (Talmor and Berant, 2018) | 34K | Freebase |
MuSiQue (Trivedi et al., 2021) | 25K | Wiki |
DROP (Dua et al., 2019) | 96K | Wiki |
HotpotQA (Yang et al., 2018) | 112K | Wiki |
2Wiki2MultiHopQA (Ho et al., 2020) | 193K | Wiki |
Natural-QA (Kwiatkowski et al., 2019) | 300K | Wiki |
ConcurrentQA | 18.4K | Email & Wiki |
C.2 Benchmark Construction
We need to generate passage pairs for Hop1,Hop2 of two Wikipedia documents (Public, Public), an email and a Wikipedia document (Public, Private and Private, Public), and two emails (Private, Private).
Public-Public Pairs
For Public-Public Pairs, we use a directed Wikipedia Hyperlink Graph, G where a node is a Wikipedia article and an edge (a, b) represents a hyperlink from the first paragraph of article a to article b. The entity associated with article b, is mentioned in article a and described in article b, so b forms a bridge, or commonality, between the two contexts. Crowdworkers are presented the final public document pairs (a, b) ∈ G. We provide the title of b as a hint to the worker, as a potential anchor for the multi-hop question.
To initialize the Wikipedia hyperlink graph, we use the KILT KnowledgeSource resource (Petroni et al., 2021) to identify hyperlinks in each of the Wikipedia passages.9 To collect passages that share enough in common, we eliminate entities b which are too specific or vague, having many plausible correspondences across passages. For example, given a representing a “company”, it may be challenging to write a question about its connection to the “business psychology” doctrine the company ascribes to (b is too specific) or to the “country” in which the company is located (b is too general). To determine which Wiki entities to permit for a and b pairings shown to the workers, we ensure that the entities come from a restricted set of entity-categories. The Wikidata knowledge base stores type categories associated with entities (e.g., “Barack Obama” is a “politician” and “lawyer”). We compute the frequency of Wikidata types across the 5.2 million entities and permit entities containing any type that occurs at least 1000 times. We also restrict to Wikipedia documents containing a minimum number of sentences and tokens. The intuition for this is that highly specific types entities (e.g., a legal code or scientific fact) and highly general types of entities (e.g., countries) occur less frequently.
Pairs with Private Emails
Unlike Wikipedia, hyperlinks are not readily available for many unstructured data sources including the emails, and the non-Wikipedia data contains both private and public (e.g., Wiki) entities. Thus, we design the following approach to annotate the public and private entity occurrences in the email passages:
We collect candidate entities with SpaCy.10
We split the full set into candidate public and candidate private entities by identifying Wikipedia linked entities amongst the spans tagged by the NER model. We annotate the text with the open-source SpaCy entity-linker, which links the text to entities in the Wiki knowledge base, to collect candidate occurrences of global entities.11 We use heuristic rules to filter remaining noise in the public entity list.
We post-process the private entity lists to improve precision. High precision entity-linking is critical for the quality of the benchmark: A query assumed to require the retrieval of private passages a and b should not be unknowingly answerable by public passages. After curating the private entity list, we restrict to candidates which occur at least 5 times in the deduplicated set of passages.
A total of 43.4k unique private entities and 8.8k unique public entities appear in the emails, and 1.6k private and 2.3k public entities occur at least 5 times across passages. We present crowd workers emails containing at least three total entities to ensure there is sufficient information to write the multi-hop question.
Private-Private Pairs are pairs of emails that mention the same private entity e. The Private-Public and Public-Private are pairs of emails mentioning public entity e and the Wikipedia passage for e. In both cases, we provide the hint that e is a potential anchor for the multi-hop question.
Comparison Questions
For comparison questions, Wikidata types are readily available for public entities, and we use these to present the crowdworker with two passages describing entities of the same type. For private emails, there is no associated knowledge graph so we heuristically assigned types to private entities, by determining whether type strings occurred frequently alongside the entity in emails (e.g., if “politician” is frequently mentioned in the emails in which an entity occurs, assign the “politician” type).
Finally, crowdworkers are presented with a passage pair and asked to write a question that requires information from both passages. We use separate interfaces for bridge vs. comparison questions and guide the crowdworker to form bridge questions by using the passages in the desired order for Hop1 and Hop2.
Author notes
Equal contribution.
Work done at Meta.
Action Editor: Jacob Eisenstein