InSCIt: Information-Seeking Conversations with Mixed-Initiative Interactions

In an information-seeking conversation, a user may ask questions that are under-specified or unanswerable. An ideal agent would interact by initiating different response types according to the available knowledge sources. However, most current studies either fail to or artificially incorporate such agent-side initiative. This work presents InSCIt, a dataset for Information-Seeking Conversations with mixed-initiative Interactions. It contains 4.7K user-agent turns from 805 human-human conversations where the agent searches over Wikipedia and either directly answers, asks for clarification, or provides relevant information to address user queries. The data supports two subtasks, evidence passage identification and response generation, as well as a human evaluation protocol to assess model performance. We report results of two systems based on state-of-the-art models of conversational knowledge identification and open-domain question answering. Both systems significantly underperform humans, suggesting ample room for improvement in future studies.1


Introduction
Recently, there are increasing interests in developing conversational information-seeking systems (Choi et al., 2018;Adlakha et al., 2022;Saeidi et al., 2018;Feng et al., 2020), where the system assists users in finding information from knowledge sources (e.g., text corpus) via multi-turn conversational interactions.One important advantage of conversational information-seeking systems is that users do not need to come up with a very descriptive query by themselves (Webb and Webber, 2009; Rieser and Lemon, 2009;Konstantinova and Orasan, 2013).Instead, as shown in Figure 1, they can start with a request that is under-specified or has no direct answer, and through conversational interactions, the agent can collaboratively guide users to refine (left) or relax their queries and even proactively suggest relevant information (right) that may partially satisfy the user's information needs.However, existing information-seeking conversation datasets rarely contain such mixed-initiative interactions.For example, most existing conversational question answering (CQA) work focuses on user-initiative interactions, where the agent simply responds to user questions with direct answers or uses no answer for out-of-scope queries (Choi et al., 2018;Reddy et al., 2019;Adlakha et al., 2022).Other work studies clarification questions using artificially-created data, failing to capture the envisioned information-seeking interactions (Saeidi et al., 2018;Feng et al., 2020;Aliannejadi et al., 2021;Guo et al., 2021).
In this paper, we introduce INSCIT, a dataset for Information-Seeking Conversations with mixed-arXiv:2207.00746v1[cs.CL] 2 Jul 2022 initiative Interactions, where agents take various strategies, such as providing direct answers (72%), raising clarification questions (13%), and presenting relevant information as indirect answers (13%), to address users' information needs.It contains 805 natural human-human information-seeking conversations with 4.7K user-agent turns over diverse topics.To simulate realistic information-seeking scenarios, users write queries with minimal restriction, and human agents decide on different strategies to respond, after searching over Wikipedia and with evidence paragraphs annotated.Through a design of scalable annotation pipeline and careful quality control, we were able to collect high-quality data (over 96% annotations pass our validation).
We formulate two tasks for the conversational agent system, where it should (1) identify a set of evidence passages from Wikipedia, and (2) generate a response grounded in the evidence.Comparing with previous studies on open-domain information-seeking conversations (Anantha et al., 2021;Adlakha et al., 2022), the key challenges in our tasks come from identifying and fusing information from multiple evidence passages to construct responses that reflect various strategies.Since handling queries with multiple evidence passages or no direct answer is more open-ended, we emphasize the need for human evaluation, and propose a more systematic human evaluation protocol that considers diverse aspects including coherence, factual consistency and information comprehensiveness, with both the predicted evidence passages and response provided as the evaluation input.
We present two strong baselines based on the state-of-the-art in open-domain question answering (Karpukhin et al., 2020;Izacard and Grave, 2020) and conversational knowledge identification (Wu et al., 2021).Results indicate that, while these systems achieve substantial improvements over trivial baselines, there is still significant room for improvements, especially for scenarios requiring agent strategies other than providing a direct answer.Our analysis suggests that the key remaining challenges are improving passage identification and fusing information from multiple passages by leveraging different response strategies.We present detailed discussion and avenues for future work.

Related Work
Information-Seeking Conversations The aim of information-seeking conversations is to address the user' initial and follow-up information needs with grounding in knowledge sources.Table 1 shows the comparison between INSCIT and existing information-seeking conversation datasets.Early CQA work including QuAC (Choi et al., 2018) and CoQA (Reddy et al., 2019) requires the agent to answer each user question in a conversation by reading a short passage.DoQA (Campos et al., 2020), QReCC (Anantha et al., 2021) and TopioCQA (Adlakha et al., 2022) extend the task to an open-domain setting where the knowledge source is a large document corpus.These studies only consider limited scenarios where the agent provides a direct answer based on a short text span in a single passage, or simply outputs no answer if there is no direct answer.
A few other studies create artificial conversations to address ambiguous user questions.For instance, Qulac (Aliannejadi et al., 2019) and the data collected in their follow-up work (Aliannejadi et al., 2021) are based on user queries containing a multi-faceted entity and intentionally annotated agent clarification questions; ShARC (Saeidi et al., 2018), Doc2Dial (Feng et al., 2020) and MultiDoc2Dial (Feng et al., 2021) are rule-based information-seeking conversations in the social welfare domain that incorporate agent-side clarifications.Guo et al. (2021) create Abg-CoQA by rewriting conversations in the CoQA dataset to intentionally include ambiguous questions.User queries that are open-ended and ambiguous have also been observed in the single-turn question answering task (Min et al., 2020;Zhang and Choi, 2021;Sun et al., 2022), which is usually addressed by training a model to predict multiple conditional answers without further interactions.In contrast, INSCIT consists of human-human conversations with natural information-seeking user requests and mixed agent initiative to address them.
Gustavo Penha and Hauff (2019) crawl conversations from Stack Exchange.2These conversations are mixed with information-seeking utterances and casual talk, and one grounding document is heuristically obtained for each conversation.In contrast, INSCIT consists of clean information-seeking dialogues with annotated and validated grounding knowledge for each agent turn.
Knowledge-Grounded Social Chat Different from information-seeking dialogues, the user intent in social chat is mostly to conduct casual talk.Knowledge-grounded social chat systems (Ghazvininejad et al., 2018;Dinan et al., 2019;Zhou et al., 2018;Moghe et al., 2018) incorporate external knowledge with the purpose of making the conversations more engaging and informative.Rodriguez et al. (2020) collect a dataset for training a conversational agent to select knowledge to present based on the user's background, with the aim to maintain the user's interest in the conversation.

Task Formulations
Before introducing our data, we first formulate two tasks for INSCIT, namely passage identification and response generation.These two tasks mimic how an agent responds to each information-seeking user request, where the agent first searches for relevant information over the knowledge source and then constructs the response based on the gathered information.Comparing with prior studies on opendomain information-seeking conversations (Anantha et al., 2021;Adlakha et al., 2022), the key challenges in our tasks come from identifying and fusing information from multiple evidence passages to construct responses using various strategies, rather than a single passage and a short answer.
At the n th agent turn, both tasks have the same input -the dialogue context X = [u 1 , a 1 , u 2 , a 2 , ..., u n ], the corpus of all passage candidates C as well as the previously used passages {P 1 , P 2 , ..., P n−1 } where each is the set of passages used in the i th agent turn a i .C is defined as all textual paragraphs3 in the full Wikipedia dump. 4nstead of treating the passage identification as a ranking problem (i.e., passage retrieval), we require the model to predict a set of passages Pn from C, that are relevant to the current user request u n in the dialogue context X and provide evidence for the response generation task-generating the next agent response ān .Specifically, identifying knowledge to be used in the response can be important for model interpretability purposes as well as for evaluating how well a model grounds the response generation in the knowledge source.Ideally, all factual information contained in ān should be consistent with Pn , and every passage in Pn should provide at least one unique information piece as evidence for ān .

Our Data: INSCIT
Now we introduce INSCIT, a new dataset of information-seeking conversations where the agent interprets the user intent and provides comprehensive information grounded in Wikipedia via natural human-human interactions.In the following, we first present our data collection pipeline ( § 4.1), 5and then explain how we define and control the data quality ( § 4.2).Lastly, we highlight characteristics of INSCIT that illustrate the diversity of user and agent turns, as well as interesting observations in agent initiatives and dialogue structures ( § 4.3).

Annotation Pipeline
We recruit user, agent and validation workers to annotate user / agent turns and validate agent annotations, respectively.Due to the asymmetric time spent by the user and the agent workers in a conversation, we design a separate annotation task for each user or agent turn, following (Wen et al., 2017).Thus, the annotation of each dialogue progresses in a pipeline, and no worker needs to wait for the other party.This pipeline (Wen et al., 2017) was introduced to collect human-human taskoriented dialogues and was inspired by the Wizardof-Oz paradigm (Kelley, 1984).It has proved to be efficient while not hurting the conversation coherence by requiring each worker to read all previous utterances before annotation.
Figure 2 illustrates the overview of the annotation pipeline.Each conversation starts with an initial user turn, where the worker asks a question after reading a text snippet from a seed document.Then, user and agent turns are annotated sequentially until the end of the conversation is reached.
Validation follows each user-agent turn.We collect up to 7 user-agent turns for each conversation.
Seed Document Selection To diversify conversation topics, we select seed Wikipedia articles, used for triggering initial user requests, from 5 different topic categories-food and drink, hobby, historical events, geography and weekly top-25 pages in Wikipedia.To further diversify the pool of seed documents, we leverage the top-down tree structure of Wikipedia categories6 and sample Wikipedia pages at various tree depths under each of the first 4 categories.Weekly top-25 pages are from Wikipedia weekly top-25 reports of the year 2021.7 See Appendix A.1 for details. Figure 3 (left) shows the distribution of sampled seed documents under each category and their corresponding depths.
User Turn Here, a user worker is asked to write an initial query or follow-up response to continue the existing conversation.To trigger each conversation (Figure 2 (a)), the user worker is presented with the leading paragraph of a seed Wikipedia article, and is instructed to ask a question they are interested to know but cannot find the answer from the paragraph.The article content outline containing all section titles is also provided to help with the question construction.The annotation for each following user turn (d) starts after the completion of the previous agent annotation (b) and the validation step (c), based on all previous utterances in the conversation (i.e., dialogue context).
Agent Turn Different from the user worker, in addition to the dialog context, each agent worker (Figure 2 (b)) is given all evidence paragraphs used by each previous agent turn as additional context.
Then, the worker is told to use the provided search engine8 to find answer(s) from Wikipedia for the current user request.They are asked to select all (up to 4) evidence paragraphs from Wikipedia that they use to construct their final response.Based on what they find, they can choose one of four response strategies: {direct answer, clarification, relevant answer, no information}.In contrast to a direct and precise answer, we consider a response as a relevant answer when the agent finds information that only partially satisfies the user need (e.g., relax a constraint in the request).For each agent turn, we collect two different annotations to increase reference diversity.
Validation After each agent turn, we send two annotations to a validator for checking whether each annotation is valid (details in § 4.2).If both are valid, the validator is asked to rate which one is more comprehensive.An agent response is considered as more comprehensive if it contains more information relevant to the user request.The more comprehensive (or the only one valid) annotation 9 is then used to continue the conversation.The annotation is terminated if both annotations are invalid.

Quality Control
Worker Qualification As the agent annotation is challenging, to recruit agent workers, we manu-  ally review more than 150 submissions of a qualification task and select 24 highly qualified agent workers who consistently produce valid annotations during the qualification.We use different qualification tasks to select 35 qualified users and 10 validators before the annotation (details in Appendix A.1).
Annotation Control To discourage users from chit-chatting or raising inappropriate requests (e.g., too subjective), each agent worker can decide to either continue the conversation or flag their previous user turn as incoherent or an invalid request.
Similarly, for each agent turn annotation, a validator (Figure 2 (c)) determines whether i) each selected evidence paragraph is properly used in the response; ii) the response is factually consistent with the evidence; iii) the response is coherent to the dialogue context; and vi) the labeled response strategy is faithfully reflected in the response.Only valid user and agent annotations are included in our final dataset.
To encourage workers to extensively search for relevant information to each user request, we assign a bonus to an agent worker if the validator labels their annotation to be equally or more comprehensive than the other worker.
We constantly monitor the annotation process and send feedback to workers.Our user and agent workers have over 99% and 96% average passing validation rate respectively.About 13% agent annotations are marked as less comprehensive.See more annotation details in Appendix A.1.

Data Analysis
In this section, we first explain the data preparation and the overall statistics of INSCIT.Then, we dive into detailed discussions about the diversity of our user and agent turns ( § 4.3.1)and analy-

Data Preparation & Overall Statistics
Table 2 shows the summary statistics of INSCIT.We collect 4712 user-agent turns from 805 conversations.
To develop information-seeking systems, we split the data into train/dev/test sets.The test set contains conversations triggered with seed documents from all 5 topic categories, while the training and dev sets only contain those from "food and drink", "hobby" and "top-25".The controlled topic distribution shift can be used to evaluate the developed model robustness.There are additional distribution changes from training to dev / test sets.First, we remove those agent responses flagged as less comprehensive from the reference set for dev and test sets.Instead, we keep all valid agent annotations as well as their comprehensiveness comparison results in the training set.This explains the variation among different sets in the number of references per turn in Table 2.In addition to this, we adjust the worker incentives when collecting dev / test sets, leading to the difference in the average agent turn length.Also, we drop agent annotations if their corresponding evidence passages cannot be found  in the corpus during post-processing. 10For word counting, we use the Spacy tokenizer. 11

Diversity of User and Agent Turns
User Request The middle and right treemaps in Figure 3 show the 7 most frequent leading unigrams of user utterances from conversations under the "food & drink" and "historical events" topic categories respectively."MISC" refers to utterances with less frequent leading unigrams.Each box size in each treemap is proportional to its percentage in the data.As we can see, most user requests are "what" and "how" questions.There are also many user turns starting with words like "can", "tell", most of which are responses to agent clarification questions.The user utterances are fairly long-tailed as "MISC" shares a large portion (about 30%) for both treemaps.Instead of being mostly factoid questions, user requests are found to be fairly openended in our dataset.

Analysis of Agent Initiatives
Fine-Grained Categorization To understand diverse agent initiatives at a more fine-grained level, we randomly sample and analyze 100 clarification and relevant answer responses respectively.As shown in the upper half of Table 3, we notice that in most cases, the agent raises a clarification when they either find a very long or too many answers (86%) or notice an ambiguous entity in the user request (13%).In 70% relevant answer cases (bottom half of Table 3), the agent would relax some constraint in the user request or provide evidence that no definite answer can be found.In the remaining cases (29%), they simply provide some relevant but side or partial information only.We also observe that in some rare cases (1%), the agent would point out some mistake (e.g., a false assumption) in the user request.We provide examples for such cases in Appendix A.2 to facilitate future investigations.
Clarification Occurrences We drill down to understand when agents are more likely to raise clarification in a conversation.We find that clarifica-tion questions are more frequently seen in the very beginning (example 2 in Table 3) than later in a conversation (18.8% vs. 11.5%).Furthermore, we notice that if a clarification is raised in the previous agent turn, the probability of the agent asking another clarification is 7.6% (Table 6), while the probability is 12.2% if the previous turn is nonclarification (example 1 in Table 3).show cases where agents take different strategies to respond to the user when they find the same evidence.While the selection of response strategy can be subjective, Table 4 shows that more information pieces would more likely trigger the agent to refine the user request by raising clarification.

Response Strategy Selection
5 Experiment Setup

Systems
We build two systems for both tasks formulated in § 3.Both systems build on retriever-reader models, inspired by recent advances in open-domain single-turn or conversational question answering (Karpukhin et al., 2020;Izacard and Grave, 2020;Adlakha et al., 2022).Here, the main function of the retriever is to gather a ranked set of top-k candidate evidence passages from the entire Wikipedia to facilitate passage selection and response generation for the later reader model.Different from prior work where the reader only predicts short answer strings, we need to adapt the reader to perform both evidence passage identification and response generation tasks based on the retrieved candidate set.We first describe the retrieval models used ( § 5.1.1),and then introduce the two reader models that perform the two main tasks based on retrieval results ( § 5.1.2).We provide implementation and training details in Appendix C.

Retrieval Models
We experiment with two retrievers (BM25 and DPR) and choose the one with the best retrieval performance for readers to perform the two main tasks.BM25 (Robertson and Zaragoza, 2009) uses sparse bag-of-word representations for ranking passages with regard to each query.We use Pyserini (Yang et al., 2017)  Decoder (FiD) (Izacard and Grave, 2021) and DI-ALKI (Wu et al., 2021).
Fusion-in-Decoder (FiD) FiD is a generative reader model that can be easily adapted to generate different formats of task output.It first encodes all retrieved passages with a given query, and then decodes the task output (e.g., an answer string) by attending over all encoded passages.To adapt it to our tasks, we prepend a passage identifier (ID) to each of the top-k retrieved passages13 and concatenate the passage with the dialogue context to be encoded by FiD.To perform the two tasks, the decoder generates a sequence of evidence passage IDs (passage identification), followed by the final response (response generation).In order to incorporate previously used evidence passages, we simply add them into the top-k retrieved passages as the augmented input to FiD.14 DIALKI + FiD In contrast to generating outputs for both tasks in an end-to-end fashion as the first system, the second system first uses DIALKI (Wu et al., 2021) to select evidence passages and then feeds the identified passages into FiD to generate the agent response.DIALKI is a state-of-the-art conversational knowledge identification model that incorporates dialogue structure with a multi-task learning framework to select an answer string span from a set of passages.DIALKI predicts a passage score for each input passage (e.g., each top-k retrieved passage).To adapt it for our passage identification task, we simply keep evidence passages (up to 4)15 with ranking scores higher than γ for multiple passage prediction.The hyperparameter γ is tuned on the dev set.We apply the same method to incorporate previously used evidence passages into DIALKI as in FiD.
Trivial Baselines We report performance of three simple baselines.1) Most Frequent: We use the most frequent evidence passage and agent response seen in the training set as the prediction.
2) Random Previous Turn: We randomly select a previous turn in the dialogue context and use its evidence and response as the prediction.For firstturn instances, we use the most frequent passage and response as in "Most Frequent".3) Last Turn: We use the most recent agent turn in the dialogue context as the prediction.Similarly, we use "Most Frequent" for first-turn examples.
Human We collect one additional annotation for each agent turn in the test set and evaluate it as the human performance.Note that these additional prediction data do not go through the same validation step as those that are used as references.

Evaluation
Below, we describe both automatic metrics and a human evaluation protocol for the passage Identification (PI) and response generation (RG) tasks in § 3, as well as a metric to assess the impact of the first stage passage retrieval on the PI task.
Passage Retrieval We follow previous work (Karpukhin et al., 2020;Adlakha et al., 2022) to report HIT@K scores for retrieval performance: where R K denotes the top K retrieved passages and P = m ∪ i=1 P i denotes the union of all passages in the reference.
Passage Identification As discussed in § 3, we allow multiple evidence passages in INSCIT.We measure the model performance by comparing the set of predicted evidence passages P to the set of reference passage sets {P 1 , . . ., P m } where P i denotes the i th reference passage set.In INSCIT, m equals 1 or 2. We use the maximum F1 score (PI-F1) between P and each P i as the final score.
Human Evaluation As the two tasks in INSCIT are dependent on each other, decoupled automatic evaluations may not capture aspects like the factual consistency between predicted passages and the generated response.In addition, handling queries with multiple evidence passages or no direct answer can be more open-ended.Therefore, we design a human evaluation protocol to evaluate the model performance on both tasks.(See the interface and details in Appendix B).Specifically, we focus on the evaluation of 4 dimensions: 1) evidence passage utility: how many predicted evidence passages are used in the generated response; 2) factual consistency between the predicted response and evidence; 3) response coherence with the dialogue context; and 4) response comprehensiveness: how much information, that is both relevant to the user request and factually consistent with the predicted evidence, is contained in the response.While most prior work on information-seeking dialogues only relies on automatic evaluation scores (Choi et al., 2018;Anantha et al., 2021;Adlakha et al., 2022), a few studies collect human ratings on dimensions like "coherence" and "informativeness" of each response (Gao et al., 2022;Feng et al., 2022).However, they do not require models to predict evidence and thus ignore the factual consistency between the response and the knowledge source.
We provide outputs for both tasks of two (or more) systems to a human judge and ask them to rate on a 4-or 5-point Likert scale17 on the first 3 dimensions for each system output.Each evaluator ranks the system responses in the aspect of response comprehensiveness (ties are permitted).We have 3 raters for each agent turn and take the average rating scores or rank places on each dimension for each individual system.Since human evaluation can be time-consuming and costly, we run it on a sampled test subset with 50 conversations (290 examples in total) and encourage future studies to report on the same subset.
6 Experiment Results

Quantitative Results
Passage Retrieval We first report the performance for passage retrieval in Table 7. BM25 underperforms DPR models significantly.We observe that pretraining (PT) alone can already work better than simply finetuning (FT).This can be explained by the fact that the pretraining data (TopioCQA) is more than 30 times larger than INSCIT.Karpukhin et al. (2020) have a similar observation on retrieval for single-turn question answering tasks.Finally, we achieve the best retrieval results with both pretraining and finetuning DPR (PT+FT), and we use the retrieval output from this setting for reader mod- els to do passage identification and response generation throughout the remaining experiments.

Automatic Evaluation (PI & RG)
Table 8 shows the overall automatic evaluation results for our main tasks.The three simple baselines perform very poorly.DIALKI + FiD achieves much better performance than FiD in all evaluation metrics.The reason that the pipelined approach beats the end-to-end FiD model may be due to the small training set size, i.e. not enough data for training an effective model to process all inputs and predict various outputs simultaneously.Both systems still substantially underperform humans.We also observe that incorporating previously used evidence passages leads to better performance for DIALKI + FiD, while not necessarily for FiD, indicating that some form of information summary performed by DIALKI is necessary to avoid the model being overwhelmed by the large input information.The reason for imperfect human performance on passage identification is two-fold.As discussed in § 4.3, due to the open-endedness of information-seeking queries in INSCIT and the large search space over Wikipedia, annotators may find different (but both valid) sets of evidence passages.In addition, annotations used for reporting the human performance do not go through the validation process, which may lead to annotation mistakes or being less comprehensive than the reference responses.Although Table 7 shows that TopioCQA can be leveraged to improve retrieval performance, it is  not straightforward to leverage it for our two main tasks.It does not come with the passage identification task, and only has short answers or no answer as their agent responses.We observe poor zero-shot response generation performance by training FiD on TopioCQA and evaluating on INSCIT, reported in Appendix D. Therefore, we leave the question of how to leverage other existing datasets to improve main task performance on INSCIT for future exploration.
Human Evaluation (PI & RG) Table 9 and 10 present our human evaluation results. 18We observe that overall humans substantially outperform DIALKI+FiD.However, the difference is much smaller in factual consistency.When compared with FiD, DIALKI+FiD has better evidence utility, factual consistency and response comprehensiveness.However, response coherence is similar for both systems.

Analysis
Seen / Unseen Topic Categories As explained in § 4, there is a topic distribution shift from training to test set.We carry out a breakdown evaluation here.Specifically, we divide test conversations into seen (food & drink, hobby and top-25) and unseen (historical events, geography) topic categories during training.Based on the automatic scores shown in Figure 4, both models generalize well on conversations from unseen topic categories.

Performance by Reference Response Strategy
Table 11 shows the model performance breakdown by reference response strategy.Specifically, we assign test examples into a response strategy category (i.e., direct answer, clarification or relevant answer) if all of their reference responses are of that response strategy.We observe a bigger gap compared to humans for both FiD and DIALKI+FiD systems for "clarification" and "relevant answer" categories, suggesting that existing state-of-the-art models still struggle at distilling relevant information and communicating it effectively via generation.
Evidence Utility by # Evidence Passages Although DIALKI+FiD achieves relatively high evidence utility score in the human evaluation, we show that it is very sensitive to the number of evidence passages predicted.Note that both the automatic systems and humans are limited to 4 passages, consistent with the data annotation.Table 12 illustrates that DIALKI+FiD responses fail to include information from all predicted evidence passages when there are more than two, while the same issue does not hold for humans.Our further analysis shows that DIALKI+FiD rarely constructs clarification questions or summarizes information from different passages as humans do, which indicates a remaining modeling challenge to fuse and present information from multiple paragraphs.

Conclusion & Future Work
To conclude, we introduce INSCIT, a new opendomain information-seeking conversational dataset grounded in Wikipedia, with mixed-initiative useragent interactions.We formulate two tasks with INSCIT as well as a new human evaluation protocol to assess the model performance, and present the results of two strong baselines.
Given the small size of our training data, it would be an interesting future direction to address the above issues by applying transfer learning from existing information-seeking conversation or question answering resources.

A Additional Details of INSCIT
A.1 Annotation Figure 6 and 7 show examples of user and agent annotation task interfaces.We provide more annotation details below.
Seed Document Selection We use PetScan 19 to sample Wikipedia pages at various tree depths under each of the 4 categories-"food and drink", "hobby", "historical events" and "geography".We use different Wikipedia category keywords for each of them to find corresponding articles via PetScan, as shown in Table 13.We filter out overlapped pages between these categories, and pages with fewer than 150 outgoing links or 3000 content words.
We sample articles from depth 1-3 for "food and drink" and "hobby".Since there are very few articles under depth 1 for "historical events" and "geography", we only sample articles from depth 2-3.For example, under the food and drink category, general pages like "Wine" and "Rice" usually appear at depth level 1.More specific topics like "Hong Kong Cuisine" and "Dairy Farming" are at level 2. Less well-known or related topics like "Callinectes sapidus" and "Chipotle" can be found at level 3.  Additional Annotation Instructions After the initial user turns (Figure 2 (d)), each user worker is instructed to read the dialogue context and respond to the last agent turn by providing a clarification, asking a new or follow-up question, or raising concerns (e.g., if there is a misunderstanding of the previous request), given the same seed article.Each agent worker is asked to limit their responses to be at most 2 sentences.We define "response comprehensiveness" for the validation task in the instruction, that an agent response is considered as more comprehensive if it 19 https://petscan.wmflabs.org/contains more information relevant to the user request.Additionally, we also specifically ask validators to focus on the "information scope".For example, if there exist multiple answers (answer A and B) to the user query, a direct answer containing both A and B or a clarification like "Do you want to know about answer A or B?" should be considered as equally comprehensive while both of them are more comprehensive than a direct answer containing A only.
Qualification Before Annotation We restrict our annotation tasks to workers in Englishspeaking countries and with more than 5000 HITs and at least a 98% acceptance rate.Depending on each worker's annotation speed, most of our workers are paid with an hourly rate of 15-20 USD.To recruit agent annotators, we manually review more than 150 submissions of a qualification task and select 24 highly qualified agent workers.Similarly, we select 35 qualified users and 10 validators before the annotation.We launch a different user qualification task, and since the user annotation is a much easier task, we only filter out 7 spamming workers during the qualification.Most of our selected agent workers also participate in the user qualification task and all of them pass the qualification.We instruct agent workers to try not to annotate for their own user request turns, and we find only fewer than 5% user-agent turns are annotated by a same worker.The 10 validators are selected from our qualified agents with a different qualification task.We instruct our validators to avoid validating their own annotations unless they forget, and we find only fewer than 5% agent annotations are self-validated.
Annotation Quality Control Each user worker can only get bonus and retain their qualification in our annotation if their following agent decides to continue their conversation, as discussed in Section 4. Similarly, our agents need to pass validation in order to get bonus and remain in our annotation.Besides verdicting whether each agent annotation is valid or not, our validators are also instructed to fix typos in each agent response.At the same time, we allow agent workers to check their mistakes flagged by validators, so that they can improve in later annotation tasks.In our pilot study with more than 500 examples, our agent validators reach over 90% agreement in deciding whether an annotation is valid or not.Table 14: An example showing that the agent finds evidence that is contradicting to an assumption made in the user query.
We also assign additional bonus if agent workers generate responses that are marked as equally or more comprehensive than the other workers.We collect training data first before collecting the development and test sets.This bonus structure was added and constantly adjusted during the training data collection based on data monitoring and feedback from workers during the annotation process.As a result, we observe variations in the average agent response length among different data subsets.
Each validator rate the response comprehensiveness comparison between 2 annotations at a 1-5 scale.After analysis, we notice that in most cases where the validator gives a score of 2-4, the 2 annotations actually find different sets of evidence passages, or have very similar or slightly different comprehensiveness.Therefore, we only consider scores of 1 and 5 as a sufficient indicator that one annotation is more comprehensive than the other.

A.2 Data Analysis
User Turns In Table 2, we observe longer user turns than reported in previous datasets (Reddy et al., 2019;Choi et al., 2018;Anantha et al., 2021;Adlakha et al., 2022).This can be partly because that in our instruction, we encourage our user workers to ask interesting questions although we leave the definition of "interestingness" to users.Another reason is that we observe in some cases, user turns contain "reaction" sentences to the previous agent turns before raising their request.For example, the user may say "Oh, sorry I should've been more specific."after the agent asks a clarification.
Other Relevant Answer Cases Besides the two categories of relevant answer agent responses discussed in Table 3, we find about 1% cases where the agent would identify some issue (e.g. a false assumption) in the user request by providing relevant information as evidence.Table 14 shows such an example.Although we do not find such situations to be common in INSCIT, it would be an interesting phenomenon for future studies.

A.2.1 Natural Conversation Topic Changes
Following Adlakha et al. ( 2022), we consider the topic(s) of each conversational turn as the Wikipedia article(s) in which the agent finds evidence passages. 20Figure 5 shows the flow of topic switches for up to 6 turns in each conversation.For each conversational turn, every pair of (previous topic, current topic) leads to a flux in the diagram.We can see that INSCIT conversations contain frequent topic changes, and Table 15 shows an example illustrating the naturalness of such topic changes.

B Additional Details of Human Evaluation
Figure 8 and 9 show interfaces of our human evaluation.As mentioned in Section 5, response comprehensiveness is defined as: how much information, that is both relevant to and factually with the predicted evidence, is contained in the response.Additionally, we also specifically ask evaluators to focus on the "information scope".For example, if there exist multiple answers (answer A and B) to the user query, a direct answer containing both A and B or a clarification like "Do you want to know about answer A or B?" should be considered as equally comprehensive while both of them are more comprehensive than a direct answer containing A only.
DPR We use the released DPR training scripts from Adlakha et al. ( 2022) for finetuning.We set the batch size to be 48.The learning rate is set to be 1e − 6 and 1e − 5 respectively with or without pretraining on TopioCQA respectively.All other parameters are kept the same.We follow Feng et al. (2021) to create a hard negative for each example by sampling a top retrieved passage from BM25, with one of the gold evidence passage as the query.
We keep all other parameters unchanged in the original codebase.Each training process is run on 4 A100 or A40 GPUs.

C.2 Reader Models
FiD We also follow the code released by Adlakha et al. (2022) for training FiD.We take the top 50 retrieved passages and encode each of them concatenated with the dialogue context by a separator token [SEP].In order to adapt FiD to our two tasks, in front of each passage, we prepend a passage id (pid), formatted as article title:passage position, where the passage position refers to the order of the passage appearing in the article.The model is trained to decode a sequence of evidence passage ids and the final response in the format of pid1 | pid2 . . .answer: response During inference, we parse the decoded sequence to get all predicted passage ids and the final response.We remove passage ids that are duplicates or do not exist in the corpus.
For each training example, we randomly select one reference agent turn as the target.We also experiment with always selecting the reference agent turn marked as more comprehensive than the other reference21 , and it gives similar performance.
The maximum encoder input length is 384 and the decoder sequence length is 100.We train for 7 epochs and the number of warm-up steps is 50.We use 12 as the batch size.All the other parameters in the original code are unchanged.Each training process takes about 2-3 hours with 4 A40 GPUS.We select the best model based on the PI-F1 score on the dev set.
DIALKI+FiD We use the public code released by Wu et al. (2021) for training DIALKI.We also take 50 passages and encode each of them concatenated with the dialogue context by a separator token [SEP].As DIALKI is designed to select only one positive passage, to create each training example, we only include one randomly sampled gold reference passage in the input and 49 negative passages from the top retrieved passages.Then, the best model is selected based on the single passage selection accuracy.During inference, we simply feed top 50 retrieved passages into the model.In order to perform multiple passage identification, we simply keep evidence passages (up to 4)22 with ranking scores higher than γ.The hyperparameter γ is tuned on the dev set.
To train FiD in this two-step system, we create training examples by taking a dialogue context and each set of its reference evidence passages as the input.As a result, the maximum number of input passages is 4. FiD is trained to decode the agent response only.The best model is selected by the RG-F1 score on dev set.During inference, we feed FiD with the predicted passages from DIALKI.
The maximum encoder sequence length is 384 for both DIALKI and FiD.We keep original parameters in DIALKI unchanged.Parameters in FiD is the same as the first reader model, except that the number of input passages is 4 in this setting.Each training process is run on 4 A40 and 2 RTX6000 GPUs for DIALKI and FiD respectively.
For each experiment, we observe similar performance or training curves for 2-3 runs and report numbers on a single run.16: Pearson and Spearman correlation coefficients between each human evaluation dimension and each automatic metric.Both "evidence passage utility" and "factual consistency" human ratings have poor correlation with all three automatic metrics.Response F1 scores have the best correlation with response "coherence" and "comprehensiveness".Except for coefficients calculated for "factual consistency", all numbers have a p-value less than 0.05.

D Additional Details of Experiments
Metric Correlations We observe similar trends between the automatic and human evaluation results in terms of system comparison.However, as shown in Table 16, we calculate Pearson and Spearman correlation coefficients between humanrated response coherence and comprehensiveness and automatic scores to be moderate and weak.Among all three automatic metrics, response F1 scores have the highest correlation with the two human evaluation dimensions.Evidence utility and factual consistency has poor correlation with all three automatic scores.This indicates that existing grounded generation metrics are lacking in truly reflecting the human judgements.
Zero-Shot Performance As DPR retriever followed by FiD is the state-of-the-art model on Topi-oCQA, we use the publicly released checkpoint 23 trained on TopioCQA to do inference on INSCIT.Since TopioCQA only considers the task of response generation, we skip the evaluation for passage identification.fixed throughout each conversation.In specific, we gather passages from all Wikipedia article sections that include at least one evidence passage used in a conversation as the candidate passages for that conversation.Table 18 shows that both systems achieve much higher scores than in the open-domain setting.Interestingly, in contrast to the open-domain setting, FiD slightly outperforms DIALKI + FiD.   Figure 9: Human evaluation interface 2: rating dimensions.Note that we also have a question for evaluating "fluency", which we do not include in the human evaluation results.question is mostly created to help human raters to decouple "coherence" to the dialogue context and the language "fluency".Both systems in our experiments turn out to have similar fluency as humans.

Figure 1 :
Figure1: The user often asks under-specified questions that require clarification (left), or asks questions with no direct answer but where providing relevant information can possibly fulfill their information needs (right).

Figure 2 :
Figure 2: Each conversation is annotated in a chained user → agent → validator tasks.A single worker is dedicated to each user / validator task but two workers work in parallel to complete the agent turn annotation (see discussion in §4.1).White and yellow boxes are inputs and outputs for each annotation task, respectively.

Figure 3 :
Figure 3: Left: seed document topic category breakdown (D → category depth).Middle and right: treemaps of top 7 (and other: MISC) first tokens in user turns from conversations under "food & drink" and "historical events" topic categories.For each figure, the size of each colored area is proportional to its percentage in the data.

Figure 4 :
Figure 4: PI-F1 and BLEU scores on test conversations with seen and unseen topic categories.RG-F1 has the same trend as BLEU.

Figure
Figure Flow diagram of topic changes in up to 6 turns in INSCIT conversations."T" stands for "Topic" and the topic indices indicate the order of their first occurrences in the conversation.

Table 1 :
Comparison of INSCIT with existing datasets of information-seeking conversations.IR, CLAR, REL, H-H stand for Retrieval Needed, Clarification, No Direct but Relevant Answer and Human-Human.indicates the property only applies to part of the dataset.
sis of agent-side initiatives ( § 4.3.2).We include discussion about natural topic changes in INSCIT conversations in Appendix A.2.
In the Battle of New Orleans, . . . the Americans had 13 dead, totaling 304 dead.<User>: Were there any long-term consequences that came as a result of the War of 1812?<Agent>: There were many!Would you like to know what they were for Bermuda, . . ., or Great Britain?Washington University is classified as what for its high research activities?<Agent>: Do you want to know about Washington University in St. Louis or in Baltimore, Maryland?Was the Matrix franchise adapted from any other work?<Agent>: While not explicitly adapted from another work, the Matrix did draw heavily on Jean . . . . . .<User>: Who authored the Torah?<Agent>: It was originally thought that a majority of the Torah was written by . . .However, it's now thought that . . .though the number of authors is up for debate.What countries have an ecological footprint under 5 hectares per person?<Agent>: The world-average ecological footprint in 2013 was 2.8 global hectares per person . . .But I don't have a list of countries with an ecological footprint under 5 global hectares per person.

Table 3 :
Examples of clarification and no-answer-but-relevant-information agent responses.Prepending dots → omitted dialogue history.Factual information from evidence passages is italicized in agent responses.

Table 4 :
Agent response strategy statistics.
DIRECT, CLAR, REL, and NEI indicate direct answer, clarification, no direct but relevant answer, and not enough information, respectively.
What are some of the health consequences of oats?<Agent> (clarification): Oat related health concerns include oat allergies, and gluten-related disorders.Which type of health issue would you like to learn more of?(evidencefrom Wikipedia page: Oat Sensitivity) <Agent> (direct answer): Oat toxicity in people with gluten-related disorders depends on the oat cultivar consumed because the immunoreactivities of toxic prolamins are different among oat varieties.Furthermore, oats are frequently cross-contaminated with the other gluten-containing cereals . ..(evidence from Wikipedia page: Oat) Would you like to know PepsiCo's major competitor in the beverage market or the snack food market?<Agent>(DIRECT): PepsiCo has two main competitors in different market sectors.The Coca-Cola Company is its largest competitor in the beverage market, and Kraft Foods is its largest competitor in the snack foods market.Dialogue Context: . . .<User>: Were the directors involved in any controversy?<Agent> (REL): I couldn't find anything about them themselves in any controversy but their film, V for Vendetta, had a controversial storyline and themes.<Agent> (DIRECT): The Wachowskis' V for Vendetta was considered to have controversial storyline and themes, but have been both criticized and praised by sociopolitical groups.
(Choi et al., 2018;Reddy et al., 2019;Adlakha et al., 2022)ostly focuses on agent responses with either a direct answer or no answer to the current user request(Choi et al., 2018;Reddy et al., 2019;Adlakha et al., 2022), Table4shows the diversity of agent response strategies in INSCIT.When no direct or precise answer can be found, agents in INSCIT can respond to the user with a relevant answer (defined in § 4.1).If no direct or relevant answer is found, the agent can then respond with no information.As we can see from Table4, the average response length and number of evidence passages differ dramatically across various response strategies.Compared with direct or relevant answer cases, clarification responses tend to be slightly shorter and are more likely to happen when more evidence passages are present.passages,which potentially require some information summarization.We also observe that 90% of direct answer responses in INSCIT are longer than 15 words, while most previous datasets have agent turns with an average length shorter than that.Different Evidence; Different Response Strategies Dialogue Context: . . .<User>:

Table 5 :
Examples of two agent reference responses with different response strategies.DIRECT, CLAR and REL indicate direct answer, clarification, and no direct but relevant information, respectively.
. . .<User>: What kinds of regional varieties are there?<Agent>: Would you like to know about East Asia, Southeast Asia, South Asia, or Europe?<User>: Tell me about East Asia.<Agent>: Sorry, but each country is detailed as well, do you want to know more about congee in China, Japan, Korea or Taiwan?

Table 6 :
An example of consecutive clarifications.
Another interesting phenomenon we observe is that among examples where we obtain 2 agent references for which the validator mark as equally comprehensive, 23% of them take different response strategies given the same dialogue context.The potential reason is that the response strategy selection relies on what information (i.e., evidence) each agent finds.Due to the open-endedness of information-seeking queries in INSCIT and the large search space over Wikipedia, it is common for the agent to land on different sets of evidence passages.The first example in Table5shows how different evidence triggers different agent response strategies.Additionally, even if two agents find the same evidence set, deciding whether it indicates an under-specified user request, a direct answer or only a relevant answer can be subjective, and we leave it to workers' own judgment.The second and third examples in Table5

Table 7 :
Passage retrieval performance.PT and FT refer to pretraining on TopioCQA and finetuning on IN-SCIT.

Table 8 :
Automatic evaluation for main tasks: passage identification (PI) and response generation (RG).

Table 9 :
Human evaluation scores on dimensions rated with Likert scales.

Table 10 :
Human evaluation on system comparison, where win/lose refers to DIALKI+FiD.

Table 11 :
Automatic metric scores by reference response strategy.Percentage values indicate the performance gap with humans.

Table 12 :
Human evaluation scores on evidence utility by the number of predicted evidence passages.

Table 13 :
Keywords used for collecting documents under each topic category from PetScan.

Table 15 :
Natural topic changes in a conversation.

Table 17 :
Table 17 shows that the model generalizes poorly from TopioCQA to INSCIT, which indicates that agent responses in INSCIT have a very different distribution with TopioCQA.Fixed Passage Setting We experiment with a setting where the pool of evidence passage candidates is much smaller than the open-domain setting and 23 https://github.com/McGill-NLP/topiocqa Results of DPR Retriever + FiD, trained with TopioCQA, on TopioCQA and INSCIT (zeroshot).Response exact match (EM) is a metric used in TopioCQA.

Table 18 :
Fixed passage setting results.