On the Robustness of Dialogue History Representation in Conversational Question Answering: A Comprehensive Study and a New Prompt-based Method

Most work on modeling the conversation history in Conversational Question Answering (CQA) reports a single main result on a common CQA benchmark. While existing models show impressive results on CQA leaderboards, it remains unclear whether they are robust to shifts in setting (sometimes to more realistic ones), training data size (e.g., from large to small sets) and domain. In this work, we design and conduct the first large-scale robustness study of history modeling approaches for CQA. We find that high benchmark scores do not necessarily translate to strong robustness, and that various methods can perform extremely differently under different settings. Equipped with the insights from our study, we design a novel prompt-based history modeling approach and demonstrate its strong robustness across various settings. Our approach is inspired by existing methods that highlight historic answers in the passage. However, instead of highlighting by modifying the passage token embeddings, we add textual prompts directly in the passage text. Our approach is simple, easy to plug into practically any model, and highly effective, thus we recommend it as a starting point for future model developers. We also hope that our study and insights will raise awareness to the importance of robustness-focused evaluation, in addition to obtaining high leaderboard scores, leading to better CQA systems.1


Introduction
Conversational Question Answering (CQA) involves a dialogue between a user who asks questions and an agent that answers them based on a given document.CQA is an extension of the traditional single-turn QA task (Rajpurkar et al., 2016), with the major difference being the presence of the conversation history, which requires effective history modeling (Gupta et al., 2020).Previous work demonstrated that the straightforward approach of concatenating the conversation turns to the input is lacking (Qu et al., 2019a), leading to various proposals of architecture components that explicitly model the conversation history (Choi et al., 2018;Huang et al., 2019;Yeh and Chen, 2019;Qu et al., 2019a,b;Chen et al., 2020;Kim et al., 2021).However, there is no single agreed-upon setting for evaluating the effectiveness of such methods, with the majority of prior work reporting a single main result on a CQA benchmark, such as CoQA (Reddy et al., 2019) or QuAC (Choi et al., 2018).
While recent CQA models show impressive results on these benchmarks, such a single-score evaluation scheme overlooks aspects that can be essential in real-world use-cases.First, QuAC and CoQA contain large annotated training sets, which makes it unclear whether existing methods can remain effective in small-data settings, where the annotation budget is limited.In addition, the evaluation is done in-domain, ignoring the model's robustness to domain shifts, with target domains that may even be unknown at model training time.Furthermore, the models are trained and evaluated using a "clean" conversation history between 2 humans, while in reality the history can be "noisy" and less fluent, due to the incorrect answers by the model (Li et al., 2022).Finally, these benchmarks mix the impact of advances in pre-trained language models (LMs) with conversation history modeling effectiveness.
In this work, we investigate the robustness of history modeling approaches in CQA.We ask whether high performance on existing benchmarks also indicates strong robustness.To address this question, we carry out the first large-scale robustness study using 6 common modeling approaches.We design 5 robustness-focused evaluation settings, that we curate based on 4 existing CQA datasets.Our  tings are designed to evaluate efficiency in low-data scenarios, the ability to scale in a high-resource setting, as well as robustness to domain-shift and to noisy conversation history.We then perform a comprehensive robustness study, where we evaluate the considered methods in our settings.
We focus exclusively on history modeling, as it is considered the most significant aspect of CQA (Gupta et al., 2020), differentiating it from the classic single-turn QA task.To better reflect the contribution of the history modeling component, we adapt the existing evaluation metric.First, to avoid differences which stem from the use of different pre-trained LMs, we fix the underlying LM for all the evaluated methods, re-implementing all of them.Second, instead of focusing on final scores on a benchmark, we focus on each model's improvement (∆%) compared to a baseline QA model that has no access to the conversation history.
Our results show that history modeling methods perform very differently in different settings, and that approaches that achieve high benchmark scores are not necessarily robust under low-data and domain-shift settings.Furthermore, we notice that approaches that highlight historic answers within the document by modifying the document embeddings achieve the top benchmark scores, but their performance is surprisingly lacking in lowdata and domain-shift settings.We hypothesize that history highlighting yields high-quality representation, but since the existing highlighting methods add dedicated embedding parameters, specifically designed to highlight the document's tokens, they are prone to over-fitting.
These findings motivate us to search for an alternative history modeling approach with improved robustness across different settings.Following latest trends w.r.t.prompting in NLP (Liu et al., 2021), we design MarCQAp, a novel prompt-based approach for history modeling, which adds textual prompts within the grounding document in order to highlight previous answers from the conversation history.While our approach is inspired by the embedding-based highlighting methods, it is not only simpler, but it also shows superior robust-ness compared to other evaluated approaches.As MarCQAp is prompt-based, it can be easily combined with any architecture, allowing to fine-tune any model with a QA architecture for the CQA task with minimal effort.Thus, we hope that it will be adopted by the community as a useful starting point, owing to its simplicity, as well as high effectiveness and robustness.We also hope that our study and insights will encourage more robustnessfocused evaluations, in addition to obtaining high leaderboard scores, leading to better CQA systems.

CQA Task Definition and Notations
Given a text passage P , the current question q k and a conversation history H k in a form of a sequence of previous questions and answers H k = (q 1 , a 1 , . . ., q k−1 , a k−1 ), a CQA model predicts the answer a k based on P as a knowledge source.The answers can be either spans within the passage P (extractive) or free-form text (abstractive).

CQA Datasets
Full datasets statistics are presented in Table 1.
QuAC (Choi et al., 2018) and CoQA (Reddy et al., 2019) are the two leading CQA datasets, with different properties.In QuAC, the questions are more exploratory and open-ended with longer answers that are more likely to be followed up.This makes QuAC more challenging and realistic.
We follow the common practice in recent works (Qu et al., 2019b,a;Kim et al., 2021;Li et al., 2022), focusing on QuAC as our main dataset, using its training set for training and its validation set for indomain evaluation (the test set is hidden, reserved for a leaderboard challenge).We use CoQA for additional pre-training or for domain-shift evaluation.(Campos et al., 2020) is another CQA dataset with dialogues from the Stack Exchange online forum.Due to its relatively small size, it is typically used for testing transfer and zero-shot learning.We use it for domain-shift evaluation.

DoQA
QuAC Noisy-History (QuAC-NH) is based on a datatset of human-machine conversations collected by Li et al. (2022), using 100 passages from the QuAC validation set.While Li et al. used it for human evaluation, we use it for automatic evaluation, leveraging the fact that the answers are labeled for correctness, which allows us to use the correct answers as labels.
In existing CQA datasets, each conversation (q 1 , a 1 , .., q m , a m ) and the corresponding passage P , are used to create m examples , where H k = (q 1 , a 1 , ...q k−1 , a k−1 ). a k is then used as a label for E k .Since QuAC-NH contains incorrect answers, if a k is incorrect we discard E k to avoid corrupting the evaluation set with incorrectly labeled examples.We also filtered out invalid questions (Li et al., 2022) and answers that did not appear in P . 2

CQA Related Work
Conversation History Modeling is the major challenge in CQA (Gupta et al., 2020).Early work used recurrent neural networks (RNNs) and variants of attention mechanisms (Reddy et al., 2019;Choi et al., 2018;Zhu et al., 2018).Another trend was to use flow-based approaches, which generate a latent representation for the tokens in H k , using tokens from P (Huang et al., 2019;Yeh and Chen, 2019;Chen et al., 2020).Modern approaches, which are the focus of our work, leverage Transformer-based (Vaswani et al., 2017) pre-trained language models.
The simplest approach to model the history with pre-trained LMs is to concatenate H k with q k and P (Choi et al., 2018;Zhao et al., 2021).Alternative approaches rewrite q k based on H k and use the rewritten questions instead of H k and q k (Vakulenko et al., 2021), or as an additional training signal (Kim et al., 2021).Another fundamental approach is to highlight historic answers within P by modifying the passage's token embeddings (Qu et al., 2019a,b).Qu et al. also introduced a component that performs dynamic history selection after each turn is encoded.Yet, in our corresponding baseline we utilize only the historic answer highlighting mechanism, owing to its simplicity and high effectiveness.A contemporaneous work proposed a global history attention component, designed to capture long-distance dependencies between conversation turns (Qian et al., 2022

History Modeling Study
In this work, we examine the effect of a model's history representation on its robustness.To this end, we evaluate different approaches under several settings that diverge from the standard supervised benchmark ( §3.1).This allows us to examine whether the performance of some methods deteriorates more quickly than others in different scenarios.To better isolate the gains from history modeling, we measure performance compared to a baseline QA model which has no access to H k ( §3.2), and re-implement all the considered methods using the same underlying pre-trained language model (LM) for text representation ( §3.3).

Robustness Study Settings
We next describe each comparative setting in our study and the rationale behind it, as summarized in Table 2. Table 1 depicts the utilized datasets.
Standard.Defined by Choi et al. (2018), this setting is followed by most works.We use a mediumsized pre-trained LM for each method, commonly known as its base version, then fine-tune and evaluate the models on QuAC.
High-Resource.This setting examines the extent to which methods can improve their performance when given more resources.Noisy-History.This setting examines robustness to noisy conversation history, where the answers are sometimes incorrect and the conversation flow is less fluent.To this end, we evaluate the models trained under the standard setting on the QuAC-NH dataset, consisting of conversations between humans and other CQA models ( §2.2).We note that a full human-machine evaluation requires a human in the loop.We choose to evaluate against other models predictions as a middle ground.This allows us to test the models' behaviour on noisy conversations with incorrect answers and less fluent flow, but without a human in the loop.

Evaluation Metric
The standard CQA evaluation metric is the average word-level F1 score (Rajpurkar et al., 2016;Choi et al., 2018;Reddy et al., 2019;Campos et al., 2020). 4Since we focus on the impact of history modeling, we propose to consider each model's improvement in F1 (∆%) compared to a baseline QA model that has no access to the dialogue history.

Pre-trained LM
To control for differences which stem from the use of different pre-trained LMs, we re-implement all the considered methods using the Longformer (Beltagy et al., 2020), a sparse-attention Transformer designed to process long input sequences.It is therefore a good fit for handling the conversation history and the source passage as a combined (long) input.Prior work usually utilized dense-attention Transformers, whose input length limitation forced them to truncate H k and split P into chunks, processing them separately and combining the results (Choi et al., 2018;Qu et al., 2019a,b;Kim et al., 2021;Zhao et al., 2021).This introduces additional complexity and diversity in the implementation, while with the Longformer we can keep implementation simple, as this model can attend to the entire history and passage.We would also like to highlight RoR (Zhao et al., 2021), which enhances a dense-attention Transformer to better handle long sequences.Notably, the state-of-the-art result on QuAC was reported using ELECTRA+RoR with simple history concatenation (see CONCAT in §3.4).While this suggests 4 We follow the calculation presented in Choi et al. (2018).
that ELECTRA+RoR can outperform the Longformer, since our primary focus is on analyzing the robustness of different history modeling techniques rather than on long sequence modeling, we opt for a general-purpose commonly-used LM for long sequences, which exhibits competitive performance.

Evaluated Methods
In our study we choose to focus on modern history modeling approaches that leverage pre-trained LMs.These models have demonstrated significant progress in recent years ( §2.3).
NO HISTORY A classic single-turn QA model without access to H k .We trained a Longformer for QA (Beltagy et al., 2020), using q k and P as a single packed input sequence (ignoring H k ).The model then extracts the answer span by predicting its start and end positions within P .
In contrast to the rest of the evaluated methods, we do not consider this method as a baseline for history modeling, but rather as a reference for calculating our ∆% metric.As discussed in §3.2, we evaluate all history modeling methods for their ability to improve over this model.
CONCAT Concatenating H k to the input (i.e. to q k and P ), which is (arguably) the most straightforward way to model the history (Choi et al., 2018;Qu et al., 2019a;Zhao et al., 2021).Other than the change to the input, this model architecture and training is identical to NO HISTORY.Vakulenko et al. (2021).It consists of a pipeline of two models, question rewriting (QR) and question answering (QA).An external QR model first generates a rewritten question q k , based on q k and H k .q k and P are then used as input to a standard QA model, identical to NO HISTORY, but trained with the rewritten questions.For the external QR model we follow Lin et al. (2020); Vakulenko et al. (2021); Kim et al. (2021) and fine-tune T5-base (Raffel et al., 2020) on the CANARD dataset (Elgohary et al., 2019).We use the same QR model across all the settings in our study ( §3.1), meaning that in the low-resource setting we limit only the CQA data, which is used to train the QA model.REWRITE C Hypothesizing that there is useful information in H k on top of the rewritten question q k , we combine REWRITE and CONCAT, obtaining a model which is similar to CONCAT, except that it replaces q k with q k .ExCorD LF Our implementation of the ExCorD approach, proposed in Kim et al. (2021).Instead of rewriting the original question, q k , at inference time (REWRITE), ExCorD uses the rewritten question only at training time as a regularization signal when encoding the original question.

REWRITE This approach was proposed in
HAE LF Our implementation of the HAE approach proposed in Qu et al. (2019a), which highlights the conversation history within P .Instead of concatenating H k to the input, HAE highlights the historic answers {a i } k−1 i=1 within P , by modifying the passage token embeddings.HAE adds an additional dedicated embedding layer with 2 learned embedding vectors, denoting whether a token from P appears in any historic answers or not.
PosHAE LF Our implementation of the PosHAE approach proposed in Qu et al. (2019b), which extends HAE by adding positional information.The embedding matrix is extended to contain a vector per conversation turn, each vector representing the turn that the corresponding token appeared in.

Implementation Details
We fine-tune all models on QuAC for 10 epochs, employ an accumulated batch size of 640, a weight decay of 0.01, and a learning rate of 3 • 10 −5 .In the high-resource setup, we also pre-train on CoQA for 5 epochs.We use a maximum output length of 64 tokens.Following Beltagy et al. (2020), we set Longformer's global attention to all the tokens of q k .We use the cross-entropy loss and AdamW optimizer (Kingma and Ba, 2015;Loshchilov and Hutter, 2019).Our implementation makes use of the HuggingFace Transformers (Wolf et al., 2020), and PyTorch-Lightning libraries. 5 For the base LM (used in all settings except highresource) we found that a Longformer that was further pre-trained on SQuADv2 (Rajpurkar et al., 2018), 6 achieved consistently better performance than the base Longformer.Thus, we adopted it as our base LM.For the large LM (used in the high-resource setting) we used Longformer-large. 7 In §5, we introduce a novel method (MarCQAp) and perform statistical significance tests (Dror et al., 2018(Dror et al., , 2020)).Following Qu et al. (2019b) In our re-implementation of the evaluated methods, we carefully followed their descriptions and implementation details as published by the authors in their corresponding papers and codebases.A key difference in our implementation is the use of a long sequence Transformer, which removes the need to truncate H k and split P into chunks ( §3.3).This simplifies our implementation and avoids differences between methods.8Table 3 compares between our results and those reported in previous works.In almost all cases we achieved a higher score (probably since Longformer outperforms BERT), with the exception of ExCorD, where we achieved a comparable score (probably since Longformer is actually initialized using RoBERTa's weights (Beltagy et al., 2020)).

Results and Analysis
We next discuss the takeaways from our study, where we evaluated the considered methods across the proposed settings.Table 4 presents the results of the standard, high-resource and low-resource settings.Table 5 further presents the domain-shift results.Finally, Table 6 depicts the results of the noisy-history setting.Each method is compared to NO HISTORY by calculating the ∆% ( §3.2).The tables also present the results of our method, termed MarCQAp, which is discussed in §5.
We further analyze the effect of the conversation history length in Figure 1, evaluating models from the standard setting with different limits on the history length.For instance, when the limit is 2, we expose the model to up to the 2 most recent turns, by truncating H k .Key Findings A key goal of our study is to examine the robustness of history modeling approaches to setting shifts.This research reveals limitations of the single-score benchmark-based evaluation adopted in previous works ( §4.1), as such scores are shown to be only weakly correlated with lowresource and domain-shift robustness.Furthermore, keeping in mind that history modeling is a key aspect of CQA, our study also demonstrates the importance of isolating the contribution of the history modeling method from other model components ( §4.2).Finally, we discover that while existing history highlighting approaches yield high-quality input representations, their robustness is surprisingly poor.We further analyze the history highlighting results and provide possible explanations for this phenomenon ( §4.3).This finding is the key motivation for our proposed method ( §5).

High CQA Benchmark Scores do not Indicate Good Robustness
First, we observe some expected general trends: All methods improve on top of NO HISTORY, as demonstrated by the positive ∆% in the standard setting, showing that all the methods can leverage information from H k .All methods scale with more training data and a larger model (high-resource), and their performances drop significantly when the training data size is reduced (low-resource) or when they are presented with noisy history.A performance drop is also observed when evaluating on domain-shift, as expected in the zero shot setting.
However, not all methods scale equally well and some deteriorate faster than others.This phenomenon is illustrated in Table 7, where the methods are ranked by their scores in each setting.We observe high instability between settings.For instance, PosHAE LF is top performing in 3 settings but is second worst in 2 others.REWRITE is second best in low-resource, but among the last ones in other settings.So is the case with CONCAT: Second best in domain-shift but among the worst ones in others.In addition, while all the methods improve when they are exposed to longer histories (Figure 1), some saturate earlier than others.
We conclude that the winner does not take it  In our study, HAE LF and PosHAE LF actually outperform ExCorD LF in the standard setting.This suggests that these methods can perform better than reported, and demonstrates the importance of controlling for the choice of LM when comparing between history modeling methods.
As can be seen in Figure 1, CONCAT saturates at 6 turns, which is interesting since Qu et al. (2019a) reported saturation at 1 turn in a BERT-based equivalent.Furthermore, Qu et al. observed a performance degradation with more turns, while we observe stability.These differences probably stem from the history truncation in BERT, due to the input length limitation of dense attention Transformers.This demonstrates the advantages of sparse attention Transformers for history modeling evaluation, since the comparison against CONCAT can be more "fair".This comparison is important, since the usefulness of any method should be established by comparing it to the straight-forward solution, which is CONCAT in case of history modeling.We would also like to highlight PosHAE LF 's F1 scores in the noisy-history (60.1) and the 20% lowresource setting (60.9), both lower than the 69.8 F1 in the standard setting.Do these performance drops reflect lower effectiveness in modeling the conversation history?Here the ∆% comes to the rescue.While the ∆% decreased between the standard and the 20% settings (15.6 → 9.9), it actually increased in the noisy-history setting (to 20.4).This indicates that even though the F1 decreased, the ability to leverage the history actually increased.
We conclude that our study results support the design choices we made, in our effort to better isolate the contribution of the history representation.We recommend future works to compare history modeling methods using the same LM (preferably a long sequence LM), and to measure a ∆% compared to a NO HISTORY baseline.

History Highlighting is Effective in
Resource-rich Setups, but is not Robust The most interesting results are observed for the history highlighting methods: HAE and PosHAE.First, when implemented using the Longformer, HAE LF and PosHAE LF perform better than reported in previous work, with 68.9 and 69.8 F1 respectively, compared to 63.9 and 64.7 reported by Qu et al. using BERT.The gap between HAE LF and PosHAE LF demonstrates the effect of the positional information in PosHAE LF .This effect is further observed in Figure 1, HAE LF saturates earlier since it cannot distinguish between different conversation turns, which probably yields conflicting information.PosHAE LF saturates at 9 turns, later than the rest of the methods, which indicates that it can better leverage long conversations.
PosHAE LF outperforms all methods in the standard, high-resource and noisy-history settings,10 demonstrating the high effectiveness of history highlighting.However, it shows surprisingly poor performance in low-resource and domain-shift settings, with extremely low average ∆% compared to other methods.The impact of the training set size is further illustrated in Figure 2. We plot the ∆% as a function of the training set size, and specifically highlight PosHAE LF in bold red.Its performance deteriorates significantly faster than others when the training set size is reduced.In the 1% setting it is actually the worst performing method.
This poor robustness could be caused by the additional parameters added in the embedding layer of PosHAE LF .Figure 2 demonstrates that properly training these parameters, in order to benefit from this method's full potential, seems to require large amounts of data.Furthermore, the poor domainshift performance indicates that, even with enough training data, this embedding layer seems to be prone to overfitting to the source domain.
We conclude that history highlighting clearly yields a very strong representation, but the additional parameters of the embedding layer seem to require large amounts of data to train properly and over-fit to the source domain.Is there a way to highlight historic answers in the passage, without adding dedicated embedding layers?
In §5 we present MarCQAp, a novel history modeling approach that is inspired by PosHAE, adopting the idea of history highlighting.However, instead of modifying the passage embedding, we highlight historic answers by adding textual prompts directly in the input text.By leveraging prompts, we reduce model complexity and remove the need for training dedicated parameters, hoping to mitigate the robustness weaknesses of PosHAE.

MarCQAp
Motivated by our findings, we design MarCQAp, a novel prompt-based history modeling approach that highlights answers from previous conversation turns by inserting textual prompts in their respective positions within P .By highlighting with prompts instead of embedding vectors, we hope to encode valuable dialogue information, while reducing the learning complexity incurred by the existing embedding-based methods.Thus, we expect Mar-CQAp to perform well not only in high-resource settings, but also in low-resource and domain adaptation settings, in which prompting methods have shown to be particularly useful (Brown et al., 2020;Le Scao and Rush, 2021;Ben-David et al., 2022).
Prompting often refers to the practice of adding phrases to the input, in order to encourage pretrained LMs to perform specific tasks (Liu et al., 2021), yet it is also used as a method for injecting task-specific guidance during fine-tuning (Le Scao and Rush, 2021;Ben-David et al., 2022).Mar-CQAp closely resembles the prompting approach from Ben-David et al. ( 2022) since our prompts are: (1) discrete (i.e the prompt is an actual text-string), (2) dynamic (i.e example-based), and (3) added to the input text and the model then makes predictions conditioned on the modified input.Moreover, as in Ben-David et al., in our method the underlying LM is further trained on the downstream task with prompts.However, in contrast to most prompting approaches, which predefine the prompt's location in the input (Liu et al., 2021), our prompts are inserted in different locations for each example.In addition, while most textual prompting approaches leverage prompts comprised of natural language, our prompts contain non-verbal symbols (e.g "<1>", see Figure 3 and  §5.1), which were proven useful for supervision of NLP tasks.For instance, Aghajanyan et al. (2022) showed the usefulness of structured pre-training by adding HTML symbols to the input text.Finally, to the best of our knowledge, this work is the first to propose a prompting mechanism for the CQA task.
a 3 q 3 Pinochet plagiarized his mentor general Gregorio Rodríguez Tascón by using paragraphs from a 1949 conference presentation What were the books about?
q 4 What did he plagiarize in Campaña de Tarapacá?a 4 NO ANSWER q 5 Figure 3: The MarCQAp highlighting scheme: Answers to previous questions are highlighted in the grounding document, which is then provided as input to the model.

Method
MarCQAp utilizes a standard single-turn QA model architecture and input, with the input comprised of the current question q k and the passage P .For each CQA example (P, H k , q k ), MarCQAp inserts a textual prompt within P , based on information extracted from the conversation history H k .
In extractive QA, the answer a k is typically a span within P .Given the input (P, H k , q k ), MarCQAp transforms P into an answer-highlighted passage P k , by constructing a prompt p k and inserting it within P .p k is constructed by locating the beginning and end positions of all historic answers {a i } k−1 i=1 within P , and inserting a unique textual marker for each answer in its respective positions (see example in Figure 3).The input ( P k , q k ) is then passed to the QA model, instead of (P, q k ).
In abstractive QA, a free form answer is generated based on an evidence span that is first extracted from P .Hence, the final answer does not necessarily appear in P .To support this setting, MarCQAp highlights the historical evidence spans (which appear in P ) instead of the generated answers.
To encode positional dialogue information, the markers for a j ∈ {a i } k−1 i=1 include its turn index number in reverse order, i.e. k − 1 − j.This encodes relative historic positioning w.r.t. the current question q k , allowing the model to distinguish between the historic answers by their recency.
MarCQAp highlights only the historic answers, since the corresponding questions do not appear in P .While this might lead to information loss, in §5.3 we implement MarCQAp's variants that add the historic questions to the input, and show that the contribution of the historic questions to the performance is minor. 11 CQA dialogue may also contain unanswerable questions.Before inserting the prompts, MarCQAp first appends a 'NO ANSWER' string to P .12Each historical 'NO ANSWER' is then highlighted with prompts, similarly to ordinary historical answers.For example see a 4 in Figure 3.
MarCQAp has several advantages over prior approaches.First, since it is prompt-based, it does not modify the model architecture, which makes it easier to port across various models, alleviating the need for model-specific implementation and training procedures.Additionally, it naturally represents overlapping answers in P , which was a limitation in prior work (Qu et al., 2019a,b).Overlapping answers contain tokens which relate to multiple turns, yet the existing token-based embedding methods encode the relation of a token from P only to a single turn from H k .Since MarCQAp is span-based, it naturally represents overlapping historic answers (e.g.see a 2 and a 3 in Figure 3).

MarCQAp Evaluation
We evaluate MarCQAp in all our proposed experimental settings ( §3.1).As presented in tables 4, 5 and 6, it outperforms all other methods in all settings.In the standard, high-resource and noisyhistory settings, its performance is very close to PosHAE LF ,13 indicating that our prompt-based approach is an effective alternative implementation for the idea of highlighting historical answers.Similarly to PosHAE LF , MarCQAp is able to handle long conversations and its performance gains saturate at 9 turns (Figure 1).However, in contrast to PosHAE LF , MarCQAp performs especially well in the low-resource and the domain-shift settings.
In the low-resource settings, MarCQAp outperforms all methods by a large margin, with an average ∆% of 13.6% compared to the best baseline with 6.3%.The dramatic improvement over PosHAE LF 's average ∆% (1.5% → 13.6%), serves as a strong indication that our prompt-based approach is much more robust.This boost in robustness is best illustrated in Figure 2, which presents QuAC id: C_721c2ff2b119415c901a3cd1ec2beb28_0 Q1: What was Ronald Ross known for?A1: Ronald Ross was noted to be eccentric and egocentric… Q2: why was that?A2: His professional life appeared to be in constant feud …  In the domain-shift settings, MarCQAp is the best performing method in 6 out of 8 domains. 14 On the remaining two domains (Cooking & Movies), CONCAT is the best performing. 15Notably, MarCQAp's average ∆% (19.6%) is substantially higher compared to the next best method (14.1%).These results serve as additional strong evidence of MarCQAp's robustness.

MarCQAp's Performance Using Different LMs
In addition to Longformer, we evaluated MarCQAp using RoBERTa (Liu et al., 2019), and BigBird (Zaheer et al., 2020) in the standard setting.The results are presented in Table 8.MarCQAp shows a consistent positive effect across different LMs, which further highlights its effectiveness.
We note that since RoBERTa is a dense-attention Transformer with input length limitation of 512 tokens, longer passages are split into chunks.This may lead to some chunks containing part of the historic answers, and therefore partial highlighting by MarCQAp.Our analysis showed that 51% of all examples in QuAC were split into several 14 For the Travel domain MarCQAp's improvement over ExCorDLF is not statistically significant. 15The differences between CONCAT and MarCQAp for both domains are not statistically significant.BiDAF++ w/ 2-Context (Choi et al., 2018) 60.1 HAE (Qu et al., 2019a) 62.4 FlowQA (Huang et al., 2019) 64.1 GraphFlow (Chen et al., 2020) 64.9 HAM (Qu et al., 2019b) 65.4 FlowDelta (Yeh and Chen, 2019) 65.5 GHR (Qian et al., 2022) 73.7 RoR (Zhao et al., 2021) 74.9 MarCQAp (Ours) 74.0 chunks, and 61% the resulted chunks contained partial highlighting.MarCQAp's strong performance with RoBERTa suggests that it can remain effective even with partial highlighting.
Official QuAC Leaderboard Results For completeness, we submitted our best performing model (from the high-resource setting) to the official QuAC leaderboard,16 evaluating its performance on the hidden test set.Table 9 presents the results.17MarCQAp achieves a very competitive score of 74.0 F1, very close to the published state-of-the art (RoR by Zhao et al. (2021) with 74.9 F1), yet with a much simpler model.18

Prompt Design
Recall that MarCQAp inserts prompts at the beginning and end positions for each historical answer within P (Figure 3).The prompts are designed with predefined marker symbols and include the answer's turn index (e.g."<1>").This design builds on 3 main assumptions: (1) textual prompts can represent conversation history information, (2) the positioning of the prompts within P facilitates highlighting of historical answers, and (3) indexing the historical answers encodes valuable information.We validate our design assumptions by comparing MarCQAp against ablated variants (Table 10).
To  the input, in addition to P k and q k .MARCQAP C is exposed to information from H k via two sources: The concatenated H k and the MarCQAp prompt within P k .We observe a negligible effect, 19 suggesting that MarCQAp indeed encodes information from the conversation history, since providing H k does not add useful information on top of P k .
To validate assumptions ( 2) and (3), we use two additional MarCQAp's variants.Answer Pos inserts a constant predefined symbol ("<>"), in each answer's beginning and end positions within P (i.e.similar to MarCQAp, but without turn indexing).Random Pos inserts the same number of symbols but in random positions within P .
Answer Pos achieves a ∆% of 12.7%, while Random Pos achieves 1.7%.This demonstrates that the positioning of the prompts within P is crucial, and that most of MarCQAp's performance gains stem from its prompts positioning w.r.t historical answers {a i } k−1 i=1 .When the prompts are inserted at meaningful positions, the model seems to learn to leverage these positions in order to derive an effective history representation.Surprisingly, Random Pos leads to a minor improvement of 1.7%. 20Finally, MarCQAp's improvement over Answer Pos (a ∆% of 15.9% compared to 12.7%), indicates that answer indexing encodes valuable information, helping us validate assumption (3).
Finally, since textual prompts allow for easy injection of additional information, we make several initial attempts in this direction, injecting different types of information into our textual prompts.In Word from Q, the marker contains the first word from the historic answer's corresponding question, which is typically a wh-word (e.g."<what>").In Word from Q + Index we also add the historic answer's turn index (e.g."<what_1>").In Full Q, we inject the entire historic question into the prompt.Word from Q and Word from Q + Index achieved 19 The difference is not statistically significant. 20The difference is statistically significant, we did not further investigate the reasons behind this particular result.comparable scores, lower than MarCQAp's but higher than Answer Pos's. 21This suggests that adding semantic information is useful (since Word from Q outperformed Answer Pos), and that combining such information with the positional information is not trivial (since MarCQAp outperformed Word from Q + Index).This points at the effects of the prompt structure and the information included, we see that "<1>" and "<what>" both outperform "<>", yet constructing a prompt by naively combining these signals ("<what_1>") does not lead to complementary effect.Finally, Word from Q outperformed Full Q.We hypothesize that since the full question can be long, it might substantially interfere with the natural structure of the passage text.This provides evidence that the prompts should probably remain compact symbols with small footprint within the passage.These initial results call for further exploration of optimal prompt design in future work.

Case Study
Figure 5 presents an example of all evaluated methods in action from the standard setting.The current question "Did he have any other critics?" has two correct answers: Alan Dershowitz or Omer Bartov.We first note that all methods predicted a name of a person, which indicates that the main subject of the question was captured correctly.Yet, the methods differ in their prediction of the specific person.
REWRITE and CONCAT predict a correct answer (Alan Dershowitz), yet CONCAT predicts it based on incorrect evidence.This may indicate that CON-CAT did not capture the context correctly (just the fact that it needs to predict a person's name), and was lucky enough to guess the correct name.
Interestingly, REWRITE C predicts Daniel Goldhagen, which is different from the answers predicted by CONCAT and REWRITE.This shows that combining both methods can yield completely different results, and demonstrates an instance where REWRITE C performs worse than REWRITE and CONCAT (for instance in the 1% low-resource setting).This is also an example of a history modeling flaw, since Daniel Goldhagen was already mentioned as a critic in previous conversation turns.
This example also demonstrates how errors can propagate through a pipeline-based system.The gold rewritten question is "Did Norman Finkelstein have any other critics aside from Peter Novick and  Daniel Goldhagen?", 22 while the question rewriting model generated "Besides Peter Novick, did Norman Finkelstein have any other critics?", omitting Daniel Goldhagen.This makes it impossible for REWRITE to figure out that Daniel Goldhagen was already mentioned, making it a legitimate answer.This reveals that REWRITE might have also gotten lucky and provides a possible explanation for the incorrect answer predicted by REWRITE C .
ExCorD LF , HAE LF and PosHAE LF not only predict a wrong answer, but also seem to fail to resolve the conversational coreferences, since the pronoun "he", in the current question "Did he have any other critics?", refers to Norman Finkelstein.
MarCQAp predicts a correct answer, Omer Bartov.This demonstrates an instance where MarC-QAp succeeds while HAE LF and PosHAE LF fail, even though they are all history-highlighting methods.Interestingly, MarCQAp is the only model that predicts Omer Bartov, a non-trivial choice compared to Alan Dershowitz, since Omer Bartov appears later in the passage, further away from 22 As annotated in CANARD (Elgohary et al., 2019).the historic answers.

Limitations
This work focuses on a single-document CQA setting, which is in line with the majority of the previous work on conversation history modeling in CQA ( §2.3).Correspondingly, MarCQAp was designed for single-document CQA.Applying MarCQAp in a multi document settings (Qu et al., 2020;Anantha et al., 2021;Adlakha et al., 2022) may result in partial history representation, since the retrieved document may contain only part of the historic answers, therefore MarCQAp will only highlight the answers which appear in the document. 23 In §5.3 we showed initial evidence that MarC-QAp prompts can encode additional information that can be useful for CQA.In this work we focused on the core idea behind prompt-based answer highlighting, as a proposed solution in light of our results in §4.Yet, we did not conduct a comprehensive exploration in search of the optimal prompt design, and leave this for future work.

Conclusion
In this work, we carry out the first comprehensive robustness study of history modeling approaches for Conversational Question Answering (CQA), including sensitivity to model and training data size, domain shift and noisy history input.We revealed limitations of the existing benchmark-based evaluation, by demonstrating that it cannot reflect the models' robustness to such changes in setting.In addition, we proposed evaluation practices that better isolate the contribution of the history modeling component, and demonstrated their usefulness.
We also discovered that highlighting historic answers via passage embedding is very effective in standard setups, but it suffers from substantial performance degradation in low data and domain shift settings.Following this finding, we design a novel prompt-based history highlighting approach.We show that highlighting with prompts, rather than with embeddings, significantly improve robustness, while maintaining overall high performance.
Our approach can be a good starting point for future work, due to its high effectiveness, robustness and portability.We also hope that the insights from our study will encourage evaluations with focus on robustness, leading to better CQA systems. 23We note that this limitation applies to all highlighting approaches, including HAE and PosHAE (Qu et al., 2019a,b).

Figure 1 :
Figure 1: F1 as a function of # history turns, for models from the standard setup.The first occurrence of the maximum F1 value (saturation point) is highlighted.

Figure 2 :
Figure 2: ∆% as a function of # training examples.Results taken from the standard and low-resource settings.

Figure 4 :
Figure 4: An example of MarCQAp's robustness in the low-resource setting.Even though ExCorD LF , HAE LF and PosHAE LF predict correct answers in the standard setting, they fail on the same example when the training data size is reduced to 10%.MarCQAp predicts a correct answer in both settings.the ∆% as a function of the training set size, highlighting PosHAE LF (red) and MarCQAp (green) specifically.An example of MarCQAp's robustness in the low-resource setting is provided in Figure 4.In the domain-shift settings, MarCQAp is the best performing method in 6 out of 8 domains.14On the remaining two domains (Cooking & Movies), CONCAT is the best performing.15Notably, MarCQAp's average ∆% (19.6%) is substantially higher compared to the next best method (14.1%).These results serve as additional strong evidence of MarCQAp's robustness.

Figure 5 :
Figure 5: Our case study example, comparing answers predicted by each evaluated method in the standard setting.We provide a detailed analysis in §5.4.

Table 2 :
Summary of our proposed settings.

Table 4 :
In-domain F1 and ∆% scores on the full QuAC validation set, for the standard, high-resource and low-resource settings.We color coded the ∆% for positive and negative numbers.

Table 5 :
F1 and ∆% scores for the domain-shift setting.We color coded the ∆% for positive and negative numbers.

Table 6 :
F1 and ∆% scores for the noisy-history setting.

Table 7 :
Per setting rankings of the mHuang et al., 2019) our study (top is best), excluding MarCQAp.C is CONCAT, R is REWRITE, R C is REWRITE C , Ex is ExCorD LF , H is HAE LF and PH is PosHAE LF .In the high-resource setting, NO HISTORY reaches 65.6 F1, higher than many CQA results reported in previous works et al., 2018; Qu et al.,  2019a,b;Huang et al., 2019).Since it is clearly ignoring the history, this shows that significant improvements can stem from simply using a better LM.Thus comparing between history modeling methods that use different LMs can be misleading.This is further illustrated with HAE LF 's and PosHAE LF 's results.The score that Kim et al. reported for ExCorD is higher than Qu et al. reported for HAE and PosHAE.While both authors used a setting equivalent to our standard setting, Kim et al. used RoBERTa while Qu et al. used BERT, as their underlying LM.It is therefore unclear whether ExCorD's higher score stems from better history representation or from choosing to use RoBERTa.
, which established him as a major figure in Chile's military literature.<3>In Geopolítica, <2>Pinochet plagiarized his mentor general Gregorio Rodríguez Tascón<3> by using paragraphs from a 1949 conference presentation<2> of Rodríguez without attributing them to him … <1>NO ANSWER<1>" Q3: Any example of what you mean?Correct Answers: -He was openly envious of his mentor Patrick Manson's... -His personal vendetta … became a … tale in science.RewriteC His professional life appeared to be in constant feud … ExCorD He was openly envious of his mentor Patrick Manson's... HAE His personal vendetta … became a … tale in science.PosHAE His personal vendetta … became a … tale in science.MarCQAp His personal vendetta … became a … tale in science.Rewrite Ronald Ross was noted to be eccentric and egocentric… RewriteC His professional life appeared to be in constant feud … ExCorD Ronald Ross was noted to be eccentric and egocentric… HAE NO ANSWER PosHAE NO ANSWER MarCQAp His personal vendetta … became a … tale in science.

Table 8 :
MarCQAp's standard setting performance across different Transformer-based pre-trained LMs.

Table 9 :
Results from the official QuAC leaderboard, presenting F1 scores for the hidden test set, for MarCQAp and other models with published papers.
validate assumption (1), we compare MarC-QAp to MARCQAP C , a variant which adds H k to

Table 10 :
F1 and ∆% scores for MarCQAp's ablated variants, in the 10% setup of the low-resource setting.
Criticism has been leveled against Finkelstein … … … Daniel Goldhagen, … , claimed his scholarship has "everything to do with his burning political agenda".Alan Dershowitz has written that Peter Novick, Professor of History … whose work Finkelstein says inspired The Holocaust Industry, has strongly criticized the latter's work, describing it as "trash".Similarly, Dershowitz, whose book … , has claimed Finkelstein complicity in a conspiracy … … … Israeli historian Omer Bartov, … , judged The Holocaust Industry to be marred by the same errors … : It is filled with precisely the kind of shrill hyperbole that Finkelstein rightly deplores in much of the current media hype over the Holocaust; … … Daniel Goldhagen, … , claimed his scholarship has "everything to do with his burning political agenda".Q2: Which other critics does he have?A2: Peter Novick, Professor of History at … Q3: How does he criticize him?A3: strongly criticized the latter's work, describing it as "trash".Q4: Did he have any other critics?Rewritten: Besides Peter Novick, did Norman Finkelstein have any other critics?Gold Rewritten: Did Norman Finkelstein have any other critics aside from Peter Novick and Daniel Goldhagen?