Exploring Contrast Consistency of Open-Domain Question Answering Systems on Minimally Edited Questions

Abstract Contrast consistency, the ability of a model to make consistently correct predictions in the presence of perturbations, is an essential aspect in NLP. While studied in tasks such as sentiment analysis and reading comprehension, it remains unexplored in open-domain question answering (OpenQA) due to the difficulty of collecting perturbed questions that satisfy factuality requirements. In this work, we collect minimally edited questions as challenging contrast sets to evaluate OpenQA models. Our collection approach combines both human annotation and large language model generation. We find that the widely used dense passage retriever (DPR) performs poorly on our contrast sets, despite fitting the training set well and performing competitively on standard test sets. To address this issue, we introduce a simple and effective query-side contrastive loss with the aid of data augmentation to improve DPR training. Our experiments on the contrast sets demonstrate that DPR’s contrast consistency is improved without sacrificing its accuracy on the standard test sets.1


Introduction
Contrast consistency (Gardner et al., 2020) is a crucial aspect for neural models in NLP.Models are expected to identify perturbations in the text input and decide whether such a semantic shift leads to a different label.To evaluate this consistency, contrast sets have been introduced in various tasks such as sentiment analysis (Wu et al., 2021), natural language inference (Ross et al., 2022), and reading comprehension (Longpre et al., 2021) by minimally modifying the original input ("Pet Sematary") ("Australian one-cent coin") ("Pet Sematary 2") ("Pet Sematary") ("Australian one-cent coin") 3  3 Figure 1: Above: Trained on question q 1 but not a contrast one q 2 , DPR generated an overly similar embedding of q 2 with q 1 's and thus falsely retrieved p 1 .We aim to identify q 2 as a distinct question and retrieve p 2 instead.Below: The performance of DPR-based OpenQA models on the standard NQ question set and our contrast set of minimally edited questions (MEQs).
to reverse the original label.However, to our best knowledge, there is no study on the contrast consistency in open-domain question answering (OpenQA).In OpenQA, even a slight modification of a word or two can alter the meaning of the question, which leads to a completely different answer.
To maintain contrast consistency, models are expected to predict the corresponding answer when such semantic shift occurs.
Studying contrast consistency in OpenQA poses unique challenges.Firstly, collecting appropriate contrast sets is difficult.While contrast sets have been developed for reading comprehension (Longpre et al., 2021;Li et al., 2022), they typically replaced an entity (e.g., Barack Obama was born in Hawaii) in given context with another entity (e.g., Barack Obama was born in New York), leading to a different answer to the given question (e.g.,Where was Barack Obama born?).Constructing such contrast sets does not necessitate the factuality of the perturbed context, as the answer depends solely on the context rather than world knowledge.However, in the absence of evidence context, the perturbed questions in OpenQA must be factually answerable in accordance with world knowledge, which is beyond what rule-based methods can do.Secondly, achieving contrast consistency is challenging for OpenQA models which usually follow the "retrieve-then-read" pipeline (Lewis et al., 2020).In addition to the challenge of predicting answers from a contrast context as in reading comprehension, models also face the challenge of mapping the perturbed question with its corresponding evidence passage in a large corpus.The latter requires the retriever to distinguish the minimal semantic difference between embeddings of the perturbed question and the original question, which is ignored in typical retriever training.
To fill this gap in OpenQA, we propose to create contrast sets using Minimally Edited Questions (MEQs).Given a question q and its answer a, an MEQ q ′ is defined as a question that possesses high lexical and semantic similarity with q, while having a distinct answer a ′ (a ′ ̸ = a).For example, in Figure 1, changing "Pet Sematary 2" to "Pet Sematary" generates an MEQ that resembles the original question but has a distinct answer ("Coweta County, Georgia"→"Maine").We use the training set of an existing benchmark as the original questions because neural OpenQA models exhibit high performance on them.Thus, we are able to evaluate the models' ability of distinguishing MEQs by measuring their performance on the MEQ contrast set.Specifically, we collect MEQs for training questions in the Natural Questions (NQ) benchmark (Kwiatkowski et al., 2019) from two sources, namely (1) InstructGPT-based question generation (Ouyang et al., 2022) then crowdsource annotation and (2) the AmbigQA dataset (Min et al., 2020).
We find that the state-of-the-art OpenQA models which employ the dense passage retriever (DPR) (Karpukhin et al., 2020) struggle on our MEQ contrast sets.As shown in Figure 1, DPRretrieved passages lead to 63% downstream QA accuracy on training set and 43% on standard test set.However, the accuracy drops to 20%~25% on our MEQ contrast sets.The problem lies in the contrastive training process of DPR.The model is trained to optimize question embeddings to be closer to their positive passage embeddings2 than negative passage embeddings.This paradigm does not provide explicit signals for understanding the relationships between questions, which causes the generated question embeddings to be insensitive to minimal discrepancies.As a result, the model generates overly similar embeddings for the MEQ and the original question, leading to incorrect passage retrieval for the MEQ.In fact, the overlap between the retrieved passages of the original question and those of its MEQ is as high as ~70%, which reflects DPR's limited ability in distinguishing the questions.To overcome such limitations, it is necessary to complement DPR training with signals on inter-question relationships.Besides building the mapping between questions and passages, DPR needs to know which questions are the same and which are different.
In this pioneering study, we propose a simple and effective method based on a query-side contrastive loss to improve the performance of DPR on MEQs.Specifically, in order to learn inter-question relationships, DPR is trained to distinguish between paraphrase questions and semantically different questions.To achieve this, we obtain synthetic MEQs for training questions from the machine-created QA corpus, PAQ (Lewis et al., 2021), as augmented data.Experiments demonstrate that learning the query-side contrastive loss on the augmented MEQs improves the performance of DPR on contrast sets, without sacrificing its performance on standard open-domain questions in the NQ test set.

Open-Domain Question Answering
OpenQA is a task that aims to answer user questions without any specified context, thereby testing the ability of QA systems to retrieve, comprehend, and utilize world knowledge (Zhu et al., 2021).The state-of-the-art approach in OpenQA is a twostage pipeline, consisting of evidence retrieval and answer prediction (Chen et al., 2017).
In the evidence retrieval stage, a retriever model finds evidence passages from a large corpus (e.g., Wikipedia) based on their relevance to the question.Traditional retrievers like BM25 (Robertson and Zaragoza, 2009) perform lexical matching to measure such relevance scores.Recently, DPR (Karpukhin et al., 2020) revolutionized the field by employing dual BERT (Devlin et al., 2019) encoders to compute embeddings for the question and the passage, respectively.It searches evidence passages based on the inner product of question and passage embeddings.Despite subsequent approaches have sought to improve the architecture of the retriever by using fine-grained question-passage interactions (Khattab and Zaharia, 2020) or enhancing global embedding training (Gao and Callan, 2021), DPR remains the most widely-used model due to its simplicity and efficiency.However, the capability of DPR in distinguishing contrastive information has not been thoroughly studied.In this work, we use MEQs as contrast sets and show that DPR has limited contrast consistency when solving MEQs.
In the answer prediction stage, a reader model encodes and fuses the representations of all passages, then predicts an answer by extracting a span (Kedia et al., 2022), generating a free-form sequence (Izacard and Grave, 2021), or using a hybrid approach (Fajcik et al., 2021).While answer prediction is also challenging on MEQs, our approach mainly focuses on the retrieval part which is the bottleneck of solving the MEQs in OpenQA.

Contrast Sets
NLP Benchmark datasets are typically comprised of i.i.d.examples that are randomly divided into training and test sets.Conversely, contrast sets refer to data created from small yet label-changing modifications to the existing examples (Gardner et al., 2020).Such characteristics make contrast sets an ideal testbed for evaluating contrast consistency.For example, Gardner et al. (2020) and Kaushik et al. (2020) employed humans to modify linguistic patterns on tasks like syntactic parsing, relation extraction, and claim verification.On sentiment analysis and language inference tasks, controlled text modification models could automatically generate contrast sets (Wu et al., 2021;Ross et al., 2022).In reading comprehension, rulebased algorithms created contrast sets by replacing the answer with another entity (Longpre et al., 2021;Ye et al., 2021;Li et al., 2022).In videoto-text matching, a pre-trained T5 model was used to find replacements for verbs and entities in the original caption (Park et al., 2022).
Nevertheless, building contrast sets to evaluate contrast consistency in OpenQA has not been explored yet, where data collection must guarantee the factuality of MEQs.The most relevant work is (Paranjape et al., 2022) which automatically generated perturbed questions for data augmentation on QA datasets.However, we focus on collecting challenging MEQs to evaluate model consistency instead of data augmentation.Moreover, their generated questions did not meet the requirements of MEQs.The limited accuracy of the question generation model would lead to lots of noise instead of perfect factuality.Also, their method did not ensure the minimality of edits.Therefore, their generated data cannot be used as challenging contrast sets to evaluate contrast consistency in OpenQA.

Problem Formulation
In this work, we study minimally edited questions (MEQ) as challenging contrast sets in OpenQA.Suppose we have two questions q and q ′ with answers a and a ′ respectively, where q is the original question in the training set and q ′ is an MEQ of q.In this study, the minimality of edits is measured in two aspects: lexical distance d ℓ (q, q ′ ) and semantic distance d s (q, q ′ ).That is to say, q ′ needs to satisfy d ℓ (q, q ′ ) ≤ ϵ ℓ , d s (q, q ′ ) ≤ ϵ s and a ′ ̸ = a, where ϵ ℓ and ϵ s are distance thresholds.

Evaluation Metrics
To evaluate DPR on MEQ contrast sets, we consider metrics on both ranking and retrieval evaluation.Besides, we run end-to-end QA experiments using the passages retrieved by DPR.
Ranking evaluation measures the model's ability to differentiate a positive passage from negative passages, by ranking a set of candidate passages based on the relevance score to the question.We collect 50 candidates for each question, including a positive passage, 30 hard negative passages and 19 random negative passages.Hard negatives are the top-ranked passages in BM25 retrieval that do not contain the answer.We report Mean Rank (MR) and Mean Reciprocal Rank (MRR) of the positive passage.
Retrieval evaluation tests the model's ability to retrieve passages relevant to answering the ques-tion from a large corpus.Our retrieval corpus contains ~21M passages from Wikipedia.We calculate Recall@k, the number of passages containing the answer in top-k retrieved passages.End-to-end QA evaluation checks whether the retrieved passages contain useful information for predicting the correct answer.The retrieved passages are fed into a Fusion-in-Decoder (FiD) reader (Izacard and Grave, 2021) trained on NQ.We calculate Exact Match between model predictions and answers.

Dataset Construction
Based on the above evaluation metrics, we collect two MEQ contrast sets to evaluate models' contrast consistency.The first set, referred to as MEQ-GPT, is generated using InstructGPT (Ouyang et al., 2022) then manually filtered and annotated with answers by crowdsource workers.The second set, named MEQ-AmbigQA, is sourced from the AmbigQA dataset (Min et al., 2020).The construction of our contrast sets consists of four phases: question collection, MEQ filtering, answer annotation, and evidence passage annotation.

MEQ-InstructGPT
Generating answerable MEQs is very difficult for crowdsource workers who are not domain experts.It is hard for them to determine which modifications to the original question result in an answerable MEQ without extensive Internet searches.However, recent GPT-3 models have demonstrated their ability to possess vast amount of knowledge through massive pre-training (Brown et al., 2020).Therefore, we first utilize the InstructGPT model (textdavinci-002) to generate a set of MEQ candidates, and leave the answer annotation task to crowdsource workers.The input to InstructGPT is of the form where I is the instruction "Generate a similar question that has a different answer".{x i } t i=1 are in-context demonstrations that are manually created, where each x i is a tuple [q i , a i , q ′ i , a ′ i ] (q ′ i is the MEQ of q i ).The original question q and answer a are appended to the input, prompting InstructGPT to generate a new question q ′ and its answer a ′ to complete the sequence.For each input q, we sample 10 completions from InstructGPT to generate a set of candidate MEQs.

MEQ-AmbigQA
The AmbigQA dataset initially targeted a subset of NQ consisting of ambiguous questions.The dataset was introduced to decompose each ambiguous question into multiple disambiguated questions, each of which is a slight modification of the original question.For each NQ question covered in AmbigQA, its corresponding disambiguated questions are considered as its candidate MEQs and are delivered to the subsequent filtering phase ( §4.1.2).However, such questions are limited as we set strict criteria for MEQs, so we need more data generated by InstructGPT for solid evaluation.

MEQ Filtering
To build challenging contrast sets, a series of criteria are applied to eliminate unqualified candidates and select MEQs based on the definition in §3.1.
1. Quality control: We do not allow q and q ′ to differ in question words (e.g., how, what), or if the only word that q ′ adds to q falls into {first,last,new,next,original,not}.
We have found that InstructGPT frequently adds these words to create MEQs, but they usually lead to unanswerable questions.

Semantic distance:
The cosine similarity of semantic embeddings is used to measure d s (q, q ′ ).We remove q ′ if cos(h q , h q ′ ) < 0.95 which indicates non-negligible semantic discrepancy.The semantic embedding h should be generated by a sentence embedding model.Here we use the question encoder of the unsupervised dense retrieval model Contriever (Izacard et al., 2021).
4. Paraphrase filtering: q ′ is discarded if it is determined as a paraphrase to q by a paraphrase detection model.Here we use a RoBERTa-large (Liu et al., 2019) fine-tuned on the Quora Question Pairs dataset (Wang et al., 2019) for paraphrase classification.

Answer Difference
For AmbigQA questions, since they are originally human-annotated, we ask human volunteers to check whether a ′ and a are aliases to the same entity.For GPT-generated questions, the inspection of answer difference is Semantic similarity is computed by Contriever (Izacard et al., 2021).For NQ-train and NQ-test, edit distance and semantic similarity are computed between random question pairs.For MEQ contrast sets, they are computed between the original question and its MEQ.
included in the answer annotation process, which we will elaborate in §4.1.3.
Among GPT-generated questions, for a certain original question q, there may be multiple MEQ candidates that pass the above filtering.In such cases, the question that is generated most frequently across 10 samples is selected as the most confident MEQ by InstructGPT.This is similar to the self-consistency idea in Wang et al. (2022).

Answer Annotation
Due to the limited accuracy of InstructGPT in directly answering open-domain questions (Yu et al., 2023), we recruit crowdsource workers to annotate the answer of each candidate MEQ generated by InstructGPT.Before human annotation, we first check the answer generated by Instruct-GPT via Google Search.If Google Search returns an highlighted answer box which matches the InstructGPT-generated answer, we skip the subsequent human labeling step.For the remaining questions, we recruit human annotators from Surge AI3 for data labeling.We ask them the following questions: Q1.Is q ′ a good variation to q? Bad variations include being unanswerable or having the same answer with q, and are discarded from our dataset.
Q2.If q ′ is deemed a good variation, find the answer a ′ using search engines.If necessary, the question may have multiple answers.
Quality control To ensure answer correctness, each question is answered by two different annotators.If the annotators disagree on the answer or if either annotator determines the question is an bad variation, the question is discarded.Since the answers are free-form responses, we manually check whether the answers given by two annotators are aliases to the same entity.If the response of the first annotator matches exactly with the answer provided by InstructGPT, we do not recruit a second annotator to reduce costs.

Gold Evidence Passages
As mentioned in §3.2, ranking evaluation on MEQs needs gold evidence passages as positive examples, so we collect them from Wikipedia for our contrast sets.For MEQ-AmbigQA, we utilize the semi-oracle evidence documents4 provided by the original authors, dividing them into 100-word passages.Then, we identify the first passage that contains the gold answer.For MEQ-GPT, our initial step involves finding candidate evidence passages that include the gold answer.This is achieved by retrieving Wiki passages with BM25 and selecting the top 3 passages that contain the answer.Next, we recruit human annotators from Surge AI to assess whether any of these passages provide sufficient evidence for answering the question.The highest-ranked passage that passed human annotation is chosen as the gold evidence passage.Finally, both contrast sets have a subset of questions paired with a corresponding gold evidence passage.

Dataset Analysis
The full dataset is composed of 3,343 MEQs (2,293 from InstructGPT and 1,050 from Am-bigQA).Each of these MEQs has its original question in the NQ training set.Among them, 1,229 (53.6%)InstructGPT questions and 625 (59.5%)AmbigQA questions are paired with a gold evidence passage from Wikipedia.We use this subset in ranking evaluation and the full set in retrieval and end-to-end QA evaluation.

Data statistics
We summarize basic statistics of the MEQ contrast sets compared to the original NQ questions.As shown in Where did the Titanic make its maiden voyage from?A: Southampton Q: Where did the Titanic make its maiden voyage to?A: New York AmbigQA are longer because the original Am-bigQA annotators usually added conditions to disambiguate the original NQ questions.Besides, AmbigQA does not impose a limit on the answer length, while we limit each answer in MEQ-GPT to at most 5 words, consistent with NQ.The number of answers per question is lower in MEQ-GPT than in MEQ-AmbigQA, because most answers are obtained through strict text matching on candidate answers from two sources.In addition, we observe that MEQ-GPT has a smaller edit distance and higher semantic similarity between q and q ′ , making it hard for models to distinguish them.

Types of edits
We review and categorize different types of minimal edits that are used to create MEQs.Since MEQ-AmbigQA primarily consists of edits that add specifications to the original NQ question, we consider MEQ-GPT as a more natural representation of minimal edits.As shown in Table 2, the edits in MEQ-GPT involve nouns (28.0%), verbs (18.5%), adjectives (18.2%), numbers (14.2%), ordinals (9.2%), dates (6.6%), prepositions/conjunctions (2.9%) and others (2.4%).A word cloud of the edited words is given in Figure 2. We also observe that 22.5% of the total edits are antonym edits where a word in the original question is replaced by its antonym.Our dataset of diverse MEQs provides a comprehensive evaluation of contrast consistency.

Challenges of MEQ Contrast Sets
The collected MEQ contrast sets are challenging for the widely-used DPR-based OpenQA system, although these perturbed questions are only minimal edits to the well-learned training questions.
As shown in Figure 1, the model significantly underperforms on the contrast sets, where the passage ranking score of DPR decreases by 39% and 45% compared to NQ-train, and by 29% and 18% compared to NQ-test.This makes a substantial impact on the QA performance, with the accuracy being 69% and 60% lower on the two contrast sets compared to NQ-train, and 54% and 40% lower than NQ-test.The results show that the collected MEQs are much harder to solve than random test questions, which indicates our contrast sets can serve as testbeds for evaluating the contrast consistency of OpenQA.Below: our improved DPR with the query-side contrastive loss, where q + and q − are obtained through data augmentation.
encoders map the input sequence to a dense vector as its semantic representation.The relevance score s(q, p) between a question q and a passage p is defined as the dot product of their representations: DPR is trained via a contrastive loss.Given a positive passage p + and a set of negative passages {p − i } n i=1 to a certain question, the model is trained to maximize the relevance score between q and p + , while minimizing the relevance score between q and each p − i .The loss function is: The above training paradigm works well on retrieving passages for random test questions, but does not perform as effectively on MEQ contrast sets, as discussed in §1 and §4.3.The training loss L QP does not provide explicit signals for DPR to learn the relationships between questions.As a result, the question embeddings are insensitive to minimal discrepancies, which prevents the model from identifying the MEQ as a distinct question after seeing the original question in training.This causes DPR to generate an overly similar embedding for the MEQ, leading to a high overlap in the retrieved passages and low contrast consistency.

Proposed Method
We propose to improve the contrast consistency of DPR by introducing a query-side contrastive loss to distinguish between paraphrase questions and MEQs which are positive and negative question examples for an original question, respectively.We devise a data augmentation approach to collect synthetic question examples to train this loss.

Data Augmentation
For a training question q, its positive example q + is a synthetic paraphrase question which is slightly different from q and has the same answer; its negative question q − is a synthetic MEQ with a different answer.
To obtain q + , we leverage back translation provided by the nlpaug5 package.The original question q is translated to another language and then translated back to produce a new phrasing of q.We used translation models of 6 languages provided by Ng et al. (2019) and Tiedemann and Thottingal (2020).Questions that are identical to q (i.e., edit distance = 0) or classified as "not paraphrase" by the paraphrase detection model used in §4.1.2are eliminated.The remaining questions constitute a candidate set of positive questions from which a random q + is sampled in each epoch.
To obtain q − , synthetic MEQs are retrieved from the machine-built QA corpus PAQ (Lewis et al., 2021).All questions in PAQ that are similar to q are retrieved by the question retriever in the work of PAQ.Then, the MEQ requirements specified in §4.1.2are applied to filter the retrieved synthetic questions.The remaining questions con-stitute a candidate set of negative questions from which a random q − is sampled in each epoch.
Apart from learning the relationships among q, q + , and q − , the loss L QP can be augmented to learn the relevance between synthetic questions and their corresponding passages.Because q + is a paraphrase question mapping the passages of q, it does not have to be involved in L QP .To train on q − , its positive passage is the Wikipedia passage that was used to generate the question during the construction of PAQ; its negative passages are collected from the top-ranked passages retrieved by BM25 which do not contain the answer.

Model Training
To provide more supervision signals and prevent overfitting, we randomly sample q + , q − , and p − for each training question q in each epoch.This means while the original training questions remain fixed, a different set of augmented questions is used.For explicit supervision on inter-question relationships, given q, DPR is trained to assign higher relevance scores to its paraphrase question (q + ) and lower relevance scores to its MEQ (q − ).The relevance score of any pair of questions (q 1 , q 2 ) is calculated as the inner product of their embeddings: s(q 1 , q 2 ) = E Q (q 1 ) ⊺ E Q (q 2 ).Specifically, we consider three forms of query-side constrastive loss functions in experiments: (1) InfoNCE Loss (van den Oord et al., 2018), which differentiates the positive question from a set of m negative questions.Besides the synthetic MEQ which is considered as a hard negative, the other questions in the same batch are included as random negatives.The loss function is: .
(2) Dot Product Loss, which directly penalizes the relevance score between a sample question q and its augmented MEQ: (3) Triplet Loss (Schroff et al., 2015), which trains the model to assign a higher relevance score to q + compared to q − , enfored by a margin α: L QQ = max 0, α − s(q, q + ) + s(q, q − ) .
The final training loss of our improved DPR is L = L QP + λL QQ , where the hyperparameter λ weights the trade-off between the loss terms.

Experiments
In experiments, we compare our proposed training method against the original training setting of DPR.After training the models on the NQ training set, we test them on the standard NQ test set as well as two MEQ contrast sets that we collected in this work.

Models
We augment the training set with M =33k synthetic MEQs and train DPR with both L QP and L QQ .We consider the following baselines: • Vanilla DPR.This is the original training setting of DPR, proposed by (Karpukhin et al., 2020).The model is trained only with L QP on the standard NQ training set.
• DPR with random augmented questions.This model is trained only with L QP , but we add M random synthetic questions from PAQ to the training set.This is to rule out the effect of simply adding more synthetic data.
• DPR with augmented MEQs.This model uses the same set of M synthetic MEQs retrieved from PAQ as data augmentation, but is trained only with L QP .We use this variant to test if L QQ is necessary in model training.
Besides, we test the performance of BM25 on retrieval as a reference.Recent research has shown that larger retrievers may exhibit better generalization (Ni et al., 2022).Therefore, in addition to the standard DPR which is built on BERT-Base, we use BERT-Large as the backbone model to see: (1) whether MEQ contrast sets are still challenging for larger models and (2) whether our training method is still effective for larger models.We name the smaller model and larger model DPR BASE and DPR LARGE , respectively.We use the same set of basic hyper-parameters for each DPR model: a learning rate of 10 −5 , a batch size of 64 (32 for DPR LARGE ), 40 training epochs with 5% warmup steps.On ranking evaluation, our best setting uses the InfoNCE loss with λ = 0.5.On retrieval and QA evaluation, our best setting uses the dot product loss with λ = 0.03.Since we do not have a dev set for MEQs, 6 we conduct ranking evaluation on MEQ contrast sets in a dev setting, where we select the highest score among all checkpoints.Then we use the checkpoint with the best ranking score to test its retrieval and QA performance.The scores on NQ-test is reported using the best checkpoint on NQ-dev.

Results
Experimental results on three datasets (NQ-test, MEQ-AmbigQA, MEQ-GPT) are presented from Table 3 to Table 6.We have the following findings: (1) Our proposed method improves DPR's ability to distinguish MEQs.As shown in Tables 3  and 5, on passage ranking and passage retrieval, the DPR trained with query-side contrastive loss outperforms the vanilla DPR on both contrast sets, showing improved contrast consistency on MEQs.This improvement is consistent across models of different sizes.For example, on MEQ-GPT, our model improves the vanilla DPR by 8% and 10% on ranking MRR for base and large versions respectively.On the choice of L QQ , Table 4 demonstrates that all three loss functions improve performance over baselines, while the optimal setting may require tuning on the specific dataset.
(2) The query-side contrastive loss contributes the most to the improved contrast consistency.
Although synthetic MEQs themselves bring more training signals, the model cannot consistently outperforms the vanilla DPR without L QQ .Actually, its performance is sometimes even lower than DPR.In contrast, after including the query-side contrastive loss, we observe consistent improvements across all datasets, as shown in Tables 3  and 5.For example, on MEQ-AmbigQA, simply adding synthetic MEQs into the training set gives 12% lower recall@1 than the vanilla DPR, while training with L QQ outperforms the naive augmentation method by 18%.
(3) The improvement does not simply come from the increased number of training data.
There is no significant difference on the performance between DPR augmented with random synthetic questions ("Random" in "Augmentaiton" column) and the original DPR ("None" in the column) in Tables 3, 5, and 6.The average improvement of inserting random synthetic questions on all metrics is only 0.2% for DPR BASE and 1.6% for DPR LARGE , which indicates simply adding more synthetic data is not an effective solution.
(4) Improved retrieval performance leads to higher end-to-end QA accuracy.As shown in Table 6, our improved DPR provides more relevant information for answer prediction on MEQs.Even using only 1 retrieved passage, our improved DPR-Large outperforms its vanilla version by 12% and 11% on two contrast sets respectively.
(5) Our method does not sacrifice performance on standard test questions.After jointly trained with the query-side contrastive loss and augmented with synthetic MEQs, our model still NQ MEQ-AmbigQA MEQ-GPT Model Augmentation R@1 R@5 R@20 R@1 R@5 R@20 R@1 R@5 R@20 maintains its competitive performance on the standard NQ test set.Specifically, It outperforms all baselines in ranking evaluation (see Table 3), while performing on par with the best baseline in retrieval and QA scores (see Tables 5 and 6).Summary: The results are consistent across ranking, retrieval, and end-to-end QA experiments, which demonstrates the solidity of the above findings.Nevertheless, the performance of DPR still has a long way to improve, and such a gap is observed in both base and large versions of the model.Notably, DPR models perform significantly worse on MEQ contrast sets than the standard test set, even though it is trained under a development setting.This suggests that further research is still necessary to improve the contrast consistency of retrieval models on MEQs.

Passage overlap
One of the indications that DPR lacks the ability to distinguish the original question and its MEQ is the high overlap between the passages retrieved for each.Figure 4 illustrates that both synthetic data augmentation and the query-side contrastive loss can reduce passage overlap.The synthetic MEQ augmentation helps to train the question embeddings of MEQs closer to their positive passages.Moreover, the queryside contrastive loss explicitly trains the model to distinguish the original question and its MEQ apart.Nevertheless, a lower passage overlap does not always indicate better performance.For instance, our model with the dot product loss does not have the lowest passage overlap, but performs the best in retrieval evaluation.

Identification of inter-question relationships
To further analyze model behavior after the queryside contrastive training, we test the models' ability to distinguish inter-question relationships.A model is considered successful in identifying the MEQ if the generated embedding of the original question is closer to its paraphrase question rather than its MEQ.The paraphrase questions are separately generated using InstructGPT to avoid conflict with those used in data augmentation.As shown in Figure 5, training with the query-side contrastive loss leads to an improved ability to distinguish between paraphrase questions and different questions, which indicates our models are better at identifying inter-question relationships.The model trained with InfoNCE loss has the highest success rate in identifying inter-question relationships, because it received more training signals from a positive example and a set of negative examples than those with other types of loss.

Conclusion
In this study, we addressed the gap in research on contrast consistency in OpenQA by collecting MEQs as challenging contrast sets to the popular NQ benchmark.Our findings reveal that DPR lacks contrast consistency on our contrast sets.To address this limitation, we introduced a query-side contrastive loss with the aid of data augmentation, which improved its ability to recognize interquestion relationships.Overall, our findings and data can pave the way for further exploring the role of contrast consistency in developing robust and effective OpenQA systems.

Figure 2 :
Figure 2: Word cloud of the edited words.Words in green and red are the deleted and added words, respectively.Larger font sizes indicate higher frequencies.

Figure 3 :
Figure3: Above: the original contrastive training of DPR.Below: our improved DPR with the query-side contrastive loss, where q + and q − are obtained through data augmentation.

Figure 4 :
Figure 4: Overlap in top-5 retrieved passages between the original training question and its MEQ.

Figure 5 :
Figure 5: The ratio of successful MEQ identifications of different models on contrast sets, with paraphrase questions as distractors.

Table 1 :
Dataset statistics.Question lengths, answer lengths, and edit distances are all measured in words.

Table 1 ,
MEQ-GPT is similar to NQ regarding the average length of questions and answers.Questions in MEQ-Who wrote the music for the national anthem?A: John Stafford Smith Q: Who wrote the lyrics for the national anthem?A: Francis Scott Key

Table 2 :
Different MEQ edit types in MEQ-GPT with their proportions of antonym edits and examples.The remaining 2.4% of the instances are of miscellaneous types.The first line in each example is the original question and the second line is the MEQ.Words in green and red are the deleted and added words, respectively.
Negative Passage : It was Williams' first Grand Slam final at Wimbledon since 2009, her first Grand Slam...
5.1 Preliminary: DPRAs a dense retriever, DPR includes a question encoder E Q (•) and a passage encoder E P (•).Both

Table 3 :
Ranking evaluation results.MR and MRR stand for mean rank and mean reciprocal rank, respectively.A lower MR or higher MRR indicates better performance.BM25 is not listed because sampling hard negatives from top-ranked passages in BM25 retrieval lowers the ranking performance of BM25 in return.

Table 4 :
Ranking evaluation with different L QQ functions on two MEQ contrast sets.All loss functions outperform the baselines in Table3.

Table 5 :
Retrieval evaluation results.R@k stands for Recall@k.

Table 6 :
End-to-end QA results (Exact Match).1P, 5P and 20P are the number of passages read by the FiD reader.