Fact Checking with Insufficient Evidence

Abstract Automating the fact checking (FC) process relies on information obtained from external sources. In this work, we posit that it is crucial for FC models to make veracity predictions only when there is sufficient evidence and otherwise indicate when it is not enough. To this end, we are the first to study what information FC models consider sufficient by introducing a novel task and advancing it with three main contributions. First, we conduct an in-depth empirical analysis of the task with a new fluency-preserving method for omitting information from the evidence at the constituent and sentence level. We identify when models consider the remaining evidence (in)sufficient for FC, based on three trained models with different Transformer architectures and three FC datasets. Second, we ask annotators whether the omitted evidence was important for FC, resulting in a novel diagnostic dataset, SufficientFacts1, for FC with omitted evidence. We find that models are least successful in detecting missing evidence when adverbial modifiers are omitted (21% accuracy), whereas it is easiest for omitted date modifiers (63% accuracy). Finally, we propose a novel data augmentation strategy for contrastive self-learning of missing evidence by employing the proposed omission method combined with tri-training. It improves performance for Evidence Sufficiency Prediction by up to 17.8 F1 score, which in turn improves FC performance by up to 2.6 F1 score.


Introduction
Computational fact checking approaches typically use deep learning models to predict the veracity of a claim given background knowledge (Thorne et al., 2018;Leippold and Diggelmann, 2020;Augenstein, 2021).However, the necessary evidence is not always available, either due to incomplete knowledge sources, or because the claim has newly emerged and the relevant facts are not documented yet.In such cases, FC models should indicate that the information available is insufficient to predict the label, as opposed to making a prediction informed by spurious correlations.
Prior work shows that FC models can sometimes predict the correct veracity based on just the claim, ignoring the evidence, and that they can overly rely on features such as the word overlap between the evidence and the claim (Schuster et al., 2019(Schuster et al., , 2021)), leading to biased predictions.However, there are no previous studies on what evidence a FC model considers to be enough for predicting a veracity label.To this end, this work introduces the novel task of Evidence Sufficiency Prediction illustrated in Fig. 1 , which we define as the task of identifying what information is sufficient for making a veracity prediction.This task is related to FC and can operate on instances and models from FC datasets, but is focused on evaluating the capability of models to detect missing important information in the provided evidence for a claim.The latter is usually not evaluated explicitly in current FC benchmarks, where joint scores disregard a FC model's prediction when insufficient evidence is retrieved.
We study the new task by, first, conducting a thorough empirical analysis of what models consider to be sufficient evidence for FC.Secondly, we collect human annotations for the latter, which results in a novel diagnostic dataset, Sufficient-Facts, for FC with omitted evidence.Finally, Figure 1: An example from the VitaminC test set, where the number modifier has been omitted from the evidence.This results in there not being enough evidence for predicting its support for the claim as judged by human annotators, while two of the models still find the remaining evidence to be sufficient.
we employ the method introduced for the empirical analysis to improve the performance of models on the new task of Evidence Sufficiency Prediction, and show that considering it a component task of FC significantly improves FC performance.For the empirical analysis, we propose a new fluency-preserving method that occludes portions of evidence, automatically removing constituents or entire sentences, to create incomplete evidence.We provide those as input to an ensemble of Transformer-based FC models to obtain instances on which FC models agree vs. disagree to have (in)sufficient information.We perform extensive experiments with three models -BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), ALBERT (Lan et al., 2020), and three textual FC datasets with different types of claims -FEVER (Thorne et al., 2018), HoVer (Jiang et al., 2020), VitaminC (Schuster et al., 2021).
To compare model behavior with human rationales for Evidence Sufficiency Prediction, we ask annotators to indicate if the occluded evidence texts still provide enough information for a factcheck.This results in a novel diagnostic test dataset, SufficientFacts, which contains information about the type of the omitted information, allowing for in-depth analyses of model behavior.
Finally, to improve model performance for detecting omitted important evidence and, in turn, FC, we propose to combine the proposed evidence omission method with tri-training (Zhou and Li, 2005), which utilises the agreement of three different machine learning models to label unlabeled training instances ( §5).This results in a novel counterfactual data augmentation schema for learning of (in)sufficient information.We find that the proposed approach is highly effective in improving model performance by up to 17.8 F 1 score on the newly introduced SufficientFacts.This also leads to improvements of up to 2.6 F 1 score on the standard FC test sets for the corresponding datasets.

Related Work
Here, we study when models trained on existing FC datasets find evidence with omitted important information to still be sufficient for veracity prediction.Such cases might be considered vulnerabilities of the models and can be due to models' faulty reasoning, learned biases, etc.Hence, our work is mainly related to studies exploring potential biases learned by FC models and the vulnerabilities of FC models to adversarial attacks.We further propose a method for evidence omission, which creates counterfactual instances, which is related to studies on input-level instance re-writing.We also use the proposed evidence omission method to collect counterfactually augmented data (CAD) and compare that to using the collected data in a contrastive learning (CL) loss to improve performance on Evidence Sufficiency Prediction and FC more generally.We thus discuss the relationship between our work and prior studies on CAD and CL.Finally, we compare our work based on deep learning models to FC performed against knowledge bases (KBs), where fact triples can also be missing.
Fact Checking Diagnostics.Previous work has exposed various biases of FC models.While FEVER (Thorne et al., 2018) is one of the largest datasets for FC, Schuster et al. (2019) points out that models trained on it can verify a claim solely based on the text of the claim, without considering the evidence.To this end, Schuster et al. (2019) introduce a new diagnostic dataset, Fever-Symmetric, of contrastively re-written claims and evidence.They show that the models fail to detect the contrastive changes in the text, leading to a drop of up to 57.46 F 1 -score, compared to 85.85 F 1 -score on the original FEVER development set.Furthermore, the claims in FEVER were manually written based on Wikipedia article sentences, and thus have a large token overlap between the evidence and the claim, especially for supporting evidence.Hence, Schuster et al. (2021) construct a new FC dataset, VitaminC, where they instruct the annotators to avoid using the same words as in the evidence.Ostrowski et al. (2021) further create PolitiHop -a dataset for claim verification of naturally occurring claims with evidence comprised of multiple hops over interconnected evidence chunks.They study how multihop vs. single inference architectures reason over the evidence sets in PolitiHop.In addition, several works (Thorne et al., 2019;Niewinski et al., 2019;Hidey et al., 2020) explored the vulnerability of FC models to adversarial attacks, e.g., by discovering universal trigger words that fool a model into wrongly changing its prediction (Atanasova et al., 2020).In contrast, we are interested in how much evidence is enough for veracity prediction, studying this with three different FC models trained on three different datasets by omitting information at the constituent and sentence levels and comparing it to human judgments.
Instance Re-Writing.
The above studies mainly perform re-writing or insertion operations for FC evidence.Here, we employ causal interventions on the evidence by omission to study when information is (in)sufficient for a model's prediction.Elazar et al. (2021) also use causal interventions that estimate the importance of a property by removing it from a representation.By comparison, even though text-level causal interventions are more intricate due to the discrete nature of text, we perform them on the text itself, by following linguistic rules for optional constituents to preserve the semantics and the fluency of the text.Thorne and Vlachos (2021) perform rewriting of claims by masking and then correcting separate words.They thus generate claims supported by the evidence, particularly for claims not supported before the factual correction.In similar vein, Wright et al. (2022) decompose long, scientific claims into shorter, atomic claims.They then generate negative instances for those by masking single words in claims and replacing them with antonyms retrieved from a scientific knowledge base.In contrast, we perform omissions of evidence information at the sentence and constituent levels and for the new task of Evidence Sufficiency Prediction.
Contrastive Learning (CL) and Counterfactual Data Augmentation (CAD).Most existing work of CL in NLP employs contrastive selflearning for model pre-training (Rethmeier and Augenstein, 2021).Contrary to this, Rethmeier and Augenstein (2022) propose for CL to be performed jointly with the supervised objective.We follow the latter to improve the performance of FC models in detecting when important information is missing from the evidence, by using the original evidence texts paired with evidence texts with omitted information as contrastive data points.We perform contrastive selftraining jointly with the supervised objective, as we use the contrastive loss as an unsupervised training for Evidence Sufficiency Prediction.In contrast, using it for pre-training followed by supervised training could lead to the models forgetting the information learned during pre-training, which is needed to improve the performance on SufficientFacts.An important factor for CL is the augmentation of negative and positive instances, which can be challenging due to the discrete nature of text.Related work explores augmentation through back-translation (Sennrich et al., 2016), masked word substitution with an LM (Wu et al., 2019), graph neighbourhood sampling (Ostendorff et al., 2022), mix-up (Chen et al., 2020), or a combination thereof (Qu et al., 2021).In a similar vein, automated approaches for CAD in NLP include paraphrasing (Iyyer et al., 2018), and controlled (Madaan et al., 2021) text generation, which do not necessarily change the target label of an instance.CAD is found to improve model robustness to data artifacts (Kaushik et al., 2020;Teney et al., 2020) and to perform better out of domain (Samory et al., 2021).In contrast, we use evidence omission, combined with tri-training for contrastive negative evidence mining ( §5).
Knowledge-Base Fact Checking.A relevant line of work conducts FC against knowledge bases (KBs) by finding fact triple chains that are (in)consistent with the claim (Kim and Choi, 2021).Discovering such missing triples could also be used to detect insufficient evidence information.As KBs can contain an incomplete set of fact triples, related work completes KBs from unstructured textual data on the Web (Distiawan et al., 2019) or with graph embedding techniques (Kim et al., 2018).This work uses machine learning models that use textual evidence as input instead of performing an intermediate step of completing a knowledge base with needed fact triples.

Datasets
We employ three fact checking datasets (see Table 1) and use the gold evidence documents, i.e., we do not perform document or sentence retrieval (apart from for the ablation experiment in Section 6.4).Thus, we avoid potential enforced biases for the veracity prediction models if they had to learn to predict the correct support of the evidence for the claim given wrong evidence sentences.Hence, each of the three fact checking datasets D = {(x i , y i )|x i = (c i , e i ), i ∈ [1, |D|]} consists of instances with input x i and veracity labels y i .The input is comprised of a claim c i and gold evidence e i .The veracity label y i ∈ {0=SUP-PORTS, 1=REFUTES, 2=NEI} for FEVER and VitamiC, and y i ∈ {0=SUPPORTING, 1=NOT SUPPORTING} for HoVer.
FEVER (Thorne et al., 2018) contains claimevidence pairs, where the evidence consists of sentences from Wikipedia pages, and the claims are written manually based on the content of those Wikipedia pages.87% of the claims have evidence consisting of one sentence.The dataset has a high ratio of token overlap between the claim and the evidence, where the overlap is naturally higher for claims that are supporting (69%), than refuting (59%) and NEI (54%) claims.The high overlap ratio can create a bias for learning from token overlap, which can further prevent generalisation, as also noted in related work (Schuster et al., 2021).
Vitamin C (Schuster et al., 2021) is a collection of sentences from Wikipedia containing factual edits.For each factual edit, annotators construct a claim that is SUPPORTED and one that is REFUTED with the old and the new version of the evidence.When the factual edit introduces/removes facts from the evidence, claims are constructed so that there is NOT ENOUGH INFORMATION (NEI) to support them.Due to its contrastive nature and reduced claim-evidence overlap, the authors demonstrate that models trained on the dataset gain a 10% accuracy improvement on adversarial fact verification.
HoVer (Jiang et al., 2020) is designed to collect claims that need several hops over Wikipedia evidence sentences to verify a claim.The evidence contains between two and four sentences from different Wikipedia articles.As the test dataset is blind and we use the gold evidence, we use the development set for testing purposes and randomly select 10% of the training dataset for development.

Evidence Omission
To study what types of information the evidence models consider important, we propose to conduct causal interventions for the evidence by omitting information from it.We hypothesise that removing information important for the model to predict the support of evidence for a claim will cause a change in its original prediction, leading to the model indicating that there is missing information.If the removed information is not important for the model though, removing it would not change the model's prediction.We then ask whether the information that is important for a model when predicting the support of the evidence text for a claim, is actually important as judged by human annotators.The human annotations allow for a systematic study of common model errors, i.e., when the models still predict the correct label even if important evidence information has been removed and when they consider the information to be insufficient if unrelated evidence has been removed.

Evidence Omission Generation
We omit information from the evidence text at the sentence and constituent level.Particularly, we aim to remove information from the evidence such that it does not change its stance towards the claim from SUPPORTS to REFUTES, or viceversa, while preserving the grammatical correctness and fluency of the evidence.Following studies of linguistic sentence structure (Burton-Roberts, 2016;Börjars and Burridge, 2019), illustrated with examples in Table 2, we collect prepositional phrases, modifiers and other optional sentence constructs -i.e.those constructs that can be removed from the sentence without impairing its grammatical correctness, and where the remaining text is semantically identical to the original one, except for the additional information from the removed construct (Garvin, 1958).We use the following optional sentence constructs: Sentences (S).In FEVER and HoVer, the evidence can consist of more than one sentence.The separate sentences are supposed to contain information important for the fact check, which we further verify with manual annotations as explained in Section 4.2.VitaminC consists of single sentences only, and we thus only perform constituentlevel omissions for it, as described next.
Prepositional Phrases (PP) are optional phrases that are not part of a Verb Phrase (VP), but are child nodes of the root sentence in the constituent tree (Brown et al., 1991).These usually function as adverbs of place and consist of more than one word.
Noun Modifiers (NOUNM) are optional elements of a phrase or clause structure (Huddleston and Pullum, 2005).NOUNM can be a single or a group of nouns that modify another noun.
Adjective Modifiers (ADJM) are a single or a group of adjectives that modify a noun.
Adverb Modifiers (ADVM) are a single or a group of adverbs that modify verbs, adjectives, or other adverbs and typically express manner, place, time, etc.
Number Modifiers (NUMM) are a single or a group of words denoting cardinality that quantify a noun phrase.
Date Modifiers (DATEM) are a single or a group of words that express temporal reference.To preserve fluency, from a date expression consisting of a day, a month, and a year, we omit either the date, the date and the month, or the year.
Subordinate Clauses (SBAR) are introduced by a subordinating conjunction.Subordinate clauses depend on the main clause and complement its meaning.SBARs can be adverb clauses, adjective clauses, and noun clauses.
For the omission process, we use two pretrained models with high performance from the Spacy library2 -a part-of-speech (PoS) tagger with an accuracy of 97.2 and a constituency parser (Kitaev and Klein, 2018) with an F 1 -score of 96.3 on the revised WSJ test set (Bies et al., 2015).During the omission process, we use the PoS tags to find nouns, adjectives, adverbs, and numbers and use the constituency tags to select only the modifiers.Thus, we find the NOUNM, ADJM, ADVM, and NUMM constructs.We collect SBAR and PP constructs by finding their corresponding tags in the constituent dependency tree.Finally, for the date, we use two regular expressions that are common date templates used in Wikipedia articles -<month name, date, year> or <date, month name, year>, and remove parts from the templates that preserve the coherency -<date>, <year>, <month name and date>, or <year and date>.
Overall, in this work, we perform a study of insufficient evidence for FC by removing information from the gold evidence.As explained in Section 2, we perform causal interventions on the evidence by omission to study when information is (in)sufficient for a model's prediction.Replacement of words is another operation that can be applied to the evidence.We can, for example, replace different types of named entities with pronouns, and different parts of the speech with demonstrative pronouns to induce insufficient information.However, the replacement operation does not allow for direct causal conclusions as any change of a word with another could potentially lead to confounding factors of the newly introduced word and the model's predictions.Note that, there are some pronouns used in the evidence when they refer to the person/object of the article.We do not treat such cases as insufficient information as the title of the page with the name of the person/object is always prepended to the sentence, which allows for coreference resolution.Finally, another possible operation is the insertion of new information, which would lead to insufficient evidence when performed on the claim.The latter, however, requires the insertion of text that preserves the grammatical correctness and meaning of the claim, which is hard to achieve in an automated way.

Models.
We train three Transformer-based FC models -BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), and ALBERT (Lan et al., 2020).BERT is pre-trained with masked language modeling and next sentence prediction objectives on the Toronto Book Corpus (Kiros et al., 2015) and the English Wikipedia. 3It is also the most widely used pre-trained Transformer model. 4RoBERTa improves upon BERT by optimising key hyper-parameters, and is trained without the next sentence prediction objective.RoBERTa is one of the top-performing models on the GLUE (Wang et al., 2018) and Super-GLUE (Wang et al., 2019) benchmarks comprised of various NLP tasks.The latter also holds for ALBERT, another Transformer architecture that improves upon BERT.It does so with parameterreduction techniques, which lower the memory consumption of the model.ALBERT also employs a self-supervised pre-training loss for inter-sentence coherence.The latter is found to be beneficial for tasks with multiple sentences, and Schuster et al. ( 2021) report improved FC robustness with it on VitaminC compared to BERT.
We train each model on the respective training splits of each dataset with the claim c and the gold evidence e as input to predict the gold veracity label y: f (c, x) = ŷ.We optimise the supervised cross-entropy loss: where m is the label space size.
We then use an ensemble of these three different Transformer-based FC models to collect predictions for our new task Evidence Sufficiency Prediction, as we want to find instances with omitted information that are more broadly applicable (e.g., those on which the models agree).The (dis)agreements between the models also allow us to study the differences between them in detecting omitted information.Transformer Language Models are pre-trained on large datasets, the veracity of which can change over time (Schuster et al., 2021).This makes it important that the FC models take into account the facts in the given evidence.When provided with differences and similarities in the three FC models' predictions, future work could then also investigate the degree to which different Transformer-based FC models encode FC-relevant world knowledge they default to in their predictions.
Annotation Task.Next, we collect evidence with removed information as described above.We then use the models to find which of the omitted evidence they consider important, resulting in a prediction change to NEI.We consider instances from the original test splits of each of the datasets, where all models predicted the veracity correctly before the evidence omission was performed, as these are the cases where we can observe whether evidence omission causes the veracity prediction to change to NEI.We collect instances with omitted evidence information where the models: (1) agree that the evidence is still enough vs. (2) insufficient; and where they (3) disagree in their prediction.We collect a total of 400 instances at the sentence, and 600 instances at the constituent level from the test splits of the corresponding datasets, distributed equally among the above three groups.
We employ annotators on Amazon Mechanical Turk 5 .We first train potential annotators, presenting them with annotation guidelines and illustrative examples.We then select annotators using a qualification test with nine test annotations for our task.Each annotation had the cost of 0.10$, and annotators were paid 10$ on average per hour.The annotation task is to determine whether the evidence is still sufficient for predicting the label without the omitted information.If the remaining evidence is still sufficient, we ask them for the reason -whether this is because the removed evidence is repeated in the remaining text or because the removed evidence is not relevant to the veracity of the claim.Following the annotation guidelines for FEVER and HoVer, we ask the annotators not to use any world knowledge or knowledge they might have about the claim.For more details on the annotation task and the guidelines, we will release the dataset with a detailed README file. The ]} consists of test instances x i with labels y i .All of the instances in SufficientFacts are a subset of the instances in the test datasets of FEVER, VitaminC, and HoVer with the following changes.The input x i is comprised of the original claim c i and the evidence with omitted information e i .The tokens of e i are a subset of the tokens of the original gold evidence e i of the instance.To re-iterate, the label of the originally selected instances is either SUPPORTS or REFUTES, i.e. they have 5 https://www.mturk.com/sufficient gold evidence information, where after omitting information from the evidence, the new label y i becomes either NEI if the majority of the annotators selected that important information was removed, and otherwise remains the original label -SUPPORTS and REFUTES for FEVER and VitamiC, or SUPPORTING for HoVer.
The resulting inter-annotator agreement (IAA) for SufficientFacts is 0.81 Fleiss' κ from three annotators.Due to the novelty of the introduced task of Evidence Sufficiency Prediction, we do not have direct points of comparison for IAA.However, we point as a reference the IAA reported for the related task of fact checking for the HoVer dataset -0.63 Fleiss' κ, and for the FEVER dataset -0.68 Fleiss' κ, where, for both datasets, the annotators were thoroughly trained and highly paid.The biggest challenges for our annotators, judging by their errors during the qualification test, were not to use common knowledge and assumptions in their annotations, and the general complexity of the task.

SufficientFacts Analysis.
Overall Agreement with Annotators.The statistics of the resulting dataset, SufficientFacts, are presented in Table 3.We find that all three models agree that the remaining evidence is still sufficient (EI Agree) even when it has become insufficient after omitting information needed for verifying the claim (NEI) in 430 out of 1000 instances.We assume that these failures of all three models to detect missing information for FC point to the models making predictions based only on patterns observed in claims, or to the models defaulting to world knowledge encoded in the pre-trained Transformer models.We further find that when the models disagree about whether the remaining information is still sufficient (Disagree), they disagree mostly about instances where the omitted evidence information is needed for veracity prediction (NEI) -in 823 out of 1000 instances.By contrast, when the models agree that the remaining evidence is insufficient, they are correct in 972 out of 1000 of the instances.
Separate Dataset Agreement with Annotators.Looking at the separate datasets, it is the hardest for the models to identify missing evidence information needed for the fact check (EI Agree vs. NEI) for HoVer, particularly with sentence omissions, and the easiest for the Vi- Table 3: Statistics of SufficientFacts presenting the predictions of the models in the ensemble (Model Pred: Agree Enough Information (EI Agree), Agree Not Enough Information (NEI Agree), Disagree, and Total) vs human annotations of the same (EI -Irrelevant (EI_I), EI -Repeated (EI_R), NEI).We present sentence (SENT) and constituent omission (CONST) dataset splits separately.We embolden/underline results of the datasets for predictions where the three models agree (NEI Agree, EI Agree) and have the highest/lowest agreement with human annotations about EI_I, EI_R and NEI predictions.We use light blue/dark blue to denote where lower/higher results are better.
taminC dataset with constituent omissions.We hypothesise that the latter is due to the HoVer dataset having more complex claims and requiring cross-sentence reasoning, whereas VitaminC contains contrastive instances which, during training, guide the models to identify the parts of the evidence needed for FC.Overall, the models fail to detect missing information more from sentences rather than from constituents.We hypothesise that this effect can be observed partly because models struggle to conduct multi-hop reasoning over them.Another possible reason for that is that the models could be better at verifying the type of information removed from a sentence constituent rather than from a sentence.
Performance by Omitted Evidence Type and Model.Figure 2   sis of the performance of the models for different types of omitted constituents.We observe that it is the hardest to detect when the evidence is missing information for the prediction (Correctly Predicted NEI) that was removed from adverbial modifiers (ADVM), followed by subordinate clauses (SBAR).By contrast, it is easiest to detect missing information when it is a date modifier (DATEM), followed by number modifiers (NUMM).BERT has the lowest rate of correctly detecting insufficient evidence from the three models, followed by RoBERTa, whereas ALBERT performs best.We conjecture that this is due to RoBERTa being an optimisation of BERT, and due to ALBERT including pre-training with an inter-sentence coherence objective, which has been shown to make the model more robust for factual verification (Schuster et al., 2021).Even though ALBERT contains fewer parameters than BERT, it still detects better when the evidence is insufficient.Finally, we see a natural trade-off between correctly detecting sufficient and correctly detecting insufficient information.In particular, some models such as ALBERT have a higher number of correct predictions on in- stances without enough information (Fig. 2, left).However, on instances with sufficient evidence information (Fig. 2, right), ALBERT has the lowest number of correct predictions.In contrast, BERT has the worst performance on the NEI instances, but the best performance on EI instances.

Evidence Omission Detection
To improve the performance of models in recognising when the evidence is not enough for verifying a claim, we experiment with CAD ( §5.2) and a CL loss ( §5.1).Both methods use contrastive data augmented with the proposed evidence omission method ( §4.1) in combination with tri-training, as illustrated in Fig. 3.We omit information from the original (anchor) evidence to collect potential negative instances with missing important evidence information compared to the original evidence (Fig. 3, right).From the resulting candidates, we select as negative only those predicted as having insufficient information by the other two supervised models from the ensemble ( §4) (e.g., RoBERTa and ALBERT predict NEI when we are training a model with a BERT Transformer architecture).We also collect positive instances that still have sufficient evidence information after applying a data augmentation operation.For each instance x i , we find one distractor sentence from the document of the gold evidence that is the most similar to the claim by word overlap.We append the distractor sentence to the original evidence, which serves as a positive instance (Fig. 3, left).Finally, we include only the distractor sentence as a negative instance as it does not have enough evidence contrasted both with the positive and the anchor instances.We conjecture that the latter would serve as a training signal for avoiding the bias for overlap between the claim and the evidence.

Contrastive Learning
We study self-supervised learning to train FC models that recognise when the evidence is not enough for verifying a claim.In particular, we propose to use self-supervised contrastive learning (CL) jointly with the supervised learning of the model to predict the support of the evidence for a claim.Given an anchor instance x i , a positive instance x + i , and the objective of CL is to make the anchor and the positive instance closer in the representation space, and the anchor and the negative instances further apart.The anchor, positive, and negative instances are collected and/or augmented from the training splits of the corresponding datasets as described above.Each model, g(x) = l(h(x)) = l(e) = ŷ, uses 12 encoding layers to encode an input instance h(x) = e and uses the encoding e of the last encoding layer to predict the veracity label with a linear layer: l(e) = ŷ.We encode the anchor, the positive, and the negative instances with the corresponding model g, resulting in the anchor e i , the positive e + i , and the negative e − i,j representations, and minimise the following CL loss: where s is a similarity function between the representation of the two instances -cosine similarity in our case, τ is a temperature parameter subtracted from the cosine similarity (Ma and Collins, 2018), and K − is the number of negatives.Note that the CL loss is the same as Noise Contrastive Estimation (Ma and Collins, 2018) expressed as a binary objective loss.The representation of each instance is obtained by mean pooling of the word representations of the instance in the last layer of the model M. We include the contrastive selflearning loss for those instances that are not annotated as NEI, as we cannot construct contrastive negative evidence with insufficient information for the instances that already do not have enough in-formation for verification.Finally, the CL loss is optimised jointly with the supervised loss: where ŷi is the label prediction of model M, m the label space size, y i is the gold label for instance x i , y i ∈ {0=SUPPORTS, 1=REFUTES, 2=NEI} for FEVER and VitamiC, and y i ∈ {0=SUPPORT-ING, 1=NOT SUPPORTING} for HoVer.

Counterfactual Data Augmentation
We also experiment with counterfactually augmented evidence, using the negative and positive instances constructed as described above ( §5 and Fig. 3).As the models have high accuracy when they agree that a piece of evidence with omitted information is not sufficient (see agreement with human annotations in Table 3), we conjecture that the counterfactually augmented instances would serve as a good training signal for detecting (in)sufficient evidence information without incurring annotation costs for training data.The counterfactually augmented data is thus simply combined with the training instances of each dataset.
In particular, we include in the training set the claim and the original evidence (anchor) with the corresponding gold label y i .We include the positive instance -original evidence with distractor sentence appended to it, with the original gold label y i .The negative instances, i.e., with insufficient evidence information, are included with a gold label y i = NEI for FEVER and VitaminC, and y i = NOT SUPPORTING for HoVer.Each model, h(c, e) = ŷ, receives as input the original claim c and the augmented or the original evidence e and predicts the veracity label ŷ.We optimise a supervised cross-entropy loss as per Equation 3.

Baseline Ensemble
We include a simple ensemble, consisting of the three models -BERT, RoBERTa, and ALBERT.Each ensemble contains only supervised models ( §4.2), models trained with CAD ( §5.2), or models trained with CL loss ( §5.1).We employ majority voting, where the final prediction is the most common class among the predictions of the three models on an instance, defaulting to the class with the highest predicted probability if there is no most common class.

Experimental Details
All models are trained on the respective training splits of each dataset.We select the checkpoint with the highest macro F 1 -score on the dev sets and provide results on the test sets.We note that for the newly introduced task Evidence Sufficiency Prediction, we have an annotated test dataset SufficientFacts, but no training dataset.
The training is performed on the original training splits of the corresponding datasets, which have a different label distribution from the introduced diagnostic test set.Hence, it is possible that some of the instances in SufficientFacts are out of the original training distribution, which would make this diagnostic dataset of rather adversarial nature.

Supervised Model Performance
We start by discussing the performance of models trained on the supervised splits of the corresponding datasets to predict labels for claims based on the newly created dataset SufficientFacts for Evidence Sufficiency Prediction, presented in Table 4. Recall that the instances in SufficientFacts had correct predictions from all models before the evidence omission was performed ( §4.2), i.e., the performance of the models on the instances in SufficientFacts had 100 F 1 -score before the evidence omission.Hence, the omission of information from the evidence results in a performance decrease from 100 to 58 F 1 -score (BERT model for the HoVer dataset), i.e. a decrease of up to 42 F 1score.Out of the three FC models, BERT has the lowest performance on SufficientFacts, whereas ALBERT has the highest.The latter corroborates that ALBERT is a more robust model for fact verification, as explained in more detail in Section 4.2.
Further, we observe the worst performance on SufficientFacts for the HoVer dataset -down to 58 F 1 -score, followed by FEVER, and with the best performance on VitaminC.We suggest that the contrastive nature of the instances in VitaminC that contain factual edits of the evidence, changing the support of the evidence for the claim, as described in Section 3, can indeed provide a better learning signal for the models about which parts of the evidence are important for verifying the claim.

CL and Augmented Model Performance
Including a CL loss or CAD results in improvements for all models and datasets on Suf-ficientFacts by up to 17.2 F 1 -score.Note that the proposed technique does not incur additional annotation costs for training data for Evidence Sufficiency Prediction.This corroborates that our proposed evidence omission approach combined with tri-training improves the recognition of (in)sufficient evidence.This, in turn, improves the performance on the original test sets by up to 3.6 F 1 -score.Comparing the CL loss with counterfactually augmented data, we see that CAD improves the model performance in more cases on Suffi-cientFacts, except for ALBERT for the FEVER dataset.This could be because the augmented data uses raw labels obtained with tri-learning, while the CL loss only drives apart the negative instances from the anchor in the representation space.Finally, we compare the performance of CAD and CL loss that rely on the agreement predictions of the supervised models with the simple majority voting ensembles ( §5.3).Single models trained with CAD and CL loss still outperform the ensembles of the supervised models.A majority voting classifier from the models trained with CAD and CL loss improves the performance on the original and SufficientFacts sets even further.

Comparison to Related Work
We further compare the performance of our models to existing systems on the used datasets (see Table 5).Note that we are particularly interested in veracity prediction to study what evidence models consider as sufficient for factuality prediction.Thus, in the base setting, we do not conduct evidence retrieval, as typically performed for the HoVer and FEVER datasets, but train models using gold evidence (oracle).For FEVER, existing systems report results on both tasks, hence we can only compare to the veracity prediction results with oracle evidence available in the FEVER dataset paper with a Decomposable Attention (DA) model (Parikh et al., 2016).For HoVer and VitaminC, the presented results are also from the dataset papers of models trained with oracle evidence.As there are no other reported results on these datasets, they also represent the state-of-the-art for these two datasets.To compare to them, we pick those of our models with the same Transformer architecture as used in the respective dataset papers, and the best-performing model architecture for FEVER.Note that we use the same training setting as in related work ( §5.4) for all models and datasets.We find that our supervised models are close in performance to prior reported results.Furthermore, including counterfactual data augmentation and contrastive learning leads to improvements over prior results for all three datasets, by up to 5.77 F 1 -score.

Incorrect Evidence
So far, we studied model performance on instances with omitted information from the gold evidence.
We now probe how well the models detect missing information given retrieved incorrect evidence, which does not contain sufficient information.The latter is possible in real-world scenarios.The evidence we feed to the fact checking model depends on the preceding evidence retrieval step, which can retrieve gold evidence with varying performance.While the fact checking model is possibly trained on gold evidence to avoid learning spurious correlations, we want to evaluate its capability to recognise when the retrieval system has discovered incorrect evidence as well.Note that current FC benchmarks do not consider the prediction of a veracity model if the correct evidence is not retrieved.However, in realistic situations, we do not know whether the evidence is correct, and FC models would still provide a veracity for a claim.Hence, we further study the performance of models on incorrect evidence.For each instance in the original test splits, we retrieve incorrect evidence by selecting the closest evidence of another claim in the dataset by word overlap between the claim and the evidence candidates.We then use the retrieved instead of the original evidence.This results in a test set of claims with incorrect evidence of the same size as the original test split.Table 6 reports results on the test datasets incorrect evidence.As all instances in the dataset have the new gold label of NEI, we report accuracy, which corresponds to the ratio of the instances with a predicted NEI label.We find that the performance of the models is improved by as much as 27 accuracy points after training with CAD or CL, which is another indication for the effectiveness of the proposed training methods.We also find that CAD again brings larger performance gains than CL, except for HoVer, where the two approaches achieve very similar accuracy scores.The extended evaluation of incorrect evidence is an important complement to the study of missing evidence.However, the two are not necessarily directly comparable.First, in Table 4, the two test datasets -the Original Test and SufficientFacts, both have instances with and without sufficient evidence.The extended study on incorrect evidence in this section only has instances that do not have sufficient evidence.This also results in our use of different measures to report results -accuracy in Table 6, which is the percentage of detected incorrectly retrieved evidence, and macro F1-score in Table 4, which combines the performance on up to three classes in a balanced way.
However, it is worth addressing the high performance of the models on the irrelevant evidence dataset.We employ evidence that has word overlap with the claim, but is not necessarily semantically similar to the claim.If the models were to only rely on features of the claim or on surface word overlap between the claim and the evidence, the models would have low performance on the irrelevant evidence dataset.We train models to avoid such spurious correlations with CAD and CL loss, which make discovering missing evidence information in irrelevant evidence easy, leading to the observed high performance in Table 6.

Error Analysis
Lastly, we conduct an error analysis on the newly introduced SufficientFacts to understand whether known biases in models trained on FC datasets ( §2) also affect predictions on SufficientFacts.
Claim-Only Prediction.Schuster et al. (2019) found that FC models often learn spurious correlations and can predict the correct label even when no evidence is provided, as they learn only features of the claim.We investigate whether it is also among the reasons for incorrect predictions of the models on the SufficientFacts dataset.We compute the percentage of instances in Sufficient-Facts where the models do not predict when provided with evidence.We find that for the HoVer dataset, the supervised BERT model does not predict an NEI label for 36% of the instances in SufficientFacts whereas the respective number for RoBERTa is 23% and 14% for ALBERT.This indicates that supervised models trained on HoVer learn claim-only features for some instances.After training the models with CAD ( §5.2) and CL loss ( §5.1), fewer than 1% of instances from Suf-ficientFacts are predicted as having enough information by each of thee models when given only the claim.This indicates that training with CAD and CL loss decreases the claim-only bias for the HoVer dataset.For FEVER and VitaminC, we find a lower percentage of instances (fewer than 4%) in the corresponding SufficientFacts splits that the supervised models predict as having enough information when given only the claim.We hypothesises that this is due to the larger amount of training data in both datasets and due to the contrastive nature of VitaminC, which requires the models to learn features from the evidence as well.The percentage is again decreased after training with CAD and CL (fewer than 1%).Finally, we find that the instances that are still not detected as having insufficient evidence after training with CAD/CL loss are those that the model could have gained world knowledge about during pre-training.One example of such a claim is given in Table 7, row 3. Claim-Evidence Overlap.Schuster et al. (2021) also find that FC models are biased in predicting the SUPPORT class when the overlap between the claim and the evidence is high.We conjecture that this is another possible reason that the instances in SufficientFacts are hard for the models to distinguish as having missing important evidence information as their evidence still has a high overlap with the claim.To probe this, we compute the average overlap between the claim and the evidence, disregarding stop words, of instances in the SufficientFacts that are predicted as having insufficient information by the supervised models and by the models trained with CAD and CL loss.For FEVER and HoVer, the instances predicted as NEI by the supervised models have low overlap with the claim that increases after training with CAD and CL loss (61% to 68% for HoVer and 63% to 65% for FEVER).An example instance where the evidence has high overlap with the claim and is predicted as NEI only after training with CAD and CL loss can be found in Table 7, row 1.The latter is an indication that training with CAD and CL loss also reduces the overlap bias of FC models.We do not observe a change in the overlap ratio for VitaminC, where we assume that training with contrastive instances already prevents learning biases, including the overlap bias.
Spurious Patterns.Finally, we investigate whether the models learn other spurious patterns that could lead to low results on SufficientFacts.We already observed that for some instances, the supervised models predict that the evidence is not sufficient after removing irrelevant information (Table 3), which is one indication of learned spurious patterns.Further, when removing impor-tant information, the supervised models still predict the same label for some instances, as they rely on other parts of the input, which might not be important.Table 7 shows one example where the supervised models did not recognise that the evidence is missing important information (row 1), but after training with CAD or CL loss, it was detected as NEI.However, there are still possible spurious correlations that the models learn even after training with CAD or CL loss, e.g. the example in row 4. Another such example is in row 3, where even after training with CAD and CL loss, the models still find the claim without any provided evidence sufficient for predicting a refuted claim.As this example relies on knowledge of common facts, we assume that the models rely on knowledge obtained during pre-training or fine-tuning instead.Finally, we find that CAD can prevent the model from learning spurious correlations more than the CL loss.This leads to more instances having the correct prediction only after training with CAD, as in the example in row 2.

Conclusion
We propose a new task related to fact checking, namely detecting when evidence with omitted information is (in)sufficient.To this end, we conducted an in-depth empirical analysis with a newly introduced fluency-preserving method for omitting evidence information.We compared what Transformer-based models and humans find to be sufficient information for FC, resulting in a novel dataset, SufficientFacts.Finally, we showed that the proposed evidence omission method can be used for collecting contrastive examples for CL and CAD, which improved the performance of the studied models on the Evidence Sufficiency Prediction task and on veracity prediction.
The resulting models could be applied to detect emergent false claims, which gain popularity before any reputable source can refute them, as our proposed models can indicate when the provided input is insufficient for making a decision and whether to provide the user with the veracity prediction.Such models could also be used for detecting knowledge or evidence gaps that need to be filled to refute or support popular claims.Another possible future research direction would be to build FC models that indicate the particular part of the claim that they are missing supporting evidence for.Moreover, our proposed analysis and methods could be applied to other knowledgeintensive tasks, such as question answering.

Figure 2 :
Figure2: SufficientFacts -fine-grained analysis by type of removed evidence inftype 4.1) vs. proportion of correct predictions of NEI/EI instances.The proportion is computed for the separate models -BERT, RoBERTa, ALBERT, and for all three models agreeing on the correct NEI/EI label (All).The total number of NEI/EI instances of each type is provided under each of the types of removed evidence information.A higher proportion of correct predictions is better.

Table 1 :
Sindh borders Indian states and is in India.Evidence: [Sindh] Sindh is home to a large portion of Pakistan's industrial sector and contains two of Pakistan's commercial seaports -Port Bin Qasim and the Karachi Port.Westlife sold more than 1 m.video albums and made over 23.5 m. sales in the UK.Sizes and examples instances for the studied fact checking datasets (see §3).

Table 2 :
Examples from the FEVER dataset of constituent types ( §4.1) removed from the evidence for a claim with Label (L) one of SUPPORTS (S) or REFUTES (R).
provides a fine-grained analy-

Table 4 :
Macro F 1 -score test performance of models and an ensemble (Ens.) ( §5.3) trained on the supervised training splits of each dataset (Supervised), and in addition with the contrastive objective (+CL) ( §5.1) and the counterfactually augmented data (+CAD) ( §5.2). Results are the average of three different seed runs.The highest results for a test dataset and a model are in bold, and the overall highest result of a model for a test dataset are additionally underlined.