Abstract
Fact-checking has become increasingly important due to the speed with which both information and misinformation can spread in the modern media ecosystem. Therefore, researchers have been exploring how fact-checking can be automated, using techniques based on natural language processing, machine learning, knowledge representation, and databases to automatically predict the veracity of claims. In this paper, we survey automated fact-checking stemming from natural language processing, and discuss its connections to related tasks and disciplines. In this process, we present an overview of existing datasets and models, aiming to unify the various definitions given and identify common concepts. Finally, we highlight challenges for future research.
1 Introduction
Fact-checking is the task of assessing whether claims made in written or spoken language are true. This is an essential task in journalism, and is commonly conducted manually by dedicated organizations such as PolitiFact. In addition to external fact-checking, internal fact-checking is also performed by publishers of newspapers, magazines, and books prior to publishing in order to promote truthful reporting. Figure 1 shows an example from PolitiFact, together with the evidence (summarized) and the verdict.
Fact-checking is a time-consuming task. To assess the claim in Figure 1, a journalist would need to search through potentially many sources to find job gains under Trump and Obama, evaluate the reliability of each source, and make a comparison. This process can take professional fact-checkers several hours or days (Hassan et al., 2015; Adair et al., 2017). Compounding the problem, fact-checkers often work under strict and tight deadlines, especially in the case of internal processes (Borel, 2016; Godler and Reich, 2017), and some studies have shown that less than half of all published articles have been subject to verification (Lewis et al., 2008). Given the amount of new information that appears and the speed with which it spreads, manual validation is insufficient.
Automating the fact-checking process has been discussed in the context of computational journalism (Flew et al., 2010; Cohen et al., 2011; Graves, 2018), and has received significant attention in the artificial intelligence community. Vlachos and Riedel (2014) proposed structuring it as a sequence of components—identifying claims to be checked, finding appropriate evidence, producing verdicts—that can be modeled as natural language processing (NLP) tasks. This motivated the development of automated pipelines consisting of subtasks that can be mapped to tasks well-explored in the NLP community. Advances were made possible by the development of datasets, consisting of either claims collected from fact-checking websites, for example Liar (Wang, 2017), or purpose-made for research, for example, FEVER (Thorne et al., 2018a).
A growing body of research is exploring the various tasks and subtasks necessary for the automation of fact checking, and to meet the need for new methods to address emerging challenges. Early developments were surveyed in Thorne and Vlachos (2018), which remains the closest to an exhaustive overview of the subject. However, their proposed framework does not include work on determining which claims to verify (i.e., claim detection), nor does their survey include the recent work on producing explainable, convincing verdicts (i.e., justification production).
Several recent papers have surveyed research focusing on individual components of the task. Zubiaga et al. (2018) and Islam et al. (2020) focus on identifying rumors on social media, Küçük and Can (2020) and Hardalov et al. (2021) on detecting the stance of a given piece of evidence towards a claim, and Kotonya and Toni (2020a) on producing explanations and justifications for fact-checks. Finally, Nakov et al. (2021a) surveyed automated approaches to assist fact-checking by humans. While these surveys are extremely useful in understanding various aspects of fact-checking technology, they are fragmented and focused on specific subtasks and components; our aim is to give a comprehensive and exhaustive birds-eye view of the subject as a whole.
A number of papers have surveyed related tasks. Lazer et al. (2018) and Zhou and Zafarani (2020) surveyed work on fake news, including descriptive work on the problem, as well as work seeking to counteract fake news through computational means. A comprehensive review of NLP approaches to fake news detection was also provided in Oshikawa et al. (2020). However, fake news detection differs in scope from fact checking, as the former focuses on assessing news articles, and includes labeling items based on aspects not related to veracity, such as satire detection (Oshikawa et al., 2020; Zhou and Zafarani, 2020). Furthermore, other factors—such as the audience reached by the claim, and the intentions and forms of the claim—are often considered. These factors also feature in the context of propaganda detection, recently surveyed by Da San Martino et al. (2020b). Unlike these efforts, the works discussed in this survey concentrate on assessing veracity of general-domain claims. Finally, Shu et al. (2017) and da Silva et al. (2019) surveyed research on fake news detection and fact checking with a focus on social media data, while this survey covers fact checking across domains and sources, including newswire, science, etc.
In this survey, we present a comprehensive and up-to-date survey of automated fact-checking, unifying various definitions developed in previous research into a common framework. We begin by defining the three stages of our fact-checking framework—claim detection, evidence retrieval, and claim verification, the latter consisting of verdict prediction and justification production. We then give an overview of the existing datasets and modeling strategies, taxonomizing these and contextualizing them with respect to our framework. We finally discuss key research challenges that have been addressed, and give directions for challenges that we believe should be tackled by future research. We accompany the survey with a repository,1 which lists the resources mentioned in our survey.
2 Task Definition
Figure 2 shows a NLP framework for automated fact-checking consisting of three stages: (i) claim detection to identify claims that require verification; (ii) evidence retrieval to find sources supporting or refuting the claim; and (iii) claim verification to assess the veracity of the claim based on the retrieved evidence. Evidence retrieval and claim verification are sometimes tackled as a single task referred to as factual verification, while claim detection is often tackled separately. Claim verification can be decomposed into two parts that can be tackled separately or jointly: verdict prediction, where claims are assigned truthfulness labels, and justification production, where explanations for verdicts must be produced.
2.1 Claim Detection
The first stage in automated fact-checking is claim detection, where claims are selected for verification. Commonly, detection relies on the concept of check-worthiness. Hassan et al. (2015) defined check-worthy claims as those for which the general public would be interested in knowing the truth. For example, “over six million Americans had COVID-19 in January” would be check-worthy, as opposed to “water is wet”. This can involve a binary decision for each potential claim, or an importance-ranking of claims (Atanasova et al., 2018; Barrón-Cedeño et al., 2020). The latter parallels standard practice in internal journalistic fact-checking, where deadlines often require fact-checkers to employ a triage system (Borel, 2016).
Another instantiation of claim detection based on check-worthiness is rumor detection. A rumor can be defined as an unverified story or statement circulating (typically on social media) (Ma et al., 2016; Zubiaga et al., 2018). Rumor detection considers language subjectivity and growth of readership through a social network (Qazvinian et al., 2011). Typical input to a rumor detection system is a stream of social media posts, whereupon a binary classifier has to determine if each post is rumorous. Metadata, such as the number of likes and re-posts, is often used as features to identify rumors (Zubiaga et al., 2016; Gorrell et al., 2019; Zhang et al., 2021).
Check-worthiness and rumorousness can be subjective. For example, the importance placed on countering COVID-19 misinformation is not uniform across every social group. The check-worthiness of each claim also varies over time, as countering misinformation related to current events is in many cases understood to be more important than countering older misinformation (e.g., misinformation about COVID-19 has a greater societal impact in 2021 than misinformation about the Spanish flu). Furthermore, older rumors may have already been debunked by journalists, reducing their impact. Misinformation that is harmful to marginalized communities may also be judged to be less check-worthy by the general public than misinformation that targets the majority. Conversely, claims originating from marginalized groups may be subject to greater scrutiny than claims originating from the majority; for example, journalists have been shown to assign greater trust and therefore lower need for verification to stories produced by male sources (Barnoy and Reich, 2019). Such biases could be replicated in datasets that capture the (often implicit) decisions made by journalists about which claims to prioritize.
Instead of using subjective concepts, Konstantinovskiy et al. (2021) framed claim detection as whether a claim makes an assertion about the world that is checkable, that is, whether it is verifiable with readily available evidence. Claims based on personal experiences or opinions are uncheckable. For example, “I woke up at 7 am today” is not checkable because appropriate evidence cannot be collected; “cubist art is beautiful” is not checkable because it is a subjective statement.
2.2 Evidence Retrieval
Evidence retrieval aims to find information beyond the claim—for example, text, tables, knowledge bases, images, relevant metadata—to indicate veracity. Some earlier efforts do not use any evidence beyond the claim itself (Wang, 2017; Rashkin et al., 2017; Volkova et al., 2017; Dungs et al., 2018). Relying on surface patterns of claims without considering the state of the world fails to identify well-presented misinformation, including machine-generated claims (Schuster et al., 2020). Recent developments in natural language generation have exacerbated this issue (Radford et al., 2019; Brown et al., 2020), with machine-generated text sometimes being perceived as more trustworthy than human-written text (Zellers et al., 2019). In addition to enabling verification, evidence is essential for generating verdict justifications to convince users of fact-checks.
Stance detection can be viewed as an instantiation of evidence retrieval, which typically assumes a more limited amount of potential evidence and predicts its stance towards the claim. For example, Ferreira and Vlachos (2016) used news article headlines from the Emergent project2 as evidence to predict whether articles supported, refuted, or merely reported a claim. The Fake News Challenge (Pomerleau and Rao, 2017) further used entire documents, allowing for evidence from multiple sentences. More recently, Hanselowski et al. (2019) filtered out irrelevant sentences in the summaries of fact-checking articles to obtain fine-grained evidence via stance detection. While both stance detection and evidence retrieval in the context of claim verification are classification tasks, what is considered evidence in the former is broader, including, for example, a social media post responding “@AJENews @germanwings yes indeed:-(.” to a claim (Gorrell et al., 2019).
A fundamental issue is that not all available information is trustworthy. Most fact-checking approaches implicitly assume access to a trusted information source such as encyclopedias (e.g., Wikipedia [Thorne et al., 2018a]) or results provided (and thus vetted) by search engines (Augenstein et al., 2019). Evidence is then defined as information that can be retrieved from this source, and veracity as coherence with the evidence. For real-world applications, evidence must be curated through the manual efforts of journalists (Borel, 2016), automated means Li et al. (2015), or their combination. For example, Full Fact uses tables and legal documents from government organizations as evidence.3
2.3 Verdict Prediction
Given an identified claim and the pieces of evidence retrieved for it, verdict prediction attempts to determine the veracity of the claim. The simplest approach is binary classification, for example, labeling a claim as true or false (Nakashole and Mitchell, 2014; Popat et al., 2016; Potthast et al., 2018). When evidence is used to verify the claim, it is often preferable to use supported/refuted (by evidence) instead of true/false respectively, as in many cases the evidence itself is not assessed by the systems. More broadly it would be dangerous to make such strong claims about the world given the well-known limitations (Graves, 2018).
Many versions of the task employ finer-grained classification schemes. A simple extension is to use an additional label denoting a lack of information to predict the veracity of the claim (Thorne et al., 2018a). Beyond that, some datasets and systems follow the approach taken by journalistic fact-checking agencies, employing multi-class labels representing degrees of truthfulness (Wang, 2017; Alhindi et al., 2018; Shahi and Nandini, 2020; Augenstein et al., 2019).
2.4 Justification Production
Justifying decisions is an important part of journalistic fact-checking, as fact-checkers need to convince readers of their interpretation of the evidence (Uscinski and Butler, 2013; Borel, 2016). Debunking purely by calling something false often fails to be persuasive, and can induce a “backfire” effect where belief in the erroneous claim is reinforced (Lewandowsky et al., 2012). This need is even greater for automated fact-checking, which may employ black-box components. When developers deploy black-box models whose decision-making processes cannot be understood, these artefacts can lead to unintended, harmful consequences (O’Neil, 2016). Developing techniques that explain model predictions has been suggested as a potential remedy to this problem (Lipton, 2018), and recent work has focused on the generation of justifications (see Kotonya and Toni’s [2020a] survey of explainable claim verification). Research so far has focused on justification production for claim verification, as the latter is often the most scrutinized stage in fact-checking. Nevertheless, explainability may also be desirable and necessary for the other stages in our framework.
Justification production for claim verification typically relies on one of four strategies. First, attention weights can be used to highlight the salient parts of the evidence, in which case justifications typically consist of scores for each evidence token (Popat et al., 2018; Shu et al., 2019; Lu and Li, 2020). Second, decision-making processes can be designed to be understandable by human experts, for example, by relying on logic-based systems (Gad-Elrab et al., 2019; Ahmadi et al., 2019); in this case, the justification is typically the derivation for the veracity of the claim. Finally, the task can be modeled as a form of summarization, where systems generate textual explanations for their decisions (Atanasova et al., 2020b). While some of these justification types require additional components, we did not introduce a fourth stage in our framework as in some cases the decision-making process of the model is self-explanatory (Gad-Elrab et al., 2019; Ahmadi et al., 2019).
A basic form of justification is to show which pieces of evidence were used to reach a verdict. However, a justification must also explain how the retrieved evidence was used, explain any assumptions or commonsense facts employed, and show the reasoning process taken to reach the verdict. Presenting the evidence returned by a retrieval system can as such be seen as a rather weak baseline for justification production, as it does not explain the process used to reach the verdict. There is furthermore a subtle difference between evaluation criteria for evidence and justifications: Good evidence facilitates the production of a correct verdict; a good justification accurately reflects the reasoning of the model through a readable and plausible explanation, regardless of the correctness of the verdict. This introduces different considerations for justification production, for example, readability (how accessible an explanation is to humans), plausibility (how convincing an explanation is), and faithfulness (how accurately an explanation reflects the reasoning of the model) (Jacovi and Goldberg, 2020).
3 Datasets
Datasets can be analyzed along three axes aligned with three stages of the fact-checking framework (Figure 2): the input, the evidence used, and verdicts and justifications that constitute the output. In this section we bring together efforts that emerged in different communities using different terminologies, but nevertheless could be used to develop and evaluate models for the same task.
3.1 Input
We first consider the inputs to claim detection (summarized in Table 1) as their format and content influences the rest of the process. A typical input is a social media post with textual content. Zubiaga et al. (2016) constructed PHEME based on source tweets in English and German that sparked a high number of retweets exceeding a predefined threshold. Derczynski et al. (2017) introduced the shared task RumourEval using the English section of PHEME; for the 2019 iteration of the shared task, this dataset was further expanded to include Reddit and new Twitter posts (Gorrell et al., 2019). Following the same annotation strategy, Lillie et al. (2019) constructed a Danish dataset by collecting posts from Reddit. Instead of considering only source tweets, subtasks in CheckThat (Barrón-Cedeño et al., 2020; Nakov et al., 2021b) viewed every post as part of the input. A set of auxiliary questions, such as “does it contain a factual claim?”, “is it of general interest?”, were created to help annotators identify check-worthy posts. Since an individual post may contain limited context, other works (Mitra and Gilbert, 2015; Ma et al., 2016; Zhang et al., 2021) represented each claim by a set of relevant posts, for example, the thread they originate from.
Dataset . | Type . | Input . | #Inputs . | Evidence . | Verdict . | Sources . | Lang . |
---|---|---|---|---|---|---|---|
CredBank (Mitra and Gilbert, 2015) | Worthy | Aggregate | 1,049 | Meta | 5 Classes | En | |
Weibo (Ma et al., 2016) | Worthy | Aggregate | 5,656 | Meta | 2 Classes | Twitter/Weibo | En/Ch |
PHEME (Zubiaga et al., 2016) | Worthy | Individual | 330 | Text/Meta | 3 Classes | En/De | |
RumourEval19 (Gorrell et al., 2019) | Worthy | Individual | 446 | Text/Meta | 3 Classes | Twitter/Reddit | En |
DAST (Lillie et al., 2019) | Worthy | Individual | 220 | Text/Meta | 3 Classes | Da | |
Suspicious (Volkova et al., 2017) | Worthy | Individual | 131,584 | ✗ | 2/5 Classes | En | |
CheckThat20-T1 (Barrón-Cedeño et al., 2020) | Worthy | Individual | 8,812 | ✗ | Ranking | En/Ar | |
CheckThat21-T1A (Nakov et al., 2021b) | Worthy | Individual | 17,282 | ✗ | 2 Classes | Many | |
Debate (Hassan et al., 2015) | Worthy | Statement | 1,571 | ✗ | 3 Classes | Transcript | En |
ClaimRank (Gencheva et al., 2017) | Worthy | Statement | 5,415 | ✗ | Ranking | Transcript | En |
CheckThat18-T1 (Atanasova et al., 2018) | Worthy | Statement | 16,200 | ✗ | Ranking | Transcript | En/Ar |
CitationReason (Redi et al., 2019) | Checkable | Statement | 4,000 | Meta | 13 Classes | Wikipedia | En |
PolitiTV (Konstantinovskiy et al., 2021) | Checkable | Statement | 6,304 | ✗ | 7 Classes | Transcript | En |
Dataset . | Type . | Input . | #Inputs . | Evidence . | Verdict . | Sources . | Lang . |
---|---|---|---|---|---|---|---|
CredBank (Mitra and Gilbert, 2015) | Worthy | Aggregate | 1,049 | Meta | 5 Classes | En | |
Weibo (Ma et al., 2016) | Worthy | Aggregate | 5,656 | Meta | 2 Classes | Twitter/Weibo | En/Ch |
PHEME (Zubiaga et al., 2016) | Worthy | Individual | 330 | Text/Meta | 3 Classes | En/De | |
RumourEval19 (Gorrell et al., 2019) | Worthy | Individual | 446 | Text/Meta | 3 Classes | Twitter/Reddit | En |
DAST (Lillie et al., 2019) | Worthy | Individual | 220 | Text/Meta | 3 Classes | Da | |
Suspicious (Volkova et al., 2017) | Worthy | Individual | 131,584 | ✗ | 2/5 Classes | En | |
CheckThat20-T1 (Barrón-Cedeño et al., 2020) | Worthy | Individual | 8,812 | ✗ | Ranking | En/Ar | |
CheckThat21-T1A (Nakov et al., 2021b) | Worthy | Individual | 17,282 | ✗ | 2 Classes | Many | |
Debate (Hassan et al., 2015) | Worthy | Statement | 1,571 | ✗ | 3 Classes | Transcript | En |
ClaimRank (Gencheva et al., 2017) | Worthy | Statement | 5,415 | ✗ | Ranking | Transcript | En |
CheckThat18-T1 (Atanasova et al., 2018) | Worthy | Statement | 16,200 | ✗ | Ranking | Transcript | En/Ar |
CitationReason (Redi et al., 2019) | Checkable | Statement | 4,000 | Meta | 13 Classes | Wikipedia | En |
PolitiTV (Konstantinovskiy et al., 2021) | Checkable | Statement | 6,304 | ✗ | 7 Classes | Transcript | En |
The second type of textual input is a document consisting of multiple claims. For Debate (Hassan et al., 2015), professionals were asked to select check-worthy claims from U.S. presidential debates to ensure good agreement and shared understanding of the assumptions. On the other hand, Konstantinovskiy et al. (2021) collected checkable claims from transcripts by crowd-sourcing, where workers labeled claims based on a predefined taxonomy. Different from prior works focused on the political domain, Redi et al. (2019) sampled sentences that contain citations from Wikipedia articles, and asked crowd-workers to annotate them based on citation policies.
Next, we discuss the inputs to factual verification. The most popular type of input to verification is textual claims, which is expected given they are often the output of claim detection. These tend to be sentence-level statements, which is a practice common among fact-checkers in order to include only the context relevant to the claim (Mena, 2019). Many existing efforts (Vlachos and Riedel, 2014; Wang, 2017; Hanselowski et al., 2019; Augenstein et al., 2019) constructed datasets by crawling real-world claims from dedicated websites (e.g., Politifact) due to their availability (see Table 2). Unlike previous work that focus on English, Gupta and Srikumar (2021) collected non-English claims from 25 languages.
Dataset . | Input . | #Inputs . | Evidence . | Verdict . | Sources . | Lang . |
---|---|---|---|---|---|---|
CrimeVeri (Bachenko et al., 2008) | Statement | 275 | ✗ | 2 Classes | Crime | En |
Politifact (Vlachos and Riedel, 2014) | Statement | 106 | Text/Meta | 5 Classes | Fact Check | En |
StatsProperties (Vlachos and Riedel, 2015) | Statement | 7,092 | KG | Numeric | Internet | En |
Emergent (Ferreira and Vlachos, 2016) | Statement | 300 | Text | 3 Classes | Emergent | En |
CreditAssess (Popat et al., 2016) | Statement | 5,013 | Text | 2 Classes | Fact Check/Wiki | En |
PunditFact (Rashkin et al., 2017) | Statement | 4,361 | ✗ | 2/6 Classes | Fact Check | En |
Liar (Wang, 2017) | Statement | 12,836 | Meta | 6 Classes | Fact Check | En |
Verify (Baly et al., 2018) | Statement | 422 | Text | 2 Classes | Fact Check | Ar/En |
CheckThat18-T2 (Barrón-Cedeño et al., 2018) | Statement | 150 | ✗ | 3 Classes | Transcript | En |
Snopes (Hanselowski et al., 2019) | Statement | 6,422 | Text | 3 Classes | Fact Check | En |
MultiFC (Augenstein et al., 2019) | Statement | 36,534 | Text/Meta | 2–27 Classes | Fact Check | En |
Climate-FEVER (Diggelmann et al., 2020) | Statement | 1,535 | Text | 4 Classes | Climate | En |
SciFact (Wadden et al., 2020) | Statement | 1,409 | Text | 3 Classes | Science | En |
PUBHEALTH (Kotonya and Toni, 2020b) | Statement | 11,832 | Text | 4 Classes | Fact Check | En |
COVID-Fact (Saakyan et al., 2021) | Statement | 4,086 | Text | 2 Classes | Forum | En |
X-Fact (Gupta and Srikumar, 2021) | Statement | 31,189 | Text | 7 Classes | Fact Check | Many |
cQA (Mihaylova et al., 2018) | Answer | 422 | Meta | 2 Classes | Forum | En |
AnswerFact (Zhang et al., 2020) | Answer | 60,864 | Text | 5 Classes | Amazon | En |
NELA (Horne et al., 2018) | Article | 136,000 | ✗ | 2 Classes | News | En |
BuzzfeedNews (Potthast et al., 2018) | Article | 1,627 | Meta | 4 Classes | En | |
BuzzFace (Santia and Williams, 2018) | Article | 2,263 | Meta | 4 Classes | En | |
FA-KES (Salem et al., 2019) | Article | 804 | ✗ | 2 Classes | VDC | En |
FakeNewsNet (Shu et al., 2020) | Article | 23,196 | Meta | 2 Classes | Fact Check | En |
FakeCovid (Shahi and Nandini, 2020) | Article | 5,182 | ✗ | 2 Classes | Fact Check | Many |
Dataset . | Input . | #Inputs . | Evidence . | Verdict . | Sources . | Lang . |
---|---|---|---|---|---|---|
CrimeVeri (Bachenko et al., 2008) | Statement | 275 | ✗ | 2 Classes | Crime | En |
Politifact (Vlachos and Riedel, 2014) | Statement | 106 | Text/Meta | 5 Classes | Fact Check | En |
StatsProperties (Vlachos and Riedel, 2015) | Statement | 7,092 | KG | Numeric | Internet | En |
Emergent (Ferreira and Vlachos, 2016) | Statement | 300 | Text | 3 Classes | Emergent | En |
CreditAssess (Popat et al., 2016) | Statement | 5,013 | Text | 2 Classes | Fact Check/Wiki | En |
PunditFact (Rashkin et al., 2017) | Statement | 4,361 | ✗ | 2/6 Classes | Fact Check | En |
Liar (Wang, 2017) | Statement | 12,836 | Meta | 6 Classes | Fact Check | En |
Verify (Baly et al., 2018) | Statement | 422 | Text | 2 Classes | Fact Check | Ar/En |
CheckThat18-T2 (Barrón-Cedeño et al., 2018) | Statement | 150 | ✗ | 3 Classes | Transcript | En |
Snopes (Hanselowski et al., 2019) | Statement | 6,422 | Text | 3 Classes | Fact Check | En |
MultiFC (Augenstein et al., 2019) | Statement | 36,534 | Text/Meta | 2–27 Classes | Fact Check | En |
Climate-FEVER (Diggelmann et al., 2020) | Statement | 1,535 | Text | 4 Classes | Climate | En |
SciFact (Wadden et al., 2020) | Statement | 1,409 | Text | 3 Classes | Science | En |
PUBHEALTH (Kotonya and Toni, 2020b) | Statement | 11,832 | Text | 4 Classes | Fact Check | En |
COVID-Fact (Saakyan et al., 2021) | Statement | 4,086 | Text | 2 Classes | Forum | En |
X-Fact (Gupta and Srikumar, 2021) | Statement | 31,189 | Text | 7 Classes | Fact Check | Many |
cQA (Mihaylova et al., 2018) | Answer | 422 | Meta | 2 Classes | Forum | En |
AnswerFact (Zhang et al., 2020) | Answer | 60,864 | Text | 5 Classes | Amazon | En |
NELA (Horne et al., 2018) | Article | 136,000 | ✗ | 2 Classes | News | En |
BuzzfeedNews (Potthast et al., 2018) | Article | 1,627 | Meta | 4 Classes | En | |
BuzzFace (Santia and Williams, 2018) | Article | 2,263 | Meta | 4 Classes | En | |
FA-KES (Salem et al., 2019) | Article | 804 | ✗ | 2 Classes | VDC | En |
FakeNewsNet (Shu et al., 2020) | Article | 23,196 | Meta | 2 Classes | Fact Check | En |
FakeCovid (Shahi and Nandini, 2020) | Article | 5,182 | ✗ | 2 Classes | Fact Check | Many |
Others extract claims from specific domains, such as science (Wadden et al., 2020), climate (Diggelmann et al., 2020), and public health (Kotonya and Toni, 2020b). Alternative forms of sentence-level inputs, such as answers from question answering forums, have also been considered (Mihaylova et al., 2018; Zhang et al., 2020). There have been approaches that consider a passage (Mihalcea and Strapparava, 2009; Pérez-Rosas et al., 2018) or an entire article (Horne et al., 2018; Santia and Williams, 2018; Shu et al., 2020) as input. However, the implicit assumption that every claim in it is either factually correct or incorrect is problematic, and thus rarely practised by human fact-checkers (Uscinski and Butler, 2013).
In order to better control the complexity of the task, efforts listed in Table 3 created claims artificially. Thorne et al. (2018a) had annotators mutate sentences from Wikipedia articles to create claims. Following the same approach, Khouja (2020) and Nørregaard and Derczynski (2021) constructed Arabic and Danish datasets, respectively. Another frequently considered option is subject-predicate-object triples, for example, (London, city_in, UK). The popularity of triples as input stems from the fact that they facilitate fact-checking against knowledge bases (Ciampaglia et al., 2015; Shi and Weninger, 2016; Shiralkar et al., 2017; Kim and Choi, 2020) such as DBpedia (Auer et al., 2007), SemMedDB (Kilicoglu et al., 2012), and KBox (Nam et al., 2018). However, such approaches implicitly assume the non-trivial conversion of text into triples.
Dataset . | Input . | #Inputs . | Evidence . | Verdict . | Sources . | Lang . |
---|---|---|---|---|---|---|
KLinker (Ciampaglia et al., 2015) | Triple | 10,000 | KG | 2 Classes | Google/Wiki | En |
PredPath (Shi and Weninger, 2016) | Triple | 3,559 | KG | 2 Classes | Google/Wiki | En |
KStream (Shiralkar et al., 2017) | Triple | 18,431 | KG | 2 Classes | Google/Wiki/WSDM | En |
UFC (Kim and Choi, 2020) | Triple | 1,759 | KG | 2 Classes | Wiki | En |
LieDetect (Mihalcea and Strapparava, 2009) | Passage | 600 | ✗ | 2 Classes | News | En |
FakeNewsAMT (Pérez-Rosas et al., 2018) | Passage | 680 | ✗ | 2 Classes | News | En |
FEVER (Thorne et al., 2018a) | Statement | 185,445 | Text | 3 Classes | Wiki | En |
HOVER (Jiang et al., 2020) | Statement | 26,171 | Text | 3 Classes | Wiki | En |
WikiFactCheck (Sathe et al., 2020) | Statement | 124,821 | Text | 2 Classes | Wiki | En |
VitaminC (Schuster et al., 2021) | Statement | 488,904 | Text | 3 Classes | Wiki | En |
TabFact (Chen et al., 2020) | Statement | 92,283 | Table | 2 Classes | Wiki | En |
InfoTabs (Gupta et al., 2020) | Statement | 23,738 | Table | 3 Classes | Wiki | En |
Sem-Tab-Fact (Wang et al., 2021) | Statement | 5,715 | Table | 3 Classes | Wiki | En |
FEVEROUS (Aly et al., 2021) | Statement | 87,026 | Text/Table | 3 Classes | Wiki | En |
ANT (Khouja, 2020) | Statement | 4,547 | ✗ | 3 Classes | News | Ar |
DanFEVER (Nørregaard and Derczynski, 2021) | Statement | 6,407 | Text | 3 Classes | Wiki | Da |
Dataset . | Input . | #Inputs . | Evidence . | Verdict . | Sources . | Lang . |
---|---|---|---|---|---|---|
KLinker (Ciampaglia et al., 2015) | Triple | 10,000 | KG | 2 Classes | Google/Wiki | En |
PredPath (Shi and Weninger, 2016) | Triple | 3,559 | KG | 2 Classes | Google/Wiki | En |
KStream (Shiralkar et al., 2017) | Triple | 18,431 | KG | 2 Classes | Google/Wiki/WSDM | En |
UFC (Kim and Choi, 2020) | Triple | 1,759 | KG | 2 Classes | Wiki | En |
LieDetect (Mihalcea and Strapparava, 2009) | Passage | 600 | ✗ | 2 Classes | News | En |
FakeNewsAMT (Pérez-Rosas et al., 2018) | Passage | 680 | ✗ | 2 Classes | News | En |
FEVER (Thorne et al., 2018a) | Statement | 185,445 | Text | 3 Classes | Wiki | En |
HOVER (Jiang et al., 2020) | Statement | 26,171 | Text | 3 Classes | Wiki | En |
WikiFactCheck (Sathe et al., 2020) | Statement | 124,821 | Text | 2 Classes | Wiki | En |
VitaminC (Schuster et al., 2021) | Statement | 488,904 | Text | 3 Classes | Wiki | En |
TabFact (Chen et al., 2020) | Statement | 92,283 | Table | 2 Classes | Wiki | En |
InfoTabs (Gupta et al., 2020) | Statement | 23,738 | Table | 3 Classes | Wiki | En |
Sem-Tab-Fact (Wang et al., 2021) | Statement | 5,715 | Table | 3 Classes | Wiki | En |
FEVEROUS (Aly et al., 2021) | Statement | 87,026 | Text/Table | 3 Classes | Wiki | En |
ANT (Khouja, 2020) | Statement | 4,547 | ✗ | 3 Classes | News | Ar |
DanFEVER (Nørregaard and Derczynski, 2021) | Statement | 6,407 | Text | 3 Classes | Wiki | Da |
3.2 Evidence
A popular type of evidence often considered is metadata, such as publication date, sources, user profiles, and so forth. However, while it offers information complementary to textual sources or structural knowledge which is useful when the latter are unavailable (Wang, 2017; Potthast et al., 2018), it does not provide evidence grounding the claim.
Textual sources, such as news articles, academic papers, and Wikipedia documents, are one of the most commonly used types of evidence for fact-checking. Ferreira and Vlachos (2016) used the headlines of selected news articles, and Pomerleau and Rao (2017) used the entire articles instead as the evidence for the same claims. Instead of using news articles, Alhindi et al. (2018) and Hanselowski et al. (2019) extracted summaries accompanying fact-checking articles about the claims as evidence. Documents from specialized domains such as science and public health have also been considered (Wadden et al., 2020; Kotonya and Toni, 2020b; Zhang et al., 2020).
The aforementioned works assume that evidence is given for every claim, which is not conducive to developing systems that need to retrieve evidence from a large knowledge source. Therefore, Thorne et al. (2018a) and Jiang et al. (2020) considered Wikipedia as the source of evidence and annotated the sentences supporting or refuting each claim. Schuster et al. (2021) constructed VitaminC based on factual revisions to Wikipedia, in which evidence pairs are nearly identical in language and content, with the exception that one supports a claim while the other does not. However, these efforts restricted world knowledge to a single source (Wikipedia), ignoring the challenge of retrieving evidence from heterogeneous sources on the web. To address this, other works (Popat et al., 2016; Baly et al., 2018; Augenstein et al., 2019) retrieved evidence from the Internet, but the search results were not annotated. Thus, it is possible that irrelevant information is present in the evidence, while information that is necessary for verification is missing.
Though the majority of studies focus on unstructured evidence (i.e., textual sources), structured knowledge has also been used. For example, the truthfulness of a claim expressed as an edge in a knowledge base (e.g., DBpedia) can be predicted by the graph topology (Ciampaglia et al., 2015; Shi and Weninger, 2016; Shiralkar et al., 2017). However, while graph topology can be an indicator of plausibility, it does not provide conclusive evidence. A claim that is not represented by a path in the graph, or that is represented by an unlikely path, is not necessarily false. The knowledge base approach assumes that true facts relevant to the claim are present in the graph; but given the incompleteness of even the largest knowledge bases, this is not realistic (Bordes et al., 2013; Socher et al., 2013).
Another type of structural knowledge is semi-structured data (e.g., tables), which is ubiquitous thanks to its ability to convey important information in a concise and flexible manner. Early work by Vlachos and Riedel (2015) used tables extracted from Freebase (Bollacker et al., 2008) to verify claims retrieved from the web about statistics of countries such as population, inflation, and so on. Chen et al. (2020) and Gupta et al. (2020) studied fact-checking textual claims against tables and info-boxes from Wikipedia. Wang et al. (2021) extracted tables from scientific articles and required evidence selection in the form of cells selected from tables. Aly et al. (2021) further considered both text and table for factual verification, while explicitly requiring the retrieval of evidence.
3.3 Verdict and Justification
The verdict in early efforts (Bachenko et al., 2008; Mihalcea and Strapparava, 2009) is a binary label (i.e., true/false). However, fact-checkers usually employ multi-class labels to represent degrees of truthfulness (true, mostly-true, mixture, etc.),4 which were considered by Vlachos and Riedel (2014) and Wang (2017). Recently, Augenstein et al. (2019) collected claims from different sources, where the number of labels vary greatly, ranging from 2 to 27. Due to the difficulty of mapping veracity labels onto the same scale, they didn’t attempt to harmonize them across sources. On the other hand, other efforts (Hanselowski et al., 2019; Kotonya and Toni, 2020b; Gupta and Srikumar, 2021) performed normalization by post-processing the labels based on rules to simplify the veracity label. For example, Hanselowski et al. (2019) mapped mixture, unproven, and undetermined onto not enough information.
Unlike prior datasets that only required outputting verdicts, FEVER (Thorne et al., 2018a) expected the output to contain both sentences forming the evidence and a label (e.g., support, refute, not enough information). Later datasets with both natural (Hanselowski et al., 2019; Wadden et al., 2020) and artificial claims (Jiang et al., 2020; Schuster et al., 2021) also adopted this scheme, where the output expected is a combination of multi-class labels and extracted evidence.
Most existing datasets do not contain textual explanations provided by journalists as justification for verdicts. Alhindi et al. (2018) extended the Liar dataset with summaries extracted from fact-checking articles. While originally intended as an auxiliary task to improve claim verification, these justifications have been used as explanations (Atanasova et al., 2020b). Recently, (Kotonya and Toni, 2020b) constructed the first dataset that explicitly includes gold explanations. These consist of fact-checking articles and other news items, which can be used to train natural language generation models to provide post-hoc justifications for the verdicts, However, using fact-checking articles is not realistic, as they are not available during inference, which makes the trained system unable to provide justifications based on retrieved evidence.
4 Modeling Strategies
We now turn to surveying modeling strategies for the various components of our framework. The most common approach is to build separate models for each component and apply them in pipeline fashion. Nevertheless, joint approaches have also been developed, either through end-to-end learning or by modeling the joint output distributions of multiple components.
4.1 Claim Detection
Claim detection is typically framed as a classification task, where models predict whether claims are checkable or check-worthy. This is challenging, especially in the case of check-worthiness: Rumorous and non-rumorous information is often difficult to distinguish, and the volume of claims analyzed in real-world scenarios (e.g., all posts published to a social network every day) prohibits the retrieval and use of evidence. Early systems employed supervised classifiers with feature engineering, relying on surface features like Reddit karma and up-votes (Aker et al., 2017), Twitter-specific types (Enayet and El-Beltagy, 2017), named entities and verbal forms in political transcripts (Zuo et al., 2018), or lexical and syntactic features (Zhou et al., 2020).
Neural network approaches based on sequence- or graph-modeling have recently become popular, as they allow models to use the context of surrounding social media activity to inform decisions. This can be highly beneficial, as the ways in which information is discussed and shared by users are strong indicators of rumorousness (Zubiaga et al., 2016). Kochkina et al. (2017) employed an LSTM (Hochreiter and Schmidhuber, 1997) to model branches of tweets, Ma et al. (2018) used Tree-LSTMs (Tai et al., 2015) to directly encode the structure of threads, and Guo et al. (2018) modeled the hierarchy by using attention networks. Recent work explored fusing more domain-specific features into neural models (Zhang et al., 2021). Another popular approach is to use Graph Neural Networks (Kipf and Welling, 2017) to model the propagation behaviour of a potentially rumorous claim (Monti et al., 2019; Li et al., 2020; Yang et al., 2020a).
Some works tackle claim detection and claim verification jointly, labeling potential claims as true rumors, false rumors, or non-rumors (Buntain and Golbeck, 2017; Ma et al., 2018). This allows systems to exploit specific features useful for both tasks, such as the different spreading patterns of false and true rumors (Zubiaga et al., 2016). Veracity predictions made by such systems are to be considered preliminary, as they are made without evidence.
4.2 Evidence Retrieval and Claim Verification
As mentioned in Section 2, evidence retrieval and claim verification are commonly addressed together. Systems mostly operate as a pipeline consisting of an evidence retrieval module and a verification module (Thorne et al., 2018b), but there are exceptions where these two modules are trained jointly (Yin and Roth, 2018).
Claim verification can be seen as a form of Recognizing Textual Entailment (RTE; Dagan et al., 2010; Bowman et al., 2015), predicting whether the evidence supports or refutes the claim. Typical retrieval strategies include commercial search APIs, Lucene indices, entity linking, or ranking functions like dot-products of TF-IDF vectors (Thorne et al., 2018b). Recently, dense retrievers employing learned representations and fast dot-product indexing (Johnson et al., 2017) have shown strong performance (Lewis et al., 2020; Maillard et al., 2021). To improve precision, more complex models—for example, stance detection systems—can be deployed as second, fine-grained filters to re-rank retrieved evidence (Thorne et al., 2018b; Nie et al., 2019b, a; Hanselowski et al., 2019). Similarly, evidence can be re-ranked implicitly during verification in late-fusion systems (Ma et al., 2019; Schlichtkrull et al., 2021). An alternative approach was proposed by Fan et al. (2020), who retrieved evidence using question generation and question answering via search engine results. Some work avoids retrieval by making a closed-domain assumption and evaluating in a setting where appropriate evidence has already been found (Ferreira and Vlachos, 2016; Chen et al., 2020; Zhong et al., 2020a; Yang et al., 2020b; Eisenschlos et al., 2020); this, however, is unrealistic. Finally, Allein et al. (2021) took into account the timestamp of the evidence in order to improve veracity prediction accuracy.
If only a single evidence document is retrieved, verification can be directly modeled as RTE. However, both real-world claims (Augenstein et al., 2019; Hanselowski et al., 2019; Kotonya and Toni, 2020b), as well as those created for research purposes (Thorne et al., 2018a; Jiang et al., 2020; Schuster et al., 2021) often require reasoning over and combining multiple pieces of evidence. A simple approach is to treat multiple pieces of evidence as one by concatenating them into a single string (Luken et al., 2018; Nie et al., 2019a), and then employ a textual entailment model to infer whether the evidence supports or refutes the claim. More recent systems employ specialized components to aggregate multiple pieces of evidence. This allows the verification of more complex claims where several pieces of information must be combined, and addresses the case where the retrieval module returns several highly related documents all of which could (but might not) contain the right evidence (Yoneda et al., 2018; Zhou et al., 2019; Ma et al., 2019; Liu et al., 2020; Zhong et al., 2020b; Schlichtkrull et al., 2021).
Some early work does not include evidence retrieval at all, performing verification purely on the basis of surface forms and metadata (Wang, 2017; Rashkin et al., 2017; Dungs et al., 2018). Recently, Lee et al. (2020) considered using the information stored in the weights of a large pretrained language model—BERT (Devlin et al., 2019)—as the only source of evidence, as it has been shown competitive in knowledge base completion (Petroni et al., 2019). Without explicitly considering evidence such approaches are likely to propagate biases learned during training, and render justification production impossible (Lee et al., 2021; Pan et al., 2021).
4.3 Justification Production
Approaches for justification production can be separated into three categories, which we examine along the three dimensions discussed in Section 2.4—readability, plausibility, and faithfulness. First, some models include components that can be analyzed as justifications by human experts, primarily attention modules. Popat et al. (2018) selected evidence tokens that have higher attention weights as explanations. Similarly, co-attention (Shu et al., 2019; Lu and Li, 2020) and self-attention (Yang et al., 2019) were used to highlight the salient excerpts from the evidence. Wu et al. (2020b) further combined decision trees and attention weights to explain which tokens were salient, and how they influenced predictions. Recent studies have shown the use of attention as explanation to be problematic. Some tokens with high attention scores can be removed without affecting predictions, while some tokens with low (non-zero) scores turn out to be crucial (Jain and Wallace, 2019; Serrano and Smith, 2019; Pruthi et al., 2020). Explanations provided by attention may therefore not be sufficiently faithful. Furthermore, as they are difficult for non-experts and/or those not well-versed in the architecture of the model to grasp, they lack readability.
Another approach is to construct decision-making processes that can be fully grasped by human experts. Rule-based methods use Horn rules and knowledge bases to mine explanations (Gad-Elrab et al., 2019; Ahmadi et al., 2019), which can be directly understood and verified. These rules are mined from a pre-constructed knowledge base, such as DBpedia (Auer et al., 2007). This limits what can be fact-checked to claims that are representable as triples, and to information present in the (often manually curated) knowledge base.
Finally, some recent work has focused on building models which—like human experts—can generate textual explanations for their decisions. Atanasova et al. (2020b) used an extractive approach to generate summaries, while Kotonya and Toni (2020b) adopted the abstractive approach. A potential issue is that such models can generate explanations that do not represent their actual veracity prediction process, but which are nevertheless plausible with respect to the decision. This is especially an issue with abstractive models, where hallucinations can produce very misleading justifications (Maynez et al., 2020). Also, the model of Atanasova et al. (2020b) assumes fact-checking articles provided as input during inference, which is unrealistic.
5 Related Tasks
Misinformation and Disinformation
Misinformation is defined as constituting a claim that contradicts or distorts common understandings of verifiable facts (Guess and Lyons, 2020). On the other hand, disinformation is defined as the subset of misinformation that is deliberately propagated. This is a question of intent: disinformation is meant to deceive, while misinformation may be inadvertent or unintentional (Tucker et al., 2018). Fact-checking can help detect misinformation, but not distinguish it from disinformation. A recent survey (Alam et al., 2021) proposed to integrate both factuality and harmfulness into a framework for multi-modal disinformation detection. Although misinformation and conspiracy theories overlap conceptually, conspiracy theories do not hinge exclusively on the truth value of the claims being made, as they are sometimes proved to be true (Sunstein and Vermeule, 2009). A related problem is propaganda detection, which overlaps with disinformation detection, but also includes identifying particular techniques such as appeals to emotion, logical fallacies, whataboutery, or cherry-picking (Da San Martino et al., 2020b).
Propaganda and the deliberate or accidental dissemination of misleading information has been studied extensively. Jowett and O’Donnell (2019) address the subject from a communications perspective, Taylor (2003) provides a historical approach, and Goldman and O’Connor (2021) tackle the related subject of epistemology and trust in social settings from a philosophical perspective. For fact-checking and the identification of misinformation by journalists, we direct the reader to Silverman (2014) and Borel (2016).
Detecting Previously Fact-checked Claims
While in this survey we focus on methods for verifying claims by finding the evidence rather than relying on previously conducted fact checks, misleading claims are often repeated (Hassan et al., 2017); thus it is useful to detect whether a claim has already been fact-checked. Shaar et al. (2020) formulated this task recently as ranking, and constructed two datasets. The social media version of the task then featured at the shared task CheckThat! (Barrón-Cedeño et al., 2020; Nakov et al., 2021b). This task was also explored by Vo and Lee (2020) from a multi-modal perspective, where claims about images were matched against previously fact-checked claims. More recently, Sheng et al. (2021) and Kazemi et al. (2021) constructed datasets for this task in languages beyond English. Hossain et al. (2020) detected misinformation by adopting a similar strategy. If a tweet was matched to any known COVID-19 related misconceptions, then it would be classified as misinformative. Matching claims against previously verified ones is a simpler task that can often be reduced to sentence-level similarity (Shaar et al., 2020), which is well studied in the context of textual entailment. Nevertheless, new claims and evidence emerge regularly. Previous fact-checks can be useful, but they can become outdated and potentially misleading over time.
6 Research Challenges
Choice of Labels
The use of fine-grained labels by fact-checking organizations has recently come under criticism (Uscinski and Butler, 2013). In-between labels like “mostly true” often represent “meta-ratings” for composite claims consisting of multiple elementary claims of different veracity. For example, a politician might claim improvements to unemployment and productivity; if one part is true and the other false, a fact-checker might label the full statement “half true”. Noisy labels resulting from composite claims could be avoided by intervening at the dataset creation stage to manually split such claims, or by learning to do so automatically. The separation of claims into truth and falsehood can be too simplistic, as true claims can still mislead. Examples include cherry-picking, where evidence is chosen to suggest a misleading trend (Asudeh et al., 2020), and technical truth, where true information is presented in a way that misleads (e.g., “I have never lost a game of chess” is also true if the speaker has never played chess). A major challenge is integrating analysis of such claims into the existing frameworks. This could involve new labels identifying specific forms of deception, as is done in propaganda detection (Da San Martino et al., 2020a), or a greater focus on producing justifications to show why claims are misleading (Atanasova et al., 2020b; Kotonya and Toni, 2020b).
Sources and Subjectivity
Not all information is equally trustworthy, and sometimes trustworthy sources contradict each other. This challenges the assumptions made by most current fact-checking research relying on a single source considered authoritative, such as Wikipedia. Methods must be developed to address the presence of disagreeing or untrustworthy evidence. Recent work proposed integrating credibility assessment as a part of the fact-checking task (Wu et al., 2020a). This could be done, for example, by assessing the agreement between evidence sources, or by assessing the degree to which sources cohere with known facts (Li et al., 2015; Dong et al., 2015; Zhang et al., 2019). Similarly, check-worthiness is a subjective concept varying along axes including target audience, recency, and geography. One solution is to focus solely on objective checkability (Konstantinovskiy et al., 2021). However, the practical limitations of fact-checking (e.g., the deadlines of journalists and the time-constraints of media consumers) often force the use of a triage system (Borel, 2016). This can introduce biases regardless of the intentions of journalists and system-developers to use objective criteria (Uscinski and Butler, 2013; Uscinski, 2015). Addressing this challenge will require the development of systems allowing for real-time interaction with users to take into account their evolving needs.
Dataset Artefacts and Biases
Synthetic datasets constructed through crowd-sourcing are common (Zeichner et al., 2012; Hermann et al., 2015; Williams et al., 2018). It has been shown that models tend to rely on biases in these datasets, without learning the underlying task (Gururangan et al., 2018; Poliak et al., 2018; McCoy et al., 2019). For fact-checking, Schuster et al. (2019) showed that the predictions of models trained on FEVER (Thorne et al., 2018a) were largely driven by indicative claim words. The FEVER 2.0 shared task explored how to generate adversarial claims and build systems resilient to such attacks (Thorne et al., 2019). Alleviating such biases and increasing the robustness to adversarial examples remains an open question. Potential solutions include leveraging better modeling approaches (Utama et al., 2020a, b; Karimi Mahabadi et al., 2020; Thorne and Vlachos, 2021), collecting data by adversarial games (Eisenschlos et al., 2021), or context-sensitive inference (Schuster et al., 2021).
Multimodality
Information (either in claims or evidence) can be conveyed through multiple modalities such as text, tables, images, audio, or video. Though the majority of existing works have focused on text, some efforts also investigated how to incorporate multimodal information, including claim detection with misleading images (Zhang et al., 2018), propaganda detection over mixed images and text (Dimitrov et al., 2021), and claim verification for images (Zlatkova et al., 2019; Nakamura et al., 2020). Monti et al. (2019) argued that rumors should be seen as signals propagating through a social network. Rumor detection is therefore inherently multimodal, requiring analysis of both graph structure and text. Available multimodal corpora are either small in size (Zhang et al., 2018; Zlatkova et al., 2019) or constructed based on distant supervision (Nakamura et al., 2020). The construction of large-scale annotated datasets paired with evidence beyond metadata will facilitate the development of multimodal fact-checking systems.
Multilinguality
Claims can occur in multiple languages, often different from the one(s) evidence is available in, calling for multilingual fact-checking systems. While misinformation spans both geographic and linguistic boundaries, most work in the field has focused on English. A possible approach for multilingual verification is to use translation systems for existing methods (Dementieva and Panchenko, 2020), but relevant datasets in more languages are necessary for testing multilingual models’ performance within each language, and ideally also for training. Currently, there exist a handful of datasets for factual verification in languages other than English (Baly et al., 2018; Lillie et al., 2019; Khouja, 2020; Shahi and Nandini, 2020; Nørregaard and Derczynski, 2021), but they do not offer a cross-lingual setup. More recently, Gupta and Srikumar (2021) introduced a multilingual dataset covering 25 languages, but found that adding training data from other languages did not improve performance. How to effectively align, coordinate, and leverage resources from different languages remains an open question. One promising direction is to distill knowledge from high-resource to low-resource languages (Kazemi et al., 2021).
Faithfulness
A significant unaddressed challenge in justification production is faithfulness. As we discuss in Section 4.3, some justifications—such as those generated abstractively (Maynez et al., 2020)—may not be faithful. This can be highly problematic, especially if these justifications are used to convince users of the validity of model predictions (Lertvittayakumjorn and Toni, 2019). Faithfulness is difficult to evaluate for, as human evaluators and human-produced gold standards often struggle to separate highly plausible, unfaithful explanations from faithful ones (Jacovi and Goldberg, 2020). In the model interpretability domain, several recent papers have introduced strategies for testing or guaranteeing faithfulness. These include introducing formal criteria that models should uphold (Yu et al., 2019), measuring the accuracy of predictions after removing some or all of the predicted non-salient input elements (Yeh et al., 2019; DeYoung et al., 2020; Atanasova et al., 2020a), or disproving the faithfulness of techniques by counterexample (Jain and Wallace, 2019; Wiegreffe and Pinter, 2019). Further work is needed to develop such techniques for justification production.
From Debunking to Early Intervention and Prebunking
The prevailing application of automated fact-checking is to discover and intervene against circulating misinformation, also referred to as debunking. Efforts have been made to respond quickly after the appearance of a piece of misinformation (Monti et al., 2019), but common to all approaches is that intervention takes place reactively after misinformation has already been introduced to the public. NLP technology could also be leveraged in proactive strategies. Prior work has employed network analysis and similar techniques to identify key actors for intervention in social networks (Farajtabar et al., 2017); using NLP, such techniques could be extended to take into account the information shared by these actors, in addition to graph-based features (Nakov, 2020; Mu and Aletras, 2020). Another direction is to disseminate countermessaging before misinformation can spread widely; this is also known as pre-bunking, and has been shown to be more effective than post-hoc debunking (van der Linden et al., 2017; Roozenbeek et al., 2020; Lewandowsky and van der Linden, 2021). NLP could play a crucial role both in early detection and in the creation of relevant countermessaging. Finally, training people to create misinformation has been shown to increase resistance towards false claims (Roozenbeek and van der Linden, 2019). NLP could be used to facilitate this process, or to provide an adversarial opponent for gamifying the creation of misinformation. This could be seen as a form of dialogue agent to educate users, however there are as of yet no resources for the development of such systems.
7 Conclusion
We have reviewed and evaluated current automated fact-checking research by unifying the task formulations and methodologies across different research efforts into one framework comprising claim detection, evidence retrieval, verdict prediction, and justification production. Based on the proposed framework, we have provided an extensive overview of the existing datasets and modeling strategies. Finally, we have identified vital challenges for future research to address.
Acknowledgments
Zhijiang Guo, Michael Schlichtkrull, and Andreas Vlachos are supported by the ERC grant AVeriTeC (GA 865958), The latter is further supported by the EU H2020 grant MONITIO (GA 965576). The authors would like to thank Rami Aly, Christos Christodoulopoulos, Nedjma Ousidhoum, and James Thorne for useful comments and suggestions.
Notes
References
Author notes
Action Editor: Yulan He
The first two authors have contributed equally.