Benchmarking the Generation of Fact Checking Explanations

Abstract Fighting misinformation is a challenging, yet crucial, task. Despite the growing number of experts being involved in manual fact-checking, this activity is time-consuming and cannot keep up with the ever-increasing amount of fake news produced daily. Hence, automating this process is necessary to help curb misinformation. Thus far, researchers have mainly focused on claim veracity classification. In this paper, instead, we address the generation of justifications (textual explanation of why a claim is classified as either true or false) and benchmark it with novel datasets and advanced baselines. In particular, we focus on summarization approaches over unstructured knowledge (i.e., news articles) and we experiment with several extractive and abstractive strategies. We employed two datasets with different styles and structures, in order to assess the generalizability of our findings. Results show that in justification production summarization benefits from the claim information, and, in particular, that a claim-driven extractive step improves abstractive summarization performances. Finally, we show that although cross-dataset experiments suffer from performance degradation, a unique model trained on a combination of the two datasets is able to retain style information in an efficient manner.


Introduction
The interaction between the modern media ecosystem and online social media has facilitated the rapid and nearly unrestricted spreading of news.While this has been a major achievement in terms of access to information, there is also an increasing need to counter the spread of misinformation, commonly conveyed through Fake News.Fake News is crafted with the intention to manipulate society towards a specific political, economic, or social outcome, lacking verifiable evidence and credible sources (Chen and Sharma, 2015).It can represent a threat to human health and safety, e.g. by disseminating false information on disease treatment (Van der Linden, 2022).Thus, verifying the accuracy of claims and presenting users with factual and impartial evidence to support their veracity is of utmost importance.Manual fact-checking, however, is a time-consuming activity (Hassan et al., 2015).Hence, Natural Language Processing has been suggested as an effective solution for automating this process.Thus far, the main strategies have involved classifying and flagging misleading information.However, a simple classification approach can generate a backfire effect where the belief of false claims is further entrenched rather than hindered (Lewandowsky et al., 2012).For this reason, explaining why a claim is classified as either true or false can be a better solution.Fact-checking articles could represent a valuable resource towards this end, however, on online social media platforms they are ineffective either because ordinary users are not prone to click on links to relevant resources (Glenski et al., 2017(Glenski et al., , 2020) ) or because these articles are excessively long to the point that users would avoid reading it (Pernice et al., 2019).Indeed, effective explanations should be simple, and only a few arguments must be provided in order to avoid an "overkill" backfire effect (Lombrozo, 2007;Sanna and Schwarz, 2006).
Although the work of professional factcheckers is crucial for countering misinformation (Wintersieck, 2017), it has been shown that disproof on social media platforms is mostly carried out by ordinary users (Micallef et al., 2020).Thus, automating the explanation generation process is deemed crucial, as an aid for both fact-checkers (to increase their online activity) and for social media users (to make their intervention more effective; He et al., 2023).
Still, few attempts to automatically generate explanations/justifications about claim veracity have been proposed so far (Kotonya and Toni, 2020a).Current methods for justification production include highlighting tokens with high attention weights (Popat et al., 2018;Yang et al., 2019;Lu and Li, 2020), utilizing knowledge graphs (Ahmadi et al., 2019), and modelling it as either an extractive or abstractive summarization task (Atanasova et al., 2020;Kotonya and Toni, 2020b).
In this paper, we aim at benchmarking justification production as a summarization task, by providing an exhaustive study of the performances of extractive and abstractive approaches over two novel datasets.In particular, we consider several extractive and abstractive approaches both in supervised and unsupervised settings, where we generate a justification for a given claim using a factchecking article as a knowledge source.We also experiment with hybrid approaches combining extractive and abstractive steps in a unique pipeline.Finally, we integrate the pipeline within an endto-end claim-driven explanation generation framework.These approaches are tested both in indomain and cross-domain configurations, by employing two different datasets.Each dataset has its own style and characteristics, but they both contain claim, verdict, and article triplets (see Figure 1).
The main findings from our experiments are: (i) If an extractive approach is employed for justification production, then the sentence selection must be driven by the claim information.(ii) If no training data is available in cross-domain experiments, extractive approaches can be better than abstractive ones for justification production.(iii) High-quality justifications can be obtained by combining in a unique pipeline extractive and abstractive summarization approaches (using simple off-the-shelf LMs), and by driving sentence selection and justification generation with the claim information.Still, differently from previous studies, we found that the sentences extracted from the article must retain their order rather than being rearranged according to some notion of relevance.(iv) LMs for abstractive summarization should be selected according to article length since there is not a one-fits-all solution: for shorter articles, 512 tokens input length LMs provide better results, while using models with 1024 input

Related Work
The process of fact-checking a news story involves determining the truthfulness of a statement (Verdict Prediction) and the generation of a written rationale for the verdict (Justification Production).The claim veracity is usually assessed through a classification task, both binary (Nakashole and Mitchell, 2014;Potthast et al., 2018;Popat et al., 2018) and multi-class (Wang, 2017;Thorne et al., 2018), or through a multitask learning approach (Augenstein et al., 2019).Recently, researchers are focusing on developing datasets and systems for evidence-based Verdict Prediction.Among the most relevant dataset, notable examples include the FEVER dataset (Thorne et al., 2018), Sci-Fact (Wadden et al., 2020), COVID-fact (Saakyan et al., 2021), and PolitiHop (Ostrowski et al., 2021).
Justification Production has proven to be more challenging than Verdict Prediction.Several approaches have been suggested, including logicbased approaches (Gad-Elrab et al., 2019;Ahmadi et al., 2019) or deep-learning and attention-based techniques (Popat et al., 2018;Yang et al., 2019;Shu et al., 2019;Lu and Li, 2020).Nevertheless, casting justification production as a summarization task appears to be the most viable solution (Kotonya and Toni, 2020a).Thereby, explanations can be derived from manually written debunking articles either by selecting important sentences from the text (extractive approach; Atanasova et al., 2020) or by generating a new one (abstractive approach; Kotonya and Toni, 2020b).Extractive and abstractive summarization approaches still have many problems: extractive-generated explanations can not generate sufficiently context-full explanations, while abstractive-generated ones may lack faithfulness, given the tendency to hallucinate of these neural models (Kotonya and Toni, 2020a;Guo et al., 2022).Currently, the abstractive summarization technique appears to be the most viable option for generating effective justifications.Nevertheless, it may not always be possible to acquire an adequate amount of training data or the necessary computational resources for highly demanding models.Thus, the purpose of this paper is twofold: (i) provide SOTA results using simple off-the-shelf LMs, and (ii) understand which is the most suitable approach for a given scenario.

Datasets
For our experiments, we collected two datasets with different structural and stylistic features.The first is LIAR++, a derivation of LIAR-PLUS (Alhindi et al., 2018), while the second is FullFact, a completely new dataset.Both datasets comprise claim, verdict, and article entries.
The claim is a short text consisting of a statement that is under inspection: it can be TRUE, partially TRUE, or FALSE.The verdict is usually a paragraph-long text that provides arguments to assess the truth value of the claim: in many cases, it corresponds to a debunking text1 .Finally, the article is a document that discusses the veracity of the claim using a journalistic style and contains the verifiable facts necessary to build the verdict.Figure 1 illustrates an example for each element.A detailed description of the employed datasets fol-lows2 .

LIAR++ Dataset
We created LIAR++ (L ++ henceforth) starting from the LIAR-PLUS dataset (Alhindi et al., 2018).This dataset contains articles from the POLITIFACT website3 spanning from 2007 to 2016 and covers various political topics with a primary emphasis on verifying the accuracy of statements made by political figures.LIAR-PLUS contains some entries in which the verdict was artificially created by extracting the last five sentences from the body of the article.In all the other cases, verdicts were extracted from a specific section of web pages, usually titled Our ruling or Summing up.Qualitative and quantitative analyses of the artificial against gold verdicts showed that the former did not meet the expected quality.Therefore we decided to discard them while creating L ++ .Differently from LIAR-PLUS, we also kept the whole verdict without removing the 'forbidden sentences' (i.e., sentences comprising any verdictrelated word) such as "this statement is false"4 .After this procedure L ++ comprises 6451 claimarticle-verdict triples.

FullFact Dataset
With a similar procedure to that used for L ++ , we created a new dataset starting from the FULL-FACT website5 (FF henceforth).This dataset contains data spanning from 2010 to 2021, and covers several different topics, such as health, economy, crime, law, and education.In FF the verdict is always present as a separate element in the web page so there was no need to filter the data.This dataset accounts for 1838 claim-article-verdict triples.

Analysis of the Datasets
In this section, we focus on the main structural and stylistic differences between the two datasets, especially those that can have an impact on the experiments presented in the following sections.We mainly employed ROUGE score (Lin, 2004) as evaluation metric in order to assess the quality of our datasets and of the generated summaries.ROUGE (Recall-Oriented Understudy for Gisting Evaluation) counts the number of overlapping units between two different texts.In the paper we reported: ROUGE-N (R-N, N=1,2) which counts the number of n-grams overlapping, and ROUGE-L (R-L) taking into account the LCS between two texts.
Average Article and Verdict Length.Data length was computed in terms of the number of sentences, standard tokens, and BPE tokens 6 .As shown in Table 1, FF articles and verdicts are much shorter than the L ++ counterparts: 632 vs. 818 tokens and 30 vs. 114 tokens respectively.On the contrary, claim lengths are essentially similar (18 vs. 15).Regarding the lengths in terms of BPE tokens, the average length of articles alone exceeds the fixed input length of the major Language Models (LMs), which is usually 512 or 1024 (see Table 1).Indeed, 98% and 54% of L ++ articles are above the 512 and 1024 limit respectively, while 66% and 24% for FF.This implies that input reduction or truncation will be needed when processing the data during our experiments.Presence of Verdict Snippets in the Article.
We compared the two datasets in terms of the possibility of abstracting/extracting the verdict from the article.In particular, we considered ROUGE recall to highlight how many verdict snippets are present in the article.Results indicate that L ++ has a more abstractive nature than FF (see Table 2).Indeed, the text of the verdict is present in the article more verbatim for FF than for L ++ (0.547 vs. 0.426 ROUGE-L recall).On the contrary, with ROUGE F1 we can observe how difficult it is to find verdict material in the article.Results show that FF articles contain very few pieces of FF verdicts.This can be explained in light of the much 6 Computed using T5-large tokenizer shorter length of the FF verdicts as compared to L ++ ones (39 vs. 150 BPE tokens on average, see Table 1), while article length difference is negligible in this comparison.
To sum up, FF verdicts are much shorter than L ++ verdicts and even if they are present in longer verbatim sequences in the articles, these sequences are much more spread out the document.Thus, we expect that it will be harder to identify and extract FF verdicts.Claim Repetition in Verdict.The possible presence of significant parts of the claim in the verdict positively affects the ROUGE scores without necessarily indicating a better verdict quality7 .For example, a trivial baseline that, given a claim, outputs a verdict that simply states "It is not true that [claim]" would obtain a high ROUGE score without producing any significant explanation to a verdict.Thus, we analysed claim and verdict overlap and reported the results in Table 3. Considering ROUGE-L, on average 65% of claim's subsequences are quoted verbatim in the verdict for the L ++ dataset, while only 26% for FF.The frequent reference to the claim at the beginning of the verdict can explain this outcome (Example in Appendix A in Table 13).To check this hypothesis we re-computed ROUGE scores after removing the first sentence of the claim.We also repeated the test by removing the last sentence as a control condition.We observe that for L ++ ROUGE scores drop when evaluating the overlap between claim and verdict without the first sentence (i.e.ROUGE-1 goes from 0.709 to 0.394).On the other hand, this is not the case with the removal of the last sentence (R1 is 0.702), which corroborates our hypothesis.
To sum up, L ++ verdicts comprise a good amount of claim information, usually reported in the first sentence.However, this does not apply to FF.Additionally, L ++ verdicts end with a statement about claim veracity, usually in the form "We rate the claim [TRUTH LABEL]".
Article Adherence to the Claim.The amount of text of the claim included in the article is a proxy for understanding: (i) if a simple summarization approach could provide a good verdict, even without explicitly providing the claim, and (ii) if there is a preferable portion of the article to be selected for summarization in order to fit into LMs' input length.
Since we know that the articles are written to discuss the veracity of the corresponding claims, we expect each article to contain a certain amount of information related to the claim, including partial or even whole quotations of it.This assumption would be reflected in high ROUGE recall values between the claim and the article.
Results in Table 4 confirm our expectations.While the high ROUGE-1 values can be trivially explained by the claim and article having the same topic, the high ROUGE-2 and L recall values indicate that entire portions of the claim were inserted into the article.On average, 80% of the claim subsequences are quoted verbatim within the article in the two datasets.However, verbatim claim text is not particularly used in the first sentence of the article: its content is spread over the article, as can be seen by the small variation in ROUGE scores obtained by removing the first or last sentences of each document.
To sum up, the claim information is highly present within the article and spread over the entire text.For this reason, we expect extractive summarization approaches to be better than simple text truncation at selecting meaningful information from the article.

Experimental Setup
In this section, we present several experiments for the task of justification production.All the approaches can be traced back to the pipeline presented in Figure 2. Given an article, we tested several extractive approaches to select relevant material.Extractive summarises were considered Justifications per se or were sent to a Language Model (LM) pre-trained on the abstractive summarization objective.The LMs, in turn, were used with or without a fine-tuning step.Moreover, we selected different decoding mechanisms to drive the generation.Eventually, we conduct a cross-domain experiment to evaluate the models' robustness to the style of each dataset.

Extractive Approaches
We first explored unsupervised extractive methods by comparing three different settings: article truncation, article-relevance extractive summarization (using LexRank algorithm) and claim-driven extractive summarization (with SBERT).Each configuration represents a different assumption: (i) the main content (corresponding to a possible verdict) is introduced at the beginning or at the end of the article, within a specific section; (ii) a proper extractive summary or verdict contains the most relevant sentences of the article; (iii) a proper verdict comprises the article sentences most similar to the claim.
• Truncation is the most straightforward approach of "input reduction", i.e. cutting the input at a given threshold.This is the simplest procedure applied when using LMs on long texts.
• LexRank (Erkan and Radev, 2004) is an unsupervised approach for extractive text summa- rization which ranks the sentences of a document through a graph-based centrality scoring.
• SBERT (Reimers and Gurevych, 2019) is a siamese network based on BERT (Devlin et al., 2019) employed for generating and ranking sentence embeddings with respect to a target sentence (i.e. the claim) using cosine-similarity.
All the reduction baselines were tested under two configurations: from the list of sentences they provide, we selected either the top or bottom of the list.Furthermore, for LexRank and SBERT we rearranged top or bottom sentences according to article or ranking order.

Abstractive Approach
In the second part of our experimental design, we combined extractive and abstractive summarization for justification production.A reduced version of the text, obtained through truncation or extractive summarization, was used as input to various off-the-shelf Transformer-based models pre-trained on an abstractive summarization objective.In particular, we experiment with 4 Transformer-based summarization LMs 8 trained on news-specific summarization datasets: • T5: T5-large, 738M parameters, input size 512, (Raffel et al., 2020) • Peg xsum : Pegasus xsum, 570M parameters, input size 512 (Zhang et al., 2020) 8 We have used Huggingface Transformers library for our experiments: https://huggingface.co/transformers.
• Peg cnn : Pegasus cnn_dailymail, 570M parameters, input size 1024 (Zhang et al., 2020) • dBart: DistilBart cnn-12-6, 305M parameters, input size 1024 (Shleifer and Rush, 2020) All the models were tested under three main configurations: unsupervised, fine-tuned on a reduced version of the article (article), and fine-tuned on the concatenation of the claim and the reduced article (claim+article).Finally, four decoding mechanisms were employed for generating the verdicts: beam search (5 beams), Top-K sampling (sampling pool limited to 40 words), nucleus sampling (probability set to 0.9), and typical sampling (probability set to 0.95; Meister et al., 2023).The fine-tuning details and hyperparameter settings can be found in Appendix B.

Experimental Results
Our experimental design combines all the settings described in the previous sections.Extractive and abstractive approaches are concatenated in a unique pipeline tested on both L ++ and FF.Additionally, we tested the generalisation capabilities of the pipeline in zero-shot experiments and by integrating the two datasets into a unique model.Although we tested the complete design (960 configurations, as depicted in Figure 2), we will discuss only the most relevant findings hereafter.
Claim-driven Extractive Summarization.If we focus on verdict generation as a pure unsu-pervised extractive summarization task, then the claim-driven approach through SBERT leads to better results in both datasets (see Table 5).The second best approach is LexRank, which focuses on sentence relevance within the article (rather than claim relevance).Simple truncation led to the lowest results when considering ROUGE-1.In Table 5 we report the best results, i.e. top selection with article order.Results for bottom selection and ranking order are reported in Table 14  Sentence Order for LM Input.An aspect that can have a significant impact on LMs' performance is the order of the sentences fed to the LMs.Results show that rearranging sentences according to ranking order, rather than article order, can hinder text coherence.As can be seen in Table 6, article order is generally better than ranking order for 1024 input size LMs with L ++ .For FF, article order leads to higher ROUGE scores with 512 input-size LMs.Differences among datasets can be explained by the lengths of their articles: in particular, most of the articles from FF are shorter than 512 BPE tokens.
One major question when using LMs is whether the claim information is essential to drive the generation of the justification.Indeed, we should consider that (i) the sentences used as LM input are already selected according to the claim (SBERT) (ii) we are using gold articles (i.e.specifically written to debunk the given claim).
9 Bottom approaches represent specific assumptions: (i) for truncation, the hypothesis is that informative content is in the last lines of the articles (in the form of a "to sum up" paragraph); (ii) for SBERT that the most similar sentences could be those that simply rephrase the claim but are not necessarily the most informative.
Results show that the enrichment of the input with the claim information leads to ROUGE scores even higher than those obtained through a simple fine-tuning on the articles only (see Table 7).
LM Input Length.Throughout the experiments, we saw that 1024 input-length models had higher results on L ++ , while on FF better performances were recorded with 512 input-length models (see Table 8).A possible explanation is that the differences in performance are due to the average length of articles in the two datasets (longer for L++, shorter for FF, see Table 1).In order to provide additional evidence for this hypothesis, we calculated ROUGE scores exclusively for articles with a length of 512 BPE tokens or less from both datasets.The results indicate that ROUGE scores for 512 input models were higher in both datasets than those obtained with 1024 proving that article length is the key factor when selecting the proper model.

Extractive vs Abstractive Summarization
In most cases, extractive summarization is better than unsupervised abstractive summarization (especially when claim-driven) in terms of ROUGE scores.Thus, if no training data is available, claim-driven extractive summarization is a viable solution.On the other hand, when training data is available the best approach is to combine claimdriven abstractive and extractive summarization.

Cross-data and Mixed-data Experiments
Next, we explored the impact of the datasets' stylistic characteristics in several training/test configurations.First, we conducted a zero-shot crossdataset experiment, then we investigated the effect of combining the two datasets for training a single model.
Cross-dataset Experiments.In these experiments, the models were fine-tuned on one dataset and tested on the other, i. vice-versa (FF→L ++ ) both for article and claim+article configurations.In particular, we employed the best-performing pipeline from the previous experiments, i.e. models from the SBERT, top, article order configuration.Results are reported in Table 10.Both for L ++ and FF, the models show the trend highlighted previclaim : There have been 1,400 deaths and one million injured from Covid-19 vaccinations in the UK.
gold verdict : These are deaths and potential side effects reported following the vaccine, not necessarily because of it.
SBERT : The front page of free newspaper 'The Light', shared on Facebook, claimed that there have been 1,400 deaths and a million injuries "from covid injections" in the UK.There had been just over 1,470 deaths following a Covid-19 vaccination in the UK, according to the Medicines and Healthcare products Regulatory Agency's (MHRA) Yellow Card reporting scheme, as of 7 July 2021, when the paper came out.
abstractive : This is technically correct, but the fact that a death is reported following a vaccination is no proof the vaccine was the cause of this death or injury.ously: the claim+article configuration performs better than the article configuration.Furthermore, as expected, testing on a different dataset yielded lower results: in several cases, results for the article configuration were on par or even worse than those obtained with the unsupervised LMs (compare with Table 7).This is particularly evident for the FF→L ++ configuration.
The low ROUGE values can be attributed to the distinct styles of the datasets and not to any degradation in the generation quality.As can be seen from the examples in Table 11, models fine-tuned on L ++ , even when tested on FF, generate justifications mimicking L ++ style (claim in the first sentence and truthfulness statement at the end, see Appendix A), and vice versa (FF→L ++ ).
Mixed Data Experiments.Finally, we tested the effect of using both datasets in training a single LM.We focused on Peg cnn as in the in-domain experiments it generally showed quantitatively (see Table 7) and qualitatively (see the in-domain fine-tuning of distinct models for each dataset (see Table 7).Thus, if the datasets have peculiar styles, a more efficient way to tackle the task is to fine-tune a unique LM on all the data available rather than fine-tuning different models for each dataset.Curbing misinformation with NLP tools is a crucial task.Up to now, researchers have mainly focused on claim veracity classification.In this paper, instead, we focused on generating textual justifications with factual and objective information to support a verdict.We started casting the problem as a news article summarization task and subsequently we integrated summarization within an end-to-end claim-driven explanation generation framework, accounting for the several practical scenarios that can be encountered.To this end, we experimented with several extractive and abstractive approaches, leveraging pre-trained LMs under manifold configurations.In order to provide an exhaustive benchmark of the justification production task, we employed two novel datasets throughout the experiments.The main results show that summarization needs to be driven by the claim to obtain better performances and that an extractive step before LM abstractive summarization further improves the results.Finally, we show that style information can be retained by a single model which is able to handle multiple datasets at once.

Limitations
LMs suffer from hallucination (Zellers et al., 2019;Solaiman et al., 2019) and, even if the phenomenon is reduced by the document-driven nature of the task, it is still present.In particular, some hallucinations are critical: we occasionally obtain the sentence "we rate this statement as false" even if the statement is true since it is a very common sentence in the L ++ training set.Moreover, the datasets used for this task (i) are restricted to the English language and (ii) assume that there is always a gold article for factchecking.In real scenarios we might have the debunking material spread over several articles: in this case, we can expect that models not suffering from the input size limit would be most beneficial.Still, from preliminary experiments, we conducted with two long input LMs on our datasets, namely LED-Large (Beltagy et al., 2020) and BERTSUMEXTABS (Liu and Lapata, 2019) from Kotonya and Toni (2020b), results were worse also for articles exceeding the 1024 input limit.
Another aspect that should be addressed is an in-depth analysis of automatically generated verdicts and their persuasiveness.In fact, different versions of a verdict for the same claim can have different effects depending on the audience -e.g. for some people explanations comprising few arguments are more effective than longer explanations (Sanna and Schwarz, 2006).To this end, carefully designed human evaluation experiments are needed.

A Verdict Stylistic Features
Differently from FF, L ++ verdicts show a peculiar and recurrent style: the first sentence comprises a reference to the claim, usually quoted verbatim (see Table 13).Moreover, verdicts end with a statement about the degree of truthfulness of the related claim, in a form similar to "We rate the claim [TRUTH LABEL]".The main justifications are presented in the body of the verdict.Examples are provided in Table 13.claim : Clinton says "Hate crimes against American Muslims and mosques have tripled after Paris and San Bernardino."verdict : Clinton said that "hate crimes against American Muslims and mosques have tripled after Paris and San Bernardino".Calculations by the director of an academic center found that the number did triple after those attacks.But it's worth noting that his data does not show whether or not they remained at that elevated level, or for how longsomething that would be a reasonable interpretation of what Clinton said.The statement is accurate but needs clarification or additional information, so we rate it Mostly True.claim : Trump says "I released the most extensive financial review of anybody in the history of politics....You don't learn much in a tax return."verdict : Trump said that he has "released the most extensive financial review of anybody in the history of politics....You don't learn much in a tax return." Trump did release an extensive (and legally required) document detailing his personal financial holdings.However, experts consider that a red herring.Unlike all presidential nominees since 1980, Trump has not released his tax returns, which experts say would offer valuable details on his effective tax rate, the types of taxes he paid, and how much he gave to charity, as well as a more detailed picture of his income-producing assets.Trump's statement is inaccurate.We rate it False.Table 13: Examples from L ++ : the claim is mostly present within the first sentence, and a truthfulness statement is reported at the end of the verdict.

B Fine-tuning Details
For the fine-tuning, each model underwent 5 epochs of training with a batch size equal to 4 and a seed set at 2022.To this end, Huggingface Trainer has been employed, keeping its default hyperparameter settings, with the exception of the Learning Rate values and the optimisation method.The Adafactor stochastic optimisation method (Shazeer and Stern, 2018) has been used throughout the whole training phase.Learning Rates values were set as follows: T5 3e-5, Peg xsum 5e-05, Peg cnn 3e-05, dBart 1e-05.For fine-tuning the models, we employed a single GPU, either a Tesla V100 or a Quadro RTX A5000.The checkpoint with minimum evaluation loss was used for testing.

C Extractive Approach Results Details
The first set of experiments tested three main text reduction methodologies: text truncation, LexRank, and SBERT.In order to assess the informativeness of the summaries, the generated extractive output was compared to the gold verdicts through ROUGE metrics.For each methodology, two main configurations have been taken into account: top and bottom (or head and tail for text truncation).While in Table 5 we just reported the best configuration (head/top), in Table 14 we report the complete results for extractive summarization, which includes the bottom configuration for comparison purposes.These results are confirmed also when these approaches are used for text reduction before the abstractive step in our pipeline.

Figure 1 :
Figure 1: An article (top) used to generate a verdict (bottom) in response to a false claim (middle).

Table 1 :
SENT µ TOK µ BPE µ Average length of each element of the datasets in terms of number of sentences (SENT), standard tokens (TOK) and BPE tokens (BPE).

Table 2 :
Verdict and article overlap measured in terms of ROUGE F1 and Recall scores.

Table 3 :
Recall of ROUGE scores between claims and verdicts.comp.indicates scores compute on the whole verdict, while no 1st and no last indicates the removal of the first and last sentence respectively.

Table 4 :
Recall of ROUGE scores between claims and articles.comp.indicates scores compute on the whole verdict, while no 1st and no last indicates the removal of the first and last sentence respectively.

Table 5 :
Extractive approaches comparison.The number of sentences to be extracted is set to the average number of sentences per verdict in the corresponding datasets (2 for FF and 6 for L ++ ).

Table 6 :
ROUGE F1 scores for each model in the SBERT top claim+article configuration.Verdicts were generated through the beam search decoding method (the best among the 4 decoding mechanisms tested).The input length size for T5 and Peg xsum is 512, while for dBart and Peg cnn is 1024.

Table 7 :
ROUGE F1 scores for each model in the SBERT top configuration.Results for both the unsupervised and the two fine-tuning settings (article and claim+article) are reported.Verdicts were generated through the beam search decoding method.

Table 8 :
e. finetuned on L ++ and tested on FF (L ++ →FF) and Averaged results for models' input length.ROUGE scores for verdicts generated under the SBERT, article order, top, claim+article configuration (beam search decoding).

Table 9 :
FF example of generated verdicts.The first one is generated through extractive summarization only (with SBERT); the second example is the output of the extractive and abstractive pipeline (SBERT, top, article order, claim+article configuration with Peg cnn ).

Table 10 :
Models fine-tuned on FF and tested on Liar+ test set (FF→L ++ ) and fine-tuned on L ++ and tested on FF test set (L ++ →FF), using article or claim+article configurations (SBERT top article order).

Table 11 :
Examples from the cross-data experiments.

Table 12 :
F1 ROUGE scores of Peg cnn fine-tuned on a unique dataset and tested on L ++ and FF test sets.

Table 14 :
Pure extractive approach results for the head/tail and top/bottom configurations.The number of sentences to be extracted is set to the average number of sentences per verdict in the corresponding datasets (2 for FF and 6 for L++)