JustiLM: Few-shot Justification Generation for Explainable Fact-Checking of Real-world Claims

Justification is an explanation that supports the veracity assigned to a claim in fact-checking. However, the task of justification generation has been previously oversimplified as summarization of a fact-check article authored by fact-checkers. Therefore, we propose a realistic approach to generate justification based on retrieved evidence. We present a new benchmark dataset called ExClaim (for Explainable fact-checking of real-world Claims), and introduce JustiLM, a novel few-shot Justification generation based on retrieval-augmented Language Model by using fact-check articles as an auxiliary resource during training only. Experiments show that JustiLM achieves promising performance in justification generation compared to strong baselines, and can also enhance veracity classification with a straightforward extension.1 Code and dataset are released at https://github.com/znhy1024/JustiLM.


Introduction
Automated fact-checking typically encompasses several stages: identify check-worthy claims, retrieve relevant evidence, determine the claim's veracity using the retrieved evidence, and generate justification for the verdict on the veracity (Guo et al., 2022).Despite a wealth of research focusing on the initial three stages, justification generation remains under-explored in the past.Justifications present essential evidence and rationales used to arrive at a claim's veracity judgement, serving to convince readers and enhance the credibility of factchecking systems.This explanatory process is of paramount importance in gaining user's trust in automated fact-checking (Kotonya and Toni, 2020a;Atanasova et al., 2020).
Several methods have attempted to generate justification of verdict by summarizing fact-check ar-1 Code and dataset are released at https://github.com/znhy1024/JustiLM ticles that were previously authored by human factcheckers (Kotonya and Toni, 2020b;Atanasova et al., 2020;Russo et al., 2023).Since a fact-check article per se is manually written to justify the verdict of given claim with detailed presentation and reasoning over digested evidence, referring to reference documents collected from multiple sources, directly generating a summary from such report as justification sidesteps the realistic challenges of evidence gathering and evidence-based reasoning for veracity assessment we essentially face in factchecking task.More importantly, these existing methods are impractical because fact-check articles are not available for new claims that are yet to check (Guo et al., 2022).Table 1 shows an example illustrating different types of information involved in the fact-checking practice and their relationship.To justify the veracity for a claim, the source of information that can be used practically ought to be the retrieved reference documents containing evidence rather than its fact-check article, which, as an outcome, has not been written during the checking process.
In this paper, we propose a more realistic approach for the task of justification generation based on a language model approach, which complies with the process of journalistic fact-checking by well-known fact-check organizations such as Politi-Fact 2 .Our goal is to produce high-quality justifications, drawing upon evidence gathered from diverse sources.To this end, we construct a benchmark dataset for Explainable fact-checking of real-world Claims, named ExClaim, derived from a public dataset WatClaimCheck (Khan et al., 2022) containing newsworthy claims along with their factcheck articles and reference documents.ExClaim provides a large searchable corpus by mixing the reference documents from all claims in WatClaim-Check.Additionally, it curates the verdict justifi-Claim Biden: Gun manufacturers are "the only industry in the country" that have immunity from lawsuits.
Evidence Documents (References) Doc1: No, you can't sue Pfizer or another manufacturer if you get a COVID-19 vaccine injury, but you can file for compensation.The Pfizer-BioNTech COVID-19 vaccine received full approval from the Food and Drug ... Doc2: Remarks by President Biden on Gun Violence Prevention.THE PRESIDENT: Thank you, Kamala -Madam Vice President.Thank you very much.You know, we're joined ... Doc3: Clinton: Gun industry is 'wholly protected' from all lawsuits.At the first Democratic debate of the 2016 presidential race, former Secretary of State Hillary Clinton criticized opponent ... Doc4: Protection of Lawful Commerce in Arms Act.The Protection of Lawful Commerce in Arms Act (PLCAA) is a United States law which protects firearms manufacturers and dealers from being held liable ... ... Table 1: An example claim along with the evidence documents, justification and veracity.The title of each evidence document is italicized.The sentences in the fact-check article referring to evidence documents are marked in the same color as the corresponding documents, and the sentences that directly entail the justification are in bold.
cations sourced from fact-check articles, typically located in a conclusive paragraph marked by cue phrases like "Our ruling" or "Our rating" for each claim.Furthermore, we develop a Justification Language Model called JustiLM for generating the rationales behind veracity judgement within the context of few-shot learning.Presumably, fewshot fine-tuning can mitigate the training resource requirements and its dependence on high-end hardware, often financially prohibitive, and also enables the model to achieve comparable effectiveness to state-of-the-art fully-trained models.JustiLM utilizes fact-check articles as auxiliary information in its training only via fine-tuning a pre-trained Retrieval-Augmented Generation (RAG) model on our curated justification dataset.Meanwhile, leveraging fact-check articles for training enhances the model's proficiency in generating rationales based on evidence and articulating them in its generated content.Our contributions are threefold: • We propose JustiLM, the first realistic justification generation method based on a retrieval-augmented language model that is trained end-toend for explainable fact checking of real-world claims, leveraging fact-check articles as auxiliary information for model training only.
• We construct ExClaim, a new benchmark derived from the WatClaimCheck dataset (Khan et al., 2022) for the explainable fact-checking, which contains 6,951 real-world claims and their corresponding veracity labels and human-written justifications, together with a large searchable corpus of 957,949 chunk-level documents for fine-grained evidence retrieval.
• JustiLM outperforms In-Context Learning (ICL) enabled language models, including Flan-T5, Llama2, and the state-of-the-art few-shot RAG model Atlas.JustiLM also shows promising performance compared to the latest GPT-4 model.A straightforward extension of JustiLM for joint veracity prediction and justification generation improves the veracity prediction task with large margins.
2 Related Work

Explanations for fact-checking
Explanations for fact-checking claims have gained significant prominence in recent times, particularly due to the prevalent use of black-box models in automated fact-checking systems (Atanasova et al., 2020;Guo et al., 2022).Several methods have emerged to address this issue utilizing various techniques to provide human readable explanations.
One stream of research leverage attention weights to highlight salient parts in the retrieved evidence as explanations (Popat et al., 2018;Ma et al., 2019;Yang et al., 2019;Shu et al., 2019;Lu and Li, 2020).Another stream of works is to adopt logicbased rules, such as knowledge graphs and natural logic relations designed by human experts (Ahmadi et al., 2019;Gad-Elrab et al., 2019;Vedula and Parthasarathy, 2021;Krishna et al., 2022a), where explanations are obtained by tracing the rules path to reach the veracity of the claim.However, these explanations are not presented in natural language, rendering them less accessible to general users.Furthermore, these rule-based systems encounter challenges when dealing with real-world claims that may not conform to predefined rules.In contrast, our work places a strong emphasis on generating textual justifications that are readily understandable for users, avoiding manual rule definitions.
A few studies have attempted to automatically generate textual justifications by summarizing fact-check articles (Kotonya and Toni, 2020b;Atanasova et al., 2020;Russo et al., 2023).Atanasova et al. (2020) employs DistilBERT (Sanh et al., 2019) to extract sentences from fact-check articles to form justifications.Kotonya and Toni (2020b) proposes a two-step process, initially utilizing a Sentence-BERT (Reimers and Gurevych, 2019) to extract sentences from fact-check articles and subsequently using the BERTSUM model (Liu and Lapata, 2019) for abstractive justification generation based on the extracted sentences.Russo et al. (2023) explores several existing extractive summarization (Erkan and Radev, 2004;Reimers and Gurevych, 2019) and abstractive summarization (Raffel et al., 2020;Zhang et al., 2020;Shleifer and Rush, 2020) approaches for summarizing factcheck articles.These summarization methods come with inherent limitations practically, including complete reliance on fact-check articles (i.e., detailed human justification) as input, which is hardly available at the time of deployment, and complete omission of automatic evidence search and evidence-based reasoning.Different from these approaches, our method only assumes the availability of fact-check articles during model training and the key evidence exists within a large corpus which is searchable.Therefore, our approach generates justifications by harnessing the information from retrieved reference documents during inference, which is a more realistic solution for real-world scenarios.Similarly, Khan et al. (2022) infers claim veracity based on retrieved textual references, while Yao et al. (2023a) retrieves evidence for multi-modal fact-checking and generates explanations for predicted veracity labels using the BART model (Lewis et al., 2020a), both of which are stage-wise and full-dataset trained.In contrast, we base our approach on the latest RAG framework that is trained end-to-end and generates justifications by using fact-check articles to distill supervisory signals for training.

Few-shot fact-checking
The need of few-shot learning is exacerbated by the continuous increase of computational and storage requirements for language model training.However, the specific application of few-shot learning techniques in the context of fact-checking has been relatively underexplored.Existing methods for fewshot fact-checking only focus on the so-called fact verification task (Lee et al., 2021;Zeng and Zubiaga, 2022;Zeng and Gao, 2023;Yue et al., 2023;Pan et al., 2023;Zhang and Gao, 2023) by feeding a few instances together with gold evidence into the model to predict the veracity of a claim.Different from these methods, our work primarily centers on generating justifications to substantiate the veracity of a claim based on the retrieved evidence.Importantly, we do not assume the availability of annotated evidence.Instead, we necessitate the system to retrieve pertinent evidence, conforming to a more realistic and challenging scenario.

Retrieval-augmented language models
Equipping language models (LM) with external memory has shown to enhance their performance in knowledge intensive NLP tasks (Chen et al., 2017;Thorne et al., 2018;Guu et al., 2020;Lewis et al., 2020b;Sachan et al., 2021;Izacard and Grave, 2021b;Borgeaud et al., 2022;Izacard et al., 2023).Typically, a retriever is used to retrieve relevant documents from a large corpus, which enriches the in-put of a language model and contributes to the final output.However, due to the high cost of acquiring query-document annotations and training retrievers, many implementations rely on off-the-shelf retrievers, such as TF-IDF and BM25 (Jones, 2004;Robertson et al., 1994), which use term-matching techniques.In this setup, only the parameters of LMs are fine-tuned.
Recent research has demonstrated the advantages of jointly training the retriever and the LM in an end-to-end manner, which leverages the supervision signals from the LM to train the retriever (Guu et al., 2020;Lewis et al., 2020b;Sachan et al., 2021;Izacard and Grave, 2021b;Izacard et al., 2023).Moreover, considering the remarkable performance of large language models (LLMs) in various few-shot NLP tasks, some studies suggest enhancing LLMs with the retrievers or web search engines (Mallen et al., 2023;Si et al., 2023;Yu et al., 2023;Shi et al., 2023;Zhang and Gao, 2023).For example, REPLUG (Shi et al., 2023) optimizes the retriever by minimizing the KL divergence between the retrieval likelihood and the black-box LLM likelihood over retrieved documents.However, there exists inherent limitations in the interaction between retriever and black-box LLMs, such as their restricted ability to provide or access specific information.We refer readers to a comprehensive survey of retrieval-augmented LMs (Mialon et al., 2023).

Task Formulation
Let C = {(x, z, y)} be a fact-checking dataset of real-world news claims associated with a textual knowledge corpus D. Each instance is composed of a claim x and its corresponding ground-truth justification y and fact-check article z. C is divided as a training set and a test set, and only instances in the training set are associated with fact-check articles if available.
Given a claim x and the corpus D, the goal of justification generation is to produce a sequence of tokens, denoted as ŷ, that serves as an explanation for the veracity rendered on the claim using the evidence retrieved from the corpus.In the few-shot setting, we randomly select K instances from the training set, following the similar setup employed in previous studies for fact verification (Lee et al., 2021;Liu et al., 2022;Zeng and Gao, 2023) set as this aligns to a more realistic scenario with limited data resources.

ExClaim Dataset
The existing fact-checking datasets based on realworld claims have limitations for justification generation.This is because the provided evidence sources might not cover the evidence documents that fact-checkers actually rely on when writing justifications.For example, some datasets (Vlachos and Riedel, 2014;Wang, 2017;Alhindi et al., 2018) only provide metadata like speaker, party and date without a sizeable knowledge corpus for finding specific evidence.Some studies (Popat et al., 2016;Baly et al., 2018;Augenstein et al., 2019;Gupta and Srikumar, 2021;Yang et al., 2022;Hu et al., 2022) utilize web search to gather evidence documents, which result in retrieved information from non-authoritative sources or lead to the leak of ground truth by inadvertently including articles verifying the same claims by other organizations or sharing the fact-check information (Khan et al., 2022).More notably, certain studies (Hanselowski et al., 2019;Kotonya and Toni, 2020a;Atanasova et al., 2020;Ostrowski et al., 2021;Russo et al., 2023) regard fact-check articles as primary source of evidence, a practice that may not align with the realistic fact-checking procedures.
We use the WatClaimCheck (Khan et al., 2022) dataset that provides the real-world claims along with the text of reference documents cited by factcheck articles.However, WatClaimCheck is constructed for veracity classification and does not provide ground-truth justifications.For our task, we construct ExClaim based on WatClaimCheck, for which we additionally extract justifications from fact-check articles based on the cue phrases such as "Our ruling" or "Our rating" in the reports following previous works (Alhindi et al., 2018;Augenstein et al., 2019;Kotonya and Toni, 2020a) and remove the instances that do not have such justification content.After extracting the justifications, we also remove them from fact-check articles.
Table 2 presents summary statistics of ExClaim dataset with total 6,951 real-world claims and justifications (i.e., 5,964 for training and 987 for testing).The data pose some challenges: 1) A single reference document is generally much longer than fact-check article, easily exceeding the context window of most text generation models (e.g., 512 tokens of T5 (Raffel et al., 2020) or 1,024 tokens of BART (Lewis et al., 2020a)).In particular, each claim may correspond to multiple reference documents from different sources, leading to excessively long text for evidence.2) There is lack of passage-/sentence-level annotation in reference documents and fact-check articles.Since fact-checkers generally refer to only several pieces of texts in reference documents when writing justifications, most information in a reference document tend to be irrelevant for generating the justifications.To address these issues, we split each document into disjoint 100-word chunks following previous works (Lee et al., 2019;Karpukhin et al., 2020;Lewis et al., 2020b;Izacard et al., 2023), resulting in a large textual knowledge corpus D comprising a total of 957,949 chunk-level documents that systems can search fine-grained evidence text from.In the rest of the paper, we refer to these short text chunks as "reference documents" or simply "documents".

Methodology
We base our approach on the retrieval-augmented generation (RAG) framework (Lewis et al., 2020b;Sachan et al., 2021;Izacard and Grave, 2021b;Izacard et al., 2023), which contains a retriever for fine-grained evidence retrieval and a LM for textual justification generation.As shown in Figure 1, the retriever takes the claim text as input and retrieves the top-N chunk-level documents from the textual knowledge corpus, and the LM conditions on these documents together with the claim to generate justification.The retriever and LM can be jointly trained within a single RAG framework, which makes it possible to utilize fact-check arti-cles as auxiliary resource to provide supervisory signals during training, targeting to enhance the quality of generated justification.We employ Atlas (Izacard et al., 2023) as our backbone model considering two main reasons: 1) its strong fewshot learning ability in knowledge intensive tasks when its retriever and LM are jointly trained; 2) its flexibility for incorporating fact-check articles in the training process.

Retriever
Given a claim x, the retriever should return the documents that help LM generate better justification.To enable the training of retriever, Atlas utilizes a dense retriever named Contriever (Izacard et al., 2022), which is pre-trained using the MoCo contrastive loss (He et al., 2020).Contriever is a dual-encoder architecture that the pre-trained query encoder E c and document encoder E d encode the claim x and each document d j ∈ D, respectively.The embeddings of documents can be pre-computed to build a collection of index using FAISS (Johnson et al., 2021) for fast retrieval.Documents are ranked by the similarity score s(x, that is calculated by taking the dot product of the embeddings of the claim x and document d j . To mitigate the burden of re-computing embeddings for all documents when training the retriever, Atlas (Izacard et al., 2023) only updates the parameters corresponding to the query encoder while freezing the documents encoder, which still shows promising results in the few-shot setting.Therefore, we employ the document encoder for encoding reference documents and the query encoder for encoding other inputs.Since there is no direct supervision available to train the retriever, Atlas proposes a Perplexity Distillation loss to leverage the supervisory signals from the LM.The intuition behind is that documents contributing to the LM that help generate lower-perplexity outputs should be ranked higher (Izacard et al., 2023).

Language Model
The language model conditions on the top-N retrieved documents D N = {d j } N j=1 by the retriever, together with the claim x, to generate the justification.To aggregate evidence efficiently and effectively from multiple documents in LM, Atlas employs a T5 encoder-decoder model (Raffel et al., 2020) with the Fusion-in-Decoder (FiD) (Izacard and Grave, 2021b) modification.Each retrieved document d j is encoded independently by the encoder, with the claim x prepended to it.All outputs of the encoder are then concatenated.The decoder takes as input this concatenation and performs cross-attention to fuse the evidence and generate outputs.The training objective is the standard language modeling loss that encourages the LM to assign higher probability to the target sequence y given the claim x and top-N retrieved documents.

Distillation Techniques
Although directly summarizing fact-check articles z can generate justifications with reasonable quality in previous works (Kotonya and Toni, 2020a;Atanasova et al., 2020), z is by no means available during inference for new claims in real-world deployment, as we discussed in §1, making the previous methods impractical.We propose a realistic approach to address this limitation: distilling information from z as auxiliary supervisory signals for training phase only.We introduce two types of techniques based on the granularity of distillation from fact-check articles.The first is article-level distillation, which utilizes aggregated information from the entire z.The second is chunk-level distillation, where we split each article z as multiple disjoint 100-word chunks z = {z i } M i=1 , where M = ⌈ |z| 100 ⌉.Chunk-level distillation utilizes individual information of each chunk z i .Both types of distillation techniques can be applied to train the retriever and LM.

Article-level Distillation
Article-level distillation is performed at the entirety of a fact-check article, aiming at utilizing the global-level alignment between fact-check article z and retrieved documents D N as supervisory signals for model training.The basic idea is that the more similar D N and z are, the easier it is for LM to generate justification based on D N closely approximating that generated based on z.This alignment serves two main purposes.Firstly, the similarity between D N and z can act as a supervisory signal, guiding the retriever to prioritize the ranking of documents in D N to resemble z.Secondly, the justification generated by the LM based on z can be used as a supervision signal to encourage the LM using D N to generate justification as similar as those generated based on z.Next, we will discuss two training losses that serve both purposes.
Retrieval loss.The technique for training retriever is based on the similarity between the entire fact-check article z and retrieved documents D N .However, the length of z is commonly larger than the maximum input length (i.e., 512 tokens) of query encoder.Therefore, we use the trainable query encoder E c to represent z by aggregating the embeddings of all its chunks and obtain The training objective is to minimize the mean-squared-error (MSE) loss between the embeddings of z and d i : Generation loss.The technique for training the LM generation is based on the distance between the generated justification using retrieved documents D N and that directly using the fact-check article z.During training, the generation ŷ of the LM using z as input is regarded as supervision signal to guide model's learning.Let be the LM probability of generating the ground-truth justification y conditioned on x and D N , where p L (t k | x, D N , t <k ) is the probability of each token t k assigned by the LM and t <k denotes the tokens generated prior to t k .Similarly, the LM probability of generating y conditioned on z is p L (y | x, z).The training objective is to minimize the MSE loss between these two distributions: where V is the vocabulary of the LM.

Chunk-level Distillation
Chunk-level distillation is performed at the granularity of each chunk of fact-check article, leveraging the alignment between chunks {z i } M i=1 and documents {d j } N j=1 to provide supervisory signals for model training.The intuition is that different chunks of the fact-check article could be derived from rearranging or modifying specific text spans sourced from reference documents.Further, the chunks {z i } M i=1 may correspond to certain parts of the ground-truth justification y.Thus, {z i } M i=1 can be seen as the "connections" between D N and y.Aligning {d j } N j=1 and {z i } M i=1 intuitively aids the model in learning the mapping from D N to y, hence improving its performance.However, there is no chunk-level annotation available, which poses an important challenge for training.We design two training techniques to address it for chunk-level distillation in both retriever and LM.
Retrieval loss.The technique for training the retriever is based on the relation between similarity score and the LM perplexity, which is inspired by Izacard et al. (2023) and Shi et al. (2023).Intuitively, the more similar the text chunk z i is to the document d j , the lower LM perplexity of generating z i conditioned on d j : We train the retriever to learn the alignment between d j and its most similar chunk z j * , where j * = arg max i∈[1,M ] s(z i , d j ).It involves minimizing the the KL-divergence between the similarity score s(z j * , d j ) and the corresponding LM probability of z j * conditioned on d j and x.Specifically, let the documents distribution over , and the documents posterior distribution according to the LM be q L (z . Finally, the loss function for optimizing the retriever is given as: (3) This loss is exclusively used to optimize the retriever's parameters, without affecting the LM.
Generation loss.Our technique for training LM utilizes the attention scores of the LM to train the LM itself, which is inspired by previous works of open-domain QA that train a retriever by learning to approximate the attention scores of the reader (Izacard and Grave, 2021a;Izacard et al., 2023).The cross-attention scores between input and output can be used as a proxy of the usefulness of each input to the justification.We firstly average decoder cross-attention scores over all attention heads, layers, and tokens for each retrieved document d j , resulting an averaged attention score a(x ⊕ d j ), where ⊕ denotes concatenation.Then the score that indicates the usefulness of d j is obtained by applying the softmax operator p(d j ) = exp(a(x⊕d j )) N k=1 exp(a(x⊕d k )) following Izacard et al. (2023).Similarly, the score for each chunk z i is p(z i ), while the score of the most similar chunk . The objective is to encourage the score of d j to approximate the score of its most similar chunk z j * .We then minimize the KL-divergence between distributions of these two scores: 6 Experiments and Results

Evaluation Metrics
To assess the consistency of generated justifications with ground truth, we employ a spectrum of metrics to make our evaluation balance between factual accuracy and style diversity of verbal expressions: ROUGE (Lin, 2004) counts the number of overlapping units (e.g., n-gram and word sequences)  between output justifications and ground truths.MAUVE (Pillutla et al., 2021) measures the divergence between output justifications and the ground truths, which could reflect whether the output is fluent and coherent to the ground (Xie et al., 2023;Krishna et al., 2022b;Gao et al., 2023;Xu et al., 2023).SummaCC expands the SummaC (Laban et al., 2022) to evaluate the coverage and factual consistency through checking entailment between the output justifications and ground truth.It sums the aggregating NLI scores over the pairs of the entire output justification and each sentence in the ground truth for coverage (Scialom et al., 2021;Gao et al., 2023), and reversely, the pairs of the entire ground truth justification and each sentence in the output justification for consistency (Laban et al., 2022).

Fallacy of Fact-Check Summarization
We investigate how previous approach based on fact-check article summarization (Kotonya and Toni, 2020b;Atanasova et al., 2020) fails to generalize to the realistic setting given retrieved evidence rather than fact-check articles as input.
Experimental Setup. 1) Full training: we include two existing models, ExplainMT (Atanasova et al., 2020) and ExplainerFC (Kotonya and Toni, 2020b).ExplainMT is an extractive model while ExplainerFC is extractive-abstractive.We partition the training set of ExClaim into 5,000 instances for training and 964 for validation.We train the two models to summarize fact-check articles, and test them by inputting fact-check articles versus evidence documents retrieved with BM25 (Robertson et al., 1994).2) Few-shot training: we train the RAG model Atlas (Izacard et al., 2023) under few shots with fact-check articles as input and test it using fact-check articles versus documents retrieved by its pre-trained retriever Contriever.In this setting, Contriever will be fixed during fine-tuning since the LM's input is fact-check articles.We use randomly sampled 30 shots from the training split, and report the results averaged over 3 trials based on different seeds.
Results.As shown in Table 3, for both settings, we observe that using retrieved documents as input dramatically declines the performance compared to inputting fact-check articles.This suggests that the fact-check article summarization approach struggles to generalize to the retrieved documents, especially in few-shot setting, indicating the impracticality of previous approaches and the importance of the more realistic framework outlined in §3.That is, models need to generate justifications based on retrieved evidence instead of fact-check articles which are not available for new claims during inference.
6.3 Few-shot Justification Generation 6.3.1 Baselines 1) Lead-4 (Nallapati et al., 2017) selects as justification the first sentence from each document among the top-4 documents retrieved by BM25.2) Retriever + ICL-enabled LMs: We use BM25 as the sparse retriever and Contriever (Izacard et al., 2022) as the dense retriever, and choose Flan-T5 (11B) (Chung et al., 2022), Llama2 (70B) (Touvron et al., 2023) and GPT-4 (OpenAI, 2023) as the ICLenabled LMs.We prompt the model to generate justifications by concatenating few-shot training instances along with a test instance.3) Atlas (Izacard et al., 2023) is the SoTA RAG model with strong few-shot ability, which consists of a trainable dense retriever Contriever and a LM-adapted variant of T5 (Lester et al., 2021) with Fusion-in-Decoder (Izacard and Grave, 2021b) modified to increase the number of retrieved documents.We also include a non-joint training setting by replacing the retriever with BM25.

Experimental Setup
For our method JustiLM, we randomly sample 30 instances from the training set for fine-tuning.We use the Atlas (Izacard et al., 2023) with its released pre-trained checkpoint 3 of 3B parameters as our backbone model.Following the Atlas paper, we retrieve top-20 documents for each instance.We set training steps as 100, batch size as 8, and learning rate as 4 × 10 −5 with linear decay and 5 warmup steps for both the LM and the retriever.
For the distillation techniques to train the LM, we begin by fine-tuning the LM to take fact-check articles as auxiliary input and generate justification, which provides a warmup for LM.For BM25 + ICL-enabled LMs, we use the Pyserini4 toolkit to build BM25 model.For Flan-T5, We use the code and pre-trained checkpoints from Hugging Face Transformers 5 .We use the original code and pre-trained checkpoints of Llama26 .We use the API service of GPT-4 from OpenAI7 .Given different length constraints of these LMs, we intend to maximize the utilization of their specific input capabilities.We adjust the number of the shots and/or the number of retrieved documents to maximally utilize their input context windows.We prioritize to ensure that these models have access to as many of the top-20 retrieved documents as possible because effective generation requires an adequate amount of information, with the secondary goal to maximize the number of few-shot examples used.Specifically, we set 1-shot ICL with top-10 documents for Flan-T5, 2-shot ICL with top-20 documents for Llama2 and 3-shot with top-20 documents for GPT-4.
For fair and robust comparison, we perform experiments three times, with training instances sampled using different random seeds.We report the mean and standard deviation of each metric over the three runs in all experiments.The seeds and training instances are kept the same across different models.All the experiments use a server with 8 NVIDIA Tesla-V100 32GB GPUs.

Main Results
The results of few-shot justification generation methods are reported in Table 4a.Lead-4 that directly presents the retrieved documents as justification does not yield satisfactory results, due to simple evidence stacking without generating a clear explanation of the rationale.
Both Flan-T5 and Llama2 outperform Lead-4, demonstrating the LM's ability to generate justifi-cations based on retrieved evidence.Flan-T5 performs comparably with Llama2 in ROUGE and SummaCC scores and better in MAUVE, despite much fewer parameters.The reasons are likely two-fold: 1) Flan-T5's instruction fine-tuning on 1.8K tasks, which effectively enhances the pretrained language models (Sanh et al., 2022;Chung et al., 2022); 2) its fine-tuning on Chain-of-Thought (CoT) data (Wei et al., 2022), aligning with the common presentation of ground-truth justifications that provide rationales to conclude the veracity, as exemplified in Table 1.
Incorporating ICL-enabled LMs with the dense retriever Contriever does not exhibit improvement over using the sparse retriever BM25.Dense retrievers that trained on extensive in-domain training datasets like MS-MARCO (Nguyen et al., 2016), are often surpassed by sparse retrievers when applied to new domains without large annotated datasets (Thakur et al., 2021;Izacard et al., 2022).While Contriever is a strong unsupervised retriever for bridging this gap, BM25 still remains competitive (Izacard et al., 2022).
When training only the LM of Atlas, it demonstrates superior overall performance compared to Flan-T5 and Llama2, despite its much fewer parameters.This finding indicates that merely relying on the implicit knowledge of LMs without parameter updates is insufficient when the size of LM is not large enough.Joint training of the retriever and LM leads to further performance gains, implying its benefits in the few-shot setting.
Compared to Atlas, JustiLM makes improvements in different metrics, indicating that utilizing fact-check article as auxiliary training signals enhances justification quality.With our proposed distillation techniques, JustiLM considerably improves all ROUGE scores.Compared to Atlas, the combination of article-level distillation on retriever and chunk-level distillation on LM increases ROUGE-1, ROUGE-2, and ROUGE-L scores by 15.0%, 7.97% and 10.9%, respectively, suggesting that JustiLM can generate justifications which are more similar to those written by fact-checkers.Furthermore, 3 out of 4 combinations of distillation techniques outperform Atlas in MAUVE scores, with the highest gain being 45%.This suggests that JustiLM's justifications are more fluent and coherent with ground truths.It can be attributed to our distillation method allowing the model to   learn from fact-check articles that are much more informative and detailed than the explanatory justifications.Lastly, JustiLM effectively enhances the SummaCC score, indicating the improvements on the factual consistency of generated justifications.
GPT-4 demonstrates exceptionally strong ability in providing factually consistent responses and outperforms other ICL-enabled methods Flan-T5 and Llama2 across all metrics.In comparison, JustiLM falls relatively below GPT-4 in ROUGE-1 and Sum-maCC, but outperforms GPT-4 in ROUGE-2/L and MAUVE.This highlights its effectiveness, especially considering its small model size and independence from intensive compute and storage resources required by very large models.Also, its ease of fine-tuning with more and new training data provides significant flexibility in addressing the ever-changing landscape of misinformation.

Generalization on New Claims
To address the concern of pre-trained LMs having potentially seen the evaluation data during their pre-training, we investigate how different methods perform on a new test set with new emerging claims made after their training.Since the WatClaimCheck dataset exclusively encompasses claims prior to July 2021 (Khan et al., 2022)

Ablation on Distillation Techniques
Table 5 reports the result of ablations on our distillation techniques.We observe that the distillation during LM training results in greater improvements compared to the retriever.This is expected, considering that the LM benefits from direct supervision from ground-truth justifications during training, while the retriever relies on the weak supervision from LM and the distillation of fact-check articles.Additionally, the LM has a larger number of parameters than the retriever, with 3 billion parameters for the LM compared to 110 million parameters for the retriever.As a result, the LM tends to capture more knowledge from fact-check article during the distillation process, leading to substantial improvements in performance.

Joint Veracity-Justification Performance
In this section, we demonstrate that JustiLM can be easily extended for joint veracity prediction and justification generation.We follow Khan et al.We make the LM generate the justification and veracity label at the same time.For veracity label prediction, let y cls,i be a veracity label, and its predicted score assigned by the LM conditioned on the claim and the retrieved documents is defined as Liu et al. (2022).In this way, we rank all classes by the predicted scores and select the top-ranked class.During training, we calculate the probability of prediction by applying softmax function on the predicted scores, and use cross-entropy as the loss function.
Table 6 presents the result.The Atlas-CLS, which directly predicts veracity label with Atlas, shows a limited improvement in macro-F1 score compared to the Majority method.This suggests that predicting the veracity of real-world claims remains challenging for this original RAG model in a few-shot setting.When performing joint veracity prediction and justification generation with the LM training, a substantial boost in verdict prediction is observed for our method.Specifically, we achieve absolute improvements of 18.19 and 15.52 in macro-F1 using article-level and chunklevel techniques, respectively.This indicates that justification generation can help veracity prediction by consolidating evidence from retrieved documents.We also find that jointly training JustiLM with the veracity prediction task does not improve the performance of justification generation, which is consistent with the findings by Atanasova et al. (2020).We conjecture that it remains challenging for the model to boost both tasks simultaneously with few-shot training instances.Potential solutions could consider either leveraging a larger multi-task training dataset, such as T0 (Sanh et al., 2022), or using an independent veracity classifier that can be jointly trained with the retriever and the LM.However, both options necessitate adding data and computational resources.We will leave it for future studies.

Case Study
Table 7 presents example justifications generated by JustiLM, the strong ICL baseline GPT-4, and the few-shot RAG model Atlas.Atlas's generated justification catches that the GOP bill does not change the law, but fails to highlight the key point that women still have viable avenues to address pay discrimination.Both GPT-4 and JustiLM successfully refute the claim by providing that crucial point.
More specifically, Atlas falls short in delivering convincing and comprehensive justification due to its tendency to provide incomplete and repetitive responses.In contrast, GPT-4, being the SoTA LLM, impresses with its ability to generate well-rounded justification, but appears to be lengthy and less focused.JustiLM, on the other hand, successfully highlights key points for fact-checking the claim with a precise and refined justification.Despite its relatively small model size, JustiLM may not always offer the same level of details as GPT-4, but it can produce concise and accurate justifications that closely resemble the ground truth, making JustiLM promising and valuable for users seeking quick and trustworthy fact-check explanations.

Discussion
There is no passage-/sentence-level annotation in the original long-form reference documents and fact-check articles, which are costly to obtain.We do not have ground truths for training and evaluating evidence retrieval model.Since these long documents bury specific evidence in them, directly using them for training will introduce a considerable amount of irrelevant text.While we mitigate this challenge by splitting each original reference document into disjoint 100-word chunks for retrieval, we believe that acquiring fine-grained evidence annotations will benefit the training and evaluation.
In our experimental setup, evidence retrieval is conducted under the assumption that the needed evidence for fact-checking a given claim exists in the retrieval corpus.However, in real-world searching scenario where gold evidence may be absent from the retrieval corpus, it is valuable to investigate how justification generation methods perform under this more challenging scenario by varying the ratio of gold reference documents in the retrieval corpus.Additionally, while our experiments include the NLI-based metric SummaCC, providing automated evaluation on the factuality of generated justifications, we believe that a sound human evaluation involving professional fact-checkers.Such evaluation, currently not conducted, necessitates close collaboration with fact-checking organizations and needs particular networking and setup, such as the integration with their existing workflow and the provision of motives for them to participate in evaluation, which could be warranted as a separate study by itself and is part of our future plan.
As the SoTA LLM, GPT-4 shows strong ability in generating factually consistent and informative justifications, therefore, developing justification methods based on those powerful API-based LLMs is beneficial.However, these blackbox LLMs have strict constraints on accessing their specific internal information, which poses important open challenges for being interacted with deeply and providing supervision signals to retriever.
In this work, we address the justification generation task with a realistic approach, which generates justifications based on the retrieved evidence using an end-to-end retrieval-augmented language model.Furthermore, incorporating our distillation techniques with the RAG model Atlas, demonstrates a marked improvement in performance.This affirms that utilizing fact-check articles during training to provide supervision signals can strongly enhance justification generation.

Conclusion and Future Work
We propose a justification generation language model JustiLM for realistic fact-checking of realword news claims, where justification generation is performed based on retrieved evidence from large textual corpus, and introduce a new benchmark dataset ExClaim for this task.JustiLM leverages fact-check articles as auxiliary resources during training to distill article-level and chunk-level training signals to guide justification writing.Experimental results show JustiLM outperforms ICLenabled Flan-T5 and Llama2, as well as the SoTA few-shot RAG model Atlas.JustiLM also demonstrates comparable and promising performance when compared to GPT-4.
In the future, we will explore the adaptation of various LLM-based reasoning methods (e.g., CoT (Wei et al., 2022), ToT (Yao et al., 2023b), and GoT (Besta et al., 2023)) into JustiLM to enhance the reasoning ability for improving the task of jus-tification generation, which aims to assist the LMs in providing better signals for guiding evidence retrieval and improving reasoning over retrieved evidence during justification generation.We also plan to develop a human evaluation scheme involving fact-checking experts to provide a more comprehensive and efficient assessment on machine-generated justifications.

Figure 1 :
Figure 1: The architecture of JustiLM.Grey solid arrows present the inference process without fact-check article z. Red dash arrows present the training process of backbone model, where the ground-truth justification provide supervisory signals to train both retriever and LM.Blue dash arrows present the training process with the distillation of z as supervisory signals.The document encoder is fixed during training, while other modules are trainable.QE: Query Encoder; DE: Document Encoder; Enc: Encoder; Dec: Decoder.

Table 2 :
Statistics of the ExClaim dataset.†: Note that fact-check articles in the test set are not used in our method, but exclusively utilized by baselines that rely on fact-check articles.

Table 3 :
Results of justification generation methods trained on Fact-check Article (F.C.Article) and tested on Fact-check Article / Retrieved Documents (Retr.Docs).Para.: Parameters.Standard deviation is in (.).
On the new test set with 348 claims published later than the claims from the WatClaimCheck dataset used for training.

Table 4 :
Few-shot justification generation results on test set (a) and new test set (b).Standard deviation is in (.).

Table 6 :
Results of joint veracity prediction and justification generation.(.) encloses standard deviation.
and the newest pre-training data of Llama2 are cut off by September 2022, we gather a new set of claims made between October 2022 and September 2023, yielding a new test set comprising 348 instances, each with their associated reference documents and justifications.Following the same steps detailed in §4, the newly collected reference documents are added into the corpus for model retrieval.As shown in Table 4b, all methods demonstrate performance drop on the new test set.Nonetheless, the findings obtained based on the original test set still hold true for the new test data.Additionally, compared to baseline methods, the relatively mild performance drop in JustiLM suggests stronger generalizability and robustness of our distillation techniques.

Table 7 :
ClaimSays Wisconsin women facing pay discrimination can't "do something about it" under bill passed by Republicans.Ground Truth Falk said women facing pay discrimination can't "do something about it" under a bill passed by Wisconsin Republicans.But of the four legal options available under the current law, only one would be eliminated by the bill.We rate Falk's statement False.AtlasFalk said that under a Republican-backed bill, women facing pay discrimination can't "do something about it."Falk said that under the GOP bill, women facing pay discrimination can't "do something about it."Butthebilldoesn't change the law.It simply makes it harder for women to file lawsuits.Falk said that under the GOP bill, women facing pay discrimination can't "do something about it."GPT-4AstatementbyKathleen Falk, a former Dane County executive, claims that women in Wisconsin would no longer have "a right to do something" about pay discrimination on the job if a bill is signed by Governor Walker.The bill in question would take away the right of women who suffer pay discrimination to sue in state court.However, there are still other avenues for employment discrimination victims, including filing a complaint with the state Equal Rights Division, filing a complaint with the federal Equal Employment Opportunity Commission, and filing a lawsuit in federal court.Therefore, while the bill may limit one avenue for action, it does not completely prevent women from taking action against pay discrimination.JustiLMFalk said that women facing pay discrimination can't "do something about it" under a bill passed by Republicans.The measure would make it harder on victims of employment discrimination because it is easier to sue in state court than in federal court.But eliminating the state lawsuit option wouldn't affect the three other avenues of legal recourse available to workplace discrimination victims.And Falk's claim was women facing pay discrimination would have no options at all.An Example of generated justifications by different methods compared to the ground-truth justification.