Abstract
Fact verification on tabular evidence incentivizes the use of symbolic reasoning models where a logical form is constructed (e.g., a LISP-style program), providing greater verifiability than fully neural approaches. However, these logical forms typically rely on well-formed tables, restricting their use in many scenarios. An emerging symbolic reasoning paradigm for textual evidence focuses on natural logic inference, which constructs proofs by modeling set-theoretic relations between a claim and its evidence in natural language. This approach provides flexibility and transparency but is less compatible with tabular evidence since the relations do not extend to arithmetic functions. We propose a set-theoretic interpretation of numerals and arithmetic functions in the context of natural logic, enabling the integration of arithmetic expressions in deterministic proofs. We leverage large language models to generate arithmetic expressions by generating questions about salient parts of a claim which are answered by executing appropriate functions on tables. In a few-shot setting on FEVEROUS, we achieve an accuracy of 71.4, outperforming both fully neural and symbolic reasoning models by 3.4 points. When evaluated on TabFact without any further training, our method remains competitive with an accuracy lead of 0.5 points.
1 Introduction
Fact verification systems assess the veracity of claims based on evidence and provide an explanation for the prediction. In the case of tabular evidence, verification frequently relies on symbolic reasoning steps, such as the execution of arithmetic functions, to accurately predict whether a claim is supported by evidence (Herzig et al., 2020, inter alia). This incentivizes symbolic reasoning systems, where a logical representation of a claim and its tabular evidence (e.g., a LISP-style program) is executed to produce the veracity prediction (Chen et al., 2020; Cheng et al., 2023). Since the execution of these logical forms is deterministic, they serve as faithful explanations of the model’s reasoning (Jacovi and Goldberg, 2021). However, these systems typically rely on well-formed tables, constraining their use in many scenarios, such as reasoning over diverse tabular structures as typically found on Wikipedia. Consequently, the majority of recently proposed verification models focus on neural entailment models that latently execute arithmetic functions (Liu et al., 2022b; Gu et al., 2022) or generate a natural language explanation alongside its prediction (Wei et al., 2022, inter alia). While systems that produce natural language explanations are more flexible regarding the evidence format, they do not necessarily generate faithful explanations (Atanasova et al., 2023).
An emergent symbolic reasoning paradigm for textual evidence focuses on logical inference by directly comparing claim and textual evidence via natural logic inference (Angeli and Manning, 2014), achieving high prediction accuracy while maintaining faithful explanations (Krishna et al., 2022; Aly et al., 2023). However, current natural logic systems are unable to handle tabular evidence since the semantic relationship captured between aligned claim-evidence spans via natural logic’s set-theoretic operators does not extend to arithmetic functions (MacCartney and Manning, 2009). For instance, in Figure 1, no evidence in the table directly corresponds to the part of the claim that states three municipalities. Instead, arithmetic computation on the table beyond the expressiveness of natural logic’s set-theoretic operators is required (i.e., counting relevant cells).
To this end, we propose TabVer: Tabular Fact Verification, a natural logic inference system that adds arithmetic reasoning capabilities to reason over tabular evidence directly in natural language. We define a set-theoretic interpretation of comparisons between numerals in claim-evidence pairs, and extend that definition to executions of arithmetic functions via arithmetic expressions (ArithExps) to enable their integration into natural logic proofs. The proofs are executed deterministically on a finite state automaton (DFA) as defined in natural logic inference. ArithExps are produced by leveraging large language models (Brown et al., 2020, inter alia), generating questions about salient parts of the claim ci, which are answered via a rationale that produces an answer ai. As illustrated in Figure 1, TabVer will generate a question such as “What is the total population of Ortegal in 2018” to verify the part larger than 12000 in the claim c. Answering this question on the evidence table produces a rationale with the expression SUM 12,238 as the final answer ai, indicating the execution of the function sum(3945,1126,1363,5804) = 12238 over relevant evidence in E. The aligned pair (larger than 12000, SUM 12,238) is then assigned a natural logic operator as part of a natural logic proof, with the predicted operator being consistent with our set-theoretic definitions (cf. Figure 3).
In a few-shot setting with 64 training instances on the tabular subset of the FEVEROUS dataset (Aly et al., 2021), TabVer outperforms previous symbolic reasoning systems, including LPA (Chen et al., 2020), SASP (Ou and Liu, 2022), Binder (Cheng et al., 2023), and a state-of-the-art natural logic system (Aly et al., 2023) by 10.5 accuracy points. Moreover, TabVer outperforms the highest-scoring neural entailment model by 3.4 accuracy points, including baselines such as TAPAS (Herzig et al., 2020), TAPEX (Liu et al., 2022b), PASTA (Gu et al., 2022), and large language models of similar size as TabVer. We confirm the tabular reasoning capabilities of TabVer in a domain transfer setting to Tabfact (Chen et al., 2020) without further training annotations. Our system performs competitively, leading over the strongest baseline by 0.5 accuracy points. Our analysis reveals that TabVer’s reading of numerals is more sensitive to numerical inaccuracies and the pragmatic context of a claim (i.e., quantifiers and rounding) than a same-sized LLM baseline, reflecting the annotator guidelines of FEVEROUS more accurately. Finally, the arithmetic functions invoked in TabVer’s proofs are more accurate than the ones called in the logical form of our symbolic reasoning baselines.1
2 Related Work
Symbolic reasoning systems for fact verification convert text into a logical form or executable program (SQL/LISP-style). They typically involve a neural component, either to rank viable candidate programs consisting of hand-crafted functions (Chen et al., 2020) or via neural- symbolic models that generate programs directly (Liang et al., 2017; Ou and Liu, 2022). These programs are faithful explanations since the program’s execution is the verdict. With the improved capabilities of large language models to generate code (Chen et al., 2021), Cheng et al. (2023) and Glenn et al. (2024) explore the use of SQL, Python, and FOL to faithfully fact-check tabular claims, however, they only use proprietary models consisting of hundreds of billions of parameters. We show that TabVer outperforms these approaches (when controlled for the language model), which we attribute to the suitability of natural logic to natural language in contrast to query languages like SQL.
The aforementioned symbolic executioners stand in contrast to the more prominent approach of using programs as features to neural systems, typically complemented by the original claim and table. For instance, LISP-style programs are used as a latent signal for a graph neural network (Shi et al., 2020; Zhong et al., 2020; Yang et al., 2020; Gong et al., 2023), and SQL queries and their executions are used as features to an LLM serving as a verdict classifier (Kong et al., 2024; Zhang et al., 2024c; Wu and Feng, 2024). Wang et al. (2024) incrementally update an evidence table with LISP-style operations. Alternatively to symbolic integration into neural systems, Chen (2023) produce natural language explanations using chain-of-thought prompting (Wei et al., 2022). Chen (2023) show that a 175B parameter GPT-3 model competes with fully supervised systems on tabular claims, yet its 6.7B variant performed only slightly above chance. This observation has been further confirmed by Zhang et al. (2024a) with Llama2-Chat-7B. Finally, large-scale instruction-tuning on tabular tasks has been explored (Zhuang et al., 2024; Zhang et al., 2024b; Liu et al., 2023), however they do not produce explanations. Conclusively, previous systems either rely on large proprietary models to achieve competitive performance or they sacrifice prediction explainability.
In contrast to these explicit meaning representations, Angeli and Manning (2014) propose to use NatLog (MacCartney and Manning, 2007, 2009) for textual inference, operating directly on natural language by comparing texts in a premise with an associated hypothesis using set-theoretic relations. Thus, as a framework of flexible compositional inference, it circumvents the requirement to convert statements into rigid logical forms, and typically independently from one another. These favorable properties of natural logic inference have subsequently recently been explored for fact verification, resulting in accurate predictions while maintaining transparency with plausible explanations (Krishna et al., 2022; Aly et al., 2023). Aly et al. (2023) exploit natural logic’s operations on natural language by casting the operators into a question-answering framework to leverage recent advances of instruction-tuned language models. This paper is the first attempt to extend natural logic inference for fact verification to the tabular domain.
Finally, tabular question answering (Jin et al., 2022) is a common component to decompose a claim and reasoning processes. Yang and Zhu (2021) supplement the evidence with answers to questions generated via decomposition templates while Suadaa et al. (2021) supplement the evidence with information from a table-to-text model. More recently, Ye et al. (2023) use LLMs to decompose tables and questions. However, all three methods feed these modified tables into a pre-trained neural model (Herzig et al., 2020), ultimately producing veracity predictions without explanations. Finally, even for textual evidence, most previous work that generates questions conditioned on the claim does not construct proofs from the answers (Rani et al., 2023; Fan et al., 2020; Jobanputra, 2019).
3 Method
Given a claim c and a set of evidence tables E, the task is to predict a veracity label {Supports, Refutes, Not Enough Information (NEI)}, and to accompany the prediction with an explanation. Since evidence might require arithmetic reasoning beyond the expressiveness of natural logic, as shown in Figure 1 with three municipalities, TabVer’s explanation is a proof P = m1,…, ml, consisting of quintuples mi = (ci, ei, qi, ai, oi), where oi describes the set-theoretic relation (NatOp) between a claim span ci and the result ai of arithmetic computations executed over relevant evidence ei. TabVer performs arithmetic reasoning steps in a question-answering framework, producing an arithmetic expression (ArithExp) with ai being the answer to a question qi for a claim span ci answered over evidence ei. The sequence of operators O = o1,…, ol is then the input to a finite state automaton that specifies the claim’s veracity label . We follow the DFA for textual entailment described in Angeli and Manning (2014), shown in Figure 2.
To enable the assignment of NatOps o to ArithExps, we need to expand the set-theoretic definition of these operators. To this end, we first discuss the set-theoretic relationship for numerals that occur in claim and evidence without the need for further computation (Section 3.1). We subsequently expand this definition to ArithExps where arithmetic functions are applied to evidence, by mapping function executions on relevant evidence to numerical representations (Section 3.2). TabVer produces its quintuples (ci, ei, qi, ai, oi) by first generating a question qi about a claim span ci that contains salient information (Section 3.3). This question is answered using the evidence E by producing a rationale, consisting of extracted evidence ei, the execution of appropriate arithmetic functions on ei, and the final answer ai (Section 3.4). Finally, a proof generation model MP, trained on proofs containing ArithExps and associated NatOps following our set-theoretic definitions, assigns a NatOp oi to the claim-answer pair. TabVer follows QA-NatVer (Aly et al., 2023) for the proof generation process by selecting over multiple proof candidates.
3.1 A Set-theoretic Perspective on Numerals
We first define a set-theoretic interpretation of the relationship between numerals in claim spans and evidence (or answers calculated on the evidence with ArithExps), within the context of natural logic. Specifically, we consider five set-theoretic relationships (NatOps) .2Figure 3 shows examples of numerical expressions as evidence ei with the associated claim span ci for each NatOp. For instance, a claim span about a hundred goals would generally follow from the evidence 99 goals since the explicit adverbial modifier about widens the scope of the numeral a hundred to a larger set, including, e.g., 99 and 101. However, even bare numerals can carry implicit meaning beyond the utterance itself, referred to as scalar implicature (Grice, 1975, inter alia), and are subject to both semantics and pragmatics.
Linguistic approaches to numerals typically consider an upper-bounded (exact) and a lower-bounded (at least) reading, depending on several factors such as whether an environment is upward- or downward-entailing3 (Panizza et al., 2009). Suitably, the effect of these environments on the entailment relationship between claim and evidence is modelled explicitly in natural logic (MacCartney, 2009), enabling these different readings of numerals into a model of natural logic. Since the majority of claims appear in an upward-entailing environment, we focus here on the set-theoretic reading of numerals in an upper-bounded definition. We discuss a downward-entailing projection of numerals when following an at least reading in Appendix A. In an upper-bounded reading, the terminology of natural logic can be extended such that evidence spans like 5 goals aligned to claim spans with a strictly smaller number like two goals are assigned the alternation NatOp (⥯) since an upper-bounded reading assumes that 2 goals and 5 goals are mutually exclusive without covering the entire universe, i.e., all natural numbers (cf. Appendix A).
Another component of a numeral’s reading to consider is its pragmatic halo (Lasersohn, 1999), where a number can represent a range of values due to the intended degree of approximation to the truth in a specific context. As seen earlier, a halo can be indicated explicitly with modifiers (cf. about), yet it is also often defined implicitly. For instance, a claim like “Messi scored a hundred goals in the 2010 season.” might be considered supported by evidence that states he scored 101 goals in the context of an environment with low requirements of numerical precision, e.g., on social media, since the communicated content ({100}) is weaker than the asserted content (101 ∈{100}∪ H100), with H100 being the pragmatic halo of 100 as a set of (integer) numbers.4 However, the evidence would lead to the claim’s refutation in an environment where exactness is required, i.e., when H100 = ∅5, e.g., statements in scientific articles. The size of the pragmatic halo typically increases with larger numbers, thus it becomes less necessary pragmatically to be precise. Therefore, Vlachos and Riedel (2015) consider a fixed threshold on the absolute percentage error between numbers in a claim and evidence. Yet, in reality, the halo of a number is more dynamic: In decimal number systems, such as English, multiples of ten and five generally have a larger pragmatic halo than others due to the communicative tool of rounding (Woodin et al., 2024). For instance, the claim that Messi scored 100 goals while evidence states he scored 101 is more likely to be accepted than the reverse since 101 is not expected to be a rounded number, hence |H100| > |H101|. Conveniently, the pragmatic halo can be expressed by natural logic via a projection to the entailment NatOps (e.g., Frwd. Entailment in Figure 3) and is learned on annotated proof data (cf. Section 4.3).
3.2 Arithmetic Expressions
Since evidence ei is often stated in terms different than those needed to verify a textual claim, e.g., as seen in Figure 1, we introduce ArithExps, which map tabular evidence to numerals by executing arithmetic functions. ArithExps are function executions that produce an answer ai for a question qi to an associated claim span ci over relevant evidence ei from the table E. For the computation of ai we consider functions that take as input evidence ei and output a single numeral. The answer of the ArithExp is represented as the result of the computation prepended by the function’s name: , with ei ⊆ E.6Figure 1 shows the ArithExp SUM 12,238 (for the sum of the cells 3,945, 1,126, 1,363, and 5,804) as answer ai aligned to the claim span cilarger than 12,000. To extend ArithExps to cover more complicated computations, we enable function composition, i.e., a function as an input argument to . The ArithExp for function composition is the final computation, i.e., for .
The full list of permissible functions we consider is shown in Figure 4. In addition to the functions count, sum, diff, average, min, and max, we consider comparative functions as a separate function class. Comparatives could be modeled by the diff function, thus subtracting quantities between relevant arguments. However, we represent them as a unique ArithExp since they serve a different semantic function in relation to a claim span ci. The comparative ArithExp can be used for both implicit (e.g., Person X had more votes than Person Y) as well as explicit comparisons (e.g., Person X had 5000 votes more than Person Y) since the difference in quantity is indicative of both polarity and magnitude. Finally, to cover the base case where all relevant information is already contained in ei (i.e., no computation is required, cf. Section 3.1), we consider a copy function.
3.3 Question Generation
We generate questions that can directly be linked to salient parts of a claim ci, as seen in Figure 1. For instance, the question What is the total population of Ortegal in 2018 directly corresponds to the claim span larger than 12,000. We use a fine-tuned large language model MQG(c, T), which takes a claim c and a prompt template T as input and autoregressively generates a collection of questions q1…ql along with their corresponding targeted claim spans. The output is formatted as a list of questions and claim spans 1. [q1] [c1] 2. [q2] [c2]⋯. To ensure that the generated claim span occurs verbatim in the claim, we employ constrained decoding to restrict the sampling of ci to spans of the claim c (including c itself). Thereby we prevent the model from introducing words or phrases that are not present in the claim, a behavior we observed even after fine-tuning. Additionally, we use constrained decoding to enforce the enumeration format defined in the prompt above for generating multiple questions jointly. By conditioning the generation of questions on previously generated ones, we can improve coverage of salient information in the claim and reduce the likelihood of redundant or repetitive questions (Fan et al., 2020).
3.4 Tabular QA with ArithExps
If no function is considered relevant to further process the evidence ei, the model MQA outputs N/A after the extraction of evidence and subsequently does not return an ArithExp. If the evidence tables do not contain any relevant information to answer q, then the model returns N/A as the relevant evidence ei, which is mapped to an independence NatOp (#), leading to an NEI verdict prediction according to the DFA (cf. Figure 2). Parts of a claim that do not require separate questioning (such as In 2018 in Figure 1) are assumed to be contained in extracted evidence for answering questions about claim c. QA-NatVer’s span alignment algorithm aligns these claim spans to extracted evidence ei from all the questions q1…ql to the claim c.
4 Evaluation
4.1 Data
FEVEROUS
We train and evaluate models on the tabular subset of FEVEROUS (Aly et al., 2021), i.e., the claims where all evidence elements across all evidence sets are cells from tables. FEVEROUS consists of complex claims and tables with irregular structures. To focus on the natural logic-based tabular fact verification component of fact-checking, we use gold evidence tables (i.e., not ones selected via a retrieval system from a knowledge source) throughout our experiments. The resulting dataset consists of 2,011 claims, with 35%, 61.7%, and 3.2% being supported, refuted, and NEI claims, respectively (cf. Appendix Table 8). Out of the 2,011 claims, 521 are labelled as requiring numerical reasoning.
Models are trained on 64 FEVEROUS instances, selected uniformly from its training data. The veracity labels in the resulting training data are thus similarly imbalanced as the FEVEROUS development data. To train TabVer we additionally manually annotated these training instances with rationales and natural logic proofs. These proofs contain ArithExps as defined in Section 3.1. The training distribution of arithmetic functions is also imbalanced. For details see Appendix B.
TabFact
We further evaluate models trained on FEVEROUS in a domain transfer scenario on TabFact (Chen et al., 2020), without further training on the latter. Contrary to FEVEROUS, TabFact only contains two veracity labels: Supported and Not Supported, the latter covering both refutations and NEI instances. TabFact contains only well-structured tables; the first row is always the table header. TabFact is designed to be evaluated on gold evidence tables E. We evaluate methods on its development set, consisting of 12,851 claims with evenly distributed labels, out of which 4,424 are simple (R1) and 8,427 complex claims (R2).
4.2 Baselines
We compare TabVer against strong baselines that can be categorized into two classes: (i) classifiers that predict a veracity label without symbolic mechanisms or explanation production (ii) symbolic reasoning models that produce faithful explanations.
Classification models.
DeBERTa+NLI is a DeBERTaV3 model (He et al., 2023) fine-tuned on multiple NLI tasks. PASTA (Gu et al., 2022) is a DeBERTaV3 model further pre-trained on different tabular operations. TAPAS (Herzig et al., 2020) is a transformer pre-trained on tabular data with additional table-aware positional embeddings. TAPEX (Liu et al., 2022b) is based on BART (Lewis et al., 2020), pre-trained as an SQL executor and fine-tuned on tabular data via table linearization. We follow typical encoder-only fine-tuning, where a linear transformation from embeddings to veracity labels is jointly optimized with the pre-trained model itself. Furthermore, we evaluate several LLMs, including Llama2-Chat-7B (Touvron et al., 2023) and MistralOrca-7B (Jiang et al., 2023). We fine-tuned the LLMs via LoRA (Hu et al., 2022).
Symbolic Reasoning Models.
We compare against LPA (Chen et al., 2020), a LISP-style program synthesis algorithm with hand-crafted functions and trigger words to prune the search space. It incorporates a fine-tuned transformer to rank candidate programs. SASP (Ou and Liu, 2022) is built on top of neural symbolic machines (Liang et al., 2018) and considers both lexical and structure features to constrain program candidates and further uses TaBERT (Yin et al., 2020) and an LSTM (Hochreiter and Schmidhuber, 1997) for program generation. We also consider Binder (Cheng et al., 2023), an approach that uses LLMs to map tabular claims to SQL queries and to execute specific API calls embedded in the queries on tables. To maintain comparability with TabVer, Binder uses MistralOrca-7B as the LLM. If no viable program can be found for a given claim, LPA, SASP, and Binder fall back to an NEI/Not Supported prediction. QA-NatVer (Aly et al., 2023) constructs natural logic inference proofs by casting natural logic operators into a question-answering framework. We linearize the evidence table and use the Flan-T5 3B backbone (Chung et al., 2024).
4.3 Implementation Details
Claim Decomposition.
We use the same decomposition for TabVer as well as all symbolic reasoning baselines we consider to maintain comparability. Classification models use the original claim as input instead, since the impact of evidence incompleteness is expected to be minimal and decomposition can lead to error propagation. With the exception of Wang and Shu (2023), who represent a claim as a conjunction over subclaims, the aggregations over verdicts of parts of a claim are executed via neural mechanisms and thus do not guarantee faithfulness (Chen et al., 2022; Zhao et al., 2024).
Experimental Setup.
We do not consider a validation set for hyperparameter-tuning, following the real-world few-shot learning setting of Alex et al. (2021). TabVer fine-tunes the question generation model MQG, the question answering model MQA, and the proof generation model MP on annotated handcrafted rationales and proofs described in Section 4.1. MQG, MQA, and and the claim decomposition model MD are MistralOrca-7B models, fine-tuned using LoRA (Hu et al., 2022). We use the proof generation model MP of Aly et al. (2023). Specifically, we fully fine-tune a FlanT5-3B parameter model and a smaller BART0 model (406M parameters) (Lin et al., 2022) as MP to measure the accuracy of TabVer across model sizes. While it would be of interest to simplify TabVer by using MistralOrca-7B (or another powerful LLM) for all components, the implementation in Aly et al. (2023) currently only supports the training of encoder-decoder models, following Liu et al. (2022a). Furthermore, while MQG, MQA, and MD require language generation, the proof generation model MP of Aly et al. (2023) solves a discriminative task (answering binary/ternary questions), for which encoder-decoders have shown to be competitive to decoder-only models on smaller scale (i.e., ≤ 11B parameters) (Chia et al., 2024). We leave the exploration of alternative model architectures and backbones for TabVer to future work. Implementation details and the prompts for all models are in Appendix C and A, respectively. Results are averaged over five runs with standard deviation indicated. In all other cases, results are reported using default seed 42.
5 Results
FEVEROUS
Results on FEVEROUS are shown in Table 1, reporting both accuracy and macro average F1 due to the dataset’s label imbalance. TabVer outperforms all baselines both on the full dataset as well as the numerical reasoning subset by 3.4 and 5.6 accuracy points, respectively. We see similar differences in terms of F1 with a lead of 5.6 points. Except for the LLM MistralOrca-7B baseline, all classification models perform poorly in a few-shot scenario on FEVEROUS. Llama2-Chat-7B model’s surprisingly poor performance confirms previous observations on few-shot tabular fact-verification (Chen, 2023; Zhang et al., 2024a, b). In addition to the classification baselines being outperformed by TabVer, they lack transparency and faithful explanations. To highlight TabVer’s data efficiency, we compare it against a fully supervised TAPAS classification model trained on 18,836 tabular FEVEROUS claims, where it achieves an accuracy score of 73.0, performing only 1.6 accuracy points better than TabVer.
. | Full . | Numerical . | Execution Found . | |||
---|---|---|---|---|---|---|
Accuracy . | Macro F1 . | Accuracy . | Macro F1 . | (%) . | ||
Majority Baseline | 61.7 | 20.5 | 64.8 | 21.6 | – | |
Classific. | DeBERTav3 | 53.90.7 | 36.80.4 | 55.60.6 | 36.01.3 | – |
PASTA | 54.62.8 | 34.10.4 | 55.34.3 | 32.61.4 | – | |
TAPAS | 53.67.6 | 35.94.1 | 52.97.3 | 33.83.4 | – | |
TAPEX | 53.61.5 | 34.00.9 | 52.83.4 | 32.92.1 | – | |
Llama2-Chat-7B | 56.04.0 | 30.91.6 | 55.06.1 | 30.92.5 | – | |
MistralOrca-7B | 68.01.1 | 45.44.4 | 64.53.2 | 43.63.0 | – | |
Symbolic | LPA (w/o decomp) | 31.60.4 | 27.50.5 | 37.30.7 | 28.10.9 | 54% |
LPA | 21.80.1 | 21.40.2 | 22.30.4 | 21.30.4 | 41% | |
SASP (w/o decomp.) | 52.92.6 | 29.81.8 | 55.13.4 | 29.31.9 | 98% | |
SASP | 58.80.8 | 29.60.8 | 61.51.2 | 29.40.8 | 95.2% | |
Binder (w/o decomp.) | 60.91.2 | 38.01.3 | 61.01.6 | 40.12.2 | 95.7% | |
Binder | 62.71.4 | 37.31.3 | 63.71.8 | 39.31.6 | 95.4 | |
QA-NatVer | 54.01.1 | 34.80.2 | 52.61.6 | 28.90.3 | 100% | |
TabVer | BART0 | 69.90.3 | 49.40.9 | 66.70.3 | 42.40.8 | 100% |
FlanT5-xl | 71.40.5 | 51.00.5 | 70.11.3 | 45.80.3 | 100% |
. | Full . | Numerical . | Execution Found . | |||
---|---|---|---|---|---|---|
Accuracy . | Macro F1 . | Accuracy . | Macro F1 . | (%) . | ||
Majority Baseline | 61.7 | 20.5 | 64.8 | 21.6 | – | |
Classific. | DeBERTav3 | 53.90.7 | 36.80.4 | 55.60.6 | 36.01.3 | – |
PASTA | 54.62.8 | 34.10.4 | 55.34.3 | 32.61.4 | – | |
TAPAS | 53.67.6 | 35.94.1 | 52.97.3 | 33.83.4 | – | |
TAPEX | 53.61.5 | 34.00.9 | 52.83.4 | 32.92.1 | – | |
Llama2-Chat-7B | 56.04.0 | 30.91.6 | 55.06.1 | 30.92.5 | – | |
MistralOrca-7B | 68.01.1 | 45.44.4 | 64.53.2 | 43.63.0 | – | |
Symbolic | LPA (w/o decomp) | 31.60.4 | 27.50.5 | 37.30.7 | 28.10.9 | 54% |
LPA | 21.80.1 | 21.40.2 | 22.30.4 | 21.30.4 | 41% | |
SASP (w/o decomp.) | 52.92.6 | 29.81.8 | 55.13.4 | 29.31.9 | 98% | |
SASP | 58.80.8 | 29.60.8 | 61.51.2 | 29.40.8 | 95.2% | |
Binder (w/o decomp.) | 60.91.2 | 38.01.3 | 61.01.6 | 40.12.2 | 95.7% | |
Binder | 62.71.4 | 37.31.3 | 63.71.8 | 39.31.6 | 95.4 | |
QA-NatVer | 54.01.1 | 34.80.2 | 52.61.6 | 28.90.3 | 100% | |
TabVer | BART0 | 69.90.3 | 49.40.9 | 66.70.3 | 42.40.8 | 100% |
FlanT5-xl | 71.40.5 | 51.00.5 | 70.11.3 | 45.80.3 | 100% |
While symbolic reasoning baselines provide faithful explanations, their performance is substantially worse than TabVer. Symbolic reasoning systems that construct semantic representations are unable to handle diverse and complex tabular structures (e.g., nested table headers) as present in FEVEROUS. For instance, the rule-based LPA approach finds a suitable program only for 41% of claims. The accuracy for claims where LPA finds a program is 55.8 points, improving by 25.6 points on its overall performance but still being outperformed substantially by TabVer. While the rate of executable programs is much higher for SASP and Binder due to the generation of programs being neural-guided, the overall performance is worse than TabVer, with a difference of 8.7 accuracy points for the best performing symbolic baseline, Binder. Finally, QA-NatVer has a 100% execution rate due to its flexibility by operating on natural language similarly to TabVer, however, the difficulty of aligning linearized evidence to claims and the lack of arithmetic reasoning capabilities result in low scores. Interestingly, the symbolic baselines perform better or comparably on the numerical subset than on the full dataset, while we observe the opposite for the majority of classification models and natural logic-based approaches, confirming the difficulty for these meaning representations to model complex textual claims correctly. Qualitative examples and representation limitations are discussed in Appendix Figures 6 and 7.
TabFact.
Results in a domain-transfer scenario without TabFact training data are shown in Table 2. TabVer still remains competitive with our baselines with an accuracy lead of 0.5 accuracy points and an F1 of 0.3 points worse than the best baseline (Binder). The performance against the symbolic reasoning systems is particularly noteworthy since LPA and SASP have been designed specifically for TabFact, and Binder’s SQL parsing excels at well-structured tables. Subsequently, LPA, SASP, and Binder find viable programs more frequently than on FEVEROUS, with 78%, 99.8%, and 100%, respectively. Binder performs the best out of all baselines, outperforming TabVer particularly on simple claims (R1) that do not require complex reasoning to predict correctly. Yet, on complex claims (R2) TabVer performs better than Binder. Binder’s performance discrepancy between FEVEROUS and TabFact is noteworthy, highlighting a fundamental limitation to previous approaches when applied to diverse tables (cf. Listing 4), which TabVer successfully addresses.
. | Full . | R1 . | R2 . | ||||
---|---|---|---|---|---|---|---|
Accuracy . | Macro F1 . | Accuracy . | Macro F1 . | Accuracy . | Macro F1 . | ||
Full Supervision | LPA | 65.2 | 64.2 | 77.6 | 77.5 | 57.4 | 55.6 |
SASP | 74.7 | 74.7 | 86.1 | 86.1 | 68.9 | 68.9 | |
TAPAS | 82.1 | 82.0 | 92.8 | 92.8 | 76.5 | 76.4 | |
Classific. | DeBERTav3 | 50.70.4 | 49.71.3 | 50.80.7 | 49.51.7 | 50.60.2 | 49.81.1 |
PASTA | 50.40.6 | 46.15.6 | 50.61.1 | 46.46.1 | 50.40.5 | 45.95.4 | |
TAPAS | 53.95.9 | 53.06.8 | 58.810.6 | 58.511.3 | 51.33.6 | 49.94.7 | |
TAPEX | 49.74.3 | 44.35.1 | 49.53.4 | 47.63.8 | 49.82.9 | 43.32.9 | |
Llama2-Chat-7B | 51.21.6 | 47.34.2 | 51.52.5 | 47.84.3 | 51.11.2 | 47.04.2 | |
MistralOrca-7B | 60.63.1 | 58.16.0 | 67.24.2 | 65.95.8 | 57.22.6 | 53.36.4 | |
Symbolic | LPA | 59.41.4 | 57.91.4 | 70.42.5 | 70.32.5 | 53.80.9 | 50.21.0 |
SASP | 48.72.8 | 45.12.9 | 50.73.0 | 47.52.0 | 47.73.0 | 43.83.7 | |
Binder | 65.11.0 | 65.11.0 | 76.90.6 | 76.90.6 | 59.11.3 | 59.11.3 | |
QA-NatVer | 50.90.1 | 43.60.3 | 52.70.2 | 49.80.1 | 49.90.1 | 49.10.2 | |
TabVer | BART0 | 62.80.8 | 62.30.9 | 71.11.0 | 71.11.1 | 58.60.6 | 57.50.9 |
Flan-T5-xl | 65.60.3 | 64.80.6 | 72.60.5 | 72.20.6 | 62.10.4 | 60.80.9 |
. | Full . | R1 . | R2 . | ||||
---|---|---|---|---|---|---|---|
Accuracy . | Macro F1 . | Accuracy . | Macro F1 . | Accuracy . | Macro F1 . | ||
Full Supervision | LPA | 65.2 | 64.2 | 77.6 | 77.5 | 57.4 | 55.6 |
SASP | 74.7 | 74.7 | 86.1 | 86.1 | 68.9 | 68.9 | |
TAPAS | 82.1 | 82.0 | 92.8 | 92.8 | 76.5 | 76.4 | |
Classific. | DeBERTav3 | 50.70.4 | 49.71.3 | 50.80.7 | 49.51.7 | 50.60.2 | 49.81.1 |
PASTA | 50.40.6 | 46.15.6 | 50.61.1 | 46.46.1 | 50.40.5 | 45.95.4 | |
TAPAS | 53.95.9 | 53.06.8 | 58.810.6 | 58.511.3 | 51.33.6 | 49.94.7 | |
TAPEX | 49.74.3 | 44.35.1 | 49.53.4 | 47.63.8 | 49.82.9 | 43.32.9 | |
Llama2-Chat-7B | 51.21.6 | 47.34.2 | 51.52.5 | 47.84.3 | 51.11.2 | 47.04.2 | |
MistralOrca-7B | 60.63.1 | 58.16.0 | 67.24.2 | 65.95.8 | 57.22.6 | 53.36.4 | |
Symbolic | LPA | 59.41.4 | 57.91.4 | 70.42.5 | 70.32.5 | 53.80.9 | 50.21.0 |
SASP | 48.72.8 | 45.12.9 | 50.73.0 | 47.52.0 | 47.73.0 | 43.83.7 | |
Binder | 65.11.0 | 65.11.0 | 76.90.6 | 76.90.6 | 59.11.3 | 59.11.3 | |
QA-NatVer | 50.90.1 | 43.60.3 | 52.70.2 | 49.80.1 | 49.90.1 | 49.10.2 | |
TabVer | BART0 | 62.80.8 | 62.30.9 | 71.11.0 | 71.11.1 | 58.60.6 | 57.50.9 |
Flan-T5-xl | 65.60.3 | 64.80.6 | 72.60.5 | 72.20.6 | 62.10.4 | 60.80.9 |
Training classification baselines, such as TAPAS, on Tabfact’s 92,283 training samples, using the same experimental setup, results in scores substantially outperforming all considered models (82.1 accuracy points). In contrast, TAPAS achieves a score barely above random in our transfer setting (53.9 accuracy points) since the small training size is insufficient for fine-tuning the model to the task and learning the linear transformation described in Section 4.3. This problem is exemplified with TAPEX as it is pre-trained only on SQL queries, necessitating substantial data during fine-tuning to learn a mapping to natural language. Compared to fully-supervised symbolic systems, TabVer remains competitive to LPA with an accuracy lead of 0.4 points, but falls behind SASP substantially with an accuracy difference of 9.1 accuracy points.
Reading of Numerals.
We further analyze TabVer’s reading of numerals by isolating its ability to consider the context of numbers mentioned in a claim.7 We automatically construct a diverse probing dataset that considers variations of numbers in supported claims by adding numerical inaccuracies, rounding numbers, adding modifiers (i.e., approximately, about, around), and adding cardinal determiners (i.e., at most/least). We measure the proportion between veracity predictions correctly labelled as supported and veracity predictions that remain supported after an inserted numeric variation. The probing dataset consists of 1638 claims. For a detailed description of the constructed variations see Appendix D.
Table 3 shows the results of the probe for TabVer, Binder, and the MistralOrca-7B classification baseline. TabVer is substantially more sensitive to small numeric inaccuracies. Only for 36.3% is the claim’s prediction maintained when adding 1 to the original number, compared to 63.4% and 57.7% for Binder and the classifier, respectively. This trend is also observed for relative numerical inaccuracies and rounded numbers. We argue TabVer’s behavior is more representative of its training data, since FEVEROUS instances are annotated to be refuted if numbers mentioned without modifier do not match exactly due to the guidelines given to annotators (Aly et al., 2021). In contrast, when adding explicit modifiers we observe that TabVer maintains its prediction more frequently than our baseline, with 57% versus 40.6% and 56.4% for the classifier and Binder, respectively. Finally, TabVer’s more nuanced reading of numerals is also seen for cardinals: While the classifier cannot differentiate between incorrect cardinal determiners (e.g., 12 being modified to at most 10 and changing the veracity label, and at most 15 while preserving it), both TabVer and Binder differentiate between the two. Yet, Binder overall favours the prediction of supported, due to the answer-biased voting strategy deployed by Cheng et al. (2023).
. | Class. . | Binder . | TabVer . |
---|---|---|---|
Inaccuracy Δ +1 | 63.4% | 57.7% | 36.3% |
Inaccuracy Δ 2% | 42.7% | 38.4% | 29.1% |
Inaccuracy Δ 10% | 37.8% | 38.4% | 26.3% |
Inaccuracy Δ 25% | 31.7% | 51.9% | 23.6% |
Rounding | 33.3% | 47.4% | 30.3% |
Modifiers (e.g., about) | 40.6% | 56.4% | 57.0% |
Cardinal (incorrect) | 32.9% | 53.8% | 31.8% |
Cardinal (correct) | 37.8% | 78.8% | 49.1% |
. | Class. . | Binder . | TabVer . |
---|---|---|---|
Inaccuracy Δ +1 | 63.4% | 57.7% | 36.3% |
Inaccuracy Δ 2% | 42.7% | 38.4% | 29.1% |
Inaccuracy Δ 10% | 37.8% | 38.4% | 26.3% |
Inaccuracy Δ 25% | 31.7% | 51.9% | 23.6% |
Rounding | 33.3% | 47.4% | 30.3% |
Modifiers (e.g., about) | 40.6% | 56.4% | 57.0% |
Cardinal (incorrect) | 32.9% | 53.8% | 31.8% |
Cardinal (correct) | 37.8% | 78.8% | 49.1% |
Correctness of ArithExps.
To assess the quality of natural logic proofs with invoked ArithExps, we randomly select 160 FEVEROUS samples and annotate the arithmetic functions required to reach the correct verdict. We compare these annotations with functions identified by TabVer in the final proof used to assess the claim’s veracity. LPA’s programs are used as a comparison baseline. As seen in Table 4, the overall accuracy of TabVer’s arithmetic function calls outperforms the LPA baseline with 76.0 versus 43.8 accuracy points, and is comparable with Binder’s score of 76.5. The largest performance lead for Binder is observed for the count function whereas TabVer is more accurate at comparisons.
. | LPA . | Binder . | TabVer . |
---|---|---|---|
Overall | 43.8 | 76.5 | 76.0 |
Filter/Copy | 41.7 | 85.6 | 90.4 |
Comparisons | 33.3 | 0.0 | 25.0 |
Count | 75.0 | 85.7 | 46.4 |
Sum | 0.0 | 100.0 | 100.0 |
Diff | 0.0 | 0.0 | 16.6 |
Min/Max | 0.0 | 0.0 | 0.0 |
. | LPA . | Binder . | TabVer . |
---|---|---|---|
Overall | 43.8 | 76.5 | 76.0 |
Filter/Copy | 41.7 | 85.6 | 90.4 |
Comparisons | 33.3 | 0.0 | 25.0 |
Count | 75.0 | 85.7 | 46.4 |
Sum | 0.0 | 100.0 | 100.0 |
Diff | 0.0 | 0.0 | 16.6 |
Min/Max | 0.0 | 0.0 | 0.0 |
TabVer Ablation.
Table 5 shows an ablation study of TabVer’s components. We see a substantial performance decline when removing the generation of ArithExps, dropping accuracy on the numerical subset by 10.0 accuracy points. When additionally removing the extracted evidence ei from the rationale and instead falling back to table linearization, we observe performance comparable to QA-NatVer, as expected. In line with our expectations, a major accuracy drop is observed on the full FEVEROUS data since the extraction and formatting of evidence is particularly useful for non-arithmetic claims. Finally, the removal of claim decomposition results in accuracy scores worse than the majority baseline. We observe that the removal of claim decomposition results in substantially more NEI predictions for longer claims, further discussed in Section 6.
ArithExp . | Constr. . | Rationale . | Decomp . | Full . | Num. . |
---|---|---|---|---|---|
✓ | ✓ | ✓ | ✓ | 72.0 | 71.4 |
✓ | ✗ | ✓ | ✓ | 69.2 | 66.2 |
✗ | ✗ | ✓ | ✓ | 66.1 | 61.0 |
✗ | ✗ | ✗ | ✓ | 60.9 | 59.9 |
✓ | ✓ | ✓ | ✗ | 66.3 | 63.7 |
✗ | ✗ | ✗ | ✗ | 44.6 | 43.0 |
ArithExp . | Constr. . | Rationale . | Decomp . | Full . | Num. . |
---|---|---|---|---|---|
✓ | ✓ | ✓ | ✓ | 72.0 | 71.4 |
✓ | ✗ | ✓ | ✓ | 69.2 | 66.2 |
✗ | ✗ | ✓ | ✓ | 66.1 | 61.0 |
✗ | ✗ | ✗ | ✓ | 60.9 | 59.9 |
✓ | ✓ | ✓ | ✗ | 66.3 | 63.7 |
✗ | ✗ | ✗ | ✗ | 44.6 | 43.0 |
6 Limitations
While the addition of arithmetic reasoning capabilities addresses a vital limitation of natural logic-based systems, TabVer is not attempting to modify natural logic’s model of compositional entailment itself (i.e., the DFA in Figure 2). NatLog fails some inferences, such as De Morgan’s laws for quantifiers, generally having less deductive power than first-order logic (MacCartney and Manning, 2014; Karttunen, 2015). In contrast, TabVer incorporates relevant reasoning processes in the generated proof either explicitly, e.g., ArithExps and claim decomposition, or latently, e.g., the assignment of NatOps between an aligned claim and evidence span. Moreover, inference rules that cannot be produced by NatLog affect the granularity of the proof: Consider a natural-language instantiation of De Morgan’s law from MacCartney (2009) where the claim “Some birds do not fly” is entailed by the evidence “Not all birds fly”. Due to NatLog’s limitations, the most fine-grained correct proof would be to align (Not all, Some do not), (birds, birds) and (fly, fly) to produce the proof ; thus the reasoning between the negations and quantifiers in the aligned pair required to arrive at the set-theoretic relation is omitted from the proof itself. Therefore, the proofs of TabVer are not necessarily fully comprehensive explanations, as they do not fully explain the production of the proof.
Moreover, proofs of TabVer do not allow assigning NatOp sequences to individual claim spans, such as , which can be a limitation for multi-hop claims where multiple pieces of evidence from one or more tables have to be combined for a single span beyond arithmetic functions. Furthermore, proofs are produced and executed from left to right. However, NatLog does not impose such constraints and is instead non-deterministic by design. This can lead to inconsistencies, as the rearrangement of a NatOp sequence O can lead to differently informative veracity predictions (MacCartney and Manning, 2009; Angeli et al., 2016). For instance, consider a variation of the running example shown in Figure 5: “In 2018, Ortegal had three municipalities and a population larger than 12,000.”. Assuming the same NatOp relations are assigned, an NEI verdict would be produced: . TabVer mitigates this issue via two mechanisms: (i) using claim decomposition to avoid long proofs where such phenomena occur, and (ii) considering multiple proof candidates at different granularity levels, following Aly et al. (2023) (e.g., three municipalities and a population larger than 12,000 could be considered a single span with the ⥯ NatOp). As shown in Figure 5, by breaking the original claims into atomic units of information, the individual subclaim verdicts (via DFA transitions and for subclaim 1 and 2, respectively) aggregate into the correct overall verdict. Both mechanisms also help in dealing with complex, multi-clause claims, where multiple erroneous and independent facts can lead to NEI predictions (double )—another weak point of natural logic’s nondeterministic composition of NatOps.
7 Conclusion
This paper presented TabVer, a natural logic inference system that adds arithmetic reasoning capabilities for few-shot fact verification on tables. We presented a set-theoretic definition between numerals in a claim and answers calculated on evidence via ArithExps. We proposed a method for leveraging LLMs to generate ArithExps via claim-aware question generation and rationale-guided question answering with constrained decoding. TabVer outperforms all baseline systems on FEVEROUS and in a domain-transfer setting on Tabfact, highlighting our model’s generalizability. We show that TabVer has learned a nuanced understanding of numerals, more sensitive to the context of a claim than other baselines. Future work investigates natural logic for scalar implicature on diverse datasets with different requirements for numerical precision.
Acknowledgments
This work was supported by the Engineering and Physical Sciences Research Council Doctoral Training Partnership (EPSRC). Andreas Vlachos is supported by the ERC grant AVeriTeC (GA 865958). The authors would like to thank Sana Kidwai for helpful conversations on linguistic concepts and Chenxi Whitehouse for useful discussions and feedback on the paper. We further thank the anonymous reviewers and the action editor Kenji Sagae for their valuable and detailed feedback.
Notes
Code at https://github.com/Raldir/TabVer.
We do not define a mapping to the independence NatOp (#) since it is applied when none of the other operators are predicted. Similarly to Krishna et al. (2022) for textual relations, we observe that the cover NatOp occurs only very rarely, thus replacing it with the independence NatOp (#).
Downward-entailing environments are, for instance, negative environments, antecedent clauses of conditional constructions, restrictors of universal quantifiers (Spector, 2013). Example for upward (downward) entailment: Messi (has not) scored 50 goals in a season.
Note that this phenomenon is distinct from truth-conditional vagueness where modifiers are hidden. While a sentence like “Messi scored about 100 goals, he scored 102.” is semantically valid, “Messi scored 100 goals, he scored 102.” is not without explicitly correcting the previous statement with a modifier like actually; e.g., “Messi scored 100 goals, actually he scored 102.” (Lauer, 2012).
Lasersohn (1999) argues that the term exact also leaves room for pragmatic slack at times, e.g., in a statement such as Mary arrives exactly at 5 o’clock, where deviations by milliseconds are permissible in most situations. We ignore this notion for simplicity.
Despite the ArithExp’s treatment as a numeral, the function name as part of ai is important since the semantics of a numeral varies between arithmetic functions (e.g., COUNT 5 versus COMP 5) and thus affect the comparison against claim span ci.
The ability of models to make pragmatic inferences has been explored in Jeretic et al. (2020), however, their dataset was constrained to a minimal scenario with four numbers (2, 3, 10, 100) and two quantifiers (some, all). Importantly, while their dataset focuses on correctness, our goal is instead to probe a model’s reading of numerals.
References
A Method Details
Table 6 shows the set-theoretic definitions of NatOps . The effect of environments on the entailment relations is modelled in natural logic via projection functions (MacCartney and Manning, 2009). The upward-entailing environment is the default environment with the projection function being the identity. Table 7 shows the projection function ρ↓ for downward-entailing environments. This projection function can now be further modified in the context of a numeral’s reading in such an environment, following Panizza et al. (2009). Consider the following example: In an upward-entailing environment, the relationship 3 ⥯ 4 holds, i.e., the numbers 3 and 4 are assigned the alternation NatOp following the upper-bounded reading in Section 3.1. However, in a downward-entailing environment, we see that Everybody who scored 3 goals received a bonusEverybody who scored 4 goals received a bonus holds instead, since the numbers have an at least interpretation without further specification, following Panizza et al. (2009). The projection function ρnum↓ from an upward-entailing to a downward-entailing environment that results from such an at least reading of numerals is shown in Table 7. The prompt templates for MQG and MQA are shown in Listing 1 and 2, respectively.
NatOp . | Set-theoretic . |
---|---|
Equivalence (≡) | x = y |
Frw. Entailment () | x ⊂ y |
Rev. Entailment () | |
Negation () | x ∩ y = ∅∧ x ∪ y = U |
Alternation (⥯) | x ∩ y = ∅∧ x ∪ y ≠ U |
Independence (#) | All other cases |
NatOp . | Set-theoretic . |
---|---|
Equivalence (≡) | x = y |
Frw. Entailment () | x ⊂ y |
Rev. Entailment () | |
Negation () | x ∩ y = ∅∧ x ∪ y = U |
Alternation (⥯) | x ∩ y = ∅∧ x ∪ y ≠ U |
Independence (#) | All other cases |
Training & Hyperparameters
We fine-tune the question generation MQG and question answering MQA model using default hyperparameters. Specifically, we use a learning rate of 2−4 and train for a total of 10 epochs across all models and experiments. The maximum generation length for MQG is set to 100 tokens for the generation of question, and the constraint answer selection is set to any-length span in c. For MQA the maximum length of Erel is 100 tokens between every generated number. We use adamw (Loshchilov and Hutter, 2019) as the optimizer. We use a batch size of 1 during training with gradient accumulation, resulting in an effective batch size of 8. For LoRA, we use a rank r = 16 and apply it to the query and value vectors of the attention mechanism. For fine-tuning, we exclude tokens of the prompts from the loss computation that are not part of the gold answer, so we are not fine-tuning the instructions, only the answers that follow after the instruction. For our proof generation model MP we use the default hyperparameters of QA-NatVer (Aly et al., 2023).
B Dataset Details
Quantitative characteristics of the tabular subset of FEVEROUS are shown in Table 8. The table further shows the statistics for the function annotations of 160 claims. The claims were sampled randomly and annotated by the authors of the paper as function annotations are made irrespectively of any model, limiting potential biases. Note that annotations are in a multi-label format since multiple functions can be required to verify a single claim.
Property . | All . | Numerical Subset . |
---|---|---|
Number of claims | 2011 | 521 |
Claims with more than 1 table | 129 | 36 |
Supported claims | 704 (35%) | 178 (34.1%) |
Refuted claims | 1242 (61.7%) | 338 (64.8%) |
NEI claims | 65 (3.2%) | 5 (1%) |
Avg. number of rows | 14.3 | 15.1 |
Avg. number of col | 4.82 | 5.9 |
Avg. num highlighted cells | 4.85 | 7.3 |
Function annotations on 160 samples | ||
Num. COPY | 143 | – |
Num. COMPARATIVES | 12 | – |
Num. COUNT | 28 | – |
Num. SUM | 1 | – |
Num. DIFF | 6 | – |
Num. MIN/MAX | 3 | – |
Property . | All . | Numerical Subset . |
---|---|---|
Number of claims | 2011 | 521 |
Claims with more than 1 table | 129 | 36 |
Supported claims | 704 (35%) | 178 (34.1%) |
Refuted claims | 1242 (61.7%) | 338 (64.8%) |
NEI claims | 65 (3.2%) | 5 (1%) |
Avg. number of rows | 14.3 | 15.1 |
Avg. number of col | 4.82 | 5.9 |
Avg. num highlighted cells | 4.85 | 7.3 |
Function annotations on 160 samples | ||
Num. COPY | 143 | – |
Num. COMPARATIVES | 12 | – |
Num. COUNT | 28 | – |
Num. SUM | 1 | – |
Num. DIFF | 6 | – |
Num. MIN/MAX | 3 | – |
C Implementation Details
The prompt template for the decomposition model MD is shown in Listing 3. We use the Huggingface checkpoints for LLama2-7B,8 MistralOrca-7B,9 TAPAS,10 and TAPEX.11 The PASTA checkpoint is taken from the associated repository.12 For constrained decoding, we used the library guidance-ai.13 The Mistral models are licensed under Apache2.0 and Llama2 is licensed under the llama license.14 Our research is consistent with the licenses’ intended use. The models are intended for use in English. All experiments are run on a single Quadro 8000 with 48GB memory. To fine-tune MP with a Flan-T5-3B backbone we use a single A100 80GB.
Baselines
We use the available implementations for LPA,15 SASP,16 and Binder.17 Identically to Tabfact, all three models consider the the first table row of a FEVEROUS table as the header row. LPA was trained for 20 epochs since the default number of training epochs (10) was not sufficient to reach convergence. Binder uses the default hyperparameters as specified for Tabfact, but we use 4 instead of 18 in-context examples as MistralOrca, with a full number of in-context examples, produced empty answers very frequently. We hypothesise that MistralOrca generates the end of text token too early since the OpenOrca dataset, on which MistralOrca7B has been instruction-tuned, consists of gold answers which are in 93% of the cases shorter than 2.5K tokens. We train DeBERTa, PASTA, TAPEX, and TAPAS using the HuggingFace Trainer for 10 (100) epochs with full (few-shot) data and a learning rate of 1 × 10−5.18 To sanity check our training pipeline, we trained TAPAS in a full supervision setting on TabFact’s 92,283 training instances, achieving a score of 82.1 accuracy points versus 81.59 points via the official model checkpoint.19 To fine-tune the LLM baselines with LoRA, we use Huggingface’s SFTTrainer.20
D Reading of Numerals Probe - Details
We construct the probing dataset by first filtering instances from the FEVEROUS evaluation data labeled as Supported that contain numbers, excluding dates (e.g., 1939) but including percentages and floating point numbers. Afer a further manual inspection a total 91 claims remain. For each claim, we generate 17 variations of a numeral x:
- (1)
x + 1 (Adding one)
- (2)
x + x * 0.02 (Adding 2%)
- (3)
x −x * 0.02 (Subtracting 2%)
- (4)
x + x * 0.1 (Adding 10%)
- (5)
x −x * 0.1 (Subtracting 10%)
- (6)
x + x * 0.25 (Adding 25%)
- (7)
x −x * 0.25 (Subtracting 25%)
- (8)
rounding via closest number to x that satisfies: (10-ness)
- (9)
rounding via closest number to x that satisfies: (5-ness)
- (10)
rounding via closest number to x that satisfies: (2.5-ness)
- (11)
(About∣Around∣Approximately) + 10-ness (Modifier 10-ness)
- (12)
(About∣Around∣Approximately) + 5-ness (Modifier 5-ness)
- (13)
(About∣Around∣Approximately) + 2.5-ness (Modifier 2.5-ness)
- (14)
‘At most’ + (Subtracting 10%) (Cardinal at most, incorrect)
- (15)
‘At least’ + (Adding 10%) (Cardinal at least, incorrect)
- (16)
‘At most’ + (Adding 10%) (Cardinal at most, correct)
- (17)
‘At least’ + (Subtracting 10%) (Cardinal at least, correct)
Rounding to numbers that satisfy the 10-ness, 5-ness, and 2.5-ness property follows the empirical observation by Jansen and Pollmann (2001) that round numbers satisfying this arithmetic property occur more frequently than round numbers that do not. For instance, the number 1010 does not satisfy either 10-ness, 5-ness, or 2.5-ness and would generally be considered an atypical way of rounding (1000 would most likely be more natural). We follow the terminology of Keenan (2017) to describe the modifiers at most and at least as cardinal determiners.
We categorize these variations into the numerical classes shown in Table 3 as follows:
Inaccuracy Δ +1: Variation 1.
Inaccuracy Δ 2%: Average of Variation 2 + 3.
Inaccuracy Δ 10%: Average of Variation 4 + 5.
Inaccuracy Δ 25%: Average of Variation 6 + 7.
Rounding: Average of Variation 8 + 9 + 10.
Modifiers: Average of Variation 11 + 12 + 13.
Cardinal (incorrect): Average of Variation 14 + 15.
Cardinal (correct): Average of Variation 16 + 17.
Author notes
Action Editor: Kenji Sagae