FeTaQA: Free-form Table Question Answering

Existing table question answering datasets contain abundant factual questions that primarily evaluate a QA system’s comprehension of query and tabular data. However, restricted by their short-form answers, these datasets fail to include question–answer interactions that represent more advanced and naturally occurring information needs: questions that ask for reasoning and integration of information pieces retrieved from a structured knowledge source. To complement the existing datasets and to reveal the challenging nature of the table-based question answering task, we introduce FeTaQA, a new dataset with 10K Wikipedia-based table, question, free-form answer, supporting table cells pairs. FeTaQA is collected from noteworthy descriptions of Wikipedia tables that contain information people tend to seek; generation of these descriptions requires advanced processing that humans perform on a daily basis: Understand the question and table, retrieve, integrate, infer, and conduct text planning and surface realization to generate an answer. We provide two benchmark methods for the proposed task: a pipeline method based on semantic parsing-based QA systems and an end-to-end method based on large pretrained text generation models, and show that FeTaQA poses a challenge for both methods.


Introduction
Question Answering (QA) is the task of producing answers to natural language questions based on knowledge resources (Burke et al., 1997;Yao and Van Durme, 2014;Chen et al., 2017).One of the primary goals of QA is to allow users to directly and efficiently interact with large-scale and heterogeneous knowledge sources.In the real world, knowledge sources take a variety of forms, including unstructured texts (documents, passages, * Now at Facebook AI. or conversations), structured knowledge bases or databases, and semi-structured tables, each requiring dedicated modeling approaches.
For table-based QA, a common approach is to apply semantic parsing on the query and the table schema to generate a logical form (e.g. a SQL-like database query) that can be executed to retrieve the answer from the relevant portion of the table (Pasupat and Liang, 2015;Iyyer et al., 2017;Zhong et al., 2017;Yu et al., 2018).The answers are extracted facts/entities in the table, therefore usually in short-form.
Though existing datasets have enabled significant progress for table QA, their limitations prevent them from reflecting the full challenge of the task.Users of QA systems tend to ask complex questions which require elaborate answers, often containing explanations, while existing datasets are limited to simple short-form answers.
To address these shortcomings we present Fe-TaQA, a Free-form Table Question Answering dataset which includes long, informative, and freeform answers.FeTaQA challenges QA systems with the following tasks: 1) retrieving multiple entities from tables based on the query; 2) reasoning over relations of these entities that are pertinent to the query and integrating these pieces of information into a coherent answer 3) aggregating associated information into an explanation when the query is abstract or ambiguous; and 4) generating an informative, relevant, and faithful answer to the query.In addition, with tables sourced from Wikipedia, FetaQA covers a diverse set of top-  SQuAD (Rajpurkar et al., 2016) Text-span 3.2 HotpotQA (Yang et al., 2018) Short-form entity 2.2 NarrativeQA (Kočiskỳ et al., 2018) Free-form text 4.7 ELI5 (Fan et al., 2019) Free-form text 130.6 WikiTableQuestions (Pasupat and Liang, 2015) Short-form entity 1.7 SequenceQA (Saha et al., 2018) Short-form entity 1.2 HybridQA (Chen et al., 2020e) Short-form entity 2.1

FeTaQA
Free-form text 18.9 ics and includes semi-structured tables containing un-normalized text, including numbers, dates, and phrases.FeTaQA examples are presented in Figure 1 and differences between FeTaQA and other QA datasets are described in Table 1.
We formulate generative table question answering as a Sequence-to-Sequence learning problem to evaluate the state-of-the-art models' performances on FeTaQA.We propose two benchmark methods and provide experiment results for them.The first one is an end-to-end model that integrates query and table comprehension, logical reasoning, and language generation by adapting T5 (Raffel et al., 2019).The other is a pipeline model that achieves content selection and surface realization in separate modules involving TAPAS (Herzig et al., 2020).
Through human studies, we evaluate answers generated by our proposed models as well as the reference answer based on fluency, correctness, adequacy (informativeness), and faithfulness.The results indicate the challenging nature of FeTaQA and that there is much room for improvement in QA systems.We make the dataset available online. 1 1 https://github.com/Yale-LILY/FeTaQA

Dataset
Here we introduce FeTaQA and describe the process and criteria for collecting the tables, questions and answers.Some statistics of FeTaQA are shown in § 2.4.

Desiderata
We frame generative table question answering as the problem of generating an answer a to a question q based on a semi-structured table T and its metadata m.Our goal was to construct a table QA dataset {(q i , a i , T i , m i )|i = 1 . . .n} that includes a large number of tables on diverse topics.The tables should be intelligible, well-formed, and moderately sized to make retrieval challenging yet plausible.Each table pairs a question with an answer sentence.The question should require retrieval and reasoning over multiple sources of information in the table, and the answer should integrate both facts and inferences into a coherent sentence that answers the question.Both questions and answers should be natural and fully grounded in the context of the entire table and its metadata such as the title.

Data Collection Method
We start building the dataset by collecting data instances from ToTTo (Parikh et al., 2020), a recent large-scale Table-to-Text dataset that contains tables and table-grounded sentences obtained from a diverse variety of Wikipedia pages.Additionally, ToTTo comes with annotations of table cells that support the sentence: a sentence is supported by the cell contents if it is directly stated or can be logically inferred by them.ToTTo applied several heuristics to sample the tables and the candidate sentences from Wikipedia pages, and their annotators are asked to revise sentences and highlight the corresponding table regions so that the sentences still have the varied language and structure found in natural sentences while being grounded to the table .Sampling examples from the ToTTo dataset was conducted in multiple steps.We first sample tables whose sizes are within 3 to 34 rows long and 3 to 7 columns wide (up to 75th percentile of all ToTTo table sizes) to avoid truncation of sequence of linearized table for transformer-based models, whose default maximum input sequence length is 512.To ensure sentences contain several table entities, we further select tables whose annotation of highlighted regions covers multiple rows.We also collect a subset of single-row highlighted regions which span multiple rows or columns in content.Following this sampling procedure, we were able to obtain 16,576 {table, metadata, highlighted region, sentence} instances with which we conduct the annotation procedure as described below.The flowchart of the sampling process is found in Figure 7 of the Appendix.
We adopted these table-grounded sentences as the answers in our new QA dataset since they are long, natural sentences containing rich information and inferences over the corresponding table.We also exploit ToTTo's annotations of table cells (the highlighted table region) as the weak supervision (denotations) for training models and labels for evaluating model retrieval competency.We parsed the tables (originally in HTML format) into a 2dimensional array, where the first row corresponds to the table header.We also processed merged cells by copying the cell content and cell highlighted region to all the individual cells that compose the original merged cell.

Question Annotation
Question annotations were collected with the help of human judges in two phases: an internal phase conducted by on-site expert annotators, and an external phase conducted by crowd workers on Amazon Mechanical Turk.To streamline the process, we built a custom web interface to visualize table HTML and metadata, augmented with web widgets that allow table region highlighting and sentence editing.A screenshot of the annotation interface is shown in Figure 8 of the Appendix.
Provided the necessary context, the annotators were asked to write a question whose answer is the provided ToTTo sentence.The annotators were given the option to modify the sentence, the table cell content, and the highlighted region to better match the associated question.
Internal Annotations In the first phase of annotation, we enrolled 15 internal annotators who were provided with preliminary guidelines.In addition to the annotation task, they were asked to provide feedback regarding the task instructions and the user experience of the website, based on which we iteratively modified the guideline and the website design.
External Annotations For external annotations, we hired MTurk workers who have completed at least 500 HITs, have 97% approval rate, and are from English-speaking regions.To ensure that the MTurk annotators understand our task, we provided an instruction video for the interactive annotation tool usage, FAQs that clarify the annotations we desire, along with good vs. bad annotation examples.We also created a Slack channel for crowdsourced workers to ask questions and clarify doubts.
Annotation Evaluation To ensure FeTaQA is of high quality, we evaluate crowdsourced annotations as follows.First we auto-rejected questions that fall outside the length range (4 to 25) or convoluted questions that contain more than two interrogatives (259 examples in total).For the remaining annotations, we built another web interface for evaluation and asked internal evaluators to label an annotation as "approve", "reject" or "questionable" and score the annotation based on its fluency, faithfulness, and the extent to which the question needs the full sentence as the answer.Internal evaluators were also asked to modify the question annotations that were not approved.Our final dataset anno-tators contribution is distributed as follows: we have 3,039 (30%) instances from internal annotators, 7,291 (70%) from MTurk workers.In total, our dataset contains 10,330 instances.

Dataset Split
Randomly splitting the dataset may make train, development, and test splits contain tables with similar contents (Finegan-Dollak et al., 2018;Lewis et al., 2020).Therefore, to increase the generalization challenge, we calculated the Jaccard similarity of two instances based on the set of tokens shown in table headers and questions, and split the dataset in such a way that models are evaluated on test split instances that are least similar to those used for training.We first sampled 800 instances randomly as a seed split.Then we add those that have Jaccard similarities greater than 0.465 to the seed split.This process generates two splits of 70% and 30% of all instances, the former becomes the train split and the latter is randomly divided with a ratio of 1:2 to form the development and test splits.This results in 7,326/1,001/2,003 instances in the train/dev/test splits, respectively.

Data Analysis and Statistics
Basic statistics of FeTaQA are shown in Table 2, and human evaluation scores and inter-evaluator agreements are reported in Table 3.A quantitative and qualitative analysis of FeTaQA shows it contains abundant complex questions that require retrieval of multiple entities in the context, as shown by the human evaluation score for question complexity, and that the median number of highlighted cells (denotations) is 6, which is twice as much as the corresponding number for ToTTo.These denotations are correct and adequate as indicated by the corresponding high evaluation scores.The freeform answers have a median of 18 tokens in length, and are grounded to the table and the denotations, also suggested by the high evaluation scores.
Topics Similar to ToTTo, we use Wikimedia Foundation's topic categorization model (Asthana and Halfaker, 2018) to investigate the topics distribution of FeTaQA.Although our dataset is limited to topics presented in ToTTo, we are able to sample instances that have evenly distributed topics, as shown in Figure 2. We found that most of the instances are related to biography, sports and geographical regions.There are also abundant instances related to media, politics and government.Question Types FeTaQA has diverse and complex questions, as illustrated in Figure 3.Comparison of question type distributions with other table QA datasets is shown in Figure 9 of the Appendix.We found that in FeTaQA, a large percentage of what questions are asking entities in plural, or abstract entity such as outcome, result, margin, percentage.In addition, there is a higher percentage of how questions that are not how many/much, compared to existing table QA datasets.

Models
To quantify the challenge posed by FeTaQA for state-of-the-art models, we used two modeling approaches that have been shown to be effective for the existing table question answering datasets, with some modifications made to adjust to our task.Model configurations are shown in Figure 4.

Pipeline Model
Question answering over tables is usually seen as a semantic parsing task.Based on the table schema, a semantic parser maps the question to a logical form that can be used to retrieve the result from the table.The answer is usually a single entity that is either a table cell value or the aggregation result of multiple cell values (aggregation operators include COUNT, MAX, SUM, etc.).A table semantic parser is trained using logical forms as supervised examples, but due to its high annotation cost, an alternative approach is to use denotations for weak supervision.The denotations are usually the targets of the existing table QA tasks.With this approach, the parser is able to retrieve denotations directly.However, in our task, targets are generated texts instead of retrieved denotations, suggesting that we also need a generator to integrate the retrieved information into a cogent sentence.Therefore, we propose a pipeline model with two separately trained modules, described below.

Weakly Supervised Table Semantic Parsing
The first module adopts a table semantic parser that is pre-trained with weak supervision.We use TAPAS (Herzig et al., 2020), a state-of-the-art model for table QA, to start with.We fine-tune it on FeTaQA with our annotated denotations (highlighted table regions).We believe fine-tuning is crucial for our task because TAPAS is pre-trained on questions that require retrieval of limited denotations (single entity or homogeneous entities that can be aggregated with COUNT, SUM, or AVG operation), while FeTaQA questions require retrieval of multiple entities and complex aggregation operations.Details of experiment results are provided in Section 4.3.Note that besides denotations, TAPAS also predicts an aggregation operation (choose from COUNT, SUM, AVG, NONE) applied to the predicted denotations to obtain the final answer.However, we use NONE as the aggregation operation label for fine-tuning due to the lack of annotations, therefore leaving the inference of aggregation operation to the second module.
Data-to-Text As shown in Figure 5, we fine-tune T5 (Raffel et al., 2019) on DART (Nan et al., 2021) to obtain a Data-to-Text model as the second module of the pipeline to perform surface realization of table cells (denotations in our case).We first convert the denotation prediction into the triple-set format with the following scheme: for each table cell in the highlighted region, we generate the following triple: [[TABLECONTEXT], column header, cell value], where column header is the cell's corresponding column name.Similar to DART, we use [TABLECONTEXT] as a special token for converting a table cell into a triple.We then incorporate the metadata into triples by replacing column header with the field name (TABLE TITLE, PAGE TITLE) and cell value with the metadata content (table title text, page title text).We end up with a triple-set containing all highlighted table cells and the metadata (table title and title of the Wikipedia page that includes the table).We further fine-tune the Datato-Text model on ToTTo instances so that it adapts to our formation of triple-set inputs.To avoid exposure to FeTaQA test instances, we fine-tune with a sample of 8K ToTTo instances that are not used for creating FeTaQA.
Figure 5: Weakly supervised fine-tuning of table semantic parser on FeTaQA.We choose a checkpoint of TAPAS-base fine-tuned on WikiTableQuestions to start with.After fine-tuning, the table semantic parser predicts denotations, which are then converted to triples and sent to the Data-to-Text module.

End-to-End Model
In this approach, we model the task as a sequenceto-sequence learning problem by linearizing table T appended to question q as the source sequence, and treating the free-form answer a as the target sequence.We propose a simple linearization scheme as a baseline: table rows are concatenated with [SEP] tokens in between, and cells in each row are separated by spaces.Since the input sequence length may exceed the model limit, we prepend q to table linearization T , using [CLS] tokens as prefixes for separation.We fine-tune models from the T5-family on the FeTaQA train set.The linearization scheme is visualized in Figure 6.

Experiments
In this section, we explain the experiment settings and report the automatic and human evaluations on model outputs.

Experiment Setup
We first experiment with the pipeline model in a zero-shot setting, that is, without any fine-tuning on FeTaQA.We use a checkpoint of TAPAS-base that is fine-tuned on WikiTableQuestions (Pasupat and Liang, 2015) to perform table semantic parsing implicitly in order to produce a set of denotations, which is then converted to a triple-set as described in 3.1.We then employ a T5-large model (Raffel et al., 2019) that goes through two fine-tuning stages: in the first stage it is fine-tuned on the downstream Data-to-Text task with DART (Nan et al., 2021); in the second stage it is further fine-tuned on ToTTo instances to adapt to the triple-set formulation we proposed.We denote this setting as Pipeline -zeroshot in Table 4. Next we experiment with the pipeline model by fine-tuning the table semantic parser on FeTaQA.We further fine-tune the TAPAS-base checkpoint (WTQ finetuned) on FeTaQA train set and select models based on their performance on the development set.We use the same Data-to-Text model as described in the zero-shot setting.
For the End-to-End model, we adapt Hugging Face's implementation (Wolf et al., 2020) of T5 (Raffel et al., 2019) for our task.We use a standard T5-tokenizer with additional [CLS] and [SEP] tokens and the model vocabulary is resized accordingly.Since we expect the input sequence to be significantly longer than the target, we fine-tuned the models using T5's "summarize: " prefix.The motivation behind this is to avoid simple extraction from the table since abstractive summarization is supposed to rephrase important details in the source.T5-small is trained on 4 Tesla K80 GPUs with per-device batch size of 16 for 30 epochs (about 6,900 steps).T5-base is trained on 4 Tesla K80 with per-device batch size of 4 (due to GPU memory constraints) for 80 epochs (about 36,640 steps).As for T5-large, we distributed the layers across 8 Tesla K80 to train with a batch size of 4 for 80 epochs (about 80k steps).

Evaluation Metrics
We use a variety of automatic metrics and human evaluation (Section 4.4) to evaluate the quality of the generated answers.We report sacreBLEU (Post, 2018), ROUGE-{1, 2, L} (Lin, 2004), and ME-TEOR (Banerjee and Lavie, 2005) that evaluate the n-gram match between generated and reference answers.Considering the limitations of these measures in evaluating the semantic meanings of sentences, we also report BERTScore (Zhang et al., 2020) and BLEURT (Sellam et al., 2020) that incorporate semantics using contextual embeddings.To evaluate the retrieval competency of table semantic parsers, we applied various set similarity metrics to the predicted and reference denotation lists.Specifically, we report Jaccard similarity, Overlap, Cosine similarity, and Dice similarity.

Results and Discussions
Our experimental results on the FeTaQA test set are summarized in Table 4.The T5-large model using an End-to-End modeling approach achieves the highest performance scores in almost all evaluation metrics.Also, we observe a large performance gap between pipeline models and End-to-End models, even though the latter only adopt a simple linearization strategy for encoding tables.
We also see that after fine-tuning on FeTaQA with denotations as weak supervisions, the pipeline model improves by almost 2 BLEU points.To fur-ther examine the source of this improvement, we report the evaluation of table semantic parser performance in Table 5, from which we also observe an improvement in retrieval capability.However, we note that compared with the reference denotations that have a median of six table cells being highlighted (shown in 2), our table semantic parser is only able to predict two table cells on average before fine-tuning on FeTaQA, and three table cells on average after.This indicates a large space for improvement.We suspect that the low performance of denotation predictions and the loss of relational information between denotations lead to the inadequate performance of pipeline models.

Human Evaluation
To further evaluate the quality of the answers generated by different models comparing to the references, we conduct our human evaluation based on four criteria: (1) fluency if an answer is natural and grammatical; (2) correctness if an answer is correct; (3) adequacy if an answer contains all the information that is asked; (4) faithfulness if an answer is faithful and grounded to the contents of the table and the highlighted region.Each evaluator is asked to examine an answer given the question and the full context (table, highlighted region, and metadata) and give a score on a scale of 1 to 5 for each of the criteria.We ask five internal annotators to evaluate 100 samples of FeTaQA instances.Each sample is paired with 3 answers: the reference, the pipeline model result, and the End-to-End model result.
Table 6 attests to the high quality of our annotations and the challenging nature of FeTaQA.Similar to the evaluation result of the automatic metrics, we observe a large gap between the pipeline model and the End-to-End model, with the latter one significantly outperforming its counterpart in terms of answer correctness, adequacy, and faithfulness.Comparing the best performing End-to-End model outputs to human references, we see that there is room for improvement in the future.

Related Work
Generative QA Generative question answering datasets such as NarrativeQA (Kočiskỳ et al., 2018), CoQA (Reddy et al., 2019), TriviaQA (Joshi et al., 2017), and MS MARCO (Nguyen et al., 2016) all have free-form answers that are generated based on the contexts of Wikipedia articles, books, movie scripts, dialogues or web documents.These responses are mostly crowd-sourced and are reported to mostly contain copies of short text spans from the source.By contrast, ELI5 (Fan et al., 2019) is a long form question answering dataset containing a diverse set of complex questions, each paired with a paragraph-long answer and 100 relevant web source documents (Petroni et al., 2020;Krishna et al., 2021).FeTaQA is the first dataset for generative question answering over tables.Unlike the existing generative QA datasets that assess multi-documents retrieval and abstraction capability, FeTaQA poses new challenges in the reasoning and integration capability of a system given a structured knowledge source.
Data-to-text generation Recent neural end-toend models tested on the WebNLG 2017 dataset (Gardent et al., 2017) have focused on incorporating pre-training and fine-tuning for specific generation tasks (Chen et al., 2020c;Kale, 2020) to improve performance and strengthen generalization ability.However, recent models featuring separate content-planning and surface realization stages have exhibited improvements (Moryossef et al., 2019;Iso et al., 2020) over comparable baselines.TabFact (Chen et al., 2020d) is composed of Wikipedia tables coupled with statements labeled as either "ENTAILED" or "RE-FUTED" by the table.LogicNLG (Chen et al., 2020b) features statements logically entailed from tables.ToTTo (Parikh et al., 2020) is a largescale open-domain dataset consisting of Wikipedia tables with a set of highlighted table cells and a sentence description of those highlighted cells.DART (Nan et al., 2021) is an open-domain Data-to-Text dataset that contains table-ontologypreserving data samples with diverse predicate set occurred in Wikipedia tables.

Conclusion
In this paper, we introduced the task of generative table question answering with FeTaQA, a table QA dataset consisting of complex questions that require free-form, elaborate answers.We also proposed two modeling approaches: (1) a pipeline model that incorporates a table semantic parser and (2) a Datato-Text generator, and an End-to-End model that includes query comprehension, reasoning and text generation.Our experimental results indicate that the End-to-End model with a simple table encoding strategy achieves much higher scores than the pipeline model that requires table semantic parsing.Furthermore, we show that FeTaQA introduces new challenges for table question answering that call for innovative model designs in the future.

Figure 1 :
Figure 1: Examples of FeTaQA instances.Only part of the original table is shown for better visualization.These examples are referred as (a), (b), (c), (d) from upper left to bottom right in the paper.

Figure 3 :
Figure 3: FeTaQA questions by top 5 most frequent starting words, where box size represents frequency.

Figure 4 :
Figure 4: Pipeline model and End-to-End model diagrams.

Figure 6 :
Figure 6: Table linearization in end-to-end model.
Wikipedia articles Stories, books, movie scripts Online forum texts Wikipedia tables in Answer

Table 1 :
Comparison of FeTaQA with other QA datasets.

Table 3 :
Human evaluation over 100 samples of Fe-TaQA. 5 internal evaluators are asked to rate the samples on a scale of 1 to 5. We report % of samples that have score ≥ 4 to show high quality of FeTaQA, and report percent agreement and Randolph's Kappa (Randolph, 2005) (with 95% CI) to show that our human evaluation has high inter-annotator agreement.

Table 4 :
Experiment results on the test split of FeTaQA.

Table 5 :
Evaluation of denotation prediction on the test split of FeTaQA.We report performance of TAPAS in zero-shot and fine-tuned with weak supervision.

Table 6 :
Human evaluation over 100 samples of model outputs and references.We report the percentage of outputs that have scores of 4 or 5.