Abstract
Existing table question answering datasets contain abundant factual questions that primarily evaluate a QA system’s comprehension of query and tabular data. However, restricted by their short-form answers, these datasets fail to include question–answer interactions that represent more advanced and naturally occurring information needs: questions that ask for reasoning and integration of information pieces retrieved from a structured knowledge source. To complement the existing datasets and to reveal the challenging nature of the table-based question answering task, we introduce FeTaQA, a new dataset with 10K Wikipedia-based {table, question, free-form answer, supporting table cells} pairs. FeTaQA is collected from noteworthy descriptions of Wikipedia tables that contain information people tend to seek; generation of these descriptions requires advanced processing that humans perform on a daily basis: Understand the question and table, retrieve, integrate, infer, and conduct text planning and surface realization to generate an answer. We provide two benchmark methods for the proposed task: a pipeline method based on semantic parsing-based QA systems and an end-to-end method based on large pretrained text generation models, and show that FeTaQA poses a challenge for both methods.
1 Introduction
Question Answering (QA) is the task of producing answers to natural language questions based on knowledge resources (Burke et al., 1997; Yao and Van Durme, 2014; Chen et al., 2017). One of the primary goals of QA is to allow users to directly and efficiently interact with large-scale and heterogeneous knowledge sources. In the real world, knowledge sources take a variety of forms, including unstructured texts (documents, passages, or conversations), structured knowledge bases, and semi-structured tables, each requiring dedicated modeling approaches.
For QA over text, a sequence modeling approach is usually adopted to encode the query and the context, and answers are either categorical (Lai et al., 2017), extractive (Rajpurkar et al., 2016; Yang et al., 2018), or abstractive/generative (Kociský et al., 2017; Nguyen et al., 2016; Fan et al., 2019; Kwiatkowski et al., 2019). For QA over table, a common approach is to apply semantic parsing on the query and the table schema to generate a logical form (e.g., a SQL-like database query) that can be executed to retrieve the answer from the relevant portion of the table (Pasupat and Liang, 2015; Iyyer et al., 2017; Zhong et al., 2017; Yu et al., 2018). The answers are extracted facts/entities in the table, therefore usually in short-form.
Though existing datasets have enabled significant progress for table QA, their limitations prevent them from reflecting the challenging nature of the task. The exchange of information between humans through interactions with questions and answers is different from the interactions presented in most of the existing QA datasets, in which questions are specific (sometimes contrived for testing multi-hop reasoning) and provide most of the information, while answers are in short-form and fill in the missing information piece. Nevertheless, in many cases, people tend to seek more structured information content, such as “how”, “why”, and some of the “what” questions that ask for general concepts. Therefore a QA system should also possess such structuring capability, evaluated by text generation tasks.
To complement the existing datasets with the absent QA interactions, we present FeTaQA, a Free-form Table Question Answering dataset that includes long, informative, and free-form answers. FeTaQA reveals the challenging nature of the table QA task: 1) retrieving multiple entities from tables based on the query; 2) aggregating and reasoning over relations of these entities; and 3) structuring surface information and inferences into a coherent answer that is faithful to the table. We collect question–answer pairs from noteworthy descriptions of Wikipedia tables that are high quality sentences rich in structured information contents. We annotate questions that elicit such descriptions, and we make efforts to ensure that the QA interaction is compatible, and question annotations are not contrived. In addition, the FeTaQA tables cover a diverse set of topics and contain un-normalized text, including numbers, dates, and phrases. FeTaQA examples are presented in Figure 1 and differences between FeTaQA and other QA datasets are described in Table 1.
Dataset . | Knowledge Source . | Answer Format . | Avg # Words in Answer . | |||
---|---|---|---|---|---|---|
Wikipedia articles . | Stories, books, movie scripts . | Online forum texts . | Wikipedia tables . | |||
SQuAD (Rajpurkar et al., 2016) | ✓ | Text-span | 3.2 | |||
HotpotQA (Yang et al., 2018) | ✓ | Short-form entity | 2.2 | |||
NarrativeQA (Kociský et al., 2017) | ✓ | Free-form text | 4.7 | |||
ELI5 (Fan et al., 2019) | ✓ | Free-form text | 130.6 | |||
WikiTableQuestions (Pasupat and Liang, 2015) | ✓ | Short-form entity | 1.7 | |||
SequenceQA (Saha et al., 2018) | ✓ | Short-form entity | 1.2 | |||
HybridQA (Chen et al., 2020d) | ✓ | ✓ | Short-form entity | 2.1 | ||
FeTaQA | ✓ | Free-form text | 18.9 |
Dataset . | Knowledge Source . | Answer Format . | Avg # Words in Answer . | |||
---|---|---|---|---|---|---|
Wikipedia articles . | Stories, books, movie scripts . | Online forum texts . | Wikipedia tables . | |||
SQuAD (Rajpurkar et al., 2016) | ✓ | Text-span | 3.2 | |||
HotpotQA (Yang et al., 2018) | ✓ | Short-form entity | 2.2 | |||
NarrativeQA (Kociský et al., 2017) | ✓ | Free-form text | 4.7 | |||
ELI5 (Fan et al., 2019) | ✓ | Free-form text | 130.6 | |||
WikiTableQuestions (Pasupat and Liang, 2015) | ✓ | Short-form entity | 1.7 | |||
SequenceQA (Saha et al., 2018) | ✓ | Short-form entity | 1.2 | |||
HybridQA (Chen et al., 2020d) | ✓ | ✓ | Short-form entity | 2.1 | ||
FeTaQA | ✓ | Free-form text | 18.9 |
We formulate generative table question answering as a Sequence-to-Sequence learning problem. We propose two benchmark methods and provide experimental results for them. The first one is an end-to-end model that integrates query and table comprehension, reasoning, and language generation by adapting T5 (Raffel et al., 2020). The other is a pipeline model that achieves content selection and surface realization in separate modules involving TAPAS (Herzig et al., 2020), which is a recently proposed pre-trained model that jointly processes text and tabular data for the usage of semantic parsing.
Through human studies, we evaluate answers generated by our proposed models as well as the reference answer based on fluency, correctness, adequacy (informativeness), and faithfulness. The results indicate the challenging nature of FeTaQA and that there is much room for improvement in QA systems. We make the dataset and code available online.1
2 Dataset
Here we introduce FeTaQA and describe the process and criteria for collecting the tables, questions, and answers. Some statistics of FeTaQA are shown in § 2.4.
2.1 Desiderata
We frame generative table question answering as a problem of generating an answer a to a question q based on a table T and its metadata m. Our goal was to construct a table QA dataset {(qi, ai, Ti, mi)|i = 1…n} with a large number of instances and diverse topics. We want to collect questions that seek not just a specific fact, but more structured information: Desirably, they should require retrieving more and different facts and reasoning with diverse aggregations. Answers should be well structured information contents, faithful to the tables, and presented in natural utterances.
2.2 Data Collection Method
A natural way to collect a table-based QA pair is to ask annotators to first generate a question given a table, then provide the answer to it. However, we found that it usually takes more effort to ask about how multiple facts are related or share something in common than to ask about a specific fact in the table; annotators spend much more time finding out the relations between cell contents for question generation, and they also need to spend time writing an answer. We found that ToTTo (Parikh et al., 2020), a recently proposed large-scale Table-to-Text dataset, is a desirable resource to start with. It contains textual descriptions that are naturally written and fully grounded in Wikipedia tables. Additionally, ToTTo comes with annotations of table cells that support the sentences: A sentence is supported by the cell contents if it is directly stated or can be logically inferred by them. ToTTo applied several heuristics to sample the tables and the candidate sentences from Wikipedia pages, and their annotators are asked to revise sentences and highlight the corresponding table regions so that the sentences still have the varied language and structure found in natural sentences.
We want to first sample a subset of these sentences that already provide aggregation and reasoning on multiple facts in the table, which is the target content that annotators spend most of the time trying to come up with, so that we could largely reduce the time spent on annotation. More importantly, such sentences contain noteworthy information that users are more interested in and likely to ask given a table from Wikipedia. We sample ToTTo instances with the following considerations. First we found that ToTTo’s annotation of highlighted cells is a reasonable indicator of how much information is required from the table to give the answer, which we aim to maximize. With this objective, we found by probing ToTTo that tables with extreme sizes (too large or too small number of rows, columns or both) are more similar to attribute–value pairs instead of tables with complicated structures, and they tend to have a small number of highlighted cells, which make them not ideal for our dataset. As shown by Figure 9 and 10 in the Appendix, we removed all tables whose sizes are above the 75th percentile of the number of rows or columns of all ToTTo tables, and also removed tables with a single row or column. We further select tables whose highlighted cells span more than a single row or column to ensure sentences contain several table entities. We provide a flowchart of this sampling process in Figure 7 in the Appendix. This process gave us sufficient {table, metadata, highlighted region, sentence} instances from ToTTo, on which we conducted the annotation procedure as described below.
We adopted these table-grounded sentences as the answers in our new QA dataset and exploited ToTTo’s annotations of table cells (the highlighted table region) as the weak supervision labels (denotations) for training and evaluating the intermediate semantic parser. We processed each table (originally in HTML format) as a 2-dimensional array, where the first row corresponds to the table header. We also processed merged cells by copying the cell content and cell highlighted region to all the individual cells that compose the original merged cell.
2.2.1 Question Annotation
Question annotations were collected with the help of human judges in two phases: an internal phase conducted by on-site expert annotators, and an external phase conducted by crowd workers on Amazon Mechanical Turk. To streamline the process, we built a custom Web interface to visualize table HTML and metadata, augmented with Web widgets that allow table region highlighting, table content and sentence editing. A screenshot of the annotation interface is shown in Figure 8 in the Appendix.
Provided the full context of ToTTo instances, the annotators were asked to write a question whose answer is the provided ToTTo sentence. We found that such questions arise naturally when table cell contents are more semantically related. In addition, annotators were free to modify the sentence, the table cell content, and the highlighted region so that these contents could lead to a more natural question formulation and avoid any contrived effort. Table 2 provides measurements on how often annotators modified ToTTo resources for producing more compatible question-answer interactions.
Highlighted Region . | Cell Content . | ToTTo Sentence . | Percentage . |
---|---|---|---|
✗ | ✗ | ✗ | 62.45% |
✓ | ✗ | ✗ | 2.96% |
✗ | ✓ | ✗ | 0.66% |
✗ | ✗ | ✓ | 10.13% |
✓ | ✗ | ✓ | 22.62% |
✓ | ✓ | ✗ | 0.07% |
✗ | ✓ | ✓ | 0.49% |
✓ | ✓ | ✓ | 0.62% |
Total | 100% |
Highlighted Region . | Cell Content . | ToTTo Sentence . | Percentage . |
---|---|---|---|
✗ | ✗ | ✗ | 62.45% |
✓ | ✗ | ✗ | 2.96% |
✗ | ✓ | ✗ | 0.66% |
✗ | ✗ | ✓ | 10.13% |
✓ | ✗ | ✓ | 22.62% |
✓ | ✓ | ✗ | 0.07% |
✗ | ✓ | ✓ | 0.49% |
✓ | ✓ | ✓ | 0.62% |
Total | 100% |
Internal Annotations
In the first phase of annotation, we enrolled 15 internal annotators who were provided with preliminary guidelines. In addition to the annotation task, they were asked to provide feedback regarding the task instructions and the user experience of the Web site, based on which we iteratively modified the guideline and the Web site design.
External Annotations
For external annotations, we hired MTurk workers who have completed at least 500 HITs, have 97% approval rate, and are from English-speaking regions. To ensure that the MTurk annotators understand our task, we provided an instruction video for the interactive annotation tool usage, FAQs that clarify the annotations we desire, along with good vs. bad annotation examples. We also created a Slack channel for crowdsourced workers to ask questions and clarify doubts.
Annotation Evaluation
To ensure that FeTaQA is of high quality, we evaluate crowdsourced annotations as follows. We built another Web interface for evaluation and asked internal evaluators to approve (with modification if necessary) based on grammatical correctness, relevancy to the highlighted table cells, and its compatibility with the answer. Evaluators modified question annotations if they are asking for only one of many facts in the answer sentence, or if a short-form answer is clearly adequate, as we discovered that most of the modifications that evaluators made are in this category. We reject when we couldn’t modify the annotation to meet the above standards within a reasonable time frame. The breakdown of the evaluation result is shown in Table 3. We approved most of the annotations and rejected only 12%, for which we found the original ToTTo instances are hard to generate questions for. We found that these instances usually contain highlighted cells that do not have any clear relation, therefore making it difficult to come up with questions. Among the annotations we approved, only 16.7% of the original annotations were modified, so that the crowd-sourced annotations are not much affected by the internal evaluators’ bias if there exist any.
Decision Type . | Percentage . |
---|---|
Reject | 12.00% |
Approve - no modification | 73.30% |
Approve - only modify question | 7.66% |
Approve - only modify HR | 1.71% |
Approve - modify question and HR | 5.19% |
Approve - other modification | 0.14% |
Total | 100% |
Decision Type . | Percentage . |
---|---|
Reject | 12.00% |
Approve - no modification | 73.30% |
Approve - only modify question | 7.66% |
Approve - only modify HR | 1.71% |
Approve - modify question and HR | 5.19% |
Approve - other modification | 0.14% |
Total | 100% |
The annotator contributions to the final dataset are distributed as follows: We have 3,039 (30%) instances from internal annotators and 7,291 (70%) from MTurk workers. In total, our dataset contains 10,330 instances.
2.3 Dataset Split
Randomly splitting the dataset may make train, development, and test splits contain tables with similar contents (Finegan-Dollak et al., 2018; Lewis et al., 2021). Therefore, to increase the generalization challenge, we split FeTaQA to minimize the content/topic overlap (not necessarily question/answer type overlap) between train set and dev-test set, similar to ToTTo (Parikh et al., 2020). We calculate the Jaccard similarity of tokens shown in the question and the table column headers of two instances to measure their similarity. We first sampled 800 instances randomly as a seed set, then gradually add instances to it if an instance is similar to any instance in the seed set. When this seed set grows to take up 70% of all the instances, the remaining 30% instances are less similar to any instance in the seed set. The seed set then becomes the training set and the remaining instances are divided to form the development and test sets. This results in 7,326/1,001/2,003 instances in the train/dev/test splits, respectively.
2.4 Data Analysis and Statistics
Basic statistics of FeTaQA are shown in Table 4. We also conducted a human evaluation over 100 FeTaQA instances in 7 dimensions. Evaluation scores and inter-evaluator agreements are reported in Table 5. A quantitative and qualitative analysis of FeTaQA shows it contains lots of complex questions judged by human evaluators. Note that an ideal measurement of the question complexity is to quantify the structural complexity of the information contained in the answer, but since this is a time-consuming process, we simply asked the evaluators to score based on their subjective judgement, which could have caused the relatively low agreement. The median number of highlighted cells (denotations) is 6, which is twice as much as the corresponding number for ToTTo, indicating that FeTaQA requires retrieval of multiple entities in the table. These denotations are correct and adequate as indicated by the corresponding high evaluation scores. The free-form answers have a median of 18 tokens in length, and are grounded to the table and the denotations, also suggested by the high evaluation scores.
Property . | Value . |
---|---|
Unique Tables | 10,330 |
Question Length (Median/Avg) | 12 / 13.2 |
Answer Length (Median/Avg) | 18 / 18.9 |
Rows per Table (Median/Avg) | 12 / 13.8 |
Columns per Table (Median/Avg) | 5 / 5.9 |
No. of Highlighted Cell (Median/Avg) | 6 / 8.0 |
Percentage of Cells Highlighted (Median/Avg) | 10.7% / 16.2% |
Page Title Length (Median/Avg) | 2 / 3.3 |
Section Title Length (Median/Avg) | 2 / 1.9 |
Training Set Size | 7,326 |
Development Set Size | 1,001 |
Test Set Size | 2,003 |
Property . | Value . |
---|---|
Unique Tables | 10,330 |
Question Length (Median/Avg) | 12 / 13.2 |
Answer Length (Median/Avg) | 18 / 18.9 |
Rows per Table (Median/Avg) | 12 / 13.8 |
Columns per Table (Median/Avg) | 5 / 5.9 |
No. of Highlighted Cell (Median/Avg) | 6 / 8.0 |
Percentage of Cells Highlighted (Median/Avg) | 10.7% / 16.2% |
Page Title Length (Median/Avg) | 2 / 3.3 |
Section Title Length (Median/Avg) | 2 / 1.9 |
Training Set Size | 7,326 |
Development Set Size | 1,001 |
Test Set Size | 2,003 |
Annotation Quality . | Score >= 4 (%) . | % Agreement . | Randolph’s Kappa / 95% CI . |
---|---|---|---|
Question Complexity | 52.6 | 0.65 | 0.48 / [0.41, 0.55] |
Denotation Correctness | 89.0 | 0.88 | 0.82 / [0.76, 0.88] |
Denotation Adequacy | 91.6 | 0.89 | 0.83 / [0.77, 0.89] |
Answer Fluency | 95.0 | 0.92 | 0.89 / [0.84, 0.94] |
Answer Correctness | 92.4 | 0.91 | 0.86 / [0.80, 0.92] |
Answer Adequacy | 90.6 | 0.88 | 0.82 / [0.76, 0.88] |
Answer Faithfulness | 95.6 | 0.93 | 0.89 / [0.84, 0.94] |
Annotation Quality . | Score >= 4 (%) . | % Agreement . | Randolph’s Kappa / 95% CI . |
---|---|---|---|
Question Complexity | 52.6 | 0.65 | 0.48 / [0.41, 0.55] |
Denotation Correctness | 89.0 | 0.88 | 0.82 / [0.76, 0.88] |
Denotation Adequacy | 91.6 | 0.89 | 0.83 / [0.77, 0.89] |
Answer Fluency | 95.0 | 0.92 | 0.89 / [0.84, 0.94] |
Answer Correctness | 92.4 | 0.91 | 0.86 / [0.80, 0.92] |
Answer Adequacy | 90.6 | 0.88 | 0.82 / [0.76, 0.88] |
Answer Faithfulness | 95.6 | 0.93 | 0.89 / [0.84, 0.94] |
Topics
Similar to ToTTo, we use Wikimedia Foundation’s topic categorization model (Asthana and Halfaker, 2018) to investigate the topic distribution of FeTaQA, as shown in Figure 2. We found that most of the instances are related to biography, sports, and geographical regions. There are also abundant instances related to media, politics, and government.
Question Types
FeTaQA has diverse and complex questions, as illustrated in Figure 3. We found that in FeTaQA, a large percentage of what questions ask about entities in plural, or about abstract entities such as outcome, result, margin, percentage. In addition, there is a higher percentage of how questions that are not how many/much, compared to existing table QA datasets.
3 Models
To quantify the challenge posed by FeTaQA for state-of-the-art models, we used two modeling approaches that have shown to be effective for the existing table question answering datasets, with some modifications made to adjust to our task. Model configurations are shown in Figure 4.
3.1 Pipeline Model
Question answering over tables is usually seen as a semantic parsing task. A table semantic parser obtains representations of the question and the table schema, and uses these to generate database-like queries. These generated queries then get executed to give the final denotation(s), which are sufficient for answering the questions in the previous datasets. There are two possible settings for training or fine-tuning a table semantic parser, as shown by the two diagrams on the left in Figure 4. The first one is the supervised learning setting, which requires annotations of database-like queries. But due to their high annotation costs, people usually train semantic parsers with the latter: a weakly supervised setting, which requires label denotations, and semantic parsers learn to predict which table cells constitute the final answer (Note that we use ToTTo’s highlighted table cells as these labels).
However, in our task, targets are generated texts instead of retrieved denotations, suggesting that we also need a generator to integrate the retrieved information into a cogent sentence. Therefore, we propose a pipeline model with two separately trained modules, described below.
Weakly Supervised Table Semantic Parsing
The first module adopts a weakly supervised table semantic parser. Two recently proposed pre-trained models could help achieve this: TAPAS (Herzig et al., 2020) and TaBERT (Yin et al., 2020a). They are both pre-trained models for joint understanding of text and tabular data, and can be integrated into semantic parsers for solving table-based QA tasks. However, we did not include TaBERT in our experiment because it provides table column representations based on no more than 3 rows of the table, which are selected based on their n-gram overlap with the question. These representations are designed to help weakly supervised semantic parsers generate better database-like queries, therefore this method also depends on a reasonably designed domain-specific query language, as shown by TaBERT’s use case of MAPO (Liang et al., 2018). In contrast, TAPAS provides representations for all table cells that help weakly supervised semantic parsers directly predict denotations in an end-to-end fashion, so it’s easier to perform analysis for our pipeline models without considering any propagating error.
We fine-tune TAPAS with FeTaQA’s label denotations (highlighted table regions). We believe fine-tuning is crucial for our task because TAPAS is pre-trained on questions that require retrieval of limited denotations (single entity or homogeneous entities that can be aggregated with COUNT, SUM, or AVG operation), while FeTaQA questions require retrieval of multiple entities and complex aggregations. Details of experiment results are provided in § 4.3. Note that besides denotations, TAPAS was pre-trained to explicitly predict an aggregation operation (choose from COUNT, SUM, AVG, NONE) applied to the predicted denotations to obtain the final answer. However, we argue that the aggregations required to solve FeTaQA instances are diverse and they are not covered by a small list of atomic operations pre-defined by humans. Instead, we use NONE as the aggregation operation label for fine-tuning TAPAS, and let the second module (described next) produce latent aggregations inferred from the question and the denotation predictions for generating the answer sentence.
Data-to-Text
As shown in Figure 5, we fine-tune T5 (Raffel et al., 2020) on DART (Nan et al., 2021) to obtain a Data-to-Text model as the second module of the pipeline to perform inference of aggregation and surface realization of table cells (denotations in our case). We first convert the denotation prediction into the triple-set format with the following scheme: for each table cell in the highlighted region, we generate the following triple: [[TABLECONTEXT], column_header, cell_value], where column_header is the cell’s corresponding column name. Similar to DART, we use [TABLECONTEXT] as a special token for converting a table cell into a triple. We then incorporate the metadata into triples by replacing column_header with the field name (TABLE_TITLE, PAGE_TITLE) and cell_value with the metadata content (table title text, page title text). We end up with a triple-set containing all highlighted table cells and the metadata (table title and title of the Wikipedia page that includes the table). We further fine-tune the Data-to-Text model on ToTTo instances so that it adapts to our formation of triple-set inputs. To avoid exposure to FeTaQA test instances, we fine-tune with a sample of 8K ToTTo instances that are not used for creating FeTaQA.
3.2 End-to-End Model
In this approach, we model the task as a sequence-to-sequence learning problem by linearizing table T appended to question q as the source sequence, and treating the free-form answer a as the target sequence. We propose a simple linearization scheme as a baseline: table rows are concatenated with [SEP] tokens in between, and cells in each row are separated by spaces. We prepend q to table linearization , and use [CLS] tokens as prefixes for separation. We fine-tune models from the T5-family on the FeTaQA train set. The linearization scheme is visualized in Figure 6. We considered an alternative option of integrating TaBERT into an end-to-end model but found it infeasible, since it provides contextual features for the question and table columns (instead of table cells, as in our table linearization). The decoder that generates the free-form answer does not have access to any table cell content. Therefore we did not include TaBERT as a baseline end-to-end model.
4 Experiments
In this section, we explain the experiment settings and report the automatic and human evaluations on model outputs.
4.1 Experiment Setup
We first experiment with the pipeline model in a zero-shot setting, that is, without any fine-tuning on FeTaQA. We use a checkpoint of TAPAS-base that is fine-tuned on WikiTableQuestions (Pasupat and Liang, 2015) to perform table semantic parsing implicitly in order to produce a set of denotations, which is then converted to a triple-set as described in § 3.1. We then employ a T5-large model (Raffel et al., 2020) that goes through two fine-tuning stages: in the first stage it is fine-tuned on the downstream Data-to-Text task with DART (Nan et al., 2021); in the second stage it is further fine-tuned on ToTTo instances to adapt to the triple-set formulation we proposed. We denote this setting as Pipeline - zeroshot in Table 6. Next we experiment with the pipeline model by fine-tuning the table semantic parser on FeTaQA. We further fine-tune the TAPAS-base checkpoint (WTQ fine-tuned) on FeTaQA train set and select models based on their performance on the development set. We use the same Data-to-Text model as described in the zero-shot setting.
. | sacreBLEU2 . | ROUGE-1 . | ROUGE-2 . | ROUGE-L . | METEOR . | BERTScore . | BLEURT . |
---|---|---|---|---|---|---|---|
Pipeline - zeroshot | 9.16 | 0.38 | 0.20 | 0.33 | 0.22 | 0.88 | −0.79 |
Pipeline - fine-tuned | 11.00 | 0.40 | 0.22 | 0.35 | 0.24 | 0.91 | −0.35 |
Pipeline - gold denotation | 31.63 | 0.67 | 0.43 | 0.53 | 0.50 | 0.91 | −0.23 |
End-to-End - T5-small | 21.60 | 0.55 | 0.33 | 0.47 | 0.40 | 0.94 | 0.08 |
End-to-End - T5-base | 28.14 | 0.61 | 0.39 | 0.51 | 0.47 | 0.96 | 0.31 |
End-to-End - T5-large | 30.54 | 0.63 | 0.41 | 0.53 | 0.49 | 0.96 | 0.57 |
. | sacreBLEU2 . | ROUGE-1 . | ROUGE-2 . | ROUGE-L . | METEOR . | BERTScore . | BLEURT . |
---|---|---|---|---|---|---|---|
Pipeline - zeroshot | 9.16 | 0.38 | 0.20 | 0.33 | 0.22 | 0.88 | −0.79 |
Pipeline - fine-tuned | 11.00 | 0.40 | 0.22 | 0.35 | 0.24 | 0.91 | −0.35 |
Pipeline - gold denotation | 31.63 | 0.67 | 0.43 | 0.53 | 0.50 | 0.91 | −0.23 |
End-to-End - T5-small | 21.60 | 0.55 | 0.33 | 0.47 | 0.40 | 0.94 | 0.08 |
End-to-End - T5-base | 28.14 | 0.61 | 0.39 | 0.51 | 0.47 | 0.96 | 0.31 |
End-to-End - T5-large | 30.54 | 0.63 | 0.41 | 0.53 | 0.49 | 0.96 | 0.57 |
For the End-to-End model, we adapt Hugging Face’s implementation (Wolf et al., 2020) of T5 (Raffel et al., 2020) for our task. We use a standard T5-tokenizer with additional [CLS] and [SEP] tokens and the model vocabulary is resized accordingly. Since we expect the input sequence to be significantly longer than the target, we fine-tuned the models using T5’s “summarize:” prefix. The motivation behind this is to avoid simple extraction from the table since abstractive summarization is supposed to rephrase important details in the source. T5-small is trained on 4 Tesla K80 GPUs with per-device batch size of 16 for 30 epochs (about 6,900 steps) which took less than an hour. T5-base is trained on 4 Tesla K80 with per-device batch size of 4 (due to GPU memory constraints) for 80 epochs (about 36,640 steps) and it took around 3 hours. As for T5-large, we distributed the layers across 8 Tesla K80 to train with a batch size of 4 for 80 epochs (about 80k steps) and it took 5 hours to train.
4.2 Automatic Evaluation Metrics
We use a variety of automatic metrics and human evaluation (§ 4.4) to evaluate the quality of the generated answers. We report sacreBLEU (Post, 2018), ROUGE-{1, 2, L} (Lin, 2004), and METEOR (Banerjee and Lavie, 2005) that evaluate the n-gram match between generated and reference answers. Considering the limitations of these measures in evaluating the semantic meanings of sentences, we also report BERTScore (Zhang et al., 2020) and BLEURT (Sellam et al., 2020) that incorporate semantics using contextual embeddings. To evaluate the retrieval competency of table semantic parsers, we applied various set similarity metrics to the predicted and reference denotation lists. Specifically, we report Jaccard similarity, Overlap, Cosine similarity, and Dice similarity.
4.3 Results and Discussions
Our experimental results on the FeTaQA test set are summarized in Table 6. The T5-large model using an End-to-End modeling approach achieves the highest performance scores in all evaluation metrics. Also, we observe a large performance gap between pipeline models and End-to-End models, even though the latter only adopt a simple linearization strategy for encoding tables.
We also see that after fine-tuning on FeTaQA with denotations as weak supervisions, the pipeline model improves by almost 2 BLEU points. To further examine the source of this improvement, we report the evaluation of table semantic parser performance in Table 7, from which we also observe an improvement in retrieval capability. However, we note that compared with the gold denotations that have a median of six table cells being highlighted (shown in 4), our table semantic parser is only able to predict two table cells on average before fine-tuning on FeTaQA, and three table cells on average after. When gold annotations are used, the pipeline model is able to perform better than the End-to-End model. This indicates that the low performance of denotation predictions and the loss of relational information between denotations lead to the inadequate performance of pipeline models, and it also indicates that the table semantic parser has a large space for improvement. A final observation is that the End-to-End model is comparable to the model that has access to the gold denotations, suggesting that the End-to-End model is effective at extracting denotations latently.
4.4 Human Evaluation
To further evaluate the quality of the answers generated by different models comparing to the references, we conduct our human evaluation based on four criteria: (1) fluency if an answer is natural and grammatical; (2) correctness if an answer is correct; (3) adequacy if an answer contains all the information that is asked; (4) faithfulness if an answer is faithful and grounded to the contents of the table and the highlighted region. Each evaluator is asked to examine an answer given the question and the full context (table, highlighted region, and metadata) and give a score on a scale of 1 to 5 for each of the criteria. We ask five internal annotators to evaluate 100 samples of FeTaQA instances. Each sample is paired with 3 answers: the reference, the pipeline model result, and the End-to-End model result.
Table 8 attests to the high quality of our annotations and the challenging nature of FeTaQA. Similar to the evaluation result of the automatic metrics, we observe a large gap between the pipeline model and the End-to-End model, with the latter one significantly outperforming its counterpart in terms of answer correctness, adequacy, and faithfulness. Comparing the best performing End-to-End model outputs to human references, we see that there is room for improvement in the future.
Source . | Fluent (%) . | Correct (%) . | Adequate (%) . | Faithful (%) . |
---|---|---|---|---|
Pipeline | 85.2 | 25.4 | 8.4 | 23.6 |
End-to-End | 94.6 | 54.8 | 48.4 | 50.4 |
Reference | 95.0 | 92.4 | 90.6 | 95.6 |
Source . | Fluent (%) . | Correct (%) . | Adequate (%) . | Faithful (%) . |
---|---|---|---|---|
Pipeline | 85.2 | 25.4 | 8.4 | 23.6 |
End-to-End | 94.6 | 54.8 | 48.4 | 50.4 |
Reference | 95.0 | 92.4 | 90.6 | 95.6 |
5 Related Work
Generative QA
Generative question answering datasets such as NarrativeQA (Kociský et al., 2017), CoQA (Reddy et al., 2019), TriviaQA (Joshi et al., 2017), and MS MARCO (Nguyen et al., 2016) all have free-form answers that are generated based on the contexts of Wikipedia articles, books, movie scripts, dialogues, or Web documents. These responses are mostly crowd-sourced and are reported to mostly contain copies of short text spans from the source. By contrast, ELI5 (Fan et al., 2019) is a long form question answering dataset containing a diverse set of complex questions, each paired with a paragraph-long answer and 100 relevant Web source documents (Petroni et al., 2021; Krishna et al., 2021). FeTaQA is the first dataset for generative question answering over tables. Unlike the existing generative QA datasets that assess multi-documents retrieval and abstraction capability, FeTaQA poses new challenges in the reasoning and integration capability of a system given a structured knowledge source.
QA over Tables and Semantic Parsing
Several datasets have been proposed to apply semantic parsing on tables, including WikiTableQuestions (Pasupat and Liang, 2015), SequentialQA (Iyyer et al., 2017), WikiSQL (Zhong et al., 2017), and Spider (Yu et al., 2018). With the development of pre-trained language models, recent work (Yin et al., 2020b; Herzig et al., 2020; Eisenschlos et al., 2020; Iida et al., 2021) jointly learns representations for natural language sentences and structured tables, and Yu et al. (2021a, b) use pre-training approach for table semantic parsing. HybridQA (Chen et al., 2020d) and OTT-QA (Chen et al., 2021) have contexts of both structured tables and unstructured text. MultiModalQA (Talmor et al., 2021) contains complex questions over text, tables and images. These datasets define a table QA task that is extractive in nature by restricting their answers to be short-form, while FeTaQA frames table QA as a generation task.
Data-to-Text Generation
Recent neural end-to-end models tested on the WebNLG 2017 dataset (Gardent et al., 2017) have focused on incorporating pre-training and fine-tuning for specific generation tasks (Chen et al., 2020b; Kale and Rastogi, 2020) to improve performance and strengthen generalization ability. However, recent models featuring separate content-planning and surface realization stages have exhibited improvements (Moryossef et al., 2019; Iso et al., 2020) over comparable baselines. TabFact (Chen et al., 2020c) is composed of Wikipedia tables coupled with statements labeled as either “ENTAILED” or “REFUTED” by the table. LogicNLG (Chen et al., 2020a) features statements logically entailed from tables. ToTTo (Parikh et al., 2020) is a large-scale open-domain dataset consisting of Wikipedia tables with a set of highlighted table cells and a sentence description of those highlighted cells. DART (Nan et al., 2021) is an open-domain Data-to-Text dataset that contains table-ontology-preserving data samples with a diverse predicate set occurring in Wikipedia tables.
6 Conclusion
In this paper, we introduced the task of generative table question answering with FeTaQA, a table QA dataset consisting of complex questions that require free-form, elaborate answers. We also proposed two modeling approaches: (1) a pipeline model that incorporates a table semantic parser and a Data-to-Text generator, and (2) an End-to-End model that integrates query comprehension, reasoning and text generation. Our experimental results indicate that the End-to-End model with a simple table encoding strategy achieves much higher scores than the pipeline model that requires table semantic parsing. Furthermore, we show that FeTaQA reveals the challenging nature of the table question answering task and calls for innovative model designs in the future.
Acknowledgments
The authors would like to thank the anonymous reviewers and the Action Editor for their valuable discussions and feedback.
A Appendix
The Appendix contains the following contents:
Notes
SacreBLEU signature: BLEU+case.lc+numrefs.1+smooth.exp+tok.13a+version.1.3.7.
References
Author notes
Now at Facebook AI.
Action Editor: Radu Florian