Abstract
Language models, potentially augmented with tool usage such as retrieval, are becoming the go-to means of answering questions. Understanding and answering questions in real-world settings often requires retrieving information from different sources, processing and aggregating data to extract insights, and presenting complex findings in form of structured artifacts such as novel tables, charts, or infographics. In this paper, we introduce TANQ,1 the first open-domain question answering dataset where the answers require building tables from information across multiple sources. We release the full source attribution for every cell in the resulting table and benchmark state-of-the-art language models in open, oracle, and closed book setups. Our best-performing baseline, Gemini Flash, reaches an overall F1 score of 60.7, lagging behind human performance by 12.3 points. We analyze baselines’ performance across different dataset attributes such as different skills required for this task, including multi-hop reasoning, math operations, and unit conversions. We further discuss common failures in model-generated answers, suggesting that TANQ is a complex task with many challenges ahead.
1 Introduction
Understanding and solving problems in real-world scenarios often requires reasoning across multiple documents and data modalities. This includes (i) retrieving information from different sources, (ii) processing and aggregating data to extract insights, and (iii) presenting complex findings in a structured format, for example a table or infographics, to communicate them to readers.
Previous studies show that knowledge workers across domains, e.g., finance, science, and economics, spend around 20% of their time searching and gathering information from different files into one document to extract insights and consequently answer information-seeking questions (Chui et al., 2012). As a result, one of most challenging tasks for workers is to aggregate data and turn it into insights. Tables as a structured representation of data are ubiquitous in real-world sources and are commonly used to communicate complex information. Hence, they can be the perfect modality to answer complex questions.
Large language models (LLMs), often enhanced with external tools, have become a primary method for various application such as answering questions. State-of-the-art question-answering (QA) systems integrate LLMs in various ways, from decomposing complex queries to retrieving documents using external tools or generating context data from knowledge acquired during model training. However, their evaluation is mostly limited to simple datasets, e.g., TabFact (Chen et al., 2020a) or HotpotQA (Yang et al., 2018), whose questions can be answered by reasoning over a single table or text document and generating a short text sequence as answer. This limits the applicability of such systems to perform complex multi-step research explorations. Moreover, it differs from real-world needs where relevant information can be spread across documents and represented in different forms (e.g., text or tables). More often than not, generating a short text as answer is not sufficient for complex information-seeking questions.
In this paper, we investigate LLMs’ capabilities in reasoning over multiple data sources and formats (i.e., text, tables, infoboxes) to answer entity-centric questions and generate table answers as structured artifacts. To address these challenges, we introduce TANQ, an open-domain, multi-hop QA dataset. TANQ requires retrieving and aggregating data from multiple documents to compile and communicate answers as tables. To solve TANQ, models require different skills in addition to data retrieval such as filtering, maths, and name normalization. We create TANQ applying a five-step, automated data collection process. We use QAMPARI (Amouyal et al., 2022) as seed dataset and Wikidata as well as the Wikipedia corpus as data sources. For automated evaluation of different data collection and processing substeps, we use PaLM-2 (Anil et al., 2023b).
We evaluate several state-of-the-art LLMs on TANQ, including close, oracle and open book evaluation settings. Finally, we study model-generated answer tables and discuss common failure cases and challenges related to TANQ. Our evaluation of models across skills can further inform future tools and evaluation setups for LLMs to improve models for complex, information-seeking questions.
Our contributions are as follows:
We introduce TANQ, the first open-domain question-answering benchmark that requires building answers in form of tables from multiple information sources.
- (b)
We benchmark state-of-the-art language models in oracle, open, and closed book setups, reaching an overall F1 score of 60.7 with our best-performing (oracle) baseline.
- (c)
We evaluate model performance across different dataset characteristics, and discuss challenges and common failure types.
2 Related Work
Various benchmarks for QA have been released in recent years. Each one addresses different challenges related to the task. Table 1 provides an overview and comparison of benchmarks.
Comparison of TANQ to related (table) question-answering datasets.
Dataset . | Open Domain . | Multi Doc . | Answer Type . | Document Type . | ||
---|---|---|---|---|---|---|
Text | Table | Infobox | ||||
InfoTabs (Gupta et al., 2020) | short text | ✓ | ||||
FeTAQA (Nan et al., 2022) | free form text | ✓ | ||||
FinQA (Chen et al., 2021b) | short text | ✓ | ✓ | |||
TATQA (Zhu et al., 2021) | short text | ✓ | ✓ | |||
MultiHiertt (Zhao et al., 2022) | numeric | ✓ | ✓ | |||
OTTQA (Chen et al., 2021a) | ✓ | short text | ✓ | |||
NQ-TABLES (Herzig et al., 2021) | ✓ | short text | ✓ | |||
HybridQA (Chen et al., 2020b) | ✓ | short text | ✓ | ✓ | ||
MultiTabQA (Pal et al., 2023) | ✓ | table | ✓ | |||
TANQ | ✓ | ✓ | table | ✓ | ✓ | ✓ |
Dataset . | Open Domain . | Multi Doc . | Answer Type . | Document Type . | ||
---|---|---|---|---|---|---|
Text | Table | Infobox | ||||
InfoTabs (Gupta et al., 2020) | short text | ✓ | ||||
FeTAQA (Nan et al., 2022) | free form text | ✓ | ||||
FinQA (Chen et al., 2021b) | short text | ✓ | ✓ | |||
TATQA (Zhu et al., 2021) | short text | ✓ | ✓ | |||
MultiHiertt (Zhao et al., 2022) | numeric | ✓ | ✓ | |||
OTTQA (Chen et al., 2021a) | ✓ | short text | ✓ | |||
NQ-TABLES (Herzig et al., 2021) | ✓ | short text | ✓ | |||
HybridQA (Chen et al., 2020b) | ✓ | short text | ✓ | ✓ | ||
MultiTabQA (Pal et al., 2023) | ✓ | table | ✓ | |||
TANQ | ✓ | ✓ | table | ✓ | ✓ | ✓ |
QA with Text and/or Table Input.
A number of datasets use text and one or multiple tables for QA. While both text and tables have been considered as input modalities, the output of the datasets is mostly limited to short textual answers. HybridQA (Chen et al., 2020b), for example, is a multi-hop QA dataset that requires reasoning over one table and multiple Wikipedia passages related to entities occurring in the table. HybridQA answers are short texts with location names being the most common answer types, followed by numbers, dates, and person names. Moreover, many QA benchmarks for reasoning over text and tabular context concentrate on the finance domain. For example, the MultiHiertt (Zhao et al., 2022) benchmark is created from financial reports. Questions require reasoning over texts and multiple tables. The answers are short numerical values with a focus on numerical reasoning. Other financial QA benchmarks are FinQA (Chen et al., 2021b) and TATQA (Zhu et al., 2021) where the context is one table and minimum two paragraphs related to the table. TATQA answers are short texts consisting of either one or multiple text spans from context paragraphs/tables or are free-form answers. MultimodalQA is a multi-hop, open-domain QA dataset that takes one table and related images and text paragraphs as input with answers similar to previously described datasets.
Open-domain Benchmarks.
Most of the earlier described datasets have context provided in form of text and/or tables, whereas open-domain QA datasets first require extracting the relevant context, before answering the given question. The majority of open-domain QA datasets, such as WikiQA (Yang et al., 2015), TriviaQA (Joshi et al., 2017), and RobustQA (Han et al., 2023), are limited to textual context. NQ Tables (Herzig et al., 2021) extends table-QA to an open-domain setting where first top-k tables are retrieved from a given corpus. These tables are processed by a reader component for generating the correct short-text answer. Built on HybridQA, the Open Table-and-Text Question Answering (OTT-QA) benchmark (Chen et al., 2021a) extends this setting by requiring to extract both tables and texts given multi-hop questions.
QA with Table Answers.
The work closest to ours with respect to input and output modalities is MultiTabQA (Pal et al., 2023). MultiTabQA seeds on the Spider dataset, a text-to-SQL dataset containing SQL queries, database tables, and natural language translations of the queries. Pal et al. (2023) use table names occurring in SQL queries to extract the input tables and query Spider databases for answer table generation. While the datas et also generates answers as tables, it has certain limitations: (i) the benchmark input is limited to tables; (ii) the input and output tables are highly structured database tables which differ from real-world scenarios where tables occur in documents and websites in various formats; (iii) the questions are limited to SQL-based queries.
3 Building the TANQ Benchmark
TANQ evaluates the capability to answer open domain, multi-hop questions by aggregating data and generating answer tables.
3.1 Task Definition
A TANQ dataset instance is a triple (q, t, D), consisting of an entity-centric question q, a table answer t, and a document set D (see Figure 1). To answer the multi-hop question q, first multiple sub-answers are extracted from the document set D. The answer is generated in table form t = {ti, j|i ≤ n, j ≤ m} consisting of n rows (i.e., one per extracted entities) and m columns. The documents in D provide supporting evidence for each cell of the answer table ti, j. D is either provided as input to models (oracle setting) or retrieved from the Wikipedia corpus as D′ (open book) and can consist of texts, tables, and infoboxes.
An example question in TANQ and its corresponding table answer. Supporting evidence from multiple pages in a Wikipedia snapshot is provided for each data point inside the table. We highlight the rationale inside each snippet in yellow. LLMs are evaluated with or without access to the evidence.
An example question in TANQ and its corresponding table answer. Supporting evidence from multiple pages in a Wikipedia snapshot is provided for each data point inside the table. We highlight the rationale inside each snippet in yellow. LLMs are evaluated with or without access to the evidence.
3.2 Preliminaries
QAMPARI.
We use QAMPARI, which is an open-domain QA dataset with lists of entities as answers (Amouyal et al., 2022), as seed dataset. QAMPARI further includes Wikipedia text as supporting evidence for each entry of the answer list. Different from prior QA datasets with short textual answers, they align to natural questions which require a list of answers extracted from multiple sources. The dataset is semi-automatically created with Wikidata and Wikipedia as data sources and evaluated by human annotators.
WikiData.
Following Amouyal et al. (2022), we use Wikidata (WD) as a source for question generation. Wikidata (Erxleben et al., 2014) is a collaborative knowledge graph consisting of triples of entities and relations, i.e., (e1, r, e2). Entities ei are values (e.g., 1990) or items that represent a real-world concept, object, or a topic.2 Relations are edges connecting entities, e.g., (DonaldTrump, instanceOf, human), where Donald Trump is the head entity and human the tail entity (Krishna et al., 2022). We follow the notion for formal queries over Wikidata introduced by Amouyal et al. (2022) for QAMPARI. Applying a relation r as query over WD items ei results in a set of (tail) entities: [[r(e)]] = {ei|(ei, r, e) ∈WD}.
Wikipedia.
We extract supporting evidence in the form of sentences, tables and infoboxes (entity-tables at the top right corner of Wikipedia articles) from Wikipedia (WP).
3.3 TANQ Benchmark Pipeline
Figure 2 provides an overview of the TANQ pipeline outlining all steps for data collection, processing, and evaluation. We use QAMPARI, Wikidata, and Wikipedia, as well as PaLM-2 (Anil et al., 2023b) for paraphrasing and validation.3
TANQ creation pipeline consisting of five steps: 1. Extending QAMPARI questions with additional relations based on Wikidata; 2. Evidence extraction (text, tables, infoboxes) from Wikipedia articles; 3. Evidence evaluation and (gold) answer table generation; 4. Rephrasing of question from first step; 5. Augmentation with additional skills to generate complex. We include a running example at the bottom.
TANQ creation pipeline consisting of five steps: 1. Extending QAMPARI questions with additional relations based on Wikidata; 2. Evidence extraction (text, tables, infoboxes) from Wikipedia articles; 3. Evidence evaluation and (gold) answer table generation; 4. Rephrasing of question from first step; 5. Augmentation with additional skills to generate complex. We include a running example at the bottom.
Step 1. Extending QAMPARI Questions
Starting with the QAMPARI questions and answer lists of entities, we extend each question q with additional WD relations using the answer entities ea. QAMPARI questions are classified in either simple, composition (e.g., “Who directed movies screen-written by Steven Spielberg?”) or intersection (e.g., the example in Figure 1). We first query the WD knowledge graph to extract additional relations rext linked to ea, i.e., r[ea] = {eext|(ea, rext, eext) ∈WD}. Hence, each extension is a WD triple linking the QAMPARI answer ea (e.g., Hey Ram in Figure 1) through the relation rext (i.e., composer) to a new extended entity eext, i.e., Ilaiyaraaja for relation composer and answer entity Hey Ram in Figure 1. We only select a relation which fulfils two conditions. First, it is part of a predefined relation set R, i.e., rext ∈ R. We manually specify R based on the WD relation used to create QAMPARI questions. Second, the relation exists for all answers ea of question q. Given n extension relations, which fulfil these conditions, we extend the question q in a template-style fashion: “[q] and what is their [rext1],[rext2],[...] and [rextn].” For example, the question in Figure 1, was generated based on the initial question “Which Indian movies were both directed and written by Kamal Haasan?” through extending with the WD relations composer and release year. Additionally, we extract all extension entities eext of the extension triple.
Step 2. Evidence Extraction from Wikipedia
Next, we collect for each extension triple (ea, rext, eext) supporting evidence using Wikipedia as an evidence source. Hereby, we search for supporting text, tables and infoboxes in the WP articles of ea and eext. We apply simple heuristics and search for mentions of ea in the eext article and vice versa. Moreover, we extend our queries with additional (heuristic-based) query words, for example, considering different formats of how numbers and dates are represented in queries.
Step 3. Evidence Evaluation & Answer Table Generation
To evaluate the correctness of the previously collected evidence texts, tables, and infoboxes, we employ PaLM-2 as an evidence evaluator. We prompt the LLM to evaluate the extracted evidence in a natural language inference setting. For each extension triple (ea, rext, eext), we construct template-based sentences s: “”, e.g., “Hey Ram composer Ilaiyaraaja” for row 2 in Figure 1. We query the LLM to label the sentence s as “supported”, “refuted”, “not enough information” based on the provided evidence in form of a sentence, table or infobox entry. We then only consider the triples supported by at least one piece of evidence.4
Answer Table.
Each extended relation corresponds to a column in the answer table (e.g., column “Composer” in Figure 1). Since the question in Figure 1 is extended with two additional relations (i.e., composer and release year), the resulting answer table has three columns. Hence, each cell in the answer table corresponds to a WD triple. For example, (Hey Ram, composer, Ilaiyaraaja) for cell “Ilaiyaraaja”. Some generated answer tables contained multiple entries in a single cell, as seen in the genre column of Table 7. To generate realistic tables, we filtered out samples with more than five entries in any single cell, resulting in a test set of 1,074 TANQ samples for evaluation.
Step 4. Question Rephrasing
This step increases the naturalness of template-based extension questions generated in Step 1. Similarly to the earlier step, we prompt a PaLM-2 model in a few-shot setting. To ensure the question meaning is preserved during rephrasing, we add structured annotations in parenthesis to questions with the name of each relation (e.g., “Which Indian movies were both directed (directed_by) and written by (written_by) Kamal Haasan?”). We run up to 5 iterations of rephrasing and stop if all relations are present, or discard the question otherwise.5
Step 5. Augmenting with Skills
Finally, to generate more challenging questions requiring further reasoning capabilities beyond retrieval, we extend TANQ questions by asking for realistic post-processing steps. Overall, we augment questions with the following additional skills: (i) Filtering of answer table given a numeric, time, or attribute condition (rows 2–4 in Table 2); (ii) Conversion of numbers, dates, locations and corresponding units (rows 5–7 in Table 2); (iii) Calculation and introduction of an additional table column based on time attributes (e.g., “lifespan” in row 8 in Table 2). (iv) Approximating numbers to the nearest ten, hundred, etc. (see last row of Table 2). We use up to three distinct skills for augmenting TANQ questions. Table 4 provides a breakdown of TANQ dataset samples across skills.
An overview of TANQ question types and skills we use for augmenting the questions for more complex reasoning in the final step of the TANQ pipeline in Figure 2.
Question Type . | Example Question . |
---|---|
1. Simple | Which Belgian Grand Prix did Michael Schumacher win, and when did that happen? |
2. Intersection | For which movie did Chris Columbus receive credits as both director and writer and what was their composer, publication year, duration, and genre? |
3. Composition | Who choregraphed a work that was produced by the Royal Ballet? What was their date of birth, place of birth, occupation, and which awards did they receive? |
Skill | |
1. No skill | What filmmaker directed a movie written by Val Guest and what is their place of birth, date of birth, occupation, date of death, and place of death? |
2. Filtering numeric | Which film was directed and produced by Mel Brooks and what was their composer and duration? Filter the answer table for duration equal to or larger than 88 minutes. |
3. Filtering time | Which Italian footballer transferred to Pro Sesto in summer of 2020 and what is their date of birth, place of birth? Filter the answer table for date of birth equal to or after 1985. |
4. Filtering entity | What pieces of writing did Gregory Benford edit? What were their publication dates and publishers? Filter the answer table for publisher equal to Bantam Books. |
5. Date-to-year conversion | Who were the members of Black Sabbath and what was their year of birth, genre, instrument, and occupation? |
6. Quantity conversion | What work did Michael Mann write and direct and what was their publication date, duration in hours, genre, director of photography? |
7. Time calculation | Which governor of Connecticut died while in office? What was their place of birth, occupation, date of death, date of birth, political party and how many years did they live? |
8. Approximation | What are the townships in Harper County, Kansas and what is their population rounded to the nearest ten? |
Question Type . | Example Question . |
---|---|
1. Simple | Which Belgian Grand Prix did Michael Schumacher win, and when did that happen? |
2. Intersection | For which movie did Chris Columbus receive credits as both director and writer and what was their composer, publication year, duration, and genre? |
3. Composition | Who choregraphed a work that was produced by the Royal Ballet? What was their date of birth, place of birth, occupation, and which awards did they receive? |
Skill | |
1. No skill | What filmmaker directed a movie written by Val Guest and what is their place of birth, date of birth, occupation, date of death, and place of death? |
2. Filtering numeric | Which film was directed and produced by Mel Brooks and what was their composer and duration? Filter the answer table for duration equal to or larger than 88 minutes. |
3. Filtering time | Which Italian footballer transferred to Pro Sesto in summer of 2020 and what is their date of birth, place of birth? Filter the answer table for date of birth equal to or after 1985. |
4. Filtering entity | What pieces of writing did Gregory Benford edit? What were their publication dates and publishers? Filter the answer table for publisher equal to Bantam Books. |
5. Date-to-year conversion | Who were the members of Black Sabbath and what was their year of birth, genre, instrument, and occupation? |
6. Quantity conversion | What work did Michael Mann write and direct and what was their publication date, duration in hours, genre, director of photography? |
7. Time calculation | Which governor of Connecticut died while in office? What was their place of birth, occupation, date of death, date of birth, political party and how many years did they live? |
8. Approximation | What are the townships in Harper County, Kansas and what is their population rounded to the nearest ten? |
3.4 Dataset Statistics and Analysis
Overall, the generated TANQ dataset has 1395 entries, 36.1% of question type simple, 40.9% intersection, and 22.9% composition questions. We further plan to release a TANQ training set with approximately 42k samples. Approximately 72.4% dataset instances (i.e., 1,010) require at least one additional skill to answer the question. See Table 4 for a breakdown of skills and question types in TANQ. TANQ questions have an average a length of 21 tokens and require extracting information about three relations on average. On average, TANQ answer tables have 6.7 rows and 4 columns. See Table 3 for WD entity types used for extending QAMPARI questions.
Types of Wikidata entities we use in the first pipeline step (see Figure 2) to extend QAMPARI and generate TANQ questions. We mostly use Wikidata (WD) items.
Type . | # . | % . | Type . | # . | % . |
---|---|---|---|---|---|
WD Item | 57.6k | 79.1 | Numeric | 3.2k | 4.4 |
Time | 11.9k | 16.4 | Text | 46 | 0.1 |
Type . | # . | % . | Type . | # . | % . |
---|---|---|---|---|---|
WD Item | 57.6k | 79.1 | Numeric | 3.2k | 4.4 |
Time | 11.9k | 16.4 | Text | 46 | 0.1 |
Question types and skills in TANQ. Simple denotes questions which require neither composition nor intersection. See Table 2 for exemplary questions.
Question Type . | Count . | Freq (%) . |
---|---|---|
1. Simple | 428 | 39.9 |
2. Composition | 164 | 15.3 |
3. Intersection | 482 | 44.9 |
Skill | ||
1. No skill | 160 | 14.9 |
2. Filtering numeric | 66 | 6.1 |
3. Filtering time | 190 | 17.7 |
4. Filtering entity | 268 | 25.0 |
5. Date-to-year conversion | 230 | 21.4 |
6. Quantity conversion | 64 | 6.0 |
7. Time calculation | 23 | 2.1 |
8. Approximation | 73 | 6.8 |
Skills per question | ||
1. No skill | 300 | 27.9 |
2. One skill | 657 | 61.2 |
3. Two skills | 94 | 8.8 |
4. Three skills | 23 | 2.1 |
Question Type . | Count . | Freq (%) . |
---|---|---|
1. Simple | 428 | 39.9 |
2. Composition | 164 | 15.3 |
3. Intersection | 482 | 44.9 |
Skill | ||
1. No skill | 160 | 14.9 |
2. Filtering numeric | 66 | 6.1 |
3. Filtering time | 190 | 17.7 |
4. Filtering entity | 268 | 25.0 |
5. Date-to-year conversion | 230 | 21.4 |
6. Quantity conversion | 64 | 6.0 |
7. Time calculation | 23 | 2.1 |
8. Approximation | 73 | 6.8 |
Skills per question | ||
1. No skill | 300 | 27.9 |
2. One skill | 657 | 61.2 |
3. Two skills | 94 | 8.8 |
4. Three skills | 23 | 2.1 |
3.5 Manual Evaluation of the Pipeline
We manually evaluated 100 TANQ samples to assess if noise is introduced through the dataset’s automated generation pipeline. The focus of the analysis was twofold. First, examining the initial questions sourced from the QAMPARI dataset and, second, evaluating the generation of TANQ questions over the different pipeline steps. We identified five issues which are discussed below.
Propagation of QAMPARI Issues.
While occurring only for a small subset of questions, one observation was that some errors present in the initial QAMPARI questions were propagated through the pipeline without being corrected. In 2 out of the 100 samples, the original QAMPARI questions contained issues, one was carried into TANQ while the other was fixed. Examples include: “Robert Benton is the screenwriter and director of which software?” This question confuses a film with software as we see by considering the requested attributes, i.e., screenwriter and director.
While these logical errors were carried forward into TANQ grammatical errors, for example, were fixed during rephrasing with LLMs: “Black Sabbath had who as a member?”—issues phrasing that was later improved to “Who were the members of Black Sabbath [...]?”.
Partially Revealed Answers.
In 3 out of 100 evaluted questions, the question itself partially contained the answer, rendering the query less meaningful: “Eon Productions and Harry Saltzman produced what series and what was their genre, director of photography, duration in second, producer, country of origin, composer, publication date, narrative location, screenwriter, and director?”—in this case, the “producer” is already mentioned in the question.
Grammar and Expression Errors.
Minor grammatical issues introduced through the TANQ pipeline were found in 2 out of the 100 samples. In one cases, this was fixed during the rephrasing process, while other remained unresolved. For example, “+89 minute” was rephrased to “89 minutes” in the sentence, “Which film was directed and produced by Mel Brooks and what was their composer, genre, duration? Filter the answer table for duration equal to or larger than +89 minute.” However, “Larger than 89 minutes” remained, but “longer than 89 minutes” would be more appropriate for this sentence.
Ordering of Relations.
While our evaluation metrics is not sensitive to the ordering of columns within the answer table, the order of relation appeared confusing for some questions. This occurred for 7 out of the evaluated 100 examples as shown here: “What filmmaker directed a movie written by Val Guest and what was their date of birth, place of death, date of death, and place of birth?”—the order of personal attributes (i.e., date and place of birth, date and place of death) seems unnatural as date of birth would naturally be followed by place of birth.
Ambiguous or Unclear Relations.
In 4 out of the 100 samples, certain relations were ambiguous or unclear, causing confusion about what exactly the question was asking for. Such as, “Who directed the movie which had the screenwriting done by Marc Norman and what was their date of birth, language spoken, written or signed, place of birth, sex or gender, country of citizenship, occupation?”—here, the attribute “language spoken, written or signed” is not clearly related to the director in question.
While we observed five different issues during the manual evaluation of the TANQ pipeline, these occurred only in a limited number of samples out of 100 evaluated ones. Thus, while these errors are present, they are not widespread enough to strongly impact the overall quality and utility of the TANQ dataset.
4 Baselines & Evaluation
We evaluate TANQ in (i) closed book, (ii) oracle, and (iii) open book setups. We give an overview of the different approaches in Figure 3.
Prompts for baseline evaluation on TANQ: closed book, oracle, and open book with augmented tools.
Prompts for baseline evaluation on TANQ: closed book, oracle, and open book with augmented tools.
Closed Book.
The closed book setup evaluates LLMs’ capabilities in extracting relevant information acquired during training to answer TANQ questions. For all experiments, we use the PaLM-2 Unicorn model (Anil et al., 2023b), GPT4o (Brown et al., 2020), Gemini Pro and Flash (Anil et al., 2023a), as well as Gemma 9B (Rivière et al., 2024). We evaluate all baselines in a few-shot setting with three examples provided in the prompt (see top left prompt in Figure 3).
Oracle.
In the oracle setup, we provide models’ source attribution for every cell of the answer table in form of oracle documents (i.e., text, tables, and infoboxes). Hence, the prompt consists of the TANQ question and multiple evidence sentences, infobox entries and/or tables (see bottom left prompt in Figure 3). To provide further context, we include the Wikipedia page title (for text evidence also sub-/section titles) where the evidence was extracted. Moreover, we add randomly selected oracle documents from other questions with similar attributes. This makes the oracle setting more challenging and requires models to filter the correct evidence from all provided ones first. We exclude models with a context length of less than 4k from the oracle evaluation. The evidence samples require a longer context to be fully displayed. Otherwise, performance may decrease due to the input being cut off.
Open Book.
For the open book baseline, we extend the PaLM-2 Unicorn model with external tools for search and calculation. The Wikipedia-based search tools aim to mimic the human search approach on Wikipedia similar to Yao et al. (2023). The agent model can access Wikipedia information through three tools: (1) WikiSearch(keywords): a keyword based Wikipedia search that returns the Wikipedia article most relevant to the provided keywords; (2) FindEvidence(article,keywords): an article-specific search that returns matched sentences, tables and infobox entries given keywords and a Wikipedia article; (3) GetIntro(article): returns the introduction section of a Wikipedia article. For calculations, we provide the model a Python engine as an external tool, (4) Python(calculation). We design the agent model to decompose and solve the TANQ task in multiple sub-tasks, consisting of (i) retrieving requested entities in an iterative manner (e.g., movies in Figure 1), (ii) searching for entity-related information (e.g., release year), (iii) post-processing information (filtering, calculation, etc.), and finally (iv) aggregating the information in form of a table. For some sub-tasks (e.g., entity retrieval), a separate sub-agent augmented with the required tools (e.g., WikiSearch) is spawned.
Human Baseline.
We compare model performance against human performance based on 100 answer tables generated by annotators. We provided the annotators the same input prompt as the models in the oracle setting (i.e., TANQ questions and evidence for each answer table cell).
Evaluation Metrics.
To evaluate answer tables, we adopted a version of the relative mapping similarity (RMS) metrics introduced by Liu et al. (2023). RMS views tables as unordered collections of mappings from row/colum headers to values. Hence, the metric is invariant to transpositions and column/row permutations. It allows small errors between tables keys/values of target and reference tables using the Normalized Levenstein Distance (Biten et al., 2019). The metrics returns both precision and recall scores.
To evaluate the generated answer tables, we first converted table entries into a list of triplets. Each triplet consists of (i) the entity name (given in the first column of the answer table), (ii) the relation names (i.e., column names), and (iii) the content of a table cell. If a table cell contains multiple values, such as in column Genre in Table 7, the cell content is split into multiple triplets with the same entity name and attribute name, but with different values extracted from the cell. For example, given Table 7, the resulting splitted triplets contains multiple triplets for genre: {(Evita, duration, 129 min), (Evita, genre, biographical film) (Evita, genre, musical), …}.
After generating triplet lists for the gold table and the model-generated answer table, we calculate the similarity between them using relative distance for numbers and Normalized Levenshtein Distance for text across each field in the triplet. Each triplet in the target is matched with its closest triplet in the prediction greedily and thus we can compute weighted precision and recall, in the same manner as Liu et al. (2023). The evaluation code will be released.
5 Results & Discussion
In this section, we address key research questions: (1) Is TANQ a challenging dataset for state-of-the-art models? (2) How do models perform in a closed book setting compared to using external context (oracle)? (3) How effective are tool-augmented models on TANQ? (4) What challenges arise from different TANQ specifications—question types, reasoning skills, etc.? (5) What are the common failure cases of the evaluated models?
5.1 Question Types
In Table 5, we compare the performance of all oracle, close, and open book baselines. In the oracle setting, we find Gemini Flash consistently outperforming other models with an overall F1 score of 60.7, followed by GPT4o. However, still, a considerable gap remains to the human baseline of 73.0. Both models experience a significant drop in performance in the closed book setting, falling behind the smaller, best-performing PaLM-2 model. Notably, PaLM-2 achieves higher recall scores, indicating that while the other models can generate some rows of the answer tables, they are slightly less accurate than PaLM-2, and their resulting tables are shorter, containing fewer entities. Different to Gemini Flash/Pro and Gemma, GPT4o and PaLM-2 struggle with more complex questions types, i.e., composition and intersection questions. Moreover, the 9B-sized Gemma model demonstrates performance comparable to much larger models in closed book evaluation.
Baseline performance by question type. For all question types, we observe Gemini Flash (60.7 F1) and PaLM-2 (47.6 F1) to outperform other baselines in oracle and closed book setting respectively, lagging 12.3 and 25.4 points behind the human baseline of 73.0.
Model . | simple . | composition . | intersection . | all . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | |
Oracle setting | ||||||||||||
Gemini Pro | 41.5 | 38.1 | 37.6 | 39.8 | 36.7 | 36.6 | 42.4 | 40.6 | 40.0 | 41.5 | 38.9 | 38.4 |
Gemini Flash | 66.4 | 59.0 | 60.4 | 67.4 | 58.4 | 60.4 | 66.8 | 59.3 | 61.1 | 66.8 | 59.0 | 60.7 |
GPT4o | 58.2 | 47.0 | 49.1 | 50.7 | 45.8 | 44.5 | 53.1 | 42.9 | 44.2 | 54.3 | 44.9 | 45.9 |
Human | 81.2 | 74.0 | 73.0 | |||||||||
Closed book setting | ||||||||||||
PaLM-2 | 55.5 | 48.1 | 49.6 | 50.6 | 45.7 | 46.6 | 52.1 | 45.5 | 46.6 | 52.9 | 46.4 | 47.6 |
Gemma | 38.9 | 26.7 | 27.9 | 37.4 | 25.8 | 26.6 | 39.3 | 26.2 | 27.7 | 38.8 | 26.3 | 27.5 |
Gemini Pro | 46.9 | 30.3 | 31.1 | 43.8 | 29.5 | 30.4 | 47.8 | 32.0 | 33.5 | 46.6 | 30.9 | 32.0 |
Gemini Flash | 37.4 | 29.0 | 31.3 | 36.2 | 27.2 | 29.6 | 36.6 | 28.8 | 30.7 | 36.8 | 28.5 | 30.7 |
GPT4o | 46.8 | 26.6 | 29.1 | 28.1 | 16.0 | 18.1 | 35.9 | 22.3 | 24.4 | 37.9 | 22.4 | 24.6 |
Open book setting | ||||||||||||
Tool LM | 24.2 | 20.3 | 18.4 | 43.6 | 33.0 | 36.8 | 50.8 | 44.2 | 46.6 | 39.7 | 33.4 | 34.5 |
Model . | simple . | composition . | intersection . | all . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | |
Oracle setting | ||||||||||||
Gemini Pro | 41.5 | 38.1 | 37.6 | 39.8 | 36.7 | 36.6 | 42.4 | 40.6 | 40.0 | 41.5 | 38.9 | 38.4 |
Gemini Flash | 66.4 | 59.0 | 60.4 | 67.4 | 58.4 | 60.4 | 66.8 | 59.3 | 61.1 | 66.8 | 59.0 | 60.7 |
GPT4o | 58.2 | 47.0 | 49.1 | 50.7 | 45.8 | 44.5 | 53.1 | 42.9 | 44.2 | 54.3 | 44.9 | 45.9 |
Human | 81.2 | 74.0 | 73.0 | |||||||||
Closed book setting | ||||||||||||
PaLM-2 | 55.5 | 48.1 | 49.6 | 50.6 | 45.7 | 46.6 | 52.1 | 45.5 | 46.6 | 52.9 | 46.4 | 47.6 |
Gemma | 38.9 | 26.7 | 27.9 | 37.4 | 25.8 | 26.6 | 39.3 | 26.2 | 27.7 | 38.8 | 26.3 | 27.5 |
Gemini Pro | 46.9 | 30.3 | 31.1 | 43.8 | 29.5 | 30.4 | 47.8 | 32.0 | 33.5 | 46.6 | 30.9 | 32.0 |
Gemini Flash | 37.4 | 29.0 | 31.3 | 36.2 | 27.2 | 29.6 | 36.6 | 28.8 | 30.7 | 36.8 | 28.5 | 30.7 |
GPT4o | 46.8 | 26.6 | 29.1 | 28.1 | 16.0 | 18.1 | 35.9 | 22.3 | 24.4 | 37.9 | 22.4 | 24.6 |
Open book setting | ||||||||||||
Tool LM | 24.2 | 20.3 | 18.4 | 43.6 | 33.0 | 36.8 | 50.8 | 44.2 | 46.6 | 39.7 | 33.4 | 34.5 |
5.2 Evaluation with Diverse Prompts
In addition to the evaluations discussed in Section 5, we further assessed the baselines to examine the impact of (i) instruction tuning and (ii) demonstration selection on the models’ performance on TANQ. Table 15 compares the performance of baselines across different numbers of demonstrations in the input prompt (i.e., 1-, 3-, and 5-shot), while Table 16 provides an overview of how different instruction styles affect the performance of Gemini Flash.
For almost all baselines, performance improves slightly when the number of input examples is increased from 1 to 3 in the prompt. However, no further improvements are observed with an increase to five examples, supporting our decision to evaluate models in a 3-shot setting. Except for Gemini Pro and GPT4o, manually selecting input examples does not yield enhanced performance.
When comparing the performance of the top-performing oracle baseline, Gemini Flash, across different instruction styles, we find that the detailed (B), step-by-step (C), and simple (D) instruction styles yield similar performance (see Figure 6). In contrast, using an empty instruction (A)—where we provide only input examples without additional details—results naturally in decreased performance (see Table 16). For our evaluation in Section 5, we used the simple instruction style, as additional details as per instruction B and C did not contribute to observable performance improvements on the evaluated models.
5.3 Reasoning Skills
Table 6 presents the performance of models on more challenging questions requiring further skills such as filtering and time conversion.
Baseline performance by skills required to answer the question: Filtering with numerical/datetime/entity conditions, date-to-year (Date2Year), quantity conversion, time calculation, and approximation. In the oracle setting Gemini Flash performs best across all skills, while PaLM-2 performs better in the closed book evaluation.
Model . | Filter num . | Filter time . | Filter entity . | Date2Year . | Quantity conv . | Time calc . | Approx . | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | |
Oracle setting | |||||||||||||||||||||
Gemini Pro | 32.8 | 28.9 | 29.0 | 40.7 | 38.6 | 38.1 | 42.7 | 39.9 | 39.0 | 39.5 | 36.2 | 36.3 | 43.5 | 38.1 | 39.0 | 40.8 | 41.5 | 39.3 | 48.8 | 45.4 | 45.2 |
Gemini Flash | 60.6 | 53.9 | 55.3 | 68.6 | 60.8 | 62.6 | 68.3 | 59.4 | 61.6 | 66.0 | 58.5 | 60.1 | 71.6 | 60.8 | 63.5 | 68.1 | 58.2 | 59.6 | 69.3 | 65.2 | 65.5 |
GPT4o | 56.0 | 47.0 | 45.8 | 53.6 | 45.7 | 47.7 | 55.9 | 46.0 | 47.2 | 42.1 | 35.5 | 35.3 | 47.5 | 41.1 | 43.2 | 81.6 | 65.8 | 71.8 | 34.2 | 35.3 | 34.3 |
Closed book setting | |||||||||||||||||||||
PaLM-2 | 50.5 | 46.3 | 46.9 | 53.9 | 47.1 | 48.0 | 54.9 | 48.5 | 49.7 | 51.4 | 44.7 | 46.1 | 51.5 | 45.2 | 46.8 | 53.1 | 48.3 | 50.1 | 52.3 | 43.4 | 45.7 |
Gemma | 37.4 | 26.5 | 27.8 | 37.1 | 26.4 | 27.3 | 40.3 | 26.2 | 27.6 | 40.4 | 27.4 | 28.7 | 37.4 | 23.8 | 25.2 | 39.2 | 27.4 | 29.3 | 40.8 | 27.5 | 29.6 |
Gemini Pro | 50.2 | 28.3 | 29.3 | 48.5 | 31.4 | 32.4 | 46.3 | 31.2 | 32.4 | 46.1 | 30.6 | 31.9 | 48.0 | 31.4 | 33.5 | 51.4 | 33.9 | 35.3 | 42.8 | 30.2 | 31.4 |
Gemini Flash | 38.7 | 29.5 | 32.1 | 35.8 | 26.7 | 29.1 | 36.2 | 27.7 | 29.9 | 35.9 | 27.8 | 29.7 | 36.6 | 28.0 | 29.1 | 34.8 | 27.5 | 30.1 | 37.1 | 27.1 | 30.0 |
GPT4o | 30.7 | 21.8 | 24.5 | 37.8 | 24.0 | 24.9 | 46.5 | 25.6 | 28.7 | 31.2 | 18.9 | 20.6 | 25.5 | 12.7 | 16.1 | 48.3 | 25.3 | 26.6 | 24.0 | 17.6 | 19.6 |
Open book setting | |||||||||||||||||||||
Tool LM | 48.7 | 39.1 | 43.1 | 36.1 | 29.7 | 30.3 | 37.6 | 33.0 | 34.2 | 53.1 | 37.8 | 41.9 | 33.8 | 28.4 | 30.2 | 25.8 | 18.2 | 21.1 | 48.3 | 44.0 | 45.9 |
Model . | Filter num . | Filter time . | Filter entity . | Date2Year . | Quantity conv . | Time calc . | Approx . | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | |
Oracle setting | |||||||||||||||||||||
Gemini Pro | 32.8 | 28.9 | 29.0 | 40.7 | 38.6 | 38.1 | 42.7 | 39.9 | 39.0 | 39.5 | 36.2 | 36.3 | 43.5 | 38.1 | 39.0 | 40.8 | 41.5 | 39.3 | 48.8 | 45.4 | 45.2 |
Gemini Flash | 60.6 | 53.9 | 55.3 | 68.6 | 60.8 | 62.6 | 68.3 | 59.4 | 61.6 | 66.0 | 58.5 | 60.1 | 71.6 | 60.8 | 63.5 | 68.1 | 58.2 | 59.6 | 69.3 | 65.2 | 65.5 |
GPT4o | 56.0 | 47.0 | 45.8 | 53.6 | 45.7 | 47.7 | 55.9 | 46.0 | 47.2 | 42.1 | 35.5 | 35.3 | 47.5 | 41.1 | 43.2 | 81.6 | 65.8 | 71.8 | 34.2 | 35.3 | 34.3 |
Closed book setting | |||||||||||||||||||||
PaLM-2 | 50.5 | 46.3 | 46.9 | 53.9 | 47.1 | 48.0 | 54.9 | 48.5 | 49.7 | 51.4 | 44.7 | 46.1 | 51.5 | 45.2 | 46.8 | 53.1 | 48.3 | 50.1 | 52.3 | 43.4 | 45.7 |
Gemma | 37.4 | 26.5 | 27.8 | 37.1 | 26.4 | 27.3 | 40.3 | 26.2 | 27.6 | 40.4 | 27.4 | 28.7 | 37.4 | 23.8 | 25.2 | 39.2 | 27.4 | 29.3 | 40.8 | 27.5 | 29.6 |
Gemini Pro | 50.2 | 28.3 | 29.3 | 48.5 | 31.4 | 32.4 | 46.3 | 31.2 | 32.4 | 46.1 | 30.6 | 31.9 | 48.0 | 31.4 | 33.5 | 51.4 | 33.9 | 35.3 | 42.8 | 30.2 | 31.4 |
Gemini Flash | 38.7 | 29.5 | 32.1 | 35.8 | 26.7 | 29.1 | 36.2 | 27.7 | 29.9 | 35.9 | 27.8 | 29.7 | 36.6 | 28.0 | 29.1 | 34.8 | 27.5 | 30.1 | 37.1 | 27.1 | 30.0 |
GPT4o | 30.7 | 21.8 | 24.5 | 37.8 | 24.0 | 24.9 | 46.5 | 25.6 | 28.7 | 31.2 | 18.9 | 20.6 | 25.5 | 12.7 | 16.1 | 48.3 | 25.3 | 26.6 | 24.0 | 17.6 | 19.6 |
Open book setting | |||||||||||||||||||||
Tool LM | 48.7 | 39.1 | 43.1 | 36.1 | 29.7 | 30.3 | 37.6 | 33.0 | 34.2 | 53.1 | 37.8 | 41.9 | 33.8 | 28.4 | 30.2 | 25.8 | 18.2 | 21.1 | 48.3 | 44.0 | 45.9 |
Gemini.
Overall, questions requiring filtering with numerical conditions pose the biggest challenge for Gemini models resulting in a performance drop compared to other skills. For example, Gemini Flash’s performance drops from 60.7 F1 (overall) to 55.3 in oracle evaluation.
PaLM-2.
For PaLM-2, we observe that the model struggles particularly with questions requiring numeracy, i.e., approximation of numerical values, quantity conversion, calculations based on datetime attributes, and filtering with numerical conditions. This challenge does not persist in the open book baseline where PaLM-2 is augmented with a calculator tool. Despite these limitations, the PaLM-2 model outperforms all other baselines across all skills in the closed book setting.
GPT4o.
For GPT-4o, no clear patterns are observed regarding its limitations in specific reasoning skills. In both the oracle and closed book scenarios, the model struggles with date-to-year conversions and the approximation of numerical values. However, it outperforms other closed book baselines in time calculations.
Number of Skills.
Table 8 shows baseline performance by the number of skills required to generate the correct answer table. We observe across baselines a stable performance as the number of skills increases and questions become more complex. Except GPT4o, which shows significant performance decreases as the number of skills increases in the close book setting, e.g., from 26.5 (one skill) to 17.6 (two skills).
Baseline performance by number of skills required to successfully answer the TANQ questions.
Model . | One skill . | Two skills . | Three skills . | ||||||
---|---|---|---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | P | R | F1 | |
Oracle setting | |||||||||
Gemini Pro | 42.3 | 39.9 | 39.3 | 36.3 | 32.4 | 32.4 | 46.5 | 41.0 | 41.3 |
Gemini Flash | 66.8 | 58.6 | 60.6 | 67.4 | 59.4 | 60.8 | 74.5 | 68.8 | 70.1 |
GPT4o | 51.5 | 42.4 | 43.3 | 50.7 | 46.4 | 47.6 | 51.0 | 39.7 | 44.7 |
Closed book setting | |||||||||
PaLM-2 | 52.8 | 46.2 | 47.4 | 52.0 | 46.9 | 47.9 | 57.0 | 47.1 | 50.5 |
Gemma | 37.9 | 25.8 | 26.9 | 42.4 | 28.9 | 30.3 | 42.5 | 26.2 | 29.5 |
Gemini Pro | 46.9 | 30.4 | 31.5 | 46.4 | 32.8 | 34.4 | 50.1 | 30.2 | 31.0 |
Gemini Flash | 35.6 | 27.8 | 29.9 | 37.4 | 28.3 | 30.6 | 39.5 | 23.2 | 25.9 |
GPT4o | 39.4 | 24.7 | 26.5 | 32.5 | 14.5 | 17.6 | 9.4 | 1.8 | 3.0 |
Open book setting | |||||||||
Tool LM | 40.7 | 33.5 | 35.0 | 46.4 | 35.6 | 39.8 | − | − | − |
Model . | One skill . | Two skills . | Three skills . | ||||||
---|---|---|---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | P | R | F1 | |
Oracle setting | |||||||||
Gemini Pro | 42.3 | 39.9 | 39.3 | 36.3 | 32.4 | 32.4 | 46.5 | 41.0 | 41.3 |
Gemini Flash | 66.8 | 58.6 | 60.6 | 67.4 | 59.4 | 60.8 | 74.5 | 68.8 | 70.1 |
GPT4o | 51.5 | 42.4 | 43.3 | 50.7 | 46.4 | 47.6 | 51.0 | 39.7 | 44.7 |
Closed book setting | |||||||||
PaLM-2 | 52.8 | 46.2 | 47.4 | 52.0 | 46.9 | 47.9 | 57.0 | 47.1 | 50.5 |
Gemma | 37.9 | 25.8 | 26.9 | 42.4 | 28.9 | 30.3 | 42.5 | 26.2 | 29.5 |
Gemini Pro | 46.9 | 30.4 | 31.5 | 46.4 | 32.8 | 34.4 | 50.1 | 30.2 | 31.0 |
Gemini Flash | 35.6 | 27.8 | 29.9 | 37.4 | 28.3 | 30.6 | 39.5 | 23.2 | 25.9 |
GPT4o | 39.4 | 24.7 | 26.5 | 32.5 | 14.5 | 17.6 | 9.4 | 1.8 | 3.0 |
Open book setting | |||||||||
Tool LM | 40.7 | 33.5 | 35.0 | 46.4 | 35.6 | 39.8 | − | − | − |
5.4 Relations in TANQ Questions
Table 9 demonstrates baselines’ performance across the number of relations in the given question. For example, in Figure 1 two relations, i.e., composer and release year, are requested for the movies specified in the question. While multiple models show a performance decrease with an increasing number of relations, this gap is particularly significant for the tool augmented model where the F1 score decrease by almost twelve points comparing questions requiring 1 relations vs. 10. Moreover, we observe for most baselines (oracle/close/open book) little performance drops comparing questions with 1 relation to those with 3—indicating that up to a certain limit models can successfully retrieve information for an increasing number of relations. However, the performance gaps increase for GPT4o (close) and the tool-augmented model as the number of relations increases further, i.e., from 3 to 5 and from 5 to 10.
Baseline performance by number of relations in TANQ questions (i.e., corresponds to number of columns in the answer table). Most baselines demonstrate a stable performance as relations increase.
Model . | One Relation . | Three Relations . | Five Relations . | Ten Relations . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | |
Oracle setting | ||||||||||||
Gemini Pro | 41.6 | 39.2 | 38.6 | 42.2 | 40.7 | 39.5 | 38.5 | 36.2 | 35.0 | 35.8 | 36.0 | 35.8 |
Gemini Flash | 67.0 | 59.5 | 61.1 | 68.5 | 59.2 | 61.8 | 64.7 | 58.6 | 59.5 | 64.7 | 58.4 | 60.5 |
GPT4o | 55.1 | 44.7 | 45.8 | 63.2 | 53.6 | 53.5 | 74.9 | 43.3 | 50.2 | 23.8 | 19.4 | 20.5 |
Closed book setting | ||||||||||||
PaLM-2 | 53.7 | 47.4 | 48.4 | 51.8 | 45.2 | 46.0 | 54.2 | 47.7 | 48.8 | 62.4 | 47.7 | 48.9 |
Gemma | 38.5 | 26.3 | 27.6 | 35.6 | 25.6 | 26.5 | 47.6 | 30.5 | 31.1 | 42.0 | 35.8 | 37.8 |
Gemini Pro | 45.6 | 32.6 | 33.5 | 47.7 | 30.8 | 31.4 | 51.1 | 29.4 | 29.9 | 43.9 | 39.2 | 40.9 |
Gemini Flash | 36.6 | 28.6 | 30.7 | 37.7 | 28.4 | 31.0 | 35.3 | 26.9 | 28.6 | 36.1 | 26.1 | 29.4 |
GPT4o | 37.5 | 21.9 | 24.4 | 33.6 | 25.1 | 27.8 | 55.6 | 15.4 | 18.2 | 34.3 | 21.3 | 23.2 |
Open book setting | ||||||||||||
Tool LM | 31.6 | 32.0 | 30.2 | 42.4 | 33.7 | 36.7 | 34.0 | 32.9 | 33.0 | 18.4 | 17.8 | 18.1 |
Model . | One Relation . | Three Relations . | Five Relations . | Ten Relations . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | |
Oracle setting | ||||||||||||
Gemini Pro | 41.6 | 39.2 | 38.6 | 42.2 | 40.7 | 39.5 | 38.5 | 36.2 | 35.0 | 35.8 | 36.0 | 35.8 |
Gemini Flash | 67.0 | 59.5 | 61.1 | 68.5 | 59.2 | 61.8 | 64.7 | 58.6 | 59.5 | 64.7 | 58.4 | 60.5 |
GPT4o | 55.1 | 44.7 | 45.8 | 63.2 | 53.6 | 53.5 | 74.9 | 43.3 | 50.2 | 23.8 | 19.4 | 20.5 |
Closed book setting | ||||||||||||
PaLM-2 | 53.7 | 47.4 | 48.4 | 51.8 | 45.2 | 46.0 | 54.2 | 47.7 | 48.8 | 62.4 | 47.7 | 48.9 |
Gemma | 38.5 | 26.3 | 27.6 | 35.6 | 25.6 | 26.5 | 47.6 | 30.5 | 31.1 | 42.0 | 35.8 | 37.8 |
Gemini Pro | 45.6 | 32.6 | 33.5 | 47.7 | 30.8 | 31.4 | 51.1 | 29.4 | 29.9 | 43.9 | 39.2 | 40.9 |
Gemini Flash | 36.6 | 28.6 | 30.7 | 37.7 | 28.4 | 31.0 | 35.3 | 26.9 | 28.6 | 36.1 | 26.1 | 29.4 |
GPT4o | 37.5 | 21.9 | 24.4 | 33.6 | 25.1 | 27.8 | 55.6 | 15.4 | 18.2 | 34.3 | 21.3 | 23.2 |
Open book setting | ||||||||||||
Tool LM | 31.6 | 32.0 | 30.2 | 42.4 | 33.7 | 36.7 | 34.0 | 32.9 | 33.0 | 18.4 | 17.8 | 18.1 |
5.5 Answer Table Length
Table 10 gives an overview of model performance across different answer table lengths: short answer tables (up to 3 rows), medium (up to 6 and long tables (7 or more rows). Models that perform well on longer tables are GPT-4o, Gemini Flash, and PaLM-2. As table size increases, so does the number of rows the models can correctly retrieve, resulting in higher recall scores, while precision slightly drops for most oracle baselines models. It is apparent that the model struggles with extracting correct information from the growing list of provided oracle documents as the table size increases. This is obviously not the case for close book evaluation as no oracle documents are given as input.
Baseline performance by length of answer tables. Short tables have <3 rows, medium tables up to six, and long tables >7 rows. Most baselines show little performance variation across table sizes.
Model . | Short . | Medium . | Long . | ||||||
---|---|---|---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | P | R | F1 | |
Oracle setting | |||||||||
Gemini Pro | 47.0 | 41.2 | 39.9 | 42.3 | 38.1 | 37.8 | 41.1 | 39.3 | 38.8 |
Gemini Flash | 68.6 | 56.6 | 58.3 | 66.2 | 56.0 | 58.0 | 67.2 | 60.6 | 62.2 |
GPT4o | 62.9 | 31.6 | 36.3 | 53.1 | 37.4 | 39.5 | 55.0 | 49.7 | 50.0 |
Closed book setting | |||||||||
PaLM-2 | 56.4 | 43.9 | 43.7 | 54.5 | 45.0 | 45.9 | 51.9 | 47.3 | 48.5 |
Gemma | 35.8 | 25.8 | 25.6 | 37.6 | 25.1 | 25.9 | 39.4 | 27.0 | 28.4 |
Gemini Pro | 47.8 | 29.6 | 29.7 | 45.9 | 31.5 | 32.7 | 47.0 | 30.5 | 31.6 |
Gemini Flash | 32.9 | 25.7 | 25.6 | 36.3 | 27.9 | 29.4 | 37.1 | 28.9 | 31.4 |
GPT4o | 30.6 | 13.0 | 14.9 | 35.3 | 17.8 | 20.2 | 39.6 | 25.4 | 27.5 |
Open book setting | |||||||||
Tool LM | 23.2 | 29.2 | 23.0 | 33.3 | 31.7 | 30.9 | 43.3 | 34.3 | 36.5 |
Model . | Short . | Medium . | Long . | ||||||
---|---|---|---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | P | R | F1 | |
Oracle setting | |||||||||
Gemini Pro | 47.0 | 41.2 | 39.9 | 42.3 | 38.1 | 37.8 | 41.1 | 39.3 | 38.8 |
Gemini Flash | 68.6 | 56.6 | 58.3 | 66.2 | 56.0 | 58.0 | 67.2 | 60.6 | 62.2 |
GPT4o | 62.9 | 31.6 | 36.3 | 53.1 | 37.4 | 39.5 | 55.0 | 49.7 | 50.0 |
Closed book setting | |||||||||
PaLM-2 | 56.4 | 43.9 | 43.7 | 54.5 | 45.0 | 45.9 | 51.9 | 47.3 | 48.5 |
Gemma | 35.8 | 25.8 | 25.6 | 37.6 | 25.1 | 25.9 | 39.4 | 27.0 | 28.4 |
Gemini Pro | 47.8 | 29.6 | 29.7 | 45.9 | 31.5 | 32.7 | 47.0 | 30.5 | 31.6 |
Gemini Flash | 32.9 | 25.7 | 25.6 | 36.3 | 27.9 | 29.4 | 37.1 | 28.9 | 31.4 |
GPT4o | 30.6 | 13.0 | 14.9 | 35.3 | 17.8 | 20.2 | 39.6 | 25.4 | 27.5 |
Open book setting | |||||||||
Tool LM | 23.2 | 29.2 | 23.0 | 33.3 | 31.7 | 30.9 | 43.3 | 34.3 | 36.5 |
5.6 Failures Types
Hence, we further study common failures cases of oracle baselines (see Table 11). The aim is to understand challenges related to TANQ when the model has access to necessary information, in form of oracle documents.
Common failure types: relations missing in answer table, no column header, filter issues (e.g., filter condition ignored), hallucinated relations, other types of hallucinations, missing entity (i.e., row), output not in table format, partial answers in one or multiple table cells, wrong answer given in one or multiple cells. Scores given as % of all annotated samples.
Failure Type . | Gemini . | PaLM-2 . | GPT4o . | Tool LM . |
---|---|---|---|---|
Rel missing | 37.3 | 33.3 | 0 | 3.9 |
No header | 27.5 | 3.9 | 0 | 5.9 |
Filter issues | 15.7 | 15.8 | 4.0 | 5.9 |
Halluc. relations | 5.9 | 9.8 | 5.9 | 2.0 |
Halluc. other | 0 | 15.7 | 0 | 7.8 |
Missing entity | 80.4 | 58.8 | 27.5 | 78.4 |
Non-table | 31.4 | 21.6 | 0 | 15.7 |
Partial answers | 3.9 | 23.5 | 2.0 | 0 |
Wrong answer | 14.9 | 7.8 | 9.8 | 13.7 |
Failure Type . | Gemini . | PaLM-2 . | GPT4o . | Tool LM . |
---|---|---|---|---|
Rel missing | 37.3 | 33.3 | 0 | 3.9 |
No header | 27.5 | 3.9 | 0 | 5.9 |
Filter issues | 15.7 | 15.8 | 4.0 | 5.9 |
Halluc. relations | 5.9 | 9.8 | 5.9 | 2.0 |
Halluc. other | 0 | 15.7 | 0 | 7.8 |
Missing entity | 80.4 | 58.8 | 27.5 | 78.4 |
Non-table | 31.4 | 21.6 | 0 | 15.7 |
Partial answers | 3.9 | 23.5 | 2.0 | 0 |
Wrong answer | 14.9 | 7.8 | 9.8 | 13.7 |
We manually evaluate a subset of model predictions to identify common failure cases, categorized as follows: 1.) Relation missing: At least one expected relation (column) is missing; 2.) No table header generated; 3.) Filtering: Conditions (attribute, numeric, dates) ignored or incorrectly applied; 4.) Hallucinated relations: Unrequested columns are added; 5.) Other hallucinations; 6.) Missing entity: Expected rows are missing; 7.) Non-table answers: Text or non-table format returned; 8.) Partial answers: Incomplete information for a given entity or attribute; 9.) Wrong answer (cell): Incorrect cell content.
Comparing open and oracle baselines, our observations show that the best performing model, w.r.t least failures, is GPT4o (oracle). The only significant failure category we identify for GPT4o are missing entities in generated answer tables. The most present issue we observe for Gemini Pro (oracle) are missing entities, missing relations, resulting in answer tables with a subset of columns, non-table outputs, and missing headers of generated tables. Similarly, multiple aspects pose a challenge for PaLM-2 (oracle), including hallucinations, missing entities/relations, and applying filtering conditions correctly. For the tool augmented model, missing entities is a frequent issue.
5.7 Detailed Evaluation of the Open Book Model
It is noteworthy that the open book, ReAct-based, baseline performed worse than other baselines, also the closed book ones. To better understand the limitations of the open book baseline, we manually evaluated the reasoning chains generated by the model for 50 TANQ samples. These chains outline the verbal reasoning processes and actions taken by the model in an interleaved manner. Our evaluation identified six different areas where the open book model struggles.
Retrieval of a Complete Entity List.
The most prevalent issue, found in 19 out of 50 samples, was that the answer list did not contain all entities of the target table. Our review of the ReAct chain logs showed that the model often terminated the search for further entities after finding a list of entities. For example, it might start searching in the introduction section of a Wikipedia article and identify some movies directed by a particular director, but then fail to continue searching the rest of the page, missing additional films.
Generation of Correctly Formatted Answer Tables.
We also found issues related to answer table formatting in 7 out of 50 samples. For instance, in one case, the name column was incorrectly replicated into a table row (i.e., “Building” below in Table 12).
Retrieval of Attributes.
In six out of fifty samples, the model retrieved entities but failed to find attributes for all entities, resulting in partly empty rows (see Table 14).
Answering Intersection Questions.
A similar issue arises with intersection questions, which ask for entities that fulfill multiple conditions. For example, “What science fiction films were both written and directed by Steven Spielberg? What is their genre, publication date, and duration?” The open book model often produced tables that fulfilled only one of these conditions, such as listing films written by Steven Spielberg without considering if he also directed them. This problem occurred in 5 out of the 50 evaluated examples.
Answering Composition Questions.
As the name suggests, composition questions require the model to first retrieve a list of intermediate entities, which are then used to compile the final entity list needed for the answer table. For example, consider the question: “Who directed the film that Jules Feiffer wrote? What is their occupation, date of death, date of birth, country of citizenship, place of birth, and award received?” The model should first retrieve the films as intermediate entities to identify the requested directors. However, in four out of the fifty evaluated samples, the model generated answer tables that included intermediate entities only instead of the requested entities. For instance, the answer table for the above question looked as demonstrated in Table 13. The first row correctly shows the directors, but the second and third rows contain the names of movies written by Jules Feiffer rather than their directors.
Example table for issues related to composition questions in open book baseline. While the question asks for directors, the second and third row contain movies.
Name . | Occupation . | Date of Death . | Date of Birth . | Citizenship . | Place of Birth . | Award Received . |
---|---|---|---|---|---|---|
Mike Nichols | director, producer | November 19, 2014 | November 6, 1931 | United States | Berlin, Germany | Academy Award [...] |
Popeye | Sailor | January 17, 1929 | ||||
Munro |
Name . | Occupation . | Date of Death . | Date of Birth . | Citizenship . | Place of Birth . | Award Received . |
---|---|---|---|---|---|---|
Mike Nichols | director, producer | November 19, 2014 | November 6, 1931 | United States | Berlin, Germany | Academy Award [...] |
Popeye | Sailor | January 17, 1929 | ||||
Munro |
Example table for issues related to missing attributes in answer tables. For the last row no relation other than director could be retrieved.
. | Director . | Producer . | Dir. of Photography . | Composer . | Genre . | Publication . |
---|---|---|---|---|---|---|
The Life and Death of Colonel Blimp | Michael Powell | E. Pressburger | Georges Périnal | Allan Gray | Drama, War | 1943 |
A Canterbury Tale | Michael Powell | E. Pressburger | Erwin Hillier | Allan Gray | Drama, Romance | 1944 |
A Matter of Life and Death | Jack Cardiff | Allan Gray | Fantasy, War | 1946 | ||
The Tales of Hoffmann | Jacques Offenbach |
. | Director . | Producer . | Dir. of Photography . | Composer . | Genre . | Publication . |
---|---|---|---|---|---|---|
The Life and Death of Colonel Blimp | Michael Powell | E. Pressburger | Georges Périnal | Allan Gray | Drama, War | 1943 |
A Canterbury Tale | Michael Powell | E. Pressburger | Erwin Hillier | Allan Gray | Drama, Romance | 1944 |
A Matter of Life and Death | Jack Cardiff | Allan Gray | Fantasy, War | 1946 | ||
The Tales of Hoffmann | Jacques Offenbach |
Evaluation results of models with different prompt settings. For most baselines the 3-shot setting yields the best results.
Model . | One Shot . | Three Random Shots . | Three Manual Shots . | Five Shots . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | |
Oracle setting | ||||||||||||
Gemini Pro | 41.8 | 38.2 | 37.8 | 41.5 | 38.9 | 38.4 | 50.8 | 40.0 | 39.0 | 41.8 | 38.2 | 37.8 |
Gemini Flash | 54.7 | 56.8 | 50.8 | 66.8 | 59.0 | 60.7 | 67.1 | 57.7 | 59.4 | 54.7 | 58.2 | 53.3 |
GPT4o | 55.4 | 44.9 | 46.6 | 54.3 | 44.9 | 45.9 | 39.9 | 34.8 | 35.4 | 48.8 | 40.1 | 40.7 |
Closed book setting | ||||||||||||
PaLM-2 | 46.4 | 42.9 | 43.2 | 52.9 | 46.4 | 47.6 | 52.3 | 45.8 | 47.0 | 52.1 | 45.8 | 47.0 |
Gemma | 28.0 | 18.2 | 19.2 | 38.8 | 26.3 | 27.5 | 40.9 | 30.6 | 31.8 | 38.5 | 33.1 | 34.6 |
Gemini Pro | 37.9 | 33.6 | 34.3 | 46.6 | 30.9 | 32.0 | 44.9 | 26.8 | 30.5 | 39.2 | 28.7 | 29.9 |
Gemini Flash | 21.3 | 25.9 | 18.6 | 36.8 | 28.5 | 30.7 | 33.5 | 30.2 | 30.1 | 29.9 | 28.6 | 23.9 |
GPT4o | 37.2 | 23.6 | 26.3 | 37.9 | 22.4 | 24.6 | 66.1 | 15.7 | 16.2 | 38.1 | 18.9 | 20.4 |
Model . | One Shot . | Three Random Shots . | Three Manual Shots . | Five Shots . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | |
Oracle setting | ||||||||||||
Gemini Pro | 41.8 | 38.2 | 37.8 | 41.5 | 38.9 | 38.4 | 50.8 | 40.0 | 39.0 | 41.8 | 38.2 | 37.8 |
Gemini Flash | 54.7 | 56.8 | 50.8 | 66.8 | 59.0 | 60.7 | 67.1 | 57.7 | 59.4 | 54.7 | 58.2 | 53.3 |
GPT4o | 55.4 | 44.9 | 46.6 | 54.3 | 44.9 | 45.9 | 39.9 | 34.8 | 35.4 | 48.8 | 40.1 | 40.7 |
Closed book setting | ||||||||||||
PaLM-2 | 46.4 | 42.9 | 43.2 | 52.9 | 46.4 | 47.6 | 52.3 | 45.8 | 47.0 | 52.1 | 45.8 | 47.0 |
Gemma | 28.0 | 18.2 | 19.2 | 38.8 | 26.3 | 27.5 | 40.9 | 30.6 | 31.8 | 38.5 | 33.1 | 34.6 |
Gemini Pro | 37.9 | 33.6 | 34.3 | 46.6 | 30.9 | 32.0 | 44.9 | 26.8 | 30.5 | 39.2 | 28.7 | 29.9 |
Gemini Flash | 21.3 | 25.9 | 18.6 | 36.8 | 28.5 | 30.7 | 33.5 | 30.2 | 30.1 | 29.9 | 28.6 | 23.9 |
GPT4o | 37.2 | 23.6 | 26.3 | 37.9 | 22.4 | 24.6 | 66.1 | 15.7 | 16.2 | 38.1 | 18.9 | 20.4 |
Evaluation results of three-random-shots prompt with different instruction styles on Gemini Flash. We observe clear difference with/without instruction but almost the same result on different instruction styles. Figure 6 shows the text for the different instruction variants.
Model . | Empty Instruction . | Detailed Instruction . | Step-by-Step Instruction . | Simple Instruction . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | |
Oracle Setting | 61.8 | 55.4 | 56.4 | 66.8 | 60.2 | 61.8 | 67.2 | 60.3 | 61.6 | 66.8 | 59.0 | 60.7 |
Closed Book Seeting | 34.2 | 25.7 | 28.5 | 37.5 | 27.0 | 29.6 | 36.1 | 29.1 | 31.2 | 36.8 | 28.5 | 30.7 |
Model . | Empty Instruction . | Detailed Instruction . | Step-by-Step Instruction . | Simple Instruction . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | |
Oracle Setting | 61.8 | 55.4 | 56.4 | 66.8 | 60.2 | 61.8 | 67.2 | 60.3 | 61.6 | 66.8 | 59.0 | 60.7 |
Closed Book Seeting | 34.2 | 25.7 | 28.5 | 37.5 | 27.0 | 29.6 | 36.1 | 29.1 | 31.2 | 36.8 | 28.5 | 30.7 |
Applying Filters to the Generated Table.
Finally, the model commonly ignored filtering conditions specified in the input question. We observed this in 10 out of 50 samples. For example, in the question: “Who was a mayor of Saint Paul, MN, and what was their occupation, position held, and date of birth? Filter the answer table for date of birth equal to or before 1829,” the resulting table included mayors born after 1829, failing to apply the requested filter.
6 Future Work
Our results on TANQ highlight the need for qualitative benchmarks, metrics, and modeling approaches to address current limitations.
We initially used the RMS metrics from Liu et al. (2023) to evaluate table generation. However, RMS was developed for the chart-to-table conversion task, which involves mostly numerical tables. In contrast, TANQ tables contain significant text content, leading to variations such as expressing “USA” as “United States.” Another difference is the more diverse table structure, where cells often contain lists of values. We adapted the RMS metrics for TANQ (see Section 5). Our experiments indicate that this version of RMS strikes a good balance of providing signal, being simple to explain and implement, and being fast to execute. However, further research on table evaluation is necessary to consider a broader range of formats.
Interestingly, we found that smaller models like Gemma (9B parameters) performed surprisingly well. While Gemma was not among the top models for any specific skill or question type in the closed book setting, it did not fall far behind models that are significantly larger, such as PaLM-2, Gemini Pro, and GPT-4o. This suggests that, despite its smaller size, models like Gemma and Gemini Flash offer promising results, though there is still room for improvement. While the size of Gemini Flash is undisclosed, it is known to be smaller than Gemini Pro, further highlighting that scale alone does not guarantee better performance on complex reasoning tasks. This raises an important question for future research: How can we further enhance the performance of smaller models?
Additionally, while tool-based LMs are popular, our results suggest that they may not be the optimal choice in domains where up-to-date knowledge is not required, especially for answering simple questions. For more complex questions involving intersecting information, tool-augmented LMs perform comparably to other baselines, but for simpler questions, they lag behind both oracle and closed book baselines.
Numeracy remains a challenge for state-of- the-art models. Despite advancements in numerical reasoning with LLMs, many models still struggle with questions requiring simple numerical skills such as filtering the content of a based on a numerical condition. While progress has been made in recent years (Imani et al., 2023; Akhtar et al., 2023; Chen et al., 2022), challenges remain.
Another challenge is that all evaluated models tend to generate incomplete answer tables, often missing some entities (rows) and relations (columns) from the target table. Future work should focus on developing methods to address these limitations and generate answer tables that capture the entire requested information. Addressing these issues and improving table evaluation metrics are important areas for future research.
7 Conclusion
This paper introduces TANQ, the first open domain question answering dataset where the answers require building tables from information across multiple sources. To create TANQ, we design and apply an automated dataset pipeline using large language models and Wikipedia and Wikidata as knowledge sources. We further release for each cell of the answer tables source attribution in form of text, tabulation, or infobox proofs. We evaluate our dataset on state-of-the-art models in three different setting: oracle documents provided, closed, and open book setting. Our results and analysis suggest that TANQ is a complex task with many challenges ahead.
Notes
Dataset available at github.com/google-deepmind/tanq.
Prompts provided in the appendix (Figures J1 and J3).
See Figure J1 for the exact prompt we used.
See Figure J3 for the exact prompt we used.
References
A Modeling Details
For our experiments, we use the following GPT4o model available over the OpenAI API: gpt-4o-2024-08-06.6 The PaLM-2 model we used for the pipeline and evaluation purposes was the model variant with 340B parameters.
B Prompt Templates
Figures 5 and 6 show the prompts used for evaluating the extracted evidence and rephrasing the TANQ questions to increase the naturalness of template-based questions. Figure 7 shows prompts for the four instruction variants we evaluated: empty, simple, detailed, and step-by-step.
Prompt used for evidence evaluation. We prompt a language model to evaluate the extracted evidence in a natural language inference setting. The LLM labels the input statements as “verifiable” or “not verifiable” based on evidence provided in form of a sentence, table, or infobox.
Prompt used for evidence evaluation. We prompt a language model to evaluate the extracted evidence in a natural language inference setting. The LLM labels the input statements as “verifiable” or “not verifiable” based on evidence provided in form of a sentence, table, or infobox.
Rephrasing questions increases the naturalness of template-based extension questions. To ensure the question meaning is preserved during rephrasing, we add structured annotations in parenthesis to questions with the name of each relation.
Rephrasing questions increases the naturalness of template-based extension questions. To ensure the question meaning is preserved during rephrasing, we add structured annotations in parenthesis to questions with the name of each relation.
Four instruction variants we evaluated: empty, simple, detailed, and step-by-step.
Four instruction variants we evaluated: empty, simple, detailed, and step-by-step.
Json schema description of a TANQ dataset entry outlining all components of a single dataset entry.
Json schema description of a TANQ dataset entry outlining all components of a single dataset entry.
Author notes
Equal contributions.
Work done during Google DeepMind internship.
Action Editor: Mark Steedman