Abstract
In the last few years, the natural language processing community has witnessed advances in neural representations of free texts with transformer-based language models (LMs). Given the importance of knowledge available in tabular data, recent research efforts extend LMs by developing neural representations for structured data. In this article, we present a survey that analyzes these efforts. We first abstract the different systems according to a traditional machine learning pipeline in terms of training data, input representation, model training, and supported downstream tasks. For each aspect, we characterize and compare the proposed solutions. Finally, we discuss future work directions.
1 Introduction
Many researchers are studying how to represent tabular data with neural models for traditional and new natural language processing (NLP) and data management tasks. These models enable effective data-driven systems that go beyond the limits of traditional declarative specifications built around first order logic and SQL. Examples of tasks include answering queries expressed in natural language (Katsogiannis-Meimarakis and Koutrika, 2021; Herzig et al., 2020; Liu et al., 2021a), performing fact-checking (Chen et al., 2020b; Yang and Zhu, 2021; Aly et al., 2021), doing semantic parsing (Yin et al., 2020; Yu et al., 2021), retrieving relevant tables (Pan et al., 2021; Kostić et al., 2021; Glass et al., 2021), understanding tables (Suhara et al., 2022; Du et al., 2021), and predicting table content (Deng et al., 2020; Iida et al., 2021). Indeed, tabular data contain an extensive amount of knowledge, necessary in a multitude of tasks, such as business (Chabot et al., 2021) and medical operations (Raghupathi and Raghupathi, 2014; Dash et al., 2019), hence, the importance of developing table representations. Given the success of transformers in developing pre-trained language models (LMs) (Devlin et al., 2019; Liu et al., 2019), we focus our analysis on the extension of this architecture for producing representations of tabular data. Such architectures, based on the attention mechanism, have proven to be successful as well on visual (Dosovitskiy et al., 2020; Khan et al., 2021), audio (Gong et al., 2021), and time series data (Cholakov and Kolev, 2021). Indeed, pre-trained LMs are very versatile, as demonstrated by the large number of tasks that practitioners solve by using these models with fine-tuning, such as improving Arabic opinion and emotion mining (Antoun et al., 2020; Badaro et al., 2014, 2018a, b, c, 2019, 2020). Recent work shows that a similar pre-training strategy leads to successful results when language models are developed for tabular data.1
As depicted in Figure 1, this survey covers both (1) the transformer-based encoder for pre-training neural representations of tabular data and (2) the target models that use the resulting LM to address downstream tasks. For (1), the training data consist of a large corpus of tables. Once the representation for this corpus has been learned, it can be used in (2) for a target task on a given (unseen at pre-training) table, such as the population table, along with its context, namely, relevant text information such as table header and captions. In most cases, the LM obtained from pre-training with large datasets in (1) is used in (2) by fine-tuning the LM with a labeled downstream dataset, for example, a table with cells annotated as the answer for the question in the context. Extensions on the typical transformer architecture are applied to account for the tabular structure, which is different and richer in some aspects than traditional free text.
While all proposed solutions make contributions to the neural representation of tabular data, there is still no systematic study to compare those representations given the different assumptions and target tasks. In this work, we aim to bring clarity in this space and to provide suitable axes for a classification that highlights the main trends and enables future work to clearly position new results. Our contributions are threefold:
We characterize the tasks of producing and consuming neural representations of tabular data in an abstract machine learning (ML) pipeline. This abstraction relies on five aspects that are applicable to all proposals and that are agnostic to methods’ assumptions and final application (Section 2).
We describe and compare the relevant proposals according to the characteristics of the datasets (Section 3), the processing of the input (Section 4), and the adaptations of the transformer architecture to handle tabular data (Section 5).
To show the impact of the proposed solutions, we present six downstream tasks where they have achieved promising results (Section 6) and discuss how tabular language models are used by systems in practice (Section 7).
We conclude the survey with a discussion of the limitations of existing works and future research directions (Section 8).
Related surveys.
Some recent surveys cover the use of deep learning on tabular data (Borisov et al., 2021; Gorishniy et al., 2021; Shwartz-Ziv and Armon, 2021). These surveys are different from ours in multiple aspects. First, they investigate how deep learning models (including non-transformer-based) compare against classical ML models (mainly gradient boosted trees) by providing empirical results with lack of details over the changes to the model, while our work taxonomizes transformer-based models in depth. Second, other works focus on standard classification and regression tasks where the term “tabular data” is used for labeled training data, representing data points with columns consisting of features. Every row is an input to the model together with a label (e.g., predicting house prices) (Somepalli et al., 2021). In contrast, we do not consider tabular data necessarily as a training dataset with features and labels, but rather as an input containing information needed for prediction by a target model, such as question answering on tables. For this reason, while they focus on data generation and classification, we relate design choices and contributions to a large set of downstream tasks. A recent survey (Dong et al., 2022) covers approaches for table pre-training and usage of LMs in downstream tasks with an overlap in terms of surveyed works. In contrast with their work, we provide a more detailed study of the extensions to the transformer architecture and analyze in detail the relevant datasets’ properties and pre-processing steps.
2 Terminology and Overview
We focus on tabular data that come in the form of a table, as depicted in the two examples in Table 1. Table kinds differ in terms of horizontal/vertical orientation and with respect to the presence of hierarchies. A relational table, which is the most common kind, consists of rows, or records, and columns that together identify cell values, or cells. Columns represent attributes for a given table, and each row represents an instance having those attributes. This can be seen as a vertical orientation, where all cells in a column share the same atomic type, for example, the relational table in Table 1. An entity table has the same properties, but with a horizontal structure with only one entity whose properties are organized as rows. Spreadsheets, or matrix tables, are the most general kind with information that can be organized both horizontally and vertically, possibly with hierarchies in the header and formatting metadata, such as in financial tables.
Player . | Team . | FG% . | Player . | Carter . |
---|---|---|---|---|
Carter | LA | 56 | Team | LA |
Smith | SF | 55 | FG% | 55 |
Player . | Team . | FG% . | Player . | Carter . |
---|---|---|---|---|
Carter | LA | 56 | Team | LA |
Smith | SF | 55 | FG% | 55 |
Tables can have rich metadata, such as attribute types (e.g., DATE), domain constraints, functional dependencies across columns and integrity constraints such as primary keys. In the table sample in Figure 1, column Country is a primary key. Most systems focus on a single table, with or without metadata. However, a few systems, such as GTR (Wang et al., 2021a), GraPPa (Yu et al., 2021), and DTR (Herzig et al., 2021), consume databases, which are collections of relational tables, possibly under referential constraints. We identify as input the table(s) and its context. The context is a text associated to the table. Depending on the dataset and task at hand, it varies from table metadata, the text surrounding the tables or their captions up to questions, expressed in natural language, that can be answered with the tabular data (Badaro and Papotti, 2022).
The main advantage of the transformer architecture is its ability to generate a LM, a large neural network, with self-supervised pre-training. This pre-trained LM is then usually followed by supervised fine-tuning to adapt it to the target task with a small amount of training data. While transformers have proven to be effective in modeling textual content, tables have a rich structure that comes with its own relationships, such as those across values in the rows and attributes. New solutions are therefore needed to jointly model the characteristics of the table, its text content, and the text in the table context. As shown in Figure 1, we distinguish two main phases to spell out these contributions. First, we focus on the development of tabular LMs by using transformer-based deep neural networks (1). Given a table and its context, the goal is to learn a pre-trained representation of the structured data (cell values, rows, attributes) in a continuous vector space. We then discuss the use of those representations in the downstream tasks (2).
Figure 1 shows the reference pipeline and the aspects that we propose to model existing systems.
Training Datasets (Sec. 3): the datasets used for pre-training and fine-tuning the models toward specific tasks; datasets for the latter case usually come with annotations and/or labels.
Input Processing (Sec. 4): the steps to prepare the data for the model processing, such as the transformation from the two dimensional tabular space to one dimensional input.
Transformer-based Encoder (Sec. 5): the pre-training objectives and customization of the typical transformer-based deep learning architecture.
Downstream Task Model (Sec. 6): the models consuming the representations or fine-tuning them to tackle downstream tasks.
Tabular Language Model (Sec. 7): the output representations, including at the token, row, column, table level, and their usage.
For the tasks consuming the models, we report on Table based Fact-Checking (TFC), Question Answering (QA), Semantic Parsing (SP), Table Retrieval (TR), Table Metadata Prediction (TMP), and Table Content Population (TCP).
3 Training Datasets
We present both the datasets used for pre-training and for fine-tuning in the downstream tasks. Pre-training tables are not annotated, in some cases scraped from the web, while data used for fine-tuning have task-dependent annotation labels. The datasets consist of tables and their context, such as table metadata, surrounding texts, claims or questions. To construct large pre-training datasets and in an attempt to reduce bias, multiple sources can be used, independently of the target task at hand. For instance, unlike TaPas (Herzig et al., 2020), which only uses Wikipedia Tables for QA, and TabularNet (Du et al., 2021), which only uses Spreadsheets for TMP, TaBERT (Yin et al., 2020) uses Wikipedia Tables and WDC for SP; GraPPa uses Wikipedia Tables, Spider, and WikiSQL for SP; MMR (Kostić et al., 2021) uses NQ, OTT-QA, and WikiSQL for TR; MATE (Eisenschlos et al., 2021) uses Wikipedia Tables and HybridQA for QA; and TUTA (Wang et al., 2021b) uses Wikipedia Tables, WDC, and spreadsheets for TMP. In general, it is recommended to utilize different data sources for pre-training to ensure covering different kinds and content, and thus, improve the scope of representations. For instance, Wikipedia tables has a large number of relational tables (Bhagavatula et al., 2015), while WDC and Spreadsheets include also entity tables and spreadsheets with complex structure.
Table 2 summarizes the main characteristics for the most common datasets. We mark the tasks for which the dataset has been used by ✔ under the column “Task”. We note that the top four datasets are mostly used for pre-training, while the others can be used for fine-tuning as well since they include annotations for the target task, for example, questions/answers for QA. The column “Large Tables” is a binary indicator, where ✔ and ✘ indicate whether or not the tabular corpus include large tables and hence whether or not some pre-processing is needed to reduce table content to meet the limits of the transformer architecture (512 input tokens in most cases). Some works, such as TaBERT, TaPEx (Liu et al., 2021a), and CLTR (Pan et al., 2021), apply filtering in any case to reduce noisy input. Finally, the “Context” column describes additional text that come with the tables. This can be text describing the table, such as caption or title of the document containing the table; table metadata, such as table orientation, header row, and keys; or questions and claims that can be addressed with the table.
Dataset . | Reference . | Used for Task . | Number of Tables . | Large Tables . | Context . | Application Example . | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
TFC . | QA . | SP . | TR . | TMP . | TCP . | ||||||
Wikipedia Tables | Wikipedia | ✔ | ✔ | ✔ | ✔ | ✔ | 3.2M | ✔ | Surrounding Text: table caption, page title, page description, segment title, text of the segment. Table Metadata: statistics about number of headings, rows, columns, data rows. | TaPas | |
WDC Web Table Corpus | (Lehmberg et al., 2016) | ✔ | ✔ | ✔ | 233M | ✔ | Table Metadata: Table orientation, header row, key column, timestamp before and after table. Surrounding Text: table caption, text before and after table, title of HTML page. | TaBERT | |||
VizNet | (Hu et al., 2019) | ✔ | ✔ | 1M | ✘ | Table Metadata: Column Types. | TABBIE | ||||
Spreadsheets | (Dong et al., 2019) | ✔ | 3,410 | ✘ | Table Metadata: Cell Roles (Index, Index Name, Value Name, Aggregation and Others). | TabularNet | |||||
NQ-Tables | (Herzig et al., 2021) | ✔ | ✔ | 169,898 | ✔ | Questions: 12K. | DTR | ||||
TabFact | (Chen et al., 2020b) | ✔ | 16K | ✘ | Textual Claims: 118K. | Deco | |||||
WikiSQL | (Zhong et al., 2017) | ✔ | ✔ | ✔ | 24,241 | ✘ | Questions: 80,654. | MMR | |||
TabMCQ | (Jauhar et al., 2016) | ✔ | ✔ | 68 | ✘ | Questions: 9,092. | CLTR | ||||
Spider | (Yu et al., 2018) | ✔ | 200 databases | ✘ | Questions: 10,181 Queries: 5,693. | GraPPa | |||||
WikiTable Question (WikiTQ) | (Pasupat and Liang, 2015) | ✔ | ✔ | 2,108 | ✘ | Questions: 22,033. | TaPEx | ||||
Natural Questions (NQ) | (Kwiatkowski et al., 2019) | ✔ | ✔ | 169,898* | ✔ | Questions: 320K. | MMR | ||||
OTT-QA | (Chen et al., 2021) | ✔ | ✔ | 400K | ✔ | Surrounding Text: page title,section title, section text limited to 12 first sentences. Questions: 45,841. | MMR | ||||
Web Query Table (WQT) | (Sun et al., 2019) | ✔ | 273,816 | ✘ | Surrounding Text: captions. Queries: 21,113. | GTR | |||||
HybridQA | (Chen et al., 2020c) | ✔ | 13K | ✘ | Questions: 72K. Surrounding Text: first 12 sentences per hyperlink in the table. | MATE | |||||
Feverous | (Aly et al., 2021) | ✔ | 28.8K | ✔ | Textual Claims: 87K. Surrounding Text: article title. Table Metadata: row and column headers. | Feverous |
Dataset . | Reference . | Used for Task . | Number of Tables . | Large Tables . | Context . | Application Example . | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
TFC . | QA . | SP . | TR . | TMP . | TCP . | ||||||
Wikipedia Tables | Wikipedia | ✔ | ✔ | ✔ | ✔ | ✔ | 3.2M | ✔ | Surrounding Text: table caption, page title, page description, segment title, text of the segment. Table Metadata: statistics about number of headings, rows, columns, data rows. | TaPas | |
WDC Web Table Corpus | (Lehmberg et al., 2016) | ✔ | ✔ | ✔ | 233M | ✔ | Table Metadata: Table orientation, header row, key column, timestamp before and after table. Surrounding Text: table caption, text before and after table, title of HTML page. | TaBERT | |||
VizNet | (Hu et al., 2019) | ✔ | ✔ | 1M | ✘ | Table Metadata: Column Types. | TABBIE | ||||
Spreadsheets | (Dong et al., 2019) | ✔ | 3,410 | ✘ | Table Metadata: Cell Roles (Index, Index Name, Value Name, Aggregation and Others). | TabularNet | |||||
NQ-Tables | (Herzig et al., 2021) | ✔ | ✔ | 169,898 | ✔ | Questions: 12K. | DTR | ||||
TabFact | (Chen et al., 2020b) | ✔ | 16K | ✘ | Textual Claims: 118K. | Deco | |||||
WikiSQL | (Zhong et al., 2017) | ✔ | ✔ | ✔ | 24,241 | ✘ | Questions: 80,654. | MMR | |||
TabMCQ | (Jauhar et al., 2016) | ✔ | ✔ | 68 | ✘ | Questions: 9,092. | CLTR | ||||
Spider | (Yu et al., 2018) | ✔ | 200 databases | ✘ | Questions: 10,181 Queries: 5,693. | GraPPa | |||||
WikiTable Question (WikiTQ) | (Pasupat and Liang, 2015) | ✔ | ✔ | 2,108 | ✘ | Questions: 22,033. | TaPEx | ||||
Natural Questions (NQ) | (Kwiatkowski et al., 2019) | ✔ | ✔ | 169,898* | ✔ | Questions: 320K. | MMR | ||||
OTT-QA | (Chen et al., 2021) | ✔ | ✔ | 400K | ✔ | Surrounding Text: page title,section title, section text limited to 12 first sentences. Questions: 45,841. | MMR | ||||
Web Query Table (WQT) | (Sun et al., 2019) | ✔ | 273,816 | ✘ | Surrounding Text: captions. Queries: 21,113. | GTR | |||||
HybridQA | (Chen et al., 2020c) | ✔ | 13K | ✘ | Questions: 72K. Surrounding Text: first 12 sentences per hyperlink in the table. | MATE | |||||
Feverous | (Aly et al., 2021) | ✔ | 28.8K | ✔ | Textual Claims: 87K. Surrounding Text: article title. Table Metadata: row and column headers. | Feverous |
4 Input Processing
As for the original text setting, transformers for tabular data require as input a sequence of tokens. However, in addition to the typical tokenization executed before feeding the text to the neural network (Lan et al., 2020), tabular data requires some steps to be processed correctly. Some requirements come from the nature of the transformer, such as the limitation on the input data size (Section 4.1). Other requirements are due to the nature of the tabular data, with structural information expressed in two dimensions that need to be converted into a one dimensional space (Section 4.2). Finally, given a table, its data and its context must be jointly fed to the transformer (Section 4.3).
4.1 Data Retrieval and Filtering
Filtering methods on the table content are applied to stay within the size limits of the transformer architecture, to reduce the model training time, and to eliminate potential noise in the representation.
TaBERT uses content snapshot to keep the top-k most relevant rows in the table. Such content is identified with the tuples with highest n-gram overlap with respect to the given context (question or utterance). TaPEx and the retrieval model in Feverous (Aly et al., 2021) randomly select rows to limit the input size, while RCI (Glass et al., 2021) down-samples rows using term frequency inverse document frequency (TF-IDF) scores; the frequency can also be used to summarize cells with long text (Li et al., 2020). In addition to keeping tables with number of columns below a fixed threshold, TUTA and TabularNet split large tables into non-overlapping horizontal partitions. Every partition contains the same header row and it is processed separately by the model. While effective, splitting is more demanding in terms of computation cost.
In terms of table selection, whereas for systems like GTR and DTR the objective is to retrieve tables that contain the answer to a given question, others, such as CLTR, use a ranking function, such as BM25 (Robertson et al., 1995), to retrieve relevant tables prior to training, or to generate negative examples, as in MMR. Regardless of the downstream task, most systems filter and reduce the size of the input data to meet the limits of transformers technology. However, frequency-driven sampling, such as in RCI, is more effective than random as it reduces noise as well in data representations.
To summarize, one can group the different content selection strategies based on the targeted downstream task. For TFC, QA, and SP, a selection strategy relying on n-gram overlap such as content snapshot or TF-IDF is recommended, while for TR, BM25 is endorsed. In the cases of TMP and TCP, where having the full content of the table is required, such as relation extraction or cell filling, splitting the tabular data without any selection is adopted.
4.2 Table Serialization
A crucial step is the transformation from the two dimensions of a table to its serialized version consumable by the transformer. The methods for table serialization can be grouped into four main types. The first type consists of horizontally scanning the table by row. Most systems achieve this task with a flattened table with value separators, for example, DTR, MMR, TURL (Deng et al., 2020), TaPas, MATE, and Deco (Yang and Zhu, 2021). For the table in Figure 1, it corresponds to a sequence such as [CLS] Population in Million by Country — Country — Capital — Population — Australia — Canberra — 25.69 …Bolivia — La Paz — 11.67. Another option is a flattened table with special token separators to indicate the beginning of a new row or cell (TaPEx, TUTA, [ForTaP (Cheng et al., 2022)]), as in the first example in Figure 2. Finally, few methods flatten the table representing each cell as a concatenation of column name, column type, and cell value (TaBERT), for example, Country — String — Australia [SEP]; or just the row of column headers (GraPPa).
The second linearization type scans the table by column, again either by simple concatenation of column values or by using special tokens as separators, as in Doduo (Suhara et al., 2022). A column serialization for the country population table is the second example in Figure 2.
The third linearization type consists of combining the output from both types of serialization by using element-wise product (RCI, CLTR), average pooling and concatenation (TabularNet), or the average of row and column embeddings (TABBIE (Iida et al., 2021)). In a context outside of our framework, which focuses on transformers, it has also been proposed to transform the input relation table in a graph and perform random walks on the latter to ultimately produce node embeddings (Cappuzzo et al., 2020).
The fourth type consists of using a text template to represent the tabular data as sentences (Chen et al., 2020b; Suadaa et al., 2021; Chen et al., 2020a). Instead of using predefined templates, natural sentences are generated out of the tabular data, by fine-tuning sequence to sequence language models such as T5 (Raffel et al., 2020) or GPT2 (Radford et al., 2019), as in DRT (Thorne et al., 2021; Neeraja et al., 2021). The generation can rely on models for this table-to-text task, such as TOTTO (Parikh et al., 2020), Pythia (Veltri et al., 2022; 2023), TableGPT (Gong et al., 2020), Logic2Text (Chen et al., 2020d), UnifiedSKG (Xie et al., 2022), or other efforts (Suadaa et al., 2021; Chen et al., 2020a, e). However, most of these methods show limitations for tables with prevalence of numerical attributes and with missing table context. Within the efforts for table-to-text, graph traversal algorithms have been explored for linearization such as relation-biased breadth first search (Li et al., 2021). While table-to-text generation is an important and challenging task, most methods rely on fine-tuning existing pre-trained language models.
While it is still not clear which table serialization method should be used for a given task, few papers performed ablation studies. For instance, TabFact (Chen et al., 2020b) does not report any significant difference in performance when comparing row and column encoding. TaBERT reports that both (i) adding type information for cell values and (ii) phrasing the input as a sentence improve results. The most promising approach is to incorporate several aspects by appending column headers to cell content, combining row and column encoding, or by adding structure-aware indicators as the positional embeddings discussed in Section 5. Finally, there is evidence that textual templates to represent table content is a valid solution when one directly fine-tunes existing pre-trained language models, without performing pre-training on tabular data (Suadaa et al., 2021).
4.3 Context and Table Concatenation
When available, the table context is concatenated with the table content. Most systems (TaBERT, TaPas, DTR, GraPPa, and Deco) combine the context by concatenating it in the serialization before the table data, while TaPEx appends it. The authors of UnifiedSKG found that placing context before the tabular knowledge yields to better results in their setting. In some cases, including CLTR, MMR, and RCI, the table and the context are encoded separately and then are combined at a later stage in the system.
Some works do not include context in their pre-training input besides the column headers (Tabbie, TabularNet, Doduo, GraPPa). Typically, the decision is based on the target downstream task. A richer context is used when the tasks are closer to the corresponding NLP task applied on free text. For instance, all models for QA use table captions or descriptions as context. To summarize, including context is crucial for TFC, QA, SP, TR, while it can be excluded for TMP and TCP. In case the context is needed, prepending or appending it to the table content does not modify the performance of the model.
5 Adaptation of Transformers
To account for structured tables in the input, several pre-trained transformer-based LM and systems have been developed. Vanilla LMs are customized to make the model more “data structure-aware”, thus rendering a modified transformer-based encoder to be utilized on other tasks. These encoders, depicted in part (1) of Figure 1, capture both structure and semantic information. To exploit such resources and build applications, as in part (2) of Figure 1, several systems build on top of the encoder, usually with more modules and fine-tuning. In a different approach, other systems use the encoder as part of a bigger architecture in a more task-oriented fashion rather than encoder-oriented. We first briefly revise the vanilla transformer architecture, then discuss customizations to LMs.
5.1 Vanilla Transformer
The vanilla transformer (Vaswani et al., 2017) is a seq2seq model (Sutskever et al., 2014) consisting of an encoder and a decoder, each of which is a stack of N identical modules. The encoder block is composed of a multi-head self-attention module and a position-wise feed-forward network. Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. Residual connections and layer-normalization modules are also used. Decoder blocks consist of cross-attention modules between the multi-head self-attention modules and the position-wise feed-forward networks, where masking is used to prevent each position from attending to subsequent positions.
The transformer architecture can be used as an encoder-decoder (Vaswani et al., 2017; Raffel et al., 2020), an encoder-only (Devlin et al., 2019; Liu et al., 2019), or decoder-only (Radford et al., 2019; Brown et al., 2020) model. The choice of the architecture depends on the final task. Encoder-only models are mainly used for classification and are the most popular choice for extensions for tabular data. In this case, pre-training is done with a masked language modeling (MLM) task, whose goal is to predict masked token(s) of an altered input. The encoder-decoder architecture is used for models that focus on sequence generation tasks (RPT, TaPEx).
5.2 LM Extensions for Tabular Data
To properly model the structure of data in tables, the vanilla transformers are extended and updated by modifying components at the (i) input, (ii) internal, (iii) output, and (iv) training-procedure levels. We discuss each of them in the following. A summary of the extensions is provided as a taxonomy in Figure 3. We note that since the encoder and decoder modules of a transformer share similar structure, most of these modifications can be applied to both.
5.2.1 Input Level
Modifications on the input level are usually designated with additional positional embeddings to explicitly model the table structure. TaBERT and TaPas show how such embeddings improve performance on structure-related tasks. For example, embeddings that represent the position of the cell, indicated by its row and column IDs, are common for relational tables, for example, in TaBERT, TaPas, and TABBIE. TableFormer (Yang et al., 2022) drops row and column ids to avoid any potential row and column biases and they instead use per cell positional embeddings similar to MATE. For tables without a relational structure, such as entity tables and spreadsheets, including complex financial tables, TUTA introduces tree-based positional embeddings to encode the position of a cell using top and left embeddings of a bi-dimensional coordinate tree. Other supplementary embeddings include those that provide relative positional information for a token within a caption/header (TURL) or a cell (TUTA). For tasks such as QA, segment embeddings are used to differentiate between the different input types, question and table, for example, in RPT (Tang et al., 2021) and TaPas. Finally, TUTA introduces embeddings for numbers when discrete features are used.
While row/column positional embeddings can better map context and table content, such as in QA or TFC tasks, it cannot overcome the challenge of empty cells, nested row headers, or descriptive cells in complex spreadsheets such as financial tables. In this case, tree-based positional embeddings are required. Aside from new positional embeddings encoding the structure of a token in a table, original, vanilla positional embeddings can also be modified for a better representation of tokens within table cells. For example, TaBERT employs a pre-training task that explicitly uses positional embeddings, alongside a cell representation, for the goal of recovering a cell value. In this case, the original positional embeddings help better model multiple tokens in a cell.
5.2.2 Internal Level
Most of the modifications on the internal level are applied to make the system more “structure-aware”. Specifically, the attention module is updated to integrate the structure of the input table. For example, in TaBERT vertical self-attention layers are produced to capture cross-row dependencies on cell values by performing the attention module in a vertical fashion. Empirical results in TUTA show that employing row-wise and column-wise attention, instead of having additional positional embeddings for rows and columns, hurts model performance for cell-type classification tasks, but it is not the case for table-type classification tasks.
Other systems, such as TURL, employ a masked self-attention module, which attends to structurally related elements such as those in the same row or column, thus ignoring the other elements, unlike the traditional transformer where each element attends to all other elements in the sequence. Moreover, for categorical data, named entities, such as city names, can be identified in the cell values. In TURL and TUTA, masking such entities helps the models capture the factual knowledge embedded in the table content as well as the associations between table metadata and table content. Ablation studies showed the positive impact of these two modifications on model performance.
Other modifications address the input size constraint of attention modules, where large tables are often neglected. Sparse attention methods are proposed to cope with this issue (Tay et al., 2020). For instance, MATE sparsifies the attention matrix to allow transformer heads to efficiently attend to either rows or columns.
Modifications to the input level, by additional input embeddings to encode cell positions (TaPas), help in tasks requiring information at the cell level, such as QA and cell type classification. However, tasks requiring understanding of table structure, such as table type classification, benefit more from modifying the attention module (TaBERT). Systems modifying both, such as TUTA, seem to be the best option. However, modifying the internal level is more effective, as removing such modification leads to the larger decrease in performance.
5.2.3 Output Level
Additional layers can be added on top of the feed-forward networks (FFNs) of the LM depending on the task at hand. Tasks such as cell selection (TaPas), TFC (TabFact), TMP (Doduo), and QA (TaPas) require to train one or more additional layers. Classification layers for aggregation operations and cell-selection are used to support aggregating cell values (GraPPa, MATE, Doduo, Deco, RCI). In TaPEx, aggregation operations are also “learned” end-to-end in a seq2seq task.
5.2.4 Training-Procedure Level
Modifications on the training-procedure level can be attributed to the pre-training task and objective.
Pre-training Tasks. Systems are either trained end-to-end or fine-tuned after some pre-training procedure. Almost all pre-training tasks fall under the category of reconstruction tasks, where the idea is to reconstruct the correct input from a corrupted one. Rather than applying traditional token-level MLM on the serialized tabular data, effectively treating it as natural language sequences, most pre-training tasks are designed to consider the information about the table structure.
Typically, traditional masking of the tokens is used for both, the textual context surrounding the table and the table cells (e.g., TaPas 15%, TURL 20%) with some novelties for table cells such as masked columns (TaBERT) and masked entities (TURL). Specifically, TaPas follows the strategy to mask the whole cell table whenever a token in that cell is randomly masked. Inspired by TaPas, TUTA masks 15% of the cells by randomly selecting those that consist mainly of text, rather than numerical values. Among these selected cells, 30% are masked completely, while for 70% only a single token within the cell is masked. Masking column names and data types for relational tables encourages the model to recover the metadata, while masking tokens and whole cell values ensure that data information is retained across the different vertical self-attention layers. Indeed, masking at the entity level enables the model to integrate the factual knowledge embedded in the table content and its context.
In TABBIE and RPT, the pre-training tasks detect if a cell/tuple is corrupted or not. While corruption has been shown to outperform MLM for standard textual LMs (Clark et al., 2020), it is not clear if the benefit generalizes to tabular setting. In TUTA, text headers of tables are used to learn representations of tables in a self-supervised manner, i.e., a binary classification task where a positive example is the table and its associated header, and a negative example is any other header.
While most systems pre-train for the purpose of learning initial tabular representations, performing aggregations and numerical operations is usually adapted by fine-tuning a classifier for predicting cells and operations, e.g., in TaPas. However, systems such as GraPPa and TaPEx pre-train with SQL queries. In GraPPa, the pre-training task enables the discovery of suitable fragments for SP by pre-training on a large number of queries, with SQL logic grounding to each table, which might help generalize to other similar queries. Empirical results show a decrease in performance without this pre-training task. TaPEx enforces the LM to emulate a SQL engine by pre-training on sampled tables and synthesized queries. This pre-training task gives (i) the LM a deeper understanding of the tables, as the system encounters operations such as selection and aggregation rendering its representations ‘aware’ of such operations, and (ii) improves the performance in QA.
According to (Chang et al., 2020), a pre-training task should be cost-efficient, ideally not requiring additional human supervision. As manually annotating queries with tables is costly, using surrounding information, such as the table header or caption, as a query acts as a self-supervised signal, helping the system learn better representations for the tables. It is the case for TR systems, such as DTR and GTR, where performance improves when using a pre-training task rather than when training from scratch.
To summarize, token-level masking is the base for all the systems and it can be sufficient for TFC and TR tasks, while cell or entity masking are also recommended for QA, TMP, such as cell role classification, and TCP, such as cell filling. Column masking is additionally suggested for SP since it helps in identifying the columns to formulate the logical queries.
Pre-training Objectives.
The objective of the majority of the systems is to minimize cross-entropy loss for a certain classification task. GTR utilizes a point-wise ranking objective for end-to-end training after pre-training, where multiple tables are ranked according to a relevance score given a certain query.
Table 3 reports a summary of the transformer customizations adopted by every system. Overall, a tabular LM requires modifications to the input level, through additional embeddings, and to the internal level, through adjustments of the attention module. Pre-training on table-related tasks, such as masking or corrupting table cells, also enhances encoding capabilities for structured data on fine-tuned tasks. Modifications on the output level are more task specific and have less impact on making the LM “understand” the data structure.
. | Module Level . | (Pre-)Training . | |||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
System . | Input: . | Internal: . | Output: FFN . | Task . | Objective . | ||||||||||||||||||||||
Addition. Pos. Emb. . | Attention . | Encod. Level . | Task-Orien . | Mask . | Corrupt . | Oth . | Cls . | Rnk . | |||||||||||||||||||
. | Ro . | Co . | Tr . | Nu . | Fr . | Vrt . | Sp . | Tr . | Vi . | Ro . | Co . | Tb . | Ce . | To . | CP . | Nli . | Sp . | Ag . | To . | Ce . | Co . | TC . | Tu . | Ce . | Nse . | CE . | PR . |
TUTA | X | X | X | X | X | X | X | X | X | ||||||||||||||||||
TURL | X | X | X | X | X | X | |||||||||||||||||||||
TaBERT | X | X | X | X | X | X | X | X | |||||||||||||||||||
TABBIE | X | X | X | X | X | X | X | ||||||||||||||||||||
MATE | X | X | X | X | X | X | X | X | X | X | X | ||||||||||||||||
RCI | X | X | |||||||||||||||||||||||||
GraPPa | X | X | X | ||||||||||||||||||||||||
TaPEx | X | X | X | X | |||||||||||||||||||||||
Doduo | X | X | X | ||||||||||||||||||||||||
TabFact | X | X | X | X | |||||||||||||||||||||||
RPT | X | X | X | X | X | X | |||||||||||||||||||||
TaPas | X | X | X | X | X | X | X | X | |||||||||||||||||||
GTR | X | X | X |
. | Module Level . | (Pre-)Training . | |||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
System . | Input: . | Internal: . | Output: FFN . | Task . | Objective . | ||||||||||||||||||||||
Addition. Pos. Emb. . | Attention . | Encod. Level . | Task-Orien . | Mask . | Corrupt . | Oth . | Cls . | Rnk . | |||||||||||||||||||
. | Ro . | Co . | Tr . | Nu . | Fr . | Vrt . | Sp . | Tr . | Vi . | Ro . | Co . | Tb . | Ce . | To . | CP . | Nli . | Sp . | Ag . | To . | Ce . | Co . | TC . | Tu . | Ce . | Nse . | CE . | PR . |
TUTA | X | X | X | X | X | X | X | X | X | ||||||||||||||||||
TURL | X | X | X | X | X | X | |||||||||||||||||||||
TaBERT | X | X | X | X | X | X | X | X | |||||||||||||||||||
TABBIE | X | X | X | X | X | X | X | ||||||||||||||||||||
MATE | X | X | X | X | X | X | X | X | X | X | X | ||||||||||||||||
RCI | X | X | |||||||||||||||||||||||||
GraPPa | X | X | X | ||||||||||||||||||||||||
TaPEx | X | X | X | X | |||||||||||||||||||||||
Doduo | X | X | X | ||||||||||||||||||||||||
TabFact | X | X | X | X | |||||||||||||||||||||||
RPT | X | X | X | X | X | X | |||||||||||||||||||||
TaPas | X | X | X | X | X | X | X | X | |||||||||||||||||||
GTR | X | X | X |
6 Downstream Tasks
Using neural representations for tabular data show, improvements in performance in several downstream tasks. In this section, we describe the tasks and define their input and output. While they all consume tables, settings can be quite heterogeneous, with systems exploiting different information, even for the same task. A summary of the covered tasks along their input, output, and some representative systems addressing them is shown in Table 4. We detail next the mandatory input elements and the different contexts.
Task ID . | Task Label . | Tasks Coverage . | Input . | Output . | System Examples . |
---|---|---|---|---|---|
TFC | Table-based Fact-Checking | Fact-Checking Text Refusal/Entailment | Table + Claim | True/False Refused/Entailed (Data Evidence) | Deco |
TabFact | |||||
TaPEx | |||||
QA | Question Answering | Retrieving the Cells for the Answer | Table + Question | Answer Cells | TaPas |
MATE | |||||
DTR | |||||
SP | Semantic Parsing | Text-to-SQL | Table + NL Query | Formal QL | TaBERT |
GraPPa | |||||
TaPEx | |||||
TR | Table Retrieval | Retrieving Table that Contains the Answer | Tables + Question | Relevant Table(s) | GTR |
CLTR | |||||
TMP | Table Metadata Prediction | Column Type Prediction | Table | Column Types | |
Table Type Classification | Table Types | Doduo | |||
Header Detection Cell | Header Row | TabularNet | |||
Role Classification Column | Cell Role | TURL | |||
Relation Annotation Column | Relation between Two Cols | TUTA | |||
Name Prediction | Column Name | ||||
TCP | Table Content Population | Cell Content Population | Table with Corrupted Cell Values | Table with Complete Cell Values | TURL |
TABBIE | |||||
RPT |
Task ID . | Task Label . | Tasks Coverage . | Input . | Output . | System Examples . |
---|---|---|---|---|---|
TFC | Table-based Fact-Checking | Fact-Checking Text Refusal/Entailment | Table + Claim | True/False Refused/Entailed (Data Evidence) | Deco |
TabFact | |||||
TaPEx | |||||
QA | Question Answering | Retrieving the Cells for the Answer | Table + Question | Answer Cells | TaPas |
MATE | |||||
DTR | |||||
SP | Semantic Parsing | Text-to-SQL | Table + NL Query | Formal QL | TaBERT |
GraPPa | |||||
TaPEx | |||||
TR | Table Retrieval | Retrieving Table that Contains the Answer | Tables + Question | Relevant Table(s) | GTR |
CLTR | |||||
TMP | Table Metadata Prediction | Column Type Prediction | Table | Column Types | |
Table Type Classification | Table Types | Doduo | |||
Header Detection Cell | Header Row | TabularNet | |||
Role Classification Column | Cell Role | TURL | |||
Relation Annotation Column | Relation between Two Cols | TUTA | |||
Name Prediction | Column Name | ||||
TCP | Table Content Population | Cell Content Population | Table with Corrupted Cell Values | Table with Complete Cell Values | TURL |
TABBIE | |||||
RPT |
Table-based Fact-Checking (TFC):
Similar to text-based textual entailment (Dagan et al., 2013; Korman et al., 2018), checking facts with tables consists of verifying if a textual input claim is true or false against a trusted database (TaPEx, Deco, TabFact), also provided as input. Some fact-checking systems, such as Feverous, also output the cells used for verification as evidence (Nakov et al., 2021; Karagiannis et al., 2020).
Question Answering (QA):
In the free text setting, QA aims at retrieving passages that include the answer to a given question. In the tabular data setting, it consists of returning as output the cells that answer an input consisting of a question and a table. One can distinguish two levels of complexity. Simple QA involves lookup queries on tables (DTR, CLTR), while a more complex QA task involves aggregation operations and numerical reasoning (TaPas, MATE, RCI, DRT, TaPEx). Most of the systems in this survey aim at improving accuracy in QA with respect to hand-crafted embeddings.
Semantic Parsing (SP):
In the tabular data setting, given a question and a table as input, SP generates a declarative query in SQL over the table’s schema to retrieve the answer to the question. While in QA the interest is in directly getting the answer, SP produces the (interpretable) query to obtain it (TaBERT, GraPPa, TaPEx).
Table Retrieval (TR):
Given a question and a set of tables as inputs, TR identifies the table that can be used to answer the question. TR is helpful when trying to reduce the search space for a QA task (GTR, DTR, MMR). It is a challenging task given the limited input size of transformers, that is, their constraint to sequences of 512 tokens.
Table Metadata Prediction (TMP):
Given an input table with corrupted or missing metadata, the TMP objective is to predict inter-table metadata, such as column types and headers, cell types, table types, and intra-tables relationships, such as equivalence between columns and entity linking/resolution. Relevant efforts focus both on spreadsheets (TUTA, TabularNet) and relational tables (TURL, Doduo).
Table Content Population (TCP):
Unlike TMP, where the table metadata is noisy or missing, TCP deals with corrupted cell content. Given an input table with missing cell values, the objective is to impute the respective values (RPT, TABBIE, TURL).
We observe that most tasks can be seen as traditional NLP problems where structured data replace free text, such as the case of QA where answers are located in tabular data instead of documents (Gupta and Gupta, 2012). TFC involves retrieving cells that entail or refute a given statement, whereas on free text the corresponding objective is to select sentences as evidence (Thorne et al., 2018). SP is the task of converting natural language utterance into a logical form (Berant and Liang, 2014), which in this setting is expressed as a declarative query over a relational table. TR on tabular data corresponds to passage retrieval on free text (Kaszkiel and Zobel, 1997). TCP is analogous to predicting missing words or values in a sentence (Devlin et al., 2019). Finally, TMP can be related to syntactic parsing in NLP (Van Gompel and Pickering, 2007), where relationships between different tokens are depicted.
We conclude this section with an analysis of the performance of the systems over the different downstream tasks. For every task, we selected datasets for which at least two systems have reported results. All datasets reported in Table 5 are described in Table 2, with the exception of WCC (Ghasemi-Gol and Szekely, 2018), which contains web tables annotated with their type (relational, entity, matrix, list, and non-data), and EntiTab (Zhang and Balog, 2017), which contains web tables annotated with possible header labels for the column population task. Table 5 also contains the size, expressed as number of parameters, of the largest model used by every system. As some systems are not comparable on any shared datasets, we report here their size: TabFact (110M), MMR (87M), TURL (314M), RPT (139M), and CLTR (235M). Larger models do not correlate with better performance across different systems for the same task, but a larger model always brings higher scores for the same system, as expected. Execution times for training and testing depend on the size of the model and the computing architecture.
System (size) . | TFC . | QA . | SP . | TR . | TMP . | TCP . | |||
---|---|---|---|---|---|---|---|---|---|
TabFact . | HybridQA . | WikiTQ . | WikiSQL . | Spider . | WQT . | WCC . | VizNet . | EntiTab . | |
accu. . | accu./F1 . | accu. . | accu. . | MAP . | MAP . | F1 . | F1 . | MAP . | |
TaBERT (367M) | 53.5* | 70.9* | 65.2 | 63.4 | 83.6* | 97.2* | 33.1* | ||
TaPas (340M) | 81.2* | 62.7/70.0 | 48.8 | 86.4 | |||||
MATE (340M) | 81.4 | 62.8/70.2 | 51.5 | ||||||
TaPEx (406M) | 84.2 | 57.5 | 89.5 | ||||||
TABBIE (170M) | 96.9 | 37.9 | |||||||
RCI (235M) | 89.8 | ||||||||
GraPPa (355M) | 52.1* | 69.6 | |||||||
Doduo (110M) | 94.3 | ||||||||
Deco (1454 M) | 82.7 | ||||||||
TUTA (134M) | 87.6 | ||||||||
GTR (117M) | 73.7 |
System (size) . | TFC . | QA . | SP . | TR . | TMP . | TCP . | |||
---|---|---|---|---|---|---|---|---|---|
TabFact . | HybridQA . | WikiTQ . | WikiSQL . | Spider . | WQT . | WCC . | VizNet . | EntiTab . | |
accu. . | accu./F1 . | accu. . | accu. . | MAP . | MAP . | F1 . | F1 . | MAP . | |
TaBERT (367M) | 53.5* | 70.9* | 65.2 | 63.4 | 83.6* | 97.2* | 33.1* | ||
TaPas (340M) | 81.2* | 62.7/70.0 | 48.8 | 86.4 | |||||
MATE (340M) | 81.4 | 62.8/70.2 | 51.5 | ||||||
TaPEx (406M) | 84.2 | 57.5 | 89.5 | ||||||
TABBIE (170M) | 96.9 | 37.9 | |||||||
RCI (235M) | 89.8 | ||||||||
GraPPa (355M) | 52.1* | 69.6 | |||||||
Doduo (110M) | 94.3 | ||||||||
Deco (1454 M) | 82.7 | ||||||||
TUTA (134M) | 87.6 | ||||||||
GTR (117M) | 73.7 |
The results in the table show that some tasks, such as TFC and TMP, can already be handled successfully by the systems, while some tasks, such as TCP, TR, and SP, are harder. QA is the task supported by most systems and the quality of the results vary depending on the dataset at hand. Differences in performance can be explained with different improvements across the systems. For example, MATE has better performance with respect to TaPas in two tasks because of its mechanism to deal with larger input tables. Similarly, TUTA improves over TaBERT because it handles tabular data beyond relational tables. Finally, TaBERT and TaPas are the systems that show most coverage in terms of tasks, with multiple papers using them as baselines in their experiments. For the systems that are not reported in Table 5, we notice that TURL obtains similar F1 results for column type prediction, but on a dataset different from VizNet; TURL, TABBIE, and TaBERT also report comparable MAP results for the row population task (not in Table 5) over different datasets.
7 Using the Language Model
The initial neural representations of tabular data are the result of the pre-training. Systems use the output LM in different ways. It can be fine-tuned or used “as-is”—for example, by using its representations as features in traditional ML algorithms (Section 7.1). However, as pre-trained LMs can act as encoders of the input, they are also used as a module in bigger systems (Section 7.2).
7.1 Encoder Output and its Usage
The alternative systems expose different granularity of the output table representation as embeddings. Almost all systems provide token and cell output embeddings. Most of them also expose column and row embeddings, while few provide table and pairwise column (Doduo) or pairwise table (Deco) embeddings. When a special separator token separates the context and the table content, a representation of the context is also provided (TabFact, CLTR, DTR, MMR).
These representations can be used in an arbitrary ML program simply as features or directly by the systems creating them to tackle a given task (Section 6). Indeed, as we discussed in Section 5.2.3, most systems use the pre-trained embeddings for further fine-tuning to tackle the tasks. We observe a relationship between the granularity of the output representations and the target downstream tasks. For instance, table representations are used for the TR task, while column-pairs representations are used for the TMP task to support columns relation annotation. Cell representations are used for the QA task, since the cells including the answer should be returned. Finally, column representations are used for the SP task, as columns are needed to formulate the output query.
Most pre-trained models, such as TaPas, are usually available out-of-the-box, while in some cases, such as in TURL, MATE, and CLTR, the users have to retrain the models. For such systems, the code and the dataset are available, but the users need to run the training on their side to generate the representations (no checkpoints available).
7.2 Encoder as a Module
Several systems, such as TUTA and TURL, add layers on top of the LMs, which are then fine-tuned for a task. While this is a common use of pre-trained LMs, other works employ LMs as components in a larger system. In these cases, the system not only needs to learn how to encode properly, but also to adjust its representations to a certain task by training end-to-end together multiple components, such as the ones that (individually) generate the embeddings for tables and text with the one for scoring similarity of textual and tabular embeddings.
Most of these larger systems focus on the retrieval of tables from an input natural language query. DTR answers a natural language question from a corpus of tables in two steps. First, it retrieves a small set of candidate tables (TR), where encoding of questions and tables are learned through similarity learning. The similarity score is obtained through an inner product between questions and table embeddings. Then, it performs the standard answer prediction (QA) with each candidate table in the input. A recent study (Wang et al., 2022) claims that table-specific models may not be needed for accurate table retrieval and that fine-tuning a general text-based retriever leads to superior results compared to DTR. MMR studies a multi-modal version of DTR using both tables and text passages by proposing several bi-encoders and tri-encoders. Similarly, CLTR introduces an end-to-end system for QA over a table corpus, where the retrieval of candidate tables is performed by a coarse-grained BM25 module, followed by a transformer-based model that concatenates the question with the row/column and classifies whether the associated row/column contains the answer. Other systems, as GTR, support retrieval of tables, where tables are represented by graphs. In this setting, stacked layers of a variant of Graph Transformers (Koncel-Kedziorski et al., 2019) are employed for obtaining node features that are combined with query embeddings. These combined embeddings are then aggregated with the BERT embeddings of the table context and query, and a relevance score is finally obtained.
8 Future Directions
Tabular LMs effectively address some of the challenges that arise with classical ML models (Borisov et al., 2021), such as transfer learning and self-supervision. However, several challenges remain unaddressed.
Interpretability.
Only a few systems expose a justification of their model output, for example, TaPas, CLTR, and MATE, thus model usage remains a black box. One direction is to use the attention mechanism to derive interpretations (Serrano and Smith, 2019; Dong et al., 2021). Looking at self-attention weights of particular layers and layer-wise propagation with respect to the input tokens, we can capture the influence of each cell value/tuple on the output through back-propagation (Huang et al., 2019). For instance, in TFC, providing explanation is crucial when the decision is derived from aggregating several cells, such as sum or average operation. In this case, a basic explanation would be to show all the cells that led to the final true/false decision.
Error Analysis.
Most studies focus on the downstream evaluation scores rather than going through manual evaluation of errors. In the downstream task level, this analysis could trace back misclassified examples to get evidence of issues in the tabular data representation. For example, in a QA task with wrong answers, there could be a pattern that explains how these errors are all due to the system confusing two columns with similar meaning (e.g., max value and last value for a stock), thus returning the wrong cell value in several cases.
Complex Queries and Rich Tables.
Several systems, such as TaPEx and MATE, handle queries with aggregations by adding classification layers. However, such methods fail short with queries that join tables. As most works assume a single table as input, a much needed direction is to develop models that handle multiple tables, for example, with classification layers predicting when a join is required. Also, as tables might contain heterogeneous types and content, such as non-uniform units (e.g., kg and lbs), systems should be able to handle such differences, which are abundant in practice. Moreover, an interesting direction is to conduct a study to show where tabular LMs can be successfully applied and where they fail in querying data.
Model Efficiency.
Transformer-based approaches are computationally expensive and some approaches try to approximate the costly attention mechanism by using locality-sensitive hashing to replace it (Kitaev et al., 2020), approximating it by a low-rank matrix (Wang et al., 2020), or applying kernels to avoid its computational complexity (Katharopoulos et al., 2020; Choromanski et al., 2020). While there exist methods to make transformers more efficient for long context (Tay et al., 2020), all optimizations consider an unstructured textual input. We believe more traction is needed for efficient transformers on structured data, with ideas from the textual counterpart, such as limiting attention heads to rows/columns, which can be obtained naturally from the structured input (Eisenschlos et al., 2021), or exploring prompt learning and delta tuning (Ding et al., 2022).
Benchmarking Data Representations.
There are no common benchmark datasets where researchers can assess and compare the quality of their data representations in a level playing field. Current evaluation is extrinsic, that is, at the downstream task level, where each work has its own assumptions. Intrinsic methods to evaluate the quality of the representations, such as those for word embeddings (Bakarov, 2018), can include predicting table caption given table representation or identifying functional dependencies. A set of precise tests can be designed to assess data-specific properties, such as the ability of the transformer-based models, designed to model sequences, to capture that row and attributes are sets in tables. Also, it is not clear whether the model representations are consistent with the table structure, effectively capturing their relationships. For example, given two cell values in the same row/column, are their embeddings closer than values coming from different rows/columns? For this, following the lines of CheckList for text LMs (Ribeiro et al., 2020), basic tests should be designed to measure the consistency of the data representation.
Data Bias.
It is recognized that LMs incorporate bias in the model parameters in terms of stereotypes, race, and gender (Nadeem et al., 2021; Vig et al., 2020). The bias is implicitly derived from the training corpora used to develop LMs. Therefore, there is a need to develop methods to overcome this drawback by, for instance, pre-filtering the training data or by correcting the tabular LMs, similarly to the ongoing efforts for text LMs (Liu et al., 2021b; Bordia and Bowman, 2019).
Green LMs.
The use of large-scale transformers for learning LMs requires considerable computation, which contributes to global warming (Strubell et al., 2020; Schwartz et al., 2020). Therefore, it is important to consider enhancements or potentially new techniques that limit the carbon footprint of tabular language models without a significant decrease in the performance in the downstream tasks. One enhancement can be at the level of the size of the training data by removing redundant, or less informative, tuples and tables. How to identify such data is a key challenge.
9 Conclusion
We conducted a survey on the efforts in developing transformer-based representations for tabular data. We introduced a high level framework to categorize those efforts and characterized each step in terms of solutions to model structured data, with special attention to the extensions to the transformer architecture. As future work, we envision a generic system to perform an experimental study based on our framework. The first part of the system would develop tabular data representations with alternative design choices, while the second part would evaluate them in downstream tasks. This work would help identifying the impact of alternative techniques on the performance in the final applications.
Acknowledgments
We would like to thank the action editor and the reviewers for their valuable inputs that helped in further improving the content and the presentation of our article. This work has been partially supported by the ANR project ATTENTION (ANR-21-CE23-0037) and by gifts from Google.
Notes
We refer to tabular “language” models with a slight abuse of the name, as the LM captures properties and relationships of the structured data rather than those of the language.
References
Author notes
Action Editor: Trevor Cohn