Abstract
Data-to-text (D2T) generation aims to transform structured data into natural language text. Data-to-text pre-training has proved to be powerful in enhancing D2T generation and yields impressive performance. However, previous pre-training methods either oversimplified structured data into a sequence without considering input structures or designed training objectives tailored for a specific data structure (e.g., table or knowledge graph). In this paper, we unify different types of structured data (i.e., table, key-value data, knowledge graph) into the graph format and cast different D2T generation tasks as graph-to-text generation. To effectively exploit the structural information of the input graph, we propose a structure-enhanced pre-training method for D2T generation by designing a structure-enhanced Transformer. Concretely, we devise a position matrix for the Transformer, encoding relative positional information of connected nodes in the input graph. In addition, we propose a new attention matrix to incorporate graph structures into the original Transformer by taking the available explicit connectivity structure into account. Extensive experiments on six benchmark datasets show the effectiveness of our model. Our source codes are available at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/unid2t.
1 Introduction
Data-to-text (D2T) generation, which aims to generate a target natural language text conditioned on source structured data, has attracted noticeable attention due to its many applications such as journalism (Rebuffel et al., 2020), medical diagnosis (Nishino et al., 2020), financial and weather reports (Liang et al., 2009), and sports broadcasting (Chen and Mooney, 2008). The input structured data can include tables of records, simulations of physical systems, spreadsheets, knowledge graphs, and so on. Transforming structured data into textual data can facilitate a wide range of users to understand and use the structured data, which is needed in many real-life scenarios.
Recently, large-scale pre-trained models have proved to be powerful in D2T generation and yield impressive performance (Kale and Rastogi, 2020; Xing and Wan, 2021; Liu et al., 2022), which benefit from the rich knowledge contained in large-scale pre-training corpora. Xing and Wan (2021) proposed a structure-aware table-to-text pre-training model, which devised three self-supervised training objectives tailored for modeling tables and their contexts. Ke et al. (2021) adopted a structure-aware semantic aggregation module to model the structure of an input graph at each Transformer layer, and explicitly learned graph-text alignments instead of directly fine-tuning text-to-text pre-trained models on graph-to-text corpora.
Although significant progress has been made in this field, there are still several technical challenges with existing D2T pre-training methods. Most prior studies made a cumbersome design tailored for a specific data structure such as tables (Liu et al., 2022) or knowledge graphs (Li et al., 2022), which could not effectively deal with diverse structured data in a unified framework. Kale and Rastogi (2020) were the first to study the “pre-train and fine-tune” strategy on several benchmarks spanning task-oriented dialogue, table-to-text, and graph-to-text. However, they oversimplified the input structured data into a flat string and adopted an original Transformer without capturing the structural information of source structured data.
In this paper, we unify the structured data into the graph format for data-to-text pre-training (denoted as UniD2T). We convert diverse types of structured data into a unified graph format, keeping the structural information of the structured data. We treat the items in the structured data as a set of nodes and connect the nodes according to the connectivity of the structured data. In this way, we can cast various D2T tasks as the graph-to-text generation task.
To effectively encode the graph structure, we propose a structure-enhanced pre-training model, which can be applied to various downstream D2T generation tasks. Our proposed D2T pre-training model is built upon the T5 model (Raffel et al., 2020). Since the T5 model is a text-to-text transfer Transformer framework and cannot effectively encode the graph structure, we propose a structure-enhanced Transformer to encode the structural information. Concretely, we propose an explicit position matrix for the Transformer, encoding the relative positional information of connected nodes in the input graph. In addition, we build a new attention matrix to replace the attention mask in self-attention of the original Transformer, which encodes graph structures and takes the available explicit connectivity structure into account.
Our main contributions are three-fold. (1) We unify diverse types of structured data into a graph format and cast all D2T tasks as the graph-to-text generation task taking a graph as input and producing a text as output. (2) We propose a structure-aware pre-training method for D2T generation based on the T5 model, which incorporates relative positional information and graph structures into the original Transformer via two new position and attention matrices, respectively. (3) We conduct extensive experiments on six D2T benchmarks and achieve substantially better performance than strong baselines. We believe that the release of our unified D2T pre-training model will advance the research in this area.
2 Related Works
2.1 Data-to-Text Generation
Data-to-text (D2T) generation aims to produce output texts from structured data and has attracted noticeable attention from the natural language processing (NLP) community (Reiter and Dale, 1997). Recently, neural D2T models (Song et al., 2018; Zhu et al., 2019) have been the mainstream for this task and made impressive progress. The end-to-end neural models generate text directly from structured data by using an encoder-decoder architecture (Sutskever et al., 2014). These studies usually focus on improving the encoder structures based on attention mechanisms (Koncel-Kedziorski et al., 2019; Mehta et al., 2022) or graph neural networks (GNNs) (Philipp and Schütze, 2021; Ribeiro et al., 2021a, b). For example, Wang et al. (2020) proposed a graph-to-sequence model using a pairwise interaction function to obtain semantic relations between concepts. Puduppully et al. (2022) suggested a neural architecture that incorporated a planning module to manage high-level information in a logical and meaningful manner. Liu et al. (2018) proposed a structure-aware sequence-to-sequence architecture, which incorporated the filed information as additional input to the table encoder. Song et al. (2018) introduced graph recurrent networks (GRNs) to encode the AMR nodes directly. Subsequently, Shi et al. (2020) proposed GNNs as the structural encoder, which updated the representations of nodes based on their immediate neighbors. To integrate both local and non-local features and learn a better structural representation of a graph, Guo et al. (2019) introduced dense connection and allowed deep GCNs. Different from the local information aggregation scheme, Cai and Lam (2020) proposed a graph transformer that used explicit relation encoding and allowed direct communication between two distant nodes.
2.2 Data-to-Text Pre-training Models
Recently, we have witnessed the remarkable success of pre-training methods in a wide range of NLP tasks (Kenton and Toutanova, 2019; Radford et al., 2018; Lan et al., 2019; Bi et al., 2020). Most pre-training models are initially designed to text-to-text generation, lacking the ability to encode structural information. Recently, there exist some pre-training models designed for D2T tasks (Chen et al., 2020b; Agarwal et al., 2021; Ke et al., 2021; Bai et al., 2022). For example, KGPT (Chen et al., 2020b) proposed a distantly supervised learning method to exploit large-scale unlabeled web text for data-to-text pre-training. However, these pre-training models consider only one specific data structure and cannot be applied to diverse downstream D2T tasks. Although Tang et al. (2022) proposed a multi-task supervised pre-training model (MVP) for a series of D2T generation tasks, they utilized the original Transformer to encode the linearized input data without considering the graph structures. UniLM (Dong et al., 2019) was a pre-trained universal language model, which incorporated modified self-attention masks to facilitate bidirectional encoding or unidirectional decoding. While UniLM offers the flexibility of bidirectional encoding, its encoding attention mask is designed primarily for processing unstructured text, thereby restricting its ability to capture the structural characteristics of input graphs.
Different from previous work, we propose a unified pre-training model that casts all D2T tasks as the graph-to-text generation task. In addition, we incorporate graph structures into the original Transformer via two new position and attention matrices to effectively model the structured input data.
3 Pre-training Data Construction
Previous data-to-text pre-training datasets are usually tailored to specific structured data. In this paper, we collect eight D2T datasets from previous works and aggregate these datasets into a large corpus for pre-training our model. The statistics of pre-training data are provided in Table 1.
Statistics of our pre-training data.
Statistics . | PreData . | DownData . |
---|---|---|
# Datasets | 2 | 6 |
# Instances | 4,951,267 | 2,240,927 |
Avg. input tokens | 84.1 | 63.7 |
Avg. target tokens | 90.8 | 100.9 |
Avg. Nodes | 17.8 | 19.4 |
Avg. Edges | 112.3 | 103.1 |
Statistics . | PreData . | DownData . |
---|---|---|
# Datasets | 2 | 6 |
# Instances | 4,951,267 | 2,240,927 |
Avg. input tokens | 84.1 | 63.7 |
Avg. target tokens | 90.8 | 100.9 |
Avg. Nodes | 17.8 | 19.4 |
Avg. Edges | 112.3 | 103.1 |
3.1 Existing Pre-training Datasets (PreData)
We first collect the table-text dataset TaPas (Herzig et al., 2020) and the graph-text dataset KGTEXT (Chen et al., 2020b), which were originally designed for table-to-text and graph-to-text pre-training, respectively. TaPas contains 6.2M tables from Wikipedia, and KGTEXT consists of 1.8M hyperlinked sentences from Wikipedia with the corresponding knowledge subgraphs from WikiData. We further devise a rule-based data-cleaning strategy to guarantee data quality. Finally, we obtain 4.9M data-text pairs (called PreData).
3.2 Existing Downstream Datasets (DownData)
We also collect the training sets from six data-to-text datasets, including WebNLG (Gardent et al., 2017), DART (Nan et al., 2020), ToTTo (Parikh et al., 2020), WikiBio (Lebret et al., 2016), WikiTableT (Chen et al., 2021), and CoSQL (Yu et al., 2019). These datasets were designed for downstream data-to-text generation tasks. Concretely, WebNLG and DART are graph-to-text datasets; WikiBio and WikiTableT contain key-value pairs; ToTTo and CoSQL are table-based datasets. In total, there are about 2.2M instances (DownData). Notably, the test sets utilized for downstream tasks are expressly omitted from the pre-training data, ensuring the integrity of our experimental results by eliminating any potential data leakage.
3.3 Unifying Structured Data
As illustrated in Figure 1, we unify different structured data (knowledge graph, table, key-value pairs) into an unlabeled and connected graph that consists of a set of nodes and unlabeled edges . Next, we elucidate the process of transforming the three distinct types of data (i.e., knowledge graphs, tables, and key-value pairs) into a unified graph .
(1) On the left side of Figure 1’s Graph Data section, a knowledge graph can be formally expressed as , where nodes are denoted by , and labeled edges are represented as , with signifying the relation type. To more effectively model the relationships between nodes within the knowledge graph without modifying the underlying model architecture, we transform it into its equivalent Levi graph, as shown on the right side of the Graph Data section in Figure 1, following similar methodologies as in prior studies (Ribeiro et al., 2021b; Li et al., 2022). A Levi graph is formally characterized as an unlabeled, connected bipartite graph, denoted as . Specifically, each relation in is treated as a new graph node within and amalgamated with all nodes in to form the comprehensive node set . Subsequently, each edge labeled with a relation type is converted into two unlabeled, undirected edges . In addition, for each unlabeled edge, corresponding reverse edges (r, vs), (vt, r) are introduced. For instance, given a labeled edge (Dance of the Seven Veils, GENRE, incidental music), this conversion results in four unlabeled edges (Dance of the Seven Veils, GENRE), (GENRE, Dance of the Seven Veils), (GENRE, incidental music), and (incidental music, Dance of the Seven Veils), constituting the final Levi graph .
(2) In the Table Data section of Figure 1, situated on the left side, Tabular data is conventionally structured with numerous cells organized based on their respective roles and interrelations. A table can be formally represented as , where vi, j denotes a table cell, and N and M represent the number of rows and columns in the table, respectively. Inspired by recent studies (Wang et al., 2022; Li et al., 2023a), we use a heuristic rule to transform the tabular data into a unified graph by introducing unlabeled edges between cells based on their roles and relationships. This structural transformation serves to maintain the invariance of the table content and proficiently articulate the relationships among cells in the table. More precisely, all cells within are considered as graph nodes in , denoted as . Furthermore, we establish the set of unlabeled edges in accordance with two guiding principles. First, for any two cells vi, j and vi, z situated within the same row, we introduce a forward edge (vi, j, vi, z) along with a corresponding reverse edge (vi, z, vi, j) into . Second, for any two cells vi, j and vi, z located in the same column, we append a forward edge (vi, j, vz, j) and its corresponding reverse edge (vz, j, vi, j) to . For instance, contemplating the right Table Data section in Figure 1, the cell “Arthur III” is linked not only to cells “1457” and “1458” in the same row but also to cells “Name” and “Peter II” in the same column. This intentional configuration is based on empirical observations and insights gained from data analysis. However, we acknowledge that there exists room for further exploration and experimentation concerning diverse node connectivity settings in future research. Given that the ToTTo dataset exclusively generates text for highlighted data, only the highlighted cells are considered as nodes.
(3) For Key-Value data in Figure 1, both key and value are regarded as nodes within . In addition to the requisite connection edges linking each (key, value) pair (e.g., the connection between the key name and the value walter extra), we extend our connectivity framework to include connections among keys themselves (e.g., the connection between nationality and birth_date) and value themselves (e.g., the connection between walter extra and gernman), drawing inspiration from the graph construction methodology commonly employed in table data analysis. In line with tabular data, we introduce both forward and reverse edges for any connected nodes within .
To ensure clarity and context in the generated text, we introduce two specific prefixes before the actual input data: (1) A data-independent prefix that universally states “describe the following data.” (2) A data-specific prefix, tailored according to the nature and structure of the data at hand. We provide the data-specific prefixes for the three data structures in Table 2. For example, the triple “Jens_Hartel — club —Berliner_AK_07” from the DART dataset will add the common prefix and its special prefix to form an input “[Prefix] describe the following data: [Prefix] The category of the DBpedia entities is: SportsTeam. [Node] Jens_Hartel [Node] club [Node] Berliner _AK_07”. We simplify the data-independent and data-specific prefixes to “[Prefix-I]” and “[Prefix-S]”, respectively. The final input sequence with connectivity information is shown in Figure 2.
The data-specific prefixes that are tailored for different types of data. Here, A, B, and C can be replaced by the content of specific samples.
Type . | Dataset . | Prefix-S . |
---|---|---|
Table | ToTTo | The table page title is: A, |
The table section title is: B | ||
CoSQL | select A from B where C | |
Graph | DART | The source is: A |
WebNLG | The category of the entities is: A, | |
The number of RDF triples is: B | ||
Key-Value | WikiBio | The article title is: A |
WikiTableT | the document title is: A, | |
the section title is: B |
Type . | Dataset . | Prefix-S . |
---|---|---|
Table | ToTTo | The table page title is: A, |
The table section title is: B | ||
CoSQL | select A from B where C | |
Graph | DART | The source is: A |
WebNLG | The category of the entities is: A, | |
The number of RDF triples is: B | ||
Key-Value | WikiBio | The article title is: A |
WikiTableT | the document title is: A, | |
the section title is: B |
4 Methodology
4.1 Problem Definition
We convert different structured data into a graph format and cast all data-to-text tasks as the graph-to-text (G2T) generation task. Formally, the G2T model takes a graph as input and produces a text Y = {y1,…, yn} as output, where represents the entity set, represents the relations between entities, and n is the length of the output text. Following previous studies (Ribeiro et al., 2020), we convert the graph into an input sequence consisting of m tokens.
4.2 Model Architecture
Our model is built upon the pre-trained T5 model given the impressive performance of T5 on text generation tasks. It is noteworthy that our pre-training strategy is model-agnostic and potentially applicable to any Transformer-based backbone networks. The encoder of Transformer is composed of a stack of blocks, each of which contains a self-attention layer followed by a feed-forward network. The decoder has a similar structure to the encoder except that it adopts a standard attention mechanism following a self-attention layer.
Preliminary
Transformer blocks on the T5-encoder side. The relative position and attention matrices in the self-attention calculation will be replaced by two novel position and attention matrices.
Transformer blocks on the T5-encoder side. The relative position and attention matrices in the self-attention calculation will be replaced by two novel position and attention matrices.
The original attention mechanism is designed to process unstructured natural language texts proves inadequate in effectively capturing the inherent structures within graphs. To better process our structured graph data, we replace the position embeddings Pemb and attention mask Amask in Equation (2) with two new position and attention matrices respectively, ensuring their awareness of the underlying graph structures. Next, we will elaborate on the processes of constructing the position and attention matrices.
4.3 Structure-enhanced Transformer
T5 is based on an encoder-decoder Transformer, which does not necessarily capture graph structures. To address this issue, we propose a structure-enhanced Transformer, which is built upon the new position and attention matrices on the T5 encoder side. As illustrated in Figure 3, we use new position embedding and attention mask matrices (denoted as Pembnew and Amasknew) to replace the Pemb and Amask in the Equation (2), respectively. Specifically, we devise a position matrix for the Transformer to encode the relative positional information of connected nodes in the original input graph . In addition, we propose a new attention matrix to replace the attention mask in the self-attention, which takes the available explicit connectivity structure of the input graph into account.
4.3.1 Position Matrix Construction
Integrating relational information about the graph structure into the Transformer architecture is essential for graph-to-text generation. Nevertheless, most previous Transformer-based methods (Xing and Wan, 2021; Han and Shareghi, 2022) learned position embeddings automatically, instead of explicitly encoding the structural relationships. For the input graph, we should only consider the relative position between connected nodes but ignore the relative position between irrelevant nodes. To this end, we replace the positional embeddings of the original Transformer with a position matrix that only establishes the relative position between each relevant node pair (connected items). In this way, we can explicitly capture the relative positions of all relevant nodes precisely.
Specifically, we first establish an auxiliary position matrix for each pair of connected nodes, similar to the green and yellow boxes in Figure 4. No matter how physically distant the two relevant nodes may be, the corresponding auxiliary position matrix solely takes into account the relative distance between these two nodes’ internal tokens, disregarding the nodes situated between the two target nodes. For example, consider the input nodes “[Node] club” and “[Node] Jens Hartel”, since “club” is 3 units to the right of “Jens”, the value of cell [Jens, club] is 3. Notably, we only compute the relative distance between each connected note pair, while the distances of nodes lacking direct connections will be set to “± inf”, signifying an infinite distance between them. For instance, the value assigned to the cell [Jens, Berliner] is “ +inf” due to the absence of a direct connection between “[Node] Jens Hartel” and “[Node] Berliner AK 07”.
We construct a new position matrix Pembnew to replace the original position matrix Pemb used in Equation (2). We first set an auxiliary matrix for each edge between two nodes, and then copy the content of the auxiliary matrix into the final position matrix. The distances of nodes lacking direct connections will be set to “±inf”. The lighter the color, the farther the distance is.
We construct a new position matrix Pembnew to replace the original position matrix Pemb used in Equation (2). We first set an auxiliary matrix for each edge between two nodes, and then copy the content of the auxiliary matrix into the final position matrix. The distances of nodes lacking direct connections will be set to “±inf”. The lighter the color, the farther the distance is.
After obtaining the auxiliary position matrix for each pair of connected items, we can construct the position matrix for the entire input sequence by copying the cell values from the corresponding auxiliary position matrices. It is noteworthy that we seek to endow the prefixes (denoted as “[Prefix-I]” and “[Prefix-S]”) embedded within the input with the capacity to encapsulate comprehensive global information. Therefore, we postulate that these prefixes establish direct connections with other nodes within the input. Finally, we replace the positional embeddings Pemb of original Transformer with the learned position matrix Pembnew, so as to effectively capture the explicit relative distance between each pair of connected items.
4.3.2 Attention Matrix Construction
The self-attention in the original Transformer processes the input sequence by transforming the input sequence through the substitution of each element with a weighted average. Without refining the conventional attention mechanism, the present input data would be perceived as a fully interconnected graph, potentially hindering the optimal extraction of inherent structural information. Given the above reasons, we construct a relation-aware attention matrix to replace the original attention mask in self-attention. Concretely, if two elements have a direct relationship, we set the value of the corresponding cell to 1; otherwise, the value is set to 0. For example, as illustrated in Figure 5, since the items “Jens Hartel” and “club” have direct connection, the values of cells (Jens, club) and (Hartel, club) are set to 1; while since “Jens Hartel” and “Berliner Ak 07” have no direct connection, the values of the corresponding cells such as (Jens, Berliner) and (Jens, AK) are set to 0. Here, we hope that the prefixes (i.e., “[Prefix-I]” and “[Prefix-S]”) within the input can carry global information, thus we make the prefixes attend to all other elements. After obtaining the attention matrix (denoted as Amasknew), we replace the attention matrix Amask of self-attention in Equation (2) with our new attention matrix Amasknew so as to effectively capture the graph structures as shown in Figure 3.
We construct a new attention matrix Amasknew to replace the attention mask Amask used in Equation (2). The attention matrix used to replace the attention mask of self-attention in Transformer. The values of the cells with colors are set to 1, while the values of the cells without colors are set to 0. The blue color represents global attention, the gray color represents the self-connection of nodes, and the green and yellow colors represent the two connected edges.
We construct a new attention matrix Amasknew to replace the attention mask Amask used in Equation (2). The attention matrix used to replace the attention mask of self-attention in Transformer. The values of the cells with colors are set to 1, while the values of the cells without colors are set to 0. The blue color represents global attention, the gray color represents the self-connection of nodes, and the green and yellow colors represent the two connected edges.
4.4 Pre-training Objectives
Similar to Andrejczuk et al. (2022), we first use the publicly available T5 checkpoints provided by Herzig et al. (2020) as the initialization. Then, we pre-train our model on our pre-training data. We employ two objectives to pre-train our model in a multi-task learning paradigm, including struct denoising and text generation objectives. In Table 3, we provide two specific training instances (input and output pairs) for the struct denoising and graph-to-text generation objectives.
The examples of input-output pairs for struct denoising and graph-to-text generation objectives.
Task . | Inputs . | Targets . |
---|---|---|
Struct Denoising | The category of the DBpedia entities is: <extra_id0 >. | <extra_id0 > Food <extra_id1 > Bakewell tart |
‘Bakewell pudding’, ‘dish variation’, ‘ <extra_id1 >’, | ||
‘main ingredients’, ‘Ground almond, jam, butter, eggs’ | ||
Graph-to-Text Generation | Describe the following data: The category of the | Bakewell tart is a
variation of Bakewell pudding and some of the main ingredients are ground almonds, jam, butter and eggs. |
DBpedia entities is: Food. ‘Bakewell pudding’, | ||
‘dish variation’, ‘Bakewell tart’, ‘main ingredients’, | ||
‘Ground almond, jam, butter, eggs’ |
Task . | Inputs . | Targets . |
---|---|---|
Struct Denoising | The category of the DBpedia entities is: <extra_id0 >. | <extra_id0 > Food <extra_id1 > Bakewell tart |
‘Bakewell pudding’, ‘dish variation’, ‘ <extra_id1 >’, | ||
‘main ingredients’, ‘Ground almond, jam, butter, eggs’ | ||
Graph-to-Text Generation | Describe the following data: The category of the | Bakewell tart is a
variation of Bakewell pudding and some of the main ingredients are ground almonds, jam, butter and eggs. |
DBpedia entities is: Food. ‘Bakewell pudding’, | ||
‘dish variation’, ‘Bakewell tart’, ‘main ingredients’, | ||
‘Ground almond, jam, butter, eggs’ |
Struct Denoising Objective
We design a struct denoising strategy for table-like data, following the method used in T5, by training the model to predict a target sequence containing the missing or corrupted tokens in the input graph. We apply a noise function to construct a noisy input graph. In particular, the noise function is implemented by masking 15% of nodes while maintaining related edges in the graph. The goal of struct denoising objective is to reconstruct the target output that contains all the dropped-out nodes, delimited by the sentinel token. This pre-training objective helps the UniD2T model capture relationships between neighboring nodes in the input graph.
Graph-to-Text Generation Objective
5 Experimental Setup
5.1 Tasks and Datasets
To verify the generality and effectiveness of UniD2T, we conduct experiments on three types of data-to-text datasets. In particular, WebNLG (Gardent et al., 2017) and DART (Nan et al., 2020) are used for evaluating graph-to-text generation; WikiBio (Lebret et al., 2016) and WikiTableT (Chen et al., 2021) are utilized for evaluating key-value-to-text generation; ToTTo (Parikh et al., 2020) and CoSQL (Yu et al., 2019) are used for evaluating table-to-text generation. Table 4 provides the statistics of these six datasets.
Statistics of downstream datasets.
Dataset . | Train . | Valid . | Test . |
---|---|---|---|
ToTTo | 120,761 | 7,700 | 7,700 |
CoSQL | 7,845 | 1,074 | − |
WebNLG | 13,211 | 1,667 | 1,779 |
DART | 62,659 | 6,980 | 12,552 |
WikiBio | 582,657 | 72,831 | 72,831 |
WikiTableT | 1,453,794 | 4,533 | 4,351 |
Dataset . | Train . | Valid . | Test . |
---|---|---|---|
ToTTo | 120,761 | 7,700 | 7,700 |
CoSQL | 7,845 | 1,074 | − |
WebNLG | 13,211 | 1,667 | 1,779 |
DART | 62,659 | 6,980 | 12,552 |
WikiBio | 582,657 | 72,831 | 72,831 |
WikiTableT | 1,453,794 | 4,533 | 4,351 |
5.2 Implementation Details
In the pre-training stage, our model is initialized with T5-Large. We pre-train our UniD2T model on NVIDIA A100 GPUs. The maximum sequence lengths of the input and target sequences are set to 1024 and 512, respectively. We set the batch size to 8. Gradient clipping is applied to the model with a maximum gradient value of 1. To alleviate the overfitting issue, the maximum number of training steps is 500k. Moreover, a patient step number is set to 25k, i.e., if the evaluation metrics does not increase for the patient step number, the training process will carry out an early stop. We set the maximum learning rate to 1e-5.
6 Experimental Results
6.1 Table-to-Text Generation
We conduct experiments on two table-to-text datasets, including ToTTo and CoSQL. The SQL queries within CoSQL and the table header information from ToTTo are strategically positioned within the data-specific prefixes, denoted as “[Prefix-S]”, as illustrated in Table 2.
ToTTo
ToTTo is an open-domain table-to-text task dataset that uses crowd annotators to highlight the table cells and revise the corresponding natural language descriptions. We compare our UniD2T with several strong baselines, including BERT2BERT (Rothe et al., 2020), LATTICE (Wang et al., 2022), CoNT (An et al., 2022), PlanGen (Su et al., 2021), and TABT5 (Andrejczuk et al., 2022). TABT5 is a pre-trained model tailored for table-to-text generation. We adopt BLEU (Papineni et al., 2002) and PARENT (Dhingra et al., 2019) as the evaluation metrics. The experimental results on ToTTo are summarized in Table 5. Our model achieves substantially better performance than the compared methods on ToTTo in terms of overall, overlap, and non-overlap settings. First, our model shows an improvement over T5 and TABT5, especially in terms of PARENT. Second, our model also achieves better results than the strong downstream methods.
Results on the ToTTo test set.
Models . | Overall . | Overlap . | Non-Overlap . | |||
---|---|---|---|---|---|---|
BLEU . | PARENT . | BLEU . | PARENT . | BLEU . | PARENT . | |
ChatGPT(gpt-3.5-turbo) | 20.5 | 49.5 | 24.4 | 51.2 | 17.5 | 47.7 |
BERT-to-BERT(Rothe et al., 2020) | 44.0 | 52.6 | 52.7 | 58.4 | 35.1 | 46.8 |
LATTICE (Wang et al., 2022) | 48.4 | 58.1 | 56.1 | 62.4 | 40.4 | 53.9 |
CoNT (An et al., 2022) | 49.1 | 58.9 | 56.7 | 63.2 | 41.3 | 54.6 |
PlanGen (Su et al., 2021) | 49.2 | 58.7 | 56.9 | 62.8 | 41.4 | 54.2 |
T5-3B | 49.5 | 58.4 | 57.5 | 62.6 | 41.4 | 54.2 |
TABT5 (Andrejczuk et al., 2022) | 49.2 | 57.2 | − | − | 41.0 | 52.7 |
UniD2T | 49.9 | 59.8 | 57.8 | 64.0 | 42.0 | 55.7 |
Models . | Overall . | Overlap . | Non-Overlap . | |||
---|---|---|---|---|---|---|
BLEU . | PARENT . | BLEU . | PARENT . | BLEU . | PARENT . | |
ChatGPT(gpt-3.5-turbo) | 20.5 | 49.5 | 24.4 | 51.2 | 17.5 | 47.7 |
BERT-to-BERT(Rothe et al., 2020) | 44.0 | 52.6 | 52.7 | 58.4 | 35.1 | 46.8 |
LATTICE (Wang et al., 2022) | 48.4 | 58.1 | 56.1 | 62.4 | 40.4 | 53.9 |
CoNT (An et al., 2022) | 49.1 | 58.9 | 56.7 | 63.2 | 41.3 | 54.6 |
PlanGen (Su et al., 2021) | 49.2 | 58.7 | 56.9 | 62.8 | 41.4 | 54.2 |
T5-3B | 49.5 | 58.4 | 57.5 | 62.6 | 41.4 | 54.2 |
TABT5 (Andrejczuk et al., 2022) | 49.2 | 57.2 | − | − | 41.0 | 52.7 |
UniD2T | 49.9 | 59.8 | 57.8 | 64.0 | 42.0 | 55.7 |
CoSQL
CoSQL serves as a prevalent benchmark for evaluating table-to-text models (Fang et al., 2022b; Li et al., 2023b). Each instance within CoSQL comprises an SQL query, the resultant table, and the corresponding response, where the SQL query gives explicit signals for models on what to generate. The generated description could provide a concise and easy-to-understand summary of the result table and help users verify whether the queried result is consistent with the original question. We compare our model with GraphWriter (Koncel-Kedziorski et al., 2019), BART-Base, T5-Large, and FALCON (Fang et al., 2022a) which is a faithful contrastive generation framework based on T5. We adopt BLEU (Papineni et al., 2002) and ROUGE-L (Lin, 2004) as evaluation metrics. Since CoSQL does not release the test set, we follow FALCON and report the experimental results on the development set in Table 6. Our UniD2T model achieves significantly better performance than baselines. The BLEU and ROUGE scores increase by 7.03 and 3.58, respectively, over the best-performing baseline FALCON.
Results on CoSQL development set.
Models . | BLEU . | ROUGE-L . |
---|---|---|
GraphWriter | 16.86 | 47.44 |
FALCON | 25.65 | 57.89 |
BART-Base | 24.60 | 57.39 |
T5-Large | 25.25 | 57.54 |
UniD2T | 32.68 | 61.47 |
Models . | BLEU . | ROUGE-L . |
---|---|---|
GraphWriter | 16.86 | 47.44 |
FALCON | 25.65 | 57.89 |
BART-Base | 24.60 | 57.39 |
T5-Large | 25.25 | 57.54 |
UniD2T | 32.68 | 61.47 |
6.2 Graph-to-Text Generation
We conduct experiments on two graph-to-text datasets, including DART and WebNLG.
DART
DART is a large dataset for open-domain text generation that treats the input as a set of RDF entity-relation triples. We compare our UniD2T model with several pre-training models including Transformer, BART, T5, and the state-of-the-art method CONTROL PREFIXES (Clive et al., 2021). BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), and TER (Snover et al., 2005) are adopted as evaluation metrics. As shown in Table 7, our model surpasses the best-performing model CONTROL PREFIXES by a 3.0% BLEU.
Evaluation results on DART test set. Results with † are token from DART (Nan et al., 2020).
Models . | BLEU . | METEOR . | TER . |
---|---|---|---|
End-to-End Transformer† | 27.24 | 0.25 | 0.65 |
LSTM with Attention† | 29.66 | 0.27 | 0.63 |
CONTROL PREFIXES | 51.95 | 0.41 | 0.43 |
ChatGPT(gpt-3.5-turbo) | 40.51 | 0.37 | 0.53 |
BART-Base† | 47.11 | 0.38 | 0.46 |
BART-Large† | 48.56 | 0.39 | 0.45 |
T5-Small† | 47.69 | 0.39 | 0.46 |
T5-Base† | 49.21 | 0.40 | 0.44 |
T5-Large† | 50.66 | 0.40 | 0.43 |
UniD2T | 54.96 | 0.42 | 0.42 |
Models . | BLEU . | METEOR . | TER . |
---|---|---|---|
End-to-End Transformer† | 27.24 | 0.25 | 0.65 |
LSTM with Attention† | 29.66 | 0.27 | 0.63 |
CONTROL PREFIXES | 51.95 | 0.41 | 0.43 |
ChatGPT(gpt-3.5-turbo) | 40.51 | 0.37 | 0.53 |
BART-Base† | 47.11 | 0.38 | 0.46 |
BART-Large† | 48.56 | 0.39 | 0.45 |
T5-Small† | 47.69 | 0.39 | 0.46 |
T5-Base† | 49.21 | 0.40 | 0.44 |
T5-Large† | 50.66 | 0.40 | 0.43 |
UniD2T | 54.96 | 0.42 | 0.42 |
WebNLG
WebNLG (Zhou and Lampouras, 2020) consists of a set of triples collected from DBpedia and the corresponding manually annotated text. BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), chrF++ (Popović, 2015), TER (Snover et al., 2005), and BLEURT (Sellam et al., 2020) are adopted as evaluation metrics. We compare our method with both pre-trained language models and strong downstream baselines. The overall experimental results on WebNLG are shown in Table 8. Our model achieves the highest performance among all baseline models, including the graph pre-training model TRIPLE (Han and Shareghi, 2022).
Evaluation results on WebNLG test set. CP stands for CONTROL PREFIXES (Clive et al., 2021).
Model . | BLEU . | METEOR . | chrF++ . | TER . | BLEURT . |
---|---|---|---|---|---|
CP | 54.97 | 41.7 | 69.3 | 39.8 | 0.62 |
CP + DART | 55.41 | 41.9 | 69.8 | 39.2 | 0.63 |
T5-Large | 51.74 | 40.3 | 66.9 | 41.7 | 0.61 |
TRIPLE | 57.64 | 42.24 | − | 38.9 | − |
UniD2T | 60.41 | 44.35 | 73.4 | 34.1 | 0.65 |
Model . | BLEU . | METEOR . | chrF++ . | TER . | BLEURT . |
---|---|---|---|---|---|
CP | 54.97 | 41.7 | 69.3 | 39.8 | 0.62 |
CP + DART | 55.41 | 41.9 | 69.8 | 39.2 | 0.63 |
T5-Large | 51.74 | 40.3 | 66.9 | 41.7 | 0.61 |
TRIPLE | 57.64 | 42.24 | − | 38.9 | − |
UniD2T | 60.41 | 44.35 | 73.4 | 34.1 | 0.65 |
6.3 Key-Value-to-Text Generation
We conduct experiments on two key-value-based datasets, including WikiBio and WikiTableT.
WikiBio
WikiBio is designed to generate descriptions from a Wikipedia infobox and aims to generate the first sentence of a biography. We compare UniD2T with previous state-of-the-art model (i.e, CoNT [An et al., 2022]), pre-trained models (T5-Large, KGPT) and Non-autoregressive model SANA (Wang et al., 2021) on WikiBio. BLEU and PARENT are adopted as evaluation metrics. The results are reported in Table 9. UniD2T outperforms the best baseline CoNT by 3.7% on BLEU.
Results on WikiBio and WikiTableT test sets.
Models . | WikiBio . | WikiTableT . | ||
---|---|---|---|---|
BLEU . | PARENT . | BLEU . | PARENT . | |
Transformer | 44.3 | 74.0 | 19.5 | 42.8 |
SANA | 45.7 | 76.9 | − | − |
CoNT | 47.1 | − | − | − |
KGPT | 45.1 | 76.3 | 31.8 | 48.5 |
T5-Large | 48.6 | 77.5 | 31.4 | 47.6 |
UniD2T | 50.4 | 79.8 | 33.7 | 50.7 |
Models . | WikiBio . | WikiTableT . | ||
---|---|---|---|---|
BLEU . | PARENT . | BLEU . | PARENT . | |
Transformer | 44.3 | 74.0 | 19.5 | 42.8 |
SANA | 45.7 | 76.9 | − | − |
CoNT | 47.1 | − | − | − |
KGPT | 45.1 | 76.3 | 31.8 | 48.5 |
T5-Large | 48.6 | 77.5 | 31.4 | 47.6 |
UniD2T | 50.4 | 79.8 | 33.7 | 50.7 |
WikiTableT
WikiTableT is collected from Wikipedia sections with their corresponding tabular data, which contains millions of instances. We compare UniD2T with Transformer, T5-Large and KGPT (Chen et al., 2020b). Experiment results on Table 9 show that UniD2T exceeds the best competitor KGPT by 1.9% on BLEU and 2.2% on PARENT.
6.4 Further Analysis
6.4.1 Ablation Study
We conduct experiments to investigate the impact of pre-training with graph structure and linear structure. The ablation results are summarized in Table 10, which is divided into two parts: The first part shows the results of directly fine-tuning the pre-trained language model (i.e., T5-Large) on the downstream datasets, referred to as DownData, while the second part presents the results of incorporating additional pre-training data, denoted as PreData, on top of T5-Large. Through careful analysis, we observe that UniD2T (T5-Large+PGraph+FGraph) consistently outperforms T5-Large+FGraph across all six data-to-text datasets, resulting in a notable improvement in the total score of +20.6. In addition, we observe that T5-Large+FGraph outperforms T5-Large+FLinear in terms of the total score by +6.1. This result clearly indicates that our method significantly improves the performance of the data-to-text models which linearize the structured data as input during fine-tuning the models on downstream datasets. Finally, we delve into the effects of the pre-training datasets. By comparing the results of and PGraph + FGraph, and PLinear + FLinear, we observe that the downstream datasets contribute to improving the model’s performance and accelerating the pre-training process. It is noteworthy that the pre-training involving both PreData and DownData achieves the best performance across all the experimental datasets.
Ablation test results on six benchmark datasets. PLinear and PGraph represent the models pre-training with linear structure and graph structure, respectively. FLinear and FGraph represent the models fine-tuning with graph structure and linear structure, respectively. P* stands for pre-training only with PreData; P indicates pre-training with both PreData and DownData.
Model . | ToTTo . | CoSQL . | DART . | WebNLG . | WikiBio . | WikiTableT . | Total Score . |
---|---|---|---|---|---|---|---|
Only Fine-tuning | |||||||
Previous SOTA | 49.2 | 25.6 | 51.9 | 57.6 | 48.6 | 31.8 | − |
T5-Large +FLinear | 48.1 | 25.2 | 50.6 | 51.7 | 48.6 | 31.4 | 255.6 |
T5-Large +FGraph | 49.1 | 26.7 | 51.2 | 53.1 | 49.4 | 32.2 | 261.7 |
With Additional Pre-training | |||||||
T5-Large + PGraph + FGraph (UniD2T) | 50.2 | 32.7 | 54.9 | 60.4 | 50.4 | 33.7 | 282.3 |
T5-Large + | 49.3 | 27.9 | 53.6 | 54.7 | 50.1 | 32.4 | 268.0 |
T5-Large +PLinear + FLinear | 48.7 | 25.8 | 53.1 | 56.7 | 49.1 | 31.7 | 265.1 |
T5-Large + | 48.3 | 25.7 | 50.9 | 52.8 | 48.7 | 31.5 | 257.9 |
Model . | ToTTo . | CoSQL . | DART . | WebNLG . | WikiBio . | WikiTableT . | Total Score . |
---|---|---|---|---|---|---|---|
Only Fine-tuning | |||||||
Previous SOTA | 49.2 | 25.6 | 51.9 | 57.6 | 48.6 | 31.8 | − |
T5-Large +FLinear | 48.1 | 25.2 | 50.6 | 51.7 | 48.6 | 31.4 | 255.6 |
T5-Large +FGraph | 49.1 | 26.7 | 51.2 | 53.1 | 49.4 | 32.2 | 261.7 |
With Additional Pre-training | |||||||
T5-Large + PGraph + FGraph (UniD2T) | 50.2 | 32.7 | 54.9 | 60.4 | 50.4 | 33.7 | 282.3 |
T5-Large + | 49.3 | 27.9 | 53.6 | 54.7 | 50.1 | 32.4 | 268.0 |
T5-Large +PLinear + FLinear | 48.7 | 25.8 | 53.1 | 56.7 | 49.1 | 31.7 | 265.1 |
T5-Large + | 48.3 | 25.7 | 50.9 | 52.8 | 48.7 | 31.5 | 257.9 |
We also delve into the effects of two Transformer modifications (position and attention matrix construction). The results are illustrated in Table 11. From the results, we observe a significant performance drop when either the structure-aware position or attention matrices are removed, demonstrating the benefits of two Transformer modifications. It is no surprise that combining all the factors achieves the best performance. These findings collectively demonstrate the effectiveness of our proposed method, which explicitly models its graph structure through the use of structure-aware position and attention matrices.
Ablation test results on WebNLG test set.
HTML 262626Models . | BLEU . | METEOR . | chrF++ . | TER . | BLEURT . |
---|---|---|---|---|---|
HTML 262626UniD2T | 60.4 | 44.4 | 73.4 | 34.1 | 0.65 |
- attention | 58.6 | 42.7 | 70.3 | 37.2 | 0.64 |
- position | 58.3 | 42.6 | 70.2 | 36.7 | 0.64 |
- all | 56.7 | 42.3 | 69.8 | 37.8 | 0.63 |
HTML 262626Models . | BLEU . | METEOR . | chrF++ . | TER . | BLEURT . |
---|---|---|---|---|---|
HTML 262626UniD2T | 60.4 | 44.4 | 73.4 | 34.1 | 0.65 |
- attention | 58.6 | 42.7 | 70.3 | 37.2 | 0.64 |
- position | 58.3 | 42.6 | 70.2 | 36.7 | 0.64 |
- all | 56.7 | 42.3 | 69.8 | 37.8 | 0.63 |
6.4.2 Few-Shot Results
We conduct few-shot experiments on the E2ENLG (Dušek et al., 2020) dataset sourced from the restaurant domain other than Wikipedia. This serves as an additional validation of the model’s generalization capabilities. The E2ENLG dataset, assembled through the CrowdFlower platform, encompasses details about restaurants and comprises over 50,000 combinations of dialogue-act-based meaning representations (MRs) with an average of 8.1 references. We fine-tune UniD2T using varying proportions of the training instances (i.e., 0.1%, 0.5%, 1%, 5%, and 10%) from E2ENLG (Dušek et al., 2020). We compare UniD2T with several few-shot learning methods including TGen (Dušek and Jurčíček, 2016), Template-GPT-2 (Chen et al., 2020a), and KGPT (Chen et al., 2020b). The experimental results are summarized in Table 12. We can see that UniD2T significantly outperforms all baselines in various few-shot settings.
Few-shot results on the E2ENLG test set.
Model . | 0.1% . | 0.5% . | 1% . | 5% . |
---|---|---|---|---|
TGen | 3.6 | 27.9 | 35.2 | 57.3 |
Template-GPT-2 | 22.5 | 47.8 | 53.3 | 59.9 |
KGPT-Graph | 39.8 | 53.3 | 55.1 | 61.5 |
KGPT-Seq | 40.2 | 53.0 | 54.1 | 61.1 |
UniD2T | 45.6 | 57.3 | 57.6 | 64.8 |
Model . | 0.1% . | 0.5% . | 1% . | 5% . |
---|---|---|---|---|
TGen | 3.6 | 27.9 | 35.2 | 57.3 |
Template-GPT-2 | 22.5 | 47.8 | 53.3 | 59.9 |
KGPT-Graph | 39.8 | 53.3 | 55.1 | 61.5 |
KGPT-Seq | 40.2 | 53.0 | 54.1 | 61.1 |
UniD2T | 45.6 | 57.3 | 57.6 | 64.8 |
6.4.3 Human Evaluation
We also conduct a human evaluation to analyze the generated sentences following Chen et al. (2020b). It is worth noting that each evaluator is unaware of which model generates the text being evaluated so as to avoid evaluation bias. Specifically, we choose 100 test samples from WebNLG and observe the factual consistency between the gold sentences and generated sentences. We invite four NLP workers to assign each text a label from {Hallucination, Missing Fact, Accurate}, similar to (Chen et al., 2020b). As shown in Figure 6, our UniD2T is less prone to hallucinating non-existing facts and can generate more accurate sentences.
Human evaluation of the factual consistency of different models on WebNLG samples.
Human evaluation of the factual consistency of different models on WebNLG samples.
6.4.4 Impact on Graph Sizes
To illustrate the effectiveness of the graph structure, we further investigate the performance of PLinear + FLinear and by concerning different graph sizes on the WebNLG validation set. Experimental results in terms of BLEU are shown in Figure 7. When the graph structure is simple, the impact of the graph structure is limited. However, as the graph structure becomes complex, the model with graph structure () performs much better than the model with linear structure (PLinear + FLinear). Thus, the structure-enhanced model UniD2T demonstrates greater stability and better performance on large-scale inputs when compared to linear sequence models.
Comparing PLinear + FLinear and PGraph + FGraph BLEU score changes in increasing the number of triples on WebNLG’s seen and unseen.
Comparing PLinear + FLinear and PGraph + FGraph BLEU score changes in increasing the number of triples on WebNLG’s seen and unseen.
6.4.5 Impact on Model Sizes
To investigate the influence of different model scales on the experimental results, we conducted experiments using FGraph on T5-Small, T5-Base, T5-Large, and T5-3B on the DART and ToTTo dev sets without pre-training. It is important to note that for our experiments, we conduct evaluations on the dev sets rather than the test sets. This decision is made due to the constraints imposed by the ToTTo dataset, where obtaining test results requires submitting predictions to the leaderboard and awaiting the evaluation process, which can be time-consuming. Therefore, to expedite our research and streamline the experimentation process, we relied on the readily available development sets for conducting our evaluations. The results are presented in the Table 13. Notably, the transition from T5-Large to T5-3B resulted in a substantial increase in the number of parameters by approximately 3.9 times. However, the corresponding improvement in efficacy was found to be less than 1%. This analysis sheds light on the limited impact of scaling up the model size beyond a certain threshold, given the marginal gains in performance despite the significant increase in parameter count.
The performance of T5 with different model scales on the dev sets of DART and ToTTo datasets, without performing any pre-training.
. | ToTTo . | DART . | |||
---|---|---|---|---|---|
BLEU . | PARENT . | BLEU . | METEOR . | TER . | |
T5-Small + FGraph | 45.5 | 53.3 | 48.8 | 0.39 | 0.45 |
T5-Base + FGraph | 48.6 | 58.8 | 50.2 | 0.40 | 0.44 |
T5-Large + FGraph | 49.1 | 59.4 | 51.2 | 0.40 | 0.43 |
T5-3B + FGraph | 49.8 | 59.7 | 51.4 | 0.41 | 0.43 |
. | ToTTo . | DART . | |||
---|---|---|---|---|---|
BLEU . | PARENT . | BLEU . | METEOR . | TER . | |
T5-Small + FGraph | 45.5 | 53.3 | 48.8 | 0.39 | 0.45 |
T5-Base + FGraph | 48.6 | 58.8 | 50.2 | 0.40 | 0.44 |
T5-Large + FGraph | 49.1 | 59.4 | 51.2 | 0.40 | 0.43 |
T5-3B + FGraph | 49.8 | 59.7 | 51.4 | 0.41 | 0.43 |
6.5 The Zero-shot Performance of ChatGPT
We conducted zero-shot experiments using ChatGPT on the ToTTo and DART datasets to establish baselines for performance evaluation. The results of these experiments are presented in Table 5 and Table 7 as baselines. The prompt structure of ChatGPT comprises two parts, and detailed information regarding these prompts can be found in Table 14.
Input examples for ChatGPT on ToTTo and DART. Here, Prompt represents task description, and Structured Input represents data input with specific formats.
ToTTo . | |
---|---|
Prompt: Put the highlighted-table together to form a sentence: | |
Structured Input: <page_title> List of Malayalam films of 1976 </page_title><table> <cell> Surveykkallu <col_header> Film </col_header> </cell> <cell> Thoppil Bhasi <col_header> Director </col_header> </cell> </table> | |
DART | |
Prompt: Put the triples together to form a sentence: | |
Structured Input: Mars Hill College: joined: 1973 — Mars Hill College: location: Mars Hill, North Carolina |
ToTTo . | |
---|---|
Prompt: Put the highlighted-table together to form a sentence: | |
Structured Input: <page_title> List of Malayalam films of 1976 </page_title><table> <cell> Surveykkallu <col_header> Film </col_header> </cell> <cell> Thoppil Bhasi <col_header> Director </col_header> </cell> </table> | |
DART | |
Prompt: Put the triples together to form a sentence: | |
Structured Input: Mars Hill College: joined: 1973 — Mars Hill College: location: Mars Hill, North Carolina |
From the results, we observe that ChatGPT demonstrates consistent performance across various measures. For instance, in the non-overlap subset of the ToTTo dataset, when compared to BERT-to-BERT, the BLEU score shows a decrease of 17.6%, while the PARENT score exhibits a slight increase of 0.9%. This divergence in BLEU performance indicates that ChatGPT generates responses with different word choices, leading to reduced word overlap with the reference. However, the improvement in the PARENT score suggests enhanced structural and content-related aspects in the generated responses. These findings underscore the importance of employing multiple evaluation metrics to comprehensively assess the performance of sophisticated language generation systems in future work.
6.5.1 Impact on Edge Directionality
We take an examination into the significance of edge directionality and present the experimental results of incorporating the edge direction in Table 16. For UniD2Tdirected, we consider the input directed graph using only its original directed edges (uni-directional) and remove the reverse edges added by UniD2T. Please refer to Section 3.3 for more details about the reverse edges. From Table 16, we can observe that the incorporation of edge direction has a deleterious effect on the performance of pre-trained models. There are several possible factors that may underlie these observed outcomes. (1) First, the pre-training models aim to learn the general representations of structured data. However, due to the vast scale of multi-source data, it is often unfeasible to assign a direction to each data pair. For example, the tabular format constitutes a fundamental type of structured data; however, the absence of explicit edge directionality is a typical characteristic between individual data pairs within this format. Therefore, we default to using bidirectional edges to signify mutual relation ships between two entities. (2) Second, we anticipate learning the coarse relationships between two entities through undirected graphs during the pre-training phase offer greater flexibility to accommodate various types of relationships in different fields. For instance, the directional link “Jay Chou→Common Jasmine Orange” conveys that Jay Chou released the album Common Jasmine Orange, while the reverse link “Common Jasmine Orange→Jay Chou” signifies that Common Jasmine Orange is one of Jay Chou’s albums. In most cases, it is unnecessary to provide elaborate descriptions of specific relationships, as the data primarily requires indicating connections.
6.6 Case Study
As illustrated in Figure 8, we further verify the effectiveness of UniD2T qualitatively by demonstrating some generated sentences by UniD2T and T5-Large. Both UniD2T and T5-Large are capable of generating main entities. However, there are notable differences in the quality and coherence of the generated sentences. Specifically, the sentences generated by T5-Large tend to exhibit shortcomings in terms of including key information and logical reasoning. For instance, in the first case, T5-Large fails to infer that the “Baltimore World Trade Center” is the tallest building. This illustrates the limitation of T5-Large in capturing and incorporating specific facts with logical reasoning. In contrast, UniD2T can produce sentences that are more accurate, complete, and encompass the main entities and logical information with greater precision. This highlights the advantages of UniD2T in generating more contextually appropriate and logically grounded sentences.
Examples of generated sentences. The main entity is highlighted in green, and the words that are not faithful to the input are in red. Important information common to both models is indicated in blue.
Examples of generated sentences. The main entity is highlighted in green, and the words that are not faithful to the input are in red. Important information common to both models is indicated in blue.
6.7 The Diversity of Generated Sentences
We conduct an evaluation of the diversity exhibited in the target sentences generated by UniD2T and compare it with strong baselines (i.e., T5-Large and ChatGPT). To quantify the diversity of the generated sentences, we utilized the Distinct-N metric (Li et al., 2016), which calculates the number of distinct N-grams divided by the total number of generated tokens. The experimental results are presented in Table 15, providing insights into the diversity performance of the models. By analyzing the results, it is evident that UniD2T achieves notably higher Distinct-2/3/4 scores compared to T5-Large. This suggests that UniD2T generates sentences with a greater variety of unique unigrams and bigrams than T5-Large, indicating a higher level of linguistic diversity in the output. However, ChatGPT achieves better diversity scores than UniD2T. It tends to generate more diverse words which are not included in our vocabulary, although these words may be non-existing content.
The results of diversity evaluation on DART test set.
Models . | Distinct-1 . | Distinct-2 . | Distinct-3 . | Distinct-4 . |
---|---|---|---|---|
ChatGPT | 7.56 | 18.93 | 28.33 | 35.75 |
T5-Large | 6.94 | 13.94 | 19.00 | 23.00 |
UniD2T | 6.58 | 14.72 | 21.22 | 26.38 |
Models . | Distinct-1 . | Distinct-2 . | Distinct-3 . | Distinct-4 . |
---|---|---|---|---|
ChatGPT | 7.56 | 18.93 | 28.33 | 35.75 |
T5-Large | 6.94 | 13.94 | 19.00 | 23.00 |
UniD2T | 6.58 | 14.72 | 21.22 | 26.38 |
The results of our models with undirected graphs (i.e., UniD2T) and directed graphs (denoted as UniD2Tdirected), respectively.
Edge . | WikiBio . | WikiTableT . | ||
---|---|---|---|---|
BLEU . | PARENT . | BLEU . | PARENT . | |
UniD2T | 50.4 | 79.8 | 33.7 | 50.7 |
UniD2Tdirected | 48.8 | 78.5 | 31.7 | 48.3 |
Edge . | WikiBio . | WikiTableT . | ||
---|---|---|---|---|
BLEU . | PARENT . | BLEU . | PARENT . | |
UniD2T | 50.4 | 79.8 | 33.7 | 50.7 |
UniD2Tdirected | 48.8 | 78.5 | 31.7 | 48.3 |
6.8 Limitations
Based on our empirical observation, we reveal several limitations of this work, which can be divided into three primary categories. (1) Our pre-training data is limited, which only contains two existing pre-training datasets and six downstream datasets. In the future, we would like to collect more D2T datasets so as to construct a large-scale diverse pre-training corpus. (2) In this work, we unify different structured data into the graph format by using a simple and direct method. We will attempt to exploit more advanced strategies to construct graphs from different structured data. (3) This study focuses on modeling the graph structures and incorporating the structural information into Transformer. However, the pre-training objectives can be further improved so as to further improve the representation learning.
7 Conclusion
In this paper, we proposed a unified data-to-text pre-training method, which could be applied to various downstream data-to-text generation tasks. Concretely, we first converted different types of structured data into graph format. Then, we devised a structure-enhanced Transformer to capture graph structures by introducing two new position and attention matrices to replace the position embedding and attention mask in the self-attention of the Transformer. Extensive experiments on six data-to-text benchmark datasets demonstrated that UniD2T achieved substantially better performance than strong baselines by enabling better information sharing and representation learning of data structures across diverse data-to-text datasets.
Acknowledgments
Min Yang was supported by National Key Research and Development Program of China (2022YFF0902100), National Natural Science Foundation of China (62376262), Shenzhen Science and Technology Innovation Program (KQTD20190929172835662), and Shenzhen Basic Research Foundation (JCYJ20210324115614039 and JCYJ20200109113441941). This work was supported by Alibaba Group through Alibaba Innovative Research Program.
References
Author notes
Work done during an internship at Alibaba DAMO Academy.
Action Editor: Benjamin Van Durme