Unifying Structured Data as Graph for Data-to-Text Pre-Training

Data-to-text (D2T) generation aims to transform structured data into natural language text. Data-to-text pre-training has proved to be powerful in enhancing D2T generation and yields impressive performance. However, previous pre-training methods either oversimplified structured data into a sequence without considering input structures or designed training objectives tailored for a specific data structure (e.g., table or knowledge graph). In this paper, we unify different types of structured data (i.e., table, key-value data, knowledge graph) into the graph format and cast different D2T generation tasks as graph-to-text generation. To effectively exploit the structural information of the input graph, we propose a structure-enhanced pre-training method for D2T generation by designing a structure-enhanced Transformer. Concretely, we devise a position matrix for the Transformer, encoding relative positional information of connected nodes in the input graph. In addition, we propose a new attention matrix to incorporate graph structures into the original Transformer by taking the available explicit connectivity structure into account. Extensive experiments on six benchmark datasets show the effectiveness of our model. Our source codes are available at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/unid2t.


Introduction
Data-to-text (D2T) generation, which aims to generate a target natural language text conditioned on source structured data, has attracted noticeable attention due to its wide applications such as journalism (Rebuffel et al., 2020), medical diagnosis (Nishino et al., 2020), financial and weather reports (Liang et al., 2009), and sports broadcasting (Chen and Mooney, 2008).The input structured data can include tables of records, simulations of physical systems, spreadsheets, knowledge graphs, and so on.Transforming structured data into textual data can facilitate a wide range of users to understand and use the structured data, which is needed in many real-life scenarios.
Recently, large-scale pre-trained models have proved to be powerful in D2T generation and yield impressive performances (Kale and Rastogi, 2020;Xing and Wan, 2021;Liu et al., 2022), which benefit from the rich knowledge contained in large-scale pre-training corpora.Xing and Wan (2021) proposed a structure-aware table-to-text pre-training model, which devised three self-supervised training objectives tailored for modeling tables and their contexts.Ke et al. (2021) adopted a structure-aware semantic aggregation module to model the structure of an input graph at each Transformer layer, and explicitly learned graph-text alignments instead of directly fine-tuning text-to-text pre-trained models on graph-to-text corpora.
Although significant progress has been made in this field, there are still several technical challenges with existing data-to-text pre-training methods.Most prior studies made a cumbersome design tailored for a specific data structure such as tables (Liu et al., 2022) or knowledge graphs (Li et al., 2022), which could not effectively deal with diverse structured data in a unified framework.Kale and Rastogi (2020) was the first work that studied the "pre-train and fine-tune" strategy on several benchmarks spanning task-oriented dialogue, tableto-text, and graph-to-text.However, it oversimplified the input structured data into a flat string and adopted an original Transformer without capturing the structural information of source structured data.
In this paper, we unify the structured data into the graph format for data-to-text pre-training (de-noted as UniD2T).We convert diverse types of structured data into a unified graph format, keeping the structural information of the structured data.We treat the items in the structured data as a set of nodes and connect the nodes according to the connectivity of the structured data.In this way, we can cast various data-to-text tasks as the graph-to-text generation task.
To effectively encode the graph structure, we propose a structure-enhanced pre-training model, which can be applied to various downstream datato-text generation tasks.Our proposed data-to-text pre-training model is built upon the T5 model (Raffel et al., 2020).Since the T5 model is a text-totext transfer Transformer framework and cannot effectively encode the graph structure, we propose a structure-enhanced Transformer to encode the structural information.Concretely, we propose an explicit position matrix for the Transformer, encoding the relative positional information of connected nodes in the input graph.In addition, we build a new attention matrix to replace the attention mask in self-attention of the original Transformer, which encodes graph structures and takes the available explicit connectivity structure into account.
Our main contributions are three-fold.(1) We unify diverse types of structured data into a graph format and cast all data-to-text tasks as the graphto-text generation task taking a graph as input and producing a text as output.(2) We propose a structure-aware pre-training method for D2T generation based on the T5 model, which incorporates relative positional information and graph structures into the original Transformer via two new position and attention matrices respectively.(3) We conduct extensive experiments on six data-to-text benchmarks and achieve substantially better performance than strong baselines.We believe that the release of our unified data-to-text pre-training model would push forward the research in this area.
2 Related Works 2.1 Data-to-Text Generation Data-to-text (D2T) generation aims to produce output texts from structured data and has attracted noticeable attention from the natural language processing (NLP) community (Reiter and Dale, 1997).Recently, neural D2T models (Song et al., 2018;Zhu et al., 2019) have been the mainstream for this task and made impressive progress.The end-toend neural models generate text directly from struc-tured data by using an encoder-decoder architecture (Sutskever et al., 2014).These works usually focus on improving the encoder structures based on attention mechanisms (Koncel-Kedziorski et al., 2019;Mehta et al., 2022) or graph neural networks (GNNs) (Philipp and Schütze, 2021;Ribeiro et al., 2021a,b).For example, Wang et al. (2020) proposed a graph-to-sequence model using a pairwise interaction function to obtain semantic relations between concepts.Puduppully et al. (2022) suggested a neural architecture that incorporated a planning module to manage high-level information in a logical and meaningful manner.Liu et al. (2018) proposed a structure-aware sequence-to-sequence architecture, which incorporated the filed information as additional input to the table encoder.Song et al. (2018) introduced graph recurrent networks (GRNs) to encode the AMR nodes directly.Subsequently, Shi et al. (2020) proposed GNNs as the structural encoder, which updated the representations of nodes based on their immediate neighbors.To integrate both local and non-local features and learn a better structural representation of a graph, Guo et al. (2019) introduced dense connection and allowed deep GCNs.Different from the local information aggregation scheme, Cai and Lam (2020) proposed a graph transformer that used explicit relation encoding and allowed direct communication between two distant nodes.

Data-to-Text Pre-training Models
Recently, we have witnessed the remarkable success of pre-training methods in a wide range of NLP tasks (Kenton and Toutanova, 2019;Radford et al.;Lan et al., 2019;Bi et al., 2020).Most pre-training models are initially designed to text-to-text generation, lacking the ability to encode structural information.Recently, there exist some pre-training models designed for data-to-text tasks (Chen et al., 2020b;Agarwal et al., 2021;Ke et al., 2021;Bai et al., 2022).For example, KGPT (Chen et al., 2020b) proposed a distantly supervised learning method to exploit large-scale unlabeled web text for data-to-text pre-training.However, these pre-training models consider only one specific data structure and cannot be applied to diverse downstream data-to-text tasks.Although (Tang et al., 2022)  considering the graph structures.UniLM (Dong et al., 2019) was a pre-trained universal language model, which incorporated modified self-attention masks to facilitate bidirectional encoding or unidirectional decoding.While UniLM offers the flexibility of bidirectional encoding, its encoding attention mask is designed primarily for processing unstructured text, thereby restricting its ability to capture the structural characteristics of input graphs.
Different from previous works, we propose a unified pre-training model that casts all data-totext tasks as the graph-to-text generation task.In addition, we incorporate graph structures into the original Transformer via two new position and attention matrices to effectively model the structured input data.

Pre-training Data Construction
Previous data-to-text pre-training datasets are usually tailored to specific structured data.In this paper, we collect eight data-to-text datasets from previous works and aggregate these datasets into a large corpus for pre-training our model.The statistics of pre-training data are provided in Table 1.

Existing Pre-training Datasets (PREDATA)
We first collect the table-text dataset TAPAS (Herzig et al., 2020) and the graph-text dataset KG-TEXT (Chen et al., 2020b), which were originally designed for table-to-text and graph-to-text pretraining respectively.TAPAS contains 6.2M tables from Wikipedia, while KGTEXT consists of 1.8M hyperlinked senftences from Wikipedia with the corresponding knowledge subgraphs from Wiki-Data.We further devise a rule-based data-cleaning strategy to guarantee data quality.Finally, we obtain 4.9M data-text pairs (called PREDATA).

Unifying Structured Data
As illustrated in Figure 1, we unify different structured data (knowledge graph, table, key-value pairs) into an unlabeled and connected graph G = (V, E) which consists of a set of nodes v ∈ V and unlabeled edges (v i , v j ) ∈ V. Next, we elucidate the process of transforming the three distinct types of data (i.e., knowledge graphs, tables, and keyvalue pairs) into a unified graph G.
(1) On the left side of Figure 1's Graph Data section, a knowledge graph can be formally ex-pressed as G 0 = (V 0 , E 0 , R 0 ), where nodes are denoted by v ∈ V 0 , and labeled edges are represented as (v s , r, v t ) ∈ E 0 , with r ∈ R 0 signifying the relation type.To more effectively model the relationships between nodes within the knowledge graph G 0 without modifying the underlying model architecture, we transform it into its equivalent Levi graph, as shown on the right side of the Graph Data section in Figure 1, following similar methodologies as in prior studies (Ribeiro et al., 2021b;Li et al., 2022).A Levi graph is formally characterized as an unlabeled, connected bipartite graph, denoted as G = (V, E).Specifically, each relation in R 0 is treated as a new graph node within G and amalgamated with all nodes in V 0 to form the comprehensive node set V. Subsequently, each edge (v s , r, v t ) ∈ E 0 labeled with a relation type is converted into two unlabeled, undirected edges (v s , r), (r, v t ) ∈ E. In addition, for each unlabeled edge, corresponding reverse edges (r, v s ), (v t , r) are introduced.For instance, given a labeled edge (Dance of the Seven Veils, GENRE, incidental music), this conversion results in four unlabeled edges (Dance of the Seven Veils, GENRE), (GENRE, Dance of the Seven Veils), (GENRE, incidental music), and (incidental music, Dance of the Seven Veils), comprising the final Levi graph G.
(2) In the Table Data section of Figure 1, situated on the left side, Tabular data is conventionally structured with numerous cells organized based on their respective roles and interrelations.A table can be formally represented as where v i,j denotes a table cell, and N and M represent the number of rows and columns in the table, respectively.Inspired by recent studies (Wang et al., 2022;Li et al., 2023a), we employ a heuristic rule to transform the tabular data into a unified graph G by introducing unlabeled edges between cells based on their roles and relationships.This structural transformation serves to maintain the invariance of the table content and proficiently articulate the relationships among cells in the table.More precisely, all cells within T are considered as graph nodes in G, denoted as V = T .Furthermore, we establish the set of unlabeled edges E in accordance with two guiding principles.First, for any two cells v i,j and v i,z situated within the same row, we introduce a forward edge (v i,j , v i,z ) along with a corresponding reverse edge (v i,z , v i,j ) into E. Second, for any two cells v i,j and v i,z located in the same col-  umn, we append a forward edge (v i,j , v z,j ) and its corresponding reverse edge (v z,j , v i,j ) to E. For instance, contemplating the right Table Data section in Figure 1, the cell "Arthur III" is linked not only to cells "1457" and "1458" in the same row but also to cells "Name" and "Peter II" in the same column.This intentional configuration is based on empirical observations and insights gained from data analysis.However, we acknowledge that there exists room for further exploration and experimentation concerning diverse node connectivity settings in future research.Given that the ToTTo dataset exclusively generates text for highlighted data, only the highlighted cells are considered as nodes.
(3) For Key-Value data in Figure 1, both key and value are regarded as nodes within V. In addition to the requisite connection edges linking each (key, value) pair (e.g., the connection between the key name and the value walter extra), we extend our connectivity framework to include connections among keys themselves (e.g., the connection between nationality and birth_date) and value themselves(e.g., the connection between walter extra and gernman), drawing inspiration from the graph construction methodology commonly employed in table data analysis.In line with tabular data, we introduce both forward and reverse edges for any connected nodes within V.
To ensure clarity and context in the generated text, we introduce two specific prefixes before the actual input data: (1) A data-independent prefix which universally states "describe the following data."(2) A data-specific prefix, tailored according to the nature and structure of the data at hand.We provide the data-specific prefixes for the three

Model Architecture
Our model is built upon the pre-trained T5 model given the impressive performance of T5 on text generation tasks.It is noteworthy that our pre-training strategy is model-agnostic and potentially applicable to any Transformer-based backbone networks.
The encoder of Transformer is composed of a stack of blocks, each of which contains a self-attention layer followed by a feed-forward network.The decoder has a similar structure to the encoder except that it adopts a standard attention mechanism following a self-attention layer.Preliminary In the case of the T5-encoder, a "fully-visible" attention mask is employed, which permits the self-attention mechanism to consider all input entries when generating each output entry.In addition, T5 adopts a simplified form of position embeddings, where each embedding is a scalar.Formally, as illustrated in Figure 3, the attention calculation of encoder can be expressed as: where X is the input sequence.
α is the attention weight between the query vector Q and the key vector K. d is the dimensionality of the hidden representations.
Z is the output of the attention module.P emb is position embedding and A mask is attention mask.
The original attention mechanism is designed to process unstructured natural language texts proves inadequate in effectively capturing the inherent structures within graphs.To better process our structured graph data, we replace the position embeddings P emb and attention mask A mask in the Equation ( 2) with two new position and attention matrices respectively, ensuring their awareness of the underlying graph structures.Next, we will elaborate on the processes of constructing the position and attention matrices.

Structure-enhanced Transformer
T5 is based on an encoder-decoder Transformer, which does not necessarily capture graph structures.To address this issue, we propose a structureenhanced Transformer, which is built upon the new position and attention matrices on the T5 encoder side.As illustrated in Figure 3, we use new position embedding and attention mask matrices (denoted as P new emb and A new mask ) to replace the P emb and A mask

Hartel club J e n s H a rt e l c lu b [N o d e ] [N o d e ]
[Node] [Node]  2).We first set an auxiliary matrix for each edge between two nodes, and then copy the content of the auxiliary matrix into the final position matrix.The distances of nodes lacking direct connections will be set to "±inf".The lighter the color, the farther the distance is.
in the Equation 2, respectively.Specifically, we devise a position matrix for the Transformer to encode the relative positional information of connected nodes in the original input graph G.In addition, we propose a new attention matrix to replace the attention mask in the self-attention, which takes the available explicit connectivity structure of the input graph into account.

Position Matrix Construction
Integrating relational information about the graph structure into the Transformer architecture is essential for graph-to-text generation.Nevertheless, most previous Transformer-based methods (Xing and Wan, 2021;Han and Shareghi, 2022) learned position embeddings automatically, instead of explicitly encoding the structural relationships.For the input graph, we should only consider the rela-tive position between connected nodes but ignore the relative position between irrelevant nodes.To this end, we replace the positional embeddings of the original Transformer with a position matrix that only establishes the relative position between each relevant node pair (connected items).In this way, we can explicitly capture the relative positions of all relevant nodes precisely.Specifically, we first establish an auxiliary position matrix for each pair of connected nodes, similar to the green and yellow boxes in Figure 4.No matter how physically distant the two relevant nodes may be, the corresponding auxiliary position matrix solely takes into account the relative distance between these two nodes' internal tokens, disregarding the nodes situated between the two target nodes.For example, consider the input nodes "[Node] club" and "[Node] Jens Hartel", since "club" is 3 units to the right of "Jens", the value of cell [Jens, club] is 3. Notably, we only compute the relative distance between each connected note pair, while the distances of nodes lacking direct connections will be set to "±inf ", signifying an infinite distance between them.For instance, the value assigned to the cell [Jens, Berliner] is "+inf " due to the absence of a direct connection between "[Node] Jens Hartel" and "[Node] Berliner AK 07".
After obtaining the auxiliary position matrix for each pair of connected items, we can construct the position matrix for the entire input sequence by copying the cell values from the corresponding auxiliary position matrices.It is noteworthy that we seek to endow the prefixes (denoted as "[Prefix-I]" and "[Prefix-S]") embedded within the input with the capacity to encapsulate comprehensive global information.Therefore, we postulate that these prefixes establish direct connections with other nodes within the input.Finally, we replace the positional embeddings P emb of original Transformer with the learned position matrix P new emb , so as to effectively capture the explicit relative distance between each pair of connected items.

Attention Matrix Construction
The self-attention in the original Transformer processes the input sequence by transforming the input sequence through the substitution of each element with a weighted average.Without refining the conventional attention mechanism, the present input data would be perceived as a fully interconnected graph, potentially hindering the optimal extraction of inherent structural information.Given the above reasons, we construct a relation-aware attention matrix to replace the original attention mask in selfattention.Concretely, if two elements have a direct relationship, we set the value of the corresponding cell to 1; otherwise, the value is set to 0. For example, as illustrated in Figure 5, since the items "Jens Hartel" and "club" have direct connection, the values of cells (Jens, club) and (Hartel, club) are set to 1; while since "Jens Hartel" and "Berliner Ak 07" have no direct connection, the values of the corresponding cells such as (Jens, Berliner) and (Jens , AK) are set to 0. Here, we hope that the prefixes (i.e., "[Prefix-I]" and "[Prefix-S]") within the input can carry global information, thus we make the prefixes attend to all other elements.After obtaining the attention matrix (denoted as A new mask ), we replace the attention matrix A mask of self-attention in Equation (2) with our new attention matrix A new mask so as to effectively capture the graph structures as shown in Figure 3.

Pre-training Objectives
Similar to (Andrejczuk et al., 2022), we first use the publicly available T5 checkpoints provided by Herzig et al. (2020) as the initialization.Then, we pre-train our model on our pre-training data.We employ two objectives to pre-train our model in a multi-task learning paradigm, including struct denoising and text generation objectives.In Table 3, we provide two specific training instances (input and output pairs) for the struct denoising and graphto-text generation objectives.

Struct Denoising Objective
We design a struct denoising strategy for table-like data, following the method used in T5, by training the model to predict a target sequence containing the missing or corrupted tokens in the input graph.We apply a noise function to construct a noisy input graph.In particular, the noise function is implemented by masking 15% of nodes while maintaining related edges in the graph.The goal of struct denoising objective is to reconstruct the target output that contains all the dropped-out nodes, delimited by the sentinel token.This pre-training objective helps the UniD2T model capture relationships between neighboring nodes in the input graph.
Graph-to-Text Generation Objective Given the linearized graph G linear and its explicit connectivity structure E, the graph-to-text generation task is carried out to produce the appropriate text to describe the given graph in an auto-regressive manner.We adopt the standard negative log-likelihood loss L TG for the graph-to-text generation task: where n is the length of the target sequence Y .
5 Experimental Setup

Tasks and Datasets
To verify the generality and effectiveness of UniD2T, we conduct experiments on three types of data-to-text datasets.In particular, WebNLG (Gardent et al., 2017) and DART (Nan et al., 2020) are used for evaluating graph-to-text generation; Wik-iBio (Lebret et al., 2016) and WikiTableT (Chen et al., 2021) are utilized for evaluating key-valueto-text generation; ToTTo (Parikh et al., 2020) and CoSQL (Yu et al., 2019) are used for evaluating table-to-text generation.Table 4 provides the statistics of these six datasets.

Graph-to-Text Generation
Describe the following data: The category of the DBpedia entities is : Food.'Bakewell pudding', 'dish variation', 'Bakewell tart', 'main ingredients', 'Ground almond, jam, butter, eggs' Bakewell tart is a variation of Bakewell pudding and some of the main ingredients are ground almonds, jam, butter and eggs.

Implementation Details
In the pre-training stage, our model is initialized with T5-Large.We pre-train our UniD2T model on NVIDIA A100 GPUs.The maximum sequence lengths of the input and target sequences are set to 1024 and 512, respectively.We set the batch size to 8. Gradient clipping is applied to the model with a maximum gradient value of 1.To alleviate the overfitting issue, the maximum number of training steps is 500k.Moreover, a patient step number is set to 25k, i.e., if the evaluation metrics does not increase for the patient step number, the training process will carry out an early stop.We set the maximum learning rate to 1e-5.
6 Experimental Results

Table-to-Text Generation
We conduct experiments on two table-to-text datasets, including ToTTo and CoSQL.The SQL queries within CoSQL and the table header information from ToTTo are strategically positioned within the data-specific prefixes, denoted as "[Prefix-S]", as illustrated in Table 2.

ToTTo
ToTTo is an open-domain table-to-text task dataset that uses crowd annotators to highlight the table cells and revise the corresponding natural language descriptions.We compare our UniD2T with several strong baselines, including BERT2BERT (Rothe et al., 2020), LATTICE (Wang et al., 2022), CoNT (An et al., 2022), Plan-Gen (Su et al., 2021) and TABT5 (Andrejczuk et al., 2022).TABT5 is a pre-trained model tailored for table-to-text generation.We adopt BLEU (Papineni et al., 2002) and PARENT (Dhingra et al., 2019) as the evaluation metrics.The experimental results on ToTTo are summarized in Table 5.Our model achieves substantially better performance than the compared methods on ToTTo in terms of overall, overlap, and non-overlap settings.First, our model shows an improvement over T5 and TABT5, especially in terms of PARENT.Second, our model also achieves better results than the strong downstream methods.
CoSQL CoSQL serves as a prevalent benchmark for evaluating table-to-text models (Fang et al., 2022b;Li et al., 2023b).Each instance within CoSQL comprises an SQL query, the resultant table, and the corresponding response, where the SQL query gives explicit signals for models on what to generate.The generated description could provide a concise and easy-to-understand summary of the result table and help users verify whether the queried result is consistent with the original question.We compare our model with Graph-Writer (Koncel-Kedziorski et al., 2019), BART-Base, T5-Large, and FALCON (Fang et al., 2022a) that is a faithful contrastive generation framework based on T5.We adopt BLEU (Papineni et al., 2002) and ROUGE-L (Lin, 2004) as evaluation metrics.Since CoSQL does not release the test set, we follow FALCON and report the experimental results on the development set in Table 6.Our UniD2T model achieves significantly better performance than baselines.The BLEU and ROUGE scores increase by 7.03 and 3.58 respectively over the best-performing baseline FALCON.

Graph-to-Text Generation
We conduct experiments on two graph-to-text datasets, including DART and WebNLG.(Nan et al., 2020).

Models
DART DART is a large dataset for open-domain text generation that treats the input as a set of RDF entity-relation triples.We compare our UniD2T model with several pre-training models including Transformer, BART, T5, and the state-of-the-art method CONTROL PREFIXES (Clive et al., 2021).BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), and TER (Snover et al., 2005) are adopted as evaluation metrics.As shown in DBpedia and the corresponding manually annotated text.BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), chrF++ (Popović, 2015), TER (Snover et al., 2005) and BLEURT (Sellam et al., 2020) are adopted as evaluation metrics.We compare our method with both pre-trained language models and strong downstream baselines.

Key-Value-to-Text Generation
We conduct experiments on two key-value-based datasets, including WikiBio and WikiTableT.
WikiBio WikiBio is designed to generate descriptions from a Wikipedia infobox and aims to generate the first sentence of a biography.We compare UniD2T with previous state-of-the-art model (i.e,CoNT (An et al., 2022)), pre-trained models (T5-Large, KGPT) and Non-autoregressive model SANA (Wang et al., 2021) on WikiBio.BLEU and PARENT are adopted as evaluation metrics.The results are reported in Table 9. UniD2T outperforms the best baseline CoNT by 3.7% on BLEU.
WikiTableT WikiTableT is collected from Wikipedia sections with their corresponding tabular data, which contains millions of instances.We compare UniD2T with Transformer, T5-Large and KGPT (Chen et al., 2020b).Experiment results on Table 9 show that UniD2T exceeds the best competitor KGPT by 1.9% on BLEU and 2.2% on PARENT.
6.4 Further Analysis

Ablation Study
We conduct experiments to investigate the impact of pre-training with graph structure and lin-ear structure.The ablation results are summarized in Table 10, which is divided into two parts: the first part shows the results of directly finetuning the pre-trained language model (i.e., T5-Large) on the downstream datasets, referred to as DOWNDATA, while the second part presents the results of incorporating additional pre-training data, denoted as PREDATA, on top of T5-Large.
Through careful analysis, we observe that UniD2T (T5-Large+P Graph +F Graph ) consistently outperforms T5-Large+F Graph across all six data-to-text datasets, resulting in a notable improvement in the total score of +20.6.In addition, we observe that T5-Large+F Graph outperforms T5-Large+F Linear in terms of the total score by +6.1.This result clearly indicates that our method significantly improves the performance of the data-to-text models which linearize the structured data as input during fine-tuning the models on downstream datasets.Finally, we delve into the effects of the pre-training datasets.By comparing the results of P * Graph + F Graph and P Graph + F Graph , P * Linear + F Linear and P Linear + F Linear , we observe that the downstream datasets contribute to improving the model's performance and accelerating the pre-training process.It is noteworthy that the pre-training involving both PREDATA and DOWNDATA achieves the best performance across all the experimental datasets.
We also delve into the effects of two Transformer modifications (position and attention matrix construction).The results are illustrated in Table 11.From the results, we observe a significant performance drop when either the structure-aware position or attention matrices are removed, demonstrating the benefits of two Transformer modifi- Table 12: Few-shot results on the E2ENLG test set.

Few-Shot Results
We conduct few-shot experiments on the E2ENLG (Dušek et al., 2020) dataset sourced from the restaurant domain other than Wikipedia.This serves as an additional validation of the model's generalization capabilities.The E2ENLG dataset, assembled through the CrowdFlower platform, encompasses details about restaurants and comprises over 50,000 combinations of dialogue-act-based meaning representations (MR) with an average of 8.1 references.We fine-tune UniD2T using varying proportions of the training instances (i.e., 0.1%, 0.5%, 1%, 5%, and 10%) from E2ENLG (Dušek et al., 2020).We compare UniD2T with several few-shot learning methods including TGen (Dušek and Jurčíček, 2016), Template-GPT-2 (Chen et al., 2020a), and KGPT (Chen et al., 2020b).The experimental results are summarized in Table 12.We can see that UniD2T significantly outperforms all baselines in various few-shot settings.

Human Evaluation
We also conduct a human evaluation to analyze the generated sentences following Chen et al. (2020b).
It is worth noting that each evaluator is unaware of which model generates the text being evaluated so as to avoid evaluation bias.Specifically, we choose 100 test samples from WebNLG and observe the factual consistency between the gold sentences and generated sentences.We invite four NLP workers to assign each text a label from {Hallucination, Missing Fact, Accurate}, similar to (Chen et al., 2020b).As shown in Figure 6, our UniD2T is less prone to hallucinating non-existing facts and can generate more accurate sentences.

Impact on Graph Sizes
To illustrate the effectiveness of the graph structure, we further investigate the performances of P Linear + F Linear and P * Graph + F Graph by concerning different graph sizes on the WebNLG validation set.Experimental results in terms of BLEU are shown in Figure 7.When the graph structure is simple, the impact of the graph structure is limited.However, as the graph structure becomes complex, the model with graph structure (P * Graph + F Graph ) performs much better than the model with linear structure (P Linear + F Linear ).Thus, the structureenhanced model UniD2T demonstrates greater stability and better performance on large-scale inputs when compared to linear sequence models.

Impact on Model Sizes
To investigate the influence of different model scales on the experimental results, we conducted experiments using F Graph on T5-Small, T5-Base, T5-Large, and T5-3B on the DART and ToTTo dev sets without pre-training.It is important to note that for our experiments, we conduct evaluations on the dev sets rather than the test sets.This decision is made due to the constraints imposed by the ToTTo dataset, where obtaining test results requires submitting predictions to the leaderboard and awaiting the evaluation process, which can be time-consuming.Therefore, to expedite our research and streamline the experimentation process, we relied on the readily available development sets for conducting our evaluations.The results are presented in the Table 13.Notably, the transition from T5-Large to T5-3B resulted in a substantial increase in the number of parameters by approximately 3.9 times.However, the corresponding improvement in efficacy was found to be less than 1%.This analysis sheds light on the limited impact of scaling up the model size beyond a certain threshold, given the marginal gains in performance despite the significant increase in parameter count.

The Zero-shot Performance of ChatGPT
We conducted zero-shot experiments using Chat-GPT on the ToTTo and DART datasets to establish baselines for performance evaluation.The results of these experiments are presented in Table 5 and Table 7 as baselines.The prompt structure of Chat-GPT comprises two parts, and detailed information regarding these prompts can be found in Table 14.
From the results, we observe that ChatGPT demonstrates consistent performance across various measures.For instance, in the non-overlap subset of the ToTTo dataset, when compared to BERT-to-BERT, the BLEU score shows a decrease of 17.6%, while the PARENT score exhibits a slight increase of 0.9%.This divergence in BLEU performance indicates that ChatGPT generates responses with different word choices, leading to reduced word overlap with the reference.However, the improvement in the PARENT score suggests en-   hanced structural and content-related aspects in the generated responses.These findings underscore the importance of employing multiple evaluation metrics to comprehensively assess the performance of sophisticated language generation systems in future work.

Impact on Edge Directionality
We take an examination into the significance of edge directionality and present the experimental results of incorporating the edge direction in Table 16.For UniD2T directed , we consider the input directed graph using only its original directed edges (uni-

Case #3
Bacon Explosion Kansas City metropolitan area

United States
Sausage

Target Sentence
Gold: The Bacon Explosion comes from the Kansas city metro area in the U.S. The main ingredient in it is bacon and also includes sausage.
UniD2T: Bacon Explosion is from the Kansas City metropolitan area, in the United States.Its main ingredients are bacon and sausage.
T5-large: Bacon Explosion is from the Kansas City metropolitan area in the United States.It includes sausage as one of it's main ingredients and bacon as a main ingredient.

Target Sentence
Gold: As a construction and management simulation game, Lego Creator: Knights Kingdom was released in 2000.
UniD2T: Lego Creator: Knights Kingdom was released in 2000 and is a construction and management simulation game.
T5-large: Knights Kingdom was released in 2000.

Case #4
Release  directional) and remove the reverse edges added by UniD2T.Please refer to Section 3.3 for more details about the reverse edges.From Table 16, we can observe that the incorporation of edge direction has a deleterious effect on the performance of pre-trained models.There are several possible factors that may underlie these observed outcomes.(1) First, the pre-training models aim to learn the general representations of structured data.However, due to the vast scale of multi-source data, it is often unfeasible to assign a direction to each data pair.For example, the tabular format constitutes a fundamental type of structured data; however, the absence of explicit edge directionality is a typical characteristic between individual data pairs within this format.Therefore, we default to using bidirectional edges to signify mutual relationships between two entities.
(2) Second, we anticipate learning the coarse relationships between two entities through undirected graphs during the pre-training phase offer greater flexibility to accommodate various types of relationships in different fields.For instance, the directional link "Jay Chou → Common Jasmine Orange" conveys that Jay Chou released the album Common Jasmine Orange, while the reverse link "Common Jasmine Orange → Jay Chou" signifies that Common Jasmine Orange is one of Jay Chou's albums.In most cases, it is unnecessary to provide elaborate descriptions of specific relationships, as the data primarily requires indicating connections.

Case Study
As illustrated in Figure 8, we further verify the effectiveness of UniD2T qualitatively by demonstrating some generated sentences by UniD2T and T5-Large.Both UniD2T and T5-Large are capable of generating main entities.However, there are notable differences in the quality and coherence of the generated sentences.Specifically, the sentences generated by T5-Large tend to exhibit shortcomings in terms of including key information and logical reasoning.For instance, in the first case, T5-Large fails to infer that the "Baltimore World Trade Center" is the tallest building.This illustrates the limitation of T5-Large in capturing and incorporating specific facts with logical reasoning.In contrast, UniD2T can produce sentences that are more accurate, complete, and encompass the main entities and logical information with greater precision.This highlights the advantages of UniD2T in generating more contextually appropriate and logically grounded sentences.

The Diversity of Generated Sentences
We conduct an evaluation of the diversity exhibited in the target sentences generated by UniD2T and compare it with strong baselines (i.e., T5-Large and ChatGPT).To quantify the diversity of the generated sentences, we employed the Distinct-N metric (Li et al., 2016), which calculates the number of distinct N-grams divided by the total number of generated tokens.The experimental results are presented in Table 15, providing insights into the diversity performance of the models.By analyzing the results, it is evident that UniD2T achieves notably higher Distinct-2/3/4 scores compared to T5-Large.This suggests that UniD2T generates sentences with a greater variety of unique unigrams and bigrams than T5-Large, indicating a higher level of linguistic diversity in the output.However, ChatGPT achieves better diversity scores than UniD2T.It tends to generate more diverse words which are not included in our vocabulary, although these words may be non-existing content.

Limitations
Based on our empirical observation, we reveal several limitations of this work, which can be divided into three primary categories.(1) Our pre-training data is limited, which only contains two existing pre-training datasets and six downstream datasets.
In the future, we would like to collect more data-totext datasets so as to construct a large-scale diverse pre-training corpus.
(2) In this work, we unify different structured data into the graph format by using a simple and direct way.We will attempt to exploit more advanced strategies to construct graphs from different structured data.
(3) This study focuses on modeling the graph structures and incorporating the structural information into Transformer.However, the pre-training objectives can be further improved so as to further improve the representation learning.

Conclusion
In this paper, we proposed a unified data-to-text pre-training method, which could be applied to various downstream data-to-text generation tasks.Concretely, we first converted different types of structured data into graph format.Then, we devised a structure-enhanced Transformer to capture graph structures by introducing two new position and attention matrices to replace the position embedding and attention mask in the self-attention of the Transformer.Extensive experiments on six data-to-text benchmark datasets demonstrated that UniD2T achieved substantially better performance than strong baselines by enabling better information sharing and representation learning of data structures across diverse data-to-text datasets.

Figure 1 :
Figure 1: Unify data in three formats into one graph structure.

Figure 2 :
Figure 2: Simplified version of model input and connections between nodes.

Figure 3 :
Figure 3: Transformer blocks on the T5-encoder side.The relative position and attention matrices in the self-attention calculation will be replaced by two novel position and attention matrices.

Figure 4 :
Figure 4: We construct a new position matrix P new emb to replace the original position matrix P emb used in Equation (2).We first set an auxiliary matrix for each edge between two nodes, and then copy the content of the auxiliary matrix into the final position matrix.The distances of nodes lacking direct connections will be set to "±inf".The lighter the color, the farther the distance is.

Figure 5 :
Figure 5: We construct a new attention matrix A new mask to replace the attention mask A mask used in Equation (2).The attention matrix used to replace the attention mask of self-attention in Transformer.The values of the cells with colors are set to 1, while the values of the cells without colors are set to 0. The blue color represents global attention, the gray color represents the self-connection of nodes, and the green and yellow colors represent the two connected edges.

Figure 7 :
Figure 7: Comparing P Linear + F Linear and P Graph + F Graph BLEU score changes in increasing the number of triples on WebNLG's seen and unseen.

Figure 8 :
Figure8: Examples of generated sentences.The main entity is highlighted in green, and the words that are not faithful to the input are in red.Important information common to both models is indicated in blue.

Table 1 :
proposed a multi-task supervised pre-training model (MVP) for a series of data-totext generation tasks, it utilized the original Transformer to encode the linearized input data without Statistics of our pre-training data.

Table 2 :
The data-specific prefixes that are tailored for different types of data.Here, A, B and C can be replaced by the content of specific samples.

Table 3 :
The examples of input-output pairs for struct denoising and graph-to-text generation objectives.

Table 4 :
Statistics of downstream datasets.

Table 5 :
Results on the ToTTo test set.

Table 6 :
Results on CoSQL development set.

Table 7 :
Evaluation results on DART test set.Results with †are token from DART

Table 7 ,
our model surpasses the best-performing model CONTROL PREFIXES by a 3.0% BLEU.

Table 9 :
Results on WikiBio and WikiTableT test sets.

Table 10 :
Ablation test results on six benchmark datasets.P Linear and P Graph represent the models pre-training with linear structure and graph structure, respectively.F Linear and F Graph represent the models fine-tuning with graph structure and linear structure, respectively.P * stands for pre-training only with PREDATA; P indicates pre-training with both PREDATA and DOWNDATA.
(Han and Shareghi, 2022) results on WebNLG are shown in Table8.Our model achieves the highest performance among all baseline models, including the graph pre-training model TRIPLE(Han and Shareghi, 2022).

Table 11 :
Ablation test results on WebNLG test set.

Table 14 :
Input examples for ChatGPT on ToTTo and DART.Here, PROMPT represents task description, and STRUCTURED INPUT represents data input with specific formats.

Table 15 :
The results of diversity evaluation on DART test set.

Table 16 :
The results of our models with undirected graphs (i.e., UniD2T) and directed graphs (denoted as UniD2T directed ), respectively.