Abstract
Recent graph-to-text models generate text from graph-based data using either global or local aggregation to learn node representations. Global node encoding allows explicit communication between two distant nodes, thereby neglecting graph topology as all nodes are directly connected. In contrast, local node encoding considers the relations between neighbor nodes capturing the graph structure, but it can fail to capture long-range relations. In this work, we gather both encoding strategies, proposing novel neural models that encode an input graph combining both global and local node contexts, in order to learn better contextualized node embeddings. In our experiments, we demonstrate that our approaches lead to significant improvements on two graph-to-text datasets achieving BLEU scores of 18.01 on the AGENDA dataset, and 63.69 on the WebNLG dataset for seen categories, outperforming state-of-the-art models by 3.7 and 3.1 points, respectively.1
1 Introduction
Graph-to-text generation refers to the task of generating natural language text from input graph structures, which can be semantic representations (Konstas et al., 2017) or knowledge graphs (KGs) (Gardent et al., 2017; Koncel-Kedziorski et al., 2019). Whereas most recent work Song et al. (2018); Ribeiro et al. (2019); Guo et al. (2019) focuses on generating sentences, a more challenging and interesting scenario emerges when the goal is to generate multisentence texts. In this context, in addition to sentence generation, document planning needs to be handled: The input needs to be mapped into several sentences; sentences need to be ordered and connected using appropriate discourse markers; and inter-sentential anaphora and ellipsis may need to be generated to avoid repetition. In this paper, we focus on generating texts rather than sentences where the output are short texts (Gardent et al., 2017) or paragraphs (Koncel-Kedziorski et al., 2019).
A key issue in neural graph-to-text generation is how to encode the input graphs. The basic idea is to incrementally compute node representations by aggregating structural context information. To this end, two main approaches have been proposed: (i) models based on local node aggregation, usually built upon Graph Neural Networks (GNNs) (Kipf and Welling, 2017; Hamilton et al., 2017) and (ii) models that leverage global node aggregation. Systems that adopt global encoding strategies are typically based on Transformers (Vaswani et al.2017), using self-attention to compute a node representation based on all nodes in the graph. This approach enjoys the advantage of a large node context range, but neglects the graph topology by effectively treating every node as being connected to all the others in the graph. In contrast, models based on local aggregation learn the representation of each node based on its adjacent nodes as defined in the input graph. This approach effectively exploits the graph topology, and the graph structure has a strong impact on the node representation (Xu et al., 2018). However, encoding relations between distant nodes can be challenging by requiring more graph encoding layers, which can also propagate noise (Li et al., 2018).
For example, Figure 1a presents a KG, for which a corresponding text is shown in Figure 1b. Note that there is a mismatch between how entities are connected in the graph and how their natural language descriptions are related in the text. Some entities syntactically related in the text are not connected in the graph. For instance, in the sentence “For the link prediction task, first we learn node embeddings using DistMult method.”, although the entity mentions are dependent of the same verb, in the graph, the node embeddings node has no explicit connection with link prediction and DistMult nodes, which are in a different connected component. This example illustrates the importance of encoding distant information in the input graph. As shown in Figure 1c, a global encoder is able to learn a node representation for node embeddings which captures information from non-connected entities such as DistMult. By modeling distant connections between all nodes, we allow for these missing links to be captured, as KGs are known to be highly incomplete (Dong et al., 2014; Schlichtkrull et al., 2018).
In contrast, the local strategy refines the node representation with richer neighborhood information, as nodes that share the same neighborhood exhibit a strong homophily: Two similar entities are much more likely to be connected than at random. Consequently, the local context enriches the node representation with local information from KG triples. For example, in Figure 1a, GAT reaches node embeddings through the GNN. This transitive relation can be captured by a local encoder, as shown in Figure 1d. Capturing this form of relationship also can support text generation at the sentence level.
In this paper, we investigate novel graph-to-text architectures that combine both global and local node aggregations, gathering the benefits from both strategies. In particular, we propose a unified graph-to-text framework based on Graph Attention Networks (GATs) (Veličković et al., 2018). As part of this framework, we empirically compare two main architectures: a cascaded architecture that performs global node aggregation before performing local node aggregation, and a parallel architecture that performs global and local aggregations simultaneously. The cascaded architecture allows the local encoder to leverage global encoding features, and the parallel architecture allows more independent features to complement each other. To further consider fine-grained integration, we additionally consider layer-wise integration of the global and local encoders.
Extensive experiments show that our approaches consistently outperform recent models on two benchmarks for text generation from KGs. To the best of our knowledge, we are the first to consider integrating global and local context aggregation in graph-to-text generation, and the first to propose a unified GAT structure for combining global and local node contexts.
2 Related Work
AMR-to-Text Generation.
Various neural models have been proposed to generate sentences from Abstract Meaning Representation (AMR) graphs. Konstas et al. (2017) provide the first neural approach for this task, by linearizing the input graph as a sequence of nodes and edges. Song et al. (2018) propose the graph recurrent network to directly encode the AMR nodes, whereas Beck et al. (2018) develop a model based on gated GNNs. However, both approaches only use local node aggregation strategies. Damonte and Cohen (2019) combine graph convolutional networks and LSTMs in order to learn complementary node contexts. However, differently from Transformers and GNNs, LSTMs generate node representations that are influenced by the node order. Ribeiro et al. (2019) develop a model based on different GNNs that learns node representations which simultaneously encode a top–down and a bottom–up views of the AMR graphs, whereas Guo et al. (2019) leverage dense connectivity in GNNs. Recently, Wang et al. (2020) propose a local graph encoder based on Transformers using separated attentions for incoming and outgoing neighbors. Recent methods (Zhu et al., 2019; Cai and Lam, 2020) also use Transformers, but learn globalized node representations, modeling graph paths in order to capture structural relations.
KG-to-Text Generation.
In this work, we focus on generating text from KGs. In comparison to AMRs, which are rooted and connected graphs, KGs do not have a defined topology, which may vary widely among different datasets, making the generation process more demanding. KGs are sparse structures that potentially contain a large number of relations. Moreover, we are typically interested in generating multisentence texts from KGs, and this involves solving document planning issues (Konstas and Lapata, 2013).
Recent neural approaches for KG-to-text generation simply linearize the KG triples, thereby loosing graph structure information. For instance, Colin and Gardent (2018), Moryossef et al. (2019), and Adapt (Gardent et al., 2017) utilize LSTM/ GRU to encode WebNLG graphs. Castro Ferreira et al. (2019) systematically compare pipeline and end-to-end models for text generation from WebNLG graphs. Trisedya et al. (2018) develop a graph encoder based on LSTMs that captures relationships within and between triples. Previous work has also studied how to explicitly encode the graph structure using GNNs or Transformers. Marcheggiani and Perez Beltrachini (2018) propose an encoder based on graph convolutional networks, that consider explicitly local node contexts, and show superior performance compared with LSTMs. Recently, Koncel-Kedziorski et al. (2019) proposed a Transformer-based approach that computes the node representations by attending over node neighborhoods following a self-attention strategy. In contrast, our models focus on distinct global and local message passing mechanisms, capturing complementary graph contexts.
Integrating Global Information.
There has been recent work that attempts to integrate global context in order to learn better node representations in graph-to-text generation. To this end, existing methods use an artificial global node for message exchange with the other nodes. This strategy can be regarded as extending the graph structure but using similar message passing mechanisms. In particular, Koncel-Kedziorski et al. (2019) add a global node to the graph and use its representation to initialize the decoder. Recently, Guo et al. (2019) and Cai and Lam (2020) also utilized an artificial global node with direct edges to all other nodes to allow global message exchange for AMR-to-text generation. Similarly, Zhang et al. (2018) use a global node to a graph recurrent network model for sentence representation. Different from the above methods, we consider integrating global and local contexts at the node level, rather than the graph level, by investigating model alternatives rather than graph structure changes. In addition, we integrate GAT and Transformer architectures into a unified global-local model.
3 Graph-to-Text Model
This section first describes (i) the graph transformation adopted to create a relational graph from the input (Section 3.1), and (ii) the graph encoders of our framework based on GAT (Veličković et al., 2018), for dealing with both global (Section 3.3) and local (Section 3.4) node contexts. We adopt GAT because it is closely related to the Transformer architecture (Vaswani et al., 2017), which provides a convenient prototype for modeling global node context. Then, (iii) we proposed strategies to combined the global and local graph enoders (Section 3.5). Finally, (iv) we describe the decoding and training procedures (Section 3.6).
3.1 Graph Preparation
We represent a KG as a multi-relational graph2 with entity nodes and labeled edges , where denotes the relation existing from the entity eh to et.3
Unlike other current approaches (Koncel-Kedziorski et al., 2019; Moryossef et al., 2019), we represent an entity as a set of nodes. For instance, the KG node ”node embedding” in Figure 1 will be represented by two nodes, one for the token ”node” and the other for the token ”embedding”. Formally, we transform each into a new graph , where each token of an entity becomes a node . We convert each edge into a set of edges (with the same relation r) and connect every token of eh to every token of et. That is, an edge (u, r, v) will belong to if and only if there exists an edge such that u∈eh and v∈et, where eh and et are seen as sets of tokens. We represent each node with an embedding , generated from its corresponding token.
The new graph increases the representational power of the models because it allows learning node embeddings at a token level, instead of entity level. This is particularly important for text generation as it permits the model to be more flexible, capturing richer relationships between entity tokens. This also allows the model to learn relations and attention functions between source and target tokens. However, it has the side effect of removing the natural sequential order of multiword entities. To preserve this information, we use position embeddings (Vaswani et al., 2017), that is, becomes the sum of the corresponding token embedding and the positional embedding for v.
3.2 Graph Neural Networks (GNN)
After L iterations, a node’s representation encodes the structural information within its L-hop neighborhood. The choices of AGGR(l)(.) and COMBINE(l)(.) differ by the specific GNN model. An example of AGGR(l)(.) is the sum of the representations of . An example of COMBINE(l)(.) is a concatenation after the feature transformation.
3.3 Global Graph Encoder
3.4 Local Graph Encoder
3.5 Combining Global and Local Encodings
Our goal is to implement a graph encoder capable of encoding global and local aspects of the input graph. We hypothesize that these two sources of information are complementary, and a combination of both enriches node representations for text generation. In order to test this hypothesis, we investigate different combined architectures.
Intuitively, there are two general methods for integrating two types of representation. The first is to concatenate vectors of global and local contexts, which we call a parallel representation. The second is to form a pipeline, where a global representation is first obtained, which is then used as a input for calculating refined representations based on the local node context. We call this approach a cascaded representation.
Parallel and cascaded integration can be performed at the model level, considering the global and local graph encoders as two representation learning units disregarding internal structures. However, because our model takes a multilayer architecture, where each layer makes a level of abstraction in representation, we can alternatively consider integration on the layer level, so that more interaction between global and local contexts may be captured. As a result, we present four architectures for integration, as shown in Figure 2. All models serve the same purpose, and their relative strengths should be evaluated empirically.
Parallel Graph Encoding (PGE).
Cascaded Graph Encoding (CGE).
Layer-wise Parallel Graph Encoding.
Layer-wise Cascaded Graph Encoding.
3.6 Decoder and Training
Our decoder follows the core architecture of a Transformer decoder (Vaswani et al., 2017). Each time step t is updated by performing multihead attentions over the output of the encoder (node embeddings hv) and over previously generated tokens (token embeddings). An additional challenge in our setup is to generate multisentence outputs. In order to encourage the model to generate longer texts, we implement a length penalty (Wu et al., 2016) to refine the pure max-probability beam search.
The model is trained to optimize the negative log-likelihood of each gold-standard output text. We use label smoothing regularization to prevent the model from predicting the tokens too confidently during training and generalizing poorly.
4 Data and Preprocessing
. | #train . | #dev . | #test . | #relations . | avg #entities . | avg #nodes . | avg #edges . | avg #CC . | avg length . |
---|---|---|---|---|---|---|---|---|---|
AGENDA | 38,720 | 1,000 | 1,000 | 7 | 12.4 | 44.3 | 68.6 | 19.1 | 140.3 |
WebNLG | 18,102 | 872 | 971 | 373 | 4.0 | 34.9 | 101.0 | 1.5 | 24.2 |
. | #train . | #dev . | #test . | #relations . | avg #entities . | avg #nodes . | avg #edges . | avg #CC . | avg length . |
---|---|---|---|---|---|---|---|---|---|
AGENDA | 38,720 | 1,000 | 1,000 | 7 | 12.4 | 44.3 | 68.6 | 19.1 | 140.3 |
WebNLG | 18,102 | 872 | 971 | 373 | 4.0 | 34.9 | 101.0 | 1.5 | 24.2 |
AGENDA.
In this dataset, KGs are paired with scientific abstracts extracted from proceedings of 12 top AI conferences. Each instance consists of the paper title, a KG, and the paper abstract. Entities correspond to scientific terms that are often multiword expressions (co-referential entities are merged). We treat each token in the title as a node, creating a unique graph with title and KG tokens as nodes. As shown in Table 1, the average output length is considerably large, as the target outputs are multisentence abstracts.
WebNLG.
In this dataset, each instance contains a KG extracted from DBPedia. The target text consists of sentences that verbalize the graph. We evaluate the models on the test set with seen categories. Note that this dataset has a considerable number of edge relations (see Table 1). In order to avoid parameter explosion, we use regularization based on the basis function decomposition to define the model relation weights (Schlichtkrull et al., 2018). Also, as an alternative, we use the Levi Transformation to create nodes from relational edges between entities (Beck et al., 2018). That is, we create a new relation node for each edge relation between two nodes. The new relation node is connected to the subject and object token entities by two binary relations, respectively.
5 Experiments
We implemented all our models using PyTorch Geometric (PyG) (Fey and Lenssen, 2019) and OpenNMT-py (Klein et al., 2017). We use the Adam optimizer with β1 = 0.9 and β2 = 0.98. Our learning rate schedule follows Vaswani et al. (2017), with 8,000 and 16,000 warming-up steps for WebNLG and AGENDA, respectively. The vocabulary is shared between the node and target tokens. In order to mitigate the effects of random seeds, for the test sets, we report the averages over 4 training runs along with their standard deviation. We use byte pair encoding (Sennrich et al., 2016) to split entity words into smaller more frequent pieces. Therefore some nodes in the graph can be sub-words. We also obtain sub-words on the target side. Following previous works, we evaluate the results with BLEU (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2014), and CHRF++ (Popović, 2015) automatic metrics and also perform a human evaluation (Section 5.6). For layer-wise models, the number of encoder layers are chosen from {2, 4, 6}, and for PGE and CGE, the global and local layers are chosen from and {2, 4, 6} and {1, 2, 3}, respectively. The hidden encoder dimensions are chosen from {256, 384, 448} (see Figure 3). Hyperparameters are tuned on the development set of both datasets. We report the test results when the BLEU score on dev set is optimal.
5.1 Results on AGENDA
Table 2 shows the results, where we report the number of layers and attention heads utilized. We train models with only global or local encoders as baselines. Each model has the respective parameter size that gives the best results on the dev set. First, the local encoder, which requires fewer encoder layers and parameters, has a better performance compared with the global encoder. This shows that explicitly encoding the graph structure is important to improve the node representations. Second, our approaches substantially outperform both baselines. CGE-LW outperforms Koncel-Kedziorski et al. (2019), a transformer model that focuses on the relations between adjacent nodes, by a large margin, achieving the new state-of-the-art BLEU score of 18.01, 25.9% higher. We also note that KGs are highly incomplete in this dataset, with an average number of connected components of 19.1 (see Table 1). For this reason, the global encoder plays an important role in our models as it enables learning node representations based on all connected components. The results indicate that combining the local node context, leveraging the graph topology, and the global node context, capturing macro-level node relations, leads to better performance. We find that, even though CGE has a small number of parameters compared to CGE-LW, it achieves comparable performance. PGE-LW has the worse performance among the proposed models. Finally, note that cascaded architectures are more effective according to different metrics.
Model . | #L . | #H . | BLEU . | METEOR . | CHRF++ . | #P . |
---|---|---|---|---|---|---|
Koncel-Kedziorski et al. (2019) | 6 | 8 | 14.30±1.01 | 18.80±0.28 | – | – |
Global Encoder | 6 | 8 | 15.44±0.25 | 20.76±0.194 | 43.95±0.40 | 54.4 |
Local Encoder | 3 | 8 | 16.03±0.19 | 21.12±0.32 | 44.70±0.29 | 54.0 |
PGE | 6, 3 | 8, 8 | 17.55±0.154 | 22.02±0.07 | 46.41±0.07 | 56.1 |
CGE | 6, 3 | 8, 8 | 17.82±0.134 | 22.23±0.09 | 46.47±0.10 | 61.5 |
PGE-LW | 6 | 8, 8 | 17.42±0.25 | 21.78±0.20 | 45.79±0.32 | 69.0 |
CGE-LW | 6 | 8, 8 | 18.01±0.14 | 22.34±0.07 | 46.69±0.17 | 69.8 |
Model . | #L . | #H . | BLEU . | METEOR . | CHRF++ . | #P . |
---|---|---|---|---|---|---|
Koncel-Kedziorski et al. (2019) | 6 | 8 | 14.30±1.01 | 18.80±0.28 | – | – |
Global Encoder | 6 | 8 | 15.44±0.25 | 20.76±0.194 | 43.95±0.40 | 54.4 |
Local Encoder | 3 | 8 | 16.03±0.19 | 21.12±0.32 | 44.70±0.29 | 54.0 |
PGE | 6, 3 | 8, 8 | 17.55±0.154 | 22.02±0.07 | 46.41±0.07 | 56.1 |
CGE | 6, 3 | 8, 8 | 17.82±0.134 | 22.23±0.09 | 46.47±0.10 | 61.5 |
PGE-LW | 6 | 8, 8 | 17.42±0.25 | 21.78±0.20 | 45.79±0.32 | 69.0 |
CGE-LW | 6 | 8, 8 | 18.01±0.14 | 22.34±0.07 | 46.69±0.17 | 69.8 |
5.2 Results on WebNLG
We compare the performance of our more effective models (CGE, CGE-LW) with six state-of-the-art results reported on this dataset. Three systems are the best competitors in the WebNLG challenge for seen categories: UPF-FORGe, Melbourne, and Adapt. UPF-FORGe follows a rule-based approach, whereas the others use neural encoder-decoder models with linearized triple sets as input.
Table 3 presents the results. CGE achieves a BLEU score of 62.30, 8.9% better than the best model of Castro Ferreira et al. (2019), who use an end-to-end architecture based on GRUs. CGE using Levi graphs outperforms Trisedya et al. (2018), an approach that encodes both intra-triple and inter-triple relationships, by 4.5 BLEU points. Interestingly, their intra-triple and inter-triple mechanisms are closely related with the local and global encodings. However, they rely on encoding entities based on sequences generated by traversal graph algorithms, whereas we explicitly exploit the graph structure, throughout the local neighborhood aggregation.
Model . | BLEU . | METEOR . | CHRF++ . | #P . |
---|---|---|---|---|
UPF-FORGe (Gardent et al., 2017) | 40.88 | 40.00 | – | – |
Melbourne (Gardent et al., 2017) | 54.52 | 41.00 | 70.72 | – |
Adapt (Gardent et al., 2017) | 60.59 | 44.00 | 76.01 | – |
Marcheggiani and Perez Beltrachini (2018) | 55.90 | 39.00 | – | 4.9 |
Trisedya et al. (2018) | 58.60 | 40.60 | – | – |
Castro Ferreira et al. (2019) | 57.20 | 41.00 | – | – |
CGE | 62.30±0.27 | 43.51±0.18 | 75.49±0.34 | 13.9 |
CGE (Levi Graph) | 63.10±0.13 | 44.11±0.09 | 76.33±0.10 | 12.8 |
CGE-LW | 62.85±0.07 | 43.75±0.21 | 75.73±0.31 | 11.2 |
CGE-LW (Levi Graph) | 63.69±0.10 | 44.47±0.12 | 76.66±0.10 | 10.4 |
Model . | BLEU . | METEOR . | CHRF++ . | #P . |
---|---|---|---|---|
UPF-FORGe (Gardent et al., 2017) | 40.88 | 40.00 | – | – |
Melbourne (Gardent et al., 2017) | 54.52 | 41.00 | 70.72 | – |
Adapt (Gardent et al., 2017) | 60.59 | 44.00 | 76.01 | – |
Marcheggiani and Perez Beltrachini (2018) | 55.90 | 39.00 | – | 4.9 |
Trisedya et al. (2018) | 58.60 | 40.60 | – | – |
Castro Ferreira et al. (2019) | 57.20 | 41.00 | – | – |
CGE | 62.30±0.27 | 43.51±0.18 | 75.49±0.34 | 13.9 |
CGE (Levi Graph) | 63.10±0.13 | 44.11±0.09 | 76.33±0.10 | 12.8 |
CGE-LW | 62.85±0.07 | 43.75±0.21 | 75.73±0.31 | 11.2 |
CGE-LW (Levi Graph) | 63.69±0.10 | 44.47±0.12 | 76.66±0.10 | 10.4 |
CGE-LW with Levi graphs as inputs has the best performance, achieving 63.69 BLEU points, even thought it uses fewer parameters. Note that this approach allows the model to handle new relations, as they are treated as nodes. Moreover, the relations become part of the shared vocabulary, making this information directly usable during the decoding phase. We outperform an approach based on GNNs (Marcheggiani and Perez Beltrachini, 2018) by a large margin of 7.7 BLEU points, showing that our combined graph encoding strategies lead to better text generation. We also outperform Adapt, a strong competitor that utilizes subword encodings, by 3.1 BLEU points.
5.3 Development Experiments
We report several development experiments in Figure 3. Figure 3a shows the effect of the number of encoder layers in the four encoding methods.4 In general, the performance increases when we gradually enlarge the number of layers, achieving the best performance with 6 encoder layers. Figure 3b shows the choices of hidden sizes for the encoders. The best performances for global and PGE are achieved with 384 dimensions, whereas the other models have the better performance with 448 dimensions. In Figure 3c, we evaluate the performance employing different number of parameters.5 When the models are smaller, parallel encoders obtain better results than the cascaded ones. When the models are larger, cascaded models perform better. We speculate that for some models, the performance can be further improved with more parameters and layers. However, we do not attempt this owing to hardware limitations.
5.4 Ablation Study
In Table 4, we report an ablation study on the impact of each module used in CGE model on the dev set of AGENDA. We also report the number of parameters used in each configuration.
Model . | BLEU . | CHRF++ . | #P . |
---|---|---|---|
CGE | 17.38 | 45.68 | 61.5 |
Global Encoder | |||
-Global Attention | 15.59 | 43.71 | 59.0 |
-FFN | 16.33 | 44.86 | 50.4 |
-Global Encoder | 15.17 | 43.30 | 45.6 |
Local Encoder | |||
-Local Attention | 16.92 | 45.97 | 61.5 |
-Weight Relations | 16.88 | 45.61 | 53.6 |
-GRU | 16.38 | 44.71 | 60.2 |
-Local Encoder | 14.68 | 42.98 | 51.8 |
-Shared Vocab. | 16.92 | 46.16 | 81.8 |
Decoder | |||
–Length Penalty | 16.68 | 44.68 | 61.5 |
Model . | BLEU . | CHRF++ . | #P . |
---|---|---|---|
CGE | 17.38 | 45.68 | 61.5 |
Global Encoder | |||
-Global Attention | 15.59 | 43.71 | 59.0 |
-FFN | 16.33 | 44.86 | 50.4 |
-Global Encoder | 15.17 | 43.30 | 45.6 |
Local Encoder | |||
-Local Attention | 16.92 | 45.97 | 61.5 |
-Weight Relations | 16.88 | 45.61 | 53.6 |
-GRU | 16.38 | 44.71 | 60.2 |
-Local Encoder | 14.68 | 42.98 | 51.8 |
-Shared Vocab. | 16.92 | 46.16 | 81.8 |
Decoder | |||
–Length Penalty | 16.68 | 44.68 | 61.5 |
Global Graph Encoder.
We start by an ablation on the global encoder. After removing the global attention coefficients, the performance of the model drops by 1.79 BLEU and 1.97 CHRF++ scores. Results also show that using FFN in the global COMBINE(.) function is important to the model but less effective than the global attention. However, when we remove FNN, the number of parameters drops considerably (around 18%) from 61.5 to 50.4 million. Finally, without the entire global encoder, the result drops substantially by 2.21 BLEU points. This indicates that enriching node embeddings with a global context allows learning more expressive graph representations.
Local Graph Encoder.
We first remove the local graph attention and the BLEU score drops to 16.92, showing that the neighborhood attention improves the performance. After removing the relation types, encoded as model weights, the performance drops by 0.5 BLEU points. However, the number of parameters is reduced by around 7.9 million. This indicates that we can have a more efficient model, in terms of the number of parameters, with a slight drop in performance. Removing the GRU used on the COMBINE(.) function decreases the performance considerably. The worse performance occurs if we remove the entire local encoder, with a BLEU score of 14.68, essentially making the encoder similar to the global baseline.
Finally, we find that vocabulary sharing improves the performance, and the length penalty is beneficial as we generate multisentence outputs.
5.5 Impact of the Graph Structure and Output Length
The overall performance on both datasets suggests the strength of combining global and local node representations. However, we are also interested in estimating the models’ performance concerning different data properties.
Graph Size.
Figure 4a shows the effect of the graph size, measured in number of nodes, on the performance, measured using CHRF++ scores,6 for the AGENDA. We evaluate global and local graph encoders, PGE-LW and CGE-LW. We find that the score increases as the graph size increases. Interestingly, the gap between the local and global encoders increases when the graph size increases. This suggests that, because larger graphs may have very different topologies, modeling the relations between nodes based on the graph structure is more beneficial than allowing direct communication between nodes, overlooking the graph structure. Also note that the the cascaded model (CGE-LW) is consistently better than the parallel model (PGE-LW) over all graph sizes.
Table 5 shows the effect of the graph size, measured in number of triples, on the performance for the WebNLG. Our model obtains better scores over all partitions. In contrast to AGENDA, the performance decreases as the graph size increases. This behavior highlights a crucial difference between AGENDA and AMR and WebNLG datasets, in which the models’ general performance decreases as the graph size increases (Gardent et al., 2017; Cai and Lam, 2020). In WebNLG, the graph and sentence sizes are correlated, and longer sentences are more challenging to generate than the smaller ones. Differently, AGENDA contains similar text lengths7 and when the input is a larger graph, the model has more information to be leveraged during the generation.
#T . | #DP . | Melbourne . | Adapt . | CGE-LW . |
---|---|---|---|---|
1-2 | 396 | 78.74 | 83.10 | 84.35 |
3-4 | 386 | 66.84 | 72.02 | 72.27 |
5-7 | 189 | 61.85 | 69.28 | 70.25 |
#D | #DP | Melbourne | Adapt | CGE-LW |
1 | 222 | 82.27 | 87.54 | 88.04 |
2 | 469 | 69.94 | 74.54 | 75.90 |
≥3 | 280 | 62.87 | 69.30 | 69.41 |
#S | #DP | Melbourne | Adapt | CGE-LW |
1 | 388 | 77.19 | 81.66 | 82.03 |
2 | 306 | 67.29 | 73.29 | 73.78 |
3 | 151 | 66.30 | 72.46 | 73.21 |
4 | 66 | 66.73 | 71.26 | 75.16 |
≥5 | 60 | 61.93 | 67.57 | 69.20 |
#T . | #DP . | Melbourne . | Adapt . | CGE-LW . |
---|---|---|---|---|
1-2 | 396 | 78.74 | 83.10 | 84.35 |
3-4 | 386 | 66.84 | 72.02 | 72.27 |
5-7 | 189 | 61.85 | 69.28 | 70.25 |
#D | #DP | Melbourne | Adapt | CGE-LW |
1 | 222 | 82.27 | 87.54 | 88.04 |
2 | 469 | 69.94 | 74.54 | 75.90 |
≥3 | 280 | 62.87 | 69.30 | 69.41 |
#S | #DP | Melbourne | Adapt | CGE-LW |
1 | 388 | 77.19 | 81.66 | 82.03 |
2 | 306 | 67.29 | 73.29 | 73.78 |
3 | 151 | 66.30 | 72.46 | 73.21 |
4 | 66 | 66.73 | 71.26 | 75.16 |
≥5 | 60 | 61.93 | 67.57 | 69.20 |
Graph Diameter.
Figure 4b shows the impact of the graph diameter8 on the performance for the AGENDA. Similarly to the graph size, the score increases as the diameter increases. As the global encoder is not aware of the graph structure, this module has the worst scores, even though it enables direct node communication over long distance. In contrast, the local encoder can propagate precise node information throughout the graph structure for k-hop distances, making the relative performance better. Table 5 shows the models’ performances with respect to the graph diameter for WebNLG. Similarly to the graph size, the score decreases as the diameter increases.
Output Length.
One interesting phenomenon to analyze is the length distribution (in number of words) of the generated outputs. We expect that our models generate texts with similar output lengths as the reference texts. As shown in Figure 4c, the references usually are bigger than the texts generated by all models for AGENDA. The texts generated by CGE-no-pl, a CGE model without length penalty, are consistently shorter than the texts from the global and CGE models. We increase the length of the texts when we use the length penalty (see Section 3.6). However, there is still a gap between the reference and the generated text lengths. We leave further investigation of this aspect for future work.
Table 5 shows the models’ performances with respect to the number of sentences for WebNLG. In general, increasing the number of sentences reduces the performance of all models. Note that when the number of sentences increases, the gap between CGE-LW and the baselines becomes larger. This suggests that our approach is able to better handle complex graph inputs in order to generate multisentence texts.
Effect of the Number of Nodes on the Output Length.
Figure 5 shows the effect of the size of a graph, defined as the number of nodes, on the quality (measured in CHRF++ scores) and length of the generated text (in number of words) in the AGENDA dev set. We bin both the graph size and the output length in 4 classes. CGE consistently outperforms the global model, in some cases by a large margin. When handling smaller graphs (with ≤35 nodes), both models have difficulties generating good summaries. However, for these smaller graphs, our model achieves a score 12.2% better when generating texts with length ≤75. Interestingly, when generating longer texts (¿140) from smaller graphs, our model outperforms the global encoder by an impressive 21.7%, indicating that our model is more effective in capturing semantic signals from graphs with scarce information. Our approach also performs better when the graph size is large (>55) but the generation output is small (≤75), beating the global encoder by 9 points.
5.6 Human Evaluation
To further assess the quality of the generated text, we conduct a human evaluation on the WebNLG dataset.9 Following previous work (Gardent et al., 2017; Castro Ferreira et al., 2019), we assess two quality criteria: (i) Fluency (i.e., does the text flow in a natural, easy to read manner?) and (ii) Adequacy (i.e., does the text clearly express the data?). We divide the datapoints into seven different sets by the number of triples. For each set, we randomly select 20 texts generated by Adapt, CGE with Levi graphs, and their corresponding human reference (420 texts in total). Because the number of datapoints for each set is not balanced (see Table 5), this sampling strategy ensures that we have the same number of samples for the different triple sets. Moreover, having human references may serve as an indicator of the sanity of the human evaluation experiment. We recruited human workers from Amazon Mechanical Turk to rate the text outputs on a 1–5 Likert scale. For each text, we collect scores from 4 workers and average them. Table 6 shows the results. We first note a similar trend as in the automatic evaluation, with CGE outperforming Adapt on both fluency and adequacy. In sets with the number of triples smaller than 5, CGE was the highest rated system in fluency. Similarly to the automatic evaluation, both systems are better in generating text from graphs with smaller diameters. Note that bigger diameters pose difficulties to the models, which achieve their worst performance for diameters ≥3.
#T . | Adapt . | CGE . | Reference . | |||
---|---|---|---|---|---|---|
F | A | F | A | F | A | |
All | 3.96C | 4.44C | 4.12B | 4.54B | 4.24A | 4.63A |
1–2 | 3.94C | 4.59B | 4.18B | 4.72A | 4.30A | 4.69A |
3–4 | 3.79C | 4.45B | 3.96B | 4.50AB | 4.14A | 4.66A |
5–7 | 4.08B | 4.35B | 4.18B | 4.45B | 4.28A | 4.59A |
#D | Adapt | CGE | Reference | |||
F | A | F | A | F | A | |
1–2 | 3.98C | 4.50B | 4.16B | 4.61A | 4.28A | 4.66A |
≥3 | 3.91C | 4.33B | 4.03B | 4.43B | 4.17A | 4.60A |
#T . | Adapt . | CGE . | Reference . | |||
---|---|---|---|---|---|---|
F | A | F | A | F | A | |
All | 3.96C | 4.44C | 4.12B | 4.54B | 4.24A | 4.63A |
1–2 | 3.94C | 4.59B | 4.18B | 4.72A | 4.30A | 4.69A |
3–4 | 3.79C | 4.45B | 3.96B | 4.50AB | 4.14A | 4.66A |
5–7 | 4.08B | 4.35B | 4.18B | 4.45B | 4.28A | 4.59A |
#D | Adapt | CGE | Reference | |||
F | A | F | A | F | A | |
1–2 | 3.98C | 4.50B | 4.16B | 4.61A | 4.28A | 4.66A |
≥3 | 3.91C | 4.33B | 4.03B | 4.43B | 4.17A | 4.60A |
5.7 Additional Experiments
Impact of the Vocabulary Sharing and Length Penalty.
During the ablation studies, we note that the vocabulary sharing and length penalty are beneficial for the performance. To better estimate their impact, we evaluate CGE-LW model with its variations without using vocabulary sharing, length penalty and without both mechanisms, on the test set of both datasets. Table 7 shows the results. We observe that sharing vocabulary is more important to WebNLG than AGENDA. This suggests that sharing vocabulary is beneficial when the training data is small, as in WebNLG. On the other hand, length penalty is more effective for AGENDA, as it has longer texts than WebNLG,10 improving the BLEU score by 0.71 points.
AGENDA . | ||
---|---|---|
Model | BLEU | CHRF++ |
CGE-LW | 18.17 | 46.80 |
-Shared Vocab | 17.88 | 47.12 |
-Length Penalty | 17.46 | 45.76 |
-Both | 17.24 | 46.14 |
WebNLG | ||
Model | BLEU | CHRF++ |
CGE-LW | 63.86 | 76.80 |
-Shared Vocab | 63.07 | 76.17 |
-Length Penalty | 63.28 | 76.51 |
-Both | 62.60 | 75.80 |
AGENDA . | ||
---|---|---|
Model | BLEU | CHRF++ |
CGE-LW | 18.17 | 46.80 |
-Shared Vocab | 17.88 | 47.12 |
-Length Penalty | 17.46 | 45.76 |
-Both | 17.24 | 46.14 |
WebNLG | ||
Model | BLEU | CHRF++ |
CGE-LW | 63.86 | 76.80 |
-Shared Vocab | 63.07 | 76.17 |
-Length Penalty | 63.28 | 76.51 |
-Both | 62.60 | 75.80 |
How Far Does the Global Attention Look?
Following previous work (Voita et al., 2019; Cai and Lam, 2020), we investigate the attention distribution of each graph encoder global layer of CGE-LW on the AGENDA dev set. In particular, for each node, we verify its global neighbor that receives the maximum attention weight and record the distance between them.11Figure 7 shows the averaged distances for each global layer. We observe that the global encoder mainly focuses on distant nodes, instead of the neighbors and closest nodes. This is very interesting and agrees with our intuition: Whereas the local encoder is concerned about the local neighborhood, the global encoder focuses on the information from long-distance nodes.
Case Study.
Figure 6 shows examples of generated texts when the WebNLG graph is complex (7 triples). While CGE generates a factually correct text (it correctly verbalises all triples), the Adapt’s output is repetitive. The example also illustrates how the text generated by CGE closely follows the graph structure whereby the first sentence verbalises the right-most subgraph, the second the left-most one and the linking node Turkey makes the transition (using hyperonymy and a definite description, i.e., The country). The text created by CGE is also more coherent than the reference. As noted above, the input graph includes two subgraphs linked by Turkey. In natural language, such a meaning representation corresponds to a topic shift with the first part of the text describing an entity from one subgraph, the second part an entity from the other subgraph, and the linking entity (Turkey) marking the topic shift. Typically, in English, a topic shift is marked by a definite noun phrase in the subject position. Although this is precisely the discourse structure generated by CGE (Turkey is realized in the second sentence by the definite description The country in the subject position), the reference fails to mark the topic shift, resulting in a text with weaker discourse coherence.
6 Conclusion
In this work, we introduced a unified graph attention network structure for investigating graph-to-text models that combines global and local graph encoders in order to improve text generation. An extensive evaluation of our models demonstrated that the global and local contexts are empirically complementary, and a combination can achieve state-of-the-art results on two datasets. In addition, cascaded architectures give better results compared with parallel ones.
We point out some directions for future work. First, it is interesting to study different fusion strategies to assemble the global and local encodings. Second, a promising direction is incorporating pre-trained contextualized word embeddings in graphs. Third, as discussed in Section 5.5, it is worth studying ways to diminish the gap between the reference and the generated text lengths.
Acknowledgments
We would like to thank Pedro Savarese, Markus Zopf, Mohsen Mesgar, Prasetya Ajie Utama, Ji-Ung Lee, and Kevin Stowe for their feedback on this work, as well as the anonymous reviewers for detailed comments that improved this paper. This work has been supported by the German Research Foundation as part of the Research Training Group Adaptive Preparation of Information from Heterogeneous Sources (AIPHES) under grant No. GRK 1994/1.
Notes
Code is available at https://github.com/UKPLab/kg2text.
In this paper, multi-relational graphs refer to directed graphs with labeled edges.
contains relations both in canonical direction (e.g., used-for) and in inverse direction (e.g., used-for-inv), so that the models consider the differences in the incoming and outgoing relations.
For CGE and PGE the values refer to the global layers and the number of local layers is fixed to 3.
It was not possible to execute the local model with larger number of parameters because of memory limitations.
CHRF++ score is used as it is a sentence-level metric.
As shown on Figure 4c, 82% of the reference abstracts have more than 100 words.
The diameter of a graph is defined as the length of the longest shortest path between two nodes. We convert the graphs into undirected graphs to calculate the diameters.
Because AGENDA is scientific in nature, we choose to crowd-source human evaluations only for WebNLG.
As shown in Table 1, AGENDA has texts 5.8 times longer than WebNLG on average.
The distance between two nodes is defined as the number of edges in a shortest path connecting them.