Abstract
Abstract meaning representation (AMR)-to-text generation is the challenging task of generating natural language texts from AMR graphs, where nodes represent concepts and edges denote relations. The current state-of-the-art methods use graph-to-sequence models; however, they still cannot significantly outperform the previous sequence-to-sequence models or statistical approaches. In this paper, we propose a novel graph-to-sequence model (Graph Transformer) to address this task. The model directly encodes the AMR graphs and learns the node representations. A pairwise interaction function is used for computing the semantic relations between the concepts. Moreover, attention mechanisms are used for aggregating the information from the incoming and outgoing neighbors, which help the model to capture the semantic information effectively. Our model outperforms the state-of-the-art neural approach by 1.5 BLEU points on LDC2015E86 and 4.8 BLEU points on LDC2017T10 and achieves new state-of-the-art performances.
1 Introduction
Abstract meaning representation (AMR) is a semantic formalism that abstracts away from the syntactic realization of a sentence, and encodes its definition as a rooted, directed, and acyclic graph. In the graph, the nodes represent the concepts, and edges denote the relations between the concepts. The root of an AMR binds its contents to a single traversable graph and serves as a rudimentary representation of the overall focus. The existence of co-references and control structures results in nodes with multiple incoming edges, called reentrancies, and causes an AMR to possess a graph structure, instead of a tree structure. Numerous natural language processing (NLP) tasks can benefit from using AMR, such as machine translation (Jones et al., 2012; Song et al., 2019), question answering (Mitra and Baral, 2016), summarization (Liu et al., 2015; Takase et al., 2016), and event extraction (Huang et al., 2016).
AMR-to-text generation is the task of recovering a text representing the same definition as a given AMR graph. Because the function words and structures are abstracted away, the AMR graph can correspond to multiple realizations. Numerous important details are underspecified, including tense, number, and definiteness, which makes this task extremely challenging (Flanigan et al., 2016). Figure 1 shows an example AMR graph and its corresponding sentence.
Early works relied on grammar-based or statistical approaches (Flanigan et al., 2016; Pourdamghani et al., 2016; Lampouras and Vlachos, 2017; Gruzitis et al., 2017). Such approaches generally require alignments between the graph nodes and surface tokens, which are automatically generated and can lead to error accumulation. In recent research, the graphs are first transformed into linear sequences, and then the text is generated from the inputs (Konstas et al., 2017). Such a method may lose information from the graph structure. The current state-of-the-art neural methods are graph-to-sequence models and hybrid variants (Beck et al., 2018; Song et al., 2018; Damonte and Cohen, 2019). These methods use a graph state long short-term memory (LSTM) network, gated graph neural network (GGNN), or graph convolution network (GCN) to encode AMR graphs directly, and they can explicitly utilize the information provided by the graph structure. However, these graph encoders still cannot significantly outperform sequence encoders. The AMR-to-text generation task can be regarded as a distinct translation task, and basing it on the concepts of off-the-shelf methods for neural machine translation can be helpful. The Transformer model (Vaswani et al., 2017) is a stacked attention architecture and has shown its effectiveness in translation tasks; however, applying it to AMR-to-text generation has a major problem: It can only deal with sequential inputs.
To address these issues, we propose a novel graph network (Graph Transformer) for AMR-to-text generation. Graph Transformer is an adaptation of the Transformer model, and it has a stacked attention-based encoder-decoder architecture. The encoder considers the AMR graph as the input and learns the node representations from the node attributes by the aggregation of the neighborhood information. The global semantic information is captured by stacked graph attention layers, which allow a node to deal with the hidden states of the neighbor nodes and their corresponding relations. Multiple stacked graph attention layers enable the nodes to utilize the information of those nodes that are not directly adjacent, allowing the global information to propagate. We consider that the AMR graph is a directed graph in which the directions hold extremely important information. Therefore, for encoding the information from the incoming and outgoing edges, we use two individual graph attentions in each layer. Then we utilize a fusion layer to incorporate the information from the incoming and outgoing relations, followed by a feed-forward network. Residual connections are used for connecting adjacent layers. The final node representations are formed by concatenating the two individual representations encoded by multiple layers. The decoder is similar to the original decoder in Transformer, performing multi-head attentions and self-attentions over the representations of the nodes in the encoder and over the hidden states of the decoder, respectively. For the decoder stack, we adopt a copy mechanism to generate the texts, which can help copy low-frequency tokens, such as named entities and numbers.
We perform experiments on two benchmark datasets (LDC2015E86 and LDC2017T10). Our model significantly outperforms the prior methods and achieves a new state-of-the-art performance. Without external data, our model improves the BLEU scores of the state-of-the-art and a mostly recently proposed neural model (i.e., g-GCNSEQ [Damonte and Cohen, 2019]) by 1.5 points on LDC2015E86 and 4.8 points on LDC2017T10. When using the Gigaword corpus as the additional training data, which is automatically labeled by a pre-trained AMR parser, our model achieves a BLEU score of 36.4 on LDC2015E86, which is the highest result on the dataset. The experimental result also shows that the improved structural representation encoding by our proposed graph encoder is most useful when the amount of training data is small. The variations in our model are evaluated to verify its robustness as well as the importance of the proposed modules. In addition, we study the performances of our model and baselines under different structures of the input graphs.
Our contributions can be summarized as follows:
- •
For AMR-to-text generation, we propose Graph Transformer, a novel graph-to-sequence model based on the attention mechanism. Our model uses a pairwise interaction function to compute the semantic relations and uses separate graph attentions on the incoming and outgoing neighbors, which help in enhanced capturing of the semantic information provided in the graph. The code is available at "https://github.com/sodawater/GraphTransformer.
- •
The experimental results show that our model achieves a new state-of-the-art performance on benchmark datasets.
2 Related Work
2.1 AMR-to-Text Generation
Early work on AMR-to-text generation focused on statistical methods. Flanigan et al. (2016) transformed AMR graphs to appropriate spanning trees and applied tree-to-string transducers to generate texts. Song et al. (2016) partitioned an AMR graph into small fragments and generated the translations for all the fragments, whose order was finally decided by solving an asymmetric generalized traveling salesman problem. Song et al. (2017) used synchronous node replacement grammar to parse AMR graphs and generate output sentences. Pourdamghani et al. (2016) adopted a phrase-based machine translation model on the input of a linearized graph. Recent works propose using neural networks for generation. Konstas et al. (2017) used a sequence-to-sequence model to generate texts, leveraging an LSTM for encoding a linearized AMR structure. Graph-to-sequence models outperform sequence-to-sequence models, including a graph state LSTM (Song et al., 2018) and GGNN (Beck et al., 2018). A most recently developed hybrid neural model achieved the state-of-the-art performance by applying a BiLSTM on the output of a graph encoder GCN, to utilize both structural and sequential information (Damonte and Cohen, 2019).
2.2 Neural Networks for Graphs
Neural network methods for processing the data represented in graph domains have been studied for several years. Graph neural networks (GNNs) have also been proposed, which are an extension of recursive neural networks and can be applied to most of the practically useful types of graphs (Gori et al., 2005; Scarselli et al., 2009). GCNs are the main alternatives for neural-based graph representations, and are widely used to address various problems (Bruna et al., 2014; Duvenaud et al., 2015; Kipf and Welling, 2017). Li et al. (2015) further extended a GNN and modified it to use gated recurrent units for processing the data represented in graphs; this method is known as a GGNN. Beck et al. (2018) followed their concept and applied a GGNN to string generation. Another neural architecture based on gated units is the graph state LSTM (Song et al., 2018), which uses an LSTM structure for encoding graph-level semantics. Our model is most similar to graph attention networks (GATs) (Velickovic et al., 2018); it incorporates the attention mechanism in the information aggregation.
2.3 Transformer Network
Recurrent neural networks (RNNs) and convolution neural networks (CNNs) have been widely used in NLP tasks because of their advantages of capturing long-term and local dependencies, respectively. Compared with these networks, models based solely on the attention mechanism show superiority in terms of the parallelism and flexibility in the modeling dependencies. Recently, RNN/CNN-free networks have attracted increasing interests. Vaswani et al. (2017) proposed a stacked attention architecture, the Transformer model, for neural machine translation. Gu et al. (2018) introduced a non-autoregressive translation model based on the transformer. Zhang et al. (2018) integrated the paraphrase rules and the Transformer model, for sentence simplification. Devlin et al. (2018) proposed a language representation model called BERT, which achieved new state-of-the-art results on 11 NLP tasks.
3 Graph Transformer
The overall architecture of Graph Transformer is shown in Figure 2, with an example AMR graph and its corresponding sentence. We begin by providing the formal definition of the AMR-to-text generation and the notations we use, and then reviewing the Transformer model. Then we introduce the graph encoder and sentence decoder used in our model. Finally, we describe the training and decoding procedures.
3.1 Problem Formulation and Notations
Given an AMR graph, G, our goal is to generate a natural language sentence that represents the same definition as G. Our model is trained to maximize the probability, P(S|G), where S is the gold sentence.
In the following, we define the notations used in this study. We assume a directed graph, G = (V, E), where V is a set of N nodes, E is a set of M edges, and N and M are the numbers of nodes and edges, respectively. Each edge in E can be represented as (i, j, l), where i and j are the indices of the source and target nodes, respectively, and l is the edge label. We further denote the incoming neighborhoods (i.e., reached by an incoming edge) of node vi ∈ V as and outgoing neighborhoods (i.e., reached by an outgoing edge) as . The corresponding sentence is S = {s1, s2, ..., sT}, where si is the i-th token of the sentence and T is the number of the tokens.
3.2 Transformer
Our model is adapted from the Transformer model, and here, we briefly review this model. The original Transformer network uses an encoder-decoder architecture, with each layer consisting of a multi-head attention mechanism and a feed- forward network. Both the components are described here.
For constructing a deep network and regularization, a residual connection and layer normalization are used to connect adjacent layers.
3.3 Graph Encoder
3.4 Sentence Decoder
3.5 Training and Decoding
We use the beam search to generate the target sentence during the decoding stage.
4 Comparison to Prior Graph Encoders
In this section, we compare our proposed graph encoders with the existing ones presented in prior works.
Most models, including a GCN (Damonte and Cohen, 2019), GGNN (Beck et al., 2018), and GraphLSTM (Song et al., 2018), use a non- pairwise interaction function to represent the information to be aggregated from the neighborhoods. Specifically, they ignore the receiver node (i.e., the node to be updated), operating only on the sender node (i.e., the neighbor node) and the edge attribute (Battaglia et al., 2018). They add a self-loop edge for each node so that its own information can be considered. In our model, we compute the pairwise interactions using Eq. (5); hence, no self-loop edge is required.
In our model, the graph attention mechanism is similar to GAT (Velickovic et al., 2018). The main differences are that GAT is designed for undirected graphs and neither directions nor labels of edges are considered. We propose using two distinct representations (i.e., head representation and tail representation) for each node and utilizing graph attentions on the incoming and outgoing relations. Accordingly, the model can consider the differences in the incoming and outgoing relations, and the results presented in the next section verify the effectiveness of this proposed modification. In addition, GAT adopts additive attention and uses averages of the outputs of the multi-head attention in the final layer. In our model, we use a scaled dot-product attention for all the attention layers.
5 Experiment
5.1 Data and Preprocessing
We used two standard AMR corpora (LDC2015E86 and LDC2017T10) as our experiment datasets. The LDC2015E86 dataset contains 16,833 instances for the training, 1,368 for the development, and 1,371 for the test. The LDC2017T10 dataset is the latest AMR corpus release, which contains 36,521 instances for the training and the same instances for the development and test as in LDC2015E86. Most prior works evaluate their models on the former dataset. Because prior approaches during the same period achieve the state-of-the-art performances on LDC2015E86 and LDC2017T10, respectively, we performed experiments on both the datasets.
Following Konstas et al. (2017), we supplemented the gold data with large-scale external data. We used the Gigaword corpus1 released by Song et al. (2018) as the external data, which was automatically parsed by the JAMR. For the training on both the gold data and automatically labeled data, the same training strategy as that of Konstas et al. (2017) was adopted, which was fine-tuning the model on the gold data after each epoch of the pre-training on the Gigaword data.
5.2 Parameter Settings and Training Details
We set our model parameters based on preliminary experiments on the development set. dmodel is set to 256 and demb is set to 300. The head number of attention is set to 2. The numbers (L1 and L2) of layers of the encoder and decoder are set to 8 and 6, respectively. The batch size is set to 64. We extract a vocabulary from the training set, which is shared by both the encoder and the decoder. The word embeddings are initialized from GloVe word embeddings (Pennington et al., 2014). We use the Adam optimizer (Kingma and Ba, 2015) with lr = 0.0002, β1 = 0.9, β2 = 0.98, and ε = 10−9. Learning rate is halved every time perplexity on the development set does not improve for two epochs. We apply dropout to the output of each attention sub-layer and the input embeddings, and use a rate of Pdrop = 0.3. Beam search with beam size to 6 is used for decoding. During training, we filter out instances with more than 100 nodes in graph or 100 words in sentence for speeding up. Note that dmodel is set to 512, the head number is set to 4, and the learning rate is set to 0.0001 when training on both gold data and automatically labeled data.
5.3 Metrics and Baselines
Following existing works, we evaluate the results with the BLEU metric (Papineni et al., 2002). We also report the results using CHRF++ (Popovic, 2017), similar to Beck et al. (2018).
Our direct baseline is the original Transformer, which takes a linearized graph as the input. We use the same linearization as that by Konstas et al. (2017). We also compare our model with prior statistical approaches (PBMT, Tree2Str, and TSP), sequence-to-sequence approaches (S2S+Anon and S2S+Copy), the current state-of-the-art graph-to-sequence approaches (GraphLSTM and GGNN), and hybrid approaches (t-GCNSEQ and g-GCNSEQ). PBMT (Pourdamghani et al., 2016) adopts a phrased-based machine translation model with the input of a linearized AMR graph. Tree2Str (Flanigan et al., 2016) converts AMR graphs into trees by splitting the reentrants and applies a tree-to-string transducer to generate text. TSP (Song et al., 2016) solves the generation problem as a traveling salesman problem. S2S+Anon (Konstas et al., 2017) is a multi-layer attention-based bidirectional LSTM model, which is trained with anonymized data. S2S+Copy (Song et al., 2018) is also an attention-based LSTM model, but it instead uses the copy mechanism. GGNN (Beck et al., 2018) uses a gated graph neural network to encode the AMR graph and an RNN-based decoder to generate the text. GraphLSTM (Song et al., 2018) utilizes a graph state LSTM as the graph encoder and uses the copy mechanism instead of anonymization. T-GCNSEQ (Damonte and Cohen, 2019) also splits the reentrancies and applies stacking of the encoders to encode the tree, in which BiLSTM networks are used on top of the GCN for utilizing both the structure and sequential information. G-GCNSEQ has the same architecture as t-GCNSEQ, but it directly encodes the graph rather than the tree. Tree2Str, TSP, S2S+Anon, S2S+Copy, and GraphLSTM have been trained on LDC2015E86. PBMT has been trained on a previous release of the corpus (LDC2014T12).2 Note that PBMT, Tree2Str, and TSP also train and use a language model based on an additional Gigaword corpus. GGNN has been trained on LDC2017T10. T-GCNSEQ and g-GCNSEQ have been trained on both LDC2015E86 and LDC2017T10.
5.4 Comparison Results
Table 1 summarizes the results of the models using LDC2015E86 as the gold training data. When trained only on the gold training data, our model achieves the best BLEU score of 25.9 among all the neural models and outperforms S2S+Anon by 3.9 BLEU points. Compared with the graph-to- sequence model, GraphLSTM, our model is 2.6 BLEU points higher, which shows the superiority of our proposed architecture. Our model also outperforms hybrid models t-GCNSEQ and g-GCNSEQ by 2.0 points and 1.5 points, respectively. Comparing the two sequence-to-sequence neural models, Transformer underperforms the RNN-based model (S2S+Anon). This is in plain contrast to their performances in machine translation. The reason is attributed to the possible extreme length of the linearized AMR graph and difficulty in performing self-attention to obtain a good context representation of each token with a small training data. Our proposed Graph Transformer does not encounter this problem, which is significantly better than Transformer, and improves the BLEU score by more than 8 points. It also shows that our proposed deep architecture is even effective with a small training data. The results of statistical approaches PBMT, Tree2Str, and TSP are not strictly comparable because they use an additional Gigaword corpus to train the language model. Our model still outperforms the Tree2Str and TSP and performs close to the PBMT.
Methods . | BLEU . |
---|---|
PBMT (Pourdamghani et al., 2016) | 26.9 |
Tree2Str (Flanigan et al., 2016) | 23.0 |
TSP (Song et al., 2016) | 22.4 |
S2S+Anon (Konstas et al., 2017) | 22.0 |
GraphLSTM (Song et al., 2018) | 23.3 |
t-GCNSEQ (Damonte and Cohen, 2019) | 23.9 |
g-GCNSEQ (Damonte and Cohen, 2019) | 24.4 |
Transformer | 17.7 |
Graph Transformer | 25.9 |
S2S+Anon (2M) (Konstas et al., 2017) | 32.3 |
S2S+Anon (20M) (Konstas et al., 2017) | 33.8 |
S2S+Copy (2M) (Song et al., 2018) | 31.7 |
GraphLSTM (2M) (Song et al., 2018) | 33.6 |
Transformer (2M) | 35.1 |
Graph Transformer (2M) | 36.4 |
Methods . | BLEU . |
---|---|
PBMT (Pourdamghani et al., 2016) | 26.9 |
Tree2Str (Flanigan et al., 2016) | 23.0 |
TSP (Song et al., 2016) | 22.4 |
S2S+Anon (Konstas et al., 2017) | 22.0 |
GraphLSTM (Song et al., 2018) | 23.3 |
t-GCNSEQ (Damonte and Cohen, 2019) | 23.9 |
g-GCNSEQ (Damonte and Cohen, 2019) | 24.4 |
Transformer | 17.7 |
Graph Transformer | 25.9 |
S2S+Anon (2M) (Konstas et al., 2017) | 32.3 |
S2S+Anon (20M) (Konstas et al., 2017) | 33.8 |
S2S+Copy (2M) (Song et al., 2018) | 31.7 |
GraphLSTM (2M) (Song et al., 2018) | 33.6 |
Transformer (2M) | 35.1 |
Graph Transformer (2M) | 36.4 |
Following the approach of Konstas et al. (2017), we also evaluate our model using automatically labeled Gigaword data as additional training data. When using external data, the performance of our model is improved significantly. Utilizing 2M gigaword data, the performance of our model improves by 10.5. With the 2M additional data, our model achieves the new state-of-the-art BLEU score of 36.4 points, which is 4.7 and 2.8 points higher than those of S2S+Copy and GraphLSTM using the same training data, respectively. Transformer achieves a BLEU score of 35.1, which is much higher compared with that achieved with the one trained on the gold data. This verifies the effectiveness of a deep neural model when the training dataset is sufficiently big. With 20M external data, the S2S+Anon obtains a BLEU score of 33.8, which is much worse than our model score. We speculate the performance can be further improved with a relatively larger number of external data; however, we do not attempt this owing to hardware limitations. Note that the CHRF++ score is not reported for these approaches in previous works; therefore, we do not compare it in this experiment.
Table 2 lists the results of the models trained on LDC2017T10. Our model strongly outperforms GGNN, and improves the BLEU score by 6.0 points and the CHRF++ score by 8.6 points. Hybrid models t-GCNSEQ and g-GCNSEQ achieve BLEU scores of 24.1 and 24.5, which are 5.2 and 4.8 points lower than those of our model, respectively. Compared with the same model with smaller gold training data in Table 1, the BLEU score of our model is also improved by 3.4 points and the scores of t-GCNSEQ and g-GCNSEQ are improved by only 0.2 and 0.1 points, respectively. This indicates that the performance of our model can easily benefit from more gold training data. Beck et al. (2018) also reported the scores of GGNN ensemble, which achieves a BLEU score of 27.5 and a CHRF++ score of 53.5; these scores are even much worse than those of our single model.
Methods . | BLEU . | CHRF++ . |
---|---|---|
GGNN (Beck et al., 2018) | 23.3 | 50.4 |
GGNN(ensemble) (Beck et al., 2018) | 27.5 | 53.5 |
t-GCNSEQ (Damonte and Cohen, 2019) | 24.1 | – |
g-GCNSEQ (Damonte and Cohen, 2019) | 24.5 | – |
Transformer | 19.4 | 48.1 |
Graph Transformer | 29.3 | 59.0 |
5.5 Model Variations
To evaluate the importance of the different components of our proposed Graph Transformer, we vary our model and perform both hyper-parameter and ablation studies. We train the models on both LDC2015E86 and LDC2017T10 and measure the performance changes on the development set, and the results are listed in Table 3.
. | L1 . | L2 . | dmodel . | dh . | Pdrop . | BLEU . | BLEU . |
---|---|---|---|---|---|---|---|
. | . | . | . | . | . | (LDC2015E86) . | (LDC2017T10) . |
base | 8 | 6 | 256 | 2 | 0.3 | 25.5 | 28.8 |
(A) | 2 | 20.4 | 24.6 | ||||
4 | 23.7 | 27.6 | |||||
6 | 24.3 | 28.3 | |||||
10 | 24.6 | 28.4 | |||||
(B) | 4 | 23.4 | 27.1 | ||||
5 | 24.6 | 28.0 | |||||
7 | 24.8 | 28.7 | |||||
(C) | 512 | 25.1 | 28.5 | ||||
(D) | 1 | 23.6 | 28.4 | ||||
4 | 23.0 | 28.7 | |||||
(E) | 0.1 | 22.7 | 26.9 | ||||
0.2 | 25.3 | 28.5 | |||||
0.4 | 24.7 | 27.9 | |||||
(F) | single representation for each node | 25.1 | 28.3 | ||||
single representation, inseparate graph attention | 23.1 | 27.6 |
. | L1 . | L2 . | dmodel . | dh . | Pdrop . | BLEU . | BLEU . |
---|---|---|---|---|---|---|---|
. | . | . | . | . | . | (LDC2015E86) . | (LDC2017T10) . |
base | 8 | 6 | 256 | 2 | 0.3 | 25.5 | 28.8 |
(A) | 2 | 20.4 | 24.6 | ||||
4 | 23.7 | 27.6 | |||||
6 | 24.3 | 28.3 | |||||
10 | 24.6 | 28.4 | |||||
(B) | 4 | 23.4 | 27.1 | ||||
5 | 24.6 | 28.0 | |||||
7 | 24.8 | 28.7 | |||||
(C) | 512 | 25.1 | 28.5 | ||||
(D) | 1 | 23.6 | 28.4 | ||||
4 | 23.0 | 28.7 | |||||
(E) | 0.1 | 22.7 | 26.9 | ||||
0.2 | 25.3 | 28.5 | |||||
0.4 | 24.7 | 27.9 | |||||
(F) | single representation for each node | 25.1 | 28.3 | ||||
single representation, inseparate graph attention | 23.1 | 27.6 |
5.5.1 Hyper-Parameter Tuning
In Table 3(A), we vary the number of transition steps (i.e., number of layers), L1, in the graph encoder. As we can see, the performance of our model increases as L1 increases; however, it starts decreasing gradually when L1 becomes larger than 8. Our model achieves the best performance when L1 equals 8. This shows that incorporating the information from the nodes with a long distance can help improve capture of the global semantic information. The reason for this performance drop when L1 is larger than 8 may be attributed to the over-fitting because the amount of training data is not large. In addition, we also compare the BLEU scores of our model and GraphLSTM with the same number of transition steps. These models are only trained on LDC2015E86. Results on the development set are shown in Figure 3. Compared with the performance of the GraphLSTM, our model performs consistently and significantly better when L1, which varies from 1 to 10. This indicates that our proposed graph encoder has a stronger ability of utilizing both local and global semantic information.
In Table 3(B), we vary the number of layers in the decoder, L2. Our model achieves the best performance when L2 equals 6, and its performance drops significantly when L2 decreases. With few layers, the decoder might not be able to utilize the information provided by the graph encoder and generate fluent sentences. An extremely large L2 also adversely affects the performance, particularly when training on the smaller dataset.
In Table 3(C), we observe that larger models do not lead to better performance. We attribute the reason to the number of training pairs being quite small. In Table 3(D), we observe that the models, trained on a small dataset, are extremely sensitive to the number of heads, dh. The single-head attention is 1.9 BLEU points worse than the best setting. The performance also deceases with too many heads. Using more training data, our model becomes more stable and insensitive to dh. In Table 3(E), we can see that a suitable rate of dropout is extremely helpful for avoiding over-fitting.
5.5.2 Ablation Study
We further perform an ablation study on two datasets to investigate the influence of the modules in the graph encoder. We fix the sentence decoder of our model because it is similar to the original one in Transformer. The modules in the graph encoder are tested by two methods: using a single representation for each node (i.e., the head representation and tail representation are updated with shared parameters), and using a single representation and performing the inseparate graph attention over the incoming and outgoing relations simultaneously (i.e., the output of the attention in Eq. (7) is and the fusion layer is discarded). These modifications test the effectiveness of the separate graph attentions.
The results are presented in Table 3(F). We can see that using a single representation for each node results in a loss of 0.4 and 0.5 BLEU points on the two datasets, respectively. It indicates that learning the head representation and tail representation for each node is helpful. We further observe that without separated graph attentions (i.e., in inseparate graph attention), the performance of our model drops, suffering a loss of 2.4 BLEU points on the LDC2015E86 dataset and 1.2 on LDC2017T10. We consider that the relations represented by the incoming edges and outgoing edges are different. Moreover, projecting them into the same space for the graph attentions might cause confusion, particularly when the number of training data is small. Separate graph attentions can help the model better capture the semantics.
Compared with the prior methods, there are two changes in our model: the graph encoder and the Transformer decoder. To study the influences of the different encoders and decoders, we implement three encoders (RNN encoder, Transformer encoder, and our graph encoder) and two decoders (RNN decoder and Transformer decoder). We also perform a study of their combinations. Table 4 presents the results. We find an interesting phenomenon that simply mixing Transformer-based networks with RNNs can lead to a large decrease in the performance. Irrespective of replacing the RNN decoder with the Transformer decoder in S2S or replacing the Transformer decoder with the RNN decoder in Transformer and Graph Transformer, the replaced models perform much worse than the original ones. This indicates that there is a mismatch in using an RNN (or Transformer) to decode a sentence from the representations encoded by Transformer-based networks (or RNNs). The superior performance of our model is owing to the interplay of the Transformer-based graph encoder and the Transformer decoder.
Methods . | BLEU . | BLEU . |
---|---|---|
(LDC2015E86) . | (LDC2017T10) . | |
RNN+RNN (S2S) | 21.2 | 22.3 |
RNN+TFM | 17.9 | 18.7 |
TFM+RNN | 16.0 | 17.9 |
TFM+TFM (Transformer) | 17.4 | 19.0 |
Graph+RNN | 21.1 | 25.2 |
Graph+TFM (Ours) | 25.5 | 28.8 |
Methods . | BLEU . | BLEU . |
---|---|---|
(LDC2015E86) . | (LDC2017T10) . | |
RNN+RNN (S2S) | 21.2 | 22.3 |
RNN+TFM | 17.9 | 18.7 |
TFM+RNN | 16.0 | 17.9 |
TFM+TFM (Transformer) | 17.4 | 19.0 |
Graph+RNN | 21.1 | 25.2 |
Graph+TFM (Ours) | 25.5 | 28.8 |
5.6 Performance Against Size and Structure of AMR Graph
To study the advantages of our proposed model against prior sequence-to-sequence models and graph models further, we compare the results of the models on different sizes and structures of the AMR graphs. All the models are trained on LDC2015E86. We consider the size and structure of a graph in three approaches: depth, number of edges, and number of reentrancies.
The depth of an AMR graph is defined as the longest distance between the AMR node and its root. The deeper the graph, the longer the dependency. Table 5 lists the counts of the AMR graphs with different depths for the test split and the results of different models on these graphs. We can see that the graph models outperform the sequence-to-sequence model, but the gap narrows when the depth increases. GraphLSTM outperforms the S2S by 3.9 points when the depth is less than 6, and the gap is only 0.7 points when the depth is larger than 10. Compared with GraphLSTM, Graph Transformer returns better performance on deeper graphs, which shows that our model is more powerful for capturing long-term dependencies.
Model . | Depth . | ||
---|---|---|---|
1−5 . | 6−10 . | 11− . | |
#count | 278 | 828 | 265 |
S2S | 37.2 | 21.2 | 19.3 |
GraphLSTM | +3.9 | +2.1 | +0.7 |
Graph Transformer | +6.0 | +4.5 | +3.7 |
Model . | Depth . | ||
---|---|---|---|
1−5 . | 6−10 . | 11− . | |
#count | 278 | 828 | 265 |
S2S | 37.2 | 21.2 | 19.3 |
GraphLSTM | +3.9 | +2.1 | +0.7 |
Graph Transformer | +6.0 | +4.5 | +3.7 |
The edges in an AMR graph represent the semantic relations of the concepts. The more the edges in the graph, the more semantic information is represented and usually the larger the graph. Table 6 lists the counts of the AMR graphs with different number of edges for the test split and the corresponding BLEU scores of the different models. We observe that all the models have much better performances on small graphs than on large ones. Similar to the phenomena based on Table 4, our model shows a stronger ability in dealing with more semantic relations than GraphLSTM.
Model . | Number of edges . | ||
---|---|---|---|
1−10 . | 11−20 . | 21− . | |
#count | 425 | 452 | 494 |
S2S | 30.8 | 21.4 | 19.7 |
GraphLSTM | +4.3 | +2.9 | +0.8 |
Graph Transformer | +6.2 | +4.6 | +3.8 |
Model . | Number of edges . | ||
---|---|---|---|
1−10 . | 11−20 . | 21− . | |
#count | 425 | 452 | 494 |
S2S | 30.8 | 21.4 | 19.7 |
GraphLSTM | +4.3 | +2.9 | +0.8 |
Graph Transformer | +6.2 | +4.6 | +3.8 |
Following Damonte and Cohen (2019), we also study the influence of the reentrancies in the graph. Reentrancies represent the co-references and control structures in AMR and make it a graph rather than a tree. A graph with more reentrancies is typically more complex. From Table 7, we can see that the performance of all the models drop significantly when the number of reentrancies increases. With more reentrancies, the lead of the graph-to-sequence models over the sequence-to-sequence model also narrows. We consider that this is because the reentrancies increase the complexity of the graph structure and make the graph models difficult to learn the semantic representation. Our model exhibits an extremely strong performance when the input degrades into a tree. This is because we use two graph attentions over the incoming and outgoing edges, respectively, and only one incoming edge makes the model easy to learn and train. In addition, our model outperforms S2S by 2.9 points when the input graphs have more than 3 reentrancies. In comparison, GraphLSTM achieves nearly an identical result to that of S2S, which indicates that our proposed encoder is also better in dealing with complex graphs than GraphLSTM.
5.7 Case Study
We perform case studies to provide a better understanding of the model performance. We compare the outputs of S2S, GraphLSTM, and our Graph Transformer trained on the gold data of LDC2015E86. We observe that there are several common error types in the outputs of these systems: 1) generating unnatural language or unreadable sentences; 2) missing information from the input graph; 3) generating words or tokens inconsistent with the given semantic representation (i.e., mistranslation of the nodes in the graph); 4) mixing the semantic relations between the entities (i.e., mistranslation of the edges in the graph).
To exhibit how systematic these errors are explicitly, we manually evaluate 50 randomly sampled outputs from each compared system, and count the ratio of the outputs with each error type. Note that these four types of errors are not mutually exclusive. Table 8 lists the results. We can clearly see that these four types of errors occur in all the three systems, and Graph Transformer performs the best by comparison. Compared with S2S and GraphLSTM, our model Graph Transformer significantly covers more information from the input graph. All the models make more mistakes on the fluency aspect than on other three aspects. This is because both the missing information from the input graph and the mistranslation of the nodes and edges can cause a generated sentence to be unnatural or unreadable.
Model . | Unnatural Language % . | Missing Information % . | Node Mistranslation % . | Edge Mistranslation % . |
---|---|---|---|---|
S2S | 62.0 | 60.0 | 34.0 | 52.0 |
GraphLSTM | 54.0 | 54.0 | 28.0 | 46.0 |
Graph Transformer | 48.0 | 44.0 | 28.0 | 40.0 |
Model . | Unnatural Language % . | Missing Information % . | Node Mistranslation % . | Edge Mistranslation % . |
---|---|---|---|---|
S2S | 62.0 | 60.0 | 34.0 | 52.0 |
GraphLSTM | 54.0 | 54.0 | 28.0 | 46.0 |
Graph Transformer | 48.0 | 44.0 | 28.0 | 40.0 |
In addition, we present several example outputs in Table 9. AMR denotes the input graph and Ref denotes the reference output sentence.
AMR: (h / have-condition-91 |
:ARG1 (w / work-01 |
:ARG0 (y / you) |
:ARG1-of (h2 / hard-02)) |
:ARG2 (w2 / want-01 |
:ARG0 y |
:ARG2 (o / out) |
:mod (r / really))) |
Ref: Work hard if you really want out. |
S2S: If you really want to want out you are hard. |
GraphLSTM: If you really want out, you really work. |
Graph Transformer: If you really want out, then you ’ll work hard. |
AMR: (a / and |
:op1 (s / stand-11 |
:ARG0 (c / criminal-organization :wiki \Taliban” |
:name (n / name :op1 \Taliban”)) |
:ARG1 (p / passive) |
:ARG2 (c2 / cultivate-01) |
:time (y / year |
:mod (t2 / this))) |
:op2 (p2 / pressure-01 |
:ARG0 c |
:ARG1 (p5 / person |
:mod (c3 / country :wiki \Afghanistan” |
:name (n2 / name :op1 \Afghanistan”)) |
:ARG0-of (f / farm-01)) |
:ARG2 (p3 / plant-01 |
:ARG0 p5 |
:ARG1 (p4 / poppy |
:mod (o / opium))) |
:degree (l / less))) |
Ref: The Taliban this year are taking a passive stance toward cultivation and putting less pressure on Afghan farmers to plant opium poppy. |
S2S: the Taliban will take away with cultivation of cultivate this year and pressure on Afghan farmers less with opium poppy in them. |
GraphLSTM: The Taliban has been standing in the cultivation this year and less pressure from the Afghan farmers to plant opium poppies. |
Graph Transformer: The Taliban has stood passive in cultivation this year and has less pressured Afghan farmers to plant opium poppies. |
AMR: (p / participate-01 |
:ARG0 (p2 / person :quant 40 |
:ARG1-of (e / employ-01 |
:ARG0 (c / company |
:mod (o / oil) |
:mod (s / state))) |
:accompanier (s2 / soldier :quant 1200)) |
:ARG1 (e2 / exercise |
:ARG1-of (r / resemble-01)) |
:time (d / date-entity :year 2005 :month 6)) |
Ref: 40 employees of the state oil company participated in a similar exercise with 1200 soldiers in 050600. |
S2S: in June 2005 40 people’s state company with 1200 soldiers were part of a similar exercise. |
GraphLSTM: 40 state of oil companies using 1200 soldiers have participated in similar exercises in 6. |
Graph Transformer: In June 2005 40 oil companies have participated in similar exercises with 1200 soldiers. |
AMR: (h / have-condition-91 |
:ARG1 (w / work-01 |
:ARG0 (y / you) |
:ARG1-of (h2 / hard-02)) |
:ARG2 (w2 / want-01 |
:ARG0 y |
:ARG2 (o / out) |
:mod (r / really))) |
Ref: Work hard if you really want out. |
S2S: If you really want to want out you are hard. |
GraphLSTM: If you really want out, you really work. |
Graph Transformer: If you really want out, then you ’ll work hard. |
AMR: (a / and |
:op1 (s / stand-11 |
:ARG0 (c / criminal-organization :wiki \Taliban” |
:name (n / name :op1 \Taliban”)) |
:ARG1 (p / passive) |
:ARG2 (c2 / cultivate-01) |
:time (y / year |
:mod (t2 / this))) |
:op2 (p2 / pressure-01 |
:ARG0 c |
:ARG1 (p5 / person |
:mod (c3 / country :wiki \Afghanistan” |
:name (n2 / name :op1 \Afghanistan”)) |
:ARG0-of (f / farm-01)) |
:ARG2 (p3 / plant-01 |
:ARG0 p5 |
:ARG1 (p4 / poppy |
:mod (o / opium))) |
:degree (l / less))) |
Ref: The Taliban this year are taking a passive stance toward cultivation and putting less pressure on Afghan farmers to plant opium poppy. |
S2S: the Taliban will take away with cultivation of cultivate this year and pressure on Afghan farmers less with opium poppy in them. |
GraphLSTM: The Taliban has been standing in the cultivation this year and less pressure from the Afghan farmers to plant opium poppies. |
Graph Transformer: The Taliban has stood passive in cultivation this year and has less pressured Afghan farmers to plant opium poppies. |
AMR: (p / participate-01 |
:ARG0 (p2 / person :quant 40 |
:ARG1-of (e / employ-01 |
:ARG0 (c / company |
:mod (o / oil) |
:mod (s / state))) |
:accompanier (s2 / soldier :quant 1200)) |
:ARG1 (e2 / exercise |
:ARG1-of (r / resemble-01)) |
:time (d / date-entity :year 2005 :month 6)) |
Ref: 40 employees of the state oil company participated in a similar exercise with 1200 soldiers in 050600. |
S2S: in June 2005 40 people’s state company with 1200 soldiers were part of a similar exercise. |
GraphLSTM: 40 state of oil companies using 1200 soldiers have participated in similar exercises in 6. |
Graph Transformer: In June 2005 40 oil companies have participated in similar exercises with 1200 soldiers. |
In the first case, we can see that S2S fails to generate a fluent sentence. It also omits the concept, work, and therefore, adversely affects the semantic relation between you and hard. GraphLSTM omits the adverb hard and generates an adverb really for verb work, which is only supposed to modify verb want. Graph Transformer generates a basically correct answer.
In the second case, the AMR graph is more complex. S2S mistranslates the concept, stand, as takeaway and omits adjective passive. The verb, plant, is also omitted, which might be caused by the long distance between plant and pressure in the linearized input. Moreover, the entire sentence is unreadable owing to numerous grammar mistakes. GraphLSTM also omits the concept, passive, and fails to generate the clause headed by stand. In addition, it makes a mistake at the conjunction between the pressure and farmers. Our model does not omit any concept in the graph, but the generated sentence is not highly fluent. It treats the stand and pressure as predicates and fails to generate take and put because they are explicitly given in the graph.
In the last case, all three models fail to capture the concept, employ, and disturbs the relations between person, employ, and company. S2S omits the adjective, oil, and mistranslates the concept, participate, as part of. GraphLSTM is completely confused in this case and even fails to generate the word, June, from the relation, :month 6. Our Graph Transformer correctly generates the sentence constituents other than the subject.
Specifically, the four types of errors occur in all the three models, particularly when the input graph is complex. Compared with S2S and GraphLSTM, our model is less likely to miss the information from the input, and it can generate sentences with high quality, in terms of the fluency and fidelity to the input semantics.
6 Conclusion
In this study, we present a novel graph network (Graph Transformer) for AMR-to-text generation. Our model is solely based on the attention mechanism. Our proposed graph attentions over the neighbor nodes, and the corresponding edges are used for learning the representations of the nodes and capturing global information. The experimental results shows that our model significantly outperforms the prior neural models and achieves a new state-of-the-art performance on benchmark datasets.
In future work, we will incorporate BERT embeddings and multi-task learnings to improve the performance further. We will also apply Graph Transformer to other related text generation tasks like MRS-to-text generation, data-to-text generation, and image captioning.
Acknowledgments
This work was supported by National Natural Science Foundation of China (61772036) and Key Laboratory of Science, Technology and Standard in Press Industry (Key Laboratory of Intelligent Press Media Technology). We thank the anonymous reviewers for their helpful comments. Xiaojun Wan is the corresponding author.
Notes
The LDC2014T12 dataset contains 10,313 instances for the training and the same instances for the development and test as in case of LDC2015E86.