Modeling Global and Local Node Contexts for Text Generation from Knowledge Graphs

Abstract Recent graph-to-text models generate text from graph-based data using either global or local aggregation to learn node representations. Global node encoding allows explicit communication between two distant nodes, thereby neglecting graph topology as all nodes are directly connected. In contrast, local node encoding considers the relations between neighbor nodes capturing the graph structure, but it can fail to capture long-range relations. In this work, we gather both encoding strategies, proposing novel neural models that encode an input graph combining both global and local node contexts, in order to learn better contextualized node embeddings. In our experiments, we demonstrate that our approaches lead to significant improvements on two graph-to-text datasets achieving BLEU scores of 18.01 on the AGENDA dataset, and 63.69 on the WebNLG dataset for seen categories, outperforming state-of-the-art models by 3.7 and 3.1 points, respectively.1


Introduction
Graph-to-text generation refers to the task of generating natural language text from input graph structures, which can be semantic representations (Konstas et al., 2017), sub-graphs from knowledge graphs (KG) (Koncel-Kedziorski et al., 2019) or other forms of structured data (Konstas and Lapata, 2013).While many recent works (Song et al., 2018;Damonte and Cohen, 2019;Ribeiro et al., 2019;Guo et al., 2019) focus on generating sentence-level outputs, a more challenging and interesting scenario emerges when the goal is to generate bigger multi-sentence text, such as a document or paragraph.In this context, the input graphs are much more diverse, representing 1 Code is available at https://github.com/UKPLab/kg2textknowledge from different domains and in different ways.The task is thus more demanding since it can be necessary to select relevant parts of graph for generating a concise text, and to handle document planning issues such as order, coherence and discourse markers (Gardent et al., 2017).
A key issue in neural graph-to-text generation is how to encode graphs.The basic idea is to incrementally calculate node representations by aggregating context information.To this end, two main approaches have been proposed: (i) models based on local node aggregation, usually based on Graph Neural Networks (GNN) (Ribeiro et al., 2019;Guo et al., 2019) and (ii) models that leverage global node aggregation.Systems based on the global encoding strategy are typically based on Transformer architectures (Zhu et al., 2019;Cai and Lam, 2020), using self-attention to compute a node representation based on all nodes in the graph.This approach enjoys the advantage of large context range, but neglects the graph topology by effectively treating every node as being connected to all the others in the graph.In contrast, models based on local aggregation learn the representation of each node based on its adjacent nodes as defined in the input.This method effectively exploits the graph structure.However, encoding relations between distant nodes can be challenging by requiring more graph encoding layers, which can also propagate noise (Li et al., 2018).
For example, Figure 1a presents a KG, for which a corresponding text is shown in Figure 1b.The nodes GNN and DistMulti have relations with the nodes node embeddings and link prediction, respectively.Both relations are important for GNN and DistMulti during the text generation phase, but are in different connected components.As shown in Figure 1c, a global encoder can learn a node representation for DistMulti which captures information from indirectly connected entities such as node embeddings.Encoding such dependencies is important for KG verbalisation as KGs are known to be highly incomplete, often missing links between entities (Schlichtkrull et al., 2018).In addition, the global encoding can capture long-range complex dependencies between entities, supporting document planning.In contrast, the local strategy refines the node representation with richer neighborhood information, as nodes that share the same neighborhood exhibit a strong homophily: two entities belonging to the same topic in a KG are much more likely to be connected than at random.Consequently, the local context enriches the node representation with topic-related information from KG triples.For example, in Figure 1a, GAT reaches node embeddings through the GNN.This transitive relation can be captured by a local encoder, as shown in Figure 1d.Capturing this form of relationship also can support text generation at the sentence level.
In this paper, we investigate novel graph-totext architectures that combine both global and lo-cal node aggregations, gathering the benefits from both strategies.In particular, we propose a unified graph-to-text framework based on Graph Attention Networks (GAT, Veličković et al., 2018).As part of this framework, we empirically compare two main architectures: a cascaded architecture that performs global node aggregation before performing local node aggregation, and a parallel architecture that performs global and local aggregation simultaneously, before concatenating the representations.While the cascaded architecture allows the local encoder to leverage global encoding features, the parallel architecture allows more independent features to compliment each other.To further consider fine-grained integration, we additionally consider layer-wise integration of global and local encoders.
Extensive experiments show that our approaches consistently outperform recent models on two benchmarks for text generation from KGs, giving the best reported results so far.Compared with parallel structures, cascaded structures give better performance with smaller numbers of parameters.To the best of our knowledge, we are the first to consider integrating global and local context aggregation in graph-to-text generation, and the first to propose a unified GAT structure for integrating global and local aggregation.

Related Work
Early efforts for graph-to-text generation employ statistical methods (Flanigan et al., 2016;Pourdamghani et al., 2016;Song et al., 2017).Recently, several neural graph-to-text models have exhibited success by levering different encoder mechanisms based on GNN and Transformer architectures, learning effective latent graph representations.
AMR-to-Text Generation Recent neural models have been applied to sentence-level generation from Abstract Meaning Representation (AMR) graphs.Konstas et al. (2017) provide the first neural approach for this task, by linearising the input graph as a sequence of nodes and edges.Song et al. (2018) propose the graph recurrent network (GRN) to directly encode the AMR nodes, whereas Beck et al. (2018) develop a model based on GGNNs (Li et al., 2016).However, both approaches only employ local node aggregation strategies.Damonte and Cohen (2019) and Ribeiro et al. (2019) develop models employing GNNs and LSTMs, in order to learn complementary node contexts.Recent methods (Zhu et al., 2019;Cai and Lam, 2020) 2019) introduce a systematic comparison between pipeline and neural end-to-end approaches for text generation from RDF graphs.Nevertheless, those approaches consider the triples as separated structures, not explicitly considering the graph topology.To explicitly encode the graph structure, Marcheggiani and Perez Beltrachini (2018) propose an encoder based on graph convolutional networks (GCN) and show superior performance compared to LSTMs.Our work is related to Koncel-Kedziorski et al. (2019) who propose a transformer-based approach that only focuses on the relations between directly connected nodes.However, our models focus on both global and local node relations, capturing complementary graph contexts.

Graph-to-Text Model
In this section, we describe (i) the general concept of GNNs; (ii) the proposed local and global graph encoders; (iii) the graph transformation adopted to create a relational graph from the input; and (iv) the various combined global and local graph architectures.

Graph Neural Networks (GNN)
Formally, let G = (V, E, R) denote a multirelational graph2 with nodes u, v ∈ V and labelled edges (u, r, v) ∈ E, where r ∈ R represents the relation between u and v. GNNs work by iteratively learning a representation vector h v of a node v ∈ V based on both its context node neighbors and edge features, through an information propagation scheme.More formally, the l-th layer aggregates the representations of v's context nodes: where AGGR (l) (.) is an aggregation function, shared by all nodes on the l-th layer.r vu represents the relation between v and u.N (v) is a set of context nodes for v.In most GNNs, the context nodes are those adjacent to v.
The aggregated context representation is used to update the representation of v: N (v) .
After l iterations, a node's representation encodes the structural information within its lhop neighborhood.The choices of AGGR (l) (.) and COMBINE (l) (.) differ by the specific GNN model.An example of AGGR (l) (.) is the sum of the representations of N (v).An example of COMBINE (l) (.) is a concatenation after the feature transformation.

Global Graph Encoder
A global graph encoder aggregates a global context for updating each node, by treating the graph as fully connected (see Figure 1c).We use the attention mechanism as the message passing scheme, extending the self-attention network structure of Transformer (Vaswani et al., 2017) to a GAT structure.In particular, we compute a layer of the global convolution for a node v ∈ V, which takes the input feature representations h v as input, adopting AGGR (l) (.) as: where W g ∈ R dv×dz is a model parameter.The attention weight α vu is calculated as: where, is the attention function which measures the global importance of node u's features to node v. W q , W k ∈ R dv×dz are model parameters and d z is a scaling factor.
Multi-head Attention.To capture distinct relations between nodes, K different global convolutions are calculated and concatenated: Finally, we define COMBINE (l) (.) employing layer normalization (LayerNorm) and a fully connected feed-forward network (FFN), in a similar way as the transformer architecture: (5) This strategy creates an artificial complete graph with O(n 2 ) edges.Note that the global encoder do not consider the edge relations between nodes.In particular, if the labelled edges were considered, the self-attention space complexity would increases to Θ(|R| n 2 ).

Local Graph Encoder
The representation h global v captures macro relationships from v to all other nodes in the graph.However, this representation lacks both structural information regarding the local neighborhood of v and the graph topology.Also, it does not capture typed relations between nodes (see Equations 1 and 3).In order to capture those crucial graph information and impose a strong relational inductive bias, we build a local graph encoder by employing a modified version of GAT augmented with relational weights.In particular, we compute a layer of the local convolution for a node v ∈ V, adopting AGGR (l) (.) as: where The attention coefficient α vu is computed as: where, is the attention function which calculates the relative importance of adjacent nodes, considering typed relations.σ is an activation function, denotes concatenation and a is a model parameter.
We employ multi-head attentions to learn local relations in different perspectives, as in Equation 4, generating ȟN (v) .Finally, we define COMBINE (l) (.) as: where we employ as RNN a Gated Recurrent Unit (GRU) (Cho et al., 2014).GRU facilitates information propagation between local layers.This choice is motivated by recent works (Xu et al., 2018;Dehmamy et al., 2019) that theoretically demonstrate that sharing information between layers helps the structural signals propagate.In a similar direction, AMR-to-text generation models employ LSTMs (Song et al., 2017) and dense connections (Guo et al., 2019) between GNN layers.

Graph Preparation
We represent a KG as a multi-relational graph G e = (V e , E e , R) with entity nodes e ∈ V e and labeled edges (e h , r, e t ) ∈ E e , where r ∈ R denotes the relation existing from the entity e h to e t . 3nlike other current approaches (Koncel-Kedziorski et al., 2019;Moryossef et al., 2019), we represent an entity as a set of nodes.Formally, we transform each G e into a new graph G = (V, E, R), where each token of an entity e ∈ V e becomes a node v ∈ V. We convert each edge (e h , r, e t ) ∈ E e into a set of edges (with the same relation r) and connect every token of e h to every token of e t .That is, an edge (u, r, v) will belong to E if and only if there exists an edge (e h , r, e t ) ∈ E e such that u ∈ e h and v ∈ e t .We represent each node v ∈ V with an embedding h 0 v ∈ R dv , generated from its corresponding token.
The new graph G increases the representational power of the model because it allows learning node embeddings at a token level, instead of entity level.This is particularly important for text generation as it permits the model to be more flexible, capturing richer relationships between entity tokens.This also allows the model to learn relations and attention functions between source and target tokens.However, it has the side effect of removing the natural sequential order of multi-word expressions such as entities.To preserve this information, we employ position embeddings (Vaswani et al., 2017), i.e., h 0 v becomes the sum of the corresponding token embedding and the positional embedding for v.

Combining Global and Local Encodings
Our goal is to implement a graph encoder capable of encoding global and local aspects of the input graph.We hypothesize that the two sources of information are complementary and a combination of both enriches node representations for text generation.In order to test this hypothesis, we investigate four possible combination architectures.Figure 2 presents our proposed encoders.
Parallel Graph Encoding.In this setup, we compose global and local graph encoders in a fully parallel structure (Figure 2a).Note that each graph encoder can have different numbers of layers and attention heads.h 0 v is the initial input for the first layer of both encoders.The final node representation is the concatenation of the local and global node representations: Cascaded Graph Encoding.We cascade local and global graph encoders as shown in Figure 2b, by first computing a global-contextual node embedding, and then refining it with the local context.h 0 v is the initial input for the global encoder and h global v is the initial input for the local encoder.
Layer-wise Parallel and Cascaded Graph Encoding.To allow fine-grained interaction between the two types of contextual information, we also combine the encoders in a layer-wise fashion.In particular, for each graph layer, we employ both the local and global encoders in a parallel structure as shown in Figure 2c.We also experiment cascading the graph encoders layer-wise (Figure 2d).

Decoder
Our decoder follows the same core architecture of the Transformer decoder.Each time step t is updated by interleaving multiple rounds of multihead attention over the output of the encoder (node embeddings h v ) and attention over previouslygenerated tokens (token embeddings).An additional challenge in our setup is to generate multisentence outputs.In order to encourage the model to generate longer texts, we employ a length penalty (Wu et al., 2016) to refine the pure maxprobability beam search.

Data and Preprocessing
We attest the effectiveness of our models on two datasets: AGENDA (Koncel-Kedziorski et al., 2019) and WebNLG (Gardent et al., 2017).Table 1 shows the statistics for both datasets.
AGENDA.In this dataset, KGs are paired with scientific abstracts extracted from proceedings of 12 top AI conferences.Each instance consists of the paper title, a KG and the paper abstract.Entities correspond to scientific terms which are often multi-word expressions (co-referential entities are merged).We treat each token in the title as a node, creating a unique graph with title and KG tokens as nodes.As shown in Table 1, the average output length is considerably large, as the target output are multi-sentence abstracts.
WebNLG.In this dataset, each instance contains a graph extracted from DBPedia.The target text consists of one or more sentences that verbalise the graph.We evaluate the models on the test set with seen categories.Note that this dataset has a considerable number of edge relations (see Table 1).In order to avoid parameter explosion, we use regularization based on basis function decomposition to define the model relation weights (Schlichtkrull et al., 2018).Also, as an alternative, we employ the Levi Transformation to create nodes from relational edges between entities (Beck et al., 2018).That is, we create a new relation node for each edge relation between two nodes.The new relation node is connected to the subject and object token entities by two binary relations, respectively.

Experiments and Discussion
The models are trained for 30 epochs with early stopping based on the development BLEU score.We use Adam optimization with initial learning rate of 0.5.The vocabulary is shared between the node and target tokens.In order to mitigate the effects of random seeds, for the test sets, we report the averages for 4 training runs along with their standard deviation.Hyperparameters are tuned on the development set of both datasets.Following previous work (Castro Ferreira et al., 2019), we employ byte pair encoding (BPE) to split entity words into smaller more frequent pieces.So some nodes in the graph can be sub-words.We also obtain sub-words on the target side.We call our models PGE-LW (layer-wise parallel encoder), CGE-LW (layer-wise cascaded encoder), and PGE (fully parallel encoder) and CGE (fully cascaded encoder).We use a standard version of the Transformer as baseline and a linearized version of the triples of the KG is used as input.Following previous works, we evaluate the results in terms of BLEU (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2014) and sentence-level CHRF++ (Popović, 2015) scores.To better attest the quality of the generated texts, we also perform a human evaluation.

Results on WebNLG
We compare the performance of our best model (CGE) with six state-of-the-art results of graphto-text models reported for this dataset (Gardent et al., 2017;Trisedya et al., 2018;Marcheggiani and Perez Beltrachini, 2018;Castro Ferreira et al 2018), an approach that encodes both intra-triple and inter-triple relationships, by 4.5 BLEU points.Interestingly, their intra-triple and inter-triple mechanisms capture relationships within a triple and among triples, approaches closely related with our local and global encodings.However, they rely on encoding sequence of relations and entities based on traversal graph algorithms, whereas we explicitly exploit the graph structure, throughout the local neighborhood aggregation.
Relations as Nodes.CGE-LG uses Levi graphs as inputs and achieves the best performance, even thought it uses less parameters.One advantage of this approach is that it allows the model to handle new relations, as they are treated as nodes.Moreover, the relations become part of the shared vocabulary, making this information directly usable during the decoding process.We outperform an approach based on GNNs (Marcheggiani and Perez Beltrachini, 2018)

Ablation Study
In Table 4, we report an ablation study on the impact of each module used in CGE model on the development set of AGENDA dataset.
Global Graph Encoder.We start by an ablation on the global encoder.After removing the global attention coefficients, the performance of the model drops by 1.77 BLEU and 1.72 CHRF++ scores.Results also show that using FFN in the global COMBINE(.)function is important to the model but less effective than the global attentions.However, when we remove FNN, the number of parameters drops considerably (around 19%) from 66.9 to 54.3 million.Finally, without the entire global encoder, the result drops substantially by 2.29 BLEU points.This indicates that enriching node embeddings with a global context allows learning more expressive graph representations.
Local Graph Encoder.We first remove the local graph attention and the BLEU score drops to 16.44, showing that the neighborhood attention improves the performance.After removing the relation types, encoded as model weights, the performance drops 0.48 BLEU points.However, the number of parameters is reduced around 10 million.This indicates that we can have a more efficient model, in terms of the number of parameters, with a slight drop in performance.Removing the GRU used on the COMBINE(.)function drops the performance considerably.The worse performance occurs if we remove the entire local encoder, with a BLEU score of 14.43, essentially making the encoder similar to the baseline.Finally, we note that the vocabulary sharing is critical to improve the performance, and the length penalty is beneficial as we generate multi-sentence outputs.

Comparing Encoding Strategies
The overall performance on both datasets suggests the superiority of combining global and local node representations.However, to have a better understanding of the positive and negative aspects of each proposed model, we introduce a systematic comparison between the encoding strategies.
Figure 3a shows the impact of graph diameter in the four encoding methods.The models perform on par for graphs with smaller diameters.Models based on layer-wise aggregations (PGE-LW and CGE-LW) have better performance when handling larger graph diameters.However, their overall per- formance is worse compared to the fully independent models because only 2% of the graphs on the AGENDA dev set have a diameter larger than or equal to 5.This indicates that the layer-wise encoders can better capture long-distance node dependencies.Moreover, the margin between PGE-LW and CGE-LW increases as the diameters increase, suggesting that PGE-LW can be a good option to encode graphs with larger diameter.Figure 3b shows the models' performance with respect to the number of triples.CGE achieves better results when the number of triples is large (≥ 9).On the other hand, the PGE has relatively worse when handling more information, that is, KGs with more triples.

Impact of the Graph Structure and Output Length
We investigate the performance of our best model (CGE) concerning different data properties.

Number of Triples.
In Table 5, we perform an inspection on the effect of the number of triples on the models' performance, measured using CHRF++ scores 4 for the WebNLG dev set.
In general, our model obtains better scores over almost all partitions, showing that capturing explicitly structural information is beneficial for text generation.The performance decreases as the number of triples increase.However, when handling datapoints with more triples (7), Adapt and our model achieve higher performance.We hypothesize that this happens because the models receive a considerable amount of input data, giving   more context to the text generation process, even though the graph structure being more complex.

Number of Nodes.
Figure 4a shows the effect of the graph size, measured in number of nodes, on the performance.Note that the score increases as the graph size increases.This trend is particularly interesting and contrasting to AMR-to-text generation, in which the models' general performance decreases as the graph size increases (Cai and Lam, 2020).In AMR benchmarks, the graph size is correlated with the sentence size, and longer sentences are more challenging to generate than the smaller ones.On the other hand, AGENDA contains similar abstract lengths5 and when the input is a bigger graph, the model has more information to be leveraged during the generation.We also investigate the performance with respect to the number of local graph layers.The performances with 1 and 4 layers are similar, while the best performance, regardless of the number of nodes, is achieved with 3 layers.Graph Diameter.Figure 4b shows the impact of the graph diameter on the performance, when employing only global or local encoding modules or both, for the AGENDA dev set.Similarly to the graph size, the score increases as the diameter increases.As the global encoder is not aware of the graph structure, this module has the worst scores, even though it enables direct node communication over long distance.In contrast, the local encoder can propagate precise node information throughout the graph structure for k-hop distances, making the relative performance better.We also observe that the performance gap between the global and local encoders increases when the diameter is 1.In this case, the graph has many connected components; that is, the triples do not share entities.It reveals that computing node representation based on adjacent nodes, rather than based on the entire set of entities, leads to better performance.Table 5 shows the performances for our best model and others with respect to the graph diameter for WebNLG dev set.In contrast to AGENDA, the score decreases as the diameter increases.This behavior highlights a crucial difference between the two datasets.Whereas in the WebNLG the graph size is correlated with the output size, this is not the case for AGENDA.For WebNLG, higher diameters pose additional challenges to the models as they need to generate larger outputs.
Output Length.One interesting phenomenon to analyze is the length distribution (in number of words) of the generated outputs.We expect that our models generate texts with similar output lengths as the reference texts.However, as shown in Figure 4c, the reference texts usually are bigger than the texts generated by all models.The texts generated by CGE-no-pl, a CGE model without length penalty, are consistently longer than the baseline.Also, note that we increase the length of the texts when we employ the length penalty (see Section 3.6).However, there is still a gap between the reference and the generated text lengths.We leave further investigation of this aspect for future work.
Effect of the Number of Nodes on the Output Length.Figure 5 shows the effect of the size of a graph, defined as the number of nodes, on the quality (measured in CHRF++ scores) and length of the generated text (in number of words) in the AGENDA dev set.We bin both the graph size and the output length in 4 classes.Our model consistently outperforms the baseline, in some cases by a large margin.When handling smaller graphs (with ≤ 35 nodes), both models have difficulties generating good summaries.However, for these smaller graphs, our model achieves a score 12.2% better when generating texts with length ≤ 75.Interestingly, when generating longer summaries (length >140) from smaller graphs, our model outperforms the baseline by an impressive 21.7%, indicating that our model is more effective in capturing semantic signals from graphs with scarce information in order to generate better text.Our approach also performs better when the graph size is large (number of nodes > 55) but the generation output is small (≤ 75), beating the baseline by 9 points.

Human Evaluation
To further assess the quality of the generated text, we conduct a human evaluation on the WebNLG test set with seen categories.Following previous works (Gardent et al., 2017;Castro Ferreira et al., 2019), we assess two quality criteria: (i) Fluency (i.e., does the text flow in a natural, easy to read manner?) and (ii) Adequacy (i.e., does the text clearly express the data?).We divide the datapoints into seven different sets by the number of triples.For each set, we randomly select 20 texts generated by Adapt, CGE-LG and their corresponding human reference text (420 texts in total).Since the number of datapoints for each set is not balanced (see Table 5), this sampling strategy assures us to have the same amount of samples for the different triple sets.Moreover, having human references may serve as an indicator of the sanity of the human evaluation experiment.We recruited human workers from Mechanical Turk to rate the text outputs on a 1-5 Likert scale.For each text, we collect scores from 4 workers and average them.Table 6 shows the results.We first note a similar trend as in the automatic evaluation, with CGE-LG outperforming Adapt on both fluency and adequacy.In sets with the number of triples smaller than 5, CGE-LG was the highest rated system in fluency.Similarly to the automatic evaluation, both systems are better in generating text from graphs with smaller diameters.Also note that bigger diameters pose difficulties to the models, which achieve their worst performance for diameters ≥ 3.

Conclusion
We introduced an unified graph attention network structure for investigating graph-to-text architectures that combined global and local graph representations in order to improve text generation.An extensive evaluation of our models demonstrated that global and local contexts are empirically complementary, and a combination can achieve stateof-the-art results on KG-to-text generation.In addition, cascaded architectures give better results compared with parallel architectures.To our knowledge, we are the first to consider both local and global aggregation in a graph attention network.

…
Figure 1: A graphical representation (a) of a scientific text (b).(c) A global encoder directly captures longer dependencies between any pair of nodes (blue and red arrows), but fails in capturing the graph structure.(d) A local encoder explicitly accesses information from the adjacent nodes (blue arrows) and implicitly captures distant information (dashed red arrows).

Figure 3 :
Figure 3: (a) Comparison between different encoder architectures with respect to (a) graph diameter and (b) number of triples, for dev set of AGENDA dataset.

Figure 4 :
Figure 4: CHRF++ scores for AGENDA dev set, with respect to (a) the number of nodes, and (b) the graph diameter.(c) Distribution of output length of the gold references and models' output for the AGENDA dev set.

Figure 5 :
Figure 5: Relation between the number of nodes and the length of the generated text, in number of words.

Table 1 :
Data statistics.Nodes and edges values are calculated after the graph transformation.Averages are computed per instance.

Table 2 :
Results on AGENDA test set.#L and #H are the numbers of layers and the attention heads in each layer, respectively.When more than one, the values are for the global and local encoders, respectively.#P stands for the number of parameters in millions (node embeddings included).

Table 3 :
., Results on WebNLG test set with seen categories.2019).Three systems are the best competitors in the challenge for seen categories: UPF-FORGe, Melbourne and Adapt.UPF-FORGe follows a rule-based approach, whereas Melbourne and Adapt employ encoder-decoder models with linearized triple sets.Table3presents the results.
Relations as Parameters.CGE-RP encodes relations as model parameters and achieves a BLEU score of 62.30, 8.9% better than the best model of Castro Ferreira et al. (2019), who employ an endto-end architecture based on GRUs.CGE-RP also outperform Trisedya et al. (

Table 4 :
Ablation study for modules used in the encoder and decoder of the CGE model.
by a large margin of 7.2 BLEU points, showing our graph encoding strategies lead to a better text generation.We also outperform Adapt, a strong competitor that employs subword encodings, by 2.51 BLEU points.

Table 6 :
All 3.96 C 4.44 C 4.12 B 4.54 B 4.24 A 4.63 A 1-2 3.94 C 4.59 B 4.18 B 4.72 A 4.30 A 4.69 A 3-4 3.79 C 4.45 B 3.96 B 4.50 AB 4.14 A 4.66 A 5-7 4.08 B 4.35 B 4.18 B 4.45 B 4.28 A 4.59 A 3.98 C 4.50 B 4.16 B 4.61 A 4.28 A 4.66 A ≥ 3 3.91 C 4.33 B 4.03 B 4.43 B 4.17A 4.60 A Fluency (F) and Adequacy (A) obtained in the human evaluation.#T refers to the number of input triples and #D to graph diameters.The ranking was determined by pair-wise Mann-Whitney tests with p < 0.05, and the difference between systems which have a letter in common is not statistically significant.