Recursive Non-Autoregressive Graph-to-Graph Transformer for Dependency Parsing with Iterative Refinement

We propose the Recursive Non-autoregressive Graph-to-Graph Transformer architecture (RNGTr) for the iterative refinement of arbitrary graphs through the recursive application of a non-autoregressive Graph-to-Graph Transformer and apply it to syntactic dependency parsing. We demonstrate the power and effectiveness of RNGTr on several dependency corpora, using a refinement model pre-trained with BERT. We also introduce Syntactic Transformer (SynTr), a non-recursive parser similar to our refinement model. RNGTr can improve the accuracy of a variety of initial parsers on 13 languages from the Universal Dependencies Treebanks, English and Chinese Penn Treebanks, and the German CoNLL2009 corpus, even improving over the new state-of-the-art results achieved by SynTr, significantly improving the state-of-the-art for all corpora tested.


Introduction
Self-attention models, such as Transformer (Vaswani et al., 2017), have been hugely successful in a wide range of natural language processing (NLP) tasks, especially when combined with language-model pre-training, such as BERT . These architectures contain a stack of self-attention layers that can capture long-range dependencies over the input sequence, while still representing its sequential order using absolute position encodings. Alternatively, Shaw et al. (2018) propose to define sequential order with relative position encodings, which are input to the self-attention functions.
Recently, Mohammadshahi and Henderson (2020) extended this sequence input method to the input of arbitrary graph relations via the selfattention mechanism, and combined it with an attention-like function for graph relation prediction, resulting in their proposed Graph-to-Graph Transformer architecture (G2GTr). They demonstrated the effectiveness of G2GTr for transitionbased dependency parsing and its compatibility with pre-trained BERT ). This parsing model predicts one edge of the parse graph at a time, conditioning on the graph of previous edges, so it is an autoregressive model.
The G2GTr architecture could be used to predict all the edges of a graph in parallel, but such predictions are non-autoregressive. They thus cannot fully model the interactions between edges. For sequence prediction, this problem has been addressed with non-autoregressive iterative refinement (Novak et al., 2016;Awasthi et al., 2019;Lichtarge et al., 2018). Interactions between different positions in the string are modeled by conditioning on a previous version of the same string.
In this paper, we propose a new graph prediction architecture that takes advantage of the full graphto-graph functionality of G2GTr to apply a G2GTr model to refine the output graph recursively. This architecture predicts all edges of the graph in parallel, and is therefore non-autoregressive, but can still capture any between-edge dependency by conditioning on the previous version of the graph, like an auto-regressive model. This proposed Recursive Non-autoregressive Graph-to-Graph Transformer (RNGTr) architecture has three components. First, an initialization model computes an initial graph, which can be any given model for the task, even a trivial one. Second, a G2GTr model takes the previous graph as input and predicts each edge of the target graph. Third, a decoding algorithm finds the best graph given these edge predictions. The second and third components are applied recursively to do iterative refinement of the output graph until some stopping criterion is met. The final output graph is the graph output by the final decoding step.
The RNG Transformer architecture can be applied to any task with a sequence or graph as input and a graph over the same set of nodes as output. We evaluate RNGTr on syntactic dependency parsing because it is a difficult structured prediction task, state-of-the-art initial parsers are extremely competitive, and there is little previous evidence that non-autoregressive models (as in graph-based dependency parsers) are not sufficient for this task. We aim to show that capturing correlations between dependencies with non-autoregressive iterative refinement results in improvements, even in the challenging case of state-of-the-art dependency parsers.
The evaluation demonstrates improvements with several initial parsers, including previous state-of-the-art dependency parsers, and the empty parse. We also introduce a strong Transformerbased dependency parser pre-trained with BERT , called Syntactic Transformer (SynTr), using it both for our initial parser and as the basis of our refinement model. Results on 13 languages from the Universal Dependencies Treebanks , English and Chinese Penn Treebanks (Marcus et al., 1993;Xue et al., 2002), and the German CoNLL 2009 corpus (Hajič et al., 2009) show significant improvements over all initial parsers and the stateof-the-art. 1 In this paper, we make the following contributions: • We propose a novel architecture for the iterative refinement of arbitrary graphs (RNGTr) that combines non-autoregressive edge prediction with conditioning on the complete graph.
• We propose a RNGTr model of syntactic dependency parsing.
• We demonstrate significant improvements over the previous state-of-the-art dependency parsing results on Universal Dependency Treebanks, Penn Treebanks, and the German CoNLL 2009 corpus.

Dependency Parsing
Syntactic dependency parsing is a critical component in a variety of natural language understanding tasks, such as semantic role labeling (Marcheggiani and Titov, 2017), machine transla-1 Our implementation is available at: https://github .com/idiap/g2g-transformer.
tion (Chen et al., 2017), relation extraction (Zhang et al., 2018), and natural language interfaces (Pang et al., 2019). There are several approaches to compute the dependency tree. Transition-based parsers predict the dependency graph one edge at a time through a sequence of parsing actions (Yamada and Matsumoto, 2003;Nivre and Scholz, 2004;Titov and Henderson, 2007;Zhang and Nivre, 2011). As in our approach, transformationbased (Satta and Brill, 1996) and corrective modeling parsers use various methods (e.g., Knight and Graehl, 2005;Hall and Novák, 2005;Attardi and Ciaramita, 2007;Hennig and Köhn, 2017;Zheng, 2017) to correct an initial parse. We take a graph-based approach to this correction. Graph-based parsers (Eisner, 1996;McDonald et al., 2005a;Koo and Collins, 2010) compute scores for every possible dependency edge and then apply a decoding algorithm to find the highest scoring total tree. Typically, neural graph-based models consist of two components: an encoder that learns context-dependent vector representations for the nodes of the dependency graph, and a decoder that computes the dependency scores for each pair of nodes and then applies a decoding algorithm to find the highestscoring dependency tree.
There are several approaches to capture correlations between dependency edges in graph-based models. In first-order models, such as Maximum Spanning Tree (MST) (Edmonds, 1967;Chu and Liu, 1965;McDonald et al., 2005b), the score for an edge must be computed without being sure what other edges the model will choose. The model itself only imposes the discrete tree constraint between edges. Higher-order models (McDonald and Pereira, 2006;Carreras, 2007;Koo and Collins, 2010;Ma and Zhao, 2012;Zhang and McDonald, 2012;Tchernowitz et al., 2016) keep some between-edge information, but require more decoding time.
In this paper, we apply first-order models, specifically the MST algorithm, and show that it is possible to keep correlations between edges without increasing the time complexity by recursively conditioning each edge score on a previous prediction of the complete dependency graph.

RNG Transformer
The RNG Transformer architecture is illustrated in Figure 1, in this case, applied to dependency parsing. The input to a RNGTr model specifies the input nodes W = (w 1 , w 2 , . . . , w N ) (e.g., a sentence), and the output is the final graph G T (e.g., a parse tree) over this set of nodes. The first step is to compute an initial graph of G 0 over W , which can be done with any model. Then each recursive iteration takes the previous graph G t−1 as input and predicts a new graph G t .
The RNGTr model predicts G t with a novel version of a Graph-to-Graph Transformer (Mohammadshahi and Henderson, 2020). Unlike in the work of Mohammadshahi and Henderson (2020), this G2GTr model predicts every edge of the graph in a single non-autoregressive step. As previously, the G2GTr first encodes the input graph G t−1 in a set of contextualized vector representations Z = (z 1 , z 2 , . . . , z N ), with one vector for each node of the graph. The decoder component then predicts the output graph G t by first computing scores for each possible edge between each pair of nodes and then applying a decoding algorithm to output the highest-scoring complete graph.
The RNGTr model can be formalized in terms of an encoder E RNG and a decoder D RNG : where W = (w 1 , w 2 , , . . . , w N ) is the input sequence of tokens, P = (p 1 , p 2 , , . . . , p N ) is their associated properties, and T is the number of refinement iterations.
In the case of dependency parsing, W are the words and symbols, P are their part-of-speech tags, and the predicted graph at iteration t is specified as: Each word w j has one head (parent) w i with dependency label l from the label set L, where the parent can also be the ROOT symbol w 2 (see Section 3.1.1).
The following sections describe in more detail each element of the proposed RNGTr dependency parsing model.

Encoder
To compute the embeddings Z t for the nodes of the graph, we use the Graph-to-Graph Transformer architecture proposed by Mohammadshahi and Henderson (2020), including a similar mechanism to input the previously predicted dependency graph G t−1 to the attention mechanism. This graph input allows the node embeddings to include both token-level and relation-level information.

Input Embeddings
The RNGTr model receives a sequence of input tokens (W ) with their associated properties (P ) and builds a sequence of input embeddings (X). For compatibility with BERT's input token representation , the sequence of input tokens starts with CLS and ends with SEP symbols. For dependency parsing, it also adds the ROOT symbol to the front of the sentence to represent the root of the dependency tree. To build token representation for a sequence of input tokens, we sum several vectors. For the input words and symbols, we sum the token embeddings of a pre-trained BERT model EMB(w i ), and learned representations EMB(p i ) of their Part-of-Speech tags p i . To keep the order information of the initial sequence, we add the position embeddings of pre-trained BERT F i to our token embeddings. The final input representations are the sum of the position embeddings and the token embeddings:

Self-Attention Mechanism
Conditioning on the previously predicted output graph G t−1 is made possible by inputting relation embeddings to the self-attention mechanism. This edge input method was initially proposed by Shaw et al. (2018) for relative position encoding, and extending to unlabeled dependency graphs in the Graph-to-Graph Transformer architecture of Mohammadshahi and Henderson (2020). We use it to input labeled dependency graphs, by adding relation label embeddings to both the value function and the attention weight function. Transformers have multiple layers of selfattention, each with multiple heads. The RNGTr architecture uses the same architecture as BERT  but changes the functions used by each attention head. Given the token embeddings X at the previous layer and the input graph G t−1 , the values A=(a 1 , . . . , a N ) computed by an attention head are: where r t−1 ij is a one-hot vector that represents the labeled dependency relation between i and j in the graph G t−1 . As shown in the matrix in Figure 2, each r t−1 ij specifies both the label and the direction of the relation (id label for i → j versus id label + |L| for i ← j, where |L| is the number of dependency labels), or specifies NONE (as 0). W L 2 ∈ R (2|L|+1)×d are the learned relation embeddings. The attention weights α ij are a Softmax applied to the attention function: where W L 1 ∈ R (2|L|+1)×d are different learned relation embeddings. LN(·) is the layer normalization function, used for better convergence. Equations (4) and (5) constitute the mechanism by which each iteration of refinement can condition on the previous graph. Instead of the more common approach of hard-coding some attention heads to represent a relation (e.g., Ji et al., 2019), all attention heads can learn for themselves how to use the information about relations.

Decoder
The decoder uses the token embeddings Z t produced by the encoder to predict the new graph G t . It consists of two components, a scoring function, and a decoding algorithm. The graph found by the decoding algorithm is the output graph G t of the decoder. Here we propose components for dependency parsing.
where the MLPs are all one-layer feed-forward networks with LeakyReLU activation functions.
These token embeddings are used to compute probabilities for every possible dependency relation, both unlabeled and labeled, similarly to Dozat and Manning (2016). The distribution of the unlabeled dependency graph is estimated using, for each token i, a biaffine classifier over possible heads j applied to z t,(arc−dep) i and z t,(arc−head) j . Then for each pair i, j, the distribution over labels given an unlabeled dependency relation is estimated using a biaffine classifier applied to z

Decoding Algorithms
The scoring function estimates a distribution over graphs, but the RNGTr architecture requires the decoder to output a single graph G t . Choosing this graph is complicated by the fact that the scoring function is non-autoregressive. Thus the estimate consists of multiple independent components, and there is no guarantee that every graph in this distribution is a valid dependency graph.
We take two approaches to this problem, one for intermediate parses G t and one for the final dependency parse G T . To speed up each refinement iteration, we ignore this problem for intermediate dependency graphs. We build these graphs by simply applying argmax independently to find the head of each node. This may result in graphs with loops, which are not trees, but this does not seem to cause problems for later refinement iterations. 2 For the final output dependency tree, we use the maximum spanning tree algorithm, specifically the Chu-Liu/Edmonds algorithm (Chi, 1999;Edmonds, 1967), to find the highest scoring valid dependency tree. This is necessary to avoid problems when running the evaluation scripts. The asymptotic complexity of the full model is determined by the complexity of this algorithm. 3

Training
The RNG Transformer model is trained separately on each refinement iteration. Standard gradient descent techniques are used, with cross-entropy loss for each edge prediction. Error is not backpropagated across iterations of refinement, because no continuous values are being passed from one iteration to another, only a discrete dependency tree.
Stopping Criterion: In the RNG Transformer architecture, the refinement of the predicted graph can be done an arbitrary number of times, since the same encoder and decoder parameters are used at each iteration. In the experiments below, we place a limit on the maximum number of iterations. But sometimes the model converges to an output graph before this limit is reached, simply copying this graph during later iterations. During training, to avoid multiple iterations where the model is trained to simply copy the input graph, the refinement iterations are stopped if the new predicted dependency graph is the same as the input graph. At test time, we also stop computation in this case, but the output of the model is not affected.

Initial Parsers
The RNGTr architecture requires a graph G 0 to initialize the iterative refinement. We consider several initial parsers to produce this graph. To leverage previous work on dependency parsing and provide a controlled comparison to the stateof-the-art, we use parsing models from the recent literature as both baselines and initial parsers. To evaluate the importance of the initial parse, we also consider a setting where the initial parse is empty, so the first complete dependency tree is predicted by the RNGTr model itself. Finally, the success of our RNGTr dependency parsing model leads us to propose an initial parsing model with the same design, so that we can control for the parser design in measuring the importance of the RNG Transformer's iterative refinement.
SynTr model We call this initial parser the Syntactic Transformer (SynTr) model. It is the same as one iteration of the RNGTr model shown in Figure 1 and defined in Section 3, except that there is no graph input to the encoder. Analogously to (1), G 0 is computed as: where E SYNTR and D SYNTR are the SynTr encoder and decoder, respectively. For the encoder, we use the Transformer architecture of BERT  and initialize with pre-trained parameters of BERT. The token embeddings of the final layer are used for Z 0 . For the decoder, we use the same scoring function as described in Section 3.2, and apply the Chu-Liu/Edmonds decoding algorithm (Chi, 1999;Edmonds, 1967) to find the highest scoring tree. This SynTr parsing model is very similar to the UDify parsing model proposed by Kondratyuk and Straka (2019). One difference that seems to be important for the results reported in Section 6.2 is in the way BERT token segmentation is handled. When BERT segments a word into subwords, UDify seems only to encode the first segment, whereas SynTr encodes all segments and only decodes with the first segment, as discussed in Section 5.3. Also, UDify decodes with an attention-based mixture of encoder layers, whereas SynTr only uses the last layer.

Datasets
To evaluate our models, we apply them on several kinds of datasets, namely, Universal Dependency (UD) Treebanks, Penn Treebanks, and the German CoNLL 2009 Treebank. For our evaluation on UD Treebanks (UD v2.3) , we select languages based on the criteria proposed in de Lhoneux et al. (2017), and adapted by Smith et al. (2018). This set contains several languages with different language families, scripts, character set sizes, morphological complexity, and training sizes and domains. For our evaluation of Penn Treebanks, we use the English and Chinese Penn Treebanks (Marcus et al., 1993;Xue et al., 2002). For English, we use the same setting as defined in Mohammadshahi and Henderson (2020). For Chinese, we apply the same setup as described in Chen and Manning (2014) (Nilsson and Nivre, 2008).

Baseline Models
For UD Treebanks, we compare to several baseline parsing models. We use the monolingual parser proposed by Kulmizev et al. (2019), which uses BERT  and ELMo (Peters et al., 2018) embeddings as additional input features. In addition, we compare to the multilingual multitask models proposed by Kondratyuk and Straka (2019) and Straka (2018). UDify (Kondratyuk and Straka, 2019) is a multilingual multitask model. UDPipe (Straka, 2018) is one of the winners of CoNLL 2018 Shared Task (Zeman et al., 2018). For a fair comparison, we report the scores of UDPipe from Kondratyuk and Straka (2019) using gold segmentation. UDify is on average the best performing of these baseline models, so we use it as one of our initial parsers in the RNGTr model.
For Penn Treebanks and the German CoNLL 2009 corpus, we compare our models with previous state-of-the-art transition-based, and graphbased models, including the biaffine parser (Dozat and Manning, 2016), which includes the same decoder as our model. We also use the biaffine parser as an initial parser for the RNGTr model.

Implementation Details
The encoder is initialized with pre-trained BERT ) models with 12 self-attention layers. All hyper-parameters are provided in Appendix A.
Since the wordpiece tokenizer (Wu et al., 2016) of BERT differs from that used in the dependency corpora, we apply the BERT tokenizer to each corpus word and input all the resulting sub-words to the encoder. For the input of dependency relations, each dependency between two words is specified as a relationship between their first sub-words. We also input a new relationship between each non-first sub-word and its associated first sub-word as its head. For the prediction of dependency relations, only the encoder embedding of the first sub-word of each word is used by the decoder. 4 The decoder predicts each dependency as a relation between the first sub-words of the corresponding words. Finally, for proper evaluation, we map the predicted sub-word heads and dependents to their original word positions in the corpus.

Results and Discussion
After some initial experiments to determine the maximum number of refinement iterations, we report the performance of the RNG Transformer model on the UD Treebanks, Penn Treebanks, and German CoNLL 2009 Treebank. 5 The RNGTr models perform substantially better than previously proposed models on every dataset, and RNGTr refinement improves over its initial parser for almost every dataset. We also perform various analyses to understand these results better.

The Number of Refinement Iterations
Before conducting a large number of experiments, we investigate how many iterations of refinement are useful, given the computational costs of additional iterations. We evaluate different variations of our RNG Transformer model on the Turkish Treebank (Table 1). 6 We use both SynTr and UDify as initial parsers. The SynTr model significantly outperforms the UDify model, so the errors are harder to correct by adding the RNGTr model (2.67% for SynTr versus 15.01% for UDify of relative error reduction in LAS after integration). In both cases, three iterations of refinement achieve more improvement than one iteration, but not by a large enough margin to suggest the need for additional iterations. The further analysis reported in Section 6.5 supports the conclusion that, in general, an additional iteration would neither help nor hurt accuracy. The results in Table 1 also show that it is better to include the stopping strategy described in Section 3.3. In subsequent experiments, we use three refinement iterations with the stopping strategy, unless mentioned otherwise.

UD Treebank Results
Results for the UD treebanks are reported in Table 2. We compare our models with previous 5 The number of parameters and run times of each model on the UD and Penn Treebanks are provided in Appendix B. 6 We choose the Turkish Treebank because it is a lowresource Treebank and there are more errors in the initial parse for RNGTr to correct. state-of-the-art results (both trained monolingually and multilingually), based on labeled attachment score. 7 The results with RNGTr refinement demonstrate the effectiveness of the RNGTr model at refining an initial dependency graph. First, the UDify+RNGTr model achieves significantly better LAS performance than the UDify model in all languages. Second, although the SynTr model significantly outperforms previous stateof-the-art models on all these UD Treebanks, 8 the SynTr+RNGTr model achieves further significant improvement over SynTr in four languages, and no significant degradation in any language. Of the nine languages where there is no significant difference between SynTr and SynTr+RNGTr for the given test sets, RNGTr refinement results in higher LAS in eight languages and lower LAS in only one (Russian).
The improvement of SynTr+RNGTr over SynTr is particularly interesting because it is a controlled demonstration of the effectiveness of the graph refinement method of RNGTr. The only difference between the SynTr model and the final iteration of the SynTr+RNGTr model is the graph inputs from the previous iteration (Equations (7) versus (1)). By conditioning on the full dependency graph, the SynTr+RNGTr 7 Unlabeled attachment scores are provided in Appendix C. All results are computed with the official CoNLL 2018 shared task evaluation script (https:// universaldependencies.org/conll18/evaluation .html).
8 In particular, SynTr significantly outperforms UDify, even though they are very similar models. In addition to the model differences discussed in Section 4, there are some differences in the way UDify and SynTr models are trained that might explain this improvement, in particular, that UDify is a multilingual multitask model, whereas SynTr is a monolingual single-task model.   (Kulmizev et al., 2019) and SynTr) and multilingual (UDPipe (Straka, 2018) and UDify (Kondratyuk and Straka, 2019)) baselines, and the refined models (+RNGTr) pre-trained with BERT . The relative error reduction from RNGTr refinement is shown in parentheses. Bold scores are not significantly different from the best score in that row (with α = 0.01).
model's final RNGTr iteration can capture any kind of correlation in the dependency graph, including both global and between-edge correlations both locally and over long distances. This result also further demonstrates the generality and effectiveness of the G2GTr architecture for conditioning on graphs (Equations (4) and (5)). As expected, we get more improvement when combining the RNGTr model with UDify, because UDify's initial dependency graph contains more incorrect dependency relations for RNGTr to correct. But after refinement, there is surprisingly little difference between the performance of the UDify+RNGTr and SynTr+RNGTr models, suggesting that RNGTr is powerful enough to correct any initial parse. To investigate the power of the RNGTr architecture to correct any initial parse, we also show results for a model with an empty initial parse, Empty+RNGTr. For this model, we run four iterations of refinement (T=4), so that the amount of computation is the same as for SynTr+RNGTr. The Empty+RNGTr model achieves competitive results with the UDify+RNGTr model (i.e., above the previous state-of-the-art), and close to the results for SynTr+RNGTr. This accuracy is achieved despite the fact that the Empty+RNGTr model has half as many parameters as the UDify+RNGtr model and the SynTr+RNGTr model since it has no separate initial parser. These Empty+RNGTr results indicate that RNGTr architecture is a very powerful method for graph refinement.

Penn Treebank and German Corpus Results
UAS and LAS results for the Penn Treebanks, and the German CoNLL 2009 Treebank are reported in Table 3. We compare to the results of previous state-of-the-art models and SynTr, and we use the RNGTr model to refine both the biaffine parser (Dozat and Manning, 2016) and SynTr, on all Treebanks. 9 Again, the SynTr model significantly outperforms previous state-of-the-art models, with a 5.78%, 9.15%, and 23.7% LAS relative error reduction in English, Chinese, and German, respectively. Despite this level of accuracy, adding RNGTr refinement improves accuracy further under both UAS and LAS. For the Chinese Treebank, this improvement is significant, with a 5.46% LAS relative error reduction. When RNGTr refinement is applied to the output of the biaffine parser (Dozat and Manning, 2016), it achieves a LAS relative error reduction of 10.64% for  Table 3: Comparison of our models to previous state-of-the-art models on English (PTB) and Chinese (CTB5.1) Penn Treebanks, and German CoNLL 2009 shared task treebank. ''T'' and ''G'' specify ''Transition-based'' and ''Graph-based'' models. Bold scores are not significantly different from the best score in that column (with α = 0.01). the English Treebank, 16.05% for the Chinese Treebank, and 27.72% for the German Treebank. These improvements, even over such strong initial parsers, again demonstrate the effectiveness of the RNGTr architecture for graph refinement.

Error Analysis
To better understand the distribution of errors for our models, we follow McDonald and Nivre (2011) and plot labeled attachment scores as a function of dependency length, sentence length, and distance to root. 10 We compare the distributions of errors made by the UDify (Kondratyuk and Straka, 2019), SynTr, and refined models (UDify+RNGTr, SynTr+RNGTr, and Empty+ RNGTr). Figure 3 shows the accuracies of the different models on the concatenation of all development sets of UD Treebanks. Results show that applying RNGTr refinement to the UDify model results in a substantial improvement in accuracy across the full range of values in all cases, and little difference in the error profile between the better performing models. In all the plots, the gains from RNGTr refinement are more pronounced for the more difficult cases, where a larger or more global view of the structure is beneficial. As shown in the leftmost plot of Figure 3, adding RNGTr refinement to UDify results in particular gains for the longer dependencies, which are more likely to interact with other dependencies. The middle plot illustrates the accuracy of models as a function of the distance to the root of the dependency tree, which is calculated as the number of dependency relations from the dependent to the root. When we add RNGTr refinement to the UDify parser, we get particular gains for the problematic middle depths, which are neither the root nor leaves. Here, SynTr+RNGTr is also particularly strong on these high nodes, whereas SynTr is particularly strong on low nodes. In the plot by sentence length, the larger improvements from adding RNGTr refinement (both to UDify and SynTr) are for the shorter sentences, which are surprisingly difficult for UDify. Presumably, these shorter sentences tend to be more idiosyncratic, which is better handled with a global view of the structure. (See Figure 5 for an example.) In all these cases, the ability of RNGTr to capture any kind of correlation in the dependency graph gives the model a larger and more global view of the correct output structure.
To further analyze where RNGTr refinement is resulting in improvements, we compare the error profiles of the SynTr and SynTr+RNGTr models on the Chinese Penn Treebank, where adding RNGTr refinement to SynTr results in significant improvement (see Table 3). As shown in Figure 4, RNGTr refinement results in particular improvement on longer dependencies (left plot), and on middle and greater depth nodes (right plot), again showing that RNGTr does particularly well on the difficult cases with more interactions with other dependencies.

Refinement Analysis
To better understand how the RNG Transformer model is doing refinement, we perform several analyses of the trained UDify+RNGTr model. 11 An example of this refinement is shown in Figure 5, where the UDify model predicts an incorrect dependency graph, but the RNGTr model modifies it to build the gold dependency tree.

Refinements by Iteration:
To measure the accuracy gained from refinement at different iterations, we define the following metric: where RER is relative error reduction, and t is the refinement iteration. LAS 0 is the accuracy of the initial parser, UDify in this case.
To illustrate the refinement procedure for different dataset types, we split UD Treebanks based on their training set size into ''Low-Resource   Table 4 shows the refinement metric (REL t ) after each refine- Dataset Type t = 1 t = 2 t = 3 Low-Resource +13.62% +17.74% +0.16% High-Resource +29.38% +0.81% +0.41%  Table 5: Relative F-score error reduction of a selection of dependency types for each refinement step on the concatenation of UD Treebanks (with UDify as the initial parser).
languages. In general, different numbers of iterations may be necessary for different datasets, allowing efficiency gains by not performing unnecessary refinement iterations.
Dependency Type Refinement: Table 5 shows the relative improvement of different dependency types for the UDify+RNGTr model at each refinement step, ranked and selected by the total relative error reduction. A huge amount of improvement is achieved for all these dependency types at the first iteration step, and then we have a considerable further improvement for many of the remaining refinement steps. The later refinement steps are particularly useful for idiosyncratic dependencies which require a more global view of the sentence, such as auxiliary (aux) and copula (cop). A similar pattern of improvements is found when SynTr is used as the initial parser, reported in Appendix A.
Refinement by Projectivity: Table 6 shows the relative improvement of each refinement step for projective and non-projective trees. Although the total gain is slightly higher for projective trees, non-projective trees require more iterations to achieve the best results. Presumably, this is because non-projective trees have more complex non-local interactions between dependencies, which requires more refinement iterations to fix incorrect dependencies. This seems to contradict the common belief that non-projective parsing is better done with factorized graph-based models, which do not model these interactions.

Conclusion
In this paper, we propose a novel architecture for structured prediction, Recursive Non-autoregressive Graph-to-Graph Transformer (RNG Transformer), to iteratively refine arbitrary graphs. Given an initial graph, RNG Transformer learns to predict a corrected graph over the same set of nodes. Each iteration of refinement predicts the edges of the graph in a non-autoregressive fashion, but conditions these predictions on the entire graph from the previous iteration. This graph conditioning and prediction are made with the Graph-to-Graph Transformer architecture (Mohammadshahi and Henderson, 2020), which can capture complex patterns of interdependencies between graph edges and can exploit BERT  pre-training. We evaluate the RNG Transformer architecture by applying it to the problematic structured prediction task of syntactic dependency parsing. In the process, we also propose a graph-based dependency parser (SynTr), which is the same as one iteration of our RNG Transformer model but without graph inputs. Evaluating on 13 languages of the Universal Dependencies Treebanks, the English and Chinese Penn Treebanks, and the German CoNLL 2009 shared task treebank, our SynTr model already significantly outperforms previous state-of-the-art models on all these treebanks. Even with this powerful initial parser, RNG Transformer refinement almost always improves accuracies, setting new state-of-the-art accuracies for all treebanks. RNG Transformer consistently results in improvement regardless of the initial parser, reaching around the same level of accuracy even when it is given an empty initial parse, demonstrating the power of this iterative refinement method. Error analysis suggests that RNG Transformer refinement is particularly useful for complex interdependencies in the output structure.
The RNG Transformer architecture is a very general and powerful method for structured prediction, which could easily be applied to other NLP tasks. It would especially benefit tasks that require capturing complex structured interdependencies between graph edges, without losing the computational benefits of a non-autoregressive model.